[ceph-users] _committed_osd_maps shutdown OSD via async signal, bug or feature?

Gregory Farnum gfarnum at redhat.com
Thu Oct 5 10:41:33 PDT 2017

On Thu, Oct 5, 2017 at 6:48 AM Stefan Kooman <stefan at bit.nl> wrote:

> Hi,
> During testing (mimicking BGP / port flaps) on our cluster we are able
> to trigger a "_committed_osd_maps shutdown OSD via async signal" on the
> the affected OSD servers in that datacenter (OSDs in that DC become
> intermittent isolated from their peers). Result is that all OSD
> processes stop. Is this a bug or a feature? I.e. is there a "flap"
> detection mechanism in Ceph OSD?
> If it's a bug it might be related to
> http://tracker.ceph.com/issues/20174. We get similiar error message on
> "12.2.0". Version "12.2.1" does not log
> "-1 Fail to open
> '/proc/0/cmdline' error = (2) No such file or directory
> -1 received  signal: Interrupt from  PID: 0 task name: <unknown> UID: 0
> -1 osd.21 1846 *** Got signal Interrupt ***
> 0 osd.21 1846 prepare_to_stop starting shutdown
> -1 osd.21 1846 shutdown"
That's a feature, but invoking it may indicate the presence of another
issue. The OSD shuts down if
1) it has been deleted from the cluster, or
2) it has been incorrectly marked down a bunch of times by the cluster, and
gives up, or
3) it has been incorrectly marked down by the cluster, and encounters an
error when it rebinds to new network ports

In your case, with the port flapping, OSDs are presumably getting marked
down by their peers (since they can't communicate), and eventually give up
on trying to stay alive. You can prevent/reduce that by setting
the osd_max_markdown_count config to a very large number, if you really
want to.
