[ceph-users] Libvirt hosts freeze after ceph osd+mon problem

Jan Pekař - Imatic jan.pekar at imatic.cz
Tue Nov 7 07:27:23 PST 2017

I migrated virtual to my second node which is running
qemu-kvm version 1:2.1+dfsg-12+deb8u6 (from debian oldstable)
the same situation - frozen after approx 30-40 seconds when
"libceph: osd6 down" appeared in syslog (not before).
Also my other virtual on first node was frozen in the same time.
Both virtuals are running debian stretch, one with
4.9.0-3-amd64 kernel, second with
4.10.17-2-pve kernel

Cannot test Windows virtuals now.

One of my virtuals is on pool, where i forced primary OSD to other nodes 
(OSDs) than I'm stopping and I have pool min_size 1, so I assume (when 
PRIMARY OSD is still online and available) I shouldn't have issue with 
disk writes or reads. But that virtual is also affected and don't 
survive MON+OSD stopping.

I tried to set

heartbeat interval = 5
osd heartbeat interval = 3
osd heartbeat grace = 10

in my ceph.conf

and after my test I got no "heartbeat_check: no reply from" in syslog, 
just "libceph: osd6 down" and virtuals survived that.
That can be workaround for me, but it can also be only coincidence that 
other part of mon code disabled osd before my problem occured. I also 
assume, that everybody else is using defaults heartbeat settings.
My cluster was installed on luminous (not migrated from previous 
versions) and node OS is stretch (one node is lenny).

With regards
Jan Pekar

On 7.11.2017 14:16, Jason Dillaman wrote:
> If you are seeing this w/ librbd and krbd, I would suggest trying a 
> different version of QEMU and/or different host OS since loss of a disk 
> shouldn't hang it -- only potentially the guest OS.
> On Tue, Nov 7, 2017 at 5:17 AM, Jan Pekař - Imatic <jan.pekar at imatic.cz 
> <mailto:jan.pekar at imatic.cz>> wrote:
>     I'm calling kill -STOP to simulate behavior, that occurred, when on
>     one ceph node i was out of memory. Processes was not killed, but
>     were somehow suspended/unresponsible (they couldn't create new
>     threads etc), and that caused all virtuals (on other nodes) to hung.
>     I decided to simulate it with kill -STOP MONPID OSDPID and I succeeded.
>     After I stop MON with OSD, it took few seconds to get osd
>     unresponsive messages, and exactly when I get final
>     libceph: osd6 down
>     all my virtuals stops responding (stop pinging, unable to use VNC etc)
>     Tried with librdb disk definition or rbd map device attached inside
>     QEMU/KVM virtuals.
>     JP
>     On 7.11.2017 10:57, Piotr Dałek wrote:
>         On 17-11-07 12:02 AM, Jan Pekař - Imatic wrote:
>             Hi,
>             I'm using debian stretch with ceph 12.2.1-1~bpo80+1 and qemu
>             1:2.8+dfsg-6+deb9u3
>             I'm running 3 nodes with 3 monitors and 8 osds on my nodes,
>             all on IPV6.
>             When I tested the cluster, I detected strange and severe
>             problem.
>             On first node I'm running qemu hosts with librados disk
>             connection to the cluster and all 3 monitors mentioned in
>             connection.
>             On second node I stopped mon and osd with command
>             kill -STOP MONPID OSDPID
>             Within one minute all my qemu hosts on first node freeze, so
>             they even don't respond to ping. [..]
>         Why would you want to *stop* (as in, freeze) a process instead
>         of killing it?
>         Anyway, with processes still there, it may take a few minutes
>         before cluster realizes that daemons are stopped and kicks it
>         out of cluster, restoring normal behavior (assuming correctly
>         set crush rules).
>     -- 
>     ============
>     Ing. Jan Pekař
>     jan.pekar at imatic.cz <mailto:jan.pekar at imatic.cz> | +420603811737
>     <tel:%2B420603811737>
>     ----
>     Imatic | Jagellonská 14 | Praha 3 | 130 00
>     http://www.imatic.cz
>     ============
>     --
>     _______________________________________________
>     ceph-users mailing list
>     ceph-users at lists.ceph.com <mailto:ceph-users at lists.ceph.com>
>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>     <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
> -- 
> Jason

Ing. Jan Pekař
jan.pekar at imatic.cz | +420603811737
Imatic | Jagellonská 14 | Praha 3 | 130 00

More information about the ceph-users mailing list