[ceph-users] Hangs with qemu/libvirt/rbd when one host disappears

Alwin Antreich a.antreich at proxmox.com
Wed Dec 6 00:44:45 PST 2017

Hello Marcus,
On Tue, Dec 05, 2017 at 07:09:35PM +0100, Marcus Priesch wrote:
> Dear Ceph Users,
> first of all, big thanks to all the devs and people who made all this
> possible, ceph is amazing !!!
> ok, so let me get to the point where i need your help:
> i have a cluster of 6 hosts, mixed with ssd's and hdd's.
> on 4 of the 6 hosts are 21 vm's running in total with less to no
> workload (web, mail, elasticsearch) for a couple of users.
> 4 nodes are running ubuntu server and 2 of them are running proxmox
> (because we are now in the process of migrating towards proxmox).
> i am running ceph luminous (have upgraded two weeks ago)
I guess, you are running on ceph 12.2.1 (12.2.2 is out)? What does ceph versions say?

> ceph communication is carried out on a seperate 1Gbit Network where we
> plan to upgrade to bonded 2x10Gbit during the next couple of weeks.
With 6 hosts you will need 10GbE, alone for lower latency. Also a ceph
recovery/rebalance might max out the bandwidth of your link.

> i have two pools defined where i only use disk images via libvirt/rbd.
> the hdd pool has two replicas and is for large (~4TB) backup images and
> the ssd pool has three replicas (two on ssd osd's and one on hdd osd's)
> for improved fail safety and faster access for "live data" and OS
> images.
Mixing of spinners with SSDs is not recommended, as spinners will slow
down the pools residing on that root.

> in the crush map i have two different rules for the two pools so that
> replicas always are stored on different hosts - i have verified this and
> it works. it is coded via the "host" attribute (host node1-hdd and host
> node1 are both actually on the same host)
> so, now comes the interesting part:
> when i turn off one of the hosts (lets say node7) that do only ceph,
> after some time the vm's stall and hang until the host comes up again.
A stall of I/O shouldn't happen, what is your min_size of the pools? How
is your 'ceph osd tree' looking?
> when i dont turn on the host again, after some time the cluster starts
> rebalancing ...

> yesterday i experienced that after a couple of hours of rebalancing the
> vm's continue working again - i think thats when the cluster has
> finished rebalancing ? havent really digged into this.
See above.

> well, today we turned off the same host (node7) again and i got stuck
> pg's again.
> this time i did some investigation and to my surprise i found the
> following in the output of ceph health detail:
> REQUEST_SLOW 17 slow requests are blocked > 32 sec
>     3 ops are blocked > 2097.15 sec
>     14 ops are blocked > 1048.58 sec
>     osds 9,10 have blocked requests > 1048.58 sec
>     osd.5 has blocked requests > 2097.15 sec
> i think the blocked requests are my problem, do they ?
That is a symptom of the problem, see above.

> but neither osd's 9, 10 or 5 are located on host7 - so can anyone of you
> tell me why the requests to this nodes got stuck ?
Those OSDs are waiting on other OSDs on host7, you can see that in the
ceph logs and you see with 'ceph pg dump' which pgs are located on which

> i have one pg in state "stuck unclean" which has its replicas on osd's
> 2, 3 and 15. 3 is on node7, but the first in the active set is 2 - i
> thought the "write op" should have gone there ... so why unclean ? the
> manual states "For stuck unclean placement groups, there is usually
> something preventing recovery from completing, like unfound objects" but
> there arent ...
unclean - The placement group has not been clean for too long (i.e., it
hasn’t been able to completely recover from a previous failure).
How is your 1GbE utilized? I guess, with 6 nodes (3-4 OSDs) your link
might be maxed out. But you should get something in the ceph

> do i have a configuration issue here (amount of replicas?) or is this
> behavior simply just because my cluster network is too slow ?
> you can find detailed outputs here :
> 	https://owncloud.priesch.co.at/index.php/s/toYdGekchqpbydY
> i hope any of you can help me shed any light on this ...
> at least the point of all is that a single host should be allowed to
> fail and the vm's continue running ... ;)
To get a better look at your setup, a crush map, ceph osd dump, ceph -s
and some log output would be nice.

Also you are moving to Proxmox, you might want to have look at the docs
& the forum.

Docs: https://pve.proxmox.com/pve-docs/
Forum: https://forum.proxmox.com
Some more useful information on PVE + Ceph: https://forum.proxmox.com/threads/ceph-raw-usage-grows-by-itself.38395/#post-189842

> regards and thanks in advance,
> marcus.
> --
> Marcus Priesch
> open source consultant - solution provider
> www.priesch.co.at / office at priesch.co.at
> A-2122 Riedenthal, In Prandnern 31 / +43 650 62 72 870
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

More information about the ceph-users mailing list