[ceph-users] Hangs with qemu/libvirt/rbd when one host disappears

Marcus Priesch marcus at priesch.co.at
Thu Dec 7 01:24:13 PST 2017


Hello Alwin, Dear All,

yesterday we finished cluster migration to proxmox and i had the same
problem again:

A couple of osd's down and out and a stuck request on a completely
different osd which blocked the vm's.

i tried to put this specific osd out (ceph osd out xx) and voila, the
problem was gone. later on i put the osd back in and anything works as
expected.

in the meantime i read the post here:

	http://ceph.com/community/new-luminous-rados-improvements/

where network problems with switches are also mentioned ...

as the 1Gb network is completely busy in such a scenario i would assume
maybe the problem is that some network communication got stuck somewhere
...

however all in all the transition from ubuntu / jewel to ubuntu
/luminous to proxmox / luminous went rather flawless - despite of the
problem stated above - but i am aware that i am using ceph outside its
requirements - so definitely *thumbs up* for ceph in general !!!!

to your comments :

>> i am running ceph luminous (have upgraded two weeks ago)
> I guess, you are running on ceph 12.2.1 (12.2.2 is out)? What does ceph versions say?

12.2.1

>> ceph communication is carried out on a seperate 1Gbit Network where we
>> plan to upgrade to bonded 2x10Gbit during the next couple of weeks.
> With 6 hosts you will need 10GbE, alone for lower latency. Also a ceph
> recovery/rebalance might max out the bandwidth of your link.

yes, i think this is the problem ...

> Mixing of spinners with SSDs is not recommended, as spinners will slow
> down the pools residing on that root.

why should this happen ? i would assume that osd's are seperate parts
running on hosts - not influencing each other ?

otherwise i would need a different set of hosts for the ssd's and the
hdd's ?

>> when i turn off one of the hosts (lets say node7) that do only ceph,
>> after some time the vm's stall and hang until the host comes up again.
> A stall of I/O shouldn't happen, what is your min_size of the pools? How
> is your 'ceph osd tree' looking?

you find it on the owncloud link ... at least ceph osd df tree

>> but neither osd's 9, 10 or 5 are located on host7 - so can anyone of you
>> tell me why the requests to this nodes got stuck ?
> Those OSDs are waiting on other OSDs on host7, you can see that in the
> ceph logs and you see with 'ceph pg dump' which pgs are located on which
> OSDs.

ok, you mean that they are waiting for operations to finish with the
osd's that just went offline ?

this should be a normal scenario when hardware fails - so this shouldnt
lead to a stuck vm ... i assume ?

>> i have one pg in state "stuck unclean" which has its replicas on osd's
>> 2, 3 and 15. 3 is on node7, but the first in the active set is 2 - i
>> thought the "write op" should have gone there ... so why unclean ? the
>> manual states "For stuck unclean placement groups, there is usually
>> something preventing recovery from completing, like unfound objects" but
>> there arent ...
> unclean - The placement group has not been clean for too long (i.e., it
> hasn’t been able to completely recover from a previous failure).
> http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/#stuck-placement-groups

i know this ... ther was no previous failure ... when i turn off some
osd's i always get this after some time ...

> How is your 1GbE utilized? I guess, with 6 nodes (3-4 OSDs) your link
> might be maxed out. But you should get something in the ceph
> logs.

yes, it is maxed out ... i suspect that maybe its a problem of the
network hardware that some packets get lost/stuck somewhere ...

>> do i have a configuration issue here (amount of replicas?) or is this
>> behavior simply just because my cluster network is too slow ?
>>
>> you can find detailed outputs here :
>>
>> 	https://owncloud.priesch.co.at/index.php/s/toYdGekchqpbydY
>>
>> i hope any of you can help me shed any light on this ...
>>
>> at least the point of all is that a single host should be allowed to
>> fail and the vm's continue running ... ;)
> To get a better look at your setup, a crush map, ceph osd dump, ceph -s
> and some log output would be nice.

you should find all in ceph_report.txt in the link above ...

> Also you are moving to Proxmox, you might want to have look at the docs
> & the forum.
> 
> Docs: https://pve.proxmox.com/pve-docs/
> Forum: https://forum.proxmox.com

thanks, been there ...

> Some more useful information on PVE + Ceph: https://forum.proxmox.com/threads/ceph-raw-usage-grows-by-itself.38395/#post-189842

havent read this ...

thanks a lot !
marcus.


More information about the ceph-users mailing list