[ceph-users] Libvirt hosts freeze after ceph osd+mon problem

Jan Pekař - Imatic jan.pekar at imatic.cz
Tue Nov 7 07:44:30 PST 2017


I am using librbd.

rbd map was only my test to see, if it is not librbd related. Both - 
librbd and rbd map were the same frozen result.

Node running virtuals has 4.9.0-3-amd64 kernel

Two tested virtuals have
4.9.0-3-amd64 kernel, second with
4.10.17-2-pve kernel

JP

On 7.11.2017 10:42, Wido den Hollander wrote:
> 
>> Op 7 november 2017 om 10:14 schreef Jan Pekař - Imatic <jan.pekar at imatic.cz>:
>>
>>
>> Additional info - it is not librbd related, I mapped disk through
>> rbd map and it was the same - virtuals were stuck/frozen.
>> I happened exactly when in my log appeared
>>
> 
> Why aren't you using librbd? Is there a specific reason for that? With Qemu/KVM/libvirt I always suggest to use librbd.
> 
> And in addition, what kernel version are you running?
> 
> Wido
> 
>> Nov  7 10:01:27 imatic-hydra01 kernel: [2266883.493688] libceph: osd6 down
>>
>> I can attach with strace to qemu process and I can get this running in loop:
>>
>> root at imatic-hydra01:/usr/local/libvirt/bin# strace -p 31963
>> strace: Process 31963 attached
>> ppoll([{fd=3, events=POLLIN}, {fd=5, events=POLLIN}, {fd=7,
>> events=POLLIN}, {fd=8, events=POLLIN}, {fd=45, events=POLLIN}, {fd=46,
>> events=POLLIN}], 6, {tv_sec=0, tv_nsec=355313847}, NULL, 8) = 0 (Timeout)
>> poll([{fd=10, events=POLLOUT}], 1, 0)   = 1 ([{fd=10,
>> revents=POLLOUT|POLLHUP}])
>> ppoll([{fd=3, events=POLLIN}, {fd=5, events=POLLIN}, {fd=7,
>> events=POLLIN}, {fd=8, events=POLLIN}, {fd=45, events=POLLIN}, {fd=46,
>> events=POLLIN}], 6, {tv_sec=1, tv_nsec=0}, NULL, 8) = 0 (Timeout)
>> poll([{fd=10, events=POLLOUT}], 1, 0)   = 1 ([{fd=10,
>> revents=POLLOUT|POLLHUP}])
>> ppoll([{fd=3, events=POLLIN}, {fd=5, events=POLLIN}, {fd=7,
>> events=POLLIN}, {fd=8, events=POLLIN}, {fd=45, events=POLLIN}, {fd=46,
>> events=POLLIN}], 6, {tv_sec=0, tv_nsec=493273904}, NULL, 8) = 0 (Timeout)
>> Process 31963 detached
>>    <detached ...>
>>
>> Can you please give me brief info, what should I debug and how can I do
>> that? I'm newbie in gdb debugging.
>> It is not problem inside the virtual machine (like disk not responding)
>> because I can't even get to VNC console and there is no kernel panic
>> visible on it. Also I suppose kernel should ping without disk being
>> available.
>>
>> Thank you
>>
>> With regards
>> Jan Pekar
>>
>>
>>
>> On 7.11.2017 00:30, Jason Dillaman wrote:
>>> If you could install the debug packages and get a gdb backtrace from all
>>> threads it would be helpful. librbd doesn't utilize any QEMU threads so
>>> even if librbd was deadlocked, the worst case that I would expect would
>>> be your guest OS complaining about hung kernel tasks related to disk IO
>>> (since the disk wouldn't be responding).
>>>
>>> On Mon, Nov 6, 2017 at 6:02 PM, Jan Pekař - Imatic <jan.pekar at imatic.cz
>>> <mailto:jan.pekar at imatic.cz>> wrote:
>>>
>>>      Hi,
>>>
>>>      I'm using debian stretch with ceph 12.2.1-1~bpo80+1 and qemu
>>>      1:2.8+dfsg-6+deb9u3
>>>      I'm running 3 nodes with 3 monitors and 8 osds on my nodes, all on IPV6.
>>>
>>>      When I tested the cluster, I detected strange and severe problem.
>>>      On first node I'm running qemu hosts with librados disk connection
>>>      to the cluster and all 3 monitors mentioned in connection.
>>>      On second node I stopped mon and osd with command
>>>
>>>      kill -STOP MONPID OSDPID
>>>
>>>      Within one minute all my qemu hosts on first node freeze, so they
>>>      even don't respond to ping. On VNC screen there is no error (disk or
>>>      kernel panic), they just hung forever with no console response. Even
>>>      starting MON and OSD on stopped host doesn't make them running.
>>>      Destroying the qemu domain and running again is the only solution.
>>>
>>>      This happens even if virtual machine has all primary OSD on other
>>>      OSDs from that I have stopped - so it is not writing primary to the
>>>      stopped OSD.
>>>
>>>      If I stop only OSD and MON keep running, or I stop only MON and OSD
>>>      keep running everything looks OK.
>>>
>>>      When I stop MON and OSD, I can see in log  osd.0 1300
>>>      heartbeat_check: no reply from ... as usual when OSD fails. During
>>>      this are virtuals still running, but after that they all stop.
>>>
>>>      What should I send you to debug this problem? Without fixing that,
>>>      ceph is not reliable to me.
>>>
>>>      Thank you
>>>      With regards
>>>      Jan Pekar
>>>      Imatic
>>>      _______________________________________________
>>>      ceph-users mailing list
>>>      ceph-users at lists.ceph.com <mailto:ceph-users at lists.ceph.com>
>>>      http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>      <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>>>
>>>
>>>
>>>
>>> -- 
>>> Jason
>>
>> -- 
>> ============
>> Ing. Jan Pekař
>> jan.pekar at imatic.cz | +420603811737
>> ----
>> Imatic | Jagellonská 14 | Praha 3 | 130 00
>> http://www.imatic.cz
>> ============
>> --
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users at lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
============
Ing. Jan Pekař
jan.pekar at imatic.cz | +420603811737
----
Imatic | Jagellonská 14 | Praha 3 | 130 00
http://www.imatic.cz
============
--


More information about the ceph-users mailing list