[ceph-users] Libvirt hosts freeze after ceph osd+mon problem

Jan Pekař - Imatic jan.pekar at imatic.cz
Tue Nov 7 01:14:46 PST 2017


Additional info - it is not librbd related, I mapped disk through
rbd map and it was the same - virtuals were stuck/frozen.
I happened exactly when in my log appeared

Nov  7 10:01:27 imatic-hydra01 kernel: [2266883.493688] libceph: osd6 down

I can attach with strace to qemu process and I can get this running in loop:

root at imatic-hydra01:/usr/local/libvirt/bin# strace -p 31963
strace: Process 31963 attached
ppoll([{fd=3, events=POLLIN}, {fd=5, events=POLLIN}, {fd=7, 
events=POLLIN}, {fd=8, events=POLLIN}, {fd=45, events=POLLIN}, {fd=46, 
events=POLLIN}], 6, {tv_sec=0, tv_nsec=355313847}, NULL, 8) = 0 (Timeout)
poll([{fd=10, events=POLLOUT}], 1, 0)   = 1 ([{fd=10, 
revents=POLLOUT|POLLHUP}])
ppoll([{fd=3, events=POLLIN}, {fd=5, events=POLLIN}, {fd=7, 
events=POLLIN}, {fd=8, events=POLLIN}, {fd=45, events=POLLIN}, {fd=46, 
events=POLLIN}], 6, {tv_sec=1, tv_nsec=0}, NULL, 8) = 0 (Timeout)
poll([{fd=10, events=POLLOUT}], 1, 0)   = 1 ([{fd=10, 
revents=POLLOUT|POLLHUP}])
ppoll([{fd=3, events=POLLIN}, {fd=5, events=POLLIN}, {fd=7, 
events=POLLIN}, {fd=8, events=POLLIN}, {fd=45, events=POLLIN}, {fd=46, 
events=POLLIN}], 6, {tv_sec=0, tv_nsec=493273904}, NULL, 8) = 0 (Timeout)
Process 31963 detached
  <detached ...>

Can you please give me brief info, what should I debug and how can I do 
that? I'm newbie in gdb debugging.
It is not problem inside the virtual machine (like disk not responding) 
because I can't even get to VNC console and there is no kernel panic 
visible on it. Also I suppose kernel should ping without disk being 
available.

Thank you

With regards
Jan Pekar



On 7.11.2017 00:30, Jason Dillaman wrote:
> If you could install the debug packages and get a gdb backtrace from all 
> threads it would be helpful. librbd doesn't utilize any QEMU threads so 
> even if librbd was deadlocked, the worst case that I would expect would 
> be your guest OS complaining about hung kernel tasks related to disk IO 
> (since the disk wouldn't be responding).
> 
> On Mon, Nov 6, 2017 at 6:02 PM, Jan Pekař - Imatic <jan.pekar at imatic.cz 
> <mailto:jan.pekar at imatic.cz>> wrote:
> 
>     Hi,
> 
>     I'm using debian stretch with ceph 12.2.1-1~bpo80+1 and qemu
>     1:2.8+dfsg-6+deb9u3
>     I'm running 3 nodes with 3 monitors and 8 osds on my nodes, all on IPV6.
> 
>     When I tested the cluster, I detected strange and severe problem.
>     On first node I'm running qemu hosts with librados disk connection
>     to the cluster and all 3 monitors mentioned in connection.
>     On second node I stopped mon and osd with command
> 
>     kill -STOP MONPID OSDPID
> 
>     Within one minute all my qemu hosts on first node freeze, so they
>     even don't respond to ping. On VNC screen there is no error (disk or
>     kernel panic), they just hung forever with no console response. Even
>     starting MON and OSD on stopped host doesn't make them running.
>     Destroying the qemu domain and running again is the only solution.
> 
>     This happens even if virtual machine has all primary OSD on other
>     OSDs from that I have stopped - so it is not writing primary to the
>     stopped OSD.
> 
>     If I stop only OSD and MON keep running, or I stop only MON and OSD
>     keep running everything looks OK.
> 
>     When I stop MON and OSD, I can see in log  osd.0 1300
>     heartbeat_check: no reply from ... as usual when OSD fails. During
>     this are virtuals still running, but after that they all stop.
> 
>     What should I send you to debug this problem? Without fixing that,
>     ceph is not reliable to me.
> 
>     Thank you
>     With regards
>     Jan Pekar
>     Imatic
>     _______________________________________________
>     ceph-users mailing list
>     ceph-users at lists.ceph.com <mailto:ceph-users at lists.ceph.com>
>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>     <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
> 
> 
> 
> 
> -- 
> Jason

-- 
============
Ing. Jan Pekař
jan.pekar at imatic.cz | +420603811737
----
Imatic | Jagellonská 14 | Praha 3 | 130 00
http://www.imatic.cz
============
--


More information about the ceph-users mailing list