[ceph-users] MDS hangs in "heartbeat_map" deadlock

Stefan Kooman stefan at bit.nl
Tue Oct 23 07:34:06 PDT 2018

Quoting Patrick Donnelly (pdonnell at redhat.com):
> Thanks for the detailed notes. It looks like the MDS is stuck
> somewhere it's not even outputting any log messages. If possible, it'd
> be helpful to get a coredump (e.g. by sending SIGQUIT to the MDS) or,
> if you're comfortable with gdb, a backtrace of any threads that look
> suspicious (e.g. not waiting on a futex) including `info threads`.

It took a while before the same issue reappeared again ... but we
managed to catch gdb backtraces and strace output. See below pastebin
links. Note: we had difficulty getting the MDSs working again, so we had
to restart them a couple of times, capturing debug output as much as we
can. Hopefully you can squeeze some useful information out of this data.

https://8n1.org/13869/bc3b - Some few minutes after it first started
acting up
https://8n1.org/13870/caf4 - Probably made when I tried to stop the
process and it took too long (process already received SIGKILL)
https://8n1.org/13871/2f22 - After restarting the same issue returned
https://8n1.org/13872/2246 - After restarting the same issue returned

https://8n1.org/13873/f861 - After it went craycray when it became
https://8n1.org/13874/c567 - After restarting the same issue returned
https://8n1.org/13875/133a - After restarting the same issue returned

MDS1: https://8n1.org/mds1-strace.zip
MDS2: https://8n1.org/mds2-strace.zip

Gr. Stefan

