[ceph-users] MDS hangs in "heartbeat_map" deadlock

Gregory Farnum gfarnum at redhat.com
Fri Oct 5 13:28:23 PDT 2018


On Thu, Oct 4, 2018 at 3:58 PM Stefan Kooman <stefan at bit.nl> wrote:

> Dear list,
>
> Today we hit our first Ceph MDS issue. Out of the blue the active MDS
> stopped working:
>
> mon.mon1 [WRN] daemon mds.mds1 is not responding, replacing it as rank 0
> with standby
> daemon mds.mds2.
>
> Logging of ceph-mds1:
>
> 2018-10-04 10:50:08.524745 7fdd516bf700 1 mds.mds1 asok_command: status
> (starting...)
> 2018-10-04 10:50:08.524782 7fdd516bf700 1 mds.mds1 asok_command: status
> (complete)
>
> ^^ one of our monitoring health checks performing a "ceph daemon mds.mds1
> version", business as usual.
>
> 2018-10-04 10:52:36.712525 7fdd51ec0700 1 heartbeat_map is_healthy
> 'MDSRank' had timed out after 15
> 2018-10-04 10:52:36.747577 7fdd4deb8700 1 heartbeat_map is_healthy
> 'MDSRank' had timed out after 15
> 2018-10-04 10:52:36.747584 7fdd4deb8700 1 mds.beacon.mds1 _send skipping
> beacon, heartbeat map not healthy
>
> ^^ the unresponsive mds1 consumes 100% CPU and keeps on logging above
> heatbeat_map messages.
>
> In the meantime ceph-mds2 has transitioned from "standby-replay" to
> "active":
>
> mon.mon1 [INF] daemon mds.mds2 is now active in filesystem
> BITED-153874-cephfs as rank 0
>
> Logging:
>
> replays, final replay as standby, reopen log
>
> 2018-10-04 10:52:53.268470 7fdb231d9700 1 mds.0.141 reconnect_done
> 2018-10-04 10:52:53.759844 7fdb231d9700 1 mds.mds2 Updating MDS map to
> version 143 from mon.3
> 2018-10-04 10:52:53.759859 7fdb231d9700 1 mds.0.141 handle_mds_map i am
> now mds.0.141
> 2018-10-04 10:52:53.759862 7fdb231d9700 1 mds.0.141 handle_mds_map state
> change up:reconnect --> up:rejoin
> 2018-10-04 10:52:53.759868 7fdb231d9700 1 mds.0.141 rejoin_start
> 2018-10-04 10:52:53.759970 7fdb231d9700 1 mds.0.141 rejoin_joint_start
> 2018-10-04 10:52:53.760970 7fdb1d1cd700 0 mds.0.cache failed to open ino
> 0x10000cd95e9 err -5/0
> 2018-10-04 10:52:54.126658 7fdb1d1cd700 1 mds.0.141 rejoin_done
> 2018-10-04 10:52:54.770457 7fdb231d9700 1 mds.mds2 Updating MDS map to
> version 144 from mon.3
> 2018-10-04 10:52:54.770484 7fdb231d9700 1 mds.0.141 handle_mds_map i am
> now mds.0.141
> 2018-10-04 10:52:54.770487 7fdb231d9700 1 mds.0.141 handle_mds_map state
> change up:rejoin --> up:clientreplay
> 2018-10-04 10:52:54.770494 7fdb231d9700 1 mds.0.141 recovery_done --
> successful recovery!
> 2018-10-04 10:52:54.770617 7fdb231d9700 1 mds.0.141 clientreplay_start
> 2018-10-04 10:52:54.882995 7fdb1d1cd700 1 mds.0.141 clientreplay_done
> 2018-10-04 10:52:55.778598 7fdb231d9700 1 mds.mds2 Updating MDS map to
> version 145 from mon.3
> 2018-10-04 10:52:55.778622 7fdb231d9700 1 mds.0.141 handle_mds_map i am
> now mds.0.141
> 2018-10-04 10:52:55.778628 7fdb231d9700 1 mds.0.141 handle_mds_map state
> change up:clientreplay --> up:active
> 2018-10-04 10:52:55.778638 7fdb231d9700 1 mds.0.141 active_start
> 2018-10-04 10:52:55.805206 7fdb231d9700 1 mds.0.141 cluster recovered.
>
> And then it _also_ starts to log hearbeat_map messages (and consuming 100%
> CPU):
>
> en dan deze meldingen die zichzelf blijven herhalen bij 100% cpu
> 2018-10-04 10:53:41.550793 7fdb241db700 1 heartbeat_map is_healthy
> 'MDSRank' had timed out after 15
> 2018-10-04 10:53:42.884018 7fdb201d3700 1 heartbeat_map is_healthy
> 'MDSRank' had timed out after 15
> 2018-10-04 10:53:42.884024 7fdb201d3700 1 mds.beacon.mds2 _send skipping
> beacon, heartbeat map not healthy
>
> At that point in time there is one active MDS according to ceph, but in
> reality it's
> not functioning correctly (not serving clients at least).
>
> ... we stopped both daemons. Restarted one ... recovery ...
> works for half a minute ... then starts logging heartbeat_map messages.
> Restart again ... works for a little while ... starts logging
> heartbeat_map messages again. We restart the mds with debug_mds=20 ....
> it keeps working fine. The other mds gets restarted and keeps on
> working. We do a couple of failover tests .... works flawlessly
> (failover in < 1 second, clients reconnect instantly).
>
> A couple of hours later we hit the same issue. We restarted with
> debug_mds=20 and debug_journaler=20 on the standby-replay node. Eight
> hours later (an hour ago) we hit the same issue. We captured ~ 4.7 GB of
> logging.... I skipped to the end of the log file just before the
> "hearbeat_map" messages start:
>
> 2018-10-04 23:23:53.144644 7f415ebf4700 20 mds.0.locker  client.17079146
> pending pAsLsXsFscr allowed pAsLsXsFscr wanted pFscr
> 2018-10-04 23:23:53.144645 7f415ebf4700 10 mds.0.locker eval done
> 2018-10-04 23:23:55.088542 7f415bbee700 10 mds.beacon.mds2 _send up:active
> seq 5021
> 2018-10-04 23:23:59.088602 7f415bbee700 10 mds.beacon.mds2 _send up:active
> seq 5022
> 2018-10-04 23:24:03.088688 7f415bbee700 10 mds.beacon.mds2 _send up:active
> seq 5023
> 2018-10-04 23:24:07.088775 7f415bbee700 10 mds.beacon.mds2 _send up:active
> seq 5024
> 2018-10-04 23:24:11.088867 7f415bbee700  1 heartbeat_map is_healthy
> 'MDSRank' had timed out after 15
> 2018-10-04 23:24:11.088871 7f415bbee700  1 mds.beacon.mds2 _send skipping
> beacon, heartbeat map not healthy
>
> As far as I can see just normal behaviour.
>
> The big question is: what is happening when the mds start logging the
> hearbeat_map messages?
> Why does it log "heartbeat_map is_healthy", just to log .000004 seconds
> later it's not healthy?
>

Ah, there's a misunderstanding here — the output isn't terribly clear.
"is_healthy" is the name of a *function* in the source code. The line

heartbeat_map is_healthy 'MDSRank' had timed out after 15

is telling you that the heartbeat_map's is_healthy function is running, and
it finds that "'MDSRank' had timed out after 15 [seconds]". So the thread
MDSRank is *not* healthy, it didn't check in for 15 seconds! Therefore the
MDS beacon code decides not to send a beacon, because it thinks the MDS
might be stuck.

>From what you've described here, it's most likely that the MDS is trying to
read something out of RADOS which is taking a long time, and which we
didn't expect to cause a slow down. You can check via the admin socket to
see if there are outstanding Objecter requests or ops_in_flight to get a
clue.
-Greg


>
> Ceph version: 12.2.8 on all nodes (mon, osd, mds)
> mds: one active / one standby-replay
>
> The system was not under any kind of resource pressure: plenty of CPU, RAM
> available. Metrics all look normal up to the moment things go into a
> deadlock
> (so it seems).
>
> Thanks for any hints / clue.
>
> Gr. Stefan
>
>
>
> --
> | BIT BV  http://www.bit.nl/        Kamer van Koophandel 09090351
> | GPG: 0xD14839C6                   +31 318 648 688
> <+31%20318%20648%20688> / info at bit.nl
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20181005/709f1d07/attachment.html>


More information about the ceph-users mailing list