[ceph-users] MDS failover, how to speed it up?

John Spray jspray at redhat.com
Mon Jun 20 04:20:03 PDT 2016

On Mon, Jun 20, 2016 at 12:04 PM, Brian Lagoni <brianl at unity3d.com> wrote:
> Are anyone here able to help us with a question about mds failover?
> The case is that we are hitting a bug in ceph which requires us to restart
> the mds every week.
> There is a bug and PR for it here - https://github.com/ceph/ceph/pull/9456
> but until this have been resolved we need to do a restart. Unless there are
> a better workaround for this bug?
> The issue we are having are when we do a failover, the time it takes for the
> cephfs kernel client to recover are high enough so that the vm guests using
> this cephfs are having timeouts to they storage and therefor enters readonly
> mode.
> We have tried with making a failover to another mds or restarting the mds
> while it's the only mds in the cluser and in both cases our cephfs kernel
> client are taking too long to recover.
> We have also tried to set the failover MDS into "MDS_STANDBY_REPLAY" mode
> which didn't help on this matter.
> When doing a failover all IOPS against ceph are being blocked for 2-5 min
> until the kernel cephfs clients recovers after some timeouts messages like
> these:

Sounds like we need to investigate why it's taking 2-5 minutes.

You should be seeing an initial 30s delay while the mons decide that
the dead MDS is dead (you can skip this by explicitly doing "ceph mds
fail <oldone>", which you might already be doing).

Then the new MDS will proceed through a series of states (replay,
clientreplay, etc).  Your cluster log should have messages showing the
MDS state changes (mdsmap updates), so hopefully you can identify
which phase is taking unexpectedly long.  Then, you can turn up the
MDS log level, and get some insight into what it's actually doing
during that phase.


> "2016-06-19 19:09:55.573739 7faaf8f48700  0 log_channel(cluster) log [WRN] :
> slow request 75.141028 seconds old, received at 2016-06-19 19:08:40.432655:
> client_request(client.4283066:4164703242 getattr pAsLsXsFs #100000000fe
> 2016-06-19 19:08:40.429496) currently failed to rdlock, waiting"
> After this there is a huge spike i IOPS data starts to being processed
> again.
> I'm not sure if any of this can be related to this warning which are present
> 90% of the day.
> "mds0: Behind on trimming (94/30)"?
> I have searched the mailing list for clues and answers on what to do about
> this but haven't found anything which have helped us.
> We have move/isolated the MDS service to it's own VM with the fastest
> processor we having, without any real changes to this warning.
>  Our infrastructure is the following:
>  - We use CEPH/CEPHFS (10.2.1)
>  - We have 3 mons and 6 storage servers with a total of 36 OSDs (~4160 PGs).
>  - We have one main mds and one standby mds.
>  - The primary MDS is a virtual machine with 8 core E5-2643 v3 @
> 3.40GHz(steal time=0), 16G mem
>  - We are using ceph kernel client to mount cephfs.
>  - Ubuntu 16.04 (4.4.0-22-generic kernel)
>  - The OSD's are physical machines with 8 cores & 32GB memory
>  - All networking is 10Gb
> So at the end are there anything we can do to make the failover and recovery
> to go faster?
> Regards,
> Brian Lagoni
> System administrator, Engineering Tools
> Unity Technologies
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

More information about the ceph-users mailing list