[ceph-users] CephFS log jam prevention

Daniel Baumann daniel.baumann at bfh.ch
Tue Dec 5 09:48:20 PST 2017


Hi,

On 12/05/17 17:58, Dan Jakubiec wrote:
> Is this is configuration problem or a bug?

we had massive problems with both kraken (feb-sept 2017) and luminous
(12.2.0), seeing the same behaviour as you. ceph.conf was containing
defaults only, except that we had to crank up mds_cache_size and
mds_bal_fragment_size_max.

using dirfrag and multi-mds did not change anything. even with luminous
(12.2.0) basically a single rsync over a large directory tree could kill
cephfs for all clients within seconds, where even a waiting period of >8
hours did not help.

since the cluster was semi-productive, we coudn't take the downtime so
we switched to unmounting all cephfs, flush journal, and re-mount it.

interestingly with 12.2.1 on kernel 4.13 however, this doesn't occur
anymore (the 'mds lagging behind' still happens, but recovers quickly
within minutes, and the rsync doesn not need to be aborted).

i'm not sure if 12.2.1 fixed it itself, or it was your config changes
happening at the same time:

mds_session_autoclose = 10
mds_reconnect_timeout = 10

mds_blacklist_interval = 10
mds_session_blacklist_on_timeout = false
mds_session_blacklist_on_evict = false

Regards,
Daniel


More information about the ceph-users mailing list