[ceph-users] CephFS very unstable with many small files
jspray at redhat.com
Sun Feb 25 12:50:15 PST 2018
On Sun, Feb 25, 2018 at 4:45 PM, Oliver Freyermuth
<freyermuth at physik.uni-bonn.de> wrote:
> Dear Cephalopodians,
> in preparation for production, we have run very successful tests with large sequential data,
> and just now a stress-test creating many small files on CephFS.
> We use a replicated metadata pool (4 SSDs, 4 replicas) and a data pool with 6 hosts with 32 OSDs each, running in EC k=4 m=2.
> Compression is activated (aggressive, snappy). All Bluestore, LVM, Luminous 12.2.3.
> There are (at the moment) only two MDS's, one is active, the other standby.
> For the test, we had 1120 client processes on 40 client machines (all cephfs-fuse!) extract a tarball with 150k small files
> ( http://distfiles.gentoo.org/snapshots/portage-latest.tar.xz ) each into a separate subdirectory.
Running these tests with numerous clients is valuable -- thanks for
doing it. The automated testing of Ceph that happens before releases
unfortunately does not include situations with more than one or two
> Things started out rather well (but expectedly slow), we had to increase
> mds_log_max_segments => 240
> mds_log_max_expiring => 160
> due to https://github.com/ceph/ceph/pull/18624
> and adjusted mds_cache_memory_limit to 4 GB.
> Even though the MDS machine has 32 GB, it is also running 2 OSDs (for metadata) and so we have been careful with the cache
> (e.g. due to http://tracker.ceph.com/issues/22599 ).
> After a while, we tested MDS failover and realized we entered a flip-flop situation between the two MDS nodes we have.
> Increasing mds_beacon_grace to 240 helped with that.
In general, if you're in a situation where you've having to increase
mds_beacon_grace, you already have pretty bad problems. It's a good
time to stop and dig into what is tying up the MDS so badly that it
can't even send a beacon to the monitor in a timely way. Perhaps at
this point your MDS daemons were already hitting swap and becoming
pathologically slow for that reason.
> Now, with about 100,000,000 objects written, we are in a disaster situation.
> First off, the MDS could not restart anymore - it required >40 GB of memory, which (together with the 2 OSDs on the MDS host) exceeded RAM and swap.
> So it tried to recover and OOMed quickly after. Replay was reasonably fast, but join took many minutes:
> 2018-02-25 04:16:02.299107 7fe20ce1f700 1 mds.0.17657 rejoin_start
> 2018-02-25 04:19:00.618514 7fe20ce1f700 1 mds.0.17657 rejoin_joint_start
> and finally, 5 minutes later, OOM.
> I stopped half of the stress-test tar's, which did not help - then I rebooted half of the clients, which did help and let the MDS recover just fine.
> So it seems the client caps have been too many for the MDS to handle. I'm unsure why "tar" would cause so many open file handles.
> Is there anything that can be configured to prevent this from happening?
Clients will generally hold onto capabilities for files they've
written out -- this is pretty sub-optimal for many workloads where
files are written out but not likely to be accessed again in the near
future. While clients hold these capabilities, the MDS cannot drop
things from its own cache.
The way this is *meant* to work is that the MDS hits its cache size
limit, and sends a message to clients asking them to drop some files
from their local cache, and consequently release those capabilities.
However, this has historically been a tricky area with ceph-fuse
clients (there are some hacks for detecting kernel version and using
different mechanisms for different versions of fuse), and it's
possible that on your clients this mechanism is simply not working,
leading to a severely oversized MDS cache.
The MDS should have been showing health alerts in "ceph status" about
this, but I suppose it's possible that it wasn't surviving long enough
to hit the timeout (60s) that we apply for warning about misbehaving
clients? It would be good to check the cluster log to see if you were
getting any health messages along the lines of "Client xyz failing to
respond to cache pressure".
> Now, I only lost some "stress test data", but later, it might be user's data...
> In parallel, I had reinstalled one OSD host.
> It was backfilling well, but now, <24 hours later, before backfill has finished, several OSD hosts enter OOM condition.
> Our OSD-hosts have 64 GB of RAM for 32 OSDs, which should be fine with the default bluestore cache size of 1 GB. However, it seems the processes are using much more,
> up to several GBs until memory is exhausted. They then become sluggish, are kicked out of the cluster, come back, and finally at some point they are OOMed.
> Now, I have restarted some OSD processes and hosts which helped to reduce the memory usage - but now I have some OSDs crashing continously,
> leading to PG unavailability, and preventing recovery from completion.
> I have reported a ticket about that, with stacktrace and log:
> This might well be a consequence of a previous OOM killer condition.
> However, my final question after these ugly experiences is:
> Did somebody ever stresstest CephFS for many small files?
> Are those issues known? Can special configuration help?
> Are the memory issues known? Are there solutions?
> We don't plan to use Ceph for many small files, but we don't have full control of our users, which is why we wanted to test this "worst case" scenario.
> It would be really bad if we lost a production filesystem due to such a situation, so the plan was to test now to know what happens before we enter production.
> As of now, this looks really bad, and I'm not sure the cluster will ever recover.
> I'll give it some more time, but we'll likely kill off all remaining clients next week and see what happens, and worst case recreate the Ceph cluster.
> ceph-users mailing list
> ceph-users at lists.ceph.com
More information about the ceph-users