[ceph-users] CephFS: costly MDS cache misses?

Jens-U. Mozdzen jmozdzen at nde.ag
Wed Nov 29 10:08:51 PST 2017


Hi *,

while tracking down a different performance issue with CephFS  
(creating tar balls from CephFS-based directories takes multiple times  
as long as when backing up the same data from local disks, i.e. 56  
hours instead of 7), we had a look at CephFS performance related to  
the size of the MDS process.

Our Ceph cluster (Luminous 12.2.1) is using file-based OSDs, CephFS  
data is on SAS HDDs, meta data is on SAS SSDs.

It came to mind that MDS memory consumption might cause the delays  
with "tar". But while below results don't confirm this (it actually  
confirms that MDS memory size does not affect CephFS read speed when  
the cache is sufficiently warm), it does show an almost 30%  
performance drop if the cache is filled with the wrong entries.

After a fresh process start, our MDS takes about 450k memory, with 56k  
residual. I then start a tar run for 36 GB small files (which I had  
also run a few minutes before MDS restart, to warm up disk caches):

--- cut here ---
    PID USER      PR  NI    VIRT    RES    SHR S   %CPU  %MEM     TIME+ COMMAND
    1233 ceph      20   0  446584  56000  15908 S  3.960 0.085    
0:01.08 ceph-mds

server01:~ # date; tar -C /srv/cephfs/prod/fileshare/stuff/ -cf- . |  
wc -c; date
Wed Nov 29 17:38:21 CET 2017
38245529600
Wed Nov 29 17:44:27 CET 2017
server01:~ #

    PID USER      PR  NI    VIRT    RES    SHR S   %CPU  %MEM     TIME+ COMMAND
   1233 ceph      20   0  485760 109156  16148 S  0.331 0.166    
0:10.76 ceph-mds
--- cut here ---

As you can see, there's only small growth in MDS virtual size.

The job took 366 seconds, that an average of about 100 MB/s.

I repeat that job a few minutes later, to get numbers with a  
previously active MDS (the MDS cache should be warmed up now):

--- cut here ---
    PID USER      PR  NI    VIRT    RES    SHR S   %CPU  %MEM     TIME+ COMMAND
   1233 ceph      20   0  494976 118404  16148 S  2.961 0.180    
0:16.21 ceph-mds

server01:~ # date; tar -C /srv/cephfs/prod/fileshare/stuff/ -cf- . |  
wc -c; date
Wed Nov 29 17:53:09 CET 2017
38245529600
Wed Nov 29 17:58:53 CET 2017
server01:~ #

    PID USER      PR  NI    VIRT    RES    SHR S   %CPU  %MEM     TIME+ COMMAND
   1233 ceph      20   0  508288 131368  16148 S  1.980 0.200    
0:25.45 ceph-mds
--- cut here ---

The job took 344 seconds, that's an average of about 106 MB/s. With  
only a single run per situation, these numbers aren't more than rough  
estimate, of course.

At 18:00:00, a file-based incremental backup job kicks in, which reads  
through most of the files on the CephFS, but only backing up those  
that were changed since the last run. This has nothing to do with our  
"tar" and is running on a different node, where CephFS is  
kernel-mounted as well. That backup job makes the MDS cache grow  
drastically, you can see MDS at more than 8 GB now.

We then start another tar job (or rather two, to account for MDS  
caching), as before:

--- cut here ---
    PID USER      PR  NI    VIRT    RES    SHR S   %CPU  %MEM     TIME+ COMMAND
   1233 ceph      20   0 8644776 7.750g  16184 S  0.990 12.39    
6:45.24 ceph-mds

server01:~ # date; tar -C /srv/cephfs/prod/fileshare/stuff/ -cf- . |  
wc -c; date
Wed Nov 29 18:13:20 CET 2017
38245529600
Wed Nov 29 18:21:50 CET 2017
server01:~ # date; tar -C /srv/cephfs/prod/fileshare/stuff/ -cf- . |  
wc -c; date
Wed Nov 29 18:22:52 CET 2017
38245529600
Wed Nov 29 18:28:28 CET 2017
server01:~ #

    PID USER      PR  NI    VIRT    RES    SHR S   %CPU  %MEM     TIME+ COMMAND
   1233 ceph      20   0 8761512 7.642g  16184 S  3.300 12.22    
7:03.52 ceph-mds
--- cut here ---

The second run is even a bit quicker than the "warmed-up" run with the  
only partially filled cache (336 seconds, that's 108,5 MB/s).

But the run against the filled-up MDS cache, where most (if not all)  
entries are no match for our tar lookups, took 510 seconds - that 71,5  
MB/s, instead of the roughly 100 MB/s when the cache was empty.

This is by far no precise benchmark test, indeed. But it at least  
seems to be an indicator that MDS cache misses are costly. (During the  
tests, only small amounts of changes in CephFS were likely -  
especially compared to the amount of reads and file lookups for their  
metadata.)

Regards,
Jens

PS: Why so much memory for MDS in the first place? Because during  
those (hourly) incremental backup runs, we got a large number of MDS  
warnings about insufficient cache pressure responses from clients.  
Increasing the MDS cache size did help to avoid these.



More information about the ceph-users mailing list