[ceph-users] CephFS log jam prevention

Dan Jakubiec dan.jakubiec at gmail.com
Tue Dec 5 08:58:51 PST 2017


To add a little color here... we started an rsync last night to copy about 4TB worth of files to CephFS.  Paused it this morning because CephFS was unresponsive on the machine (e.g. can't cat a file from the filesystem).

Been waiting about 3 hours for the log jam to clear.  Slow requests have steadily decreased but still can't cat a file.

Seems like there should be something throttling the rsync operation to prevent the queues from backing up so far.  Is this is configuration problem or a bug?

From reading the Ceph docs, this seems to be the most telling:

mdsdb(mds.0): MDS cache is too large (23GB/8GB); 1018 inodes in use by clients, 1 stray files

[Ref: http://docs.ceph.com/docs/master/cephfs/cache-size-limits/]

"Be aware that the cache limit is not a hard limit. Potential bugs in the CephFS client or MDS or misbehaving applications might cause the MDS to exceed its cache size. The  mds_health_cache_threshold configures the cluster health warning message so that operators can investigate why the MDS cannot shrink its cache."

Any suggestions?

Thanks,

-- Dan



> On Dec 5, 2017, at 10:07, Reed Dier <reed.dier at focusvq.com> wrote:
> 
> Been trying to do a fairly large rsync onto a 3x replicated, filestore HDD backed CephFS pool.
> 
> Luminous 12.2.1 for all daemons, kernel CephFS driver, Ubuntu 16.04 running mix of 4.8 and 4.10 kernels, 2x10GbE networking between all daemons and clients.
> 
>> $ ceph versions
>> {
>>     "mon": {
>>         "ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)": 3
>>     },
>>     "mgr": {
>>         "ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)": 3
>>     },
>>     "osd": {
>>         "ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)": 74
>>     },
>>     "mds": {
>>         "ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)": 2
>>     },
>>     "overall": {
>>         "ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)": 82
>>     }
>> }
> 
>>  <https://www.anandtech.com/show/12116/amd-and-microsoft-announce-azure-vms-with-32core-epyc-cpus>HEALTH_ERR 1 MDSs report oversized cache; 1 MDSs have many clients failing to respond to cache pressure; 1 MDSs behind on tr
>> imming; noout,nodeep-scrub flag(s) set; application not enabled on 1 pool(s); 242 slow requests are blocked > 32 sec
>> ; 769378 stuck requests are blocked > 4096 sec
>> MDS_CACHE_OVERSIZED 1 MDSs report oversized cache
>>     mdsdb(mds.0): MDS cache is too large (23GB/8GB); 1018 inodes in use by clients, 1 stray files
>> MDS_CLIENT_RECALL_MANY 1 MDSs have many clients failing to respond to cache pressure
>>     mdsdb(mds.0): Many clients (37) failing to respond to cache pressureclient_count: 37
>> MDS_TRIM 1 MDSs behind on trimming
>>     mdsdb(mds.0): Behind on trimming (36252/30)max_segments: 30, num_segments: 36252
>> OSDMAP_FLAGS noout,nodeep-scrub flag(s) set
>> REQUEST_SLOW 242 slow requests are blocked > 32 sec
>>     236 ops are blocked > 2097.15 sec
>>     3 ops are blocked > 1048.58 sec
>>     2 ops are blocked > 524.288 sec
>>     1 ops are blocked > 32.768 sec
>> REQUEST_STUCK 769378 stuck requests are blocked > 4096 sec
>>     91 ops are blocked > 67108.9 sec
>>     121258 ops are blocked > 33554.4 sec
>>     308189 ops are blocked > 16777.2 sec
>>     251586 ops are blocked > 8388.61 sec
>>     88254 ops are blocked > 4194.3 sec
>>     osds 0,1,3,6,8,12,15,16,17,21,22,23 have stuck requests > 16777.2 sec
>>     osds 4,7,9,10,11,14,18,20 have stuck requests > 33554.4 sec
>>     osd.13 has stuck requests > 67108.9 sec
> 
> This is across 8 nodes, holding 3x 8TB HDD’s each, all backed by Intel P3600 NVMe drives for journaling.
> Removed SSD OSD’s for brevity.
> 
>> $ ceph osd tree
>> ID  CLASS WEIGHT    TYPE NAME                         STATUS REWEIGHT PRI-AFF
>> -13        87.28799 root ssd
>>  -1       174.51500 root default
>> -10       174.51500     rack default.rack2
>> -55        43.62000         chassis node2425
>>  -2        21.81000             host node24
>>   0   hdd   7.26999                 osd.0                 up  1.00000 1.00000
>>   8   hdd   7.26999                 osd.8                 up  1.00000 1.00000
>>  16   hdd   7.26999                 osd.16                up  1.00000 1.00000
>>  -3        21.81000             host node25
>>   1   hdd   7.26999                 osd.1                 up  1.00000 1.00000
>>   9   hdd   7.26999                 osd.9                 up  1.00000 1.00000
>>  17   hdd   7.26999                 osd.17                up  1.00000 1.00000
>> -56        43.63499         chassis node2627
>>  -4        21.81999             host node26
>>   2   hdd   7.27499                 osd.2                 up  1.00000 1.00000
>>  10   hdd   7.26999                 osd.10                up  1.00000 1.00000
>>  18   hdd   7.27499                 osd.18                up  1.00000 1.00000
>>  -5        21.81499             host node27
>>   3   hdd   7.26999                 osd.3                 up  1.00000 1.00000
>>  11   hdd   7.26999                 osd.11                up  1.00000 1.00000
>>  19   hdd   7.27499                 osd.19                up  1.00000 1.00000
>> -57        43.62999         chassis node2829
>>  -6        21.81499             host node28
>>   4   hdd   7.26999                 osd.4                 up  1.00000 1.00000
>>  12   hdd   7.26999                 osd.12                up  1.00000 1.00000
>>  20   hdd   7.27499                 osd.20                up  1.00000 1.00000
>>  -7        21.81499             host node29
>>   5   hdd   7.26999                 osd.5                 up  1.00000 1.00000
>>  13   hdd   7.26999                 osd.13                up  1.00000 1.00000
>>  21   hdd   7.27499                 osd.21                up  1.00000 1.00000
>> -58        43.62999         chassis node3031
>>  -8        21.81499             host node30
>>   6   hdd   7.26999                 osd.6                 up  1.00000 1.00000
>>  14   hdd   7.26999                 osd.14                up  1.00000 1.00000
>>  22   hdd   7.27499                 osd.22                up  1.00000 1.00000
>>  -9        21.81499             host node31
>>   7   hdd   7.26999                 osd.7                 up  1.00000 1.00000
>>  15   hdd   7.26999                 osd.15                up  1.00000 1.00000
>>  23   hdd   7.27499                 osd.23                up  1.00000 1.00000
> 
> Trying to figure out what in my configuration is off, because I am told that CephFS should be able to throttle the requests to match the underlying storage medium and not create such an extensive log jam. 
> 
>> [mds]
>> mds_cache_size = 0
>> mds_cache_memory_limit = 8589934592
>> 
>> [osd]
>> osd_op_threads = 4
>> filestore max sync interval = 30
>> osd_max_backfills = 10
>> osd_recovery_max_active = 16
>> osd_op_thread_suicide_timeout = 600
> 
> I originally had the mds_cache_size set to 10000000 from Jewel, but read that it is better to 0 that and set limits in the mds_cache_memory_limit now. So I set that to 8GB to see if that helped any.
> 
> Because I haven’t seen anything less than I believe 4.13 kernel for the Luminous capabilities CephFS kernel driver, everything is using Jewel capabilities for CephFS.
> 
>> $ ceph features
>> {
>>     "mon": {
>>         "group": {
>>             "features": "0x1ffddff8eea4fffb",
>>             "release": "luminous",
>>             "num": 3
>>         }
>>     },
>>     "mds": {
>>         "group": {
>>             "features": "0x1ffddff8eea4fffb",
>>             "release": "luminous",
>>             "num": 2
>>         }
>>     },
>>     "osd": {
>>         "group": {
>>             "features": "0x1ffddff8eea4fffb",
>>             "release": "luminous",
>>             "num": 74
>>         }
>>     },
>>     "client": {
>>         "group": {
>>             "features": "0x107b84a842aca",
>>             "release": "hammer",
>>             "num": 2
>>         },
>>         "group": {
>>             "features": "0x40107b86a842ada",
>>             "release": "jewel",
>>             "num": 39
>>         },
>>         "group": {
>>             "features": "0x7010fb86aa42ada",
>>             "release": "jewel",
>>             "num": 1
>>         },
>>         "group": {
>>             "features": "0x1ffddff8eea4fffb",
>>             "release": "luminous",
>>             "num": 189
>>         }
>>     }
>> }
> 
> 
> Any help is appreciated.
> 
> Thanks,
> 
> Reed
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20171205/aaee11b2/attachment.html>


More information about the ceph-users mailing list