[ceph-users] CephFS log jam prevention

Reed Dier reed.dier at focusvq.com
Tue Dec 5 08:07:19 PST 2017


Been trying to do a fairly large rsync onto a 3x replicated, filestore HDD backed CephFS pool.

Luminous 12.2.1 for all daemons, kernel CephFS driver, Ubuntu 16.04 running mix of 4.8 and 4.10 kernels, 2x10GbE networking between all daemons and clients.

> $ ceph versions
> {
>     "mon": {
>         "ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)": 3
>     },
>     "mgr": {
>         "ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)": 3
>     },
>     "osd": {
>         "ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)": 74
>     },
>     "mds": {
>         "ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)": 2
>     },
>     "overall": {
>         "ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)": 82
>     }
> }

>  <https://www.anandtech.com/show/12116/amd-and-microsoft-announce-azure-vms-with-32core-epyc-cpus>HEALTH_ERR 1 MDSs report oversized cache; 1 MDSs have many clients failing to respond to cache pressure; 1 MDSs behind on tr
> imming; noout,nodeep-scrub flag(s) set; application not enabled on 1 pool(s); 242 slow requests are blocked > 32 sec
> ; 769378 stuck requests are blocked > 4096 sec
> MDS_CACHE_OVERSIZED 1 MDSs report oversized cache
>     mdsdb(mds.0): MDS cache is too large (23GB/8GB); 1018 inodes in use by clients, 1 stray files
> MDS_CLIENT_RECALL_MANY 1 MDSs have many clients failing to respond to cache pressure
>     mdsdb(mds.0): Many clients (37) failing to respond to cache pressureclient_count: 37
> MDS_TRIM 1 MDSs behind on trimming
>     mdsdb(mds.0): Behind on trimming (36252/30)max_segments: 30, num_segments: 36252
> OSDMAP_FLAGS noout,nodeep-scrub flag(s) set
> REQUEST_SLOW 242 slow requests are blocked > 32 sec
>     236 ops are blocked > 2097.15 sec
>     3 ops are blocked > 1048.58 sec
>     2 ops are blocked > 524.288 sec
>     1 ops are blocked > 32.768 sec
> REQUEST_STUCK 769378 stuck requests are blocked > 4096 sec
>     91 ops are blocked > 67108.9 sec
>     121258 ops are blocked > 33554.4 sec
>     308189 ops are blocked > 16777.2 sec
>     251586 ops are blocked > 8388.61 sec
>     88254 ops are blocked > 4194.3 sec
>     osds 0,1,3,6,8,12,15,16,17,21,22,23 have stuck requests > 16777.2 sec
>     osds 4,7,9,10,11,14,18,20 have stuck requests > 33554.4 sec
>     osd.13 has stuck requests > 67108.9 sec

This is across 8 nodes, holding 3x 8TB HDD’s each, all backed by Intel P3600 NVMe drives for journaling.
Removed SSD OSD’s for brevity.

> $ ceph osd tree
> ID  CLASS WEIGHT    TYPE NAME                         STATUS REWEIGHT PRI-AFF
> -13        87.28799 root ssd
>  -1       174.51500 root default
> -10       174.51500     rack default.rack2
> -55        43.62000         chassis node2425
>  -2        21.81000             host node24
>   0   hdd   7.26999                 osd.0                 up  1.00000 1.00000
>   8   hdd   7.26999                 osd.8                 up  1.00000 1.00000
>  16   hdd   7.26999                 osd.16                up  1.00000 1.00000
>  -3        21.81000             host node25
>   1   hdd   7.26999                 osd.1                 up  1.00000 1.00000
>   9   hdd   7.26999                 osd.9                 up  1.00000 1.00000
>  17   hdd   7.26999                 osd.17                up  1.00000 1.00000
> -56        43.63499         chassis node2627
>  -4        21.81999             host node26
>   2   hdd   7.27499                 osd.2                 up  1.00000 1.00000
>  10   hdd   7.26999                 osd.10                up  1.00000 1.00000
>  18   hdd   7.27499                 osd.18                up  1.00000 1.00000
>  -5        21.81499             host node27
>   3   hdd   7.26999                 osd.3                 up  1.00000 1.00000
>  11   hdd   7.26999                 osd.11                up  1.00000 1.00000
>  19   hdd   7.27499                 osd.19                up  1.00000 1.00000
> -57        43.62999         chassis node2829
>  -6        21.81499             host node28
>   4   hdd   7.26999                 osd.4                 up  1.00000 1.00000
>  12   hdd   7.26999                 osd.12                up  1.00000 1.00000
>  20   hdd   7.27499                 osd.20                up  1.00000 1.00000
>  -7        21.81499             host node29
>   5   hdd   7.26999                 osd.5                 up  1.00000 1.00000
>  13   hdd   7.26999                 osd.13                up  1.00000 1.00000
>  21   hdd   7.27499                 osd.21                up  1.00000 1.00000
> -58        43.62999         chassis node3031
>  -8        21.81499             host node30
>   6   hdd   7.26999                 osd.6                 up  1.00000 1.00000
>  14   hdd   7.26999                 osd.14                up  1.00000 1.00000
>  22   hdd   7.27499                 osd.22                up  1.00000 1.00000
>  -9        21.81499             host node31
>   7   hdd   7.26999                 osd.7                 up  1.00000 1.00000
>  15   hdd   7.26999                 osd.15                up  1.00000 1.00000
>  23   hdd   7.27499                 osd.23                up  1.00000 1.00000

Trying to figure out what in my configuration is off, because I am told that CephFS should be able to throttle the requests to match the underlying storage medium and not create such an extensive log jam. 

> [mds]
> mds_cache_size = 0
> mds_cache_memory_limit = 8589934592
> 
> [osd]
> osd_op_threads = 4
> filestore max sync interval = 30
> osd_max_backfills = 10
> osd_recovery_max_active = 16
> osd_op_thread_suicide_timeout = 600

I originally had the mds_cache_size set to 10000000 from Jewel, but read that it is better to 0 that and set limits in the mds_cache_memory_limit now. So I set that to 8GB to see if that helped any.

Because I haven’t seen anything less than I believe 4.13 kernel for the Luminous capabilities CephFS kernel driver, everything is using Jewel capabilities for CephFS.

> $ ceph features
> {
>     "mon": {
>         "group": {
>             "features": "0x1ffddff8eea4fffb",
>             "release": "luminous",
>             "num": 3
>         }
>     },
>     "mds": {
>         "group": {
>             "features": "0x1ffddff8eea4fffb",
>             "release": "luminous",
>             "num": 2
>         }
>     },
>     "osd": {
>         "group": {
>             "features": "0x1ffddff8eea4fffb",
>             "release": "luminous",
>             "num": 74
>         }
>     },
>     "client": {
>         "group": {
>             "features": "0x107b84a842aca",
>             "release": "hammer",
>             "num": 2
>         },
>         "group": {
>             "features": "0x40107b86a842ada",
>             "release": "jewel",
>             "num": 39
>         },
>         "group": {
>             "features": "0x7010fb86aa42ada",
>             "release": "jewel",
>             "num": 1
>         },
>         "group": {
>             "features": "0x1ffddff8eea4fffb",
>             "release": "luminous",
>             "num": 189
>         }
>     }
> }


Any help is appreciated.

Thanks,

Reed

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20171205/0c60fed6/attachment.html>


More information about the ceph-users mailing list