[ceph-users] OSD is near full and slow in accessing storage from client

Sébastien VIGNERON sebastien.vigneron at criann.fr
Sun Nov 12 03:14:17 PST 2017


Hi,

Have you tried to query pg state for some stuck or undersized pgs? Maybe some OSD daemons are not right, blocking the reconstruction.

ceph pg 3.be query
ceph pg 4.d4 query
ceph pg 4.8c query

http://docs.ceph.com/docs/jewel/rados/troubleshooting/troubleshooting-pg/

Cordialement / Best regards,

Sébastien VIGNERON 
CRIANN, 
Ingénieur / Engineer
Technopôle du Madrillet 
745, avenue de l'Université 
76800 Saint-Etienne du Rouvray - France 
tél. +33 2 32 91 42 91 
fax. +33 2 32 91 42 92 
http://www.criann.fr 
mailto:sebastien.vigneron at criann.fr
support: support at criann.fr

> Le 12 nov. 2017 à 10:59, gjprabu <gjprabu at zohocorp.com> a écrit :
> 
> Hi Sebastien
> 
>  Thanks for you reply , yes undersize pgs and recovery in process becuase of we added new osd after getting 2 OSD is near full warning .   Yes newly added osd is reblancing the size.
> 
> 
> [root at intcfs-osd6 ~]# ceph osd df
> ID WEIGHT  REWEIGHT SIZE   USE    AVAIL %USE  VAR  PGS
> 0 3.29749  1.00000  3376G  2875G  501G 85.15 1.26 165
> 1 3.26869  1.00000  3347G  1923G 1423G 57.46 0.85 152
> 2 3.27339  1.00000  3351G  1980G 1371G 59.08 0.88 161
> 3 3.24089  1.00000  3318G  2130G 1187G 64.21 0.95 168
> 4 3.24089  1.00000  3318G  2997G  320G 90.34 1.34 176
> 5 3.32669  1.00000  3406G  2466G  939G 72.42 1.07 165
> 6 3.27800  1.00000  3356G  1463G 1893G 43.60 0.65 166  
> 
> ceph osd crush rule dump
> 
> [
>     {
>         "rule_id": 0,
>         "rule_name": "replicated_ruleset",
>         "ruleset": 0,
>         "type": 1,
>         "min_size": 1,
>         "max_size": 10,
>         "steps": [
>             {
>                 "op": "take",
>                 "item": -1,
>                 "item_name": "default"
>             },
>             {
>                 "op": "chooseleaf_firstn",
>                 "num": 0,
>                 "type": "host"
>             },
>             {
>                 "op": "emit"
>             }
>         ]
>     }
> ]
> 
> 
> ceph version 10.2.2 and ceph version 10.2.9
> 
> 
> ceph osd pool ls detail
> 
> pool 0 'rbd' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0
> pool 3 'downloads_data' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 250 pgp_num 250 last_change 39 flags hashpspool crash_replay_interval 45 stripe_width 0
> pool 4 'downloads_metadata' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 250 pgp_num 250 last_change 36 flags hashpspool stripe_width 0
> 
> 
> ---- On Sun, 12 Nov 2017 15:04:02 +0530 Sébastien VIGNERON <sebastien.vigneron at criann.fr <mailto:sebastien.vigneron at criann.fr>> wrote ----
> 
> Hi,
> 
> Can you share:
>  - your placement rules: ceph osd crush rule dump
>  - your CEPH version: ceph versions
>  - your pools definitions: ceph osd pool ls detail
> 
> With these we can determine is your pgs are stuck because of a misconfiguration or something else.
> 
> You seems to have some undersized pgs and a recovery in process. Does your OSDs showed some rebalance of your datas? Does your OSDs use percentage change over time? (changes in "ceph osd df")
> 
> Cordialement / Best regards,
> 
> Sébastien VIGNERON 
> CRIANN, 
> Ingénieur / Engineer
> Technopôle du Madrillet 
> 745, avenue de l'Université 
> 76800 Saint-Etienne du Rouvray - France 
> tél. +33 2 32 91 42 91 
> fax. +33 2 32 91 42 92 
> http://www.criann.fr <http://www.criann.fr/> 
> mailto:sebastien.vigneron at criann.fr <mailto:sebastien.vigneron at criann.fr>
> support: support at criann.fr <mailto:support at criann.fr>
> 
> Le 12 nov. 2017 à 10:04, gjprabu <gjprabu at zohocorp.com <mailto:gjprabu at zohocorp.com>> a écrit :
> 
> Hi Team,
> 
>          We have ceph setup with 6 OSD and we got alert with 2 OSD is near full . We faced issue like slow in accessing ceph from client. So i have added 7th OSD and still 2 OSD is showing near full ( OSD.0 and OSD.4) , I have restarted ceph service in osd.0 and osd.4 .  Kindly check the below ceph osd status and please provide us the solutions. 
> 
> 
> # ceph health detail
> HEALTH_WARN 46 pgs backfill_wait; 1 pgs backfilling; 32 pgs degraded; 50 pgs stuck unclean; 32 pgs undersized; recovery 1098780/40253637 objects degraded (2.730%); recovery 3401433/40253637 objects misplaced (8.450%); 2 near full osd(s); mds0: Client integ-hm3 failing to respond to cache pressure; mds0: Client integ-hm8 failing to respond to cache pressure; mds0: Client integ-hm2 failing to respond to cache pressure; mds0: Client integ-hm9 failing to respond to cache pressure; mds0: Client integ-hm5 failing to respond to cache pressure; mds0: Client integ-hm9-bkp failing to respond to cache pressure; mds0: Client me-build1-bkp failing to respond to cache pressure
> 
> pg 3.f6 is stuck unclean for 511223.069161, current state active+undersized+degraded+remapped+wait_backfill, last acting [2]
> pg 4.f6 is stuck unclean for 511232.770419, current state active+undersized+degraded+remapped+wait_backfill, last acting [2]
> pg 3.ec is stuck unclean for 510902.815668, current state active+undersized+degraded+remapped+wait_backfill, last acting [2]
> pg 3.eb is stuck unclean for 511285.576487, current state active+remapped+wait_backfill, last acting [3,0]
> pg 4.17 is stuck unclean for 511235.326709, current state active+undersized+degraded+remapped+wait_backfill, last acting [1]
> pg 4.2f is stuck unclean for 511232.356371, current state active+undersized+degraded+remapped+wait_backfill, last acting [2]
> pg 4.3d is stuck unclean for 511300.446982, current state active+remapped, last acting [3,0]
> pg 4.93 is stuck unclean for 511295.539229, current state active+undersized+degraded+remapped+wait_backfill, last acting [3]
> pg 3.47 is stuck unclean for 511288.104965, current state active+remapped+wait_backfill, last acting [3,0]
> pg 4.d5 is stuck unclean for 510916.509825, current state active+undersized+degraded+remapped+wait_backfill, last acting [2]
> pg 3.31 is stuck unclean for 511221.542878, current state active+remapped+wait_backfill, last acting [0,3]
> pg 3.62 is stuck unclean for 511221.551662, current state active+undersized+degraded+remapped+wait_backfill, last acting [4]
> pg 4.4d is stuck unclean for 511232.279602, current state active+undersized+degraded+remapped+wait_backfill, last acting [2]
> pg 4.48 is stuck unclean for 510911.095367, current state active+remapped+wait_backfill, last acting [5,4]
> pg 3.4f is stuck unclean for 511226.712285, current state active+undersized+degraded+remapped+wait_backfill, last acting [1]
> pg 3.78 is stuck unclean for 511221.531199, current state active+undersized+degraded+remapped+wait_backfill, last acting [2]
> pg 3.24 is stuck unclean for 510903.483324, current state active+remapped+backfilling, last acting [1,2]
> pg 4.8c is stuck unclean for 511231.668693, current state active+undersized+degraded+remapped+wait_backfill, last acting [1]
> pg 3.b4 is stuck unclean for 511222.612012, current state active+undersized+degraded+remapped+wait_backfill, last acting [0]
> pg 4.41 is stuck unclean for 511287.031264, current state active+remapped+wait_backfill, last acting [3,2]
> pg 3.d1 is stuck unclean for 510903.797329, current state active+remapped+wait_backfill, last acting [0,3]
> pg 3.7f is stuck unclean for 511222.929722, current state active+undersized+degraded+remapped+wait_backfill, last acting [1]
> pg 4.af is stuck unclean for 511262.494659, current state active+undersized+degraded+remapped, last acting [0]
> pg 3.66 is stuck unclean for 510903.296711, current state active+remapped+wait_backfill, last acting [3,0]
> pg 3.76 is stuck unclean for 511224.615144, current state active+undersized+degraded+remapped+wait_backfill, last acting [3]
> pg 4.57 is stuck unclean for 511234.514343, current state active+remapped, last acting [0,4]
> pg 3.69 is stuck unclean for 511224.672085, current state active+undersized+degraded+remapped+wait_backfill, last acting [4]
> pg 3.9a is stuck unclean for 510967.300000, current state active+remapped+wait_backfill, last acting [3,2]
> pg 4.50 is stuck unclean for 510903.825565, current state active+undersized+degraded+remapped+wait_backfill, last acting [1]
> pg 4.53 is stuck unclean for 510921.975268, current state active+undersized+degraded+remapped+wait_backfill, last acting [2]
> pg 3.e7 is stuck unclean for 511221.530592, current state active+undersized+degraded+remapped+wait_backfill, last acting [2]
> pg 4.6a is stuck unclean for 510911.284877, current state active+undersized+degraded+remapped+wait_backfill, last acting [0]
> pg 4.16 is stuck unclean for 511232.702762, current state active+undersized+degraded+remapped+wait_backfill, last acting [1]
> pg 3.2c is stuck unclean for 511222.443893, current state active+remapped+wait_backfill, last acting [2,3]
> pg 4.89 is stuck unclean for 511228.846614, current state active+undersized+degraded+remapped+wait_backfill, last acting [4]
> pg 4.39 is stuck unclean for 511239.544231, current state active+remapped+wait_backfill, last acting [3,2]
> pg 4.ce is stuck unclean for 511232.294586, current state active+undersized+degraded+remapped+wait_backfill, last acting [1]
> pg 3.91 is stuck unclean for 511232.341380, current state active+undersized+degraded+remapped+wait_backfill, last acting [2]
> pg 3.96 is stuck unclean for 510904.043900, current state active+undersized+degraded+remapped+wait_backfill, last acting [2]
> pg 4.c0 is stuck unclean for 510904.253281, current state active+undersized+degraded+remapped+wait_backfill, last acting [2]
> pg 4.9c is stuck unclean for 511237.612850, current state active+undersized+degraded+remapped+wait_backfill, last acting [1]
> pg 3.ab is stuck unclean for 510960.756324, current state active+remapped+wait_backfill, last acting [3,2]
> pg 4.aa is stuck unclean for 511229.307559, current state active+remapped+wait_backfill, last acting [0,3]
> pg 3.ad is stuck unclean for 510903.764157, current state active+remapped+wait_backfill, last acting [0,3]
> pg 3.b5 is stuck unclean for 511226.560774, current state active+undersized+degraded+remapped+wait_backfill, last acting [3]
> pg 4.58 is stuck unclean for 510919.273667, current state active+undersized+degraded+remapped+wait_backfill, last acting [1]
> pg 4.b9 is stuck unclean for 511232.760066, current state active+remapped+wait_backfill, last acting [5,4]
> pg 3.be <http://3.be/> is stuck unclean for 511224.422931, current state active+remapped+wait_backfill, last acting [0,4]
> pg 4.d4 is stuck unclean for 510962.810416, current state active+undersized+degraded+remapped+wait_backfill, last acting [3]
> pg 4.da is stuck unclean for 511259.506962, current state active+undersized+degraded+remapped+wait_backfill, last acting [2]
> pg 4.8c is active+undersized+degraded+remapped+wait_backfill, acting [1]
> pg 3.7f is active+undersized+degraded+remapped+wait_backfill, acting [1]
> pg 3.78 is active+undersized+degraded+remapped+wait_backfill, acting [2]
> pg 3.76 is active+undersized+degraded+remapped+wait_backfill, acting [3]
> pg 4.6a is active+undersized+degraded+remapped+wait_backfill, acting [0]
> pg 3.69 is active+undersized+degraded+remapped+wait_backfill, acting [4]
> pg 3.66 is active+remapped+wait_backfill, acting [3,0]
> pg 3.62 is active+undersized+degraded+remapped+wait_backfill, acting [4]
> pg 4.58 is active+undersized+degraded+remapped+wait_backfill, acting [1]
> pg 4.50 is active+undersized+degraded+remapped+wait_backfill, acting [1]
> pg 4.53 is active+undersized+degraded+remapped+wait_backfill, acting [2]
> pg 3.4f is active+undersized+degraded+remapped+wait_backfill, acting [1]
> pg 4.48 is active+remapped+wait_backfill, acting [5,4]
> pg 4.4d is active+undersized+degraded+remapped+wait_backfill, acting [2]
> pg 3.47 is active+remapped+wait_backfill, acting [3,0]
> pg 4.41 is active+remapped+wait_backfill, acting [3,2]
> pg 3.31 is active+remapped+wait_backfill, acting [0,3]
> pg 4.2f is active+undersized+degraded+remapped+wait_backfill, acting [2]
> pg 3.24 is active+remapped+backfilling, acting [1,2]
> pg 4.17 is active+undersized+degraded+remapped+wait_backfill, acting [1]
> pg 4.16 is active+undersized+degraded+remapped+wait_backfill, acting [1]
> pg 3.2c is active+remapped+wait_backfill, acting [2,3]
> pg 4.39 is active+remapped+wait_backfill, acting [3,2]
> pg 4.89 is active+undersized+degraded+remapped+wait_backfill, acting [4]
> pg 3.91 is active+undersized+degraded+remapped+wait_backfill, acting [2]
> pg 4.93 is active+undersized+degraded+remapped+wait_backfill, acting [3]
> pg 3.96 is active+undersized+degraded+remapped+wait_backfill, acting [2]
> pg 3.9a is active+remapped+wait_backfill, acting [3,2]
> pg 4.9c is active+undersized+degraded+remapped+wait_backfill, acting [1]
> pg 4.af is active+undersized+degraded+remapped, acting [0]
> pg 3.ab is active+remapped+wait_backfill, acting [3,2]
> pg 4.aa is active+remapped+wait_backfill, acting [0,3]
> pg 3.ad is active+remapped+wait_backfill, acting [0,3]
> pg 3.b4 is active+undersized+degraded+remapped+wait_backfill, acting [0]
> pg 3.b5 is active+undersized+degraded+remapped+wait_backfill, acting [3]
> pg 4.b9 is active+remapped+wait_backfill, acting [5,4]
> pg 3.be <http://3.be/> is active+remapped+wait_backfill, acting [0,4]
> pg 4.c0 is active+undersized+degraded+remapped+wait_backfill, acting [2]
> pg 4.ce is active+undersized+degraded+remapped+wait_backfill, acting [1]
> pg 3.d1 is active+remapped+wait_backfill, acting [0,3]
> pg 4.d5 is active+undersized+degraded+remapped+wait_backfill, acting [2]
> pg 4.d4 is active+undersized+degraded+remapped+wait_backfill, acting [3]
> pg 4.da is active+undersized+degraded+remapped+wait_backfill, acting [2]
> pg 3.e7 is active+undersized+degraded+remapped+wait_backfill, acting [2]
> pg 3.eb is active+remapped+wait_backfill, acting [3,0]
> pg 3.ec is active+undersized+degraded+remapped+wait_backfill, acting [2]
> pg 4.f6 is active+undersized+degraded+remapped+wait_backfill, acting [2]
> pg 3.f6 is active+undersized+degraded+remapped+wait_backfill, acting [2]
> recovery 1098780/40253637 objects degraded (2.730%)
> recovery 3401433/40253637 objects misplaced (8.450%)
> osd.0 is near full at 85%
> osd.4 is near full at 90%
> mds0: Client integ-hm3 failing to respond to cache pressure(client_id: 733998)
> mds0: Client integ-hm8 failing to respond to cache pressure(client_id: 843866)
> mds0: Client integ-hm2 failing to respond to cache pressure(client_id: 844939)
> mds0: Client integ-hm9 failing to respond to cache pressure(client_id: 845065)
> mds0: Client integ-hm5 failing to respond to cache pressure(client_id: 845068)
> mds0: Client integ-hm9-bkp failing to respond to cache pressure(client_id: 895898)
> mds0: Client me-build1-bkp failing to respond to cache pressure(client_id: 888666)
> 
> 
> hm ~]# ceph osd tree
> ID WEIGHT   TYPE NAME            UP/DOWN REWEIGHT PRIMARY-AFFINITY
> -1 22.92604 root default                                          
> -2  3.29749     host intcfs-osd1                                  
> 0  3.29749         osd.0             up  1.00000          1.00000
> -3  3.26869     host intcfs-osd2                                  
> 1  3.26869         osd.1             up  1.00000          1.00000
> -4  3.27339     host intcfs-osd3                                  
> 2  3.27339         osd.2             up  1.00000          1.00000
> -5  3.24089     host intcfs-osd4                                  
> 3  3.24089         osd.3             up  1.00000          1.00000
> -6  3.24089     host intcfs-osd5                                  
> 4  3.24089         osd.4             up  1.00000          1.00000
> -7  3.32669     host intcfs-osd6                                  
> 5  3.32669         osd.5             up  1.00000          1.00000
> -8  3.27800     host intcfs-osd7                                  
> 6  3.27800         osd.6             up  1.00000          1.00000
> 
> 
> hm5 ~]# ceph osd df
> ID WEIGHT  REWEIGHT SIZE   USE    AVAIL %USE  VAR  PGS
> 0 3.29749  1.00000  3376G  2874G  502G 85.13 1.26 165
> 1 3.26869  1.00000  3347G  1922G 1424G 57.44 0.85 152
> 2 3.27339  1.00000  3351G  2009G 1342G 59.95 0.89 162
> 3 3.24089  1.00000  3318G  2130G 1188G 64.19 0.95 168
> 4 3.24089  1.00000  3318G  2996G  321G 90.30 1.34 176
> 5 3.32669  1.00000  3406G  2465G  940G 72.39 1.07 165
> 6 3.27800  1.00000  3356G  1435G 1921G 42.76 0.63 166
>               TOTAL 23476G 15834G 7641G 67.45         
> MIN/MAX VAR: 0.63/1.34  STDDEV: 15.29
> 
> 
> Regards
> Prabu GJ
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com <mailto:ceph-users at lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20171112/ab1253ca/attachment.html>


More information about the ceph-users mailing list