[ceph-users] 12.2.8: 1 node comes up (noout set), from a 6 nodes cluster -> I/O stuck (rbd usage)

Eugen Block eblock at nde.ag
Fri Oct 19 02:36:29 PDT 2018


Hi Denny,

the recommendation for ceph maintenance is to set three flags if you  
need to shutdown a node (or the entire cluster):

ceph osd set noout
ceph osd set nobackfill
ceph osd set norecover

Although the 'noout' flag seems to be enough for many maintenance  
tasks it doesn't prevent the cluster from rebalancing because some PGs  
are degraded and the cluster tries to recover.
Of course there are also scenarios where something could go wrong  
during the maintenance and a recovery would be required, but there's  
always some kind of tradeoff, I guess.

Until now we had no problems with maintenance when these flags were  
set, and also no stuck I/O for the clients.

> We never saw this in the past, with the same procedure ...

If this has always worked for you and you suddenly encounter these  
problems for the first time there could be something else going on, of  
course, but I just wanted to point out the importance of those osd  
flags.

Regards,
Eugen

[1] https://ceph.com/planet/how-to-do-a-ceph-cluster-maintenance-shutdown/


Zitat von Denny Fuchs <linuxmail at 4lin.net>:

> Hi,
>
> today we had an issue with our 6 node Ceph cluster.
>
> We had to shutdown one node (Ceph-03), to replace a disk (because,   
> we did now know the slot). We set the noout flag and did a graceful  
> shutdown. All was O.K. After the disk was replaced, the node comes  
> up and our VMs had a big I/O latency.
> We never saw this in the past, with the same procedure ...
>
> * From our logs on Ceph-01:
>
> 2018-10-18 15:53:45.455743 mon.qh-a07-ceph-osd-03 mon.2  
> 10.3.0.3:6789/0 1 : cluster [INF] mon.qh-a07-ceph-  osd-03 calling  
> monitor election
> ...
> 2018-10-18 15:53:45.503818 mon.qh-a07-ceph-osd-01 mon.0  
> 10.3.0.1:6789/0 1663050 : cluster [INF] mon.qh-a07-ceph-osd-01 is  
> new leader, mons  
> qh-a07-ceph-osd-01,qh-a07-ceph-osd-02,qh-a07-ceph-osd-03,qh-a07-ceph-osd-04,qh-a07-ceph-osd-05,qh
>
> * First OSD comes up:
>
> 2018-10-18 15:53:55.207742 mon.qh-a07-ceph-osd-01 mon.0  
> 10.3.0.1:6789/0 1663063 : cluster [WRN] Health check update: 10 osds  
> down (OSD_DOWN)
> 2018-10-18 15:53:55.207768 mon.qh-a07-ceph-osd-01 mon.0  
> 10.3.0.1:6789/0 1663064 : cluster [INF] Health check cleared:  
> OSD_HOST_DOWN (was: 1 host (11 osds) down)
> 2018-10-18 15:53:55.240079 mon.qh-a07-ceph-osd-01 mon.0  
> 10.3.0.1:6789/0 1663065 : cluster [INF] osd.43 10.3.0.3:6812/7554 boot
>
> * All OSDs where up:
>
> 2018-10-18 15:54:25.331692 mon.qh-a07-ceph-osd-01 mon.0  
> 10.3.0.1:6789/0 1663134 : cluster [INF] Health check cleared:  
> OSD_DOWN (was: 1 osds down)
> 2018-10-18 15:54:25.360151 mon.qh-a07-ceph-osd-01 mon.0  
> 10.3.0.1:6789/0 1663135 : cluster [INF] osd.12 10.3.0.3:6820/8537 boot
>
> * This OSDs here are a mix of HDD and SDD and different nodes
>
> 2018-10-18 15:54:27.073266 mon.qh-a07-ceph-osd-01 mon.0  
> 10.3.0.1:6789/0 1663138 : cluster [WRN] Health check update:  
> Degraded data redundancy: 84012/4293867 objects degraded (1.957%),  
> 1316 pgs degraded, 487 pgs undersized (PG_DEGRADED)
> 2018-10-18 15:54:32.073644 mon.qh-a07-ceph-osd-01 mon.0  
> 10.3.0.1:6789/0 1663142 : cluster [WRN] Health check update:  
> Degraded data redundancy: 4611/4293867 objects degraded (0.107%),  
> 1219 pgs degraded (PG_DEGRADED)
> 2018-10-18 15:54:36.841189 mon.qh-a07-ceph-osd-01 mon.0  
> 10.3.0.1:6789/0 1663144 : cluster [WRN] Health check failed: 1 slow  
> requests are blocked > 32 sec. Implicated osds 16 (REQUEST_SLOW)
> 2018-10-18 15:54:37.074098 mon.qh-a07-ceph-osd-01 mon.0  
> 10.3.0.1:6789/0 1663145 : cluster [WRN] Health check update:  
> Degraded data redundancy: 4541/4293867 objects degraded (0.106%),  
> 1216 pgs degraded (PG_DEGRADED)
> 2018-10-18 15:54:42.074510 mon.qh-a07-ceph-osd-01 mon.0  
> 10.3.0.1:6789/0 1663149 : cluster [WRN] Health check update:  
> Degraded data redundancy: 4364/4293867 objects degraded (0.102%),  
> 1176 pgs degraded (PG_DEGRADED)
> 2018-10-18 15:54:42.074561 mon.qh-a07-ceph-osd-01 mon.0  
> 10.3.0.1:6789/0 1663150 : cluster [WRN] Health check update: 5 slow  
> requests are blocked > 32 sec. Implicated osds 15,25,30,34  
> (REQUEST_SLOW)
> 2018-10-18 15:54:47.074886 mon.qh-a07-ceph-osd-01 mon.0  
> 10.3.0.1:6789/0 1663152 : cluster [WRN] Health check update:  
> Degraded data redundancy: 4193/4293867 objects degraded (0.098%),  
> 1140 pgs degraded (PG_DEGRADED)
> 2018-10-18 15:54:47.074934 mon.qh-a07-ceph-osd-01 mon.0  
> 10.3.0.1:6789/0 1663153 : cluster [WRN] Health check update: 5 slow  
> requests are blocked > 32 sec. Implicated osds 9,15,23,30  
> (REQUEST_SLOW)
> 2018-10-18 15:54:52.075274 mon.qh-a07-ceph-osd-01 mon.0  
> 10.3.0.1:6789/0 1663156 : cluster [WRN] Health check update:  
> Degraded data redundancy: 4087/4293867 objects degraded (0.095%),  
> 1120 pgs degraded (PG_DEGRADED)
> 2018-10-18 15:54:52.075313 mon.qh-a07-ceph-osd-01 mon.0  
> 10.3.0.1:6789/0 1663157 : cluster [WRN] Health check update: 14 slow  
> requests are blocked > 32 sec. Implicated osds 2,13,14,15,16,23  
> (REQUEST_SLOW)
> 2018-10-18 15:54:57.075635 mon.qh-a07-ceph-osd-01 mon.0  
> 10.3.0.1:6789/0 1663158 : cluster [WRN] Health check update:  
> Degraded data redundancy: 3932/4293867 objects degraded (0.092%),  
> 1074 pgs degraded (PG_DEGRADED)
> 2018-10-18 15:54:57.075683 mon.qh-a07-ceph-osd-01 mon.0  
> 10.3.0.1:6789/0 1663159 : cluster [WRN] Health check update: 6 slow  
> requests are blocked > 32 sec. Implicated osds 14,15 (REQUEST_SLOW)
> 2018-10-18 15:55:02.076071 mon.qh-a07-ceph-osd-01 mon.0  
> 10.3.0.1:6789/0 1663163 : cluster [WRN] Health check update:  
> Degraded data redundancy: 3805/4293867 objects degraded (0.089%),  
> 1036 pgs degraded (PG_DEGRADED)
> 2018-10-18 15:55:02.076138 mon.qh-a07-ceph-osd-01 mon.0  
> 10.3.0.1:6789/0 1663164 : cluster [WRN] Health check update: 4 slow  
> requests are blocked > 32 sec. Implicated osds 1,15,19 (REQUEST_SLOW)
> 2018-10-18 15:55:07.076562 mon.qh-a07-ceph-osd-01 mon.0  
> 10.3.0.1:6789/0 1663168 : cluster [WRN] Health check update:  
> Degraded data redundancy: 3656/4293867 objects degraded (0.085%),  
> 988 pgs degraded (PG_DEGRADED)
> 2018-10-18 15:55:07.076633 mon.qh-a07-ceph-osd-01 mon.0  
> 10.3.0.1:6789/0 1663169 : cluster [WRN] Health check update: 4 slow  
> requests are blocked > 32 sec. Implicated osds 15,22 (REQUEST_SLOW)
> 2018-10-18 15:55:12.077091 mon.qh-a07-ceph-osd-01 mon.0  
> 10.3.0.1:6789/0 1663174 : cluster [WRN] Health check update:  
> Degraded data redundancy: 3509/4293867 objects degraded (0.082%),  
> 934 pgs degraded (PG_DEGRADED)
> 2018-10-18 15:55:12.077130 mon.qh-a07-ceph-osd-01 mon.0  
> 10.3.0.1:6789/0 1663175 : cluster [WRN] Health check update: 1 slow  
> requests are blocked > 32 sec. Implicated osds 15 (REQUEST_SLOW)
> 2018-10-18 15:55:13.317758 mon.qh-a07-ceph-osd-01 mon.0  
> 10.3.0.1:6789/0 1663178 : cluster [INF] Health check cleared:  
> REQUEST_SLOW (was: 1 slow requests are blocked > 32 sec. Implicated  
> osds 15)
>
> ....
>
> 2018-10-18 15:55:17.077685 mon.qh-a07-ceph-osd-01 mon.0  
> 10.3.0.1:6789/0 1663180 : cluster [WRN] Health check update:  
> Degraded data redundancy: 3351/4293867 objects degraded (0.078%),  
> 886 pgs degraded, 1 pg undersized (PG_DEGRADED)
> 2018-10-18 15:55:22.078077 mon.qh-a07-ceph-osd-01 mon.0  
> 10.3.0.1:6789/0 1663184 : cluster [WRN] Health check update:  
> Degraded data redundancy: 3202/4293867 objects degraded (0.075%),  
> 831 pgs degraded, 3 pgs undersized (PG_DEGRADED)
>
> * And then, all went back to work:
>
> 2018-10-18 16:00:22.876849 mon.qh-a07-ceph-osd-01 mon.0  
> 10.3.0.1:6789/0 1663322 : cluster [WRN] Health check failed: Reduced  
> data availability: 4 pgs inactive, 6 pgs peering (PG_AVAILABILITY)
> 2018-10-18 16:00:24.898816 mon.qh-a07-ceph-osd-01 mon.0  
> 10.3.0.1:6789/0 1663326 : cluster [WRN] Health check failed:  
> 102307/4289049 objects misplaced (2.385%) (OBJECT_MISPLACED)
> 2018-10-18 16:00:28.995821 mon.qh-a07-ceph-osd-01 mon.0  
> 10.3.0.1:6789/0 1663327 : cluster [INF] Health check cleared:  
> PG_AVAILABILITY (was: Reduced data availability: 3 pgs peering)
> 2018-10-18 16:00:31.026616 mon.qh-a07-ceph-osd-01 mon.0  
> 10.3.0.1:6789/0 1663330 : cluster [WRN] Health check update:  
> 156337/4294191 objects misplaced (3.641%) (OBJECT_MISPLACED)
>
>
> A few mins later, we added the replaced disks again (on ceph-osd-06  
> and on ceph-osd-03), without issues. To the I/O stuck started as the  
> node where in the cluster again ( 15:53:45 ...) we have no clue ...  
> and never happens in the past.
>
> We have per OSD node:
> * 2 x Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
> * 64GB DDR4 ECC
> * HDD: WD Red 1TB 2.5 with Intel NVME SSDPEDMD400G4 as journal device
> * SDD: Samsung 950 EVO 512GB
> * 2 x Symbios Logic SAS3224 PCI-Express Fusion-MPT SAS-3 (one for  
> SSD, one for HDD)
> * Mellanox 100GB Ceph network (CX-4)
> * 6 Mon (on Ceph nodes)
>
>
> Ceph.conf
>
> [global]
> 	 osd journal size = 5120
> 	 osd pool default min size = 2
> 	 osd pool default size = 3
>
> [osd]
>     keyring = /var/lib/ceph/osd/ceph-$id/keyring
>     osd max backfills = 20
>     osd recovery max active = 8
>
> no special crush rule, just the HDD / SDD one.
>
>
> * ceph osd pool get ssd-pool pg_num
> pg_num: 2048
>
> * root at qh-a07-ceph-osd-01:[~]: ceph osd pool get sata-pool pg_num
> pg_num: 1024
>
> * root at qh-a07-ceph-osd-01:[~]: ceph osd pool get db-pool pg_num
> pg_num: 1024
>
> * ceph -s
>
> data:
>     pools:   3 pools, 4096 pgs
>     objects: 1.43M objects, 5.39TiB
>     usage:   15.7TiB used, 33.4TiB / 49.1TiB avail
>     pgs:     4096 active+clean
>
>
> Any suggestions to avoid this stuck next time ?
>
> cu denny





More information about the ceph-users mailing list