[ceph-users] why sudden (and brief) HEALTH_ERR

Piotr Dałek piotr.dalek at corp.ovh.com
Wed Oct 4 00:08:35 PDT 2017


On 17-10-04 08:51 AM, lists wrote:
> Hi,
> 
> Yesterday I chowned our /var/lib/ceph ceph, to completely finalize our jewel 
> migration, and noticed something interesting.
> 
> After I brought back up the OSDs I just chowned, the system had some 
> recovery to do. During that recovery, the system went to HEALTH_ERR for a 
> short moment:
> 
> See below, for consecutive ceph -s outputs:
> 
>> [..]
>> root at pm2:~# ceph -s
>>     cluster 1397f1dc-7d94-43ea-ab12-8f8792eee9c1
>>      health HEALTH_ERR
>>             2 pgs are stuck inactive for more than 300 seconds

^^ that.

>>             761 pgs degraded
>>             2 pgs recovering
>>             181 pgs recovery_wait
>>             2 pgs stuck inactive
>>             273 pgs stuck unclean
>>             543 pgs undersized
>>             recovery 1394085/8384166 objects degraded (16.628%)
>>             4/24 in osds are down
>>             noout flag(s) set
>>      monmap e3: 3 mons at 
>> {0=10.10.89.1:6789/0,1=10.10.89.2:6789/0,2=10.10.89.3:6789/0}
>>             election epoch 256, quorum 0,1,2 0,1,2
>>      osdmap e10230: 24 osds: 20 up, 24 in; 543 remapped pgs
>>             flags noout,sortbitwise,require_jewel_osds
>>       pgmap v36531146: 1088 pgs, 2 pools, 10703 GB data, 2729 kobjects
>>             32724 GB used, 56656 GB / 89380 GB avail
>>             1394085/8384166 objects degraded (16.628%)
>>                  543 active+undersized+degraded
>>                  310 active+clean
>>                  181 active+recovery_wait+degraded
>>                   26 active+degraded
>>                   13 active
>>                    9 activating+degraded
>>                    4 activating
>>                    2 active+recovering+degraded
>> recovery io 133 MB/s, 37 objects/s
>>   client io 64936 B/s rd, 9935 kB/s wr, 0 op/s rd, 942 op/s wr
>> [..]
> It was only very briefly, but it did worry me a bit, fortunately, we went 
> back to the expected HEALTH_WARN very quickly, and everything finished fine, 
> so I guess nothing to worry.
> 
> But I'm curious: can anyone explain WHY we got a brief HEALTH_ERR?
> 
> No smart errors, apply and commit latency are all within the expected 
> ranges, the systems basically is healthy.
> 
> Curious :-)

Since Jewel (AFAIR), when (re)starting OSDs, pg status is reset to "never 
contacted", resulting in "pgs are stuck inactive for more than 300 seconds" 
being reported until osds regain connections between themselves.

-- 
Piotr Dałek
piotr.dalek at corp.ovh.com
https://www.ovh.com/us/


More information about the ceph-users mailing list