[ceph-users] why sudden (and brief) HEALTH_ERR
Dan van der Ster
dan at vanderster.com
Wed Oct 4 00:38:35 PDT 2017
On Wed, Oct 4, 2017 at 9:08 AM, Piotr Dałek <piotr.dalek at corp.ovh.com> wrote:
> On 17-10-04 08:51 AM, lists wrote:
>> Yesterday I chowned our /var/lib/ceph ceph, to completely finalize our
>> jewel migration, and noticed something interesting.
>> After I brought back up the OSDs I just chowned, the system had some
>> recovery to do. During that recovery, the system went to HEALTH_ERR for a
>> short moment:
>> See below, for consecutive ceph -s outputs:
>>> root at pm2:~# ceph -s
>>> cluster 1397f1dc-7d94-43ea-ab12-8f8792eee9c1
>>> health HEALTH_ERR
>>> 2 pgs are stuck inactive for more than 300 seconds
> ^^ that.
>>> 761 pgs degraded
>>> 2 pgs recovering
>>> 181 pgs recovery_wait
>>> 2 pgs stuck inactive
>>> 273 pgs stuck unclean
>>> 543 pgs undersized
>>> recovery 1394085/8384166 objects degraded (16.628%)
>>> 4/24 in osds are down
>>> noout flag(s) set
>>> monmap e3: 3 mons at
>>> election epoch 256, quorum 0,1,2 0,1,2
>>> osdmap e10230: 24 osds: 20 up, 24 in; 543 remapped pgs
>>> flags noout,sortbitwise,require_jewel_osds
>>> pgmap v36531146: 1088 pgs, 2 pools, 10703 GB data, 2729 kobjects
>>> 32724 GB used, 56656 GB / 89380 GB avail
>>> 1394085/8384166 objects degraded (16.628%)
>>> 543 active+undersized+degraded
>>> 310 active+clean
>>> 181 active+recovery_wait+degraded
>>> 26 active+degraded
>>> 13 active
>>> 9 activating+degraded
>>> 4 activating
>>> 2 active+recovering+degraded
>>> recovery io 133 MB/s, 37 objects/s
>>> client io 64936 B/s rd, 9935 kB/s wr, 0 op/s rd, 942 op/s wr
>> It was only very briefly, but it did worry me a bit, fortunately, we went
>> back to the expected HEALTH_WARN very quickly, and everything finished fine,
>> so I guess nothing to worry.
>> But I'm curious: can anyone explain WHY we got a brief HEALTH_ERR?
>> No smart errors, apply and commit latency are all within the expected
>> ranges, the systems basically is healthy.
>> Curious :-)
> Since Jewel (AFAIR), when (re)starting OSDs, pg status is reset to "never
> contacted", resulting in "pgs are stuck inactive for more than 300 seconds"
> being reported until osds regain connections between themselves.
Also, the last_active state isn't updated very regularly, as far as I can tell.
On our cluster I have increased this timeout
(Which helps suppress these bogus HEALTH_ERR's)
More information about the ceph-users