[ceph-users] Upgrade to Infernalis: OSDs crash all the time
kees at nefos.nl
Sun Nov 11 23:50:13 PST 2018
Between crashes we were able to allow the cluster to backfill as much as
possible (all monitors Infernalis, OSDs being Hammer again).
Leftover PGs wouldn't backfill until we removed files such as:
> 8.0M -rw-r--r-- 1 root root 8.0M Aug 24 23:56
> 8.0M -rw-r--r-- 1 root root 8.0M Aug 28 05:51
> 8.0M -rw-r--r-- 1 root root 8.0M Aug 30 03:40
> 8.0M -rw-r--r-- 1 root root 8.0M Aug 31 03:46
> 8.0M -rw-r--r-- 1 root root 8.0M Sep 5 19:44
> 8.0M -rw-r--r-- 1 root root 8.0M Sep 6 14:44
> 8.0M -rw-r--r-- 1 root root 8.0M Sep 7 10:21
Restarting the given OSD didn't seem necessary; backfilling started to
work and at some point enough replicas were available for each PG.
Finally deep scrubbing repaired the inconsistent PGs automagically and
we arrived at HEALTH_OK again!
Case closed: up to Jewel.
For everyone involved: a big, big and even bigger thank you for all
pointers and support!
On 10-09-18 16:43, Kees Meijs wrote:
> A little update: meanwhile we added a new node consisting of Hammer OSDs
> to ensure sufficient cluster capacity.
> The upgraded node with Infernalis OSDs is completely removed from the
> CRUSH map and the OSDs removed (obviously we didn't wipe the disks yet).
> At the moment we're still running using flags
> noout,nobackfill,noscrub,nodeep-scrub. Although now only Hammer OSDs
> reside, we still experience OSD crashes on backfilling so we're unable
> to achieve HEALTH_OK state.
> Using debug 20 level we're (mostly my coworker Willem Jan is) figuring
> out why the crashes happen exactly. Hopefully we'll figure it out.
> To be continued...
More information about the ceph-users