[ceph-users] Upgrade to Infernalis: OSDs crash all the time

Kees Meijs kees at nefos.nl
Sun Nov 11 23:50:13 PST 2018


Hi list,

Between crashes we were able to allow the cluster to backfill as much as 
possible (all monitors Infernalis, OSDs being Hammer again).

Leftover PGs wouldn't backfill until we removed files such as:

> 8.0M -rw-r--r-- 1 root root 8.0M Aug 24 23:56 
> temp\u3.bd\u0\u16175417\u2718__head_000000BD__fffffffffffffffb
> 8.0M -rw-r--r-- 1 root root 8.0M Aug 28 05:51 
> temp\u3.bd\u0\u16175417\u3992__head_000000BD__fffffffffffffffb
> 8.0M -rw-r--r-- 1 root root 8.0M Aug 30 03:40 
> temp\u3.bd\u0\u16175417\u4521__head_000000BD__fffffffffffffffb
> 8.0M -rw-r--r-- 1 root root 8.0M Aug 31 03:46 
> temp\u3.bd\u0\u16175417\u4817__head_000000BD__fffffffffffffffb
> 8.0M -rw-r--r-- 1 root root 8.0M Sep  5 19:44 
> temp\u3.bd\u0\u16175417\u6252__head_000000BD__fffffffffffffffb
> 8.0M -rw-r--r-- 1 root root 8.0M Sep  6 14:44 
> temp\u3.bd\u0\u16175417\u6593__head_000000BD__fffffffffffffffb
> 8.0M -rw-r--r-- 1 root root 8.0M Sep  7 10:21 
> temp\u3.bd\u0\u16175417\u6870__head_000000BD__fffffffffffffffb

Restarting the given OSD didn't seem necessary; backfilling started to 
work and at some point enough replicas were available for each PG.

Finally deep scrubbing repaired the inconsistent PGs automagically and 
we arrived at HEALTH_OK again!

Case closed: up to Jewel.

For everyone involved: a big, big and even bigger thank you for all 
pointers and support!

Regards,
Kees

On 10-09-18 16:43, Kees Meijs wrote:
> A little update: meanwhile we added a new node consisting of Hammer OSDs
> to ensure sufficient cluster capacity.
>
> The upgraded node with Infernalis OSDs is completely removed from the
> CRUSH map and the OSDs removed (obviously we didn't wipe the disks yet).
>
> At the moment we're still running using flags
> noout,nobackfill,noscrub,nodeep-scrub. Although now only Hammer OSDs
> reside, we still experience OSD crashes on backfilling so we're unable
> to achieve HEALTH_OK state.
>
> Using debug 20 level we're (mostly my coworker Willem Jan is) figuring
> out why the crashes happen exactly. Hopefully we'll figure it out.
>
> To be continued...



More information about the ceph-users mailing list