[ceph-users] No recovery when "norebalance" flag set

Stefan Kooman stefan at bit.nl
Mon Nov 26 03:32:07 PST 2018


Quoting Dan van der Ster (dan at vanderster.com):
> Haven't seen that exact issue.
> 
> One thing to note though is that if osd_max_backfills is set to 1,
> then it can happen that PGs get into backfill state, taking that
> single reservation on a given OSD, and therefore the recovery_wait PGs
> can't get a slot.
> I suppose that backfill prioritization is supposed to prevent this,
> but in my experience luminous v12.2.8 doesn't always get it right.

That's also our experience. Even if if the degraded PGs with backfill /
recovery state are given a higher priority (forced) ... than still
normally backfilling takes place.

> So next time I'd try injecting osd_max_backfills = 2 or 3 to kickstart
> the recovering PGs.

Wat still on "1" indeed. We tend to cranck that (and max recovery) with
keeping an eye on max read and write apply latency. In our setup we can
do 16 backfills concurrently / and or 2 recovery / 4 backfills. Recovery
speeds ~ 4 - 5 GB/s ... pushing it beyond that tends to crashing OSDs.

We'll try your suggestion next time.

Thanks,

Stefan

-- 
| BIT BV  http://www.bit.nl/        Kamer van Koophandel 09090351
| GPG: 0xD14839C6                   +31 318 648 688 / info at bit.nl


More information about the ceph-users mailing list