[ceph-users] fixing another remapped+incomplete EC 4+2 pg
gta at umn.edu
Tue Oct 9 11:14:25 PDT 2018
On 10/9/2018 12:19 PM, Gregory Farnum wrote:
> On Wed, Oct 3, 2018 at 10:18 AM Graham Allan <gta at umn.edu
> <mailto:gta at umn.edu>> wrote:
> However I have one pg which is stuck in state remapped+incomplete
> because it has only 4 out of 6 osds running, and I have been unable to
> bring the missing two back into service.
> > PG_AVAILABILITY Reduced data availability: 1 pg inactive, 1 pg
> > pg 70.82d is remapped+incomplete, acting [2147483647
> <tel:%28214%29%20748-3647>,190,448,61,315] (reducing pool
> .rgw.buckets.ec42 min_size from 5 may help; search ceph.com/docs
> <http://ceph.com/docs> for 'incomplete')
> I don't think I want to do anything with min_size as that would make
> other pgs vulnerable to running dangerously undersized (unless there is
> any way to force that state for only a single pg). It seems to me that
> with 4/6 osds available, it should maybe be possible to force ceph to
> select one or two new osds to rebalance this pg to?
> I think unfortunately the easiest thing for you to fix this will be to
> set the min_size back to 4 until the PG is recovered (or at least has 5
> shards done). This will be fixed in a later version of Ceph and probably
> backported, but sadly it's not done yet.
Thanks Greg, though sadly I've tried that; whatever I do, one of the 4
osds involved will simply crash (not just the ones I previously tried to
re-import via ceph-objectstore-tool). I just spend time chasing them
around but never succeeding in having a complete set run long enough to
make progress. They seem to crash when starting backfill on the next
object. There has to be something in the current set of shards which it
Since then I've been focusing on trying to get the pg to revert to an
earlier interval using osd_find_best_info_ignore_history_les, though the
information I find around it is minimal.
Most sources seem to suggest setting it for the primary osd then either
setting it down or restarting it, but that simply seems to result in the
osd disappearing from the pg. After setting this flag for all of the
"acting" osds (most recent interval), the pg switched to having the set
of "active" osds == "up" osds, but still "incomplete" (it's not reverted
to the set of osds in an earlier interval). Still stuck with condition
"peering_blocked_by_history_les_bound" at present.
I'm guessing that I actually need to set the flag
osd_find_best_info_ignore_history_les for *all* osds involved in the
historical record of this pg (the "probing osds" list?), and restart
Still also trying to understand exactly how the flag works. I think I
see now that the "_les" bit must refer to "last epoch started"...
Minnesota Supercomputing Institute - gta at umn.edu
More information about the ceph-users