[ceph-users] fixing another remapped+incomplete EC 4+2 pg

Gregory Farnum gfarnum at redhat.com
Mon Oct 15 11:44:50 PDT 2018


On Thu, Oct 11, 2018 at 3:22 PM Graham Allan <gta at umn.edu> wrote:

> As the osd crash implies, setting "nobackfill" appears to let all the
> osds keep running and the pg stays active and can apparently serve data.
>
> If I track down the object referenced below in the object store, I can
> download it without error via s3... though as I can't generate a
> matching etag, it may well be corrupt.
>
> Still I do wonder if deleting this object - either via s3, or maybe more
> likely directly within filestore, might permit backfill to continue.
>

Yes, that is very likely! (...unless there are a bunch of other objects
with the same issue.)

I'm not immediately familiar with the crash asserts you're seeing, but it
certainly looks like somehow the object data didn't quite get stored
correctly as the metadata understands it. Perhaps a write got lost/missed
on m+1 of the PG shards, setting the osd_find_best_info_ignore_history_les
caused it to try and recover from what it had rather than following normal
recovery procedures, and now it's not working.
-Greg


>
> Of other objects referenced in "last_backfill" for each osd in the pg,
> both also download via s3, but one I have seen implicated in similar
> crash logs for otehr OSDs and the etag again does not match; the other I
> have not seen in crash logs and does generate a matching etag.
>
> Opened a tracker issue for this: http://tracker.ceph.com/issues/36411
>
> Graham
>
> On 10/09/2018 06:55 PM, Graham Allan wrote:
> >
> > On 10/09/2018 01:14 PM, Graham Allan wrote:
> >>
> >> On 10/9/2018 12:19 PM, Gregory Farnum wrote:
> >>>
> >>> I think unfortunately the easiest thing for you to fix this will be
> >>> to set the min_size back to 4 until the PG is recovered (or at least
> >>> has 5 shards done). This will be fixed in a later version of Ceph and
> >>> probably backported, but sadly it's not done yet.
> >>> -Greg
> >
> >> Thanks Greg, though sadly I've tried that; whatever I do, one of the 4
> >> osds involved will simply crash (not just the ones I previously tried
> >> to re-import via ceph-objectstore-tool). I just spend time chasing
> >> them around but never succeeding in having a complete set run long
> >> enough to make progress. They seem to crash when starting backfill on
> >> the next object. There has to be something in the current set of
> >> shards which it can't handle.
> >>
> >> Since then I've been focusing on trying to get the pg to revert to an
> >> earlier interval using osd_find_best_info_ignore_history_les, though
> >> the information I find around it is minimal.
> >
> > Since my experiments with osd_find_best_info_ignore_history_les have not
> > borne any results, I'm looking again at the osd crashes when I get
> > enough of them running for backfill to start.
> >
> > They all crash in the same way; with "debug osd=10", the very last bit
> is:
> >
> >>     -2> 2018-10-09 16:27:25.425004 7faa866bd700 10 osd.190 pg_epoch:
> >> 710808 pg[70.82ds2( v 710799'704745 (586066'698574,710799'704745]
> >> local-lis/les=710807/710808 n=102929 ec=21494/21494 lis/c
> >> 710807/588565 les/c/f 710808/588566/0 710711/710807/710807)
> >> [820,761,105,789,562,485]/[2147483647 <(214)%20748-3647>,2147483647
> <(214)%20748-3647>,190,448,61,315]p190(2) r=2
> >> lpr=710807 pi=[588565,710807)/39 rops=1
> >> bft=105(2),485(5),562(4),761(1),789(3),820(0) crt=710799'70
> >> 4745 lcod 0'0 mlcod 0'0
> >> active+undersized+degraded+remapped+backfilling] continue_recovery_op:
> >> continuing
> >>
> RecoveryOp(hoid=70:b415ca12:::default.10341087.1__shadow_Subjects%2fsub-NDARINV6BVVAY29%2fses-baselineYear1Arm1%2ffunc%2frun-04_AROMA%2fdenoised_func_data_nonaggr.nii.2~VfI6rAYnU4XtzUcFAoecJDU5TpVR8AP.16_3:head
>
> >> v=579167'695462 missing_on=105(2),485(5),562(4),761(1),789(3),820(0)
> >> missing_on_shards=0,1,2,3,4,5 recovery_inf
> >>
> o=ObjectRecoveryInfo(70:b415ca12:::default.10341087.1__shadow_Subjects%2fsub-NDARINV6BVVAY29%2fses-baselineYear1Arm1%2ffunc%2frun-04_AROMA%2fdenoised_func_data_nonaggr.nii.2~VfI6rAYnU4XtzUcFAoecJDU5TpVR8AP.16_3:head at 579167'695462,
>
> >> size: 0, copy_subset: [], clone_subset: {}, snapset: 0=[]:[])
> >> recovery_progress=ObjectRecoveryProgress(first, data_recovered_to:0,
> >> data_complete:false, omap_recovered_to:, omap_complete:true, error
> >> :false) obc refcount=3 state=READING waiting_on_pushes=
> >> extent_requested=0,8388608)
> >>     -1> 2018-10-09 16:27:25.425105 7faa866bd700 10 osd.190 pg_epoch:
> >> 710808 pg[70.82ds2( v 710799'704745 (586066'698574,710799'704745]
> >> local-lis/les=710807/710808 n=102929 ec=21494/21494 lis/c
> >> 710807/588565 les/c/f 710808/588566/0 710711/710807/710807)
> >> [820,761,105,789,562,485]/[2147483647 <(214)%20748-3647>,2147483647
> <(214)%20748-3647>,190,448,61,315]p190(2) r=2
> >> lpr=710807 pi=[588565,710807)/39 rops=1
> >> bft=105(2),485(5),562(4),761(1),789(3),820(0) crt=710799'70
> >> 4745 lcod 0'0 mlcod 0'0
> >> active+undersized+degraded+remapped+backfilling] continue_recovery_op:
> >> before_progress=ObjectRecoveryProgress(first, data_recovered_to:0,
> >> data_complete:false, omap_recovered_to:, omap_complete:true,
> >> error:false), after_progress=ObjectRecoveryProgress(!first,
> >> data_recovered_to:0, data_complete:true, omap_recovered_to:,
> >> omap_complete:true, error:false), pop.data.length()=786432, size=0
> >>      0> 2018-10-09 16:27:25.430239 7faa866bd700 -1
> >> /build/ceph-12.2.8/src/osd/ECBackend.cc: In function 'void
> >> ECBackend::continue_recovery_op(ECBackend::RecoveryOp&,
> >> RecoveryMessages*)' thread 7faa866bd700 time 2018-10-09 16:27:25.425135
> >> /build/ceph-12.2.8/src/osd/ECBackend.cc: 619: FAILED
> >> assert(pop.data.length() ==
> >> sinfo.aligned_logical_offset_to_chunk_offset(
> >> after_progress.data_recovered_to -
> >> op.recovery_progress.data_recovered_to))
> >
> > It sounds like we're getting an unexpected data chunk size, presumably
> > due to some kind of data corruption.
> >
> > Though if I search on all acting OSDs (up or down) for the object on
> > which backfill crashed, I do find all shards and they do at least have a
> > consistent size... perhaps that size itself is different from expected.
> >
> > osd 233:
> >> root at ceph01:/var/lib/ceph/osd/ceph-233/current# find 70.82ds1_head
> >> -name '*NDARINV6BVVAY29*' -type f -ls 23624833062  772 -rw-r--r--   1
> >> root     root       786432 Sep 26 15:32
> >>
> 70.82ds1_head/DIR_D/DIR_2/DIR_8/DIR_A/DIR_3/default.10341087.1\\u\\ushadow\\uSubjects\\ssub-NDARINV6BVVAY29\\sses-baselineYear1Arm1\\sfunc\\srun-04\\uAROMA\\sdenoised\\ufunc\\udata\\unonaggr.nii.2~VfI6rAYnU4XtzUcFAoecJDU5TpVR8AP.16\\u3__head_4853A82D__46_ffffffffffffffff_1
>
> >>
> > osd 190:
> >> root at ceph07:/var/lib/ceph/osd/ceph-190/current# find 70.82ds2_head
> >> -name '*NDARINV6BVVAY29*' -type f -ls 23710021684  772 -rw-r--r--   1
> >> root     root       786432 Jun 27 07:13
> >>
> 70.82ds2_head/DIR_D/DIR_2/DIR_8/DIR_A/DIR_3/DIR_5/default.10341087.1\\u\\ushadow\\uSubjects\\ssub-NDARINV6BVVAY29\\sses-baselineYear1Arm1\\sfunc\\srun-04\\uAROMA\\sdenoised\\ufunc\\udata\\unonaggr.nii.2~VfI6rAYnU4XtzUcFAoecJDU5TpVR8AP.16\\u3__head_4853A82D__46_ffffffffffffffff_2
>
> >>
> > osd 448:
> >> root at ceph09:/var/lib/ceph/osd/ceph-448/current# find 70.82ds3_head
> >> -name '*NDARINV6BVVAY29*' -type f -ls 3259920428 <(325)%20992-0428>
> 772 -rw-r--r--   1
> >> root     root       786432 Jun 27 07:13
> >>
> 70.82ds3_head/DIR_D/DIR_2/DIR_8/DIR_A/DIR_3/DIR_5/default.10341087.1\\u\\ushadow\\uSubjects\\ssub-NDARINV6BVVAY29\\sses-baselineYear1Arm1\\sfunc\\srun-04\\uAROMA\\sdenoised\\ufunc\\udata\\unonaggr.nii.2~VfI6rAYnU4XtzUcFAoecJDU5TpVR8AP.16\\u3__head_4853A82D__46_ffffffffffffffff_3
>
> >>
> > osd 315:
> >> root at ceph03:/var/lib/ceph/osd/ceph-315/current# find 70.82ds5_head
> >> -name '*NDARINV6BVVAY29*' -type f -ls 64439371813  772 -rw-r--r--   1
> >> root     root       786432 Jun 27 07:13
> >>
> 70.82ds5_head/DIR_D/DIR_2/DIR_8/DIR_A/DIR_3/DIR_5/default.10341087.1\\u\\ushadow\\uSubjects\\ssub-NDARINV6BVVAY29\\sses-baselineYear1Arm1\\sfunc\\srun-04\\uAROMA\\sdenoised\\ufunc\\udata\\unonaggr.nii.2~VfI6rAYnU4XtzUcFAoecJDU5TpVR8AP.16\\u3__head_4853A82D__46_ffffffffffffffff_5
>
> >>
> > osd 98:
> >> root at ceph05:/var/lib/ceph/osd/ceph-98/current# find 70.82ds0_head
> >> -name '*NDARINV6BVVAY29*' -type f -ls 2428799806  768 -rw-r--r--   1
> >> root     root       786432 Jun 27 07:13
> >>
> 70.82ds0_head/DIR_D/DIR_2/DIR_8/DIR_A/DIR_3/DIR_5/default.10341087.1\\u\\ushadow\\uSubjects\\ssub-NDARINV6BVVAY29\\sses-baselineYear1Arm1\\sfunc\\srun-04\\uAROMA\\sdenoised\\ufunc\\udata\\unonaggr.nii.2~VfI6rAYnU4XtzUcFAoecJDU5TpVR8AP.16\\u3__head_4853A82D__46_ffffffffffffffff_0
>
> >>
> > osd 61:
> >> root at ceph04:~# ls -l
> >>
> /var/lib/ceph/osd/ceph-61/current/70.82ds4_head/DIR_D/DIR_2/DIR_8/DIR_A/DIR_3/DIR_5/*NDARINV6BVVAY29*
>
> >>
> >> -rw-r--r-- 1 root root 786432 Jun 27 07:13
> >>
> /var/lib/ceph/osd/ceph-61/current/70.82ds4_head/DIR_D/DIR_2/DIR_8/DIR_A/DIR_3/DIR_5/default.10341087.1\u\ushadow\uSubjects\ssub-NDARINV6BVVAY29\sses-baselineYear1Arm1\sfunc\srun-04\uAROMA\sdenoised\ufunc\udata\unonaggr.nii.2~VfI6rAYnU4XtzUcFAoecJDU5TpVR8AP.16\u3__head_4853A82D__46_ffffffffffffffff_4
>
> >>
> >
> >
> > Highly motivated to keep looking for a solution... around 1PB of data in
> > this pool!
> >
> > Graham
>
> --
> Graham Allan
> Minnesota Supercomputing Institute - gta at umn.edu
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20181015/01933724/attachment.html>


More information about the ceph-users mailing list