[ceph-users] How to just delete PGs stuck incomplete on EC pool

Maks Kowalik maks_kowalik at poczta.fm
Mon Mar 11 06:29:37 PDT 2019


Hello Daniel,

I think you will not avoid a tedious job of manual cleanup...
Or the other way is to delete the whole pool (ID 18).

The manual cleanup means to take all the OSDs from "probing_osds", stop
them one by one and remove the shards of groups 18.1e and 18.c (using
ceph-objstore-tool).
Afterwards you need to restart these OSDs with
osd_find_best_info_ignore_history_les
set to true.

Kind regards,
Maks Kowalik





pon., 4 mar 2019 o 17:05 Daniel K <sathackr at gmail.com> napisał(a):

> Thanks for the suggestions.
>
> I've tried both -- setting osd_find_best_info_ignore_history_les = true and
> restarting all OSDs,  as well as 'ceph osd-force-create-pg' -- but both
> still show incomplete
>
> PG_AVAILABILITY Reduced data availability: 2 pgs inactive, 2 pgs incomplete
>     pg 18.c is incomplete, acting [32,48,58,40,13,44,61,59,30,27,43,37]
> (reducing pool ec84-hdd-zm min_size from 8 may help; search ceph.com/docs
> for 'incomplete')
>     pg 18.1e is incomplete, acting [50,49,41,58,60,46,52,37,34,63,57,16]
> (reducing pool ec84-hdd-zm min_size from 8 may help; search ceph.com/docs
> for 'incomplete')
>
>
> The OSDs in down_osds_we_would_probe have already been marked lost
>
> When I ran  the force-create-pg command, they went to peering for a few
> seconds, but then went back incomplete.
>
> Updated ceph pg 18.1e query https://pastebin.com/XgZHvJXu
> Updated ceph pg 18.c query https://pastebin.com/N7xdQnhX
>
> Any other suggestions?
>
>
>
> Thanks again,
>
> Daniel
>
>
>
> On Sat, Mar 2, 2019 at 3:44 PM Paul Emmerich <paul.emmerich at croit.io>
> wrote:
>
>> On Sat, Mar 2, 2019 at 5:49 PM Alexandre Marangone
>> <a.marangone at gmail.com> wrote:
>> >
>> > If you have no way to recover the drives, you can try to reboot the
>> OSDs with `osd_find_best_info_ignore_history_les = true` (revert it
>> afterwards), you'll lose data. If after this, the PGs are down, you can
>> mark the OSDs blocking the PGs from become active lost.
>>
>> this should work for PG 18.1e, but not for 18.c. Try running "ceph osd
>> force-create-pg <pgid>" to reset the PGs instead.
>> Data will obviously be lost afterwards.
>>
>> Paul
>>
>> >
>> > On Sat, Mar 2, 2019 at 6:08 AM Daniel K <sathackr at gmail.com> wrote:
>> >>
>> >> They all just started having read errors. Bus resets. Slow reads.
>> Which is one of the reasons the cluster didn't recover fast enough to
>> compensate.
>> >>
>> >> I tried to be mindful of the drive type and specifically avoided the
>> larger capacity Seagates that are SMR. Used 1 SM863 for every 6 drives for
>> the WAL.
>> >>
>> >> Not sure why they failed. The data isn't critical at this point, just
>> need to get the cluster back to normal.
>> >>
>> >> On Sat, Mar 2, 2019, 9:00 AM <jesper at krogh.cc> wrote:
>> >>>
>> >>> Did they break, or did something went wronng trying to replace them?
>> >>>
>> >>> Jespe
>> >>>
>> >>>
>> >>>
>> >>> Sent from myMail for iOS
>> >>>
>> >>>
>> >>> Saturday, 2 March 2019, 14.34 +0100 from Daniel K <sathackr at gmail.com
>> >:
>> >>>
>> >>> I bought the wrong drives trying to be cheap. They were 2TB WD Blue
>> 5400rpm 2.5 inch laptop drives.
>> >>>
>> >>> They've been replace now with HGST 10K 1.8TB SAS drives.
>> >>>
>> >>>
>> >>>
>> >>> On Sat, Mar 2, 2019, 12:04 AM <jesper at krogh.cc> wrote:
>> >>>
>> >>>
>> >>>
>> >>> Saturday, 2 March 2019, 04.20 +0100 from sathackr at gmail.com <
>> sathackr at gmail.com>:
>> >>>
>> >>> 56 OSD, 6-node 12.2.5 cluster on Proxmox
>> >>>
>> >>> We had multiple drives fail(about 30%) within a few days of each
>> other, likely faster than the cluster could recover.
>> >>>
>> >>>
>> >>> Hov did so many drives break?
>> >>>
>> >>> Jesper
>> >>
>> >> _______________________________________________
>> >> ceph-users mailing list
>> >> ceph-users at lists.ceph.com
>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>> > _______________________________________________
>> > ceph-users mailing list
>> > ceph-users at lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20190311/903a89ba/attachment.html>


More information about the ceph-users mailing list