[ceph-users] Help! OSDs across the cluster just crashed

Brett Chancellor bchancellor at salesforce.com
Wed Oct 3 09:44:16 PDT 2018


That turned out to be exactly the issue (And boy was it fun clearing pgs
out on 71 OSDs). I think it's caused by a combination of two factors.
1. This cluster has way to many placement groups per OSD (just north of
800). It was fine when we first created all the pools, but upgrades (most
recently to luminous 12.2.4) have cemented the fact that high PG:OSD ratio
is a bad thing.
2. We had a host in a failed state for an extended period of time. That
host finally coming online is what triggered the event. The system dug
itself into a hole it couldn't get out of.

-Brett

On Wed, Oct 3, 2018 at 11:49 AM Gregory Farnum <gfarnum at redhat.com> wrote:

> Yeah, don't run these commands blind. They are changing the local metadata
> of the PG in ways that may make it inconsistent with the overall cluster
> and result in lost data.
>
> Brett, it seems this issue has come up several times in the field but we
> haven't been able to reproduce it locally or get enough info to debug
> what's going on: https://tracker.ceph.com/issues/21142
> Maybe run through that ticket and see if you can contribute new logs or
> add detail about possible sources?
> -Greg
>
> On Tue, Oct 2, 2018 at 3:18 PM Goktug Yildirim <goktug.yildirim at gmail.com>
> wrote:
>
>> Hi,
>>
>> Sorry to hear that. I’ve been battling with mine for 2 weeks :/
>>
>> I’ve corrected mine OSDs with the following commands. My OSD logs
>> (/var/log/ceph/ceph-OSDx.log) has a line including log(EER) with the PG
>> number besides and before crash dump.
>>
>> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-$1/ --op
>> trim-pg-log --pgid $2
>> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-$1/ --op
>> fix-lost --pgid $2
>> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-$1/ --op repair
>> --pgid $2
>> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-$1/ --op
>> mark-complete --pgid $2
>> systemctl restart ceph-osd@$1
>>
>> I dont know if it works for you but it may be no harm to try for an OSD.
>>
>> There is such less information about this tools. So it might be risky. I
>> hope someone much experienced could help more.
>>
>>
>> > On 2 Oct 2018, at 23:23, Brett Chancellor <bchancellor at salesforce.com>
>> wrote:
>> >
>> > Help. I have a 60 node cluster and most of the OSDs decided to crash
>> themselves at the same time. They wont restart, the messages look like...
>> >
>> > --- begin dump of recent events ---
>> >      0> 2018-10-02 21:19:16.990369 7f57ab5b7d80 -1 *** Caught signal
>> (Aborted) **
>> >  in thread 7f57ab5b7d80 thread_name:ceph-osd
>> >
>> >  ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b)
>> luminous (stable)
>> >  1: (()+0xa3c611) [0x556d618bb611]
>> >  2: (()+0xf6d0) [0x7f57a885e6d0]
>> >  3: (gsignal()+0x37) [0x7f57a787f277]
>> >  4: (abort()+0x148) [0x7f57a7880968]
>> >  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x284) [0x556d618fa6e4]
>> >  6: (pi_compact_rep::add_interval(bool, PastIntervals::pg_interval_t
>> const&)+0x3b2) [0x556d615c74a2]
>> >  7: (PastIntervals::check_new_interval(int, int, std::vector<int,
>> std::allocator<int> > const&, std::vector<int, std::allocator<int> >
>> const&, int, int, std::vector<int, std::allocator<int> > const&,
>> std::vector<int, std::allocator<int> > const&, unsigned int, unsigned int,
>> std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, pg_t,
>> IsPGRecoverablePredicate*, PastIntervals*, std::ostream*)+0x380)
>> [0x556d615ae6c0]
>> >  8: (OSD::build_past_intervals_parallel()+0x9ff) [0x556d613707af]
>> >  9: (OSD::load_pgs()+0x545) [0x556d61373095]
>> >  10: (OSD::init()+0x2169) [0x556d613919d9]
>> >  11: (main()+0x2d07) [0x556d61295dd7]
>> >  12: (__libc_start_main()+0xf5) [0x7f57a786b445]
>> >  13: (()+0x4b53e3) [0x556d613343e3]
>> >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>> needed to interpret this.
>> >
>> >
>> > Some hosts have no working OSDs, others seem to have 1 working, and 2
>> dead.  It's spread all across the cluster, across several different racks.
>> Any idea on where to look next? The cluster is dead in the water right now.
>> > _______________________________________________
>> > ceph-users mailing list
>> > ceph-users at lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users at lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20181003/b5fb1096/attachment.html>


More information about the ceph-users mailing list