[ceph-users] OSDs crash after deleting unfound object in Luminous 12.2.8

Mike Lovell mike.lovell at endurance.com
Thu Oct 18 10:10:57 PDT 2018


re-adding the list.

i'm glad to hear you got things back to a working state. one thing you
might want to check is the hit_set_history in the pg data. if the missing
hit sets are no longer in the history, then it is probably safe to go back
to the normal builds. that is until you have to mark another hit set
missing. :)  i think the code that removes the hit set from the pg data is
before that assert so its possible it still removed it from the history.

mike

On Thu, Oct 18, 2018 at 9:11 AM Lawrence Smith <
lawrence.smith at uni-muenster.de> wrote:

> Hi Mike,
>
> Thanks a bunch for your writeup, that was exactly the problem and
> solution! All i did was comment out the assert and ad an if(obc){ } after
> to make sure i don't run into a segfault, and now the cluster is healthy
> once again. I am not sure if ceph will register a mismatch in a byte count
> while scrubbing due to the missing object, but I don't think so.
>
> Anyway, I just wanted to thank you for your help!
>
> Best wishes,
>
> Lawrence
>
> On 10/13/2018 02:00 AM, Mike Lovell wrote:
>
> what was the object name that you marked lost? was it one of the cache
> tier hit_sets?
>
> the trace you have does seem to be failing when the OSD is trying to
> remove a hit set that is no longer needed. i ran into a similar problem
> which might have been why that bug you listed was created. maybe providing
> what i have since discovered about hit sets might help.
>
> the hit sets are what the cache tier uses to know which objects have been
> accessed in a given period of time. these hit sets are then stored in the
> object store using an object name that is generated. for the version you're
> running, the code for that generation is at
> https://github.com/ceph/ceph/blob/v12.2.8/src/osd/PrimaryLogPG.cc#L12667.
> its bascially "hit_set_<pgid>_archive_<begin time>_<end time>" where the
> times are recorded in the hit set history. that hit set history is stored
> as part of the PG metadata. you can get a list of all of the hit sets the
> PG has by looking at 'ceph pg <pgid> query' and looking at the
> ['info']['hit_set_history']['history'] array. each entry in that array has
> the information on each hit set for the PG and the times are what is used
> in generation of the object name. there should be one ceph object for each
> hit set listed in that array.
>
> if you told the cluster to mark one of the hit set objects as lost, its
> possible the OSD cannot get that object and is hitting the assert(obc) near
> the end of PrimaryLogPG::hit_set_trim in the same source file referenced
> above. you can potentially verify this by a couple methods. i think if you
> set debug_osd at 20 that it should log a line saying something like
> "hit_set_trim removing hit_set_<pgid>_archive....." if that name matches
> one of the ones you marked lost, then is this almost certainly the cause.
> you can also do a find on the OSD directory, if you're using file store,
> and look for the right file name. something like 'find
> /var/lib/ceph/osd/ceph-<osd num>/current/<pgid>_head -name
> hit\*set\*<pgid>\*archive\*' should work. include the \ to escape the * so
> bash doesn't interpret it. if you're using bluestore, i think you can use
> the ceph-objectstore-tool while the osd is stopped to get a list of
> objects. you'll probably want to only look in the .ceph-internal namespace
> since the hit sets are stored in that namespace.
>
> there are a couple potential ways to get around this. what we did when we
> had the problem was run a custom build of the ceph-osd where we commented
> out the assert(obc); line in hit_set_trim. that build was only run for long
> enough to get the cluster back online and then to flush and evict the
> entire cache, remove the cache, restart using the normal ceph builds, and
> then recreate the cache.
>
> the other options are things that i don't know for sure if they'll work.
> if you're using file store, you might be able to just copy another hit set
> to the file name of the missing hit set object. this should be pretty
> benign and its just going to remove the object in a moment anyways. also,
> i'm not entirely sure how to come up with what directory to put the object
> in if the osd has done any directory splitting. maybe someone on the list
> will know how to do this. there might be a way with the
> ceph-objectstore-tool to write in the object but i couldn't find one in my
> testing on hammer.
>
> the last option i can think of, is that if you can completely stop any
> traffic to the pools in question, its possible the OSDs wont crash.
> hit_set_trim doesn't appear to get called if there is no client traffic
> reaching the osds and the hit sets aren't being updated. if you can stop
> anything from using the pools in question and guarantee nothing will come
> in, then it might be possible to keep the OSDs long enough to flush
> everything from the cache tier, remove it, and recreate it. this option
> seems like a long shot and i don't know for sure it'll work. it just seemed
> to me like the OSDs would stay up in a similar scenario on my hammer test
> cluster. its possible things have changed in luminous and hit_set_trim gets
> called more often. i also didn't test whether the process of flushing and
> evicting the objects in the cache caused hit_set_trim to get called.
>
> hopefully that gives you some more info on what might be going on and ways
> around it. i'm not entirely sure why there is still the assert(obj); in
> hit_set_trim still. there was a little bit of discussion about removing it
> since it means the object its trying to remove is gone anyways. i think
> that just happened for a little bit in irc. i guess it didn't happen cause
> no one followed up on it.
>
> good luck and hopefully you don't blame me if things get worse. :)
> mike
>
> On Fri, Oct 12, 2018 at 7:34 AM Lawrence Smith <
> lawrence.smith at uni-muenster.de> wrote:
>
>> Hi all,
>>
>> we are running a luminous 12.2.8 cluster with a 3 fold replicated cache
>> pool with a min_size of 2. We recently encountered an "object unfound"
>> error in one of our pgs in this pool. After marking this object lost,
>> two of the acting osds crashed and were unable to start up again, with
>> only the primary osd staying up. Hoping the cluster might remap the
>> copies of this pg, we marked the two crashed osds as out. Now the
>> primary osd of this pg has also gone down leaving again only one active
>> osd with the cluster reporting a degraded filesystem. All the affected
>> osds are running filestore, while about half the cluster has already
>> been upgraded to run bluestore osds.
>>
>> All three of the crashed osds fail to restart, reporting the following
>> error during startup:
>>
>> Oct 12 13:19:12 kaa-109 ceph-osd[166266]:      0> 2018-10-12
>> 13:19:12.782652 7f1f2d79b700 -1
>> /var/tmp/portage/sys-cluster/ceph-12.2.8/work/ceph-12.2.8/src/osd/PrimaryLogPG.cc:
>>
>> In function '
>> void PrimaryLogPG::hit_set_trim(PrimaryLogPG::OpContextUPtr&, unsigned
>> int)' thread 7f1f2d79b700 time 2018-10-12 13:19:12.779813
>> /var/tmp/portage/sys-cluster/ceph-12.2.8/work/ceph-12.2.8/src/osd/PrimaryLogPG.cc:
>>
>> 12985: FAILED assert(obc)
>>
>>   ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0)
>> luminous (stable)
>>   1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x109) [0x562265bfda9c]
>>   2:
>> (PrimaryLogPG::hit_set_trim(std::unique_ptr<PrimaryLogPG::OpContext,
>> std::default_delete<PrimaryLogPG::OpContext> >&, unsigned int)+0x942)
>> [0x5622657d6cea]
>>   3: (PrimaryLogPG::hit_set_persist()+0xa4b) [0x5622657e5fab]
>>   4: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x426a)
>> [0x562265800c64]
>>   5: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&,
>> ThreadPool::TPHandle&)+0xc1f) [0x5622657b94ed]
>>   6: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
>> boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x396)
>> [0x562265655cf8]
>>   7: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest>
>> const&)+0x5a) [0x5622658c09a6]
>>   8: (OSD::ShardedOpWQ::_process(unsigned int,
>> ceph::heartbeat_handle_d*)+0x1ab6) [0x562265657918]
>>   9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5cd)
>> [0x562265c026f5]
>>   10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
>> [0x562265c05e72]
>>   11: (()+0x751e) [0x7f1f4fd7f51e]
>>   12: (clone()+0x3f) [0x7f1f4ed7ef0f]
>>
>> A search in the bug tracker revealed that a similar error has been
>> resolved for jewel http://tracker.ceph.com/issues/19223, yet I don't
>> know if this is in any way relevant.
>>
>> We are currently at a loss how to get these osds back up. Any
>> suggestions how to approach this would be very welcome. If there is any
>> further information that is needed or additional context please let me
>> know.
>>
>> Thanks,
>>
>> Lawrence
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users at lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20181018/d632004a/attachment.html>


More information about the ceph-users mailing list