[ceph-users] OSDs crash after deleting unfound object in Luminous 12.2.8

Mike Lovell mike.lovell at endurance.com
Fri Oct 12 17:00:16 PDT 2018

what was the object name that you marked lost? was it one of the cache tier

the trace you have does seem to be failing when the OSD is trying to remove
a hit set that is no longer needed. i ran into a similar problem which
might have been why that bug you listed was created. maybe providing what i
have since discovered about hit sets might help.

the hit sets are what the cache tier uses to know which objects have been
accessed in a given period of time. these hit sets are then stored in the
object store using an object name that is generated. for the version you're
running, the code for that generation is at
its bascially "hit_set_<pgid>_archive_<begin time>_<end time>" where the
times are recorded in the hit set history. that hit set history is stored
as part of the PG metadata. you can get a list of all of the hit sets the
PG has by looking at 'ceph pg <pgid> query' and looking at the
['info']['hit_set_history']['history'] array. each entry in that array has
the information on each hit set for the PG and the times are what is used
in generation of the object name. there should be one ceph object for each
hit set listed in that array.

if you told the cluster to mark one of the hit set objects as lost, its
possible the OSD cannot get that object and is hitting the assert(obc) near
the end of PrimaryLogPG::hit_set_trim in the same source file referenced
above. you can potentially verify this by a couple methods. i think if you
set debug_osd at 20 that it should log a line saying something like
"hit_set_trim removing hit_set_<pgid>_archive....." if that name matches
one of the ones you marked lost, then is this almost certainly the cause.
you can also do a find on the OSD directory, if you're using file store,
and look for the right file name. something like 'find
/var/lib/ceph/osd/ceph-<osd num>/current/<pgid>_head -name
hit\*set\*<pgid>\*archive\*' should work. include the \ to escape the * so
bash doesn't interpret it. if you're using bluestore, i think you can use
the ceph-objectstore-tool while the osd is stopped to get a list of
objects. you'll probably want to only look in the .ceph-internal namespace
since the hit sets are stored in that namespace.

there are a couple potential ways to get around this. what we did when we
had the problem was run a custom build of the ceph-osd where we commented
out the assert(obc); line in hit_set_trim. that build was only run for long
enough to get the cluster back online and then to flush and evict the
entire cache, remove the cache, restart using the normal ceph builds, and
then recreate the cache.

the other options are things that i don't know for sure if they'll work. if
you're using file store, you might be able to just copy another hit set to
the file name of the missing hit set object. this should be pretty benign
and its just going to remove the object in a moment anyways. also, i'm not
entirely sure how to come up with what directory to put the object in if
the osd has done any directory splitting. maybe someone on the list will
know how to do this. there might be a way with the ceph-objectstore-tool to
write in the object but i couldn't find one in my testing on hammer.

the last option i can think of, is that if you can completely stop any
traffic to the pools in question, its possible the OSDs wont crash.
hit_set_trim doesn't appear to get called if there is no client traffic
reaching the osds and the hit sets aren't being updated. if you can stop
anything from using the pools in question and guarantee nothing will come
in, then it might be possible to keep the OSDs long enough to flush
everything from the cache tier, remove it, and recreate it. this option
seems like a long shot and i don't know for sure it'll work. it just seemed
to me like the OSDs would stay up in a similar scenario on my hammer test
cluster. its possible things have changed in luminous and hit_set_trim gets
called more often. i also didn't test whether the process of flushing and
evicting the objects in the cache caused hit_set_trim to get called.

hopefully that gives you some more info on what might be going on and ways
around it. i'm not entirely sure why there is still the assert(obj); in
hit_set_trim still. there was a little bit of discussion about removing it
since it means the object its trying to remove is gone anyways. i think
that just happened for a little bit in irc. i guess it didn't happen cause
no one followed up on it.

good luck and hopefully you don't blame me if things get worse. :)

On Fri, Oct 12, 2018 at 7:34 AM Lawrence Smith <
lawrence.smith at uni-muenster.de> wrote:

> Hi all,
> we are running a luminous 12.2.8 cluster with a 3 fold replicated cache
> pool with a min_size of 2. We recently encountered an "object unfound"
> error in one of our pgs in this pool. After marking this object lost,
> two of the acting osds crashed and were unable to start up again, with
> only the primary osd staying up. Hoping the cluster might remap the
> copies of this pg, we marked the two crashed osds as out. Now the
> primary osd of this pg has also gone down leaving again only one active
> osd with the cluster reporting a degraded filesystem. All the affected
> osds are running filestore, while about half the cluster has already
> been upgraded to run bluestore osds.
> All three of the crashed osds fail to restart, reporting the following
> error during startup:
> Oct 12 13:19:12 kaa-109 ceph-osd[166266]:      0> 2018-10-12
> 13:19:12.782652 7f1f2d79b700 -1
> /var/tmp/portage/sys-cluster/ceph-12.2.8/work/ceph-12.2.8/src/osd/PrimaryLogPG.cc:
> In function '
> void PrimaryLogPG::hit_set_trim(PrimaryLogPG::OpContextUPtr&, unsigned
> int)' thread 7f1f2d79b700 time 2018-10-12 13:19:12.779813
> /var/tmp/portage/sys-cluster/ceph-12.2.8/work/ceph-12.2.8/src/osd/PrimaryLogPG.cc:
> 12985: FAILED assert(obc)
>   ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0)
> luminous (stable)
>   1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x109) [0x562265bfda9c]
>   2:
> (PrimaryLogPG::hit_set_trim(std::unique_ptr<PrimaryLogPG::OpContext,
> std::default_delete<PrimaryLogPG::OpContext> >&, unsigned int)+0x942)
> [0x5622657d6cea]
>   3: (PrimaryLogPG::hit_set_persist()+0xa4b) [0x5622657e5fab]
>   4: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x426a)
> [0x562265800c64]
>   5: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&,
> ThreadPool::TPHandle&)+0xc1f) [0x5622657b94ed]
>   6: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
> boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x396)
> [0x562265655cf8]
>   7: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest>
> const&)+0x5a) [0x5622658c09a6]
>   8: (OSD::ShardedOpWQ::_process(unsigned int,
> ceph::heartbeat_handle_d*)+0x1ab6) [0x562265657918]
>   9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5cd)
> [0x562265c026f5]
>   10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x562265c05e72]
>   11: (()+0x751e) [0x7f1f4fd7f51e]
>   12: (clone()+0x3f) [0x7f1f4ed7ef0f]
> A search in the bug tracker revealed that a similar error has been
> resolved for jewel http://tracker.ceph.com/issues/19223, yet I don't
> know if this is in any way relevant.
> We are currently at a loss how to get these osds back up. Any
> suggestions how to approach this would be very welcome. If there is any
> further information that is needed or additional context please let me
> know.
> Thanks,
> Lawrence
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20181012/aaf33f1d/attachment.html>

More information about the ceph-users mailing list