[ceph-users] OSD Random Failures - Latest Luminous

David Turner drakonstein at gmail.com
Sat Nov 18 06:18:52 PST 2017


Does letting the cluster run with noup for a while until all down disks are
idle, and then letting them come in help at all?  I don't know your
specific issue and haven't touched bluestore yet, but that is generally
sound advice when is won't start.

Also is there any pattern to the osds that are down? Common PGs, common
hosts, common ssds, etc?

On Sat, Nov 18, 2017, 7:08 AM Ashley Merrick <ashley at amerrick.co.uk> wrote:

> Hello,
>
>
>
> Any further suggestions or work around’s from anyone?
>
>
>
> Cluster is hard down now with around 2% PG’s offline, on the occasion able
> to get an OSD to start for a bit but then will seem to do some peering and
> again crash with “*** Caught signal (Aborted) **in thread 7f3471c55700
> thread_name:tp_peering”
>
>
>
> ,Ashley
>
>
>
> *From:* Ashley Merrick
>
> *Sent:* 16 November 2017 17:27
> *To:* Eric Nelson <ericnelson at gmail.com>
>
> *Cc:* ceph-users at ceph.com
> *Subject:* Re: [ceph-users] OSD Random Failures - Latest Luminous
>
>
>
> Hello,
>
>
>
> Good to hear it's not just me, however have a cluster basically offline
> due to too many OSD's dropping for this issue.
>
>
>
> Anybody have any suggestions?
>
>
>
> ,Ashley
> ------------------------------
>
> *From:* Eric Nelson <ericnelson at gmail.com>
> *Sent:* 16 November 2017 00:06:14
> *To:* Ashley Merrick
> *Cc:* ceph-users at ceph.com
> *Subject:* Re: [ceph-users] OSD Random Failures - Latest Luminous
>
>
>
> I've been seeing these as well on our SSD cachetier that's been ravaged by
> disk failures as of late.... Same tp_peering assert as above even running
> luminous branch from git.
>
>
>
> Let me know if you have a bug filed I can +1 or have found a workaround.
>
>
>
> E
>
>
>
> On Wed, Nov 15, 2017 at 10:25 AM, Ashley Merrick <ashley at amerrick.co.uk>
> wrote:
>
> Hello,
>
>
>
> After replacing a single OSD disk due to a failed disk I am now seeing 2-3
> OSD’s randomly stop and fail to start, do a boot loop get to load_pgs and
> then fail with the following (I tried setting OSD log’s to 5/5 but didn’t
> get any extra lines around the error just more information pre boot.
>
>
>
> Could this be a certain PG causing these OSD’s to crash (6.2f2s10 for
> example)?
>
>
>
>     -9> 2017-11-15 17:37:14.696229 7fa4ec50f700  1 osd.37 pg_epoch: 161571
> pg[6.2f9s1( v 161563'158209 lc 161175'158153 (150659'148187,161563'158209]
> local-lis/les=161519/161521 n=47572 ec=31534/31534 lis/c 161519/152474
> les/c/f 161521/152523/159786 161517/161519/161519)
> [34,37,13,12,66,69,118,120,28,20,88,0,2]/[34,37,13,12,66,69,118,120,28,20,53,54,2147483647]
> r=1 lpr=161563 pi=[152474,161519)/1 crt=161562'158208 lcod 0'0 unknown
> NOTIFY m=21] state<Start>: transitioning to Stray
>
>     -8> 2017-11-15 17:37:14.696239 7fa4ec50f700  5 osd.37 pg_epoch: 161571
> pg[6.2f9s1( v 161563'158209 lc 161175'158153 (150659'148187,161563'158209]
> local-lis/les=161519/161521 n=47572 ec=31534/31534 lis/c 161519/152474
> les/c/f 161521/152523/159786 161517/161519/161519)
> [34,37,13,12,66,69,118,120,28,20,88,0,2]/[34,37,13,12,66,69,118,120,28,20,53,54,2147483647]
> r=1 lpr=161563 pi=[152474,161519)/1 crt=161562'158208 lcod 0'0 unknown
> NOTIFY m=21] exit Start 0.000019 0 0.000000
>
>     -7> 2017-11-15 17:37:14.696250 7fa4ec50f700  5 osd.37 pg_epoch: 161571
> pg[6.2f9s1( v 161563'158209 lc 161175'158153 (150659'148187,161563'158209]
> local-lis/les=161519/161521 n=47572 ec=31534/31534 lis/c 161519/152474
> les/c/f 161521/152523/159786 161517/161519/161519)
> [34,37,13,12,66,69,118,120,28,20,88,0,2]/[34,37,13,12,66,69,118,120,28,20,53,54,2147483647]
> r=1 lpr=161563 pi=[152474,161519)/1 crt=161562'158208 lcod 0'0 unknown
> NOTIFY m=21] enter Started/Stray
>
>     -6> 2017-11-15 17:37:14.696324 7fa4ec50f700  5 osd.37 pg_epoch: 161571
> pg[6.2f2s10( v 161570'157712 lc 161175'157648 (160455'154564,161570'157712]
> local-lis/les=161517/161519 n=47328 ec=31534/31534 lis/c 161517/160962
> les/c/f 161519/160963/159786 161517/161517/108939)
> [96,100,79,4,69,65,57,59,135,134,37,35,18] r=10 lpr=161570
> pi=[160962,161517)/2 crt=161560'157711 lcod 0'0 unknown NOTIFY m=5] exit
> Reset 3.363755 2 0.000076
>
>     -5> 2017-11-15 17:37:14.696337 7fa4ec50f700  5 osd.37 pg_epoch: 161571
> pg[6.2f2s10( v 161570'157712 lc 161175'157648 (160455'154564,161570'157712]
> local-lis/les=161517/161519 n=47328 ec=31534/31534 lis/c 161517/160962
> les/c/f 161519/160963/159786 161517/161517/108939)
> [96,100,79,4,69,65,57,59,135,134,37,35,18] r=10 lpr=161570
> pi=[160962,161517)/2 crt=161560'157711 lcod 0'0 unknown NOTIFY m=5] enter
> Started
>
>     -4> 2017-11-15 17:37:14.696346 7fa4ec50f700  5 osd.37 pg_epoch: 161571
> pg[6.2f2s10( v 161570'157712 lc 161175'157648 (160455'154564,161570'157712]
> local-lis/les=161517/161519 n=47328 ec=31534/31534 lis/c 161517/160962
> les/c/f 161519/160963/159786 161517/161517/108939)
> [96,100,79,4,69,65,57,59,135,134,37,35,18] r=10 lpr=161570
> pi=[160962,161517)/2 crt=161560'157711 lcod 0'0 unknown NOTIFY m=5] enter
> Start
>
>     -3> 2017-11-15 17:37:14.696353 7fa4ec50f700  1 osd.37 pg_epoch: 161571
> pg[6.2f2s10( v 161570'157712 lc 161175'157648 (160455'154564,161570'157712]
> local-lis/les=161517/161519 n=47328 ec=31534/31534 lis/c 161517/160962
> les/c/f 161519/160963/159786 161517/161517/108939)
> [96,100,79,4,69,65,57,59,135,134,37,35,18] r=10 lpr=161570
> pi=[160962,161517)/2 crt=161560'157711 lcod 0'0 unknown NOTIFY m=5]
> state<Start>: transitioning to Stray
>
>     -2> 2017-11-15 17:37:14.696364 7fa4ec50f700  5 osd.37 pg_epoch: 161571
> pg[6.2f2s10( v 161570'157712 lc 161175'157648 (160455'154564,161570'157712]
> local-lis/les=161517/161519 n=47328 ec=31534/31534 lis/c 161517/160962
> les/c/f 161519/160963/159786 161517/161517/108939)
> [96,100,79,4,69,65,57,59,135,134,37,35,18] r=10 lpr=161570
> pi=[160962,161517)/2 crt=161560'157711 lcod 0'0 unknown NOTIFY m=5] exit
> Start 0.000018 0 0.000000
>
>     -1> 2017-11-15 17:37:14.696372 7fa4ec50f700  5 osd.37 pg_epoch: 161571
> pg[6.2f2s10( v 161570'157712 lc 161175'157648 (160455'154564,161570'157712]
> local-lis/les=161517/161519 n=47328 ec=31534/31534 lis/c 161517/160962
> les/c/f 161519/160963/159786 161517/161517/108939)
> [96,100,79,4,69,65,57,59,135,134,37,35,18] r=10 lpr=161570
> pi=[160962,161517)/2 crt=161560'157711 lcod 0'0 unknown NOTIFY m=5] enter
> Started/Stray
>
>      0> 2017-11-15 17:37:14.697245 7fa4ebd0e700 -1 *** Caught signal
> (Aborted) **
>
> in thread 7fa4ebd0e700 thread_name:tp_peering
>
>
>
> ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous
> (stable)
>
> 1: (()+0xa3acdc) [0x55dfb6ba3cdc]
>
> 2: (()+0xf890) [0x7fa510e2c890]
>
> 3: (gsignal()+0x37) [0x7fa50fe66067]
>
> 4: (abort()+0x148) [0x7fa50fe67448]
>
> 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x27f) [0x55dfb6be6f5f]
>
> 6: (PG::start_peering_interval(std::shared_ptr<OSDMap const>,
> std::vector<int, std::allocator<int> > const&, int, std::vector<int,
> std::allocator<int> > const&, int, ObjectStore::Transaction*)+0x14e3)
> [0x55dfb670f8a3]
>
> 7: (PG::RecoveryState::Reset::react(PG::AdvMap const&)+0x539)
> [0x55dfb670ff39]
>
> 8: (boost::statechart::simple_state<PG::RecoveryState::Reset,
> PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na,
> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
> mpl_::na, mpl_::na, mpl_::na, mpl_::na>,
> (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base
> const&, void const*)+0x244) [0x55dfb67552a4]
>
> 9: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine,
> PG::RecoveryState::Initial, std::allocator<void>,
> boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base
> const&)+0x6b) [0x55dfb6732c1b]
>
> 10: (PG::handle_advance_map(std::shared_ptr<OSDMap const>,
> std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&,
> int, std::vector<int, std::allocator<int> >&, int, PG::RecoveryCtx*)+0x3e3)
> [0x55dfb6702ef3]
>
> 11: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&,
> PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>,
> std::less<boost::intrusive_ptr<PG> >,
> std::allocator<boost::intrusive_ptr<PG> > >*)+0x20a) [0x55dfb664db2a]
>
> 12: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> >
> const&, ThreadPool::TPHandle&)+0x175) [0x55dfb664e6b5]
>
> 13: (ThreadPool::BatchWorkQueue<PG>::_void_process(void*,
> ThreadPool::TPHandle&)+0x27) [0x55dfb66ae5a7]
>
> 14: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa8f) [0x55dfb6bedb1f]
>
> 15: (ThreadPool::WorkThread::entry()+0x10) [0x55dfb6beea50]
>
> 16: (()+0x8064) [0x7fa510e25064]
>
> 17: (clone()+0x6d) [0x7fa50ff1962d]
>
> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> to interpret this.
>
>
>
> --- logging levels ---
>
>    0/ 5 none
>
>    0/ 1 lockdep
>
>    0/ 1 context
>
>    1/ 1 crush
>
>    1/ 5 mds
>
>    1/ 5 mds_balancer
>
>    1/ 5 mds_locker
>
>    1/ 5 mds_log
>
>    1/ 5 mds_log_expire
>
>    1/ 5 mds_migrator
>
>    0/ 1 buffer
>
>    0/ 1 timer
>
>    0/ 1 filer
>
>    0/ 1 striper
>
>    0/ 1 objecter
>
>    0/ 5 rados
>
>    0/ 5 rbd
>
>    0/ 5 rbd_mirror
>
>    0/ 5 rbd_replay
>
>    0/ 5 journaler
>
>    0/ 5 objectcacher
>
>    0/ 5 client
>
>    1/ 5 osd
>
>    0/ 5 optracker
>
>    0/ 5 objclass
>
>    1/ 3 filestore
>
>    1/ 3 journal
>
>    0/ 5 ms
>
>    1/ 5 mon
>
>    0/10 monc
>
>    1/ 5 paxos
>
>    0/ 5 tp
>
>    1/ 5 auth
>
>    1/ 5 crypto
>
>    1/ 1 finisher
>
>    1/ 5 heartbeatmap
>
>    1/ 5 perfcounter
>
>    1/ 5 rgw
>
>    1/10 civetweb
>
>    1/ 5 javaclient
>
>    1/ 5 asok
>
>    1/ 1 throttle
>
>    0/ 0 refs
>
>    1/ 5 xio
>
>    1/ 5 compressor
>
>    1/ 5 bluestore
>
>    1/ 5 bluefs
>
>    1/ 3 bdev
>
>    1/ 5 kstore
>
>    4/ 5 rocksdb
>
>    4/ 5 leveldb
>
>    4/ 5 memdb
>
>    1/ 5 kinetic
>
>    1/ 5 fuse
>
>    1/ 5 mgr
>
>    1/ 5 mgrc
>
>    1/ 5 dpdk
>
>    1/ 5 eventtrace
>
>   -2/-2 (syslog threshold)
>
>   -1/-1 (stderr threshold)
>
>   max_recent     10000
>
>   max_new         1000
>
>   log_file /var/log/ceph/ceph-osd.37.log
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20171118/3d486be8/attachment.html>


More information about the ceph-users mailing list