[ceph-users] OSD Random Failures - Latest Luminous

Ashley Merrick ashley at amerrick.co.uk
Sat Nov 18 04:01:19 PST 2017


Hello,

Any further suggestions or work around's from anyone?

Cluster is hard down now with around 2% PG's offline, on the occasion able to get an OSD to start for a bit but then will seem to do some peering and again crash with "*** Caught signal (Aborted) **in thread 7f3471c55700 thread_name:tp_peering"

,Ashley

From: Ashley Merrick
Sent: 16 November 2017 17:27
To: Eric Nelson <ericnelson at gmail.com>
Cc: ceph-users at ceph.com
Subject: Re: [ceph-users] OSD Random Failures - Latest Luminous


Hello,



Good to hear it's not just me, however have a cluster basically offline due to too many OSD's dropping for this issue.



Anybody have any suggestions?



,Ashley

________________________________
From: Eric Nelson <ericnelson at gmail.com<mailto:ericnelson at gmail.com>>
Sent: 16 November 2017 00:06:14
To: Ashley Merrick
Cc: ceph-users at ceph.com<mailto:ceph-users at ceph.com>
Subject: Re: [ceph-users] OSD Random Failures - Latest Luminous

I've been seeing these as well on our SSD cachetier that's been ravaged by disk failures as of late.... Same tp_peering assert as above even running luminous branch from git.

Let me know if you have a bug filed I can +1 or have found a workaround.

E

On Wed, Nov 15, 2017 at 10:25 AM, Ashley Merrick <ashley at amerrick.co.uk<mailto:ashley at amerrick.co.uk>> wrote:

Hello,



After replacing a single OSD disk due to a failed disk I am now seeing 2-3 OSD's randomly stop and fail to start, do a boot loop get to load_pgs and then fail with the following (I tried setting OSD log's to 5/5 but didn't get any extra lines around the error just more information pre boot.



Could this be a certain PG causing these OSD's to crash (6.2f2s10 for example)?



    -9> 2017-11-15 17:37:14.696229 7fa4ec50f700  1 osd.37 pg_epoch: 161571 pg[6.2f9s1( v 161563'158209 lc 161175'158153 (150659'148187,161563'158209] local-lis/les=161519/161521 n=47572 ec=31534/31534 lis/c 161519/152474 les/c/f 161521/152523/159786 161517/161519/161519) [34,37,13,12,66,69,118,120,28,20,88,0,2]/[34,37,13,12,66,69,118,120,28,20,53,54,2147483647] r=1 lpr=161563 pi=[152474,161519)/1 crt=161562'158208 lcod 0'0 unknown NOTIFY m=21] state<Start>: transitioning to Stray

    -8> 2017-11-15 17:37:14.696239 7fa4ec50f700  5 osd.37 pg_epoch: 161571 pg[6.2f9s1( v 161563'158209 lc 161175'158153 (150659'148187,161563'158209] local-lis/les=161519/161521 n=47572 ec=31534/31534 lis/c 161519/152474 les/c/f 161521/152523/159786 161517/161519/161519) [34,37,13,12,66,69,118,120,28,20,88,0,2]/[34,37,13,12,66,69,118,120,28,20,53,54,2147483647] r=1 lpr=161563 pi=[152474,161519)/1 crt=161562'158208 lcod 0'0 unknown NOTIFY m=21] exit Start 0.000019 0 0.000000

    -7> 2017-11-15 17:37:14.696250 7fa4ec50f700  5 osd.37 pg_epoch: 161571 pg[6.2f9s1( v 161563'158209 lc 161175'158153 (150659'148187,161563'158209] local-lis/les=161519/161521 n=47572 ec=31534/31534 lis/c 161519/152474 les/c/f 161521/152523/159786 161517/161519/161519) [34,37,13,12,66,69,118,120,28,20,88,0,2]/[34,37,13,12,66,69,118,120,28,20,53,54,2147483647] r=1 lpr=161563 pi=[152474,161519)/1 crt=161562'158208 lcod 0'0 unknown NOTIFY m=21] enter Started/Stray

    -6> 2017-11-15 17:37:14.696324 7fa4ec50f700  5 osd.37 pg_epoch: 161571 pg[6.2f2s10( v 161570'157712 lc 161175'157648 (160455'154564,161570'157712] local-lis/les=161517/161519 n=47328 ec=31534/31534 lis/c 161517/160962 les/c/f 161519/160963/159786 161517/161517/108939) [96,100,79,4,69,65,57,59,135,134,37,35,18] r=10 lpr=161570 pi=[160962,161517)/2 crt=161560'157711 lcod 0'0 unknown NOTIFY m=5] exit Reset 3.363755 2 0.000076

    -5> 2017-11-15 17:37:14.696337 7fa4ec50f700  5 osd.37 pg_epoch: 161571 pg[6.2f2s10( v 161570'157712 lc 161175'157648 (160455'154564,161570'157712] local-lis/les=161517/161519 n=47328 ec=31534/31534 lis/c 161517/160962 les/c/f 161519/160963/159786 161517/161517/108939) [96,100,79,4,69,65,57,59,135,134,37,35,18] r=10 lpr=161570 pi=[160962,161517)/2 crt=161560'157711 lcod 0'0 unknown NOTIFY m=5] enter Started

    -4> 2017-11-15 17:37:14.696346 7fa4ec50f700  5 osd.37 pg_epoch: 161571 pg[6.2f2s10( v 161570'157712 lc 161175'157648 (160455'154564,161570'157712] local-lis/les=161517/161519 n=47328 ec=31534/31534 lis/c 161517/160962 les/c/f 161519/160963/159786 161517/161517/108939) [96,100,79,4,69,65,57,59,135,134,37,35,18] r=10 lpr=161570 pi=[160962,161517)/2 crt=161560'157711 lcod 0'0 unknown NOTIFY m=5] enter Start

    -3> 2017-11-15 17:37:14.696353 7fa4ec50f700  1 osd.37 pg_epoch: 161571 pg[6.2f2s10( v 161570'157712 lc 161175'157648 (160455'154564,161570'157712] local-lis/les=161517/161519 n=47328 ec=31534/31534 lis/c 161517/160962 les/c/f 161519/160963/159786 161517/161517/108939) [96,100,79,4,69,65,57,59,135,134,37,35,18] r=10 lpr=161570 pi=[160962,161517)/2 crt=161560'157711 lcod 0'0 unknown NOTIFY m=5] state<Start>: transitioning to Stray

    -2> 2017-11-15 17:37:14.696364 7fa4ec50f700  5 osd.37 pg_epoch: 161571 pg[6.2f2s10( v 161570'157712 lc 161175'157648 (160455'154564,161570'157712] local-lis/les=161517/161519 n=47328 ec=31534/31534 lis/c 161517/160962 les/c/f 161519/160963/159786 161517/161517/108939) [96,100,79,4,69,65,57,59,135,134,37,35,18] r=10 lpr=161570 pi=[160962,161517)/2 crt=161560'157711 lcod 0'0 unknown NOTIFY m=5] exit Start 0.000018 0 0.000000

    -1> 2017-11-15 17:37:14.696372 7fa4ec50f700  5 osd.37 pg_epoch: 161571 pg[6.2f2s10( v 161570'157712 lc 161175'157648 (160455'154564,161570'157712] local-lis/les=161517/161519 n=47328 ec=31534/31534 lis/c 161517/160962 les/c/f 161519/160963/159786 161517/161517/108939) [96,100,79,4,69,65,57,59,135,134,37,35,18] r=10 lpr=161570 pi=[160962,161517)/2 crt=161560'157711 lcod 0'0 unknown NOTIFY m=5] enter Started/Stray

     0> 2017-11-15 17:37:14.697245 7fa4ebd0e700 -1 *** Caught signal (Aborted) **

in thread 7fa4ebd0e700 thread_name:tp_peering



ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)

1: (()+0xa3acdc) [0x55dfb6ba3cdc]

2: (()+0xf890) [0x7fa510e2c890]

3: (gsignal()+0x37) [0x7fa50fe66067]

4: (abort()+0x148) [0x7fa50fe67448]

5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27f) [0x55dfb6be6f5f]

6: (PG::start_peering_interval(std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> > const&, int, std::vector<int, std::allocator<int> > const&, int, ObjectStore::Transaction*)+0x14e3) [0x55dfb670f8a3]

7: (PG::RecoveryState::Reset::react(PG::AdvMap const&)+0x539) [0x55dfb670ff39]

8: (boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x244) [0x55dfb67552a4]

9: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x6b) [0x55dfb6732c1b]

10: (PG::handle_advance_map(std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&, int, PG::RecoveryCtx*)+0x3e3) [0x55dfb6702ef3]

11: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std::allocator<boost::intrusive_ptr<PG> > >*)+0x20a) [0x55dfb664db2a]

12: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x175) [0x55dfb664e6b5]

13: (ThreadPool::BatchWorkQueue<PG>::_void_process(void*, ThreadPool::TPHandle&)+0x27) [0x55dfb66ae5a7]

14: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa8f) [0x55dfb6bedb1f]

15: (ThreadPool::WorkThread::entry()+0x10) [0x55dfb6beea50]

16: (()+0x8064) [0x7fa510e25064]

17: (clone()+0x6d) [0x7fa50ff1962d]

NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.



--- logging levels ---

   0/ 5 none

   0/ 1 lockdep

   0/ 1 context

   1/ 1 crush

   1/ 5 mds

   1/ 5 mds_balancer

   1/ 5 mds_locker

   1/ 5 mds_log

   1/ 5 mds_log_expire

   1/ 5 mds_migrator

   0/ 1 buffer

   0/ 1 timer

   0/ 1 filer

   0/ 1 striper

   0/ 1 objecter

   0/ 5 rados

   0/ 5 rbd

   0/ 5 rbd_mirror

   0/ 5 rbd_replay

   0/ 5 journaler

   0/ 5 objectcacher

   0/ 5 client

   1/ 5 osd

   0/ 5 optracker

   0/ 5 objclass

   1/ 3 filestore

   1/ 3 journal

   0/ 5 ms

   1/ 5 mon

   0/10 monc

   1/ 5 paxos

   0/ 5 tp

   1/ 5 auth

   1/ 5 crypto

   1/ 1 finisher

   1/ 5 heartbeatmap

   1/ 5 perfcounter

   1/ 5 rgw

   1/10 civetweb

   1/ 5 javaclient

   1/ 5 asok

   1/ 1 throttle

   0/ 0 refs

   1/ 5 xio

   1/ 5 compressor

   1/ 5 bluestore

   1/ 5 bluefs

   1/ 3 bdev

   1/ 5 kstore

   4/ 5 rocksdb

   4/ 5 leveldb

   4/ 5 memdb

   1/ 5 kinetic

   1/ 5 fuse

   1/ 5 mgr

   1/ 5 mgrc

   1/ 5 dpdk

   1/ 5 eventtrace

  -2/-2 (syslog threshold)

  -1/-1 (stderr threshold)

  max_recent     10000

  max_new         1000

  log_file /var/log/ceph/ceph-osd.37.log

_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com<mailto:ceph-users at lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20171118/4deb8c56/attachment.html>


More information about the ceph-users mailing list