[ceph-users] Another OSD broken today. How can I recover it?

Gonzalo Aguilar Delgado gaguilar at aguilardelgado.com
Sun Dec 3 04:31:11 PST 2017


Hi,

Yes. Nice. Until all your OSD fails and you don't know what else to try.
Looking at the faillure rates it will happen very soon.

I want to recover them. I'm writing in another mail what I tried. Let
see if someone can help me.

I'm not doing anything. Just looking at my cluster from time to time to
find that something else failed. I will do hard to recover this situation.

Thank you.


On 26/11/17 16:13, Marc Roos wrote:
>  
> If I am not mistaken, the whole idea with the 3 replica's is dat you 
> have enough copies to recover from a failed osd. In my tests this seems 
> to go fine automatically. Are you doing something that is not adviced?
>
>
>
>
> -----Original Message-----
> From: Gonzalo Aguilar Delgado [mailto:gaguilar at aguilardelgado.com] 
> Sent: zaterdag 25 november 2017 20:44
> To: 'ceph-users'
> Subject: [ceph-users] Another OSD broken today. How can I recover it?
>
> Hello, 
>
>
> I had another blackout with ceph today. It seems that ceph osd's fall 
> from time to time and they are unable to recover. I have 3 OSD's down 
> now. 1 removed from the cluster and 2 down because I'm unable to recover 
> them. 
>
>
> We really need a recovery tool. It's not normal that an OSD breaks and 
> there's no way to recover. Is there any way to do it?
>
>
> Last one shows this:
>
>
>
>
> ] enter Reset
>    -12> 2017-11-25 20:34:19.548891 7f6e5dc158c0  5 osd.4 pg_epoch: 9686 
> pg[0.34(unlocked)] enter Initial
>    -11> 2017-11-25 20:34:19.548983 7f6e5dc158c0  5 osd.4 pg_epoch: 9686 
> pg[0.34( empty local-les=9685 n=0 ec=404 les/c/f 9685/9685/0 
> 9684/9684/9684) [4,0] r=0 lpr=0 crt=0'0 mlcod 0'0 inactive NIBBLEWISE] 
> exit Initial 0.000091 0 0.000000
>    -10> 2017-11-25 20:34:19.548994 7f6e5dc158c0  5 osd.4 pg_epoch: 9686 
> pg[0.34( empty local-les=9685 n=0 ec=404 les/c/f 9685/9685/0 
> 9684/9684/9684) [4,0] r=0 lpr=0 crt=0'0 mlcod 0'0 inactive NIBBLEWISE] 
> enter Reset
>     -9> 2017-11-25 20:34:19.549166 7f6e5dc158c0  5 osd.4 pg_epoch: 9686 
> pg[10.36(unlocked)] enter Initial
>     -8> 2017-11-25 20:34:19.566781 7f6e5dc158c0  5 osd.4 pg_epoch: 9686 
> pg[10.36( v 9686'7301894 (9686'7298879,9686'7301894] local-les=9685 
> n=534 ec=419 les/c/f 9685/9686/0 9684/9684/9684) [4,0] r=0 lpr=0 
> crt=9686'7301894 lcod 0'0 mlcod 0'0 inactive NIBBLEWISE] exit Initial 
> 0.017614 0 0.000000
>     -7> 2017-11-25 20:34:19.566811 7f6e5dc158c0  5 osd.4 pg_epoch: 9686 
> pg[10.36( v 9686'7301894 (9686'7298879,9686'7301894] local-les=9685 
> n=534 ec=419 les/c/f 9685/9686/0 9684/9684/9684) [4,0] r=0 lpr=0 
> crt=9686'7301894 lcod 0'0 mlcod 0'0 inactive NIBBLEWISE] enter Reset
>     -6> 2017-11-25 20:34:19.585411 7f6e5dc158c0  5 osd.4 pg_epoch: 9686 
> pg[8.5c(unlocked)] enter Initial
>     -5> 2017-11-25 20:34:19.602888 7f6e5dc158c0  5 osd.4 pg_epoch: 9686 
> pg[8.5c( empty local-les=9685 n=0 ec=348 les/c/f 9685/9685/0 
> 9684/9684/9684) [4,0] r=0 lpr=0 crt=0'0 mlcod 0'0 inactive NIBBLEWISE] 
> exit Initial 0.017478 0 0.000000
>     -4> 2017-11-25 20:34:19.602912 7f6e5dc158c0  5 osd.4 pg_epoch: 9686 
> pg[8.5c( empty local-les=9685 n=0 ec=348 les/c/f 9685/9685/0 
> 9684/9684/9684) [4,0] r=0 lpr=0 crt=0'0 mlcod 0'0 inactive NIBBLEWISE] 
> enter Reset
>     -3> 2017-11-25 20:34:19.603082 7f6e5dc158c0  5 osd.4 pg_epoch: 9686 
> pg[9.10(unlocked)] enter Initial
>     -2> 2017-11-25 20:34:19.615456 7f6e5dc158c0  5 osd.4 pg_epoch: 9686 
> pg[9.10( v 9686'2322547 (9031'2319518,9686'2322547] local-les=9685 n=261 
> ec=417 les/c/f 9685/9685/0 9684/9684/9684) [4,0] r=0 lpr=0 
> crt=9686'2322547 lcod 0'0 mlcod 0'0 inactive NIBBLEWISE] exit Initial 
> 0.012373 0 0.000000
>     -1> 2017-11-25 20:34:19.615481 7f6e5dc158c0  5 osd.4 pg_epoch: 9686 
> pg[9.10( v 9686'2322547 (9031'2319518,9686'2322547] local-les=9685 n=261 
> ec=417 les/c/f 9685/9685/0 9684/9684/9684) [4,0] r=0 lpr=0 
> crt=9686'2322547 lcod 0'0 mlcod 0'0 inactive NIBBLEWISE] enter Reset
>      0> 2017-11-25 20:34:19.617400 7f6e5dc158c0 -1 osd/PG.cc: In 
> function 'static int PG::peek_map_epoch(ObjectStore*, spg_t, epoch_t*, 
> ceph::bufferlist*)' thread 7f6e5dc158c0 time 2017-11-25 20:34:19.615633
> osd/PG.cc: 3025: FAILED assert(values.size() == 2)
>
>  ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> const*)+0x80) [0x5562d318d790]
>  2: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*, 
> ceph::buffer::list*)+0x661) [0x5562d2b4b601]
>  3: (OSD::load_pgs()+0x75a) [0x5562d2a9f8aa]
>  4: (OSD::init()+0x2026) [0x5562d2aaaca6]
>  5: (main()+0x2ef1) [0x5562d2a1c301]
>  6: (__libc_start_main()+0xf0) [0x7f6e5aa75830]
>  7: (_start()+0x29) [0x5562d2a5db09]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
> needed to interpret this.
>
> --- logging levels ---
>    0/ 5 none
>    0/ 1 lockdep
>    0/ 1 context
>    1/ 1 crush
>    1/ 5 mds
>    1/ 5 mds_balancer
>    1/ 5 mds_locker
>    1/ 5 mds_log
>    1/ 5 mds_log_expire
>    1/ 5 mds_migrator
>    0/ 1 buffer
>    0/ 1 timer
>    0/ 1 filer
>    0/ 1 striper
>    0/ 1 objecter
>    0/ 5 rados
>    0/ 5 rbd
>    0/ 5 rbd_mirror
>    0/ 5 rbd_replay
>    0/ 5 journaler
>    0/ 5 objectcacher
>    0/ 5 client
>    0/ 5 osd
>    0/ 5 optracker
>    0/ 5 objclass
>    1/ 3 filestore
>    1/ 3 journal
>    0/ 5 ms
>    1/ 5 mon
>    0/10 monc
>    1/ 5 paxos
>    0/ 5 tp
>    1/ 5 auth
>    1/ 5 crypto
>    1/ 1 finisher
>    1/ 5 heartbeatmap
>    1/ 5 perfcounter
>    1/ 5 rgw
>    1/10 civetweb
>    1/ 5 javaclient
>    1/ 5 asok
>    1/ 1 throttle
>    0/ 0 refs
>    1/ 5 xio
>    1/ 5 compressor
>    1/ 5 newstore
>    1/ 5 bluestore
>    1/ 5 bluefs
>    1/ 3 bdev
>    1/ 5 kstore
>    4/ 5 rocksdb
>    4/ 5 leveldb
>    1/ 5 kinetic
>    1/ 5 fuse
>   -2/-2 (syslog threshold)
>   -1/-1 (stderr threshold)
>   max_recent     10000
>   max_new         1000
>   log_file /var/log/ceph/ceph-osd.4.log
> --- end dump of recent events ---
> 2017-11-25 20:34:19.622559 7f6e5dc158c0 -1 *** Caught signal (Aborted) 
> **  in thread 7f6e5dc158c0 thread_name:ceph-osd
>
>  ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
>  1: (()+0x98653e) [0x5562d308d53e]
>  2: (()+0x11390) [0x7f6e5caee390]
>  3: (gsignal()+0x38) [0x7f6e5aa8a428]
>  4: (abort()+0x16a) [0x7f6e5aa8c02a]
>  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> const*)+0x26b) [0x5562d318d97b]
>  6: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*, 
> ceph::buffer::list*)+0x661) [0x5562d2b4b601]
>  7: (OSD::load_pgs()+0x75a) [0x5562d2a9f8aa]
>  8: (OSD::init()+0x2026) [0x5562d2aaaca6]
>  9: (main()+0x2ef1) [0x5562d2a1c301]
>  10: (__libc_start_main()+0xf0) [0x7f6e5aa75830]
>  11: (_start()+0x29) [0x5562d2a5db09]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
> needed to interpret this.
>
> --- begin dump of recent events ---
>      0> 2017-11-25 20:34:19.622559 7f6e5dc158c0 -1 *** Caught signal 
> (Aborted) **  in thread 7f6e5dc158c0 thread_name:ceph-osd
>
>  ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
>  1: (()+0x98653e) [0x5562d308d53e]
>  2: (()+0x11390) [0x7f6e5caee390]
>  3: (gsignal()+0x38) [0x7f6e5aa8a428]
>  4: (abort()+0x16a) [0x7f6e5aa8c02a]
>  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> const*)+0x26b) [0x5562d318d97b]
>  6: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*, 
> ceph::buffer::list*)+0x661) [0x5562d2b4b601]
>  7: (OSD::load_pgs()+0x75a) [0x5562d2a9f8aa]
>  8: (OSD::init()+0x2026) [0x5562d2aaaca6]
>  9: (main()+0x2ef1) [0x5562d2a1c301]
>  10: (__libc_start_main()+0xf0) [0x7f6e5aa75830]
>  11: (_start()+0x29) [0x5562d2a5db09]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
> needed to interpret this.
>
> --- logging levels ---
>    0/ 5 none
>    0/ 1 lockdep
>    0/ 1 context
>    1/ 1 crush
>    1/ 5 mds
>    1/ 5 mds_balancer
>    1/ 5 mds_locker
>    1/ 5 mds_log
>    1/ 5 mds_log_expire
>    1/ 5 mds_migrator
>    0/ 1 buffer
>    0/ 1 timer
>    0/ 1 filer
>    0/ 1 striper
>    0/ 1 objecter
>    0/ 5 rados
>    0/ 5 rbd
>    0/ 5 rbd_mirror
>    0/ 5 rbd_replay
>    0/ 5 journaler
>    0/ 5 objectcacher
>    0/ 5 client
>    0/ 5 osd
>    0/ 5 optracker
>    0/ 5 objclass
>    1/ 3 filestore
>    1/ 3 journal
>    0/ 5 ms
>    1/ 5 mon
>    0/10 monc
>    1/ 5 paxos
>    0/ 5 tp
>    1/ 5 auth
>    1/ 5 crypto
>    1/ 1 finisher
>    1/ 5 heartbeatmap
>    1/ 5 perfcounter
>    1/ 5 rgw
>    1/10 civetweb
>    1/ 5 javaclient
>    1/ 5 asok
>    1/ 1 throttle
>    0/ 0 refs
>    1/ 5 xio
>    1/ 5 compressor
>    1/ 5 newstore
>    1/ 5 bluestore
>    1/ 5 bluefs
>    1/ 3 bdev
>    1/ 5 kstore
>    4/ 5 rocksdb
>    4/ 5 leveldb
>    1/ 5 kinetic
>    1/ 5 fuse
>   -2/-2 (syslog threshold)
>   -1/-1 (stderr threshold)
>   max_recent     10000
>   max_new         1000
>   log_file /var/log/ceph/ceph-osd.4.log
>
>
>
>
>
>
>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20171203/c35c83a9/attachment.html>


More information about the ceph-users mailing list