[ceph-users] Another OSD broken today. How can I recover it?

Gonzalo Aguilar Delgado gaguilar at aguilardelgado.com
Sat Nov 25 11:43:58 PST 2017


Hello,

I had another blackout with ceph today. It seems that ceph osd's fall
from time to time and they are unable to recover. I have 3 OSD's down
now. 1 removed from the cluster and 2 down because I'm unable to recover
them.

We really need a recovery tool. It's not normal that an OSD breaks and
there's no way to recover. Is there any way to do it?

Last one shows this:


] enter Reset
   -12> 2017-11-25 20:34:19.548891 7f6e5dc158c0  5 osd.4 pg_epoch: 9686
pg[0.34(unlocked)] enter Initial
   -11> 2017-11-25 20:34:19.548983 7f6e5dc158c0  5 osd.4 pg_epoch: 9686
pg[0.34( empty local-les=9685 n=0 ec=404 les/c/f 9685/9685/0
9684/9684/9684) [4,0] r=0 lpr=0 crt=0'0 mlcod 0'0 inactive NIBBLEWISE]
exit Initial 0.000091 0 0.000000
   -10> 2017-11-25 20:34:19.548994 7f6e5dc158c0  5 osd.4 pg_epoch: 9686
pg[0.34( empty local-les=9685 n=0 ec=404 les/c/f 9685/9685/0
9684/9684/9684) [4,0] r=0 lpr=0 crt=0'0 mlcod 0'0 inactive NIBBLEWISE]
enter Reset
    -9> 2017-11-25 20:34:19.549166 7f6e5dc158c0  5 osd.4 pg_epoch: 9686
pg[10.36(unlocked)] enter Initial
    -8> 2017-11-25 20:34:19.566781 7f6e5dc158c0  5 osd.4 pg_epoch: 9686
pg[10.36( v 9686'7301894 (9686'7298879,9686'7301894] local-les=9685
n=534 ec=419 les/c/f 9685/9686/0 9684/9684/9684) [4,0] r=0 lpr=0
crt=9686'7301894 lcod 0'0 mlcod 0'0 inactive NIBBLEWISE] exit Initial
0.017614 0 0.000000
    -7> 2017-11-25 20:34:19.566811 7f6e5dc158c0  5 osd.4 pg_epoch: 9686
pg[10.36( v 9686'7301894 (9686'7298879,9686'7301894] local-les=9685
n=534 ec=419 les/c/f 9685/9686/0 9684/9684/9684) [4,0] r=0 lpr=0
crt=9686'7301894 lcod 0'0 mlcod 0'0 inactive NIBBLEWISE] enter Reset
    -6> 2017-11-25 20:34:19.585411 7f6e5dc158c0  5 osd.4 pg_epoch: 9686
pg[8.5c(unlocked)] enter Initial
    -5> 2017-11-25 20:34:19.602888 7f6e5dc158c0  5 osd.4 pg_epoch: 9686
pg[8.5c( empty local-les=9685 n=0 ec=348 les/c/f 9685/9685/0
9684/9684/9684) [4,0] r=0 lpr=0 crt=0'0 mlcod 0'0 inactive NIBBLEWISE]
exit Initial 0.017478 0 0.000000
    -4> 2017-11-25 20:34:19.602912 7f6e5dc158c0  5 osd.4 pg_epoch: 9686
pg[8.5c( empty local-les=9685 n=0 ec=348 les/c/f 9685/9685/0
9684/9684/9684) [4,0] r=0 lpr=0 crt=0'0 mlcod 0'0 inactive NIBBLEWISE]
enter Reset
    -3> 2017-11-25 20:34:19.603082 7f6e5dc158c0  5 osd.4 pg_epoch: 9686
pg[9.10(unlocked)] enter Initial
    -2> 2017-11-25 20:34:19.615456 7f6e5dc158c0  5 osd.4 pg_epoch: 9686
pg[9.10( v 9686'2322547 (9031'2319518,9686'2322547] local-les=9685 n=261
ec=417 les/c/f 9685/9685/0 9684/9684/9684) [4,0] r=0 lpr=0
crt=9686'2322547 lcod 0'0 mlcod 0'0 inactive NIBBLEWISE] exit Initial
0.012373 0 0.000000
    -1> 2017-11-25 20:34:19.615481 7f6e5dc158c0  5 osd.4 pg_epoch: 9686
pg[9.10( v 9686'2322547 (9031'2319518,9686'2322547] local-les=9685 n=261
ec=417 les/c/f 9685/9685/0 9684/9684/9684) [4,0] r=0 lpr=0
crt=9686'2322547 lcod 0'0 mlcod 0'0 inactive NIBBLEWISE] enter Reset
     0> 2017-11-25 20:34:19.617400 7f6e5dc158c0 -1 osd/PG.cc: In
function 'static int PG::peek_map_epoch(ObjectStore*, spg_t, epoch_t*,
ceph::bufferlist*)' thread 7f6e5dc158c0 time 2017-11-25 20:34:19.615633
osd/PG.cc: 3025: FAILED assert(values.size() == 2)

 ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x80) [0x5562d318d790]
 2: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*,
ceph::buffer::list*)+0x661) [0x5562d2b4b601]
 3: (OSD::load_pgs()+0x75a) [0x5562d2a9f8aa]
 4: (OSD::init()+0x2026) [0x5562d2aaaca6]
 5: (main()+0x2ef1) [0x5562d2a1c301]
 6: (__libc_start_main()+0xf0) [0x7f6e5aa75830]
 7: (_start()+0x29) [0x5562d2a5db09]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 newstore
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   1/ 5 kinetic
   1/ 5 fuse
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.4.log
--- end dump of recent events ---
2017-11-25 20:34:19.622559 7f6e5dc158c0 -1 *** Caught signal (Aborted) **
 in thread 7f6e5dc158c0 thread_name:ceph-osd

 ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
 1: (()+0x98653e) [0x5562d308d53e]
 2: (()+0x11390) [0x7f6e5caee390]
 3: (gsignal()+0x38) [0x7f6e5aa8a428]
 4: (abort()+0x16a) [0x7f6e5aa8c02a]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x26b) [0x5562d318d97b]
 6: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*,
ceph::buffer::list*)+0x661) [0x5562d2b4b601]
 7: (OSD::load_pgs()+0x75a) [0x5562d2a9f8aa]
 8: (OSD::init()+0x2026) [0x5562d2aaaca6]
 9: (main()+0x2ef1) [0x5562d2a1c301]
 10: (__libc_start_main()+0xf0) [0x7f6e5aa75830]
 11: (_start()+0x29) [0x5562d2a5db09]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.

--- begin dump of recent events ---
     0> 2017-11-25 20:34:19.622559 7f6e5dc158c0 -1 *** Caught signal
(Aborted) **
 in thread 7f6e5dc158c0 thread_name:ceph-osd

 ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
 1: (()+0x98653e) [0x5562d308d53e]
 2: (()+0x11390) [0x7f6e5caee390]
 3: (gsignal()+0x38) [0x7f6e5aa8a428]
 4: (abort()+0x16a) [0x7f6e5aa8c02a]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x26b) [0x5562d318d97b]
 6: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*,
ceph::buffer::list*)+0x661) [0x5562d2b4b601]
 7: (OSD::load_pgs()+0x75a) [0x5562d2a9f8aa]
 8: (OSD::init()+0x2026) [0x5562d2aaaca6]
 9: (main()+0x2ef1) [0x5562d2a1c301]
 10: (__libc_start_main()+0xf0) [0x7f6e5aa75830]
 11: (_start()+0x29) [0x5562d2a5db09]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 newstore
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   1/ 5 kinetic
   1/ 5 fuse
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.4.log


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20171125/478c1f94/attachment.html>


More information about the ceph-users mailing list