[ceph-users] I cannot make the OSD to work, Journal always breaks 100% time

Gonzalo Aguilar Delgado gaguilar at aguilardelgado.com
Wed Dec 6 01:01:10 PST 2017


Hi,

Another OSD falled down. And it's pretty scary how easy is to break the 
cluster. This time is something related to the journal.


/usr/bin/ceph-osd -f --cluster ceph --id 6 --setuser ceph --setgroup ceph
starting osd.6 at :/0 osd_data /var/lib/ceph/osd/ceph-6 
/var/lib/ceph/osd/ceph-6/journal
2017-12-05 13:19:03.473082 7f24515148c0 -1 osd.6 10538 log_to_monitors 
{default=true}
os/filestore/FileStore.cc: In function 'void 
FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, 
ThreadPool::TPHandle*)' thread 7f243d1a0700 time 2017-12-05 13:19:04.433036
os/filestore/FileStore.cc: 2930: FAILED assert(0 == "unexpected error")
  ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x80) [0x55569c1ff790]
  2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned 
long, int, ThreadPool::TPHandle*)+0xb8e) [0x55569be9d58e]
  3: (FileStore::_do_transactions(std::vector<ObjectStore::Transaction, 
std::allocator<ObjectStore::Transaction> >&, unsigned long, 
ThreadPool::TPHandle*)+0x3b) [0x55569bea3a1b]
  4: (FileStore::_do_op(FileStore::OpSequencer*, 
ThreadPool::TPHandle&)+0x39d) [0x55569bea3ded]
  5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb1) [0x55569c1f1961]
  6: (ThreadPool::WorkThread::entry()+0x10) [0x55569c1f2a60]
  7: (()+0x76ba) [0x7f24503e36ba]
  8: (clone()+0x6d) [0x7f244e45b3dd]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
needed to interpret this.
2017-12-05 13:19:04.437968 7f243d1a0700 -1 os/filestore/FileStore.cc: In 
function 'void FileStore::_do_transaction(ObjectStore::Transaction&, 
uint64_t, int, ThreadPool::TPHandle*)' thread 7f243d1a0700 time 
2017-12-05 13:19:04.433036
os/filestore/FileStore.cc: 2930: FAILED assert(0 == "unexpected error")

  ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x80) [0x55569c1ff790]
  2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned 
long, int, ThreadPool::TPHandle*)+0xb8e) [0x55569be9d58e]
  3: (FileStore::_do_transactions(std::vector<ObjectStore::Transaction, 
std::allocator<ObjectStore::Transaction> >&, unsigned long, 
ThreadPool::TPHandle*)+0x3b) [0x55569bea3a1b]
  4: (FileStore::_do_op(FileStore::OpSequencer*, 
ThreadPool::TPHandle&)+0x39d) [0x55569bea3ded]
  5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb1) [0x55569c1f1961]
  6: (ThreadPool::WorkThread::entry()+0x10) [0x55569c1f2a60]
  7: (()+0x76ba) [0x7f24503e36ba]
  8: (clone()+0x6d) [0x7f244e45b3dd]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
needed to interpret this.

os/filestore/FileStore.cc: In function 'void 
FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, 
ThreadPool::TPHandle*)' thread 7f243d9a1700 time 2017-12-05 13:19:04.435362
os/filestore/FileStore.cc: 2930: FAILED assert(0 == "unexpected error")
  ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x80) [0x55569c1ff790]
  2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned 
long, int, ThreadPool::TPHandle*)+0xb8e) [0x55569be9d58e]
  3: (FileStore::_do_transactions(std::vector<ObjectStore::Transaction, 
std::allocator<ObjectStore::Transaction> >&, unsigned long, 
ThreadPool::TPHandle*)+0x3b) [0x55569bea3a1b]
  4: (FileStore::_do_op(FileStore::OpSequencer*, 
ThreadPool::TPHandle&)+0x39d) [0x55569bea3ded]
  5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb1) [0x55569c1f1961]
  6: (ThreadPool::WorkThread::entry()+0x10) [0x55569c1f2a60]
  7: (()+0x76ba) [0x7f24503e36ba]
  8: (clone()+0x6d) [0x7f244e45b3dd]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
needed to interpret this.
   -405> 2017-12-05 13:19:03.473082 7f24515148c0 -1 osd.6 10538 
log_to_monitors {default=true}
      0> 2017-12-05 13:19:04.437968 7f243d1a0700 -1 
os/filestore/FileStore.cc: In function 'void 
FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, 
ThreadPool::TPHandle*)' thread 7f243d1a0700 time 2017-12-05 13:19:04.433036
os/filestore/FileStore.cc: 2930: FAILED assert(0 == "unexpected error")

  ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x80) [0x55569c1ff790]
  2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned 
long, int, ThreadPool::TPHandle*)+0xb8e) [0x55569be9d58e]
  3: (FileStore::_do_transactions(std::vector<ObjectStore::Transaction, 
std::allocator<ObjectStore::Transaction> >&, unsigned long, 
ThreadPool::TPHandle*)+0x3b) [0x55569bea3a1b]
  4: (FileStore::_do_op(FileStore::OpSequencer*, 
ThreadPool::TPHandle&)+0x39d) [0x55569bea3ded]
  5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb1) [0x55569c1f1961]
  6: (ThreadPool::WorkThread::entry()+0x10) [0x55569c1f2a60]
  7: (()+0x76ba) [0x7f24503e36ba]
  8: (clone()+0x6d) [0x7f244e45b3dd]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
needed to interpret this.

2017-12-05 13:19:04.442866 7f243d9a1700 -1 os/filestore/FileStore.cc: In 
function 'void FileStore::_do_transaction(ObjectStore::Transaction&, 
uint64_t, int, ThreadPool::TPHandle*)' thread 7f243d9a1700 time 
2017-12-05 13:19:04.435362
os/filestore/FileStore.cc: 2930: FAILED assert(0 == "unexpected error")

  ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x80) [0x55569c1ff790]
  2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned 
long, int, ThreadPool::TPHandle*)+0xb8e) [0x55569be9d58e]
  3: (FileStore::_do_transactions(std::vector<ObjectStore::Transaction, 
std::allocator<ObjectStore::Transaction> >&, unsigned long, 
ThreadPool::TPHandle*)+0x3b) [0x55569bea3a1b]
  4: (FileStore::_do_op(FileStore::OpSequencer*, 
ThreadPool::TPHandle&)+0x39d) [0x55569bea3ded]
  5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb1) [0x55569c1f1961]
  6: (ThreadPool::WorkThread::entry()+0x10) [0x55569c1f2a60]
  7: (()+0x76ba) [0x7f24503e36ba]
  8: (clone()+0x6d) [0x7f244e45b3dd]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
needed to interpret this.

      0> 2017-12-05 13:19:04.442866 7f243d9a1700 -1 
os/filestore/FileStore.cc: In function 'void 
FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, 
ThreadPool::TPHandle*)' thread 7f243d9a1700 time 2017-12-05 13:19:04.435362
os/filestore/FileStore.cc: 2930: FAILED assert(0 == "unexpected error")

  ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x80) [0x55569c1ff790]
  2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned 
long, int, ThreadPool::TPHandle*)+0xb8e) [0x55569be9d58e]
  3: (FileStore::_do_transactions(std::vector<ObjectStore::Transaction, 
std::allocator<ObjectStore::Transaction> >&, unsigned long, 
ThreadPool::TPHandle*)+0x3b) [0x55569bea3a1b]
  4: (FileStore::_do_op(FileStore::OpSequencer*, 
ThreadPool::TPHandle&)+0x39d) [0x55569bea3ded]
  5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb1) [0x55569c1f1961]
  6: (ThreadPool::WorkThread::entry()+0x10) [0x55569c1f2a60]
  7: (()+0x76ba) [0x7f24503e36ba]
  8: (clone()+0x6d) [0x7f244e45b3dd]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
needed to interpret this.

*** Caught signal (Aborted) **
  in thread 7f243d1a0700 thread_name:tp_fstore_op


I tried to boot it several times.

I zero the journal

dd if=/dev/zero of=/dev/sde2

create a new journal

ceph-osd --mkjournal -i 6

Flush it. But's empty so ok.

/usr/bin/ceph-osd -f --cluster ceph --id 6 --setuser ceph --setgroup 
ceph --flush-journal


and boot manually the osd.


/usr/bin/ceph-osd -f --cluster ceph --id 6 --setuser ceph --setgroup ceph


Then it breaks. I pasted bin my whole configuration in 
https://pastebin.com/QfrE71Dg.

But I changed also the journal partition from sde4 to sde2 to see if 
this has something to do. sde is SSD disk so wanted to see no block is 
corrupting everything.

Nothing it breaks 100% of time after a while. I'm desperate to see how 
it breaks. I must say that this is other OSD that failed and I 
recovered. Smartscan long is correct xfs_repair is ok on disk everything 
seems correct. But it keep crashing.

Any advice?

Can I run the disk without journal for a while until all pg are backup 
to the other disks? I just increased the size of the pools and min size 
as well and I need this disk in order to recover all information.


Best regards,




-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20171206/7fafaef4/attachment.html>


More information about the ceph-users mailing list