[ceph-users] I cannot make the OSD to work, Journal always breaks 100% time

Ronny Aasen ronny+ceph-users at aasen.cx
Wed Dec 6 03:01:56 PST 2017


On 06. des. 2017 10:01, Gonzalo Aguilar Delgado wrote:
> Hi,
> 
> Another OSD falled down. And it's pretty scary how easy is to break the 
> cluster. This time is something related to the journal.
> 
> 
> /usr/bin/ceph-osd -f --cluster ceph --id 6 --setuser ceph --setgroup ceph
> starting osd.6 at :/0 osd_data /var/lib/ceph/osd/ceph-6 
> /var/lib/ceph/osd/ceph-6/journal
> 2017-12-05 13:19:03.473082 7f24515148c0 -1 osd.6 10538 log_to_monitors 
> {default=true}
> os/filestore/FileStore.cc: In function 'void 
> FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, 
> ThreadPool::TPHandle*)' thread 7f243d1a0700 time 2017-12-05 13:19:04.433036
> os/filestore/FileStore.cc: 2930: FAILED assert(0 == "unexpected error")
>   ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
>   1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> const*)+0x80) [0x55569c1ff790]
>   2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned 
> long, int, ThreadPool::TPHandle*)+0xb8e) [0x55569be9d58e]
>   3: (FileStore::_do_transactions(std::vector<ObjectStore::Transaction, 
> std::allocator<ObjectStore::Transaction> >&, unsigned long, 
> ThreadPool::TPHandle*)+0x3b) [0x55569bea3a1b]
>   4: (FileStore::_do_op(FileStore::OpSequencer*, 
> ThreadPool::TPHandle&)+0x39d) [0x55569bea3ded]
>   5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb1) [0x55569c1f1961]
>   6: (ThreadPool::WorkThread::entry()+0x10) [0x55569c1f2a60]
>   7: (()+0x76ba) [0x7f24503e36ba]
>   8: (clone()+0x6d) [0x7f244e45b3dd]
>   NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
> needed to interpret this.
> 2017-12-05 13:19:04.437968 7f243d1a0700 -1 os/filestore/FileStore.cc: In 
> function 'void FileStore::_do_transaction(ObjectStore::Transaction&, 
> uint64_t, int, ThreadPool::TPHandle*)' thread 7f243d1a0700 time 
> 2017-12-05 13:19:04.433036
> os/filestore/FileStore.cc: 2930: FAILED assert(0 == "unexpected error")
> 
>   ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
>   1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> const*)+0x80) [0x55569c1ff790]
>   2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned 
> long, int, ThreadPool::TPHandle*)+0xb8e) [0x55569be9d58e]
>   3: (FileStore::_do_transactions(std::vector<ObjectStore::Transaction, 
> std::allocator<ObjectStore::Transaction> >&, unsigned long, 
> ThreadPool::TPHandle*)+0x3b) [0x55569bea3a1b]
>   4: (FileStore::_do_op(FileStore::OpSequencer*, 
> ThreadPool::TPHandle&)+0x39d) [0x55569bea3ded]
>   5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb1) [0x55569c1f1961]
>   6: (ThreadPool::WorkThread::entry()+0x10) [0x55569c1f2a60]
>   7: (()+0x76ba) [0x7f24503e36ba]
>   8: (clone()+0x6d) [0x7f244e45b3dd]
>   NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
> needed to interpret this.
> 
> os/filestore/FileStore.cc: In function 'void 
> FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, 
> ThreadPool::TPHandle*)' thread 7f243d9a1700 time 2017-12-05 13:19:04.435362
> os/filestore/FileStore.cc: 2930: FAILED assert(0 == "unexpected error")
>   ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
>   1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> const*)+0x80) [0x55569c1ff790]
>   2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned 
> long, int, ThreadPool::TPHandle*)+0xb8e) [0x55569be9d58e]
>   3: (FileStore::_do_transactions(std::vector<ObjectStore::Transaction, 
> std::allocator<ObjectStore::Transaction> >&, unsigned long, 
> ThreadPool::TPHandle*)+0x3b) [0x55569bea3a1b]
>   4: (FileStore::_do_op(FileStore::OpSequencer*, 
> ThreadPool::TPHandle&)+0x39d) [0x55569bea3ded]
>   5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb1) [0x55569c1f1961]
>   6: (ThreadPool::WorkThread::entry()+0x10) [0x55569c1f2a60]
>   7: (()+0x76ba) [0x7f24503e36ba]
>   8: (clone()+0x6d) [0x7f244e45b3dd]
>   NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
> needed to interpret this.
>    -405> 2017-12-05 13:19:03.473082 7f24515148c0 -1 osd.6 10538 
> log_to_monitors {default=true}
>       0> 2017-12-05 13:19:04.437968 7f243d1a0700 -1 
> os/filestore/FileStore.cc: In function 'void 
> FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, 
> ThreadPool::TPHandle*)' thread 7f243d1a0700 time 2017-12-05 13:19:04.433036
> os/filestore/FileStore.cc: 2930: FAILED assert(0 == "unexpected error")
> 
>   ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
>   1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> const*)+0x80) [0x55569c1ff790]
>   2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned 
> long, int, ThreadPool::TPHandle*)+0xb8e) [0x55569be9d58e]
>   3: (FileStore::_do_transactions(std::vector<ObjectStore::Transaction, 
> std::allocator<ObjectStore::Transaction> >&, unsigned long, 
> ThreadPool::TPHandle*)+0x3b) [0x55569bea3a1b]
>   4: (FileStore::_do_op(FileStore::OpSequencer*, 
> ThreadPool::TPHandle&)+0x39d) [0x55569bea3ded]
>   5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb1) [0x55569c1f1961]
>   6: (ThreadPool::WorkThread::entry()+0x10) [0x55569c1f2a60]
>   7: (()+0x76ba) [0x7f24503e36ba]
>   8: (clone()+0x6d) [0x7f244e45b3dd]
>   NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
> needed to interpret this.
> 
> 2017-12-05 13:19:04.442866 7f243d9a1700 -1 os/filestore/FileStore.cc: In 
> function 'void FileStore::_do_transaction(ObjectStore::Transaction&, 
> uint64_t, int, ThreadPool::TPHandle*)' thread 7f243d9a1700 time 
> 2017-12-05 13:19:04.435362
> os/filestore/FileStore.cc: 2930: FAILED assert(0 == "unexpected error")
> 
>   ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
>   1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> const*)+0x80) [0x55569c1ff790]
>   2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned 
> long, int, ThreadPool::TPHandle*)+0xb8e) [0x55569be9d58e]
>   3: (FileStore::_do_transactions(std::vector<ObjectStore::Transaction, 
> std::allocator<ObjectStore::Transaction> >&, unsigned long, 
> ThreadPool::TPHandle*)+0x3b) [0x55569bea3a1b]
>   4: (FileStore::_do_op(FileStore::OpSequencer*, 
> ThreadPool::TPHandle&)+0x39d) [0x55569bea3ded]
>   5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb1) [0x55569c1f1961]
>   6: (ThreadPool::WorkThread::entry()+0x10) [0x55569c1f2a60]
>   7: (()+0x76ba) [0x7f24503e36ba]
>   8: (clone()+0x6d) [0x7f244e45b3dd]
>   NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
> needed to interpret this.
> 
>       0> 2017-12-05 13:19:04.442866 7f243d9a1700 -1 
> os/filestore/FileStore.cc: In function 'void 
> FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, 
> ThreadPool::TPHandle*)' thread 7f243d9a1700 time 2017-12-05 13:19:04.435362
> os/filestore/FileStore.cc: 2930: FAILED assert(0 == "unexpected error")
> 
>   ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
>   1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> const*)+0x80) [0x55569c1ff790]
>   2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned 
> long, int, ThreadPool::TPHandle*)+0xb8e) [0x55569be9d58e]
>   3: (FileStore::_do_transactions(std::vector<ObjectStore::Transaction, 
> std::allocator<ObjectStore::Transaction> >&, unsigned long, 
> ThreadPool::TPHandle*)+0x3b) [0x55569bea3a1b]
>   4: (FileStore::_do_op(FileStore::OpSequencer*, 
> ThreadPool::TPHandle&)+0x39d) [0x55569bea3ded]
>   5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb1) [0x55569c1f1961]
>   6: (ThreadPool::WorkThread::entry()+0x10) [0x55569c1f2a60]
>   7: (()+0x76ba) [0x7f24503e36ba]
>   8: (clone()+0x6d) [0x7f244e45b3dd]
>   NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
> needed to interpret this.
> 
> *** Caught signal (Aborted) **
>   in thread 7f243d1a0700 thread_name:tp_fstore_op
> 
> 
> I tried to boot it several times.
> 
> I zero the journal
> 
> dd if=/dev/zero of=/dev/sde2

This probably kills the OSD, at the very least it destroys objects that 
was written to journal (and cluster assumed was safe), unless you 
flushed it successfully previously.




> create a new journal
> 
> ceph-osd --mkjournal -i 6
> 
> Flush it. But's empty so ok.
> 
> /usr/bin/ceph-osd -f --cluster ceph --id 6 --setuser ceph --setgroup 
> ceph --flush-journal
> 
> 
> and boot manually the osd.
> 
> 
> /usr/bin/ceph-osd -f --cluster ceph --id 6 --setuser ceph --setgroup ceph
> 
> 
> Then it breaks. I pasted bin my whole configuration in 
> https://pastebin.com/QfrE71Dg.
> 
> But I changed also the journal partition from sde4 to sde2 to see if 
> this has something to do. sde is SSD disk so wanted to see no block is 
> corrupting everything.
> 
> Nothing it breaks 100% of time after a while. I'm desperate to see how 
> it breaks. I must say that this is other OSD that failed and I 
> recovered. Smartscan long is correct xfs_repair is ok on disk everything 
> seems correct. But it keep crashing.
> 
> Any advice?
> 
> Can I run the disk without journal for a while until all pg are backup 
> to the other disks? I just increased the size of the pools and min size 
> as well and I need this disk in order to recover all information.


you need this disk to recover all information ? do you not have 
replication and objects are safe? i can not see from your pastebin that 
you have missing objects (that are only on this one disk)

if you need the actualy objects from this disk, then you need to do a 
recovery. that is a whole other job.

if you only need the space of the disk, then you should zap and wipe it. 
and insert it as a new fresh OSD.



but these 2 lines from your pastebin is a bit over the top. how you can 
have this many degraded objects  based on only 289090 objects is hard to 
get.

recovery 20266198323167232/289090 objects degraded (7010342219781.809%)
37154696925806625 scrub errors

i have not seen that before so hopefully someone else can chime in.
also what exact os kernel and ceph versions are you running?


kind regards
Ronny Aasen




More information about the ceph-users mailing list