[ceph-users] Recovery from "FAILED assert(omap_num_objs <= MAX_OBJECTS)"

Zoë O'Connell zoe+ceph at prowler.io
Tue Aug 27 09:23:41 PDT 2019


We have run in to what looks like bug 36094 
(https://tracker.ceph.com/issues/36094) on our 13.2.6 cluster and 
unfortunately now one of our ranks (Rank 1) won't start - it comes up 
for a few seconds before the assigned MDS crashes again with the below 
log entries. It would appear that OpenFileTable has somehow become 
corrupted, but it's not clear from any of the Ceph tool documentation if 
there is any way of clearing this.

Before we resort to deleting and recreating the cluster, are there any 
further recovery steps we can perform?

Thanks.

2019-08-27 16:10:50.775 7f2c94581700 -1 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm/el7/BUILD/ceph-13.2.6/src/mds/OpenFileTable.cc: 
In function 'void OpenFileTable::commit(MDSInternalContextBase*, 
uint64_t, int)' thread 7f2c94581700 time 2019-08-27 16:10:50.774858
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm/el7/BUILD/ceph-13.2.6/src/mds/OpenFileTable.cc: 
473: FAILED assert(omap_num_objs <= MAX_OBJECTS)

  ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic 
(stable)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x14b) [0x7f2ca064636b]
  2: (()+0x26e4f7) [0x7f2ca06464f7]
  3: (OpenFileTable::commit(MDSInternalContextBase*, unsigned long, 
int)+0x1b35) [0x557afbe49265]
  4: (MDLog::trim(int)+0x5a6) [0x557afbe36a86]
  5: (MDSRankDispatcher::tick()+0x24b) [0x557afbbcd97b]
  6: (FunctionContext::finish(int)+0x2c) [0x557afbbb326c]
  7: (Context::complete(int)+0x9) [0x557afbbb0ef9]
  8: (SafeTimer::timer_thread()+0x18b) [0x7f2ca0642c3b]
  9: (SafeTimerThread::entry()+0xd) [0x7f2ca06441fd]
  10: (()+0x7dd5) [0x7f2c9e284dd5]
  11: (clone()+0x6d) [0x7f2c9d36202d]

2019-08-27 16:10:50.777 7f2c94581700 -1 *** Caught signal (Aborted) **
  in thread 7f2c94581700 thread_name:safe_timer

  ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic 
(stable)
  1: (()+0xf5d0) [0x7f2c9e28c5d0]
  2: (gsignal()+0x37) [0x7f2c9d29a2c7]
  3: (abort()+0x148) [0x7f2c9d29b9b8]
  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x248) [0x7f2ca0646468]
  5: (()+0x26e4f7) [0x7f2ca06464f7]
  6: (OpenFileTable::commit(MDSInternalContextBase*, unsigned long, 
int)+0x1b35) [0x557afbe49265]
  7: (MDLog::trim(int)+0x5a6) [0x557afbe36a86]
  8: (MDSRankDispatcher::tick()+0x24b) [0x557afbbcd97b]
  9: (FunctionContext::finish(int)+0x2c) [0x557afbbb326c]
  10: (Context::complete(int)+0x9) [0x557afbbb0ef9]
  11: (SafeTimer::timer_thread()+0x18b) [0x7f2ca0642c3b]
  12: (SafeTimerThread::entry()+0xd) [0x7f2ca06441fd]
  13: (()+0x7dd5) [0x7f2c9e284dd5]
  14: (clone()+0x6d) [0x7f2c9d36202d]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
needed to interpret this.




More information about the ceph-users mailing list