[ceph-users] ceph-mds failure replaying journal

Yan, Zheng ukernel at gmail.com
Mon Oct 29 04:12:32 PDT 2018


please try again debug_mds=10 and send log to me

Regards
Yan, Zheng

On Mon, Oct 29, 2018 at 6:30 PM Jon Morby (Fido) <jon at fido.net> wrote:

> fyi, downgrading to 13.2.1 doesn't seem to have fixed the issue either :(
>
> --- end dump of recent events ---
> 2018-10-29 10:27:50.440 7feb58b43700 -1 *** Caught signal (Aborted) **
>  in thread 7feb58b43700 thread_name:md_log_replay
>
>  ceph version 13.2.1 (5533ecdc0fda920179d7ad84e0aa65a127b20d77) mimic
> (stable)
>  1: (()+0x3ebf40) [0x55deff8e0f40]
>  2: (()+0x11390) [0x7feb68246390]
>  3: (gsignal()+0x38) [0x7feb67993428]
>  4: (abort()+0x16a) [0x7feb6799502a]
>  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x250) [0x7feb689a5630]
>  6: (()+0x2e26a7) [0x7feb689a56a7]
>  7: (EMetaBlob::replay(MDSRank*, LogSegment*, MDSlaveUpdate*)+0x5f4b)
> [0x55deff8ccc8b]
>  8: (EUpdate::replay(MDSRank*)+0x39) [0x55deff8ce1c9]
>  9: (MDLog::_replay_thread()+0x864) [0x55deff876974]
>  10: (MDLog::ReplayThread::entry()+0xd) [0x55deff61a95d]
>  11: (()+0x76ba) [0x7feb6823c6ba]
>  12: (clone()+0x6d) [0x7feb67a6541d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> to interpret this.
>
> --- begin dump of recent events ---
>      0> 2018-10-29 10:27:50.440 7feb58b43700 -1 *** Caught signal
> (Aborted) **
>  in thread 7feb58b43700 thread_name:md_log_replay
>
>  ceph version 13.2.1 (5533ecdc0fda920179d7ad84e0aa65a127b20d77) mimic
> (stable)
>  1: (()+0x3ebf40) [0x55deff8e0f40]
>  2: (()+0x11390) [0x7feb68246390]
>  3: (gsignal()+0x38) [0x7feb67993428]
>  4: (abort()+0x16a) [0x7feb6799502a]
>  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x250) [0x7feb689a5630]
>  6: (()+0x2e26a7) [0x7feb689a56a7]
>  7: (EMetaBlob::replay(MDSRank*, LogSegment*, MDSlaveUpdate*)+0x5f4b)
> [0x55deff8ccc8b]
>  8: (EUpdate::replay(MDSRank*)+0x39) [0x55deff8ce1c9]
>  9: (MDLog::_replay_thread()+0x864) [0x55deff876974]
>  10: (MDLog::ReplayThread::entry()+0xd) [0x55deff61a95d]
>  11: (()+0x76ba) [0x7feb6823c6ba]
>  12: (clone()+0x6d) [0x7feb67a6541d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> to interpret this.
>
> --- logging levels ---
>    0/ 5 none
>    0/ 0 lockdep
>    0/ 0 context
>    0/ 0 crush
>    3/ 3 mds
>    1/ 5 mds_balancer
>    1/ 5 mds_locker
>    1/ 5 mds_log
>    1/ 5 mds_log_expire
>    1/ 5 mds_migrator
>    0/ 0 buffer
>    0/ 0 timer
>    0/ 0 filer
>    0/ 1 striper
>    0/ 0 objecter
>    0/ 0 rados
>    0/ 0 rbd
>    0/ 5 rbd_mirror
>    0/ 5 rbd_replay
>    0/ 0 journaler
>    0/ 5 objectcacher
>    0/ 0 client
>    0/ 0 osd
>    0/ 0 optracker
>    0/ 0 objclass
>    0/ 0 filestore
>    0/ 0 journal
>    0/ 0 ms
>    0/ 0 mon
>    0/ 0 monc
>    0/ 0 paxos
>    0/ 0 tp
>    0/ 0 auth
>    1/ 5 crypto
>    0/ 0 finisher
>    1/ 1 reserver
>    0/ 0 heartbeatmap
>    0/ 0 perfcounter
>    0/ 0 rgw
>    1/ 5 rgw_sync
>    1/10 civetweb
>    1/ 5 javaclient
>    0/ 0 asok
>    0/ 0 throttle
>    0/ 0 refs
>    1/ 5 xio
>    1/ 5 compressor
>    1/ 5 bluestore
>    1/ 5 bluefs
>    1/ 3 bdev
>    1/ 5 kstore
>    4/ 5 rocksdb
>    4/ 5 leveldb
>    4/ 5 memdb
>    1/ 5 kinetic
>    1/ 5 fuse
>    1/ 5 mgr
>    1/ 5 mgrc
>    1/ 5 dpdk
>    1/ 5 eventtrace
>   99/99 (syslog threshold)
>   -1/-1 (stderr threshold)
>   max_recent     10000
>   max_new         1000
>   log_file /var/log/ceph/ceph-mds.mds04.log
> --- end dump of recent events ---
>
>
> ----- On 29 Oct, 2018, at 09:25, Jon Morby <jon at fido.net> wrote:
>
> Hi
>
> Ideally we'd like to undo the whole accidental upgrade to 13.x and ensure
> that ceph-deploy doesn't do another major release upgrade without a lot of
> warnings
>
> Either way, I'm currently getting errors that 13.2.1 isn't available /
> shaman is offline / etc
>
> What's the best / recommended way of doing this downgrade across our
> estate?
>
>
>
> ----- On 29 Oct, 2018, at 08:19, Yan, Zheng <ukernel at gmail.com> wrote:
>
>
> We backported a wrong patch to 13.2.2.  downgrade ceph to 13.2.1, then run
> 'ceph mds repaired fido_fs:1" .
> Sorry for the trouble
> Yan, Zheng
>
> On Mon, Oct 29, 2018 at 7:48 AM Jon Morby <jon at fido.net> wrote:
>
>>
>> We accidentally found ourselves upgraded from 12.2.8 to 13.2.2 after a
>> ceph-deploy install went awry (we were expecting it to upgrade to 12.2.9
>> and not jump a major release without warning)
>>
>> Anyway .. as a result, we ended up with an mds journal error and 1 daemon
>> reporting as damaged
>>
>> Having got nowhere trying to ask for help on irc, we've followed various
>> forum posts and disaster recovery guides, we ended up resetting the journal
>> which left the daemon as no longer “damaged” however we’re now seeing mds
>> segfault whilst trying to replay
>>
>> https://pastebin.com/iSLdvu0b
>>
>>
>>
>> /build/ceph-13.2.2/src/mds/journal.cc: 1572: FAILED
>> assert(g_conf->mds_wipe_sessions)
>>
>>  ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic
>> (stable)
>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x102) [0x7fad637f70f2]
>>  2: (()+0x3162b7) [0x7fad637f72b7]
>>  3: (EMetaBlob::replay(MDSRank*, LogSegment*, MDSlaveUpdate*)+0x5f4b)
>> [0x7a7a6b]
>>  4: (EUpdate::replay(MDSRank*)+0x39) [0x7a8fa9]
>>  5: (MDLog::_replay_thread()+0x864) [0x752164]
>>  6: (MDLog::ReplayThread::entry()+0xd) [0x4f021d]
>>  7: (()+0x76ba) [0x7fad6305a6ba]
>>  8: (clone()+0x6d) [0x7fad6288341d]
>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
>> to interpret this.
>>
>>
>> full logs
>>
>> https://pastebin.com/X5UG9vT2
>>
>>
>> We’ve been unable to access the cephfs file system since all of this
>> started …. attempts to mount fail with reports that “mds probably not
>> available”
>>
>> Oct 28 23:47:02 mirrors kernel: [115602.911193] ceph: probably no mds
>> server is up
>>
>>
>> root at mds02:~# ceph -s
>>   cluster:
>>     id:     78d5bf7d-b074-47ab-8d73-bd4d99df98a5
>>     health: HEALTH_WARN
>>             1 filesystem is degraded
>>             insufficient standby MDS daemons available
>>             too many PGs per OSD (276 > max 250)
>>
>>   services:
>>     mon: 3 daemons, quorum mon01,mon02,mon03
>>     mgr: mon01(active), standbys: mon02, mon03
>>     mds: fido_fs-2/2/1 up  {0=mds01=up:resolve,1=mds02=up:replay(laggy or
>> crashed)}
>>     osd: 27 osds: 27 up, 27 in
>>
>>   data:
>>     pools:   15 pools, 3168 pgs
>>     objects: 16.97 M objects, 30 TiB
>>     usage:   71 TiB used, 27 TiB / 98 TiB avail
>>     pgs:     3168 active+clean
>>
>>   io:
>>     client:   680 B/s rd, 1.1 MiB/s wr, 0 op/s rd, 345 op/s wr
>>
>>
>> Before I just trash the entire fs and give up on ceph, does anyone have
>> any suggestions as to how we can fix this?
>>
>> root at mds02:~# ceph versions
>> {
>>     "mon": {
>>         "ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126)
>> mimic (stable)": 3
>>     },
>>     "mgr": {
>>         "ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126)
>> mimic (stable)": 3
>>     },
>>     "osd": {
>>         "ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0)
>> luminous (stable)": 27
>>     },
>>     "mds": {
>>         "ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126)
>> mimic (stable)": 2
>>     },
>>     "overall": {
>>         "ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0)
>> luminous (stable)": 27,
>>         "ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126)
>> mimic (stable)": 8
>>     }
>> }
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users at lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
> --
> ------------------------------
> Jon Morby
> FidoNet - the internet made simple!
> 10 - 16 Tiller Road, London, E14 8PX
> tel: 0345 004 3050 / fax: 0345 004 3051
>
> Need more rack space?
> Check out our Co-Lo offerings at http://www.fido.net/services/colo/
> <http://www.fido.net/services/colo/%20>32 amp racks in London and Brighton
> Linx ConneXions available at all Fido sites!
> https://www.fido.net/services/backbone/connexions/
> PGP Key <http://jonmorby.com/B3B5AD3A.asc>: 26DC B618 DE9E F9CB F8B7 1EFA
> 2A64 BA69 B3B5 AD3A - http://jonmorby.com/B3B5AD3A.asc
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> --
> ------------------------------
> Jon Morby
> FidoNet - the internet made simple!
> 10 - 16 Tiller Road, London, E14 8PX
> tel: 0345 004 3050 / fax: 0345 004 3051
>
> Need more rack space?
> Check out our Co-Lo offerings at http://www.fido.net/services/colo/
> <http://www.fido.net/services/colo/%20>32 amp racks in London and Brighton
> Linx ConneXions available at all Fido sites!
> https://www.fido.net/services/backbone/connexions/
> PGP Key <http://jonmorby.com/B3B5AD3A.asc>: 26DC B618 DE9E F9CB F8B7 1EFA
> 2A64 BA69 B3B5 AD3A - http://jonmorby.com/B3B5AD3A.asc
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20181029/120dd653/attachment.html>


More information about the ceph-users mailing list