[ceph-users] ceph-mds failure replaying journal

Yan, Zheng ukernel at gmail.com
Mon Oct 29 01:19:39 PDT 2018


We backported a wrong patch to 13.2.2.  downgrade ceph to 13.2.1, then run
'ceph mds repaired fido_fs:1" .

Sorry for the trouble
Yan, Zheng

On Mon, Oct 29, 2018 at 7:48 AM Jon Morby <jon at fido.net> wrote:

>
> We accidentally found ourselves upgraded from 12.2.8 to 13.2.2 after a
> ceph-deploy install went awry (we were expecting it to upgrade to 12.2.9
> and not jump a major release without warning)
>
> Anyway .. as a result, we ended up with an mds journal error and 1 daemon
> reporting as damaged
>
> Having got nowhere trying to ask for help on irc, we've followed various
> forum posts and disaster recovery guides, we ended up resetting the journal
> which left the daemon as no longer “damaged” however we’re now seeing mds
> segfault whilst trying to replay
>
> https://pastebin.com/iSLdvu0b
>
>
>
> /build/ceph-13.2.2/src/mds/journal.cc: 1572: FAILED
> assert(g_conf->mds_wipe_sessions)
>
>  ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic
> (stable)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x102) [0x7fad637f70f2]
>  2: (()+0x3162b7) [0x7fad637f72b7]
>  3: (EMetaBlob::replay(MDSRank*, LogSegment*, MDSlaveUpdate*)+0x5f4b)
> [0x7a7a6b]
>  4: (EUpdate::replay(MDSRank*)+0x39) [0x7a8fa9]
>  5: (MDLog::_replay_thread()+0x864) [0x752164]
>  6: (MDLog::ReplayThread::entry()+0xd) [0x4f021d]
>  7: (()+0x76ba) [0x7fad6305a6ba]
>  8: (clone()+0x6d) [0x7fad6288341d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> to interpret this.
>
>
> full logs
>
> https://pastebin.com/X5UG9vT2
>
>
> We’ve been unable to access the cephfs file system since all of this
> started …. attempts to mount fail with reports that “mds probably not
> available”
>
> Oct 28 23:47:02 mirrors kernel: [115602.911193] ceph: probably no mds
> server is up
>
>
> root at mds02:~# ceph -s
>   cluster:
>     id:     78d5bf7d-b074-47ab-8d73-bd4d99df98a5
>     health: HEALTH_WARN
>             1 filesystem is degraded
>             insufficient standby MDS daemons available
>             too many PGs per OSD (276 > max 250)
>
>   services:
>     mon: 3 daemons, quorum mon01,mon02,mon03
>     mgr: mon01(active), standbys: mon02, mon03
>     mds: fido_fs-2/2/1 up  {0=mds01=up:resolve,1=mds02=up:replay(laggy or
> crashed)}
>     osd: 27 osds: 27 up, 27 in
>
>   data:
>     pools:   15 pools, 3168 pgs
>     objects: 16.97 M objects, 30 TiB
>     usage:   71 TiB used, 27 TiB / 98 TiB avail
>     pgs:     3168 active+clean
>
>   io:
>     client:   680 B/s rd, 1.1 MiB/s wr, 0 op/s rd, 345 op/s wr
>
>
> Before I just trash the entire fs and give up on ceph, does anyone have
> any suggestions as to how we can fix this?
>
> root at mds02:~# ceph versions
> {
>     "mon": {
>         "ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126)
> mimic (stable)": 3
>     },
>     "mgr": {
>         "ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126)
> mimic (stable)": 3
>     },
>     "osd": {
>         "ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0)
> luminous (stable)": 27
>     },
>     "mds": {
>         "ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126)
> mimic (stable)": 2
>     },
>     "overall": {
>         "ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0)
> luminous (stable)": 27,
>         "ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126)
> mimic (stable)": 8
>     }
> }
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20181029/613f2fc7/attachment.html>


More information about the ceph-users mailing list