[ceph-users] MDS stuck in 'rejoin' after network fragmentation caused OSD flapping
gfarnum at redhat.com
Thu Aug 16 07:15:21 PDT 2018
On Thu, Aug 16, 2018 at 8:58 AM, Jonathan Woytek <woytek at dryrose.com> wrote:
> This did the trick! THANK YOU!
> After starting with the mds_wipe_sessions set and after removing the
> mds*_openfiles.0 entries in the metadata pool, mds started almost
> immediately and went to active. I verified that the filesystem could mount
> again, shut down mds, removed the wipe sessions setting, and restarted all
> four mds daemons. The cluster is back to healthy again.
> I've got more stuff to write up on our end for recovery procedures now, and
> that's a good thing! Thanks again!
Do note that while this works and is unlikely to break anything, it's
not entirely ideal. The MDS was trying to probe the size and mtime of
any files which were opened by clients that have since disappeared. By
removing that list of open files, it can't do that any more, so you
may have some inaccurate metadata about individual file sizes or
> On Wed, Aug 15, 2018 at 11:12 PM, Jonathan Woytek <woytek at dryrose.com>
>> On Wed, Aug 15, 2018 at 11:02 PM Yan, Zheng <ukernel at gmail.com> wrote:
>>> On Thu, Aug 16, 2018 at 10:55 AM Jonathan Woytek <woytek at dryrose.com>
>>> > ceph version 13.2.0 (79a10589f1f80dfe21e8f9794365ed98143071c4) mimic
>>> > (stable)
>>> Try deleting mds0_openfiles.0 (mds1_openfiles.0 and so on if you have
>>> multiple active mds) from metadata pool of your filesystem. Records
>>> in these files are open files hints. It's safe to delete them.
>> I will try that in the morning. I had to bail for the night here (UTC-4).
>> Thank you!
>> Sent from my Commodore64
> Jonathan Woytek
> PGP: 462C 5F50 144D 6B09 3B65 FCE8 C1DC DEC4 E8B6 AABC
> ceph-users mailing list
> ceph-users at lists.ceph.com
More information about the ceph-users