[ceph-users] Mimic offline problem
goktug.yildirim at gmail.com
Mon Oct 1 14:03:40 PDT 2018
I mistyped the user list mail address. I am correcting and sending again. Apologies for the noise.
My mail is below.
> Kimden: Goktug Yildirim <goktug.yildirim at gmail.com>
> Tarih: 1 Ekim 2018 21:54:31 GMT+2
> Kime: ceph-users-join at lists.ceph.com
> Bilgi: ceph-devel at vger.kernel.org
> Konu: Mimic offline problem
> Hi all,
> We have recently upgraded from luminous to mimic. It’s been 6 days since this cluster is offline. The long short story is here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030078.html
> I’ve also CC’ed developers since I believe this is a bug. If this is not to correct way I apology and please let me know.
> For the 6 days lots of thing happened and there were some outcomes about the problem. Some of them was misjudged and some of them are not looked deeper.
> However the most certain diagnosis is this: each OSD causes very high disk I/O to its bluestore disk (WAL and DB are fine). After that OSDs become unresponsive or very very less responsive. For example "ceph tell osd.x version” stucks like for ever.
> So due to unresponsive OSDs cluster does not settle. This is our problem!
> This is the one we are very sure of. But we are not sure of the reason.
> Here is the latest ceph status:
> This is the status after we started all of the OSDs 24 hours ago.
> Some of the OSDs are not started. However it didnt make any difference when all of them was online.
> Here is the debug=20 log of an OSD which is same for all others:
> As we figure out there is a loop pattern. I am sure it wont caught from eye.
> This the full log the same OSD.
> Here is the strace of the same OSD process:
> Recently we hear more to uprade mimic. I hope none get hurts as we do. I am sure we have done lots of mistakes to let this happening. And this situation may be a example for other user and could be a potential bug for ceph developer.
> Any help to figure out what is going on would be great.
> Best Regards,
> Goktug Yildirim
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the ceph-users