[ceph-users] Mimic offline problem

Göktuğ Yıldırım goktug.yildirim at gmail.com
Mon Oct 1 14:03:40 PDT 2018


I mistyped the user list mail address. I am correcting and sending again. Apologies for the noise.

My mail is below.


İleti başlangıcı:

> Kimden: Goktug Yildirim <goktug.yildirim at gmail.com>
> Tarih: 1 Ekim 2018 21:54:31 GMT+2
> Kime: ceph-users-join at lists.ceph.com
> Bilgi: ceph-devel at vger.kernel.org
> Konu: Mimic offline problem
> 
> Hi all,
> 
> We have recently upgraded from luminous to mimic. It’s been 6 days since this cluster is offline. The long short story is here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030078.html
> 
> I’ve also CC’ed developers since I believe this is a bug. If this is not to correct way I apology and please let me know.
> 
> For the 6 days lots of thing happened and there were some outcomes about the problem. Some of them was misjudged and some of them are not looked deeper. 
> However the most certain diagnosis is this: each OSD causes very high disk I/O to its bluestore disk (WAL and DB are fine). After that OSDs become unresponsive or very very less responsive. For example "ceph tell osd.x version” stucks like for ever.
> 
> So due to unresponsive OSDs cluster does not settle. This is our problem! 
> 
> This is the one we are very sure of. But we are not sure of the reason. 
> 
> Here is the latest ceph status: 
> https://paste.ubuntu.com/p/2DyZ5YqPjh/. 
> 
> This is the status after we started all of the OSDs 24 hours ago.
> Some of the OSDs are not started. However it didnt make any difference when all of them was online.
> 
> Here is the debug=20 log of an OSD which is same for all others: 
> https://paste.ubuntu.com/p/8n2kTvwnG6/
> As we figure out there is a loop pattern. I am sure it wont caught from eye.
> 
> This the full log the same OSD.
> https://www.dropbox.com/s/pwzqeajlsdwaoi1/ceph-osd.90.log?dl=0
> 
> Here is the strace of the same OSD process:
> https://paste.ubuntu.com/p/8n2kTvwnG6/
> 
> Recently we hear more to uprade mimic. I hope none get hurts as we do. I am sure we have done lots of mistakes to let this happening. And this situation may be a example for other user and could be a potential bug for ceph developer.
> 
> Any help to figure out what is going on would be great.
> 
> Best Regards,
> Goktug Yildirim
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20181001/2be6e9b4/attachment.html>


More information about the ceph-users mailing list