[ceph-users] ceph-mgr hangs on larger clusters in Luminous

Bryan Stillwell bstillwell at godaddy.com
Thu Oct 18 13:34:50 PDT 2018


Thanks Dan!

It does look like we're hitting the ms_tcp_read_timeout.  I changed it to 79 seconds and I've had a couple dumps that were hung for ~2m40s (2*ms_tcp_read_timeout) and one that was hung for 8 minutes (6*ms_tcp_read_timeout).

I agree that 15 minutes (900s) is a long timeout.  Anyone know the reasoning for that decision?

Bryan

From: Dan van der Ster <dan at vanderster.com>
Date: Thursday, October 18, 2018 at 2:03 PM
To: Bryan Stillwell <bstillwell at godaddy.com>
Cc: ceph-users <ceph-users at lists.ceph.com>
Subject: Re: [ceph-users] ceph-mgr hangs on larger clusters in Luminous

15 minutes seems like the ms tcp read timeout would be related.

Try shortening that and see if it works around the issue...

(We use ms tcp read timeout = 60 over here -- the 900s default seems
really long to keep idle connections open)

-- dan


On Thu, Oct 18, 2018 at 9:39 PM Bryan Stillwell <bstillwell at godaddy.com<mailto:bstillwell at godaddy.com>> wrote:

I left some of the 'ceph pg dump' commands running and twice they returned results after 30 minutes, and three times it took 45 minutes.  Is there something that runs every 15 minutes that would let these commands finish?

Bryan

From: Bryan Stillwell <bstillwell at godaddy.com<mailto:bstillwell at godaddy.com>>
Date: Thursday, October 18, 2018 at 11:16 AM
To: "ceph-users at lists.ceph.com<mailto:ceph-users at lists.ceph.com>" <ceph-users at lists.ceph.com<mailto:ceph-users at lists.ceph.com>>
Subject: ceph-mgr hangs on larger clusters in Luminous

After we upgraded from Jewel (10.2.10) to Luminous (12.2.5) we started seeing a problem where the new ceph-mgr would sometimes hang indefinitely when doing commands like 'ceph pg dump' on our largest cluster (~1,300 OSDs).  The rest of our clusters (10+) aren't seeing the same issue, but they are all under 600 OSDs each.  Restarting ceph-mgr seems to fix the issue for 12 hours or so, but usually overnight it'll get back into the state where the hang reappears.  At first I thought it was a hardware issue, but switching the primary ceph-mgr to another node didn't fix the problem.

I've increased the logging to 20/20 for debug_mgr, and while a working dump looks like this:

2018-10-18 09:26:16.256911 7f9dbf5e7700  4 mgr.server handle_command decoded 3
2018-10-18 09:26:16.256917 7f9dbf5e7700  4 mgr.server handle_command prefix=pg dump
2018-10-18 09:26:16.256937 7f9dbf5e7700 10 mgr.server _allowed_command  client.admin capable
2018-10-18 09:26:16.256951 7f9dbf5e7700  0 log_channel(audit) log [DBG] : from='client.1414554763 10.2.4.2:0/2175076978' entity='client.admin' cmd=[{"prefix": "pg dump", "target": ["mgr", ""], "format": "json-pretty"}]: dispatch
2018-10-18 09:26:22.567583 7f9dbf5e7700  1 mgr.server reply handle_command (0) Success dumped all

A failed dump call doesn't show up at all.  The "mgr.server handle_command prefix=pg dump" log entry doesn't seem to even make it to the logs.

This problem also continued to appear after upgrading to 12.2.8.

Has anyone else seen this?

Thanks,
Bryan

_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com<mailto:ceph-users at lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20181018/e07f5666/attachment.html>


More information about the ceph-users mailing list