[ceph-users] ceph-mgr hangs on larger clusters in Luminous

Gregory Farnum gfarnum at redhat.com
Thu Oct 18 14:20:44 PDT 2018

On Thu, Oct 18, 2018 at 1:35 PM Bryan Stillwell <bstillwell at godaddy.com> wrote:
> It does look like we're hitting the ms_tcp_read_timeout.  I changed it to 79 seconds and I've had a couple dumps that were hung for ~2m40s (2*ms_tcp_read_timeout) and one that was hung for 8 minutes (6*ms_tcp_read_timeout).
> I agree that 15 minutes (900s) is a long timeout.  Anyone know the reasoning for that decision?

I think we picked it because it was long enough to be very sure that a
connection wouldn't time out while it was waiting on some kind of slow
response, but short enough that it would actually go away.
In general, we don't expect it to be an "important" value since
connections shouldn't dangle unless one Ceph entity actually remains
alive that whole time and stops needing to talk to an entity it was
previously using, and establishing a connection takes a few
round-trips but otherwise costs little.

So eg it's not uncommon for an rbd client to hit these disconnects if
it stops using its disk for a while. But there's also very little cost
to keeping the session around.

I wouldn't worry much about turning it down quite a bit, but if it's
changing the behavior of ceph-mgr there's also a ceph-mgr bug that
needs to be resolved. I presume John's link is more useful for that.

