[ceph-users] cephfs kernel, hang with libceph: osdx X.X.X.X socket closed (con state OPEN)
aderumier at odiso.com
Thu Nov 8 16:12:25 PST 2018
To be more precise,
the logs occurs when the hang is finished.
I have looked at stats on 10 differents hang, and the duration is always around 15 minutes.
Maybe related to:
ms tcp read timeout
Description: If a client or daemon makes a request to another Ceph daemon and does not drop an unused connection, the ms tcp read timeout defines the connection as idle after the specified number of seconds.
Type: Unsigned 64-bit Integer
Default: 900 15 minutes.
Find a similar bug report with firewall too:
----- Mail original -----
De: "aderumier" <aderumier at odiso.com>
À: "ceph-users" <ceph-users at lists.ceph.com>
Envoyé: Jeudi 8 Novembre 2018 18:16:20
Objet: [ceph-users] cephfs kernel, hang with libceph: osdx X.X.X.X socket closed (con state OPEN)
we are currently test cephfs with kernel module (4.17 and 4.18) instead fuse (worked fine),
and we have hang, iowait jump like crazy for around 20min.
client is a qemu 2.12 vm with virtio-net interface.
Is the client logs, we are seeing this kind of logs:
[jeu. nov. 8 12:20:18 2018] libceph: osd14 x.x.x.x:6801 socket closed (con state OPEN)
[jeu. nov. 8 12:42:03 2018] libceph: osd9 x.x.x.x:6821 socket closed (con state OPEN)
and in osd logs:
2018-11-08 12:20:25.247 7f31ffac8700 0 -- x.x.x.x:6801/1745 >> x.x.x.x:0/3678871522 conn(0x558c430ec300 :6801 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
2018-11-08 12:42:09.820 7f7ca970e700 0 -- x.x.x.x:6821/1739 >> x.x.x.x:0/3678871522 conn(0x564fcbec5100 :6821 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1)
cluster is ceph 13.2.1
Note that we have a physical firewall between client and server, I'm not sure yet if the session could be dropped. (I don't have find any logs in the firewall).
Any idea ? I would like to known if it's a network bug, or ceph bug (not sure how to understand the osd logs)
fuse_disable_pagecache = true
client_reconnect_stale = true
ceph-users mailing list
ceph-users at lists.ceph.com
More information about the ceph-users