[ceph-users] cephfs kernel, hang with libceph: osdx X.X.X.X socket closed (con state OPEN)

Alexandre DERUMIER aderumier at odiso.com
Thu Nov 8 22:08:55 PST 2018


>>If you're using kernel client for cephfs, I strongly advise to have the client on the same subnet as the ceph public one i.e all traffic should be on the same subnet/VLAN. Even if your firewall situation is good, if you >>have to cross subnets or VLANs, you will run into weird problems later. 

Thanks. 

Currently client is in different vlan for security. (multiple differents customer, don't want that a customer have direct access to other customer or ceph).
But, as they are vm, I can manage to put them in the same vlan and do firewalling on the hypervisor.  (but I'll need firewalling in all cases)


>>Fuse has much better tolerance for that scenario. 

What's the difference ? 



----- Mail original -----
De: "Linh Vu" <vul at unimelb.edu.au>
À: "aderumier" <aderumier at odiso.com>, "ceph-users" <ceph-users at lists.ceph.com>
Envoyé: Vendredi 9 Novembre 2018 02:16:07
Objet: Re: cephfs kernel, hang with libceph: osdx X.X.X.X socket closed (con state OPEN)



If you're using kernel client for cephfs, I strongly advise to have the client on the same subnet as the ceph public one i.e all traffic should be on the same subnet/VLAN. Even if your firewall situation is good, if you have to cross subnets or VLANs, you will run into weird problems later. Fuse has much better tolerance for that scenario. 

From: ceph-users <ceph-users-bounces at lists.ceph.com> on behalf of Alexandre DERUMIER <aderumier at odiso.com> 
Sent: Friday, 9 November 2018 12:06:43 PM 
To: ceph-users 
Subject: Re: [ceph-users] cephfs kernel, hang with libceph: osdx X.X.X.X socket closed (con state OPEN) 
Ok, 
It seem to come from firewall, 
I'm seeing dropped session exactly 15min before the log. 

The sessions are the session to osd, session to mon && mds are ok. 


Seem that keeplive2 is used to monitor the mon session 
[ https://patchwork.kernel.org/patch/7105641/ | https://patchwork.kernel.org/patch/7105641/ ] 

but I'm not sure about osd sessions ? 

----- Mail original ----- 
De: "aderumier" <aderumier at odiso.com> 
À: "ceph-users" <ceph-users at lists.ceph.com> 
Cc: "Alexandre Bruyelles" <abruyelles at odiso.com> 
Envoyé: Vendredi 9 Novembre 2018 01:12:25 
Objet: Re: [ceph-users] cephfs kernel, hang with libceph: osdx X.X.X.X socket closed (con state OPEN) 

To be more precise, 

the logs occurs when the hang is finished. 

I have looked at stats on 10 differents hang, and the duration is always around 15 minutes. 

Maybe related to: 

ms tcp read timeout 
Description: If a client or daemon makes a request to another Ceph daemon and does not drop an unused connection, the ms tcp read timeout defines the connection as idle after the specified number of seconds. 
Type: Unsigned 64-bit Integer 
Required: No 
Default: 900 15 minutes. 

? 

Find a similar bug report with firewall too: 

[ http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-October/013841.html | http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-October/013841.html ] 


----- Mail original ----- 
De: "aderumier" <aderumier at odiso.com> 
À: "ceph-users" <ceph-users at lists.ceph.com> 
Envoyé: Jeudi 8 Novembre 2018 18:16:20 
Objet: [ceph-users] cephfs kernel, hang with libceph: osdx X.X.X.X socket closed (con state OPEN) 

Hi, 

we are currently test cephfs with kernel module (4.17 and 4.18) instead fuse (worked fine), 

and we have hang, iowait jump like crazy for around 20min. 

client is a qemu 2.12 vm with virtio-net interface. 


Is the client logs, we are seeing this kind of logs: 

[jeu. nov. 8 12:20:18 2018] libceph: osd14 x.x.x.x:6801 socket closed (con state OPEN) 
[jeu. nov. 8 12:42:03 2018] libceph: osd9 x.x.x.x:6821 socket closed (con state OPEN) 


and in osd logs: 

osd14: 
2018-11-08 12:20:25.247 7f31ffac8700 0 -- x.x.x.x:6801/1745 >> x.x.x.x:0/3678871522 conn(0x558c430ec300 :6801 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1) 

osd9: 
2018-11-08 12:42:09.820 7f7ca970e700 0 -- x.x.x.x:6821/1739 >> x.x.x.x:0/3678871522 conn(0x564fcbec5100 :6821 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1) 


cluster is ceph 13.2.1 

Note that we have a physical firewall between client and server, I'm not sure yet if the session could be dropped. (I don't have find any logs in the firewall). 

Any idea ? I would like to known if it's a network bug, or ceph bug (not sure how to understand the osd logs) 

Regards, 

Alexandre 



client ceph.conf 
---------------- 
[client] 
fuse_disable_pagecache = true 
client_reconnect_stale = true 


_______________________________________________ 
ceph-users mailing list 
ceph-users at lists.ceph.com 
[ http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com | http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ] 

_______________________________________________ 
ceph-users mailing list 
ceph-users at lists.ceph.com 
[ http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com | http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ] 

_______________________________________________ 
ceph-users mailing list 
ceph-users at lists.ceph.com 
[ http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com | http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ] 




More information about the ceph-users mailing list