[ceph-users] cephfs clients hanging multi mds to single mds

Jaime Ibar jaime at tchpc.tcd.ie
Tue Oct 2 05:53:12 PDT 2018


Hi Paul,

I tried ceph-fuse mounting it in a different mount point and it worked.

The problem here is we can't unmount ceph kernel client as it is in use

by some virsh processes. We forced the unmount and mount ceph-fuse

but we got an I/O error and mount -l cleared all the processes but after

rebooting the vm's they didn't come back and a server reboot was needed.

Not sure how can I restore mds session or remounting cephfs keeping

all processes.

Thanks a lot for your help.

Jaime


On 02/10/18 11:02, Paul Emmerich wrote:
> Kernel 4.4 is not suitable for a multi MDS setup. In general, I
> wouldn't feel comfortable running 4.4 with kernel cephfs in
> production.
> I think at least 4.15 (not sure, but definitely > 4.9) is recommended
> for multi MDS setups.
>
> If you can't reboot: maybe try cephfs-fuse instead which is usually
> very awesome and usually fast enough.
>
> Paul
>
> Am Di., 2. Okt. 2018 um 10:45 Uhr schrieb Jaime Ibar <jaime at tchpc.tcd.ie>:
>> Hi Paul,
>>
>> we're using 4.4 kernel. Not sure if more recent kernels are stable
>>
>> for production services. In any case, as there are some production
>>
>> services running on those servers, rebooting wouldn't be an option
>>
>> if we can bring ceph clients back without rebooting.
>>
>> Thanks
>>
>> Jaime
>>
>>
>> On 01/10/18 21:10, Paul Emmerich wrote:
>>> Which kernel version are you using for the kernel cephfs clients?
>>> I've seen this problem with "older" kernels (where old is as recent as 4.9)
>>>
>>> Paul
>>> Am Mo., 1. Okt. 2018 um 18:35 Uhr schrieb Jaime Ibar <jaime at tchpc.tcd.ie>:
>>>> Hi all,
>>>>
>>>> we're running a ceph 12.2.7 Luminous cluster, two weeks ago we enabled
>>>> multi mds and after few hours
>>>>
>>>> these errors started showing up
>>>>
>>>> 2018-09-28 09:41:20.577350 mds.1 [WRN] slow request 64.421475 seconds
>>>> old, received at 2018-09-28 09:40:16.155841:
>>>> client_request(client.31059144:8544450 getattr Xs #0$
>>>> 100002e1e73 2018-09-28 09:40:16.147368 caller_uid=0, caller_gid=124{})
>>>> currently failed to authpin local pins
>>>>
>>>> 2018-09-28 10:56:51.051100 mon.1 [WRN] Health check failed: 5 clients
>>>> failing to respond to cache pressure (MDS_CLIENT_RECALL)
>>>> 2018-09-28 10:57:08.000361 mds.1 [WRN] 3 slow requests, 1 included
>>>> below; oldest blocked for > 4614.580689 secs
>>>> 2018-09-28 10:57:08.000365 mds.1 [WRN] slow request 244.796854 seconds
>>>> old, received at 2018-09-28 10:53:03.203476:
>>>> client_request(client.31059144:9080057 lookup #0x100
>>>> 000b7564/58 2018-09-28 10:53:03.197922 caller_uid=0, caller_gid=0{})
>>>> currently initiated
>>>> 2018-09-28 11:00:00.000105 mon.1 [WRN] overall HEALTH_WARN 1 clients
>>>> failing to respond to capability release; 5 clients failing to respond
>>>> to cache pressure; 1 MDSs report slow requests,
>>>>
>>>> Due to this, we decide to go back to single mds(as it worked before),
>>>> however, the clients pointing to mds.1 started hanging, however, the
>>>> ones pointing to mds.0 worked fine.
>>>>
>>>> Then, we tried to enable multi mds again and the clients pointing mds.1
>>>> went back online, however the ones pointing to mds.0 stopped work.
>>>>
>>>> Today, we tried to go back to single mds, however this error was
>>>> preventing ceph to disable second active mds(mds.1)
>>>>
>>>> 2018-10-01 14:33:48.358443 mds.1 [WRN] evicting unresponsive client
>>>> XXXXX: (30108925), after 68213.084174 seconds
>>>>
>>>> After wait for 3 hours, we restarted mds.1 daemon (as it was stuck in
>>>> stopping state forever due to the above error), we waited for it to
>>>> become active again,
>>>>
>>>> unmount the problematic clients, wait for the cluster to be healthy and
>>>> try to go back to single mds again.
>>>>
>>>> Apparently this worked with some of the clients, we tried to enable
>>>> multi mds again to bring faulty clients back again, however no luck this
>>>> time
>>>>
>>>> and some of them are hanging and can't access to ceph fs.
>>>>
>>>> This is what we have in kern.log
>>>>
>>>> Oct  1 15:29:32 05 kernel: [2342847.017426] ceph: mds1 reconnect start
>>>> Oct  1 15:29:32 05 kernel: [2342847.018677] ceph: mds1 reconnect success
>>>> Oct  1 15:29:49 05 kernel: [2342864.651398] ceph: mds1 recovery completed
>>>>
>>>> Not sure what else can we try to bring hanging clients back without
>>>> rebooting as they're in production and rebooting is not an option.
>>>>
>>>> Does anyone know how can we deal with this, please?
>>>>
>>>> Thanks
>>>>
>>>> Jaime
>>>>
>>>> --
>>>>
>>>> Jaime Ibar
>>>> High Performance & Research Computing, IS Services
>>>> Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
>>>> http://www.tchpc.tcd.ie/ | jaime at tchpc.tcd.ie
>>>> Tel: +353-1-896-3725
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users at lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>> --
>>
>> Jaime Ibar
>> High Performance & Research Computing, IS Services
>> Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
>> http://www.tchpc.tcd.ie/ | jaime at tchpc.tcd.ie
>> Tel: +353-1-896-3725
>>
>

-- 

Jaime Ibar
High Performance & Research Computing, IS Services
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
http://www.tchpc.tcd.ie/ | jaime at tchpc.tcd.ie
Tel: +353-1-896-3725



More information about the ceph-users mailing list