[ceph-users] Can CephFS Kernel Client Not Read & Write at the Same Time?

Andrew Richards andrew.richards at keepertech.com
Tue Mar 19 12:20:22 PDT 2019


I don't think file locks are to blame. I tired to control for that in my tests; I was reading with fio from one set of files (multiple fio pids spawned from a single command) while writing with dd to an entirely different file using a different shell on the same host. So one CephFS kernel client, all different files being acted on by different pids, and still the interruption of reads when writes were being synched.

Thanks,
Andrew Richards
Senior Systems Engineer
keepertechnology

> On Mar 8, 2019, at 2:14 AM, Yan, Zheng <ukernel at gmail.com> wrote:
> 
> CephFS kernel mount blocks reads while other client has dirty data in
> its page cache.   Cache coherency rule looks like:
> 
> state 1 - only one client opens a file for read/write.  the client can
> use page cache
> state 2 - multiple clients open a file for read, no client opens the
> file for wirte. clients can use page cache
> state 3 - multiple clients open a file for read/write. client are not
> allowed to use page cache.
> 
> The behavior you saw is likely caused by state transition from 1 to 3
> 
> On Fri, Mar 8, 2019 at 8:15 AM Gregory Farnum <gfarnum at redhat.com> wrote:
>> 
>> In general, no, this is not an expected behavior.
>> 
>> My guess would be that something odd is happening with the other clients you have to the system, and there's a weird pattern with the way the file locks are being issued. Can you be more precise about exactly what workload you're running, and get the output of the session list on your MDS while doing so?
>> -Greg
>> 
>> On Wed, Mar 6, 2019 at 9:49 AM Andrew Richards <andrew.richards at keepertech.com> wrote:
>>> 
>>> We discovered recently that our CephFS mount appeared to be halting reads when writes were being synched to the Ceph cluster to the point it was affecting applications.
>>> 
>>> I also posted this as a Gist with embedded graph images to help illustrate: https://gist.github.com/keeperAndy/aa80d41618caa4394e028478f4ad1694
>>> 
>>> The following is the plain text from the Gist.
>>> 
>>> First, details about the host:
>>> 
>>> ````
>>>    $ uname -r
>>>    4.16.13-041613-generic
>>> 
>>>    $ egrep 'xfs|ceph' /proc/mounts
>>>    192.168.1.115:6789,192.168.1.116:6789,192.168.1.117:6789:/ /cephfs ceph rw,noatime,name=cephfs,secret=<hidden>,rbytes,acl,wsize=16777216 0 0
>>>    /dev/mapper/tst01-lvidmt01 /rbd_xfs xfs rw,relatime,attr2,inode64,logbsize=256k,sunit=512,swidth=1024,noquota 0 0
>>> 
>>>    $ ceph -v
>>>    ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable)
>>> 
>>>    $ cat /proc/net/bonding/bond1
>>>    Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
>>> 
>>>    Bonding Mode: adaptive load balancing
>>>    Primary Slave: None
>>>    Currently Active Slave: net6
>>>    MII Status: up
>>>    MII Polling Interval (ms): 100
>>>    Up Delay (ms): 200
>>>    Down Delay (ms): 200
>>> 
>>>    Slave Interface: net8
>>>    MII Status: up
>>>    Speed: 10000 Mbps
>>>    Duplex: full
>>>    Link Failure Count: 2
>>>    Permanent HW addr: e4:1d:2d:17:71:e1
>>>    Slave queue ID: 0
>>> 
>>>    Slave Interface: net6
>>>    MII Status: up
>>>    Speed: 10000 Mbps
>>>    Duplex: full
>>>    Link Failure Count: 1
>>>    Permanent HW addr: e4:1d:2d:17:71:e0
>>>    Slave queue ID: 0
>>> 
>>> ````
>>> 
>>> We had CephFS mounted alongside an XFS filesystem made up of 16 RBD images aggregated under LVM as our storage targets. The link to the Ceph cluster from the host is a mode 6 2x10GbE bond (bond1 above).
>>> 
>>> We started capturing network counters from the Ceph cluster connection (bond1) on the host using ifstat at its most granular setting of 0.1 (sampling every tenth of a second). We then ran various overlapping read and write operations in separate shells on the same host to obtain samples of how our different means of accessing Ceph handled this. We converted our ifstat output to CSV and insterted it into a spreadsheet to visualize the network activity.
>>> 
>>> We found that the CephFS kernel mount did indeed appear to pause ongoing reads when writes were being flushed from the page cache to the Ceph cluster.
>>> 
>>> We wanted to see if we could make this more pronounced, so we added a 6Gb-limit tc filter to the interface and re-ran our tests. This yielded much lengthier delay periods in the reads while the writes were more slowly flushed from the page cache to the Ceph cluster.
>>> 
>>> A more restrictive 2Gbit-limit tc filter produced much lengthier delays of our reads as the writes were synched to the cluster.
>>> 
>>> When we tested the same I/O on the RBD-backed XFS file system on the same host, we found a very different pattern. The reads seemed to be given priority over the write activity, but the writes were only slowed, they were not halted.
>>> 
>>> Finally we tested overlapping SMB client reads and writes to a Samba share that used the userspace libceph-based VFS_Ceph module to produce the share. In this case, while raw throughput was lower than that of the kernel, the reads and writes did not interrupt each other at all.
>>> 
>>> Is this expected behavior for the CephFS kernel drivers? Can a CephFS kernel client really not read and write to the file system simultaneously?
>>> 
>>> Thanks,
>>> Andrew Richards
>>> Senior Systems Engineer
>>> Keeper Technology, LLC
>>> 
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users at lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users at lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20190319/211344e8/attachment.html>


More information about the ceph-users mailing list