[ceph-users] CephFS performance.

Ronny Aasen ronny+ceph-users at aasen.cx
Thu Oct 4 02:09:58 PDT 2018

On 10/4/18 7:04 AM, jesper at krogh.cc wrote:
> Hi All.
> First thanks for the good discussion and strong answer's I've gotten so far.
> Current cluster setup is 4 x 10 x 12TB 7.2K RPM drives with all and
> 10GbitE and metadata on rotating drives - 3x replication - 256GB memory in
> OSD hosts and 32+ cores. Behind Perc with eachdiskraid0 and BBWC.
> Planned changes:
> - is to get 1-2 more OSD-hosts
> - experiment with EC-pools for CephFS
> - MDS onto seperate host and metadata onto SSD's.
> I'm still struggling to get "non-cached" performance up to "hardware"
> speed - whatever that means. I do "fio" benchmark using 10GB files, 16
> threads, 4M block size -- at which I can "almost" sustained fill the
> 10GbitE NIC. In this configuraiton I would have expected it to be "way
> above" 10Gbit speed thus have the NIC not "almost" filled - but fully
> filled - could that be the metadata activities .. but on "big files" and
> read - that should not be much - right?
> Above is actually ok for production, thus .. not a big issue, just
> information.
> Single threaded performance is still struggling
> Cold HHD (read from disk in NFS-server end) / NFS performance:
> jk at zebra01:~$ pipebench < /nfs/16GB.file > /dev/null
> Summary:
> Piped   15.86 GB in 00h00m27.53s:  589.88 MB/second
> Local page cache (just to say it isn't the profiling tool delivering
> limitations):
> jk at zebra03:~$ pipebench < /nfs/16GB.file > /dev/null
> Summary:
> Piped   29.24 GB in 00h00m09.15s:    3.19 GB/second
> jk at zebra03:~$
> Now from the Ceph system:
> jk at zebra01:~$ pipebench < /ceph/bigfile.file> /dev/null
> Summary:
> Piped   36.79 GB in 00h03m47.66s:  165.49 MB/second
> Can block/stripe-size be tuned? Does it make sense?
> Does read-ahead on the CephFS kernel-client need tuning?
> What performance are other people seeing?
> Other thoughts - recommendations?
> On some of the shares we're storing pretty large files (GB size) and
> need the backup to move them to tape - so it is preferred to be capable
> of filling an LTO6 drive's write speed to capacity with a single thread.
> 40'ish 7.2K RPM drives - should - add up to more than above.. right?
> This is the only current load being put on the cluster - + 100MB/s
> recovery traffic.

the problem with single threaded performance in ceph. Is that it reads 
the spindles in serial. so you are practically reading one and one 
drive, and see a single disk's performance, subtracted all the overheads 
from ceph, network, mds, etc.
So you do not get the combined performance of the drives, only one drive 
at the time. So the trick for ceph performance is to get more spindles 
working for you at the same time.

There are ways to get more performance out of a single thread:
- faster components in the path, ie faster disk/network/cpu/memory
- larger pre-fetching/read-ahead, with a large enough read-ahead more 
osd's will participate in reading simultaneously. [1] shows a table of 
benchmarks with different read-ahead sizes.
- erasure coding. while erasure coding does add latency vs replicated 
pools. You will get more spindles involved in reading in parallel. so 
for large sequential loads erasure coding can have a benefit.
- some sort of extra caching scheme, I have not looked at cachefiles, 
but it may provide some benefit.

you can also play with different cephfs implementations, there is a fuse 
client, where you can play with different cache solutions. But generally 
the kernel client is faster.

in rbd there is a fancy striping solution, by using --stripe-unit and 
--stripe-count. This would get more spindles running ; perhaps consider 
using rbd instead of cephfs if it fits the workload.


good luck
Ronny Aasen

More information about the ceph-users mailing list