[ceph-users] CephFS performance.
ronny+ceph-users at aasen.cx
Thu Oct 4 02:09:58 PDT 2018
On 10/4/18 7:04 AM, jesper at krogh.cc wrote:
> Hi All.
> First thanks for the good discussion and strong answer's I've gotten so far.
> Current cluster setup is 4 x 10 x 12TB 7.2K RPM drives with all and
> 10GbitE and metadata on rotating drives - 3x replication - 256GB memory in
> OSD hosts and 32+ cores. Behind Perc with eachdiskraid0 and BBWC.
> Planned changes:
> - is to get 1-2 more OSD-hosts
> - experiment with EC-pools for CephFS
> - MDS onto seperate host and metadata onto SSD's.
> I'm still struggling to get "non-cached" performance up to "hardware"
> speed - whatever that means. I do "fio" benchmark using 10GB files, 16
> threads, 4M block size -- at which I can "almost" sustained fill the
> 10GbitE NIC. In this configuraiton I would have expected it to be "way
> above" 10Gbit speed thus have the NIC not "almost" filled - but fully
> filled - could that be the metadata activities .. but on "big files" and
> read - that should not be much - right?
> Above is actually ok for production, thus .. not a big issue, just
> Single threaded performance is still struggling
> Cold HHD (read from disk in NFS-server end) / NFS performance:
> jk at zebra01:~$ pipebench < /nfs/16GB.file > /dev/null
> Piped 15.86 GB in 00h00m27.53s: 589.88 MB/second
> Local page cache (just to say it isn't the profiling tool delivering
> jk at zebra03:~$ pipebench < /nfs/16GB.file > /dev/null
> Piped 29.24 GB in 00h00m09.15s: 3.19 GB/second
> jk at zebra03:~$
> Now from the Ceph system:
> jk at zebra01:~$ pipebench < /ceph/bigfile.file> /dev/null
> Piped 36.79 GB in 00h03m47.66s: 165.49 MB/second
> Can block/stripe-size be tuned? Does it make sense?
> Does read-ahead on the CephFS kernel-client need tuning?
> What performance are other people seeing?
> Other thoughts - recommendations?
> On some of the shares we're storing pretty large files (GB size) and
> need the backup to move them to tape - so it is preferred to be capable
> of filling an LTO6 drive's write speed to capacity with a single thread.
> 40'ish 7.2K RPM drives - should - add up to more than above.. right?
> This is the only current load being put on the cluster - + 100MB/s
> recovery traffic.
the problem with single threaded performance in ceph. Is that it reads
the spindles in serial. so you are practically reading one and one
drive, and see a single disk's performance, subtracted all the overheads
from ceph, network, mds, etc.
So you do not get the combined performance of the drives, only one drive
at the time. So the trick for ceph performance is to get more spindles
working for you at the same time.
There are ways to get more performance out of a single thread:
- faster components in the path, ie faster disk/network/cpu/memory
- larger pre-fetching/read-ahead, with a large enough read-ahead more
osd's will participate in reading simultaneously.  shows a table of
benchmarks with different read-ahead sizes.
- erasure coding. while erasure coding does add latency vs replicated
pools. You will get more spindles involved in reading in parallel. so
for large sequential loads erasure coding can have a benefit.
- some sort of extra caching scheme, I have not looked at cachefiles,
but it may provide some benefit.
you can also play with different cephfs implementations, there is a fuse
client, where you can play with different cache solutions. But generally
the kernel client is faster.
in rbd there is a fancy striping solution, by using --stripe-unit and
--stripe-count. This would get more spindles running ; perhaps consider
using rbd instead of cephfs if it fits the workload.
More information about the ceph-users