[ceph-users] 3-node cluster with 3 x Intel Optane 900P - very low benchmarked performance (200 IOPS)?

Victor Hooi victorhooi at yahoo.com
Sat Mar 9 12:03:34 PST 2019


Hi,

I have retested with 4K blocks - results are below.

I am currently using 4 OSDs per Optane 900P drive. This was based on some
posts I found on Proxmox Forums, and what seems to be "tribal knowledge"
there.

I also saw this presentation
<https://hubb.blob.core.windows.net/c2511cea-81c5-4386-8731-cc444ff806df-public/resources/M1205%20-Ceph%20BlueStore%20performance%20on%20latest%20Intel%20Server%20Platforms%20Distribution.pdf>,
which mentions on page 14:

2-4 OSDs/NVMe SSD and 4-6 NVMe SSDs per node are sweet spots


Has anybody done much testing with pure Optane drives for Ceph? (Paper
above seems to use them mixed with traditional SSDs).

Would increasing the number of OSDs help in this scenario? I am happy to
try that - I assume I will need to blow away all the existing OSDs/Ceph
setup and start again, of course.

Here are the rados bench results with 4K - the write IOPS are still a tad
short of 15,000 - is that what I should be aiming for?

Write result:

# rados bench -p proxmox_vms 60 write -b 4K -t 16 --no-cleanup
Total time run:         60.001016
Total writes made:      726749
Write size:             4096
Object size:            4096
Bandwidth (MB/sec):     47.3136
Stddev Bandwidth:       2.16408
Max bandwidth (MB/sec): 48.7344
Min bandwidth (MB/sec): 38.5078
Average IOPS:           12112
Stddev IOPS:            554
Max IOPS:               12476
Min IOPS:               9858
Average Latency(s):     0.00132019
Stddev Latency(s):      0.000670617
Max latency(s):         0.065541
Min latency(s):         0.000689406


Sequential read result:

# rados bench -p proxmox_vms  60 seq -t 16
Total time run:       17.098593
Total reads made:     726749
Read size:            4096
Object size:          4096
Bandwidth (MB/sec):   166.029
Average IOPS:         42503
Stddev IOPS:          218
Max IOPS:             42978
Min IOPS:             42192
Average Latency(s):   0.000369021
Max latency(s):       0.00543175
Min latency(s):       0.000170024


Random read result:

# rados bench -p proxmox_vms 60 rand -t 16
Total time run:       60.000282
Total reads made:     2708799
Read size:            4096
Object size:          4096
Bandwidth (MB/sec):   176.353
Average IOPS:         45146
Stddev IOPS:          310
Max IOPS:             45754
Min IOPS:             44506
Average Latency(s):   0.000347637
Max latency(s):       0.00457886
Min latency(s):       0.000138381


I am happy to try with fio -ioengine =rbd (the reason I use rados bench is
because that is what was used in the Proxmox Ceph benchmark paper
<https://www.proxmox.com/en/downloads/item/proxmox-ve-ceph-benchmark>)
however, is there a common community-suggested starting command line that
makes it easy to compare results? (fio seems quite complex in terms of
options).

Thanks,
Victor

On Sun, Mar 10, 2019 at 6:15 AM Vitaliy Filippov <vitalif at yourcmc.ru> wrote:

> Welcome to our "slow ceph" party :)))
>
> However I have to note that:
>
> 1) 500000 iops is for 4 KB blocks. You're testing it with 4 MB ones.
> That's kind of unfair comparison.
>
> 2) fio -ioengine=rbd is better than rados bench for testing.
>
> 3) You can't "compensate" for Ceph's overhead even by having infinitely
> fast disks.
>
> At its simplest, imagine that disk I/O takes X microseconds and Ceph's
> overhead is Y for a single operation.
>
> Suppose there is no parallelism. Then raw disk IOPS = 1000000/X and Ceph
> IOPS = 1000000/(X+Y). Y is currently quite long, something around 400-800
> microseconds or so. So the best IOPS number you can squeeze out of a
> single client thread (a DBMS, for example) is 1000000/400 = only ~2500
> iops.
>
> Parallel iops are of course better, but still you won't get anything
> close
> to 500000 iops from a single OSD. The expected number is around 15000.
> Create multiple OSDs on a single NVMe and sacrifice your CPU usage if you
> want better results.
>
> --
> With best regards,
>    Vitaliy Filippov
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20190310/4ebecc3a/attachment.html>


More information about the ceph-users mailing list