[ceph-users] 3-node cluster with 3 x Intel Optane 900P - very low benchmarked performance (200 IOPS)?

Victor Hooi victorhooi at yahoo.com
Sat Mar 9 04:42:57 PST 2019


Hi Ahsley,

Right - so the 50% bandwidth is OK, I guess, but it was more the drop in
IOPS that was concerning (hence the subject line about 200 IOPS) *sad face*.

That, and the Optane drives weren't exactly cheap, and I was hoping they
would compensate for the overhead of Ceph.

At random read, each Optane drive is capable of 550000 IOPS (random read)
and 500000 IOPS (random write). Yet we're seeing it drop to around 0.04% of
that in testing (200 IOPS). Is that sort of drop in IOPS normal for Ceph?

Each node can take up to 8 x 2.5" drives. If I loaded up say 4 cheap SSDs
in each (e.g. Intel S3700 SSD), instead of one Optane drive per node, would
that have better performance with 4 x 3 = 12 drives? (Would I still put 4
OSDs per physical drive)? Or some way to supplement the Optane's with SSDs?
(Although I would assume any SSD I get is going to be slower than an Optane
drive).

Or are there tweaks I can do to either configuration, or our layout that
could eke out more IOPS?

(This is going to be used for VM hosting, so IOPS is definitely a concern).

Thanks,
Victor

On Sat, Mar 9, 2019 at 9:27 PM Ashley Merrick <singapore at amerrick.co.uk>
wrote:

> What kind of results are you expecting?
>
> Looking at the specs they are "up to" 2000 Write, and 2500 Read, so your
> around 50-60% of the max up to speed, which I wouldn't say is to bad due to
> the fact CEPH / Bluestore has an overhead specially when using a single
> disk for DB & WAL & Content.
>
> Remember CEPH scales with the amount of physical disks you have, as you
> only have 3 disks every piece of I/O is hitting all 3 disks, if you had 6
> disks for example and still did replication of 3 then only 50% of I/O would
> be hitting each disks, therefore id expect to see performance jump.
>
> On Sat, Mar 9, 2019 at 5:08 PM Victor Hooi <victorhooi at yahoo.com> wrote:
>
>> Hi,
>>
>> I'm setting up a 3-node Proxmox cluster with Ceph as the shared storage,
>> based around Intel Optane 900P drives (which are meant to be the bee's
>> knees), and I'm seeing pretty low IOPS/bandwidth.
>>
>>    - 3 nodes, each running a Ceph monitor daemon, and OSDs.
>>    - Node 1 has 48 GB of RAM and 10 cores (Intel 4114
>>    <https://ark.intel.com/content/www/us/en/ark/products/123550/intel-xeon-silver-4114-processor-13-75m-cache-2-20-ghz.html>),
>>    and Node 2 and 3 have 32 GB of RAM and 4 cores (Intel E3-1230V6
>>    <https://ark.intel.com/content/www/us/en/ark/products/97474/intel-xeon-processor-e3-1230-v6-8m-cache-3-50-ghz.html>
>>    )
>>    - Each node has a Intel Optane 900p (480GB) NVMe
>>    <https://www.intel.com.au/content/www/au/en/products/memory-storage/solid-state-drives/gaming-enthusiast-ssds/optane-900p-series/900p-480gb-aic-20nm.html> dedicated
>>    for Ceph.
>>    - 4 OSDs per node (total of 12 OSDs)
>>    - NICs are Intel X520-DA2
>>    <https://ark.intel.com/content/www/us/en/ark/products/39776/intel-ethernet-converged-network-adapter-x520-da2.html>,
>>    with 10GBASE-LR going to a Unifi US-XG-16
>>    <https://www.ui.com/unifi-switching/unifi-switch-16-xg/>.
>>    - First 10GB port is for Proxmox VM traffic, second 10GB port is for
>>    Ceph traffic.
>>
>> I created a new Ceph pool specifically for benchmarking with 128 PGs.
>>
>> Write results:
>>
>> root at vwnode1:~# rados bench -p benchmarking 60 write -b 4M -t 16
>> --no-cleanup
>> ....
>>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
>> lat(s)
>>    60      16     12258     12242   816.055       788   0.0856726
>> 0.0783458
>> Total time run:         60.069008
>> Total writes made:      12258
>> Write size:             4194304
>> Object size:            4194304
>> Bandwidth (MB/sec):     816.261
>> Stddev Bandwidth:       17.4584
>> Max bandwidth (MB/sec): 856
>> Min bandwidth (MB/sec): 780
>> Average IOPS:           204
>> Stddev IOPS:            4
>> Max IOPS:               214
>> Min IOPS:               195
>> Average Latency(s):     0.0783801
>> Stddev Latency(s):      0.0468404
>> Max latency(s):         0.437235
>> Min latency(s):         0.0177178
>>
>>
>> Sequential read results - I don't know why this only ran for 32 seconds?
>>
>> root at vwnode1:~# rados bench -p benchmarking 60 seq -t 16
>> ....
>> Total time run:       32.608549
>> Total reads made:     12258
>> Read size:            4194304
>> Object size:          4194304
>> Bandwidth (MB/sec):   1503.65
>> Average IOPS:         375
>> Stddev IOPS:          22
>> Max IOPS:             410
>> Min IOPS:             326
>> Average Latency(s):   0.0412777
>> Max latency(s):       0.498116
>> Min latency(s):       0.00447062
>>
>>
>> Random read result:
>>
>> root at vwnode1:~# rados bench -p benchmarking 60 rand -t 16
>> ....
>> Total time run:       60.066384
>> Total reads made:     22819
>> Read size:            4194304
>> Object size:          4194304
>> Bandwidth (MB/sec):   1519.59
>> Average IOPS:         379
>> Stddev IOPS:          21
>> Max IOPS:             424
>> Min IOPS:             320
>> Average Latency(s):   0.0408697
>> Max latency(s):       0.662955
>> Min latency(s):       0.00172077
>>
>>
>> I then cleaned-up with:
>>
>> root at vwnode1:~# rados -p benchmarking cleanup
>> Removed 12258 objects
>>
>>
>> I then tested with another Ceph pool, with 512 PGs (originally created
>> for Proxmox VMs) - results seem quite similar:
>>
>> root at vwnode1:~# rados bench -p proxmox_vms 60 write -b 4M -t 16
>> --no-cleanup
>> ....
>> Total time run:         60.041712
>> Total writes made:      12132
>> Write size:             4194304
>> Object size:            4194304
>> Bandwidth (MB/sec):     808.238
>> Stddev Bandwidth:       20.7444
>> Max bandwidth (MB/sec): 860
>> Min bandwidth (MB/sec): 744
>> Average IOPS:           202
>> Stddev IOPS:            5
>> Max IOPS:               215
>> Min IOPS:               186
>> Average Latency(s):     0.0791746
>> Stddev Latency(s):      0.0432707
>> Max latency(s):         0.42535
>> Min latency(s):         0.0200791
>>
>>
>> Sequential read result - once again, only ran for 32 seconds:
>>
>> root at vwnode1:~# rados bench -p proxmox_vms 60 seq -t 16
>> ....
>> Total time run:       31.249274
>> Total reads made:     12132
>> Read size:            4194304
>> Object size:          4194304
>> Bandwidth (MB/sec):   1552.93
>> Average IOPS:         388
>> Stddev IOPS:          30
>> Max IOPS:             460
>> Min IOPS:             320
>> Average Latency(s):   0.0398702
>> Max latency(s):       0.481106
>> Min latency(s):       0.00461585
>>
>>
>> Random read result:
>>
>> root at vwnode1:~# rados bench -p proxmox_vms 60 rand -t 16
>> ...
>> Total time run:       60.088822
>> Total reads made:     23626
>> Read size:            4194304
>> Object size:          4194304
>> Bandwidth (MB/sec):   1572.74
>> Average IOPS:         393
>> Stddev IOPS:          25
>> Max IOPS:             432
>> Min IOPS:             322
>> Average Latency(s):   0.0392854
>> Max latency(s):       0.693123
>> Min latency(s):       0.00178545
>>
>>
>> Cleanup:
>>
>> root at vwnode1:~# rados -p proxmox_vms cleanup
>> Removed 12132 objects
>> root at vwnode1:~# rados df
>> POOL_NAME   USED   OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND
>> DEGRADED RD_OPS RD     WR_OPS WR
>> proxmox_vms 169GiB   43396      0 130188                  0       0
>>  0 909519 298GiB 619697 272GiB
>>
>> total_objects    43396
>> total_used       564GiB
>> total_avail      768GiB
>> total_space      1.30TiB/
>>
>>
>> These results (800 MB/s writes, 1500 Mb/s reads, and 200 write IOPS, 400
>> read IOPS) seems incredibly low - particularly considering what the Optane
>> 900p is meant to be capable of.
>>
>> Is this in line with what you might expect on this hardware with Ceph
>> though?
>>
>> Or is there some way to find out the source of bottleneck?
>>
>> Thanks,
>> Victor
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users at lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20190309/52aeb17e/attachment.html>


More information about the ceph-users mailing list