[ceph-users] how to improve performance

Christian Balzer chibi at gol.com
Mon Nov 20 03:44:51 PST 2017


On Mon, 20 Nov 2017 12:38:55 +0200 Rudi Ahlers wrote:

> Hi,
> 
> Can someone please help me, how do I improve performance on ou CEPH cluster?
> 
> The hardware in use are as follows:
> 3x SuperMicro servers with the following configuration
> 12Core Dual XEON 2.2Ghz
Faster cores is better for Ceph, IMNSHO.
Though with main storage on HDDs, this will do.

> 128GB RAM
Overkill for Ceph but I see something else below...

> 2x 400GB Intel DC SSD drives
Exact model please.

> 4x 8TB Seagate 7200rpm 6Gbps SATA HDD's
One hopes that's a non SMR one.
Model please.

> 1x SuperMicro DOM for Proxmox / Debian OS
Ah, Proxmox. 
I'm personally not averse to converged, high density, multi-role clusters
myself, but you:
a) need to know what you're doing and
b) will find a lot of people here who don't approve of it.

I've avoided DOMs so far (non-hotswapable SPOF), even though the SM ones
look good on paper with regards to endurance and IOPS. 
The later being rather important for your monitors. 

> 4x Port 10Gbe NIC
> Cisco 10Gbe switch.
> 
Configuration would be nice for those, LACP?

> 
> root at virt2:~# rados bench -p Data 10 write --no-cleanup
> hints = 1
> Maintaining 16 concurrent writes of 4194304 bytes to objects of size
> 4194304 for       up to 10 seconds or 0 objects

rados bench is limited tool and measuring bandwidth is in nearly all
the use cases pointless. 
Latency is where it is at and testing from inside a VM is more relevant
than synthetic tests of the storage.
But it is a start.

> Object prefix: benchmark_data_virt2_39099
>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
> lat(s)
>     0       0         0         0         0         0           -
>  0
>     1      16        85        69   275.979       276    0.185576
> 0.204146
>     2      16       171       155   309.966       344   0.0625409
> 0.193558
>     3      16       243       227   302.633       288   0.0547129
>  0.19835
>     4      16       330       314   313.965       348   0.0959492
> 0.199825
>     5      16       413       397   317.565       332    0.124908
> 0.196191
>     6      16       494       478   318.633       324      0.1556
> 0.197014
>     7      15       591       576   329.109       392    0.136305
> 0.192192
>     8      16       670       654   326.965       312   0.0703808
> 0.190643
>     9      16       757       741   329.297       348    0.165211
> 0.192183
>    10      16       828       812   324.764       284   0.0935803
> 0.194041
> Total time run:         10.120215
> Total writes made:      829
> Write size:             4194304
> Object size:            4194304
> Bandwidth (MB/sec):     327.661
What part of this surprises you?

With a replication of 3, you have effectively the bandwidth of your 2 SSDs
(for small writes, not the case here) and the bandwidth of your 4 HDDs
available. 
Given overhead, other inefficiencies and the fact that this is not a
sequential write from the HDD perspective, 320MB/s isn't all that bad.
Though with your setup I would have expected something faster, but NOT the
theoretical 600MB/s 4 HDDs will do in sequential writes.

> Stddev Bandwidth:       35.8664
> Max bandwidth (MB/sec): 392
> Min bandwidth (MB/sec): 276
> Average IOPS:           81
> Stddev IOPS:            8
> Max IOPS:               98
> Min IOPS:               69
> Average Latency(s):     0.195191
> Stddev Latency(s):      0.0830062
> Max latency(s):         0.481448
> Min latency(s):         0.0414858
> root at virt2:~# hdparm -I /dev/sda
> 
> 
> 
> root at virt2:~# ceph osd tree
> ID CLASS WEIGHT   TYPE NAME      STATUS REWEIGHT PRI-AFF
> -1       72.78290 root default
> -3       29.11316     host virt1
>  1   hdd  7.27829         osd.1      up  1.00000 1.00000
>  2   hdd  7.27829         osd.2      up  1.00000 1.00000
>  3   hdd  7.27829         osd.3      up  1.00000 1.00000
>  4   hdd  7.27829         osd.4      up  1.00000 1.00000
> -5       21.83487     host virt2
>  5   hdd  7.27829         osd.5      up  1.00000 1.00000
>  6   hdd  7.27829         osd.6      up  1.00000 1.00000
>  7   hdd  7.27829         osd.7      up  1.00000 1.00000
> -7       21.83487     host virt3
>  8   hdd  7.27829         osd.8      up  1.00000 1.00000
>  9   hdd  7.27829         osd.9      up  1.00000 1.00000
> 10   hdd  7.27829         osd.10     up  1.00000 1.00000
>  0              0 osd.0            down        0 1.00000
> 
> 
> root at virt2:~# ceph -s
>   cluster:
>     id:     278a2e9c-0578-428f-bd5b-3bb348923c27
>     health: HEALTH_OK
> 
>   services:
>     mon: 3 daemons, quorum virt1,virt2,virt3
>     mgr: virt1(active)
>     osd: 11 osds: 10 up, 10 in
> 
>   data:
>     pools:   1 pools, 512 pgs
>     objects: 6084 objects, 24105 MB
>     usage:   92822 MB used, 74438 GB / 74529 GB avail
>     pgs:     512 active+clean
> 
> root at virt2:~# ceph -w
>   cluster:
>     id:     278a2e9c-0578-428f-bd5b-3bb348923c27
>     health: HEALTH_OK
> 
>   services:
>     mon: 3 daemons, quorum virt1,virt2,virt3
>     mgr: virt1(active)
>     osd: 11 osds: 10 up, 10 in
> 
>   data:
>     pools:   1 pools, 512 pgs
>     objects: 6084 objects, 24105 MB
>     usage:   92822 MB used, 74438 GB / 74529 GB avail
>     pgs:     512 active+clean
> 
> 
> 2017-11-20 12:32:08.199450 mon.virt1 [INF] mon.1 10.10.10.82:6789/0
> 
> 
> 
> The SSD drives are used as journal drives:
> 
Bluestore has no journals, don't confuse it and the people you're asking
for help. 

> root at virt3:~# ceph-disk list | grep /dev/sde | grep osd
>  /dev/sdb1 ceph data, active, cluster ceph, osd.8, block /dev/sdb2,
> block.db /dev/sde1
> root at virt3:~# ceph-disk list | grep /dev/sdf | grep osd
>  /dev/sdc1 ceph data, active, cluster ceph, osd.9, block /dev/sdc2,
> block.db /dev/sdf1
>  /dev/sdd1 ceph data, active, cluster ceph, osd.10, block /dev/sdd2,
> block.db /dev/sdf2
> 
> 
> 
> I see now /dev/sda doesn't have a journal, though it should have. Not sure
> why.
If an OSD has no fast WAL/DB, it will drag the overall speed down.

Verify and if so fix this and re-test.

Christian 

> This is the command I used to create it:
> 
> 
>  pveceph createosd /dev/sda -bluestore 1  -journal_dev /dev/sde
> 
> 


-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Rakuten Communications


More information about the ceph-users mailing list