[ceph-users] how to improve performance

Rudi Ahlers rudiahlers at gmail.com
Mon Nov 20 12:55:09 PST 2017


Ok, so it seems an MTU of 9000 didn't improve anything.

On Mon, Nov 20, 2017 at 5:34 PM, Sébastien VIGNERON <
sebastien.vigneron at criann.fr> wrote:

> Your performance hit can be from here. When OSD daemons tries to send a
> big frame, MTU misconfiguration blocks them and they must send them again
> with a lower size.
> On some switches, you have to set the global and the per-interface MTU
> sizes.
>
> Cordialement / Best regards,
>
> Sébastien VIGNERON
> CRIANN,
> Ingénieur / Engineer
> Technopôle du Madrillet
> 745, avenue de l'Université
> <https://maps.google.com/?q=745,+avenue+de+l'Universit%C3%A9%C2%A0+76800+Saint-Etienne+du+Rouvray+-+France&entry=gmail&source=g>
>
> 76800 Saint-Etienne du Rouvray - France
> <https://maps.google.com/?q=745,+avenue+de+l'Universit%C3%A9%C2%A0+76800+Saint-Etienne+du+Rouvray+-+France&entry=gmail&source=g>
>
> tél. +33 2 32 91 42 91 <+33%202%2032%2091%2042%2091>
> fax. +33 2 32 91 42 92 <+33%202%2032%2091%2042%2092>
> http://www.criann.fr
> mailto:sebastien.vigneron at criann.fr <sebastien.vigneron at criann.fr>
> support: support at criann.fr
>
> Le 20 nov. 2017 à 16:21, Rudi Ahlers <rudiahlers at gmail.com> a écrit :
>
> I am not sure why, but I cannot get Jumbo Frames to work properly:
>
>
> root at virt2:~# ping -M do -s 8972 -c 4 10.10.10.83
> PING 10.10.10.83 (10.10.10.83) 8972(9000) bytes of data.
> ping: local error: Message too long, mtu=1500
> ping: local error: Message too long, mtu=1500
> ping: local error: Message too long, mtu=1500
>
>
> Jumbo Frames is on, on the switch and on the NIC's:
>
> ens2f0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
>         inet 10.10.10.83  netmask 255.255.255.0  broadcast 10.10.10.255
>         inet6 fe80::ec4:7aff:feea:7b40  prefixlen 64  scopeid 0x20<link>
>         ether 0c:c4:7a:ea:7b:40  txqueuelen 1000  (Ethernet)
>         RX packets 166440655  bytes 229547410625 (213.7 GiB)
>         RX errors 0  dropped 223  overruns 0  frame 0
>         TX packets 142788790  bytes 188658602086 (175.7 GiB)
>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>
>
>
>
> root at virt2:~# ifconfig ens2f0
> ens2f0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
>         inet 10.10.10.82  netmask 255.255.255.0  broadcast 10.10.10.255
>         inet6 fe80::ec4:7aff:feea:ff2c  prefixlen 64  scopeid 0x20<link>
>         ether 0c:c4:7a:ea:ff:2c  txqueuelen 1000  (Ethernet)
>         RX packets 466774  bytes 385578454 (367.7 MiB)
>         RX errors 4  dropped 223  overruns 0  frame 3
>         TX packets 594975  bytes 580053745 (553.1 MiB)
>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>
>
>
> On Mon, Nov 20, 2017 at 2:13 PM, Sébastien VIGNERON <sebastien.vigneron@
> criann.fr> wrote:
>
>> As a jumbo frame test, can you try the following?
>>
>> ping -M do -s 8972 -c 4 IP_of_other_node_within_cluster_network
>>
>> If you have « ping: sendto: Message too long », jumbo frames are not
>> activated.
>>
>> Cordialement / Best regards,
>>
>> Sébastien VIGNERON
>> CRIANN,
>> Ingénieur / Engineer
>> Technopôle du Madrillet
>> 745, avenue de l'Université
>> <https://maps.google.com/?q=745,+avenue+de+l%27Universit%C3%A9%C2%A0+76800+Saint-Etienne+du+Rouvray+-+France&entry=gmail&source=g>
>>
>> 76800 Saint-Etienne du Rouvray - France
>> <https://maps.google.com/?q=745,+avenue+de+l%27Universit%C3%A9%C2%A0+76800+Saint-Etienne+du+Rouvray+-+France&entry=gmail&source=g>
>>
>> tél. +33 2 32 91 42 91 <+33%202%2032%2091%2042%2091>
>> fax. +33 2 32 91 42 92 <+33%202%2032%2091%2042%2092>
>> http://www.criann.fr
>> mailto:sebastien.vigneron at criann.fr <sebastien.vigneron at criann.fr>
>> support: support at criann.fr
>>
>> Le 20 nov. 2017 à 13:02, Rudi Ahlers <rudiahlers at gmail.com> a écrit :
>>
>> We're planning on installing 12X Virtual Machines with some heavy loads.
>>
>> the SSD drives are  INTEL SSDSC2BA400G4
>>
>> The SATA drives are ST8000NM0055-1RM112
>>
>> Please explain your comment, "b) will find a lot of people here who
>> don't approve of it."
>>
>> I don't have access to the switches right now, but they're new so
>> whatever default config ships from factory would be active. Though iperf
>> shows 10.5 GBytes  / 9.02 Gbits/sec throughput.
>>
>> What speeds would you expect?
>> "Though with your setup I would have expected something faster, but NOT
>> the
>> theoretical 600MB/s 4 HDDs will do in sequential writes."
>>
>>
>>
>> On this, "If an OSD has no fast WAL/DB, it will drag the overall speed
>> down. Verify and if so fix this and re-test.": how?
>>
>>
>> On Mon, Nov 20, 2017 at 1:44 PM, Christian Balzer <chibi at gol.com> wrote:
>>
>>> On Mon, 20 Nov 2017 12:38:55 +0200 Rudi Ahlers wrote:
>>>
>>> > Hi,
>>> >
>>> > Can someone please help me, how do I improve performance on ou CEPH
>>> cluster?
>>> >
>>> > The hardware in use are as follows:
>>> > 3x SuperMicro servers with the following configuration
>>> > 12Core Dual XEON 2.2Ghz
>>> Faster cores is better for Ceph, IMNSHO.
>>> Though with main storage on HDDs, this will do.
>>>
>>> > 128GB RAM
>>> Overkill for Ceph but I see something else below...
>>>
>>> > 2x 400GB Intel DC SSD drives
>>> Exact model please.
>>>
>>> > 4x 8TB Seagate 7200rpm 6Gbps SATA HDD's
>>> One hopes that's a non SMR one.
>>> Model please.
>>>
>>> > 1x SuperMicro DOM for Proxmox / Debian OS
>>> Ah, Proxmox.
>>> I'm personally not averse to converged, high density, multi-role clusters
>>> myself, but you:
>>> a) need to know what you're doing and
>>> b) will find a lot of people here who don't approve of it.
>>>
>>> I've avoided DOMs so far (non-hotswapable SPOF), even though the SM ones
>>> look good on paper with regards to endurance and IOPS.
>>> The later being rather important for your monitors.
>>>
>>> > 4x Port 10Gbe NIC
>>> > Cisco 10Gbe switch.
>>> >
>>> Configuration would be nice for those, LACP?
>>>
>>> >
>>> > root at virt2:~# rados bench -p Data 10 write --no-cleanup
>>> > hints = 1
>>> > Maintaining 16 concurrent writes of 4194304 bytes to objects of size
>>> > 4194304 for       up to 10 seconds or 0 objects
>>>
>>> rados bench is limited tool and measuring bandwidth is in nearly all
>>> the use cases pointless.
>>> Latency is where it is at and testing from inside a VM is more relevant
>>> than synthetic tests of the storage.
>>> But it is a start.
>>>
>>> > Object prefix: benchmark_data_virt2_39099
>>> >   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
>>> > lat(s)
>>> >     0       0         0         0         0         0           -
>>> >  0
>>> >     1      16        85        69   275.979       276    0.185576
>>> > 0.204146
>>> >     2      16       171       155   309.966       344   0.0625409
>>> > 0.193558
>>> >     3      16       243       227   302.633       288   0.0547129
>>> >  0.19835
>>> >     4      16       330       314   313.965       348   0.0959492
>>> > 0.199825
>>> >     5      16       413       397   317.565       332    0.124908
>>> > 0.196191
>>> >     6      16       494       478   318.633       324      0.1556
>>> > 0.197014
>>> >     7      15       591       576   329.109       392    0.136305
>>> > 0.192192
>>> >     8      16       670       654   326.965       312   0.0703808
>>> > 0.190643
>>> >     9      16       757       741   329.297       348    0.165211
>>> > 0.192183
>>> >    10      16       828       812   324.764       284   0.0935803
>>> > 0.194041
>>> > Total time run:         10.120215
>>> > Total writes made:      829
>>> > Write size:             4194304
>>> > Object size:            4194304
>>> > Bandwidth (MB/sec):     327.661
>>> What part of this surprises you?
>>>
>>> With a replication of 3, you have effectively the bandwidth of your 2
>>> SSDs
>>> (for small writes, not the case here) and the bandwidth of your 4 HDDs
>>> available.
>>> Given overhead, other inefficiencies and the fact that this is not a
>>> sequential write from the HDD perspective, 320MB/s isn't all that bad.
>>> Though with your setup I would have expected something faster, but NOT
>>> the
>>> theoretical 600MB/s 4 HDDs will do in sequential writes.
>>>
>>> > Stddev Bandwidth:       35.8664
>>> > Max bandwidth (MB/sec): 392
>>> > Min bandwidth (MB/sec): 276
>>> > Average IOPS:           81
>>> > Stddev IOPS:            8
>>> > Max IOPS:               98
>>> > Min IOPS:               69
>>> > Average Latency(s):     0.195191
>>> > Stddev Latency(s):      0.0830062 <083%200062>
>>> > Max latency(s):         0.481448
>>> > Min latency(s):         0.0414858
>>> > root at virt2:~# hdparm -I /dev/sda
>>> >
>>> >
>>> >
>>> > root at virt2:~# ceph osd tree
>>> > ID CLASS WEIGHT   TYPE NAME      STATUS REWEIGHT PRI-AFF
>>> > -1       72.78290 root default
>>> > -3       29.11316     host virt1
>>> >  1   hdd  7.27829         osd.1      up  1.00000 1.00000
>>> >  2   hdd  7.27829         osd.2      up  1.00000 1.00000
>>> >  3   hdd  7.27829         osd.3      up  1.00000 1.00000
>>> >  4   hdd  7.27829         osd.4      up  1.00000 1.00000
>>> > -5       21.83487     host virt2
>>> >  5   hdd  7.27829         osd.5      up  1.00000 1.00000
>>> >  6   hdd  7.27829         osd.6      up  1.00000 1.00000
>>> >  7   hdd  7.27829         osd.7      up  1.00000 1.00000
>>> > -7       21.83487     host virt3
>>> >  8   hdd  7.27829         osd.8      up  1.00000 1.00000
>>> >  9   hdd  7.27829         osd.9      up  1.00000 1.00000
>>> > 10   hdd  7.27829         osd.10     up  1.00000 1.00000
>>> >  0              0 osd.0            down        0 1.00000
>>> >
>>> >
>>> > root at virt2:~# ceph -s
>>> >   cluster:
>>> >     id:     278a2e9c-0578-428f-bd5b-3bb348923c27
>>> >     health: HEALTH_OK
>>> >
>>> >   services:
>>> >     mon: 3 daemons, quorum virt1,virt2,virt3
>>> >     mgr: virt1(active)
>>> >     osd: 11 osds: 10 up, 10 in
>>> >
>>> >   data:
>>> >     pools:   1 pools, 512 pgs
>>> >     objects: 6084 objects, 24105 MB
>>> >     usage:   92822 MB used, 74438 GB / 74529 GB avail
>>> >     pgs:     512 active+clean
>>> >
>>> > root at virt2:~# ceph -w
>>> >   cluster:
>>> >     id:     278a2e9c-0578-428f-bd5b-3bb348923c27
>>> >     health: HEALTH_OK
>>> >
>>> >   services:
>>> >     mon: 3 daemons, quorum virt1,virt2,virt3
>>> >     mgr: virt1(active)
>>> >     osd: 11 osds: 10 up, 10 in
>>> >
>>> >   data:
>>> >     pools:   1 pools, 512 pgs
>>> >     objects: 6084 objects, 24105 MB
>>> >     usage:   92822 MB used, 74438 GB / 74529 GB avail
>>> >     pgs:     512 active+clean
>>> >
>>> >
>>> > 2017-11-20 12:32:08.199450 mon.virt1 [INF] mon.1 10.10.10.82:6789/0
>>> >
>>> >
>>> >
>>> > The SSD drives are used as journal drives:
>>> >
>>> Bluestore has no journals, don't confuse it and the people you're asking
>>> for help.
>>>
>>> > root at virt3:~# ceph-disk list | grep /dev/sde | grep osd
>>> >  /dev/sdb1 ceph data, active, cluster ceph, osd.8, block /dev/sdb2,
>>> > block.db /dev/sde1
>>> > root at virt3:~# ceph-disk list | grep /dev/sdf | grep osd
>>> >  /dev/sdc1 ceph data, active, cluster ceph, osd.9, block /dev/sdc2,
>>> > block.db /dev/sdf1
>>> >  /dev/sdd1 ceph data, active, cluster ceph, osd.10, block /dev/sdd2,
>>> > block.db /dev/sdf2
>>> >
>>> >
>>> >
>>> > I see now /dev/sda doesn't have a journal, though it should have. Not
>>> sure
>>> > why.
>>> If an OSD has no fast WAL/DB, it will drag the overall speed down.
>>>
>>> Verify and if so fix this and re-test.
>>>
>>> Christian
>>>
>>> > This is the command I used to create it:
>>> >
>>> >
>>> >  pveceph createosd /dev/sda -bluestore 1  -journal_dev /dev/sde
>>> >
>>> >
>>>
>>>
>>> --
>>> Christian Balzer        Network/Systems Engineer
>>> chibi at gol.com           Rakuten Communications
>>>
>>
>>
>>
>> --
>> Kind Regards
>> Rudi Ahlers
>> Website: http://www.rudiahlers.co.za
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users at lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>
>
> --
> Kind Regards
> Rudi Ahlers
> Website: http://www.rudiahlers.co.za
>
>
>


-- 
Kind Regards
Rudi Ahlers
Website: http://www.rudiahlers.co.za
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20171120/f1b500b7/attachment.html>


More information about the ceph-users mailing list