[ceph-users] how to improve performance

Rudi Ahlers rudiahlers at gmail.com
Mon Nov 20 23:21:58 PST 2017


On Mon, Nov 20, 2017 at 2:36 PM, Christian Balzer <chibi at gol.com> wrote:

> On Mon, 20 Nov 2017 14:02:30 +0200 Rudi Ahlers wrote:
>
> > We're planning on installing 12X Virtual Machines with some heavy loads.
> >
> > the SSD drives are  INTEL SSDSC2BA400G4
> >
> Interesting, where did you find those?
> Or did you have them lying around?
>
> I've been unable to get DC S3710 SSDs for nearly a year now.
>

In South Africa, one of our suppliers had some in stock. They're still
fairly new, about 2 months old now.




> The SATA drives are ST8000NM0055-1RM112
> >
> Note that these (while fast) have an internal flash cache, limiting them to
> something like 0.2 DWPD.
> Probably not an issue with the WAL/DB on the Intels, but something to keep
> in mind.
>


I don't quite understand what you want to say, please explain?



> > Please explain your comment, "b) will find a lot of people here who don't
> > approve of it."
> >
> Read the archives.
> Converged clusters are complex and debugging Ceph when tons of other
> things are going on at the same time on the machine even more so.
>


Ok, so I have 4 physical servers and need to setup a highly redundant
cluster. How else would you have done it? There is no budget for a SAN, let
alone a highly available SAN.



>
> > I don't have access to the switches right now, but they're new so
> whatever
> > default config ships from factory would be active. Though iperf shows
> 10.5
> > GBytes  / 9.02 Gbits/sec throughput.
> >
> Didn't think it was the switches, but completeness sake and all that.
>
> > What speeds would you expect?
> > "Though with your setup I would have expected something faster, but NOT
> the
> > theoretical 600MB/s 4 HDDs will do in sequential writes."
> >
> What I wrote.
> A 7200RPM HDD, even these, can not sustain writes much over 170MB/s, in
> the most optimal circumstances.
> So your cluster can NOT exceed about 600MB/s sustained writes with the
> effective bandwidth of 4 HDDs.
> Smaller writes/reads that can be cached by RAM, DB, onboard caches on the
> HDDs of course can and will be faster.
>
> But again, you're missing the point, even if you get 600MB/s writes out of
> your cluster, the number of 4k IOPS will be much more relevant to your VMs.
>
>
hdparm shows about 230MB/s:

^Croot at virt2:~# hdparm -Tt /dev/sda

/dev/sda:
 Timing cached reads:   20250 MB in  2.00 seconds = 10134.81 MB/sec
 Timing buffered disk reads: 680 MB in  3.00 seconds = 226.50 MB/sec



600MB/s would be super nice, but in reality even 400MB/s would be nice.
Would it not be achievable?



> >
> >
> > On this, "If an OSD has no fast WAL/DB, it will drag the overall speed
> > down. Verify and if so fix this and re-test.": how?
> >
> No idea, I don't do bluestore.
> You noticed the lack of a WAL/DB for sda, go and fix it.
> If in in doubt by destroying and re-creating.
>
> And if you're looking for a less invasive procedure, docs and the ML
> archive, but AFAIK there is nothing but re-creation at this time.
>


Since I use Proxmox, which setup a DB device, but not a WAL device.




> Christian
> >
> > On Mon, Nov 20, 2017 at 1:44 PM, Christian Balzer <chibi at gol.com> wrote:
> >
> > > On Mon, 20 Nov 2017 12:38:55 +0200 Rudi Ahlers wrote:
> > >
> > > > Hi,
> > > >
> > > > Can someone please help me, how do I improve performance on ou CEPH
> > > cluster?
> > > >
> > > > The hardware in use are as follows:
> > > > 3x SuperMicro servers with the following configuration
> > > > 12Core Dual XEON 2.2Ghz
> > > Faster cores is better for Ceph, IMNSHO.
> > > Though with main storage on HDDs, this will do.
> > >
> > > > 128GB RAM
> > > Overkill for Ceph but I see something else below...
> > >
> > > > 2x 400GB Intel DC SSD drives
> > > Exact model please.
> > >
> > > > 4x 8TB Seagate 7200rpm 6Gbps SATA HDD's
> > > One hopes that's a non SMR one.
> > > Model please.
> > >
> > > > 1x SuperMicro DOM for Proxmox / Debian OS
> > > Ah, Proxmox.
> > > I'm personally not averse to converged, high density, multi-role
> clusters
> > > myself, but you:
> > > a) need to know what you're doing and
> > > b) will find a lot of people here who don't approve of it.
> > >
> > > I've avoided DOMs so far (non-hotswapable SPOF), even though the SM
> ones
> > > look good on paper with regards to endurance and IOPS.
> > > The later being rather important for your monitors.
> > >
> > > > 4x Port 10Gbe NIC
> > > > Cisco 10Gbe switch.
> > > >
> > > Configuration would be nice for those, LACP?
> > >
> > > >
> > > > root at virt2:~# rados bench -p Data 10 write --no-cleanup
> > > > hints = 1
> > > > Maintaining 16 concurrent writes of 4194304 bytes to objects of size
> > > > 4194304 for       up to 10 seconds or 0 objects
> > >
> > > rados bench is limited tool and measuring bandwidth is in nearly all
> > > the use cases pointless.
> > > Latency is where it is at and testing from inside a VM is more relevant
> > > than synthetic tests of the storage.
> > > But it is a start.
> > >
> > > > Object prefix: benchmark_data_virt2_39099
> > > >   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)
> avg
> > > > lat(s)
> > > >     0       0         0         0         0         0           -
> > > >  0
> > > >     1      16        85        69   275.979       276    0.185576
> > > > 0.204146
> > > >     2      16       171       155   309.966       344   0.0625409
> > > > 0.193558
> > > >     3      16       243       227   302.633       288   0.0547129
> > > >  0.19835
> > > >     4      16       330       314   313.965       348   0.0959492
> > > > 0.199825
> > > >     5      16       413       397   317.565       332    0.124908
> > > > 0.196191
> > > >     6      16       494       478   318.633       324      0.1556
> > > > 0.197014
> > > >     7      15       591       576   329.109       392    0.136305
> > > > 0.192192
> > > >     8      16       670       654   326.965       312   0.0703808
> > > > 0.190643
> > > >     9      16       757       741   329.297       348    0.165211
> > > > 0.192183
> > > >    10      16       828       812   324.764       284   0.0935803
> > > > 0.194041
> > > > Total time run:         10.120215
> > > > Total writes made:      829
> > > > Write size:             4194304
> > > > Object size:            4194304
> > > > Bandwidth (MB/sec):     327.661
> > > What part of this surprises you?
> > >
> > > With a replication of 3, you have effectively the bandwidth of your 2
> SSDs
> > > (for small writes, not the case here) and the bandwidth of your 4 HDDs
> > > available.
> > > Given overhead, other inefficiencies and the fact that this is not a
> > > sequential write from the HDD perspective, 320MB/s isn't all that bad.
> > > Though with your setup I would have expected something faster, but NOT
> the
> > > theoretical 600MB/s 4 HDDs will do in sequential writes.
> > >
> > > > Stddev Bandwidth:       35.8664
> > > > Max bandwidth (MB/sec): 392
> > > > Min bandwidth (MB/sec): 276
> > > > Average IOPS:           81
> > > > Stddev IOPS:            8
> > > > Max IOPS:               98
> > > > Min IOPS:               69
> > > > Average Latency(s):     0.195191
> > > > Stddev Latency(s):      0.0830062
> > > > Max latency(s):         0.481448
> > > > Min latency(s):         0.0414858
> > > > root at virt2:~# hdparm -I /dev/sda
> > > >
> > > >
> > > >
> > > > root at virt2:~# ceph osd tree
> > > > ID CLASS WEIGHT   TYPE NAME      STATUS REWEIGHT PRI-AFF
> > > > -1       72.78290 root default
> > > > -3       29.11316     host virt1
> > > >  1   hdd  7.27829         osd.1      up  1.00000 1.00000
> > > >  2   hdd  7.27829         osd.2      up  1.00000 1.00000
> > > >  3   hdd  7.27829         osd.3      up  1.00000 1.00000
> > > >  4   hdd  7.27829         osd.4      up  1.00000 1.00000
> > > > -5       21.83487     host virt2
> > > >  5   hdd  7.27829         osd.5      up  1.00000 1.00000
> > > >  6   hdd  7.27829         osd.6      up  1.00000 1.00000
> > > >  7   hdd  7.27829         osd.7      up  1.00000 1.00000
> > > > -7       21.83487     host virt3
> > > >  8   hdd  7.27829         osd.8      up  1.00000 1.00000
> > > >  9   hdd  7.27829         osd.9      up  1.00000 1.00000
> > > > 10   hdd  7.27829         osd.10     up  1.00000 1.00000
> > > >  0              0 osd.0            down        0 1.00000
> > > >
> > > >
> > > > root at virt2:~# ceph -s
> > > >   cluster:
> > > >     id:     278a2e9c-0578-428f-bd5b-3bb348923c27
> > > >     health: HEALTH_OK
> > > >
> > > >   services:
> > > >     mon: 3 daemons, quorum virt1,virt2,virt3
> > > >     mgr: virt1(active)
> > > >     osd: 11 osds: 10 up, 10 in
> > > >
> > > >   data:
> > > >     pools:   1 pools, 512 pgs
> > > >     objects: 6084 objects, 24105 MB
> > > >     usage:   92822 MB used, 74438 GB / 74529 GB avail
> > > >     pgs:     512 active+clean
> > > >
> > > > root at virt2:~# ceph -w
> > > >   cluster:
> > > >     id:     278a2e9c-0578-428f-bd5b-3bb348923c27
> > > >     health: HEALTH_OK
> > > >
> > > >   services:
> > > >     mon: 3 daemons, quorum virt1,virt2,virt3
> > > >     mgr: virt1(active)
> > > >     osd: 11 osds: 10 up, 10 in
> > > >
> > > >   data:
> > > >     pools:   1 pools, 512 pgs
> > > >     objects: 6084 objects, 24105 MB
> > > >     usage:   92822 MB used, 74438 GB / 74529 GB avail
> > > >     pgs:     512 active+clean
> > > >
> > > >
> > > > 2017-11-20 12:32:08.199450 mon.virt1 [INF] mon.1 10.10.10.82:6789/0
> > > >
> > > >
> > > >
> > > > The SSD drives are used as journal drives:
> > > >
> > > Bluestore has no journals, don't confuse it and the people you're
> asking
> > > for help.
> > >
> > > > root at virt3:~# ceph-disk list | grep /dev/sde | grep osd
> > > >  /dev/sdb1 ceph data, active, cluster ceph, osd.8, block /dev/sdb2,
> > > > block.db /dev/sde1
> > > > root at virt3:~# ceph-disk list | grep /dev/sdf | grep osd
> > > >  /dev/sdc1 ceph data, active, cluster ceph, osd.9, block /dev/sdc2,
> > > > block.db /dev/sdf1
> > > >  /dev/sdd1 ceph data, active, cluster ceph, osd.10, block /dev/sdd2,
> > > > block.db /dev/sdf2
> > > >
> > > >
> > > >
> > > > I see now /dev/sda doesn't have a journal, though it should have. Not
> > > sure
> > > > why.
> > > If an OSD has no fast WAL/DB, it will drag the overall speed down.
> > >
> > > Verify and if so fix this and re-test.
> > >
> > > Christian
> > >
> > > > This is the command I used to create it:
> > > >
> > > >
> > > >  pveceph createosd /dev/sda -bluestore 1  -journal_dev /dev/sde
> > > >
> > > >
> > >
> > >
> > > --
> > > Christian Balzer        Network/Systems Engineer
> > > chibi at gol.com           Rakuten Communications
> > >
> >
> >
> >
>
>
> --
> Christian Balzer        Network/Systems Engineer
> chibi at gol.com           Rakuten Communications
>



-- 
Kind Regards
Rudi Ahlers
Website: http://www.rudiahlers.co.za
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20171121/690af5c6/attachment.html>


More information about the ceph-users mailing list