[ceph-users] how to improve performance

Christian Balzer chibi at gol.com
Tue Nov 21 00:46:45 PST 2017


On Tue, 21 Nov 2017 09:21:58 +0200 Rudi Ahlers wrote:

> On Mon, Nov 20, 2017 at 2:36 PM, Christian Balzer <chibi at gol.com> wrote:
> 
> > On Mon, 20 Nov 2017 14:02:30 +0200 Rudi Ahlers wrote:
> >  
> > > We're planning on installing 12X Virtual Machines with some heavy loads.
> > >
> > > the SSD drives are  INTEL SSDSC2BA400G4
> > >  
> > Interesting, where did you find those?
> > Or did you have them lying around?
> >
> > I've been unable to get DC S3710 SSDs for nearly a year now.
> >  
> 
> In South Africa, one of our suppliers had some in stock. They're still
> fairly new, about 2 months old now.
> 
> 
Odd, oh well.

> 
> 
> > The SATA drives are ST8000NM0055-1RM112  
> > >  
> > Note that these (while fast) have an internal flash cache, limiting them to
> > something like 0.2 DWPD.
> > Probably not an issue with the WAL/DB on the Intels, but something to keep
> > in mind.
> >  
> 
> 
> I don't quite understand what you want to say, please explain?
> 
See the other mails in this thread after the one above.
In short, probably nothing to worry about.

> 
> 
> > > Please explain your comment, "b) will find a lot of people here who don't
> > > approve of it."
> > >  
> > Read the archives.
> > Converged clusters are complex and debugging Ceph when tons of other
> > things are going on at the same time on the machine even more so.
> >  
> 
> 
> Ok, so I have 4 physical servers and need to setup a highly redundant
> cluster. How else would you have done it? There is no budget for a SAN, let
> alone a highly available SAN.
>
As I said, I'd be fine doing it with Ceph, if that was a good match.
It's easy to starve resources with hyperconverged clusters.

Since you're using proxmox, DRBD would be an obvious alternative,
especially if you're not planning on growing this cluster. 
 
You only mentioned 3 servers so far, is the fourth one non-Ceph?

> 
> 
> >  
> > > I don't have access to the switches right now, but they're new so  
> > whatever  
> > > default config ships from factory would be active. Though iperf shows  
> > 10.5  
> > > GBytes  / 9.02 Gbits/sec throughput.
> > >  
> > Didn't think it was the switches, but completeness sake and all that.
> >  
> > > What speeds would you expect?
> > > "Though with your setup I would have expected something faster, but NOT  
> > the  
> > > theoretical 600MB/s 4 HDDs will do in sequential writes."
> > >  
> > What I wrote.
> > A 7200RPM HDD, even these, can not sustain writes much over 170MB/s, in
> > the most optimal circumstances.
> > So your cluster can NOT exceed about 600MB/s sustained writes with the
> > effective bandwidth of 4 HDDs.
> > Smaller writes/reads that can be cached by RAM, DB, onboard caches on the
> > HDDs of course can and will be faster.
> >
> > But again, you're missing the point, even if you get 600MB/s writes out of
> > your cluster, the number of 4k IOPS will be much more relevant to your VMs.
> >
> >  
> hdparm shows about 230MB/s:
> 
> ^Croot at virt2:~# hdparm -Tt /dev/sda
> 
> /dev/sda:
>  Timing cached reads:   20250 MB in  2.00 seconds = 10134.81 MB/sec
>  Timing buffered disk reads: 680 MB in  3.00 seconds = 226.50 MB/sec
>
That's read and a very optimized sequential one at that.  
> 
> 
> 600MB/s would be super nice, but in reality even 400MB/s would be nice.
Do you really need to write that amount of data in a short time?
Typical VMs are IOPS bound, as pointed out several times.

> Would it not be achievable?
> 
Maybe, but you need to find out what, if anything makes your cluster
slower than this.
iostat, atop, etc can help with that.
How busy are your CPUs, HDDs and SSDs when you run that benchmark?

> 
> 
> > >
> > >
> > > On this, "If an OSD has no fast WAL/DB, it will drag the overall speed
> > > down. Verify and if so fix this and re-test.": how?
> > >  
> > No idea, I don't do bluestore.
> > You noticed the lack of a WAL/DB for sda, go and fix it.
> > If in in doubt by destroying and re-creating.
> >
> > And if you're looking for a less invasive procedure, docs and the ML
> > archive, but AFAIK there is nothing but re-creation at this time.
> >  
> 
> 
> Since I use Proxmox, which setup a DB device, but not a WAL device.
> 
Again, I don't do bluestore.
But AFAIK, WAL will live on the fastest device, which is the SSD you've
put the DB on, unless specified separately. 
So nothing to be done here.

Christian
> 
> 
> 
> > Christian  
> > >
> > > On Mon, Nov 20, 2017 at 1:44 PM, Christian Balzer <chibi at gol.com> wrote:
> > >  
> > > > On Mon, 20 Nov 2017 12:38:55 +0200 Rudi Ahlers wrote:
> > > >  
> > > > > Hi,
> > > > >
> > > > > Can someone please help me, how do I improve performance on ou CEPH  
> > > > cluster?  
> > > > >
> > > > > The hardware in use are as follows:
> > > > > 3x SuperMicro servers with the following configuration
> > > > > 12Core Dual XEON 2.2Ghz  
> > > > Faster cores is better for Ceph, IMNSHO.
> > > > Though with main storage on HDDs, this will do.
> > > >  
> > > > > 128GB RAM  
> > > > Overkill for Ceph but I see something else below...
> > > >  
> > > > > 2x 400GB Intel DC SSD drives  
> > > > Exact model please.
> > > >  
> > > > > 4x 8TB Seagate 7200rpm 6Gbps SATA HDD's  
> > > > One hopes that's a non SMR one.
> > > > Model please.
> > > >  
> > > > > 1x SuperMicro DOM for Proxmox / Debian OS  
> > > > Ah, Proxmox.
> > > > I'm personally not averse to converged, high density, multi-role  
> > clusters  
> > > > myself, but you:
> > > > a) need to know what you're doing and
> > > > b) will find a lot of people here who don't approve of it.
> > > >
> > > > I've avoided DOMs so far (non-hotswapable SPOF), even though the SM  
> > ones  
> > > > look good on paper with regards to endurance and IOPS.
> > > > The later being rather important for your monitors.
> > > >  
> > > > > 4x Port 10Gbe NIC
> > > > > Cisco 10Gbe switch.
> > > > >  
> > > > Configuration would be nice for those, LACP?
> > > >  
> > > > >
> > > > > root at virt2:~# rados bench -p Data 10 write --no-cleanup
> > > > > hints = 1
> > > > > Maintaining 16 concurrent writes of 4194304 bytes to objects of size
> > > > > 4194304 for       up to 10 seconds or 0 objects  
> > > >
> > > > rados bench is limited tool and measuring bandwidth is in nearly all
> > > > the use cases pointless.
> > > > Latency is where it is at and testing from inside a VM is more relevant
> > > > than synthetic tests of the storage.
> > > > But it is a start.
> > > >  
> > > > > Object prefix: benchmark_data_virt2_39099
> > > > >   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  
> > avg  
> > > > > lat(s)
> > > > >     0       0         0         0         0         0           -
> > > > >  0
> > > > >     1      16        85        69   275.979       276    0.185576
> > > > > 0.204146
> > > > >     2      16       171       155   309.966       344   0.0625409
> > > > > 0.193558
> > > > >     3      16       243       227   302.633       288   0.0547129
> > > > >  0.19835
> > > > >     4      16       330       314   313.965       348   0.0959492
> > > > > 0.199825
> > > > >     5      16       413       397   317.565       332    0.124908
> > > > > 0.196191
> > > > >     6      16       494       478   318.633       324      0.1556
> > > > > 0.197014
> > > > >     7      15       591       576   329.109       392    0.136305
> > > > > 0.192192
> > > > >     8      16       670       654   326.965       312   0.0703808
> > > > > 0.190643
> > > > >     9      16       757       741   329.297       348    0.165211
> > > > > 0.192183
> > > > >    10      16       828       812   324.764       284   0.0935803
> > > > > 0.194041
> > > > > Total time run:         10.120215
> > > > > Total writes made:      829
> > > > > Write size:             4194304
> > > > > Object size:            4194304
> > > > > Bandwidth (MB/sec):     327.661  
> > > > What part of this surprises you?
> > > >
> > > > With a replication of 3, you have effectively the bandwidth of your 2  
> > SSDs  
> > > > (for small writes, not the case here) and the bandwidth of your 4 HDDs
> > > > available.
> > > > Given overhead, other inefficiencies and the fact that this is not a
> > > > sequential write from the HDD perspective, 320MB/s isn't all that bad.
> > > > Though with your setup I would have expected something faster, but NOT  
> > the  
> > > > theoretical 600MB/s 4 HDDs will do in sequential writes.
> > > >  
> > > > > Stddev Bandwidth:       35.8664
> > > > > Max bandwidth (MB/sec): 392
> > > > > Min bandwidth (MB/sec): 276
> > > > > Average IOPS:           81
> > > > > Stddev IOPS:            8
> > > > > Max IOPS:               98
> > > > > Min IOPS:               69
> > > > > Average Latency(s):     0.195191
> > > > > Stddev Latency(s):      0.0830062
> > > > > Max latency(s):         0.481448
> > > > > Min latency(s):         0.0414858
> > > > > root at virt2:~# hdparm -I /dev/sda
> > > > >
> > > > >
> > > > >
> > > > > root at virt2:~# ceph osd tree
> > > > > ID CLASS WEIGHT   TYPE NAME      STATUS REWEIGHT PRI-AFF
> > > > > -1       72.78290 root default
> > > > > -3       29.11316     host virt1
> > > > >  1   hdd  7.27829         osd.1      up  1.00000 1.00000
> > > > >  2   hdd  7.27829         osd.2      up  1.00000 1.00000
> > > > >  3   hdd  7.27829         osd.3      up  1.00000 1.00000
> > > > >  4   hdd  7.27829         osd.4      up  1.00000 1.00000
> > > > > -5       21.83487     host virt2
> > > > >  5   hdd  7.27829         osd.5      up  1.00000 1.00000
> > > > >  6   hdd  7.27829         osd.6      up  1.00000 1.00000
> > > > >  7   hdd  7.27829         osd.7      up  1.00000 1.00000
> > > > > -7       21.83487     host virt3
> > > > >  8   hdd  7.27829         osd.8      up  1.00000 1.00000
> > > > >  9   hdd  7.27829         osd.9      up  1.00000 1.00000
> > > > > 10   hdd  7.27829         osd.10     up  1.00000 1.00000
> > > > >  0              0 osd.0            down        0 1.00000
> > > > >
> > > > >
> > > > > root at virt2:~# ceph -s
> > > > >   cluster:
> > > > >     id:     278a2e9c-0578-428f-bd5b-3bb348923c27
> > > > >     health: HEALTH_OK
> > > > >
> > > > >   services:
> > > > >     mon: 3 daemons, quorum virt1,virt2,virt3
> > > > >     mgr: virt1(active)
> > > > >     osd: 11 osds: 10 up, 10 in
> > > > >
> > > > >   data:
> > > > >     pools:   1 pools, 512 pgs
> > > > >     objects: 6084 objects, 24105 MB
> > > > >     usage:   92822 MB used, 74438 GB / 74529 GB avail
> > > > >     pgs:     512 active+clean
> > > > >
> > > > > root at virt2:~# ceph -w
> > > > >   cluster:
> > > > >     id:     278a2e9c-0578-428f-bd5b-3bb348923c27
> > > > >     health: HEALTH_OK
> > > > >
> > > > >   services:
> > > > >     mon: 3 daemons, quorum virt1,virt2,virt3
> > > > >     mgr: virt1(active)
> > > > >     osd: 11 osds: 10 up, 10 in
> > > > >
> > > > >   data:
> > > > >     pools:   1 pools, 512 pgs
> > > > >     objects: 6084 objects, 24105 MB
> > > > >     usage:   92822 MB used, 74438 GB / 74529 GB avail
> > > > >     pgs:     512 active+clean
> > > > >
> > > > >
> > > > > 2017-11-20 12:32:08.199450 mon.virt1 [INF] mon.1 10.10.10.82:6789/0
> > > > >
> > > > >
> > > > >
> > > > > The SSD drives are used as journal drives:
> > > > >  
> > > > Bluestore has no journals, don't confuse it and the people you're  
> > asking  
> > > > for help.
> > > >  
> > > > > root at virt3:~# ceph-disk list | grep /dev/sde | grep osd
> > > > >  /dev/sdb1 ceph data, active, cluster ceph, osd.8, block /dev/sdb2,
> > > > > block.db /dev/sde1
> > > > > root at virt3:~# ceph-disk list | grep /dev/sdf | grep osd
> > > > >  /dev/sdc1 ceph data, active, cluster ceph, osd.9, block /dev/sdc2,
> > > > > block.db /dev/sdf1
> > > > >  /dev/sdd1 ceph data, active, cluster ceph, osd.10, block /dev/sdd2,
> > > > > block.db /dev/sdf2
> > > > >
> > > > >
> > > > >
> > > > > I see now /dev/sda doesn't have a journal, though it should have. Not  
> > > > sure  
> > > > > why.  
> > > > If an OSD has no fast WAL/DB, it will drag the overall speed down.
> > > >
> > > > Verify and if so fix this and re-test.
> > > >
> > > > Christian
> > > >  
> > > > > This is the command I used to create it:
> > > > >
> > > > >
> > > > >  pveceph createosd /dev/sda -bluestore 1  -journal_dev /dev/sde
> > > > >
> > > > >  
> > > >
> > > >
> > > > --
> > > > Christian Balzer        Network/Systems Engineer
> > > > chibi at gol.com           Rakuten Communications
> > > >  
> > >
> > >
> > >  
> >
> >
> > --
> > Christian Balzer        Network/Systems Engineer
> > chibi at gol.com           Rakuten Communications
> >  
> 
> 
> 


-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Rakuten Communications


More information about the ceph-users mailing list