[ceph-users] how to improve performance

Rudi Ahlers rudiahlers at gmail.com
Tue Nov 21 04:12:22 PST 2017


On Tue, Nov 21, 2017 at 10:46 AM, Christian Balzer <chibi at gol.com> wrote:

> On Tue, 21 Nov 2017 09:21:58 +0200 Rudi Ahlers wrote:
>
> > On Mon, Nov 20, 2017 at 2:36 PM, Christian Balzer <chibi at gol.com> wrote:
> >
> > > On Mon, 20 Nov 2017 14:02:30 +0200 Rudi Ahlers wrote:
> > >
> > > > We're planning on installing 12X Virtual Machines with some heavy
> loads.
> > > >
> > > > the SSD drives are  INTEL SSDSC2BA400G4
> > > >
> > > Interesting, where did you find those?
> > > Or did you have them lying around?
> > >
> > > I've been unable to get DC S3710 SSDs for nearly a year now.
> > >
> >
> > In South Africa, one of our suppliers had some in stock. They're still
> > fairly new, about 2 months old now.
> >
> >
> Odd, oh well.
>
> >
> >
> > > The SATA drives are ST8000NM0055-1RM112
> > > >
> > > Note that these (while fast) have an internal flash cache, limiting
> them to
> > > something like 0.2 DWPD.
> > > Probably not an issue with the WAL/DB on the Intels, but something to
> keep
> > > in mind.
> > >
> >
> >
> > I don't quite understand what you want to say, please explain?
> >
> See the other mails in this thread after the one above.
> In short, probably nothing to worry about.
>
> >
> >
> > > > Please explain your comment, "b) will find a lot of people here who
> don't
> > > > approve of it."
> > > >
> > > Read the archives.
> > > Converged clusters are complex and debugging Ceph when tons of other
> > > things are going on at the same time on the machine even more so.
> > >
> >
> >
> > Ok, so I have 4 physical servers and need to setup a highly redundant
> > cluster. How else would you have done it? There is no budget for a SAN,
> let
> > alone a highly available SAN.
> >
> As I said, I'd be fine doing it with Ceph, if that was a good match.
> It's easy to starve resources with hyperconverged clusters.
>
> Since you're using proxmox, DRBD would be an obvious alternative,
> especially if you're not planning on growing this cluster.
>
> You only mentioned 3 servers so far, is the fourth one non-Ceph?
>

>From what I have read, DRBD isn't very stable?

The 4th one will be for backups.



>
> >
> >
> > >
> > > > I don't have access to the switches right now, but they're new so
> > > whatever
> > > > default config ships from factory would be active. Though iperf shows
> > > 10.5
> > > > GBytes  / 9.02 Gbits/sec throughput.
> > > >
> > > Didn't think it was the switches, but completeness sake and all that.
> > >
> > > > What speeds would you expect?
> > > > "Though with your setup I would have expected something faster, but
> NOT
> > > the
> > > > theoretical 600MB/s 4 HDDs will do in sequential writes."
> > > >
> > > What I wrote.
> > > A 7200RPM HDD, even these, can not sustain writes much over 170MB/s, in
> > > the most optimal circumstances.
> > > So your cluster can NOT exceed about 600MB/s sustained writes with the
> > > effective bandwidth of 4 HDDs.
> > > Smaller writes/reads that can be cached by RAM, DB, onboard caches on
> the
> > > HDDs of course can and will be faster.
> > >
> > > But again, you're missing the point, even if you get 600MB/s writes
> out of
> > > your cluster, the number of 4k IOPS will be much more relevant to your
> VMs.
> > >
> > >
> > hdparm shows about 230MB/s:
> >
> > ^Croot at virt2:~# hdparm -Tt /dev/sda
> >
> > /dev/sda:
> >  Timing cached reads:   20250 MB in  2.00 seconds = 10134.81 MB/sec
> >  Timing buffered disk reads: 680 MB in  3.00 seconds = 226.50 MB/sec
> >
> That's read and a very optimized sequential one at that.
> >
> >
> > 600MB/s would be super nice, but in reality even 400MB/s would be nice.
> Do you really need to write that amount of data in a short time?
> Typical VMs are IOPS bound, as pointed out several times.
>

We have 10x physical servers which are quite busy and two of them are slow
in terms of disk speed so I am looking at getting better performance.


>
> > Would it not be achievable?
> >
> Maybe, but you need to find out what, if anything makes your cluster
> slower than this.
> iostat, atop, etc can help with that.
> How busy are your CPUs, HDDs and SSDs when you run that benchmark?
>

The CPU and RAM is fairly "idle" during any of my tests.


>
> >
> >
> > > >
> > > >
> > > > On this, "If an OSD has no fast WAL/DB, it will drag the overall
> speed
> > > > down. Verify and if so fix this and re-test.": how?
> > > >
> > > No idea, I don't do bluestore.
> > > You noticed the lack of a WAL/DB for sda, go and fix it.
> > > If in in doubt by destroying and re-creating.
> > >
> > > And if you're looking for a less invasive procedure, docs and the ML
> > > archive, but AFAIK there is nothing but re-creation at this time.
> > >
> >
> >
> > Since I use Proxmox, which setup a DB device, but not a WAL device.
> >
> Again, I don't do bluestore.
> But AFAIK, WAL will live on the fastest device, which is the SSD you've
> put the DB on, unless specified separately.
> So nothing to be done here.
>


I have re-created the CEPH pool with a DB and WAL device this time and
performance is slightly better:

root at virt2:~#  ceph-disk list | grep /dev/sdf | grep osd
 /dev/sdb1 ceph data, active, cluster ceph, osd.5, block /dev/sdb2,
block.db /dev/sdf1, block.wal /dev/sdf2
 /dev/sdd1 ceph data, active, cluster ceph, osd.7, block /dev/sdd2,
block.db /dev/sdf3, block.wal /dev/sdf4


root at virt2:~#  ceph-disk list | grep /dev/sde | grep osd
 /dev/sda1 ceph data, active, cluster ceph, osd.4, block /dev/sda2,
block.db /dev/sde1, block.wal /dev/sde2
 /dev/sdc1 ceph data, active, cluster ceph, osd.6, block /dev/sdc2,
block.db /dev/sde3, block.wal /dev/sde4




root at virt2:~#  rados bench -p Data 10 seq
hints = 1
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
lat(s)
    0       0         0         0         0         0           -
 0
    1      16       311       295   1179.73      1180   0.0498938
 0.0520793
    2      16       622       606   1211.78      1244      0.0358
 0.0511329
    3      16       934       918    1223.8      1248   0.0587524
 0.0506744
Total time run:       3.420127
Total reads made:     986
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   1153.17
Average IOPS:         288
Stddev IOPS:          9
Max IOPS:             312
Min IOPS:             295
Average Latency(s):   0.053413
Max latency(s):       0.284069
Min latency(s):       0.0166523





root at virt2:~# rados bench -p Data 10 rand
hints = 1
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
lat(s)
    0       0         0         0         0         0           -
 0
    1      16       381       365   1459.69      1460  0.00267135
 0.04159
    2      15       715       700   1399.75      1340   0.0934119
 0.0441607
    3      15      1079      1064   1418.44      1456  0.00258879
 0.0435526
    4      16      1448      1432   1431.77      1472    0.134513
 0.0435446
    5      16      1862      1846   1476.56      1656    0.017519
0.042301
    6      16      2192      2176   1450.44      1320  0.00885603
 0.0427858
    7      16      2558      2542   1452.35      1464  0.00184139
 0.0429065
    8      16      2996      2980   1489.78      1752   0.0103593
 0.04178
    9      16      3385      3369   1497.12      1556  0.00866541
0.041612
   10      16      3744      3728   1490.99      1436  0.00246718
 0.0420014
Total time run:       10.204271
Total reads made:     3744
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   1467.62
Average IOPS:         366
Stddev IOPS:          33
Max IOPS:             438
Min IOPS:             330
Average Latency(s):   0.0427017
Max latency(s):       0.453643
Min latency(s):       0.00143035




root at virt2:~# rados bench -p Data 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size
4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_virt2_20816
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
lat(s)
    0       0         0         0         0         0           -
 0
    1      16       106        90   359.981       360    0.211947
0.164055
    2      16       202       186   371.956       384    0.101829
0.161727
    3      16       312       296   394.616       440    0.142682
0.157926
    4      16       414       398   397.946       408     0.17893
0.157207
    5      16       515       499   399.147       404    0.138521
0.157384
    6      16       609       593   395.281       376    0.197496
0.159185
    7      16       703       687   392.521       376    0.148057
0.160965
    8      16       796       780   389.952       372    0.360846
0.161464
    9      16       907       891   395.951       444   0.0697599
0.160687
   10      16       989       973   389.153       328    0.164584
0.161334
Total time run:         10.125151
Total writes made:      990
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     391.105
Stddev Bandwidth:       35.6302
Max bandwidth (MB/sec): 444
Min bandwidth (MB/sec): 328
Average IOPS:           97
Stddev IOPS:            8
Max IOPS:               111
Min IOPS:               82
Average Latency(s):     0.163488
Stddev Latency(s):      0.0623322
Max latency(s):         0.451163
Min latency(s):         0.0416428



As noted the IOPS is still very very low. What could cause that?



-- 
Kind Regards
Rudi Ahlers
Website: http://www.rudiahlers.co.za
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20171121/323d87ad/attachment.html>


More information about the ceph-users mailing list