[ceph-users] bluestore - wal,db on faster devices?

Nick Fisk nick at fisk.me.uk
Wed Nov 8 14:04:24 PST 2017


> -----Original Message-----
> From: Mark Nelson [mailto:mnelson at redhat.com]
> Sent: 08 November 2017 21:42
> To: nick at fisk.me.uk; 'Wolfgang Lendl' <wolfgang.lendl at meduniwien.ac.at>
> Cc: ceph-users at lists.ceph.com
> Subject: Re: [ceph-users] bluestore - wal,db on faster devices?
> 
> 
> 
> On 11/08/2017 03:16 PM, Nick Fisk wrote:
> >> -----Original Message-----
> >> From: ceph-users [mailto:ceph-users-bounces at lists.ceph.com] On Behalf
> >> Of Mark Nelson
> >> Sent: 08 November 2017 19:46
> >> To: Wolfgang Lendl <wolfgang.lendl at meduniwien.ac.at>
> >> Cc: ceph-users at lists.ceph.com
> >> Subject: Re: [ceph-users] bluestore - wal,db on faster devices?
> >>
> >> Hi Wolfgang,
> >>
> >> You've got the right idea.  RBD is probably going to benefit less
> >> since
> > you
> >> have a small number of large objects and little extra OMAP data.
> >> Having the allocation and object metadata on flash certainly
> >> shouldn't
> > hurt,
> >> and you should still have less overhead for small (<64k) writes.
> >> With RGW however you also have to worry about bucket index updates
> >> during writes and that's a big potential bottleneck that you don't
> >> need to worry about with RBD.
> >
> > If you are running anything which is sensitive to sync write latency,
> > like databases. You will see a big performance improvement in using WAL
> on SSD.
> > As Mark says, small writes will get ack'd once written to SSD.
> > ~10-200us vs 10000-20000us difference. It will also batch lots of
> > these small writes together and write them to disk in bigger chunks
> > much more effectively. If you want to run active workloads on RBD and
> > want them to match enterprise storage array with BBWC type
> > performance, I would say DB and WAL on SSD is a requirement.
> 
> Hi Nick,
> 
> You've done more investigation in this area than most I think.  Once you get
> to the point under continuous load where RocksDB is compacting, do you see
> better than a 2X gain?
> 
> Mark

Hi Mark,

I've not really been testing it in a way where all the OSD's would be under 100% load for a long period of time. It's been more of a real world user facing test were IO comes and goes in short bursts and spikes. I've been busy in other areas for the last few months and so have sort of missed out on all the official Luminous/bluestore goodness. I hope to get round to doing some more testing towards the end of the year though. Once I do, I will look into the compaction and see what impact it might be having.

> 
> >
> >>
> >> Mark
> >>
> >> On 11/08/2017 01:01 PM, Wolfgang Lendl wrote:
> >>> Hi Mark,
> >>>
> >>> thanks for your reply!
> >>> I'm a big fan of keeping things simple - this means that there has
> >>> to be a very good reason to put the WAL and DB on a separate device
> >>> otherwise I'll keep it collocated (and simpler).
> >>>
> >>> as far as I understood - putting the WAL,DB on a faster (than hdd)
> >>> device makes more sense in cephfs and rgw environments (more
> >> metadata)
> >>> - and less sense in rbd environments - correct?
> >>>
> >>> br
> >>> wolfgang
> >>>
> >>> On 11/08/2017 02:21 PM, Mark Nelson wrote:
> >>>> Hi Wolfgang,
> >>>>
> >>>> In bluestore the WAL serves sort of a similar purpose to
> >>>> filestore's journal, but bluestore isn't dependent on it for
> >>>> guaranteeing durability of large writes.  With bluestore you can
> >>>> often get higher large-write throughput than with filestore when
> >>>> using HDD-only or flash-only OSDs.
> >>>>
> >>>> Bluestore also stores allocation, object, and cluster metadata in
> >>>> the DB.  That, in combination with the way bluestore stores
> >>>> objects, dramatically improves behavior during certain workloads.
> >>>> A big one is creating millions of small objects as quickly as
> >>>> possible.  In filestore, PG splitting has a huge impact on
> >>>> performance and tail latency.  Bluestore is much better just on
> >>>> HDD, and putting the DB and WAL on flash makes it better still
> >>>> since metadata no longer is a bottleneck.
> >>>>
> >>>> Bluestore does have a couple of shortcomings vs filestore currently.
> >>>> The allocator is not as good as XFS's and can fragment more over time.
> >>>> There is no server-side readahead so small sequential read
> >>>> performance is very dependent on client-side readahead.  There's
> >>>> still a number of optimizations to various things ranging from
> >>>> threading and locking in the shardedopwq to pglog and dup_ops that
> >>>> potentially could improve performance.
> >>>>
> >>>> I have a blog post that we've been working on that explores some of
> >>>> these things but I'm still waiting on review before I publish it.
> >>>>
> >>>> Mark
> >>>>
> >>>> On 11/08/2017 05:53 AM, Wolfgang Lendl wrote:
> >>>>> Hello,
> >>>>>
> >>>>> it's clear to me getting a performance gain from putting the
> >>>>> journal on a fast device (ssd,nvme) when using filestore backend.
> >>>>> it's not when it comes to bluestore - are there any resources,
> >>>>> performance test, etc. out there how a fast wal,db device impacts
> >>>>> performance?
> >>>>>
> >>>>>
> >>>>> br
> >>>>> wolfgang
> >>>>>
> >>>> _______________________________________________
> >>>> ceph-users mailing list
> >>>> ceph-users at lists.ceph.com
> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users at lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >



More information about the ceph-users mailing list