[ceph-users] bluestore - wal,db on faster devices?

Mark Nelson mnelson at redhat.com
Wed Nov 8 11:45:46 PST 2017

Hi Wolfgang,

You've got the right idea.  RBD is probably going to benefit less since 
you have a small number of large objects and little extra OMAP data. 
Having the allocation and object metadata on flash certainly shouldn't 
hurt, and you should still have less overhead for small (<64k) writes. 
With RGW however you also have to worry about bucket index updates 
during writes and that's a big potential bottleneck that you don't need 
to worry about with RBD.


On 11/08/2017 01:01 PM, Wolfgang Lendl wrote:
> Hi Mark,
> thanks for your reply!
> I'm a big fan of keeping things simple - this means that there has to be
> a very good reason to put the WAL and DB on a separate device otherwise
> I'll keep it collocated (and simpler).
> as far as I understood - putting the WAL,DB on a faster (than hdd)
> device makes more sense in cephfs and rgw environments (more metadata) -
> and less sense in rbd environments - correct?
> br
> wolfgang
> On 11/08/2017 02:21 PM, Mark Nelson wrote:
>> Hi Wolfgang,
>> In bluestore the WAL serves sort of a similar purpose to filestore's
>> journal, but bluestore isn't dependent on it for guaranteeing
>> durability of large writes.  With bluestore you can often get higher
>> large-write throughput than with filestore when using HDD-only or
>> flash-only OSDs.
>> Bluestore also stores allocation, object, and cluster metadata in the
>> DB.  That, in combination with the way bluestore stores objects,
>> dramatically improves behavior during certain workloads.  A big one is
>> creating millions of small objects as quickly as possible.  In
>> filestore, PG splitting has a huge impact on performance and tail
>> latency.  Bluestore is much better just on HDD, and putting the DB and
>> WAL on flash makes it better still since metadata no longer is a
>> bottleneck.
>> Bluestore does have a couple of shortcomings vs filestore currently.
>> The allocator is not as good as XFS's and can fragment more over time.
>> There is no server-side readahead so small sequential read performance
>> is very dependent on client-side readahead.  There's still a number of
>> optimizations to various things ranging from threading and locking in
>> the shardedopwq to pglog and dup_ops that potentially could improve
>> performance.
>> I have a blog post that we've been working on that explores some of
>> these things but I'm still waiting on review before I publish it.
>> Mark
>> On 11/08/2017 05:53 AM, Wolfgang Lendl wrote:
>>> Hello,
>>> it's clear to me getting a performance gain from putting the journal on
>>> a fast device (ssd,nvme) when using filestore backend.
>>> it's not when it comes to bluestore - are there any resources,
>>> performance test, etc. out there how a fast wal,db device impacts
>>> performance?
>>> br
>>> wolfgang
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users at lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

More information about the ceph-users mailing list