[ceph-users] bluestore - wal,db on faster devices?

Mark Nelson mnelson at redhat.com
Wed Nov 8 13:41:37 PST 2017

On 11/08/2017 03:16 PM, Nick Fisk wrote:
>> -----Original Message-----
>> From: ceph-users [mailto:ceph-users-bounces at lists.ceph.com] On Behalf Of
>> Mark Nelson
>> Sent: 08 November 2017 19:46
>> To: Wolfgang Lendl <wolfgang.lendl at meduniwien.ac.at>
>> Cc: ceph-users at lists.ceph.com
>> Subject: Re: [ceph-users] bluestore - wal,db on faster devices?
>> Hi Wolfgang,
>> You've got the right idea.  RBD is probably going to benefit less since
> you
>> have a small number of large objects and little extra OMAP data.
>> Having the allocation and object metadata on flash certainly shouldn't
> hurt,
>> and you should still have less overhead for small (<64k) writes.
>> With RGW however you also have to worry about bucket index updates
>> during writes and that's a big potential bottleneck that you don't need to
>> worry about with RBD.
> If you are running anything which is sensitive to sync write latency, like
> databases. You will see a big performance improvement in using WAL on SSD.
> As Mark says, small writes will get ack'd once written to SSD. ~10-200us vs
> 10000-20000us difference. It will also batch lots of these small writes
> together and write them to disk in bigger chunks much more effectively. If
> you want to run active workloads on RBD and want them to match enterprise
> storage array with BBWC type performance, I would say DB and WAL on SSD is a
> requirement.

Hi Nick,

You've done more investigation in this area than most I think.  Once you 
get to the point under continuous load where RocksDB is compacting, do 
you see better than a 2X gain?


>> Mark
>> On 11/08/2017 01:01 PM, Wolfgang Lendl wrote:
>>> Hi Mark,
>>> thanks for your reply!
>>> I'm a big fan of keeping things simple - this means that there has to
>>> be a very good reason to put the WAL and DB on a separate device
>>> otherwise I'll keep it collocated (and simpler).
>>> as far as I understood - putting the WAL,DB on a faster (than hdd)
>>> device makes more sense in cephfs and rgw environments (more
>> metadata)
>>> - and less sense in rbd environments - correct?
>>> br
>>> wolfgang
>>> On 11/08/2017 02:21 PM, Mark Nelson wrote:
>>>> Hi Wolfgang,
>>>> In bluestore the WAL serves sort of a similar purpose to filestore's
>>>> journal, but bluestore isn't dependent on it for guaranteeing
>>>> durability of large writes.  With bluestore you can often get higher
>>>> large-write throughput than with filestore when using HDD-only or
>>>> flash-only OSDs.
>>>> Bluestore also stores allocation, object, and cluster metadata in the
>>>> DB.  That, in combination with the way bluestore stores objects,
>>>> dramatically improves behavior during certain workloads.  A big one
>>>> is creating millions of small objects as quickly as possible.  In
>>>> filestore, PG splitting has a huge impact on performance and tail
>>>> latency.  Bluestore is much better just on HDD, and putting the DB
>>>> and WAL on flash makes it better still since metadata no longer is a
>>>> bottleneck.
>>>> Bluestore does have a couple of shortcomings vs filestore currently.
>>>> The allocator is not as good as XFS's and can fragment more over time.
>>>> There is no server-side readahead so small sequential read
>>>> performance is very dependent on client-side readahead.  There's
>>>> still a number of optimizations to various things ranging from
>>>> threading and locking in the shardedopwq to pglog and dup_ops that
>>>> potentially could improve performance.
>>>> I have a blog post that we've been working on that explores some of
>>>> these things but I'm still waiting on review before I publish it.
>>>> Mark
>>>> On 11/08/2017 05:53 AM, Wolfgang Lendl wrote:
>>>>> Hello,
>>>>> it's clear to me getting a performance gain from putting the journal
>>>>> on a fast device (ssd,nvme) when using filestore backend.
>>>>> it's not when it comes to bluestore - are there any resources,
>>>>> performance test, etc. out there how a fast wal,db device impacts
>>>>> performance?
>>>>> br
>>>>> wolfgang
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users at lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users at lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

More information about the ceph-users mailing list