[ceph-users] Slow rbd reads (fast writes) with luminous + bluestore

Mark Nelson mnelson at redhat.com
Wed Nov 28 06:52:03 PST 2018


On 11/28/18 8:36 AM, Florian Haas wrote:
> On 14/08/2018 15:57, Emmanuel Lacour wrote:
>> Le 13/08/2018 à 16:58, Jason Dillaman a écrit :
>>> See [1] for ways to tweak the bluestore cache sizes. I believe that by
>>> default, bluestore will not cache any data but instead will only
>>> attempt to cache its key/value store and metadata.
>> I suppose too because default ratio is to cache as much as possible k/v
>> up to 512M and hdd cache is 1G by default.
>>
>> I tried to increase hdd cache up to 4G and it seems to be used, 4 osd
>> processes uses 20GB now.
>>
>>> In general, however, I would think that attempting to have bluestore
>>> cache data is just an attempt to optimize to the test instead of
>>> actual workloads. Personally, I think it would be more worthwhile to
>>> just run 'fio --ioengine=rbd' directly against a pre-initialized image
>>> after you have dropped the cache on the OSD nodes.
>> So with bluestore, I assume that we need to think more of client page
>> cache (at least when using a VM)  when with old filestore both osd and
>> client cache where used.
>>   
>> For benchmark, I did real benchmark here for the expected app workload
>> of this new cluster and it's ok for us :)
>>
>>
>> Thanks for your help Jason.
> Shifting over a discussion from IRC and taking the liberty to resurrect
> an old thread, as I just ran into the same (?) issue. I see
> *significantly* reduced performance on RBD reads, compared to writes
> with the same parameters. "rbd bench --io-type read" gives me 8K IOPS
> (with the default 4K I/O size), whereas "rbd bench --io-type write"
> produces more than twice that.
>
> I should probably add that while my end result of doing an "rbd bench
> --io-type read" is about half of what I get from a write benchmark, the
> intermediate ops/sec output fluctuates from > 30K IOPS (about twice the
> write IOPS) to about 3K IOPS (about 1/6 of what I get for writes). So
> really, my read IOPS are all over the map (and terrible on average),
> whereas my write IOPS are not stellar, but consistent.
>
> This is an all-bluestore cluster on spinning disks with Luminous, and
> I've tried the following things:
>
> - run rbd bench with --rbd_readahead_disable_after_bytes=0 and
> --rbd_readahead_max_bytes=4194304 (per
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-March/008271.html)
>
> - configure OSDs with a larger bluestore_cache_size_hdd (4G; default is 1G)
>
> - configure OSDs with bluestore_cache_kv_ratio = .49, so that rather
> than using 1%/99%/0% for metadata/KV data/objects, the OSDs use 1%/49%/50%
>
> None of the above produced any tangible improvement. Benchmark results
> are at http://paste.openstack.org/show/736314/ if anyone wants to take a
> look.
>
> I'd be curious to see if anyone has a suggestion on what else to try.
> Thanks in advance!


Hi Florian,


By default bluestore will cache buffers on reads but not on writes 
(unless there are hints):


Option("bluestore_default_buffered_read", Option::TYPE_BOOL, 
Option::LEVEL_ADVANCED)
     .set_default(true)
     .set_flag(Option::FLAG_RUNTIME)
     .set_description("Cache read results by default (unless hinted 
NOCACHE or WONTNEED)"),

     Option("bluestore_default_buffered_write", Option::TYPE_BOOL, 
Option::LEVEL_ADVANCED)
     .set_default(false)
     .set_flag(Option::FLAG_RUNTIME)
     .set_description("Cache writes by default (unless hinted NOCACHE or 
WONTNEED)"),


This is one area where bluestore is a lot more confusing for users that 
filestore was.  There was a lot of concern about enabling buffer cache 
on writes by default because there's some associated overhead 
(potentially both during writes and in the mempool thread when trimming 
the cache).  It might be worth enabling bluestore_default_buffered_write 
and see if it helps reads.  You'll probably also want to pay attention 
to writes though.  I think we might want to consider enabling it by 
default but we should go through and do a lot of careful testing first. 
FWIW I did have it enabled when testing the new memory target code (and 
the not-yet-merged age-binned autotuning).  It was doing OK in my tests, 
but I didn't do an apples-to-apples comparison with it off.


Mark


>
> Cheers,
> Florian
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


More information about the ceph-users mailing list