[ceph-users] optimize bluestore for random write i/o

vitalif at yourcmc.ru vitalif at yourcmc.ru
Tue Mar 12 05:31:38 PDT 2019

> Decreasing the min_alloc size isn't always a win, but ican be in some
> cases.  Originally bluestore_min_alloc_size_ssd was set to 4096 but we
> increased it to 16384 because at the time our metadata path was slow
> and increasing it resulted in a pretty significant performance win
> (along with increasing the WAL buffers in rocksdb to reduce write
> amplification).  Since then we've improved the metadata path to the
> point where at least on our test nodes performance is pretty close
> between with min_alloc size = 16k and min_alloc size = 4k the last
> time I looked.  It might be a good idea to drop it down to 4k now but
> I think we need to be careful because there are tradeoffs.

I think it's all about your disks' latency. Deferred write is 1 IO+sync 
and redirect-write is 2 IOs+syncs. So if your IO or sync is slow (like 
it is on HDDs and bad SSDs) then the deferred write is better in terms 
of latency. If your IO is fast then you're only bottlenecked by the OSD 
code itself eating a lot of CPU and then direct write may be better. By 
the way, I think OSD itself is way TOO slow currently (see below).

The idea I was talking about turned out to be only viable for HDD/slow 
SSDs and only for low iodepths. But the gain is huge - something between 
+50% iops to +100% iops (2x less latency). There is a stupid problem in 
current bluestore implementation which makes it do 2 journal writes and 
FSYNCs instead of one for every incoming transaction. The details are 
here: https://tracker.ceph.com/issues/38559

The unnecessary commit is the BlueFS's WAL. All it's doing is recording 
the increased size of a RocksDB WAL file. Which obviously shouldn't be 
required with RocksDB as its default setting is 
"kTolerateCorruptedTailRecords". However, without this setting the WAL 
is not synced to the disk with every write because by some clever logic 
sync_file_range is called only with SYNC_FILE_RANGE_WRITE in the 
corresponding piece of code. Thus the OSD's database gets corrupted when 
you kill it with -9 and thus it's impossible to set 
`bluefs_preextend_wal_files` to true. And thus you get two writes and 
commits instead of one.

I don't know the exact idea behind doing only SYNC_FILE_RANGE_WRITE - as 
I understand there is currently no benefit in doing this. It could be a 
benefit if RocksDB was writing journal in small parts and then doing a 
single sync - but it's always flushing the newly written part of a 
journal to disk as a whole.

The simplest way to fix it is just to add SYNC_FILE_RANGE_WAIT_BEFORE 
and SYNC_FILE_RANGE_WAIT_AFTER to sync_file_range in KernelDevice.cc. My 
pull request is here: https://github.com/ceph/ceph/pull/26909 - I've 
tested this change with 13.2.4 Mimic and 14.1.0 Nautilus and yes, it 
does increase single-thread iops on HDDs two times (!). After this 
change BlueStore becomes actually better than FileStore at least on 

Another way of fixing it would be to add an explicit bdev->flush at the 
end of the kv_sync_thread, after db->submit_transaction_sync(), and 
possibly remove the redundant sync_file_range at all. But then you must 
do the same in another place in _txc_state_proc, because it's also 
sometimes doing submit_transaction_sync(). In the end I personally think 
that to add flags to sync_file_range is better because a function named 
"submit_transaction_sync" should be in fact SYNC! It shouldn't require 
additional steps from the caller to make the data durable.

Also I have a small funny test result to share.

I've created one OSD on my laptop on a loop device in a tmpfs (i.e. 
RAM), created 1 RBD image inside it and tested it with `fio 
-ioengine=rbd -direct=1 -bs=4k -rw=randwrite`. Before doing the test 
I've turned off CPU power saving with `cpupower idle-set -D 0`.

The results are:
- filestore: 2200 iops with -iodepth=1 (0.454ms average latency). 8500 
iops with -iodepth=128.
- bluestore: 1800 iops with -iodepth=1 (0.555ms average latency). 9000 
iops with -iodepth=128.
- memstore: 3000 iops with -iodepth=1 (0.333ms average latency). 11000 
iops with -iodepth=128.

If we can think of memstore being a "minimal possible /dev/null" then:
- OSD overhead is 1/3000 = 0.333ms (maybe slighly less, but that doesn't 
- filestore overhead is 1/2200-1/3000 = 0.121ms
- bluestore overhead is 1/1800-1/3000 = 0.222ms

The conclusion is that bluestore is actually almost TWO TIMES slower 
than filestore in terms of pure latency, and the throughput is only 
slightly better. How could it happen? How could a newly written store 
become two times slower than the old one? ) that's pretty annoying...

Could it be because bluestore is doing a lot of threading? I mean could 
it be because each write operation goes through 5 threads during its 
execution? (tp_osd_tp -> aio -> kv_sync_thread -> kv_finalize_thread -> 
finisher)? Maybe just remove aio and kv threads and process all 
operations directly in tp_osd_tp then?

More information about the ceph-users mailing list