[ceph-users] SSD sizing for Bluestore
moloney at ohsu.edu
Tue Nov 13 15:32:30 PST 2018
Thank you for that information. This means I would have to reduce the "write_buffer_size" in order to reduce the L0 size in addition to reducing "max_bytes_for_level_base" to make the L1 size match.
Does anyone on the list have experience making these kinds of modifications? Or better yet some benchmarks?
I found a mailing list reference  saying RBD workloads need about 24KB per onode, and average object size is ~2.8MB. Even taking advertised best case throughput for an HDD we only get ~70 objects per second, which would generate 1.6MB/s in writes to RocksDB. If the write_buffer_size were set to 75MB (25% of default) that would take 45 seconds to fill. With a more realistic number for sustained write throughput on an HDD, it would take well over a minute. That sounds like a rather large buffer to me...
From: Igor Fedotov [ifedotov at suse.de]
Sent: Tuesday, November 13, 2018 3:44 AM
To: Brendan Moloney; ceph-users at lists.ceph.com
Subject: Re: [ceph-users] SSD sizing for Bluestore
in fact you can alter RocksDB settings by using bluestore_rocksdb_options config parameter. And hence change "max_bytes_for_level_base" and others.
Not sure about dynamic level sizing though.
Current defaults are:
On 11/13/2018 5:19 AM, Brendan Moloney wrote:
I have been reading up on this a bit, and found one particularly useful mailing list thread .
The fact that there is such a large jump when your DB fits into 3 levels (30GB) vs 4 levels (300GB) makes it hard to choose SSDs of an appropriate size. My workload is all RBD, so objects should be large, but I am also looking at purchasing rather large HDDs (12TB). It seems wasteful to spec out 300GB per OSD, but I am worried that I will barely cross the 30GB threshold when the disks get close to full.
It would be nice if we could either enable "dynamic level sizing" (done here  for monitors, but not bluestore?), or allow changing the "max_bytes_for_level_base" to something that better suits our use case. For example, if it were set it to 25% of the default (75MB L0 and L1, 750MB L2, 7.5GB L3, 75GB L4) then I could allocate ~85GB per OSD and feel confident there wouldn't be any spill over onto the slow HDDs. I am far from on expert on RocksDB, so I might be overlooking something important here.
ceph-users mailing list
ceph-users at lists.ceph.com<mailto:ceph-users at lists.ceph.com>
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the ceph-users