[ceph-users] slow_used_bytes - SlowDB being used despite lots of space free in BlockDB on SSD?

Nick Fisk nick at fisk.me.uk
Sat Oct 20 13:30:03 PDT 2018

> >> On 10/18/2018 7:49 PM, Nick Fisk wrote:
> >>> Hi,
> >>>
> >>> Ceph Version = 12.2.8
> >>> 8TB spinner with 20G SSD partition
> >>>
> >>> Perf dump shows the following:
> >>>
> >>> "bluefs": {
> >>>           "gift_bytes": 0,
> >>>           "reclaim_bytes": 0,
> >>>           "db_total_bytes": 21472731136,
> >>>           "db_used_bytes": 3467640832,
> >>>           "wal_total_bytes": 0,
> >>>           "wal_used_bytes": 0,
> >>>           "slow_total_bytes": 320063143936,
> >>>           "slow_used_bytes": 4546625536,
> >>>           "num_files": 124,
> >>>           "log_bytes": 11833344,
> >>>           "log_compactions": 4,
> >>>           "logged_bytes": 316227584,
> >>>           "files_written_wal": 2,
> >>>           "files_written_sst": 4375,
> >>>           "bytes_written_wal": 204427489105,
> >>>           "bytes_written_sst": 248223463173
> >>>
> >>> Am I reading that correctly, about 3.4GB used out of 20GB on the SSD, yet 4.5GB of DB is stored on the spinning disk?
> >> Correct. Most probably the rationale for this is the layered scheme
> >> RocksDB uses to keep its sst. For each level It has a maximum
> >> threshold (determined by level no, some base value and corresponding
> >> multiplier - see max_bytes_for_level_base &
> >> max_bytes_for_level_multiplier at
> >> https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide)
> >> If the next level  (at its max size) doesn't fit into the space available at DB volume - it's totally spilled over to slow device.
> >> IIRC level_base is about 250MB and multiplier is 10 so the third level needs 25Gb and hence doesn't fit into your DB volume.
> >>
> >> In fact  DB volume of 20GB is VERY small for 8TB OSD - just 0.25% of the slow one. AFAIR current recommendation is about 4%.
> >>
> > Thanks Igor, these nodes were designed back in the filestore days where Small 10DWPD SSD's were all the rage, I might be able to
> shrink the OS/swap partition and get each DB partition up to 25/26GB, they are not going to get any bigger than that as that’s the
> NVME completely filled. But I'm then going have to effectively wipe all the disks I've done so far and re-backfill. ☹ Are there any
> tunables to change this behaviour post OSD deployment to move data back onto SSD?
> None I'm aware of.
> However I've just completed development for offline BlueFS volume migration feature within ceph-bluestore-tool. It allows DB/WAL
> volumes allocation and resizing as well as moving BlueFS data between volumes (with some limitations unrelated to your case). Hence
> one doesn't need slow backfilling to adjust BlueFS volume configuration.
> Here is the PR (Nautilus only for now):
> https://github.com/ceph/ceph/pull/23103

That sounds awesome, I might look at leaving the current OSD's how they are and look to "fix" them when Nautilus comes out.

> >
> > On a related note, does frequently accessed data move into the SSD, or is the overspill a one way ticket? I would assume writes
> would cause data in rocksdb to be written back into L0 and work its way down, but I'm not sure about reads?
> AFAIK reads don't trigger any data layout changes.


> >
> > So I think the lesson from this is that despite whatever DB usage you may think you may end up with, always make sure your SSD
> partition is bigger than 26GB (L0+L1)?
> In fact that's
> L0+L1 (2x250Mb), L2(2500MB), L3(25000MB) which is about 28GB.

Well I upgraded a new node and after shrinking the OS, I managed to assign 29GB as the DB's. It's just finished backfilling and disappointingly it looks like the DB has over spilled onto the disks ☹ So the magic minimum number is going to be somewhere between 30GB and 40GB. I might be able to squeeze 30G partitions out if I go for a tiny OS disk and no swap. Will try that on the next one. Hoping that 30G does it.

> One more observation from my side - RocksDB might additionally use up to 100% of the level maximum size during compaction -
> hence it might make sense to have up to 25GB of additional spare space. Surely this spare space wouldn't be fully used most of the
> time. And actually I don't have any instructions or clear knowledge base for this aspect. Just some warning.
> To track such an  excess I used additional perf counters, see commit
> 2763c4de41ea55a97ed7400f54a2b2d841894bf5 in
> https://github.com/ceph/ceph/pull/23208
> Perhaps makes sense to have a separare PR for this stuff and even backport it...

I think I'm starting to capture some of that data as I'm graphing all the "perf dump" values into graphite. The nodes with the 40GB DB partitions with all data on SSD currently have about 10GiB in the DB. During compactions the highest it has peaked over the last few days is up to 14GiB. In the nodes with the 20GB partitions, the SSD.DB sits at about 2.5GiB and peaks to just under 5GiB, the slow sits at 4.3GiB and peaks to about 6GiB.

> >
> >>> Am I also understanding correctly that BlueFS has reserved 300G of space on the spinning disk?
> >> Right.
> >>> Found a previous bug tracker for something which looks exactly the same case, but should be fixed now:
> >>> https://tracker.ceph.com/issues/22264
> >>>
> >>> Thanks,
> >>> Nick
> >>>
> >>> _______________________________________________
> >>> ceph-users mailing list
> >>> ceph-users at lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >

More information about the ceph-users mailing list