[ceph-users] Cache tier operation clarifications

Christian Balzer chibi at gol.com
Fri Mar 4 00:17:14 PST 2016


Unlike the subject may suggest, I'm mostly going to try and explain how
things work with cache tiers, as far as I understand them.
Something of a reference to point to.
Of course if you spot something that's wrong or have additional
information, by all means please do comment.

While the documentation in master now correctly warns that you HAVE to set
target_max_bytes (the size of your cache pool) for any of the relative
sizing bits to work, lets repeat that here since it wasn't mentioned there
And without that value being set, none of the flushing or eviction will
happen, resulting in blocked IOs when it gets full.

The other thing about target_max_bytes is to remember (documented nowhere)
that this space calculation is base per PG. 
So if you have a 1024GB cache pool and target_max_bytes set accordingly
(one of the most annoying things about Ceph is have to specify full bytes
in most places instead of human friendly shortcuts like "1TB"), Ceph
(the cache tiering agent to be precise) will think that the cache is 50%
full when just one PG has reached 512MB.

In short, expect things to happen quite a bit before you reach the usage
that you think you specified in cache_target_dirty_ratio and
Annoying, but at least failing safe.

I'm ignoring target_max_objects for this, as it's the same for object
count instead of space.
min_read_recency_for_promote and min_write_recency_for_promote I shall
ignore for now as well, since I have no cluster to test them with.

Either way once Ceph thinks you've reached the cache_target_dirty_ratio
specified, it copies dirty objects to the backing storage. 
If they never existed there before, they will be created (so keep that in
mind if you see an increase in objects).
This (additional object) is similar to tier promotion, when an existing
object is copied from the base pool to the cache pool the first time it's

In versions after Hammer there is also cache_target_dirty_high_ratio,
which specifies at which point more aggressive flushing starts.

Note that flushing keeps objects in the cache.
So that object you wrote too some days ago and kept reading frequently
ever since isn't just going away to the slower base pool.

Next is eviction. This is where things became bit more muddled for me and
I had to do some testing and staring at objects in PGs.
So your cache pool is now hitting the cache_target_full_ratio (or so the
wonky space per PG algorithm thinks).
Remember that all IO will stop once the cache pool gets 100% full, so you
want this to happen at some safe, sane point before this. 
What that point is depends of course on the maximum write speed to your
pool, how fast your cache can flush to the base pool, etc.
Now here is the fun part, clean objects (ones that have not been modified
since they were promoted from the base pool or last flushed) are eligible
for eviction. 
When reading about this the first time I thought this involved more moving
of data from the cache pool to the base pool.
However what happens is that since the object is "clean" (copy exists on
the base pool), it is simply zero'd (after demotion), leaving an empty
rados object in the cache pool and consequently releasing space.

So as far as IO and network traffic is concerned, your enemy is flushing,
not eviction.

In clusters that have a clear usage pattern and idle times, a command
to trigger flushes for a specified ratio and with settable IO limits would
be most welcome. (hint-hint)
Lacking this for now, I've be pondering a cron job that sets
cache_target_dirty_ratio from .7 (my current value) to .6 (or more
likely something smaller, like .65) for a few hours during night and then
back up again. 
This is based on our cache typically not growing more than 2% per day.

Lastly we come to cache_min_flush_age and cache_min_evict_age.
It is my understanding that in Hammer and later a truly full cache pool
will cause these to be ignored to prevent IO deadlocks, correct?

The largest source of cache pollution for us are VM reboots (all those
objects holding the kernel and other things only read at startup, never to
be needed again for months) while on the other hand we have about 10k
truly hot objects that are constantly being read/written. 
Lacking min_write_recency_for_promote for now, I've been thinking to set
cache_min_evict_age to several hours. 
Truly cold objects will be subject to eviction, even lukewarm ones get to
Note that for objects that more or less belong in the cache we're using
less than 15% of its capacity.

Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Rakuten Communications

More information about the ceph-users mailing list