[ceph-users] ceph osd commit latency increase over time, until restart

Igor Fedotov ifedotov at suse.de
Fri Mar 1 02:26:46 PST 2019

resending, not sure the prev email reached the mailing list...

Hi Chen,

thanks for the update. Will prepare patch to periodically reset 
StupidAllocator today.

And just to let you know below is an e-mail from AdamK from RH which 
might explain the issue with the allocator.

Also please note that StupidAllocator might not perform full 
defragmentation in run-time. That's why we observed (mentioned somewhere 
in the thread) fragmentation growth while OSD is running and its drop on 
restart. Such a restart rebuilds internal tree and eliminates 
defragmentation flaws. May be that's the case.



-------- Forwarded Message --------
Subject:     High CPU in StupidAllocator
Date:     Tue, 12 Feb 2019 10:24:37 +0100
From:     Adam Kupczyk <akupczyk at redhat.com>
To:     IGOR FEDOTOV <ifed75 at gmail.com>

Hi Igor,

I have observed that StupidAllocator can burn a lot of CPU in 
This comes from loops:
while (p != free[bin].end()) {
     if (_aligned_len(p, alloc_unit) >= want_size) {
       goto found;

It happens when want_size is close to limit of size of bin.
For example, free[5] contains sizes 8192..16383.
When requesting size like 16000 it is quite likely that multiple chunks 
must be checked.

I have made an attempt to improve it by increasing amount of buckets.
It is done in aclamk/wip-bs-stupid-allocator-2 .

Best regards,

Adam Kupczyk

On 3/1/2019 11:46 AM, Xiaoxi Chen wrote:
> igor,
>    I can test the patch if we have a package.
>    My enviroment and workload can consistently reproduce the latency  
> 2-3 days after restarting.
>     Sage tells me to try bitmap allocator to make sure stupid 
> allocator is the bad guy. I have some osds in luminous +bitmap and 
> some osds in 14.1.0+bitmap.  Both looks positive till now, but i need 
> more time to be sure.
>      The perf ,log and admin socket analysis lead to the theory that 
> in alloc_int the loop sometimes take long time wkth allocator locks 
> held. Which blocks release part called from _txc_finish in 
> kv_finalize_thread, this thread is also the one to calculate 
> state_kv_committing_lat and overall commit_lat. You can find from 
> admin socket that state_done_latency has similar trend as commit_latency.
>     But we cannot find a theory to.explain why reboot helps, the 
> allocator btree will be rebuild from freelist manager and.it.should be 
> exactly. the same as it is prior to reboot.  Anything related with pg 
> recovery?
>    Anyway, as I have a live env and workload, I am more than willing 
> to work with you for further investigatiom
> -Xiaoxi
> Igor Fedotov <ifedotov at suse.de <mailto:ifedotov at suse.de>> 于 
> 2019年3月1日周五 上午6:21写道:
>     Also I think it makes sense to create a ticket at this point. Any
>     volunteers?
>     On 3/1/2019 1:00 AM, Igor Fedotov wrote:
>     > Wondering if somebody would be able to apply simple patch that
>     > periodically resets StupidAllocator?
>     >
>     > Just to verify/disprove the hypothesis it's allocator relateted
>     >
>     > On 2/28/2019 11:57 PM, Stefan Kooman wrote:
>     >> Quoting Wido den Hollander (wido at 42on.com <mailto:wido at 42on.com>):
>     >>> Just wanted to chime in, I've seen this with
>     Luminous+BlueStore+NVMe
>     >>> OSDs as well. Over time their latency increased until we
>     started to
>     >>> notice I/O-wait inside VMs.
>     >> On a Luminous 12.2.8 cluster with only SSDs we also hit this
>     issue I
>     >> guess. After restarting the OSD servers the latency would drop
>     to normal
>     >> values again. See https://owncloud.kooman.org/s/BpkUc7YM79vhcDj
>     >>
>     >> Reboots were finished at ~ 19:00.
>     >>
>     >> Gr. Stefan
>     >>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20190301/bb5d2158/attachment.html>

More information about the ceph-users mailing list