[ceph-users] ceph osd commit latency increase over time, until restart

Igor Fedotov ifedotov at suse.de
Fri Mar 1 02:24:35 PST 2019

Hi Chen,

thanks for the update. Will prepare patch to periodically reset 
StupidAllocator today.

And just to let you know below is an e-mail from AdamK from RH which 
might explain the issue with the allocator.

Also please note that StupidAllocator might not perform full 
defragmentation in run-time. That's why we observed (mentioned somewhere 
in the thread) fragmentation growth while OSD is running and its drop on 
restart. Such a restart rebuilds internal tree and eliminates 
defragmentation flaws. May be that's the case.



-------- Forwarded Message --------

Subject: 	High CPU in StupidAllocator
Date: 	Tue, 12 Feb 2019 10:24:37 +0100
From: 	Adam Kupczyk <akupczyk at redhat.com>
To: 	IGOR FEDOTOV <ifed75 at gmail.com>

Hi Igor,

I have observed that StupidAllocator can burn a lot of CPU in 
This comes from loops:
while (p != free[bin].end()) {
     if (_aligned_len(p, alloc_unit) >= want_size) {
       goto found;

It happens when want_size is close to limit of size of bin.
For example, free[5] contains sizes 8192..16383.
When requesting size like 16000 it is quite likely that multiple chunks 
must be checked.

I have made an attempt to improve it by increasing amount of buckets.
It is done in aclamk/wip-bs-stupid-allocator-2 .

Best regards,

Adam Kupczyk

On 3/1/2019 11:46 AM, Xiaoxi Chen wrote:
> igor,
>    I can test the patch if we have a package.
>    My enviroment and workload can consistently reproduce the latency  
> 2-3 days after restarting.
>     Sage tells me to try bitmap allocator to make sure stupid 
> allocator is the bad guy. I have some osds in luminous +bitmap and 
> some osds in 14.1.0+bitmap.  Both looks positive till now, but i need 
> more time to be sure.
>      The perf ,log and admin socket analysis lead to the theory that 
> in alloc_int the loop sometimes take long time wkth allocator locks 
> held. Which blocks release part called from _txc_finish in 
> kv_finalize_thread, this thread is also the one to calculate 
> state_kv_committing_lat and overall commit_lat. You can find from 
> admin socket that state_done_latency has similar trend as commit_latency.
>     But we cannot find a theory to.explain why reboot helps, the 
> allocator btree will be rebuild from freelist manager and.it.should be 
> exactly. the same as it is prior to reboot.  Anything related with pg 
> recovery?
>    Anyway, as I have a live env and workload, I am more than willing 
> to work with you for further investigatiom
> -Xiaoxi
> Igor Fedotov <ifedotov at suse.de <mailto:ifedotov at suse.de>> 于 
> 2019年3月1日周五 上午6:21写道:
>     Also I think it makes sense to create a ticket at this point. Any
>     volunteers?
>     On 3/1/2019 1:00 AM, Igor Fedotov wrote:
>     > Wondering if somebody would be able to apply simple patch that
>     > periodically resets StupidAllocator?
>     >
>     > Just to verify/disprove the hypothesis it's allocator relateted
>     >
>     > On 2/28/2019 11:57 PM, Stefan Kooman wrote:
>     >> Quoting Wido den Hollander (wido at 42on.com <mailto:wido at 42on.com>):
>     >>> Just wanted to chime in, I've seen this with
>     Luminous+BlueStore+NVMe
>     >>> OSDs as well. Over time their latency increased until we
>     started to
>     >>> notice I/O-wait inside VMs.
>     >> On a Luminous 12.2.8 cluster with only SSDs we also hit this
>     issue I
>     >> guess. After restarting the OSD servers the latency would drop
>     to normal
>     >> values again. See https://owncloud.kooman.org/s/BpkUc7YM79vhcDj
>     >>
>     >> Reboots were finished at ~ 19:00.
>     >>
>     >> Gr. Stefan
>     >>
