[ceph-users] OSD log being spammed with BlueStore stupidallocator dump

Wido den Hollander wido at 42on.com
Mon Oct 15 20:57:58 PDT 2018



On 10/16/2018 12:04 AM, Igor Fedotov wrote:
> 
> On 10/15/2018 11:47 PM, Wido den Hollander wrote:
>> Hi,
>>
>> On 10/15/2018 10:43 PM, Igor Fedotov wrote:
>>> Hi Wido,
>>>
>>> once you apply the PR you'll probably see the initial error in the log
>>> that triggers the dump. Which is most probably the lack of space
>>> reported by _balance_bluefs_freespace() function. If so this means that
>>> BlueFS rebalance is unable to allocate contiguous 1M chunk at main
>>> device to gift to BlueFS. I.e. your main device space is very
>>> fragmented.
>>>
>>> Unfortunately I don't know any ways to recover from this state but OSD
>>> redeployment or data removal.
>>>
>> We are moving data away from these OSDs. Lucky us that we have HDD OSDs
>> in there as well, moving a lot of data there.
>>
>> How would re-deployment work? Just wipe the OSDs and bring them back
>> into the cluster again? That would be a very painful task.. :-(
> 
> Good chances that you'll face the same issue again one day.
> May be allocate some SSDs to serve as DB devices?

Maybe, but this is a very common use-case where people run WAL+DB+DATA
on a single SSD.

Now we are running into it, but aren't the chances big other people will
run into it as well?

>>
>>> Upcoming PR that brings an ability for offline BlueFS volume
>>> manipulation (https://github.com/ceph/ceph/pull/23103) will probably
>>> help to recover from this issue in future by migrating BlueFS data to a
>>> new larger DB volume. (targeted for Nautilus, not sure about backporting
>>> to Mimic or Luminous).
>>>
>>> For now I can suggest the only preventive mean to avoid the case - have
>>> large enough space at your standalone DB volume. So that master device
>>> isn't used for DB at all or as minimum as possible. Hence no rebalance
>>> is needed and no fragmentation is present.
>>>
>> I see, but these are SSD-only OSDs.
>>
>>> BTW wondering if you have one for your OSDs? How large if so?
>>>
>> The cluster consists out of 96 OSDs with Samsung PM863a 1.92TB OSDs.
>>
>> The fullest OSD currently is 78% full, which is 348GiB free on the
>> 1.75TiB device.
>>
>> Does this information help?
> Yeah, thanks for sharing.

Let me know if you need more!

Wido

>>
>> Thanks!
>>
>> Wido
>>
>>> Everything above is "IMO", some chances that I missed something..
>>>
>>>
>>> Thanks,
>>>
>>> Igor
>>>
>>>
>>> On 10/15/2018 10:12 PM, Wido den Hollander wrote:
>>>> On 10/15/2018 08:23 PM, Gregory Farnum wrote:
>>>>> I don't know anything about the BlueStore code, but given the snippets
>>>>> you've posted this appears to be a debug thing that doesn't expect
>>>>> to be
>>>>> invoked (or perhaps only in an unexpected case that it's trying
>>>>> hard to
>>>>> recover from). Have you checked where the dump() function is invoked
>>>>> from? I'd imagine it's something about having to try extra-hard to
>>>>> allocate free space or something.
>>>> It seems BlueFS that is having a hard time finding free space.
>>>>
>>>> I'm trying this PR now: https://github.com/ceph/ceph/pull/24543
>>>>
>>>> It will stop the spamming, but that's not the root cause. The OSDs in
>>>> this case are at max 80% full and they do have a lot of OMAP (RGW
>>>> indexes) in them, but that's all.
>>>>
>>>> I'm however not sure why this is happening suddenly in this cluster.
>>>>
>>>> Wido
>>>>
>>>>> -Greg
>>>>>
>>>>> On Mon, Oct 15, 2018 at 10:02 AM Wido den Hollander <wido at 42on.com
>>>>> <mailto:wido at 42on.com>> wrote:
>>>>>
>>>>>
>>>>>
>>>>>       On 10/11/2018 12:08 AM, Wido den Hollander wrote:
>>>>>       > Hi,
>>>>>       >
>>>>>       > On a Luminous cluster running a mix of 12.2.4, 12.2.5 and
>>>>> 12.2.8 I'm
>>>>>       > seeing OSDs writing heavily to their logfiles spitting out
>>>>> these
>>>>>       lines:
>>>>>       >
>>>>>       >
>>>>>       > 2018-10-10 21:52:04.019037 7f90c2f0f700  0 stupidalloc
>>>>>       0x0x55828ae047d0
>>>>>       > dump  0x15cd2078000~34000
>>>>>       > 2018-10-10 21:52:04.019038 7f90c2f0f700  0 stupidalloc
>>>>>       0x0x55828ae047d0
>>>>>       > dump  0x15cd22cc000~24000
>>>>>       > 2018-10-10 21:52:04.019038 7f90c2f0f700  0 stupidalloc
>>>>>       0x0x55828ae047d0
>>>>>       > dump  0x15cd2300000~20000
>>>>>       > 2018-10-10 21:52:04.019039 7f90c2f0f700  0 stupidalloc
>>>>>       0x0x55828ae047d0
>>>>>       > dump  0x15cd2324000~24000
>>>>>       > 2018-10-10 21:52:04.019040 7f90c2f0f700  0 stupidalloc
>>>>>       0x0x55828ae047d0
>>>>>       > dump  0x15cd26c0000~24000
>>>>>       > 2018-10-10 21:52:04.019041 7f90c2f0f700  0 stupidalloc
>>>>>       0x0x55828ae047d0
>>>>>       > dump  0x15cd2704000~30000
>>>>>       >
>>>>>       > It goes so fast that the OS-disk in this case can't keep up
>>>>> and become
>>>>>       > 100% util.
>>>>>       >
>>>>>       > This causes the OSD to slow down and cause slow requests and
>>>>>       starts to flap.
>>>>>       >
>>>>>
>>>>>       I've set 'log_file' to /dev/null for now, but that doesn't
>>>>> solve it
>>>>>       either. Randomly OSDs just start spitting out slow requests and
>>>>> have
>>>>>       these issues.
>>>>>
>>>>>       Any suggestions on how to fix this?
>>>>>
>>>>>       Wido
>>>>>
>>>>>       > It seems that this is *only* happening on OSDs which are the
>>>>> fullest
>>>>>       > (~85%) on this cluster and they have about ~400 PGs each (Yes,
>>>>> I know,
>>>>>       > that's high).
>>>>>       >
>>>>>       > Looking at StupidAllocator.cc I see this piece of code:
>>>>>       >
>>>>>       > void StupidAllocator::dump()
>>>>>       > {
>>>>>       >   std::lock_guard<std::mutex> l(lock);
>>>>>       >   for (unsigned bin = 0; bin < free.size(); ++bin) {
>>>>>       >     ldout(cct, 0) << __func__ << " free bin " << bin << ": "
>>>>>       >                   << free[bin].num_intervals() << " extents"
>>>>> << dendl;
>>>>>       >     for (auto p = free[bin].begin();
>>>>>       >          p != free[bin].end();
>>>>>       >          ++p) {
>>>>>       >       ldout(cct, 0) << __func__ << "  0x" << std::hex <<
>>>>> p.get_start()
>>>>>       > << "~"
>>>>>       >                     << p.get_len() << std::dec << dendl;
>>>>>       >     }
>>>>>       >   }
>>>>>       > }
>>>>>       >
>>>>>       > I'm just wondering why it would spit out these lines and
>>>>> what's
>>>>>       causing it.
>>>>>       >
>>>>>       > Has anybody seen this before?
>>>>>       >
>>>>>       > Wido
>>>>>       > _______________________________________________
>>>>>       > ceph-users mailing list
>>>>>       > ceph-users at lists.ceph.com <mailto:ceph-users at lists.ceph.com>
>>>>>       > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>       >
>>>>>       _______________________________________________
>>>>>       ceph-users mailing list
>>>>>       ceph-users at lists.ceph.com <mailto:ceph-users at lists.ceph.com>
>>>>>       http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users at lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


More information about the ceph-users mailing list