[ceph-users] OSD log being spammed with BlueStore stupidallocator dump

Igor Fedotov ifedotov at suse.de
Mon Oct 15 15:04:27 PDT 2018


On 10/15/2018 11:47 PM, Wido den Hollander wrote:
> Hi,
>
> On 10/15/2018 10:43 PM, Igor Fedotov wrote:
>> Hi Wido,
>>
>> once you apply the PR you'll probably see the initial error in the log
>> that triggers the dump. Which is most probably the lack of space
>> reported by _balance_bluefs_freespace() function. If so this means that
>> BlueFS rebalance is unable to allocate contiguous 1M chunk at main
>> device to gift to BlueFS. I.e. your main device space is very fragmented.
>>
>> Unfortunately I don't know any ways to recover from this state but OSD
>> redeployment or data removal.
>>
> We are moving data away from these OSDs. Lucky us that we have HDD OSDs
> in there as well, moving a lot of data there.
>
> How would re-deployment work? Just wipe the OSDs and bring them back
> into the cluster again? That would be a very painful task.. :-(

Good chances that you'll face the same issue again one day.
May be allocate some SSDs to serve as DB devices?
>
>> Upcoming PR that brings an ability for offline BlueFS volume
>> manipulation (https://github.com/ceph/ceph/pull/23103) will probably
>> help to recover from this issue in future by migrating BlueFS data to a
>> new larger DB volume. (targeted for Nautilus, not sure about backporting
>> to Mimic or Luminous).
>>
>> For now I can suggest the only preventive mean to avoid the case - have
>> large enough space at your standalone DB volume. So that master device
>> isn't used for DB at all or as minimum as possible. Hence no rebalance
>> is needed and no fragmentation is present.
>>
> I see, but these are SSD-only OSDs.
>
>> BTW wondering if you have one for your OSDs? How large if so?
>>
> The cluster consists out of 96 OSDs with Samsung PM863a 1.92TB OSDs.
>
> The fullest OSD currently is 78% full, which is 348GiB free on the
> 1.75TiB device.
>
> Does this information help?
Yeah, thanks for sharing.
>
> Thanks!
>
> Wido
>
>> Everything above is "IMO", some chances that I missed something..
>>
>>
>> Thanks,
>>
>> Igor
>>
>>
>> On 10/15/2018 10:12 PM, Wido den Hollander wrote:
>>> On 10/15/2018 08:23 PM, Gregory Farnum wrote:
>>>> I don't know anything about the BlueStore code, but given the snippets
>>>> you've posted this appears to be a debug thing that doesn't expect to be
>>>> invoked (or perhaps only in an unexpected case that it's trying hard to
>>>> recover from). Have you checked where the dump() function is invoked
>>>> from? I'd imagine it's something about having to try extra-hard to
>>>> allocate free space or something.
>>> It seems BlueFS that is having a hard time finding free space.
>>>
>>> I'm trying this PR now: https://github.com/ceph/ceph/pull/24543
>>>
>>> It will stop the spamming, but that's not the root cause. The OSDs in
>>> this case are at max 80% full and they do have a lot of OMAP (RGW
>>> indexes) in them, but that's all.
>>>
>>> I'm however not sure why this is happening suddenly in this cluster.
>>>
>>> Wido
>>>
>>>> -Greg
>>>>
>>>> On Mon, Oct 15, 2018 at 10:02 AM Wido den Hollander <wido at 42on.com
>>>> <mailto:wido at 42on.com>> wrote:
>>>>
>>>>
>>>>
>>>>       On 10/11/2018 12:08 AM, Wido den Hollander wrote:
>>>>       > Hi,
>>>>       >
>>>>       > On a Luminous cluster running a mix of 12.2.4, 12.2.5 and
>>>> 12.2.8 I'm
>>>>       > seeing OSDs writing heavily to their logfiles spitting out these
>>>>       lines:
>>>>       >
>>>>       >
>>>>       > 2018-10-10 21:52:04.019037 7f90c2f0f700  0 stupidalloc
>>>>       0x0x55828ae047d0
>>>>       > dump  0x15cd2078000~34000
>>>>       > 2018-10-10 21:52:04.019038 7f90c2f0f700  0 stupidalloc
>>>>       0x0x55828ae047d0
>>>>       > dump  0x15cd22cc000~24000
>>>>       > 2018-10-10 21:52:04.019038 7f90c2f0f700  0 stupidalloc
>>>>       0x0x55828ae047d0
>>>>       > dump  0x15cd2300000~20000
>>>>       > 2018-10-10 21:52:04.019039 7f90c2f0f700  0 stupidalloc
>>>>       0x0x55828ae047d0
>>>>       > dump  0x15cd2324000~24000
>>>>       > 2018-10-10 21:52:04.019040 7f90c2f0f700  0 stupidalloc
>>>>       0x0x55828ae047d0
>>>>       > dump  0x15cd26c0000~24000
>>>>       > 2018-10-10 21:52:04.019041 7f90c2f0f700  0 stupidalloc
>>>>       0x0x55828ae047d0
>>>>       > dump  0x15cd2704000~30000
>>>>       >
>>>>       > It goes so fast that the OS-disk in this case can't keep up
>>>> and become
>>>>       > 100% util.
>>>>       >
>>>>       > This causes the OSD to slow down and cause slow requests and
>>>>       starts to flap.
>>>>       >
>>>>
>>>>       I've set 'log_file' to /dev/null for now, but that doesn't solve it
>>>>       either. Randomly OSDs just start spitting out slow requests and
>>>> have
>>>>       these issues.
>>>>
>>>>       Any suggestions on how to fix this?
>>>>
>>>>       Wido
>>>>
>>>>       > It seems that this is *only* happening on OSDs which are the
>>>> fullest
>>>>       > (~85%) on this cluster and they have about ~400 PGs each (Yes,
>>>> I know,
>>>>       > that's high).
>>>>       >
>>>>       > Looking at StupidAllocator.cc I see this piece of code:
>>>>       >
>>>>       > void StupidAllocator::dump()
>>>>       > {
>>>>       >   std::lock_guard<std::mutex> l(lock);
>>>>       >   for (unsigned bin = 0; bin < free.size(); ++bin) {
>>>>       >     ldout(cct, 0) << __func__ << " free bin " << bin << ": "
>>>>       >                   << free[bin].num_intervals() << " extents"
>>>> << dendl;
>>>>       >     for (auto p = free[bin].begin();
>>>>       >          p != free[bin].end();
>>>>       >          ++p) {
>>>>       >       ldout(cct, 0) << __func__ << "  0x" << std::hex <<
>>>> p.get_start()
>>>>       > << "~"
>>>>       >                     << p.get_len() << std::dec << dendl;
>>>>       >     }
>>>>       >   }
>>>>       > }
>>>>       >
>>>>       > I'm just wondering why it would spit out these lines and what's
>>>>       causing it.
>>>>       >
>>>>       > Has anybody seen this before?
>>>>       >
>>>>       > Wido
>>>>       > _______________________________________________
>>>>       > ceph-users mailing list
>>>>       > ceph-users at lists.ceph.com <mailto:ceph-users at lists.ceph.com>
>>>>       > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>       >
>>>>       _______________________________________________
>>>>       ceph-users mailing list
>>>>       ceph-users at lists.ceph.com <mailto:ceph-users at lists.ceph.com>
>>>>       http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users at lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



More information about the ceph-users mailing list