[ceph-users] OSD log being spammed with BlueStore stupidallocator dump

Wido den Hollander wido at 42on.com
Tue Oct 16 03:30:17 PDT 2018



On 10/16/18 11:32 AM, Igor Fedotov wrote:
> 
> 
> On 10/16/2018 6:57 AM, Wido den Hollander wrote:
>>
>> On 10/16/2018 12:04 AM, Igor Fedotov wrote:
>>> On 10/15/2018 11:47 PM, Wido den Hollander wrote:
>>>> Hi,
>>>>
>>>> On 10/15/2018 10:43 PM, Igor Fedotov wrote:
>>>>> Hi Wido,
>>>>>
>>>>> once you apply the PR you'll probably see the initial error in the log
>>>>> that triggers the dump. Which is most probably the lack of space
>>>>> reported by _balance_bluefs_freespace() function. If so this means
>>>>> that
>>>>> BlueFS rebalance is unable to allocate contiguous 1M chunk at main
>>>>> device to gift to BlueFS. I.e. your main device space is very
>>>>> fragmented.
>>>>>
>>>>> Unfortunately I don't know any ways to recover from this state but OSD
>>>>> redeployment or data removal.
>>>>>
>>>> We are moving data away from these OSDs. Lucky us that we have HDD OSDs
>>>> in there as well, moving a lot of data there.
>>>>
>>>> How would re-deployment work? Just wipe the OSDs and bring them back
>>>> into the cluster again? That would be a very painful task.. :-(
>>> Good chances that you'll face the same issue again one day.
>>> May be allocate some SSDs to serve as DB devices?
>> Maybe, but this is a very common use-case where people run WAL+DB+DATA
>> on a single SSD.
> Yeah, but I'd consider that as a workaround until better solution is
> provided.
> Before your case this high fragmentation issue had been rather a
> theoretical outlook with a very small probability.
> 

Understood. What I see now is that after offloading the RGW bucket
indexes from these OSDs there is free space.

The OSDs are now randomly going to 100% util (their disk) in reading a
lot from their disk.

Cranking up the logging shows me:

2018-10-16 11:50:20.325674 7fa52b9a4700  5 bdev(0x5627cc185200
/var/lib/ceph/osd/ceph-118/block) read_random 0xe7547cd93c~f0b
2018-10-16 11:50:20.325836 7fa52b9a4700  5 bdev(0x5627cc185200
/var/lib/ceph/osd/ceph-118/block) read_random 0xe7547ce847~ef9
2018-10-16 11:50:20.325997 7fa52b9a4700  5 bdev(0x5627cc185200
/var/lib/ceph/osd/ceph-118/block) read_random 0xe7547cf740

bluefs / bdev seem to be reading a lot, a lot. From what I can tell they
are discarding OMAP data from RocksDB which causes a lot of reads on BlueFS.

>> Now we are running into it, but aren't the chances big other people will
>> run into it as well?
>>
>>>>> Upcoming PR that brings an ability for offline BlueFS volume
>>>>> manipulation (https://github.com/ceph/ceph/pull/23103) will probably
>>>>> help to recover from this issue in future by migrating BlueFS data
>>>>> to a
>>>>> new larger DB volume. (targeted for Nautilus, not sure about
>>>>> backporting
>>>>> to Mimic or Luminous).
>>>>>
>>>>> For now I can suggest the only preventive mean to avoid the case -
>>>>> have
>>>>> large enough space at your standalone DB volume. So that master device
>>>>> isn't used for DB at all or as minimum as possible. Hence no rebalance
>>>>> is needed and no fragmentation is present.
>>>>>
>>>> I see, but these are SSD-only OSDs.
>>>>
>>>>> BTW wondering if you have one for your OSDs? How large if so?
>>>>>
>>>> The cluster consists out of 96 OSDs with Samsung PM863a 1.92TB OSDs.
>>>>
>>>> The fullest OSD currently is 78% full, which is 348GiB free on the
>>>> 1.75TiB device.
>>>>
>>>> Does this information help?
>>> Yeah, thanks for sharing.
>> Let me know if you need more!
> Additionally wondering if you know how many data (in average) has been
> totally written to these OSDs. I mean an aggregation over all writes
> (even ones that already has been removed) not the current usage.
> Use patterns, object sizes etc would be interesting as well.
>

These OSDs were migrated from FileStore to BlueStore in March 2018.

Using blkdiscard the SSDs (Samsung PM863a 1.92TB) were wiped and then
deployed with BlueStore.

They run:

- RBD (couple of pools)
- RGW indexes

The actual RGW data is on HDD OSDs in the same cluster.

The whole cluster does a steady 20k IOps (R+W) during the day I would
say. Roughly 200MB/sec write and 100MB/sec read.

Suddenly OSDs started to grind to a halt last week and have been
flapping and showing slow requests ever since.

The RGW indexes have been offloaded to the HDD nodes in the meantime.

Wido

>>>
>> Wido
>>
>>>> Thanks!
>>>>
>>>> Wido
>>>>
>>>>> Everything above is "IMO", some chances that I missed something..
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Igor
>>>>>
>>>>>
>>>>> On 10/15/2018 10:12 PM, Wido den Hollander wrote:
>>>>>> On 10/15/2018 08:23 PM, Gregory Farnum wrote:
>>>>>>> I don't know anything about the BlueStore code, but given the
>>>>>>> snippets
>>>>>>> you've posted this appears to be a debug thing that doesn't expect
>>>>>>> to be
>>>>>>> invoked (or perhaps only in an unexpected case that it's trying
>>>>>>> hard to
>>>>>>> recover from). Have you checked where the dump() function is invoked
>>>>>>> from? I'd imagine it's something about having to try extra-hard to
>>>>>>> allocate free space or something.
>>>>>> It seems BlueFS that is having a hard time finding free space.
>>>>>>
>>>>>> I'm trying this PR now: https://github.com/ceph/ceph/pull/24543
>>>>>>
>>>>>> It will stop the spamming, but that's not the root cause. The OSDs in
>>>>>> this case are at max 80% full and they do have a lot of OMAP (RGW
>>>>>> indexes) in them, but that's all.
>>>>>>
>>>>>> I'm however not sure why this is happening suddenly in this cluster.
>>>>>>
>>>>>> Wido
>>>>>>
>>>>>>> -Greg
>>>>>>>
>>>>>>> On Mon, Oct 15, 2018 at 10:02 AM Wido den Hollander <wido at 42on.com
>>>>>>> <mailto:wido at 42on.com>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>        On 10/11/2018 12:08 AM, Wido den Hollander wrote:
>>>>>>>        > Hi,
>>>>>>>        >
>>>>>>>        > On a Luminous cluster running a mix of 12.2.4, 12.2.5 and
>>>>>>> 12.2.8 I'm
>>>>>>>        > seeing OSDs writing heavily to their logfiles spitting out
>>>>>>> these
>>>>>>>        lines:
>>>>>>>        >
>>>>>>>        >
>>>>>>>        > 2018-10-10 21:52:04.019037 7f90c2f0f700  0 stupidalloc
>>>>>>>        0x0x55828ae047d0
>>>>>>>        > dump  0x15cd2078000~34000
>>>>>>>        > 2018-10-10 21:52:04.019038 7f90c2f0f700  0 stupidalloc
>>>>>>>        0x0x55828ae047d0
>>>>>>>        > dump  0x15cd22cc000~24000
>>>>>>>        > 2018-10-10 21:52:04.019038 7f90c2f0f700  0 stupidalloc
>>>>>>>        0x0x55828ae047d0
>>>>>>>        > dump  0x15cd2300000~20000
>>>>>>>        > 2018-10-10 21:52:04.019039 7f90c2f0f700  0 stupidalloc
>>>>>>>        0x0x55828ae047d0
>>>>>>>        > dump  0x15cd2324000~24000
>>>>>>>        > 2018-10-10 21:52:04.019040 7f90c2f0f700  0 stupidalloc
>>>>>>>        0x0x55828ae047d0
>>>>>>>        > dump  0x15cd26c0000~24000
>>>>>>>        > 2018-10-10 21:52:04.019041 7f90c2f0f700  0 stupidalloc
>>>>>>>        0x0x55828ae047d0
>>>>>>>        > dump  0x15cd2704000~30000
>>>>>>>        >
>>>>>>>        > It goes so fast that the OS-disk in this case can't keep up
>>>>>>> and become
>>>>>>>        > 100% util.
>>>>>>>        >
>>>>>>>        > This causes the OSD to slow down and cause slow requests
>>>>>>> and
>>>>>>>        starts to flap.
>>>>>>>        >
>>>>>>>
>>>>>>>        I've set 'log_file' to /dev/null for now, but that doesn't
>>>>>>> solve it
>>>>>>>        either. Randomly OSDs just start spitting out slow
>>>>>>> requests and
>>>>>>> have
>>>>>>>        these issues.
>>>>>>>
>>>>>>>        Any suggestions on how to fix this?
>>>>>>>
>>>>>>>        Wido
>>>>>>>
>>>>>>>        > It seems that this is *only* happening on OSDs which are
>>>>>>> the
>>>>>>> fullest
>>>>>>>        > (~85%) on this cluster and they have about ~400 PGs each
>>>>>>> (Yes,
>>>>>>> I know,
>>>>>>>        > that's high).
>>>>>>>        >
>>>>>>>        > Looking at StupidAllocator.cc I see this piece of code:
>>>>>>>        >
>>>>>>>        > void StupidAllocator::dump()
>>>>>>>        > {
>>>>>>>        >   std::lock_guard<std::mutex> l(lock);
>>>>>>>        >   for (unsigned bin = 0; bin < free.size(); ++bin) {
>>>>>>>        >     ldout(cct, 0) << __func__ << " free bin " << bin <<
>>>>>>> ": "
>>>>>>>        >                   << free[bin].num_intervals() << "
>>>>>>> extents"
>>>>>>> << dendl;
>>>>>>>        >     for (auto p = free[bin].begin();
>>>>>>>        >          p != free[bin].end();
>>>>>>>        >          ++p) {
>>>>>>>        >       ldout(cct, 0) << __func__ << "  0x" << std::hex <<
>>>>>>> p.get_start()
>>>>>>>        > << "~"
>>>>>>>        >                     << p.get_len() << std::dec << dendl;
>>>>>>>        >     }
>>>>>>>        >   }
>>>>>>>        > }
>>>>>>>        >
>>>>>>>        > I'm just wondering why it would spit out these lines and
>>>>>>> what's
>>>>>>>        causing it.
>>>>>>>        >
>>>>>>>        > Has anybody seen this before?
>>>>>>>        >
>>>>>>>        > Wido
>>>>>>>        > _______________________________________________
>>>>>>>        > ceph-users mailing list
>>>>>>>        > ceph-users at lists.ceph.com
>>>>>>> <mailto:ceph-users at lists.ceph.com>
>>>>>>>        > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>        >
>>>>>>>        _______________________________________________
>>>>>>>        ceph-users mailing list
>>>>>>>        ceph-users at lists.ceph.com <mailto:ceph-users at lists.ceph.com>
>>>>>>>        http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-users at lists.ceph.com
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


More information about the ceph-users mailing list