[ceph-users] Sudden omap growth on some OSDs

Gregory Farnum gfarnum at redhat.com
Wed Dec 6 14:50:15 PST 2017

On Wed, Dec 6, 2017 at 2:35 PM David Turner <drakonstein at gmail.com> wrote:

> I have no proof or anything other than a hunch, but OSDs don't trim omaps
> unless all PGs are healthy.  If this PG is actually not healthy, but the
> cluster doesn't realize it while these 11 involved OSDs do realize that the
> PG is unhealthy... You would see this exact problem.  The OSDs think a PG
> is unhealthy so they aren't trimming their omaps while the cluster doesn't
> seem to be aware of it and everything else is trimming their omaps properly.

I think you're confusing omaps and OSDMaps here. OSDMaps, like omap, are
stored in leveldb, but they have different trimming rules.

> I don't know what to do about it, but I hope it helps get you (or someone
> else on the ML) towards a resolution.
> On Wed, Dec 6, 2017 at 1:59 PM <george.vasilakakos at stfc.ac.uk> wrote:
>> Hi ceph-users,
>> We have a Ceph cluster (running Kraken) that is exhibiting some odd
>> behaviour.
>> A couple weeks ago, the LevelDBs on some our OSDs started growing large
>> (now at around 20G size).
>> The one thing they have in common is the 11 disks with inflating LevelDBs
>> are all in the set for one PG in one of our pools (EC 8+3). This pool
>> started to see use around the time the LevelDBs started inflating.
>> Compactions are running and they do go down in size a bit but the overall
>> trend is one of rapid growth. The other 2000+ OSDs in the cluster have
>> LevelDBs between 650M and 1.2G.
>> This PG has nothing to separate it from the others in its pool, within 5%
>> of average number of objects per PG, no hot-spotting in terms of load, no
>> weird states reported by ceph status.
>> The one odd thing about it is the pg query output mentions it is
>> active+clean, but it has a recovery state, which it enters every morning
>> between 9 and 10am, where it mentions a "might_have_unfound" situation and
>> having probed all other set members. A deep scrub of the PG didn't turn up
>> anything.
You need to be more specific here. What do you mean it "enters into" the
recovery state every morning?

How many PGs are in your 8+3 pool, and are all your OSDs hosting EC pools?
What are you using the cluster for?

>> The cluster is now starting to manifest slow requests on the OSDs with
>> the large LevelDBs, although not in the particular PG.
>> What can I do to diagnose and resolve this?
>> Thanks,
>> George
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users at lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20171206/2e6d6ef9/attachment.html>

More information about the ceph-users mailing list