[ceph-users] Sudden omap growth on some OSDs
gfarnum at redhat.com
Wed Dec 6 14:50:15 PST 2017
On Wed, Dec 6, 2017 at 2:35 PM David Turner <drakonstein at gmail.com> wrote:
> I have no proof or anything other than a hunch, but OSDs don't trim omaps
> unless all PGs are healthy. If this PG is actually not healthy, but the
> cluster doesn't realize it while these 11 involved OSDs do realize that the
> PG is unhealthy... You would see this exact problem. The OSDs think a PG
> is unhealthy so they aren't trimming their omaps while the cluster doesn't
> seem to be aware of it and everything else is trimming their omaps properly.
I think you're confusing omaps and OSDMaps here. OSDMaps, like omap, are
stored in leveldb, but they have different trimming rules.
> I don't know what to do about it, but I hope it helps get you (or someone
> else on the ML) towards a resolution.
> On Wed, Dec 6, 2017 at 1:59 PM <george.vasilakakos at stfc.ac.uk> wrote:
>> Hi ceph-users,
>> We have a Ceph cluster (running Kraken) that is exhibiting some odd
>> A couple weeks ago, the LevelDBs on some our OSDs started growing large
>> (now at around 20G size).
>> The one thing they have in common is the 11 disks with inflating LevelDBs
>> are all in the set for one PG in one of our pools (EC 8+3). This pool
>> started to see use around the time the LevelDBs started inflating.
>> Compactions are running and they do go down in size a bit but the overall
>> trend is one of rapid growth. The other 2000+ OSDs in the cluster have
>> LevelDBs between 650M and 1.2G.
>> This PG has nothing to separate it from the others in its pool, within 5%
>> of average number of objects per PG, no hot-spotting in terms of load, no
>> weird states reported by ceph status.
>> The one odd thing about it is the pg query output mentions it is
>> active+clean, but it has a recovery state, which it enters every morning
>> between 9 and 10am, where it mentions a "might_have_unfound" situation and
>> having probed all other set members. A deep scrub of the PG didn't turn up
You need to be more specific here. What do you mean it "enters into" the
recovery state every morning?
How many PGs are in your 8+3 pool, and are all your OSDs hosting EC pools?
What are you using the cluster for?
>> The cluster is now starting to manifest slow requests on the OSDs with
>> the large LevelDBs, although not in the particular PG.
>> What can I do to diagnose and resolve this?
>> ceph-users mailing list
>> ceph-users at lists.ceph.com
> ceph-users mailing list
> ceph-users at lists.ceph.com
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the ceph-users