[ceph-users] Sudden omap growth on some OSDs

george.vasilakakos at stfc.ac.uk george.vasilakakos at stfc.ac.uk
Wed Dec 6 10:59:02 PST 2017

Hi ceph-users,

We have a Ceph cluster (running Kraken) that is exhibiting some odd behaviour.
A couple weeks ago, the LevelDBs on some our OSDs started growing large (now at around 20G size).

The one thing they have in common is the 11 disks with inflating LevelDBs are all in the set for one PG in one of our pools (EC 8+3). This pool started to see use around the time the LevelDBs started inflating. Compactions are running and they do go down in size a bit but the overall trend is one of rapid growth. The other 2000+ OSDs in the cluster have LevelDBs between 650M and 1.2G.
This PG has nothing to separate it from the others in its pool, within 5% of average number of objects per PG, no hot-spotting in terms of load, no weird states reported by ceph status.

The one odd thing about it is the pg query output mentions it is active+clean, but it has a recovery state, which it enters every morning between 9 and 10am, where it mentions a "might_have_unfound" situation and having probed all other set members. A deep scrub of the PG didn't turn up anything.

The cluster is now starting to manifest slow requests on the OSDs with the large LevelDBs, although not in the particular PG.

What can I do to diagnose and resolve this?



