[ceph-users] Sudden omap growth on some OSDs

Gregory Farnum gfarnum at redhat.com
Thu Dec 7 13:57:07 PST 2017

On Thu, Dec 7, 2017 at 4:41 AM <george.vasilakakos at stfc.ac.uk> wrote:

> ________________________________
> From: Gregory Farnum [gfarnum at redhat.com]
> Sent: 06 December 2017 22:50
> To: David Turner
> Cc: Vasilakakos, George (STFC,RAL,SC); ceph-users at lists.ceph.com
> Subject: Re: [ceph-users] Sudden omap growth on some OSDs
> On Wed, Dec 6, 2017 at 2:35 PM David Turner <drakonstein at gmail.com<mailto:
> drakonstein at gmail.com>> wrote:
> I have no proof or anything other than a hunch, but OSDs don't trim omaps
> unless all PGs are healthy.  If this PG is actually not healthy, but the
> cluster doesn't realize it while these 11 involved OSDs do realize that the
> PG is unhealthy... You would see this exact problem.  The OSDs think a PG
> is unhealthy so they aren't trimming their omaps while the cluster doesn't
> seem to be aware of it and everything else is trimming their omaps properly.
> I think you're confusing omaps and OSDMaps here. OSDMaps, like omap, are
> stored in leveldb, but they have different trimming rules.
> I don't know what to do about it, but I hope it helps get you (or someone
> else on the ML) towards a resolution.
> On Wed, Dec 6, 2017 at 1:59 PM <george.vasilakakos at stfc.ac.uk<mailto:
> george.vasilakakos at stfc.ac.uk>> wrote:
> Hi ceph-users,
> We have a Ceph cluster (running Kraken) that is exhibiting some odd
> behaviour.
> A couple weeks ago, the LevelDBs on some our OSDs started growing large
> (now at around 20G size).
> The one thing they have in common is the 11 disks with inflating LevelDBs
> are all in the set for one PG in one of our pools (EC 8+3). This pool
> started to see use around the time the LevelDBs started inflating.
> Compactions are running and they do go down in size a bit but the overall
> trend is one of rapid growth. The other 2000+ OSDs in the cluster have
> LevelDBs between 650M and 1.2G.
> This PG has nothing to separate it from the others in its pool, within 5%
> of average number of objects per PG, no hot-spotting in terms of load, no
> weird states reported by ceph status.
> The one odd thing about it is the pg query output mentions it is
> active+clean, but it has a recovery state, which it enters every morning
> between 9 and 10am, where it mentions a "might_have_unfound" situation and
> having probed all other set members. A deep scrub of the PG didn't turn up
> anything.
> You need to be more specific here. What do you mean it "enters into" the
> recovery state every morning?
> Here's what PG query showed me yesterday:
>     "recovery_state": [
>         {
>             "name": "Started\/Primary\/Active",
>             "enter_time": "2017-12-05 09:48:57.730385",
>             "might_have_unfound": [
>                 {
>                     "osd": "79(1)",
>                     "status": "already probed"
>                 },
>                 {
>                     "osd": "337(9)",
>                     "status": "already probed"
>                 },... it goes on to list all peers of this OSD in that PG.

IIRC that's just a normal thing when there's any kind of recovery happening
— it builds up a set during peering of OSDs that might have data, in case
it discovers stuff missing.

> How many PGs are in your 8+3 pool, and are all your OSDs hosting EC pools?
> What are you using the cluster for?
> 2048 PGs in this pool, also another 2048 PG EC pool (same profile) and two
> more 1024 PG EC pools (also same profile). Then a set of RGW auxiliary
> pools with 3-way replication.
> I'm not 100% sure but I think all of our OSDs should have a few PGs from
> one of the EC pools. Our rules don't make a distinction so it's
> probabilistic. We're using the cluster as an object store, minor RGW use
> and custom gateways using libradosstriper.
> It's also worth pointing out that an OSD in that PG was taken out of the
> cluster earlier today and pg query shows the following weirdness:
> The primary thinks it's active+clean but, in the peer_info section all
> peers report it is "active+undersized+degraded+remapped+backfilling". It
> has shown this discrepancy before between the primary thinking it's a+c and
> the rest of the set seeing it a+c+degraded.

Again, exactly what output makes you say the primary thinks it's
active+clean but the others have more complex recovery states?

> In the query output we're showing the following for recovery state:
> "recovery_state": [
>         {
>             "name": "Started\/Primary\/Active",
>             "enter_time": "2017-12-07 08:41:57.850220",
>             "might_have_unfound": [],
>             "recovery_progress": {
>                 "backfill_targets": [],
>                 "waiting_on_backfill": [],
>                 "last_backfill_started": "MIN",
>                 "backfill_info": {
>                     "begin": "MIN",
>                     "end": "MIN",
>                     "objects": []
>                 },
>                 "peer_backfill_info": [],
>                 "backfills_in_flight": [],
>                 "recovering": [],
>                 "pg_backend": {
>                     "recovery_ops": [],
>                     "read_ops": []
>                 }
> The cluster is now starting to manifest slow requests on the OSDs with the
> large LevelDBs, although not in the particular PG.

Well, there have been a few causes of large LevelDBs, but given that you
have degraded PGs and a bunch of EC pools, my guess is that the PG logs are
getting extended thanks to the PG states. EC PG logs can be much larger
than replicated ones, since EC pools need to be able to reverse the IO in
those cases. So you need to get your PGs clean first and then see if the
LevelDB shrinks down or not.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20171207/f26a8ea9/attachment.html>

More information about the ceph-users mailing list