[ceph-users] Resolving Large omap objects in RGW index pool

Chris Sarginson csargiso at gmail.com
Thu Oct 18 07:42:33 PDT 2018

Hi Tom,

I used a slightly modified version of your script to generate a comparative
list to mine (echoing out the bucket name, id and actual_id), which has
returned substantially more indexes than mine, including a number that
don't show any indication of resharding having been run, or versioning
being enabled, including some with only minor different bucket_ids:

  5813 buckets_with_multiple_reindexes2.txt (my script)
  7999 buckets_with_multiple_reindexes3.txt (modified Tomasz script)

For example a bucket has 2 entries:


running "radosgw-admin bucket stats" against this bucket shows the current
id as default.23407.9

None of the indexes (including the active one) shows multiple shards, or
any resharding activities.

Using the command:
rados -p ,rgw,buckets.index listomapvals .dir.${id}

Shows the other (lower) index ids as being empty, and the current one
containing the index data.

I'm wondering if it is possible some of these are remnants from upgrades
(this cluster started as giant and has been upgraded through the LTS
releases to Luminous)?  Using radosgw-admin metadata get bucket.instance on
my sample bucket shows different "ver" information between them all:

    "ver": {

        "tag": "__17wYsZGbXIhRKtx3goicMV",
        "ver": 1
    "mtime": "2014-03-24 15:45:03.000000Z"

    "ver": {
        "tag": "_x5RWprsckrL3Bj8h7Mbwklt",
        "ver": 1
    "mtime": "2014-03-24 15:43:31.000000Z"

    "ver": {
        "tag": "_6sTOABOHCGTSZ-EEIZ29VSN",
        "ver": 4
    "mtime": "2017-08-10 15:06:38.940464Z",

This obviously still leaves me with the original issue noticed, which is
multiple instances of buckets that seem to have been repeatedly resharded
to the same number of shards as the currently active index.  From having a
search around the tracker it seems like this may be worth following -
"Aborted dynamic resharding should clean up created bucket index objs" :


Again, any other suggestions or ideas are greatly welcomed on this :)


On Wed, 17 Oct 2018 at 12:29 Tomasz Płaza <tomasz.plaza at grupawp.pl> wrote:

> Hi,
> I have a similar issue, and created a simple bash file to delete old
> indexes (it is PoC and have not been tested on production):
> for bucket in `radosgw-admin metadata list bucket | jq -r '.[]' | sort`
> do
>   actual_id=`radosgw-admin bucket stats --bucket=${bucket} | jq -r '.id'`
>   for instance in `radosgw-admin metadata list bucket.instance | jq -r
> '.[]' | grep ${bucket}: | cut -d ':' -f 2`
>   do
>     if [ "$actual_id" != "$instance" ]
>     then
>       radosgw-admin bi purge --bucket=${bucket} --bucket-id=${instance}
>       radosgw-admin metadata rm bucket.instance:${bucket}:${instance}
>     fi
>   done
> done
> I find it more readable than mentioned one liner. Any sugestions on this
> topic are greatly appreciated.
> Tom
> Hi,
> Having spent some time on the below issue, here are the steps I took to
> resolve the "Large omap objects" warning.  Hopefully this will help others
> who find themselves in this situation.
> I got the object ID and OSD ID implicated from the ceph cluster logfile on
> the mon.  I then proceeded to the implicated host containing the OSD, and
> extracted the implicated PG by running the following, and looking at which
> PG had started and completed a deep-scrub around the warning being logged:
> grep -C 200 Large /var/log/ceph/ceph-osd.*.log | egrep '(Large
> omap|deep-scrub)'
> If the bucket had not been sharded sufficiently (IE the cluster log showed
> a "Key Count" or "Size" over the thresholds), I ran through the manual
> sharding procedure (shown here:
> https://tracker.ceph.com/issues/24457#note-5)
> Once this was successfully sharded, or if the bucket was previously
> sufficiently sharded by Ceph prior to disabling the functionality I was
> able to use the following command (seemingly undocumented for Luminous
> http://docs.ceph.com/docs/mimic/man/8/radosgw-admin/#commands):
> radosgw-admin bi purge --bucket ${bucketname} --bucket-id ${old_bucket_id}
> I then issued a ceph pg deep-scrub against the PG that had contained the
> Large omap object.
> Once I had completed this procedure, my Large omap object warnings went
> away and the cluster returned to HEALTH_OK.
> However our radosgw bucket indexes pool now seems to be using
> substantially more space than previously.  Having looked initially at this
> bug, and in particular the first comment:
> http://tracker.ceph.com/issues/34307#note-1
> I was able to extract a number of bucket indexes that had apparently been
> resharded, and removed the legacy index using the radosgw-admin bi purge
> --bucket ${bucket} ${marker}.  I am still able  to perform a radosgw-admin
> metadata get bucket.instance:${bucket}:${marker} successfully, however now
> when I run rados -p .rgw.buckets.index ls | grep ${marker} nothing is
> returned.  Even after this, we were still seeing extremely high disk usage
> of our OSDs containing the bucket indexes (we have a dedicated pool for
> this).  I then modified the one liner referenced in the previous link as
> follows:
>  grep -E '"bucket"|"id"|"marker"' bucket-stats.out | awk -F ":" '{print
> $2}' | tr -d '",' | while read -r bucket; do read -r id; read -r marker; [
> "$id" == "$marker" ] && true || NEWID=`radosgw-admin --id rgw.ceph-rgw-1
> metadata get bucket.instance:${bucket}:${marker} | python -c 'import sys,
> json; print
> json.load(sys.stdin)["data"]["bucket_info"]["new_bucket_instance_id"]'`;
> while [ ${NEWID} ]; do if [ "${NEWID}" != "${marker}" ] && [ ${NEWID} !=
> ${bucket} ] ; then echo "$bucket $NEWID"; fi; NEWID=`radosgw-admin --id
> rgw.ceph-rgw-1 metadata get bucket.instance:${bucket}:${NEWID} | python -c
> 'import sys, json; print
> json.load(sys.stdin)["data"]["bucket_info"]["new_bucket_instance_id"]'`;
> done; done > buckets_with_multiple_reindexes2.txt
> This loops through the buckets that have a different marker/bucket_id, and
> looks to see if a new_bucket_instance_id is there, and if so will loop
> through until there is no longer a "new_bucket_instance_id".  After letting
> this complete, this suggests that I have over 5000 indexes for 74 buckets,
> some of these buckets have > 100 indexes apparently.
> :~# awk '{print $1}' buckets_with_multiple_reindexes2.txt | uniq | wc -l
> 74
> ~# wc -l buckets_with_multiple_reindexes2.txt
> 5813 buckets_with_multiple_reindexes2.txt
> This is running a single realm, multiple zone configuration, and no multi
> site sync, but the closest I can find to this issue is this bug
> https://tracker.ceph.com/issues/24603
> Should I be OK to loop through these indexes and remove any with a
> reshard_status of 2, a new_bucket_instance_id that does not match the
> bucket_instance_id returned by the command:
> radosgw-admin bucket stats --bucket ${bucket}
> I'd ideally like to get to a point where I can turn dynamic sharding back
> on safely for this cluster.
> Thanks for any assistance, let me know if there's any more information I
> should provide
> Chris
> On Thu, 4 Oct 2018 at 18:22 Chris Sarginson <csargiso at gmail.com> wrote:
>> Hi,
>> Thanks for the response - I am still unsure as to what will happen to the
>> "marker" reference in the bucket metadata, as this is the object that is
>> being detected as Large.  Will the bucket generate a new "marker" reference
>> in the bucket metadata?
>> I've been reading this page to try and get a better understanding of this
>> http://docs.ceph.com/docs/luminous/radosgw/layout/
>> However I'm no clearer on this (and what the "marker" is used for), or
>> why there are multiple separate "bucket_id" values (with different mtime
>> stamps) that all show as having the same number of shards.
>> If I were to remove the old bucket would I just be looking to execute
>> rados - p .rgw.buckets.index rm .dir.default.5689810.107
>> Is the differing marker/bucket_id in the other buckets that was found
>> also an indicator?  As I say, there's a good number of these, here's some
>> additional examples, though these aren't necessarily reporting as large
>> omap objects:
>> "BUCKET1", "default.281853840.479", "default.105206134.5",
>> "BUCKET2", "default.364663174.1", "default.349712129.3674",
>> Checking these other buckets, they are exhibiting the same sort of
>> symptoms as the first (multiple instances of radosgw-admin metadata get
>> showing what seem to be multiple resharding processes being run, with
>> different mtimes recorded).
>> Thanks
>> Chris
>> On Thu, 4 Oct 2018 at 16:21 Konstantin Shalygin <k0ste at k0ste.ru> wrote:
>>> Hi,
>>> Ceph version: Luminous 12.2.7
>>> Following upgrading to Luminous from Jewel we have been stuck with a
>>> cluster in HEALTH_WARN state that is complaining about large omap objects.
>>> These all seem to be located in our .rgw.buckets.index pool.  We've
>>> disabled auto resharding on bucket indexes due to seeming looping issues
>>> after our upgrade.  We've reduced the number reported of reported large
>>> omap objects by initially increasing the following value:
>>> ~# ceph daemon mon.ceph-mon-1 config get
>>> osd_deep_scrub_large_omap_object_value_sum_threshold
>>> {
>>>     "osd_deep_scrub_large_omap_object_value_sum_threshold": "2147483648 <%28214%29%20748-3648>"
>>> }
>>> However we're still getting a warning about a single large OMAP object,
>>> however I don't believe this is related to an unsharded index - here's the
>>> log entry:
>>> 2018-10-01 13:46:24.427213 osd.477 osd.477 8482 :
>>> cluster [WRN] Large omap object found. Object:
>>> 15:333d5ad7:::.dir.default.5689810.107:head Key count: 17467251 Size
>>> (bytes): 4458647149 <%28445%29%20864-7149>
>>> The object in the logs is the "marker" object, rather than the bucket_id -
>>> I've put some details regarding the bucket here:
>>> https://pastebin.com/hW53kTxL
>>> The bucket limit check shows that the index is sharded, so I think this
>>> might be related to versioning, although I was unable to get confirmation
>>> that the bucket in question has versioning enabled through the aws
>>> cli(snipped debug output below)
>>> 2018-10-02 15:11:17,530 - MainThread - botocore.parsers - DEBUG - Response
>>> headers: {'date': 'Tue, 02 Oct 2018 14:11:17 GMT', 'content-length': '137',
>>> 'x-amz-request-id': 'tx0000000000000020e3b15-005bb37c85-15870fe0-default',
>>> 'content-type': 'application/xml'}
>>> 2018-10-02 15:11:17,530 - MainThread - botocore.parsers - DEBUG - Response
>>> body:
>>> <?xml version="1.0" encoding="UTF-8"?><VersioningConfiguration xmlns="http://s3.amazonaws.com/doc/2006-03-01/"></VersioningConfiguration>
>>> After dumping the contents of large omap object mentioned above into a file
>>> it does seem to be a simple listing of the bucket contents, potentially an
>>> old index:
>>> ~# wc -l omap_keys
>>> 17467251 omap_keys
>>> This is approximately 5 million below the currently reported number of
>>> objects in the bucket.
>>> When running the commands listed here:http://tracker.ceph.com/issues/34307#note-1
>>> The problematic bucket is listed in the output (along with 72 other
>>> buckets):
>>> "CLIENTBUCKET", "default.294495648.690", "default.5689810.107"
>>> As this tests for bucket_id and marker fields not matching to print out the
>>> information, is the implication here that both of these should match in
>>> order to fully migrate to the new sharded index?
>>> I was able to do a "metadata get" using what appears to be the old index
>>> object ID, which seems to support this (there's a "new_bucket_instance_id"
>>> field, containing a newer "bucket_id" and reshard_status is 2, which seems
>>> to suggest it has completed).
>>> I am able to take the "new_bucket_instance_id" and get additional metadata
>>> about the bucket, each time I do this I get a slightly newer
>>> "new_bucket_instance_id", until it stops suggesting updated indexes.
>>> It's probably worth pointing out that when going through this process the
>>> final "bucket_id" doesn't match the one that I currently get when running
>>> 'radosgw-admin bucket stats --bucket "CLIENTBUCKET"', even though it also
>>> suggests that no further resharding has been done as "reshard_status" = 0
>>> and "new_bucket_instance_id" is blank.  The output is available to view
>>> here:
>>> https://pastebin.com/g1TJfKLU
>>> It would be useful if anyone can offer some clarification on how to proceed
>>> from this situation, identifying and removing any old/stale indexes from
>>> the index pool (if that is the case), as I've not been able to spot
>>> anything in the archives.
>>> If there's any further information that is needed for additional context
>>> please let me know.
>>> Usually, when you bucket is automatically resharded in some case old big
>>> index is not deleted - this is your large omap object.
>>> This index is safe to delete. Also look at [1].
>>> [1] https://tracker.ceph.com/issues/24457
>>> k
> _______________________________________________
> ceph-users mailing listceph-users at lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20181018/bcf6ec43/attachment.html>

More information about the ceph-users mailing list