[ceph-users] Resolving Large omap objects in RGW index pool

Chris Sarginson csargiso at gmail.com
Tue Oct 16 11:11:48 PDT 2018


Hi,

Having spent some time on the below issue, here are the steps I took to
resolve the "Large omap objects" warning.  Hopefully this will help others
who find themselves in this situation.

I got the object ID and OSD ID implicated from the ceph cluster logfile on
the mon.  I then proceeded to the implicated host containing the OSD, and
extracted the implicated PG by running the following, and looking at which
PG had started and completed a deep-scrub around the warning being logged:

grep -C 200 Large /var/log/ceph/ceph-osd.*.log | egrep '(Large
omap|deep-scrub)'

If the bucket had not been sharded sufficiently (IE the cluster log showed
a "Key Count" or "Size" over the thresholds), I ran through the manual
sharding procedure (shown here: https://tracker.ceph.com/issues/24457#note-5
)

Once this was successfully sharded, or if the bucket was previously
sufficiently sharded by Ceph prior to disabling the functionality I was
able to use the following command (seemingly undocumented for Luminous
http://docs.ceph.com/docs/mimic/man/8/radosgw-admin/#commands):

radosgw-admin bi purge --bucket ${bucketname} --bucket-id ${old_bucket_id}

I then issued a ceph pg deep-scrub against the PG that had contained the
Large omap object.

Once I had completed this procedure, my Large omap object warnings went
away and the cluster returned to HEALTH_OK.

However our radosgw bucket indexes pool now seems to be using substantially
more space than previously.  Having looked initially at this bug, and in
particular the first comment:

http://tracker.ceph.com/issues/34307#note-1

I was able to extract a number of bucket indexes that had apparently been
resharded, and removed the legacy index using the radosgw-admin bi purge
--bucket ${bucket} ${marker}.  I am still able  to perform a radosgw-admin
metadata get bucket.instance:${bucket}:${marker} successfully, however now
when I run rados -p .rgw.buckets.index ls | grep ${marker} nothing is
returned.  Even after this, we were still seeing extremely high disk usage
of our OSDs containing the bucket indexes (we have a dedicated pool for
this).  I then modified the one liner referenced in the previous link as
follows:

 grep -E '"bucket"|"id"|"marker"' bucket-stats.out | awk -F ":" '{print
$2}' | tr -d '",' | while read -r bucket; do read -r id; read -r marker; [
"$id" == "$marker" ] && true || NEWID=`radosgw-admin --id rgw.ceph-rgw-1
metadata get bucket.instance:${bucket}:${marker} | python -c 'import sys,
json; print
json.load(sys.stdin)["data"]["bucket_info"]["new_bucket_instance_id"]'`;
while [ ${NEWID} ]; do if [ "${NEWID}" != "${marker}" ] && [ ${NEWID} !=
${bucket} ] ; then echo "$bucket $NEWID"; fi; NEWID=`radosgw-admin --id
rgw.ceph-rgw-1 metadata get bucket.instance:${bucket}:${NEWID} | python -c
'import sys, json; print
json.load(sys.stdin)["data"]["bucket_info"]["new_bucket_instance_id"]'`;
done; done > buckets_with_multiple_reindexes2.txt

This loops through the buckets that have a different marker/bucket_id, and
looks to see if a new_bucket_instance_id is there, and if so will loop
through until there is no longer a "new_bucket_instance_id".  After letting
this complete, this suggests that I have over 5000 indexes for 74 buckets,
some of these buckets have > 100 indexes apparently.

:~# awk '{print $1}' buckets_with_multiple_reindexes2.txt | uniq | wc -l
74
~# wc -l buckets_with_multiple_reindexes2.txt
5813 buckets_with_multiple_reindexes2.txt

This is running a single realm, multiple zone configuration, and no multi
site sync, but the closest I can find to this issue is this bug
https://tracker.ceph.com/issues/24603

Should I be OK to loop through these indexes and remove any with a
reshard_status of 2, a new_bucket_instance_id that does not match the
bucket_instance_id returned by the command:

radosgw-admin bucket stats --bucket ${bucket}

I'd ideally like to get to a point where I can turn dynamic sharding back
on safely for this cluster.

Thanks for any assistance, let me know if there's any more information I
should provide
Chris

On Thu, 4 Oct 2018 at 18:22 Chris Sarginson <csargiso at gmail.com> wrote:

> Hi,
>
> Thanks for the response - I am still unsure as to what will happen to the
> "marker" reference in the bucket metadata, as this is the object that is
> being detected as Large.  Will the bucket generate a new "marker" reference
> in the bucket metadata?
>
> I've been reading this page to try and get a better understanding of this
> http://docs.ceph.com/docs/luminous/radosgw/layout/
>
> However I'm no clearer on this (and what the "marker" is used for), or why
> there are multiple separate "bucket_id" values (with different mtime
> stamps) that all show as having the same number of shards.
>
> If I were to remove the old bucket would I just be looking to execute
>
> rados - p .rgw.buckets.index rm .dir.default.5689810.107
>
> Is the differing marker/bucket_id in the other buckets that was found also
> an indicator?  As I say, there's a good number of these, here's some
> additional examples, though these aren't necessarily reporting as large
> omap objects:
>
> "BUCKET1", "default.281853840.479", "default.105206134.5",
> "BUCKET2", "default.364663174.1", "default.349712129.3674",
>
> Checking these other buckets, they are exhibiting the same sort of
> symptoms as the first (multiple instances of radosgw-admin metadata get
> showing what seem to be multiple resharding processes being run, with
> different mtimes recorded).
>
> Thanks
> Chris
>
> On Thu, 4 Oct 2018 at 16:21 Konstantin Shalygin <k0ste at k0ste.ru> wrote:
>
>> Hi,
>>
>> Ceph version: Luminous 12.2.7
>>
>> Following upgrading to Luminous from Jewel we have been stuck with a
>> cluster in HEALTH_WARN state that is complaining about large omap objects.
>> These all seem to be located in our .rgw.buckets.index pool.  We've
>> disabled auto resharding on bucket indexes due to seeming looping issues
>> after our upgrade.  We've reduced the number reported of reported large
>> omap objects by initially increasing the following value:
>>
>> ~# ceph daemon mon.ceph-mon-1 config get
>> osd_deep_scrub_large_omap_object_value_sum_threshold
>> {
>>     "osd_deep_scrub_large_omap_object_value_sum_threshold": "2147483648 <(214)%20748-3648>"
>> }
>>
>> However we're still getting a warning about a single large OMAP object,
>> however I don't believe this is related to an unsharded index - here's the
>> log entry:
>>
>> 2018-10-01 13:46:24.427213 osd.477 osd.477 172.26.216.6:6804/2311858 8482 :
>> cluster [WRN] Large omap object found. Object:
>> 15:333d5ad7:::.dir.default.5689810.107:head Key count: 17467251 Size
>> (bytes): 4458647149 <(445)%20864-7149>
>>
>> The object in the logs is the "marker" object, rather than the bucket_id -
>> I've put some details regarding the bucket here:
>> https://pastebin.com/hW53kTxL
>>
>> The bucket limit check shows that the index is sharded, so I think this
>> might be related to versioning, although I was unable to get confirmation
>> that the bucket in question has versioning enabled through the aws
>> cli(snipped debug output below)
>>
>> 2018-10-02 15:11:17,530 - MainThread - botocore.parsers - DEBUG - Response
>> headers: {'date': 'Tue, 02 Oct 2018 14:11:17 GMT', 'content-length': '137',
>> 'x-amz-request-id': 'tx0000000000000020e3b15-005bb37c85-15870fe0-default',
>> 'content-type': 'application/xml'}
>> 2018-10-02 15:11:17,530 - MainThread - botocore.parsers - DEBUG - Response
>> body:
>> <?xml version="1.0" encoding="UTF-8"?><VersioningConfiguration xmlns="http://s3.amazonaws.com/doc/2006-03-01/"></VersioningConfiguration>
>>
>> After dumping the contents of large omap object mentioned above into a file
>> it does seem to be a simple listing of the bucket contents, potentially an
>> old index:
>>
>> ~# wc -l omap_keys
>> 17467251 omap_keys
>>
>> This is approximately 5 million below the currently reported number of
>> objects in the bucket.
>>
>> When running the commands listed here:http://tracker.ceph.com/issues/34307#note-1
>>
>> The problematic bucket is listed in the output (along with 72 other
>> buckets):
>> "CLIENTBUCKET", "default.294495648.690", "default.5689810.107"
>>
>> As this tests for bucket_id and marker fields not matching to print out the
>> information, is the implication here that both of these should match in
>> order to fully migrate to the new sharded index?
>>
>> I was able to do a "metadata get" using what appears to be the old index
>> object ID, which seems to support this (there's a "new_bucket_instance_id"
>> field, containing a newer "bucket_id" and reshard_status is 2, which seems
>> to suggest it has completed).
>>
>> I am able to take the "new_bucket_instance_id" and get additional metadata
>> about the bucket, each time I do this I get a slightly newer
>> "new_bucket_instance_id", until it stops suggesting updated indexes.
>>
>> It's probably worth pointing out that when going through this process the
>> final "bucket_id" doesn't match the one that I currently get when running
>> 'radosgw-admin bucket stats --bucket "CLIENTBUCKET"', even though it also
>> suggests that no further resharding has been done as "reshard_status" = 0
>> and "new_bucket_instance_id" is blank.  The output is available to view
>> here:
>> https://pastebin.com/g1TJfKLU
>>
>> It would be useful if anyone can offer some clarification on how to proceed
>> from this situation, identifying and removing any old/stale indexes from
>> the index pool (if that is the case), as I've not been able to spot
>> anything in the archives.
>>
>> If there's any further information that is needed for additional context
>> please let me know.
>>
>>
>> Usually, when you bucket is automatically resharded in some case old big
>> index is not deleted - this is your large omap object.
>>
>> This index is safe to delete. Also look at [1].
>>
>>
>> [1] https://tracker.ceph.com/issues/24457
>>
>>
>>
>> k
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20181016/834d567a/attachment.html>


More information about the ceph-users mailing list