[ceph-users] Bluestore OSD_DATA, WAL & DB

Maged Mokhtar mmokhtar at petasan.org
Fri Nov 3 07:26:47 PDT 2017


On 2017-11-03 15:59, Wido den Hollander wrote:

> Op 3 november 2017 om 14:43 schreef Mark Nelson <mnelson at redhat.com>:
> 
> On 11/03/2017 08:25 AM, Wido den Hollander wrote: 
> Op 3 november 2017 om 13:33 schreef Mark Nelson <mnelson at redhat.com>:
> 
> On 11/03/2017 02:44 AM, Wido den Hollander wrote: 
> Op 3 november 2017 om 0:09 schreef Nigel Williams <nigel.williams at tpac.org.au>:
> 
> On 3 November 2017 at 07:45, Martin Overgaard Hansen <moh at multihouse.dk> wrote: I want to bring this subject back in the light and hope someone can provide
> insight regarding the issue, thanks. 
> Thanks Martin, I was going to do the same.
> 
> Is it possible to make the DB partition (on the fastest device) too
> big? in other words is there a point where for a given set of OSDs
> (number + size) the DB partition is sized too large and is wasting
> resources. I recall a comment by someone proposing to split up a
> single large (fast) SSD into 100GB partitions for each OSD.

It depends on the size of your backing disk. The DB will grow for the
amount of Objects you have on your OSD.

A 4TB drive will hold more objects then a 1TB drive (usually), same goes
for a 10TB vs 6TB.

>From what I've seen now there is no such thing as a 'too big' DB.

The tests I've done for now seem to suggest that filling up a 50GB DB is
rather hard to do. But if you have Billions of Objects and thus tens of
millions object per OSD. 
Are you doing RBD, RGW, or something else to test?  What size are the
objets and are you fragmenting them? 

> Let's say the avg overhead is 16k you would need a 150GB DB for 10M objects.
> 
> You could look into your current numbers and check how many objects you have per OSD.
> 
> I checked a couple of Ceph clusters I run and see about 1M objects per OSD, but other only have 250k OSDs.
> 
> In all those cases even with 32k you would need a 30GB DB with 1M objects in that OSD.
> 
>> The answer could be couched as some intersection of pool type (RBD /
>> RADOS / CephFS), object change(update?) intensity, size of OSD etc and
>> rule-of-thumb.
> 
> I would check your running Ceph clusters and calculate the amount of objects per OSD.
> 
> total objects / num osd * 3

One nagging concern I have in the back of my mind is that the amount of
space amplification in rocksdb might grow with the number of levels (ie
the number of objects).  The space used per object might be different at
10M objects and 50M objects.

True. But how many systems do we have out there with 10M objects in ONE
OSD?

The systems I checked range from 250k to 1M objects per OSD. Ofcourse,
but statistics aren't the golden rule, but users will want some
guideline on how to size their DB. 
That's actually something I would really like better insight into.  I 
don't feel like I have a sufficient understanding of how many 
objects/OSD people are really deploying in the field.  I figure 10M/OSD 
is probably a reasonable "typical" upper limit for HDDs, but I could see

some use cases with flash backed SSDs pushing far more. 
Would a poll on the ceph-users list work? I understand that you require
such feedback to make a proper judgement.

I know of one cluster which has 10M objects (heavy, heavy, heavy RGW
user) in about 400TB of data.

All other clusters I've seen aren't that high on the amount of Objects.
They are usually high on data since they have a RBD use-case which is a
lot of 4M objects.

You could also ask users to use this tool:
https://github.com/42on/ceph-collect

That tarball would give you a lot of information about the cluster and
the amount of objects per OSD and PG.

Wido

>> WAL should be sufficient with 1GB~2GB, right?
> 
> Yep.  On the surface this appears to be a simple question, but a much 
> deeper question is what are we actually doing with the WAL?  How should 
> we be storing PG log and dup ops data?  How can we get away from the 
> large WAL buffers and memtables we have now?  These are questions we are 
> actively working on solving.  For the moment though, having multiple (4) 
> 256MB WAL buffers appears to give us the best performance despite 
> resulting in large memtables, so 1-2GB for the WAL is right.
> 
> Mark
> 
> Wido
> 
> Wido
> 
> An idea occurred to me that by monitoring for the logged spill message
> (the event when the DB partition spills/overflows to the OSD), OSDs
> could be (lazily) destroyed and recreated with a new DB partition
> increased in size say by 10% each time.
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 _______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

I agree with Wido that rbd is an abundant use case. At least we can
start with recommendations for it. in this case: 

Number of objects per OSD =  750k per TB of OSD disk capacity   

So for an avg 16k per object:  12G per TB. For 32k per object: 24G per
TB 

Maged
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20171103/a9060f6d/attachment.html>


More information about the ceph-users mailing list