[ceph-users] Using bluestore in Jewel 10.0.4

Mark Nelson mnelson at redhat.com
Mon Mar 14 09:52:26 PDT 2016


Hi Folks,

We are actually in the middle of doing some bluestore testing/tuning for 
the upstream jewel release as we speak. :)  These are (so far) pure HDD 
tests using 4 nodes with 4 spinning disks and no SSDs.

Basically on the write side it's looking fantastic and that's an area we 
really wanted to improve so that's great.  On the read side, we are 
working on getting sequential read performance up for certain IO sizes. 
  We are more dependent on client-side readahead with bluestore since 
there is no underlying filesystem below the OSDs helping us out. This 
usually isn't a problem in practice since there should be readahead on 
the VM, but when testing with fio using the RBD engine you should 
probably enable client side RBD readahead:

rbd readahead disable after bytes = 0
rbd readahead max bytes = 4194304

Again, this probably only matters when directly using librbd.

The other question is using default buffered reads in bluestore, ie setting:

"bluestore default buffered read = true"

That's what we are working on testing now.  I've included the ceph.conf 
used for these tests and also a link for some of our recent results. 
Please download it and open it up in libreoffice as google's preview 
isn't showing the graphs.

Here's how the legend is setup:

Hammer-FS: Hammer + Filestore
6dba7fd-BS (No RBD RA): Master + Fixes + Bluestore
6dba7fd-BS: (4M RBD RA): Master + Fixes + Bluestore + 4M RBD Read Ahead
c1e41afb-FS: Master + Filestore + new journal throttling + Sam's tuning

https://drive.google.com/file/d/0B2gTBZrkrnpZMl9OZ18yS3NuZEU/view?usp=sharing

Mark

On 03/14/2016 11:04 AM, Kenneth Waegeman wrote:
> Hi Stefan,
>
> We are also interested in the bluestore, but did not yet look into it.
>
> We tried keyvaluestore before and that could be enabled by setting the
> osd objectstore value.
> And in this ticket http://tracker.ceph.com/issues/13942 I see:
>
> [global]
>          enable experimental unrecoverable data corrupting features = *
>          bluestore fsck on mount = true
>          bluestore block db size = 67108864
>          bluestore block wal size = 134217728
>          bluestore block size = 5368709120
>          osd objectstore = bluestore
>
> So I guess this could work for bluestore too.
>
> Very curious to hear what you see stability and performance wise :)
>
> Cheers,
> Kenneth
>
> On 14/03/16 16:03, Stefan Lissmats wrote:
>> Hello everyone!
>>
>> I think that the new bluestore sounds great and would like to try it
>> out in my test environment but I didn't find anything how to use it
>> but I finally managed to test it and it really looks promising
>> performancewise.
>> If anyone has more information or guides for bluestore please tell me
>> where.
>>
>> I thought I would share how I managed to get a new Jewel cluster with
>> bluestore based osd:s to work.
>>
>>
>> What i found so far is that ceph-disk can create new bluestore osd:s
>> (but not ceph-deploy, plase correct me if i'm wrong) and I need to
>> have "enable experimental unrecoverable data corrupting features =
>> bluestore rocksdb" in global section in ceph.conf.
>> After that I can create new osd:s with ceph-disk prepare --bluestore
>> /dev/sdg
>>
>> So i created a cluster with ceph-deploy without any osd:s and then
>> used ceph-disk on hosts to create the osd:s.
>>
>> Pretty simple in the end but it took me a while to figure that out.
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users at lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
-------------- next part --------------
[global]
        enable experimental unrecoverable data corrupting features = bluestore rocksdb
#        enable experimental unrecoverable data corrupting features = bluestore rocksdb ms-type-async
        osd objectstore = bluestore
#        bluestore sync wal apply = false
#        bluestore overlay max = 0
#        bluestore_wal_threads = 8
#        rocksdb_write_buffer_size = 536870912
#        rocksdb_write_buffer_num = 4
#        rocksdb_min_write_buffer_number_to_merge = 2
        rocksdb_log = /tmp/cbt/ceph/log/rocksdb.log
#        rocksdb_max_background_compactions = 4
#        rocksdb_compaction_threads = 4
#        rocksdb_level0_file_num_compaction_trigger = 4
#        rocksdb_max_bytes_for_level_base = 104857600 //100MB
#        rocksdb_target_file_size_base = 10485760      //10MB
#        rocksdb_num_levels = 3
#        rocksdb_compression = none
#        bluestore_min_alloc_size = 32768

#        ms_type = async

        rbd readahead disable after bytes = 0
        rbd readahead max bytes = 4194304
        bluestore default buffered read = true
        osd pool default size = 1
        osd crush chooseleaf type = 0

        keyring = /tmp/cbt/ceph/keyring
        osd pg bits = 8
        osd pgp bits = 8
        auth supported = none
        log to syslog = false
        log file = /tmp/cbt/ceph/log/$name.log
        filestore xattr use omap = true
        auth cluster required = none
        auth service required = none
        auth client required = none

        public network = 10.0.10.0/24
        cluster network = 10.0.10.0/24
        rbd cache = true
        rbd cache writethrough until flush = false
        osd scrub load threshold = 0.01
        osd scrub min interval = 137438953472
        osd scrub max interval = 137438953472
        osd deep scrub interval = 137438953472
        osd max scrubs = 16

        filestore merge threshold = 40
        filestore split multiple = 8
        osd op threads = 8

        debug_bluefs = "0/0"
        debug_bluestore = "0/0"
        debug_bdev = "0/0"
#        debug_bluefs = "20"
#        debug_bluestore = "30"
#        debug_bdev = "20"
        debug_lockdep = "0/0" 
        debug_context = "0/0"
        debug_crush = "0/0"
        debug_mds = "0/0"
        debug_mds_balancer = "0/0"
        debug_mds_locker = "0/0"
        debug_mds_log = "0/0"
        debug_mds_log_expire = "0/0"
        debug_mds_migrator = "0/0"
        debug_buffer = "0/0"
        debug_timer = "0/0"
        debug_filer = "0/0"
        debug_objecter = "0/0"
        debug_rados = "0/0"
        debug_rbd = "0/0"
        debug_journaler = "0/0"
        debug_objectcacher = "0/0"
        debug_client = "0/0"
        debug_osd = "0/0"
#        debug_osd = "30"
        debug_optracker = "0/0"
        debug_objclass = "0/0"
        debug_filestore = "0/0"
        debug_journal = "0/0"
        debug_ms = "0/0"
#        debug_ms = 1
        debug_mon = "0/0"
        debug_monc = "0/0"
        debug_paxos = "0/0"
        debug_tp = "0/0"
        debug_auth = "0/0"
        debug_finisher = "0/0"
        debug_heartbeatmap = "0/0"
        debug_perfcounter = "0/0"
        debug_rgw = "0/0"
        debug_hadoop = "0/0"
        debug_asok = "0/0"
        debug_throttle = "0/0"

        mon pg warn max object skew = 100000
        mon pg warn min per osd = 0
        mon pg warn max per osd = 32768

[client]
        log_file = /var/log/ceph/ceph-rbd.log
        admin_socket = /var/run/ceph/ceph-rbd.asok

[mon]
	mon data = /tmp/cbt/ceph/mon.$id
        
[mon.a]
	host = incerta01.front.sepia.ceph.com 
        mon addr = 10.0.10.101:6789

[osd.0]
        host = incerta01.front.sepia.ceph.com
        osd data = /tmp/cbt/mnt/osd-device-0-data
        osd journal = /dev/disk/by-partlabel/osd-device-0-journal

[osd.1]
        host = incerta01.front.sepia.ceph.com
        osd data = /tmp/cbt/mnt/osd-device-1-data
        osd journal = /dev/disk/by-partlabel/osd-device-1-journal

[osd.2]
        host = incerta01.front.sepia.ceph.com
        osd data = /tmp/cbt/mnt/osd-device-2-data
        osd journal = /dev/disk/by-partlabel/osd-device-2-journal

[osd.3]
        host = incerta01.front.sepia.ceph.com
        osd data = /tmp/cbt/mnt/osd-device-3-data
        osd journal = /dev/disk/by-partlabel/osd-device-3-journal

[osd.4]
        host = incerta02.front.sepia.ceph.com
        osd data = /tmp/cbt/mnt/osd-device-0-data
        osd journal = /dev/disk/by-partlabel/osd-device-0-journal

[osd.5]
        host = incerta02.front.sepia.ceph.com
        osd data = /tmp/cbt/mnt/osd-device-1-data
        osd journal = /dev/disk/by-partlabel/osd-device-1-journal

[osd.6]
        host = incerta02.front.sepia.ceph.com
        osd data = /tmp/cbt/mnt/osd-device-2-data
        osd journal = /dev/disk/by-partlabel/osd-device-2-journal

[osd.7]
        host = incerta02.front.sepia.ceph.com
        osd data = /tmp/cbt/mnt/osd-device-3-data
        osd journal = /dev/disk/by-partlabel/osd-device-3-journal

[osd.8]
        host = incerta03.front.sepia.ceph.com
        osd data = /tmp/cbt/mnt/osd-device-0-data
        osd journal = /dev/disk/by-partlabel/osd-device-0-journal

[osd.9]
        host = incerta03.front.sepia.ceph.com
        osd data = /tmp/cbt/mnt/osd-device-1-data
        osd journal = /dev/disk/by-partlabel/osd-device-1-journal

[osd.10]
        host = incerta03.front.sepia.ceph.com
        osd data = /tmp/cbt/mnt/osd-device-2-data
        osd journal = /dev/disk/by-partlabel/osd-device-2-journal

[osd.11]
        host = incerta03.front.sepia.ceph.com
        osd data = /tmp/cbt/mnt/osd-device-3-data
        osd journal = /dev/disk/by-partlabel/osd-device-3-journal

[osd.12]
        host = incerta04.front.sepia.ceph.com
        osd data = /tmp/cbt/mnt/osd-device-0-data
        osd journal = /dev/disk/by-partlabel/osd-device-0-journal

[osd.13]
        host = incerta04.front.sepia.ceph.com
        osd data = /tmp/cbt/mnt/osd-device-1-data
        osd journal = /dev/disk/by-partlabel/osd-device-1-journal

[osd.14]
        host = incerta04.front.sepia.ceph.com
        osd data = /tmp/cbt/mnt/osd-device-2-data
        osd journal = /dev/disk/by-partlabel/osd-device-2-journal

[osd.15]
        host = incerta04.front.sepia.ceph.com
        osd data = /tmp/cbt/mnt/osd-device-3-data
        osd journal = /dev/disk/by-partlabel/osd-device-3-journal



More information about the ceph-users mailing list