[ceph-users] ceph all-nvme mysql performance tuning

Donny Davis donny at fortnebula.com
Mon Nov 27 08:19:05 PST 2017


Also what tuned profile are you using? There is something to be gained by
using a matching tuned profile for your workload.

On Mon, Nov 27, 2017 at 11:16 AM, Donny Davis <donny at fortnebula.com> wrote:

> Why not ask Red Hat? All the rest of the storage vendors you are looking
> at are not free.
>
> Full disclosure, I am an employee at Red Hat.
>
> On Mon, Nov 27, 2017 at 10:16 AM, Nick Fisk <nick at fisk.me.uk> wrote:
>
>> *From:* ceph-users [mailto:ceph-users-bounces at lists.ceph.com] *On Behalf
>> Of *German Anders
>> *Sent:* 27 November 2017 14:44
>> *To:* Maged Mokhtar <mmokhtar at petasan.org>
>> *Cc:* ceph-users <ceph-users at lists.ceph.com>
>> *Subject:* Re: [ceph-users] ceph all-nvme mysql performance tuning
>>
>>
>>
>> Hi Maged,
>>
>>
>>
>> Thanks a lot for the response. We try with different number of threads
>> and we're getting almost the same kind of difference between the storage
>> types. Going to try with different rbd stripe size, object size values and
>> see if we get more competitive numbers. Will get back with more tests and
>> param changes to see if we get better :)
>>
>>
>>
>>
>>
>> Just to echo a couple of comments. Ceph will always struggle to match the
>> performance of a traditional array for mainly 2 reasons.
>>
>>
>>
>>    1. You are replacing some sort of dual ported SAS or internally RDMA
>>    connected device with a network for Ceph replication traffic. This will
>>    instantly have a large impact on write latency
>>    2. Ceph locks at the PG level and a PG will most likely cover at
>>    least one 4MB object, so lots of small accesses to the same blocks (on a
>>    block device) will wait on each other and go effectively at a single
>>    threaded rate.
>>
>>
>>
>> The best thing you can do to mitigate these, is to run the fastest
>> journal/WAL devices you can, fastest network connections (ie 25Gb/s) and
>> run your CPU’s at max C and P states.
>>
>>
>>
>> You stated that you are running the performance profile on the CPU’s.
>> Could you also just double check that the C-states are being held at C1(e)?
>> There are a few utilities that can show this in realtime.
>>
>>
>>
>> Other than that, although there could be some minor tweaks, you are
>> probably nearing the limit of what you can hope to achieve.
>>
>>
>>
>> Nick
>>
>>
>>
>>
>>
>> Thanks,
>>
>>
>>
>> Best,
>>
>>
>> *German*
>>
>>
>>
>> 2017-11-27 11:36 GMT-03:00 Maged Mokhtar <mmokhtar at petasan.org>:
>>
>> On 2017-11-27 15:02, German Anders wrote:
>>
>> Hi All,
>>
>>
>>
>> I've a performance question, we recently install a brand new Ceph cluster
>> with all-nvme disks, using ceph version 12.2.0 with bluestore configured.
>> The back-end of the cluster is using a bond IPoIB (active/passive) , and
>> for the front-end we are using a bonding config with active/active (20GbE)
>> to communicate with the clients.
>>
>>
>>
>> The cluster configuration is the following:
>>
>>
>>
>> *MON Nodes:*
>>
>> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14
>>
>> 3x 1U servers:
>>
>>   2x Intel Xeon E5-2630v4 @2.2Ghz
>>
>>   128G RAM
>>
>>   2x Intel SSD DC S3520 150G (in RAID-1 for OS)
>>
>>   2x 82599ES 10-Gigabit SFI/SFP+ Network Connection
>>
>>
>>
>> *OSD Nodes:*
>>
>> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14
>>
>> 4x 2U servers:
>>
>>   2x Intel Xeon E5-2640v4 @2.4Ghz
>>
>>   128G RAM
>>
>>   2x Intel SSD DC S3520 150G (in RAID-1 for OS)
>>
>>   1x Ethernet Controller 10G X550T
>>
>>   1x 82599ES 10-Gigabit SFI/SFP+ Network Connection
>>
>>   12x Intel SSD DC P3520 1.2T (NVMe) for OSD daemons
>>
>>   1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port)
>>
>>
>>
>>
>>
>> Here's the tree:
>>
>>
>>
>> ID CLASS WEIGHT   TYPE NAME          STATUS REWEIGHT PRI-AFF
>>
>> -7       48.00000 root root
>>
>> -5       24.00000     rack rack1
>>
>> -1       12.00000         node cpn01
>>
>>  0  nvme  1.00000             osd.0      up  1.00000 1.00000
>>
>>  1  nvme  1.00000             osd.1      up  1.00000 1.00000
>>
>>  2  nvme  1.00000             osd.2      up  1.00000 1.00000
>>
>>  3  nvme  1.00000             osd.3      up  1.00000 1.00000
>>
>>  4  nvme  1.00000             osd.4      up  1.00000 1.00000
>>
>>  5  nvme  1.00000             osd.5      up  1.00000 1.00000
>>
>>  6  nvme  1.00000             osd.6      up  1.00000 1.00000
>>
>>  7  nvme  1.00000             osd.7      up  1.00000 1.00000
>>
>>  8  nvme  1.00000             osd.8      up  1.00000 1.00000
>>
>>  9  nvme  1.00000             osd.9      up  1.00000 1.00000
>>
>> 10  nvme  1.00000             osd.10     up  1.00000 1.00000
>>
>> 11  nvme  1.00000             osd.11     up  1.00000 1.00000
>>
>> -3       12.00000         node cpn03
>>
>> 24  nvme  1.00000             osd.24     up  1.00000 1.00000
>>
>> 25  nvme  1.00000             osd.25     up  1.00000 1.00000
>>
>> 26  nvme  1.00000             osd.26     up  1.00000 1.00000
>>
>> 27  nvme  1.00000             osd.27     up  1.00000 1.00000
>>
>> 28  nvme  1.00000             osd.28     up  1.00000 1.00000
>>
>> 29  nvme  1.00000             osd.29     up  1.00000 1.00000
>>
>> 30  nvme  1.00000             osd.30     up  1.00000 1.00000
>>
>> 31  nvme  1.00000             osd.31     up  1.00000 1.00000
>>
>> 32  nvme  1.00000             osd.32     up  1.00000 1.00000
>>
>> 33  nvme  1.00000             osd.33     up  1.00000 1.00000
>>
>> 34  nvme  1.00000             osd.34     up  1.00000 1.00000
>>
>> 35  nvme  1.00000             osd.35     up  1.00000 1.00000
>>
>> -6       24.00000     rack rack2
>>
>> -2       12.00000         node cpn02
>>
>> 12  nvme  1.00000             osd.12     up  1.00000 1.00000
>>
>> 13  nvme  1.00000             osd.13     up  1.00000 1.00000
>>
>> 14  nvme  1.00000             osd.14     up  1.00000 1.00000
>>
>> 15  nvme  1.00000             osd.15     up  1.00000 1.00000
>>
>> 16  nvme  1.00000             osd.16     up  1.00000 1.00000
>>
>> 17  nvme  1.00000             osd.17     up  1.00000 1.00000
>>
>> 18  nvme  1.00000             osd.18     up  1.00000 1.00000
>>
>> 19  nvme  1.00000             osd.19     up  1.00000 1.00000
>>
>> 20  nvme  1.00000             osd.20     up  1.00000 1.00000
>>
>> 21  nvme  1.00000             osd.21     up  1.00000 1.00000
>>
>> 22  nvme  1.00000             osd.22     up  1.00000 1.00000
>>
>> 23  nvme  1.00000             osd.23     up  1.00000 1.00000
>>
>> -4       12.00000         node cpn04
>>
>> 36  nvme  1.00000             osd.36     up  1.00000 1.00000
>>
>> 37  nvme  1.00000             osd.37     up  1.00000 1.00000
>>
>> 38  nvme  1.00000             osd.38     up  1.00000 1.00000
>>
>> 39  nvme  1.00000             osd.39     up  1.00000 1.00000
>>
>> 40  nvme  1.00000             osd.40     up  1.00000 1.00000
>>
>> 41  nvme  1.00000             osd.41     up  1.00000 1.00000
>>
>> 42  nvme  1.00000             osd.42     up  1.00000 1.00000
>>
>> 43  nvme  1.00000             osd.43     up  1.00000 1.00000
>>
>> 44  nvme  1.00000             osd.44     up  1.00000 1.00000
>>
>> 45  nvme  1.00000             osd.45     up  1.00000 1.00000
>>
>> 46  nvme  1.00000             osd.46     up  1.00000 1.00000
>>
>> 47  nvme  1.00000             osd.47     up  1.00000 1.00000
>>
>>
>>
>> The disk partition of one of the OSD nodes:
>>
>>
>>
>> NAME                   MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
>>
>> nvme6n1                259:1    0   1.1T  0 disk
>>
>> ├─nvme6n1p2            259:15   0   1.1T  0 part
>>
>> └─nvme6n1p1            259:13   0   100M  0 part
>> /var/lib/ceph/osd/ceph-6
>>
>> nvme9n1                259:0    0   1.1T  0 disk
>>
>> ├─nvme9n1p2            259:8    0   1.1T  0 part
>>
>> └─nvme9n1p1            259:7    0   100M  0 part
>> /var/lib/ceph/osd/ceph-9
>>
>> sdb                      8:16   0 139.8G  0 disk
>>
>> └─sdb1                   8:17   0 139.8G  0 part
>>
>>   └─md0                  9:0    0 139.6G  0 raid1
>>
>>     ├─md0p2            259:31   0     1K  0 md
>>
>>     ├─md0p5            259:32   0 139.1G  0 md
>>
>>     │ ├─cpn01--vg-swap 253:1    0  27.4G  0 lvm   [SWAP]
>>
>>     │ └─cpn01--vg-root 253:0    0 111.8G  0 lvm   /
>>
>>     └─md0p1            259:30   0 486.3M  0 md    /boot
>>
>> nvme11n1               259:2    0   1.1T  0 disk
>>
>> ├─nvme11n1p1           259:12   0   100M  0 part
>> /var/lib/ceph/osd/ceph-11
>>
>> └─nvme11n1p2           259:14   0   1.1T  0 part
>>
>> nvme2n1                259:6    0   1.1T  0 disk
>>
>> ├─nvme2n1p2            259:21   0   1.1T  0 part
>>
>> └─nvme2n1p1            259:20   0   100M  0 part
>> /var/lib/ceph/osd/ceph-2
>>
>> nvme5n1                259:3    0   1.1T  0 disk
>>
>> ├─nvme5n1p1            259:9    0   100M  0 part
>> /var/lib/ceph/osd/ceph-5
>>
>> └─nvme5n1p2            259:10   0   1.1T  0 part
>>
>> nvme8n1                259:24   0   1.1T  0 disk
>>
>> ├─nvme8n1p1            259:26   0   100M  0 part
>> /var/lib/ceph/osd/ceph-8
>>
>> └─nvme8n1p2            259:28   0   1.1T  0 part
>>
>> nvme10n1               259:11   0   1.1T  0 disk
>>
>> ├─nvme10n1p1           259:22   0   100M  0 part
>> /var/lib/ceph/osd/ceph-10
>>
>> └─nvme10n1p2           259:23   0   1.1T  0 part
>>
>> nvme1n1                259:33   0   1.1T  0 disk
>>
>> ├─nvme1n1p1            259:34   0   100M  0 part
>> /var/lib/ceph/osd/ceph-1
>>
>> └─nvme1n1p2            259:35   0   1.1T  0 part
>>
>> nvme4n1                259:5    0   1.1T  0 disk
>>
>> ├─nvme4n1p1            259:18   0   100M  0 part
>> /var/lib/ceph/osd/ceph-4
>>
>> └─nvme4n1p2            259:19   0   1.1T  0 part
>>
>> nvme7n1                259:25   0   1.1T  0 disk
>>
>> ├─nvme7n1p1            259:27   0   100M  0 part
>> /var/lib/ceph/osd/ceph-7
>>
>> └─nvme7n1p2            259:29   0   1.1T  0 part
>>
>> sda                      8:0    0 139.8G  0 disk
>>
>> └─sda1                   8:1    0 139.8G  0 part
>>
>>   └─md0                  9:0    0 139.6G  0 raid1
>>
>>     ├─md0p2            259:31   0     1K  0 md
>>
>>     ├─md0p5            259:32   0 139.1G  0 md
>>
>>     │ ├─cpn01--vg-swap 253:1    0  27.4G  0 lvm   [SWAP]
>>
>>     │ └─cpn01--vg-root 253:0    0 111.8G  0 lvm   /
>>
>>     └─md0p1            259:30   0 486.3M  0 md    /boot
>>
>> nvme0n1                259:36   0   1.1T  0 disk
>>
>> ├─nvme0n1p1            259:37   0   100M  0 part
>> /var/lib/ceph/osd/ceph-0
>>
>> └─nvme0n1p2            259:38   0   1.1T  0 part
>>
>> nvme3n1                259:4    0   1.1T  0 disk
>>
>> ├─nvme3n1p1            259:16   0   100M  0 part
>> /var/lib/ceph/osd/ceph-3
>>
>> └─nvme3n1p2            259:17   0   1.1T  0 part
>>
>>
>>
>>
>>
>> For the disk scheduler we're using [kyber], for the read_ahead_kb we try
>> different values (0,128 and 2048), the rq_affinity set to 2, and the
>> rotational parameter set to 0.
>>
>> We've also set the CPU governor to performance on all the cores, and tune
>> some sysctl parameters also:
>>
>>
>>
>> # for Ceph
>>
>> net.ipv4.ip_forward=0
>>
>> net.ipv4.conf.default.rp_filter=1
>>
>> kernel.sysrq=0
>>
>> kernel.core_uses_pid=1
>>
>> net.ipv4.tcp_syncookies=0
>>
>> #net.netfilter.nf_conntrack_max=2621440
>>
>> #net.netfilter.nf_conntrack_tcp_timeout_established = 1800
>>
>> # disable netfilter on bridges
>>
>> #net.bridge.bridge-nf-call-ip6tables = 0
>>
>> #net.bridge.bridge-nf-call-iptables = 0
>>
>> #net.bridge.bridge-nf-call-arptables = 0
>>
>> vm.min_free_kbytes=1000000
>>
>>
>>
>> # Controls the maximum size of a message, in bytes
>>
>> kernel.msgmnb = 65536
>>
>>
>>
>> # Controls the default maxmimum size of a mesage queue
>>
>> kernel.msgmax = 65536
>>
>>
>>
>> # Controls the maximum shared segment size, in bytes
>>
>> kernel.shmmax = 68719476736
>>
>>
>>
>> # Controls the maximum number of shared memory segments, in pages
>>
>> kernel.shmall = 4294967296
>>
>>
>>
>>
>>
>> The ceph.conf file is:
>>
>>
>>
>> ...
>>
>> osd_pool_default_size = 3
>>
>> osd_pool_default_min_size = 2
>>
>> osd_pool_default_pg_num = 1600
>>
>> osd_pool_default_pgp_num = 1600
>>
>>
>>
>> debug_crush = 1/1
>>
>> debug_buffer = 0/1
>>
>> debug_timer = 0/0
>>
>> debug_filer = 0/1
>>
>> debug_objecter = 0/1
>>
>> debug_rados = 0/5
>>
>> debug_rbd = 0/5
>>
>> debug_ms = 0/5
>>
>> debug_throttle = 1/1
>>
>>
>>
>> debug_journaler = 0/0
>>
>> debug_objectcatcher = 0/0
>>
>> debug_client = 0/0
>>
>> debug_osd = 0/0
>>
>> debug_optracker = 0/0
>>
>> debug_objclass = 0/0
>>
>> debug_journal = 0/0
>>
>> debug_filestore = 0/0
>>
>> debug_mon = 0/0
>>
>> debug_paxos = 0/0
>>
>>
>>
>> osd_crush_chooseleaf_type = 0
>>
>> filestore_xattr_use_omap = true
>>
>>
>>
>> rbd_cache = true
>>
>> mon_compact_on_trim = false
>>
>>
>>
>> [osd]
>>
>> osd_crush_update_on_start = false
>>
>>
>>
>> [client]
>>
>> rbd_cache = true
>>
>> rbd_cache_writethrough_until_flush = true
>>
>> rbd_default_features = 1
>>
>> admin_socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok
>>
>> log_file = /var/log/ceph/
>>
>>
>>
>>
>>
>> The cluster has two production pools on for openstack (volumes) with RF
>> of 3 and another pool for db (db) with RF of 2. The DBA team has perform
>> several tests with a volume mounted on the DB server (with RBD). The DB
>> server has the following configuration:
>>
>>
>>
>> OS: CentOS 6.9 | kernel 4.14.1
>>
>> DB: MySQL
>>
>> ProLiant BL685c G7
>>
>> 4x AMD Opteron Processor 6376 (total of 64 cores)
>>
>> 128G RAM
>>
>> 1x OneConnect 10Gb NIC (quad-port) - in a bond configuration
>> (active/active) with 3 vlans
>>
>>
>>
>>
>>
>>
>>
>> We also did some tests with *sysbench* on different storage types:
>>
>>
>>
>> *sysbench*
>>
>> *disk*
>>
>> *tps*
>>
>> *qps*
>>
>> *latency (ms) 95th percentile*
>>
>> Local SSD
>>
>> 261,28
>>
>> 5.225,61
>>
>> 5,18
>>
>> Ceph NVMe
>>
>> 95,18
>>
>> 1.903,53
>>
>> 12,3
>>
>> Pure Storage
>>
>> 196,49
>>
>> 3.929,71
>>
>> 6,32
>>
>> NetApp FAS
>>
>> 189,83
>>
>> 3.796,59
>>
>> 6,67
>>
>> EMC VMAX
>>
>> 196,14
>>
>> 3.922,82
>>
>> 6,32
>>
>>
>>
>>
>>
>> Is there any specific tuning that I can apply to the ceph cluster, in
>> order to improve those numbers? Or are those numbers ok for the type and
>> size of the cluster that we have? Any advice would be really appreciated.
>>
>>
>>
>> Thanks,
>>
>>
>>
>>
>>
>>
>>
>> *German*
>>
>>
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users at lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> Hi,
>>
>> What is the value of --num-threads (def value is 1) ? Ceph will be better
>> with more threads: 32 or 64.
>> What is the value of --file-block-size (def 16k) and file-test-mode ? If
>> you are using sequential seqwr/seqrd you will be hitting the same OSD, so
>> maybe try random (rndrd/rndwr) or better use rbd stripe size of 16kb
>> (default rbd stripe is 4M). rbd striping is ideal for small block
>> sequential io pattern typical in databases.
>>
>> /Maged
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users at lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20171127/4d153883/attachment.html>


More information about the ceph-users mailing list