[ceph-users] ceph all-nvme mysql performance tuning

Maged Mokhtar mmokhtar at petasan.org
Wed Nov 29 00:24:44 PST 2017


Hi German, 

I would personally prefer to use rados bench/ fio which are more common
to benchmark the cluster first then later do mysql specific tests using
sysbench. Another thing is to run the client test simultaneously on more
than 1 machine and aggregate/add the performance numbers of each, the
limitation can be caused by client side resources which could be
stressed differently based on the different storage backends you tried. 

Maged 

On 2017-11-28 21:20, German Anders wrote:

> Don't know if there's any statistics available really, but Im running some sysbench tests with mysql before the changes and the idea is to run those tests again after the 'tuning' and see if numbers get better in any way, also I'm gathering numbers from some collectd and statsd collectors running on the osd nodes so, I hope to get some info about that :) 
> 
> GERMAN 
> 2017-11-28 16:12 GMT-03:00 Marc Roos <M.Roos at f1-outsourcing.eu>:
> 
>> I was wondering if there are any statistics available that show the
>> performance increase of doing such things?
>> 
>> -----Original Message-----
>> From: German Anders [mailto:ganders at despegar.com]
>> Sent: dinsdag 28 november 2017 19:34
>> To: Luis Periquito
>> Cc: ceph-users
>> Subject: Re: [ceph-users] ceph all-nvme mysql performance tuning
>> 
>> Thanks a lot Luis, I agree with you regarding the CPUs, but
>> unfortunately those were the best CPU model that we can afford :S
>> 
>> For the NUMA part, I manage to pinned the OSDs by changing the
>> /usr/lib/systemd/system/ceph-osd at .service file and adding the
>> CPUAffinity list to it. But, this is for ALL the OSDs to specific nodes
>> or specific CPU list. But I can't find the way to specify a list for
>> only a specific number of OSDs.
>> 
>> Also, I notice that the NVMe disks are all on the same node (since I'm
>> using half of the shelf - so the other half will be pinned to the other
>> node), so the lanes of the NVMe disks are all on the same CPU (in this
>> case 0). Also, I find that the IB adapter that is mapped to the OSD
>> network (osd replication) is pinned to CPU 1, so this will cross the QPI
>> path.
>> 
>> And for the memory, from the other email, we are already using the
>> TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES parameter with a value of
>> 134217728
>> 
>> In this case I can pinned all the actual OSDs to CPU 0, but in the near
>> future when I add more nvme disks to the OSD nodes, I'll definitely need
>> to pinned the other half OSDs to CPU 1, someone already did this?
>> 
>> Thanks a lot,
>> 
>> Best,
>> 
>> German
>> 
>> 2017-11-28 6:36 GMT-03:00 Luis Periquito <periquito at gmail.com>:
>> 
>> There are a few things I don't like about your machines... If you
>> want latency/IOPS (as you seemingly do) you really want the highest
>> frequency CPUs, even over number of cores. These are not too bad, but
>> not great either.
>> 
>> Also you have 2x CPU meaning NUMA. Have you pinned OSDs to NUMA
>> nodes? Ideally OSD is pinned to same NUMA node the NVMe device is
>> connected to. Each NVMe device will be running on PCIe lanes generated
>> by one of the CPUs...
>> 
>> What versions of TCMalloc (or jemalloc) are you running? Have you
>> tuned them to have a bigger cache?
>> 
>> These are from what I've learned using filestore - I've yet to run
>> full tests on bluestore - but they should still apply...
>> 
>> On Mon, Nov 27, 2017 at 5:10 PM, German Anders
>> <ganders at despegar.com> wrote:
>> 
>> Hi Nick,
>> 
>> yeah, we are using the same nvme disk with an additional
>> partition to use as journal/wal. We double check the c-state and it was
>> not configure to use c1, so we change that on all the osd nodes and mon
>> nodes and we're going to make some new tests, and see how it goes. I'll
>> get back as soon as get got those tests running.
>> 
>> Thanks a lot,
>> 
>> Best,
>> 
>> German
>> 
>> 2017-11-27 12:16 GMT-03:00 Nick Fisk <nick at fisk.me.uk>:
>> 
>> From: ceph-users
>> [mailto:ceph-users-bounces at lists.ceph.com
>> <mailto:ceph-users-bounces at lists.ceph.com> ] On Behalf Of German Anders
>> Sent: 27 November 2017 14:44
>> To: Maged Mokhtar <mmokhtar at petasan.org>
>> Cc: ceph-users <ceph-users at lists.ceph.com>
>> Subject: Re: [ceph-users] ceph all-nvme mysql performance
>> tuning
>> 
>> Hi Maged,
>> 
>> Thanks a lot for the response. We try with different
>> number of threads and we're getting almost the same kind of difference
>> between the storage types. Going to try with different rbd stripe size,
>> object size values and see if we get more competitive numbers. Will get
>> back with more tests and param changes to see if we get better :)
>> 
>> Just to echo a couple of comments. Ceph will always
>> struggle to match the performance of a traditional array for mainly 2
>> reasons.
>> 
>> 1.      You are replacing some sort of dual ported SAS or
>> internally RDMA connected device with a network for Ceph replication
>> traffic. This will instantly have a large impact on write latency
>> 2.      Ceph locks at the PG level and a PG will most
>> 
>> likely cover at least one 4MB object, so lots of small accesses to the
>> same blocks (on a block device) will wait on each other and go
>> effectively at a single threaded rate.
>> 
>> The best thing you can do to mitigate these, is to run
>> the fastest journal/WAL devices you can, fastest network connections (ie
>> 25Gb/s) and run your CPU's at max C and P states.
>> 
>> You stated that you are running the performance profile
>> on the CPU's. Could you also just double check that the C-states are
>> being held at C1(e)? There are a few utilities that can show this in
>> realtime.
>> 
>> Other than that, although there could be some minor
>> tweaks, you are probably nearing the limit of what you can hope to
>> achieve.
>> 
>> Nick
>> 
>> Thanks,
>> 
>> Best,
>> 
>> German
>> 
>> 2017-11-27 11:36 GMT-03:00 Maged Mokhtar
>> <mmokhtar at petasan.org>:
>> 
>> On 2017-11-27 15:02, German Anders wrote:
>> 
>> Hi All,
>> 
>> I've a performance question, we recently
>> install a brand new Ceph cluster with all-nvme disks, using ceph version
>> 12.2.0 with bluestore configured. The back-end of the cluster is using a
>> bond IPoIB (active/passive) , and for the front-end we are using a
>> bonding config with active/active (20GbE) to communicate with the
>> clients.
>> 
>> The cluster configuration is the following:
>> 
>> MON Nodes:
>> 
>> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14
>> 
>> 3x 1U servers:
>> 
>> 2x Intel Xeon E5-2630v4 @2.2Ghz
>> 
>> 128G RAM
>> 
>> 2x Intel SSD DC S3520 150G (in RAID-1 for OS)
>> 
>> 2x 82599ES 10-Gigabit SFI/SFP+ Network
>> Connection
>> 
>> OSD Nodes:
>> 
>> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14
>> 
>> 4x 2U servers:
>> 
>> 2x Intel Xeon E5-2640v4 @2.4Ghz
>> 
>> 128G RAM
>> 
>> 2x Intel SSD DC S3520 150G (in RAID-1 for OS)
>> 
>> 1x Ethernet Controller 10G X550T
>> 
>> 1x 82599ES 10-Gigabit SFI/SFP+ Network
>> Connection
>> 
>> 12x Intel SSD DC P3520 1.2T (NVMe) for OSD
>> daemons
>> 
>> 1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s
>> Adapter (dual port)
>> 
>> Here's the tree:
>> 
>> ID CLASS WEIGHT   TYPE NAME          STATUS
>> REWEIGHT PRI-AFF
>> 
>> -7       48.00000 root root
>> 
>> -5       24.00000     rack rack1
>> 
>> -1       12.00000         node cpn01
>> 
>> 0  nvme  1.00000             osd.0      up
>> 1.00000 1.00000
>> 
>> 1  nvme  1.00000             osd.1      up
>> 1.00000 1.00000
>> 
>> 2  nvme  1.00000             osd.2      up
>> 1.00000 1.00000
>> 
>> 3  nvme  1.00000             osd.3      up
>> 1.00000 1.00000
>> 
>> 4  nvme  1.00000             osd.4      up
>> 1.00000 1.00000
>> 
>> 5  nvme  1.00000             osd.5      up
>> 1.00000 1.00000
>> 
>> 6  nvme  1.00000             osd.6      up
>> 1.00000 1.00000
>> 
>> 7  nvme  1.00000             osd.7      up
>> 1.00000 1.00000
>> 
>> 8  nvme  1.00000             osd.8      up
>> 1.00000 1.00000
>> 
>> 9  nvme  1.00000             osd.9      up
>> 1.00000 1.00000
>> 
>> 10  nvme  1.00000             osd.10     up
>> 1.00000 1.00000
>> 
>> 11  nvme  1.00000             osd.11     up
>> 1.00000 1.00000
>> 
>> -3       12.00000         node cpn03
>> 
>> 24  nvme  1.00000             osd.24     up
>> 1.00000 1.00000
>> 
>> 25  nvme  1.00000             osd.25     up
>> 1.00000 1.00000
>> 
>> 26  nvme  1.00000             osd.26     up
>> 1.00000 1.00000
>> 
>> 27  nvme  1.00000             osd.27     up
>> 1.00000 1.00000
>> 
>> 28  nvme  1.00000             osd.28     up
>> 1.00000 1.00000
>> 
>> 29  nvme  1.00000             osd.29     up
>> 1.00000 1.00000
>> 
>> 30  nvme  1.00000             osd.30     up
>> 1.00000 1.00000
>> 
>> 31  nvme  1.00000             osd.31     up
>> 1.00000 1.00000
>> 
>> 32  nvme  1.00000             osd.32     up
>> 1.00000 1.00000
>> 
>> 33  nvme  1.00000             osd.33     up
>> 1.00000 1.00000
>> 
>> 34  nvme  1.00000             osd.34     up
>> 1.00000 1.00000
>> 
>> 35  nvme  1.00000             osd.35     up
>> 1.00000 1.00000
>> 
>> -6       24.00000     rack rack2
>> 
>> -2       12.00000         node cpn02
>> 
>> 12  nvme  1.00000             osd.12     up
>> 1.00000 1.00000
>> 
>> 13  nvme  1.00000             osd.13     up
>> 1.00000 1.00000
>> 
>> 14  nvme  1.00000             osd.14     up
>> 1.00000 1.00000
>> 
>> 15  nvme  1.00000             osd.15     up
>> 1.00000 1.00000
>> 
>> 16  nvme  1.00000             osd.16     up
>> 1.00000 1.00000
>> 
>> 17  nvme  1.00000             osd.17     up
>> 1.00000 1.00000
>> 
>> 18  nvme  1.00000             osd.18     up
>> 1.00000 1.00000
>> 
>> 19  nvme  1.00000             osd.19     up
>> 1.00000 1.00000
>> 
>> 20  nvme  1.00000             osd.20     up
>> 1.00000 1.00000
>> 
>> 21  nvme  1.00000             osd.21     up
>> 1.00000 1.00000
>> 
>> 22  nvme  1.00000             osd.22     up
>> 1.00000 1.00000
>> 
>> 23  nvme  1.00000             osd.23     up
>> 1.00000 1.00000
>> 
>> -4       12.00000         node cpn04
>> 
>> 36  nvme  1.00000             osd.36     up
>> 1.00000 1.00000
>> 
>> 37  nvme  1.00000             osd.37     up
>> 1.00000 1.00000
>> 
>> 38  nvme  1.00000             osd.38     up
>> 1.00000 1.00000
>> 
>> 39  nvme  1.00000             osd.39     up
>> 1.00000 1.00000
>> 
>> 40  nvme  1.00000             osd.40     up
>> 1.00000 1.00000
>> 
>> 41  nvme  1.00000             osd.41     up
>> 1.00000 1.00000
>> 
>> 42  nvme  1.00000             osd.42     up
>> 1.00000 1.00000
>> 
>> 43  nvme  1.00000             osd.43     up
>> 1.00000 1.00000
>> 
>> 44  nvme  1.00000             osd.44     up
>> 1.00000 1.00000
>> 
>> 45  nvme  1.00000             osd.45     up
>> 1.00000 1.00000
>> 
>> 46  nvme  1.00000             osd.46     up
>> 1.00000 1.00000
>> 
>> 47  nvme  1.00000             osd.47     up
>> 1.00000 1.00000
>> 
>> The disk partition of one of the OSD nodes:
>> 
>> NAME                   MAJ:MIN RM   SIZE RO
>> TYPE  MOUNTPOINT
>> 
>> nvme6n1                259:1    0   1.1T  0
>> disk
>> 
>> ├─nvme6n1p2            259:15   0   1.1T  0
>> part
>> 
>> └─nvme6n1p1            259:13   0   100M  0
>> part  /var/lib/ceph/osd/ceph-6
>> 
>> nvme9n1                259:0    0   1.1T  0
>> disk
>> 
>> ├─nvme9n1p2            259:8    0   1.1T  0
>> part
>> 
>> └─nvme9n1p1            259:7    0   100M  0
>> part  /var/lib/ceph/osd/ceph-9
>> 
>> sdb                      8:16   0 139.8G  0
>> disk
>> 
>> └─sdb1                   8:17   0 139.8G  0
>> part
>> 
>> └─md0                  9:0    0 139.6G  0
>> raid1
>> 
>> ├─md0p2            259:31   0     1K  0
>> md
>> 
>> ├─md0p5            259:32   0 139.1G  0
>> md
>> 
>> │ ├─cpn01--vg-swap 253:1    0  27.4G  0
>> lvm   [SWAP]
>> 
>> │ └─cpn01--vg-root 253:0    0 111.8G  0
>> lvm   /
>> 
>> └─md0p1            259:30   0 486.3M  0
>> md    /boot
>> 
>> nvme11n1               259:2    0   1.1T  0
>> disk
>> 
>> ├─nvme11n1p1           259:12   0   100M  0
>> part  /var/lib/ceph/osd/ceph-11
>> 
>> └─nvme11n1p2           259:14   0   1.1T  0
>> part
>> 
>> nvme2n1                259:6    0   1.1T  0
>> disk
>> 
>> ├─nvme2n1p2            259:21   0   1.1T  0
>> part
>> 
>> └─nvme2n1p1            259:20   0   100M  0
>> part  /var/lib/ceph/osd/ceph-2
>> 
>> nvme5n1                259:3    0   1.1T  0
>> disk
>> 
>> ├─nvme5n1p1            259:9    0   100M  0
>> part  /var/lib/ceph/osd/ceph-5
>> 
>> └─nvme5n1p2            259:10   0   1.1T  0
>> part
>> 
>> nvme8n1                259:24   0   1.1T  0
>> disk
>> 
>> ├─nvme8n1p1            259:26   0   100M  0
>> part  /var/lib/ceph/osd/ceph-8
>> 
>> └─nvme8n1p2            259:28   0   1.1T  0
>> part
>> 
>> nvme10n1               259:11   0   1.1T  0
>> disk
>> 
>> ├─nvme10n1p1           259:22   0   100M  0
>> part  /var/lib/ceph/osd/ceph-10
>> 
>> └─nvme10n1p2           259:23   0   1.1T  0
>> part
>> 
>> nvme1n1                259:33   0   1.1T  0
>> disk
>> 
>> ├─nvme1n1p1            259:34   0   100M  0
>> part  /var/lib/ceph/osd/ceph-1
>> 
>> └─nvme1n1p2            259:35   0   1.1T  0
>> part
>> 
>> nvme4n1                259:5    0   1.1T  0
>> disk
>> 
>> ├─nvme4n1p1            259:18   0   100M  0
>> part  /var/lib/ceph/osd/ceph-4
>> 
>> └─nvme4n1p2            259:19   0   1.1T  0
>> part
>> 
>> nvme7n1                259:25   0   1.1T  0
>> disk
>> 
>> ├─nvme7n1p1            259:27   0   100M  0
>> part  /var/lib/ceph/osd/ceph-7
>> 
>> └─nvme7n1p2            259:29   0   1.1T  0
>> part
>> 
>> sda                      8:0    0 139.8G  0
>> disk
>> 
>> └─sda1                   8:1    0 139.8G  0
>> part
>> 
>> └─md0                  9:0    0 139.6G  0
>> raid1
>> 
>> ├─md0p2            259:31   0     1K  0
>> md
>> 
>> ├─md0p5            259:32   0 139.1G  0
>> md
>> 
>> │ ├─cpn01--vg-swap 253:1    0  27.4G  0
>> lvm   [SWAP]
>> 
>> │ └─cpn01--vg-root 253:0    0 111.8G  0
>> lvm   /
>> 
>> └─md0p1            259:30   0 486.3M  0
>> md    /boot
>> 
>> nvme0n1                259:36   0   1.1T  0
>> disk
>> 
>> ├─nvme0n1p1            259:37   0   100M  0
>> part  /var/lib/ceph/osd/ceph-0
>> 
>> └─nvme0n1p2            259:38   0   1.1T  0
>> part
>> 
>> nvme3n1                259:4    0   1.1T  0
>> disk
>> 
>> ├─nvme3n1p1            259:16   0   100M  0
>> part  /var/lib/ceph/osd/ceph-3
>> 
>> └─nvme3n1p2            259:17   0   1.1T  0
>> part
>> 
>> For the disk scheduler we're using [kyber], for
>> the read_ahead_kb we try different values (0,128 and 2048), the
>> rq_affinity set to 2, and the rotational parameter set to 0.
>> 
>> We've also set the CPU governor to performance
>> on all the cores, and tune some sysctl parameters also:
>> 
>> # for Ceph
>> 
>> net.ipv4.ip_forward=0
>> 
>> net.ipv4.conf.default.rp_filter=1
>> 
>> kernel.sysrq=0
>> 
>> kernel.core_uses_pid=1
>> 
>> net.ipv4.tcp_syncookies=0
>> 
>> #net.netfilter.nf_conntrack_max=2621440
>> 
>> #net.netfilter.nf_conntrack_tcp_timeout_established = 1800
>> 
>> # disable netfilter on bridges
>> 
>> #net.bridge.bridge-nf-call-ip6tables = 0
>> 
>> #net.bridge.bridge-nf-call-iptables = 0
>> 
>> #net.bridge.bridge-nf-call-arptables = 0
>> 
>> vm.min_free_kbytes=1000000
>> 
>> # Controls the maximum size of a message, in
>> bytes
>> 
>> kernel.msgmnb = 65536
>> 
>> # Controls the default maxmimum size of a
>> mesage queue
>> 
>> kernel.msgmax = 65536
>> 
>> # Controls the maximum shared segment size, in
>> bytes
>> 
>> kernel.shmmax = 68719476736
>> 
>> # Controls the maximum number of shared memory
>> segments, in pages
>> 
>> kernel.shmall = 4294967296
>> 
>> The ceph.conf file is:
>> 
>> ...
>> 
>> osd_pool_default_size = 3
>> 
>> osd_pool_default_min_size = 2
>> 
>> osd_pool_default_pg_num = 1600
>> 
>> osd_pool_default_pgp_num = 1600
>> 
>> debug_crush = 1/1
>> 
>> debug_buffer = 0/1
>> 
>> debug_timer = 0/0
>> 
>> debug_filer = 0/1
>> 
>> debug_objecter = 0/1
>> 
>> debug_rados = 0/5
>> 
>> debug_rbd = 0/5
>> 
>> debug_ms = 0/5
>> 
>> debug_throttle = 1/1
>> 
>> debug_journaler = 0/0
>> 
>> debug_objectcatcher = 0/0
>> 
>> debug_client = 0/0
>> 
>> debug_osd = 0/0
>> 
>> debug_optracker = 0/0
>> 
>> debug_objclass = 0/0
>> 
>> debug_journal = 0/0
>> 
>> debug_filestore = 0/0
>> 
>> debug_mon = 0/0
>> 
>> debug_paxos = 0/0
>> 
>> osd_crush_chooseleaf_type = 0
>> 
>> filestore_xattr_use_omap = true
>> 
>> rbd_cache = true
>> 
>> mon_compact_on_trim = false
>> 
>> [osd]
>> 
>> osd_crush_update_on_start = false
>> 
>> [client]
>> 
>> rbd_cache = true
>> 
>> rbd_cache_writethrough_until_flush = true
>> 
>> rbd_default_features = 1
>> 
>> admin_socket =
>> /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok
>> 
>> log_file = /var/log/ceph/
>> 
>> The cluster has two production pools on for
>> openstack (volumes) with RF of 3 and another pool for db (db) with RF of
>> 2. The DBA team has perform several tests with a volume mounted on the
>> DB server (with RBD). The DB server has the following configuration:
>> 
>> OS: CentOS 6.9 | kernel 4.14.1
>> 
>> DB: MySQL
>> 
>> ProLiant BL685c G7
>> 
>> 4x AMD Opteron Processor 6376 (total of 64
>> cores)
>> 
>> 128G RAM
>> 
>> 1x OneConnect 10Gb NIC (quad-port) - in a bond
>> configuration (active/active) with 3 vlans
>> 
>> We also did some tests with sysbench on
>> different storage types:
>> 
>> sysbench
>> 
>> disk
>> 
>> tps
>> 
>> qps
>> 
>> latency (ms) 95th percentile
>> 
>> Local SSD
>> 
>> 261,28
>> 
>> 5.225,61
>> 
>> 5,18
>> 
>> Ceph NVMe
>> 
>> 95,18
>> 
>> 1.903,53
>> 
>> 12,3
>> 
>> Pure Storage
>> 
>> 196,49
>> 
>> 3.929,71
>> 
>> 6,32
>> 
>> NetApp FAS
>> 
>> 189,83
>> 
>> 3.796,59
>> 
>> 6,67
>> 
>> EMC VMAX
>> 
>> 196,14
>> 
>> 3.922,82
>> 
>> 6,32
>> 
>> Is there any specific tuning that I can apply
>> to the ceph cluster, in order to improve those numbers? Or are those
>> numbers ok for the type and size of the cluster that we have? Any advice
>> would be really appreciated.
>> 
>> Thanks,
>> 
>> German
>> 
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users at lists.ceph.com
>> 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [1]
>> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [1]>
>> 
>> Hi,
>> 
>> What is the value of --num-threads (def value is 1)
>> ? Ceph will be better with more threads: 32 or 64.
>> What is the value of --file-block-size (def 16k) and
>> file-test-mode ? If you are using sequential seqwr/seqrd you will be
>> hitting the same OSD, so maybe try random (rndrd/rndwr) or better use
>> rbd stripe size of 16kb (default rbd stripe is 4M). rbd striping is
>> ideal for small block sequential io pattern typical in databases.
>> 
>> /Maged
>> 
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users at lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [1] <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [1]>
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 

Links:
------
[1] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20171129/99c8aeec/attachment.html>


More information about the ceph-users mailing list