[ceph-users] ceph all-nvme mysql performance tuning

Maged Mokhtar mmokhtar at petasan.org
Mon Nov 27 06:36:18 PST 2017


On 2017-11-27 15:02, German Anders wrote:

> Hi All, 
> 
> I've a performance question, we recently install a brand new Ceph cluster with all-nvme disks, using ceph version 12.2.0 with bluestore configured. The back-end of the cluster is using a bond IPoIB (active/passive) , and for the front-end we are using a bonding config with active/active (20GbE) to communicate with the clients. 
> 
> The cluster configuration is the following: 
> 
> MON NODES: 
> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14  
> 3x 1U servers: 
> 2x Intel Xeon E5-2630v4 @2.2Ghz 
> 128G RAM 
> 2x Intel SSD DC S3520 150G (in RAID-1 for OS) 
> 2x 82599ES 10-Gigabit SFI/SFP+ Network Connection 
> 
> OSD NODES: 
> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14 
> 4x 2U servers: 
> 2x Intel Xeon E5-2640v4 @2.4Ghz 
> 128G RAM 
> 2x Intel SSD DC S3520 150G (in RAID-1 for OS) 
> 1x Ethernet Controller 10G X550T 
> 1x 82599ES 10-Gigabit SFI/SFP+ Network Connection 
> 12x Intel SSD DC P3520 1.2T (NVMe) for OSD daemons 
> 1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port) 
> 
> Here's the tree: 
> 
> ID CLASS WEIGHT   TYPE NAME          STATUS REWEIGHT PRI-AFF 
> -7       48.00000 root root 
> -5       24.00000     rack rack1 
> -1       12.00000         node cpn01 
> 0  nvme  1.00000             osd.0      up  1.00000 1.00000 
> 1  nvme  1.00000             osd.1      up  1.00000 1.00000 
> 2  nvme  1.00000             osd.2      up  1.00000 1.00000 
> 3  nvme  1.00000             osd.3      up  1.00000 1.00000 
> 4  nvme  1.00000             osd.4      up  1.00000 1.00000 
> 5  nvme  1.00000             osd.5      up  1.00000 1.00000 
> 6  nvme  1.00000             osd.6      up  1.00000 1.00000 
> 7  nvme  1.00000             osd.7      up  1.00000 1.00000 
> 8  nvme  1.00000             osd.8      up  1.00000 1.00000 
> 9  nvme  1.00000             osd.9      up  1.00000 1.00000 
> 10  nvme  1.00000             osd.10     up  1.00000 1.00000 
> 11  nvme  1.00000             osd.11     up  1.00000 1.00000 
> -3       12.00000         node cpn03 
> 24  nvme  1.00000             osd.24     up  1.00000 1.00000 
> 25  nvme  1.00000             osd.25     up  1.00000 1.00000 
> 26  nvme  1.00000             osd.26     up  1.00000 1.00000 
> 27  nvme  1.00000             osd.27     up  1.00000 1.00000 
> 28  nvme  1.00000             osd.28     up  1.00000 1.00000 
> 
> 29  nvme  1.00000             osd.29     up  1.00000 1.00000 
> 30  nvme  1.00000             osd.30     up  1.00000 1.00000 
> 31  nvme  1.00000             osd.31     up  1.00000 1.00000 
> 32  nvme  1.00000             osd.32     up  1.00000 1.00000 
> 33  nvme  1.00000             osd.33     up  1.00000 1.00000 
> 34  nvme  1.00000             osd.34     up  1.00000 1.00000 
> 35  nvme  1.00000             osd.35     up  1.00000 1.00000 
> -6       24.00000     rack rack2 
> -2       12.00000         node cpn02 
> 12  nvme  1.00000             osd.12     up  1.00000 1.00000 
> 13  nvme  1.00000             osd.13     up  1.00000 1.00000 
> 14  nvme  1.00000             osd.14     up  1.00000 1.00000 
> 15  nvme  1.00000             osd.15     up  1.00000 1.00000 
> 16  nvme  1.00000             osd.16     up  1.00000 1.00000 
> 17  nvme  1.00000             osd.17     up  1.00000 1.00000 
> 18  nvme  1.00000             osd.18     up  1.00000 1.00000 
> 19  nvme  1.00000             osd.19     up  1.00000 1.00000 
> 20  nvme  1.00000             osd.20     up  1.00000 1.00000 
> 21  nvme  1.00000             osd.21     up  1.00000 1.00000 
> 22  nvme  1.00000             osd.22     up  1.00000 1.00000 
> 23  nvme  1.00000             osd.23     up  1.00000 1.00000 
> -4       12.00000         node cpn04 
> 36  nvme  1.00000             osd.36     up  1.00000 1.00000 
> 37  nvme  1.00000             osd.37     up  1.00000 1.00000 
> 38  nvme  1.00000             osd.38     up  1.00000 1.00000 
> 39  nvme  1.00000             osd.39     up  1.00000 1.00000 
> 40  nvme  1.00000             osd.40     up  1.00000 1.00000 
> 41  nvme  1.00000             osd.41     up  1.00000 1.00000 
> 42  nvme  1.00000             osd.42     up  1.00000 1.00000 
> 43  nvme  1.00000             osd.43     up  1.00000 1.00000 
> 44  nvme  1.00000             osd.44     up  1.00000 1.00000 
> 45  nvme  1.00000             osd.45     up  1.00000 1.00000 
> 46  nvme  1.00000             osd.46     up  1.00000 1.00000 
> 47  nvme  1.00000             osd.47     up  1.00000 1.00000 
> 
> The disk partition of one of the OSD nodes: 
> 
> NAME                   MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT 
> nvme6n1                259:1    0   1.1T  0 disk 
> ├─nvme6n1p2            259:15   0   1.1T  0 part 
> └─nvme6n1p1            259:13   0   100M  0 part  /var/lib/ceph/osd/ceph-6 
> nvme9n1                259:0    0   1.1T  0 disk 
> ├─nvme9n1p2            259:8    0   1.1T  0 part 
> └─nvme9n1p1            259:7    0   100M  0 part  /var/lib/ceph/osd/ceph-9 
> sdb                      8:16   0 139.8G  0 disk 
> └─sdb1                   8:17   0 139.8G  0 part 
> └─md0                  9:0    0 139.6G  0 raid1 
> ├─md0p2            259:31   0     1K  0 md 
> ├─md0p5            259:32   0 139.1G  0 md 
> │ ├─cpn01--vg-swap 253:1    0  27.4G  0 lvm   [SWAP] 
> │ └─cpn01--vg-root 253:0    0 111.8G  0 lvm   / 
> └─md0p1            259:30   0 486.3M  0 md    /boot 
> nvme11n1               259:2    0   1.1T  0 disk 
> ├─nvme11n1p1           259:12   0   100M  0 part  /var/lib/ceph/osd/ceph-11 
> └─nvme11n1p2           259:14   0   1.1T  0 part 
> nvme2n1                259:6    0   1.1T  0 disk 
> ├─nvme2n1p2            259:21   0   1.1T  0 part 
> └─nvme2n1p1            259:20   0   100M  0 part  /var/lib/ceph/osd/ceph-2 
> nvme5n1                259:3    0   1.1T  0 disk 
> ├─nvme5n1p1            259:9    0   100M  0 part  /var/lib/ceph/osd/ceph-5 
> └─nvme5n1p2            259:10   0   1.1T  0 part 
> nvme8n1                259:24   0   1.1T  0 disk 
> ├─nvme8n1p1            259:26   0   100M  0 part  /var/lib/ceph/osd/ceph-8 
> └─nvme8n1p2            259:28   0   1.1T  0 part 
> nvme10n1               259:11   0   1.1T  0 disk 
> ├─nvme10n1p1           259:22   0   100M  0 part  /var/lib/ceph/osd/ceph-10 
> └─nvme10n1p2           259:23   0   1.1T  0 part 
> nvme1n1                259:33   0   1.1T  0 disk 
> ├─nvme1n1p1            259:34   0   100M  0 part  /var/lib/ceph/osd/ceph-1 
> └─nvme1n1p2            259:35   0   1.1T  0 part 
> nvme4n1                259:5    0   1.1T  0 disk 
> ├─nvme4n1p1            259:18   0   100M  0 part  /var/lib/ceph/osd/ceph-4 
> └─nvme4n1p2            259:19   0   1.1T  0 part 
> nvme7n1                259:25   0   1.1T  0 disk 
> ├─nvme7n1p1            259:27   0   100M  0 part  /var/lib/ceph/osd/ceph-7 
> └─nvme7n1p2            259:29   0   1.1T  0 part 
> sda                      8:0    0 139.8G  0 disk 
> └─sda1                   8:1    0 139.8G  0 part 
> └─md0                  9:0    0 139.6G  0 raid1 
> ├─md0p2            259:31   0     1K  0 md 
> ├─md0p5            259:32   0 139.1G  0 md 
> │ ├─cpn01--vg-swap 253:1    0  27.4G  0 lvm   [SWAP] 
> │ └─cpn01--vg-root 253:0    0 111.8G  0 lvm   / 
> └─md0p1            259:30   0 486.3M  0 md    /boot 
> nvme0n1                259:36   0   1.1T  0 disk 
> ├─nvme0n1p1            259:37   0   100M  0 part  /var/lib/ceph/osd/ceph-0 
> └─nvme0n1p2            259:38   0   1.1T  0 part 
> nvme3n1                259:4    0   1.1T  0 disk 
> ├─nvme3n1p1            259:16   0   100M  0 part  /var/lib/ceph/osd/ceph-3 
> └─nvme3n1p2            259:17   0   1.1T  0 part 
> 
> For the disk scheduler we're using [kyber], for the read_ahead_kb we try different values (0,128 and 2048), the rq_affinity set to 2, and the rotational parameter set to 0. 
> We've also set the CPU governor to performance on all the cores, and tune some sysctl parameters also: 
> 
> # for Ceph 
> net.ipv4.ip_forward=0 
> net.ipv4.conf.default.rp_filter=1 
> kernel.sysrq=0 
> kernel.core_uses_pid=1 
> net.ipv4.tcp_syncookies=0 
> #net.netfilter.nf_conntrack_max=2621440 
> #net.netfilter.nf_conntrack_tcp_timeout_established = 1800 
> # disable netfilter on bridges 
> #net.bridge.bridge-nf-call-ip6tables = 0 
> #net.bridge.bridge-nf-call-iptables = 0 
> #net.bridge.bridge-nf-call-arptables = 0 
> vm.min_free_kbytes=1000000 
> 
> # Controls the maximum size of a message, in bytes 
> kernel.msgmnb = 65536 
> 
> # Controls the default maxmimum size of a mesage queue 
> kernel.msgmax = 65536 
> 
> # Controls the maximum shared segment size, in bytes 
> kernel.shmmax = 68719476736 
> 
> # Controls the maximum number of shared memory segments, in pages 
> kernel.shmall = 4294967296 
> 
> The ceph.conf file is: 
> 
> ... 
> 
> osd_pool_default_size = 3 
> osd_pool_default_min_size = 2 
> osd_pool_default_pg_num = 1600 
> osd_pool_default_pgp_num = 1600 
> 
> debug_crush = 1/1 
> debug_buffer = 0/1 
> debug_timer = 0/0 
> debug_filer = 0/1 
> debug_objecter = 0/1 
> debug_rados = 0/5 
> debug_rbd = 0/5 
> debug_ms = 0/5 
> debug_throttle = 1/1 
> 
> debug_journaler = 0/0 
> debug_objectcatcher = 0/0 
> debug_client = 0/0 
> debug_osd = 0/0 
> debug_optracker = 0/0 
> debug_objclass = 0/0 
> debug_journal = 0/0 
> debug_filestore = 0/0 
> debug_mon = 0/0 
> debug_paxos = 0/0 
> 
> osd_crush_chooseleaf_type = 0 
> filestore_xattr_use_omap = true 
> 
> rbd_cache = true 
> mon_compact_on_trim = false 
> 
> [osd] 
> osd_crush_update_on_start = false 
> 
> [client] 
> rbd_cache = true 
> rbd_cache_writethrough_until_flush = true 
> rbd_default_features = 1 
> admin_socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok 
> log_file = /var/log/ceph/ 
> 
> The cluster has two production pools on for openstack (volumes) with RF of 3 and another pool for db (db) with RF of 2. The DBA team has perform several tests with a volume mounted on the DB server (with RBD). The DB server has the following configuration: 
> 
> OS: CentOS 6.9 | kernel 4.14.1 
> DB: MySQL 
> ProLiant BL685c G7 
> 4x AMD Opteron Processor 6376 (total of 64 cores) 
> 128G RAM 
> 1x OneConnect 10Gb NIC (quad-port) - in a bond configuration (active/active) with 3 vlans 
> 
> We also did some tests with SYSBENCH on different storage types: 
> 
> sysbench
> 
> disk
> tps
> qps
> latency (ms) 95th percentile
> 
> Local SSD
> 261,28
> 5.225,61
> 5,18
> 
> Ceph NVMe
> 95,18
> 1.903,53
> 12,3
> 
> Pure Storage
> 196,49
> 3.929,71
> 6,32
> 
> NetApp FAS
> 189,83
> 3.796,59
> 6,67
> 
> EMC VMAX
> 196,14
> 3.922,82
> 6,32
> 
> Is there any specific tuning that I can apply to the ceph cluster, in order to improve those numbers? Or are those numbers ok for the type and size of the cluster that we have? Any advice would be really appreciated. 
> 
> Thanks, 
> 
> German
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Hi, 

What is the value of --num-threads (def value is 1) ? Ceph will be
better with more threads: 32 or 64.
What is the value of --file-block-size (def 16k) and file-test-mode ? If
you are using sequential seqwr/seqrd you will be hitting the same OSD,
so maybe try random (rndrd/rndwr) or better use rbd stripe size of 16kb
(default rbd stripe is 4M). rbd striping is ideal for small block
sequential io pattern typical in databases. 

/Maged
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20171127/f7ce3974/attachment.html>


More information about the ceph-users mailing list