[ceph-users] PGs get placed in the same datacenter (Trying to make a hybrid NVMe/HDD pool with 6 servers, 2 in each datacenter)

Peter Linder peter.linder at fiberdirekt.se
Sun Oct 8 08:38:02 PDT 2017


Oh, you mean monitor quorum is enforced? I never really considered that.
However, I think I found another solution:

I created a second tree called "ldc" and under it I made 3 "logical
datacenters" (waiting for a better name) and grouped the servers under
it so that one logical datacenter contains 3 servers, one ssd and 2 hdd
selected from different physical datacenters. I could now rewrite my
hybrid rule to simply select one datacenter and then 3 hostgroups from
it. I made a new bucket type called "hostgroup" that I put the physical
servers under, so that it is easy to add more servers in the future
(just add them to the correct host group)

 It should work, I will test fully this coming week.

Complete crushmap is below. Rules and stuff for the other two more
normal rules are the same, interesting stuff starts about half way down.

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class nvme
device 1 osd.1 class nvme
device 2 osd.2 class nvme
device 3 osd.3 class nvme
device 4 osd.4 class nvme
device 5 osd.5 class nvme
device 6 osd.6 class nvme
device 7 osd.7 class nvme
device 8 osd.8 class nvme
device 9 osd.9 class nvme
device 10 osd.10 class nvme
device 11 osd.11 class nvme
device 12 osd.12 class hdd
device 13 osd.13 class hdd
device 14 osd.14 class hdd
device 15 osd.15 class hdd
device 16 osd.16 class hdd
device 17 osd.17 class hdd
device 18 osd.18 class hdd
device 19 osd.19 class hdd
device 20 osd.20 class hdd
device 21 osd.21 class hdd
device 22 osd.22 class hdd
device 23 osd.23 class hdd
device 24 osd.24 class hdd
device 25 osd.25 class hdd
device 26 osd.26 class hdd
device 27 osd.27 class hdd
device 28 osd.28 class hdd
device 29 osd.29 class hdd
device 30 osd.30 class hdd
device 31 osd.31 class hdd
device 32 osd.32 class hdd
device 33 osd.33 class hdd
device 34 osd.34 class hdd
device 35 osd.35 class hdd

# types
type 0 osd
type 1 host
type 2 hostgroup
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host storage11 {
        id -5           # do not change unnecessarily
        id -6 class nvme                # do not change unnecessarily
        id -10 class hdd                # do not change unnecessarily
        # weight 2.913
        alg straw2
        hash 0  # rjenkins1
        item osd.0 weight 0.729
        item osd.3 weight 0.728
        item osd.6 weight 0.728
        item osd.9 weight 0.728
}
host storage21 {
        id -13          # do not change unnecessarily
        id -14 class nvme               # do not change unnecessarily
        id -15 class hdd                # do not change unnecessarily
        # weight 65.496
        alg straw2
        hash 0  # rjenkins1
        item osd.12 weight 5.458
        item osd.13 weight 5.458
        item osd.14 weight 5.458
        item osd.15 weight 5.458
        item osd.16 weight 5.458
        item osd.17 weight 5.458
        item osd.18 weight 5.458
        item osd.19 weight 5.458
        item osd.20 weight 5.458
        item osd.21 weight 5.458
        item osd.22 weight 5.458
        item osd.23 weight 5.458
}
datacenter HORN79 {
        id -19          # do not change unnecessarily
        id -26 class nvme               # do not change unnecessarily
        id -27 class hdd                # do not change unnecessarily
        # weight 68.406
        alg straw2
        hash 0  # rjenkins1
        item storage11 weight 2.911
        item storage21 weight 65.495
}
host storage13 {
        id -7           # do not change unnecessarily
        id -8 class nvme                # do not change unnecessarily
        id -11 class hdd                # do not change unnecessarily
        # weight 2.912
        alg straw2
        hash 0  # rjenkins1
        item osd.2 weight 0.728
        item osd.5 weight 0.728
        item osd.8 weight 0.728
        item osd.11 weight 0.728
}
host storage23 {
        id -16          # do not change unnecessarily
        id -17 class nvme               # do not change unnecessarily
        id -18 class hdd                # do not change unnecessarily
        # weight 65.496
        alg straw2
        hash 0  # rjenkins1
        item osd.24 weight 5.458
        item osd.25 weight 5.458
        item osd.26 weight 5.458
        item osd.27 weight 5.458
        item osd.28 weight 5.458
        item osd.29 weight 5.458
        item osd.30 weight 5.458
        item osd.31 weight 5.458
        item osd.32 weight 5.458
        item osd.33 weight 5.458
        item osd.34 weight 5.458
        item osd.35 weight 5.458
}
datacenter WAR {
        id -20          # do not change unnecessarily
        id -24 class nvme               # do not change unnecessarily
        id -25 class hdd                # do not change unnecessarily
        # weight 68.406
        alg straw2
        hash 0  # rjenkins1
        item storage13 weight 2.911
        item storage23 weight 65.495
}
host storage12 {
        id -3           # do not change unnecessarily
        id -4 class nvme                # do not change unnecessarily
        id -9 class hdd         # do not change unnecessarily
        # weight 2.912
        alg straw2
        hash 0  # rjenkins1
        item osd.1 weight 0.728
        item osd.4 weight 0.728
        item osd.7 weight 0.728
        item osd.10 weight 0.728
}
datacenter TEG4 {
        id -21          # do not change unnecessarily
        id -22 class nvme               # do not change unnecessarily
        id -23 class hdd                # do not change unnecessarily
        # weight 2.911
        alg straw2
        hash 0  # rjenkins1
        item storage12 weight 2.911
}
root default {
        id -1           # do not change unnecessarily
        id -2 class nvme                # do not change unnecessarily
        id -12 class hdd                # do not change unnecessarily
        # weight 139.721
        alg straw2
        hash 0  # rjenkins1
        item HORN79 weight 68.405
        item WAR weight 68.405
        item TEG4 weight 2.911
}
hostgroup hg1-1 {
        id -30          # do not change unnecessarily
        id -28 class nvme               # do not change unnecessarily
        id -54 class hdd                # do not change unnecessarily
        # weight 2.913
        alg straw2
        hash 0  # rjenkins1
        item storage11 weight 2.913
}
hostgroup hg1-2 {
        id -31          # do not change unnecessarily
        id -29 class nvme               # do not change unnecessarily
        id -55 class hdd                # do not change unnecessarily
        # weight 0.000
        alg straw2
        hash 0  # rjenkins1
}
hostgroup hg1-3 {
        id -32          # do not change unnecessarily
        id -43 class nvme               # do not change unnecessarily
        id -56 class hdd                # do not change unnecessarily
        # weight 65.496
        alg straw2
        hash 0  # rjenkins1
        item storage23 weight 65.496
}
hostgroup hg2-1 {
        id -33          # do not change unnecessarily
        id -45 class nvme               # do not change unnecessarily
        id -58 class hdd                # do not change unnecessarily
        # weight 2.912
        alg straw2
        hash 0  # rjenkins1
        item storage12 weight 2.912
}
hostgroup hg2-2 {
        id -34          # do not change unnecessarily
        id -46 class nvme               # do not change unnecessarily
        id -59 class hdd                # do not change unnecessarily
        # weight 65.496
        alg straw2
        hash 0  # rjenkins1
        item storage21 weight 65.496
}
hostgroup hg2-3 {
        id -35          # do not change unnecessarily
        id -47 class nvme               # do not change unnecessarily
        id -60 class hdd                # do not change unnecessarily
        # weight 65.496
        alg straw2
        hash 0  # rjenkins1
        item storage23 weight 65.496
}
hostgroup hg3-1 {
        id -36          # do not change unnecessarily
        id -49 class nvme               # do not change unnecessarily
        id -62 class hdd                # do not change unnecessarily
        # weight 2.912
        alg straw2
        hash 0  # rjenkins1
        item storage13 weight 2.912
}
hostgroup hg3-2 {
        id -37          # do not change unnecessarily
        id -50 class nvme               # do not change unnecessarily
        id -63 class hdd                # do not change unnecessarily
        # weight 65.496
        alg straw2
        hash 0  # rjenkins1
        item storage21 weight 65.496
}
hostgroup hg3-3 {
        id -38          # do not change unnecessarily
        id -51 class nvme               # do not change unnecessarily
        id -64 class hdd                # do not change unnecessarily
        # weight 0.000
        alg straw2
        hash 0  # rjenkins1
}
datacenter ldc1 {
        id -39          # do not change unnecessarily
        id -44 class nvme               # do not change unnecessarily
        id -57 class hdd                # do not change unnecessarily
        # weight 68.409
        alg straw2
        hash 0  # rjenkins1
        item hg1-1 weight 2.913
        item hg1-2 weight 0.000
        item hg1-3 weight 65.496
}
datacenter ldc2 {
        id -40          # do not change unnecessarily
        id -48 class nvme               # do not change unnecessarily
        id -61 class hdd                # do not change unnecessarily
        # weight 133.904
        alg straw2
        hash 0  # rjenkins1
        item hg2-1 weight 2.912
        item hg2-2 weight 65.496
        item hg2-3 weight 65.496
}
datacenter ldc3 {
        id -41          # do not change unnecessarily
        id -52 class nvme               # do not change unnecessarily
        id -65 class hdd                # do not change unnecessarily
        # weight 68.408
        alg straw2
        hash 0  # rjenkins1
        item hg3-1 weight 2.912
        item hg3-2 weight 65.496
        item hg3-3 weight 0.000
}
root ldc {
        id -42          # do not change unnecessarily
        id -53 class nvme               # do not change unnecessarily
        id -66 class hdd                # do not change unnecessarily
        # weight 270.721
        alg straw2
        hash 0  # rjenkins1
        item ldc1 weight 68.409
        item ldc2 weight 133.904
        item ldc3 weight 68.408
}

# rules
rule hybrid {
        id 1
        type replicated
        min_size 1
        max_size 10
        step take ldc
        step choose firstn 1 type datacenter
        step chooseleaf firstn 0 type hostgroup
        step emit
}
rule hdd {
        id 2
        type replicated
        min_size 1
        max_size 3
        step take default class hdd
        step chooseleaf firstn 0 type datacenter
        step emit
}
rule nvme {
        id 3
        type replicated
        min_size 1
        max_size 3
        step take default class nvme
        step chooseleaf firstn 0 type datacenter
        step emit
}

# end crush map





On 10/8/2017 3:22 PM, David Turner wrote:
>
> That's correct. It doesn't matter how many copies of the data you have
> in each datacenter. The mons control the maps and you should be good
> as long as you have 1 mon per DC. You should test this to see how the
> recovery goes, but there shouldn't be a problem.
>
>
> On Sat, Oct 7, 2017, 6:10 PM Дробышевский, Владимир <vlad at itgorod.ru
> <mailto:vlad at itgorod.ru>> wrote:
>
>     2017-10-08 2:02 GMT+05:00 Peter Linder
>     <peter.linder at fiberdirekt.se <mailto:peter.linder at fiberdirekt.se>>:
>
>>
>>         Then, I believe, the next best configuration would be to set
>>         size for this pool to 4.  It would choose an NVMe as the
>>         primary OSD, and then choose an HDD from each DC for the
>>         secondary copies.  This will guarantee that a copy of the
>>         data goes into each DC and you will have 2 copies in other
>>         DCs away from the primary NVMe copy.  It wastes a copy of all
>>         of the data in the pool, but that's on the much cheaper HDD
>>         storage and can probably be considered acceptable losses for
>>         the sake of having the primary OSD on NVMe drives.
>         I have considered this, and it should of course work when it
>         works so to say, but what if 1 datacenter is isolated while
>         running? We would be left with 2 running copies on each side
>         for all PGs, with no way of knowing what gets written where.
>         In the end, data would be destoyed due to the split brain.
>         Even being able to enforce quorum where the SSD is would mean
>         a single point of failure.
>
>     In case you have one mon per DC all operations in the isolated DC
>     will be frozen, so I believe you would not lose data.
>      
>
>
>
>>
>>         On Sat, Oct 7, 2017 at 3:36 PM Peter Linder
>>         <peter.linder at fiberdirekt.se
>>         <mailto:peter.linder at fiberdirekt.se>> wrote:
>>
>>             On 10/7/2017 8:08 PM, David Turner wrote:
>>>
>>>             Just to make sure you understand that the reads will
>>>             happen on the primary osd for the PG and not the nearest
>>>             osd, meaning that reads will go between the datacenters.
>>>             Also that each write will not ack until all 3 writes
>>>             happen adding the latency to the writes and reads both.
>>>
>>>
>>
>>             Yes, I understand this. It is actually fine, the
>>             datacenters have been selected so that they are about
>>             10-20km apart. This yields around a 0.1 - 0.2ms round
>>             trip time due to speed of light being too low.
>>             Nevertheless, latency due to network shouldn't be a
>>             problem and it's all 40G (dedicated) TRILL network for
>>             the moment.
>>
>>             I just want to be able to select 1 SSD and 2 HDDs, all
>>             spread out. I can do that, but one of the HDDs end up in
>>             the same datacenter, probably because I'm using the
>>             "take" command 2 times (resets selecting buckets?).
>>
>>
>>
>>>             On Sat, Oct 7, 2017, 1:48 PM Peter Linder
>>>             <peter.linder at fiberdirekt.se
>>>             <mailto:peter.linder at fiberdirekt.se>> wrote:
>>>
>>>                 On 10/7/2017 7:36 PM, Дробышевский, Владимир wrote:
>>>>                 Hello!
>>>>
>>>>                 2017-10-07 19:12 GMT+05:00 Peter Linder
>>>>                 <peter.linder at fiberdirekt.se
>>>>                 <mailto:peter.linder at fiberdirekt.se>>:
>>>>
>>>>                     The idea is to select an nvme osd, and
>>>>                     then select the rest from hdd osds in different
>>>>                     datacenters (see crush
>>>>                     map below for hierarchy). 
>>>>
>>>>                 It's a little bit aside of the question, but why do
>>>>                 you want to mix SSDs and HDDs in the same pool? Do
>>>>                 you have read-intensive workload and going to use
>>>>                 primary-affinity to get all reads from nvme?
>>>>                  
>>>>
>>>                 Yes, this is pretty much the idea, getting the
>>>                 performance from NVMe reads, while still maintaining
>>>                 triple redundancy and a reasonable cost.
>>>
>>>
>>>>                 -- 
>>>>                 Regards,
>>>>                 Vladimir
>>>
>>>
>>>                 _______________________________________________
>>>                 ceph-users mailing list
>>>                 ceph-users at lists.ceph.com
>>>                 <mailto:ceph-users at lists.ceph.com>
>>>                 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>
>
>         _______________________________________________
>         ceph-users mailing list
>         ceph-users at lists.ceph.com <mailto:ceph-users at lists.ceph.com>
>         http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
>     -- 
>     Regards,
>     Vladimir
>     _______________________________________________
>     ceph-users mailing list
>     ceph-users at lists.ceph.com <mailto:ceph-users at lists.ceph.com>
>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20171008/527eeb19/attachment.html>


More information about the ceph-users mailing list