[ceph-users] Needed help to setup a 3-way replication between 2 datacenters

Peter Linder peter.linder at fiberdirekt.se
Fri Nov 10 00:29:16 PST 2017


On 11/10/2017 7:17 AM, Sébastien VIGNERON wrote:
> Hi everyone,
>
> Beginner with Ceph, i’m looking for a way to do a 3-way replication
> between 2 datacenters as mention in ceph docs (but not describe).
>
> My goal is to keep access to the data (at least read-only access) even
> when the link between the 2 datacenters is cut and make sure at least
> one copy of the data exists in each datacenter.
If that is your goal, then why 3 way replication?

>
> I’m not sure how to implement such 3-way replication. With a rule?
> Based on the CEPH docs, I think of a rule:
> rule 3-way-replication_with_2_DC {
> ruleset 1
> type replicated
> min_size 2
> max_size 3

min_size and max_size here doesn't do what you expect it to. You need to
set min_size = 1 for a 2 way replicated cluster (beware of
inconcistencies if the link between DC's go down) and = 2 for a 3 way
replicated cluster, but the setting is on the pool, not in the crush rule.

> step take DC-1
> step choose firstn 1 type host
> step chooseleaf firstn 1 type osd
> step emit
> step take DC-2
> step choose firstn 1 type host
> step chooseleaf firstn 1 type osd
> step emit
> step take default
> step choose firstn 1 type host
> step chooseleaf firstn 1 type osd
> step emit
> }
> but what should happen if the link between the 2 datacenters is cut?
> If someone has a better solution, I interested by any resources about
> it (examples, …).

This seems to, for each pg, take an osd on a host in DC-1 and then and
osd on a host in DC-2, and then just a random osd on a random host
anywhere. 50% of the extra osds selected will be in DC1 and the rest in
DC2. When the link is cut, 50% of the pgs will not be able to fulfil the
min_size = 2 requirement (depending on if the observer is in DC1 or DC2
of course) and operations will stop on these. This should in practice
stop all operations, and Im not even considering monitor quorum yet.

>
> The default rule (see below) keep the pool working when we mark each
> node of DC-2 as down (typically maintenance) but if we shut the link
> down between the 2 datacenters, the pool/rbd hangs (frozen writing dd
> tool for example).
> Does anyone have some insight on how to setup a 3-way replication
> between 2 datacenters?
I don't really know why there is a difference here. We opted for a 3 way
cluster in 3 separate datacenters though. Perhaps somehow you can
simulate 2 separate datacenters in one of yours, at least make sure they
are on different power circuits etc. Also, consider redundancy for your
network so that is does not go down. Spanning tree is a little slow, but
TRILL or SPB should work in your case.


>
> Thanks in advance for any advice on the topic.
>
> Current situation:
>
> Mons : host-1, host-2, host-4
>
> Quick network topology:
>
> USERS NETWORK
>      |
>    2x10G
>      |
>   DC-1-SWITCH <——— 40G ——> DC-2-SWITCH
>         | | |                   | | |
> host-1 _| | |           host-4 _| | |
> host-2 ___| |           host-5 ___| |
> host-3 _____|           host-6 _____|
>
>
>
> crushmap :
> # ceph osd tree
> ID  CLASS WEIGHT    TYPE NAME                   STATUS REWEIGHT PRI-AFF
>  -1       147.33325 root default
> -20        73.66663     datacenter DC-1
> -15        73.66663         rack DC-1-RACK-1
>  -9        24.55554             host host-1
>  27   hdd   2.72839                 osd.27          up  1.00000 1.00000
>  28   hdd   2.72839                 osd.28          up  1.00000 1.00000
>  29   hdd   2.72839                 osd.29          up  1.00000 1.00000
>  30   hdd   2.72839                 osd.30          up  1.00000 1.00000
>  31   hdd   2.72839                 osd.31          up  1.00000 1.00000
>  32   hdd   2.72839                 osd.32          up  1.00000 1.00000
>  33   hdd   2.72839                 osd.33          up  1.00000 1.00000
>  34   hdd   2.72839                 osd.34          up  1.00000 1.00000
>  36   hdd   2.72839                 osd.36          up  1.00000 1.00000
> -11        24.55554             host host-2
>  35   hdd   2.72839                 osd.35          up  1.00000 1.00000
>  37   hdd   2.72839                 osd.37          up  1.00000 1.00000
>  38   hdd   2.72839                 osd.38          up  1.00000 1.00000
>  39   hdd   2.72839                 osd.39          up  1.00000 1.00000
>  40   hdd   2.72839                 osd.40          up  1.00000 1.00000
>  41   hdd   2.72839                 osd.41          up  1.00000 1.00000
>  42   hdd   2.72839                 osd.42          up  1.00000 1.00000
>  43   hdd   2.72839                 osd.43          up  1.00000 1.00000
>  46   hdd   2.72839                 osd.46          up  1.00000 1.00000
> -13        24.55554             host host-3
>  44   hdd   2.72839                 osd.44          up  1.00000 1.00000
>  45   hdd   2.72839                 osd.45          up  1.00000 1.00000
>  47   hdd   2.72839                 osd.47          up  1.00000 1.00000
>  48   hdd   2.72839                 osd.48          up  1.00000 1.00000
>  49   hdd   2.72839                 osd.49          up  1.00000 1.00000
>  50   hdd   2.72839                 osd.50          up  1.00000 1.00000
>  51   hdd   2.72839                 osd.51          up  1.00000 1.00000
>  52   hdd   2.72839                 osd.52          up  1.00000 1.00000
>  53   hdd   2.72839                 osd.53          up  1.00000 1.00000
> -19        73.66663     datacenter DC-2
> -16        73.66663         rack DC-2-RACK-1
>  -3        24.55554             host host-4
>   0   hdd   2.72839                 osd.0           up  1.00000 1.00000
>   1   hdd   2.72839                 osd.1           up  1.00000 1.00000
>   2   hdd   2.72839                 osd.2           up  1.00000 1.00000
>   3   hdd   2.72839                 osd.3           up  1.00000 1.00000
>   4   hdd   2.72839                 osd.4           up  1.00000 1.00000
>   5   hdd   2.72839                 osd.5           up  1.00000 1.00000
>   6   hdd   2.72839                 osd.6           up  1.00000 1.00000
>   7   hdd   2.72839                 osd.7           up  1.00000 1.00000
>   8   hdd   2.72839                 osd.8           up  1.00000 1.00000
>  -5        24.55554             host host-5
>   9   hdd   2.72839                 osd.9           up  1.00000 1.00000
>  10   hdd   2.72839                 osd.10          up  1.00000 1.00000
>  11   hdd   2.72839                 osd.11          up  1.00000 1.00000
>  12   hdd   2.72839                 osd.12          up  1.00000 1.00000
>  13   hdd   2.72839                 osd.13          up  1.00000 1.00000
>  14   hdd   2.72839                 osd.14          up  1.00000 1.00000
>  15   hdd   2.72839                 osd.15          up  1.00000 1.00000
>  16   hdd   2.72839                 osd.16          up  1.00000 1.00000
>  18   hdd   2.72839                 osd.18          up  1.00000 1.00000
>  -7        24.55554             host host-6
>  19   hdd   2.72839                 osd.19          up  1.00000 1.00000
>  20   hdd   2.72839                 osd.20          up  1.00000 1.00000
>  21   hdd   2.72839                 osd.21          up  1.00000 1.00000
>  22   hdd   2.72839                 osd.22          up  1.00000 1.00000
>  23   hdd   2.72839                 osd.23          up  1.00000 1.00000
>  24   hdd   2.72839                 osd.24          up  1.00000 1.00000
>  25   hdd   2.72839                 osd.25          up  1.00000 1.00000
>  26   hdd   2.72839                 osd.26          up  1.00000 1.00000
>  54   hdd   2.72839                 osd.54          up  1.00000 1.00000
>
> current rules (default one) :
> # ceph osd crush rule dump
> [
>     {
>         "rule_id": 0,
>         "rule_name": "replicated_rule",
>         "ruleset": 0,
>         "type": 1,
>         "min_size": 1,
>         "max_size": 10,
>         "steps": [
>             {
>                 "op": "take",
>                 "item": -1,
>                 "item_name": "default"
>             },
>             {
>                 "op": "chooseleaf_firstn",
>                 "num": 0,
>                 "type": "host"
>             },
>             {
>                 "op": "emit"
>             }
>         ]
>     }
> ]
>
> Cordialement / Best regards,
>
> Sébastien VIGNERON 
> CRIANN, 
> Ingénieur / Engineer
> Technopôle du Madrillet 
> 745, avenue de l'Université 
> 76800 Saint-Etienne du Rouvray - France 
> tél. +33 2 32 91 42 91 
> fax. +33 2 32 91 42 92 
> http://www.criann.fr 
> mailto:sebastien.vigneron at criann.fr
> support: support at criann.fr
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20171110/4960658a/attachment.html>


More information about the ceph-users mailing list