[ceph-users] PGs get placed in the same datacenter (Trying to make a hybrid NVMe/HDD pool with 6 servers, 2 in each datacenter)

Peter Linder peter.linder at fiberdirekt.se
Mon Oct 9 10:04:17 PDT 2017


I was able to get this working with the crushmap in my last post! I now
have the intended behavior together with the change of primary affinity
on the slow hdds. Very happy, performance is excellent.

One thing was a little weird though, I had to manually change the weight
of each hostgroup so that they are in the same ballpark. If they were
too far apart ceph couldn't properly allocate 3 buckets for each pg,
some ended up being in state "remapped" or "degraded".

When I changed the weights (The crush rule selects 3 out of 3 hostgroups
anyway so weight isn't even a consideration there) to similar values
that problem went away.

Perhaps that is a bug?

/Peter

On 10/8/2017 3:22 PM, David Turner wrote:
>
> That's correct. It doesn't matter how many copies of the data you have
> in each datacenter. The mons control the maps and you should be good
> as long as you have 1 mon per DC. You should test this to see how the
> recovery goes, but there shouldn't be a problem.
>
>
> On Sat, Oct 7, 2017, 6:10 PM Дробышевский, Владимир <vlad at itgorod.ru
> <mailto:vlad at itgorod.ru>> wrote:
>
>     2017-10-08 2:02 GMT+05:00 Peter Linder
>     <peter.linder at fiberdirekt.se <mailto:peter.linder at fiberdirekt.se>>:
>
>>
>>         Then, I believe, the next best configuration would be to set
>>         size for this pool to 4.  It would choose an NVMe as the
>>         primary OSD, and then choose an HDD from each DC for the
>>         secondary copies.  This will guarantee that a copy of the
>>         data goes into each DC and you will have 2 copies in other
>>         DCs away from the primary NVMe copy.  It wastes a copy of all
>>         of the data in the pool, but that's on the much cheaper HDD
>>         storage and can probably be considered acceptable losses for
>>         the sake of having the primary OSD on NVMe drives.
>         I have considered this, and it should of course work when it
>         works so to say, but what if 1 datacenter is isolated while
>         running? We would be left with 2 running copies on each side
>         for all PGs, with no way of knowing what gets written where.
>         In the end, data would be destoyed due to the split brain.
>         Even being able to enforce quorum where the SSD is would mean
>         a single point of failure.
>
>     In case you have one mon per DC all operations in the isolated DC
>     will be frozen, so I believe you would not lose data.
>      
>
>
>
>>
>>         On Sat, Oct 7, 2017 at 3:36 PM Peter Linder
>>         <peter.linder at fiberdirekt.se
>>         <mailto:peter.linder at fiberdirekt.se>> wrote:
>>
>>             On 10/7/2017 8:08 PM, David Turner wrote:
>>>
>>>             Just to make sure you understand that the reads will
>>>             happen on the primary osd for the PG and not the nearest
>>>             osd, meaning that reads will go between the datacenters.
>>>             Also that each write will not ack until all 3 writes
>>>             happen adding the latency to the writes and reads both.
>>>
>>>
>>
>>             Yes, I understand this. It is actually fine, the
>>             datacenters have been selected so that they are about
>>             10-20km apart. This yields around a 0.1 - 0.2ms round
>>             trip time due to speed of light being too low.
>>             Nevertheless, latency due to network shouldn't be a
>>             problem and it's all 40G (dedicated) TRILL network for
>>             the moment.
>>
>>             I just want to be able to select 1 SSD and 2 HDDs, all
>>             spread out. I can do that, but one of the HDDs end up in
>>             the same datacenter, probably because I'm using the
>>             "take" command 2 times (resets selecting buckets?).
>>
>>
>>
>>>             On Sat, Oct 7, 2017, 1:48 PM Peter Linder
>>>             <peter.linder at fiberdirekt.se
>>>             <mailto:peter.linder at fiberdirekt.se>> wrote:
>>>
>>>                 On 10/7/2017 7:36 PM, Дробышевский, Владимир wrote:
>>>>                 Hello!
>>>>
>>>>                 2017-10-07 19:12 GMT+05:00 Peter Linder
>>>>                 <peter.linder at fiberdirekt.se
>>>>                 <mailto:peter.linder at fiberdirekt.se>>:
>>>>
>>>>                     The idea is to select an nvme osd, and
>>>>                     then select the rest from hdd osds in different
>>>>                     datacenters (see crush
>>>>                     map below for hierarchy). 
>>>>
>>>>                 It's a little bit aside of the question, but why do
>>>>                 you want to mix SSDs and HDDs in the same pool? Do
>>>>                 you have read-intensive workload and going to use
>>>>                 primary-affinity to get all reads from nvme?
>>>>                  
>>>>
>>>                 Yes, this is pretty much the idea, getting the
>>>                 performance from NVMe reads, while still maintaining
>>>                 triple redundancy and a reasonable cost.
>>>
>>>
>>>>                 -- 
>>>>                 Regards,
>>>>                 Vladimir
>>>
>>>
>>>                 _______________________________________________
>>>                 ceph-users mailing list
>>>                 ceph-users at lists.ceph.com
>>>                 <mailto:ceph-users at lists.ceph.com>
>>>                 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>
>
>         _______________________________________________
>         ceph-users mailing list
>         ceph-users at lists.ceph.com <mailto:ceph-users at lists.ceph.com>
>         http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
>     -- 
>     Regards,
>     Vladimir
>     _______________________________________________
>     ceph-users mailing list
>     ceph-users at lists.ceph.com <mailto:ceph-users at lists.ceph.com>
>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20171009/01ca33f6/attachment.html>


More information about the ceph-users mailing list