[ceph-users] PGs get placed in the same datacenter (Trying to make a hybrid NVMe/HDD pool with 6 servers, 2 in each datacenter)

David Turner drakonstein at gmail.com
Sun Oct 8 06:22:58 PDT 2017


That's correct. It doesn't matter how many copies of the data you have in
each datacenter. The mons control the maps and you should be good as long
as you have 1 mon per DC. You should test this to see how the recovery
goes, but there shouldn't be a problem.

On Sat, Oct 7, 2017, 6:10 PM Дробышевский, Владимир <vlad at itgorod.ru> wrote:

> 2017-10-08 2:02 GMT+05:00 Peter Linder <peter.linder at fiberdirekt.se>:
>
>>
>> Then, I believe, the next best configuration would be to set size for
>> this pool to 4.  It would choose an NVMe as the primary OSD, and then
>> choose an HDD from each DC for the secondary copies.  This will guarantee
>> that a copy of the data goes into each DC and you will have 2 copies in
>> other DCs away from the primary NVMe copy.  It wastes a copy of all of the
>> data in the pool, but that's on the much cheaper HDD storage and can
>> probably be considered acceptable losses for the sake of having the primary
>> OSD on NVMe drives.
>>
>> I have considered this, and it should of course work when it works so to
>> say, but what if 1 datacenter is isolated while running? We would be left
>> with 2 running copies on each side for all PGs, with no way of knowing what
>> gets written where. In the end, data would be destoyed due to the split
>> brain. Even being able to enforce quorum where the SSD is would mean a
>> single point of failure.
>>
> In case you have one mon per DC all operations in the isolated DC will be
> frozen, so I believe you would not lose data.
>
>
>>
>>
>>
>> On Sat, Oct 7, 2017 at 3:36 PM Peter Linder <peter.linder at fiberdirekt.se>
>> wrote:
>>
>>> On 10/7/2017 8:08 PM, David Turner wrote:
>>>
>>> Just to make sure you understand that the reads will happen on the
>>> primary osd for the PG and not the nearest osd, meaning that reads will go
>>> between the datacenters. Also that each write will not ack until all 3
>>> writes happen adding the latency to the writes and reads both.
>>>
>>>
>>> Yes, I understand this. It is actually fine, the datacenters have been
>>> selected so that they are about 10-20km apart. This yields around a 0.1 -
>>> 0.2ms round trip time due to speed of light being too low. Nevertheless,
>>> latency due to network shouldn't be a problem and it's all 40G (dedicated)
>>> TRILL network for the moment.
>>>
>>> I just want to be able to select 1 SSD and 2 HDDs, all spread out. I can
>>> do that, but one of the HDDs end up in the same datacenter, probably
>>> because I'm using the "take" command 2 times (resets selecting buckets?).
>>>
>>>
>>>
>>> On Sat, Oct 7, 2017, 1:48 PM Peter Linder <peter.linder at fiberdirekt.se>
>>> wrote:
>>>
>>>> On 10/7/2017 7:36 PM, Дробышевский, Владимир wrote:
>>>>
>>>> Hello!
>>>>
>>>> 2017-10-07 19:12 GMT+05:00 Peter Linder <peter.linder at fiberdirekt.se>:
>>>>
>>>> The idea is to select an nvme osd, and
>>>>> then select the rest from hdd osds in different datacenters (see crush
>>>>> map below for hierarchy).
>>>>>
>>>>> It's a little bit aside of the question, but why do you want to mix
>>>> SSDs and HDDs in the same pool? Do you have read-intensive workload and
>>>> going to use primary-affinity to get all reads from nvme?
>>>>
>>>>
>>>> Yes, this is pretty much the idea, getting the performance from NVMe
>>>> reads, while still maintaining triple redundancy and a reasonable cost.
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Vladimir
>>>>
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users at lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>
>>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users at lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
>
> --
> Regards,
> Vladimir
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20171008/3847b6b3/attachment.html>


More information about the ceph-users mailing list