[ceph-users] PGs get placed in the same datacenter (Trying to make a hybrid NVMe/HDD pool with 6 servers, 2 in each datacenter)

Дробышевский, Владимир vlad at itgorod.ru
Sat Oct 7 15:09:44 PDT 2017


2017-10-08 2:02 GMT+05:00 Peter Linder <peter.linder at fiberdirekt.se>:

>
> Then, I believe, the next best configuration would be to set size for this
> pool to 4.  It would choose an NVMe as the primary OSD, and then choose an
> HDD from each DC for the secondary copies.  This will guarantee that a copy
> of the data goes into each DC and you will have 2 copies in other DCs away
> from the primary NVMe copy.  It wastes a copy of all of the data in the
> pool, but that's on the much cheaper HDD storage and can probably be
> considered acceptable losses for the sake of having the primary OSD on NVMe
> drives.
>
> I have considered this, and it should of course work when it works so to
> say, but what if 1 datacenter is isolated while running? We would be left
> with 2 running copies on each side for all PGs, with no way of knowing what
> gets written where. In the end, data would be destoyed due to the split
> brain. Even being able to enforce quorum where the SSD is would mean a
> single point of failure.
>
In case you have one mon per DC all operations in the isolated DC will be
frozen, so I believe you would not lose data.


>
>
>
> On Sat, Oct 7, 2017 at 3:36 PM Peter Linder <peter.linder at fiberdirekt.se>
> wrote:
>
>> On 10/7/2017 8:08 PM, David Turner wrote:
>>
>> Just to make sure you understand that the reads will happen on the
>> primary osd for the PG and not the nearest osd, meaning that reads will go
>> between the datacenters. Also that each write will not ack until all 3
>> writes happen adding the latency to the writes and reads both.
>>
>>
>> Yes, I understand this. It is actually fine, the datacenters have been
>> selected so that they are about 10-20km apart. This yields around a 0.1 -
>> 0.2ms round trip time due to speed of light being too low. Nevertheless,
>> latency due to network shouldn't be a problem and it's all 40G (dedicated)
>> TRILL network for the moment.
>>
>> I just want to be able to select 1 SSD and 2 HDDs, all spread out. I can
>> do that, but one of the HDDs end up in the same datacenter, probably
>> because I'm using the "take" command 2 times (resets selecting buckets?).
>>
>>
>>
>> On Sat, Oct 7, 2017, 1:48 PM Peter Linder <peter.linder at fiberdirekt.se>
>> wrote:
>>
>>> On 10/7/2017 7:36 PM, Дробышевский, Владимир wrote:
>>>
>>> Hello!
>>>
>>> 2017-10-07 19:12 GMT+05:00 Peter Linder <peter.linder at fiberdirekt.se>:
>>>
>>> The idea is to select an nvme osd, and
>>>> then select the rest from hdd osds in different datacenters (see crush
>>>> map below for hierarchy).
>>>>
>>>> It's a little bit aside of the question, but why do you want to mix
>>> SSDs and HDDs in the same pool? Do you have read-intensive workload and
>>> going to use primary-affinity to get all reads from nvme?
>>>
>>>
>>> Yes, this is pretty much the idea, getting the performance from NVMe
>>> reads, while still maintaining triple redundancy and a reasonable cost.
>>>
>>>
>>> --
>>> Regards,
>>> Vladimir
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users at lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
Regards,
Vladimir
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20171008/f4659f46/attachment.html>


More information about the ceph-users mailing list