[ceph-users] PGs get placed in the same datacenter (Trying to make a hybrid NVMe/HDD pool with 6 servers, 2 in each datacenter)

Peter Linder peter.linder at fiberdirekt.se
Sat Oct 7 14:02:22 PDT 2017

On 10/7/2017 10:41 PM, David Turner wrote:
> Disclaimer, I have never attempted this configuration especially with
> Luminous. I doubt many have, but it's a curious configuration that I'd
> love to help see if it is possible.
Very generous of you :). (With that said, I suppose we are prepared to
pay for help to have this figured out. It makes me a little headachy and
there is budget space :)).
> There is 1 logical problem with your configuration (which you have
> most likely considered).  If you want all of your PGs to be primary on
> NVMe's across the 3 DC's, then you need to have 1/3 of your available
> storage (that you plan to use for this pool) be from NVMe's. 
> Otherwise they will fill up long before the HDDs and your cluster will
> be "full" while your HDDs are near empty.  I clarify "that you plan to
> use for this pool" because if you plan to put other stuff on just the
> HDDs, that is planning to utilize that extra space, then it's a part
> of the plan that your NVMe's don't total 1/3 of your storage.
We were going to use the left over HDD space for nearline archives,
intermediary backups etc.

> Second, I'm noticing that if a PG has a primary OSD in any datacenter
> other than TEG4, then it only has 1 other datacenter available to have
> its 2 HDD copies on.  If the rules were working properly, then I would
> expect the PG to be stuck undersized as opposed to choosing an OSD
> from a datacenter that it shouldn't be able to.  Potentially, you
> could test setting the size to 2 for this pool (while you're missing
> the third HDD node) to see if any PGs still end up on an HDD and NVMe
> in the same DC.  I think that likely you will find that PGs will still
> be able to use 2 copies in the same DC based on your current
> configuration.
I know. This server does not exist yet. It should be finished this
coming week (hardware is busy, task needs migrating). And yes, this does
not make testing this out any easier. It was an oversight to not have it

> Then, I believe, the next best configuration would be to set size for
> this pool to 4.  It would choose an NVMe as the primary OSD, and then
> choose an HDD from each DC for the secondary copies.  This will
> guarantee that a copy of the data goes into each DC and you will have
> 2 copies in other DCs away from the primary NVMe copy.  It wastes a
> copy of all of the data in the pool, but that's on the much cheaper
> HDD storage and can probably be considered acceptable losses for the
> sake of having the primary OSD on NVMe drives.
I have considered this, and it should of course work when it works so to
say, but what if 1 datacenter is isolated while running? We would be
left with 2 running copies on each side for all PGs, with no way of
knowing what gets written where. In the end, data would be destoyed due
to the split brain. Even being able to enforce quorum where the SSD is
would mean a single point of failure.

I was thinking instead I can define a crushmap where I make logical
datacenters that include the SSDs as they are spread out and the HDDS I
explicitly want to mirror each ssd set to, and make a crush rule to
enforce 3 copies of the data to 3 hosts within that "datacenter"
selected for each PG. I dont really know how to make such a "depth
first" rule though, but I will try tomorrow.

I was considering making 3 rules to map SSDs to HDDs and then 3 pools,
but that would leave me manually balancing load. And if one node went
down, some RBDs would completely loose their SSD read capability instead
of just 1/3 of it...  perhaps acceptable, but not optimal :)


> On Sat, Oct 7, 2017 at 3:36 PM Peter Linder
> <peter.linder at fiberdirekt.se <mailto:peter.linder at fiberdirekt.se>> wrote:
>     On 10/7/2017 8:08 PM, David Turner wrote:
>>     Just to make sure you understand that the reads will happen on
>>     the primary osd for the PG and not the nearest osd, meaning that
>>     reads will go between the datacenters. Also that each write will
>>     not ack until all 3 writes happen adding the latency to the
>>     writes and reads both.
>     Yes, I understand this. It is actually fine, the datacenters have
>     been selected so that they are about 10-20km apart. This yields
>     around a 0.1 - 0.2ms round trip time due to speed of light being
>     too low. Nevertheless, latency due to network shouldn't be a
>     problem and it's all 40G (dedicated) TRILL network for the moment.
>     I just want to be able to select 1 SSD and 2 HDDs, all spread out.
>     I can do that, but one of the HDDs end up in the same datacenter,
>     probably because I'm using the "take" command 2 times (resets
>     selecting buckets?).
>>     On Sat, Oct 7, 2017, 1:48 PM Peter Linder
>>     <peter.linder at fiberdirekt.se
>>     <mailto:peter.linder at fiberdirekt.se>> wrote:
>>         On 10/7/2017 7:36 PM, Дробышевский, Владимир wrote:
>>>         Hello!
>>>         2017-10-07 19:12 GMT+05:00 Peter Linder
>>>         <peter.linder at fiberdirekt.se
>>>         <mailto:peter.linder at fiberdirekt.se>>:
>>>             The idea is to select an nvme osd, and
>>>             then select the rest from hdd osds in different
>>>             datacenters (see crush
>>>             map below for hierarchy). 
>>>         It's a little bit aside of the question, but why do you want
>>>         to mix SSDs and HDDs in the same pool? Do you have
>>>         read-intensive workload and going to use primary-affinity to
>>>         get all reads from nvme?
>>         Yes, this is pretty much the idea, getting the performance
>>         from NVMe reads, while still maintaining triple redundancy
>>         and a reasonable cost.
>>>         -- 
>>>         Regards,
>>>         Vladimir
>>         _______________________________________________
>>         ceph-users mailing list
>>         ceph-users at lists.ceph.com <mailto:ceph-users at lists.ceph.com>
>>         http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20171007/d1b77433/attachment.html>

More information about the ceph-users mailing list