[ceph-users] will crush rule be used during object relocation in OSD failure ?

Maged Mokhtar mmokhtar at petasan.org
Sat Nov 24 04:44:58 PST 2018



On 23/11/18 18:00, ST Wong (ITSC) wrote:
>
> Hi all,
>
>
> We've 8 osd hosts, 4 in room 1 and 4 in room2.
>
> A pool with size = 3 using following crush map is created, to cater 
> for room failure.
>
>
>
> rule multiroom {
>         id 0
>         type replicated
>         min_size 2
>         max_size 4
>         step take default
>         step choose firstn 2 type room
>         step chooseleaf firstn 2 type host
>         step emit
> }
>
>
>
> We're expecting:
>
> 1.for each object, there are always 2 replicas in one room and 1 
> replica in other room making size=3.  But we can't control which room 
> has 1 or 2 replicas.
>
> 2.in case an osd host fails, ceph will assign remaining osds to the 
> same PG to hold replicas on the failed osd host. Selection is based on 
> crush rule of the pool, thus maintaining the same failure domain - 
> won't make all replicas in the same room.
>
> 3.in case of entire room with 1 replica fails, the pool will remain 
> degraded but won't do any replica relocation.
>
> 4. in case of entire room with 2 replicas fails, ceph will make use of 
> osds in the surviving room and making 2 replicas. Pool will not be 
> writeable before all objects are made 2 copies (unless we make pool 
> size=4?).  Then when recovery is complete, pool will remain in 
> degraded state until the failed room recover.
>
>
> Is our understanding correct?  Thanks a lot.
> Will do some simulation later to verify.
>
> Regards,
> /stwong
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

I think this is correct. To re-phrase 2) : all PGs on the failed node 
will be re-distributed on  several other hosts within the same room.

Since some PGs will have 2 replicas in 1 room whereas some other PGs 
will have 2 replicas in the other room, i tend to dis-like such setup as 
it is not symmetric,some PGs will suffer more than others in case on 
room failure, you failure domain is not symmetric. Besides more 
importantly, as you stated in 4, you cluster will be down while these 
unfortunate PGs recover ( statistically that is half your data ). I 
would prefer in such case you would use a size=4 min_size=2 setup.

/Maged
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20181124/77d67683/attachment.html>


More information about the ceph-users mailing list