[ceph-users] will crush rule be used during object relocation in OSD failure ?

Gregory Farnum gfarnum at redhat.com
Mon Nov 26 05:27:17 PST 2018


On Fri, Nov 23, 2018 at 11:01 AM ST Wong (ITSC) <ST at itsc.cuhk.edu.hk> wrote:

> Hi all,
>
>
> We've 8 osd hosts, 4 in room 1 and 4 in room2.
>
> A pool with size = 3 using following crush map is created, to cater for
> room failure.
>
>
> rule multiroom {
>         id 0
>         type replicated
>         min_size 2
>         max_size 4
>         step take default
>         step choose firstn 2 type room
>         step chooseleaf firstn 2 type host
>         step emit
> }
>
>
>
> We're expecting:
>
> 1.for each object, there are always 2 replicas in one room and 1 replica
> in other room making size=3.  But we can't control which room has 1 or 2
> replicas.
>

Right.


>
> 2.in case an osd host fails, ceph will assign remaining osds to the same
> PG to hold replicas on the failed osd host.  Selection is based on crush
> rule of the pool, thus maintaining the same failure domain - won't make all
> replicas in the same room.
>

Yes, if a host fails the copies it held will be replaced by new copies in
the same room.


>
> 3.in case of entire room with 1 replica fails, the pool will remain
> degraded but won't do any replica relocation.
>

Right.


>
> 4. in case of entire room with 2 replicas fails, ceph will make use of
> osds in the surviving room and making 2 replicas.  Pool will not be
> writeable before all objects are made 2 copies (unless we make pool
> size=4?).  Then when recovery is complete, pool will remain in degraded
> state until the failed room recover.
>

Hmm, I'm actually not sure if this will work out — because CRUSH is
hierarchical, it will keep trying to select hosts from the dead room and
will fill out the location vector's first two spots with -1. It could be
that Ceph will skip all those "nonexistent" entries and just pick the two
copies from slots 3 and 4, but it might not. You should test this carefully
and report back!
-Greg

>
> Is our understanding correct?  Thanks a lot.
> Will do some simulation later to verify.
>
> Regards,
> /stwong
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20181126/eabafad4/attachment.html>


More information about the ceph-users mailing list