[ceph-users] read performance, separate client CRUSH maps or limit osd read access from each client

Vlad Kopylov vladkopy at gmail.com
Sun Nov 11 10:47:11 PST 2018


Maybe it is possible if done via gateway-nfs export?
Settings for gateway allow read osd selection?

v

On Sun, Nov 11, 2018 at 1:01 AM Martin Verges <martin.verges at croit.io>
wrote:

> Hello Vlad,
>
> If you want to read from the same data, then it ist not possible (as far I
> know).
>
> --
> Martin Verges
> Managing director
>
> Mobile: +49 174 9335695
> E-Mail: martin.verges at croit.io
> Chat: https://t.me/MartinVerges
>
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263
>
> Web: https://croit.io
> YouTube: https://goo.gl/PGE1Bx
>
> Am Sa., 10. Nov. 2018, 03:47 hat Vlad Kopylov <vladkopy at gmail.com>
> geschrieben:
>
>> Maybe i missed something but FS is explicitly selecting pools to put
>> files and metadata, like I did below.
>> So if I create new pools - data in them will be different. If I apply the
>> rule dc1_primary to cfs_data pool, and client from dc3 connects to fs t01 -
>> it will start using dc1 hosts
>>
>>
>> ceph osd pool create cfs_data 100
>> ceph osd pool create cfs_meta 100
>> ceph fs new t01 cfs_data cfs_meta
>> sudo mount -t ceph ceph1:6789:/ /mnt/t01 -o
>> name=admin,secretfile=/home/mciadmin/admin.secret
>>
>> rule dc1_primary {
>>         id 1
>>         type replicated
>>         min_size 1
>>         max_size 10
>>         step take dc1
>>         step chooseleaf firstn 1 type host
>>         step emit
>>         step take dc2
>>         step chooseleaf firstn -2 type host
>>         step emit
>>         step take dc3
>>         step chooseleaf firstn -2 type host
>>         step emit
>> }
>>
>> On Fri, Nov 9, 2018 at 9:32 PM Vlad Kopylov <vladkopy at gmail.com> wrote:
>>
>>> Just to confirm - it will still populate  3 copies in each datacenter?
>>> Thought this map was to select where to write to, guess it does write
>>> replication on the back end.
>>>
>>> I thought pools are completely separate and clients would not see each
>>> others data?
>>>
>>> Thank you Martin!
>>>
>>>
>>>
>>>
>>> On Fri, Nov 9, 2018 at 2:10 PM Martin Verges <martin.verges at croit.io>
>>> wrote:
>>>
>>>> Hello Vlad,
>>>>
>>>> you can generate something like this:
>>>>
>>>> rule dc1_primary_dc2_secondary {
>>>>         id 1
>>>>         type replicated
>>>>         min_size 1
>>>>         max_size 10
>>>>         step take dc1
>>>>         step chooseleaf firstn 1 type host
>>>>         step emit
>>>>         step take dc2
>>>>         step chooseleaf firstn 1 type host
>>>>         step emit
>>>>         step take dc3
>>>>         step chooseleaf firstn -2 type host
>>>>         step emit
>>>> }
>>>>
>>>> rule dc2_primary_dc1_secondary {
>>>>         id 2
>>>>         type replicated
>>>>         min_size 1
>>>>         max_size 10
>>>>         step take dc1
>>>>         step chooseleaf firstn 1 type host
>>>>         step emit
>>>>         step take dc2
>>>>         step chooseleaf firstn 1 type host
>>>>         step emit
>>>>         step take dc3
>>>>         step chooseleaf firstn -2 type host
>>>>         step emit
>>>> }
>>>>
>>>> After you added such crush rules, you can configure the pools:
>>>>
>>>> ~ $ ceph osd pool set <pool_for_dc1> crush_ruleset 1
>>>> ~ $ ceph osd pool set <pool_for_dc2> crush_ruleset 2
>>>>
>>>> Now you place your workload from dc1 to the dc1 pool, and workload
>>>> from dc2 to the dc2 pool. You could also use HDD with SSD journal (if
>>>> your workload issn't that write intensive) and save some money in dc3
>>>> as your client would always read from a SSD and write to Hybrid.
>>>>
>>>> Btw. all this could be done with a few simple clicks through our web
>>>> frontend. Even if you want to export it via CephFS / NFS / .. it is
>>>> possible to set it on a per folder level. Feel free to take a look at
>>>> https://www.youtube.com/watch?v=V33f7ipw9d4 to see how easy it could
>>>> be.
>>>>
>>>> --
>>>> Martin Verges
>>>> Managing director
>>>>
>>>> Mobile: +49 174 9335695
>>>> E-Mail: martin.verges at croit.io
>>>> Chat: https://t.me/MartinVerges
>>>>
>>>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>>>> CEO: Martin Verges - VAT-ID: DE310638492
>>>> Com. register: Amtsgericht Munich HRB 231263
>>>>
>>>> Web: https://croit.io
>>>> YouTube: https://goo.gl/PGE1Bx
>>>>
>>>>
>>>> 2018-11-09 17:35 GMT+01:00 Vlad Kopylov <vladkopy at gmail.com>:
>>>> > Please disregard pg status, one of test vms was down for some time it
>>>> is
>>>> > healing.
>>>> > Question only how to make it read from proper datacenter
>>>> >
>>>> > If you have an example.
>>>> >
>>>> > Thanks
>>>> >
>>>> >
>>>> > On Fri, Nov 9, 2018 at 11:28 AM Vlad Kopylov <vladkopy at gmail.com>
>>>> wrote:
>>>> >>
>>>> >> Martin, thank you for the tip.
>>>> >> googling ceph crush rule examples doesn't give much on rules, just
>>>> static
>>>> >> placement of buckets.
>>>> >> this all seems to be for placing data, not to giving client in
>>>> specific
>>>> >> datacenter proper read osd
>>>> >>
>>>> >> maybe something wrong with placement groups?
>>>> >>
>>>> >> I added datacenter dc1 dc2 dc3
>>>> >> Current replicated_rule is
>>>> >>
>>>> >> rule replicated_rule {
>>>> >>         id 0
>>>> >> type replicated
>>>> >>         min_size 1
>>>> >>         max_size 10
>>>> >>         step take default
>>>> >>         step chooseleaf firstn 0 type host
>>>> >>         step emit
>>>> >> }
>>>> >>
>>>> >> # buckets
>>>> >> host ceph1 {
>>>> >> id -3 # do not change unnecessarily
>>>> >> id -2 class ssd # do not change unnecessarily
>>>> >> # weight 1.000
>>>> >> alg straw2
>>>> >> hash 0 # rjenkins1
>>>> >> item osd.0 weight 1.000
>>>> >> }
>>>> >> datacenter dc1 {
>>>> >> id -9 # do not change unnecessarily
>>>> >> id -4 class ssd # do not change unnecessarily
>>>> >> # weight 1.000
>>>> >> alg straw2
>>>> >> hash 0 # rjenkins1
>>>> >> item ceph1 weight 1.000
>>>> >> }
>>>> >> host ceph2 {
>>>> >> id -5 # do not change unnecessarily
>>>> >> id -6 class ssd # do not change unnecessarily
>>>> >> # weight 1.000
>>>> >> alg straw2
>>>> >> hash 0 # rjenkins1
>>>> >> item osd.1 weight 1.000
>>>> >> }
>>>> >> datacenter dc2 {
>>>> >> id -10 # do not change unnecessarily
>>>> >> id -8 class ssd # do not change unnecessarily
>>>> >> # weight 1.000
>>>> >> alg straw2
>>>> >> hash 0 # rjenkins1
>>>> >> item ceph2 weight 1.000
>>>> >> }
>>>> >> host ceph3 {
>>>> >> id -7 # do not change unnecessarily
>>>> >> id -12 class ssd # do not change unnecessarily
>>>> >> # weight 1.000
>>>> >> alg straw2
>>>> >> hash 0 # rjenkins1
>>>> >> item osd.2 weight 1.000
>>>> >> }
>>>> >> datacenter dc3 {
>>>> >> id -11 # do not change unnecessarily
>>>> >> id -13 class ssd # do not change unnecessarily
>>>> >> # weight 1.000
>>>> >> alg straw2
>>>> >> hash 0 # rjenkins1
>>>> >> item ceph3 weight 1.000
>>>> >> }
>>>> >> root default {
>>>> >> id -1 # do not change unnecessarily
>>>> >> id -14 class ssd # do not change unnecessarily
>>>> >> # weight 3.000
>>>> >> alg straw2
>>>> >> hash 0 # rjenkins1
>>>> >> item dc1 weight 1.000
>>>> >> item dc2 weight 1.000
>>>> >> item dc3 weight 1.000
>>>> >> }
>>>> >>
>>>> >>
>>>> >> #ceph pg dump
>>>> >> dumped all
>>>> >> version 29433
>>>> >> stamp 2018-11-09 11:23:44.510872
>>>> >> last_osdmap_epoch 0
>>>> >> last_pg_scan 0
>>>> >> PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES
>>>>   LOG
>>>> >> DISK_LOG STATE                      STATE_STAMP
>>>> VERSION
>>>> >> REPORTED UP      UP_PRIMARY ACTING  ACTING_PRIMARY LAST_SCRUB
>>>> SCRUB_STAMP
>>>> >> LAST_DEEP_SCRUB DEEP_SCRUB_STAMP           SNAPTRIMQ_LEN
>>>> >> 1.5f          0                  0        0         0       0
>>>> 0
>>>> >> 0        0               active+clean 2018-11-09 04:35:32.320607
>>>>   0'0
>>>> >> 544:1317 [0,2,1]          0 [0,2,1]              0        0'0
>>>> 2018-11-09
>>>> >> 04:35:32.320561             0'0 2018-11-04 11:55:54.756115
>>>>    0
>>>> >> 2.5c        143                  0      143         0       0
>>>> 19490267
>>>> >> 461      461 active+undersized+degraded 2018-11-08 19:02:03.873218
>>>> 508'461
>>>> >> 544:2100   [2,1]          2   [2,1]              2    290'380
>>>> 2018-11-07
>>>> >> 18:58:43.043719          64'120 2018-11-05 14:21:49.256324
>>>>    0
>>>> >> .....
>>>> >> sum 15239 0 2053 2659 0 2157615019 58286 58286
>>>> >> OSD_STAT USED    AVAIL  TOTAL  HB_PEERS PG_SUM PRIMARY_PG_SUM
>>>> >> 2        3.7 GiB 28 GiB 32 GiB    [0,1]    200             73
>>>> >> 1        3.7 GiB 28 GiB 32 GiB    [0,2]    200             58
>>>> >> 0        3.7 GiB 28 GiB 32 GiB    [1,2]    173             69
>>>> >> sum       11 GiB 85 GiB 96 GiB
>>>> >>
>>>> >> #ceph pg map 2.5c
>>>> >> osdmap e545 pg 2.5c (2.5c) -> up [2,1] acting [2,1]
>>>> >>
>>>> >> #pg map 1.5f
>>>> >> osdmap e547 pg 1.5f (1.5f) -> up [0,2,1] acting [0,2,1]
>>>> >>
>>>> >>
>>>> >> On Fri, Nov 9, 2018 at 2:21 AM Martin Verges <martin.verges at croit.io
>>>> >
>>>> >> wrote:
>>>> >>>
>>>> >>> Hello Vlad,
>>>> >>>
>>>> >>> Ceph clients connect to the primary OSD of each PG. If you create a
>>>> >>> crush rule for building1 and one for building2 that takes a OSD from
>>>> >>> the same building as the first one, your reads to the pool will
>>>> always
>>>> >>> be on the same building (if the cluster is healthy) and only write
>>>> >>> request get replicated to the other building.
>>>> >>>
>>>> >>> --
>>>> >>> Martin Verges
>>>> >>> Managing director
>>>> >>>
>>>> >>> Mobile: +49 174 9335695
>>>> >>> E-Mail: martin.verges at croit.io
>>>> >>> Chat: https://t.me/MartinVerges
>>>> >>>
>>>> >>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>>>> >>> CEO: Martin Verges - VAT-ID: DE310638492
>>>> >>> Com. register: Amtsgericht Munich HRB 231263
>>>> >>>
>>>> >>> Web: https://croit.io
>>>> >>> YouTube: https://goo.gl/PGE1Bx
>>>> >>>
>>>> >>>
>>>> >>> 2018-11-09 4:54 GMT+01:00 Vlad Kopylov <vladkopy at gmail.com>:
>>>> >>> > I am trying to test replicated ceph with servers in different
>>>> >>> > buildings, and
>>>> >>> > I have a read problem.
>>>> >>> > Reads from one building go to osd in another building and vice
>>>> versa,
>>>> >>> > making
>>>> >>> > reads slower then writes! Making read as slow as slowest node.
>>>> >>> >
>>>> >>> > Is there a way to
>>>> >>> > - disable parallel read (so it reads only from the same osd node
>>>> where
>>>> >>> > mon
>>>> >>> > is);
>>>> >>> > - or give each client read restriction per osd?
>>>> >>> > - or maybe strictly specify read osd on mount;
>>>> >>> > - or have node read delay cap (for example if node time out is
>>>> larger
>>>> >>> > then 2
>>>> >>> > ms then do not use such node for read as other replicas are
>>>> available).
>>>> >>> > - or ability to place Clients on the Crush map - so it
>>>> understands that
>>>> >>> > osd
>>>> >>> > in - for example osd in the same data-center as client has
>>>> preference,
>>>> >>> > and
>>>> >>> > pull data from it/them.
>>>> >>> >
>>>> >>> > Mounting with kernel client latest mimic.
>>>> >>> >
>>>> >>> > Thank you!
>>>> >>> >
>>>> >>> > Vlad
>>>> >>> >
>>>> >>> > _______________________________________________
>>>> >>> > ceph-users mailing list
>>>> >>> > ceph-users at lists.ceph.com
>>>> >>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> >>> >
>>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20181111/ccbaba15/attachment.html>


More information about the ceph-users mailing list