[ceph-users] read performance, separate client CRUSH maps or limit osd read access from each client

Vlad Kopylov vladkopy at gmail.com
Tue Nov 13 16:53:52 PST 2018


Each of 3 clients from different buildings are picking same
primary-affinity, and everything is slow at least on two.
Instead of just read from their local OSD they read mostly from
primary-affinity.

*What I need is something like primary-affinity for each client connection*

ID  CLASS WEIGHT  TYPE NAME     STATUS REWEIGHT PRI-AFF
 -1       0.08189 root default
 -3       0.02730     host vm1
  0   hdd 0.02730         osd.0     up  1.00000 1.00000
-10       0.02730     host vm2
  1   hdd 0.02730         osd.1     up  1.00000 0.50000
 -5       0.02730     host vm3
  2   hdd 0.02730         osd.2     up  1.00000 0.50000

v

On Tue, Nov 13, 2018 at 4:25 PM Jean-Charles Lopez <jelopez at redhat.com>
wrote:

> Hi Vlad,
>
> No need for a specific CRUSH map configuration. I’d suggest you use the
> primary-affinity setting on the OSD so that only the OSDs that are close to
> your read point are are selected as primary.
>
> See https://ceph.com/geen-categorie/ceph-primary-affinity/ for information
>
> Just set the primary affinity of all the OSDs in building 2 to 0.
>
> Only the OSDs in building 1 should then be used as primary OSDs.
>
> BR
> JC
>
> On Nov 13, 2018, at 12:19, Vlad Kopylov <vladkopy at gmail.com> wrote:
>
> Or is it possible to mount one OSD directly for read file access?
>
> v
>
> On Sun, Nov 11, 2018 at 1:47 PM Vlad Kopylov <vladkopy at gmail.com> wrote:
>
>> Maybe it is possible if done via gateway-nfs export?
>> Settings for gateway allow read osd selection?
>>
>> v
>>
>> On Sun, Nov 11, 2018 at 1:01 AM Martin Verges <martin.verges at croit.io>
>> wrote:
>>
>>> Hello Vlad,
>>>
>>> If you want to read from the same data, then it ist not possible (as far
>>> I know).
>>>
>>> --
>>> Martin Verges
>>> Managing director
>>>
>>> Mobile: +49 174 9335695
>>> E-Mail: martin.verges at croit.io
>>> Chat: https://t.me/MartinVerges
>>>
>>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>>> CEO: Martin Verges - VAT-ID: DE310638492
>>> Com. register: Amtsgericht Munich HRB 231263
>>>
>>> Web: https://croit.io
>>> YouTube: https://goo.gl/PGE1Bx
>>>
>>> Am Sa., 10. Nov. 2018, 03:47 hat Vlad Kopylov <vladkopy at gmail.com>
>>> geschrieben:
>>>
>>>> Maybe i missed something but FS is explicitly selecting pools to put
>>>> files and metadata, like I did below.
>>>> So if I create new pools - data in them will be different. If I apply
>>>> the rule dc1_primary to cfs_data pool, and client from dc3 connects to fs
>>>> t01 - it will start using dc1 hosts
>>>>
>>>>
>>>> ceph osd pool create cfs_data 100
>>>> ceph osd pool create cfs_meta 100
>>>> ceph fs new t01 cfs_data cfs_meta
>>>> sudo mount -t ceph ceph1:6789:/ /mnt/t01 -o
>>>> name=admin,secretfile=/home/mciadmin/admin.secret
>>>>
>>>> rule dc1_primary {
>>>>         id 1
>>>>         type replicated
>>>>         min_size 1
>>>>         max_size 10
>>>>         step take dc1
>>>>         step chooseleaf firstn 1 type host
>>>>         step emit
>>>>         step take dc2
>>>>         step chooseleaf firstn -2 type host
>>>>         step emit
>>>>         step take dc3
>>>>         step chooseleaf firstn -2 type host
>>>>         step emit
>>>> }
>>>>
>>>> On Fri, Nov 9, 2018 at 9:32 PM Vlad Kopylov <vladkopy at gmail.com> wrote:
>>>>
>>>>> Just to confirm - it will still populate  3 copies in each datacenter?
>>>>> Thought this map was to select where to write to, guess it does write
>>>>> replication on the back end.
>>>>>
>>>>> I thought pools are completely separate and clients would not see each
>>>>> others data?
>>>>>
>>>>> Thank you Martin!
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Nov 9, 2018 at 2:10 PM Martin Verges <martin.verges at croit.io>
>>>>> wrote:
>>>>>
>>>>>> Hello Vlad,
>>>>>>
>>>>>> you can generate something like this:
>>>>>>
>>>>>> rule dc1_primary_dc2_secondary {
>>>>>>         id 1
>>>>>>         type replicated
>>>>>>         min_size 1
>>>>>>         max_size 10
>>>>>>         step take dc1
>>>>>>         step chooseleaf firstn 1 type host
>>>>>>         step emit
>>>>>>         step take dc2
>>>>>>         step chooseleaf firstn 1 type host
>>>>>>         step emit
>>>>>>         step take dc3
>>>>>>         step chooseleaf firstn -2 type host
>>>>>>         step emit
>>>>>> }
>>>>>>
>>>>>> rule dc2_primary_dc1_secondary {
>>>>>>         id 2
>>>>>>         type replicated
>>>>>>         min_size 1
>>>>>>         max_size 10
>>>>>>         step take dc1
>>>>>>         step chooseleaf firstn 1 type host
>>>>>>         step emit
>>>>>>         step take dc2
>>>>>>         step chooseleaf firstn 1 type host
>>>>>>         step emit
>>>>>>         step take dc3
>>>>>>         step chooseleaf firstn -2 type host
>>>>>>         step emit
>>>>>> }
>>>>>>
>>>>>> After you added such crush rules, you can configure the pools:
>>>>>>
>>>>>> ~ $ ceph osd pool set <pool_for_dc1> crush_ruleset 1
>>>>>> ~ $ ceph osd pool set <pool_for_dc2> crush_ruleset 2
>>>>>>
>>>>>> Now you place your workload from dc1 to the dc1 pool, and workload
>>>>>> from dc2 to the dc2 pool. You could also use HDD with SSD journal (if
>>>>>> your workload issn't that write intensive) and save some money in dc3
>>>>>> as your client would always read from a SSD and write to Hybrid.
>>>>>>
>>>>>> Btw. all this could be done with a few simple clicks through our web
>>>>>> frontend. Even if you want to export it via CephFS / NFS / .. it is
>>>>>> possible to set it on a per folder level. Feel free to take a look at
>>>>>> https://www.youtube.com/watch?v=V33f7ipw9d4 to see how easy it could
>>>>>> be.
>>>>>>
>>>>>> --
>>>>>> Martin Verges
>>>>>> Managing director
>>>>>>
>>>>>> Mobile: +49 174 9335695
>>>>>> E-Mail: martin.verges at croit.io
>>>>>> Chat: https://t.me/MartinVerges
>>>>>>
>>>>>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>>>>>> CEO: Martin Verges - VAT-ID: DE310638492
>>>>>> Com. register: Amtsgericht Munich HRB 231263
>>>>>>
>>>>>> Web: https://croit.io
>>>>>> YouTube: https://goo.gl/PGE1Bx
>>>>>>
>>>>>>
>>>>>> 2018-11-09 17:35 GMT+01:00 Vlad Kopylov <vladkopy at gmail.com>:
>>>>>> > Please disregard pg status, one of test vms was down for some time
>>>>>> it is
>>>>>> > healing.
>>>>>> > Question only how to make it read from proper datacenter
>>>>>> >
>>>>>> > If you have an example.
>>>>>> >
>>>>>> > Thanks
>>>>>> >
>>>>>> >
>>>>>> > On Fri, Nov 9, 2018 at 11:28 AM Vlad Kopylov <vladkopy at gmail.com>
>>>>>> wrote:
>>>>>> >>
>>>>>> >> Martin, thank you for the tip.
>>>>>> >> googling ceph crush rule examples doesn't give much on rules, just
>>>>>> static
>>>>>> >> placement of buckets.
>>>>>> >> this all seems to be for placing data, not to giving client in
>>>>>> specific
>>>>>> >> datacenter proper read osd
>>>>>> >>
>>>>>> >> maybe something wrong with placement groups?
>>>>>> >>
>>>>>> >> I added datacenter dc1 dc2 dc3
>>>>>> >> Current replicated_rule is
>>>>>> >>
>>>>>> >> rule replicated_rule {
>>>>>> >>         id 0
>>>>>> >> type replicated
>>>>>> >>         min_size 1
>>>>>> >>         max_size 10
>>>>>> >>         step take default
>>>>>> >>         step chooseleaf firstn 0 type host
>>>>>> >>         step emit
>>>>>> >> }
>>>>>> >>
>>>>>> >> # buckets
>>>>>> >> host ceph1 {
>>>>>> >> id -3 # do not change unnecessarily
>>>>>> >> id -2 class ssd # do not change unnecessarily
>>>>>> >> # weight 1.000
>>>>>> >> alg straw2
>>>>>> >> hash 0 # rjenkins1
>>>>>> >> item osd.0 weight 1.000
>>>>>> >> }
>>>>>> >> datacenter dc1 {
>>>>>> >> id -9 # do not change unnecessarily
>>>>>> >> id -4 class ssd # do not change unnecessarily
>>>>>> >> # weight 1.000
>>>>>> >> alg straw2
>>>>>> >> hash 0 # rjenkins1
>>>>>> >> item ceph1 weight 1.000
>>>>>> >> }
>>>>>> >> host ceph2 {
>>>>>> >> id -5 # do not change unnecessarily
>>>>>> >> id -6 class ssd # do not change unnecessarily
>>>>>> >> # weight 1.000
>>>>>> >> alg straw2
>>>>>> >> hash 0 # rjenkins1
>>>>>> >> item osd.1 weight 1.000
>>>>>> >> }
>>>>>> >> datacenter dc2 {
>>>>>> >> id -10 # do not change unnecessarily
>>>>>> >> id -8 class ssd # do not change unnecessarily
>>>>>> >> # weight 1.000
>>>>>> >> alg straw2
>>>>>> >> hash 0 # rjenkins1
>>>>>> >> item ceph2 weight 1.000
>>>>>> >> }
>>>>>> >> host ceph3 {
>>>>>> >> id -7 # do not change unnecessarily
>>>>>> >> id -12 class ssd # do not change unnecessarily
>>>>>> >> # weight 1.000
>>>>>> >> alg straw2
>>>>>> >> hash 0 # rjenkins1
>>>>>> >> item osd.2 weight 1.000
>>>>>> >> }
>>>>>> >> datacenter dc3 {
>>>>>> >> id -11 # do not change unnecessarily
>>>>>> >> id -13 class ssd # do not change unnecessarily
>>>>>> >> # weight 1.000
>>>>>> >> alg straw2
>>>>>> >> hash 0 # rjenkins1
>>>>>> >> item ceph3 weight 1.000
>>>>>> >> }
>>>>>> >> root default {
>>>>>> >> id -1 # do not change unnecessarily
>>>>>> >> id -14 class ssd # do not change unnecessarily
>>>>>> >> # weight 3.000
>>>>>> >> alg straw2
>>>>>> >> hash 0 # rjenkins1
>>>>>> >> item dc1 weight 1.000
>>>>>> >> item dc2 weight 1.000
>>>>>> >> item dc3 weight 1.000
>>>>>> >> }
>>>>>> >>
>>>>>> >>
>>>>>> >> #ceph pg dump
>>>>>> >> dumped all
>>>>>> >> version 29433
>>>>>> >> stamp 2018-11-09 11:23:44.510872
>>>>>> >> last_osdmap_epoch 0
>>>>>> >> last_pg_scan 0
>>>>>> >> PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND
>>>>>> BYTES    LOG
>>>>>> >> DISK_LOG STATE                      STATE_STAMP
>>>>>> VERSION
>>>>>> >> REPORTED UP      UP_PRIMARY ACTING  ACTING_PRIMARY LAST_SCRUB
>>>>>> SCRUB_STAMP
>>>>>> >> LAST_DEEP_SCRUB DEEP_SCRUB_STAMP           SNAPTRIMQ_LEN
>>>>>> >> 1.5f          0                  0        0         0       0
>>>>>>   0
>>>>>> >> 0        0               active+clean 2018-11-09 04:35:32.320607
>>>>>>     0'0
>>>>>> >> 544:1317 [0,2,1]          0 [0,2,1]              0        0'0
>>>>>> 2018-11-09
>>>>>> >> 04:35:32.320561             0'0 2018-11-04 11:55:54.756115
>>>>>>      0
>>>>>> >> 2.5c        143                  0      143         0       0
>>>>>> 19490267
>>>>>> >> 461      461 active+undersized+degraded 2018-11-08
>>>>>> 19:02:03.873218  508'461
>>>>>> >> 544:2100   [2,1]          2   [2,1]              2    290'380
>>>>>> 2018-11-07
>>>>>> >> 18:58:43.043719          64'120 2018-11-05 14:21:49.256324
>>>>>>      0
>>>>>> >> .....
>>>>>> >> sum 15239 0 2053 2659 0 2157615019 58286 58286
>>>>>> >> OSD_STAT USED    AVAIL  TOTAL  HB_PEERS PG_SUM PRIMARY_PG_SUM
>>>>>> >> 2        3.7 GiB 28 GiB 32 GiB    [0,1]    200             73
>>>>>> >> 1        3.7 GiB 28 GiB 32 GiB    [0,2]    200             58
>>>>>> >> 0        3.7 GiB 28 GiB 32 GiB    [1,2]    173             69
>>>>>> >> sum       11 GiB 85 GiB 96 GiB
>>>>>> >>
>>>>>> >> #ceph pg map 2.5c
>>>>>> >> osdmap e545 pg 2.5c (2.5c) -> up [2,1] acting [2,1]
>>>>>> >>
>>>>>> >> #pg map 1.5f
>>>>>> >> osdmap e547 pg 1.5f (1.5f) -> up [0,2,1] acting [0,2,1]
>>>>>> >>
>>>>>> >>
>>>>>> >> On Fri, Nov 9, 2018 at 2:21 AM Martin Verges <
>>>>>> martin.verges at croit.io>
>>>>>> >> wrote:
>>>>>> >>>
>>>>>> >>> Hello Vlad,
>>>>>> >>>
>>>>>> >>> Ceph clients connect to the primary OSD of each PG. If you create
>>>>>> a
>>>>>> >>> crush rule for building1 and one for building2 that takes a OSD
>>>>>> from
>>>>>> >>> the same building as the first one, your reads to the pool will
>>>>>> always
>>>>>> >>> be on the same building (if the cluster is healthy) and only write
>>>>>> >>> request get replicated to the other building.
>>>>>> >>>
>>>>>> >>> --
>>>>>> >>> Martin Verges
>>>>>> >>> Managing director
>>>>>> >>>
>>>>>> >>> Mobile: +49 174 9335695
>>>>>> >>> E-Mail: martin.verges at croit.io
>>>>>> >>> Chat: https://t.me/MartinVerges
>>>>>> >>>
>>>>>> >>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>>>>>> >>> CEO: Martin Verges - VAT-ID: DE310638492
>>>>>> >>> Com. register: Amtsgericht Munich HRB 231263
>>>>>> >>>
>>>>>> >>> Web: https://croit.io
>>>>>> >>> YouTube: https://goo.gl/PGE1Bx
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> 2018-11-09 4:54 GMT+01:00 Vlad Kopylov <vladkopy at gmail.com>:
>>>>>> >>> > I am trying to test replicated ceph with servers in different
>>>>>> >>> > buildings, and
>>>>>> >>> > I have a read problem.
>>>>>> >>> > Reads from one building go to osd in another building and vice
>>>>>> versa,
>>>>>> >>> > making
>>>>>> >>> > reads slower then writes! Making read as slow as slowest node.
>>>>>> >>> >
>>>>>> >>> > Is there a way to
>>>>>> >>> > - disable parallel read (so it reads only from the same osd
>>>>>> node where
>>>>>> >>> > mon
>>>>>> >>> > is);
>>>>>> >>> > - or give each client read restriction per osd?
>>>>>> >>> > - or maybe strictly specify read osd on mount;
>>>>>> >>> > - or have node read delay cap (for example if node time out is
>>>>>> larger
>>>>>> >>> > then 2
>>>>>> >>> > ms then do not use such node for read as other replicas are
>>>>>> available).
>>>>>> >>> > - or ability to place Clients on the Crush map - so it
>>>>>> understands that
>>>>>> >>> > osd
>>>>>> >>> > in - for example osd in the same data-center as client has
>>>>>> preference,
>>>>>> >>> > and
>>>>>> >>> > pull data from it/them.
>>>>>> >>> >
>>>>>> >>> > Mounting with kernel client latest mimic.
>>>>>> >>> >
>>>>>> >>> > Thank you!
>>>>>> >>> >
>>>>>> >>> > Vlad
>>>>>> >>> >
>>>>>> >>> > _______________________________________________
>>>>>> >>> > ceph-users mailing list
>>>>>> >>> > ceph-users at lists.ceph.com
>>>>>> >>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>> >>> >
>>>>>>
>>>>> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20181113/5757145f/attachment.html>


More information about the ceph-users mailing list