[ceph-users] Slow requests in cache tier with rep_size 2

David Turner drakonstein at gmail.com
Wed Nov 1 11:50:04 PDT 2017


Here's some good reading for you.
https://www.spinics.net/lists/ceph-users/msg32895.html

I really like how Wido puts it, "Loosing two disks at the same time is
something which doesn't happen that much, but if it happens you don't want
to modify any data on the only copy which you still have left. Setting
min_size to 1 should be a manual action imho when size = 3 and you loose
two copies. In that case YOU decide at that moment if it is the right
course of action."


On Wed, Nov 1, 2017 at 11:36 AM Mazzystr <mazzystr at gmail.com> wrote:

> I disagree.
>
> We have the following setting...
>   osd pool default size = 3
>   osd pool default min size = 1
>
> There's maths that need to be conducted for 'osd pool default size'.  A
> setting of 3 and 1 allows for 2 disks to fail ... at the same time ...
> without a loss of data.  This is standard storage admin practice.  NetApp
> advertises this as protection from double disk failure.
>
> Higher replication settings need to take account desired usable capacity,
> power, data center space, mean time to replication.  I was at a site that
> had 8000 disks in an hdfs cluster.  We were replacing a dozen disks
> weekly.
>
> On the flip side shutting down client access because of a disk failure in
> the cluster is *unacceptable* to a product
>
> On Wed, Nov 1, 2017 at 10:08 AM, David Turner <drakonstein at gmail.com>
> wrote:
>
>> PPS - or min_size 1 in production
>>
>> On Wed, Nov 1, 2017 at 10:08 AM David Turner <drakonstein at gmail.com>
>> wrote:
>>
>>> What is your min_size in the cache pool?  If your min_size is 2, then
>>> the cluster would block requests to that pool due to it having too few
>>> copies available.
>>>
>>> PS - Please don't consider using rep_size 2 in production.
>>>
>>> On Wed, Nov 1, 2017 at 5:14 AM Eugen Block <eblock at nde.ag> wrote:
>>>
>>>> Hi experts,
>>>>
>>>> we have upgraded our cluster to Luminous successfully, no big issues
>>>> so far. We are also testing cache tier with only 2 SSDs (we know it's
>>>> not recommended), and there's one issue to be resolved:
>>>>
>>>> Evertime we have to restart the cache OSDs we get slow requests with
>>>> impacts on our client VMs. We never restart both SSDs at the same
>>>> time, of course, so we are wondering why the slow requests? What
>>>> exactly is happening with a replication size of 2? We assumed that if
>>>> one SSD is restarted the data is still available on the other disc, we
>>>> didn't expect any problems. But this occurs everytime we have to
>>>> restart them. Can anyone explain what is going on there? We could use
>>>> your expertise on this :-)
>>>>
>>>> Best regards,
>>>> Eugen
>>>>
>>>> --
>>>> Eugen Block                             voice   : +49-40-559 51 75
>>>> <+49%2040%205595175>
>>>> NDE Netzdesign und -entwicklung AG      fax     : +49-40-559 51 77
>>>> <+49%2040%205595177>
>>>> Postfach 61 03 15
>>>> D-22423 Hamburg                         e-mail  : eblock at nde.ag
>>>>
>>>>          Vorsitzende des Aufsichtsrates: Angelika Mozdzen
>>>>            Sitz und Registergericht: Hamburg, HRB 90934
>>>>                    Vorstand: Jens-U. Mozdzen
>>>>                     USt-IdNr. DE 814 013 983
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users at lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users at lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20171101/13562d8f/attachment.html>


More information about the ceph-users mailing list