[ceph-users] REQUEST_SLOW across many OSDs at the same time

mart.v mart.v at seznam.cz
Mon Mar 11 01:13:36 PDT 2019


Well, the drive supports trim:

# hdparm -I /dev/sdd|grep TRIM

           *    Data Set Management TRIM supported (limit 8 blocks)

           *    Deterministic read ZEROs after TRIM




But fstrim or discard is not enabled (I have checked both mount options and 
services/cron). I'm using defaults from Proxmox, OSDs are created like this:
 ceph-volume lvm create --bluestore --data /dev/sdX




Best,
Martin
---------- Původní e-mail ----------
Od: Matthew H <matthew.heler at hotmail.com>
Komu: Paul Emmerich <paul.emmerich at croit.io>, Massimo Sgaravatto <massimo.
sgaravatto at gmail.com>
Datum: 28. 2. 2019 10:51:36
Předmět: Re: [ceph-users] REQUEST_SLOW across many OSDs at the same time 
" 

Is fstrim or discard enabled for these SSD's? If so, how did you enable it?







I've seen similiar issues with poor controllers on SSDs. They tend to block 
I/O when trim kicks off.




Thanks,



----------------------------------------------------------------------------

From: ceph-users <ceph-users-bounces at lists.ceph.com> on behalf of Paul 
Emmerich <paul.emmerich at croit.io>
Sent: Friday, February 22, 2019 9:04 AM
To: Massimo Sgaravatto
Cc: Ceph Users
Subject: Re: [ceph-users] REQUEST_SLOW across many OSDs at the same time 
 
 


Bad SSDs can also cause this. Which SSD are you using?

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io
(https://croit.io)

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io(http://www.croit.io)
Tel: +49 89 1896585 90

On Fri, Feb 22, 2019 at 2:53 PM Massimo Sgaravatto
<massimo.sgaravatto at gmail.com> wrote:
>
> A couple of hints to debug the issue (since I had to recently debug a 
problem with the same symptoms):
>
> - As far as I understand the reported 'implicated osds' are only the 
primary ones. In the log of the osds you should find also the relevant pg 
number, and with this information you can get all the involved OSDs. This 
might be useful e.g. to see if a specific OSD node is always involved. This 
was my case (a the problem was with the patch cable connecting the node)
>
> - You can use the "ceph daemon osd.x dump_historic_ops" command to debug 
some of these slow requests (to see which events take much time)
>
> Cheers, Massimo
>
> On Fri, Feb 22, 2019 at 10:28 AM mart.v <mart.v at seznam.cz> wrote:
>>
>> Hello everyone,
>>
>> I'm experiencing a strange behaviour. My cluster is relatively small (43 
OSDs, 11 nodes), running Ceph 12.2.10 (and Proxmox 5). Nodes are connected 
via 10 Gbit network (Nexus 6000). Cluster is mixed (SSD and HDD), but with 
different pools. Descibed error is only on the SSD part of the cluster.
>>
>> I noticed that few times a day the cluster slows down a bit and I have 
discovered this in logs:
>>
>> 2019-02-22 08:21:20.064396 mon.node1 mon.0 172.16.254.101:6789/0 1794159 
: cluster [WRN] Health check failed: 27 slow requests are blocked > 32 sec. 
Implicated osds 10,22,33 (REQUEST_SLOW)
>> 2019-02-22 08:21:26.589202 mon.node1 mon.0 172.16.254.101:6789/0 1794169 
: cluster [WRN] Health check update: 199 slow requests are blocked > 32 sec.
Implicated osds 0,4,5,6,7,8,9,10,12,16,17,19,20,21,22,25,26,33,41 (REQUEST_
SLOW)
>> 2019-02-22 08:21:32.655671 mon.node1 mon.0 172.16.254.101:6789/0 1794183 
: cluster [WRN] Health check update: 448 slow requests are blocked > 32 sec.
Implicated osds 0,3,4,5,6,7,8,9,10,12,15,16,17,19,20,21,22,24,25,26,33,41 
(REQUEST_SLOW)
>> 2019-02-22 08:21:38.744210 mon.node1 mon.0 172.16.254.101:6789/0 1794210 
: cluster [WRN] Health check update: 388 slow requests are blocked > 32 sec.
Implicated osds 4,8,10,16,24,33 (REQUEST_SLOW)
>> 2019-02-22 08:21:42.790346 mon.node1 mon.0 172.16.254.101:6789/0 1794214 
: cluster [INF] Health check cleared: REQUEST_SLOW (was: 18 slow requests 
are blocked > 32 sec. Implicated osds 8,16)
>>
>> "ceph health detail" shows nothing more
>>
>> It is happening through the whole day and the times can't be linked to 
any read or write intensive task (e.g. backup). I also tried to disable 
scrubbing, but it kept on going. These errors were not there since 
beginning, but unfortunately I cannot track the day they started (it is 
beyond my logs).
>>
>> Any ideas?
>>
>> Thank you!
>> Martin
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users at lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
(http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com)
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
(http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com)
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
(http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com)




_______________________________________________ 
ceph-users mailing list 
ceph-users at lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
"
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20190311/4995dc8a/attachment.html>


More information about the ceph-users mailing list