[ceph-users] How you handle failing/slow disks?

Alex Litvak alexander.v.litvak at gmail.com
Thu Nov 22 13:13:19 PST 2018


Sorry for hijacking a thread but do you have an idea of what to watch for:

I monitor admin sockets of osds and occasionally I see a burst of both op_w_process_latency and op_w_latency to near 150 - 200 ms on 7200 SAS enterprise drives.
For example load average on the node jumps up with idle 97 % CPU and I see that out of 12 OSDs probably have latency of op_w_latency 170 - 180 ms and 3 more have latency of ~ 120 - 130 ms and the rest 
100 ms or below.  Does it say anything regarding possible drive failure (I am running drives inside of Dell PowerVault MD3400 and the storage unit shows them all green OK)?  Unfortunately, smartmon 
outside of box tells me nothing other then health is OK.

High load usually corresponds with when the op_w_latency affects multiple OSDs (4 or more) at the same time.

On 11/21/2018 10:26 AM, Paul Emmerich wrote:
> Yeah, we also observed problems with HP raid controllers misbehaving
> when a single disk starts to fail. We would never recommend building a
> Ceph cluster on HP raid controllers until they can fix that issue.
> 
> There are several features in Ceph which detect dead disks: there are
> timeouts for OSDs checking each other and there's a timeout for OSDs
> checking in with the mons. But that's usually not enough in this
> scenario. The good news is that recent Ceph versions will show which
> OSDs are implicated in slow requests (check ceph health detail) which
> at least gives you some way to figure out which OSDs are becoming
> slow.
> 
> We have found it to be useful to monitor the op_*_latency values of
> all OSDs (especially subop latencies) from the admin daemon to detect
> such failures earlier.
> 
> 
> Paul
> 




More information about the ceph-users mailing list