[ceph-users] One OSD misbehaving (spinning 100% CPU, delayed ops)

Denes Dolhay denke at denkesys.com
Wed Nov 29 14:48:28 PST 2017


You might consider checking the iowait (during the problem), and the 
dmesg (after it recovered). Maybe an issue with the given sata/sas/nvme 



On 11/29/2017 06:24 PM, Matthew Vernon wrote:
> Hi,
> We have a 3,060 OSD ceph cluster (running Jewel
> 10.2.7-0ubuntu0.16.04.1), and one OSD on one host keeps misbehaving - by
> which I mean it keeps spinning ~100% CPU (cf ~5% for other OSDs on that
> host), and having ops blocking on it for some time. It will then behave
> for a bit, and then go back to doing this.
> It's always the same OSD, and we've tried replacing the underlying disk.
> The logs have lots of entries of the form
> 2017-11-29 17:18:51.097230 7fcc06919700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7fcc29fec700' had timed out after 15
> I've had a brief poke through the collectd metrics for this osd (and
> comparing them with other OSDs on the same host) but other than showing
> spikes in latency for that OSD (iostat et al show no issues with the
> underlying disk) there's nothing obviously explanatory.
> I tried ceph tell osd.2054 injectargs --osd-op-thread-timeout 90 (which
> is what googling for the above message suggests), but that just said
> "unchangeable", and didn't seem to make any difference.
> Any ideas? Other metrics to consider? ...
> Thanks,
> Matthew

