[ceph-users] One OSD misbehaving (spinning 100% CPU, delayed ops)
jelopez at redhat.com
Wed Nov 29 10:29:22 PST 2017
anything special happening on the NIC side that could cause a problem? Packet drops? Incorrect jumbo frame settings causing fragmentation?
Have you checked the cstate settings on the box?
Have you disabled energy saving settings differently from the other boxes?
Any unexpected wait time on some devices on the box?
Have you compared your kernel parameters on this box compared to the other boxes?
Just in case
> On Nov 29, 2017, at 09:24, Matthew Vernon <mv3 at sanger.ac.uk> wrote:
> We have a 3,060 OSD ceph cluster (running Jewel
> 10.2.7-0ubuntu0.16.04.1), and one OSD on one host keeps misbehaving - by
> which I mean it keeps spinning ~100% CPU (cf ~5% for other OSDs on that
> host), and having ops blocking on it for some time. It will then behave
> for a bit, and then go back to doing this.
> It's always the same OSD, and we've tried replacing the underlying disk.
> The logs have lots of entries of the form
> 2017-11-29 17:18:51.097230 7fcc06919700 1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7fcc29fec700' had timed out after 15
> I've had a brief poke through the collectd metrics for this osd (and
> comparing them with other OSDs on the same host) but other than showing
> spikes in latency for that OSD (iostat et al show no issues with the
> underlying disk) there's nothing obviously explanatory.
> I tried ceph tell osd.2054 injectargs --osd-op-thread-timeout 90 (which
> is what googling for the above message suggests), but that just said
> "unchangeable", and didn't seem to make any difference.
> Any ideas? Other metrics to consider? ...
> The Wellcome Trust Sanger Institute is operated by Genome Research
> Limited, a charity registered in England with number 1021457 and a
> company registered in England with number 2742969, whose registered
> office is 215 Euston Road, London, NW1 2BE.
> ceph-users mailing list
> ceph-users at lists.ceph.com
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the ceph-users