[ceph-users] One OSD misbehaving (spinning 100% CPU, delayed ops)

Matthew Vernon mv3 at sanger.ac.uk
Wed Nov 29 09:24:02 PST 2017


We have a 3,060 OSD ceph cluster (running Jewel
10.2.7-0ubuntu0.16.04.1), and one OSD on one host keeps misbehaving - by
which I mean it keeps spinning ~100% CPU (cf ~5% for other OSDs on that
host), and having ops blocking on it for some time. It will then behave
for a bit, and then go back to doing this.

It's always the same OSD, and we've tried replacing the underlying disk.

The logs have lots of entries of the form

2017-11-29 17:18:51.097230 7fcc06919700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7fcc29fec700' had timed out after 15

I've had a brief poke through the collectd metrics for this osd (and
comparing them with other OSDs on the same host) but other than showing
spikes in latency for that OSD (iostat et al show no issues with the
underlying disk) there's nothing obviously explanatory.

I tried ceph tell osd.2054 injectargs --osd-op-thread-timeout 90 (which
is what googling for the above message suggests), but that just said
"unchangeable", and didn't seem to make any difference.

Any ideas? Other metrics to consider? ...



 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 

More information about the ceph-users mailing list