[ceph-users] One OSD misbehaving (spinning 100% CPU, delayed ops)

Brad Hubbard bhubbard at redhat.com
Wed Nov 29 21:51:47 PST 2017

# ps axHo %cpu,stat,pid,tid,pgid,ppid,comm,wchan | grep ceph-osd

To find the actual thread that is using 100% CPU.

# for x in `seq 1 5`; do gdb -batch -p [PID] -ex "thr appl all bt";
echo; done > /tmp/osd.stack.dump

Then look at the stacks for the thread that was using all the CPU and
see what it was doing at the time.

Note that you may need to install debuginfo for ceph to see meaningful
stack traces. How you go about this is dependant on the distro you are

On Thu, Nov 30, 2017 at 8:48 AM, Denes Dolhay wrote:
> Hello,
> You might consider checking the iowait (during the problem), and the dmesg
> (after it recovered). Maybe an issue with the given sata/sas/nvme port?
> Regards,
> Denes
On 11/29/2017 06:24 PM, Matthew Vernon wrote:
>> Hi,
>> We have a 3,060 OSD ceph cluster (running Jewel
>> 10.2.7-0ubuntu0.16.04.1), and one OSD on one host keeps misbehaving - by
>> which I mean it keeps spinning ~100% CPU (cf ~5% for other OSDs on that
>> host), and having ops blocking on it for some time. It will then behave
>> for a bit, and then go back to doing this.
>> It's always the same OSD, and we've tried replacing the underlying disk.
>> The logs have lots of entries of the form
>> 2017-11-29 17:18:51.097230 7fcc06919700  1 heartbeat_map is_healthy
>> 'OSD::osd_op_tp thread 0x7fcc29fec700' had timed out after 15
>> I've had a brief poke through the collectd metrics for this osd (and
>> comparing them with other OSDs on the same host) but other than showing
>> spikes in latency for that OSD (iostat et al show no issues with the
>> underlying disk) there's nothing obviously explanatory.
>> I tried ceph tell osd.2054 injectargs --osd-op-thread-timeout 90 (which
>> is what googling for the above message suggests), but that just said
>> "unchangeable", and didn't seem to make any difference.
>> Any ideas? Other metrics to consider? ...
>> Thanks,
>> Matthew
