[ceph-users] Another OSD broken today. How can I recover it?

Ronny Aasen ronny+ceph-users at aasen.cx
Mon Dec 4 03:21:47 PST 2017


On 04. des. 2017 10:22, Gonzalo Aguilar Delgado wrote:
> Hello,
> 
> Things are going worse every day.
> 
> 
> ceph -w
>      cluster 9028f4da-0d77-462b-be9b-dbdf7fa57771
>       health HEALTH_ERR
>              1 pgs are stuck inactive for more than 300 seconds
>              8 pgs inconsistent
>              1 pgs repair
>              1 pgs stale
>              1 pgs stuck stale
>              recovery 20266198323167232/288980 objects degraded 
> (7013010700798.405%)
>              37154696925806624 scrub errors
>              no legacy OSD present but 'sortbitwise' flag is not set
> 
> 
> But I'm finally finding time to recover. The disk seems to be correct, 
> no smart errors and everything looks fine just ceph not starting. Today 
> I started to look for the ceph-objectstore-tool. That I don't really 
> know much.
> 
> It just works nice. No crash as expected like on the OSD.
> 
> So I'm lost. Since both OSD and ceph objectstore tool use same backend 
> how is this posible?
> 
> Can someone help me on fixing this, please?
> 
> 
> 
this line seems quite insane:
recovery 20266198323167232/288980 objects degraded (7013010700798.405%)

there is obviously something wrong in your cluster. once the defect osd 
id down/out does the cluster eventually heal to HEALTH_OK ?

you should start by reading and understanding this page.
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/

also in order to get assistance you need to provide a lot more detail.
how many nodes, how many osd's per node. what kinf of nodes cpu/ram. 
what kind of networking setup.

show the output from
ceph -s
ceph osd tree
ceph osd pool ls detail
ceph health detail




since you are systematically loosing osd's i would start by checking the 
timestamp in the defect osd for when it died.
doublecheck your clock sync settingts that all servers are time 
syncronized and then check all logs for the time in question.

especialy dmesg, did OOM killer do something ? was networking flaky ?
mon logs ?  did they complain about the osd in some fashion ?


also since you fail to start the osd again there is probably some 
corruption going on. bump the log for that osd in the nodes ceph.conf, 
something like

[osd.XX]
debug osd = 20

rename the log for the osd so you have a fresh file. and try to start 
the osd once. put the log on some pastebin and send the url.
read 
http://ceph.com/planet/how-to-increase-debug-levels-and-harvest-a-detailed-osd-log/ 
for details.



generally: try to make it easy for people to help you without having to 
drag details out of you. If you can collect all of the above on a 
pastebin like http://paste.debian.net/ instead of piecing it together 
from 3-4 different email threads, you will find a lot more eyeballs 
willing to give it a look.



good luck and kind regards
Ronny Aasen





More information about the ceph-users mailing list