[ceph-users] Disk Down Emergency

Georgios Dimitrakakis giorgis at acmac.uoc.gr
Thu Nov 16 05:05:18 PST 2017


 Dear cephers,

 I have an emergency on a rather small ceph cluster.

 My cluster consists of 2 OSD nodes with 10 disks x4TB each and 3 
 monitor nodes.

 The version of ceph running is Firefly v.0.80.9 
 (b5a67f0e1d15385bc0d60a6da6e7fc810bde6047)

 The cluster originally was build with "Replicated size=2" and "Min 
 size=1" with the attached crush map,
 which in my understanding this replicates data across hosts.

 The emergency comes from the violation of the golden rule: "Never use 2 
 replicas on a production cluster"

 Unfortunately the customers never really understood well the risk and 
 now that one disk is down I am in the middle and I must do everything in 
 my power not to loose any data, thus I am requesting your assistance.

 Here is the output of

 $ ceph osd tree
 # id	weight	type name	up/down	reweight
 -1	72.6	root default
 -2	36.3		host store1
 0	3.63			osd.0	down	0	---> DISK DOWN
 1	3.63			osd.1	up	1
 2	3.63			osd.2	up	1
 3	3.63			osd.3	up	1
 4	3.63			osd.4	up	1
 5	3.63			osd.5	up	1
 6	3.63			osd.6	up	1
 7	3.63			osd.7	up	1
 8	3.63			osd.8	up	1
 9	3.63			osd.9	up	1
 -3	36.3		host store2
 10	3.63			osd.10	up	1
 11	3.63			osd.11	up	1
 12	3.63			osd.12	up	1
 13	3.63			osd.13	up	1
 14	3.63			osd.14	up	1
 15	3.63			osd.15	up	1
 16	3.63			osd.16	up	1
 17	3.63			osd.17	up	1
 18	3.63			osd.18	up	1
 19	3.63			osd.19	up	1

 and here is the status of the cluster


 # ceph health
 HEALTH_WARN 497 pgs degraded; 549 pgs stuck unclean; recovery 
 51916/2552684 objects degraded (2.034%)


 Althoug OSD.0 is shown as mounted it cannot be started (probably failed 
 disk controller problem)

 # df -h
 Filesystem      Size  Used Avail Use% Mounted on
 /dev/sda3       251G  4.1G  235G   2% /
 tmpfs            24G     0   24G   0% /dev/shm
 /dev/sda1       239M  100M  127M  44% /boot
 /dev/sdj1       3.7T  223G  3.5T   6% /var/lib/ceph/osd/ceph-8
 /dev/sdh1       3.7T  205G  3.5T   6% /var/lib/ceph/osd/ceph-6
 /dev/sdg1       3.7T  199G  3.5T   6% /var/lib/ceph/osd/ceph-5
 /dev/sde1       3.7T  180G  3.5T   5% /var/lib/ceph/osd/ceph-3
 /dev/sdi1       3.7T  187G  3.5T   6% /var/lib/ceph/osd/ceph-7
 /dev/sdf1       3.7T  193G  3.5T   6% /var/lib/ceph/osd/ceph-4
 /dev/sdd1       3.7T  212G  3.5T   6% /var/lib/ceph/osd/ceph-2
 /dev/sdk1       3.7T  210G  3.5T   6% /var/lib/ceph/osd/ceph-9
 /dev/sdb1       3.7T  164G  3.5T   5% /var/lib/ceph/osd/ceph-0    ---> 
 This is the problematic OSD
 /dev/sdc1       3.7T  183G  3.5T   5% /var/lib/ceph/osd/ceph-1



 # service ceph start osd.0
 find: `/var/lib/ceph/osd/ceph-0': Input/output error
 /etc/init.d/ceph: osd.0 not found (/etc/ceph/ceph.conf defines 
 mon.store1 osd.6 osd.9 osd.1 osd.4 osd.3 osd.2 osd.8 osd.5 osd.7 
 mds.store1 mon.store3, /var/lib/ceph defines mon.store1 osd.6 osd.9 
 osd.1 osd.4 osd.3 osd.2 osd.8 osd.5 osd.7 mds.store1)


 I have found this: 
 http://ceph.com/geen-categorie/admin-guide-replacing-a-failed-disk-in-a-ceph-cluster/

 and I am looking for your guidance in order to properly perform all 
 actions in order not to loose any data and keep the ones of the second 
 copy.


 Best regards,

 G.
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: crush_map.txt
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20171116/7ab0c8dd/attachment.txt>


More information about the ceph-users mailing list