[ceph-users] Disk Down Emergency

Georgios Dimitrakakis giorgis at acmac.uoc.gr
Thu Nov 16 08:16:13 PST 2017


 I would like to thank all of you very much for your assistance, help, 
 support and time.

 I have to say that I totally agree with you regarding the number of 
 replicas and probably this is the best time to switch to 3 replicas 
 since all services have been stopped due to this emergency.

 After I 've removed the OSD from CRUSH the cluster started backfilling 
 which finished successfully.
 I have to say here that before removing the OSD from CRUSH I stopped 
 scrubbing which I re-enabled it as soon as backfilling was finished 
 successfully.

 So the cluster now is scrubbing again and I am wondering if it is 
 necessary to let it finish scrubbing and/or even issue a deep-scrubbing 
 before changing the physical disk or should I proceed to changing the 
 disk as soon as possible before any further actions?
 What would you do?

 Best,

 G.


> The first step is to make sure that it is out of the cluster.  Does
> `ceph osd stat` show the same number of OSDs as in (its the same as a
> line from `ceph status`)?  It should show 1 less for up, but if its
> still registering the OSD as in then the backfilling wont start. 
> `ceph osd out 0` should mark it out and let the backfilling start.
>
> If its already out, then `ceph osd crush remove osd.0; ceph auth del
> osd.0; ceph osd rm 0` will finish removing it from the cluster and 
> let
> you move forward.  Its a good idea to wait to run these commands once
> you have a full copy of your data again, so really try to let the
> cluster do its thing if you can.  What generally happens and what
> everyone (including myself) is recommending for you, is to let the 
> OSD
> get marked down (which it has done), and then marked out.  Once it is
> marked out, the cluster will rebalance and make sure that all of the
> data that is on the out OSD is replicated to have the full number of
> copies again.  My guess is that you just have a setting somewhere
> preventing the OSD from being marked out.
>
> As far as your customer not understanding that 2 replicas is bad for
> production data, write up a contract that they have to sign
> indemnifying you of any responsibility if they lose data because you
> have warned them to have 3 replicas.  If they dont sign it, then tell
> them you will no longer manage Ceph for them.  Hopefully they wake up
> and make everyones job easier by purchasing a third server.
>
> On Thu, Nov 16, 2017 at 9:26 AM Georgios Dimitrakakis  wrote:
>
>>  Thank you all for your time and support.
>>
>>  I dont see any backfilling in the logs and the number of
>>  "active+degraded" as well as "active+remapped" and "active+clean"
>>  objects is the same for some time now. The only thing I see is
>>  "scrubbing".
>>
>>  Wido, I cannot do anything with the data in osd.0 since although
>> the
>>  failed disk seems mounted I cannot see anything and I am getting
>> an
>>  "Input/output" error.
>>
>>  So I guess the right action for now is to remove the OSD by
>> issuing
>>  "ceph osd crush remove  osd.0" as Sean suggested, correct?
>>
>>  G.
>>
>> >> Op 16 november 2017 om 14:46 schreef Caspar Smit
>> >> :
>> >>
>> >>
>> >> 2017-11-16 14:43 GMT+01:00 Wido den Hollander :
>> >>
>> >> >
>> >> > > Op 16 november 2017 om 14:40 schreef Georgios Dimitrakakis <
>> >> > giorgis at acmac.uoc.gr [3]>:
>> >> > >
>> >> > >
>> >> > >  @Sean Redmond: No I dont have any unfound objects. I only
>> have
>> >> "stuck
>> >> > >  unclean" with "active+degraded" status
>> >> > >  @Caspar Smit: The cluster is scrubbing ...
>> >> > >
>> >> > >  @All: My concern is because of one copy left for the data
>> on
>> >> the failed
>> >> > >  disk.
>> >> > >
>> >> >
>> >> > Let the Ceph recovery do its work. Dont do anything manually
>> >> now.
>> >> >
>> >> >
>> >> @Wido, i think his cluster might have stopped recovering because
>> of
>> >> non-optimal tunables in firefly.
>> >>
>> >
>> > Ah, darn. Yes, thats been a long time ago. Could very well be the
>> > case.
>> >
>> > He could try to remove osd.0 from the CRUSHMap and let recovery
>> > progress.
>> >
>> > I would however advise him not to fiddle with the data on osd.0.
>> Do
>> > not try to copy the data somewhere else and try to fix the OSD.
>> >
>> > Wido
>> >
>> >>
>> >> > >  If I just remove the OSD.0 from crush map does that copy
>> all
>> >> its data
>> >> > >  from the only one available copy to the rest unaffected
>> disks
>> >> which will
>> >> > >  consequently end in having again two copies on two
>> different
>> >> hosts?
>> >> > >
>> >> >
>> >> > Do NOT copy the data from osd.0 to another OSD. Let the Ceph
>> >> recovery
>> >> > handle this.
>> >> >
>> >> > It is already marked as out and within 24 hours or so recovery
>> >> will have
>> >> > finished.
>> >> >
>> >> > But a few things:
>> >> >
>> >> > - Firefly 0.80.9 is old
>> >> > - Never, never, never run with size=2
>> >> >
>> >> > Not trying to scare you, but its a reality.
>> >> >
>> >> > Now let Ceph handle the rebalance and wait.
>> >> >
>> >> > Wido
>> >> >
>> >> > >  Best,
>> >> > >
>> >> > >  G.
>> >> > >
>> >> > >
>> >> > > > 2017-11-16 14:05 GMT+01:00 Georgios Dimitrakakis :
>> >> > > >
>> >> > > >> Dear cephers,
>> >> > > >>
>> >> > > >> I have an emergency on a rather small ceph cluster.
>> >> > > >>
>> >> > > >> My cluster consists of 2 OSD nodes with 10 disks x4TB
>> each
>> >> and 3
>> >> > > >> monitor nodes.
>> >> > > >>
>> >> > > >> The version of ceph running is Firefly v.0.80.9
>> >> > > >> (b5a67f0e1d15385bc0d60a6da6e7fc810bde6047)
>> >> > > >>
>> >> > > >> The cluster originally was build with "Replicated size=2"
>> and
>> >> "Min
>> >> > > >> size=1" with the attached crush map,
>> >> > > >> which in my understanding this replicates data across
>> hosts.
>> >> > > >>
>> >> > > >> The emergency comes from the violation of the golden
>> rule:
>> >> "Never
>> >> > > >> use 2 replicas on a production cluster"
>> >> > > >>
>> >> > > >> Unfortunately the customers never really understood well
>> the
>> >> risk
>> >> > > >> and now that one disk is down I am in the middle and I
>> must
>> >> do
>> >> > > >> everything in my power not to loose any data, thus I am
>> >> requesting
>> >> > > >> your assistance.
>> >> > > >>
>> >> > > >> Here is the output of
>> >> > > >>
>> >> > > >> $ ceph osd tree
>> >> > > >> # id    weight  type name       up/down reweight
>> >> > > >> -1      72.6    root default
>> >> > > >> -2      36.3            host store1
>> >> > > >> 0       3.63                    osd.0 
>>  down
>> >> > > >> 0       ---> DISK DOWN
>> >> > > >> 1       3.63                    osd.1 
>>  up
>> >> > > >> 1
>> >> > > >> 2       3.63                    osd.2 
>>  up
>> >> > > >> 1
>> >> > > >> 3       3.63                    osd.3 
>>  up
>> >> > > >> 1
>> >> > > >> 4       3.63                    osd.4 
>>  up
>> >> > > >> 1
>> >> > > >> 5       3.63                    osd.5 
>>  up
>> >> > > >> 1
>> >> > > >> 6       3.63                    osd.6 
>>  up
>> >> > > >> 1
>> >> > > >> 7       3.63                    osd.7 
>>  up
>> >> > > >> 1
>> >> > > >> 8       3.63                    osd.8 
>>  up
>> >> > > >> 1
>> >> > > >> 9       3.63                    osd.9 
>>  up
>> >> > > >> 1
>> >> > > >> -3      36.3            host store2
>> >> > > >> 10      3.63                    osd.10 
>> up      1
>> >> > > >> 11      3.63                    osd.11 
>> up      1
>> >> > > >> 12      3.63                    osd.12 
>> up      1
>> >> > > >> 13      3.63                    osd.13 
>> up      1
>> >> > > >> 14      3.63                    osd.14 
>> up      1
>> >> > > >> 15      3.63                    osd.15 
>> up      1
>> >> > > >> 16      3.63                    osd.16 
>> up      1
>> >> > > >> 17      3.63                    osd.17 
>> up      1
>> >> > > >> 18      3.63                    osd.18 
>> up      1
>> >> > > >> 19      3.63                    osd.19 
>> up      1
>> >> > > >>
>> >> > > >> and here is the status of the cluster
>> >> > > >>
>> >> > > >> # ceph health
>> >> > > >> HEALTH_WARN 497 pgs degraded; 549 pgs stuck unclean;
>> recovery
>> >> > > >> 51916/2552684 objects degraded (2.034%)
>> >> > > >>
>> >> > > >> Althoug OSD.0 is shown as mounted it cannot be started
>> >> (probably
>> >> > > >> failed disk controller problem)
>> >> > > >>
>> >> > > >> # df -h
>> >> > > >> Filesystem      Size  Used Avail Use% Mounted on
>> >> > > >> /dev/sda3       251G  4.1G  235G   2% /
>> >> > > >> tmpfs            24G     0   24G   0%
>> /dev/shm
>> >> > > >> /dev/sda1       239M  100M  127M  44% /boot
>> >> > > >> /dev/sdj1       3.7T  223G  3.5T   6%
>> >> > > >> /var/lib/ceph/osd/ceph-8
>> >> > > >> /dev/sdh1       3.7T  205G  3.5T   6%
>> >> > > >> /var/lib/ceph/osd/ceph-6
>> >> > > >> /dev/sdg1       3.7T  199G  3.5T   6%
>> >> > > >> /var/lib/ceph/osd/ceph-5
>> >> > > >> /dev/sde1       3.7T  180G  3.5T   5%
>> >> > > >> /var/lib/ceph/osd/ceph-3
>> >> > > >> /dev/sdi1       3.7T  187G  3.5T   6%
>> >> > > >> /var/lib/ceph/osd/ceph-7
>> >> > > >> /dev/sdf1       3.7T  193G  3.5T   6%
>> >> > > >> /var/lib/ceph/osd/ceph-4
>> >> > > >> /dev/sdd1       3.7T  212G  3.5T   6%
>> >> > > >> /var/lib/ceph/osd/ceph-2
>> >> > > >> /dev/sdk1       3.7T  210G  3.5T   6%
>> >> > > >> /var/lib/ceph/osd/ceph-9
>> >> > > >> /dev/sdb1       3.7T  164G  3.5T   5%
>> >> > > >> /var/lib/ceph/osd/ceph-0    ---> This is the
>> problematic OSD
>> >> > > >> /dev/sdc1       3.7T  183G  3.5T   5%
>> >> > > >> /var/lib/ceph/osd/ceph-1
>> >> > > >>
>> >> > > >> # service ceph start osd.0
>> >> > > >> find: `/var/lib/ceph/osd/ceph-0: Input/output error
>> >> > > >> /etc/init.d/ceph: osd.0 not found (/etc/ceph/ceph.conf
>> >> defines
>> >> > > >> mon.store1 osd.6 osd.9 osd.1 osd.4 osd.3 osd.2 osd.8
>> osd.5
>> >> osd.7
>> >> > > >> mds.store1 mon.store3, /var/lib/ceph defines mon.store1
>> osd.6
>> >> osd.9
>> >> > > >> osd.1 osd.4 osd.3 osd.2 osd.8 osd.5 osd.7 mds.store1)
>> >> > > >>
>> >> > > >> I have found this:
>> >> > > >>
>> >> > > >
>> >> > > > http://ceph.com/geen-categorie/admin-guide- [4]
>> >> > replacing-a-failed-disk-in-a-ceph-cluster/
>> >> > > >> [1]
>> >> > > >>
>> >> > > >> and I am looking for your guidance in order to properly
>> >> perform all
>> >> > > >> actions in order not to loose any data and keep the ones
>> of
>> >> the
>> >> > > >> second copy.
>> >> > > >
>> >> > > > What guidance are you looking for besides the steps to
>> replace
>> >> a
>> >> > > > failed disk (which you already found) ?
>> >> > > > If i look at your situation, there is nothing down in
>> terms of
>> >> > > > availability of pgs, just a failed drive which needs to be
>> >> replaced.
>> >> > > >
>> >> > > > Is the cluster still recovering? It should reach HEALTH_OK
>> >> again
>> >> > > > after
>> >> > > > rebalancing the cluster when an OSD goes down.
>> >> > > >
>> >> > > > If it stopped recovering it might have to do with the ceph
>> >> tunables
>> >> > > > which are not set to optimal by default on firefly and
>> that
>> >> prevents
>> >> > > > further rebalancing.
>> >> > > > WARNING: Dont just set tunables to optimal because it will
>> >> trigger a
>> >> > > > massive rebalance!
>> >> > > >
>> >> > > > Perhaps the second golden rule is to never run a CEPH
>> >> production
>> >> > > > cluster without knowing (and testing) how to replace a
>> failed
>> >> drive.
>> >> > > > (Im not trying to be harsh here).
>> >> > > >
>> >> > > > Kind regards,
>> >> > > > Caspar
>> >> > > >
>> >> > > >
>> >> > > >> Best regards,
>> >> > > >>
>> >> > > >> G.
>> >> > > >> _______________________________________________
>> >> > > >> ceph-users mailing list
>> >> > > >> ceph-users at lists.ceph.com [5] [2]
>> >> > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> [6] [3]
>> >> > > >
>> >> > > >
>> >> > > >
>> >> > > > Links:
>> >> > > > ------
>> >> > > > [1]
>> >> > > >
>> >> > > > http://ceph.com/geen-categorie/admin-guide- [7]
>> >> > replacing-a-failed-disk-in-a-ceph-cluster/
>> >> > > > [2] mailto:ceph-users at lists.ceph.com [8]
>> >> > > > [3] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> [9]
>> >> > > > [4] mailto:giorgis at acmac.uoc.gr [10]
>> >> > > _______________________________________________
>> >> > > ceph-users mailing list
>> >> > > ceph-users at lists.ceph.com [11]
>> >> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [12]
>> >> > _______________________________________________
>> >> > ceph-users mailing list
>> >> > ceph-users at lists.ceph.com [13]
>> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [14]
>> >> >
>> >> _______________________________________________
>> >> ceph-users mailing list
>> >> ceph-users at lists.ceph.com [15]
>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [16]
>> > _______________________________________________
>> > ceph-users mailing list
>> > ceph-users at lists.ceph.com [17]
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [18]
>>
>> --
>>  Dr. Dimitrakakis Georgios
>>
>>  Networks and Systems Administrator
>>
>>  Archimedes Center for Modeling, Analysis & Computation (ACMAC)
>>  School of Sciences and Engineering
>>  University of Crete
>>  P.O. Box 2208
>>  710 - 03 Heraklion
>>  Crete, Greece
>>
>>  Tel: +30 2810 393717
>>  Fax: +30 2810 393660
>>
>>  E-mail: giorgis at acmac.uoc.gr [19]
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users at lists.ceph.com [20]
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [21]
>
>
> Links:
> ------
> [1] mailto:casparsmit at supernas.eu
> [2] mailto:wido at 42on.com
> [3] mailto:giorgis at acmac.uoc.gr
> [4] http://ceph.com/geen-categorie/admin-guide-
> [5] mailto:ceph-users at lists.ceph.com
> [6] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> [7] http://ceph.com/geen-categorie/admin-guide-
> [8] mailto:ceph-users at lists.ceph.com
> [9] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> [10] mailto:giorgis at acmac.uoc.gr
> [11] mailto:ceph-users at lists.ceph.com
> [12] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> [13] mailto:ceph-users at lists.ceph.com
> [14] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> [15] mailto:ceph-users at lists.ceph.com
> [16] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> [17] mailto:ceph-users at lists.ceph.com
> [18] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> [19] mailto:giorgis at acmac.uoc.gr
> [20] mailto:ceph-users at lists.ceph.com
> [21] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> [22] mailto:giorgis at acmac.uoc.gr


More information about the ceph-users mailing list