[ceph-users] HELP with some basics please

Denes Dolhay denke at denkesys.com
Mon Dec 4 09:37:54 PST 2017


Hi,

I would not rip out the discs, but I would reweight the osd to 0, wait 
for the cluster to reconfigure, and when it is done, you can remove the 
disc / raid pair without ever going down to 1 copy only.

The jornals can only be moved back by a complete rebiuld of that osd as 
to my knowledge.

In size=2 losing any 2 discs on different hosts would probably cause 
data to be unavailable / lost, as the pg copys are randomly distribbuted 
across the osds. Chances are, that you can find a pg which's acting 
group is the two failed osd (you lost all your replicas)

Ceph by default will not put both acting group members of a pg onto the 
same host. This is determined by the failure domain setting.

Denes.


On 12/04/2017 06:27 PM, tim taler wrote:
> thnx a lot again,
> makes sense to me.
>
> We have all journals of the HDD-OSDs on partitions on an extra
> SSD-raid1 (each OSD got it's own journal partition on that raid1)
> but as I understand they could be moved back to the OSD, at least for
> the time of the restructuring.
>
> What makes my tommy turn though, is the thought of ripping out a raid0
> pair and plug it into another machine, (it's hwraid not zfs!)
> in the hope of keeping the data on it, even if I can get the same sort
> of controller (which might be possible, although the machines are a
> couple of years old and
> Machine C is not the same as A and B) .
>
> And I'm still puzzled about the implication of the cluster size on the
> amount of OSD failures.
> With size=2 min_size=1 one host could die and (if by chance there is
> NO read error on any bit on the living host) I could (theoretically)
> recover, is that right?
> OR is it that if any two disks in the cluster fail at the same time
> (or while one is still being rebuild) all my data would be gone?
>
>
>
> On Mon, Dec 4, 2017 at 4:42 PM, David Turner <drakonstein at gmail.com> wrote:
>> Your current node configuration cannot do size=3 for any pools.  You only
>> have 2 hosts with HDDs and 2 hosts with SSDs in each root.  You cannot put 3
>> copies of data for an HDD pool on 3 separate nodes when you only have 2
>> nodes with HDDs...  In this configuration, size=2 is putting a copy of the
>> data on every available node.  That is why you need to have the space
>> available on the host with the failed OSD to be able to recover; there is no
>> other way for the cluster to keep 2 copies of the data on different nodes.
>> The same will be true if you only have 3 available nodes and size=3; any
>> failed disk can only backfill onto the same node.
>>
>> I would start by recommending that you restructure your nodes quite heavily.
>> You want as close to the same number of disks in each node as you can get.
>> A balanced setup might look like...  This is of course assuming that the
>> CPU, RAM, and disk controllers are similar between the 3 nodes.
>>
>> machine A:
>> 2x 3.6TB
>> 2x 3.6TB RAID0
>> 1x 1.8TB
>> 2x .7TB SSD (1 each from machines B & C)
>>
>> machine B:
>> 2x 3.6TB
>> 2x 3.6TB RAID0
>> 1x 1.8TB
>> 2x .7TB SSD
>>
>> machine C:
>> 1x 3.6TB (from machine B)
>> 2x 3.6TB RAID0 (1 each from machines A & B)
>> 2x 1.8TB  (from machine A)
>> 2x .7TB SSD
>>
>> After all of that is configured and backfilled (a lot of backfilling). The
>> next step is to remove the RAID0 OSDs and add them back in as individual
>> 1.8TB OSDs.  You can also consider size=3 min_size=2 for some of your pools
>> in this configuration.  Both rebuilding the RAID0 OSDs and increasing the
>> size of a pool will require that you have enough space in your
>> cluster/nodes.  Depending on how you have your journals configured moving
>> the OSDs between hosts is usually fairly trivial (except for the
>> backfilling).
>>
>> Your % used is going to be a problem throughout this as an inherent issue
>> with Ceph not being perfect at balancing data which is a trade-off for data
>> integrity in the CRUSH algorithm.  There are ways to change the weights of
>> the OSDs to help fix the balance issue, but it is not indicative of a
>> problem in your configuration... just something that you need to be aware of
>> to be able to prevent it from being a major problem.
>>
>> There is a lot of material on why size=2 min_size=1 is bad.  Read back
>> through the ML archives to find some.  My biggest take-away is... if you
>> lose all but 1 copy of your data... do you really want to make changes to
>> it?  I've also noticed that the majority of clusters on the ML that have
>> irreparably lost data were running with size=2 min_size=1.
>>
>> On Mon, Dec 4, 2017 at 6:12 AM tim taler <robur314 at gmail.com> wrote:
>>> Hi,
>>>
>>> thnx a lot for the quick response
>>> and for laying out some of the issues
>>>
>>>> I'm also new, but I'll try to help. IMHO most of the pros here would be
>>>> quite worried about this cluster if it is production:
>>> thought so ;-/
>>>
>>>> -A prod ceph cluster should not be run with size=2 min_size=1, because:
>>>> --In case of a down'ed osd / host the cluster could have problems
>>>> determining which data is the correct when the osd/host came back up
>>> Uhm  I thought at least THAT wouldn't be the case here since we hace
>>> three mons??
>>> don't THEY keep track of which osd has the latest data
>>> isn't the size set on the pool level not on the cluster level??
>>>
>>>> --If an osd dies, the others get more io (has to compensate the lost io
>>>> capacity and the rebuilding too) which can instantly kill another close to
>>>> death disc (not with ceph, but with raid i have been there)
>>>> --If an osd dies ANY other osd serving that pool has well placed
>>>> inconsistency, like bitrot you'll lose data
>>> good point, with scrubbing the checksums of the the objects are checked,
>>> right?
>>> can I get somewhere the report how much errors where found by the last
>>> scrub run (like in zpool status)
>>> to estimate how well a disk is doing (right now the raid controller
>>> won't let me read the smart data from the disks)
>>>
>>>
>>>> -There are not enough hosts in your setup, or rather the discs are not
>>>> distributed well:
>>>> --If an osd / host dies, the cluster trys to repair itself and relocate
>>>> the data onto another host. In your config there is no other host to
>>>> reallocate data to if ANY of the hosts fail (I guess that hdds and ssds are
>>>> separated)
>>> Yupp, HDD and SDD form seperate pools.
>>> Good point, not in my list of arguments yet
>>>
>>>> -The disks should nod be placed in raid arrays if it can be avoided
>>>> especially raid0:
>>>> --You multiply the possibility of an un-recoverable disc error (and
>>>> since the data is striped) the other disks data is unrecoverable too
>>>> --When an osd dies, the cluster should relocate the data onto another
>>>> osd. When this happens now there is double the data that need to be moved,
>>>> this causes 2 problems: Recovery time / io, and free space. The cluster
>>>> should have enough free space to reallocate data to, in this setup you
>>>> cannot do that in case of a host dies (see above), but in case an osd dies,
>>>> ceph would try to replicate the data onto other osds in the machine. So you
>>>> have to have enough free space on >>the same host<< in this setup to
>>>> replicate data to.
>>> ON THE SAME MACHINE ?
>>> is that so?
>>> So than there should be at the BARE MINIMUM always be more free space
>>> on each machine than the biggest OSD it hosts, right?
>>>
>>>> In your case, I would recommend:
>>>> -Introducing (and activating) a fourth osd host
>>>> -setting size=3 min_size=2
>>> that will be difficult, can't I run size=3 min_size=2 with three hosts?
>>>
>>>> -After data migration is done, one-by-one separating the raid0 arrays:
>>>> (remove, split) -> (zap, init, add) separately, in such a manner that hdds
>>>> and ssds are evenly distributed across the servers
>>> from what I understand the sizes of OSDs can vary
>>> and the weight setting in our setup seems plausible to me (it's
>>> directly derived from the size of the osd)
>>> why than are the not filled on the same level nor even tending to
>>> being filled the same?
>>> does ceph by itself include other measurements like latency of the
>>> OSD? that would explain why the raid0 OSDs have so much more data
>>> than the single disks, but I haven't seen anything about that in the
>>> docus (so far?)
>>>
>>>> -Always keeping that much free space, so the cluster could lose a host
>>>> and still has space to repair (calculating with the repair max usage %
>>>> setting).
>>> thnx again!
>>> yupp that was helpfull
>>>
>>>> I hope this helps, and please keep in mind that I'm a noob too :)
>>>>
>>>> Denes.
>>>>
>>>>
>>>> On 12/04/2017 10:07 AM, tim taler wrote:
>>>>
>>>> Hi
>>>> I'm new to ceph but have to honor to look after a cluster that I haven't
>>>> set up by myself.
>>>> Rushing to the ceph docs and having a first glimpse on our cluster I
>>>> start worrying about our setup,
>>>> so I need some advice and guidance here.
>>>>
>>>> The set up is:
>>>> 3 machines, each running a ceph-monitor.
>>>> all of them are also hosting OSDs
>>>>
>>>> machine A:
>>>> 2 OSDs, each 3.6 TB - consisitng of 1 disk each (spinning disk)
>>>> 3 OSDs, each 3.6 TB - consisting each of a 2 disk hardware-raid 0
>>>> (spinning disk)
>>>> 3 OSDs, each 1.8 TB - consisting each of a 2 disk hardware-raid 0
>>>> (spinning disk)
>>>>
>>>> machine B:
>>>> 3 OSDs, each 3.6 TB - consisitng of 1 disk each (spinning disk)
>>>> 3 OSDs, each 3.6 TB - consisting each of a 2 disk hardware-raid 0
>>>> (spinning disk)
>>>> 1 OSDs, each 1.8 TB - consisting each of a 2 disk hardware-raid 0
>>>> (spinning disk)
>>>>
>>>> 3 OSDs, each, 0.7 TB - consisitng of 1 disk each (SSD)
>>>>
>>>> machine C:
>>>> 3 OSDs, each, 0.7 TB - consisitng of 1 disk each (SSD)
>>>>
>>>> the spinning disks and the SSD disks are forming two seperate pools.
>>>>
>>>> Now what I'm worrying about is that I read "don't use raid together with
>>>> ceph"
>>>> in combination with our poolsize
>>>> :~ ceph osd pool get <poolname> size
>>>> size: 2
>>>>
>>>>  From what I understand from the ceph docu the size tell me "how many
>>>> disks may fail" without loosing the data of the whole pool.
>>>> Is that right? or can HALF the OSDs fail (since all objects are
>>>> duplicated)?
>>>>
>>>> Unfortunately I'm not very good in stochastic but given a probability of
>>>> 1% disk failure per year
>>>> I'm not feeling very secure with this set up (How do I calculate the
>>>> value that two disks fail "at the same time"? - or ahs anybody a rough
>>>> number about that?)
>>>> although looking at our OSD tree it seems we try to spread the objects
>>>> always between two peers:
>>>>
>>>> ID  CLASS WEIGHT   TYPE NAME                      STATUS REWEIGHT
>>>> PRI-AFF
>>>> -19        4.76700 root here_ssd
>>>> -15        2.38350     room 2_ssd
>>>> -14        2.38350         rack 2_ssd
>>>>   -4        2.38350             host B_ssd
>>>>    4   hdd  0.79449                 osd.4              up  1.00000
>>>> 1.00000
>>>>    5   hdd  0.79449                 osd.5              up  1.00000
>>>> 1.00000
>>>>   13   hdd  0.79449                 osd.13             up  1.00000
>>>> 1.00000
>>>> -18        2.38350     room 1_ssd
>>>> -17        2.38350         rack 1_ssd
>>>>   -5        2.38350             host C_ssd
>>>>    0   hdd  0.79449                 osd.0              up  1.00000
>>>> 1.00000
>>>>    1   hdd  0.79449                 osd.1              up  1.00000
>>>> 1.00000
>>>>    2   hdd  0.79449                 osd.2              up  1.00000
>>>> 1.00000
>>>>   -1       51.96059 root here_spinning
>>>> -12       25.98090     room 2_spinning
>>>> -11       25.98090         rack 2_spinning
>>>>   -2       25.98090             host B_spinning
>>>>    3   hdd  3.99959                 osd.3              up  1.00000
>>>> 1.00000
>>>>    8   hdd  3.99429                 osd.8              up  1.00000
>>>> 1.00000
>>>>    9   hdd  3.99429                 osd.9              up  1.00000
>>>> 1.00000
>>>>   10   hdd  3.99429                 osd.10             up  1.00000
>>>> 1.00000
>>>>   11   hdd  1.99919                 osd.11             up  1.00000
>>>> 1.00000
>>>>   12   hdd  3.99959                 osd.12             up  1.00000
>>>> 1.00000
>>>>   20   hdd  3.99959                 osd.20             up  1.00000
>>>> 1.00000
>>>> -10       25.97969     room 1_spinning
>>>>   -8       25.97969         rack l1_spinning
>>>>   -3       25.97969             host A_spinning
>>>>    6   hdd  3.99959                 osd.6              up  1.00000
>>>> 1.00000
>>>>    7   hdd  3.99959                 osd.7              up  1.00000
>>>> 1.00000
>>>>   14   hdd  3.99429                 osd.14             up  1.00000
>>>> 1.00000
>>>>   15   hdd  3.99429                 osd.15             up  1.00000
>>>> 1.00000
>>>>   16   hdd  3.99429                 osd.16             up  1.00000
>>>> 1.00000
>>>>   17   hdd  1.99919                 osd.17             up  1.00000
>>>> 1.00000
>>>>   18   hdd  1.99919                 osd.18             up  1.00000
>>>> 1.00000
>>>>   19   hdd  1.99919                 osd.19             up  1.00000
>>>> 1.00000
>>>>
>>>>
>>>>
>>>> And the second question
>>>> I tracked the disk usage of our OSDs over the last two weeks and it
>>>> looks somehow strange too:
>>>> While osd.14, and osd.20 are filled only well below 60%
>>>> the osd 9,16 and 18 are well about 80%
>>>> graphing that shows pretty stable parallel lines, with no hint of
>>>> convergence
>>>> That's true for both the HDD and the SSD pool.
>>>> How is that and why and is that normal and okay or is there a(nother)
>>>> glitch in our config?
>>>>
>>>> any hints and comments are welcome
>>>>
>>>> TIA
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users at lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users at lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users at lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



More information about the ceph-users mailing list