[ceph-users] HELP with some basics please

Denes Dolhay denke at denkesys.com
Tue Dec 5 12:00:31 PST 2017


Hello!

I can only answer some of your questions:

-The backfill process obeys a "nearfull_ratio" limit (I think defaults 
to 85%) above it the system will stop repairing itself, so it wont go up 
to 100%

-The normal write ops obey a full_ratio too, I think default 95%, above 
that no write io will be accepted to the pool.

-You have min_size=1 (as i can recall) So if you lose a disc then the 
other osds on the same hosts would fill up to 85% and then the cluster 
would stop repairing and would remain in a degraded (some pgs 
undersized) state until you solve the problem, or reach 95% at which 
point the cluster would stop accepting write io.

Calculations:

Sum pool used : 995+14986+1318 = 17299 ... 17299 * 2 (size) = 34598 
(+journal ?) ~ 35349 (global raw used)

Size: 52806G = 35349 (raw used) + 17457 (raw avail) => 66.94% OK.

The dociumentation sais that max avail pool is an estimate and is 
calculated against the osd which will run out of space first, so in tour 
case this is the relevant info.


I think you can access the per osd statistics with the ceph pg dump command.


However, I think you are quite correct:
Spinning usage: 14986+995 = 15981G
Sum spinning capacity: 15981+3232 = 19213G -> 83% full
(I used the vaues caluclated by your ceph df, as it uses the most full 
osd, so it is a good estimate for a worst case)
Since at 85% full, the cluster will stop self healing, so you cannot 
lose any spinning disc in a way that the cluster auto recovers to a 
healthy state (no undersized pgs). I would consider adding at least 2 
new discs to the host which only has ssds in your setup, of course 
considering slots, memory, etc. This would give you some breathing space 
to restructure your cluster too.

Denes.

On 12/05/2017 03:07 PM, tim taler wrote:
> okay another day another nightmare ;-)
>
> So far we discussed pools as bundles of:
> - pool 1) 15 HDD-OSDs (consisting of a total of 25 HDDs actual, 5
> single HDDs and five raid0 pairs as mentioned before)
> - pool 2) 6 SSD-OSDs
> unfortunately (well) on the "physical" pool 1 there are two "logical"
> pools (my wording is here maybe not cephish?)
>
> now I wonder about the real free space on "the pool"...
>
> ceph df tells me:
>
> GLOBAL:
>      SIZE       AVAIL      RAW USED     %RAW USED
>      52806G     17457G       35349G         66.94
> POOLS:
>      NAME                     ID     USED       %USED     MAX AVAIL     OBJECTS
>      pool-1-HDD              9         995G     13.34         3232G       262134
>      pool-2-HDD            10     14986G      69.86         3232G     3892481
>      pool-3-SDD            12       1318G      55.94           519G      372618
>
> Now how do I read this?
> the sum of "MAX AVAIL" in the "POOLS" section is 7387
> okay 7387*2 (since all three pools have a size of 2) is 14774
>
> The GLOBAL section on the other hand tells me I still got 17457G available
> 17457-14774=2683
> where are the missing 2683 GB?
> or am I missing something (else than space and a sane setup I mean :-)
>
> AND (!)
> if in the "physical" HDD pool the reported two times 3232G available
> space is true,
> than in this setup (two hosts) there would be only 3232G free on each host.
> Given that the HDD-OSDs are 4TB in size - if one dies and the host
> tries to restore the data
> (as I learned yesterday the data in this setup will ONLY be restored
> on that host on which the OSD died)
> than ...
> it doesn't work, right?
> Except I could hope that - due to too few placement groups and the resulting
> miss-balance of space usage on the OSDs - the dead OSD was only filled
> by 60% and not 85%
> and only the real data will rewritten(restored).
> But even that seems not possible - given the miss-balanced OSDs - the
> fuller ones will hit total saturation
> and - at least as I understand it now - after that (again after the
> first OSD is filled 100%) I can't use the left
> space on the other OSDs.
> right?
>
> If all that is true (and PLEASE point out any mistake in my thinking)
> than I got here at the moment
> 25 harddisks of which NONE  must fail or the pool will at least stop
> accepting writes.
>
> Am I right? (feels like a reciprocal Russian roulette ... ONE chamber
> WITHOUT a bullet ;-)
>
> Now - sorry we are not finished yet (and yes this is true, I'm not
> trying to make fun of you)
>
> On top of all this I see a rapid decrease in the available space which
> is not consistent
> with growing data inside the rbds living in this cluster nore growing
> numbers of rbds (we ONLY use rbds).
> BUT someone is running sanpshots.
> How do I sum up the amount of space each snapshot is using.
>
> is it the sum of the USED column in the output of "rbd du --snapp" ?
>
> And what is the philosophy of snapshots in ceph?
> AN object is 4MB in size, if a bit in that object changes is the whole
> object replicated?
> (the cluster is luminous upgraded from jewel so we use filestore on
> xfs not bluestore)
>
> TIA
>
> On Tue, Dec 5, 2017 at 11:10 AM, Stefan Kooman <stefan at bit.nl> wrote:
>> Quoting tim taler (robur314 at gmail.com):
>>> And I'm still puzzled about the implication of the cluster size on the
>>> amount of OSD failures.
>>> With size=2 min_size=1 one host could die and (if by chance there is
>>> NO read error on any bit on the living host) I could (theoretically)
>>> recover, is that right?
>> True.
>>> OR is it that if any two disks in the cluster fail at the same time
>>> (or while one is still being rebuild) all my data would be gone?
>> Only the objects that are located on those disks. So for example obj1
>> disk1,host1 and obj 1 on disk2,host2 ... you will lose data, yes.
>>
>> Gr. Stefan
>>
>> --
>> | BIT BV  http://www.bit.nl/        Kamer van Koophandel 09090351
>> | GPG: 0xD14839C6                   +31 318 648 688 / info at bit.nl



More information about the ceph-users mailing list