[ceph-users] Degraded objects afte: ceph osd in $osd
gfarnum at redhat.com
Mon Nov 26 05:10:12 PST 2018
On Mon, Nov 26, 2018 at 3:30 AM Janne Johansson <icepic.dz at gmail.com> wrote:
> Den sön 25 nov. 2018 kl 22:10 skrev Stefan Kooman <stefan at bit.nl>:
> > Hi List,
> > Another interesting and unexpected thing we observed during cluster
> > expansion is the following. After we added extra disks to the cluster,
> > while "norebalance" flag was set, we put the new OSDs "IN". As soon as
> > we did that a couple of hundered objects would become degraded. During
> > that time no OSD crashed or restarted. Every "ceph osd crush add $osd
> > weight host=$storage-node" would cause extra degraded objects.
> > I don't expect objects to become degraded when extra OSDs are added.
> > Misplaced, yes. Degraded, no
> > Someone got an explantion for this?
> Yes, when you add a drive (or 10), some PGs decide they should have one or
> replicas on the new drives, a new empty PG is created there, and
> _then_ that replica
> will make that PG get into the "degraded" mode, meaning if it had 3
> fine active+clean
> replicas before, it now has 2 active+clean and one needing backfill to
> get into shape.
> It is a slight mistake in reporting it in the same way as an error,
> even if it looks to the
> cluster just as if it was in error and needs fixing. This gives the
> new ceph admins a
> sense of urgency or danger whereas it should be perfectly normal to add
> space to
> a cluster. Also, it could have chosen to add a fourth PG in a repl=3
> PG and fill from
> the one going out into the new empty PG and somehow keep itself with 3
> replicas, but ceph chooses to first discard one replica, then backfill
> into the empty
> one, leading to this kind of "error" report.
See, that's the thing: Ceph is designed *not* to reduce data reliability
this way; it shouldn't do that; and so far as I've been able to establish
so far it doesn't actually do that. Which makes these degraded object
reports a bit perplexing.
What we have worked out is that sometimes objects can be degraded because
the log-based recovery takes a while after the primary juggles around PG
set membership, and I suspect that's what is turning up here. The exact
cause still eludes me a bit, but I assume it's a consequence of the
backfill and recovery throttling we've added over the years.
If a whole PG was missing then you'd expect to see very large degraded
object counts (as opposed to the 2 that Marco reported).
> May the most significant bit of your life be positive.
> ceph-users mailing list
> ceph-users at lists.ceph.com
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the ceph-users