[ceph-users] Troubleshooting hanging storage backend whenever there is any cluster change
Stefan Priebe - Profihost AG
s.priebe at profihost.ag
Fri Oct 12 14:27:23 PDT 2018
Am 12.10.2018 um 15:59 schrieb David Turner:
> The PGs per OSD does not change unless the OSDs are marked out. You
> have noout set, so that doesn't change at all during this test. All of
> your PGs peered quickly at the beginning and then were active+undersized
> the rest of the time, you never had any blocked requests, and you always
> had 100MB/s+ client IO. I didn't see anything wrong with your cluster
> to indicate that your clients had any problems whatsoever accessing data.
> Can you confirm that you saw the same problems while you were running
> those commands? The next thing would seem that possibly a client isn't
> getting an updated OSD map to indicate that the host and its OSDs are
> down and it's stuck trying to communicate with host7. That would
> indicate a potential problem with the client being unable to communicate
> with the Mons maybe?
May be but what about this status
'PG_AVAILABILITY Reduced data availability: pgs peering'
See the log here: https://pastebin.com/wxUKzhgB
PG_AVAILABILITY is noted at timestamps [2018-10-12 12:16:15.403394] and
And why does Ceph docs say:
Data availability is reduced, meaning that the cluster is unable to
service potential read or write requests for some data in the cluster.
Specifically, one or more PGs is in a state that does not allow IO
requests to be serviced. Problematic PG states include peering, stale,
incomplete, and the lack of active (if those conditions do not clear
> On Fri, Oct 12, 2018 at 8:35 AM Nils Fahldieck - Profihost AG
> <n.fahldieck at profihost.ag <mailto:n.fahldieck at profihost.ag>> wrote:
> Hi, in our `ceph.conf` we have:
> mon_max_pg_per_osd = 300
> While the host is offline (9 OSDs down):
> 4352 PGs * 3 / 62 OSDs ~ 210 PGs per OSD
> If all OSDs are online:
> 4352 PGs * 3 / 71 OSDs ~ 183 PGs per OSD
> ... so this doesn't seem to be the issue.
> If I understood you right, that's what you've meant. If I got you wrong,
> would you mind to point to one of those threads you mentioned?
> Thanks :)
> Am 12.10.2018 um 14:03 schrieb Burkhard Linke:
> > Hi,
> > On 10/12/2018 01:55 PM, Nils Fahldieck - Profihost AG wrote:
> >> I rebooted a Ceph host and logged `ceph status` & `ceph health
> >> every 5 seconds. During this I encountered 'PG_AVAILABILITY
> Reduced data
> >> availability: pgs peering'. At the same time some VMs hung as
> >> before.
> > Just a wild guess... you have 71 OSDs and about 4500 PG with size=3.
> > 13500 PG instance overall, resulting in ~190 PGs per OSD under normal
> > circumstances.
> > If one host is down and the PGs have to re-peer, you might reach the
> > limit of 200 PG/OSDs on some of the OSDs, resulting in stuck peering.
> > You can try to raise this limit. There are several threads on the
> > mailing list about this.
> > Regards,
> > Burkhard
> ceph-users mailing list
> ceph-users at lists.ceph.com <mailto:ceph-users at lists.ceph.com>
More information about the ceph-users