[ceph-users] Cluster Down from reweight-by-utilization

Sage Weil sage at newdream.net
Sat Nov 4 21:05:05 PDT 2017

Hi Kevin,

On Sat, 4 Nov 2017, Kevin Hrpcek wrote:
> Hello,
> I've run into an issue and would appreciate any assistance anyone can provide as I haven't been able to solve this problem yet
> and am running out of ideas. I ran a reweight-by-utilization on my cluster using conservative values so that it wouldn't cause
> a large rebalancing. The reweight ran and changed some osd weights, but shortly after OSDs started getting marked down and out
> of the cluster. This continued until I set nodown and noout, but by that time the cluster had <200 osd up and in out of 540.
> The OSD processes started pegging the servers at 100% cpu usage. The problem seems similar to one or two others i've seen in
> the lists or bugs where the OSD maps get behind and spin 100% cpu trying to catch up. The cluster has been down for about 30
> hours now.
> After setting noout and nodown the cluster started to slowly mark osds up and in averaging 8 osd/hour. At around 490/540 up i
> unset nodown to see if these OSDs were really up and the cluster immediately marked most down again until i reset nodown at
> 161/540 up. Once again the cluster started marking OSDs up until it got to 245 but then it stopped increasing. The mon debug
> logs show that the cluster wants to mark majority of OSDs down. Restarting OSD processes doesn't bring them up and in, mgr
> restarts didn't improve anything, OSD node reboots seem to not do anything positive. Some logs seem to suggest there are
> authentication issues among the daemons and that daemons are simply waiting for new maps. I can often see the "newest_map"
> incrementing on osd daemons, but it is slow and some are behind by thousands.
> Thanks,
> Kevin
> Cluster details:
> CentOS 7.4
> Kraken ceph-11.2.1-0.el7.x86_64
> 540 OSD, 3 mon/mgr/mds
> ~3.6PB, 72% raw used, ~40 million objects
> 24 OSD/server
> ~25k PGs, mostly ec k=4 m=1, 2 small replicated pools
> The command that broke everything, this should have resulted in < ~20tb migrating
> # ceph osd reweight-by-utilization 110 0.1 10
> moved 200 / 112640 (0.177557%)
> avg 208.593
> stddev 60.2522 -> 60.1992 (expected baseline 14.4294)
> min osd.351 with 98 -> 98 pgs (0.469815 -> 0.469815 * mean)
> max osd.503 with 310 -> 310 pgs (1.48615 -> 1.48615 * mean)
> oload 110
> max_change 0.1
> max_change_osds 10
> average_utilization 0.7176
> overload_utilization 0.7893
> osd.244 weight 1.0000 -> 0.9000osd.167 weight 1.0000 -> 0.9000osd.318 weight 1.0000 -> 0.9000osd.302 weight 0.8544 ->
> 0.7545osd.264 weight 1.0000 -> 0.9000osd.233 weight 1.0000 -> 0.9000osd.18 weight 1.0000 -> 0.9000osd.268 weight 0.8728 ->
> 0.7729osd.14 weight 1.0000 -> 0.9000osd.343 weight 1.0000 -> 0.9000
> A lot of OSDs are stuck in the preboot state and are marked down in the map but seem to be behind the osdmap reported on the
> monitors.
> From osd.1:
>     "whoami": 1,
>     "state": "preboot",
>     "oldest_map": 502767,
>     "newest_map": 516124,

This has been going on a while, it looks like!  :(

Do you now why the OSDs were going down?  Are there any crash dumps in the 
osd logs, or is the OOM killer getting them?

The usual strategy here is to set 'noup' and get all of the OSDs to catch 
up on osdmaps (you can check progress via the above status command).  Once 
they are all caught up, unset noup and let them all peer at once.

The problem that has come up here in the past is when the cluster has been 
unhealthy for a long time and the past intervals use too much memory.  I 
don't see anything in your description about memory usage, though.  If 
that does rear its head there's a patch we can apply to kraken to work 
around it (this is fixed in luminous).


More information about the ceph-users mailing list