[ceph-users] Cluster Down from reweight-by-utilization

Kevin Hrpcek kevin.hrpcek at ssec.wisc.edu
Sat Nov 4 20:09:47 PDT 2017


Hello,

I've run into an issue and would appreciate any assistance anyone can 
provide as I haven't been able to solve this problem yet and am running 
out of ideas. I ran a reweight-by-utilization on my cluster using 
conservative values so that it wouldn't cause a large rebalancing. The 
reweight ran and changed some osd weights, but shortly after OSDs 
started getting marked down and out of the cluster. This continued until 
I set nodown and noout, but by that time the cluster had <200 osd up and 
in out of 540. The OSD processes started pegging the servers at 100% cpu 
usage. The problem seems similar to one or two others i've seen in the 
lists or bugs where the OSD maps get behind and spin 100% cpu trying to 
catch up. The cluster has been down for about 30 hours now.

After setting noout and nodown the cluster started to slowly mark osds 
up and in averaging 8 osd/hour. At around 490/540 up i unset nodown to 
see if these OSDs were really up and the cluster immediately marked most 
down again until i reset nodown at 161/540 up. Once again the cluster 
started marking OSDs up until it got to 245 but then it stopped 
increasing. The mon debug logs show that the cluster wants to mark 
majority of OSDs down. Restarting OSD processes doesn't bring them up 
and in, mgr restarts didn't improve anything, OSD node reboots seem to 
not do anything positive. Some logs seem to suggest there are 
authentication issues among the daemons and that daemons are simply 
waiting for new maps. I can often see the "newest_map" incrementing on 
osd daemons, but it is slow and some are behind by thousands.

Thanks,
Kevin

Cluster details:
CentOS 7.4
Kraken ceph-11.2.1-0.el7.x86_64
540 OSD, 3 mon/mgr/mds
~3.6PB, 72% raw used, ~40 million objects
24 OSD/server
~25k PGs, mostly ec k=4 m=1, 2 small replicated pools


The command that broke everything, this should have resulted in < ~20tb 
migrating
# ceph osd reweight-by-utilization 110 0.1 10
moved 200 / 112640 (0.177557%)
avg 208.593
stddev 60.2522 -> 60.1992 (expected baseline 14.4294)
min osd.351 with 98 -> 98 pgs (0.469815 -> 0.469815 * mean)
max osd.503 with 310 -> 310 pgs (1.48615 -> 1.48615 * mean)

oload 110
max_change 0.1
max_change_osds 10
average_utilization 0.7176
overload_utilization 0.7893
osd.244 weight 1.0000 -> 0.9000osd.167 weight 1.0000 -> 0.9000osd.318 
weight 1.0000 -> 0.9000osd.302 weight 0.8544 -> 0.7545osd.264 weight 
1.0000 -> 0.9000osd.233 weight 1.0000 -> 0.9000osd.18 weight 1.0000 -> 
0.9000osd.268 weight 0.8728 -> 0.7729osd.14 weight 1.0000 -> 
0.9000osd.343 weight 1.0000 -> 0.9000

A lot of OSDs are stuck in the preboot state and are marked down in the 
map but seem to be behind the osdmap reported on the monitors.
 From osd.1:
     "whoami": 1,
     "state": "preboot",
     "oldest_map": 502767,
     "newest_map": 516124,

# ceph osd stat
      osdmap e518386: 540 osds: 245 up, 530 in; 14626 remapped pgs
             flags 
nodown,noout,sortbitwise,require_jewel_osds,require_kraken_osds

Some are caught up to the map and are in the cluster while nodown noout 
is set
# ceph daemon osd.0 status
{
     "cluster_fsid": "bc2a1488-74f8-4d87-b2f6-615ae26bf7c9",
     "osd_fsid": "7d88f4d2-cebd-477c-a344-ffbca8c471a8",
     "whoami": 0,
     "state": "active",
     "oldest_map": 502767,
     "newest_map": 518391,
     "num_pgs": 228

But the mon debug log suggests it wants to mark it and others down
2017-11-05 02:16:31.598083 7f42382d9700  5 mon.sephmon1 at 0(leader).osd 
e518321 can_mark_down NODOWN flag set, will not mark osd.0 down
2017-11-05 02:16:31.598095 7f42382d9700  5 mon.sephmon1 at 0(leader).osd 
e518321 can_mark_down NODOWN flag set, will not mark osd.3 down
2017-11-05 02:16:31.598100 7f42382d9700  5 mon.sephmon1 at 0(leader).osd 
e518321 can_mark_down NODOWN flag set, will not mark osd.4 down
2017-11-05 02:16:31.598104 7f42382d9700  5 mon.sephmon1 at 0(leader).osd 
e518321 can_mark_down NODOWN flag set, will not mark osd.5 down
2017-11-05 02:16:31.598107 7f42382d9700  5 mon.sephmon1 at 0(leader).osd 
e518321 can_mark_down NODOWN flag set, will not mark osd.6 down
2017-11-05 02:16:31.598111 7f42382d9700  5 mon.sephmon1 at 0(leader).osd 
e518321 can_mark_down NODOWN flag set, will not mark osd.8 down

Mon log is full of messages like:
2017-11-05 02:55:55.512643 7f423b3d9700  0 -- 10.1.9.201:6789/0 >> - 
conn(0x55cfe1ae6800 :6789 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 
l=0).fault with nothing to send and in the half  accept state just closed

mgr log has a lot of:
2017-11-05 02:58:05.813420 7f76dcf90700  0 -- 10.1.9.201:6801/24033 >> - 
conn(0x55e93ed39000 :6801 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 
l=0).fault with nothing to send and in the half  accept state just closed
2017-11-05 02:58:05.998992 7f76dcf90700  0 -- 10.1.9.201:6801/24033 >> - 
conn(0x55e942ff9000 :6801 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 
l=0).fault with nothing to send and in the half  accept state just closed
2017-11-05 02:58:06.496972 7f76da816700 -1 mgr ms_dispatch fsmap(e 6691) v1
2017-11-05 02:58:08.042295 7f76d8812700  1 mgr send_beacon active
2017-11-05 02:58:08.429476 7f76da816700 -1 mgr ms_dispatch mgrdigest v1
2017-11-05 02:58:08.537725 7f76dcf90700  0 -- 10.1.9.201:6801/24033 >> 
10.1.9.24:6816/5624 conn(0x55e947cde000 :6801 
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 
l=0).handle_connect_msg accept connect_seq 2 vs existing csq=2 
existing_state=STATE_CONNECTING
2017-11-05 02:58:08.538037 7f76dcf90700  0 -- 10.1.9.201:6801/24033 >> 
10.1.9.24:6816/5624 conn(0x55e947cde000 :6801 
s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=0 cs=0 l=0).fault with nothing to 
send and in the half  accept state just closed
2017-11-05 02:58:08.613374 7f76dcf90700  0 -- 10.1.9.201:6801/24033 >> - 
conn(0x55e941843800 :6801 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 
l=0).fault with nothing to send and in the half  accept state just closed

OSD logs for a 'down' osd:
2017-11-05 02:59:44.256472 7f68fc0ad700  0 -- 0.0.0.0:6835/35706 >> - 
conn(0x55acb8c11800 :6835 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 
l=0).fault with nothing to send and in the half  accept state just closed
2017-11-05 02:59:44.751890 7f68fc8ae700  0 -- 10.1.9.1:6833/35706 >> - 
conn(0x55acc93ee000 :6833 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 
l=0).fault with nothing to send and in the half  accept state just closed
2017-11-05 02:59:44.934979 7f68fc0ad700  0 -- 0.0.0.0:6835/35706 >> - 
conn(0x55acbc5ef800 :6835 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 
l=0).fault with nothing to send and in the half  accept state just closed
2017-11-05 02:59:44.935007 7f68fc8ae700  0 -- 0.0.0.0:6834/35706 >> - 
conn(0x55accb28c800 :6834 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 
l=0).fault with nothing to send and in the half  accept state just closed

OSD logs for a 'up' osd:
2017-11-05 03:00:24.266686 7f2459f41700 -1 osd.0 518417 heartbeat_check: 
no reply from 10.1.9.8:6851 osd.250 since back 2017-11-05 
03:00:21.586832 front 2017-11-05 02:59:35.268399 (cutoff 2017-11-05 
03:00:04.266664)
2017-11-05 03:00:24.922615 7f245de82700  0 -- 10.1.9.1:6813/1027111 >> 
10.1.9.20:6825/2005105 conn(0x5580adf00000 :-1 
s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 
l=0).handle_connect_reply connect got BADAUTHORIZER
2017-11-05 03:00:25.100982 7f245e683700  0 -- 10.1.9.1:6813/1027111 >> 
10.1.9.8:6893/1015456 conn(0x5580acba9000 :-1 
s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 
l=0).handle_connect_reply connect got BADAUTHORIZER
2017-11-05 03:00:25.266906 7f2459f41700 -1 osd.0 518417 heartbeat_check: 
no reply from 10.1.9.8:6851 osd.250 since back 2017-11-05 
03:00:25.087925 front 2017-11-05 02:59:35.268399 (cutoff 2017-11-05 
03:00:05.266894)

ceph status:

[root at sephmon1 ceph]# ceph status
     cluster bc2a1488-74f8-4d87-b2f6-615ae26bf7c9
      health HEALTH_ERR
             25701 pgs are stuck inactive for more than 300 seconds
             2 pgs backfill_wait
             2 pgs backfilling
             1151 pgs degraded
             19114 pgs down
             2496 pgs peering
             1 pgs recovering
             3973 pgs stale
             1150 pgs stuck degraded
             21728 pgs stuck inactive
             3973 pgs stuck stale
             22805 pgs stuck unclean
             1150 pgs stuck undersized
             1151 pgs undersized
             542105 requests are blocked > 32 sec
             recovery 3150732/243157671 objects degraded (1.296%)
             recovery 21562/243157671 objects misplaced (0.009%)
             recovery 1/51923123 unfound (0.000%)
             mds0: 4755 slow requests are blocked > 30 sec
             284/530 in osds are down
             nodown,noout flag(s) set
             mon.sephmon1 low disk space
             mon.sephmon2 low disk space
             mon.sephmon3 low disk space
      monmap e15: 3 mons at 
{sephmon1=10.1.9.201:6789/0,sephmon2=10.1.9.202:6789/0,sephmon3=10.1.9.203:6789/0}
             election epoch 916, quorum 0,1,2 sephmon1,sephmon2,sephmon3
       fsmap e6708: 1/1/1 up {0=sephmon1=up:active}, 1 up:standby
         mgr active: sephmon1 standbys: sephmon2, sephmon3
      osdmap e518431: 540 osds: 246 up, 530 in; 14672 remapped pgs
             flags 
nodown,noout,sortbitwise,require_jewel_osds,require_kraken_osds
       pgmap v52920051: 23552 pgs, 6 pools, 2030 TB data, 50706 kobjects
             2404 TB used, 942 TB / 3347 TB avail
             3150732/243157671 objects degraded (1.296%)
             21562/243157671 objects misplaced (0.009%)
             1/51923123 unfound (0.000%)
                10465 down+remapped
                 6972 down
                 1124 stale+remapped+peering
                 1029 active+undersized+degraded
                  935 stale+down+remapped
                  742 stale+down
                  691 stale+peering
                  494 remapped+peering
                  397 active+clean
                  347 stale+active+clean
                  187 peering
                   58 stale+activating+undersized+degraded
                   36 stale+active+undersized+degraded
                   14 stale+remapped
                   12 activating+undersized+degraded
                   11 stale+activating+undersized+degraded+remapped
                    7 remapped
                    6 stale
                    5 activating
                    4 active+remapped
                    3 active+undersized+degraded+remapped
                    2 stale+activating+remapped
                    2 stale+active+remapped+backfill_wait
                    2 stale+activating
                    1 stale+active+clean+scrubbing
                    1 active+recovering+undersized+degraded
                    1 stale+active+remapped+backfilling
                    1 inactive
                    1 active+clean+scrubbing
                    1 stale+active+clean+scrubbing+deep
                    1 active+undersized+degraded+remapped+backfilling


-- 
Kevin Hrpcek
Linux Systems Administrator
Space Science & Engineering Center
University of Wisconsin-Madison

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20171104/4ad0b82f/attachment.html>


More information about the ceph-users mailing list