[ceph-users] OSD killed by OOM when many cache available

Sam Huracan nowitzki.sammy at gmail.com
Fri Nov 17 16:22:25 PST 2017


Today, one of our Ceph OSDs was down, I've check syslog and see this OSD
process was killed by OMM


Nov 17 10:01:06 ceph1 kernel: [2807926.762304] Out of memory: Kill process
3330 (ceph-osd) score 7 or sacrifice child
Nov 17 10:01:06 ceph1 kernel: [2807926.763745] Killed process 3330
(ceph-osd) total-vm:2372392kB, anon-rss:559084kB, file-rss:7268kB
Nov 17 10:01:06 ceph1 kernel: [2807926.985830] init: ceph-osd (ceph/6) main
process (3330) killed by KILL signal
Nov 17 10:01:06 ceph1 kernel: [2807926.985844] init: ceph-osd (ceph/6) main
process ended, respawning
Nov 17 10:03:39 ceph1 bash: root [15524]: sudo ceph health detail [0]
Nov 17 10:03:40 ceph1 bash: root [15524]: sudo ceph health detail [0]
Nov 17 10:17:01 ceph1 CRON[75167]: (root) CMD (   cd / && run-parts
--report /etc/cron.hourly)
Nov 17 10:47:17 ceph1 kernel: [2810698.553690] ceph-status.sh invoked
oom-killer: gfp_mask=0x26000c0, order=2, oom_score_adj=0
Nov 17 10:47:17 ceph1 kernel: [2810698.553693] ceph-status.sh cpuset=/
mems_allowed=0
Nov 17 10:47:17 ceph1 kernel: [2810698.553697] CPU: 5 PID: 194271 Comm:
ceph-status.sh Not tainted 4.4.0-62-generic #83~14.04.1-Ubuntu
Nov 17 10:47:17 ceph1 kernel: [2810698.553698] Hardware name: Dell Inc.
PowerEdge R730xd/072T6D, BIOS 2.5.5 08/16/2017
Nov 17 10:47:17 ceph1 kernel: [2810698.553699]  0000000000000000
ffff88001a857b38 ffffffff813dc4ac ffff88001a857cf0
Nov 17 10:47:17 ceph1 kernel: [2810698.553701]  0000000000000000
ffff88001a857bc8 ffffffff811fb066 ffff88083d29e200
Nov 17 10:47:17 ceph1 kernel: [2810698.553703]  ffff88001a857cf0
ffff88001a857c00 ffff8808453cf000 0000000000000000
Nov 17 10:47:17 ceph1 kernel: [2810698.553704] Call Trace:
Nov 17 10:47:17 ceph1 kernel: [2810698.553709]  [<ffffffff813dc4ac>]
dump_stack+0x63/0x87
Nov 17 10:47:17 ceph1 kernel: [2810698.553713]  [<ffffffff811fb066>]
dump_header+0x5b/0x1d5
Nov 17 10:47:17 ceph1 kernel: [2810698.553717]  [<ffffffff81188b75>]
oom_kill_process+0x205/0x3d0
Nov 17 10:47:17 ceph1 kernel: [2810698.553718]  [<ffffffff8118857e>] ?
oom_unkillable_task+0x9e/0xc0
Nov 17 10:47:17 ceph1 kernel: [2810698.553720]  [<ffffffff811891ab>]
out_of_memory+0x40b/0x460
Nov 17 10:47:17 ceph1 kernel: [2810698.553722]  [<ffffffff811fbb1f>]
__alloc_pages_slowpath.constprop.87+0x742/0x7ad
Nov 17 10:47:17 ceph1 kernel: [2810698.553725]  [<ffffffff8118e1a7>]
__alloc_pages_nodemask+0x237/0x240
Nov 17 10:47:17 ceph1 kernel: [2810698.553727]  [<ffffffff8118e36d>]
alloc_kmem_pages_node+0x4d/0xd0
Nov 17 10:47:17 ceph1 kernel: [2810698.553730]  [<ffffffff8107c125>]
copy_process+0x185/0x1ce0
Nov 17 10:47:17 ceph1 kernel: [2810698.553732]  [<ffffffff813320f3>] ?
security_file_alloc+0x33/0x50
Nov 17 10:47:17 ceph1 kernel: [2810698.553734]  [<ffffffff8107de1a>]
_do_fork+0x8a/0x310
Nov 17 10:47:17 ceph1 kernel: [2810698.553737]  [<ffffffff8108dc01>] ?
sigprocmask+0x51/0x80
Nov 17 10:47:17 ceph1 kernel: [2810698.553738]  [<ffffffff8107e149>]
SyS_clone+0x19/0x20
Nov 17 10:47:17 ceph1 kernel: [2810698.553743]  [<ffffffff81802d36>]
entry_SYSCALL_64_fastpath+0x16/0x75
Nov 17 10:47:17 ceph1 kernel: [2810698.553744] Mem-Info:
Nov 17 10:47:17 ceph1 kernel: [2810698.553747] active_anon:798964
inactive_anon:187711 isolated_anon:0
Nov 17 10:47:17 ceph1 kernel: [2810698.553747]  active_file:3299435
inactive_file:3242497 isolated_file:9
Nov 17 10:47:17 ceph1 kernel: [2810698.553747]  unevictable:0 dirty:18678
writeback:949 unstable:0
Nov 17 10:47:17 ceph1 kernel: [2810698.553747]  slab_reclaimable:238396
slab_unreclaimable:142741
Nov 17 10:47:17 ceph1 kernel: [2810698.553747]  mapped:12429 shmem:319
pagetables:6616 bounce:0
Nov 17 10:47:17 ceph1 kernel: [2810698.553747]  free:96727 free_pcp:0
free_cma:0
Nov 17 10:47:17 ceph1 kernel: [2810698.553749] Node 0 DMA free:14828kB
min:32kB low:40kB high:48kB active_anon:0kB inactive_anon:0kB
active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB
isolated(file):0kB present:15996kB managed:15896kB mlocked:0kB dirty:0kB
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB
bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB
pages_scanned:0 all_unreclaimable? yes
Nov 17 10:47:17 ceph1 kernel: [2810698.553753] lowmem_reserve[]: 0 1830
32011 32011 32011
Nov 17 10:47:17 ceph1 kernel: [2810698.553755] Node 0 DMA32 free:134848kB
min:3860kB low:4824kB high:5788kB active_anon:120628kB
inactive_anon:121792kB active_file:653404kB inactive_file:399272kB
unevictable:0kB isolated(anon):0kB isolated(file):36kB present:1981184kB
managed:1900752kB mlocked:0kB dirty:344kB writeback:0kB mapped:1200kB
shmem:0kB slab_reclaimable:239900kB slab_unreclaimable:154560kB
kernel_stack:43376kB pagetables:1176kB unstable:0kB bounce:0kB free_pcp:0kB
local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0
all_unreclaimable? no


We check free Memory, Cache is free 24,705 MB

root at ceph1:/var/log# free -m
             total       used       free     shared    buffers     cached
Mem:         32052      31729        323          1        125      24256
-/+ buffers/cache:       7347      24705
Swap:        61033        120      60913



It's so weird. Could you help me solve this problem, I'm afraid it 'll come
again.

Thanks in advance.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20171118/d7882db0/attachment.html>


More information about the ceph-users mailing list