OOM problems

* OOM problems
@ 2010-11-13  7:57 John Weekes
  2010-11-13  8:14 ` Ian Pratt
  2010-11-13 18:15 ` George Shuklin
  0 siblings, 2 replies; 23+ messages in thread
From: John Weekes @ 2010-11-13  7:57 UTC (permalink / raw)
  To: xen-devel

On machines running many HVM (stubdom-based) domains, I often see errors 
like this:

[77176.524094] qemu-dm invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0
[77176.524102] Pid: 7478, comm: qemu-dm Not tainted 2.6.32.25-g80f7e08 #2
[77176.524109] Call Trace:
[77176.524123]  [<ffffffff810897fd>] ? T.413+0xcd/0x290
[77176.524129]  [<ffffffff81089ad3>] ? __out_of_memory+0x113/0x180
[77176.524133]  [<ffffffff81089b9e>] ? out_of_memory+0x5e/0xc0
[77176.524140]  [<ffffffff8108d1cb>] ? __alloc_pages_nodemask+0x69b/0x6b0
[77176.524144]  [<ffffffff8108d1f2>] ? __get_free_pages+0x12/0x60
[77176.524152]  [<ffffffff810c94e7>] ? __pollwait+0xb7/0x110
[77176.524161]  [<ffffffff81262b93>] ? n_tty_poll+0x183/0x1d0
[77176.524165]  [<ffffffff8125ea42>] ? tty_poll+0x92/0xa0
[77176.524169]  [<ffffffff810c8a92>] ? do_select+0x362/0x670
[77176.524173]  [<ffffffff810c9430>] ? __pollwait+0x0/0x110
[77176.524178]  [<ffffffff810c9540>] ? pollwake+0x0/0x60
[77176.524183]  [<ffffffff810c9540>] ? pollwake+0x0/0x60
[77176.524188]  [<ffffffff810c9540>] ? pollwake+0x0/0x60
[77176.524193]  [<ffffffff810c9540>] ? pollwake+0x0/0x60
[77176.524197]  [<ffffffff810c9540>] ? pollwake+0x0/0x60
[77176.524202]  [<ffffffff810c9540>] ? pollwake+0x0/0x60
[77176.524207]  [<ffffffff810c9540>] ? pollwake+0x0/0x60
[77176.524212]  [<ffffffff810c9540>] ? pollwake+0x0/0x60
[77176.524217]  [<ffffffff810c9540>] ? pollwake+0x0/0x60
[77176.524222]  [<ffffffff810c8fb5>] ? core_sys_select+0x215/0x350
[77176.524231]  [<ffffffff810100af>] ? xen_restore_fl_direct_end+0x0/0x1
[77176.524236]  [<ffffffff8100c48d>] ? xen_mc_flush+0x8d/0x1b0
[77176.524243]  [<ffffffff81014ffb>] ? xen_hypervisor_callback+0x1b/0x20
[77176.524251]  [<ffffffff814b0f5a>] ? error_exit+0x2a/0x60
[77176.524255]  [<ffffffff8101485d>] ? retint_restore_args+0x5/0x6
[77176.524263]  [<ffffffff8102fd3d>] ? pvclock_clocksource_read+0x4d/0xb0
[77176.524268]  [<ffffffff8102fd3d>] ? pvclock_clocksource_read+0x4d/0xb0
[77176.524276]  [<ffffffff810663d1>] ? ktime_get_ts+0x61/0xd0
[77176.524281]  [<ffffffff810c9354>] ? sys_select+0x44/0x120
[77176.524286]  [<ffffffff81013f02>] ? system_call_fastpath+0x16/0x1b
[77176.524290] Mem-Info:
[77176.524293] DMA per-cpu:
[77176.524296] CPU    0: hi:    0, btch:   1 usd:   0
[77176.524300] CPU    1: hi:    0, btch:   1 usd:   0
[77176.524303] CPU    2: hi:    0, btch:   1 usd:   0
[77176.524306] CPU    3: hi:    0, btch:   1 usd:   0
[77176.524310] CPU    4: hi:    0, btch:   1 usd:   0
[77176.524313] CPU    5: hi:    0, btch:   1 usd:   0
[77176.524316] CPU    6: hi:    0, btch:   1 usd:   0
[77176.524318] CPU    7: hi:    0, btch:   1 usd:   0
[77176.524322] CPU    8: hi:    0, btch:   1 usd:   0
[77176.524324] CPU    9: hi:    0, btch:   1 usd:   0
[77176.524327] CPU   10: hi:    0, btch:   1 usd:   0
[77176.524330] CPU   11: hi:    0, btch:   1 usd:   0
[77176.524333] CPU   12: hi:    0, btch:   1 usd:   0
[77176.524336] CPU   13: hi:    0, btch:   1 usd:   0
[77176.524339] CPU   14: hi:    0, btch:   1 usd:   0
[77176.524342] CPU   15: hi:    0, btch:   1 usd:   0
[77176.524345] CPU   16: hi:    0, btch:   1 usd:   0
[77176.524348] CPU   17: hi:    0, btch:   1 usd:   0
[77176.524351] CPU   18: hi:    0, btch:   1 usd:   0
[77176.524354] CPU   19: hi:    0, btch:   1 usd:   0
[77176.524358] CPU   20: hi:    0, btch:   1 usd:   0
[77176.524364] CPU   21: hi:    0, btch:   1 usd:   0
[77176.524367] CPU   22: hi:    0, btch:   1 usd:   0
[77176.524370] CPU   23: hi:    0, btch:   1 usd:   0
[77176.524372] DMA32 per-cpu:
[77176.524374] CPU    0: hi:  186, btch:  31 usd:  81
[77176.524377] CPU    1: hi:  186, btch:  31 usd:  66
[77176.524380] CPU    2: hi:  186, btch:  31 usd:  49
[77176.524385] CPU    3: hi:  186, btch:  31 usd:  67
[77176.524387] CPU    4: hi:  186, btch:  31 usd:  93
[77176.524390] CPU    5: hi:  186, btch:  31 usd:  73
[77176.524393] CPU    6: hi:  186, btch:  31 usd:  50
[77176.524396] CPU    7: hi:  186, btch:  31 usd:  79
[77176.524399] CPU    8: hi:  186, btch:  31 usd:  21
[77176.524402] CPU    9: hi:  186, btch:  31 usd:  38
[77176.524406] CPU   10: hi:  186, btch:  31 usd:   0
[77176.524409] CPU   11: hi:  186, btch:  31 usd:  75
[77176.524412] CPU   12: hi:  186, btch:  31 usd:   1
[77176.524414] CPU   13: hi:  186, btch:  31 usd:   4
[77176.524417] CPU   14: hi:  186, btch:  31 usd:   9
[77176.524420] CPU   15: hi:  186, btch:  31 usd:   0
[77176.524423] CPU   16: hi:  186, btch:  31 usd:  56
[77176.524426] CPU   17: hi:  186, btch:  31 usd:  35
[77176.524429] CPU   18: hi:  186, btch:  31 usd:  32
[77176.524432] CPU   19: hi:  186, btch:  31 usd:  39
[77176.524435] CPU   20: hi:  186, btch:  31 usd:  24
[77176.524438] CPU   21: hi:  186, btch:  31 usd:   0
[77176.524441] CPU   22: hi:  186, btch:  31 usd:  35
[77176.524444] CPU   23: hi:  186, btch:  31 usd:  51
[77176.524447] Normal per-cpu:
[77176.524449] CPU    0: hi:  186, btch:  31 usd:  29
[77176.524453] CPU    1: hi:  186, btch:  31 usd:   1
[77176.524456] CPU    2: hi:  186, btch:  31 usd:  30
[77176.524459] CPU    3: hi:  186, btch:  31 usd:  30
[77176.524463] CPU    4: hi:  186, btch:  31 usd:  30
[77176.524466] CPU    5: hi:  186, btch:  31 usd:  31
[77176.524469] CPU    6: hi:  186, btch:  31 usd:   0
[77176.524471] CPU    7: hi:  186, btch:  31 usd:   0
[77176.524474] CPU    8: hi:  186, btch:  31 usd:  30
[77176.524477] CPU    9: hi:  186, btch:  31 usd:  28
[77176.524480] CPU   10: hi:  186, btch:  31 usd:   0
[77176.524483] CPU   11: hi:  186, btch:  31 usd:  30
[77176.524486] CPU   12: hi:  186, btch:  31 usd:   0
[77176.524489] CPU   13: hi:  186, btch:  31 usd:   0
[77176.524492] CPU   14: hi:  186, btch:  31 usd:   0
[77176.524495] CPU   15: hi:  186, btch:  31 usd:   0
[77176.524498] CPU   16: hi:  186, btch:  31 usd:   0
[77176.524501] CPU   17: hi:  186, btch:  31 usd:   0
[77176.524504] CPU   18: hi:  186, btch:  31 usd:   0
[77176.524507] CPU   19: hi:  186, btch:  31 usd:   0
[77176.524510] CPU   20: hi:  186, btch:  31 usd:   0
[77176.524513] CPU   21: hi:  186, btch:  31 usd:   0
[77176.524516] CPU   22: hi:  186, btch:  31 usd:   0
[77176.524518] CPU   23: hi:  186, btch:  31 usd:   0
[77176.524524] active_anon:5675 inactive_anon:4676 isolated_anon:0
[77176.524526]  active_file:146373 inactive_file:153543 isolated_file:480
[77176.524527]  unevictable:0 dirty:167539 writeback:322 unstable:0
[77176.524528]  free:5017 slab_reclaimable:15640 slab_unreclaimable:8972
[77176.524529]  mapped:1114 shmem:7 pagetables:1908 bounce:0
[77176.524536] DMA free:9820kB min:32kB low:40kB high:48kB 
active_anon:4kB inactive_anon:0kB active_file:616kB inactive_file:2212kB 
unevictable:0kB isolated(anon):0kB isolated(file):0kB present:12740kB 
mlocked:0kB dirty:2292kB writeback:0kB mapped:0kB shmem:0kB 
slab_reclaimable:72kB slab_unreclaimable:108kB kernel_stack:0kB 
pagetables:12kB unstable:0kB bounce:0kB writeback_tmp:0kB 
pages_scanned:3040 all_unreclaimable? no
[77176.524541] lowmem_reserve[]: 0 1428 2452 2452
[77176.524551] DMA32 free:7768kB min:3680kB low:4600kB high:5520kB 
active_anon:22696kB inactive_anon:18704kB active_file:584580kB 
inactive_file:608508kB unevictable:0kB isolated(anon):0kB 
isolated(file):1920kB present:1462496kB mlocked:0kB dirty:664128kB 
writeback:1276kB mapped:4456kB shmem:28kB slab_reclaimable:62076kB 
slab_unreclaimable:32292kB kernel_stack:5120kB pagetables:7620kB 
unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1971808 
all_unreclaimable? yes
[77176.524556] lowmem_reserve[]: 0 0 1024 1024
[77176.524564] Normal free:2480kB min:2636kB low:3292kB high:3952kB 
active_anon:0kB inactive_anon:0kB active_file:296kB inactive_file:3452kB 
unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1048700kB 
mlocked:0kB dirty:3736kB writeback:12kB mapped:0kB shmem:0kB 
slab_reclaimable:412kB slab_unreclaimable:3488kB kernel_stack:80kB 
pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB 
pages_scanned:8192 all_unreclaimable? yes
[77176.524569] lowmem_reserve[]: 0 0 0 0
[77176.524574] DMA: 4*4kB 25*8kB 11*16kB 7*32kB 8*64kB 8*128kB 8*256kB 
3*512kB 0*1024kB 0*2048kB 1*4096kB = 9832kB
[77176.524587] DMA32: 742*4kB 118*8kB 3*16kB 3*32kB 2*64kB 0*128kB 
0*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 7768kB
[77176.524600] Normal: 1*4kB 1*8kB 2*16kB 13*32kB 14*64kB 2*128kB 
0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1612kB
[77176.524613] 302308 total pagecache pages
[77176.524615] 1619 pages in swap cache
[77176.524617] Swap cache stats: add 40686, delete 39067, find 24687/26036
[77176.524619] Free swap  = 10141956kB
[77176.524621] Total swap = 10239992kB
[77176.577607] 793456 pages RAM
[77176.577611] 436254 pages reserved
[77176.577613] 308627 pages shared
[77176.577615] 49249 pages non-shared
[77176.577620] Out of memory: kill process 5755 (python2.6) score 110492 
or a child
[77176.577623] Killed process 5757 (python2.6)

Depending on what gets nuked by the OOM-killer, I am frequently left 
with an unusable system that needs to be rebooted.

The machine always has plenty of memory available (1.5 GB devoted to 
dom0, of which >1 GB is always just in "cached" state). For instance, 
right now, on this same machine:

# free
              total       used       free     shared    buffers     cached
Mem:       1536512    1493112      43400          0      10284    1144904
-/+ buffers/cache:     337924    1198588
Swap:     10239992      74444   10165548

I have seen this OOM problem on a wide range of Xen versions, stretching 
as far back as I can remember, including the most recent 4.1-unstable 
and 2.6.32 pvops kernel (from yesterday, tested in the hope that they 
would fix this).  I haven't found a way to reliably reproduce it yet, 
but I suspect that the problem relates to reasonably heavy disk or 
network activity -- during this last one, I see that a domain was 
briefly doing ~200 Mbps of downloads.

Anyone have any ideas on what this could be? Is RAM getting 
spontaneously filled because a buffer somewhere grows too quickly, or 
something like that? What can I try here?

-John

^ permalink raw reply	[flat|nested] 23+ messages in thread