* Re: 2.6.20-rc5: cp 18gb 18gb.2 = OOM killer, reproducible just like 2.16.19.2
@ 2007-01-22 19:57 ` Andrew Morton
0 siblings, 0 replies; 24+ messages in thread
From: Andrew Morton @ 2007-01-22 19:57 UTC (permalink / raw)
To: Justin Piszcz; +Cc: linux-kernel, linux-raid, xfs
> On Sun, 21 Jan 2007 14:27:34 -0500 (EST) Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
> Why does copying an 18GB on a 74GB raptor raid1 cause the kernel to invoke
> the OOM killer and kill all of my processes?
What's that? Software raid or hardware raid? If the latter, which driver?
> Doing this on a single disk 2.6.19.2 is OK, no issues. However, this
> happens every time!
>
> Anything to try? Any other output needed? Can someone shed some light on
> this situation?
>
> Thanks.
>
>
> The last lines of vmstat 1 (right before it kill -9'd my shell/ssh)
>
> procs -----------memory---------- ---swap-- -----io---- -system--
> ----cpu----
> r b swpd free buff cache si so bi bo in cs us sy id
> wa
> 0 7 764 50348 12 1269988 0 0 53632 172 1902 4600 1 8
> 29 62
> 0 7 764 49420 12 1260004 0 0 53632 34368 1871 6357 2 11
> 48 40
The wordwrapping is painful :(
>
> The last lines of dmesg:
> [ 5947.199985] lowmem_reserve[]: 0 0 0
> [ 5947.199992] DMA: 0*4kB 1*8kB 1*16kB 0*32kB 1*64kB 1*128kB 1*256kB
> 0*512kB 1*1024kB 1*2048kB 0*4096kB = 3544kB
> [ 5947.200010] Normal: 1*4kB 0*8kB 1*16kB 1*32kB 0*64kB 1*128kB 0*256kB
> 1*512kB 0*1024kB 1*2048kB 0*4096kB = 2740kB
> [ 5947.200035] HighMem: 98*4kB 35*8kB 9*16kB 69*32kB 4*64kB 1*128kB
> 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3664kB
> [ 5947.200052] Swap cache: add 789, delete 189, find 16/17, race 0+0
> [ 5947.200055] Free swap = 2197628kB
> [ 5947.200058] Total swap = 2200760kB
> [ 5947.200060] Free swap: 2197628kB
> [ 5947.205664] 517888 pages of RAM
> [ 5947.205671] 288512 pages of HIGHMEM
> [ 5947.205673] 5666 reserved pages
> [ 5947.205675] 257163 pages shared
> [ 5947.205678] 600 pages swap cached
> [ 5947.205680] 88876 pages dirty
> [ 5947.205682] 115111 pages writeback
> [ 5947.205684] 5608 pages mapped
> [ 5947.205686] 49367 pages slab
> [ 5947.205688] 541 pages pagetables
> [ 5947.205795] Out of memory: kill process 1853 (named) score 9937 or a
> child
> [ 5947.205801] Killed process 1853 (named)
> [ 5947.206616] bash invoked oom-killer: gfp_mask=0x84d0, order=0,
> oomkilladj=0
> [ 5947.206621] [<c013e33b>] out_of_memory+0x17b/0x1b0
> [ 5947.206631] [<c013fcac>] __alloc_pages+0x29c/0x2f0
> [ 5947.206636] [<c01479ad>] __pte_alloc+0x1d/0x90
> [ 5947.206643] [<c0148bf7>] copy_page_range+0x357/0x380
> [ 5947.206649] [<c0119d75>] copy_process+0x765/0xfc0
> [ 5947.206655] [<c012c3f9>] alloc_pid+0x1b9/0x280
> [ 5947.206662] [<c011a839>] do_fork+0x79/0x1e0
> [ 5947.206674] [<c015f91f>] do_pipe+0x5f/0xc0
> [ 5947.206680] [<c0101176>] sys_clone+0x36/0x40
> [ 5947.206686] [<c0103138>] syscall_call+0x7/0xb
> [ 5947.206691] [<c0420033>] __sched_text_start+0x853/0x950
> [ 5947.206698] =======================
Important information from the oom-killing event is missing. Please send
it all.
>From your earlier reports we have several hundred MB of ZONE_NORMAL memory
which has gone awol.
Please include /proc/meminfo from after the oom-killing.
Please work out what is using all that slab memory, via /proc/slabinfo.
After the oom-killing, please see if you can free up the ZONE_NORMAL memory
via a few `echo 3 > /proc/sys/vm/drop_caches' commands. See if you can
work out what happened to the missing couple-of-hundred MB from
ZONE_NORMAL.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: 2.6.20-rc5: cp 18gb 18gb.2 = OOM killer, reproducible just like 2.16.19.2
2007-01-22 19:57 ` Andrew Morton
(?)
@ 2007-01-22 20:20 ` Justin Piszcz
-1 siblings, 0 replies; 24+ messages in thread
From: Justin Piszcz @ 2007-01-22 20:20 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, linux-raid, xfs
> What's that? Software raid or hardware raid? If the latter, which
driver?
Software RAID (md)
On Mon, 22 Jan 2007, Andrew Morton wrote:
> > On Sun, 21 Jan 2007 14:27:34 -0500 (EST) Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
> > Why does copying an 18GB on a 74GB raptor raid1 cause the kernel to invoke
> > the OOM killer and kill all of my processes?
>
> What's that? Software raid or hardware raid? If the latter, which driver?
>
> > Doing this on a single disk 2.6.19.2 is OK, no issues. However, this
> > happens every time!
> >
> > Anything to try? Any other output needed? Can someone shed some light on
> > this situation?
> >
> > Thanks.
> >
> >
> > The last lines of vmstat 1 (right before it kill -9'd my shell/ssh)
> >
> > procs -----------memory---------- ---swap-- -----io---- -system--
> > ----cpu----
> > r b swpd free buff cache si so bi bo in cs us sy id
> > wa
> > 0 7 764 50348 12 1269988 0 0 53632 172 1902 4600 1 8
> > 29 62
> > 0 7 764 49420 12 1260004 0 0 53632 34368 1871 6357 2 11
> > 48 40
>
> The wordwrapping is painful :(
>
> >
> > The last lines of dmesg:
> > [ 5947.199985] lowmem_reserve[]: 0 0 0
> > [ 5947.199992] DMA: 0*4kB 1*8kB 1*16kB 0*32kB 1*64kB 1*128kB 1*256kB
> > 0*512kB 1*1024kB 1*2048kB 0*4096kB = 3544kB
> > [ 5947.200010] Normal: 1*4kB 0*8kB 1*16kB 1*32kB 0*64kB 1*128kB 0*256kB
> > 1*512kB 0*1024kB 1*2048kB 0*4096kB = 2740kB
> > [ 5947.200035] HighMem: 98*4kB 35*8kB 9*16kB 69*32kB 4*64kB 1*128kB
> > 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3664kB
> > [ 5947.200052] Swap cache: add 789, delete 189, find 16/17, race 0+0
> > [ 5947.200055] Free swap = 2197628kB
> > [ 5947.200058] Total swap = 2200760kB
> > [ 5947.200060] Free swap: 2197628kB
> > [ 5947.205664] 517888 pages of RAM
> > [ 5947.205671] 288512 pages of HIGHMEM
> > [ 5947.205673] 5666 reserved pages
> > [ 5947.205675] 257163 pages shared
> > [ 5947.205678] 600 pages swap cached
> > [ 5947.205680] 88876 pages dirty
> > [ 5947.205682] 115111 pages writeback
> > [ 5947.205684] 5608 pages mapped
> > [ 5947.205686] 49367 pages slab
> > [ 5947.205688] 541 pages pagetables
> > [ 5947.205795] Out of memory: kill process 1853 (named) score 9937 or a
> > child
> > [ 5947.205801] Killed process 1853 (named)
> > [ 5947.206616] bash invoked oom-killer: gfp_mask=0x84d0, order=0,
> > oomkilladj=0
> > [ 5947.206621] [<c013e33b>] out_of_memory+0x17b/0x1b0
> > [ 5947.206631] [<c013fcac>] __alloc_pages+0x29c/0x2f0
> > [ 5947.206636] [<c01479ad>] __pte_alloc+0x1d/0x90
> > [ 5947.206643] [<c0148bf7>] copy_page_range+0x357/0x380
> > [ 5947.206649] [<c0119d75>] copy_process+0x765/0xfc0
> > [ 5947.206655] [<c012c3f9>] alloc_pid+0x1b9/0x280
> > [ 5947.206662] [<c011a839>] do_fork+0x79/0x1e0
> > [ 5947.206674] [<c015f91f>] do_pipe+0x5f/0xc0
> > [ 5947.206680] [<c0101176>] sys_clone+0x36/0x40
> > [ 5947.206686] [<c0103138>] syscall_call+0x7/0xb
> > [ 5947.206691] [<c0420033>] __sched_text_start+0x853/0x950
> > [ 5947.206698] =======================
>
> Important information from the oom-killing event is missing. Please send
> it all.
>
> >From your earlier reports we have several hundred MB of ZONE_NORMAL memory
> which has gone awol.
>
> Please include /proc/meminfo from after the oom-killing.
>
> Please work out what is using all that slab memory, via /proc/slabinfo.
>
> After the oom-killing, please see if you can free up the ZONE_NORMAL memory
> via a few `echo 3 > /proc/sys/vm/drop_caches' commands. See if you can
> work out what happened to the missing couple-of-hundred MB from
> ZONE_NORMAL.
>
>
I believe this is the first part of it (hopefully):
2908kB active:86104kB inactive:1061904kB present:1145032kB pages_scanned:0
all_unreclaimable? no
[ 5947.199985] lowmem_reserve[]: 0 0 0
[ 5947.199992] DMA: 0*4kB 1*8kB 1*16kB 0*32kB 1*64kB 1*128kB 1*256kB
0*512kB 1*1024kB 1*2048kB 0*4096kB = 3544kB
[ 5947.200010] Normal: 1*4kB 0*8kB 1*16kB 1*32kB 0*64kB 1*128kB 0*256kB
1*512kB 0*1024kB 1*2048kB 0*4096kB = 2740kB
[ 5947.200035] HighMem: 98*4kB 35*8kB 9*16kB 69*32kB 4*64kB 1*128kB
1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3664kB
[ 5947.200052] Swap cache: add 789, delete 189, find 16/17, race 0+0
[ 5947.200055] Free swap = 2197628kB
[ 5947.200058] Total swap = 2200760kB
[ 5947.200060] Free swap: 2197628kB
[ 5947.205664] 517888 pages of RAM
[ 5947.205671] 288512 pages of HIGHMEM
[ 5947.205673] 5666 reserved pages
[ 5947.205675] 257163 pages shared
[ 5947.205678] 600 pages swap cached
[ 5947.205680] 88876 pages dirty
[ 5947.205682] 115111 pages writeback
[ 5947.205684] 5608 pages mapped
[ 5947.205686] 49367 pages slab
[ 5947.205688] 541 pages pagetables
[ 5947.205795] Out of memory: kill process 1853 (named) score 9937 or a
child
[ 5947.205801] Killed process 1853 (named)
[ 5947.206616] bash invoked oom-killer: gfp_mask=0x84d0, order=0,
oomkilladj=0
[ 5947.206621] [<c013e33b>] out_of_memory+0x17b/0x1b0
[ 5947.206631] [<c013fcac>] __alloc_pages+0x29c/0x2f0
[ 5947.206636] [<c01479ad>] __pte_alloc+0x1d/0x90
[ 5947.206643] [<c0148bf7>] copy_page_range+0x357/0x380
[ 5947.206649] [<c0119d75>] copy_process+0x765/0xfc0
[ 5947.206655] [<c012c3f9>] alloc_pid+0x1b9/0x280
[ 5947.206662] [<c011a839>] do_fork+0x79/0x1e0
[ 5947.206674] [<c015f91f>] do_pipe+0x5f/0xc0
[ 5947.206680] [<c0101176>] sys_clone+0x36/0x40
[ 5947.206686] [<c0103138>] syscall_call+0x7/0xb
[ 5947.206691] [<c0420033>] __sched_text_start+0x853/0x950
[ 5947.206698] =======================
I will have to include the other parts when I am near the machine and can
reboot it locally :)
Justin.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: 2.6.20-rc5: cp 18gb 18gb.2 = OOM killer, reproducible just like 2.16.19.2
2007-01-22 19:57 ` Andrew Morton
(?)
(?)
@ 2007-01-23 0:37 ` Donald Douwsma
2007-01-23 1:12 ` Andrew Morton
-1 siblings, 1 reply; 24+ messages in thread
From: Donald Douwsma @ 2007-01-23 0:37 UTC (permalink / raw)
To: Andrew Morton; +Cc: Justin Piszcz, linux-kernel, linux-raid, xfs
Andrew Morton wrote:
>> On Sun, 21 Jan 2007 14:27:34 -0500 (EST) Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
>> Why does copying an 18GB on a 74GB raptor raid1 cause the kernel to invoke
>> the OOM killer and kill all of my processes?
>
> What's that? Software raid or hardware raid? If the latter, which driver?
I've hit this using local disk while testing xfs built against 2.6.20-rc4 (SMP x86_64)
dmesg follows, I'm not sure if anything in this is useful after the first event as our automated tests continued on
after the failure.
> Please include /proc/meminfo from after the oom-killing.
>
> Please work out what is using all that slab memory, via /proc/slabinfo.
Sorry I didnt pick this up ether.
I'll try to reproduce this and gather some more detailed info for a single event.
Donald
...
XFS mounting filesystem sdb5
Ending clean XFS mount for filesystem: sdb5
XFS mounting filesystem sdb5
Ending clean XFS mount for filesystem: sdb5
hald invoked oom-killer: gfp_mask=0x200d2, order=0, oomkilladj=0
Call Trace:
[<ffffffff80257367>] out_of_memory+0x70/0x25d
[<ffffffff80258f6b>] __alloc_pages+0x22c/0x2b5
[<ffffffff8026d889>] alloc_page_vma+0x71/0x76
[<ffffffff8026937b>] read_swap_cache_async+0x45/0xd8
[<ffffffff8025f2e0>] swapin_readahead+0x60/0xd3
[<ffffffff80260ece>] __handle_mm_fault+0x703/0x9d8
[<ffffffff80532bf7>] do_page_fault+0x42b/0x7b3
[<ffffffff80278adf>] do_readv_writev+0x176/0x18b
[<ffffffff8052efde>] thread_return+0x0/0xed
[<ffffffff8034d7f5>] __const_udelay+0x2c/0x2d
[<ffffffff803f4e0b>] scsi_done+0x0/0x17
[<ffffffff8053109d>] error_exit+0x0/0x84
Mem-info:
Node 0 DMA per-cpu:
CPU 0: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
CPU 1: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
CPU 2: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
CPU 3: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
Node 0 DMA32 per-cpu:
CPU 0: Hot: hi: 186, btch: 31 usd: 31 Cold: hi: 62, btch: 15 usd: 53
CPU 1: Hot: hi: 186, btch: 31 usd: 2 Cold: hi: 62, btch: 15 usd: 60
CPU 2: Hot: hi: 186, btch: 31 usd: 20 Cold: hi: 62, btch: 15 usd: 47
CPU 3: Hot: hi: 186, btch: 31 usd: 25 Cold: hi: 62, btch: 15 usd: 56
Active:76 inactive:495856 dirty:0 writeback:0 unstable:0 free:3680 slab:9119 mapped:32 pagetables:637
Node 0 DMA free:8036kB min:24kB low:28kB high:36kB active:0kB inactive:1856kB present:9376kB pages_scanned:3296
all_unreclaimable? yes
lowmem_reserve[]: 0 2003 2003
Node 0 DMA32 free:6684kB min:5712kB low:7140kB high:8568kB active:304kB inactive:1981624kB present:2052068kB
pages_scanned:4343329 all_unreclaimable? yes
lowmem_reserve[]: 0 0 0
Node 0 DMA: 1*4kB 0*8kB 0*16kB 1*32kB 1*64kB 0*128kB 1*256kB 1*512kB 1*1024kB 1*2048kB 1*4096kB = 8036kB
Node 0 DMA32: 273*4kB 29*8kB 1*16kB 1*32kB 1*64kB 1*128kB 2*256kB 1*512kB 0*1024kB 0*2048kB 1*4096kB = 6684kB
Swap cache: add 741048, delete 244661, find 84826/143198, race 680+239
Free swap = 1088524kB
Total swap = 3140668kB
Free swap: 1088524kB
524224 pages of RAM
9619 reserved pages
259 pages shared
496388 pages swap cached
No available memory (MPOL_BIND): kill process 3492 (hald) score 0 or a child
Killed process 3626 (hald-addon-acpi)
top invoked oom-killer: gfp_mask=0x201d2, order=0, oomkilladj=0
Call Trace:
[<ffffffff80257367>] out_of_memory+0x70/0x25d
[<ffffffff80258f6b>] __alloc_pages+0x22c/0x2b5
[<ffffffff8026e6a3>] alloc_pages_current+0x74/0x79
[<ffffffff802548c8>] __page_cache_alloc+0xb/0xe
[<ffffffff8025a65f>] __do_page_cache_readahead+0xa1/0x217
[<ffffffff8052f776>] io_schedule+0x28/0x33
[<ffffffff8052f9e7>] __wait_on_bit_lock+0x5b/0x66
[<ffffffff802546de>] __lock_page+0x72/0x78
[<ffffffff8025ab22>] do_page_cache_readahead+0x4e/0x5a
[<ffffffff80256714>] filemap_nopage+0x140/0x30c
[<ffffffff802609c6>] __handle_mm_fault+0x1fb/0x9d8
[<ffffffff80532bf7>] do_page_fault+0x42b/0x7b3
[<ffffffff80228273>] __wake_up+0x43/0x50
[<ffffffff80380bd5>] tty_ldisc_deref+0x71/0x76
[<ffffffff8053109d>] error_exit+0x0/0x84
Mem-info:
Node 0 DMA per-cpu:
CPU 0: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
CPU 1: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
CPU 2: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
CPU 3: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
Node 0 DMA32 per-cpu:
CPU 0: Hot: hi: 186, btch: 31 usd: 31 Cold: hi: 62, btch: 15 usd: 53
CPU 1: Hot: hi: 186, btch: 31 usd: 2 Cold: hi: 62, btch: 15 usd: 60
CPU 2: Hot: hi: 186, btch: 31 usd: 1 Cold: hi: 62, btch: 15 usd: 10
CPU 3: Hot: hi: 186, btch: 31 usd: 25 Cold: hi: 62, btch: 15 usd: 26
Active:90 inactive:496233 dirty:0 writeback:0 unstable:0 free:3485 slab:9119 mapped:32 pagetables:637
Node 0 DMA free:8036kB min:24kB low:28kB high:36kB active:0kB inactive:1856kB present:9376kB pages_scanned:3328
all_unreclaimable? yes
lowmem_reserve[]: 0 2003 2003
Node 0 DMA32 free:5904kB min:5712kB low:7140kB high:8568kB active:360kB inactive:1983092kB present:2052068kB
pages_scanned:4587649 all_unreclaimable? yes
lowmem_reserve[]: 0 0 0
Node 0 DMA: 1*4kB 0*8kB 0*16kB 1*32kB 1*64kB 0*128kB 1*256kB 1*512kB 1*1024kB 1*2048kB 1*4096kB = 8036kB
Node 0 DMA32: 78*4kB 29*8kB 1*16kB 1*32kB 1*64kB 1*128kB 2*256kB 1*512kB 0*1024kB 0*2048kB 1*4096kB = 5904kB
Swap cache: add 741067, delete 244673, find 84826/143210, race 680+239
Free swap = 1088572kB
Total swap = 3140668kB
Free swap: 1088572kB
524224 pages of RAM
9619 reserved pages
290 pages shared
496396 pages swap cached
No available memory (MPOL_BIND): kill process 7914 (top) score 0 or a child
Killed process 7914 (top)
nscd invoked oom-killer: gfp_mask=0x200d2, order=0, oomkilladj=0
Call Trace:
[<ffffffff80257367>] out_of_memory+0x70/0x25d
[<ffffffff80258f6b>] __alloc_pages+0x22c/0x2b5
[<ffffffff8026d889>] alloc_page_vma+0x71/0x76
[<ffffffff8026937b>] read_swap_cache_async+0x45/0xd8
[<ffffffff80260ede>] __handle_mm_fault+0x713/0x9d8
[<ffffffff80532bf7>] do_page_fault+0x42b/0x7b3
[<ffffffff80238e16>] try_to_del_timer_sync+0x51/0x5a
[<ffffffff80238e2b>] del_timer_sync+0xc/0x16
[<ffffffff8052f939>] schedule_timeout+0x92/0xad
[<ffffffff80238a40>] process_timeout+0x0/0xb
[<ffffffff8029e563>] sys_epoll_wait+0x3e0/0x421
[<ffffffff8022a96b>] default_wake_function+0x0/0xf
[<ffffffff8053109d>] error_exit+0x0/0x84
Mem-info:
Node 0 DMA per-cpu:
CPU 0: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
CPU 1: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
CPU 2: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
CPU 3: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
Node 0 DMA32 per-cpu:
CPU 0: Hot: hi: 186, btch: 31 usd: 30 Cold: hi: 62, btch: 15 usd: 53
CPU 1: Hot: hi: 186, btch: 31 usd: 2 Cold: hi: 62, btch: 15 usd: 60
CPU 2: Hot: hi: 186, btch: 31 usd: 0 Cold: hi: 62, btch: 15 usd: 14
CPU 3: Hot: hi: 186, btch: 31 usd: 25 Cold: hi: 62, btch: 15 usd: 26
Active:91 inactive:496325 dirty:0 writeback:0 unstable:0 free:3425 slab:9119 mapped:32 pagetables:637
Node 0 DMA free:8036kB min:24kB low:28kB high:36kB active:0kB inactive:1856kB present:9376kB pages_scanned:3328
all_unreclaimable? yes
lowmem_reserve[]: 0 2003 2003
Node 0 DMA32 free:5664kB min:5712kB low:7140kB high:8568kB active:364kB inactive:1983372kB present:2052068kB
pages_scanned:4610273 all_unreclaimable? yes
lowmem_reserve[]: 0 0 0
Node 0 DMA: 1*4kB 0*8kB 0*16kB 1*32kB 1*64kB 0*128kB 1*256kB 1*512kB 1*1024kB 1*2048kB 1*4096kB = 8036kB
Node 0 DMA32: 18*4kB 29*8kB 1*16kB 1*32kB 1*64kB 1*128kB 2*256kB 1*512kB 0*1024kB 0*2048kB 1*4096kB = 5664kB
Swap cache: add 741069, delete 244674, find 84826/143212, race 680+239
Free swap = 1088576kB
Total swap = 3140668kB
Free swap: 1088576kB
524224 pages of RAM
9619 reserved pages
293 pages shared
496396 pages swap cached
No available memory (MPOL_BIND): kill process 4166 (nscd) score 0 or a child
Killed process 4166 (nscd)
xfs_repair invoked oom-killer: gfp_mask=0x201d2, order=0, oomkilladj=0
Call Trace:
[<ffffffff80257367>] out_of_memory+0x70/0x25d
[<ffffffff80258f6b>] __alloc_pages+0x22c/0x2b5
[<ffffffff8026e6a3>] alloc_pages_current+0x74/0x79
[<ffffffff802548c8>] __page_cache_alloc+0xb/0xe
[<ffffffff8025a65f>] __do_page_cache_readahead+0xa1/0x217
[<ffffffff8025ab22>] do_page_cache_readahead+0x4e/0x5a
[<ffffffff80256714>] filemap_nopage+0x140/0x30c
[<ffffffff802609c6>] __handle_mm_fault+0x1fb/0x9d8
[<ffffffff80532bf7>] do_page_fault+0x42b/0x7b3
[<ffffffff802426eb>] autoremove_wake_function+0x0/0x2e
[<ffffffff80244ea4>] up_write+0x9/0xb
[<ffffffff8026602d>] sys_mprotect+0x645/0x764
[<ffffffff8053109d>] error_exit+0x0/0x84
Mem-info:
Node 0 DMA per-cpu:
CPU 0: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
CPU 1: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
CPU 2: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
CPU 3: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
Node 0 DMA32 per-cpu:
CPU 0: Hot: hi: 186, btch: 31 usd: 30 Cold: hi: 62, btch: 15 usd: 53
CPU 1: Hot: hi: 186, btch: 31 usd: 2 Cold: hi: 62, btch: 15 usd: 60
CPU 2: Hot: hi: 186, btch: 31 usd: 30 Cold: hi: 62, btch: 15 usd: 14
CPU 3: Hot: hi: 186, btch: 31 usd: 25 Cold: hi: 62, btch: 15 usd: 26
Active:91 inactive:496247 dirty:0 writeback:0 unstable:0 free:3394 slab:9119 mapped:32 pagetables:637
Node 0 DMA free:8036kB min:24kB low:28kB high:36kB active:0kB inactive:1856kB present:9376kB pages_scanned:3328
all_unreclaimable? yes
lowmem_reserve[]: 0 2003 2003
Node 0 DMA32 free:5540kB min:5712kB low:7140kB high:8568kB active:364kB inactive:1983300kB present:2052068kB
pages_scanned:4631841 all_unreclaimable? yes
lowmem_reserve[]: 0 0 0
Node 0 DMA: 1*4kB 0*8kB 0*16kB 1*32kB 1*64kB 0*128kB 1*256kB 1*512kB 1*1024kB 1*2048kB 1*4096kB = 8036kB
Node 0 DMA32: 1*4kB 22*8kB 1*16kB 1*32kB 1*64kB 1*128kB 2*256kB 1*512kB 0*1024kB 0*2048kB 1*4096kB = 5540kB
Swap cache: add 741070, delete 244674, find 84826/143212, race 680+239
Free swap = 1088576kB
Total swap = 3140668kB
Free swap: 1088576kB
524224 pages of RAM
9619 reserved pages
293 pages shared
496397 pages swap cached
No available memory (MPOL_BIND): kill process 17869 (xfs_repair) score 0 or a child
Killed process 17869 (xfs_repair)
klogd invoked oom-killer: gfp_mask=0x200d2, order=0, oomkilladj=0
Call Trace:
[<ffffffff80257367>] out_of_memory+0x70/0x25d
[<ffffffff80258f6b>] __alloc_pages+0x22c/0x2b5
[<ffffffff8025afc2>] __pagevec_lru_add_active+0xce/0xde
[<ffffffff8026d889>] alloc_page_vma+0x71/0x76
[<ffffffff8026937b>] read_swap_cache_async+0x45/0xd8
[<ffffffff80260ede>] __handle_mm_fault+0x713/0x9d8
[<ffffffff80532bf7>] do_page_fault+0x42b/0x7b3
[<ffffffff802426eb>] autoremove_wake_function+0x0/0x2e
[<ffffffff8053109d>] error_exit+0x0/0x84
...
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: 2.6.20-rc5: cp 18gb 18gb.2 = OOM killer, reproducible just like 2.16.19.2
2007-01-23 0:37 ` Donald Douwsma
@ 2007-01-23 1:12 ` Andrew Morton
0 siblings, 0 replies; 24+ messages in thread
From: Andrew Morton @ 2007-01-23 1:12 UTC (permalink / raw)
To: Donald Douwsma; +Cc: Justin Piszcz, linux-kernel, linux-raid, xfs
On Tue, 23 Jan 2007 11:37:09 +1100
Donald Douwsma <donaldd@sgi.com> wrote:
> Andrew Morton wrote:
> >> On Sun, 21 Jan 2007 14:27:34 -0500 (EST) Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
> >> Why does copying an 18GB on a 74GB raptor raid1 cause the kernel to invoke
> >> the OOM killer and kill all of my processes?
> >
> > What's that? Software raid or hardware raid? If the latter, which driver?
>
> I've hit this using local disk while testing xfs built against 2.6.20-rc4 (SMP x86_64)
>
> dmesg follows, I'm not sure if anything in this is useful after the first event as our automated tests continued on
> after the failure.
This looks different.
> ...
>
> Mem-info:
> Node 0 DMA per-cpu:
> CPU 0: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
> CPU 1: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
> CPU 2: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
> CPU 3: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
> Node 0 DMA32 per-cpu:
> CPU 0: Hot: hi: 186, btch: 31 usd: 31 Cold: hi: 62, btch: 15 usd: 53
> CPU 1: Hot: hi: 186, btch: 31 usd: 2 Cold: hi: 62, btch: 15 usd: 60
> CPU 2: Hot: hi: 186, btch: 31 usd: 20 Cold: hi: 62, btch: 15 usd: 47
> CPU 3: Hot: hi: 186, btch: 31 usd: 25 Cold: hi: 62, btch: 15 usd: 56
> Active:76 inactive:495856 dirty:0 writeback:0 unstable:0 free:3680 slab:9119 mapped:32 pagetables:637
No dirty pages, no pages under writeback.
> Node 0 DMA free:8036kB min:24kB low:28kB high:36kB active:0kB inactive:1856kB present:9376kB pages_scanned:3296
> all_unreclaimable? yes
> lowmem_reserve[]: 0 2003 2003
> Node 0 DMA32 free:6684kB min:5712kB low:7140kB high:8568kB active:304kB inactive:1981624kB present:2052068kB
Inactive list is filled.
> pages_scanned:4343329 all_unreclaimable? yes
We scanned our guts out and decided that nothing was reclaimable.
> No available memory (MPOL_BIND): kill process 3492 (hald) score 0 or a child
> No available memory (MPOL_BIND): kill process 7914 (top) score 0 or a child
> No available memory (MPOL_BIND): kill process 4166 (nscd) score 0 or a child
> No available memory (MPOL_BIND): kill process 17869 (xfs_repair) score 0 or a child
But in all cases a constrained memory policy was in use.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: 2.6.20-rc5: cp 18gb 18gb.2 = OOM killer, reproducible just like 2.16.19.2
2007-01-22 19:57 ` Andrew Morton
` (2 preceding siblings ...)
(?)
@ 2007-01-24 23:40 ` Justin Piszcz
-1 siblings, 0 replies; 24+ messages in thread
From: Justin Piszcz @ 2007-01-24 23:40 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, linux-raid, xfs
On Mon, 22 Jan 2007, Andrew Morton wrote:
> > On Sun, 21 Jan 2007 14:27:34 -0500 (EST) Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
> > Why does copying an 18GB on a 74GB raptor raid1 cause the kernel to invoke
> > the OOM killer and kill all of my processes?
>
> What's that? Software raid or hardware raid? If the latter, which driver?
>
> > Doing this on a single disk 2.6.19.2 is OK, no issues. However, this
> > happens every time!
> >
> > Anything to try? Any other output needed? Can someone shed some light on
> > this situation?
> >
> > Thanks.
> >
> >
> > The last lines of vmstat 1 (right before it kill -9'd my shell/ssh)
> >
> > procs -----------memory---------- ---swap-- -----io---- -system--
> > ----cpu----
> > r b swpd free buff cache si so bi bo in cs us sy id
> > wa
> > 0 7 764 50348 12 1269988 0 0 53632 172 1902 4600 1 8
> > 29 62
> > 0 7 764 49420 12 1260004 0 0 53632 34368 1871 6357 2 11
> > 48 40
>
> The wordwrapping is painful :(
>
> >
> > The last lines of dmesg:
> > [ 5947.199985] lowmem_reserve[]: 0 0 0
> > [ 5947.199992] DMA: 0*4kB 1*8kB 1*16kB 0*32kB 1*64kB 1*128kB 1*256kB
> > 0*512kB 1*1024kB 1*2048kB 0*4096kB = 3544kB
> > [ 5947.200010] Normal: 1*4kB 0*8kB 1*16kB 1*32kB 0*64kB 1*128kB 0*256kB
> > 1*512kB 0*1024kB 1*2048kB 0*4096kB = 2740kB
> > [ 5947.200035] HighMem: 98*4kB 35*8kB 9*16kB 69*32kB 4*64kB 1*128kB
> > 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3664kB
> > [ 5947.200052] Swap cache: add 789, delete 189, find 16/17, race 0+0
> > [ 5947.200055] Free swap = 2197628kB
> > [ 5947.200058] Total swap = 2200760kB
> > [ 5947.200060] Free swap: 2197628kB
> > [ 5947.205664] 517888 pages of RAM
> > [ 5947.205671] 288512 pages of HIGHMEM
> > [ 5947.205673] 5666 reserved pages
> > [ 5947.205675] 257163 pages shared
> > [ 5947.205678] 600 pages swap cached
> > [ 5947.205680] 88876 pages dirty
> > [ 5947.205682] 115111 pages writeback
> > [ 5947.205684] 5608 pages mapped
> > [ 5947.205686] 49367 pages slab
> > [ 5947.205688] 541 pages pagetables
> > [ 5947.205795] Out of memory: kill process 1853 (named) score 9937 or a
> > child
> > [ 5947.205801] Killed process 1853 (named)
> > [ 5947.206616] bash invoked oom-killer: gfp_mask=0x84d0, order=0,
> > oomkilladj=0
> > [ 5947.206621] [<c013e33b>] out_of_memory+0x17b/0x1b0
> > [ 5947.206631] [<c013fcac>] __alloc_pages+0x29c/0x2f0
> > [ 5947.206636] [<c01479ad>] __pte_alloc+0x1d/0x90
> > [ 5947.206643] [<c0148bf7>] copy_page_range+0x357/0x380
> > [ 5947.206649] [<c0119d75>] copy_process+0x765/0xfc0
> > [ 5947.206655] [<c012c3f9>] alloc_pid+0x1b9/0x280
> > [ 5947.206662] [<c011a839>] do_fork+0x79/0x1e0
> > [ 5947.206674] [<c015f91f>] do_pipe+0x5f/0xc0
> > [ 5947.206680] [<c0101176>] sys_clone+0x36/0x40
> > [ 5947.206686] [<c0103138>] syscall_call+0x7/0xb
> > [ 5947.206691] [<c0420033>] __sched_text_start+0x853/0x950
> > [ 5947.206698] =======================
>
> Important information from the oom-killing event is missing. Please send
> it all.
>
> >From your earlier reports we have several hundred MB of ZONE_NORMAL memory
> which has gone awol.
>
> Please include /proc/meminfo from after the oom-killing.
>
> Please work out what is using all that slab memory, via /proc/slabinfo.
>
> After the oom-killing, please see if you can free up the ZONE_NORMAL memory
> via a few `echo 3 > /proc/sys/vm/drop_caches' commands. See if you can
> work out what happened to the missing couple-of-hundred MB from
> ZONE_NORMAL.
>
>
Trying this now.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: 2.6.20-rc5: cp 18gb 18gb.2 = OOM killer, reproducible just like 2.16.19.2
2007-01-22 19:57 ` Andrew Morton
` (3 preceding siblings ...)
(?)
@ 2007-01-25 0:10 ` Justin Piszcz
2007-01-25 0:36 ` Nick Piggin
2007-01-25 1:21 ` Bill Cizek
-1 siblings, 2 replies; 24+ messages in thread
From: Justin Piszcz @ 2007-01-25 0:10 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, linux-raid, xfs
On Mon, 22 Jan 2007, Andrew Morton wrote:
> > On Sun, 21 Jan 2007 14:27:34 -0500 (EST) Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
> > Why does copying an 18GB on a 74GB raptor raid1 cause the kernel to invoke
> > the OOM killer and kill all of my processes?
>
> What's that? Software raid or hardware raid? If the latter, which driver?
>
> > Doing this on a single disk 2.6.19.2 is OK, no issues. However, this
> > happens every time!
> >
> > Anything to try? Any other output needed? Can someone shed some light on
> > this situation?
> >
> > Thanks.
> >
> >
> > The last lines of vmstat 1 (right before it kill -9'd my shell/ssh)
> >
> > procs -----------memory---------- ---swap-- -----io---- -system--
> > ----cpu----
> > r b swpd free buff cache si so bi bo in cs us sy id
> > wa
> > 0 7 764 50348 12 1269988 0 0 53632 172 1902 4600 1 8
> > 29 62
> > 0 7 764 49420 12 1260004 0 0 53632 34368 1871 6357 2 11
> > 48 40
>
> The wordwrapping is painful :(
>
> >
> > The last lines of dmesg:
> > [ 5947.199985] lowmem_reserve[]: 0 0 0
> > [ 5947.199992] DMA: 0*4kB 1*8kB 1*16kB 0*32kB 1*64kB 1*128kB 1*256kB
> > 0*512kB 1*1024kB 1*2048kB 0*4096kB = 3544kB
> > [ 5947.200010] Normal: 1*4kB 0*8kB 1*16kB 1*32kB 0*64kB 1*128kB 0*256kB
> > 1*512kB 0*1024kB 1*2048kB 0*4096kB = 2740kB
> > [ 5947.200035] HighMem: 98*4kB 35*8kB 9*16kB 69*32kB 4*64kB 1*128kB
> > 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3664kB
> > [ 5947.200052] Swap cache: add 789, delete 189, find 16/17, race 0+0
> > [ 5947.200055] Free swap = 2197628kB
> > [ 5947.200058] Total swap = 2200760kB
> > [ 5947.200060] Free swap: 2197628kB
> > [ 5947.205664] 517888 pages of RAM
> > [ 5947.205671] 288512 pages of HIGHMEM
> > [ 5947.205673] 5666 reserved pages
> > [ 5947.205675] 257163 pages shared
> > [ 5947.205678] 600 pages swap cached
> > [ 5947.205680] 88876 pages dirty
> > [ 5947.205682] 115111 pages writeback
> > [ 5947.205684] 5608 pages mapped
> > [ 5947.205686] 49367 pages slab
> > [ 5947.205688] 541 pages pagetables
> > [ 5947.205795] Out of memory: kill process 1853 (named) score 9937 or a
> > child
> > [ 5947.205801] Killed process 1853 (named)
> > [ 5947.206616] bash invoked oom-killer: gfp_mask=0x84d0, order=0,
> > oomkilladj=0
> > [ 5947.206621] [<c013e33b>] out_of_memory+0x17b/0x1b0
> > [ 5947.206631] [<c013fcac>] __alloc_pages+0x29c/0x2f0
> > [ 5947.206636] [<c01479ad>] __pte_alloc+0x1d/0x90
> > [ 5947.206643] [<c0148bf7>] copy_page_range+0x357/0x380
> > [ 5947.206649] [<c0119d75>] copy_process+0x765/0xfc0
> > [ 5947.206655] [<c012c3f9>] alloc_pid+0x1b9/0x280
> > [ 5947.206662] [<c011a839>] do_fork+0x79/0x1e0
> > [ 5947.206674] [<c015f91f>] do_pipe+0x5f/0xc0
> > [ 5947.206680] [<c0101176>] sys_clone+0x36/0x40
> > [ 5947.206686] [<c0103138>] syscall_call+0x7/0xb
> > [ 5947.206691] [<c0420033>] __sched_text_start+0x853/0x950
> > [ 5947.206698] =======================
>
> Important information from the oom-killing event is missing. Please send
> it all.
>
> >From your earlier reports we have several hundred MB of ZONE_NORMAL memory
> which has gone awol.
>
> Please include /proc/meminfo from after the oom-killing.
>
> Please work out what is using all that slab memory, via /proc/slabinfo.
>
> After the oom-killing, please see if you can free up the ZONE_NORMAL memory
> via a few `echo 3 > /proc/sys/vm/drop_caches' commands. See if you can
> work out what happened to the missing couple-of-hundred MB from
> ZONE_NORMAL.
>
>
Running with PREEMPT OFF lets me copy the file!! The machine LAGS
occasionally every 5-30-60 seconds or so VERY BADLY, talking 5-10 seconds
of lag, but hey, it does not crash!! I will boot the older kernel with
preempt on and see if I can get you that information you requested.
Justin.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: 2.6.20-rc5: cp 18gb 18gb.2 = OOM killer, reproducible just like 2.16.19.2
2007-01-25 0:10 ` Justin Piszcz
@ 2007-01-25 0:36 ` Nick Piggin
2007-01-25 11:11 ` Justin Piszcz
2007-01-25 1:21 ` Bill Cizek
1 sibling, 1 reply; 24+ messages in thread
From: Nick Piggin @ 2007-01-25 0:36 UTC (permalink / raw)
To: Justin Piszcz; +Cc: Andrew Morton, linux-kernel, linux-raid, xfs
Justin Piszcz wrote:
>
> On Mon, 22 Jan 2007, Andrew Morton wrote:
>>After the oom-killing, please see if you can free up the ZONE_NORMAL memory
>>via a few `echo 3 > /proc/sys/vm/drop_caches' commands. See if you can
>>work out what happened to the missing couple-of-hundred MB from
>>ZONE_NORMAL.
>>
>
> Running with PREEMPT OFF lets me copy the file!! The machine LAGS
> occasionally every 5-30-60 seconds or so VERY BADLY, talking 5-10 seconds
> of lag, but hey, it does not crash!! I will boot the older kernel with
> preempt on and see if I can get you that information you requested.
It wouldn't be a bad idea to recompile the new kernel with preempt on
and get the info from there.
It is usually best to be working with the most recent kernels. We can
always backport any important fixes if we need to.
Thanks,
Nick
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: 2.6.20-rc5: cp 18gb 18gb.2 = OOM killer, reproducible just like 2.16.19.2
2007-01-25 0:36 ` Nick Piggin
@ 2007-01-25 11:11 ` Justin Piszcz
0 siblings, 0 replies; 24+ messages in thread
From: Justin Piszcz @ 2007-01-25 11:11 UTC (permalink / raw)
To: Nick Piggin; +Cc: Andrew Morton, linux-kernel, linux-raid, xfs
On Thu, 25 Jan 2007, Nick Piggin wrote:
> Justin Piszcz wrote:
> >
> > On Mon, 22 Jan 2007, Andrew Morton wrote:
>
> > >After the oom-killing, please see if you can free up the ZONE_NORMAL memory
> > >via a few `echo 3 > /proc/sys/vm/drop_caches' commands. See if you can
> > >work out what happened to the missing couple-of-hundred MB from
> > >ZONE_NORMAL.
> > >
> >
> > Running with PREEMPT OFF lets me copy the file!! The machine LAGS
> > occasionally every 5-30-60 seconds or so VERY BADLY, talking 5-10 seconds of
> > lag, but hey, it does not crash!! I will boot the older kernel with preempt
> > on and see if I can get you that information you requested.
>
> It wouldn't be a bad idea to recompile the new kernel with preempt on
> and get the info from there.
>
> It is usually best to be working with the most recent kernels. We can
> always backport any important fixes if we need to.
>
> Thanks,
> Nick
>
> --
> SUSE Labs, Novell Inc.
> Send instant messages to your online friends http://au.messenger.yahoo.com -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
In my tests for the most part I am using the latest kernels.
Justin.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: 2.6.20-rc5: cp 18gb 18gb.2 = OOM killer, reproducible just like 2.16.19.2
2007-01-25 0:10 ` Justin Piszcz
2007-01-25 0:36 ` Nick Piggin
@ 2007-01-25 1:21 ` Bill Cizek
2007-01-25 11:13 ` Justin Piszcz
1 sibling, 1 reply; 24+ messages in thread
From: Bill Cizek @ 2007-01-25 1:21 UTC (permalink / raw)
To: Justin Piszcz; +Cc: linux-kernel, linux-raid, xfs
Justin Piszcz wrote:
> On Mon, 22 Jan 2007, Andrew Morton wrote:
>
>>> On Sun, 21 Jan 2007 14:27:34 -0500 (EST) Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
>>> Why does copying an 18GB on a 74GB raptor raid1 cause the kernel to invoke
>>> the OOM killer and kill all of my processes?
>>>
> Running with PREEMPT OFF lets me copy the file!! The machine LAGS
> occasionally every 5-30-60 seconds or so VERY BADLY, talking 5-10 seconds
> of lag, but hey, it does not crash!! I will boot the older kernel with
> preempt on and see if I can get you that information you requested.
>
Justin,
According to your kernel_ring_buffer.txt (attached to another email),
you are using "anticipatory" as your io scheduler:
289 Jan 24 18:35:25 p34 kernel: [ 0.142130] io scheduler noop
registered
290 Jan 24 18:35:25 p34 kernel: [ 0.142194] io scheduler
anticipatory registered (default)
I had a problem with this scheduler where my system would occasionally
lockup during heavy I/O. Sometimes it would fix itself, sometimes I had
to reboot. I changed to the "CFQ" io scheduler and my system has worked
fine since then.
CFQ has to be built into the kernel (under BlockLayer/IOSchedulers). It
can be selected as default or you can set it during runtime:
echo cfq > /sys/block/<disk>/queue/scheduler
...
Hope this helps,
Bill
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: 2.6.20-rc5: cp 18gb 18gb.2 = OOM killer, reproducible just like 2.16.19.2
2007-01-25 1:21 ` Bill Cizek
@ 2007-01-25 11:13 ` Justin Piszcz
0 siblings, 0 replies; 24+ messages in thread
From: Justin Piszcz @ 2007-01-25 11:13 UTC (permalink / raw)
To: Bill Cizek; +Cc: linux-kernel, linux-raid, xfs, Alan Piszcz
On Wed, 24 Jan 2007, Bill Cizek wrote:
> Justin Piszcz wrote:
> > On Mon, 22 Jan 2007, Andrew Morton wrote:
> >
> > > > On Sun, 21 Jan 2007 14:27:34 -0500 (EST) Justin Piszcz
> > > > <jpiszcz@lucidpixels.com> wrote:
> > > > Why does copying an 18GB on a 74GB raptor raid1 cause the kernel to
> > > > invoke the OOM killer and kill all of my processes?
> > > >
> > Running with PREEMPT OFF lets me copy the file!! The machine LAGS
> > occasionally every 5-30-60 seconds or so VERY BADLY, talking 5-10 seconds of
> > lag, but hey, it does not crash!! I will boot the older kernel with preempt
> > on and see if I can get you that information you requested.
> >
> Justin,
>
> According to your kernel_ring_buffer.txt (attached to another email), you are
> using "anticipatory" as your io scheduler:
> 289 Jan 24 18:35:25 p34 kernel: [ 0.142130] io scheduler noop registered
> 290 Jan 24 18:35:25 p34 kernel: [ 0.142194] io scheduler anticipatory
> registered (default)
>
> I had a problem with this scheduler where my system would occasionally lockup
> during heavy I/O. Sometimes it would fix itself, sometimes I had to reboot.
> I changed to the "CFQ" io scheduler and my system has worked fine since then.
>
> CFQ has to be built into the kernel (under BlockLayer/IOSchedulers). It can
> be selected as default or you can set it during runtime:
>
> echo cfq > /sys/block/<disk>/queue/scheduler
> ...
>
> Hope this helps,
> Bill
>
>
I used to run CFQ awhile back but then I switched over to AS as it has
better performance for my workloads, currently, I am running with PREEMPT
off, if I see any additional issues, I will switch to the CFQ scheduler.
Right now, its the OOM killer that is going crazy.
Justin.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: 2.6.20-rc5: cp 18gb 18gb.2 = OOM killer, reproducible just like 2.16.19.2
2007-01-22 19:57 ` Andrew Morton
` (4 preceding siblings ...)
(?)
@ 2007-01-25 0:34 ` Justin Piszcz
-1 siblings, 0 replies; 24+ messages in thread
From: Justin Piszcz @ 2007-01-25 0:34 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, linux-raid, xfs
[-- Attachment #1: Type: TEXT/PLAIN, Size: 5044 bytes --]
There is some XFS stuff in the dmesg too, that is why I am continuing to
include the XFS mailing list. Scroll down to read more.
On Mon, 22 Jan 2007, Andrew Morton wrote:
> > On Sun, 21 Jan 2007 14:27:34 -0500 (EST) Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
> > Why does copying an 18GB on a 74GB raptor raid1 cause the kernel to invoke
> > the OOM killer and kill all of my processes?
>
> What's that? Software raid or hardware raid? If the latter, which driver?
>
> > Doing this on a single disk 2.6.19.2 is OK, no issues. However, this
> > happens every time!
> >
> > Anything to try? Any other output needed? Can someone shed some light on
> > this situation?
> >
> > Thanks.
> >
> >
> > The last lines of vmstat 1 (right before it kill -9'd my shell/ssh)
> >
> > procs -----------memory---------- ---swap-- -----io---- -system--
> > ----cpu----
> > r b swpd free buff cache si so bi bo in cs us sy id
> > wa
> > 0 7 764 50348 12 1269988 0 0 53632 172 1902 4600 1 8
> > 29 62
> > 0 7 764 49420 12 1260004 0 0 53632 34368 1871 6357 2 11
> > 48 40
>
> The wordwrapping is painful :(
>
> >
> > The last lines of dmesg:
> > [ 5947.199985] lowmem_reserve[]: 0 0 0
> > [ 5947.199992] DMA: 0*4kB 1*8kB 1*16kB 0*32kB 1*64kB 1*128kB 1*256kB
> > 0*512kB 1*1024kB 1*2048kB 0*4096kB = 3544kB
> > [ 5947.200010] Normal: 1*4kB 0*8kB 1*16kB 1*32kB 0*64kB 1*128kB 0*256kB
> > 1*512kB 0*1024kB 1*2048kB 0*4096kB = 2740kB
> > [ 5947.200035] HighMem: 98*4kB 35*8kB 9*16kB 69*32kB 4*64kB 1*128kB
> > 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3664kB
> > [ 5947.200052] Swap cache: add 789, delete 189, find 16/17, race 0+0
> > [ 5947.200055] Free swap = 2197628kB
> > [ 5947.200058] Total swap = 2200760kB
> > [ 5947.200060] Free swap: 2197628kB
> > [ 5947.205664] 517888 pages of RAM
> > [ 5947.205671] 288512 pages of HIGHMEM
> > [ 5947.205673] 5666 reserved pages
> > [ 5947.205675] 257163 pages shared
> > [ 5947.205678] 600 pages swap cached
> > [ 5947.205680] 88876 pages dirty
> > [ 5947.205682] 115111 pages writeback
> > [ 5947.205684] 5608 pages mapped
> > [ 5947.205686] 49367 pages slab
> > [ 5947.205688] 541 pages pagetables
> > [ 5947.205795] Out of memory: kill process 1853 (named) score 9937 or a
> > child
> > [ 5947.205801] Killed process 1853 (named)
> > [ 5947.206616] bash invoked oom-killer: gfp_mask=0x84d0, order=0,
> > oomkilladj=0
> > [ 5947.206621] [<c013e33b>] out_of_memory+0x17b/0x1b0
> > [ 5947.206631] [<c013fcac>] __alloc_pages+0x29c/0x2f0
> > [ 5947.206636] [<c01479ad>] __pte_alloc+0x1d/0x90
> > [ 5947.206643] [<c0148bf7>] copy_page_range+0x357/0x380
> > [ 5947.206649] [<c0119d75>] copy_process+0x765/0xfc0
> > [ 5947.206655] [<c012c3f9>] alloc_pid+0x1b9/0x280
> > [ 5947.206662] [<c011a839>] do_fork+0x79/0x1e0
> > [ 5947.206674] [<c015f91f>] do_pipe+0x5f/0xc0
> > [ 5947.206680] [<c0101176>] sys_clone+0x36/0x40
> > [ 5947.206686] [<c0103138>] syscall_call+0x7/0xb
> > [ 5947.206691] [<c0420033>] __sched_text_start+0x853/0x950
> > [ 5947.206698] =======================
>
> Important information from the oom-killing event is missing. Please send
> it all.
>
> >From your earlier reports we have several hundred MB of ZONE_NORMAL memory
> which has gone awol.
>
> Please include /proc/meminfo from after the oom-killing.
>
> Please work out what is using all that slab memory, via /proc/slabinfo.
>
> After the oom-killing, please see if you can free up the ZONE_NORMAL memory
> via a few `echo 3 > /proc/sys/vm/drop_caches' commands. See if you can
> work out what happened to the missing couple-of-hundred MB from
> ZONE_NORMAL.
>
>
I have done all you said, and I ran a constant loop w/ vmstat & cat
/proc/slabinfo toward the end of the file(s) (_after_oom_killer) is when I
ran:
p34:~# echo 3 > /proc/sys/vm/drop_caches
p34:~# echo 3 > /proc/sys/vm/drop_caches
p34:~# echo 3 > /proc/sys/vm/drop_caches
p34:~#
The tarball will yield the following files you requested:
4.0K _proc_meminfo_after_oom_killing.txt
976K _slabinfo_after_oom_killer.txt
460K _slabinfo.txt
8.0K _vmstat_after_oom_killer.txt
4.0K _vmstat.txt
I am going back to 2.6.20-rc5-6 w/NO-PRE-EMPT, as this was about 10x more
stable in copying operations and everything else, if you need any more
diagnostics/crashing(s) in this manner, let me know, because I can make it
happen every single time with pre-empt on.
And not sure if it matters but when I copy 18gb 18g.2 I have seen it make
it to various stages:
Attempt size_the 18g.2 got to:
1. 7.8G
2. 4.3G
3. 1.2G
Back to 2.6.20-rc5 w/ no pre-emption, until then/or if/you request
anything else let me know. Also I ran echo t > /proc/sysrq-trigger
I have attached THE ENTIRE syslog/kernel dmesg/ring buffer so you'll see
my system booting up with no problems/errors up until now in which I am
going back to the old kernel. This is attached as
kernel_ring_buffer.txt.bz2
Andrew,
Please let me know if any of this helps!
Justin.
[-- Attachment #2: Type: APPLICATION/octet-stream, Size: 33223 bytes --]
[-- Attachment #3: Type: APPLICATION/octet-stream, Size: 20488 bytes --]
^ permalink raw reply [flat|nested] 24+ messages in thread