linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* 2.6.39-rc4+: oom-killer busy killing tasks
@ 2011-04-22  1:57 Christian Kujau
  2011-04-22  2:58 ` Minchan Kim
  2011-04-24 23:46 ` Dave Chinner
  0 siblings, 2 replies; 42+ messages in thread
From: Christian Kujau @ 2011-04-22  1:57 UTC (permalink / raw)
  To: LKML; +Cc: xfs

Hi,

after the block layer regression[0] seemed to be fixed, the machine 
appeared to be running fine. But after putting some disk I/O to the system 
(PowerBook G4) it became unresponsive, I/O wait went up high and I could 
see that the OOM killer was killing processes. Logging in via SSH was 
sometimes possible, but the each session was killed shortly after, so I 
could not do much.

The box finally rebooted itself, the logfile recorded something xfs 
related in the first backtrace, hence I'm cc'ing the xfs list too:

du invoked oom-killer: gfp_mask=0x842d0, order=0, oom_adj=0, oom_score_adj=0
Call Trace:
[c0009ce4] show_stack+0x70/0x1bc (unreliable)
[c008f508] T.528+0x74/0x1cc
[c008f734] T.526+0xd4/0x2a0
[c008fb7c] out_of_memory+0x27c/0x360
[c0093b3c] __alloc_pages_nodemask+0x6f8/0x708
[c00c00b4] new_slab+0x244/0x27c
[c00c0620] T.879+0x1cc/0x37c
[c00c08d0] kmem_cache_alloc+0x100/0x108
[c01cb2b8] kmem_zone_alloc+0xa4/0x114
[c01a7d58] xfs_inode_alloc+0x40/0x13c
[c01a8218] xfs_iget+0x258/0x5a0
[c01c922c] xfs_lookup+0xf8/0x114
[c01d70b0] xfs_vn_lookup+0x5c/0xb0
[c00d14c8] d_alloc_and_lookup+0x54/0x90
[c00d1d4c] do_lookup+0x248/0x2bc
[c00d33cc] path_lookupat+0xfc/0x8f4
[c00d3bf8] do_path_lookup+0x34/0xac
[c00d53e0] user_path_at+0x64/0xb4
[c00ca638] vfs_fstatat+0x58/0xbc
[c00ca6c0] sys_fstatat64+0x24/0x50
[c00124f4] ret_from_syscall+0x0/0x38
 --- Exception: c01 at 0xff4b050
   LR = 0x10008cf8


This is wih today's git (91e8549bde...); full log & .config on: 

  http://nerdbynature.de/bits/2.6.39-rc4/oom/

Thanks,
Christian.

[0] https://lkml.org/lkml/2011/4/14/702
-- 
BOFH excuse #117:

the printer thinks its a router.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 2.6.39-rc4+: oom-killer busy killing tasks
  2011-04-22  1:57 2.6.39-rc4+: oom-killer busy killing tasks Christian Kujau
@ 2011-04-22  2:58 ` Minchan Kim
  2011-04-22  3:03   ` Christian Kujau
  2011-04-22 17:41   ` Christian Kujau
  2011-04-24 23:46 ` Dave Chinner
  1 sibling, 2 replies; 42+ messages in thread
From: Minchan Kim @ 2011-04-22  2:58 UTC (permalink / raw)
  To: Christian Kujau; +Cc: LKML, xfs

On Fri, Apr 22, 2011 at 10:57 AM, Christian Kujau <lists@nerdbynature.de> wrote:
> Hi,
>
> after the block layer regression[0] seemed to be fixed, the machine
> appeared to be running fine. But after putting some disk I/O to the system
> (PowerBook G4) it became unresponsive, I/O wait went up high and I could
> see that the OOM killer was killing processes. Logging in via SSH was
> sometimes possible, but the each session was killed shortly after, so I
> could not do much.
>
> The box finally rebooted itself, the logfile recorded something xfs
> related in the first backtrace, hence I'm cc'ing the xfs list too:
>
> du invoked oom-killer: gfp_mask=0x842d0, order=0, oom_adj=0, oom_score_adj=0
> Call Trace:
> [c0009ce4] show_stack+0x70/0x1bc (unreliable)
> [c008f508] T.528+0x74/0x1cc
> [c008f734] T.526+0xd4/0x2a0
> [c008fb7c] out_of_memory+0x27c/0x360
> [c0093b3c] __alloc_pages_nodemask+0x6f8/0x708
> [c00c00b4] new_slab+0x244/0x27c
> [c00c0620] T.879+0x1cc/0x37c
> [c00c08d0] kmem_cache_alloc+0x100/0x108
> [c01cb2b8] kmem_zone_alloc+0xa4/0x114
> [c01a7d58] xfs_inode_alloc+0x40/0x13c
> [c01a8218] xfs_iget+0x258/0x5a0
> [c01c922c] xfs_lookup+0xf8/0x114
> [c01d70b0] xfs_vn_lookup+0x5c/0xb0
> [c00d14c8] d_alloc_and_lookup+0x54/0x90
> [c00d1d4c] do_lookup+0x248/0x2bc
> [c00d33cc] path_lookupat+0xfc/0x8f4
> [c00d3bf8] do_path_lookup+0x34/0xac
> [c00d53e0] user_path_at+0x64/0xb4
> [c00ca638] vfs_fstatat+0x58/0xbc
> [c00ca6c0] sys_fstatat64+0x24/0x50
> [c00124f4] ret_from_syscall+0x0/0x38
>  --- Exception: c01 at 0xff4b050
>   LR = 0x10008cf8
>
>
> This is wih today's git (91e8549bde...); full log & .config on:
>
>  http://nerdbynature.de/bits/2.6.39-rc4/oom/

You would try to allocate a page from DMA as you don't have a normal zone.

Although free pages in DMA zone is about 3M, free pages of zone is
below min of DMA zone. So zone_watermark_ok would be failed.

But I wonder why VM can't reclaim the pages. As I see the log, there
are lots of slab pages(710M) in DMA zone while LRU pages are very
small. SLAB pages are things VM has a trouble to reclaim. I am not
sure 710M of SLAB is reasonable size. Don't you have experience same
problem in old kernel?
If you see the problem first in 2.6.39-rc4, maybe it would be a
regression(ex, might be slab memory leak)
Could you get the information about slabinfo(ex, cat /proc/slabinfo)
right before OOM happens.
It could say culprit.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 2.6.39-rc4+: oom-killer busy killing tasks
  2011-04-22  2:58 ` Minchan Kim
@ 2011-04-22  3:03   ` Christian Kujau
  2011-04-22 17:41   ` Christian Kujau
  1 sibling, 0 replies; 42+ messages in thread
From: Christian Kujau @ 2011-04-22  3:03 UTC (permalink / raw)
  To: Minchan Kim; +Cc: LKML, xfs

On Fri, 22 Apr 2011 at 11:58, Minchan Kim wrote:
> But I wonder why VM can't reclaim the pages. As I see the log, there
> are lots of slab pages(710M) in DMA zone while LRU pages are very
> small. SLAB pages are things VM has a trouble to reclaim. I am not
> sure 710M of SLAB is reasonable size. Don't you have experience same
> problem in old kernel?

No, I forgot to add 2.6.38 (and before) did not show that behaviour.

> If you see the problem first in 2.6.39-rc4, maybe it would be a
> regression(ex, might be slab memory leak)

I would think so. I'll try to bisect, but with one compilation taking 
~20min on this box, this might take a while.

> Could you get the information about slabinfo(ex, cat /proc/slabinfo)
> right before OOM happens.

When it happens again, I'll try to get this.

Thanks for replying,
Christian.
-- 
BOFH excuse #73:

Daemons did it

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 2.6.39-rc4+: oom-killer busy killing tasks
  2011-04-22  2:58 ` Minchan Kim
  2011-04-22  3:03   ` Christian Kujau
@ 2011-04-22 17:41   ` Christian Kujau
  2011-04-22 18:46     ` Christian Kujau
  1 sibling, 1 reply; 42+ messages in thread
From: Christian Kujau @ 2011-04-22 17:41 UTC (permalink / raw)
  To: Minchan Kim; +Cc: LKML, xfs, rientjes

On Fri, 22 Apr 2011 at 11:58, Minchan Kim wrote:
> You would try to allocate a page from DMA as you don't have a normal zone.
> 
> Although free pages in DMA zone is about 3M, free pages of zone is
> below min of DMA zone. So zone_watermark_ok would be failed.
> 
> But I wonder why VM can't reclaim the pages. As I see the log, there
> are lots of slab pages(710M) in DMA zone while LRU pages are very
> small. SLAB pages are things VM has a trouble to reclaim. I am not
> sure 710M of SLAB is reasonable size. Don't you have experience same
> problem in old kernel?
> If you see the problem first in 2.6.39-rc4, maybe it would be a
> regression(ex, might be slab memory leak)
> Could you get the information about slabinfo(ex, cat /proc/slabinfo)
> right before OOM happens.
> It could say culprit.

It happened again, and again it's du(1) invoking the OOM killer (I'm 
running du(1) accross a few directories every 6hrs). This time I got 
/proc/slabinfo as well:

  http://nerdbynature.de/bits/2.6.39-rc4/oom/
  * NOTE: slabinfo.txt.gz deflates to ~40MB!)
  * messages-2.txt contains the messages of last night's OOM, which
   started at Apr 22 00:10:16 PDT

Thanks for looking into it.

Christian.
-- 
BOFH excuse #140:

LBNC (luser brain not connected)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 2.6.39-rc4+: oom-killer busy killing tasks
  2011-04-22 17:41   ` Christian Kujau
@ 2011-04-22 18:46     ` Christian Kujau
  2011-04-22 22:47       ` Minchan Kim
  0 siblings, 1 reply; 42+ messages in thread
From: Christian Kujau @ 2011-04-22 18:46 UTC (permalink / raw)
  To: Minchan Kim; +Cc: LKML, xfs, rientjes

On Fri, 22 Apr 2011 at 10:41, Christian Kujau wrote:
> It happened again, and again it's du(1) invoking the OOM killer (I'm 
> running du(1) accross a few directories every 6hrs).

Btw, xfs* showed up again in the first backtrace. This might be just 
because du(1) came across an XFS filesystem (~1TB) or it's something else:

[20546.774480] du invoked oom-killer: gfp_mask=0x842d0, order=0, oom_adj=0, oom_score_adj=0
[20546.776130] Call Trace:
[20546.777754] [ed53f9e0] [c0009ce4] show_stack+0x70/0x1bc (unreliable)
[20546.779363] [ed53fa20] [c008f508] T.528+0x74/0x1cc
[20546.780945] [ed53fa70] [c008f734] T.526+0xd4/0x2a0
[20546.782484] [ed53fab0] [c008fb7c] out_of_memory+0x27c/0x360
[20546.784028] [ed53fb00] [c0093b3c] __alloc_pages_nodemask+0x6f8/0x708
[20546.785594] [ed53fba0] [c00c00b4] new_slab+0x244/0x27c
[20546.787143] [ed53fbd0] [c00c0620] T.879+0x1cc/0x37c
[20546.788687] [ed53fbf0] [c00c08d0] kmem_cache_alloc+0x100/0x108
[20546.790244] [ed53fc10] [c01cb2b8] kmem_zone_alloc+0xa4/0x114
[20546.791780] [ed53fc40] [c01a7d58] xfs_inode_alloc+0x40/0x13c
[20546.793273] [ed53fc60] [c01a8218] xfs_iget+0x258/0x5a0
[20546.794701] [ed53fcc0] [c01c922c] xfs_lookup+0xf8/0x114
[20546.796115] [ed53fd00] [c01d70b0] xfs_vn_lookup+0x5c/0xb0
[20546.797531] [ed53fd20] [c00d14c8] d_alloc_and_lookup+0x54/0x90
[20546.798914] [ed53fd40] [c00d1d4c] do_lookup+0x248/0x2bc
[20546.800281] [ed53fd90] [c00d33cc] path_lookupat+0xfc/0x8f4
[20546.801656] [ed53fe10] [c00d3bf8] do_path_lookup+0x34/0xac
[20546.803004] [ed53fe30] [c00d53e0] user_path_at+0x64/0xb4
[20546.804336] [ed53feb0] [c00ca638] vfs_fstatat+0x58/0xbc
[20546.805682] [ed53fee0] [c00ca6c0] sys_fstatat64+0x24/0x50
[20546.807026] [ed53ff40] [c00124f4] ret_from_syscall+0x0/0x38
[20546.808380] --- Exception: c01 at 0xff4b050
[20546.808384]     LR = 0x10008cf8
[20546.811051] Mem-Info:
[20546.812324] DMA per-cpu:
[20546.813567] CPU    0: hi:  186, btch:  31 usd: 107
[20546.814819] HighMem per-cpu:
[20546.816065] CPU    0: hi:  186, btch:  31 usd:  30
[20546.817328] active_anon:8813 inactive_anon:9740 isolated_anon:0
[20546.817332]  active_file:22597 inactive_file:20784 isolated_file:0
[20546.817336]  unevictable:0 dirty:0 writeback:0 unstable:0
[20546.817340]  free:69629 slab_reclaimable:181633 slab_unreclaimable:4393
[20546.817345]  mapped:3810 shmem:169 pagetables:395 bounce:0
[20546.823525] DMA free:3536kB min:3532kB low:4412kB high:5296kB active_anon:0kB inactive_anon:0kB active_file:32kB inactive_file:256kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:780288kB mlocked:0kB dirty:0kB writeback:0kB mapped:8kB shmem:0kB slab_reclaimable:726532kB slab_unreclaimable:17572kB kernel_stack:1064kB pagetables:1580kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:449 all_unreclaimable? yes
[20546.828690] lowmem_reserve[]: 0 0 508 508
[20546.830057] HighMem free:274980kB min:508kB low:1096kB high:1684kB active_anon:35252kB inactive_anon:38960kB active_file:90356kB inactive_file:82880kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:520192kB mlocked:0kB dirty:0kB writeback:0kB mapped:15232kB shmem:676kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[20546.835656] lowmem_reserve[]: 0 0 0 0
[20546.837081] DMA: 70*4kB 295*8kB 0*16kB 0*32kB 0*64kB 1*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 3536kB
[20546.838629] HighMem: 4245*4kB 3604*8kB 2151*16kB 1238*32kB 736*64kB 296*128kB 96*256kB 47*512kB 17*1024kB 2*2048kB 0*4096kB = 274980kB
[20546.841678] 44154 total pagecache pages
[20546.843160] 604 pages in swap cache
[20546.844631] Swap cache stats: add 2932, delete 2328, find 417/545
[20546.846145] Free swap  = 1164320kB
[20546.847647] Total swap = 1171868kB
[20546.882384] 327680 pages RAM
[20546.883921] 131072 pages HighMem
[20546.885438] 6467 pages reserved
[20546.886903] 17653 pages shared
[20546.888362] 246912 pages non-shared

Again, full dmesg & config on: http://nerdbynature.de/bits/2.6.39-rc4/oom/

Thanks,
Christian.
-- 
BOFH excuse #158:

Defunct processes

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 2.6.39-rc4+: oom-killer busy killing tasks
  2011-04-22 18:46     ` Christian Kujau
@ 2011-04-22 22:47       ` Minchan Kim
  0 siblings, 0 replies; 42+ messages in thread
From: Minchan Kim @ 2011-04-22 22:47 UTC (permalink / raw)
  To: Christian Kujau; +Cc: LKML, xfs, rientjes, Dave Chinner

On Sat, Apr 23, 2011 at 3:46 AM, Christian Kujau <lists@nerdbynature.de> wrote:
> On Fri, 22 Apr 2011 at 10:41, Christian Kujau wrote:
>> It happened again, and again it's du(1) invoking the OOM killer (I'm
>> running du(1) accross a few directories every 6hrs).
>
> Btw, xfs* showed up again in the first backtrace. This might be just
> because du(1) came across an XFS filesystem (~1TB) or it's something else:
>
> [20546.774480] du invoked oom-killer: gfp_mask=0x842d0, order=0, oom_adj=0, oom_score_adj=0
> [20546.776130] Call Trace:
> [20546.777754] [ed53f9e0] [c0009ce4] show_stack+0x70/0x1bc (unreliable)
> [20546.779363] [ed53fa20] [c008f508] T.528+0x74/0x1cc
> [20546.780945] [ed53fa70] [c008f734] T.526+0xd4/0x2a0
> [20546.782484] [ed53fab0] [c008fb7c] out_of_memory+0x27c/0x360
> [20546.784028] [ed53fb00] [c0093b3c] __alloc_pages_nodemask+0x6f8/0x708
> [20546.785594] [ed53fba0] [c00c00b4] new_slab+0x244/0x27c
> [20546.787143] [ed53fbd0] [c00c0620] T.879+0x1cc/0x37c
> [20546.788687] [ed53fbf0] [c00c08d0] kmem_cache_alloc+0x100/0x108
> [20546.790244] [ed53fc10] [c01cb2b8] kmem_zone_alloc+0xa4/0x114
> [20546.791780] [ed53fc40] [c01a7d58] xfs_inode_alloc+0x40/0x13c
> [20546.793273] [ed53fc60] [c01a8218] xfs_iget+0x258/0x5a0
> [20546.794701] [ed53fcc0] [c01c922c] xfs_lookup+0xf8/0x114
> [20546.796115] [ed53fd00] [c01d70b0] xfs_vn_lookup+0x5c/0xb0
> [20546.797531] [ed53fd20] [c00d14c8] d_alloc_and_lookup+0x54/0x90
> [20546.798914] [ed53fd40] [c00d1d4c] do_lookup+0x248/0x2bc
> [20546.800281] [ed53fd90] [c00d33cc] path_lookupat+0xfc/0x8f4
> [20546.801656] [ed53fe10] [c00d3bf8] do_path_lookup+0x34/0xac
> [20546.803004] [ed53fe30] [c00d53e0] user_path_at+0x64/0xb4
> [20546.804336] [ed53feb0] [c00ca638] vfs_fstatat+0x58/0xbc
> [20546.805682] [ed53fee0] [c00ca6c0] sys_fstatat64+0x24/0x50
> [20546.807026] [ed53ff40] [c00124f4] ret_from_syscall+0x0/0x38
> [20546.808380] --- Exception: c01 at 0xff4b050
> [20546.808384]     LR = 0x10008cf8
> [20546.811051] Mem-Info:
> [20546.812324] DMA per-cpu:
> [20546.813567] CPU    0: hi:  186, btch:  31 usd: 107
> [20546.814819] HighMem per-cpu:
> [20546.816065] CPU    0: hi:  186, btch:  31 usd:  30
> [20546.817328] active_anon:8813 inactive_anon:9740 isolated_anon:0
> [20546.817332]  active_file:22597 inactive_file:20784 isolated_file:0
> [20546.817336]  unevictable:0 dirty:0 writeback:0 unstable:0
> [20546.817340]  free:69629 slab_reclaimable:181633 slab_unreclaimable:4393
> [20546.817345]  mapped:3810 shmem:169 pagetables:395 bounce:0
> [20546.823525] DMA free:3536kB min:3532kB low:4412kB high:5296kB active_anon:0kB inactive_anon:0kB active_file:32kB inactive_file:256kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:780288kB mlocked:0kB dirty:0kB writeback:0kB mapped:8kB shmem:0kB slab_reclaimable:726532kB slab_unreclaimable:17572kB kernel_stack:1064kB pagetables:1580kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:449 all_unreclaimable? yes
> [20546.828690] lowmem_reserve[]: 0 0 508 508
> [20546.830057] HighMem free:274980kB min:508kB low:1096kB high:1684kB active_anon:35252kB inactive_anon:38960kB active_file:90356kB inactive_file:82880kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:520192kB mlocked:0kB dirty:0kB writeback:0kB mapped:15232kB shmem:676kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> [20546.835656] lowmem_reserve[]: 0 0 0 0
> [20546.837081] DMA: 70*4kB 295*8kB 0*16kB 0*32kB 0*64kB 1*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 3536kB
> [20546.838629] HighMem: 4245*4kB 3604*8kB 2151*16kB 1238*32kB 736*64kB 296*128kB 96*256kB 47*512kB 17*1024kB 2*2048kB 0*4096kB = 274980kB
> [20546.841678] 44154 total pagecache pages
> [20546.843160] 604 pages in swap cache
> [20546.844631] Swap cache stats: add 2932, delete 2328, find 417/545
> [20546.846145] Free swap  = 1164320kB
> [20546.847647] Total swap = 1171868kB
> [20546.882384] 327680 pages RAM
> [20546.883921] 131072 pages HighMem
> [20546.885438] 6467 pages reserved
> [20546.886903] 17653 pages shared
> [20546.888362] 246912 pages non-shared
>
> Again, full dmesg & config on: http://nerdbynature.de/bits/2.6.39-rc4/oom/
>

xfs_inode slab is using about 200M.
It seems rather huge. Let's Cced XFS guys.


> Thanks,
> Christian.
> --
> BOFH excuse #158:
>
> Defunct processes
>



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 2.6.39-rc4+: oom-killer busy killing tasks
  2011-04-22  1:57 2.6.39-rc4+: oom-killer busy killing tasks Christian Kujau
  2011-04-22  2:58 ` Minchan Kim
@ 2011-04-24 23:46 ` Dave Chinner
  2011-04-25  5:51   ` Christian Kujau
  2011-04-25  8:02   ` Christian Kujau
  1 sibling, 2 replies; 42+ messages in thread
From: Dave Chinner @ 2011-04-24 23:46 UTC (permalink / raw)
  To: Christian Kujau; +Cc: LKML, xfs

On Thu, Apr 21, 2011 at 06:57:16PM -0700, Christian Kujau wrote:
> Hi,
> 
> after the block layer regression[0] seemed to be fixed, the machine 
> appeared to be running fine. But after putting some disk I/O to the system 
> (PowerBook G4) it became unresponsive, I/O wait went up high and I could 
> see that the OOM killer was killing processes. Logging in via SSH was 
> sometimes possible, but the each session was killed shortly after, so I 
> could not do much.
> 
> The box finally rebooted itself, the logfile recorded something xfs 
> related in the first backtrace, hence I'm cc'ing the xfs list too:
> 
> du invoked oom-killer: gfp_mask=0x842d0, order=0, oom_adj=0, oom_score_adj=0
> Call Trace:
> [c0009ce4] show_stack+0x70/0x1bc (unreliable)
> [c008f508] T.528+0x74/0x1cc
> [c008f734] T.526+0xd4/0x2a0
> [c008fb7c] out_of_memory+0x27c/0x360
> [c0093b3c] __alloc_pages_nodemask+0x6f8/0x708
> [c00c00b4] new_slab+0x244/0x27c
> [c00c0620] T.879+0x1cc/0x37c
> [c00c08d0] kmem_cache_alloc+0x100/0x108
> [c01cb2b8] kmem_zone_alloc+0xa4/0x114
> [c01a7d58] xfs_inode_alloc+0x40/0x13c
> [c01a8218] xfs_iget+0x258/0x5a0
> [c01c922c] xfs_lookup+0xf8/0x114
> [c01d70b0] xfs_vn_lookup+0x5c/0xb0
> [c00d14c8] d_alloc_and_lookup+0x54/0x90
> [c00d1d4c] do_lookup+0x248/0x2bc
> [c00d33cc] path_lookupat+0xfc/0x8f4
> [c00d3bf8] do_path_lookup+0x34/0xac
> [c00d53e0] user_path_at+0x64/0xb4
> [c00ca638] vfs_fstatat+0x58/0xbc
> [c00ca6c0] sys_fstatat64+0x24/0x50
> [c00124f4] ret_from_syscall+0x0/0x38
>  --- Exception: c01 at 0xff4b050
>    LR = 0x10008cf8
> 
> 
> This is wih today's git (91e8549bde...); full log & .config on: 
> 
>   http://nerdbynature.de/bits/2.6.39-rc4/oom/

You memory is full of xfs inodes, and it doesn't appear that memory
reclaim has kicked in at all to free any - the numbers just keep
growing at 1-2000 inodes/s.

I'd say they are not being reclaimmmed because the VFS hasn't let go
of them yet. Can you also dump /proc/sys/fs/{dentry,inode}-state so
we can see if the VFS has released the inodes such that they can be
reclaimed by XFS?

BTW, what are your mount options? If it is the problem I suspect it
is, then using noatime with stop it from occurring....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 2.6.39-rc4+: oom-killer busy killing tasks
  2011-04-24 23:46 ` Dave Chinner
@ 2011-04-25  5:51   ` Christian Kujau
  2011-04-25  7:19     ` Christian Kujau
  2011-04-25  8:02   ` Christian Kujau
  1 sibling, 1 reply; 42+ messages in thread
From: Christian Kujau @ 2011-04-25  5:51 UTC (permalink / raw)
  To: Dave Chinner; +Cc: LKML, xfs

On Mon, 25 Apr 2011 at 09:46, Dave Chinner wrote:
> I'd say they are not being reclaimmmed because the VFS hasn't let go
> of them yet. Can you also dump /proc/sys/fs/{dentry,inode}-state so
> we can see if the VFS has released the inodes such that they can be
> reclaimed by XFS?

Will do that.

> BTW, what are your mount options? If it is the problem I suspect it
> is, then using noatime with stop it from occurring....

# grep xfs /proc/mounts 
/dev/mapper/wdc1 /mnt/backup xfs rw,nosuid,nodev,noexec,relatime,attr2,delaylog,noquota 0 0
/dev/mapper/wdc1 /data xfs ro,relatime,attr2,delaylog,noquota 0 0

The "/dev/mapper/wdc1" is a dm-crypt device on a ~1TB external disk 
(via FireWire). The 2nd line in the output above is a bind-mount from 
somewhere below "/mnt/backup".

I'm currently trying to reproduce the problem again and see if the 
{dentry,inode}-state output reveals any more information. After that, I'll 
try with noatime.

Thanks,
Christian.
-- 
BOFH excuse #115:

your keyboard's space bar is generating spurious keycodes.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 2.6.39-rc4+: oom-killer busy killing tasks
  2011-04-25  5:51   ` Christian Kujau
@ 2011-04-25  7:19     ` Christian Kujau
  2011-04-26 15:14       ` Christian Kujau
  2011-04-27  2:26       ` Dave Chinner
  0 siblings, 2 replies; 42+ messages in thread
From: Christian Kujau @ 2011-04-25  7:19 UTC (permalink / raw)
  To: Dave Chinner; +Cc: LKML, xfs

On Mon, 25 Apr 2011 at 09:46, Dave Chinner wrote:
> I'd say they are not being reclaimmmed because the VFS hasn't let go
> of them yet. Can you also dump /proc/sys/fs/{dentry,inode}-state so
> we can see if the VFS has released the inodes such that they can be
> reclaimed by XFS?

Please see http://nerdbynature.de/bits/2.6.39-rc4/oom/

 - slabinfo-4.txt.bz2, contains /proc/sys/fs/{dentry,inode}-state and
   /proc/fs/xfs/stat (is this interesting at all?) and a few other
   /proc outputs
 - messages-4.txt, contains the syslog from this run.

Should I enable a few more DEBUG_* options in the kernel? Which options 
would be helpful?

Thanks for looking into it,
Christian.
-- 
BOFH excuse #194:

We only support a 1200 bps connection.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 2.6.39-rc4+: oom-killer busy killing tasks
  2011-04-24 23:46 ` Dave Chinner
  2011-04-25  5:51   ` Christian Kujau
@ 2011-04-25  8:02   ` Christian Kujau
  2011-04-25  9:50     ` Christian Kujau
  1 sibling, 1 reply; 42+ messages in thread
From: Christian Kujau @ 2011-04-25  8:02 UTC (permalink / raw)
  To: Dave Chinner; +Cc: LKML, xfs

On Mon, 25 Apr 2011 at 09:46, Dave Chinner wrote:
> BTW, what are your mount options? If it is the problem I suspect it
> is, then using noatime with stop it from occurring....

When mounted with noatime, running du(1) accross the XFS volume still 
triggered the OOM killer, but the backtrace looked somewhat different 
this time:

 date invoked oom-killer: gfp_mask=0x84d0, order=0, oom_adj=0, oom_score_adj=0
 Call Trace:
 [ecda5c70] [c0009ce4] show_stack+0x70/0x1bc (unreliable)
 [ecda5cb0] [c008f508] T.528+0x74/0x1cc
 [ecda5d00] [c008f734] T.526+0xd4/0x2a0
 [ecda5d40] [c008fb7c] out_of_memory+0x27c/0x360
 [ecda5d90] [c0093b3c] __alloc_pages_nodemask+0x6f8/0x708
 [ecda5e30] [c0017e9c] pte_alloc_one+0x30/0x90
 [ecda5e40] [c00a9464] __pte_alloc+0x2c/0x100
 [ecda5e60] [c00a95a8] handle_mm_fault+0x70/0x110
 [ecda5e90] [c0016d78] do_page_fault+0x358/0x504
 [ecda5f40] [c0012938] handle_page_fault+0xc/0x80
 --- Exception: 401 at 0xffdd5e0
     LR = 0xffd8a04

Full dmesg & slabinfo & more on: 

http://nerdbynature.de/bits/2.6.39-rc4/oom/

(slabinfo-5.txt.bz2, messages-5.txt)

Christian.
-- 
BOFH excuse #176:

vapors from evaporating sticky-note adhesives

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 2.6.39-rc4+: oom-killer busy killing tasks
  2011-04-25  8:02   ` Christian Kujau
@ 2011-04-25  9:50     ` Christian Kujau
  0 siblings, 0 replies; 42+ messages in thread
From: Christian Kujau @ 2011-04-25  9:50 UTC (permalink / raw)
  To: Dave Chinner; +Cc: LKML, xfs

On Mon, 25 Apr 2011 at 01:02, Christian Kujau wrote:
> On Mon, 25 Apr 2011 at 09:46, Dave Chinner wrote:
> > BTW, what are your mount options? If it is the problem I suspect it
> > is, then using noatime with stop it from occurring....
> 
> When mounted with noatime, running du(1) accross the XFS volume still 
> triggered the OOM killer, but the backtrace looked somewhat different 
> this time:

Tried again with noatime, this time no OOM message got logged to the disk 
(but may have been logged to the console early on). As it's reproducible, 
I started du(1) across the XFS volume and after a while my SSH connection 
to the box was lost. That's all I got this time:

  http://nerdbynature.de/bits/2.6.39-rc4/oom/oom-6.JPG
  http://nerdbynature.de/bits/2.6.39-rc4/oom/slabinfo-6.txt.bz2

Christian.
-- 
BOFH excuse #272:

Netscape has crashed

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 2.6.39-rc4+: oom-killer busy killing tasks
  2011-04-25  7:19     ` Christian Kujau
@ 2011-04-26 15:14       ` Christian Kujau
  2011-04-27  2:26       ` Dave Chinner
  1 sibling, 0 replies; 42+ messages in thread
From: Christian Kujau @ 2011-04-26 15:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: LKML, xfs

Dave,

On Mon, 25 Apr 2011 at 00:19, Christian Kujau wrote:
> Please see http://nerdbynature.de/bits/2.6.39-rc4/oom/
> 
>  - slabinfo-4.txt.bz2, contains /proc/sys/fs/{dentry,inode}-state and
>    /proc/fs/xfs/stat (is this interesting at all?) and a few other
>    /proc outputs
>  - messages-4.txt, contains the syslog from this run.

did you have time to look at those outputs yet? Do you need some more 
information? Right now 2.6.39-rc4 is not an option for me, 2.6.38 works 
just fine....

Thanks,
Christian.
-- 
BOFH excuse #170:

popper unable to process jumbo kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 2.6.39-rc4+: oom-killer busy killing tasks
  2011-04-25  7:19     ` Christian Kujau
  2011-04-26 15:14       ` Christian Kujau
@ 2011-04-27  2:26       ` Dave Chinner
  2011-04-27  7:46         ` Christian Kujau
  1 sibling, 1 reply; 42+ messages in thread
From: Dave Chinner @ 2011-04-27  2:26 UTC (permalink / raw)
  To: Christian Kujau; +Cc: LKML, xfs

On Mon, Apr 25, 2011 at 12:19:24AM -0700, Christian Kujau wrote:
> On Mon, 25 Apr 2011 at 09:46, Dave Chinner wrote:
> > I'd say they are not being reclaimmmed because the VFS hasn't let go
> > of them yet. Can you also dump /proc/sys/fs/{dentry,inode}-state so
> > we can see if the VFS has released the inodes such that they can be
> > reclaimed by XFS?
> 
> Please see http://nerdbynature.de/bits/2.6.39-rc4/oom/
> 
>  - slabinfo-4.txt.bz2, contains /proc/sys/fs/{dentry,inode}-state and

Ok, so looking at slabinfo-5.txt.bz2, this is the pattern of dentry,
VFS inode and XFS inode use patterns:

http://userweb.kernel.org/~dgc/slab-usage.png

What this shows is that VFS inode cache memory usage increases until
about the 550 sample mark before the VM starts to reclaim it with
extreme prejudice. At that point, I'd expect the XFS inode cache to
then shrink, and it doesn't. I've got no idea why the either the
shrinker or background reclaim is not reclaiming and freeing inodes,
but it is the reason why the system OOMs.

Can you check if there are any blocked tasks nearing OOM (i.e. "echo
w > /proc/sysrq-trigger") so we can see if XFS inode reclaim is
stuck somewhere?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 2.6.39-rc4+: oom-killer busy killing tasks
  2011-04-27  2:26       ` Dave Chinner
@ 2011-04-27  7:46         ` Christian Kujau
  2011-04-27 10:28           ` Dave Chinner
  0 siblings, 1 reply; 42+ messages in thread
From: Christian Kujau @ 2011-04-27  7:46 UTC (permalink / raw)
  To: Dave Chinner; +Cc: LKML, xfs

On Wed, 27 Apr 2011 at 12:26, Dave Chinner wrote:
> What this shows is that VFS inode cache memory usage increases until
> about the 550 sample mark before the VM starts to reclaim it with
> extreme prejudice. At that point, I'd expect the XFS inode cache to
> then shrink, and it doesn't. I've got no idea why the either the

Do you remember any XFS changes past 2.6.38 that could be related to 
something like this?

Bisecting is pretty slow on this machine. Could I somehow try to run 
2.6.39-rc4 but w/o the XFS changes merged after 2.6.38? (Does someone know 
how to do this via git?)

> Can you check if there are any blocked tasks nearing OOM (i.e. "echo
> w > /proc/sysrq-trigger") so we can see if XFS inode reclaim is
> stuck somewhere?

Will do, tomorrow.

Should I open a regression bug, so we don't loose track of this thing?

Thanks,
Christian.
-- 
BOFH excuse #425:

stop bit received

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 2.6.39-rc4+: oom-killer busy killing tasks
  2011-04-27  7:46         ` Christian Kujau
@ 2011-04-27 10:28           ` Dave Chinner
  2011-04-27 23:16             ` Minchan Kim
  2011-04-28 17:30             ` Christian Kujau
  0 siblings, 2 replies; 42+ messages in thread
From: Dave Chinner @ 2011-04-27 10:28 UTC (permalink / raw)
  To: Christian Kujau; +Cc: LKML, xfs

On Wed, Apr 27, 2011 at 12:46:51AM -0700, Christian Kujau wrote:
> On Wed, 27 Apr 2011 at 12:26, Dave Chinner wrote:
> > What this shows is that VFS inode cache memory usage increases until
> > about the 550 sample mark before the VM starts to reclaim it with
> > extreme prejudice. At that point, I'd expect the XFS inode cache to
> > then shrink, and it doesn't. I've got no idea why the either the
> 
> Do you remember any XFS changes past 2.6.38 that could be related to 
> something like this?

There's plenty of changes that coul dbe the cause - we've changed
the inode reclaim to run in the background out of a workqueue as
well as via the shrinker, so it could even be workqueue starvation
causing the the problem...

hmmmm. Speaking of which - have you changed any of the XFS tunables
in /proc/sys/fs/xfs/ on your machine (specifically
xfssyncd_centisecs)?

> Bisecting is pretty slow on this machine. Could I somehow try to run 
> 2.6.39-rc4 but w/o the XFS changes merged after 2.6.38? (Does someone know 
> how to do this via git?)

Not easy because there are tree-wide changes that need to be
preserved (e.g. block layer plugging changes) while others around it
would need to be reverted....

> > Can you check if there are any blocked tasks nearing OOM (i.e. "echo
> > w > /proc/sysrq-trigger") so we can see if XFS inode reclaim is
> > stuck somewhere?
> 
> Will do, tomorrow.
> 
> Should I open a regression bug, so we don't loose track of this thing?

Whatever you want. 

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 2.6.39-rc4+: oom-killer busy killing tasks
  2011-04-27 10:28           ` Dave Chinner
@ 2011-04-27 23:16             ` Minchan Kim
  2011-04-27 23:56               ` Dave Chinner
  2011-04-28 17:30             ` Christian Kujau
  1 sibling, 1 reply; 42+ messages in thread
From: Minchan Kim @ 2011-04-27 23:16 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christian Kujau, LKML, xfs

On Wed, Apr 27, 2011 at 7:28 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Wed, Apr 27, 2011 at 12:46:51AM -0700, Christian Kujau wrote:
>> On Wed, 27 Apr 2011 at 12:26, Dave Chinner wrote:
>> > What this shows is that VFS inode cache memory usage increases until
>> > about the 550 sample mark before the VM starts to reclaim it with
>> > extreme prejudice. At that point, I'd expect the XFS inode cache to
>> > then shrink, and it doesn't. I've got no idea why the either the
>>
>> Do you remember any XFS changes past 2.6.38 that could be related to
>> something like this?
>
> There's plenty of changes that coul dbe the cause - we've changed
> the inode reclaim to run in the background out of a workqueue as
> well as via the shrinker, so it could even be workqueue starvation
> causing the the problem...

RCU free starvation is another possibility?
https://lkml.org/lkml/2011/4/25/124

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 2.6.39-rc4+: oom-killer busy killing tasks
  2011-04-27 23:16             ` Minchan Kim
@ 2011-04-27 23:56               ` Dave Chinner
  0 siblings, 0 replies; 42+ messages in thread
From: Dave Chinner @ 2011-04-27 23:56 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Christian Kujau, LKML, xfs

On Thu, Apr 28, 2011 at 08:16:29AM +0900, Minchan Kim wrote:
> On Wed, Apr 27, 2011 at 7:28 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Wed, Apr 27, 2011 at 12:46:51AM -0700, Christian Kujau wrote:
> >> On Wed, 27 Apr 2011 at 12:26, Dave Chinner wrote:
> >> > What this shows is that VFS inode cache memory usage increases until
> >> > about the 550 sample mark before the VM starts to reclaim it with
> >> > extreme prejudice. At that point, I'd expect the XFS inode cache to
> >> > then shrink, and it doesn't. I've got no idea why the either the
> >>
> >> Do you remember any XFS changes past 2.6.38 that could be related to
> >> something like this?
> >
> > There's plenty of changes that coul dbe the cause - we've changed
> > the inode reclaim to run in the background out of a workqueue as
> > well as via the shrinker, so it could even be workqueue starvation
> > causing the the problem...
> 
> RCU free starvation is another possibility?
> https://lkml.org/lkml/2011/4/25/124

You know, I've been waching that thread with interest, but it didn't
seem to be related. However, now that I go look at the config file
provided, I see:

CONFIG_TINY_RCU=y
# CONFIG_SMP is not set
CONFIG_PREEMPT_NONE=y

which means it probably is the same rcu free starvation problem as
reported in that thread.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 2.6.39-rc4+: oom-killer busy killing tasks
  2011-04-27 10:28           ` Dave Chinner
  2011-04-27 23:16             ` Minchan Kim
@ 2011-04-28 17:30             ` Christian Kujau
  2011-04-28 23:37               ` Dave Chinner
  1 sibling, 1 reply; 42+ messages in thread
From: Christian Kujau @ 2011-04-28 17:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: LKML, xfs

Hi David,

sorry for the late response:

On Wed, 27 Apr 2011 at 20:28, Dave Chinner wrote:
> hmmmm. Speaking of which - have you changed any of the XFS tunables
> in /proc/sys/fs/xfs/ on your machine (specifically
> xfssyncd_centisecs)?

No, I did not change anything underneath /proc/sys/fs/xfs.
 
> Not easy because there are tree-wide changes that need to be
> preserved (e.g. block layer plugging changes) while others around it
> would need to be reverted....

Yeah, I noticed. I wanted to do something like this:

  $ for c in `git log --pretty=oneline --no-merges v2.6.38..v2.6.39-rc2 fs/xfs | awk '{print $1}'`; do
       git revert $c
    done

...but this broke on the 1st or 2nd commit because of unresolved changes.

Now I'm in the middle of another bisect attempt, trying to reproduce but 
Im getting (probably totally unrelated) warnings here:

BUG: sleeping function called from invalid context at /usr/local/src/linux-2.6-git/drivers/ide/ide-io.c:447
in_atomic(): 1, irqs_disabled(): 0, pid: 0, name: swapper
INFO: lockdep is turned off.
[...]
 
> Can you check if there are any blocked tasks nearing OOM (i.e. "echo
> w > /proc/sysrq-trigger") so we can see if XFS inode reclaim is
> stuck somewhere?

In one instance when I mounted with "noatime", no log messages made it to 
the disk, but I could see "hung tasks":

   http://nerdbynature.de/bits/2.6.39-rc4/oom/oom-6.JPG

I will try sysrq-w later on.

FWIW, a regression bug has been opened:

  https://bugzilla.kernel.org/show_bug.cgi?id=34012

Thanks,
Christian.
-- 
BOFH excuse #284:

Electrons on a bender

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 2.6.39-rc4+: oom-killer busy killing tasks
  2011-04-28 17:30             ` Christian Kujau
@ 2011-04-28 23:37               ` Dave Chinner
  2011-04-29 17:32                 ` Christian Kujau
  2011-04-29 19:58                 ` Christian Kujau
  0 siblings, 2 replies; 42+ messages in thread
From: Dave Chinner @ 2011-04-28 23:37 UTC (permalink / raw)
  To: Christian Kujau; +Cc: LKML, xfs

On Thu, Apr 28, 2011 at 10:30:47AM -0700, Christian Kujau wrote:
> Hi David,
> 
> sorry for the late response:
> 
> On Wed, 27 Apr 2011 at 20:28, Dave Chinner wrote:
> > hmmmm. Speaking of which - have you changed any of the XFS tunables
> > in /proc/sys/fs/xfs/ on your machine (specifically
> > xfssyncd_centisecs)?
> 
> No, I did not change anything underneath /proc/sys/fs/xfs.
>  
> > Not easy because there are tree-wide changes that need to be
> > preserved (e.g. block layer plugging changes) while others around it
> > would need to be reverted....
> 
> Yeah, I noticed. I wanted to do something like this:
> 
>   $ for c in `git log --pretty=oneline --no-merges v2.6.38..v2.6.39-rc2 fs/xfs | awk '{print $1}'`; do
>        git revert $c
>     done
> 
> ...but this broke on the 1st or 2nd commit because of unresolved changes.
> 
> Now I'm in the middle of another bisect attempt, trying to reproduce but 
> Im getting (probably totally unrelated) warnings here:

Did you read the email that pointed to CONFIG_TINY_RCU as the
ptotential source of the problem? Can you change to CONFIG_TREE_RCU
and see if the problem goes away?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 2.6.39-rc4+: oom-killer busy killing tasks
  2011-04-28 23:37               ` Dave Chinner
@ 2011-04-29 17:32                 ` Christian Kujau
  2011-04-29 19:58                 ` Christian Kujau
  1 sibling, 0 replies; 42+ messages in thread
From: Christian Kujau @ 2011-04-29 17:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: LKML, xfs

On Fri, 29 Apr 2011 at 09:37, Dave Chinner wrote:
> Did you read the email that pointed to CONFIG_TINY_RCU as the
> ptotential source of the problem? Can you change to CONFIG_TREE_RCU
> and see if the problem goes away?

For that I had to select SMP on this UP-PowerPC G4 machine. But compiling 
-rc5 (as the bug is strill present in -rc5) fails with:

/usr/local/src/linux-2.6-git/arch/powerpc/kernel/crash.c: In function 
‘default_machine_crash_shutdown’:
/usr/local/src/linux-2.6-git/arch/powerpc/kernel/crash.c:349: error: 
implicit declaration of function ‘crash_kexec_wait_realmode’
make[2]: *** [arch/powerpc/kernel/crash.o] Error 1
make[1]: *** [arch/powerpc/kernel] Error 2
make: *** [sub-make] Error 2

Maybe I'll just disable kexec and see if TREE_RCU helps.

Meanwhile, I gave up on the git bisect, as I can hardly boot inbetween 
commits: http://nerdbynature.de/bits/2.6.39-rc4/oom/git-bisect-log.txt

Thanks,
Christian.
-- 
BOFH excuse #446:

Mailer-daemon is busy burning your message in hell.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 2.6.39-rc4+: oom-killer busy killing tasks
  2011-04-28 23:37               ` Dave Chinner
  2011-04-29 17:32                 ` Christian Kujau
@ 2011-04-29 19:58                 ` Christian Kujau
  2011-04-29 20:17                   ` Markus Trippelsdorf
  1 sibling, 1 reply; 42+ messages in thread
From: Christian Kujau @ 2011-04-29 19:58 UTC (permalink / raw)
  To: Dave Chinner; +Cc: LKML, xfs, minchan.kim

On Fri, 29 Apr 2011 at 09:37, Dave Chinner wrote:
> Did you read the email that pointed to CONFIG_TINY_RCU as the
> ptotential source of the problem? Can you change to CONFIG_TREE_RCU
> and see if the problem goes away?

Tried with TREE_RCU, unfortunately the problem persists. However, the OOM 
messages did not make it to the disk this time. Syslog continued to work 
until 11:51 (local time), the slabinfo-8-TREE_RCU.txt file was written to 
until 11:57, the machine was already unusable at that point. At 12:07 I 
took a "screenshot" and then powercycled the machine:

  http://nerdbynature.de/bits/2.6.39-rc4/oom/
  (The files marked with -8 are the current ones)

Btw, the xfs filesystem I'm du'ing over is still mounted with "noatime".

Any ideas? Patches to try?

Thanks,
Christian.
-- 
BOFH excuse #70:

nesting roaches shorted out the ether cable

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 2.6.39-rc4+: oom-killer busy killing tasks
  2011-04-29 19:58                 ` Christian Kujau
@ 2011-04-29 20:17                   ` Markus Trippelsdorf
  2011-04-29 20:20                     ` Christian Kujau
  2011-04-30  0:17                     ` Christian Kujau
  0 siblings, 2 replies; 42+ messages in thread
From: Markus Trippelsdorf @ 2011-04-29 20:17 UTC (permalink / raw)
  To: Christian Kujau; +Cc: Dave Chinner, LKML, xfs, minchan.kim

On 2011.04.29 at 12:58 -0700, Christian Kujau wrote:
> On Fri, 29 Apr 2011 at 09:37, Dave Chinner wrote:
> > Did you read the email that pointed to CONFIG_TINY_RCU as the
> > ptotential source of the problem? Can you change to CONFIG_TREE_RCU
> > and see if the problem goes away?
> 
> Tried with TREE_RCU, unfortunately the problem persists. However, the OOM 
> messages did not make it to the disk this time. Syslog continued to work 
> until 11:51 (local time), the slabinfo-8-TREE_RCU.txt file was written to 
> until 11:57, the machine was already unusable at that point. At 12:07 I 
> took a "screenshot" and then powercycled the machine:
> 
>   http://nerdbynature.de/bits/2.6.39-rc4/oom/
>   (The files marked with -8 are the current ones)
> 
> Btw, the xfs filesystem I'm du'ing over is still mounted with "noatime".
> 
> Any ideas? Patches to try?

I could be the hrtimer bug again. Would you try to reproduce the issue
with this patch applied?
http://git.us.kernel.org/?p=linux/kernel/git/tip/linux-2.6-tip.git;a=commit;h=ce31332d3c77532d6ea97ddcb475a2b02dd358b4

-- 
Markus

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 2.6.39-rc4+: oom-killer busy killing tasks
  2011-04-29 20:17                   ` Markus Trippelsdorf
@ 2011-04-29 20:20                     ` Christian Kujau
  2011-04-29 20:21                       ` Markus Trippelsdorf
  2011-04-30  0:17                     ` Christian Kujau
  1 sibling, 1 reply; 42+ messages in thread
From: Christian Kujau @ 2011-04-29 20:20 UTC (permalink / raw)
  To: Markus Trippelsdorf; +Cc: Dave Chinner, LKML, xfs, minchan.kim

On Fri, 29 Apr 2011 at 22:17, Markus Trippelsdorf wrote:
> I could be the hrtimer bug again. Would you try to reproduce the issue
> with this patch applied?
> http://git.us.kernel.org/?p=linux/kernel/git/tip/linux-2.6-tip.git;a=commit;h=ce31332d3c77532d6ea97ddcb475a2b02dd358b4

will do. Should I enable TREE_RCU again (SMP) or use TINY_RCU for my UP 
system?

Thanks,
C.
-- 
BOFH excuse #221:

The mainframe needs to rest.  It's getting old, you know.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 2.6.39-rc4+: oom-killer busy killing tasks
  2011-04-29 20:20                     ` Christian Kujau
@ 2011-04-29 20:21                       ` Markus Trippelsdorf
  0 siblings, 0 replies; 42+ messages in thread
From: Markus Trippelsdorf @ 2011-04-29 20:21 UTC (permalink / raw)
  To: Christian Kujau; +Cc: Dave Chinner, LKML, xfs, minchan.kim

On 2011.04.29 at 13:20 -0700, Christian Kujau wrote:
> On Fri, 29 Apr 2011 at 22:17, Markus Trippelsdorf wrote:
> > I could be the hrtimer bug again. Would you try to reproduce the issue
> > with this patch applied?
> > http://git.us.kernel.org/?p=linux/kernel/git/tip/linux-2.6-tip.git;a=commit;h=ce31332d3c77532d6ea97ddcb475a2b02dd358b4
> 
> will do. Should I enable TREE_RCU again (SMP) or use TINY_RCU for my UP 
> system?

It shouldn't matter. Both will do.
-- 
Markus

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 2.6.39-rc4+: oom-killer busy killing tasks
  2011-04-29 20:17                   ` Markus Trippelsdorf
  2011-04-29 20:20                     ` Christian Kujau
@ 2011-04-30  0:17                     ` Christian Kujau
  2011-05-01  8:01                       ` Dave Chinner
  1 sibling, 1 reply; 42+ messages in thread
From: Christian Kujau @ 2011-04-30  0:17 UTC (permalink / raw)
  To: Markus Trippelsdorf; +Cc: Dave Chinner, LKML, xfs, minchan.kim

On Fri, 29 Apr 2011 at 22:17, Markus Trippelsdorf wrote:
> I could be the hrtimer bug again. Would you try to reproduce the issue
> with this patch applied?
> http://git.us.kernel.org/?p=linux/kernel/git/tip/linux-2.6-tip.git;a=commit;h=ce31332d3c77532d6ea97ddcb475a2b02dd358b4

With that patch applied, the OOm killer still kicks in, this time the OOM 
messages were written to the syslog agian:

   http://nerdbynature.de/bits/2.6.39-rc4/oom/
   (The -9 files are the current ones)

Also, this time xfs did not show up in the backtrace:

ssh invoked oom-killer: gfp_mask=0x44d0, order=2, oom_adj=0, oom_score_adj=0
Call Trace:
[c22bfae0] [c0009d30] show_stack+0x70/0x1bc (unreliable)
[c22bfb20] [c009cd3c] T.545+0x74/0x1d0
[c22bfb70] [c009cf6c] T.543+0xd4/0x2a0
[c22bfbb0] [c009d3b4] out_of_memory+0x27c/0x360
[c22bfc00] [c00a199c] __alloc_pages_nodemask+0x6f8/0x708
[c22bfca0] [c00a19c8] __get_free_pages+0x1c/0x44
[c22bfcb0] [c00d283c] __kmalloc_track_caller+0x1c0/0x1dc
[c22bfcd0] [c036ff1c] __alloc_skb+0x74/0x140
[c22bfd00] [c0369b08] sock_alloc_send_pskb+0x23c/0x37c
[c22bfd70] [c03e8974] unix_stream_sendmsg+0x354/0x478
[c22bfde0] [c0364118] sock_aio_write+0x170/0x180
[c22bfe50] [c00d580c] do_sync_write+0xb8/0x144
[c22bfef0] [c00d68d0] vfs_write+0x1b8/0x1c0
[c22bff10] [c00d6a10] sys_write+0x58/0xc8
[c22bff40] [c00127d4] ret_from_syscall+0x0/0x38
--- Exception: c01 at 0x2044cc14


Thanks,
Christian.
-- 
BOFH excuse #144:

Too few computrons available.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 2.6.39-rc4+: oom-killer busy killing tasks
  2011-04-30  0:17                     ` Christian Kujau
@ 2011-05-01  8:01                       ` Dave Chinner
  2011-05-02  4:59                         ` Christian Kujau
  2011-05-02  9:26                         ` Christian Kujau
  0 siblings, 2 replies; 42+ messages in thread
From: Dave Chinner @ 2011-05-01  8:01 UTC (permalink / raw)
  To: Christian Kujau; +Cc: Markus Trippelsdorf, LKML, xfs, minchan.kim

On Fri, Apr 29, 2011 at 05:17:53PM -0700, Christian Kujau wrote:
> On Fri, 29 Apr 2011 at 22:17, Markus Trippelsdorf wrote:
> > I could be the hrtimer bug again. Would you try to reproduce the issue
> > with this patch applied?
> > http://git.us.kernel.org/?p=linux/kernel/git/tip/linux-2.6-tip.git;a=commit;h=ce31332d3c77532d6ea97ddcb475a2b02dd358b4
> 
> With that patch applied, the OOm killer still kicks in, this time the OOM 
> messages were written to the syslog agian:
> 
>    http://nerdbynature.de/bits/2.6.39-rc4/oom/
>    (The -9 files are the current ones)
> 
> Also, this time xfs did not show up in the backtrace:
> 
> ssh invoked oom-killer: gfp_mask=0x44d0, order=2, oom_adj=0, oom_score_adj=0
> Call Trace:
> [c22bfae0] [c0009d30] show_stack+0x70/0x1bc (unreliable)
> [c22bfb20] [c009cd3c] T.545+0x74/0x1d0
> [c22bfb70] [c009cf6c] T.543+0xd4/0x2a0
> [c22bfbb0] [c009d3b4] out_of_memory+0x27c/0x360
> [c22bfc00] [c00a199c] __alloc_pages_nodemask+0x6f8/0x708
> [c22bfca0] [c00a19c8] __get_free_pages+0x1c/0x44
> [c22bfcb0] [c00d283c] __kmalloc_track_caller+0x1c0/0x1dc
> [c22bfcd0] [c036ff1c] __alloc_skb+0x74/0x140
> [c22bfd00] [c0369b08] sock_alloc_send_pskb+0x23c/0x37c
> [c22bfd70] [c03e8974] unix_stream_sendmsg+0x354/0x478
> [c22bfde0] [c0364118] sock_aio_write+0x170/0x180
> [c22bfe50] [c00d580c] do_sync_write+0xb8/0x144
> [c22bfef0] [c00d68d0] vfs_write+0x1b8/0x1c0
> [c22bff10] [c00d6a10] sys_write+0x58/0xc8
> [c22bff40] [c00127d4] ret_from_syscall+0x0/0x38
> --- Exception: c01 at 0x2044cc14

Doesn't need to have XFS in the stack trace - the inode cache is
consuming all of low memory.

Indeed, I wonder if that is the problem - this is a highmem
configuration where there is 450MB of highmem free, and very little
lowmem free which is considered "all unreclaimable". The lowmem
zone:

Apr 29 15:59:10 alice kernel: [ 3834.754358] DMA free:64704kB
min:3532kB low:4412kB high:5296kB active_anon:0kB inactive_anon:0kB
active_file:132kB inactive_file:168kB unevictable:0kB
isolated(anon):0kB isolated(file):0kB present:780288kB mlocked:0kB
dirty:0kB writeback:0kB mapped:8kB shmem:0kB
slab_reclaimable:639680kB slab_unreclaimable:41652kB
kernel_stack:1128kB pagetables:1788kB unstable:0kB bounce:0kB
writeback_tmp:0kB pages_scanned:516 all_unreclaimable? yes

I really don't know why the xfs inode cache is not being trimmed. I
really, really need to know if the XFS inode cache shrinker is
getting blocked or not running - do you have those sysrq-w traces
when near OOM I asked for a while back?

It may be that the zone reclaim is simply fubar because slab cache
reclaim is proportional to the number of pages scanned on the LRU.
With most of the cached pages in the highmem zone, the lowmem zone
scan only scanned 516 pages. I can't see it freeing many inodes
(there's >600,000 of them in memory) based on such a low page scan
number.

Maybe you should tweak /proc/sys/vm/vfs_cache_pressure to make it
reclaim vfs structures more rapidly. It might help, but I'm starting
to think that this problem is actually a VM zone reclaim balance
problem, not an XFS problem as such....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 2.6.39-rc4+: oom-killer busy killing tasks
  2011-05-01  8:01                       ` Dave Chinner
@ 2011-05-02  4:59                         ` Christian Kujau
  2011-05-02 12:19                           ` Dave Chinner
  2011-05-02  9:26                         ` Christian Kujau
  1 sibling, 1 reply; 42+ messages in thread
From: Christian Kujau @ 2011-05-02  4:59 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Markus Trippelsdorf, LKML, xfs, minchan.kim

On Sun, 1 May 2011 at 18:01, Dave Chinner wrote:
> I really don't know why the xfs inode cache is not being trimmed. I
> really, really need to know if the XFS inode cache shrinker is
> getting blocked or not running - do you have those sysrq-w traces
> when near OOM I asked for a while back?

I tried to generate those via /proc/sysrq-trigger (don't have a F13/Print 
Screen key), but the OOM killer kicks in prett fast - so fast thay my 
debug script, trying to generate sysrq-w every second was too late and the 
machine was already dead:

   http://nerdbynature.de/bits/2.6.39-rc4/oom/
   * messages-10.txt.gz
   * slabinfo-10.txt.bz2

Timeline:
  - du(1) started at 12:25:16 (and immediately listed
    as "blocked" task)
  - the last sysrq-w succeeded at 12:38:05, listing kswapd0
  - du invoked oom-killer at 12:38:06

I'll keep trying...

> scan only scanned 516 pages. I can't see it freeing many inodes
> (there's >600,000 of them in memory) based on such a low page scan
> number.

Not sure if this is related...this XFS filesytem I'm running du(1) on is 
~1 TB in size, with 918K allocated inodes, if df(1) is correct:

# df -hi /mnt/backup/
Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/mapper/wdc1         37M    918K     36M    3% /mnt/backup

> Maybe you should tweak /proc/sys/vm/vfs_cache_pressure to make it
> reclaim vfs structures more rapidly. It might help

/proc/sys/vm/vfs_cache_pressure is currently set to '100'. You mean I 
should increase it? To..150? 200? 1000?

Thanks,
Christian.
-- 
BOFH excuse #347:

The rubber band broke

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 2.6.39-rc4+: oom-killer busy killing tasks
  2011-05-01  8:01                       ` Dave Chinner
  2011-05-02  4:59                         ` Christian Kujau
@ 2011-05-02  9:26                         ` Christian Kujau
  2011-05-02 12:38                           ` Dave Chinner
  1 sibling, 1 reply; 42+ messages in thread
From: Christian Kujau @ 2011-05-02  9:26 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Markus Trippelsdorf, LKML, xfs, minchan.kim

On Sun, 1 May 2011 at 18:01, Dave Chinner wrote:
> I really don't know why the xfs inode cache is not being trimmed. I
> really, really need to know if the XFS inode cache shrinker is
> getting blocked or not running - do you have those sysrq-w traces
> when near OOM I asked for a while back?

Here's another attempt at getting those:

  http://nerdbynature.de/bits/2.6.39-rc4/oom/
  * messages-11.txt.gz & slabinfo-11.txt.bz2
    - oom-killer at 00:05:04
    - last sysrq-w to succeed at 00:05:03

  * messages-12.txt.gz & slabinfo-12.txt.bz2, along
    with meminfo-post-oom-12.txt & sysrq-w_post-oom-12.jpg could
    be more interesting:
    - last sysrq-w to succeed at 01:27:08
    - oom-killer at 01:27:11

   ...but after the OOM-killer was killing quite a few processes, MemFree
   showed 511236 kB free memory, yet ssh logins were still being killed.
   Finally I got a root shell on the box, issued sysrq-w again and even
   executed /bin/sync, which came back. But looking at the logs now 
   nothing went to the disk (/var/log resides on / which is a ext4 fs).
   See sysrq-w_post-oom-12.jpg for a sysrq-w I took 2381s after boot time,
   or 01:32 - syslog stopped on 01:27.

I shall try again with netconsole loggin or something...

HTH & thanks for looking into this,
Christian.
-- 
BOFH excuse #176:

vapors from evaporating sticky-note adhesives

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 2.6.39-rc4+: oom-killer busy killing tasks
  2011-05-02  4:59                         ` Christian Kujau
@ 2011-05-02 12:19                           ` Dave Chinner
  2011-05-02 19:59                             ` Christian Kujau
  0 siblings, 1 reply; 42+ messages in thread
From: Dave Chinner @ 2011-05-02 12:19 UTC (permalink / raw)
  To: Christian Kujau; +Cc: Markus Trippelsdorf, LKML, xfs, minchan.kim

On Sun, May 01, 2011 at 09:59:35PM -0700, Christian Kujau wrote:
> On Sun, 1 May 2011 at 18:01, Dave Chinner wrote:
> > I really don't know why the xfs inode cache is not being trimmed. I
> > really, really need to know if the XFS inode cache shrinker is
> > getting blocked or not running - do you have those sysrq-w traces
> > when near OOM I asked for a while back?
> 
> I tried to generate those via /proc/sysrq-trigger (don't have a F13/Print 
> Screen key), but the OOM killer kicks in prett fast - so fast thay my 
> debug script, trying to generate sysrq-w every second was too late and the 
> machine was already dead:
> 
>    http://nerdbynature.de/bits/2.6.39-rc4/oom/
>    * messages-10.txt.gz
>    * slabinfo-10.txt.bz2
> 
> Timeline:
>   - du(1) started at 12:25:16 (and immediately listed
>     as "blocked" task)
>   - the last sysrq-w succeeded at 12:38:05, listing kswapd0
>   - du invoked oom-killer at 12:38:06
> 
> I'll keep trying...
> 
> > scan only scanned 516 pages. I can't see it freeing many inodes
> > (there's >600,000 of them in memory) based on such a low page scan
> > number.
> 
> Not sure if this is related...this XFS filesytem I'm running du(1) on is 
> ~1 TB in size, with 918K allocated inodes, if df(1) is correct:
> 
> # df -hi /mnt/backup/
> Filesystem            Inodes   IUsed   IFree IUse% Mounted on
> /dev/mapper/wdc1         37M    918K     36M    3% /mnt/backup
> 
> > Maybe you should tweak /proc/sys/vm/vfs_cache_pressure to make it
> > reclaim vfs structures more rapidly. It might help
> 
> /proc/sys/vm/vfs_cache_pressure is currently set to '100'. You mean I 
> should increase it? To..150? 200? 1000?

Yes. Try 2 orders of magnitude as a start. i.e change it to 10000...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 2.6.39-rc4+: oom-killer busy killing tasks
  2011-05-02  9:26                         ` Christian Kujau
@ 2011-05-02 12:38                           ` Dave Chinner
  0 siblings, 0 replies; 42+ messages in thread
From: Dave Chinner @ 2011-05-02 12:38 UTC (permalink / raw)
  To: Christian Kujau; +Cc: Markus Trippelsdorf, LKML, xfs, minchan.kim

On Mon, May 02, 2011 at 02:26:17AM -0700, Christian Kujau wrote:
> On Sun, 1 May 2011 at 18:01, Dave Chinner wrote:
> > I really don't know why the xfs inode cache is not being trimmed. I
> > really, really need to know if the XFS inode cache shrinker is
> > getting blocked or not running - do you have those sysrq-w traces
> > when near OOM I asked for a while back?
> 
> Here's another attempt at getting those:
> 
>   http://nerdbynature.de/bits/2.6.39-rc4/oom/
>   * messages-11.txt.gz & slabinfo-11.txt.bz2
>     - oom-killer at 00:05:04
>     - last sysrq-w to succeed at 00:05:03
> 
>   * messages-12.txt.gz & slabinfo-12.txt.bz2, along
>     with meminfo-post-oom-12.txt & sysrq-w_post-oom-12.jpg could
>     be more interesting:
>     - last sysrq-w to succeed at 01:27:08
>     - oom-killer at 01:27:11
> 
>    ...but after the OOM-killer was killing quite a few processes, MemFree
>    showed 511236 kB free memory, yet ssh logins were still being killed.
>    Finally I got a root shell on the box, issued sysrq-w again and even
>    executed /bin/sync, which came back. But looking at the logs now 
>    nothing went to the disk (/var/log resides on / which is a ext4 fs).
>    See sysrq-w_post-oom-12.jpg for a sysrq-w I took 2381s after boot time,
>    or 01:32 - syslog stopped on 01:27.

Same problem:

MemFree:          511236 kB
....
LowTotal:         759904 kB
LowFree:            3804 kB

i.e. that low memory is being exhausted by the slab cache, while
there is lots of free high memory, and the low memory zone is marked
as all unreclaimable....

The sysrq trace less than 1s before the first OOM shows this:

[c00770ec] __lock_acquire+0x43c/0x1818 (unreliable)
[c000a924] __switch_to+0x9c/0x128
[c0417580] schedule+0x274/0x8bc
[c0418128] schedule_timeout+0x16c/0x214
[c04172a0] io_schedule_timeout+0xb0/0x11c
[c00b153c] congestion_wait+0x8c/0xdc
[c00aa43c] kswapd+0x6d0/0x884
[c005e3d0] kthread+0x84/0x88
[c0010908] kernel_thread+0x4c/0x68

Background memory reclaim appears to be blocked by IO congestion....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 2.6.39-rc4+: oom-killer busy killing tasks
  2011-05-02 12:19                           ` Dave Chinner
@ 2011-05-02 19:59                             ` Christian Kujau
  2011-05-03  0:51                               ` Dave Chinner
  0 siblings, 1 reply; 42+ messages in thread
From: Christian Kujau @ 2011-05-02 19:59 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Markus Trippelsdorf, LKML, xfs, minchan.kim

On Mon, 2 May 2011 at 22:19, Dave Chinner wrote:
> Yes. Try 2 orders of magnitude as a start. i.e change it to 10000...

I've run the -12 test with vfs_cache_pressure=200 and now the -13 test 
with vfs_cache_pressure=10000. The OOM killer still kicks in, but the 
machine seems to be more usable afterwards and does not get totally stuck:

  http://nerdbynature.de/bits/2.6.39-rc4/oom/
  - messages-12.txt.gz & slabinfo-12.txt.bz2
    * oom-debug.sh invoked oom-killer at 01:27:11
    * sysrq-w works until 01:27:08, but got killed by oom

  - messages-13.txt.gz & slabinfo-13.txt.bz2 
    * find invoked oom-killer at 08:44:07
    * sysrq-w works until 08:45:48 (listing jbd2/hda6-8), then
      my debug script got killed

Thanks,
Christian.
-- 
BOFH excuse #224:

Jan  9 16:41:27 huber su: 'su root' succeeded for .... on /dev/pts/1

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 2.6.39-rc4+: oom-killer busy killing tasks
  2011-05-02 19:59                             ` Christian Kujau
@ 2011-05-03  0:51                               ` Dave Chinner
  2011-05-03  4:04                                 ` Christian Kujau
  2011-05-03 20:53                                 ` Christian Kujau
  0 siblings, 2 replies; 42+ messages in thread
From: Dave Chinner @ 2011-05-03  0:51 UTC (permalink / raw)
  To: Christian Kujau; +Cc: Markus Trippelsdorf, LKML, xfs, minchan.kim

On Mon, May 02, 2011 at 12:59:50PM -0700, Christian Kujau wrote:
> On Mon, 2 May 2011 at 22:19, Dave Chinner wrote:
> > Yes. Try 2 orders of magnitude as a start. i.e change it to 10000...
> 
> I've run the -12 test with vfs_cache_pressure=200 and now the -13 test 
> with vfs_cache_pressure=10000. The OOM killer still kicks in, but the 
> machine seems to be more usable afterwards and does not get totally stuck:
> 
>   http://nerdbynature.de/bits/2.6.39-rc4/oom/
>   - messages-12.txt.gz & slabinfo-12.txt.bz2
>     * oom-debug.sh invoked oom-killer at 01:27:11
>     * sysrq-w works until 01:27:08, but got killed by oom
> 
>   - messages-13.txt.gz & slabinfo-13.txt.bz2 
>     * find invoked oom-killer at 08:44:07
>     * sysrq-w works until 08:45:48 (listing jbd2/hda6-8), then
>       my debug script got killed

So before the OOM killer kicks in, kswapd is stuck in
congestion_wait(), and after a number of oom-kills over a 5s period
it is still in congestion_wait(). 7s later it is still in
congestion_wait() and the oom-killer starts up again, with kswapd
still being in congestion_wait() when the oom-killer stops again 3s
later.

Ok, so kswapd being stuck in congestion wait means it can only be in
balance_pgdat() and it thinks  that it is getting into trouble.

Looking at the OOM output:

  active_anon:7992 inactive_anon:8714 isolated_anon:0
  active_file:5995 inactive_file:73780 isolated_file:0
  unevictable:0 dirty:0 writeback:0 unstable:0
  free:35263 slab_reclaimable:182652 slab_unreclaimable:3224
  mapped:6929 shmem:199 pagetables:396 bounce:0
 DMA free:3436kB min:3532kB low:4412kB high:5296kB active_anon:0kB inactive_anon:0kB active_file:236kB inactive_file:248kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:780288kB mlocked:0kB dirty:0kB writeback:0kB mapped:8kB shmem:0kB slab_reclaimable:730608kB slab_unreclaimable:12896kB kernel_stack:1032kB pagetables:1584kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:680 all_unreclaimable? yes
 lowmem_reserve[]: 0 0 508 508
 HighMem free:137616kB min:508kB low:1096kB high:1684kB active_anon:31968kB inactive_anon:34856kB active_file:23744kB inactive_file:294872kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:520192kB mlocked:0kB dirty:0kB writeback:0kB mapped:27708kB shmem:796kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
 lowmem_reserve[]: 0 0 0 0

There are no isolated pages, so that means we aren't in the
congestion_wait() call related to having too many isolated pages.

We see that the ZONE_DMA is all_unreclaimable and had 680 pages
scanned. ZONE_HIGHMEM had _zero_ pages scanned, which means it must
be over the high water marks for free memory and so no attempt is
made to reclaim from this zone. That means lru_pages is set to
zone_reclaimable_pages(ZONE_DMA), which at this point in time would
be:

	active_anon:0kB inactive_anon:0kB active_file:236kB inactive_file:248kB

about 484k or 121 pages. To get all_unreclaimable set, the
shrink_slab() call must have returned zero to indicate it didn't
free anything.

So the first pass through would have passed that to shrink_slab, and
asumming they are all mapped pages we'd end up with nr_scanned =
242. For the xfs inode cache with 600,000 reclaimable inodes, this
would have resulted in:

	max_pass = 600000
	delta = 4 * 242 / 2 = 484
	delta = 484 * 600,000 = 290,400,000
	delta = 290,400,000 / 121 + 1 ٍ= 2,380,327
	shrinker->nr += delta
	if (shrinker->nr > max_pass * 2)
		shrinker->nr = max_pass * 2; = 1,200,000

So, the shrinker->nr should be well above zero, even in the worst
case. The question is now: how on earth is it returning zero?

Two cases: if the shrinker returns -1, or because the cache is growing:

                        nr_before = (*shrinker->shrink)(shrinker, 0, gfp_mask);
                        shrink_ret = (*shrinker->shrink)(shrinker, this_scan,
                                                                gfp_mask);
                        if (shrink_ret == -1)
                                break;
                        if (shrink_ret < nr_before)
                                ret += nr_before - shrink_ret;

So, first case will happen for XFS when:

		if (!(gfp_mask & __GFP_FS))
                        return -1;

In most of the OOM-killer invocations, the stack trace is:

 out_of_memory+0x27c/0x360
 __alloc_pages_nodemask+0x6f8/0x708
 new_slab+0x1fc/0x234
 T.915+0x1f8/0x388
 kmem_cache_alloc+0x11c/0x124
 kmem_zone_alloc+0xa4/0x114
 xfs_inode_alloc+0x40/0x13c
 xfs_iget+0x2a8/0x620
 xfs_lookup+0xf8/0x114
 xfs_vn_lookup+0x5c/0xb0
 d_alloc_and_lookup+0x54/0x90
 do_lookup+0x248/0x2bc
 path_lookupat+0xfc/0x8f4
 do_path_lookup+0x34/0xac
 user_path_at+0x64/0xb4
 vfs_fstatat+0x58/0xbc
 sys_fstatat64+0x24/0x50
 ret_from_syscall+0x0/0x38

So we are not preventing reclaim via the gfp_mask. That leaves the
other case, where the number of reclaimable inodes is growing faster
than the shrinker is freeing them.  I can't really see how that is
possible with a single CPU machine without prempt enabled and,
apparently, no dirty inodes. Inode reclaim should not block
(shrinker or background), so there's something else going on here.

Can you run an event trace of all the XFS events during a find for
me? Don't do it over the entire subset of the filesystem - only
100,000 inodes is sufficient (i.e. kill the find once the xfs inode
cache slab reaches 100k inodes. While still running the event trace,
can you then drop the caches (echo 3 > /proc/sys/vm/drop_caches) and
check that the xfs inode cache is emptied? If it isn't emptied, drop
caches again to see if that empties it. If you coul dthen post the
event trace, I might be able to see what is going strange with the
shrinker and/or reclaim.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 2.6.39-rc4+: oom-killer busy killing tasks
  2011-05-03  0:51                               ` Dave Chinner
@ 2011-05-03  4:04                                 ` Christian Kujau
  2011-05-03  6:36                                   ` Dave Chinner
  2011-05-03 20:53                                 ` Christian Kujau
  1 sibling, 1 reply; 42+ messages in thread
From: Christian Kujau @ 2011-05-03  4:04 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Markus Trippelsdorf, LKML, xfs, minchan.kim

On Tue, 3 May 2011 at 10:51, Dave Chinner wrote:
> Can you run an event trace of all the XFS events during a find for
> me? Don't do it over the entire subset of the filesystem - only

You mean "event tracing", as in Documentation/trace/events.txt. For 
that I will have to enable CONFIG_FTRACE and CONFIG_FUNCTION_TRACER and 
probably others, right?

Looking at http://lwn.net/Articles/341899, I see CONFIG_EVENT_TRACING 
and the way to enable event tracing for "all events in fs/xfs" would be:

  echo 1 > /sys/kernel/debug/tracing/events/xfs/enable

> 100,000 inodes is sufficient (i.e. kill the find once the xfs inode
> cache slab reaches 100k inodes. While still running the event trace,
> can you then drop the caches (echo 3 > /proc/sys/vm/drop_caches) and
> check that the xfs inode cache is emptied? If it isn't emptied, drop
> caches again to see if that empties it. If you coul dthen post the
> event trace, I might be able to see what is going strange with the
> shrinker and/or reclaim.

Will try to do all that.

I wonder why nobody else is affected by this. Because nobody else runs 
powerpc or UP any more? I'm sure other people's filesystems are way bigger 
than mine, with much more inodes to cache...

Thanks for your time,
Christian. 
-- 
BOFH excuse #136:

Daemons loose in system.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 2.6.39-rc4+: oom-killer busy killing tasks
  2011-05-03  4:04                                 ` Christian Kujau
@ 2011-05-03  6:36                                   ` Dave Chinner
  0 siblings, 0 replies; 42+ messages in thread
From: Dave Chinner @ 2011-05-03  6:36 UTC (permalink / raw)
  To: Christian Kujau; +Cc: Markus Trippelsdorf, LKML, xfs, minchan.kim

On Mon, May 02, 2011 at 09:04:40PM -0700, Christian Kujau wrote:
> On Tue, 3 May 2011 at 10:51, Dave Chinner wrote:
> > Can you run an event trace of all the XFS events during a find for
> > me? Don't do it over the entire subset of the filesystem - only
> 
> You mean "event tracing", as in Documentation/trace/events.txt. For 
> that I will have to enable CONFIG_FTRACE and CONFIG_FUNCTION_TRACER and 
> probably others, right?
> 
> Looking at http://lwn.net/Articles/341899, I see CONFIG_EVENT_TRACING 
> and the way to enable event tracing for "all events in fs/xfs" would be:
> 
>   echo 1 > /sys/kernel/debug/tracing/events/xfs/enable

Download trace-cmd and use that - more efficient and easier to
specify the events to record...

> > 100,000 inodes is sufficient (i.e. kill the find once the xfs inode
> > cache slab reaches 100k inodes. While still running the event trace,
> > can you then drop the caches (echo 3 > /proc/sys/vm/drop_caches) and
> > check that the xfs inode cache is emptied? If it isn't emptied, drop
> > caches again to see if that empties it. If you coul dthen post the
> > event trace, I might be able to see what is going strange with the
> > shrinker and/or reclaim.
> 
> Will try to do all that.
> 
> I wonder why nobody else is affected by this. Because nobody else runs 
> powerpc or UP any more? I'm sure other people's filesystems are way bigger 
> than mine, with much more inodes to cache...

XFS on uniprocessor, 32 bit, highmem system is pretty rare. Let
alone on an old powerpc platform. It's so far out in left field that
I'd expect the only test coverage we get is your laptop....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 2.6.39-rc4+: oom-killer busy killing tasks
  2011-05-03  0:51                               ` Dave Chinner
  2011-05-03  4:04                                 ` Christian Kujau
@ 2011-05-03 20:53                                 ` Christian Kujau
  2011-05-04  0:46                                   ` Christian Kujau
  1 sibling, 1 reply; 42+ messages in thread
From: Christian Kujau @ 2011-05-03 20:53 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Markus Trippelsdorf, LKML, xfs, minchan.kim

On Tue, 3 May 2011 at 10:51, Dave Chinner wrote:
> Can you run an event trace of all the XFS events during a find for
> me? Don't do it over the entire subset of the filesystem - only
> 100,000 inodes is sufficient (i.e. kill the find once the xfs inode
> cache slab reaches 100k inodes. While still running the event trace,
> can you then drop the caches (echo 3 > /proc/sys/vm/drop_caches) and
> check that the xfs inode cache is emptied? If it isn't emptied, drop
> caches again to see if that empties it. If you coul dthen post the
> event trace, I might be able to see what is going strange with the
> shrinker and/or reclaim.

OK, I've done something. Not sure if I got everything right:

  https://trent.utfs.org/p/bits/2.6.39-rc4/oom/trace/
  (new URL, the other one ran out of webspace. Omit the s in https
   if you don't have the CAcert.org root cert imported)

* I've started 'trace-cmd record -e xfs /usr/bin/find /mnt/backup'
  in one (screen-)window, which produced trace-14.dat.bz2

* I've started my oom-debug.sh script in another, which produced
  slabinfo-14.txt.bz2

* In another window, I was dropping the caches and looked at
  /proc/slabinfo again, see drop_caches-14.txt

Somehow "trace-cmd report" segfaults here, but I hope "trace-14.report" 
contains enough details already. If not, I can do this again.

Thanks,
Christian.
-- 
BOFH excuse #314:

You need to upgrade your VESA local bus to a MasterCard local bus.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 2.6.39-rc4+: oom-killer busy killing tasks
  2011-05-03 20:53                                 ` Christian Kujau
@ 2011-05-04  0:46                                   ` Christian Kujau
  2011-05-04  1:51                                     ` Christian Kujau
  2011-05-04  7:36                                     ` Dave Chinner
  0 siblings, 2 replies; 42+ messages in thread
From: Christian Kujau @ 2011-05-04  0:46 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Markus Trippelsdorf, LKML, xfs, minchan.kim

And another one, please see the files marked with 15- here:

   https://trent.utfs.org/p/bits/2.6.39-rc4/oom/trace/

I tried to have more concise timestamps in each of these files, hope that 
helps. Sadly though, trace-cmd reports still segfaults on the tracefile.

Christian.
-- 
BOFH excuse #263:

It's stuck in the Web.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 2.6.39-rc4+: oom-killer busy killing tasks
  2011-05-04  0:46                                   ` Christian Kujau
@ 2011-05-04  1:51                                     ` Christian Kujau
  2011-05-04  7:36                                     ` Dave Chinner
  1 sibling, 0 replies; 42+ messages in thread
From: Christian Kujau @ 2011-05-04  1:51 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Markus Trippelsdorf, LKML, xfs, minchan.kim

On Tue, 3 May 2011 at 17:46, Christian Kujau wrote:
> And another one, please see the files marked with 15- here:
> 
>    https://trent.utfs.org/p/bits/2.6.39-rc4/oom/trace/
> 
> I tried to have more concise timestamps in each of these files, hope that 
> helps. Sadly though, trace-cmd reports still segfaults on the tracefile.

Running "trace-cmd report" on an i386 machine with those trace.dat files 
did not segfault. I've uploaded/will upload the reports to the url above.

Christian.
-- 
BOFH excuse #186:

permission denied

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 2.6.39-rc4+: oom-killer busy killing tasks
  2011-05-04  0:46                                   ` Christian Kujau
  2011-05-04  1:51                                     ` Christian Kujau
@ 2011-05-04  7:36                                     ` Dave Chinner
  2011-05-04 11:12                                       ` Dave Chinner
  1 sibling, 1 reply; 42+ messages in thread
From: Dave Chinner @ 2011-05-04  7:36 UTC (permalink / raw)
  To: Christian Kujau; +Cc: Markus Trippelsdorf, LKML, xfs, minchan.kim

On Tue, May 03, 2011 at 05:46:14PM -0700, Christian Kujau wrote:
> And another one, please see the files marked with 15- here:
> 
>    https://trent.utfs.org/p/bits/2.6.39-rc4/oom/trace/
> 
> I tried to have more concise timestamps in each of these files, hope that 
> helps. Sadly though, trace-cmd reports still segfaults on the tracefile.

Ok, that will be helpful. Also helpful is that I've (FINALLY!)
reproduced this myself, and i think i can now reproduce it at will
on a highmem i686 machine. I'll look into it more later tonight....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 2.6.39-rc4+: oom-killer busy killing tasks
  2011-05-04  7:36                                     ` Dave Chinner
@ 2011-05-04 11:12                                       ` Dave Chinner
  2011-05-04 19:10                                         ` Christian Kujau
  0 siblings, 1 reply; 42+ messages in thread
From: Dave Chinner @ 2011-05-04 11:12 UTC (permalink / raw)
  To: Christian Kujau; +Cc: Markus Trippelsdorf, LKML, xfs, minchan.kim

On Wed, May 04, 2011 at 05:36:15PM +1000, Dave Chinner wrote:
> On Tue, May 03, 2011 at 05:46:14PM -0700, Christian Kujau wrote:
> > And another one, please see the files marked with 15- here:
> > 
> >    https://trent.utfs.org/p/bits/2.6.39-rc4/oom/trace/
> > 
> > I tried to have more concise timestamps in each of these files, hope that 
> > helps. Sadly though, trace-cmd reports still segfaults on the tracefile.
> 
> Ok, that will be helpful. Also helpful is that I've (FINALLY!)
> reproduced this myself, and i think i can now reproduce it at will
> on a highmem i686 machine. I'll look into it more later tonight....

And here's a patch for you to try. It fixes the problem on my test
machine.....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

xfs: ensure reclaim cursor is reset correctly at end of AG

From: Dave Chinner <dchinner@redhat.com>

On a 32 bit highmem PowerPC machine, the XFS inode cache was growing
without bound and exhausting low memory causing the OOM killer to be
triggered. After some effort, the problem was reproduced on a 32 bit
x86 highmem machine.

The problem is that the per-ag inode reclaim index cursor was not
getting reset to the start of the AG if the radix tree tag lookup
found no more reclaimable inodes. Hence every further reclaim
attempt started at the same index beyond where any reclaimable
inodes lay, and no further background reclaim ever occurred from the
AG.

Without background inode reclaim the VM driven cache shrinker
simply cannot keep up with cache growth, and OOM is the result.

While the change that exposed the problem was the conversion of the
inode reclaim to use work queues for background reclaim, it was not
the cause of the bug. The bug was introduced when the cursor code
was added, just waiting for some weird configuration to strike....

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/linux-2.6/xfs_sync.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
index 3253572..4e1f23a 100644
--- a/fs/xfs/linux-2.6/xfs_sync.c
+++ b/fs/xfs/linux-2.6/xfs_sync.c
@@ -936,6 +936,7 @@ restart:
 					XFS_LOOKUP_BATCH,
 					XFS_ICI_RECLAIM_TAG);
 			if (!nr_found) {
+				done = 1;
 				rcu_read_unlock();
 				break;
 			}

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: 2.6.39-rc4+: oom-killer busy killing tasks
  2011-05-04 11:12                                       ` Dave Chinner
@ 2011-05-04 19:10                                         ` Christian Kujau
  2011-05-04 23:15                                           ` Dave Chinner
  0 siblings, 1 reply; 42+ messages in thread
From: Christian Kujau @ 2011-05-04 19:10 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Markus Trippelsdorf, LKML, xfs, minchan.kim

On Wed, 4 May 2011 at 21:12, Dave Chinner wrote:
>> Ok, that will be helpful. Also helpful is that I've (FINALLY!)
>> reproduced this myself, and i think i can now reproduce it at will
>> on a highmem i686 machine. I'll look into it more later tonight....

I've tried to reproduce it on an SMP i686 with 1024MB RAM, but there was 
no OOM there.

> And here's a patch for you to try. It fixes the problem on my test
> machine.....

Excellent! After one run with that patch, the machine does not go OOM any 
more when running du(1) over this XFS filesystem.

I've run du(1) login with my oom-debug script to gather sysrq-w and 
slabinfo, see https://trent.utfs.org/p/bits/2.6.39-rc4/oom/trace/16-* for 
the details.

I'll poke it a bit more over the day, but your patch (one line only! Wow!) 
seems to help.

Thanks so much,
Christian.
-- 
BOFH excuse #160:

non-redundant fan failure

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 2.6.39-rc4+: oom-killer busy killing tasks
  2011-05-04 19:10                                         ` Christian Kujau
@ 2011-05-04 23:15                                           ` Dave Chinner
  2011-05-05  2:07                                             ` Christian Kujau
  0 siblings, 1 reply; 42+ messages in thread
From: Dave Chinner @ 2011-05-04 23:15 UTC (permalink / raw)
  To: Christian Kujau; +Cc: Markus Trippelsdorf, LKML, xfs, minchan.kim

On Wed, May 04, 2011 at 12:10:04PM -0700, Christian Kujau wrote:
> On Wed, 4 May 2011 at 21:12, Dave Chinner wrote:
> >> Ok, that will be helpful. Also helpful is that I've (FINALLY!)
> >> reproduced this myself, and i think i can now reproduce it at will
> >> on a highmem i686 machine. I'll look into it more later tonight....
> 
> I've tried to reproduce it on an SMP i686 with 1024MB RAM, but there was 
> no OOM there.
> 
> > And here's a patch for you to try. It fixes the problem on my test
> > machine.....
> 
> Excellent! After one run with that patch, the machine does not go OOM any 
> more when running du(1) over this XFS filesystem.

Ok, That is good to know that we are seeing the same problem ;)

I'll push the fix through the XFS tree to mainline. Thanks for all
your effort helping me understand and debug the problem, Christian.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 2.6.39-rc4+: oom-killer busy killing tasks
  2011-05-04 23:15                                           ` Dave Chinner
@ 2011-05-05  2:07                                             ` Christian Kujau
  0 siblings, 0 replies; 42+ messages in thread
From: Christian Kujau @ 2011-05-05  2:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Markus Trippelsdorf, LKML, xfs, minchan.kim

On Thu, 5 May 2011 at 09:15, Dave Chinner wrote:
> I'll push the fix through the XFS tree to mainline. Thanks for all
> your effort helping me understand and debug the problem, Christian.

OK, I gave it some more testing and compiled the kernel wit UP and 
TINY_RCU again and with your patch applied to -rc6 I could not reproduce 
the issue any more. Plain -rc6 still has the OOM problem.

Feel free to add:

  Tested-By: Christian Kujau <lists@nerdbynature.de>

Again, thanks so much for staying with me and guiding me through all the 
debug steps. And all your effort fixing such a rare setup. I hope more 
people will share the joy :-)

Christian.
-- 
BOFH excuse #164:

root rot

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2011-05-05  2:07 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-04-22  1:57 2.6.39-rc4+: oom-killer busy killing tasks Christian Kujau
2011-04-22  2:58 ` Minchan Kim
2011-04-22  3:03   ` Christian Kujau
2011-04-22 17:41   ` Christian Kujau
2011-04-22 18:46     ` Christian Kujau
2011-04-22 22:47       ` Minchan Kim
2011-04-24 23:46 ` Dave Chinner
2011-04-25  5:51   ` Christian Kujau
2011-04-25  7:19     ` Christian Kujau
2011-04-26 15:14       ` Christian Kujau
2011-04-27  2:26       ` Dave Chinner
2011-04-27  7:46         ` Christian Kujau
2011-04-27 10:28           ` Dave Chinner
2011-04-27 23:16             ` Minchan Kim
2011-04-27 23:56               ` Dave Chinner
2011-04-28 17:30             ` Christian Kujau
2011-04-28 23:37               ` Dave Chinner
2011-04-29 17:32                 ` Christian Kujau
2011-04-29 19:58                 ` Christian Kujau
2011-04-29 20:17                   ` Markus Trippelsdorf
2011-04-29 20:20                     ` Christian Kujau
2011-04-29 20:21                       ` Markus Trippelsdorf
2011-04-30  0:17                     ` Christian Kujau
2011-05-01  8:01                       ` Dave Chinner
2011-05-02  4:59                         ` Christian Kujau
2011-05-02 12:19                           ` Dave Chinner
2011-05-02 19:59                             ` Christian Kujau
2011-05-03  0:51                               ` Dave Chinner
2011-05-03  4:04                                 ` Christian Kujau
2011-05-03  6:36                                   ` Dave Chinner
2011-05-03 20:53                                 ` Christian Kujau
2011-05-04  0:46                                   ` Christian Kujau
2011-05-04  1:51                                     ` Christian Kujau
2011-05-04  7:36                                     ` Dave Chinner
2011-05-04 11:12                                       ` Dave Chinner
2011-05-04 19:10                                         ` Christian Kujau
2011-05-04 23:15                                           ` Dave Chinner
2011-05-05  2:07                                             ` Christian Kujau
2011-05-02  9:26                         ` Christian Kujau
2011-05-02 12:38                           ` Dave Chinner
2011-04-25  8:02   ` Christian Kujau
2011-04-25  9:50     ` Christian Kujau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).