All of lore.kernel.org
 help / color / mirror / Atom feed
* 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer
@ 2010-04-04 22:49 ` Hans-Peter Jansen
  0 siblings, 0 replies; 48+ messages in thread
From: Hans-Peter Jansen @ 2010-04-04 22:49 UTC (permalink / raw)
  To: linux-kernel; +Cc: opensuse-kernel, xfs

[Sorry for the cross post, but I don't know where to start to tackle this 
 issue]

Hi,

on an attempt to get to a current kernel, I suffer from an issue, where a 
simple du on a reasonably big xfs tree leads to invoking the oom killer: 

Apr  4 23:24:53 tyrex kernel: [  418.913223] XFS mounting filesystem sdd1
Apr  4 23:24:54 tyrex kernel: [  419.774606] Ending clean XFS mount for filesystem: sdd1
Apr  4 23:26:02 tyrex kernel: [  488.160795] du invoked oom-killer: gfp_mask=0x802d0, order=0, oom_adj=0
Apr  4 23:26:02 tyrex kernel: [  488.160798] du cpuset=/ mems_allowed=0
Apr  4 23:26:02 tyrex kernel: [  488.160800] Pid: 6397, comm: du Tainted: G        W  2.6.34-rc3-13-vanilla #1
Apr  4 23:26:02 tyrex kernel: [  488.160802] Call Trace:
Apr  4 23:26:02 tyrex kernel: [  488.160808]  [<c02becc7>] dump_header+0x67/0x1a0
Apr  4 23:26:02 tyrex kernel: [  488.160811]  [<c03cf1a7>] ? ___ratelimit+0x77/0xe0
Apr  4 23:26:02 tyrex kernel: [  488.160813]  [<c02bee59>] oom_kill_process+0x59/0x160
Apr  4 23:26:02 tyrex kernel: [  488.160815]  [<c02bf43e>] __out_of_memory+0x4e/0xc0
Apr  4 23:26:02 tyrex kernel: [  488.160817]  [<c02bf502>] out_of_memory+0x52/0xc0
Apr  4 23:26:02 tyrex kernel: [  488.160819]  [<c02c20f4>] __alloc_pages_slowpath+0x444/0x4c0
Apr  4 23:26:02 tyrex kernel: [  488.160822]  [<c02c22c2>] __alloc_pages_nodemask+0x152/0x160
Apr  4 23:26:02 tyrex kernel: [  488.160825]  [<c02ea4a9>] cache_grow+0x249/0x2e0
Apr  4 23:26:02 tyrex kernel: [  488.160838]  [<c02ea748>] cache_alloc_refill+0x208/0x240
Apr  4 23:26:02 tyrex kernel: [  488.160840]  [<c02eab19>] kmem_cache_alloc+0xb9/0xc0
Apr  4 23:26:02 tyrex kernel: [  488.160868]  [<f86375dd>] ? xfs_trans_brelse+0xfd/0x150 [xfs]
Apr  4 23:26:02 tyrex kernel: [  488.160888]  [<f863d547>] kmem_zone_alloc+0x77/0xb0 [xfs]
Apr  4 23:26:02 tyrex kernel: [  488.160905]  [<f860a043>] ? xfs_da_state_free+0x53/0x60 [xfs]
Apr  4 23:26:02 tyrex kernel: [  488.160923]  [<f861c796>] xfs_inode_alloc+0x26/0x110 [xfs]
Apr  4 23:26:02 tyrex kernel: [  488.160941]  [<f861c8b2>] xfs_iget_cache_miss+0x32/0x200 [xfs]
Apr  4 23:26:02 tyrex kernel: [  488.160960]  [<f8630fd9>] ? xfs_perag_get+0x29/0xa0 [xfs]
Apr  4 23:26:02 tyrex kernel: [  488.160977]  [<f861cbac>] xfs_iget+0x12c/0x1b0 [xfs]
Apr  4 23:26:02 tyrex kernel: [  488.160997]  [<f863c9c5>] xfs_lookup+0xc5/0x110 [xfs]
Apr  4 23:26:02 tyrex kernel: [  488.161017]  [<f86467d1>] xfs_vn_lookup+0x51/0xa0 [xfs]
Apr  4 23:26:02 tyrex kernel: [  488.161021]  [<c02ffef7>] do_lookup+0xa7/0x1c0
Apr  4 23:26:02 tyrex kernel: [  488.161023]  [<c030089b>] link_path_walk+0x1db/0x850
Apr  4 23:26:02 tyrex kernel: [  488.161026]  [<c024a870>] ? capable+0x10/0x40
Apr  4 23:26:02 tyrex kernel: [  488.161028]  [<c038969e>] ? security_inode_permission+0x1e/0x20
Apr  4 23:26:02 tyrex kernel: [  488.161030]  [<c0301294>] path_walk+0x54/0xc0
Apr  4 23:26:02 tyrex kernel: [  488.161032]  [<c0301431>] do_path_lookup+0x51/0x90
Apr  4 23:26:02 tyrex kernel: [  488.161034]  [<c0301f8a>] user_path_at+0x4a/0x80
Apr  4 23:26:02 tyrex kernel: [  488.161037]  [<c02f9748>] ? cp_new_stat64+0xe8/0x100
Apr  4 23:26:02 tyrex kernel: [  488.161039]  [<c02f9bb5>] vfs_fstatat+0x35/0x70
Apr  4 23:26:02 tyrex kernel: [  488.161041]  [<c02f9c0d>] sys_fstatat64+0x1d/0x40
Apr  4 23:26:02 tyrex kernel: [  488.161044]  [<c0202d8c>] sysenter_do_call+0x12/0x22
Apr  4 23:26:02 tyrex kernel: [  488.161045] Mem-Info:
Apr  4 23:26:02 tyrex kernel: [  488.161046] DMA per-cpu:
Apr  4 23:26:02 tyrex kernel: [  488.161047] CPU    0: hi:    0, btch:   1 usd:   0
Apr  4 23:26:02 tyrex kernel: [  488.161048] CPU    1: hi:    0, btch:   1 usd:   0
Apr  4 23:26:02 tyrex kernel: [  488.161050] CPU    2: hi:    0, btch:   1 usd:   0
Apr  4 23:26:02 tyrex kernel: [  488.161051] CPU    3: hi:    0, btch:   1 usd:   0
Apr  4 23:26:02 tyrex kernel: [  488.161052] CPU    4: hi:    0, btch:   1 usd:   0
Apr  4 23:26:02 tyrex kernel: [  488.161053] CPU    5: hi:    0, btch:   1 usd:   0
Apr  4 23:26:02 tyrex kernel: [  488.161054] CPU    6: hi:    0, btch:   1 usd:   0
Apr  4 23:26:02 tyrex kernel: [  488.161056] CPU    7: hi:    0, btch:   1 usd:   0
Apr  4 23:26:02 tyrex kernel: [  488.161057] Normal per-cpu:
Apr  4 23:26:02 tyrex kernel: [  488.161058] CPU    0: hi:  186, btch:  31 usd: 173
Apr  4 23:26:02 tyrex kernel: [  488.161059] CPU    1: hi:  186, btch:  31 usd:  18
Apr  4 23:26:02 tyrex kernel: [  488.161060] CPU    2: hi:  186, btch:  31 usd:  41
Apr  4 23:26:02 tyrex kernel: [  488.161061] CPU    3: hi:  186, btch:  31 usd: 197
Apr  4 23:26:02 tyrex kernel: [  488.161062] CPU    4: hi:  186, btch:  31 usd: 175
Apr  4 23:26:02 tyrex kernel: [  488.161063] CPU    5: hi:  186, btch:  31 usd: 157
Apr  4 23:26:02 tyrex kernel: [  488.161065] CPU    6: hi:  186, btch:  31 usd: 182
Apr  4 23:26:02 tyrex kernel: [  488.161066] CPU    7: hi:  186, btch:  31 usd: 173
Apr  4 23:26:02 tyrex kernel: [  488.161067] HighMem per-cpu:
Apr  4 23:26:02 tyrex kernel: [  488.161068] CPU    0: hi:  186, btch:  31 usd:  24
Apr  4 23:26:02 tyrex kernel: [  488.161069] CPU    1: hi:  186, btch:  31 usd:  33
Apr  4 23:26:02 tyrex kernel: [  488.161070] CPU    2: hi:  186, btch:  31 usd:  55
Apr  4 23:26:02 tyrex kernel: [  488.161071] CPU    3: hi:  186, btch:  31 usd: 172
Apr  4 23:26:02 tyrex kernel: [  488.161072] CPU    4: hi:  186, btch:  31 usd: 177
Apr  4 23:26:02 tyrex kernel: [  488.161073] CPU    5: hi:  186, btch:  31 usd: 100
Apr  4 23:26:02 tyrex kernel: [  488.161075] CPU    6: hi:  186, btch:  31 usd:  38
Apr  4 23:26:02 tyrex kernel: [  488.161076] CPU    7: hi:  186, btch:  31 usd: 115
Apr  4 23:26:02 tyrex kernel: [  488.161078] active_anon:98428 inactive_anon:24627 isolated_anon:0
Apr  4 23:26:02 tyrex kernel: [  488.161079]  active_file:4141 inactive_file:25571 isolated_file:0
Apr  4 23:26:02 tyrex kernel: [  488.161080]  unevictable:0 dirty:2 writeback:0 unstable:0
Apr  4 23:26:02 tyrex kernel: [  488.161080]  free:403317 slab_reclaimable:203863 slab_unreclaimable:4837
Apr  4 23:26:02 tyrex kernel: [  488.161081]  mapped:7418 shmem:137 pagetables:625 bounce:0
Apr  4 23:26:02 tyrex kernel: [  488.161085] 
DMA free:3520kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB 
active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:15804kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:6032kB 
slab_unreclaimable:36kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB 
pages_scanned:0 all_unreclaimable? yes
Apr  4 23:26:02 tyrex kernel: [  488.161089] lowmem_reserve[]: 0 865 3023 3023
Apr  4 23:26:02 tyrex kernel: [  488.161093] 
Normal free:3580kB min:3728kB low:4660kB high:5592kB active_anon:0kB inactive_anon:0kB 
active_file:360kB inactive_file:264kB unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:885944kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:809420kB 
slab_unreclaimable:19312kB kernel_stack:3304kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB 
pages_scanned:24 all_unreclaimable? no
Apr  4 23:26:02 tyrex kernel: [  488.161097] lowmem_reserve[]: 0 0 17267 17267
Apr  4 23:26:02 tyrex kernel: [  488.161101] 
HighMem free:1606168kB min:512kB low:2836kB high:5164kB active_anon:393712kB inactive_anon:98508kB 
active_file:16228kB inactive_file:102020kB unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:2210252kB mlocked:0kB dirty:8kB writeback:0kB mapped:29672kB shmem:548kB slab_reclaimable:0kB 
slab_unreclaimable:0kB kernel_stack:0kB pagetables:2500kB unstable:0kB bounce:0kB writeback_tmp:0kB 
pages_scanned:0 all_unreclaimable? no
Apr  4 23:26:02 tyrex kernel: [  488.161105] lowmem_reserve[]: 0 0 0 0
Apr  4 23:26:02 tyrex kernel: [  488.161107] DMA: 18*4kB 53*8kB 31*16kB 20*32kB 14*64kB 8*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3552kB
Apr  4 23:26:02 tyrex kernel: [  488.161112] Normal: 32*4kB 1*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 0*4096kB = 3704kB
Apr  4 23:26:02 tyrex kernel: [  488.161117] HighMem: 17*4kB 29*8kB 47*16kB 16*32kB 6*64kB 30*128kB 53*256kB 27*512kB 14*1024kB 7*2048kB 377*4096kB = 1606044kB
Apr  4 23:26:02 tyrex kernel: [  488.161122] 29947 total pagecache pages
Apr  4 23:26:02 tyrex kernel: [  488.161123] 0 pages in swap cache
Apr  4 23:26:02 tyrex kernel: [  488.161124] Swap cache stats: add 0, delete 0, find 0/0
Apr  4 23:26:02 tyrex kernel: [  488.161125] Free swap  = 2104476kB
Apr  4 23:26:02 tyrex kernel: [  488.161126] Total swap = 2104476kB
Apr  4 23:26:02 tyrex kernel: [  488.165523] 784224 pages RAM
Apr  4 23:26:02 tyrex kernel: [  488.165524] 556914 pages HighMem
Apr  4 23:26:02 tyrex kernel: [  488.165525] 12060 pages reserved
Apr  4 23:26:02 tyrex kernel: [  488.165526] 82604 pages shared
Apr  4 23:26:02 tyrex kernel: [  488.165527] 328045 pages non-shared
Apr  4 23:26:02 tyrex kernel: [  488.165529] Out of memory: kill process 4788 (mysqld-max) score 326208 or a child
Apr  4 23:26:02 tyrex kernel: [  488.165531] Killed process 4788 (mysqld-max) vsz:1304832kB, anon-rss:121428kB, file-rss:4336kB
[...]

vmstat 1, when it occurs:
 1  0      0 1722516      4 926532    0    0 13800     0 2598 6479  2 14 84  0
 3  0      0 1735540      4 927392    0    0 10128     0 2144 5434  0  8 92  0
 1  0      0 1733944      4 927868    0    0 10416    92 4465 8264  1  7 93  0
 0  0      0 1733448      4 928024    0    0  9220     0 2081 3570  1  8 91  0
 0  0      0 1734160      4 927412    0    0  7404     0 1912 3014  0  6 90  4
 1  0      0 1733072      4 927208    0    0  9512     0 2227 3591  1  8 89  2
 1  1      0 1728356      4 928256    0    0  4620     0 1334 2087  0  2 92  5
 2  1      0 1926208      4 928832    0    0  3848   712 1869 3515  0  5 84 11
 1  0      0 2181520      4 849064    0    0  2736  3983 3154 308011  2 14 71 13
 0  0      0 2181028      4 847560    0    0   140   109  300  484  2  0 98  0
 4  0      0 2561212      4 467272    0    0     4     0  383  233  0  6 94  0
 0  0      0 2995016      4  33560    0    0    44     0  421  238  0  9 91  0
 0  0      0 2995024      4  33584    0    0   120     0  187  251  0  0 100  0
 0  0      0 2994932      4  33636    0    0     0     0   34   63  0  0 100  0
 1  0      0 2995212      4  33548    0    0     8     0   72   94  0  0 100  0
 0  0      0 2995368      4  33528    0    0     0     0   39   73  0  0 100  0
 0  0      0 2995384      4  33348    0    0     0     0   96  116  0  0 100  0
 0  0      0 2995556      4  33256    0    0    40     0  139  184  0  0 100  0
 0  0      0 2995800      4  33248    0    0    68     0  178  193  0  0 100  0

The oom-killer got triggered around the line with 308011 context switches.
Needless to say, this event renders this system dysfunctional.

Sorry for the insanely long lines, but wrapping them to a sane width results 
in an unparseable mess. This kernel is closest to upstream, that I was able
to produce, built in Novell build service, available here:
http://download.opensuse.org/repositories/home:/frispete:/kernel-head/openSUSE_11.1/
As you can see, userspace is openSUSE 11.1, arch is i586, based on vanilla 
built of rc3 with Linus' git tree applied up to 5e11611a5d22252f3f9c169a3c9377eac0c32033.
I have reproduced this issue also with 2.6.33.1. Usually, I run the openSUSE 
customized kernels, e.g. kernel-pae-2.6.27.45-0.1.1 doesn't show this issue.

boot.msg: ftp://urpla.net/boot.msg
config: ftp://urpla.net/vanilla

I do use some unusual vm settings, as documented in boot.msg, but that should 
not result in this issue, does it?

Hardware: Xeon X3460, Asus P7F-E motherboard, Intel 3420 PCH chipset, 12 GB Mem

In case, you wonder, why I drove that test with a 4GB memory setup: that was
the easiest way to demonstrate the problem without resorting to tainted and 
heavy customized kernels. And for compatibility reasons (customer support), 
I have to stay at i586 for the moment. 

Of course, the du command does succeed with the distributed kernel with exactly
the same (userspace and sysctl) settings: 
1,1T    /backup/daily.0

Anybody with an idea, what's going on here?

TIA,
Pete

^ permalink raw reply	[flat|nested] 48+ messages in thread

* 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer
@ 2010-04-04 22:49 ` Hans-Peter Jansen
  0 siblings, 0 replies; 48+ messages in thread
From: Hans-Peter Jansen @ 2010-04-04 22:49 UTC (permalink / raw)
  To: linux-kernel; +Cc: opensuse-kernel, xfs

[Sorry for the cross post, but I don't know where to start to tackle this 
 issue]

Hi,

on an attempt to get to a current kernel, I suffer from an issue, where a 
simple du on a reasonably big xfs tree leads to invoking the oom killer: 

Apr  4 23:24:53 tyrex kernel: [  418.913223] XFS mounting filesystem sdd1
Apr  4 23:24:54 tyrex kernel: [  419.774606] Ending clean XFS mount for filesystem: sdd1
Apr  4 23:26:02 tyrex kernel: [  488.160795] du invoked oom-killer: gfp_mask=0x802d0, order=0, oom_adj=0
Apr  4 23:26:02 tyrex kernel: [  488.160798] du cpuset=/ mems_allowed=0
Apr  4 23:26:02 tyrex kernel: [  488.160800] Pid: 6397, comm: du Tainted: G        W  2.6.34-rc3-13-vanilla #1
Apr  4 23:26:02 tyrex kernel: [  488.160802] Call Trace:
Apr  4 23:26:02 tyrex kernel: [  488.160808]  [<c02becc7>] dump_header+0x67/0x1a0
Apr  4 23:26:02 tyrex kernel: [  488.160811]  [<c03cf1a7>] ? ___ratelimit+0x77/0xe0
Apr  4 23:26:02 tyrex kernel: [  488.160813]  [<c02bee59>] oom_kill_process+0x59/0x160
Apr  4 23:26:02 tyrex kernel: [  488.160815]  [<c02bf43e>] __out_of_memory+0x4e/0xc0
Apr  4 23:26:02 tyrex kernel: [  488.160817]  [<c02bf502>] out_of_memory+0x52/0xc0
Apr  4 23:26:02 tyrex kernel: [  488.160819]  [<c02c20f4>] __alloc_pages_slowpath+0x444/0x4c0
Apr  4 23:26:02 tyrex kernel: [  488.160822]  [<c02c22c2>] __alloc_pages_nodemask+0x152/0x160
Apr  4 23:26:02 tyrex kernel: [  488.160825]  [<c02ea4a9>] cache_grow+0x249/0x2e0
Apr  4 23:26:02 tyrex kernel: [  488.160838]  [<c02ea748>] cache_alloc_refill+0x208/0x240
Apr  4 23:26:02 tyrex kernel: [  488.160840]  [<c02eab19>] kmem_cache_alloc+0xb9/0xc0
Apr  4 23:26:02 tyrex kernel: [  488.160868]  [<f86375dd>] ? xfs_trans_brelse+0xfd/0x150 [xfs]
Apr  4 23:26:02 tyrex kernel: [  488.160888]  [<f863d547>] kmem_zone_alloc+0x77/0xb0 [xfs]
Apr  4 23:26:02 tyrex kernel: [  488.160905]  [<f860a043>] ? xfs_da_state_free+0x53/0x60 [xfs]
Apr  4 23:26:02 tyrex kernel: [  488.160923]  [<f861c796>] xfs_inode_alloc+0x26/0x110 [xfs]
Apr  4 23:26:02 tyrex kernel: [  488.160941]  [<f861c8b2>] xfs_iget_cache_miss+0x32/0x200 [xfs]
Apr  4 23:26:02 tyrex kernel: [  488.160960]  [<f8630fd9>] ? xfs_perag_get+0x29/0xa0 [xfs]
Apr  4 23:26:02 tyrex kernel: [  488.160977]  [<f861cbac>] xfs_iget+0x12c/0x1b0 [xfs]
Apr  4 23:26:02 tyrex kernel: [  488.160997]  [<f863c9c5>] xfs_lookup+0xc5/0x110 [xfs]
Apr  4 23:26:02 tyrex kernel: [  488.161017]  [<f86467d1>] xfs_vn_lookup+0x51/0xa0 [xfs]
Apr  4 23:26:02 tyrex kernel: [  488.161021]  [<c02ffef7>] do_lookup+0xa7/0x1c0
Apr  4 23:26:02 tyrex kernel: [  488.161023]  [<c030089b>] link_path_walk+0x1db/0x850
Apr  4 23:26:02 tyrex kernel: [  488.161026]  [<c024a870>] ? capable+0x10/0x40
Apr  4 23:26:02 tyrex kernel: [  488.161028]  [<c038969e>] ? security_inode_permission+0x1e/0x20
Apr  4 23:26:02 tyrex kernel: [  488.161030]  [<c0301294>] path_walk+0x54/0xc0
Apr  4 23:26:02 tyrex kernel: [  488.161032]  [<c0301431>] do_path_lookup+0x51/0x90
Apr  4 23:26:02 tyrex kernel: [  488.161034]  [<c0301f8a>] user_path_at+0x4a/0x80
Apr  4 23:26:02 tyrex kernel: [  488.161037]  [<c02f9748>] ? cp_new_stat64+0xe8/0x100
Apr  4 23:26:02 tyrex kernel: [  488.161039]  [<c02f9bb5>] vfs_fstatat+0x35/0x70
Apr  4 23:26:02 tyrex kernel: [  488.161041]  [<c02f9c0d>] sys_fstatat64+0x1d/0x40
Apr  4 23:26:02 tyrex kernel: [  488.161044]  [<c0202d8c>] sysenter_do_call+0x12/0x22
Apr  4 23:26:02 tyrex kernel: [  488.161045] Mem-Info:
Apr  4 23:26:02 tyrex kernel: [  488.161046] DMA per-cpu:
Apr  4 23:26:02 tyrex kernel: [  488.161047] CPU    0: hi:    0, btch:   1 usd:   0
Apr  4 23:26:02 tyrex kernel: [  488.161048] CPU    1: hi:    0, btch:   1 usd:   0
Apr  4 23:26:02 tyrex kernel: [  488.161050] CPU    2: hi:    0, btch:   1 usd:   0
Apr  4 23:26:02 tyrex kernel: [  488.161051] CPU    3: hi:    0, btch:   1 usd:   0
Apr  4 23:26:02 tyrex kernel: [  488.161052] CPU    4: hi:    0, btch:   1 usd:   0
Apr  4 23:26:02 tyrex kernel: [  488.161053] CPU    5: hi:    0, btch:   1 usd:   0
Apr  4 23:26:02 tyrex kernel: [  488.161054] CPU    6: hi:    0, btch:   1 usd:   0
Apr  4 23:26:02 tyrex kernel: [  488.161056] CPU    7: hi:    0, btch:   1 usd:   0
Apr  4 23:26:02 tyrex kernel: [  488.161057] Normal per-cpu:
Apr  4 23:26:02 tyrex kernel: [  488.161058] CPU    0: hi:  186, btch:  31 usd: 173
Apr  4 23:26:02 tyrex kernel: [  488.161059] CPU    1: hi:  186, btch:  31 usd:  18
Apr  4 23:26:02 tyrex kernel: [  488.161060] CPU    2: hi:  186, btch:  31 usd:  41
Apr  4 23:26:02 tyrex kernel: [  488.161061] CPU    3: hi:  186, btch:  31 usd: 197
Apr  4 23:26:02 tyrex kernel: [  488.161062] CPU    4: hi:  186, btch:  31 usd: 175
Apr  4 23:26:02 tyrex kernel: [  488.161063] CPU    5: hi:  186, btch:  31 usd: 157
Apr  4 23:26:02 tyrex kernel: [  488.161065] CPU    6: hi:  186, btch:  31 usd: 182
Apr  4 23:26:02 tyrex kernel: [  488.161066] CPU    7: hi:  186, btch:  31 usd: 173
Apr  4 23:26:02 tyrex kernel: [  488.161067] HighMem per-cpu:
Apr  4 23:26:02 tyrex kernel: [  488.161068] CPU    0: hi:  186, btch:  31 usd:  24
Apr  4 23:26:02 tyrex kernel: [  488.161069] CPU    1: hi:  186, btch:  31 usd:  33
Apr  4 23:26:02 tyrex kernel: [  488.161070] CPU    2: hi:  186, btch:  31 usd:  55
Apr  4 23:26:02 tyrex kernel: [  488.161071] CPU    3: hi:  186, btch:  31 usd: 172
Apr  4 23:26:02 tyrex kernel: [  488.161072] CPU    4: hi:  186, btch:  31 usd: 177
Apr  4 23:26:02 tyrex kernel: [  488.161073] CPU    5: hi:  186, btch:  31 usd: 100
Apr  4 23:26:02 tyrex kernel: [  488.161075] CPU    6: hi:  186, btch:  31 usd:  38
Apr  4 23:26:02 tyrex kernel: [  488.161076] CPU    7: hi:  186, btch:  31 usd: 115
Apr  4 23:26:02 tyrex kernel: [  488.161078] active_anon:98428 inactive_anon:24627 isolated_anon:0
Apr  4 23:26:02 tyrex kernel: [  488.161079]  active_file:4141 inactive_file:25571 isolated_file:0
Apr  4 23:26:02 tyrex kernel: [  488.161080]  unevictable:0 dirty:2 writeback:0 unstable:0
Apr  4 23:26:02 tyrex kernel: [  488.161080]  free:403317 slab_reclaimable:203863 slab_unreclaimable:4837
Apr  4 23:26:02 tyrex kernel: [  488.161081]  mapped:7418 shmem:137 pagetables:625 bounce:0
Apr  4 23:26:02 tyrex kernel: [  488.161085] 
DMA free:3520kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB 
active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:15804kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:6032kB 
slab_unreclaimable:36kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB 
pages_scanned:0 all_unreclaimable? yes
Apr  4 23:26:02 tyrex kernel: [  488.161089] lowmem_reserve[]: 0 865 3023 3023
Apr  4 23:26:02 tyrex kernel: [  488.161093] 
Normal free:3580kB min:3728kB low:4660kB high:5592kB active_anon:0kB inactive_anon:0kB 
active_file:360kB inactive_file:264kB unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:885944kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:809420kB 
slab_unreclaimable:19312kB kernel_stack:3304kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB 
pages_scanned:24 all_unreclaimable? no
Apr  4 23:26:02 tyrex kernel: [  488.161097] lowmem_reserve[]: 0 0 17267 17267
Apr  4 23:26:02 tyrex kernel: [  488.161101] 
HighMem free:1606168kB min:512kB low:2836kB high:5164kB active_anon:393712kB inactive_anon:98508kB 
active_file:16228kB inactive_file:102020kB unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:2210252kB mlocked:0kB dirty:8kB writeback:0kB mapped:29672kB shmem:548kB slab_reclaimable:0kB 
slab_unreclaimable:0kB kernel_stack:0kB pagetables:2500kB unstable:0kB bounce:0kB writeback_tmp:0kB 
pages_scanned:0 all_unreclaimable? no
Apr  4 23:26:02 tyrex kernel: [  488.161105] lowmem_reserve[]: 0 0 0 0
Apr  4 23:26:02 tyrex kernel: [  488.161107] DMA: 18*4kB 53*8kB 31*16kB 20*32kB 14*64kB 8*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3552kB
Apr  4 23:26:02 tyrex kernel: [  488.161112] Normal: 32*4kB 1*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 0*4096kB = 3704kB
Apr  4 23:26:02 tyrex kernel: [  488.161117] HighMem: 17*4kB 29*8kB 47*16kB 16*32kB 6*64kB 30*128kB 53*256kB 27*512kB 14*1024kB 7*2048kB 377*4096kB = 1606044kB
Apr  4 23:26:02 tyrex kernel: [  488.161122] 29947 total pagecache pages
Apr  4 23:26:02 tyrex kernel: [  488.161123] 0 pages in swap cache
Apr  4 23:26:02 tyrex kernel: [  488.161124] Swap cache stats: add 0, delete 0, find 0/0
Apr  4 23:26:02 tyrex kernel: [  488.161125] Free swap  = 2104476kB
Apr  4 23:26:02 tyrex kernel: [  488.161126] Total swap = 2104476kB
Apr  4 23:26:02 tyrex kernel: [  488.165523] 784224 pages RAM
Apr  4 23:26:02 tyrex kernel: [  488.165524] 556914 pages HighMem
Apr  4 23:26:02 tyrex kernel: [  488.165525] 12060 pages reserved
Apr  4 23:26:02 tyrex kernel: [  488.165526] 82604 pages shared
Apr  4 23:26:02 tyrex kernel: [  488.165527] 328045 pages non-shared
Apr  4 23:26:02 tyrex kernel: [  488.165529] Out of memory: kill process 4788 (mysqld-max) score 326208 or a child
Apr  4 23:26:02 tyrex kernel: [  488.165531] Killed process 4788 (mysqld-max) vsz:1304832kB, anon-rss:121428kB, file-rss:4336kB
[...]

vmstat 1, when it occurs:
 1  0      0 1722516      4 926532    0    0 13800     0 2598 6479  2 14 84  0
 3  0      0 1735540      4 927392    0    0 10128     0 2144 5434  0  8 92  0
 1  0      0 1733944      4 927868    0    0 10416    92 4465 8264  1  7 93  0
 0  0      0 1733448      4 928024    0    0  9220     0 2081 3570  1  8 91  0
 0  0      0 1734160      4 927412    0    0  7404     0 1912 3014  0  6 90  4
 1  0      0 1733072      4 927208    0    0  9512     0 2227 3591  1  8 89  2
 1  1      0 1728356      4 928256    0    0  4620     0 1334 2087  0  2 92  5
 2  1      0 1926208      4 928832    0    0  3848   712 1869 3515  0  5 84 11
 1  0      0 2181520      4 849064    0    0  2736  3983 3154 308011  2 14 71 13
 0  0      0 2181028      4 847560    0    0   140   109  300  484  2  0 98  0
 4  0      0 2561212      4 467272    0    0     4     0  383  233  0  6 94  0
 0  0      0 2995016      4  33560    0    0    44     0  421  238  0  9 91  0
 0  0      0 2995024      4  33584    0    0   120     0  187  251  0  0 100  0
 0  0      0 2994932      4  33636    0    0     0     0   34   63  0  0 100  0
 1  0      0 2995212      4  33548    0    0     8     0   72   94  0  0 100  0
 0  0      0 2995368      4  33528    0    0     0     0   39   73  0  0 100  0
 0  0      0 2995384      4  33348    0    0     0     0   96  116  0  0 100  0
 0  0      0 2995556      4  33256    0    0    40     0  139  184  0  0 100  0
 0  0      0 2995800      4  33248    0    0    68     0  178  193  0  0 100  0

The oom-killer got triggered around the line with 308011 context switches.
Needless to say, this event renders this system dysfunctional.

Sorry for the insanely long lines, but wrapping them to a sane width results 
in an unparseable mess. This kernel is closest to upstream, that I was able
to produce, built in Novell build service, available here:
http://download.opensuse.org/repositories/home:/frispete:/kernel-head/openSUSE_11.1/
As you can see, userspace is openSUSE 11.1, arch is i586, based on vanilla 
built of rc3 with Linus' git tree applied up to 5e11611a5d22252f3f9c169a3c9377eac0c32033.
I have reproduced this issue also with 2.6.33.1. Usually, I run the openSUSE 
customized kernels, e.g. kernel-pae-2.6.27.45-0.1.1 doesn't show this issue.

boot.msg: ftp://urpla.net/boot.msg
config: ftp://urpla.net/vanilla

I do use some unusual vm settings, as documented in boot.msg, but that should 
not result in this issue, does it?

Hardware: Xeon X3460, Asus P7F-E motherboard, Intel 3420 PCH chipset, 12 GB Mem

In case, you wonder, why I drove that test with a 4GB memory setup: that was
the easiest way to demonstrate the problem without resorting to tainted and 
heavy customized kernels. And for compatibility reasons (customer support), 
I have to stay at i586 for the moment. 

Of course, the du command does succeed with the distributed kernel with exactly
the same (userspace and sysctl) settings: 
1,1T    /backup/daily.0

Anybody with an idea, what's going on here?

TIA,
Pete

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer
  2010-04-04 22:49 ` Hans-Peter Jansen
@ 2010-04-05  0:49   ` Dave Chinner
  -1 siblings, 0 replies; 48+ messages in thread
From: Dave Chinner @ 2010-04-05  0:49 UTC (permalink / raw)
  To: Hans-Peter Jansen; +Cc: linux-kernel, opensuse-kernel, xfs

On Mon, Apr 05, 2010 at 12:49:17AM +0200, Hans-Peter Jansen wrote:
> [Sorry for the cross post, but I don't know where to start to tackle this 
>  issue]
> 
> Hi,
> 
> on an attempt to get to a current kernel, I suffer from an issue, where a 
> simple du on a reasonably big xfs tree leads to invoking the oom killer: 

How big is the directory tree (how many inodes, etc)?

> Apr  4 23:24:53 tyrex kernel: [  418.913223] XFS mounting filesystem sdd1
> Apr  4 23:24:54 tyrex kernel: [  419.774606] Ending clean XFS mount for filesystem: sdd1
> Apr  4 23:26:02 tyrex kernel: [  488.160795] du invoked oom-killer: gfp_mask=0x802d0, order=0, oom_adj=0
> Apr  4 23:26:02 tyrex kernel: [  488.160798] du cpuset=/ mems_allowed=0
> Apr  4 23:26:02 tyrex kernel: [  488.160800] Pid: 6397, comm: du Tainted: G        W  2.6.34-rc3-13-vanilla #1
> Apr  4 23:26:02 tyrex kernel: [  488.160802] Call Trace:
> Apr  4 23:26:02 tyrex kernel: [  488.160808]  [<c02becc7>] dump_header+0x67/0x1a0
> Apr  4 23:26:02 tyrex kernel: [  488.160811]  [<c03cf1a7>] ? ___ratelimit+0x77/0xe0
> Apr  4 23:26:02 tyrex kernel: [  488.160813]  [<c02bee59>] oom_kill_process+0x59/0x160
> Apr  4 23:26:02 tyrex kernel: [  488.160815]  [<c02bf43e>] __out_of_memory+0x4e/0xc0
> Apr  4 23:26:02 tyrex kernel: [  488.160817]  [<c02bf502>] out_of_memory+0x52/0xc0
> Apr  4 23:26:02 tyrex kernel: [  488.160819]  [<c02c20f4>] __alloc_pages_slowpath+0x444/0x4c0
> Apr  4 23:26:02 tyrex kernel: [  488.160822]  [<c02c22c2>] __alloc_pages_nodemask+0x152/0x160
> Apr  4 23:26:02 tyrex kernel: [  488.160825]  [<c02ea4a9>] cache_grow+0x249/0x2e0
> Apr  4 23:26:02 tyrex kernel: [  488.160838]  [<c02ea748>] cache_alloc_refill+0x208/0x240
> Apr  4 23:26:02 tyrex kernel: [  488.160840]  [<c02eab19>] kmem_cache_alloc+0xb9/0xc0
> Apr  4 23:26:02 tyrex kernel: [  488.160868]  [<f86375dd>] ? xfs_trans_brelse+0xfd/0x150 [xfs]
> Apr  4 23:26:02 tyrex kernel: [  488.160888]  [<f863d547>] kmem_zone_alloc+0x77/0xb0 [xfs]
> Apr  4 23:26:02 tyrex kernel: [  488.160905]  [<f860a043>] ? xfs_da_state_free+0x53/0x60 [xfs]
> Apr  4 23:26:02 tyrex kernel: [  488.160923]  [<f861c796>] xfs_inode_alloc+0x26/0x110 [xfs]
> Apr  4 23:26:02 tyrex kernel: [  488.160941]  [<f861c8b2>] xfs_iget_cache_miss+0x32/0x200 [xfs]
> Apr  4 23:26:02 tyrex kernel: [  488.160960]  [<f8630fd9>] ? xfs_perag_get+0x29/0xa0 [xfs]
> Apr  4 23:26:02 tyrex kernel: [  488.160977]  [<f861cbac>] xfs_iget+0x12c/0x1b0 [xfs]
> Apr  4 23:26:02 tyrex kernel: [  488.160997]  [<f863c9c5>] xfs_lookup+0xc5/0x110 [xfs]
> Apr  4 23:26:02 tyrex kernel: [  488.161017]  [<f86467d1>] xfs_vn_lookup+0x51/0xa0 [xfs]
> Apr  4 23:26:02 tyrex kernel: [  488.161021]  [<c02ffef7>] do_lookup+0xa7/0x1c0
> Apr  4 23:26:02 tyrex kernel: [  488.161023]  [<c030089b>] link_path_walk+0x1db/0x850
> Apr  4 23:26:02 tyrex kernel: [  488.161026]  [<c024a870>] ? capable+0x10/0x40
> Apr  4 23:26:02 tyrex kernel: [  488.161028]  [<c038969e>] ? security_inode_permission+0x1e/0x20
> Apr  4 23:26:02 tyrex kernel: [  488.161030]  [<c0301294>] path_walk+0x54/0xc0
> Apr  4 23:26:02 tyrex kernel: [  488.161032]  [<c0301431>] do_path_lookup+0x51/0x90
> Apr  4 23:26:02 tyrex kernel: [  488.161034]  [<c0301f8a>] user_path_at+0x4a/0x80
> Apr  4 23:26:02 tyrex kernel: [  488.161037]  [<c02f9748>] ? cp_new_stat64+0xe8/0x100
> Apr  4 23:26:02 tyrex kernel: [  488.161039]  [<c02f9bb5>] vfs_fstatat+0x35/0x70
> Apr  4 23:26:02 tyrex kernel: [  488.161041]  [<c02f9c0d>] sys_fstatat64+0x1d/0x40
> Apr  4 23:26:02 tyrex kernel: [  488.161044]  [<c0202d8c>] sysenter_do_call+0x12/0x22
> Apr  4 23:26:02 tyrex kernel: [  488.161045] Mem-Info:
> Apr  4 23:26:02 tyrex kernel: [  488.161046] DMA per-cpu:
> Apr  4 23:26:02 tyrex kernel: [  488.161047] CPU    0: hi:    0, btch:   1 usd:   0
> Apr  4 23:26:02 tyrex kernel: [  488.161048] CPU    1: hi:    0, btch:   1 usd:   0
> Apr  4 23:26:02 tyrex kernel: [  488.161050] CPU    2: hi:    0, btch:   1 usd:   0
> Apr  4 23:26:02 tyrex kernel: [  488.161051] CPU    3: hi:    0, btch:   1 usd:   0
> Apr  4 23:26:02 tyrex kernel: [  488.161052] CPU    4: hi:    0, btch:   1 usd:   0
> Apr  4 23:26:02 tyrex kernel: [  488.161053] CPU    5: hi:    0, btch:   1 usd:   0
> Apr  4 23:26:02 tyrex kernel: [  488.161054] CPU    6: hi:    0, btch:   1 usd:   0
> Apr  4 23:26:02 tyrex kernel: [  488.161056] CPU    7: hi:    0, btch:   1 usd:   0
> Apr  4 23:26:02 tyrex kernel: [  488.161057] Normal per-cpu:
> Apr  4 23:26:02 tyrex kernel: [  488.161058] CPU    0: hi:  186, btch:  31 usd: 173
> Apr  4 23:26:02 tyrex kernel: [  488.161059] CPU    1: hi:  186, btch:  31 usd:  18
> Apr  4 23:26:02 tyrex kernel: [  488.161060] CPU    2: hi:  186, btch:  31 usd:  41
> Apr  4 23:26:02 tyrex kernel: [  488.161061] CPU    3: hi:  186, btch:  31 usd: 197
> Apr  4 23:26:02 tyrex kernel: [  488.161062] CPU    4: hi:  186, btch:  31 usd: 175
> Apr  4 23:26:02 tyrex kernel: [  488.161063] CPU    5: hi:  186, btch:  31 usd: 157
> Apr  4 23:26:02 tyrex kernel: [  488.161065] CPU    6: hi:  186, btch:  31 usd: 182
> Apr  4 23:26:02 tyrex kernel: [  488.161066] CPU    7: hi:  186, btch:  31 usd: 173
> Apr  4 23:26:02 tyrex kernel: [  488.161067] HighMem per-cpu:
> Apr  4 23:26:02 tyrex kernel: [  488.161068] CPU    0: hi:  186, btch:  31 usd:  24
> Apr  4 23:26:02 tyrex kernel: [  488.161069] CPU    1: hi:  186, btch:  31 usd:  33
> Apr  4 23:26:02 tyrex kernel: [  488.161070] CPU    2: hi:  186, btch:  31 usd:  55
> Apr  4 23:26:02 tyrex kernel: [  488.161071] CPU    3: hi:  186, btch:  31 usd: 172
> Apr  4 23:26:02 tyrex kernel: [  488.161072] CPU    4: hi:  186, btch:  31 usd: 177
> Apr  4 23:26:02 tyrex kernel: [  488.161073] CPU    5: hi:  186, btch:  31 usd: 100
> Apr  4 23:26:02 tyrex kernel: [  488.161075] CPU    6: hi:  186, btch:  31 usd:  38
> Apr  4 23:26:02 tyrex kernel: [  488.161076] CPU    7: hi:  186, btch:  31 usd: 115
> Apr  4 23:26:02 tyrex kernel: [  488.161078] active_anon:98428 inactive_anon:24627 isolated_anon:0
> Apr  4 23:26:02 tyrex kernel: [  488.161079]  active_file:4141 inactive_file:25571 isolated_file:0
> Apr  4 23:26:02 tyrex kernel: [  488.161080]  unevictable:0 dirty:2 writeback:0 unstable:0
> Apr  4 23:26:02 tyrex kernel: [  488.161080]  free:403317 slab_reclaimable:203863 slab_unreclaimable:4837
> Apr  4 23:26:02 tyrex kernel: [  488.161081]  mapped:7418 shmem:137 pagetables:625 bounce:0
> Apr  4 23:26:02 tyrex kernel: [  488.161085] 
> DMA free:3520kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB 
> active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB 
> present:15804kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:6032kB 
> slab_unreclaimable:36kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB 
> pages_scanned:0 all_unreclaimable? yes
> Apr  4 23:26:02 tyrex kernel: [  488.161089] lowmem_reserve[]: 0 865 3023 3023
> Apr  4 23:26:02 tyrex kernel: [  488.161093] 
> Normal free:3580kB min:3728kB low:4660kB high:5592kB active_anon:0kB inactive_anon:0kB 
> active_file:360kB inactive_file:264kB unevictable:0kB isolated(anon):0kB isolated(file):0kB 
> present:885944kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:809420kB 
> slab_unreclaimable:19312kB kernel_stack:3304kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB 
> pages_scanned:24 all_unreclaimable? no
> Apr  4 23:26:02 tyrex kernel: [  488.161097] lowmem_reserve[]: 0 0 17267 17267
> Apr  4 23:26:02 tyrex kernel: [  488.161101] 
> HighMem free:1606168kB min:512kB low:2836kB high:5164kB active_anon:393712kB inactive_anon:98508kB 
> active_file:16228kB inactive_file:102020kB unevictable:0kB isolated(anon):0kB isolated(file):0kB 
> present:2210252kB mlocked:0kB dirty:8kB writeback:0kB mapped:29672kB shmem:548kB slab_reclaimable:0kB 
> slab_unreclaimable:0kB kernel_stack:0kB pagetables:2500kB unstable:0kB bounce:0kB writeback_tmp:0kB 
> pages_scanned:0 all_unreclaimable? no
> Apr  4 23:26:02 tyrex kernel: [  488.161105] lowmem_reserve[]: 0 0 0 0
> Apr  4 23:26:02 tyrex kernel: [  488.161107] DMA: 18*4kB 53*8kB 31*16kB 20*32kB 14*64kB 8*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3552kB
> Apr  4 23:26:02 tyrex kernel: [  488.161112] Normal: 32*4kB 1*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 0*4096kB = 3704kB
> Apr  4 23:26:02 tyrex kernel: [  488.161117] HighMem: 17*4kB 29*8kB 47*16kB 16*32kB 6*64kB 30*128kB 53*256kB 27*512kB 14*1024kB 7*2048kB 377*4096kB = 1606044kB
> Apr  4 23:26:02 tyrex kernel: [  488.161122] 29947 total pagecache pages
> Apr  4 23:26:02 tyrex kernel: [  488.161123] 0 pages in swap cache
> Apr  4 23:26:02 tyrex kernel: [  488.161124] Swap cache stats: add 0, delete 0, find 0/0
> Apr  4 23:26:02 tyrex kernel: [  488.161125] Free swap  = 2104476kB
> Apr  4 23:26:02 tyrex kernel: [  488.161126] Total swap = 2104476kB
> Apr  4 23:26:02 tyrex kernel: [  488.165523] 784224 pages RAM
> Apr  4 23:26:02 tyrex kernel: [  488.165524] 556914 pages HighMem
> Apr  4 23:26:02 tyrex kernel: [  488.165525] 12060 pages reserved
> Apr  4 23:26:02 tyrex kernel: [  488.165526] 82604 pages shared
> Apr  4 23:26:02 tyrex kernel: [  488.165527] 328045 pages non-shared
> Apr  4 23:26:02 tyrex kernel: [  488.165529] Out of memory: kill process 4788 (mysqld-max) score 326208 or a child
> Apr  4 23:26:02 tyrex kernel: [  488.165531] Killed process 4788 (mysqld-max) vsz:1304832kB, anon-rss:121428kB, file-rss:4336kB
> [...]

Oh, this is a highmem box. You ran out of low memory, I think, which
is where all the inodes are cached. Seems like a VM problem or a
highmem/lowmem split config problem to me, not anything to do with
XFS...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer
@ 2010-04-05  0:49   ` Dave Chinner
  0 siblings, 0 replies; 48+ messages in thread
From: Dave Chinner @ 2010-04-05  0:49 UTC (permalink / raw)
  To: Hans-Peter Jansen; +Cc: opensuse-kernel, linux-kernel, xfs

On Mon, Apr 05, 2010 at 12:49:17AM +0200, Hans-Peter Jansen wrote:
> [Sorry for the cross post, but I don't know where to start to tackle this 
>  issue]
> 
> Hi,
> 
> on an attempt to get to a current kernel, I suffer from an issue, where a 
> simple du on a reasonably big xfs tree leads to invoking the oom killer: 

How big is the directory tree (how many inodes, etc)?

> Apr  4 23:24:53 tyrex kernel: [  418.913223] XFS mounting filesystem sdd1
> Apr  4 23:24:54 tyrex kernel: [  419.774606] Ending clean XFS mount for filesystem: sdd1
> Apr  4 23:26:02 tyrex kernel: [  488.160795] du invoked oom-killer: gfp_mask=0x802d0, order=0, oom_adj=0
> Apr  4 23:26:02 tyrex kernel: [  488.160798] du cpuset=/ mems_allowed=0
> Apr  4 23:26:02 tyrex kernel: [  488.160800] Pid: 6397, comm: du Tainted: G        W  2.6.34-rc3-13-vanilla #1
> Apr  4 23:26:02 tyrex kernel: [  488.160802] Call Trace:
> Apr  4 23:26:02 tyrex kernel: [  488.160808]  [<c02becc7>] dump_header+0x67/0x1a0
> Apr  4 23:26:02 tyrex kernel: [  488.160811]  [<c03cf1a7>] ? ___ratelimit+0x77/0xe0
> Apr  4 23:26:02 tyrex kernel: [  488.160813]  [<c02bee59>] oom_kill_process+0x59/0x160
> Apr  4 23:26:02 tyrex kernel: [  488.160815]  [<c02bf43e>] __out_of_memory+0x4e/0xc0
> Apr  4 23:26:02 tyrex kernel: [  488.160817]  [<c02bf502>] out_of_memory+0x52/0xc0
> Apr  4 23:26:02 tyrex kernel: [  488.160819]  [<c02c20f4>] __alloc_pages_slowpath+0x444/0x4c0
> Apr  4 23:26:02 tyrex kernel: [  488.160822]  [<c02c22c2>] __alloc_pages_nodemask+0x152/0x160
> Apr  4 23:26:02 tyrex kernel: [  488.160825]  [<c02ea4a9>] cache_grow+0x249/0x2e0
> Apr  4 23:26:02 tyrex kernel: [  488.160838]  [<c02ea748>] cache_alloc_refill+0x208/0x240
> Apr  4 23:26:02 tyrex kernel: [  488.160840]  [<c02eab19>] kmem_cache_alloc+0xb9/0xc0
> Apr  4 23:26:02 tyrex kernel: [  488.160868]  [<f86375dd>] ? xfs_trans_brelse+0xfd/0x150 [xfs]
> Apr  4 23:26:02 tyrex kernel: [  488.160888]  [<f863d547>] kmem_zone_alloc+0x77/0xb0 [xfs]
> Apr  4 23:26:02 tyrex kernel: [  488.160905]  [<f860a043>] ? xfs_da_state_free+0x53/0x60 [xfs]
> Apr  4 23:26:02 tyrex kernel: [  488.160923]  [<f861c796>] xfs_inode_alloc+0x26/0x110 [xfs]
> Apr  4 23:26:02 tyrex kernel: [  488.160941]  [<f861c8b2>] xfs_iget_cache_miss+0x32/0x200 [xfs]
> Apr  4 23:26:02 tyrex kernel: [  488.160960]  [<f8630fd9>] ? xfs_perag_get+0x29/0xa0 [xfs]
> Apr  4 23:26:02 tyrex kernel: [  488.160977]  [<f861cbac>] xfs_iget+0x12c/0x1b0 [xfs]
> Apr  4 23:26:02 tyrex kernel: [  488.160997]  [<f863c9c5>] xfs_lookup+0xc5/0x110 [xfs]
> Apr  4 23:26:02 tyrex kernel: [  488.161017]  [<f86467d1>] xfs_vn_lookup+0x51/0xa0 [xfs]
> Apr  4 23:26:02 tyrex kernel: [  488.161021]  [<c02ffef7>] do_lookup+0xa7/0x1c0
> Apr  4 23:26:02 tyrex kernel: [  488.161023]  [<c030089b>] link_path_walk+0x1db/0x850
> Apr  4 23:26:02 tyrex kernel: [  488.161026]  [<c024a870>] ? capable+0x10/0x40
> Apr  4 23:26:02 tyrex kernel: [  488.161028]  [<c038969e>] ? security_inode_permission+0x1e/0x20
> Apr  4 23:26:02 tyrex kernel: [  488.161030]  [<c0301294>] path_walk+0x54/0xc0
> Apr  4 23:26:02 tyrex kernel: [  488.161032]  [<c0301431>] do_path_lookup+0x51/0x90
> Apr  4 23:26:02 tyrex kernel: [  488.161034]  [<c0301f8a>] user_path_at+0x4a/0x80
> Apr  4 23:26:02 tyrex kernel: [  488.161037]  [<c02f9748>] ? cp_new_stat64+0xe8/0x100
> Apr  4 23:26:02 tyrex kernel: [  488.161039]  [<c02f9bb5>] vfs_fstatat+0x35/0x70
> Apr  4 23:26:02 tyrex kernel: [  488.161041]  [<c02f9c0d>] sys_fstatat64+0x1d/0x40
> Apr  4 23:26:02 tyrex kernel: [  488.161044]  [<c0202d8c>] sysenter_do_call+0x12/0x22
> Apr  4 23:26:02 tyrex kernel: [  488.161045] Mem-Info:
> Apr  4 23:26:02 tyrex kernel: [  488.161046] DMA per-cpu:
> Apr  4 23:26:02 tyrex kernel: [  488.161047] CPU    0: hi:    0, btch:   1 usd:   0
> Apr  4 23:26:02 tyrex kernel: [  488.161048] CPU    1: hi:    0, btch:   1 usd:   0
> Apr  4 23:26:02 tyrex kernel: [  488.161050] CPU    2: hi:    0, btch:   1 usd:   0
> Apr  4 23:26:02 tyrex kernel: [  488.161051] CPU    3: hi:    0, btch:   1 usd:   0
> Apr  4 23:26:02 tyrex kernel: [  488.161052] CPU    4: hi:    0, btch:   1 usd:   0
> Apr  4 23:26:02 tyrex kernel: [  488.161053] CPU    5: hi:    0, btch:   1 usd:   0
> Apr  4 23:26:02 tyrex kernel: [  488.161054] CPU    6: hi:    0, btch:   1 usd:   0
> Apr  4 23:26:02 tyrex kernel: [  488.161056] CPU    7: hi:    0, btch:   1 usd:   0
> Apr  4 23:26:02 tyrex kernel: [  488.161057] Normal per-cpu:
> Apr  4 23:26:02 tyrex kernel: [  488.161058] CPU    0: hi:  186, btch:  31 usd: 173
> Apr  4 23:26:02 tyrex kernel: [  488.161059] CPU    1: hi:  186, btch:  31 usd:  18
> Apr  4 23:26:02 tyrex kernel: [  488.161060] CPU    2: hi:  186, btch:  31 usd:  41
> Apr  4 23:26:02 tyrex kernel: [  488.161061] CPU    3: hi:  186, btch:  31 usd: 197
> Apr  4 23:26:02 tyrex kernel: [  488.161062] CPU    4: hi:  186, btch:  31 usd: 175
> Apr  4 23:26:02 tyrex kernel: [  488.161063] CPU    5: hi:  186, btch:  31 usd: 157
> Apr  4 23:26:02 tyrex kernel: [  488.161065] CPU    6: hi:  186, btch:  31 usd: 182
> Apr  4 23:26:02 tyrex kernel: [  488.161066] CPU    7: hi:  186, btch:  31 usd: 173
> Apr  4 23:26:02 tyrex kernel: [  488.161067] HighMem per-cpu:
> Apr  4 23:26:02 tyrex kernel: [  488.161068] CPU    0: hi:  186, btch:  31 usd:  24
> Apr  4 23:26:02 tyrex kernel: [  488.161069] CPU    1: hi:  186, btch:  31 usd:  33
> Apr  4 23:26:02 tyrex kernel: [  488.161070] CPU    2: hi:  186, btch:  31 usd:  55
> Apr  4 23:26:02 tyrex kernel: [  488.161071] CPU    3: hi:  186, btch:  31 usd: 172
> Apr  4 23:26:02 tyrex kernel: [  488.161072] CPU    4: hi:  186, btch:  31 usd: 177
> Apr  4 23:26:02 tyrex kernel: [  488.161073] CPU    5: hi:  186, btch:  31 usd: 100
> Apr  4 23:26:02 tyrex kernel: [  488.161075] CPU    6: hi:  186, btch:  31 usd:  38
> Apr  4 23:26:02 tyrex kernel: [  488.161076] CPU    7: hi:  186, btch:  31 usd: 115
> Apr  4 23:26:02 tyrex kernel: [  488.161078] active_anon:98428 inactive_anon:24627 isolated_anon:0
> Apr  4 23:26:02 tyrex kernel: [  488.161079]  active_file:4141 inactive_file:25571 isolated_file:0
> Apr  4 23:26:02 tyrex kernel: [  488.161080]  unevictable:0 dirty:2 writeback:0 unstable:0
> Apr  4 23:26:02 tyrex kernel: [  488.161080]  free:403317 slab_reclaimable:203863 slab_unreclaimable:4837
> Apr  4 23:26:02 tyrex kernel: [  488.161081]  mapped:7418 shmem:137 pagetables:625 bounce:0
> Apr  4 23:26:02 tyrex kernel: [  488.161085] 
> DMA free:3520kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB 
> active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB 
> present:15804kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:6032kB 
> slab_unreclaimable:36kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB 
> pages_scanned:0 all_unreclaimable? yes
> Apr  4 23:26:02 tyrex kernel: [  488.161089] lowmem_reserve[]: 0 865 3023 3023
> Apr  4 23:26:02 tyrex kernel: [  488.161093] 
> Normal free:3580kB min:3728kB low:4660kB high:5592kB active_anon:0kB inactive_anon:0kB 
> active_file:360kB inactive_file:264kB unevictable:0kB isolated(anon):0kB isolated(file):0kB 
> present:885944kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:809420kB 
> slab_unreclaimable:19312kB kernel_stack:3304kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB 
> pages_scanned:24 all_unreclaimable? no
> Apr  4 23:26:02 tyrex kernel: [  488.161097] lowmem_reserve[]: 0 0 17267 17267
> Apr  4 23:26:02 tyrex kernel: [  488.161101] 
> HighMem free:1606168kB min:512kB low:2836kB high:5164kB active_anon:393712kB inactive_anon:98508kB 
> active_file:16228kB inactive_file:102020kB unevictable:0kB isolated(anon):0kB isolated(file):0kB 
> present:2210252kB mlocked:0kB dirty:8kB writeback:0kB mapped:29672kB shmem:548kB slab_reclaimable:0kB 
> slab_unreclaimable:0kB kernel_stack:0kB pagetables:2500kB unstable:0kB bounce:0kB writeback_tmp:0kB 
> pages_scanned:0 all_unreclaimable? no
> Apr  4 23:26:02 tyrex kernel: [  488.161105] lowmem_reserve[]: 0 0 0 0
> Apr  4 23:26:02 tyrex kernel: [  488.161107] DMA: 18*4kB 53*8kB 31*16kB 20*32kB 14*64kB 8*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3552kB
> Apr  4 23:26:02 tyrex kernel: [  488.161112] Normal: 32*4kB 1*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 0*4096kB = 3704kB
> Apr  4 23:26:02 tyrex kernel: [  488.161117] HighMem: 17*4kB 29*8kB 47*16kB 16*32kB 6*64kB 30*128kB 53*256kB 27*512kB 14*1024kB 7*2048kB 377*4096kB = 1606044kB
> Apr  4 23:26:02 tyrex kernel: [  488.161122] 29947 total pagecache pages
> Apr  4 23:26:02 tyrex kernel: [  488.161123] 0 pages in swap cache
> Apr  4 23:26:02 tyrex kernel: [  488.161124] Swap cache stats: add 0, delete 0, find 0/0
> Apr  4 23:26:02 tyrex kernel: [  488.161125] Free swap  = 2104476kB
> Apr  4 23:26:02 tyrex kernel: [  488.161126] Total swap = 2104476kB
> Apr  4 23:26:02 tyrex kernel: [  488.165523] 784224 pages RAM
> Apr  4 23:26:02 tyrex kernel: [  488.165524] 556914 pages HighMem
> Apr  4 23:26:02 tyrex kernel: [  488.165525] 12060 pages reserved
> Apr  4 23:26:02 tyrex kernel: [  488.165526] 82604 pages shared
> Apr  4 23:26:02 tyrex kernel: [  488.165527] 328045 pages non-shared
> Apr  4 23:26:02 tyrex kernel: [  488.165529] Out of memory: kill process 4788 (mysqld-max) score 326208 or a child
> Apr  4 23:26:02 tyrex kernel: [  488.165531] Killed process 4788 (mysqld-max) vsz:1304832kB, anon-rss:121428kB, file-rss:4336kB
> [...]

Oh, this is a highmem box. You ran out of low memory, I think, which
is where all the inodes are cached. Seems like a VM problem or a
highmem/lowmem split config problem to me, not anything to do with
XFS...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer
  2010-04-05  0:49   ` Dave Chinner
@ 2010-04-05 11:35     ` Hans-Peter Jansen
  -1 siblings, 0 replies; 48+ messages in thread
From: Hans-Peter Jansen @ 2010-04-05 11:35 UTC (permalink / raw)
  To: linux-kernel; +Cc: Dave Chinner, opensuse-kernel, xfs

On Monday 05 April 2010, 02:49:06 Dave Chinner wrote:
> On Mon, Apr 05, 2010 at 12:49:17AM +0200, Hans-Peter Jansen wrote:
> > [Sorry for the cross post, but I don't know where to start to tackle this 
> >  issue]
> > 
> > Hi,
> > 
> > on an attempt to get to a current kernel, I suffer from an issue, where a 
> > simple du on a reasonably big xfs tree leads to invoking the oom killer: 
> 
> How big is the directory tree (how many inodes, etc)?

It's 1.1 TB system backup tree, let's say: many..

> > Apr  4 23:26:02 tyrex kernel: [  488.161105] lowmem_reserve[]: 0 0 0 0
> > Apr  4 23:26:02 tyrex kernel: [  488.161107] DMA: 18*4kB 53*8kB 31*16kB 20*32kB 14*64kB 8*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3552kB
> > Apr  4 23:26:02 tyrex kernel: [  488.161112] Normal: 32*4kB 1*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 0*4096kB = 3704kB
> > Apr  4 23:26:02 tyrex kernel: [  488.161117] HighMem: 17*4kB 29*8kB 47*16kB 16*32kB 6*64kB 30*128kB 53*256kB 27*512kB 14*1024kB 7*2048kB 377*4096kB = 1606044kB
> > Apr  4 23:26:02 tyrex kernel: [  488.161122] 29947 total pagecache pages
> > Apr  4 23:26:02 tyrex kernel: [  488.161123] 0 pages in swap cache
> > Apr  4 23:26:02 tyrex kernel: [  488.161124] Swap cache stats: add 0, delete 0, find 0/0
> > Apr  4 23:26:02 tyrex kernel: [  488.161125] Free swap  = 2104476kB
> > Apr  4 23:26:02 tyrex kernel: [  488.161126] Total swap = 2104476kB
> > Apr  4 23:26:02 tyrex kernel: [  488.165523] 784224 pages RAM
> > Apr  4 23:26:02 tyrex kernel: [  488.165524] 556914 pages HighMem
> > Apr  4 23:26:02 tyrex kernel: [  488.165525] 12060 pages reserved
> > Apr  4 23:26:02 tyrex kernel: [  488.165526] 82604 pages shared
> > Apr  4 23:26:02 tyrex kernel: [  488.165527] 328045 pages non-shared
> > Apr  4 23:26:02 tyrex kernel: [  488.165529] Out of memory: kill process 4788 (mysqld-max) score 326208 or a child
> > Apr  4 23:26:02 tyrex kernel: [  488.165531] Killed process 4788 (mysqld-max) vsz:1304832kB, anon-rss:121428kB, file-rss:4336kB
> > [...]
> 
> Oh, this is a highmem box. You ran out of low memory, I think, which
> is where all the inodes are cached. Seems like a VM problem or a
> highmem/lowmem split config problem to me, not anything to do with
> XFS...

Might be, I don't have a chance to test this on a different FS. Thanks
for the answer anyway, Dave. I hope, you don't mind, that I keep you 
copied on this thread.. 

This matter is, I cannot locate the problem from the syslog output. Might
be a "can't see the forest because all the trees" syndrome.

Today I repeated that thing with 2.6.24-rc3 as a pae build with openSUSE
patches applied and vm.swappiness, vm.dirty_ratio and vm.dirty_background_ratio 
reset to kernel defaults.

It behaves exactly the same, thus it looks like a generic problem. du -sh 
on the huge tree, this time gkrellmd triggered the oom killer, while the
du process kept going.

Apr  5 13:09:20 tyrex kernel: [ 1747.524375] XFS mounting filesystem sdd1
Apr  5 13:09:21 tyrex kernel: [ 1747.942048] Ending clean XFS mount for filesystem: sdd1
Apr  5 13:10:27 tyrex kernel: [ 1814.288944] oom_kill_process: 3 callbacks suppressed
Apr  5 13:10:27 tyrex kernel: [ 1814.288946] gkrellmd invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0
Apr  5 13:10:27 tyrex kernel: [ 1814.288948] gkrellmd cpuset=/ mems_allowed=0
Apr  5 13:10:27 tyrex kernel: [ 1814.288950] Pid: 4019, comm: gkrellmd Not tainted 2.6.34-rc3-13-pae #1
Apr  5 13:10:27 tyrex kernel: [ 1814.288951] Call Trace:
Apr  5 13:10:27 tyrex kernel: [ 1814.288959]  [<c0206181>] try_stack_unwind+0x1b1/0x200
Apr  5 13:10:27 tyrex kernel: [ 1814.288962]  [<c020507f>] dump_trace+0x3f/0xe0
Apr  5 13:10:27 tyrex kernel: [ 1814.288965]  [<c0205cfb>] show_trace_log_lvl+0x4b/0x60
Apr  5 13:10:27 tyrex kernel: [ 1814.288967]  [<c0205d28>] show_trace+0x18/0x20
Apr  5 13:10:27 tyrex kernel: [ 1814.288970]  [<c05ec570>] dump_stack+0x6d/0x7d
Apr  5 13:10:27 tyrex kernel: [ 1814.288974]  [<c02c758a>] dump_header+0x6a/0x1b0
Apr  5 13:10:27 tyrex kernel: [ 1814.288976]  [<c02c772a>] oom_kill_process+0x5a/0x160
Apr  5 13:10:27 tyrex kernel: [ 1814.288979]  [<c02c7bc6>] __out_of_memory+0x56/0xc0
Apr  5 13:10:27 tyrex kernel: [ 1814.288981]  [<c02c7ca7>] out_of_memory+0x77/0x1b0
Apr  5 13:10:27 tyrex kernel: [ 1814.288984]  [<c02cb07c>] __alloc_pages_slowpath+0x47c/0x4c0
Apr  5 13:10:27 tyrex kernel: [ 1814.288986]  [<c02cb214>] __alloc_pages_nodemask+0x154/0x160
Apr  5 13:10:27 tyrex kernel: [ 1814.288989]  [<c02f67a5>] alloc_pages_current+0x65/0xc0
Apr  5 13:10:27 tyrex kernel: [ 1814.288993]  [<c055ca73>] tcp_sendmsg+0x893/0x990
Apr  5 13:10:27 tyrex kernel: [ 1814.288997]  [<c05175f6>] sock_sendmsg+0xc6/0x100
Apr  5 13:10:27 tyrex kernel: [ 1814.289000]  [<c0517794>] sys_sendto+0x114/0x160
Apr  5 13:10:27 tyrex kernel: [ 1814.289002]  [<c0517817>] sys_send+0x37/0x40
Apr  5 13:10:27 tyrex kernel: [ 1814.289005]  [<c05187de>] sys_socketcall+0x22e/0x2f0
Apr  5 13:10:27 tyrex kernel: [ 1814.289007]  [<c0202e4c>] sysenter_do_call+0x12/0x22
Apr  5 13:10:27 tyrex kernel: [ 1814.289022]  [<ffffe424>] 0xffffe424
Apr  5 13:10:27 tyrex kernel: [ 1814.289023] Mem-Info:
Apr  5 13:10:27 tyrex kernel: [ 1814.289024] Node 0 DMA per-cpu:
Apr  5 13:10:27 tyrex kernel: [ 1814.289025] CPU    0: hi:    0, btch:   1 usd:   0
Apr  5 13:10:27 tyrex kernel: [ 1814.289026] CPU    1: hi:    0, btch:   1 usd:   0
Apr  5 13:10:27 tyrex kernel: [ 1814.289027] CPU    2: hi:    0, btch:   1 usd:   0
Apr  5 13:10:27 tyrex kernel: [ 1814.289028] CPU    3: hi:    0, btch:   1 usd:   0
Apr  5 13:10:27 tyrex kernel: [ 1814.289030] CPU    4: hi:    0, btch:   1 usd:   0
Apr  5 13:10:27 tyrex kernel: [ 1814.289031] CPU    5: hi:    0, btch:   1 usd:   0
Apr  5 13:10:27 tyrex kernel: [ 1814.289032] CPU    6: hi:    0, btch:   1 usd:   0
Apr  5 13:10:27 tyrex kernel: [ 1814.289033] CPU    7: hi:    0, btch:   1 usd:   0
Apr  5 13:10:27 tyrex kernel: [ 1814.289033] Node 0 Normal per-cpu:
Apr  5 13:10:27 tyrex kernel: [ 1814.289035] CPU    0: hi:  186, btch:  31 usd: 183
Apr  5 13:10:27 tyrex kernel: [ 1814.289036] CPU    1: hi:  186, btch:  31 usd: 126
Apr  5 13:10:27 tyrex kernel: [ 1814.289037] CPU    2: hi:  186, btch:  31 usd:  50
Apr  5 13:10:27 tyrex kernel: [ 1814.289038] CPU    3: hi:  186, btch:  31 usd: 178
Apr  5 13:10:27 tyrex kernel: [ 1814.289039] CPU    4: hi:  186, btch:  31 usd: 158
Apr  5 13:10:27 tyrex kernel: [ 1814.289040] CPU    5: hi:  186, btch:  31 usd: 181
Apr  5 13:10:27 tyrex kernel: [ 1814.289041] CPU    6: hi:  186, btch:  31 usd:  32
Apr  5 13:10:27 tyrex kernel: [ 1814.289042] CPU    7: hi:  186, btch:  31 usd: 164
Apr  5 13:10:27 tyrex kernel: [ 1814.289043] Node 0 HighMem per-cpu:
Apr  5 13:10:27 tyrex kernel: [ 1814.289044] CPU    0: hi:  186, btch:  31 usd: 153
Apr  5 13:10:27 tyrex kernel: [ 1814.289045] CPU    1: hi:  186, btch:  31 usd: 148
Apr  5 13:10:27 tyrex kernel: [ 1814.289046] CPU    2: hi:  186, btch:  31 usd: 119
Apr  5 13:10:27 tyrex kernel: [ 1814.289047] CPU    3: hi:  186, btch:  31 usd: 173
Apr  5 13:10:27 tyrex kernel: [ 1814.289048] CPU    4: hi:  186, btch:  31 usd:  46
Apr  5 13:10:27 tyrex kernel: [ 1814.289049] CPU    5: hi:  186, btch:  31 usd: 167
Apr  5 13:10:27 tyrex kernel: [ 1814.289050] CPU    6: hi:  186, btch:  31 usd:  57
Apr  5 13:10:27 tyrex kernel: [ 1814.289051] CPU    7: hi:  186, btch:  31 usd:  52
Apr  5 13:10:27 tyrex kernel: [ 1814.289054] active_anon:145037 inactive_anon:14560 isolated_anon:0
Apr  5 13:10:27 tyrex kernel: [ 1814.289055]  active_file:4510 inactive_file:30778 isolated_file:0
Apr  5 13:10:27 tyrex kernel: [ 1814.289055]  unevictable:0 dirty:3 writeback:0 unstable:0
Apr  5 13:10:27 tyrex kernel: [ 1814.289056]  free:2694178 slab_reclaimable:183482 slab_unreclaimable:4574
Apr  5 13:10:27 tyrex kernel: [ 1814.289057]  mapped:9362 shmem:137 pagetables:994 bounce:0
Apr  5 13:10:27 tyrex kernel: [ 1814.289058] Node 0 DMA free:3488kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:20kB inactive_file:36
kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15804kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:4416kB slab_unre
claimable:16kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Apr  5 13:10:27 tyrex kernel: [ 1814.289064] lowmem_reserve[]: 0 859 12054 12054
Apr  5 13:10:27 tyrex kernel: [ 1814.289066] Node 0 Normal free:3644kB min:3716kB low:4644kB high:5572kB active_anon:0kB inactive_anon:0kB active_file:848kB inacti
ve_file:928kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:879848kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:7295
12kB slab_unreclaimable:18280kB kernel_stack:3320kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Apr  5 13:10:27 tyrex kernel: [ 1814.289072] lowmem_reserve[]: 0 0 89561 89561
Apr  5 13:10:27 tyrex kernel: [ 1814.289074] Node 0 HighMem free:10769580kB min:512kB low:12620kB high:24728kB active_anon:580148kB inactive_anon:58240kB active_fi
le:17172kB inactive_file:122148kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:11463844kB mlocked:0kB dirty:12kB writeback:0kB mapped:37448kB shme
m:548kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:3976kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:96 all_unreclaimable
? no
Apr  5 13:10:27 tyrex kernel: [ 1814.289080] lowmem_reserve[]: 0 0 0 0
Apr  5 13:10:27 tyrex kernel: [ 1814.289081] Node 0 DMA: 0*4kB 2*8kB 1*16kB 0*32kB 0*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 0*4096kB = 3488kB
Apr  5 13:10:27 tyrex kernel: [ 1814.289087] Node 0 Normal: 430*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 1*2048kB 0*4096kB = 3768kB
Apr  5 13:10:27 tyrex kernel: [ 1814.289092] Node 0 HighMem: 213*4kB 619*8kB 374*16kB 15*32kB 357*64kB 413*128kB 277*256kB 80*512kB 46*1024kB 22*2048kB 2558*4096kB
 = 10769580kB
Apr  5 13:10:27 tyrex kernel: [ 1814.289097] 35438 total pagecache pages
Apr  5 13:10:27 tyrex kernel: [ 1814.289098] 0 pages in swap cache
Apr  5 13:10:27 tyrex kernel: [ 1814.289099] Swap cache stats: add 0, delete 0, find 0/0
Apr  5 13:10:27 tyrex kernel: [ 1814.289100] Free swap  = 2104476kB
Apr  5 13:10:27 tyrex kernel: [ 1814.289101] Total swap = 2104476kB
Apr  5 13:10:27 tyrex kernel: [ 1814.317335] 3380720 pages RAM
Apr  5 13:10:27 tyrex kernel: [ 1814.317337] 3154946 pages HighMem
Apr  5 13:10:27 tyrex kernel: [ 1814.317338] 293990 pages reserved
Apr  5 13:10:27 tyrex kernel: [ 1814.317338] 65372 pages shared
Apr  5 13:10:27 tyrex kernel: [ 1814.317339] 359255 pages non-shared

 0  0      0 10942040      4 776696    0    0 17512     4 2897 5178  1  7 92  0
 0  0      0 10878436      4 833952    0    0 13396     0 2330 4329  1  5 94  0
 0  0      0 10834376      4 872328    0    0 15072    32 2703 5578  0  7 92  0
 2  0      0 10834000      4 874896    0    0 11228     0 2331 5601  0  4 95  0
 0  0      0 10818628      4 875192    0    0  3172    10  773 1588  0  2 98  0
 1  0      0 10819240      4 874452    0    0  5428    69 1395 3091  0  4 96  0
 0  0      0 10818608      4 873360    0    0 17728     0 3056 10587  1 12 82  5
 1  0      0 10831032      4 873860    0    0  7732     0 1954 5411  0  4 93  3
 0  0      0 10825456      4 874208    0    0  2884     0  844 1907  1  2 98  0
 0  0      0 10831656      4 873720    0    0 11968     0 2228 6897  0  5 95  0
 0  0      0 10830760      4 873744    0    0  6424     8 1549 4021  0  3 94  3
 2  0      0 10829524      4 873944    0    0 10160    81 2129 5325  0  4 96  0
 0  0      0 10798644      4 874248    0    0  8112    16 2161 4348  1  3 96  0
 0  0      0 10781788      4 874688    0    0  5340     0 1823 2924  1  2 97  0
 0  2      0 10776580      4 875980    0    0  5964     0 1754 2722  1  4 93  1
 0  1      0 10776848      4 875440    0    0  5152     8 1543 2313  0  3 87 11
 1  1      0 10776704      4 875596    0    0  2416     0 1011 1632  0  5 79 16
 3  3      0 10776712      4 875676    0    0     0     4  749  867  0 11 75 14
 2  1      0 10879116      4 867544    0    0  7608  2138 3043 154270  7 13 77  3
 2  1      0 10874356      4 870636    0    0  2476     0 1204 711732  0 14 79  7
 3  1      0 10874248      4 870668    0    0     4     0 1078 1066604  0 14 71 14
 2  1      0 10874040      4 870668    0    0     0     0 1055 1067483  0 11 79 11
 2  1      0 10873792      4 870660    0    0   104     0 1137 1063509  0 13 73 13
 2  1      0 10873608      4 870444    0    0   324     0 1335 1055994  0 12 76 12
 3  0      0 11323756      4 421084    0    0  1768     0 1709 424405  0 16 76  7
 1  0      0 11558484      4 186672    0    0 10936   121 2212 3308  1  6 93  0
 1  0      0 11501320      4 243908    0    0 13020     0 2281 4205  0  6 93  0
 1  0      0 11413776      4 330896    0    0 19872     0 3492 6304  1  6 93  0
 0  0      0 11328588      4 415920    0    0 19568     0 3482 6139  1  8 91  0
 1  0      0 11299456      4 475428    0    0 15056     0 3054 5242  1  5 94  0

oom killer invoked around the lines with > 1000000 cs.

It's hard to believe, that a current kernel on a current system with 12 GB, 
even if using the insane pae on i586 is not able to cope with an du on a 
1.1 TB file tree. Since du is invokable by users, this creates a pretty 
ugly DOS attack for local users.

Updated boot log: ftp://urpla.net/boot-pae.msg

Pete

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer
@ 2010-04-05 11:35     ` Hans-Peter Jansen
  0 siblings, 0 replies; 48+ messages in thread
From: Hans-Peter Jansen @ 2010-04-05 11:35 UTC (permalink / raw)
  To: linux-kernel; +Cc: opensuse-kernel, xfs

On Monday 05 April 2010, 02:49:06 Dave Chinner wrote:
> On Mon, Apr 05, 2010 at 12:49:17AM +0200, Hans-Peter Jansen wrote:
> > [Sorry for the cross post, but I don't know where to start to tackle this 
> >  issue]
> > 
> > Hi,
> > 
> > on an attempt to get to a current kernel, I suffer from an issue, where a 
> > simple du on a reasonably big xfs tree leads to invoking the oom killer: 
> 
> How big is the directory tree (how many inodes, etc)?

It's 1.1 TB system backup tree, let's say: many..

> > Apr  4 23:26:02 tyrex kernel: [  488.161105] lowmem_reserve[]: 0 0 0 0
> > Apr  4 23:26:02 tyrex kernel: [  488.161107] DMA: 18*4kB 53*8kB 31*16kB 20*32kB 14*64kB 8*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3552kB
> > Apr  4 23:26:02 tyrex kernel: [  488.161112] Normal: 32*4kB 1*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 0*4096kB = 3704kB
> > Apr  4 23:26:02 tyrex kernel: [  488.161117] HighMem: 17*4kB 29*8kB 47*16kB 16*32kB 6*64kB 30*128kB 53*256kB 27*512kB 14*1024kB 7*2048kB 377*4096kB = 1606044kB
> > Apr  4 23:26:02 tyrex kernel: [  488.161122] 29947 total pagecache pages
> > Apr  4 23:26:02 tyrex kernel: [  488.161123] 0 pages in swap cache
> > Apr  4 23:26:02 tyrex kernel: [  488.161124] Swap cache stats: add 0, delete 0, find 0/0
> > Apr  4 23:26:02 tyrex kernel: [  488.161125] Free swap  = 2104476kB
> > Apr  4 23:26:02 tyrex kernel: [  488.161126] Total swap = 2104476kB
> > Apr  4 23:26:02 tyrex kernel: [  488.165523] 784224 pages RAM
> > Apr  4 23:26:02 tyrex kernel: [  488.165524] 556914 pages HighMem
> > Apr  4 23:26:02 tyrex kernel: [  488.165525] 12060 pages reserved
> > Apr  4 23:26:02 tyrex kernel: [  488.165526] 82604 pages shared
> > Apr  4 23:26:02 tyrex kernel: [  488.165527] 328045 pages non-shared
> > Apr  4 23:26:02 tyrex kernel: [  488.165529] Out of memory: kill process 4788 (mysqld-max) score 326208 or a child
> > Apr  4 23:26:02 tyrex kernel: [  488.165531] Killed process 4788 (mysqld-max) vsz:1304832kB, anon-rss:121428kB, file-rss:4336kB
> > [...]
> 
> Oh, this is a highmem box. You ran out of low memory, I think, which
> is where all the inodes are cached. Seems like a VM problem or a
> highmem/lowmem split config problem to me, not anything to do with
> XFS...

Might be, I don't have a chance to test this on a different FS. Thanks
for the answer anyway, Dave. I hope, you don't mind, that I keep you 
copied on this thread.. 

This matter is, I cannot locate the problem from the syslog output. Might
be a "can't see the forest because all the trees" syndrome.

Today I repeated that thing with 2.6.24-rc3 as a pae build with openSUSE
patches applied and vm.swappiness, vm.dirty_ratio and vm.dirty_background_ratio 
reset to kernel defaults.

It behaves exactly the same, thus it looks like a generic problem. du -sh 
on the huge tree, this time gkrellmd triggered the oom killer, while the
du process kept going.

Apr  5 13:09:20 tyrex kernel: [ 1747.524375] XFS mounting filesystem sdd1
Apr  5 13:09:21 tyrex kernel: [ 1747.942048] Ending clean XFS mount for filesystem: sdd1
Apr  5 13:10:27 tyrex kernel: [ 1814.288944] oom_kill_process: 3 callbacks suppressed
Apr  5 13:10:27 tyrex kernel: [ 1814.288946] gkrellmd invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0
Apr  5 13:10:27 tyrex kernel: [ 1814.288948] gkrellmd cpuset=/ mems_allowed=0
Apr  5 13:10:27 tyrex kernel: [ 1814.288950] Pid: 4019, comm: gkrellmd Not tainted 2.6.34-rc3-13-pae #1
Apr  5 13:10:27 tyrex kernel: [ 1814.288951] Call Trace:
Apr  5 13:10:27 tyrex kernel: [ 1814.288959]  [<c0206181>] try_stack_unwind+0x1b1/0x200
Apr  5 13:10:27 tyrex kernel: [ 1814.288962]  [<c020507f>] dump_trace+0x3f/0xe0
Apr  5 13:10:27 tyrex kernel: [ 1814.288965]  [<c0205cfb>] show_trace_log_lvl+0x4b/0x60
Apr  5 13:10:27 tyrex kernel: [ 1814.288967]  [<c0205d28>] show_trace+0x18/0x20
Apr  5 13:10:27 tyrex kernel: [ 1814.288970]  [<c05ec570>] dump_stack+0x6d/0x7d
Apr  5 13:10:27 tyrex kernel: [ 1814.288974]  [<c02c758a>] dump_header+0x6a/0x1b0
Apr  5 13:10:27 tyrex kernel: [ 1814.288976]  [<c02c772a>] oom_kill_process+0x5a/0x160
Apr  5 13:10:27 tyrex kernel: [ 1814.288979]  [<c02c7bc6>] __out_of_memory+0x56/0xc0
Apr  5 13:10:27 tyrex kernel: [ 1814.288981]  [<c02c7ca7>] out_of_memory+0x77/0x1b0
Apr  5 13:10:27 tyrex kernel: [ 1814.288984]  [<c02cb07c>] __alloc_pages_slowpath+0x47c/0x4c0
Apr  5 13:10:27 tyrex kernel: [ 1814.288986]  [<c02cb214>] __alloc_pages_nodemask+0x154/0x160
Apr  5 13:10:27 tyrex kernel: [ 1814.288989]  [<c02f67a5>] alloc_pages_current+0x65/0xc0
Apr  5 13:10:27 tyrex kernel: [ 1814.288993]  [<c055ca73>] tcp_sendmsg+0x893/0x990
Apr  5 13:10:27 tyrex kernel: [ 1814.288997]  [<c05175f6>] sock_sendmsg+0xc6/0x100
Apr  5 13:10:27 tyrex kernel: [ 1814.289000]  [<c0517794>] sys_sendto+0x114/0x160
Apr  5 13:10:27 tyrex kernel: [ 1814.289002]  [<c0517817>] sys_send+0x37/0x40
Apr  5 13:10:27 tyrex kernel: [ 1814.289005]  [<c05187de>] sys_socketcall+0x22e/0x2f0
Apr  5 13:10:27 tyrex kernel: [ 1814.289007]  [<c0202e4c>] sysenter_do_call+0x12/0x22
Apr  5 13:10:27 tyrex kernel: [ 1814.289022]  [<ffffe424>] 0xffffe424
Apr  5 13:10:27 tyrex kernel: [ 1814.289023] Mem-Info:
Apr  5 13:10:27 tyrex kernel: [ 1814.289024] Node 0 DMA per-cpu:
Apr  5 13:10:27 tyrex kernel: [ 1814.289025] CPU    0: hi:    0, btch:   1 usd:   0
Apr  5 13:10:27 tyrex kernel: [ 1814.289026] CPU    1: hi:    0, btch:   1 usd:   0
Apr  5 13:10:27 tyrex kernel: [ 1814.289027] CPU    2: hi:    0, btch:   1 usd:   0
Apr  5 13:10:27 tyrex kernel: [ 1814.289028] CPU    3: hi:    0, btch:   1 usd:   0
Apr  5 13:10:27 tyrex kernel: [ 1814.289030] CPU    4: hi:    0, btch:   1 usd:   0
Apr  5 13:10:27 tyrex kernel: [ 1814.289031] CPU    5: hi:    0, btch:   1 usd:   0
Apr  5 13:10:27 tyrex kernel: [ 1814.289032] CPU    6: hi:    0, btch:   1 usd:   0
Apr  5 13:10:27 tyrex kernel: [ 1814.289033] CPU    7: hi:    0, btch:   1 usd:   0
Apr  5 13:10:27 tyrex kernel: [ 1814.289033] Node 0 Normal per-cpu:
Apr  5 13:10:27 tyrex kernel: [ 1814.289035] CPU    0: hi:  186, btch:  31 usd: 183
Apr  5 13:10:27 tyrex kernel: [ 1814.289036] CPU    1: hi:  186, btch:  31 usd: 126
Apr  5 13:10:27 tyrex kernel: [ 1814.289037] CPU    2: hi:  186, btch:  31 usd:  50
Apr  5 13:10:27 tyrex kernel: [ 1814.289038] CPU    3: hi:  186, btch:  31 usd: 178
Apr  5 13:10:27 tyrex kernel: [ 1814.289039] CPU    4: hi:  186, btch:  31 usd: 158
Apr  5 13:10:27 tyrex kernel: [ 1814.289040] CPU    5: hi:  186, btch:  31 usd: 181
Apr  5 13:10:27 tyrex kernel: [ 1814.289041] CPU    6: hi:  186, btch:  31 usd:  32
Apr  5 13:10:27 tyrex kernel: [ 1814.289042] CPU    7: hi:  186, btch:  31 usd: 164
Apr  5 13:10:27 tyrex kernel: [ 1814.289043] Node 0 HighMem per-cpu:
Apr  5 13:10:27 tyrex kernel: [ 1814.289044] CPU    0: hi:  186, btch:  31 usd: 153
Apr  5 13:10:27 tyrex kernel: [ 1814.289045] CPU    1: hi:  186, btch:  31 usd: 148
Apr  5 13:10:27 tyrex kernel: [ 1814.289046] CPU    2: hi:  186, btch:  31 usd: 119
Apr  5 13:10:27 tyrex kernel: [ 1814.289047] CPU    3: hi:  186, btch:  31 usd: 173
Apr  5 13:10:27 tyrex kernel: [ 1814.289048] CPU    4: hi:  186, btch:  31 usd:  46
Apr  5 13:10:27 tyrex kernel: [ 1814.289049] CPU    5: hi:  186, btch:  31 usd: 167
Apr  5 13:10:27 tyrex kernel: [ 1814.289050] CPU    6: hi:  186, btch:  31 usd:  57
Apr  5 13:10:27 tyrex kernel: [ 1814.289051] CPU    7: hi:  186, btch:  31 usd:  52
Apr  5 13:10:27 tyrex kernel: [ 1814.289054] active_anon:145037 inactive_anon:14560 isolated_anon:0
Apr  5 13:10:27 tyrex kernel: [ 1814.289055]  active_file:4510 inactive_file:30778 isolated_file:0
Apr  5 13:10:27 tyrex kernel: [ 1814.289055]  unevictable:0 dirty:3 writeback:0 unstable:0
Apr  5 13:10:27 tyrex kernel: [ 1814.289056]  free:2694178 slab_reclaimable:183482 slab_unreclaimable:4574
Apr  5 13:10:27 tyrex kernel: [ 1814.289057]  mapped:9362 shmem:137 pagetables:994 bounce:0
Apr  5 13:10:27 tyrex kernel: [ 1814.289058] Node 0 DMA free:3488kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:20kB inactive_file:36
kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15804kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:4416kB slab_unre
claimable:16kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Apr  5 13:10:27 tyrex kernel: [ 1814.289064] lowmem_reserve[]: 0 859 12054 12054
Apr  5 13:10:27 tyrex kernel: [ 1814.289066] Node 0 Normal free:3644kB min:3716kB low:4644kB high:5572kB active_anon:0kB inactive_anon:0kB active_file:848kB inacti
ve_file:928kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:879848kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:7295
12kB slab_unreclaimable:18280kB kernel_stack:3320kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Apr  5 13:10:27 tyrex kernel: [ 1814.289072] lowmem_reserve[]: 0 0 89561 89561
Apr  5 13:10:27 tyrex kernel: [ 1814.289074] Node 0 HighMem free:10769580kB min:512kB low:12620kB high:24728kB active_anon:580148kB inactive_anon:58240kB active_fi
le:17172kB inactive_file:122148kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:11463844kB mlocked:0kB dirty:12kB writeback:0kB mapped:37448kB shme
m:548kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:3976kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:96 all_unreclaimable
? no
Apr  5 13:10:27 tyrex kernel: [ 1814.289080] lowmem_reserve[]: 0 0 0 0
Apr  5 13:10:27 tyrex kernel: [ 1814.289081] Node 0 DMA: 0*4kB 2*8kB 1*16kB 0*32kB 0*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 0*4096kB = 3488kB
Apr  5 13:10:27 tyrex kernel: [ 1814.289087] Node 0 Normal: 430*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 1*2048kB 0*4096kB = 3768kB
Apr  5 13:10:27 tyrex kernel: [ 1814.289092] Node 0 HighMem: 213*4kB 619*8kB 374*16kB 15*32kB 357*64kB 413*128kB 277*256kB 80*512kB 46*1024kB 22*2048kB 2558*4096kB
 = 10769580kB
Apr  5 13:10:27 tyrex kernel: [ 1814.289097] 35438 total pagecache pages
Apr  5 13:10:27 tyrex kernel: [ 1814.289098] 0 pages in swap cache
Apr  5 13:10:27 tyrex kernel: [ 1814.289099] Swap cache stats: add 0, delete 0, find 0/0
Apr  5 13:10:27 tyrex kernel: [ 1814.289100] Free swap  = 2104476kB
Apr  5 13:10:27 tyrex kernel: [ 1814.289101] Total swap = 2104476kB
Apr  5 13:10:27 tyrex kernel: [ 1814.317335] 3380720 pages RAM
Apr  5 13:10:27 tyrex kernel: [ 1814.317337] 3154946 pages HighMem
Apr  5 13:10:27 tyrex kernel: [ 1814.317338] 293990 pages reserved
Apr  5 13:10:27 tyrex kernel: [ 1814.317338] 65372 pages shared
Apr  5 13:10:27 tyrex kernel: [ 1814.317339] 359255 pages non-shared

 0  0      0 10942040      4 776696    0    0 17512     4 2897 5178  1  7 92  0
 0  0      0 10878436      4 833952    0    0 13396     0 2330 4329  1  5 94  0
 0  0      0 10834376      4 872328    0    0 15072    32 2703 5578  0  7 92  0
 2  0      0 10834000      4 874896    0    0 11228     0 2331 5601  0  4 95  0
 0  0      0 10818628      4 875192    0    0  3172    10  773 1588  0  2 98  0
 1  0      0 10819240      4 874452    0    0  5428    69 1395 3091  0  4 96  0
 0  0      0 10818608      4 873360    0    0 17728     0 3056 10587  1 12 82  5
 1  0      0 10831032      4 873860    0    0  7732     0 1954 5411  0  4 93  3
 0  0      0 10825456      4 874208    0    0  2884     0  844 1907  1  2 98  0
 0  0      0 10831656      4 873720    0    0 11968     0 2228 6897  0  5 95  0
 0  0      0 10830760      4 873744    0    0  6424     8 1549 4021  0  3 94  3
 2  0      0 10829524      4 873944    0    0 10160    81 2129 5325  0  4 96  0
 0  0      0 10798644      4 874248    0    0  8112    16 2161 4348  1  3 96  0
 0  0      0 10781788      4 874688    0    0  5340     0 1823 2924  1  2 97  0
 0  2      0 10776580      4 875980    0    0  5964     0 1754 2722  1  4 93  1
 0  1      0 10776848      4 875440    0    0  5152     8 1543 2313  0  3 87 11
 1  1      0 10776704      4 875596    0    0  2416     0 1011 1632  0  5 79 16
 3  3      0 10776712      4 875676    0    0     0     4  749  867  0 11 75 14
 2  1      0 10879116      4 867544    0    0  7608  2138 3043 154270  7 13 77  3
 2  1      0 10874356      4 870636    0    0  2476     0 1204 711732  0 14 79  7
 3  1      0 10874248      4 870668    0    0     4     0 1078 1066604  0 14 71 14
 2  1      0 10874040      4 870668    0    0     0     0 1055 1067483  0 11 79 11
 2  1      0 10873792      4 870660    0    0   104     0 1137 1063509  0 13 73 13
 2  1      0 10873608      4 870444    0    0   324     0 1335 1055994  0 12 76 12
 3  0      0 11323756      4 421084    0    0  1768     0 1709 424405  0 16 76  7
 1  0      0 11558484      4 186672    0    0 10936   121 2212 3308  1  6 93  0
 1  0      0 11501320      4 243908    0    0 13020     0 2281 4205  0  6 93  0
 1  0      0 11413776      4 330896    0    0 19872     0 3492 6304  1  6 93  0
 0  0      0 11328588      4 415920    0    0 19568     0 3482 6139  1  8 91  0
 1  0      0 11299456      4 475428    0    0 15056     0 3054 5242  1  5 94  0

oom killer invoked around the lines with > 1000000 cs.

It's hard to believe, that a current kernel on a current system with 12 GB, 
even if using the insane pae on i586 is not able to cope with an du on a 
1.1 TB file tree. Since du is invokable by users, this creates a pretty 
ugly DOS attack for local users.

Updated boot log: ftp://urpla.net/boot-pae.msg

Pete

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer
  2010-04-05 11:35     ` Hans-Peter Jansen
@ 2010-04-05 23:06       ` Dave Chinner
  -1 siblings, 0 replies; 48+ messages in thread
From: Dave Chinner @ 2010-04-05 23:06 UTC (permalink / raw)
  To: Hans-Peter Jansen; +Cc: linux-kernel, opensuse-kernel, xfs

On Mon, Apr 05, 2010 at 01:35:41PM +0200, Hans-Peter Jansen wrote:
> On Monday 05 April 2010, 02:49:06 Dave Chinner wrote:
> > On Mon, Apr 05, 2010 at 12:49:17AM +0200, Hans-Peter Jansen wrote:
> > > [Sorry for the cross post, but I don't know where to start to tackle this 
> > >  issue]
> > > 
> > > Hi,
> > > 
> > > on an attempt to get to a current kernel, I suffer from an issue, where a 
> > > simple du on a reasonably big xfs tree leads to invoking the oom killer: 
> > 
> > How big is the directory tree (how many inodes, etc)?
> 
> It's 1.1 TB system backup tree, let's say: many..

1.1TB isn't big anymore. ;)

> > > Apr  4 23:26:02 tyrex kernel: [  488.161105] lowmem_reserve[]: 0 0 0 0
> > > Apr  4 23:26:02 tyrex kernel: [  488.161107] DMA: 18*4kB 53*8kB 31*16kB 20*32kB 14*64kB 8*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3552kB
> > > Apr  4 23:26:02 tyrex kernel: [  488.161112] Normal: 32*4kB 1*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 0*4096kB = 3704kB
> > > Apr  4 23:26:02 tyrex kernel: [  488.161117] HighMem: 17*4kB 29*8kB 47*16kB 16*32kB 6*64kB 30*128kB 53*256kB 27*512kB 14*1024kB 7*2048kB 377*4096kB = 1606044kB
> > > Apr  4 23:26:02 tyrex kernel: [  488.161122] 29947 total pagecache pages
> > > Apr  4 23:26:02 tyrex kernel: [  488.161123] 0 pages in swap cache
> > > Apr  4 23:26:02 tyrex kernel: [  488.161124] Swap cache stats: add 0, delete 0, find 0/0
> > > Apr  4 23:26:02 tyrex kernel: [  488.161125] Free swap  = 2104476kB
> > > Apr  4 23:26:02 tyrex kernel: [  488.161126] Total swap = 2104476kB
> > > Apr  4 23:26:02 tyrex kernel: [  488.165523] 784224 pages RAM
> > > Apr  4 23:26:02 tyrex kernel: [  488.165524] 556914 pages HighMem
> > > Apr  4 23:26:02 tyrex kernel: [  488.165525] 12060 pages reserved
> > > Apr  4 23:26:02 tyrex kernel: [  488.165526] 82604 pages shared
> > > Apr  4 23:26:02 tyrex kernel: [  488.165527] 328045 pages non-shared
> > > Apr  4 23:26:02 tyrex kernel: [  488.165529] Out of memory: kill process 4788 (mysqld-max) score 326208 or a child
> > > Apr  4 23:26:02 tyrex kernel: [  488.165531] Killed process 4788 (mysqld-max) vsz:1304832kB, anon-rss:121428kB, file-rss:4336kB
> > > [...]
> > 
> > Oh, this is a highmem box. You ran out of low memory, I think, which
> > is where all the inodes are cached. Seems like a VM problem or a
> > highmem/lowmem split config problem to me, not anything to do with
> > XFS...
> 
> Might be, I don't have a chance to test this on a different FS. Thanks
> for the answer anyway, Dave. I hope, you don't mind, that I keep you 
> copied on this thread.. 
> 
> This matter is, I cannot locate the problem from the syslog output. Might
> be a "can't see the forest because all the trees" syndrome.

Well, I have to ask why you are running a 32bit PAE kernel when your
CPU is:

<6>[    0.085062] CPU0: Intel(R) Xeon(R) CPU           X3460  @ 2.80GHz stepping 05

64bit capable.  Use a 64 bit kernel and this problem should go away.

> It's hard to believe, that a current kernel on a current system with 12 GB, 
> even if using the insane pae on i586 is not able to cope with an du on a 
> 1.1 TB file tree. Since du is invokable by users, this creates a pretty 
> ugly DOS attack for local users.

Agreed. And FWIW, don't let your filesystems get near ENOSPC on
2.6.34-rc, either....

(i.e. under sustained write load, 2.6.34-rc will hit the OOM killer
on page cache allocation before the filesystem can report ENOSPC to
the user application.  Test 224 in the xfsqa suite on a VM w/ 1GB
RAM will trigger this with > 90% reliability....)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer
@ 2010-04-05 23:06       ` Dave Chinner
  0 siblings, 0 replies; 48+ messages in thread
From: Dave Chinner @ 2010-04-05 23:06 UTC (permalink / raw)
  To: Hans-Peter Jansen; +Cc: opensuse-kernel, linux-kernel, xfs

On Mon, Apr 05, 2010 at 01:35:41PM +0200, Hans-Peter Jansen wrote:
> On Monday 05 April 2010, 02:49:06 Dave Chinner wrote:
> > On Mon, Apr 05, 2010 at 12:49:17AM +0200, Hans-Peter Jansen wrote:
> > > [Sorry for the cross post, but I don't know where to start to tackle this 
> > >  issue]
> > > 
> > > Hi,
> > > 
> > > on an attempt to get to a current kernel, I suffer from an issue, where a 
> > > simple du on a reasonably big xfs tree leads to invoking the oom killer: 
> > 
> > How big is the directory tree (how many inodes, etc)?
> 
> It's 1.1 TB system backup tree, let's say: many..

1.1TB isn't big anymore. ;)

> > > Apr  4 23:26:02 tyrex kernel: [  488.161105] lowmem_reserve[]: 0 0 0 0
> > > Apr  4 23:26:02 tyrex kernel: [  488.161107] DMA: 18*4kB 53*8kB 31*16kB 20*32kB 14*64kB 8*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3552kB
> > > Apr  4 23:26:02 tyrex kernel: [  488.161112] Normal: 32*4kB 1*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 0*4096kB = 3704kB
> > > Apr  4 23:26:02 tyrex kernel: [  488.161117] HighMem: 17*4kB 29*8kB 47*16kB 16*32kB 6*64kB 30*128kB 53*256kB 27*512kB 14*1024kB 7*2048kB 377*4096kB = 1606044kB
> > > Apr  4 23:26:02 tyrex kernel: [  488.161122] 29947 total pagecache pages
> > > Apr  4 23:26:02 tyrex kernel: [  488.161123] 0 pages in swap cache
> > > Apr  4 23:26:02 tyrex kernel: [  488.161124] Swap cache stats: add 0, delete 0, find 0/0
> > > Apr  4 23:26:02 tyrex kernel: [  488.161125] Free swap  = 2104476kB
> > > Apr  4 23:26:02 tyrex kernel: [  488.161126] Total swap = 2104476kB
> > > Apr  4 23:26:02 tyrex kernel: [  488.165523] 784224 pages RAM
> > > Apr  4 23:26:02 tyrex kernel: [  488.165524] 556914 pages HighMem
> > > Apr  4 23:26:02 tyrex kernel: [  488.165525] 12060 pages reserved
> > > Apr  4 23:26:02 tyrex kernel: [  488.165526] 82604 pages shared
> > > Apr  4 23:26:02 tyrex kernel: [  488.165527] 328045 pages non-shared
> > > Apr  4 23:26:02 tyrex kernel: [  488.165529] Out of memory: kill process 4788 (mysqld-max) score 326208 or a child
> > > Apr  4 23:26:02 tyrex kernel: [  488.165531] Killed process 4788 (mysqld-max) vsz:1304832kB, anon-rss:121428kB, file-rss:4336kB
> > > [...]
> > 
> > Oh, this is a highmem box. You ran out of low memory, I think, which
> > is where all the inodes are cached. Seems like a VM problem or a
> > highmem/lowmem split config problem to me, not anything to do with
> > XFS...
> 
> Might be, I don't have a chance to test this on a different FS. Thanks
> for the answer anyway, Dave. I hope, you don't mind, that I keep you 
> copied on this thread.. 
> 
> This matter is, I cannot locate the problem from the syslog output. Might
> be a "can't see the forest because all the trees" syndrome.

Well, I have to ask why you are running a 32bit PAE kernel when your
CPU is:

<6>[    0.085062] CPU0: Intel(R) Xeon(R) CPU           X3460  @ 2.80GHz stepping 05

64bit capable.  Use a 64 bit kernel and this problem should go away.

> It's hard to believe, that a current kernel on a current system with 12 GB, 
> even if using the insane pae on i586 is not able to cope with an du on a 
> 1.1 TB file tree. Since du is invokable by users, this creates a pretty 
> ugly DOS attack for local users.

Agreed. And FWIW, don't let your filesystems get near ENOSPC on
2.6.34-rc, either....

(i.e. under sustained write load, 2.6.34-rc will hit the OOM killer
on page cache allocation before the filesystem can report ENOSPC to
the user application.  Test 224 in the xfsqa suite on a VM w/ 1GB
RAM will trigger this with > 90% reliability....)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]
  2010-04-05 23:06       ` Dave Chinner
@ 2010-04-06 14:52         ` Hans-Peter Jansen
  -1 siblings, 0 replies; 48+ messages in thread
From: Hans-Peter Jansen @ 2010-04-06 14:52 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, opensuse-kernel, xfs

Hi Dave,

On Tuesday 06 April 2010, 01:06:00 Dave Chinner wrote:
> On Mon, Apr 05, 2010 at 01:35:41PM +0200, Hans-Peter Jansen wrote:
> > >
> > > Oh, this is a highmem box. You ran out of low memory, I think, which
> > > is where all the inodes are cached. Seems like a VM problem or a
> > > highmem/lowmem split config problem to me, not anything to do with
> > > XFS...

With all due respect, I disagree. See below.

> > Might be, I don't have a chance to test this on a different FS. Thanks
> > for the answer anyway, Dave. I hope, you don't mind, that I keep you
> > copied on this thread..
> >
> > This matter is, I cannot locate the problem from the syslog output.
> > Might be a "can't see the forest because all the trees" syndrome.
>
> Well, I have to ask why you are running a 32bit PAE kernel when your
> CPU is:
>
> <6>[    0.085062] CPU0: Intel(R) Xeon(R) CPU           X3460  @ 2.80GHz
> stepping 05
>
> 64bit capable.  Use a 64 bit kernel and this problem should go away.

Sure, but for compatibility reasons with a customer setup, that I'm fully 
responsible for and we strongly depend on, it is i586 still. (and it's a 
system, that I've full access on only for a few hours on sundays, which 
punishes my family..).

Dave, I really don't want to disappoint you, but a lengthy bisection session 
points to:

57817c68229984818fea9e614d6f95249c3fb098 is the first bad commit
commit 57817c68229984818fea9e614d6f95249c3fb098
Author: Dave Chinner <david@fromorbit.com>
Date:   Sun Jan 10 23:51:47 2010 +0000

    xfs: reclaim all inodes by background tree walks
    
    We cannot do direct inode reclaim without taking the flush lock to
    ensure that we do not reclaim an inode under IO. We check the inode
    is clean before doing direct reclaim, but this is not good enough
    because the inode flush code marks the inode clean once it has
    copied the in-core dirty state to the backing buffer.
    
    It is the flush lock that determines whether the inode is still
    under IO, even though it is marked clean, and the inode is still
    required at IO completion so we can't reclaim it even though it is
    clean in core. Hence the requirement that we need to take the flush
    lock even on clean inodes because this guarantees that the inode
    writeback IO has completed and it is safe to reclaim the inode.
    
    With delayed write inode flushing, we coul dend up waiting a long
    time on the flush lock even for a clean inode. The background
    reclaim already handles this efficiently, so avoid all the problems
    by killing the direct reclaim path altogether.
    
    Signed-off-by: Dave Chinner <david@fromorbit.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Alex Elder <aelder@sgi.com>

:040000 040000 9cada5739037ecd59afb358cf5ed6186b82d5236 
8e6b6febccba69bc4cdbfd1886d545c369d64c41 M      fs

I will try to prove this by reverting this commit on a 2.6.33.2 build, but
that's going to take another day, or so.

> > It's hard to believe, that a current kernel on a current system with 12
> > GB, even if using the insane pae on i586 is not able to cope with an du
> > on a 1.1 TB file tree. Since du is invokable by users, this creates a
> > pretty ugly DOS attack for local users.
>
> Agreed. And FWIW, don't let your filesystems get near ENOSPC on
> 2.6.34-rc, either....
>
> (i.e. under sustained write load, 2.6.34-rc will hit the OOM killer
> on page cache allocation before the filesystem can report ENOSPC to
> the user application.  Test 224 in the xfsqa suite on a VM w/ 1GB
> RAM will trigger this with > 90% reliability....)

Hmm, thanks for the warning. Will resort to 2.6.33.2 for now on my servers
and keep an eye on the xfs commit logs...

Cheers && greetings to the orbit ;-),
Pete

For the sake of completeness, here's the revert:

---
commit dfe0d292280ad21c9cf3f240bb415913715d8980
Author: Hans-Peter Jansen <hpj@urpla.net>
Date:   Tue Apr 6 16:05:47 2010 +0200

    Revert "xfs: reclaim all inodes by background tree walks"
    
    This reverts commit 57817c68229984818fea9e614d6f95249c3fb098.
    
    Avoid triggering the oom killer with a simple du on a big xfs tree on i586.
    
    Signed-off-by: Hans-Peter Jansen <hpj@urpla.net>

:100644 100644 52e06b4... a76fc01... M	fs/xfs/linux-2.6/xfs_super.c

diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c
index 52e06b4..a76fc01 100644
--- a/fs/xfs/linux-2.6/xfs_super.c
+++ b/fs/xfs/linux-2.6/xfs_super.c
@@ -954,14 +954,16 @@ xfs_fs_destroy_inode(
 	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_IRECLAIM));
 
 	/*
-	 * We always use background reclaim here because even if the
-	 * inode is clean, it still may be under IO and hence we have
-	 * to take the flush lock. The background reclaim path handles
-	 * this more efficiently than we can here, so simply let background
-	 * reclaim tear down all inodes.
+	 * If we have nothing to flush with this inode then complete the
+	 * teardown now, otherwise delay the flush operation.
 	 */
+	if (!xfs_inode_clean(ip)) {
+		xfs_inode_set_reclaim_tag(ip);
+		return;
+	}
+
 out_reclaim:
-	xfs_inode_set_reclaim_tag(ip);
+	xfs_ireclaim(ip);
 }
 
 /*


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]
@ 2010-04-06 14:52         ` Hans-Peter Jansen
  0 siblings, 0 replies; 48+ messages in thread
From: Hans-Peter Jansen @ 2010-04-06 14:52 UTC (permalink / raw)
  To: Dave Chinner; +Cc: opensuse-kernel, linux-kernel, xfs

Hi Dave,

On Tuesday 06 April 2010, 01:06:00 Dave Chinner wrote:
> On Mon, Apr 05, 2010 at 01:35:41PM +0200, Hans-Peter Jansen wrote:
> > >
> > > Oh, this is a highmem box. You ran out of low memory, I think, which
> > > is where all the inodes are cached. Seems like a VM problem or a
> > > highmem/lowmem split config problem to me, not anything to do with
> > > XFS...

With all due respect, I disagree. See below.

> > Might be, I don't have a chance to test this on a different FS. Thanks
> > for the answer anyway, Dave. I hope, you don't mind, that I keep you
> > copied on this thread..
> >
> > This matter is, I cannot locate the problem from the syslog output.
> > Might be a "can't see the forest because all the trees" syndrome.
>
> Well, I have to ask why you are running a 32bit PAE kernel when your
> CPU is:
>
> <6>[    0.085062] CPU0: Intel(R) Xeon(R) CPU           X3460  @ 2.80GHz
> stepping 05
>
> 64bit capable.  Use a 64 bit kernel and this problem should go away.

Sure, but for compatibility reasons with a customer setup, that I'm fully 
responsible for and we strongly depend on, it is i586 still. (and it's a 
system, that I've full access on only for a few hours on sundays, which 
punishes my family..).

Dave, I really don't want to disappoint you, but a lengthy bisection session 
points to:

57817c68229984818fea9e614d6f95249c3fb098 is the first bad commit
commit 57817c68229984818fea9e614d6f95249c3fb098
Author: Dave Chinner <david@fromorbit.com>
Date:   Sun Jan 10 23:51:47 2010 +0000

    xfs: reclaim all inodes by background tree walks
    
    We cannot do direct inode reclaim without taking the flush lock to
    ensure that we do not reclaim an inode under IO. We check the inode
    is clean before doing direct reclaim, but this is not good enough
    because the inode flush code marks the inode clean once it has
    copied the in-core dirty state to the backing buffer.
    
    It is the flush lock that determines whether the inode is still
    under IO, even though it is marked clean, and the inode is still
    required at IO completion so we can't reclaim it even though it is
    clean in core. Hence the requirement that we need to take the flush
    lock even on clean inodes because this guarantees that the inode
    writeback IO has completed and it is safe to reclaim the inode.
    
    With delayed write inode flushing, we coul dend up waiting a long
    time on the flush lock even for a clean inode. The background
    reclaim already handles this efficiently, so avoid all the problems
    by killing the direct reclaim path altogether.
    
    Signed-off-by: Dave Chinner <david@fromorbit.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Alex Elder <aelder@sgi.com>

:040000 040000 9cada5739037ecd59afb358cf5ed6186b82d5236 
8e6b6febccba69bc4cdbfd1886d545c369d64c41 M      fs

I will try to prove this by reverting this commit on a 2.6.33.2 build, but
that's going to take another day, or so.

> > It's hard to believe, that a current kernel on a current system with 12
> > GB, even if using the insane pae on i586 is not able to cope with an du
> > on a 1.1 TB file tree. Since du is invokable by users, this creates a
> > pretty ugly DOS attack for local users.
>
> Agreed. And FWIW, don't let your filesystems get near ENOSPC on
> 2.6.34-rc, either....
>
> (i.e. under sustained write load, 2.6.34-rc will hit the OOM killer
> on page cache allocation before the filesystem can report ENOSPC to
> the user application.  Test 224 in the xfsqa suite on a VM w/ 1GB
> RAM will trigger this with > 90% reliability....)

Hmm, thanks for the warning. Will resort to 2.6.33.2 for now on my servers
and keep an eye on the xfs commit logs...

Cheers && greetings to the orbit ;-),
Pete

For the sake of completeness, here's the revert:

---
commit dfe0d292280ad21c9cf3f240bb415913715d8980
Author: Hans-Peter Jansen <hpj@urpla.net>
Date:   Tue Apr 6 16:05:47 2010 +0200

    Revert "xfs: reclaim all inodes by background tree walks"
    
    This reverts commit 57817c68229984818fea9e614d6f95249c3fb098.
    
    Avoid triggering the oom killer with a simple du on a big xfs tree on i586.
    
    Signed-off-by: Hans-Peter Jansen <hpj@urpla.net>

:100644 100644 52e06b4... a76fc01... M	fs/xfs/linux-2.6/xfs_super.c

diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c
index 52e06b4..a76fc01 100644
--- a/fs/xfs/linux-2.6/xfs_super.c
+++ b/fs/xfs/linux-2.6/xfs_super.c
@@ -954,14 +954,16 @@ xfs_fs_destroy_inode(
 	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_IRECLAIM));
 
 	/*
-	 * We always use background reclaim here because even if the
-	 * inode is clean, it still may be under IO and hence we have
-	 * to take the flush lock. The background reclaim path handles
-	 * this more efficiently than we can here, so simply let background
-	 * reclaim tear down all inodes.
+	 * If we have nothing to flush with this inode then complete the
+	 * teardown now, otherwise delay the flush operation.
 	 */
+	if (!xfs_inode_clean(ip)) {
+		xfs_inode_set_reclaim_tag(ip);
+		return;
+	}
+
 out_reclaim:
-	xfs_inode_set_reclaim_tag(ip);
+	xfs_ireclaim(ip);
 }
 
 /*

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]
  2010-04-06 14:52         ` Hans-Peter Jansen
@ 2010-04-06 23:11           ` Dave Chinner
  -1 siblings, 0 replies; 48+ messages in thread
From: Dave Chinner @ 2010-04-06 23:11 UTC (permalink / raw)
  To: Hans-Peter Jansen; +Cc: linux-kernel, opensuse-kernel, xfs

On Tue, Apr 06, 2010 at 04:52:57PM +0200, Hans-Peter Jansen wrote:
> Hi Dave,
> 
> On Tuesday 06 April 2010, 01:06:00 Dave Chinner wrote:
> > On Mon, Apr 05, 2010 at 01:35:41PM +0200, Hans-Peter Jansen wrote:
> > > >
> > > > Oh, this is a highmem box. You ran out of low memory, I think, which
> > > > is where all the inodes are cached. Seems like a VM problem or a
> > > > highmem/lowmem split config problem to me, not anything to do with
> > > > XFS...

[snip]

> Dave, I really don't want to disappoint you, but a lengthy bisection session 
> points to:
> 
> 57817c68229984818fea9e614d6f95249c3fb098 is the first bad commit
> commit 57817c68229984818fea9e614d6f95249c3fb098
> Author: Dave Chinner <david@fromorbit.com>
> Date:   Sun Jan 10 23:51:47 2010 +0000
> 
>     xfs: reclaim all inodes by background tree walks

Interesting. I did a fair bit of low memory testing when i made that
change (admittedly none on a highmem i386 box), and since then I've
done lots of "millions of files" tree creates, traversals and destroys on
limited memory machines without triggering problems when memory is
completely full of inodes.

Let me try to reproduce this on a small VM and I'll get back to you.

> diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c
> index 52e06b4..a76fc01 100644
> --- a/fs/xfs/linux-2.6/xfs_super.c
> +++ b/fs/xfs/linux-2.6/xfs_super.c
> @@ -954,14 +954,16 @@ xfs_fs_destroy_inode(
>  	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_IRECLAIM));
>  
>  	/*
> -	 * We always use background reclaim here because even if the
> -	 * inode is clean, it still may be under IO and hence we have
> -	 * to take the flush lock. The background reclaim path handles
> -	 * this more efficiently than we can here, so simply let background
> -	 * reclaim tear down all inodes.
> +	 * If we have nothing to flush with this inode then complete the
> +	 * teardown now, otherwise delay the flush operation.
>  	 */
> +	if (!xfs_inode_clean(ip)) {
> +		xfs_inode_set_reclaim_tag(ip);
> +		return;
> +	}
> +
>  out_reclaim:
> -	xfs_inode_set_reclaim_tag(ip);
> +	xfs_ireclaim(ip);
>  }

I don't think that will work as expected in all situations - the
inode clean check there is not completely valid as the XFS inode
locks aren't held, so it can race with other operations that need
to complete before reclaim is done. This was one of the reasons for
pushing reclaim into the background....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]
@ 2010-04-06 23:11           ` Dave Chinner
  0 siblings, 0 replies; 48+ messages in thread
From: Dave Chinner @ 2010-04-06 23:11 UTC (permalink / raw)
  To: Hans-Peter Jansen; +Cc: opensuse-kernel, linux-kernel, xfs

On Tue, Apr 06, 2010 at 04:52:57PM +0200, Hans-Peter Jansen wrote:
> Hi Dave,
> 
> On Tuesday 06 April 2010, 01:06:00 Dave Chinner wrote:
> > On Mon, Apr 05, 2010 at 01:35:41PM +0200, Hans-Peter Jansen wrote:
> > > >
> > > > Oh, this is a highmem box. You ran out of low memory, I think, which
> > > > is where all the inodes are cached. Seems like a VM problem or a
> > > > highmem/lowmem split config problem to me, not anything to do with
> > > > XFS...

[snip]

> Dave, I really don't want to disappoint you, but a lengthy bisection session 
> points to:
> 
> 57817c68229984818fea9e614d6f95249c3fb098 is the first bad commit
> commit 57817c68229984818fea9e614d6f95249c3fb098
> Author: Dave Chinner <david@fromorbit.com>
> Date:   Sun Jan 10 23:51:47 2010 +0000
> 
>     xfs: reclaim all inodes by background tree walks

Interesting. I did a fair bit of low memory testing when i made that
change (admittedly none on a highmem i386 box), and since then I've
done lots of "millions of files" tree creates, traversals and destroys on
limited memory machines without triggering problems when memory is
completely full of inodes.

Let me try to reproduce this on a small VM and I'll get back to you.

> diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c
> index 52e06b4..a76fc01 100644
> --- a/fs/xfs/linux-2.6/xfs_super.c
> +++ b/fs/xfs/linux-2.6/xfs_super.c
> @@ -954,14 +954,16 @@ xfs_fs_destroy_inode(
>  	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_IRECLAIM));
>  
>  	/*
> -	 * We always use background reclaim here because even if the
> -	 * inode is clean, it still may be under IO and hence we have
> -	 * to take the flush lock. The background reclaim path handles
> -	 * this more efficiently than we can here, so simply let background
> -	 * reclaim tear down all inodes.
> +	 * If we have nothing to flush with this inode then complete the
> +	 * teardown now, otherwise delay the flush operation.
>  	 */
> +	if (!xfs_inode_clean(ip)) {
> +		xfs_inode_set_reclaim_tag(ip);
> +		return;
> +	}
> +
>  out_reclaim:
> -	xfs_inode_set_reclaim_tag(ip);
> +	xfs_ireclaim(ip);
>  }

I don't think that will work as expected in all situations - the
inode clean check there is not completely valid as the XFS inode
locks aren't held, so it can race with other operations that need
to complete before reclaim is done. This was one of the reasons for
pushing reclaim into the background....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]
  2010-04-06 23:11           ` Dave Chinner
@ 2010-04-07  1:45             ` Dave Chinner
  -1 siblings, 0 replies; 48+ messages in thread
From: Dave Chinner @ 2010-04-07  1:45 UTC (permalink / raw)
  To: Hans-Peter Jansen; +Cc: linux-kernel, opensuse-kernel, xfs

[-- Attachment #1: Type: text/plain, Size: 3024 bytes --]

On Wed, Apr 07, 2010 at 09:11:44AM +1000, Dave Chinner wrote:
> On Tue, Apr 06, 2010 at 04:52:57PM +0200, Hans-Peter Jansen wrote:
> > Hi Dave,
> > 
> > On Tuesday 06 April 2010, 01:06:00 Dave Chinner wrote:
> > > On Mon, Apr 05, 2010 at 01:35:41PM +0200, Hans-Peter Jansen wrote:
> > > > >
> > > > > Oh, this is a highmem box. You ran out of low memory, I think, which
> > > > > is where all the inodes are cached. Seems like a VM problem or a
> > > > > highmem/lowmem split config problem to me, not anything to do with
> > > > > XFS...
> 
> [snip]
> 
> > Dave, I really don't want to disappoint you, but a lengthy bisection session 
> > points to:
> > 
> > 57817c68229984818fea9e614d6f95249c3fb098 is the first bad commit
> > commit 57817c68229984818fea9e614d6f95249c3fb098
> > Author: Dave Chinner <david@fromorbit.com>
> > Date:   Sun Jan 10 23:51:47 2010 +0000
> > 
> >     xfs: reclaim all inodes by background tree walks
> 
> Interesting. I did a fair bit of low memory testing when i made that
> change (admittedly none on a highmem i386 box), and since then I've
> done lots of "millions of files" tree creates, traversals and destroys on
> limited memory machines without triggering problems when memory is
> completely full of inodes.
> 
> Let me try to reproduce this on a small VM and I'll get back to you.

OK, if there is page cache pressure (e.g. creating small files or
grepping the resultant tree) or the machine has significant amounts
of memory (e.g. >= 4GB) then I can't reproduce this.

However, if the memory pressure is purely inode cache (creating zero
length files or read-only traversal), then the OOM killer kicks a
while after the slab cache fills memory.  This doesn't need highmem;
I used a x86_64 kernel on a VM w/ 1GB RAM to reliably reproduce
this.  I'll add zero length file tests and traversals to my low
memory testing.

The best way to fix this, I think, is to trigger a shrinker callback
when memory is low to run the background inode reclaim. The problem
is that these inode caches and the reclaim state are per-filesystem,
not global state, and the current shrinker interface only works with
global state.

Hence there are two patches to this fix - the first adds a context
to the shrinker callout, and the second adds the XFS infrastructure
to track the number of reclaimable inodes per filesystem and
register/unregister shrinkers for each filesystem.

With these patches, my reproducable test case which locked the
machine up with a OOM panic in a couple of minutes has been running
for over half an hour. I have much more confidence in this change
with limited testing than the reverting of the background inode
reclaim as the revert introduces 

The patches below apply to the xfs-dev tree, which is currently at
34-rc1. If they don't apply, let me know and I'll redo them against
a vanilla kernel tree. Can you test them to see if the problem goes
away? If the problem is fixed, I'll push them for a proper review
cycle...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

[-- Attachment #2: shrinker-context --]
[-- Type: text/plain, Size: 10312 bytes --]

mm: add context argument to shrinker callback

From: Dave Chinner <dchinner@redhat.com>

The current shrinker implementation requires the registered callback
to have global state to work from. This makes it difficult to shrink
caches that are not global (e.g. per-filesystem caches). Add a
context argument to the shrinker callback so that it can easily be
used in such situations.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 arch/x86/kvm/mmu.c              |    2 +-
 drivers/gpu/drm/i915/i915_gem.c |    2 +-
 fs/dcache.c                     |    2 +-
 fs/gfs2/glock.c                 |    2 +-
 fs/gfs2/quota.c                 |    2 +-
 fs/gfs2/quota.h                 |    2 +-
 fs/inode.c                      |    2 +-
 fs/mbcache.c                    |    5 +++--
 fs/nfs/dir.c                    |    2 +-
 fs/nfs/internal.h               |    2 +-
 fs/quota/dquot.c                |    2 +-
 fs/ubifs/shrinker.c             |    2 +-
 fs/ubifs/ubifs.h                |    2 +-
 fs/xfs/linux-2.6/xfs_buf.c      |    5 +++--
 fs/xfs/quota/xfs_qm.c           |    3 ++-
 include/linux/mm.h              |    3 ++-
 mm/vmscan.c                     |    9 ++++++---
 17 files changed, 28 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 741373e..8977b3e 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2914,7 +2914,7 @@ static void kvm_mmu_remove_one_alloc_mmu_page(struct kvm *kvm)
 	kvm_mmu_zap_page(kvm, page);
 }
 
-static int mmu_shrink(int nr_to_scan, gfp_t gfp_mask)
+static int mmu_shrink(int nr_to_scan, void *ctx, gfp_t gfp_mask)
 {
 	struct kvm *kvm;
 	struct kvm *kvm_freed = NULL;
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index fba37e9..295f993 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -5054,7 +5054,7 @@ void i915_gem_release(struct drm_device * dev, struct drm_file *file_priv)
 }
 
 static int
-i915_gem_shrink(int nr_to_scan, gfp_t gfp_mask)
+i915_gem_shrink(int nr_to_scan, void *ctx, gfp_t gfp_mask)
 {
 	drm_i915_private_t *dev_priv, *next_dev;
 	struct drm_i915_gem_object *obj_priv, *next_obj;
diff --git a/fs/dcache.c b/fs/dcache.c
index f1358e5..1545d37 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -897,7 +897,7 @@ EXPORT_SYMBOL(shrink_dcache_parent);
  *
  * In this case we return -1 to tell the caller that we baled.
  */
-static int shrink_dcache_memory(int nr, gfp_t gfp_mask)
+static int shrink_dcache_memory(int nr, void *ctx, gfp_t gfp_mask)
 {
 	if (nr) {
 		if (!(gfp_mask & __GFP_FS))
diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 454d4b4..fc4ced7 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -1345,7 +1345,7 @@ void gfs2_glock_complete(struct gfs2_glock *gl, int ret)
 }
 
 
-static int gfs2_shrink_glock_memory(int nr, gfp_t gfp_mask)
+static int gfs2_shrink_glock_memory(int nr, void *ctx, gfp_t gfp_mask)
 {
 	struct gfs2_glock *gl;
 	int may_demote;
diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
index 6dbcbad..a5a1836 100644
--- a/fs/gfs2/quota.c
+++ b/fs/gfs2/quota.c
@@ -77,7 +77,7 @@ static LIST_HEAD(qd_lru_list);
 static atomic_t qd_lru_count = ATOMIC_INIT(0);
 static DEFINE_SPINLOCK(qd_lru_lock);
 
-int gfs2_shrink_qd_memory(int nr, gfp_t gfp_mask)
+int gfs2_shrink_qd_memory(int nr, void *ctx, gfp_t gfp_mask)
 {
 	struct gfs2_quota_data *qd;
 	struct gfs2_sbd *sdp;
diff --git a/fs/gfs2/quota.h b/fs/gfs2/quota.h
index 195f60c..1dd6aa2 100644
--- a/fs/gfs2/quota.h
+++ b/fs/gfs2/quota.h
@@ -51,7 +51,7 @@ static inline int gfs2_quota_lock_check(struct gfs2_inode *ip)
 	return ret;
 }
 
-extern int gfs2_shrink_qd_memory(int nr, gfp_t gfp_mask);
+extern int gfs2_shrink_qd_memory(int nr, void *ctx, gfp_t gfp_mask);
 extern const struct quotactl_ops gfs2_quotactl_ops;
 
 #endif /* __QUOTA_DOT_H__ */
diff --git a/fs/inode.c b/fs/inode.c
index 407bf39..8423324 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -514,7 +514,7 @@ static void prune_icache(int nr_to_scan)
  * This function is passed the number of inodes to scan, and it returns the
  * total number of remaining possibly-reclaimable inodes.
  */
-static int shrink_icache_memory(int nr, gfp_t gfp_mask)
+static int shrink_icache_memory(int nr, void *ctx, gfp_t gfp_mask)
 {
 	if (nr) {
 		/*
diff --git a/fs/mbcache.c b/fs/mbcache.c
index ec88ff3..2c4f9bf 100644
--- a/fs/mbcache.c
+++ b/fs/mbcache.c
@@ -115,7 +115,7 @@ mb_cache_indexes(struct mb_cache *cache)
  * What the mbcache registers as to get shrunk dynamically.
  */
 
-static int mb_cache_shrink_fn(int nr_to_scan, gfp_t gfp_mask);
+static int mb_cache_shrink_fn(int nr_to_scan, void *ctx, gfp_t gfp_mask);
 
 static struct shrinker mb_cache_shrinker = {
 	.shrink = mb_cache_shrink_fn,
@@ -192,12 +192,13 @@ forget:
  * gets low.
  *
  * @nr_to_scan: Number of objects to scan
+ * @ctx: (ignored)
  * @gfp_mask: (ignored)
  *
  * Returns the number of objects which are present in the cache.
  */
 static int
-mb_cache_shrink_fn(int nr_to_scan, gfp_t gfp_mask)
+mb_cache_shrink_fn(int nr_to_scan, void *ctx, gfp_t gfp_mask)
 {
 	LIST_HEAD(free_list);
 	struct list_head *l, *ltmp;
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index a1f6b44..b92b17d 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1667,7 +1667,7 @@ static void nfs_access_free_entry(struct nfs_access_entry *entry)
 	smp_mb__after_atomic_dec();
 }
 
-int nfs_access_cache_shrinker(int nr_to_scan, gfp_t gfp_mask)
+int nfs_access_cache_shrinker(int nr_to_scan, void *ctx, gfp_t gfp_mask)
 {
 	LIST_HEAD(head);
 	struct nfs_inode *nfsi;
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 11f82f0..34ad436 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -205,7 +205,7 @@ extern struct rpc_procinfo nfs4_procedures[];
 void nfs_close_context(struct nfs_open_context *ctx, int is_sync);
 
 /* dir.c */
-extern int nfs_access_cache_shrinker(int nr_to_scan, gfp_t gfp_mask);
+extern int nfs_access_cache_shrinker(int nr_to_scan, void *ctx, gfp_t gfp_mask);
 
 /* inode.c */
 extern struct workqueue_struct *nfsiod_workqueue;
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index e0b870f..225b961 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -670,7 +670,7 @@ static void prune_dqcache(int count)
  * more memory
  */
 
-static int shrink_dqcache_memory(int nr, gfp_t gfp_mask)
+static int shrink_dqcache_memory(int nr, void *ctx, gfp_t gfp_mask)
 {
 	if (nr) {
 		spin_lock(&dq_list_lock);
diff --git a/fs/ubifs/shrinker.c b/fs/ubifs/shrinker.c
index 02feb59..a62dfe4 100644
--- a/fs/ubifs/shrinker.c
+++ b/fs/ubifs/shrinker.c
@@ -277,7 +277,7 @@ static int kick_a_thread(void)
 	return 0;
 }
 
-int ubifs_shrinker(int nr, gfp_t gfp_mask)
+int ubifs_shrinker(int nr, void *ctx, gfp_t gfp_mask)
 {
 	int freed, contention = 0;
 	long clean_zn_cnt = atomic_long_read(&ubifs_clean_zn_cnt);
diff --git a/fs/ubifs/ubifs.h b/fs/ubifs/ubifs.h
index b2d9763..a1fdb9c 100644
--- a/fs/ubifs/ubifs.h
+++ b/fs/ubifs/ubifs.h
@@ -1574,7 +1574,7 @@ int ubifs_tnc_start_commit(struct ubifs_info *c, struct ubifs_zbranch *zroot);
 int ubifs_tnc_end_commit(struct ubifs_info *c);
 
 /* shrinker.c */
-int ubifs_shrinker(int nr_to_scan, gfp_t gfp_mask);
+int ubifs_shrinker(int nr_to_scan, void *ctx, gfp_t gfp_mask);
 
 /* commit.c */
 int ubifs_bg_thread(void *info);
diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index 74ccc70..4f5351f 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -44,7 +44,7 @@
 
 static kmem_zone_t *xfs_buf_zone;
 STATIC int xfsbufd(void *);
-STATIC int xfsbufd_wakeup(int, gfp_t);
+STATIC int xfsbufd_wakeup(int, void *, gfp_t);
 STATIC void xfs_buf_delwri_queue(xfs_buf_t *, int);
 static struct shrinker xfs_buf_shake = {
 	.shrink = xfsbufd_wakeup,
@@ -339,7 +339,7 @@ _xfs_buf_lookup_pages(
 					__func__, gfp_mask);
 
 			XFS_STATS_INC(xb_page_retries);
-			xfsbufd_wakeup(0, gfp_mask);
+			xfsbufd_wakeup(0, NULL, gfp_mask);
 			congestion_wait(BLK_RW_ASYNC, HZ/50);
 			goto retry;
 		}
@@ -1759,6 +1759,7 @@ xfs_buf_runall_queues(
 STATIC int
 xfsbufd_wakeup(
 	int			priority,
+	void			*ctx,
 	gfp_t			mask)
 {
 	xfs_buftarg_t		*btp;
diff --git a/fs/xfs/quota/xfs_qm.c b/fs/xfs/quota/xfs_qm.c
index 456b72e..40cf5c9 100644
--- a/fs/xfs/quota/xfs_qm.c
+++ b/fs/xfs/quota/xfs_qm.c
@@ -69,7 +69,7 @@ STATIC void	xfs_qm_list_destroy(xfs_dqlist_t *);
 
 STATIC int	xfs_qm_init_quotainos(xfs_mount_t *);
 STATIC int	xfs_qm_init_quotainfo(xfs_mount_t *);
-STATIC int	xfs_qm_shake(int, gfp_t);
+STATIC int	xfs_qm_shake(int, void *, gfp_t);
 
 static struct shrinker xfs_qm_shaker = {
 	.shrink = xfs_qm_shake,
@@ -2120,6 +2120,7 @@ xfs_qm_shake_freelist(
 STATIC int
 xfs_qm_shake(
 	int	nr_to_scan,
+	void	*ctx,
 	gfp_t	gfp_mask)
 {
 	int	ndqused, nfree, n;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3899395..b3fd387 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -989,7 +989,8 @@ void sync_mm_rss(struct task_struct *task, struct mm_struct *mm);
  * querying the cache size, so a fastpath for that case is appropriate.
  */
 struct shrinker {
-	int (*shrink)(int nr_to_scan, gfp_t gfp_mask);
+	int (*shrink)(int nr_to_scan, void *ctx, gfp_t gfp_mask);
+	void *ctx;	/* user callback context */
 	int seeks;	/* seeks to recreate an obj */
 
 	/* These are for internal use */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 79c8098..f293372 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -215,8 +215,9 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
 	list_for_each_entry(shrinker, &shrinker_list, list) {
 		unsigned long long delta;
 		unsigned long total_scan;
-		unsigned long max_pass = (*shrinker->shrink)(0, gfp_mask);
+		unsigned long max_pass;
 
+		max_pass = (*shrinker->shrink)(0, shrinker->ctx, gfp_mask);
 		delta = (4 * scanned) / shrinker->seeks;
 		delta *= max_pass;
 		do_div(delta, lru_pages + 1);
@@ -244,8 +245,10 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
 			int shrink_ret;
 			int nr_before;
 
-			nr_before = (*shrinker->shrink)(0, gfp_mask);
-			shrink_ret = (*shrinker->shrink)(this_scan, gfp_mask);
+			nr_before = (*shrinker->shrink)(0,
+						shrinker->ctx, gfp_mask);
+			shrink_ret = (*shrinker->shrink)(this_scan,
+						shrinker->ctx, gfp_mask);
 			if (shrink_ret == -1)
 				break;
 			if (shrink_ret < nr_before)

[-- Attachment #3: xfs-inode-reclaim-shrinker --]
[-- Type: text/plain, Size: 6498 bytes --]

xfs: add a shrinker to background inode reclaim

From: Dave Chinner <dchinner@redhat.com>

On low memory boxes or those with highmem, kernel can OOM before the
background reclaims inodes via xfssyncd. Add a shrinker to run inode
reclaim so that it inode reclaim is expedited when memory is low.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/linux-2.6/xfs_super.c   |    3 ++
 fs/xfs/linux-2.6/xfs_sync.c    |   71 ++++++++++++++++++++++++++++++++++++----
 fs/xfs/linux-2.6/xfs_sync.h    |    5 ++-
 fs/xfs/quota/xfs_qm_syscalls.c |    3 +-
 fs/xfs/xfs_ag.h                |    1 +
 fs/xfs/xfs_mount.h             |    1 +
 6 files changed, 75 insertions(+), 9 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c
index 15bea67..aa11204 100644
--- a/fs/xfs/linux-2.6/xfs_super.c
+++ b/fs/xfs/linux-2.6/xfs_super.c
@@ -1209,6 +1209,7 @@ xfs_fs_put_super(
 
 	xfs_unmountfs(mp);
 	xfs_freesb(mp);
+	xfs_inode_shrinker_unregister(mp);
 	xfs_icsb_destroy_counters(mp);
 	xfs_close_devices(mp);
 	xfs_dmops_put(mp);
@@ -1622,6 +1623,8 @@ xfs_fs_fill_super(
 	if (error)
 		goto fail_vnrele;
 
+	xfs_inode_shrinker_register(mp);
+
 	kfree(mtpt);
 	return 0;
 
diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
index 5a4731e..a376a0d 100644
--- a/fs/xfs/linux-2.6/xfs_sync.c
+++ b/fs/xfs/linux-2.6/xfs_sync.c
@@ -95,7 +95,8 @@ xfs_inode_ag_walk(
 					   struct xfs_perag *pag, int flags),
 	int			flags,
 	int			tag,
-	int			exclusive)
+	int			exclusive,
+	int			*nr_to_scan)
 {
 	uint32_t		first_index;
 	int			last_error = 0;
@@ -134,7 +135,7 @@ restart:
 		if (error == EFSCORRUPTED)
 			break;
 
-	} while (1);
+	} while ((*nr_to_scan)--);
 
 	if (skipped) {
 		delay(1);
@@ -150,11 +151,16 @@ xfs_inode_ag_iterator(
 					   struct xfs_perag *pag, int flags),
 	int			flags,
 	int			tag,
-	int			exclusive)
+	int			exclusive,
+	int			nr_to_scan)
 {
 	int			error = 0;
 	int			last_error = 0;
 	xfs_agnumber_t		ag;
+	int			nr = nr_to_scan;
+
+	if (!nr)
+		nr = INT_MAX;
 
 	for (ag = 0; ag < mp->m_sb.sb_agcount; ag++) {
 		struct xfs_perag	*pag;
@@ -165,7 +171,7 @@ xfs_inode_ag_iterator(
 			continue;
 		}
 		error = xfs_inode_ag_walk(mp, pag, execute, flags, tag,
-						exclusive);
+						exclusive, &nr);
 		xfs_perag_put(pag);
 		if (error) {
 			last_error = error;
@@ -291,7 +297,7 @@ xfs_sync_data(
 	ASSERT((flags & ~(SYNC_TRYLOCK|SYNC_WAIT)) == 0);
 
 	error = xfs_inode_ag_iterator(mp, xfs_sync_inode_data, flags,
-				      XFS_ICI_NO_TAG, 0);
+				      XFS_ICI_NO_TAG, 0, 0);
 	if (error)
 		return XFS_ERROR(error);
 
@@ -310,7 +316,7 @@ xfs_sync_attr(
 	ASSERT((flags & ~SYNC_WAIT) == 0);
 
 	return xfs_inode_ag_iterator(mp, xfs_sync_inode_attr, flags,
-				     XFS_ICI_NO_TAG, 0);
+				     XFS_ICI_NO_TAG, 0, 0);
 }
 
 STATIC int
@@ -673,6 +679,7 @@ __xfs_inode_set_reclaim_tag(
 	radix_tree_tag_set(&pag->pag_ici_root,
 			   XFS_INO_TO_AGINO(ip->i_mount, ip->i_ino),
 			   XFS_ICI_RECLAIM_TAG);
+	pag->pag_ici_reclaimable++;
 }
 
 /*
@@ -705,6 +712,7 @@ __xfs_inode_clear_reclaim_tag(
 {
 	radix_tree_tag_clear(&pag->pag_ici_root,
 			XFS_INO_TO_AGINO(mp, ip->i_ino), XFS_ICI_RECLAIM_TAG);
+	pag->pag_ici_reclaimable--;
 }
 
 /*
@@ -854,5 +862,54 @@ xfs_reclaim_inodes(
 	int		mode)
 {
 	return xfs_inode_ag_iterator(mp, xfs_reclaim_inode, mode,
-					XFS_ICI_RECLAIM_TAG, 1);
+					XFS_ICI_RECLAIM_TAG, 1, 0);
+}
+
+static int
+xfs_reclaim_inode_shrink(
+	int	nr_to_scan,
+	void	*ctx,
+	gfp_t	gfp_mask)
+{
+	struct xfs_mount *mp = ctx;
+	xfs_agnumber_t	ag;
+	int		reclaimable = 0;
+
+	if (nr_to_scan) {
+		if (!(gfp_mask & __GFP_FS))
+			return -1;
+
+		xfs_inode_ag_iterator(mp, xfs_reclaim_inode, 0,
+					XFS_ICI_RECLAIM_TAG, 1, nr_to_scan);
+	}
+
+	for (ag = 0; ag < mp->m_sb.sb_agcount; ag++) {
+		struct xfs_perag	*pag;
+
+		pag = xfs_perag_get(mp, ag);
+		if (!pag->pag_ici_init) {
+			xfs_perag_put(pag);
+			continue;
+		}
+		reclaimable += pag->pag_ici_reclaimable;
+		xfs_perag_put(pag);
+	}
+	return reclaimable;
+}
+
+void
+xfs_inode_shrinker_register(
+	struct xfs_mount	*mp)
+{
+	mp->m_inode_shrink.shrink = xfs_reclaim_inode_shrink;
+	mp->m_inode_shrink.ctx = mp;
+	mp->m_inode_shrink.seeks = DEFAULT_SEEKS;
+	register_shrinker(&mp->m_inode_shrink);
+}
+
+void
+xfs_inode_shrinker_unregister(
+	struct xfs_mount	*mp)
+{
+	unregister_shrinker(&mp->m_inode_shrink);
 }
diff --git a/fs/xfs/linux-2.6/xfs_sync.h b/fs/xfs/linux-2.6/xfs_sync.h
index d480c34..e6c631b 100644
--- a/fs/xfs/linux-2.6/xfs_sync.h
+++ b/fs/xfs/linux-2.6/xfs_sync.h
@@ -53,6 +53,9 @@ void __xfs_inode_clear_reclaim_tag(struct xfs_mount *mp, struct xfs_perag *pag,
 int xfs_sync_inode_valid(struct xfs_inode *ip, struct xfs_perag *pag);
 int xfs_inode_ag_iterator(struct xfs_mount *mp,
 	int (*execute)(struct xfs_inode *ip, struct xfs_perag *pag, int flags),
-	int flags, int tag, int write_lock);
+	int flags, int tag, int write_lock, int nr_to_scan);
+
+void xfs_inode_shrinker_register(struct xfs_mount *mp);
+void xfs_inode_shrinker_unregister(struct xfs_mount *mp);
 
 #endif
diff --git a/fs/xfs/quota/xfs_qm_syscalls.c b/fs/xfs/quota/xfs_qm_syscalls.c
index cda8ef4..79dbd30 100644
--- a/fs/xfs/quota/xfs_qm_syscalls.c
+++ b/fs/xfs/quota/xfs_qm_syscalls.c
@@ -891,7 +891,8 @@ xfs_qm_dqrele_all_inodes(
 	uint		 flags)
 {
 	ASSERT(mp->m_quotainfo);
-	xfs_inode_ag_iterator(mp, xfs_dqrele_inode, flags, XFS_ICI_NO_TAG, 0);
+	xfs_inode_ag_iterator(mp, xfs_dqrele_inode, flags,
+				XFS_ICI_NO_TAG, 0, 0);
 }
 
 /*------------------------------------------------------------------------*/
diff --git a/fs/xfs/xfs_ag.h b/fs/xfs/xfs_ag.h
index b1a5a1f..abb8222 100644
--- a/fs/xfs/xfs_ag.h
+++ b/fs/xfs/xfs_ag.h
@@ -223,6 +223,7 @@ typedef struct xfs_perag {
 	int		pag_ici_init;	/* incore inode cache initialised */
 	rwlock_t	pag_ici_lock;	/* incore inode lock */
 	struct radix_tree_root pag_ici_root;	/* incore inode cache root */
+	int		pag_ici_reclaimable;	/* reclaimable inodes */
 #endif
 	int		pagb_count;	/* pagb slots in use */
 	xfs_perag_busy_t pagb_list[XFS_PAGB_NUM_SLOTS];	/* unstable blocks */
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 4fa0bc7..1fe7662 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -259,6 +259,7 @@ typedef struct xfs_mount {
 	wait_queue_head_t	m_wait_single_sync_task;
 	__int64_t		m_update_flags;	/* sb flags we need to update
 						   on the next remount,rw */
+	struct shrinker		m_inode_shrink;	/* inode reclaim shrinker */
 } xfs_mount_t;
 
 /*

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]
@ 2010-04-07  1:45             ` Dave Chinner
  0 siblings, 0 replies; 48+ messages in thread
From: Dave Chinner @ 2010-04-07  1:45 UTC (permalink / raw)
  To: Hans-Peter Jansen; +Cc: opensuse-kernel, linux-kernel, xfs

[-- Attachment #1: Type: text/plain, Size: 3024 bytes --]

On Wed, Apr 07, 2010 at 09:11:44AM +1000, Dave Chinner wrote:
> On Tue, Apr 06, 2010 at 04:52:57PM +0200, Hans-Peter Jansen wrote:
> > Hi Dave,
> > 
> > On Tuesday 06 April 2010, 01:06:00 Dave Chinner wrote:
> > > On Mon, Apr 05, 2010 at 01:35:41PM +0200, Hans-Peter Jansen wrote:
> > > > >
> > > > > Oh, this is a highmem box. You ran out of low memory, I think, which
> > > > > is where all the inodes are cached. Seems like a VM problem or a
> > > > > highmem/lowmem split config problem to me, not anything to do with
> > > > > XFS...
> 
> [snip]
> 
> > Dave, I really don't want to disappoint you, but a lengthy bisection session 
> > points to:
> > 
> > 57817c68229984818fea9e614d6f95249c3fb098 is the first bad commit
> > commit 57817c68229984818fea9e614d6f95249c3fb098
> > Author: Dave Chinner <david@fromorbit.com>
> > Date:   Sun Jan 10 23:51:47 2010 +0000
> > 
> >     xfs: reclaim all inodes by background tree walks
> 
> Interesting. I did a fair bit of low memory testing when i made that
> change (admittedly none on a highmem i386 box), and since then I've
> done lots of "millions of files" tree creates, traversals and destroys on
> limited memory machines without triggering problems when memory is
> completely full of inodes.
> 
> Let me try to reproduce this on a small VM and I'll get back to you.

OK, if there is page cache pressure (e.g. creating small files or
grepping the resultant tree) or the machine has significant amounts
of memory (e.g. >= 4GB) then I can't reproduce this.

However, if the memory pressure is purely inode cache (creating zero
length files or read-only traversal), then the OOM killer kicks a
while after the slab cache fills memory.  This doesn't need highmem;
I used a x86_64 kernel on a VM w/ 1GB RAM to reliably reproduce
this.  I'll add zero length file tests and traversals to my low
memory testing.

The best way to fix this, I think, is to trigger a shrinker callback
when memory is low to run the background inode reclaim. The problem
is that these inode caches and the reclaim state are per-filesystem,
not global state, and the current shrinker interface only works with
global state.

Hence there are two patches to this fix - the first adds a context
to the shrinker callout, and the second adds the XFS infrastructure
to track the number of reclaimable inodes per filesystem and
register/unregister shrinkers for each filesystem.

With these patches, my reproducable test case which locked the
machine up with a OOM panic in a couple of minutes has been running
for over half an hour. I have much more confidence in this change
with limited testing than the reverting of the background inode
reclaim as the revert introduces 

The patches below apply to the xfs-dev tree, which is currently at
34-rc1. If they don't apply, let me know and I'll redo them against
a vanilla kernel tree. Can you test them to see if the problem goes
away? If the problem is fixed, I'll push them for a proper review
cycle...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

[-- Attachment #2: shrinker-context --]
[-- Type: text/plain, Size: 10312 bytes --]

mm: add context argument to shrinker callback

From: Dave Chinner <dchinner@redhat.com>

The current shrinker implementation requires the registered callback
to have global state to work from. This makes it difficult to shrink
caches that are not global (e.g. per-filesystem caches). Add a
context argument to the shrinker callback so that it can easily be
used in such situations.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 arch/x86/kvm/mmu.c              |    2 +-
 drivers/gpu/drm/i915/i915_gem.c |    2 +-
 fs/dcache.c                     |    2 +-
 fs/gfs2/glock.c                 |    2 +-
 fs/gfs2/quota.c                 |    2 +-
 fs/gfs2/quota.h                 |    2 +-
 fs/inode.c                      |    2 +-
 fs/mbcache.c                    |    5 +++--
 fs/nfs/dir.c                    |    2 +-
 fs/nfs/internal.h               |    2 +-
 fs/quota/dquot.c                |    2 +-
 fs/ubifs/shrinker.c             |    2 +-
 fs/ubifs/ubifs.h                |    2 +-
 fs/xfs/linux-2.6/xfs_buf.c      |    5 +++--
 fs/xfs/quota/xfs_qm.c           |    3 ++-
 include/linux/mm.h              |    3 ++-
 mm/vmscan.c                     |    9 ++++++---
 17 files changed, 28 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 741373e..8977b3e 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2914,7 +2914,7 @@ static void kvm_mmu_remove_one_alloc_mmu_page(struct kvm *kvm)
 	kvm_mmu_zap_page(kvm, page);
 }
 
-static int mmu_shrink(int nr_to_scan, gfp_t gfp_mask)
+static int mmu_shrink(int nr_to_scan, void *ctx, gfp_t gfp_mask)
 {
 	struct kvm *kvm;
 	struct kvm *kvm_freed = NULL;
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index fba37e9..295f993 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -5054,7 +5054,7 @@ void i915_gem_release(struct drm_device * dev, struct drm_file *file_priv)
 }
 
 static int
-i915_gem_shrink(int nr_to_scan, gfp_t gfp_mask)
+i915_gem_shrink(int nr_to_scan, void *ctx, gfp_t gfp_mask)
 {
 	drm_i915_private_t *dev_priv, *next_dev;
 	struct drm_i915_gem_object *obj_priv, *next_obj;
diff --git a/fs/dcache.c b/fs/dcache.c
index f1358e5..1545d37 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -897,7 +897,7 @@ EXPORT_SYMBOL(shrink_dcache_parent);
  *
  * In this case we return -1 to tell the caller that we baled.
  */
-static int shrink_dcache_memory(int nr, gfp_t gfp_mask)
+static int shrink_dcache_memory(int nr, void *ctx, gfp_t gfp_mask)
 {
 	if (nr) {
 		if (!(gfp_mask & __GFP_FS))
diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 454d4b4..fc4ced7 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -1345,7 +1345,7 @@ void gfs2_glock_complete(struct gfs2_glock *gl, int ret)
 }
 
 
-static int gfs2_shrink_glock_memory(int nr, gfp_t gfp_mask)
+static int gfs2_shrink_glock_memory(int nr, void *ctx, gfp_t gfp_mask)
 {
 	struct gfs2_glock *gl;
 	int may_demote;
diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
index 6dbcbad..a5a1836 100644
--- a/fs/gfs2/quota.c
+++ b/fs/gfs2/quota.c
@@ -77,7 +77,7 @@ static LIST_HEAD(qd_lru_list);
 static atomic_t qd_lru_count = ATOMIC_INIT(0);
 static DEFINE_SPINLOCK(qd_lru_lock);
 
-int gfs2_shrink_qd_memory(int nr, gfp_t gfp_mask)
+int gfs2_shrink_qd_memory(int nr, void *ctx, gfp_t gfp_mask)
 {
 	struct gfs2_quota_data *qd;
 	struct gfs2_sbd *sdp;
diff --git a/fs/gfs2/quota.h b/fs/gfs2/quota.h
index 195f60c..1dd6aa2 100644
--- a/fs/gfs2/quota.h
+++ b/fs/gfs2/quota.h
@@ -51,7 +51,7 @@ static inline int gfs2_quota_lock_check(struct gfs2_inode *ip)
 	return ret;
 }
 
-extern int gfs2_shrink_qd_memory(int nr, gfp_t gfp_mask);
+extern int gfs2_shrink_qd_memory(int nr, void *ctx, gfp_t gfp_mask);
 extern const struct quotactl_ops gfs2_quotactl_ops;
 
 #endif /* __QUOTA_DOT_H__ */
diff --git a/fs/inode.c b/fs/inode.c
index 407bf39..8423324 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -514,7 +514,7 @@ static void prune_icache(int nr_to_scan)
  * This function is passed the number of inodes to scan, and it returns the
  * total number of remaining possibly-reclaimable inodes.
  */
-static int shrink_icache_memory(int nr, gfp_t gfp_mask)
+static int shrink_icache_memory(int nr, void *ctx, gfp_t gfp_mask)
 {
 	if (nr) {
 		/*
diff --git a/fs/mbcache.c b/fs/mbcache.c
index ec88ff3..2c4f9bf 100644
--- a/fs/mbcache.c
+++ b/fs/mbcache.c
@@ -115,7 +115,7 @@ mb_cache_indexes(struct mb_cache *cache)
  * What the mbcache registers as to get shrunk dynamically.
  */
 
-static int mb_cache_shrink_fn(int nr_to_scan, gfp_t gfp_mask);
+static int mb_cache_shrink_fn(int nr_to_scan, void *ctx, gfp_t gfp_mask);
 
 static struct shrinker mb_cache_shrinker = {
 	.shrink = mb_cache_shrink_fn,
@@ -192,12 +192,13 @@ forget:
  * gets low.
  *
  * @nr_to_scan: Number of objects to scan
+ * @ctx: (ignored)
  * @gfp_mask: (ignored)
  *
  * Returns the number of objects which are present in the cache.
  */
 static int
-mb_cache_shrink_fn(int nr_to_scan, gfp_t gfp_mask)
+mb_cache_shrink_fn(int nr_to_scan, void *ctx, gfp_t gfp_mask)
 {
 	LIST_HEAD(free_list);
 	struct list_head *l, *ltmp;
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index a1f6b44..b92b17d 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1667,7 +1667,7 @@ static void nfs_access_free_entry(struct nfs_access_entry *entry)
 	smp_mb__after_atomic_dec();
 }
 
-int nfs_access_cache_shrinker(int nr_to_scan, gfp_t gfp_mask)
+int nfs_access_cache_shrinker(int nr_to_scan, void *ctx, gfp_t gfp_mask)
 {
 	LIST_HEAD(head);
 	struct nfs_inode *nfsi;
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 11f82f0..34ad436 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -205,7 +205,7 @@ extern struct rpc_procinfo nfs4_procedures[];
 void nfs_close_context(struct nfs_open_context *ctx, int is_sync);
 
 /* dir.c */
-extern int nfs_access_cache_shrinker(int nr_to_scan, gfp_t gfp_mask);
+extern int nfs_access_cache_shrinker(int nr_to_scan, void *ctx, gfp_t gfp_mask);
 
 /* inode.c */
 extern struct workqueue_struct *nfsiod_workqueue;
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index e0b870f..225b961 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -670,7 +670,7 @@ static void prune_dqcache(int count)
  * more memory
  */
 
-static int shrink_dqcache_memory(int nr, gfp_t gfp_mask)
+static int shrink_dqcache_memory(int nr, void *ctx, gfp_t gfp_mask)
 {
 	if (nr) {
 		spin_lock(&dq_list_lock);
diff --git a/fs/ubifs/shrinker.c b/fs/ubifs/shrinker.c
index 02feb59..a62dfe4 100644
--- a/fs/ubifs/shrinker.c
+++ b/fs/ubifs/shrinker.c
@@ -277,7 +277,7 @@ static int kick_a_thread(void)
 	return 0;
 }
 
-int ubifs_shrinker(int nr, gfp_t gfp_mask)
+int ubifs_shrinker(int nr, void *ctx, gfp_t gfp_mask)
 {
 	int freed, contention = 0;
 	long clean_zn_cnt = atomic_long_read(&ubifs_clean_zn_cnt);
diff --git a/fs/ubifs/ubifs.h b/fs/ubifs/ubifs.h
index b2d9763..a1fdb9c 100644
--- a/fs/ubifs/ubifs.h
+++ b/fs/ubifs/ubifs.h
@@ -1574,7 +1574,7 @@ int ubifs_tnc_start_commit(struct ubifs_info *c, struct ubifs_zbranch *zroot);
 int ubifs_tnc_end_commit(struct ubifs_info *c);
 
 /* shrinker.c */
-int ubifs_shrinker(int nr_to_scan, gfp_t gfp_mask);
+int ubifs_shrinker(int nr_to_scan, void *ctx, gfp_t gfp_mask);
 
 /* commit.c */
 int ubifs_bg_thread(void *info);
diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index 74ccc70..4f5351f 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -44,7 +44,7 @@
 
 static kmem_zone_t *xfs_buf_zone;
 STATIC int xfsbufd(void *);
-STATIC int xfsbufd_wakeup(int, gfp_t);
+STATIC int xfsbufd_wakeup(int, void *, gfp_t);
 STATIC void xfs_buf_delwri_queue(xfs_buf_t *, int);
 static struct shrinker xfs_buf_shake = {
 	.shrink = xfsbufd_wakeup,
@@ -339,7 +339,7 @@ _xfs_buf_lookup_pages(
 					__func__, gfp_mask);
 
 			XFS_STATS_INC(xb_page_retries);
-			xfsbufd_wakeup(0, gfp_mask);
+			xfsbufd_wakeup(0, NULL, gfp_mask);
 			congestion_wait(BLK_RW_ASYNC, HZ/50);
 			goto retry;
 		}
@@ -1759,6 +1759,7 @@ xfs_buf_runall_queues(
 STATIC int
 xfsbufd_wakeup(
 	int			priority,
+	void			*ctx,
 	gfp_t			mask)
 {
 	xfs_buftarg_t		*btp;
diff --git a/fs/xfs/quota/xfs_qm.c b/fs/xfs/quota/xfs_qm.c
index 456b72e..40cf5c9 100644
--- a/fs/xfs/quota/xfs_qm.c
+++ b/fs/xfs/quota/xfs_qm.c
@@ -69,7 +69,7 @@ STATIC void	xfs_qm_list_destroy(xfs_dqlist_t *);
 
 STATIC int	xfs_qm_init_quotainos(xfs_mount_t *);
 STATIC int	xfs_qm_init_quotainfo(xfs_mount_t *);
-STATIC int	xfs_qm_shake(int, gfp_t);
+STATIC int	xfs_qm_shake(int, void *, gfp_t);
 
 static struct shrinker xfs_qm_shaker = {
 	.shrink = xfs_qm_shake,
@@ -2120,6 +2120,7 @@ xfs_qm_shake_freelist(
 STATIC int
 xfs_qm_shake(
 	int	nr_to_scan,
+	void	*ctx,
 	gfp_t	gfp_mask)
 {
 	int	ndqused, nfree, n;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3899395..b3fd387 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -989,7 +989,8 @@ void sync_mm_rss(struct task_struct *task, struct mm_struct *mm);
  * querying the cache size, so a fastpath for that case is appropriate.
  */
 struct shrinker {
-	int (*shrink)(int nr_to_scan, gfp_t gfp_mask);
+	int (*shrink)(int nr_to_scan, void *ctx, gfp_t gfp_mask);
+	void *ctx;	/* user callback context */
 	int seeks;	/* seeks to recreate an obj */
 
 	/* These are for internal use */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 79c8098..f293372 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -215,8 +215,9 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
 	list_for_each_entry(shrinker, &shrinker_list, list) {
 		unsigned long long delta;
 		unsigned long total_scan;
-		unsigned long max_pass = (*shrinker->shrink)(0, gfp_mask);
+		unsigned long max_pass;
 
+		max_pass = (*shrinker->shrink)(0, shrinker->ctx, gfp_mask);
 		delta = (4 * scanned) / shrinker->seeks;
 		delta *= max_pass;
 		do_div(delta, lru_pages + 1);
@@ -244,8 +245,10 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
 			int shrink_ret;
 			int nr_before;
 
-			nr_before = (*shrinker->shrink)(0, gfp_mask);
-			shrink_ret = (*shrinker->shrink)(this_scan, gfp_mask);
+			nr_before = (*shrinker->shrink)(0,
+						shrinker->ctx, gfp_mask);
+			shrink_ret = (*shrinker->shrink)(this_scan,
+						shrinker->ctx, gfp_mask);
 			if (shrink_ret == -1)
 				break;
 			if (shrink_ret < nr_before)

[-- Attachment #3: xfs-inode-reclaim-shrinker --]
[-- Type: text/plain, Size: 6498 bytes --]

xfs: add a shrinker to background inode reclaim

From: Dave Chinner <dchinner@redhat.com>

On low memory boxes or those with highmem, kernel can OOM before the
background reclaims inodes via xfssyncd. Add a shrinker to run inode
reclaim so that it inode reclaim is expedited when memory is low.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/linux-2.6/xfs_super.c   |    3 ++
 fs/xfs/linux-2.6/xfs_sync.c    |   71 ++++++++++++++++++++++++++++++++++++----
 fs/xfs/linux-2.6/xfs_sync.h    |    5 ++-
 fs/xfs/quota/xfs_qm_syscalls.c |    3 +-
 fs/xfs/xfs_ag.h                |    1 +
 fs/xfs/xfs_mount.h             |    1 +
 6 files changed, 75 insertions(+), 9 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c
index 15bea67..aa11204 100644
--- a/fs/xfs/linux-2.6/xfs_super.c
+++ b/fs/xfs/linux-2.6/xfs_super.c
@@ -1209,6 +1209,7 @@ xfs_fs_put_super(
 
 	xfs_unmountfs(mp);
 	xfs_freesb(mp);
+	xfs_inode_shrinker_unregister(mp);
 	xfs_icsb_destroy_counters(mp);
 	xfs_close_devices(mp);
 	xfs_dmops_put(mp);
@@ -1622,6 +1623,8 @@ xfs_fs_fill_super(
 	if (error)
 		goto fail_vnrele;
 
+	xfs_inode_shrinker_register(mp);
+
 	kfree(mtpt);
 	return 0;
 
diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
index 5a4731e..a376a0d 100644
--- a/fs/xfs/linux-2.6/xfs_sync.c
+++ b/fs/xfs/linux-2.6/xfs_sync.c
@@ -95,7 +95,8 @@ xfs_inode_ag_walk(
 					   struct xfs_perag *pag, int flags),
 	int			flags,
 	int			tag,
-	int			exclusive)
+	int			exclusive,
+	int			*nr_to_scan)
 {
 	uint32_t		first_index;
 	int			last_error = 0;
@@ -134,7 +135,7 @@ restart:
 		if (error == EFSCORRUPTED)
 			break;
 
-	} while (1);
+	} while ((*nr_to_scan)--);
 
 	if (skipped) {
 		delay(1);
@@ -150,11 +151,16 @@ xfs_inode_ag_iterator(
 					   struct xfs_perag *pag, int flags),
 	int			flags,
 	int			tag,
-	int			exclusive)
+	int			exclusive,
+	int			nr_to_scan)
 {
 	int			error = 0;
 	int			last_error = 0;
 	xfs_agnumber_t		ag;
+	int			nr = nr_to_scan;
+
+	if (!nr)
+		nr = INT_MAX;
 
 	for (ag = 0; ag < mp->m_sb.sb_agcount; ag++) {
 		struct xfs_perag	*pag;
@@ -165,7 +171,7 @@ xfs_inode_ag_iterator(
 			continue;
 		}
 		error = xfs_inode_ag_walk(mp, pag, execute, flags, tag,
-						exclusive);
+						exclusive, &nr);
 		xfs_perag_put(pag);
 		if (error) {
 			last_error = error;
@@ -291,7 +297,7 @@ xfs_sync_data(
 	ASSERT((flags & ~(SYNC_TRYLOCK|SYNC_WAIT)) == 0);
 
 	error = xfs_inode_ag_iterator(mp, xfs_sync_inode_data, flags,
-				      XFS_ICI_NO_TAG, 0);
+				      XFS_ICI_NO_TAG, 0, 0);
 	if (error)
 		return XFS_ERROR(error);
 
@@ -310,7 +316,7 @@ xfs_sync_attr(
 	ASSERT((flags & ~SYNC_WAIT) == 0);
 
 	return xfs_inode_ag_iterator(mp, xfs_sync_inode_attr, flags,
-				     XFS_ICI_NO_TAG, 0);
+				     XFS_ICI_NO_TAG, 0, 0);
 }
 
 STATIC int
@@ -673,6 +679,7 @@ __xfs_inode_set_reclaim_tag(
 	radix_tree_tag_set(&pag->pag_ici_root,
 			   XFS_INO_TO_AGINO(ip->i_mount, ip->i_ino),
 			   XFS_ICI_RECLAIM_TAG);
+	pag->pag_ici_reclaimable++;
 }
 
 /*
@@ -705,6 +712,7 @@ __xfs_inode_clear_reclaim_tag(
 {
 	radix_tree_tag_clear(&pag->pag_ici_root,
 			XFS_INO_TO_AGINO(mp, ip->i_ino), XFS_ICI_RECLAIM_TAG);
+	pag->pag_ici_reclaimable--;
 }
 
 /*
@@ -854,5 +862,54 @@ xfs_reclaim_inodes(
 	int		mode)
 {
 	return xfs_inode_ag_iterator(mp, xfs_reclaim_inode, mode,
-					XFS_ICI_RECLAIM_TAG, 1);
+					XFS_ICI_RECLAIM_TAG, 1, 0);
+}
+
+static int
+xfs_reclaim_inode_shrink(
+	int	nr_to_scan,
+	void	*ctx,
+	gfp_t	gfp_mask)
+{
+	struct xfs_mount *mp = ctx;
+	xfs_agnumber_t	ag;
+	int		reclaimable = 0;
+
+	if (nr_to_scan) {
+		if (!(gfp_mask & __GFP_FS))
+			return -1;
+
+		xfs_inode_ag_iterator(mp, xfs_reclaim_inode, 0,
+					XFS_ICI_RECLAIM_TAG, 1, nr_to_scan);
+	}
+
+	for (ag = 0; ag < mp->m_sb.sb_agcount; ag++) {
+		struct xfs_perag	*pag;
+
+		pag = xfs_perag_get(mp, ag);
+		if (!pag->pag_ici_init) {
+			xfs_perag_put(pag);
+			continue;
+		}
+		reclaimable += pag->pag_ici_reclaimable;
+		xfs_perag_put(pag);
+	}
+	return reclaimable;
+}
+
+void
+xfs_inode_shrinker_register(
+	struct xfs_mount	*mp)
+{
+	mp->m_inode_shrink.shrink = xfs_reclaim_inode_shrink;
+	mp->m_inode_shrink.ctx = mp;
+	mp->m_inode_shrink.seeks = DEFAULT_SEEKS;
+	register_shrinker(&mp->m_inode_shrink);
+}
+
+void
+xfs_inode_shrinker_unregister(
+	struct xfs_mount	*mp)
+{
+	unregister_shrinker(&mp->m_inode_shrink);
 }
diff --git a/fs/xfs/linux-2.6/xfs_sync.h b/fs/xfs/linux-2.6/xfs_sync.h
index d480c34..e6c631b 100644
--- a/fs/xfs/linux-2.6/xfs_sync.h
+++ b/fs/xfs/linux-2.6/xfs_sync.h
@@ -53,6 +53,9 @@ void __xfs_inode_clear_reclaim_tag(struct xfs_mount *mp, struct xfs_perag *pag,
 int xfs_sync_inode_valid(struct xfs_inode *ip, struct xfs_perag *pag);
 int xfs_inode_ag_iterator(struct xfs_mount *mp,
 	int (*execute)(struct xfs_inode *ip, struct xfs_perag *pag, int flags),
-	int flags, int tag, int write_lock);
+	int flags, int tag, int write_lock, int nr_to_scan);
+
+void xfs_inode_shrinker_register(struct xfs_mount *mp);
+void xfs_inode_shrinker_unregister(struct xfs_mount *mp);
 
 #endif
diff --git a/fs/xfs/quota/xfs_qm_syscalls.c b/fs/xfs/quota/xfs_qm_syscalls.c
index cda8ef4..79dbd30 100644
--- a/fs/xfs/quota/xfs_qm_syscalls.c
+++ b/fs/xfs/quota/xfs_qm_syscalls.c
@@ -891,7 +891,8 @@ xfs_qm_dqrele_all_inodes(
 	uint		 flags)
 {
 	ASSERT(mp->m_quotainfo);
-	xfs_inode_ag_iterator(mp, xfs_dqrele_inode, flags, XFS_ICI_NO_TAG, 0);
+	xfs_inode_ag_iterator(mp, xfs_dqrele_inode, flags,
+				XFS_ICI_NO_TAG, 0, 0);
 }
 
 /*------------------------------------------------------------------------*/
diff --git a/fs/xfs/xfs_ag.h b/fs/xfs/xfs_ag.h
index b1a5a1f..abb8222 100644
--- a/fs/xfs/xfs_ag.h
+++ b/fs/xfs/xfs_ag.h
@@ -223,6 +223,7 @@ typedef struct xfs_perag {
 	int		pag_ici_init;	/* incore inode cache initialised */
 	rwlock_t	pag_ici_lock;	/* incore inode lock */
 	struct radix_tree_root pag_ici_root;	/* incore inode cache root */
+	int		pag_ici_reclaimable;	/* reclaimable inodes */
 #endif
 	int		pagb_count;	/* pagb slots in use */
 	xfs_perag_busy_t pagb_list[XFS_PAGB_NUM_SLOTS];	/* unstable blocks */
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 4fa0bc7..1fe7662 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -259,6 +259,7 @@ typedef struct xfs_mount {
 	wait_queue_head_t	m_wait_single_sync_task;
 	__int64_t		m_update_flags;	/* sb flags we need to update
 						   on the next remount,rw */
+	struct shrinker		m_inode_shrink;	/* inode reclaim shrinker */
 } xfs_mount_t;
 
 /*

[-- Attachment #4: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]
  2010-04-07  1:45             ` Dave Chinner
@ 2010-04-07 22:02               ` Hans-Peter Jansen
  -1 siblings, 0 replies; 48+ messages in thread
From: Hans-Peter Jansen @ 2010-04-07 22:02 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, opensuse-kernel, xfs

On Wednesday 07 April 2010, 03:45:33 Dave Chinner wrote:
>
> However, if the memory pressure is purely inode cache (creating zero
> length files or read-only traversal), then the OOM killer kicks a
> while after the slab cache fills memory.  This doesn't need highmem;
> I used a x86_64 kernel on a VM w/ 1GB RAM to reliably reproduce
> this.  I'll add zero length file tests and traversals to my low
> memory testing.

I'm glad, that you're able to reproduce it. My initial failure was during 
disk to disk backup (with a simple cp -al & rsync combination).

> The best way to fix this, I think, is to trigger a shrinker callback
> when memory is low to run the background inode reclaim. The problem
> is that these inode caches and the reclaim state are per-filesystem,
> not global state, and the current shrinker interface only works with
> global state.
>
> Hence there are two patches to this fix - the first adds a context
> to the shrinker callout, and the second adds the XFS infrastructure
> to track the number of reclaimable inodes per filesystem and
> register/unregister shrinkers for each filesystem.

I see, the first one will be interesting to get into mainline, given the 
number of projects, that are involved. 

> With these patches, my reproducable test case which locked the
> machine up with a OOM panic in a couple of minutes has been running
> for over half an hour. I have much more confidence in this change
> with limited testing than the reverting of the background inode
> reclaim as the revert introduces
>
> The patches below apply to the xfs-dev tree, which is currently at
> 34-rc1. If they don't apply, let me know and I'll redo them against
> a vanilla kernel tree. Can you test them to see if the problem goes
> away? If the problem is fixed, I'll push them for a proper review
> cycle...

Of course, you did the original patch for a reason... Therefor I would love 
to test your patches. I've tried to apply them to 2.6.33.2, but after 
fixing the same reject as noted below, I'm stuck here:

/usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/linux-2.6/xfs_sync.c: 
In function 'xfs_reclaim_inode_shrink':
/usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/linux-2.6/xfs_sync.c:805: 
error: implicit declaration of function 'xfs_perag_get'
/usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/linux-2.6/xfs_sync.c:805: 
warning: assignment makes pointer from integer without a cast
/usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/linux-2.6/xfs_sync.c:807: 
error: implicit declaration of function 'xfs_perag_put'

Now I see, that there happened a rename of the offending functions, but also 
they've grown a radix_tree structure and locking. How do I handle that?

BTW, your patches do not apply to Linus' current git tree either:
patching file fs/xfs/quota/xfs_qm.c
Hunk #1 succeeded at 72 (offset 3 lines).
Hunk #2 FAILED at 2120.
1 out of 2 hunks FAILED -- saving rejects to file fs/xfs/quota/xfs_qm.c.rej
I'm able to resolve this, but 2.6.34-current does give me some other 
trouble, that I need to get by (PS2 keyboard stops working eventually)..

Anyway, thanks for your great support, Dave. This is much appreciated.

Cheers,
Pete

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]
@ 2010-04-07 22:02               ` Hans-Peter Jansen
  0 siblings, 0 replies; 48+ messages in thread
From: Hans-Peter Jansen @ 2010-04-07 22:02 UTC (permalink / raw)
  To: Dave Chinner; +Cc: opensuse-kernel, linux-kernel, xfs

On Wednesday 07 April 2010, 03:45:33 Dave Chinner wrote:
>
> However, if the memory pressure is purely inode cache (creating zero
> length files or read-only traversal), then the OOM killer kicks a
> while after the slab cache fills memory.  This doesn't need highmem;
> I used a x86_64 kernel on a VM w/ 1GB RAM to reliably reproduce
> this.  I'll add zero length file tests and traversals to my low
> memory testing.

I'm glad, that you're able to reproduce it. My initial failure was during 
disk to disk backup (with a simple cp -al & rsync combination).

> The best way to fix this, I think, is to trigger a shrinker callback
> when memory is low to run the background inode reclaim. The problem
> is that these inode caches and the reclaim state are per-filesystem,
> not global state, and the current shrinker interface only works with
> global state.
>
> Hence there are two patches to this fix - the first adds a context
> to the shrinker callout, and the second adds the XFS infrastructure
> to track the number of reclaimable inodes per filesystem and
> register/unregister shrinkers for each filesystem.

I see, the first one will be interesting to get into mainline, given the 
number of projects, that are involved. 

> With these patches, my reproducable test case which locked the
> machine up with a OOM panic in a couple of minutes has been running
> for over half an hour. I have much more confidence in this change
> with limited testing than the reverting of the background inode
> reclaim as the revert introduces
>
> The patches below apply to the xfs-dev tree, which is currently at
> 34-rc1. If they don't apply, let me know and I'll redo them against
> a vanilla kernel tree. Can you test them to see if the problem goes
> away? If the problem is fixed, I'll push them for a proper review
> cycle...

Of course, you did the original patch for a reason... Therefor I would love 
to test your patches. I've tried to apply them to 2.6.33.2, but after 
fixing the same reject as noted below, I'm stuck here:

/usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/linux-2.6/xfs_sync.c: 
In function 'xfs_reclaim_inode_shrink':
/usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/linux-2.6/xfs_sync.c:805: 
error: implicit declaration of function 'xfs_perag_get'
/usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/linux-2.6/xfs_sync.c:805: 
warning: assignment makes pointer from integer without a cast
/usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/linux-2.6/xfs_sync.c:807: 
error: implicit declaration of function 'xfs_perag_put'

Now I see, that there happened a rename of the offending functions, but also 
they've grown a radix_tree structure and locking. How do I handle that?

BTW, your patches do not apply to Linus' current git tree either:
patching file fs/xfs/quota/xfs_qm.c
Hunk #1 succeeded at 72 (offset 3 lines).
Hunk #2 FAILED at 2120.
1 out of 2 hunks FAILED -- saving rejects to file fs/xfs/quota/xfs_qm.c.rej
I'm able to resolve this, but 2.6.34-current does give me some other 
trouble, that I need to get by (PS2 keyboard stops working eventually)..

Anyway, thanks for your great support, Dave. This is much appreciated.

Cheers,
Pete

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]
  2010-04-07 22:02               ` Hans-Peter Jansen
@ 2010-04-12 22:32                 ` Dave Chinner
  -1 siblings, 0 replies; 48+ messages in thread
From: Dave Chinner @ 2010-04-12 22:32 UTC (permalink / raw)
  To: Hans-Peter Jansen; +Cc: linux-kernel, opensuse-kernel, xfs

On Thu, Apr 08, 2010 at 12:02:20AM +0200, Hans-Peter Jansen wrote:
> On Wednesday 07 April 2010, 03:45:33 Dave Chinner wrote:
> >
> > However, if the memory pressure is purely inode cache (creating zero
> > length files or read-only traversal), then the OOM killer kicks a
> > while after the slab cache fills memory.  This doesn't need highmem;
> > I used a x86_64 kernel on a VM w/ 1GB RAM to reliably reproduce
> > this.  I'll add zero length file tests and traversals to my low
> > memory testing.
> 
> I'm glad, that you're able to reproduce it. My initial failure was during 
> disk to disk backup (with a simple cp -al & rsync combination).
> 
> > The best way to fix this, I think, is to trigger a shrinker callback
> > when memory is low to run the background inode reclaim. The problem
> > is that these inode caches and the reclaim state are per-filesystem,
> > not global state, and the current shrinker interface only works with
> > global state.
> >
> > Hence there are two patches to this fix - the first adds a context
> > to the shrinker callout, and the second adds the XFS infrastructure
> > to track the number of reclaimable inodes per filesystem and
> > register/unregister shrinkers for each filesystem.
> 
> I see, the first one will be interesting to get into mainline, given the 
> number of projects, that are involved. 
> 
> > With these patches, my reproducable test case which locked the
> > machine up with a OOM panic in a couple of minutes has been running
> > for over half an hour. I have much more confidence in this change
> > with limited testing than the reverting of the background inode
> > reclaim as the revert introduces
> >
> > The patches below apply to the xfs-dev tree, which is currently at
> > 34-rc1. If they don't apply, let me know and I'll redo them against
> > a vanilla kernel tree. Can you test them to see if the problem goes
> > away? If the problem is fixed, I'll push them for a proper review
> > cycle...
> 
> Of course, you did the original patch for a reason... Therefor I would love 
> to test your patches. I've tried to apply them to 2.6.33.2, but after 
> fixing the same reject as noted below, I'm stuck here:
> 
> /usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/linux-2.6/xfs_sync.c: 
> In function 'xfs_reclaim_inode_shrink':
> /usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/linux-2.6/xfs_sync.c:805: 
> error: implicit declaration of function 'xfs_perag_get'
> /usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/linux-2.6/xfs_sync.c:805: 
> warning: assignment makes pointer from integer without a cast
> /usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/linux-2.6/xfs_sync.c:807: 
> error: implicit declaration of function 'xfs_perag_put'
> 
> Now I see, that there happened a rename of the offending functions, but also 
> they've grown a radix_tree structure and locking. How do I handle that?

With difficulty. I'd need to backport it to match the .33 code,
which may or may not be trivial...

> BTW, your patches do not apply to Linus' current git tree either:
> patching file fs/xfs/quota/xfs_qm.c
> Hunk #1 succeeded at 72 (offset 3 lines).
> Hunk #2 FAILED at 2120.
> 1 out of 2 hunks FAILED -- saving rejects to file fs/xfs/quota/xfs_qm.c.rej
> I'm able to resolve this, but 2.6.34-current does give me some other 
> trouble, that I need to get by (PS2 keyboard stops working eventually)..

Yeah, there's another patch in my xfs-dev tree that changes that.
I'll rebase it on a clean linux tree before I post it again.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]
@ 2010-04-12 22:32                 ` Dave Chinner
  0 siblings, 0 replies; 48+ messages in thread
From: Dave Chinner @ 2010-04-12 22:32 UTC (permalink / raw)
  To: Hans-Peter Jansen; +Cc: opensuse-kernel, linux-kernel, xfs

On Thu, Apr 08, 2010 at 12:02:20AM +0200, Hans-Peter Jansen wrote:
> On Wednesday 07 April 2010, 03:45:33 Dave Chinner wrote:
> >
> > However, if the memory pressure is purely inode cache (creating zero
> > length files or read-only traversal), then the OOM killer kicks a
> > while after the slab cache fills memory.  This doesn't need highmem;
> > I used a x86_64 kernel on a VM w/ 1GB RAM to reliably reproduce
> > this.  I'll add zero length file tests and traversals to my low
> > memory testing.
> 
> I'm glad, that you're able to reproduce it. My initial failure was during 
> disk to disk backup (with a simple cp -al & rsync combination).
> 
> > The best way to fix this, I think, is to trigger a shrinker callback
> > when memory is low to run the background inode reclaim. The problem
> > is that these inode caches and the reclaim state are per-filesystem,
> > not global state, and the current shrinker interface only works with
> > global state.
> >
> > Hence there are two patches to this fix - the first adds a context
> > to the shrinker callout, and the second adds the XFS infrastructure
> > to track the number of reclaimable inodes per filesystem and
> > register/unregister shrinkers for each filesystem.
> 
> I see, the first one will be interesting to get into mainline, given the 
> number of projects, that are involved. 
> 
> > With these patches, my reproducable test case which locked the
> > machine up with a OOM panic in a couple of minutes has been running
> > for over half an hour. I have much more confidence in this change
> > with limited testing than the reverting of the background inode
> > reclaim as the revert introduces
> >
> > The patches below apply to the xfs-dev tree, which is currently at
> > 34-rc1. If they don't apply, let me know and I'll redo them against
> > a vanilla kernel tree. Can you test them to see if the problem goes
> > away? If the problem is fixed, I'll push them for a proper review
> > cycle...
> 
> Of course, you did the original patch for a reason... Therefor I would love 
> to test your patches. I've tried to apply them to 2.6.33.2, but after 
> fixing the same reject as noted below, I'm stuck here:
> 
> /usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/linux-2.6/xfs_sync.c: 
> In function 'xfs_reclaim_inode_shrink':
> /usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/linux-2.6/xfs_sync.c:805: 
> error: implicit declaration of function 'xfs_perag_get'
> /usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/linux-2.6/xfs_sync.c:805: 
> warning: assignment makes pointer from integer without a cast
> /usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/linux-2.6/xfs_sync.c:807: 
> error: implicit declaration of function 'xfs_perag_put'
> 
> Now I see, that there happened a rename of the offending functions, but also 
> they've grown a radix_tree structure and locking. How do I handle that?

With difficulty. I'd need to backport it to match the .33 code,
which may or may not be trivial...

> BTW, your patches do not apply to Linus' current git tree either:
> patching file fs/xfs/quota/xfs_qm.c
> Hunk #1 succeeded at 72 (offset 3 lines).
> Hunk #2 FAILED at 2120.
> 1 out of 2 hunks FAILED -- saving rejects to file fs/xfs/quota/xfs_qm.c.rej
> I'm able to resolve this, but 2.6.34-current does give me some other 
> trouble, that I need to get by (PS2 keyboard stops working eventually)..

Yeah, there's another patch in my xfs-dev tree that changes that.
I'll rebase it on a clean linux tree before I post it again.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]
  2010-04-12 22:32                 ` Dave Chinner
@ 2010-04-13  8:50                   ` Hans-Peter Jansen
  -1 siblings, 0 replies; 48+ messages in thread
From: Hans-Peter Jansen @ 2010-04-13  8:50 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, opensuse-kernel, xfs

On Tuesday 13 April 2010, 00:32:41 Dave Chinner wrote:
> On Thu, Apr 08, 2010 at 12:02:20AM +0200, Hans-Peter Jansen wrote:
> > On Wednesday 07 April 2010, 03:45:33 Dave Chinner wrote:
> >
> > > The patches below apply to the xfs-dev tree, which is currently at
> > > 34-rc1. If they don't apply, let me know and I'll redo them against
> > > a vanilla kernel tree. Can you test them to see if the problem goes
> > > away? If the problem is fixed, I'll push them for a proper review
> > > cycle...
> >
> > Of course, you did the original patch for a reason... Therefor I would
> > love to test your patches. I've tried to apply them to 2.6.33.2, but
> > after fixing the same reject as noted below, I'm stuck here:
> >
> > /usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/lin
> >ux-2.6/xfs_sync.c: In function 'xfs_reclaim_inode_shrink':
> > /usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/lin
> >ux-2.6/xfs_sync.c:805: error: implicit declaration of function
> > 'xfs_perag_get'
> > /usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/lin
> >ux-2.6/xfs_sync.c:805: warning: assignment makes pointer from integer
> > without a cast
> > /usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/lin
> >ux-2.6/xfs_sync.c:807: error: implicit declaration of function
> > 'xfs_perag_put'
> >
> > Now I see, that there happened a rename of the offending functions, but
> > also they've grown a radix_tree structure and locking. How do I handle
> > that?
>
> With difficulty. I'd need to backport it to match the .33 code,
> which may or may not be trivial...

Dave, may I ask you kindly for briefly elaborating on the worst consequences 
of just reverting this hunk, as I've done before?

> > BTW, your patches do not apply to Linus' current git tree either:
> > patching file fs/xfs/quota/xfs_qm.c
> > Hunk #1 succeeded at 72 (offset 3 lines).
> > Hunk #2 FAILED at 2120.
> > 1 out of 2 hunks FAILED -- saving rejects to file
> > fs/xfs/quota/xfs_qm.c.rej I'm able to resolve this, but 2.6.34-current
> > does give me some other trouble, that I need to get by (PS2 keyboard
> > stops working eventually)..
>
> Yeah, there's another patch in my xfs-dev tree that changes that.
> I'll rebase it on a clean linux tree before I post it again.

I've briefly tested this with a codebase somewhere between -rc3 and -rc4, 
and it survived the du test, but it suffered from some strange network drop 
outs, that aren't funny on a nfs server...

Will retest your patches after opensuse-current catched up with -rc4. 
Hopefully, the most blatant stability issues are fixed by then.

Cheers,
Pete

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]
@ 2010-04-13  8:50                   ` Hans-Peter Jansen
  0 siblings, 0 replies; 48+ messages in thread
From: Hans-Peter Jansen @ 2010-04-13  8:50 UTC (permalink / raw)
  To: Dave Chinner; +Cc: opensuse-kernel, linux-kernel, xfs

On Tuesday 13 April 2010, 00:32:41 Dave Chinner wrote:
> On Thu, Apr 08, 2010 at 12:02:20AM +0200, Hans-Peter Jansen wrote:
> > On Wednesday 07 April 2010, 03:45:33 Dave Chinner wrote:
> >
> > > The patches below apply to the xfs-dev tree, which is currently at
> > > 34-rc1. If they don't apply, let me know and I'll redo them against
> > > a vanilla kernel tree. Can you test them to see if the problem goes
> > > away? If the problem is fixed, I'll push them for a proper review
> > > cycle...
> >
> > Of course, you did the original patch for a reason... Therefor I would
> > love to test your patches. I've tried to apply them to 2.6.33.2, but
> > after fixing the same reject as noted below, I'm stuck here:
> >
> > /usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/lin
> >ux-2.6/xfs_sync.c: In function 'xfs_reclaim_inode_shrink':
> > /usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/lin
> >ux-2.6/xfs_sync.c:805: error: implicit declaration of function
> > 'xfs_perag_get'
> > /usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/lin
> >ux-2.6/xfs_sync.c:805: warning: assignment makes pointer from integer
> > without a cast
> > /usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/lin
> >ux-2.6/xfs_sync.c:807: error: implicit declaration of function
> > 'xfs_perag_put'
> >
> > Now I see, that there happened a rename of the offending functions, but
> > also they've grown a radix_tree structure and locking. How do I handle
> > that?
>
> With difficulty. I'd need to backport it to match the .33 code,
> which may or may not be trivial...

Dave, may I ask you kindly for briefly elaborating on the worst consequences 
of just reverting this hunk, as I've done before?

> > BTW, your patches do not apply to Linus' current git tree either:
> > patching file fs/xfs/quota/xfs_qm.c
> > Hunk #1 succeeded at 72 (offset 3 lines).
> > Hunk #2 FAILED at 2120.
> > 1 out of 2 hunks FAILED -- saving rejects to file
> > fs/xfs/quota/xfs_qm.c.rej I'm able to resolve this, but 2.6.34-current
> > does give me some other trouble, that I need to get by (PS2 keyboard
> > stops working eventually)..
>
> Yeah, there's another patch in my xfs-dev tree that changes that.
> I'll rebase it on a clean linux tree before I post it again.

I've briefly tested this with a codebase somewhere between -rc3 and -rc4, 
and it survived the du test, but it suffered from some strange network drop 
outs, that aren't funny on a nfs server...

Will retest your patches after opensuse-current catched up with -rc4. 
Hopefully, the most blatant stability issues are fixed by then.

Cheers,
Pete

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]
  2010-04-13  8:50                   ` Hans-Peter Jansen
@ 2010-04-13  9:18                     ` Dave Chinner
  -1 siblings, 0 replies; 48+ messages in thread
From: Dave Chinner @ 2010-04-13  9:18 UTC (permalink / raw)
  To: Hans-Peter Jansen; +Cc: linux-kernel, opensuse-kernel, xfs

On Tue, Apr 13, 2010 at 10:50:35AM +0200, Hans-Peter Jansen wrote:
> On Tuesday 13 April 2010, 00:32:41 Dave Chinner wrote:
> > On Thu, Apr 08, 2010 at 12:02:20AM +0200, Hans-Peter Jansen wrote:
> > > On Wednesday 07 April 2010, 03:45:33 Dave Chinner wrote:
> > >
> > > > The patches below apply to the xfs-dev tree, which is currently at
> > > > 34-rc1. If they don't apply, let me know and I'll redo them against
> > > > a vanilla kernel tree. Can you test them to see if the problem goes
> > > > away? If the problem is fixed, I'll push them for a proper review
> > > > cycle...
> > >
> > > Of course, you did the original patch for a reason... Therefor I would
> > > love to test your patches. I've tried to apply them to 2.6.33.2, but
> > > after fixing the same reject as noted below, I'm stuck here:
> > >
> > > /usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/lin
> > >ux-2.6/xfs_sync.c: In function 'xfs_reclaim_inode_shrink':
> > > /usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/lin
> > >ux-2.6/xfs_sync.c:805: error: implicit declaration of function
> > > 'xfs_perag_get'
> > > /usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/lin
> > >ux-2.6/xfs_sync.c:805: warning: assignment makes pointer from integer
> > > without a cast
> > > /usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/lin
> > >ux-2.6/xfs_sync.c:807: error: implicit declaration of function
> > > 'xfs_perag_put'
> > >
> > > Now I see, that there happened a rename of the offending functions, but
> > > also they've grown a radix_tree structure and locking. How do I handle
> > > that?
> >
> > With difficulty. I'd need to backport it to match the .33 code,
> > which may or may not be trivial...
> 
> Dave, may I ask you kindly for briefly elaborating on the worst consequences 
> of just reverting this hunk, as I've done before?

Well, given that is the new shrinker code generating the warnings,
reverting/removing that hunk will render the patch useless :0

I'll get you a working 2.6.33 patch tomorrow - it's dinner time
now....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]
@ 2010-04-13  9:18                     ` Dave Chinner
  0 siblings, 0 replies; 48+ messages in thread
From: Dave Chinner @ 2010-04-13  9:18 UTC (permalink / raw)
  To: Hans-Peter Jansen; +Cc: opensuse-kernel, linux-kernel, xfs

On Tue, Apr 13, 2010 at 10:50:35AM +0200, Hans-Peter Jansen wrote:
> On Tuesday 13 April 2010, 00:32:41 Dave Chinner wrote:
> > On Thu, Apr 08, 2010 at 12:02:20AM +0200, Hans-Peter Jansen wrote:
> > > On Wednesday 07 April 2010, 03:45:33 Dave Chinner wrote:
> > >
> > > > The patches below apply to the xfs-dev tree, which is currently at
> > > > 34-rc1. If they don't apply, let me know and I'll redo them against
> > > > a vanilla kernel tree. Can you test them to see if the problem goes
> > > > away? If the problem is fixed, I'll push them for a proper review
> > > > cycle...
> > >
> > > Of course, you did the original patch for a reason... Therefor I would
> > > love to test your patches. I've tried to apply them to 2.6.33.2, but
> > > after fixing the same reject as noted below, I'm stuck here:
> > >
> > > /usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/lin
> > >ux-2.6/xfs_sync.c: In function 'xfs_reclaim_inode_shrink':
> > > /usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/lin
> > >ux-2.6/xfs_sync.c:805: error: implicit declaration of function
> > > 'xfs_perag_get'
> > > /usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/lin
> > >ux-2.6/xfs_sync.c:805: warning: assignment makes pointer from integer
> > > without a cast
> > > /usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/lin
> > >ux-2.6/xfs_sync.c:807: error: implicit declaration of function
> > > 'xfs_perag_put'
> > >
> > > Now I see, that there happened a rename of the offending functions, but
> > > also they've grown a radix_tree structure and locking. How do I handle
> > > that?
> >
> > With difficulty. I'd need to backport it to match the .33 code,
> > which may or may not be trivial...
> 
> Dave, may I ask you kindly for briefly elaborating on the worst consequences 
> of just reverting this hunk, as I've done before?

Well, given that is the new shrinker code generating the warnings,
reverting/removing that hunk will render the patch useless :0

I'll get you a working 2.6.33 patch tomorrow - it's dinner time
now....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]
  2010-04-13  9:18                     ` Dave Chinner
@ 2010-04-13  9:42                       ` Hans-Peter Jansen
  -1 siblings, 0 replies; 48+ messages in thread
From: Hans-Peter Jansen @ 2010-04-13  9:42 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, opensuse-kernel, xfs

On Tuesday 13 April 2010, 11:18:23 Dave Chinner wrote:
> On Tue, Apr 13, 2010 at 10:50:35AM +0200, Hans-Peter Jansen wrote:
> >
> > Dave, may I ask you kindly for briefly elaborating on the worst
> > consequences of just reverting this hunk, as I've done before?
>
> Well, given that is the new shrinker code generating the warnings,
> reverting/removing that hunk will render the patch useless :0

Excuse me, I didn't express myself well. I'm after the consequences of 
applying the revert, that I posted a few messages above. 

> I'll get you a working 2.6.33 patch tomorrow - it's dinner time
> now....

Cool, thanks. 

Have a nice dinner,
Pete




^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]
@ 2010-04-13  9:42                       ` Hans-Peter Jansen
  0 siblings, 0 replies; 48+ messages in thread
From: Hans-Peter Jansen @ 2010-04-13  9:42 UTC (permalink / raw)
  To: Dave Chinner; +Cc: opensuse-kernel, linux-kernel, xfs

On Tuesday 13 April 2010, 11:18:23 Dave Chinner wrote:
> On Tue, Apr 13, 2010 at 10:50:35AM +0200, Hans-Peter Jansen wrote:
> >
> > Dave, may I ask you kindly for briefly elaborating on the worst
> > consequences of just reverting this hunk, as I've done before?
>
> Well, given that is the new shrinker code generating the warnings,
> reverting/removing that hunk will render the patch useless :0

Excuse me, I didn't express myself well. I'm after the consequences of 
applying the revert, that I posted a few messages above. 

> I'll get you a working 2.6.33 patch tomorrow - it's dinner time
> now....

Cool, thanks. 

Have a nice dinner,
Pete



_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]
  2010-04-13  9:42                       ` Hans-Peter Jansen
@ 2010-04-24 16:44                         ` Hans-Peter Jansen
  -1 siblings, 0 replies; 48+ messages in thread
From: Hans-Peter Jansen @ 2010-04-24 16:44 UTC (permalink / raw)
  To: xfs; +Cc: Dave Chinner, opensuse-kernel, linux-kernel, Greg KH, Nick Piggin

On Tuesday 13 April 2010, 11:42:33 Hans-Peter Jansen wrote:
> On Tuesday 13 April 2010, 11:18:23 Dave Chinner wrote:
> > On Tue, Apr 13, 2010 at 10:50:35AM +0200, Hans-Peter Jansen wrote:
> > > Dave, may I ask you kindly for briefly elaborating on the worst
> > > consequences of just reverting this hunk, as I've done before?
> >
> > Well, given that is the new shrinker code generating the warnings,
> > reverting/removing that hunk will render the patch useless :0
>
> Excuse me, I didn't express myself well. I'm after the consequences of
> applying the revert, that I posted a few messages above.
>
> > I'll get you a working 2.6.33 patch tomorrow - it's dinner time
> > now....
>
> Cool, thanks.

Obviously and not totally unexpected, really fixing this is going to take 
more time.

FYI, 2.6.33.2 is still affected from this issue. 

Greg, you might search for a server using xfs filesystems and and a i586 
kernel >= 2.6.33, (2.6.32.11 of SLE11-SP1 will serve as well), log in as an 
ordinary user, do a "du" on /usr, and wait for the other users screaming...

BTW, all affected kernels, available from 
http://download.opensuse.org/repositories/home:/frispete: have the 
offending patch reverted (see subject), do run fine for me (on this 
aspect).

Will you guys pass by another round of stable fixes without doing anything 
on this issue?

Dave, this is why I'm kindly asking you: what might be the worst 
consequences, if we just do the revert for now (at least for 2.6.33), until 
you and Nick came to a final decision on how to solve this issue in the 
future.

Just a brief note would be fine.

Cheers,
Pete

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]
@ 2010-04-24 16:44                         ` Hans-Peter Jansen
  0 siblings, 0 replies; 48+ messages in thread
From: Hans-Peter Jansen @ 2010-04-24 16:44 UTC (permalink / raw)
  To: xfs; +Cc: Nick Piggin, opensuse-kernel, Greg KH, linux-kernel

On Tuesday 13 April 2010, 11:42:33 Hans-Peter Jansen wrote:
> On Tuesday 13 April 2010, 11:18:23 Dave Chinner wrote:
> > On Tue, Apr 13, 2010 at 10:50:35AM +0200, Hans-Peter Jansen wrote:
> > > Dave, may I ask you kindly for briefly elaborating on the worst
> > > consequences of just reverting this hunk, as I've done before?
> >
> > Well, given that is the new shrinker code generating the warnings,
> > reverting/removing that hunk will render the patch useless :0
>
> Excuse me, I didn't express myself well. I'm after the consequences of
> applying the revert, that I posted a few messages above.
>
> > I'll get you a working 2.6.33 patch tomorrow - it's dinner time
> > now....
>
> Cool, thanks.

Obviously and not totally unexpected, really fixing this is going to take 
more time.

FYI, 2.6.33.2 is still affected from this issue. 

Greg, you might search for a server using xfs filesystems and and a i586 
kernel >= 2.6.33, (2.6.32.11 of SLE11-SP1 will serve as well), log in as an 
ordinary user, do a "du" on /usr, and wait for the other users screaming...

BTW, all affected kernels, available from 
http://download.opensuse.org/repositories/home:/frispete: have the 
offending patch reverted (see subject), do run fine for me (on this 
aspect).

Will you guys pass by another round of stable fixes without doing anything 
on this issue?

Dave, this is why I'm kindly asking you: what might be the worst 
consequences, if we just do the revert for now (at least for 2.6.33), until 
you and Nick came to a final decision on how to solve this issue in the 
future.

Just a brief note would be fine.

Cheers,
Pete

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]
  2010-04-24 16:44                         ` Hans-Peter Jansen
@ 2010-04-24 21:23                           ` Emmanuel Florac
  -1 siblings, 0 replies; 48+ messages in thread
From: Emmanuel Florac @ 2010-04-24 21:23 UTC (permalink / raw)
  To: Hans-Peter Jansen
  Cc: xfs, Nick Piggin, opensuse-kernel, Greg KH, linux-kernel

Le Sat, 24 Apr 2010 18:44:22 +0200 vous écriviez:

> Greg, you might search for a server using xfs filesystems and and a
> i586 kernel >= 2.6.33, (2.6.32.11 of SLE11-SP1 will serve as well),
> log in as an ordinary user, do a "du" on /usr, and wait for the other
> users screaming...

I did precisely that, and didn't notice anything special (du on kernel
source tree) kernel 2.6.32.11, deadline scheduler, 7 drives RAID-6
array, 8GB RAM.

-- 
------------------------------------------------------------------------
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    |	<eflorac@intellique.com>
                    |   +33 1 78 94 84 02
------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]
@ 2010-04-24 21:23                           ` Emmanuel Florac
  0 siblings, 0 replies; 48+ messages in thread
From: Emmanuel Florac @ 2010-04-24 21:23 UTC (permalink / raw)
  To: Hans-Peter Jansen
  Cc: Nick Piggin, opensuse-kernel, Greg KH, linux-kernel, xfs

Le Sat, 24 Apr 2010 18:44:22 +0200 vous écriviez:

> Greg, you might search for a server using xfs filesystems and and a
> i586 kernel >= 2.6.33, (2.6.32.11 of SLE11-SP1 will serve as well),
> log in as an ordinary user, do a "du" on /usr, and wait for the other
> users screaming...

I did precisely that, and didn't notice anything special (du on kernel
source tree) kernel 2.6.32.11, deadline scheduler, 7 drives RAID-6
array, 8GB RAM.

-- 
------------------------------------------------------------------------
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    |	<eflorac@intellique.com>
                    |   +33 1 78 94 84 02
------------------------------------------------------------------------

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]
  2010-04-24 21:23                           ` Emmanuel Florac
@ 2010-04-24 22:30                             ` Hans-Peter Jansen
  -1 siblings, 0 replies; 48+ messages in thread
From: Hans-Peter Jansen @ 2010-04-24 22:30 UTC (permalink / raw)
  To: Emmanuel Florac; +Cc: xfs, Nick Piggin, opensuse-kernel, Greg KH, linux-kernel

On Saturday 24 April 2010, 23:23:20 Emmanuel Florac wrote:
> Le Sat, 24 Apr 2010 18:44:22 +0200 vous écriviez:
> > Greg, you might search for a server using xfs filesystems and and a
> > i586 kernel >= 2.6.33, (2.6.32.11 of SLE11-SP1 will serve as well),
> > log in as an ordinary user, do a "du" on /usr, and wait for the other
> > users screaming...
>
> I did precisely that, and didn't notice anything special (du on kernel
> source tree) kernel 2.6.32.11, deadline scheduler, 7 drives RAID-6
> array, 8GB RAM.

I guess, you're not on this specific openSUSE git version of 2.6.32.11 (e.g. 
the preparation for SP1 of SLE11), which, as usual, carries a lot of stuff 
from later kernels. The offending patch was included in linux-2.6.33 
between -rc4 and -rc5:

Committer
Alex Elder<aelder@sgi.com>
Author
Dave Chinner<david@fromorbit.com>
Author date
11.01.10 00:51
Parent
xfs: Avoid inodes in reclaim when flushing from inode cache
Child
xfs: Remove inode iolock held check during allocation
Branch
master origin (Merge branch 'for-linus' of git://git.kernel.org/pub/scm/...) 
Branch
2.6.33.1 (Linux 2.6.33) 
Follows
v2.6.33-rc4 (Linux 2.6.33-rc4)
Precedes
v2.6.33-rc5 (Linux 2.6.33-rc5)

Cheers,
Pete

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]
@ 2010-04-24 22:30                             ` Hans-Peter Jansen
  0 siblings, 0 replies; 48+ messages in thread
From: Hans-Peter Jansen @ 2010-04-24 22:30 UTC (permalink / raw)
  To: Emmanuel Florac; +Cc: Nick Piggin, opensuse-kernel, Greg KH, linux-kernel, xfs

On Saturday 24 April 2010, 23:23:20 Emmanuel Florac wrote:
> Le Sat, 24 Apr 2010 18:44:22 +0200 vous écriviez:
> > Greg, you might search for a server using xfs filesystems and and a
> > i586 kernel >= 2.6.33, (2.6.32.11 of SLE11-SP1 will serve as well),
> > log in as an ordinary user, do a "du" on /usr, and wait for the other
> > users screaming...
>
> I did precisely that, and didn't notice anything special (du on kernel
> source tree) kernel 2.6.32.11, deadline scheduler, 7 drives RAID-6
> array, 8GB RAM.

I guess, you're not on this specific openSUSE git version of 2.6.32.11 (e.g. 
the preparation for SP1 of SLE11), which, as usual, carries a lot of stuff 
from later kernels. The offending patch was included in linux-2.6.33 
between -rc4 and -rc5:

Committer
Alex Elder<aelder@sgi.com>
Author
Dave Chinner<david@fromorbit.com>
Author date
11.01.10 00:51
Parent
xfs: Avoid inodes in reclaim when flushing from inode cache
Child
xfs: Remove inode iolock held check during allocation
Branch
master origin (Merge branch 'for-linus' of git://git.kernel.org/pub/scm/...) 
Branch
2.6.33.1 (Linux 2.6.33) 
Follows
v2.6.33-rc4 (Linux 2.6.33-rc4)
Precedes
v2.6.33-rc5 (Linux 2.6.33-rc5)

Cheers,
Pete

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [opensuse-kernel] Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]
  2010-04-24 22:30                             ` Hans-Peter Jansen
@ 2010-04-24 22:40                               ` Justin P. Mattock
  -1 siblings, 0 replies; 48+ messages in thread
From: Justin P. Mattock @ 2010-04-24 22:40 UTC (permalink / raw)
  To: Hans-Peter Jansen
  Cc: Emmanuel Florac, xfs, Nick Piggin, opensuse-kernel, Greg KH,
	linux-kernel

On 04/24/2010 03:30 PM, Hans-Peter Jansen wrote:
> On Saturday 24 April 2010, 23:23:20 Emmanuel Florac wrote:
>> Le Sat, 24 Apr 2010 18:44:22 +0200 vous écriviez:
>>> Greg, you might search for a server using xfs filesystems and and a
>>> i586 kernel>= 2.6.33, (2.6.32.11 of SLE11-SP1 will serve as well),
>>> log in as an ordinary user, do a "du" on /usr, and wait for the other
>>> users screaming...
>>
>> I did precisely that, and didn't notice anything special (du on kernel
>> source tree) kernel 2.6.32.11, deadline scheduler, 7 drives RAID-6
>> array, 8GB RAM.
>
> I guess, you're not on this specific openSUSE git version of 2.6.32.11 (e.g.
> the preparation for SP1 of SLE11), which, as usual, carries a lot of stuff
> from later kernels. The offending patch was included in linux-2.6.33
> between -rc4 and -rc5:
>
> Committer
> Alex Elder<aelder@sgi.com>
> Author
> Dave Chinner<david@fromorbit.com>
> Author date
> 11.01.10 00:51
> Parent
> xfs: Avoid inodes in reclaim when flushing from inode cache
> Child
> xfs: Remove inode iolock held check during allocation
> Branch
> master origin (Merge branch 'for-linus' of git://git.kernel.org/pub/scm/...)
> Branch
> 2.6.33.1 (Linux 2.6.33)
> Follows
> v2.6.33-rc4 (Linux 2.6.33-rc4)
> Precedes
> v2.6.33-rc5 (Linux 2.6.33-rc5)
>
> Cheers,
> Pete


is this bisectable?
from what I remember with 2.6.33
looking at the bugreports I don't recall
any issue in regards with firmware related stuff
for radeon(but could be wrong).

Keep in mind, I don't have your card, but I do have the X1600
which had no issues so far(running the latest HEAD).
does changing the .config work for you?
(in regards to what the thread I posted
had mentioned)

as for the open suse SP1 of SLE11..glad there is
an option to load the latest kernel.

Justin P. Mattock

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [opensuse-kernel] Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]
@ 2010-04-24 22:40                               ` Justin P. Mattock
  0 siblings, 0 replies; 48+ messages in thread
From: Justin P. Mattock @ 2010-04-24 22:40 UTC (permalink / raw)
  To: Hans-Peter Jansen
  Cc: Nick Piggin, opensuse-kernel, Greg KH, linux-kernel, xfs

On 04/24/2010 03:30 PM, Hans-Peter Jansen wrote:
> On Saturday 24 April 2010, 23:23:20 Emmanuel Florac wrote:
>> Le Sat, 24 Apr 2010 18:44:22 +0200 vous écriviez:
>>> Greg, you might search for a server using xfs filesystems and and a
>>> i586 kernel>= 2.6.33, (2.6.32.11 of SLE11-SP1 will serve as well),
>>> log in as an ordinary user, do a "du" on /usr, and wait for the other
>>> users screaming...
>>
>> I did precisely that, and didn't notice anything special (du on kernel
>> source tree) kernel 2.6.32.11, deadline scheduler, 7 drives RAID-6
>> array, 8GB RAM.
>
> I guess, you're not on this specific openSUSE git version of 2.6.32.11 (e.g.
> the preparation for SP1 of SLE11), which, as usual, carries a lot of stuff
> from later kernels. The offending patch was included in linux-2.6.33
> between -rc4 and -rc5:
>
> Committer
> Alex Elder<aelder@sgi.com>
> Author
> Dave Chinner<david@fromorbit.com>
> Author date
> 11.01.10 00:51
> Parent
> xfs: Avoid inodes in reclaim when flushing from inode cache
> Child
> xfs: Remove inode iolock held check during allocation
> Branch
> master origin (Merge branch 'for-linus' of git://git.kernel.org/pub/scm/...)
> Branch
> 2.6.33.1 (Linux 2.6.33)
> Follows
> v2.6.33-rc4 (Linux 2.6.33-rc4)
> Precedes
> v2.6.33-rc5 (Linux 2.6.33-rc5)
>
> Cheers,
> Pete


is this bisectable?
from what I remember with 2.6.33
looking at the bugreports I don't recall
any issue in regards with firmware related stuff
for radeon(but could be wrong).

Keep in mind, I don't have your card, but I do have the X1600
which had no issues so far(running the latest HEAD).
does changing the .config work for you?
(in regards to what the thread I posted
had mentioned)

as for the open suse SP1 of SLE11..glad there is
an option to load the latest kernel.

Justin P. Mattock

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [opensuse-kernel] Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]
  2010-04-24 22:30                             ` Hans-Peter Jansen
@ 2010-04-24 22:41                               ` Justin P. Mattock
  -1 siblings, 0 replies; 48+ messages in thread
From: Justin P. Mattock @ 2010-04-24 22:41 UTC (permalink / raw)
  To: Hans-Peter Jansen
  Cc: Emmanuel Florac, xfs, Nick Piggin, opensuse-kernel, Greg KH,
	linux-kernel

On 04/24/2010 03:30 PM, Hans-Peter Jansen wrote:
> On Saturday 24 April 2010, 23:23:20 Emmanuel Florac wrote:
>> Le Sat, 24 Apr 2010 18:44:22 +0200 vous écriviez:
>>> Greg, you might search for a server using xfs filesystems and and a
>>> i586 kernel>= 2.6.33, (2.6.32.11 of SLE11-SP1 will serve as well),
>>> log in as an ordinary user, do a "du" on /usr, and wait for the other
>>> users screaming...
>>
>> I did precisely that, and didn't notice anything special (du on kernel
>> source tree) kernel 2.6.32.11, deadline scheduler, 7 drives RAID-6
>> array, 8GB RAM.
>
> I guess, you're not on this specific openSUSE git version of 2.6.32.11 (e.g.
> the preparation for SP1 of SLE11), which, as usual, carries a lot of stuff
> from later kernels. The offending patch was included in linux-2.6.33
> between -rc4 and -rc5:
>
> Committer
> Alex Elder<aelder@sgi.com>
> Author
> Dave Chinner<david@fromorbit.com>
> Author date
> 11.01.10 00:51
> Parent
> xfs: Avoid inodes in reclaim when flushing from inode cache
> Child
> xfs: Remove inode iolock held check during allocation
> Branch
> master origin (Merge branch 'for-linus' of git://git.kernel.org/pub/scm/...)
> Branch
> 2.6.33.1 (Linux 2.6.33)
> Follows
> v2.6.33-rc4 (Linux 2.6.33-rc4)
> Precedes
> v2.6.33-rc5 (Linux 2.6.33-rc5)
>
> Cheers,
> Pete

sh*t.. my bad just replied to the wrong
thread..

I'm tired.. please ignore this.

Justin P. Mattock

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [opensuse-kernel] Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]
@ 2010-04-24 22:41                               ` Justin P. Mattock
  0 siblings, 0 replies; 48+ messages in thread
From: Justin P. Mattock @ 2010-04-24 22:41 UTC (permalink / raw)
  To: Hans-Peter Jansen
  Cc: Nick Piggin, opensuse-kernel, Greg KH, linux-kernel, xfs

On 04/24/2010 03:30 PM, Hans-Peter Jansen wrote:
> On Saturday 24 April 2010, 23:23:20 Emmanuel Florac wrote:
>> Le Sat, 24 Apr 2010 18:44:22 +0200 vous écriviez:
>>> Greg, you might search for a server using xfs filesystems and and a
>>> i586 kernel>= 2.6.33, (2.6.32.11 of SLE11-SP1 will serve as well),
>>> log in as an ordinary user, do a "du" on /usr, and wait for the other
>>> users screaming...
>>
>> I did precisely that, and didn't notice anything special (du on kernel
>> source tree) kernel 2.6.32.11, deadline scheduler, 7 drives RAID-6
>> array, 8GB RAM.
>
> I guess, you're not on this specific openSUSE git version of 2.6.32.11 (e.g.
> the preparation for SP1 of SLE11), which, as usual, carries a lot of stuff
> from later kernels. The offending patch was included in linux-2.6.33
> between -rc4 and -rc5:
>
> Committer
> Alex Elder<aelder@sgi.com>
> Author
> Dave Chinner<david@fromorbit.com>
> Author date
> 11.01.10 00:51
> Parent
> xfs: Avoid inodes in reclaim when flushing from inode cache
> Child
> xfs: Remove inode iolock held check during allocation
> Branch
> master origin (Merge branch 'for-linus' of git://git.kernel.org/pub/scm/...)
> Branch
> 2.6.33.1 (Linux 2.6.33)
> Follows
> v2.6.33-rc4 (Linux 2.6.33-rc4)
> Precedes
> v2.6.33-rc5 (Linux 2.6.33-rc5)
>
> Cheers,
> Pete

sh*t.. my bad just replied to the wrong
thread..

I'm tired.. please ignore this.

Justin P. Mattock

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]
  2010-04-24 22:30                             ` Hans-Peter Jansen
@ 2010-04-25 13:04                               ` Emmanuel Florac
  -1 siblings, 0 replies; 48+ messages in thread
From: Emmanuel Florac @ 2010-04-25 13:04 UTC (permalink / raw)
  To: Hans-Peter Jansen; +Cc: xfs, Nick Piggin, Greg KH, linux-kernel

Le Sun, 25 Apr 2010 00:30:40 +0200 vous écriviez:

> I guess, you're not on this specific openSUSE git version of
> 2.6.32.11 (e.g. the preparation for SP1 of SLE11), which, as usual,
> carries a lot of stuff from later kernels.

No, I always use pristine unpatched kernel.org releases, no SELinux, no
nothing. Just another confirmation I should go on this way :)

-- 
------------------------------------------------------------------------
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    |	<eflorac@intellique.com>
                    |   +33 1 78 94 84 02
------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]
@ 2010-04-25 13:04                               ` Emmanuel Florac
  0 siblings, 0 replies; 48+ messages in thread
From: Emmanuel Florac @ 2010-04-25 13:04 UTC (permalink / raw)
  To: Hans-Peter Jansen; +Cc: Nick Piggin, Greg KH, linux-kernel, xfs

Le Sun, 25 Apr 2010 00:30:40 +0200 vous écriviez:

> I guess, you're not on this specific openSUSE git version of
> 2.6.32.11 (e.g. the preparation for SP1 of SLE11), which, as usual,
> carries a lot of stuff from later kernels.

No, I always use pristine unpatched kernel.org releases, no SELinux, no
nothing. Just another confirmation I should go on this way :)

-- 
------------------------------------------------------------------------
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    |	<eflorac@intellique.com>
                    |   +33 1 78 94 84 02
------------------------------------------------------------------------

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]
  2010-04-24 16:44                         ` Hans-Peter Jansen
@ 2010-04-25 16:27                           ` Greg KH
  -1 siblings, 0 replies; 48+ messages in thread
From: Greg KH @ 2010-04-25 16:27 UTC (permalink / raw)
  To: Hans-Peter Jansen
  Cc: xfs, Dave Chinner, opensuse-kernel, linux-kernel, Nick Piggin

On Sat, Apr 24, 2010 at 06:44:22PM +0200, Hans-Peter Jansen wrote:
> FYI, 2.6.33.2 is still affected from this issue. 

Is 2.6.33.3-rc2 affected?  A lot of xfs patches are in there (as are in
2.6.32.12-rc2.)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]
@ 2010-04-25 16:27                           ` Greg KH
  0 siblings, 0 replies; 48+ messages in thread
From: Greg KH @ 2010-04-25 16:27 UTC (permalink / raw)
  To: Hans-Peter Jansen; +Cc: Nick Piggin, opensuse-kernel, linux-kernel, xfs

On Sat, Apr 24, 2010 at 06:44:22PM +0200, Hans-Peter Jansen wrote:
> FYI, 2.6.33.2 is still affected from this issue. 

Is 2.6.33.3-rc2 affected?  A lot of xfs patches are in there (as are in
2.6.32.12-rc2.)

thanks,

greg k-h

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]
  2010-04-25 16:27                           ` Greg KH
@ 2010-04-25 16:57                             ` Christoph Hellwig
  -1 siblings, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2010-04-25 16:57 UTC (permalink / raw)
  To: Greg KH
  Cc: Hans-Peter Jansen, Nick Piggin, opensuse-kernel, linux-kernel, xfs

On Sun, Apr 25, 2010 at 09:27:40AM -0700, Greg KH wrote:
> On Sat, Apr 24, 2010 at 06:44:22PM +0200, Hans-Peter Jansen wrote:
> > FYI, 2.6.33.2 is still affected from this issue. 
> 
> Is 2.6.33.3-rc2 affected?  A lot of xfs patches are in there (as are in
> 2.6.32.12-rc2.)

Yes.  It's not even in mainline yet as Nick doesn't like the trivial
core VM fix required to solve this in a clean way.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]
@ 2010-04-25 16:57                             ` Christoph Hellwig
  0 siblings, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2010-04-25 16:57 UTC (permalink / raw)
  To: Greg KH; +Cc: Nick Piggin, xfs, opensuse-kernel, linux-kernel

On Sun, Apr 25, 2010 at 09:27:40AM -0700, Greg KH wrote:
> On Sat, Apr 24, 2010 at 06:44:22PM +0200, Hans-Peter Jansen wrote:
> > FYI, 2.6.33.2 is still affected from this issue. 
> 
> Is 2.6.33.3-rc2 affected?  A lot of xfs patches are in there (as are in
> 2.6.32.12-rc2.)

Yes.  It's not even in mainline yet as Nick doesn't like the trivial
core VM fix required to solve this in a clean way.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]
  2010-04-25 16:57                             ` Christoph Hellwig
@ 2010-04-25 18:18                               ` Greg KH
  -1 siblings, 0 replies; 48+ messages in thread
From: Greg KH @ 2010-04-25 18:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Hans-Peter Jansen, Nick Piggin, opensuse-kernel, linux-kernel, xfs

On Sun, Apr 25, 2010 at 12:57:35PM -0400, Christoph Hellwig wrote:
> On Sun, Apr 25, 2010 at 09:27:40AM -0700, Greg KH wrote:
> > On Sat, Apr 24, 2010 at 06:44:22PM +0200, Hans-Peter Jansen wrote:
> > > FYI, 2.6.33.2 is still affected from this issue. 
> > 
> > Is 2.6.33.3-rc2 affected?  A lot of xfs patches are in there (as are in
> > 2.6.32.12-rc2.)
> 
> Yes.  It's not even in mainline yet as Nick doesn't like the trivial
> core VM fix required to solve this in a clean way.

Hm, Nick, why?  This seems like a real problem, easily reproduced.  Is
it solved some other way in Linus's tree that we could backport to the
-stable series?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]
@ 2010-04-25 18:18                               ` Greg KH
  0 siblings, 0 replies; 48+ messages in thread
From: Greg KH @ 2010-04-25 18:18 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Nick Piggin, xfs, opensuse-kernel, linux-kernel

On Sun, Apr 25, 2010 at 12:57:35PM -0400, Christoph Hellwig wrote:
> On Sun, Apr 25, 2010 at 09:27:40AM -0700, Greg KH wrote:
> > On Sat, Apr 24, 2010 at 06:44:22PM +0200, Hans-Peter Jansen wrote:
> > > FYI, 2.6.33.2 is still affected from this issue. 
> > 
> > Is 2.6.33.3-rc2 affected?  A lot of xfs patches are in there (as are in
> > 2.6.32.12-rc2.)
> 
> Yes.  It's not even in mainline yet as Nick doesn't like the trivial
> core VM fix required to solve this in a clean way.

Hm, Nick, why?  This seems like a real problem, easily reproduced.  Is
it solved some other way in Linus's tree that we could backport to the
-stable series?

thanks,

greg k-h

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]
  2010-04-24 16:44                         ` Hans-Peter Jansen
@ 2010-04-26  0:32                           ` Dave Chinner
  -1 siblings, 0 replies; 48+ messages in thread
From: Dave Chinner @ 2010-04-26  0:32 UTC (permalink / raw)
  To: Hans-Peter Jansen
  Cc: xfs, opensuse-kernel, linux-kernel, Greg KH, Nick Piggin

On Sat, Apr 24, 2010 at 06:44:22PM +0200, Hans-Peter Jansen wrote:
> On Tuesday 13 April 2010, 11:42:33 Hans-Peter Jansen wrote:
> > On Tuesday 13 April 2010, 11:18:23 Dave Chinner wrote:
> > > On Tue, Apr 13, 2010 at 10:50:35AM +0200, Hans-Peter Jansen wrote:
> > > > Dave, may I ask you kindly for briefly elaborating on the worst
> > > > consequences of just reverting this hunk, as I've done before?
> > >
> > > Well, given that is the new shrinker code generating the warnings,
> > > reverting/removing that hunk will render the patch useless :0
> >
> > Excuse me, I didn't express myself well. I'm after the consequences of
> > applying the revert, that I posted a few messages above.
> >
> > > I'll get you a working 2.6.33 patch tomorrow - it's dinner time
> > > now....
> >
> > Cool, thanks.
> 
> Obviously and not totally unexpected, really fixing this is going to take 
> more time.

The problem is that the fix I did has been rejected by the upstream
VM guys, and the stable rules are that fixes have to be in mainline
before they can be put in a stable release.  So, until we get a fix
in mainline, it can't be fixed in the -stable kernels.

> FYI, 2.6.33.2 is still affected from this issue. 
> 
> Greg, you might search for a server using xfs filesystems and and a i586 
> kernel >= 2.6.33, (2.6.32.11 of SLE11-SP1 will serve as well), log in as an 
> ordinary user, do a "du" on /usr, and wait for the other users screaming...

Yet there's only been one report of the problem. While that doesn't
make it any less serious, I don't think the problem you're reporting
is as widespread as you are making it out to be. We'll get the fix
done and upstream, and then it will go back to the stable kernel.

You could always apply the *tested* patches I posted that fix
the problem, as....

> BTW, all affected kernels, available from 
> http://download.opensuse.org/repositories/home:/frispete: have the 
> offending patch reverted (see subject), do run fine for me (on this 
> aspect).

... you seem to be capable of doing so.

> Will you guys pass by another round of stable fixes without doing anything 
> on this issue?

If the process of getting the fix upstream takes longer than another
stable release cycle, then yes. I'm sorry, but I can't control the
process, and if someone takes a week to NACK a fix, then you're just
going to have to wait longer. Feel free to run the fix in the
meantime - testing it, even if it was NACKed will still help us
because if it fixes your problem we know that we are fixing the
_right problem_.

If you can't live with this, then you shouldn't be running the
latest and greatest kernels in your production environment....

> Dave, this is why I'm kindly asking you: what might be the worst 
> consequences, if we just do the revert for now (at least for 2.6.33), until 
> you and Nick came to a final decision on how to solve this issue in the 
> future.

I've already told you - you could be reintroducing all the really
hard to reproduce inode reclaim problems (oops, hangs, panics,
potentially even fs corruption) that the patch in question was part
of the fix for.  You're running code that changes reclaim in very
subtle ways and has not been tested upstream in any way - if it
breaks you get to keep all the broken pieces to yourself...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]
@ 2010-04-26  0:32                           ` Dave Chinner
  0 siblings, 0 replies; 48+ messages in thread
From: Dave Chinner @ 2010-04-26  0:32 UTC (permalink / raw)
  To: Hans-Peter Jansen
  Cc: Nick Piggin, opensuse-kernel, Greg KH, linux-kernel, xfs

On Sat, Apr 24, 2010 at 06:44:22PM +0200, Hans-Peter Jansen wrote:
> On Tuesday 13 April 2010, 11:42:33 Hans-Peter Jansen wrote:
> > On Tuesday 13 April 2010, 11:18:23 Dave Chinner wrote:
> > > On Tue, Apr 13, 2010 at 10:50:35AM +0200, Hans-Peter Jansen wrote:
> > > > Dave, may I ask you kindly for briefly elaborating on the worst
> > > > consequences of just reverting this hunk, as I've done before?
> > >
> > > Well, given that is the new shrinker code generating the warnings,
> > > reverting/removing that hunk will render the patch useless :0
> >
> > Excuse me, I didn't express myself well. I'm after the consequences of
> > applying the revert, that I posted a few messages above.
> >
> > > I'll get you a working 2.6.33 patch tomorrow - it's dinner time
> > > now....
> >
> > Cool, thanks.
> 
> Obviously and not totally unexpected, really fixing this is going to take 
> more time.

The problem is that the fix I did has been rejected by the upstream
VM guys, and the stable rules are that fixes have to be in mainline
before they can be put in a stable release.  So, until we get a fix
in mainline, it can't be fixed in the -stable kernels.

> FYI, 2.6.33.2 is still affected from this issue. 
> 
> Greg, you might search for a server using xfs filesystems and and a i586 
> kernel >= 2.6.33, (2.6.32.11 of SLE11-SP1 will serve as well), log in as an 
> ordinary user, do a "du" on /usr, and wait for the other users screaming...

Yet there's only been one report of the problem. While that doesn't
make it any less serious, I don't think the problem you're reporting
is as widespread as you are making it out to be. We'll get the fix
done and upstream, and then it will go back to the stable kernel.

You could always apply the *tested* patches I posted that fix
the problem, as....

> BTW, all affected kernels, available from 
> http://download.opensuse.org/repositories/home:/frispete: have the 
> offending patch reverted (see subject), do run fine for me (on this 
> aspect).

... you seem to be capable of doing so.

> Will you guys pass by another round of stable fixes without doing anything 
> on this issue?

If the process of getting the fix upstream takes longer than another
stable release cycle, then yes. I'm sorry, but I can't control the
process, and if someone takes a week to NACK a fix, then you're just
going to have to wait longer. Feel free to run the fix in the
meantime - testing it, even if it was NACKed will still help us
because if it fixes your problem we know that we are fixing the
_right problem_.

If you can't live with this, then you shouldn't be running the
latest and greatest kernels in your production environment....

> Dave, this is why I'm kindly asking you: what might be the worst 
> consequences, if we just do the revert for now (at least for 2.6.33), until 
> you and Nick came to a final decision on how to solve this issue in the 
> future.

I've already told you - you could be reintroducing all the really
hard to reproduce inode reclaim problems (oops, hangs, panics,
potentially even fs corruption) that the patch in question was part
of the fix for.  You're running code that changes reclaim in very
subtle ways and has not been tested upstream in any way - if it
breaks you get to keep all the broken pieces to yourself...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]
  2010-04-25 16:27                           ` Greg KH
@ 2010-04-26  0:36                             ` Dave Chinner
  -1 siblings, 0 replies; 48+ messages in thread
From: Dave Chinner @ 2010-04-26  0:36 UTC (permalink / raw)
  To: Greg KH
  Cc: Hans-Peter Jansen, xfs, opensuse-kernel, linux-kernel, Nick Piggin

On Sun, Apr 25, 2010 at 09:27:40AM -0700, Greg KH wrote:
> On Sat, Apr 24, 2010 at 06:44:22PM +0200, Hans-Peter Jansen wrote:
> > FYI, 2.6.33.2 is still affected from this issue. 
> 
> Is 2.6.33.3-rc2 affected?  A lot of xfs patches are in there (as are in
> 2.6.32.12-rc2.)

The fix is not in Linus' kernel yet, Greg. So once that is done,
I'll have to backport the fix back to those stable kernels as well.
It's not a trivial fix, so it will miss this round of stable
releases....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]
@ 2010-04-26  0:36                             ` Dave Chinner
  0 siblings, 0 replies; 48+ messages in thread
From: Dave Chinner @ 2010-04-26  0:36 UTC (permalink / raw)
  To: Greg KH; +Cc: Nick Piggin, linux-kernel, opensuse-kernel, xfs

On Sun, Apr 25, 2010 at 09:27:40AM -0700, Greg KH wrote:
> On Sat, Apr 24, 2010 at 06:44:22PM +0200, Hans-Peter Jansen wrote:
> > FYI, 2.6.33.2 is still affected from this issue. 
> 
> Is 2.6.33.3-rc2 affected?  A lot of xfs patches are in there (as are in
> 2.6.32.12-rc2.)

The fix is not in Linus' kernel yet, Greg. So once that is done,
I'll have to backport the fix back to those stable kernels as well.
It's not a trivial fix, so it will miss this round of stable
releases....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]
  2010-04-26  0:36                             ` Dave Chinner
@ 2010-04-27  0:02                               ` Greg KH
  -1 siblings, 0 replies; 48+ messages in thread
From: Greg KH @ 2010-04-27  0:02 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Hans-Peter Jansen, xfs, opensuse-kernel, linux-kernel, Nick Piggin

On Mon, Apr 26, 2010 at 10:36:37AM +1000, Dave Chinner wrote:
> On Sun, Apr 25, 2010 at 09:27:40AM -0700, Greg KH wrote:
> > On Sat, Apr 24, 2010 at 06:44:22PM +0200, Hans-Peter Jansen wrote:
> > > FYI, 2.6.33.2 is still affected from this issue. 
> > 
> > Is 2.6.33.3-rc2 affected?  A lot of xfs patches are in there (as are in
> > 2.6.32.12-rc2.)
> 
> The fix is not in Linus' kernel yet, Greg. So once that is done,
> I'll have to backport the fix back to those stable kernels as well.
> It's not a trivial fix, so it will miss this round of stable
> releases....

Ok, that's fine, just checking.  I'm in no rush, I have plenty of other
patches to queue up for the next stable releases :)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]
@ 2010-04-27  0:02                               ` Greg KH
  0 siblings, 0 replies; 48+ messages in thread
From: Greg KH @ 2010-04-27  0:02 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nick Piggin, linux-kernel, opensuse-kernel, xfs

On Mon, Apr 26, 2010 at 10:36:37AM +1000, Dave Chinner wrote:
> On Sun, Apr 25, 2010 at 09:27:40AM -0700, Greg KH wrote:
> > On Sat, Apr 24, 2010 at 06:44:22PM +0200, Hans-Peter Jansen wrote:
> > > FYI, 2.6.33.2 is still affected from this issue. 
> > 
> > Is 2.6.33.3-rc2 affected?  A lot of xfs patches are in there (as are in
> > 2.6.32.12-rc2.)
> 
> The fix is not in Linus' kernel yet, Greg. So once that is done,
> I'll have to backport the fix back to those stable kernels as well.
> It's not a trivial fix, so it will miss this round of stable
> releases....

Ok, that's fine, just checking.  I'm in no rush, I have plenty of other
patches to queue up for the next stable releases :)

thanks,

greg k-h

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2010-04-27  0:02 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-04-04 22:49 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer Hans-Peter Jansen
2010-04-04 22:49 ` Hans-Peter Jansen
2010-04-05  0:49 ` Dave Chinner
2010-04-05  0:49   ` Dave Chinner
2010-04-05 11:35   ` Hans-Peter Jansen
2010-04-05 11:35     ` Hans-Peter Jansen
2010-04-05 23:06     ` Dave Chinner
2010-04-05 23:06       ` Dave Chinner
2010-04-06 14:52       ` 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098] Hans-Peter Jansen
2010-04-06 14:52         ` Hans-Peter Jansen
2010-04-06 23:11         ` Dave Chinner
2010-04-06 23:11           ` Dave Chinner
2010-04-07  1:45           ` Dave Chinner
2010-04-07  1:45             ` Dave Chinner
2010-04-07 22:02             ` Hans-Peter Jansen
2010-04-07 22:02               ` Hans-Peter Jansen
2010-04-12 22:32               ` Dave Chinner
2010-04-12 22:32                 ` Dave Chinner
2010-04-13  8:50                 ` Hans-Peter Jansen
2010-04-13  8:50                   ` Hans-Peter Jansen
2010-04-13  9:18                   ` Dave Chinner
2010-04-13  9:18                     ` Dave Chinner
2010-04-13  9:42                     ` Hans-Peter Jansen
2010-04-13  9:42                       ` Hans-Peter Jansen
2010-04-24 16:44                       ` Hans-Peter Jansen
2010-04-24 16:44                         ` Hans-Peter Jansen
2010-04-24 21:23                         ` Emmanuel Florac
2010-04-24 21:23                           ` Emmanuel Florac
2010-04-24 22:30                           ` Hans-Peter Jansen
2010-04-24 22:30                             ` Hans-Peter Jansen
2010-04-24 22:40                             ` [opensuse-kernel] " Justin P. Mattock
2010-04-24 22:40                               ` Justin P. Mattock
2010-04-24 22:41                             ` Justin P. Mattock
2010-04-24 22:41                               ` Justin P. Mattock
2010-04-25 13:04                             ` Emmanuel Florac
2010-04-25 13:04                               ` Emmanuel Florac
2010-04-25 16:27                         ` Greg KH
2010-04-25 16:27                           ` Greg KH
2010-04-25 16:57                           ` Christoph Hellwig
2010-04-25 16:57                             ` Christoph Hellwig
2010-04-25 18:18                             ` Greg KH
2010-04-25 18:18                               ` Greg KH
2010-04-26  0:36                           ` Dave Chinner
2010-04-26  0:36                             ` Dave Chinner
2010-04-27  0:02                             ` Greg KH
2010-04-27  0:02                               ` Greg KH
2010-04-26  0:32                         ` Dave Chinner
2010-04-26  0:32                           ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.