PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64

All of lore.kernel.org
 help / color / mirror / Atom feed

* PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64
@ 2010-04-07 11:06 John Berthels
  2010-04-07 14:05   ` Dave Chinner
  0 siblings, 1 reply; 43+ messages in thread
From: John Berthels @ 2010-04-07 11:06 UTC (permalink / raw)
  To: linux-kernel; +Cc: Nick Gregory, Rob Sanderson

Hi folks,

[I'm afraid that I'm not subscribed to the list, please cc: me on any 
reply].

Problem: kernel.org 2.6.33.2 x86_64 kernel locks up under write-heavy 
I/O load. It is "fixed" by changing THREAD_ORDER to 2.

Is this an OK long-term solution/should this be needed? As far as I can 
see from searching, there is an expectation that xfs would generally 
work with 8k stacks (THREAD_ORDER 1). We don't have xfs stacked over LVM 
or anything else.

If anyone can offer any advice on this, that would be great. I 
understand larger kernel stacks may introduce problems in getting an 
allocation of the appropriate size. So am I right in thinking the 
symptom we need to look out for would be an error on fork() or clone()? 
Or will the box panic in that case?

Details below.

regards,

jb

Background: We have a cluster of systems with roughly the following 
specs (2GB RAM, 24 (twenty-four) 1TB+ disks, Intel Core2 Duo @ 2.2GHz).

Following a the addition of three new servers to the cluster, we started 
seeing a high incidence of intermittent lockups (up to several times per 
day for some servers) across both the old and new servers. Prior to 
that, we saw this problem only rarely (perhaps once per 3 months).

Adding the new servers will have changed the I/O patterns to all 
servers. The servers receive a heavy write load, often with many slow 
writers (as well as a read load).

Servers would become unresponsive, with nothing written to 
/var/log/messages. Setting sysctl kernel.panic=300 caused a restart 
(which showed the kernel was panicing and unable to write at the time). 
netconsole showed a variety of stack traces, mostly related to xfs_write 
activity (but then, that's what the box spends it's time doing).

22/24 of the disks have 1 partition, formatted with xfs (over the 
partition, not over LVM). The other 2 disks have 3 partitions: xfs data, 
swap and a RAID1 partition contributing to an ext3 root filesystem 
mounted on /dev/md0.

We have tried various solutions (different kernels from ubuntu server 
2.6.28->2.6.32).

Vanilla 2.6.33.2 from kernel.org + stack tracing still has the problem, 
and logged:

kernel: [58552.740032] flush-8:112 used greatest stack depth: 184 bytes left

a short while before dying.

Vanilla 2.6.33.2 + stack tracing + THREAD_ORDER 2 is much more stable 
(no lockups so far, we would have expected 5-6 by now) and has logged:

kernel: [44798.183507] apache2 used greatest stack depth: 7208 bytes left

which I understand (possibly wrongly) as concrete evidence that we have 
exceeded 8k of stack space.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64
  2010-04-07 11:06 PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64 John Berthels
@ 2010-04-07 14:05   ` Dave Chinner
  0 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2010-04-07 14:05 UTC (permalink / raw)
  To: John Berthels; +Cc: linux-kernel, Nick Gregory, Rob Sanderson, xfs

On Wed, Apr 07, 2010 at 12:06:01PM +0100, John Berthels wrote:
> Hi folks,
> 
> [I'm afraid that I'm not subscribed to the list, please cc: me on
> any reply].
> 
> Problem: kernel.org 2.6.33.2 x86_64 kernel locks up under
> write-heavy I/O load. It is "fixed" by changing THREAD_ORDER to 2.
> 
> Is this an OK long-term solution/should this be needed? As far as I
> can see from searching, there is an expectation that xfs would
> generally work with 8k stacks (THREAD_ORDER 1). We don't have xfs
> stacked over LVM or anything else.

I'm not seeing stacks deeper than about 5.6k on XFS under heavy write
loads. That's nowhere near blowing an 8k stack, so there must be
something special about what you are doing. Can you post the stack
traces that are being generated for the deepest stack generated -
/sys/kernel/debug/tracing/stack_trace should contain it.

> Background: We have a cluster of systems with roughly the following
> specs (2GB RAM, 24 (twenty-four) 1TB+ disks, Intel Core2 Duo @
> 2.2GHz).
> 
> Following a the addition of three new servers to the cluster, we
> started seeing a high incidence of intermittent lockups (up to
> several times per day for some servers) across both the old and new
> servers. Prior to that, we saw this problem only rarely (perhaps
> once per 3 months).

What is generating the write load?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64
@ 2010-04-07 14:05   ` Dave Chinner
  0 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2010-04-07 14:05 UTC (permalink / raw)
  To: John Berthels; +Cc: Nick Gregory, xfs, linux-kernel, Rob Sanderson

On Wed, Apr 07, 2010 at 12:06:01PM +0100, John Berthels wrote:
> Hi folks,
> 
> [I'm afraid that I'm not subscribed to the list, please cc: me on
> any reply].
> 
> Problem: kernel.org 2.6.33.2 x86_64 kernel locks up under
> write-heavy I/O load. It is "fixed" by changing THREAD_ORDER to 2.
> 
> Is this an OK long-term solution/should this be needed? As far as I
> can see from searching, there is an expectation that xfs would
> generally work with 8k stacks (THREAD_ORDER 1). We don't have xfs
> stacked over LVM or anything else.

I'm not seeing stacks deeper than about 5.6k on XFS under heavy write
loads. That's nowhere near blowing an 8k stack, so there must be
something special about what you are doing. Can you post the stack
traces that are being generated for the deepest stack generated -
/sys/kernel/debug/tracing/stack_trace should contain it.

> Background: We have a cluster of systems with roughly the following
> specs (2GB RAM, 24 (twenty-four) 1TB+ disks, Intel Core2 Duo @
> 2.2GHz).
> 
> Following a the addition of three new servers to the cluster, we
> started seeing a high incidence of intermittent lockups (up to
> several times per day for some servers) across both the old and new
> servers. Prior to that, we saw this problem only rarely (perhaps
> once per 3 months).

What is generating the write load?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64
  2010-04-07 14:05   ` Dave Chinner
@ 2010-04-07 15:57     ` John Berthels
  -1 siblings, 0 replies; 43+ messages in thread
From: John Berthels @ 2010-04-07 15:57 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, Nick Gregory, Rob Sanderson, xfs

[-- Attachment #1: Type: text/plain, Size: 4514 bytes --]

Dave Chinner wrote:
> I'm not seeing stacks deeper than about 5.6k on XFS under heavy write
> loads. That's nowhere near blowing an 8k stack, so there must be
> something special about what you are doing. Can you post the stack
> traces that are being generated for the deepest stack generated -
> /sys/kernel/debug/tracing/stack_trace should contain it.
>   
Appended below. That doesn't seem to reach 8192 but the box it's from 
has logged:

[74649.579386] apache2 used greatest stack depth: 7024 bytes left

full dmesg (gzipped) attached.
> What is generating the write load?
>   

WebDAV PUTs in a modified mogilefs cluster, running apache-mpm-worker 
(threaded) as the DAV server. The write load is a mix of internet-upload 
speed writers trickling files up and some local fast replicators copying 
from elsewhere in the cluster. mpm worker cfg is:

        ServerLimit 20
        StartServers 5
        MaxClients 300
        MinSpareThreads 25
        MaxSpareThreads 75
        ThreadsPerChild 30
        MaxRequestsPerChild 0

File sizes are a mix of small to large (4GB+). Each disk is exported as 
a mogile device, so it's possible for mogile to pound a single disk with 
lots of write activity (if the random number generator decides to put 
lots of files on that device at the same time).

We're also seeing occasional slowdowns + high load avg (up to ~300, i.e. 
MaxClients) with a corresponding number of threads in D state. (This 
slowdown + high load avg seems to correlate with what would have 
previously caused a panic on the THREAD_ORDER 1, but not 100% sure).

As you can see from the dmesg, this trips the "task xxx blocked for more 
than 120 seconds." on some of the threads.

Don't know if that's related to the stack issue or to be expected under 
the load.


jb

        Depth    Size   Location    (47 entries)
        -----    ----   --------
  0)     7568      16   mempool_alloc_slab+0x16/0x20
  1)     7552     144   mempool_alloc+0x65/0x140
  2)     7408      96   get_request+0x124/0x370
  3)     7312     144   get_request_wait+0x29/0x1b0
  4)     7168      96   __make_request+0x9b/0x490
  5)     7072     208   generic_make_request+0x3df/0x4d0
  6)     6864      80   submit_bio+0x7c/0x100
  7)     6784      96   _xfs_buf_ioapply+0x128/0x2c0 [xfs]
  8)     6688      48   xfs_buf_iorequest+0x75/0xd0 [xfs]
  9)     6640      32   _xfs_buf_read+0x36/0x70 [xfs]
 10)     6608      48   xfs_buf_read+0xda/0x110 [xfs]
 11)     6560      80   xfs_trans_read_buf+0x2a7/0x410 [xfs]
 12)     6480      80   xfs_btree_read_buf_block+0x5d/0xb0 [xfs]
 13)     6400      80   xfs_btree_lookup_get_block+0x84/0xf0 [xfs]
 14)     6320     176   xfs_btree_lookup+0xd7/0x490 [xfs]
 15)     6144      16   xfs_alloc_lookup_eq+0x19/0x20 [xfs]
 16)     6128      96   xfs_alloc_fixup_trees+0xee/0x350 [xfs]
 17)     6032     144   xfs_alloc_ag_vextent_near+0x916/0xb30 [xfs]
 18)     5888      32   xfs_alloc_ag_vextent+0xe5/0x140 [xfs]
 19)     5856      96   xfs_alloc_vextent+0x49f/0x630 [xfs]
 20)     5760     160   xfs_bmbt_alloc_block+0xbe/0x1d0 [xfs]
 21)     5600     208   xfs_btree_split+0xb3/0x6a0 [xfs]
 22)     5392      96   xfs_btree_make_block_unfull+0x151/0x190 [xfs]
 23)     5296     224   xfs_btree_insrec+0x39c/0x5b0 [xfs]
 24)     5072     128   xfs_btree_insert+0x86/0x180 [xfs]
 25)     4944     352   xfs_bmap_add_extent_delay_real+0x41e/0x1660 [xfs]
 26)     4592     208   xfs_bmap_add_extent+0x41c/0x450 [xfs]
 27)     4384     448   xfs_bmapi+0x982/0x1200 [xfs]
 28)     3936     256   xfs_iomap_write_allocate+0x248/0x3c0 [xfs]
 29)     3680     208   xfs_iomap+0x3d8/0x410 [xfs]
 30)     3472      32   xfs_map_blocks+0x2c/0x30 [xfs]
 31)     3440     256   xfs_page_state_convert+0x443/0x730 [xfs]
 32)     3184      64   xfs_vm_writepage+0xab/0x160 [xfs]
 33)     3120     384   shrink_page_list+0x65e/0x840
 34)     2736     528   shrink_zone+0x63f/0xe10
 35)     2208     112   do_try_to_free_pages+0xc2/0x3c0
 36)     2096     128   try_to_free_pages+0x77/0x80
 37)     1968     240   __alloc_pages_nodemask+0x3e4/0x710
 38)     1728      48   alloc_pages_current+0x8c/0xe0
 39)     1680      16   __get_free_pages+0xe/0x50
 40)     1664      48   __pollwait+0xca/0x110
 41)     1616      32   unix_poll+0x28/0xc0
 42)     1584      16   sock_poll+0x1d/0x20
 43)     1568     912   do_select+0x3d6/0x700
 44)      656     416   core_sys_select+0x18c/0x2c0
 45)      240     112   sys_select+0x4f/0x110
 46)      128     128   system_call_fastpath+0x16/0x1b


[-- Attachment #2: dmesg.txt.gz --]
[-- Type: application/x-gzip, Size: 18745 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64
@ 2010-04-07 15:57     ` John Berthels
  0 siblings, 0 replies; 43+ messages in thread
From: John Berthels @ 2010-04-07 15:57 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nick Gregory, xfs, linux-kernel, Rob Sanderson

[-- Attachment #1: Type: text/plain, Size: 4514 bytes --]

Dave Chinner wrote:
> I'm not seeing stacks deeper than about 5.6k on XFS under heavy write
> loads. That's nowhere near blowing an 8k stack, so there must be
> something special about what you are doing. Can you post the stack
> traces that are being generated for the deepest stack generated -
> /sys/kernel/debug/tracing/stack_trace should contain it.
>   
Appended below. That doesn't seem to reach 8192 but the box it's from 
has logged:

[74649.579386] apache2 used greatest stack depth: 7024 bytes left

full dmesg (gzipped) attached.
> What is generating the write load?
>   

WebDAV PUTs in a modified mogilefs cluster, running apache-mpm-worker 
(threaded) as the DAV server. The write load is a mix of internet-upload 
speed writers trickling files up and some local fast replicators copying 
from elsewhere in the cluster. mpm worker cfg is:

        ServerLimit 20
        StartServers 5
        MaxClients 300
        MinSpareThreads 25
        MaxSpareThreads 75
        ThreadsPerChild 30
        MaxRequestsPerChild 0

File sizes are a mix of small to large (4GB+). Each disk is exported as 
a mogile device, so it's possible for mogile to pound a single disk with 
lots of write activity (if the random number generator decides to put 
lots of files on that device at the same time).

We're also seeing occasional slowdowns + high load avg (up to ~300, i.e. 
MaxClients) with a corresponding number of threads in D state. (This 
slowdown + high load avg seems to correlate with what would have 
previously caused a panic on the THREAD_ORDER 1, but not 100% sure).

As you can see from the dmesg, this trips the "task xxx blocked for more 
than 120 seconds." on some of the threads.

Don't know if that's related to the stack issue or to be expected under 
the load.


jb

        Depth    Size   Location    (47 entries)
        -----    ----   --------
  0)     7568      16   mempool_alloc_slab+0x16/0x20
  1)     7552     144   mempool_alloc+0x65/0x140
  2)     7408      96   get_request+0x124/0x370
  3)     7312     144   get_request_wait+0x29/0x1b0
  4)     7168      96   __make_request+0x9b/0x490
  5)     7072     208   generic_make_request+0x3df/0x4d0
  6)     6864      80   submit_bio+0x7c/0x100
  7)     6784      96   _xfs_buf_ioapply+0x128/0x2c0 [xfs]
  8)     6688      48   xfs_buf_iorequest+0x75/0xd0 [xfs]
  9)     6640      32   _xfs_buf_read+0x36/0x70 [xfs]
 10)     6608      48   xfs_buf_read+0xda/0x110 [xfs]
 11)     6560      80   xfs_trans_read_buf+0x2a7/0x410 [xfs]
 12)     6480      80   xfs_btree_read_buf_block+0x5d/0xb0 [xfs]
 13)     6400      80   xfs_btree_lookup_get_block+0x84/0xf0 [xfs]
 14)     6320     176   xfs_btree_lookup+0xd7/0x490 [xfs]
 15)     6144      16   xfs_alloc_lookup_eq+0x19/0x20 [xfs]
 16)     6128      96   xfs_alloc_fixup_trees+0xee/0x350 [xfs]
 17)     6032     144   xfs_alloc_ag_vextent_near+0x916/0xb30 [xfs]
 18)     5888      32   xfs_alloc_ag_vextent+0xe5/0x140 [xfs]
 19)     5856      96   xfs_alloc_vextent+0x49f/0x630 [xfs]
 20)     5760     160   xfs_bmbt_alloc_block+0xbe/0x1d0 [xfs]
 21)     5600     208   xfs_btree_split+0xb3/0x6a0 [xfs]
 22)     5392      96   xfs_btree_make_block_unfull+0x151/0x190 [xfs]
 23)     5296     224   xfs_btree_insrec+0x39c/0x5b0 [xfs]
 24)     5072     128   xfs_btree_insert+0x86/0x180 [xfs]
 25)     4944     352   xfs_bmap_add_extent_delay_real+0x41e/0x1660 [xfs]
 26)     4592     208   xfs_bmap_add_extent+0x41c/0x450 [xfs]
 27)     4384     448   xfs_bmapi+0x982/0x1200 [xfs]
 28)     3936     256   xfs_iomap_write_allocate+0x248/0x3c0 [xfs]
 29)     3680     208   xfs_iomap+0x3d8/0x410 [xfs]
 30)     3472      32   xfs_map_blocks+0x2c/0x30 [xfs]
 31)     3440     256   xfs_page_state_convert+0x443/0x730 [xfs]
 32)     3184      64   xfs_vm_writepage+0xab/0x160 [xfs]
 33)     3120     384   shrink_page_list+0x65e/0x840
 34)     2736     528   shrink_zone+0x63f/0xe10
 35)     2208     112   do_try_to_free_pages+0xc2/0x3c0
 36)     2096     128   try_to_free_pages+0x77/0x80
 37)     1968     240   __alloc_pages_nodemask+0x3e4/0x710
 38)     1728      48   alloc_pages_current+0x8c/0xe0
 39)     1680      16   __get_free_pages+0xe/0x50
 40)     1664      48   __pollwait+0xca/0x110
 41)     1616      32   unix_poll+0x28/0xc0
 42)     1584      16   sock_poll+0x1d/0x20
 43)     1568     912   do_select+0x3d6/0x700
 44)      656     416   core_sys_select+0x18c/0x2c0
 45)      240     112   sys_select+0x4f/0x110
 46)      128     128   system_call_fastpath+0x16/0x1b


[-- Attachment #2: dmesg.txt.gz --]
[-- Type: application/x-gzip, Size: 18745 bytes --]

[-- Attachment #3: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64
  2010-04-07 15:57     ` John Berthels
@ 2010-04-07 17:43       ` Eric Sandeen
  -1 siblings, 0 replies; 43+ messages in thread
From: Eric Sandeen @ 2010-04-07 17:43 UTC (permalink / raw)
  To: John Berthels
  Cc: Dave Chinner, Nick Gregory, xfs, linux-kernel, Rob Sanderson

John Berthels wrote:
> Dave Chinner wrote:
>> I'm not seeing stacks deeper than about 5.6k on XFS under heavy write
>> loads. That's nowhere near blowing an 8k stack, so there must be
>> something special about what you are doing. Can you post the stack
>> traces that are being generated for the deepest stack generated -
>> /sys/kernel/debug/tracing/stack_trace should contain it.
>>   
> Appended below. That doesn't seem to reach 8192 but the box it's from
> has logged:
> 
> [74649.579386] apache2 used greatest stack depth: 7024 bytes left

but that's -left- (out of 8k or is that from a THREAD_ORDER=2 box?)

I guess it must be out of 16k...

>        Depth    Size   Location    (47 entries)
>        -----    ----   --------
>  0)     7568      16   mempool_alloc_slab+0x16/0x20
>  1)     7552     144   mempool_alloc+0x65/0x140
>  2)     7408      96   get_request+0x124/0x370
>  3)     7312     144   get_request_wait+0x29/0x1b0
>  4)     7168      96   __make_request+0x9b/0x490
>  5)     7072     208   generic_make_request+0x3df/0x4d0
>  6)     6864      80   submit_bio+0x7c/0x100
>  7)     6784      96   _xfs_buf_ioapply+0x128/0x2c0 [xfs]
>  8)     6688      48   xfs_buf_iorequest+0x75/0xd0 [xfs]
>  9)     6640      32   _xfs_buf_read+0x36/0x70 [xfs]
> 10)     6608      48   xfs_buf_read+0xda/0x110 [xfs]
> 11)     6560      80   xfs_trans_read_buf+0x2a7/0x410 [xfs]
> 12)     6480      80   xfs_btree_read_buf_block+0x5d/0xb0 [xfs]
> 13)     6400      80   xfs_btree_lookup_get_block+0x84/0xf0 [xfs]
> 14)     6320     176   xfs_btree_lookup+0xd7/0x490 [xfs]
> 15)     6144      16   xfs_alloc_lookup_eq+0x19/0x20 [xfs]
> 16)     6128      96   xfs_alloc_fixup_trees+0xee/0x350 [xfs]
> 17)     6032     144   xfs_alloc_ag_vextent_near+0x916/0xb30 [xfs]
> 18)     5888      32   xfs_alloc_ag_vextent+0xe5/0x140 [xfs]
> 19)     5856      96   xfs_alloc_vextent+0x49f/0x630 [xfs]
> 20)     5760     160   xfs_bmbt_alloc_block+0xbe/0x1d0 [xfs]
> 21)     5600     208   xfs_btree_split+0xb3/0x6a0 [xfs]
> 22)     5392      96   xfs_btree_make_block_unfull+0x151/0x190 [xfs]
> 23)     5296     224   xfs_btree_insrec+0x39c/0x5b0 [xfs]
> 24)     5072     128   xfs_btree_insert+0x86/0x180 [xfs]
> 25)     4944     352   xfs_bmap_add_extent_delay_real+0x41e/0x1660 [xfs]
> 26)     4592     208   xfs_bmap_add_extent+0x41c/0x450 [xfs]
> 27)     4384     448   xfs_bmapi+0x982/0x1200 [xfs]

This one, I'm afraid, has always been big.

> 28)     3936     256   xfs_iomap_write_allocate+0x248/0x3c0 [xfs]
> 29)     3680     208   xfs_iomap+0x3d8/0x410 [xfs]
> 30)     3472      32   xfs_map_blocks+0x2c/0x30 [xfs]
> 31)     3440     256   xfs_page_state_convert+0x443/0x730 [xfs]
> 32)     3184      64   xfs_vm_writepage+0xab/0x160 [xfs]
> 33)     3120     384   shrink_page_list+0x65e/0x840
> 34)     2736     528   shrink_zone+0x63f/0xe10

that's a nice one  (actually the two together at > 900 bytes, ouch)

> 35)     2208     112   do_try_to_free_pages+0xc2/0x3c0
> 36)     2096     128   try_to_free_pages+0x77/0x80
> 37)     1968     240   __alloc_pages_nodemask+0x3e4/0x710
> 38)     1728      48   alloc_pages_current+0x8c/0xe0
> 39)     1680      16   __get_free_pages+0xe/0x50
> 40)     1664      48   __pollwait+0xca/0x110
> 41)     1616      32   unix_poll+0x28/0xc0
> 42)     1584      16   sock_poll+0x1d/0x20
> 43)     1568     912   do_select+0x3d6/0x700

912, ouch!

int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
{
        ktime_t expire, *to = NULL;
        struct poll_wqueues table;

(gdb) p sizeof(struct poll_wqueues)
$1 = 624

I guess that's been there forever, though.

> 44)      656     416   core_sys_select+0x18c/0x2c0

416 hurts too.

The xfs callchain is deep, no doubt, but the combination of the select path
and the shrink calls is almost 2k in just a few calls, and that doesn't
help much.

-Eric

> 45)      240     112   sys_select+0x4f/0x110
> 46)      128     128   system_call_fastpath+0x16/0x1b
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64
@ 2010-04-07 17:43       ` Eric Sandeen
  0 siblings, 0 replies; 43+ messages in thread
From: Eric Sandeen @ 2010-04-07 17:43 UTC (permalink / raw)
  To: John Berthels; +Cc: Nick Gregory, Rob Sanderson, linux-kernel, xfs

John Berthels wrote:
> Dave Chinner wrote:
>> I'm not seeing stacks deeper than about 5.6k on XFS under heavy write
>> loads. That's nowhere near blowing an 8k stack, so there must be
>> something special about what you are doing. Can you post the stack
>> traces that are being generated for the deepest stack generated -
>> /sys/kernel/debug/tracing/stack_trace should contain it.
>>   
> Appended below. That doesn't seem to reach 8192 but the box it's from
> has logged:
> 
> [74649.579386] apache2 used greatest stack depth: 7024 bytes left

but that's -left- (out of 8k or is that from a THREAD_ORDER=2 box?)

I guess it must be out of 16k...

>        Depth    Size   Location    (47 entries)
>        -----    ----   --------
>  0)     7568      16   mempool_alloc_slab+0x16/0x20
>  1)     7552     144   mempool_alloc+0x65/0x140
>  2)     7408      96   get_request+0x124/0x370
>  3)     7312     144   get_request_wait+0x29/0x1b0
>  4)     7168      96   __make_request+0x9b/0x490
>  5)     7072     208   generic_make_request+0x3df/0x4d0
>  6)     6864      80   submit_bio+0x7c/0x100
>  7)     6784      96   _xfs_buf_ioapply+0x128/0x2c0 [xfs]
>  8)     6688      48   xfs_buf_iorequest+0x75/0xd0 [xfs]
>  9)     6640      32   _xfs_buf_read+0x36/0x70 [xfs]
> 10)     6608      48   xfs_buf_read+0xda/0x110 [xfs]
> 11)     6560      80   xfs_trans_read_buf+0x2a7/0x410 [xfs]
> 12)     6480      80   xfs_btree_read_buf_block+0x5d/0xb0 [xfs]
> 13)     6400      80   xfs_btree_lookup_get_block+0x84/0xf0 [xfs]
> 14)     6320     176   xfs_btree_lookup+0xd7/0x490 [xfs]
> 15)     6144      16   xfs_alloc_lookup_eq+0x19/0x20 [xfs]
> 16)     6128      96   xfs_alloc_fixup_trees+0xee/0x350 [xfs]
> 17)     6032     144   xfs_alloc_ag_vextent_near+0x916/0xb30 [xfs]
> 18)     5888      32   xfs_alloc_ag_vextent+0xe5/0x140 [xfs]
> 19)     5856      96   xfs_alloc_vextent+0x49f/0x630 [xfs]
> 20)     5760     160   xfs_bmbt_alloc_block+0xbe/0x1d0 [xfs]
> 21)     5600     208   xfs_btree_split+0xb3/0x6a0 [xfs]
> 22)     5392      96   xfs_btree_make_block_unfull+0x151/0x190 [xfs]
> 23)     5296     224   xfs_btree_insrec+0x39c/0x5b0 [xfs]
> 24)     5072     128   xfs_btree_insert+0x86/0x180 [xfs]
> 25)     4944     352   xfs_bmap_add_extent_delay_real+0x41e/0x1660 [xfs]
> 26)     4592     208   xfs_bmap_add_extent+0x41c/0x450 [xfs]
> 27)     4384     448   xfs_bmapi+0x982/0x1200 [xfs]

This one, I'm afraid, has always been big.

> 28)     3936     256   xfs_iomap_write_allocate+0x248/0x3c0 [xfs]
> 29)     3680     208   xfs_iomap+0x3d8/0x410 [xfs]
> 30)     3472      32   xfs_map_blocks+0x2c/0x30 [xfs]
> 31)     3440     256   xfs_page_state_convert+0x443/0x730 [xfs]
> 32)     3184      64   xfs_vm_writepage+0xab/0x160 [xfs]
> 33)     3120     384   shrink_page_list+0x65e/0x840
> 34)     2736     528   shrink_zone+0x63f/0xe10

that's a nice one  (actually the two together at > 900 bytes, ouch)

> 35)     2208     112   do_try_to_free_pages+0xc2/0x3c0
> 36)     2096     128   try_to_free_pages+0x77/0x80
> 37)     1968     240   __alloc_pages_nodemask+0x3e4/0x710
> 38)     1728      48   alloc_pages_current+0x8c/0xe0
> 39)     1680      16   __get_free_pages+0xe/0x50
> 40)     1664      48   __pollwait+0xca/0x110
> 41)     1616      32   unix_poll+0x28/0xc0
> 42)     1584      16   sock_poll+0x1d/0x20
> 43)     1568     912   do_select+0x3d6/0x700

912, ouch!

int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
{
        ktime_t expire, *to = NULL;
        struct poll_wqueues table;

(gdb) p sizeof(struct poll_wqueues)
$1 = 624

I guess that's been there forever, though.

> 44)      656     416   core_sys_select+0x18c/0x2c0

416 hurts too.

The xfs callchain is deep, no doubt, but the combination of the select path
and the shrink calls is almost 2k in just a few calls, and that doesn't
help much.

-Eric

> 45)      240     112   sys_select+0x4f/0x110
> 46)      128     128   system_call_fastpath+0x16/0x1b
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64
  2010-04-07 15:57     ` John Berthels
@ 2010-04-07 23:43       ` Dave Chinner
  -1 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2010-04-07 23:43 UTC (permalink / raw)
  To: John Berthels; +Cc: linux-kernel, Nick Gregory, Rob Sanderson, xfs

[added linux-mm]

On Wed, Apr 07, 2010 at 04:57:11PM +0100, John Berthels wrote:
> Dave Chinner wrote:
> >I'm not seeing stacks deeper than about 5.6k on XFS under heavy write
> >loads. That's nowhere near blowing an 8k stack, so there must be
> >something special about what you are doing. Can you post the stack
> >traces that are being generated for the deepest stack generated -
> >/sys/kernel/debug/tracing/stack_trace should contain it.
> Appended below. That doesn't seem to reach 8192 but the box it's
> from has logged:
> 
> [74649.579386] apache2 used greatest stack depth: 7024 bytes left
> 
> full dmesg (gzipped) attached.
> >What is generating the write load?
> 
> WebDAV PUTs in a modified mogilefs cluster, running
> apache-mpm-worker (threaded) as the DAV server. The write load is a
> mix of internet-upload speed writers trickling files up and some
> local fast replicators copying from elsewhere in the cluster. mpm
> worker cfg is:
> 
>        ServerLimit 20
>        StartServers 5
>        MaxClients 300
>        MinSpareThreads 25
>        MaxSpareThreads 75
>        ThreadsPerChild 30
>        MaxRequestsPerChild 0
> 
> File sizes are a mix of small to large (4GB+). Each disk is exported
> as a mogile device, so it's possible for mogile to pound a single
> disk with lots of write activity (if the random number generator
> decides to put lots of files on that device at the same time).
> 
> We're also seeing occasional slowdowns + high load avg (up to ~300,
> i.e. MaxClients) with a corresponding number of threads in D state.
> (This slowdown + high load avg seems to correlate with what would
> have previously caused a panic on the THREAD_ORDER 1, but not 100%
> sure).
> 
> As you can see from the dmesg, this trips the "task xxx blocked for
> more than 120 seconds." on some of the threads.
> 
> Don't know if that's related to the stack issue or to be expected
> under the load.

It looks to be caused by direct memory reclaim trying to clean pages
with a significant amount of stack already in use. basically there
is not enough stack space left for the XFS ->writepage path to
execute in. I can't see any fast fix for this occurring, so you are
probably best to run with a larger stack for the moment.

As it is, I don't think direct memory reclim should be cleaning
dirty file pages - it should be leaving that to the writeback
threads (which are far more efficient at it) or, as a
last resort, kswapd. Direct memory reclaim is invoked with an
unknown amount of stack already in use, so there is never any
guarantee that there is enough stack space left to enter the
->writepage path of any filesystem.

MM-folk - have there been any changes recently to writeback of
pages from direct reclaim that may have caused this,
or have we just been lucky for a really long time?

Cheers,

Dave.

>        Depth    Size   Location    (47 entries)
>        -----    ----   --------
>  0)     7568      16   mempool_alloc_slab+0x16/0x20
>  1)     7552     144   mempool_alloc+0x65/0x140
>  2)     7408      96   get_request+0x124/0x370
>  3)     7312     144   get_request_wait+0x29/0x1b0
>  4)     7168      96   __make_request+0x9b/0x490
>  5)     7072     208   generic_make_request+0x3df/0x4d0
>  6)     6864      80   submit_bio+0x7c/0x100
>  7)     6784      96   _xfs_buf_ioapply+0x128/0x2c0 [xfs]
>  8)     6688      48   xfs_buf_iorequest+0x75/0xd0 [xfs]
>  9)     6640      32   _xfs_buf_read+0x36/0x70 [xfs]
> 10)     6608      48   xfs_buf_read+0xda/0x110 [xfs]
> 11)     6560      80   xfs_trans_read_buf+0x2a7/0x410 [xfs]
> 12)     6480      80   xfs_btree_read_buf_block+0x5d/0xb0 [xfs]
> 13)     6400      80   xfs_btree_lookup_get_block+0x84/0xf0 [xfs]
> 14)     6320     176   xfs_btree_lookup+0xd7/0x490 [xfs]
> 15)     6144      16   xfs_alloc_lookup_eq+0x19/0x20 [xfs]
> 16)     6128      96   xfs_alloc_fixup_trees+0xee/0x350 [xfs]
> 17)     6032     144   xfs_alloc_ag_vextent_near+0x916/0xb30 [xfs]
> 18)     5888      32   xfs_alloc_ag_vextent+0xe5/0x140 [xfs]
> 19)     5856      96   xfs_alloc_vextent+0x49f/0x630 [xfs]
> 20)     5760     160   xfs_bmbt_alloc_block+0xbe/0x1d0 [xfs]
> 21)     5600     208   xfs_btree_split+0xb3/0x6a0 [xfs]
> 22)     5392      96   xfs_btree_make_block_unfull+0x151/0x190 [xfs]
> 23)     5296     224   xfs_btree_insrec+0x39c/0x5b0 [xfs]
> 24)     5072     128   xfs_btree_insert+0x86/0x180 [xfs]
> 25)     4944     352   xfs_bmap_add_extent_delay_real+0x41e/0x1660 [xfs]
> 26)     4592     208   xfs_bmap_add_extent+0x41c/0x450 [xfs]
> 27)     4384     448   xfs_bmapi+0x982/0x1200 [xfs]
> 28)     3936     256   xfs_iomap_write_allocate+0x248/0x3c0 [xfs]
> 29)     3680     208   xfs_iomap+0x3d8/0x410 [xfs]
> 30)     3472      32   xfs_map_blocks+0x2c/0x30 [xfs]
> 31)     3440     256   xfs_page_state_convert+0x443/0x730 [xfs]
> 32)     3184      64   xfs_vm_writepage+0xab/0x160 [xfs]
> 33)     3120     384   shrink_page_list+0x65e/0x840
> 34)     2736     528   shrink_zone+0x63f/0xe10
> 35)     2208     112   do_try_to_free_pages+0xc2/0x3c0
> 36)     2096     128   try_to_free_pages+0x77/0x80
> 37)     1968     240   __alloc_pages_nodemask+0x3e4/0x710
> 38)     1728      48   alloc_pages_current+0x8c/0xe0
> 39)     1680      16   __get_free_pages+0xe/0x50
> 40)     1664      48   __pollwait+0xca/0x110
> 41)     1616      32   unix_poll+0x28/0xc0
> 42)     1584      16   sock_poll+0x1d/0x20
> 43)     1568     912   do_select+0x3d6/0x700
> 44)      656     416   core_sys_select+0x18c/0x2c0
> 45)      240     112   sys_select+0x4f/0x110
> 46)      128     128   system_call_fastpath+0x16/0x1b
> 



-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64
@ 2010-04-07 23:43       ` Dave Chinner
  0 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2010-04-07 23:43 UTC (permalink / raw)
  To: John Berthels; +Cc: Nick Gregory, xfs, linux-kernel, Rob Sanderson

[added linux-mm]

On Wed, Apr 07, 2010 at 04:57:11PM +0100, John Berthels wrote:
> Dave Chinner wrote:
> >I'm not seeing stacks deeper than about 5.6k on XFS under heavy write
> >loads. That's nowhere near blowing an 8k stack, so there must be
> >something special about what you are doing. Can you post the stack
> >traces that are being generated for the deepest stack generated -
> >/sys/kernel/debug/tracing/stack_trace should contain it.
> Appended below. That doesn't seem to reach 8192 but the box it's
> from has logged:
> 
> [74649.579386] apache2 used greatest stack depth: 7024 bytes left
> 
> full dmesg (gzipped) attached.
> >What is generating the write load?
> 
> WebDAV PUTs in a modified mogilefs cluster, running
> apache-mpm-worker (threaded) as the DAV server. The write load is a
> mix of internet-upload speed writers trickling files up and some
> local fast replicators copying from elsewhere in the cluster. mpm
> worker cfg is:
> 
>        ServerLimit 20
>        StartServers 5
>        MaxClients 300
>        MinSpareThreads 25
>        MaxSpareThreads 75
>        ThreadsPerChild 30
>        MaxRequestsPerChild 0
> 
> File sizes are a mix of small to large (4GB+). Each disk is exported
> as a mogile device, so it's possible for mogile to pound a single
> disk with lots of write activity (if the random number generator
> decides to put lots of files on that device at the same time).
> 
> We're also seeing occasional slowdowns + high load avg (up to ~300,
> i.e. MaxClients) with a corresponding number of threads in D state.
> (This slowdown + high load avg seems to correlate with what would
> have previously caused a panic on the THREAD_ORDER 1, but not 100%
> sure).
> 
> As you can see from the dmesg, this trips the "task xxx blocked for
> more than 120 seconds." on some of the threads.
> 
> Don't know if that's related to the stack issue or to be expected
> under the load.

It looks to be caused by direct memory reclaim trying to clean pages
with a significant amount of stack already in use. basically there
is not enough stack space left for the XFS ->writepage path to
execute in. I can't see any fast fix for this occurring, so you are
probably best to run with a larger stack for the moment.

As it is, I don't think direct memory reclim should be cleaning
dirty file pages - it should be leaving that to the writeback
threads (which are far more efficient at it) or, as a
last resort, kswapd. Direct memory reclaim is invoked with an
unknown amount of stack already in use, so there is never any
guarantee that there is enough stack space left to enter the
->writepage path of any filesystem.

MM-folk - have there been any changes recently to writeback of
pages from direct reclaim that may have caused this,
or have we just been lucky for a really long time?

Cheers,

Dave.

>        Depth    Size   Location    (47 entries)
>        -----    ----   --------
>  0)     7568      16   mempool_alloc_slab+0x16/0x20
>  1)     7552     144   mempool_alloc+0x65/0x140
>  2)     7408      96   get_request+0x124/0x370
>  3)     7312     144   get_request_wait+0x29/0x1b0
>  4)     7168      96   __make_request+0x9b/0x490
>  5)     7072     208   generic_make_request+0x3df/0x4d0
>  6)     6864      80   submit_bio+0x7c/0x100
>  7)     6784      96   _xfs_buf_ioapply+0x128/0x2c0 [xfs]
>  8)     6688      48   xfs_buf_iorequest+0x75/0xd0 [xfs]
>  9)     6640      32   _xfs_buf_read+0x36/0x70 [xfs]
> 10)     6608      48   xfs_buf_read+0xda/0x110 [xfs]
> 11)     6560      80   xfs_trans_read_buf+0x2a7/0x410 [xfs]
> 12)     6480      80   xfs_btree_read_buf_block+0x5d/0xb0 [xfs]
> 13)     6400      80   xfs_btree_lookup_get_block+0x84/0xf0 [xfs]
> 14)     6320     176   xfs_btree_lookup+0xd7/0x490 [xfs]
> 15)     6144      16   xfs_alloc_lookup_eq+0x19/0x20 [xfs]
> 16)     6128      96   xfs_alloc_fixup_trees+0xee/0x350 [xfs]
> 17)     6032     144   xfs_alloc_ag_vextent_near+0x916/0xb30 [xfs]
> 18)     5888      32   xfs_alloc_ag_vextent+0xe5/0x140 [xfs]
> 19)     5856      96   xfs_alloc_vextent+0x49f/0x630 [xfs]
> 20)     5760     160   xfs_bmbt_alloc_block+0xbe/0x1d0 [xfs]
> 21)     5600     208   xfs_btree_split+0xb3/0x6a0 [xfs]
> 22)     5392      96   xfs_btree_make_block_unfull+0x151/0x190 [xfs]
> 23)     5296     224   xfs_btree_insrec+0x39c/0x5b0 [xfs]
> 24)     5072     128   xfs_btree_insert+0x86/0x180 [xfs]
> 25)     4944     352   xfs_bmap_add_extent_delay_real+0x41e/0x1660 [xfs]
> 26)     4592     208   xfs_bmap_add_extent+0x41c/0x450 [xfs]
> 27)     4384     448   xfs_bmapi+0x982/0x1200 [xfs]
> 28)     3936     256   xfs_iomap_write_allocate+0x248/0x3c0 [xfs]
> 29)     3680     208   xfs_iomap+0x3d8/0x410 [xfs]
> 30)     3472      32   xfs_map_blocks+0x2c/0x30 [xfs]
> 31)     3440     256   xfs_page_state_convert+0x443/0x730 [xfs]
> 32)     3184      64   xfs_vm_writepage+0xab/0x160 [xfs]
> 33)     3120     384   shrink_page_list+0x65e/0x840
> 34)     2736     528   shrink_zone+0x63f/0xe10
> 35)     2208     112   do_try_to_free_pages+0xc2/0x3c0
> 36)     2096     128   try_to_free_pages+0x77/0x80
> 37)     1968     240   __alloc_pages_nodemask+0x3e4/0x710
> 38)     1728      48   alloc_pages_current+0x8c/0xe0
> 39)     1680      16   __get_free_pages+0xe/0x50
> 40)     1664      48   __pollwait+0xca/0x110
> 41)     1616      32   unix_poll+0x28/0xc0
> 42)     1584      16   sock_poll+0x1d/0x20
> 43)     1568     912   do_select+0x3d6/0x700
> 44)      656     416   core_sys_select+0x18c/0x2c0
> 45)      240     112   sys_select+0x4f/0x110
> 46)      128     128   system_call_fastpath+0x16/0x1b
> 



-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64
  2010-04-07 23:43       ` Dave Chinner
  (?)
@ 2010-04-08  3:03         ` Dave Chinner
  -1 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2010-04-08  3:03 UTC (permalink / raw)
  To: John Berthels; +Cc: linux-kernel, Nick Gregory, Rob Sanderson, xfs, linux-mm

On Thu, Apr 08, 2010 at 09:43:41AM +1000, Dave Chinner wrote:
> [added linux-mm]

Now really added linux-mm.

And there's a patch attached that stops direct reclaim from writing
back dirty pages - it seems to work fine from some rough testing
I've done. Perhaps you might want to give it a spin on a
test box, John?

> On Wed, Apr 07, 2010 at 04:57:11PM +0100, John Berthels wrote:
> > Dave Chinner wrote:
> > >I'm not seeing stacks deeper than about 5.6k on XFS under heavy write
> > >loads. That's nowhere near blowing an 8k stack, so there must be
> > >something special about what you are doing. Can you post the stack
> > >traces that are being generated for the deepest stack generated -
> > >/sys/kernel/debug/tracing/stack_trace should contain it.
> > Appended below. That doesn't seem to reach 8192 but the box it's
> > from has logged:
> > 
> > [74649.579386] apache2 used greatest stack depth: 7024 bytes left
> > 
> > full dmesg (gzipped) attached.
> > >What is generating the write load?
> > 
> > WebDAV PUTs in a modified mogilefs cluster, running
> > apache-mpm-worker (threaded) as the DAV server. The write load is a
> > mix of internet-upload speed writers trickling files up and some
> > local fast replicators copying from elsewhere in the cluster. mpm
> > worker cfg is:
> > 
> >        ServerLimit 20
> >        StartServers 5
> >        MaxClients 300
> >        MinSpareThreads 25
> >        MaxSpareThreads 75
> >        ThreadsPerChild 30
> >        MaxRequestsPerChild 0
> > 
> > File sizes are a mix of small to large (4GB+). Each disk is exported
> > as a mogile device, so it's possible for mogile to pound a single
> > disk with lots of write activity (if the random number generator
> > decides to put lots of files on that device at the same time).
> > 
> > We're also seeing occasional slowdowns + high load avg (up to ~300,
> > i.e. MaxClients) with a corresponding number of threads in D state.
> > (This slowdown + high load avg seems to correlate with what would
> > have previously caused a panic on the THREAD_ORDER 1, but not 100%
> > sure).
> > 
> > As you can see from the dmesg, this trips the "task xxx blocked for
> > more than 120 seconds." on some of the threads.
> > 
> > Don't know if that's related to the stack issue or to be expected
> > under the load.
> 
> It looks to be caused by direct memory reclaim trying to clean pages
> with a significant amount of stack already in use. basically there
> is not enough stack space left for the XFS ->writepage path to
> execute in. I can't see any fast fix for this occurring, so you are
> probably best to run with a larger stack for the moment.
> 
> As it is, I don't think direct memory reclim should be cleaning
> dirty file pages - it should be leaving that to the writeback
> threads (which are far more efficient at it) or, as a
> last resort, kswapd. Direct memory reclaim is invoked with an
> unknown amount of stack already in use, so there is never any
> guarantee that there is enough stack space left to enter the
> ->writepage path of any filesystem.
> 
> MM-folk - have there been any changes recently to writeback of
> pages from direct reclaim that may have caused this,
> or have we just been lucky for a really long time?
> 
> Cheers,
> 
> Dave.
> 
> >        Depth    Size   Location    (47 entries)
> >        -----    ----   --------
> >  0)     7568      16   mempool_alloc_slab+0x16/0x20
> >  1)     7552     144   mempool_alloc+0x65/0x140
> >  2)     7408      96   get_request+0x124/0x370
> >  3)     7312     144   get_request_wait+0x29/0x1b0
> >  4)     7168      96   __make_request+0x9b/0x490
> >  5)     7072     208   generic_make_request+0x3df/0x4d0
> >  6)     6864      80   submit_bio+0x7c/0x100
> >  7)     6784      96   _xfs_buf_ioapply+0x128/0x2c0 [xfs]
> >  8)     6688      48   xfs_buf_iorequest+0x75/0xd0 [xfs]
> >  9)     6640      32   _xfs_buf_read+0x36/0x70 [xfs]
> > 10)     6608      48   xfs_buf_read+0xda/0x110 [xfs]
> > 11)     6560      80   xfs_trans_read_buf+0x2a7/0x410 [xfs]
> > 12)     6480      80   xfs_btree_read_buf_block+0x5d/0xb0 [xfs]
> > 13)     6400      80   xfs_btree_lookup_get_block+0x84/0xf0 [xfs]
> > 14)     6320     176   xfs_btree_lookup+0xd7/0x490 [xfs]
> > 15)     6144      16   xfs_alloc_lookup_eq+0x19/0x20 [xfs]
> > 16)     6128      96   xfs_alloc_fixup_trees+0xee/0x350 [xfs]
> > 17)     6032     144   xfs_alloc_ag_vextent_near+0x916/0xb30 [xfs]
> > 18)     5888      32   xfs_alloc_ag_vextent+0xe5/0x140 [xfs]
> > 19)     5856      96   xfs_alloc_vextent+0x49f/0x630 [xfs]
> > 20)     5760     160   xfs_bmbt_alloc_block+0xbe/0x1d0 [xfs]
> > 21)     5600     208   xfs_btree_split+0xb3/0x6a0 [xfs]
> > 22)     5392      96   xfs_btree_make_block_unfull+0x151/0x190 [xfs]
> > 23)     5296     224   xfs_btree_insrec+0x39c/0x5b0 [xfs]
> > 24)     5072     128   xfs_btree_insert+0x86/0x180 [xfs]
> > 25)     4944     352   xfs_bmap_add_extent_delay_real+0x41e/0x1660 [xfs]
> > 26)     4592     208   xfs_bmap_add_extent+0x41c/0x450 [xfs]
> > 27)     4384     448   xfs_bmapi+0x982/0x1200 [xfs]
> > 28)     3936     256   xfs_iomap_write_allocate+0x248/0x3c0 [xfs]
> > 29)     3680     208   xfs_iomap+0x3d8/0x410 [xfs]
> > 30)     3472      32   xfs_map_blocks+0x2c/0x30 [xfs]
> > 31)     3440     256   xfs_page_state_convert+0x443/0x730 [xfs]
> > 32)     3184      64   xfs_vm_writepage+0xab/0x160 [xfs]
> > 33)     3120     384   shrink_page_list+0x65e/0x840
> > 34)     2736     528   shrink_zone+0x63f/0xe10
> > 35)     2208     112   do_try_to_free_pages+0xc2/0x3c0
> > 36)     2096     128   try_to_free_pages+0x77/0x80
> > 37)     1968     240   __alloc_pages_nodemask+0x3e4/0x710
> > 38)     1728      48   alloc_pages_current+0x8c/0xe0
> > 39)     1680      16   __get_free_pages+0xe/0x50
> > 40)     1664      48   __pollwait+0xca/0x110
> > 41)     1616      32   unix_poll+0x28/0xc0
> > 42)     1584      16   sock_poll+0x1d/0x20
> > 43)     1568     912   do_select+0x3d6/0x700
> > 44)      656     416   core_sys_select+0x18c/0x2c0
> > 45)      240     112   sys_select+0x4f/0x110
> > 46)      128     128   system_call_fastpath+0x16/0x1b

-- 
Dave Chinner
david@fromorbit.com

mm: disallow direct reclaim page writeback

From: Dave Chinner <dchinner@redhat.com>

When we enter direct reclaim we may have used an arbitrary amount of stack
space, and hence entering the filesystem to do writeback can then lead to
stack overruns.

Writeback from direct reclaim is a bad idea, anyway. The background flusher
threads should be taking care of cleaning dirty pages, and direct reclaim will
kick them if they aren't already doing work. If direct reclaim is also calling
->writepage, it will cause the IO patterns from the background flusher threads
to be upset by LRU-order writeback from pageout(). Having competing sources of
IO trying to clean pages on the same backing device reduces throughput by
increasing the amount of seeks that the backing device has to do to write back
the pages.

Hence for direct reclaim we should not allow ->writepages to be entered at all.
Set up the relevant scan_control structures to enforce this, and prevent
sc->may_writepage from being set in other places in the direct reclaim path in
response to other events.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 mm/vmscan.c |   13 ++++++-------
 1 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index f293372..3c194f4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1829,10 +1829,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		 * writeout.  So in laptop mode, write out the whole world.
 		 */
 		writeback_threshold = sc->nr_to_reclaim + sc->nr_to_reclaim / 2;
-		if (total_scanned > writeback_threshold) {
+		if (total_scanned > writeback_threshold)
 			wakeup_flusher_threads(laptop_mode ? 0 : total_scanned);
-			sc->may_writepage = 1;
-		}
 
 		/* Take a nap, wait for some writeback to complete */
 		if (!sc->hibernation_mode && sc->nr_scanned &&
@@ -1874,7 +1872,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 {
 	struct scan_control sc = {
 		.gfp_mask = gfp_mask,
-		.may_writepage = !laptop_mode,
+		.may_writepage = 0,
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
 		.may_unmap = 1,
 		.may_swap = 1,
@@ -1896,7 +1894,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
 						struct zone *zone, int nid)
 {
 	struct scan_control sc = {
-		.may_writepage = !laptop_mode,
+		.may_writepage = 0,
 		.may_unmap = 1,
 		.may_swap = !noswap,
 		.swappiness = swappiness,
@@ -1929,7 +1927,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 {
 	struct zonelist *zonelist;
 	struct scan_control sc = {
-		.may_writepage = !laptop_mode,
+		.may_writepage = 0,
 		.may_unmap = 1,
 		.may_swap = !noswap,
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
@@ -2570,7 +2568,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	struct reclaim_state reclaim_state;
 	int priority;
 	struct scan_control sc = {
-		.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
+		.may_writepage = (current_is_kswapd() &&
+					(zone_reclaim_mode & RECLAIM_WRITE)),
 		.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
 		.may_swap = 1,
 		.nr_to_reclaim = max_t(unsigned long, nr_pages,

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64
@ 2010-04-08  3:03         ` Dave Chinner
  0 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2010-04-08  3:03 UTC (permalink / raw)
  To: John Berthels; +Cc: linux-kernel, Nick Gregory, Rob Sanderson, xfs, linux-mm

On Thu, Apr 08, 2010 at 09:43:41AM +1000, Dave Chinner wrote:
> [added linux-mm]

Now really added linux-mm.

And there's a patch attached that stops direct reclaim from writing
back dirty pages - it seems to work fine from some rough testing
I've done. Perhaps you might want to give it a spin on a
test box, John?

> On Wed, Apr 07, 2010 at 04:57:11PM +0100, John Berthels wrote:
> > Dave Chinner wrote:
> > >I'm not seeing stacks deeper than about 5.6k on XFS under heavy write
> > >loads. That's nowhere near blowing an 8k stack, so there must be
> > >something special about what you are doing. Can you post the stack
> > >traces that are being generated for the deepest stack generated -
> > >/sys/kernel/debug/tracing/stack_trace should contain it.
> > Appended below. That doesn't seem to reach 8192 but the box it's
> > from has logged:
> > 
> > [74649.579386] apache2 used greatest stack depth: 7024 bytes left
> > 
> > full dmesg (gzipped) attached.
> > >What is generating the write load?
> > 
> > WebDAV PUTs in a modified mogilefs cluster, running
> > apache-mpm-worker (threaded) as the DAV server. The write load is a
> > mix of internet-upload speed writers trickling files up and some
> > local fast replicators copying from elsewhere in the cluster. mpm
> > worker cfg is:
> > 
> >        ServerLimit 20
> >        StartServers 5
> >        MaxClients 300
> >        MinSpareThreads 25
> >        MaxSpareThreads 75
> >        ThreadsPerChild 30
> >        MaxRequestsPerChild 0
> > 
> > File sizes are a mix of small to large (4GB+). Each disk is exported
> > as a mogile device, so it's possible for mogile to pound a single
> > disk with lots of write activity (if the random number generator
> > decides to put lots of files on that device at the same time).
> > 
> > We're also seeing occasional slowdowns + high load avg (up to ~300,
> > i.e. MaxClients) with a corresponding number of threads in D state.
> > (This slowdown + high load avg seems to correlate with what would
> > have previously caused a panic on the THREAD_ORDER 1, but not 100%
> > sure).
> > 
> > As you can see from the dmesg, this trips the "task xxx blocked for
> > more than 120 seconds." on some of the threads.
> > 
> > Don't know if that's related to the stack issue or to be expected
> > under the load.
> 
> It looks to be caused by direct memory reclaim trying to clean pages
> with a significant amount of stack already in use. basically there
> is not enough stack space left for the XFS ->writepage path to
> execute in. I can't see any fast fix for this occurring, so you are
> probably best to run with a larger stack for the moment.
> 
> As it is, I don't think direct memory reclim should be cleaning
> dirty file pages - it should be leaving that to the writeback
> threads (which are far more efficient at it) or, as a
> last resort, kswapd. Direct memory reclaim is invoked with an
> unknown amount of stack already in use, so there is never any
> guarantee that there is enough stack space left to enter the
> ->writepage path of any filesystem.
> 
> MM-folk - have there been any changes recently to writeback of
> pages from direct reclaim that may have caused this,
> or have we just been lucky for a really long time?
> 
> Cheers,
> 
> Dave.
> 
> >        Depth    Size   Location    (47 entries)
> >        -----    ----   --------
> >  0)     7568      16   mempool_alloc_slab+0x16/0x20
> >  1)     7552     144   mempool_alloc+0x65/0x140
> >  2)     7408      96   get_request+0x124/0x370
> >  3)     7312     144   get_request_wait+0x29/0x1b0
> >  4)     7168      96   __make_request+0x9b/0x490
> >  5)     7072     208   generic_make_request+0x3df/0x4d0
> >  6)     6864      80   submit_bio+0x7c/0x100
> >  7)     6784      96   _xfs_buf_ioapply+0x128/0x2c0 [xfs]
> >  8)     6688      48   xfs_buf_iorequest+0x75/0xd0 [xfs]
> >  9)     6640      32   _xfs_buf_read+0x36/0x70 [xfs]
> > 10)     6608      48   xfs_buf_read+0xda/0x110 [xfs]
> > 11)     6560      80   xfs_trans_read_buf+0x2a7/0x410 [xfs]
> > 12)     6480      80   xfs_btree_read_buf_block+0x5d/0xb0 [xfs]
> > 13)     6400      80   xfs_btree_lookup_get_block+0x84/0xf0 [xfs]
> > 14)     6320     176   xfs_btree_lookup+0xd7/0x490 [xfs]
> > 15)     6144      16   xfs_alloc_lookup_eq+0x19/0x20 [xfs]
> > 16)     6128      96   xfs_alloc_fixup_trees+0xee/0x350 [xfs]
> > 17)     6032     144   xfs_alloc_ag_vextent_near+0x916/0xb30 [xfs]
> > 18)     5888      32   xfs_alloc_ag_vextent+0xe5/0x140 [xfs]
> > 19)     5856      96   xfs_alloc_vextent+0x49f/0x630 [xfs]
> > 20)     5760     160   xfs_bmbt_alloc_block+0xbe/0x1d0 [xfs]
> > 21)     5600     208   xfs_btree_split+0xb3/0x6a0 [xfs]
> > 22)     5392      96   xfs_btree_make_block_unfull+0x151/0x190 [xfs]
> > 23)     5296     224   xfs_btree_insrec+0x39c/0x5b0 [xfs]
> > 24)     5072     128   xfs_btree_insert+0x86/0x180 [xfs]
> > 25)     4944     352   xfs_bmap_add_extent_delay_real+0x41e/0x1660 [xfs]
> > 26)     4592     208   xfs_bmap_add_extent+0x41c/0x450 [xfs]
> > 27)     4384     448   xfs_bmapi+0x982/0x1200 [xfs]
> > 28)     3936     256   xfs_iomap_write_allocate+0x248/0x3c0 [xfs]
> > 29)     3680     208   xfs_iomap+0x3d8/0x410 [xfs]
> > 30)     3472      32   xfs_map_blocks+0x2c/0x30 [xfs]
> > 31)     3440     256   xfs_page_state_convert+0x443/0x730 [xfs]
> > 32)     3184      64   xfs_vm_writepage+0xab/0x160 [xfs]
> > 33)     3120     384   shrink_page_list+0x65e/0x840
> > 34)     2736     528   shrink_zone+0x63f/0xe10
> > 35)     2208     112   do_try_to_free_pages+0xc2/0x3c0
> > 36)     2096     128   try_to_free_pages+0x77/0x80
> > 37)     1968     240   __alloc_pages_nodemask+0x3e4/0x710
> > 38)     1728      48   alloc_pages_current+0x8c/0xe0
> > 39)     1680      16   __get_free_pages+0xe/0x50
> > 40)     1664      48   __pollwait+0xca/0x110
> > 41)     1616      32   unix_poll+0x28/0xc0
> > 42)     1584      16   sock_poll+0x1d/0x20
> > 43)     1568     912   do_select+0x3d6/0x700
> > 44)      656     416   core_sys_select+0x18c/0x2c0
> > 45)      240     112   sys_select+0x4f/0x110
> > 46)      128     128   system_call_fastpath+0x16/0x1b

-- 
Dave Chinner
david@fromorbit.com

mm: disallow direct reclaim page writeback

From: Dave Chinner <dchinner@redhat.com>

When we enter direct reclaim we may have used an arbitrary amount of stack
space, and hence entering the filesystem to do writeback can then lead to
stack overruns.

Writeback from direct reclaim is a bad idea, anyway. The background flusher
threads should be taking care of cleaning dirty pages, and direct reclaim will
kick them if they aren't already doing work. If direct reclaim is also calling
->writepage, it will cause the IO patterns from the background flusher threads
to be upset by LRU-order writeback from pageout(). Having competing sources of
IO trying to clean pages on the same backing device reduces throughput by
increasing the amount of seeks that the backing device has to do to write back
the pages.

Hence for direct reclaim we should not allow ->writepages to be entered at all.
Set up the relevant scan_control structures to enforce this, and prevent
sc->may_writepage from being set in other places in the direct reclaim path in
response to other events.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 mm/vmscan.c |   13 ++++++-------
 1 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index f293372..3c194f4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1829,10 +1829,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		 * writeout.  So in laptop mode, write out the whole world.
 		 */
 		writeback_threshold = sc->nr_to_reclaim + sc->nr_to_reclaim / 2;
-		if (total_scanned > writeback_threshold) {
+		if (total_scanned > writeback_threshold)
 			wakeup_flusher_threads(laptop_mode ? 0 : total_scanned);
-			sc->may_writepage = 1;
-		}
 
 		/* Take a nap, wait for some writeback to complete */
 		if (!sc->hibernation_mode && sc->nr_scanned &&
@@ -1874,7 +1872,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 {
 	struct scan_control sc = {
 		.gfp_mask = gfp_mask,
-		.may_writepage = !laptop_mode,
+		.may_writepage = 0,
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
 		.may_unmap = 1,
 		.may_swap = 1,
@@ -1896,7 +1894,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
 						struct zone *zone, int nid)
 {
 	struct scan_control sc = {
-		.may_writepage = !laptop_mode,
+		.may_writepage = 0,
 		.may_unmap = 1,
 		.may_swap = !noswap,
 		.swappiness = swappiness,
@@ -1929,7 +1927,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 {
 	struct zonelist *zonelist;
 	struct scan_control sc = {
-		.may_writepage = !laptop_mode,
+		.may_writepage = 0,
 		.may_unmap = 1,
 		.may_swap = !noswap,
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
@@ -2570,7 +2568,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	struct reclaim_state reclaim_state;
 	int priority;
 	struct scan_control sc = {
-		.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
+		.may_writepage = (current_is_kswapd() &&
+					(zone_reclaim_mode & RECLAIM_WRITE)),
 		.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
 		.may_swap = 1,
 		.nr_to_reclaim = max_t(unsigned long, nr_pages,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64
@ 2010-04-08  3:03         ` Dave Chinner
  0 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2010-04-08  3:03 UTC (permalink / raw)
  To: John Berthels; +Cc: Nick Gregory, linux-mm, xfs, linux-kernel, Rob Sanderson

On Thu, Apr 08, 2010 at 09:43:41AM +1000, Dave Chinner wrote:
> [added linux-mm]

Now really added linux-mm.

And there's a patch attached that stops direct reclaim from writing
back dirty pages - it seems to work fine from some rough testing
I've done. Perhaps you might want to give it a spin on a
test box, John?

> On Wed, Apr 07, 2010 at 04:57:11PM +0100, John Berthels wrote:
> > Dave Chinner wrote:
> > >I'm not seeing stacks deeper than about 5.6k on XFS under heavy write
> > >loads. That's nowhere near blowing an 8k stack, so there must be
> > >something special about what you are doing. Can you post the stack
> > >traces that are being generated for the deepest stack generated -
> > >/sys/kernel/debug/tracing/stack_trace should contain it.
> > Appended below. That doesn't seem to reach 8192 but the box it's
> > from has logged:
> > 
> > [74649.579386] apache2 used greatest stack depth: 7024 bytes left
> > 
> > full dmesg (gzipped) attached.
> > >What is generating the write load?
> > 
> > WebDAV PUTs in a modified mogilefs cluster, running
> > apache-mpm-worker (threaded) as the DAV server. The write load is a
> > mix of internet-upload speed writers trickling files up and some
> > local fast replicators copying from elsewhere in the cluster. mpm
> > worker cfg is:
> > 
> >        ServerLimit 20
> >        StartServers 5
> >        MaxClients 300
> >        MinSpareThreads 25
> >        MaxSpareThreads 75
> >        ThreadsPerChild 30
> >        MaxRequestsPerChild 0
> > 
> > File sizes are a mix of small to large (4GB+). Each disk is exported
> > as a mogile device, so it's possible for mogile to pound a single
> > disk with lots of write activity (if the random number generator
> > decides to put lots of files on that device at the same time).
> > 
> > We're also seeing occasional slowdowns + high load avg (up to ~300,
> > i.e. MaxClients) with a corresponding number of threads in D state.
> > (This slowdown + high load avg seems to correlate with what would
> > have previously caused a panic on the THREAD_ORDER 1, but not 100%
> > sure).
> > 
> > As you can see from the dmesg, this trips the "task xxx blocked for
> > more than 120 seconds." on some of the threads.
> > 
> > Don't know if that's related to the stack issue or to be expected
> > under the load.
> 
> It looks to be caused by direct memory reclaim trying to clean pages
> with a significant amount of stack already in use. basically there
> is not enough stack space left for the XFS ->writepage path to
> execute in. I can't see any fast fix for this occurring, so you are
> probably best to run with a larger stack for the moment.
> 
> As it is, I don't think direct memory reclim should be cleaning
> dirty file pages - it should be leaving that to the writeback
> threads (which are far more efficient at it) or, as a
> last resort, kswapd. Direct memory reclaim is invoked with an
> unknown amount of stack already in use, so there is never any
> guarantee that there is enough stack space left to enter the
> ->writepage path of any filesystem.
> 
> MM-folk - have there been any changes recently to writeback of
> pages from direct reclaim that may have caused this,
> or have we just been lucky for a really long time?
> 
> Cheers,
> 
> Dave.
> 
> >        Depth    Size   Location    (47 entries)
> >        -----    ----   --------
> >  0)     7568      16   mempool_alloc_slab+0x16/0x20
> >  1)     7552     144   mempool_alloc+0x65/0x140
> >  2)     7408      96   get_request+0x124/0x370
> >  3)     7312     144   get_request_wait+0x29/0x1b0
> >  4)     7168      96   __make_request+0x9b/0x490
> >  5)     7072     208   generic_make_request+0x3df/0x4d0
> >  6)     6864      80   submit_bio+0x7c/0x100
> >  7)     6784      96   _xfs_buf_ioapply+0x128/0x2c0 [xfs]
> >  8)     6688      48   xfs_buf_iorequest+0x75/0xd0 [xfs]
> >  9)     6640      32   _xfs_buf_read+0x36/0x70 [xfs]
> > 10)     6608      48   xfs_buf_read+0xda/0x110 [xfs]
> > 11)     6560      80   xfs_trans_read_buf+0x2a7/0x410 [xfs]
> > 12)     6480      80   xfs_btree_read_buf_block+0x5d/0xb0 [xfs]
> > 13)     6400      80   xfs_btree_lookup_get_block+0x84/0xf0 [xfs]
> > 14)     6320     176   xfs_btree_lookup+0xd7/0x490 [xfs]
> > 15)     6144      16   xfs_alloc_lookup_eq+0x19/0x20 [xfs]
> > 16)     6128      96   xfs_alloc_fixup_trees+0xee/0x350 [xfs]
> > 17)     6032     144   xfs_alloc_ag_vextent_near+0x916/0xb30 [xfs]
> > 18)     5888      32   xfs_alloc_ag_vextent+0xe5/0x140 [xfs]
> > 19)     5856      96   xfs_alloc_vextent+0x49f/0x630 [xfs]
> > 20)     5760     160   xfs_bmbt_alloc_block+0xbe/0x1d0 [xfs]
> > 21)     5600     208   xfs_btree_split+0xb3/0x6a0 [xfs]
> > 22)     5392      96   xfs_btree_make_block_unfull+0x151/0x190 [xfs]
> > 23)     5296     224   xfs_btree_insrec+0x39c/0x5b0 [xfs]
> > 24)     5072     128   xfs_btree_insert+0x86/0x180 [xfs]
> > 25)     4944     352   xfs_bmap_add_extent_delay_real+0x41e/0x1660 [xfs]
> > 26)     4592     208   xfs_bmap_add_extent+0x41c/0x450 [xfs]
> > 27)     4384     448   xfs_bmapi+0x982/0x1200 [xfs]
> > 28)     3936     256   xfs_iomap_write_allocate+0x248/0x3c0 [xfs]
> > 29)     3680     208   xfs_iomap+0x3d8/0x410 [xfs]
> > 30)     3472      32   xfs_map_blocks+0x2c/0x30 [xfs]
> > 31)     3440     256   xfs_page_state_convert+0x443/0x730 [xfs]
> > 32)     3184      64   xfs_vm_writepage+0xab/0x160 [xfs]
> > 33)     3120     384   shrink_page_list+0x65e/0x840
> > 34)     2736     528   shrink_zone+0x63f/0xe10
> > 35)     2208     112   do_try_to_free_pages+0xc2/0x3c0
> > 36)     2096     128   try_to_free_pages+0x77/0x80
> > 37)     1968     240   __alloc_pages_nodemask+0x3e4/0x710
> > 38)     1728      48   alloc_pages_current+0x8c/0xe0
> > 39)     1680      16   __get_free_pages+0xe/0x50
> > 40)     1664      48   __pollwait+0xca/0x110
> > 41)     1616      32   unix_poll+0x28/0xc0
> > 42)     1584      16   sock_poll+0x1d/0x20
> > 43)     1568     912   do_select+0x3d6/0x700
> > 44)      656     416   core_sys_select+0x18c/0x2c0
> > 45)      240     112   sys_select+0x4f/0x110
> > 46)      128     128   system_call_fastpath+0x16/0x1b

-- 
Dave Chinner
david@fromorbit.com

mm: disallow direct reclaim page writeback

From: Dave Chinner <dchinner@redhat.com>

When we enter direct reclaim we may have used an arbitrary amount of stack
space, and hence entering the filesystem to do writeback can then lead to
stack overruns.

Writeback from direct reclaim is a bad idea, anyway. The background flusher
threads should be taking care of cleaning dirty pages, and direct reclaim will
kick them if they aren't already doing work. If direct reclaim is also calling
->writepage, it will cause the IO patterns from the background flusher threads
to be upset by LRU-order writeback from pageout(). Having competing sources of
IO trying to clean pages on the same backing device reduces throughput by
increasing the amount of seeks that the backing device has to do to write back
the pages.

Hence for direct reclaim we should not allow ->writepages to be entered at all.
Set up the relevant scan_control structures to enforce this, and prevent
sc->may_writepage from being set in other places in the direct reclaim path in
response to other events.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 mm/vmscan.c |   13 ++++++-------
 1 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index f293372..3c194f4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1829,10 +1829,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		 * writeout.  So in laptop mode, write out the whole world.
 		 */
 		writeback_threshold = sc->nr_to_reclaim + sc->nr_to_reclaim / 2;
-		if (total_scanned > writeback_threshold) {
+		if (total_scanned > writeback_threshold)
 			wakeup_flusher_threads(laptop_mode ? 0 : total_scanned);
-			sc->may_writepage = 1;
-		}
 
 		/* Take a nap, wait for some writeback to complete */
 		if (!sc->hibernation_mode && sc->nr_scanned &&
@@ -1874,7 +1872,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 {
 	struct scan_control sc = {
 		.gfp_mask = gfp_mask,
-		.may_writepage = !laptop_mode,
+		.may_writepage = 0,
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
 		.may_unmap = 1,
 		.may_swap = 1,
@@ -1896,7 +1894,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
 						struct zone *zone, int nid)
 {
 	struct scan_control sc = {
-		.may_writepage = !laptop_mode,
+		.may_writepage = 0,
 		.may_unmap = 1,
 		.may_swap = !noswap,
 		.swappiness = swappiness,
@@ -1929,7 +1927,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 {
 	struct zonelist *zonelist;
 	struct scan_control sc = {
-		.may_writepage = !laptop_mode,
+		.may_writepage = 0,
 		.may_unmap = 1,
 		.may_swap = !noswap,
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
@@ -2570,7 +2568,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	struct reclaim_state reclaim_state;
 	int priority;
 	struct scan_control sc = {
-		.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
+		.may_writepage = (current_is_kswapd() &&
+					(zone_reclaim_mode & RECLAIM_WRITE)),
 		.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
 		.may_swap = 1,
 		.nr_to_reclaim = max_t(unsigned long, nr_pages,

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64
  2010-04-08  3:03         ` Dave Chinner
  (?)
@ 2010-04-08 12:16           ` John Berthels
  -1 siblings, 0 replies; 43+ messages in thread
From: John Berthels @ 2010-04-08 12:16 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, Nick Gregory, Rob Sanderson, xfs, linux-mm

Dave Chinner wrote:
> On Thu, Apr 08, 2010 at 09:43:41AM +1000, Dave Chinner wrote:
>   
> And there's a patch attached that stops direct reclaim from writing
> back dirty pages - it seems to work fine from some rough testing
> I've done. Perhaps you might want to give it a spin on a
> test box, John?
>   
Thanks very much for this. The patch is in and soaking on a THREAD_ORDER 
1 kernel (2.6.33.2 + patch + stack instrumentation), so far so good, but 
it's early days. After about 2hrs of uptime:

$ dmesg | grep stack | tail -1
[   60.350766] apache2 used greatest stack depth: 2544 bytes left

(which tallies well with your 5 1/2Kbytes usage figure).

I'll reply again after it's been running long enough to draw conclusions.

jb

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64
@ 2010-04-08 12:16           ` John Berthels
  0 siblings, 0 replies; 43+ messages in thread
From: John Berthels @ 2010-04-08 12:16 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, Nick Gregory, Rob Sanderson, xfs, linux-mm

Dave Chinner wrote:
> On Thu, Apr 08, 2010 at 09:43:41AM +1000, Dave Chinner wrote:
>   
> And there's a patch attached that stops direct reclaim from writing
> back dirty pages - it seems to work fine from some rough testing
> I've done. Perhaps you might want to give it a spin on a
> test box, John?
>   
Thanks very much for this. The patch is in and soaking on a THREAD_ORDER 
1 kernel (2.6.33.2 + patch + stack instrumentation), so far so good, but 
it's early days. After about 2hrs of uptime:

$ dmesg | grep stack | tail -1
[   60.350766] apache2 used greatest stack depth: 2544 bytes left

(which tallies well with your 5 1/2Kbytes usage figure).

I'll reply again after it's been running long enough to draw conclusions.

jb

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64
@ 2010-04-08 12:16           ` John Berthels
  0 siblings, 0 replies; 43+ messages in thread
From: John Berthels @ 2010-04-08 12:16 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nick Gregory, linux-mm, xfs, linux-kernel, Rob Sanderson

Dave Chinner wrote:
> On Thu, Apr 08, 2010 at 09:43:41AM +1000, Dave Chinner wrote:
>   
> And there's a patch attached that stops direct reclaim from writing
> back dirty pages - it seems to work fine from some rough testing
> I've done. Perhaps you might want to give it a spin on a
> test box, John?
>   
Thanks very much for this. The patch is in and soaking on a THREAD_ORDER 
1 kernel (2.6.33.2 + patch + stack instrumentation), so far so good, but 
it's early days. After about 2hrs of uptime:

$ dmesg | grep stack | tail -1
[   60.350766] apache2 used greatest stack depth: 2544 bytes left

(which tallies well with your 5 1/2Kbytes usage figure).

I'll reply again after it's been running long enough to draw conclusions.

jb

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64
  2010-04-08 12:16           ` John Berthels
  (?)
@ 2010-04-08 14:47             ` John Berthels
  -1 siblings, 0 replies; 43+ messages in thread
From: John Berthels @ 2010-04-08 14:47 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, Nick Gregory, Rob Sanderson, xfs, linux-mm

John Berthels wrote:
> I'll reply again after it's been running long enough to draw conclusions.
We're getting pretty close on the 8k stack on this box now. It's running 
2.6.33.2 + your patch, with THREAD_ORDER 1, stack tracing and 
CONFIG_LOCKDEP=y. (Sorry that LOCKDEP is on, please advise if that's 
going to throw the figures and we'll restart the test systems with new 
kernels).

This is significantly more than 5.6K, so it shows a potential problem? 
Or is 720 bytes enough headroom?

jb

[ 4005.541869] apache2 used greatest stack depth: 2480 bytes left
[ 4005.541973] apache2 used greatest stack depth: 2240 bytes left
[ 4005.542070] apache2 used greatest stack depth: 1936 bytes left
[ 4005.542614] apache2 used greatest stack depth: 1616 bytes left
[ 5531.406529] apache2 used greatest stack depth: 720 bytes left

$ cat /sys/kernel/debug/tracing/stack_trace
        Depth    Size   Location    (55 entries)
        -----    ----   --------
  0)     7440      48   add_partial+0x26/0x90
  1)     7392      64   __slab_free+0x1a9/0x380
  2)     7328      64   kmem_cache_free+0xb9/0x160
  3)     7264      16   free_buffer_head+0x25/0x50
  4)     7248      64   try_to_free_buffers+0x79/0xc0
  5)     7184     160   xfs_vm_releasepage+0xda/0x130 [xfs]
  6)     7024      16   try_to_release_page+0x33/0x60
  7)     7008     384   shrink_page_list+0x585/0x860
  8)     6624     528   shrink_zone+0x636/0xdc0
  9)     6096     112   do_try_to_free_pages+0xc2/0x3c0
 10)     5984     112   try_to_free_pages+0x64/0x70
 11)     5872     256   __alloc_pages_nodemask+0x3d2/0x710
 12)     5616      48   alloc_pages_current+0x8c/0xe0
 13)     5568      32   __page_cache_alloc+0x67/0x70
 14)     5536      80   find_or_create_page+0x50/0xb0
 15)     5456     160   _xfs_buf_lookup_pages+0x145/0x350 [xfs]
 16)     5296      64   xfs_buf_get+0x74/0x1d0 [xfs]
 17)     5232      48   xfs_buf_read+0x2f/0x110 [xfs]
 18)     5184      80   xfs_trans_read_buf+0x2bf/0x430 [xfs]
 19)     5104      80   xfs_btree_read_buf_block+0x5d/0xb0 [xfs]
 20)     5024      80   xfs_btree_lookup_get_block+0x84/0xf0 [xfs]
 21)     4944     176   xfs_btree_lookup+0xd7/0x490 [xfs]
 22)     4768      16   xfs_alloc_lookup_ge+0x1c/0x20 [xfs]
 23)     4752     144   xfs_alloc_ag_vextent_near+0x58/0xb30 [xfs]
 24)     4608      32   xfs_alloc_ag_vextent+0xe5/0x140 [xfs]
 25)     4576      96   xfs_alloc_vextent+0x49f/0x630 [xfs]
 26)     4480     160   xfs_bmbt_alloc_block+0xbe/0x1d0 [xfs]
 27)     4320     208   xfs_btree_split+0xb3/0x6a0 [xfs]
 28)     4112      96   xfs_btree_make_block_unfull+0x151/0x190 [xfs]
 29)     4016     224   xfs_btree_insrec+0x39c/0x5b0 [xfs]
 30)     3792     128   xfs_btree_insert+0x86/0x180 [xfs]
 31)     3664     352   xfs_bmap_add_extent_delay_real+0x564/0x1670 [xfs]
 32)     3312     208   xfs_bmap_add_extent+0x41c/0x450 [xfs]
 33)     3104     448   xfs_bmapi+0x982/0x1200 [xfs]
 34)     2656     256   xfs_iomap_write_allocate+0x248/0x3c0 [xfs]
 35)     2400     208   xfs_iomap+0x3d8/0x410 [xfs]
 36)     2192      32   xfs_map_blocks+0x2c/0x30 [xfs]
 37)     2160     256   xfs_page_state_convert+0x443/0x730 [xfs]
 38)     1904      64   xfs_vm_writepage+0xab/0x160 [xfs]
 39)     1840      32   __writepage+0x1a/0x60
 40)     1808     288   write_cache_pages+0x1f7/0x400
 41)     1520      16   generic_writepages+0x27/0x30
 42)     1504      48   xfs_vm_writepages+0x5a/0x70 [xfs]
 43)     1456      16   do_writepages+0x24/0x40
 44)     1440      64   writeback_single_inode+0xf1/0x3e0
 45)     1376     128   writeback_inodes_wb+0x31e/0x510
 46)     1248      16   writeback_inodes_wbc+0x1e/0x20
 47)     1232     224   balance_dirty_pages_ratelimited_nr+0x277/0x410
 48)     1008     192   generic_file_buffered_write+0x19b/0x240
 49)      816     288   xfs_write+0x849/0x930 [xfs]
 50)      528      16   xfs_file_aio_write+0x5b/0x70 [xfs]
 51)      512     272   do_sync_write+0xd1/0x120
 52)      240      48   vfs_write+0xcb/0x1a0
 53)      192      64   sys_write+0x55/0x90
 54)      128     128   system_call_fastpath+0x16/0x1b


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64
@ 2010-04-08 14:47             ` John Berthels
  0 siblings, 0 replies; 43+ messages in thread
From: John Berthels @ 2010-04-08 14:47 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, Nick Gregory, Rob Sanderson, xfs, linux-mm

John Berthels wrote:
> I'll reply again after it's been running long enough to draw conclusions.
We're getting pretty close on the 8k stack on this box now. It's running 
2.6.33.2 + your patch, with THREAD_ORDER 1, stack tracing and 
CONFIG_LOCKDEP=y. (Sorry that LOCKDEP is on, please advise if that's 
going to throw the figures and we'll restart the test systems with new 
kernels).

This is significantly more than 5.6K, so it shows a potential problem? 
Or is 720 bytes enough headroom?

jb

[ 4005.541869] apache2 used greatest stack depth: 2480 bytes left
[ 4005.541973] apache2 used greatest stack depth: 2240 bytes left
[ 4005.542070] apache2 used greatest stack depth: 1936 bytes left
[ 4005.542614] apache2 used greatest stack depth: 1616 bytes left
[ 5531.406529] apache2 used greatest stack depth: 720 bytes left

$ cat /sys/kernel/debug/tracing/stack_trace
        Depth    Size   Location    (55 entries)
        -----    ----   --------
  0)     7440      48   add_partial+0x26/0x90
  1)     7392      64   __slab_free+0x1a9/0x380
  2)     7328      64   kmem_cache_free+0xb9/0x160
  3)     7264      16   free_buffer_head+0x25/0x50
  4)     7248      64   try_to_free_buffers+0x79/0xc0
  5)     7184     160   xfs_vm_releasepage+0xda/0x130 [xfs]
  6)     7024      16   try_to_release_page+0x33/0x60
  7)     7008     384   shrink_page_list+0x585/0x860
  8)     6624     528   shrink_zone+0x636/0xdc0
  9)     6096     112   do_try_to_free_pages+0xc2/0x3c0
 10)     5984     112   try_to_free_pages+0x64/0x70
 11)     5872     256   __alloc_pages_nodemask+0x3d2/0x710
 12)     5616      48   alloc_pages_current+0x8c/0xe0
 13)     5568      32   __page_cache_alloc+0x67/0x70
 14)     5536      80   find_or_create_page+0x50/0xb0
 15)     5456     160   _xfs_buf_lookup_pages+0x145/0x350 [xfs]
 16)     5296      64   xfs_buf_get+0x74/0x1d0 [xfs]
 17)     5232      48   xfs_buf_read+0x2f/0x110 [xfs]
 18)     5184      80   xfs_trans_read_buf+0x2bf/0x430 [xfs]
 19)     5104      80   xfs_btree_read_buf_block+0x5d/0xb0 [xfs]
 20)     5024      80   xfs_btree_lookup_get_block+0x84/0xf0 [xfs]
 21)     4944     176   xfs_btree_lookup+0xd7/0x490 [xfs]
 22)     4768      16   xfs_alloc_lookup_ge+0x1c/0x20 [xfs]
 23)     4752     144   xfs_alloc_ag_vextent_near+0x58/0xb30 [xfs]
 24)     4608      32   xfs_alloc_ag_vextent+0xe5/0x140 [xfs]
 25)     4576      96   xfs_alloc_vextent+0x49f/0x630 [xfs]
 26)     4480     160   xfs_bmbt_alloc_block+0xbe/0x1d0 [xfs]
 27)     4320     208   xfs_btree_split+0xb3/0x6a0 [xfs]
 28)     4112      96   xfs_btree_make_block_unfull+0x151/0x190 [xfs]
 29)     4016     224   xfs_btree_insrec+0x39c/0x5b0 [xfs]
 30)     3792     128   xfs_btree_insert+0x86/0x180 [xfs]
 31)     3664     352   xfs_bmap_add_extent_delay_real+0x564/0x1670 [xfs]
 32)     3312     208   xfs_bmap_add_extent+0x41c/0x450 [xfs]
 33)     3104     448   xfs_bmapi+0x982/0x1200 [xfs]
 34)     2656     256   xfs_iomap_write_allocate+0x248/0x3c0 [xfs]
 35)     2400     208   xfs_iomap+0x3d8/0x410 [xfs]
 36)     2192      32   xfs_map_blocks+0x2c/0x30 [xfs]
 37)     2160     256   xfs_page_state_convert+0x443/0x730 [xfs]
 38)     1904      64   xfs_vm_writepage+0xab/0x160 [xfs]
 39)     1840      32   __writepage+0x1a/0x60
 40)     1808     288   write_cache_pages+0x1f7/0x400
 41)     1520      16   generic_writepages+0x27/0x30
 42)     1504      48   xfs_vm_writepages+0x5a/0x70 [xfs]
 43)     1456      16   do_writepages+0x24/0x40
 44)     1440      64   writeback_single_inode+0xf1/0x3e0
 45)     1376     128   writeback_inodes_wb+0x31e/0x510
 46)     1248      16   writeback_inodes_wbc+0x1e/0x20
 47)     1232     224   balance_dirty_pages_ratelimited_nr+0x277/0x410
 48)     1008     192   generic_file_buffered_write+0x19b/0x240
 49)      816     288   xfs_write+0x849/0x930 [xfs]
 50)      528      16   xfs_file_aio_write+0x5b/0x70 [xfs]
 51)      512     272   do_sync_write+0xd1/0x120
 52)      240      48   vfs_write+0xcb/0x1a0
 53)      192      64   sys_write+0x55/0x90
 54)      128     128   system_call_fastpath+0x16/0x1b

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64
@ 2010-04-08 14:47             ` John Berthels
  0 siblings, 0 replies; 43+ messages in thread
From: John Berthels @ 2010-04-08 14:47 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nick Gregory, linux-mm, xfs, linux-kernel, Rob Sanderson

John Berthels wrote:
> I'll reply again after it's been running long enough to draw conclusions.
We're getting pretty close on the 8k stack on this box now. It's running 
2.6.33.2 + your patch, with THREAD_ORDER 1, stack tracing and 
CONFIG_LOCKDEP=y. (Sorry that LOCKDEP is on, please advise if that's 
going to throw the figures and we'll restart the test systems with new 
kernels).

This is significantly more than 5.6K, so it shows a potential problem? 
Or is 720 bytes enough headroom?

jb

[ 4005.541869] apache2 used greatest stack depth: 2480 bytes left
[ 4005.541973] apache2 used greatest stack depth: 2240 bytes left
[ 4005.542070] apache2 used greatest stack depth: 1936 bytes left
[ 4005.542614] apache2 used greatest stack depth: 1616 bytes left
[ 5531.406529] apache2 used greatest stack depth: 720 bytes left

$ cat /sys/kernel/debug/tracing/stack_trace
        Depth    Size   Location    (55 entries)
        -----    ----   --------
  0)     7440      48   add_partial+0x26/0x90
  1)     7392      64   __slab_free+0x1a9/0x380
  2)     7328      64   kmem_cache_free+0xb9/0x160
  3)     7264      16   free_buffer_head+0x25/0x50
  4)     7248      64   try_to_free_buffers+0x79/0xc0
  5)     7184     160   xfs_vm_releasepage+0xda/0x130 [xfs]
  6)     7024      16   try_to_release_page+0x33/0x60
  7)     7008     384   shrink_page_list+0x585/0x860
  8)     6624     528   shrink_zone+0x636/0xdc0
  9)     6096     112   do_try_to_free_pages+0xc2/0x3c0
 10)     5984     112   try_to_free_pages+0x64/0x70
 11)     5872     256   __alloc_pages_nodemask+0x3d2/0x710
 12)     5616      48   alloc_pages_current+0x8c/0xe0
 13)     5568      32   __page_cache_alloc+0x67/0x70
 14)     5536      80   find_or_create_page+0x50/0xb0
 15)     5456     160   _xfs_buf_lookup_pages+0x145/0x350 [xfs]
 16)     5296      64   xfs_buf_get+0x74/0x1d0 [xfs]
 17)     5232      48   xfs_buf_read+0x2f/0x110 [xfs]
 18)     5184      80   xfs_trans_read_buf+0x2bf/0x430 [xfs]
 19)     5104      80   xfs_btree_read_buf_block+0x5d/0xb0 [xfs]
 20)     5024      80   xfs_btree_lookup_get_block+0x84/0xf0 [xfs]
 21)     4944     176   xfs_btree_lookup+0xd7/0x490 [xfs]
 22)     4768      16   xfs_alloc_lookup_ge+0x1c/0x20 [xfs]
 23)     4752     144   xfs_alloc_ag_vextent_near+0x58/0xb30 [xfs]
 24)     4608      32   xfs_alloc_ag_vextent+0xe5/0x140 [xfs]
 25)     4576      96   xfs_alloc_vextent+0x49f/0x630 [xfs]
 26)     4480     160   xfs_bmbt_alloc_block+0xbe/0x1d0 [xfs]
 27)     4320     208   xfs_btree_split+0xb3/0x6a0 [xfs]
 28)     4112      96   xfs_btree_make_block_unfull+0x151/0x190 [xfs]
 29)     4016     224   xfs_btree_insrec+0x39c/0x5b0 [xfs]
 30)     3792     128   xfs_btree_insert+0x86/0x180 [xfs]
 31)     3664     352   xfs_bmap_add_extent_delay_real+0x564/0x1670 [xfs]
 32)     3312     208   xfs_bmap_add_extent+0x41c/0x450 [xfs]
 33)     3104     448   xfs_bmapi+0x982/0x1200 [xfs]
 34)     2656     256   xfs_iomap_write_allocate+0x248/0x3c0 [xfs]
 35)     2400     208   xfs_iomap+0x3d8/0x410 [xfs]
 36)     2192      32   xfs_map_blocks+0x2c/0x30 [xfs]
 37)     2160     256   xfs_page_state_convert+0x443/0x730 [xfs]
 38)     1904      64   xfs_vm_writepage+0xab/0x160 [xfs]
 39)     1840      32   __writepage+0x1a/0x60
 40)     1808     288   write_cache_pages+0x1f7/0x400
 41)     1520      16   generic_writepages+0x27/0x30
 42)     1504      48   xfs_vm_writepages+0x5a/0x70 [xfs]
 43)     1456      16   do_writepages+0x24/0x40
 44)     1440      64   writeback_single_inode+0xf1/0x3e0
 45)     1376     128   writeback_inodes_wb+0x31e/0x510
 46)     1248      16   writeback_inodes_wbc+0x1e/0x20
 47)     1232     224   balance_dirty_pages_ratelimited_nr+0x277/0x410
 48)     1008     192   generic_file_buffered_write+0x19b/0x240
 49)      816     288   xfs_write+0x849/0x930 [xfs]
 50)      528      16   xfs_file_aio_write+0x5b/0x70 [xfs]
 51)      512     272   do_sync_write+0xd1/0x120
 52)      240      48   vfs_write+0xcb/0x1a0
 53)      192      64   sys_write+0x55/0x90
 54)      128     128   system_call_fastpath+0x16/0x1b

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64
  2010-04-08 14:47             ` John Berthels
  (?)
@ 2010-04-08 16:18               ` John Berthels
  -1 siblings, 0 replies; 43+ messages in thread
From: John Berthels @ 2010-04-08 16:18 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, Nick Gregory, Rob Sanderson, xfs, linux-mm

John Berthels wrote:
> John Berthels wrote:
>> I'll reply again after it's been running long enough to draw 
>> conclusions.
The box with patch+THREAD_ORDER 1+LOCKDEP went down (with no further 
logging retrievable by /var/log/messages or netconsole).

We're loading up a 2.6.33.2 + patch + THREAD_ORDER 2 (no LOCKDEP) to get 
better info as to whether we are still blowing the 8k limit with the 
patch in place.

jb


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64
@ 2010-04-08 16:18               ` John Berthels
  0 siblings, 0 replies; 43+ messages in thread
From: John Berthels @ 2010-04-08 16:18 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, Nick Gregory, Rob Sanderson, xfs, linux-mm

John Berthels wrote:
> John Berthels wrote:
>> I'll reply again after it's been running long enough to draw 
>> conclusions.
The box with patch+THREAD_ORDER 1+LOCKDEP went down (with no further 
logging retrievable by /var/log/messages or netconsole).

We're loading up a 2.6.33.2 + patch + THREAD_ORDER 2 (no LOCKDEP) to get 
better info as to whether we are still blowing the 8k limit with the 
patch in place.

jb

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64
@ 2010-04-08 16:18               ` John Berthels
  0 siblings, 0 replies; 43+ messages in thread
From: John Berthels @ 2010-04-08 16:18 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nick Gregory, linux-mm, xfs, linux-kernel, Rob Sanderson

John Berthels wrote:
> John Berthels wrote:
>> I'll reply again after it's been running long enough to draw 
>> conclusions.
The box with patch+THREAD_ORDER 1+LOCKDEP went down (with no further 
logging retrievable by /var/log/messages or netconsole).

We're loading up a 2.6.33.2 + patch + THREAD_ORDER 2 (no LOCKDEP) to get 
better info as to whether we are still blowing the 8k limit with the 
patch in place.

jb

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64
  2010-04-08 14:47             ` John Berthels
  (?)
@ 2010-04-08 23:38               ` Dave Chinner
  -1 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2010-04-08 23:38 UTC (permalink / raw)
  To: John Berthels; +Cc: linux-kernel, Nick Gregory, Rob Sanderson, xfs, linux-mm

On Thu, Apr 08, 2010 at 03:47:54PM +0100, John Berthels wrote:
> John Berthels wrote:
> >I'll reply again after it's been running long enough to draw conclusions.
> We're getting pretty close on the 8k stack on this box now. It's
> running 2.6.33.2 + your patch, with THREAD_ORDER 1, stack tracing
> and CONFIG_LOCKDEP=y. (Sorry that LOCKDEP is on, please advise if
> that's going to throw the figures and we'll restart the test systems
> with new kernels).
> 
> This is significantly more than 5.6K, so it shows a potential
> problem? Or is 720 bytes enough headroom?
> 
> jb
> 
> [ 4005.541869] apache2 used greatest stack depth: 2480 bytes left
> [ 4005.541973] apache2 used greatest stack depth: 2240 bytes left
> [ 4005.542070] apache2 used greatest stack depth: 1936 bytes left
> [ 4005.542614] apache2 used greatest stack depth: 1616 bytes left
> [ 5531.406529] apache2 used greatest stack depth: 720 bytes left
> 
> $ cat /sys/kernel/debug/tracing/stack_trace
>        Depth    Size   Location    (55 entries)
>        -----    ----   --------
>  0)     7440      48   add_partial+0x26/0x90
>  1)     7392      64   __slab_free+0x1a9/0x380
>  2)     7328      64   kmem_cache_free+0xb9/0x160
>  3)     7264      16   free_buffer_head+0x25/0x50
>  4)     7248      64   try_to_free_buffers+0x79/0xc0
>  5)     7184     160   xfs_vm_releasepage+0xda/0x130 [xfs]
>  6)     7024      16   try_to_release_page+0x33/0x60
>  7)     7008     384   shrink_page_list+0x585/0x860
>  8)     6624     528   shrink_zone+0x636/0xdc0
>  9)     6096     112   do_try_to_free_pages+0xc2/0x3c0
> 10)     5984     112   try_to_free_pages+0x64/0x70
> 11)     5872     256   __alloc_pages_nodemask+0x3d2/0x710
> 12)     5616      48   alloc_pages_current+0x8c/0xe0
> 13)     5568      32   __page_cache_alloc+0x67/0x70
> 14)     5536      80   find_or_create_page+0x50/0xb0
> 15)     5456     160   _xfs_buf_lookup_pages+0x145/0x350 [xfs]
> 16)     5296      64   xfs_buf_get+0x74/0x1d0 [xfs]
> 17)     5232      48   xfs_buf_read+0x2f/0x110 [xfs]
> 18)     5184      80   xfs_trans_read_buf+0x2bf/0x430 [xfs]

We're entering memory reclaim with almost 6k of stack already in
use. If we get down into the IO layer and then have to do a memory
reclaim, then we'll have even less stack to work with. It looks like
memory allocation needs at least 2KB of stack to work with now,
so if we enter anywhere near the top of the stack we can blow it...

Basically this trace is telling us the stack we have to work with
is:

	2KB memory allocation
	4KB page writeback
	2KB write foreground throttling path

So effectively the storage subsystem (NFS, filesystem, DM, MD,
device drivers) have about 4K of stack to work in now. That seems to
be a lot less than last time I looked at this, and we've been really
careful not to increase XFS's stack usage for quite some time now.

Hence I'm not sure exactly what to do about this, John. I can't
really do much about the stack footprint of XFS as all the
low-hanging fruit has already been trimmed. Even if I convert the
foreground throttling to not issue IO, the background flush threads
still have roughly the same stack usage, so a memory allocation and
reclaim in the wrong place could still blow the stack....

I'll have to have a bit of a think on this one - if you could
provide further stack traces as they get deeper (esp. if they go
past 8k) that would be really handy.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64
@ 2010-04-08 23:38               ` Dave Chinner
  0 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2010-04-08 23:38 UTC (permalink / raw)
  To: John Berthels; +Cc: linux-kernel, Nick Gregory, Rob Sanderson, xfs, linux-mm

On Thu, Apr 08, 2010 at 03:47:54PM +0100, John Berthels wrote:
> John Berthels wrote:
> >I'll reply again after it's been running long enough to draw conclusions.
> We're getting pretty close on the 8k stack on this box now. It's
> running 2.6.33.2 + your patch, with THREAD_ORDER 1, stack tracing
> and CONFIG_LOCKDEP=y. (Sorry that LOCKDEP is on, please advise if
> that's going to throw the figures and we'll restart the test systems
> with new kernels).
> 
> This is significantly more than 5.6K, so it shows a potential
> problem? Or is 720 bytes enough headroom?
> 
> jb
> 
> [ 4005.541869] apache2 used greatest stack depth: 2480 bytes left
> [ 4005.541973] apache2 used greatest stack depth: 2240 bytes left
> [ 4005.542070] apache2 used greatest stack depth: 1936 bytes left
> [ 4005.542614] apache2 used greatest stack depth: 1616 bytes left
> [ 5531.406529] apache2 used greatest stack depth: 720 bytes left
> 
> $ cat /sys/kernel/debug/tracing/stack_trace
>        Depth    Size   Location    (55 entries)
>        -----    ----   --------
>  0)     7440      48   add_partial+0x26/0x90
>  1)     7392      64   __slab_free+0x1a9/0x380
>  2)     7328      64   kmem_cache_free+0xb9/0x160
>  3)     7264      16   free_buffer_head+0x25/0x50
>  4)     7248      64   try_to_free_buffers+0x79/0xc0
>  5)     7184     160   xfs_vm_releasepage+0xda/0x130 [xfs]
>  6)     7024      16   try_to_release_page+0x33/0x60
>  7)     7008     384   shrink_page_list+0x585/0x860
>  8)     6624     528   shrink_zone+0x636/0xdc0
>  9)     6096     112   do_try_to_free_pages+0xc2/0x3c0
> 10)     5984     112   try_to_free_pages+0x64/0x70
> 11)     5872     256   __alloc_pages_nodemask+0x3d2/0x710
> 12)     5616      48   alloc_pages_current+0x8c/0xe0
> 13)     5568      32   __page_cache_alloc+0x67/0x70
> 14)     5536      80   find_or_create_page+0x50/0xb0
> 15)     5456     160   _xfs_buf_lookup_pages+0x145/0x350 [xfs]
> 16)     5296      64   xfs_buf_get+0x74/0x1d0 [xfs]
> 17)     5232      48   xfs_buf_read+0x2f/0x110 [xfs]
> 18)     5184      80   xfs_trans_read_buf+0x2bf/0x430 [xfs]

We're entering memory reclaim with almost 6k of stack already in
use. If we get down into the IO layer and then have to do a memory
reclaim, then we'll have even less stack to work with. It looks like
memory allocation needs at least 2KB of stack to work with now,
so if we enter anywhere near the top of the stack we can blow it...

Basically this trace is telling us the stack we have to work with
is:

	2KB memory allocation
	4KB page writeback
	2KB write foreground throttling path

So effectively the storage subsystem (NFS, filesystem, DM, MD,
device drivers) have about 4K of stack to work in now. That seems to
be a lot less than last time I looked at this, and we've been really
careful not to increase XFS's stack usage for quite some time now.

Hence I'm not sure exactly what to do about this, John. I can't
really do much about the stack footprint of XFS as all the
low-hanging fruit has already been trimmed. Even if I convert the
foreground throttling to not issue IO, the background flush threads
still have roughly the same stack usage, so a memory allocation and
reclaim in the wrong place could still blow the stack....

I'll have to have a bit of a think on this one - if you could
provide further stack traces as they get deeper (esp. if they go
past 8k) that would be really handy.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64
@ 2010-04-08 23:38               ` Dave Chinner
  0 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2010-04-08 23:38 UTC (permalink / raw)
  To: John Berthels; +Cc: Nick Gregory, linux-mm, xfs, linux-kernel, Rob Sanderson

On Thu, Apr 08, 2010 at 03:47:54PM +0100, John Berthels wrote:
> John Berthels wrote:
> >I'll reply again after it's been running long enough to draw conclusions.
> We're getting pretty close on the 8k stack on this box now. It's
> running 2.6.33.2 + your patch, with THREAD_ORDER 1, stack tracing
> and CONFIG_LOCKDEP=y. (Sorry that LOCKDEP is on, please advise if
> that's going to throw the figures and we'll restart the test systems
> with new kernels).
> 
> This is significantly more than 5.6K, so it shows a potential
> problem? Or is 720 bytes enough headroom?
> 
> jb
> 
> [ 4005.541869] apache2 used greatest stack depth: 2480 bytes left
> [ 4005.541973] apache2 used greatest stack depth: 2240 bytes left
> [ 4005.542070] apache2 used greatest stack depth: 1936 bytes left
> [ 4005.542614] apache2 used greatest stack depth: 1616 bytes left
> [ 5531.406529] apache2 used greatest stack depth: 720 bytes left
> 
> $ cat /sys/kernel/debug/tracing/stack_trace
>        Depth    Size   Location    (55 entries)
>        -----    ----   --------
>  0)     7440      48   add_partial+0x26/0x90
>  1)     7392      64   __slab_free+0x1a9/0x380
>  2)     7328      64   kmem_cache_free+0xb9/0x160
>  3)     7264      16   free_buffer_head+0x25/0x50
>  4)     7248      64   try_to_free_buffers+0x79/0xc0
>  5)     7184     160   xfs_vm_releasepage+0xda/0x130 [xfs]
>  6)     7024      16   try_to_release_page+0x33/0x60
>  7)     7008     384   shrink_page_list+0x585/0x860
>  8)     6624     528   shrink_zone+0x636/0xdc0
>  9)     6096     112   do_try_to_free_pages+0xc2/0x3c0
> 10)     5984     112   try_to_free_pages+0x64/0x70
> 11)     5872     256   __alloc_pages_nodemask+0x3d2/0x710
> 12)     5616      48   alloc_pages_current+0x8c/0xe0
> 13)     5568      32   __page_cache_alloc+0x67/0x70
> 14)     5536      80   find_or_create_page+0x50/0xb0
> 15)     5456     160   _xfs_buf_lookup_pages+0x145/0x350 [xfs]
> 16)     5296      64   xfs_buf_get+0x74/0x1d0 [xfs]
> 17)     5232      48   xfs_buf_read+0x2f/0x110 [xfs]
> 18)     5184      80   xfs_trans_read_buf+0x2bf/0x430 [xfs]

We're entering memory reclaim with almost 6k of stack already in
use. If we get down into the IO layer and then have to do a memory
reclaim, then we'll have even less stack to work with. It looks like
memory allocation needs at least 2KB of stack to work with now,
so if we enter anywhere near the top of the stack we can blow it...

Basically this trace is telling us the stack we have to work with
is:

	2KB memory allocation
	4KB page writeback
	2KB write foreground throttling path

So effectively the storage subsystem (NFS, filesystem, DM, MD,
device drivers) have about 4K of stack to work in now. That seems to
be a lot less than last time I looked at this, and we've been really
careful not to increase XFS's stack usage for quite some time now.

Hence I'm not sure exactly what to do about this, John. I can't
really do much about the stack footprint of XFS as all the
low-hanging fruit has already been trimmed. Even if I convert the
foreground throttling to not issue IO, the background flush threads
still have roughly the same stack usage, so a memory allocation and
reclaim in the wrong place could still blow the stack....

I'll have to have a bit of a think on this one - if you could
provide further stack traces as they get deeper (esp. if they go
past 8k) that would be really handy.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64
  2010-04-08 23:38               ` Dave Chinner
  (?)
@ 2010-04-09 11:38                 ` Chris Mason
  -1 siblings, 0 replies; 43+ messages in thread
From: Chris Mason @ 2010-04-09 11:38 UTC (permalink / raw)
  To: Dave Chinner
  Cc: John Berthels, linux-kernel, Nick Gregory, Rob Sanderson, xfs, linux-mm

On Fri, Apr 09, 2010 at 09:38:37AM +1000, Dave Chinner wrote:
> On Thu, Apr 08, 2010 at 03:47:54PM +0100, John Berthels wrote:
> > John Berthels wrote:
> > >I'll reply again after it's been running long enough to draw conclusions.
> > We're getting pretty close on the 8k stack on this box now. It's
> > running 2.6.33.2 + your patch, with THREAD_ORDER 1, stack tracing
> > and CONFIG_LOCKDEP=y. (Sorry that LOCKDEP is on, please advise if
> > that's going to throw the figures and we'll restart the test systems
> > with new kernels).
> > 
> > This is significantly more than 5.6K, so it shows a potential
> > problem? Or is 720 bytes enough headroom?
> > 
> > jb
> > 
> > [ 4005.541869] apache2 used greatest stack depth: 2480 bytes left
> > [ 4005.541973] apache2 used greatest stack depth: 2240 bytes left
> > [ 4005.542070] apache2 used greatest stack depth: 1936 bytes left
> > [ 4005.542614] apache2 used greatest stack depth: 1616 bytes left
> > [ 5531.406529] apache2 used greatest stack depth: 720 bytes left
> > 
> > $ cat /sys/kernel/debug/tracing/stack_trace
> >        Depth    Size   Location    (55 entries)
> >        -----    ----   --------
> >  0)     7440      48   add_partial+0x26/0x90
> >  1)     7392      64   __slab_free+0x1a9/0x380
> >  2)     7328      64   kmem_cache_free+0xb9/0x160
> >  3)     7264      16   free_buffer_head+0x25/0x50
> >  4)     7248      64   try_to_free_buffers+0x79/0xc0
> >  5)     7184     160   xfs_vm_releasepage+0xda/0x130 [xfs]
> >  6)     7024      16   try_to_release_page+0x33/0x60
> >  7)     7008     384   shrink_page_list+0x585/0x860
> >  8)     6624     528   shrink_zone+0x636/0xdc0
> >  9)     6096     112   do_try_to_free_pages+0xc2/0x3c0
> > 10)     5984     112   try_to_free_pages+0x64/0x70
> > 11)     5872     256   __alloc_pages_nodemask+0x3d2/0x710
> > 12)     5616      48   alloc_pages_current+0x8c/0xe0
> > 13)     5568      32   __page_cache_alloc+0x67/0x70
> > 14)     5536      80   find_or_create_page+0x50/0xb0
> > 15)     5456     160   _xfs_buf_lookup_pages+0x145/0x350 [xfs]
> > 16)     5296      64   xfs_buf_get+0x74/0x1d0 [xfs]
> > 17)     5232      48   xfs_buf_read+0x2f/0x110 [xfs]
> > 18)     5184      80   xfs_trans_read_buf+0x2bf/0x430 [xfs]
> 
> We're entering memory reclaim with almost 6k of stack already in
> use. If we get down into the IO layer and then have to do a memory
> reclaim, then we'll have even less stack to work with. It looks like
> memory allocation needs at least 2KB of stack to work with now,
> so if we enter anywhere near the top of the stack we can blow it...

shrink_zone on my box isn't 500 bytes, but lets try the easy stuff
first.  This is against .34, if you have any trouble applying to .32,
just add the word noinline after the word static on the function
definitions.

This makes shrink_zone disappear from my check_stack.pl output.
Basically I think the compiler is inlining the shrink_active_zone and
shrink_inactive_zone code into shrink_zone.

-chris

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 79c8098..c70593e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -620,7 +620,7 @@ static enum page_references page_check_references(struct page *page,
 /*
  * shrink_page_list() returns the number of reclaimed pages
  */
-static unsigned long shrink_page_list(struct list_head *page_list,
+static noinline unsigned long shrink_page_list(struct list_head *page_list,
 					struct scan_control *sc,
 					enum pageout_io sync_writeback)
 {
@@ -1121,7 +1121,7 @@ static int too_many_isolated(struct zone *zone, int file,
  * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
  * of reclaimed pages
  */
-static unsigned long shrink_inactive_list(unsigned long max_scan,
+static noinline unsigned long shrink_inactive_list(unsigned long max_scan,
 			struct zone *zone, struct scan_control *sc,
 			int priority, int file)
 {
@@ -1341,7 +1341,7 @@ static void move_active_pages_to_lru(struct zone *zone,
 		__count_vm_events(PGDEACTIVATE, pgmoved);
 }
 
-static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
+static noinline void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 			struct scan_control *sc, int priority, int file)
 {
 	unsigned long nr_taken;
@@ -1504,7 +1504,7 @@ static int inactive_list_is_low(struct zone *zone, struct scan_control *sc,
 		return inactive_anon_is_low(zone, sc);
 }
 
-static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
+static noinline unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
 	struct zone *zone, struct scan_control *sc, int priority)
 {
 	int file = is_file_lru(lru);

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64
@ 2010-04-09 11:38                 ` Chris Mason
  0 siblings, 0 replies; 43+ messages in thread
From: Chris Mason @ 2010-04-09 11:38 UTC (permalink / raw)
  To: Dave Chinner
  Cc: John Berthels, linux-kernel, Nick Gregory, Rob Sanderson, xfs, linux-mm

On Fri, Apr 09, 2010 at 09:38:37AM +1000, Dave Chinner wrote:
> On Thu, Apr 08, 2010 at 03:47:54PM +0100, John Berthels wrote:
> > John Berthels wrote:
> > >I'll reply again after it's been running long enough to draw conclusions.
> > We're getting pretty close on the 8k stack on this box now. It's
> > running 2.6.33.2 + your patch, with THREAD_ORDER 1, stack tracing
> > and CONFIG_LOCKDEP=y. (Sorry that LOCKDEP is on, please advise if
> > that's going to throw the figures and we'll restart the test systems
> > with new kernels).
> > 
> > This is significantly more than 5.6K, so it shows a potential
> > problem? Or is 720 bytes enough headroom?
> > 
> > jb
> > 
> > [ 4005.541869] apache2 used greatest stack depth: 2480 bytes left
> > [ 4005.541973] apache2 used greatest stack depth: 2240 bytes left
> > [ 4005.542070] apache2 used greatest stack depth: 1936 bytes left
> > [ 4005.542614] apache2 used greatest stack depth: 1616 bytes left
> > [ 5531.406529] apache2 used greatest stack depth: 720 bytes left
> > 
> > $ cat /sys/kernel/debug/tracing/stack_trace
> >        Depth    Size   Location    (55 entries)
> >        -----    ----   --------
> >  0)     7440      48   add_partial+0x26/0x90
> >  1)     7392      64   __slab_free+0x1a9/0x380
> >  2)     7328      64   kmem_cache_free+0xb9/0x160
> >  3)     7264      16   free_buffer_head+0x25/0x50
> >  4)     7248      64   try_to_free_buffers+0x79/0xc0
> >  5)     7184     160   xfs_vm_releasepage+0xda/0x130 [xfs]
> >  6)     7024      16   try_to_release_page+0x33/0x60
> >  7)     7008     384   shrink_page_list+0x585/0x860
> >  8)     6624     528   shrink_zone+0x636/0xdc0
> >  9)     6096     112   do_try_to_free_pages+0xc2/0x3c0
> > 10)     5984     112   try_to_free_pages+0x64/0x70
> > 11)     5872     256   __alloc_pages_nodemask+0x3d2/0x710
> > 12)     5616      48   alloc_pages_current+0x8c/0xe0
> > 13)     5568      32   __page_cache_alloc+0x67/0x70
> > 14)     5536      80   find_or_create_page+0x50/0xb0
> > 15)     5456     160   _xfs_buf_lookup_pages+0x145/0x350 [xfs]
> > 16)     5296      64   xfs_buf_get+0x74/0x1d0 [xfs]
> > 17)     5232      48   xfs_buf_read+0x2f/0x110 [xfs]
> > 18)     5184      80   xfs_trans_read_buf+0x2bf/0x430 [xfs]
> 
> We're entering memory reclaim with almost 6k of stack already in
> use. If we get down into the IO layer and then have to do a memory
> reclaim, then we'll have even less stack to work with. It looks like
> memory allocation needs at least 2KB of stack to work with now,
> so if we enter anywhere near the top of the stack we can blow it...

shrink_zone on my box isn't 500 bytes, but lets try the easy stuff
first.  This is against .34, if you have any trouble applying to .32,
just add the word noinline after the word static on the function
definitions.

This makes shrink_zone disappear from my check_stack.pl output.
Basically I think the compiler is inlining the shrink_active_zone and
shrink_inactive_zone code into shrink_zone.

-chris

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 79c8098..c70593e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -620,7 +620,7 @@ static enum page_references page_check_references(struct page *page,
 /*
  * shrink_page_list() returns the number of reclaimed pages
  */
-static unsigned long shrink_page_list(struct list_head *page_list,
+static noinline unsigned long shrink_page_list(struct list_head *page_list,
 					struct scan_control *sc,
 					enum pageout_io sync_writeback)
 {
@@ -1121,7 +1121,7 @@ static int too_many_isolated(struct zone *zone, int file,
  * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
  * of reclaimed pages
  */
-static unsigned long shrink_inactive_list(unsigned long max_scan,
+static noinline unsigned long shrink_inactive_list(unsigned long max_scan,
 			struct zone *zone, struct scan_control *sc,
 			int priority, int file)
 {
@@ -1341,7 +1341,7 @@ static void move_active_pages_to_lru(struct zone *zone,
 		__count_vm_events(PGDEACTIVATE, pgmoved);
 }
 
-static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
+static noinline void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 			struct scan_control *sc, int priority, int file)
 {
 	unsigned long nr_taken;
@@ -1504,7 +1504,7 @@ static int inactive_list_is_low(struct zone *zone, struct scan_control *sc,
 		return inactive_anon_is_low(zone, sc);
 }
 
-static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
+static noinline unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
 	struct zone *zone, struct scan_control *sc, int priority)
 {
 	int file = is_file_lru(lru);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64
@ 2010-04-09 11:38                 ` Chris Mason
  0 siblings, 0 replies; 43+ messages in thread
From: Chris Mason @ 2010-04-09 11:38 UTC (permalink / raw)
  To: Dave Chinner
  Cc: John Berthels, linux-kernel, xfs, Nick Gregory, linux-mm, Rob Sanderson

On Fri, Apr 09, 2010 at 09:38:37AM +1000, Dave Chinner wrote:
> On Thu, Apr 08, 2010 at 03:47:54PM +0100, John Berthels wrote:
> > John Berthels wrote:
> > >I'll reply again after it's been running long enough to draw conclusions.
> > We're getting pretty close on the 8k stack on this box now. It's
> > running 2.6.33.2 + your patch, with THREAD_ORDER 1, stack tracing
> > and CONFIG_LOCKDEP=y. (Sorry that LOCKDEP is on, please advise if
> > that's going to throw the figures and we'll restart the test systems
> > with new kernels).
> > 
> > This is significantly more than 5.6K, so it shows a potential
> > problem? Or is 720 bytes enough headroom?
> > 
> > jb
> > 
> > [ 4005.541869] apache2 used greatest stack depth: 2480 bytes left
> > [ 4005.541973] apache2 used greatest stack depth: 2240 bytes left
> > [ 4005.542070] apache2 used greatest stack depth: 1936 bytes left
> > [ 4005.542614] apache2 used greatest stack depth: 1616 bytes left
> > [ 5531.406529] apache2 used greatest stack depth: 720 bytes left
> > 
> > $ cat /sys/kernel/debug/tracing/stack_trace
> >        Depth    Size   Location    (55 entries)
> >        -----    ----   --------
> >  0)     7440      48   add_partial+0x26/0x90
> >  1)     7392      64   __slab_free+0x1a9/0x380
> >  2)     7328      64   kmem_cache_free+0xb9/0x160
> >  3)     7264      16   free_buffer_head+0x25/0x50
> >  4)     7248      64   try_to_free_buffers+0x79/0xc0
> >  5)     7184     160   xfs_vm_releasepage+0xda/0x130 [xfs]
> >  6)     7024      16   try_to_release_page+0x33/0x60
> >  7)     7008     384   shrink_page_list+0x585/0x860
> >  8)     6624     528   shrink_zone+0x636/0xdc0
> >  9)     6096     112   do_try_to_free_pages+0xc2/0x3c0
> > 10)     5984     112   try_to_free_pages+0x64/0x70
> > 11)     5872     256   __alloc_pages_nodemask+0x3d2/0x710
> > 12)     5616      48   alloc_pages_current+0x8c/0xe0
> > 13)     5568      32   __page_cache_alloc+0x67/0x70
> > 14)     5536      80   find_or_create_page+0x50/0xb0
> > 15)     5456     160   _xfs_buf_lookup_pages+0x145/0x350 [xfs]
> > 16)     5296      64   xfs_buf_get+0x74/0x1d0 [xfs]
> > 17)     5232      48   xfs_buf_read+0x2f/0x110 [xfs]
> > 18)     5184      80   xfs_trans_read_buf+0x2bf/0x430 [xfs]
> 
> We're entering memory reclaim with almost 6k of stack already in
> use. If we get down into the IO layer and then have to do a memory
> reclaim, then we'll have even less stack to work with. It looks like
> memory allocation needs at least 2KB of stack to work with now,
> so if we enter anywhere near the top of the stack we can blow it...

shrink_zone on my box isn't 500 bytes, but lets try the easy stuff
first.  This is against .34, if you have any trouble applying to .32,
just add the word noinline after the word static on the function
definitions.

This makes shrink_zone disappear from my check_stack.pl output.
Basically I think the compiler is inlining the shrink_active_zone and
shrink_inactive_zone code into shrink_zone.

-chris

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 79c8098..c70593e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -620,7 +620,7 @@ static enum page_references page_check_references(struct page *page,
 /*
  * shrink_page_list() returns the number of reclaimed pages
  */
-static unsigned long shrink_page_list(struct list_head *page_list,
+static noinline unsigned long shrink_page_list(struct list_head *page_list,
 					struct scan_control *sc,
 					enum pageout_io sync_writeback)
 {
@@ -1121,7 +1121,7 @@ static int too_many_isolated(struct zone *zone, int file,
  * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
  * of reclaimed pages
  */
-static unsigned long shrink_inactive_list(unsigned long max_scan,
+static noinline unsigned long shrink_inactive_list(unsigned long max_scan,
 			struct zone *zone, struct scan_control *sc,
 			int priority, int file)
 {
@@ -1341,7 +1341,7 @@ static void move_active_pages_to_lru(struct zone *zone,
 		__count_vm_events(PGDEACTIVATE, pgmoved);
 }
 
-static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
+static noinline void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 			struct scan_control *sc, int priority, int file)
 {
 	unsigned long nr_taken;
@@ -1504,7 +1504,7 @@ static int inactive_list_is_low(struct zone *zone, struct scan_control *sc,
 		return inactive_anon_is_low(zone, sc);
 }
 
-static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
+static noinline unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
 	struct zone *zone, struct scan_control *sc, int priority)
 {
 	int file = is_file_lru(lru);

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64
  2010-04-09 11:38                 ` Chris Mason
  (?)
@ 2010-04-09 18:05                   ` Eric Sandeen
  -1 siblings, 0 replies; 43+ messages in thread
From: Eric Sandeen @ 2010-04-09 18:05 UTC (permalink / raw)
  To: Chris Mason, Dave Chinner, John Berthels, linux-kernel,
	Nick Gregory, Rob Sanderson, xfs, linux-mm

Chris Mason wrote:

> shrink_zone on my box isn't 500 bytes, but lets try the easy stuff
> first.  This is against .34, if you have any trouble applying to .32,
> just add the word noinline after the word static on the function
> definitions.
> 
> This makes shrink_zone disappear from my check_stack.pl output.
> Basically I think the compiler is inlining the shrink_active_zone and
> shrink_inactive_zone code into shrink_zone.
> 
> -chris
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 79c8098..c70593e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -620,7 +620,7 @@ static enum page_references page_check_references(struct page *page,
>  /*
>   * shrink_page_list() returns the number of reclaimed pages
>   */
> -static unsigned long shrink_page_list(struct list_head *page_list,
> +static noinline unsigned long shrink_page_list(struct list_head *page_list,

FWIW akpm suggested that I add:

/*
 * Rather then using noinline to prevent stack consumption, use
 * noinline_for_stack instead.  For documentaiton reasons.
 */
#define noinline_for_stack noinline

so maybe for a formal submission that'd be good to use.


>  					struct scan_control *sc,
>  					enum pageout_io sync_writeback)
>  {
> @@ -1121,7 +1121,7 @@ static int too_many_isolated(struct zone *zone, int file,
>   * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
>   * of reclaimed pages
>   */
> -static unsigned long shrink_inactive_list(unsigned long max_scan,
> +static noinline unsigned long shrink_inactive_list(unsigned long max_scan,
>  			struct zone *zone, struct scan_control *sc,
>  			int priority, int file)
>  {
> @@ -1341,7 +1341,7 @@ static void move_active_pages_to_lru(struct zone *zone,
>  		__count_vm_events(PGDEACTIVATE, pgmoved);
>  }
>  
> -static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
> +static noinline void shrink_active_list(unsigned long nr_pages, struct zone *zone,
>  			struct scan_control *sc, int priority, int file)
>  {
>  	unsigned long nr_taken;
> @@ -1504,7 +1504,7 @@ static int inactive_list_is_low(struct zone *zone, struct scan_control *sc,
>  		return inactive_anon_is_low(zone, sc);
>  }
>  
> -static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
> +static noinline unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
>  	struct zone *zone, struct scan_control *sc, int priority)
>  {
>  	int file = is_file_lru(lru);
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
> 


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64
@ 2010-04-09 18:05                   ` Eric Sandeen
  0 siblings, 0 replies; 43+ messages in thread
From: Eric Sandeen @ 2010-04-09 18:05 UTC (permalink / raw)
  To: Chris Mason, Dave Chinner, John Berthels, linux-kernel,
	Nick Gregory, Rob Sanderson, xfs, linux-mm

Chris Mason wrote:

> shrink_zone on my box isn't 500 bytes, but lets try the easy stuff
> first.  This is against .34, if you have any trouble applying to .32,
> just add the word noinline after the word static on the function
> definitions.
> 
> This makes shrink_zone disappear from my check_stack.pl output.
> Basically I think the compiler is inlining the shrink_active_zone and
> shrink_inactive_zone code into shrink_zone.
> 
> -chris
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 79c8098..c70593e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -620,7 +620,7 @@ static enum page_references page_check_references(struct page *page,
>  /*
>   * shrink_page_list() returns the number of reclaimed pages
>   */
> -static unsigned long shrink_page_list(struct list_head *page_list,
> +static noinline unsigned long shrink_page_list(struct list_head *page_list,

FWIW akpm suggested that I add:

/*
 * Rather then using noinline to prevent stack consumption, use
 * noinline_for_stack instead.  For documentaiton reasons.
 */
#define noinline_for_stack noinline

so maybe for a formal submission that'd be good to use.


>  					struct scan_control *sc,
>  					enum pageout_io sync_writeback)
>  {
> @@ -1121,7 +1121,7 @@ static int too_many_isolated(struct zone *zone, int file,
>   * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
>   * of reclaimed pages
>   */
> -static unsigned long shrink_inactive_list(unsigned long max_scan,
> +static noinline unsigned long shrink_inactive_list(unsigned long max_scan,
>  			struct zone *zone, struct scan_control *sc,
>  			int priority, int file)
>  {
> @@ -1341,7 +1341,7 @@ static void move_active_pages_to_lru(struct zone *zone,
>  		__count_vm_events(PGDEACTIVATE, pgmoved);
>  }
>  
> -static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
> +static noinline void shrink_active_list(unsigned long nr_pages, struct zone *zone,
>  			struct scan_control *sc, int priority, int file)
>  {
>  	unsigned long nr_taken;
> @@ -1504,7 +1504,7 @@ static int inactive_list_is_low(struct zone *zone, struct scan_control *sc,
>  		return inactive_anon_is_low(zone, sc);
>  }
>  
> -static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
> +static noinline unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
>  	struct zone *zone, struct scan_control *sc, int priority)
>  {
>  	int file = is_file_lru(lru);
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64
@ 2010-04-09 18:05                   ` Eric Sandeen
  0 siblings, 0 replies; 43+ messages in thread
From: Eric Sandeen @ 2010-04-09 18:05 UTC (permalink / raw)
  To: Chris Mason, Dave Chinner, John Berthels, linux-kernel,
	Nick Gregory, Rob Sanderson, xfs, linux-mm

Chris Mason wrote:

> shrink_zone on my box isn't 500 bytes, but lets try the easy stuff
> first.  This is against .34, if you have any trouble applying to .32,
> just add the word noinline after the word static on the function
> definitions.
> 
> This makes shrink_zone disappear from my check_stack.pl output.
> Basically I think the compiler is inlining the shrink_active_zone and
> shrink_inactive_zone code into shrink_zone.
> 
> -chris
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 79c8098..c70593e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -620,7 +620,7 @@ static enum page_references page_check_references(struct page *page,
>  /*
>   * shrink_page_list() returns the number of reclaimed pages
>   */
> -static unsigned long shrink_page_list(struct list_head *page_list,
> +static noinline unsigned long shrink_page_list(struct list_head *page_list,

FWIW akpm suggested that I add:

/*
 * Rather then using noinline to prevent stack consumption, use
 * noinline_for_stack instead.  For documentaiton reasons.
 */
#define noinline_for_stack noinline

so maybe for a formal submission that'd be good to use.


>  					struct scan_control *sc,
>  					enum pageout_io sync_writeback)
>  {
> @@ -1121,7 +1121,7 @@ static int too_many_isolated(struct zone *zone, int file,
>   * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
>   * of reclaimed pages
>   */
> -static unsigned long shrink_inactive_list(unsigned long max_scan,
> +static noinline unsigned long shrink_inactive_list(unsigned long max_scan,
>  			struct zone *zone, struct scan_control *sc,
>  			int priority, int file)
>  {
> @@ -1341,7 +1341,7 @@ static void move_active_pages_to_lru(struct zone *zone,
>  		__count_vm_events(PGDEACTIVATE, pgmoved);
>  }
>  
> -static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
> +static noinline void shrink_active_list(unsigned long nr_pages, struct zone *zone,
>  			struct scan_control *sc, int priority, int file)
>  {
>  	unsigned long nr_taken;
> @@ -1504,7 +1504,7 @@ static int inactive_list_is_low(struct zone *zone, struct scan_control *sc,
>  		return inactive_anon_is_low(zone, sc);
>  }
>  
> -static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
> +static noinline unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
>  	struct zone *zone, struct scan_control *sc, int priority)
>  {
>  	int file = is_file_lru(lru);
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64
  2010-04-09 18:05                   ` Eric Sandeen
  (?)
@ 2010-04-09 18:11                     ` Chris Mason
  -1 siblings, 0 replies; 43+ messages in thread
From: Chris Mason @ 2010-04-09 18:11 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Dave Chinner, John Berthels, linux-kernel, Nick Gregory,
	Rob Sanderson, xfs, linux-mm

On Fri, Apr 09, 2010 at 01:05:05PM -0500, Eric Sandeen wrote:
> Chris Mason wrote:
> 
> > shrink_zone on my box isn't 500 bytes, but lets try the easy stuff
> > first.  This is against .34, if you have any trouble applying to .32,
> > just add the word noinline after the word static on the function
> > definitions.
> > 
> > This makes shrink_zone disappear from my check_stack.pl output.
> > Basically I think the compiler is inlining the shrink_active_zone and
> > shrink_inactive_zone code into shrink_zone.
> > 
> > -chris
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 79c8098..c70593e 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -620,7 +620,7 @@ static enum page_references page_check_references(struct page *page,
> >  /*
> >   * shrink_page_list() returns the number of reclaimed pages
> >   */
> > -static unsigned long shrink_page_list(struct list_head *page_list,
> > +static noinline unsigned long shrink_page_list(struct list_head *page_list,
> 
> FWIW akpm suggested that I add:
> 
> /*
>  * Rather then using noinline to prevent stack consumption, use
>  * noinline_for_stack instead.  For documentaiton reasons.
>  */
> #define noinline_for_stack noinline
> 
> so maybe for a formal submission that'd be good to use.

Oh yeah, I forgot about that one.  If the patch actually helps we can
switch it.

-chris

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64
@ 2010-04-09 18:11                     ` Chris Mason
  0 siblings, 0 replies; 43+ messages in thread
From: Chris Mason @ 2010-04-09 18:11 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Dave Chinner, John Berthels, linux-kernel, Nick Gregory,
	Rob Sanderson, xfs, linux-mm

On Fri, Apr 09, 2010 at 01:05:05PM -0500, Eric Sandeen wrote:
> Chris Mason wrote:
> 
> > shrink_zone on my box isn't 500 bytes, but lets try the easy stuff
> > first.  This is against .34, if you have any trouble applying to .32,
> > just add the word noinline after the word static on the function
> > definitions.
> > 
> > This makes shrink_zone disappear from my check_stack.pl output.
> > Basically I think the compiler is inlining the shrink_active_zone and
> > shrink_inactive_zone code into shrink_zone.
> > 
> > -chris
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 79c8098..c70593e 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -620,7 +620,7 @@ static enum page_references page_check_references(struct page *page,
> >  /*
> >   * shrink_page_list() returns the number of reclaimed pages
> >   */
> > -static unsigned long shrink_page_list(struct list_head *page_list,
> > +static noinline unsigned long shrink_page_list(struct list_head *page_list,
> 
> FWIW akpm suggested that I add:
> 
> /*
>  * Rather then using noinline to prevent stack consumption, use
>  * noinline_for_stack instead.  For documentaiton reasons.
>  */
> #define noinline_for_stack noinline
> 
> so maybe for a formal submission that'd be good to use.

Oh yeah, I forgot about that one.  If the patch actually helps we can
switch it.

-chris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64
@ 2010-04-09 18:11                     ` Chris Mason
  0 siblings, 0 replies; 43+ messages in thread
From: Chris Mason @ 2010-04-09 18:11 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: John Berthels, linux-kernel, xfs, Nick Gregory, linux-mm, Rob Sanderson

On Fri, Apr 09, 2010 at 01:05:05PM -0500, Eric Sandeen wrote:
> Chris Mason wrote:
> 
> > shrink_zone on my box isn't 500 bytes, but lets try the easy stuff
> > first.  This is against .34, if you have any trouble applying to .32,
> > just add the word noinline after the word static on the function
> > definitions.
> > 
> > This makes shrink_zone disappear from my check_stack.pl output.
> > Basically I think the compiler is inlining the shrink_active_zone and
> > shrink_inactive_zone code into shrink_zone.
> > 
> > -chris
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 79c8098..c70593e 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -620,7 +620,7 @@ static enum page_references page_check_references(struct page *page,
> >  /*
> >   * shrink_page_list() returns the number of reclaimed pages
> >   */
> > -static unsigned long shrink_page_list(struct list_head *page_list,
> > +static noinline unsigned long shrink_page_list(struct list_head *page_list,
> 
> FWIW akpm suggested that I add:
> 
> /*
>  * Rather then using noinline to prevent stack consumption, use
>  * noinline_for_stack instead.  For documentaiton reasons.
>  */
> #define noinline_for_stack noinline
> 
> so maybe for a formal submission that'd be good to use.

Oh yeah, I forgot about that one.  If the patch actually helps we can
switch it.

-chris

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64
  2010-04-09 18:11                     ` Chris Mason
  (?)
@ 2010-04-12  1:01                       ` Dave Chinner
  -1 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2010-04-12  1:01 UTC (permalink / raw)
  To: Chris Mason, Eric Sandeen, John Berthels, linux-kernel,
	Nick Gregory, Rob Sanderson, xfs, linux-mm

On Fri, Apr 09, 2010 at 02:11:08PM -0400, Chris Mason wrote:
> On Fri, Apr 09, 2010 at 01:05:05PM -0500, Eric Sandeen wrote:
> > Chris Mason wrote:
> > 
> > > shrink_zone on my box isn't 500 bytes, but lets try the easy stuff
> > > first.  This is against .34, if you have any trouble applying to .32,
> > > just add the word noinline after the word static on the function
> > > definitions.
> > > 
> > > This makes shrink_zone disappear from my check_stack.pl output.
> > > Basically I think the compiler is inlining the shrink_active_zone and
> > > shrink_inactive_zone code into shrink_zone.
> > > 
> > > -chris
> > > 
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 79c8098..c70593e 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -620,7 +620,7 @@ static enum page_references page_check_references(struct page *page,
> > >  /*
> > >   * shrink_page_list() returns the number of reclaimed pages
> > >   */
> > > -static unsigned long shrink_page_list(struct list_head *page_list,
> > > +static noinline unsigned long shrink_page_list(struct list_head *page_list,
> > 
> > FWIW akpm suggested that I add:
> > 
> > /*
> >  * Rather then using noinline to prevent stack consumption, use
> >  * noinline_for_stack instead.  For documentaiton reasons.
> >  */
> > #define noinline_for_stack noinline
> > 
> > so maybe for a formal submission that'd be good to use.
> 
> Oh yeah, I forgot about that one.  If the patch actually helps we can
> switch it.

Well, given that the largest stack overflow reported was about 800
bytes, I don't think it's enough. All the fat has been trimmed from
XFS long ago, and there isn't that much in the generic code paths
to trim. And if we consider that this isn't including a significant
storage subsystem (i.e. NFS on top and stacked DM+MD+FC below), then
trimming a few hundred bytes is not enough to prevent an 8k stack
being blown sky high.

That is why I was saying I'm not sure what the best way to solve the
problem is - I've got a couple of ideas for fixing the problem in
XFS once and for all, but I'm not sure if they will fly or not
yet, let alone written any code....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64
@ 2010-04-12  1:01                       ` Dave Chinner
  0 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2010-04-12  1:01 UTC (permalink / raw)
  To: Chris Mason, Eric Sandeen, John Berthels, linux-kernel,
	Nick Gregory, Rob Sanderson, xfs, linux-mm

On Fri, Apr 09, 2010 at 02:11:08PM -0400, Chris Mason wrote:
> On Fri, Apr 09, 2010 at 01:05:05PM -0500, Eric Sandeen wrote:
> > Chris Mason wrote:
> > 
> > > shrink_zone on my box isn't 500 bytes, but lets try the easy stuff
> > > first.  This is against .34, if you have any trouble applying to .32,
> > > just add the word noinline after the word static on the function
> > > definitions.
> > > 
> > > This makes shrink_zone disappear from my check_stack.pl output.
> > > Basically I think the compiler is inlining the shrink_active_zone and
> > > shrink_inactive_zone code into shrink_zone.
> > > 
> > > -chris
> > > 
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 79c8098..c70593e 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -620,7 +620,7 @@ static enum page_references page_check_references(struct page *page,
> > >  /*
> > >   * shrink_page_list() returns the number of reclaimed pages
> > >   */
> > > -static unsigned long shrink_page_list(struct list_head *page_list,
> > > +static noinline unsigned long shrink_page_list(struct list_head *page_list,
> > 
> > FWIW akpm suggested that I add:
> > 
> > /*
> >  * Rather then using noinline to prevent stack consumption, use
> >  * noinline_for_stack instead.  For documentaiton reasons.
> >  */
> > #define noinline_for_stack noinline
> > 
> > so maybe for a formal submission that'd be good to use.
> 
> Oh yeah, I forgot about that one.  If the patch actually helps we can
> switch it.

Well, given that the largest stack overflow reported was about 800
bytes, I don't think it's enough. All the fat has been trimmed from
XFS long ago, and there isn't that much in the generic code paths
to trim. And if we consider that this isn't including a significant
storage subsystem (i.e. NFS on top and stacked DM+MD+FC below), then
trimming a few hundred bytes is not enough to prevent an 8k stack
being blown sky high.

That is why I was saying I'm not sure what the best way to solve the
problem is - I've got a couple of ideas for fixing the problem in
XFS once and for all, but I'm not sure if they will fly or not
yet, let alone written any code....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64
@ 2010-04-12  1:01                       ` Dave Chinner
  0 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2010-04-12  1:01 UTC (permalink / raw)
  To: Chris Mason, Eric Sandeen, John Berthels, linux-kernel,
	Nick Gregory, Rob Sanderson, xfs, linux-mm

On Fri, Apr 09, 2010 at 02:11:08PM -0400, Chris Mason wrote:
> On Fri, Apr 09, 2010 at 01:05:05PM -0500, Eric Sandeen wrote:
> > Chris Mason wrote:
> > 
> > > shrink_zone on my box isn't 500 bytes, but lets try the easy stuff
> > > first.  This is against .34, if you have any trouble applying to .32,
> > > just add the word noinline after the word static on the function
> > > definitions.
> > > 
> > > This makes shrink_zone disappear from my check_stack.pl output.
> > > Basically I think the compiler is inlining the shrink_active_zone and
> > > shrink_inactive_zone code into shrink_zone.
> > > 
> > > -chris
> > > 
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 79c8098..c70593e 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -620,7 +620,7 @@ static enum page_references page_check_references(struct page *page,
> > >  /*
> > >   * shrink_page_list() returns the number of reclaimed pages
> > >   */
> > > -static unsigned long shrink_page_list(struct list_head *page_list,
> > > +static noinline unsigned long shrink_page_list(struct list_head *page_list,
> > 
> > FWIW akpm suggested that I add:
> > 
> > /*
> >  * Rather then using noinline to prevent stack consumption, use
> >  * noinline_for_stack instead.  For documentaiton reasons.
> >  */
> > #define noinline_for_stack noinline
> > 
> > so maybe for a formal submission that'd be good to use.
> 
> Oh yeah, I forgot about that one.  If the patch actually helps we can
> switch it.

Well, given that the largest stack overflow reported was about 800
bytes, I don't think it's enough. All the fat has been trimmed from
XFS long ago, and there isn't that much in the generic code paths
to trim. And if we consider that this isn't including a significant
storage subsystem (i.e. NFS on top and stacked DM+MD+FC below), then
trimming a few hundred bytes is not enough to prevent an 8k stack
being blown sky high.

That is why I was saying I'm not sure what the best way to solve the
problem is - I've got a couple of ideas for fixing the problem in
XFS once and for all, but I'm not sure if they will fly or not
yet, let alone written any code....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64
  2010-04-09 11:38                 ` Chris Mason
@ 2010-04-13  9:51                   ` John Berthels
  -1 siblings, 0 replies; 43+ messages in thread
From: John Berthels @ 2010-04-13  9:51 UTC (permalink / raw)
  To: Chris Mason, Dave Chinner, John Berthels, linux-kernel,
	Nick Gregory, Rob Sanderson, xfs, linux-mm

[-- Attachment #1: Type: text/plain, Size: 5328 bytes --]

Chris Mason wrote:
> shrink_zone on my box isn't 500 bytes, but lets try the easy stuff
> first.  This is against .34, if you have any trouble applying to .32,
> just add the word noinline after the word static on the function
> definitions.

Hi Chris,

Thanks for this, we've been soaking it for a while and get the stack 
trace below (which is still >8k), which still has shrink_zone at 528 bytes.

I find it odd that the shrink_zone stack usage is different on our 
systems. This is a stock kernel 2.6.33.2 kernel, x86_64 arch (plus your 
patch + Dave Chinner's patch) built using ubuntu make-kpkg, with gcc 
(Ubuntu 4.3.3-5ubuntu4) 4.3.3 (.vmscan.o.cmd with full build options is 
below, gzipped .config attached).

Can you see any difference between your system and ours which might 
explain the discrepancy? I note -g and -pg in there. (Does -pg have any 
stack overhead? It seems to be enabled in ubuntu release kernels).

regards,

jb



mm/.vmscan.o.cmd:

cmd_mm/vmscan.o := gcc -Wp,-MD,mm/.vmscan.o.d  -nostdinc -isystem 
/usr/lib/gcc/x86_64-linux-gnu/4.3.3/include 
-I/usr/local/src/kern/linux-2.6.33.2/arch/x86/include -Iinclude 
-include include/generated/autoconf.h -D__KERNEL__ -Wall -Wundef 
-Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common 
-Werror-implicit-function-declaration -Wno-format-security 
-fno-delete-null-pointer-checks -O2 -m64 -mtune=generic -mno-red-zone 
-mcmodel=kernel -funit-at-a-time -maccumulate-outgoing-args 
-fstack-protector -DCONFIG_AS_CFI=1 -DCONFIG_AS_CFI_SIGNAL_FRAME=1 -pipe 
-Wno-sign-compare -fno-asynchronous-unwind-tables -mno-sse -mno-mmx 
-mno-sse2 -mno-3dnow -fno-omit-frame-pointer -fno-optimize-sibling-calls 
-g -pg -Wdeclaration-after-statement -Wno-pointer-sign 
-fno-strict-overflow   -D"KBUILD_STR(s)=\#s" 
-D"KBUILD_BASENAME=KBUILD_STR(vmscan)" 
-D"KBUILD_MODNAME=KBUILD_STR(vmscan)"  -c -o mm/.tmp_vmscan.o mm/vmscan.c



Apr 12 22:06:35 nas17 kernel: [36346.599076] apache2 used greatest stack
depth: 7904 bytes left
         Depth    Size   Location    (56 entries)
         -----    ----   --------
   0)     7904      48   __call_rcu+0x67/0x190
   1)     7856      16   call_rcu_sched+0x15/0x20
   2)     7840      16   call_rcu+0xe/0x10
   3)     7824     272   radix_tree_delete+0x159/0x2e0
   4)     7552      32   __remove_from_page_cache+0x21/0x110
   5)     7520      64   __remove_mapping+0xe8/0x130
   6)     7456     384   shrink_page_list+0x400/0x860
   7)     7072     528   shrink_zone+0x636/0xdc0
   8)     6544     112   do_try_to_free_pages+0xc2/0x3c0
   9)     6432     112   try_to_free_pages+0x64/0x70
  10)     6320     256   __alloc_pages_nodemask+0x3d2/0x710
  11)     6064      48   alloc_pages_current+0x8c/0xe0
  12)     6016      32   __page_cache_alloc+0x67/0x70
  13)     5984      80   find_or_create_page+0x50/0xb0
  14)     5904     160   _xfs_buf_lookup_pages+0x145/0x350 [xfs]
  15)     5744      64   xfs_buf_get+0x74/0x1d0 [xfs]
  16)     5680      48   xfs_buf_read+0x2f/0x110 [xfs]
  17)     5632      80   xfs_trans_read_buf+0x2bf/0x430 [xfs]
  18)     5552      80   xfs_btree_read_buf_block+0x5d/0xb0 [xfs]
  19)     5472     176   xfs_btree_rshift+0xd7/0x530 [xfs]
  20)     5296      96   xfs_btree_make_block_unfull+0x5b/0x190 [xfs]
  21)     5200     224   xfs_btree_insrec+0x39c/0x5b0 [xfs]
  22)     4976     128   xfs_btree_insert+0x86/0x180 [xfs]
  23)     4848      96   xfs_alloc_fixup_trees+0x1fa/0x350 [xfs]
  24)     4752     144   xfs_alloc_ag_vextent_near+0x916/0xb30 [xfs]
  25)     4608      32   xfs_alloc_ag_vextent+0xe5/0x140 [xfs]
  26)     4576      96   xfs_alloc_vextent+0x49f/0x630 [xfs]
  27)     4480     160   xfs_bmbt_alloc_block+0xbe/0x1d0 [xfs]
  28)     4320     208   xfs_btree_split+0xb3/0x6a0 [xfs]
  29)     4112      96   xfs_btree_make_block_unfull+0x151/0x190 [xfs]
  30)     4016     224   xfs_btree_insrec+0x39c/0x5b0 [xfs]
  31)     3792     128   xfs_btree_insert+0x86/0x180 [xfs]
  32)     3664     352   xfs_bmap_add_extent_delay_real+0x41e/0x1670 [xfs]
  33)     3312     208   xfs_bmap_add_extent+0x41c/0x450 [xfs]
  34)     3104     448   xfs_bmapi+0x982/0x1200 [xfs]
  35)     2656     256   xfs_iomap_write_allocate+0x248/0x3c0 [xfs]
  36)     2400     208   xfs_iomap+0x3d8/0x410 [xfs]
  37)     2192      32   xfs_map_blocks+0x2c/0x30 [xfs]
  38)     2160     256   xfs_page_state_convert+0x443/0x730 [xfs]
  39)     1904      64   xfs_vm_writepage+0xab/0x160 [xfs]
  40)     1840      32   __writepage+0x1a/0x60
  41)     1808     288   write_cache_pages+0x1f7/0x400
  42)     1520      16   generic_writepages+0x27/0x30
  43)     1504      48   xfs_vm_writepages+0x5a/0x70 [xfs]
  44)     1456      16   do_writepages+0x24/0x40
  45)     1440      64   writeback_single_inode+0xf1/0x3e0
  46)     1376     128   writeback_inodes_wb+0x31e/0x510
  47)     1248      16   writeback_inodes_wbc+0x1e/0x20
  48)     1232     224   balance_dirty_pages_ratelimited_nr+0x277/0x410
  49)     1008     192   generic_file_buffered_write+0x19b/0x240
  50)      816     288   xfs_write+0x849/0x930 [xfs]
  51)      528      16   xfs_file_aio_write+0x5b/0x70 [xfs]
  52)      512     272   do_sync_write+0xd1/0x120
  53)      240      48   vfs_write+0xcb/0x1a0
  54)      192      64   sys_write+0x55/0x90
  55)      128     128   system_call_fastpath+0x16/0x1b


[-- Attachment #2: config.gz --]
[-- Type: application/x-gzip, Size: 28595 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64
@ 2010-04-13  9:51                   ` John Berthels
  0 siblings, 0 replies; 43+ messages in thread
From: John Berthels @ 2010-04-13  9:51 UTC (permalink / raw)
  To: Chris Mason, Dave Chinner, John Berthels, linux-kernel,
	Nick Gregory, Rob Sanderson, xfs, linux-mm

[-- Attachment #1: Type: text/plain, Size: 5328 bytes --]

Chris Mason wrote:
> shrink_zone on my box isn't 500 bytes, but lets try the easy stuff
> first.  This is against .34, if you have any trouble applying to .32,
> just add the word noinline after the word static on the function
> definitions.

Hi Chris,

Thanks for this, we've been soaking it for a while and get the stack 
trace below (which is still >8k), which still has shrink_zone at 528 bytes.

I find it odd that the shrink_zone stack usage is different on our 
systems. This is a stock kernel 2.6.33.2 kernel, x86_64 arch (plus your 
patch + Dave Chinner's patch) built using ubuntu make-kpkg, with gcc 
(Ubuntu 4.3.3-5ubuntu4) 4.3.3 (.vmscan.o.cmd with full build options is 
below, gzipped .config attached).

Can you see any difference between your system and ours which might 
explain the discrepancy? I note -g and -pg in there. (Does -pg have any 
stack overhead? It seems to be enabled in ubuntu release kernels).

regards,

jb



mm/.vmscan.o.cmd:

cmd_mm/vmscan.o := gcc -Wp,-MD,mm/.vmscan.o.d  -nostdinc -isystem 
/usr/lib/gcc/x86_64-linux-gnu/4.3.3/include 
-I/usr/local/src/kern/linux-2.6.33.2/arch/x86/include -Iinclude 
-include include/generated/autoconf.h -D__KERNEL__ -Wall -Wundef 
-Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common 
-Werror-implicit-function-declaration -Wno-format-security 
-fno-delete-null-pointer-checks -O2 -m64 -mtune=generic -mno-red-zone 
-mcmodel=kernel -funit-at-a-time -maccumulate-outgoing-args 
-fstack-protector -DCONFIG_AS_CFI=1 -DCONFIG_AS_CFI_SIGNAL_FRAME=1 -pipe 
-Wno-sign-compare -fno-asynchronous-unwind-tables -mno-sse -mno-mmx 
-mno-sse2 -mno-3dnow -fno-omit-frame-pointer -fno-optimize-sibling-calls 
-g -pg -Wdeclaration-after-statement -Wno-pointer-sign 
-fno-strict-overflow   -D"KBUILD_STR(s)=\#s" 
-D"KBUILD_BASENAME=KBUILD_STR(vmscan)" 
-D"KBUILD_MODNAME=KBUILD_STR(vmscan)"  -c -o mm/.tmp_vmscan.o mm/vmscan.c



Apr 12 22:06:35 nas17 kernel: [36346.599076] apache2 used greatest stack
depth: 7904 bytes left
         Depth    Size   Location    (56 entries)
         -----    ----   --------
   0)     7904      48   __call_rcu+0x67/0x190
   1)     7856      16   call_rcu_sched+0x15/0x20
   2)     7840      16   call_rcu+0xe/0x10
   3)     7824     272   radix_tree_delete+0x159/0x2e0
   4)     7552      32   __remove_from_page_cache+0x21/0x110
   5)     7520      64   __remove_mapping+0xe8/0x130
   6)     7456     384   shrink_page_list+0x400/0x860
   7)     7072     528   shrink_zone+0x636/0xdc0
   8)     6544     112   do_try_to_free_pages+0xc2/0x3c0
   9)     6432     112   try_to_free_pages+0x64/0x70
  10)     6320     256   __alloc_pages_nodemask+0x3d2/0x710
  11)     6064      48   alloc_pages_current+0x8c/0xe0
  12)     6016      32   __page_cache_alloc+0x67/0x70
  13)     5984      80   find_or_create_page+0x50/0xb0
  14)     5904     160   _xfs_buf_lookup_pages+0x145/0x350 [xfs]
  15)     5744      64   xfs_buf_get+0x74/0x1d0 [xfs]
  16)     5680      48   xfs_buf_read+0x2f/0x110 [xfs]
  17)     5632      80   xfs_trans_read_buf+0x2bf/0x430 [xfs]
  18)     5552      80   xfs_btree_read_buf_block+0x5d/0xb0 [xfs]
  19)     5472     176   xfs_btree_rshift+0xd7/0x530 [xfs]
  20)     5296      96   xfs_btree_make_block_unfull+0x5b/0x190 [xfs]
  21)     5200     224   xfs_btree_insrec+0x39c/0x5b0 [xfs]
  22)     4976     128   xfs_btree_insert+0x86/0x180 [xfs]
  23)     4848      96   xfs_alloc_fixup_trees+0x1fa/0x350 [xfs]
  24)     4752     144   xfs_alloc_ag_vextent_near+0x916/0xb30 [xfs]
  25)     4608      32   xfs_alloc_ag_vextent+0xe5/0x140 [xfs]
  26)     4576      96   xfs_alloc_vextent+0x49f/0x630 [xfs]
  27)     4480     160   xfs_bmbt_alloc_block+0xbe/0x1d0 [xfs]
  28)     4320     208   xfs_btree_split+0xb3/0x6a0 [xfs]
  29)     4112      96   xfs_btree_make_block_unfull+0x151/0x190 [xfs]
  30)     4016     224   xfs_btree_insrec+0x39c/0x5b0 [xfs]
  31)     3792     128   xfs_btree_insert+0x86/0x180 [xfs]
  32)     3664     352   xfs_bmap_add_extent_delay_real+0x41e/0x1670 [xfs]
  33)     3312     208   xfs_bmap_add_extent+0x41c/0x450 [xfs]
  34)     3104     448   xfs_bmapi+0x982/0x1200 [xfs]
  35)     2656     256   xfs_iomap_write_allocate+0x248/0x3c0 [xfs]
  36)     2400     208   xfs_iomap+0x3d8/0x410 [xfs]
  37)     2192      32   xfs_map_blocks+0x2c/0x30 [xfs]
  38)     2160     256   xfs_page_state_convert+0x443/0x730 [xfs]
  39)     1904      64   xfs_vm_writepage+0xab/0x160 [xfs]
  40)     1840      32   __writepage+0x1a/0x60
  41)     1808     288   write_cache_pages+0x1f7/0x400
  42)     1520      16   generic_writepages+0x27/0x30
  43)     1504      48   xfs_vm_writepages+0x5a/0x70 [xfs]
  44)     1456      16   do_writepages+0x24/0x40
  45)     1440      64   writeback_single_inode+0xf1/0x3e0
  46)     1376     128   writeback_inodes_wb+0x31e/0x510
  47)     1248      16   writeback_inodes_wbc+0x1e/0x20
  48)     1232     224   balance_dirty_pages_ratelimited_nr+0x277/0x410
  49)     1008     192   generic_file_buffered_write+0x19b/0x240
  50)      816     288   xfs_write+0x849/0x930 [xfs]
  51)      528      16   xfs_file_aio_write+0x5b/0x70 [xfs]
  52)      512     272   do_sync_write+0xd1/0x120
  53)      240      48   vfs_write+0xcb/0x1a0
  54)      192      64   sys_write+0x55/0x90
  55)      128     128   system_call_fastpath+0x16/0x1b


[-- Attachment #2: config.gz --]
[-- Type: application/x-gzip, Size: 28595 bytes --]

[-- Attachment #3: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64
  2010-04-09 11:38                 ` Chris Mason
  (?)
@ 2010-04-16 13:41                   ` John Berthels
  -1 siblings, 0 replies; 43+ messages in thread
From: John Berthels @ 2010-04-16 13:41 UTC (permalink / raw)
  To: Chris Mason, Dave Chinner, John Berthels, linux-kernel,
	Nick Gregory, Rob Sanderson, xfs, linux-mm

Chris Mason wrote:
> shrink_zone on my box isn't 500 bytes, but lets try the easy stuff
> first.  This is against .34, if you have any trouble applying to .32,
> just add the word noinline after the word static on the function
> definitions.
> 
> This makes shrink_zone disappear from my check_stack.pl output.
> Basically I think the compiler is inlining the shrink_active_zone and
> shrink_inactive_zone code into shrink_zone.

Hi Chris,

I hadn't seen the followup discussion on lkml until today, but this message:

http://marc.info/?l=linux-mm&m=127122143303771&w=2

allowed me to look at stack usage in our build environment. If I've 
understood correctly, it seems that a build with gcc-4.4 and gcc-4.3 
have very different stack usages for shrink_zone(): 0x88 versus 0x1d8. 
(details below).

The reason appears to be the -fconserve-stack compilation option 
specified when using 4.4, since running the cmdline from mm/.vmscan.cmd 
with gcc-4.4 but *without* -fconserve-stack gives the same result as 
with 4.3.

According to the discussion when the flag was added, 
http://www.gossamer-threads.com/lists/linux/kernel/1131612
this flag seems to primarily affects inlining, so I double-checked the 
noinline patch you sent to the list and discovered that it had been 
incorrectly applied to the build tree. Correctly applying that patch to 
mm/vmscan.c (and using gcc-4.3) gives a

sub    $0x78,%rsp

line. I'm very sorry that this test or ours wasn't correct and I'm sorry 
for sending bad info to the list.

We're currently building a kernel with gcc-4.4 and will let you know if 
it blows the 8k limit or not.

Thanks for your help.

regards,

jb

$ gcc-4.3 --version
gcc-4.3 (Ubuntu 4.3.4-5ubuntu1) 4.3.4
$ gcc-4.4 --version
gcc-4.4 (Ubuntu 4.4.1-4ubuntu9) 4.4.1

$ make CC=gcc-4.4 mm/vmscan.o
$ objdump -d mm/vmscan.o  | less +/shrink_zone
0000000000002830 <shrink_zone>:
     2830:       55                      push   %rbp
     2831:       48 89 e5                mov    %rsp,%rbp
     2834:       41 57                   push   %r15
     2836:       41 56                   push   %r14
     2838:       41 55                   push   %r13
     283a:       41 54                   push   %r12
     283c:       53                      push   %rbx
     283d:       48 81 ec 88 00 00 00    sub    $0x88,%rsp
     2844:       e8 00 00 00 00          callq  2849 <shrink_zone+0x19>
$ make clean
$ make CC=gcc-4.3 mm/vmscan.o
$ objdump -d mm/vmscan.o  | less +/shrink_zone
0000000000001ca0 <shrink_zone>:
     1ca0:       55                      push   %rbp
     1ca1:       48 89 e5                mov    %rsp,%rbp
     1ca4:       41 57                   push   %r15
     1ca6:       41 56                   push   %r14
     1ca8:       41 55                   push   %r13
     1caa:       41 54                   push   %r12
     1cac:       53                      push   %rbx
     1cad:       48 81 ec d8 01 00 00    sub    $0x1d8,%rsp
     1cb4:       e8 00 00 00 00          callq  1cb9 <shrink_zone+0x19>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64
@ 2010-04-16 13:41                   ` John Berthels
  0 siblings, 0 replies; 43+ messages in thread
From: John Berthels @ 2010-04-16 13:41 UTC (permalink / raw)
  To: Chris Mason, Dave Chinner, John Berthels, linux-kernel,
	Nick Gregory, Rob Sanderson, xfs, linux-mm

Chris Mason wrote:
> shrink_zone on my box isn't 500 bytes, but lets try the easy stuff
> first.  This is against .34, if you have any trouble applying to .32,
> just add the word noinline after the word static on the function
> definitions.
> 
> This makes shrink_zone disappear from my check_stack.pl output.
> Basically I think the compiler is inlining the shrink_active_zone and
> shrink_inactive_zone code into shrink_zone.

Hi Chris,

I hadn't seen the followup discussion on lkml until today, but this message:

http://marc.info/?l=linux-mm&m=127122143303771&w=2

allowed me to look at stack usage in our build environment. If I've 
understood correctly, it seems that a build with gcc-4.4 and gcc-4.3 
have very different stack usages for shrink_zone(): 0x88 versus 0x1d8. 
(details below).

The reason appears to be the -fconserve-stack compilation option 
specified when using 4.4, since running the cmdline from mm/.vmscan.cmd 
with gcc-4.4 but *without* -fconserve-stack gives the same result as 
with 4.3.

According to the discussion when the flag was added, 
http://www.gossamer-threads.com/lists/linux/kernel/1131612
this flag seems to primarily affects inlining, so I double-checked the 
noinline patch you sent to the list and discovered that it had been 
incorrectly applied to the build tree. Correctly applying that patch to 
mm/vmscan.c (and using gcc-4.3) gives a

sub    $0x78,%rsp

line. I'm very sorry that this test or ours wasn't correct and I'm sorry 
for sending bad info to the list.

We're currently building a kernel with gcc-4.4 and will let you know if 
it blows the 8k limit or not.

Thanks for your help.

regards,

jb

$ gcc-4.3 --version
gcc-4.3 (Ubuntu 4.3.4-5ubuntu1) 4.3.4
$ gcc-4.4 --version
gcc-4.4 (Ubuntu 4.4.1-4ubuntu9) 4.4.1

$ make CC=gcc-4.4 mm/vmscan.o
$ objdump -d mm/vmscan.o  | less +/shrink_zone
0000000000002830 <shrink_zone>:
     2830:       55                      push   %rbp
     2831:       48 89 e5                mov    %rsp,%rbp
     2834:       41 57                   push   %r15
     2836:       41 56                   push   %r14
     2838:       41 55                   push   %r13
     283a:       41 54                   push   %r12
     283c:       53                      push   %rbx
     283d:       48 81 ec 88 00 00 00    sub    $0x88,%rsp
     2844:       e8 00 00 00 00          callq  2849 <shrink_zone+0x19>
$ make clean
$ make CC=gcc-4.3 mm/vmscan.o
$ objdump -d mm/vmscan.o  | less +/shrink_zone
0000000000001ca0 <shrink_zone>:
     1ca0:       55                      push   %rbp
     1ca1:       48 89 e5                mov    %rsp,%rbp
     1ca4:       41 57                   push   %r15
     1ca6:       41 56                   push   %r14
     1ca8:       41 55                   push   %r13
     1caa:       41 54                   push   %r12
     1cac:       53                      push   %rbx
     1cad:       48 81 ec d8 01 00 00    sub    $0x1d8,%rsp
     1cb4:       e8 00 00 00 00          callq  1cb9 <shrink_zone+0x19>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64
@ 2010-04-16 13:41                   ` John Berthels
  0 siblings, 0 replies; 43+ messages in thread
From: John Berthels @ 2010-04-16 13:41 UTC (permalink / raw)
  To: Chris Mason, Dave Chinner, John Berthels, linux-kernel,
	Nick Gregory, Rob Sanderson, xfs, linux-mm

Chris Mason wrote:
> shrink_zone on my box isn't 500 bytes, but lets try the easy stuff
> first.  This is against .34, if you have any trouble applying to .32,
> just add the word noinline after the word static on the function
> definitions.
> 
> This makes shrink_zone disappear from my check_stack.pl output.
> Basically I think the compiler is inlining the shrink_active_zone and
> shrink_inactive_zone code into shrink_zone.

Hi Chris,

I hadn't seen the followup discussion on lkml until today, but this message:

http://marc.info/?l=linux-mm&m=127122143303771&w=2

allowed me to look at stack usage in our build environment. If I've 
understood correctly, it seems that a build with gcc-4.4 and gcc-4.3 
have very different stack usages for shrink_zone(): 0x88 versus 0x1d8. 
(details below).

The reason appears to be the -fconserve-stack compilation option 
specified when using 4.4, since running the cmdline from mm/.vmscan.cmd 
with gcc-4.4 but *without* -fconserve-stack gives the same result as 
with 4.3.

According to the discussion when the flag was added, 
http://www.gossamer-threads.com/lists/linux/kernel/1131612
this flag seems to primarily affects inlining, so I double-checked the 
noinline patch you sent to the list and discovered that it had been 
incorrectly applied to the build tree. Correctly applying that patch to 
mm/vmscan.c (and using gcc-4.3) gives a

sub    $0x78,%rsp

line. I'm very sorry that this test or ours wasn't correct and I'm sorry 
for sending bad info to the list.

We're currently building a kernel with gcc-4.4 and will let you know if 
it blows the 8k limit or not.

Thanks for your help.

regards,

jb

$ gcc-4.3 --version
gcc-4.3 (Ubuntu 4.3.4-5ubuntu1) 4.3.4
$ gcc-4.4 --version
gcc-4.4 (Ubuntu 4.4.1-4ubuntu9) 4.4.1

$ make CC=gcc-4.4 mm/vmscan.o
$ objdump -d mm/vmscan.o  | less +/shrink_zone
0000000000002830 <shrink_zone>:
     2830:       55                      push   %rbp
     2831:       48 89 e5                mov    %rsp,%rbp
     2834:       41 57                   push   %r15
     2836:       41 56                   push   %r14
     2838:       41 55                   push   %r13
     283a:       41 54                   push   %r12
     283c:       53                      push   %rbx
     283d:       48 81 ec 88 00 00 00    sub    $0x88,%rsp
     2844:       e8 00 00 00 00          callq  2849 <shrink_zone+0x19>
$ make clean
$ make CC=gcc-4.3 mm/vmscan.o
$ objdump -d mm/vmscan.o  | less +/shrink_zone
0000000000001ca0 <shrink_zone>:
     1ca0:       55                      push   %rbp
     1ca1:       48 89 e5                mov    %rsp,%rbp
     1ca4:       41 57                   push   %r15
     1ca6:       41 56                   push   %r14
     1ca8:       41 55                   push   %r13
     1caa:       41 54                   push   %r12
     1cac:       53                      push   %rbx
     1cad:       48 81 ec d8 01 00 00    sub    $0x1d8,%rsp
     1cb4:       e8 00 00 00 00          callq  1cb9 <shrink_zone+0x19>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64
  2010-04-08 23:38               ` Dave Chinner
@ 2010-04-09 13:43                 ` John Berthels
  -1 siblings, 0 replies; 43+ messages in thread
From: John Berthels @ 2010-04-09 13:43 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, Nick Gregory, Rob Sanderson, xfs, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1933 bytes --]

Dave Chinner wrote:

> So effectively the storage subsystem (NFS, filesystem, DM, MD,
> device drivers) have about 4K of stack to work in now. That seems to
> be a lot less than last time I looked at this, and we've been really
> careful not to increase XFS's stack usage for quite some time now.

OK. I should note that we have what appears to be a similar problem on a 
2.6.28 distro kernel, so I'm not sure this is a very recent change. (We 
see the lockups on that kernel, we haven't tried larger stacks + stack 
instrumentation on the earlier kernel).

Do you know if there are any obvious knobs to twiddle to make these 
codepaths less likely? The cluster is resilient against occasional 
server death, but frequent death is more annoying.

We're currently running with sysctls:

net.ipv4.ip_nonlocal_bind=1
kernel.panic=300
vm.dirty_background_ratio=3
vm.min_free_kbytes=16384

I'm not sure what circumstances force the memory reclaim (and why it 
doesn't come from discarding a cached page).

Is the problem is the DMA/DMA32 zone and we should try playing with 
lowmem_reserve_ratio? Is there anything else we could do to keep dirty 
pages out of the low zones?

Before trying THREAD_ORDER 2, we tried doubling the RAM in a couple of 
boxes from 2GB to 4GB without any significant reduction in the problem.

Lastly - if we end up stuck with THREAD_ORDER 2, does anyone know what 
symptoms to look out for to know if unable to allocate thread stacks due 
to fragmentation?

> I'll have to have a bit of a think on this one - if you could
> provide further stack traces as they get deeper (esp. if they go
> past 8k) that would be really handy.

Two of the worst offenders below. We have plenty to send if you would 
like more. Please let us know if you'd like us to try anything else or 
would like other info.

Thanks very much for your thoughts, suggestions and work so far, it's 
very much appreciated here.

regards,

jb


[-- Attachment #2: stack_traces.txt --]
[-- Type: text/plain, Size: 7831 bytes --]

=== server16 ===

apache2 used greatest stack depth: 7208 bytes left

        Depth    Size   Location    (72 entries)
        -----    ----   --------
  0)     8336     304   select_task_rq_fair+0x235/0xad0
  1)     8032      96   try_to_wake_up+0x189/0x3f0
  2)     7936      16   default_wake_function+0x12/0x20
  3)     7920      32   autoremove_wake_function+0x16/0x40
  4)     7888      64   __wake_up_common+0x5a/0x90
  5)     7824      64   __wake_up+0x48/0x70
  6)     7760      64   insert_work+0x9f/0xb0
  7)     7696      48   __queue_work+0x36/0x50
  8)     7648      16   queue_work_on+0x4d/0x60
  9)     7632      16   queue_work+0x1f/0x30
 10)     7616      16   queue_delayed_work+0x2d/0x40
 11)     7600      32   ata_pio_queue_task+0x35/0x40
 12)     7568      48   ata_sff_qc_issue+0x146/0x2f0
 13)     7520      96   mv_qc_issue+0x12d/0x540 [sata_mv]
 14)     7424      96   ata_qc_issue+0x1fe/0x320
 15)     7328      64   ata_scsi_translate+0xae/0x1a0
 16)     7264      64   ata_scsi_queuecmd+0xbf/0x2f0
 17)     7200      48   scsi_dispatch_cmd+0x114/0x2b0
 18)     7152      96   scsi_request_fn+0x419/0x590
 19)     7056      32   __blk_run_queue+0x82/0x150
 20)     7024      48   elv_insert+0x1aa/0x2d0
 21)     6976      48   __elv_add_request+0x83/0xd0
 22)     6928      96   __make_request+0x139/0x490
 23)     6832     208   generic_make_request+0x3df/0x4d0
 24)     6624      80   submit_bio+0x7c/0x100
 25)     6544      96   _xfs_buf_ioapply+0x128/0x2c0 [xfs]
 26)     6448      48   xfs_buf_iorequest+0x75/0xd0 [xfs]
 27)     6400      32   xlog_bdstrat_cb+0x4d/0x60 [xfs]
 28)     6368      80   xlog_sync+0x218/0x510 [xfs]
 29)     6288      64   xlog_state_release_iclog+0xbb/0x100 [xfs]
 30)     6224     160   xlog_state_sync+0x1ab/0x230 [xfs]
 31)     6064      32   _xfs_log_force+0x5a/0x80 [xfs]
 32)     6032      32   xfs_log_force+0x18/0x40 [xfs]
 33)     6000      64   xfs_alloc_search_busy+0x14b/0x160 [xfs]
 34)     5936     112   xfs_alloc_get_freelist+0x130/0x170 [xfs]
 35)     5824      48   xfs_allocbt_alloc_block+0x33/0x70 [xfs]
 36)     5776     208   xfs_btree_split+0xb3/0x6a0 [xfs]
 37)     5568      96   xfs_btree_make_block_unfull+0x151/0x190 [xfs]
 38)     5472     224   xfs_btree_insrec+0x39c/0x5b0 [xfs]
 39)     5248     128   xfs_btree_insert+0x86/0x180 [xfs]
 40)     5120     144   xfs_free_ag_extent+0x33b/0x7b0 [xfs]
 41)     4976     224   xfs_alloc_fix_freelist+0x120/0x490 [xfs]
 42)     4752      96   xfs_alloc_vextent+0x1f5/0x630 [xfs]
 43)     4656     272   xfs_bmap_btalloc+0x497/0xa70 [xfs]
 44)     4384      16   xfs_bmap_alloc+0x21/0x40 [xfs]
 45)     4368     448   xfs_bmapi+0x85e/0x1200 [xfs]
 46)     3920     256   xfs_iomap_write_allocate+0x248/0x3c0 [xfs]
 47)     3664     208   xfs_iomap+0x3d8/0x410 [xfs]
 48)     3456      32   xfs_map_blocks+0x2c/0x30 [xfs]
 49)     3424     256   xfs_page_state_convert+0x443/0x730 [xfs]
 50)     3168      64   xfs_vm_writepage+0xab/0x160 [xfs]
 51)     3104     384   shrink_page_list+0x65e/0x840
 52)     2720     528   shrink_zone+0x63f/0xe10
 53)     2192     112   do_try_to_free_pages+0xc2/0x3c0
 54)     2080     128   try_to_free_pages+0x77/0x80
 55)     1952     240   __alloc_pages_nodemask+0x3e4/0x710
 56)     1712      48   alloc_pages_current+0x8c/0xe0
 57)     1664      32   __page_cache_alloc+0x67/0x70
 58)     1632     144   __do_page_cache_readahead+0xd3/0x220
 59)     1488      16   ra_submit+0x21/0x30
 60)     1472      80   ondemand_readahead+0x11d/0x250
 61)     1392      64   page_cache_async_readahead+0xa9/0xe0
 62)     1328     592   __generic_file_splice_read+0x48a/0x530
 63)      736      48   generic_file_splice_read+0x4f/0x90
 64)      688      96   xfs_splice_read+0xf2/0x130 [xfs]
 65)      592      32   xfs_file_splice_read+0x4b/0x50 [xfs]
 66)      560      64   do_splice_to+0x77/0xb0
 67)      496     112   splice_direct_to_actor+0xcc/0x1c0
 68)      384      80   do_splice_direct+0x57/0x80
 69)      304      96   do_sendfile+0x16c/0x1e0
 70)      208      80   sys_sendfile64+0x8d/0xb0
 71)      128     128   system_call_fastpath+0x16/0x1b

=== server9 ===

[223269.859411] apache2 used greatest stack depth: 7088 bytes left

        Depth    Size   Location    (62 entries)
        -----    ----   --------

  0)     8528      32   down_trylock+0x1e/0x50
  1)     8496      80   _xfs_buf_find+0x12f/0x290 [xfs]
  2)     8416      64   xfs_buf_get+0x61/0x1c0 [xfs]
  3)     8352      48   xfs_buf_read+0x2f/0x110 [xfs]
  4)     8304      48   xfs_buf_readahead+0x61/0x90 [xfs]
  5)     8256      48   xfs_btree_readahead_sblock+0xea/0xf0 [xfs]
  6)     8208      16   xfs_btree_readahead+0x5f/0x90 [xfs]
  7)     8192     112   xfs_btree_increment+0x2e/0x2b0 [xfs]
  8)     8080     176   xfs_btree_rshift+0x2f2/0x530 [xfs]
  9)     7904     272   xfs_btree_delrec+0x4a3/0x1020 [xfs]
 10)     7632      64   xfs_btree_delete+0x40/0xd0 [xfs]
 11)     7568      96   xfs_alloc_fixup_trees+0x7d/0x350 [xfs]
 12)     7472     144   xfs_alloc_ag_vextent_near+0x916/0xb30 [xfs]
 13)     7328      32   xfs_alloc_ag_vextent+0xe5/0x140 [xfs]
 14)     7296      96   xfs_alloc_vextent+0x49f/0x630 [xfs]
 15)     7200     160   xfs_bmbt_alloc_block+0xbe/0x1d0 [xfs]
 16)     7040     208   xfs_btree_split+0xb3/0x6a0 [xfs]
 17)     6832      96   xfs_btree_make_block_unfull+0x151/0x190 [xfs]
 18)     6736     224   xfs_btree_insrec+0x39c/0x5b0 [xfs]
 19)     6512     128   xfs_btree_insert+0x86/0x180 [xfs]
 20)     6384     352   xfs_bmap_add_extent_delay_real+0x41e/0x1660 [xfs]
 21)     6032     208   xfs_bmap_add_extent+0x41c/0x450 [xfs]
 22)     5824     448   xfs_bmapi+0x982/0x1200 [xfs]
 23)     5376     256   xfs_iomap_write_allocate+0x248/0x3c0 [xfs]
 24)     5120     208   xfs_iomap+0x3d8/0x410 [xfs]
 25)     4912      32   xfs_map_blocks+0x2c/0x30 [xfs]
 26)     4880     256   xfs_page_state_convert+0x443/0x730 [xfs]
 27)     4624      64   xfs_vm_writepage+0xab/0x160 [xfs]
 28)     4560     384   shrink_page_list+0x65e/0x840
 29)     4176     528   shrink_zone+0x63f/0xe10
 30)     3648     112   do_try_to_free_pages+0xc2/0x3c0
 31)     3536     128   try_to_free_pages+0x77/0x80
 32)     3408     240   __alloc_pages_nodemask+0x3e4/0x710
 33)     3168      48   alloc_pages_current+0x8c/0xe0
 34)     3120      80   new_slab+0x247/0x300
 35)     3040      96   __slab_alloc+0x137/0x490
 36)     2944      64   kmem_cache_alloc+0x110/0x120
 37)     2880      64   kmem_zone_alloc+0x9a/0xe0 [xfs]
 38)     2816      32   kmem_zone_zalloc+0x1e/0x50 [xfs]
 39)     2784      32   _xfs_trans_alloc+0x38/0x80 [xfs]
 40)     2752      96   xfs_trans_alloc+0x9f/0xb0 [xfs]
 41)     2656     256   xfs_iomap_write_allocate+0xf1/0x3c0 [xfs]
 42)     2400     208   xfs_iomap+0x3d8/0x410 [xfs]
 43)     2192      32   xfs_map_blocks+0x2c/0x30 [xfs]
 44)     2160     256   xfs_page_state_convert+0x443/0x730 [xfs]
 45)     1904      64   xfs_vm_writepage+0xab/0x160 [xfs]
 46)     1840      32   __writepage+0x17/0x50
 47)     1808     288   write_cache_pages+0x1f7/0x400
 48)     1520      16   generic_writepages+0x24/0x30
 49)     1504      48   xfs_vm_writepages+0x5c/0x80 [xfs]
 50)     1456      16   do_writepages+0x21/0x40
 51)     1440      64   writeback_single_inode+0xeb/0x3c0
 52)     1376     128   writeback_inodes_wb+0x318/0x510
 53)     1248      16   writeback_inodes_wbc+0x1e/0x20
 54)     1232     224   balance_dirty_pages_ratelimited_nr+0x269/0x3a0
 55)     1008     192   generic_file_buffered_write+0x19b/0x240
 56)      816     288   xfs_write+0x837/0x920 [xfs]
 57)      528      16   xfs_file_aio_write+0x5b/0x70 [xfs]
 58)      512     272   do_sync_write+0xd1/0x120
 59)      240      48   vfs_write+0xcb/0x1a0
 60)      192      64   sys_write+0x55/0x90
 61)      128     128   system_call_fastpath+0x16/0x1b

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64
@ 2010-04-09 13:43                 ` John Berthels
  0 siblings, 0 replies; 43+ messages in thread
From: John Berthels @ 2010-04-09 13:43 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nick Gregory, linux-mm, xfs, linux-kernel, Rob Sanderson

[-- Attachment #1: Type: text/plain, Size: 1933 bytes --]

Dave Chinner wrote:

> So effectively the storage subsystem (NFS, filesystem, DM, MD,
> device drivers) have about 4K of stack to work in now. That seems to
> be a lot less than last time I looked at this, and we've been really
> careful not to increase XFS's stack usage for quite some time now.

OK. I should note that we have what appears to be a similar problem on a 
2.6.28 distro kernel, so I'm not sure this is a very recent change. (We 
see the lockups on that kernel, we haven't tried larger stacks + stack 
instrumentation on the earlier kernel).

Do you know if there are any obvious knobs to twiddle to make these 
codepaths less likely? The cluster is resilient against occasional 
server death, but frequent death is more annoying.

We're currently running with sysctls:

net.ipv4.ip_nonlocal_bind=1
kernel.panic=300
vm.dirty_background_ratio=3
vm.min_free_kbytes=16384

I'm not sure what circumstances force the memory reclaim (and why it 
doesn't come from discarding a cached page).

Is the problem is the DMA/DMA32 zone and we should try playing with 
lowmem_reserve_ratio? Is there anything else we could do to keep dirty 
pages out of the low zones?

Before trying THREAD_ORDER 2, we tried doubling the RAM in a couple of 
boxes from 2GB to 4GB without any significant reduction in the problem.

Lastly - if we end up stuck with THREAD_ORDER 2, does anyone know what 
symptoms to look out for to know if unable to allocate thread stacks due 
to fragmentation?

> I'll have to have a bit of a think on this one - if you could
> provide further stack traces as they get deeper (esp. if they go
> past 8k) that would be really handy.

Two of the worst offenders below. We have plenty to send if you would 
like more. Please let us know if you'd like us to try anything else or 
would like other info.

Thanks very much for your thoughts, suggestions and work so far, it's 
very much appreciated here.

regards,

jb


[-- Attachment #2: stack_traces.txt --]
[-- Type: text/plain, Size: 7831 bytes --]

=== server16 ===

apache2 used greatest stack depth: 7208 bytes left

        Depth    Size   Location    (72 entries)
        -----    ----   --------
  0)     8336     304   select_task_rq_fair+0x235/0xad0
  1)     8032      96   try_to_wake_up+0x189/0x3f0
  2)     7936      16   default_wake_function+0x12/0x20
  3)     7920      32   autoremove_wake_function+0x16/0x40
  4)     7888      64   __wake_up_common+0x5a/0x90
  5)     7824      64   __wake_up+0x48/0x70
  6)     7760      64   insert_work+0x9f/0xb0
  7)     7696      48   __queue_work+0x36/0x50
  8)     7648      16   queue_work_on+0x4d/0x60
  9)     7632      16   queue_work+0x1f/0x30
 10)     7616      16   queue_delayed_work+0x2d/0x40
 11)     7600      32   ata_pio_queue_task+0x35/0x40
 12)     7568      48   ata_sff_qc_issue+0x146/0x2f0
 13)     7520      96   mv_qc_issue+0x12d/0x540 [sata_mv]
 14)     7424      96   ata_qc_issue+0x1fe/0x320
 15)     7328      64   ata_scsi_translate+0xae/0x1a0
 16)     7264      64   ata_scsi_queuecmd+0xbf/0x2f0
 17)     7200      48   scsi_dispatch_cmd+0x114/0x2b0
 18)     7152      96   scsi_request_fn+0x419/0x590
 19)     7056      32   __blk_run_queue+0x82/0x150
 20)     7024      48   elv_insert+0x1aa/0x2d0
 21)     6976      48   __elv_add_request+0x83/0xd0
 22)     6928      96   __make_request+0x139/0x490
 23)     6832     208   generic_make_request+0x3df/0x4d0
 24)     6624      80   submit_bio+0x7c/0x100
 25)     6544      96   _xfs_buf_ioapply+0x128/0x2c0 [xfs]
 26)     6448      48   xfs_buf_iorequest+0x75/0xd0 [xfs]
 27)     6400      32   xlog_bdstrat_cb+0x4d/0x60 [xfs]
 28)     6368      80   xlog_sync+0x218/0x510 [xfs]
 29)     6288      64   xlog_state_release_iclog+0xbb/0x100 [xfs]
 30)     6224     160   xlog_state_sync+0x1ab/0x230 [xfs]
 31)     6064      32   _xfs_log_force+0x5a/0x80 [xfs]
 32)     6032      32   xfs_log_force+0x18/0x40 [xfs]
 33)     6000      64   xfs_alloc_search_busy+0x14b/0x160 [xfs]
 34)     5936     112   xfs_alloc_get_freelist+0x130/0x170 [xfs]
 35)     5824      48   xfs_allocbt_alloc_block+0x33/0x70 [xfs]
 36)     5776     208   xfs_btree_split+0xb3/0x6a0 [xfs]
 37)     5568      96   xfs_btree_make_block_unfull+0x151/0x190 [xfs]
 38)     5472     224   xfs_btree_insrec+0x39c/0x5b0 [xfs]
 39)     5248     128   xfs_btree_insert+0x86/0x180 [xfs]
 40)     5120     144   xfs_free_ag_extent+0x33b/0x7b0 [xfs]
 41)     4976     224   xfs_alloc_fix_freelist+0x120/0x490 [xfs]
 42)     4752      96   xfs_alloc_vextent+0x1f5/0x630 [xfs]
 43)     4656     272   xfs_bmap_btalloc+0x497/0xa70 [xfs]
 44)     4384      16   xfs_bmap_alloc+0x21/0x40 [xfs]
 45)     4368     448   xfs_bmapi+0x85e/0x1200 [xfs]
 46)     3920     256   xfs_iomap_write_allocate+0x248/0x3c0 [xfs]
 47)     3664     208   xfs_iomap+0x3d8/0x410 [xfs]
 48)     3456      32   xfs_map_blocks+0x2c/0x30 [xfs]
 49)     3424     256   xfs_page_state_convert+0x443/0x730 [xfs]
 50)     3168      64   xfs_vm_writepage+0xab/0x160 [xfs]
 51)     3104     384   shrink_page_list+0x65e/0x840
 52)     2720     528   shrink_zone+0x63f/0xe10
 53)     2192     112   do_try_to_free_pages+0xc2/0x3c0
 54)     2080     128   try_to_free_pages+0x77/0x80
 55)     1952     240   __alloc_pages_nodemask+0x3e4/0x710
 56)     1712      48   alloc_pages_current+0x8c/0xe0
 57)     1664      32   __page_cache_alloc+0x67/0x70
 58)     1632     144   __do_page_cache_readahead+0xd3/0x220
 59)     1488      16   ra_submit+0x21/0x30
 60)     1472      80   ondemand_readahead+0x11d/0x250
 61)     1392      64   page_cache_async_readahead+0xa9/0xe0
 62)     1328     592   __generic_file_splice_read+0x48a/0x530
 63)      736      48   generic_file_splice_read+0x4f/0x90
 64)      688      96   xfs_splice_read+0xf2/0x130 [xfs]
 65)      592      32   xfs_file_splice_read+0x4b/0x50 [xfs]
 66)      560      64   do_splice_to+0x77/0xb0
 67)      496     112   splice_direct_to_actor+0xcc/0x1c0
 68)      384      80   do_splice_direct+0x57/0x80
 69)      304      96   do_sendfile+0x16c/0x1e0
 70)      208      80   sys_sendfile64+0x8d/0xb0
 71)      128     128   system_call_fastpath+0x16/0x1b

=== server9 ===

[223269.859411] apache2 used greatest stack depth: 7088 bytes left

        Depth    Size   Location    (62 entries)
        -----    ----   --------

  0)     8528      32   down_trylock+0x1e/0x50
  1)     8496      80   _xfs_buf_find+0x12f/0x290 [xfs]
  2)     8416      64   xfs_buf_get+0x61/0x1c0 [xfs]
  3)     8352      48   xfs_buf_read+0x2f/0x110 [xfs]
  4)     8304      48   xfs_buf_readahead+0x61/0x90 [xfs]
  5)     8256      48   xfs_btree_readahead_sblock+0xea/0xf0 [xfs]
  6)     8208      16   xfs_btree_readahead+0x5f/0x90 [xfs]
  7)     8192     112   xfs_btree_increment+0x2e/0x2b0 [xfs]
  8)     8080     176   xfs_btree_rshift+0x2f2/0x530 [xfs]
  9)     7904     272   xfs_btree_delrec+0x4a3/0x1020 [xfs]
 10)     7632      64   xfs_btree_delete+0x40/0xd0 [xfs]
 11)     7568      96   xfs_alloc_fixup_trees+0x7d/0x350 [xfs]
 12)     7472     144   xfs_alloc_ag_vextent_near+0x916/0xb30 [xfs]
 13)     7328      32   xfs_alloc_ag_vextent+0xe5/0x140 [xfs]
 14)     7296      96   xfs_alloc_vextent+0x49f/0x630 [xfs]
 15)     7200     160   xfs_bmbt_alloc_block+0xbe/0x1d0 [xfs]
 16)     7040     208   xfs_btree_split+0xb3/0x6a0 [xfs]
 17)     6832      96   xfs_btree_make_block_unfull+0x151/0x190 [xfs]
 18)     6736     224   xfs_btree_insrec+0x39c/0x5b0 [xfs]
 19)     6512     128   xfs_btree_insert+0x86/0x180 [xfs]
 20)     6384     352   xfs_bmap_add_extent_delay_real+0x41e/0x1660 [xfs]
 21)     6032     208   xfs_bmap_add_extent+0x41c/0x450 [xfs]
 22)     5824     448   xfs_bmapi+0x982/0x1200 [xfs]
 23)     5376     256   xfs_iomap_write_allocate+0x248/0x3c0 [xfs]
 24)     5120     208   xfs_iomap+0x3d8/0x410 [xfs]
 25)     4912      32   xfs_map_blocks+0x2c/0x30 [xfs]
 26)     4880     256   xfs_page_state_convert+0x443/0x730 [xfs]
 27)     4624      64   xfs_vm_writepage+0xab/0x160 [xfs]
 28)     4560     384   shrink_page_list+0x65e/0x840
 29)     4176     528   shrink_zone+0x63f/0xe10
 30)     3648     112   do_try_to_free_pages+0xc2/0x3c0
 31)     3536     128   try_to_free_pages+0x77/0x80
 32)     3408     240   __alloc_pages_nodemask+0x3e4/0x710
 33)     3168      48   alloc_pages_current+0x8c/0xe0
 34)     3120      80   new_slab+0x247/0x300
 35)     3040      96   __slab_alloc+0x137/0x490
 36)     2944      64   kmem_cache_alloc+0x110/0x120
 37)     2880      64   kmem_zone_alloc+0x9a/0xe0 [xfs]
 38)     2816      32   kmem_zone_zalloc+0x1e/0x50 [xfs]
 39)     2784      32   _xfs_trans_alloc+0x38/0x80 [xfs]
 40)     2752      96   xfs_trans_alloc+0x9f/0xb0 [xfs]
 41)     2656     256   xfs_iomap_write_allocate+0xf1/0x3c0 [xfs]
 42)     2400     208   xfs_iomap+0x3d8/0x410 [xfs]
 43)     2192      32   xfs_map_blocks+0x2c/0x30 [xfs]
 44)     2160     256   xfs_page_state_convert+0x443/0x730 [xfs]
 45)     1904      64   xfs_vm_writepage+0xab/0x160 [xfs]
 46)     1840      32   __writepage+0x17/0x50
 47)     1808     288   write_cache_pages+0x1f7/0x400
 48)     1520      16   generic_writepages+0x24/0x30
 49)     1504      48   xfs_vm_writepages+0x5c/0x80 [xfs]
 50)     1456      16   do_writepages+0x21/0x40
 51)     1440      64   writeback_single_inode+0xeb/0x3c0
 52)     1376     128   writeback_inodes_wb+0x318/0x510
 53)     1248      16   writeback_inodes_wbc+0x1e/0x20
 54)     1232     224   balance_dirty_pages_ratelimited_nr+0x269/0x3a0
 55)     1008     192   generic_file_buffered_write+0x19b/0x240
 56)      816     288   xfs_write+0x837/0x920 [xfs]
 57)      528      16   xfs_file_aio_write+0x5b/0x70 [xfs]
 58)      512     272   do_sync_write+0xd1/0x120
 59)      240      48   vfs_write+0xcb/0x1a0
 60)      192      64   sys_write+0x55/0x90
 61)      128     128   system_call_fastpath+0x16/0x1b

[-- Attachment #3: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2010-04-16 13:42 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-04-07 11:06 PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64 John Berthels
2010-04-07 14:05 ` Dave Chinner
2010-04-07 14:05   ` Dave Chinner
2010-04-07 15:57   ` John Berthels
2010-04-07 15:57     ` John Berthels
2010-04-07 17:43     ` Eric Sandeen
2010-04-07 17:43       ` Eric Sandeen
2010-04-07 23:43     ` Dave Chinner
2010-04-07 23:43       ` Dave Chinner
2010-04-08  3:03       ` Dave Chinner
2010-04-08  3:03         ` Dave Chinner
2010-04-08  3:03         ` Dave Chinner
2010-04-08 12:16         ` John Berthels
2010-04-08 12:16           ` John Berthels
2010-04-08 12:16           ` John Berthels
2010-04-08 14:47           ` John Berthels
2010-04-08 14:47             ` John Berthels
2010-04-08 14:47             ` John Berthels
2010-04-08 16:18             ` John Berthels
2010-04-08 16:18               ` John Berthels
2010-04-08 16:18               ` John Berthels
2010-04-08 23:38             ` Dave Chinner
2010-04-08 23:38               ` Dave Chinner
2010-04-08 23:38               ` Dave Chinner
2010-04-09 11:38               ` Chris Mason
2010-04-09 11:38                 ` Chris Mason
2010-04-09 11:38                 ` Chris Mason
2010-04-09 18:05                 ` Eric Sandeen
2010-04-09 18:05                   ` Eric Sandeen
2010-04-09 18:05                   ` Eric Sandeen
2010-04-09 18:11                   ` Chris Mason
2010-04-09 18:11                     ` Chris Mason
2010-04-09 18:11                     ` Chris Mason
2010-04-12  1:01                     ` Dave Chinner
2010-04-12  1:01                       ` Dave Chinner
2010-04-12  1:01                       ` Dave Chinner
2010-04-13  9:51                 ` John Berthels
2010-04-13  9:51                   ` John Berthels
2010-04-16 13:41                 ` John Berthels
2010-04-16 13:41                   ` John Berthels
2010-04-16 13:41                   ` John Berthels
2010-04-09 13:43               ` John Berthels
2010-04-09 13:43                 ` John Berthels

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.