* PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64 @ 2010-04-07 11:06 John Berthels 2010-04-07 14:05 ` Dave Chinner 0 siblings, 1 reply; 43+ messages in thread From: John Berthels @ 2010-04-07 11:06 UTC (permalink / raw) To: linux-kernel; +Cc: Nick Gregory, Rob Sanderson Hi folks, [I'm afraid that I'm not subscribed to the list, please cc: me on any reply]. Problem: kernel.org 2.6.33.2 x86_64 kernel locks up under write-heavy I/O load. It is "fixed" by changing THREAD_ORDER to 2. Is this an OK long-term solution/should this be needed? As far as I can see from searching, there is an expectation that xfs would generally work with 8k stacks (THREAD_ORDER 1). We don't have xfs stacked over LVM or anything else. If anyone can offer any advice on this, that would be great. I understand larger kernel stacks may introduce problems in getting an allocation of the appropriate size. So am I right in thinking the symptom we need to look out for would be an error on fork() or clone()? Or will the box panic in that case? Details below. regards, jb Background: We have a cluster of systems with roughly the following specs (2GB RAM, 24 (twenty-four) 1TB+ disks, Intel Core2 Duo @ 2.2GHz). Following a the addition of three new servers to the cluster, we started seeing a high incidence of intermittent lockups (up to several times per day for some servers) across both the old and new servers. Prior to that, we saw this problem only rarely (perhaps once per 3 months). Adding the new servers will have changed the I/O patterns to all servers. The servers receive a heavy write load, often with many slow writers (as well as a read load). Servers would become unresponsive, with nothing written to /var/log/messages. Setting sysctl kernel.panic=300 caused a restart (which showed the kernel was panicing and unable to write at the time). netconsole showed a variety of stack traces, mostly related to xfs_write activity (but then, that's what the box spends it's time doing). 22/24 of the disks have 1 partition, formatted with xfs (over the partition, not over LVM). The other 2 disks have 3 partitions: xfs data, swap and a RAID1 partition contributing to an ext3 root filesystem mounted on /dev/md0. We have tried various solutions (different kernels from ubuntu server 2.6.28->2.6.32). Vanilla 2.6.33.2 from kernel.org + stack tracing still has the problem, and logged: kernel: [58552.740032] flush-8:112 used greatest stack depth: 184 bytes left a short while before dying. Vanilla 2.6.33.2 + stack tracing + THREAD_ORDER 2 is much more stable (no lockups so far, we would have expected 5-6 by now) and has logged: kernel: [44798.183507] apache2 used greatest stack depth: 7208 bytes left which I understand (possibly wrongly) as concrete evidence that we have exceeded 8k of stack space. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64 2010-04-07 11:06 PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64 John Berthels @ 2010-04-07 14:05 ` Dave Chinner 0 siblings, 0 replies; 43+ messages in thread From: Dave Chinner @ 2010-04-07 14:05 UTC (permalink / raw) To: John Berthels; +Cc: linux-kernel, Nick Gregory, Rob Sanderson, xfs On Wed, Apr 07, 2010 at 12:06:01PM +0100, John Berthels wrote: > Hi folks, > > [I'm afraid that I'm not subscribed to the list, please cc: me on > any reply]. > > Problem: kernel.org 2.6.33.2 x86_64 kernel locks up under > write-heavy I/O load. It is "fixed" by changing THREAD_ORDER to 2. > > Is this an OK long-term solution/should this be needed? As far as I > can see from searching, there is an expectation that xfs would > generally work with 8k stacks (THREAD_ORDER 1). We don't have xfs > stacked over LVM or anything else. I'm not seeing stacks deeper than about 5.6k on XFS under heavy write loads. That's nowhere near blowing an 8k stack, so there must be something special about what you are doing. Can you post the stack traces that are being generated for the deepest stack generated - /sys/kernel/debug/tracing/stack_trace should contain it. > Background: We have a cluster of systems with roughly the following > specs (2GB RAM, 24 (twenty-four) 1TB+ disks, Intel Core2 Duo @ > 2.2GHz). > > Following a the addition of three new servers to the cluster, we > started seeing a high incidence of intermittent lockups (up to > several times per day for some servers) across both the old and new > servers. Prior to that, we saw this problem only rarely (perhaps > once per 3 months). What is generating the write load? Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64 @ 2010-04-07 14:05 ` Dave Chinner 0 siblings, 0 replies; 43+ messages in thread From: Dave Chinner @ 2010-04-07 14:05 UTC (permalink / raw) To: John Berthels; +Cc: Nick Gregory, xfs, linux-kernel, Rob Sanderson On Wed, Apr 07, 2010 at 12:06:01PM +0100, John Berthels wrote: > Hi folks, > > [I'm afraid that I'm not subscribed to the list, please cc: me on > any reply]. > > Problem: kernel.org 2.6.33.2 x86_64 kernel locks up under > write-heavy I/O load. It is "fixed" by changing THREAD_ORDER to 2. > > Is this an OK long-term solution/should this be needed? As far as I > can see from searching, there is an expectation that xfs would > generally work with 8k stacks (THREAD_ORDER 1). We don't have xfs > stacked over LVM or anything else. I'm not seeing stacks deeper than about 5.6k on XFS under heavy write loads. That's nowhere near blowing an 8k stack, so there must be something special about what you are doing. Can you post the stack traces that are being generated for the deepest stack generated - /sys/kernel/debug/tracing/stack_trace should contain it. > Background: We have a cluster of systems with roughly the following > specs (2GB RAM, 24 (twenty-four) 1TB+ disks, Intel Core2 Duo @ > 2.2GHz). > > Following a the addition of three new servers to the cluster, we > started seeing a high incidence of intermittent lockups (up to > several times per day for some servers) across both the old and new > servers. Prior to that, we saw this problem only rarely (perhaps > once per 3 months). What is generating the write load? Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64 2010-04-07 14:05 ` Dave Chinner @ 2010-04-07 15:57 ` John Berthels -1 siblings, 0 replies; 43+ messages in thread From: John Berthels @ 2010-04-07 15:57 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-kernel, Nick Gregory, Rob Sanderson, xfs [-- Attachment #1: Type: text/plain, Size: 4514 bytes --] Dave Chinner wrote: > I'm not seeing stacks deeper than about 5.6k on XFS under heavy write > loads. That's nowhere near blowing an 8k stack, so there must be > something special about what you are doing. Can you post the stack > traces that are being generated for the deepest stack generated - > /sys/kernel/debug/tracing/stack_trace should contain it. > Appended below. That doesn't seem to reach 8192 but the box it's from has logged: [74649.579386] apache2 used greatest stack depth: 7024 bytes left full dmesg (gzipped) attached. > What is generating the write load? > WebDAV PUTs in a modified mogilefs cluster, running apache-mpm-worker (threaded) as the DAV server. The write load is a mix of internet-upload speed writers trickling files up and some local fast replicators copying from elsewhere in the cluster. mpm worker cfg is: ServerLimit 20 StartServers 5 MaxClients 300 MinSpareThreads 25 MaxSpareThreads 75 ThreadsPerChild 30 MaxRequestsPerChild 0 File sizes are a mix of small to large (4GB+). Each disk is exported as a mogile device, so it's possible for mogile to pound a single disk with lots of write activity (if the random number generator decides to put lots of files on that device at the same time). We're also seeing occasional slowdowns + high load avg (up to ~300, i.e. MaxClients) with a corresponding number of threads in D state. (This slowdown + high load avg seems to correlate with what would have previously caused a panic on the THREAD_ORDER 1, but not 100% sure). As you can see from the dmesg, this trips the "task xxx blocked for more than 120 seconds." on some of the threads. Don't know if that's related to the stack issue or to be expected under the load. jb Depth Size Location (47 entries) ----- ---- -------- 0) 7568 16 mempool_alloc_slab+0x16/0x20 1) 7552 144 mempool_alloc+0x65/0x140 2) 7408 96 get_request+0x124/0x370 3) 7312 144 get_request_wait+0x29/0x1b0 4) 7168 96 __make_request+0x9b/0x490 5) 7072 208 generic_make_request+0x3df/0x4d0 6) 6864 80 submit_bio+0x7c/0x100 7) 6784 96 _xfs_buf_ioapply+0x128/0x2c0 [xfs] 8) 6688 48 xfs_buf_iorequest+0x75/0xd0 [xfs] 9) 6640 32 _xfs_buf_read+0x36/0x70 [xfs] 10) 6608 48 xfs_buf_read+0xda/0x110 [xfs] 11) 6560 80 xfs_trans_read_buf+0x2a7/0x410 [xfs] 12) 6480 80 xfs_btree_read_buf_block+0x5d/0xb0 [xfs] 13) 6400 80 xfs_btree_lookup_get_block+0x84/0xf0 [xfs] 14) 6320 176 xfs_btree_lookup+0xd7/0x490 [xfs] 15) 6144 16 xfs_alloc_lookup_eq+0x19/0x20 [xfs] 16) 6128 96 xfs_alloc_fixup_trees+0xee/0x350 [xfs] 17) 6032 144 xfs_alloc_ag_vextent_near+0x916/0xb30 [xfs] 18) 5888 32 xfs_alloc_ag_vextent+0xe5/0x140 [xfs] 19) 5856 96 xfs_alloc_vextent+0x49f/0x630 [xfs] 20) 5760 160 xfs_bmbt_alloc_block+0xbe/0x1d0 [xfs] 21) 5600 208 xfs_btree_split+0xb3/0x6a0 [xfs] 22) 5392 96 xfs_btree_make_block_unfull+0x151/0x190 [xfs] 23) 5296 224 xfs_btree_insrec+0x39c/0x5b0 [xfs] 24) 5072 128 xfs_btree_insert+0x86/0x180 [xfs] 25) 4944 352 xfs_bmap_add_extent_delay_real+0x41e/0x1660 [xfs] 26) 4592 208 xfs_bmap_add_extent+0x41c/0x450 [xfs] 27) 4384 448 xfs_bmapi+0x982/0x1200 [xfs] 28) 3936 256 xfs_iomap_write_allocate+0x248/0x3c0 [xfs] 29) 3680 208 xfs_iomap+0x3d8/0x410 [xfs] 30) 3472 32 xfs_map_blocks+0x2c/0x30 [xfs] 31) 3440 256 xfs_page_state_convert+0x443/0x730 [xfs] 32) 3184 64 xfs_vm_writepage+0xab/0x160 [xfs] 33) 3120 384 shrink_page_list+0x65e/0x840 34) 2736 528 shrink_zone+0x63f/0xe10 35) 2208 112 do_try_to_free_pages+0xc2/0x3c0 36) 2096 128 try_to_free_pages+0x77/0x80 37) 1968 240 __alloc_pages_nodemask+0x3e4/0x710 38) 1728 48 alloc_pages_current+0x8c/0xe0 39) 1680 16 __get_free_pages+0xe/0x50 40) 1664 48 __pollwait+0xca/0x110 41) 1616 32 unix_poll+0x28/0xc0 42) 1584 16 sock_poll+0x1d/0x20 43) 1568 912 do_select+0x3d6/0x700 44) 656 416 core_sys_select+0x18c/0x2c0 45) 240 112 sys_select+0x4f/0x110 46) 128 128 system_call_fastpath+0x16/0x1b [-- Attachment #2: dmesg.txt.gz --] [-- Type: application/x-gzip, Size: 18745 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64 @ 2010-04-07 15:57 ` John Berthels 0 siblings, 0 replies; 43+ messages in thread From: John Berthels @ 2010-04-07 15:57 UTC (permalink / raw) To: Dave Chinner; +Cc: Nick Gregory, xfs, linux-kernel, Rob Sanderson [-- Attachment #1: Type: text/plain, Size: 4514 bytes --] Dave Chinner wrote: > I'm not seeing stacks deeper than about 5.6k on XFS under heavy write > loads. That's nowhere near blowing an 8k stack, so there must be > something special about what you are doing. Can you post the stack > traces that are being generated for the deepest stack generated - > /sys/kernel/debug/tracing/stack_trace should contain it. > Appended below. That doesn't seem to reach 8192 but the box it's from has logged: [74649.579386] apache2 used greatest stack depth: 7024 bytes left full dmesg (gzipped) attached. > What is generating the write load? > WebDAV PUTs in a modified mogilefs cluster, running apache-mpm-worker (threaded) as the DAV server. The write load is a mix of internet-upload speed writers trickling files up and some local fast replicators copying from elsewhere in the cluster. mpm worker cfg is: ServerLimit 20 StartServers 5 MaxClients 300 MinSpareThreads 25 MaxSpareThreads 75 ThreadsPerChild 30 MaxRequestsPerChild 0 File sizes are a mix of small to large (4GB+). Each disk is exported as a mogile device, so it's possible for mogile to pound a single disk with lots of write activity (if the random number generator decides to put lots of files on that device at the same time). We're also seeing occasional slowdowns + high load avg (up to ~300, i.e. MaxClients) with a corresponding number of threads in D state. (This slowdown + high load avg seems to correlate with what would have previously caused a panic on the THREAD_ORDER 1, but not 100% sure). As you can see from the dmesg, this trips the "task xxx blocked for more than 120 seconds." on some of the threads. Don't know if that's related to the stack issue or to be expected under the load. jb Depth Size Location (47 entries) ----- ---- -------- 0) 7568 16 mempool_alloc_slab+0x16/0x20 1) 7552 144 mempool_alloc+0x65/0x140 2) 7408 96 get_request+0x124/0x370 3) 7312 144 get_request_wait+0x29/0x1b0 4) 7168 96 __make_request+0x9b/0x490 5) 7072 208 generic_make_request+0x3df/0x4d0 6) 6864 80 submit_bio+0x7c/0x100 7) 6784 96 _xfs_buf_ioapply+0x128/0x2c0 [xfs] 8) 6688 48 xfs_buf_iorequest+0x75/0xd0 [xfs] 9) 6640 32 _xfs_buf_read+0x36/0x70 [xfs] 10) 6608 48 xfs_buf_read+0xda/0x110 [xfs] 11) 6560 80 xfs_trans_read_buf+0x2a7/0x410 [xfs] 12) 6480 80 xfs_btree_read_buf_block+0x5d/0xb0 [xfs] 13) 6400 80 xfs_btree_lookup_get_block+0x84/0xf0 [xfs] 14) 6320 176 xfs_btree_lookup+0xd7/0x490 [xfs] 15) 6144 16 xfs_alloc_lookup_eq+0x19/0x20 [xfs] 16) 6128 96 xfs_alloc_fixup_trees+0xee/0x350 [xfs] 17) 6032 144 xfs_alloc_ag_vextent_near+0x916/0xb30 [xfs] 18) 5888 32 xfs_alloc_ag_vextent+0xe5/0x140 [xfs] 19) 5856 96 xfs_alloc_vextent+0x49f/0x630 [xfs] 20) 5760 160 xfs_bmbt_alloc_block+0xbe/0x1d0 [xfs] 21) 5600 208 xfs_btree_split+0xb3/0x6a0 [xfs] 22) 5392 96 xfs_btree_make_block_unfull+0x151/0x190 [xfs] 23) 5296 224 xfs_btree_insrec+0x39c/0x5b0 [xfs] 24) 5072 128 xfs_btree_insert+0x86/0x180 [xfs] 25) 4944 352 xfs_bmap_add_extent_delay_real+0x41e/0x1660 [xfs] 26) 4592 208 xfs_bmap_add_extent+0x41c/0x450 [xfs] 27) 4384 448 xfs_bmapi+0x982/0x1200 [xfs] 28) 3936 256 xfs_iomap_write_allocate+0x248/0x3c0 [xfs] 29) 3680 208 xfs_iomap+0x3d8/0x410 [xfs] 30) 3472 32 xfs_map_blocks+0x2c/0x30 [xfs] 31) 3440 256 xfs_page_state_convert+0x443/0x730 [xfs] 32) 3184 64 xfs_vm_writepage+0xab/0x160 [xfs] 33) 3120 384 shrink_page_list+0x65e/0x840 34) 2736 528 shrink_zone+0x63f/0xe10 35) 2208 112 do_try_to_free_pages+0xc2/0x3c0 36) 2096 128 try_to_free_pages+0x77/0x80 37) 1968 240 __alloc_pages_nodemask+0x3e4/0x710 38) 1728 48 alloc_pages_current+0x8c/0xe0 39) 1680 16 __get_free_pages+0xe/0x50 40) 1664 48 __pollwait+0xca/0x110 41) 1616 32 unix_poll+0x28/0xc0 42) 1584 16 sock_poll+0x1d/0x20 43) 1568 912 do_select+0x3d6/0x700 44) 656 416 core_sys_select+0x18c/0x2c0 45) 240 112 sys_select+0x4f/0x110 46) 128 128 system_call_fastpath+0x16/0x1b [-- Attachment #2: dmesg.txt.gz --] [-- Type: application/x-gzip, Size: 18745 bytes --] [-- Attachment #3: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64 2010-04-07 15:57 ` John Berthels @ 2010-04-07 17:43 ` Eric Sandeen -1 siblings, 0 replies; 43+ messages in thread From: Eric Sandeen @ 2010-04-07 17:43 UTC (permalink / raw) To: John Berthels Cc: Dave Chinner, Nick Gregory, xfs, linux-kernel, Rob Sanderson John Berthels wrote: > Dave Chinner wrote: >> I'm not seeing stacks deeper than about 5.6k on XFS under heavy write >> loads. That's nowhere near blowing an 8k stack, so there must be >> something special about what you are doing. Can you post the stack >> traces that are being generated for the deepest stack generated - >> /sys/kernel/debug/tracing/stack_trace should contain it. >> > Appended below. That doesn't seem to reach 8192 but the box it's from > has logged: > > [74649.579386] apache2 used greatest stack depth: 7024 bytes left but that's -left- (out of 8k or is that from a THREAD_ORDER=2 box?) I guess it must be out of 16k... > Depth Size Location (47 entries) > ----- ---- -------- > 0) 7568 16 mempool_alloc_slab+0x16/0x20 > 1) 7552 144 mempool_alloc+0x65/0x140 > 2) 7408 96 get_request+0x124/0x370 > 3) 7312 144 get_request_wait+0x29/0x1b0 > 4) 7168 96 __make_request+0x9b/0x490 > 5) 7072 208 generic_make_request+0x3df/0x4d0 > 6) 6864 80 submit_bio+0x7c/0x100 > 7) 6784 96 _xfs_buf_ioapply+0x128/0x2c0 [xfs] > 8) 6688 48 xfs_buf_iorequest+0x75/0xd0 [xfs] > 9) 6640 32 _xfs_buf_read+0x36/0x70 [xfs] > 10) 6608 48 xfs_buf_read+0xda/0x110 [xfs] > 11) 6560 80 xfs_trans_read_buf+0x2a7/0x410 [xfs] > 12) 6480 80 xfs_btree_read_buf_block+0x5d/0xb0 [xfs] > 13) 6400 80 xfs_btree_lookup_get_block+0x84/0xf0 [xfs] > 14) 6320 176 xfs_btree_lookup+0xd7/0x490 [xfs] > 15) 6144 16 xfs_alloc_lookup_eq+0x19/0x20 [xfs] > 16) 6128 96 xfs_alloc_fixup_trees+0xee/0x350 [xfs] > 17) 6032 144 xfs_alloc_ag_vextent_near+0x916/0xb30 [xfs] > 18) 5888 32 xfs_alloc_ag_vextent+0xe5/0x140 [xfs] > 19) 5856 96 xfs_alloc_vextent+0x49f/0x630 [xfs] > 20) 5760 160 xfs_bmbt_alloc_block+0xbe/0x1d0 [xfs] > 21) 5600 208 xfs_btree_split+0xb3/0x6a0 [xfs] > 22) 5392 96 xfs_btree_make_block_unfull+0x151/0x190 [xfs] > 23) 5296 224 xfs_btree_insrec+0x39c/0x5b0 [xfs] > 24) 5072 128 xfs_btree_insert+0x86/0x180 [xfs] > 25) 4944 352 xfs_bmap_add_extent_delay_real+0x41e/0x1660 [xfs] > 26) 4592 208 xfs_bmap_add_extent+0x41c/0x450 [xfs] > 27) 4384 448 xfs_bmapi+0x982/0x1200 [xfs] This one, I'm afraid, has always been big. > 28) 3936 256 xfs_iomap_write_allocate+0x248/0x3c0 [xfs] > 29) 3680 208 xfs_iomap+0x3d8/0x410 [xfs] > 30) 3472 32 xfs_map_blocks+0x2c/0x30 [xfs] > 31) 3440 256 xfs_page_state_convert+0x443/0x730 [xfs] > 32) 3184 64 xfs_vm_writepage+0xab/0x160 [xfs] > 33) 3120 384 shrink_page_list+0x65e/0x840 > 34) 2736 528 shrink_zone+0x63f/0xe10 that's a nice one (actually the two together at > 900 bytes, ouch) > 35) 2208 112 do_try_to_free_pages+0xc2/0x3c0 > 36) 2096 128 try_to_free_pages+0x77/0x80 > 37) 1968 240 __alloc_pages_nodemask+0x3e4/0x710 > 38) 1728 48 alloc_pages_current+0x8c/0xe0 > 39) 1680 16 __get_free_pages+0xe/0x50 > 40) 1664 48 __pollwait+0xca/0x110 > 41) 1616 32 unix_poll+0x28/0xc0 > 42) 1584 16 sock_poll+0x1d/0x20 > 43) 1568 912 do_select+0x3d6/0x700 912, ouch! int do_select(int n, fd_set_bits *fds, struct timespec *end_time) { ktime_t expire, *to = NULL; struct poll_wqueues table; (gdb) p sizeof(struct poll_wqueues) $1 = 624 I guess that's been there forever, though. > 44) 656 416 core_sys_select+0x18c/0x2c0 416 hurts too. The xfs callchain is deep, no doubt, but the combination of the select path and the shrink calls is almost 2k in just a few calls, and that doesn't help much. -Eric > 45) 240 112 sys_select+0x4f/0x110 > 46) 128 128 system_call_fastpath+0x16/0x1b > > > ------------------------------------------------------------------------ > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64 @ 2010-04-07 17:43 ` Eric Sandeen 0 siblings, 0 replies; 43+ messages in thread From: Eric Sandeen @ 2010-04-07 17:43 UTC (permalink / raw) To: John Berthels; +Cc: Nick Gregory, Rob Sanderson, linux-kernel, xfs John Berthels wrote: > Dave Chinner wrote: >> I'm not seeing stacks deeper than about 5.6k on XFS under heavy write >> loads. That's nowhere near blowing an 8k stack, so there must be >> something special about what you are doing. Can you post the stack >> traces that are being generated for the deepest stack generated - >> /sys/kernel/debug/tracing/stack_trace should contain it. >> > Appended below. That doesn't seem to reach 8192 but the box it's from > has logged: > > [74649.579386] apache2 used greatest stack depth: 7024 bytes left but that's -left- (out of 8k or is that from a THREAD_ORDER=2 box?) I guess it must be out of 16k... > Depth Size Location (47 entries) > ----- ---- -------- > 0) 7568 16 mempool_alloc_slab+0x16/0x20 > 1) 7552 144 mempool_alloc+0x65/0x140 > 2) 7408 96 get_request+0x124/0x370 > 3) 7312 144 get_request_wait+0x29/0x1b0 > 4) 7168 96 __make_request+0x9b/0x490 > 5) 7072 208 generic_make_request+0x3df/0x4d0 > 6) 6864 80 submit_bio+0x7c/0x100 > 7) 6784 96 _xfs_buf_ioapply+0x128/0x2c0 [xfs] > 8) 6688 48 xfs_buf_iorequest+0x75/0xd0 [xfs] > 9) 6640 32 _xfs_buf_read+0x36/0x70 [xfs] > 10) 6608 48 xfs_buf_read+0xda/0x110 [xfs] > 11) 6560 80 xfs_trans_read_buf+0x2a7/0x410 [xfs] > 12) 6480 80 xfs_btree_read_buf_block+0x5d/0xb0 [xfs] > 13) 6400 80 xfs_btree_lookup_get_block+0x84/0xf0 [xfs] > 14) 6320 176 xfs_btree_lookup+0xd7/0x490 [xfs] > 15) 6144 16 xfs_alloc_lookup_eq+0x19/0x20 [xfs] > 16) 6128 96 xfs_alloc_fixup_trees+0xee/0x350 [xfs] > 17) 6032 144 xfs_alloc_ag_vextent_near+0x916/0xb30 [xfs] > 18) 5888 32 xfs_alloc_ag_vextent+0xe5/0x140 [xfs] > 19) 5856 96 xfs_alloc_vextent+0x49f/0x630 [xfs] > 20) 5760 160 xfs_bmbt_alloc_block+0xbe/0x1d0 [xfs] > 21) 5600 208 xfs_btree_split+0xb3/0x6a0 [xfs] > 22) 5392 96 xfs_btree_make_block_unfull+0x151/0x190 [xfs] > 23) 5296 224 xfs_btree_insrec+0x39c/0x5b0 [xfs] > 24) 5072 128 xfs_btree_insert+0x86/0x180 [xfs] > 25) 4944 352 xfs_bmap_add_extent_delay_real+0x41e/0x1660 [xfs] > 26) 4592 208 xfs_bmap_add_extent+0x41c/0x450 [xfs] > 27) 4384 448 xfs_bmapi+0x982/0x1200 [xfs] This one, I'm afraid, has always been big. > 28) 3936 256 xfs_iomap_write_allocate+0x248/0x3c0 [xfs] > 29) 3680 208 xfs_iomap+0x3d8/0x410 [xfs] > 30) 3472 32 xfs_map_blocks+0x2c/0x30 [xfs] > 31) 3440 256 xfs_page_state_convert+0x443/0x730 [xfs] > 32) 3184 64 xfs_vm_writepage+0xab/0x160 [xfs] > 33) 3120 384 shrink_page_list+0x65e/0x840 > 34) 2736 528 shrink_zone+0x63f/0xe10 that's a nice one (actually the two together at > 900 bytes, ouch) > 35) 2208 112 do_try_to_free_pages+0xc2/0x3c0 > 36) 2096 128 try_to_free_pages+0x77/0x80 > 37) 1968 240 __alloc_pages_nodemask+0x3e4/0x710 > 38) 1728 48 alloc_pages_current+0x8c/0xe0 > 39) 1680 16 __get_free_pages+0xe/0x50 > 40) 1664 48 __pollwait+0xca/0x110 > 41) 1616 32 unix_poll+0x28/0xc0 > 42) 1584 16 sock_poll+0x1d/0x20 > 43) 1568 912 do_select+0x3d6/0x700 912, ouch! int do_select(int n, fd_set_bits *fds, struct timespec *end_time) { ktime_t expire, *to = NULL; struct poll_wqueues table; (gdb) p sizeof(struct poll_wqueues) $1 = 624 I guess that's been there forever, though. > 44) 656 416 core_sys_select+0x18c/0x2c0 416 hurts too. The xfs callchain is deep, no doubt, but the combination of the select path and the shrink calls is almost 2k in just a few calls, and that doesn't help much. -Eric > 45) 240 112 sys_select+0x4f/0x110 > 46) 128 128 system_call_fastpath+0x16/0x1b > > > ------------------------------------------------------------------------ > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64 2010-04-07 15:57 ` John Berthels @ 2010-04-07 23:43 ` Dave Chinner -1 siblings, 0 replies; 43+ messages in thread From: Dave Chinner @ 2010-04-07 23:43 UTC (permalink / raw) To: John Berthels; +Cc: linux-kernel, Nick Gregory, Rob Sanderson, xfs [added linux-mm] On Wed, Apr 07, 2010 at 04:57:11PM +0100, John Berthels wrote: > Dave Chinner wrote: > >I'm not seeing stacks deeper than about 5.6k on XFS under heavy write > >loads. That's nowhere near blowing an 8k stack, so there must be > >something special about what you are doing. Can you post the stack > >traces that are being generated for the deepest stack generated - > >/sys/kernel/debug/tracing/stack_trace should contain it. > Appended below. That doesn't seem to reach 8192 but the box it's > from has logged: > > [74649.579386] apache2 used greatest stack depth: 7024 bytes left > > full dmesg (gzipped) attached. > >What is generating the write load? > > WebDAV PUTs in a modified mogilefs cluster, running > apache-mpm-worker (threaded) as the DAV server. The write load is a > mix of internet-upload speed writers trickling files up and some > local fast replicators copying from elsewhere in the cluster. mpm > worker cfg is: > > ServerLimit 20 > StartServers 5 > MaxClients 300 > MinSpareThreads 25 > MaxSpareThreads 75 > ThreadsPerChild 30 > MaxRequestsPerChild 0 > > File sizes are a mix of small to large (4GB+). Each disk is exported > as a mogile device, so it's possible for mogile to pound a single > disk with lots of write activity (if the random number generator > decides to put lots of files on that device at the same time). > > We're also seeing occasional slowdowns + high load avg (up to ~300, > i.e. MaxClients) with a corresponding number of threads in D state. > (This slowdown + high load avg seems to correlate with what would > have previously caused a panic on the THREAD_ORDER 1, but not 100% > sure). > > As you can see from the dmesg, this trips the "task xxx blocked for > more than 120 seconds." on some of the threads. > > Don't know if that's related to the stack issue or to be expected > under the load. It looks to be caused by direct memory reclaim trying to clean pages with a significant amount of stack already in use. basically there is not enough stack space left for the XFS ->writepage path to execute in. I can't see any fast fix for this occurring, so you are probably best to run with a larger stack for the moment. As it is, I don't think direct memory reclim should be cleaning dirty file pages - it should be leaving that to the writeback threads (which are far more efficient at it) or, as a last resort, kswapd. Direct memory reclaim is invoked with an unknown amount of stack already in use, so there is never any guarantee that there is enough stack space left to enter the ->writepage path of any filesystem. MM-folk - have there been any changes recently to writeback of pages from direct reclaim that may have caused this, or have we just been lucky for a really long time? Cheers, Dave. > Depth Size Location (47 entries) > ----- ---- -------- > 0) 7568 16 mempool_alloc_slab+0x16/0x20 > 1) 7552 144 mempool_alloc+0x65/0x140 > 2) 7408 96 get_request+0x124/0x370 > 3) 7312 144 get_request_wait+0x29/0x1b0 > 4) 7168 96 __make_request+0x9b/0x490 > 5) 7072 208 generic_make_request+0x3df/0x4d0 > 6) 6864 80 submit_bio+0x7c/0x100 > 7) 6784 96 _xfs_buf_ioapply+0x128/0x2c0 [xfs] > 8) 6688 48 xfs_buf_iorequest+0x75/0xd0 [xfs] > 9) 6640 32 _xfs_buf_read+0x36/0x70 [xfs] > 10) 6608 48 xfs_buf_read+0xda/0x110 [xfs] > 11) 6560 80 xfs_trans_read_buf+0x2a7/0x410 [xfs] > 12) 6480 80 xfs_btree_read_buf_block+0x5d/0xb0 [xfs] > 13) 6400 80 xfs_btree_lookup_get_block+0x84/0xf0 [xfs] > 14) 6320 176 xfs_btree_lookup+0xd7/0x490 [xfs] > 15) 6144 16 xfs_alloc_lookup_eq+0x19/0x20 [xfs] > 16) 6128 96 xfs_alloc_fixup_trees+0xee/0x350 [xfs] > 17) 6032 144 xfs_alloc_ag_vextent_near+0x916/0xb30 [xfs] > 18) 5888 32 xfs_alloc_ag_vextent+0xe5/0x140 [xfs] > 19) 5856 96 xfs_alloc_vextent+0x49f/0x630 [xfs] > 20) 5760 160 xfs_bmbt_alloc_block+0xbe/0x1d0 [xfs] > 21) 5600 208 xfs_btree_split+0xb3/0x6a0 [xfs] > 22) 5392 96 xfs_btree_make_block_unfull+0x151/0x190 [xfs] > 23) 5296 224 xfs_btree_insrec+0x39c/0x5b0 [xfs] > 24) 5072 128 xfs_btree_insert+0x86/0x180 [xfs] > 25) 4944 352 xfs_bmap_add_extent_delay_real+0x41e/0x1660 [xfs] > 26) 4592 208 xfs_bmap_add_extent+0x41c/0x450 [xfs] > 27) 4384 448 xfs_bmapi+0x982/0x1200 [xfs] > 28) 3936 256 xfs_iomap_write_allocate+0x248/0x3c0 [xfs] > 29) 3680 208 xfs_iomap+0x3d8/0x410 [xfs] > 30) 3472 32 xfs_map_blocks+0x2c/0x30 [xfs] > 31) 3440 256 xfs_page_state_convert+0x443/0x730 [xfs] > 32) 3184 64 xfs_vm_writepage+0xab/0x160 [xfs] > 33) 3120 384 shrink_page_list+0x65e/0x840 > 34) 2736 528 shrink_zone+0x63f/0xe10 > 35) 2208 112 do_try_to_free_pages+0xc2/0x3c0 > 36) 2096 128 try_to_free_pages+0x77/0x80 > 37) 1968 240 __alloc_pages_nodemask+0x3e4/0x710 > 38) 1728 48 alloc_pages_current+0x8c/0xe0 > 39) 1680 16 __get_free_pages+0xe/0x50 > 40) 1664 48 __pollwait+0xca/0x110 > 41) 1616 32 unix_poll+0x28/0xc0 > 42) 1584 16 sock_poll+0x1d/0x20 > 43) 1568 912 do_select+0x3d6/0x700 > 44) 656 416 core_sys_select+0x18c/0x2c0 > 45) 240 112 sys_select+0x4f/0x110 > 46) 128 128 system_call_fastpath+0x16/0x1b > -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64 @ 2010-04-07 23:43 ` Dave Chinner 0 siblings, 0 replies; 43+ messages in thread From: Dave Chinner @ 2010-04-07 23:43 UTC (permalink / raw) To: John Berthels; +Cc: Nick Gregory, xfs, linux-kernel, Rob Sanderson [added linux-mm] On Wed, Apr 07, 2010 at 04:57:11PM +0100, John Berthels wrote: > Dave Chinner wrote: > >I'm not seeing stacks deeper than about 5.6k on XFS under heavy write > >loads. That's nowhere near blowing an 8k stack, so there must be > >something special about what you are doing. Can you post the stack > >traces that are being generated for the deepest stack generated - > >/sys/kernel/debug/tracing/stack_trace should contain it. > Appended below. That doesn't seem to reach 8192 but the box it's > from has logged: > > [74649.579386] apache2 used greatest stack depth: 7024 bytes left > > full dmesg (gzipped) attached. > >What is generating the write load? > > WebDAV PUTs in a modified mogilefs cluster, running > apache-mpm-worker (threaded) as the DAV server. The write load is a > mix of internet-upload speed writers trickling files up and some > local fast replicators copying from elsewhere in the cluster. mpm > worker cfg is: > > ServerLimit 20 > StartServers 5 > MaxClients 300 > MinSpareThreads 25 > MaxSpareThreads 75 > ThreadsPerChild 30 > MaxRequestsPerChild 0 > > File sizes are a mix of small to large (4GB+). Each disk is exported > as a mogile device, so it's possible for mogile to pound a single > disk with lots of write activity (if the random number generator > decides to put lots of files on that device at the same time). > > We're also seeing occasional slowdowns + high load avg (up to ~300, > i.e. MaxClients) with a corresponding number of threads in D state. > (This slowdown + high load avg seems to correlate with what would > have previously caused a panic on the THREAD_ORDER 1, but not 100% > sure). > > As you can see from the dmesg, this trips the "task xxx blocked for > more than 120 seconds." on some of the threads. > > Don't know if that's related to the stack issue or to be expected > under the load. It looks to be caused by direct memory reclaim trying to clean pages with a significant amount of stack already in use. basically there is not enough stack space left for the XFS ->writepage path to execute in. I can't see any fast fix for this occurring, so you are probably best to run with a larger stack for the moment. As it is, I don't think direct memory reclim should be cleaning dirty file pages - it should be leaving that to the writeback threads (which are far more efficient at it) or, as a last resort, kswapd. Direct memory reclaim is invoked with an unknown amount of stack already in use, so there is never any guarantee that there is enough stack space left to enter the ->writepage path of any filesystem. MM-folk - have there been any changes recently to writeback of pages from direct reclaim that may have caused this, or have we just been lucky for a really long time? Cheers, Dave. > Depth Size Location (47 entries) > ----- ---- -------- > 0) 7568 16 mempool_alloc_slab+0x16/0x20 > 1) 7552 144 mempool_alloc+0x65/0x140 > 2) 7408 96 get_request+0x124/0x370 > 3) 7312 144 get_request_wait+0x29/0x1b0 > 4) 7168 96 __make_request+0x9b/0x490 > 5) 7072 208 generic_make_request+0x3df/0x4d0 > 6) 6864 80 submit_bio+0x7c/0x100 > 7) 6784 96 _xfs_buf_ioapply+0x128/0x2c0 [xfs] > 8) 6688 48 xfs_buf_iorequest+0x75/0xd0 [xfs] > 9) 6640 32 _xfs_buf_read+0x36/0x70 [xfs] > 10) 6608 48 xfs_buf_read+0xda/0x110 [xfs] > 11) 6560 80 xfs_trans_read_buf+0x2a7/0x410 [xfs] > 12) 6480 80 xfs_btree_read_buf_block+0x5d/0xb0 [xfs] > 13) 6400 80 xfs_btree_lookup_get_block+0x84/0xf0 [xfs] > 14) 6320 176 xfs_btree_lookup+0xd7/0x490 [xfs] > 15) 6144 16 xfs_alloc_lookup_eq+0x19/0x20 [xfs] > 16) 6128 96 xfs_alloc_fixup_trees+0xee/0x350 [xfs] > 17) 6032 144 xfs_alloc_ag_vextent_near+0x916/0xb30 [xfs] > 18) 5888 32 xfs_alloc_ag_vextent+0xe5/0x140 [xfs] > 19) 5856 96 xfs_alloc_vextent+0x49f/0x630 [xfs] > 20) 5760 160 xfs_bmbt_alloc_block+0xbe/0x1d0 [xfs] > 21) 5600 208 xfs_btree_split+0xb3/0x6a0 [xfs] > 22) 5392 96 xfs_btree_make_block_unfull+0x151/0x190 [xfs] > 23) 5296 224 xfs_btree_insrec+0x39c/0x5b0 [xfs] > 24) 5072 128 xfs_btree_insert+0x86/0x180 [xfs] > 25) 4944 352 xfs_bmap_add_extent_delay_real+0x41e/0x1660 [xfs] > 26) 4592 208 xfs_bmap_add_extent+0x41c/0x450 [xfs] > 27) 4384 448 xfs_bmapi+0x982/0x1200 [xfs] > 28) 3936 256 xfs_iomap_write_allocate+0x248/0x3c0 [xfs] > 29) 3680 208 xfs_iomap+0x3d8/0x410 [xfs] > 30) 3472 32 xfs_map_blocks+0x2c/0x30 [xfs] > 31) 3440 256 xfs_page_state_convert+0x443/0x730 [xfs] > 32) 3184 64 xfs_vm_writepage+0xab/0x160 [xfs] > 33) 3120 384 shrink_page_list+0x65e/0x840 > 34) 2736 528 shrink_zone+0x63f/0xe10 > 35) 2208 112 do_try_to_free_pages+0xc2/0x3c0 > 36) 2096 128 try_to_free_pages+0x77/0x80 > 37) 1968 240 __alloc_pages_nodemask+0x3e4/0x710 > 38) 1728 48 alloc_pages_current+0x8c/0xe0 > 39) 1680 16 __get_free_pages+0xe/0x50 > 40) 1664 48 __pollwait+0xca/0x110 > 41) 1616 32 unix_poll+0x28/0xc0 > 42) 1584 16 sock_poll+0x1d/0x20 > 43) 1568 912 do_select+0x3d6/0x700 > 44) 656 416 core_sys_select+0x18c/0x2c0 > 45) 240 112 sys_select+0x4f/0x110 > 46) 128 128 system_call_fastpath+0x16/0x1b > -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64 2010-04-07 23:43 ` Dave Chinner (?) @ 2010-04-08 3:03 ` Dave Chinner -1 siblings, 0 replies; 43+ messages in thread From: Dave Chinner @ 2010-04-08 3:03 UTC (permalink / raw) To: John Berthels; +Cc: linux-kernel, Nick Gregory, Rob Sanderson, xfs, linux-mm On Thu, Apr 08, 2010 at 09:43:41AM +1000, Dave Chinner wrote: > [added linux-mm] Now really added linux-mm. And there's a patch attached that stops direct reclaim from writing back dirty pages - it seems to work fine from some rough testing I've done. Perhaps you might want to give it a spin on a test box, John? > On Wed, Apr 07, 2010 at 04:57:11PM +0100, John Berthels wrote: > > Dave Chinner wrote: > > >I'm not seeing stacks deeper than about 5.6k on XFS under heavy write > > >loads. That's nowhere near blowing an 8k stack, so there must be > > >something special about what you are doing. Can you post the stack > > >traces that are being generated for the deepest stack generated - > > >/sys/kernel/debug/tracing/stack_trace should contain it. > > Appended below. That doesn't seem to reach 8192 but the box it's > > from has logged: > > > > [74649.579386] apache2 used greatest stack depth: 7024 bytes left > > > > full dmesg (gzipped) attached. > > >What is generating the write load? > > > > WebDAV PUTs in a modified mogilefs cluster, running > > apache-mpm-worker (threaded) as the DAV server. The write load is a > > mix of internet-upload speed writers trickling files up and some > > local fast replicators copying from elsewhere in the cluster. mpm > > worker cfg is: > > > > ServerLimit 20 > > StartServers 5 > > MaxClients 300 > > MinSpareThreads 25 > > MaxSpareThreads 75 > > ThreadsPerChild 30 > > MaxRequestsPerChild 0 > > > > File sizes are a mix of small to large (4GB+). Each disk is exported > > as a mogile device, so it's possible for mogile to pound a single > > disk with lots of write activity (if the random number generator > > decides to put lots of files on that device at the same time). > > > > We're also seeing occasional slowdowns + high load avg (up to ~300, > > i.e. MaxClients) with a corresponding number of threads in D state. > > (This slowdown + high load avg seems to correlate with what would > > have previously caused a panic on the THREAD_ORDER 1, but not 100% > > sure). > > > > As you can see from the dmesg, this trips the "task xxx blocked for > > more than 120 seconds." on some of the threads. > > > > Don't know if that's related to the stack issue or to be expected > > under the load. > > It looks to be caused by direct memory reclaim trying to clean pages > with a significant amount of stack already in use. basically there > is not enough stack space left for the XFS ->writepage path to > execute in. I can't see any fast fix for this occurring, so you are > probably best to run with a larger stack for the moment. > > As it is, I don't think direct memory reclim should be cleaning > dirty file pages - it should be leaving that to the writeback > threads (which are far more efficient at it) or, as a > last resort, kswapd. Direct memory reclaim is invoked with an > unknown amount of stack already in use, so there is never any > guarantee that there is enough stack space left to enter the > ->writepage path of any filesystem. > > MM-folk - have there been any changes recently to writeback of > pages from direct reclaim that may have caused this, > or have we just been lucky for a really long time? > > Cheers, > > Dave. > > > Depth Size Location (47 entries) > > ----- ---- -------- > > 0) 7568 16 mempool_alloc_slab+0x16/0x20 > > 1) 7552 144 mempool_alloc+0x65/0x140 > > 2) 7408 96 get_request+0x124/0x370 > > 3) 7312 144 get_request_wait+0x29/0x1b0 > > 4) 7168 96 __make_request+0x9b/0x490 > > 5) 7072 208 generic_make_request+0x3df/0x4d0 > > 6) 6864 80 submit_bio+0x7c/0x100 > > 7) 6784 96 _xfs_buf_ioapply+0x128/0x2c0 [xfs] > > 8) 6688 48 xfs_buf_iorequest+0x75/0xd0 [xfs] > > 9) 6640 32 _xfs_buf_read+0x36/0x70 [xfs] > > 10) 6608 48 xfs_buf_read+0xda/0x110 [xfs] > > 11) 6560 80 xfs_trans_read_buf+0x2a7/0x410 [xfs] > > 12) 6480 80 xfs_btree_read_buf_block+0x5d/0xb0 [xfs] > > 13) 6400 80 xfs_btree_lookup_get_block+0x84/0xf0 [xfs] > > 14) 6320 176 xfs_btree_lookup+0xd7/0x490 [xfs] > > 15) 6144 16 xfs_alloc_lookup_eq+0x19/0x20 [xfs] > > 16) 6128 96 xfs_alloc_fixup_trees+0xee/0x350 [xfs] > > 17) 6032 144 xfs_alloc_ag_vextent_near+0x916/0xb30 [xfs] > > 18) 5888 32 xfs_alloc_ag_vextent+0xe5/0x140 [xfs] > > 19) 5856 96 xfs_alloc_vextent+0x49f/0x630 [xfs] > > 20) 5760 160 xfs_bmbt_alloc_block+0xbe/0x1d0 [xfs] > > 21) 5600 208 xfs_btree_split+0xb3/0x6a0 [xfs] > > 22) 5392 96 xfs_btree_make_block_unfull+0x151/0x190 [xfs] > > 23) 5296 224 xfs_btree_insrec+0x39c/0x5b0 [xfs] > > 24) 5072 128 xfs_btree_insert+0x86/0x180 [xfs] > > 25) 4944 352 xfs_bmap_add_extent_delay_real+0x41e/0x1660 [xfs] > > 26) 4592 208 xfs_bmap_add_extent+0x41c/0x450 [xfs] > > 27) 4384 448 xfs_bmapi+0x982/0x1200 [xfs] > > 28) 3936 256 xfs_iomap_write_allocate+0x248/0x3c0 [xfs] > > 29) 3680 208 xfs_iomap+0x3d8/0x410 [xfs] > > 30) 3472 32 xfs_map_blocks+0x2c/0x30 [xfs] > > 31) 3440 256 xfs_page_state_convert+0x443/0x730 [xfs] > > 32) 3184 64 xfs_vm_writepage+0xab/0x160 [xfs] > > 33) 3120 384 shrink_page_list+0x65e/0x840 > > 34) 2736 528 shrink_zone+0x63f/0xe10 > > 35) 2208 112 do_try_to_free_pages+0xc2/0x3c0 > > 36) 2096 128 try_to_free_pages+0x77/0x80 > > 37) 1968 240 __alloc_pages_nodemask+0x3e4/0x710 > > 38) 1728 48 alloc_pages_current+0x8c/0xe0 > > 39) 1680 16 __get_free_pages+0xe/0x50 > > 40) 1664 48 __pollwait+0xca/0x110 > > 41) 1616 32 unix_poll+0x28/0xc0 > > 42) 1584 16 sock_poll+0x1d/0x20 > > 43) 1568 912 do_select+0x3d6/0x700 > > 44) 656 416 core_sys_select+0x18c/0x2c0 > > 45) 240 112 sys_select+0x4f/0x110 > > 46) 128 128 system_call_fastpath+0x16/0x1b -- Dave Chinner david@fromorbit.com mm: disallow direct reclaim page writeback From: Dave Chinner <dchinner@redhat.com> When we enter direct reclaim we may have used an arbitrary amount of stack space, and hence entering the filesystem to do writeback can then lead to stack overruns. Writeback from direct reclaim is a bad idea, anyway. The background flusher threads should be taking care of cleaning dirty pages, and direct reclaim will kick them if they aren't already doing work. If direct reclaim is also calling ->writepage, it will cause the IO patterns from the background flusher threads to be upset by LRU-order writeback from pageout(). Having competing sources of IO trying to clean pages on the same backing device reduces throughput by increasing the amount of seeks that the backing device has to do to write back the pages. Hence for direct reclaim we should not allow ->writepages to be entered at all. Set up the relevant scan_control structures to enforce this, and prevent sc->may_writepage from being set in other places in the direct reclaim path in response to other events. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- mm/vmscan.c | 13 ++++++------- 1 files changed, 6 insertions(+), 7 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index f293372..3c194f4 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1829,10 +1829,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, * writeout. So in laptop mode, write out the whole world. */ writeback_threshold = sc->nr_to_reclaim + sc->nr_to_reclaim / 2; - if (total_scanned > writeback_threshold) { + if (total_scanned > writeback_threshold) wakeup_flusher_threads(laptop_mode ? 0 : total_scanned); - sc->may_writepage = 1; - } /* Take a nap, wait for some writeback to complete */ if (!sc->hibernation_mode && sc->nr_scanned && @@ -1874,7 +1872,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, { struct scan_control sc = { .gfp_mask = gfp_mask, - .may_writepage = !laptop_mode, + .may_writepage = 0, .nr_to_reclaim = SWAP_CLUSTER_MAX, .may_unmap = 1, .may_swap = 1, @@ -1896,7 +1894,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem, struct zone *zone, int nid) { struct scan_control sc = { - .may_writepage = !laptop_mode, + .may_writepage = 0, .may_unmap = 1, .may_swap = !noswap, .swappiness = swappiness, @@ -1929,7 +1927,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont, { struct zonelist *zonelist; struct scan_control sc = { - .may_writepage = !laptop_mode, + .may_writepage = 0, .may_unmap = 1, .may_swap = !noswap, .nr_to_reclaim = SWAP_CLUSTER_MAX, @@ -2570,7 +2568,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) struct reclaim_state reclaim_state; int priority; struct scan_control sc = { - .may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE), + .may_writepage = (current_is_kswapd() && + (zone_reclaim_mode & RECLAIM_WRITE)), .may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP), .may_swap = 1, .nr_to_reclaim = max_t(unsigned long, nr_pages, ^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64 @ 2010-04-08 3:03 ` Dave Chinner 0 siblings, 0 replies; 43+ messages in thread From: Dave Chinner @ 2010-04-08 3:03 UTC (permalink / raw) To: John Berthels; +Cc: linux-kernel, Nick Gregory, Rob Sanderson, xfs, linux-mm On Thu, Apr 08, 2010 at 09:43:41AM +1000, Dave Chinner wrote: > [added linux-mm] Now really added linux-mm. And there's a patch attached that stops direct reclaim from writing back dirty pages - it seems to work fine from some rough testing I've done. Perhaps you might want to give it a spin on a test box, John? > On Wed, Apr 07, 2010 at 04:57:11PM +0100, John Berthels wrote: > > Dave Chinner wrote: > > >I'm not seeing stacks deeper than about 5.6k on XFS under heavy write > > >loads. That's nowhere near blowing an 8k stack, so there must be > > >something special about what you are doing. Can you post the stack > > >traces that are being generated for the deepest stack generated - > > >/sys/kernel/debug/tracing/stack_trace should contain it. > > Appended below. That doesn't seem to reach 8192 but the box it's > > from has logged: > > > > [74649.579386] apache2 used greatest stack depth: 7024 bytes left > > > > full dmesg (gzipped) attached. > > >What is generating the write load? > > > > WebDAV PUTs in a modified mogilefs cluster, running > > apache-mpm-worker (threaded) as the DAV server. The write load is a > > mix of internet-upload speed writers trickling files up and some > > local fast replicators copying from elsewhere in the cluster. mpm > > worker cfg is: > > > > ServerLimit 20 > > StartServers 5 > > MaxClients 300 > > MinSpareThreads 25 > > MaxSpareThreads 75 > > ThreadsPerChild 30 > > MaxRequestsPerChild 0 > > > > File sizes are a mix of small to large (4GB+). Each disk is exported > > as a mogile device, so it's possible for mogile to pound a single > > disk with lots of write activity (if the random number generator > > decides to put lots of files on that device at the same time). > > > > We're also seeing occasional slowdowns + high load avg (up to ~300, > > i.e. MaxClients) with a corresponding number of threads in D state. > > (This slowdown + high load avg seems to correlate with what would > > have previously caused a panic on the THREAD_ORDER 1, but not 100% > > sure). > > > > As you can see from the dmesg, this trips the "task xxx blocked for > > more than 120 seconds." on some of the threads. > > > > Don't know if that's related to the stack issue or to be expected > > under the load. > > It looks to be caused by direct memory reclaim trying to clean pages > with a significant amount of stack already in use. basically there > is not enough stack space left for the XFS ->writepage path to > execute in. I can't see any fast fix for this occurring, so you are > probably best to run with a larger stack for the moment. > > As it is, I don't think direct memory reclim should be cleaning > dirty file pages - it should be leaving that to the writeback > threads (which are far more efficient at it) or, as a > last resort, kswapd. Direct memory reclaim is invoked with an > unknown amount of stack already in use, so there is never any > guarantee that there is enough stack space left to enter the > ->writepage path of any filesystem. > > MM-folk - have there been any changes recently to writeback of > pages from direct reclaim that may have caused this, > or have we just been lucky for a really long time? > > Cheers, > > Dave. > > > Depth Size Location (47 entries) > > ----- ---- -------- > > 0) 7568 16 mempool_alloc_slab+0x16/0x20 > > 1) 7552 144 mempool_alloc+0x65/0x140 > > 2) 7408 96 get_request+0x124/0x370 > > 3) 7312 144 get_request_wait+0x29/0x1b0 > > 4) 7168 96 __make_request+0x9b/0x490 > > 5) 7072 208 generic_make_request+0x3df/0x4d0 > > 6) 6864 80 submit_bio+0x7c/0x100 > > 7) 6784 96 _xfs_buf_ioapply+0x128/0x2c0 [xfs] > > 8) 6688 48 xfs_buf_iorequest+0x75/0xd0 [xfs] > > 9) 6640 32 _xfs_buf_read+0x36/0x70 [xfs] > > 10) 6608 48 xfs_buf_read+0xda/0x110 [xfs] > > 11) 6560 80 xfs_trans_read_buf+0x2a7/0x410 [xfs] > > 12) 6480 80 xfs_btree_read_buf_block+0x5d/0xb0 [xfs] > > 13) 6400 80 xfs_btree_lookup_get_block+0x84/0xf0 [xfs] > > 14) 6320 176 xfs_btree_lookup+0xd7/0x490 [xfs] > > 15) 6144 16 xfs_alloc_lookup_eq+0x19/0x20 [xfs] > > 16) 6128 96 xfs_alloc_fixup_trees+0xee/0x350 [xfs] > > 17) 6032 144 xfs_alloc_ag_vextent_near+0x916/0xb30 [xfs] > > 18) 5888 32 xfs_alloc_ag_vextent+0xe5/0x140 [xfs] > > 19) 5856 96 xfs_alloc_vextent+0x49f/0x630 [xfs] > > 20) 5760 160 xfs_bmbt_alloc_block+0xbe/0x1d0 [xfs] > > 21) 5600 208 xfs_btree_split+0xb3/0x6a0 [xfs] > > 22) 5392 96 xfs_btree_make_block_unfull+0x151/0x190 [xfs] > > 23) 5296 224 xfs_btree_insrec+0x39c/0x5b0 [xfs] > > 24) 5072 128 xfs_btree_insert+0x86/0x180 [xfs] > > 25) 4944 352 xfs_bmap_add_extent_delay_real+0x41e/0x1660 [xfs] > > 26) 4592 208 xfs_bmap_add_extent+0x41c/0x450 [xfs] > > 27) 4384 448 xfs_bmapi+0x982/0x1200 [xfs] > > 28) 3936 256 xfs_iomap_write_allocate+0x248/0x3c0 [xfs] > > 29) 3680 208 xfs_iomap+0x3d8/0x410 [xfs] > > 30) 3472 32 xfs_map_blocks+0x2c/0x30 [xfs] > > 31) 3440 256 xfs_page_state_convert+0x443/0x730 [xfs] > > 32) 3184 64 xfs_vm_writepage+0xab/0x160 [xfs] > > 33) 3120 384 shrink_page_list+0x65e/0x840 > > 34) 2736 528 shrink_zone+0x63f/0xe10 > > 35) 2208 112 do_try_to_free_pages+0xc2/0x3c0 > > 36) 2096 128 try_to_free_pages+0x77/0x80 > > 37) 1968 240 __alloc_pages_nodemask+0x3e4/0x710 > > 38) 1728 48 alloc_pages_current+0x8c/0xe0 > > 39) 1680 16 __get_free_pages+0xe/0x50 > > 40) 1664 48 __pollwait+0xca/0x110 > > 41) 1616 32 unix_poll+0x28/0xc0 > > 42) 1584 16 sock_poll+0x1d/0x20 > > 43) 1568 912 do_select+0x3d6/0x700 > > 44) 656 416 core_sys_select+0x18c/0x2c0 > > 45) 240 112 sys_select+0x4f/0x110 > > 46) 128 128 system_call_fastpath+0x16/0x1b -- Dave Chinner david@fromorbit.com mm: disallow direct reclaim page writeback From: Dave Chinner <dchinner@redhat.com> When we enter direct reclaim we may have used an arbitrary amount of stack space, and hence entering the filesystem to do writeback can then lead to stack overruns. Writeback from direct reclaim is a bad idea, anyway. The background flusher threads should be taking care of cleaning dirty pages, and direct reclaim will kick them if they aren't already doing work. If direct reclaim is also calling ->writepage, it will cause the IO patterns from the background flusher threads to be upset by LRU-order writeback from pageout(). Having competing sources of IO trying to clean pages on the same backing device reduces throughput by increasing the amount of seeks that the backing device has to do to write back the pages. Hence for direct reclaim we should not allow ->writepages to be entered at all. Set up the relevant scan_control structures to enforce this, and prevent sc->may_writepage from being set in other places in the direct reclaim path in response to other events. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- mm/vmscan.c | 13 ++++++------- 1 files changed, 6 insertions(+), 7 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index f293372..3c194f4 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1829,10 +1829,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, * writeout. So in laptop mode, write out the whole world. */ writeback_threshold = sc->nr_to_reclaim + sc->nr_to_reclaim / 2; - if (total_scanned > writeback_threshold) { + if (total_scanned > writeback_threshold) wakeup_flusher_threads(laptop_mode ? 0 : total_scanned); - sc->may_writepage = 1; - } /* Take a nap, wait for some writeback to complete */ if (!sc->hibernation_mode && sc->nr_scanned && @@ -1874,7 +1872,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, { struct scan_control sc = { .gfp_mask = gfp_mask, - .may_writepage = !laptop_mode, + .may_writepage = 0, .nr_to_reclaim = SWAP_CLUSTER_MAX, .may_unmap = 1, .may_swap = 1, @@ -1896,7 +1894,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem, struct zone *zone, int nid) { struct scan_control sc = { - .may_writepage = !laptop_mode, + .may_writepage = 0, .may_unmap = 1, .may_swap = !noswap, .swappiness = swappiness, @@ -1929,7 +1927,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont, { struct zonelist *zonelist; struct scan_control sc = { - .may_writepage = !laptop_mode, + .may_writepage = 0, .may_unmap = 1, .may_swap = !noswap, .nr_to_reclaim = SWAP_CLUSTER_MAX, @@ -2570,7 +2568,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) struct reclaim_state reclaim_state; int priority; struct scan_control sc = { - .may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE), + .may_writepage = (current_is_kswapd() && + (zone_reclaim_mode & RECLAIM_WRITE)), .may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP), .may_swap = 1, .nr_to_reclaim = max_t(unsigned long, nr_pages, -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64 @ 2010-04-08 3:03 ` Dave Chinner 0 siblings, 0 replies; 43+ messages in thread From: Dave Chinner @ 2010-04-08 3:03 UTC (permalink / raw) To: John Berthels; +Cc: Nick Gregory, linux-mm, xfs, linux-kernel, Rob Sanderson On Thu, Apr 08, 2010 at 09:43:41AM +1000, Dave Chinner wrote: > [added linux-mm] Now really added linux-mm. And there's a patch attached that stops direct reclaim from writing back dirty pages - it seems to work fine from some rough testing I've done. Perhaps you might want to give it a spin on a test box, John? > On Wed, Apr 07, 2010 at 04:57:11PM +0100, John Berthels wrote: > > Dave Chinner wrote: > > >I'm not seeing stacks deeper than about 5.6k on XFS under heavy write > > >loads. That's nowhere near blowing an 8k stack, so there must be > > >something special about what you are doing. Can you post the stack > > >traces that are being generated for the deepest stack generated - > > >/sys/kernel/debug/tracing/stack_trace should contain it. > > Appended below. That doesn't seem to reach 8192 but the box it's > > from has logged: > > > > [74649.579386] apache2 used greatest stack depth: 7024 bytes left > > > > full dmesg (gzipped) attached. > > >What is generating the write load? > > > > WebDAV PUTs in a modified mogilefs cluster, running > > apache-mpm-worker (threaded) as the DAV server. The write load is a > > mix of internet-upload speed writers trickling files up and some > > local fast replicators copying from elsewhere in the cluster. mpm > > worker cfg is: > > > > ServerLimit 20 > > StartServers 5 > > MaxClients 300 > > MinSpareThreads 25 > > MaxSpareThreads 75 > > ThreadsPerChild 30 > > MaxRequestsPerChild 0 > > > > File sizes are a mix of small to large (4GB+). Each disk is exported > > as a mogile device, so it's possible for mogile to pound a single > > disk with lots of write activity (if the random number generator > > decides to put lots of files on that device at the same time). > > > > We're also seeing occasional slowdowns + high load avg (up to ~300, > > i.e. MaxClients) with a corresponding number of threads in D state. > > (This slowdown + high load avg seems to correlate with what would > > have previously caused a panic on the THREAD_ORDER 1, but not 100% > > sure). > > > > As you can see from the dmesg, this trips the "task xxx blocked for > > more than 120 seconds." on some of the threads. > > > > Don't know if that's related to the stack issue or to be expected > > under the load. > > It looks to be caused by direct memory reclaim trying to clean pages > with a significant amount of stack already in use. basically there > is not enough stack space left for the XFS ->writepage path to > execute in. I can't see any fast fix for this occurring, so you are > probably best to run with a larger stack for the moment. > > As it is, I don't think direct memory reclim should be cleaning > dirty file pages - it should be leaving that to the writeback > threads (which are far more efficient at it) or, as a > last resort, kswapd. Direct memory reclaim is invoked with an > unknown amount of stack already in use, so there is never any > guarantee that there is enough stack space left to enter the > ->writepage path of any filesystem. > > MM-folk - have there been any changes recently to writeback of > pages from direct reclaim that may have caused this, > or have we just been lucky for a really long time? > > Cheers, > > Dave. > > > Depth Size Location (47 entries) > > ----- ---- -------- > > 0) 7568 16 mempool_alloc_slab+0x16/0x20 > > 1) 7552 144 mempool_alloc+0x65/0x140 > > 2) 7408 96 get_request+0x124/0x370 > > 3) 7312 144 get_request_wait+0x29/0x1b0 > > 4) 7168 96 __make_request+0x9b/0x490 > > 5) 7072 208 generic_make_request+0x3df/0x4d0 > > 6) 6864 80 submit_bio+0x7c/0x100 > > 7) 6784 96 _xfs_buf_ioapply+0x128/0x2c0 [xfs] > > 8) 6688 48 xfs_buf_iorequest+0x75/0xd0 [xfs] > > 9) 6640 32 _xfs_buf_read+0x36/0x70 [xfs] > > 10) 6608 48 xfs_buf_read+0xda/0x110 [xfs] > > 11) 6560 80 xfs_trans_read_buf+0x2a7/0x410 [xfs] > > 12) 6480 80 xfs_btree_read_buf_block+0x5d/0xb0 [xfs] > > 13) 6400 80 xfs_btree_lookup_get_block+0x84/0xf0 [xfs] > > 14) 6320 176 xfs_btree_lookup+0xd7/0x490 [xfs] > > 15) 6144 16 xfs_alloc_lookup_eq+0x19/0x20 [xfs] > > 16) 6128 96 xfs_alloc_fixup_trees+0xee/0x350 [xfs] > > 17) 6032 144 xfs_alloc_ag_vextent_near+0x916/0xb30 [xfs] > > 18) 5888 32 xfs_alloc_ag_vextent+0xe5/0x140 [xfs] > > 19) 5856 96 xfs_alloc_vextent+0x49f/0x630 [xfs] > > 20) 5760 160 xfs_bmbt_alloc_block+0xbe/0x1d0 [xfs] > > 21) 5600 208 xfs_btree_split+0xb3/0x6a0 [xfs] > > 22) 5392 96 xfs_btree_make_block_unfull+0x151/0x190 [xfs] > > 23) 5296 224 xfs_btree_insrec+0x39c/0x5b0 [xfs] > > 24) 5072 128 xfs_btree_insert+0x86/0x180 [xfs] > > 25) 4944 352 xfs_bmap_add_extent_delay_real+0x41e/0x1660 [xfs] > > 26) 4592 208 xfs_bmap_add_extent+0x41c/0x450 [xfs] > > 27) 4384 448 xfs_bmapi+0x982/0x1200 [xfs] > > 28) 3936 256 xfs_iomap_write_allocate+0x248/0x3c0 [xfs] > > 29) 3680 208 xfs_iomap+0x3d8/0x410 [xfs] > > 30) 3472 32 xfs_map_blocks+0x2c/0x30 [xfs] > > 31) 3440 256 xfs_page_state_convert+0x443/0x730 [xfs] > > 32) 3184 64 xfs_vm_writepage+0xab/0x160 [xfs] > > 33) 3120 384 shrink_page_list+0x65e/0x840 > > 34) 2736 528 shrink_zone+0x63f/0xe10 > > 35) 2208 112 do_try_to_free_pages+0xc2/0x3c0 > > 36) 2096 128 try_to_free_pages+0x77/0x80 > > 37) 1968 240 __alloc_pages_nodemask+0x3e4/0x710 > > 38) 1728 48 alloc_pages_current+0x8c/0xe0 > > 39) 1680 16 __get_free_pages+0xe/0x50 > > 40) 1664 48 __pollwait+0xca/0x110 > > 41) 1616 32 unix_poll+0x28/0xc0 > > 42) 1584 16 sock_poll+0x1d/0x20 > > 43) 1568 912 do_select+0x3d6/0x700 > > 44) 656 416 core_sys_select+0x18c/0x2c0 > > 45) 240 112 sys_select+0x4f/0x110 > > 46) 128 128 system_call_fastpath+0x16/0x1b -- Dave Chinner david@fromorbit.com mm: disallow direct reclaim page writeback From: Dave Chinner <dchinner@redhat.com> When we enter direct reclaim we may have used an arbitrary amount of stack space, and hence entering the filesystem to do writeback can then lead to stack overruns. Writeback from direct reclaim is a bad idea, anyway. The background flusher threads should be taking care of cleaning dirty pages, and direct reclaim will kick them if they aren't already doing work. If direct reclaim is also calling ->writepage, it will cause the IO patterns from the background flusher threads to be upset by LRU-order writeback from pageout(). Having competing sources of IO trying to clean pages on the same backing device reduces throughput by increasing the amount of seeks that the backing device has to do to write back the pages. Hence for direct reclaim we should not allow ->writepages to be entered at all. Set up the relevant scan_control structures to enforce this, and prevent sc->may_writepage from being set in other places in the direct reclaim path in response to other events. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- mm/vmscan.c | 13 ++++++------- 1 files changed, 6 insertions(+), 7 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index f293372..3c194f4 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1829,10 +1829,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, * writeout. So in laptop mode, write out the whole world. */ writeback_threshold = sc->nr_to_reclaim + sc->nr_to_reclaim / 2; - if (total_scanned > writeback_threshold) { + if (total_scanned > writeback_threshold) wakeup_flusher_threads(laptop_mode ? 0 : total_scanned); - sc->may_writepage = 1; - } /* Take a nap, wait for some writeback to complete */ if (!sc->hibernation_mode && sc->nr_scanned && @@ -1874,7 +1872,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, { struct scan_control sc = { .gfp_mask = gfp_mask, - .may_writepage = !laptop_mode, + .may_writepage = 0, .nr_to_reclaim = SWAP_CLUSTER_MAX, .may_unmap = 1, .may_swap = 1, @@ -1896,7 +1894,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem, struct zone *zone, int nid) { struct scan_control sc = { - .may_writepage = !laptop_mode, + .may_writepage = 0, .may_unmap = 1, .may_swap = !noswap, .swappiness = swappiness, @@ -1929,7 +1927,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont, { struct zonelist *zonelist; struct scan_control sc = { - .may_writepage = !laptop_mode, + .may_writepage = 0, .may_unmap = 1, .may_swap = !noswap, .nr_to_reclaim = SWAP_CLUSTER_MAX, @@ -2570,7 +2568,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) struct reclaim_state reclaim_state; int priority; struct scan_control sc = { - .may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE), + .may_writepage = (current_is_kswapd() && + (zone_reclaim_mode & RECLAIM_WRITE)), .may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP), .may_swap = 1, .nr_to_reclaim = max_t(unsigned long, nr_pages, _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64 2010-04-08 3:03 ` Dave Chinner (?) @ 2010-04-08 12:16 ` John Berthels -1 siblings, 0 replies; 43+ messages in thread From: John Berthels @ 2010-04-08 12:16 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-kernel, Nick Gregory, Rob Sanderson, xfs, linux-mm Dave Chinner wrote: > On Thu, Apr 08, 2010 at 09:43:41AM +1000, Dave Chinner wrote: > > And there's a patch attached that stops direct reclaim from writing > back dirty pages - it seems to work fine from some rough testing > I've done. Perhaps you might want to give it a spin on a > test box, John? > Thanks very much for this. The patch is in and soaking on a THREAD_ORDER 1 kernel (2.6.33.2 + patch + stack instrumentation), so far so good, but it's early days. After about 2hrs of uptime: $ dmesg | grep stack | tail -1 [ 60.350766] apache2 used greatest stack depth: 2544 bytes left (which tallies well with your 5 1/2Kbytes usage figure). I'll reply again after it's been running long enough to draw conclusions. jb ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64 @ 2010-04-08 12:16 ` John Berthels 0 siblings, 0 replies; 43+ messages in thread From: John Berthels @ 2010-04-08 12:16 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-kernel, Nick Gregory, Rob Sanderson, xfs, linux-mm Dave Chinner wrote: > On Thu, Apr 08, 2010 at 09:43:41AM +1000, Dave Chinner wrote: > > And there's a patch attached that stops direct reclaim from writing > back dirty pages - it seems to work fine from some rough testing > I've done. Perhaps you might want to give it a spin on a > test box, John? > Thanks very much for this. The patch is in and soaking on a THREAD_ORDER 1 kernel (2.6.33.2 + patch + stack instrumentation), so far so good, but it's early days. After about 2hrs of uptime: $ dmesg | grep stack | tail -1 [ 60.350766] apache2 used greatest stack depth: 2544 bytes left (which tallies well with your 5 1/2Kbytes usage figure). I'll reply again after it's been running long enough to draw conclusions. jb -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64 @ 2010-04-08 12:16 ` John Berthels 0 siblings, 0 replies; 43+ messages in thread From: John Berthels @ 2010-04-08 12:16 UTC (permalink / raw) To: Dave Chinner; +Cc: Nick Gregory, linux-mm, xfs, linux-kernel, Rob Sanderson Dave Chinner wrote: > On Thu, Apr 08, 2010 at 09:43:41AM +1000, Dave Chinner wrote: > > And there's a patch attached that stops direct reclaim from writing > back dirty pages - it seems to work fine from some rough testing > I've done. Perhaps you might want to give it a spin on a > test box, John? > Thanks very much for this. The patch is in and soaking on a THREAD_ORDER 1 kernel (2.6.33.2 + patch + stack instrumentation), so far so good, but it's early days. After about 2hrs of uptime: $ dmesg | grep stack | tail -1 [ 60.350766] apache2 used greatest stack depth: 2544 bytes left (which tallies well with your 5 1/2Kbytes usage figure). I'll reply again after it's been running long enough to draw conclusions. jb _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64 2010-04-08 12:16 ` John Berthels (?) @ 2010-04-08 14:47 ` John Berthels -1 siblings, 0 replies; 43+ messages in thread From: John Berthels @ 2010-04-08 14:47 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-kernel, Nick Gregory, Rob Sanderson, xfs, linux-mm John Berthels wrote: > I'll reply again after it's been running long enough to draw conclusions. We're getting pretty close on the 8k stack on this box now. It's running 2.6.33.2 + your patch, with THREAD_ORDER 1, stack tracing and CONFIG_LOCKDEP=y. (Sorry that LOCKDEP is on, please advise if that's going to throw the figures and we'll restart the test systems with new kernels). This is significantly more than 5.6K, so it shows a potential problem? Or is 720 bytes enough headroom? jb [ 4005.541869] apache2 used greatest stack depth: 2480 bytes left [ 4005.541973] apache2 used greatest stack depth: 2240 bytes left [ 4005.542070] apache2 used greatest stack depth: 1936 bytes left [ 4005.542614] apache2 used greatest stack depth: 1616 bytes left [ 5531.406529] apache2 used greatest stack depth: 720 bytes left $ cat /sys/kernel/debug/tracing/stack_trace Depth Size Location (55 entries) ----- ---- -------- 0) 7440 48 add_partial+0x26/0x90 1) 7392 64 __slab_free+0x1a9/0x380 2) 7328 64 kmem_cache_free+0xb9/0x160 3) 7264 16 free_buffer_head+0x25/0x50 4) 7248 64 try_to_free_buffers+0x79/0xc0 5) 7184 160 xfs_vm_releasepage+0xda/0x130 [xfs] 6) 7024 16 try_to_release_page+0x33/0x60 7) 7008 384 shrink_page_list+0x585/0x860 8) 6624 528 shrink_zone+0x636/0xdc0 9) 6096 112 do_try_to_free_pages+0xc2/0x3c0 10) 5984 112 try_to_free_pages+0x64/0x70 11) 5872 256 __alloc_pages_nodemask+0x3d2/0x710 12) 5616 48 alloc_pages_current+0x8c/0xe0 13) 5568 32 __page_cache_alloc+0x67/0x70 14) 5536 80 find_or_create_page+0x50/0xb0 15) 5456 160 _xfs_buf_lookup_pages+0x145/0x350 [xfs] 16) 5296 64 xfs_buf_get+0x74/0x1d0 [xfs] 17) 5232 48 xfs_buf_read+0x2f/0x110 [xfs] 18) 5184 80 xfs_trans_read_buf+0x2bf/0x430 [xfs] 19) 5104 80 xfs_btree_read_buf_block+0x5d/0xb0 [xfs] 20) 5024 80 xfs_btree_lookup_get_block+0x84/0xf0 [xfs] 21) 4944 176 xfs_btree_lookup+0xd7/0x490 [xfs] 22) 4768 16 xfs_alloc_lookup_ge+0x1c/0x20 [xfs] 23) 4752 144 xfs_alloc_ag_vextent_near+0x58/0xb30 [xfs] 24) 4608 32 xfs_alloc_ag_vextent+0xe5/0x140 [xfs] 25) 4576 96 xfs_alloc_vextent+0x49f/0x630 [xfs] 26) 4480 160 xfs_bmbt_alloc_block+0xbe/0x1d0 [xfs] 27) 4320 208 xfs_btree_split+0xb3/0x6a0 [xfs] 28) 4112 96 xfs_btree_make_block_unfull+0x151/0x190 [xfs] 29) 4016 224 xfs_btree_insrec+0x39c/0x5b0 [xfs] 30) 3792 128 xfs_btree_insert+0x86/0x180 [xfs] 31) 3664 352 xfs_bmap_add_extent_delay_real+0x564/0x1670 [xfs] 32) 3312 208 xfs_bmap_add_extent+0x41c/0x450 [xfs] 33) 3104 448 xfs_bmapi+0x982/0x1200 [xfs] 34) 2656 256 xfs_iomap_write_allocate+0x248/0x3c0 [xfs] 35) 2400 208 xfs_iomap+0x3d8/0x410 [xfs] 36) 2192 32 xfs_map_blocks+0x2c/0x30 [xfs] 37) 2160 256 xfs_page_state_convert+0x443/0x730 [xfs] 38) 1904 64 xfs_vm_writepage+0xab/0x160 [xfs] 39) 1840 32 __writepage+0x1a/0x60 40) 1808 288 write_cache_pages+0x1f7/0x400 41) 1520 16 generic_writepages+0x27/0x30 42) 1504 48 xfs_vm_writepages+0x5a/0x70 [xfs] 43) 1456 16 do_writepages+0x24/0x40 44) 1440 64 writeback_single_inode+0xf1/0x3e0 45) 1376 128 writeback_inodes_wb+0x31e/0x510 46) 1248 16 writeback_inodes_wbc+0x1e/0x20 47) 1232 224 balance_dirty_pages_ratelimited_nr+0x277/0x410 48) 1008 192 generic_file_buffered_write+0x19b/0x240 49) 816 288 xfs_write+0x849/0x930 [xfs] 50) 528 16 xfs_file_aio_write+0x5b/0x70 [xfs] 51) 512 272 do_sync_write+0xd1/0x120 52) 240 48 vfs_write+0xcb/0x1a0 53) 192 64 sys_write+0x55/0x90 54) 128 128 system_call_fastpath+0x16/0x1b ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64 @ 2010-04-08 14:47 ` John Berthels 0 siblings, 0 replies; 43+ messages in thread From: John Berthels @ 2010-04-08 14:47 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-kernel, Nick Gregory, Rob Sanderson, xfs, linux-mm John Berthels wrote: > I'll reply again after it's been running long enough to draw conclusions. We're getting pretty close on the 8k stack on this box now. It's running 2.6.33.2 + your patch, with THREAD_ORDER 1, stack tracing and CONFIG_LOCKDEP=y. (Sorry that LOCKDEP is on, please advise if that's going to throw the figures and we'll restart the test systems with new kernels). This is significantly more than 5.6K, so it shows a potential problem? Or is 720 bytes enough headroom? jb [ 4005.541869] apache2 used greatest stack depth: 2480 bytes left [ 4005.541973] apache2 used greatest stack depth: 2240 bytes left [ 4005.542070] apache2 used greatest stack depth: 1936 bytes left [ 4005.542614] apache2 used greatest stack depth: 1616 bytes left [ 5531.406529] apache2 used greatest stack depth: 720 bytes left $ cat /sys/kernel/debug/tracing/stack_trace Depth Size Location (55 entries) ----- ---- -------- 0) 7440 48 add_partial+0x26/0x90 1) 7392 64 __slab_free+0x1a9/0x380 2) 7328 64 kmem_cache_free+0xb9/0x160 3) 7264 16 free_buffer_head+0x25/0x50 4) 7248 64 try_to_free_buffers+0x79/0xc0 5) 7184 160 xfs_vm_releasepage+0xda/0x130 [xfs] 6) 7024 16 try_to_release_page+0x33/0x60 7) 7008 384 shrink_page_list+0x585/0x860 8) 6624 528 shrink_zone+0x636/0xdc0 9) 6096 112 do_try_to_free_pages+0xc2/0x3c0 10) 5984 112 try_to_free_pages+0x64/0x70 11) 5872 256 __alloc_pages_nodemask+0x3d2/0x710 12) 5616 48 alloc_pages_current+0x8c/0xe0 13) 5568 32 __page_cache_alloc+0x67/0x70 14) 5536 80 find_or_create_page+0x50/0xb0 15) 5456 160 _xfs_buf_lookup_pages+0x145/0x350 [xfs] 16) 5296 64 xfs_buf_get+0x74/0x1d0 [xfs] 17) 5232 48 xfs_buf_read+0x2f/0x110 [xfs] 18) 5184 80 xfs_trans_read_buf+0x2bf/0x430 [xfs] 19) 5104 80 xfs_btree_read_buf_block+0x5d/0xb0 [xfs] 20) 5024 80 xfs_btree_lookup_get_block+0x84/0xf0 [xfs] 21) 4944 176 xfs_btree_lookup+0xd7/0x490 [xfs] 22) 4768 16 xfs_alloc_lookup_ge+0x1c/0x20 [xfs] 23) 4752 144 xfs_alloc_ag_vextent_near+0x58/0xb30 [xfs] 24) 4608 32 xfs_alloc_ag_vextent+0xe5/0x140 [xfs] 25) 4576 96 xfs_alloc_vextent+0x49f/0x630 [xfs] 26) 4480 160 xfs_bmbt_alloc_block+0xbe/0x1d0 [xfs] 27) 4320 208 xfs_btree_split+0xb3/0x6a0 [xfs] 28) 4112 96 xfs_btree_make_block_unfull+0x151/0x190 [xfs] 29) 4016 224 xfs_btree_insrec+0x39c/0x5b0 [xfs] 30) 3792 128 xfs_btree_insert+0x86/0x180 [xfs] 31) 3664 352 xfs_bmap_add_extent_delay_real+0x564/0x1670 [xfs] 32) 3312 208 xfs_bmap_add_extent+0x41c/0x450 [xfs] 33) 3104 448 xfs_bmapi+0x982/0x1200 [xfs] 34) 2656 256 xfs_iomap_write_allocate+0x248/0x3c0 [xfs] 35) 2400 208 xfs_iomap+0x3d8/0x410 [xfs] 36) 2192 32 xfs_map_blocks+0x2c/0x30 [xfs] 37) 2160 256 xfs_page_state_convert+0x443/0x730 [xfs] 38) 1904 64 xfs_vm_writepage+0xab/0x160 [xfs] 39) 1840 32 __writepage+0x1a/0x60 40) 1808 288 write_cache_pages+0x1f7/0x400 41) 1520 16 generic_writepages+0x27/0x30 42) 1504 48 xfs_vm_writepages+0x5a/0x70 [xfs] 43) 1456 16 do_writepages+0x24/0x40 44) 1440 64 writeback_single_inode+0xf1/0x3e0 45) 1376 128 writeback_inodes_wb+0x31e/0x510 46) 1248 16 writeback_inodes_wbc+0x1e/0x20 47) 1232 224 balance_dirty_pages_ratelimited_nr+0x277/0x410 48) 1008 192 generic_file_buffered_write+0x19b/0x240 49) 816 288 xfs_write+0x849/0x930 [xfs] 50) 528 16 xfs_file_aio_write+0x5b/0x70 [xfs] 51) 512 272 do_sync_write+0xd1/0x120 52) 240 48 vfs_write+0xcb/0x1a0 53) 192 64 sys_write+0x55/0x90 54) 128 128 system_call_fastpath+0x16/0x1b -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64 @ 2010-04-08 14:47 ` John Berthels 0 siblings, 0 replies; 43+ messages in thread From: John Berthels @ 2010-04-08 14:47 UTC (permalink / raw) To: Dave Chinner; +Cc: Nick Gregory, linux-mm, xfs, linux-kernel, Rob Sanderson John Berthels wrote: > I'll reply again after it's been running long enough to draw conclusions. We're getting pretty close on the 8k stack on this box now. It's running 2.6.33.2 + your patch, with THREAD_ORDER 1, stack tracing and CONFIG_LOCKDEP=y. (Sorry that LOCKDEP is on, please advise if that's going to throw the figures and we'll restart the test systems with new kernels). This is significantly more than 5.6K, so it shows a potential problem? Or is 720 bytes enough headroom? jb [ 4005.541869] apache2 used greatest stack depth: 2480 bytes left [ 4005.541973] apache2 used greatest stack depth: 2240 bytes left [ 4005.542070] apache2 used greatest stack depth: 1936 bytes left [ 4005.542614] apache2 used greatest stack depth: 1616 bytes left [ 5531.406529] apache2 used greatest stack depth: 720 bytes left $ cat /sys/kernel/debug/tracing/stack_trace Depth Size Location (55 entries) ----- ---- -------- 0) 7440 48 add_partial+0x26/0x90 1) 7392 64 __slab_free+0x1a9/0x380 2) 7328 64 kmem_cache_free+0xb9/0x160 3) 7264 16 free_buffer_head+0x25/0x50 4) 7248 64 try_to_free_buffers+0x79/0xc0 5) 7184 160 xfs_vm_releasepage+0xda/0x130 [xfs] 6) 7024 16 try_to_release_page+0x33/0x60 7) 7008 384 shrink_page_list+0x585/0x860 8) 6624 528 shrink_zone+0x636/0xdc0 9) 6096 112 do_try_to_free_pages+0xc2/0x3c0 10) 5984 112 try_to_free_pages+0x64/0x70 11) 5872 256 __alloc_pages_nodemask+0x3d2/0x710 12) 5616 48 alloc_pages_current+0x8c/0xe0 13) 5568 32 __page_cache_alloc+0x67/0x70 14) 5536 80 find_or_create_page+0x50/0xb0 15) 5456 160 _xfs_buf_lookup_pages+0x145/0x350 [xfs] 16) 5296 64 xfs_buf_get+0x74/0x1d0 [xfs] 17) 5232 48 xfs_buf_read+0x2f/0x110 [xfs] 18) 5184 80 xfs_trans_read_buf+0x2bf/0x430 [xfs] 19) 5104 80 xfs_btree_read_buf_block+0x5d/0xb0 [xfs] 20) 5024 80 xfs_btree_lookup_get_block+0x84/0xf0 [xfs] 21) 4944 176 xfs_btree_lookup+0xd7/0x490 [xfs] 22) 4768 16 xfs_alloc_lookup_ge+0x1c/0x20 [xfs] 23) 4752 144 xfs_alloc_ag_vextent_near+0x58/0xb30 [xfs] 24) 4608 32 xfs_alloc_ag_vextent+0xe5/0x140 [xfs] 25) 4576 96 xfs_alloc_vextent+0x49f/0x630 [xfs] 26) 4480 160 xfs_bmbt_alloc_block+0xbe/0x1d0 [xfs] 27) 4320 208 xfs_btree_split+0xb3/0x6a0 [xfs] 28) 4112 96 xfs_btree_make_block_unfull+0x151/0x190 [xfs] 29) 4016 224 xfs_btree_insrec+0x39c/0x5b0 [xfs] 30) 3792 128 xfs_btree_insert+0x86/0x180 [xfs] 31) 3664 352 xfs_bmap_add_extent_delay_real+0x564/0x1670 [xfs] 32) 3312 208 xfs_bmap_add_extent+0x41c/0x450 [xfs] 33) 3104 448 xfs_bmapi+0x982/0x1200 [xfs] 34) 2656 256 xfs_iomap_write_allocate+0x248/0x3c0 [xfs] 35) 2400 208 xfs_iomap+0x3d8/0x410 [xfs] 36) 2192 32 xfs_map_blocks+0x2c/0x30 [xfs] 37) 2160 256 xfs_page_state_convert+0x443/0x730 [xfs] 38) 1904 64 xfs_vm_writepage+0xab/0x160 [xfs] 39) 1840 32 __writepage+0x1a/0x60 40) 1808 288 write_cache_pages+0x1f7/0x400 41) 1520 16 generic_writepages+0x27/0x30 42) 1504 48 xfs_vm_writepages+0x5a/0x70 [xfs] 43) 1456 16 do_writepages+0x24/0x40 44) 1440 64 writeback_single_inode+0xf1/0x3e0 45) 1376 128 writeback_inodes_wb+0x31e/0x510 46) 1248 16 writeback_inodes_wbc+0x1e/0x20 47) 1232 224 balance_dirty_pages_ratelimited_nr+0x277/0x410 48) 1008 192 generic_file_buffered_write+0x19b/0x240 49) 816 288 xfs_write+0x849/0x930 [xfs] 50) 528 16 xfs_file_aio_write+0x5b/0x70 [xfs] 51) 512 272 do_sync_write+0xd1/0x120 52) 240 48 vfs_write+0xcb/0x1a0 53) 192 64 sys_write+0x55/0x90 54) 128 128 system_call_fastpath+0x16/0x1b _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64 2010-04-08 14:47 ` John Berthels (?) @ 2010-04-08 16:18 ` John Berthels -1 siblings, 0 replies; 43+ messages in thread From: John Berthels @ 2010-04-08 16:18 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-kernel, Nick Gregory, Rob Sanderson, xfs, linux-mm John Berthels wrote: > John Berthels wrote: >> I'll reply again after it's been running long enough to draw >> conclusions. The box with patch+THREAD_ORDER 1+LOCKDEP went down (with no further logging retrievable by /var/log/messages or netconsole). We're loading up a 2.6.33.2 + patch + THREAD_ORDER 2 (no LOCKDEP) to get better info as to whether we are still blowing the 8k limit with the patch in place. jb ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64 @ 2010-04-08 16:18 ` John Berthels 0 siblings, 0 replies; 43+ messages in thread From: John Berthels @ 2010-04-08 16:18 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-kernel, Nick Gregory, Rob Sanderson, xfs, linux-mm John Berthels wrote: > John Berthels wrote: >> I'll reply again after it's been running long enough to draw >> conclusions. The box with patch+THREAD_ORDER 1+LOCKDEP went down (with no further logging retrievable by /var/log/messages or netconsole). We're loading up a 2.6.33.2 + patch + THREAD_ORDER 2 (no LOCKDEP) to get better info as to whether we are still blowing the 8k limit with the patch in place. jb -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64 @ 2010-04-08 16:18 ` John Berthels 0 siblings, 0 replies; 43+ messages in thread From: John Berthels @ 2010-04-08 16:18 UTC (permalink / raw) To: Dave Chinner; +Cc: Nick Gregory, linux-mm, xfs, linux-kernel, Rob Sanderson John Berthels wrote: > John Berthels wrote: >> I'll reply again after it's been running long enough to draw >> conclusions. The box with patch+THREAD_ORDER 1+LOCKDEP went down (with no further logging retrievable by /var/log/messages or netconsole). We're loading up a 2.6.33.2 + patch + THREAD_ORDER 2 (no LOCKDEP) to get better info as to whether we are still blowing the 8k limit with the patch in place. jb _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64 2010-04-08 14:47 ` John Berthels (?) @ 2010-04-08 23:38 ` Dave Chinner -1 siblings, 0 replies; 43+ messages in thread From: Dave Chinner @ 2010-04-08 23:38 UTC (permalink / raw) To: John Berthels; +Cc: linux-kernel, Nick Gregory, Rob Sanderson, xfs, linux-mm On Thu, Apr 08, 2010 at 03:47:54PM +0100, John Berthels wrote: > John Berthels wrote: > >I'll reply again after it's been running long enough to draw conclusions. > We're getting pretty close on the 8k stack on this box now. It's > running 2.6.33.2 + your patch, with THREAD_ORDER 1, stack tracing > and CONFIG_LOCKDEP=y. (Sorry that LOCKDEP is on, please advise if > that's going to throw the figures and we'll restart the test systems > with new kernels). > > This is significantly more than 5.6K, so it shows a potential > problem? Or is 720 bytes enough headroom? > > jb > > [ 4005.541869] apache2 used greatest stack depth: 2480 bytes left > [ 4005.541973] apache2 used greatest stack depth: 2240 bytes left > [ 4005.542070] apache2 used greatest stack depth: 1936 bytes left > [ 4005.542614] apache2 used greatest stack depth: 1616 bytes left > [ 5531.406529] apache2 used greatest stack depth: 720 bytes left > > $ cat /sys/kernel/debug/tracing/stack_trace > Depth Size Location (55 entries) > ----- ---- -------- > 0) 7440 48 add_partial+0x26/0x90 > 1) 7392 64 __slab_free+0x1a9/0x380 > 2) 7328 64 kmem_cache_free+0xb9/0x160 > 3) 7264 16 free_buffer_head+0x25/0x50 > 4) 7248 64 try_to_free_buffers+0x79/0xc0 > 5) 7184 160 xfs_vm_releasepage+0xda/0x130 [xfs] > 6) 7024 16 try_to_release_page+0x33/0x60 > 7) 7008 384 shrink_page_list+0x585/0x860 > 8) 6624 528 shrink_zone+0x636/0xdc0 > 9) 6096 112 do_try_to_free_pages+0xc2/0x3c0 > 10) 5984 112 try_to_free_pages+0x64/0x70 > 11) 5872 256 __alloc_pages_nodemask+0x3d2/0x710 > 12) 5616 48 alloc_pages_current+0x8c/0xe0 > 13) 5568 32 __page_cache_alloc+0x67/0x70 > 14) 5536 80 find_or_create_page+0x50/0xb0 > 15) 5456 160 _xfs_buf_lookup_pages+0x145/0x350 [xfs] > 16) 5296 64 xfs_buf_get+0x74/0x1d0 [xfs] > 17) 5232 48 xfs_buf_read+0x2f/0x110 [xfs] > 18) 5184 80 xfs_trans_read_buf+0x2bf/0x430 [xfs] We're entering memory reclaim with almost 6k of stack already in use. If we get down into the IO layer and then have to do a memory reclaim, then we'll have even less stack to work with. It looks like memory allocation needs at least 2KB of stack to work with now, so if we enter anywhere near the top of the stack we can blow it... Basically this trace is telling us the stack we have to work with is: 2KB memory allocation 4KB page writeback 2KB write foreground throttling path So effectively the storage subsystem (NFS, filesystem, DM, MD, device drivers) have about 4K of stack to work in now. That seems to be a lot less than last time I looked at this, and we've been really careful not to increase XFS's stack usage for quite some time now. Hence I'm not sure exactly what to do about this, John. I can't really do much about the stack footprint of XFS as all the low-hanging fruit has already been trimmed. Even if I convert the foreground throttling to not issue IO, the background flush threads still have roughly the same stack usage, so a memory allocation and reclaim in the wrong place could still blow the stack.... I'll have to have a bit of a think on this one - if you could provide further stack traces as they get deeper (esp. if they go past 8k) that would be really handy. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64 @ 2010-04-08 23:38 ` Dave Chinner 0 siblings, 0 replies; 43+ messages in thread From: Dave Chinner @ 2010-04-08 23:38 UTC (permalink / raw) To: John Berthels; +Cc: linux-kernel, Nick Gregory, Rob Sanderson, xfs, linux-mm On Thu, Apr 08, 2010 at 03:47:54PM +0100, John Berthels wrote: > John Berthels wrote: > >I'll reply again after it's been running long enough to draw conclusions. > We're getting pretty close on the 8k stack on this box now. It's > running 2.6.33.2 + your patch, with THREAD_ORDER 1, stack tracing > and CONFIG_LOCKDEP=y. (Sorry that LOCKDEP is on, please advise if > that's going to throw the figures and we'll restart the test systems > with new kernels). > > This is significantly more than 5.6K, so it shows a potential > problem? Or is 720 bytes enough headroom? > > jb > > [ 4005.541869] apache2 used greatest stack depth: 2480 bytes left > [ 4005.541973] apache2 used greatest stack depth: 2240 bytes left > [ 4005.542070] apache2 used greatest stack depth: 1936 bytes left > [ 4005.542614] apache2 used greatest stack depth: 1616 bytes left > [ 5531.406529] apache2 used greatest stack depth: 720 bytes left > > $ cat /sys/kernel/debug/tracing/stack_trace > Depth Size Location (55 entries) > ----- ---- -------- > 0) 7440 48 add_partial+0x26/0x90 > 1) 7392 64 __slab_free+0x1a9/0x380 > 2) 7328 64 kmem_cache_free+0xb9/0x160 > 3) 7264 16 free_buffer_head+0x25/0x50 > 4) 7248 64 try_to_free_buffers+0x79/0xc0 > 5) 7184 160 xfs_vm_releasepage+0xda/0x130 [xfs] > 6) 7024 16 try_to_release_page+0x33/0x60 > 7) 7008 384 shrink_page_list+0x585/0x860 > 8) 6624 528 shrink_zone+0x636/0xdc0 > 9) 6096 112 do_try_to_free_pages+0xc2/0x3c0 > 10) 5984 112 try_to_free_pages+0x64/0x70 > 11) 5872 256 __alloc_pages_nodemask+0x3d2/0x710 > 12) 5616 48 alloc_pages_current+0x8c/0xe0 > 13) 5568 32 __page_cache_alloc+0x67/0x70 > 14) 5536 80 find_or_create_page+0x50/0xb0 > 15) 5456 160 _xfs_buf_lookup_pages+0x145/0x350 [xfs] > 16) 5296 64 xfs_buf_get+0x74/0x1d0 [xfs] > 17) 5232 48 xfs_buf_read+0x2f/0x110 [xfs] > 18) 5184 80 xfs_trans_read_buf+0x2bf/0x430 [xfs] We're entering memory reclaim with almost 6k of stack already in use. If we get down into the IO layer and then have to do a memory reclaim, then we'll have even less stack to work with. It looks like memory allocation needs at least 2KB of stack to work with now, so if we enter anywhere near the top of the stack we can blow it... Basically this trace is telling us the stack we have to work with is: 2KB memory allocation 4KB page writeback 2KB write foreground throttling path So effectively the storage subsystem (NFS, filesystem, DM, MD, device drivers) have about 4K of stack to work in now. That seems to be a lot less than last time I looked at this, and we've been really careful not to increase XFS's stack usage for quite some time now. Hence I'm not sure exactly what to do about this, John. I can't really do much about the stack footprint of XFS as all the low-hanging fruit has already been trimmed. Even if I convert the foreground throttling to not issue IO, the background flush threads still have roughly the same stack usage, so a memory allocation and reclaim in the wrong place could still blow the stack.... I'll have to have a bit of a think on this one - if you could provide further stack traces as they get deeper (esp. if they go past 8k) that would be really handy. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64 @ 2010-04-08 23:38 ` Dave Chinner 0 siblings, 0 replies; 43+ messages in thread From: Dave Chinner @ 2010-04-08 23:38 UTC (permalink / raw) To: John Berthels; +Cc: Nick Gregory, linux-mm, xfs, linux-kernel, Rob Sanderson On Thu, Apr 08, 2010 at 03:47:54PM +0100, John Berthels wrote: > John Berthels wrote: > >I'll reply again after it's been running long enough to draw conclusions. > We're getting pretty close on the 8k stack on this box now. It's > running 2.6.33.2 + your patch, with THREAD_ORDER 1, stack tracing > and CONFIG_LOCKDEP=y. (Sorry that LOCKDEP is on, please advise if > that's going to throw the figures and we'll restart the test systems > with new kernels). > > This is significantly more than 5.6K, so it shows a potential > problem? Or is 720 bytes enough headroom? > > jb > > [ 4005.541869] apache2 used greatest stack depth: 2480 bytes left > [ 4005.541973] apache2 used greatest stack depth: 2240 bytes left > [ 4005.542070] apache2 used greatest stack depth: 1936 bytes left > [ 4005.542614] apache2 used greatest stack depth: 1616 bytes left > [ 5531.406529] apache2 used greatest stack depth: 720 bytes left > > $ cat /sys/kernel/debug/tracing/stack_trace > Depth Size Location (55 entries) > ----- ---- -------- > 0) 7440 48 add_partial+0x26/0x90 > 1) 7392 64 __slab_free+0x1a9/0x380 > 2) 7328 64 kmem_cache_free+0xb9/0x160 > 3) 7264 16 free_buffer_head+0x25/0x50 > 4) 7248 64 try_to_free_buffers+0x79/0xc0 > 5) 7184 160 xfs_vm_releasepage+0xda/0x130 [xfs] > 6) 7024 16 try_to_release_page+0x33/0x60 > 7) 7008 384 shrink_page_list+0x585/0x860 > 8) 6624 528 shrink_zone+0x636/0xdc0 > 9) 6096 112 do_try_to_free_pages+0xc2/0x3c0 > 10) 5984 112 try_to_free_pages+0x64/0x70 > 11) 5872 256 __alloc_pages_nodemask+0x3d2/0x710 > 12) 5616 48 alloc_pages_current+0x8c/0xe0 > 13) 5568 32 __page_cache_alloc+0x67/0x70 > 14) 5536 80 find_or_create_page+0x50/0xb0 > 15) 5456 160 _xfs_buf_lookup_pages+0x145/0x350 [xfs] > 16) 5296 64 xfs_buf_get+0x74/0x1d0 [xfs] > 17) 5232 48 xfs_buf_read+0x2f/0x110 [xfs] > 18) 5184 80 xfs_trans_read_buf+0x2bf/0x430 [xfs] We're entering memory reclaim with almost 6k of stack already in use. If we get down into the IO layer and then have to do a memory reclaim, then we'll have even less stack to work with. It looks like memory allocation needs at least 2KB of stack to work with now, so if we enter anywhere near the top of the stack we can blow it... Basically this trace is telling us the stack we have to work with is: 2KB memory allocation 4KB page writeback 2KB write foreground throttling path So effectively the storage subsystem (NFS, filesystem, DM, MD, device drivers) have about 4K of stack to work in now. That seems to be a lot less than last time I looked at this, and we've been really careful not to increase XFS's stack usage for quite some time now. Hence I'm not sure exactly what to do about this, John. I can't really do much about the stack footprint of XFS as all the low-hanging fruit has already been trimmed. Even if I convert the foreground throttling to not issue IO, the background flush threads still have roughly the same stack usage, so a memory allocation and reclaim in the wrong place could still blow the stack.... I'll have to have a bit of a think on this one - if you could provide further stack traces as they get deeper (esp. if they go past 8k) that would be really handy. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64 2010-04-08 23:38 ` Dave Chinner (?) @ 2010-04-09 11:38 ` Chris Mason -1 siblings, 0 replies; 43+ messages in thread From: Chris Mason @ 2010-04-09 11:38 UTC (permalink / raw) To: Dave Chinner Cc: John Berthels, linux-kernel, Nick Gregory, Rob Sanderson, xfs, linux-mm On Fri, Apr 09, 2010 at 09:38:37AM +1000, Dave Chinner wrote: > On Thu, Apr 08, 2010 at 03:47:54PM +0100, John Berthels wrote: > > John Berthels wrote: > > >I'll reply again after it's been running long enough to draw conclusions. > > We're getting pretty close on the 8k stack on this box now. It's > > running 2.6.33.2 + your patch, with THREAD_ORDER 1, stack tracing > > and CONFIG_LOCKDEP=y. (Sorry that LOCKDEP is on, please advise if > > that's going to throw the figures and we'll restart the test systems > > with new kernels). > > > > This is significantly more than 5.6K, so it shows a potential > > problem? Or is 720 bytes enough headroom? > > > > jb > > > > [ 4005.541869] apache2 used greatest stack depth: 2480 bytes left > > [ 4005.541973] apache2 used greatest stack depth: 2240 bytes left > > [ 4005.542070] apache2 used greatest stack depth: 1936 bytes left > > [ 4005.542614] apache2 used greatest stack depth: 1616 bytes left > > [ 5531.406529] apache2 used greatest stack depth: 720 bytes left > > > > $ cat /sys/kernel/debug/tracing/stack_trace > > Depth Size Location (55 entries) > > ----- ---- -------- > > 0) 7440 48 add_partial+0x26/0x90 > > 1) 7392 64 __slab_free+0x1a9/0x380 > > 2) 7328 64 kmem_cache_free+0xb9/0x160 > > 3) 7264 16 free_buffer_head+0x25/0x50 > > 4) 7248 64 try_to_free_buffers+0x79/0xc0 > > 5) 7184 160 xfs_vm_releasepage+0xda/0x130 [xfs] > > 6) 7024 16 try_to_release_page+0x33/0x60 > > 7) 7008 384 shrink_page_list+0x585/0x860 > > 8) 6624 528 shrink_zone+0x636/0xdc0 > > 9) 6096 112 do_try_to_free_pages+0xc2/0x3c0 > > 10) 5984 112 try_to_free_pages+0x64/0x70 > > 11) 5872 256 __alloc_pages_nodemask+0x3d2/0x710 > > 12) 5616 48 alloc_pages_current+0x8c/0xe0 > > 13) 5568 32 __page_cache_alloc+0x67/0x70 > > 14) 5536 80 find_or_create_page+0x50/0xb0 > > 15) 5456 160 _xfs_buf_lookup_pages+0x145/0x350 [xfs] > > 16) 5296 64 xfs_buf_get+0x74/0x1d0 [xfs] > > 17) 5232 48 xfs_buf_read+0x2f/0x110 [xfs] > > 18) 5184 80 xfs_trans_read_buf+0x2bf/0x430 [xfs] > > We're entering memory reclaim with almost 6k of stack already in > use. If we get down into the IO layer and then have to do a memory > reclaim, then we'll have even less stack to work with. It looks like > memory allocation needs at least 2KB of stack to work with now, > so if we enter anywhere near the top of the stack we can blow it... shrink_zone on my box isn't 500 bytes, but lets try the easy stuff first. This is against .34, if you have any trouble applying to .32, just add the word noinline after the word static on the function definitions. This makes shrink_zone disappear from my check_stack.pl output. Basically I think the compiler is inlining the shrink_active_zone and shrink_inactive_zone code into shrink_zone. -chris diff --git a/mm/vmscan.c b/mm/vmscan.c index 79c8098..c70593e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -620,7 +620,7 @@ static enum page_references page_check_references(struct page *page, /* * shrink_page_list() returns the number of reclaimed pages */ -static unsigned long shrink_page_list(struct list_head *page_list, +static noinline unsigned long shrink_page_list(struct list_head *page_list, struct scan_control *sc, enum pageout_io sync_writeback) { @@ -1121,7 +1121,7 @@ static int too_many_isolated(struct zone *zone, int file, * shrink_inactive_list() is a helper for shrink_zone(). It returns the number * of reclaimed pages */ -static unsigned long shrink_inactive_list(unsigned long max_scan, +static noinline unsigned long shrink_inactive_list(unsigned long max_scan, struct zone *zone, struct scan_control *sc, int priority, int file) { @@ -1341,7 +1341,7 @@ static void move_active_pages_to_lru(struct zone *zone, __count_vm_events(PGDEACTIVATE, pgmoved); } -static void shrink_active_list(unsigned long nr_pages, struct zone *zone, +static noinline void shrink_active_list(unsigned long nr_pages, struct zone *zone, struct scan_control *sc, int priority, int file) { unsigned long nr_taken; @@ -1504,7 +1504,7 @@ static int inactive_list_is_low(struct zone *zone, struct scan_control *sc, return inactive_anon_is_low(zone, sc); } -static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan, +static noinline unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan, struct zone *zone, struct scan_control *sc, int priority) { int file = is_file_lru(lru); ^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64 @ 2010-04-09 11:38 ` Chris Mason 0 siblings, 0 replies; 43+ messages in thread From: Chris Mason @ 2010-04-09 11:38 UTC (permalink / raw) To: Dave Chinner Cc: John Berthels, linux-kernel, Nick Gregory, Rob Sanderson, xfs, linux-mm On Fri, Apr 09, 2010 at 09:38:37AM +1000, Dave Chinner wrote: > On Thu, Apr 08, 2010 at 03:47:54PM +0100, John Berthels wrote: > > John Berthels wrote: > > >I'll reply again after it's been running long enough to draw conclusions. > > We're getting pretty close on the 8k stack on this box now. It's > > running 2.6.33.2 + your patch, with THREAD_ORDER 1, stack tracing > > and CONFIG_LOCKDEP=y. (Sorry that LOCKDEP is on, please advise if > > that's going to throw the figures and we'll restart the test systems > > with new kernels). > > > > This is significantly more than 5.6K, so it shows a potential > > problem? Or is 720 bytes enough headroom? > > > > jb > > > > [ 4005.541869] apache2 used greatest stack depth: 2480 bytes left > > [ 4005.541973] apache2 used greatest stack depth: 2240 bytes left > > [ 4005.542070] apache2 used greatest stack depth: 1936 bytes left > > [ 4005.542614] apache2 used greatest stack depth: 1616 bytes left > > [ 5531.406529] apache2 used greatest stack depth: 720 bytes left > > > > $ cat /sys/kernel/debug/tracing/stack_trace > > Depth Size Location (55 entries) > > ----- ---- -------- > > 0) 7440 48 add_partial+0x26/0x90 > > 1) 7392 64 __slab_free+0x1a9/0x380 > > 2) 7328 64 kmem_cache_free+0xb9/0x160 > > 3) 7264 16 free_buffer_head+0x25/0x50 > > 4) 7248 64 try_to_free_buffers+0x79/0xc0 > > 5) 7184 160 xfs_vm_releasepage+0xda/0x130 [xfs] > > 6) 7024 16 try_to_release_page+0x33/0x60 > > 7) 7008 384 shrink_page_list+0x585/0x860 > > 8) 6624 528 shrink_zone+0x636/0xdc0 > > 9) 6096 112 do_try_to_free_pages+0xc2/0x3c0 > > 10) 5984 112 try_to_free_pages+0x64/0x70 > > 11) 5872 256 __alloc_pages_nodemask+0x3d2/0x710 > > 12) 5616 48 alloc_pages_current+0x8c/0xe0 > > 13) 5568 32 __page_cache_alloc+0x67/0x70 > > 14) 5536 80 find_or_create_page+0x50/0xb0 > > 15) 5456 160 _xfs_buf_lookup_pages+0x145/0x350 [xfs] > > 16) 5296 64 xfs_buf_get+0x74/0x1d0 [xfs] > > 17) 5232 48 xfs_buf_read+0x2f/0x110 [xfs] > > 18) 5184 80 xfs_trans_read_buf+0x2bf/0x430 [xfs] > > We're entering memory reclaim with almost 6k of stack already in > use. If we get down into the IO layer and then have to do a memory > reclaim, then we'll have even less stack to work with. It looks like > memory allocation needs at least 2KB of stack to work with now, > so if we enter anywhere near the top of the stack we can blow it... shrink_zone on my box isn't 500 bytes, but lets try the easy stuff first. This is against .34, if you have any trouble applying to .32, just add the word noinline after the word static on the function definitions. This makes shrink_zone disappear from my check_stack.pl output. Basically I think the compiler is inlining the shrink_active_zone and shrink_inactive_zone code into shrink_zone. -chris diff --git a/mm/vmscan.c b/mm/vmscan.c index 79c8098..c70593e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -620,7 +620,7 @@ static enum page_references page_check_references(struct page *page, /* * shrink_page_list() returns the number of reclaimed pages */ -static unsigned long shrink_page_list(struct list_head *page_list, +static noinline unsigned long shrink_page_list(struct list_head *page_list, struct scan_control *sc, enum pageout_io sync_writeback) { @@ -1121,7 +1121,7 @@ static int too_many_isolated(struct zone *zone, int file, * shrink_inactive_list() is a helper for shrink_zone(). It returns the number * of reclaimed pages */ -static unsigned long shrink_inactive_list(unsigned long max_scan, +static noinline unsigned long shrink_inactive_list(unsigned long max_scan, struct zone *zone, struct scan_control *sc, int priority, int file) { @@ -1341,7 +1341,7 @@ static void move_active_pages_to_lru(struct zone *zone, __count_vm_events(PGDEACTIVATE, pgmoved); } -static void shrink_active_list(unsigned long nr_pages, struct zone *zone, +static noinline void shrink_active_list(unsigned long nr_pages, struct zone *zone, struct scan_control *sc, int priority, int file) { unsigned long nr_taken; @@ -1504,7 +1504,7 @@ static int inactive_list_is_low(struct zone *zone, struct scan_control *sc, return inactive_anon_is_low(zone, sc); } -static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan, +static noinline unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan, struct zone *zone, struct scan_control *sc, int priority) { int file = is_file_lru(lru); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64 @ 2010-04-09 11:38 ` Chris Mason 0 siblings, 0 replies; 43+ messages in thread From: Chris Mason @ 2010-04-09 11:38 UTC (permalink / raw) To: Dave Chinner Cc: John Berthels, linux-kernel, xfs, Nick Gregory, linux-mm, Rob Sanderson On Fri, Apr 09, 2010 at 09:38:37AM +1000, Dave Chinner wrote: > On Thu, Apr 08, 2010 at 03:47:54PM +0100, John Berthels wrote: > > John Berthels wrote: > > >I'll reply again after it's been running long enough to draw conclusions. > > We're getting pretty close on the 8k stack on this box now. It's > > running 2.6.33.2 + your patch, with THREAD_ORDER 1, stack tracing > > and CONFIG_LOCKDEP=y. (Sorry that LOCKDEP is on, please advise if > > that's going to throw the figures and we'll restart the test systems > > with new kernels). > > > > This is significantly more than 5.6K, so it shows a potential > > problem? Or is 720 bytes enough headroom? > > > > jb > > > > [ 4005.541869] apache2 used greatest stack depth: 2480 bytes left > > [ 4005.541973] apache2 used greatest stack depth: 2240 bytes left > > [ 4005.542070] apache2 used greatest stack depth: 1936 bytes left > > [ 4005.542614] apache2 used greatest stack depth: 1616 bytes left > > [ 5531.406529] apache2 used greatest stack depth: 720 bytes left > > > > $ cat /sys/kernel/debug/tracing/stack_trace > > Depth Size Location (55 entries) > > ----- ---- -------- > > 0) 7440 48 add_partial+0x26/0x90 > > 1) 7392 64 __slab_free+0x1a9/0x380 > > 2) 7328 64 kmem_cache_free+0xb9/0x160 > > 3) 7264 16 free_buffer_head+0x25/0x50 > > 4) 7248 64 try_to_free_buffers+0x79/0xc0 > > 5) 7184 160 xfs_vm_releasepage+0xda/0x130 [xfs] > > 6) 7024 16 try_to_release_page+0x33/0x60 > > 7) 7008 384 shrink_page_list+0x585/0x860 > > 8) 6624 528 shrink_zone+0x636/0xdc0 > > 9) 6096 112 do_try_to_free_pages+0xc2/0x3c0 > > 10) 5984 112 try_to_free_pages+0x64/0x70 > > 11) 5872 256 __alloc_pages_nodemask+0x3d2/0x710 > > 12) 5616 48 alloc_pages_current+0x8c/0xe0 > > 13) 5568 32 __page_cache_alloc+0x67/0x70 > > 14) 5536 80 find_or_create_page+0x50/0xb0 > > 15) 5456 160 _xfs_buf_lookup_pages+0x145/0x350 [xfs] > > 16) 5296 64 xfs_buf_get+0x74/0x1d0 [xfs] > > 17) 5232 48 xfs_buf_read+0x2f/0x110 [xfs] > > 18) 5184 80 xfs_trans_read_buf+0x2bf/0x430 [xfs] > > We're entering memory reclaim with almost 6k of stack already in > use. If we get down into the IO layer and then have to do a memory > reclaim, then we'll have even less stack to work with. It looks like > memory allocation needs at least 2KB of stack to work with now, > so if we enter anywhere near the top of the stack we can blow it... shrink_zone on my box isn't 500 bytes, but lets try the easy stuff first. This is against .34, if you have any trouble applying to .32, just add the word noinline after the word static on the function definitions. This makes shrink_zone disappear from my check_stack.pl output. Basically I think the compiler is inlining the shrink_active_zone and shrink_inactive_zone code into shrink_zone. -chris diff --git a/mm/vmscan.c b/mm/vmscan.c index 79c8098..c70593e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -620,7 +620,7 @@ static enum page_references page_check_references(struct page *page, /* * shrink_page_list() returns the number of reclaimed pages */ -static unsigned long shrink_page_list(struct list_head *page_list, +static noinline unsigned long shrink_page_list(struct list_head *page_list, struct scan_control *sc, enum pageout_io sync_writeback) { @@ -1121,7 +1121,7 @@ static int too_many_isolated(struct zone *zone, int file, * shrink_inactive_list() is a helper for shrink_zone(). It returns the number * of reclaimed pages */ -static unsigned long shrink_inactive_list(unsigned long max_scan, +static noinline unsigned long shrink_inactive_list(unsigned long max_scan, struct zone *zone, struct scan_control *sc, int priority, int file) { @@ -1341,7 +1341,7 @@ static void move_active_pages_to_lru(struct zone *zone, __count_vm_events(PGDEACTIVATE, pgmoved); } -static void shrink_active_list(unsigned long nr_pages, struct zone *zone, +static noinline void shrink_active_list(unsigned long nr_pages, struct zone *zone, struct scan_control *sc, int priority, int file) { unsigned long nr_taken; @@ -1504,7 +1504,7 @@ static int inactive_list_is_low(struct zone *zone, struct scan_control *sc, return inactive_anon_is_low(zone, sc); } -static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan, +static noinline unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan, struct zone *zone, struct scan_control *sc, int priority) { int file = is_file_lru(lru); _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64 2010-04-09 11:38 ` Chris Mason (?) @ 2010-04-09 18:05 ` Eric Sandeen -1 siblings, 0 replies; 43+ messages in thread From: Eric Sandeen @ 2010-04-09 18:05 UTC (permalink / raw) To: Chris Mason, Dave Chinner, John Berthels, linux-kernel, Nick Gregory, Rob Sanderson, xfs, linux-mm Chris Mason wrote: > shrink_zone on my box isn't 500 bytes, but lets try the easy stuff > first. This is against .34, if you have any trouble applying to .32, > just add the word noinline after the word static on the function > definitions. > > This makes shrink_zone disappear from my check_stack.pl output. > Basically I think the compiler is inlining the shrink_active_zone and > shrink_inactive_zone code into shrink_zone. > > -chris > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 79c8098..c70593e 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -620,7 +620,7 @@ static enum page_references page_check_references(struct page *page, > /* > * shrink_page_list() returns the number of reclaimed pages > */ > -static unsigned long shrink_page_list(struct list_head *page_list, > +static noinline unsigned long shrink_page_list(struct list_head *page_list, FWIW akpm suggested that I add: /* * Rather then using noinline to prevent stack consumption, use * noinline_for_stack instead. For documentaiton reasons. */ #define noinline_for_stack noinline so maybe for a formal submission that'd be good to use. > struct scan_control *sc, > enum pageout_io sync_writeback) > { > @@ -1121,7 +1121,7 @@ static int too_many_isolated(struct zone *zone, int file, > * shrink_inactive_list() is a helper for shrink_zone(). It returns the number > * of reclaimed pages > */ > -static unsigned long shrink_inactive_list(unsigned long max_scan, > +static noinline unsigned long shrink_inactive_list(unsigned long max_scan, > struct zone *zone, struct scan_control *sc, > int priority, int file) > { > @@ -1341,7 +1341,7 @@ static void move_active_pages_to_lru(struct zone *zone, > __count_vm_events(PGDEACTIVATE, pgmoved); > } > > -static void shrink_active_list(unsigned long nr_pages, struct zone *zone, > +static noinline void shrink_active_list(unsigned long nr_pages, struct zone *zone, > struct scan_control *sc, int priority, int file) > { > unsigned long nr_taken; > @@ -1504,7 +1504,7 @@ static int inactive_list_is_low(struct zone *zone, struct scan_control *sc, > return inactive_anon_is_low(zone, sc); > } > > -static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan, > +static noinline unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan, > struct zone *zone, struct scan_control *sc, int priority) > { > int file = is_file_lru(lru); > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs > ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64 @ 2010-04-09 18:05 ` Eric Sandeen 0 siblings, 0 replies; 43+ messages in thread From: Eric Sandeen @ 2010-04-09 18:05 UTC (permalink / raw) To: Chris Mason, Dave Chinner, John Berthels, linux-kernel, Nick Gregory, Rob Sanderson, xfs, linux-mm Chris Mason wrote: > shrink_zone on my box isn't 500 bytes, but lets try the easy stuff > first. This is against .34, if you have any trouble applying to .32, > just add the word noinline after the word static on the function > definitions. > > This makes shrink_zone disappear from my check_stack.pl output. > Basically I think the compiler is inlining the shrink_active_zone and > shrink_inactive_zone code into shrink_zone. > > -chris > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 79c8098..c70593e 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -620,7 +620,7 @@ static enum page_references page_check_references(struct page *page, > /* > * shrink_page_list() returns the number of reclaimed pages > */ > -static unsigned long shrink_page_list(struct list_head *page_list, > +static noinline unsigned long shrink_page_list(struct list_head *page_list, FWIW akpm suggested that I add: /* * Rather then using noinline to prevent stack consumption, use * noinline_for_stack instead. For documentaiton reasons. */ #define noinline_for_stack noinline so maybe for a formal submission that'd be good to use. > struct scan_control *sc, > enum pageout_io sync_writeback) > { > @@ -1121,7 +1121,7 @@ static int too_many_isolated(struct zone *zone, int file, > * shrink_inactive_list() is a helper for shrink_zone(). It returns the number > * of reclaimed pages > */ > -static unsigned long shrink_inactive_list(unsigned long max_scan, > +static noinline unsigned long shrink_inactive_list(unsigned long max_scan, > struct zone *zone, struct scan_control *sc, > int priority, int file) > { > @@ -1341,7 +1341,7 @@ static void move_active_pages_to_lru(struct zone *zone, > __count_vm_events(PGDEACTIVATE, pgmoved); > } > > -static void shrink_active_list(unsigned long nr_pages, struct zone *zone, > +static noinline void shrink_active_list(unsigned long nr_pages, struct zone *zone, > struct scan_control *sc, int priority, int file) > { > unsigned long nr_taken; > @@ -1504,7 +1504,7 @@ static int inactive_list_is_low(struct zone *zone, struct scan_control *sc, > return inactive_anon_is_low(zone, sc); > } > > -static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan, > +static noinline unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan, > struct zone *zone, struct scan_control *sc, int priority) > { > int file = is_file_lru(lru); > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64 @ 2010-04-09 18:05 ` Eric Sandeen 0 siblings, 0 replies; 43+ messages in thread From: Eric Sandeen @ 2010-04-09 18:05 UTC (permalink / raw) To: Chris Mason, Dave Chinner, John Berthels, linux-kernel, Nick Gregory, Rob Sanderson, xfs, linux-mm Chris Mason wrote: > shrink_zone on my box isn't 500 bytes, but lets try the easy stuff > first. This is against .34, if you have any trouble applying to .32, > just add the word noinline after the word static on the function > definitions. > > This makes shrink_zone disappear from my check_stack.pl output. > Basically I think the compiler is inlining the shrink_active_zone and > shrink_inactive_zone code into shrink_zone. > > -chris > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 79c8098..c70593e 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -620,7 +620,7 @@ static enum page_references page_check_references(struct page *page, > /* > * shrink_page_list() returns the number of reclaimed pages > */ > -static unsigned long shrink_page_list(struct list_head *page_list, > +static noinline unsigned long shrink_page_list(struct list_head *page_list, FWIW akpm suggested that I add: /* * Rather then using noinline to prevent stack consumption, use * noinline_for_stack instead. For documentaiton reasons. */ #define noinline_for_stack noinline so maybe for a formal submission that'd be good to use. > struct scan_control *sc, > enum pageout_io sync_writeback) > { > @@ -1121,7 +1121,7 @@ static int too_many_isolated(struct zone *zone, int file, > * shrink_inactive_list() is a helper for shrink_zone(). It returns the number > * of reclaimed pages > */ > -static unsigned long shrink_inactive_list(unsigned long max_scan, > +static noinline unsigned long shrink_inactive_list(unsigned long max_scan, > struct zone *zone, struct scan_control *sc, > int priority, int file) > { > @@ -1341,7 +1341,7 @@ static void move_active_pages_to_lru(struct zone *zone, > __count_vm_events(PGDEACTIVATE, pgmoved); > } > > -static void shrink_active_list(unsigned long nr_pages, struct zone *zone, > +static noinline void shrink_active_list(unsigned long nr_pages, struct zone *zone, > struct scan_control *sc, int priority, int file) > { > unsigned long nr_taken; > @@ -1504,7 +1504,7 @@ static int inactive_list_is_low(struct zone *zone, struct scan_control *sc, > return inactive_anon_is_low(zone, sc); > } > > -static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan, > +static noinline unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan, > struct zone *zone, struct scan_control *sc, int priority) > { > int file = is_file_lru(lru); > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64 2010-04-09 18:05 ` Eric Sandeen (?) @ 2010-04-09 18:11 ` Chris Mason -1 siblings, 0 replies; 43+ messages in thread From: Chris Mason @ 2010-04-09 18:11 UTC (permalink / raw) To: Eric Sandeen Cc: Dave Chinner, John Berthels, linux-kernel, Nick Gregory, Rob Sanderson, xfs, linux-mm On Fri, Apr 09, 2010 at 01:05:05PM -0500, Eric Sandeen wrote: > Chris Mason wrote: > > > shrink_zone on my box isn't 500 bytes, but lets try the easy stuff > > first. This is against .34, if you have any trouble applying to .32, > > just add the word noinline after the word static on the function > > definitions. > > > > This makes shrink_zone disappear from my check_stack.pl output. > > Basically I think the compiler is inlining the shrink_active_zone and > > shrink_inactive_zone code into shrink_zone. > > > > -chris > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 79c8098..c70593e 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -620,7 +620,7 @@ static enum page_references page_check_references(struct page *page, > > /* > > * shrink_page_list() returns the number of reclaimed pages > > */ > > -static unsigned long shrink_page_list(struct list_head *page_list, > > +static noinline unsigned long shrink_page_list(struct list_head *page_list, > > FWIW akpm suggested that I add: > > /* > * Rather then using noinline to prevent stack consumption, use > * noinline_for_stack instead. For documentaiton reasons. > */ > #define noinline_for_stack noinline > > so maybe for a formal submission that'd be good to use. Oh yeah, I forgot about that one. If the patch actually helps we can switch it. -chris ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64 @ 2010-04-09 18:11 ` Chris Mason 0 siblings, 0 replies; 43+ messages in thread From: Chris Mason @ 2010-04-09 18:11 UTC (permalink / raw) To: Eric Sandeen Cc: Dave Chinner, John Berthels, linux-kernel, Nick Gregory, Rob Sanderson, xfs, linux-mm On Fri, Apr 09, 2010 at 01:05:05PM -0500, Eric Sandeen wrote: > Chris Mason wrote: > > > shrink_zone on my box isn't 500 bytes, but lets try the easy stuff > > first. This is against .34, if you have any trouble applying to .32, > > just add the word noinline after the word static on the function > > definitions. > > > > This makes shrink_zone disappear from my check_stack.pl output. > > Basically I think the compiler is inlining the shrink_active_zone and > > shrink_inactive_zone code into shrink_zone. > > > > -chris > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 79c8098..c70593e 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -620,7 +620,7 @@ static enum page_references page_check_references(struct page *page, > > /* > > * shrink_page_list() returns the number of reclaimed pages > > */ > > -static unsigned long shrink_page_list(struct list_head *page_list, > > +static noinline unsigned long shrink_page_list(struct list_head *page_list, > > FWIW akpm suggested that I add: > > /* > * Rather then using noinline to prevent stack consumption, use > * noinline_for_stack instead. For documentaiton reasons. > */ > #define noinline_for_stack noinline > > so maybe for a formal submission that'd be good to use. Oh yeah, I forgot about that one. If the patch actually helps we can switch it. -chris -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64 @ 2010-04-09 18:11 ` Chris Mason 0 siblings, 0 replies; 43+ messages in thread From: Chris Mason @ 2010-04-09 18:11 UTC (permalink / raw) To: Eric Sandeen Cc: John Berthels, linux-kernel, xfs, Nick Gregory, linux-mm, Rob Sanderson On Fri, Apr 09, 2010 at 01:05:05PM -0500, Eric Sandeen wrote: > Chris Mason wrote: > > > shrink_zone on my box isn't 500 bytes, but lets try the easy stuff > > first. This is against .34, if you have any trouble applying to .32, > > just add the word noinline after the word static on the function > > definitions. > > > > This makes shrink_zone disappear from my check_stack.pl output. > > Basically I think the compiler is inlining the shrink_active_zone and > > shrink_inactive_zone code into shrink_zone. > > > > -chris > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 79c8098..c70593e 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -620,7 +620,7 @@ static enum page_references page_check_references(struct page *page, > > /* > > * shrink_page_list() returns the number of reclaimed pages > > */ > > -static unsigned long shrink_page_list(struct list_head *page_list, > > +static noinline unsigned long shrink_page_list(struct list_head *page_list, > > FWIW akpm suggested that I add: > > /* > * Rather then using noinline to prevent stack consumption, use > * noinline_for_stack instead. For documentaiton reasons. > */ > #define noinline_for_stack noinline > > so maybe for a formal submission that'd be good to use. Oh yeah, I forgot about that one. If the patch actually helps we can switch it. -chris _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64 2010-04-09 18:11 ` Chris Mason (?) @ 2010-04-12 1:01 ` Dave Chinner -1 siblings, 0 replies; 43+ messages in thread From: Dave Chinner @ 2010-04-12 1:01 UTC (permalink / raw) To: Chris Mason, Eric Sandeen, John Berthels, linux-kernel, Nick Gregory, Rob Sanderson, xfs, linux-mm On Fri, Apr 09, 2010 at 02:11:08PM -0400, Chris Mason wrote: > On Fri, Apr 09, 2010 at 01:05:05PM -0500, Eric Sandeen wrote: > > Chris Mason wrote: > > > > > shrink_zone on my box isn't 500 bytes, but lets try the easy stuff > > > first. This is against .34, if you have any trouble applying to .32, > > > just add the word noinline after the word static on the function > > > definitions. > > > > > > This makes shrink_zone disappear from my check_stack.pl output. > > > Basically I think the compiler is inlining the shrink_active_zone and > > > shrink_inactive_zone code into shrink_zone. > > > > > > -chris > > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > index 79c8098..c70593e 100644 > > > --- a/mm/vmscan.c > > > +++ b/mm/vmscan.c > > > @@ -620,7 +620,7 @@ static enum page_references page_check_references(struct page *page, > > > /* > > > * shrink_page_list() returns the number of reclaimed pages > > > */ > > > -static unsigned long shrink_page_list(struct list_head *page_list, > > > +static noinline unsigned long shrink_page_list(struct list_head *page_list, > > > > FWIW akpm suggested that I add: > > > > /* > > * Rather then using noinline to prevent stack consumption, use > > * noinline_for_stack instead. For documentaiton reasons. > > */ > > #define noinline_for_stack noinline > > > > so maybe for a formal submission that'd be good to use. > > Oh yeah, I forgot about that one. If the patch actually helps we can > switch it. Well, given that the largest stack overflow reported was about 800 bytes, I don't think it's enough. All the fat has been trimmed from XFS long ago, and there isn't that much in the generic code paths to trim. And if we consider that this isn't including a significant storage subsystem (i.e. NFS on top and stacked DM+MD+FC below), then trimming a few hundred bytes is not enough to prevent an 8k stack being blown sky high. That is why I was saying I'm not sure what the best way to solve the problem is - I've got a couple of ideas for fixing the problem in XFS once and for all, but I'm not sure if they will fly or not yet, let alone written any code.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64 @ 2010-04-12 1:01 ` Dave Chinner 0 siblings, 0 replies; 43+ messages in thread From: Dave Chinner @ 2010-04-12 1:01 UTC (permalink / raw) To: Chris Mason, Eric Sandeen, John Berthels, linux-kernel, Nick Gregory, Rob Sanderson, xfs, linux-mm On Fri, Apr 09, 2010 at 02:11:08PM -0400, Chris Mason wrote: > On Fri, Apr 09, 2010 at 01:05:05PM -0500, Eric Sandeen wrote: > > Chris Mason wrote: > > > > > shrink_zone on my box isn't 500 bytes, but lets try the easy stuff > > > first. This is against .34, if you have any trouble applying to .32, > > > just add the word noinline after the word static on the function > > > definitions. > > > > > > This makes shrink_zone disappear from my check_stack.pl output. > > > Basically I think the compiler is inlining the shrink_active_zone and > > > shrink_inactive_zone code into shrink_zone. > > > > > > -chris > > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > index 79c8098..c70593e 100644 > > > --- a/mm/vmscan.c > > > +++ b/mm/vmscan.c > > > @@ -620,7 +620,7 @@ static enum page_references page_check_references(struct page *page, > > > /* > > > * shrink_page_list() returns the number of reclaimed pages > > > */ > > > -static unsigned long shrink_page_list(struct list_head *page_list, > > > +static noinline unsigned long shrink_page_list(struct list_head *page_list, > > > > FWIW akpm suggested that I add: > > > > /* > > * Rather then using noinline to prevent stack consumption, use > > * noinline_for_stack instead. For documentaiton reasons. > > */ > > #define noinline_for_stack noinline > > > > so maybe for a formal submission that'd be good to use. > > Oh yeah, I forgot about that one. If the patch actually helps we can > switch it. Well, given that the largest stack overflow reported was about 800 bytes, I don't think it's enough. All the fat has been trimmed from XFS long ago, and there isn't that much in the generic code paths to trim. And if we consider that this isn't including a significant storage subsystem (i.e. NFS on top and stacked DM+MD+FC below), then trimming a few hundred bytes is not enough to prevent an 8k stack being blown sky high. That is why I was saying I'm not sure what the best way to solve the problem is - I've got a couple of ideas for fixing the problem in XFS once and for all, but I'm not sure if they will fly or not yet, let alone written any code.... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64 @ 2010-04-12 1:01 ` Dave Chinner 0 siblings, 0 replies; 43+ messages in thread From: Dave Chinner @ 2010-04-12 1:01 UTC (permalink / raw) To: Chris Mason, Eric Sandeen, John Berthels, linux-kernel, Nick Gregory, Rob Sanderson, xfs, linux-mm On Fri, Apr 09, 2010 at 02:11:08PM -0400, Chris Mason wrote: > On Fri, Apr 09, 2010 at 01:05:05PM -0500, Eric Sandeen wrote: > > Chris Mason wrote: > > > > > shrink_zone on my box isn't 500 bytes, but lets try the easy stuff > > > first. This is against .34, if you have any trouble applying to .32, > > > just add the word noinline after the word static on the function > > > definitions. > > > > > > This makes shrink_zone disappear from my check_stack.pl output. > > > Basically I think the compiler is inlining the shrink_active_zone and > > > shrink_inactive_zone code into shrink_zone. > > > > > > -chris > > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > index 79c8098..c70593e 100644 > > > --- a/mm/vmscan.c > > > +++ b/mm/vmscan.c > > > @@ -620,7 +620,7 @@ static enum page_references page_check_references(struct page *page, > > > /* > > > * shrink_page_list() returns the number of reclaimed pages > > > */ > > > -static unsigned long shrink_page_list(struct list_head *page_list, > > > +static noinline unsigned long shrink_page_list(struct list_head *page_list, > > > > FWIW akpm suggested that I add: > > > > /* > > * Rather then using noinline to prevent stack consumption, use > > * noinline_for_stack instead. For documentaiton reasons. > > */ > > #define noinline_for_stack noinline > > > > so maybe for a formal submission that'd be good to use. > > Oh yeah, I forgot about that one. If the patch actually helps we can > switch it. Well, given that the largest stack overflow reported was about 800 bytes, I don't think it's enough. All the fat has been trimmed from XFS long ago, and there isn't that much in the generic code paths to trim. And if we consider that this isn't including a significant storage subsystem (i.e. NFS on top and stacked DM+MD+FC below), then trimming a few hundred bytes is not enough to prevent an 8k stack being blown sky high. That is why I was saying I'm not sure what the best way to solve the problem is - I've got a couple of ideas for fixing the problem in XFS once and for all, but I'm not sure if they will fly or not yet, let alone written any code.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64 2010-04-09 11:38 ` Chris Mason @ 2010-04-13 9:51 ` John Berthels -1 siblings, 0 replies; 43+ messages in thread From: John Berthels @ 2010-04-13 9:51 UTC (permalink / raw) To: Chris Mason, Dave Chinner, John Berthels, linux-kernel, Nick Gregory, Rob Sanderson, xfs, linux-mm [-- Attachment #1: Type: text/plain, Size: 5328 bytes --] Chris Mason wrote: > shrink_zone on my box isn't 500 bytes, but lets try the easy stuff > first. This is against .34, if you have any trouble applying to .32, > just add the word noinline after the word static on the function > definitions. Hi Chris, Thanks for this, we've been soaking it for a while and get the stack trace below (which is still >8k), which still has shrink_zone at 528 bytes. I find it odd that the shrink_zone stack usage is different on our systems. This is a stock kernel 2.6.33.2 kernel, x86_64 arch (plus your patch + Dave Chinner's patch) built using ubuntu make-kpkg, with gcc (Ubuntu 4.3.3-5ubuntu4) 4.3.3 (.vmscan.o.cmd with full build options is below, gzipped .config attached). Can you see any difference between your system and ours which might explain the discrepancy? I note -g and -pg in there. (Does -pg have any stack overhead? It seems to be enabled in ubuntu release kernels). regards, jb mm/.vmscan.o.cmd: cmd_mm/vmscan.o := gcc -Wp,-MD,mm/.vmscan.o.d -nostdinc -isystem /usr/lib/gcc/x86_64-linux-gnu/4.3.3/include -I/usr/local/src/kern/linux-2.6.33.2/arch/x86/include -Iinclude -include include/generated/autoconf.h -D__KERNEL__ -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -Werror-implicit-function-declaration -Wno-format-security -fno-delete-null-pointer-checks -O2 -m64 -mtune=generic -mno-red-zone -mcmodel=kernel -funit-at-a-time -maccumulate-outgoing-args -fstack-protector -DCONFIG_AS_CFI=1 -DCONFIG_AS_CFI_SIGNAL_FRAME=1 -pipe -Wno-sign-compare -fno-asynchronous-unwind-tables -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -fno-omit-frame-pointer -fno-optimize-sibling-calls -g -pg -Wdeclaration-after-statement -Wno-pointer-sign -fno-strict-overflow -D"KBUILD_STR(s)=\#s" -D"KBUILD_BASENAME=KBUILD_STR(vmscan)" -D"KBUILD_MODNAME=KBUILD_STR(vmscan)" -c -o mm/.tmp_vmscan.o mm/vmscan.c Apr 12 22:06:35 nas17 kernel: [36346.599076] apache2 used greatest stack depth: 7904 bytes left Depth Size Location (56 entries) ----- ---- -------- 0) 7904 48 __call_rcu+0x67/0x190 1) 7856 16 call_rcu_sched+0x15/0x20 2) 7840 16 call_rcu+0xe/0x10 3) 7824 272 radix_tree_delete+0x159/0x2e0 4) 7552 32 __remove_from_page_cache+0x21/0x110 5) 7520 64 __remove_mapping+0xe8/0x130 6) 7456 384 shrink_page_list+0x400/0x860 7) 7072 528 shrink_zone+0x636/0xdc0 8) 6544 112 do_try_to_free_pages+0xc2/0x3c0 9) 6432 112 try_to_free_pages+0x64/0x70 10) 6320 256 __alloc_pages_nodemask+0x3d2/0x710 11) 6064 48 alloc_pages_current+0x8c/0xe0 12) 6016 32 __page_cache_alloc+0x67/0x70 13) 5984 80 find_or_create_page+0x50/0xb0 14) 5904 160 _xfs_buf_lookup_pages+0x145/0x350 [xfs] 15) 5744 64 xfs_buf_get+0x74/0x1d0 [xfs] 16) 5680 48 xfs_buf_read+0x2f/0x110 [xfs] 17) 5632 80 xfs_trans_read_buf+0x2bf/0x430 [xfs] 18) 5552 80 xfs_btree_read_buf_block+0x5d/0xb0 [xfs] 19) 5472 176 xfs_btree_rshift+0xd7/0x530 [xfs] 20) 5296 96 xfs_btree_make_block_unfull+0x5b/0x190 [xfs] 21) 5200 224 xfs_btree_insrec+0x39c/0x5b0 [xfs] 22) 4976 128 xfs_btree_insert+0x86/0x180 [xfs] 23) 4848 96 xfs_alloc_fixup_trees+0x1fa/0x350 [xfs] 24) 4752 144 xfs_alloc_ag_vextent_near+0x916/0xb30 [xfs] 25) 4608 32 xfs_alloc_ag_vextent+0xe5/0x140 [xfs] 26) 4576 96 xfs_alloc_vextent+0x49f/0x630 [xfs] 27) 4480 160 xfs_bmbt_alloc_block+0xbe/0x1d0 [xfs] 28) 4320 208 xfs_btree_split+0xb3/0x6a0 [xfs] 29) 4112 96 xfs_btree_make_block_unfull+0x151/0x190 [xfs] 30) 4016 224 xfs_btree_insrec+0x39c/0x5b0 [xfs] 31) 3792 128 xfs_btree_insert+0x86/0x180 [xfs] 32) 3664 352 xfs_bmap_add_extent_delay_real+0x41e/0x1670 [xfs] 33) 3312 208 xfs_bmap_add_extent+0x41c/0x450 [xfs] 34) 3104 448 xfs_bmapi+0x982/0x1200 [xfs] 35) 2656 256 xfs_iomap_write_allocate+0x248/0x3c0 [xfs] 36) 2400 208 xfs_iomap+0x3d8/0x410 [xfs] 37) 2192 32 xfs_map_blocks+0x2c/0x30 [xfs] 38) 2160 256 xfs_page_state_convert+0x443/0x730 [xfs] 39) 1904 64 xfs_vm_writepage+0xab/0x160 [xfs] 40) 1840 32 __writepage+0x1a/0x60 41) 1808 288 write_cache_pages+0x1f7/0x400 42) 1520 16 generic_writepages+0x27/0x30 43) 1504 48 xfs_vm_writepages+0x5a/0x70 [xfs] 44) 1456 16 do_writepages+0x24/0x40 45) 1440 64 writeback_single_inode+0xf1/0x3e0 46) 1376 128 writeback_inodes_wb+0x31e/0x510 47) 1248 16 writeback_inodes_wbc+0x1e/0x20 48) 1232 224 balance_dirty_pages_ratelimited_nr+0x277/0x410 49) 1008 192 generic_file_buffered_write+0x19b/0x240 50) 816 288 xfs_write+0x849/0x930 [xfs] 51) 528 16 xfs_file_aio_write+0x5b/0x70 [xfs] 52) 512 272 do_sync_write+0xd1/0x120 53) 240 48 vfs_write+0xcb/0x1a0 54) 192 64 sys_write+0x55/0x90 55) 128 128 system_call_fastpath+0x16/0x1b [-- Attachment #2: config.gz --] [-- Type: application/x-gzip, Size: 28595 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64 @ 2010-04-13 9:51 ` John Berthels 0 siblings, 0 replies; 43+ messages in thread From: John Berthels @ 2010-04-13 9:51 UTC (permalink / raw) To: Chris Mason, Dave Chinner, John Berthels, linux-kernel, Nick Gregory, Rob Sanderson, xfs, linux-mm [-- Attachment #1: Type: text/plain, Size: 5328 bytes --] Chris Mason wrote: > shrink_zone on my box isn't 500 bytes, but lets try the easy stuff > first. This is against .34, if you have any trouble applying to .32, > just add the word noinline after the word static on the function > definitions. Hi Chris, Thanks for this, we've been soaking it for a while and get the stack trace below (which is still >8k), which still has shrink_zone at 528 bytes. I find it odd that the shrink_zone stack usage is different on our systems. This is a stock kernel 2.6.33.2 kernel, x86_64 arch (plus your patch + Dave Chinner's patch) built using ubuntu make-kpkg, with gcc (Ubuntu 4.3.3-5ubuntu4) 4.3.3 (.vmscan.o.cmd with full build options is below, gzipped .config attached). Can you see any difference between your system and ours which might explain the discrepancy? I note -g and -pg in there. (Does -pg have any stack overhead? It seems to be enabled in ubuntu release kernels). regards, jb mm/.vmscan.o.cmd: cmd_mm/vmscan.o := gcc -Wp,-MD,mm/.vmscan.o.d -nostdinc -isystem /usr/lib/gcc/x86_64-linux-gnu/4.3.3/include -I/usr/local/src/kern/linux-2.6.33.2/arch/x86/include -Iinclude -include include/generated/autoconf.h -D__KERNEL__ -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -Werror-implicit-function-declaration -Wno-format-security -fno-delete-null-pointer-checks -O2 -m64 -mtune=generic -mno-red-zone -mcmodel=kernel -funit-at-a-time -maccumulate-outgoing-args -fstack-protector -DCONFIG_AS_CFI=1 -DCONFIG_AS_CFI_SIGNAL_FRAME=1 -pipe -Wno-sign-compare -fno-asynchronous-unwind-tables -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -fno-omit-frame-pointer -fno-optimize-sibling-calls -g -pg -Wdeclaration-after-statement -Wno-pointer-sign -fno-strict-overflow -D"KBUILD_STR(s)=\#s" -D"KBUILD_BASENAME=KBUILD_STR(vmscan)" -D"KBUILD_MODNAME=KBUILD_STR(vmscan)" -c -o mm/.tmp_vmscan.o mm/vmscan.c Apr 12 22:06:35 nas17 kernel: [36346.599076] apache2 used greatest stack depth: 7904 bytes left Depth Size Location (56 entries) ----- ---- -------- 0) 7904 48 __call_rcu+0x67/0x190 1) 7856 16 call_rcu_sched+0x15/0x20 2) 7840 16 call_rcu+0xe/0x10 3) 7824 272 radix_tree_delete+0x159/0x2e0 4) 7552 32 __remove_from_page_cache+0x21/0x110 5) 7520 64 __remove_mapping+0xe8/0x130 6) 7456 384 shrink_page_list+0x400/0x860 7) 7072 528 shrink_zone+0x636/0xdc0 8) 6544 112 do_try_to_free_pages+0xc2/0x3c0 9) 6432 112 try_to_free_pages+0x64/0x70 10) 6320 256 __alloc_pages_nodemask+0x3d2/0x710 11) 6064 48 alloc_pages_current+0x8c/0xe0 12) 6016 32 __page_cache_alloc+0x67/0x70 13) 5984 80 find_or_create_page+0x50/0xb0 14) 5904 160 _xfs_buf_lookup_pages+0x145/0x350 [xfs] 15) 5744 64 xfs_buf_get+0x74/0x1d0 [xfs] 16) 5680 48 xfs_buf_read+0x2f/0x110 [xfs] 17) 5632 80 xfs_trans_read_buf+0x2bf/0x430 [xfs] 18) 5552 80 xfs_btree_read_buf_block+0x5d/0xb0 [xfs] 19) 5472 176 xfs_btree_rshift+0xd7/0x530 [xfs] 20) 5296 96 xfs_btree_make_block_unfull+0x5b/0x190 [xfs] 21) 5200 224 xfs_btree_insrec+0x39c/0x5b0 [xfs] 22) 4976 128 xfs_btree_insert+0x86/0x180 [xfs] 23) 4848 96 xfs_alloc_fixup_trees+0x1fa/0x350 [xfs] 24) 4752 144 xfs_alloc_ag_vextent_near+0x916/0xb30 [xfs] 25) 4608 32 xfs_alloc_ag_vextent+0xe5/0x140 [xfs] 26) 4576 96 xfs_alloc_vextent+0x49f/0x630 [xfs] 27) 4480 160 xfs_bmbt_alloc_block+0xbe/0x1d0 [xfs] 28) 4320 208 xfs_btree_split+0xb3/0x6a0 [xfs] 29) 4112 96 xfs_btree_make_block_unfull+0x151/0x190 [xfs] 30) 4016 224 xfs_btree_insrec+0x39c/0x5b0 [xfs] 31) 3792 128 xfs_btree_insert+0x86/0x180 [xfs] 32) 3664 352 xfs_bmap_add_extent_delay_real+0x41e/0x1670 [xfs] 33) 3312 208 xfs_bmap_add_extent+0x41c/0x450 [xfs] 34) 3104 448 xfs_bmapi+0x982/0x1200 [xfs] 35) 2656 256 xfs_iomap_write_allocate+0x248/0x3c0 [xfs] 36) 2400 208 xfs_iomap+0x3d8/0x410 [xfs] 37) 2192 32 xfs_map_blocks+0x2c/0x30 [xfs] 38) 2160 256 xfs_page_state_convert+0x443/0x730 [xfs] 39) 1904 64 xfs_vm_writepage+0xab/0x160 [xfs] 40) 1840 32 __writepage+0x1a/0x60 41) 1808 288 write_cache_pages+0x1f7/0x400 42) 1520 16 generic_writepages+0x27/0x30 43) 1504 48 xfs_vm_writepages+0x5a/0x70 [xfs] 44) 1456 16 do_writepages+0x24/0x40 45) 1440 64 writeback_single_inode+0xf1/0x3e0 46) 1376 128 writeback_inodes_wb+0x31e/0x510 47) 1248 16 writeback_inodes_wbc+0x1e/0x20 48) 1232 224 balance_dirty_pages_ratelimited_nr+0x277/0x410 49) 1008 192 generic_file_buffered_write+0x19b/0x240 50) 816 288 xfs_write+0x849/0x930 [xfs] 51) 528 16 xfs_file_aio_write+0x5b/0x70 [xfs] 52) 512 272 do_sync_write+0xd1/0x120 53) 240 48 vfs_write+0xcb/0x1a0 54) 192 64 sys_write+0x55/0x90 55) 128 128 system_call_fastpath+0x16/0x1b [-- Attachment #2: config.gz --] [-- Type: application/x-gzip, Size: 28595 bytes --] [-- Attachment #3: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64 2010-04-09 11:38 ` Chris Mason (?) @ 2010-04-16 13:41 ` John Berthels -1 siblings, 0 replies; 43+ messages in thread From: John Berthels @ 2010-04-16 13:41 UTC (permalink / raw) To: Chris Mason, Dave Chinner, John Berthels, linux-kernel, Nick Gregory, Rob Sanderson, xfs, linux-mm Chris Mason wrote: > shrink_zone on my box isn't 500 bytes, but lets try the easy stuff > first. This is against .34, if you have any trouble applying to .32, > just add the word noinline after the word static on the function > definitions. > > This makes shrink_zone disappear from my check_stack.pl output. > Basically I think the compiler is inlining the shrink_active_zone and > shrink_inactive_zone code into shrink_zone. Hi Chris, I hadn't seen the followup discussion on lkml until today, but this message: http://marc.info/?l=linux-mm&m=127122143303771&w=2 allowed me to look at stack usage in our build environment. If I've understood correctly, it seems that a build with gcc-4.4 and gcc-4.3 have very different stack usages for shrink_zone(): 0x88 versus 0x1d8. (details below). The reason appears to be the -fconserve-stack compilation option specified when using 4.4, since running the cmdline from mm/.vmscan.cmd with gcc-4.4 but *without* -fconserve-stack gives the same result as with 4.3. According to the discussion when the flag was added, http://www.gossamer-threads.com/lists/linux/kernel/1131612 this flag seems to primarily affects inlining, so I double-checked the noinline patch you sent to the list and discovered that it had been incorrectly applied to the build tree. Correctly applying that patch to mm/vmscan.c (and using gcc-4.3) gives a sub $0x78,%rsp line. I'm very sorry that this test or ours wasn't correct and I'm sorry for sending bad info to the list. We're currently building a kernel with gcc-4.4 and will let you know if it blows the 8k limit or not. Thanks for your help. regards, jb $ gcc-4.3 --version gcc-4.3 (Ubuntu 4.3.4-5ubuntu1) 4.3.4 $ gcc-4.4 --version gcc-4.4 (Ubuntu 4.4.1-4ubuntu9) 4.4.1 $ make CC=gcc-4.4 mm/vmscan.o $ objdump -d mm/vmscan.o | less +/shrink_zone 0000000000002830 <shrink_zone>: 2830: 55 push %rbp 2831: 48 89 e5 mov %rsp,%rbp 2834: 41 57 push %r15 2836: 41 56 push %r14 2838: 41 55 push %r13 283a: 41 54 push %r12 283c: 53 push %rbx 283d: 48 81 ec 88 00 00 00 sub $0x88,%rsp 2844: e8 00 00 00 00 callq 2849 <shrink_zone+0x19> $ make clean $ make CC=gcc-4.3 mm/vmscan.o $ objdump -d mm/vmscan.o | less +/shrink_zone 0000000000001ca0 <shrink_zone>: 1ca0: 55 push %rbp 1ca1: 48 89 e5 mov %rsp,%rbp 1ca4: 41 57 push %r15 1ca6: 41 56 push %r14 1ca8: 41 55 push %r13 1caa: 41 54 push %r12 1cac: 53 push %rbx 1cad: 48 81 ec d8 01 00 00 sub $0x1d8,%rsp 1cb4: e8 00 00 00 00 callq 1cb9 <shrink_zone+0x19> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64 @ 2010-04-16 13:41 ` John Berthels 0 siblings, 0 replies; 43+ messages in thread From: John Berthels @ 2010-04-16 13:41 UTC (permalink / raw) To: Chris Mason, Dave Chinner, John Berthels, linux-kernel, Nick Gregory, Rob Sanderson, xfs, linux-mm Chris Mason wrote: > shrink_zone on my box isn't 500 bytes, but lets try the easy stuff > first. This is against .34, if you have any trouble applying to .32, > just add the word noinline after the word static on the function > definitions. > > This makes shrink_zone disappear from my check_stack.pl output. > Basically I think the compiler is inlining the shrink_active_zone and > shrink_inactive_zone code into shrink_zone. Hi Chris, I hadn't seen the followup discussion on lkml until today, but this message: http://marc.info/?l=linux-mm&m=127122143303771&w=2 allowed me to look at stack usage in our build environment. If I've understood correctly, it seems that a build with gcc-4.4 and gcc-4.3 have very different stack usages for shrink_zone(): 0x88 versus 0x1d8. (details below). The reason appears to be the -fconserve-stack compilation option specified when using 4.4, since running the cmdline from mm/.vmscan.cmd with gcc-4.4 but *without* -fconserve-stack gives the same result as with 4.3. According to the discussion when the flag was added, http://www.gossamer-threads.com/lists/linux/kernel/1131612 this flag seems to primarily affects inlining, so I double-checked the noinline patch you sent to the list and discovered that it had been incorrectly applied to the build tree. Correctly applying that patch to mm/vmscan.c (and using gcc-4.3) gives a sub $0x78,%rsp line. I'm very sorry that this test or ours wasn't correct and I'm sorry for sending bad info to the list. We're currently building a kernel with gcc-4.4 and will let you know if it blows the 8k limit or not. Thanks for your help. regards, jb $ gcc-4.3 --version gcc-4.3 (Ubuntu 4.3.4-5ubuntu1) 4.3.4 $ gcc-4.4 --version gcc-4.4 (Ubuntu 4.4.1-4ubuntu9) 4.4.1 $ make CC=gcc-4.4 mm/vmscan.o $ objdump -d mm/vmscan.o | less +/shrink_zone 0000000000002830 <shrink_zone>: 2830: 55 push %rbp 2831: 48 89 e5 mov %rsp,%rbp 2834: 41 57 push %r15 2836: 41 56 push %r14 2838: 41 55 push %r13 283a: 41 54 push %r12 283c: 53 push %rbx 283d: 48 81 ec 88 00 00 00 sub $0x88,%rsp 2844: e8 00 00 00 00 callq 2849 <shrink_zone+0x19> $ make clean $ make CC=gcc-4.3 mm/vmscan.o $ objdump -d mm/vmscan.o | less +/shrink_zone 0000000000001ca0 <shrink_zone>: 1ca0: 55 push %rbp 1ca1: 48 89 e5 mov %rsp,%rbp 1ca4: 41 57 push %r15 1ca6: 41 56 push %r14 1ca8: 41 55 push %r13 1caa: 41 54 push %r12 1cac: 53 push %rbx 1cad: 48 81 ec d8 01 00 00 sub $0x1d8,%rsp 1cb4: e8 00 00 00 00 callq 1cb9 <shrink_zone+0x19> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64 @ 2010-04-16 13:41 ` John Berthels 0 siblings, 0 replies; 43+ messages in thread From: John Berthels @ 2010-04-16 13:41 UTC (permalink / raw) To: Chris Mason, Dave Chinner, John Berthels, linux-kernel, Nick Gregory, Rob Sanderson, xfs, linux-mm Chris Mason wrote: > shrink_zone on my box isn't 500 bytes, but lets try the easy stuff > first. This is against .34, if you have any trouble applying to .32, > just add the word noinline after the word static on the function > definitions. > > This makes shrink_zone disappear from my check_stack.pl output. > Basically I think the compiler is inlining the shrink_active_zone and > shrink_inactive_zone code into shrink_zone. Hi Chris, I hadn't seen the followup discussion on lkml until today, but this message: http://marc.info/?l=linux-mm&m=127122143303771&w=2 allowed me to look at stack usage in our build environment. If I've understood correctly, it seems that a build with gcc-4.4 and gcc-4.3 have very different stack usages for shrink_zone(): 0x88 versus 0x1d8. (details below). The reason appears to be the -fconserve-stack compilation option specified when using 4.4, since running the cmdline from mm/.vmscan.cmd with gcc-4.4 but *without* -fconserve-stack gives the same result as with 4.3. According to the discussion when the flag was added, http://www.gossamer-threads.com/lists/linux/kernel/1131612 this flag seems to primarily affects inlining, so I double-checked the noinline patch you sent to the list and discovered that it had been incorrectly applied to the build tree. Correctly applying that patch to mm/vmscan.c (and using gcc-4.3) gives a sub $0x78,%rsp line. I'm very sorry that this test or ours wasn't correct and I'm sorry for sending bad info to the list. We're currently building a kernel with gcc-4.4 and will let you know if it blows the 8k limit or not. Thanks for your help. regards, jb $ gcc-4.3 --version gcc-4.3 (Ubuntu 4.3.4-5ubuntu1) 4.3.4 $ gcc-4.4 --version gcc-4.4 (Ubuntu 4.4.1-4ubuntu9) 4.4.1 $ make CC=gcc-4.4 mm/vmscan.o $ objdump -d mm/vmscan.o | less +/shrink_zone 0000000000002830 <shrink_zone>: 2830: 55 push %rbp 2831: 48 89 e5 mov %rsp,%rbp 2834: 41 57 push %r15 2836: 41 56 push %r14 2838: 41 55 push %r13 283a: 41 54 push %r12 283c: 53 push %rbx 283d: 48 81 ec 88 00 00 00 sub $0x88,%rsp 2844: e8 00 00 00 00 callq 2849 <shrink_zone+0x19> $ make clean $ make CC=gcc-4.3 mm/vmscan.o $ objdump -d mm/vmscan.o | less +/shrink_zone 0000000000001ca0 <shrink_zone>: 1ca0: 55 push %rbp 1ca1: 48 89 e5 mov %rsp,%rbp 1ca4: 41 57 push %r15 1ca6: 41 56 push %r14 1ca8: 41 55 push %r13 1caa: 41 54 push %r12 1cac: 53 push %rbx 1cad: 48 81 ec d8 01 00 00 sub $0x1d8,%rsp 1cb4: e8 00 00 00 00 callq 1cb9 <shrink_zone+0x19> _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64 2010-04-08 23:38 ` Dave Chinner @ 2010-04-09 13:43 ` John Berthels -1 siblings, 0 replies; 43+ messages in thread From: John Berthels @ 2010-04-09 13:43 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-kernel, Nick Gregory, Rob Sanderson, xfs, linux-mm [-- Attachment #1: Type: text/plain, Size: 1933 bytes --] Dave Chinner wrote: > So effectively the storage subsystem (NFS, filesystem, DM, MD, > device drivers) have about 4K of stack to work in now. That seems to > be a lot less than last time I looked at this, and we've been really > careful not to increase XFS's stack usage for quite some time now. OK. I should note that we have what appears to be a similar problem on a 2.6.28 distro kernel, so I'm not sure this is a very recent change. (We see the lockups on that kernel, we haven't tried larger stacks + stack instrumentation on the earlier kernel). Do you know if there are any obvious knobs to twiddle to make these codepaths less likely? The cluster is resilient against occasional server death, but frequent death is more annoying. We're currently running with sysctls: net.ipv4.ip_nonlocal_bind=1 kernel.panic=300 vm.dirty_background_ratio=3 vm.min_free_kbytes=16384 I'm not sure what circumstances force the memory reclaim (and why it doesn't come from discarding a cached page). Is the problem is the DMA/DMA32 zone and we should try playing with lowmem_reserve_ratio? Is there anything else we could do to keep dirty pages out of the low zones? Before trying THREAD_ORDER 2, we tried doubling the RAM in a couple of boxes from 2GB to 4GB without any significant reduction in the problem. Lastly - if we end up stuck with THREAD_ORDER 2, does anyone know what symptoms to look out for to know if unable to allocate thread stacks due to fragmentation? > I'll have to have a bit of a think on this one - if you could > provide further stack traces as they get deeper (esp. if they go > past 8k) that would be really handy. Two of the worst offenders below. We have plenty to send if you would like more. Please let us know if you'd like us to try anything else or would like other info. Thanks very much for your thoughts, suggestions and work so far, it's very much appreciated here. regards, jb [-- Attachment #2: stack_traces.txt --] [-- Type: text/plain, Size: 7831 bytes --] === server16 === apache2 used greatest stack depth: 7208 bytes left Depth Size Location (72 entries) ----- ---- -------- 0) 8336 304 select_task_rq_fair+0x235/0xad0 1) 8032 96 try_to_wake_up+0x189/0x3f0 2) 7936 16 default_wake_function+0x12/0x20 3) 7920 32 autoremove_wake_function+0x16/0x40 4) 7888 64 __wake_up_common+0x5a/0x90 5) 7824 64 __wake_up+0x48/0x70 6) 7760 64 insert_work+0x9f/0xb0 7) 7696 48 __queue_work+0x36/0x50 8) 7648 16 queue_work_on+0x4d/0x60 9) 7632 16 queue_work+0x1f/0x30 10) 7616 16 queue_delayed_work+0x2d/0x40 11) 7600 32 ata_pio_queue_task+0x35/0x40 12) 7568 48 ata_sff_qc_issue+0x146/0x2f0 13) 7520 96 mv_qc_issue+0x12d/0x540 [sata_mv] 14) 7424 96 ata_qc_issue+0x1fe/0x320 15) 7328 64 ata_scsi_translate+0xae/0x1a0 16) 7264 64 ata_scsi_queuecmd+0xbf/0x2f0 17) 7200 48 scsi_dispatch_cmd+0x114/0x2b0 18) 7152 96 scsi_request_fn+0x419/0x590 19) 7056 32 __blk_run_queue+0x82/0x150 20) 7024 48 elv_insert+0x1aa/0x2d0 21) 6976 48 __elv_add_request+0x83/0xd0 22) 6928 96 __make_request+0x139/0x490 23) 6832 208 generic_make_request+0x3df/0x4d0 24) 6624 80 submit_bio+0x7c/0x100 25) 6544 96 _xfs_buf_ioapply+0x128/0x2c0 [xfs] 26) 6448 48 xfs_buf_iorequest+0x75/0xd0 [xfs] 27) 6400 32 xlog_bdstrat_cb+0x4d/0x60 [xfs] 28) 6368 80 xlog_sync+0x218/0x510 [xfs] 29) 6288 64 xlog_state_release_iclog+0xbb/0x100 [xfs] 30) 6224 160 xlog_state_sync+0x1ab/0x230 [xfs] 31) 6064 32 _xfs_log_force+0x5a/0x80 [xfs] 32) 6032 32 xfs_log_force+0x18/0x40 [xfs] 33) 6000 64 xfs_alloc_search_busy+0x14b/0x160 [xfs] 34) 5936 112 xfs_alloc_get_freelist+0x130/0x170 [xfs] 35) 5824 48 xfs_allocbt_alloc_block+0x33/0x70 [xfs] 36) 5776 208 xfs_btree_split+0xb3/0x6a0 [xfs] 37) 5568 96 xfs_btree_make_block_unfull+0x151/0x190 [xfs] 38) 5472 224 xfs_btree_insrec+0x39c/0x5b0 [xfs] 39) 5248 128 xfs_btree_insert+0x86/0x180 [xfs] 40) 5120 144 xfs_free_ag_extent+0x33b/0x7b0 [xfs] 41) 4976 224 xfs_alloc_fix_freelist+0x120/0x490 [xfs] 42) 4752 96 xfs_alloc_vextent+0x1f5/0x630 [xfs] 43) 4656 272 xfs_bmap_btalloc+0x497/0xa70 [xfs] 44) 4384 16 xfs_bmap_alloc+0x21/0x40 [xfs] 45) 4368 448 xfs_bmapi+0x85e/0x1200 [xfs] 46) 3920 256 xfs_iomap_write_allocate+0x248/0x3c0 [xfs] 47) 3664 208 xfs_iomap+0x3d8/0x410 [xfs] 48) 3456 32 xfs_map_blocks+0x2c/0x30 [xfs] 49) 3424 256 xfs_page_state_convert+0x443/0x730 [xfs] 50) 3168 64 xfs_vm_writepage+0xab/0x160 [xfs] 51) 3104 384 shrink_page_list+0x65e/0x840 52) 2720 528 shrink_zone+0x63f/0xe10 53) 2192 112 do_try_to_free_pages+0xc2/0x3c0 54) 2080 128 try_to_free_pages+0x77/0x80 55) 1952 240 __alloc_pages_nodemask+0x3e4/0x710 56) 1712 48 alloc_pages_current+0x8c/0xe0 57) 1664 32 __page_cache_alloc+0x67/0x70 58) 1632 144 __do_page_cache_readahead+0xd3/0x220 59) 1488 16 ra_submit+0x21/0x30 60) 1472 80 ondemand_readahead+0x11d/0x250 61) 1392 64 page_cache_async_readahead+0xa9/0xe0 62) 1328 592 __generic_file_splice_read+0x48a/0x530 63) 736 48 generic_file_splice_read+0x4f/0x90 64) 688 96 xfs_splice_read+0xf2/0x130 [xfs] 65) 592 32 xfs_file_splice_read+0x4b/0x50 [xfs] 66) 560 64 do_splice_to+0x77/0xb0 67) 496 112 splice_direct_to_actor+0xcc/0x1c0 68) 384 80 do_splice_direct+0x57/0x80 69) 304 96 do_sendfile+0x16c/0x1e0 70) 208 80 sys_sendfile64+0x8d/0xb0 71) 128 128 system_call_fastpath+0x16/0x1b === server9 === [223269.859411] apache2 used greatest stack depth: 7088 bytes left Depth Size Location (62 entries) ----- ---- -------- 0) 8528 32 down_trylock+0x1e/0x50 1) 8496 80 _xfs_buf_find+0x12f/0x290 [xfs] 2) 8416 64 xfs_buf_get+0x61/0x1c0 [xfs] 3) 8352 48 xfs_buf_read+0x2f/0x110 [xfs] 4) 8304 48 xfs_buf_readahead+0x61/0x90 [xfs] 5) 8256 48 xfs_btree_readahead_sblock+0xea/0xf0 [xfs] 6) 8208 16 xfs_btree_readahead+0x5f/0x90 [xfs] 7) 8192 112 xfs_btree_increment+0x2e/0x2b0 [xfs] 8) 8080 176 xfs_btree_rshift+0x2f2/0x530 [xfs] 9) 7904 272 xfs_btree_delrec+0x4a3/0x1020 [xfs] 10) 7632 64 xfs_btree_delete+0x40/0xd0 [xfs] 11) 7568 96 xfs_alloc_fixup_trees+0x7d/0x350 [xfs] 12) 7472 144 xfs_alloc_ag_vextent_near+0x916/0xb30 [xfs] 13) 7328 32 xfs_alloc_ag_vextent+0xe5/0x140 [xfs] 14) 7296 96 xfs_alloc_vextent+0x49f/0x630 [xfs] 15) 7200 160 xfs_bmbt_alloc_block+0xbe/0x1d0 [xfs] 16) 7040 208 xfs_btree_split+0xb3/0x6a0 [xfs] 17) 6832 96 xfs_btree_make_block_unfull+0x151/0x190 [xfs] 18) 6736 224 xfs_btree_insrec+0x39c/0x5b0 [xfs] 19) 6512 128 xfs_btree_insert+0x86/0x180 [xfs] 20) 6384 352 xfs_bmap_add_extent_delay_real+0x41e/0x1660 [xfs] 21) 6032 208 xfs_bmap_add_extent+0x41c/0x450 [xfs] 22) 5824 448 xfs_bmapi+0x982/0x1200 [xfs] 23) 5376 256 xfs_iomap_write_allocate+0x248/0x3c0 [xfs] 24) 5120 208 xfs_iomap+0x3d8/0x410 [xfs] 25) 4912 32 xfs_map_blocks+0x2c/0x30 [xfs] 26) 4880 256 xfs_page_state_convert+0x443/0x730 [xfs] 27) 4624 64 xfs_vm_writepage+0xab/0x160 [xfs] 28) 4560 384 shrink_page_list+0x65e/0x840 29) 4176 528 shrink_zone+0x63f/0xe10 30) 3648 112 do_try_to_free_pages+0xc2/0x3c0 31) 3536 128 try_to_free_pages+0x77/0x80 32) 3408 240 __alloc_pages_nodemask+0x3e4/0x710 33) 3168 48 alloc_pages_current+0x8c/0xe0 34) 3120 80 new_slab+0x247/0x300 35) 3040 96 __slab_alloc+0x137/0x490 36) 2944 64 kmem_cache_alloc+0x110/0x120 37) 2880 64 kmem_zone_alloc+0x9a/0xe0 [xfs] 38) 2816 32 kmem_zone_zalloc+0x1e/0x50 [xfs] 39) 2784 32 _xfs_trans_alloc+0x38/0x80 [xfs] 40) 2752 96 xfs_trans_alloc+0x9f/0xb0 [xfs] 41) 2656 256 xfs_iomap_write_allocate+0xf1/0x3c0 [xfs] 42) 2400 208 xfs_iomap+0x3d8/0x410 [xfs] 43) 2192 32 xfs_map_blocks+0x2c/0x30 [xfs] 44) 2160 256 xfs_page_state_convert+0x443/0x730 [xfs] 45) 1904 64 xfs_vm_writepage+0xab/0x160 [xfs] 46) 1840 32 __writepage+0x17/0x50 47) 1808 288 write_cache_pages+0x1f7/0x400 48) 1520 16 generic_writepages+0x24/0x30 49) 1504 48 xfs_vm_writepages+0x5c/0x80 [xfs] 50) 1456 16 do_writepages+0x21/0x40 51) 1440 64 writeback_single_inode+0xeb/0x3c0 52) 1376 128 writeback_inodes_wb+0x318/0x510 53) 1248 16 writeback_inodes_wbc+0x1e/0x20 54) 1232 224 balance_dirty_pages_ratelimited_nr+0x269/0x3a0 55) 1008 192 generic_file_buffered_write+0x19b/0x240 56) 816 288 xfs_write+0x837/0x920 [xfs] 57) 528 16 xfs_file_aio_write+0x5b/0x70 [xfs] 58) 512 272 do_sync_write+0xd1/0x120 59) 240 48 vfs_write+0xcb/0x1a0 60) 192 64 sys_write+0x55/0x90 61) 128 128 system_call_fastpath+0x16/0x1b ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64 @ 2010-04-09 13:43 ` John Berthels 0 siblings, 0 replies; 43+ messages in thread From: John Berthels @ 2010-04-09 13:43 UTC (permalink / raw) To: Dave Chinner; +Cc: Nick Gregory, linux-mm, xfs, linux-kernel, Rob Sanderson [-- Attachment #1: Type: text/plain, Size: 1933 bytes --] Dave Chinner wrote: > So effectively the storage subsystem (NFS, filesystem, DM, MD, > device drivers) have about 4K of stack to work in now. That seems to > be a lot less than last time I looked at this, and we've been really > careful not to increase XFS's stack usage for quite some time now. OK. I should note that we have what appears to be a similar problem on a 2.6.28 distro kernel, so I'm not sure this is a very recent change. (We see the lockups on that kernel, we haven't tried larger stacks + stack instrumentation on the earlier kernel). Do you know if there are any obvious knobs to twiddle to make these codepaths less likely? The cluster is resilient against occasional server death, but frequent death is more annoying. We're currently running with sysctls: net.ipv4.ip_nonlocal_bind=1 kernel.panic=300 vm.dirty_background_ratio=3 vm.min_free_kbytes=16384 I'm not sure what circumstances force the memory reclaim (and why it doesn't come from discarding a cached page). Is the problem is the DMA/DMA32 zone and we should try playing with lowmem_reserve_ratio? Is there anything else we could do to keep dirty pages out of the low zones? Before trying THREAD_ORDER 2, we tried doubling the RAM in a couple of boxes from 2GB to 4GB without any significant reduction in the problem. Lastly - if we end up stuck with THREAD_ORDER 2, does anyone know what symptoms to look out for to know if unable to allocate thread stacks due to fragmentation? > I'll have to have a bit of a think on this one - if you could > provide further stack traces as they get deeper (esp. if they go > past 8k) that would be really handy. Two of the worst offenders below. We have plenty to send if you would like more. Please let us know if you'd like us to try anything else or would like other info. Thanks very much for your thoughts, suggestions and work so far, it's very much appreciated here. regards, jb [-- Attachment #2: stack_traces.txt --] [-- Type: text/plain, Size: 7831 bytes --] === server16 === apache2 used greatest stack depth: 7208 bytes left Depth Size Location (72 entries) ----- ---- -------- 0) 8336 304 select_task_rq_fair+0x235/0xad0 1) 8032 96 try_to_wake_up+0x189/0x3f0 2) 7936 16 default_wake_function+0x12/0x20 3) 7920 32 autoremove_wake_function+0x16/0x40 4) 7888 64 __wake_up_common+0x5a/0x90 5) 7824 64 __wake_up+0x48/0x70 6) 7760 64 insert_work+0x9f/0xb0 7) 7696 48 __queue_work+0x36/0x50 8) 7648 16 queue_work_on+0x4d/0x60 9) 7632 16 queue_work+0x1f/0x30 10) 7616 16 queue_delayed_work+0x2d/0x40 11) 7600 32 ata_pio_queue_task+0x35/0x40 12) 7568 48 ata_sff_qc_issue+0x146/0x2f0 13) 7520 96 mv_qc_issue+0x12d/0x540 [sata_mv] 14) 7424 96 ata_qc_issue+0x1fe/0x320 15) 7328 64 ata_scsi_translate+0xae/0x1a0 16) 7264 64 ata_scsi_queuecmd+0xbf/0x2f0 17) 7200 48 scsi_dispatch_cmd+0x114/0x2b0 18) 7152 96 scsi_request_fn+0x419/0x590 19) 7056 32 __blk_run_queue+0x82/0x150 20) 7024 48 elv_insert+0x1aa/0x2d0 21) 6976 48 __elv_add_request+0x83/0xd0 22) 6928 96 __make_request+0x139/0x490 23) 6832 208 generic_make_request+0x3df/0x4d0 24) 6624 80 submit_bio+0x7c/0x100 25) 6544 96 _xfs_buf_ioapply+0x128/0x2c0 [xfs] 26) 6448 48 xfs_buf_iorequest+0x75/0xd0 [xfs] 27) 6400 32 xlog_bdstrat_cb+0x4d/0x60 [xfs] 28) 6368 80 xlog_sync+0x218/0x510 [xfs] 29) 6288 64 xlog_state_release_iclog+0xbb/0x100 [xfs] 30) 6224 160 xlog_state_sync+0x1ab/0x230 [xfs] 31) 6064 32 _xfs_log_force+0x5a/0x80 [xfs] 32) 6032 32 xfs_log_force+0x18/0x40 [xfs] 33) 6000 64 xfs_alloc_search_busy+0x14b/0x160 [xfs] 34) 5936 112 xfs_alloc_get_freelist+0x130/0x170 [xfs] 35) 5824 48 xfs_allocbt_alloc_block+0x33/0x70 [xfs] 36) 5776 208 xfs_btree_split+0xb3/0x6a0 [xfs] 37) 5568 96 xfs_btree_make_block_unfull+0x151/0x190 [xfs] 38) 5472 224 xfs_btree_insrec+0x39c/0x5b0 [xfs] 39) 5248 128 xfs_btree_insert+0x86/0x180 [xfs] 40) 5120 144 xfs_free_ag_extent+0x33b/0x7b0 [xfs] 41) 4976 224 xfs_alloc_fix_freelist+0x120/0x490 [xfs] 42) 4752 96 xfs_alloc_vextent+0x1f5/0x630 [xfs] 43) 4656 272 xfs_bmap_btalloc+0x497/0xa70 [xfs] 44) 4384 16 xfs_bmap_alloc+0x21/0x40 [xfs] 45) 4368 448 xfs_bmapi+0x85e/0x1200 [xfs] 46) 3920 256 xfs_iomap_write_allocate+0x248/0x3c0 [xfs] 47) 3664 208 xfs_iomap+0x3d8/0x410 [xfs] 48) 3456 32 xfs_map_blocks+0x2c/0x30 [xfs] 49) 3424 256 xfs_page_state_convert+0x443/0x730 [xfs] 50) 3168 64 xfs_vm_writepage+0xab/0x160 [xfs] 51) 3104 384 shrink_page_list+0x65e/0x840 52) 2720 528 shrink_zone+0x63f/0xe10 53) 2192 112 do_try_to_free_pages+0xc2/0x3c0 54) 2080 128 try_to_free_pages+0x77/0x80 55) 1952 240 __alloc_pages_nodemask+0x3e4/0x710 56) 1712 48 alloc_pages_current+0x8c/0xe0 57) 1664 32 __page_cache_alloc+0x67/0x70 58) 1632 144 __do_page_cache_readahead+0xd3/0x220 59) 1488 16 ra_submit+0x21/0x30 60) 1472 80 ondemand_readahead+0x11d/0x250 61) 1392 64 page_cache_async_readahead+0xa9/0xe0 62) 1328 592 __generic_file_splice_read+0x48a/0x530 63) 736 48 generic_file_splice_read+0x4f/0x90 64) 688 96 xfs_splice_read+0xf2/0x130 [xfs] 65) 592 32 xfs_file_splice_read+0x4b/0x50 [xfs] 66) 560 64 do_splice_to+0x77/0xb0 67) 496 112 splice_direct_to_actor+0xcc/0x1c0 68) 384 80 do_splice_direct+0x57/0x80 69) 304 96 do_sendfile+0x16c/0x1e0 70) 208 80 sys_sendfile64+0x8d/0xb0 71) 128 128 system_call_fastpath+0x16/0x1b === server9 === [223269.859411] apache2 used greatest stack depth: 7088 bytes left Depth Size Location (62 entries) ----- ---- -------- 0) 8528 32 down_trylock+0x1e/0x50 1) 8496 80 _xfs_buf_find+0x12f/0x290 [xfs] 2) 8416 64 xfs_buf_get+0x61/0x1c0 [xfs] 3) 8352 48 xfs_buf_read+0x2f/0x110 [xfs] 4) 8304 48 xfs_buf_readahead+0x61/0x90 [xfs] 5) 8256 48 xfs_btree_readahead_sblock+0xea/0xf0 [xfs] 6) 8208 16 xfs_btree_readahead+0x5f/0x90 [xfs] 7) 8192 112 xfs_btree_increment+0x2e/0x2b0 [xfs] 8) 8080 176 xfs_btree_rshift+0x2f2/0x530 [xfs] 9) 7904 272 xfs_btree_delrec+0x4a3/0x1020 [xfs] 10) 7632 64 xfs_btree_delete+0x40/0xd0 [xfs] 11) 7568 96 xfs_alloc_fixup_trees+0x7d/0x350 [xfs] 12) 7472 144 xfs_alloc_ag_vextent_near+0x916/0xb30 [xfs] 13) 7328 32 xfs_alloc_ag_vextent+0xe5/0x140 [xfs] 14) 7296 96 xfs_alloc_vextent+0x49f/0x630 [xfs] 15) 7200 160 xfs_bmbt_alloc_block+0xbe/0x1d0 [xfs] 16) 7040 208 xfs_btree_split+0xb3/0x6a0 [xfs] 17) 6832 96 xfs_btree_make_block_unfull+0x151/0x190 [xfs] 18) 6736 224 xfs_btree_insrec+0x39c/0x5b0 [xfs] 19) 6512 128 xfs_btree_insert+0x86/0x180 [xfs] 20) 6384 352 xfs_bmap_add_extent_delay_real+0x41e/0x1660 [xfs] 21) 6032 208 xfs_bmap_add_extent+0x41c/0x450 [xfs] 22) 5824 448 xfs_bmapi+0x982/0x1200 [xfs] 23) 5376 256 xfs_iomap_write_allocate+0x248/0x3c0 [xfs] 24) 5120 208 xfs_iomap+0x3d8/0x410 [xfs] 25) 4912 32 xfs_map_blocks+0x2c/0x30 [xfs] 26) 4880 256 xfs_page_state_convert+0x443/0x730 [xfs] 27) 4624 64 xfs_vm_writepage+0xab/0x160 [xfs] 28) 4560 384 shrink_page_list+0x65e/0x840 29) 4176 528 shrink_zone+0x63f/0xe10 30) 3648 112 do_try_to_free_pages+0xc2/0x3c0 31) 3536 128 try_to_free_pages+0x77/0x80 32) 3408 240 __alloc_pages_nodemask+0x3e4/0x710 33) 3168 48 alloc_pages_current+0x8c/0xe0 34) 3120 80 new_slab+0x247/0x300 35) 3040 96 __slab_alloc+0x137/0x490 36) 2944 64 kmem_cache_alloc+0x110/0x120 37) 2880 64 kmem_zone_alloc+0x9a/0xe0 [xfs] 38) 2816 32 kmem_zone_zalloc+0x1e/0x50 [xfs] 39) 2784 32 _xfs_trans_alloc+0x38/0x80 [xfs] 40) 2752 96 xfs_trans_alloc+0x9f/0xb0 [xfs] 41) 2656 256 xfs_iomap_write_allocate+0xf1/0x3c0 [xfs] 42) 2400 208 xfs_iomap+0x3d8/0x410 [xfs] 43) 2192 32 xfs_map_blocks+0x2c/0x30 [xfs] 44) 2160 256 xfs_page_state_convert+0x443/0x730 [xfs] 45) 1904 64 xfs_vm_writepage+0xab/0x160 [xfs] 46) 1840 32 __writepage+0x17/0x50 47) 1808 288 write_cache_pages+0x1f7/0x400 48) 1520 16 generic_writepages+0x24/0x30 49) 1504 48 xfs_vm_writepages+0x5c/0x80 [xfs] 50) 1456 16 do_writepages+0x21/0x40 51) 1440 64 writeback_single_inode+0xeb/0x3c0 52) 1376 128 writeback_inodes_wb+0x318/0x510 53) 1248 16 writeback_inodes_wbc+0x1e/0x20 54) 1232 224 balance_dirty_pages_ratelimited_nr+0x269/0x3a0 55) 1008 192 generic_file_buffered_write+0x19b/0x240 56) 816 288 xfs_write+0x837/0x920 [xfs] 57) 528 16 xfs_file_aio_write+0x5b/0x70 [xfs] 58) 512 272 do_sync_write+0xd1/0x120 59) 240 48 vfs_write+0xcb/0x1a0 60) 192 64 sys_write+0x55/0x90 61) 128 128 system_call_fastpath+0x16/0x1b [-- Attachment #3: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 43+ messages in thread
end of thread, other threads:[~2010-04-16 13:42 UTC | newest] Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2010-04-07 11:06 PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64 John Berthels 2010-04-07 14:05 ` Dave Chinner 2010-04-07 14:05 ` Dave Chinner 2010-04-07 15:57 ` John Berthels 2010-04-07 15:57 ` John Berthels 2010-04-07 17:43 ` Eric Sandeen 2010-04-07 17:43 ` Eric Sandeen 2010-04-07 23:43 ` Dave Chinner 2010-04-07 23:43 ` Dave Chinner 2010-04-08 3:03 ` Dave Chinner 2010-04-08 3:03 ` Dave Chinner 2010-04-08 3:03 ` Dave Chinner 2010-04-08 12:16 ` John Berthels 2010-04-08 12:16 ` John Berthels 2010-04-08 12:16 ` John Berthels 2010-04-08 14:47 ` John Berthels 2010-04-08 14:47 ` John Berthels 2010-04-08 14:47 ` John Berthels 2010-04-08 16:18 ` John Berthels 2010-04-08 16:18 ` John Berthels 2010-04-08 16:18 ` John Berthels 2010-04-08 23:38 ` Dave Chinner 2010-04-08 23:38 ` Dave Chinner 2010-04-08 23:38 ` Dave Chinner 2010-04-09 11:38 ` Chris Mason 2010-04-09 11:38 ` Chris Mason 2010-04-09 11:38 ` Chris Mason 2010-04-09 18:05 ` Eric Sandeen 2010-04-09 18:05 ` Eric Sandeen 2010-04-09 18:05 ` Eric Sandeen 2010-04-09 18:11 ` Chris Mason 2010-04-09 18:11 ` Chris Mason 2010-04-09 18:11 ` Chris Mason 2010-04-12 1:01 ` Dave Chinner 2010-04-12 1:01 ` Dave Chinner 2010-04-12 1:01 ` Dave Chinner 2010-04-13 9:51 ` John Berthels 2010-04-13 9:51 ` John Berthels 2010-04-16 13:41 ` John Berthels 2010-04-16 13:41 ` John Berthels 2010-04-16 13:41 ` John Berthels 2010-04-09 13:43 ` John Berthels 2010-04-09 13:43 ` John Berthels
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.