linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Chris Mason <chris.mason@oracle.com>, Mel Gorman <mel@csn.ul.ie>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH] mm: disallow direct reclaim page writeback
Date: Wed, 14 Apr 2010 11:40:41 +1000	[thread overview]
Message-ID: <20100414014041.GD2493@dastard> (raw)
In-Reply-To: <20100413202021.GZ13327@think>

On Tue, Apr 13, 2010 at 04:20:21PM -0400, Chris Mason wrote:
> On Tue, Apr 13, 2010 at 08:34:29PM +0100, Mel Gorman wrote:
> > > This problem is not a filesystem recursion problem which is, as I
> > > understand it, what GFP_NOFS is used to prevent. It's _any_ kernel
> > > code that uses signficant stack before trying to allocate memory
> > > that is the problem. e.g a select() system call:
> > > 
> > >        Depth    Size   Location    (47 entries)
> > >        -----    ----   --------
> > >  0)     7568      16   mempool_alloc_slab+0x16/0x20
> > >  1)     7552     144   mempool_alloc+0x65/0x140
> > >  2)     7408      96   get_request+0x124/0x370
> > >  3)     7312     144   get_request_wait+0x29/0x1b0
> > >  4)     7168      96   __make_request+0x9b/0x490
> > >  5)     7072     208   generic_make_request+0x3df/0x4d0
> > >  6)     6864      80   submit_bio+0x7c/0x100
> > >  7)     6784      96   _xfs_buf_ioapply+0x128/0x2c0 [xfs]
> > > ....
> > > 32)     3184      64   xfs_vm_writepage+0xab/0x160 [xfs]
> > > 33)     3120     384   shrink_page_list+0x65e/0x840
> > > 34)     2736     528   shrink_zone+0x63f/0xe10
> > > 35)     2208     112   do_try_to_free_pages+0xc2/0x3c0
> > > 36)     2096     128   try_to_free_pages+0x77/0x80
> > > 37)     1968     240   __alloc_pages_nodemask+0x3e4/0x710
> > > 38)     1728      48   alloc_pages_current+0x8c/0xe0
> > > 39)     1680      16   __get_free_pages+0xe/0x50
> > > 40)     1664      48   __pollwait+0xca/0x110
> > > 41)     1616      32   unix_poll+0x28/0xc0
> > > 42)     1584      16   sock_poll+0x1d/0x20
> > > 43)     1568     912   do_select+0x3d6/0x700
> > > 44)      656     416   core_sys_select+0x18c/0x2c0
> > > 45)      240     112   sys_select+0x4f/0x110
> > > 46)      128     128   system_call_fastpath+0x16/0x1b
> > > 
> > > There's 1.6k of stack used before memory allocation is called, 3.1k
> > > used there before ->writepage is entered, XFS used 3.5k, and
> > > if the mempool needed to allocate a page it would have blown the
> > > stack. If there was any significant storage subsystem (add dm, md
> > > and/or scsi of some kind), it would have blown the stack.
> > > 
> > > Basically, there is not enough stack space available to allow direct
> > > reclaim to enter ->writepage _anywhere_ according to the stack usage
> > > profiles we are seeing here....
> > > 
> > 
> > I'm not denying the evidence but how has it been gotten away with for years
> > then? Prevention of writeback isn't the answer without figuring out how
> > direct reclaimers can queue pages for IO and in the case of lumpy reclaim
> > doing sync IO, then waiting on those pages.
> 
> So, I've been reading along, nodding my head to Dave's side of things
> because seeks are evil and direct reclaim makes seeks.  I'd really loev
> for direct reclaim to somehow trigger writepages on large chunks instead
> of doing page by page spatters of IO to the drive.

Perhaps drop the lock on the page if it is held and call one of the
helpers that filesystems use to do this, like:

	filemap_write_and_wait(page->mapping);

> But, somewhere along the line I overlooked the part of Dave's stack trace
> that said:
> 
> 43)     1568     912   do_select+0x3d6/0x700
> 
> Huh, 912 bytes...for select, really?  From poll.h:

Sure, it's bad, but we focussing on the specific case misses the
point that even code that is using minimal stack can enter direct
reclaim after consuming 1.5k of stack. e.g.:

 50)     3168      64   xfs_vm_writepage+0xab/0x160 [xfs]
 51)     3104     384   shrink_page_list+0x65e/0x840
 52)     2720     528   shrink_zone+0x63f/0xe10
 53)     2192     112   do_try_to_free_pages+0xc2/0x3c0
 54)     2080     128   try_to_free_pages+0x77/0x80
 55)     1952     240   __alloc_pages_nodemask+0x3e4/0x710
 56)     1712      48   alloc_pages_current+0x8c/0xe0
 57)     1664      32   __page_cache_alloc+0x67/0x70
 58)     1632     144   __do_page_cache_readahead+0xd3/0x220
 59)     1488      16   ra_submit+0x21/0x30
 60)     1472      80   ondemand_readahead+0x11d/0x250
 61)     1392      64   page_cache_async_readahead+0xa9/0xe0
 62)     1328     592   __generic_file_splice_read+0x48a/0x530
 63)      736      48   generic_file_splice_read+0x4f/0x90
 64)      688      96   xfs_splice_read+0xf2/0x130 [xfs]
 65)      592      32   xfs_file_splice_read+0x4b/0x50 [xfs]
 66)      560      64   do_splice_to+0x77/0xb0
 67)      496     112   splice_direct_to_actor+0xcc/0x1c0
 68)      384      80   do_splice_direct+0x57/0x80
 69)      304      96   do_sendfile+0x16c/0x1e0
 70)      208      80   sys_sendfile64+0x8d/0xb0
 71)      128     128   system_call_fastpath+0x16/0x1b

Yes, __generic_file_splice_read() is a hog, but they seem to be
_everywhere_ today...

> So, select is intentionally trying to use that much stack.  It should be using
> GFP_NOFS if it really wants to suck down that much stack...

The code that did the allocation is called from multiple different
contexts - how is it supposed to know that in some of those contexts
it is supposed to treat memory allocation differently?

This is my point - if you introduce a new semantic to memory allocation
that is "use GFP_NOFS when you are using too much stack" and too much
stack is more than 15% of the stack, then pretty much every code path
will need to set that flag...

> if only the
> kernel had some sort of way to dynamically allocate ram, it could try
> that too.

Sure, but to play the devil's advocate: if memory allocation blows
the stack, then surely avoiding allocation by using stack variables
is safer? ;)

FWIW, even if we use GFP_NOFS, allocation+reclaim can still use 2k
of stack; stuff like the radix tree code appears to be a significant
user of stack now:

        Depth    Size   Location    (56 entries)
        -----    ----   --------
  0)     7904      48   __call_rcu+0x67/0x190
  1)     7856      16   call_rcu_sched+0x15/0x20
  2)     7840      16   call_rcu+0xe/0x10
  3)     7824     272   radix_tree_delete+0x159/0x2e0
  4)     7552      32   __remove_from_page_cache+0x21/0x110
  5)     7520      64   __remove_mapping+0xe8/0x130
  6)     7456     384   shrink_page_list+0x400/0x860
  7)     7072     528   shrink_zone+0x636/0xdc0
  8)     6544     112   do_try_to_free_pages+0xc2/0x3c0
  9)     6432     112   try_to_free_pages+0x64/0x70
 10)     6320     256   __alloc_pages_nodemask+0x3d2/0x710
 11)     6064      48   alloc_pages_current+0x8c/0xe0
 12)     6016      32   __page_cache_alloc+0x67/0x70
 13)     5984      80   find_or_create_page+0x50/0xb0
 14)     5904     160   _xfs_buf_lookup_pages+0x145/0x350 [xfs]

or even just calling ->releasepage and freeing bufferheads:

       Depth    Size   Location    (55 entries)
       -----    ----   --------
 0)     7440      48   add_partial+0x26/0x90
 1)     7392      64   __slab_free+0x1a9/0x380
 2)     7328      64   kmem_cache_free+0xb9/0x160
 3)     7264      16   free_buffer_head+0x25/0x50
 4)     7248      64   try_to_free_buffers+0x79/0xc0
 5)     7184     160   xfs_vm_releasepage+0xda/0x130 [xfs]
 6)     7024      16   try_to_release_page+0x33/0x60
 7)     7008     384   shrink_page_list+0x585/0x860
 8)     6624     528   shrink_zone+0x636/0xdc0
 9)     6096     112   do_try_to_free_pages+0xc2/0x3c0
10)     5984     112   try_to_free_pages+0x64/0x70
11)     5872     256   __alloc_pages_nodemask+0x3d2/0x710
12)     5616      48   alloc_pages_current+0x8c/0xe0
13)     5568      32   __page_cache_alloc+0x67/0x70
14)     5536      80   find_or_create_page+0x50/0xb0
15)     5456     160   _xfs_buf_lookup_pages+0x145/0x350 [xfs]

And another eye-opening example, this time deep in the sata driver
layer:

        Depth    Size   Location    (72 entries)
        -----    ----   --------
  0)     8336     304   select_task_rq_fair+0x235/0xad0
  1)     8032      96   try_to_wake_up+0x189/0x3f0
  2)     7936      16   default_wake_function+0x12/0x20
  3)     7920      32   autoremove_wake_function+0x16/0x40
  4)     7888      64   __wake_up_common+0x5a/0x90
  5)     7824      64   __wake_up+0x48/0x70
  6)     7760      64   insert_work+0x9f/0xb0
  7)     7696      48   __queue_work+0x36/0x50
  8)     7648      16   queue_work_on+0x4d/0x60
  9)     7632      16   queue_work+0x1f/0x30
 10)     7616      16   queue_delayed_work+0x2d/0x40
 11)     7600      32   ata_pio_queue_task+0x35/0x40
 12)     7568      48   ata_sff_qc_issue+0x146/0x2f0
 13)     7520      96   mv_qc_issue+0x12d/0x540 [sata_mv]
 14)     7424      96   ata_qc_issue+0x1fe/0x320
 15)     7328      64   ata_scsi_translate+0xae/0x1a0
 16)     7264      64   ata_scsi_queuecmd+0xbf/0x2f0
 17)     7200      48   scsi_dispatch_cmd+0x114/0x2b0
 18)     7152      96   scsi_request_fn+0x419/0x590
 19)     7056      32   __blk_run_queue+0x82/0x150
 20)     7024      48   elv_insert+0x1aa/0x2d0
 21)     6976      48   __elv_add_request+0x83/0xd0
 22)     6928      96   __make_request+0x139/0x490
 23)     6832     208   generic_make_request+0x3df/0x4d0
 24)     6624      80   submit_bio+0x7c/0x100
 25)     6544      96   _xfs_buf_ioapply+0x128/0x2c0 [xfs]

We need at least _700_ bytes of stack free just to call queue_work(),
and that now happens deep in the guts of the driver subsystem below XFS.
This trace shows 1.8k of stack usage on a simple, single sata disk
storage subsystem, so my estimate of 2k of stack for the storage system
below XFS is too small - a worst case of 2.5-3k of stack space is probably
closer to the mark.

This is the sort of thing I'm pointing at when I say that stack
usage outside XFS has grown significantly significantly over the
past couple of years. Given XFS has remained pretty much the same or
even reduced slightly over the same time period, blaming XFS or
saying "callers should use GFP_NOFS" seems like a cop-out to me.
Regardless of the IO pattern performance issues, writeback via
direct reclaim just uses too much stack to be safe these days...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2010-04-14  1:40 UTC|newest]

Thread overview: 115+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-04-13  0:17 [PATCH] mm: disallow direct reclaim page writeback Dave Chinner
2010-04-13  8:31 ` KOSAKI Motohiro
2010-04-13 10:29   ` Dave Chinner
2010-04-13 11:39     ` KOSAKI Motohiro
2010-04-13 14:36       ` Dave Chinner
2010-04-14  3:12         ` Dave Chinner
2010-04-14  6:52           ` KOSAKI Motohiro
2010-04-15  1:56             ` Dave Chinner
2010-04-14  6:52         ` KOSAKI Motohiro
2010-04-14  7:36           ` Dave Chinner
2010-04-13  9:58 ` Mel Gorman
2010-04-13 11:19   ` Dave Chinner
2010-04-13 19:34     ` Mel Gorman
2010-04-13 20:20       ` Chris Mason
2010-04-14  1:40         ` Dave Chinner [this message]
2010-04-14  4:59           ` KAMEZAWA Hiroyuki
2010-04-14  5:41             ` Dave Chinner
2010-04-14  5:54               ` KOSAKI Motohiro
2010-04-14  6:13                 ` Minchan Kim
2010-04-14  7:19                   ` Minchan Kim
2010-04-14  9:42                     ` KAMEZAWA Hiroyuki
2010-04-14 10:01                       ` Minchan Kim
2010-04-14 10:07                         ` Mel Gorman
2010-04-14 10:16                           ` Minchan Kim
2010-04-14  7:06                 ` Dave Chinner
2010-04-14  6:52           ` KOSAKI Motohiro
2010-04-14  7:28             ` Dave Chinner
2010-04-14  8:51               ` Mel Gorman
2010-04-15  1:34                 ` Dave Chinner
2010-04-15  4:09                   ` KOSAKI Motohiro
2010-04-15  4:11                     ` [PATCH 1/4] vmscan: delegate pageout io to flusher thread if current is kswapd KOSAKI Motohiro
2010-04-15  8:05                       ` Suleiman Souhlal
2010-04-15  8:17                         ` KOSAKI Motohiro
2010-04-15  8:26                           ` KOSAKI Motohiro
2010-04-15 10:30                             ` Johannes Weiner
2010-04-15 17:24                               ` Suleiman Souhlal
2010-04-20  2:56                               ` Ying Han
2010-04-15  9:32                         ` Dave Chinner
2010-04-15  9:41                           ` KOSAKI Motohiro
2010-04-15 17:27                           ` Suleiman Souhlal
2010-04-15 23:33                             ` Dave Chinner
2010-04-15 23:41                               ` Suleiman Souhlal
2010-04-16  9:50                               ` Alan Cox
2010-04-17  3:06                                 ` Dave Chinner
2010-04-15  8:18                       ` KOSAKI Motohiro
2010-04-15 10:31                       ` Mel Gorman
2010-04-15 11:26                         ` KOSAKI Motohiro
2010-04-15  4:13                     ` [PATCH 2/4] vmscan: kill prev_priority completely KOSAKI Motohiro
2010-04-15  4:14                     ` [PATCH 3/4] vmscan: move priority variable into scan_control KOSAKI Motohiro
2010-04-15  4:15                     ` [PATCH 4/4] vmscan: delegate page cleaning io to flusher thread if VM pressure is low KOSAKI Motohiro
2010-04-15  4:35                     ` [PATCH] mm: disallow direct reclaim page writeback KOSAKI Motohiro
2010-04-15  6:32                       ` Dave Chinner
2010-04-15  6:44                         ` KOSAKI Motohiro
2010-04-15  6:58                           ` Dave Chinner
2010-04-15  6:20                     ` Dave Chinner
2010-04-15  6:35                       ` KOSAKI Motohiro
2010-04-15  8:54                         ` Dave Chinner
2010-04-15 10:21                           ` KOSAKI Motohiro
2010-04-15 10:23                             ` [PATCH 1/4] vmscan: simplify shrink_inactive_list() KOSAKI Motohiro
2010-04-15 13:15                               ` Mel Gorman
2010-04-15 15:01                                 ` Andi Kleen
2010-04-15 15:44                                   ` Mel Gorman
2010-04-15 16:54                                     ` Andi Kleen
2010-04-15 23:40                                       ` Dave Chinner
2010-04-16  7:13                                         ` Andi Kleen
2010-04-16 14:57                                         ` Mel Gorman
2010-04-17  2:37                                           ` Dave Chinner
2010-04-16 14:55                                       ` Mel Gorman
2010-04-15 18:22                                 ` Valdis.Kletnieks
2010-04-16  9:39                                   ` Mel Gorman
2010-04-15 10:24                             ` [PATCH 2/4] [cleanup] mm: introduce free_pages_prepare KOSAKI Motohiro
2010-04-15 13:33                               ` Mel Gorman
2010-04-15 10:24                             ` [PATCH 3/4] mm: introduce free_pages_bulk KOSAKI Motohiro
2010-04-15 13:46                               ` Mel Gorman
2010-04-15 10:26                             ` [PATCH 4/4] vmscan: replace the pagevec in shrink_inactive_list() with list KOSAKI Motohiro
2010-04-15 10:28                   ` [PATCH] mm: disallow direct reclaim page writeback Mel Gorman
2010-04-15 13:42                     ` Chris Mason
2010-04-15 17:50                       ` tytso
2010-04-16 15:05                       ` Mel Gorman
2010-04-19 15:15                         ` Mel Gorman
2010-04-16  4:14                     ` Dave Chinner
2010-04-16 15:14                       ` Mel Gorman
2010-04-18  0:32                         ` Andrew Morton
2010-04-18 19:05                           ` Christoph Hellwig
2010-04-18 16:31                             ` Andrew Morton
2010-04-18 19:35                               ` Christoph Hellwig
2010-04-18 19:11                             ` Sorin Faibish
2010-04-18 19:10                           ` Sorin Faibish
2010-04-18 21:30                             ` James Bottomley
2010-04-18 23:34                               ` Sorin Faibish
2010-04-19  3:08                               ` tytso
2010-04-19  0:35                           ` Dave Chinner
2010-04-19  0:49                             ` Arjan van de Ven
2010-04-19  1:08                               ` Dave Chinner
2010-04-19  4:32                                 ` Arjan van de Ven
2010-04-19 15:20                         ` Mel Gorman
2010-04-23  1:06                           ` Dave Chinner
2010-04-23 10:50                             ` Mel Gorman
2010-04-15 14:57                   ` Andi Kleen
2010-04-15  2:37                 ` Johannes Weiner
2010-04-15  2:43                   ` KOSAKI Motohiro
2010-04-16 23:56                     ` Johannes Weiner
2010-04-14  6:52         ` KOSAKI Motohiro
2010-04-14 10:06         ` Andi Kleen
2010-04-14 11:20           ` Chris Mason
2010-04-14 12:15             ` Andi Kleen
2010-04-14 12:32               ` Alan Cox
2010-04-14 12:34                 ` Andi Kleen
2010-04-14 13:23             ` Mel Gorman
2010-04-14 14:07               ` Chris Mason
2010-04-14  0:24 ` Minchan Kim
2010-04-14  4:44   ` Dave Chinner
2010-04-14  7:54     ` Minchan Kim
2010-04-16  1:13 ` KAMEZAWA Hiroyuki
2010-04-16  4:18   ` KAMEZAWA Hiroyuki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100414014041.GD2493@dastard \
    --to=david@fromorbit.com \
    --cc=chris.mason@oracle.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mel@csn.ul.ie \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).