linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm: disallow direct reclaim page writeback
@ 2010-04-13  0:17 Dave Chinner
  2010-04-13  8:31 ` KOSAKI Motohiro
                   ` (3 more replies)
  0 siblings, 4 replies; 115+ messages in thread
From: Dave Chinner @ 2010-04-13  0:17 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, linux-fsdevel

From: Dave Chinner <dchinner@redhat.com>

When we enter direct reclaim we may have used an arbitrary amount of stack
space, and hence enterring the filesystem to do writeback can then lead to
stack overruns. This problem was recently encountered x86_64 systems with
8k stacks running XFS with simple storage configurations.

Writeback from direct reclaim also adversely affects background writeback. The
background flusher threads should already be taking care of cleaning dirty
pages, and direct reclaim will kick them if they aren't already doing work. If
direct reclaim is also calling ->writepage, it will cause the IO patterns from
the background flusher threads to be upset by LRU-order writeback from
pageout() which can be effectively random IO. Having competing sources of IO
trying to clean pages on the same backing device reduces throughput by
increasing the amount of seeks that the backing device has to do to write back
the pages.

Hence for direct reclaim we should not allow ->writepages to be entered at all.
Set up the relevant scan_control structures to enforce this, and prevent
sc->may_writepage from being set in other places in the direct reclaim path in
response to other events.

Reported-by: John Berthels <john@humyo.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 mm/vmscan.c |   13 ++++++-------
 1 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index e0e5f15..5321ac4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1826,10 +1826,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		 * writeout.  So in laptop mode, write out the whole world.
 		 */
 		writeback_threshold = sc->nr_to_reclaim + sc->nr_to_reclaim / 2;
-		if (total_scanned > writeback_threshold) {
+		if (total_scanned > writeback_threshold)
 			wakeup_flusher_threads(laptop_mode ? 0 : total_scanned);
-			sc->may_writepage = 1;
-		}
 
 		/* Take a nap, wait for some writeback to complete */
 		if (!sc->hibernation_mode && sc->nr_scanned &&
@@ -1871,7 +1869,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 {
 	struct scan_control sc = {
 		.gfp_mask = gfp_mask,
-		.may_writepage = !laptop_mode,
+		.may_writepage = 0,
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
 		.may_unmap = 1,
 		.may_swap = 1,
@@ -1893,7 +1891,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
 						struct zone *zone, int nid)
 {
 	struct scan_control sc = {
-		.may_writepage = !laptop_mode,
+		.may_writepage = 0,
 		.may_unmap = 1,
 		.may_swap = !noswap,
 		.swappiness = swappiness,
@@ -1926,7 +1924,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 {
 	struct zonelist *zonelist;
 	struct scan_control sc = {
-		.may_writepage = !laptop_mode,
+		.may_writepage = 0,
 		.may_unmap = 1,
 		.may_swap = !noswap,
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
@@ -2567,7 +2565,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	struct reclaim_state reclaim_state;
 	int priority;
 	struct scan_control sc = {
-		.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
+		.may_writepage = (current_is_kswapd() &&
+					(zone_reclaim_mode & RECLAIM_WRITE)),
 		.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
 		.may_swap = 1,
 		.nr_to_reclaim = max_t(unsigned long, nr_pages,
-- 
1.6.5


^ permalink raw reply related	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-13  0:17 [PATCH] mm: disallow direct reclaim page writeback Dave Chinner
@ 2010-04-13  8:31 ` KOSAKI Motohiro
  2010-04-13 10:29   ` Dave Chinner
  2010-04-13  9:58 ` Mel Gorman
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 115+ messages in thread
From: KOSAKI Motohiro @ 2010-04-13  8:31 UTC (permalink / raw)
  To: Dave Chinner
  Cc: kosaki.motohiro, linux-kernel, linux-mm, linux-fsdevel, Chris Mason

Hi

> From: Dave Chinner <dchinner@redhat.com>
> 
> When we enter direct reclaim we may have used an arbitrary amount of stack
> space, and hence enterring the filesystem to do writeback can then lead to
> stack overruns. This problem was recently encountered x86_64 systems with
> 8k stacks running XFS with simple storage configurations.
> 
> Writeback from direct reclaim also adversely affects background writeback. The
> background flusher threads should already be taking care of cleaning dirty
> pages, and direct reclaim will kick them if they aren't already doing work. If
> direct reclaim is also calling ->writepage, it will cause the IO patterns from
> the background flusher threads to be upset by LRU-order writeback from
> pageout() which can be effectively random IO. Having competing sources of IO
> trying to clean pages on the same backing device reduces throughput by
> increasing the amount of seeks that the backing device has to do to write back
> the pages.
> 
> Hence for direct reclaim we should not allow ->writepages to be entered at all.
> Set up the relevant scan_control structures to enforce this, and prevent
> sc->may_writepage from being set in other places in the direct reclaim path in
> response to other events.

Ummm..
This patch is harder to ack. This patch's pros/cons seems

Pros:
	1) prevent XFS stack overflow
	2) improve io workload performance

Cons:
	3) TOTALLY kill lumpy reclaim (i.e. high order allocation)

So, If we only need to consider io workload this is no downside. but
it can't.

I think (1) is XFS issue. XFS should care it itself. but (2) is really
VM issue. Now our VM makes too agressive pageout() and decrease io 
throughput. I've heard this issue from Chris (cc to him). I'd like to 
fix this. but we never kill pageout() completely because we can't
assume users don't run high order allocation workload.
(perhaps Mel's memory compaction code is going to improve much and
 we can kill lumpy reclaim in future. but it's another story)

Thanks.


> 
> Reported-by: John Berthels <john@humyo.com>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  mm/vmscan.c |   13 ++++++-------
>  1 files changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index e0e5f15..5321ac4 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1826,10 +1826,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>  		 * writeout.  So in laptop mode, write out the whole world.
>  		 */
>  		writeback_threshold = sc->nr_to_reclaim + sc->nr_to_reclaim / 2;
> -		if (total_scanned > writeback_threshold) {
> +		if (total_scanned > writeback_threshold)
>  			wakeup_flusher_threads(laptop_mode ? 0 : total_scanned);
> -			sc->may_writepage = 1;
> -		}
>  
>  		/* Take a nap, wait for some writeback to complete */
>  		if (!sc->hibernation_mode && sc->nr_scanned &&
> @@ -1871,7 +1869,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  {
>  	struct scan_control sc = {
>  		.gfp_mask = gfp_mask,
> -		.may_writepage = !laptop_mode,
> +		.may_writepage = 0,
>  		.nr_to_reclaim = SWAP_CLUSTER_MAX,
>  		.may_unmap = 1,
>  		.may_swap = 1,
> @@ -1893,7 +1891,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
>  						struct zone *zone, int nid)
>  {
>  	struct scan_control sc = {
> -		.may_writepage = !laptop_mode,
> +		.may_writepage = 0,
>  		.may_unmap = 1,
>  		.may_swap = !noswap,
>  		.swappiness = swappiness,
> @@ -1926,7 +1924,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
>  {
>  	struct zonelist *zonelist;
>  	struct scan_control sc = {
> -		.may_writepage = !laptop_mode,
> +		.may_writepage = 0,
>  		.may_unmap = 1,
>  		.may_swap = !noswap,
>  		.nr_to_reclaim = SWAP_CLUSTER_MAX,
> @@ -2567,7 +2565,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
>  	struct reclaim_state reclaim_state;
>  	int priority;
>  	struct scan_control sc = {
> -		.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
> +		.may_writepage = (current_is_kswapd() &&
> +					(zone_reclaim_mode & RECLAIM_WRITE)),
>  		.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
>  		.may_swap = 1,
>  		.nr_to_reclaim = max_t(unsigned long, nr_pages,
> -- 
> 1.6.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/




^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-13  0:17 [PATCH] mm: disallow direct reclaim page writeback Dave Chinner
  2010-04-13  8:31 ` KOSAKI Motohiro
@ 2010-04-13  9:58 ` Mel Gorman
  2010-04-13 11:19   ` Dave Chinner
  2010-04-14  0:24 ` Minchan Kim
  2010-04-16  1:13 ` KAMEZAWA Hiroyuki
  3 siblings, 1 reply; 115+ messages in thread
From: Mel Gorman @ 2010-04-13  9:58 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-mm, linux-fsdevel

On Tue, Apr 13, 2010 at 10:17:58AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When we enter direct reclaim we may have used an arbitrary amount of stack
> space, and hence enterring the filesystem to do writeback can then lead to
> stack overruns. This problem was recently encountered x86_64 systems with
> 8k stacks running XFS with simple storage configurations.
> 
> Writeback from direct reclaim also adversely affects background writeback. The
> background flusher threads should already be taking care of cleaning dirty
> pages, and direct reclaim will kick them if they aren't already doing work. If
> direct reclaim is also calling ->writepage, it will cause the IO patterns from
> the background flusher threads to be upset by LRU-order writeback from
> pageout() which can be effectively random IO. Having competing sources of IO
> trying to clean pages on the same backing device reduces throughput by
> increasing the amount of seeks that the backing device has to do to write back
> the pages.
> 

It's already known that the VM requesting specific pages be cleaned and
reclaimed is a bad IO pattern but unfortunately it is still required by
lumpy reclaim. This change would appear to break that although I haven't
tested it to be 100% sure.

Even without high-order considerations, this patch would appear to make
fairly large changes to how direct reclaim behaves. It would no longer
wait on page writeback for example so direct reclaim will return sooner
than it did potentially going OOM if there were a lot of dirty pages and
it made no progress during direct reclaim.

> Hence for direct reclaim we should not allow ->writepages to be entered at all.
> Set up the relevant scan_control structures to enforce this, and prevent
> sc->may_writepage from being set in other places in the direct reclaim path in
> response to other events.
> 

If an FS caller cannot re-enter the FS, it should be using GFP_NOFS
instead of GFP_KERNEL.

> Reported-by: John Berthels <john@humyo.com>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  mm/vmscan.c |   13 ++++++-------
>  1 files changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index e0e5f15..5321ac4 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1826,10 +1826,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>  		 * writeout.  So in laptop mode, write out the whole world.
>  		 */
>  		writeback_threshold = sc->nr_to_reclaim + sc->nr_to_reclaim / 2;
> -		if (total_scanned > writeback_threshold) {
> +		if (total_scanned > writeback_threshold)
>  			wakeup_flusher_threads(laptop_mode ? 0 : total_scanned);
> -			sc->may_writepage = 1;
> -		}
>  
>  		/* Take a nap, wait for some writeback to complete */
>  		if (!sc->hibernation_mode && sc->nr_scanned &&
> @@ -1871,7 +1869,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  {
>  	struct scan_control sc = {
>  		.gfp_mask = gfp_mask,
> -		.may_writepage = !laptop_mode,
> +		.may_writepage = 0,
>  		.nr_to_reclaim = SWAP_CLUSTER_MAX,
>  		.may_unmap = 1,
>  		.may_swap = 1,
> @@ -1893,7 +1891,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
>  						struct zone *zone, int nid)
>  {
>  	struct scan_control sc = {
> -		.may_writepage = !laptop_mode,
> +		.may_writepage = 0,
>  		.may_unmap = 1,
>  		.may_swap = !noswap,
>  		.swappiness = swappiness,
> @@ -1926,7 +1924,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
>  {
>  	struct zonelist *zonelist;
>  	struct scan_control sc = {
> -		.may_writepage = !laptop_mode,
> +		.may_writepage = 0,
>  		.may_unmap = 1,
>  		.may_swap = !noswap,
>  		.nr_to_reclaim = SWAP_CLUSTER_MAX,
> @@ -2567,7 +2565,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
>  	struct reclaim_state reclaim_state;
>  	int priority;
>  	struct scan_control sc = {
> -		.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
> +		.may_writepage = (current_is_kswapd() &&
> +					(zone_reclaim_mode & RECLAIM_WRITE)),
>  		.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
>  		.may_swap = 1,
>  		.nr_to_reclaim = max_t(unsigned long, nr_pages,
> -- 
> 1.6.5
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-13  8:31 ` KOSAKI Motohiro
@ 2010-04-13 10:29   ` Dave Chinner
  2010-04-13 11:39     ` KOSAKI Motohiro
  0 siblings, 1 reply; 115+ messages in thread
From: Dave Chinner @ 2010-04-13 10:29 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: linux-kernel, linux-mm, linux-fsdevel, Chris Mason

On Tue, Apr 13, 2010 at 05:31:25PM +0900, KOSAKI Motohiro wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > When we enter direct reclaim we may have used an arbitrary amount of stack
> > space, and hence enterring the filesystem to do writeback can then lead to
> > stack overruns. This problem was recently encountered x86_64 systems with
> > 8k stacks running XFS with simple storage configurations.
> > 
> > Writeback from direct reclaim also adversely affects background writeback. The
> > background flusher threads should already be taking care of cleaning dirty
> > pages, and direct reclaim will kick them if they aren't already doing work. If
> > direct reclaim is also calling ->writepage, it will cause the IO patterns from
> > the background flusher threads to be upset by LRU-order writeback from
> > pageout() which can be effectively random IO. Having competing sources of IO
> > trying to clean pages on the same backing device reduces throughput by
> > increasing the amount of seeks that the backing device has to do to write back
> > the pages.
> > 
> > Hence for direct reclaim we should not allow ->writepages to be entered at all.
> > Set up the relevant scan_control structures to enforce this, and prevent
> > sc->may_writepage from being set in other places in the direct reclaim path in
> > response to other events.
> 
> Ummm..
> This patch is harder to ack. This patch's pros/cons seems
> 
> Pros:
> 	1) prevent XFS stack overflow
> 	2) improve io workload performance
> 
> Cons:
> 	3) TOTALLY kill lumpy reclaim (i.e. high order allocation)
> 
> So, If we only need to consider io workload this is no downside. but
> it can't.
> 
> I think (1) is XFS issue. XFS should care it itself.

The filesystem is irrelevant, IMO.

The traces from the reporter showed that we've got close to a 2k
stack footprint for memory allocation to direct reclaim and then we
can put the entire writeback path on top of that. This is roughly
3.5k for XFS, and then depending on the storage subsystem
configuration and transport can be another 2k of stack needed below
XFS.

IOWs, if we completely ignore the filesystem stack usage, there's
still up to 4k of stack needed in the direct reclaim path. Given
that one of the stack traces supplied show direct reclaim being
entered with over 3k of stack already used, pretty much any
filesystem is capable of blowing an 8k stack.

So, this is not an XFS issue, even though XFS is the first to
uncover it. Don't shoot the messenger....

> but (2) is really
> VM issue. Now our VM makes too agressive pageout() and decrease io 
> throughput. I've heard this issue from Chris (cc to him). I'd like to 
> fix this.

I didn't expect this to be easy. ;)

I had a good look at what the code was doing before I wrote the
patch, and IMO, there is no good reason for issuing IO from direct
reclaim.

My reasoning is as follows - consider a system with a typical
sata disk and the machine is low on memory and in direct reclaim.

direct reclaim is taking pages of the end of the LRU and writing
them one at a time from there. It is scanning thousands of pages
pages and it triggers IO on on the dirty ones it comes across.
This is done with no regard to the IO patterns it generates - it can
(and frequently does) result in completely random single page IO
patterns hitting the disk, and as a result cleaning pages happens
really, really slowly. If we are in a OOM situation, the machine
will grind to a halt as it struggles to clean maybe 1MB of RAM per
second.

On the other hand, if the IO is well formed then the disk might be
capable of 100MB/s. The background flusher threads and filesystems
try very hard to issue well formed IOs, so the difference in the
rate that memory can be cleaned may be a couple of orders of
magnitude.

(Of course, the difference will typically be somewhere in between
these two extremes, but I'm simply trying to illustrate how big
the difference in performance can be.)

IOWs, the background flusher threads are there to clean memory by
issuing IO as efficiently as possible.  Direct reclaim is very
efficient at reclaiming clean memory, but it really, really sucks at
cleaning dirty memory in a predictable and deterministic manner. It
is also much more likely to hit worst case IO patterns than the
background flusher threads.

Hence I think that direct reclaim should be deferring to the
background flusher threads for cleaning memory and not trying to be
doing it itself.

> but we never kill pageout() completely because we can't
> assume users don't run high order allocation workload.

I think that lumpy reclaim will still work just fine.

Lumpy reclaim appears to be using IO as a method of slowing
down the reclaim cycle - the congestion_wait() call will still
function as it does now if the background flusher threads are active
and causing congestion. I don't see why lumpy reclaim specifically
needs to be issuing IO to make it work - if the congestion_wait() is
not waiting long enough then wait longer - don't issue IO to extend
the wait time.

Also, there doesn't appear to be anything special about the chunks of
pages it's issuing IO on and waiting for, either. They are simply
the last N pages on the LRU that could be grabbed so they have no
guarantee of contiguity, so the IO it issues does nothing specific
to help higher order allocations to succeed.

Hence it really seems to me that the effectiveness of lumpy reclaim
is determined mostly by the effectiveness of the IO subsystem - the
faster the IO subsystem cleans pages, the less time lumpy reclaim
will block and the faster it will free pages. From this observation
and the fact that issuing IO only from the bdi flusher threads will
have the same effect (improves IO subsystem effectiveness), it seems
to me that lumpy reclaim should not be adversely affected by this
change.

Of course, the code is a maze of twisty passages, so I probably
missed something important. Hopefully someone can tell me what. ;)

FWIW, the biggest problem here is that I have absolutely no clue on
how to test what the impact on lumpy reclaim really is. Does anyone
have a relatively simple test that can be run to determine what the
impact is?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-13  9:58 ` Mel Gorman
@ 2010-04-13 11:19   ` Dave Chinner
  2010-04-13 19:34     ` Mel Gorman
  0 siblings, 1 reply; 115+ messages in thread
From: Dave Chinner @ 2010-04-13 11:19 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-kernel, linux-mm, linux-fsdevel

On Tue, Apr 13, 2010 at 10:58:15AM +0100, Mel Gorman wrote:
> On Tue, Apr 13, 2010 at 10:17:58AM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > When we enter direct reclaim we may have used an arbitrary amount of stack
> > space, and hence enterring the filesystem to do writeback can then lead to
> > stack overruns. This problem was recently encountered x86_64 systems with
> > 8k stacks running XFS with simple storage configurations.
> > 
> > Writeback from direct reclaim also adversely affects background writeback. The
> > background flusher threads should already be taking care of cleaning dirty
> > pages, and direct reclaim will kick them if they aren't already doing work. If
> > direct reclaim is also calling ->writepage, it will cause the IO patterns from
> > the background flusher threads to be upset by LRU-order writeback from
> > pageout() which can be effectively random IO. Having competing sources of IO
> > trying to clean pages on the same backing device reduces throughput by
> > increasing the amount of seeks that the backing device has to do to write back
> > the pages.
> > 
> 
> It's already known that the VM requesting specific pages be cleaned and
> reclaimed is a bad IO pattern but unfortunately it is still required by
> lumpy reclaim. This change would appear to break that although I haven't
> tested it to be 100% sure.

How do you test it? I'd really like to be able to test this myself....

> Even without high-order considerations, this patch would appear to make
> fairly large changes to how direct reclaim behaves. It would no longer
> wait on page writeback for example so direct reclaim will return sooner

AFAICT it still waits for pages under writeback in exactly the same manner
it does now. shrink_page_list() does the following completely
separately to the sc->may_writepage flag:

 666                 may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
 667                         (PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
 668
 669                 if (PageWriteback(page)) {
 670                         /*
 671                          * Synchronous reclaim is performed in two passes,
 672                          * first an asynchronous pass over the list to
 673                          * start parallel writeback, and a second synchronous
 674                          * pass to wait for the IO to complete.  Wait here
 675                          * for any page for which writeback has already
 676                          * started.
 677                          */
 678                         if (sync_writeback == PAGEOUT_IO_SYNC && may_enter_fs)
 679                                 wait_on_page_writeback(page);
 680                         else
 681                                 goto keep_locked;
 682                 }

So if the page is under writeback, PAGEOUT_IO_SYNC is set and
we can enter the fs, it will still wait for writeback to complete
just like it does now.

However, the current code only uses PAGEOUT_IO_SYNC in lumpy
reclaim, so for most typical workloads direct reclaim does not wait
on page writeback, either. Hence, this patch doesn't appear to
change the actions taken on a page under writeback in direct
reclaim....

> than it did potentially going OOM if there were a lot of dirty pages and
> it made no progress during direct reclaim.

I did a fair bit of low/small memory testing. This is a subjective
observation, but I definitely seemed to get less severe OOM
situations and better overall responisveness with this patch than
compared to when direct reclaim was doing writeback.

> > Hence for direct reclaim we should not allow ->writepages to be entered at all.
> > Set up the relevant scan_control structures to enforce this, and prevent
> > sc->may_writepage from being set in other places in the direct reclaim path in
> > response to other events.
> > 
> 
> If an FS caller cannot re-enter the FS, it should be using GFP_NOFS
> instead of GFP_KERNEL.

This problem is not a filesystem recursion problem which is, as I
understand it, what GFP_NOFS is used to prevent. It's _any_ kernel
code that uses signficant stack before trying to allocate memory
that is the problem. e.g a select() system call:

       Depth    Size   Location    (47 entries)
       -----    ----   --------
 0)     7568      16   mempool_alloc_slab+0x16/0x20
 1)     7552     144   mempool_alloc+0x65/0x140
 2)     7408      96   get_request+0x124/0x370
 3)     7312     144   get_request_wait+0x29/0x1b0
 4)     7168      96   __make_request+0x9b/0x490
 5)     7072     208   generic_make_request+0x3df/0x4d0
 6)     6864      80   submit_bio+0x7c/0x100
 7)     6784      96   _xfs_buf_ioapply+0x128/0x2c0 [xfs]
....
32)     3184      64   xfs_vm_writepage+0xab/0x160 [xfs]
33)     3120     384   shrink_page_list+0x65e/0x840
34)     2736     528   shrink_zone+0x63f/0xe10
35)     2208     112   do_try_to_free_pages+0xc2/0x3c0
36)     2096     128   try_to_free_pages+0x77/0x80
37)     1968     240   __alloc_pages_nodemask+0x3e4/0x710
38)     1728      48   alloc_pages_current+0x8c/0xe0
39)     1680      16   __get_free_pages+0xe/0x50
40)     1664      48   __pollwait+0xca/0x110
41)     1616      32   unix_poll+0x28/0xc0
42)     1584      16   sock_poll+0x1d/0x20
43)     1568     912   do_select+0x3d6/0x700
44)      656     416   core_sys_select+0x18c/0x2c0
45)      240     112   sys_select+0x4f/0x110
46)      128     128   system_call_fastpath+0x16/0x1b

There's 1.6k of stack used before memory allocation is called, 3.1k
used there before ->writepage is entered, XFS used 3.5k, and
if the mempool needed to allocate a page it would have blown the
stack. If there was any significant storage subsystem (add dm, md
and/or scsi of some kind), it would have blown the stack.

Basically, there is not enough stack space available to allow direct
reclaim to enter ->writepage _anywhere_ according to the stack usage
profiles we are seeing here....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-13 10:29   ` Dave Chinner
@ 2010-04-13 11:39     ` KOSAKI Motohiro
  2010-04-13 14:36       ` Dave Chinner
  0 siblings, 1 reply; 115+ messages in thread
From: KOSAKI Motohiro @ 2010-04-13 11:39 UTC (permalink / raw)
  To: Dave Chinner
  Cc: kosaki.motohiro, linux-kernel, linux-mm, linux-fsdevel, Chris Mason

Hi

> > Pros:
> > 	1) prevent XFS stack overflow
> > 	2) improve io workload performance
> > 
> > Cons:
> > 	3) TOTALLY kill lumpy reclaim (i.e. high order allocation)
> > 
> > So, If we only need to consider io workload this is no downside. but
> > it can't.
> > 
> > I think (1) is XFS issue. XFS should care it itself.
> 
> The filesystem is irrelevant, IMO.
> 
> The traces from the reporter showed that we've got close to a 2k
> stack footprint for memory allocation to direct reclaim and then we
> can put the entire writeback path on top of that. This is roughly
> 3.5k for XFS, and then depending on the storage subsystem
> configuration and transport can be another 2k of stack needed below
> XFS.
> 
> IOWs, if we completely ignore the filesystem stack usage, there's
> still up to 4k of stack needed in the direct reclaim path. Given
> that one of the stack traces supplied show direct reclaim being
> entered with over 3k of stack already used, pretty much any
> filesystem is capable of blowing an 8k stack.
> 
> So, this is not an XFS issue, even though XFS is the first to
> uncover it. Don't shoot the messenger....

Thanks explanation. I haven't noticed direct reclaim consume
2k stack. I'll investigate it and try diet it.
But XFS 3.5K stack consumption is too large too. please diet too.


> > but (2) is really
> > VM issue. Now our VM makes too agressive pageout() and decrease io 
> > throughput. I've heard this issue from Chris (cc to him). I'd like to 
> > fix this.
> 
> I didn't expect this to be easy. ;)
> 
> I had a good look at what the code was doing before I wrote the
> patch, and IMO, there is no good reason for issuing IO from direct
> reclaim.
> 
> My reasoning is as follows - consider a system with a typical
> sata disk and the machine is low on memory and in direct reclaim.
> 
> direct reclaim is taking pages of the end of the LRU and writing
> them one at a time from there. It is scanning thousands of pages
> pages and it triggers IO on on the dirty ones it comes across.
> This is done with no regard to the IO patterns it generates - it can
> (and frequently does) result in completely random single page IO
> patterns hitting the disk, and as a result cleaning pages happens
> really, really slowly. If we are in a OOM situation, the machine
> will grind to a halt as it struggles to clean maybe 1MB of RAM per
> second.
> 
> On the other hand, if the IO is well formed then the disk might be
> capable of 100MB/s. The background flusher threads and filesystems
> try very hard to issue well formed IOs, so the difference in the
> rate that memory can be cleaned may be a couple of orders of
> magnitude.
> 
> (Of course, the difference will typically be somewhere in between
> these two extremes, but I'm simply trying to illustrate how big
> the difference in performance can be.)
> 
> IOWs, the background flusher threads are there to clean memory by
> issuing IO as efficiently as possible.  Direct reclaim is very
> efficient at reclaiming clean memory, but it really, really sucks at
> cleaning dirty memory in a predictable and deterministic manner. It
> is also much more likely to hit worst case IO patterns than the
> background flusher threads.
> 
> Hence I think that direct reclaim should be deferring to the
> background flusher threads for cleaning memory and not trying to be
> doing it itself.

Well, you seems continue to discuss io workload. I don't disagree
such point. 

example, If only order-0 reclaim skip pageout(), we will get the above
benefit too.



> > but we never kill pageout() completely because we can't
> > assume users don't run high order allocation workload.
> 
> I think that lumpy reclaim will still work just fine.
> 
> Lumpy reclaim appears to be using IO as a method of slowing
> down the reclaim cycle - the congestion_wait() call will still
> function as it does now if the background flusher threads are active
> and causing congestion. I don't see why lumpy reclaim specifically
> needs to be issuing IO to make it work - if the congestion_wait() is
> not waiting long enough then wait longer - don't issue IO to extend
> the wait time.

lumpy reclaim is for allocation high order page. then, it not only
reclaim LRU head page, but also its PFN neighborhood. PFN neighborhood
is often newly page and still dirty. then we enfoce pageout cleaning
and discard it.

When high order allocation occur, we don't only need free enough amount
memory, but also need free enough contenious memory block.

If we need to consider _only_ io throughput, waiting flusher thread
might faster perhaps, but actually we also need to consider reclaim
latency. I'm worry about such point too.



> Also, there doesn't appear to be anything special about the chunks of
> pages it's issuing IO on and waiting for, either. They are simply
> the last N pages on the LRU that could be grabbed so they have no
> guarantee of contiguity, so the IO it issues does nothing specific
> to help higher order allocations to succeed.

It does. lumpy reclaim doesn't grab last N pages. instead grab contenious
memory chunk. please see isolate_lru_pages(). 

> 
> Hence it really seems to me that the effectiveness of lumpy reclaim
> is determined mostly by the effectiveness of the IO subsystem - the
> faster the IO subsystem cleans pages, the less time lumpy reclaim
> will block and the faster it will free pages. From this observation
> and the fact that issuing IO only from the bdi flusher threads will
> have the same effect (improves IO subsystem effectiveness), it seems
> to me that lumpy reclaim should not be adversely affected by this
> change.
> 
> Of course, the code is a maze of twisty passages, so I probably
> missed something important. Hopefully someone can tell me what. ;)
> 
> FWIW, the biggest problem here is that I have absolutely no clue on
> how to test what the impact on lumpy reclaim really is. Does anyone
> have a relatively simple test that can be run to determine what the
> impact is?

So, can you please run two workloads concurrently?
 - Normal IO workload (fio, iozone, etc..)
 - echo $NUM > /proc/sys/vm/nr_hugepages

Most typical high order allocation is occur by blutal wireless LAN driver.
(or some cheap LAN card)
But sadly, If the test depend on specific hardware, our discussion might
make mess maze easily. then, I hope to use hugepage feature instead.


Thanks.




^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-13 11:39     ` KOSAKI Motohiro
@ 2010-04-13 14:36       ` Dave Chinner
  2010-04-14  3:12         ` Dave Chinner
  2010-04-14  6:52         ` KOSAKI Motohiro
  0 siblings, 2 replies; 115+ messages in thread
From: Dave Chinner @ 2010-04-13 14:36 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: linux-kernel, linux-mm, linux-fsdevel, Chris Mason

On Tue, Apr 13, 2010 at 08:39:29PM +0900, KOSAKI Motohiro wrote:
> Hi
> 
> > > Pros:
> > > 	1) prevent XFS stack overflow
> > > 	2) improve io workload performance
> > > 
> > > Cons:
> > > 	3) TOTALLY kill lumpy reclaim (i.e. high order allocation)
> > > 
> > > So, If we only need to consider io workload this is no downside. but
> > > it can't.
> > > 
> > > I think (1) is XFS issue. XFS should care it itself.
> > 
> > The filesystem is irrelevant, IMO.
> > 
> > The traces from the reporter showed that we've got close to a 2k
> > stack footprint for memory allocation to direct reclaim and then we
> > can put the entire writeback path on top of that. This is roughly
> > 3.5k for XFS, and then depending on the storage subsystem
> > configuration and transport can be another 2k of stack needed below
> > XFS.
> > 
> > IOWs, if we completely ignore the filesystem stack usage, there's
> > still up to 4k of stack needed in the direct reclaim path. Given
> > that one of the stack traces supplied show direct reclaim being
> > entered with over 3k of stack already used, pretty much any
> > filesystem is capable of blowing an 8k stack.
> > 
> > So, this is not an XFS issue, even though XFS is the first to
> > uncover it. Don't shoot the messenger....
> 
> Thanks explanation. I haven't noticed direct reclaim consume
> 2k stack. I'll investigate it and try diet it.
> But XFS 3.5K stack consumption is too large too. please diet too.

It hasn't grown in the last 2 years after the last major diet where
all the fat was trimmed from it in the last round of the i386 4k
stack vs XFS saga. it seems that everything else around XFS has
grown in that time, and now we are blowing stacks again....

> > Hence I think that direct reclaim should be deferring to the
> > background flusher threads for cleaning memory and not trying to be
> > doing it itself.
> 
> Well, you seems continue to discuss io workload. I don't disagree
> such point. 
> 
> example, If only order-0 reclaim skip pageout(), we will get the above
> benefit too.

But it won't prevent start blowups...

> > > but we never kill pageout() completely because we can't
> > > assume users don't run high order allocation workload.
> > 
> > I think that lumpy reclaim will still work just fine.
> > 
> > Lumpy reclaim appears to be using IO as a method of slowing
> > down the reclaim cycle - the congestion_wait() call will still
> > function as it does now if the background flusher threads are active
> > and causing congestion. I don't see why lumpy reclaim specifically
> > needs to be issuing IO to make it work - if the congestion_wait() is
> > not waiting long enough then wait longer - don't issue IO to extend
> > the wait time.
> 
> lumpy reclaim is for allocation high order page. then, it not only
> reclaim LRU head page, but also its PFN neighborhood. PFN neighborhood
> is often newly page and still dirty. then we enfoce pageout cleaning
> and discard it.

Ok, I see that now - I missed the second call to __isolate_lru_pages()
in isolate_lru_pages().

> When high order allocation occur, we don't only need free enough amount
> memory, but also need free enough contenious memory block.

Agreed, that was why I was kind of surprised not to find it was
doing that. But, as you have pointed out, that was my mistake.

> If we need to consider _only_ io throughput, waiting flusher thread
> might faster perhaps, but actually we also need to consider reclaim
> latency. I'm worry about such point too.

True, but without know how to test and measure such things I can't
really comment...

> > Of course, the code is a maze of twisty passages, so I probably
> > missed something important. Hopefully someone can tell me what. ;)
> > 
> > FWIW, the biggest problem here is that I have absolutely no clue on
> > how to test what the impact on lumpy reclaim really is. Does anyone
> > have a relatively simple test that can be run to determine what the
> > impact is?
> 
> So, can you please run two workloads concurrently?
>  - Normal IO workload (fio, iozone, etc..)
>  - echo $NUM > /proc/sys/vm/nr_hugepages

What do I measure/observe/record that is meaningful?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-13 11:19   ` Dave Chinner
@ 2010-04-13 19:34     ` Mel Gorman
  2010-04-13 20:20       ` Chris Mason
  0 siblings, 1 reply; 115+ messages in thread
From: Mel Gorman @ 2010-04-13 19:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-mm, linux-fsdevel

On Tue, Apr 13, 2010 at 09:19:02PM +1000, Dave Chinner wrote:
> On Tue, Apr 13, 2010 at 10:58:15AM +0100, Mel Gorman wrote:
> > On Tue, Apr 13, 2010 at 10:17:58AM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > When we enter direct reclaim we may have used an arbitrary amount of stack
> > > space, and hence enterring the filesystem to do writeback can then lead to
> > > stack overruns. This problem was recently encountered x86_64 systems with
> > > 8k stacks running XFS with simple storage configurations.
> > > 
> > > Writeback from direct reclaim also adversely affects background writeback. The
> > > background flusher threads should already be taking care of cleaning dirty
> > > pages, and direct reclaim will kick them if they aren't already doing work. If
> > > direct reclaim is also calling ->writepage, it will cause the IO patterns from
> > > the background flusher threads to be upset by LRU-order writeback from
> > > pageout() which can be effectively random IO. Having competing sources of IO
> > > trying to clean pages on the same backing device reduces throughput by
> > > increasing the amount of seeks that the backing device has to do to write back
> > > the pages.
> > > 
> > 
> > It's already known that the VM requesting specific pages be cleaned and
> > reclaimed is a bad IO pattern but unfortunately it is still required by
> > lumpy reclaim. This change would appear to break that although I haven't
> > tested it to be 100% sure.
> 
> How do you test it? I'd really like to be able to test this myself....
> 

Depends. For raw effectiveness, I run a series of performance-related
benchmarks with a final test that

o Starts a number of parallel compiles that in combination are 1.25 times
  of physical memory in total size
o Sleep three minutes
o Start allocating huge pages recording the latency required for each one
o Record overall success rate and graph latency over time

Lumpy reclaim both increases the success rate and reduces the latency.

> > Even without high-order considerations, this patch would appear to make
> > fairly large changes to how direct reclaim behaves. It would no longer
> > wait on page writeback for example so direct reclaim will return sooner
> 
> AFAICT it still waits for pages under writeback in exactly the same manner
> it does now. shrink_page_list() does the following completely
> separately to the sc->may_writepage flag:
> 
>  666                 may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
>  667                         (PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
>  668
>  669                 if (PageWriteback(page)) {
>  670                         /*
>  671                          * Synchronous reclaim is performed in two passes,
>  672                          * first an asynchronous pass over the list to
>  673                          * start parallel writeback, and a second synchronous
>  674                          * pass to wait for the IO to complete.  Wait here
>  675                          * for any page for which writeback has already
>  676                          * started.
>  677                          */
>  678                         if (sync_writeback == PAGEOUT_IO_SYNC && may_enter_fs)
>  679                                 wait_on_page_writeback(page);
>  680                         else
>  681                                 goto keep_locked;
>  682                 }
> 

Right, so it'll still wait on writeback but won't kick it off. That
would still be a fairly significant change in behaviour though. Think of
synchronous lumpy reclaim for example where it queues up a contiguous
batch of patches and then waits on them to writeback..

> So if the page is under writeback, PAGEOUT_IO_SYNC is set and
> we can enter the fs, it will still wait for writeback to complete
> just like it does now.
> 

But it would be no longer queueing them for writeback so it'd be
depending heavily on kswapd or a background cleaning daemon to clean
them.

> However, the current code only uses PAGEOUT_IO_SYNC in lumpy
> reclaim, so for most typical workloads direct reclaim does not wait
> on page writeback, either.

No, but it does queue them back on the LRU where they might be clean the
next time they are found on the list. How significant a problem this is
I couldn't tell you but it could show a corner case where a large number
of direct reclaimers are encountering dirty pages frequenctly and
recycling them around the LRU list instead of cleaning them.

> Hence, this patch doesn't appear to
> change the actions taken on a page under writeback in direct
> reclaim....
> 

It does, but indirectly. The impact is very direct for lumpy reclaim
obviously. For other direct reclaim, pages that were at the end of the
LRU list are no longer getting cleaned before doing another lap through
the LRU list.

The consequences of the latter are harder to predict.

> > than it did potentially going OOM if there were a lot of dirty pages and
> > it made no progress during direct reclaim.
> 
> I did a fair bit of low/small memory testing. This is a subjective
> observation, but I definitely seemed to get less severe OOM
> situations and better overall responisveness with this patch than
> compared to when direct reclaim was doing writeback.
> 

And it is possible that it is best overall of only kswapd and the
background cleaner are queueing pages for IO. All I can say for sure is
that this does appear to hurt lumpy reclaim and does affect normal
direct reclaim where I have no predictions.

> > > Hence for direct reclaim we should not allow ->writepages to be entered at all.
> > > Set up the relevant scan_control structures to enforce this, and prevent
> > > sc->may_writepage from being set in other places in the direct reclaim path in
> > > response to other events.
> > > 
> > 
> > If an FS caller cannot re-enter the FS, it should be using GFP_NOFS
> > instead of GFP_KERNEL.
> 
> This problem is not a filesystem recursion problem which is, as I
> understand it, what GFP_NOFS is used to prevent. It's _any_ kernel
> code that uses signficant stack before trying to allocate memory
> that is the problem. e.g a select() system call:
> 
>        Depth    Size   Location    (47 entries)
>        -----    ----   --------
>  0)     7568      16   mempool_alloc_slab+0x16/0x20
>  1)     7552     144   mempool_alloc+0x65/0x140
>  2)     7408      96   get_request+0x124/0x370
>  3)     7312     144   get_request_wait+0x29/0x1b0
>  4)     7168      96   __make_request+0x9b/0x490
>  5)     7072     208   generic_make_request+0x3df/0x4d0
>  6)     6864      80   submit_bio+0x7c/0x100
>  7)     6784      96   _xfs_buf_ioapply+0x128/0x2c0 [xfs]
> ....
> 32)     3184      64   xfs_vm_writepage+0xab/0x160 [xfs]
> 33)     3120     384   shrink_page_list+0x65e/0x840
> 34)     2736     528   shrink_zone+0x63f/0xe10
> 35)     2208     112   do_try_to_free_pages+0xc2/0x3c0
> 36)     2096     128   try_to_free_pages+0x77/0x80
> 37)     1968     240   __alloc_pages_nodemask+0x3e4/0x710
> 38)     1728      48   alloc_pages_current+0x8c/0xe0
> 39)     1680      16   __get_free_pages+0xe/0x50
> 40)     1664      48   __pollwait+0xca/0x110
> 41)     1616      32   unix_poll+0x28/0xc0
> 42)     1584      16   sock_poll+0x1d/0x20
> 43)     1568     912   do_select+0x3d6/0x700
> 44)      656     416   core_sys_select+0x18c/0x2c0
> 45)      240     112   sys_select+0x4f/0x110
> 46)      128     128   system_call_fastpath+0x16/0x1b
> 
> There's 1.6k of stack used before memory allocation is called, 3.1k
> used there before ->writepage is entered, XFS used 3.5k, and
> if the mempool needed to allocate a page it would have blown the
> stack. If there was any significant storage subsystem (add dm, md
> and/or scsi of some kind), it would have blown the stack.
> 
> Basically, there is not enough stack space available to allow direct
> reclaim to enter ->writepage _anywhere_ according to the stack usage
> profiles we are seeing here....
> 

I'm not denying the evidence but how has it been gotten away with for years
then? Prevention of writeback isn't the answer without figuring out how
direct reclaimers can queue pages for IO and in the case of lumpy reclaim
doing sync IO, then waiting on those pages.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-13 19:34     ` Mel Gorman
@ 2010-04-13 20:20       ` Chris Mason
  2010-04-14  1:40         ` Dave Chinner
                           ` (2 more replies)
  0 siblings, 3 replies; 115+ messages in thread
From: Chris Mason @ 2010-04-13 20:20 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Dave Chinner, linux-kernel, linux-mm, linux-fsdevel

On Tue, Apr 13, 2010 at 08:34:29PM +0100, Mel Gorman wrote:
> > This problem is not a filesystem recursion problem which is, as I
> > understand it, what GFP_NOFS is used to prevent. It's _any_ kernel
> > code that uses signficant stack before trying to allocate memory
> > that is the problem. e.g a select() system call:
> > 
> >        Depth    Size   Location    (47 entries)
> >        -----    ----   --------
> >  0)     7568      16   mempool_alloc_slab+0x16/0x20
> >  1)     7552     144   mempool_alloc+0x65/0x140
> >  2)     7408      96   get_request+0x124/0x370
> >  3)     7312     144   get_request_wait+0x29/0x1b0
> >  4)     7168      96   __make_request+0x9b/0x490
> >  5)     7072     208   generic_make_request+0x3df/0x4d0
> >  6)     6864      80   submit_bio+0x7c/0x100
> >  7)     6784      96   _xfs_buf_ioapply+0x128/0x2c0 [xfs]
> > ....
> > 32)     3184      64   xfs_vm_writepage+0xab/0x160 [xfs]
> > 33)     3120     384   shrink_page_list+0x65e/0x840
> > 34)     2736     528   shrink_zone+0x63f/0xe10
> > 35)     2208     112   do_try_to_free_pages+0xc2/0x3c0
> > 36)     2096     128   try_to_free_pages+0x77/0x80
> > 37)     1968     240   __alloc_pages_nodemask+0x3e4/0x710
> > 38)     1728      48   alloc_pages_current+0x8c/0xe0
> > 39)     1680      16   __get_free_pages+0xe/0x50
> > 40)     1664      48   __pollwait+0xca/0x110
> > 41)     1616      32   unix_poll+0x28/0xc0
> > 42)     1584      16   sock_poll+0x1d/0x20
> > 43)     1568     912   do_select+0x3d6/0x700
> > 44)      656     416   core_sys_select+0x18c/0x2c0
> > 45)      240     112   sys_select+0x4f/0x110
> > 46)      128     128   system_call_fastpath+0x16/0x1b
> > 
> > There's 1.6k of stack used before memory allocation is called, 3.1k
> > used there before ->writepage is entered, XFS used 3.5k, and
> > if the mempool needed to allocate a page it would have blown the
> > stack. If there was any significant storage subsystem (add dm, md
> > and/or scsi of some kind), it would have blown the stack.
> > 
> > Basically, there is not enough stack space available to allow direct
> > reclaim to enter ->writepage _anywhere_ according to the stack usage
> > profiles we are seeing here....
> > 
> 
> I'm not denying the evidence but how has it been gotten away with for years
> then? Prevention of writeback isn't the answer without figuring out how
> direct reclaimers can queue pages for IO and in the case of lumpy reclaim
> doing sync IO, then waiting on those pages.

So, I've been reading along, nodding my head to Dave's side of things
because seeks are evil and direct reclaim makes seeks.  I'd really loev
for direct reclaim to somehow trigger writepages on large chunks instead
of doing page by page spatters of IO to the drive.

But, somewhere along the line I overlooked the part of Dave's stack trace
that said:

43)     1568     912   do_select+0x3d6/0x700

Huh, 912 bytes...for select, really?  From poll.h:

/* ~832 bytes of stack space used max in sys_select/sys_poll before allocating
   additional memory. */
#define MAX_STACK_ALLOC 832
#define FRONTEND_STACK_ALLOC    256
#define SELECT_STACK_ALLOC      FRONTEND_STACK_ALLOC
#define POLL_STACK_ALLOC        FRONTEND_STACK_ALLOC
#define WQUEUES_STACK_ALLOC     (MAX_STACK_ALLOC - FRONTEND_STACK_ALLOC)
#define N_INLINE_POLL_ENTRIES   (WQUEUES_STACK_ALLOC / sizeof(struct poll_table_entry))

So, select is intentionally trying to use that much stack.  It should be using
GFP_NOFS if it really wants to suck down that much stack...if only the
kernel had some sort of way to dynamically allocate ram, it could try
that too.

-chris

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-13  0:17 [PATCH] mm: disallow direct reclaim page writeback Dave Chinner
  2010-04-13  8:31 ` KOSAKI Motohiro
  2010-04-13  9:58 ` Mel Gorman
@ 2010-04-14  0:24 ` Minchan Kim
  2010-04-14  4:44   ` Dave Chinner
  2010-04-16  1:13 ` KAMEZAWA Hiroyuki
  3 siblings, 1 reply; 115+ messages in thread
From: Minchan Kim @ 2010-04-14  0:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-mm, linux-fsdevel

Hi, Dave.

On Tue, Apr 13, 2010 at 9:17 AM, Dave Chinner <david@fromorbit.com> wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> When we enter direct reclaim we may have used an arbitrary amount of stack
> space, and hence enterring the filesystem to do writeback can then lead to
> stack overruns. This problem was recently encountered x86_64 systems with
> 8k stacks running XFS with simple storage configurations.
>
> Writeback from direct reclaim also adversely affects background writeback. The
> background flusher threads should already be taking care of cleaning dirty
> pages, and direct reclaim will kick them if they aren't already doing work. If
> direct reclaim is also calling ->writepage, it will cause the IO patterns from
> the background flusher threads to be upset by LRU-order writeback from
> pageout() which can be effectively random IO. Having competing sources of IO
> trying to clean pages on the same backing device reduces throughput by
> increasing the amount of seeks that the backing device has to do to write back
> the pages.
>
> Hence for direct reclaim we should not allow ->writepages to be entered at all.
> Set up the relevant scan_control structures to enforce this, and prevent
> sc->may_writepage from being set in other places in the direct reclaim path in
> response to other events.

I think your solution is rather aggressive change as Mel and Kosaki
already pointed out.
Do flush thread aware LRU of dirty pages in system level recency not
dirty pages recency?
Of course flush thread can clean dirty pages faster than direct reclaimer.
But if it don't aware LRUness, hot page thrashing can be happened by
corner case.
It could lost write merge.

And non-rotation storage might be not big of seek cost.
I think we have to consider that case if we decide to change direct reclaim I/O.

How do we separate the problem?

1. stack hogging problem.
2. direct reclaim random write.

And try to solve one by one instead of all at once.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-13 20:20       ` Chris Mason
@ 2010-04-14  1:40         ` Dave Chinner
  2010-04-14  4:59           ` KAMEZAWA Hiroyuki
  2010-04-14  6:52           ` KOSAKI Motohiro
  2010-04-14  6:52         ` KOSAKI Motohiro
  2010-04-14 10:06         ` Andi Kleen
  2 siblings, 2 replies; 115+ messages in thread
From: Dave Chinner @ 2010-04-14  1:40 UTC (permalink / raw)
  To: Chris Mason, Mel Gorman, linux-kernel, linux-mm, linux-fsdevel

On Tue, Apr 13, 2010 at 04:20:21PM -0400, Chris Mason wrote:
> On Tue, Apr 13, 2010 at 08:34:29PM +0100, Mel Gorman wrote:
> > > This problem is not a filesystem recursion problem which is, as I
> > > understand it, what GFP_NOFS is used to prevent. It's _any_ kernel
> > > code that uses signficant stack before trying to allocate memory
> > > that is the problem. e.g a select() system call:
> > > 
> > >        Depth    Size   Location    (47 entries)
> > >        -----    ----   --------
> > >  0)     7568      16   mempool_alloc_slab+0x16/0x20
> > >  1)     7552     144   mempool_alloc+0x65/0x140
> > >  2)     7408      96   get_request+0x124/0x370
> > >  3)     7312     144   get_request_wait+0x29/0x1b0
> > >  4)     7168      96   __make_request+0x9b/0x490
> > >  5)     7072     208   generic_make_request+0x3df/0x4d0
> > >  6)     6864      80   submit_bio+0x7c/0x100
> > >  7)     6784      96   _xfs_buf_ioapply+0x128/0x2c0 [xfs]
> > > ....
> > > 32)     3184      64   xfs_vm_writepage+0xab/0x160 [xfs]
> > > 33)     3120     384   shrink_page_list+0x65e/0x840
> > > 34)     2736     528   shrink_zone+0x63f/0xe10
> > > 35)     2208     112   do_try_to_free_pages+0xc2/0x3c0
> > > 36)     2096     128   try_to_free_pages+0x77/0x80
> > > 37)     1968     240   __alloc_pages_nodemask+0x3e4/0x710
> > > 38)     1728      48   alloc_pages_current+0x8c/0xe0
> > > 39)     1680      16   __get_free_pages+0xe/0x50
> > > 40)     1664      48   __pollwait+0xca/0x110
> > > 41)     1616      32   unix_poll+0x28/0xc0
> > > 42)     1584      16   sock_poll+0x1d/0x20
> > > 43)     1568     912   do_select+0x3d6/0x700
> > > 44)      656     416   core_sys_select+0x18c/0x2c0
> > > 45)      240     112   sys_select+0x4f/0x110
> > > 46)      128     128   system_call_fastpath+0x16/0x1b
> > > 
> > > There's 1.6k of stack used before memory allocation is called, 3.1k
> > > used there before ->writepage is entered, XFS used 3.5k, and
> > > if the mempool needed to allocate a page it would have blown the
> > > stack. If there was any significant storage subsystem (add dm, md
> > > and/or scsi of some kind), it would have blown the stack.
> > > 
> > > Basically, there is not enough stack space available to allow direct
> > > reclaim to enter ->writepage _anywhere_ according to the stack usage
> > > profiles we are seeing here....
> > > 
> > 
> > I'm not denying the evidence but how has it been gotten away with for years
> > then? Prevention of writeback isn't the answer without figuring out how
> > direct reclaimers can queue pages for IO and in the case of lumpy reclaim
> > doing sync IO, then waiting on those pages.
> 
> So, I've been reading along, nodding my head to Dave's side of things
> because seeks are evil and direct reclaim makes seeks.  I'd really loev
> for direct reclaim to somehow trigger writepages on large chunks instead
> of doing page by page spatters of IO to the drive.

Perhaps drop the lock on the page if it is held and call one of the
helpers that filesystems use to do this, like:

	filemap_write_and_wait(page->mapping);

> But, somewhere along the line I overlooked the part of Dave's stack trace
> that said:
> 
> 43)     1568     912   do_select+0x3d6/0x700
> 
> Huh, 912 bytes...for select, really?  From poll.h:

Sure, it's bad, but we focussing on the specific case misses the
point that even code that is using minimal stack can enter direct
reclaim after consuming 1.5k of stack. e.g.:

 50)     3168      64   xfs_vm_writepage+0xab/0x160 [xfs]
 51)     3104     384   shrink_page_list+0x65e/0x840
 52)     2720     528   shrink_zone+0x63f/0xe10
 53)     2192     112   do_try_to_free_pages+0xc2/0x3c0
 54)     2080     128   try_to_free_pages+0x77/0x80
 55)     1952     240   __alloc_pages_nodemask+0x3e4/0x710
 56)     1712      48   alloc_pages_current+0x8c/0xe0
 57)     1664      32   __page_cache_alloc+0x67/0x70
 58)     1632     144   __do_page_cache_readahead+0xd3/0x220
 59)     1488      16   ra_submit+0x21/0x30
 60)     1472      80   ondemand_readahead+0x11d/0x250
 61)     1392      64   page_cache_async_readahead+0xa9/0xe0
 62)     1328     592   __generic_file_splice_read+0x48a/0x530
 63)      736      48   generic_file_splice_read+0x4f/0x90
 64)      688      96   xfs_splice_read+0xf2/0x130 [xfs]
 65)      592      32   xfs_file_splice_read+0x4b/0x50 [xfs]
 66)      560      64   do_splice_to+0x77/0xb0
 67)      496     112   splice_direct_to_actor+0xcc/0x1c0
 68)      384      80   do_splice_direct+0x57/0x80
 69)      304      96   do_sendfile+0x16c/0x1e0
 70)      208      80   sys_sendfile64+0x8d/0xb0
 71)      128     128   system_call_fastpath+0x16/0x1b

Yes, __generic_file_splice_read() is a hog, but they seem to be
_everywhere_ today...

> So, select is intentionally trying to use that much stack.  It should be using
> GFP_NOFS if it really wants to suck down that much stack...

The code that did the allocation is called from multiple different
contexts - how is it supposed to know that in some of those contexts
it is supposed to treat memory allocation differently?

This is my point - if you introduce a new semantic to memory allocation
that is "use GFP_NOFS when you are using too much stack" and too much
stack is more than 15% of the stack, then pretty much every code path
will need to set that flag...

> if only the
> kernel had some sort of way to dynamically allocate ram, it could try
> that too.

Sure, but to play the devil's advocate: if memory allocation blows
the stack, then surely avoiding allocation by using stack variables
is safer? ;)

FWIW, even if we use GFP_NOFS, allocation+reclaim can still use 2k
of stack; stuff like the radix tree code appears to be a significant
user of stack now:

        Depth    Size   Location    (56 entries)
        -----    ----   --------
  0)     7904      48   __call_rcu+0x67/0x190
  1)     7856      16   call_rcu_sched+0x15/0x20
  2)     7840      16   call_rcu+0xe/0x10
  3)     7824     272   radix_tree_delete+0x159/0x2e0
  4)     7552      32   __remove_from_page_cache+0x21/0x110
  5)     7520      64   __remove_mapping+0xe8/0x130
  6)     7456     384   shrink_page_list+0x400/0x860
  7)     7072     528   shrink_zone+0x636/0xdc0
  8)     6544     112   do_try_to_free_pages+0xc2/0x3c0
  9)     6432     112   try_to_free_pages+0x64/0x70
 10)     6320     256   __alloc_pages_nodemask+0x3d2/0x710
 11)     6064      48   alloc_pages_current+0x8c/0xe0
 12)     6016      32   __page_cache_alloc+0x67/0x70
 13)     5984      80   find_or_create_page+0x50/0xb0
 14)     5904     160   _xfs_buf_lookup_pages+0x145/0x350 [xfs]

or even just calling ->releasepage and freeing bufferheads:

       Depth    Size   Location    (55 entries)
       -----    ----   --------
 0)     7440      48   add_partial+0x26/0x90
 1)     7392      64   __slab_free+0x1a9/0x380
 2)     7328      64   kmem_cache_free+0xb9/0x160
 3)     7264      16   free_buffer_head+0x25/0x50
 4)     7248      64   try_to_free_buffers+0x79/0xc0
 5)     7184     160   xfs_vm_releasepage+0xda/0x130 [xfs]
 6)     7024      16   try_to_release_page+0x33/0x60
 7)     7008     384   shrink_page_list+0x585/0x860
 8)     6624     528   shrink_zone+0x636/0xdc0
 9)     6096     112   do_try_to_free_pages+0xc2/0x3c0
10)     5984     112   try_to_free_pages+0x64/0x70
11)     5872     256   __alloc_pages_nodemask+0x3d2/0x710
12)     5616      48   alloc_pages_current+0x8c/0xe0
13)     5568      32   __page_cache_alloc+0x67/0x70
14)     5536      80   find_or_create_page+0x50/0xb0
15)     5456     160   _xfs_buf_lookup_pages+0x145/0x350 [xfs]

And another eye-opening example, this time deep in the sata driver
layer:

        Depth    Size   Location    (72 entries)
        -----    ----   --------
  0)     8336     304   select_task_rq_fair+0x235/0xad0
  1)     8032      96   try_to_wake_up+0x189/0x3f0
  2)     7936      16   default_wake_function+0x12/0x20
  3)     7920      32   autoremove_wake_function+0x16/0x40
  4)     7888      64   __wake_up_common+0x5a/0x90
  5)     7824      64   __wake_up+0x48/0x70
  6)     7760      64   insert_work+0x9f/0xb0
  7)     7696      48   __queue_work+0x36/0x50
  8)     7648      16   queue_work_on+0x4d/0x60
  9)     7632      16   queue_work+0x1f/0x30
 10)     7616      16   queue_delayed_work+0x2d/0x40
 11)     7600      32   ata_pio_queue_task+0x35/0x40
 12)     7568      48   ata_sff_qc_issue+0x146/0x2f0
 13)     7520      96   mv_qc_issue+0x12d/0x540 [sata_mv]
 14)     7424      96   ata_qc_issue+0x1fe/0x320
 15)     7328      64   ata_scsi_translate+0xae/0x1a0
 16)     7264      64   ata_scsi_queuecmd+0xbf/0x2f0
 17)     7200      48   scsi_dispatch_cmd+0x114/0x2b0
 18)     7152      96   scsi_request_fn+0x419/0x590
 19)     7056      32   __blk_run_queue+0x82/0x150
 20)     7024      48   elv_insert+0x1aa/0x2d0
 21)     6976      48   __elv_add_request+0x83/0xd0
 22)     6928      96   __make_request+0x139/0x490
 23)     6832     208   generic_make_request+0x3df/0x4d0
 24)     6624      80   submit_bio+0x7c/0x100
 25)     6544      96   _xfs_buf_ioapply+0x128/0x2c0 [xfs]

We need at least _700_ bytes of stack free just to call queue_work(),
and that now happens deep in the guts of the driver subsystem below XFS.
This trace shows 1.8k of stack usage on a simple, single sata disk
storage subsystem, so my estimate of 2k of stack for the storage system
below XFS is too small - a worst case of 2.5-3k of stack space is probably
closer to the mark.

This is the sort of thing I'm pointing at when I say that stack
usage outside XFS has grown significantly significantly over the
past couple of years. Given XFS has remained pretty much the same or
even reduced slightly over the same time period, blaming XFS or
saying "callers should use GFP_NOFS" seems like a cop-out to me.
Regardless of the IO pattern performance issues, writeback via
direct reclaim just uses too much stack to be safe these days...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-13 14:36       ` Dave Chinner
@ 2010-04-14  3:12         ` Dave Chinner
  2010-04-14  6:52           ` KOSAKI Motohiro
  2010-04-14  6:52         ` KOSAKI Motohiro
  1 sibling, 1 reply; 115+ messages in thread
From: Dave Chinner @ 2010-04-14  3:12 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: linux-kernel, linux-mm, linux-fsdevel, Chris Mason

On Wed, Apr 14, 2010 at 12:36:59AM +1000, Dave Chinner wrote:
> On Tue, Apr 13, 2010 at 08:39:29PM +0900, KOSAKI Motohiro wrote:
> > > FWIW, the biggest problem here is that I have absolutely no clue on
> > > how to test what the impact on lumpy reclaim really is. Does anyone
> > > have a relatively simple test that can be run to determine what the
> > > impact is?
> > 
> > So, can you please run two workloads concurrently?
> >  - Normal IO workload (fio, iozone, etc..)
> >  - echo $NUM > /proc/sys/vm/nr_hugepages
> 
> What do I measure/observe/record that is meaningful?

So, a rough as guts first pass - just run a large dd (8 times the
size of memory - 8GB file vs 1GB RAM) and repeated try to allocate
the entire of memory in huge pages (500) every 5 seconds. The IO
rate is roughly 100MB/s, so it takes 75-85s to complete the dd.

The script:

$ cat t.sh
#!/bin/bash

echo 0 > /proc/sys/vm/nr_hugepages
echo 3 > /proc/sys/vm/drop_caches

dd if=/dev/zero of=/mnt/scratch/test bs=1024k count=8000 > /dev/null 2>&1 &

(
for i in `seq 1 1 20`; do
        sleep 5
        /usr/bin/time --format="wall %e" sh -c "echo 500 > /proc/sys/vm/nr_hugepages" 2>&1
        grep HugePages_Total /proc/meminfo
done
) | awk '
        /wall/ { wall += $2; cnt += 1 }
        /Pages/ { pages[cnt] = $2 }
        END { printf "average wall time %f\nPages step: ", wall / cnt ;
                for (i = 1; i <= cnt; i++) {
                        printf "%d ", pages[i];
                }
        }'
----

And the output looks like:

$ sudo ./t.sh
average wall time 0.954500
Pages step: 97 101 101 121 173 173 173 173 173 173 175 194 195 195 202 220 226 419 423 426
$

Run 50 times in a loop, and the outputs averaged, the existing lumpy
reclaim resulted in:

dave@test-1:~$ cat current.txt | awk -f av.awk
av. wall = 0.519385 secs
av Pages step: 192 228 242 255 265 272 279 284 289 294 298 303 307 322 342 366 383 401 412 420

And with my patch that disables ->writepage:

dave@test-1:~$ cat no-direct.txt | awk -f av.awk
av. wall = 0.554163 secs
av Pages step: 231 283 310 316 323 328 336 340 345 351 356 359 364 377 388 397 413 423 432 439

Basically, with my patch lumpy reclaim was *substantially* more
effective with only a slight increase in average allocation latency
with this test case.

I need to add a marker to the output that records when the dd
completes, but from monitoring the writeback rates via PCP, they
were in the balllpark of 85-100MB/s for the existing code, and
95-110MB/s with my patch.  Hence it improved both IO throughput and
the effectiveness of lumpy reclaim.

On the down side, I did have an OOM killer invocation with my patch
after about 150 iterations - dd failed an order zero allocation
because there were 455 huge pages allocated and there were only
_320_ available pages for IO, all of which were under IO. i.e. lumpy
reclaim worked so well that the machine got into order-0 page
starvation.

I know this is a simple test case, but it shows much better results
than I think anyone (even me) is expecting...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-14  0:24 ` Minchan Kim
@ 2010-04-14  4:44   ` Dave Chinner
  2010-04-14  7:54     ` Minchan Kim
  0 siblings, 1 reply; 115+ messages in thread
From: Dave Chinner @ 2010-04-14  4:44 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-kernel, linux-mm, linux-fsdevel

On Wed, Apr 14, 2010 at 09:24:33AM +0900, Minchan Kim wrote:
> Hi, Dave.
> 
> On Tue, Apr 13, 2010 at 9:17 AM, Dave Chinner <david@fromorbit.com> wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> >
> > When we enter direct reclaim we may have used an arbitrary amount of stack
> > space, and hence enterring the filesystem to do writeback can then lead to
> > stack overruns. This problem was recently encountered x86_64 systems with
> > 8k stacks running XFS with simple storage configurations.
> >
> > Writeback from direct reclaim also adversely affects background writeback. The
> > background flusher threads should already be taking care of cleaning dirty
> > pages, and direct reclaim will kick them if they aren't already doing work. If
> > direct reclaim is also calling ->writepage, it will cause the IO patterns from
> > the background flusher threads to be upset by LRU-order writeback from
> > pageout() which can be effectively random IO. Having competing sources of IO
> > trying to clean pages on the same backing device reduces throughput by
> > increasing the amount of seeks that the backing device has to do to write back
> > the pages.
> >
> > Hence for direct reclaim we should not allow ->writepages to be entered at all.
> > Set up the relevant scan_control structures to enforce this, and prevent
> > sc->may_writepage from being set in other places in the direct reclaim path in
> > response to other events.
> 
> I think your solution is rather aggressive change as Mel and Kosaki
> already pointed out.

It may be agressive, but writeback from direct reclaim is, IMO, one
of the worst aspects of the current VM design because of it's
adverse effect on the IO subsystem.

I'd prefer to remove it completely that continue to try and patch
around it, especially given that everyone seems to agree that it
does have an adverse affect on IO...

> Do flush thread aware LRU of dirty pages in system level recency not
> dirty pages recency?

It writes back in the order inodes were dirtied. i.e. the LRU is a
coarser measure, but it it still definitely there. It also takes
into account fairness of IO between dirty inodes, so no one dirty
inode prevents IO beining issued on a other dirty inodes on the
LRU...

> Of course flush thread can clean dirty pages faster than direct reclaimer.
> But if it don't aware LRUness, hot page thrashing can be happened by
> corner case.
> It could lost write merge.
> 
> And non-rotation storage might be not big of seek cost.

Non-rotational storage still goes faster when it is fed large, well
formed IOs.

> I think we have to consider that case if we decide to change direct reclaim I/O.
> 
> How do we separate the problem?
> 
> 1. stack hogging problem.
> 2. direct reclaim random write.

AFAICT, the only way to _reliably_ avoid the stack usage problem is
to avoid writeback in direct reclaim. That has the side effect of
fixing #2 as well, so do they really need separating?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-14  1:40         ` Dave Chinner
@ 2010-04-14  4:59           ` KAMEZAWA Hiroyuki
  2010-04-14  5:41             ` Dave Chinner
  2010-04-14  6:52           ` KOSAKI Motohiro
  1 sibling, 1 reply; 115+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-14  4:59 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Chris Mason, Mel Gorman, linux-kernel, linux-mm, linux-fsdevel

On Wed, 14 Apr 2010 11:40:41 +1000
Dave Chinner <david@fromorbit.com> wrote:

>  50)     3168      64   xfs_vm_writepage+0xab/0x160 [xfs]
>  51)     3104     384   shrink_page_list+0x65e/0x840
>  52)     2720     528   shrink_zone+0x63f/0xe10

A bit OFF TOPIC.

Could you share disassemble of shrink_zone() ?

In my environ.
00000000000115a0 <shrink_zone>:
   115a0:       55                      push   %rbp
   115a1:       48 89 e5                mov    %rsp,%rbp
   115a4:       41 57                   push   %r15
   115a6:       41 56                   push   %r14
   115a8:       41 55                   push   %r13
   115aa:       41 54                   push   %r12
   115ac:       53                      push   %rbx
   115ad:       48 83 ec 78             sub    $0x78,%rsp
   115b1:       e8 00 00 00 00          callq  115b6 <shrink_zone+0x16>
   115b6:       48 89 75 80             mov    %rsi,-0x80(%rbp)

disassemble seems to show 0x78 bytes for stack. And no changes to %rsp
until retrun.

I may misunderstand something...

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-14  4:59           ` KAMEZAWA Hiroyuki
@ 2010-04-14  5:41             ` Dave Chinner
  2010-04-14  5:54               ` KOSAKI Motohiro
  0 siblings, 1 reply; 115+ messages in thread
From: Dave Chinner @ 2010-04-14  5:41 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Chris Mason, Mel Gorman, linux-kernel, linux-mm, linux-fsdevel

On Wed, Apr 14, 2010 at 01:59:45PM +0900, KAMEZAWA Hiroyuki wrote:
> On Wed, 14 Apr 2010 11:40:41 +1000
> Dave Chinner <david@fromorbit.com> wrote:
> 
> >  50)     3168      64   xfs_vm_writepage+0xab/0x160 [xfs]
> >  51)     3104     384   shrink_page_list+0x65e/0x840
> >  52)     2720     528   shrink_zone+0x63f/0xe10
> 
> A bit OFF TOPIC.
> 
> Could you share disassemble of shrink_zone() ?
> 
> In my environ.
> 00000000000115a0 <shrink_zone>:
>    115a0:       55                      push   %rbp
>    115a1:       48 89 e5                mov    %rsp,%rbp
>    115a4:       41 57                   push   %r15
>    115a6:       41 56                   push   %r14
>    115a8:       41 55                   push   %r13
>    115aa:       41 54                   push   %r12
>    115ac:       53                      push   %rbx
>    115ad:       48 83 ec 78             sub    $0x78,%rsp
>    115b1:       e8 00 00 00 00          callq  115b6 <shrink_zone+0x16>
>    115b6:       48 89 75 80             mov    %rsi,-0x80(%rbp)
> 
> disassemble seems to show 0x78 bytes for stack. And no changes to %rsp
> until retrun.

I see the same. I didn't compile those kernels, though. IIUC,
they were built through the Ubuntu build infrastructure, so there is
something different in terms of compiler, compiler options or config
to what we are both using. Most likely it is the compiler inlining,
though Chris's patches to prevent that didn't seem to change the
stack usage.

I'm trying to get a stack trace from the kernel that has shrink_zone
in it, but I haven't succeeded yet....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-14  5:41             ` Dave Chinner
@ 2010-04-14  5:54               ` KOSAKI Motohiro
  2010-04-14  6:13                 ` Minchan Kim
  2010-04-14  7:06                 ` Dave Chinner
  0 siblings, 2 replies; 115+ messages in thread
From: KOSAKI Motohiro @ 2010-04-14  5:54 UTC (permalink / raw)
  To: Dave Chinner
  Cc: kosaki.motohiro, KAMEZAWA Hiroyuki, Chris Mason, Mel Gorman,
	linux-kernel, linux-mm, linux-fsdevel

> On Wed, Apr 14, 2010 at 01:59:45PM +0900, KAMEZAWA Hiroyuki wrote:
> > On Wed, 14 Apr 2010 11:40:41 +1000
> > Dave Chinner <david@fromorbit.com> wrote:
> > 
> > >  50)     3168      64   xfs_vm_writepage+0xab/0x160 [xfs]
> > >  51)     3104     384   shrink_page_list+0x65e/0x840
> > >  52)     2720     528   shrink_zone+0x63f/0xe10
> > 
> > A bit OFF TOPIC.
> > 
> > Could you share disassemble of shrink_zone() ?
> > 
> > In my environ.
> > 00000000000115a0 <shrink_zone>:
> >    115a0:       55                      push   %rbp
> >    115a1:       48 89 e5                mov    %rsp,%rbp
> >    115a4:       41 57                   push   %r15
> >    115a6:       41 56                   push   %r14
> >    115a8:       41 55                   push   %r13
> >    115aa:       41 54                   push   %r12
> >    115ac:       53                      push   %rbx
> >    115ad:       48 83 ec 78             sub    $0x78,%rsp
> >    115b1:       e8 00 00 00 00          callq  115b6 <shrink_zone+0x16>
> >    115b6:       48 89 75 80             mov    %rsi,-0x80(%rbp)
> > 
> > disassemble seems to show 0x78 bytes for stack. And no changes to %rsp
> > until retrun.
> 
> I see the same. I didn't compile those kernels, though. IIUC,
> they were built through the Ubuntu build infrastructure, so there is
> something different in terms of compiler, compiler options or config
> to what we are both using. Most likely it is the compiler inlining,
> though Chris's patches to prevent that didn't seem to change the
> stack usage.
> 
> I'm trying to get a stack trace from the kernel that has shrink_zone
> in it, but I haven't succeeded yet....

I also got 0x78 byte stack usage. Umm.. Do we discussed real issue now?





^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-14  5:54               ` KOSAKI Motohiro
@ 2010-04-14  6:13                 ` Minchan Kim
  2010-04-14  7:19                   ` Minchan Kim
  2010-04-14  7:06                 ` Dave Chinner
  1 sibling, 1 reply; 115+ messages in thread
From: Minchan Kim @ 2010-04-14  6:13 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Dave Chinner, KAMEZAWA Hiroyuki, Chris Mason, Mel Gorman,
	linux-kernel, linux-mm, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 3876 bytes --]

On Wed, Apr 14, 2010 at 2:54 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
>> On Wed, Apr 14, 2010 at 01:59:45PM +0900, KAMEZAWA Hiroyuki wrote:
>> > On Wed, 14 Apr 2010 11:40:41 +1000
>> > Dave Chinner <david@fromorbit.com> wrote:
>> >
>> > >  50)     3168      64   xfs_vm_writepage+0xab/0x160 [xfs]
>> > >  51)     3104     384   shrink_page_list+0x65e/0x840
>> > >  52)     2720     528   shrink_zone+0x63f/0xe10
>> >
>> > A bit OFF TOPIC.
>> >
>> > Could you share disassemble of shrink_zone() ?
>> >
>> > In my environ.
>> > 00000000000115a0 <shrink_zone>:
>> >    115a0:       55                      push   %rbp
>> >    115a1:       48 89 e5                mov    %rsp,%rbp
>> >    115a4:       41 57                   push   %r15
>> >    115a6:       41 56                   push   %r14
>> >    115a8:       41 55                   push   %r13
>> >    115aa:       41 54                   push   %r12
>> >    115ac:       53                      push   %rbx
>> >    115ad:       48 83 ec 78             sub    $0x78,%rsp
>> >    115b1:       e8 00 00 00 00          callq  115b6 <shrink_zone+0x16>
>> >    115b6:       48 89 75 80             mov    %rsi,-0x80(%rbp)
>> >
>> > disassemble seems to show 0x78 bytes for stack. And no changes to %rsp
>> > until retrun.
>>
>> I see the same. I didn't compile those kernels, though. IIUC,
>> they were built through the Ubuntu build infrastructure, so there is
>> something different in terms of compiler, compiler options or config
>> to what we are both using. Most likely it is the compiler inlining,
>> though Chris's patches to prevent that didn't seem to change the
>> stack usage.
>>
>> I'm trying to get a stack trace from the kernel that has shrink_zone
>> in it, but I haven't succeeded yet....
>
> I also got 0x78 byte stack usage. Umm.. Do we discussed real issue now?
>

In my case, 0x110 byte in 32 bit machine.
I think it's possible in 64 bit machine.

00001830 <shrink_zone>:
    1830:       55                      push   %ebp
    1831:       89 e5                   mov    %esp,%ebp
    1833:       57                      push   %edi
    1834:       56                      push   %esi
    1835:       53                      push   %ebx
    1836:       81 ec 10 01 00 00       sub    $0x110,%esp
    183c:       89 85 24 ff ff ff       mov    %eax,-0xdc(%ebp)
    1842:       89 95 20 ff ff ff       mov    %edx,-0xe0(%ebp)
    1848:       89 8d 1c ff ff ff       mov    %ecx,-0xe4(%ebp)
    184e:       8b 41 04                mov    0x4(%ecx)

my gcc is following as.

barrios@barriostarget:~/mmotm$ gcc -v
Using built-in specs.
Target: i486-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu
4.3.3-5ubuntu4'
--with-bugurl=file:///usr/share/doc/gcc-4.3/README.Bugs
--enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr
--enable-shared --with-system-zlib --libexecdir=/usr/lib
--without-included-gettext --enable-threads=posix --enable-nls
--with-gxx-include-dir=/usr/include/c++/4.3 --program-suffix=-4.3
--enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc
--enable-mpfr --enable-targets=all --with-tune=generic
--enable-checking=release --build=i486-linux-gnu --host=i486-linux-gnu
--target=i486-linux-gnu
Thread model: posix
gcc version 4.3.3 (Ubuntu 4.3.3-5ubuntu4)


Is it depends on config?
I attach my config.

> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>



-- 
Kind regards,
Minchan Kim

[-- Attachment #2: barrios_config --]
[-- Type: application/octet-stream, Size: 81325 bytes --]

#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.34-rc3
# Thu Apr 15 00:09:03 2010
#
# CONFIG_64BIT is not set
CONFIG_X86_32=y
# CONFIG_X86_64 is not set
CONFIG_X86=y
CONFIG_OUTPUT_FORMAT="elf32-i386"
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/i386_defconfig"
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
# CONFIG_NEED_DMA_MAP_STATE is not set
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_GENERIC_GPIO=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
# CONFIG_RWSEM_GENERIC_SPINLOCK is not set
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
# CONFIG_GENERIC_TIME_VSYSCALL is not set
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_DEFAULT_IDLE=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
# CONFIG_HAVE_CPUMASK_OF_CPU_MAP is not set
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
# CONFIG_ZONE_DMA32 is not set
CONFIG_ARCH_POPULATES_NODE_MAP=y
# CONFIG_AUDIT_ARCH is not set
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_HAVE_EARLY_RES=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_HARDIRQS_NO__DO_IRQ=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_USE_GENERIC_SMP_HELPERS=y
CONFIG_X86_32_SMP=y
CONFIG_X86_HT=y
CONFIG_X86_TRAMPOLINE=y
CONFIG_X86_32_LAZY_GS=y
CONFIG_KTIME_SCALAR=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
CONFIG_CONSTRUCTORS=y

#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_LOCALVERSION=""
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
# CONFIG_KERNEL_LZO is not set
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
CONFIG_BSD_PROCESS_ACCT=y
CONFIG_BSD_PROCESS_ACCT_V3=y
CONFIG_TASKSTATS=y
# CONFIG_TASK_DELAY_ACCT is not set
CONFIG_TASK_XACCT=y
CONFIG_TASK_IO_ACCOUNTING=y
CONFIG_AUDIT=y
CONFIG_AUDITSYSCALL=y
CONFIG_AUDIT_TREE=y

#
# RCU Subsystem
#
CONFIG_TREE_RCU=y
# CONFIG_TREE_PREEMPT_RCU is not set
# CONFIG_TINY_RCU is not set
# CONFIG_RCU_TRACE is not set
CONFIG_RCU_FANOUT=32
# CONFIG_RCU_FANOUT_EXACT is not set
# CONFIG_RCU_FAST_NO_HZ is not set
# CONFIG_TREE_RCU_TRACE is not set
CONFIG_IKCONFIG=m
# CONFIG_IKCONFIG_PROC is not set
CONFIG_LOG_BUF_SHIFT=17
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
CONFIG_CGROUPS=y
# CONFIG_CGROUP_DEBUG is not set
CONFIG_CGROUP_NS=y
CONFIG_CGROUP_FREEZER=y
# CONFIG_CGROUP_DEVICE is not set
CONFIG_CPUSETS=y
CONFIG_PROC_PID_CPUSET=y
CONFIG_CGROUP_CPUACCT=y
CONFIG_RESOURCE_COUNTERS=y
CONFIG_CGROUP_MEM_RES_CTLR=y
# CONFIG_CGROUP_MEM_RES_CTLR_SWAP is not set
CONFIG_CGROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_RT_GROUP_SCHED=y
CONFIG_MM_OWNER=y
# CONFIG_SYSFS_DEPRECATED_V2 is not set
CONFIG_RELAY=y
CONFIG_NAMESPACES=y
CONFIG_UTS_NS=y
CONFIG_IPC_NS=y
# CONFIG_USER_NS is not set
CONFIG_PID_NS=y
# CONFIG_NET_NS is not set
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_RD_GZIP=y
CONFIG_RD_BZIP2=y
CONFIG_RD_LZMA=y
CONFIG_RD_LZO=y
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_SYSCTL=y
CONFIG_ANON_INODES=y
# CONFIG_EMBEDDED is not set
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
# CONFIG_KALLSYMS_EXTRA_PASS is not set
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_PCSPKR_PLATFORM=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_AIO=y
CONFIG_HAVE_PERF_EVENTS=y

#
# Kernel Performance Events And Counters
#
CONFIG_PERF_EVENTS=y
# CONFIG_PERF_COUNTERS is not set
# CONFIG_DEBUG_PERF_USE_VMALLOC is not set
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_PCI_QUIRKS=y
CONFIG_SLUB_DEBUG=y
# CONFIG_COMPAT_BRK is not set
# CONFIG_SLAB is not set
CONFIG_SLUB=y
# CONFIG_SLOB is not set
CONFIG_PROFILING=y
# CONFIG_OPROFILE is not set
CONFIG_HAVE_OPROFILE=y
CONFIG_KPROBES=y
CONFIG_OPTPROBES=y
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y
CONFIG_KRETPROBES=y
CONFIG_USER_RETURN_NOTIFIER=y
CONFIG_HAVE_IOREMAP_PROT=y
CONFIG_HAVE_KPROBES=y
CONFIG_HAVE_KRETPROBES=y
CONFIG_HAVE_OPTPROBES=y
CONFIG_HAVE_ARCH_TRACEHOOK=y
CONFIG_HAVE_DMA_ATTRS=y
CONFIG_HAVE_REGS_AND_STACK_ACCESS_API=y
CONFIG_HAVE_DMA_API_DEBUG=y
CONFIG_HAVE_HW_BREAKPOINT=y
CONFIG_HAVE_USER_RETURN_NOTIFIER=y

#
# GCOV-based kernel profiling
#
# CONFIG_GCOV_KERNEL is not set
# CONFIG_SLOW_WORK is not set
CONFIG_HAVE_GENERIC_DMA_COHERENT=y
CONFIG_SLABINFO=y
CONFIG_RT_MUTEXES=y
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
# CONFIG_MODULE_FORCE_LOAD is not set
CONFIG_MODULE_UNLOAD=y
# CONFIG_MODULE_FORCE_UNLOAD is not set
CONFIG_MODVERSIONS=y
CONFIG_MODULE_SRCVERSION_ALL=y
CONFIG_STOP_MACHINE=y
CONFIG_BLOCK=y
CONFIG_LBDAF=y
# CONFIG_BLK_DEV_BSG is not set
CONFIG_BLK_DEV_INTEGRITY=y
# CONFIG_BLK_CGROUP is not set

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
# CONFIG_CFQ_GROUP_IOSCHED is not set
# CONFIG_DEFAULT_DEADLINE is not set
CONFIG_DEFAULT_CFQ=y
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="cfq"
CONFIG_PREEMPT_NOTIFIERS=y
# CONFIG_INLINE_SPIN_TRYLOCK is not set
# CONFIG_INLINE_SPIN_TRYLOCK_BH is not set
# CONFIG_INLINE_SPIN_LOCK is not set
# CONFIG_INLINE_SPIN_LOCK_BH is not set
# CONFIG_INLINE_SPIN_LOCK_IRQ is not set
# CONFIG_INLINE_SPIN_LOCK_IRQSAVE is not set
# CONFIG_INLINE_SPIN_UNLOCK is not set
# CONFIG_INLINE_SPIN_UNLOCK_BH is not set
# CONFIG_INLINE_SPIN_UNLOCK_IRQ is not set
# CONFIG_INLINE_SPIN_UNLOCK_IRQRESTORE is not set
# CONFIG_INLINE_READ_TRYLOCK is not set
# CONFIG_INLINE_READ_LOCK is not set
# CONFIG_INLINE_READ_LOCK_BH is not set
# CONFIG_INLINE_READ_LOCK_IRQ is not set
# CONFIG_INLINE_READ_LOCK_IRQSAVE is not set
# CONFIG_INLINE_READ_UNLOCK is not set
# CONFIG_INLINE_READ_UNLOCK_BH is not set
# CONFIG_INLINE_READ_UNLOCK_IRQ is not set
# CONFIG_INLINE_READ_UNLOCK_IRQRESTORE is not set
# CONFIG_INLINE_WRITE_TRYLOCK is not set
# CONFIG_INLINE_WRITE_LOCK is not set
# CONFIG_INLINE_WRITE_LOCK_BH is not set
# CONFIG_INLINE_WRITE_LOCK_IRQ is not set
# CONFIG_INLINE_WRITE_LOCK_IRQSAVE is not set
# CONFIG_INLINE_WRITE_UNLOCK is not set
# CONFIG_INLINE_WRITE_UNLOCK_BH is not set
# CONFIG_INLINE_WRITE_UNLOCK_IRQ is not set
# CONFIG_INLINE_WRITE_UNLOCK_IRQRESTORE is not set
# CONFIG_MUTEX_SPIN_ON_OWNER is not set
CONFIG_FREEZER=y

#
# Processor type and features
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y
CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
CONFIG_SMP=y
# CONFIG_SPARSE_IRQ is not set
CONFIG_X86_MPPARSE=y
# CONFIG_X86_BIGSMP is not set
CONFIG_X86_EXTENDED_PLATFORM=y
# CONFIG_X86_ELAN is not set
# CONFIG_X86_MRST is not set
# CONFIG_X86_RDC321X is not set
# CONFIG_X86_32_NON_STANDARD is not set
CONFIG_SCHED_OMIT_FRAME_POINTER=y
CONFIG_PARAVIRT_GUEST=y
CONFIG_VMI=y
CONFIG_KVM_CLOCK=y
CONFIG_KVM_GUEST=y
# CONFIG_LGUEST_GUEST is not set
CONFIG_PARAVIRT=y
# CONFIG_PARAVIRT_SPINLOCKS is not set
CONFIG_PARAVIRT_CLOCK=y
# CONFIG_PARAVIRT_DEBUG is not set
CONFIG_NO_BOOTMEM=y
# CONFIG_MEMTEST is not set
# CONFIG_M386 is not set
# CONFIG_M486 is not set
# CONFIG_M586 is not set
# CONFIG_M586TSC is not set
# CONFIG_M586MMX is not set
# CONFIG_M686 is not set
# CONFIG_MPENTIUMII is not set
# CONFIG_MPENTIUMIII is not set
# CONFIG_MPENTIUMM is not set
# CONFIG_MPENTIUM4 is not set
# CONFIG_MK6 is not set
# CONFIG_MK7 is not set
# CONFIG_MK8 is not set
# CONFIG_MCRUSOE is not set
# CONFIG_MEFFICEON is not set
# CONFIG_MWINCHIPC6 is not set
# CONFIG_MWINCHIP3D is not set
# CONFIG_MGEODEGX1 is not set
# CONFIG_MGEODE_LX is not set
# CONFIG_MCYRIXIII is not set
# CONFIG_MVIAC3_2 is not set
# CONFIG_MVIAC7 is not set
# CONFIG_MPSC is not set
CONFIG_MCORE2=y
# CONFIG_MATOM is not set
# CONFIG_GENERIC_CPU is not set
CONFIG_X86_GENERIC=y
CONFIG_X86_CPU=y
CONFIG_X86_INTERNODE_CACHE_SHIFT=6
CONFIG_X86_CMPXCHG=y
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_X86_XADD=y
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_INVLPG=y
CONFIG_X86_BSWAP=y
CONFIG_X86_POPAD_OK=y
CONFIG_X86_INTEL_USERCOPY=y
CONFIG_X86_USE_PPRO_CHECKSUM=y
CONFIG_X86_TSC=y
CONFIG_X86_CMPXCHG64=y
CONFIG_X86_CMOV=y
CONFIG_X86_MINIMUM_CPU_FAMILY=5
CONFIG_X86_DEBUGCTLMSR=y
CONFIG_CPU_SUP_INTEL=y
CONFIG_CPU_SUP_CYRIX_32=y
CONFIG_CPU_SUP_AMD=y
CONFIG_CPU_SUP_CENTAUR=y
CONFIG_CPU_SUP_TRANSMETA_32=y
CONFIG_CPU_SUP_UMC_32=y
# CONFIG_X86_DS is not set
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_DMI=y
# CONFIG_IOMMU_HELPER is not set
# CONFIG_IOMMU_API is not set
CONFIG_NR_CPUS=8
CONFIG_SCHED_SMT=y
CONFIG_SCHED_MC=y
# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
# CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS is not set
# CONFIG_X86_MCE is not set
CONFIG_VM86=y
# CONFIG_TOSHIBA is not set
# CONFIG_I8K is not set
CONFIG_X86_REBOOTFIXUPS=y
# CONFIG_MICROCODE is not set
# CONFIG_X86_MSR is not set
# CONFIG_X86_CPUID is not set
# CONFIG_NOHIGHMEM is not set
CONFIG_HIGHMEM4G=y
# CONFIG_HIGHMEM64G is not set
CONFIG_PAGE_OFFSET=0xC0000000
CONFIG_HIGHMEM=y
# CONFIG_ARCH_PHYS_ADDR_T_64BIT is not set
CONFIG_ARCH_FLATMEM_ENABLE=y
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_SELECT_MEMORY_MODEL=y
CONFIG_ILLEGAL_POINTER_VALUE=0
CONFIG_SELECT_MEMORY_MODEL=y
CONFIG_FLATMEM_MANUAL=y
# CONFIG_DISCONTIGMEM_MANUAL is not set
# CONFIG_SPARSEMEM_MANUAL is not set
CONFIG_FLATMEM=y
CONFIG_FLAT_NODE_MEM_MAP=y
CONFIG_SPARSEMEM_STATIC=y
CONFIG_PAGEFLAGS_EXTENDED=y
CONFIG_SPLIT_PTLOCK_CPUS=999999
# CONFIG_PHYS_ADDR_T_64BIT is not set
CONFIG_ZONE_DMA_FLAG=1
CONFIG_BOUNCE=y
CONFIG_VIRT_TO_BUS=y
CONFIG_MMU_NOTIFIER=y
# CONFIG_KSM is not set
CONFIG_DEFAULT_MMAP_MIN_ADDR=65536
CONFIG_HIGHPTE=y
CONFIG_X86_CHECK_BIOS_CORRUPTION=y
CONFIG_X86_BOOTPARAM_MEMORY_CORRUPTION_CHECK=y
CONFIG_X86_RESERVE_LOW_64K=y
# CONFIG_MATH_EMULATION is not set
CONFIG_MTRR=y
CONFIG_MTRR_SANITIZER=y
CONFIG_MTRR_SANITIZER_ENABLE_DEFAULT=0
CONFIG_MTRR_SANITIZER_SPARE_REG_NR_DEFAULT=1
CONFIG_X86_PAT=y
CONFIG_ARCH_USES_PG_UNCACHED=y
CONFIG_EFI=y
CONFIG_SECCOMP=y
# CONFIG_CC_STACKPROTECTOR is not set
# CONFIG_HZ_100 is not set
CONFIG_HZ_250=y
# CONFIG_HZ_300 is not set
# CONFIG_HZ_1000 is not set
CONFIG_HZ=250
CONFIG_SCHED_HRTICK=y
CONFIG_KEXEC=y
CONFIG_CRASH_DUMP=y
CONFIG_KEXEC_JUMP=y
CONFIG_PHYSICAL_START=0x100000
CONFIG_RELOCATABLE=y
CONFIG_X86_NEED_RELOCS=y
CONFIG_PHYSICAL_ALIGN=0x100000
CONFIG_HOTPLUG_CPU=y
# CONFIG_COMPAT_VDSO is not set
# CONFIG_CMDLINE_BOOL is not set
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y

#
# Power management and ACPI options
#
CONFIG_PM=y
CONFIG_PM_DEBUG=y
# CONFIG_PM_ADVANCED_DEBUG is not set
# CONFIG_PM_VERBOSE is not set
CONFIG_CAN_PM_TRACE=y
CONFIG_PM_TRACE=y
CONFIG_PM_TRACE_RTC=y
CONFIG_PM_SLEEP_SMP=y
CONFIG_PM_SLEEP=y
CONFIG_SUSPEND=y
CONFIG_PM_TEST_SUSPEND=y
CONFIG_SUSPEND_FREEZER=y
CONFIG_HIBERNATION_NVS=y
CONFIG_HIBERNATION=y
CONFIG_PM_STD_PARTITION=""
# CONFIG_PM_RUNTIME is not set
CONFIG_PM_OPS=y
CONFIG_ACPI=y
CONFIG_ACPI_SLEEP=y
CONFIG_ACPI_PROCFS=y
CONFIG_ACPI_PROCFS_POWER=y
# CONFIG_ACPI_POWER_METER is not set
CONFIG_ACPI_SYSFS_POWER=y
CONFIG_ACPI_PROC_EVENT=y
CONFIG_ACPI_AC=y
CONFIG_ACPI_BATTERY=y
CONFIG_ACPI_BUTTON=y
CONFIG_ACPI_VIDEO=m
CONFIG_ACPI_FAN=y
CONFIG_ACPI_DOCK=y
CONFIG_ACPI_PROCESSOR=y
CONFIG_ACPI_HOTPLUG_CPU=y
# CONFIG_ACPI_PROCESSOR_AGGREGATOR is not set
CONFIG_ACPI_THERMAL=y
CONFIG_ACPI_CUSTOM_DSDT_FILE=""
# CONFIG_ACPI_CUSTOM_DSDT is not set
CONFIG_ACPI_BLACKLIST_YEAR=2000
# CONFIG_ACPI_DEBUG is not set
CONFIG_ACPI_PCI_SLOT=y
CONFIG_X86_PM_TIMER=y
CONFIG_ACPI_CONTAINER=y
CONFIG_ACPI_SBS=y
# CONFIG_SFI is not set
# CONFIG_APM is not set

#
# CPU Frequency scaling
#
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_TABLE=y
# CONFIG_CPU_FREQ_DEBUG is not set
CONFIG_CPU_FREQ_STAT=y
CONFIG_CPU_FREQ_STAT_DETAILS=y
CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
CONFIG_CPU_FREQ_GOV_POWERSAVE=y
CONFIG_CPU_FREQ_GOV_USERSPACE=y
CONFIG_CPU_FREQ_GOV_ONDEMAND=y
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=y

#
# CPUFreq processor drivers
#
# CONFIG_X86_PCC_CPUFREQ is not set
CONFIG_X86_ACPI_CPUFREQ=y
CONFIG_X86_POWERNOW_K6=y
CONFIG_X86_POWERNOW_K7=y
CONFIG_X86_POWERNOW_K7_ACPI=y
CONFIG_X86_POWERNOW_K8=y
CONFIG_X86_GX_SUSPMOD=y
CONFIG_X86_SPEEDSTEP_CENTRINO=y
CONFIG_X86_SPEEDSTEP_CENTRINO_TABLE=y
CONFIG_X86_SPEEDSTEP_ICH=y
CONFIG_X86_SPEEDSTEP_SMI=y
# CONFIG_X86_P4_CLOCKMOD is not set
CONFIG_X86_CPUFREQ_NFORCE2=y
CONFIG_X86_LONGRUN=y
CONFIG_X86_LONGHAUL=y
# CONFIG_X86_E_POWERSAVER is not set

#
# shared options
#
CONFIG_X86_SPEEDSTEP_LIB=y
CONFIG_X86_SPEEDSTEP_RELAXED_CAP_CHECK=y
CONFIG_CPU_IDLE=y
CONFIG_CPU_IDLE_GOV_LADDER=y
CONFIG_CPU_IDLE_GOV_MENU=y

#
# Bus options (PCI etc.)
#
CONFIG_PCI=y
# CONFIG_PCI_GOBIOS is not set
# CONFIG_PCI_GOMMCONFIG is not set
# CONFIG_PCI_GODIRECT is not set
# CONFIG_PCI_GOOLPC is not set
CONFIG_PCI_GOANY=y
CONFIG_PCI_BIOS=y
CONFIG_PCI_DIRECT=y
CONFIG_PCI_MMCONFIG=y
CONFIG_PCI_OLPC=y
CONFIG_PCI_DOMAINS=y
# CONFIG_DMAR is not set
CONFIG_PCIEPORTBUS=y
CONFIG_HOTPLUG_PCI_PCIE=y
CONFIG_PCIEAER=y
# CONFIG_PCIE_ECRC is not set
# CONFIG_PCIEAER_INJECT is not set
# CONFIG_PCIEASPM is not set
CONFIG_ARCH_SUPPORTS_MSI=y
CONFIG_PCI_MSI=y
# CONFIG_PCI_DEBUG is not set
# CONFIG_PCI_STUB is not set
CONFIG_HT_IRQ=y
# CONFIG_PCI_IOV is not set
CONFIG_PCI_IOAPIC=y
CONFIG_ISA_DMA_API=y
CONFIG_ISA=y
CONFIG_EISA=y
CONFIG_EISA_VLB_PRIMING=y
CONFIG_EISA_PCI_EISA=y
CONFIG_EISA_VIRTUAL_ROOT=y
CONFIG_EISA_NAMES=y
CONFIG_MCA=y
CONFIG_MCA_LEGACY=y
# CONFIG_MCA_PROC_FS is not set
# CONFIG_SCx200 is not set
CONFIG_OLPC=y
CONFIG_K8_NB=y
# CONFIG_PCCARD is not set
CONFIG_HOTPLUG_PCI=y
# CONFIG_HOTPLUG_PCI_FAKE is not set
# CONFIG_HOTPLUG_PCI_COMPAQ is not set
# CONFIG_HOTPLUG_PCI_IBM is not set
# CONFIG_HOTPLUG_PCI_ACPI is not set
CONFIG_HOTPLUG_PCI_CPCI=y
# CONFIG_HOTPLUG_PCI_CPCI_ZT5550 is not set
# CONFIG_HOTPLUG_PCI_CPCI_GENERIC is not set
# CONFIG_HOTPLUG_PCI_SHPC is not set

#
# Executable file formats / Emulations
#
CONFIG_BINFMT_ELF=y
# CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS is not set
CONFIG_HAVE_AOUT=y
# CONFIG_BINFMT_AOUT is not set
CONFIG_BINFMT_MISC=m
CONFIG_HAVE_ATOMIC_IOMAP=y
CONFIG_NET=y

#
# Networking options
#
CONFIG_PACKET=y
CONFIG_UNIX=y
# CONFIG_NET_KEY is not set
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
CONFIG_IP_ADVANCED_ROUTER=y
CONFIG_ASK_IP_FIB_HASH=y
# CONFIG_IP_FIB_TRIE is not set
CONFIG_IP_FIB_HASH=y
CONFIG_IP_MULTIPLE_TABLES=y
CONFIG_IP_ROUTE_MULTIPATH=y
CONFIG_IP_ROUTE_VERBOSE=y
# CONFIG_IP_PNP is not set
# CONFIG_NET_IPIP is not set
# CONFIG_NET_IPGRE is not set
CONFIG_IP_MROUTE=y
CONFIG_IP_PIMSM_V1=y
CONFIG_IP_PIMSM_V2=y
# CONFIG_ARPD is not set
CONFIG_SYN_COOKIES=y
# CONFIG_INET_AH is not set
# CONFIG_INET_ESP is not set
# CONFIG_INET_IPCOMP is not set
# CONFIG_INET_XFRM_TUNNEL is not set
# CONFIG_INET_TUNNEL is not set
# CONFIG_INET_XFRM_MODE_TRANSPORT is not set
# CONFIG_INET_XFRM_MODE_TUNNEL is not set
# CONFIG_INET_XFRM_MODE_BEET is not set
CONFIG_INET_LRO=y
CONFIG_INET_DIAG=y
CONFIG_INET_TCP_DIAG=y
CONFIG_TCP_CONG_ADVANCED=y
# CONFIG_TCP_CONG_BIC is not set
CONFIG_TCP_CONG_CUBIC=y
# CONFIG_TCP_CONG_WESTWOOD is not set
# CONFIG_TCP_CONG_HTCP is not set
# CONFIG_TCP_CONG_HSTCP is not set
# CONFIG_TCP_CONG_HYBLA is not set
# CONFIG_TCP_CONG_VEGAS is not set
# CONFIG_TCP_CONG_SCALABLE is not set
# CONFIG_TCP_CONG_LP is not set
# CONFIG_TCP_CONG_VENO is not set
# CONFIG_TCP_CONG_YEAH is not set
# CONFIG_TCP_CONG_ILLINOIS is not set
# CONFIG_DEFAULT_BIC is not set
CONFIG_DEFAULT_CUBIC=y
# CONFIG_DEFAULT_HTCP is not set
# CONFIG_DEFAULT_VEGAS is not set
# CONFIG_DEFAULT_WESTWOOD is not set
# CONFIG_DEFAULT_RENO is not set
CONFIG_DEFAULT_TCP_CONG="cubic"
CONFIG_TCP_MD5SIG=y
CONFIG_IPV6=y
CONFIG_IPV6_PRIVACY=y
# CONFIG_IPV6_ROUTER_PREF is not set
# CONFIG_IPV6_OPTIMISTIC_DAD is not set
# CONFIG_INET6_AH is not set
# CONFIG_INET6_ESP is not set
# CONFIG_INET6_IPCOMP is not set
# CONFIG_IPV6_MIP6 is not set
# CONFIG_INET6_XFRM_TUNNEL is not set
# CONFIG_INET6_TUNNEL is not set
# CONFIG_INET6_XFRM_MODE_TRANSPORT is not set
# CONFIG_INET6_XFRM_MODE_TUNNEL is not set
# CONFIG_INET6_XFRM_MODE_BEET is not set
# CONFIG_INET6_XFRM_MODE_ROUTEOPTIMIZATION is not set
# CONFIG_IPV6_SIT is not set
# CONFIG_IPV6_TUNNEL is not set
CONFIG_IPV6_MULTIPLE_TABLES=y
# CONFIG_IPV6_SUBTREES is not set
# CONFIG_IPV6_MROUTE is not set
CONFIG_NETLABEL=y
CONFIG_NETWORK_SECMARK=y
CONFIG_NETFILTER=y
# CONFIG_NETFILTER_DEBUG is not set
CONFIG_NETFILTER_ADVANCED=y
CONFIG_BRIDGE_NETFILTER=y

#
# Core Netfilter Configuration
#
# CONFIG_NETFILTER_NETLINK_QUEUE is not set
# CONFIG_NETFILTER_NETLINK_LOG is not set
# CONFIG_NF_CONNTRACK is not set
# CONFIG_NETFILTER_XTABLES is not set
# CONFIG_IP_VS is not set

#
# IP: Netfilter Configuration
#
# CONFIG_NF_DEFRAG_IPV4 is not set
# CONFIG_IP_NF_QUEUE is not set
# CONFIG_IP_NF_IPTABLES is not set
# CONFIG_IP_NF_ARPTABLES is not set

#
# IPv6: Netfilter Configuration
#
# CONFIG_IP6_NF_QUEUE is not set
# CONFIG_IP6_NF_IPTABLES is not set
# CONFIG_BRIDGE_NF_EBTABLES is not set
# CONFIG_IP_DCCP is not set
# CONFIG_IP_SCTP is not set
# CONFIG_RDS is not set
# CONFIG_TIPC is not set
CONFIG_ATM=y
CONFIG_ATM_CLIP=y
# CONFIG_ATM_CLIP_NO_ICMP is not set
# CONFIG_ATM_LANE is not set
# CONFIG_ATM_BR2684 is not set
CONFIG_STP=m
CONFIG_GARP=m
CONFIG_BRIDGE=m
CONFIG_BRIDGE_IGMP_SNOOPING=y
CONFIG_NET_DSA=y
CONFIG_NET_DSA_TAG_DSA=y
CONFIG_NET_DSA_TAG_EDSA=y
CONFIG_NET_DSA_TAG_TRAILER=y
CONFIG_NET_DSA_MV88E6XXX=y
CONFIG_NET_DSA_MV88E6060=y
CONFIG_NET_DSA_MV88E6XXX_NEED_PPU=y
CONFIG_NET_DSA_MV88E6131=y
CONFIG_NET_DSA_MV88E6123_61_65=y
CONFIG_VLAN_8021Q=m
CONFIG_VLAN_8021Q_GVRP=y
# CONFIG_DECNET is not set
CONFIG_LLC=y
# CONFIG_LLC2 is not set
# CONFIG_IPX is not set
# CONFIG_ATALK is not set
# CONFIG_X25 is not set
# CONFIG_LAPB is not set
# CONFIG_ECONET is not set
# CONFIG_WAN_ROUTER is not set
# CONFIG_PHONET is not set
# CONFIG_IEEE802154 is not set
CONFIG_NET_SCHED=y

#
# Queueing/Scheduling
#
# CONFIG_NET_SCH_CBQ is not set
# CONFIG_NET_SCH_HTB is not set
# CONFIG_NET_SCH_HFSC is not set
# CONFIG_NET_SCH_ATM is not set
# CONFIG_NET_SCH_PRIO is not set
# CONFIG_NET_SCH_MULTIQ is not set
# CONFIG_NET_SCH_RED is not set
# CONFIG_NET_SCH_SFQ is not set
# CONFIG_NET_SCH_TEQL is not set
# CONFIG_NET_SCH_TBF is not set
# CONFIG_NET_SCH_GRED is not set
# CONFIG_NET_SCH_DSMARK is not set
# CONFIG_NET_SCH_NETEM is not set
# CONFIG_NET_SCH_DRR is not set
# CONFIG_NET_SCH_INGRESS is not set

#
# Classification
#
CONFIG_NET_CLS=y
# CONFIG_NET_CLS_BASIC is not set
# CONFIG_NET_CLS_TCINDEX is not set
# CONFIG_NET_CLS_ROUTE4 is not set
# CONFIG_NET_CLS_FW is not set
# CONFIG_NET_CLS_U32 is not set
# CONFIG_NET_CLS_RSVP is not set
# CONFIG_NET_CLS_RSVP6 is not set
# CONFIG_NET_CLS_FLOW is not set
# CONFIG_NET_CLS_CGROUP is not set
CONFIG_NET_EMATCH=y
CONFIG_NET_EMATCH_STACK=32
# CONFIG_NET_EMATCH_CMP is not set
# CONFIG_NET_EMATCH_NBYTE is not set
# CONFIG_NET_EMATCH_U32 is not set
# CONFIG_NET_EMATCH_META is not set
# CONFIG_NET_EMATCH_TEXT is not set
CONFIG_NET_CLS_ACT=y
# CONFIG_NET_ACT_POLICE is not set
# CONFIG_NET_ACT_GACT is not set
# CONFIG_NET_ACT_MIRRED is not set
# CONFIG_NET_ACT_NAT is not set
# CONFIG_NET_ACT_PEDIT is not set
# CONFIG_NET_ACT_SIMP is not set
# CONFIG_NET_ACT_SKBEDIT is not set
CONFIG_NET_SCH_FIFO=y
# CONFIG_DCB is not set

#
# Network testing
#
# CONFIG_NET_PKTGEN is not set
# CONFIG_NET_TCPPROBE is not set
CONFIG_HAMRADIO=y

#
# Packet Radio protocols
#
# CONFIG_AX25 is not set
# CONFIG_CAN is not set
# CONFIG_IRDA is not set
CONFIG_BT=y
CONFIG_BT_L2CAP=y
CONFIG_BT_SCO=y
CONFIG_BT_RFCOMM=y
CONFIG_BT_RFCOMM_TTY=y
CONFIG_BT_BNEP=m
CONFIG_BT_BNEP_MC_FILTER=y
CONFIG_BT_BNEP_PROTO_FILTER=y
# CONFIG_BT_HIDP is not set

#
# Bluetooth device drivers
#
# CONFIG_BT_HCIBTUSB is not set
# CONFIG_BT_HCIBTSDIO is not set
# CONFIG_BT_HCIUART is not set
# CONFIG_BT_HCIBCM203X is not set
# CONFIG_BT_HCIBPA10X is not set
# CONFIG_BT_HCIBFUSB is not set
# CONFIG_BT_HCIVHCI is not set
# CONFIG_BT_MRVL is not set
# CONFIG_AF_RXRPC is not set
CONFIG_FIB_RULES=y
CONFIG_WIRELESS=y
# CONFIG_CFG80211 is not set
# CONFIG_LIB80211 is not set

#
# CFG80211 needs to be enabled for MAC80211
#
# CONFIG_WIMAX is not set
CONFIG_RFKILL=y
CONFIG_RFKILL_LEDS=y
CONFIG_RFKILL_INPUT=y
# CONFIG_NET_9P is not set

#
# Device Drivers
#

#
# Generic Driver Options
#
CONFIG_UEVENT_HELPER_PATH=""
CONFIG_DEVTMPFS=y
CONFIG_DEVTMPFS_MOUNT=y
# CONFIG_STANDALONE is not set
CONFIG_PREVENT_FIRMWARE_BUILD=y
CONFIG_FW_LOADER=y
# CONFIG_FIRMWARE_IN_KERNEL is not set
CONFIG_EXTRA_FIRMWARE=""
# CONFIG_DEBUG_DRIVER is not set
# CONFIG_DEBUG_DEVRES is not set
# CONFIG_SYS_HYPERVISOR is not set
CONFIG_CONNECTOR=y
CONFIG_PROC_EVENTS=y
# CONFIG_MTD is not set
CONFIG_PARPORT=m
CONFIG_PARPORT_PC=m
# CONFIG_PARPORT_SERIAL is not set
CONFIG_PARPORT_PC_FIFO=y
# CONFIG_PARPORT_PC_SUPERIO is not set
# CONFIG_PARPORT_GSC is not set
# CONFIG_PARPORT_AX88796 is not set
CONFIG_PARPORT_1284=y
CONFIG_PNP=y
CONFIG_PNP_DEBUG_MESSAGES=y

#
# Protocols
#
CONFIG_ISAPNP=y
CONFIG_PNPBIOS=y
CONFIG_PNPBIOS_PROC_FS=y
CONFIG_PNPACPI=y
CONFIG_BLK_DEV=y
# CONFIG_BLK_DEV_FD is not set
# CONFIG_BLK_DEV_XD is not set
# CONFIG_PARIDE is not set
# CONFIG_BLK_CPQ_DA is not set
# CONFIG_BLK_CPQ_CISS_DA is not set
# CONFIG_BLK_DEV_DAC960 is not set
# CONFIG_BLK_DEV_UMEM is not set
# CONFIG_BLK_DEV_COW_COMMON is not set
CONFIG_BLK_DEV_LOOP=y
# CONFIG_BLK_DEV_CRYPTOLOOP is not set
# CONFIG_BLK_DEV_DRBD is not set
# CONFIG_BLK_DEV_NBD is not set
# CONFIG_BLK_DEV_SX8 is not set
# CONFIG_BLK_DEV_UB is not set
CONFIG_BLK_DEV_RAM=y
CONFIG_BLK_DEV_RAM_COUNT=16
CONFIG_BLK_DEV_RAM_SIZE=65536
# CONFIG_BLK_DEV_XIP is not set
CONFIG_CDROM_PKTCDVD=y
CONFIG_CDROM_PKTCDVD_BUFFERS=8
# CONFIG_CDROM_PKTCDVD_WCACHE is not set
# CONFIG_ATA_OVER_ETH is not set
# CONFIG_BLK_DEV_HD is not set
CONFIG_MISC_DEVICES=y
# CONFIG_AD525X_DPOT is not set
# CONFIG_IBM_ASM is not set
# CONFIG_PHANTOM is not set
# CONFIG_SGI_IOC4 is not set
# CONFIG_TIFM_CORE is not set
# CONFIG_ICS932S401 is not set
# CONFIG_ENCLOSURE_SERVICES is not set
# CONFIG_CS5535_MFGPT is not set
# CONFIG_HP_ILO is not set
# CONFIG_ISL29003 is not set
# CONFIG_SENSORS_TSL2550 is not set
# CONFIG_DS1682 is not set
# CONFIG_TI_DAC7512 is not set
# CONFIG_C2PORT is not set

#
# EEPROM support
#
# CONFIG_EEPROM_AT24 is not set
# CONFIG_EEPROM_AT25 is not set
# CONFIG_EEPROM_LEGACY is not set
# CONFIG_EEPROM_MAX6875 is not set
# CONFIG_EEPROM_93CX6 is not set
# CONFIG_CB710_CORE is not set
# CONFIG_IWMC3200TOP is not set
CONFIG_HAVE_IDE=y
# CONFIG_IDE is not set

#
# SCSI device support
#
CONFIG_SCSI_MOD=y
# CONFIG_RAID_ATTRS is not set
CONFIG_SCSI=y
CONFIG_SCSI_DMA=y
# CONFIG_SCSI_TGT is not set
CONFIG_SCSI_NETLINK=y
CONFIG_SCSI_PROC_FS=y

#
# SCSI support type (disk, tape, CD-ROM)
#
CONFIG_BLK_DEV_SD=y
# CONFIG_CHR_DEV_ST is not set
# CONFIG_CHR_DEV_OSST is not set
CONFIG_BLK_DEV_SR=y
# CONFIG_BLK_DEV_SR_VENDOR is not set
CONFIG_CHR_DEV_SG=y
# CONFIG_CHR_DEV_SCH is not set
CONFIG_SCSI_MULTI_LUN=y
CONFIG_SCSI_CONSTANTS=y
CONFIG_SCSI_LOGGING=y
CONFIG_SCSI_SCAN_ASYNC=y
CONFIG_SCSI_WAIT_SCAN=m

#
# SCSI Transports
#
# CONFIG_SCSI_SPI_ATTRS is not set
CONFIG_SCSI_FC_ATTRS=m
# CONFIG_SCSI_ISCSI_ATTRS is not set
# CONFIG_SCSI_SAS_LIBSAS is not set
# CONFIG_SCSI_SRP_ATTRS is not set
CONFIG_SCSI_LOWLEVEL=y
# CONFIG_ISCSI_TCP is not set
# CONFIG_SCSI_CXGB3_ISCSI is not set
# CONFIG_SCSI_BNX2_ISCSI is not set
# CONFIG_BE2ISCSI is not set
# CONFIG_BLK_DEV_3W_XXXX_RAID is not set
# CONFIG_SCSI_HPSA is not set
# CONFIG_SCSI_3W_9XXX is not set
# CONFIG_SCSI_3W_SAS is not set
# CONFIG_SCSI_7000FASST is not set
# CONFIG_SCSI_ACARD is not set
# CONFIG_SCSI_AHA152X is not set
# CONFIG_SCSI_AHA1542 is not set
# CONFIG_SCSI_AHA1740 is not set
# CONFIG_SCSI_AACRAID is not set
# CONFIG_SCSI_AIC7XXX is not set
# CONFIG_SCSI_AIC7XXX_OLD is not set
# CONFIG_SCSI_AIC79XX is not set
# CONFIG_SCSI_AIC94XX is not set
# CONFIG_SCSI_MVSAS is not set
# CONFIG_SCSI_DPT_I2O is not set
# CONFIG_SCSI_ADVANSYS is not set
# CONFIG_SCSI_IN2000 is not set
# CONFIG_SCSI_ARCMSR is not set
CONFIG_MEGARAID_NEWGEN=y
# CONFIG_MEGARAID_MM is not set
# CONFIG_MEGARAID_LEGACY is not set
# CONFIG_MEGARAID_SAS is not set
# CONFIG_SCSI_MPT2SAS is not set
# CONFIG_SCSI_HPTIOP is not set
# CONFIG_SCSI_BUSLOGIC is not set
# CONFIG_VMWARE_PVSCSI is not set
# CONFIG_LIBFC is not set
# CONFIG_LIBFCOE is not set
# CONFIG_FCOE is not set
# CONFIG_FCOE_FNIC is not set
# CONFIG_SCSI_DMX3191D is not set
# CONFIG_SCSI_DTC3280 is not set
# CONFIG_SCSI_EATA is not set
# CONFIG_SCSI_FUTURE_DOMAIN is not set
# CONFIG_SCSI_FD_MCS is not set
# CONFIG_SCSI_GDTH is not set
# CONFIG_SCSI_GENERIC_NCR5380 is not set
# CONFIG_SCSI_GENERIC_NCR5380_MMIO is not set
# CONFIG_SCSI_IBMMCA is not set
# CONFIG_SCSI_IPS is not set
# CONFIG_SCSI_INITIO is not set
# CONFIG_SCSI_INIA100 is not set
# CONFIG_SCSI_PPA is not set
# CONFIG_SCSI_IMM is not set
# CONFIG_SCSI_NCR53C406A is not set
# CONFIG_SCSI_NCR_D700 is not set
# CONFIG_SCSI_STEX is not set
# CONFIG_SCSI_SYM53C8XX_2 is not set
# CONFIG_SCSI_IPR is not set
# CONFIG_SCSI_NCR_Q720 is not set
# CONFIG_SCSI_PAS16 is not set
# CONFIG_SCSI_QLOGIC_FAS is not set
# CONFIG_SCSI_QLOGIC_1280 is not set
# CONFIG_SCSI_QLA_FC is not set
# CONFIG_SCSI_QLA_ISCSI is not set
# CONFIG_SCSI_LPFC is not set
# CONFIG_SCSI_SIM710 is not set
# CONFIG_SCSI_SYM53C416 is not set
# CONFIG_SCSI_DC395x is not set
# CONFIG_SCSI_DC390T is not set
# CONFIG_SCSI_T128 is not set
# CONFIG_SCSI_U14_34F is not set
# CONFIG_SCSI_ULTRASTOR is not set
# CONFIG_SCSI_NSP32 is not set
# CONFIG_SCSI_DEBUG is not set
# CONFIG_SCSI_PMCRAID is not set
# CONFIG_SCSI_PM8001 is not set
# CONFIG_SCSI_SRP is not set
# CONFIG_SCSI_BFA_FC is not set
CONFIG_SCSI_DH=y
# CONFIG_SCSI_DH_RDAC is not set
# CONFIG_SCSI_DH_HP_SW is not set
# CONFIG_SCSI_DH_EMC is not set
# CONFIG_SCSI_DH_ALUA is not set
# CONFIG_SCSI_OSD_INITIATOR is not set
CONFIG_ATA=y
# CONFIG_ATA_NONSTANDARD is not set
CONFIG_ATA_VERBOSE_ERROR=y
CONFIG_ATA_ACPI=y
CONFIG_SATA_PMP=y
CONFIG_SATA_AHCI=y
CONFIG_SATA_SIL24=y
CONFIG_ATA_SFF=y
CONFIG_SATA_SVW=y
CONFIG_ATA_PIIX=y
# CONFIG_SATA_MV is not set
CONFIG_SATA_NV=y
CONFIG_PDC_ADMA=y
CONFIG_SATA_QSTOR=y
CONFIG_SATA_PROMISE=y
# CONFIG_SATA_SX4 is not set
CONFIG_SATA_SIL=y
CONFIG_SATA_SIS=y
CONFIG_SATA_ULI=y
CONFIG_SATA_VIA=y
CONFIG_SATA_VITESSE=y
CONFIG_SATA_INIC162X=y
CONFIG_PATA_ACPI=y
CONFIG_PATA_ALI=y
CONFIG_PATA_AMD=y
CONFIG_PATA_ARTOP=y
# CONFIG_PATA_ATP867X is not set
CONFIG_PATA_ATIIXP=y
CONFIG_PATA_CMD640_PCI=y
CONFIG_PATA_CMD64X=y
CONFIG_PATA_CS5520=y
CONFIG_PATA_CS5530=y
# CONFIG_PATA_CS5535 is not set
CONFIG_PATA_CS5536=y
# CONFIG_PATA_CYPRESS is not set
CONFIG_PATA_EFAR=y
CONFIG_ATA_GENERIC=y
CONFIG_PATA_HPT366=y
# CONFIG_PATA_HPT37X is not set
CONFIG_PATA_HPT3X2N=y
CONFIG_PATA_HPT3X3=y
# CONFIG_PATA_HPT3X3_DMA is not set
# CONFIG_PATA_ISAPNP is not set
CONFIG_PATA_IT821X=y
# CONFIG_PATA_IT8213 is not set
CONFIG_PATA_JMICRON=y
# CONFIG_PATA_LEGACY is not set
CONFIG_PATA_TRIFLEX=y
CONFIG_PATA_MARVELL=y
CONFIG_PATA_MPIIX=y
# CONFIG_PATA_OLDPIIX is not set
CONFIG_PATA_NETCELL=y
# CONFIG_PATA_NINJA32 is not set
CONFIG_PATA_NS87410=y
CONFIG_PATA_NS87415=y
# CONFIG_PATA_OPTI is not set
# CONFIG_PATA_OPTIDMA is not set
CONFIG_PATA_PDC2027X=y
CONFIG_PATA_PDC_OLD=y
CONFIG_PATA_QDI=y
# CONFIG_PATA_RADISYS is not set
# CONFIG_PATA_RDC is not set
CONFIG_PATA_RZ1000=y
CONFIG_PATA_SC1200=y
CONFIG_PATA_SERVERWORKS=y
CONFIG_PATA_SIL680=y
CONFIG_PATA_SIS=y
# CONFIG_PATA_TOSHIBA is not set
CONFIG_PATA_VIA=y
CONFIG_PATA_WINBOND=y
# CONFIG_PATA_WINBOND_VLB is not set
CONFIG_PATA_SCH=y
CONFIG_MD=y
CONFIG_BLK_DEV_MD=y
CONFIG_MD_AUTODETECT=y
# CONFIG_MD_LINEAR is not set
# CONFIG_MD_RAID0 is not set
# CONFIG_MD_RAID1 is not set
# CONFIG_MD_RAID10 is not set
# CONFIG_MD_RAID456 is not set
# CONFIG_MD_MULTIPATH is not set
# CONFIG_MD_FAULTY is not set
CONFIG_BLK_DEV_DM=y
# CONFIG_DM_DEBUG is not set
# CONFIG_DM_CRYPT is not set
CONFIG_DM_SNAPSHOT=y
CONFIG_DM_MIRROR=y
# CONFIG_DM_LOG_USERSPACE is not set
# CONFIG_DM_ZERO is not set
CONFIG_DM_MULTIPATH=y
# CONFIG_DM_MULTIPATH_QL is not set
# CONFIG_DM_MULTIPATH_ST is not set
# CONFIG_DM_DELAY is not set
CONFIG_DM_UEVENT=y
CONFIG_FUSION=y
# CONFIG_FUSION_SPI is not set
# CONFIG_FUSION_FC is not set
# CONFIG_FUSION_SAS is not set
CONFIG_FUSION_MAX_SGE=128
CONFIG_FUSION_LOGGING=y

#
# IEEE 1394 (FireWire) support
#

#
# You can enable one or both FireWire driver stacks.
#

#
# The newer stack is recommended.
#
# CONFIG_FIREWIRE is not set
# CONFIG_IEEE1394 is not set
# CONFIG_I2O is not set
CONFIG_MACINTOSH_DRIVERS=y
CONFIG_MAC_EMUMOUSEBTN=y
CONFIG_NETDEVICES=y
# CONFIG_IFB is not set
# CONFIG_DUMMY is not set
# CONFIG_BONDING is not set
# CONFIG_MACVLAN is not set
# CONFIG_EQUALIZER is not set
# CONFIG_TUN is not set
# CONFIG_VETH is not set
# CONFIG_NET_SB1000 is not set
# CONFIG_ARCNET is not set
CONFIG_PHYLIB=y

#
# MII PHY device drivers
#
# CONFIG_MARVELL_PHY is not set
# CONFIG_DAVICOM_PHY is not set
# CONFIG_QSEMI_PHY is not set
# CONFIG_LXT_PHY is not set
# CONFIG_CICADA_PHY is not set
# CONFIG_VITESSE_PHY is not set
# CONFIG_SMSC_PHY is not set
# CONFIG_BROADCOM_PHY is not set
# CONFIG_ICPLUS_PHY is not set
# CONFIG_REALTEK_PHY is not set
# CONFIG_NATIONAL_PHY is not set
# CONFIG_STE10XP is not set
# CONFIG_LSI_ET1011C_PHY is not set
CONFIG_FIXED_PHY=y
# CONFIG_MDIO_BITBANG is not set
CONFIG_NET_ETHERNET=y
CONFIG_MII=m
# CONFIG_HAPPYMEAL is not set
# CONFIG_SUNGEM is not set
# CONFIG_CASSINI is not set
CONFIG_NET_VENDOR_3COM=y
# CONFIG_EL1 is not set
# CONFIG_EL2 is not set
# CONFIG_ELPLUS is not set
# CONFIG_EL16 is not set
# CONFIG_EL3 is not set
# CONFIG_3C515 is not set
# CONFIG_ELMC is not set
# CONFIG_ELMC_II is not set
# CONFIG_VORTEX is not set
# CONFIG_TYPHOON is not set
# CONFIG_LANCE is not set
CONFIG_NET_VENDOR_SMC=y
# CONFIG_WD80x3 is not set
# CONFIG_ULTRAMCA is not set
# CONFIG_ULTRA is not set
# CONFIG_ULTRA32 is not set
# CONFIG_SMC9194 is not set
# CONFIG_ENC28J60 is not set
# CONFIG_ETHOC is not set
CONFIG_NET_VENDOR_RACAL=y
# CONFIG_NI52 is not set
# CONFIG_NI65 is not set
# CONFIG_DNET is not set
CONFIG_NET_TULIP=y
# CONFIG_DE2104X is not set
# CONFIG_TULIP is not set
# CONFIG_DE4X5 is not set
# CONFIG_WINBOND_840 is not set
# CONFIG_DM9102 is not set
# CONFIG_ULI526X is not set
# CONFIG_AT1700 is not set
# CONFIG_DEPCA is not set
# CONFIG_HP100 is not set
CONFIG_NET_ISA=y
# CONFIG_E2100 is not set
# CONFIG_EWRK3 is not set
# CONFIG_EEXPRESS is not set
# CONFIG_EEXPRESS_PRO is not set
# CONFIG_HPLAN_PLUS is not set
# CONFIG_HPLAN is not set
# CONFIG_LP486E is not set
# CONFIG_ETH16I is not set
# CONFIG_NE2000 is not set
# CONFIG_ZNET is not set
# CONFIG_SEEQ8005 is not set
# CONFIG_NE2_MCA is not set
# CONFIG_IBMLANA is not set
# CONFIG_IBM_NEW_EMAC_ZMII is not set
# CONFIG_IBM_NEW_EMAC_RGMII is not set
# CONFIG_IBM_NEW_EMAC_TAH is not set
# CONFIG_IBM_NEW_EMAC_EMAC4 is not set
# CONFIG_IBM_NEW_EMAC_NO_FLOW_CTRL is not set
# CONFIG_IBM_NEW_EMAC_MAL_CLR_ICINTSTAT is not set
# CONFIG_IBM_NEW_EMAC_MAL_COMMON_ERR is not set
CONFIG_NET_PCI=y
# CONFIG_PCNET32 is not set
# CONFIG_AMD8111_ETH is not set
# CONFIG_ADAPTEC_STARFIRE is not set
# CONFIG_AC3200 is not set
# CONFIG_KSZ884X_PCI is not set
# CONFIG_APRICOT is not set
# CONFIG_B44 is not set
# CONFIG_FORCEDETH is not set
# CONFIG_CS89x0 is not set
# CONFIG_E100 is not set
# CONFIG_LNE390 is not set
# CONFIG_FEALNX is not set
# CONFIG_NATSEMI is not set
# CONFIG_NE2K_PCI is not set
# CONFIG_NE3210 is not set
# CONFIG_ES3210 is not set
# CONFIG_8139CP is not set
# CONFIG_8139TOO is not set
# CONFIG_R6040 is not set
# CONFIG_SIS900 is not set
# CONFIG_EPIC100 is not set
# CONFIG_SMSC9420 is not set
# CONFIG_SUNDANCE is not set
# CONFIG_TLAN is not set
# CONFIG_KS8842 is not set
# CONFIG_KS8851 is not set
# CONFIG_KS8851_MLL is not set
# CONFIG_VIA_RHINE is not set
# CONFIG_SC92031 is not set
CONFIG_NET_POCKET=y
# CONFIG_ATP is not set
# CONFIG_DE600 is not set
# CONFIG_DE620 is not set
# CONFIG_ATL2 is not set
CONFIG_NETDEV_1000=y
# CONFIG_ACENIC is not set
# CONFIG_DL2K is not set
# CONFIG_E1000 is not set
CONFIG_E1000E=m
# CONFIG_IP1000 is not set
# CONFIG_IGB is not set
# CONFIG_IGBVF is not set
# CONFIG_NS83820 is not set
# CONFIG_HAMACHI is not set
# CONFIG_YELLOWFIN is not set
CONFIG_R8169=m
CONFIG_R8169_VLAN=y
# CONFIG_SIS190 is not set
# CONFIG_SKGE is not set
# CONFIG_SKY2 is not set
# CONFIG_VIA_VELOCITY is not set
# CONFIG_TIGON3 is not set
# CONFIG_BNX2 is not set
# CONFIG_CNIC is not set
# CONFIG_QLA3XXX is not set
# CONFIG_ATL1 is not set
# CONFIG_ATL1E is not set
# CONFIG_ATL1C is not set
# CONFIG_JME is not set
CONFIG_NETDEV_10000=y
CONFIG_MDIO=m
# CONFIG_CHELSIO_T1 is not set
CONFIG_CHELSIO_T3_DEPENDS=y
# CONFIG_CHELSIO_T3 is not set
# CONFIG_ENIC is not set
# CONFIG_IXGBE is not set
# CONFIG_IXGBEVF is not set
# CONFIG_IXGB is not set
# CONFIG_S2IO is not set
# CONFIG_VXGE is not set
# CONFIG_MYRI10GE is not set
# CONFIG_NETXEN_NIC is not set
# CONFIG_NIU is not set
# CONFIG_MLX4_EN is not set
# CONFIG_MLX4_CORE is not set
# CONFIG_TEHUTI is not set
# CONFIG_BNX2X is not set
# CONFIG_QLCNIC is not set
# CONFIG_QLGE is not set
CONFIG_SFC=m
# CONFIG_BE2NET is not set
CONFIG_TR=y
# CONFIG_IBMTR is not set
# CONFIG_IBMOL is not set
# CONFIG_IBMLS is not set
# CONFIG_3C359 is not set
# CONFIG_TMS380TR is not set
# CONFIG_SMCTR is not set
CONFIG_WLAN=y
# CONFIG_AIRO is not set
# CONFIG_ATMEL is not set
# CONFIG_PRISM54 is not set
# CONFIG_USB_ZD1201 is not set
# CONFIG_HOSTAP is not set

#
# Enable WiMAX (Networking options) to see the WiMAX drivers
#

#
# USB Network Adapters
#
# CONFIG_USB_CATC is not set
# CONFIG_USB_KAWETH is not set
# CONFIG_USB_PEGASUS is not set
# CONFIG_USB_RTL8150 is not set
# CONFIG_USB_USBNET is not set
# CONFIG_USB_HSO is not set
CONFIG_WAN=y
# CONFIG_HDLC is not set
# CONFIG_DLCI is not set
# CONFIG_SBNI is not set
CONFIG_ATM_DRIVERS=y
# CONFIG_ATM_DUMMY is not set
# CONFIG_ATM_TCP is not set
# CONFIG_ATM_LANAI is not set
# CONFIG_ATM_ENI is not set
# CONFIG_ATM_FIRESTREAM is not set
# CONFIG_ATM_ZATM is not set
# CONFIG_ATM_NICSTAR is not set
# CONFIG_ATM_IDT77252 is not set
# CONFIG_ATM_AMBASSADOR is not set
# CONFIG_ATM_HORIZON is not set
# CONFIG_ATM_IA is not set
# CONFIG_ATM_FORE200E is not set
# CONFIG_ATM_HE is not set
# CONFIG_ATM_SOLOS is not set
CONFIG_FDDI=y
# CONFIG_DEFXX is not set
# CONFIG_SKFP is not set
CONFIG_HIPPI=y
# CONFIG_ROADRUNNER is not set
# CONFIG_PLIP is not set
CONFIG_PPP=y
CONFIG_PPP_MULTILINK=y
CONFIG_PPP_FILTER=y
# CONFIG_PPP_ASYNC is not set
# CONFIG_PPP_SYNC_TTY is not set
# CONFIG_PPP_DEFLATE is not set
# CONFIG_PPP_BSDCOMP is not set
# CONFIG_PPP_MPPE is not set
# CONFIG_PPPOE is not set
# CONFIG_PPPOATM is not set
# CONFIG_PPPOL2TP is not set
# CONFIG_SLIP is not set
CONFIG_SLHC=y
CONFIG_NET_FC=y
# CONFIG_NETCONSOLE is not set
# CONFIG_NETPOLL is not set
# CONFIG_NET_POLL_CONTROLLER is not set
# CONFIG_VMXNET3 is not set
CONFIG_ISDN=y
# CONFIG_ISDN_I4L is not set
# CONFIG_ISDN_CAPI is not set
# CONFIG_ISDN_DRV_GIGASET is not set
# CONFIG_HYSDN is not set
# CONFIG_MISDN is not set
# CONFIG_PHONE is not set

#
# Input device support
#
CONFIG_INPUT=y
CONFIG_INPUT_FF_MEMLESS=m
CONFIG_INPUT_POLLDEV=m
# CONFIG_INPUT_SPARSEKMAP is not set

#
# Userland interfaces
#
CONFIG_INPUT_MOUSEDEV=y
CONFIG_INPUT_MOUSEDEV_PSAUX=y
CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024
CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768
# CONFIG_INPUT_JOYDEV is not set
CONFIG_INPUT_EVDEV=y
# CONFIG_INPUT_EVBUG is not set

#
# Input Device Drivers
#
CONFIG_INPUT_KEYBOARD=y
# CONFIG_KEYBOARD_ADP5588 is not set
CONFIG_KEYBOARD_ATKBD=y
# CONFIG_QT2160 is not set
# CONFIG_KEYBOARD_LKKBD is not set
# CONFIG_KEYBOARD_GPIO is not set
# CONFIG_KEYBOARD_MATRIX is not set
# CONFIG_KEYBOARD_LM8323 is not set
# CONFIG_KEYBOARD_MAX7359 is not set
# CONFIG_KEYBOARD_NEWTON is not set
# CONFIG_KEYBOARD_OPENCORES is not set
# CONFIG_KEYBOARD_STOWAWAY is not set
# CONFIG_KEYBOARD_SUNKBD is not set
# CONFIG_KEYBOARD_XTKBD is not set
CONFIG_INPUT_MOUSE=y
# CONFIG_MOUSE_PS2 is not set
# CONFIG_MOUSE_SERIAL is not set
# CONFIG_MOUSE_APPLETOUCH is not set
# CONFIG_MOUSE_BCM5974 is not set
# CONFIG_MOUSE_INPORT is not set
# CONFIG_MOUSE_LOGIBM is not set
# CONFIG_MOUSE_PC110PAD is not set
# CONFIG_MOUSE_VSXXXAA is not set
# CONFIG_MOUSE_GPIO is not set
# CONFIG_MOUSE_SYNAPTICS_I2C is not set
CONFIG_INPUT_JOYSTICK=y
# CONFIG_JOYSTICK_ANALOG is not set
# CONFIG_JOYSTICK_A3D is not set
# CONFIG_JOYSTICK_ADI is not set
# CONFIG_JOYSTICK_COBRA is not set
# CONFIG_JOYSTICK_GF2K is not set
# CONFIG_JOYSTICK_GRIP is not set
# CONFIG_JOYSTICK_GRIP_MP is not set
# CONFIG_JOYSTICK_GUILLEMOT is not set
# CONFIG_JOYSTICK_INTERACT is not set
# CONFIG_JOYSTICK_SIDEWINDER is not set
# CONFIG_JOYSTICK_TMDC is not set
# CONFIG_JOYSTICK_IFORCE is not set
# CONFIG_JOYSTICK_WARRIOR is not set
# CONFIG_JOYSTICK_MAGELLAN is not set
# CONFIG_JOYSTICK_SPACEORB is not set
# CONFIG_JOYSTICK_SPACEBALL is not set
# CONFIG_JOYSTICK_STINGER is not set
# CONFIG_JOYSTICK_TWIDJOY is not set
# CONFIG_JOYSTICK_ZHENHUA is not set
# CONFIG_JOYSTICK_DB9 is not set
# CONFIG_JOYSTICK_GAMECON is not set
# CONFIG_JOYSTICK_TURBOGRAFX is not set
# CONFIG_JOYSTICK_JOYDUMP is not set
# CONFIG_JOYSTICK_XPAD is not set
# CONFIG_JOYSTICK_WALKERA0701 is not set
CONFIG_INPUT_TABLET=y
# CONFIG_TABLET_USB_ACECAD is not set
# CONFIG_TABLET_USB_AIPTEK is not set
# CONFIG_TABLET_USB_GTCO is not set
# CONFIG_TABLET_USB_KBTAB is not set
# CONFIG_TABLET_USB_WACOM is not set
CONFIG_INPUT_TOUCHSCREEN=y
# CONFIG_TOUCHSCREEN_ADS7846 is not set
# CONFIG_TOUCHSCREEN_AD7877 is not set
# CONFIG_TOUCHSCREEN_AD7879_I2C is not set
# CONFIG_TOUCHSCREEN_AD7879_SPI is not set
# CONFIG_TOUCHSCREEN_AD7879 is not set
CONFIG_TOUCHSCREEN_DA9034=y
# CONFIG_TOUCHSCREEN_DYNAPRO is not set
# CONFIG_TOUCHSCREEN_EETI is not set
# CONFIG_TOUCHSCREEN_FUJITSU is not set
# CONFIG_TOUCHSCREEN_GUNZE is not set
# CONFIG_TOUCHSCREEN_ELO is not set
# CONFIG_TOUCHSCREEN_WACOM_W8001 is not set
# CONFIG_TOUCHSCREEN_MCS5000 is not set
# CONFIG_TOUCHSCREEN_MTOUCH is not set
# CONFIG_TOUCHSCREEN_INEXIO is not set
# CONFIG_TOUCHSCREEN_MK712 is not set
# CONFIG_TOUCHSCREEN_HTCPEN is not set
# CONFIG_TOUCHSCREEN_PENMOUNT is not set
# CONFIG_TOUCHSCREEN_TOUCHRIGHT is not set
# CONFIG_TOUCHSCREEN_TOUCHWIN is not set
# CONFIG_TOUCHSCREEN_WM97XX is not set
# CONFIG_TOUCHSCREEN_USB_COMPOSITE is not set
# CONFIG_TOUCHSCREEN_TOUCHIT213 is not set
# CONFIG_TOUCHSCREEN_TSC2007 is not set
CONFIG_INPUT_MISC=y
CONFIG_INPUT_PCSPKR=m
# CONFIG_INPUT_APANEL is not set
# CONFIG_INPUT_WISTRON_BTNS is not set
# CONFIG_INPUT_ATLAS_BTNS is not set
# CONFIG_INPUT_ATI_REMOTE is not set
# CONFIG_INPUT_ATI_REMOTE2 is not set
# CONFIG_INPUT_KEYSPAN_REMOTE is not set
# CONFIG_INPUT_POWERMATE is not set
# CONFIG_INPUT_YEALINK is not set
# CONFIG_INPUT_CM109 is not set
# CONFIG_INPUT_UINPUT is not set
# CONFIG_INPUT_WINBOND_CIR is not set
# CONFIG_INPUT_GPIO_ROTARY_ENCODER is not set

#
# Hardware I/O ports
#
CONFIG_SERIO=y
CONFIG_SERIO_I8042=y
# CONFIG_SERIO_SERPORT is not set
# CONFIG_SERIO_CT82C710 is not set
# CONFIG_SERIO_PARKBD is not set
# CONFIG_SERIO_PCIPS2 is not set
CONFIG_SERIO_LIBPS2=y
CONFIG_SERIO_RAW=m
# CONFIG_SERIO_ALTERA_PS2 is not set
# CONFIG_GAMEPORT is not set

#
# Character devices
#
CONFIG_VT=y
CONFIG_CONSOLE_TRANSLATIONS=y
CONFIG_VT_CONSOLE=y
CONFIG_HW_CONSOLE=y
CONFIG_VT_HW_CONSOLE_BINDING=y
# CONFIG_DEVKMEM is not set
CONFIG_SERIAL_NONSTANDARD=y
# CONFIG_COMPUTONE is not set
# CONFIG_ROCKETPORT is not set
# CONFIG_CYCLADES is not set
# CONFIG_DIGIEPCA is not set
# CONFIG_MOXA_INTELLIO is not set
# CONFIG_MOXA_SMARTIO is not set
# CONFIG_ISI is not set
# CONFIG_SYNCLINK is not set
# CONFIG_SYNCLINKMP is not set
# CONFIG_SYNCLINK_GT is not set
# CONFIG_N_HDLC is not set
# CONFIG_RISCOM8 is not set
# CONFIG_SPECIALIX is not set
CONFIG_STALDRV=y
# CONFIG_STALLION is not set
# CONFIG_ISTALLION is not set
# CONFIG_NOZOMI is not set

#
# Serial drivers
#
CONFIG_SERIAL_8250=y
CONFIG_SERIAL_8250_CONSOLE=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_SERIAL_8250_PCI=y
CONFIG_SERIAL_8250_PNP=y
CONFIG_SERIAL_8250_NR_UARTS=48
CONFIG_SERIAL_8250_RUNTIME_UARTS=4
CONFIG_SERIAL_8250_EXTENDED=y
CONFIG_SERIAL_8250_MANY_PORTS=y
# CONFIG_SERIAL_8250_FOURPORT is not set
# CONFIG_SERIAL_8250_ACCENT is not set
# CONFIG_SERIAL_8250_BOCA is not set
# CONFIG_SERIAL_8250_EXAR_ST16C554 is not set
# CONFIG_SERIAL_8250_HUB6 is not set
CONFIG_SERIAL_8250_SHARE_IRQ=y
# CONFIG_SERIAL_8250_DETECT_IRQ is not set
CONFIG_SERIAL_8250_RSA=y
# CONFIG_SERIAL_8250_MCA is not set

#
# Non-8250 serial port support
#
# CONFIG_SERIAL_MAX3100 is not set
CONFIG_SERIAL_CORE=y
CONFIG_SERIAL_CORE_CONSOLE=y
CONFIG_CONSOLE_POLL=y
# CONFIG_SERIAL_JSM is not set
# CONFIG_SERIAL_TIMBERDALE is not set
CONFIG_UNIX98_PTYS=y
# CONFIG_DEVPTS_MULTIPLE_INSTANCES is not set
CONFIG_LEGACY_PTYS=y
CONFIG_LEGACY_PTY_COUNT=0
CONFIG_PRINTER=m
# CONFIG_LP_CONSOLE is not set
CONFIG_PPDEV=m
# CONFIG_IPMI_HANDLER is not set
CONFIG_HW_RANDOM=y
# CONFIG_HW_RANDOM_TIMERIOMEM is not set
# CONFIG_HW_RANDOM_INTEL is not set
# CONFIG_HW_RANDOM_AMD is not set
# CONFIG_HW_RANDOM_GEODE is not set
# CONFIG_HW_RANDOM_VIA is not set
CONFIG_NVRAM=m
# CONFIG_DTLK is not set
# CONFIG_R3964 is not set
# CONFIG_APPLICOM is not set
# CONFIG_SONYPI is not set
# CONFIG_MWAVE is not set
# CONFIG_PC8736x_GPIO is not set
# CONFIG_NSC_GPIO is not set
# CONFIG_CS5535_GPIO is not set
# CONFIG_RAW_DRIVER is not set
CONFIG_HPET=y
CONFIG_HPET_MMAP=y
# CONFIG_HANGCHECK_TIMER is not set
CONFIG_TCG_TPM=m
# CONFIG_TCG_TIS is not set
# CONFIG_TCG_NSC is not set
# CONFIG_TCG_ATMEL is not set
CONFIG_TCG_INFINEON=m
# CONFIG_TELCLOCK is not set
CONFIG_DEVPORT=y
CONFIG_I2C=y
CONFIG_I2C_BOARDINFO=y
CONFIG_I2C_COMPAT=y
# CONFIG_I2C_CHARDEV is not set
# CONFIG_I2C_HELPER_AUTO is not set
# CONFIG_I2C_SMBUS is not set

#
# I2C Algorithms
#
CONFIG_I2C_ALGOBIT=m
# CONFIG_I2C_ALGOPCF is not set
# CONFIG_I2C_ALGOPCA is not set

#
# I2C Hardware Bus support
#

#
# PC SMBus host controller drivers
#
# CONFIG_I2C_ALI1535 is not set
# CONFIG_I2C_ALI1563 is not set
# CONFIG_I2C_ALI15X3 is not set
# CONFIG_I2C_AMD756 is not set
# CONFIG_I2C_AMD8111 is not set
# CONFIG_I2C_I801 is not set
# CONFIG_I2C_ISCH is not set
# CONFIG_I2C_PIIX4 is not set
# CONFIG_I2C_NFORCE2 is not set
# CONFIG_I2C_SIS5595 is not set
# CONFIG_I2C_SIS630 is not set
# CONFIG_I2C_SIS96X is not set
# CONFIG_I2C_VIA is not set
# CONFIG_I2C_VIAPRO is not set

#
# ACPI drivers
#
# CONFIG_I2C_SCMI is not set

#
# I2C system bus drivers (mostly embedded / system-on-chip)
#
# CONFIG_I2C_GPIO is not set
# CONFIG_I2C_OCORES is not set
# CONFIG_I2C_SIMTEC is not set
# CONFIG_I2C_XILINX is not set

#
# External I2C/SMBus adapter drivers
#
# CONFIG_I2C_PARPORT is not set
# CONFIG_I2C_PARPORT_LIGHT is not set
# CONFIG_I2C_TAOS_EVM is not set
# CONFIG_I2C_TINY_USB is not set

#
# Other I2C/SMBus bus drivers
#
# CONFIG_I2C_PCA_ISA is not set
# CONFIG_I2C_PCA_PLATFORM is not set
# CONFIG_I2C_STUB is not set
# CONFIG_SCx200_ACB is not set
# CONFIG_I2C_DEBUG_CORE is not set
# CONFIG_I2C_DEBUG_ALGO is not set
# CONFIG_I2C_DEBUG_BUS is not set
CONFIG_SPI=y
# CONFIG_SPI_DEBUG is not set
CONFIG_SPI_MASTER=y

#
# SPI Master Controller Drivers
#
# CONFIG_SPI_BITBANG is not set
# CONFIG_SPI_BUTTERFLY is not set
# CONFIG_SPI_GPIO is not set
# CONFIG_SPI_LM70_LLP is not set
# CONFIG_SPI_XILINX is not set
# CONFIG_SPI_DESIGNWARE is not set

#
# SPI Protocol Masters
#
# CONFIG_SPI_SPIDEV is not set
# CONFIG_SPI_TLE62X0 is not set

#
# PPS support
#
# CONFIG_PPS is not set
CONFIG_ARCH_WANT_OPTIONAL_GPIOLIB=y
CONFIG_GPIOLIB=y
# CONFIG_DEBUG_GPIO is not set
CONFIG_GPIO_SYSFS=y

#
# Memory mapped GPIO expanders:
#
# CONFIG_GPIO_IT8761E is not set
# CONFIG_GPIO_SCH is not set

#
# I2C GPIO expanders:
#
# CONFIG_GPIO_MAX7300 is not set
# CONFIG_GPIO_MAX732X is not set
# CONFIG_GPIO_PCA953X is not set
# CONFIG_GPIO_PCF857X is not set
# CONFIG_GPIO_ADP5588 is not set

#
# PCI GPIO expanders:
#
# CONFIG_GPIO_CS5535 is not set
# CONFIG_GPIO_BT8XX is not set
# CONFIG_GPIO_LANGWELL is not set

#
# SPI GPIO expanders:
#
# CONFIG_GPIO_MAX7301 is not set
# CONFIG_GPIO_MCP23S08 is not set
# CONFIG_GPIO_MC33880 is not set

#
# AC97 GPIO expanders:
#
# CONFIG_W1 is not set
CONFIG_POWER_SUPPLY=y
# CONFIG_POWER_SUPPLY_DEBUG is not set
# CONFIG_PDA_POWER is not set
# CONFIG_BATTERY_DS2760 is not set
# CONFIG_BATTERY_DS2782 is not set
# CONFIG_BATTERY_OLPC is not set
# CONFIG_BATTERY_BQ27x00 is not set
# CONFIG_BATTERY_DA9030 is not set
# CONFIG_BATTERY_MAX17040 is not set
CONFIG_HWMON=y
# CONFIG_HWMON_VID is not set
# CONFIG_HWMON_DEBUG_CHIP is not set

#
# Native drivers
#
# CONFIG_SENSORS_ABITUGURU is not set
# CONFIG_SENSORS_ABITUGURU3 is not set
# CONFIG_SENSORS_AD7414 is not set
# CONFIG_SENSORS_AD7418 is not set
# CONFIG_SENSORS_ADCXX is not set
# CONFIG_SENSORS_ADM1021 is not set
# CONFIG_SENSORS_ADM1025 is not set
# CONFIG_SENSORS_ADM1026 is not set
# CONFIG_SENSORS_ADM1029 is not set
# CONFIG_SENSORS_ADM1031 is not set
# CONFIG_SENSORS_ADM9240 is not set
# CONFIG_SENSORS_ADT7411 is not set
# CONFIG_SENSORS_ADT7462 is not set
# CONFIG_SENSORS_ADT7470 is not set
# CONFIG_SENSORS_ADT7475 is not set
# CONFIG_SENSORS_ASC7621 is not set
# CONFIG_SENSORS_K8TEMP is not set
# CONFIG_SENSORS_K10TEMP is not set
# CONFIG_SENSORS_ASB100 is not set
# CONFIG_SENSORS_ATXP1 is not set
# CONFIG_SENSORS_DS1621 is not set
# CONFIG_SENSORS_I5K_AMB is not set
# CONFIG_SENSORS_F71805F is not set
# CONFIG_SENSORS_F71882FG is not set
# CONFIG_SENSORS_F75375S is not set
# CONFIG_SENSORS_FSCHMD is not set
# CONFIG_SENSORS_G760A is not set
# CONFIG_SENSORS_GL518SM is not set
# CONFIG_SENSORS_GL520SM is not set
# CONFIG_SENSORS_CORETEMP is not set
# CONFIG_SENSORS_IT87 is not set
# CONFIG_SENSORS_LM63 is not set
# CONFIG_SENSORS_LM70 is not set
# CONFIG_SENSORS_LM73 is not set
# CONFIG_SENSORS_LM75 is not set
# CONFIG_SENSORS_LM77 is not set
# CONFIG_SENSORS_LM78 is not set
# CONFIG_SENSORS_LM80 is not set
# CONFIG_SENSORS_LM83 is not set
# CONFIG_SENSORS_LM85 is not set
# CONFIG_SENSORS_LM87 is not set
# CONFIG_SENSORS_LM90 is not set
# CONFIG_SENSORS_LM92 is not set
# CONFIG_SENSORS_LM93 is not set
# CONFIG_SENSORS_LTC4215 is not set
# CONFIG_SENSORS_LTC4245 is not set
# CONFIG_SENSORS_LM95241 is not set
# CONFIG_SENSORS_MAX1111 is not set
# CONFIG_SENSORS_MAX1619 is not set
# CONFIG_SENSORS_MAX6650 is not set
# CONFIG_SENSORS_PC87360 is not set
# CONFIG_SENSORS_PC87427 is not set
# CONFIG_SENSORS_PCF8591 is not set
# CONFIG_SENSORS_SHT15 is not set
# CONFIG_SENSORS_SIS5595 is not set
# CONFIG_SENSORS_DME1737 is not set
# CONFIG_SENSORS_SMSC47M1 is not set
# CONFIG_SENSORS_SMSC47M192 is not set
# CONFIG_SENSORS_SMSC47B397 is not set
# CONFIG_SENSORS_ADS7828 is not set
# CONFIG_SENSORS_AMC6821 is not set
# CONFIG_SENSORS_THMC50 is not set
# CONFIG_SENSORS_TMP401 is not set
# CONFIG_SENSORS_TMP421 is not set
# CONFIG_SENSORS_VIA_CPUTEMP is not set
# CONFIG_SENSORS_VIA686A is not set
# CONFIG_SENSORS_VT1211 is not set
# CONFIG_SENSORS_VT8231 is not set
# CONFIG_SENSORS_W83781D is not set
# CONFIG_SENSORS_W83791D is not set
# CONFIG_SENSORS_W83792D is not set
# CONFIG_SENSORS_W83793 is not set
# CONFIG_SENSORS_W83L785TS is not set
# CONFIG_SENSORS_W83L786NG is not set
# CONFIG_SENSORS_W83627HF is not set
# CONFIG_SENSORS_W83627EHF is not set
# CONFIG_SENSORS_HDAPS is not set
# CONFIG_SENSORS_LIS3_I2C is not set
# CONFIG_SENSORS_APPLESMC is not set

#
# ACPI drivers
#
# CONFIG_SENSORS_ATK0110 is not set
# CONFIG_SENSORS_LIS3LV02D is not set
CONFIG_THERMAL=y
CONFIG_THERMAL_HWMON=y
CONFIG_WATCHDOG=y
# CONFIG_WATCHDOG_NOWAYOUT is not set

#
# Watchdog Device Drivers
#
# CONFIG_SOFT_WATCHDOG is not set
# CONFIG_MAX63XX_WATCHDOG is not set
# CONFIG_ACQUIRE_WDT is not set
# CONFIG_ADVANTECH_WDT is not set
# CONFIG_ALIM1535_WDT is not set
# CONFIG_ALIM7101_WDT is not set
# CONFIG_SC520_WDT is not set
# CONFIG_SBC_FITPC2_WATCHDOG is not set
# CONFIG_EUROTECH_WDT is not set
# CONFIG_IB700_WDT is not set
# CONFIG_IBMASR is not set
# CONFIG_WAFER_WDT is not set
# CONFIG_I6300ESB_WDT is not set
CONFIG_ITCO_WDT=m
CONFIG_ITCO_VENDOR_SUPPORT=y
# CONFIG_IT8712F_WDT is not set
# CONFIG_IT87_WDT is not set
# CONFIG_HP_WATCHDOG is not set
# CONFIG_SC1200_WDT is not set
# CONFIG_PC87413_WDT is not set
# CONFIG_60XX_WDT is not set
# CONFIG_SBC8360_WDT is not set
# CONFIG_SBC7240_WDT is not set
# CONFIG_CPU5_WDT is not set
# CONFIG_SMSC_SCH311X_WDT is not set
# CONFIG_SMSC37B787_WDT is not set
# CONFIG_W83627HF_WDT is not set
# CONFIG_W83697HF_WDT is not set
# CONFIG_W83697UG_WDT is not set
# CONFIG_W83877F_WDT is not set
# CONFIG_W83977F_WDT is not set
# CONFIG_MACHZ_WDT is not set
# CONFIG_SBC_EPX_C3_WATCHDOG is not set

#
# ISA-based Watchdog Cards
#
# CONFIG_PCWATCHDOG is not set
# CONFIG_MIXCOMWD is not set
# CONFIG_WDT is not set

#
# PCI-based Watchdog Cards
#
# CONFIG_PCIPCWATCHDOG is not set
# CONFIG_WDTPCI is not set

#
# USB-based Watchdog Cards
#
# CONFIG_USBPCWATCHDOG is not set
CONFIG_SSB_POSSIBLE=y

#
# Sonics Silicon Backplane
#
# CONFIG_SSB is not set

#
# Multifunction device drivers
#
# CONFIG_MFD_CORE is not set
# CONFIG_MFD_88PM860X is not set
# CONFIG_MFD_SM501 is not set
# CONFIG_HTC_PASIC3 is not set
# CONFIG_HTC_I2CPLD is not set
# CONFIG_UCB1400_CORE is not set
# CONFIG_TPS65010 is not set
# CONFIG_TWL4030_CORE is not set
# CONFIG_MFD_TMIO is not set
CONFIG_PMIC_DA903X=y
# CONFIG_PMIC_ADP5520 is not set
# CONFIG_MFD_MAX8925 is not set
# CONFIG_MFD_WM8400 is not set
# CONFIG_MFD_WM831X is not set
# CONFIG_MFD_WM8350_I2C is not set
# CONFIG_MFD_WM8994 is not set
# CONFIG_MFD_PCF50633 is not set
# CONFIG_MFD_MC13783 is not set
# CONFIG_AB3100_CORE is not set
# CONFIG_EZX_PCAP is not set
# CONFIG_AB4500_CORE is not set
# CONFIG_MFD_TIMBERDALE is not set
# CONFIG_LPC_SCH is not set
CONFIG_REGULATOR=y
CONFIG_REGULATOR_DEBUG=y
# CONFIG_REGULATOR_DUMMY is not set
# CONFIG_REGULATOR_FIXED_VOLTAGE is not set
# CONFIG_REGULATOR_VIRTUAL_CONSUMER is not set
# CONFIG_REGULATOR_USERSPACE_CONSUMER is not set
# CONFIG_REGULATOR_BQ24022 is not set
# CONFIG_REGULATOR_MAX1586 is not set
# CONFIG_REGULATOR_MAX8649 is not set
# CONFIG_REGULATOR_MAX8660 is not set
# CONFIG_REGULATOR_DA903X is not set
# CONFIG_REGULATOR_LP3971 is not set
# CONFIG_REGULATOR_TPS65023 is not set
# CONFIG_REGULATOR_TPS6507X is not set
# CONFIG_MEDIA_SUPPORT is not set

#
# Graphics support
#
CONFIG_AGP=m
# CONFIG_AGP_ALI is not set
# CONFIG_AGP_ATI is not set
# CONFIG_AGP_AMD is not set
# CONFIG_AGP_AMD64 is not set
CONFIG_AGP_INTEL=m
# CONFIG_AGP_NVIDIA is not set
# CONFIG_AGP_SIS is not set
# CONFIG_AGP_SWORKS is not set
# CONFIG_AGP_VIA is not set
# CONFIG_AGP_EFFICEON is not set
CONFIG_VGA_ARB=y
CONFIG_VGA_ARB_MAX_GPUS=16
# CONFIG_VGA_SWITCHEROO is not set
CONFIG_DRM=m
CONFIG_DRM_KMS_HELPER=m
CONFIG_DRM_TTM=m
# CONFIG_DRM_TDFX is not set
# CONFIG_DRM_R128 is not set
CONFIG_DRM_RADEON=m
# CONFIG_DRM_RADEON_KMS is not set
# CONFIG_DRM_I810 is not set
# CONFIG_DRM_I830 is not set
CONFIG_DRM_I915=m
# CONFIG_DRM_I915_KMS is not set
# CONFIG_DRM_MGA is not set
# CONFIG_DRM_SIS is not set
# CONFIG_DRM_VIA is not set
# CONFIG_DRM_SAVAGE is not set
CONFIG_VGASTATE=m
CONFIG_VIDEO_OUTPUT_CONTROL=m
CONFIG_FB=y
CONFIG_FIRMWARE_EDID=y
CONFIG_FB_DDC=m
# CONFIG_FB_BOOT_VESA_SUPPORT is not set
CONFIG_FB_CFB_FILLRECT=y
CONFIG_FB_CFB_COPYAREA=y
CONFIG_FB_CFB_IMAGEBLIT=y
# CONFIG_FB_CFB_REV_PIXELS_IN_BYTE is not set
# CONFIG_FB_SYS_FILLRECT is not set
# CONFIG_FB_SYS_COPYAREA is not set
# CONFIG_FB_SYS_IMAGEBLIT is not set
# CONFIG_FB_FOREIGN_ENDIAN is not set
# CONFIG_FB_SYS_FOPS is not set
# CONFIG_FB_SVGALIB is not set
# CONFIG_FB_MACMODES is not set
CONFIG_FB_BACKLIGHT=y
CONFIG_FB_MODE_HELPERS=y
CONFIG_FB_TILEBLITTING=y

#
# Frame buffer hardware drivers
#
# CONFIG_FB_CIRRUS is not set
# CONFIG_FB_PM2 is not set
# CONFIG_FB_CYBER2000 is not set
# CONFIG_FB_ARC is not set
CONFIG_FB_ASILIANT=y
CONFIG_FB_IMSTT=y
# CONFIG_FB_VGA16 is not set
# CONFIG_FB_UVESA is not set
# CONFIG_FB_VESA is not set
CONFIG_FB_EFI=y
# CONFIG_FB_N411 is not set
# CONFIG_FB_HGA is not set
# CONFIG_FB_S1D13XXX is not set
CONFIG_FB_NVIDIA=m
CONFIG_FB_NVIDIA_I2C=y
# CONFIG_FB_NVIDIA_DEBUG is not set
CONFIG_FB_NVIDIA_BACKLIGHT=y
CONFIG_FB_RIVA=m
CONFIG_FB_RIVA_I2C=y
# CONFIG_FB_RIVA_DEBUG is not set
CONFIG_FB_RIVA_BACKLIGHT=y
CONFIG_FB_I810=m
# CONFIG_FB_I810_GTF is not set
# CONFIG_FB_LE80578 is not set
CONFIG_FB_MATROX=m
CONFIG_FB_MATROX_MILLENIUM=y
CONFIG_FB_MATROX_MYSTIQUE=y
CONFIG_FB_MATROX_G=y
CONFIG_FB_MATROX_I2C=m
# CONFIG_FB_MATROX_MAVEN is not set
CONFIG_FB_RADEON=m
CONFIG_FB_RADEON_I2C=y
CONFIG_FB_RADEON_BACKLIGHT=y
# CONFIG_FB_RADEON_DEBUG is not set
CONFIG_FB_ATY128=m
CONFIG_FB_ATY128_BACKLIGHT=y
CONFIG_FB_ATY=m
CONFIG_FB_ATY_CT=y
CONFIG_FB_ATY_GENERIC_LCD=y
CONFIG_FB_ATY_GX=y
CONFIG_FB_ATY_BACKLIGHT=y
# CONFIG_FB_S3 is not set
CONFIG_FB_SAVAGE=m
CONFIG_FB_SAVAGE_I2C=y
CONFIG_FB_SAVAGE_ACCEL=y
# CONFIG_FB_SIS is not set
CONFIG_FB_VIA=m
# CONFIG_FB_NEOMAGIC is not set
# CONFIG_FB_KYRO is not set
CONFIG_FB_3DFX=m
# CONFIG_FB_3DFX_ACCEL is not set
CONFIG_FB_3DFX_I2C=y
# CONFIG_FB_VOODOO1 is not set
# CONFIG_FB_VT8623 is not set
# CONFIG_FB_TRIDENT is not set
# CONFIG_FB_ARK is not set
# CONFIG_FB_PM3 is not set
# CONFIG_FB_CARMINE is not set
CONFIG_FB_GEODE=y
# CONFIG_FB_GEODE_LX is not set
# CONFIG_FB_GEODE_GX is not set
# CONFIG_FB_GEODE_GX1 is not set
# CONFIG_FB_VIRTUAL is not set
# CONFIG_FB_METRONOME is not set
# CONFIG_FB_MB862XX is not set
# CONFIG_FB_BROADSHEET is not set
CONFIG_BACKLIGHT_LCD_SUPPORT=y
# CONFIG_LCD_CLASS_DEVICE is not set
CONFIG_BACKLIGHT_CLASS_DEVICE=y
CONFIG_BACKLIGHT_GENERIC=y
# CONFIG_BACKLIGHT_PROGEAR is not set
# CONFIG_BACKLIGHT_DA903X is not set
# CONFIG_BACKLIGHT_MBP_NVIDIA is not set
# CONFIG_BACKLIGHT_SAHARA is not set

#
# Display device support
#
# CONFIG_DISPLAY_SUPPORT is not set

#
# Console display driver support
#
CONFIG_VGA_CONSOLE=y
# CONFIG_VGACON_SOFT_SCROLLBACK is not set
# CONFIG_MDA_CONSOLE is not set
CONFIG_DUMMY_CONSOLE=y
CONFIG_FRAMEBUFFER_CONSOLE=m
# CONFIG_FRAMEBUFFER_CONSOLE_DETECT_PRIMARY is not set
# CONFIG_FRAMEBUFFER_CONSOLE_ROTATION is not set
# CONFIG_FONTS is not set
CONFIG_FONT_8x8=y
CONFIG_FONT_8x16=y
# CONFIG_LOGO is not set
CONFIG_SOUND=m
CONFIG_SOUND_OSS_CORE=y
CONFIG_SOUND_OSS_CORE_PRECLAIM=y
CONFIG_SND=m
CONFIG_SND_TIMER=m
CONFIG_SND_PCM=m
CONFIG_SND_HWDEP=m
CONFIG_SND_RAWMIDI=m
CONFIG_SND_SEQUENCER=m
CONFIG_SND_SEQ_DUMMY=m
CONFIG_SND_OSSEMUL=y
CONFIG_SND_MIXER_OSS=m
CONFIG_SND_PCM_OSS=m
CONFIG_SND_PCM_OSS_PLUGINS=y
CONFIG_SND_SEQUENCER_OSS=y
# CONFIG_SND_HRTIMER is not set
CONFIG_SND_DYNAMIC_MINORS=y
CONFIG_SND_SUPPORT_OLD_API=y
CONFIG_SND_VERBOSE_PROCFS=y
# CONFIG_SND_VERBOSE_PRINTK is not set
# CONFIG_SND_DEBUG is not set
CONFIG_SND_VMASTER=y
CONFIG_SND_DMA_SGBUF=y
CONFIG_SND_RAWMIDI_SEQ=m
CONFIG_SND_OPL3_LIB_SEQ=m
CONFIG_SND_OPL4_LIB_SEQ=m
CONFIG_SND_SBAWE_SEQ=m
CONFIG_SND_EMU10K1_SEQ=m
CONFIG_SND_MPU401_UART=m
CONFIG_SND_OPL3_LIB=m
CONFIG_SND_OPL4_LIB=m
CONFIG_SND_VX_LIB=m
CONFIG_SND_AC97_CODEC=m
CONFIG_SND_DRIVERS=y
CONFIG_SND_PCSP=m
CONFIG_SND_DUMMY=m
CONFIG_SND_VIRMIDI=m
CONFIG_SND_MTPAV=m
CONFIG_SND_MTS64=m
CONFIG_SND_SERIAL_U16550=m
CONFIG_SND_MPU401=m
CONFIG_SND_PORTMAN2X4=m
CONFIG_SND_AC97_POWER_SAVE=y
CONFIG_SND_AC97_POWER_SAVE_DEFAULT=0
CONFIG_SND_WSS_LIB=m
CONFIG_SND_SB_COMMON=m
CONFIG_SND_SB8_DSP=m
CONFIG_SND_SB16_DSP=m
CONFIG_SND_ISA=y
CONFIG_SND_ADLIB=m
CONFIG_SND_AD1816A=m
CONFIG_SND_AD1848=m
CONFIG_SND_ALS100=m
CONFIG_SND_AZT2320=m
CONFIG_SND_CMI8330=m
CONFIG_SND_CS4231=m
CONFIG_SND_CS4236=m
CONFIG_SND_ES968=m
CONFIG_SND_ES1688=m
CONFIG_SND_ES18XX=m
CONFIG_SND_SC6000=m
CONFIG_SND_GUSCLASSIC=m
CONFIG_SND_GUSEXTREME=m
CONFIG_SND_GUSMAX=m
CONFIG_SND_INTERWAVE=m
CONFIG_SND_INTERWAVE_STB=m
# CONFIG_SND_JAZZ16 is not set
CONFIG_SND_OPL3SA2=m
CONFIG_SND_OPTI92X_AD1848=m
CONFIG_SND_OPTI92X_CS4231=m
CONFIG_SND_OPTI93X=m
CONFIG_SND_MIRO=m
CONFIG_SND_SB8=m
CONFIG_SND_SB16=m
CONFIG_SND_SBAWE=m
CONFIG_SND_SB16_CSP=y
CONFIG_SND_SGALAXY=m
CONFIG_SND_SSCAPE=m
CONFIG_SND_WAVEFRONT=m
# CONFIG_SND_MSND_PINNACLE is not set
# CONFIG_SND_MSND_CLASSIC is not set
CONFIG_SND_PCI=y
CONFIG_SND_AD1889=m
CONFIG_SND_ALS300=m
CONFIG_SND_ALS4000=m
CONFIG_SND_ALI5451=m
CONFIG_SND_ATIIXP=m
CONFIG_SND_ATIIXP_MODEM=m
CONFIG_SND_AU8810=m
CONFIG_SND_AU8820=m
CONFIG_SND_AU8830=m
# CONFIG_SND_AW2 is not set
CONFIG_SND_AZT3328=m
CONFIG_SND_BT87X=m
# CONFIG_SND_BT87X_OVERCLOCK is not set
CONFIG_SND_CA0106=m
CONFIG_SND_CMIPCI=m
CONFIG_SND_OXYGEN_LIB=m
CONFIG_SND_OXYGEN=m
CONFIG_SND_CS4281=m
CONFIG_SND_CS46XX=m
CONFIG_SND_CS46XX_NEW_DSP=y
CONFIG_SND_CS5530=m
CONFIG_SND_CS5535AUDIO=m
# CONFIG_SND_CTXFI is not set
CONFIG_SND_DARLA20=m
CONFIG_SND_GINA20=m
CONFIG_SND_LAYLA20=m
CONFIG_SND_DARLA24=m
CONFIG_SND_GINA24=m
CONFIG_SND_LAYLA24=m
CONFIG_SND_MONA=m
CONFIG_SND_MIA=m
CONFIG_SND_ECHO3G=m
CONFIG_SND_INDIGO=m
CONFIG_SND_INDIGOIO=m
CONFIG_SND_INDIGODJ=m
# CONFIG_SND_INDIGOIOX is not set
# CONFIG_SND_INDIGODJX is not set
CONFIG_SND_EMU10K1=m
CONFIG_SND_EMU10K1X=m
CONFIG_SND_ENS1370=m
CONFIG_SND_ENS1371=m
CONFIG_SND_ES1938=m
CONFIG_SND_ES1968=m
CONFIG_SND_FM801=m
CONFIG_SND_HDA_INTEL=m
# CONFIG_SND_HDA_HWDEP is not set
# CONFIG_SND_HDA_INPUT_BEEP is not set
# CONFIG_SND_HDA_INPUT_JACK is not set
# CONFIG_SND_HDA_PATCH_LOADER is not set
CONFIG_SND_HDA_CODEC_REALTEK=y
CONFIG_SND_HDA_CODEC_ANALOG=y
CONFIG_SND_HDA_CODEC_SIGMATEL=y
CONFIG_SND_HDA_CODEC_VIA=y
CONFIG_SND_HDA_CODEC_ATIHDMI=y
CONFIG_SND_HDA_CODEC_NVHDMI=y
CONFIG_SND_HDA_CODEC_INTELHDMI=y
CONFIG_SND_HDA_ELD=y
CONFIG_SND_HDA_CODEC_CIRRUS=y
CONFIG_SND_HDA_CODEC_CONEXANT=y
CONFIG_SND_HDA_CODEC_CA0110=y
CONFIG_SND_HDA_CODEC_CMEDIA=y
CONFIG_SND_HDA_CODEC_SI3054=y
CONFIG_SND_HDA_GENERIC=y
CONFIG_SND_HDA_POWER_SAVE=y
CONFIG_SND_HDA_POWER_SAVE_DEFAULT=0
CONFIG_SND_HDSP=m
CONFIG_SND_HDSPM=m
CONFIG_SND_HIFIER=m
CONFIG_SND_ICE1712=m
CONFIG_SND_ICE1724=m
CONFIG_SND_INTEL8X0=m
CONFIG_SND_INTEL8X0M=m
CONFIG_SND_KORG1212=m
# CONFIG_SND_LX6464ES is not set
CONFIG_SND_MAESTRO3=m
CONFIG_SND_MIXART=m
CONFIG_SND_NM256=m
CONFIG_SND_PCXHR=m
CONFIG_SND_RIPTIDE=m
CONFIG_SND_RME32=m
CONFIG_SND_RME96=m
CONFIG_SND_RME9652=m
CONFIG_SND_SIS7019=m
CONFIG_SND_SONICVIBES=m
CONFIG_SND_TRIDENT=m
CONFIG_SND_VIA82XX=m
CONFIG_SND_VIA82XX_MODEM=m
CONFIG_SND_VIRTUOSO=m
CONFIG_SND_VX222=m
CONFIG_SND_YMFPCI=m
CONFIG_SND_SPI=y
CONFIG_SND_USB=y
CONFIG_SND_USB_AUDIO=m
# CONFIG_SND_USB_UA101 is not set
CONFIG_SND_USB_USX2Y=m
CONFIG_SND_USB_CAIAQ=m
CONFIG_SND_USB_CAIAQ_INPUT=y
CONFIG_SND_USB_US122L=m
# CONFIG_SND_SOC is not set
# CONFIG_SOUND_PRIME is not set
CONFIG_AC97_BUS=m
CONFIG_HID_SUPPORT=y
CONFIG_HID=y
CONFIG_HIDRAW=y

#
# USB Input Devices
#
CONFIG_USB_HID=m
CONFIG_HID_PID=y
CONFIG_USB_HIDDEV=y

#
# Special HID drivers
#
# CONFIG_HID_3M_PCT is not set
CONFIG_HID_A4TECH=m
CONFIG_HID_APPLE=m
CONFIG_HID_BELKIN=m
CONFIG_HID_CHERRY=m
CONFIG_HID_CHICONY=m
CONFIG_HID_CYPRESS=m
CONFIG_HID_DRAGONRISE=m
# CONFIG_DRAGONRISE_FF is not set
CONFIG_HID_EZKEY=m
CONFIG_HID_KYE=m
CONFIG_HID_GYRATION=m
CONFIG_HID_TWINHAN=m
CONFIG_HID_KENSINGTON=m
CONFIG_HID_LOGITECH=m
CONFIG_LOGITECH_FF=y
CONFIG_LOGIRUMBLEPAD2_FF=y
# CONFIG_LOGIG940_FF is not set
CONFIG_HID_MICROSOFT=m
# CONFIG_HID_MOSART is not set
CONFIG_HID_MONTEREY=m
CONFIG_HID_NTRIG=m
CONFIG_HID_ORTEK=m
CONFIG_HID_PANTHERLORD=m
CONFIG_PANTHERLORD_FF=y
CONFIG_HID_PETALYNX=m
# CONFIG_HID_QUANTA is not set
CONFIG_HID_SAMSUNG=m
CONFIG_HID_SONY=m
# CONFIG_HID_STANTUM is not set
CONFIG_HID_SUNPLUS=m
CONFIG_HID_GREENASIA=m
# CONFIG_GREENASIA_FF is not set
CONFIG_HID_SMARTJOYPLUS=m
# CONFIG_SMARTJOYPLUS_FF is not set
CONFIG_HID_TOPSEED=m
CONFIG_HID_THRUSTMASTER=m
# CONFIG_THRUSTMASTER_FF is not set
CONFIG_HID_ZEROPLUS=m
# CONFIG_ZEROPLUS_FF is not set
CONFIG_USB_SUPPORT=y
CONFIG_USB_ARCH_HAS_HCD=y
CONFIG_USB_ARCH_HAS_OHCI=y
CONFIG_USB_ARCH_HAS_EHCI=y
CONFIG_USB=y
# CONFIG_USB_DEBUG is not set
# CONFIG_USB_ANNOUNCE_NEW_DEVICES is not set

#
# Miscellaneous USB options
#
CONFIG_USB_DEVICEFS=y
# CONFIG_USB_DEVICE_CLASS is not set
# CONFIG_USB_DYNAMIC_MINORS is not set
# CONFIG_USB_OTG is not set
CONFIG_USB_MON=y
# CONFIG_USB_WUSB is not set
# CONFIG_USB_WUSB_CBAF is not set

#
# USB Host Controller Drivers
#
# CONFIG_USB_C67X00_HCD is not set
# CONFIG_USB_XHCI_HCD is not set
CONFIG_USB_EHCI_HCD=y
CONFIG_USB_EHCI_ROOT_HUB_TT=y
CONFIG_USB_EHCI_TT_NEWSCHED=y
# CONFIG_USB_OXU210HP_HCD is not set
# CONFIG_USB_ISP116X_HCD is not set
# CONFIG_USB_ISP1760_HCD is not set
# CONFIG_USB_ISP1362_HCD is not set
CONFIG_USB_OHCI_HCD=y
# CONFIG_USB_OHCI_BIG_ENDIAN_DESC is not set
# CONFIG_USB_OHCI_BIG_ENDIAN_MMIO is not set
CONFIG_USB_OHCI_LITTLE_ENDIAN=y
CONFIG_USB_UHCI_HCD=y
# CONFIG_USB_SL811_HCD is not set
# CONFIG_USB_R8A66597_HCD is not set
# CONFIG_USB_WHCI_HCD is not set
# CONFIG_USB_HWA_HCD is not set
# CONFIG_USB_GADGET_MUSB_HDRC is not set

#
# USB Device Class drivers
#
# CONFIG_USB_ACM is not set
# CONFIG_USB_PRINTER is not set
# CONFIG_USB_WDM is not set
# CONFIG_USB_TMC is not set

#
# NOTE: USB_STORAGE depends on SCSI but BLK_DEV_SD may
#

#
# also be needed; see USB_STORAGE Help for more info
#
CONFIG_USB_STORAGE=y
# CONFIG_USB_STORAGE_DEBUG is not set
# CONFIG_USB_STORAGE_DATAFAB is not set
# CONFIG_USB_STORAGE_FREECOM is not set
# CONFIG_USB_STORAGE_ISD200 is not set
# CONFIG_USB_STORAGE_USBAT is not set
# CONFIG_USB_STORAGE_SDDR09 is not set
# CONFIG_USB_STORAGE_SDDR55 is not set
# CONFIG_USB_STORAGE_JUMPSHOT is not set
# CONFIG_USB_STORAGE_ALAUDA is not set
# CONFIG_USB_STORAGE_ONETOUCH is not set
# CONFIG_USB_STORAGE_KARMA is not set
# CONFIG_USB_STORAGE_CYPRESS_ATACB is not set
# CONFIG_USB_LIBUSUAL is not set

#
# USB Imaging devices
#
# CONFIG_USB_MDC800 is not set
# CONFIG_USB_MICROTEK is not set

#
# USB port drivers
#
# CONFIG_USB_USS720 is not set
# CONFIG_USB_SERIAL is not set

#
# USB Miscellaneous drivers
#
# CONFIG_USB_EMI62 is not set
# CONFIG_USB_EMI26 is not set
# CONFIG_USB_ADUTUX is not set
# CONFIG_USB_SEVSEG is not set
# CONFIG_USB_RIO500 is not set
# CONFIG_USB_LEGOTOWER is not set
# CONFIG_USB_LCD is not set
# CONFIG_USB_LED is not set
# CONFIG_USB_CYPRESS_CY7C63 is not set
# CONFIG_USB_CYTHERM is not set
# CONFIG_USB_IDMOUSE is not set
# CONFIG_USB_FTDI_ELAN is not set
CONFIG_USB_APPLEDISPLAY=m
CONFIG_USB_SISUSBVGA=m
# CONFIG_USB_SISUSBVGA_CON is not set
# CONFIG_USB_LD is not set
# CONFIG_USB_TRANCEVIBRATOR is not set
# CONFIG_USB_IOWARRIOR is not set
# CONFIG_USB_TEST is not set
# CONFIG_USB_ISIGHTFW is not set
# CONFIG_USB_ATM is not set
CONFIG_USB_GADGET=m
# CONFIG_USB_GADGET_DEBUG is not set
# CONFIG_USB_GADGET_DEBUG_FILES is not set
# CONFIG_USB_GADGET_DEBUG_FS is not set
CONFIG_USB_GADGET_VBUS_DRAW=2
CONFIG_USB_GADGET_SELECTED=y
# CONFIG_USB_GADGET_AT91 is not set
# CONFIG_USB_GADGET_ATMEL_USBA is not set
# CONFIG_USB_GADGET_FSL_USB2 is not set
# CONFIG_USB_GADGET_LH7A40X is not set
# CONFIG_USB_GADGET_OMAP is not set
# CONFIG_USB_GADGET_PXA25X is not set
# CONFIG_USB_GADGET_R8A66597 is not set
# CONFIG_USB_GADGET_PXA27X is not set
# CONFIG_USB_GADGET_S3C_HSOTG is not set
# CONFIG_USB_GADGET_IMX is not set
# CONFIG_USB_GADGET_S3C2410 is not set
# CONFIG_USB_GADGET_M66592 is not set
# CONFIG_USB_GADGET_AMD5536UDC is not set
# CONFIG_USB_GADGET_FSL_QE is not set
# CONFIG_USB_GADGET_CI13XXX is not set
# CONFIG_USB_GADGET_NET2280 is not set
# CONFIG_USB_GADGET_GOKU is not set
# CONFIG_USB_GADGET_LANGWELL is not set
CONFIG_USB_GADGET_DUMMY_HCD=y
CONFIG_USB_DUMMY_HCD=m
CONFIG_USB_GADGET_DUALSPEED=y
# CONFIG_USB_ZERO is not set
# CONFIG_USB_AUDIO is not set
# CONFIG_USB_ETH is not set
# CONFIG_USB_GADGETFS is not set
# CONFIG_USB_FILE_STORAGE is not set
# CONFIG_USB_MASS_STORAGE is not set
# CONFIG_USB_G_SERIAL is not set
# CONFIG_USB_MIDI_GADGET is not set
# CONFIG_USB_G_PRINTER is not set
# CONFIG_USB_CDC_COMPOSITE is not set
# CONFIG_USB_G_NOKIA is not set
# CONFIG_USB_G_MULTI is not set

#
# OTG and related infrastructure
#
# CONFIG_USB_GPIO_VBUS is not set
# CONFIG_NOP_USB_XCEIV is not set
# CONFIG_UWB is not set
CONFIG_MMC=y
# CONFIG_MMC_DEBUG is not set
# CONFIG_MMC_UNSAFE_RESUME is not set

#
# MMC/SD/SDIO Card Drivers
#
# CONFIG_MMC_BLOCK is not set
# CONFIG_SDIO_UART is not set
# CONFIG_MMC_TEST is not set

#
# MMC/SD/SDIO Host Controller Drivers
#
# CONFIG_MMC_SDHCI is not set
# CONFIG_MMC_WBSD is not set
# CONFIG_MMC_TIFM_SD is not set
# CONFIG_MMC_CB710 is not set
# CONFIG_MMC_VIA_SDMMC is not set
# CONFIG_MEMSTICK is not set
CONFIG_NEW_LEDS=y
CONFIG_LEDS_CLASS=m

#
# LED drivers
#
# CONFIG_LEDS_ALIX2 is not set
# CONFIG_LEDS_PCA9532 is not set
# CONFIG_LEDS_GPIO is not set
# CONFIG_LEDS_LP3944 is not set
# CONFIG_LEDS_CLEVO_MAIL is not set
# CONFIG_LEDS_PCA955X is not set
# CONFIG_LEDS_DA903X is not set
# CONFIG_LEDS_DAC124S085 is not set
# CONFIG_LEDS_REGULATOR is not set
# CONFIG_LEDS_BD2802 is not set
# CONFIG_LEDS_INTEL_SS4200 is not set
# CONFIG_LEDS_LT3593 is not set
# CONFIG_LEDS_DELL_NETBOOKS is not set
CONFIG_LEDS_TRIGGERS=y

#
# LED Triggers
#
# CONFIG_LEDS_TRIGGER_TIMER is not set
# CONFIG_LEDS_TRIGGER_HEARTBEAT is not set
# CONFIG_LEDS_TRIGGER_BACKLIGHT is not set
# CONFIG_LEDS_TRIGGER_GPIO is not set
# CONFIG_LEDS_TRIGGER_DEFAULT_ON is not set

#
# iptables trigger is under Netfilter config (LED target)
#
# CONFIG_ACCESSIBILITY is not set
# CONFIG_INFINIBAND is not set
CONFIG_EDAC=y

#
# Reporting subsystems
#
# CONFIG_EDAC_DEBUG is not set
# CONFIG_EDAC_MM_EDAC is not set
CONFIG_RTC_LIB=y
CONFIG_RTC_CLASS=y
CONFIG_RTC_HCTOSYS=y
CONFIG_RTC_HCTOSYS_DEVICE="rtc0"
# CONFIG_RTC_DEBUG is not set

#
# RTC interfaces
#
CONFIG_RTC_INTF_SYSFS=y
CONFIG_RTC_INTF_PROC=y
CONFIG_RTC_INTF_DEV=y
CONFIG_RTC_INTF_DEV_UIE_EMUL=y
# CONFIG_RTC_DRV_TEST is not set

#
# I2C RTC drivers
#
# CONFIG_RTC_DRV_DS1307 is not set
# CONFIG_RTC_DRV_DS1374 is not set
# CONFIG_RTC_DRV_DS1672 is not set
# CONFIG_RTC_DRV_MAX6900 is not set
# CONFIG_RTC_DRV_RS5C372 is not set
# CONFIG_RTC_DRV_ISL1208 is not set
# CONFIG_RTC_DRV_X1205 is not set
# CONFIG_RTC_DRV_PCF8563 is not set
# CONFIG_RTC_DRV_PCF8583 is not set
# CONFIG_RTC_DRV_M41T80 is not set
# CONFIG_RTC_DRV_BQ32K is not set
# CONFIG_RTC_DRV_S35390A is not set
# CONFIG_RTC_DRV_FM3130 is not set
# CONFIG_RTC_DRV_RX8581 is not set
# CONFIG_RTC_DRV_RX8025 is not set

#
# SPI RTC drivers
#
# CONFIG_RTC_DRV_M41T94 is not set
# CONFIG_RTC_DRV_DS1305 is not set
# CONFIG_RTC_DRV_DS1390 is not set
# CONFIG_RTC_DRV_MAX6902 is not set
# CONFIG_RTC_DRV_R9701 is not set
# CONFIG_RTC_DRV_RS5C348 is not set
# CONFIG_RTC_DRV_DS3234 is not set
# CONFIG_RTC_DRV_PCF2123 is not set

#
# Platform RTC drivers
#
CONFIG_RTC_DRV_CMOS=y
# CONFIG_RTC_DRV_DS1286 is not set
# CONFIG_RTC_DRV_DS1511 is not set
# CONFIG_RTC_DRV_DS1553 is not set
# CONFIG_RTC_DRV_DS1742 is not set
# CONFIG_RTC_DRV_STK17TA8 is not set
# CONFIG_RTC_DRV_M48T86 is not set
# CONFIG_RTC_DRV_M48T35 is not set
# CONFIG_RTC_DRV_M48T59 is not set
# CONFIG_RTC_DRV_MSM6242 is not set
# CONFIG_RTC_DRV_BQ4802 is not set
# CONFIG_RTC_DRV_RP5C01 is not set
# CONFIG_RTC_DRV_V3020 is not set

#
# on-CPU RTC drivers
#
CONFIG_DMADEVICES=y
# CONFIG_DMADEVICES_DEBUG is not set

#
# DMA Devices
#
# CONFIG_INTEL_IOATDMA is not set
CONFIG_AUXDISPLAY=y
# CONFIG_KS0108 is not set
# CONFIG_UIO is not set

#
# TI VLYNQ
#
# CONFIG_STAGING is not set
CONFIG_X86_PLATFORM_DEVICES=y
# CONFIG_ACER_WMI is not set
# CONFIG_ACERHDF is not set
# CONFIG_ASUS_LAPTOP is not set
# CONFIG_DELL_WMI is not set
# CONFIG_FUJITSU_LAPTOP is not set
# CONFIG_TC1100_WMI is not set
# CONFIG_HP_WMI is not set
# CONFIG_MSI_LAPTOP is not set
# CONFIG_PANASONIC_LAPTOP is not set
# CONFIG_COMPAL_LAPTOP is not set
# CONFIG_SONY_LAPTOP is not set
CONFIG_THINKPAD_ACPI=m
CONFIG_THINKPAD_ACPI_ALSA_SUPPORT=y
# CONFIG_THINKPAD_ACPI_DEBUGFACILITIES is not set
# CONFIG_THINKPAD_ACPI_DEBUG is not set
# CONFIG_THINKPAD_ACPI_UNSAFE_LEDS is not set
CONFIG_THINKPAD_ACPI_VIDEO=y
CONFIG_THINKPAD_ACPI_HOTKEY_POLL=y
# CONFIG_INTEL_MENLOW is not set
# CONFIG_EEEPC_LAPTOP is not set
CONFIG_ACPI_WMI=y
# CONFIG_MSI_WMI is not set
# CONFIG_ACPI_ASUS is not set
# CONFIG_TOPSTAR_LAPTOP is not set
# CONFIG_ACPI_TOSHIBA is not set
# CONFIG_TOSHIBA_BT_RFKILL is not set
# CONFIG_ACPI_CMPC is not set

#
# Firmware Drivers
#
CONFIG_EDD=y
CONFIG_EDD_OFF=y
CONFIG_FIRMWARE_MEMMAP=y
CONFIG_EFI_VARS=y
# CONFIG_DELL_RBU is not set
# CONFIG_DCDBAS is not set
CONFIG_DMIID=y
CONFIG_ISCSI_IBFT_FIND=y
# CONFIG_ISCSI_IBFT is not set

#
# File systems
#
CONFIG_EXT2_FS=y
CONFIG_EXT2_FS_XATTR=y
CONFIG_EXT2_FS_POSIX_ACL=y
CONFIG_EXT2_FS_SECURITY=y
# CONFIG_EXT2_FS_XIP is not set
CONFIG_EXT3_FS=y
# CONFIG_EXT3_DEFAULTS_TO_ORDERED is not set
CONFIG_EXT3_FS_XATTR=y
CONFIG_EXT3_FS_POSIX_ACL=y
CONFIG_EXT3_FS_SECURITY=y
CONFIG_EXT4_FS=y
CONFIG_EXT4_FS_XATTR=y
CONFIG_EXT4_FS_POSIX_ACL=y
CONFIG_EXT4_FS_SECURITY=y
# CONFIG_EXT4_DEBUG is not set
CONFIG_JBD=y
# CONFIG_JBD_DEBUG is not set
CONFIG_JBD2=y
# CONFIG_JBD2_DEBUG is not set
CONFIG_FS_MBCACHE=y
# CONFIG_REISERFS_FS is not set
# CONFIG_JFS_FS is not set
CONFIG_FS_POSIX_ACL=y
# CONFIG_XFS_FS is not set
# CONFIG_GFS2_FS is not set
# CONFIG_OCFS2_FS is not set
# CONFIG_BTRFS_FS is not set
# CONFIG_NILFS2_FS is not set
CONFIG_FILE_LOCKING=y
CONFIG_FSNOTIFY=y
CONFIG_DNOTIFY=y
CONFIG_INOTIFY=y
CONFIG_INOTIFY_USER=y
CONFIG_QUOTA=y
CONFIG_QUOTA_NETLINK_INTERFACE=y
CONFIG_PRINT_QUOTA_WARNING=y
# CONFIG_QFMT_V1 is not set
# CONFIG_QFMT_V2 is not set
CONFIG_QUOTACTL=y
# CONFIG_AUTOFS_FS is not set
# CONFIG_AUTOFS4_FS is not set
CONFIG_FUSE_FS=y
# CONFIG_CUSE is not set
CONFIG_GENERIC_ACL=y

#
# Caches
#
# CONFIG_FSCACHE is not set

#
# CD-ROM/DVD Filesystems
#
# CONFIG_ISO9660_FS is not set
# CONFIG_UDF_FS is not set

#
# DOS/FAT/NT Filesystems
#
# CONFIG_MSDOS_FS is not set
# CONFIG_VFAT_FS is not set
# CONFIG_NTFS_FS is not set

#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
CONFIG_PROC_VMCORE=y
CONFIG_PROC_SYSCTL=y
CONFIG_PROC_PAGE_MONITOR=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
CONFIG_TMPFS_POSIX_ACL=y
CONFIG_HUGETLBFS=y
CONFIG_HUGETLB_PAGE=y
# CONFIG_CONFIGFS_FS is not set
CONFIG_MISC_FILESYSTEMS=y
# CONFIG_ADFS_FS is not set
# CONFIG_AFFS_FS is not set
CONFIG_ECRYPT_FS=y
# CONFIG_HFS_FS is not set
# CONFIG_HFSPLUS_FS is not set
# CONFIG_BEFS_FS is not set
# CONFIG_BFS_FS is not set
# CONFIG_EFS_FS is not set
# CONFIG_LOGFS is not set
CONFIG_CRAMFS=y
# CONFIG_SQUASHFS is not set
# CONFIG_VXFS_FS is not set
# CONFIG_MINIX_FS is not set
# CONFIG_OMFS_FS is not set
# CONFIG_HPFS_FS is not set
# CONFIG_QNX4FS_FS is not set
# CONFIG_ROMFS_FS is not set
# CONFIG_SYSV_FS is not set
# CONFIG_UFS_FS is not set
CONFIG_NETWORK_FILESYSTEMS=y
# CONFIG_NFS_FS is not set
# CONFIG_NFSD is not set
# CONFIG_SMB_FS is not set
# CONFIG_CEPH_FS is not set
# CONFIG_CIFS is not set
# CONFIG_NCP_FS is not set
# CONFIG_CODA_FS is not set
# CONFIG_AFS_FS is not set

#
# Partition Types
#
CONFIG_PARTITION_ADVANCED=y
CONFIG_ACORN_PARTITION=y
# CONFIG_ACORN_PARTITION_CUMANA is not set
# CONFIG_ACORN_PARTITION_EESOX is not set
CONFIG_ACORN_PARTITION_ICS=y
# CONFIG_ACORN_PARTITION_ADFS is not set
# CONFIG_ACORN_PARTITION_POWERTEC is not set
CONFIG_ACORN_PARTITION_RISCIX=y
CONFIG_OSF_PARTITION=y
CONFIG_AMIGA_PARTITION=y
CONFIG_ATARI_PARTITION=y
CONFIG_MAC_PARTITION=y
CONFIG_MSDOS_PARTITION=y
CONFIG_BSD_DISKLABEL=y
CONFIG_MINIX_SUBPARTITION=y
CONFIG_SOLARIS_X86_PARTITION=y
CONFIG_UNIXWARE_DISKLABEL=y
CONFIG_LDM_PARTITION=y
# CONFIG_LDM_DEBUG is not set
CONFIG_SGI_PARTITION=y
CONFIG_ULTRIX_PARTITION=y
CONFIG_SUN_PARTITION=y
CONFIG_KARMA_PARTITION=y
CONFIG_EFI_PARTITION=y
CONFIG_SYSV68_PARTITION=y
CONFIG_NLS=y
CONFIG_NLS_DEFAULT="cp437"
# CONFIG_NLS_CODEPAGE_437 is not set
# CONFIG_NLS_CODEPAGE_737 is not set
# CONFIG_NLS_CODEPAGE_775 is not set
# CONFIG_NLS_CODEPAGE_850 is not set
# CONFIG_NLS_CODEPAGE_852 is not set
# CONFIG_NLS_CODEPAGE_855 is not set
# CONFIG_NLS_CODEPAGE_857 is not set
# CONFIG_NLS_CODEPAGE_860 is not set
# CONFIG_NLS_CODEPAGE_861 is not set
# CONFIG_NLS_CODEPAGE_862 is not set
# CONFIG_NLS_CODEPAGE_863 is not set
# CONFIG_NLS_CODEPAGE_864 is not set
# CONFIG_NLS_CODEPAGE_865 is not set
# CONFIG_NLS_CODEPAGE_866 is not set
# CONFIG_NLS_CODEPAGE_869 is not set
# CONFIG_NLS_CODEPAGE_936 is not set
# CONFIG_NLS_CODEPAGE_950 is not set
# CONFIG_NLS_CODEPAGE_932 is not set
# CONFIG_NLS_CODEPAGE_949 is not set
# CONFIG_NLS_CODEPAGE_874 is not set
# CONFIG_NLS_ISO8859_8 is not set
# CONFIG_NLS_CODEPAGE_1250 is not set
# CONFIG_NLS_CODEPAGE_1251 is not set
# CONFIG_NLS_ASCII is not set
# CONFIG_NLS_ISO8859_1 is not set
# CONFIG_NLS_ISO8859_2 is not set
# CONFIG_NLS_ISO8859_3 is not set
# CONFIG_NLS_ISO8859_4 is not set
# CONFIG_NLS_ISO8859_5 is not set
# CONFIG_NLS_ISO8859_6 is not set
# CONFIG_NLS_ISO8859_7 is not set
# CONFIG_NLS_ISO8859_9 is not set
# CONFIG_NLS_ISO8859_13 is not set
# CONFIG_NLS_ISO8859_14 is not set
# CONFIG_NLS_ISO8859_15 is not set
# CONFIG_NLS_KOI8_R is not set
# CONFIG_NLS_KOI8_U is not set
# CONFIG_NLS_UTF8 is not set
# CONFIG_DLM is not set

#
# Kernel hacking
#
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
# CONFIG_PRINTK_TIME is not set
# CONFIG_ENABLE_WARN_DEPRECATED is not set
# CONFIG_ENABLE_MUST_CHECK is not set
CONFIG_FRAME_WARN=1024
CONFIG_MAGIC_SYSRQ=y
# CONFIG_STRIP_ASM_SYMS is not set
CONFIG_UNUSED_SYMBOLS=y
CONFIG_DEBUG_FS=y
# CONFIG_HEADERS_CHECK is not set
CONFIG_DEBUG_KERNEL=y
# CONFIG_DEBUG_SHIRQ is not set
CONFIG_DETECT_SOFTLOCKUP=y
# CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC is not set
CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE=0
CONFIG_DETECT_HUNG_TASK=y
# CONFIG_BOOTPARAM_HUNG_TASK_PANIC is not set
CONFIG_BOOTPARAM_HUNG_TASK_PANIC_VALUE=0
CONFIG_SCHED_DEBUG=y
CONFIG_SCHEDSTATS=y
CONFIG_TIMER_STATS=y
# CONFIG_DEBUG_OBJECTS is not set
# CONFIG_SLUB_DEBUG_ON is not set
# CONFIG_SLUB_STATS is not set
# CONFIG_DEBUG_KMEMLEAK is not set
# CONFIG_DEBUG_RT_MUTEXES is not set
# CONFIG_RT_MUTEX_TESTER is not set
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_PROVE_LOCKING=y
CONFIG_PROVE_RCU=y
CONFIG_LOCKDEP=y
CONFIG_LOCK_STAT=y
CONFIG_DEBUG_LOCKDEP=y
CONFIG_TRACE_IRQFLAGS=y
CONFIG_DEBUG_SPINLOCK_SLEEP=y
CONFIG_DEBUG_LOCKING_API_SELFTESTS=y
CONFIG_STACKTRACE=y
# CONFIG_DEBUG_KOBJECT is not set
CONFIG_DEBUG_HIGHMEM=y
CONFIG_DEBUG_BUGVERBOSE=y
CONFIG_DEBUG_INFO=y
CONFIG_DEBUG_VM=y
# CONFIG_DEBUG_VIRTUAL is not set
# CONFIG_DEBUG_WRITECOUNT is not set
CONFIG_DEBUG_MEMORY_INIT=y
CONFIG_DEBUG_LIST=y
# CONFIG_DEBUG_SG is not set
CONFIG_DEBUG_NOTIFIERS=y
# CONFIG_DEBUG_CREDENTIALS is not set
CONFIG_ARCH_WANT_FRAME_POINTERS=y
CONFIG_FRAME_POINTER=y
# CONFIG_BOOT_PRINTK_DELAY is not set
# CONFIG_RCU_TORTURE_TEST is not set
CONFIG_RCU_CPU_STALL_DETECTOR=y
# CONFIG_KPROBES_SANITY_TEST is not set
# CONFIG_BACKTRACE_SELF_TEST is not set
# CONFIG_DEBUG_BLOCK_EXT_DEVT is not set
# CONFIG_DEBUG_FORCE_WEAK_PER_CPU is not set
# CONFIG_LKDTM is not set
# CONFIG_FAULT_INJECTION is not set
CONFIG_LATENCYTOP=y
CONFIG_SYSCTL_SYSCALL_CHECK=y
# CONFIG_DEBUG_PAGEALLOC is not set
CONFIG_USER_STACKTRACE_SUPPORT=y
CONFIG_HAVE_FUNCTION_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_FP_TEST=y
CONFIG_HAVE_FUNCTION_TRACE_MCOUNT_TEST=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y
CONFIG_HAVE_SYSCALL_TRACEPOINTS=y
CONFIG_TRACING_SUPPORT=y
# CONFIG_FTRACE is not set
# CONFIG_PROVIDE_OHCI1394_DMA_INIT is not set
CONFIG_DYNAMIC_DEBUG=y
# CONFIG_DMA_API_DEBUG is not set
# CONFIG_SAMPLES is not set
CONFIG_HAVE_ARCH_KGDB=y
CONFIG_KGDB=y
CONFIG_KGDB_SERIAL_CONSOLE=y
# CONFIG_KGDB_TESTS is not set
CONFIG_HAVE_ARCH_KMEMCHECK=y
# CONFIG_KMEMCHECK is not set
CONFIG_STRICT_DEVMEM=y
# CONFIG_X86_VERBOSE_BOOTUP is not set
CONFIG_EARLY_PRINTK=y
# CONFIG_EARLY_PRINTK_DBGP is not set
# CONFIG_DEBUG_STACKOVERFLOW is not set
# CONFIG_DEBUG_STACK_USAGE is not set
# CONFIG_DEBUG_PER_CPU_MAPS is not set
# CONFIG_X86_PTDUMP is not set
CONFIG_DEBUG_RODATA=y
# CONFIG_DEBUG_RODATA_TEST is not set
# CONFIG_DEBUG_NX_TEST is not set
# CONFIG_4KSTACKS is not set
CONFIG_DOUBLEFAULT=y
# CONFIG_IOMMU_STRESS is not set
CONFIG_HAVE_MMIOTRACE_SUPPORT=y
# CONFIG_X86_DECODER_SELFTEST is not set
CONFIG_IO_DELAY_TYPE_0X80=0
CONFIG_IO_DELAY_TYPE_0XED=1
CONFIG_IO_DELAY_TYPE_UDELAY=2
CONFIG_IO_DELAY_TYPE_NONE=3
# CONFIG_IO_DELAY_0X80 is not set
CONFIG_IO_DELAY_0XED=y
# CONFIG_IO_DELAY_UDELAY is not set
# CONFIG_IO_DELAY_NONE is not set
CONFIG_DEFAULT_IO_DELAY_TYPE=1
# CONFIG_DEBUG_BOOT_PARAMS is not set
# CONFIG_CPA_DEBUG is not set
CONFIG_OPTIMIZE_INLINING=y
CONFIG_DEBUG_STRICT_USER_COPY_CHECKS=y

#
# Security options
#
CONFIG_KEYS=y
# CONFIG_KEYS_DEBUG_PROC_KEYS is not set
CONFIG_SECURITY=y
CONFIG_SECURITYFS=y
CONFIG_SECURITY_NETWORK=y
# CONFIG_SECURITY_PATH is not set
CONFIG_LSM_MMAP_MIN_ADDR=0
CONFIG_SECURITY_SELINUX=y
CONFIG_SECURITY_SELINUX_BOOTPARAM=y
CONFIG_SECURITY_SELINUX_BOOTPARAM_VALUE=0
CONFIG_SECURITY_SELINUX_DISABLE=y
CONFIG_SECURITY_SELINUX_DEVELOP=y
CONFIG_SECURITY_SELINUX_AVC_STATS=y
CONFIG_SECURITY_SELINUX_CHECKREQPROT_VALUE=1
# CONFIG_SECURITY_SELINUX_POLICYDB_VERSION_MAX is not set
CONFIG_SECURITY_SMACK=y
# CONFIG_SECURITY_TOMOYO is not set
# CONFIG_IMA is not set
CONFIG_DEFAULT_SECURITY_SELINUX=y
# CONFIG_DEFAULT_SECURITY_SMACK is not set
# CONFIG_DEFAULT_SECURITY_TOMOYO is not set
# CONFIG_DEFAULT_SECURITY_DAC is not set
CONFIG_DEFAULT_SECURITY="selinux"
CONFIG_CRYPTO=y

#
# Crypto core or helper
#
CONFIG_CRYPTO_ALGAPI=y
CONFIG_CRYPTO_ALGAPI2=y
CONFIG_CRYPTO_AEAD2=y
CONFIG_CRYPTO_BLKCIPHER=y
CONFIG_CRYPTO_BLKCIPHER2=y
CONFIG_CRYPTO_HASH=y
CONFIG_CRYPTO_HASH2=y
CONFIG_CRYPTO_RNG2=y
CONFIG_CRYPTO_PCOMP=y
CONFIG_CRYPTO_MANAGER=y
CONFIG_CRYPTO_MANAGER2=y
# CONFIG_CRYPTO_GF128MUL is not set
# CONFIG_CRYPTO_NULL is not set
# CONFIG_CRYPTO_PCRYPT is not set
CONFIG_CRYPTO_WORKQUEUE=y
# CONFIG_CRYPTO_CRYPTD is not set
# CONFIG_CRYPTO_AUTHENC is not set
# CONFIG_CRYPTO_TEST is not set

#
# Authenticated Encryption with Associated Data
#
# CONFIG_CRYPTO_CCM is not set
# CONFIG_CRYPTO_GCM is not set
# CONFIG_CRYPTO_SEQIV is not set

#
# Block modes
#
CONFIG_CRYPTO_CBC=y
# CONFIG_CRYPTO_CTR is not set
# CONFIG_CRYPTO_CTS is not set
CONFIG_CRYPTO_ECB=y
# CONFIG_CRYPTO_LRW is not set
# CONFIG_CRYPTO_PCBC is not set
# CONFIG_CRYPTO_XTS is not set

#
# Hash modes
#
CONFIG_CRYPTO_HMAC=y
# CONFIG_CRYPTO_XCBC is not set
# CONFIG_CRYPTO_VMAC is not set

#
# Digest
#
# CONFIG_CRYPTO_CRC32C is not set
# CONFIG_CRYPTO_CRC32C_INTEL is not set
# CONFIG_CRYPTO_GHASH is not set
# CONFIG_CRYPTO_MD4 is not set
CONFIG_CRYPTO_MD5=y
# CONFIG_CRYPTO_MICHAEL_MIC is not set
# CONFIG_CRYPTO_RMD128 is not set
# CONFIG_CRYPTO_RMD160 is not set
# CONFIG_CRYPTO_RMD256 is not set
# CONFIG_CRYPTO_RMD320 is not set
# CONFIG_CRYPTO_SHA1 is not set
# CONFIG_CRYPTO_SHA256 is not set
# CONFIG_CRYPTO_SHA512 is not set
# CONFIG_CRYPTO_TGR192 is not set
# CONFIG_CRYPTO_WP512 is not set

#
# Ciphers
#
# CONFIG_CRYPTO_AES is not set
# CONFIG_CRYPTO_AES_586 is not set
# CONFIG_CRYPTO_ANUBIS is not set
# CONFIG_CRYPTO_ARC4 is not set
# CONFIG_CRYPTO_BLOWFISH is not set
# CONFIG_CRYPTO_CAMELLIA is not set
# CONFIG_CRYPTO_CAST5 is not set
# CONFIG_CRYPTO_CAST6 is not set
# CONFIG_CRYPTO_DES is not set
# CONFIG_CRYPTO_FCRYPT is not set
# CONFIG_CRYPTO_KHAZAD is not set
# CONFIG_CRYPTO_SALSA20 is not set
# CONFIG_CRYPTO_SALSA20_586 is not set
# CONFIG_CRYPTO_SEED is not set
# CONFIG_CRYPTO_SERPENT is not set
# CONFIG_CRYPTO_TEA is not set
# CONFIG_CRYPTO_TWOFISH is not set
# CONFIG_CRYPTO_TWOFISH_586 is not set

#
# Compression
#
# CONFIG_CRYPTO_DEFLATE is not set
# CONFIG_CRYPTO_ZLIB is not set
# CONFIG_CRYPTO_LZO is not set

#
# Random Number Generation
#
# CONFIG_CRYPTO_ANSI_CPRNG is not set
CONFIG_CRYPTO_HW=y
CONFIG_CRYPTO_DEV_PADLOCK=y
# CONFIG_CRYPTO_DEV_PADLOCK_AES is not set
# CONFIG_CRYPTO_DEV_PADLOCK_SHA is not set
# CONFIG_CRYPTO_DEV_GEODE is not set
# CONFIG_CRYPTO_DEV_HIFN_795X is not set
CONFIG_HAVE_KVM=y
CONFIG_HAVE_KVM_IRQCHIP=y
CONFIG_HAVE_KVM_EVENTFD=y
CONFIG_KVM_APIC_ARCHITECTURE=y
CONFIG_KVM_MMIO=y
CONFIG_VIRTUALIZATION=y
CONFIG_KVM=m
CONFIG_KVM_INTEL=m
# CONFIG_KVM_AMD is not set
# CONFIG_VHOST_NET is not set
# CONFIG_LGUEST is not set
# CONFIG_VIRTIO_PCI is not set
# CONFIG_VIRTIO_BALLOON is not set
# CONFIG_BINARY_PRINTF is not set

#
# Library routines
#
CONFIG_BITREVERSE=y
CONFIG_GENERIC_FIND_FIRST_BIT=y
CONFIG_GENERIC_FIND_NEXT_BIT=y
CONFIG_GENERIC_FIND_LAST_BIT=y
# CONFIG_CRC_CCITT is not set
CONFIG_CRC16=y
CONFIG_CRC_T10DIF=y
# CONFIG_CRC_ITU_T is not set
CONFIG_CRC32=y
# CONFIG_CRC7 is not set
# CONFIG_LIBCRC32C is not set
CONFIG_AUDIT_GENERIC=y
CONFIG_ZLIB_INFLATE=y
CONFIG_LZO_DECOMPRESS=y
CONFIG_DECOMPRESS_GZIP=y
CONFIG_DECOMPRESS_BZIP2=y
CONFIG_DECOMPRESS_LZMA=y
CONFIG_DECOMPRESS_LZO=y
CONFIG_HAS_IOMEM=y
CONFIG_HAS_IOPORT=y
CONFIG_HAS_DMA=y
CONFIG_NLATTR=y

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-13 14:36       ` Dave Chinner
  2010-04-14  3:12         ` Dave Chinner
@ 2010-04-14  6:52         ` KOSAKI Motohiro
  2010-04-14  7:36           ` Dave Chinner
  1 sibling, 1 reply; 115+ messages in thread
From: KOSAKI Motohiro @ 2010-04-14  6:52 UTC (permalink / raw)
  To: Dave Chinner
  Cc: kosaki.motohiro, linux-kernel, linux-mm, linux-fsdevel, Chris Mason

> On Tue, Apr 13, 2010 at 08:39:29PM +0900, KOSAKI Motohiro wrote:
> > Hi
> > 
> > > > Pros:
> > > > 	1) prevent XFS stack overflow
> > > > 	2) improve io workload performance
> > > > 
> > > > Cons:
> > > > 	3) TOTALLY kill lumpy reclaim (i.e. high order allocation)
> > > > 
> > > > So, If we only need to consider io workload this is no downside. but
> > > > it can't.
> > > > 
> > > > I think (1) is XFS issue. XFS should care it itself.
> > > 
> > > The filesystem is irrelevant, IMO.
> > > 
> > > The traces from the reporter showed that we've got close to a 2k
> > > stack footprint for memory allocation to direct reclaim and then we
> > > can put the entire writeback path on top of that. This is roughly
> > > 3.5k for XFS, and then depending on the storage subsystem
> > > configuration and transport can be another 2k of stack needed below
> > > XFS.
> > > 
> > > IOWs, if we completely ignore the filesystem stack usage, there's
> > > still up to 4k of stack needed in the direct reclaim path. Given
> > > that one of the stack traces supplied show direct reclaim being
> > > entered with over 3k of stack already used, pretty much any
> > > filesystem is capable of blowing an 8k stack.
> > > 
> > > So, this is not an XFS issue, even though XFS is the first to
> > > uncover it. Don't shoot the messenger....
> > 
> > Thanks explanation. I haven't noticed direct reclaim consume
> > 2k stack. I'll investigate it and try diet it.
> > But XFS 3.5K stack consumption is too large too. please diet too.
> 
> It hasn't grown in the last 2 years after the last major diet where
> all the fat was trimmed from it in the last round of the i386 4k
> stack vs XFS saga. it seems that everything else around XFS has
> grown in that time, and now we are blowing stacks again....

I have dumb question, If xfs haven't bloat stack usage, why 3.5
stack usage works fine on 4k stack kernel? It seems impossible.

Please don't think I blame you. I don't know what is "4k stack vs XFS saga".
I merely want to understand what you said.


> > > Hence I think that direct reclaim should be deferring to the
> > > background flusher threads for cleaning memory and not trying to be
> > > doing it itself.
> > 
> > Well, you seems continue to discuss io workload. I don't disagree
> > such point. 
> > 
> > example, If only order-0 reclaim skip pageout(), we will get the above
> > benefit too.
> 
> But it won't prevent start blowups...
> 
> > > > but we never kill pageout() completely because we can't
> > > > assume users don't run high order allocation workload.
> > > 
> > > I think that lumpy reclaim will still work just fine.
> > > 
> > > Lumpy reclaim appears to be using IO as a method of slowing
> > > down the reclaim cycle - the congestion_wait() call will still
> > > function as it does now if the background flusher threads are active
> > > and causing congestion. I don't see why lumpy reclaim specifically
> > > needs to be issuing IO to make it work - if the congestion_wait() is
> > > not waiting long enough then wait longer - don't issue IO to extend
> > > the wait time.
> > 
> > lumpy reclaim is for allocation high order page. then, it not only
> > reclaim LRU head page, but also its PFN neighborhood. PFN neighborhood
> > is often newly page and still dirty. then we enfoce pageout cleaning
> > and discard it.
> 
> Ok, I see that now - I missed the second call to __isolate_lru_pages()
> in isolate_lru_pages().

No problem. It's one of VM mess. Usual developers don't know it :-)



> > When high order allocation occur, we don't only need free enough amount
> > memory, but also need free enough contenious memory block.
> 
> Agreed, that was why I was kind of surprised not to find it was
> doing that. But, as you have pointed out, that was my mistake.
> 
> > If we need to consider _only_ io throughput, waiting flusher thread
> > might faster perhaps, but actually we also need to consider reclaim
> > latency. I'm worry about such point too.
> 
> True, but without know how to test and measure such things I can't
> really comment...

Agreed. I know making VM mesurement benchmark is very difficult. but
probably it is necessary....
I'm sorry, now I can't give you good convenient benchmark.

> 
> > > Of course, the code is a maze of twisty passages, so I probably
> > > missed something important. Hopefully someone can tell me what. ;)
> > > 
> > > FWIW, the biggest problem here is that I have absolutely no clue on
> > > how to test what the impact on lumpy reclaim really is. Does anyone
> > > have a relatively simple test that can be run to determine what the
> > > impact is?
> > 
> > So, can you please run two workloads concurrently?
> >  - Normal IO workload (fio, iozone, etc..)
> >  - echo $NUM > /proc/sys/vm/nr_hugepages
> 
> What do I measure/observe/record that is meaningful?
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>




^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-13 20:20       ` Chris Mason
  2010-04-14  1:40         ` Dave Chinner
@ 2010-04-14  6:52         ` KOSAKI Motohiro
  2010-04-14 10:06         ` Andi Kleen
  2 siblings, 0 replies; 115+ messages in thread
From: KOSAKI Motohiro @ 2010-04-14  6:52 UTC (permalink / raw)
  To: Chris Mason, Mel Gorman, Dave Chinner, linux-kernel, linux-mm,
	linux-fsdevel
  Cc: kosaki.motohiro

> On Tue, Apr 13, 2010 at 08:34:29PM +0100, Mel Gorman wrote:
> > > This problem is not a filesystem recursion problem which is, as I
> > > understand it, what GFP_NOFS is used to prevent. It's _any_ kernel
> > > code that uses signficant stack before trying to allocate memory
> > > that is the problem. e.g a select() system call:
> > > 
> > >        Depth    Size   Location    (47 entries)
> > >        -----    ----   --------
> > >  0)     7568      16   mempool_alloc_slab+0x16/0x20
> > >  1)     7552     144   mempool_alloc+0x65/0x140
> > >  2)     7408      96   get_request+0x124/0x370
> > >  3)     7312     144   get_request_wait+0x29/0x1b0
> > >  4)     7168      96   __make_request+0x9b/0x490
> > >  5)     7072     208   generic_make_request+0x3df/0x4d0
> > >  6)     6864      80   submit_bio+0x7c/0x100
> > >  7)     6784      96   _xfs_buf_ioapply+0x128/0x2c0 [xfs]
> > > ....
> > > 32)     3184      64   xfs_vm_writepage+0xab/0x160 [xfs]
> > > 33)     3120     384   shrink_page_list+0x65e/0x840
> > > 34)     2736     528   shrink_zone+0x63f/0xe10
> > > 35)     2208     112   do_try_to_free_pages+0xc2/0x3c0
> > > 36)     2096     128   try_to_free_pages+0x77/0x80
> > > 37)     1968     240   __alloc_pages_nodemask+0x3e4/0x710
> > > 38)     1728      48   alloc_pages_current+0x8c/0xe0
> > > 39)     1680      16   __get_free_pages+0xe/0x50
> > > 40)     1664      48   __pollwait+0xca/0x110
> > > 41)     1616      32   unix_poll+0x28/0xc0
> > > 42)     1584      16   sock_poll+0x1d/0x20
> > > 43)     1568     912   do_select+0x3d6/0x700
> > > 44)      656     416   core_sys_select+0x18c/0x2c0
> > > 45)      240     112   sys_select+0x4f/0x110
> > > 46)      128     128   system_call_fastpath+0x16/0x1b
> > > 
> > > There's 1.6k of stack used before memory allocation is called, 3.1k
> > > used there before ->writepage is entered, XFS used 3.5k, and
> > > if the mempool needed to allocate a page it would have blown the
> > > stack. If there was any significant storage subsystem (add dm, md
> > > and/or scsi of some kind), it would have blown the stack.
> > > 
> > > Basically, there is not enough stack space available to allow direct
> > > reclaim to enter ->writepage _anywhere_ according to the stack usage
> > > profiles we are seeing here....
> > > 
> > 
> > I'm not denying the evidence but how has it been gotten away with for years
> > then? Prevention of writeback isn't the answer without figuring out how
> > direct reclaimers can queue pages for IO and in the case of lumpy reclaim
> > doing sync IO, then waiting on those pages.
> 
> So, I've been reading along, nodding my head to Dave's side of things
> because seeks are evil and direct reclaim makes seeks.  I'd really loev
> for direct reclaim to somehow trigger writepages on large chunks instead
> of doing page by page spatters of IO to the drive.
> 
> But, somewhere along the line I overlooked the part of Dave's stack trace
> that said:
> 
> 43)     1568     912   do_select+0x3d6/0x700
> 
> Huh, 912 bytes...for select, really?  From poll.h:
> 
> /* ~832 bytes of stack space used max in sys_select/sys_poll before allocating
>    additional memory. */
> #define MAX_STACK_ALLOC 832
> #define FRONTEND_STACK_ALLOC    256
> #define SELECT_STACK_ALLOC      FRONTEND_STACK_ALLOC
> #define POLL_STACK_ALLOC        FRONTEND_STACK_ALLOC
> #define WQUEUES_STACK_ALLOC     (MAX_STACK_ALLOC - FRONTEND_STACK_ALLOC)
> #define N_INLINE_POLL_ENTRIES   (WQUEUES_STACK_ALLOC / sizeof(struct poll_table_entry))
> 
> So, select is intentionally trying to use that much stack.  It should be using
> GFP_NOFS if it really wants to suck down that much stack...if only the
> kernel had some sort of way to dynamically allocate ram, it could try
> that too.

Yeah, Of cource much. I would propse to revert 70674f95c0.
But I doubt GFP_NOFS solve our issue.




^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-14  3:12         ` Dave Chinner
@ 2010-04-14  6:52           ` KOSAKI Motohiro
  2010-04-15  1:56             ` Dave Chinner
  0 siblings, 1 reply; 115+ messages in thread
From: KOSAKI Motohiro @ 2010-04-14  6:52 UTC (permalink / raw)
  To: Dave Chinner
  Cc: kosaki.motohiro, linux-kernel, linux-mm, linux-fsdevel, Chris Mason

> On Wed, Apr 14, 2010 at 12:36:59AM +1000, Dave Chinner wrote:
> > On Tue, Apr 13, 2010 at 08:39:29PM +0900, KOSAKI Motohiro wrote:
> > > > FWIW, the biggest problem here is that I have absolutely no clue on
> > > > how to test what the impact on lumpy reclaim really is. Does anyone
> > > > have a relatively simple test that can be run to determine what the
> > > > impact is?
> > > 
> > > So, can you please run two workloads concurrently?
> > >  - Normal IO workload (fio, iozone, etc..)
> > >  - echo $NUM > /proc/sys/vm/nr_hugepages
> > 
> > What do I measure/observe/record that is meaningful?
> 
> So, a rough as guts first pass - just run a large dd (8 times the
> size of memory - 8GB file vs 1GB RAM) and repeated try to allocate
> the entire of memory in huge pages (500) every 5 seconds. The IO
> rate is roughly 100MB/s, so it takes 75-85s to complete the dd.
> 
> The script:
> 
> $ cat t.sh
> #!/bin/bash
> 
> echo 0 > /proc/sys/vm/nr_hugepages
> echo 3 > /proc/sys/vm/drop_caches
> 
> dd if=/dev/zero of=/mnt/scratch/test bs=1024k count=8000 > /dev/null 2>&1 &
> 
> (
> for i in `seq 1 1 20`; do
>         sleep 5
>         /usr/bin/time --format="wall %e" sh -c "echo 500 > /proc/sys/vm/nr_hugepages" 2>&1
>         grep HugePages_Total /proc/meminfo
> done
> ) | awk '
>         /wall/ { wall += $2; cnt += 1 }
>         /Pages/ { pages[cnt] = $2 }
>         END { printf "average wall time %f\nPages step: ", wall / cnt ;
>                 for (i = 1; i <= cnt; i++) {
>                         printf "%d ", pages[i];
>                 }
>         }'
> ----
> 
> And the output looks like:
> 
> $ sudo ./t.sh
> average wall time 0.954500
> Pages step: 97 101 101 121 173 173 173 173 173 173 175 194 195 195 202 220 226 419 423 426
> $
> 
> Run 50 times in a loop, and the outputs averaged, the existing lumpy
> reclaim resulted in:
> 
> dave@test-1:~$ cat current.txt | awk -f av.awk
> av. wall = 0.519385 secs
> av Pages step: 192 228 242 255 265 272 279 284 289 294 298 303 307 322 342 366 383 401 412 420
> 
> And with my patch that disables ->writepage:
> 
> dave@test-1:~$ cat no-direct.txt | awk -f av.awk
> av. wall = 0.554163 secs
> av Pages step: 231 283 310 316 323 328 336 340 345 351 356 359 364 377 388 397 413 423 432 439
> 
> Basically, with my patch lumpy reclaim was *substantially* more
> effective with only a slight increase in average allocation latency
> with this test case.
> 
> I need to add a marker to the output that records when the dd
> completes, but from monitoring the writeback rates via PCP, they
> were in the balllpark of 85-100MB/s for the existing code, and
> 95-110MB/s with my patch.  Hence it improved both IO throughput and
> the effectiveness of lumpy reclaim.
> 
> On the down side, I did have an OOM killer invocation with my patch
> after about 150 iterations - dd failed an order zero allocation
> because there were 455 huge pages allocated and there were only
> _320_ available pages for IO, all of which were under IO. i.e. lumpy
> reclaim worked so well that the machine got into order-0 page
> starvation.
> 
> I know this is a simple test case, but it shows much better results
> than I think anyone (even me) is expecting...

Ummm...

Probably, I have to say I'm sorry. I guess my last mail give you
a misunderstand.
To be honest, I'm not interest this artificial non fragmentation case.
The above test-case does 1) discard all cache 2) fill pages by streaming
io. then, it makes artificial "file offset neighbor == block neighbor == PFN neighbor"
situation. then, file offset order writeout by flusher thread can make
PFN contenious pages effectively.

Why I dont interest it? because lumpy reclaim is a technique for
avoiding external fragmentation mess. IOW, it is for avoiding worst
case. but your test case seems to mesure best one.




^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-14  1:40         ` Dave Chinner
  2010-04-14  4:59           ` KAMEZAWA Hiroyuki
@ 2010-04-14  6:52           ` KOSAKI Motohiro
  2010-04-14  7:28             ` Dave Chinner
  1 sibling, 1 reply; 115+ messages in thread
From: KOSAKI Motohiro @ 2010-04-14  6:52 UTC (permalink / raw)
  To: Dave Chinner
  Cc: kosaki.motohiro, Chris Mason, Mel Gorman, linux-kernel, linux-mm,
	linux-fsdevel

> On Tue, Apr 13, 2010 at 04:20:21PM -0400, Chris Mason wrote:
> > On Tue, Apr 13, 2010 at 08:34:29PM +0100, Mel Gorman wrote:
> > > > This problem is not a filesystem recursion problem which is, as I
> > > > understand it, what GFP_NOFS is used to prevent. It's _any_ kernel
> > > > code that uses signficant stack before trying to allocate memory
> > > > that is the problem. e.g a select() system call:
> > > > 
> > > >        Depth    Size   Location    (47 entries)
> > > >        -----    ----   --------
> > > >  0)     7568      16   mempool_alloc_slab+0x16/0x20
> > > >  1)     7552     144   mempool_alloc+0x65/0x140
> > > >  2)     7408      96   get_request+0x124/0x370
> > > >  3)     7312     144   get_request_wait+0x29/0x1b0
> > > >  4)     7168      96   __make_request+0x9b/0x490
> > > >  5)     7072     208   generic_make_request+0x3df/0x4d0
> > > >  6)     6864      80   submit_bio+0x7c/0x100
> > > >  7)     6784      96   _xfs_buf_ioapply+0x128/0x2c0 [xfs]
> > > > ....
> > > > 32)     3184      64   xfs_vm_writepage+0xab/0x160 [xfs]
> > > > 33)     3120     384   shrink_page_list+0x65e/0x840
> > > > 34)     2736     528   shrink_zone+0x63f/0xe10
> > > > 35)     2208     112   do_try_to_free_pages+0xc2/0x3c0
> > > > 36)     2096     128   try_to_free_pages+0x77/0x80
> > > > 37)     1968     240   __alloc_pages_nodemask+0x3e4/0x710
> > > > 38)     1728      48   alloc_pages_current+0x8c/0xe0
> > > > 39)     1680      16   __get_free_pages+0xe/0x50
> > > > 40)     1664      48   __pollwait+0xca/0x110
> > > > 41)     1616      32   unix_poll+0x28/0xc0
> > > > 42)     1584      16   sock_poll+0x1d/0x20
> > > > 43)     1568     912   do_select+0x3d6/0x700
> > > > 44)      656     416   core_sys_select+0x18c/0x2c0
> > > > 45)      240     112   sys_select+0x4f/0x110
> > > > 46)      128     128   system_call_fastpath+0x16/0x1b
> > > > 
> > > > There's 1.6k of stack used before memory allocation is called, 3.1k
> > > > used there before ->writepage is entered, XFS used 3.5k, and
> > > > if the mempool needed to allocate a page it would have blown the
> > > > stack. If there was any significant storage subsystem (add dm, md
> > > > and/or scsi of some kind), it would have blown the stack.
> > > > 
> > > > Basically, there is not enough stack space available to allow direct
> > > > reclaim to enter ->writepage _anywhere_ according to the stack usage
> > > > profiles we are seeing here....
> > > > 
> > > 
> > > I'm not denying the evidence but how has it been gotten away with for years
> > > then? Prevention of writeback isn't the answer without figuring out how
> > > direct reclaimers can queue pages for IO and in the case of lumpy reclaim
> > > doing sync IO, then waiting on those pages.
> > 
> > So, I've been reading along, nodding my head to Dave's side of things
> > because seeks are evil and direct reclaim makes seeks.  I'd really loev
> > for direct reclaim to somehow trigger writepages on large chunks instead
> > of doing page by page spatters of IO to the drive.

I agree that "seeks are evil and direct reclaim makes seeks". Actually,
making 4k io is not must for pageout. So, probably we can improve it.


> Perhaps drop the lock on the page if it is held and call one of the
> helpers that filesystems use to do this, like:
> 
> 	filemap_write_and_wait(page->mapping);

Sorry, I'm lost what you talk about. Why do we need per-file waiting?
If file is 1GB file, do we need to wait 1GB writeout?


> 
> > But, somewhere along the line I overlooked the part of Dave's stack trace
> > that said:
> > 
> > 43)     1568     912   do_select+0x3d6/0x700
> > 
> > Huh, 912 bytes...for select, really?  From poll.h:
> 
> Sure, it's bad, but we focussing on the specific case misses the
> point that even code that is using minimal stack can enter direct
> reclaim after consuming 1.5k of stack. e.g.:

checkstack.pl says do_select() and __generic_file_splice_read() are one
of worstest stack consumer. both sould be fixed.

also, checkstack.pl says such stack eater aren't so much.


> 
>  50)     3168      64   xfs_vm_writepage+0xab/0x160 [xfs]
>  51)     3104     384   shrink_page_list+0x65e/0x840
>  52)     2720     528   shrink_zone+0x63f/0xe10
>  53)     2192     112   do_try_to_free_pages+0xc2/0x3c0
>  54)     2080     128   try_to_free_pages+0x77/0x80
>  55)     1952     240   __alloc_pages_nodemask+0x3e4/0x710
>  56)     1712      48   alloc_pages_current+0x8c/0xe0
>  57)     1664      32   __page_cache_alloc+0x67/0x70
>  58)     1632     144   __do_page_cache_readahead+0xd3/0x220
>  59)     1488      16   ra_submit+0x21/0x30
>  60)     1472      80   ondemand_readahead+0x11d/0x250
>  61)     1392      64   page_cache_async_readahead+0xa9/0xe0
>  62)     1328     592   __generic_file_splice_read+0x48a/0x530
>  63)      736      48   generic_file_splice_read+0x4f/0x90
>  64)      688      96   xfs_splice_read+0xf2/0x130 [xfs]
>  65)      592      32   xfs_file_splice_read+0x4b/0x50 [xfs]
>  66)      560      64   do_splice_to+0x77/0xb0
>  67)      496     112   splice_direct_to_actor+0xcc/0x1c0
>  68)      384      80   do_splice_direct+0x57/0x80
>  69)      304      96   do_sendfile+0x16c/0x1e0
>  70)      208      80   sys_sendfile64+0x8d/0xb0
>  71)      128     128   system_call_fastpath+0x16/0x1b
> 
> Yes, __generic_file_splice_read() is a hog, but they seem to be
> _everywhere_ today...
> 
> > So, select is intentionally trying to use that much stack.  It should be using
> > GFP_NOFS if it really wants to suck down that much stack...
> 
> The code that did the allocation is called from multiple different
> contexts - how is it supposed to know that in some of those contexts
> it is supposed to treat memory allocation differently?
> 
> This is my point - if you introduce a new semantic to memory allocation
> that is "use GFP_NOFS when you are using too much stack" and too much
> stack is more than 15% of the stack, then pretty much every code path
> will need to set that flag...

Nodding my head to Dave's side. changing caller argument seems not good
solution. I mean
 - do_select() should use GFP_KERNEL instead stack (as revert 70674f95c0)
 - reclaim and xfs (and other something else) need to diet.

Also, I believe stack eater function should be created waring. patch attached.


> > if only the
> > kernel had some sort of way to dynamically allocate ram, it could try
> > that too.
> 
> Sure, but to play the devil's advocate: if memory allocation blows
> the stack, then surely avoiding allocation by using stack variables
> is safer? ;)
> 
> FWIW, even if we use GFP_NOFS, allocation+reclaim can still use 2k
> of stack; stuff like the radix tree code appears to be a significant
> user of stack now:
> 
>         Depth    Size   Location    (56 entries)
>         -----    ----   --------
>   0)     7904      48   __call_rcu+0x67/0x190
>   1)     7856      16   call_rcu_sched+0x15/0x20
>   2)     7840      16   call_rcu+0xe/0x10
>   3)     7824     272   radix_tree_delete+0x159/0x2e0
>   4)     7552      32   __remove_from_page_cache+0x21/0x110
>   5)     7520      64   __remove_mapping+0xe8/0x130
>   6)     7456     384   shrink_page_list+0x400/0x860
>   7)     7072     528   shrink_zone+0x636/0xdc0
>   8)     6544     112   do_try_to_free_pages+0xc2/0x3c0
>   9)     6432     112   try_to_free_pages+0x64/0x70
>  10)     6320     256   __alloc_pages_nodemask+0x3d2/0x710
>  11)     6064      48   alloc_pages_current+0x8c/0xe0
>  12)     6016      32   __page_cache_alloc+0x67/0x70
>  13)     5984      80   find_or_create_page+0x50/0xb0
>  14)     5904     160   _xfs_buf_lookup_pages+0x145/0x350 [xfs]
> 
> or even just calling ->releasepage and freeing bufferheads:
> 
>        Depth    Size   Location    (55 entries)
>        -----    ----   --------
>  0)     7440      48   add_partial+0x26/0x90
>  1)     7392      64   __slab_free+0x1a9/0x380
>  2)     7328      64   kmem_cache_free+0xb9/0x160
>  3)     7264      16   free_buffer_head+0x25/0x50
>  4)     7248      64   try_to_free_buffers+0x79/0xc0
>  5)     7184     160   xfs_vm_releasepage+0xda/0x130 [xfs]
>  6)     7024      16   try_to_release_page+0x33/0x60
>  7)     7008     384   shrink_page_list+0x585/0x860
>  8)     6624     528   shrink_zone+0x636/0xdc0
>  9)     6096     112   do_try_to_free_pages+0xc2/0x3c0
> 10)     5984     112   try_to_free_pages+0x64/0x70
> 11)     5872     256   __alloc_pages_nodemask+0x3d2/0x710
> 12)     5616      48   alloc_pages_current+0x8c/0xe0
> 13)     5568      32   __page_cache_alloc+0x67/0x70
> 14)     5536      80   find_or_create_page+0x50/0xb0
> 15)     5456     160   _xfs_buf_lookup_pages+0x145/0x350 [xfs]
> 
> And another eye-opening example, this time deep in the sata driver
> layer:
> 
>         Depth    Size   Location    (72 entries)
>         -----    ----   --------
>   0)     8336     304   select_task_rq_fair+0x235/0xad0
>   1)     8032      96   try_to_wake_up+0x189/0x3f0
>   2)     7936      16   default_wake_function+0x12/0x20
>   3)     7920      32   autoremove_wake_function+0x16/0x40
>   4)     7888      64   __wake_up_common+0x5a/0x90
>   5)     7824      64   __wake_up+0x48/0x70
>   6)     7760      64   insert_work+0x9f/0xb0
>   7)     7696      48   __queue_work+0x36/0x50
>   8)     7648      16   queue_work_on+0x4d/0x60
>   9)     7632      16   queue_work+0x1f/0x30
>  10)     7616      16   queue_delayed_work+0x2d/0x40
>  11)     7600      32   ata_pio_queue_task+0x35/0x40
>  12)     7568      48   ata_sff_qc_issue+0x146/0x2f0
>  13)     7520      96   mv_qc_issue+0x12d/0x540 [sata_mv]
>  14)     7424      96   ata_qc_issue+0x1fe/0x320
>  15)     7328      64   ata_scsi_translate+0xae/0x1a0
>  16)     7264      64   ata_scsi_queuecmd+0xbf/0x2f0
>  17)     7200      48   scsi_dispatch_cmd+0x114/0x2b0
>  18)     7152      96   scsi_request_fn+0x419/0x590
>  19)     7056      32   __blk_run_queue+0x82/0x150
>  20)     7024      48   elv_insert+0x1aa/0x2d0
>  21)     6976      48   __elv_add_request+0x83/0xd0
>  22)     6928      96   __make_request+0x139/0x490
>  23)     6832     208   generic_make_request+0x3df/0x4d0
>  24)     6624      80   submit_bio+0x7c/0x100
>  25)     6544      96   _xfs_buf_ioapply+0x128/0x2c0 [xfs]
> 
> We need at least _700_ bytes of stack free just to call queue_work(),
> and that now happens deep in the guts of the driver subsystem below XFS.
> This trace shows 1.8k of stack usage on a simple, single sata disk
> storage subsystem, so my estimate of 2k of stack for the storage system
> below XFS is too small - a worst case of 2.5-3k of stack space is probably
> closer to the mark.

your explanation is very interesting. I have a (probably dumb) question.
Why nobody faced stack overflow issue in past? now I think every users
easily get stack overflow if your explanation is correct.


> 
> This is the sort of thing I'm pointing at when I say that stack
> usage outside XFS has grown significantly significantly over the
> past couple of years. Given XFS has remained pretty much the same or
> even reduced slightly over the same time period, blaming XFS or
> saying "callers should use GFP_NOFS" seems like a cop-out to me.
> Regardless of the IO pattern performance issues, writeback via
> direct reclaim just uses too much stack to be safe these days...

Yeah, My answer is simple, All stack eater should be fixed.
but XFS seems not innocence too. 3.5K is enough big although
xfs have use such amount since very ago.


===========================================================
Subject: [PATCH] kconfig: reduce FRAME_WARN default value to 512

Surprisedly, now several odd functions use very much stack.

% objdump -d vmlinux | ./scripts/checkstack.pl

0xffffffff81e3db07 get_next_block [vmlinux]:            1976
0xffffffff8130b9bd node_read_meminfo [vmlinux]:         1240
0xffffffff811553fd do_sys_poll [vmlinux]:               1000
0xffffffff8122b49d test_aead [vmlinux]:                 904
0xffffffff81154c9d do_select [vmlinux]:                 888
0xffffffff81168d9d default_file_splice_read [vmlinux]:  760

Oh well, Every developers have to pay attention a stack usage!
Thus, this patch reduce FRAME_WARN default value to 512.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 lib/Kconfig.debug |    3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index ff01710..44ebba6 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -28,8 +28,7 @@ config ENABLE_MUST_CHECK
 config FRAME_WARN
 	int "Warn for stack frames larger than (needs gcc 4.4)"
 	range 0 8192
-	default 1024 if !64BIT
-	default 2048 if 64BIT
+	default 512
 	help
 	  Tell gcc to warn at build time for stack frames larger than this.
 	  Setting this too low will cause a lot of warnings.
-- 
1.6.5.2





^ permalink raw reply related	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-14  5:54               ` KOSAKI Motohiro
  2010-04-14  6:13                 ` Minchan Kim
@ 2010-04-14  7:06                 ` Dave Chinner
  1 sibling, 0 replies; 115+ messages in thread
From: Dave Chinner @ 2010-04-14  7:06 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: KAMEZAWA Hiroyuki, Chris Mason, Mel Gorman, linux-kernel,
	linux-mm, linux-fsdevel

On Wed, Apr 14, 2010 at 02:54:14PM +0900, KOSAKI Motohiro wrote:
> > On Wed, Apr 14, 2010 at 01:59:45PM +0900, KAMEZAWA Hiroyuki wrote:
> > > On Wed, 14 Apr 2010 11:40:41 +1000
> > > Dave Chinner <david@fromorbit.com> wrote:
> > > 
> > > >  50)     3168      64   xfs_vm_writepage+0xab/0x160 [xfs]
> > > >  51)     3104     384   shrink_page_list+0x65e/0x840
> > > >  52)     2720     528   shrink_zone+0x63f/0xe10
> > > 
> > > A bit OFF TOPIC.
> > > 
> > > Could you share disassemble of shrink_zone() ?
> > > 
> > > In my environ.
> > > 00000000000115a0 <shrink_zone>:
> > >    115a0:       55                      push   %rbp
> > >    115a1:       48 89 e5                mov    %rsp,%rbp
> > >    115a4:       41 57                   push   %r15
> > >    115a6:       41 56                   push   %r14
> > >    115a8:       41 55                   push   %r13
> > >    115aa:       41 54                   push   %r12
> > >    115ac:       53                      push   %rbx
> > >    115ad:       48 83 ec 78             sub    $0x78,%rsp
> > >    115b1:       e8 00 00 00 00          callq  115b6 <shrink_zone+0x16>
> > >    115b6:       48 89 75 80             mov    %rsi,-0x80(%rbp)
> > > 
> > > disassemble seems to show 0x78 bytes for stack. And no changes to %rsp
> > > until retrun.
> > 
> > I see the same. I didn't compile those kernels, though. IIUC,
> > they were built through the Ubuntu build infrastructure, so there is
> > something different in terms of compiler, compiler options or config
> > to what we are both using. Most likely it is the compiler inlining,
> > though Chris's patches to prevent that didn't seem to change the
> > stack usage.
> > 
> > I'm trying to get a stack trace from the kernel that has shrink_zone
> > in it, but I haven't succeeded yet....
> 
> I also got 0x78 byte stack usage. Umm.. Do we discussed real issue now?

Ok, so here's a trace at the top of the stack from a kernel with a
the above shrink_zone disassembly:

$ cat /sys/kernel/debug/tracing/stack_trace
        Depth    Size   Location    (49 entries)
        -----    ----   --------
  0)     6152     112   force_qs_rnp+0x58/0x150
  1)     6040      48   force_quiescent_state+0x1a7/0x1f0
  2)     5992      48   __call_rcu+0x13d/0x190
  3)     5944      16   call_rcu_sched+0x15/0x20
  4)     5928      16   call_rcu+0xe/0x10
  5)     5912     240   radix_tree_delete+0x14a/0x2d0
  6)     5672      32   __remove_from_page_cache+0x21/0x110
  7)     5640      64   __remove_mapping+0x86/0x100
  8)     5576     272   shrink_page_list+0x2fd/0x5a0
  9)     5304     400   shrink_inactive_list+0x313/0x730
 10)     4904     176   shrink_zone+0x3d1/0x490
 11)     4728     128   do_try_to_free_pages+0x2b6/0x380
 12)     4600     112   try_to_free_pages+0x5e/0x60
 13)     4488     272   __alloc_pages_nodemask+0x3fb/0x730
 14)     4216      48   alloc_pages_current+0x87/0xd0
 15)     4168      32   __page_cache_alloc+0x67/0x70
 16)     4136      80   find_or_create_page+0x4f/0xb0
 17)     4056     160   _xfs_buf_lookup_pages+0x150/0x390
.....

So the differences are most likely from the compiler doing
automatic inlining of static functions...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-14  6:13                 ` Minchan Kim
@ 2010-04-14  7:19                   ` Minchan Kim
  2010-04-14  9:42                     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 115+ messages in thread
From: Minchan Kim @ 2010-04-14  7:19 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Dave Chinner, KAMEZAWA Hiroyuki, Chris Mason, Mel Gorman,
	linux-kernel, linux-mm, linux-fsdevel

On Wed, Apr 14, 2010 at 3:13 PM, Minchan Kim <minchan.kim@gmail.com> wrote:
> On Wed, Apr 14, 2010 at 2:54 PM, KOSAKI Motohiro
> <kosaki.motohiro@jp.fujitsu.com> wrote:
>>> On Wed, Apr 14, 2010 at 01:59:45PM +0900, KAMEZAWA Hiroyuki wrote:
>>> > On Wed, 14 Apr 2010 11:40:41 +1000
>>> > Dave Chinner <david@fromorbit.com> wrote:
>>> >
>>> > >  50)     3168      64   xfs_vm_writepage+0xab/0x160 [xfs]
>>> > >  51)     3104     384   shrink_page_list+0x65e/0x840
>>> > >  52)     2720     528   shrink_zone+0x63f/0xe10
>>> >
>>> > A bit OFF TOPIC.
>>> >
>>> > Could you share disassemble of shrink_zone() ?
>>> >
>>> > In my environ.
>>> > 00000000000115a0 <shrink_zone>:
>>> >    115a0:       55                      push   %rbp
>>> >    115a1:       48 89 e5                mov    %rsp,%rbp
>>> >    115a4:       41 57                   push   %r15
>>> >    115a6:       41 56                   push   %r14
>>> >    115a8:       41 55                   push   %r13
>>> >    115aa:       41 54                   push   %r12
>>> >    115ac:       53                      push   %rbx
>>> >    115ad:       48 83 ec 78             sub    $0x78,%rsp
>>> >    115b1:       e8 00 00 00 00          callq  115b6 <shrink_zone+0x16>
>>> >    115b6:       48 89 75 80             mov    %rsi,-0x80(%rbp)
>>> >
>>> > disassemble seems to show 0x78 bytes for stack. And no changes to %rsp
>>> > until retrun.
>>>
>>> I see the same. I didn't compile those kernels, though. IIUC,
>>> they were built through the Ubuntu build infrastructure, so there is
>>> something different in terms of compiler, compiler options or config
>>> to what we are both using. Most likely it is the compiler inlining,
>>> though Chris's patches to prevent that didn't seem to change the
>>> stack usage.
>>>
>>> I'm trying to get a stack trace from the kernel that has shrink_zone
>>> in it, but I haven't succeeded yet....
>>
>> I also got 0x78 byte stack usage. Umm.. Do we discussed real issue now?
>>
>
> In my case, 0x110 byte in 32 bit machine.
> I think it's possible in 64 bit machine.
>
> 00001830 <shrink_zone>:
>    1830:       55                      push   %ebp
>    1831:       89 e5                   mov    %esp,%ebp
>    1833:       57                      push   %edi
>    1834:       56                      push   %esi
>    1835:       53                      push   %ebx
>    1836:       81 ec 10 01 00 00       sub    $0x110,%esp
>    183c:       89 85 24 ff ff ff       mov    %eax,-0xdc(%ebp)
>    1842:       89 95 20 ff ff ff       mov    %edx,-0xe0(%ebp)
>    1848:       89 8d 1c ff ff ff       mov    %ecx,-0xe4(%ebp)
>    184e:       8b 41 04                mov    0x4(%ecx)
>
> my gcc is following as.
>
> barrios@barriostarget:~/mmotm$ gcc -v
> Using built-in specs.
> Target: i486-linux-gnu
> Configured with: ../src/configure -v --with-pkgversion='Ubuntu
> 4.3.3-5ubuntu4'
> --with-bugurl=file:///usr/share/doc/gcc-4.3/README.Bugs
> --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr
> --enable-shared --with-system-zlib --libexecdir=/usr/lib
> --without-included-gettext --enable-threads=posix --enable-nls
> --with-gxx-include-dir=/usr/include/c++/4.3 --program-suffix=-4.3
> --enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc
> --enable-mpfr --enable-targets=all --with-tune=generic
> --enable-checking=release --build=i486-linux-gnu --host=i486-linux-gnu
> --target=i486-linux-gnu
> Thread model: posix
> gcc version 4.3.3 (Ubuntu 4.3.3-5ubuntu4)
>
>
> Is it depends on config?
> I attach my config.

I changed shrink list by noinline_for_stack.
The result is following as.


00001fe0 <shrink_zone>:
    1fe0:       55                      push   %ebp
    1fe1:       89 e5                   mov    %esp,%ebp
    1fe3:       57                      push   %edi
    1fe4:       56                      push   %esi
    1fe5:       53                      push   %ebx
    1fe6:       83 ec 4c                sub    $0x4c,%esp
    1fe9:       89 45 c0                mov    %eax,-0x40(%ebp)
    1fec:       89 55 bc                mov    %edx,-0x44(%ebp)
    1fef:       89 4d b8                mov    %ecx,-0x48(%ebp)

0x110 -> 0x4c.

Should we have to add noinline_for_stack for shrink_list?


-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-14  6:52           ` KOSAKI Motohiro
@ 2010-04-14  7:28             ` Dave Chinner
  2010-04-14  8:51               ` Mel Gorman
  0 siblings, 1 reply; 115+ messages in thread
From: Dave Chinner @ 2010-04-14  7:28 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Chris Mason, Mel Gorman, linux-kernel, linux-mm, linux-fsdevel

On Wed, Apr 14, 2010 at 03:52:44PM +0900, KOSAKI Motohiro wrote:
> > On Tue, Apr 13, 2010 at 04:20:21PM -0400, Chris Mason wrote:
> > > On Tue, Apr 13, 2010 at 08:34:29PM +0100, Mel Gorman wrote:
> > > > > Basically, there is not enough stack space available to allow direct
> > > > > reclaim to enter ->writepage _anywhere_ according to the stack usage
> > > > > profiles we are seeing here....
> > > > > 
> > > > 
> > > > I'm not denying the evidence but how has it been gotten away with for years
> > > > then? Prevention of writeback isn't the answer without figuring out how
> > > > direct reclaimers can queue pages for IO and in the case of lumpy reclaim
> > > > doing sync IO, then waiting on those pages.
> > > 
> > > So, I've been reading along, nodding my head to Dave's side of things
> > > because seeks are evil and direct reclaim makes seeks.  I'd really loev
> > > for direct reclaim to somehow trigger writepages on large chunks instead
> > > of doing page by page spatters of IO to the drive.
> 
> I agree that "seeks are evil and direct reclaim makes seeks". Actually,
> making 4k io is not must for pageout. So, probably we can improve it.
> 
> 
> > Perhaps drop the lock on the page if it is held and call one of the
> > helpers that filesystems use to do this, like:
> > 
> > 	filemap_write_and_wait(page->mapping);
> 
> Sorry, I'm lost what you talk about. Why do we need per-file
> waiting? If file is 1GB file, do we need to wait 1GB writeout?

So use filemap_fdatawrite(page->mapping), or if it's better only
to start IO on a segment of the file, use
filemap_fdatawrite_range(page->mapping, start, end)....

> > > But, somewhere along the line I overlooked the part of Dave's stack trace
> > > that said:
> > > 
> > > 43)     1568     912   do_select+0x3d6/0x700
> > > 
> > > Huh, 912 bytes...for select, really?  From poll.h:
> > 
> > Sure, it's bad, but we focussing on the specific case misses the
> > point that even code that is using minimal stack can enter direct
> > reclaim after consuming 1.5k of stack. e.g.:
> 
> checkstack.pl says do_select() and __generic_file_splice_read() are one
> of worstest stack consumer. both sould be fixed.

the deepest call chain in queue_work() needs 700 bytes of stack
to complete, wait_for_completion() requires almost 2k of stack space
at it's deepest, the scheduler has some heavy stack users, etc,
and these are all functions that appear at the top of the stack.

> also, checkstack.pl says such stack eater aren't so much.

Yeah, but when we have ia callchain 70 or more functions deep,
even 100 bytes of stack is a lot....

> > > So, select is intentionally trying to use that much stack.  It should be using
> > > GFP_NOFS if it really wants to suck down that much stack...
> > 
> > The code that did the allocation is called from multiple different
> > contexts - how is it supposed to know that in some of those contexts
> > it is supposed to treat memory allocation differently?
> > 
> > This is my point - if you introduce a new semantic to memory allocation
> > that is "use GFP_NOFS when you are using too much stack" and too much
> > stack is more than 15% of the stack, then pretty much every code path
> > will need to set that flag...
> 
> Nodding my head to Dave's side. changing caller argument seems not good
> solution. I mean
>  - do_select() should use GFP_KERNEL instead stack (as revert 70674f95c0)
>  - reclaim and xfs (and other something else) need to diet.

The list I'm seeing so far includes:
	- scheduler
	- completion interfaces
	- radix tree
	- memory allocation, memory reclaim
	- anything that implements ->writepage
	- select
	- splice read

> Also, I believe stack eater function should be created waring. patch attached.

Good start, but 512 bytes will only catch select and splice read,
and there are 300-400 byte functions in the above list that sit near
the top of the stack....

> > We need at least _700_ bytes of stack free just to call queue_work(),
> > and that now happens deep in the guts of the driver subsystem below XFS.
> > This trace shows 1.8k of stack usage on a simple, single sata disk
> > storage subsystem, so my estimate of 2k of stack for the storage system
> > below XFS is too small - a worst case of 2.5-3k of stack space is probably
> > closer to the mark.
> 
> your explanation is very interesting. I have a (probably dumb) question.
> Why nobody faced stack overflow issue in past? now I think every users
> easily get stack overflow if your explanation is correct.

It's always a problem, but the focus on minimising stack usage has
gone away since i386 has mostly disappeared from server rooms.

XFS has always been the thing that triggered stack usage problems
first - the first reports of problems on x86_64 with 8k stacks in low
memory situations have only just come in, and this is the first time
in a couple of years I've paid close attention to stack usage
outside XFS. What I'm seeing is not pretty....

> > This is the sort of thing I'm pointing at when I say that stack
> > usage outside XFS has grown significantly significantly over the
> > past couple of years. Given XFS has remained pretty much the same or
> > even reduced slightly over the same time period, blaming XFS or
> > saying "callers should use GFP_NOFS" seems like a cop-out to me.
> > Regardless of the IO pattern performance issues, writeback via
> > direct reclaim just uses too much stack to be safe these days...
> 
> Yeah, My answer is simple, All stack eater should be fixed.
> but XFS seems not innocence too. 3.5K is enough big although
> xfs have use such amount since very ago.

XFS used to use much more than that - significant effort has been
put into reduce the stack footprint over many years. There's not
much left to trim without rewriting half the filesystem...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-14  6:52         ` KOSAKI Motohiro
@ 2010-04-14  7:36           ` Dave Chinner
  0 siblings, 0 replies; 115+ messages in thread
From: Dave Chinner @ 2010-04-14  7:36 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: linux-kernel, linux-mm, linux-fsdevel, Chris Mason

On Wed, Apr 14, 2010 at 03:52:10PM +0900, KOSAKI Motohiro wrote:
> > On Tue, Apr 13, 2010 at 08:39:29PM +0900, KOSAKI Motohiro wrote:
> > > Thanks explanation. I haven't noticed direct reclaim consume
> > > 2k stack. I'll investigate it and try diet it.
> > > But XFS 3.5K stack consumption is too large too. please diet too.
> > 
> > It hasn't grown in the last 2 years after the last major diet where
> > all the fat was trimmed from it in the last round of the i386 4k
> > stack vs XFS saga. it seems that everything else around XFS has
> > grown in that time, and now we are blowing stacks again....
> 
> I have dumb question, If xfs haven't bloat stack usage, why 3.5
> stack usage works fine on 4k stack kernel? It seems impossible.

Because on a 32 bit kernel it's somewhere between 2-2.5k of stack
space. That being said, XFS _will_ blow a 4k stack on anything other
than the most basic storage configurations, and if you run out of
memory it is almost guaranteed to do so.

> Please don't think I blame you. I don't know what is "4k stack vs XFS saga".
> I merely want to understand what you said.

Over a period of years there were repeated attempts to make the
default stack size on i386 4k, despite it being known to cause
problems one relatively common configurations. Every time it was
brought up it was rejected, but every few months somebody else made
an attempt to make it the default. There was a lot of flamage
directed at XFS because it was seen as the reason that 4k stacks
were not made the default....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-14  4:44   ` Dave Chinner
@ 2010-04-14  7:54     ` Minchan Kim
  0 siblings, 0 replies; 115+ messages in thread
From: Minchan Kim @ 2010-04-14  7:54 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-mm, linux-fsdevel

On Wed, Apr 14, 2010 at 1:44 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Wed, Apr 14, 2010 at 09:24:33AM +0900, Minchan Kim wrote:
>> Hi, Dave.
>>
>> On Tue, Apr 13, 2010 at 9:17 AM, Dave Chinner <david@fromorbit.com> wrote:
>> > From: Dave Chinner <dchinner@redhat.com>
>> >
>> > When we enter direct reclaim we may have used an arbitrary amount of stack
>> > space, and hence enterring the filesystem to do writeback can then lead to
>> > stack overruns. This problem was recently encountered x86_64 systems with
>> > 8k stacks running XFS with simple storage configurations.
>> >
>> > Writeback from direct reclaim also adversely affects background writeback. The
>> > background flusher threads should already be taking care of cleaning dirty
>> > pages, and direct reclaim will kick them if they aren't already doing work. If
>> > direct reclaim is also calling ->writepage, it will cause the IO patterns from
>> > the background flusher threads to be upset by LRU-order writeback from
>> > pageout() which can be effectively random IO. Having competing sources of IO
>> > trying to clean pages on the same backing device reduces throughput by
>> > increasing the amount of seeks that the backing device has to do to write back
>> > the pages.
>> >
>> > Hence for direct reclaim we should not allow ->writepages to be entered at all.
>> > Set up the relevant scan_control structures to enforce this, and prevent
>> > sc->may_writepage from being set in other places in the direct reclaim path in
>> > response to other events.
>>
>> I think your solution is rather aggressive change as Mel and Kosaki
>> already pointed out.
>
> It may be agressive, but writeback from direct reclaim is, IMO, one
> of the worst aspects of the current VM design because of it's
> adverse effect on the IO subsystem.

Tend to agree. But De we need it by last resort if flusher thread
can't catch up
write stream?
Or In my opinion, Could I/O layer have better throttle logic than now?

>
> I'd prefer to remove it completely that continue to try and patch
> around it, especially given that everyone seems to agree that it
> does have an adverse affect on IO...

Of course, If everybody agree, we can do it.
For it, we need many benchmark result which is very hard.
Maybe I will help it in embedded system.

>
>> Do flush thread aware LRU of dirty pages in system level recency not
>> dirty pages recency?
>
> It writes back in the order inodes were dirtied. i.e. the LRU is a
> coarser measure, but it it still definitely there. It also takes
> into account fairness of IO between dirty inodes, so no one dirty
> inode prevents IO beining issued on a other dirty inodes on the
> LRU...

Thanks.
It seems to be lost recency.
I am not sure how much it affects system performance.

>
>> Of course flush thread can clean dirty pages faster than direct reclaimer.
>> But if it don't aware LRUness, hot page thrashing can be happened by
>> corner case.
>> It could lost write merge.
>>
>> And non-rotation storage might be not big of seek cost.
>
> Non-rotational storage still goes faster when it is fed large, well
> formed IOs.

Agreed. I missed. Nand device is stronger than HDD about random read.
But ramdom write is very weak in performance and wear-leveling.

>
>> I think we have to consider that case if we decide to change direct reclaim I/O.
>>
>> How do we separate the problem?
>>
>> 1. stack hogging problem.
>> 2. direct reclaim random write.
>
> AFAICT, the only way to _reliably_ avoid the stack usage problem is
> to avoid writeback in direct reclaim. That has the side effect of
> fixing #2 as well, so do they really need separating?

If we can do it, it's good.
but 2. problem is not easy to fix, I think.
Compared to 2, 1 is rather easy.
So I thought we can solve 1 firstly and then focusing 2.
If your suggestion is right, then we can apply your idea.
Then we don't need to revert the patch of 1 since small stack usage is
always good
if we don't lost big performance.

>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-14  7:28             ` Dave Chinner
@ 2010-04-14  8:51               ` Mel Gorman
  2010-04-15  1:34                 ` Dave Chinner
  2010-04-15  2:37                 ` Johannes Weiner
  0 siblings, 2 replies; 115+ messages in thread
From: Mel Gorman @ 2010-04-14  8:51 UTC (permalink / raw)
  To: Dave Chinner
  Cc: KOSAKI Motohiro, Chris Mason, linux-kernel, linux-mm, linux-fsdevel

On Wed, Apr 14, 2010 at 05:28:30PM +1000, Dave Chinner wrote:
> On Wed, Apr 14, 2010 at 03:52:44PM +0900, KOSAKI Motohiro wrote:
> > > On Tue, Apr 13, 2010 at 04:20:21PM -0400, Chris Mason wrote:
> > > > On Tue, Apr 13, 2010 at 08:34:29PM +0100, Mel Gorman wrote:
> > > > > > Basically, there is not enough stack space available to allow direct
> > > > > > reclaim to enter ->writepage _anywhere_ according to the stack usage
> > > > > > profiles we are seeing here....
> > > > > > 
> > > > > 
> > > > > I'm not denying the evidence but how has it been gotten away with for years
> > > > > then? Prevention of writeback isn't the answer without figuring out how
> > > > > direct reclaimers can queue pages for IO and in the case of lumpy reclaim
> > > > > doing sync IO, then waiting on those pages.
> > > > 
> > > > So, I've been reading along, nodding my head to Dave's side of things
> > > > because seeks are evil and direct reclaim makes seeks.  I'd really loev
> > > > for direct reclaim to somehow trigger writepages on large chunks instead
> > > > of doing page by page spatters of IO to the drive.
> > 
> > I agree that "seeks are evil and direct reclaim makes seeks". Actually,
> > making 4k io is not must for pageout. So, probably we can improve it.
> > 
> > 
> > > Perhaps drop the lock on the page if it is held and call one of the
> > > helpers that filesystems use to do this, like:
> > > 
> > > 	filemap_write_and_wait(page->mapping);
> > 
> > Sorry, I'm lost what you talk about. Why do we need per-file
> > waiting? If file is 1GB file, do we need to wait 1GB writeout?
> 
> So use filemap_fdatawrite(page->mapping), or if it's better only
> to start IO on a segment of the file, use
> filemap_fdatawrite_range(page->mapping, start, end)....
> 

That does not help the stack usage issue, the caller ends up in
->writepages. From an IO perspective, it'll be better from a seek point of
view but from a VM perspective, it may or may not be cleaning the right pages.
So I think this is a red herring.

> > > > But, somewhere along the line I overlooked the part of Dave's stack trace
> > > > that said:
> > > > 
> > > > 43)     1568     912   do_select+0x3d6/0x700
> > > > 
> > > > Huh, 912 bytes...for select, really?  From poll.h:
> > > 
> > > Sure, it's bad, but we focussing on the specific case misses the
> > > point that even code that is using minimal stack can enter direct
> > > reclaim after consuming 1.5k of stack. e.g.:
> > 
> > checkstack.pl says do_select() and __generic_file_splice_read() are one
> > of worstest stack consumer. both sould be fixed.
> 
> the deepest call chain in queue_work() needs 700 bytes of stack
> to complete, wait_for_completion() requires almost 2k of stack space
> at it's deepest, the scheduler has some heavy stack users, etc,
> and these are all functions that appear at the top of the stack.
> 

The real issue here then is that stack usage has gone out of control.
Disabling ->writepage in direct reclaim does not guarantee that stack
usage will not be a problem again. From your traces, page reclaim itself
seems to be a big dirty hog.

Differences in what people see in their machines may be down to architecture,
compiler but most likely inlining. Changing inlining will not fix the problem,
it'll just move the stack usage around.

> > also, checkstack.pl says such stack eater aren't so much.
> 
> Yeah, but when we have ia callchain 70 or more functions deep,
> even 100 bytes of stack is a lot....
> 
> > > > So, select is intentionally trying to use that much stack.  It should be using
> > > > GFP_NOFS if it really wants to suck down that much stack...
> > > 
> > > The code that did the allocation is called from multiple different
> > > contexts - how is it supposed to know that in some of those contexts
> > > it is supposed to treat memory allocation differently?
> > > 
> > > This is my point - if you introduce a new semantic to memory allocation
> > > that is "use GFP_NOFS when you are using too much stack" and too much
> > > stack is more than 15% of the stack, then pretty much every code path
> > > will need to set that flag...
> > 
> > Nodding my head to Dave's side. changing caller argument seems not good
> > solution. I mean
> >  - do_select() should use GFP_KERNEL instead stack (as revert 70674f95c0)
> >  - reclaim and xfs (and other something else) need to diet.
> 
> The list I'm seeing so far includes:
> 	- scheduler
> 	- completion interfaces
> 	- radix tree
> 	- memory allocation, memory reclaim
> 	- anything that implements ->writepage
> 	- select
> 	- splice read
> 
> > Also, I believe stack eater function should be created waring. patch attached.
> 
> Good start, but 512 bytes will only catch select and splice read,
> and there are 300-400 byte functions in the above list that sit near
> the top of the stack....
> 

They will need to be tackled in turn then but obviously there should be
a focus on the common paths. The reclaim paths do seem particularly
heavy and it's down to a lot of temporary variables. I might not get the
time today but what I'm going to try do some time this week is

o Look at what temporary variables are copies of other pieces of information
o See what variables live for the duration of reclaim but are not needed
  for all of it (i.e. uninline parts of it so variables do not persist)
o See if it's possible to dynamically allocate scan_control

The last one is the trickiest. Basically, the idea would be to move as much
into scan_control as possible. Then, instead of allocating it on the stack,
allocate a fixed number of them at boot-time (NR_CPU probably) protected by
a semaphore. Limit the number of direct reclaimers that can be active at a
time to the number of scan_control variables. kswapd could still allocate
its on the stack or with kmalloc.

If it works out, it would have two main benefits. Limits the number of
processes in direct reclaim - if there is NR_CPU-worth of proceses in direct
reclaim, there is too much going on. It would also shrink the stack usage
particularly if some of the stack variables are moved into scan_control.

Maybe someone will beat me to looking at the feasibility of this.

> > > We need at least _700_ bytes of stack free just to call queue_work(),
> > > and that now happens deep in the guts of the driver subsystem below XFS.
> > > This trace shows 1.8k of stack usage on a simple, single sata disk
> > > storage subsystem, so my estimate of 2k of stack for the storage system
> > > below XFS is too small - a worst case of 2.5-3k of stack space is probably
> > > closer to the mark.
> > 
> > your explanation is very interesting. I have a (probably dumb) question.
> > Why nobody faced stack overflow issue in past? now I think every users
> > easily get stack overflow if your explanation is correct.
> 
> It's always a problem, but the focus on minimising stack usage has
> gone away since i386 has mostly disappeared from server rooms.
> 
> XFS has always been the thing that triggered stack usage problems
> first - the first reports of problems on x86_64 with 8k stacks in low
> memory situations have only just come in, and this is the first time
> in a couple of years I've paid close attention to stack usage
> outside XFS. What I'm seeing is not pretty....
> 
> > > This is the sort of thing I'm pointing at when I say that stack
> > > usage outside XFS has grown significantly significantly over the
> > > past couple of years. Given XFS has remained pretty much the same or
> > > even reduced slightly over the same time period, blaming XFS or
> > > saying "callers should use GFP_NOFS" seems like a cop-out to me.
> > > Regardless of the IO pattern performance issues, writeback via
> > > direct reclaim just uses too much stack to be safe these days...
> > 
> > Yeah, My answer is simple, All stack eater should be fixed.
> > but XFS seems not innocence too. 3.5K is enough big although
> > xfs have use such amount since very ago.
> 
> XFS used to use much more than that - significant effort has been
> put into reduce the stack footprint over many years. There's not
> much left to trim without rewriting half the filesystem...
> 

I don't think he is levelling a complain at XFS in particular - just pointing
out that it's heavy too. Still, we should be gratful that XFS is sort of
a "Stack Canary". If it dies, everyone else could be in trouble soon :)

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-14  7:19                   ` Minchan Kim
@ 2010-04-14  9:42                     ` KAMEZAWA Hiroyuki
  2010-04-14 10:01                       ` Minchan Kim
  0 siblings, 1 reply; 115+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-14  9:42 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, Dave Chinner, Chris Mason, Mel Gorman,
	linux-kernel, linux-mm, linux-fsdevel

On Wed, 14 Apr 2010 16:19:02 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:

> On Wed, Apr 14, 2010 at 3:13 PM, Minchan Kim <minchan.kim@gmail.com> wrote:
> > On Wed, Apr 14, 2010 at 2:54 PM, KOSAKI Motohiro
> > <kosaki.motohiro@jp.fujitsu.com> wrote:
> >>> On Wed, Apr 14, 2010 at 01:59:45PM +0900, KAMEZAWA Hiroyuki wrote:
> >>> > On Wed, 14 Apr 2010 11:40:41 +1000
> >>> > Dave Chinner <david@fromorbit.com> wrote:
> >>> >
> >>> > >  50)     3168      64   xfs_vm_writepage+0xab/0x160 [xfs]
> >>> > >  51)     3104     384   shrink_page_list+0x65e/0x840
> >>> > >  52)     2720     528   shrink_zone+0x63f/0xe10
> >>> >
> >>> > A bit OFF TOPIC.
> >>> >
> >>> > Could you share disassemble of shrink_zone() ?
> >>> >
> >>> > In my environ.
> >>> > 00000000000115a0 <shrink_zone>:
> >>> >    115a0:       55                      push   %rbp
> >>> >    115a1:       48 89 e5                mov    %rsp,%rbp
> >>> >    115a4:       41 57                   push   %r15
> >>> >    115a6:       41 56                   push   %r14
> >>> >    115a8:       41 55                   push   %r13
> >>> >    115aa:       41 54                   push   %r12
> >>> >    115ac:       53                      push   %rbx
> >>> >    115ad:       48 83 ec 78             sub    $0x78,%rsp
> >>> >    115b1:       e8 00 00 00 00          callq  115b6 <shrink_zone+0x16>
> >>> >    115b6:       48 89 75 80             mov    %rsi,-0x80(%rbp)
> >>> >
> >>> > disassemble seems to show 0x78 bytes for stack. And no changes to %rsp
> >>> > until retrun.
> >>>
> >>> I see the same. I didn't compile those kernels, though. IIUC,
> >>> they were built through the Ubuntu build infrastructure, so there is
> >>> something different in terms of compiler, compiler options or config
> >>> to what we are both using. Most likely it is the compiler inlining,
> >>> though Chris's patches to prevent that didn't seem to change the
> >>> stack usage.
> >>>
> >>> I'm trying to get a stack trace from the kernel that has shrink_zone
> >>> in it, but I haven't succeeded yet....
> >>
> >> I also got 0x78 byte stack usage. Umm.. Do we discussed real issue now?
> >>
> >
> > In my case, 0x110 byte in 32 bit machine.
> > I think it's possible in 64 bit machine.
> >
> > 00001830 <shrink_zone>:
> >    1830:       55                      push   %ebp
> >    1831:       89 e5                   mov    %esp,%ebp
> >    1833:       57                      push   %edi
> >    1834:       56                      push   %esi
> >    1835:       53                      push   %ebx
> >    1836:       81 ec 10 01 00 00       sub    $0x110,%esp
> >    183c:       89 85 24 ff ff ff       mov    %eax,-0xdc(%ebp)
> >    1842:       89 95 20 ff ff ff       mov    %edx,-0xe0(%ebp)
> >    1848:       89 8d 1c ff ff ff       mov    %ecx,-0xe4(%ebp)
> >    184e:       8b 41 04                mov    0x4(%ecx)
> >
> > my gcc is following as.
> >
> > barrios@barriostarget:~/mmotm$ gcc -v
> > Using built-in specs.
> > Target: i486-linux-gnu
> > Configured with: ../src/configure -v --with-pkgversion='Ubuntu
> > 4.3.3-5ubuntu4'
> > --with-bugurl=file:///usr/share/doc/gcc-4.3/README.Bugs
> > --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr
> > --enable-shared --with-system-zlib --libexecdir=/usr/lib
> > --without-included-gettext --enable-threads=posix --enable-nls
> > --with-gxx-include-dir=/usr/include/c++/4.3 --program-suffix=-4.3
> > --enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc
> > --enable-mpfr --enable-targets=all --with-tune=generic
> > --enable-checking=release --build=i486-linux-gnu --host=i486-linux-gnu
> > --target=i486-linux-gnu
> > Thread model: posix
> > gcc version 4.3.3 (Ubuntu 4.3.3-5ubuntu4)
> >
> >
> > Is it depends on config?
> > I attach my config.
> 
> I changed shrink list by noinline_for_stack.
> The result is following as.
> 
> 
> 00001fe0 <shrink_zone>:
>     1fe0:       55                      push   %ebp
>     1fe1:       89 e5                   mov    %esp,%ebp
>     1fe3:       57                      push   %edi
>     1fe4:       56                      push   %esi
>     1fe5:       53                      push   %ebx
>     1fe6:       83 ec 4c                sub    $0x4c,%esp
>     1fe9:       89 45 c0                mov    %eax,-0x40(%ebp)
>     1fec:       89 55 bc                mov    %edx,-0x44(%ebp)
>     1fef:       89 4d b8                mov    %ecx,-0x48(%ebp)
> 
> 0x110 -> 0x4c.
> 
> Should we have to add noinline_for_stack for shrink_list?
> 

Hmm. about shirnk_zone(), I don't think uninlining functions directly called
by shrink_zone() can be a help.
Total stack size of call-chain will be still big.

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-14  9:42                     ` KAMEZAWA Hiroyuki
@ 2010-04-14 10:01                       ` Minchan Kim
  2010-04-14 10:07                         ` Mel Gorman
  0 siblings, 1 reply; 115+ messages in thread
From: Minchan Kim @ 2010-04-14 10:01 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Dave Chinner, Chris Mason, Mel Gorman,
	linux-kernel, linux-mm, linux-fsdevel

On Wed, Apr 14, 2010 at 6:42 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Wed, 14 Apr 2010 16:19:02 +0900
> Minchan Kim <minchan.kim@gmail.com> wrote:
>
>> On Wed, Apr 14, 2010 at 3:13 PM, Minchan Kim <minchan.kim@gmail.com> wrote:
>> > On Wed, Apr 14, 2010 at 2:54 PM, KOSAKI Motohiro
>> > <kosaki.motohiro@jp.fujitsu.com> wrote:
>> >>> On Wed, Apr 14, 2010 at 01:59:45PM +0900, KAMEZAWA Hiroyuki wrote:
>> >>> > On Wed, 14 Apr 2010 11:40:41 +1000
>> >>> > Dave Chinner <david@fromorbit.com> wrote:
>> >>> >
>> >>> > >  50)     3168      64   xfs_vm_writepage+0xab/0x160 [xfs]
>> >>> > >  51)     3104     384   shrink_page_list+0x65e/0x840
>> >>> > >  52)     2720     528   shrink_zone+0x63f/0xe10
>> >>> >
>> >>> > A bit OFF TOPIC.
>> >>> >
>> >>> > Could you share disassemble of shrink_zone() ?
>> >>> >
>> >>> > In my environ.
>> >>> > 00000000000115a0 <shrink_zone>:
>> >>> >    115a0:       55                      push   %rbp
>> >>> >    115a1:       48 89 e5                mov    %rsp,%rbp
>> >>> >    115a4:       41 57                   push   %r15
>> >>> >    115a6:       41 56                   push   %r14
>> >>> >    115a8:       41 55                   push   %r13
>> >>> >    115aa:       41 54                   push   %r12
>> >>> >    115ac:       53                      push   %rbx
>> >>> >    115ad:       48 83 ec 78             sub    $0x78,%rsp
>> >>> >    115b1:       e8 00 00 00 00          callq  115b6 <shrink_zone+0x16>
>> >>> >    115b6:       48 89 75 80             mov    %rsi,-0x80(%rbp)
>> >>> >
>> >>> > disassemble seems to show 0x78 bytes for stack. And no changes to %rsp
>> >>> > until retrun.
>> >>>
>> >>> I see the same. I didn't compile those kernels, though. IIUC,
>> >>> they were built through the Ubuntu build infrastructure, so there is
>> >>> something different in terms of compiler, compiler options or config
>> >>> to what we are both using. Most likely it is the compiler inlining,
>> >>> though Chris's patches to prevent that didn't seem to change the
>> >>> stack usage.
>> >>>
>> >>> I'm trying to get a stack trace from the kernel that has shrink_zone
>> >>> in it, but I haven't succeeded yet....
>> >>
>> >> I also got 0x78 byte stack usage. Umm.. Do we discussed real issue now?
>> >>
>> >
>> > In my case, 0x110 byte in 32 bit machine.
>> > I think it's possible in 64 bit machine.
>> >
>> > 00001830 <shrink_zone>:
>> >    1830:       55                      push   %ebp
>> >    1831:       89 e5                   mov    %esp,%ebp
>> >    1833:       57                      push   %edi
>> >    1834:       56                      push   %esi
>> >    1835:       53                      push   %ebx
>> >    1836:       81 ec 10 01 00 00       sub    $0x110,%esp
>> >    183c:       89 85 24 ff ff ff       mov    %eax,-0xdc(%ebp)
>> >    1842:       89 95 20 ff ff ff       mov    %edx,-0xe0(%ebp)
>> >    1848:       89 8d 1c ff ff ff       mov    %ecx,-0xe4(%ebp)
>> >    184e:       8b 41 04                mov    0x4(%ecx)
>> >
>> > my gcc is following as.
>> >
>> > barrios@barriostarget:~/mmotm$ gcc -v
>> > Using built-in specs.
>> > Target: i486-linux-gnu
>> > Configured with: ../src/configure -v --with-pkgversion='Ubuntu
>> > 4.3.3-5ubuntu4'
>> > --with-bugurl=file:///usr/share/doc/gcc-4.3/README.Bugs
>> > --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr
>> > --enable-shared --with-system-zlib --libexecdir=/usr/lib
>> > --without-included-gettext --enable-threads=posix --enable-nls
>> > --with-gxx-include-dir=/usr/include/c++/4.3 --program-suffix=-4.3
>> > --enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc
>> > --enable-mpfr --enable-targets=all --with-tune=generic
>> > --enable-checking=release --build=i486-linux-gnu --host=i486-linux-gnu
>> > --target=i486-linux-gnu
>> > Thread model: posix
>> > gcc version 4.3.3 (Ubuntu 4.3.3-5ubuntu4)
>> >
>> >
>> > Is it depends on config?
>> > I attach my config.
>>
>> I changed shrink list by noinline_for_stack.
>> The result is following as.
>>
>>
>> 00001fe0 <shrink_zone>:
>>     1fe0:       55                      push   %ebp
>>     1fe1:       89 e5                   mov    %esp,%ebp
>>     1fe3:       57                      push   %edi
>>     1fe4:       56                      push   %esi
>>     1fe5:       53                      push   %ebx
>>     1fe6:       83 ec 4c                sub    $0x4c,%esp
>>     1fe9:       89 45 c0                mov    %eax,-0x40(%ebp)
>>     1fec:       89 55 bc                mov    %edx,-0x44(%ebp)
>>     1fef:       89 4d b8                mov    %ecx,-0x48(%ebp)
>>
>> 0x110 -> 0x4c.
>>
>> Should we have to add noinline_for_stack for shrink_list?
>>
>
> Hmm. about shirnk_zone(), I don't think uninlining functions directly called
> by shrink_zone() can be a help.
> Total stack size of call-chain will be still big.

Absolutely.
But above 500 byte usage is one of hogger and uninlining is not
critical about reclaim performance. So I think we don't get any lost
than gain.

But I don't get in a hurry. adhoc approach is not good.
I hope when Mel tackles down consumption of stack in reclaim path, he
modifies this part, too.

Thanks.

> Thanks,
> -Kame
>
>
>



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-13 20:20       ` Chris Mason
  2010-04-14  1:40         ` Dave Chinner
  2010-04-14  6:52         ` KOSAKI Motohiro
@ 2010-04-14 10:06         ` Andi Kleen
  2010-04-14 11:20           ` Chris Mason
  2 siblings, 1 reply; 115+ messages in thread
From: Andi Kleen @ 2010-04-14 10:06 UTC (permalink / raw)
  To: Chris Mason
  Cc: Mel Gorman, Dave Chinner, linux-kernel, linux-mm, linux-fsdevel

Chris Mason <chris.mason@oracle.com> writes:
>
> Huh, 912 bytes...for select, really?  From poll.h:
>
> /* ~832 bytes of stack space used max in sys_select/sys_poll before allocating
>    additional memory. */
> #define MAX_STACK_ALLOC 832
> #define FRONTEND_STACK_ALLOC    256
> #define SELECT_STACK_ALLOC      FRONTEND_STACK_ALLOC
> #define POLL_STACK_ALLOC        FRONTEND_STACK_ALLOC
> #define WQUEUES_STACK_ALLOC     (MAX_STACK_ALLOC - FRONTEND_STACK_ALLOC)
> #define N_INLINE_POLL_ENTRIES   (WQUEUES_STACK_ALLOC / sizeof(struct poll_table_entry))
>
> So, select is intentionally trying to use that much stack.  It should be using
> GFP_NOFS if it really wants to suck down that much stack...

There are lots of other call chains which use multiple KB bytes by itself,
so why not give select() that measly 832 bytes?

You think only file systems are allowed to use stack? :)

Basically if you cannot tolerate 1K (or more likely more) of stack
used before your fs is called you're toast in lots of other situations
anyways.

> kernel had some sort of way to dynamically allocate ram, it could try
> that too.

It does this for large inputs, but the whole point of the stack fast
path is to avoid it for common cases when a small number of fds is
only needed.

It's significantly slower to go to any external allocator.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-14 10:01                       ` Minchan Kim
@ 2010-04-14 10:07                         ` Mel Gorman
  2010-04-14 10:16                           ` Minchan Kim
  0 siblings, 1 reply; 115+ messages in thread
From: Mel Gorman @ 2010-04-14 10:07 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	linux-kernel, linux-mm, linux-fsdevel

On Wed, Apr 14, 2010 at 07:01:47PM +0900, Minchan Kim wrote:
> >> >>> > Dave Chinner <david@fromorbit.com> wrote:
> >> >>> >
> >> >>> > >  50)     3168      64   xfs_vm_writepage+0xab/0x160 [xfs]
> >> >>> > >  51)     3104     384   shrink_page_list+0x65e/0x840
> >> >>> > >  52)     2720     528   shrink_zone+0x63f/0xe10
> >> >>> >
> >> >>> > A bit OFF TOPIC.
> >> >>> >
> >> >>> > Could you share disassemble of shrink_zone() ?
> >> >>> >
> >> >>> > In my environ.
> >> >>> > 00000000000115a0 <shrink_zone>:
> >> >>> >    115a0:       55                      push   %rbp
> >> >>> >    115a1:       48 89 e5                mov    %rsp,%rbp
> >> >>> >    115a4:       41 57                   push   %r15
> >> >>> >    115a6:       41 56                   push   %r14
> >> >>> >    115a8:       41 55                   push   %r13
> >> >>> >    115aa:       41 54                   push   %r12
> >> >>> >    115ac:       53                      push   %rbx
> >> >>> >    115ad:       48 83 ec 78             sub    $0x78,%rsp
> >> >>> >    115b1:       e8 00 00 00 00          callq  115b6 <shrink_zone+0x16>
> >> >>> >    115b6:       48 89 75 80             mov    %rsi,-0x80(%rbp)
> >> >>> >
> >> >>> > disassemble seems to show 0x78 bytes for stack. And no changes to %rsp
> >> >>> > until retrun.
> >> >>>
> >> >>> I see the same. I didn't compile those kernels, though. IIUC,
> >> >>> they were built through the Ubuntu build infrastructure, so there is
> >> >>> something different in terms of compiler, compiler options or config
> >> >>> to what we are both using. Most likely it is the compiler inlining,
> >> >>> though Chris's patches to prevent that didn't seem to change the
> >> >>> stack usage.
> >> >>>
> >> >>> I'm trying to get a stack trace from the kernel that has shrink_zone
> >> >>> in it, but I haven't succeeded yet....
> >> >>
> >> >> I also got 0x78 byte stack usage. Umm.. Do we discussed real issue now?
> >> >>
> >> >
> >> > In my case, 0x110 byte in 32 bit machine.
> >> > I think it's possible in 64 bit machine.
> >> >
> >> > 00001830 <shrink_zone>:
> >> >    1830:       55                      push   %ebp
> >> >    1831:       89 e5                   mov    %esp,%ebp
> >> >    1833:       57                      push   %edi
> >> >    1834:       56                      push   %esi
> >> >    1835:       53                      push   %ebx
> >> >    1836:       81 ec 10 01 00 00       sub    $0x110,%esp
> >> >    183c:       89 85 24 ff ff ff       mov    %eax,-0xdc(%ebp)
> >> >    1842:       89 95 20 ff ff ff       mov    %edx,-0xe0(%ebp)
> >> >    1848:       89 8d 1c ff ff ff       mov    %ecx,-0xe4(%ebp)
> >> >    184e:       8b 41 04                mov    0x4(%ecx)
> >> >
> >> > my gcc is following as.
> >> >
> >> > barrios@barriostarget:~/mmotm$ gcc -v
> >> > Using built-in specs.
> >> > Target: i486-linux-gnu
> >> > Configured with: ../src/configure -v --with-pkgversion='Ubuntu
> >> > 4.3.3-5ubuntu4'
> >> > --with-bugurl=file:///usr/share/doc/gcc-4.3/README.Bugs
> >> > --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr
> >> > --enable-shared --with-system-zlib --libexecdir=/usr/lib
> >> > --without-included-gettext --enable-threads=posix --enable-nls
> >> > --with-gxx-include-dir=/usr/include/c++/4.3 --program-suffix=-4.3
> >> > --enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc
> >> > --enable-mpfr --enable-targets=all --with-tune=generic
> >> > --enable-checking=release --build=i486-linux-gnu --host=i486-linux-gnu
> >> > --target=i486-linux-gnu
> >> > Thread model: posix
> >> > gcc version 4.3.3 (Ubuntu 4.3.3-5ubuntu4)
> >> >
> >> >
> >> > Is it depends on config?
> >> > I attach my config.
> >>
> >> I changed shrink list by noinline_for_stack.
> >> The result is following as.
> >>
> >>
> >> 00001fe0 <shrink_zone>:
> >>     1fe0:       55                      push   %ebp
> >>     1fe1:       89 e5                   mov    %esp,%ebp
> >>     1fe3:       57                      push   %edi
> >>     1fe4:       56                      push   %esi
> >>     1fe5:       53                      push   %ebx
> >>     1fe6:       83 ec 4c                sub    $0x4c,%esp
> >>     1fe9:       89 45 c0                mov    %eax,-0x40(%ebp)
> >>     1fec:       89 55 bc                mov    %edx,-0x44(%ebp)
> >>     1fef:       89 4d b8                mov    %ecx,-0x48(%ebp)
> >>
> >> 0x110 -> 0x4c.
> >>
> >> Should we have to add noinline_for_stack for shrink_list?
> >>
> >
> > Hmm. about shirnk_zone(), I don't think uninlining functions directly called
> > by shrink_zone() can be a help.
> > Total stack size of call-chain will be still big.
> 
> Absolutely.
> But above 500 byte usage is one of hogger and uninlining is not
> critical about reclaim performance. So I think we don't get any lost
> than gain.
> 

Beat in mind that uninlining can slightly increase the stack usage in some
cases because arguments, return addresses and the like have to be pushed
onto the stack. Inlining or unlining is only the answer when it reduces the
number of stack variables that exist at any given time.

> But I don't get in a hurry. adhoc approach is not good.
> I hope when Mel tackles down consumption of stack in reclaim path, he
> modifies this part, too.
> 

It'll be at least two days before I get the chance to try. A lot of the
temporary variables used in the reclaim path have existed for some time so
it will take a while.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-14 10:07                         ` Mel Gorman
@ 2010-04-14 10:16                           ` Minchan Kim
  0 siblings, 0 replies; 115+ messages in thread
From: Minchan Kim @ 2010-04-14 10:16 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	linux-kernel, linux-mm, linux-fsdevel

On Wed, Apr 14, 2010 at 7:07 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> On Wed, Apr 14, 2010 at 07:01:47PM +0900, Minchan Kim wrote:
>> >> >>> > Dave Chinner <david@fromorbit.com> wrote:
>> >> >>> >
>> >> >>> > >  50)     3168      64   xfs_vm_writepage+0xab/0x160 [xfs]
>> >> >>> > >  51)     3104     384   shrink_page_list+0x65e/0x840
>> >> >>> > >  52)     2720     528   shrink_zone+0x63f/0xe10
>> >> >>> >
>> >> >>> > A bit OFF TOPIC.
>> >> >>> >
>> >> >>> > Could you share disassemble of shrink_zone() ?
>> >> >>> >
>> >> >>> > In my environ.
>> >> >>> > 00000000000115a0 <shrink_zone>:
>> >> >>> >    115a0:       55                      push   %rbp
>> >> >>> >    115a1:       48 89 e5                mov    %rsp,%rbp
>> >> >>> >    115a4:       41 57                   push   %r15
>> >> >>> >    115a6:       41 56                   push   %r14
>> >> >>> >    115a8:       41 55                   push   %r13
>> >> >>> >    115aa:       41 54                   push   %r12
>> >> >>> >    115ac:       53                      push   %rbx
>> >> >>> >    115ad:       48 83 ec 78             sub    $0x78,%rsp
>> >> >>> >    115b1:       e8 00 00 00 00          callq  115b6 <shrink_zone+0x16>
>> >> >>> >    115b6:       48 89 75 80             mov    %rsi,-0x80(%rbp)
>> >> >>> >
>> >> >>> > disassemble seems to show 0x78 bytes for stack. And no changes to %rsp
>> >> >>> > until retrun.
>> >> >>>
>> >> >>> I see the same. I didn't compile those kernels, though. IIUC,
>> >> >>> they were built through the Ubuntu build infrastructure, so there is
>> >> >>> something different in terms of compiler, compiler options or config
>> >> >>> to what we are both using. Most likely it is the compiler inlining,
>> >> >>> though Chris's patches to prevent that didn't seem to change the
>> >> >>> stack usage.
>> >> >>>
>> >> >>> I'm trying to get a stack trace from the kernel that has shrink_zone
>> >> >>> in it, but I haven't succeeded yet....
>> >> >>
>> >> >> I also got 0x78 byte stack usage. Umm.. Do we discussed real issue now?
>> >> >>
>> >> >
>> >> > In my case, 0x110 byte in 32 bit machine.
>> >> > I think it's possible in 64 bit machine.
>> >> >
>> >> > 00001830 <shrink_zone>:
>> >> >    1830:       55                      push   %ebp
>> >> >    1831:       89 e5                   mov    %esp,%ebp
>> >> >    1833:       57                      push   %edi
>> >> >    1834:       56                      push   %esi
>> >> >    1835:       53                      push   %ebx
>> >> >    1836:       81 ec 10 01 00 00       sub    $0x110,%esp
>> >> >    183c:       89 85 24 ff ff ff       mov    %eax,-0xdc(%ebp)
>> >> >    1842:       89 95 20 ff ff ff       mov    %edx,-0xe0(%ebp)
>> >> >    1848:       89 8d 1c ff ff ff       mov    %ecx,-0xe4(%ebp)
>> >> >    184e:       8b 41 04                mov    0x4(%ecx)
>> >> >
>> >> > my gcc is following as.
>> >> >
>> >> > barrios@barriostarget:~/mmotm$ gcc -v
>> >> > Using built-in specs.
>> >> > Target: i486-linux-gnu
>> >> > Configured with: ../src/configure -v --with-pkgversion='Ubuntu
>> >> > 4.3.3-5ubuntu4'
>> >> > --with-bugurl=file:///usr/share/doc/gcc-4.3/README.Bugs
>> >> > --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr
>> >> > --enable-shared --with-system-zlib --libexecdir=/usr/lib
>> >> > --without-included-gettext --enable-threads=posix --enable-nls
>> >> > --with-gxx-include-dir=/usr/include/c++/4.3 --program-suffix=-4.3
>> >> > --enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc
>> >> > --enable-mpfr --enable-targets=all --with-tune=generic
>> >> > --enable-checking=release --build=i486-linux-gnu --host=i486-linux-gnu
>> >> > --target=i486-linux-gnu
>> >> > Thread model: posix
>> >> > gcc version 4.3.3 (Ubuntu 4.3.3-5ubuntu4)
>> >> >
>> >> >
>> >> > Is it depends on config?
>> >> > I attach my config.
>> >>
>> >> I changed shrink list by noinline_for_stack.
>> >> The result is following as.
>> >>
>> >>
>> >> 00001fe0 <shrink_zone>:
>> >>     1fe0:       55                      push   %ebp
>> >>     1fe1:       89 e5                   mov    %esp,%ebp
>> >>     1fe3:       57                      push   %edi
>> >>     1fe4:       56                      push   %esi
>> >>     1fe5:       53                      push   %ebx
>> >>     1fe6:       83 ec 4c                sub    $0x4c,%esp
>> >>     1fe9:       89 45 c0                mov    %eax,-0x40(%ebp)
>> >>     1fec:       89 55 bc                mov    %edx,-0x44(%ebp)
>> >>     1fef:       89 4d b8                mov    %ecx,-0x48(%ebp)
>> >>
>> >> 0x110 -> 0x4c.
>> >>
>> >> Should we have to add noinline_for_stack for shrink_list?
>> >>
>> >
>> > Hmm. about shirnk_zone(), I don't think uninlining functions directly called
>> > by shrink_zone() can be a help.
>> > Total stack size of call-chain will be still big.
>>
>> Absolutely.
>> But above 500 byte usage is one of hogger and uninlining is not
>> critical about reclaim performance. So I think we don't get any lost
>> than gain.
>>
>
> Beat in mind that uninlining can slightly increase the stack usage in some
> cases because arguments, return addresses and the like have to be pushed
> onto the stack. Inlining or unlining is only the answer when it reduces the
> number of stack variables that exist at any given time.

Yes. I totally have missed it.
Thanks, Mel.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-14 10:06         ` Andi Kleen
@ 2010-04-14 11:20           ` Chris Mason
  2010-04-14 12:15             ` Andi Kleen
  2010-04-14 13:23             ` Mel Gorman
  0 siblings, 2 replies; 115+ messages in thread
From: Chris Mason @ 2010-04-14 11:20 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Mel Gorman, Dave Chinner, linux-kernel, linux-mm, linux-fsdevel

On Wed, Apr 14, 2010 at 12:06:36PM +0200, Andi Kleen wrote:
> Chris Mason <chris.mason@oracle.com> writes:
> >
> > Huh, 912 bytes...for select, really?  From poll.h:
> >
> > /* ~832 bytes of stack space used max in sys_select/sys_poll before allocating
> >    additional memory. */
> > #define MAX_STACK_ALLOC 832
> > #define FRONTEND_STACK_ALLOC    256
> > #define SELECT_STACK_ALLOC      FRONTEND_STACK_ALLOC
> > #define POLL_STACK_ALLOC        FRONTEND_STACK_ALLOC
> > #define WQUEUES_STACK_ALLOC     (MAX_STACK_ALLOC - FRONTEND_STACK_ALLOC)
> > #define N_INLINE_POLL_ENTRIES   (WQUEUES_STACK_ALLOC / sizeof(struct poll_table_entry))
> >
> > So, select is intentionally trying to use that much stack.  It should be using
> > GFP_NOFS if it really wants to suck down that much stack...
> 
> There are lots of other call chains which use multiple KB bytes by itself,
> so why not give select() that measly 832 bytes?
> 
> You think only file systems are allowed to use stack? :)

Grin, most definitely.

> 
> Basically if you cannot tolerate 1K (or more likely more) of stack
> used before your fs is called you're toast in lots of other situations
> anyways.

Well, on a 4K stack kernel, 832 bytes is a very large percentage for
just one function.

Direct reclaim is a problem because it splices parts of the kernel that
normally aren't connected together.  The people that code in select see
832 bytes and say that's teeny, I should have taken 3832 bytes.

But they don't realize their function can dive down into ecryptfs then
the filesystem then maybe loop and then perhaps raid6 on top of a
network block device.

> 
> > kernel had some sort of way to dynamically allocate ram, it could try
> > that too.
> 
> It does this for large inputs, but the whole point of the stack fast
> path is to avoid it for common cases when a small number of fds is
> only needed.
> 
> It's significantly slower to go to any external allocator.

Yeah, but since the call chain does eventually go into the allocator,
this function needs to be more stack friendly.

I do agree that we can't really solve this with noinline_for_stack pixie
dust, the long call chains are going to be a problem no matter what.

Reading through all the comments so far, I think the short summary is:

Cleaning pages in direct reclaim helps the VM because it is able to make
sure that lumpy reclaim finds adjacent pages.  This isn't a fast
operation, it has to wait for IO (infinitely slow compared to the CPU).

Will it be good enough for the VM if we add a hint to the bdi writeback
threads to work on a general area of the file?  The filesystem will get
writepages(), the VM will get the IO it needs started.

I know Mel mentioned before he wasn't interested in waiting for helper
threads, but I don't see how we can work without it.

-chris

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-14 11:20           ` Chris Mason
@ 2010-04-14 12:15             ` Andi Kleen
  2010-04-14 12:32               ` Alan Cox
  2010-04-14 13:23             ` Mel Gorman
  1 sibling, 1 reply; 115+ messages in thread
From: Andi Kleen @ 2010-04-14 12:15 UTC (permalink / raw)
  To: Chris Mason
  Cc: Mel Gorman, Dave Chinner, linux-kernel, linux-mm, linux-fsdevel

Chris Mason <chris.mason@oracle.com> writes:
>> 
>> Basically if you cannot tolerate 1K (or more likely more) of stack
>> used before your fs is called you're toast in lots of other situations
>> anyways.
>
> Well, on a 4K stack kernel, 832 bytes is a very large percentage for
> just one function.

To be honest I think 4K stack simply has to go. I tend to call
it "russian roulette" mode. 

It was just a old workaround for a very old buggy VM that couldn't free 8K
pages and the VM is a lot better at that now. And the general trend is
to more complex code everywhere, so 4K stacks become more and more hazardous.

It was a bad idea back then and is still a bad idea, getting
worse and worse with each MLOC being added to the kernel each year.

We don't have any good ways to verify that obscure paths through
the more and more subsystems won't exceed it (in fact I'm pretty
sure there are plenty of problems in exotic configurations)

And even if you can make a specific load work there's basically
no safety net.

The only part of the 4K stack code that's good is the separate
interrupt stack, but that one should be just combined with a sane 8K 
process stack.

But yes on a 4K kernel you probably don't want to do any direct reclaim. 
Maybe for GFP_NOFS everywhere except user allocations when it's set? 
Or simply drop it?

> But they don't realize their function can dive down into ecryptfs then
> the filesystem then maybe loop and then perhaps raid6 on top of a
> network block device.

Those stackings need to use separate threads anyways. A lot of them
do in fact. Block avoided this problem by iterating instead of
recursing.  Those that still recurse on the same stack simply
need to be fixed.

> Yeah, but since the call chain does eventually go into the allocator,
> this function needs to be more stack friendly.

For common fast paths it doesn't go into the allocator.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-14 12:15             ` Andi Kleen
@ 2010-04-14 12:32               ` Alan Cox
  2010-04-14 12:34                 ` Andi Kleen
  0 siblings, 1 reply; 115+ messages in thread
From: Alan Cox @ 2010-04-14 12:32 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Chris Mason, Mel Gorman, Dave Chinner, linux-kernel, linux-mm,
	linux-fsdevel

> The only part of the 4K stack code that's good is the separate
> interrupt stack, but that one should be just combined with a sane 8K 
> process stack.

The reality is that if you are blowing a 4K process stack you are
probably playing russian roulette on the current 8K x86-32 stack as well
because of the non IRQ split. So it needs fixing either way

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-14 12:32               ` Alan Cox
@ 2010-04-14 12:34                 ` Andi Kleen
  0 siblings, 0 replies; 115+ messages in thread
From: Andi Kleen @ 2010-04-14 12:34 UTC (permalink / raw)
  To: Alan Cox
  Cc: Andi Kleen, Chris Mason, Mel Gorman, Dave Chinner, linux-kernel,
	linux-mm, linux-fsdevel

On Wed, Apr 14, 2010 at 01:32:29PM +0100, Alan Cox wrote:
> > The only part of the 4K stack code that's good is the separate
> > interrupt stack, but that one should be just combined with a sane 8K 
> > process stack.
> 
> The reality is that if you are blowing a 4K process stack you are
> probably playing russian roulette on the current 8K x86-32 stack as well
> because of the non IRQ split. So it needs fixing either way

Yes I think the 8K stack on 32bit should be combined with a interrupt 
stack too. There's no reason not to have an interrupt stack ever. 

Again the problem with fixing it is that you won't have any safety net
for a slightly different stacking etc. path that you didn't cover.

That said extreme examples (like some of those Chris listed) definitely
need fixing by moving them to different threads. But even after that
you still want a safety net. 4K is just too near the edge.

Maybe it would work if we never used any indirect calls, but that's
clearly not the case.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-14 11:20           ` Chris Mason
  2010-04-14 12:15             ` Andi Kleen
@ 2010-04-14 13:23             ` Mel Gorman
  2010-04-14 14:07               ` Chris Mason
  1 sibling, 1 reply; 115+ messages in thread
From: Mel Gorman @ 2010-04-14 13:23 UTC (permalink / raw)
  To: Chris Mason, Andi Kleen, Dave Chinner, linux-kernel, linux-mm,
	linux-fsdevel

On Wed, Apr 14, 2010 at 07:20:15AM -0400, Chris Mason wrote:
> On Wed, Apr 14, 2010 at 12:06:36PM +0200, Andi Kleen wrote:
> > Chris Mason <chris.mason@oracle.com> writes:
> > >
> > > Huh, 912 bytes...for select, really?  From poll.h:
> > >
> > > /* ~832 bytes of stack space used max in sys_select/sys_poll before allocating
> > >    additional memory. */
> > > #define MAX_STACK_ALLOC 832
> > > #define FRONTEND_STACK_ALLOC    256
> > > #define SELECT_STACK_ALLOC      FRONTEND_STACK_ALLOC
> > > #define POLL_STACK_ALLOC        FRONTEND_STACK_ALLOC
> > > #define WQUEUES_STACK_ALLOC     (MAX_STACK_ALLOC - FRONTEND_STACK_ALLOC)
> > > #define N_INLINE_POLL_ENTRIES   (WQUEUES_STACK_ALLOC / sizeof(struct poll_table_entry))
> > >
> > > So, select is intentionally trying to use that much stack.  It should be using
> > > GFP_NOFS if it really wants to suck down that much stack...
> > 
> > There are lots of other call chains which use multiple KB bytes by itself,
> > so why not give select() that measly 832 bytes?
> > 
> > You think only file systems are allowed to use stack? :)
> 
> Grin, most definitely.
> 
> > 
> > Basically if you cannot tolerate 1K (or more likely more) of stack
> > used before your fs is called you're toast in lots of other situations
> > anyways.
> 
> Well, on a 4K stack kernel, 832 bytes is a very large percentage for
> just one function.
> 
> Direct reclaim is a problem because it splices parts of the kernel that
> normally aren't connected together.  The people that code in select see
> 832 bytes and say that's teeny, I should have taken 3832 bytes.
> 

Even without direct reclaim, I doubt stack usage is often at the top of
peoples minds except for truly criminal large usages of it. Direct
reclaim splicing is somewhat of a problem but it's separate to stack
consumption overall.

> But they don't realize their function can dive down into ecryptfs then
> the filesystem then maybe loop and then perhaps raid6 on top of a
> network block device.
> 
> > 
> > > kernel had some sort of way to dynamically allocate ram, it could try
> > > that too.
> > 
> > It does this for large inputs, but the whole point of the stack fast
> > path is to avoid it for common cases when a small number of fds is
> > only needed.
> > 
> > It's significantly slower to go to any external allocator.
> 
> Yeah, but since the call chain does eventually go into the allocator,
> this function needs to be more stack friendly.
> 
> I do agree that we can't really solve this with noinline_for_stack pixie
> dust, the long call chains are going to be a problem no matter what.
> 
> Reading through all the comments so far, I think the short summary is:
> 
> Cleaning pages in direct reclaim helps the VM because it is able to make
> sure that lumpy reclaim finds adjacent pages.  This isn't a fast
> operation, it has to wait for IO (infinitely slow compared to the CPU).
> 
> Will it be good enough for the VM if we add a hint to the bdi writeback
> threads to work on a general area of the file?  The filesystem will get
> writepages(), the VM will get the IO it needs started.
> 

Bear in mind that the context of lumpy reclaim that the VM doesn't care
about where the data is on the file or filesystem. It's only concerned
about where the data is located in memory. There *may* be a correlation
between location-of-data-in-file and location-of-data-in-memory but only
if readahead was a factor and readahead happened to hit at a time the page
allocator broke up a contiguous block of memory.

> I know Mel mentioned before he wasn't interested in waiting for helper
> threads, but I don't see how we can work without it.
> 

I'm not against the idea as such. It would have advantages in that the
thread could reorder the IO for better seeks for example and lumpy
reclaim is already potentially waiting a long time so another delay
won't hurt. I would worry that it's just hiding the stack usage by
moving it to another thread and that there would be communication cost
between a direct reclaimer and this writeback thread. The main gain
would be in hiding the "splicing" effect between subsystems that direct
reclaim can have.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-14 13:23             ` Mel Gorman
@ 2010-04-14 14:07               ` Chris Mason
  0 siblings, 0 replies; 115+ messages in thread
From: Chris Mason @ 2010-04-14 14:07 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andi Kleen, Dave Chinner, linux-kernel, linux-mm, linux-fsdevel

On Wed, Apr 14, 2010 at 02:23:50PM +0100, Mel Gorman wrote:
> On Wed, Apr 14, 2010 at 07:20:15AM -0400, Chris Mason wrote:

[ nods ]

> 
> Bear in mind that the context of lumpy reclaim that the VM doesn't care
> about where the data is on the file or filesystem. It's only concerned
> about where the data is located in memory. There *may* be a correlation
> between location-of-data-in-file and location-of-data-in-memory but only
> if readahead was a factor and readahead happened to hit at a time the page
> allocator broke up a contiguous block of memory.
> 
> > I know Mel mentioned before he wasn't interested in waiting for helper
> > threads, but I don't see how we can work without it.
> > 
> 
> I'm not against the idea as such. It would have advantages in that the
> thread could reorder the IO for better seeks for example and lumpy
> reclaim is already potentially waiting a long time so another delay
> won't hurt. I would worry that it's just hiding the stack usage by
> moving it to another thread and that there would be communication cost
> between a direct reclaimer and this writeback thread. The main gain
> would be in hiding the "splicing" effect between subsystems that direct
> reclaim can have.

The big gain from the helper threads is that storage operates at a
roughly fixed iop rate.  This is true for ssd as well, it's just a much
higher rate.  So the threads can send down 4K ios and recover clean pages at
exactly the same rate it would sending down 64KB ios. 

I know that for lumpy purposes it might not be the best 64KB, but the
other side of it is that we have to write those pages eventually anyway.
We might as well write them when it is more or less free.

The per-bdi writeback threads are a pretty good base for changing the
ordering for writeback, it seems like a good place to integrate requests
from the VM about which files (and which offsets in those files) to
write back first.

-chris


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-14  8:51               ` Mel Gorman
@ 2010-04-15  1:34                 ` Dave Chinner
  2010-04-15  4:09                   ` KOSAKI Motohiro
                                     ` (2 more replies)
  2010-04-15  2:37                 ` Johannes Weiner
  1 sibling, 3 replies; 115+ messages in thread
From: Dave Chinner @ 2010-04-15  1:34 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KOSAKI Motohiro, Chris Mason, linux-kernel, linux-mm, linux-fsdevel

On Wed, Apr 14, 2010 at 09:51:33AM +0100, Mel Gorman wrote:
> On Wed, Apr 14, 2010 at 05:28:30PM +1000, Dave Chinner wrote:
> > On Wed, Apr 14, 2010 at 03:52:44PM +0900, KOSAKI Motohiro wrote:
> > > > On Tue, Apr 13, 2010 at 04:20:21PM -0400, Chris Mason wrote:
> > > > > On Tue, Apr 13, 2010 at 08:34:29PM +0100, Mel Gorman wrote:
> > > > > > > Basically, there is not enough stack space available to allow direct
> > > > > > > reclaim to enter ->writepage _anywhere_ according to the stack usage
> > > > > > > profiles we are seeing here....
> > > > > > > 
> > > > > > 
> > > > > > I'm not denying the evidence but how has it been gotten away with for years
> > > > > > then? Prevention of writeback isn't the answer without figuring out how
> > > > > > direct reclaimers can queue pages for IO and in the case of lumpy reclaim
> > > > > > doing sync IO, then waiting on those pages.
> > > > > 
> > > > > So, I've been reading along, nodding my head to Dave's side of things
> > > > > because seeks are evil and direct reclaim makes seeks.  I'd really loev
> > > > > for direct reclaim to somehow trigger writepages on large chunks instead
> > > > > of doing page by page spatters of IO to the drive.
> > > 
> > > I agree that "seeks are evil and direct reclaim makes seeks". Actually,
> > > making 4k io is not must for pageout. So, probably we can improve it.
> > > 
> > > 
> > > > Perhaps drop the lock on the page if it is held and call one of the
> > > > helpers that filesystems use to do this, like:
> > > > 
> > > > 	filemap_write_and_wait(page->mapping);
> > > 
> > > Sorry, I'm lost what you talk about. Why do we need per-file
> > > waiting? If file is 1GB file, do we need to wait 1GB writeout?
> > 
> > So use filemap_fdatawrite(page->mapping), or if it's better only
> > to start IO on a segment of the file, use
> > filemap_fdatawrite_range(page->mapping, start, end)....
> 
> That does not help the stack usage issue, the caller ends up in
> ->writepages. From an IO perspective, it'll be better from a seek point of
> view but from a VM perspective, it may or may not be cleaning the right pages.
> So I think this is a red herring.

If you ask it to clean a bunch of pages around the one you want to
reclaim on the LRU, there is a good chance it will also be cleaning
pages that are near the end of the LRU or physically close by as
well. It's not a guarantee, but for the additional IO cost of about
10% wall time on that IO to clean the page you need, you also get
1-2 orders of magnitude other pages cleaned. That sounds like a
win any way you look at it...

I agree that it doesn't solve the stack problem (Chris' suggestion
that we enable the bdi flusher interface would fix this); what I'm
pointing out is that the arguments that it is too hard or there are
no interfaces available to issue larger IO from reclaim are not at
all valid.

> > the deepest call chain in queue_work() needs 700 bytes of stack
> > to complete, wait_for_completion() requires almost 2k of stack space
> > at it's deepest, the scheduler has some heavy stack users, etc,
> > and these are all functions that appear at the top of the stack.
> > 
> 
> The real issue here then is that stack usage has gone out of control.

That's definitely true, but it shouldn't cloud the fact that most
ppl want to kill writeback from direct reclaim, too, so killing two
birds with one stone seems like a good idea.

How about this? For now, we stop direct reclaim from doing writeback
only on order zero allocations, but allow it for higher order
allocations. That will prevent the majority of situations where
direct reclaim blows the stack and interferes with background
writeout, but won't cause lumpy reclaim to change behaviour.
This reduces the scope of impact and hence testing and validation
the needs to be done.

Then we can work towards allowing lumpy reclaim to use background
threads as Chris suggested for doing specific writeback operations
to solve the remaining problems being seen. Does this seem like a
reasonable compromise and approach to dealing with the problem?

> Disabling ->writepage in direct reclaim does not guarantee that stack
> usage will not be a problem again. From your traces, page reclaim itself
> seems to be a big dirty hog.

I couldn't agree more - the kernel still needs to be put on a stack
usage diet, but the above would give use some breathing space to attack the
problem before more people start to hit these problems.

> > Good start, but 512 bytes will only catch select and splice read,
> > and there are 300-400 byte functions in the above list that sit near
> > the top of the stack....
> > 
> 
> They will need to be tackled in turn then but obviously there should be
> a focus on the common paths. The reclaim paths do seem particularly
> heavy and it's down to a lot of temporary variables. I might not get the
> time today but what I'm going to try do some time this week is
> 
> o Look at what temporary variables are copies of other pieces of information
> o See what variables live for the duration of reclaim but are not needed
>   for all of it (i.e. uninline parts of it so variables do not persist)
> o See if it's possible to dynamically allocate scan_control

Welcome to my world ;)

> The last one is the trickiest. Basically, the idea would be to move as much
> into scan_control as possible. Then, instead of allocating it on the stack,
> allocate a fixed number of them at boot-time (NR_CPU probably) protected by
> a semaphore. Limit the number of direct reclaimers that can be active at a
> time to the number of scan_control variables. kswapd could still allocate
> its on the stack or with kmalloc.
> 
> If it works out, it would have two main benefits. Limits the number of
> processes in direct reclaim - if there is NR_CPU-worth of proceses in direct
> reclaim, there is too much going on. It would also shrink the stack usage
> particularly if some of the stack variables are moved into scan_control.
> 
> Maybe someone will beat me to looking at the feasibility of this.

I like the idea - it really sounds like you want a fixed size,
preallocated mempool that can't be enlarged. In fact, I can probably
use something like this in XFS to save a couple of hundred bytes of
stack space in the worst hogs....

> > > > This is the sort of thing I'm pointing at when I say that stack
> > > > usage outside XFS has grown significantly significantly over the
> > > > past couple of years. Given XFS has remained pretty much the same or
> > > > even reduced slightly over the same time period, blaming XFS or
> > > > saying "callers should use GFP_NOFS" seems like a cop-out to me.
> > > > Regardless of the IO pattern performance issues, writeback via
> > > > direct reclaim just uses too much stack to be safe these days...
> > > 
> > > Yeah, My answer is simple, All stack eater should be fixed.
> > > but XFS seems not innocence too. 3.5K is enough big although
> > > xfs have use such amount since very ago.
> > 
> > XFS used to use much more than that - significant effort has been
> > put into reduce the stack footprint over many years. There's not
> > much left to trim without rewriting half the filesystem...
> 
> I don't think he is levelling a complain at XFS in particular - just pointing
> out that it's heavy too. Still, we should be gratful that XFS is sort of
> a "Stack Canary". If it dies, everyone else could be in trouble soon :)

Yeah, true. Sorry іf in being a bit too defensive here - the scars
from previous discussions like this are showing through....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-14  6:52           ` KOSAKI Motohiro
@ 2010-04-15  1:56             ` Dave Chinner
  0 siblings, 0 replies; 115+ messages in thread
From: Dave Chinner @ 2010-04-15  1:56 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: linux-kernel, linux-mm, linux-fsdevel, Chris Mason

On Wed, Apr 14, 2010 at 03:52:32PM +0900, KOSAKI Motohiro wrote:
> > On Wed, Apr 14, 2010 at 12:36:59AM +1000, Dave Chinner wrote:
> > > On Tue, Apr 13, 2010 at 08:39:29PM +0900, KOSAKI Motohiro wrote:
> > > > > FWIW, the biggest problem here is that I have absolutely no clue on
> > > > > how to test what the impact on lumpy reclaim really is. Does anyone
> > > > > have a relatively simple test that can be run to determine what the
> > > > > impact is?
> > > > 
> > > > So, can you please run two workloads concurrently?
> > > >  - Normal IO workload (fio, iozone, etc..)
> > > >  - echo $NUM > /proc/sys/vm/nr_hugepages
> > > 
> > > What do I measure/observe/record that is meaningful?
> > 
> > So, a rough as guts first pass - just run a large dd (8 times the
> > size of memory - 8GB file vs 1GB RAM) and repeated try to allocate
> > the entire of memory in huge pages (500) every 5 seconds. The IO
> > rate is roughly 100MB/s, so it takes 75-85s to complete the dd.
.....
> > Basically, with my patch lumpy reclaim was *substantially* more
> > effective with only a slight increase in average allocation latency
> > with this test case.
....
> > I know this is a simple test case, but it shows much better results
> > than I think anyone (even me) is expecting...
> 
> Ummm...
> 
> Probably, I have to say I'm sorry. I guess my last mail give you
> a misunderstand.
> To be honest, I'm not interest this artificial non fragmentation case.

And to be brutally honest, I'm not interested in wasting my time
trying to come up with a test case that you are interested in.

Instead, can you please you provide me with your test cases
(scripts, preferably) that you use to measure the effectiveness of
reclaim changes and I'll run them.

> The above test-case does 1) discard all cache 2) fill pages by streaming
> io. then, it makes artificial "file offset neighbor == block neighbor == PFN neighbor"
> situation. then, file offset order writeout by flusher thread can make
> PFN contenious pages effectively.

Yes, that's true, but it does indicate that in that situation, it is
more effective than the current code. FWIW, in the case of HPC
applications (which often use huge pages and clear the cache before
starting anew job), large streaming IO is a pretty common IO
pattern, so I don't think this situation is as artificial as you are
indicating.

> Why I dont interest it? because lumpy reclaim is a technique for
> avoiding external fragmentation mess. IOW, it is for avoiding
> worst case. but your test case seems to mesure best one.

Then please provide test cases that you consider valid.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-14  8:51               ` Mel Gorman
  2010-04-15  1:34                 ` Dave Chinner
@ 2010-04-15  2:37                 ` Johannes Weiner
  2010-04-15  2:43                   ` KOSAKI Motohiro
  1 sibling, 1 reply; 115+ messages in thread
From: Johannes Weiner @ 2010-04-15  2:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Dave Chinner, KOSAKI Motohiro, Chris Mason, linux-kernel,
	linux-mm, linux-fsdevel

On Wed, Apr 14, 2010 at 09:51:33AM +0100, Mel Gorman wrote:
> They will need to be tackled in turn then but obviously there should be
> a focus on the common paths. The reclaim paths do seem particularly
> heavy and it's down to a lot of temporary variables. I might not get the
> time today but what I'm going to try do some time this week is
> 
> o Look at what temporary variables are copies of other pieces of information
> o See what variables live for the duration of reclaim but are not needed
>   for all of it (i.e. uninline parts of it so variables do not persist)
> o See if it's possible to dynamically allocate scan_control
> 
> The last one is the trickiest. Basically, the idea would be to move as much
> into scan_control as possible. Then, instead of allocating it on the stack,
> allocate a fixed number of them at boot-time (NR_CPU probably) protected by
> a semaphore. Limit the number of direct reclaimers that can be active at a
> time to the number of scan_control variables. kswapd could still allocate
> its on the stack or with kmalloc.
> 
> If it works out, it would have two main benefits. Limits the number of
> processes in direct reclaim - if there is NR_CPU-worth of proceses in direct
> reclaim, there is too much going on. It would also shrink the stack usage
> particularly if some of the stack variables are moved into scan_control.
> 
> Maybe someone will beat me to looking at the feasibility of this.

I already have some patches to remove trivial parts of struct scan_control,
namely may_unmap, may_swap, all_unreclaimable and isolate_pages.  The rest
needs a deeper look.

A rather big offender in there is the combination of shrink_active_list (360
bytes here) and shrink_page_list (200 bytes).  I am currently looking at
breaking out all the accounting stuff from shrink_active_list into a separate
leaf function so that the stack footprint does not add up.

Your idea of per-cpu allocated scan controls reminds me of an idea I have
had for some time now: moving reclaim into its own threads (per cpu?).

Not only would it separate the allocator's stack from the writeback stack,
we could also get rid of that too_many_isolated() workaround and coordinate
reclaim work better to prevent overreclaim.

But that is not a quick fix either...

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-15  2:37                 ` Johannes Weiner
@ 2010-04-15  2:43                   ` KOSAKI Motohiro
  2010-04-16 23:56                     ` Johannes Weiner
  0 siblings, 1 reply; 115+ messages in thread
From: KOSAKI Motohiro @ 2010-04-15  2:43 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: kosaki.motohiro, Mel Gorman, Dave Chinner, Chris Mason,
	linux-kernel, linux-mm, linux-fsdevel

Hi

> On Wed, Apr 14, 2010 at 09:51:33AM +0100, Mel Gorman wrote:
> > They will need to be tackled in turn then but obviously there should be
> > a focus on the common paths. The reclaim paths do seem particularly
> > heavy and it's down to a lot of temporary variables. I might not get the
> > time today but what I'm going to try do some time this week is
> > 
> > o Look at what temporary variables are copies of other pieces of information
> > o See what variables live for the duration of reclaim but are not needed
> >   for all of it (i.e. uninline parts of it so variables do not persist)
> > o See if it's possible to dynamically allocate scan_control
> > 
> > The last one is the trickiest. Basically, the idea would be to move as much
> > into scan_control as possible. Then, instead of allocating it on the stack,
> > allocate a fixed number of them at boot-time (NR_CPU probably) protected by
> > a semaphore. Limit the number of direct reclaimers that can be active at a
> > time to the number of scan_control variables. kswapd could still allocate
> > its on the stack or with kmalloc.
> > 
> > If it works out, it would have two main benefits. Limits the number of
> > processes in direct reclaim - if there is NR_CPU-worth of proceses in direct
> > reclaim, there is too much going on. It would also shrink the stack usage
> > particularly if some of the stack variables are moved into scan_control.
> > 
> > Maybe someone will beat me to looking at the feasibility of this.
> 
> I already have some patches to remove trivial parts of struct scan_control,
> namely may_unmap, may_swap, all_unreclaimable and isolate_pages.  The rest
> needs a deeper look.

Seems interesting. but scan_control diet is not so effective. How much
bytes can we diet by it?


> A rather big offender in there is the combination of shrink_active_list (360
> bytes here) and shrink_page_list (200 bytes).  I am currently looking at
> breaking out all the accounting stuff from shrink_active_list into a separate
> leaf function so that the stack footprint does not add up.

pagevec. it consume 128bytes per struct. I have removing patch.


> Your idea of per-cpu allocated scan controls reminds me of an idea I have
> had for some time now: moving reclaim into its own threads (per cpu?).
> 
> Not only would it separate the allocator's stack from the writeback stack,
> we could also get rid of that too_many_isolated() workaround and coordinate
> reclaim work better to prevent overreclaim.
> 
> But that is not a quick fix either...

So, I haven't think this way. probably seems good. but I like to do
simple diet at first.




^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-15  1:34                 ` Dave Chinner
@ 2010-04-15  4:09                   ` KOSAKI Motohiro
  2010-04-15  4:11                     ` [PATCH 1/4] vmscan: delegate pageout io to flusher thread if current is kswapd KOSAKI Motohiro
                                       ` (5 more replies)
  2010-04-15 10:28                   ` [PATCH] mm: disallow direct reclaim page writeback Mel Gorman
  2010-04-15 14:57                   ` Andi Kleen
  2 siblings, 6 replies; 115+ messages in thread
From: KOSAKI Motohiro @ 2010-04-15  4:09 UTC (permalink / raw)
  To: Dave Chinner
  Cc: kosaki.motohiro, Mel Gorman, Chris Mason, linux-kernel, linux-mm,
	linux-fsdevel

Hi

> How about this? For now, we stop direct reclaim from doing writeback
> only on order zero allocations, but allow it for higher order
> allocations. That will prevent the majority of situations where
> direct reclaim blows the stack and interferes with background
> writeout, but won't cause lumpy reclaim to change behaviour.
> This reduces the scope of impact and hence testing and validation
> the needs to be done.

Tend to agree. but I would proposed slightly different algorithm for
avoind incorrect oom.

for high order allocation
	allow to use lumpy reclaim and pageout() for both kswapd and direct reclaim

for low order allocation
	- kswapd:          always delegate io to flusher thread
	- direct reclaim:  delegate io to flusher thread only if vm pressure is low

This seems more safely. I mean Who want see incorrect oom regression?
I've made some pathes for this. I'll post it as another mail.

> Then we can work towards allowing lumpy reclaim to use background
> threads as Chris suggested for doing specific writeback operations
> to solve the remaining problems being seen. Does this seem like a
> reasonable compromise and approach to dealing with the problem?

Tend to agree. probably now we are discussing right approach. but
this is definitely needed deep thinking. then, I can't take exactly
answer yet.





^ permalink raw reply	[flat|nested] 115+ messages in thread

* [PATCH 1/4] vmscan: delegate pageout io to flusher thread if current is kswapd
  2010-04-15  4:09                   ` KOSAKI Motohiro
@ 2010-04-15  4:11                     ` KOSAKI Motohiro
  2010-04-15  8:05                       ` Suleiman Souhlal
                                         ` (2 more replies)
  2010-04-15  4:13                     ` [PATCH 2/4] vmscan: kill prev_priority completely KOSAKI Motohiro
                                       ` (4 subsequent siblings)
  5 siblings, 3 replies; 115+ messages in thread
From: KOSAKI Motohiro @ 2010-04-15  4:11 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: kosaki.motohiro, Dave Chinner, Mel Gorman, Chris Mason,
	linux-kernel, linux-mm, linux-fsdevel

Now, vmscan pageout() is one of IO throuput degression source.
Some IO workload makes very much order-0 allocation and reclaim
and pageout's 4K IOs are making annoying lots seeks.

At least, kswapd can avoid such pageout() because kswapd don't
need to consider OOM-Killer situation. that's no risk.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/vmscan.c |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3ff3311..d392a50 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -614,6 +614,13 @@ static enum page_references page_check_references(struct page *page,
 	if (referenced_page)
 		return PAGEREF_RECLAIM_CLEAN;
 
+	/*
+	 * Delegate pageout IO to flusher thread. They can make more
+	 * effective IO pattern.
+	 */
+	if (current_is_kswapd())
+		return PAGEREF_RECLAIM_CLEAN;
+
 	return PAGEREF_RECLAIM;
 }
 
-- 
1.6.5.2




^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH 2/4] vmscan: kill prev_priority completely
  2010-04-15  4:09                   ` KOSAKI Motohiro
  2010-04-15  4:11                     ` [PATCH 1/4] vmscan: delegate pageout io to flusher thread if current is kswapd KOSAKI Motohiro
@ 2010-04-15  4:13                     ` KOSAKI Motohiro
  2010-04-15  4:14                     ` [PATCH 3/4] vmscan: move priority variable into scan_control KOSAKI Motohiro
                                       ` (3 subsequent siblings)
  5 siblings, 0 replies; 115+ messages in thread
From: KOSAKI Motohiro @ 2010-04-15  4:13 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: kosaki.motohiro, Dave Chinner, Mel Gorman, Chris Mason,
	linux-kernel, linux-mm, linux-fsdevel

This patch is not related the patch series directly.
but [4/4] depend on scan_control has `priority' member.
then, I'm include this.

=============================================
Since 2.6.28 zone->prev_priority is unused. Then it can be removed
safely. It reduce stack usage slightly.

Now I have to say that I'm sorry. 2 years ago, I thghout prev_priority
can be integrate again, it's useful. but four (or more) times trying
haven't got good performance number. thus I give up such approach.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 include/linux/mmzone.h |   15 -------------
 mm/page_alloc.c        |    2 -
 mm/vmscan.c            |   54 ++---------------------------------------------
 mm/vmstat.c            |    2 -
 4 files changed, 3 insertions(+), 70 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index cf9e458..ad76962 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -339,21 +339,6 @@ struct zone {
 	atomic_long_t		vm_stat[NR_VM_ZONE_STAT_ITEMS];
 
 	/*
-	 * prev_priority holds the scanning priority for this zone.  It is
-	 * defined as the scanning priority at which we achieved our reclaim
-	 * target at the previous try_to_free_pages() or balance_pgdat()
-	 * invocation.
-	 *
-	 * We use prev_priority as a measure of how much stress page reclaim is
-	 * under - it drives the swappiness decision: whether to unmap mapped
-	 * pages.
-	 *
-	 * Access to both this field is quite racy even on uniprocessor.  But
-	 * it is expected to average out OK.
-	 */
-	int prev_priority;
-
-	/*
 	 * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
 	 * this zone's LRU.  Maintained by the pageout code.
 	 */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d03c946..88513c0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3862,8 +3862,6 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 		zone_seqlock_init(zone);
 		zone->zone_pgdat = pgdat;
 
-		zone->prev_priority = DEF_PRIORITY;
-
 		zone_pcp_init(zone);
 		for_each_lru(l) {
 			INIT_LIST_HEAD(&zone->lru[l].list);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d392a50..dadb461 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1284,20 +1284,6 @@ done:
 }
 
 /*
- * We are about to scan this zone at a certain priority level.  If that priority
- * level is smaller (ie: more urgent) than the previous priority, then note
- * that priority level within the zone.  This is done so that when the next
- * process comes in to scan this zone, it will immediately start out at this
- * priority level rather than having to build up its own scanning priority.
- * Here, this priority affects only the reclaim-mapped threshold.
- */
-static inline void note_zone_scanning_priority(struct zone *zone, int priority)
-{
-	if (priority < zone->prev_priority)
-		zone->prev_priority = priority;
-}
-
-/*
  * This moves pages from the active list to the inactive list.
  *
  * We move them the other way if the page is referenced by one or more
@@ -1733,20 +1719,15 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
 		if (scanning_global_lru(sc)) {
 			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 				continue;
-			note_zone_scanning_priority(zone, priority);
-
 			if (zone->all_unreclaimable && priority != DEF_PRIORITY)
 				continue;	/* Let kswapd poll it */
 			sc->all_unreclaimable = 0;
-		} else {
+		} else
 			/*
 			 * Ignore cpuset limitation here. We just want to reduce
 			 * # of used pages by us regardless of memory shortage.
 			 */
 			sc->all_unreclaimable = 0;
-			mem_cgroup_note_reclaim_priority(sc->mem_cgroup,
-							priority);
-		}
 
 		shrink_zone(priority, zone, sc);
 	}
@@ -1852,17 +1833,11 @@ out:
 	if (priority < 0)
 		priority = 0;
 
-	if (scanning_global_lru(sc)) {
-		for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
-
+	if (scanning_global_lru(sc))
+		for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
 			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 				continue;
 
-			zone->prev_priority = priority;
-		}
-	} else
-		mem_cgroup_record_reclaim_priority(sc->mem_cgroup, priority);
-
 	delayacct_freepages_end();
 
 	return ret;
@@ -2015,22 +1990,12 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
 		.mem_cgroup = NULL,
 		.isolate_pages = isolate_pages_global,
 	};
-	/*
-	 * temp_priority is used to remember the scanning priority at which
-	 * this zone was successfully refilled to
-	 * free_pages == high_wmark_pages(zone).
-	 */
-	int temp_priority[MAX_NR_ZONES];
-
 loop_again:
 	total_scanned = 0;
 	sc.nr_reclaimed = 0;
 	sc.may_writepage = !laptop_mode;
 	count_vm_event(PAGEOUTRUN);
 
-	for (i = 0; i < pgdat->nr_zones; i++)
-		temp_priority[i] = DEF_PRIORITY;
-
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
 		int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
 		unsigned long lru_pages = 0;
@@ -2098,9 +2063,7 @@ loop_again:
 			if (zone->all_unreclaimable && priority != DEF_PRIORITY)
 				continue;
 
-			temp_priority[i] = priority;
 			sc.nr_scanned = 0;
-			note_zone_scanning_priority(zone, priority);
 
 			nid = pgdat->node_id;
 			zid = zone_idx(zone);
@@ -2173,16 +2136,6 @@ loop_again:
 			break;
 	}
 out:
-	/*
-	 * Note within each zone the priority level at which this zone was
-	 * brought into a happy state.  So that the next thread which scans this
-	 * zone will start out at that priority level.
-	 */
-	for (i = 0; i < pgdat->nr_zones; i++) {
-		struct zone *zone = pgdat->node_zones + i;
-
-		zone->prev_priority = temp_priority[i];
-	}
 	if (!all_zones_ok) {
 		cond_resched();
 
@@ -2600,7 +2553,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 		 */
 		priority = ZONE_RECLAIM_PRIORITY;
 		do {
-			note_zone_scanning_priority(zone, priority);
 			shrink_zone(priority, zone, &sc);
 			priority--;
 		} while (priority >= 0 && sc.nr_reclaimed < nr_pages);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index fa12ea3..2db0a0f 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -761,11 +761,9 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 	}
 	seq_printf(m,
 		   "\n  all_unreclaimable: %u"
-		   "\n  prev_priority:     %i"
 		   "\n  start_pfn:         %lu"
 		   "\n  inactive_ratio:    %u",
 		   zone->all_unreclaimable,
-		   zone->prev_priority,
 		   zone->zone_start_pfn,
 		   zone->inactive_ratio);
 	seq_putc(m, '\n');
-- 
1.6.5.2




^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH 3/4] vmscan: move priority variable into scan_control
  2010-04-15  4:09                   ` KOSAKI Motohiro
  2010-04-15  4:11                     ` [PATCH 1/4] vmscan: delegate pageout io to flusher thread if current is kswapd KOSAKI Motohiro
  2010-04-15  4:13                     ` [PATCH 2/4] vmscan: kill prev_priority completely KOSAKI Motohiro
@ 2010-04-15  4:14                     ` KOSAKI Motohiro
  2010-04-15  4:15                     ` [PATCH 4/4] vmscan: delegate page cleaning io to flusher thread if VM pressure is low KOSAKI Motohiro
                                       ` (2 subsequent siblings)
  5 siblings, 0 replies; 115+ messages in thread
From: KOSAKI Motohiro @ 2010-04-15  4:14 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: kosaki.motohiro, Dave Chinner, Mel Gorman, Chris Mason,
	linux-kernel, linux-mm, linux-fsdevel

ditto

This patch is not related the patch series directly.
but [4/4] depend on scan_control has `priority' member.
then, I'm include this.
=========================================

Now very lots function in vmscan have `priority' argument. It consume
stack slightly. To move it on struct scan_control reduce stack.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/vmscan.c |   83 ++++++++++++++++++++++++++--------------------------------
 1 files changed, 37 insertions(+), 46 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index dadb461..8b78b49 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -77,6 +77,8 @@ struct scan_control {
 
 	int order;
 
+	int priority;
+
 	/* Which cgroup do we reclaim from */
 	struct mem_cgroup *mem_cgroup;
 
@@ -1130,7 +1132,7 @@ static int too_many_isolated(struct zone *zone, int file,
  */
 static unsigned long shrink_inactive_list(unsigned long max_scan,
 			struct zone *zone, struct scan_control *sc,
-			int priority, int file)
+			int file)
 {
 	LIST_HEAD(page_list);
 	struct pagevec pvec;
@@ -1156,7 +1158,7 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
 	 */
 	if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
 		lumpy_reclaim = 1;
-	else if (sc->order && priority < DEF_PRIORITY - 2)
+	else if (sc->order && sc->priority < DEF_PRIORITY - 2)
 		lumpy_reclaim = 1;
 
 	pagevec_init(&pvec, 1);
@@ -1335,7 +1337,7 @@ static void move_active_pages_to_lru(struct zone *zone,
 }
 
 static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
-			struct scan_control *sc, int priority, int file)
+			struct scan_control *sc, int file)
 {
 	unsigned long nr_taken;
 	unsigned long pgscanned;
@@ -1498,17 +1500,17 @@ static int inactive_list_is_low(struct zone *zone, struct scan_control *sc,
 }
 
 static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
-	struct zone *zone, struct scan_control *sc, int priority)
+	struct zone *zone, struct scan_control *sc)
 {
 	int file = is_file_lru(lru);
 
 	if (is_active_lru(lru)) {
 		if (inactive_list_is_low(zone, sc, file))
-		    shrink_active_list(nr_to_scan, zone, sc, priority, file);
+		    shrink_active_list(nr_to_scan, zone, sc, file);
 		return 0;
 	}
 
-	return shrink_inactive_list(nr_to_scan, zone, sc, priority, file);
+	return shrink_inactive_list(nr_to_scan, zone, sc, file);
 }
 
 /*
@@ -1615,8 +1617,7 @@ static unsigned long nr_scan_try_batch(unsigned long nr_to_scan,
 /*
  * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
  */
-static void shrink_zone(int priority, struct zone *zone,
-				struct scan_control *sc)
+static void shrink_zone(struct zone *zone, struct scan_control *sc)
 {
 	unsigned long nr[NR_LRU_LISTS];
 	unsigned long nr_to_scan;
@@ -1640,8 +1641,8 @@ static void shrink_zone(int priority, struct zone *zone,
 		unsigned long scan;
 
 		scan = zone_nr_lru_pages(zone, sc, l);
-		if (priority || noswap) {
-			scan >>= priority;
+		if (sc->priority || noswap) {
+			scan >>= sc->priority;
 			scan = (scan * percent[file]) / 100;
 		}
 		nr[l] = nr_scan_try_batch(scan,
@@ -1657,7 +1658,7 @@ static void shrink_zone(int priority, struct zone *zone,
 				nr[l] -= nr_to_scan;
 
 				nr_reclaimed += shrink_list(l, nr_to_scan,
-							    zone, sc, priority);
+							    zone, sc);
 			}
 		}
 		/*
@@ -1668,7 +1669,8 @@ static void shrink_zone(int priority, struct zone *zone,
 		 * with multiple processes reclaiming pages, the total
 		 * freeing target can get unreasonably large.
 		 */
-		if (nr_reclaimed >= nr_to_reclaim && priority < DEF_PRIORITY)
+		if (nr_reclaimed >= nr_to_reclaim &&
+		    sc->priority < DEF_PRIORITY)
 			break;
 	}
 
@@ -1679,7 +1681,7 @@ static void shrink_zone(int priority, struct zone *zone,
 	 * rebalance the anon lru active/inactive ratio.
 	 */
 	if (inactive_anon_is_low(zone, sc) && nr_swap_pages > 0)
-		shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0);
+		shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, 0);
 
 	throttle_vm_writeout(sc->gfp_mask);
 }
@@ -1700,8 +1702,7 @@ static void shrink_zone(int priority, struct zone *zone,
  * If a zone is deemed to be full of pinned pages then just give it a light
  * scan then give up on it.
  */
-static void shrink_zones(int priority, struct zonelist *zonelist,
-					struct scan_control *sc)
+static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 {
 	enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask);
 	struct zoneref *z;
@@ -1719,7 +1720,8 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
 		if (scanning_global_lru(sc)) {
 			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 				continue;
-			if (zone->all_unreclaimable && priority != DEF_PRIORITY)
+			if (zone->all_unreclaimable &&
+			    sc->priority != DEF_PRIORITY)
 				continue;	/* Let kswapd poll it */
 			sc->all_unreclaimable = 0;
 		} else
@@ -1729,7 +1731,7 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
 			 */
 			sc->all_unreclaimable = 0;
 
-		shrink_zone(priority, zone, sc);
+		shrink_zone(zone, sc);
 	}
 }
 
@@ -1752,7 +1754,6 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
 static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 					struct scan_control *sc)
 {
-	int priority;
 	unsigned long ret = 0;
 	unsigned long total_scanned = 0;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
@@ -1779,11 +1780,11 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		}
 	}
 
-	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
+	for (sc->priority = DEF_PRIORITY; sc->priority >= 0; sc->priority--) {
 		sc->nr_scanned = 0;
-		if (!priority)
+		if (!sc->priority)
 			disable_swap_token();
-		shrink_zones(priority, zonelist, sc);
+		shrink_zones(zonelist, sc);
 		/*
 		 * Don't shrink slabs when reclaiming memory from
 		 * over limit cgroups
@@ -1816,23 +1817,14 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 
 		/* Take a nap, wait for some writeback to complete */
 		if (!sc->hibernation_mode && sc->nr_scanned &&
-		    priority < DEF_PRIORITY - 2)
+		    sc->priority < DEF_PRIORITY - 2)
 			congestion_wait(BLK_RW_ASYNC, HZ/10);
 	}
 	/* top priority shrink_zones still had more to do? don't OOM, then */
 	if (!sc->all_unreclaimable && scanning_global_lru(sc))
 		ret = sc->nr_reclaimed;
-out:
-	/*
-	 * Now that we've scanned all the zones at this priority level, note
-	 * that level within the zone so that the next thread which performs
-	 * scanning of this zone will immediately start out at this priority
-	 * level.  This affects only the decision whether or not to bring
-	 * mapped pages onto the inactive list.
-	 */
-	if (priority < 0)
-		priority = 0;
 
+out:
 	if (scanning_global_lru(sc))
 		for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
 			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
@@ -1892,7 +1884,8 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
 	 * will pick up pages from other mem cgroup's as well. We hack
 	 * the priority and make it zero.
 	 */
-	shrink_zone(0, zone, &sc);
+	sc.priority = 0;
+	shrink_zone(zone, &sc);
 	return sc.nr_reclaimed;
 }
 
@@ -1972,7 +1965,6 @@ static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
 static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
 {
 	int all_zones_ok;
-	int priority;
 	int i;
 	unsigned long total_scanned;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
@@ -1996,13 +1988,13 @@ loop_again:
 	sc.may_writepage = !laptop_mode;
 	count_vm_event(PAGEOUTRUN);
 
-	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
+	for (sc.priority = DEF_PRIORITY; sc.priority >= 0; sc.priority--) {
 		int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
 		unsigned long lru_pages = 0;
 		int has_under_min_watermark_zone = 0;
 
 		/* The swap token gets in the way of swapout... */
-		if (!priority)
+		if (!sc.priority)
 			disable_swap_token();
 
 		all_zones_ok = 1;
@@ -2017,7 +2009,7 @@ loop_again:
 			if (!populated_zone(zone))
 				continue;
 
-			if (zone->all_unreclaimable && priority != DEF_PRIORITY)
+			if (zone->all_unreclaimable && sc.priority != DEF_PRIORITY)
 				continue;
 
 			/*
@@ -2026,7 +2018,7 @@ loop_again:
 			 */
 			if (inactive_anon_is_low(zone, &sc))
 				shrink_active_list(SWAP_CLUSTER_MAX, zone,
-							&sc, priority, 0);
+							&sc, 0);
 
 			if (!zone_watermark_ok(zone, order,
 					high_wmark_pages(zone), 0, 0)) {
@@ -2060,7 +2052,7 @@ loop_again:
 			if (!populated_zone(zone))
 				continue;
 
-			if (zone->all_unreclaimable && priority != DEF_PRIORITY)
+			if (zone->all_unreclaimable && sc.priority != DEF_PRIORITY)
 				continue;
 
 			sc.nr_scanned = 0;
@@ -2079,7 +2071,7 @@ loop_again:
 			 */
 			if (!zone_watermark_ok(zone, order,
 					8*high_wmark_pages(zone), end_zone, 0))
-				shrink_zone(priority, zone, &sc);
+				shrink_zone(zone, &sc);
 			reclaim_state->reclaimed_slab = 0;
 			nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
 						lru_pages);
@@ -2119,7 +2111,7 @@ loop_again:
 		 * OK, kswapd is getting into trouble.  Take a nap, then take
 		 * another pass across the zones.
 		 */
-		if (total_scanned && (priority < DEF_PRIORITY - 2)) {
+		if (total_scanned && (sc.priority < DEF_PRIORITY - 2)) {
 			if (has_under_min_watermark_zone)
 				count_vm_event(KSWAPD_SKIP_CONGESTION_WAIT);
 			else
@@ -2520,7 +2512,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	const unsigned long nr_pages = 1 << order;
 	struct task_struct *p = current;
 	struct reclaim_state reclaim_state;
-	int priority;
 	struct scan_control sc = {
 		.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
 		.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
@@ -2551,11 +2542,11 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 		 * Free memory by calling shrink zone with increasing
 		 * priorities until we have enough memory freed.
 		 */
-		priority = ZONE_RECLAIM_PRIORITY;
+		sc.priority = ZONE_RECLAIM_PRIORITY;
 		do {
-			shrink_zone(priority, zone, &sc);
-			priority--;
-		} while (priority >= 0 && sc.nr_reclaimed < nr_pages);
+			shrink_zone(zone, &sc);
+			sc.priority--;
+		} while (sc.priority >= 0 && sc.nr_reclaimed < nr_pages);
 	}
 
 	slab_reclaimable = zone_page_state(zone, NR_SLAB_RECLAIMABLE);
-- 
1.6.5.2






^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH 4/4] vmscan: delegate page cleaning io to flusher thread if VM pressure is low
  2010-04-15  4:09                   ` KOSAKI Motohiro
                                       ` (2 preceding siblings ...)
  2010-04-15  4:14                     ` [PATCH 3/4] vmscan: move priority variable into scan_control KOSAKI Motohiro
@ 2010-04-15  4:15                     ` KOSAKI Motohiro
  2010-04-15  4:35                     ` [PATCH] mm: disallow direct reclaim page writeback KOSAKI Motohiro
  2010-04-15  6:20                     ` Dave Chinner
  5 siblings, 0 replies; 115+ messages in thread
From: KOSAKI Motohiro @ 2010-04-15  4:15 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: kosaki.motohiro, Dave Chinner, Mel Gorman, Chris Mason,
	linux-kernel, linux-mm, linux-fsdevel

Even if pageout() is called from direct reclaim, we can delegate io to
flusher thread if vm pressure is low.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/vmscan.c |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8b78b49..eab6028 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -623,6 +623,13 @@ static enum page_references page_check_references(struct page *page,
 	if (current_is_kswapd())
 		return PAGEREF_RECLAIM_CLEAN;
 
+	/*
+	 * Now VM pressure is not so high. then we can delegate
+	 * page cleaning to flusher thread safely.
+	 */
+	if (!sc->order && sc->priority > DEF_PRIORITY/2)
+		return PAGEREF_RECLAIM_CLEAN;
+
 	return PAGEREF_RECLAIM;
 }
 
-- 
1.6.5.2




^ permalink raw reply related	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-15  4:09                   ` KOSAKI Motohiro
                                       ` (3 preceding siblings ...)
  2010-04-15  4:15                     ` [PATCH 4/4] vmscan: delegate page cleaning io to flusher thread if VM pressure is low KOSAKI Motohiro
@ 2010-04-15  4:35                     ` KOSAKI Motohiro
  2010-04-15  6:32                       ` Dave Chinner
  2010-04-15  6:20                     ` Dave Chinner
  5 siblings, 1 reply; 115+ messages in thread
From: KOSAKI Motohiro @ 2010-04-15  4:35 UTC (permalink / raw)
  To: Dave Chinner
  Cc: kosaki.motohiro, Mel Gorman, Chris Mason, linux-kernel, linux-mm,
	linux-fsdevel

> Hi
> 
> > How about this? For now, we stop direct reclaim from doing writeback
> > only on order zero allocations, but allow it for higher order
> > allocations. That will prevent the majority of situations where
> > direct reclaim blows the stack and interferes with background
> > writeout, but won't cause lumpy reclaim to change behaviour.
> > This reduces the scope of impact and hence testing and validation
> > the needs to be done.
> 
> Tend to agree. but I would proposed slightly different algorithm for
> avoind incorrect oom.
> 
> for high order allocation
> 	allow to use lumpy reclaim and pageout() for both kswapd and direct reclaim
> 
> for low order allocation
> 	- kswapd:          always delegate io to flusher thread
> 	- direct reclaim:  delegate io to flusher thread only if vm pressure is low
> 
> This seems more safely. I mean Who want see incorrect oom regression?
> I've made some pathes for this. I'll post it as another mail.

Now, kernel compile and/or backup operation seems keep nr_vmscan_write==0.
Dave, can you please try to run your pageout annoying workload?




^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-15  4:09                   ` KOSAKI Motohiro
                                       ` (4 preceding siblings ...)
  2010-04-15  4:35                     ` [PATCH] mm: disallow direct reclaim page writeback KOSAKI Motohiro
@ 2010-04-15  6:20                     ` Dave Chinner
  2010-04-15  6:35                       ` KOSAKI Motohiro
  5 siblings, 1 reply; 115+ messages in thread
From: Dave Chinner @ 2010-04-15  6:20 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mel Gorman, Chris Mason, linux-kernel, linux-mm, linux-fsdevel

On Thu, Apr 15, 2010 at 01:09:01PM +0900, KOSAKI Motohiro wrote:
> Hi
> 
> > How about this? For now, we stop direct reclaim from doing writeback
> > only on order zero allocations, but allow it for higher order
> > allocations. That will prevent the majority of situations where
> > direct reclaim blows the stack and interferes with background
> > writeout, but won't cause lumpy reclaim to change behaviour.
> > This reduces the scope of impact and hence testing and validation
> > the needs to be done.
> 
> Tend to agree. but I would proposed slightly different algorithm for
> avoind incorrect oom.
> 
> for high order allocation
> 	allow to use lumpy reclaim and pageout() for both kswapd and direct reclaim

SO same as current.

> for low order allocation
> 	- kswapd:          always delegate io to flusher thread
> 	- direct reclaim:  delegate io to flusher thread only if vm pressure is low

IMO, this really doesn't fix either of the problems - the bad IO
patterns nor the stack usage. All it will take is a bit more memory
pressure to trigger stack and IO problems, and the user reporting the
problems is generating an awful lot of memory pressure...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-15  4:35                     ` [PATCH] mm: disallow direct reclaim page writeback KOSAKI Motohiro
@ 2010-04-15  6:32                       ` Dave Chinner
  2010-04-15  6:44                         ` KOSAKI Motohiro
  0 siblings, 1 reply; 115+ messages in thread
From: Dave Chinner @ 2010-04-15  6:32 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mel Gorman, Chris Mason, linux-kernel, linux-mm, linux-fsdevel

On Thu, Apr 15, 2010 at 01:35:17PM +0900, KOSAKI Motohiro wrote:
> > Hi
> > 
> > > How about this? For now, we stop direct reclaim from doing writeback
> > > only on order zero allocations, but allow it for higher order
> > > allocations. That will prevent the majority of situations where
> > > direct reclaim blows the stack and interferes with background
> > > writeout, but won't cause lumpy reclaim to change behaviour.
> > > This reduces the scope of impact and hence testing and validation
> > > the needs to be done.
> > 
> > Tend to agree. but I would proposed slightly different algorithm for
> > avoind incorrect oom.
> > 
> > for high order allocation
> > 	allow to use lumpy reclaim and pageout() for both kswapd and direct reclaim
> > 
> > for low order allocation
> > 	- kswapd:          always delegate io to flusher thread
> > 	- direct reclaim:  delegate io to flusher thread only if vm pressure is low
> > 
> > This seems more safely. I mean Who want see incorrect oom regression?
> > I've made some pathes for this. I'll post it as another mail.
> 
> Now, kernel compile and/or backup operation seems keep nr_vmscan_write==0.
> Dave, can you please try to run your pageout annoying workload?

It's just as easy for you to run and observe the effects. Start with a VM
with 1GB RAM and a 10GB scratch block device:

# mkfs.xfs -f /dev/<blah>
# mount -o logbsize=262144,nobarrier /dev/<blah> /mnt/scratch

in one shell:

# while [ 1 ]; do dd if=/dev/zero of=/mnt/scratch/foo bs=1024k ; done

in another shell, if you have fs_mark installed, run:

# ./fs_mark -S0 -n 100000 -F -s 0 -d /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/3 -d /mnt/scratch/2 &

otherwise run a couple of these in parallel on different directories:

# for i in `seq 1 1 100000`; do echo > /mnt/scratch/0/foo.$i ; done

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-15  6:20                     ` Dave Chinner
@ 2010-04-15  6:35                       ` KOSAKI Motohiro
  2010-04-15  8:54                         ` Dave Chinner
  0 siblings, 1 reply; 115+ messages in thread
From: KOSAKI Motohiro @ 2010-04-15  6:35 UTC (permalink / raw)
  To: Dave Chinner
  Cc: kosaki.motohiro, Mel Gorman, Chris Mason, linux-kernel, linux-mm,
	linux-fsdevel

> On Thu, Apr 15, 2010 at 01:09:01PM +0900, KOSAKI Motohiro wrote:
> > Hi
> > 
> > > How about this? For now, we stop direct reclaim from doing writeback
> > > only on order zero allocations, but allow it for higher order
> > > allocations. That will prevent the majority of situations where
> > > direct reclaim blows the stack and interferes with background
> > > writeout, but won't cause lumpy reclaim to change behaviour.
> > > This reduces the scope of impact and hence testing and validation
> > > the needs to be done.
> > 
> > Tend to agree. but I would proposed slightly different algorithm for
> > avoind incorrect oom.
> > 
> > for high order allocation
> > 	allow to use lumpy reclaim and pageout() for both kswapd and direct reclaim
> 
> SO same as current.

Yes. as same as you propsed.

> 
> > for low order allocation
> > 	- kswapd:          always delegate io to flusher thread
> > 	- direct reclaim:  delegate io to flusher thread only if vm pressure is low
> 
> IMO, this really doesn't fix either of the problems - the bad IO
> patterns nor the stack usage. All it will take is a bit more memory
> pressure to trigger stack and IO problems, and the user reporting the
> problems is generating an awful lot of memory pressure...

This patch doesn't care stack usage. because
  - again, I think all stack eater shold be diet.
  - under allowing lumpy reclaim world, only deny low order reclaim
    doesn't solve anything.

Please don't forget priority=0 recliam failure incvoke OOM-killer.
I don't imagine anyone want it.

And, Which IO workload trigger <6 priority vmscan?




^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-15  6:32                       ` Dave Chinner
@ 2010-04-15  6:44                         ` KOSAKI Motohiro
  2010-04-15  6:58                           ` Dave Chinner
  0 siblings, 1 reply; 115+ messages in thread
From: KOSAKI Motohiro @ 2010-04-15  6:44 UTC (permalink / raw)
  To: Dave Chinner
  Cc: kosaki.motohiro, Mel Gorman, Chris Mason, linux-kernel, linux-mm,
	linux-fsdevel

> On Thu, Apr 15, 2010 at 01:35:17PM +0900, KOSAKI Motohiro wrote:
> > > Hi
> > > 
> > > > How about this? For now, we stop direct reclaim from doing writeback
> > > > only on order zero allocations, but allow it for higher order
> > > > allocations. That will prevent the majority of situations where
> > > > direct reclaim blows the stack and interferes with background
> > > > writeout, but won't cause lumpy reclaim to change behaviour.
> > > > This reduces the scope of impact and hence testing and validation
> > > > the needs to be done.
> > > 
> > > Tend to agree. but I would proposed slightly different algorithm for
> > > avoind incorrect oom.
> > > 
> > > for high order allocation
> > > 	allow to use lumpy reclaim and pageout() for both kswapd and direct reclaim
> > > 
> > > for low order allocation
> > > 	- kswapd:          always delegate io to flusher thread
> > > 	- direct reclaim:  delegate io to flusher thread only if vm pressure is low
> > > 
> > > This seems more safely. I mean Who want see incorrect oom regression?
> > > I've made some pathes for this. I'll post it as another mail.
> > 
> > Now, kernel compile and/or backup operation seems keep nr_vmscan_write==0.
> > Dave, can you please try to run your pageout annoying workload?
> 
> It's just as easy for you to run and observe the effects. Start with a VM
> with 1GB RAM and a 10GB scratch block device:
> 
> # mkfs.xfs -f /dev/<blah>
> # mount -o logbsize=262144,nobarrier /dev/<blah> /mnt/scratch
> 
> in one shell:
> 
> # while [ 1 ]; do dd if=/dev/zero of=/mnt/scratch/foo bs=1024k ; done
> 
> in another shell, if you have fs_mark installed, run:
> 
> # ./fs_mark -S0 -n 100000 -F -s 0 -d /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/3 -d /mnt/scratch/2 &
> 
> otherwise run a couple of these in parallel on different directories:
> 
> # for i in `seq 1 1 100000`; do echo > /mnt/scratch/0/foo.$i ; done

Thanks.

Unfortunately, I don't have unused disks. So, I'll try it at (probably)
next week.





^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-15  6:44                         ` KOSAKI Motohiro
@ 2010-04-15  6:58                           ` Dave Chinner
  0 siblings, 0 replies; 115+ messages in thread
From: Dave Chinner @ 2010-04-15  6:58 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mel Gorman, Chris Mason, linux-kernel, linux-mm, linux-fsdevel

On Thu, Apr 15, 2010 at 03:44:50PM +0900, KOSAKI Motohiro wrote:
> > > Now, kernel compile and/or backup operation seems keep nr_vmscan_write==0.
> > > Dave, can you please try to run your pageout annoying workload?
> > 
> > It's just as easy for you to run and observe the effects. Start with a VM
> > with 1GB RAM and a 10GB scratch block device:
> > 
> > # mkfs.xfs -f /dev/<blah>
> > # mount -o logbsize=262144,nobarrier /dev/<blah> /mnt/scratch
> > 
> > in one shell:
> > 
> > # while [ 1 ]; do dd if=/dev/zero of=/mnt/scratch/foo bs=1024k ; done
> > 
> > in another shell, if you have fs_mark installed, run:
> > 
> > # ./fs_mark -S0 -n 100000 -F -s 0 -d /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/3 -d /mnt/scratch/2 &
> > 
> > otherwise run a couple of these in parallel on different directories:
> > 
> > # for i in `seq 1 1 100000`; do echo > /mnt/scratch/0/foo.$i ; done
> 
> Thanks.
> 
> Unfortunately, I don't have unused disks. So, I'll try it at (probably)
> next week.

A filesystem on a loopback device will work just as well ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH 1/4] vmscan: delegate pageout io to flusher thread if current is kswapd
  2010-04-15  4:11                     ` [PATCH 1/4] vmscan: delegate pageout io to flusher thread if current is kswapd KOSAKI Motohiro
@ 2010-04-15  8:05                       ` Suleiman Souhlal
  2010-04-15  8:17                         ` KOSAKI Motohiro
  2010-04-15  9:32                         ` Dave Chinner
  2010-04-15  8:18                       ` KOSAKI Motohiro
  2010-04-15 10:31                       ` Mel Gorman
  2 siblings, 2 replies; 115+ messages in thread
From: Suleiman Souhlal @ 2010-04-15  8:05 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Dave Chinner, Mel Gorman, Chris Mason, linux-kernel, linux-mm,
	linux-fsdevel, suleiman


On Apr 14, 2010, at 9:11 PM, KOSAKI Motohiro wrote:

> Now, vmscan pageout() is one of IO throuput degression source.
> Some IO workload makes very much order-0 allocation and reclaim
> and pageout's 4K IOs are making annoying lots seeks.
>
> At least, kswapd can avoid such pageout() because kswapd don't
> need to consider OOM-Killer situation. that's no risk.
>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

What's your opinion on trying to cluster the writes done by pageout,  
instead of not doing any paging out in kswapd?
Something along these lines:

     Cluster writes to disk due to memory pressure.

     Write out logically adjacent pages to the one we're paging out
     so that we may get better IOs in these situations:
     These pages are likely to be contiguous on disk to the one we're
     writing out, so they should get merged into a single disk IO.

     Signed-off-by: Suleiman Souhlal <suleiman@google.com>

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c26986c..4e5a613 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -48,6 +48,8 @@

  #include "internal.h"

+#define PAGEOUT_CLUSTER_PAGES	16
+
  struct scan_control {
  	/* Incremented by the number of inactive pages that were scanned */
  	unsigned long nr_scanned;
@@ -350,6 +352,8 @@ typedef enum {
  static pageout_t pageout(struct page *page, struct address_space  
*mapping,
  						enum pageout_io sync_writeback)
  {
+	int i;
+
  	/*
  	 * If the page is dirty, only perform writeback if that write
  	 * will be non-blocking.  To prevent this allocation from being
@@ -408,6 +412,37 @@ static pageout_t pageout(struct page *page,  
struct address_space *mapping,
  		}

  		/*
+		 * Try to write out logically adjacent dirty pages too, if
+		 * possible, to get better IOs, as the IO scheduler should
+		 * merge them with the original one, if the file is not too
+		 * fragmented.
+		 */
+		for (i = 1; i < PAGEOUT_CLUSTER_PAGES; i++) {
+			struct page *p2;
+			int err;
+
+			p2 = find_get_page(mapping, page->index + i);
+			if (p2) {
+				if (trylock_page(p2) == 0) {
+					page_cache_release(p2);
+					break;
+				}
+				if (page_mapped(p2))
+					try_to_unmap(p2, 0);
+				if (PageDirty(p2)) {
+					err = write_one_page(p2, 0);
+					page_cache_release(p2);
+					if (err)
+						break;
+				} else {
+					unlock_page(p2);
+					page_cache_release(p2);
+					break;
+				}
+			}
+		}
+
+		/*
  		 * Wait on writeback if requested to. This happens when
  		 * direct reclaiming a large contiguous area and the
  		 * first attempt to free a range of pages fails.


^ permalink raw reply related	[flat|nested] 115+ messages in thread

* Re: [PATCH 1/4] vmscan: delegate pageout io to flusher thread if current is kswapd
  2010-04-15  8:05                       ` Suleiman Souhlal
@ 2010-04-15  8:17                         ` KOSAKI Motohiro
  2010-04-15  8:26                           ` KOSAKI Motohiro
  2010-04-15  9:32                         ` Dave Chinner
  1 sibling, 1 reply; 115+ messages in thread
From: KOSAKI Motohiro @ 2010-04-15  8:17 UTC (permalink / raw)
  To: Suleiman Souhlal
  Cc: kosaki.motohiro, Dave Chinner, Mel Gorman, Chris Mason,
	linux-kernel, linux-mm, linux-fsdevel, suleiman

> 
> On Apr 14, 2010, at 9:11 PM, KOSAKI Motohiro wrote:
> 
> > Now, vmscan pageout() is one of IO throuput degression source.
> > Some IO workload makes very much order-0 allocation and reclaim
> > and pageout's 4K IOs are making annoying lots seeks.
> >
> > At least, kswapd can avoid such pageout() because kswapd don't
> > need to consider OOM-Killer situation. that's no risk.
> >
> > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> What's your opinion on trying to cluster the writes done by pageout,  
> instead of not doing any paging out in kswapd?
> Something along these lines:

Interesting. 
So, I'd like to review your patch carefully. can you please give me one
day? :)


> 
>      Cluster writes to disk due to memory pressure.
> 
>      Write out logically adjacent pages to the one we're paging out
>      so that we may get better IOs in these situations:
>      These pages are likely to be contiguous on disk to the one we're
>      writing out, so they should get merged into a single disk IO.
> 
>      Signed-off-by: Suleiman Souhlal <suleiman@google.com>





^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH 1/4] vmscan: delegate pageout io to flusher thread if current is kswapd
  2010-04-15  4:11                     ` [PATCH 1/4] vmscan: delegate pageout io to flusher thread if current is kswapd KOSAKI Motohiro
  2010-04-15  8:05                       ` Suleiman Souhlal
@ 2010-04-15  8:18                       ` KOSAKI Motohiro
  2010-04-15 10:31                       ` Mel Gorman
  2 siblings, 0 replies; 115+ messages in thread
From: KOSAKI Motohiro @ 2010-04-15  8:18 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: kosaki.motohiro, Dave Chinner, Mel Gorman, Chris Mason,
	linux-kernel, linux-mm, linux-fsdevel

> Now, vmscan pageout() is one of IO throuput degression source.
> Some IO workload makes very much order-0 allocation and reclaim
> and pageout's 4K IOs are making annoying lots seeks.
> 
> At least, kswapd can avoid such pageout() because kswapd don't
> need to consider OOM-Killer situation. that's no risk.

I've found one bug in this patch myself. flusher thread don't
pageout anon pages. then, we need PageAnon() check ;)



> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> ---
>  mm/vmscan.c |    7 +++++++
>  1 files changed, 7 insertions(+), 0 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 3ff3311..d392a50 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -614,6 +614,13 @@ static enum page_references page_check_references(struct page *page,
>  	if (referenced_page)
>  		return PAGEREF_RECLAIM_CLEAN;
>  
> +	/*
> +	 * Delegate pageout IO to flusher thread. They can make more
> +	 * effective IO pattern.
> +	 */
> +	if (current_is_kswapd())
> +		return PAGEREF_RECLAIM_CLEAN;
> +
>  	return PAGEREF_RECLAIM;
>  }
>  
> -- 
> 1.6.5.2
> 
> 
> 




^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH 1/4] vmscan: delegate pageout io to flusher thread if current is kswapd
  2010-04-15  8:17                         ` KOSAKI Motohiro
@ 2010-04-15  8:26                           ` KOSAKI Motohiro
  2010-04-15 10:30                             ` Johannes Weiner
  0 siblings, 1 reply; 115+ messages in thread
From: KOSAKI Motohiro @ 2010-04-15  8:26 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: kosaki.motohiro, Suleiman Souhlal, Dave Chinner, Mel Gorman,
	Chris Mason, linux-kernel, linux-mm, linux-fsdevel, suleiman

Cc to Johannes

> > 
> > On Apr 14, 2010, at 9:11 PM, KOSAKI Motohiro wrote:
> > 
> > > Now, vmscan pageout() is one of IO throuput degression source.
> > > Some IO workload makes very much order-0 allocation and reclaim
> > > and pageout's 4K IOs are making annoying lots seeks.
> > >
> > > At least, kswapd can avoid such pageout() because kswapd don't
> > > need to consider OOM-Killer situation. that's no risk.
> > >
> > > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > 
> > What's your opinion on trying to cluster the writes done by pageout,  
> > instead of not doing any paging out in kswapd?
> > Something along these lines:
> 
> Interesting. 
> So, I'd like to review your patch carefully. can you please give me one
> day? :)

Hannes, if my remember is correct, you tried similar swap-cluster IO
long time ago. now I can't remember why we didn't merged such patch.
Do you remember anything?


> 
> 
> > 
> >      Cluster writes to disk due to memory pressure.
> > 
> >      Write out logically adjacent pages to the one we're paging out
> >      so that we may get better IOs in these situations:
> >      These pages are likely to be contiguous on disk to the one we're
> >      writing out, so they should get merged into a single disk IO.
> > 
> >      Signed-off-by: Suleiman Souhlal <suleiman@google.com>
> 
> 
> 
> 




^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-15  6:35                       ` KOSAKI Motohiro
@ 2010-04-15  8:54                         ` Dave Chinner
  2010-04-15 10:21                           ` KOSAKI Motohiro
  0 siblings, 1 reply; 115+ messages in thread
From: Dave Chinner @ 2010-04-15  8:54 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mel Gorman, Chris Mason, linux-kernel, linux-mm, linux-fsdevel

On Thu, Apr 15, 2010 at 03:35:14PM +0900, KOSAKI Motohiro wrote:
> > On Thu, Apr 15, 2010 at 01:09:01PM +0900, KOSAKI Motohiro wrote:
> > > Hi
> > > 
> > > > How about this? For now, we stop direct reclaim from doing writeback
> > > > only on order zero allocations, but allow it for higher order
> > > > allocations. That will prevent the majority of situations where
> > > > direct reclaim blows the stack and interferes with background
> > > > writeout, but won't cause lumpy reclaim to change behaviour.
> > > > This reduces the scope of impact and hence testing and validation
> > > > the needs to be done.
> > > 
> > > Tend to agree. but I would proposed slightly different algorithm for
> > > avoind incorrect oom.
> > > 
> > > for high order allocation
> > > 	allow to use lumpy reclaim and pageout() for both kswapd and direct reclaim
> > 
> > SO same as current.
> 
> Yes. as same as you propsed.
> 
> > 
> > > for low order allocation
> > > 	- kswapd:          always delegate io to flusher thread
> > > 	- direct reclaim:  delegate io to flusher thread only if vm pressure is low
> > 
> > IMO, this really doesn't fix either of the problems - the bad IO
> > patterns nor the stack usage. All it will take is a bit more memory
> > pressure to trigger stack and IO problems, and the user reporting the
> > problems is generating an awful lot of memory pressure...
> 
> This patch doesn't care stack usage. because
>   - again, I think all stack eater shold be diet.

Agreed (again), but we've already come to the conclusion that a
stack diet is not enough.

>   - under allowing lumpy reclaim world, only deny low order reclaim
>     doesn't solve anything.

Yes, I suggested it *as a first step*, not as the end goal. Your
patches don't reach the first step which is fixing the reported
stack problem for order-0 allocations...

> Please don't forget priority=0 recliam failure incvoke OOM-killer.
> I don't imagine anyone want it.

Given that I haven't been able to trigger OOM without writeback from
direct reclaim so far (*) I'm not finding any evidence that it is a
problem or that there are regressions.  I want to be able to say
that this change has no known regressions. I want to find the
regression and  work to fix them, but without test cases there's no
way I can do this.

This is what I'm getting frustrated about - I want to fix this
problem once and for all, but I can't find out what I need to do to
robustly test such a change so we can have a high degree of
confidence that it doesn't introduce major regressions. Can anyone
help here?

(*) except in one case I've already described where it mananged to
allocate enough huge pages to starve the system of order zero pages,
which is what I asked it to do.

> And, Which IO workload trigger <6 priority vmscan?

You're asking me? I've been asking you for workloads that wind up
reclaim priority.... :/

All I can say is that the most common trigger I see for OOM is
copying a large file on a busy system that is running off a single
spindle.  When that happens on my laptop I walk away and get a cup
of coffee when that happens and when I come back I pick up all the
broken bits the OOM killer left behind.....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH 1/4] vmscan: delegate pageout io to flusher thread if current is kswapd
  2010-04-15  8:05                       ` Suleiman Souhlal
  2010-04-15  8:17                         ` KOSAKI Motohiro
@ 2010-04-15  9:32                         ` Dave Chinner
  2010-04-15  9:41                           ` KOSAKI Motohiro
  2010-04-15 17:27                           ` Suleiman Souhlal
  1 sibling, 2 replies; 115+ messages in thread
From: Dave Chinner @ 2010-04-15  9:32 UTC (permalink / raw)
  To: Suleiman Souhlal
  Cc: KOSAKI Motohiro, Mel Gorman, Chris Mason, linux-kernel, linux-mm,
	linux-fsdevel, suleiman

On Thu, Apr 15, 2010 at 01:05:57AM -0700, Suleiman Souhlal wrote:
> 
> On Apr 14, 2010, at 9:11 PM, KOSAKI Motohiro wrote:
> 
> >Now, vmscan pageout() is one of IO throuput degression source.
> >Some IO workload makes very much order-0 allocation and reclaim
> >and pageout's 4K IOs are making annoying lots seeks.
> >
> >At least, kswapd can avoid such pageout() because kswapd don't
> >need to consider OOM-Killer situation. that's no risk.
> >
> >Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> What's your opinion on trying to cluster the writes done by pageout,
> instead of not doing any paging out in kswapd?

XFS already does this in ->writepage to try to minimise the impact
of the way pageout issues IO. It helps, but it is still not as good
as having all the writeback come from the flusher threads because
it's still pretty much random IO.

And, FWIW, it doesn't solve the stack usage problems, either. In
fact, it will make them worse as write_one_page() puts another
struct writeback_control on the stack...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH 1/4] vmscan: delegate pageout io to flusher thread if current is kswapd
  2010-04-15  9:32                         ` Dave Chinner
@ 2010-04-15  9:41                           ` KOSAKI Motohiro
  2010-04-15 17:27                           ` Suleiman Souhlal
  1 sibling, 0 replies; 115+ messages in thread
From: KOSAKI Motohiro @ 2010-04-15  9:41 UTC (permalink / raw)
  To: Dave Chinner
  Cc: kosaki.motohiro, Suleiman Souhlal, Mel Gorman, Chris Mason,
	linux-kernel, linux-mm, linux-fsdevel, suleiman

> On Thu, Apr 15, 2010 at 01:05:57AM -0700, Suleiman Souhlal wrote:
> > 
> > On Apr 14, 2010, at 9:11 PM, KOSAKI Motohiro wrote:
> > 
> > >Now, vmscan pageout() is one of IO throuput degression source.
> > >Some IO workload makes very much order-0 allocation and reclaim
> > >and pageout's 4K IOs are making annoying lots seeks.
> > >
> > >At least, kswapd can avoid such pageout() because kswapd don't
> > >need to consider OOM-Killer situation. that's no risk.
> > >
> > >Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > 
> > What's your opinion on trying to cluster the writes done by pageout,
> > instead of not doing any paging out in kswapd?
> 
> XFS already does this in ->writepage to try to minimise the impact
> of the way pageout issues IO. It helps, but it is still not as good
> as having all the writeback come from the flusher threads because
> it's still pretty much random IO.

I havent review such patch yet. then, I'm talking about generic thing.
pageout() doesn't only writeout file backed page, but also write
swap backed page. so, filesystem optimization nor flusher thread
doesn't erase pageout clusterring worth.


> And, FWIW, it doesn't solve the stack usage problems, either. In
> fact, it will make them worse as write_one_page() puts another
> struct writeback_control on the stack...

Correct. we need to avoid double writeback_control on stack.
probably, we need to divide pageout() some piece.




^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-15  8:54                         ` Dave Chinner
@ 2010-04-15 10:21                           ` KOSAKI Motohiro
  2010-04-15 10:23                             ` [PATCH 1/4] vmscan: simplify shrink_inactive_list() KOSAKI Motohiro
                                               ` (3 more replies)
  0 siblings, 4 replies; 115+ messages in thread
From: KOSAKI Motohiro @ 2010-04-15 10:21 UTC (permalink / raw)
  To: Dave Chinner
  Cc: kosaki.motohiro, Mel Gorman, Chris Mason, linux-kernel, linux-mm,
	linux-fsdevel

> On Thu, Apr 15, 2010 at 03:35:14PM +0900, KOSAKI Motohiro wrote:
> > > On Thu, Apr 15, 2010 at 01:09:01PM +0900, KOSAKI Motohiro wrote:
> > > > Hi
> > > > 
> > > > > How about this? For now, we stop direct reclaim from doing writeback
> > > > > only on order zero allocations, but allow it for higher order
> > > > > allocations. That will prevent the majority of situations where
> > > > > direct reclaim blows the stack and interferes with background
> > > > > writeout, but won't cause lumpy reclaim to change behaviour.
> > > > > This reduces the scope of impact and hence testing and validation
> > > > > the needs to be done.
> > > > 
> > > > Tend to agree. but I would proposed slightly different algorithm for
> > > > avoind incorrect oom.
> > > > 
> > > > for high order allocation
> > > > 	allow to use lumpy reclaim and pageout() for both kswapd and direct reclaim
> > > 
> > > SO same as current.
> > 
> > Yes. as same as you propsed.
> > 
> > > 
> > > > for low order allocation
> > > > 	- kswapd:          always delegate io to flusher thread
> > > > 	- direct reclaim:  delegate io to flusher thread only if vm pressure is low
> > > 
> > > IMO, this really doesn't fix either of the problems - the bad IO
> > > patterns nor the stack usage. All it will take is a bit more memory
> > > pressure to trigger stack and IO problems, and the user reporting the
> > > problems is generating an awful lot of memory pressure...
> > 
> > This patch doesn't care stack usage. because
> >   - again, I think all stack eater shold be diet.
> 
> Agreed (again), but we've already come to the conclusion that a
> stack diet is not enough.

ok.


> >   - under allowing lumpy reclaim world, only deny low order reclaim
> >     doesn't solve anything.
> 
> Yes, I suggested it *as a first step*, not as the end goal. Your
> patches don't reach the first step which is fixing the reported
> stack problem for order-0 allocations...

I have some diet patch as another patches. I'll post todays diet patch
by another mail. I didn't hope mixing perfectly unrelated patches.


> > Please don't forget priority=0 recliam failure incvoke OOM-killer.
> > I don't imagine anyone want it.
> 
> Given that I haven't been able to trigger OOM without writeback from
> direct reclaim so far (*) I'm not finding any evidence that it is a
> problem or that there are regressions.  I want to be able to say
> that this change has no known regressions. I want to find the
> regression and  work to fix them, but without test cases there's no
> way I can do this.
> 
> This is what I'm getting frustrated about - I want to fix this
> problem once and for all, but I can't find out what I need to do to
> robustly test such a change so we can have a high degree of
> confidence that it doesn't introduce major regressions. Can anyone
> help here?
> 
> (*) except in one case I've already described where it mananged to
> allocate enough huge pages to starve the system of order zero pages,
> which is what I asked it to do.

Agreed. I'm sorry that thing. Probably nobody in the world have
enough VM test case even though include no linux people. Modern general
purpose OS are used really really various purpose and various machine.
So, I haven't seen perfectly zero regression VM change. I'm getting 
the same frustration anytime. 

Because, Many VM mess is for avoiding extream starvation case. but If
it can be reproduced easily, it's VM bug ;)



> > And, Which IO workload trigger <6 priority vmscan?
> 
> You're asking me? I've been asking you for workloads that wind up
> reclaim priority.... :/

??? Do I misunderstand your last mail?
You wrote

> IMO, this really doesn't fix either of the problems - the bad IO
> patterns nor the stack usage. All it will take is a bit more memory
> pressure to trigger stack and IO problems, and the user reporting the
> problems is generating an awful lot of memory pressure...

and, I ask which is "the bad IO patterns". if it's not your intention,
What do you talked about io pattern?

If my understand is correct, you asked me about vmscan hurt case,
and I asked you your the bad IO pattern. 

now guessing, your intention was "bad IO patterns", not "the IO patterns"??



> All I can say is that the most common trigger I see for OOM is
> copying a large file on a busy system that is running off a single
> spindle.  When that happens on my laptop I walk away and get a cup
> of coffee when that happens and when I come back I pick up all the
> broken bits the OOM killer left behind.....

As far as I understand, you are talking about no specific general thing.
then, I also talking general one. In general, I think slow down is
better than OOM-killer. So, even though we need more and more improvement,
we always care about avoiding incorrect oom. iow, I'd prefer step by
step development.





^ permalink raw reply	[flat|nested] 115+ messages in thread

* [PATCH 1/4] vmscan: simplify shrink_inactive_list()
  2010-04-15 10:21                           ` KOSAKI Motohiro
@ 2010-04-15 10:23                             ` KOSAKI Motohiro
  2010-04-15 13:15                               ` Mel Gorman
  2010-04-15 10:24                             ` [PATCH 2/4] [cleanup] mm: introduce free_pages_prepare KOSAKI Motohiro
                                               ` (2 subsequent siblings)
  3 siblings, 1 reply; 115+ messages in thread
From: KOSAKI Motohiro @ 2010-04-15 10:23 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: kosaki.motohiro, Dave Chinner, Mel Gorman, Chris Mason,
	linux-kernel, linux-mm, linux-fsdevel

Now, max_scan of shrink_inactive_list() is always passed less than
SWAP_CLUSTER_MAX. then, we can remove scanning pages loop in it.
This patch also help stack diet.

detail
 - remove "while (nr_scanned < max_scan)" loop
 - remove nr_freed (now, we use nr_reclaimed directly)
 - remove nr_scan (now, we use nr_scanned directly)
 - rename max_scan to nr_to_scan
 - pass nr_to_scan into isolate_pages() directly instead
   using SWAP_CLUSTER_MAX

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/vmscan.c |  190 ++++++++++++++++++++++++++++-------------------------------
 1 files changed, 89 insertions(+), 101 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index eab6028..4de4029 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1137,16 +1137,22 @@ static int too_many_isolated(struct zone *zone, int file,
  * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
  * of reclaimed pages
  */
-static unsigned long shrink_inactive_list(unsigned long max_scan,
+static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 			struct zone *zone, struct scan_control *sc,
 			int file)
 {
 	LIST_HEAD(page_list);
 	struct pagevec pvec;
-	unsigned long nr_scanned = 0;
+	unsigned long nr_scanned;
 	unsigned long nr_reclaimed = 0;
 	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
 	int lumpy_reclaim = 0;
+	struct page *page;
+	unsigned long nr_taken;
+	unsigned long nr_active;
+	unsigned int count[NR_LRU_LISTS] = { 0, };
+	unsigned long nr_anon;
+	unsigned long nr_file;
 
 	while (unlikely(too_many_isolated(zone, file, sc))) {
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
@@ -1172,119 +1178,101 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
 
 	lru_add_drain();
 	spin_lock_irq(&zone->lru_lock);
-	do {
-		struct page *page;
-		unsigned long nr_taken;
-		unsigned long nr_scan;
-		unsigned long nr_freed;
-		unsigned long nr_active;
-		unsigned int count[NR_LRU_LISTS] = { 0, };
-		int mode = lumpy_reclaim ? ISOLATE_BOTH : ISOLATE_INACTIVE;
-		unsigned long nr_anon;
-		unsigned long nr_file;
-
-		nr_taken = sc->isolate_pages(SWAP_CLUSTER_MAX,
-			     &page_list, &nr_scan, sc->order, mode,
-				zone, sc->mem_cgroup, 0, file);
+	nr_taken = sc->isolate_pages(nr_to_scan,
+				     &page_list, &nr_scanned, sc->order,
+				     lumpy_reclaim ? ISOLATE_BOTH : ISOLATE_INACTIVE,
+				     zone, sc->mem_cgroup, 0, file);
 
-		if (scanning_global_lru(sc)) {
-			zone->pages_scanned += nr_scan;
-			if (current_is_kswapd())
-				__count_zone_vm_events(PGSCAN_KSWAPD, zone,
-						       nr_scan);
-			else
-				__count_zone_vm_events(PGSCAN_DIRECT, zone,
-						       nr_scan);
-		}
+	if (scanning_global_lru(sc)) {
+		zone->pages_scanned += nr_scanned;
+		if (current_is_kswapd())
+			__count_zone_vm_events(PGSCAN_KSWAPD, zone, nr_scanned);
+		else
+			__count_zone_vm_events(PGSCAN_DIRECT, zone, nr_scanned);
+	}
 
-		if (nr_taken == 0)
-			goto done;
+	if (nr_taken == 0)
+		goto done;
 
-		nr_active = clear_active_flags(&page_list, count);
-		__count_vm_events(PGDEACTIVATE, nr_active);
+	nr_active = clear_active_flags(&page_list, count);
+	__count_vm_events(PGDEACTIVATE, nr_active);
 
-		__mod_zone_page_state(zone, NR_ACTIVE_FILE,
-						-count[LRU_ACTIVE_FILE]);
-		__mod_zone_page_state(zone, NR_INACTIVE_FILE,
-						-count[LRU_INACTIVE_FILE]);
-		__mod_zone_page_state(zone, NR_ACTIVE_ANON,
-						-count[LRU_ACTIVE_ANON]);
-		__mod_zone_page_state(zone, NR_INACTIVE_ANON,
-						-count[LRU_INACTIVE_ANON]);
+	__mod_zone_page_state(zone, NR_ACTIVE_FILE,
+			      -count[LRU_ACTIVE_FILE]);
+	__mod_zone_page_state(zone, NR_INACTIVE_FILE,
+			      -count[LRU_INACTIVE_FILE]);
+	__mod_zone_page_state(zone, NR_ACTIVE_ANON,
+			      -count[LRU_ACTIVE_ANON]);
+	__mod_zone_page_state(zone, NR_INACTIVE_ANON,
+			      -count[LRU_INACTIVE_ANON]);
 
-		nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
-		nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
-		__mod_zone_page_state(zone, NR_ISOLATED_ANON, nr_anon);
-		__mod_zone_page_state(zone, NR_ISOLATED_FILE, nr_file);
+	nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
+	nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
+	__mod_zone_page_state(zone, NR_ISOLATED_ANON, nr_anon);
+	__mod_zone_page_state(zone, NR_ISOLATED_FILE, nr_file);
 
-		reclaim_stat->recent_scanned[0] += nr_anon;
-		reclaim_stat->recent_scanned[1] += nr_file;
+	reclaim_stat->recent_scanned[0] += nr_anon;
+	reclaim_stat->recent_scanned[1] += nr_file;
 
-		spin_unlock_irq(&zone->lru_lock);
+	spin_unlock_irq(&zone->lru_lock);
 
-		nr_scanned += nr_scan;
-		nr_freed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
+	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
+
+	/*
+	 * If we are direct reclaiming for contiguous pages and we do
+	 * not reclaim everything in the list, try again and wait
+	 * for IO to complete. This will stall high-order allocations
+	 * but that should be acceptable to the caller
+	 */
+	if (nr_reclaimed < nr_taken && !current_is_kswapd() && lumpy_reclaim) {
+		congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 		/*
-		 * If we are direct reclaiming for contiguous pages and we do
-		 * not reclaim everything in the list, try again and wait
-		 * for IO to complete. This will stall high-order allocations
-		 * but that should be acceptable to the caller
+		 * The attempt at page out may have made some
+		 * of the pages active, mark them inactive again.
 		 */
-		if (nr_freed < nr_taken && !current_is_kswapd() &&
-		    lumpy_reclaim) {
-			congestion_wait(BLK_RW_ASYNC, HZ/10);
-
-			/*
-			 * The attempt at page out may have made some
-			 * of the pages active, mark them inactive again.
-			 */
-			nr_active = clear_active_flags(&page_list, count);
-			count_vm_events(PGDEACTIVATE, nr_active);
-
-			nr_freed += shrink_page_list(&page_list, sc,
-							PAGEOUT_IO_SYNC);
-		}
+		nr_active = clear_active_flags(&page_list, count);
+		count_vm_events(PGDEACTIVATE, nr_active);
 
-		nr_reclaimed += nr_freed;
+		nr_reclaimed += shrink_page_list(&page_list, sc,
+						 PAGEOUT_IO_SYNC);
+	}
 
-		local_irq_disable();
-		if (current_is_kswapd())
-			__count_vm_events(KSWAPD_STEAL, nr_freed);
-		__count_zone_vm_events(PGSTEAL, zone, nr_freed);
+	local_irq_disable();
+	if (current_is_kswapd())
+		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
+	__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
 
-		spin_lock(&zone->lru_lock);
-		/*
-		 * Put back any unfreeable pages.
-		 */
-		while (!list_empty(&page_list)) {
-			int lru;
-			page = lru_to_page(&page_list);
-			VM_BUG_ON(PageLRU(page));
-			list_del(&page->lru);
-			if (unlikely(!page_evictable(page, NULL))) {
-				spin_unlock_irq(&zone->lru_lock);
-				putback_lru_page(page);
-				spin_lock_irq(&zone->lru_lock);
-				continue;
-			}
-			SetPageLRU(page);
-			lru = page_lru(page);
-			add_page_to_lru_list(zone, page, lru);
-			if (is_active_lru(lru)) {
-				int file = is_file_lru(lru);
-				reclaim_stat->recent_rotated[file]++;
-			}
-			if (!pagevec_add(&pvec, page)) {
-				spin_unlock_irq(&zone->lru_lock);
-				__pagevec_release(&pvec);
-				spin_lock_irq(&zone->lru_lock);
-			}
+	spin_lock(&zone->lru_lock);
+	/*
+	 * Put back any unfreeable pages.
+	 */
+	while (!list_empty(&page_list)) {
+		int lru;
+		page = lru_to_page(&page_list);
+		VM_BUG_ON(PageLRU(page));
+		list_del(&page->lru);
+		if (unlikely(!page_evictable(page, NULL))) {
+			spin_unlock_irq(&zone->lru_lock);
+			putback_lru_page(page);
+			spin_lock_irq(&zone->lru_lock);
+			continue;
 		}
-		__mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
-		__mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
-
-  	} while (nr_scanned < max_scan);
+		SetPageLRU(page);
+		lru = page_lru(page);
+		add_page_to_lru_list(zone, page, lru);
+		if (is_active_lru(lru)) {
+			int file = is_file_lru(lru);
+			reclaim_stat->recent_rotated[file]++;
+		}
+		if (!pagevec_add(&pvec, page)) {
+			spin_unlock_irq(&zone->lru_lock);
+			__pagevec_release(&pvec);
+			spin_lock_irq(&zone->lru_lock);
+		}
+	}
+	__mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
+	__mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
 
 done:
 	spin_unlock_irq(&zone->lru_lock);
-- 
1.6.5.2




^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH 2/4] [cleanup] mm: introduce free_pages_prepare
  2010-04-15 10:21                           ` KOSAKI Motohiro
  2010-04-15 10:23                             ` [PATCH 1/4] vmscan: simplify shrink_inactive_list() KOSAKI Motohiro
@ 2010-04-15 10:24                             ` KOSAKI Motohiro
  2010-04-15 13:33                               ` Mel Gorman
  2010-04-15 10:24                             ` [PATCH 3/4] mm: introduce free_pages_bulk KOSAKI Motohiro
  2010-04-15 10:26                             ` [PATCH 4/4] vmscan: replace the pagevec in shrink_inactive_list() with list KOSAKI Motohiro
  3 siblings, 1 reply; 115+ messages in thread
From: KOSAKI Motohiro @ 2010-04-15 10:24 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: kosaki.motohiro, Dave Chinner, Mel Gorman, Chris Mason,
	linux-kernel, linux-mm, linux-fsdevel

This patch is used from [3/4]

===================================
Free_hot_cold_page() and __free_pages_ok() have very similar
freeing preparation. This patch make consolicate it.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/page_alloc.c |   40 +++++++++++++++++++++-------------------
 1 files changed, 21 insertions(+), 19 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 88513c0..ba9aea7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -599,20 +599,23 @@ static void free_one_page(struct zone *zone, struct page *page, int order,
 	spin_unlock(&zone->lock);
 }
 
-static void __free_pages_ok(struct page *page, unsigned int order)
+static int free_pages_prepare(struct page *page, unsigned int order)
 {
-	unsigned long flags;
 	int i;
 	int bad = 0;
-	int wasMlocked = __TestClearPageMlocked(page);
 
 	trace_mm_page_free_direct(page, order);
 	kmemcheck_free_shadow(page, order);
 
-	for (i = 0 ; i < (1 << order) ; ++i)
-		bad += free_pages_check(page + i);
+	for (i = 0 ; i < (1 << order) ; ++i) {
+		struct page *pg = page + i;
+
+		if (PageAnon(pg))
+			pg->mapping = NULL;
+		bad += free_pages_check(pg);
+	}
 	if (bad)
-		return;
+		return -EINVAL;
 
 	if (!PageHighMem(page)) {
 		debug_check_no_locks_freed(page_address(page),PAGE_SIZE<<order);
@@ -622,6 +625,17 @@ static void __free_pages_ok(struct page *page, unsigned int order)
 	arch_free_page(page, order);
 	kernel_map_pages(page, 1 << order, 0);
 
+	return 0;
+}
+
+static void __free_pages_ok(struct page *page, unsigned int order)
+{
+	unsigned long flags;
+	int wasMlocked = __TestClearPageMlocked(page);
+
+	if (free_pages_prepare(page, order))
+		return;
+
 	local_irq_save(flags);
 	if (unlikely(wasMlocked))
 		free_page_mlock(page);
@@ -1107,21 +1121,9 @@ void free_hot_cold_page(struct page *page, int cold)
 	int migratetype;
 	int wasMlocked = __TestClearPageMlocked(page);
 
-	trace_mm_page_free_direct(page, 0);
-	kmemcheck_free_shadow(page, 0);
-
-	if (PageAnon(page))
-		page->mapping = NULL;
-	if (free_pages_check(page))
+	if (free_pages_prepare(page, 0))
 		return;
 
-	if (!PageHighMem(page)) {
-		debug_check_no_locks_freed(page_address(page), PAGE_SIZE);
-		debug_check_no_obj_freed(page_address(page), PAGE_SIZE);
-	}
-	arch_free_page(page, 0);
-	kernel_map_pages(page, 1, 0);
-
 	migratetype = get_pageblock_migratetype(page);
 	set_page_private(page, migratetype);
 	local_irq_save(flags);
-- 
1.6.5.2




^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH 3/4] mm: introduce free_pages_bulk
  2010-04-15 10:21                           ` KOSAKI Motohiro
  2010-04-15 10:23                             ` [PATCH 1/4] vmscan: simplify shrink_inactive_list() KOSAKI Motohiro
  2010-04-15 10:24                             ` [PATCH 2/4] [cleanup] mm: introduce free_pages_prepare KOSAKI Motohiro
@ 2010-04-15 10:24                             ` KOSAKI Motohiro
  2010-04-15 13:46                               ` Mel Gorman
  2010-04-15 10:26                             ` [PATCH 4/4] vmscan: replace the pagevec in shrink_inactive_list() with list KOSAKI Motohiro
  3 siblings, 1 reply; 115+ messages in thread
From: KOSAKI Motohiro @ 2010-04-15 10:24 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: kosaki.motohiro, Dave Chinner, Mel Gorman, Chris Mason,
	linux-kernel, linux-mm, linux-fsdevel

Now, vmscan is using __pagevec_free() for batch freeing. but
pagevec consume slightly lots stack (sizeof(long)*8), and x86_64
stack is very strictly limited.

Then, now we are planning to use page->lru list instead pagevec
for reducing stack. and introduce new helper function.

This is similar to __pagevec_free(), but receive list instead
pagevec. and this don't use pcp cache. it is good characteristics
for vmscan.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 include/linux/gfp.h |    1 +
 mm/page_alloc.c     |   44 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 45 insertions(+), 0 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 4c6d413..dbcac56 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -332,6 +332,7 @@ extern void free_hot_cold_page(struct page *page, int cold);
 #define __free_page(page) __free_pages((page), 0)
 #define free_page(addr) free_pages((addr),0)
 
+void free_pages_bulk(struct zone *zone, struct list_head *list);
 void page_alloc_init(void);
 void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp);
 void drain_all_pages(void);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ba9aea7..1f68832 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2049,6 +2049,50 @@ void free_pages(unsigned long addr, unsigned int order)
 
 EXPORT_SYMBOL(free_pages);
 
+/*
+ * Frees a number of pages from the list
+ * Assumes all pages on list are in same zone and order==0.
+ *
+ * This is similar to __pagevec_free(), but receive list instead pagevec.
+ * and this don't use pcp cache. it is good characteristics for vmscan.
+ */
+void free_pages_bulk(struct zone *zone, struct list_head *list)
+{
+	unsigned long flags;
+	struct page *page;
+	struct page *page2;
+	int nr_pages = 0;
+
+	list_for_each_entry_safe(page, page2, list, lru) {
+		int wasMlocked = __TestClearPageMlocked(page);
+
+		if (free_pages_prepare(page, 0)) {
+			/* Make orphan the corrupted page. */
+			list_del(&page->lru);
+			continue;
+		}
+		if (unlikely(wasMlocked)) {
+			local_irq_save(flags);
+			free_page_mlock(page);
+			local_irq_restore(flags);
+		}
+		nr_pages++;
+	}
+
+	spin_lock_irqsave(&zone->lock, flags);
+	__count_vm_events(PGFREE, nr_pages);
+	zone->all_unreclaimable = 0;
+	zone->pages_scanned = 0;
+	__mod_zone_page_state(zone, NR_FREE_PAGES, nr_pages);
+
+	list_for_each_entry_safe(page, page2, list, lru) {
+		/* have to delete it as __free_one_page list manipulates */
+		list_del(&page->lru);
+		__free_one_page(page, zone, 0, page_private(page));
+	}
+	spin_unlock_irqrestore(&zone->lock, flags);
+}
+
 /**
  * alloc_pages_exact - allocate an exact number physically-contiguous pages.
  * @size: the number of bytes to allocate
-- 
1.6.5.2




^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH 4/4] vmscan: replace the pagevec in shrink_inactive_list() with list
  2010-04-15 10:21                           ` KOSAKI Motohiro
                                               ` (2 preceding siblings ...)
  2010-04-15 10:24                             ` [PATCH 3/4] mm: introduce free_pages_bulk KOSAKI Motohiro
@ 2010-04-15 10:26                             ` KOSAKI Motohiro
  3 siblings, 0 replies; 115+ messages in thread
From: KOSAKI Motohiro @ 2010-04-15 10:26 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: kosaki.motohiro, Dave Chinner, Mel Gorman, Chris Mason,
	linux-kernel, linux-mm, linux-fsdevel

On x86_64, sizeof(struct pagevec) is 8*16=128, but
sizeof(struct list_head) is 8*2=16. So, to replace pagevec with list
makes to reduce 112 bytes stack.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/vmscan.c |   22 ++++++++++++++--------
 1 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4de4029..fbc26d8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -93,6 +93,8 @@ struct scan_control {
 			unsigned long *scanned, int order, int mode,
 			struct zone *z, struct mem_cgroup *mem_cont,
 			int active, int file);
+
+	struct list_head free_batch_list;
 };
 
 #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
@@ -641,13 +643,11 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 					enum pageout_io sync_writeback)
 {
 	LIST_HEAD(ret_pages);
-	struct pagevec freed_pvec;
 	int pgactivate = 0;
 	unsigned long nr_reclaimed = 0;
 
 	cond_resched();
 
-	pagevec_init(&freed_pvec, 1);
 	while (!list_empty(page_list)) {
 		enum page_references references;
 		struct address_space *mapping;
@@ -822,10 +822,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		__clear_page_locked(page);
 free_it:
 		nr_reclaimed++;
-		if (!pagevec_add(&freed_pvec, page)) {
-			__pagevec_free(&freed_pvec);
-			pagevec_reinit(&freed_pvec);
-		}
+		list_add(&page->lru, &sc->free_batch_list);
 		continue;
 
 cull_mlocked:
@@ -849,8 +846,6 @@ keep:
 		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
 	}
 	list_splice(&ret_pages, page_list);
-	if (pagevec_count(&freed_pvec))
-		__pagevec_free(&freed_pvec);
 	count_vm_events(PGACTIVATE, pgactivate);
 	return nr_reclaimed;
 }
@@ -1238,6 +1233,11 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 						 PAGEOUT_IO_SYNC);
 	}
 
+	/*
+	 * Free unused pages.
+	 */
+	free_pages_bulk(zone, &sc->free_batch_list);
+
 	local_irq_disable();
 	if (current_is_kswapd())
 		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
@@ -1844,6 +1844,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 		.mem_cgroup = NULL,
 		.isolate_pages = isolate_pages_global,
 		.nodemask = nodemask,
+		.free_batch_list = LIST_HEAD_INIT(sc.free_batch_list),
 	};
 
 	return do_try_to_free_pages(zonelist, &sc);
@@ -1864,6 +1865,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
 		.order = 0,
 		.mem_cgroup = mem,
 		.isolate_pages = mem_cgroup_isolate_pages,
+		.free_batch_list = LIST_HEAD_INIT(sc.free_batch_list),
 	};
 	nodemask_t nm  = nodemask_of_node(nid);
 
@@ -1900,6 +1902,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 		.mem_cgroup = mem_cont,
 		.isolate_pages = mem_cgroup_isolate_pages,
 		.nodemask = NULL, /* we don't care the placement */
+		.free_batch_list = LIST_HEAD_INIT(sc.free_batch_list),
 	};
 
 	sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
@@ -1976,6 +1979,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
 		.order = order,
 		.mem_cgroup = NULL,
 		.isolate_pages = isolate_pages_global,
+		.free_batch_list = LIST_HEAD_INIT(sc.free_batch_list),
 	};
 loop_again:
 	total_scanned = 0;
@@ -2333,6 +2337,7 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim)
 		.swappiness = vm_swappiness,
 		.order = 0,
 		.isolate_pages = isolate_pages_global,
+		.free_batch_list = LIST_HEAD_INIT(sc.free_batch_list),
 	};
 	struct zonelist * zonelist = node_zonelist(numa_node_id(), sc.gfp_mask);
 	struct task_struct *p = current;
@@ -2517,6 +2522,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 		.swappiness = vm_swappiness,
 		.order = order,
 		.isolate_pages = isolate_pages_global,
+		.free_batch_list = LIST_HEAD_INIT(sc.free_batch_list),
 	};
 	unsigned long slab_reclaimable;
 
-- 
1.6.5.2




^ permalink raw reply related	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-15  1:34                 ` Dave Chinner
  2010-04-15  4:09                   ` KOSAKI Motohiro
@ 2010-04-15 10:28                   ` Mel Gorman
  2010-04-15 13:42                     ` Chris Mason
  2010-04-16  4:14                     ` Dave Chinner
  2010-04-15 14:57                   ` Andi Kleen
  2 siblings, 2 replies; 115+ messages in thread
From: Mel Gorman @ 2010-04-15 10:28 UTC (permalink / raw)
  To: Dave Chinner
  Cc: KOSAKI Motohiro, Chris Mason, linux-kernel, linux-mm, linux-fsdevel

On Thu, Apr 15, 2010 at 11:34:36AM +1000, Dave Chinner wrote:
> On Wed, Apr 14, 2010 at 09:51:33AM +0100, Mel Gorman wrote:
> > On Wed, Apr 14, 2010 at 05:28:30PM +1000, Dave Chinner wrote:
> > > On Wed, Apr 14, 2010 at 03:52:44PM +0900, KOSAKI Motohiro wrote:
> > > > > On Tue, Apr 13, 2010 at 04:20:21PM -0400, Chris Mason wrote:
> > > > > > On Tue, Apr 13, 2010 at 08:34:29PM +0100, Mel Gorman wrote:
> > > > > > > > Basically, there is not enough stack space available to allow direct
> > > > > > > > reclaim to enter ->writepage _anywhere_ according to the stack usage
> > > > > > > > profiles we are seeing here....
> > > > > > > > 
> > > > > > > 
> > > > > > > I'm not denying the evidence but how has it been gotten away with for years
> > > > > > > then? Prevention of writeback isn't the answer without figuring out how
> > > > > > > direct reclaimers can queue pages for IO and in the case of lumpy reclaim
> > > > > > > doing sync IO, then waiting on those pages.
> > > > > > 
> > > > > > So, I've been reading along, nodding my head to Dave's side of things
> > > > > > because seeks are evil and direct reclaim makes seeks.  I'd really loev
> > > > > > for direct reclaim to somehow trigger writepages on large chunks instead
> > > > > > of doing page by page spatters of IO to the drive.
> > > > 
> > > > I agree that "seeks are evil and direct reclaim makes seeks". Actually,
> > > > making 4k io is not must for pageout. So, probably we can improve it.
> > > > 
> > > > 
> > > > > Perhaps drop the lock on the page if it is held and call one of the
> > > > > helpers that filesystems use to do this, like:
> > > > > 
> > > > > 	filemap_write_and_wait(page->mapping);
> > > > 
> > > > Sorry, I'm lost what you talk about. Why do we need per-file
> > > > waiting? If file is 1GB file, do we need to wait 1GB writeout?
> > > 
> > > So use filemap_fdatawrite(page->mapping), or if it's better only
> > > to start IO on a segment of the file, use
> > > filemap_fdatawrite_range(page->mapping, start, end)....
> > 
> > That does not help the stack usage issue, the caller ends up in
> > ->writepages. From an IO perspective, it'll be better from a seek point of
> > view but from a VM perspective, it may or may not be cleaning the right pages.
> > So I think this is a red herring.
> 
> If you ask it to clean a bunch of pages around the one you want to
> reclaim on the LRU, there is a good chance it will also be cleaning
> pages that are near the end of the LRU or physically close by as
> well. It's not a guarantee, but for the additional IO cost of about
> 10% wall time on that IO to clean the page you need, you also get
> 1-2 orders of magnitude other pages cleaned. That sounds like a
> win any way you look at it...
> 

At worst, it'll distort the LRU ordering slightly. Lets say the the
file-adjacent-page you clean was near the end of the LRU. Before such a
patch, it may have gotten cleaned and done another lap of the LRU.
After, it would be reclaimed sooner. I don't know if we depend on such
behaviour (very doubtful) but it's a subtle enough change. I can't
predict what it'll do for IO congestion. Simplistically, there is more
IO so it's bad but if the write pattern is less seeky and we needed to
write the pages anyway, it might be improved.

> I agree that it doesn't solve the stack problem (Chris' suggestion
> that we enable the bdi flusher interface would fix this);

I'm afraid I'm not familiar with this interface. Can you point me at
some previous discussion so that I am sure I am looking at the right
thing?

> what I'm
> pointing out is that the arguments that it is too hard or there are
> no interfaces available to issue larger IO from reclaim are not at
> all valid.
> 

Sure, I'm not resisting fixing this, just your first patch :) There are four
goals here

1. Reduce stack usage
2. Avoid the splicing of subsystem stack usage with direct reclaim
3. Preserve lumpy reclaims cleaning of contiguous pages
4. Try and not drastically alter LRU aging

1 and 2 are important for you, 3 is important for me and 4 will have to
be dealt with on a case-by-case basis.

Your patch fixes 2, avoids 1, breaks 3 and haven't thought about 4 but I
guess dirty pages can cycle around more so it'd need to be cared for.

> > > the deepest call chain in queue_work() needs 700 bytes of stack
> > > to complete, wait_for_completion() requires almost 2k of stack space
> > > at it's deepest, the scheduler has some heavy stack users, etc,
> > > and these are all functions that appear at the top of the stack.
> > > 
> > 
> > The real issue here then is that stack usage has gone out of control.
> 
> That's definitely true, but it shouldn't cloud the fact that most
> ppl want to kill writeback from direct reclaim, too, so killing two
> birds with one stone seems like a good idea.
> 

Ah yes, but I at least will resist killing of writeback from direct
reclaim because of lumpy reclaim. Again, I recognise the seek pattern
sucks but sometimes there are specific pages we need cleaned.

> How about this? For now, we stop direct reclaim from doing writeback
> only on order zero allocations, but allow it for higher order
> allocations. That will prevent the majority of situations where
> direct reclaim blows the stack and interferes with background
> writeout, but won't cause lumpy reclaim to change behaviour.
> This reduces the scope of impact and hence testing and validation
> the needs to be done.
> 
> Then we can work towards allowing lumpy reclaim to use background
> threads as Chris suggested for doing specific writeback operations
> to solve the remaining problems being seen. Does this seem like a
> reasonable compromise and approach to dealing with the problem?
> 

I'd like this to be plan b (or maybe c or d) if we cannot reduce stack usage
enough or come up with an alternative fix. From the goals above it mitigates
1, mitigates 2, addresses 3 but potentially allows dirty pages to remain on
the LRU with 4 until the background cleaner or kswapd comes along.

One reason why I am edgy about this is that lumpy reclaim can kick in
for low-enough orders too like order-1 pages for stacks in some cases or
order-2 pages for network cards using jumbo frames or some wireless
cards. The network cards in particular could still cause the stack
overflow but be much harder to reproduce and detect.

> > Disabling ->writepage in direct reclaim does not guarantee that stack
> > usage will not be a problem again. From your traces, page reclaim itself
> > seems to be a big dirty hog.
> 
> I couldn't agree more - the kernel still needs to be put on a stack
> usage diet, but the above would give use some breathing space to attack the
> problem before more people start to hit these problems.
> 

I'd like stack reduction to be plan a because it buys time without
making the problem exclusively lumpy reclaims where it can still hit,
but is harder to reproduce.

> > > Good start, but 512 bytes will only catch select and splice read,
> > > and there are 300-400 byte functions in the above list that sit near
> > > the top of the stack....
> > > 
> > 
> > They will need to be tackled in turn then but obviously there should be
> > a focus on the common paths. The reclaim paths do seem particularly
> > heavy and it's down to a lot of temporary variables. I might not get the
> > time today but what I'm going to try do some time this week is
> > 
> > o Look at what temporary variables are copies of other pieces of information
> > o See what variables live for the duration of reclaim but are not needed
> >   for all of it (i.e. uninline parts of it so variables do not persist)
> > o See if it's possible to dynamically allocate scan_control
> 
> Welcome to my world ;)
> 

It's not like the brochure at all :)

> > The last one is the trickiest. Basically, the idea would be to move as much
> > into scan_control as possible. Then, instead of allocating it on the stack,
> > allocate a fixed number of them at boot-time (NR_CPU probably) protected by
> > a semaphore. Limit the number of direct reclaimers that can be active at a
> > time to the number of scan_control variables. kswapd could still allocate
> > its on the stack or with kmalloc.
> > 
> > If it works out, it would have two main benefits. Limits the number of
> > processes in direct reclaim - if there is NR_CPU-worth of proceses in direct
> > reclaim, there is too much going on. It would also shrink the stack usage
> > particularly if some of the stack variables are moved into scan_control.
> > 
> > Maybe someone will beat me to looking at the feasibility of this.
> 
> I like the idea - it really sounds like you want a fixed size,
> preallocated mempool that can't be enlarged.

Yep. It would cut down around 1K of stack usage when direct reclaim gets
involved. The "downside" would be a limitation of the number of direct
reclaimers that exist at any given time but that could be a positive in
some cases.

> In fact, I can probably
> use something like this in XFS to save a couple of hundred bytes of
> stack space in the worst hogs....
> 
> > > > > This is the sort of thing I'm pointing at when I say that stack
> > > > > usage outside XFS has grown significantly significantly over the
> > > > > past couple of years. Given XFS has remained pretty much the same or
> > > > > even reduced slightly over the same time period, blaming XFS or
> > > > > saying "callers should use GFP_NOFS" seems like a cop-out to me.
> > > > > Regardless of the IO pattern performance issues, writeback via
> > > > > direct reclaim just uses too much stack to be safe these days...
> > > > 
> > > > Yeah, My answer is simple, All stack eater should be fixed.
> > > > but XFS seems not innocence too. 3.5K is enough big although
> > > > xfs have use such amount since very ago.
> > > 
> > > XFS used to use much more than that - significant effort has been
> > > put into reduce the stack footprint over many years. There's not
> > > much left to trim without rewriting half the filesystem...
> > 
> > I don't think he is levelling a complain at XFS in particular - just pointing
> > out that it's heavy too. Still, we should be gratful that XFS is sort of
> > a "Stack Canary". If it dies, everyone else could be in trouble soon :)
> 
> Yeah, true. Sorry ??f in being a bit too defensive here - the scars
> from previous discussions like this are showing through....
> 

I guessed :)

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH 1/4] vmscan: delegate pageout io to flusher thread if current is kswapd
  2010-04-15  8:26                           ` KOSAKI Motohiro
@ 2010-04-15 10:30                             ` Johannes Weiner
  2010-04-15 17:24                               ` Suleiman Souhlal
  2010-04-20  2:56                               ` Ying Han
  0 siblings, 2 replies; 115+ messages in thread
From: Johannes Weiner @ 2010-04-15 10:30 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Suleiman Souhlal, Dave Chinner, Mel Gorman, Chris Mason,
	linux-kernel, linux-mm, linux-fsdevel, suleiman

On Thu, Apr 15, 2010 at 05:26:27PM +0900, KOSAKI Motohiro wrote:
> Cc to Johannes
> 
> > > 
> > > On Apr 14, 2010, at 9:11 PM, KOSAKI Motohiro wrote:
> > > 
> > > > Now, vmscan pageout() is one of IO throuput degression source.
> > > > Some IO workload makes very much order-0 allocation and reclaim
> > > > and pageout's 4K IOs are making annoying lots seeks.
> > > >
> > > > At least, kswapd can avoid such pageout() because kswapd don't
> > > > need to consider OOM-Killer situation. that's no risk.
> > > >
> > > > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > 
> > > What's your opinion on trying to cluster the writes done by pageout,  
> > > instead of not doing any paging out in kswapd?
> > > Something along these lines:
> > 
> > Interesting. 
> > So, I'd like to review your patch carefully. can you please give me one
> > day? :)
> 
> Hannes, if my remember is correct, you tried similar swap-cluster IO
> long time ago. now I can't remember why we didn't merged such patch.
> Do you remember anything?

Oh, quite vividly in fact :)  For a lot of swap loads the LRU order
diverged heavily from swap slot order and readaround was a waste of
time.

Of course, the patch looked good, too, but it did not match reality
that well.

I guess 'how about this patch?' won't get us as far as 'how about
those numbers/graphs of several real-life workloads?  oh and here
is the patch...'.

> > >      Cluster writes to disk due to memory pressure.
> > > 
> > >      Write out logically adjacent pages to the one we're paging out
> > >      so that we may get better IOs in these situations:
> > >      These pages are likely to be contiguous on disk to the one we're
> > >      writing out, so they should get merged into a single disk IO.
> > > 
> > >      Signed-off-by: Suleiman Souhlal <suleiman@google.com>

For random IO, LRU order will have nothing to do with mapping/disk order.

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH 1/4] vmscan: delegate pageout io to flusher thread if current is kswapd
  2010-04-15  4:11                     ` [PATCH 1/4] vmscan: delegate pageout io to flusher thread if current is kswapd KOSAKI Motohiro
  2010-04-15  8:05                       ` Suleiman Souhlal
  2010-04-15  8:18                       ` KOSAKI Motohiro
@ 2010-04-15 10:31                       ` Mel Gorman
  2010-04-15 11:26                         ` KOSAKI Motohiro
  2 siblings, 1 reply; 115+ messages in thread
From: Mel Gorman @ 2010-04-15 10:31 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Dave Chinner, Chris Mason, linux-kernel, linux-mm, linux-fsdevel

On Thu, Apr 15, 2010 at 01:11:37PM +0900, KOSAKI Motohiro wrote:
> Now, vmscan pageout() is one of IO throuput degression source.
> Some IO workload makes very much order-0 allocation and reclaim
> and pageout's 4K IOs are making annoying lots seeks.
> 
> At least, kswapd can avoid such pageout() because kswapd don't
> need to consider OOM-Killer situation. that's no risk.
> 

Well, there is some risk here. Direct reclaimers may not be cleaning
more pages than it had to previously except it splices subsystems
together increasing stack usage and causing further problems.

It might not cause OOM-killer issues but it could increase the time
dirty pages spend on the LRU.

Am I missing something?

> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> ---
>  mm/vmscan.c |    7 +++++++
>  1 files changed, 7 insertions(+), 0 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 3ff3311..d392a50 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -614,6 +614,13 @@ static enum page_references page_check_references(struct page *page,
>  	if (referenced_page)
>  		return PAGEREF_RECLAIM_CLEAN;
>  
> +	/*
> +	 * Delegate pageout IO to flusher thread. They can make more
> +	 * effective IO pattern.
> +	 */
> +	if (current_is_kswapd())
> +		return PAGEREF_RECLAIM_CLEAN;
> +
>  	return PAGEREF_RECLAIM;
>  }
>  
> -- 
> 1.6.5.2
> 
> 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH 1/4] vmscan: delegate pageout io to flusher thread if current is kswapd
  2010-04-15 10:31                       ` Mel Gorman
@ 2010-04-15 11:26                         ` KOSAKI Motohiro
  0 siblings, 0 replies; 115+ messages in thread
From: KOSAKI Motohiro @ 2010-04-15 11:26 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Dave Chinner, Chris Mason, linux-kernel,
	linux-mm, linux-fsdevel

> On Thu, Apr 15, 2010 at 01:11:37PM +0900, KOSAKI Motohiro wrote:
> > Now, vmscan pageout() is one of IO throuput degression source.
> > Some IO workload makes very much order-0 allocation and reclaim
> > and pageout's 4K IOs are making annoying lots seeks.
> > 
> > At least, kswapd can avoid such pageout() because kswapd don't
> > need to consider OOM-Killer situation. that's no risk.
> > 
> 
> Well, there is some risk here. Direct reclaimers may not be cleaning
> more pages than it had to previously except it splices subsystems
> together increasing stack usage and causing further problems.
> 
> It might not cause OOM-killer issues but it could increase the time
> dirty pages spend on the LRU.
> 
> Am I missing something?

No. you are right. I fully agree your previous mail. so, I need to cool down a bit ;)








^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH 1/4] vmscan: simplify shrink_inactive_list()
  2010-04-15 10:23                             ` [PATCH 1/4] vmscan: simplify shrink_inactive_list() KOSAKI Motohiro
@ 2010-04-15 13:15                               ` Mel Gorman
  2010-04-15 15:01                                 ` Andi Kleen
  2010-04-15 18:22                                 ` Valdis.Kletnieks
  0 siblings, 2 replies; 115+ messages in thread
From: Mel Gorman @ 2010-04-15 13:15 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Dave Chinner, Chris Mason, linux-kernel, linux-mm, linux-fsdevel

On Thu, Apr 15, 2010 at 07:23:04PM +0900, KOSAKI Motohiro wrote:
> Now, max_scan of shrink_inactive_list() is always passed less than
> SWAP_CLUSTER_MAX. then, we can remove scanning pages loop in it.
> This patch also help stack diet.
> 

Yep. I modified bloat-o-meter to work with stacks (imaginatively calling it
stack-o-meter) and got the following. The prereq patches are from
earlier in the thread with the subjects

vmscan: kill prev_priority completely
vmscan: move priority variable into scan_control

It gets

$ stack-o-meter vmlinux-vanilla vmlinux-1-2patchprereq 
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-72 (-72)
function                                     old     new   delta
kswapd                                       748     676     -72

and with this patch on top

$ stack-o-meter vmlinux-vanilla vmlinux-2-simplfy-shrink 
add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-144 (-144)
function                                     old     new   delta
shrink_zone                                 1232    1160     -72
kswapd                                       748     676     -72

X86-32 based config.

> detail
>  - remove "while (nr_scanned < max_scan)" loop
>  - remove nr_freed (now, we use nr_reclaimed directly)
>  - remove nr_scan (now, we use nr_scanned directly)
>  - rename max_scan to nr_to_scan
>  - pass nr_to_scan into isolate_pages() directly instead
>    using SWAP_CLUSTER_MAX
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

I couldn't spot any problems. I'd consider throwing a

WARN_ON(nr_to_scan > SWAP_CLUSTER_MAX) in case some future change breaks
the assumptions but otherwise.

Acked-by: Mel Gorman <mel@csn.ul.ie>

> ---
>  mm/vmscan.c |  190 ++++++++++++++++++++++++++++-------------------------------
>  1 files changed, 89 insertions(+), 101 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index eab6028..4de4029 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1137,16 +1137,22 @@ static int too_many_isolated(struct zone *zone, int file,
>   * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
>   * of reclaimed pages
>   */
> -static unsigned long shrink_inactive_list(unsigned long max_scan,
> +static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
>  			struct zone *zone, struct scan_control *sc,
>  			int file)
>  {
>  	LIST_HEAD(page_list);
>  	struct pagevec pvec;
> -	unsigned long nr_scanned = 0;
> +	unsigned long nr_scanned;
>  	unsigned long nr_reclaimed = 0;
>  	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
>  	int lumpy_reclaim = 0;
> +	struct page *page;
> +	unsigned long nr_taken;
> +	unsigned long nr_active;
> +	unsigned int count[NR_LRU_LISTS] = { 0, };
> +	unsigned long nr_anon;
> +	unsigned long nr_file;
>  
>  	while (unlikely(too_many_isolated(zone, file, sc))) {
>  		congestion_wait(BLK_RW_ASYNC, HZ/10);
> @@ -1172,119 +1178,101 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
>  
>  	lru_add_drain();
>  	spin_lock_irq(&zone->lru_lock);
> -	do {
> -		struct page *page;
> -		unsigned long nr_taken;
> -		unsigned long nr_scan;
> -		unsigned long nr_freed;
> -		unsigned long nr_active;
> -		unsigned int count[NR_LRU_LISTS] = { 0, };
> -		int mode = lumpy_reclaim ? ISOLATE_BOTH : ISOLATE_INACTIVE;
> -		unsigned long nr_anon;
> -		unsigned long nr_file;
> -
> -		nr_taken = sc->isolate_pages(SWAP_CLUSTER_MAX,
> -			     &page_list, &nr_scan, sc->order, mode,
> -				zone, sc->mem_cgroup, 0, file);
> +	nr_taken = sc->isolate_pages(nr_to_scan,
> +				     &page_list, &nr_scanned, sc->order,
> +				     lumpy_reclaim ? ISOLATE_BOTH : ISOLATE_INACTIVE,
> +				     zone, sc->mem_cgroup, 0, file);
>  
> -		if (scanning_global_lru(sc)) {
> -			zone->pages_scanned += nr_scan;
> -			if (current_is_kswapd())
> -				__count_zone_vm_events(PGSCAN_KSWAPD, zone,
> -						       nr_scan);
> -			else
> -				__count_zone_vm_events(PGSCAN_DIRECT, zone,
> -						       nr_scan);
> -		}
> +	if (scanning_global_lru(sc)) {
> +		zone->pages_scanned += nr_scanned;
> +		if (current_is_kswapd())
> +			__count_zone_vm_events(PGSCAN_KSWAPD, zone, nr_scanned);
> +		else
> +			__count_zone_vm_events(PGSCAN_DIRECT, zone, nr_scanned);
> +	}
>  
> -		if (nr_taken == 0)
> -			goto done;
> +	if (nr_taken == 0)
> +		goto done;
>  
> -		nr_active = clear_active_flags(&page_list, count);
> -		__count_vm_events(PGDEACTIVATE, nr_active);
> +	nr_active = clear_active_flags(&page_list, count);
> +	__count_vm_events(PGDEACTIVATE, nr_active);
>  
> -		__mod_zone_page_state(zone, NR_ACTIVE_FILE,
> -						-count[LRU_ACTIVE_FILE]);
> -		__mod_zone_page_state(zone, NR_INACTIVE_FILE,
> -						-count[LRU_INACTIVE_FILE]);
> -		__mod_zone_page_state(zone, NR_ACTIVE_ANON,
> -						-count[LRU_ACTIVE_ANON]);
> -		__mod_zone_page_state(zone, NR_INACTIVE_ANON,
> -						-count[LRU_INACTIVE_ANON]);
> +	__mod_zone_page_state(zone, NR_ACTIVE_FILE,
> +			      -count[LRU_ACTIVE_FILE]);
> +	__mod_zone_page_state(zone, NR_INACTIVE_FILE,
> +			      -count[LRU_INACTIVE_FILE]);
> +	__mod_zone_page_state(zone, NR_ACTIVE_ANON,
> +			      -count[LRU_ACTIVE_ANON]);
> +	__mod_zone_page_state(zone, NR_INACTIVE_ANON,
> +			      -count[LRU_INACTIVE_ANON]);
>  
> -		nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
> -		nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
> -		__mod_zone_page_state(zone, NR_ISOLATED_ANON, nr_anon);
> -		__mod_zone_page_state(zone, NR_ISOLATED_FILE, nr_file);
> +	nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
> +	nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
> +	__mod_zone_page_state(zone, NR_ISOLATED_ANON, nr_anon);
> +	__mod_zone_page_state(zone, NR_ISOLATED_FILE, nr_file);
>  
> -		reclaim_stat->recent_scanned[0] += nr_anon;
> -		reclaim_stat->recent_scanned[1] += nr_file;
> +	reclaim_stat->recent_scanned[0] += nr_anon;
> +	reclaim_stat->recent_scanned[1] += nr_file;
>  
> -		spin_unlock_irq(&zone->lru_lock);
> +	spin_unlock_irq(&zone->lru_lock);
>  
> -		nr_scanned += nr_scan;
> -		nr_freed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
> +	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
> +
> +	/*
> +	 * If we are direct reclaiming for contiguous pages and we do
> +	 * not reclaim everything in the list, try again and wait
> +	 * for IO to complete. This will stall high-order allocations
> +	 * but that should be acceptable to the caller
> +	 */
> +	if (nr_reclaimed < nr_taken && !current_is_kswapd() && lumpy_reclaim) {
> +		congestion_wait(BLK_RW_ASYNC, HZ/10);
>  
>  		/*
> -		 * If we are direct reclaiming for contiguous pages and we do
> -		 * not reclaim everything in the list, try again and wait
> -		 * for IO to complete. This will stall high-order allocations
> -		 * but that should be acceptable to the caller
> +		 * The attempt at page out may have made some
> +		 * of the pages active, mark them inactive again.
>  		 */
> -		if (nr_freed < nr_taken && !current_is_kswapd() &&
> -		    lumpy_reclaim) {
> -			congestion_wait(BLK_RW_ASYNC, HZ/10);
> -
> -			/*
> -			 * The attempt at page out may have made some
> -			 * of the pages active, mark them inactive again.
> -			 */
> -			nr_active = clear_active_flags(&page_list, count);
> -			count_vm_events(PGDEACTIVATE, nr_active);
> -
> -			nr_freed += shrink_page_list(&page_list, sc,
> -							PAGEOUT_IO_SYNC);
> -		}
> +		nr_active = clear_active_flags(&page_list, count);
> +		count_vm_events(PGDEACTIVATE, nr_active);
>  
> -		nr_reclaimed += nr_freed;
> +		nr_reclaimed += shrink_page_list(&page_list, sc,
> +						 PAGEOUT_IO_SYNC);
> +	}
>  
> -		local_irq_disable();
> -		if (current_is_kswapd())
> -			__count_vm_events(KSWAPD_STEAL, nr_freed);
> -		__count_zone_vm_events(PGSTEAL, zone, nr_freed);
> +	local_irq_disable();
> +	if (current_is_kswapd())
> +		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
> +	__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
>  
> -		spin_lock(&zone->lru_lock);
> -		/*
> -		 * Put back any unfreeable pages.
> -		 */
> -		while (!list_empty(&page_list)) {
> -			int lru;
> -			page = lru_to_page(&page_list);
> -			VM_BUG_ON(PageLRU(page));
> -			list_del(&page->lru);
> -			if (unlikely(!page_evictable(page, NULL))) {
> -				spin_unlock_irq(&zone->lru_lock);
> -				putback_lru_page(page);
> -				spin_lock_irq(&zone->lru_lock);
> -				continue;
> -			}
> -			SetPageLRU(page);
> -			lru = page_lru(page);
> -			add_page_to_lru_list(zone, page, lru);
> -			if (is_active_lru(lru)) {
> -				int file = is_file_lru(lru);
> -				reclaim_stat->recent_rotated[file]++;
> -			}
> -			if (!pagevec_add(&pvec, page)) {
> -				spin_unlock_irq(&zone->lru_lock);
> -				__pagevec_release(&pvec);
> -				spin_lock_irq(&zone->lru_lock);
> -			}
> +	spin_lock(&zone->lru_lock);
> +	/*
> +	 * Put back any unfreeable pages.
> +	 */
> +	while (!list_empty(&page_list)) {
> +		int lru;
> +		page = lru_to_page(&page_list);
> +		VM_BUG_ON(PageLRU(page));
> +		list_del(&page->lru);
> +		if (unlikely(!page_evictable(page, NULL))) {
> +			spin_unlock_irq(&zone->lru_lock);
> +			putback_lru_page(page);
> +			spin_lock_irq(&zone->lru_lock);
> +			continue;
>  		}
> -		__mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
> -		__mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
> -
> -  	} while (nr_scanned < max_scan);
> +		SetPageLRU(page);
> +		lru = page_lru(page);
> +		add_page_to_lru_list(zone, page, lru);
> +		if (is_active_lru(lru)) {
> +			int file = is_file_lru(lru);
> +			reclaim_stat->recent_rotated[file]++;
> +		}
> +		if (!pagevec_add(&pvec, page)) {
> +			spin_unlock_irq(&zone->lru_lock);
> +			__pagevec_release(&pvec);
> +			spin_lock_irq(&zone->lru_lock);
> +		}
> +	}
> +	__mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
> +	__mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
>  
>  done:
>  	spin_unlock_irq(&zone->lru_lock);
> -- 
> 1.6.5.2
> 
> 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH 2/4] [cleanup] mm: introduce free_pages_prepare
  2010-04-15 10:24                             ` [PATCH 2/4] [cleanup] mm: introduce free_pages_prepare KOSAKI Motohiro
@ 2010-04-15 13:33                               ` Mel Gorman
  0 siblings, 0 replies; 115+ messages in thread
From: Mel Gorman @ 2010-04-15 13:33 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Dave Chinner, Chris Mason, linux-kernel, linux-mm, linux-fsdevel

On Thu, Apr 15, 2010 at 07:24:05PM +0900, KOSAKI Motohiro wrote:
> This patch is used from [3/4]
> 
> ===================================
> Free_hot_cold_page() and __free_pages_ok() have very similar
> freeing preparation. This patch make consolicate it.
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> ---
>  mm/page_alloc.c |   40 +++++++++++++++++++++-------------------
>  1 files changed, 21 insertions(+), 19 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 88513c0..ba9aea7 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -599,20 +599,23 @@ static void free_one_page(struct zone *zone, struct page *page, int order,
>  	spin_unlock(&zone->lock);
>  }
>  
> -static void __free_pages_ok(struct page *page, unsigned int order)
> +static int free_pages_prepare(struct page *page, unsigned int order)
>  {

You don't appear to do anything with the return value. bool? Otherwise I
see no problems

Acked-by: Mel Gorman <mel@csn.ul.ie>

> -	unsigned long flags;
>  	int i;
>  	int bad = 0;
> -	int wasMlocked = __TestClearPageMlocked(page);
>  
>  	trace_mm_page_free_direct(page, order);
>  	kmemcheck_free_shadow(page, order);
>  
> -	for (i = 0 ; i < (1 << order) ; ++i)
> -		bad += free_pages_check(page + i);
> +	for (i = 0 ; i < (1 << order) ; ++i) {
> +		struct page *pg = page + i;
> +
> +		if (PageAnon(pg))
> +			pg->mapping = NULL;
> +		bad += free_pages_check(pg);
> +	}
>  	if (bad)
> -		return;
> +		return -EINVAL;
>  
>  	if (!PageHighMem(page)) {
>  		debug_check_no_locks_freed(page_address(page),PAGE_SIZE<<order);
> @@ -622,6 +625,17 @@ static void __free_pages_ok(struct page *page, unsigned int order)
>  	arch_free_page(page, order);
>  	kernel_map_pages(page, 1 << order, 0);
>  
> +	return 0;
> +}
> +
> +static void __free_pages_ok(struct page *page, unsigned int order)
> +{
> +	unsigned long flags;
> +	int wasMlocked = __TestClearPageMlocked(page);
> +
> +	if (free_pages_prepare(page, order))
> +		return;
> +
>  	local_irq_save(flags);
>  	if (unlikely(wasMlocked))
>  		free_page_mlock(page);
> @@ -1107,21 +1121,9 @@ void free_hot_cold_page(struct page *page, int cold)
>  	int migratetype;
>  	int wasMlocked = __TestClearPageMlocked(page);
>  
> -	trace_mm_page_free_direct(page, 0);
> -	kmemcheck_free_shadow(page, 0);
> -
> -	if (PageAnon(page))
> -		page->mapping = NULL;
> -	if (free_pages_check(page))
> +	if (free_pages_prepare(page, 0))
>  		return;
>  
> -	if (!PageHighMem(page)) {
> -		debug_check_no_locks_freed(page_address(page), PAGE_SIZE);
> -		debug_check_no_obj_freed(page_address(page), PAGE_SIZE);
> -	}
> -	arch_free_page(page, 0);
> -	kernel_map_pages(page, 1, 0);
> -
>  	migratetype = get_pageblock_migratetype(page);
>  	set_page_private(page, migratetype);
>  	local_irq_save(flags);
> -- 
> 1.6.5.2
> 
> 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-15 10:28                   ` [PATCH] mm: disallow direct reclaim page writeback Mel Gorman
@ 2010-04-15 13:42                     ` Chris Mason
  2010-04-15 17:50                       ` tytso
  2010-04-16 15:05                       ` Mel Gorman
  2010-04-16  4:14                     ` Dave Chinner
  1 sibling, 2 replies; 115+ messages in thread
From: Chris Mason @ 2010-04-15 13:42 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Dave Chinner, KOSAKI Motohiro, linux-kernel, linux-mm, linux-fsdevel

On Thu, Apr 15, 2010 at 11:28:37AM +0100, Mel Gorman wrote:
> On Thu, Apr 15, 2010 at 11:34:36AM +1000, Dave Chinner wrote:
> > On Wed, Apr 14, 2010 at 09:51:33AM +0100, Mel Gorman wrote:
> > > On Wed, Apr 14, 2010 at 05:28:30PM +1000, Dave Chinner wrote:
> > > > On Wed, Apr 14, 2010 at 03:52:44PM +0900, KOSAKI Motohiro wrote:
> > > > > > On Tue, Apr 13, 2010 at 04:20:21PM -0400, Chris Mason wrote:
> > > > > > > On Tue, Apr 13, 2010 at 08:34:29PM +0100, Mel Gorman wrote:
> > > > > > > > > Basically, there is not enough stack space available to allow direct
> > > > > > > > > reclaim to enter ->writepage _anywhere_ according to the stack usage
> > > > > > > > > profiles we are seeing here....
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > I'm not denying the evidence but how has it been gotten away with for years
> > > > > > > > then? Prevention of writeback isn't the answer without figuring out how
> > > > > > > > direct reclaimers can queue pages for IO and in the case of lumpy reclaim
> > > > > > > > doing sync IO, then waiting on those pages.
> > > > > > > 
> > > > > > > So, I've been reading along, nodding my head to Dave's side of things
> > > > > > > because seeks are evil and direct reclaim makes seeks.  I'd really loev
> > > > > > > for direct reclaim to somehow trigger writepages on large chunks instead
> > > > > > > of doing page by page spatters of IO to the drive.
> > > > > 
> > > > > I agree that "seeks are evil and direct reclaim makes seeks". Actually,
> > > > > making 4k io is not must for pageout. So, probably we can improve it.
> > > > > 
> > > > > 
> > > > > > Perhaps drop the lock on the page if it is held and call one of the
> > > > > > helpers that filesystems use to do this, like:
> > > > > > 
> > > > > > 	filemap_write_and_wait(page->mapping);
> > > > > 
> > > > > Sorry, I'm lost what you talk about. Why do we need per-file
> > > > > waiting? If file is 1GB file, do we need to wait 1GB writeout?
> > > > 
> > > > So use filemap_fdatawrite(page->mapping), or if it's better only
> > > > to start IO on a segment of the file, use
> > > > filemap_fdatawrite_range(page->mapping, start, end)....
> > > 
> > > That does not help the stack usage issue, the caller ends up in
> > > ->writepages. From an IO perspective, it'll be better from a seek point of
> > > view but from a VM perspective, it may or may not be cleaning the right pages.
> > > So I think this is a red herring.
> > 
> > If you ask it to clean a bunch of pages around the one you want to
> > reclaim on the LRU, there is a good chance it will also be cleaning
> > pages that are near the end of the LRU or physically close by as
> > well. It's not a guarantee, but for the additional IO cost of about
> > 10% wall time on that IO to clean the page you need, you also get
> > 1-2 orders of magnitude other pages cleaned. That sounds like a
> > win any way you look at it...
> > 
> 
> At worst, it'll distort the LRU ordering slightly. Lets say the the
> file-adjacent-page you clean was near the end of the LRU. Before such a
> patch, it may have gotten cleaned and done another lap of the LRU.
> After, it would be reclaimed sooner. I don't know if we depend on such
> behaviour (very doubtful) but it's a subtle enough change. I can't
> predict what it'll do for IO congestion. Simplistically, there is more
> IO so it's bad but if the write pattern is less seeky and we needed to
> write the pages anyway, it might be improved.
> 
> > I agree that it doesn't solve the stack problem (Chris' suggestion
> > that we enable the bdi flusher interface would fix this);
> 
> I'm afraid I'm not familiar with this interface. Can you point me at
> some previous discussion so that I am sure I am looking at the right
> thing?

vi fs/direct-reclaim-helper.c, it has a few placeholders for where the
real code needs to go....just look for the ~ marks.

I mostly meant that the bdi helper threads were the best place to add
knowledge about which pages we want to write for reclaim.  We might need
to add a thread dedicated to just doing the VM's dirty work, but that's
where I would start discussing fancy new interfaces.

> 
> > what I'm
> > pointing out is that the arguments that it is too hard or there are
> > no interfaces available to issue larger IO from reclaim are not at
> > all valid.
> > 
> 
> Sure, I'm not resisting fixing this, just your first patch :) There are four
> goals here
> 
> 1. Reduce stack usage
> 2. Avoid the splicing of subsystem stack usage with direct reclaim
> 3. Preserve lumpy reclaims cleaning of contiguous pages
> 4. Try and not drastically alter LRU aging
> 
> 1 and 2 are important for you, 3 is important for me and 4 will have to
> be dealt with on a case-by-case basis.
> 
> Your patch fixes 2, avoids 1, breaks 3 and haven't thought about 4 but I
> guess dirty pages can cycle around more so it'd need to be cared for.

I'd like to add one more:

5. Don't dive into filesystem locks during reclaim.

This is different from splicing code paths together, but
the filesystem writepage code has become the center of our attempts at
doing big fat contiguous writes on disk.  We push off work as late as we
can until just before the pages go down to disk.

I'll pick on ext4 and btrfs for a minute, just to broaden the scope
outside of XFS.  Writepage comes along and the filesystem needs to
actually find blocks on disk for all the dirty pages it has promised to
write.

So, we start a transaction, we take various allocator locks, modify
different metadata, log changed blocks, take a break (logging is hard
work you know, need_resched() triggered a by now), stuff it
all into the file's metadata, log that, and finally return.

Each of the steps above can block for a long time.  Ext4 solves
this by not doing them.  ext4_writepage only writes pages that
are already fully allocated on disk.

Btrfs is much more efficient at not doing them, it just returns right
away for PF_MEMALLOC.

This is a long way of saying the filesystem writepage code is the
opposite of what direct reclaim wants.  Direct reclaim wants to
find free ram now, and if it does end up in the mess describe above,
it'll just get stuck for a long time on work entirely unrelated to
finding free pages.

-chris


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH 3/4] mm: introduce free_pages_bulk
  2010-04-15 10:24                             ` [PATCH 3/4] mm: introduce free_pages_bulk KOSAKI Motohiro
@ 2010-04-15 13:46                               ` Mel Gorman
  0 siblings, 0 replies; 115+ messages in thread
From: Mel Gorman @ 2010-04-15 13:46 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Dave Chinner, Chris Mason, linux-kernel, linux-mm, linux-fsdevel

On Thu, Apr 15, 2010 at 07:24:53PM +0900, KOSAKI Motohiro wrote:
> Now, vmscan is using __pagevec_free() for batch freeing. but
> pagevec consume slightly lots stack (sizeof(long)*8), and x86_64
> stack is very strictly limited.
> 
> Then, now we are planning to use page->lru list instead pagevec
> for reducing stack. and introduce new helper function.
> 
> This is similar to __pagevec_free(), but receive list instead
> pagevec. and this don't use pcp cache. it is good characteristics
> for vmscan.
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> ---
>  include/linux/gfp.h |    1 +
>  mm/page_alloc.c     |   44 ++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 45 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 4c6d413..dbcac56 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -332,6 +332,7 @@ extern void free_hot_cold_page(struct page *page, int cold);
>  #define __free_page(page) __free_pages((page), 0)
>  #define free_page(addr) free_pages((addr),0)
>  
> +void free_pages_bulk(struct zone *zone, struct list_head *list);
>  void page_alloc_init(void);
>  void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp);
>  void drain_all_pages(void);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ba9aea7..1f68832 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2049,6 +2049,50 @@ void free_pages(unsigned long addr, unsigned int order)
>  
>  EXPORT_SYMBOL(free_pages);
>  
> +/*
> + * Frees a number of pages from the list
> + * Assumes all pages on list are in same zone and order==0.
> + *
> + * This is similar to __pagevec_free(), but receive list instead pagevec.
> + * and this don't use pcp cache. it is good characteristics for vmscan.
> + */
> +void free_pages_bulk(struct zone *zone, struct list_head *list)
> +{
> +	unsigned long flags;
> +	struct page *page;
> +	struct page *page2;
> +	int nr_pages = 0;
> +
> +	list_for_each_entry_safe(page, page2, list, lru) {
> +		int wasMlocked = __TestClearPageMlocked(page);
> +
> +		if (free_pages_prepare(page, 0)) {
> +			/* Make orphan the corrupted page. */
> +			list_del(&page->lru);
> +			continue;
> +		}
> +		if (unlikely(wasMlocked)) {
> +			local_irq_save(flags);
> +			free_page_mlock(page);
> +			local_irq_restore(flags);
> +		}

You could clear this under the zone->lock below before calling
__free_one_page. It'd avoid a large number of IRQ enables and disables which
are a problem on some CPUs (P4 and Itanium both blow in this regard according
to PeterZ).

> +		nr_pages++;
> +	}
> +
> +	spin_lock_irqsave(&zone->lock, flags);
> +	__count_vm_events(PGFREE, nr_pages);
> +	zone->all_unreclaimable = 0;
> +	zone->pages_scanned = 0;
> +	__mod_zone_page_state(zone, NR_FREE_PAGES, nr_pages);
> +
> +	list_for_each_entry_safe(page, page2, list, lru) {
> +		/* have to delete it as __free_one_page list manipulates */
> +		list_del(&page->lru);
> +		__free_one_page(page, zone, 0, page_private(page));
> +	}

This has the effect of bypassing the per-cpu lists as well as making the
zone lock hotter. The cache hotness of the data within the page is
probably not a factor but the cache hotness of the stuct page is.

The zone lock getting hotter is a greater problem. Large amounts of page
reclaim or dumping of page cache will now contend on the zone lock where
as previously it would have dumped into the per-cpu lists (potentially
but not necessarily avoiding the zone lock).

While there might be a stack saving in the next patch, there would appear
to be definite performance implications in taking this patch.

Functionally, I see no problem but I'd put this sort of patch on the
very long finger until the performance aspects of it could be examined.

> +	spin_unlock_irqrestore(&zone->lock, flags);
> +}
> +
>  /**
>   * alloc_pages_exact - allocate an exact number physically-contiguous pages.
>   * @size: the number of bytes to allocate
> -- 
> 1.6.5.2
> 
> 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-15  1:34                 ` Dave Chinner
  2010-04-15  4:09                   ` KOSAKI Motohiro
  2010-04-15 10:28                   ` [PATCH] mm: disallow direct reclaim page writeback Mel Gorman
@ 2010-04-15 14:57                   ` Andi Kleen
  2 siblings, 0 replies; 115+ messages in thread
From: Andi Kleen @ 2010-04-15 14:57 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Mel Gorman, KOSAKI Motohiro, Chris Mason, linux-kernel, linux-mm,
	linux-fsdevel

Dave Chinner <david@fromorbit.com> writes:
>
> How about this? For now, we stop direct reclaim from doing writeback
> only on order zero allocations, but allow it for higher order
> allocations. That will prevent the majority of situations where

And also stop it always with 4K stacks.

> direct reclaim blows the stack and interferes with background
> writeout, but won't cause lumpy reclaim to change behaviour.
> This reduces the scope of impact and hence testing and validation
> the needs to be done.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH 1/4] vmscan: simplify shrink_inactive_list()
  2010-04-15 13:15                               ` Mel Gorman
@ 2010-04-15 15:01                                 ` Andi Kleen
  2010-04-15 15:44                                   ` Mel Gorman
  2010-04-15 18:22                                 ` Valdis.Kletnieks
  1 sibling, 1 reply; 115+ messages in thread
From: Andi Kleen @ 2010-04-15 15:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KOSAKI Motohiro, Dave Chinner, Chris Mason, linux-kernel,
	linux-mm, linux-fsdevel

Mel Gorman <mel@csn.ul.ie> writes:
>
> $ stack-o-meter vmlinux-vanilla vmlinux-2-simplfy-shrink 
> add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-144 (-144)
> function                                     old     new   delta
> shrink_zone                                 1232    1160     -72
> kswapd                                       748     676     -72

And the next time someone adds a new feature to these code paths or
the compiler inlines differently these 72 bytes are easily there
again. It's not really a long term solution. Code is tending to get
more complicated all the time. I consider it unlikely this trend will
stop any time soon.

So just doing some stack micro optimizations doesn't really help 
all that much.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH 1/4] vmscan: simplify shrink_inactive_list()
  2010-04-15 15:01                                 ` Andi Kleen
@ 2010-04-15 15:44                                   ` Mel Gorman
  2010-04-15 16:54                                     ` Andi Kleen
  0 siblings, 1 reply; 115+ messages in thread
From: Mel Gorman @ 2010-04-15 15:44 UTC (permalink / raw)
  To: Andi Kleen
  Cc: KOSAKI Motohiro, Dave Chinner, Chris Mason, linux-kernel,
	linux-mm, linux-fsdevel

On Thu, Apr 15, 2010 at 05:01:36PM +0200, Andi Kleen wrote:
> Mel Gorman <mel@csn.ul.ie> writes:
> >
> > $ stack-o-meter vmlinux-vanilla vmlinux-2-simplfy-shrink 
> > add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-144 (-144)
> > function                                     old     new   delta
> > shrink_zone                                 1232    1160     -72
> > kswapd                                       748     676     -72
> 
> And the next time someone adds a new feature to these code paths or
> the compiler inlines differently these 72 bytes are easily there
> again. It's not really a long term solution. Code is tending to get
> more complicated all the time. I consider it unlikely this trend will
> stop any time soon.
> 

The same logic applies when/if page writeback is split so that it is
handled by a separate thread.

> So just doing some stack micro optimizations doesn't really help 
> all that much.
> 

It's a buying-time venture, I'll agree but as both approaches are only
about reducing stack stack they wouldn't be long-term solutions by your
criteria. What do you suggest?


-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH 1/4] vmscan: simplify shrink_inactive_list()
  2010-04-15 15:44                                   ` Mel Gorman
@ 2010-04-15 16:54                                     ` Andi Kleen
  2010-04-15 23:40                                       ` Dave Chinner
  2010-04-16 14:55                                       ` Mel Gorman
  0 siblings, 2 replies; 115+ messages in thread
From: Andi Kleen @ 2010-04-15 16:54 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andi Kleen, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	linux-kernel, linux-mm, linux-fsdevel

> It's a buying-time venture, I'll agree but as both approaches are only
> about reducing stack stack they wouldn't be long-term solutions by your
> criteria. What do you suggest?

(from easy to more complicated):

- Disable direct reclaim with 4K stacks
- Do direct reclaim only on separate stacks
- Add interrupt stacks to any 8K stack architectures.
- Get rid of 4K stacks completely
- Think about any other stackings that could give large scale recursion
and find ways to run them on separate stacks too.
- Long term: maybe we need 16K stacks at some point, depending on how
good the VM gets. Alternative would be to stop making Linux more complicated,
but that's unlikely to happen.


-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH 1/4] vmscan: delegate pageout io to flusher thread if current is kswapd
  2010-04-15 10:30                             ` Johannes Weiner
@ 2010-04-15 17:24                               ` Suleiman Souhlal
  2010-04-20  2:56                               ` Ying Han
  1 sibling, 0 replies; 115+ messages in thread
From: Suleiman Souhlal @ 2010-04-15 17:24 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KOSAKI Motohiro, Dave Chinner, Mel Gorman, Chris Mason,
	linux-kernel, linux-mm, linux-fsdevel, suleiman


On Apr 15, 2010, at 3:30 AM, Johannes Weiner wrote:

> On Thu, Apr 15, 2010 at 05:26:27PM +0900, KOSAKI Motohiro wrote:
>>
>> Hannes, if my remember is correct, you tried similar swap-cluster IO
>> long time ago. now I can't remember why we didn't merged such patch.
>> Do you remember anything?
>
> Oh, quite vividly in fact :)  For a lot of swap loads the LRU order
> diverged heavily from swap slot order and readaround was a waste of
> time.
>
> Of course, the patch looked good, too, but it did not match reality
> that well.
>
> I guess 'how about this patch?' won't get us as far as 'how about
> those numbers/graphs of several real-life workloads?  oh and here
> is the patch...'.
>
>>>>     Cluster writes to disk due to memory pressure.
>>>>
>>>>     Write out logically adjacent pages to the one we're paging out
>>>>     so that we may get better IOs in these situations:
>>>>     These pages are likely to be contiguous on disk to the one  
>>>> we're
>>>>     writing out, so they should get merged into a single disk IO.
>>>>
>>>>     Signed-off-by: Suleiman Souhlal <suleiman@google.com>
>
> For random IO, LRU order will have nothing to do with mapping/disk  
> order.

Right, that's why the patch writes out contiguous pages in mapping  
order.

If they are contiguous on disk with the original page, then writing  
them out
as well should be essentially free (when it comes to disk time). There  
is
almost no waste of memory regardless of the access patterns, as far as I
can tell.

This patch is just a proof of concept and could be improved by getting  
help
from the filesystem/swap code to ensure that the additional pages we're
writing out really are contiguous with the original one.

-- Suleiman

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH 1/4] vmscan: delegate pageout io to flusher thread if current is kswapd
  2010-04-15  9:32                         ` Dave Chinner
  2010-04-15  9:41                           ` KOSAKI Motohiro
@ 2010-04-15 17:27                           ` Suleiman Souhlal
  2010-04-15 23:33                             ` Dave Chinner
  1 sibling, 1 reply; 115+ messages in thread
From: Suleiman Souhlal @ 2010-04-15 17:27 UTC (permalink / raw)
  To: Dave Chinner
  Cc: KOSAKI Motohiro, Mel Gorman, Chris Mason, linux-kernel, linux-mm,
	linux-fsdevel, suleiman


On Apr 15, 2010, at 2:32 AM, Dave Chinner wrote:

> On Thu, Apr 15, 2010 at 01:05:57AM -0700, Suleiman Souhlal wrote:
>>
>> On Apr 14, 2010, at 9:11 PM, KOSAKI Motohiro wrote:
>>
>>> Now, vmscan pageout() is one of IO throuput degression source.
>>> Some IO workload makes very much order-0 allocation and reclaim
>>> and pageout's 4K IOs are making annoying lots seeks.
>>>
>>> At least, kswapd can avoid such pageout() because kswapd don't
>>> need to consider OOM-Killer situation. that's no risk.
>>>
>>> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>>
>> What's your opinion on trying to cluster the writes done by pageout,
>> instead of not doing any paging out in kswapd?
>
> XFS already does this in ->writepage to try to minimise the impact
> of the way pageout issues IO. It helps, but it is still not as good
> as having all the writeback come from the flusher threads because
> it's still pretty much random IO.

Doesn't the randomness become irrelevant if you can cluster enough
pages?

> And, FWIW, it doesn't solve the stack usage problems, either. In
> fact, it will make them worse as write_one_page() puts another
> struct writeback_control on the stack...

Sorry, this patch was not meant to solve the stack usage problems.

-- Suleiman

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-15 13:42                     ` Chris Mason
@ 2010-04-15 17:50                       ` tytso
  2010-04-16 15:05                       ` Mel Gorman
  1 sibling, 0 replies; 115+ messages in thread
From: tytso @ 2010-04-15 17:50 UTC (permalink / raw)
  To: Chris Mason, Mel Gorman, Dave Chinner, KOSAKI Motohiro,
	linux-kernel, linux-mm, linux-fsdevel

On Thu, Apr 15, 2010 at 09:42:17AM -0400, Chris Mason wrote:
> I'd like to add one more:
> 
> 5. Don't dive into filesystem locks during reclaim.
> 
> This is different from splicing code paths together, but
> the filesystem writepage code has become the center of our attempts at
> doing big fat contiguous writes on disk.  We push off work as late as we
> can until just before the pages go down to disk.
> 
> I'll pick on ext4 and btrfs for a minute, just to broaden the scope
> outside of XFS.  Writepage comes along and the filesystem needs to
> actually find blocks on disk for all the dirty pages it has promised to
> write.
> 
> So, we start a transaction, we take various allocator locks, modify
> different metadata, log changed blocks, take a break (logging is hard
> work you know, need_resched() triggered a by now), stuff it
> all into the file's metadata, log that, and finally return.
> 
> Each of the steps above can block for a long time.  Ext4 solves
> this by not doing them.  ext4_writepage only writes pages that
> are already fully allocated on disk.
> 
> Btrfs is much more efficient at not doing them, it just returns right
> away for PF_MEMALLOC.

This is a real problem, BTW.  One of the problems we've been fighting
inside Google is because ext4_writepage() refuses to write pages that
are subject to delayed allocation, it can cause the OOM killer to get
invoked.  

I had thought this was because of some evil games we're playing for
container support that makes zones small, but just last night at the
LF Collaboration Summit reception, I ran into a technologist from a
major financial industry customer reported to me that when they tried
using ext4, they ran into the exact same problem because they were
running Oracle which was pinning down 3 gigs of memory, and then when
they tried writing a very big file using ext4, they had the same
problem of writepage() not being able to reclaim enough pages, so the
kernel fell back to invoking the OOM killer, and things got ugly in a
hurry...

One of the things I was proposing internally to try as a long-term
we-gotta-fix writeback is that we need some kind of signal so that we
can do the lumpy reclaim (a) in a separate process, to avoid a lock
inversion problem and the gee-its-going-to-take-a-long-time problem
which Chris Mentioned, and (b) to try to cluster I/O so that we're not
dribbling out writes to the disk in small, seeky, 4k writes, which is
really a disaster from a performance standpoint.  Maybe the VM guys
don't care about this, but this sort of things tends to get us
filesystem guys all up in a lather not just because of the really
sucky performance, but also because it tends to mean that the system
can thrash itself to death in low memory situations.

    	       	      	     	      	 - Ted

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH 1/4] vmscan: simplify shrink_inactive_list()
  2010-04-15 13:15                               ` Mel Gorman
  2010-04-15 15:01                                 ` Andi Kleen
@ 2010-04-15 18:22                                 ` Valdis.Kletnieks
  2010-04-16  9:39                                   ` Mel Gorman
  1 sibling, 1 reply; 115+ messages in thread
From: Valdis.Kletnieks @ 2010-04-15 18:22 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KOSAKI Motohiro, Dave Chinner, Chris Mason, linux-kernel,
	linux-mm, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 285 bytes --]

On Thu, 15 Apr 2010 14:15:33 BST, Mel Gorman said:

> Yep. I modified bloat-o-meter to work with stacks (imaginatively calling it
> stack-o-meter) and got the following. The prereq patches are from
> earlier in the thread with the subjects

Think that's a script worth having in-tree?

[-- Attachment #2: Type: application/pgp-signature, Size: 227 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH 1/4] vmscan: delegate pageout io to flusher thread if current is kswapd
  2010-04-15 17:27                           ` Suleiman Souhlal
@ 2010-04-15 23:33                             ` Dave Chinner
  2010-04-15 23:41                               ` Suleiman Souhlal
  2010-04-16  9:50                               ` Alan Cox
  0 siblings, 2 replies; 115+ messages in thread
From: Dave Chinner @ 2010-04-15 23:33 UTC (permalink / raw)
  To: Suleiman Souhlal
  Cc: KOSAKI Motohiro, Mel Gorman, Chris Mason, linux-kernel, linux-mm,
	linux-fsdevel, suleiman

On Thu, Apr 15, 2010 at 10:27:09AM -0700, Suleiman Souhlal wrote:
> 
> On Apr 15, 2010, at 2:32 AM, Dave Chinner wrote:
> 
> >On Thu, Apr 15, 2010 at 01:05:57AM -0700, Suleiman Souhlal wrote:
> >>
> >>On Apr 14, 2010, at 9:11 PM, KOSAKI Motohiro wrote:
> >>
> >>>Now, vmscan pageout() is one of IO throuput degression source.
> >>>Some IO workload makes very much order-0 allocation and reclaim
> >>>and pageout's 4K IOs are making annoying lots seeks.
> >>>
> >>>At least, kswapd can avoid such pageout() because kswapd don't
> >>>need to consider OOM-Killer situation. that's no risk.
> >>>
> >>>Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> >>
> >>What's your opinion on trying to cluster the writes done by pageout,
> >>instead of not doing any paging out in kswapd?
> >
> >XFS already does this in ->writepage to try to minimise the impact
> >of the way pageout issues IO. It helps, but it is still not as good
> >as having all the writeback come from the flusher threads because
> >it's still pretty much random IO.
> 
> Doesn't the randomness become irrelevant if you can cluster enough
> pages?

No. If you are doing full disk seeks between random chunks, then you
still lose a large amount of throughput. e.g. if the seek time is
10ms and your IO time is 10ms for each 4k page, then increasing the
size ito 64k makes it 10ms seek and 12ms for the IO. We might increase
throughput but we are still limited to 100 IOs per second. We've
gone from 400kB/s to 6MB/s, but that's still an order of magnitude
short of the 100MB/s full size IOs with little in way of seeks
between them will acheive on the same spindle...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH 1/4] vmscan: simplify shrink_inactive_list()
  2010-04-15 16:54                                     ` Andi Kleen
@ 2010-04-15 23:40                                       ` Dave Chinner
  2010-04-16  7:13                                         ` Andi Kleen
  2010-04-16 14:57                                         ` Mel Gorman
  2010-04-16 14:55                                       ` Mel Gorman
  1 sibling, 2 replies; 115+ messages in thread
From: Dave Chinner @ 2010-04-15 23:40 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Mel Gorman, KOSAKI Motohiro, Chris Mason, linux-kernel, linux-mm,
	linux-fsdevel

On Thu, Apr 15, 2010 at 06:54:16PM +0200, Andi Kleen wrote:
> > It's a buying-time venture, I'll agree but as both approaches are only
> > about reducing stack stack they wouldn't be long-term solutions by your
> > criteria. What do you suggest?
> 
> (from easy to more complicated):
> 
> - Disable direct reclaim with 4K stacks

Just to re-iterate: we're blowing the stack with direct reclaim on
x86_64  w/ 8k stacks.  The old i386/4k stack problem is a red
herring.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH 1/4] vmscan: delegate pageout io to flusher thread if  current is kswapd
  2010-04-15 23:33                             ` Dave Chinner
@ 2010-04-15 23:41                               ` Suleiman Souhlal
  2010-04-16  9:50                               ` Alan Cox
  1 sibling, 0 replies; 115+ messages in thread
From: Suleiman Souhlal @ 2010-04-15 23:41 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Suleiman Souhlal, KOSAKI Motohiro, Mel Gorman, Chris Mason,
	linux-kernel, linux-mm, linux-fsdevel

On Thu, Apr 15, 2010 at 4:33 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Apr 15, 2010 at 10:27:09AM -0700, Suleiman Souhlal wrote:
>>
>> On Apr 15, 2010, at 2:32 AM, Dave Chinner wrote:
>>
>> >On Thu, Apr 15, 2010 at 01:05:57AM -0700, Suleiman Souhlal wrote:
>> >>
>> >>On Apr 14, 2010, at 9:11 PM, KOSAKI Motohiro wrote:
>> >>
>> >>>Now, vmscan pageout() is one of IO throuput degression source.
>> >>>Some IO workload makes very much order-0 allocation and reclaim
>> >>>and pageout's 4K IOs are making annoying lots seeks.
>> >>>
>> >>>At least, kswapd can avoid such pageout() because kswapd don't
>> >>>need to consider OOM-Killer situation. that's no risk.
>> >>>
>> >>>Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>> >>
>> >>What's your opinion on trying to cluster the writes done by pageout,
>> >>instead of not doing any paging out in kswapd?
>> >
>> >XFS already does this in ->writepage to try to minimise the impact
>> >of the way pageout issues IO. It helps, but it is still not as good
>> >as having all the writeback come from the flusher threads because
>> >it's still pretty much random IO.
>>
>> Doesn't the randomness become irrelevant if you can cluster enough
>> pages?
>
> No. If you are doing full disk seeks between random chunks, then you
> still lose a large amount of throughput. e.g. if the seek time is
> 10ms and your IO time is 10ms for each 4k page, then increasing the
> size ito 64k makes it 10ms seek and 12ms for the IO. We might increase
> throughput but we are still limited to 100 IOs per second. We've
> gone from 400kB/s to 6MB/s, but that's still an order of magnitude
> short of the 100MB/s full size IOs with little in way of seeks
> between them will acheive on the same spindle...

What I meant was that, theoretically speaking, you could increase the
maximum amount of pages that get clustered so that you could get
100MB/s, although it most likely wouldn't be a good idea with the
current patch.

-- Suleiman

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-13  0:17 [PATCH] mm: disallow direct reclaim page writeback Dave Chinner
                   ` (2 preceding siblings ...)
  2010-04-14  0:24 ` Minchan Kim
@ 2010-04-16  1:13 ` KAMEZAWA Hiroyuki
  2010-04-16  4:18   ` KAMEZAWA Hiroyuki
  3 siblings, 1 reply; 115+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-16  1:13 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-mm, linux-fsdevel

On Tue, 13 Apr 2010 10:17:58 +1000
Dave Chinner <david@fromorbit.com> wrote:

> From: Dave Chinner <dchinner@redhat.com>
> 
> When we enter direct reclaim we may have used an arbitrary amount of stack
> space, and hence enterring the filesystem to do writeback can then lead to
> stack overruns. This problem was recently encountered x86_64 systems with
> 8k stacks running XFS with simple storage configurations.
> 
> Writeback from direct reclaim also adversely affects background writeback. The
> background flusher threads should already be taking care of cleaning dirty
> pages, and direct reclaim will kick them if they aren't already doing work. If
> direct reclaim is also calling ->writepage, it will cause the IO patterns from
> the background flusher threads to be upset by LRU-order writeback from
> pageout() which can be effectively random IO. Having competing sources of IO
> trying to clean pages on the same backing device reduces throughput by
> increasing the amount of seeks that the backing device has to do to write back
> the pages.
> 
> Hence for direct reclaim we should not allow ->writepages to be entered at all.
> Set up the relevant scan_control structures to enforce this, and prevent
> sc->may_writepage from being set in other places in the direct reclaim path in
> response to other events.
> 
> Reported-by: John Berthels <john@humyo.com>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Hmm. Then, if memoy cgroup is filled by dirty pages, it can't kick writeback
and has to wait for someone else's writeback ?

How long this will take ?
# mount -t cgroup none /cgroup -o memory
# mkdir /cgroup/A
# echo 20M > /cgroup/A/memory.limit_in_bytes
# echo $$ > /cgroup/A/tasks
# dd if=/dev/zero of=./tmpfile bs=4096 count=1000000

Can memcg ask writeback thread to "Wake Up Now! and Write this out!" effectively ?

Thanks,
-Kame

> ---
>  mm/vmscan.c |   13 ++++++-------
>  1 files changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index e0e5f15..5321ac4 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1826,10 +1826,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>  		 * writeout.  So in laptop mode, write out the whole world.
>  		 */
>  		writeback_threshold = sc->nr_to_reclaim + sc->nr_to_reclaim / 2;
> -		if (total_scanned > writeback_threshold) {
> +		if (total_scanned > writeback_threshold)
>  			wakeup_flusher_threads(laptop_mode ? 0 : total_scanned);
> -			sc->may_writepage = 1;
> -		}
>  
>  		/* Take a nap, wait for some writeback to complete */
>  		if (!sc->hibernation_mode && sc->nr_scanned &&
> @@ -1871,7 +1869,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  {
>  	struct scan_control sc = {
>  		.gfp_mask = gfp_mask,
> -		.may_writepage = !laptop_mode,
> +		.may_writepage = 0,
>  		.nr_to_reclaim = SWAP_CLUSTER_MAX,
>  		.may_unmap = 1,
>  		.may_swap = 1,
> @@ -1893,7 +1891,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
>  						struct zone *zone, int nid)
>  {
>  	struct scan_control sc = {
> -		.may_writepage = !laptop_mode,
> +		.may_writepage = 0,
>  		.may_unmap = 1,
>  		.may_swap = !noswap,
>  		.swappiness = swappiness,
> @@ -1926,7 +1924,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
>  {
>  	struct zonelist *zonelist;
>  	struct scan_control sc = {
> -		.may_writepage = !laptop_mode,
> +		.may_writepage = 0,
>  		.may_unmap = 1,
>  		.may_swap = !noswap,
>  		.nr_to_reclaim = SWAP_CLUSTER_MAX,
> @@ -2567,7 +2565,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
>  	struct reclaim_state reclaim_state;
>  	int priority;
>  	struct scan_control sc = {
> -		.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
> +		.may_writepage = (current_is_kswapd() &&
> +					(zone_reclaim_mode & RECLAIM_WRITE)),
>  		.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
>  		.may_swap = 1,
>  		.nr_to_reclaim = max_t(unsigned long, nr_pages,
> -- 
> 1.6.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-15 10:28                   ` [PATCH] mm: disallow direct reclaim page writeback Mel Gorman
  2010-04-15 13:42                     ` Chris Mason
@ 2010-04-16  4:14                     ` Dave Chinner
  2010-04-16 15:14                       ` Mel Gorman
  1 sibling, 1 reply; 115+ messages in thread
From: Dave Chinner @ 2010-04-16  4:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KOSAKI Motohiro, Chris Mason, linux-kernel, linux-mm, linux-fsdevel

On Thu, Apr 15, 2010 at 11:28:37AM +0100, Mel Gorman wrote:
> On Thu, Apr 15, 2010 at 11:34:36AM +1000, Dave Chinner wrote:
> > On Wed, Apr 14, 2010 at 09:51:33AM +0100, Mel Gorman wrote:
> > If you ask it to clean a bunch of pages around the one you want to
> > reclaim on the LRU, there is a good chance it will also be cleaning
> > pages that are near the end of the LRU or physically close by as
> > well. It's not a guarantee, but for the additional IO cost of about
> > 10% wall time on that IO to clean the page you need, you also get
> > 1-2 orders of magnitude other pages cleaned. That sounds like a
> > win any way you look at it...
> 
> At worst, it'll distort the LRU ordering slightly. Lets say the the
> file-adjacent-page you clean was near the end of the LRU. Before such a
> patch, it may have gotten cleaned and done another lap of the LRU.
> After, it would be reclaimed sooner. I don't know if we depend on such
> behaviour (very doubtful) but it's a subtle enough change. I can't
> predict what it'll do for IO congestion. Simplistically, there is more
> IO so it's bad but if the write pattern is less seeky and we needed to
> write the pages anyway, it might be improved.

Fundamentally, we have so many pages on the LRU, getting a few out
of order at the back end of it is going to be in the noise. If we
trade off "perfect" LRU behaviour for cleaning pages an order of
magnitude faster, reclaim will find candidate pages for a whole lot
faster. And if we have more clean pages available, faster, overall
system throughput is going to improve and be much less likely to
fall into deep, dark holes where the OOM-killer is the light at the
end.....

[ snip questions Chris answered ]

> > what I'm
> > pointing out is that the arguments that it is too hard or there are
> > no interfaces available to issue larger IO from reclaim are not at
> > all valid.
> > 
> 
> Sure, I'm not resisting fixing this, just your first patch :) There are four
> goals here
> 
> 1. Reduce stack usage
> 2. Avoid the splicing of subsystem stack usage with direct reclaim
> 3. Preserve lumpy reclaims cleaning of contiguous pages
> 4. Try and not drastically alter LRU aging
> 
> 1 and 2 are important for you, 3 is important for me and 4 will have to
> be dealt with on a case-by-case basis.

#4 is important to me, too, because that has direct impact on large
file IO workloads. however, it is gross changes in behaviour that
concern me, not subtle, probably-in-the-noise changes that you're
concerned about. :)

> Your patch fixes 2, avoids 1, breaks 3 and haven't thought about 4 but I
> guess dirty pages can cycle around more so it'd need to be cared for.

Well, you keep saying that they break #3, but I haven't seen any
test cases or results showing that. I've been unable to confirm that
lumpy reclaim is broken by disallowing writeback in my testing, so
I'm interested to know what tests you are running that show it is
broken...

> > How about this? For now, we stop direct reclaim from doing writeback
> > only on order zero allocations, but allow it for higher order
> > allocations. That will prevent the majority of situations where
> > direct reclaim blows the stack and interferes with background
> > writeout, but won't cause lumpy reclaim to change behaviour.
> > This reduces the scope of impact and hence testing and validation
> > the needs to be done.
> > 
> > Then we can work towards allowing lumpy reclaim to use background
> > threads as Chris suggested for doing specific writeback operations
> > to solve the remaining problems being seen. Does this seem like a
> > reasonable compromise and approach to dealing with the problem?
> > 
> 
> I'd like this to be plan b (or maybe c or d) if we cannot reduce stack usage
> enough or come up with an alternative fix. From the goals above it mitigates
> 1, mitigates 2, addresses 3 but potentially allows dirty pages to remain on
> the LRU with 4 until the background cleaner or kswapd comes along.

We've been through this already, but I'll repeat it again in the
hope it sinks in: reducing stack usage is not sufficient to stay
within an 8k stack if we can enter writeback with an arbitrary
amount of stack already consumed.

We've already got a report of 9k of stack usage (7200 bytes left on
a order-2 stack) and this is without a complex storage stack - it's
just a partition on a SATA drive. We can easily add another 1k,
possibly 2k to that stack depth with a complex storage subsystem.
Trimming this much (3-4k) is simply not feasible in a callchain that
is 50-70 functions deep...

> One reason why I am edgy about this is that lumpy reclaim can kick in
> for low-enough orders too like order-1 pages for stacks in some cases or
> order-2 pages for network cards using jumbo frames or some wireless
> cards. The network cards in particular could still cause the stack
> overflow but be much harder to reproduce and detect.

So push lumpy reclaim into a separate thread. It already blocks, so
waiting for some other thread to do the work won't change anything.
Separating high-order reclaim from LRU reclaim is probably a good
idea, anyway - they use different algorithms and while the two are
intertwined it's hard to optimise/improve either....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-16  1:13 ` KAMEZAWA Hiroyuki
@ 2010-04-16  4:18   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 115+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-16  4:18 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Dave Chinner, linux-kernel, linux-mm, linux-fsdevel

On Fri, 16 Apr 2010 10:13:39 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
 
> Hmm. Then, if memoy cgroup is filled by dirty pages, it can't kick writeback
> and has to wait for someone else's writeback ?
> 
> How long this will take ?
> # mount -t cgroup none /cgroup -o memory
> # mkdir /cgroup/A
> # echo 20M > /cgroup/A/memory.limit_in_bytes
> # echo $$ > /cgroup/A/tasks
> # dd if=/dev/zero of=./tmpfile bs=4096 count=1000000
> 
> Can memcg ask writeback thread to "Wake Up Now! and Write this out!" effectively ?
> 

Hmm.. I saw an oom-kill while testing several cases but performance itself
seems not to be far different with or without patch.
But I'm unhappy with oom-kill, so some tweak for memcg will be necessary
if we'll go with this.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH 1/4] vmscan: simplify shrink_inactive_list()
  2010-04-15 23:40                                       ` Dave Chinner
@ 2010-04-16  7:13                                         ` Andi Kleen
  2010-04-16 14:57                                         ` Mel Gorman
  1 sibling, 0 replies; 115+ messages in thread
From: Andi Kleen @ 2010-04-16  7:13 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andi Kleen, Mel Gorman, KOSAKI Motohiro, Chris Mason,
	linux-kernel, linux-mm, linux-fsdevel

On Fri, Apr 16, 2010 at 09:40:13AM +1000, Dave Chinner wrote:
> On Thu, Apr 15, 2010 at 06:54:16PM +0200, Andi Kleen wrote:
> > > It's a buying-time venture, I'll agree but as both approaches are only
> > > about reducing stack stack they wouldn't be long-term solutions by your
> > > criteria. What do you suggest?
> > 
> > (from easy to more complicated):
> > 
> > - Disable direct reclaim with 4K stacks
> 
> Just to re-iterate: we're blowing the stack with direct reclaim on
> x86_64  w/ 8k stacks.  The old i386/4k stack problem is a red
> herring.

Yes that's known, but on 4K it will definitely not work at all.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH 1/4] vmscan: simplify shrink_inactive_list()
  2010-04-15 18:22                                 ` Valdis.Kletnieks
@ 2010-04-16  9:39                                   ` Mel Gorman
  0 siblings, 0 replies; 115+ messages in thread
From: Mel Gorman @ 2010-04-16  9:39 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: KOSAKI Motohiro, Dave Chinner, Chris Mason, linux-kernel,
	linux-mm, linux-fsdevel

On Thu, Apr 15, 2010 at 02:22:01PM -0400, Valdis.Kletnieks@vt.edu wrote:
> On Thu, 15 Apr 2010 14:15:33 BST, Mel Gorman said:
> 
> > Yep. I modified bloat-o-meter to work with stacks (imaginatively calling it
> > stack-o-meter) and got the following. The prereq patches are from
> > earlier in the thread with the subjects
> 
> Think that's a script worth having in-tree?

Ahh, it's a hatchet-job at the moment. I copied bloat-o-meter and
altered one function. I made a TODO note to extend bloat-o-meter
properly and that would be worth merging.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH 1/4] vmscan: delegate pageout io to flusher thread if current is kswapd
  2010-04-15 23:33                             ` Dave Chinner
  2010-04-15 23:41                               ` Suleiman Souhlal
@ 2010-04-16  9:50                               ` Alan Cox
  2010-04-17  3:06                                 ` Dave Chinner
  1 sibling, 1 reply; 115+ messages in thread
From: Alan Cox @ 2010-04-16  9:50 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Suleiman Souhlal, KOSAKI Motohiro, Mel Gorman, Chris Mason,
	linux-kernel, linux-mm, linux-fsdevel, suleiman

> No. If you are doing full disk seeks between random chunks, then you
> still lose a large amount of throughput. e.g. if the seek time is
> 10ms and your IO time is 10ms for each 4k page, then increasing the
> size ito 64k makes it 10ms seek and 12ms for the IO. We might increase
> throughput but we are still limited to 100 IOs per second. We've
> gone from 400kB/s to 6MB/s, but that's still an order of magnitude
> short of the 100MB/s full size IOs with little in way of seeks
> between them will acheive on the same spindle...

The usual armwaving numbers for ops/sec for an ATA disk are in the 200
ops/sec range so that seems horribly credible.

But then I've never quite understood why our anonymous paging isn't
sorting stuff as best it can and then using the drive as a log structure
with in memory metadata so it can stream the pages onto disk. Read
performance is goig to be similar (maybe better if you have a log tidy
when idle), write ought to be far better.

Alan

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH 1/4] vmscan: simplify shrink_inactive_list()
  2010-04-15 16:54                                     ` Andi Kleen
  2010-04-15 23:40                                       ` Dave Chinner
@ 2010-04-16 14:55                                       ` Mel Gorman
  1 sibling, 0 replies; 115+ messages in thread
From: Mel Gorman @ 2010-04-16 14:55 UTC (permalink / raw)
  To: Andi Kleen
  Cc: KOSAKI Motohiro, Dave Chinner, Chris Mason, linux-kernel,
	linux-mm, linux-fsdevel

On Thu, Apr 15, 2010 at 06:54:16PM +0200, Andi Kleen wrote:
> > It's a buying-time venture, I'll agree but as both approaches are only
> > about reducing stack stack they wouldn't be long-term solutions by your
> > criteria. What do you suggest?
> 
> (from easy to more complicated):
> 
> - Disable direct reclaim with 4K stacks

Do not like. While I can see why 4K stacks are a serious problem, I'd
sooner see 4K stacks disabled than have the kernel behave so differently
for direct reclaim. It's be tricky to spot regressions in reclaim that
were due to this .config option

> - Do direct reclaim only on separate stacks

This is looking more and more attractive.

> - Add interrupt stacks to any 8K stack architectures.

This is a similar but separate problem. It's similar in that interrupt
stacks can splice subsystems together in terms of stack usage.

> - Get rid of 4K stacks completely

Why would we *not* do this? I can't remember the original reasoning
behind 4K stacks but am guessing it helped fork-orientated workloads in
startup times in the days before lumpy reclaim and better fragmentation
control.

Who typically enables this option?

> - Think about any other stackings that could give large scale recursion
> and find ways to run them on separate stacks too.

The patch series I threw up about reducing stack was a cut-down
approach. Instead of using separate stacks, keep the stack usage out of
the main caller path where possible.

> - Long term: maybe we need 16K stacks at some point, depending on how
> good the VM gets. Alternative would be to stop making Linux more complicated,
> but that's unlikely to happen.
> 

Make this Plan D if nothing else works out and we still hit a wall?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH 1/4] vmscan: simplify shrink_inactive_list()
  2010-04-15 23:40                                       ` Dave Chinner
  2010-04-16  7:13                                         ` Andi Kleen
@ 2010-04-16 14:57                                         ` Mel Gorman
  2010-04-17  2:37                                           ` Dave Chinner
  1 sibling, 1 reply; 115+ messages in thread
From: Mel Gorman @ 2010-04-16 14:57 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andi Kleen, KOSAKI Motohiro, Chris Mason, linux-kernel, linux-mm,
	linux-fsdevel

On Fri, Apr 16, 2010 at 09:40:13AM +1000, Dave Chinner wrote:
> On Thu, Apr 15, 2010 at 06:54:16PM +0200, Andi Kleen wrote:
> > > It's a buying-time venture, I'll agree but as both approaches are only
> > > about reducing stack stack they wouldn't be long-term solutions by your
> > > criteria. What do you suggest?
> > 
> > (from easy to more complicated):
> > 
> > - Disable direct reclaim with 4K stacks
> 
> Just to re-iterate: we're blowing the stack with direct reclaim on
> x86_64  w/ 8k stacks. 

Yep, that is not being disputed. By the way, what did you use to
generate your report? Was it CONFIG_DEBUG_STACK_USAGE or something else?
I used a modified bloat-o-meter to gather my data but it'd be nice to
be sure I'm seeing the same things as you (minus XFS unless I
specifically set it up).

> The old i386/4k stack problem is a red
> herring.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-15 13:42                     ` Chris Mason
  2010-04-15 17:50                       ` tytso
@ 2010-04-16 15:05                       ` Mel Gorman
  2010-04-19 15:15                         ` Mel Gorman
  1 sibling, 1 reply; 115+ messages in thread
From: Mel Gorman @ 2010-04-16 15:05 UTC (permalink / raw)
  To: Chris Mason, Dave Chinner, KOSAKI Motohiro, linux-kernel,
	linux-mm, linux-fsdevel

On Thu, Apr 15, 2010 at 09:42:17AM -0400, Chris Mason wrote:
> On Thu, Apr 15, 2010 at 11:28:37AM +0100, Mel Gorman wrote:
> > On Thu, Apr 15, 2010 at 11:34:36AM +1000, Dave Chinner wrote:
> > > On Wed, Apr 14, 2010 at 09:51:33AM +0100, Mel Gorman wrote:
> > > > On Wed, Apr 14, 2010 at 05:28:30PM +1000, Dave Chinner wrote:
> > > > > On Wed, Apr 14, 2010 at 03:52:44PM +0900, KOSAKI Motohiro wrote:
> > > > > > > On Tue, Apr 13, 2010 at 04:20:21PM -0400, Chris Mason wrote:
> > > > > > > > On Tue, Apr 13, 2010 at 08:34:29PM +0100, Mel Gorman wrote:
> > > > > > > > > > Basically, there is not enough stack space available to allow direct
> > > > > > > > > > reclaim to enter ->writepage _anywhere_ according to the stack usage
> > > > > > > > > > profiles we are seeing here....
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > I'm not denying the evidence but how has it been gotten away with for years
> > > > > > > > > then? Prevention of writeback isn't the answer without figuring out how
> > > > > > > > > direct reclaimers can queue pages for IO and in the case of lumpy reclaim
> > > > > > > > > doing sync IO, then waiting on those pages.
> > > > > > > > 
> > > > > > > > So, I've been reading along, nodding my head to Dave's side of things
> > > > > > > > because seeks are evil and direct reclaim makes seeks.  I'd really loev
> > > > > > > > for direct reclaim to somehow trigger writepages on large chunks instead
> > > > > > > > of doing page by page spatters of IO to the drive.
> > > > > > 
> > > > > > I agree that "seeks are evil and direct reclaim makes seeks". Actually,
> > > > > > making 4k io is not must for pageout. So, probably we can improve it.
> > > > > > 
> > > > > > 
> > > > > > > Perhaps drop the lock on the page if it is held and call one of the
> > > > > > > helpers that filesystems use to do this, like:
> > > > > > > 
> > > > > > > 	filemap_write_and_wait(page->mapping);
> > > > > > 
> > > > > > Sorry, I'm lost what you talk about. Why do we need per-file
> > > > > > waiting? If file is 1GB file, do we need to wait 1GB writeout?
> > > > > 
> > > > > So use filemap_fdatawrite(page->mapping), or if it's better only
> > > > > to start IO on a segment of the file, use
> > > > > filemap_fdatawrite_range(page->mapping, start, end)....
> > > > 
> > > > That does not help the stack usage issue, the caller ends up in
> > > > ->writepages. From an IO perspective, it'll be better from a seek point of
> > > > view but from a VM perspective, it may or may not be cleaning the right pages.
> > > > So I think this is a red herring.
> > > 
> > > If you ask it to clean a bunch of pages around the one you want to
> > > reclaim on the LRU, there is a good chance it will also be cleaning
> > > pages that are near the end of the LRU or physically close by as
> > > well. It's not a guarantee, but for the additional IO cost of about
> > > 10% wall time on that IO to clean the page you need, you also get
> > > 1-2 orders of magnitude other pages cleaned. That sounds like a
> > > win any way you look at it...
> > > 
> > 
> > At worst, it'll distort the LRU ordering slightly. Lets say the the
> > file-adjacent-page you clean was near the end of the LRU. Before such a
> > patch, it may have gotten cleaned and done another lap of the LRU.
> > After, it would be reclaimed sooner. I don't know if we depend on such
> > behaviour (very doubtful) but it's a subtle enough change. I can't
> > predict what it'll do for IO congestion. Simplistically, there is more
> > IO so it's bad but if the write pattern is less seeky and we needed to
> > write the pages anyway, it might be improved.
> > 
> > > I agree that it doesn't solve the stack problem (Chris' suggestion
> > > that we enable the bdi flusher interface would fix this);
> > 
> > I'm afraid I'm not familiar with this interface. Can you point me at
> > some previous discussion so that I am sure I am looking at the right
> > thing?
> 
> vi fs/direct-reclaim-helper.c, it has a few placeholders for where the
> real code needs to go....just look for the ~ marks.
> 

I must be blind. What tree is this in? I can't see it v2.6.34-rc4,
mmotm or google.

> I mostly meant that the bdi helper threads were the best place to add
> knowledge about which pages we want to write for reclaim.  We might need
> to add a thread dedicated to just doing the VM's dirty work, but that's
> where I would start discussing fancy new interfaces.
> 
> > 
> > > what I'm
> > > pointing out is that the arguments that it is too hard or there are
> > > no interfaces available to issue larger IO from reclaim are not at
> > > all valid.
> > > 
> > 
> > Sure, I'm not resisting fixing this, just your first patch :) There are four
> > goals here
> > 
> > 1. Reduce stack usage
> > 2. Avoid the splicing of subsystem stack usage with direct reclaim
> > 3. Preserve lumpy reclaims cleaning of contiguous pages
> > 4. Try and not drastically alter LRU aging
> > 
> > 1 and 2 are important for you, 3 is important for me and 4 will have to
> > be dealt with on a case-by-case basis.
> > 
> > Your patch fixes 2, avoids 1, breaks 3 and haven't thought about 4 but I
> > guess dirty pages can cycle around more so it'd need to be cared for.
> 
> I'd like to add one more:
> 
> 5. Don't dive into filesystem locks during reclaim.
> 

Good add. It's not a new problem either. This came up at least two years
ago at around the first VM/FS summit and the response was a long the lines
of shuffling uncomfortably :/

> This is different from splicing code paths together, but
> the filesystem writepage code has become the center of our attempts at
> doing big fat contiguous writes on disk.  We push off work as late as we
> can until just before the pages go down to disk.
> 
> I'll pick on ext4 and btrfs for a minute, just to broaden the scope
> outside of XFS.  Writepage comes along and the filesystem needs to
> actually find blocks on disk for all the dirty pages it has promised to
> write.
> 
> So, we start a transaction, we take various allocator locks, modify
> different metadata, log changed blocks, take a break (logging is hard
> work you know, need_resched() triggered a by now), stuff it
> all into the file's metadata, log that, and finally return.
> 
> Each of the steps above can block for a long time.  Ext4 solves
> this by not doing them.  ext4_writepage only writes pages that
> are already fully allocated on disk.
> 
> Btrfs is much more efficient at not doing them, it just returns right
> away for PF_MEMALLOC.
> 
> This is a long way of saying the filesystem writepage code is the
> opposite of what direct reclaim wants.  Direct reclaim wants to
> find free ram now, and if it does end up in the mess describe above,
> it'll just get stuck for a long time on work entirely unrelated to
> finding free pages.
> 

Ok, good summary, thanks. I was only partially aware of some of these.
i.e. I knew it was a problem but was not sensitive to how bad it was.
Your last point is interesting because lumpy reclaim for large orders under
heavy pressure can make the system stutter badly (e.g. during a huge
page pool resize). I had blamed just plain IO but messing around with
locks and tranactions could have been a large factor and I didn't go
looking for it.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-16  4:14                     ` Dave Chinner
@ 2010-04-16 15:14                       ` Mel Gorman
  2010-04-18  0:32                         ` Andrew Morton
  2010-04-19 15:20                         ` Mel Gorman
  0 siblings, 2 replies; 115+ messages in thread
From: Mel Gorman @ 2010-04-16 15:14 UTC (permalink / raw)
  To: Dave Chinner
  Cc: KOSAKI Motohiro, Chris Mason, linux-kernel, linux-mm, linux-fsdevel

On Fri, Apr 16, 2010 at 02:14:12PM +1000, Dave Chinner wrote:
> On Thu, Apr 15, 2010 at 11:28:37AM +0100, Mel Gorman wrote:
> > On Thu, Apr 15, 2010 at 11:34:36AM +1000, Dave Chinner wrote:
> > > On Wed, Apr 14, 2010 at 09:51:33AM +0100, Mel Gorman wrote:
> > > If you ask it to clean a bunch of pages around the one you want to
> > > reclaim on the LRU, there is a good chance it will also be cleaning
> > > pages that are near the end of the LRU or physically close by as
> > > well. It's not a guarantee, but for the additional IO cost of about
> > > 10% wall time on that IO to clean the page you need, you also get
> > > 1-2 orders of magnitude other pages cleaned. That sounds like a
> > > win any way you look at it...
> > 
> > At worst, it'll distort the LRU ordering slightly. Lets say the the
> > file-adjacent-page you clean was near the end of the LRU. Before such a
> > patch, it may have gotten cleaned and done another lap of the LRU.
> > After, it would be reclaimed sooner. I don't know if we depend on such
> > behaviour (very doubtful) but it's a subtle enough change. I can't
> > predict what it'll do for IO congestion. Simplistically, there is more
> > IO so it's bad but if the write pattern is less seeky and we needed to
> > write the pages anyway, it might be improved.
> 
> Fundamentally, we have so many pages on the LRU, getting a few out
> of order at the back end of it is going to be in the noise. If we
> trade off "perfect" LRU behaviour for cleaning pages an order of

haha, I don't think anyone pretends the LRU behaviour is perfect.
Altering its existing behaviour tends to be done with great care but
from what I gather that is often a case of "better the devil you know".

> magnitude faster, reclaim will find candidate pages for a whole lot
> faster. And if we have more clean pages available, faster, overall
> system throughput is going to improve and be much less likely to
> fall into deep, dark holes where the OOM-killer is the light at the
> end.....
> 
> [ snip questions Chris answered ]
> 
> > > what I'm
> > > pointing out is that the arguments that it is too hard or there are
> > > no interfaces available to issue larger IO from reclaim are not at
> > > all valid.
> > > 
> > 
> > Sure, I'm not resisting fixing this, just your first patch :) There are four
> > goals here
> > 
> > 1. Reduce stack usage
> > 2. Avoid the splicing of subsystem stack usage with direct reclaim
> > 3. Preserve lumpy reclaims cleaning of contiguous pages
> > 4. Try and not drastically alter LRU aging
> > 
> > 1 and 2 are important for you, 3 is important for me and 4 will have to
> > be dealt with on a case-by-case basis.
> 
> #4 is important to me, too, because that has direct impact on large
> file IO workloads. however, it is gross changes in behaviour that
> concern me, not subtle, probably-in-the-noise changes that you're
> concerned about. :)
> 

I'm also less concerned with this aspect. I brought it up because it was
a factor. I don't think it'll cause us problems but if problems do
arise, it's nice to have a few potential candidates to examine in
advance.

> > Your patch fixes 2, avoids 1, breaks 3 and haven't thought about 4 but I
> > guess dirty pages can cycle around more so it'd need to be cared for.
> 
> Well, you keep saying that they break #3, but I haven't seen any
> test cases or results showing that. I've been unable to confirm that
> lumpy reclaim is broken by disallowing writeback in my testing, so
> I'm interested to know what tests you are running that show it is
> broken...
> 

Ok, I haven't actually tested this. The machines I use are tied up
retesting the compaction patches at the moment. The reason why I reckon
it'll be a problem is that when these sync-writeback changes were
introduced, it significantly helped lumpy reclaim for huge pages. I am
making an assumption that backing out those changes will hurt it.

I'll test for real on Monday and see what falls out.

> > > How about this? For now, we stop direct reclaim from doing writeback
> > > only on order zero allocations, but allow it for higher order
> > > allocations. That will prevent the majority of situations where
> > > direct reclaim blows the stack and interferes with background
> > > writeout, but won't cause lumpy reclaim to change behaviour.
> > > This reduces the scope of impact and hence testing and validation
> > > the needs to be done.
> > > 
> > > Then we can work towards allowing lumpy reclaim to use background
> > > threads as Chris suggested for doing specific writeback operations
> > > to solve the remaining problems being seen. Does this seem like a
> > > reasonable compromise and approach to dealing with the problem?
> > > 
> > 
> > I'd like this to be plan b (or maybe c or d) if we cannot reduce stack usage
> > enough or come up with an alternative fix. From the goals above it mitigates
> > 1, mitigates 2, addresses 3 but potentially allows dirty pages to remain on
> > the LRU with 4 until the background cleaner or kswapd comes along.
> 
> We've been through this already, but I'll repeat it again in the
> hope it sinks in: reducing stack usage is not sufficient to stay
> within an 8k stack if we can enter writeback with an arbitrary
> amount of stack already consumed.
> 
> We've already got a report of 9k of stack usage (7200 bytes left on
> a order-2 stack) and this is without a complex storage stack - it's
> just a partition on a SATA drive. We can easily add another 1k,
> possibly 2k to that stack depth with a complex storage subsystem.
> Trimming this much (3-4k) is simply not feasible in a callchain that
> is 50-70 functions deep...
> 

Ok, based on this, I'll stop working on the stack-reduction patches.
I'll test what I have and push it but I won't bring it further for the
moment and instead look at putting writeback into its own thread. If
someone else works on it in the meantime, I'll review and test from the
perspective of lumpy reclaim.

> > One reason why I am edgy about this is that lumpy reclaim can kick in
> > for low-enough orders too like order-1 pages for stacks in some cases or
> > order-2 pages for network cards using jumbo frames or some wireless
> > cards. The network cards in particular could still cause the stack
> > overflow but be much harder to reproduce and detect.
> 
> So push lumpy reclaim into a separate thread. It already blocks, so
> waiting for some other thread to do the work won't change anything.

No, it wouldn't. As long as it can wait on the right pages, it doesn't
really matter who does the work.

> Separating high-order reclaim from LRU reclaim is probably a good
> idea, anyway - they use different algorithms and while the two are
> intertwined it's hard to optimise/improve either....
> 

They are not a million miles apart either. Lumpy reclaim uses the LRU to
select a cursor page and then reclaims around it. Improvements on LRU tend
to help lumpy reclaim as well. It's why during the tests I run I can often
allocate 80-95% of memory as huge pages on x86-64 as opposed to when anti-frag
was being developed first where getting 30% was a cause for celebration :)

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-15  2:43                   ` KOSAKI Motohiro
@ 2010-04-16 23:56                     ` Johannes Weiner
  0 siblings, 0 replies; 115+ messages in thread
From: Johannes Weiner @ 2010-04-16 23:56 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mel Gorman, Dave Chinner, Chris Mason, linux-kernel, linux-mm,
	linux-fsdevel

On Thu, Apr 15, 2010 at 11:43:48AM +0900, KOSAKI Motohiro wrote:
> > I already have some patches to remove trivial parts of struct scan_control,
> > namely may_unmap, may_swap, all_unreclaimable and isolate_pages.  The rest
> > needs a deeper look.
> 
> Seems interesting. but scan_control diet is not so effective. How much
> bytes can we diet by it?

Not much, it cuts 16 bytes on x86 32 bit.  The bigger gain is the code
clarification it comes with.  There is too much state to keep track of
in reclaim.

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH 1/4] vmscan: simplify shrink_inactive_list()
  2010-04-16 14:57                                         ` Mel Gorman
@ 2010-04-17  2:37                                           ` Dave Chinner
  0 siblings, 0 replies; 115+ messages in thread
From: Dave Chinner @ 2010-04-17  2:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andi Kleen, KOSAKI Motohiro, Chris Mason, linux-kernel, linux-mm,
	linux-fsdevel

On Fri, Apr 16, 2010 at 03:57:07PM +0100, Mel Gorman wrote:
> On Fri, Apr 16, 2010 at 09:40:13AM +1000, Dave Chinner wrote:
> > On Thu, Apr 15, 2010 at 06:54:16PM +0200, Andi Kleen wrote:
> > > > It's a buying-time venture, I'll agree but as both approaches are only
> > > > about reducing stack stack they wouldn't be long-term solutions by your
> > > > criteria. What do you suggest?
> > > 
> > > (from easy to more complicated):
> > > 
> > > - Disable direct reclaim with 4K stacks
> > 
> > Just to re-iterate: we're blowing the stack with direct reclaim on
> > x86_64  w/ 8k stacks. 
> 
> Yep, that is not being disputed. By the way, what did you use to
> generate your report? Was it CONFIG_DEBUG_STACK_USAGE or something else?
> I used a modified bloat-o-meter to gather my data but it'd be nice to
> be sure I'm seeing the same things as you (minus XFS unless I
> specifically set it up).

I'm using the tracing subsystem to get them. Doesn't everyone use
that now? ;)

$ grep STACK .config
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_REGS_AND_STACK_ACCESS_API=y
# CONFIG_CC_STACKPROTECTOR is not set
CONFIG_STACKTRACE=y
CONFIG_USER_STACKTRACE_SUPPORT=y
CONFIG_STACK_TRACER=y
# CONFIG_DEBUG_STACKOVERFLOW is not set
# CONFIG_DEBUG_STACK_USAGE is not set

Then:

# echo 1 > /proc/sys/kernel/stack_tracer_enabled

<run workloads>

Monitor the worst recorded stack usage as it changes via:

# cat /sys/kernel/debug/tracing/stack_trace
        Depth    Size   Location    (44 entries)
        -----    ----   --------
  0)     5584     288   get_page_from_freelist+0x5c0/0x830
  1)     5296     272   __alloc_pages_nodemask+0x102/0x730
  2)     5024      48   kmem_getpages+0x62/0x160
  3)     4976      96   cache_grow+0x308/0x330
  4)     4880      96   cache_alloc_refill+0x27f/0x2c0
  5)     4784      96   __kmalloc+0x241/0x250
  6)     4688     112   vring_add_buf+0x233/0x420
......


Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH 1/4] vmscan: delegate pageout io to flusher thread if current is kswapd
  2010-04-16  9:50                               ` Alan Cox
@ 2010-04-17  3:06                                 ` Dave Chinner
  0 siblings, 0 replies; 115+ messages in thread
From: Dave Chinner @ 2010-04-17  3:06 UTC (permalink / raw)
  To: Alan Cox
  Cc: Suleiman Souhlal, KOSAKI Motohiro, Mel Gorman, Chris Mason,
	linux-kernel, linux-mm, linux-fsdevel, suleiman

On Fri, Apr 16, 2010 at 10:50:02AM +0100, Alan Cox wrote:
> > No. If you are doing full disk seeks between random chunks, then you
> > still lose a large amount of throughput. e.g. if the seek time is
> > 10ms and your IO time is 10ms for each 4k page, then increasing the
> > size ito 64k makes it 10ms seek and 12ms for the IO. We might increase
> > throughput but we are still limited to 100 IOs per second. We've
> > gone from 400kB/s to 6MB/s, but that's still an order of magnitude
> > short of the 100MB/s full size IOs with little in way of seeks
> > between them will acheive on the same spindle...
> 
> The usual armwaving numbers for ops/sec for an ATA disk are in the 200
> ops/sec range so that seems horribly credible.

Yeah, in my experience 7200rpm SATA will get you 200 ops/s when you
are doing really small seeks as the typical minimum seek time is
around 4-5ms. Average seek time, however, is usually in the range of
10ms, because full head sweep + spindle rotation seeks take in the
order of 15ms.

Hence small random IO tends to result in seek times nearer the
average seek time than the minimum, so that's what i tend to use for
determining the number of ops/s a disk will sustain.

> But then I've never quite understood why our anonymous paging isn't
> sorting stuff as best it can and then using the drive as a log structure
> with in memory metadata so it can stream the pages onto disk. Read
> performance is goig to be similar (maybe better if you have a log tidy
> when idle), write ought to be far better.

Sounds like a worthy project for someone to sink their teeth into.
Lots of people would like to have a system that can page out at
hundreds of megabytes a second....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-16 15:14                       ` Mel Gorman
@ 2010-04-18  0:32                         ` Andrew Morton
  2010-04-18 19:05                           ` Christoph Hellwig
                                             ` (2 more replies)
  2010-04-19 15:20                         ` Mel Gorman
  1 sibling, 3 replies; 115+ messages in thread
From: Andrew Morton @ 2010-04-18  0:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Dave Chinner, KOSAKI Motohiro, Chris Mason, linux-kernel,
	linux-mm, linux-fsdevel


There are two issues here: stack utilisation and poor IO patterns in
direct reclaim.  They are different.

The poor IO patterns thing is a regression.  Some time several years
ago (around 2.6.16, perhaps), page reclaim started to do a LOT more
dirty-page writeback than it used to.  AFAIK nobody attempted to work
out why, nor attempted to try to fix it.


Doing writearound in pageout() might help.  The kernel was in fact was
doing that around 2.5.10, but I took it out again because it wasn't
obviously beneficial.

Writearound is hard to do, because direct-reclaim doesn't have an easy
way of pinning the address_space: it can disappear and get freed under
your feet.  I was able to make this happen under intense MM loads.  The
current page-at-a-time pageout code pins the address_space by taking a
lock on one of its pages.  Once that lock is released, we cannot touch
*mapping.

And lo, the pageout() code is presently buggy:

		res = mapping->a_ops->writepage(page, &wbc);
		if (res < 0)
			handle_write_error(mapping, page, res);

The ->writepage can/will unlock the page, and we're passing a hand
grenade into handle_write_error().

Any attempt to implement writearound in pageout will need to find a way
to safely pin that address_space.  One way is to take a temporary ref
on mapping->host, but IIRC that introduced nasties with inode_lock. 
Certainly it'll put more load on that worrisomely-singleton lock.


Regarding simply not doing any writeout in direct reclaim (Dave's
initial proposal): the problem is that pageout() will clean a page in
the target zone.  Normal writeout won't do that, so we could get into a
situation where vast amounts of writeout is happening, but none of it
is cleaning pages in the zone which we're trying to allocate from. 
It's quite possibly livelockable, too.

Doing writearound (if we can get it going) will solve that adequately
(assuming that the target page gets reliably written), but it won't
help the stack usage problem.


To solve the IO-pattern thing I really do think we should first work
out ytf we started doing much more IO off the LRU.  What caused it?  Is
it really unavoidable?


To solve the stack-usage thing: dunno, really.  One could envisage code
which skips pageout() if we're using more than X amount of stack, but
that sucks.  Another possibility might be to hand the target page over
to another thread (I suppose kswapd will do) and then synchronise with
that thread - get_page()+wait_on_page_locked() is one way.  The helper
thread could of course do writearound.


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-18 19:05                           ` Christoph Hellwig
@ 2010-04-18 16:31                             ` Andrew Morton
  2010-04-18 19:35                               ` Christoph Hellwig
  2010-04-18 19:11                             ` Sorin Faibish
  1 sibling, 1 reply; 115+ messages in thread
From: Andrew Morton @ 2010-04-18 16:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Mel Gorman, Dave Chinner, KOSAKI Motohiro, Chris Mason,
	linux-kernel, linux-mm, linux-fsdevel

On Sun, 18 Apr 2010 15:05:26 -0400 Christoph Hellwig <hch@infradead.org> wrote:

> On Sat, Apr 17, 2010 at 08:32:39PM -0400, Andrew Morton wrote:
> > The poor IO patterns thing is a regression.  Some time several years
> > ago (around 2.6.16, perhaps), page reclaim started to do a LOT more
> > dirty-page writeback than it used to.  AFAIK nobody attempted to work
> > out why, nor attempted to try to fix it.
> 
> I just know that we XFS guys have been complaining about it a lot..
> 
> But that was mostly a tuning issue - before writeout mostly happened
> from pdflush.  If we got into kswapd or direct reclaim we already
> did get horrible I/O patterns - it just happened far less often.

Right.  It's intended that the great majority of writeout be performed
by the fs flusher threads and by the write()r in balance_dirty_pages().
Writeout off the LRU is supposed to be a rare emergency case.

This got broken.

> > Regarding simply not doing any writeout in direct reclaim (Dave's
> > initial proposal): the problem is that pageout() will clean a page in
> > the target zone.  Normal writeout won't do that, so we could get into a
> > situation where vast amounts of writeout is happening, but none of it
> > is cleaning pages in the zone which we're trying to allocate from. 
> > It's quite possibly livelockable, too.
> 
> As Chris mentioned currently btrfs and ext4 do not actually do delalloc
> conversions from this path, so for typical workloads the amount of
> writeout that can happen from this path is extremly limited.  And unless
> we get things fixed we will have to do the same for XFS.  I'd be much
> more happy if we could just sort it out at the VM level, because this
> means we have one sane place for this kind of policy instead of three
> or more hacks down inside the filesystems.  It's rather interesting
> that all people on the modern fs side completely agree here what the
> problem is, but it seems rather hard to convince the VM side to do
> anything about it.
> 
> > To solve the stack-usage thing: dunno, really.  One could envisage code
> > which skips pageout() if we're using more than X amount of stack, but
> > that sucks.
> 
> And it doesn't solve other issues, like the whole lock taking problem.
> 
> > Another possibility might be to hand the target page over
> > to another thread (I suppose kswapd will do) and then synchronise with
> > that thread - get_page()+wait_on_page_locked() is one way.  The helper
> > thread could of course do writearound.
> 
> Allowing the flusher threads to do targeted writeout would be the
> best from the FS POV.  We'll still have one source of the I/O, just
> with another know on how to select the exact region to write out.
> We can still synchronously wait for the I/O for lumpy reclaim if really
> nessecary.

Yeah, but it's all bandaids.  The first thing we should do is work out
why writeout-off-the-LRU increased so much and fix that.

Handing writeout off to separate threads might be used to solve the
stack consumption problem but we shouldn't use it to "solve" the
excess-writeout-from-page-reclaim problem.


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-18  0:32                         ` Andrew Morton
@ 2010-04-18 19:05                           ` Christoph Hellwig
  2010-04-18 16:31                             ` Andrew Morton
  2010-04-18 19:11                             ` Sorin Faibish
  2010-04-18 19:10                           ` Sorin Faibish
  2010-04-19  0:35                           ` Dave Chinner
  2 siblings, 2 replies; 115+ messages in thread
From: Christoph Hellwig @ 2010-04-18 19:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Dave Chinner, KOSAKI Motohiro, Chris Mason,
	linux-kernel, linux-mm, linux-fsdevel

On Sat, Apr 17, 2010 at 08:32:39PM -0400, Andrew Morton wrote:
> The poor IO patterns thing is a regression.  Some time several years
> ago (around 2.6.16, perhaps), page reclaim started to do a LOT more
> dirty-page writeback than it used to.  AFAIK nobody attempted to work
> out why, nor attempted to try to fix it.

I just know that we XFS guys have been complaining about it a lot..

But that was mostly a tuning issue - before writeout mostly happened
from pdflush.  If we got into kswapd or direct reclaim we already
did get horrible I/O patterns - it just happened far less often.

> Regarding simply not doing any writeout in direct reclaim (Dave's
> initial proposal): the problem is that pageout() will clean a page in
> the target zone.  Normal writeout won't do that, so we could get into a
> situation where vast amounts of writeout is happening, but none of it
> is cleaning pages in the zone which we're trying to allocate from. 
> It's quite possibly livelockable, too.

As Chris mentioned currently btrfs and ext4 do not actually do delalloc
conversions from this path, so for typical workloads the amount of
writeout that can happen from this path is extremly limited.  And unless
we get things fixed we will have to do the same for XFS.  I'd be much
more happy if we could just sort it out at the VM level, because this
means we have one sane place for this kind of policy instead of three
or more hacks down inside the filesystems.  It's rather interesting
that all people on the modern fs side completely agree here what the
problem is, but it seems rather hard to convince the VM side to do
anything about it.

> To solve the stack-usage thing: dunno, really.  One could envisage code
> which skips pageout() if we're using more than X amount of stack, but
> that sucks.

And it doesn't solve other issues, like the whole lock taking problem.

> Another possibility might be to hand the target page over
> to another thread (I suppose kswapd will do) and then synchronise with
> that thread - get_page()+wait_on_page_locked() is one way.  The helper
> thread could of course do writearound.

Allowing the flusher threads to do targeted writeout would be the
best from the FS POV.  We'll still have one source of the I/O, just
with another know on how to select the exact region to write out.
We can still synchronously wait for the I/O for lumpy reclaim if really
nessecary.


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-18  0:32                         ` Andrew Morton
  2010-04-18 19:05                           ` Christoph Hellwig
@ 2010-04-18 19:10                           ` Sorin Faibish
  2010-04-18 21:30                             ` James Bottomley
  2010-04-19  0:35                           ` Dave Chinner
  2 siblings, 1 reply; 115+ messages in thread
From: Sorin Faibish @ 2010-04-18 19:10 UTC (permalink / raw)
  To: Andrew Morton, Mel Gorman
  Cc: Dave Chinner, KOSAKI Motohiro, Chris Mason, linux-kernel,
	linux-mm, linux-fsdevel

On Sat, 17 Apr 2010 20:32:39 -0400, Andrew Morton
<akpm@linux-foundation.org> wrote:

>
> There are two issues here: stack utilisation and poor IO patterns in
> direct reclaim.  They are different.
>
> The poor IO patterns thing is a regression.  Some time several years
> ago (around 2.6.16, perhaps), page reclaim started to do a LOT more
> dirty-page writeback than it used to.  AFAIK nobody attempted to work
> out why, nor attempted to try to fix it.
I for one am looking very seriously at this problem together with Bruce.
We plan to have a discussion on this topic at the next LSF meeting
in Boston.


>
>
> Doing writearound in pageout() might help.  The kernel was in fact was
> doing that around 2.5.10, but I took it out again because it wasn't
> obviously beneficial.
>
> Writearound is hard to do, because direct-reclaim doesn't have an easy
> way of pinning the address_space: it can disappear and get freed under
> your feet.  I was able to make this happen under intense MM loads.  The
> current page-at-a-time pageout code pins the address_space by taking a
> lock on one of its pages.  Once that lock is released, we cannot touch
> *mapping.
>
> And lo, the pageout() code is presently buggy:
>
> 		res = mapping->a_ops->writepage(page, &wbc);
> 		if (res < 0)
> 			handle_write_error(mapping, page, res);
>
> The ->writepage can/will unlock the page, and we're passing a hand
> grenade into handle_write_error().
>
> Any attempt to implement writearound in pageout will need to find a way
> to safely pin that address_space.  One way is to take a temporary ref
> on mapping->host, but IIRC that introduced nasties with inode_lock.
> Certainly it'll put more load on that worrisomely-singleton lock.
>
>
> Regarding simply not doing any writeout in direct reclaim (Dave's
> initial proposal): the problem is that pageout() will clean a page in
> the target zone.  Normal writeout won't do that, so we could get into a
> situation where vast amounts of writeout is happening, but none of it
> is cleaning pages in the zone which we're trying to allocate from.
> It's quite possibly livelockable, too.
>
> Doing writearound (if we can get it going) will solve that adequately
> (assuming that the target page gets reliably written), but it won't
> help the stack usage problem.
>
>
> To solve the IO-pattern thing I really do think we should first work
> out ytf we started doing much more IO off the LRU.  What caused it?  Is
> it really unavoidable?
>
>
> To solve the stack-usage thing: dunno, really.  One could envisage code
> which skips pageout() if we're using more than X amount of stack, but
> that sucks.  Another possibility might be to hand the target page over
> to another thread (I suppose kswapd will do) and then synchronise with
> that thread - get_page()+wait_on_page_locked() is one way.  The helper
> thread could of course do writearound.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel"  
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>



-- 
Best Regards
Sorin Faibish
Corporate Distinguished Engineer
Network Storage Group

         EMC²
where information lives

Phone: 508-435-1000 x 48545
Cellphone: 617-510-0422
Email : sfaibish@emc.com

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-18 19:05                           ` Christoph Hellwig
  2010-04-18 16:31                             ` Andrew Morton
@ 2010-04-18 19:11                             ` Sorin Faibish
  1 sibling, 0 replies; 115+ messages in thread
From: Sorin Faibish @ 2010-04-18 19:11 UTC (permalink / raw)
  To: Christoph Hellwig, Andrew Morton
  Cc: Mel Gorman, Dave Chinner, KOSAKI Motohiro, Chris Mason,
	linux-kernel, linux-mm, linux-fsdevel

On Sun, 18 Apr 2010 15:05:26 -0400, Christoph Hellwig <hch@infradead.org>  
wrote:

> On Sat, Apr 17, 2010 at 08:32:39PM -0400, Andrew Morton wrote:
>> The poor IO patterns thing is a regression.  Some time several years
>> ago (around 2.6.16, perhaps), page reclaim started to do a LOT more
>> dirty-page writeback than it used to.  AFAIK nobody attempted to work
>> out why, nor attempted to try to fix it.
>
> I just know that we XFS guys have been complaining about it a lot..
I know also that the ext3 and reisefs guys complained about this issue
as well.

>
> But that was mostly a tuning issue - before writeout mostly happened
> from pdflush.  If we got into kswapd or direct reclaim we already
> did get horrible I/O patterns - it just happened far less often.
>
>> Regarding simply not doing any writeout in direct reclaim (Dave's
>> initial proposal): the problem is that pageout() will clean a page in
>> the target zone.  Normal writeout won't do that, so we could get into a
>> situation where vast amounts of writeout is happening, but none of it
>> is cleaning pages in the zone which we're trying to allocate from.
>> It's quite possibly livelockable, too.
>
> As Chris mentioned currently btrfs and ext4 do not actually do delalloc
> conversions from this path, so for typical workloads the amount of
> writeout that can happen from this path is extremly limited.  And unless
> we get things fixed we will have to do the same for XFS.  I'd be much
> more happy if we could just sort it out at the VM level, because this
> means we have one sane place for this kind of policy instead of three
> or more hacks down inside the filesystems.  It's rather interesting
> that all people on the modern fs side completely agree here what the
> problem is, but it seems rather hard to convince the VM side to do
> anything about it.
>
>> To solve the stack-usage thing: dunno, really.  One could envisage code
>> which skips pageout() if we're using more than X amount of stack, but
>> that sucks.
>
> And it doesn't solve other issues, like the whole lock taking problem.
>
>> Another possibility might be to hand the target page over
>> to another thread (I suppose kswapd will do) and then synchronise with
>> that thread - get_page()+wait_on_page_locked() is one way.  The helper
>> thread could of course do writearound.
>
> Allowing the flusher threads to do targeted writeout would be the
> best from the FS POV.  We'll still have one source of the I/O, just
> with another know on how to select the exact region to write out.
> We can still synchronously wait for the I/O for lumpy reclaim if really
> nessecary.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel"  
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>



-- 
Best Regards
Sorin Faibish
Corporate Distinguished Engineer
Network Storage Group

        EMC²
where information lives

Phone: 508-435-1000 x 48545
Cellphone: 617-510-0422
Email : sfaibish@emc.com

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-18 16:31                             ` Andrew Morton
@ 2010-04-18 19:35                               ` Christoph Hellwig
  0 siblings, 0 replies; 115+ messages in thread
From: Christoph Hellwig @ 2010-04-18 19:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, Mel Gorman, Dave Chinner, KOSAKI Motohiro,
	Chris Mason, linux-kernel, linux-mm, linux-fsdevel

On Sun, Apr 18, 2010 at 12:31:09PM -0400, Andrew Morton wrote:
> Yeah, but it's all bandaids.  The first thing we should do is work out
> why writeout-off-the-LRU increased so much and fix that.
> 
> Handing writeout off to separate threads might be used to solve the
> stack consumption problem but we shouldn't use it to "solve" the
> excess-writeout-from-page-reclaim problem.

I think both of them are really serious issue.  Exposing the whole
stack and lock problems with direct reclaim are a bit of a positive
side-effect os the writeout tuning messup.  Without it the problems
would still be just as harmfull, just happenening even less often and
thus getting even less attention.


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-18 19:10                           ` Sorin Faibish
@ 2010-04-18 21:30                             ` James Bottomley
  2010-04-18 23:34                               ` Sorin Faibish
  2010-04-19  3:08                               ` tytso
  0 siblings, 2 replies; 115+ messages in thread
From: James Bottomley @ 2010-04-18 21:30 UTC (permalink / raw)
  To: Sorin Faibish
  Cc: Andrew Morton, Mel Gorman, Dave Chinner, KOSAKI Motohiro,
	Chris Mason, linux-kernel, linux-mm, linux-fsdevel

On Sun, 2010-04-18 at 15:10 -0400, Sorin Faibish wrote:
> On Sat, 17 Apr 2010 20:32:39 -0400, Andrew Morton
> <akpm@linux-foundation.org> wrote:
> 
> >
> > There are two issues here: stack utilisation and poor IO patterns in
> > direct reclaim.  They are different.
> >
> > The poor IO patterns thing is a regression.  Some time several years
> > ago (around 2.6.16, perhaps), page reclaim started to do a LOT more
> > dirty-page writeback than it used to.  AFAIK nobody attempted to work
> > out why, nor attempted to try to fix it.

> I for one am looking very seriously at this problem together with Bruce.
> We plan to have a discussion on this topic at the next LSF meeting
> in Boston.

As luck would have it, the Memory Management summit is co-located with
the Storage and Filesystem workshop ... how about just planning to lock
all the protagonists in a room if it's not solved by August.  The less
extreme might even like to propose topics for the plenary sessions ...

James



^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-18 21:30                             ` James Bottomley
@ 2010-04-18 23:34                               ` Sorin Faibish
  2010-04-19  3:08                               ` tytso
  1 sibling, 0 replies; 115+ messages in thread
From: Sorin Faibish @ 2010-04-18 23:34 UTC (permalink / raw)
  To: James Bottomley
  Cc: Andrew Morton, Mel Gorman, Dave Chinner, KOSAKI Motohiro,
	Chris Mason, linux-kernel, linux-mm, linux-fsdevel

On Sun, 18 Apr 2010 17:30:36 -0400, James Bottomley  
<James.Bottomley@suse.de> wrote:

> On Sun, 2010-04-18 at 15:10 -0400, Sorin Faibish wrote:
>> On Sat, 17 Apr 2010 20:32:39 -0400, Andrew Morton
>> <akpm@linux-foundation.org> wrote:
>>
>> >
>> > There are two issues here: stack utilisation and poor IO patterns in
>> > direct reclaim.  They are different.
>> >
>> > The poor IO patterns thing is a regression.  Some time several years
>> > ago (around 2.6.16, perhaps), page reclaim started to do a LOT more
>> > dirty-page writeback than it used to.  AFAIK nobody attempted to work
>> > out why, nor attempted to try to fix it.
>
>> I for one am looking very seriously at this problem together with Bruce.
>> We plan to have a discussion on this topic at the next LSF meeting
>> in Boston.
>
> As luck would have it, the Memory Management summit is co-located with
> the Storage and Filesystem workshop ... how about just planning to lock
> all the protagonists in a room if it's not solved by August.  The less
> extreme might even like to propose topics for the plenary sessions ...
Let's work together to get this done. This is a very good idea. I will try
to bring some facts about the current state by instrumenting the kernel
to sample with higher time granularity the dirty pages dynamics. This will
allow us expose better the problem or lack of. :)

/Sorin


>
> James
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel"  
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>



-- 
Best Regards
Sorin Faibish
Corporate Distinguished Engineer
Network Storage Group

        EMC²
where information lives

Phone: 508-435-1000 x 48545
Cellphone: 617-510-0422
Email : sfaibish@emc.com

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-18  0:32                         ` Andrew Morton
  2010-04-18 19:05                           ` Christoph Hellwig
  2010-04-18 19:10                           ` Sorin Faibish
@ 2010-04-19  0:35                           ` Dave Chinner
  2010-04-19  0:49                             ` Arjan van de Ven
  2 siblings, 1 reply; 115+ messages in thread
From: Dave Chinner @ 2010-04-19  0:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, KOSAKI Motohiro, Chris Mason, linux-kernel, linux-mm,
	linux-fsdevel

On Sat, Apr 17, 2010 at 08:32:39PM -0400, Andrew Morton wrote:
> 
> There are two issues here: stack utilisation and poor IO patterns in
> direct reclaim.  They are different.
> 
> The poor IO patterns thing is a regression.  Some time several years
> ago (around 2.6.16, perhaps), page reclaim started to do a LOT more
> dirty-page writeback than it used to.  AFAIK nobody attempted to work
> out why, nor attempted to try to fix it.

I think that part of the problem is that at roughly the same time
writeback started on a long down hill slide as well, and we've
really only fixed that in the last couple of kernel releases. Also,
it tends to take more that just writing a few large files to invoke
the LRU-based writeback code is it is generally not invoked in
filesystem "performance" testing. Hence my bet is on the fact that
the effects of LRU-based writeback are rarely noticed in common
testing.

IOWs, low memory testing is not something a lot of people do. Add to
that the fact that most fs people, including me, have been treating
the VM as a black box that a bunch of other people have been taking
care of and hence really just been hoping it does the right thing,
and we've got a recipe for an unnoticed descent into a Bad Place.

[snip]

> Any attempt to implement writearound in pageout will need to find a way
> to safely pin that address_space.  One way is to take a temporary ref
> on mapping->host, but IIRC that introduced nasties with inode_lock. 
> Certainly it'll put more load on that worrisomely-singleton lock.

A problem already solved in the background flusher threads....

> Regarding simply not doing any writeout in direct reclaim (Dave's
> initial proposal): the problem is that pageout() will clean a page in
> the target zone.  Normal writeout won't do that, so we could get into a
> situation where vast amounts of writeout is happening, but none of it
> is cleaning pages in the zone which we're trying to allocate from. 
> It's quite possibly livelockable, too.

That's true, but seeing as we can't safely do writeback from
reclaim, we need some method of telling the background threads to
write a certain region of an inode. Perhaps some extension of a
struct writeback_control?

> Doing writearound (if we can get it going) will solve that adequately
> (assuming that the target page gets reliably written), but it won't
> help the stack usage problem.
> 
> 
> To solve the IO-pattern thing I really do think we should first work
> out ytf we started doing much more IO off the LRU.  What caused it?  Is
> it really unavoidable?

/me wonders who has the time and expertise to do that archeology

> To solve the stack-usage thing: dunno, really.  One could envisage code
> which skips pageout() if we're using more than X amount of stack, but

Which, if we have to set it as low as 1.5k of stack used, may as
well just skip pageout()....

> that sucks.  Another possibility might be to hand the target page over
> to another thread (I suppose kswapd will do) and then synchronise with
> that thread - get_page()+wait_on_page_locked() is one way.  The helper
> thread could of course do writearound.

I'm fundamentally opposed to pushing IO to another place in the VM
when it could be just as easily handed to the flusher threads.
Also, consider that there's only one kswapd thread in a given
context (e.g. per CPU), but we can scale the number of flusher
threads as need be....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-19  0:35                           ` Dave Chinner
@ 2010-04-19  0:49                             ` Arjan van de Ven
  2010-04-19  1:08                               ` Dave Chinner
  0 siblings, 1 reply; 115+ messages in thread
From: Arjan van de Ven @ 2010-04-19  0:49 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Mel Gorman, KOSAKI Motohiro, Chris Mason,
	linux-kernel, linux-mm, linux-fsdevel

On Mon, 19 Apr 2010 10:35:56 +1000
Dave Chinner <david@fromorbit.com> wrote:

> On Sat, Apr 17, 2010 at 08:32:39PM -0400, Andrew Morton wrote:
> > 
> > There are two issues here: stack utilisation and poor IO patterns in
> > direct reclaim.  They are different.
> > 
> > The poor IO patterns thing is a regression.  Some time several years
> > ago (around 2.6.16, perhaps), page reclaim started to do a LOT more
> > dirty-page writeback than it used to.  AFAIK nobody attempted to
> > work out why, nor attempted to try to fix it.
> 
> I think that part of the problem is that at roughly the same time
> writeback started on a long down hill slide as well, and we've
> really only fixed that in the last couple of kernel releases. Also,
> it tends to take more that just writing a few large files to invoke
> the LRU-based writeback code is it is generally not invoked in
> filesystem "performance" testing. Hence my bet is on the fact that
> the effects of LRU-based writeback are rarely noticed in common
> testing.
> 


Would this also be the time where we started real dirty accounting, and
started playing with the dirty page thresholds?

Background writeback is that interesting tradeoff between writing out
to make the VM easier (and the data safe) and the chance of someone
either rewriting the same data (as benchmarks do regularly... not sure
about real workloads) or deleting the temporary file.


Maybe we need to do the background dirty writes a bit more aggressive...
or play with heuristics where we get an adaptive timeout (say, if the
file got closed by the last opener, then do a shorter timeout)


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-19  0:49                             ` Arjan van de Ven
@ 2010-04-19  1:08                               ` Dave Chinner
  2010-04-19  4:32                                 ` Arjan van de Ven
  0 siblings, 1 reply; 115+ messages in thread
From: Dave Chinner @ 2010-04-19  1:08 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Andrew Morton, Mel Gorman, KOSAKI Motohiro, Chris Mason,
	linux-kernel, linux-mm, linux-fsdevel

On Sun, Apr 18, 2010 at 05:49:44PM -0700, Arjan van de Ven wrote:
> On Mon, 19 Apr 2010 10:35:56 +1000
> Dave Chinner <david@fromorbit.com> wrote:
> 
> > On Sat, Apr 17, 2010 at 08:32:39PM -0400, Andrew Morton wrote:
> > > 
> > > There are two issues here: stack utilisation and poor IO patterns in
> > > direct reclaim.  They are different.
> > > 
> > > The poor IO patterns thing is a regression.  Some time several years
> > > ago (around 2.6.16, perhaps), page reclaim started to do a LOT more
> > > dirty-page writeback than it used to.  AFAIK nobody attempted to
> > > work out why, nor attempted to try to fix it.
> > 
> > I think that part of the problem is that at roughly the same time
> > writeback started on a long down hill slide as well, and we've
> > really only fixed that in the last couple of kernel releases. Also,
> > it tends to take more that just writing a few large files to invoke
> > the LRU-based writeback code is it is generally not invoked in
> > filesystem "performance" testing. Hence my bet is on the fact that
> > the effects of LRU-based writeback are rarely noticed in common
> > testing.
> 
> Would this also be the time where we started real dirty accounting, and
> started playing with the dirty page thresholds?

Yes, I think that was introduced in 2.6.16/17, so it's definitely in
the ballpark.

> Background writeback is that interesting tradeoff between writing out
> to make the VM easier (and the data safe) and the chance of someone
> either rewriting the same data (as benchmarks do regularly... not sure
> about real workloads) or deleting the temporary file.
> 
> Maybe we need to do the background dirty writes a bit more aggressive...
> or play with heuristics where we get an adaptive timeout (say, if the
> file got closed by the last opener, then do a shorter timeout)

Realistically, I'm concerned about preventing the worst case
behaviour from occurring - making the background writes more
agressive without preventing writeback in LRU order simply means it
will be harder to test the VM corner case that triggers these
writeout patterns...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-18 21:30                             ` James Bottomley
  2010-04-18 23:34                               ` Sorin Faibish
@ 2010-04-19  3:08                               ` tytso
  1 sibling, 0 replies; 115+ messages in thread
From: tytso @ 2010-04-19  3:08 UTC (permalink / raw)
  To: James Bottomley
  Cc: Sorin Faibish, Andrew Morton, Mel Gorman, Dave Chinner,
	KOSAKI Motohiro, Chris Mason, linux-kernel, linux-mm,
	linux-fsdevel

On Sun, Apr 18, 2010 at 04:30:36PM -0500, James Bottomley wrote:
> > I for one am looking very seriously at this problem together with Bruce.
> > We plan to have a discussion on this topic at the next LSF meeting
> > in Boston.
> 
> As luck would have it, the Memory Management summit is co-located with
> the Storage and Filesystem workshop ... how about just planning to lock
> all the protagonists in a room if it's not solved by August.  The less
> extreme might even like to propose topics for the plenary sessions ...

I'd personally hope that this is solved long before the LSF/VM
workshops.... but if not, yes, we should definitely tackle it then.

      	     	       	  	 	    	   - Ted

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-19  1:08                               ` Dave Chinner
@ 2010-04-19  4:32                                 ` Arjan van de Ven
  0 siblings, 0 replies; 115+ messages in thread
From: Arjan van de Ven @ 2010-04-19  4:32 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Mel Gorman, KOSAKI Motohiro, Chris Mason,
	linux-kernel, linux-mm, linux-fsdevel

On Mon, 19 Apr 2010 11:08:05 +1000
Dave Chinner <david@fromorbit.com> wrote:

> > Maybe we need to do the background dirty writes a bit more
> > aggressive... or play with heuristics where we get an adaptive
> > timeout (say, if the file got closed by the last opener, then do a
> > shorter timeout)
> 
> Realistically, I'm concerned about preventing the worst case
> behaviour from occurring - making the background writes more
> agressive without preventing writeback in LRU order simply means it
> will be harder to test the VM corner case that triggers these
> writeout patterns...


while I appreciate that the worst case should not be uber horrific...
I care a LOT about getting the normal case right... and am willing to
sacrifice the worst case for that.. (obviously not to infinity, it
needs to be bounded)

-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-16 15:05                       ` Mel Gorman
@ 2010-04-19 15:15                         ` Mel Gorman
  0 siblings, 0 replies; 115+ messages in thread
From: Mel Gorman @ 2010-04-19 15:15 UTC (permalink / raw)
  To: Chris Mason, Dave Chinner, KOSAKI Motohiro, linux-kernel,
	linux-mm, linux-fsdevel

On Fri, Apr 16, 2010 at 04:05:10PM +0100, Mel Gorman wrote:
> > vi fs/direct-reclaim-helper.c, it has a few placeholders for where the
> > real code needs to go....just look for the ~ marks.
> > 
> 
> I must be blind. What tree is this in? I can't see it v2.6.34-rc4,
> mmotm or google.
> 

Bah, Johannes corrected my literal mind. har de har har :)

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-16 15:14                       ` Mel Gorman
  2010-04-18  0:32                         ` Andrew Morton
@ 2010-04-19 15:20                         ` Mel Gorman
  2010-04-23  1:06                           ` Dave Chinner
  1 sibling, 1 reply; 115+ messages in thread
From: Mel Gorman @ 2010-04-19 15:20 UTC (permalink / raw)
  To: Dave Chinner
  Cc: KOSAKI Motohiro, Chris Mason, linux-kernel, linux-mm, linux-fsdevel

On Fri, Apr 16, 2010 at 04:14:03PM +0100, Mel Gorman wrote:
> > > Your patch fixes 2, avoids 1, breaks 3 and haven't thought about 4 but I
> > > guess dirty pages can cycle around more so it'd need to be cared for.
> > 
> > Well, you keep saying that they break #3, but I haven't seen any
> > test cases or results showing that. I've been unable to confirm that
> > lumpy reclaim is broken by disallowing writeback in my testing, so
> > I'm interested to know what tests you are running that show it is
> > broken...
> > 
> 
> Ok, I haven't actually tested this. The machines I use are tied up
> retesting the compaction patches at the moment. The reason why I reckon
> it'll be a problem is that when these sync-writeback changes were
> introduced, it significantly helped lumpy reclaim for huge pages. I am
> making an assumption that backing out those changes will hurt it.
> 
> I'll test for real on Monday and see what falls out.
> 

One machine has completed the test and the results are as expected. When
allocating huge pages under stress, your patch drops the success rates
significantly. On X86-64, it showed

STRESS-HIGHALLOC
              stress-highalloc   stress-highalloc
            enable-directreclaim disable-directreclaim
Under Load 1    89.00 ( 0.00)    73.00 (-16.00)
Under Load 2    90.00 ( 0.00)    85.00 (-5.00)
At Rest         90.00 ( 0.00)    90.00 ( 0.00)

So with direct reclaim, it gets 89% of memory as huge pages at the first
attempt but 73% with your patch applied. The "Under Load 2" test happens
immediately after. With the start kernel, the first and second attempts
are usually the same or very close together. With your patch applied,
there are big differences as it was no longer trying to clean pages.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH 1/4] vmscan: delegate pageout io to flusher thread if  current is kswapd
  2010-04-15 10:30                             ` Johannes Weiner
  2010-04-15 17:24                               ` Suleiman Souhlal
@ 2010-04-20  2:56                               ` Ying Han
  1 sibling, 0 replies; 115+ messages in thread
From: Ying Han @ 2010-04-20  2:56 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KOSAKI Motohiro, Suleiman Souhlal, Dave Chinner, Mel Gorman,
	Chris Mason, linux-kernel, linux-mm, linux-fsdevel, suleiman

On Thu, Apr 15, 2010 at 3:30 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> On Thu, Apr 15, 2010 at 05:26:27PM +0900, KOSAKI Motohiro wrote:
>> Cc to Johannes
>>
>> > >
>> > > On Apr 14, 2010, at 9:11 PM, KOSAKI Motohiro wrote:
>> > >
>> > > > Now, vmscan pageout() is one of IO throuput degression source.
>> > > > Some IO workload makes very much order-0 allocation and reclaim
>> > > > and pageout's 4K IOs are making annoying lots seeks.
>> > > >
>> > > > At least, kswapd can avoid such pageout() because kswapd don't
>> > > > need to consider OOM-Killer situation. that's no risk.
>> > > >
>> > > > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>> > >
>> > > What's your opinion on trying to cluster the writes done by pageout,
>> > > instead of not doing any paging out in kswapd?
>> > > Something along these lines:
>> >
>> > Interesting.
>> > So, I'd like to review your patch carefully. can you please give me one
>> > day? :)
>>
>> Hannes, if my remember is correct, you tried similar swap-cluster IO
>> long time ago. now I can't remember why we didn't merged such patch.
>> Do you remember anything?
>
> Oh, quite vividly in fact :)  For a lot of swap loads the LRU order
> diverged heavily from swap slot order and readaround was a waste of
> time.
>
> Of course, the patch looked good, too, but it did not match reality
> that well.
>
> I guess 'how about this patch?' won't get us as far as 'how about
> those numbers/graphs of several real-life workloads?  oh and here
> is the patch...'.

Hannes,

We recently ran into this problem while running some experiments on
ext4 filesystem. We experienced the scenario where we are writing a
large file or just opening a large file with limited memory allocation
(using containers), and the process got OOMed. The memory assigned to
the container is reasonably large, and the OOM can not be reproduced
on ext2 with the same configurations.

Later we figured this might be due to the delayed block allocation
from ext4. Vmscan sends a single page to ext4->writepage(), then ext4
punts if the block is DA'ed and re-dirties the page. On the other
hand, the flusher thread use ext4->writepages() which does include the
block allocation.

We looked at the OOM log under ext4, all pages within the container
were in inactive list and either Dirty or WriteBack. Also, the zones
are all marked as "all_unreclaimable" which indicates the reclaim path
has scanned the LRU quite lot times without making progress. If the
delayed block allocation is the cause for pageout() not being able to
flush dirty pages and then triggers OOMs, should we signal the fs to
force write out dirty pages under memory pressure?

--Ying

>
>> > >      Cluster writes to disk due to memory pressure.
>> > >
>> > >      Write out logically adjacent pages to the one we're paging out
>> > >      so that we may get better IOs in these situations:
>> > >      These pages are likely to be contiguous on disk to the one we're
>> > >      writing out, so they should get merged into a single disk IO.
>> > >
>> > >      Signed-off-by: Suleiman Souhlal <suleiman@google.com>
>
> For random IO, LRU order will have nothing to do with mapping/disk order.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-19 15:20                         ` Mel Gorman
@ 2010-04-23  1:06                           ` Dave Chinner
  2010-04-23 10:50                             ` Mel Gorman
  0 siblings, 1 reply; 115+ messages in thread
From: Dave Chinner @ 2010-04-23  1:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KOSAKI Motohiro, Chris Mason, linux-kernel, linux-mm, linux-fsdevel

On Mon, Apr 19, 2010 at 04:20:34PM +0100, Mel Gorman wrote:
> On Fri, Apr 16, 2010 at 04:14:03PM +0100, Mel Gorman wrote:
> > > > Your patch fixes 2, avoids 1, breaks 3 and haven't thought about 4 but I
> > > > guess dirty pages can cycle around more so it'd need to be cared for.
> > > 
> > > Well, you keep saying that they break #3, but I haven't seen any
> > > test cases or results showing that. I've been unable to confirm that
> > > lumpy reclaim is broken by disallowing writeback in my testing, so
> > > I'm interested to know what tests you are running that show it is
> > > broken...
> > > 
> > 
> > Ok, I haven't actually tested this. The machines I use are tied up
> > retesting the compaction patches at the moment. The reason why I reckon
> > it'll be a problem is that when these sync-writeback changes were
> > introduced, it significantly helped lumpy reclaim for huge pages. I am
> > making an assumption that backing out those changes will hurt it.
> > 
> > I'll test for real on Monday and see what falls out.
> > 
> 
> One machine has completed the test and the results are as expected. When
> allocating huge pages under stress, your patch drops the success rates
> significantly. On X86-64, it showed
> 
> STRESS-HIGHALLOC
>               stress-highalloc   stress-highalloc
>             enable-directreclaim disable-directreclaim
> Under Load 1    89.00 ( 0.00)    73.00 (-16.00)
> Under Load 2    90.00 ( 0.00)    85.00 (-5.00)
> At Rest         90.00 ( 0.00)    90.00 ( 0.00)
> 
> So with direct reclaim, it gets 89% of memory as huge pages at the first
> attempt but 73% with your patch applied. The "Under Load 2" test happens
> immediately after. With the start kernel, the first and second attempts
> are usually the same or very close together. With your patch applied,
> there are big differences as it was no longer trying to clean pages.

What was the machine config you were testing on (RAM, CPUs, etc)?
And what are these loads? Do you have a script that generates
them? If so, can you share them, please?

OOC, what was the effect on the background load - did it go faster
or slower when writeback was disabled? i.e. did we trade of more
large pages for better overall throughput?

Also, I'm curious as to the repeatability of the tests you are
doing. I found that from run to run I could see a *massive*
variance in the results. e.g. one run might only get ~80 huge
pages at the first attempt, the test run from the same initial
conditions next might get 440 huge pages at the first attempt. I saw
the same variance with or without writeback from direct reclaim
enabled. Hence only after averaging over tens of runs could I see
any sort of trend emerge, and it makes me wonder if your testing is
also seeing this sort of variance....

FWIW, if we look results of the test I did, it showed a 20%
improvement in large page allocation with a 15% increase in load
throughput, while you're showing a 16% degradation in large page
allocation.  Effectively we've got two workloads that show results
at either end of the spectrum (perhaps they are best case vs worst
case) but there's no real in-between. What other tests can we run to
get a better picture of the effect?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH] mm: disallow direct reclaim page writeback
  2010-04-23  1:06                           ` Dave Chinner
@ 2010-04-23 10:50                             ` Mel Gorman
  0 siblings, 0 replies; 115+ messages in thread
From: Mel Gorman @ 2010-04-23 10:50 UTC (permalink / raw)
  To: Dave Chinner
  Cc: KOSAKI Motohiro, Chris Mason, linux-kernel, linux-mm, linux-fsdevel

On Fri, Apr 23, 2010 at 11:06:32AM +1000, Dave Chinner wrote:
> On Mon, Apr 19, 2010 at 04:20:34PM +0100, Mel Gorman wrote:
> > On Fri, Apr 16, 2010 at 04:14:03PM +0100, Mel Gorman wrote:
> > > > > Your patch fixes 2, avoids 1, breaks 3 and haven't thought about 4 but I
> > > > > guess dirty pages can cycle around more so it'd need to be cared for.
> > > > 
> > > > Well, you keep saying that they break #3, but I haven't seen any
> > > > test cases or results showing that. I've been unable to confirm that
> > > > lumpy reclaim is broken by disallowing writeback in my testing, so
> > > > I'm interested to know what tests you are running that show it is
> > > > broken...
> > > > 
> > > 
> > > Ok, I haven't actually tested this. The machines I use are tied up
> > > retesting the compaction patches at the moment. The reason why I reckon
> > > it'll be a problem is that when these sync-writeback changes were
> > > introduced, it significantly helped lumpy reclaim for huge pages. I am
> > > making an assumption that backing out those changes will hurt it.
> > > 
> > > I'll test for real on Monday and see what falls out.
> > > 
> > 
> > One machine has completed the test and the results are as expected. When
> > allocating huge pages under stress, your patch drops the success rates
> > significantly. On X86-64, it showed
> > 
> > STRESS-HIGHALLOC
> >               stress-highalloc   stress-highalloc
> >             enable-directreclaim disable-directreclaim
> > Under Load 1    89.00 ( 0.00)    73.00 (-16.00)
> > Under Load 2    90.00 ( 0.00)    85.00 (-5.00)
> > At Rest         90.00 ( 0.00)    90.00 ( 0.00)
> > 
> > So with direct reclaim, it gets 89% of memory as huge pages at the first
> > attempt but 73% with your patch applied. The "Under Load 2" test happens
> > immediately after. With the start kernel, the first and second attempts
> > are usually the same or very close together. With your patch applied,
> > there are big differences as it was no longer trying to clean pages.
> 
> What was the machine config you were testing on (RAM, CPUs, etc)?

2G RAM, AMD Phenom with 4 cores.

> And what are these loads?

Compile-based loads that fill up memory and put it under heavy memory
pressure that also dirties memory. While they are running, a kernel module
is loaded that starts allocating huge pages one at a time so that accurate
timing and the state of the system can be gathered at allocation time. The
number of allocation attempts is 90% of the number of huge pages that exist
in the system.

> Do you have a script that generates
> them? If so, can you share them, please?
> 

Yes, but unfortunately they are not in a publishable state. Parts of
them depend on an automation harness that I don't hold the copyright to.

> OOC, what was the effect on the background load - did it go faster
> or slower when writeback was disabled?

Unfortunately, I don't know what the effect on the underlying load is
as it takes longer than the huge page allocation attempts do. The tests
objective is to check how well lumpy reclaim works undedmemory pressure.

However, the time it takes to allocate a huge page increases with direct
reclaim disabled (i.e. your patch) early in the test up until about 40%
of memory was allocated as huge pages. After that, the latencies with
disable-directreclaim are lower until the gives up while the latencies with
enable-directreclaim increase.

In other words, with direct reclaim writing back pages, lumpy reclaim is a
lot more determined to get the pages cleaned and wait on them if necessary. A
compromise patch might be to have a wait_on_page_dirty to be cleared instead
of queueing the IO and wait_on_page_writeback? How long it stalled would
depend heavily on what rate pages were getting cleaned in the background.

> i.e. did we trade of more
> large pages for better overall throughput?
> 
> Also, I'm curious as to the repeatability of the tests you are
> doing. I found that from run to run I could see a *massive*
> variance in the results. e.g. one run might only get ~80 huge
> pages at the first attempt, the test run from the same initial
> conditions next might get 440 huge pages at the first attempt.

You are using the nr_hugepages interface and writing a large number to it
so you are also triggering the hugetlbfs retry-logic and have little control
over how many times the allocator gets called on each attempt. How many huge
pages it allocates depends on how much progress it is able to make during
lumpy reclaim.

It's why the tests I run allocate huge pages one at a time and measure
the latencies as it goes. The results tend to be quite reproducible.
Success figures would be the same between runs and the rate of
allocation success would generally be comparable as well.

Your test could do something similar by only ever requesting one additional
page. It will be good enough to measure allocation latency.  The gathering
of other system state at the time of failure is not very important here
(where as it was important during anti-frag development hence the use of a
kernel module).

> I saw
> the same variance with or without writeback from direct reclaim
> enabled. Hence only after averaging over tens of runs could I see
> any sort of trend emerge, and it makes me wonder if your testing is
> also seeing this sort of variance....
> 

Typically, there is not much variance between tests. Maybe 1-2% in allocation
success rates.

> FWIW, if we look results of the test I did, it showed a 20%
> improvement in large page allocation with a 15% increase in load
> throughput, while you're showing a 16% degradation in large page
> allocation.

With writeback, lumpy reclaim takes a range of pages, cleans them, waits for
the IO before moving on. This causes a seeky IO pattern and takes time. Also
causes a nice amount of trashing.

With your patch, lumpy reclaim would just skip over ranges with dirty pages
until it found clean pages in a suitable range. When there is plenty of
usable memore early in the test, it probably scans more but causes less
IO so would appear faster. Later in the test, it scans more but eventually
encounters too many dirty pages and gives up. Hence, its success rates will
be more random because it depends on where exactly the dirty pages were.

If this is accurate, it will always be the case that your patch causes less
disruption in the system and will appear faster due to the lack of IO but
will be less predictable and give up easier so will have lower success
rates when there are dirty pages in the system.

> Effectively we've got two workloads that show results
> at either end of the spectrum (perhaps they are best case vs worst
> case) but there's no real in-between. What other tests can we run to
> get a better picture of the effect?
> 

The underlying workload is only important in how many pages it is
dirtying at any given time. Heck, at one point my test workload was a
single process that created a mapping the size of physical memory and in
test a) would constantly read it and in test b) would constantly write
it. Lumpy reclaim with dirty-page-writeback was always more predictable
and had higher success rates.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 115+ messages in thread

end of thread, other threads:[~2010-04-23 10:51 UTC | newest]

Thread overview: 115+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-04-13  0:17 [PATCH] mm: disallow direct reclaim page writeback Dave Chinner
2010-04-13  8:31 ` KOSAKI Motohiro
2010-04-13 10:29   ` Dave Chinner
2010-04-13 11:39     ` KOSAKI Motohiro
2010-04-13 14:36       ` Dave Chinner
2010-04-14  3:12         ` Dave Chinner
2010-04-14  6:52           ` KOSAKI Motohiro
2010-04-15  1:56             ` Dave Chinner
2010-04-14  6:52         ` KOSAKI Motohiro
2010-04-14  7:36           ` Dave Chinner
2010-04-13  9:58 ` Mel Gorman
2010-04-13 11:19   ` Dave Chinner
2010-04-13 19:34     ` Mel Gorman
2010-04-13 20:20       ` Chris Mason
2010-04-14  1:40         ` Dave Chinner
2010-04-14  4:59           ` KAMEZAWA Hiroyuki
2010-04-14  5:41             ` Dave Chinner
2010-04-14  5:54               ` KOSAKI Motohiro
2010-04-14  6:13                 ` Minchan Kim
2010-04-14  7:19                   ` Minchan Kim
2010-04-14  9:42                     ` KAMEZAWA Hiroyuki
2010-04-14 10:01                       ` Minchan Kim
2010-04-14 10:07                         ` Mel Gorman
2010-04-14 10:16                           ` Minchan Kim
2010-04-14  7:06                 ` Dave Chinner
2010-04-14  6:52           ` KOSAKI Motohiro
2010-04-14  7:28             ` Dave Chinner
2010-04-14  8:51               ` Mel Gorman
2010-04-15  1:34                 ` Dave Chinner
2010-04-15  4:09                   ` KOSAKI Motohiro
2010-04-15  4:11                     ` [PATCH 1/4] vmscan: delegate pageout io to flusher thread if current is kswapd KOSAKI Motohiro
2010-04-15  8:05                       ` Suleiman Souhlal
2010-04-15  8:17                         ` KOSAKI Motohiro
2010-04-15  8:26                           ` KOSAKI Motohiro
2010-04-15 10:30                             ` Johannes Weiner
2010-04-15 17:24                               ` Suleiman Souhlal
2010-04-20  2:56                               ` Ying Han
2010-04-15  9:32                         ` Dave Chinner
2010-04-15  9:41                           ` KOSAKI Motohiro
2010-04-15 17:27                           ` Suleiman Souhlal
2010-04-15 23:33                             ` Dave Chinner
2010-04-15 23:41                               ` Suleiman Souhlal
2010-04-16  9:50                               ` Alan Cox
2010-04-17  3:06                                 ` Dave Chinner
2010-04-15  8:18                       ` KOSAKI Motohiro
2010-04-15 10:31                       ` Mel Gorman
2010-04-15 11:26                         ` KOSAKI Motohiro
2010-04-15  4:13                     ` [PATCH 2/4] vmscan: kill prev_priority completely KOSAKI Motohiro
2010-04-15  4:14                     ` [PATCH 3/4] vmscan: move priority variable into scan_control KOSAKI Motohiro
2010-04-15  4:15                     ` [PATCH 4/4] vmscan: delegate page cleaning io to flusher thread if VM pressure is low KOSAKI Motohiro
2010-04-15  4:35                     ` [PATCH] mm: disallow direct reclaim page writeback KOSAKI Motohiro
2010-04-15  6:32                       ` Dave Chinner
2010-04-15  6:44                         ` KOSAKI Motohiro
2010-04-15  6:58                           ` Dave Chinner
2010-04-15  6:20                     ` Dave Chinner
2010-04-15  6:35                       ` KOSAKI Motohiro
2010-04-15  8:54                         ` Dave Chinner
2010-04-15 10:21                           ` KOSAKI Motohiro
2010-04-15 10:23                             ` [PATCH 1/4] vmscan: simplify shrink_inactive_list() KOSAKI Motohiro
2010-04-15 13:15                               ` Mel Gorman
2010-04-15 15:01                                 ` Andi Kleen
2010-04-15 15:44                                   ` Mel Gorman
2010-04-15 16:54                                     ` Andi Kleen
2010-04-15 23:40                                       ` Dave Chinner
2010-04-16  7:13                                         ` Andi Kleen
2010-04-16 14:57                                         ` Mel Gorman
2010-04-17  2:37                                           ` Dave Chinner
2010-04-16 14:55                                       ` Mel Gorman
2010-04-15 18:22                                 ` Valdis.Kletnieks
2010-04-16  9:39                                   ` Mel Gorman
2010-04-15 10:24                             ` [PATCH 2/4] [cleanup] mm: introduce free_pages_prepare KOSAKI Motohiro
2010-04-15 13:33                               ` Mel Gorman
2010-04-15 10:24                             ` [PATCH 3/4] mm: introduce free_pages_bulk KOSAKI Motohiro
2010-04-15 13:46                               ` Mel Gorman
2010-04-15 10:26                             ` [PATCH 4/4] vmscan: replace the pagevec in shrink_inactive_list() with list KOSAKI Motohiro
2010-04-15 10:28                   ` [PATCH] mm: disallow direct reclaim page writeback Mel Gorman
2010-04-15 13:42                     ` Chris Mason
2010-04-15 17:50                       ` tytso
2010-04-16 15:05                       ` Mel Gorman
2010-04-19 15:15                         ` Mel Gorman
2010-04-16  4:14                     ` Dave Chinner
2010-04-16 15:14                       ` Mel Gorman
2010-04-18  0:32                         ` Andrew Morton
2010-04-18 19:05                           ` Christoph Hellwig
2010-04-18 16:31                             ` Andrew Morton
2010-04-18 19:35                               ` Christoph Hellwig
2010-04-18 19:11                             ` Sorin Faibish
2010-04-18 19:10                           ` Sorin Faibish
2010-04-18 21:30                             ` James Bottomley
2010-04-18 23:34                               ` Sorin Faibish
2010-04-19  3:08                               ` tytso
2010-04-19  0:35                           ` Dave Chinner
2010-04-19  0:49                             ` Arjan van de Ven
2010-04-19  1:08                               ` Dave Chinner
2010-04-19  4:32                                 ` Arjan van de Ven
2010-04-19 15:20                         ` Mel Gorman
2010-04-23  1:06                           ` Dave Chinner
2010-04-23 10:50                             ` Mel Gorman
2010-04-15 14:57                   ` Andi Kleen
2010-04-15  2:37                 ` Johannes Weiner
2010-04-15  2:43                   ` KOSAKI Motohiro
2010-04-16 23:56                     ` Johannes Weiner
2010-04-14  6:52         ` KOSAKI Motohiro
2010-04-14 10:06         ` Andi Kleen
2010-04-14 11:20           ` Chris Mason
2010-04-14 12:15             ` Andi Kleen
2010-04-14 12:32               ` Alan Cox
2010-04-14 12:34                 ` Andi Kleen
2010-04-14 13:23             ` Mel Gorman
2010-04-14 14:07               ` Chris Mason
2010-04-14  0:24 ` Minchan Kim
2010-04-14  4:44   ` Dave Chinner
2010-04-14  7:54     ` Minchan Kim
2010-04-16  1:13 ` KAMEZAWA Hiroyuki
2010-04-16  4:18   ` KAMEZAWA Hiroyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).