[PATCH 0/2] Eliminate hangs when using frequent high-order allocations V3

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] Eliminate hangs when using frequent high-order allocations V3
@ 2011-05-16 15:06 ` Mel Gorman
  0 siblings, 0 replies; 29+ messages in thread
From: Mel Gorman @ 2011-05-16 15:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara,
	Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel,
	Johannes Weiner, Minchan Kim, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4, stable, Mel Gorman

Changelog since V2
  o Drop all SLUB latency-reducing patches.

Changelog since V1
  o kswapd should sleep if need_resched
  o Remove __GFP_REPEAT from GFP flags when speculatively using high
    orders so direct/compaction exits earlier
  o Remove __GFP_NORETRY for correctness
  o Correct logic in sleeping_prematurely
  o Leave SLUB using the default slub_max_order

There are a few reports of people experiencing hangs when copying
large amounts of data with kswapd using a large amount of CPU which
appear to be due to recent reclaim changes. SLUB using high orders
is the trigger but not the root cause as SLUB has been using high
orders for a while. The root cause was bugs introduced into reclaim
which are addressed by the following two patches.

Patch 1 corrects logic introduced by commit [1741c877: mm:
	kswapd: keep kswapd awake for high-order allocations until
	a percentage of the node is balanced] to allow kswapd to
	go to sleep when balanced for high orders.

Patch 2 notes that even when kswapd is failing to keep up with
	allocation requests, it should still go to sleep when its
	quota has expired to prevent it spinning.

This version drops the patches whereby SLUB avoids expensive steps in
the page allocator, reclaim and compaction due to a lack of agreement
on whether it was an appropriate step or not and not being critical
to resolve the hang. Chris Wood reports that these two patches in
isolation are sufficient to prevent the system hanging.

These should be also considered for -stable for 2.6.38.

-- 
1.7.3.4

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 0/2] Eliminate hangs when using frequent high-order allocations V3
@ 2011-05-16 15:06 ` Mel Gorman
  0 siblings, 0 replies; 29+ messages in thread
From: Mel Gorman @ 2011-05-16 15:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara,
	Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel,
	Johannes Weiner, Minchan Kim, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4, stable, Mel Gorman

Changelog since V2
  o Drop all SLUB latency-reducing patches.

Changelog since V1
  o kswapd should sleep if need_resched
  o Remove __GFP_REPEAT from GFP flags when speculatively using high
    orders so direct/compaction exits earlier
  o Remove __GFP_NORETRY for correctness
  o Correct logic in sleeping_prematurely
  o Leave SLUB using the default slub_max_order

There are a few reports of people experiencing hangs when copying
large amounts of data with kswapd using a large amount of CPU which
appear to be due to recent reclaim changes. SLUB using high orders
is the trigger but not the root cause as SLUB has been using high
orders for a while. The root cause was bugs introduced into reclaim
which are addressed by the following two patches.

Patch 1 corrects logic introduced by commit [1741c877: mm:
	kswapd: keep kswapd awake for high-order allocations until
	a percentage of the node is balanced] to allow kswapd to
	go to sleep when balanced for high orders.

Patch 2 notes that even when kswapd is failing to keep up with
	allocation requests, it should still go to sleep when its
	quota has expired to prevent it spinning.

This version drops the patches whereby SLUB avoids expensive steps in
the page allocator, reclaim and compaction due to a lack of agreement
on whether it was an appropriate step or not and not being critical
to resolve the hang. Chris Wood reports that these two patches in
isolation are sufficient to prevent the system hanging.

These should be also considered for -stable for 2.6.38.

-- 
1.7.3.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 1/2] mm: vmscan: Correct use of pgdat_balanced in sleeping_prematurely
  2011-05-16 15:06 ` Mel Gorman
@ 2011-05-16 15:06   ` Mel Gorman
  -1 siblings, 0 replies; 29+ messages in thread
From: Mel Gorman @ 2011-05-16 15:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara,
	Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel,
	Johannes Weiner, Minchan Kim, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4, stable, Mel Gorman

From: Johannes Weiner <hannes@cmpxchg.org>

Johannes Weiner poined out that the logic in commit [1741c877: mm:
kswapd: keep kswapd awake for high-order allocations until a percentage
of the node is balanced] is backwards. Instead of allowing kswapd to go
to sleep when balancing for high order allocations, it keeps it kswapd
running uselessly.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 mm/vmscan.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index f6b435c..af24d1e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2286,7 +2286,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
 	 * must be balanced
 	 */
 	if (order)
-		return pgdat_balanced(pgdat, balanced, classzone_idx);
+		return !pgdat_balanced(pgdat, balanced, classzone_idx);
 	else
 		return !all_zones_ok;
 }
-- 
1.7.3.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 1/2] mm: vmscan: Correct use of pgdat_balanced in sleeping_prematurely
@ 2011-05-16 15:06   ` Mel Gorman
  0 siblings, 0 replies; 29+ messages in thread
From: Mel Gorman @ 2011-05-16 15:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara,
	Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel,
	Johannes Weiner, Minchan Kim, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4, stable, Mel Gorman

From: Johannes Weiner <hannes@cmpxchg.org>

Johannes Weiner poined out that the logic in commit [1741c877: mm:
kswapd: keep kswapd awake for high-order allocations until a percentage
of the node is balanced] is backwards. Instead of allowing kswapd to go
to sleep when balancing for high order allocations, it keeps it kswapd
running uselessly.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 mm/vmscan.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index f6b435c..af24d1e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2286,7 +2286,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
 	 * must be balanced
 	 */
 	if (order)
-		return pgdat_balanced(pgdat, balanced, classzone_idx);
+		return !pgdat_balanced(pgdat, balanced, classzone_idx);
 	else
 		return !all_zones_ok;
 }
-- 
1.7.3.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 2/2] mm: vmscan: If kswapd has been running too long, allow it to sleep
  2011-05-16 15:06 ` Mel Gorman
@ 2011-05-16 15:06   ` Mel Gorman
  -1 siblings, 0 replies; 29+ messages in thread
From: Mel Gorman @ 2011-05-16 15:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara,
	Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel,
	Johannes Weiner, Minchan Kim, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4, stable, Mel Gorman

Under constant allocation pressure, kswapd can be in the situation where
sleeping_prematurely() will always return true even if kswapd has been
running a long time. Check if kswapd needs to be scheduled.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
---
 mm/vmscan.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index af24d1e..4d24828 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
 	unsigned long balanced = 0;
 	bool all_zones_ok = true;
 
+	/* If kswapd has been running too long, just sleep */
+	if (need_resched())
+		return false;
+
 	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
 	if (remaining)
 		return true;
-- 
1.7.3.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 2/2] mm: vmscan: If kswapd has been running too long, allow it to sleep
@ 2011-05-16 15:06   ` Mel Gorman
  0 siblings, 0 replies; 29+ messages in thread
From: Mel Gorman @ 2011-05-16 15:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara,
	Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel,
	Johannes Weiner, Minchan Kim, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4, stable, Mel Gorman

Under constant allocation pressure, kswapd can be in the situation where
sleeping_prematurely() will always return true even if kswapd has been
running a long time. Check if kswapd needs to be scheduled.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
---
 mm/vmscan.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index af24d1e..4d24828 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
 	unsigned long balanced = 0;
 	bool all_zones_ok = true;
 
+	/* If kswapd has been running too long, just sleep */
+	if (need_resched())
+		return false;
+
 	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
 	if (remaining)
 		return true;
-- 
1.7.3.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] mm: vmscan: Correct use of pgdat_balanced in sleeping_prematurely
  2011-05-16 15:06   ` Mel Gorman
@ 2011-05-16 15:26     ` Johannes Weiner
  -1 siblings, 0 replies; 29+ messages in thread
From: Johannes Weiner @ 2011-05-16 15:26 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu,
	Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg,
	Rik van Riel, Minchan Kim, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4, stable

On Mon, May 16, 2011 at 04:06:56PM +0100, Mel Gorman wrote:
> From: Johannes Weiner <hannes@cmpxchg.org>
> 
> Johannes Weiner poined out that the logic in commit [1741c877: mm:
> kswapd: keep kswapd awake for high-order allocations until a percentage
> of the node is balanced] is backwards. Instead of allowing kswapd to go
> to sleep when balancing for high order allocations, it keeps it kswapd
> running uselessly.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Reviewed-by: Rik van Riel <riel@redhat.com>

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] mm: vmscan: Correct use of pgdat_balanced in sleeping_prematurely
@ 2011-05-16 15:26     ` Johannes Weiner
  0 siblings, 0 replies; 29+ messages in thread
From: Johannes Weiner @ 2011-05-16 15:26 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu,
	Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg,
	Rik van Riel, Minchan Kim, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4, stable

On Mon, May 16, 2011 at 04:06:56PM +0100, Mel Gorman wrote:
> From: Johannes Weiner <hannes@cmpxchg.org>
> 
> Johannes Weiner poined out that the logic in commit [1741c877: mm:
> kswapd: keep kswapd awake for high-order allocations until a percentage
> of the node is balanced] is backwards. Instead of allowing kswapd to go
> to sleep when balancing for high order allocations, it keeps it kswapd
> running uselessly.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Reviewed-by: Rik van Riel <riel@redhat.com>

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/2] mm: vmscan: If kswapd has been running too long, allow it to sleep
  2011-05-16 15:06   ` Mel Gorman
@ 2011-05-16 15:26     ` Johannes Weiner
  -1 siblings, 0 replies; 29+ messages in thread
From: Johannes Weiner @ 2011-05-16 15:26 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu,
	Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg,
	Rik van Riel, Minchan Kim, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4, stable

On Mon, May 16, 2011 at 04:06:57PM +0100, Mel Gorman wrote:
> Under constant allocation pressure, kswapd can be in the situation where
> sleeping_prematurely() will always return true even if kswapd has been
> running a long time. Check if kswapd needs to be scheduled.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Acked-by: Rik van Riel <riel@redhat.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/2] mm: vmscan: If kswapd has been running too long, allow it to sleep
@ 2011-05-16 15:26     ` Johannes Weiner
  0 siblings, 0 replies; 29+ messages in thread
From: Johannes Weiner @ 2011-05-16 15:26 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu,
	Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg,
	Rik van Riel, Minchan Kim, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4, stable

On Mon, May 16, 2011 at 04:06:57PM +0100, Mel Gorman wrote:
> Under constant allocation pressure, kswapd can be in the situation where
> sleeping_prematurely() will always return true even if kswapd has been
> running a long time. Check if kswapd needs to be scheduled.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Acked-by: Rik van Riel <riel@redhat.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/2] mm: vmscan: If kswapd has been running too long, allow it to sleep
  2011-05-16 15:06   ` Mel Gorman
@ 2011-05-16 21:16     ` Andrew Morton
  -1 siblings, 0 replies; 29+ messages in thread
From: Andrew Morton @ 2011-05-16 21:16 UTC (permalink / raw)
  To: Mel Gorman
  Cc: James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara,
	Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel,
	Johannes Weiner, Minchan Kim, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4, stable

On Mon, 16 May 2011 16:06:57 +0100
Mel Gorman <mgorman@suse.de> wrote:

> Under constant allocation pressure, kswapd can be in the situation where
> sleeping_prematurely() will always return true even if kswapd has been
> running a long time. Check if kswapd needs to be scheduled.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---
>  mm/vmscan.c |    4 ++++
>  1 files changed, 4 insertions(+), 0 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index af24d1e..4d24828 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
>  	unsigned long balanced = 0;
>  	bool all_zones_ok = true;
>  
> +	/* If kswapd has been running too long, just sleep */
> +	if (need_resched())
> +		return false;
> +
>  	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
>  	if (remaining)
>  		return true;

I'm a bit worried by this one.

Do we really fully understand why kswapd is continuously running like
this?  The changelog makes me think "no" ;)

Given that the page-allocating process is madly reclaiming pages in
direct reclaim (yes?) and that kswapd is madly reclaiming pages on a
different CPU, we should pretty promptly get into a situation where
kswapd can suspend itself.  But that obviously isn't happening.  So
what *is* going on?

Secondly, taking an up-to-100ms sleep in response to a need_resched()
seems pretty savage and I suspect it risks undesirable side-effects.  A
plain old cond_resched() would be more cautious.  But presumably
kswapd() is already running cond_resched() pretty frequently, so why
didn't that work?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/2] mm: vmscan: If kswapd has been running too long, allow it to sleep
@ 2011-05-16 21:16     ` Andrew Morton
  0 siblings, 0 replies; 29+ messages in thread
From: Andrew Morton @ 2011-05-16 21:16 UTC (permalink / raw)
  To: Mel Gorman
  Cc: James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara,
	Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel,
	Johannes Weiner, Minchan Kim, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4, stable

On Mon, 16 May 2011 16:06:57 +0100
Mel Gorman <mgorman@suse.de> wrote:

> Under constant allocation pressure, kswapd can be in the situation where
> sleeping_prematurely() will always return true even if kswapd has been
> running a long time. Check if kswapd needs to be scheduled.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---
>  mm/vmscan.c |    4 ++++
>  1 files changed, 4 insertions(+), 0 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index af24d1e..4d24828 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
>  	unsigned long balanced = 0;
>  	bool all_zones_ok = true;
>  
> +	/* If kswapd has been running too long, just sleep */
> +	if (need_resched())
> +		return false;
> +
>  	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
>  	if (remaining)
>  		return true;

I'm a bit worried by this one.

Do we really fully understand why kswapd is continuously running like
this?  The changelog makes me think "no" ;)

Given that the page-allocating process is madly reclaiming pages in
direct reclaim (yes?) and that kswapd is madly reclaiming pages on a
different CPU, we should pretty promptly get into a situation where
kswapd can suspend itself.  But that obviously isn't happening.  So
what *is* going on?

Secondly, taking an up-to-100ms sleep in response to a need_resched()
seems pretty savage and I suspect it risks undesirable side-effects.  A
plain old cond_resched() would be more cautious.  But presumably
kswapd() is already running cond_resched() pretty frequently, so why
didn't that work?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] mm: vmscan: Correct use of pgdat_balanced in sleeping_prematurely
  2011-05-16 15:06   ` Mel Gorman
@ 2011-05-16 23:05     ` Minchan Kim
  -1 siblings, 0 replies; 29+ messages in thread
From: Minchan Kim @ 2011-05-16 23:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu,
	Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4, stable

On Tue, May 17, 2011 at 12:06 AM, Mel Gorman <mgorman@suse.de> wrote:
> From: Johannes Weiner <hannes@cmpxchg.org>
>
> Johannes Weiner poined out that the logic in commit [1741c877: mm:
> kswapd: keep kswapd awake for high-order allocations until a percentage
> of the node is balanced] is backwards. Instead of allowing kswapd to go
> to sleep when balancing for high order allocations, it keeps it kswapd
> running uselessly.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] mm: vmscan: Correct use of pgdat_balanced in sleeping_prematurely
@ 2011-05-16 23:05     ` Minchan Kim
  0 siblings, 0 replies; 29+ messages in thread
From: Minchan Kim @ 2011-05-16 23:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu,
	Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4, stable

On Tue, May 17, 2011 at 12:06 AM, Mel Gorman <mgorman@suse.de> wrote:
> From: Johannes Weiner <hannes@cmpxchg.org>
>
> Johannes Weiner poined out that the logic in commit [1741c877: mm:
> kswapd: keep kswapd awake for high-order allocations until a percentage
> of the node is balanced] is backwards. Instead of allowing kswapd to go
> to sleep when balancing for high order allocations, it keeps it kswapd
> running uselessly.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] mm: vmscan: Correct use of pgdat_balanced in sleeping_prematurely
  2011-05-16 15:26     ` Johannes Weiner
@ 2011-05-17  5:26       ` Wu Fengguang
  -1 siblings, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-05-17  5:26 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Mel Gorman, Andrew Morton, James Bottomley, Colin King,
	Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter,
	Pekka Enberg, Rik van Riel, Minchan Kim, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4, stable

On Mon, May 16, 2011 at 05:26:08PM +0200, Johannes Weiner wrote:
> On Mon, May 16, 2011 at 04:06:56PM +0100, Mel Gorman wrote:
> > From: Johannes Weiner <hannes@cmpxchg.org>
> > 
> > Johannes Weiner poined out that the logic in commit [1741c877: mm:
> > kswapd: keep kswapd awake for high-order allocations until a percentage
> > of the node is balanced] is backwards. Instead of allowing kswapd to go
> > to sleep when balancing for high order allocations, it keeps it kswapd
> > running uselessly.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > Reviewed-by: Rik van Riel <riel@redhat.com>
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/2] mm: vmscan: Correct use of pgdat_balanced in sleeping_prematurely
@ 2011-05-17  5:26       ` Wu Fengguang
  0 siblings, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-05-17  5:26 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Mel Gorman, Andrew Morton, James Bottomley, Colin King,
	Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter,
	Pekka Enberg, Rik van Riel, Minchan Kim, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4, stable

On Mon, May 16, 2011 at 05:26:08PM +0200, Johannes Weiner wrote:
> On Mon, May 16, 2011 at 04:06:56PM +0100, Mel Gorman wrote:
> > From: Johannes Weiner <hannes@cmpxchg.org>
> > 
> > Johannes Weiner poined out that the logic in commit [1741c877: mm:
> > kswapd: keep kswapd awake for high-order allocations until a percentage
> > of the node is balanced] is backwards. Instead of allowing kswapd to go
> > to sleep when balancing for high order allocations, it keeps it kswapd
> > running uselessly.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > Reviewed-by: Rik van Riel <riel@redhat.com>
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/2] mm: vmscan: If kswapd has been running too long, allow it to sleep
  2011-05-16 21:16     ` Andrew Morton
@ 2011-05-17  6:37       ` James Bottomley
  -1 siblings, 0 replies; 29+ messages in thread
From: James Bottomley @ 2011-05-17  6:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Colin King, Raghavendra D Prabhu, Jan Kara,
	Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel,
	Johannes Weiner, Minchan Kim, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4, stable

On Mon, 2011-05-16 at 14:16 -0700, Andrew Morton wrote:
> On Mon, 16 May 2011 16:06:57 +0100
> Mel Gorman <mgorman@suse.de> wrote:
> 
> > Under constant allocation pressure, kswapd can be in the situation where
> > sleeping_prematurely() will always return true even if kswapd has been
> > running a long time. Check if kswapd needs to be scheduled.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > Acked-by: Rik van Riel <riel@redhat.com>
> > ---
> >  mm/vmscan.c |    4 ++++
> >  1 files changed, 4 insertions(+), 0 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index af24d1e..4d24828 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> >  	unsigned long balanced = 0;
> >  	bool all_zones_ok = true;
> >  
> > +	/* If kswapd has been running too long, just sleep */
> > +	if (need_resched())
> > +		return false;
> > +
> >  	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
> >  	if (remaining)
> >  		return true;
> 
> I'm a bit worried by this one.
> 
> Do we really fully understand why kswapd is continuously running like
> this?  The changelog makes me think "no" ;)
> 
> Given that the page-allocating process is madly reclaiming pages in
> direct reclaim (yes?) and that kswapd is madly reclaiming pages on a
> different CPU, we should pretty promptly get into a situation where
> kswapd can suspend itself.  But that obviously isn't happening.  So
> what *is* going on?

The triggering workload is a massive untar using a file on the same
filesystem, so that's a continuous stream of pages read into the cache
for the input and a stream of dirty pages out for the writes.  We
thought it might have been out of control shrinkers, so we already
debugged that and found it wasn't.  It just seems to be an imbalance in
the zones that the shrinkers can't fix which causes
sleeping_prematurely() to return true almost indefinitely.

> Secondly, taking an up-to-100ms sleep in response to a need_resched()
> seems pretty savage and I suspect it risks undesirable side-effects.  A
> plain old cond_resched() would be more cautious.  But presumably
> kswapd() is already running cond_resched() pretty frequently, so why
> didn't that work?

So the specific problem with cond_resched() is that kswapd is still
runnable, so even if there's other work the system can be getting on
with, it quickly comes back to looping madly in kswapd.  If we return
false from sleeping_prematurely(), we stop kswapd until its woken up to
do more work.  This manifests, even on non sandybridge systems that
don't hang as a lot of time burned in kswapd.

I think the sandybridge bug I see on the laptop is that cond_resched()
is somehow ineffective:  kswapd is usually hogging one CPU and there are
runnable processes but they seem to cluster on other CPUs, leaving
kswapd to spin at close to 100% system time.

When the problem was first described, we tried sprinkling more
cond_rescheds() in the shrinker loop and it didn't work.

James

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/2] mm: vmscan: If kswapd has been running too long, allow it to sleep
@ 2011-05-17  6:37       ` James Bottomley
  0 siblings, 0 replies; 29+ messages in thread
From: James Bottomley @ 2011-05-17  6:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Colin King, Raghavendra D Prabhu, Jan Kara,
	Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel,
	Johannes Weiner, Minchan Kim, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4, stable

On Mon, 2011-05-16 at 14:16 -0700, Andrew Morton wrote:
> On Mon, 16 May 2011 16:06:57 +0100
> Mel Gorman <mgorman@suse.de> wrote:
> 
> > Under constant allocation pressure, kswapd can be in the situation where
> > sleeping_prematurely() will always return true even if kswapd has been
> > running a long time. Check if kswapd needs to be scheduled.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > Acked-by: Rik van Riel <riel@redhat.com>
> > ---
> >  mm/vmscan.c |    4 ++++
> >  1 files changed, 4 insertions(+), 0 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index af24d1e..4d24828 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> >  	unsigned long balanced = 0;
> >  	bool all_zones_ok = true;
> >  
> > +	/* If kswapd has been running too long, just sleep */
> > +	if (need_resched())
> > +		return false;
> > +
> >  	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
> >  	if (remaining)
> >  		return true;
> 
> I'm a bit worried by this one.
> 
> Do we really fully understand why kswapd is continuously running like
> this?  The changelog makes me think "no" ;)
> 
> Given that the page-allocating process is madly reclaiming pages in
> direct reclaim (yes?) and that kswapd is madly reclaiming pages on a
> different CPU, we should pretty promptly get into a situation where
> kswapd can suspend itself.  But that obviously isn't happening.  So
> what *is* going on?

The triggering workload is a massive untar using a file on the same
filesystem, so that's a continuous stream of pages read into the cache
for the input and a stream of dirty pages out for the writes.  We
thought it might have been out of control shrinkers, so we already
debugged that and found it wasn't.  It just seems to be an imbalance in
the zones that the shrinkers can't fix which causes
sleeping_prematurely() to return true almost indefinitely.

> Secondly, taking an up-to-100ms sleep in response to a need_resched()
> seems pretty savage and I suspect it risks undesirable side-effects.  A
> plain old cond_resched() would be more cautious.  But presumably
> kswapd() is already running cond_resched() pretty frequently, so why
> didn't that work?

So the specific problem with cond_resched() is that kswapd is still
runnable, so even if there's other work the system can be getting on
with, it quickly comes back to looping madly in kswapd.  If we return
false from sleeping_prematurely(), we stop kswapd until its woken up to
do more work.  This manifests, even on non sandybridge systems that
don't hang as a lot of time burned in kswapd.

I think the sandybridge bug I see on the laptop is that cond_resched()
is somehow ineffective:  kswapd is usually hogging one CPU and there are
runnable processes but they seem to cluster on other CPUs, leaving
kswapd to spin at close to 100% system time.

When the problem was first described, we tried sprinkling more
cond_rescheds() in the shrinker loop and it didn't work.

James

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/2] mm: vmscan: If kswapd has been running too long, allow it to sleep
  2011-05-17  6:37       ` James Bottomley
@ 2011-05-17 23:22         ` Andrew Morton
  -1 siblings, 0 replies; 29+ messages in thread
From: Andrew Morton @ 2011-05-17 23:22 UTC (permalink / raw)
  To: James Bottomley
  Cc: Mel Gorman, Colin King, Raghavendra D Prabhu, Jan Kara,
	Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel,
	Johannes Weiner, Minchan Kim, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4, stable

On Tue, 17 May 2011 10:37:04 +0400
James Bottomley <James.Bottomley@HansenPartnership.com> wrote:

> On Mon, 2011-05-16 at 14:16 -0700, Andrew Morton wrote:
> > On Mon, 16 May 2011 16:06:57 +0100
> > Mel Gorman <mgorman@suse.de> wrote:
> > 
> > > Under constant allocation pressure, kswapd can be in the situation where
> > > sleeping_prematurely() will always return true even if kswapd has been
> > > running a long time. Check if kswapd needs to be scheduled.
> > > 
> > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > > Acked-by: Rik van Riel <riel@redhat.com>
> > > ---
> > >  mm/vmscan.c |    4 ++++
> > >  1 files changed, 4 insertions(+), 0 deletions(-)
> > > 
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index af24d1e..4d24828 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> > >  	unsigned long balanced = 0;
> > >  	bool all_zones_ok = true;
> > >  
> > > +	/* If kswapd has been running too long, just sleep */
> > > +	if (need_resched())
> > > +		return false;
> > > +
> > >  	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
> > >  	if (remaining)
> > >  		return true;
> > 
> > I'm a bit worried by this one.
> > 
> > Do we really fully understand why kswapd is continuously running like
> > this?  The changelog makes me think "no" ;)
> > 
> > Given that the page-allocating process is madly reclaiming pages in
> > direct reclaim (yes?) and that kswapd is madly reclaiming pages on a
> > different CPU, we should pretty promptly get into a situation where
> > kswapd can suspend itself.  But that obviously isn't happening.  So
> > what *is* going on?
> 
> The triggering workload is a massive untar using a file on the same
> filesystem, so that's a continuous stream of pages read into the cache
> for the input and a stream of dirty pages out for the writes.  We
> thought it might have been out of control shrinkers, so we already
> debugged that and found it wasn't.  It just seems to be an imbalance in
> the zones that the shrinkers can't fix which causes
> sleeping_prematurely() to return true almost indefinitely.

Is the untar disk-bound?  The untar has presumably hit the writeback
dirty_ratio?  So its rate of page allocation is approximately equal to
the write speed of the disks?

If so, the VM is consuming 100% of a CPU to reclaim pages at a mere
tens-of-megabytes-per-second.  If so, there's something seriously wrong
here - under favorable conditions one would expect reclaim to free up
100,000 pages/sec, maybe more.

If the untar is not disk-bound and the required page reclaim rate is
equal to the rate at which a CPU can read, decompress and write to
pagecache then, err, maybe possible.  But it still smells of
inefficient reclaim.

> > Secondly, taking an up-to-100ms sleep in response to a need_resched()
> > seems pretty savage and I suspect it risks undesirable side-effects.  A
> > plain old cond_resched() would be more cautious.  But presumably
> > kswapd() is already running cond_resched() pretty frequently, so why
> > didn't that work?
> 
> So the specific problem with cond_resched() is that kswapd is still
> runnable, so even if there's other work the system can be getting on
> with, it quickly comes back to looping madly in kswapd.  If we return
> false from sleeping_prematurely(), we stop kswapd until its woken up to
> do more work.  This manifests, even on non sandybridge systems that
> don't hang as a lot of time burned in kswapd.
> 
> I think the sandybridge bug I see on the laptop is that cond_resched()
> is somehow ineffective:  kswapd is usually hogging one CPU and there are
> runnable processes but they seem to cluster on other CPUs, leaving
> kswapd to spin at close to 100% system time.
> 
> When the problem was first described, we tried sprinkling more
> cond_rescheds() in the shrinker loop and it didn't work.

Seems to me that kswapd for some reason is doing too much work.  Or,
more specifically is doing its work very inefficiently.  Making kswapd
take arbitrary naps when it's misbehaving didn't fix that misbehaviour!

It would be interesting to watch kswapd's page reclaim inefficiency
when this is happening: /proc/vmstat:pgscan_kswapd_* versus
/proc/vmstat:kswapd_steal.  If that ration is high then kswapd is
scanning many pages and not reclaiming them.

But given the prominence of shrink_slab in the traces, perhaps that
isn't happening.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/2] mm: vmscan: If kswapd has been running too long, allow it to sleep
@ 2011-05-17 23:22         ` Andrew Morton
  0 siblings, 0 replies; 29+ messages in thread
From: Andrew Morton @ 2011-05-17 23:22 UTC (permalink / raw)
  To: James Bottomley
  Cc: Mel Gorman, Colin King, Raghavendra D Prabhu, Jan Kara,
	Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel,
	Johannes Weiner, Minchan Kim, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4, stable

On Tue, 17 May 2011 10:37:04 +0400
James Bottomley <James.Bottomley@HansenPartnership.com> wrote:

> On Mon, 2011-05-16 at 14:16 -0700, Andrew Morton wrote:
> > On Mon, 16 May 2011 16:06:57 +0100
> > Mel Gorman <mgorman@suse.de> wrote:
> > 
> > > Under constant allocation pressure, kswapd can be in the situation where
> > > sleeping_prematurely() will always return true even if kswapd has been
> > > running a long time. Check if kswapd needs to be scheduled.
> > > 
> > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > > Acked-by: Rik van Riel <riel@redhat.com>
> > > ---
> > >  mm/vmscan.c |    4 ++++
> > >  1 files changed, 4 insertions(+), 0 deletions(-)
> > > 
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index af24d1e..4d24828 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> > >  	unsigned long balanced = 0;
> > >  	bool all_zones_ok = true;
> > >  
> > > +	/* If kswapd has been running too long, just sleep */
> > > +	if (need_resched())
> > > +		return false;
> > > +
> > >  	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
> > >  	if (remaining)
> > >  		return true;
> > 
> > I'm a bit worried by this one.
> > 
> > Do we really fully understand why kswapd is continuously running like
> > this?  The changelog makes me think "no" ;)
> > 
> > Given that the page-allocating process is madly reclaiming pages in
> > direct reclaim (yes?) and that kswapd is madly reclaiming pages on a
> > different CPU, we should pretty promptly get into a situation where
> > kswapd can suspend itself.  But that obviously isn't happening.  So
> > what *is* going on?
> 
> The triggering workload is a massive untar using a file on the same
> filesystem, so that's a continuous stream of pages read into the cache
> for the input and a stream of dirty pages out for the writes.  We
> thought it might have been out of control shrinkers, so we already
> debugged that and found it wasn't.  It just seems to be an imbalance in
> the zones that the shrinkers can't fix which causes
> sleeping_prematurely() to return true almost indefinitely.

Is the untar disk-bound?  The untar has presumably hit the writeback
dirty_ratio?  So its rate of page allocation is approximately equal to
the write speed of the disks?

If so, the VM is consuming 100% of a CPU to reclaim pages at a mere
tens-of-megabytes-per-second.  If so, there's something seriously wrong
here - under favorable conditions one would expect reclaim to free up
100,000 pages/sec, maybe more.

If the untar is not disk-bound and the required page reclaim rate is
equal to the rate at which a CPU can read, decompress and write to
pagecache then, err, maybe possible.  But it still smells of
inefficient reclaim.

> > Secondly, taking an up-to-100ms sleep in response to a need_resched()
> > seems pretty savage and I suspect it risks undesirable side-effects.  A
> > plain old cond_resched() would be more cautious.  But presumably
> > kswapd() is already running cond_resched() pretty frequently, so why
> > didn't that work?
> 
> So the specific problem with cond_resched() is that kswapd is still
> runnable, so even if there's other work the system can be getting on
> with, it quickly comes back to looping madly in kswapd.  If we return
> false from sleeping_prematurely(), we stop kswapd until its woken up to
> do more work.  This manifests, even on non sandybridge systems that
> don't hang as a lot of time burned in kswapd.
> 
> I think the sandybridge bug I see on the laptop is that cond_resched()
> is somehow ineffective:  kswapd is usually hogging one CPU and there are
> runnable processes but they seem to cluster on other CPUs, leaving
> kswapd to spin at close to 100% system time.
> 
> When the problem was first described, we tried sprinkling more
> cond_rescheds() in the shrinker loop and it didn't work.

Seems to me that kswapd for some reason is doing too much work.  Or,
more specifically is doing its work very inefficiently.  Making kswapd
take arbitrary naps when it's misbehaving didn't fix that misbehaviour!

It would be interesting to watch kswapd's page reclaim inefficiency
when this is happening: /proc/vmstat:pgscan_kswapd_* versus
/proc/vmstat:kswapd_steal.  If that ration is high then kswapd is
scanning many pages and not reclaiming them.

But given the prominence of shrink_slab in the traces, perhaps that
isn't happening.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/2] mm: vmscan: If kswapd has been running too long, allow it to sleep
  2011-05-17 23:22         ` Andrew Morton
@ 2011-05-18  9:47           ` Mel Gorman
  -1 siblings, 0 replies; 29+ messages in thread
From: Mel Gorman @ 2011-05-18  9:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara,
	Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel,
	Johannes Weiner, Minchan Kim, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4, stable

On Tue, May 17, 2011 at 04:22:26PM -0700, Andrew Morton wrote:
> On Tue, 17 May 2011 10:37:04 +0400
> James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
> 
> > On Mon, 2011-05-16 at 14:16 -0700, Andrew Morton wrote:
> > > On Mon, 16 May 2011 16:06:57 +0100
> > > Mel Gorman <mgorman@suse.de> wrote:
> > > 
> > > > Under constant allocation pressure, kswapd can be in the situation where
> > > > sleeping_prematurely() will always return true even if kswapd has been
> > > > running a long time. Check if kswapd needs to be scheduled.
> > > > 
> > > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > > > Acked-by: Rik van Riel <riel@redhat.com>
> > > > ---
> > > >  mm/vmscan.c |    4 ++++
> > > >  1 files changed, 4 insertions(+), 0 deletions(-)
> > > > 
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index af24d1e..4d24828 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> > > >  	unsigned long balanced = 0;
> > > >  	bool all_zones_ok = true;
> > > >  
> > > > +	/* If kswapd has been running too long, just sleep */
> > > > +	if (need_resched())
> > > > +		return false;
> > > > +
> > > >  	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
> > > >  	if (remaining)
> > > >  		return true;
> > > 
> > > I'm a bit worried by this one.
> > > 
> > > Do we really fully understand why kswapd is continuously running like
> > > this?  The changelog makes me think "no" ;)
> > > 
> > > Given that the page-allocating process is madly reclaiming pages in
> > > direct reclaim (yes?) and that kswapd is madly reclaiming pages on a
> > > different CPU, we should pretty promptly get into a situation where
> > > kswapd can suspend itself.  But that obviously isn't happening.  So
> > > what *is* going on?
> > 
> > The triggering workload is a massive untar using a file on the same
> > filesystem, so that's a continuous stream of pages read into the cache
> > for the input and a stream of dirty pages out for the writes.  We
> > thought it might have been out of control shrinkers, so we already
> > debugged that and found it wasn't.  It just seems to be an imbalance in
> > the zones that the shrinkers can't fix which causes
> > sleeping_prematurely() to return true almost indefinitely.
> 
> Is the untar disk-bound?  The untar has presumably hit the writeback
> dirty_ratio?  So its rate of page allocation is approximately equal to
> the write speed of the disks?
> 

A reasonable assumption but it gets messy.

> If so, the VM is consuming 100% of a CPU to reclaim pages at a mere
> tens-of-megabytes-per-second.  If so, there's something seriously wrong
> here - under favorable conditions one would expect reclaim to free up
> 100,000 pages/sec, maybe more.
> 
> If the untar is not disk-bound and the required page reclaim rate is
> equal to the rate at which a CPU can read, decompress and write to
> pagecache then, err, maybe possible.  But it still smells of
> inefficient reclaim.
> 

I think it's higher than just the rate of data but couldn't guess by
how much exactly. Reproducing this locally would have been nice but
the following conditions are likely happening on the problem machine.

   SLUB is using high-orders for its slabs, kswapd and reclaimers are
   reclaiming at a faster rate than required for just the data. SLUB
   is using order-2 allocs for inodes so every 18 files created by
   untar, we need an order-2 page. For ext4_io_end, we need order-3
   allocs and we are allocating these due to delayed block allocation.

   So for example: 50 files, each less than 1 page in size needs 50
   order-0 pages, 3 order-2 page and 2 order-3 pages

   To satisfy the high order pages, we are reclaiming at least 28
   pages. For compaction, we are migrating these so we are allocating
   a further 28 pages and then copying putting further pressure on
   the system. We may do this multiple times as order-0 allocations
   could be breaking up the pages again. Without compaction, we are
   only reclaiming but can get stalled for significant periods of
   time if dirty or writeback pages are encountered in the contiguous
   blocks and can reclaim too many pages quite easily.

So the rate of allocation required to write out data is higher than
just the data rate. The reclaim rate could be just fine but the number
of pages we need to reclaim to allocate slab objects can be screwy.

> > > Secondly, taking an up-to-100ms sleep in response to a need_resched()
> > > seems pretty savage and I suspect it risks undesirable side-effects.  A
> > > plain old cond_resched() would be more cautious.  But presumably
> > > kswapd() is already running cond_resched() pretty frequently, so why
> > > didn't that work?
> > 
> > So the specific problem with cond_resched() is that kswapd is still
> > runnable, so even if there's other work the system can be getting on
> > with, it quickly comes back to looping madly in kswapd.  If we return
> > false from sleeping_prematurely(), we stop kswapd until its woken up to
> > do more work.  This manifests, even on non sandybridge systems that
> > don't hang as a lot of time burned in kswapd.
> > 
> > I think the sandybridge bug I see on the laptop is that cond_resched()
> > is somehow ineffective:  kswapd is usually hogging one CPU and there are
> > runnable processes but they seem to cluster on other CPUs, leaving
> > kswapd to spin at close to 100% system time.
> > 
> > When the problem was first described, we tried sprinkling more
> > cond_rescheds() in the shrinker loop and it didn't work.
> 
> Seems to me that kswapd for some reason is doing too much work.  Or,
> more specifically is doing its work very inefficiently.  Making kswapd
> take arbitrary naps when it's misbehaving didn't fix that misbehaviour!
> 

It is likely to be doing work inefficiently in one of two ways

  1. We are reclaiming far more pages than required by the data
     for slab objects

  2. The rate we are reclaiming is fast enough that dirty pages are
     reaching the end of the LRU quickly

The latter part is also important. I doubt we are getting stalled in
writepage as this is new data being written to disk to blocks aren't
allocated yet but kswapd is encountering the dirty_ratio of pages
on the LRU and churning them through the LRU and reclaims the clean
pages in between.

In effect, this "sorts" the LRU lists so the dirty pages get grouped
together. At worst on a 2G system such as James', we have 104857
(20% of memory in pages) pages together on the LRU, all dirty and
all being skipped over by kswapd and direct reclaimers. This is at
least 3276 takings of the zone LRU lock assuming we isolate pages in
groups of SWAP_CLUSTER_MAX which a lot of list walking and CPU usage
for no pages reclaimed.

In this case, kswapd might as well take a brief nap as it can't clean
the pages so the flusher threads can get some work done.

> It would be interesting to watch kswapd's page reclaim inefficiency
> when this is happening: /proc/vmstat:pgscan_kswapd_* versus
> /proc/vmstat:kswapd_steal.  If that ration is high then kswapd is
> scanning many pages and not reclaiming them.
> 
> But given the prominence of shrink_slab in the traces, perhaps that
> isn't happening.
> 

As we are aggressively shrinking slab, we can reach the stage where
we scan the requested number of objects and reclaim none of them
potentially setting zone->all_unreclaimable to 1 if a lot of scanning
has also taken place recently without pages being freed. Once this
happens, kswapd isn't even trying to reclaim pages and is instead stuck
in shrink_slab until a page is freed clearing zone->all_unreclaimable
and zone->pages-scanned.

The ratio during that window would not change but slabs_scanned would
continue to increase.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/2] mm: vmscan: If kswapd has been running too long, allow it to sleep
@ 2011-05-18  9:47           ` Mel Gorman
  0 siblings, 0 replies; 29+ messages in thread
From: Mel Gorman @ 2011-05-18  9:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara,
	Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel,
	Johannes Weiner, Minchan Kim, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4, stable

On Tue, May 17, 2011 at 04:22:26PM -0700, Andrew Morton wrote:
> On Tue, 17 May 2011 10:37:04 +0400
> James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
> 
> > On Mon, 2011-05-16 at 14:16 -0700, Andrew Morton wrote:
> > > On Mon, 16 May 2011 16:06:57 +0100
> > > Mel Gorman <mgorman@suse.de> wrote:
> > > 
> > > > Under constant allocation pressure, kswapd can be in the situation where
> > > > sleeping_prematurely() will always return true even if kswapd has been
> > > > running a long time. Check if kswapd needs to be scheduled.
> > > > 
> > > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > > > Acked-by: Rik van Riel <riel@redhat.com>
> > > > ---
> > > >  mm/vmscan.c |    4 ++++
> > > >  1 files changed, 4 insertions(+), 0 deletions(-)
> > > > 
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index af24d1e..4d24828 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> > > >  	unsigned long balanced = 0;
> > > >  	bool all_zones_ok = true;
> > > >  
> > > > +	/* If kswapd has been running too long, just sleep */
> > > > +	if (need_resched())
> > > > +		return false;
> > > > +
> > > >  	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
> > > >  	if (remaining)
> > > >  		return true;
> > > 
> > > I'm a bit worried by this one.
> > > 
> > > Do we really fully understand why kswapd is continuously running like
> > > this?  The changelog makes me think "no" ;)
> > > 
> > > Given that the page-allocating process is madly reclaiming pages in
> > > direct reclaim (yes?) and that kswapd is madly reclaiming pages on a
> > > different CPU, we should pretty promptly get into a situation where
> > > kswapd can suspend itself.  But that obviously isn't happening.  So
> > > what *is* going on?
> > 
> > The triggering workload is a massive untar using a file on the same
> > filesystem, so that's a continuous stream of pages read into the cache
> > for the input and a stream of dirty pages out for the writes.  We
> > thought it might have been out of control shrinkers, so we already
> > debugged that and found it wasn't.  It just seems to be an imbalance in
> > the zones that the shrinkers can't fix which causes
> > sleeping_prematurely() to return true almost indefinitely.
> 
> Is the untar disk-bound?  The untar has presumably hit the writeback
> dirty_ratio?  So its rate of page allocation is approximately equal to
> the write speed of the disks?
> 

A reasonable assumption but it gets messy.

> If so, the VM is consuming 100% of a CPU to reclaim pages at a mere
> tens-of-megabytes-per-second.  If so, there's something seriously wrong
> here - under favorable conditions one would expect reclaim to free up
> 100,000 pages/sec, maybe more.
> 
> If the untar is not disk-bound and the required page reclaim rate is
> equal to the rate at which a CPU can read, decompress and write to
> pagecache then, err, maybe possible.  But it still smells of
> inefficient reclaim.
> 

I think it's higher than just the rate of data but couldn't guess by
how much exactly. Reproducing this locally would have been nice but
the following conditions are likely happening on the problem machine.

   SLUB is using high-orders for its slabs, kswapd and reclaimers are
   reclaiming at a faster rate than required for just the data. SLUB
   is using order-2 allocs for inodes so every 18 files created by
   untar, we need an order-2 page. For ext4_io_end, we need order-3
   allocs and we are allocating these due to delayed block allocation.

   So for example: 50 files, each less than 1 page in size needs 50
   order-0 pages, 3 order-2 page and 2 order-3 pages

   To satisfy the high order pages, we are reclaiming at least 28
   pages. For compaction, we are migrating these so we are allocating
   a further 28 pages and then copying putting further pressure on
   the system. We may do this multiple times as order-0 allocations
   could be breaking up the pages again. Without compaction, we are
   only reclaiming but can get stalled for significant periods of
   time if dirty or writeback pages are encountered in the contiguous
   blocks and can reclaim too many pages quite easily.

So the rate of allocation required to write out data is higher than
just the data rate. The reclaim rate could be just fine but the number
of pages we need to reclaim to allocate slab objects can be screwy.

> > > Secondly, taking an up-to-100ms sleep in response to a need_resched()
> > > seems pretty savage and I suspect it risks undesirable side-effects.  A
> > > plain old cond_resched() would be more cautious.  But presumably
> > > kswapd() is already running cond_resched() pretty frequently, so why
> > > didn't that work?
> > 
> > So the specific problem with cond_resched() is that kswapd is still
> > runnable, so even if there's other work the system can be getting on
> > with, it quickly comes back to looping madly in kswapd.  If we return
> > false from sleeping_prematurely(), we stop kswapd until its woken up to
> > do more work.  This manifests, even on non sandybridge systems that
> > don't hang as a lot of time burned in kswapd.
> > 
> > I think the sandybridge bug I see on the laptop is that cond_resched()
> > is somehow ineffective:  kswapd is usually hogging one CPU and there are
> > runnable processes but they seem to cluster on other CPUs, leaving
> > kswapd to spin at close to 100% system time.
> > 
> > When the problem was first described, we tried sprinkling more
> > cond_rescheds() in the shrinker loop and it didn't work.
> 
> Seems to me that kswapd for some reason is doing too much work.  Or,
> more specifically is doing its work very inefficiently.  Making kswapd
> take arbitrary naps when it's misbehaving didn't fix that misbehaviour!
> 

It is likely to be doing work inefficiently in one of two ways

  1. We are reclaiming far more pages than required by the data
     for slab objects

  2. The rate we are reclaiming is fast enough that dirty pages are
     reaching the end of the LRU quickly

The latter part is also important. I doubt we are getting stalled in
writepage as this is new data being written to disk to blocks aren't
allocated yet but kswapd is encountering the dirty_ratio of pages
on the LRU and churning them through the LRU and reclaims the clean
pages in between.

In effect, this "sorts" the LRU lists so the dirty pages get grouped
together. At worst on a 2G system such as James', we have 104857
(20% of memory in pages) pages together on the LRU, all dirty and
all being skipped over by kswapd and direct reclaimers. This is at
least 3276 takings of the zone LRU lock assuming we isolate pages in
groups of SWAP_CLUSTER_MAX which a lot of list walking and CPU usage
for no pages reclaimed.

In this case, kswapd might as well take a brief nap as it can't clean
the pages so the flusher threads can get some work done.

> It would be interesting to watch kswapd's page reclaim inefficiency
> when this is happening: /proc/vmstat:pgscan_kswapd_* versus
> /proc/vmstat:kswapd_steal.  If that ration is high then kswapd is
> scanning many pages and not reclaiming them.
> 
> But given the prominence of shrink_slab in the traces, perhaps that
> isn't happening.
> 

As we are aggressively shrinking slab, we can reach the stage where
we scan the requested number of objects and reclaim none of them
potentially setting zone->all_unreclaimable to 1 if a lot of scanning
has also taken place recently without pages being freed. Once this
happens, kswapd isn't even trying to reclaim pages and is instead stuck
in shrink_slab until a page is freed clearing zone->all_unreclaimable
and zone->pages-scanned.

The ratio during that window would not change but slabs_scanned would
continue to increase.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/2] mm: vmscan: If kswapd has been running too long, allow it to sleep
  2011-05-18  9:47           ` Mel Gorman
@ 2011-05-18 22:42             ` Minchan Kim
  -1 siblings, 0 replies; 29+ messages in thread
From: Minchan Kim @ 2011-05-18 22:42 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu,
	Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4, stable

On Wed, May 18, 2011 at 6:47 PM, Mel Gorman <mgorman@suse.de> wrote:
> On Tue, May 17, 2011 at 04:22:26PM -0700, Andrew Morton wrote:
>> On Tue, 17 May 2011 10:37:04 +0400
>> James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
>>
>> > On Mon, 2011-05-16 at 14:16 -0700, Andrew Morton wrote:
>> > > On Mon, 16 May 2011 16:06:57 +0100
>> > > Mel Gorman <mgorman@suse.de> wrote:
>> > >
>> > > > Under constant allocation pressure, kswapd can be in the situation where
>> > > > sleeping_prematurely() will always return true even if kswapd has been
>> > > > running a long time. Check if kswapd needs to be scheduled.
>> > > >
>> > > > Signed-off-by: Mel Gorman <mgorman@suse.de>
>> > > > Acked-by: Rik van Riel <riel@redhat.com>
>> > > > ---
>> > > >  mm/vmscan.c |    4 ++++
>> > > >  1 files changed, 4 insertions(+), 0 deletions(-)
>> > > >
>> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
>> > > > index af24d1e..4d24828 100644
>> > > > --- a/mm/vmscan.c
>> > > > +++ b/mm/vmscan.c
>> > > > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
>> > > >         unsigned long balanced = 0;
>> > > >         bool all_zones_ok = true;
>> > > >
>> > > > +       /* If kswapd has been running too long, just sleep */
>> > > > +       if (need_resched())
>> > > > +               return false;
>> > > > +
>> > > >         /* If a direct reclaimer woke kswapd within HZ/10, it's premature */
>> > > >         if (remaining)
>> > > >                 return true;
>> > >
>> > > I'm a bit worried by this one.
>> > >
>> > > Do we really fully understand why kswapd is continuously running like
>> > > this?  The changelog makes me think "no" ;)
>> > >
>> > > Given that the page-allocating process is madly reclaiming pages in
>> > > direct reclaim (yes?) and that kswapd is madly reclaiming pages on a
>> > > different CPU, we should pretty promptly get into a situation where
>> > > kswapd can suspend itself.  But that obviously isn't happening.  So
>> > > what *is* going on?
>> >
>> > The triggering workload is a massive untar using a file on the same
>> > filesystem, so that's a continuous stream of pages read into the cache
>> > for the input and a stream of dirty pages out for the writes.  We
>> > thought it might have been out of control shrinkers, so we already
>> > debugged that and found it wasn't.  It just seems to be an imbalance in
>> > the zones that the shrinkers can't fix which causes
>> > sleeping_prematurely() to return true almost indefinitely.
>>
>> Is the untar disk-bound?  The untar has presumably hit the writeback
>> dirty_ratio?  So its rate of page allocation is approximately equal to
>> the write speed of the disks?
>>
>
> A reasonable assumption but it gets messy.
>
>> If so, the VM is consuming 100% of a CPU to reclaim pages at a mere
>> tens-of-megabytes-per-second.  If so, there's something seriously wrong
>> here - under favorable conditions one would expect reclaim to free up
>> 100,000 pages/sec, maybe more.
>>
>> If the untar is not disk-bound and the required page reclaim rate is
>> equal to the rate at which a CPU can read, decompress and write to
>> pagecache then, err, maybe possible.  But it still smells of
>> inefficient reclaim.
>>
>
> I think it's higher than just the rate of data but couldn't guess by
> how much exactly. Reproducing this locally would have been nice but
> the following conditions are likely happening on the problem machine.
>
>   SLUB is using high-orders for its slabs, kswapd and reclaimers are
>   reclaiming at a faster rate than required for just the data. SLUB
>   is using order-2 allocs for inodes so every 18 files created by
>   untar, we need an order-2 page. For ext4_io_end, we need order-3
>   allocs and we are allocating these due to delayed block allocation.
>
>   So for example: 50 files, each less than 1 page in size needs 50
>   order-0 pages, 3 order-2 page and 2 order-3 pages
>
>   To satisfy the high order pages, we are reclaiming at least 28
>   pages. For compaction, we are migrating these so we are allocating
>   a further 28 pages and then copying putting further pressure on
>   the system. We may do this multiple times as order-0 allocations
>   could be breaking up the pages again. Without compaction, we are
>   only reclaiming but can get stalled for significant periods of
>   time if dirty or writeback pages are encountered in the contiguous
>   blocks and can reclaim too many pages quite easily.
>
> So the rate of allocation required to write out data is higher than
> just the data rate. The reclaim rate could be just fine but the number
> of pages we need to reclaim to allocate slab objects can be screwy.
>
>> > > Secondly, taking an up-to-100ms sleep in response to a need_resched()
>> > > seems pretty savage and I suspect it risks undesirable side-effects.  A
>> > > plain old cond_resched() would be more cautious.  But presumably
>> > > kswapd() is already running cond_resched() pretty frequently, so why
>> > > didn't that work?
>> >
>> > So the specific problem with cond_resched() is that kswapd is still
>> > runnable, so even if there's other work the system can be getting on
>> > with, it quickly comes back to looping madly in kswapd.  If we return
>> > false from sleeping_prematurely(), we stop kswapd until its woken up to
>> > do more work.  This manifests, even on non sandybridge systems that
>> > don't hang as a lot of time burned in kswapd.
>> >
>> > I think the sandybridge bug I see on the laptop is that cond_resched()
>> > is somehow ineffective:  kswapd is usually hogging one CPU and there are
>> > runnable processes but they seem to cluster on other CPUs, leaving
>> > kswapd to spin at close to 100% system time.
>> >
>> > When the problem was first described, we tried sprinkling more
>> > cond_rescheds() in the shrinker loop and it didn't work.
>>
>> Seems to me that kswapd for some reason is doing too much work.  Or,
>> more specifically is doing its work very inefficiently.  Making kswapd
>> take arbitrary naps when it's misbehaving didn't fix that misbehaviour!
>>
>
> It is likely to be doing work inefficiently in one of two ways
>
>  1. We are reclaiming far more pages than required by the data
>     for slab objects
>
>  2. The rate we are reclaiming is fast enough that dirty pages are
>     reaching the end of the LRU quickly
>
> The latter part is also important. I doubt we are getting stalled in
> writepage as this is new data being written to disk to blocks aren't
> allocated yet but kswapd is encountering the dirty_ratio of pages
> on the LRU and churning them through the LRU and reclaims the clean
> pages in between.
>
> In effect, this "sorts" the LRU lists so the dirty pages get grouped
> together. At worst on a 2G system such as James', we have 104857
> (20% of memory in pages) pages together on the LRU, all dirty and
> all being skipped over by kswapd and direct reclaimers. This is at
> least 3276 takings of the zone LRU lock assuming we isolate pages in
> groups of SWAP_CLUSTER_MAX which a lot of list walking and CPU usage
> for no pages reclaimed.
>
> In this case, kswapd might as well take a brief nap as it can't clean
> the pages so the flusher threads can get some work done.
>
>> It would be interesting to watch kswapd's page reclaim inefficiency
>> when this is happening: /proc/vmstat:pgscan_kswapd_* versus
>> /proc/vmstat:kswapd_steal.  If that ration is high then kswapd is
>> scanning many pages and not reclaiming them.
>>
>> But given the prominence of shrink_slab in the traces, perhaps that
>> isn't happening.
>>
>
> As we are aggressively shrinking slab, we can reach the stage where
> we scan the requested number of objects and reclaim none of them
> potentially setting zone->all_unreclaimable to 1 if a lot of scanning
> has also taken place recently without pages being freed. Once this
> happens, kswapd isn't even trying to reclaim pages and is instead stuck
> in shrink_slab until a page is freed clearing zone->all_unreclaimable
> and zone->pages-scanned.

Why does it stuck in shrink_slab?
If the zone is trouble to reclaim(ie, all_unreclaimable is set),
kswapd will poll the zone only in case of DEF_PRIORITY(ie, small
window) for when the problem goes away. In high priority (0..11), the
zone will be skipped and we can't get a chance to call
shrink_[zone|slab].


>
> The ratio during that window would not change but slabs_scanned would
> continue to increase.
>
> --
> Mel Gorman
> SUSE Labs
>



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/2] mm: vmscan: If kswapd has been running too long, allow it to sleep
@ 2011-05-18 22:42             ` Minchan Kim
  0 siblings, 0 replies; 29+ messages in thread
From: Minchan Kim @ 2011-05-18 22:42 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu,
	Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4, stable

On Wed, May 18, 2011 at 6:47 PM, Mel Gorman <mgorman@suse.de> wrote:
> On Tue, May 17, 2011 at 04:22:26PM -0700, Andrew Morton wrote:
>> On Tue, 17 May 2011 10:37:04 +0400
>> James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
>>
>> > On Mon, 2011-05-16 at 14:16 -0700, Andrew Morton wrote:
>> > > On Mon, 16 May 2011 16:06:57 +0100
>> > > Mel Gorman <mgorman@suse.de> wrote:
>> > >
>> > > > Under constant allocation pressure, kswapd can be in the situation where
>> > > > sleeping_prematurely() will always return true even if kswapd has been
>> > > > running a long time. Check if kswapd needs to be scheduled.
>> > > >
>> > > > Signed-off-by: Mel Gorman <mgorman@suse.de>
>> > > > Acked-by: Rik van Riel <riel@redhat.com>
>> > > > ---
>> > > >  mm/vmscan.c |    4 ++++
>> > > >  1 files changed, 4 insertions(+), 0 deletions(-)
>> > > >
>> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
>> > > > index af24d1e..4d24828 100644
>> > > > --- a/mm/vmscan.c
>> > > > +++ b/mm/vmscan.c
>> > > > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
>> > > >         unsigned long balanced = 0;
>> > > >         bool all_zones_ok = true;
>> > > >
>> > > > +       /* If kswapd has been running too long, just sleep */
>> > > > +       if (need_resched())
>> > > > +               return false;
>> > > > +
>> > > >         /* If a direct reclaimer woke kswapd within HZ/10, it's premature */
>> > > >         if (remaining)
>> > > >                 return true;
>> > >
>> > > I'm a bit worried by this one.
>> > >
>> > > Do we really fully understand why kswapd is continuously running like
>> > > this?  The changelog makes me think "no" ;)
>> > >
>> > > Given that the page-allocating process is madly reclaiming pages in
>> > > direct reclaim (yes?) and that kswapd is madly reclaiming pages on a
>> > > different CPU, we should pretty promptly get into a situation where
>> > > kswapd can suspend itself.  But that obviously isn't happening.  So
>> > > what *is* going on?
>> >
>> > The triggering workload is a massive untar using a file on the same
>> > filesystem, so that's a continuous stream of pages read into the cache
>> > for the input and a stream of dirty pages out for the writes.  We
>> > thought it might have been out of control shrinkers, so we already
>> > debugged that and found it wasn't.  It just seems to be an imbalance in
>> > the zones that the shrinkers can't fix which causes
>> > sleeping_prematurely() to return true almost indefinitely.
>>
>> Is the untar disk-bound?  The untar has presumably hit the writeback
>> dirty_ratio?  So its rate of page allocation is approximately equal to
>> the write speed of the disks?
>>
>
> A reasonable assumption but it gets messy.
>
>> If so, the VM is consuming 100% of a CPU to reclaim pages at a mere
>> tens-of-megabytes-per-second.  If so, there's something seriously wrong
>> here - under favorable conditions one would expect reclaim to free up
>> 100,000 pages/sec, maybe more.
>>
>> If the untar is not disk-bound and the required page reclaim rate is
>> equal to the rate at which a CPU can read, decompress and write to
>> pagecache then, err, maybe possible.  But it still smells of
>> inefficient reclaim.
>>
>
> I think it's higher than just the rate of data but couldn't guess by
> how much exactly. Reproducing this locally would have been nice but
> the following conditions are likely happening on the problem machine.
>
>   SLUB is using high-orders for its slabs, kswapd and reclaimers are
>   reclaiming at a faster rate than required for just the data. SLUB
>   is using order-2 allocs for inodes so every 18 files created by
>   untar, we need an order-2 page. For ext4_io_end, we need order-3
>   allocs and we are allocating these due to delayed block allocation.
>
>   So for example: 50 files, each less than 1 page in size needs 50
>   order-0 pages, 3 order-2 page and 2 order-3 pages
>
>   To satisfy the high order pages, we are reclaiming at least 28
>   pages. For compaction, we are migrating these so we are allocating
>   a further 28 pages and then copying putting further pressure on
>   the system. We may do this multiple times as order-0 allocations
>   could be breaking up the pages again. Without compaction, we are
>   only reclaiming but can get stalled for significant periods of
>   time if dirty or writeback pages are encountered in the contiguous
>   blocks and can reclaim too many pages quite easily.
>
> So the rate of allocation required to write out data is higher than
> just the data rate. The reclaim rate could be just fine but the number
> of pages we need to reclaim to allocate slab objects can be screwy.
>
>> > > Secondly, taking an up-to-100ms sleep in response to a need_resched()
>> > > seems pretty savage and I suspect it risks undesirable side-effects.  A
>> > > plain old cond_resched() would be more cautious.  But presumably
>> > > kswapd() is already running cond_resched() pretty frequently, so why
>> > > didn't that work?
>> >
>> > So the specific problem with cond_resched() is that kswapd is still
>> > runnable, so even if there's other work the system can be getting on
>> > with, it quickly comes back to looping madly in kswapd.  If we return
>> > false from sleeping_prematurely(), we stop kswapd until its woken up to
>> > do more work.  This manifests, even on non sandybridge systems that
>> > don't hang as a lot of time burned in kswapd.
>> >
>> > I think the sandybridge bug I see on the laptop is that cond_resched()
>> > is somehow ineffective:  kswapd is usually hogging one CPU and there are
>> > runnable processes but they seem to cluster on other CPUs, leaving
>> > kswapd to spin at close to 100% system time.
>> >
>> > When the problem was first described, we tried sprinkling more
>> > cond_rescheds() in the shrinker loop and it didn't work.
>>
>> Seems to me that kswapd for some reason is doing too much work.  Or,
>> more specifically is doing its work very inefficiently.  Making kswapd
>> take arbitrary naps when it's misbehaving didn't fix that misbehaviour!
>>
>
> It is likely to be doing work inefficiently in one of two ways
>
>  1. We are reclaiming far more pages than required by the data
>     for slab objects
>
>  2. The rate we are reclaiming is fast enough that dirty pages are
>     reaching the end of the LRU quickly
>
> The latter part is also important. I doubt we are getting stalled in
> writepage as this is new data being written to disk to blocks aren't
> allocated yet but kswapd is encountering the dirty_ratio of pages
> on the LRU and churning them through the LRU and reclaims the clean
> pages in between.
>
> In effect, this "sorts" the LRU lists so the dirty pages get grouped
> together. At worst on a 2G system such as James', we have 104857
> (20% of memory in pages) pages together on the LRU, all dirty and
> all being skipped over by kswapd and direct reclaimers. This is at
> least 3276 takings of the zone LRU lock assuming we isolate pages in
> groups of SWAP_CLUSTER_MAX which a lot of list walking and CPU usage
> for no pages reclaimed.
>
> In this case, kswapd might as well take a brief nap as it can't clean
> the pages so the flusher threads can get some work done.
>
>> It would be interesting to watch kswapd's page reclaim inefficiency
>> when this is happening: /proc/vmstat:pgscan_kswapd_* versus
>> /proc/vmstat:kswapd_steal.  If that ration is high then kswapd is
>> scanning many pages and not reclaiming them.
>>
>> But given the prominence of shrink_slab in the traces, perhaps that
>> isn't happening.
>>
>
> As we are aggressively shrinking slab, we can reach the stage where
> we scan the requested number of objects and reclaim none of them
> potentially setting zone->all_unreclaimable to 1 if a lot of scanning
> has also taken place recently without pages being freed. Once this
> happens, kswapd isn't even trying to reclaim pages and is instead stuck
> in shrink_slab until a page is freed clearing zone->all_unreclaimable
> and zone->pages-scanned.

Why does it stuck in shrink_slab?
If the zone is trouble to reclaim(ie, all_unreclaimable is set),
kswapd will poll the zone only in case of DEF_PRIORITY(ie, small
window) for when the problem goes away. In high priority (0..11), the
zone will be skipped and we can't get a chance to call
shrink_[zone|slab].


>
> The ratio during that window would not change but slabs_scanned would
> continue to increase.
>
> --
> Mel Gorman
> SUSE Labs
>



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/2] mm: vmscan: If kswapd has been running too long, allow it to sleep
  2011-05-18  9:47           ` Mel Gorman
@ 2011-05-19  0:28             ` Dave Chinner
  -1 siblings, 0 replies; 29+ messages in thread
From: Dave Chinner @ 2011-05-19  0:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu,
	Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg,
	Rik van Riel, Johannes Weiner, Minchan Kim, linux-fsdevel,
	linux-mm, linux-kernel, linux-ext4, stable

On Wed, May 18, 2011 at 10:47:18AM +0100, Mel Gorman wrote:
> As we are aggressively shrinking slab, we can reach the stage where
> we scan the requested number of objects and reclaim none of them
> potentially setting zone->all_unreclaimable to 1 if a lot of scanning
> has also taken place recently without pages being freed. Once this
> happens, kswapd isn't even trying to reclaim pages and is instead stuck
> in shrink_slab until a page is freed clearing zone->all_unreclaimable
> and zone->pages-scanned.

Isn't this completely broken then? We can have slabs with lots of
objects but none are reclaimable - e.g. dirty inodes are not even on
the inode LRU and require IO to get there, so repeatedly scanning
the slab trying to free inodes is completely pointless.

If the shrinkers are not freeing anything, then it should be backing
off and giving thme some time to clean objects is a much more
efficient use of CPU time than spinning madly. Indeed, if you back
off, you can do another pass over the LRU and see if there are more
pages that can be reclaimed, too, so you're not dependent on the
shrinkers actually making progress to break the livelock....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/2] mm: vmscan: If kswapd has been running too long, allow it to sleep
@ 2011-05-19  0:28             ` Dave Chinner
  0 siblings, 0 replies; 29+ messages in thread
From: Dave Chinner @ 2011-05-19  0:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu,
	Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg,
	Rik van Riel, Johannes Weiner, Minchan Kim, linux-fsdevel,
	linux-mm, linux-kernel, linux-ext4, stable

On Wed, May 18, 2011 at 10:47:18AM +0100, Mel Gorman wrote:
> As we are aggressively shrinking slab, we can reach the stage where
> we scan the requested number of objects and reclaim none of them
> potentially setting zone->all_unreclaimable to 1 if a lot of scanning
> has also taken place recently without pages being freed. Once this
> happens, kswapd isn't even trying to reclaim pages and is instead stuck
> in shrink_slab until a page is freed clearing zone->all_unreclaimable
> and zone->pages-scanned.

Isn't this completely broken then? We can have slabs with lots of
objects but none are reclaimable - e.g. dirty inodes are not even on
the inode LRU and require IO to get there, so repeatedly scanning
the slab trying to free inodes is completely pointless.

If the shrinkers are not freeing anything, then it should be backing
off and giving thme some time to clean objects is a much more
efficient use of CPU time than spinning madly. Indeed, if you back
off, you can do another pass over the LRU and see if there are more
pages that can be reclaimed, too, so you're not dependent on the
shrinkers actually making progress to break the livelock....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/2] mm: vmscan: If kswapd has been running too long, allow it to sleep
  2011-05-18 22:42             ` Minchan Kim
  (?)
@ 2011-05-19  9:19               ` Mel Gorman
  -1 siblings, 0 replies; 29+ messages in thread
From: Mel Gorman @ 2011-05-19  9:19 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu,
	Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4, stable

On Thu, May 19, 2011 at 07:42:29AM +0900, Minchan Kim wrote:
> On Wed, May 18, 2011 at 6:47 PM, Mel Gorman <mgorman@suse.de> wrote:
> > On Tue, May 17, 2011 at 04:22:26PM -0700, Andrew Morton wrote:
> >> On Tue, 17 May 2011 10:37:04 +0400
> >> James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
> >>
> >> > On Mon, 2011-05-16 at 14:16 -0700, Andrew Morton wrote:
> >> > > On Mon, 16 May 2011 16:06:57 +0100
> >> > > Mel Gorman <mgorman@suse.de> wrote:
> >> > >
> >> > > > Under constant allocation pressure, kswapd can be in the situation where
> >> > > > sleeping_prematurely() will always return true even if kswapd has been
> >> > > > running a long time. Check if kswapd needs to be scheduled.
> >> > > >
> >> > > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> >> > > > Acked-by: Rik van Riel <riel@redhat.com>
> >> > > > ---
> >> > > >  mm/vmscan.c |    4 ++++
> >> > > >  1 files changed, 4 insertions(+), 0 deletions(-)
> >> > > >
> >> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> > > > index af24d1e..4d24828 100644
> >> > > > --- a/mm/vmscan.c
> >> > > > +++ b/mm/vmscan.c
> >> > > > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> >> > > >         unsigned long balanced = 0;
> >> > > >         bool all_zones_ok = true;
> >> > > >
> >> > > > +       /* If kswapd has been running too long, just sleep */
> >> > > > +       if (need_resched())
> >> > > > +               return false;
> >> > > > +
> >> > > >         /* If a direct reclaimer woke kswapd within HZ/10, it's premature */
> >> > > >         if (remaining)
> >> > > >                 return true;
> >> > >
> >> > > I'm a bit worried by this one.
> >> > >
> >> > > Do we really fully understand why kswapd is continuously running like
> >> > > this?  The changelog makes me think "no" ;)
> >> > >
> >> > > Given that the page-allocating process is madly reclaiming pages in
> >> > > direct reclaim (yes?) and that kswapd is madly reclaiming pages on a
> >> > > different CPU, we should pretty promptly get into a situation where
> >> > > kswapd can suspend itself.  But that obviously isn't happening.  So
> >> > > what *is* going on?
> >> >
> >> > The triggering workload is a massive untar using a file on the same
> >> > filesystem, so that's a continuous stream of pages read into the cache
> >> > for the input and a stream of dirty pages out for the writes.  We
> >> > thought it might have been out of control shrinkers, so we already
> >> > debugged that and found it wasn't.  It just seems to be an imbalance in
> >> > the zones that the shrinkers can't fix which causes
> >> > sleeping_prematurely() to return true almost indefinitely.
> >>
> >> Is the untar disk-bound?  The untar has presumably hit the writeback
> >> dirty_ratio?  So its rate of page allocation is approximately equal to
> >> the write speed of the disks?
> >>
> >
> > A reasonable assumption but it gets messy.
> >
> >> If so, the VM is consuming 100% of a CPU to reclaim pages at a mere
> >> tens-of-megabytes-per-second.  If so, there's something seriously wrong
> >> here - under favorable conditions one would expect reclaim to free up
> >> 100,000 pages/sec, maybe more.
> >>
> >> If the untar is not disk-bound and the required page reclaim rate is
> >> equal to the rate at which a CPU can read, decompress and write to
> >> pagecache then, err, maybe possible.  But it still smells of
> >> inefficient reclaim.
> >>
> >
> > I think it's higher than just the rate of data but couldn't guess by
> > how much exactly. Reproducing this locally would have been nice but
> > the following conditions are likely happening on the problem machine.
> >
> >   SLUB is using high-orders for its slabs, kswapd and reclaimers are
> >   reclaiming at a faster rate than required for just the data. SLUB
> >   is using order-2 allocs for inodes so every 18 files created by
> >   untar, we need an order-2 page. For ext4_io_end, we need order-3
> >   allocs and we are allocating these due to delayed block allocation.
> >
> >   So for example: 50 files, each less than 1 page in size needs 50
> >   order-0 pages, 3 order-2 page and 2 order-3 pages
> >
> >   To satisfy the high order pages, we are reclaiming at least 28
> >   pages. For compaction, we are migrating these so we are allocating
> >   a further 28 pages and then copying putting further pressure on
> >   the system. We may do this multiple times as order-0 allocations
> >   could be breaking up the pages again. Without compaction, we are
> >   only reclaiming but can get stalled for significant periods of
> >   time if dirty or writeback pages are encountered in the contiguous
> >   blocks and can reclaim too many pages quite easily.
> >
> > So the rate of allocation required to write out data is higher than
> > just the data rate. The reclaim rate could be just fine but the number
> > of pages we need to reclaim to allocate slab objects can be screwy.
> >
> >> > > Secondly, taking an up-to-100ms sleep in response to a need_resched()
> >> > > seems pretty savage and I suspect it risks undesirable side-effects.  A
> >> > > plain old cond_resched() would be more cautious.  But presumably
> >> > > kswapd() is already running cond_resched() pretty frequently, so why
> >> > > didn't that work?
> >> >
> >> > So the specific problem with cond_resched() is that kswapd is still
> >> > runnable, so even if there's other work the system can be getting on
> >> > with, it quickly comes back to looping madly in kswapd.  If we return
> >> > false from sleeping_prematurely(), we stop kswapd until its woken up to
> >> > do more work.  This manifests, even on non sandybridge systems that
> >> > don't hang as a lot of time burned in kswapd.
> >> >
> >> > I think the sandybridge bug I see on the laptop is that cond_resched()
> >> > is somehow ineffective:  kswapd is usually hogging one CPU and there are
> >> > runnable processes but they seem to cluster on other CPUs, leaving
> >> > kswapd to spin at close to 100% system time.
> >> >
> >> > When the problem was first described, we tried sprinkling more
> >> > cond_rescheds() in the shrinker loop and it didn't work.
> >>
> >> Seems to me that kswapd for some reason is doing too much work.  Or,
> >> more specifically is doing its work very inefficiently.  Making kswapd
> >> take arbitrary naps when it's misbehaving didn't fix that misbehaviour!
> >>
> >
> > It is likely to be doing work inefficiently in one of two ways
> >
> >  1. We are reclaiming far more pages than required by the data
> >     for slab objects
> >
> >  2. The rate we are reclaiming is fast enough that dirty pages are
> >     reaching the end of the LRU quickly
> >
> > The latter part is also important. I doubt we are getting stalled in
> > writepage as this is new data being written to disk to blocks aren't
> > allocated yet but kswapd is encountering the dirty_ratio of pages
> > on the LRU and churning them through the LRU and reclaims the clean
> > pages in between.
> >
> > In effect, this "sorts" the LRU lists so the dirty pages get grouped
> > together. At worst on a 2G system such as James', we have 104857
> > (20% of memory in pages) pages together on the LRU, all dirty and
> > all being skipped over by kswapd and direct reclaimers. This is at
> > least 3276 takings of the zone LRU lock assuming we isolate pages in
> > groups of SWAP_CLUSTER_MAX which a lot of list walking and CPU usage
> > for no pages reclaimed.
> >
> > In this case, kswapd might as well take a brief nap as it can't clean
> > the pages so the flusher threads can get some work done.
> >
> >> It would be interesting to watch kswapd's page reclaim inefficiency
> >> when this is happening: /proc/vmstat:pgscan_kswapd_* versus
> >> /proc/vmstat:kswapd_steal.  If that ration is high then kswapd is
> >> scanning many pages and not reclaiming them.
> >>
> >> But given the prominence of shrink_slab in the traces, perhaps that
> >> isn't happening.
> >>
> >
> > As we are aggressively shrinking slab, we can reach the stage where
> > we scan the requested number of objects and reclaim none of them
> > potentially setting zone->all_unreclaimable to 1 if a lot of scanning
> > has also taken place recently without pages being freed. Once this
> > happens, kswapd isn't even trying to reclaim pages and is instead stuck
> > in shrink_slab until a page is freed clearing zone->all_unreclaimable
> > and zone->pages-scanned.
> 
> Why does it stuck in shrink_slab?
> If the zone is trouble to reclaim(ie, all_unreclaimable is set),
> kswapd will poll the zone only in case of DEF_PRIORITY(ie, small
> window) for when the problem goes away.

"stuck in shrink" was a poor choice of words. I should have said we
can spend a lot of time in there.

True, kswapd will only poll the zones while all_unreclaimable is
set but it only takes one page to be freed to the per-cpu list to
clear all_unreclaimable again. Once any zone has all_unreclaimable
cleared, the watermarks are checked but with enough direct
reclaimers, it's possible watermarks are met so shrink_zone is not
called but shrink_slab is called anyway. Depending on the result,
all_unreclaimable can get set again (possibly incorrectly as there
is simply no reclaimable slab objects rather than the zone is truely
unreclaimable). Another scenario is all zones except ZONE_DMA have
all_unreclaimable set when kswapd runs. kswapd finds the watermarks
to be ok as the zone is only lightly used so skips shrink_zone()
but calls shrink_slab() anyway.

Both of these situations would allow kswapd to use a lot of CPU while
spending a significant percentage of it in shrink_slab().

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/2] mm: vmscan: If kswapd has been running too long, allow it to sleep
@ 2011-05-19  9:19               ` Mel Gorman
  0 siblings, 0 replies; 29+ messages in thread
From: Mel Gorman @ 2011-05-19  9:19 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu,
	Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4, stable

On Thu, May 19, 2011 at 07:42:29AM +0900, Minchan Kim wrote:
> On Wed, May 18, 2011 at 6:47 PM, Mel Gorman <mgorman@suse.de> wrote:
> > On Tue, May 17, 2011 at 04:22:26PM -0700, Andrew Morton wrote:
> >> On Tue, 17 May 2011 10:37:04 +0400
> >> James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
> >>
> >> > On Mon, 2011-05-16 at 14:16 -0700, Andrew Morton wrote:
> >> > > On Mon, 16 May 2011 16:06:57 +0100
> >> > > Mel Gorman <mgorman@suse.de> wrote:
> >> > >
> >> > > > Under constant allocation pressure, kswapd can be in the situation where
> >> > > > sleeping_prematurely() will always return true even if kswapd has been
> >> > > > running a long time. Check if kswapd needs to be scheduled.
> >> > > >
> >> > > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> >> > > > Acked-by: Rik van Riel <riel@redhat.com>
> >> > > > ---
> >> > > >  mm/vmscan.c |    4 ++++
> >> > > >  1 files changed, 4 insertions(+), 0 deletions(-)
> >> > > >
> >> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> > > > index af24d1e..4d24828 100644
> >> > > > --- a/mm/vmscan.c
> >> > > > +++ b/mm/vmscan.c
> >> > > > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> >> > > >         unsigned long balanced = 0;
> >> > > >         bool all_zones_ok = true;
> >> > > >
> >> > > > +       /* If kswapd has been running too long, just sleep */
> >> > > > +       if (need_resched())
> >> > > > +               return false;
> >> > > > +
> >> > > >         /* If a direct reclaimer woke kswapd within HZ/10, it's premature */
> >> > > >         if (remaining)
> >> > > >                 return true;
> >> > >
> >> > > I'm a bit worried by this one.
> >> > >
> >> > > Do we really fully understand why kswapd is continuously running like
> >> > > this?  The changelog makes me think "no" ;)
> >> > >
> >> > > Given that the page-allocating process is madly reclaiming pages in
> >> > > direct reclaim (yes?) and that kswapd is madly reclaiming pages on a
> >> > > different CPU, we should pretty promptly get into a situation where
> >> > > kswapd can suspend itself.  But that obviously isn't happening.  So
> >> > > what *is* going on?
> >> >
> >> > The triggering workload is a massive untar using a file on the same
> >> > filesystem, so that's a continuous stream of pages read into the cache
> >> > for the input and a stream of dirty pages out for the writes.  We
> >> > thought it might have been out of control shrinkers, so we already
> >> > debugged that and found it wasn't.  It just seems to be an imbalance in
> >> > the zones that the shrinkers can't fix which causes
> >> > sleeping_prematurely() to return true almost indefinitely.
> >>
> >> Is the untar disk-bound?  The untar has presumably hit the writeback
> >> dirty_ratio?  So its rate of page allocation is approximately equal to
> >> the write speed of the disks?
> >>
> >
> > A reasonable assumption but it gets messy.
> >
> >> If so, the VM is consuming 100% of a CPU to reclaim pages at a mere
> >> tens-of-megabytes-per-second.  If so, there's something seriously wrong
> >> here - under favorable conditions one would expect reclaim to free up
> >> 100,000 pages/sec, maybe more.
> >>
> >> If the untar is not disk-bound and the required page reclaim rate is
> >> equal to the rate at which a CPU can read, decompress and write to
> >> pagecache then, err, maybe possible.  But it still smells of
> >> inefficient reclaim.
> >>
> >
> > I think it's higher than just the rate of data but couldn't guess by
> > how much exactly. Reproducing this locally would have been nice but
> > the following conditions are likely happening on the problem machine.
> >
> >   SLUB is using high-orders for its slabs, kswapd and reclaimers are
> >   reclaiming at a faster rate than required for just the data. SLUB
> >   is using order-2 allocs for inodes so every 18 files created by
> >   untar, we need an order-2 page. For ext4_io_end, we need order-3
> >   allocs and we are allocating these due to delayed block allocation.
> >
> >   So for example: 50 files, each less than 1 page in size needs 50
> >   order-0 pages, 3 order-2 page and 2 order-3 pages
> >
> >   To satisfy the high order pages, we are reclaiming at least 28
> >   pages. For compaction, we are migrating these so we are allocating
> >   a further 28 pages and then copying putting further pressure on
> >   the system. We may do this multiple times as order-0 allocations
> >   could be breaking up the pages again. Without compaction, we are
> >   only reclaiming but can get stalled for significant periods of
> >   time if dirty or writeback pages are encountered in the contiguous
> >   blocks and can reclaim too many pages quite easily.
> >
> > So the rate of allocation required to write out data is higher than
> > just the data rate. The reclaim rate could be just fine but the number
> > of pages we need to reclaim to allocate slab objects can be screwy.
> >
> >> > > Secondly, taking an up-to-100ms sleep in response to a need_resched()
> >> > > seems pretty savage and I suspect it risks undesirable side-effects.  A
> >> > > plain old cond_resched() would be more cautious.  But presumably
> >> > > kswapd() is already running cond_resched() pretty frequently, so why
> >> > > didn't that work?
> >> >
> >> > So the specific problem with cond_resched() is that kswapd is still
> >> > runnable, so even if there's other work the system can be getting on
> >> > with, it quickly comes back to looping madly in kswapd.  If we return
> >> > false from sleeping_prematurely(), we stop kswapd until its woken up to
> >> > do more work.  This manifests, even on non sandybridge systems that
> >> > don't hang as a lot of time burned in kswapd.
> >> >
> >> > I think the sandybridge bug I see on the laptop is that cond_resched()
> >> > is somehow ineffective:  kswapd is usually hogging one CPU and there are
> >> > runnable processes but they seem to cluster on other CPUs, leaving
> >> > kswapd to spin at close to 100% system time.
> >> >
> >> > When the problem was first described, we tried sprinkling more
> >> > cond_rescheds() in the shrinker loop and it didn't work.
> >>
> >> Seems to me that kswapd for some reason is doing too much work.  Or,
> >> more specifically is doing its work very inefficiently.  Making kswapd
> >> take arbitrary naps when it's misbehaving didn't fix that misbehaviour!
> >>
> >
> > It is likely to be doing work inefficiently in one of two ways
> >
> >  1. We are reclaiming far more pages than required by the data
> >     for slab objects
> >
> >  2. The rate we are reclaiming is fast enough that dirty pages are
> >     reaching the end of the LRU quickly
> >
> > The latter part is also important. I doubt we are getting stalled in
> > writepage as this is new data being written to disk to blocks aren't
> > allocated yet but kswapd is encountering the dirty_ratio of pages
> > on the LRU and churning them through the LRU and reclaims the clean
> > pages in between.
> >
> > In effect, this "sorts" the LRU lists so the dirty pages get grouped
> > together. At worst on a 2G system such as James', we have 104857
> > (20% of memory in pages) pages together on the LRU, all dirty and
> > all being skipped over by kswapd and direct reclaimers. This is at
> > least 3276 takings of the zone LRU lock assuming we isolate pages in
> > groups of SWAP_CLUSTER_MAX which a lot of list walking and CPU usage
> > for no pages reclaimed.
> >
> > In this case, kswapd might as well take a brief nap as it can't clean
> > the pages so the flusher threads can get some work done.
> >
> >> It would be interesting to watch kswapd's page reclaim inefficiency
> >> when this is happening: /proc/vmstat:pgscan_kswapd_* versus
> >> /proc/vmstat:kswapd_steal.  If that ration is high then kswapd is
> >> scanning many pages and not reclaiming them.
> >>
> >> But given the prominence of shrink_slab in the traces, perhaps that
> >> isn't happening.
> >>
> >
> > As we are aggressively shrinking slab, we can reach the stage where
> > we scan the requested number of objects and reclaim none of them
> > potentially setting zone->all_unreclaimable to 1 if a lot of scanning
> > has also taken place recently without pages being freed. Once this
> > happens, kswapd isn't even trying to reclaim pages and is instead stuck
> > in shrink_slab until a page is freed clearing zone->all_unreclaimable
> > and zone->pages-scanned.
> 
> Why does it stuck in shrink_slab?
> If the zone is trouble to reclaim(ie, all_unreclaimable is set),
> kswapd will poll the zone only in case of DEF_PRIORITY(ie, small
> window) for when the problem goes away.

"stuck in shrink" was a poor choice of words. I should have said we
can spend a lot of time in there.

True, kswapd will only poll the zones while all_unreclaimable is
set but it only takes one page to be freed to the per-cpu list to
clear all_unreclaimable again. Once any zone has all_unreclaimable
cleared, the watermarks are checked but with enough direct
reclaimers, it's possible watermarks are met so shrink_zone is not
called but shrink_slab is called anyway. Depending on the result,
all_unreclaimable can get set again (possibly incorrectly as there
is simply no reclaimable slab objects rather than the zone is truely
unreclaimable). Another scenario is all zones except ZONE_DMA have
all_unreclaimable set when kswapd runs. kswapd finds the watermarks
to be ok as the zone is only lightly used so skips shrink_zone()
but calls shrink_slab() anyway.

Both of these situations would allow kswapd to use a lot of CPU while
spending a significant percentage of it in shrink_slab().

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/2] mm: vmscan: If kswapd has been running too long, allow it to sleep
@ 2011-05-19  9:19               ` Mel Gorman
  0 siblings, 0 replies; 29+ messages in thread
From: Mel Gorman @ 2011-05-19  9:19 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu,
	Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4, stable

On Thu, May 19, 2011 at 07:42:29AM +0900, Minchan Kim wrote:
> On Wed, May 18, 2011 at 6:47 PM, Mel Gorman <mgorman@suse.de> wrote:
> > On Tue, May 17, 2011 at 04:22:26PM -0700, Andrew Morton wrote:
> >> On Tue, 17 May 2011 10:37:04 +0400
> >> James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
> >>
> >> > On Mon, 2011-05-16 at 14:16 -0700, Andrew Morton wrote:
> >> > > On Mon, 16 May 2011 16:06:57 +0100
> >> > > Mel Gorman <mgorman@suse.de> wrote:
> >> > >
> >> > > > Under constant allocation pressure, kswapd can be in the situation where
> >> > > > sleeping_prematurely() will always return true even if kswapd has been
> >> > > > running a long time. Check if kswapd needs to be scheduled.
> >> > > >
> >> > > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> >> > > > Acked-by: Rik van Riel <riel@redhat.com>
> >> > > > ---
> >> > > >  mm/vmscan.c |    4 ++++
> >> > > >  1 files changed, 4 insertions(+), 0 deletions(-)
> >> > > >
> >> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> > > > index af24d1e..4d24828 100644
> >> > > > --- a/mm/vmscan.c
> >> > > > +++ b/mm/vmscan.c
> >> > > > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> >> > > >         unsigned long balanced = 0;
> >> > > >         bool all_zones_ok = true;
> >> > > >
> >> > > > +       /* If kswapd has been running too long, just sleep */
> >> > > > +       if (need_resched())
> >> > > > +               return false;
> >> > > > +
> >> > > >         /* If a direct reclaimer woke kswapd within HZ/10, it's premature */
> >> > > >         if (remaining)
> >> > > >                 return true;
> >> > >
> >> > > I'm a bit worried by this one.
> >> > >
> >> > > Do we really fully understand why kswapd is continuously running like
> >> > > this?  The changelog makes me think "no" ;)
> >> > >
> >> > > Given that the page-allocating process is madly reclaiming pages in
> >> > > direct reclaim (yes?) and that kswapd is madly reclaiming pages on a
> >> > > different CPU, we should pretty promptly get into a situation where
> >> > > kswapd can suspend itself.  But that obviously isn't happening.  So
> >> > > what *is* going on?
> >> >
> >> > The triggering workload is a massive untar using a file on the same
> >> > filesystem, so that's a continuous stream of pages read into the cache
> >> > for the input and a stream of dirty pages out for the writes.  We
> >> > thought it might have been out of control shrinkers, so we already
> >> > debugged that and found it wasn't.  It just seems to be an imbalance in
> >> > the zones that the shrinkers can't fix which causes
> >> > sleeping_prematurely() to return true almost indefinitely.
> >>
> >> Is the untar disk-bound?  The untar has presumably hit the writeback
> >> dirty_ratio?  So its rate of page allocation is approximately equal to
> >> the write speed of the disks?
> >>
> >
> > A reasonable assumption but it gets messy.
> >
> >> If so, the VM is consuming 100% of a CPU to reclaim pages at a mere
> >> tens-of-megabytes-per-second.  If so, there's something seriously wrong
> >> here - under favorable conditions one would expect reclaim to free up
> >> 100,000 pages/sec, maybe more.
> >>
> >> If the untar is not disk-bound and the required page reclaim rate is
> >> equal to the rate at which a CPU can read, decompress and write to
> >> pagecache then, err, maybe possible.  But it still smells of
> >> inefficient reclaim.
> >>
> >
> > I think it's higher than just the rate of data but couldn't guess by
> > how much exactly. Reproducing this locally would have been nice but
> > the following conditions are likely happening on the problem machine.
> >
> >   SLUB is using high-orders for its slabs, kswapd and reclaimers are
> >   reclaiming at a faster rate than required for just the data. SLUB
> >   is using order-2 allocs for inodes so every 18 files created by
> >   untar, we need an order-2 page. For ext4_io_end, we need order-3
> >   allocs and we are allocating these due to delayed block allocation.
> >
> >   So for example: 50 files, each less than 1 page in size needs 50
> >   order-0 pages, 3 order-2 page and 2 order-3 pages
> >
> >   To satisfy the high order pages, we are reclaiming at least 28
> >   pages. For compaction, we are migrating these so we are allocating
> >   a further 28 pages and then copying putting further pressure on
> >   the system. We may do this multiple times as order-0 allocations
> >   could be breaking up the pages again. Without compaction, we are
> >   only reclaiming but can get stalled for significant periods of
> >   time if dirty or writeback pages are encountered in the contiguous
> >   blocks and can reclaim too many pages quite easily.
> >
> > So the rate of allocation required to write out data is higher than
> > just the data rate. The reclaim rate could be just fine but the number
> > of pages we need to reclaim to allocate slab objects can be screwy.
> >
> >> > > Secondly, taking an up-to-100ms sleep in response to a need_resched()
> >> > > seems pretty savage and I suspect it risks undesirable side-effects.  A
> >> > > plain old cond_resched() would be more cautious.  But presumably
> >> > > kswapd() is already running cond_resched() pretty frequently, so why
> >> > > didn't that work?
> >> >
> >> > So the specific problem with cond_resched() is that kswapd is still
> >> > runnable, so even if there's other work the system can be getting on
> >> > with, it quickly comes back to looping madly in kswapd.  If we return
> >> > false from sleeping_prematurely(), we stop kswapd until its woken up to
> >> > do more work.  This manifests, even on non sandybridge systems that
> >> > don't hang as a lot of time burned in kswapd.
> >> >
> >> > I think the sandybridge bug I see on the laptop is that cond_resched()
> >> > is somehow ineffective:  kswapd is usually hogging one CPU and there are
> >> > runnable processes but they seem to cluster on other CPUs, leaving
> >> > kswapd to spin at close to 100% system time.
> >> >
> >> > When the problem was first described, we tried sprinkling more
> >> > cond_rescheds() in the shrinker loop and it didn't work.
> >>
> >> Seems to me that kswapd for some reason is doing too much work.  Or,
> >> more specifically is doing its work very inefficiently.  Making kswapd
> >> take arbitrary naps when it's misbehaving didn't fix that misbehaviour!
> >>
> >
> > It is likely to be doing work inefficiently in one of two ways
> >
> >  1. We are reclaiming far more pages than required by the data
> >     for slab objects
> >
> >  2. The rate we are reclaiming is fast enough that dirty pages are
> >     reaching the end of the LRU quickly
> >
> > The latter part is also important. I doubt we are getting stalled in
> > writepage as this is new data being written to disk to blocks aren't
> > allocated yet but kswapd is encountering the dirty_ratio of pages
> > on the LRU and churning them through the LRU and reclaims the clean
> > pages in between.
> >
> > In effect, this "sorts" the LRU lists so the dirty pages get grouped
> > together. At worst on a 2G system such as James', we have 104857
> > (20% of memory in pages) pages together on the LRU, all dirty and
> > all being skipped over by kswapd and direct reclaimers. This is at
> > least 3276 takings of the zone LRU lock assuming we isolate pages in
> > groups of SWAP_CLUSTER_MAX which a lot of list walking and CPU usage
> > for no pages reclaimed.
> >
> > In this case, kswapd might as well take a brief nap as it can't clean
> > the pages so the flusher threads can get some work done.
> >
> >> It would be interesting to watch kswapd's page reclaim inefficiency
> >> when this is happening: /proc/vmstat:pgscan_kswapd_* versus
> >> /proc/vmstat:kswapd_steal.  If that ration is high then kswapd is
> >> scanning many pages and not reclaiming them.
> >>
> >> But given the prominence of shrink_slab in the traces, perhaps that
> >> isn't happening.
> >>
> >
> > As we are aggressively shrinking slab, we can reach the stage where
> > we scan the requested number of objects and reclaim none of them
> > potentially setting zone->all_unreclaimable to 1 if a lot of scanning
> > has also taken place recently without pages being freed. Once this
> > happens, kswapd isn't even trying to reclaim pages and is instead stuck
> > in shrink_slab until a page is freed clearing zone->all_unreclaimable
> > and zone->pages-scanned.
> 
> Why does it stuck in shrink_slab?
> If the zone is trouble to reclaim(ie, all_unreclaimable is set),
> kswapd will poll the zone only in case of DEF_PRIORITY(ie, small
> window) for when the problem goes away.

"stuck in shrink" was a poor choice of words. I should have said we
can spend a lot of time in there.

True, kswapd will only poll the zones while all_unreclaimable is
set but it only takes one page to be freed to the per-cpu list to
clear all_unreclaimable again. Once any zone has all_unreclaimable
cleared, the watermarks are checked but with enough direct
reclaimers, it's possible watermarks are met so shrink_zone is not
called but shrink_slab is called anyway. Depending on the result,
all_unreclaimable can get set again (possibly incorrectly as there
is simply no reclaimable slab objects rather than the zone is truely
unreclaimable). Another scenario is all zones except ZONE_DMA have
all_unreclaimable set when kswapd runs. kswapd finds the watermarks
to be ok as the zone is only lightly used so skips shrink_zone()
but calls shrink_slab() anyway.

Both of these situations would allow kswapd to use a lot of CPU while
spending a significant percentage of it in shrink_slab().

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2011-05-19  9:19 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-16 15:06 [PATCH 0/2] Eliminate hangs when using frequent high-order allocations V3 Mel Gorman
2011-05-16 15:06 ` Mel Gorman
2011-05-16 15:06 ` [PATCH 1/2] mm: vmscan: Correct use of pgdat_balanced in sleeping_prematurely Mel Gorman
2011-05-16 15:06   ` Mel Gorman
2011-05-16 15:26   ` Johannes Weiner
2011-05-16 15:26     ` Johannes Weiner
2011-05-17  5:26     ` Wu Fengguang
2011-05-17  5:26       ` Wu Fengguang
2011-05-16 23:05   ` Minchan Kim
2011-05-16 23:05     ` Minchan Kim
2011-05-16 15:06 ` [PATCH 2/2] mm: vmscan: If kswapd has been running too long, allow it to sleep Mel Gorman
2011-05-16 15:06   ` Mel Gorman
2011-05-16 15:26   ` Johannes Weiner
2011-05-16 15:26     ` Johannes Weiner
2011-05-16 21:16   ` Andrew Morton
2011-05-16 21:16     ` Andrew Morton
2011-05-17  6:37     ` James Bottomley
2011-05-17  6:37       ` James Bottomley
2011-05-17 23:22       ` Andrew Morton
2011-05-17 23:22         ` Andrew Morton
2011-05-18  9:47         ` Mel Gorman
2011-05-18  9:47           ` Mel Gorman
2011-05-18 22:42           ` Minchan Kim
2011-05-18 22:42             ` Minchan Kim
2011-05-19  9:19             ` Mel Gorman
2011-05-19  9:19               ` Mel Gorman
2011-05-19  9:19               ` Mel Gorman
2011-05-19  0:28           ` Dave Chinner
2011-05-19  0:28             ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.