[patch] vmscan: fix zone shrinking exit when scan work is done

All of lore.kernel.org
 help / color / mirror / Atom feed

* [patch] vmscan: fix zone shrinking exit when scan work is done
@ 2011-02-09 15:46 ` Johannes Weiner
  0 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2011-02-09 15:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Mel Gorman, Rik van Riel, Michal Hocko,
	Kent Overstreet, linux-mm, linux-kernel

Hi,

I think this should fix the problem of processes getting stuck in
reclaim that has been reported several times.  Kent actually
single-stepped through this code and noted that it was never exiting
shrink_zone(), which really narrowed it down a lot, considering the
tons of nested loops from the allocator down to the list shrinking.

	Hannes

---
From: Johannes Weiner <hannes@cmpxchg.org>
Subject: vmscan: fix zone shrinking exit when scan work is done

'3e7d344 mm: vmscan: reclaim order-0 and use compaction instead of
lumpy reclaim' introduced an indefinite loop in shrink_zone().

It meant to break out of this loop when no pages had been reclaimed
and not a single page was even scanned.  The way it would detect the
latter is by taking a snapshot of sc->nr_scanned at the beginning of
the function and comparing it against the new sc->nr_scanned after the
scan loop.  But it would re-iterate without updating that snapshot,
looping forever if sc->nr_scanned changed at least once since
shrink_zone() was invoked.

This is not the sole condition that would exit that loop, but it
requires other processes to change the zone state, as the reclaimer
that is stuck obviously can not anymore.

This is only happening for higher-order allocations, where reclaim is
run back to back with compaction.

Reported-by: Michal Hocko <mhocko@suse.cz>
Reported-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 148c6e6..17497d0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1882,12 +1882,12 @@ static void shrink_zone(int priority, struct zone *zone,
 	unsigned long nr[NR_LRU_LISTS];
 	unsigned long nr_to_scan;
 	enum lru_list l;
-	unsigned long nr_reclaimed;
+	unsigned long nr_reclaimed, nr_scanned;
 	unsigned long nr_to_reclaim = sc->nr_to_reclaim;
-	unsigned long nr_scanned = sc->nr_scanned;

 restart:
 	nr_reclaimed = 0;
+	nr_scanned = sc->nr_scanned;
 	get_scan_count(zone, sc, nr, priority);

 	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
-- 
1.7.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [patch] vmscan: fix zone shrinking exit when scan work is done
@ 2011-02-09 15:46 ` Johannes Weiner
  0 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2011-02-09 15:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Mel Gorman, Rik van Riel, Michal Hocko,
	Kent Overstreet, linux-mm, linux-kernel

Hi,

I think this should fix the problem of processes getting stuck in
reclaim that has been reported several times.  Kent actually
single-stepped through this code and noted that it was never exiting
shrink_zone(), which really narrowed it down a lot, considering the
tons of nested loops from the allocator down to the list shrinking.

	Hannes

---
From: Johannes Weiner <hannes@cmpxchg.org>
Subject: vmscan: fix zone shrinking exit when scan work is done

'3e7d344 mm: vmscan: reclaim order-0 and use compaction instead of
lumpy reclaim' introduced an indefinite loop in shrink_zone().

It meant to break out of this loop when no pages had been reclaimed
and not a single page was even scanned.  The way it would detect the
latter is by taking a snapshot of sc->nr_scanned at the beginning of
the function and comparing it against the new sc->nr_scanned after the
scan loop.  But it would re-iterate without updating that snapshot,
looping forever if sc->nr_scanned changed at least once since
shrink_zone() was invoked.

This is not the sole condition that would exit that loop, but it
requires other processes to change the zone state, as the reclaimer
that is stuck obviously can not anymore.

This is only happening for higher-order allocations, where reclaim is
run back to back with compaction.

Reported-by: Michal Hocko <mhocko@suse.cz>
Reported-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 148c6e6..17497d0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1882,12 +1882,12 @@ static void shrink_zone(int priority, struct zone *zone,
 	unsigned long nr[NR_LRU_LISTS];
 	unsigned long nr_to_scan;
 	enum lru_list l;
-	unsigned long nr_reclaimed;
+	unsigned long nr_reclaimed, nr_scanned;
 	unsigned long nr_to_reclaim = sc->nr_to_reclaim;
-	unsigned long nr_scanned = sc->nr_scanned;

 restart:
 	nr_reclaimed = 0;
+	nr_scanned = sc->nr_scanned;
 	get_scan_count(zone, sc, nr, priority);

 	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
-- 
1.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [patch] vmscan: fix zone shrinking exit when scan work is done
  2011-02-09 15:46 ` Johannes Weiner
@ 2011-02-09 15:54   ` Kent Overstreet
  -1 siblings, 0 replies; 44+ messages in thread
From: Kent Overstreet @ 2011-02-09 15:54 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Andrea Arcangeli, Mel Gorman, Rik van Riel,
	Michal Hocko, linux-mm, linux-kernel

On 02/09/2011 07:46 AM, Johannes Weiner wrote:
> Hi,
>
> I think this should fix the problem of processes getting stuck in
> reclaim that has been reported several times.  Kent actually
> single-stepped through this code and noted that it was never exiting
> shrink_zone(), which really narrowed it down a lot, considering the
> tons of nested loops from the allocator down to the list shrinking.
>
> 	Hannes

I was able to trigger this in just a few minutes stress testing bcache, 
and now it's been going for half an hour working beautifully. Thanks!

>
> ---
> From: Johannes Weiner<hannes@cmpxchg.org>
> Subject: vmscan: fix zone shrinking exit when scan work is done
>
> '3e7d344 mm: vmscan: reclaim order-0 and use compaction instead of
> lumpy reclaim' introduced an indefinite loop in shrink_zone().
>
> It meant to break out of this loop when no pages had been reclaimed
> and not a single page was even scanned.  The way it would detect the
> latter is by taking a snapshot of sc->nr_scanned at the beginning of
> the function and comparing it against the new sc->nr_scanned after the
> scan loop.  But it would re-iterate without updating that snapshot,
> looping forever if sc->nr_scanned changed at least once since
> shrink_zone() was invoked.
>
> This is not the sole condition that would exit that loop, but it
> requires other processes to change the zone state, as the reclaimer
> that is stuck obviously can not anymore.
>
> This is only happening for higher-order allocations, where reclaim is
> run back to back with compaction.
>
> Reported-by: Michal Hocko<mhocko@suse.cz>
> Reported-by: Kent Overstreet<kent.overstreet@gmail.com>
> Signed-off-by: Johannes Weiner<hannes@cmpxchg.org>

Tested-by: Kent Overstreet<kent.overstreet@gmail.com>

> ---
>   mm/vmscan.c |    4 ++--
>   1 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 148c6e6..17497d0 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1882,12 +1882,12 @@ static void shrink_zone(int priority, struct zone *zone,
>   	unsigned long nr[NR_LRU_LISTS];
>   	unsigned long nr_to_scan;
>   	enum lru_list l;
> -	unsigned long nr_reclaimed;
> +	unsigned long nr_reclaimed, nr_scanned;
>   	unsigned long nr_to_reclaim = sc->nr_to_reclaim;
> -	unsigned long nr_scanned = sc->nr_scanned;
>
>   restart:
>   	nr_reclaimed = 0;
> +	nr_scanned = sc->nr_scanned;
>   	get_scan_count(zone, sc, nr, priority);
>
>   	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] vmscan: fix zone shrinking exit when scan work is done
@ 2011-02-09 15:54   ` Kent Overstreet
  0 siblings, 0 replies; 44+ messages in thread
From: Kent Overstreet @ 2011-02-09 15:54 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Andrea Arcangeli, Mel Gorman, Rik van Riel,
	Michal Hocko, linux-mm, linux-kernel

On 02/09/2011 07:46 AM, Johannes Weiner wrote:
> Hi,
>
> I think this should fix the problem of processes getting stuck in
> reclaim that has been reported several times.  Kent actually
> single-stepped through this code and noted that it was never exiting
> shrink_zone(), which really narrowed it down a lot, considering the
> tons of nested loops from the allocator down to the list shrinking.
>
> 	Hannes

I was able to trigger this in just a few minutes stress testing bcache, 
and now it's been going for half an hour working beautifully. Thanks!

>
> ---
> From: Johannes Weiner<hannes@cmpxchg.org>
> Subject: vmscan: fix zone shrinking exit when scan work is done
>
> '3e7d344 mm: vmscan: reclaim order-0 and use compaction instead of
> lumpy reclaim' introduced an indefinite loop in shrink_zone().
>
> It meant to break out of this loop when no pages had been reclaimed
> and not a single page was even scanned.  The way it would detect the
> latter is by taking a snapshot of sc->nr_scanned at the beginning of
> the function and comparing it against the new sc->nr_scanned after the
> scan loop.  But it would re-iterate without updating that snapshot,
> looping forever if sc->nr_scanned changed at least once since
> shrink_zone() was invoked.
>
> This is not the sole condition that would exit that loop, but it
> requires other processes to change the zone state, as the reclaimer
> that is stuck obviously can not anymore.
>
> This is only happening for higher-order allocations, where reclaim is
> run back to back with compaction.
>
> Reported-by: Michal Hocko<mhocko@suse.cz>
> Reported-by: Kent Overstreet<kent.overstreet@gmail.com>
> Signed-off-by: Johannes Weiner<hannes@cmpxchg.org>

Tested-by: Kent Overstreet<kent.overstreet@gmail.com>

> ---
>   mm/vmscan.c |    4 ++--
>   1 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 148c6e6..17497d0 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1882,12 +1882,12 @@ static void shrink_zone(int priority, struct zone *zone,
>   	unsigned long nr[NR_LRU_LISTS];
>   	unsigned long nr_to_scan;
>   	enum lru_list l;
> -	unsigned long nr_reclaimed;
> +	unsigned long nr_reclaimed, nr_scanned;
>   	unsigned long nr_to_reclaim = sc->nr_to_reclaim;
> -	unsigned long nr_scanned = sc->nr_scanned;
>
>   restart:
>   	nr_reclaimed = 0;
> +	nr_scanned = sc->nr_scanned;
>   	get_scan_count(zone, sc, nr, priority);
>
>   	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] vmscan: fix zone shrinking exit when scan work is done
  2011-02-09 15:46 ` Johannes Weiner
@ 2011-02-09 16:46   ` Mel Gorman
  -1 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2011-02-09 16:46 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Andrea Arcangeli, Rik van Riel, Michal Hocko,
	Kent Overstreet, linux-mm, linux-kernel

On Wed, Feb 09, 2011 at 04:46:06PM +0100, Johannes Weiner wrote:
> Hi,
> 
> I think this should fix the problem of processes getting stuck in
> reclaim that has been reported several times.

I don't think it's the only source but I'm basing this on seeing
constant looping in balance_pgdat() and calling congestion_wait() a few
weeks ago that I haven't rechecked since. However, this looks like a
real fix for a real problem.

> Kent actually
> single-stepped through this code and noted that it was never exiting
> shrink_zone(), which really narrowed it down a lot, considering the
> tons of nested loops from the allocator down to the list shrinking.
> 
> 	Hannes
> 
> ---
> From: Johannes Weiner <hannes@cmpxchg.org>
> Subject: vmscan: fix zone shrinking exit when scan work is done
> 
> '3e7d344 mm: vmscan: reclaim order-0 and use compaction instead of
> lumpy reclaim' introduced an indefinite loop in shrink_zone().
> 
> It meant to break out of this loop when no pages had been reclaimed
> and not a single page was even scanned.  The way it would detect the
> latter is by taking a snapshot of sc->nr_scanned at the beginning of
> the function and comparing it against the new sc->nr_scanned after the
> scan loop.  But it would re-iterate without updating that snapshot,
> looping forever if sc->nr_scanned changed at least once since
> shrink_zone() was invoked.
> 
> This is not the sole condition that would exit that loop, but it
> requires other processes to change the zone state, as the reclaimer
> that is stuck obviously can not anymore.
> 
> This is only happening for higher-order allocations, where reclaim is
> run back to back with compaction.
> 
> Reported-by: Michal Hocko <mhocko@suse.cz>
> Reported-by: Kent Overstreet <kent.overstreet@gmail.com>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Well spotted.

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] vmscan: fix zone shrinking exit when scan work is done
@ 2011-02-09 16:46   ` Mel Gorman
  0 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2011-02-09 16:46 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Andrea Arcangeli, Rik van Riel, Michal Hocko,
	Kent Overstreet, linux-mm, linux-kernel

On Wed, Feb 09, 2011 at 04:46:06PM +0100, Johannes Weiner wrote:
> Hi,
> 
> I think this should fix the problem of processes getting stuck in
> reclaim that has been reported several times.

I don't think it's the only source but I'm basing this on seeing
constant looping in balance_pgdat() and calling congestion_wait() a few
weeks ago that I haven't rechecked since. However, this looks like a
real fix for a real problem.

> Kent actually
> single-stepped through this code and noted that it was never exiting
> shrink_zone(), which really narrowed it down a lot, considering the
> tons of nested loops from the allocator down to the list shrinking.
> 
> 	Hannes
> 
> ---
> From: Johannes Weiner <hannes@cmpxchg.org>
> Subject: vmscan: fix zone shrinking exit when scan work is done
> 
> '3e7d344 mm: vmscan: reclaim order-0 and use compaction instead of
> lumpy reclaim' introduced an indefinite loop in shrink_zone().
> 
> It meant to break out of this loop when no pages had been reclaimed
> and not a single page was even scanned.  The way it would detect the
> latter is by taking a snapshot of sc->nr_scanned at the beginning of
> the function and comparing it against the new sc->nr_scanned after the
> scan loop.  But it would re-iterate without updating that snapshot,
> looping forever if sc->nr_scanned changed at least once since
> shrink_zone() was invoked.
> 
> This is not the sole condition that would exit that loop, but it
> requires other processes to change the zone state, as the reclaimer
> that is stuck obviously can not anymore.
> 
> This is only happening for higher-order allocations, where reclaim is
> run back to back with compaction.
> 
> Reported-by: Michal Hocko <mhocko@suse.cz>
> Reported-by: Kent Overstreet <kent.overstreet@gmail.com>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Well spotted.

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] vmscan: fix zone shrinking exit when scan work is done
  2011-02-09 16:46   ` Mel Gorman
@ 2011-02-09 18:28     ` Andrea Arcangeli
  -1 siblings, 0 replies; 44+ messages in thread
From: Andrea Arcangeli @ 2011-02-09 18:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, Andrew Morton, Rik van Riel, Michal Hocko,
	Kent Overstreet, linux-mm, linux-kernel

On Wed, Feb 09, 2011 at 04:46:56PM +0000, Mel Gorman wrote:
> On Wed, Feb 09, 2011 at 04:46:06PM +0100, Johannes Weiner wrote:
> > Hi,
> > 
> > I think this should fix the problem of processes getting stuck in
> > reclaim that has been reported several times.
> 
> I don't think it's the only source but I'm basing this on seeing
> constant looping in balance_pgdat() and calling congestion_wait() a few
> weeks ago that I haven't rechecked since. However, this looks like a
> real fix for a real problem.

Agreed. Just yesterday I spent some time on the lumpy compaction
changes after wondering about Michal's khugepaged 100% report, and I
expected some fix was needed in this area (as I couldn't find any bug
in khugepaged yet, so the lumpy compaction looked the next candidate
for bugs).

I've also been wondering about the !nr_scanned check in
should_continue_reclaim too but I didn't look too much into the caller
(I was tempted to remove it all together). I don't see how checking
nr_scanned can be safe even after we fix the caller to avoid passing
non-zero values if "goto restart".

nr_scanned is incremented even for !page_evictable... so it's not
really useful to insist, just because we scanned something, in my
view. It looks bogus... So my proposal would be below.

====
Subject: mm: stop checking nr_scanned in should_continue_reclaim

From: Andrea Arcangeli <aarcange@redhat.com>

nr_scanned is incremented even for !page_evictable... so it's not
really useful to insist, just because we scanned something.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 148c6e6..9741884 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1831,7 +1831,6 @@ out:
  */
 static inline bool should_continue_reclaim(struct zone *zone,
 					unsigned long nr_reclaimed,
-					unsigned long nr_scanned,
 					struct scan_control *sc)
 {
 	unsigned long pages_for_compaction;
@@ -1841,15 +1840,8 @@ static inline bool should_continue_reclaim(struct zone *zone,
 	if (!(sc->reclaim_mode & RECLAIM_MODE_COMPACTION))
 		return false;
 
-	/*
-	 * If we failed to reclaim and have scanned the full list, stop.
-	 * NOTE: Checking just nr_reclaimed would exit reclaim/compaction far
-	 *       faster but obviously would be less likely to succeed
-	 *       allocation. If this is desirable, use GFP_REPEAT to decide
-	 *       if both reclaimed and scanned should be checked or just
-	 *       reclaimed
-	 */
-	if (!nr_reclaimed && !nr_scanned)
+	/* If we failed to reclaim stop. */
+	if (!nr_reclaimed)
 		return false;
 
 	/*
@@ -1884,7 +1876,6 @@ static void shrink_zone(int priority, struct zone *zone,
 	enum lru_list l;
 	unsigned long nr_reclaimed;
 	unsigned long nr_to_reclaim = sc->nr_to_reclaim;
-	unsigned long nr_scanned = sc->nr_scanned;
 
 restart:
 	nr_reclaimed = 0;
@@ -1923,8 +1914,7 @@ restart:
 		shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0);
 
 	/* reclaim/compaction might need reclaim to continue */
-	if (should_continue_reclaim(zone, nr_reclaimed,
-					sc->nr_scanned - nr_scanned, sc))
+	if (should_continue_reclaim(zone, nr_reclaimed, sc))
 		goto restart;
 
 	throttle_vm_writeout(sc->gfp_mask);



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [patch] vmscan: fix zone shrinking exit when scan work is done
@ 2011-02-09 18:28     ` Andrea Arcangeli
  0 siblings, 0 replies; 44+ messages in thread
From: Andrea Arcangeli @ 2011-02-09 18:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, Andrew Morton, Rik van Riel, Michal Hocko,
	Kent Overstreet, linux-mm, linux-kernel

On Wed, Feb 09, 2011 at 04:46:56PM +0000, Mel Gorman wrote:
> On Wed, Feb 09, 2011 at 04:46:06PM +0100, Johannes Weiner wrote:
> > Hi,
> > 
> > I think this should fix the problem of processes getting stuck in
> > reclaim that has been reported several times.
> 
> I don't think it's the only source but I'm basing this on seeing
> constant looping in balance_pgdat() and calling congestion_wait() a few
> weeks ago that I haven't rechecked since. However, this looks like a
> real fix for a real problem.

Agreed. Just yesterday I spent some time on the lumpy compaction
changes after wondering about Michal's khugepaged 100% report, and I
expected some fix was needed in this area (as I couldn't find any bug
in khugepaged yet, so the lumpy compaction looked the next candidate
for bugs).

I've also been wondering about the !nr_scanned check in
should_continue_reclaim too but I didn't look too much into the caller
(I was tempted to remove it all together). I don't see how checking
nr_scanned can be safe even after we fix the caller to avoid passing
non-zero values if "goto restart".

nr_scanned is incremented even for !page_evictable... so it's not
really useful to insist, just because we scanned something, in my
view. It looks bogus... So my proposal would be below.

====
Subject: mm: stop checking nr_scanned in should_continue_reclaim

From: Andrea Arcangeli <aarcange@redhat.com>

nr_scanned is incremented even for !page_evictable... so it's not
really useful to insist, just because we scanned something.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 148c6e6..9741884 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1831,7 +1831,6 @@ out:
  */
 static inline bool should_continue_reclaim(struct zone *zone,
 					unsigned long nr_reclaimed,
-					unsigned long nr_scanned,
 					struct scan_control *sc)
 {
 	unsigned long pages_for_compaction;
@@ -1841,15 +1840,8 @@ static inline bool should_continue_reclaim(struct zone *zone,
 	if (!(sc->reclaim_mode & RECLAIM_MODE_COMPACTION))
 		return false;
 
-	/*
-	 * If we failed to reclaim and have scanned the full list, stop.
-	 * NOTE: Checking just nr_reclaimed would exit reclaim/compaction far
-	 *       faster but obviously would be less likely to succeed
-	 *       allocation. If this is desirable, use GFP_REPEAT to decide
-	 *       if both reclaimed and scanned should be checked or just
-	 *       reclaimed
-	 */
-	if (!nr_reclaimed && !nr_scanned)
+	/* If we failed to reclaim stop. */
+	if (!nr_reclaimed)
 		return false;
 
 	/*
@@ -1884,7 +1876,6 @@ static void shrink_zone(int priority, struct zone *zone,
 	enum lru_list l;
 	unsigned long nr_reclaimed;
 	unsigned long nr_to_reclaim = sc->nr_to_reclaim;
-	unsigned long nr_scanned = sc->nr_scanned;
 
 restart:
 	nr_reclaimed = 0;
@@ -1923,8 +1914,7 @@ restart:
 		shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0);
 
 	/* reclaim/compaction might need reclaim to continue */
-	if (should_continue_reclaim(zone, nr_reclaimed,
-					sc->nr_scanned - nr_scanned, sc))
+	if (should_continue_reclaim(zone, nr_reclaimed, sc))
 		goto restart;
 
 	throttle_vm_writeout(sc->gfp_mask);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [patch] vmscan: fix zone shrinking exit when scan work is done
  2011-02-09 18:28     ` Andrea Arcangeli
@ 2011-02-09 20:05       ` Andrew Morton
  -1 siblings, 0 replies; 44+ messages in thread
From: Andrew Morton @ 2011-02-09 20:05 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Johannes Weiner, Rik van Riel, Michal Hocko,
	Kent Overstreet, linux-mm, linux-kernel

On Wed, 9 Feb 2011 19:28:46 +0100
Andrea Arcangeli <aarcange@redhat.com> wrote:

> On Wed, Feb 09, 2011 at 04:46:56PM +0000, Mel Gorman wrote:
> > On Wed, Feb 09, 2011 at 04:46:06PM +0100, Johannes Weiner wrote:
> > > Hi,
> > > 
> > > I think this should fix the problem of processes getting stuck in
> > > reclaim that has been reported several times.
> > 
> > I don't think it's the only source but I'm basing this on seeing
> > constant looping in balance_pgdat() and calling congestion_wait() a few
> > weeks ago that I haven't rechecked since. However, this looks like a
> > real fix for a real problem.
> 
> Agreed. Just yesterday I spent some time on the lumpy compaction
> changes after wondering about Michal's khugepaged 100% report, and I
> expected some fix was needed in this area (as I couldn't find any bug
> in khugepaged yet, so the lumpy compaction looked the next candidate
> for bugs).
> 
> I've also been wondering about the !nr_scanned check in
> should_continue_reclaim too but I didn't look too much into the caller
> (I was tempted to remove it all together). I don't see how checking
> nr_scanned can be safe even after we fix the caller to avoid passing
> non-zero values if "goto restart".
> 
> nr_scanned is incremented even for !page_evictable... so it's not
> really useful to insist, just because we scanned something, in my
> view. It looks bogus... So my proposal would be below.
> 
> ====
> Subject: mm: stop checking nr_scanned in should_continue_reclaim
> 
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> nr_scanned is incremented even for !page_evictable... so it's not
> really useful to insist, just because we scanned something.

So if reclaim has scanned 100% !page_evictable pages,
should_continue_reclaim() can return true and we keep on scanning?

That sounds like it's both good and bad :( Is this actually a problem? 
What sort of behaviour could it cause and under what circumstances?

Johannes's patch is an obvious bugfix and I'll run with it for now, but
please let's have a further think abut the impact of the
!page_evictable pages.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] vmscan: fix zone shrinking exit when scan work is done
@ 2011-02-09 20:05       ` Andrew Morton
  0 siblings, 0 replies; 44+ messages in thread
From: Andrew Morton @ 2011-02-09 20:05 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Johannes Weiner, Rik van Riel, Michal Hocko,
	Kent Overstreet, linux-mm, linux-kernel

On Wed, 9 Feb 2011 19:28:46 +0100
Andrea Arcangeli <aarcange@redhat.com> wrote:

> On Wed, Feb 09, 2011 at 04:46:56PM +0000, Mel Gorman wrote:
> > On Wed, Feb 09, 2011 at 04:46:06PM +0100, Johannes Weiner wrote:
> > > Hi,
> > > 
> > > I think this should fix the problem of processes getting stuck in
> > > reclaim that has been reported several times.
> > 
> > I don't think it's the only source but I'm basing this on seeing
> > constant looping in balance_pgdat() and calling congestion_wait() a few
> > weeks ago that I haven't rechecked since. However, this looks like a
> > real fix for a real problem.
> 
> Agreed. Just yesterday I spent some time on the lumpy compaction
> changes after wondering about Michal's khugepaged 100% report, and I
> expected some fix was needed in this area (as I couldn't find any bug
> in khugepaged yet, so the lumpy compaction looked the next candidate
> for bugs).
> 
> I've also been wondering about the !nr_scanned check in
> should_continue_reclaim too but I didn't look too much into the caller
> (I was tempted to remove it all together). I don't see how checking
> nr_scanned can be safe even after we fix the caller to avoid passing
> non-zero values if "goto restart".
> 
> nr_scanned is incremented even for !page_evictable... so it's not
> really useful to insist, just because we scanned something, in my
> view. It looks bogus... So my proposal would be below.
> 
> ====
> Subject: mm: stop checking nr_scanned in should_continue_reclaim
> 
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> nr_scanned is incremented even for !page_evictable... so it's not
> really useful to insist, just because we scanned something.

So if reclaim has scanned 100% !page_evictable pages,
should_continue_reclaim() can return true and we keep on scanning?

That sounds like it's both good and bad :( Is this actually a problem? 
What sort of behaviour could it cause and under what circumstances?

Johannes's patch is an obvious bugfix and I'll run with it for now, but
please let's have a further think abut the impact of the
!page_evictable pages.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] vmscan: fix zone shrinking exit when scan work is done
  2011-02-09 15:46 ` Johannes Weiner
@ 2011-02-10  4:04   ` Minchan Kim
  -1 siblings, 0 replies; 44+ messages in thread
From: Minchan Kim @ 2011-02-10  4:04 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Andrea Arcangeli, Mel Gorman, Rik van Riel,
	Michal Hocko, Kent Overstreet, linux-mm, linux-kernel

On Thu, Feb 10, 2011 at 12:46 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> Hi,
>
> I think this should fix the problem of processes getting stuck in
> reclaim that has been reported several times.  Kent actually
> single-stepped through this code and noted that it was never exiting
> shrink_zone(), which really narrowed it down a lot, considering the
> tons of nested loops from the allocator down to the list shrinking.
>
>        Hannes
>
> ---
> From: Johannes Weiner <hannes@cmpxchg.org>
> Subject: vmscan: fix zone shrinking exit when scan work is done
>
> '3e7d344 mm: vmscan: reclaim order-0 and use compaction instead of
> lumpy reclaim' introduced an indefinite loop in shrink_zone().
>
> It meant to break out of this loop when no pages had been reclaimed
> and not a single page was even scanned.  The way it would detect the
> latter is by taking a snapshot of sc->nr_scanned at the beginning of
> the function and comparing it against the new sc->nr_scanned after the
> scan loop.  But it would re-iterate without updating that snapshot,
> looping forever if sc->nr_scanned changed at least once since
> shrink_zone() was invoked.
>
> This is not the sole condition that would exit that loop, but it
> requires other processes to change the zone state, as the reclaimer
> that is stuck obviously can not anymore.
>
> This is only happening for higher-order allocations, where reclaim is
> run back to back with compaction.
>
> Reported-by: Michal Hocko <mhocko@suse.cz>
> Reported-by: Kent Overstreet <kent.overstreet@gmail.com>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] vmscan: fix zone shrinking exit when scan work is done
@ 2011-02-10  4:04   ` Minchan Kim
  0 siblings, 0 replies; 44+ messages in thread
From: Minchan Kim @ 2011-02-10  4:04 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Andrea Arcangeli, Mel Gorman, Rik van Riel,
	Michal Hocko, Kent Overstreet, linux-mm, linux-kernel

On Thu, Feb 10, 2011 at 12:46 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> Hi,
>
> I think this should fix the problem of processes getting stuck in
> reclaim that has been reported several times.  Kent actually
> single-stepped through this code and noted that it was never exiting
> shrink_zone(), which really narrowed it down a lot, considering the
> tons of nested loops from the allocator down to the list shrinking.
>
>        Hannes
>
> ---
> From: Johannes Weiner <hannes@cmpxchg.org>
> Subject: vmscan: fix zone shrinking exit when scan work is done
>
> '3e7d344 mm: vmscan: reclaim order-0 and use compaction instead of
> lumpy reclaim' introduced an indefinite loop in shrink_zone().
>
> It meant to break out of this loop when no pages had been reclaimed
> and not a single page was even scanned.  The way it would detect the
> latter is by taking a snapshot of sc->nr_scanned at the beginning of
> the function and comparing it against the new sc->nr_scanned after the
> scan loop.  But it would re-iterate without updating that snapshot,
> looping forever if sc->nr_scanned changed at least once since
> shrink_zone() was invoked.
>
> This is not the sole condition that would exit that loop, but it
> requires other processes to change the zone state, as the reclaimer
> that is stuck obviously can not anymore.
>
> This is only happening for higher-order allocations, where reclaim is
> run back to back with compaction.
>
> Reported-by: Michal Hocko <mhocko@suse.cz>
> Reported-by: Kent Overstreet <kent.overstreet@gmail.com>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] vmscan: fix zone shrinking exit when scan work is done
  2011-02-09 18:28     ` Andrea Arcangeli
@ 2011-02-10 10:21       ` Mel Gorman
  -1 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2011-02-10 10:21 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Johannes Weiner, Andrew Morton, Rik van Riel, Michal Hocko,
	Kent Overstreet, linux-mm, linux-kernel

On Wed, Feb 09, 2011 at 07:28:46PM +0100, Andrea Arcangeli wrote:
> On Wed, Feb 09, 2011 at 04:46:56PM +0000, Mel Gorman wrote:
> > On Wed, Feb 09, 2011 at 04:46:06PM +0100, Johannes Weiner wrote:
> > > Hi,
> > > 
> > > I think this should fix the problem of processes getting stuck in
> > > reclaim that has been reported several times.
> > 
> > I don't think it's the only source but I'm basing this on seeing
> > constant looping in balance_pgdat() and calling congestion_wait() a few
> > weeks ago that I haven't rechecked since. However, this looks like a
> > real fix for a real problem.
> 
> Agreed. Just yesterday I spent some time on the lumpy compaction
> changes after wondering about Michal's khugepaged 100% report, and I
> expected some fix was needed in this area (as I couldn't find any bug
> in khugepaged yet, so the lumpy compaction looked the next candidate
> for bugs).
> 

Michal did report that disabling defrag did not help but the stack trace
also showed that it was stuck in shrink_zone() which is what Johannes'
patch targets. It's not unreasonable to test if Johannes' patch solves
Michal's problem. Michal, I know that your workload is a bit random and
may not be reproducible but do you think it'd be possible to determine
if Johannes' patch helps?

> I've also been wondering about the !nr_scanned check in
> should_continue_reclaim too but I didn't look too much into the caller
> (I was tempted to remove it all together). I don't see how checking
> nr_scanned can be safe even after we fix the caller to avoid passing
> non-zero values if "goto restart".
> 
> nr_scanned is incremented even for !page_evictable... so it's not
> really useful to insist, just because we scanned something, in my
> view. It looks bogus... So my proposal would be below.
> 

We should not be ending up in a situation with the LRU list of only
page_evictable pages and that situation persisting causing excessive (or
infinite) looping. As unevictable pages are encountered on the LRU list,
they should be moved to the unevictable lists by putback_lru_page().  Are you
aware of a situation where this becomes broken?

I recognise that SWAP_CLUSTER_MAX pages could all be unevictable and they
are all get moved. In this case, nr_scanned is positive and we continue
to scan but this is expected and desirable: Reclaim/compaction needs more
pages to be freed before it starts compaction. If it stops scanning early,
then it would just fail the allocation later. This is what the "NOTE" is about.

I prefer Johannes' fix for the observed problem.

> ====
> Subject: mm: stop checking nr_scanned in should_continue_reclaim
> 
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> nr_scanned is incremented even for !page_evictable... so it's not
> really useful to insist, just because we scanned something.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 148c6e6..9741884 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1831,7 +1831,6 @@ out:
>   */
>  static inline bool should_continue_reclaim(struct zone *zone,
>  					unsigned long nr_reclaimed,
> -					unsigned long nr_scanned,
>  					struct scan_control *sc)
>  {
>  	unsigned long pages_for_compaction;
> @@ -1841,15 +1840,8 @@ static inline bool should_continue_reclaim(struct zone *zone,
>  	if (!(sc->reclaim_mode & RECLAIM_MODE_COMPACTION))
>  		return false;
>  
> -	/*
> -	 * If we failed to reclaim and have scanned the full list, stop.
> -	 * NOTE: Checking just nr_reclaimed would exit reclaim/compaction far
> -	 *       faster but obviously would be less likely to succeed
> -	 *       allocation. If this is desirable, use GFP_REPEAT to decide
> -	 *       if both reclaimed and scanned should be checked or just
> -	 *       reclaimed
> -	 */
> -	if (!nr_reclaimed && !nr_scanned)
> +	/* If we failed to reclaim stop. */
> +	if (!nr_reclaimed)
>  		return false;
>  
>  	/*
> @@ -1884,7 +1876,6 @@ static void shrink_zone(int priority, struct zone *zone,
>  	enum lru_list l;
>  	unsigned long nr_reclaimed;
>  	unsigned long nr_to_reclaim = sc->nr_to_reclaim;
> -	unsigned long nr_scanned = sc->nr_scanned;
>  
>  restart:
>  	nr_reclaimed = 0;
> @@ -1923,8 +1914,7 @@ restart:
>  		shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0);
>  
>  	/* reclaim/compaction might need reclaim to continue */
> -	if (should_continue_reclaim(zone, nr_reclaimed,
> -					sc->nr_scanned - nr_scanned, sc))
> +	if (should_continue_reclaim(zone, nr_reclaimed, sc))
>  		goto restart;
>  
>  	throttle_vm_writeout(sc->gfp_mask);
> 
> 

-- 
Mel Gorman

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] vmscan: fix zone shrinking exit when scan work is done
@ 2011-02-10 10:21       ` Mel Gorman
  0 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2011-02-10 10:21 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Johannes Weiner, Andrew Morton, Rik van Riel, Michal Hocko,
	Kent Overstreet, linux-mm, linux-kernel

On Wed, Feb 09, 2011 at 07:28:46PM +0100, Andrea Arcangeli wrote:
> On Wed, Feb 09, 2011 at 04:46:56PM +0000, Mel Gorman wrote:
> > On Wed, Feb 09, 2011 at 04:46:06PM +0100, Johannes Weiner wrote:
> > > Hi,
> > > 
> > > I think this should fix the problem of processes getting stuck in
> > > reclaim that has been reported several times.
> > 
> > I don't think it's the only source but I'm basing this on seeing
> > constant looping in balance_pgdat() and calling congestion_wait() a few
> > weeks ago that I haven't rechecked since. However, this looks like a
> > real fix for a real problem.
> 
> Agreed. Just yesterday I spent some time on the lumpy compaction
> changes after wondering about Michal's khugepaged 100% report, and I
> expected some fix was needed in this area (as I couldn't find any bug
> in khugepaged yet, so the lumpy compaction looked the next candidate
> for bugs).
> 

Michal did report that disabling defrag did not help but the stack trace
also showed that it was stuck in shrink_zone() which is what Johannes'
patch targets. It's not unreasonable to test if Johannes' patch solves
Michal's problem. Michal, I know that your workload is a bit random and
may not be reproducible but do you think it'd be possible to determine
if Johannes' patch helps?

> I've also been wondering about the !nr_scanned check in
> should_continue_reclaim too but I didn't look too much into the caller
> (I was tempted to remove it all together). I don't see how checking
> nr_scanned can be safe even after we fix the caller to avoid passing
> non-zero values if "goto restart".
> 
> nr_scanned is incremented even for !page_evictable... so it's not
> really useful to insist, just because we scanned something, in my
> view. It looks bogus... So my proposal would be below.
> 

We should not be ending up in a situation with the LRU list of only
page_evictable pages and that situation persisting causing excessive (or
infinite) looping. As unevictable pages are encountered on the LRU list,
they should be moved to the unevictable lists by putback_lru_page().  Are you
aware of a situation where this becomes broken?

I recognise that SWAP_CLUSTER_MAX pages could all be unevictable and they
are all get moved. In this case, nr_scanned is positive and we continue
to scan but this is expected and desirable: Reclaim/compaction needs more
pages to be freed before it starts compaction. If it stops scanning early,
then it would just fail the allocation later. This is what the "NOTE" is about.

I prefer Johannes' fix for the observed problem.

> ====
> Subject: mm: stop checking nr_scanned in should_continue_reclaim
> 
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> nr_scanned is incremented even for !page_evictable... so it's not
> really useful to insist, just because we scanned something.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 148c6e6..9741884 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1831,7 +1831,6 @@ out:
>   */
>  static inline bool should_continue_reclaim(struct zone *zone,
>  					unsigned long nr_reclaimed,
> -					unsigned long nr_scanned,
>  					struct scan_control *sc)
>  {
>  	unsigned long pages_for_compaction;
> @@ -1841,15 +1840,8 @@ static inline bool should_continue_reclaim(struct zone *zone,
>  	if (!(sc->reclaim_mode & RECLAIM_MODE_COMPACTION))
>  		return false;
>  
> -	/*
> -	 * If we failed to reclaim and have scanned the full list, stop.
> -	 * NOTE: Checking just nr_reclaimed would exit reclaim/compaction far
> -	 *       faster but obviously would be less likely to succeed
> -	 *       allocation. If this is desirable, use GFP_REPEAT to decide
> -	 *       if both reclaimed and scanned should be checked or just
> -	 *       reclaimed
> -	 */
> -	if (!nr_reclaimed && !nr_scanned)
> +	/* If we failed to reclaim stop. */
> +	if (!nr_reclaimed)
>  		return false;
>  
>  	/*
> @@ -1884,7 +1876,6 @@ static void shrink_zone(int priority, struct zone *zone,
>  	enum lru_list l;
>  	unsigned long nr_reclaimed;
>  	unsigned long nr_to_reclaim = sc->nr_to_reclaim;
> -	unsigned long nr_scanned = sc->nr_scanned;
>  
>  restart:
>  	nr_reclaimed = 0;
> @@ -1923,8 +1914,7 @@ restart:
>  		shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0);
>  
>  	/* reclaim/compaction might need reclaim to continue */
> -	if (should_continue_reclaim(zone, nr_reclaimed,
> -					sc->nr_scanned - nr_scanned, sc))
> +	if (should_continue_reclaim(zone, nr_reclaimed, sc))
>  		goto restart;
>  
>  	throttle_vm_writeout(sc->gfp_mask);
> 
> 

-- 
Mel Gorman

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] vmscan: fix zone shrinking exit when scan work is done
  2011-02-10 10:21       ` Mel Gorman
@ 2011-02-10 10:41         ` Michal Hocko
  -1 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2011-02-10 10:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Johannes Weiner, Andrew Morton, Rik van Riel,
	Kent Overstreet, linux-mm, linux-kernel

On Thu 10-02-11 10:21:10, Mel Gorman wrote:
> On Wed, Feb 09, 2011 at 07:28:46PM +0100, Andrea Arcangeli wrote:
> > On Wed, Feb 09, 2011 at 04:46:56PM +0000, Mel Gorman wrote:
> > > On Wed, Feb 09, 2011 at 04:46:06PM +0100, Johannes Weiner wrote:
> > > > Hi,
> > > > 
> > > > I think this should fix the problem of processes getting stuck in
> > > > reclaim that has been reported several times.
> > > 
> > > I don't think it's the only source but I'm basing this on seeing
> > > constant looping in balance_pgdat() and calling congestion_wait() a few
> > > weeks ago that I haven't rechecked since. However, this looks like a
> > > real fix for a real problem.
> > 
> > Agreed. Just yesterday I spent some time on the lumpy compaction
> > changes after wondering about Michal's khugepaged 100% report, and I
> > expected some fix was needed in this area (as I couldn't find any bug
> > in khugepaged yet, so the lumpy compaction looked the next candidate
> > for bugs).
> > 
> 
> Michal did report that disabling defrag did not help but the stack trace
> also showed that it was stuck in shrink_zone() which is what Johannes'
> patch targets. It's not unreasonable to test if Johannes' patch solves
> Michal's problem. Michal, I know that your workload is a bit random and
> may not be reproducible but do you think it'd be possible to determine
> if Johannes' patch helps?

Sure, I can test it. Nevertheless, I haven't seen the problem again. I
have tried to make some memory pressure on the machine but no "luck".

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] vmscan: fix zone shrinking exit when scan work is done
@ 2011-02-10 10:41         ` Michal Hocko
  0 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2011-02-10 10:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Johannes Weiner, Andrew Morton, Rik van Riel,
	Kent Overstreet, linux-mm, linux-kernel

On Thu 10-02-11 10:21:10, Mel Gorman wrote:
> On Wed, Feb 09, 2011 at 07:28:46PM +0100, Andrea Arcangeli wrote:
> > On Wed, Feb 09, 2011 at 04:46:56PM +0000, Mel Gorman wrote:
> > > On Wed, Feb 09, 2011 at 04:46:06PM +0100, Johannes Weiner wrote:
> > > > Hi,
> > > > 
> > > > I think this should fix the problem of processes getting stuck in
> > > > reclaim that has been reported several times.
> > > 
> > > I don't think it's the only source but I'm basing this on seeing
> > > constant looping in balance_pgdat() and calling congestion_wait() a few
> > > weeks ago that I haven't rechecked since. However, this looks like a
> > > real fix for a real problem.
> > 
> > Agreed. Just yesterday I spent some time on the lumpy compaction
> > changes after wondering about Michal's khugepaged 100% report, and I
> > expected some fix was needed in this area (as I couldn't find any bug
> > in khugepaged yet, so the lumpy compaction looked the next candidate
> > for bugs).
> > 
> 
> Michal did report that disabling defrag did not help but the stack trace
> also showed that it was stuck in shrink_zone() which is what Johannes'
> patch targets. It's not unreasonable to test if Johannes' patch solves
> Michal's problem. Michal, I know that your workload is a bit random and
> may not be reproducible but do you think it'd be possible to determine
> if Johannes' patch helps?

Sure, I can test it. Nevertheless, I haven't seen the problem again. I
have tried to make some memory pressure on the machine but no "luck".

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] vmscan: fix zone shrinking exit when scan work is done
  2011-02-10 10:21       ` Mel Gorman
@ 2011-02-10 12:48         ` Andrea Arcangeli
  -1 siblings, 0 replies; 44+ messages in thread
From: Andrea Arcangeli @ 2011-02-10 12:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, Andrew Morton, Rik van Riel, Michal Hocko,
	Kent Overstreet, linux-mm, linux-kernel

On Thu, Feb 10, 2011 at 10:21:10AM +0000, Mel Gorman wrote:
> We should not be ending up in a situation with the LRU list of only
> page_evictable pages and that situation persisting causing excessive (or
> infinite) looping. As unevictable pages are encountered on the LRU list,
> they should be moved to the unevictable lists by putback_lru_page().  Are you
> aware of a situation where this becomes broken?
> 
> I recognise that SWAP_CLUSTER_MAX pages could all be unevictable and they
> are all get moved. In this case, nr_scanned is positive and we continue
> to scan but this is expected and desirable: Reclaim/compaction needs more
> pages to be freed before it starts compaction. If it stops scanning early,
> then it would just fail the allocation later. This is what the "NOTE" is about.
> 
> I prefer Johannes' fix for the observed problem.

should_continue_reclaim is only needed for compaction. It tries to
free enough pages so that compaction can succeed in its defrag
attempt. So breaking the loop faster isn't going to cause failures for
0 order pages. My worry is that we loop too much in shrink_zone just
for compaction even when we don't do any progress. shrink_zone would
never scan more than SWAP_CLUSTER_MAX pages, before this change. Now
it can loop over the whole lru as long as we're scanning stuff. Ok to
overboost shrink_zone if we're making progress to allow compaction at
the next round, but if we don't visibly make progress, I'm concerned
that it may be too aggressive to scan the whole list. The performance
benefit of having an hugepage isn't as huge as scanning all pages in
the lru when before we would have broken the loop and declared failure
after only SWAP_CLUSTER_MAX pages, and then we would have fallen back
in a order 0 allocation. The fix may help of course, maybe it's enough
for his case I don't know, but I don't see it making a whole lot of
difference, except now it will stop when the lru is practically empty
which clearly is an improvement. I think we shouldn't be so worried
about succeeding compaction, the important thing is we don't waste
time in compaction if there's not enough free memory but
compaction_suitable used by both logics should be enough for that.

I'd rather prefer that if hugetlbfs has special needs it uses a __GFP_
flag or similar that increases how compaction is strict in succeeding,
up to scanning the whole lru in one go in order to make some free
memory for compaction to succeed.

Going ahead with the scan until compaction_suitable is true instead
makes sense when there's absence of memory pressure and nr_reclaimed
is never zero.

Maybe we should try a bit more times than just nr_reclaim but going
over the whole lru, sounds a bit extreme.

The issue isn't just for unevictable pages, that will be refiled
during the scan but it will also happen in presence of lots of
referenced pages. For example if we don't apply my fix, the current
code can take down all young bits in all ptes in one go in the whole
system before returning from shrink_zone, that is too much in my view,
and losing all that information in one go (not even to tell the cost
associated with losing it) can hardly be offseted by the improvement
given by 1 more hugepage.

But please let me know if I've misread something...

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] vmscan: fix zone shrinking exit when scan work is done
@ 2011-02-10 12:48         ` Andrea Arcangeli
  0 siblings, 0 replies; 44+ messages in thread
From: Andrea Arcangeli @ 2011-02-10 12:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, Andrew Morton, Rik van Riel, Michal Hocko,
	Kent Overstreet, linux-mm, linux-kernel

On Thu, Feb 10, 2011 at 10:21:10AM +0000, Mel Gorman wrote:
> We should not be ending up in a situation with the LRU list of only
> page_evictable pages and that situation persisting causing excessive (or
> infinite) looping. As unevictable pages are encountered on the LRU list,
> they should be moved to the unevictable lists by putback_lru_page().  Are you
> aware of a situation where this becomes broken?
> 
> I recognise that SWAP_CLUSTER_MAX pages could all be unevictable and they
> are all get moved. In this case, nr_scanned is positive and we continue
> to scan but this is expected and desirable: Reclaim/compaction needs more
> pages to be freed before it starts compaction. If it stops scanning early,
> then it would just fail the allocation later. This is what the "NOTE" is about.
> 
> I prefer Johannes' fix for the observed problem.

should_continue_reclaim is only needed for compaction. It tries to
free enough pages so that compaction can succeed in its defrag
attempt. So breaking the loop faster isn't going to cause failures for
0 order pages. My worry is that we loop too much in shrink_zone just
for compaction even when we don't do any progress. shrink_zone would
never scan more than SWAP_CLUSTER_MAX pages, before this change. Now
it can loop over the whole lru as long as we're scanning stuff. Ok to
overboost shrink_zone if we're making progress to allow compaction at
the next round, but if we don't visibly make progress, I'm concerned
that it may be too aggressive to scan the whole list. The performance
benefit of having an hugepage isn't as huge as scanning all pages in
the lru when before we would have broken the loop and declared failure
after only SWAP_CLUSTER_MAX pages, and then we would have fallen back
in a order 0 allocation. The fix may help of course, maybe it's enough
for his case I don't know, but I don't see it making a whole lot of
difference, except now it will stop when the lru is practically empty
which clearly is an improvement. I think we shouldn't be so worried
about succeeding compaction, the important thing is we don't waste
time in compaction if there's not enough free memory but
compaction_suitable used by both logics should be enough for that.

I'd rather prefer that if hugetlbfs has special needs it uses a __GFP_
flag or similar that increases how compaction is strict in succeeding,
up to scanning the whole lru in one go in order to make some free
memory for compaction to succeed.

Going ahead with the scan until compaction_suitable is true instead
makes sense when there's absence of memory pressure and nr_reclaimed
is never zero.

Maybe we should try a bit more times than just nr_reclaim but going
over the whole lru, sounds a bit extreme.

The issue isn't just for unevictable pages, that will be refiled
during the scan but it will also happen in presence of lots of
referenced pages. For example if we don't apply my fix, the current
code can take down all young bits in all ptes in one go in the whole
system before returning from shrink_zone, that is too much in my view,
and losing all that information in one go (not even to tell the cost
associated with losing it) can hardly be offseted by the improvement
given by 1 more hugepage.

But please let me know if I've misread something...

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] vmscan: fix zone shrinking exit when scan work is done
  2011-02-10 12:48         ` Andrea Arcangeli
@ 2011-02-10 13:33           ` Mel Gorman
  -1 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2011-02-10 13:33 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Johannes Weiner, Andrew Morton, Rik van Riel, Michal Hocko,
	Kent Overstreet, linux-mm, linux-kernel

On Thu, Feb 10, 2011 at 01:48:38PM +0100, Andrea Arcangeli wrote:
> On Thu, Feb 10, 2011 at 10:21:10AM +0000, Mel Gorman wrote:
> > We should not be ending up in a situation with the LRU list of only
> > page_evictable pages and that situation persisting causing excessive (or
> > infinite) looping. As unevictable pages are encountered on the LRU list,
> > they should be moved to the unevictable lists by putback_lru_page().  Are you
> > aware of a situation where this becomes broken?
> > 
> > I recognise that SWAP_CLUSTER_MAX pages could all be unevictable and they
> > are all get moved. In this case, nr_scanned is positive and we continue
> > to scan but this is expected and desirable: Reclaim/compaction needs more
> > pages to be freed before it starts compaction. If it stops scanning early,
> > then it would just fail the allocation later. This is what the "NOTE" is about.
> > 
> > I prefer Johannes' fix for the observed problem.
> 
> should_continue_reclaim is only needed for compaction. It tries to
> free enough pages so that compaction can succeed in its defrag
> attempt.

Correct.

> So breaking the loop faster isn't going to cause failures for
> 0 order pages.

Also true, I commented on this in the "Note" your patch deletes and a
suggestion on how an alternative would be to break early unless GFP_REPEAT.

> My worry is that we loop too much in shrink_zone just
> for compaction even when we don't do any progress. shrink_zone would
> never scan more than SWAP_CLUSTER_MAX pages, before this change.

Sortof. Lumpy reclaim would have scanned more than SWAP_CLUSTER_MAX so
scanning was still pretty high. The other costs of lumpy reclaim would hide
it of course.

> Now
> it can loop over the whole lru as long as we're scanning stuff.

True, the alternative being failing the allocation. Returning sooner is of
course an option, but it would be preferable to see a case where the logic
after Johannes' patch is failing.

> Ok to
> overboost shrink_zone if we're making progress to allow compaction at
> the next round, but if we don't visibly make progress, I'm concerned
> that it may be too aggressive to scan the whole list. The performance
> benefit of having an hugepage isn't as huge as scanning all pages in
> the lru when before we would have broken the loop and declared failure
> after only SWAP_CLUSTER_MAX pages, and then we would have fallen back
> in a order 0 allocation.

What about other cases such as order-1 allocations for stack or order-3
allocations for those network cards using jumbo frames without
scatter/gather?

Don't get me wrong, I see your point but I'm wondering if there really are
cases where we routinely scan an entire LRU list of unevictable pages that
are somehow not being migrated properly to the unevictable lists. If
this is happening, we are also in trouble for reclaiming for order-0
pages, right?

> The fix may help of course, maybe it's enough
> for his case I don't know, but I don't see it making a whole lot of
> difference, except now it will stop when the lru is practically empty
> which clearly is an improvement. I think we shouldn't be so worried
> about succeeding compaction, the important thing is we don't waste
> time in compaction if there's not enough free memory but
> compaction_suitable used by both logics should be enough for that.
> 
> I'd rather prefer that if hugetlbfs has special needs it uses a __GFP_

It uses GFP_REPEAT. That is why I specifically mentioned it in the "NOTE"
as an alternative to how we could break early while still being agressive
when required. The only reason it's not that way now is because a) I didn't
consider an LRU mostly full of unevictable pages to be the normal case and b)
for allocations such as order-3 that are preferable not to fail.

> flag or similar that increases how compaction is strict in succeeding,
> up to scanning the whole lru in one go in order to make some free
> memory for compaction to succeed.
> 
> Going ahead with the scan until compaction_suitable is true instead
> makes sense when there's absence of memory pressure and nr_reclaimed
> is never zero.
> 
> Maybe we should try a bit more times than just nr_reclaim but going
> over the whole lru, sounds a bit extreme.
> 

Where should be draw the line? We could come up with ratio of the lists
depending on priority but it'd be hard to measure the gain or loss
without having a profile of a problem case to look at.

> The issue isn't just for unevictable pages, that will be refiled
> during the scan but it will also happen in presence of lots of
> referenced pages. For example if we don't apply my fix, the current
> code can take down all young bits in all ptes in one go in the whole
> system before returning from shrink_zone, that is too much in my view,
> and losing all that information in one go (not even to tell the cost
> associated with losing it) can hardly be offseted by the improvement
> given by 1 more hugepage.
> 
> But please let me know if I've misread something...
> 

I don't think you have misread anything but if we're going to weaken
this logic, I'd at least like to see the GFP_REPEAT option tried - i.e.
preserve being aggressive if set. I'm also not convinced we routinely get
into a situation where the LRU consists of almost all unevictable pages
and if we are in this situation, that is a serious problem on its own. It
would also be preferable if we could get latency figures on alloc_pages for
hugepage-sized allocations and a count of how many are succeeding or failing
to measure the impact (if any).

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] vmscan: fix zone shrinking exit when scan work is done
@ 2011-02-10 13:33           ` Mel Gorman
  0 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2011-02-10 13:33 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Johannes Weiner, Andrew Morton, Rik van Riel, Michal Hocko,
	Kent Overstreet, linux-mm, linux-kernel

On Thu, Feb 10, 2011 at 01:48:38PM +0100, Andrea Arcangeli wrote:
> On Thu, Feb 10, 2011 at 10:21:10AM +0000, Mel Gorman wrote:
> > We should not be ending up in a situation with the LRU list of only
> > page_evictable pages and that situation persisting causing excessive (or
> > infinite) looping. As unevictable pages are encountered on the LRU list,
> > they should be moved to the unevictable lists by putback_lru_page().  Are you
> > aware of a situation where this becomes broken?
> > 
> > I recognise that SWAP_CLUSTER_MAX pages could all be unevictable and they
> > are all get moved. In this case, nr_scanned is positive and we continue
> > to scan but this is expected and desirable: Reclaim/compaction needs more
> > pages to be freed before it starts compaction. If it stops scanning early,
> > then it would just fail the allocation later. This is what the "NOTE" is about.
> > 
> > I prefer Johannes' fix for the observed problem.
> 
> should_continue_reclaim is only needed for compaction. It tries to
> free enough pages so that compaction can succeed in its defrag
> attempt.

Correct.

> So breaking the loop faster isn't going to cause failures for
> 0 order pages.

Also true, I commented on this in the "Note" your patch deletes and a
suggestion on how an alternative would be to break early unless GFP_REPEAT.

> My worry is that we loop too much in shrink_zone just
> for compaction even when we don't do any progress. shrink_zone would
> never scan more than SWAP_CLUSTER_MAX pages, before this change.

Sortof. Lumpy reclaim would have scanned more than SWAP_CLUSTER_MAX so
scanning was still pretty high. The other costs of lumpy reclaim would hide
it of course.

> Now
> it can loop over the whole lru as long as we're scanning stuff.

True, the alternative being failing the allocation. Returning sooner is of
course an option, but it would be preferable to see a case where the logic
after Johannes' patch is failing.

> Ok to
> overboost shrink_zone if we're making progress to allow compaction at
> the next round, but if we don't visibly make progress, I'm concerned
> that it may be too aggressive to scan the whole list. The performance
> benefit of having an hugepage isn't as huge as scanning all pages in
> the lru when before we would have broken the loop and declared failure
> after only SWAP_CLUSTER_MAX pages, and then we would have fallen back
> in a order 0 allocation.

What about other cases such as order-1 allocations for stack or order-3
allocations for those network cards using jumbo frames without
scatter/gather?

Don't get me wrong, I see your point but I'm wondering if there really are
cases where we routinely scan an entire LRU list of unevictable pages that
are somehow not being migrated properly to the unevictable lists. If
this is happening, we are also in trouble for reclaiming for order-0
pages, right?

> The fix may help of course, maybe it's enough
> for his case I don't know, but I don't see it making a whole lot of
> difference, except now it will stop when the lru is practically empty
> which clearly is an improvement. I think we shouldn't be so worried
> about succeeding compaction, the important thing is we don't waste
> time in compaction if there's not enough free memory but
> compaction_suitable used by both logics should be enough for that.
> 
> I'd rather prefer that if hugetlbfs has special needs it uses a __GFP_

It uses GFP_REPEAT. That is why I specifically mentioned it in the "NOTE"
as an alternative to how we could break early while still being agressive
when required. The only reason it's not that way now is because a) I didn't
consider an LRU mostly full of unevictable pages to be the normal case and b)
for allocations such as order-3 that are preferable not to fail.

> flag or similar that increases how compaction is strict in succeeding,
> up to scanning the whole lru in one go in order to make some free
> memory for compaction to succeed.
> 
> Going ahead with the scan until compaction_suitable is true instead
> makes sense when there's absence of memory pressure and nr_reclaimed
> is never zero.
> 
> Maybe we should try a bit more times than just nr_reclaim but going
> over the whole lru, sounds a bit extreme.
> 

Where should be draw the line? We could come up with ratio of the lists
depending on priority but it'd be hard to measure the gain or loss
without having a profile of a problem case to look at.

> The issue isn't just for unevictable pages, that will be refiled
> during the scan but it will also happen in presence of lots of
> referenced pages. For example if we don't apply my fix, the current
> code can take down all young bits in all ptes in one go in the whole
> system before returning from shrink_zone, that is too much in my view,
> and losing all that information in one go (not even to tell the cost
> associated with losing it) can hardly be offseted by the improvement
> given by 1 more hugepage.
> 
> But please let me know if I've misread something...
> 

I don't think you have misread anything but if we're going to weaken
this logic, I'd at least like to see the GFP_REPEAT option tried - i.e.
preserve being aggressive if set. I'm also not convinced we routinely get
into a situation where the LRU consists of almost all unevictable pages
and if we are in this situation, that is a serious problem on its own. It
would also be preferable if we could get latency figures on alloc_pages for
hugepage-sized allocations and a count of how many are succeeding or failing
to measure the impact (if any).

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] vmscan: fix zone shrinking exit when scan work is done
  2011-02-10 13:33           ` Mel Gorman
@ 2011-02-10 14:14             ` Andrea Arcangeli
  -1 siblings, 0 replies; 44+ messages in thread
From: Andrea Arcangeli @ 2011-02-10 14:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, Andrew Morton, Rik van Riel, Michal Hocko,
	Kent Overstreet, linux-mm, linux-kernel

On Thu, Feb 10, 2011 at 01:33:24PM +0000, Mel Gorman wrote:
> Also true, I commented on this in the "Note" your patch deletes and a
> suggestion on how an alternative would be to break early unless GFP_REPEAT.

Yep noticed that ;), doing that with __GFP_REPEAT sounds just fine to me.

> Sortof. Lumpy reclaim would have scanned more than SWAP_CLUSTER_MAX so
> scanning was still pretty high. The other costs of lumpy reclaim would hide
> it of course.

Ok but we know lumpy reclaim was not ok to start with.

> What about other cases such as order-1 allocations for stack or order-3
> allocations for those network cards using jumbo frames without
> scatter/gather?

stack order 1 is one of the few cases that come to mind where failing
an allocation becomes fatal. Maybe we should use __GFP_REPEAT there
too.

But we probably need a way to discriminate callers that can gracefully
fallback. I'd be extremely surprised if the cost of looping all over
the lru taking down all young bits could ever be offseted by the jumbo
frame. In fact the jumbo frame I'm afraid might be better off without
using compaction at all because it's probably very latency
sensitive. We need a 'lowlatency' version of compaction for these
users where the improvement of having a compound page instead of a
regular page isn't very significant.

On a separated topic, I'm currently trying to use the new async
compaction code upstream with jumbo frames. I'm also wondering if I'll
have to set sync=0 by default unless __GFP_REPEAT is set. It seems
adding compaction to jumbo frames is increasing latency to certain
workloads in a measurable way. Things were fine when compaction was
only used by THP and not for all order allocations (but I didn't test
the async mode yet plus the other optimizations for compaction you did
recently, I hope they're enough to jumbo frames).

> Don't get me wrong, I see your point but I'm wondering if there really are
> cases where we routinely scan an entire LRU list of unevictable pages that
> are somehow not being migrated properly to the unevictable lists. If
> this is happening, we are also in trouble for reclaiming for order-0
> pages, right?

Well unevictable pages are just an example and like you said they last
only one round of the loop at most. But other caching bits like the
referenced bits and all young bits will get all taken down during all
later loops too. We definitely don't want to swap just to allow
compaction to succeed! I think this argument explains it pretty well,
if you takedown all young bits in a constant loop, then system may end
up swapping. That's definitely something we don't want.

Things may be different if this is a stack allocation without
fallback, or if it's hugetlbfs again without kernel fallback (only
userland fallback).

> It uses GFP_REPEAT. That is why I specifically mentioned it in the "NOTE"
> as an alternative to how we could break early while still being agressive
> when required. The only reason it's not that way now is because a) I didn't
> consider an LRU mostly full of unevictable pages to be the normal case and b)
> for allocations such as order-3 that are preferable not to fail.

Ok.

> Where should be draw the line? We could come up with ratio of the lists
> depending on priority but it'd be hard to measure the gain or loss
> without having a profile of a problem case to look at.

I would just stick to !nr_reclaimed to break the loop, and ignore the
nr_scanned unless __GFP_REPEAT is set, in which case you're welcome to
scan everything.

Then we've to decide if to add __GFP_REPEAT to the stack allocation...

> I don't think you have misread anything but if we're going to weaken
> this logic, I'd at least like to see the GFP_REPEAT option tried - i.e.

I see the point of __GFP_REPEAT, that sounds the best, I should have
just followed your comment but I felt scanning everything was too
heavyweight regardless, but ok I see you want as much accuracy as
possible in that case, even if that end up in a swap storm.

> preserve being aggressive if set. I'm also not convinced we routinely get
> into a situation where the LRU consists of almost all unevictable pages
> and if we are in this situation, that is a serious problem on its own. It
> would also be preferable if we could get latency figures on alloc_pages for
> hugepage-sized allocations and a count of how many are succeeding or failing
> to measure the impact (if any).

I think I made not the best example talking about unevictable pages. I
said that because the code is like below, but to me all the goto
something following the !page_evictable check were also relevant for
this shrink_zone loop. The real life issue is avoid swap storm (or
expensive loop flooding the whole system of ipis to takedown all young
bits in all ptes), to allocate an hugepage or jumboframe that has a
graceful fallback that performs not hugely slower than the
hugepage/jumboframe.

     	  sc->nr_scanned++;

          if (unlikely(!page_evictable(page, NULL)))
	      goto cull_mlocked;

I think making the !nr_scanned check conditional to __GFP_REPEAT as
the comment suggested, for now is the best way to go.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] vmscan: fix zone shrinking exit when scan work is done
@ 2011-02-10 14:14             ` Andrea Arcangeli
  0 siblings, 0 replies; 44+ messages in thread
From: Andrea Arcangeli @ 2011-02-10 14:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, Andrew Morton, Rik van Riel, Michal Hocko,
	Kent Overstreet, linux-mm, linux-kernel

On Thu, Feb 10, 2011 at 01:33:24PM +0000, Mel Gorman wrote:
> Also true, I commented on this in the "Note" your patch deletes and a
> suggestion on how an alternative would be to break early unless GFP_REPEAT.

Yep noticed that ;), doing that with __GFP_REPEAT sounds just fine to me.

> Sortof. Lumpy reclaim would have scanned more than SWAP_CLUSTER_MAX so
> scanning was still pretty high. The other costs of lumpy reclaim would hide
> it of course.

Ok but we know lumpy reclaim was not ok to start with.

> What about other cases such as order-1 allocations for stack or order-3
> allocations for those network cards using jumbo frames without
> scatter/gather?

stack order 1 is one of the few cases that come to mind where failing
an allocation becomes fatal. Maybe we should use __GFP_REPEAT there
too.

But we probably need a way to discriminate callers that can gracefully
fallback. I'd be extremely surprised if the cost of looping all over
the lru taking down all young bits could ever be offseted by the jumbo
frame. In fact the jumbo frame I'm afraid might be better off without
using compaction at all because it's probably very latency
sensitive. We need a 'lowlatency' version of compaction for these
users where the improvement of having a compound page instead of a
regular page isn't very significant.

On a separated topic, I'm currently trying to use the new async
compaction code upstream with jumbo frames. I'm also wondering if I'll
have to set sync=0 by default unless __GFP_REPEAT is set. It seems
adding compaction to jumbo frames is increasing latency to certain
workloads in a measurable way. Things were fine when compaction was
only used by THP and not for all order allocations (but I didn't test
the async mode yet plus the other optimizations for compaction you did
recently, I hope they're enough to jumbo frames).

> Don't get me wrong, I see your point but I'm wondering if there really are
> cases where we routinely scan an entire LRU list of unevictable pages that
> are somehow not being migrated properly to the unevictable lists. If
> this is happening, we are also in trouble for reclaiming for order-0
> pages, right?

Well unevictable pages are just an example and like you said they last
only one round of the loop at most. But other caching bits like the
referenced bits and all young bits will get all taken down during all
later loops too. We definitely don't want to swap just to allow
compaction to succeed! I think this argument explains it pretty well,
if you takedown all young bits in a constant loop, then system may end
up swapping. That's definitely something we don't want.

Things may be different if this is a stack allocation without
fallback, or if it's hugetlbfs again without kernel fallback (only
userland fallback).

> It uses GFP_REPEAT. That is why I specifically mentioned it in the "NOTE"
> as an alternative to how we could break early while still being agressive
> when required. The only reason it's not that way now is because a) I didn't
> consider an LRU mostly full of unevictable pages to be the normal case and b)
> for allocations such as order-3 that are preferable not to fail.

Ok.

> Where should be draw the line? We could come up with ratio of the lists
> depending on priority but it'd be hard to measure the gain or loss
> without having a profile of a problem case to look at.

I would just stick to !nr_reclaimed to break the loop, and ignore the
nr_scanned unless __GFP_REPEAT is set, in which case you're welcome to
scan everything.

Then we've to decide if to add __GFP_REPEAT to the stack allocation...

> I don't think you have misread anything but if we're going to weaken
> this logic, I'd at least like to see the GFP_REPEAT option tried - i.e.

I see the point of __GFP_REPEAT, that sounds the best, I should have
just followed your comment but I felt scanning everything was too
heavyweight regardless, but ok I see you want as much accuracy as
possible in that case, even if that end up in a swap storm.

> preserve being aggressive if set. I'm also not convinced we routinely get
> into a situation where the LRU consists of almost all unevictable pages
> and if we are in this situation, that is a serious problem on its own. It
> would also be preferable if we could get latency figures on alloc_pages for
> hugepage-sized allocations and a count of how many are succeeding or failing
> to measure the impact (if any).

I think I made not the best example talking about unevictable pages. I
said that because the code is like below, but to me all the goto
something following the !page_evictable check were also relevant for
this shrink_zone loop. The real life issue is avoid swap storm (or
expensive loop flooding the whole system of ipis to takedown all young
bits in all ptes), to allocate an hugepage or jumboframe that has a
graceful fallback that performs not hugely slower than the
hugepage/jumboframe.

     	  sc->nr_scanned++;

          if (unlikely(!page_evictable(page, NULL)))
	      goto cull_mlocked;

I think making the !nr_scanned check conditional to __GFP_REPEAT as
the comment suggested, for now is the best way to go.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] vmscan: fix zone shrinking exit when scan work is done
  2011-02-10 14:14             ` Andrea Arcangeli
@ 2011-02-10 14:58               ` Mel Gorman
  -1 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2011-02-10 14:58 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Johannes Weiner, Andrew Morton, Rik van Riel, Michal Hocko,
	Kent Overstreet, linux-mm, linux-kernel

On Thu, Feb 10, 2011 at 03:14:47PM +0100, Andrea Arcangeli wrote:
> On Thu, Feb 10, 2011 at 01:33:24PM +0000, Mel Gorman wrote:
> > Also true, I commented on this in the "Note" your patch deletes and a
> > suggestion on how an alternative would be to break early unless GFP_REPEAT.
> 
> Yep noticed that ;), doing that with __GFP_REPEAT sounds just fine to me.
> 

Great.

> > Sortof. Lumpy reclaim would have scanned more than SWAP_CLUSTER_MAX so
> > scanning was still pretty high. The other costs of lumpy reclaim would hide
> > it of course.
> 
> Ok but we know lumpy reclaim was not ok to start with.
> 

Sure.

> > What about other cases such as order-1 allocations for stack or order-3
> > allocations for those network cards using jumbo frames without
> > scatter/gather?
> 
> stack order 1 is one of the few cases that come to mind where failing
> an allocation becomes fatal. Maybe we should use __GFP_REPEAT there
> too.
> 

Actually, there shouldn't be need. Small allocations such as order-1
effectively loop indefinitely due to the check in should_alloc_retry().
This means that even if reclaim/compaction breaks earlier than it
should, it'll get tried again.

> But we probably need a way to discriminate callers that can gracefully
> fallback. I'd be extremely surprised if the cost of looping all over
> the lru taking down all young bits could ever be offseted by the jumbo
> frame. In fact the jumbo frame I'm afraid might be better off without
> using compaction at all because it's probably very latency
> sensitive.

It depends entirely on whether the jumbo frame can be received with
order-0 pages. If not, it means it's dropping packets which as worse
latency.

> We need a 'lowlatency' version of compaction for these
> users where the improvement of having a compound page instead of a
> regular page isn't very significant.
> 

It's not impossible to pass this information in once the cases where it
is required are identified.

> On a separated topic, I'm currently trying to use the new async
> compaction code upstream with jumbo frames. I'm also wondering if I'll
> have to set sync=0 by default unless __GFP_REPEAT is set. It seems
> adding compaction to jumbo frames is increasing latency to certain
> workloads in a measurable way.

This is interesting. Any profiles showing where the time is being spent?
In the event an order-3 allocation fails with the particular network
card, is it able to fallback to order-0 pages?

> Things were fine when compaction was
> only used by THP and not for all order allocations (but I didn't test
> the async mode yet plus the other optimizations for compaction you did
> recently, I hope they're enough to jumbo frames).
> 

Wish I had your test rig :/

> > Don't get me wrong, I see your point but I'm wondering if there really are
> > cases where we routinely scan an entire LRU list of unevictable pages that
> > are somehow not being migrated properly to the unevictable lists. If
> > this is happening, we are also in trouble for reclaiming for order-0
> > pages, right?
> 
> Well unevictable pages are just an example and like you said they last
> only one round of the loop at most. But other caching bits like the
> referenced bits and all young bits will get all taken down during all
> later loops too. We definitely don't want to swap just to allow
> compaction to succeed! I think this argument explains it pretty well,
> if you takedown all young bits in a constant loop, then system may end
> up swapping. That's definitely something we don't want.
> 

Avoiding the clearing of young bits is a much stronger arguement.

> Things may be different if this is a stack allocation without
> fallback, or if it's hugetlbfs again without kernel fallback (only
> userland fallback).
> 
> > It uses GFP_REPEAT. That is why I specifically mentioned it in the "NOTE"
> > as an alternative to how we could break early while still being agressive
> > when required. The only reason it's not that way now is because a) I didn't
> > consider an LRU mostly full of unevictable pages to be the normal case and b)
> > for allocations such as order-3 that are preferable not to fail.
> 
> Ok.
> 
> > Where should be draw the line? We could come up with ratio of the lists
> > depending on priority but it'd be hard to measure the gain or loss
> > without having a profile of a problem case to look at.
> 
> I would just stick to !nr_reclaimed to break the loop, and ignore the
> nr_scanned unless __GFP_REPEAT is set, in which case you're welcome to
> scan everything.
> 

Patch that should do this is below.

> Then we've to decide if to add __GFP_REPEAT to the stack allocation...
> 

It shouldn't be necessary as the allocator will continue looping.

> > I don't think you have misread anything but if we're going to weaken
> > this logic, I'd at least like to see the GFP_REPEAT option tried - i.e.
> 
> I see the point of __GFP_REPEAT, that sounds the best, I should have
> just followed your comment but I felt scanning everything was too
> heavyweight regardless, but ok I see you want as much accuracy as
> possible in that case, even if that end up in a swap storm.
> 
> > preserve being aggressive if set. I'm also not convinced we routinely get
> > into a situation where the LRU consists of almost all unevictable pages
> > and if we are in this situation, that is a serious problem on its own. It
> > would also be preferable if we could get latency figures on alloc_pages for
> > hugepage-sized allocations and a count of how many are succeeding or failing
> > to measure the impact (if any).
> 
> I think I made not the best example talking about unevictable pages. I
> said that because the code is like below, but to me all the goto
> something following the !page_evictable check were also relevant for
> this shrink_zone loop. The real life issue is avoid swap storm (or
> expensive loop flooding the whole system of ipis to takedown all young
> bits in all ptes), to allocate an hugepage or jumboframe that has a
> graceful fallback that performs not hugely slower than the
> hugepage/jumboframe.
> 
>      	  sc->nr_scanned++;
> 
>           if (unlikely(!page_evictable(page, NULL)))
> 	      goto cull_mlocked;
> 
> I think making the !nr_scanned check conditional to __GFP_REPEAT as
> the comment suggested, for now is the best way to go.
> 

Ok, here is a patch that should do that. This does *not* replace Johannes'
patch which I think should still be merged. However, I am unable to test this
at the moment. My laptop and test machines are 200KM away and inaccessible
until next Tuesday at the earliest. The machine I'm typing this mail from
is unsuitable for testing with. Are you in the position to test THP with it
applied for me please?

==== CUT HERE ====
mm: vmscan: Stop reclaim/compaction when reclaim is failing for !__GFP_REPEAT allocations

should_continue_reclaim() for reclaim/compaction allows scanning to continue
even if pages are not being reclaimed until the full list is scanned. In
terms of allocation success, this makes sense but potentially it introduces
unwanted latency for transparent hugepages and network jumbo frames that
would prefer to fail the allocation attempt and fallback to order-0 pages.
Worse, there is a potential that the full LRU scan will clear all the young
bits, distort page aging information and potentially push pages into swap
that would have otherwise remained resident.

This patch will stop reclaim/compaction if no pages were reclaimed in the
last SWAP_CLUSTER_MAX pages that were considered. For allocations such as
hugetlbfs that use GFP_REPEAT and have fewer fallback options, the full LRU
list may still be scanned.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/vmscan.c |   32 ++++++++++++++++++++++----------
 1 files changed, 22 insertions(+), 10 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 148c6e6..591b907 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1841,16 +1841,28 @@ static inline bool should_continue_reclaim(struct zone *zone,
 	if (!(sc->reclaim_mode & RECLAIM_MODE_COMPACTION))
 		return false;
 
-	/*
-	 * If we failed to reclaim and have scanned the full list, stop.
-	 * NOTE: Checking just nr_reclaimed would exit reclaim/compaction far
-	 *       faster but obviously would be less likely to succeed
-	 *       allocation. If this is desirable, use GFP_REPEAT to decide
-	 *       if both reclaimed and scanned should be checked or just
-	 *       reclaimed
-	 */
-	if (!nr_reclaimed && !nr_scanned)
-		return false;
+	/* Consider stopping depending on scan and reclaim activity */
+	if (sc->gfp_mask & __GFP_REPEAT) {
+		/*
+		 * For GFP_REPEAT allocations, stop reclaiming if the
+		 * full LRU list has been scanned and we are still failing
+		 * to reclaim pages. This full LRU scan is potentially
+		 * expensive but a GFP_REPEAT caller really wants to succeed
+		 */
+		if (!nr_reclaimed && !nr_scanned)
+			return false;
+	} else {
+		/*
+		 * For non-GFP_REPEAT allocations which can presumably
+		 * fail without consequence, stop if we failed to reclaim
+		 * any pages from the last SWAP_CLUSTER_MAX number of
+		 * pages that were scanned. This will return to the
+		 * caller faster at the risk reclaim/compaction and
+		 * the resulting allocation attempt will fail
+		 */
+		if (!nr_reclaimed)
+			return false;
+	}
 
 	/*
 	 * If we have not reclaimed enough pages for compaction and the

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [patch] vmscan: fix zone shrinking exit when scan work is done
@ 2011-02-10 14:58               ` Mel Gorman
  0 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2011-02-10 14:58 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Johannes Weiner, Andrew Morton, Rik van Riel, Michal Hocko,
	Kent Overstreet, linux-mm, linux-kernel

On Thu, Feb 10, 2011 at 03:14:47PM +0100, Andrea Arcangeli wrote:
> On Thu, Feb 10, 2011 at 01:33:24PM +0000, Mel Gorman wrote:
> > Also true, I commented on this in the "Note" your patch deletes and a
> > suggestion on how an alternative would be to break early unless GFP_REPEAT.
> 
> Yep noticed that ;), doing that with __GFP_REPEAT sounds just fine to me.
> 

Great.

> > Sortof. Lumpy reclaim would have scanned more than SWAP_CLUSTER_MAX so
> > scanning was still pretty high. The other costs of lumpy reclaim would hide
> > it of course.
> 
> Ok but we know lumpy reclaim was not ok to start with.
> 

Sure.

> > What about other cases such as order-1 allocations for stack or order-3
> > allocations for those network cards using jumbo frames without
> > scatter/gather?
> 
> stack order 1 is one of the few cases that come to mind where failing
> an allocation becomes fatal. Maybe we should use __GFP_REPEAT there
> too.
> 

Actually, there shouldn't be need. Small allocations such as order-1
effectively loop indefinitely due to the check in should_alloc_retry().
This means that even if reclaim/compaction breaks earlier than it
should, it'll get tried again.

> But we probably need a way to discriminate callers that can gracefully
> fallback. I'd be extremely surprised if the cost of looping all over
> the lru taking down all young bits could ever be offseted by the jumbo
> frame. In fact the jumbo frame I'm afraid might be better off without
> using compaction at all because it's probably very latency
> sensitive.

It depends entirely on whether the jumbo frame can be received with
order-0 pages. If not, it means it's dropping packets which as worse
latency.

> We need a 'lowlatency' version of compaction for these
> users where the improvement of having a compound page instead of a
> regular page isn't very significant.
> 

It's not impossible to pass this information in once the cases where it
is required are identified.

> On a separated topic, I'm currently trying to use the new async
> compaction code upstream with jumbo frames. I'm also wondering if I'll
> have to set sync=0 by default unless __GFP_REPEAT is set. It seems
> adding compaction to jumbo frames is increasing latency to certain
> workloads in a measurable way.

This is interesting. Any profiles showing where the time is being spent?
In the event an order-3 allocation fails with the particular network
card, is it able to fallback to order-0 pages?

> Things were fine when compaction was
> only used by THP and not for all order allocations (but I didn't test
> the async mode yet plus the other optimizations for compaction you did
> recently, I hope they're enough to jumbo frames).
> 

Wish I had your test rig :/

> > Don't get me wrong, I see your point but I'm wondering if there really are
> > cases where we routinely scan an entire LRU list of unevictable pages that
> > are somehow not being migrated properly to the unevictable lists. If
> > this is happening, we are also in trouble for reclaiming for order-0
> > pages, right?
> 
> Well unevictable pages are just an example and like you said they last
> only one round of the loop at most. But other caching bits like the
> referenced bits and all young bits will get all taken down during all
> later loops too. We definitely don't want to swap just to allow
> compaction to succeed! I think this argument explains it pretty well,
> if you takedown all young bits in a constant loop, then system may end
> up swapping. That's definitely something we don't want.
> 

Avoiding the clearing of young bits is a much stronger arguement.

> Things may be different if this is a stack allocation without
> fallback, or if it's hugetlbfs again without kernel fallback (only
> userland fallback).
> 
> > It uses GFP_REPEAT. That is why I specifically mentioned it in the "NOTE"
> > as an alternative to how we could break early while still being agressive
> > when required. The only reason it's not that way now is because a) I didn't
> > consider an LRU mostly full of unevictable pages to be the normal case and b)
> > for allocations such as order-3 that are preferable not to fail.
> 
> Ok.
> 
> > Where should be draw the line? We could come up with ratio of the lists
> > depending on priority but it'd be hard to measure the gain or loss
> > without having a profile of a problem case to look at.
> 
> I would just stick to !nr_reclaimed to break the loop, and ignore the
> nr_scanned unless __GFP_REPEAT is set, in which case you're welcome to
> scan everything.
> 

Patch that should do this is below.

> Then we've to decide if to add __GFP_REPEAT to the stack allocation...
> 

It shouldn't be necessary as the allocator will continue looping.

> > I don't think you have misread anything but if we're going to weaken
> > this logic, I'd at least like to see the GFP_REPEAT option tried - i.e.
> 
> I see the point of __GFP_REPEAT, that sounds the best, I should have
> just followed your comment but I felt scanning everything was too
> heavyweight regardless, but ok I see you want as much accuracy as
> possible in that case, even if that end up in a swap storm.
> 
> > preserve being aggressive if set. I'm also not convinced we routinely get
> > into a situation where the LRU consists of almost all unevictable pages
> > and if we are in this situation, that is a serious problem on its own. It
> > would also be preferable if we could get latency figures on alloc_pages for
> > hugepage-sized allocations and a count of how many are succeeding or failing
> > to measure the impact (if any).
> 
> I think I made not the best example talking about unevictable pages. I
> said that because the code is like below, but to me all the goto
> something following the !page_evictable check were also relevant for
> this shrink_zone loop. The real life issue is avoid swap storm (or
> expensive loop flooding the whole system of ipis to takedown all young
> bits in all ptes), to allocate an hugepage or jumboframe that has a
> graceful fallback that performs not hugely slower than the
> hugepage/jumboframe.
> 
>      	  sc->nr_scanned++;
> 
>           if (unlikely(!page_evictable(page, NULL)))
> 	      goto cull_mlocked;
> 
> I think making the !nr_scanned check conditional to __GFP_REPEAT as
> the comment suggested, for now is the best way to go.
> 

Ok, here is a patch that should do that. This does *not* replace Johannes'
patch which I think should still be merged. However, I am unable to test this
at the moment. My laptop and test machines are 200KM away and inaccessible
until next Tuesday at the earliest. The machine I'm typing this mail from
is unsuitable for testing with. Are you in the position to test THP with it
applied for me please?

==== CUT HERE ====
mm: vmscan: Stop reclaim/compaction when reclaim is failing for !__GFP_REPEAT allocations

should_continue_reclaim() for reclaim/compaction allows scanning to continue
even if pages are not being reclaimed until the full list is scanned. In
terms of allocation success, this makes sense but potentially it introduces
unwanted latency for transparent hugepages and network jumbo frames that
would prefer to fail the allocation attempt and fallback to order-0 pages.
Worse, there is a potential that the full LRU scan will clear all the young
bits, distort page aging information and potentially push pages into swap
that would have otherwise remained resident.

This patch will stop reclaim/compaction if no pages were reclaimed in the
last SWAP_CLUSTER_MAX pages that were considered. For allocations such as
hugetlbfs that use GFP_REPEAT and have fewer fallback options, the full LRU
list may still be scanned.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/vmscan.c |   32 ++++++++++++++++++++++----------
 1 files changed, 22 insertions(+), 10 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 148c6e6..591b907 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1841,16 +1841,28 @@ static inline bool should_continue_reclaim(struct zone *zone,
 	if (!(sc->reclaim_mode & RECLAIM_MODE_COMPACTION))
 		return false;
 
-	/*
-	 * If we failed to reclaim and have scanned the full list, stop.
-	 * NOTE: Checking just nr_reclaimed would exit reclaim/compaction far
-	 *       faster but obviously would be less likely to succeed
-	 *       allocation. If this is desirable, use GFP_REPEAT to decide
-	 *       if both reclaimed and scanned should be checked or just
-	 *       reclaimed
-	 */
-	if (!nr_reclaimed && !nr_scanned)
-		return false;
+	/* Consider stopping depending on scan and reclaim activity */
+	if (sc->gfp_mask & __GFP_REPEAT) {
+		/*
+		 * For GFP_REPEAT allocations, stop reclaiming if the
+		 * full LRU list has been scanned and we are still failing
+		 * to reclaim pages. This full LRU scan is potentially
+		 * expensive but a GFP_REPEAT caller really wants to succeed
+		 */
+		if (!nr_reclaimed && !nr_scanned)
+			return false;
+	} else {
+		/*
+		 * For non-GFP_REPEAT allocations which can presumably
+		 * fail without consequence, stop if we failed to reclaim
+		 * any pages from the last SWAP_CLUSTER_MAX number of
+		 * pages that were scanned. This will return to the
+		 * caller faster at the risk reclaim/compaction and
+		 * the resulting allocation attempt will fail
+		 */
+		if (!nr_reclaimed)
+			return false;
+	}
 
 	/*
 	 * If we have not reclaimed enough pages for compaction and the

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH] mm: vmscan: Stop reclaim/compaction earlier due to insufficient progress if !__GFP_REPEAT
  2011-02-10 14:58               ` Mel Gorman
@ 2011-02-16  9:50                 ` Mel Gorman
  -1 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2011-02-16  9:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Andrea Arcangeli, Rik van Riel, Michal Hocko,
	Kent Overstreet, linux-mm, linux-kernel

should_continue_reclaim() for reclaim/compaction allows scanning to continue
even if pages are not being reclaimed until the full list is scanned. In
terms of allocation success, this makes sense but potentially it introduces
unwanted latency for high-order allocations such as transparent hugepages
and network jumbo frames that would prefer to fail the allocation attempt
and fallback to order-0 pages.  Worse, there is a potential that the full
LRU scan will clear all the young bits, distort page aging information and
potentially push pages into swap that would have otherwise remained resident.

This patch will stop reclaim/compaction if no pages were reclaimed in the
last SWAP_CLUSTER_MAX pages that were considered. For allocations such as
hugetlbfs that use GFP_REPEAT and have fewer fallback options, the full LRU
list may still be scanned.

To test this, a tool was developed based on ftrace that tracked the latency of
high-order allocations while transparent hugepage support was enabled and three
benchmarks were run. The "fix-infinite" figures are 2.6.38-rc4 with Johannes's
patch "vmscan: fix zone shrinking exit when scan work is done" applied.

STREAM Highorder Allocation Latency Statistics
	       fix-infinite	break-early
1 :: Count            10298           10229
1 :: Min             0.4560          0.4640
1 :: Mean            1.0589          1.0183
1 :: Max            14.5990         11.7510
1 :: Stddev          0.5208          0.4719
2 :: Count                2               1
2 :: Min             1.8610          3.7240
2 :: Mean            3.4325          3.7240
2 :: Max             5.0040          3.7240
2 :: Stddev          1.5715          0.0000
9 :: Count           111696          111694
9 :: Min             0.5230          0.4110
9 :: Mean           10.5831         10.5718
9 :: Max            38.4480         43.2900
9 :: Stddev          1.1147          1.1325

Mean time for order-1 allocations is reduced. order-2 looks increased
but with so few allocations, it's not particularly significant. THP mean
allocation latency is also reduced. That said, allocation time varies so
significantly that the reductions are within noise.

Max allocation time is reduced by a significant amount for low-order
allocations but reduced for THP allocations which presumably are now
breaking before reclaim has done enough work.

SysBench Highorder Allocation Latency Statistics
	       fix-infinite	break-early
1 :: Count            15745           15677
1 :: Min             0.4250          0.4550
1 :: Mean            1.1023          1.0810
1 :: Max            14.4590         10.8220
1 :: Stddev          0.5117          0.5100
2 :: Count                1               1
2 :: Min             3.0040          2.1530
2 :: Mean            3.0040          2.1530
2 :: Max             3.0040          2.1530
2 :: Stddev          0.0000          0.0000
9 :: Count             2017            1931
9 :: Min             0.4980          0.7480
9 :: Mean           10.4717         10.3840
9 :: Max            24.9460         26.2500
9 :: Stddev          1.1726          1.1966

Again, mean time for order-1 allocations is reduced while order-2 allocations
are too few to draw conclusions from. The mean time for THP allocations is
also slightly reduced albeit the reductions are within varianes.

Once again, our maximum allocation time is significantly reduced for
low-order allocations and slightly increased for THP allocations.

Anon stream mmap reference Highorder Allocation Latency Statistics
1 :: Count             1376            1790
1 :: Min             0.4940          0.5010
1 :: Mean            1.0289          0.9732
1 :: Max             6.2670          4.2540
1 :: Stddev          0.4142          0.2785
2 :: Count                1               -
2 :: Min             1.9060               -
2 :: Mean            1.9060               -
2 :: Max             1.9060               -
2 :: Stddev          0.0000               -
9 :: Count            11266           11257
9 :: Min             0.4990          0.4940
9 :: Mean        27250.4669      24256.1919
9 :: Max      11439211.0000    6008885.0000
9 :: Stddev     226427.4624     186298.1430

This benchmark creates one thread per CPU which references an amount of
anonymous memory 1.5 times the size of physical RAM. This pounds swap quite
heavily and is intended to exercise THP a bit.

Mean allocation time for order-1 is reduced as before. It's also reduced
for THP allocations but the variations here are pretty massive due to swap.
As before, maximum allocation times are significantly reduced.

Overall, the patch reduces the mean and maximum allocation latencies for
the smaller high-order allocations. This was with Slab configured so it
would be expected to be more significant with Slub which uses these size
allocations more aggressively.

The mean allocation times for THP allocations are also slightly reduced.
The maximum latency was slightly increased as predicted by the comments due
to reclaim/compaction breaking early. However, workloads care more about the
latency of lower-order allocations than THP so it's an acceptable trade-off.
Please consider merging for 2.6.38.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/vmscan.c |   32 ++++++++++++++++++++++----------
 1 files changed, 22 insertions(+), 10 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 148c6e6..591b907 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1841,16 +1841,28 @@ static inline bool should_continue_reclaim(struct zone *zone,
 	if (!(sc->reclaim_mode & RECLAIM_MODE_COMPACTION))
 		return false;

-	/*
-	 * If we failed to reclaim and have scanned the full list, stop.
-	 * NOTE: Checking just nr_reclaimed would exit reclaim/compaction far
-	 *       faster but obviously would be less likely to succeed
-	 *       allocation. If this is desirable, use GFP_REPEAT to decide
-	 *       if both reclaimed and scanned should be checked or just
-	 *       reclaimed
-	 */
-	if (!nr_reclaimed && !nr_scanned)
-		return false;
+	/* Consider stopping depending on scan and reclaim activity */
+	if (sc->gfp_mask & __GFP_REPEAT) {
+		/*
+		 * For GFP_REPEAT allocations, stop reclaiming if the
+		 * full LRU list has been scanned and we are still failing
+		 * to reclaim pages. This full LRU scan is potentially
+		 * expensive but a GFP_REPEAT caller really wants to succeed
+		 */
+		if (!nr_reclaimed && !nr_scanned)
+			return false;
+	} else {
+		/*
+		 * For non-GFP_REPEAT allocations which can presumably
+		 * fail without consequence, stop if we failed to reclaim
+		 * any pages from the last SWAP_CLUSTER_MAX number of
+		 * pages that were scanned. This will return to the
+		 * caller faster at the risk reclaim/compaction and
+		 * the resulting allocation attempt fails
+		 */
+		if (!nr_reclaimed)
+			return false;
+	}

 	/*
 	 * If we have not reclaimed enough pages for compaction and the

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH] mm: vmscan: Stop reclaim/compaction earlier due to insufficient progress if !__GFP_REPEAT
@ 2011-02-16  9:50                 ` Mel Gorman
  0 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2011-02-16  9:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Andrea Arcangeli, Rik van Riel, Michal Hocko,
	Kent Overstreet, linux-mm, linux-kernel

should_continue_reclaim() for reclaim/compaction allows scanning to continue
even if pages are not being reclaimed until the full list is scanned. In
terms of allocation success, this makes sense but potentially it introduces
unwanted latency for high-order allocations such as transparent hugepages
and network jumbo frames that would prefer to fail the allocation attempt
and fallback to order-0 pages.  Worse, there is a potential that the full
LRU scan will clear all the young bits, distort page aging information and
potentially push pages into swap that would have otherwise remained resident.

This patch will stop reclaim/compaction if no pages were reclaimed in the
last SWAP_CLUSTER_MAX pages that were considered. For allocations such as
hugetlbfs that use GFP_REPEAT and have fewer fallback options, the full LRU
list may still be scanned.

To test this, a tool was developed based on ftrace that tracked the latency of
high-order allocations while transparent hugepage support was enabled and three
benchmarks were run. The "fix-infinite" figures are 2.6.38-rc4 with Johannes's
patch "vmscan: fix zone shrinking exit when scan work is done" applied.

STREAM Highorder Allocation Latency Statistics
	       fix-infinite	break-early
1 :: Count            10298           10229
1 :: Min             0.4560          0.4640
1 :: Mean            1.0589          1.0183
1 :: Max            14.5990         11.7510
1 :: Stddev          0.5208          0.4719
2 :: Count                2               1
2 :: Min             1.8610          3.7240
2 :: Mean            3.4325          3.7240
2 :: Max             5.0040          3.7240
2 :: Stddev          1.5715          0.0000
9 :: Count           111696          111694
9 :: Min             0.5230          0.4110
9 :: Mean           10.5831         10.5718
9 :: Max            38.4480         43.2900
9 :: Stddev          1.1147          1.1325

Mean time for order-1 allocations is reduced. order-2 looks increased
but with so few allocations, it's not particularly significant. THP mean
allocation latency is also reduced. That said, allocation time varies so
significantly that the reductions are within noise.

Max allocation time is reduced by a significant amount for low-order
allocations but reduced for THP allocations which presumably are now
breaking before reclaim has done enough work.

SysBench Highorder Allocation Latency Statistics
	       fix-infinite	break-early
1 :: Count            15745           15677
1 :: Min             0.4250          0.4550
1 :: Mean            1.1023          1.0810
1 :: Max            14.4590         10.8220
1 :: Stddev          0.5117          0.5100
2 :: Count                1               1
2 :: Min             3.0040          2.1530
2 :: Mean            3.0040          2.1530
2 :: Max             3.0040          2.1530
2 :: Stddev          0.0000          0.0000
9 :: Count             2017            1931
9 :: Min             0.4980          0.7480
9 :: Mean           10.4717         10.3840
9 :: Max            24.9460         26.2500
9 :: Stddev          1.1726          1.1966

Again, mean time for order-1 allocations is reduced while order-2 allocations
are too few to draw conclusions from. The mean time for THP allocations is
also slightly reduced albeit the reductions are within varianes.

Once again, our maximum allocation time is significantly reduced for
low-order allocations and slightly increased for THP allocations.

Anon stream mmap reference Highorder Allocation Latency Statistics
1 :: Count             1376            1790
1 :: Min             0.4940          0.5010
1 :: Mean            1.0289          0.9732
1 :: Max             6.2670          4.2540
1 :: Stddev          0.4142          0.2785
2 :: Count                1               -
2 :: Min             1.9060               -
2 :: Mean            1.9060               -
2 :: Max             1.9060               -
2 :: Stddev          0.0000               -
9 :: Count            11266           11257
9 :: Min             0.4990          0.4940
9 :: Mean        27250.4669      24256.1919
9 :: Max      11439211.0000    6008885.0000
9 :: Stddev     226427.4624     186298.1430

This benchmark creates one thread per CPU which references an amount of
anonymous memory 1.5 times the size of physical RAM. This pounds swap quite
heavily and is intended to exercise THP a bit.

Mean allocation time for order-1 is reduced as before. It's also reduced
for THP allocations but the variations here are pretty massive due to swap.
As before, maximum allocation times are significantly reduced.

Overall, the patch reduces the mean and maximum allocation latencies for
the smaller high-order allocations. This was with Slab configured so it
would be expected to be more significant with Slub which uses these size
allocations more aggressively.

The mean allocation times for THP allocations are also slightly reduced.
The maximum latency was slightly increased as predicted by the comments due
to reclaim/compaction breaking early. However, workloads care more about the
latency of lower-order allocations than THP so it's an acceptable trade-off.
Please consider merging for 2.6.38.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/vmscan.c |   32 ++++++++++++++++++++++----------
 1 files changed, 22 insertions(+), 10 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 148c6e6..591b907 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1841,16 +1841,28 @@ static inline bool should_continue_reclaim(struct zone *zone,
 	if (!(sc->reclaim_mode & RECLAIM_MODE_COMPACTION))
 		return false;

-	/*
-	 * If we failed to reclaim and have scanned the full list, stop.
-	 * NOTE: Checking just nr_reclaimed would exit reclaim/compaction far
-	 *       faster but obviously would be less likely to succeed
-	 *       allocation. If this is desirable, use GFP_REPEAT to decide
-	 *       if both reclaimed and scanned should be checked or just
-	 *       reclaimed
-	 */
-	if (!nr_reclaimed && !nr_scanned)
-		return false;
+	/* Consider stopping depending on scan and reclaim activity */
+	if (sc->gfp_mask & __GFP_REPEAT) {
+		/*
+		 * For GFP_REPEAT allocations, stop reclaiming if the
+		 * full LRU list has been scanned and we are still failing
+		 * to reclaim pages. This full LRU scan is potentially
+		 * expensive but a GFP_REPEAT caller really wants to succeed
+		 */
+		if (!nr_reclaimed && !nr_scanned)
+			return false;
+	} else {
+		/*
+		 * For non-GFP_REPEAT allocations which can presumably
+		 * fail without consequence, stop if we failed to reclaim
+		 * any pages from the last SWAP_CLUSTER_MAX number of
+		 * pages that were scanned. This will return to the
+		 * caller faster at the risk reclaim/compaction and
+		 * the resulting allocation attempt fails
+		 */
+		if (!nr_reclaimed)
+			return false;
+	}

 	/*
 	 * If we have not reclaimed enough pages for compaction and the

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH] mm: vmscan: Stop reclaim/compaction earlier due to insufficient progress if !__GFP_REPEAT
  2011-02-16  9:50                 ` Mel Gorman
@ 2011-02-16 10:13                   ` Andrea Arcangeli
  -1 siblings, 0 replies; 44+ messages in thread
From: Andrea Arcangeli @ 2011-02-16 10:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Rik van Riel, Michal Hocko,
	Kent Overstreet, linux-mm, linux-kernel

On Wed, Feb 16, 2011 at 09:50:49AM +0000, Mel Gorman wrote:
> The mean allocation times for THP allocations are also slightly reduced.
> The maximum latency was slightly increased as predicted by the comments due
> to reclaim/compaction breaking early. However, workloads care more about the
> latency of lower-order allocations than THP so it's an acceptable trade-off.
> Please consider merging for 2.6.38.

Full agreement. I'm currently dealing with latency issues (nothing
major! but still not nice to see a reproducible regression, even if a
small one only visible in the benchmark) with compaction and jumbo
frames. This won't be enough to close them completely though because I
didn't backport the change to vmscan.c and should_continue_reclaim (I
backported all the other compaction improvements though, so this
practically is the only missing bit). I also suspected the e1000
driver, that sets the NAPI latency to bulk_latency when it uses jumbo
frames, so I wonder if it could be that with compaction we get more
jumbo frames and the latency then gets reduced by the driver as side
effect. Not sure yet.

I like the above because it's less likely to give us compaction issues
with jumbo frames when I add should_continue_reclaim on top. It seems
anonymous memory allocation are orders of magnitude more long lived
than jumbo frames could ever be, so at this point I'm not even
entirely certain it's ok to enable compaction at all for jumbo
frames. But I still like the above regardless of my current issue
(just because of the young bits going nuked in one go the lumpy hammer
way, even if it actually increases latency a bit for THP allocations).

One issue with compaction for jumbo frames, is the potentially very
long loop, for the scan in isolated_migratepages. I added a counter to
break the loop after 1024 pages scanned. This is extreme but this is a
debug patch for now, I also did if (retval == bulk_latency) reval =
low_latency in the e1000* drivers to see if it makes a difference. If
any of the two will help I will track down how much each change
contributes to lowering the network latency to pre-compaction
levels. It may very well be only a compaction issue, or only a driver
issue, I don't know yet (the latter less likely because this very
compaction loop spikes at the top of oprofile output, but maybe that
only affects throughput and the driver is to blame for the latency
reduction... this is what I'm going to find pretty soon). Also this
isolate_migratepages loop I think needs a cond_resched() (I didn't add
that yet ;). 1024 pages scanned is too few, I just want to see how it
behaves with an extremely permissive setting. I'll let you know when I
come to some more reliable conclusion.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] mm: vmscan: Stop reclaim/compaction earlier due to insufficient progress if !__GFP_REPEAT
@ 2011-02-16 10:13                   ` Andrea Arcangeli
  0 siblings, 0 replies; 44+ messages in thread
From: Andrea Arcangeli @ 2011-02-16 10:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Rik van Riel, Michal Hocko,
	Kent Overstreet, linux-mm, linux-kernel

On Wed, Feb 16, 2011 at 09:50:49AM +0000, Mel Gorman wrote:
> The mean allocation times for THP allocations are also slightly reduced.
> The maximum latency was slightly increased as predicted by the comments due
> to reclaim/compaction breaking early. However, workloads care more about the
> latency of lower-order allocations than THP so it's an acceptable trade-off.
> Please consider merging for 2.6.38.

Full agreement. I'm currently dealing with latency issues (nothing
major! but still not nice to see a reproducible regression, even if a
small one only visible in the benchmark) with compaction and jumbo
frames. This won't be enough to close them completely though because I
didn't backport the change to vmscan.c and should_continue_reclaim (I
backported all the other compaction improvements though, so this
practically is the only missing bit). I also suspected the e1000
driver, that sets the NAPI latency to bulk_latency when it uses jumbo
frames, so I wonder if it could be that with compaction we get more
jumbo frames and the latency then gets reduced by the driver as side
effect. Not sure yet.

I like the above because it's less likely to give us compaction issues
with jumbo frames when I add should_continue_reclaim on top. It seems
anonymous memory allocation are orders of magnitude more long lived
than jumbo frames could ever be, so at this point I'm not even
entirely certain it's ok to enable compaction at all for jumbo
frames. But I still like the above regardless of my current issue
(just because of the young bits going nuked in one go the lumpy hammer
way, even if it actually increases latency a bit for THP allocations).

One issue with compaction for jumbo frames, is the potentially very
long loop, for the scan in isolated_migratepages. I added a counter to
break the loop after 1024 pages scanned. This is extreme but this is a
debug patch for now, I also did if (retval == bulk_latency) reval =
low_latency in the e1000* drivers to see if it makes a difference. If
any of the two will help I will track down how much each change
contributes to lowering the network latency to pre-compaction
levels. It may very well be only a compaction issue, or only a driver
issue, I don't know yet (the latter less likely because this very
compaction loop spikes at the top of oprofile output, but maybe that
only affects throughput and the driver is to blame for the latency
reduction... this is what I'm going to find pretty soon). Also this
isolate_migratepages loop I think needs a cond_resched() (I didn't add
that yet ;). 1024 pages scanned is too few, I just want to see how it
behaves with an extremely permissive setting. I'll let you know when I
come to some more reliable conclusion.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] mm: vmscan: Stop reclaim/compaction earlier due to insufficient progress if !__GFP_REPEAT
  2011-02-16 10:13                   ` Andrea Arcangeli
@ 2011-02-16 11:22                     ` Mel Gorman
  -1 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2011-02-16 11:22 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Johannes Weiner, Rik van Riel, Michal Hocko,
	Kent Overstreet, linux-mm, linux-kernel

On Wed, Feb 16, 2011 at 11:13:01AM +0100, Andrea Arcangeli wrote:
> On Wed, Feb 16, 2011 at 09:50:49AM +0000, Mel Gorman wrote:
> > The mean allocation times for THP allocations are also slightly reduced.
> > The maximum latency was slightly increased as predicted by the comments due
> > to reclaim/compaction breaking early. However, workloads care more about the
> > latency of lower-order allocations than THP so it's an acceptable trade-off.
> > Please consider merging for 2.6.38.
> 
> Full agreement. I'm currently dealing with latency issues (nothing
> major! but still not nice to see a reproducible regression, even if a
> small one only visible in the benchmark) with compaction and jumbo
> frames.

Out of curiousity, what are you measuring the latency of and how? I used
a combination of the function_graph ftrace analyser and the mm_page_alloc
tracepoint myself to avoid any additional patching and it was easier than
cobbling together something with kprobes. A perl script configures ftrace
and then parses the contents of trace_pipe - crude but does the job without
patching the kernel.

> This won't be enough to close them completely though because I
> didn't backport the change to vmscan.c and should_continue_reclaim (I
> backported all the other compaction improvements though, so this
> practically is the only missing bit).

How big are the discrepancies?

> I also suspected the e1000
> driver, that sets the NAPI latency to bulk_latency when it uses jumbo
> frames, so I wonder if it could be that with compaction we get more
> jumbo frames and the latency then gets reduced by the driver as side
> effect. Not sure yet.
> 

No idea.

> I like the above because it's less likely to give us compaction issues
> with jumbo frames when I add should_continue_reclaim on top. It seems
> anonymous memory allocation are orders of magnitude more long lived
> than jumbo frames could ever be, so at this point I'm not even
> entirely certain it's ok to enable compaction at all for jumbo
> frames. But I still like the above regardless of my current issue
> (just because of the young bits going nuked in one go the lumpy hammer
> way, even if it actually increases latency a bit for THP allocations).
> 

Can I have your ack on the patch then? Even if it doesn't resolve the
jumbo frame problems, it's in the right direction. Measuring how it
currently behaves and what direction should be taken may be something
still worth discussing at LSF/MM.

> One issue with compaction for jumbo frames, is the potentially very
> long loop, for the scan in isolated_migratepages.

Yes, the scanner is poor. The scanner for free pages is potentially just
as bad. I prototyped some designs that should have been faster but they
didn't make any significant difference so they got discarded.

> I added a counter to
> break the loop after 1024 pages scanned. This is extreme but this is a
> debug patch for now, I also did if (retval == bulk_latency) reval =
> low_latency in the e1000* drivers to see if it makes a difference. If
> any of the two will help I will track down how much each change
> contributes to lowering the network latency to pre-compaction
> levels. It may very well be only a compaction issue, or only a driver
> issue, I don't know yet (the latter less likely because this very
> compaction loop spikes at the top of oprofile output, but maybe that
> only affects throughput and the driver is to blame for the latency
> reduction... this is what I'm going to find pretty soon). Also this
> isolate_migratepages loop I think needs a cond_resched()

This surprises me. In my own tests at least, the compaction stuff was
way down in the profile and I wouldn't have expected scanning to take so
long as to require a cond_resched. I was depending on the cond_resched()
in migrate_pages() to yield the processor as necessary.

> (I didn't add
> that yet ;). 1024 pages scanned is too few, I just want to see how it
> behaves with an extremely permissive setting. I'll let you know when I
> come to some more reliable conclusion.
> 

Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] mm: vmscan: Stop reclaim/compaction earlier due to insufficient progress if !__GFP_REPEAT
@ 2011-02-16 11:22                     ` Mel Gorman
  0 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2011-02-16 11:22 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Johannes Weiner, Rik van Riel, Michal Hocko,
	Kent Overstreet, linux-mm, linux-kernel

On Wed, Feb 16, 2011 at 11:13:01AM +0100, Andrea Arcangeli wrote:
> On Wed, Feb 16, 2011 at 09:50:49AM +0000, Mel Gorman wrote:
> > The mean allocation times for THP allocations are also slightly reduced.
> > The maximum latency was slightly increased as predicted by the comments due
> > to reclaim/compaction breaking early. However, workloads care more about the
> > latency of lower-order allocations than THP so it's an acceptable trade-off.
> > Please consider merging for 2.6.38.
> 
> Full agreement. I'm currently dealing with latency issues (nothing
> major! but still not nice to see a reproducible regression, even if a
> small one only visible in the benchmark) with compaction and jumbo
> frames.

Out of curiousity, what are you measuring the latency of and how? I used
a combination of the function_graph ftrace analyser and the mm_page_alloc
tracepoint myself to avoid any additional patching and it was easier than
cobbling together something with kprobes. A perl script configures ftrace
and then parses the contents of trace_pipe - crude but does the job without
patching the kernel.

> This won't be enough to close them completely though because I
> didn't backport the change to vmscan.c and should_continue_reclaim (I
> backported all the other compaction improvements though, so this
> practically is the only missing bit).

How big are the discrepancies?

> I also suspected the e1000
> driver, that sets the NAPI latency to bulk_latency when it uses jumbo
> frames, so I wonder if it could be that with compaction we get more
> jumbo frames and the latency then gets reduced by the driver as side
> effect. Not sure yet.
> 

No idea.

> I like the above because it's less likely to give us compaction issues
> with jumbo frames when I add should_continue_reclaim on top. It seems
> anonymous memory allocation are orders of magnitude more long lived
> than jumbo frames could ever be, so at this point I'm not even
> entirely certain it's ok to enable compaction at all for jumbo
> frames. But I still like the above regardless of my current issue
> (just because of the young bits going nuked in one go the lumpy hammer
> way, even if it actually increases latency a bit for THP allocations).
> 

Can I have your ack on the patch then? Even if it doesn't resolve the
jumbo frame problems, it's in the right direction. Measuring how it
currently behaves and what direction should be taken may be something
still worth discussing at LSF/MM.

> One issue with compaction for jumbo frames, is the potentially very
> long loop, for the scan in isolated_migratepages.

Yes, the scanner is poor. The scanner for free pages is potentially just
as bad. I prototyped some designs that should have been faster but they
didn't make any significant difference so they got discarded.

> I added a counter to
> break the loop after 1024 pages scanned. This is extreme but this is a
> debug patch for now, I also did if (retval == bulk_latency) reval =
> low_latency in the e1000* drivers to see if it makes a difference. If
> any of the two will help I will track down how much each change
> contributes to lowering the network latency to pre-compaction
> levels. It may very well be only a compaction issue, or only a driver
> issue, I don't know yet (the latter less likely because this very
> compaction loop spikes at the top of oprofile output, but maybe that
> only affects throughput and the driver is to blame for the latency
> reduction... this is what I'm going to find pretty soon). Also this
> isolate_migratepages loop I think needs a cond_resched()

This surprises me. In my own tests at least, the compaction stuff was
way down in the profile and I wouldn't have expected scanning to take so
long as to require a cond_resched. I was depending on the cond_resched()
in migrate_pages() to yield the processor as necessary.

> (I didn't add
> that yet ;). 1024 pages scanned is too few, I just want to see how it
> behaves with an extremely permissive setting. I'll let you know when I
> come to some more reliable conclusion.
> 

Thanks.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] mm: vmscan: Stop reclaim/compaction earlier due to insufficient progress if !__GFP_REPEAT
  2011-02-16  9:50                 ` Mel Gorman
@ 2011-02-16 12:03                   ` Andrea Arcangeli
  -1 siblings, 0 replies; 44+ messages in thread
From: Andrea Arcangeli @ 2011-02-16 12:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Rik van Riel, Michal Hocko,
	Kent Overstreet, linux-mm, linux-kernel

On Wed, Feb 16, 2011 at 09:50:49AM +0000, Mel Gorman wrote:
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  mm/vmscan.c |   32 ++++++++++++++++++++++----------
>  1 files changed, 22 insertions(+), 10 deletions(-)

Acked-by: Andrea Arcangeli <aarcange@redhat.com>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] mm: vmscan: Stop reclaim/compaction earlier due to insufficient progress if !__GFP_REPEAT
@ 2011-02-16 12:03                   ` Andrea Arcangeli
  0 siblings, 0 replies; 44+ messages in thread
From: Andrea Arcangeli @ 2011-02-16 12:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Rik van Riel, Michal Hocko,
	Kent Overstreet, linux-mm, linux-kernel

On Wed, Feb 16, 2011 at 09:50:49AM +0000, Mel Gorman wrote:
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  mm/vmscan.c |   32 ++++++++++++++++++++++----------
>  1 files changed, 22 insertions(+), 10 deletions(-)

Acked-by: Andrea Arcangeli <aarcange@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] mm: vmscan: Stop reclaim/compaction earlier due to insufficient progress if !__GFP_REPEAT
  2011-02-16  9:50                 ` Mel Gorman
@ 2011-02-16 12:14                   ` Rik van Riel
  -1 siblings, 0 replies; 44+ messages in thread
From: Rik van Riel @ 2011-02-16 12:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Andrea Arcangeli, Michal Hocko,
	Kent Overstreet, linux-mm, linux-kernel

On 02/16/2011 04:50 AM, Mel Gorman wrote:

> This patch will stop reclaim/compaction if no pages were reclaimed in the
> last SWAP_CLUSTER_MAX pages that were considered. For allocations such as
> hugetlbfs that use GFP_REPEAT and have fewer fallback options, the full LRU
> list may still be scanned.

> Signed-off-by: Mel Gorman<mel@csn.ul.ie>

Acked-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] mm: vmscan: Stop reclaim/compaction earlier due to insufficient progress if !__GFP_REPEAT
@ 2011-02-16 12:14                   ` Rik van Riel
  0 siblings, 0 replies; 44+ messages in thread
From: Rik van Riel @ 2011-02-16 12:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Andrea Arcangeli, Michal Hocko,
	Kent Overstreet, linux-mm, linux-kernel

On 02/16/2011 04:50 AM, Mel Gorman wrote:

> This patch will stop reclaim/compaction if no pages were reclaimed in the
> last SWAP_CLUSTER_MAX pages that were considered. For allocations such as
> hugetlbfs that use GFP_REPEAT and have fewer fallback options, the full LRU
> list may still be scanned.

> Signed-off-by: Mel Gorman<mel@csn.ul.ie>

Acked-by: Rik van Riel <riel@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] mm: vmscan: Stop reclaim/compaction earlier due to insufficient progress if !__GFP_REPEAT
  2011-02-16  9:50                 ` Mel Gorman
@ 2011-02-16 12:38                   ` Johannes Weiner
  -1 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2011-02-16 12:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Andrea Arcangeli, Rik van Riel, Michal Hocko,
	Kent Overstreet, linux-mm, linux-kernel

On Wed, Feb 16, 2011 at 09:50:49AM +0000, Mel Gorman wrote:
> should_continue_reclaim() for reclaim/compaction allows scanning to continue
> even if pages are not being reclaimed until the full list is scanned. In
> terms of allocation success, this makes sense but potentially it introduces
> unwanted latency for high-order allocations such as transparent hugepages
> and network jumbo frames that would prefer to fail the allocation attempt
> and fallback to order-0 pages.  Worse, there is a potential that the full
> LRU scan will clear all the young bits, distort page aging information and
> potentially push pages into swap that would have otherwise remained resident.
> 
> This patch will stop reclaim/compaction if no pages were reclaimed in the
> last SWAP_CLUSTER_MAX pages that were considered. For allocations such as
> hugetlbfs that use GFP_REPEAT and have fewer fallback options, the full LRU
> list may still be scanned.
> 
> To test this, a tool was developed based on ftrace that tracked the latency of
> high-order allocations while transparent hugepage support was enabled and three
> benchmarks were run. The "fix-infinite" figures are 2.6.38-rc4 with Johannes's
> patch "vmscan: fix zone shrinking exit when scan work is done" applied.
> 
> STREAM Highorder Allocation Latency Statistics
> 	       fix-infinite	break-early
> 1 :: Count            10298           10229
> 1 :: Min             0.4560          0.4640
> 1 :: Mean            1.0589          1.0183
> 1 :: Max            14.5990         11.7510
> 1 :: Stddev          0.5208          0.4719
> 2 :: Count                2               1
> 2 :: Min             1.8610          3.7240
> 2 :: Mean            3.4325          3.7240
> 2 :: Max             5.0040          3.7240
> 2 :: Stddev          1.5715          0.0000
> 9 :: Count           111696          111694
> 9 :: Min             0.5230          0.4110
> 9 :: Mean           10.5831         10.5718
> 9 :: Max            38.4480         43.2900
> 9 :: Stddev          1.1147          1.1325
> 
> Mean time for order-1 allocations is reduced. order-2 looks increased
> but with so few allocations, it's not particularly significant. THP mean
> allocation latency is also reduced. That said, allocation time varies so
> significantly that the reductions are within noise.
> 
> Max allocation time is reduced by a significant amount for low-order
> allocations but reduced for THP allocations which presumably are now
> breaking before reclaim has done enough work.
> 
> SysBench Highorder Allocation Latency Statistics
> 	       fix-infinite	break-early
> 1 :: Count            15745           15677
> 1 :: Min             0.4250          0.4550
> 1 :: Mean            1.1023          1.0810
> 1 :: Max            14.4590         10.8220
> 1 :: Stddev          0.5117          0.5100
> 2 :: Count                1               1
> 2 :: Min             3.0040          2.1530
> 2 :: Mean            3.0040          2.1530
> 2 :: Max             3.0040          2.1530
> 2 :: Stddev          0.0000          0.0000
> 9 :: Count             2017            1931
> 9 :: Min             0.4980          0.7480
> 9 :: Mean           10.4717         10.3840
> 9 :: Max            24.9460         26.2500
> 9 :: Stddev          1.1726          1.1966
> 
> Again, mean time for order-1 allocations is reduced while order-2 allocations
> are too few to draw conclusions from. The mean time for THP allocations is
> also slightly reduced albeit the reductions are within varianes.
> 
> Once again, our maximum allocation time is significantly reduced for
> low-order allocations and slightly increased for THP allocations.
> 
> Anon stream mmap reference Highorder Allocation Latency Statistics
> 1 :: Count             1376            1790
> 1 :: Min             0.4940          0.5010
> 1 :: Mean            1.0289          0.9732
> 1 :: Max             6.2670          4.2540
> 1 :: Stddev          0.4142          0.2785
> 2 :: Count                1               -
> 2 :: Min             1.9060               -
> 2 :: Mean            1.9060               -
> 2 :: Max             1.9060               -
> 2 :: Stddev          0.0000               -
> 9 :: Count            11266           11257
> 9 :: Min             0.4990          0.4940
> 9 :: Mean        27250.4669      24256.1919
> 9 :: Max      11439211.0000    6008885.0000
> 9 :: Stddev     226427.4624     186298.1430
> 
> This benchmark creates one thread per CPU which references an amount of
> anonymous memory 1.5 times the size of physical RAM. This pounds swap quite
> heavily and is intended to exercise THP a bit.
> 
> Mean allocation time for order-1 is reduced as before. It's also reduced
> for THP allocations but the variations here are pretty massive due to swap.
> As before, maximum allocation times are significantly reduced.
> 
> Overall, the patch reduces the mean and maximum allocation latencies for
> the smaller high-order allocations. This was with Slab configured so it
> would be expected to be more significant with Slub which uses these size
> allocations more aggressively.
> 
> The mean allocation times for THP allocations are also slightly reduced.
> The maximum latency was slightly increased as predicted by the comments due
> to reclaim/compaction breaking early. However, workloads care more about the
> latency of lower-order allocations than THP so it's an acceptable trade-off.
> Please consider merging for 2.6.38.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] mm: vmscan: Stop reclaim/compaction earlier due to insufficient progress if !__GFP_REPEAT
@ 2011-02-16 12:38                   ` Johannes Weiner
  0 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2011-02-16 12:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Andrea Arcangeli, Rik van Riel, Michal Hocko,
	Kent Overstreet, linux-mm, linux-kernel

On Wed, Feb 16, 2011 at 09:50:49AM +0000, Mel Gorman wrote:
> should_continue_reclaim() for reclaim/compaction allows scanning to continue
> even if pages are not being reclaimed until the full list is scanned. In
> terms of allocation success, this makes sense but potentially it introduces
> unwanted latency for high-order allocations such as transparent hugepages
> and network jumbo frames that would prefer to fail the allocation attempt
> and fallback to order-0 pages.  Worse, there is a potential that the full
> LRU scan will clear all the young bits, distort page aging information and
> potentially push pages into swap that would have otherwise remained resident.
> 
> This patch will stop reclaim/compaction if no pages were reclaimed in the
> last SWAP_CLUSTER_MAX pages that were considered. For allocations such as
> hugetlbfs that use GFP_REPEAT and have fewer fallback options, the full LRU
> list may still be scanned.
> 
> To test this, a tool was developed based on ftrace that tracked the latency of
> high-order allocations while transparent hugepage support was enabled and three
> benchmarks were run. The "fix-infinite" figures are 2.6.38-rc4 with Johannes's
> patch "vmscan: fix zone shrinking exit when scan work is done" applied.
> 
> STREAM Highorder Allocation Latency Statistics
> 	       fix-infinite	break-early
> 1 :: Count            10298           10229
> 1 :: Min             0.4560          0.4640
> 1 :: Mean            1.0589          1.0183
> 1 :: Max            14.5990         11.7510
> 1 :: Stddev          0.5208          0.4719
> 2 :: Count                2               1
> 2 :: Min             1.8610          3.7240
> 2 :: Mean            3.4325          3.7240
> 2 :: Max             5.0040          3.7240
> 2 :: Stddev          1.5715          0.0000
> 9 :: Count           111696          111694
> 9 :: Min             0.5230          0.4110
> 9 :: Mean           10.5831         10.5718
> 9 :: Max            38.4480         43.2900
> 9 :: Stddev          1.1147          1.1325
> 
> Mean time for order-1 allocations is reduced. order-2 looks increased
> but with so few allocations, it's not particularly significant. THP mean
> allocation latency is also reduced. That said, allocation time varies so
> significantly that the reductions are within noise.
> 
> Max allocation time is reduced by a significant amount for low-order
> allocations but reduced for THP allocations which presumably are now
> breaking before reclaim has done enough work.
> 
> SysBench Highorder Allocation Latency Statistics
> 	       fix-infinite	break-early
> 1 :: Count            15745           15677
> 1 :: Min             0.4250          0.4550
> 1 :: Mean            1.1023          1.0810
> 1 :: Max            14.4590         10.8220
> 1 :: Stddev          0.5117          0.5100
> 2 :: Count                1               1
> 2 :: Min             3.0040          2.1530
> 2 :: Mean            3.0040          2.1530
> 2 :: Max             3.0040          2.1530
> 2 :: Stddev          0.0000          0.0000
> 9 :: Count             2017            1931
> 9 :: Min             0.4980          0.7480
> 9 :: Mean           10.4717         10.3840
> 9 :: Max            24.9460         26.2500
> 9 :: Stddev          1.1726          1.1966
> 
> Again, mean time for order-1 allocations is reduced while order-2 allocations
> are too few to draw conclusions from. The mean time for THP allocations is
> also slightly reduced albeit the reductions are within varianes.
> 
> Once again, our maximum allocation time is significantly reduced for
> low-order allocations and slightly increased for THP allocations.
> 
> Anon stream mmap reference Highorder Allocation Latency Statistics
> 1 :: Count             1376            1790
> 1 :: Min             0.4940          0.5010
> 1 :: Mean            1.0289          0.9732
> 1 :: Max             6.2670          4.2540
> 1 :: Stddev          0.4142          0.2785
> 2 :: Count                1               -
> 2 :: Min             1.9060               -
> 2 :: Mean            1.9060               -
> 2 :: Max             1.9060               -
> 2 :: Stddev          0.0000               -
> 9 :: Count            11266           11257
> 9 :: Min             0.4990          0.4940
> 9 :: Mean        27250.4669      24256.1919
> 9 :: Max      11439211.0000    6008885.0000
> 9 :: Stddev     226427.4624     186298.1430
> 
> This benchmark creates one thread per CPU which references an amount of
> anonymous memory 1.5 times the size of physical RAM. This pounds swap quite
> heavily and is intended to exercise THP a bit.
> 
> Mean allocation time for order-1 is reduced as before. It's also reduced
> for THP allocations but the variations here are pretty massive due to swap.
> As before, maximum allocation times are significantly reduced.
> 
> Overall, the patch reduces the mean and maximum allocation latencies for
> the smaller high-order allocations. This was with Slab configured so it
> would be expected to be more significant with Slub which uses these size
> allocations more aggressively.
> 
> The mean allocation times for THP allocations are also slightly reduced.
> The maximum latency was slightly increased as predicted by the comments due
> to reclaim/compaction breaking early. However, workloads care more about the
> latency of lower-order allocations than THP so it's an acceptable trade-off.
> Please consider merging for 2.6.38.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] mm: vmscan: Stop reclaim/compaction earlier due to insufficient progress if !__GFP_REPEAT
  2011-02-16 11:22                     ` Mel Gorman
@ 2011-02-16 14:44                       ` Andrea Arcangeli
  -1 siblings, 0 replies; 44+ messages in thread
From: Andrea Arcangeli @ 2011-02-16 14:44 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Rik van Riel, Michal Hocko,
	Kent Overstreet, linux-mm, linux-kernel

On Wed, Feb 16, 2011 at 11:22:32AM +0000, Mel Gorman wrote:
> Out of curiousity, what are you measuring the latency of and how? I used
> a combination of the function_graph ftrace analyser and the mm_page_alloc
> tracepoint myself to avoid any additional patching and it was easier than
> cobbling together something with kprobes. A perl script configures ftrace
> and then parses the contents of trace_pipe - crude but does the job without
> patching the kernel.

It's some complex benchmark that is measuring the latency from
userland, I think latency is measured from clients (not the server
running compaction).

> How big are the discrepancies?

Latency in msec/op goes up from 1.1 to 5.4 starting from half the peak
load. But then latency stays flat with compaction, eventually the peak
load latency is similar. It just goes immediately from 1.1 to 5.4 in
the middle and it's slightly higher even for the light load runs.

> No idea.

I guess it's very hard to tell unless we try. I just nuked the
bulk_latency for the jumbo frames and forced the driver to always
stay in low_latency mode (in NAPI ->poll method of the driver), just
in case it's not compaction to blame but a side effect of compaction
providing jumbo frames much more frequently to the driver.

> Can I have your ack on the patch then? Even if it doesn't resolve the

Sure, I acked it explicitly in separate email ;).

> jumbo frame problems, it's in the right direction. Measuring how it
> currently behaves and what direction should be taken may be something
> still worth discussing at LSF/MM.

Agreed!

> > One issue with compaction for jumbo frames, is the potentially very
> > long loop, for the scan in isolated_migratepages.
> 
> Yes, the scanner is poor. The scanner for free pages is potentially just
> as bad. I prototyped some designs that should have been faster but they
> didn't make any significant difference so they got discarded.

But the scanner for free pages a nr_scanned countdown and breaks the
loop way sooner. Also most of the >order allocations must have a
fallback so scanning everything for succeeding order 0 is much more
obviously safe than scanning everything to provide an order 2
allocation, if the order 0 allocation could be provided immediately
without scanning anything. It's not a trivial problem when we deal
with short lived allocations. Also the throughput is equal or a little
higher (not necessarily related to compaction though), the latency is
the real measurable regression.

> This surprises me. In my own tests at least, the compaction stuff was
> way down in the profile and I wouldn't have expected scanning to take so
> long as to require a cond_resched. I was depending on the cond_resched()
> in migrate_pages() to yield the processor as necessary.

If migrate_pages runs often likely won't need to scan too many pages
in the first place. I think cond_resched is good idea in that
loop considering the current possible worst case.

This is the profiling. This is with basically 2.6.37 compaction code
so only enabled for THP sized allocations and not for order <=
PAGE_ALLOC_COSTLY_ORDER and not for kswapd.

Samples          % of Total       Cum. Samples     Cum. % of Total   module:function
-------------------------------------------------------------------------------------------------
177786           6.178            177786           6.178             sunrpc:svc_recv
128779           4.475            306565           10.654            sunrpc:svc_xprt_enqueue
80786            2.807            387351           13.462            vmlinux:__d_lookup
62272            2.164            449623           15.626            ext4:ext4_htree_store_dirent
55896            1.942            505519           17.569            jbd2:journal_clean_one_cp_list
43868            1.524            549387           19.093            vmlinux:task_rq_lock
43572            1.514            592959           20.608            vmlinux:kfree
37620            1.307            630579           21.915            vmlinux:mwait_idle
36169            1.257            666748           23.172            vmlinux:schedule
34037            1.182            700785           24.355            e1000:e1000_clean
31945            1.110            732730           25.465            vmlinux:find_busiest_group
31491            1.094            764221           26.560            qla2xxx:qla24xx_intr_handler
30681            1.066            794902           27.626            vmlinux:_atomic_dec_and_lock 
7425             0.258            xxxxxx           xxxxxx            vmlinux:get_page_from_freelist

This is with 2.6.38 compaction code enabled for all !order in both
direct compaction and kswapd (it includes async compaction/migrate and
the preferred pageblock selection in !cc->sync mode). It basically
only doesn't include the should_continue_reclaim loop as that could
only potentially increase the latency even further so I skipped it for
now (I'll add it later with your __GFP_RECLAIM new patch).

Samples          % of Total       Cum. Samples     Cum. % of Total   module:function
-------------------------------------------------------------------------------------------------
1182928          17.358           1182928          17.358            vmlinux:get_page_from_freelist
657802           9.652            1840730          27.011            vmlinux:free_pcppages_bulk
579976           8.510            2420706          35.522            sunrpc:svc_xprt_enqueue
508953           7.468            2929659          42.991            sunrpc:svc_recv
490538           7.198            3420197          50.189            vmlinux:compaction_alloc
188620           2.767            3608817          52.957            vmlinux:tg_shares_up
97527            1.431            3706344          54.388            vmlinux:__d_lookup
85670            1.257            3792014          55.646            jbd2:journal_clean_one_cp_list
71738            1.052            3863752          56.698            vmlinux:mutex_spin_on_owner
71037            1.042            3934789          57.741            vmlinux:kfree

Basically it was my patch that enabled compaction for all order sized
allocations and in kswapd as well that started this but I think I only
exposed the problem and if the jumbo frame would have order 4 instead
of order 1/2/3, it'd happen regardless of my patch. Later I'm also
going to check if it's the kswapd invocation that causes the problem
(so trying with only direct compaction) but I doubt it'll help.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] mm: vmscan: Stop reclaim/compaction earlier due to insufficient progress if !__GFP_REPEAT
@ 2011-02-16 14:44                       ` Andrea Arcangeli
  0 siblings, 0 replies; 44+ messages in thread
From: Andrea Arcangeli @ 2011-02-16 14:44 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Rik van Riel, Michal Hocko,
	Kent Overstreet, linux-mm, linux-kernel

On Wed, Feb 16, 2011 at 11:22:32AM +0000, Mel Gorman wrote:
> Out of curiousity, what are you measuring the latency of and how? I used
> a combination of the function_graph ftrace analyser and the mm_page_alloc
> tracepoint myself to avoid any additional patching and it was easier than
> cobbling together something with kprobes. A perl script configures ftrace
> and then parses the contents of trace_pipe - crude but does the job without
> patching the kernel.

It's some complex benchmark that is measuring the latency from
userland, I think latency is measured from clients (not the server
running compaction).

> How big are the discrepancies?

Latency in msec/op goes up from 1.1 to 5.4 starting from half the peak
load. But then latency stays flat with compaction, eventually the peak
load latency is similar. It just goes immediately from 1.1 to 5.4 in
the middle and it's slightly higher even for the light load runs.

> No idea.

I guess it's very hard to tell unless we try. I just nuked the
bulk_latency for the jumbo frames and forced the driver to always
stay in low_latency mode (in NAPI ->poll method of the driver), just
in case it's not compaction to blame but a side effect of compaction
providing jumbo frames much more frequently to the driver.

> Can I have your ack on the patch then? Even if it doesn't resolve the

Sure, I acked it explicitly in separate email ;).

> jumbo frame problems, it's in the right direction. Measuring how it
> currently behaves and what direction should be taken may be something
> still worth discussing at LSF/MM.

Agreed!

> > One issue with compaction for jumbo frames, is the potentially very
> > long loop, for the scan in isolated_migratepages.
> 
> Yes, the scanner is poor. The scanner for free pages is potentially just
> as bad. I prototyped some designs that should have been faster but they
> didn't make any significant difference so they got discarded.

But the scanner for free pages a nr_scanned countdown and breaks the
loop way sooner. Also most of the >order allocations must have a
fallback so scanning everything for succeeding order 0 is much more
obviously safe than scanning everything to provide an order 2
allocation, if the order 0 allocation could be provided immediately
without scanning anything. It's not a trivial problem when we deal
with short lived allocations. Also the throughput is equal or a little
higher (not necessarily related to compaction though), the latency is
the real measurable regression.

> This surprises me. In my own tests at least, the compaction stuff was
> way down in the profile and I wouldn't have expected scanning to take so
> long as to require a cond_resched. I was depending on the cond_resched()
> in migrate_pages() to yield the processor as necessary.

If migrate_pages runs often likely won't need to scan too many pages
in the first place. I think cond_resched is good idea in that
loop considering the current possible worst case.

This is the profiling. This is with basically 2.6.37 compaction code
so only enabled for THP sized allocations and not for order <=
PAGE_ALLOC_COSTLY_ORDER and not for kswapd.

Samples          % of Total       Cum. Samples     Cum. % of Total   module:function
-------------------------------------------------------------------------------------------------
177786           6.178            177786           6.178             sunrpc:svc_recv
128779           4.475            306565           10.654            sunrpc:svc_xprt_enqueue
80786            2.807            387351           13.462            vmlinux:__d_lookup
62272            2.164            449623           15.626            ext4:ext4_htree_store_dirent
55896            1.942            505519           17.569            jbd2:journal_clean_one_cp_list
43868            1.524            549387           19.093            vmlinux:task_rq_lock
43572            1.514            592959           20.608            vmlinux:kfree
37620            1.307            630579           21.915            vmlinux:mwait_idle
36169            1.257            666748           23.172            vmlinux:schedule
34037            1.182            700785           24.355            e1000:e1000_clean
31945            1.110            732730           25.465            vmlinux:find_busiest_group
31491            1.094            764221           26.560            qla2xxx:qla24xx_intr_handler
30681            1.066            794902           27.626            vmlinux:_atomic_dec_and_lock 
7425             0.258            xxxxxx           xxxxxx            vmlinux:get_page_from_freelist

This is with 2.6.38 compaction code enabled for all !order in both
direct compaction and kswapd (it includes async compaction/migrate and
the preferred pageblock selection in !cc->sync mode). It basically
only doesn't include the should_continue_reclaim loop as that could
only potentially increase the latency even further so I skipped it for
now (I'll add it later with your __GFP_RECLAIM new patch).

Samples          % of Total       Cum. Samples     Cum. % of Total   module:function
-------------------------------------------------------------------------------------------------
1182928          17.358           1182928          17.358            vmlinux:get_page_from_freelist
657802           9.652            1840730          27.011            vmlinux:free_pcppages_bulk
579976           8.510            2420706          35.522            sunrpc:svc_xprt_enqueue
508953           7.468            2929659          42.991            sunrpc:svc_recv
490538           7.198            3420197          50.189            vmlinux:compaction_alloc
188620           2.767            3608817          52.957            vmlinux:tg_shares_up
97527            1.431            3706344          54.388            vmlinux:__d_lookup
85670            1.257            3792014          55.646            jbd2:journal_clean_one_cp_list
71738            1.052            3863752          56.698            vmlinux:mutex_spin_on_owner
71037            1.042            3934789          57.741            vmlinux:kfree

Basically it was my patch that enabled compaction for all order sized
allocations and in kswapd as well that started this but I think I only
exposed the problem and if the jumbo frame would have order 4 instead
of order 1/2/3, it'd happen regardless of my patch. Later I'm also
going to check if it's the kswapd invocation that causes the problem
(so trying with only direct compaction) but I doubt it'll help.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] mm: vmscan: Stop reclaim/compaction earlier due to insufficient progress if !__GFP_REPEAT
  2011-02-16  9:50                 ` Mel Gorman
@ 2011-02-16 23:26                   ` Minchan Kim
  -1 siblings, 0 replies; 44+ messages in thread
From: Minchan Kim @ 2011-02-16 23:26 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Andrea Arcangeli, Rik van Riel,
	Michal Hocko, Kent Overstreet, linux-mm, linux-kernel

On Wed, Feb 16, 2011 at 6:50 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> should_continue_reclaim() for reclaim/compaction allows scanning to continue
> even if pages are not being reclaimed until the full list is scanned. In
> terms of allocation success, this makes sense but potentially it introduces
> unwanted latency for high-order allocations such as transparent hugepages
> and network jumbo frames that would prefer to fail the allocation attempt
> and fallback to order-0 pages.  Worse, there is a potential that the full
> LRU scan will clear all the young bits, distort page aging information and
> potentially push pages into swap that would have otherwise remained resident.
>
> This patch will stop reclaim/compaction if no pages were reclaimed in the
> last SWAP_CLUSTER_MAX pages that were considered. For allocations such as
> hugetlbfs that use GFP_REPEAT and have fewer fallback options, the full LRU
> list may still be scanned.
>
> To test this, a tool was developed based on ftrace that tracked the latency of
> high-order allocations while transparent hugepage support was enabled and three
> benchmarks were run. The "fix-infinite" figures are 2.6.38-rc4 with Johannes's
> patch "vmscan: fix zone shrinking exit when scan work is done" applied.
>
> STREAM Highorder Allocation Latency Statistics
>               fix-infinite     break-early
> 1 :: Count            10298           10229
> 1 :: Min             0.4560          0.4640
> 1 :: Mean            1.0589          1.0183
> 1 :: Max            14.5990         11.7510
> 1 :: Stddev          0.5208          0.4719
> 2 :: Count                2               1
> 2 :: Min             1.8610          3.7240
> 2 :: Mean            3.4325          3.7240
> 2 :: Max             5.0040          3.7240
> 2 :: Stddev          1.5715          0.0000
> 9 :: Count           111696          111694
> 9 :: Min             0.5230          0.4110
> 9 :: Mean           10.5831         10.5718
> 9 :: Max            38.4480         43.2900
> 9 :: Stddev          1.1147          1.1325
>
> Mean time for order-1 allocations is reduced. order-2 looks increased
> but with so few allocations, it's not particularly significant. THP mean
> allocation latency is also reduced. That said, allocation time varies so
> significantly that the reductions are within noise.
>
> Max allocation time is reduced by a significant amount for low-order
> allocations but reduced for THP allocations which presumably are now
> breaking before reclaim has done enough work.
>
> SysBench Highorder Allocation Latency Statistics
>               fix-infinite     break-early
> 1 :: Count            15745           15677
> 1 :: Min             0.4250          0.4550
> 1 :: Mean            1.1023          1.0810
> 1 :: Max            14.4590         10.8220
> 1 :: Stddev          0.5117          0.5100
> 2 :: Count                1               1
> 2 :: Min             3.0040          2.1530
> 2 :: Mean            3.0040          2.1530
> 2 :: Max             3.0040          2.1530
> 2 :: Stddev          0.0000          0.0000
> 9 :: Count             2017            1931
> 9 :: Min             0.4980          0.7480
> 9 :: Mean           10.4717         10.3840
> 9 :: Max            24.9460         26.2500
> 9 :: Stddev          1.1726          1.1966
>
> Again, mean time for order-1 allocations is reduced while order-2 allocations
> are too few to draw conclusions from. The mean time for THP allocations is
> also slightly reduced albeit the reductions are within varianes.
>
> Once again, our maximum allocation time is significantly reduced for
> low-order allocations and slightly increased for THP allocations.
>
> Anon stream mmap reference Highorder Allocation Latency Statistics
> 1 :: Count             1376            1790
> 1 :: Min             0.4940          0.5010
> 1 :: Mean            1.0289          0.9732
> 1 :: Max             6.2670          4.2540
> 1 :: Stddev          0.4142          0.2785
> 2 :: Count                1               -
> 2 :: Min             1.9060               -
> 2 :: Mean            1.9060               -
> 2 :: Max             1.9060               -
> 2 :: Stddev          0.0000               -
> 9 :: Count            11266           11257
> 9 :: Min             0.4990          0.4940
> 9 :: Mean        27250.4669      24256.1919
> 9 :: Max      11439211.0000    6008885.0000
> 9 :: Stddev     226427.4624     186298.1430
>
> This benchmark creates one thread per CPU which references an amount of
> anonymous memory 1.5 times the size of physical RAM. This pounds swap quite
> heavily and is intended to exercise THP a bit.
>
> Mean allocation time for order-1 is reduced as before. It's also reduced
> for THP allocations but the variations here are pretty massive due to swap.
> As before, maximum allocation times are significantly reduced.
>
> Overall, the patch reduces the mean and maximum allocation latencies for
> the smaller high-order allocations. This was with Slab configured so it
> would be expected to be more significant with Slub which uses these size
> allocations more aggressively.
>
> The mean allocation times for THP allocations are also slightly reduced.
> The maximum latency was slightly increased as predicted by the comments due
> to reclaim/compaction breaking early. However, workloads care more about the
> latency of lower-order allocations than THP so it's an acceptable trade-off.
> Please consider merging for 2.6.38.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

> ---
>  mm/vmscan.c |   32 ++++++++++++++++++++++----------
>  1 files changed, 22 insertions(+), 10 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 148c6e6..591b907 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1841,16 +1841,28 @@ static inline bool should_continue_reclaim(struct zone *zone,
>        if (!(sc->reclaim_mode & RECLAIM_MODE_COMPACTION))
>                return false;
>
> -       /*
> -        * If we failed to reclaim and have scanned the full list, stop.
> -        * NOTE: Checking just nr_reclaimed would exit reclaim/compaction far
> -        *       faster but obviously would be less likely to succeed
> -        *       allocation. If this is desirable, use GFP_REPEAT to decide

Typo. __GFP_REPEAT

Otherwise, looks good to me.
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>






-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] mm: vmscan: Stop reclaim/compaction earlier due to insufficient progress if !__GFP_REPEAT
@ 2011-02-16 23:26                   ` Minchan Kim
  0 siblings, 0 replies; 44+ messages in thread
From: Minchan Kim @ 2011-02-16 23:26 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Andrea Arcangeli, Rik van Riel,
	Michal Hocko, Kent Overstreet, linux-mm, linux-kernel

On Wed, Feb 16, 2011 at 6:50 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> should_continue_reclaim() for reclaim/compaction allows scanning to continue
> even if pages are not being reclaimed until the full list is scanned. In
> terms of allocation success, this makes sense but potentially it introduces
> unwanted latency for high-order allocations such as transparent hugepages
> and network jumbo frames that would prefer to fail the allocation attempt
> and fallback to order-0 pages.  Worse, there is a potential that the full
> LRU scan will clear all the young bits, distort page aging information and
> potentially push pages into swap that would have otherwise remained resident.
>
> This patch will stop reclaim/compaction if no pages were reclaimed in the
> last SWAP_CLUSTER_MAX pages that were considered. For allocations such as
> hugetlbfs that use GFP_REPEAT and have fewer fallback options, the full LRU
> list may still be scanned.
>
> To test this, a tool was developed based on ftrace that tracked the latency of
> high-order allocations while transparent hugepage support was enabled and three
> benchmarks were run. The "fix-infinite" figures are 2.6.38-rc4 with Johannes's
> patch "vmscan: fix zone shrinking exit when scan work is done" applied.
>
> STREAM Highorder Allocation Latency Statistics
>               fix-infinite     break-early
> 1 :: Count            10298           10229
> 1 :: Min             0.4560          0.4640
> 1 :: Mean            1.0589          1.0183
> 1 :: Max            14.5990         11.7510
> 1 :: Stddev          0.5208          0.4719
> 2 :: Count                2               1
> 2 :: Min             1.8610          3.7240
> 2 :: Mean            3.4325          3.7240
> 2 :: Max             5.0040          3.7240
> 2 :: Stddev          1.5715          0.0000
> 9 :: Count           111696          111694
> 9 :: Min             0.5230          0.4110
> 9 :: Mean           10.5831         10.5718
> 9 :: Max            38.4480         43.2900
> 9 :: Stddev          1.1147          1.1325
>
> Mean time for order-1 allocations is reduced. order-2 looks increased
> but with so few allocations, it's not particularly significant. THP mean
> allocation latency is also reduced. That said, allocation time varies so
> significantly that the reductions are within noise.
>
> Max allocation time is reduced by a significant amount for low-order
> allocations but reduced for THP allocations which presumably are now
> breaking before reclaim has done enough work.
>
> SysBench Highorder Allocation Latency Statistics
>               fix-infinite     break-early
> 1 :: Count            15745           15677
> 1 :: Min             0.4250          0.4550
> 1 :: Mean            1.1023          1.0810
> 1 :: Max            14.4590         10.8220
> 1 :: Stddev          0.5117          0.5100
> 2 :: Count                1               1
> 2 :: Min             3.0040          2.1530
> 2 :: Mean            3.0040          2.1530
> 2 :: Max             3.0040          2.1530
> 2 :: Stddev          0.0000          0.0000
> 9 :: Count             2017            1931
> 9 :: Min             0.4980          0.7480
> 9 :: Mean           10.4717         10.3840
> 9 :: Max            24.9460         26.2500
> 9 :: Stddev          1.1726          1.1966
>
> Again, mean time for order-1 allocations is reduced while order-2 allocations
> are too few to draw conclusions from. The mean time for THP allocations is
> also slightly reduced albeit the reductions are within varianes.
>
> Once again, our maximum allocation time is significantly reduced for
> low-order allocations and slightly increased for THP allocations.
>
> Anon stream mmap reference Highorder Allocation Latency Statistics
> 1 :: Count             1376            1790
> 1 :: Min             0.4940          0.5010
> 1 :: Mean            1.0289          0.9732
> 1 :: Max             6.2670          4.2540
> 1 :: Stddev          0.4142          0.2785
> 2 :: Count                1               -
> 2 :: Min             1.9060               -
> 2 :: Mean            1.9060               -
> 2 :: Max             1.9060               -
> 2 :: Stddev          0.0000               -
> 9 :: Count            11266           11257
> 9 :: Min             0.4990          0.4940
> 9 :: Mean        27250.4669      24256.1919
> 9 :: Max      11439211.0000    6008885.0000
> 9 :: Stddev     226427.4624     186298.1430
>
> This benchmark creates one thread per CPU which references an amount of
> anonymous memory 1.5 times the size of physical RAM. This pounds swap quite
> heavily and is intended to exercise THP a bit.
>
> Mean allocation time for order-1 is reduced as before. It's also reduced
> for THP allocations but the variations here are pretty massive due to swap.
> As before, maximum allocation times are significantly reduced.
>
> Overall, the patch reduces the mean and maximum allocation latencies for
> the smaller high-order allocations. This was with Slab configured so it
> would be expected to be more significant with Slub which uses these size
> allocations more aggressively.
>
> The mean allocation times for THP allocations are also slightly reduced.
> The maximum latency was slightly increased as predicted by the comments due
> to reclaim/compaction breaking early. However, workloads care more about the
> latency of lower-order allocations than THP so it's an acceptable trade-off.
> Please consider merging for 2.6.38.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

> ---
>  mm/vmscan.c |   32 ++++++++++++++++++++++----------
>  1 files changed, 22 insertions(+), 10 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 148c6e6..591b907 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1841,16 +1841,28 @@ static inline bool should_continue_reclaim(struct zone *zone,
>        if (!(sc->reclaim_mode & RECLAIM_MODE_COMPACTION))
>                return false;
>
> -       /*
> -        * If we failed to reclaim and have scanned the full list, stop.
> -        * NOTE: Checking just nr_reclaimed would exit reclaim/compaction far
> -        *       faster but obviously would be less likely to succeed
> -        *       allocation. If this is desirable, use GFP_REPEAT to decide

Typo. __GFP_REPEAT

Otherwise, looks good to me.
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>






-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] mm: vmscan: Stop reclaim/compaction earlier due to insufficient progress if !__GFP_REPEAT
  2011-02-16  9:50                 ` Mel Gorman
@ 2011-02-17 22:22                   ` Andrew Morton
  -1 siblings, 0 replies; 44+ messages in thread
From: Andrew Morton @ 2011-02-17 22:22 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, Andrea Arcangeli, Rik van Riel, Michal Hocko,
	Kent Overstreet, linux-mm, linux-kernel

On Wed, 16 Feb 2011 09:50:49 +0000
Mel Gorman <mel@csn.ul.ie> wrote:

> should_continue_reclaim() for reclaim/compaction allows scanning to continue
> even if pages are not being reclaimed until the full list is scanned. In
> terms of allocation success, this makes sense but potentially it introduces
> unwanted latency for high-order allocations such as transparent hugepages
> and network jumbo frames that would prefer to fail the allocation attempt
> and fallback to order-0 pages.  Worse, there is a potential that the full
> LRU scan will clear all the young bits, distort page aging information and
> potentially push pages into swap that would have otherwise remained resident.

afaict the patch affects order-0 allocations as well.  What are the
implications of this?

Also, what might be the downsides of this change, and did you test for
them?

> This patch will stop reclaim/compaction if no pages were reclaimed in the
> last SWAP_CLUSTER_MAX pages that were considered.

a) Why SWAP_CLUSTER_MAX?  Is (SWAP_CLUSTER_MAX+7) better or worse?

b) The sentence doesn't seem even vaguely accurate.  shrink_zone()
   will scan vastly more than SWAP_CLUSTER_MAX pages before calling
   should_continue_reclaim().  Confused.

c) The patch doesn't "stop reclaim/compaction" fully.  It stops it
   against one zone.  reclaim will then advance on to any other
   eligible zones.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] mm: vmscan: Stop reclaim/compaction earlier due to insufficient progress if !__GFP_REPEAT
@ 2011-02-17 22:22                   ` Andrew Morton
  0 siblings, 0 replies; 44+ messages in thread
From: Andrew Morton @ 2011-02-17 22:22 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, Andrea Arcangeli, Rik van Riel, Michal Hocko,
	Kent Overstreet, linux-mm, linux-kernel

On Wed, 16 Feb 2011 09:50:49 +0000
Mel Gorman <mel@csn.ul.ie> wrote:

> should_continue_reclaim() for reclaim/compaction allows scanning to continue
> even if pages are not being reclaimed until the full list is scanned. In
> terms of allocation success, this makes sense but potentially it introduces
> unwanted latency for high-order allocations such as transparent hugepages
> and network jumbo frames that would prefer to fail the allocation attempt
> and fallback to order-0 pages.  Worse, there is a potential that the full
> LRU scan will clear all the young bits, distort page aging information and
> potentially push pages into swap that would have otherwise remained resident.

afaict the patch affects order-0 allocations as well.  What are the
implications of this?

Also, what might be the downsides of this change, and did you test for
them?

> This patch will stop reclaim/compaction if no pages were reclaimed in the
> last SWAP_CLUSTER_MAX pages that were considered.

a) Why SWAP_CLUSTER_MAX?  Is (SWAP_CLUSTER_MAX+7) better or worse?

b) The sentence doesn't seem even vaguely accurate.  shrink_zone()
   will scan vastly more than SWAP_CLUSTER_MAX pages before calling
   should_continue_reclaim().  Confused.

c) The patch doesn't "stop reclaim/compaction" fully.  It stops it
   against one zone.  reclaim will then advance on to any other
   eligible zones.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] mm: vmscan: Stop reclaim/compaction earlier due to insufficient progress if !__GFP_REPEAT
  2011-02-17 22:22                   ` Andrew Morton
@ 2011-02-18 12:22                     ` Mel Gorman
  -1 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2011-02-18 12:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Andrea Arcangeli, Rik van Riel, Michal Hocko,
	Kent Overstreet, linux-mm, linux-kernel

On Thu, Feb 17, 2011 at 02:22:09PM -0800, Andrew Morton wrote:
> On Wed, 16 Feb 2011 09:50:49 +0000
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > should_continue_reclaim() for reclaim/compaction allows scanning to continue
> > even if pages are not being reclaimed until the full list is scanned. In
> > terms of allocation success, this makes sense but potentially it introduces
> > unwanted latency for high-order allocations such as transparent hugepages
> > and network jumbo frames that would prefer to fail the allocation attempt
> > and fallback to order-0 pages.  Worse, there is a potential that the full
> > LRU scan will clear all the young bits, distort page aging information and
> > potentially push pages into swap that would have otherwise remained resident.
> 
> afaict the patch affects order-0 allocations as well.  What are the
> implications of this?
> 

order-0 allocation should not be affected because RECLAIM_MODE_COMPACTION
is not set so the following avoids the gfp_mask being examined;

        if (!(sc->reclaim_mode & RECLAIM_MODE_COMPACTION))
                return false;

> Also, what might be the downsides of this change, and did you test for
> them?
> 

The main downside that I predict is that the worst-case latencies for
successful transparent hugepage allocations will be increased as there will
be more looping in do_try_to_free_pages() at higher priorities. I would also
not be surprised if there were fewer successful allocations.

Latencies did seem to be worse for order-9 allocations in testing but it was
offset by lower latencies for lower orders and seemed an acceptable trade-off.

Other major consequences did not spring to mind.

> > This patch will stop reclaim/compaction if no pages were reclaimed in the
> > last SWAP_CLUSTER_MAX pages that were considered.
> 
> a) Why SWAP_CLUSTER_MAX?  Is (SWAP_CLUSTER_MAX+7) better or worse?
> 

SWAP_CLUSTER_MAX is the standard "unit of reclaim" and that's what I had
in mind when writing the comment but it's wrong and misleading. More on
this below.

> b) The sentence doesn't seem even vaguely accurate.  shrink_zone()
>    will scan vastly more than SWAP_CLUSTER_MAX pages before calling
>    should_continue_reclaim().  Confused.
> 
> c) The patch doesn't "stop reclaim/compaction" fully.  It stops it
>    against one zone.  reclaim will then advance on to any other
>    eligible zones.

You're right on both counts and this comment is inaccurate. It should
have read;

This patch will stop reclaim/compaction for the current zone in shrink_zone()
if there were no pages reclaimed in the last batch of scanning at the
current priority.  For allocations such as hugetlbfs that use __GFP_REPEAT
and have fewer fallback options, the full LRU list may still be scanned.

The comment in the code itself then becomes

+               /*
+                * For non-__GFP_REPEAT allocations which can presumably
+                * fail without consequence, stop if we failed to reclaim
+                * any pages from the last batch of pages that were scanned.
+                * This will return to the caller faster at the risk that
+                * reclaim/compaction and the resulting allocation attempt
+                * fails
+                */

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] mm: vmscan: Stop reclaim/compaction earlier due to insufficient progress if !__GFP_REPEAT
@ 2011-02-18 12:22                     ` Mel Gorman
  0 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2011-02-18 12:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Andrea Arcangeli, Rik van Riel, Michal Hocko,
	Kent Overstreet, linux-mm, linux-kernel

On Thu, Feb 17, 2011 at 02:22:09PM -0800, Andrew Morton wrote:
> On Wed, 16 Feb 2011 09:50:49 +0000
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > should_continue_reclaim() for reclaim/compaction allows scanning to continue
> > even if pages are not being reclaimed until the full list is scanned. In
> > terms of allocation success, this makes sense but potentially it introduces
> > unwanted latency for high-order allocations such as transparent hugepages
> > and network jumbo frames that would prefer to fail the allocation attempt
> > and fallback to order-0 pages.  Worse, there is a potential that the full
> > LRU scan will clear all the young bits, distort page aging information and
> > potentially push pages into swap that would have otherwise remained resident.
> 
> afaict the patch affects order-0 allocations as well.  What are the
> implications of this?
> 

order-0 allocation should not be affected because RECLAIM_MODE_COMPACTION
is not set so the following avoids the gfp_mask being examined;

        if (!(sc->reclaim_mode & RECLAIM_MODE_COMPACTION))
                return false;

> Also, what might be the downsides of this change, and did you test for
> them?
> 

The main downside that I predict is that the worst-case latencies for
successful transparent hugepage allocations will be increased as there will
be more looping in do_try_to_free_pages() at higher priorities. I would also
not be surprised if there were fewer successful allocations.

Latencies did seem to be worse for order-9 allocations in testing but it was
offset by lower latencies for lower orders and seemed an acceptable trade-off.

Other major consequences did not spring to mind.

> > This patch will stop reclaim/compaction if no pages were reclaimed in the
> > last SWAP_CLUSTER_MAX pages that were considered.
> 
> a) Why SWAP_CLUSTER_MAX?  Is (SWAP_CLUSTER_MAX+7) better or worse?
> 

SWAP_CLUSTER_MAX is the standard "unit of reclaim" and that's what I had
in mind when writing the comment but it's wrong and misleading. More on
this below.

> b) The sentence doesn't seem even vaguely accurate.  shrink_zone()
>    will scan vastly more than SWAP_CLUSTER_MAX pages before calling
>    should_continue_reclaim().  Confused.
> 
> c) The patch doesn't "stop reclaim/compaction" fully.  It stops it
>    against one zone.  reclaim will then advance on to any other
>    eligible zones.

You're right on both counts and this comment is inaccurate. It should
have read;

This patch will stop reclaim/compaction for the current zone in shrink_zone()
if there were no pages reclaimed in the last batch of scanning at the
current priority.  For allocations such as hugetlbfs that use __GFP_REPEAT
and have fewer fallback options, the full LRU list may still be scanned.

The comment in the code itself then becomes

+               /*
+                * For non-__GFP_REPEAT allocations which can presumably
+                * fail without consequence, stop if we failed to reclaim
+                * any pages from the last batch of pages that were scanned.
+                * This will return to the caller faster at the risk that
+                * reclaim/compaction and the resulting allocation attempt
+                * fails
+                */

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2011-02-18 12:22 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-02-09 15:46 [patch] vmscan: fix zone shrinking exit when scan work is done Johannes Weiner
2011-02-09 15:46 ` Johannes Weiner
2011-02-09 15:54 ` Kent Overstreet
2011-02-09 15:54   ` Kent Overstreet
2011-02-09 16:46 ` Mel Gorman
2011-02-09 16:46   ` Mel Gorman
2011-02-09 18:28   ` Andrea Arcangeli
2011-02-09 18:28     ` Andrea Arcangeli
2011-02-09 20:05     ` Andrew Morton
2011-02-09 20:05       ` Andrew Morton
2011-02-10 10:21     ` Mel Gorman
2011-02-10 10:21       ` Mel Gorman
2011-02-10 10:41       ` Michal Hocko
2011-02-10 10:41         ` Michal Hocko
2011-02-10 12:48       ` Andrea Arcangeli
2011-02-10 12:48         ` Andrea Arcangeli
2011-02-10 13:33         ` Mel Gorman
2011-02-10 13:33           ` Mel Gorman
2011-02-10 14:14           ` Andrea Arcangeli
2011-02-10 14:14             ` Andrea Arcangeli
2011-02-10 14:58             ` Mel Gorman
2011-02-10 14:58               ` Mel Gorman
2011-02-16  9:50               ` [PATCH] mm: vmscan: Stop reclaim/compaction earlier due to insufficient progress if !__GFP_REPEAT Mel Gorman
2011-02-16  9:50                 ` Mel Gorman
2011-02-16 10:13                 ` Andrea Arcangeli
2011-02-16 10:13                   ` Andrea Arcangeli
2011-02-16 11:22                   ` Mel Gorman
2011-02-16 11:22                     ` Mel Gorman
2011-02-16 14:44                     ` Andrea Arcangeli
2011-02-16 14:44                       ` Andrea Arcangeli
2011-02-16 12:03                 ` Andrea Arcangeli
2011-02-16 12:03                   ` Andrea Arcangeli
2011-02-16 12:14                 ` Rik van Riel
2011-02-16 12:14                   ` Rik van Riel
2011-02-16 12:38                 ` Johannes Weiner
2011-02-16 12:38                   ` Johannes Weiner
2011-02-16 23:26                 ` Minchan Kim
2011-02-16 23:26                   ` Minchan Kim
2011-02-17 22:22                 ` Andrew Morton
2011-02-17 22:22                   ` Andrew Morton
2011-02-18 12:22                   ` Mel Gorman
2011-02-18 12:22                     ` Mel Gorman
2011-02-10  4:04 ` [patch] vmscan: fix zone shrinking exit when scan work is done Minchan Kim
2011-02-10  4:04   ` Minchan Kim

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.