All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch 0/8] page reclaim bits v2
@ 2012-12-17 18:12 ` Johannes Weiner
  0 siblings, 0 replies; 46+ messages in thread
From: Johannes Weiner @ 2012-12-17 18:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Michal Hocko, Mel Gorman, Hugh Dickins,
	Satoru Moriya, linux-mm, linux-kernel

Hello,

I dropped the controversial change of the meaning of !swappiness for
memcg and instead made the two changes into one cleanup patch to spell
out the conditions in get_scan_count() more explicitely.

Aside from that, I expanded the changelogs with statistics and numbers
where there was confusion about the benefits and added the review tags
to the patches that didn't otherwise change.  Let me know if I missed
any review feedback, but I think I have it all addressed.

Thanks!


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [patch 0/8] page reclaim bits v2
@ 2012-12-17 18:12 ` Johannes Weiner
  0 siblings, 0 replies; 46+ messages in thread
From: Johannes Weiner @ 2012-12-17 18:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Michal Hocko, Mel Gorman, Hugh Dickins,
	Satoru Moriya, linux-mm, linux-kernel

Hello,

I dropped the controversial change of the meaning of !swappiness for
memcg and instead made the two changes into one cleanup patch to spell
out the conditions in get_scan_count() more explicitely.

Aside from that, I expanded the changelogs with statistics and numbers
where there was confusion about the benefits and added the review tags
to the patches that didn't otherwise change.  Let me know if I missed
any review feedback, but I think I have it all addressed.

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [patch 1/7] mm: memcg: only evict file pages when we have plenty
  2012-12-17 18:12 ` Johannes Weiner
@ 2012-12-17 18:12   ` Johannes Weiner
  -1 siblings, 0 replies; 46+ messages in thread
From: Johannes Weiner @ 2012-12-17 18:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Michal Hocko, Mel Gorman, Hugh Dickins,
	Satoru Moriya, linux-mm, linux-kernel

e986850 "mm, vmscan: only evict file pages when we have plenty" makes
a point of not going for anonymous memory while there is still enough
inactive cache around.

The check was added only for global reclaim, but it is just as useful
to reduce swapping in memory cgroup reclaim:

200M-memcg-defconfig-j2

                                 vanilla                   patched
Real time              454.06 (  +0.00%)         453.71 (  -0.08%)
User time              668.57 (  +0.00%)         668.73 (  +0.02%)
System time            128.92 (  +0.00%)         129.53 (  +0.46%)
Swap in               1246.80 (  +0.00%)         814.40 ( -34.65%)
Swap out              1198.90 (  +0.00%)         827.00 ( -30.99%)
Pages allocated   16431288.10 (  +0.00%)    16434035.30 (  +0.02%)
Major faults           681.50 (  +0.00%)         593.70 ( -12.86%)
THP faults             237.20 (  +0.00%)         242.40 (  +2.18%)
THP collapse           241.20 (  +0.00%)         248.50 (  +3.01%)
THP splits             157.30 (  +0.00%)         161.40 (  +2.59%)

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
---
 mm/vmscan.c | 20 +++++++++++---------
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7f30961..249ff94 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1688,19 +1688,21 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 			fraction[1] = 0;
 			denominator = 1;
 			goto out;
-		} else if (!inactive_file_is_low_global(zone)) {
-			/*
-			 * There is enough inactive page cache, do not
-			 * reclaim anything from the working set right now.
-			 */
-			fraction[0] = 0;
-			fraction[1] = 1;
-			denominator = 1;
-			goto out;
 		}
 	}
 
 	/*
+	 * There is enough inactive page cache, do not reclaim
+	 * anything from the anonymous working set right now.
+	 */
+	if (!inactive_file_is_low(lruvec)) {
+		fraction[0] = 0;
+		fraction[1] = 1;
+		denominator = 1;
+		goto out;
+	}
+
+	/*
 	 * With swappiness at 100, anonymous and file have the same priority.
 	 * This scanning priority is essentially the inverse of IO cost.
 	 */
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [patch 1/7] mm: memcg: only evict file pages when we have plenty
@ 2012-12-17 18:12   ` Johannes Weiner
  0 siblings, 0 replies; 46+ messages in thread
From: Johannes Weiner @ 2012-12-17 18:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Michal Hocko, Mel Gorman, Hugh Dickins,
	Satoru Moriya, linux-mm, linux-kernel

e986850 "mm, vmscan: only evict file pages when we have plenty" makes
a point of not going for anonymous memory while there is still enough
inactive cache around.

The check was added only for global reclaim, but it is just as useful
to reduce swapping in memory cgroup reclaim:

200M-memcg-defconfig-j2

                                 vanilla                   patched
Real time              454.06 (  +0.00%)         453.71 (  -0.08%)
User time              668.57 (  +0.00%)         668.73 (  +0.02%)
System time            128.92 (  +0.00%)         129.53 (  +0.46%)
Swap in               1246.80 (  +0.00%)         814.40 ( -34.65%)
Swap out              1198.90 (  +0.00%)         827.00 ( -30.99%)
Pages allocated   16431288.10 (  +0.00%)    16434035.30 (  +0.02%)
Major faults           681.50 (  +0.00%)         593.70 ( -12.86%)
THP faults             237.20 (  +0.00%)         242.40 (  +2.18%)
THP collapse           241.20 (  +0.00%)         248.50 (  +3.01%)
THP splits             157.30 (  +0.00%)         161.40 (  +2.59%)

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
---
 mm/vmscan.c | 20 +++++++++++---------
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7f30961..249ff94 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1688,19 +1688,21 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 			fraction[1] = 0;
 			denominator = 1;
 			goto out;
-		} else if (!inactive_file_is_low_global(zone)) {
-			/*
-			 * There is enough inactive page cache, do not
-			 * reclaim anything from the working set right now.
-			 */
-			fraction[0] = 0;
-			fraction[1] = 1;
-			denominator = 1;
-			goto out;
 		}
 	}
 
 	/*
+	 * There is enough inactive page cache, do not reclaim
+	 * anything from the anonymous working set right now.
+	 */
+	if (!inactive_file_is_low(lruvec)) {
+		fraction[0] = 0;
+		fraction[1] = 1;
+		denominator = 1;
+		goto out;
+	}
+
+	/*
 	 * With swappiness at 100, anonymous and file have the same priority.
 	 * This scanning priority is essentially the inverse of IO cost.
 	 */
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [patch 2/7] mm: vmscan: save work scanning (almost) empty LRU lists
  2012-12-17 18:12 ` Johannes Weiner
@ 2012-12-17 18:12   ` Johannes Weiner
  -1 siblings, 0 replies; 46+ messages in thread
From: Johannes Weiner @ 2012-12-17 18:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Michal Hocko, Mel Gorman, Hugh Dickins,
	Satoru Moriya, linux-mm, linux-kernel

In certain cases (kswapd reclaim, memcg target reclaim), a fixed
minimum amount of pages is scanned from the LRU lists on each
iteration, to make progress.

Do not make this minimum bigger than the respective LRU list size,
however, and save some busy work trying to isolate and reclaim pages
that are not there.

Empty LRU lists are quite common with memory cgroups in NUMA
environments because there exists a set of LRU lists for each zone for
each memory cgroup, while the memory of a single cgroup is expected to
stay on just one node.  The number of expected empty LRU lists is thus

  memcgs * (nodes - 1) * lru types

Each attempt to reclaim from an empty LRU list does expensive size
comparisons between lists, acquires the zone's lru lock etc.  Avoid
that.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
---
 include/linux/swap.h |  2 +-
 mm/vmscan.c          | 10 ++++++----
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 68df9c1..8c66486 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -156,7 +156,7 @@ enum {
 	SWP_SCANNING	= (1 << 8),	/* refcount in scan_swap_map */
 };
 
-#define SWAP_CLUSTER_MAX 32
+#define SWAP_CLUSTER_MAX 32UL
 #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
 
 /*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 249ff94..648a4db 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1749,15 +1749,17 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 out:
 	for_each_evictable_lru(lru) {
 		int file = is_file_lru(lru);
+		unsigned long size;
 		unsigned long scan;
 
-		scan = get_lru_size(lruvec, lru);
+		size = get_lru_size(lruvec, lru);
 		if (sc->priority || noswap || !vmscan_swappiness(sc)) {
-			scan >>= sc->priority;
+			scan = size >> sc->priority;
 			if (!scan && force_scan)
-				scan = SWAP_CLUSTER_MAX;
+				scan = min(size, SWAP_CLUSTER_MAX);
 			scan = div64_u64(scan * fraction[file], denominator);
-		}
+		} else
+			scan = size;
 		nr[lru] = scan;
 	}
 }
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [patch 2/7] mm: vmscan: save work scanning (almost) empty LRU lists
@ 2012-12-17 18:12   ` Johannes Weiner
  0 siblings, 0 replies; 46+ messages in thread
From: Johannes Weiner @ 2012-12-17 18:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Michal Hocko, Mel Gorman, Hugh Dickins,
	Satoru Moriya, linux-mm, linux-kernel

In certain cases (kswapd reclaim, memcg target reclaim), a fixed
minimum amount of pages is scanned from the LRU lists on each
iteration, to make progress.

Do not make this minimum bigger than the respective LRU list size,
however, and save some busy work trying to isolate and reclaim pages
that are not there.

Empty LRU lists are quite common with memory cgroups in NUMA
environments because there exists a set of LRU lists for each zone for
each memory cgroup, while the memory of a single cgroup is expected to
stay on just one node.  The number of expected empty LRU lists is thus

  memcgs * (nodes - 1) * lru types

Each attempt to reclaim from an empty LRU list does expensive size
comparisons between lists, acquires the zone's lru lock etc.  Avoid
that.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
---
 include/linux/swap.h |  2 +-
 mm/vmscan.c          | 10 ++++++----
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 68df9c1..8c66486 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -156,7 +156,7 @@ enum {
 	SWP_SCANNING	= (1 << 8),	/* refcount in scan_swap_map */
 };
 
-#define SWAP_CLUSTER_MAX 32
+#define SWAP_CLUSTER_MAX 32UL
 #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
 
 /*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 249ff94..648a4db 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1749,15 +1749,17 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 out:
 	for_each_evictable_lru(lru) {
 		int file = is_file_lru(lru);
+		unsigned long size;
 		unsigned long scan;
 
-		scan = get_lru_size(lruvec, lru);
+		size = get_lru_size(lruvec, lru);
 		if (sc->priority || noswap || !vmscan_swappiness(sc)) {
-			scan >>= sc->priority;
+			scan = size >> sc->priority;
 			if (!scan && force_scan)
-				scan = SWAP_CLUSTER_MAX;
+				scan = min(size, SWAP_CLUSTER_MAX);
 			scan = div64_u64(scan * fraction[file], denominator);
-		}
+		} else
+			scan = size;
 		nr[lru] = scan;
 	}
 }
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [patch 3/7] mm: vmscan: clarify how swappiness, highest priority, memcg interact
  2012-12-17 18:12 ` Johannes Weiner
@ 2012-12-17 18:12   ` Johannes Weiner
  -1 siblings, 0 replies; 46+ messages in thread
From: Johannes Weiner @ 2012-12-17 18:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Michal Hocko, Mel Gorman, Hugh Dickins,
	Satoru Moriya, linux-mm, linux-kernel

A swappiness of 0 has a slightly different meaning for global reclaim
(may swap if file cache really low) and memory cgroup reclaim (never
swap, ever).

In addition, global reclaim at highest priority will scan all LRU
lists equal to their size and ignore other balancing heuristics.
UNLESS swappiness forbids swapping, then the lists are balanced based
on recent reclaim effectiveness.  UNLESS file cache is running low,
then anonymous pages are force-scanned.

This (total mess of a) behaviour is implicit and not obvious from the
way the code is organized.  At least make it apparent in the code flow
and document the conditions.  It will be it easier to come up with
sane semantics later.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 39 ++++++++++++++++++++++++++++++---------
 1 file changed, 30 insertions(+), 9 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 648a4db..c37deaf 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1644,7 +1644,6 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
 	u64 fraction[2], denominator;
 	enum lru_list lru;
-	int noswap = 0;
 	bool force_scan = false;
 	struct zone *zone = lruvec_zone(lruvec);
 
@@ -1665,13 +1664,38 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 
 	/* If we have no swap space, do not bother scanning anon pages. */
 	if (!sc->may_swap || (nr_swap_pages <= 0)) {
-		noswap = 1;
 		fraction[0] = 0;
 		fraction[1] = 1;
 		denominator = 1;
 		goto out;
 	}
 
+	/*
+	 * Global reclaim will swap to prevent OOM even with no
+	 * swappiness, but memcg users want to use this knob to
+	 * disable swapping for individual groups completely when
+	 * using the memory controller's swap limit feature would be
+	 * too expensive.
+	 */
+	if (!global_reclaim(sc) && !vmscan_swappiness(sc)) {
+		fraction[0] = 0;
+		fraction[1] = 1;
+		denominator = 1;
+		goto out;
+	}
+
+	/*
+	 * Do not apply any pressure balancing cleverness when the
+	 * system is close to OOM, scan both anon and file equally
+	 * (unless the swappiness setting disagrees with swapping).
+	 */
+	if (!sc->priority && vmscan_swappiness(sc)) {
+		fraction[0] = 1;
+		fraction[1] = 1;
+		denominator = 1;
+		goto out;
+	}
+
 	anon  = get_lru_size(lruvec, LRU_ACTIVE_ANON) +
 		get_lru_size(lruvec, LRU_INACTIVE_ANON);
 	file  = get_lru_size(lruvec, LRU_ACTIVE_FILE) +
@@ -1753,13 +1777,10 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 		unsigned long scan;
 
 		size = get_lru_size(lruvec, lru);
-		if (sc->priority || noswap || !vmscan_swappiness(sc)) {
-			scan = size >> sc->priority;
-			if (!scan && force_scan)
-				scan = min(size, SWAP_CLUSTER_MAX);
-			scan = div64_u64(scan * fraction[file], denominator);
-		} else
-			scan = size;
+		scan = size >> sc->priority;
+		if (!scan && force_scan)
+			scan = min(size, SWAP_CLUSTER_MAX);
+		scan = div64_u64(scan * fraction[file], denominator);
 		nr[lru] = scan;
 	}
 }
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [patch 3/7] mm: vmscan: clarify how swappiness, highest priority, memcg interact
@ 2012-12-17 18:12   ` Johannes Weiner
  0 siblings, 0 replies; 46+ messages in thread
From: Johannes Weiner @ 2012-12-17 18:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Michal Hocko, Mel Gorman, Hugh Dickins,
	Satoru Moriya, linux-mm, linux-kernel

A swappiness of 0 has a slightly different meaning for global reclaim
(may swap if file cache really low) and memory cgroup reclaim (never
swap, ever).

In addition, global reclaim at highest priority will scan all LRU
lists equal to their size and ignore other balancing heuristics.
UNLESS swappiness forbids swapping, then the lists are balanced based
on recent reclaim effectiveness.  UNLESS file cache is running low,
then anonymous pages are force-scanned.

This (total mess of a) behaviour is implicit and not obvious from the
way the code is organized.  At least make it apparent in the code flow
and document the conditions.  It will be it easier to come up with
sane semantics later.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 39 ++++++++++++++++++++++++++++++---------
 1 file changed, 30 insertions(+), 9 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 648a4db..c37deaf 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1644,7 +1644,6 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
 	u64 fraction[2], denominator;
 	enum lru_list lru;
-	int noswap = 0;
 	bool force_scan = false;
 	struct zone *zone = lruvec_zone(lruvec);
 
@@ -1665,13 +1664,38 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 
 	/* If we have no swap space, do not bother scanning anon pages. */
 	if (!sc->may_swap || (nr_swap_pages <= 0)) {
-		noswap = 1;
 		fraction[0] = 0;
 		fraction[1] = 1;
 		denominator = 1;
 		goto out;
 	}
 
+	/*
+	 * Global reclaim will swap to prevent OOM even with no
+	 * swappiness, but memcg users want to use this knob to
+	 * disable swapping for individual groups completely when
+	 * using the memory controller's swap limit feature would be
+	 * too expensive.
+	 */
+	if (!global_reclaim(sc) && !vmscan_swappiness(sc)) {
+		fraction[0] = 0;
+		fraction[1] = 1;
+		denominator = 1;
+		goto out;
+	}
+
+	/*
+	 * Do not apply any pressure balancing cleverness when the
+	 * system is close to OOM, scan both anon and file equally
+	 * (unless the swappiness setting disagrees with swapping).
+	 */
+	if (!sc->priority && vmscan_swappiness(sc)) {
+		fraction[0] = 1;
+		fraction[1] = 1;
+		denominator = 1;
+		goto out;
+	}
+
 	anon  = get_lru_size(lruvec, LRU_ACTIVE_ANON) +
 		get_lru_size(lruvec, LRU_INACTIVE_ANON);
 	file  = get_lru_size(lruvec, LRU_ACTIVE_FILE) +
@@ -1753,13 +1777,10 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 		unsigned long scan;
 
 		size = get_lru_size(lruvec, lru);
-		if (sc->priority || noswap || !vmscan_swappiness(sc)) {
-			scan = size >> sc->priority;
-			if (!scan && force_scan)
-				scan = min(size, SWAP_CLUSTER_MAX);
-			scan = div64_u64(scan * fraction[file], denominator);
-		} else
-			scan = size;
+		scan = size >> sc->priority;
+		if (!scan && force_scan)
+			scan = min(size, SWAP_CLUSTER_MAX);
+		scan = div64_u64(scan * fraction[file], denominator);
 		nr[lru] = scan;
 	}
 }
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [patch 4/7] mm: vmscan: improve comment on low-page cache handling
  2012-12-17 18:12 ` Johannes Weiner
@ 2012-12-17 18:12   ` Johannes Weiner
  -1 siblings, 0 replies; 46+ messages in thread
From: Johannes Weiner @ 2012-12-17 18:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Michal Hocko, Mel Gorman, Hugh Dickins,
	Satoru Moriya, linux-mm, linux-kernel

Fix comment style and elaborate on why anonymous memory is
force-scanned when file cache runs low.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
---
 mm/vmscan.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c37deaf..510c0d3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1701,13 +1701,15 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	file  = get_lru_size(lruvec, LRU_ACTIVE_FILE) +
 		get_lru_size(lruvec, LRU_INACTIVE_FILE);
 
+	/*
+	 * If it's foreseeable that reclaiming the file cache won't be
+	 * enough to get the zone back into a desirable shape, we have
+	 * to swap.  Better start now and leave the - probably heavily
+	 * thrashing - remaining file pages alone.
+	 */
 	if (global_reclaim(sc)) {
-		free  = zone_page_state(zone, NR_FREE_PAGES);
+		free = zone_page_state(zone, NR_FREE_PAGES);
 		if (unlikely(file + free <= high_wmark_pages(zone))) {
-			/*
-			 * If we have very few page cache pages, force-scan
-			 * anon pages.
-			 */
 			fraction[0] = 1;
 			fraction[1] = 0;
 			denominator = 1;
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [patch 4/7] mm: vmscan: improve comment on low-page cache handling
@ 2012-12-17 18:12   ` Johannes Weiner
  0 siblings, 0 replies; 46+ messages in thread
From: Johannes Weiner @ 2012-12-17 18:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Michal Hocko, Mel Gorman, Hugh Dickins,
	Satoru Moriya, linux-mm, linux-kernel

Fix comment style and elaborate on why anonymous memory is
force-scanned when file cache runs low.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
---
 mm/vmscan.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c37deaf..510c0d3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1701,13 +1701,15 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	file  = get_lru_size(lruvec, LRU_ACTIVE_FILE) +
 		get_lru_size(lruvec, LRU_INACTIVE_FILE);
 
+	/*
+	 * If it's foreseeable that reclaiming the file cache won't be
+	 * enough to get the zone back into a desirable shape, we have
+	 * to swap.  Better start now and leave the - probably heavily
+	 * thrashing - remaining file pages alone.
+	 */
 	if (global_reclaim(sc)) {
-		free  = zone_page_state(zone, NR_FREE_PAGES);
+		free = zone_page_state(zone, NR_FREE_PAGES);
 		if (unlikely(file + free <= high_wmark_pages(zone))) {
-			/*
-			 * If we have very few page cache pages, force-scan
-			 * anon pages.
-			 */
 			fraction[0] = 1;
 			fraction[1] = 0;
 			denominator = 1;
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [patch 5/7] mm: vmscan: clean up get_scan_count()
  2012-12-17 18:12 ` Johannes Weiner
@ 2012-12-17 18:12   ` Johannes Weiner
  -1 siblings, 0 replies; 46+ messages in thread
From: Johannes Weiner @ 2012-12-17 18:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Michal Hocko, Mel Gorman, Hugh Dickins,
	Satoru Moriya, linux-mm, linux-kernel

Reclaim pressure balance between anon and file pages is calculated
through a tuple of numerators and a shared denominator.

Exceptional cases that want to force-scan anon or file pages configure
the numerators and denominator such that one list is preferred, which
is not necessarily the most obvious way:

    fraction[0] = 1;
    fraction[1] = 0;
    denominator = 1;
    goto out;

Make this easier by making the force-scan cases explicit and use the
fractionals only in case they are calculated from reclaim history.

And bring the variable declarations/definitions in order.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
---
 mm/vmscan.c | 64 +++++++++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 43 insertions(+), 21 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 510c0d3..785f4cd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1626,6 +1626,13 @@ static int vmscan_swappiness(struct scan_control *sc)
 	return mem_cgroup_swappiness(sc->target_mem_cgroup);
 }
 
+enum scan_balance {
+	SCAN_EQUAL,
+	SCAN_FRACT,
+	SCAN_ANON,
+	SCAN_FILE,
+};
+
 /*
  * Determine how aggressively the anon and file LRU lists should be
  * scanned.  The relative value of each set of LRU lists is determined
@@ -1638,14 +1645,15 @@ static int vmscan_swappiness(struct scan_control *sc)
 static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 			   unsigned long *nr)
 {
-	unsigned long anon, file, free;
+	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
+	u64 fraction[2], uninitialized_var(denominator);
+	struct zone *zone = lruvec_zone(lruvec);
 	unsigned long anon_prio, file_prio;
+	enum scan_balance scan_balance;
+	unsigned long anon, file, free;
+	bool force_scan = false;
 	unsigned long ap, fp;
-	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
-	u64 fraction[2], denominator;
 	enum lru_list lru;
-	bool force_scan = false;
-	struct zone *zone = lruvec_zone(lruvec);
 
 	/*
 	 * If the zone or memcg is small, nr[l] can be 0.  This
@@ -1664,9 +1672,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 
 	/* If we have no swap space, do not bother scanning anon pages. */
 	if (!sc->may_swap || (nr_swap_pages <= 0)) {
-		fraction[0] = 0;
-		fraction[1] = 1;
-		denominator = 1;
+		scan_balance = SCAN_FILE;
 		goto out;
 	}
 
@@ -1678,9 +1684,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	 * too expensive.
 	 */
 	if (!global_reclaim(sc) && !vmscan_swappiness(sc)) {
-		fraction[0] = 0;
-		fraction[1] = 1;
-		denominator = 1;
+		scan_balance = SCAN_FILE;
 		goto out;
 	}
 
@@ -1690,9 +1694,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	 * (unless the swappiness setting disagrees with swapping).
 	 */
 	if (!sc->priority && vmscan_swappiness(sc)) {
-		fraction[0] = 1;
-		fraction[1] = 1;
-		denominator = 1;
+		scan_balance = SCAN_EQUAL;
 		goto out;
 	}
 
@@ -1710,9 +1712,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	if (global_reclaim(sc)) {
 		free = zone_page_state(zone, NR_FREE_PAGES);
 		if (unlikely(file + free <= high_wmark_pages(zone))) {
-			fraction[0] = 1;
-			fraction[1] = 0;
-			denominator = 1;
+			scan_balance = SCAN_ANON;
 			goto out;
 		}
 	}
@@ -1722,12 +1722,12 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	 * anything from the anonymous working set right now.
 	 */
 	if (!inactive_file_is_low(lruvec)) {
-		fraction[0] = 0;
-		fraction[1] = 1;
-		denominator = 1;
+		scan_balance = SCAN_FILE;
 		goto out;
 	}
 
+	scan_balance = SCAN_FRACT;
+
 	/*
 	 * With swappiness at 100, anonymous and file have the same priority.
 	 * This scanning priority is essentially the inverse of IO cost.
@@ -1780,9 +1780,31 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 
 		size = get_lru_size(lruvec, lru);
 		scan = size >> sc->priority;
+
 		if (!scan && force_scan)
 			scan = min(size, SWAP_CLUSTER_MAX);
-		scan = div64_u64(scan * fraction[file], denominator);
+
+		switch (scan_balance) {
+		case SCAN_EQUAL:
+			/* Scan lists relative to size */
+			break;
+		case SCAN_FRACT:
+			/*
+			 * Scan types proportional to swappiness and
+			 * their relative recent reclaim efficiency.
+			 */
+			scan = div64_u64(scan * fraction[file], denominator);
+			break;
+		case SCAN_FILE:
+		case SCAN_ANON:
+			/* Scan one type exclusively */
+			if ((scan_balance == SCAN_FILE) != file)
+				scan = 0;
+			break;
+		default:
+			/* Look ma, no brain */
+			BUG();
+		}
 		nr[lru] = scan;
 	}
 }
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [patch 5/7] mm: vmscan: clean up get_scan_count()
@ 2012-12-17 18:12   ` Johannes Weiner
  0 siblings, 0 replies; 46+ messages in thread
From: Johannes Weiner @ 2012-12-17 18:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Michal Hocko, Mel Gorman, Hugh Dickins,
	Satoru Moriya, linux-mm, linux-kernel

Reclaim pressure balance between anon and file pages is calculated
through a tuple of numerators and a shared denominator.

Exceptional cases that want to force-scan anon or file pages configure
the numerators and denominator such that one list is preferred, which
is not necessarily the most obvious way:

    fraction[0] = 1;
    fraction[1] = 0;
    denominator = 1;
    goto out;

Make this easier by making the force-scan cases explicit and use the
fractionals only in case they are calculated from reclaim history.

And bring the variable declarations/definitions in order.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
---
 mm/vmscan.c | 64 +++++++++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 43 insertions(+), 21 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 510c0d3..785f4cd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1626,6 +1626,13 @@ static int vmscan_swappiness(struct scan_control *sc)
 	return mem_cgroup_swappiness(sc->target_mem_cgroup);
 }
 
+enum scan_balance {
+	SCAN_EQUAL,
+	SCAN_FRACT,
+	SCAN_ANON,
+	SCAN_FILE,
+};
+
 /*
  * Determine how aggressively the anon and file LRU lists should be
  * scanned.  The relative value of each set of LRU lists is determined
@@ -1638,14 +1645,15 @@ static int vmscan_swappiness(struct scan_control *sc)
 static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 			   unsigned long *nr)
 {
-	unsigned long anon, file, free;
+	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
+	u64 fraction[2], uninitialized_var(denominator);
+	struct zone *zone = lruvec_zone(lruvec);
 	unsigned long anon_prio, file_prio;
+	enum scan_balance scan_balance;
+	unsigned long anon, file, free;
+	bool force_scan = false;
 	unsigned long ap, fp;
-	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
-	u64 fraction[2], denominator;
 	enum lru_list lru;
-	bool force_scan = false;
-	struct zone *zone = lruvec_zone(lruvec);
 
 	/*
 	 * If the zone or memcg is small, nr[l] can be 0.  This
@@ -1664,9 +1672,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 
 	/* If we have no swap space, do not bother scanning anon pages. */
 	if (!sc->may_swap || (nr_swap_pages <= 0)) {
-		fraction[0] = 0;
-		fraction[1] = 1;
-		denominator = 1;
+		scan_balance = SCAN_FILE;
 		goto out;
 	}
 
@@ -1678,9 +1684,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	 * too expensive.
 	 */
 	if (!global_reclaim(sc) && !vmscan_swappiness(sc)) {
-		fraction[0] = 0;
-		fraction[1] = 1;
-		denominator = 1;
+		scan_balance = SCAN_FILE;
 		goto out;
 	}
 
@@ -1690,9 +1694,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	 * (unless the swappiness setting disagrees with swapping).
 	 */
 	if (!sc->priority && vmscan_swappiness(sc)) {
-		fraction[0] = 1;
-		fraction[1] = 1;
-		denominator = 1;
+		scan_balance = SCAN_EQUAL;
 		goto out;
 	}
 
@@ -1710,9 +1712,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	if (global_reclaim(sc)) {
 		free = zone_page_state(zone, NR_FREE_PAGES);
 		if (unlikely(file + free <= high_wmark_pages(zone))) {
-			fraction[0] = 1;
-			fraction[1] = 0;
-			denominator = 1;
+			scan_balance = SCAN_ANON;
 			goto out;
 		}
 	}
@@ -1722,12 +1722,12 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	 * anything from the anonymous working set right now.
 	 */
 	if (!inactive_file_is_low(lruvec)) {
-		fraction[0] = 0;
-		fraction[1] = 1;
-		denominator = 1;
+		scan_balance = SCAN_FILE;
 		goto out;
 	}
 
+	scan_balance = SCAN_FRACT;
+
 	/*
 	 * With swappiness at 100, anonymous and file have the same priority.
 	 * This scanning priority is essentially the inverse of IO cost.
@@ -1780,9 +1780,31 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 
 		size = get_lru_size(lruvec, lru);
 		scan = size >> sc->priority;
+
 		if (!scan && force_scan)
 			scan = min(size, SWAP_CLUSTER_MAX);
-		scan = div64_u64(scan * fraction[file], denominator);
+
+		switch (scan_balance) {
+		case SCAN_EQUAL:
+			/* Scan lists relative to size */
+			break;
+		case SCAN_FRACT:
+			/*
+			 * Scan types proportional to swappiness and
+			 * their relative recent reclaim efficiency.
+			 */
+			scan = div64_u64(scan * fraction[file], denominator);
+			break;
+		case SCAN_FILE:
+		case SCAN_ANON:
+			/* Scan one type exclusively */
+			if ((scan_balance == SCAN_FILE) != file)
+				scan = 0;
+			break;
+		default:
+			/* Look ma, no brain */
+			BUG();
+		}
 		nr[lru] = scan;
 	}
 }
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [patch 6/7] mm: vmscan: compaction works against zones, not lruvecs
  2012-12-17 18:12 ` Johannes Weiner
@ 2012-12-17 18:12   ` Johannes Weiner
  -1 siblings, 0 replies; 46+ messages in thread
From: Johannes Weiner @ 2012-12-17 18:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Michal Hocko, Mel Gorman, Hugh Dickins,
	Satoru Moriya, linux-mm, linux-kernel

The restart logic for when reclaim operates back to back with
compaction is currently applied on the lruvec level.  But this does
not make sense, because the container of interest for compaction is a
zone as a whole, not the zone pages that are part of a certain memory
cgroup.

Negative impact is bounded.  For one, the code checks that the lruvec
has enough reclaim candidates, so it does not risk getting stuck on a
condition that can not be fulfilled.  And the unfairness of hammering
on one particular memory cgroup to make progress in a zone will be
amortized by the round robin manner in which reclaim goes through the
memory cgroups.  Still, this can lead to unnecessary allocation
latencies when the code elects to restart on a hard to reclaim or
small group when there are other, more reclaimable groups in the zone.

Move this logic to the zone level and restart reclaim for all memory
cgroups in a zone when compaction requires more free pages from it.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
---
 mm/vmscan.c | 180 +++++++++++++++++++++++++++++++-----------------------------
 1 file changed, 92 insertions(+), 88 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 785f4cd..2193396 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1809,6 +1809,59 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	}
 }
 
+/*
+ * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
+ */
+static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
+{
+	unsigned long nr[NR_LRU_LISTS];
+	unsigned long nr_to_scan;
+	enum lru_list lru;
+	unsigned long nr_reclaimed = 0;
+	unsigned long nr_to_reclaim = sc->nr_to_reclaim;
+	struct blk_plug plug;
+
+	get_scan_count(lruvec, sc, nr);
+
+	blk_start_plug(&plug);
+	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
+					nr[LRU_INACTIVE_FILE]) {
+		for_each_evictable_lru(lru) {
+			if (nr[lru]) {
+				nr_to_scan = min_t(unsigned long,
+						   nr[lru], SWAP_CLUSTER_MAX);
+				nr[lru] -= nr_to_scan;
+
+				nr_reclaimed += shrink_list(lru, nr_to_scan,
+							    lruvec, sc);
+			}
+		}
+		/*
+		 * On large memory systems, scan >> priority can become
+		 * really large. This is fine for the starting priority;
+		 * we want to put equal scanning pressure on each zone.
+		 * However, if the VM has a harder time of freeing pages,
+		 * with multiple processes reclaiming pages, the total
+		 * freeing target can get unreasonably large.
+		 */
+		if (nr_reclaimed >= nr_to_reclaim &&
+		    sc->priority < DEF_PRIORITY)
+			break;
+	}
+	blk_finish_plug(&plug);
+	sc->nr_reclaimed += nr_reclaimed;
+
+	/*
+	 * Even if we did not try to evict anon pages at all, we want to
+	 * rebalance the anon lru active/inactive ratio.
+	 */
+	if (inactive_anon_is_low(lruvec))
+		shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
+				   sc, LRU_ACTIVE_ANON);
+
+	throttle_vm_writeout(sc->gfp_mask);
+}
+
 /* Use reclaim/compaction for costly allocs or under memory pressure */
 static bool in_reclaim_compaction(struct scan_control *sc)
 {
@@ -1827,7 +1880,7 @@ static bool in_reclaim_compaction(struct scan_control *sc)
  * calls try_to_compact_zone() that it will have enough free pages to succeed.
  * It will give up earlier than that if there is difficulty reclaiming pages.
  */
-static inline bool should_continue_reclaim(struct lruvec *lruvec,
+static inline bool should_continue_reclaim(struct zone *zone,
 					unsigned long nr_reclaimed,
 					unsigned long nr_scanned,
 					struct scan_control *sc)
@@ -1867,15 +1920,15 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
 	 * inactive lists are large enough, continue reclaiming
 	 */
 	pages_for_compaction = (2UL << sc->order);
-	inactive_lru_pages = get_lru_size(lruvec, LRU_INACTIVE_FILE);
+	inactive_lru_pages = zone_page_state(zone, NR_INACTIVE_FILE);
 	if (nr_swap_pages > 0)
-		inactive_lru_pages += get_lru_size(lruvec, LRU_INACTIVE_ANON);
+		inactive_lru_pages += zone_page_state(zone, NR_INACTIVE_ANON);
 	if (sc->nr_reclaimed < pages_for_compaction &&
 			inactive_lru_pages > pages_for_compaction)
 		return true;
 
 	/* If compaction would go ahead or the allocation would succeed, stop */
-	switch (compaction_suitable(lruvec_zone(lruvec), sc->order)) {
+	switch (compaction_suitable(zone, sc->order)) {
 	case COMPACT_PARTIAL:
 	case COMPACT_CONTINUE:
 		return false;
@@ -1884,98 +1937,49 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
 	}
 }
 
-/*
- * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
- */
-static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
+static void shrink_zone(struct zone *zone, struct scan_control *sc)
 {
-	unsigned long nr[NR_LRU_LISTS];
-	unsigned long nr_to_scan;
-	enum lru_list lru;
 	unsigned long nr_reclaimed, nr_scanned;
-	unsigned long nr_to_reclaim = sc->nr_to_reclaim;
-	struct blk_plug plug;
-
-restart:
-	nr_reclaimed = 0;
-	nr_scanned = sc->nr_scanned;
-	get_scan_count(lruvec, sc, nr);
-
-	blk_start_plug(&plug);
-	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
-					nr[LRU_INACTIVE_FILE]) {
-		for_each_evictable_lru(lru) {
-			if (nr[lru]) {
-				nr_to_scan = min_t(unsigned long,
-						   nr[lru], SWAP_CLUSTER_MAX);
-				nr[lru] -= nr_to_scan;
-
-				nr_reclaimed += shrink_list(lru, nr_to_scan,
-							    lruvec, sc);
-			}
-		}
-		/*
-		 * On large memory systems, scan >> priority can become
-		 * really large. This is fine for the starting priority;
-		 * we want to put equal scanning pressure on each zone.
-		 * However, if the VM has a harder time of freeing pages,
-		 * with multiple processes reclaiming pages, the total
-		 * freeing target can get unreasonably large.
-		 */
-		if (nr_reclaimed >= nr_to_reclaim &&
-		    sc->priority < DEF_PRIORITY)
-			break;
-	}
-	blk_finish_plug(&plug);
-	sc->nr_reclaimed += nr_reclaimed;
 
-	/*
-	 * Even if we did not try to evict anon pages at all, we want to
-	 * rebalance the anon lru active/inactive ratio.
-	 */
-	if (inactive_anon_is_low(lruvec))
-		shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
-				   sc, LRU_ACTIVE_ANON);
-
-	/* reclaim/compaction might need reclaim to continue */
-	if (should_continue_reclaim(lruvec, nr_reclaimed,
-				    sc->nr_scanned - nr_scanned, sc))
-		goto restart;
+	do {
+		struct mem_cgroup *root = sc->target_mem_cgroup;
+		struct mem_cgroup_reclaim_cookie reclaim = {
+			.zone = zone,
+			.priority = sc->priority,
+		};
+		struct mem_cgroup *memcg;
 
-	throttle_vm_writeout(sc->gfp_mask);
-}
+		nr_reclaimed = sc->nr_reclaimed;
+		nr_scanned = sc->nr_scanned;
 
-static void shrink_zone(struct zone *zone, struct scan_control *sc)
-{
-	struct mem_cgroup *root = sc->target_mem_cgroup;
-	struct mem_cgroup_reclaim_cookie reclaim = {
-		.zone = zone,
-		.priority = sc->priority,
-	};
-	struct mem_cgroup *memcg;
+		memcg = mem_cgroup_iter(root, NULL, &reclaim);
+		do {
+			struct lruvec *lruvec;
 
-	memcg = mem_cgroup_iter(root, NULL, &reclaim);
-	do {
-		struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
+			lruvec = mem_cgroup_zone_lruvec(zone, memcg);
 
-		shrink_lruvec(lruvec, sc);
+			shrink_lruvec(lruvec, sc);
 
-		/*
-		 * Limit reclaim has historically picked one memcg and
-		 * scanned it with decreasing priority levels until
-		 * nr_to_reclaim had been reclaimed.  This priority
-		 * cycle is thus over after a single memcg.
-		 *
-		 * Direct reclaim and kswapd, on the other hand, have
-		 * to scan all memory cgroups to fulfill the overall
-		 * scan target for the zone.
-		 */
-		if (!global_reclaim(sc)) {
-			mem_cgroup_iter_break(root, memcg);
-			break;
-		}
-		memcg = mem_cgroup_iter(root, memcg, &reclaim);
-	} while (memcg);
+			/*
+			 * Limit reclaim has historically picked one
+			 * memcg and scanned it with decreasing
+			 * priority levels until nr_to_reclaim had
+			 * been reclaimed.  This priority cycle is
+			 * thus over after a single memcg.
+			 *
+			 * Direct reclaim and kswapd, on the other
+			 * hand, have to scan all memory cgroups to
+			 * fulfill the overall scan target for the
+			 * zone.
+			 */
+			if (!global_reclaim(sc)) {
+				mem_cgroup_iter_break(root, memcg);
+				break;
+			}
+			memcg = mem_cgroup_iter(root, memcg, &reclaim);
+		} while (memcg);
+	} while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed,
+					 sc->nr_scanned - nr_scanned, sc));
 }
 
 /* Returns true if compaction should go ahead for a high-order request */
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [patch 6/7] mm: vmscan: compaction works against zones, not lruvecs
@ 2012-12-17 18:12   ` Johannes Weiner
  0 siblings, 0 replies; 46+ messages in thread
From: Johannes Weiner @ 2012-12-17 18:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Michal Hocko, Mel Gorman, Hugh Dickins,
	Satoru Moriya, linux-mm, linux-kernel

The restart logic for when reclaim operates back to back with
compaction is currently applied on the lruvec level.  But this does
not make sense, because the container of interest for compaction is a
zone as a whole, not the zone pages that are part of a certain memory
cgroup.

Negative impact is bounded.  For one, the code checks that the lruvec
has enough reclaim candidates, so it does not risk getting stuck on a
condition that can not be fulfilled.  And the unfairness of hammering
on one particular memory cgroup to make progress in a zone will be
amortized by the round robin manner in which reclaim goes through the
memory cgroups.  Still, this can lead to unnecessary allocation
latencies when the code elects to restart on a hard to reclaim or
small group when there are other, more reclaimable groups in the zone.

Move this logic to the zone level and restart reclaim for all memory
cgroups in a zone when compaction requires more free pages from it.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
---
 mm/vmscan.c | 180 +++++++++++++++++++++++++++++++-----------------------------
 1 file changed, 92 insertions(+), 88 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 785f4cd..2193396 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1809,6 +1809,59 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	}
 }
 
+/*
+ * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
+ */
+static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
+{
+	unsigned long nr[NR_LRU_LISTS];
+	unsigned long nr_to_scan;
+	enum lru_list lru;
+	unsigned long nr_reclaimed = 0;
+	unsigned long nr_to_reclaim = sc->nr_to_reclaim;
+	struct blk_plug plug;
+
+	get_scan_count(lruvec, sc, nr);
+
+	blk_start_plug(&plug);
+	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
+					nr[LRU_INACTIVE_FILE]) {
+		for_each_evictable_lru(lru) {
+			if (nr[lru]) {
+				nr_to_scan = min_t(unsigned long,
+						   nr[lru], SWAP_CLUSTER_MAX);
+				nr[lru] -= nr_to_scan;
+
+				nr_reclaimed += shrink_list(lru, nr_to_scan,
+							    lruvec, sc);
+			}
+		}
+		/*
+		 * On large memory systems, scan >> priority can become
+		 * really large. This is fine for the starting priority;
+		 * we want to put equal scanning pressure on each zone.
+		 * However, if the VM has a harder time of freeing pages,
+		 * with multiple processes reclaiming pages, the total
+		 * freeing target can get unreasonably large.
+		 */
+		if (nr_reclaimed >= nr_to_reclaim &&
+		    sc->priority < DEF_PRIORITY)
+			break;
+	}
+	blk_finish_plug(&plug);
+	sc->nr_reclaimed += nr_reclaimed;
+
+	/*
+	 * Even if we did not try to evict anon pages at all, we want to
+	 * rebalance the anon lru active/inactive ratio.
+	 */
+	if (inactive_anon_is_low(lruvec))
+		shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
+				   sc, LRU_ACTIVE_ANON);
+
+	throttle_vm_writeout(sc->gfp_mask);
+}
+
 /* Use reclaim/compaction for costly allocs or under memory pressure */
 static bool in_reclaim_compaction(struct scan_control *sc)
 {
@@ -1827,7 +1880,7 @@ static bool in_reclaim_compaction(struct scan_control *sc)
  * calls try_to_compact_zone() that it will have enough free pages to succeed.
  * It will give up earlier than that if there is difficulty reclaiming pages.
  */
-static inline bool should_continue_reclaim(struct lruvec *lruvec,
+static inline bool should_continue_reclaim(struct zone *zone,
 					unsigned long nr_reclaimed,
 					unsigned long nr_scanned,
 					struct scan_control *sc)
@@ -1867,15 +1920,15 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
 	 * inactive lists are large enough, continue reclaiming
 	 */
 	pages_for_compaction = (2UL << sc->order);
-	inactive_lru_pages = get_lru_size(lruvec, LRU_INACTIVE_FILE);
+	inactive_lru_pages = zone_page_state(zone, NR_INACTIVE_FILE);
 	if (nr_swap_pages > 0)
-		inactive_lru_pages += get_lru_size(lruvec, LRU_INACTIVE_ANON);
+		inactive_lru_pages += zone_page_state(zone, NR_INACTIVE_ANON);
 	if (sc->nr_reclaimed < pages_for_compaction &&
 			inactive_lru_pages > pages_for_compaction)
 		return true;
 
 	/* If compaction would go ahead or the allocation would succeed, stop */
-	switch (compaction_suitable(lruvec_zone(lruvec), sc->order)) {
+	switch (compaction_suitable(zone, sc->order)) {
 	case COMPACT_PARTIAL:
 	case COMPACT_CONTINUE:
 		return false;
@@ -1884,98 +1937,49 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
 	}
 }
 
-/*
- * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
- */
-static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
+static void shrink_zone(struct zone *zone, struct scan_control *sc)
 {
-	unsigned long nr[NR_LRU_LISTS];
-	unsigned long nr_to_scan;
-	enum lru_list lru;
 	unsigned long nr_reclaimed, nr_scanned;
-	unsigned long nr_to_reclaim = sc->nr_to_reclaim;
-	struct blk_plug plug;
-
-restart:
-	nr_reclaimed = 0;
-	nr_scanned = sc->nr_scanned;
-	get_scan_count(lruvec, sc, nr);
-
-	blk_start_plug(&plug);
-	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
-					nr[LRU_INACTIVE_FILE]) {
-		for_each_evictable_lru(lru) {
-			if (nr[lru]) {
-				nr_to_scan = min_t(unsigned long,
-						   nr[lru], SWAP_CLUSTER_MAX);
-				nr[lru] -= nr_to_scan;
-
-				nr_reclaimed += shrink_list(lru, nr_to_scan,
-							    lruvec, sc);
-			}
-		}
-		/*
-		 * On large memory systems, scan >> priority can become
-		 * really large. This is fine for the starting priority;
-		 * we want to put equal scanning pressure on each zone.
-		 * However, if the VM has a harder time of freeing pages,
-		 * with multiple processes reclaiming pages, the total
-		 * freeing target can get unreasonably large.
-		 */
-		if (nr_reclaimed >= nr_to_reclaim &&
-		    sc->priority < DEF_PRIORITY)
-			break;
-	}
-	blk_finish_plug(&plug);
-	sc->nr_reclaimed += nr_reclaimed;
 
-	/*
-	 * Even if we did not try to evict anon pages at all, we want to
-	 * rebalance the anon lru active/inactive ratio.
-	 */
-	if (inactive_anon_is_low(lruvec))
-		shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
-				   sc, LRU_ACTIVE_ANON);
-
-	/* reclaim/compaction might need reclaim to continue */
-	if (should_continue_reclaim(lruvec, nr_reclaimed,
-				    sc->nr_scanned - nr_scanned, sc))
-		goto restart;
+	do {
+		struct mem_cgroup *root = sc->target_mem_cgroup;
+		struct mem_cgroup_reclaim_cookie reclaim = {
+			.zone = zone,
+			.priority = sc->priority,
+		};
+		struct mem_cgroup *memcg;
 
-	throttle_vm_writeout(sc->gfp_mask);
-}
+		nr_reclaimed = sc->nr_reclaimed;
+		nr_scanned = sc->nr_scanned;
 
-static void shrink_zone(struct zone *zone, struct scan_control *sc)
-{
-	struct mem_cgroup *root = sc->target_mem_cgroup;
-	struct mem_cgroup_reclaim_cookie reclaim = {
-		.zone = zone,
-		.priority = sc->priority,
-	};
-	struct mem_cgroup *memcg;
+		memcg = mem_cgroup_iter(root, NULL, &reclaim);
+		do {
+			struct lruvec *lruvec;
 
-	memcg = mem_cgroup_iter(root, NULL, &reclaim);
-	do {
-		struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
+			lruvec = mem_cgroup_zone_lruvec(zone, memcg);
 
-		shrink_lruvec(lruvec, sc);
+			shrink_lruvec(lruvec, sc);
 
-		/*
-		 * Limit reclaim has historically picked one memcg and
-		 * scanned it with decreasing priority levels until
-		 * nr_to_reclaim had been reclaimed.  This priority
-		 * cycle is thus over after a single memcg.
-		 *
-		 * Direct reclaim and kswapd, on the other hand, have
-		 * to scan all memory cgroups to fulfill the overall
-		 * scan target for the zone.
-		 */
-		if (!global_reclaim(sc)) {
-			mem_cgroup_iter_break(root, memcg);
-			break;
-		}
-		memcg = mem_cgroup_iter(root, memcg, &reclaim);
-	} while (memcg);
+			/*
+			 * Limit reclaim has historically picked one
+			 * memcg and scanned it with decreasing
+			 * priority levels until nr_to_reclaim had
+			 * been reclaimed.  This priority cycle is
+			 * thus over after a single memcg.
+			 *
+			 * Direct reclaim and kswapd, on the other
+			 * hand, have to scan all memory cgroups to
+			 * fulfill the overall scan target for the
+			 * zone.
+			 */
+			if (!global_reclaim(sc)) {
+				mem_cgroup_iter_break(root, memcg);
+				break;
+			}
+			memcg = mem_cgroup_iter(root, memcg, &reclaim);
+		} while (memcg);
+	} while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed,
+					 sc->nr_scanned - nr_scanned, sc));
 }
 
 /* Returns true if compaction should go ahead for a high-order request */
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [patch 7/7] mm: reduce rmap overhead for ex-KSM page copies created on swap faults
  2012-12-17 18:12 ` Johannes Weiner
@ 2012-12-17 18:12   ` Johannes Weiner
  -1 siblings, 0 replies; 46+ messages in thread
From: Johannes Weiner @ 2012-12-17 18:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Michal Hocko, Mel Gorman, Hugh Dickins,
	Satoru Moriya, linux-mm, linux-kernel

When ex-KSM pages are faulted from swap cache, the fault handler is
not capable of re-establishing anon_vma-spanning KSM pages.  In this
case, a copy of the page is created instead, just like during a COW
break.

These freshly made copies are known to be exclusive to the faulting
VMA and there is no reason to go look for this page in parent and
sibling processes during rmap operations.

Use page_add_new_anon_rmap() for these copies.  This also puts them on
the proper LRU lists and marks them SwapBacked, so we can get rid of
doing this ad-hoc in the KSM copy code.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 mm/ksm.c    | 6 ------
 mm/memory.c | 5 ++++-
 2 files changed, 4 insertions(+), 7 deletions(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index 382d930..7275c74 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1590,13 +1590,7 @@ struct page *ksm_does_need_to_copy(struct page *page,
 
 		SetPageDirty(new_page);
 		__SetPageUptodate(new_page);
-		SetPageSwapBacked(new_page);
 		__set_page_locked(new_page);
-
-		if (!mlocked_vma_newpage(vma, new_page))
-			lru_cache_add_lru(new_page, LRU_ACTIVE_ANON);
-		else
-			add_page_to_unevictable_list(new_page);
 	}
 
 	return new_page;
diff --git a/mm/memory.c b/mm/memory.c
index db2e9e7..7e17eb0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3020,7 +3020,10 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 	flush_icache_page(vma, page);
 	set_pte_at(mm, address, page_table, pte);
-	do_page_add_anon_rmap(page, vma, address, exclusive);
+	if (swapcache) /* ksm created a completely new copy */
+		page_add_new_anon_rmap(page, vma, address);
+	else
+		do_page_add_anon_rmap(page, vma, address, exclusive);
 	/* It's better to call commit-charge after rmap is established */
 	mem_cgroup_commit_charge_swapin(page, ptr);
 
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [patch 7/7] mm: reduce rmap overhead for ex-KSM page copies created on swap faults
@ 2012-12-17 18:12   ` Johannes Weiner
  0 siblings, 0 replies; 46+ messages in thread
From: Johannes Weiner @ 2012-12-17 18:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Michal Hocko, Mel Gorman, Hugh Dickins,
	Satoru Moriya, linux-mm, linux-kernel

When ex-KSM pages are faulted from swap cache, the fault handler is
not capable of re-establishing anon_vma-spanning KSM pages.  In this
case, a copy of the page is created instead, just like during a COW
break.

These freshly made copies are known to be exclusive to the faulting
VMA and there is no reason to go look for this page in parent and
sibling processes during rmap operations.

Use page_add_new_anon_rmap() for these copies.  This also puts them on
the proper LRU lists and marks them SwapBacked, so we can get rid of
doing this ad-hoc in the KSM copy code.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 mm/ksm.c    | 6 ------
 mm/memory.c | 5 ++++-
 2 files changed, 4 insertions(+), 7 deletions(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index 382d930..7275c74 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1590,13 +1590,7 @@ struct page *ksm_does_need_to_copy(struct page *page,
 
 		SetPageDirty(new_page);
 		__SetPageUptodate(new_page);
-		SetPageSwapBacked(new_page);
 		__set_page_locked(new_page);
-
-		if (!mlocked_vma_newpage(vma, new_page))
-			lru_cache_add_lru(new_page, LRU_ACTIVE_ANON);
-		else
-			add_page_to_unevictable_list(new_page);
 	}
 
 	return new_page;
diff --git a/mm/memory.c b/mm/memory.c
index db2e9e7..7e17eb0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3020,7 +3020,10 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 	flush_icache_page(vma, page);
 	set_pte_at(mm, address, page_table, pte);
-	do_page_add_anon_rmap(page, vma, address, exclusive);
+	if (swapcache) /* ksm created a completely new copy */
+		page_add_new_anon_rmap(page, vma, address);
+	else
+		do_page_add_anon_rmap(page, vma, address, exclusive);
 	/* It's better to call commit-charge after rmap is established */
 	mem_cgroup_commit_charge_swapin(page, ptr);
 
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [patch 1/7] mm: memcg: only evict file pages when we have plenty
  2012-12-17 18:12   ` Johannes Weiner
@ 2012-12-17 18:15     ` Rik van Riel
  -1 siblings, 0 replies; 46+ messages in thread
From: Rik van Riel @ 2012-12-17 18:15 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, Mel Gorman, Hugh Dickins,
	Satoru Moriya, linux-mm, linux-kernel

On 12/17/2012 01:12 PM, Johannes Weiner wrote:
> e986850 "mm, vmscan: only evict file pages when we have plenty" makes
> a point of not going for anonymous memory while there is still enough
> inactive cache around.
>
> The check was added only for global reclaim, but it is just as useful
> to reduce swapping in memory cgroup reclaim:
>
> 200M-memcg-defconfig-j2
>
>                                   vanilla                   patched
> Real time              454.06 (  +0.00%)         453.71 (  -0.08%)
> User time              668.57 (  +0.00%)         668.73 (  +0.02%)
> System time            128.92 (  +0.00%)         129.53 (  +0.46%)
> Swap in               1246.80 (  +0.00%)         814.40 ( -34.65%)
> Swap out              1198.90 (  +0.00%)         827.00 ( -30.99%)
> Pages allocated   16431288.10 (  +0.00%)    16434035.30 (  +0.02%)
> Major faults           681.50 (  +0.00%)         593.70 ( -12.86%)
> THP faults             237.20 (  +0.00%)         242.40 (  +2.18%)
> THP collapse           241.20 (  +0.00%)         248.50 (  +3.01%)
> THP splits             157.30 (  +0.00%)         161.40 (  +2.59%)
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Michal Hocko <mhocko@suse.cz>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 1/7] mm: memcg: only evict file pages when we have plenty
@ 2012-12-17 18:15     ` Rik van Riel
  0 siblings, 0 replies; 46+ messages in thread
From: Rik van Riel @ 2012-12-17 18:15 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, Mel Gorman, Hugh Dickins,
	Satoru Moriya, linux-mm, linux-kernel

On 12/17/2012 01:12 PM, Johannes Weiner wrote:
> e986850 "mm, vmscan: only evict file pages when we have plenty" makes
> a point of not going for anonymous memory while there is still enough
> inactive cache around.
>
> The check was added only for global reclaim, but it is just as useful
> to reduce swapping in memory cgroup reclaim:
>
> 200M-memcg-defconfig-j2
>
>                                   vanilla                   patched
> Real time              454.06 (  +0.00%)         453.71 (  -0.08%)
> User time              668.57 (  +0.00%)         668.73 (  +0.02%)
> System time            128.92 (  +0.00%)         129.53 (  +0.46%)
> Swap in               1246.80 (  +0.00%)         814.40 ( -34.65%)
> Swap out              1198.90 (  +0.00%)         827.00 ( -30.99%)
> Pages allocated   16431288.10 (  +0.00%)    16434035.30 (  +0.02%)
> Major faults           681.50 (  +0.00%)         593.70 ( -12.86%)
> THP faults             237.20 (  +0.00%)         242.40 (  +2.18%)
> THP collapse           241.20 (  +0.00%)         248.50 (  +3.01%)
> THP splits             157.30 (  +0.00%)         161.40 (  +2.59%)
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Michal Hocko <mhocko@suse.cz>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 3/7] mm: vmscan: clarify how swappiness, highest priority, memcg interact
  2012-12-17 18:12   ` Johannes Weiner
@ 2012-12-17 18:16     ` Rik van Riel
  -1 siblings, 0 replies; 46+ messages in thread
From: Rik van Riel @ 2012-12-17 18:16 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, Mel Gorman, Hugh Dickins,
	Satoru Moriya, linux-mm, linux-kernel

On 12/17/2012 01:12 PM, Johannes Weiner wrote:
> A swappiness of 0 has a slightly different meaning for global reclaim
> (may swap if file cache really low) and memory cgroup reclaim (never
> swap, ever).
>
> In addition, global reclaim at highest priority will scan all LRU
> lists equal to their size and ignore other balancing heuristics.
> UNLESS swappiness forbids swapping, then the lists are balanced based
> on recent reclaim effectiveness.  UNLESS file cache is running low,
> then anonymous pages are force-scanned.
>
> This (total mess of a) behaviour is implicit and not obvious from the
> way the code is organized.  At least make it apparent in the code flow
> and document the conditions.  It will be it easier to come up with
> sane semantics later.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 3/7] mm: vmscan: clarify how swappiness, highest priority, memcg interact
@ 2012-12-17 18:16     ` Rik van Riel
  0 siblings, 0 replies; 46+ messages in thread
From: Rik van Riel @ 2012-12-17 18:16 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, Mel Gorman, Hugh Dickins,
	Satoru Moriya, linux-mm, linux-kernel

On 12/17/2012 01:12 PM, Johannes Weiner wrote:
> A swappiness of 0 has a slightly different meaning for global reclaim
> (may swap if file cache really low) and memory cgroup reclaim (never
> swap, ever).
>
> In addition, global reclaim at highest priority will scan all LRU
> lists equal to their size and ignore other balancing heuristics.
> UNLESS swappiness forbids swapping, then the lists are balanced based
> on recent reclaim effectiveness.  UNLESS file cache is running low,
> then anonymous pages are force-scanned.
>
> This (total mess of a) behaviour is implicit and not obvious from the
> way the code is organized.  At least make it apparent in the code flow
> and document the conditions.  It will be it easier to come up with
> sane semantics later.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [patch 3/7] mm: vmscan: clarify how swappiness, highest priority, memcg interact
  2012-12-17 18:12   ` Johannes Weiner
@ 2012-12-17 19:37     ` Satoru Moriya
  -1 siblings, 0 replies; 46+ messages in thread
From: Satoru Moriya @ 2012-12-17 19:37 UTC (permalink / raw)
  To: Johannes Weiner, Andrew Morton
  Cc: Rik van Riel, Michal Hocko, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

On 12/17/2012 01:12 PM, Johannes Weiner wrote:
> A swappiness of 0 has a slightly different meaning for global reclaim 
> (may swap if file cache really low) and memory cgroup reclaim (never 
> swap, ever).
> 
> In addition, global reclaim at highest priority will scan all LRU 
> lists equal to their size and ignore other balancing heuristics.
> UNLESS swappiness forbids swapping, then the lists are balanced based 
> on recent reclaim effectiveness.  UNLESS file cache is running low, 
> then anonymous pages are force-scanned.
> 
> This (total mess of a) behaviour is implicit and not obvious from the 
> way the code is organized.  At least make it apparent in the code flow 
> and document the conditions.  It will be it easier to come up with 
> sane semantics later.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Satoru Moriya <satoru.moriya@hds.com>


^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [patch 3/7] mm: vmscan: clarify how swappiness, highest priority, memcg interact
@ 2012-12-17 19:37     ` Satoru Moriya
  0 siblings, 0 replies; 46+ messages in thread
From: Satoru Moriya @ 2012-12-17 19:37 UTC (permalink / raw)
  To: Johannes Weiner, Andrew Morton
  Cc: Rik van Riel, Michal Hocko, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

On 12/17/2012 01:12 PM, Johannes Weiner wrote:
> A swappiness of 0 has a slightly different meaning for global reclaim 
> (may swap if file cache really low) and memory cgroup reclaim (never 
> swap, ever).
> 
> In addition, global reclaim at highest priority will scan all LRU 
> lists equal to their size and ignore other balancing heuristics.
> UNLESS swappiness forbids swapping, then the lists are balanced based 
> on recent reclaim effectiveness.  UNLESS file cache is running low, 
> then anonymous pages are force-scanned.
> 
> This (total mess of a) behaviour is implicit and not obvious from the 
> way the code is organized.  At least make it apparent in the code flow 
> and document the conditions.  It will be it easier to come up with 
> sane semantics later.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Satoru Moriya <satoru.moriya@hds.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 3/7] mm: vmscan: clarify how swappiness, highest priority, memcg interact
  2012-12-17 18:12   ` Johannes Weiner
@ 2012-12-17 20:05     ` Michal Hocko
  -1 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2012-12-17 20:05 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Hugh Dickins,
	Satoru Moriya, linux-mm, linux-kernel

On Mon 17-12-12 13:12:33, Johannes Weiner wrote:
> A swappiness of 0 has a slightly different meaning for global reclaim
> (may swap if file cache really low) and memory cgroup reclaim (never
> swap, ever).
> 
> In addition, global reclaim at highest priority will scan all LRU
> lists equal to their size and ignore other balancing heuristics.
> UNLESS swappiness forbids swapping, then the lists are balanced based
> on recent reclaim effectiveness.  UNLESS file cache is running low,
> then anonymous pages are force-scanned.
> 
> This (total mess of a) behaviour is implicit and not obvious from the
> way the code is organized.  At least make it apparent in the code flow
> and document the conditions.  It will be it easier to come up with
> sane semantics later.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Michal Hocko <mhocko@suse.cz>

Thanks!

> ---
>  mm/vmscan.c | 39 ++++++++++++++++++++++++++++++---------
>  1 file changed, 30 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 648a4db..c37deaf 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1644,7 +1644,6 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>  	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
>  	u64 fraction[2], denominator;
>  	enum lru_list lru;
> -	int noswap = 0;
>  	bool force_scan = false;
>  	struct zone *zone = lruvec_zone(lruvec);
>  
> @@ -1665,13 +1664,38 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>  
>  	/* If we have no swap space, do not bother scanning anon pages. */
>  	if (!sc->may_swap || (nr_swap_pages <= 0)) {
> -		noswap = 1;
>  		fraction[0] = 0;
>  		fraction[1] = 1;
>  		denominator = 1;
>  		goto out;
>  	}
>  
> +	/*
> +	 * Global reclaim will swap to prevent OOM even with no
> +	 * swappiness, but memcg users want to use this knob to
> +	 * disable swapping for individual groups completely when
> +	 * using the memory controller's swap limit feature would be
> +	 * too expensive.
> +	 */
> +	if (!global_reclaim(sc) && !vmscan_swappiness(sc)) {
> +		fraction[0] = 0;
> +		fraction[1] = 1;
> +		denominator = 1;
> +		goto out;
> +	}
> +
> +	/*
> +	 * Do not apply any pressure balancing cleverness when the
> +	 * system is close to OOM, scan both anon and file equally
> +	 * (unless the swappiness setting disagrees with swapping).
> +	 */
> +	if (!sc->priority && vmscan_swappiness(sc)) {
> +		fraction[0] = 1;
> +		fraction[1] = 1;
> +		denominator = 1;
> +		goto out;
> +	}
> +
>  	anon  = get_lru_size(lruvec, LRU_ACTIVE_ANON) +
>  		get_lru_size(lruvec, LRU_INACTIVE_ANON);
>  	file  = get_lru_size(lruvec, LRU_ACTIVE_FILE) +
> @@ -1753,13 +1777,10 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>  		unsigned long scan;
>  
>  		size = get_lru_size(lruvec, lru);
> -		if (sc->priority || noswap || !vmscan_swappiness(sc)) {
> -			scan = size >> sc->priority;
> -			if (!scan && force_scan)
> -				scan = min(size, SWAP_CLUSTER_MAX);
> -			scan = div64_u64(scan * fraction[file], denominator);
> -		} else
> -			scan = size;
> +		scan = size >> sc->priority;
> +		if (!scan && force_scan)
> +			scan = min(size, SWAP_CLUSTER_MAX);
> +		scan = div64_u64(scan * fraction[file], denominator);
>  		nr[lru] = scan;
>  	}
>  }
> -- 
> 1.7.11.7
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 3/7] mm: vmscan: clarify how swappiness, highest priority, memcg interact
@ 2012-12-17 20:05     ` Michal Hocko
  0 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2012-12-17 20:05 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Hugh Dickins,
	Satoru Moriya, linux-mm, linux-kernel

On Mon 17-12-12 13:12:33, Johannes Weiner wrote:
> A swappiness of 0 has a slightly different meaning for global reclaim
> (may swap if file cache really low) and memory cgroup reclaim (never
> swap, ever).
> 
> In addition, global reclaim at highest priority will scan all LRU
> lists equal to their size and ignore other balancing heuristics.
> UNLESS swappiness forbids swapping, then the lists are balanced based
> on recent reclaim effectiveness.  UNLESS file cache is running low,
> then anonymous pages are force-scanned.
> 
> This (total mess of a) behaviour is implicit and not obvious from the
> way the code is organized.  At least make it apparent in the code flow
> and document the conditions.  It will be it easier to come up with
> sane semantics later.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Michal Hocko <mhocko@suse.cz>

Thanks!

> ---
>  mm/vmscan.c | 39 ++++++++++++++++++++++++++++++---------
>  1 file changed, 30 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 648a4db..c37deaf 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1644,7 +1644,6 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>  	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
>  	u64 fraction[2], denominator;
>  	enum lru_list lru;
> -	int noswap = 0;
>  	bool force_scan = false;
>  	struct zone *zone = lruvec_zone(lruvec);
>  
> @@ -1665,13 +1664,38 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>  
>  	/* If we have no swap space, do not bother scanning anon pages. */
>  	if (!sc->may_swap || (nr_swap_pages <= 0)) {
> -		noswap = 1;
>  		fraction[0] = 0;
>  		fraction[1] = 1;
>  		denominator = 1;
>  		goto out;
>  	}
>  
> +	/*
> +	 * Global reclaim will swap to prevent OOM even with no
> +	 * swappiness, but memcg users want to use this knob to
> +	 * disable swapping for individual groups completely when
> +	 * using the memory controller's swap limit feature would be
> +	 * too expensive.
> +	 */
> +	if (!global_reclaim(sc) && !vmscan_swappiness(sc)) {
> +		fraction[0] = 0;
> +		fraction[1] = 1;
> +		denominator = 1;
> +		goto out;
> +	}
> +
> +	/*
> +	 * Do not apply any pressure balancing cleverness when the
> +	 * system is close to OOM, scan both anon and file equally
> +	 * (unless the swappiness setting disagrees with swapping).
> +	 */
> +	if (!sc->priority && vmscan_swappiness(sc)) {
> +		fraction[0] = 1;
> +		fraction[1] = 1;
> +		denominator = 1;
> +		goto out;
> +	}
> +
>  	anon  = get_lru_size(lruvec, LRU_ACTIVE_ANON) +
>  		get_lru_size(lruvec, LRU_INACTIVE_ANON);
>  	file  = get_lru_size(lruvec, LRU_ACTIVE_FILE) +
> @@ -1753,13 +1777,10 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>  		unsigned long scan;
>  
>  		size = get_lru_size(lruvec, lru);
> -		if (sc->priority || noswap || !vmscan_swappiness(sc)) {
> -			scan = size >> sc->priority;
> -			if (!scan && force_scan)
> -				scan = min(size, SWAP_CLUSTER_MAX);
> -			scan = div64_u64(scan * fraction[file], denominator);
> -		} else
> -			scan = size;
> +		scan = size >> sc->priority;
> +		if (!scan && force_scan)
> +			scan = min(size, SWAP_CLUSTER_MAX);
> +		scan = div64_u64(scan * fraction[file], denominator);
>  		nr[lru] = scan;
>  	}
>  }
> -- 
> 1.7.11.7
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 7/7] mm: reduce rmap overhead for ex-KSM page copies created on swap faults
  2012-12-17 18:12   ` Johannes Weiner
@ 2012-12-17 22:42     ` Hugh Dickins
  -1 siblings, 0 replies; 46+ messages in thread
From: Hugh Dickins @ 2012-12-17 22:42 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Rik van Riel, Michal Hocko, Mel Gorman,
	Satoru Moriya, linux-mm, linux-kernel

On Mon, 17 Dec 2012, Johannes Weiner wrote:

> When ex-KSM pages are faulted from swap cache, the fault handler is
> not capable of re-establishing anon_vma-spanning KSM pages.  In this
> case, a copy of the page is created instead, just like during a COW
> break.
> 
> These freshly made copies are known to be exclusive to the faulting
> VMA and there is no reason to go look for this page in parent and
> sibling processes during rmap operations.
> 
> Use page_add_new_anon_rmap() for these copies.  This also puts them on
> the proper LRU lists and marks them SwapBacked, so we can get rid of
> doing this ad-hoc in the KSM copy code.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Reviewed-by: Rik van Riel <riel@redhat.com>

Yes, that's good, thanks Hannes:
Acked-by: Hugh Dickins <hughd@google.com>

> ---
>  mm/ksm.c    | 6 ------
>  mm/memory.c | 5 ++++-
>  2 files changed, 4 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 382d930..7275c74 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -1590,13 +1590,7 @@ struct page *ksm_does_need_to_copy(struct page *page,
>  
>  		SetPageDirty(new_page);
>  		__SetPageUptodate(new_page);
> -		SetPageSwapBacked(new_page);
>  		__set_page_locked(new_page);
> -
> -		if (!mlocked_vma_newpage(vma, new_page))
> -			lru_cache_add_lru(new_page, LRU_ACTIVE_ANON);
> -		else
> -			add_page_to_unevictable_list(new_page);
>  	}
>  
>  	return new_page;
> diff --git a/mm/memory.c b/mm/memory.c
> index db2e9e7..7e17eb0 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3020,7 +3020,10 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  	}
>  	flush_icache_page(vma, page);
>  	set_pte_at(mm, address, page_table, pte);
> -	do_page_add_anon_rmap(page, vma, address, exclusive);
> +	if (swapcache) /* ksm created a completely new copy */
> +		page_add_new_anon_rmap(page, vma, address);
> +	else
> +		do_page_add_anon_rmap(page, vma, address, exclusive);
>  	/* It's better to call commit-charge after rmap is established */
>  	mem_cgroup_commit_charge_swapin(page, ptr);
>  
> -- 
> 1.7.11.7

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 7/7] mm: reduce rmap overhead for ex-KSM page copies created on swap faults
@ 2012-12-17 22:42     ` Hugh Dickins
  0 siblings, 0 replies; 46+ messages in thread
From: Hugh Dickins @ 2012-12-17 22:42 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Rik van Riel, Michal Hocko, Mel Gorman,
	Satoru Moriya, linux-mm, linux-kernel

On Mon, 17 Dec 2012, Johannes Weiner wrote:

> When ex-KSM pages are faulted from swap cache, the fault handler is
> not capable of re-establishing anon_vma-spanning KSM pages.  In this
> case, a copy of the page is created instead, just like during a COW
> break.
> 
> These freshly made copies are known to be exclusive to the faulting
> VMA and there is no reason to go look for this page in parent and
> sibling processes during rmap operations.
> 
> Use page_add_new_anon_rmap() for these copies.  This also puts them on
> the proper LRU lists and marks them SwapBacked, so we can get rid of
> doing this ad-hoc in the KSM copy code.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Reviewed-by: Rik van Riel <riel@redhat.com>

Yes, that's good, thanks Hannes:
Acked-by: Hugh Dickins <hughd@google.com>

> ---
>  mm/ksm.c    | 6 ------
>  mm/memory.c | 5 ++++-
>  2 files changed, 4 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 382d930..7275c74 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -1590,13 +1590,7 @@ struct page *ksm_does_need_to_copy(struct page *page,
>  
>  		SetPageDirty(new_page);
>  		__SetPageUptodate(new_page);
> -		SetPageSwapBacked(new_page);
>  		__set_page_locked(new_page);
> -
> -		if (!mlocked_vma_newpage(vma, new_page))
> -			lru_cache_add_lru(new_page, LRU_ACTIVE_ANON);
> -		else
> -			add_page_to_unevictable_list(new_page);
>  	}
>  
>  	return new_page;
> diff --git a/mm/memory.c b/mm/memory.c
> index db2e9e7..7e17eb0 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3020,7 +3020,10 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  	}
>  	flush_icache_page(vma, page);
>  	set_pte_at(mm, address, page_table, pte);
> -	do_page_add_anon_rmap(page, vma, address, exclusive);
> +	if (swapcache) /* ksm created a completely new copy */
> +		page_add_new_anon_rmap(page, vma, address);
> +	else
> +		do_page_add_anon_rmap(page, vma, address, exclusive);
>  	/* It's better to call commit-charge after rmap is established */
>  	mem_cgroup_commit_charge_swapin(page, ptr);
>  
> -- 
> 1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 1/7] mm: memcg: only evict file pages when we have plenty
  2012-12-17 18:12   ` Johannes Weiner
@ 2012-12-18  9:56     ` Mel Gorman
  -1 siblings, 0 replies; 46+ messages in thread
From: Mel Gorman @ 2012-12-18  9:56 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Rik van Riel, Michal Hocko, Hugh Dickins,
	Satoru Moriya, linux-mm, linux-kernel

On Mon, Dec 17, 2012 at 01:12:31PM -0500, Johannes Weiner wrote:
> e986850 "mm, vmscan: only evict file pages when we have plenty" makes
> a point of not going for anonymous memory while there is still enough
> inactive cache around.
> 
> The check was added only for global reclaim, but it is just as useful
> to reduce swapping in memory cgroup reclaim:
> 
> 200M-memcg-defconfig-j2
> 
>                                  vanilla                   patched
> Real time              454.06 (  +0.00%)         453.71 (  -0.08%)
> User time              668.57 (  +0.00%)         668.73 (  +0.02%)
> System time            128.92 (  +0.00%)         129.53 (  +0.46%)
> Swap in               1246.80 (  +0.00%)         814.40 ( -34.65%)
> Swap out              1198.90 (  +0.00%)         827.00 ( -30.99%)
> Pages allocated   16431288.10 (  +0.00%)    16434035.30 (  +0.02%)
> Major faults           681.50 (  +0.00%)         593.70 ( -12.86%)
> THP faults             237.20 (  +0.00%)         242.40 (  +2.18%)
> THP collapse           241.20 (  +0.00%)         248.50 (  +3.01%)
> THP splits             157.30 (  +0.00%)         161.40 (  +2.59%)
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Michal Hocko <mhocko@suse.cz>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 1/7] mm: memcg: only evict file pages when we have plenty
@ 2012-12-18  9:56     ` Mel Gorman
  0 siblings, 0 replies; 46+ messages in thread
From: Mel Gorman @ 2012-12-18  9:56 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Rik van Riel, Michal Hocko, Hugh Dickins,
	Satoru Moriya, linux-mm, linux-kernel

On Mon, Dec 17, 2012 at 01:12:31PM -0500, Johannes Weiner wrote:
> e986850 "mm, vmscan: only evict file pages when we have plenty" makes
> a point of not going for anonymous memory while there is still enough
> inactive cache around.
> 
> The check was added only for global reclaim, but it is just as useful
> to reduce swapping in memory cgroup reclaim:
> 
> 200M-memcg-defconfig-j2
> 
>                                  vanilla                   patched
> Real time              454.06 (  +0.00%)         453.71 (  -0.08%)
> User time              668.57 (  +0.00%)         668.73 (  +0.02%)
> System time            128.92 (  +0.00%)         129.53 (  +0.46%)
> Swap in               1246.80 (  +0.00%)         814.40 ( -34.65%)
> Swap out              1198.90 (  +0.00%)         827.00 ( -30.99%)
> Pages allocated   16431288.10 (  +0.00%)    16434035.30 (  +0.02%)
> Major faults           681.50 (  +0.00%)         593.70 ( -12.86%)
> THP faults             237.20 (  +0.00%)         242.40 (  +2.18%)
> THP collapse           241.20 (  +0.00%)         248.50 (  +3.01%)
> THP splits             157.30 (  +0.00%)         161.40 (  +2.59%)
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Michal Hocko <mhocko@suse.cz>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 3/7] mm: vmscan: clarify how swappiness, highest priority, memcg interact
  2012-12-17 18:12   ` Johannes Weiner
@ 2012-12-18 10:04     ` Mel Gorman
  -1 siblings, 0 replies; 46+ messages in thread
From: Mel Gorman @ 2012-12-18 10:04 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Rik van Riel, Michal Hocko, Hugh Dickins,
	Satoru Moriya, linux-mm, linux-kernel

On Mon, Dec 17, 2012 at 01:12:33PM -0500, Johannes Weiner wrote:
> A swappiness of 0 has a slightly different meaning for global reclaim
> (may swap if file cache really low) and memory cgroup reclaim (never
> swap, ever).
> 
> In addition, global reclaim at highest priority will scan all LRU
> lists equal to their size and ignore other balancing heuristics.
> UNLESS swappiness forbids swapping, then the lists are balanced based
> on recent reclaim effectiveness.  UNLESS file cache is running low,
> then anonymous pages are force-scanned.
> 
> This (total mess of a) behaviour is implicit and not obvious from the
> way the code is organized.  At least make it apparent in the code flow
> and document the conditions.  It will be it easier to come up with
> sane semantics later.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 3/7] mm: vmscan: clarify how swappiness, highest priority, memcg interact
@ 2012-12-18 10:04     ` Mel Gorman
  0 siblings, 0 replies; 46+ messages in thread
From: Mel Gorman @ 2012-12-18 10:04 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Rik van Riel, Michal Hocko, Hugh Dickins,
	Satoru Moriya, linux-mm, linux-kernel

On Mon, Dec 17, 2012 at 01:12:33PM -0500, Johannes Weiner wrote:
> A swappiness of 0 has a slightly different meaning for global reclaim
> (may swap if file cache really low) and memory cgroup reclaim (never
> swap, ever).
> 
> In addition, global reclaim at highest priority will scan all LRU
> lists equal to their size and ignore other balancing heuristics.
> UNLESS swappiness forbids swapping, then the lists are balanced based
> on recent reclaim effectiveness.  UNLESS file cache is running low,
> then anonymous pages are force-scanned.
> 
> This (total mess of a) behaviour is implicit and not obvious from the
> way the code is organized.  At least make it apparent in the code flow
> and document the conditions.  It will be it easier to come up with
> sane semantics later.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [patch 3/7] mm: vmscan: clarify how swappiness, highest priority, memcg interact
  2012-12-17 18:12   ` Johannes Weiner
@ 2012-12-18 15:16     ` Satoru Moriya
  -1 siblings, 0 replies; 46+ messages in thread
From: Satoru Moriya @ 2012-12-18 15:16 UTC (permalink / raw)
  To: Johannes Weiner, Andrew Morton
  Cc: Rik van Riel, Michal Hocko, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

On 12/17/2012 01:12 PM, Johannes Weiner wrote:
> A swappiness of 0 has a slightly different meaning for global reclaim 
> (may swap if file cache really low) and memory cgroup reclaim (never 
> swap, ever).
> 
> In addition, global reclaim at highest priority will scan all LRU 
> lists equal to their size and ignore other balancing heuristics.
> UNLESS swappiness forbids swapping, then the lists are balanced based 
> on recent reclaim effectiveness.  UNLESS file cache is running low, 
> then anonymous pages are force-scanned.
> 
> This (total mess of a) behaviour is implicit and not obvious from the 
> way the code is organized.  At least make it apparent in the code flow 
> and document the conditions.  It will be it easier to come up with 
> sane semantics later.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Hmmm, my company's mail server is in trouble... The mail I sent
yesterday has not been delivered yet.

Anyway, this is good for me.
Thanks!

Reviewed-by: Satoru Moriya <satoru.moriya@hds.com>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [patch 3/7] mm: vmscan: clarify how swappiness, highest priority, memcg interact
@ 2012-12-18 15:16     ` Satoru Moriya
  0 siblings, 0 replies; 46+ messages in thread
From: Satoru Moriya @ 2012-12-18 15:16 UTC (permalink / raw)
  To: Johannes Weiner, Andrew Morton
  Cc: Rik van Riel, Michal Hocko, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

On 12/17/2012 01:12 PM, Johannes Weiner wrote:
> A swappiness of 0 has a slightly different meaning for global reclaim 
> (may swap if file cache really low) and memory cgroup reclaim (never 
> swap, ever).
> 
> In addition, global reclaim at highest priority will scan all LRU 
> lists equal to their size and ignore other balancing heuristics.
> UNLESS swappiness forbids swapping, then the lists are balanced based 
> on recent reclaim effectiveness.  UNLESS file cache is running low, 
> then anonymous pages are force-scanned.
> 
> This (total mess of a) behaviour is implicit and not obvious from the 
> way the code is organized.  At least make it apparent in the code flow 
> and document the conditions.  It will be it easier to come up with 
> sane semantics later.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Hmmm, my company's mail server is in trouble... The mail I sent
yesterday has not been delivered yet.

Anyway, this is good for me.
Thanks!

Reviewed-by: Satoru Moriya <satoru.moriya@hds.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 7/7] mm: reduce rmap overhead for ex-KSM page copies created on swap faults
  2012-12-17 18:12   ` Johannes Weiner
@ 2012-12-19  7:01     ` Simon Jeons
  -1 siblings, 0 replies; 46+ messages in thread
From: Simon Jeons @ 2012-12-19  7:01 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Rik van Riel, Michal Hocko, Mel Gorman,
	Hugh Dickins, Satoru Moriya, linux-mm, linux-kernel

On Mon, 2012-12-17 at 13:12 -0500, Johannes Weiner wrote:
> When ex-KSM pages are faulted from swap cache, the fault handler is
> not capable of re-establishing anon_vma-spanning KSM pages.  In this
> case, a copy of the page is created instead, just like during a COW
> break.
> 
> These freshly made copies are known to be exclusive to the faulting
> VMA and there is no reason to go look for this page in parent and
> sibling processes during rmap operations.
> 
> Use page_add_new_anon_rmap() for these copies.  This also puts them on
> the proper LRU lists and marks them SwapBacked, so we can get rid of
> doing this ad-hoc in the KSM copy code.

Is it just a code cleanup instead of reduce rmap overhead?

> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> ---
>  mm/ksm.c    | 6 ------
>  mm/memory.c | 5 ++++-
>  2 files changed, 4 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 382d930..7275c74 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -1590,13 +1590,7 @@ struct page *ksm_does_need_to_copy(struct page *page,
>  
>  		SetPageDirty(new_page);
>  		__SetPageUptodate(new_page);
> -		SetPageSwapBacked(new_page);
>  		__set_page_locked(new_page);
> -
> -		if (!mlocked_vma_newpage(vma, new_page))
> -			lru_cache_add_lru(new_page, LRU_ACTIVE_ANON);
> -		else
> -			add_page_to_unevictable_list(new_page);
>  	}
>  
>  	return new_page;
> diff --git a/mm/memory.c b/mm/memory.c
> index db2e9e7..7e17eb0 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3020,7 +3020,10 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  	}
>  	flush_icache_page(vma, page);
>  	set_pte_at(mm, address, page_table, pte);
> -	do_page_add_anon_rmap(page, vma, address, exclusive);
> +	if (swapcache) /* ksm created a completely new copy */
> +		page_add_new_anon_rmap(page, vma, address);
> +	else
> +		do_page_add_anon_rmap(page, vma, address, exclusive);
>  	/* It's better to call commit-charge after rmap is established */
>  	mem_cgroup_commit_charge_swapin(page, ptr);
>  



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 7/7] mm: reduce rmap overhead for ex-KSM page copies created on swap faults
@ 2012-12-19  7:01     ` Simon Jeons
  0 siblings, 0 replies; 46+ messages in thread
From: Simon Jeons @ 2012-12-19  7:01 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Rik van Riel, Michal Hocko, Mel Gorman,
	Hugh Dickins, Satoru Moriya, linux-mm, linux-kernel

On Mon, 2012-12-17 at 13:12 -0500, Johannes Weiner wrote:
> When ex-KSM pages are faulted from swap cache, the fault handler is
> not capable of re-establishing anon_vma-spanning KSM pages.  In this
> case, a copy of the page is created instead, just like during a COW
> break.
> 
> These freshly made copies are known to be exclusive to the faulting
> VMA and there is no reason to go look for this page in parent and
> sibling processes during rmap operations.
> 
> Use page_add_new_anon_rmap() for these copies.  This also puts them on
> the proper LRU lists and marks them SwapBacked, so we can get rid of
> doing this ad-hoc in the KSM copy code.

Is it just a code cleanup instead of reduce rmap overhead?

> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> ---
>  mm/ksm.c    | 6 ------
>  mm/memory.c | 5 ++++-
>  2 files changed, 4 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 382d930..7275c74 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -1590,13 +1590,7 @@ struct page *ksm_does_need_to_copy(struct page *page,
>  
>  		SetPageDirty(new_page);
>  		__SetPageUptodate(new_page);
> -		SetPageSwapBacked(new_page);
>  		__set_page_locked(new_page);
> -
> -		if (!mlocked_vma_newpage(vma, new_page))
> -			lru_cache_add_lru(new_page, LRU_ACTIVE_ANON);
> -		else
> -			add_page_to_unevictable_list(new_page);
>  	}
>  
>  	return new_page;
> diff --git a/mm/memory.c b/mm/memory.c
> index db2e9e7..7e17eb0 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3020,7 +3020,10 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  	}
>  	flush_icache_page(vma, page);
>  	set_pte_at(mm, address, page_table, pte);
> -	do_page_add_anon_rmap(page, vma, address, exclusive);
> +	if (swapcache) /* ksm created a completely new copy */
> +		page_add_new_anon_rmap(page, vma, address);
> +	else
> +		do_page_add_anon_rmap(page, vma, address, exclusive);
>  	/* It's better to call commit-charge after rmap is established */
>  	mem_cgroup_commit_charge_swapin(page, ptr);
>  


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 7/7] mm: reduce rmap overhead for ex-KSM page copies created on swap faults
  2012-12-19  7:01     ` Simon Jeons
@ 2012-12-19 17:58       ` Johannes Weiner
  -1 siblings, 0 replies; 46+ messages in thread
From: Johannes Weiner @ 2012-12-19 17:58 UTC (permalink / raw)
  To: Simon Jeons
  Cc: Andrew Morton, Rik van Riel, Michal Hocko, Mel Gorman,
	Hugh Dickins, Satoru Moriya, linux-mm, linux-kernel

On Wed, Dec 19, 2012 at 02:01:19AM -0500, Simon Jeons wrote:
> On Mon, 2012-12-17 at 13:12 -0500, Johannes Weiner wrote:
> > When ex-KSM pages are faulted from swap cache, the fault handler is
> > not capable of re-establishing anon_vma-spanning KSM pages.  In this
> > case, a copy of the page is created instead, just like during a COW
> > break.
> > 
> > These freshly made copies are known to be exclusive to the faulting
> > VMA and there is no reason to go look for this page in parent and
> > sibling processes during rmap operations.
> > 
> > Use page_add_new_anon_rmap() for these copies.  This also puts them on
> > the proper LRU lists and marks them SwapBacked, so we can get rid of
> > doing this ad-hoc in the KSM copy code.
> 
> Is it just a code cleanup instead of reduce rmap overhead?

Both.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 7/7] mm: reduce rmap overhead for ex-KSM page copies created on swap faults
@ 2012-12-19 17:58       ` Johannes Weiner
  0 siblings, 0 replies; 46+ messages in thread
From: Johannes Weiner @ 2012-12-19 17:58 UTC (permalink / raw)
  To: Simon Jeons
  Cc: Andrew Morton, Rik van Riel, Michal Hocko, Mel Gorman,
	Hugh Dickins, Satoru Moriya, linux-mm, linux-kernel

On Wed, Dec 19, 2012 at 02:01:19AM -0500, Simon Jeons wrote:
> On Mon, 2012-12-17 at 13:12 -0500, Johannes Weiner wrote:
> > When ex-KSM pages are faulted from swap cache, the fault handler is
> > not capable of re-establishing anon_vma-spanning KSM pages.  In this
> > case, a copy of the page is created instead, just like during a COW
> > break.
> > 
> > These freshly made copies are known to be exclusive to the faulting
> > VMA and there is no reason to go look for this page in parent and
> > sibling processes during rmap operations.
> > 
> > Use page_add_new_anon_rmap() for these copies.  This also puts them on
> > the proper LRU lists and marks them SwapBacked, so we can get rid of
> > doing this ad-hoc in the KSM copy code.
> 
> Is it just a code cleanup instead of reduce rmap overhead?

Both.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 2/7] mm: vmscan: save work scanning (almost) empty LRU lists
  2012-12-17 18:12   ` Johannes Weiner
@ 2012-12-19 23:59     ` Andrew Morton
  -1 siblings, 0 replies; 46+ messages in thread
From: Andrew Morton @ 2012-12-19 23:59 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Rik van Riel, Michal Hocko, Mel Gorman, Hugh Dickins,
	Satoru Moriya, linux-mm, linux-kernel

On Mon, 17 Dec 2012 13:12:32 -0500
Johannes Weiner <hannes@cmpxchg.org> wrote:

> In certain cases (kswapd reclaim, memcg target reclaim), a fixed
> minimum amount of pages is scanned from the LRU lists on each
> iteration, to make progress.
> 
> Do not make this minimum bigger than the respective LRU list size,
> however, and save some busy work trying to isolate and reclaim pages
> that are not there.
> 
> Empty LRU lists are quite common with memory cgroups in NUMA
> environments because there exists a set of LRU lists for each zone for
> each memory cgroup, while the memory of a single cgroup is expected to
> stay on just one node.  The number of expected empty LRU lists is thus
> 
>   memcgs * (nodes - 1) * lru types
> 
> Each attempt to reclaim from an empty LRU list does expensive size
> comparisons between lists, acquires the zone's lru lock etc.  Avoid
> that.
> 
> ...
>
> -#define SWAP_CLUSTER_MAX 32
> +#define SWAP_CLUSTER_MAX 32UL

You made me review the effects of this change.  It looks OK.  A few
cleanups are possible, please review.

I wonder what happens in __setup_per_zone_wmarks() if we set
SWAP_CLUSTER_MAX greater than 128.



From: Andrew Morton <akpm@linux-foundation.org>
Subject: mm/page_alloc.c:__setup_per_zone_wmarks: make min_pages unsigned long

`int' is an inappropriate type for a number-of-pages counter.

While we're there, use the clamp() macro.

Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Satoru Moriya <satoru.moriya@hds.com>
Cc: Simon Jeons <simon.jeons@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_alloc.c |    7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff -puN mm/page_alloc.c~a mm/page_alloc.c
--- a/mm/page_alloc.c~a
+++ a/mm/page_alloc.c
@@ -5258,13 +5258,10 @@ static void __setup_per_zone_wmarks(void
 			 * deltas controls asynch page reclaim, and so should
 			 * not be capped for highmem.
 			 */
-			int min_pages;
+			unsigned long min_pages;
 
 			min_pages = zone->present_pages / 1024;
-			if (min_pages < SWAP_CLUSTER_MAX)
-				min_pages = SWAP_CLUSTER_MAX;
-			if (min_pages > 128)
-				min_pages = 128;
+			min_pages = clamp(min_pages, SWAP_CLUSTER_MAX, 128UL);
 			zone->watermark[WMARK_MIN] = min_pages;
 		} else {
 			/*
_


From: Andrew Morton <akpm@linux-foundation.org>
Subject: mm/vmscan.c:shrink_lruvec(): switch to min()

"mm: vmscan: save work scanning (almost) empty LRU lists" made
SWAP_CLUSTER_MAX an unsigned long.

Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Satoru Moriya <satoru.moriya@hds.com>
Cc: Simon Jeons <simon.jeons@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/vmscan.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff -puN mm/vmscan.c~a mm/vmscan.c
--- a/mm/vmscan.c~a
+++ a/mm/vmscan.c
@@ -1873,8 +1873,7 @@ restart:
 					nr[LRU_INACTIVE_FILE]) {
 		for_each_evictable_lru(lru) {
 			if (nr[lru]) {
-				nr_to_scan = min_t(unsigned long,
-						   nr[lru], SWAP_CLUSTER_MAX);
+				nr_to_scan = min(nr[lru], SWAP_CLUSTER_MAX);
 				nr[lru] -= nr_to_scan;
 
 				nr_reclaimed += shrink_list(lru, nr_to_scan,
_


From: Andrew Morton <akpm@linux-foundation.org>
Subject: mm/vmscan.c:__zone_reclaim(): replace max_t() with max()

"mm: vmscan: save work scanning (almost) empty LRU lists" made
SWAP_CLUSTER_MAX an unsigned long.

Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Satoru Moriya <satoru.moriya@hds.com>
Cc: Simon Jeons <simon.jeons@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/vmscan.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff -puN mm/vmscan.c~mm-vmscanc-__zone_reclaim-replace-max_t-with-max mm/vmscan.c
--- a/mm/vmscan.c~mm-vmscanc-__zone_reclaim-replace-max_t-with-max
+++ a/mm/vmscan.c
@@ -3347,8 +3347,7 @@ static int __zone_reclaim(struct zone *z
 		.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
 		.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
 		.may_swap = 1,
-		.nr_to_reclaim = max_t(unsigned long, nr_pages,
-				       SWAP_CLUSTER_MAX),
+		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
 		.gfp_mask = gfp_mask,
 		.order = order,
 		.priority = ZONE_RECLAIM_PRIORITY,
_


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 2/7] mm: vmscan: save work scanning (almost) empty LRU lists
@ 2012-12-19 23:59     ` Andrew Morton
  0 siblings, 0 replies; 46+ messages in thread
From: Andrew Morton @ 2012-12-19 23:59 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Rik van Riel, Michal Hocko, Mel Gorman, Hugh Dickins,
	Satoru Moriya, linux-mm, linux-kernel

On Mon, 17 Dec 2012 13:12:32 -0500
Johannes Weiner <hannes@cmpxchg.org> wrote:

> In certain cases (kswapd reclaim, memcg target reclaim), a fixed
> minimum amount of pages is scanned from the LRU lists on each
> iteration, to make progress.
> 
> Do not make this minimum bigger than the respective LRU list size,
> however, and save some busy work trying to isolate and reclaim pages
> that are not there.
> 
> Empty LRU lists are quite common with memory cgroups in NUMA
> environments because there exists a set of LRU lists for each zone for
> each memory cgroup, while the memory of a single cgroup is expected to
> stay on just one node.  The number of expected empty LRU lists is thus
> 
>   memcgs * (nodes - 1) * lru types
> 
> Each attempt to reclaim from an empty LRU list does expensive size
> comparisons between lists, acquires the zone's lru lock etc.  Avoid
> that.
> 
> ...
>
> -#define SWAP_CLUSTER_MAX 32
> +#define SWAP_CLUSTER_MAX 32UL

You made me review the effects of this change.  It looks OK.  A few
cleanups are possible, please review.

I wonder what happens in __setup_per_zone_wmarks() if we set
SWAP_CLUSTER_MAX greater than 128.



From: Andrew Morton <akpm@linux-foundation.org>
Subject: mm/page_alloc.c:__setup_per_zone_wmarks: make min_pages unsigned long

`int' is an inappropriate type for a number-of-pages counter.

While we're there, use the clamp() macro.

Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Satoru Moriya <satoru.moriya@hds.com>
Cc: Simon Jeons <simon.jeons@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_alloc.c |    7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff -puN mm/page_alloc.c~a mm/page_alloc.c
--- a/mm/page_alloc.c~a
+++ a/mm/page_alloc.c
@@ -5258,13 +5258,10 @@ static void __setup_per_zone_wmarks(void
 			 * deltas controls asynch page reclaim, and so should
 			 * not be capped for highmem.
 			 */
-			int min_pages;
+			unsigned long min_pages;
 
 			min_pages = zone->present_pages / 1024;
-			if (min_pages < SWAP_CLUSTER_MAX)
-				min_pages = SWAP_CLUSTER_MAX;
-			if (min_pages > 128)
-				min_pages = 128;
+			min_pages = clamp(min_pages, SWAP_CLUSTER_MAX, 128UL);
 			zone->watermark[WMARK_MIN] = min_pages;
 		} else {
 			/*
_


From: Andrew Morton <akpm@linux-foundation.org>
Subject: mm/vmscan.c:shrink_lruvec(): switch to min()

"mm: vmscan: save work scanning (almost) empty LRU lists" made
SWAP_CLUSTER_MAX an unsigned long.

Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Satoru Moriya <satoru.moriya@hds.com>
Cc: Simon Jeons <simon.jeons@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/vmscan.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff -puN mm/vmscan.c~a mm/vmscan.c
--- a/mm/vmscan.c~a
+++ a/mm/vmscan.c
@@ -1873,8 +1873,7 @@ restart:
 					nr[LRU_INACTIVE_FILE]) {
 		for_each_evictable_lru(lru) {
 			if (nr[lru]) {
-				nr_to_scan = min_t(unsigned long,
-						   nr[lru], SWAP_CLUSTER_MAX);
+				nr_to_scan = min(nr[lru], SWAP_CLUSTER_MAX);
 				nr[lru] -= nr_to_scan;
 
 				nr_reclaimed += shrink_list(lru, nr_to_scan,
_


From: Andrew Morton <akpm@linux-foundation.org>
Subject: mm/vmscan.c:__zone_reclaim(): replace max_t() with max()

"mm: vmscan: save work scanning (almost) empty LRU lists" made
SWAP_CLUSTER_MAX an unsigned long.

Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Satoru Moriya <satoru.moriya@hds.com>
Cc: Simon Jeons <simon.jeons@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/vmscan.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff -puN mm/vmscan.c~mm-vmscanc-__zone_reclaim-replace-max_t-with-max mm/vmscan.c
--- a/mm/vmscan.c~mm-vmscanc-__zone_reclaim-replace-max_t-with-max
+++ a/mm/vmscan.c
@@ -3347,8 +3347,7 @@ static int __zone_reclaim(struct zone *z
 		.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
 		.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
 		.may_swap = 1,
-		.nr_to_reclaim = max_t(unsigned long, nr_pages,
-				       SWAP_CLUSTER_MAX),
+		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
 		.gfp_mask = gfp_mask,
 		.order = order,
 		.priority = ZONE_RECLAIM_PRIORITY,
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 5/7] mm: vmscan: clean up get_scan_count()
  2012-12-17 18:12   ` Johannes Weiner
@ 2012-12-20  0:08     ` Andrew Morton
  -1 siblings, 0 replies; 46+ messages in thread
From: Andrew Morton @ 2012-12-20  0:08 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Rik van Riel, Michal Hocko, Mel Gorman, Hugh Dickins,
	Satoru Moriya, linux-mm, linux-kernel

On Mon, 17 Dec 2012 13:12:35 -0500
Johannes Weiner <hannes@cmpxchg.org> wrote:

> Reclaim pressure balance between anon and file pages is calculated
> through a tuple of numerators and a shared denominator.
> 
> Exceptional cases that want to force-scan anon or file pages configure
> the numerators and denominator such that one list is preferred, which
> is not necessarily the most obvious way:
> 
>     fraction[0] = 1;
>     fraction[1] = 0;
>     denominator = 1;
>     goto out;
> 
> Make this easier by making the force-scan cases explicit and use the
> fractionals only in case they are calculated from reclaim history.
> 
> And bring the variable declarations/definitions in order.
> 
> ...
>
> +	u64 fraction[2], uninitialized_var(denominator);

Using uninitialized_var() puts Linus into rant mode.  Unkindly, IMO:
uninitialized_var() is documentarily useful and reduces bloat.  There is
a move afoot to replace it with

	int foo = 0;	/* gcc */

To avoid getting ranted at we can do

--- a/mm/vmscan.c~mm-vmscan-clean-up-get_scan_count-fix
+++ a/mm/vmscan.c
@@ -1658,7 +1658,8 @@ static void get_scan_count(struct lruvec
 			   unsigned long *nr)
 {
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
-	u64 fraction[2], uninitialized_var(denominator);
+	u64 fraction[2];
+	u64 denominator = 0;
 	struct zone *zone = lruvec_zone(lruvec);
 	unsigned long anon_prio, file_prio;
 	enum scan_balance scan_balance;
_

Which bloats the text by six bytes, but will force a nice div-by-zero
if we ever hit that can't-happen path.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 5/7] mm: vmscan: clean up get_scan_count()
@ 2012-12-20  0:08     ` Andrew Morton
  0 siblings, 0 replies; 46+ messages in thread
From: Andrew Morton @ 2012-12-20  0:08 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Rik van Riel, Michal Hocko, Mel Gorman, Hugh Dickins,
	Satoru Moriya, linux-mm, linux-kernel

On Mon, 17 Dec 2012 13:12:35 -0500
Johannes Weiner <hannes@cmpxchg.org> wrote:

> Reclaim pressure balance between anon and file pages is calculated
> through a tuple of numerators and a shared denominator.
> 
> Exceptional cases that want to force-scan anon or file pages configure
> the numerators and denominator such that one list is preferred, which
> is not necessarily the most obvious way:
> 
>     fraction[0] = 1;
>     fraction[1] = 0;
>     denominator = 1;
>     goto out;
> 
> Make this easier by making the force-scan cases explicit and use the
> fractionals only in case they are calculated from reclaim history.
> 
> And bring the variable declarations/definitions in order.
> 
> ...
>
> +	u64 fraction[2], uninitialized_var(denominator);

Using uninitialized_var() puts Linus into rant mode.  Unkindly, IMO:
uninitialized_var() is documentarily useful and reduces bloat.  There is
a move afoot to replace it with

	int foo = 0;	/* gcc */

To avoid getting ranted at we can do

--- a/mm/vmscan.c~mm-vmscan-clean-up-get_scan_count-fix
+++ a/mm/vmscan.c
@@ -1658,7 +1658,8 @@ static void get_scan_count(struct lruvec
 			   unsigned long *nr)
 {
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
-	u64 fraction[2], uninitialized_var(denominator);
+	u64 fraction[2];
+	u64 denominator = 0;
 	struct zone *zone = lruvec_zone(lruvec);
 	unsigned long anon_prio, file_prio;
 	enum scan_balance scan_balance;
_

Which bloats the text by six bytes, but will force a nice div-by-zero
if we ever hit that can't-happen path.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 2/7] mm: vmscan: save work scanning (almost) empty LRU lists
  2012-12-19 23:59     ` Andrew Morton
@ 2012-12-20 13:55       ` Michal Hocko
  -1 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2012-12-20 13:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Rik van Riel, Mel Gorman, Hugh Dickins,
	Satoru Moriya, linux-mm, linux-kernel

On Wed 19-12-12 15:59:01, Andrew Morton wrote:
[...]
> From: Andrew Morton <akpm@linux-foundation.org>
> Subject: mm/page_alloc.c:__setup_per_zone_wmarks: make min_pages unsigned long
> 
> `int' is an inappropriate type for a number-of-pages counter.
> 
> While we're there, use the clamp() macro.
> 
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Satoru Moriya <satoru.moriya@hds.com>
> Cc: Simon Jeons <simon.jeons@gmail.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Reviewed-by: Michal Hocko <mhocko@suse.cz>

> ---
> 
>  mm/page_alloc.c |    7 ++-----
>  1 file changed, 2 insertions(+), 5 deletions(-)
> 
> diff -puN mm/page_alloc.c~a mm/page_alloc.c
> --- a/mm/page_alloc.c~a
> +++ a/mm/page_alloc.c
> @@ -5258,13 +5258,10 @@ static void __setup_per_zone_wmarks(void
>  			 * deltas controls asynch page reclaim, and so should
>  			 * not be capped for highmem.
>  			 */
> -			int min_pages;
> +			unsigned long min_pages;
>  
>  			min_pages = zone->present_pages / 1024;
> -			if (min_pages < SWAP_CLUSTER_MAX)
> -				min_pages = SWAP_CLUSTER_MAX;
> -			if (min_pages > 128)
> -				min_pages = 128;
> +			min_pages = clamp(min_pages, SWAP_CLUSTER_MAX, 128UL);
>  			zone->watermark[WMARK_MIN] = min_pages;
>  		} else {
>  			/*
> _
> 
> 
> From: Andrew Morton <akpm@linux-foundation.org>
> Subject: mm/vmscan.c:shrink_lruvec(): switch to min()
> 
> "mm: vmscan: save work scanning (almost) empty LRU lists" made
> SWAP_CLUSTER_MAX an unsigned long.
> 
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Satoru Moriya <satoru.moriya@hds.com>
> Cc: Simon Jeons <simon.jeons@gmail.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Reviewed-by: Michal Hocko <mhocko@suse.cz>

> ---
> 
>  mm/vmscan.c |    3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff -puN mm/vmscan.c~a mm/vmscan.c
> --- a/mm/vmscan.c~a
> +++ a/mm/vmscan.c
> @@ -1873,8 +1873,7 @@ restart:
>  					nr[LRU_INACTIVE_FILE]) {
>  		for_each_evictable_lru(lru) {
>  			if (nr[lru]) {
> -				nr_to_scan = min_t(unsigned long,
> -						   nr[lru], SWAP_CLUSTER_MAX);
> +				nr_to_scan = min(nr[lru], SWAP_CLUSTER_MAX);
>  				nr[lru] -= nr_to_scan;
>  
>  				nr_reclaimed += shrink_list(lru, nr_to_scan,
> _
> 
> 
> From: Andrew Morton <akpm@linux-foundation.org>
> Subject: mm/vmscan.c:__zone_reclaim(): replace max_t() with max()
> 
> "mm: vmscan: save work scanning (almost) empty LRU lists" made
> SWAP_CLUSTER_MAX an unsigned long.
> 
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Satoru Moriya <satoru.moriya@hds.com>
> Cc: Simon Jeons <simon.jeons@gmail.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Reviewed-by: Michal Hocko <mhocko@suse.cz>

> ---
> 
>  mm/vmscan.c |    3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff -puN mm/vmscan.c~mm-vmscanc-__zone_reclaim-replace-max_t-with-max mm/vmscan.c
> --- a/mm/vmscan.c~mm-vmscanc-__zone_reclaim-replace-max_t-with-max
> +++ a/mm/vmscan.c
> @@ -3347,8 +3347,7 @@ static int __zone_reclaim(struct zone *z
>  		.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
>  		.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
>  		.may_swap = 1,
> -		.nr_to_reclaim = max_t(unsigned long, nr_pages,
> -				       SWAP_CLUSTER_MAX),
> +		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
>  		.gfp_mask = gfp_mask,
>  		.order = order,
>  		.priority = ZONE_RECLAIM_PRIORITY,
> _
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 2/7] mm: vmscan: save work scanning (almost) empty LRU lists
@ 2012-12-20 13:55       ` Michal Hocko
  0 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2012-12-20 13:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Rik van Riel, Mel Gorman, Hugh Dickins,
	Satoru Moriya, linux-mm, linux-kernel

On Wed 19-12-12 15:59:01, Andrew Morton wrote:
[...]
> From: Andrew Morton <akpm@linux-foundation.org>
> Subject: mm/page_alloc.c:__setup_per_zone_wmarks: make min_pages unsigned long
> 
> `int' is an inappropriate type for a number-of-pages counter.
> 
> While we're there, use the clamp() macro.
> 
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Satoru Moriya <satoru.moriya@hds.com>
> Cc: Simon Jeons <simon.jeons@gmail.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Reviewed-by: Michal Hocko <mhocko@suse.cz>

> ---
> 
>  mm/page_alloc.c |    7 ++-----
>  1 file changed, 2 insertions(+), 5 deletions(-)
> 
> diff -puN mm/page_alloc.c~a mm/page_alloc.c
> --- a/mm/page_alloc.c~a
> +++ a/mm/page_alloc.c
> @@ -5258,13 +5258,10 @@ static void __setup_per_zone_wmarks(void
>  			 * deltas controls asynch page reclaim, and so should
>  			 * not be capped for highmem.
>  			 */
> -			int min_pages;
> +			unsigned long min_pages;
>  
>  			min_pages = zone->present_pages / 1024;
> -			if (min_pages < SWAP_CLUSTER_MAX)
> -				min_pages = SWAP_CLUSTER_MAX;
> -			if (min_pages > 128)
> -				min_pages = 128;
> +			min_pages = clamp(min_pages, SWAP_CLUSTER_MAX, 128UL);
>  			zone->watermark[WMARK_MIN] = min_pages;
>  		} else {
>  			/*
> _
> 
> 
> From: Andrew Morton <akpm@linux-foundation.org>
> Subject: mm/vmscan.c:shrink_lruvec(): switch to min()
> 
> "mm: vmscan: save work scanning (almost) empty LRU lists" made
> SWAP_CLUSTER_MAX an unsigned long.
> 
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Satoru Moriya <satoru.moriya@hds.com>
> Cc: Simon Jeons <simon.jeons@gmail.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Reviewed-by: Michal Hocko <mhocko@suse.cz>

> ---
> 
>  mm/vmscan.c |    3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff -puN mm/vmscan.c~a mm/vmscan.c
> --- a/mm/vmscan.c~a
> +++ a/mm/vmscan.c
> @@ -1873,8 +1873,7 @@ restart:
>  					nr[LRU_INACTIVE_FILE]) {
>  		for_each_evictable_lru(lru) {
>  			if (nr[lru]) {
> -				nr_to_scan = min_t(unsigned long,
> -						   nr[lru], SWAP_CLUSTER_MAX);
> +				nr_to_scan = min(nr[lru], SWAP_CLUSTER_MAX);
>  				nr[lru] -= nr_to_scan;
>  
>  				nr_reclaimed += shrink_list(lru, nr_to_scan,
> _
> 
> 
> From: Andrew Morton <akpm@linux-foundation.org>
> Subject: mm/vmscan.c:__zone_reclaim(): replace max_t() with max()
> 
> "mm: vmscan: save work scanning (almost) empty LRU lists" made
> SWAP_CLUSTER_MAX an unsigned long.
> 
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Satoru Moriya <satoru.moriya@hds.com>
> Cc: Simon Jeons <simon.jeons@gmail.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Reviewed-by: Michal Hocko <mhocko@suse.cz>

> ---
> 
>  mm/vmscan.c |    3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff -puN mm/vmscan.c~mm-vmscanc-__zone_reclaim-replace-max_t-with-max mm/vmscan.c
> --- a/mm/vmscan.c~mm-vmscanc-__zone_reclaim-replace-max_t-with-max
> +++ a/mm/vmscan.c
> @@ -3347,8 +3347,7 @@ static int __zone_reclaim(struct zone *z
>  		.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
>  		.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
>  		.may_swap = 1,
> -		.nr_to_reclaim = max_t(unsigned long, nr_pages,
> -				       SWAP_CLUSTER_MAX),
> +		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
>  		.gfp_mask = gfp_mask,
>  		.order = order,
>  		.priority = ZONE_RECLAIM_PRIORITY,
> _
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 5/7] mm: vmscan: clean up get_scan_count()
  2012-12-20  0:08     ` Andrew Morton
@ 2012-12-21  2:49       ` Johannes Weiner
  -1 siblings, 0 replies; 46+ messages in thread
From: Johannes Weiner @ 2012-12-21  2:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Michal Hocko, Mel Gorman, Hugh Dickins,
	Satoru Moriya, linux-mm, linux-kernel

On Wed, Dec 19, 2012 at 04:08:05PM -0800, Andrew Morton wrote:
> On Mon, 17 Dec 2012 13:12:35 -0500
> Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > Reclaim pressure balance between anon and file pages is calculated
> > through a tuple of numerators and a shared denominator.
> > 
> > Exceptional cases that want to force-scan anon or file pages configure
> > the numerators and denominator such that one list is preferred, which
> > is not necessarily the most obvious way:
> > 
> >     fraction[0] = 1;
> >     fraction[1] = 0;
> >     denominator = 1;
> >     goto out;
> > 
> > Make this easier by making the force-scan cases explicit and use the
> > fractionals only in case they are calculated from reclaim history.
> > 
> > And bring the variable declarations/definitions in order.
> > 
> > ...
> >
> > +	u64 fraction[2], uninitialized_var(denominator);
> 
> Using uninitialized_var() puts Linus into rant mode.  Unkindly, IMO:
> uninitialized_var() is documentarily useful and reduces bloat.  There is
> a move afoot to replace it with
> 
> 	int foo = 0;	/* gcc */
> 
> To avoid getting ranted at we can do
> 
> --- a/mm/vmscan.c~mm-vmscan-clean-up-get_scan_count-fix
> +++ a/mm/vmscan.c
> @@ -1658,7 +1658,8 @@ static void get_scan_count(struct lruvec
>  			   unsigned long *nr)
>  {
>  	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
> -	u64 fraction[2], uninitialized_var(denominator);
> +	u64 fraction[2];
> +	u64 denominator = 0;
>  	struct zone *zone = lruvec_zone(lruvec);
>  	unsigned long anon_prio, file_prio;
>  	enum scan_balance scan_balance;

Makes sense, I guess, but then you have to delete this line from the
changelog:

"And bring the variable declarations/definitions in order."

Or change it to "partial" order or something... :-)


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 5/7] mm: vmscan: clean up get_scan_count()
@ 2012-12-21  2:49       ` Johannes Weiner
  0 siblings, 0 replies; 46+ messages in thread
From: Johannes Weiner @ 2012-12-21  2:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Michal Hocko, Mel Gorman, Hugh Dickins,
	Satoru Moriya, linux-mm, linux-kernel

On Wed, Dec 19, 2012 at 04:08:05PM -0800, Andrew Morton wrote:
> On Mon, 17 Dec 2012 13:12:35 -0500
> Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > Reclaim pressure balance between anon and file pages is calculated
> > through a tuple of numerators and a shared denominator.
> > 
> > Exceptional cases that want to force-scan anon or file pages configure
> > the numerators and denominator such that one list is preferred, which
> > is not necessarily the most obvious way:
> > 
> >     fraction[0] = 1;
> >     fraction[1] = 0;
> >     denominator = 1;
> >     goto out;
> > 
> > Make this easier by making the force-scan cases explicit and use the
> > fractionals only in case they are calculated from reclaim history.
> > 
> > And bring the variable declarations/definitions in order.
> > 
> > ...
> >
> > +	u64 fraction[2], uninitialized_var(denominator);
> 
> Using uninitialized_var() puts Linus into rant mode.  Unkindly, IMO:
> uninitialized_var() is documentarily useful and reduces bloat.  There is
> a move afoot to replace it with
> 
> 	int foo = 0;	/* gcc */
> 
> To avoid getting ranted at we can do
> 
> --- a/mm/vmscan.c~mm-vmscan-clean-up-get_scan_count-fix
> +++ a/mm/vmscan.c
> @@ -1658,7 +1658,8 @@ static void get_scan_count(struct lruvec
>  			   unsigned long *nr)
>  {
>  	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
> -	u64 fraction[2], uninitialized_var(denominator);
> +	u64 fraction[2];
> +	u64 denominator = 0;
>  	struct zone *zone = lruvec_zone(lruvec);
>  	unsigned long anon_prio, file_prio;
>  	enum scan_balance scan_balance;

Makes sense, I guess, but then you have to delete this line from the
changelog:

"And bring the variable declarations/definitions in order."

Or change it to "partial" order or something... :-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 2/7] mm: vmscan: save work scanning (almost) empty LRU lists
  2012-12-19 23:59     ` Andrew Morton
@ 2012-12-21  3:02       ` Johannes Weiner
  -1 siblings, 0 replies; 46+ messages in thread
From: Johannes Weiner @ 2012-12-21  3:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Michal Hocko, Mel Gorman, Hugh Dickins,
	Satoru Moriya, linux-mm, linux-kernel

On Wed, Dec 19, 2012 at 03:59:01PM -0800, Andrew Morton wrote:
> On Mon, 17 Dec 2012 13:12:32 -0500
> Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > In certain cases (kswapd reclaim, memcg target reclaim), a fixed
> > minimum amount of pages is scanned from the LRU lists on each
> > iteration, to make progress.
> > 
> > Do not make this minimum bigger than the respective LRU list size,
> > however, and save some busy work trying to isolate and reclaim pages
> > that are not there.
> > 
> > Empty LRU lists are quite common with memory cgroups in NUMA
> > environments because there exists a set of LRU lists for each zone for
> > each memory cgroup, while the memory of a single cgroup is expected to
> > stay on just one node.  The number of expected empty LRU lists is thus
> > 
> >   memcgs * (nodes - 1) * lru types
> > 
> > Each attempt to reclaim from an empty LRU list does expensive size
> > comparisons between lists, acquires the zone's lru lock etc.  Avoid
> > that.
> > 
> > ...
> >
> > -#define SWAP_CLUSTER_MAX 32
> > +#define SWAP_CLUSTER_MAX 32UL
> 
> You made me review the effects of this change.  It looks OK.  A few
> cleanups are possible, please review.
> 
> I wonder what happens in __setup_per_zone_wmarks() if we set
> SWAP_CLUSTER_MAX greater than 128.

In the current clamp() implementation max overrides min, so...

BUILD_BUG_ON()?  Probably unnecessary, it seems like a rather
arbitrary range to begin with.

> From: Andrew Morton <akpm@linux-foundation.org>
> Subject: mm/page_alloc.c:__setup_per_zone_wmarks: make min_pages unsigned long
> 
> `int' is an inappropriate type for a number-of-pages counter.
> 
> While we're there, use the clamp() macro.
> 
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Satoru Moriya <satoru.moriya@hds.com>
> Cc: Simon Jeons <simon.jeons@gmail.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

> From: Andrew Morton <akpm@linux-foundation.org>
> Subject: mm/vmscan.c:shrink_lruvec(): switch to min()
> 
> "mm: vmscan: save work scanning (almost) empty LRU lists" made
> SWAP_CLUSTER_MAX an unsigned long.
> 
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Satoru Moriya <satoru.moriya@hds.com>
> Cc: Simon Jeons <simon.jeons@gmail.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

> From: Andrew Morton <akpm@linux-foundation.org>
> Subject: mm/vmscan.c:__zone_reclaim(): replace max_t() with max()
> 
> "mm: vmscan: save work scanning (almost) empty LRU lists" made
> SWAP_CLUSTER_MAX an unsigned long.
> 
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Satoru Moriya <satoru.moriya@hds.com>
> Cc: Simon Jeons <simon.jeons@gmail.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [patch 2/7] mm: vmscan: save work scanning (almost) empty LRU lists
@ 2012-12-21  3:02       ` Johannes Weiner
  0 siblings, 0 replies; 46+ messages in thread
From: Johannes Weiner @ 2012-12-21  3:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Michal Hocko, Mel Gorman, Hugh Dickins,
	Satoru Moriya, linux-mm, linux-kernel

On Wed, Dec 19, 2012 at 03:59:01PM -0800, Andrew Morton wrote:
> On Mon, 17 Dec 2012 13:12:32 -0500
> Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > In certain cases (kswapd reclaim, memcg target reclaim), a fixed
> > minimum amount of pages is scanned from the LRU lists on each
> > iteration, to make progress.
> > 
> > Do not make this minimum bigger than the respective LRU list size,
> > however, and save some busy work trying to isolate and reclaim pages
> > that are not there.
> > 
> > Empty LRU lists are quite common with memory cgroups in NUMA
> > environments because there exists a set of LRU lists for each zone for
> > each memory cgroup, while the memory of a single cgroup is expected to
> > stay on just one node.  The number of expected empty LRU lists is thus
> > 
> >   memcgs * (nodes - 1) * lru types
> > 
> > Each attempt to reclaim from an empty LRU list does expensive size
> > comparisons between lists, acquires the zone's lru lock etc.  Avoid
> > that.
> > 
> > ...
> >
> > -#define SWAP_CLUSTER_MAX 32
> > +#define SWAP_CLUSTER_MAX 32UL
> 
> You made me review the effects of this change.  It looks OK.  A few
> cleanups are possible, please review.
> 
> I wonder what happens in __setup_per_zone_wmarks() if we set
> SWAP_CLUSTER_MAX greater than 128.

In the current clamp() implementation max overrides min, so...

BUILD_BUG_ON()?  Probably unnecessary, it seems like a rather
arbitrary range to begin with.

> From: Andrew Morton <akpm@linux-foundation.org>
> Subject: mm/page_alloc.c:__setup_per_zone_wmarks: make min_pages unsigned long
> 
> `int' is an inappropriate type for a number-of-pages counter.
> 
> While we're there, use the clamp() macro.
> 
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Satoru Moriya <satoru.moriya@hds.com>
> Cc: Simon Jeons <simon.jeons@gmail.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

> From: Andrew Morton <akpm@linux-foundation.org>
> Subject: mm/vmscan.c:shrink_lruvec(): switch to min()
> 
> "mm: vmscan: save work scanning (almost) empty LRU lists" made
> SWAP_CLUSTER_MAX an unsigned long.
> 
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Satoru Moriya <satoru.moriya@hds.com>
> Cc: Simon Jeons <simon.jeons@gmail.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

> From: Andrew Morton <akpm@linux-foundation.org>
> Subject: mm/vmscan.c:__zone_reclaim(): replace max_t() with max()
> 
> "mm: vmscan: save work scanning (almost) empty LRU lists" made
> SWAP_CLUSTER_MAX an unsigned long.
> 
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Satoru Moriya <satoru.moriya@hds.com>
> Cc: Simon Jeons <simon.jeons@gmail.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2012-12-21  3:03 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-12-17 18:12 [patch 0/8] page reclaim bits v2 Johannes Weiner
2012-12-17 18:12 ` Johannes Weiner
2012-12-17 18:12 ` [patch 1/7] mm: memcg: only evict file pages when we have plenty Johannes Weiner
2012-12-17 18:12   ` Johannes Weiner
2012-12-17 18:15   ` Rik van Riel
2012-12-17 18:15     ` Rik van Riel
2012-12-18  9:56   ` Mel Gorman
2012-12-18  9:56     ` Mel Gorman
2012-12-17 18:12 ` [patch 2/7] mm: vmscan: save work scanning (almost) empty LRU lists Johannes Weiner
2012-12-17 18:12   ` Johannes Weiner
2012-12-19 23:59   ` Andrew Morton
2012-12-19 23:59     ` Andrew Morton
2012-12-20 13:55     ` Michal Hocko
2012-12-20 13:55       ` Michal Hocko
2012-12-21  3:02     ` Johannes Weiner
2012-12-21  3:02       ` Johannes Weiner
2012-12-17 18:12 ` [patch 3/7] mm: vmscan: clarify how swappiness, highest priority, memcg interact Johannes Weiner
2012-12-17 18:12   ` Johannes Weiner
2012-12-17 18:16   ` Rik van Riel
2012-12-17 18:16     ` Rik van Riel
2012-12-17 19:37   ` Satoru Moriya
2012-12-17 19:37     ` Satoru Moriya
2012-12-17 20:05   ` Michal Hocko
2012-12-17 20:05     ` Michal Hocko
2012-12-18 10:04   ` Mel Gorman
2012-12-18 10:04     ` Mel Gorman
2012-12-18 15:16   ` Satoru Moriya
2012-12-18 15:16     ` Satoru Moriya
2012-12-17 18:12 ` [patch 4/7] mm: vmscan: improve comment on low-page cache handling Johannes Weiner
2012-12-17 18:12   ` Johannes Weiner
2012-12-17 18:12 ` [patch 5/7] mm: vmscan: clean up get_scan_count() Johannes Weiner
2012-12-17 18:12   ` Johannes Weiner
2012-12-20  0:08   ` Andrew Morton
2012-12-20  0:08     ` Andrew Morton
2012-12-21  2:49     ` Johannes Weiner
2012-12-21  2:49       ` Johannes Weiner
2012-12-17 18:12 ` [patch 6/7] mm: vmscan: compaction works against zones, not lruvecs Johannes Weiner
2012-12-17 18:12   ` Johannes Weiner
2012-12-17 18:12 ` [patch 7/7] mm: reduce rmap overhead for ex-KSM page copies created on swap faults Johannes Weiner
2012-12-17 18:12   ` Johannes Weiner
2012-12-17 22:42   ` Hugh Dickins
2012-12-17 22:42     ` Hugh Dickins
2012-12-19  7:01   ` Simon Jeons
2012-12-19  7:01     ` Simon Jeons
2012-12-19 17:58     ` Johannes Weiner
2012-12-19 17:58       ` Johannes Weiner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.