All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch 0/8] page reclaim bits
@ 2012-12-12 21:43 ` Johannes Weiner
  0 siblings, 0 replies; 114+ messages in thread
From: Johannes Weiner @ 2012-12-12 21:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Michal Hocko, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

Hello,

I had these in my queue and on test machines for a while, but they got
deferred over and over, partly because of the kswapd issues.  I hope
it's not too late for 3.8, they should be fairly straight forward.

#1 takes the anon workingset protection with plenty file cache from
global reclaim, which was just merged into 3.8, and generalizes it to
include memcg reclaim.

#2-#6 are get_scan_count() fixes and cleanups.

#7 fixes reclaim-for-compaction to work against zones, not lruvecs,
since that is what compaction works against.  Practical impact only on
memcg setups, but confusing for everybody.

#8 puts ksm pages that are copied-on-swapin into their own separate
anon_vma.

Thanks!

 include/linux/swap.h |   2 +-
 mm/ksm.c             |   6 --
 mm/memory.c          |   5 +-
 mm/vmscan.c          | 268 +++++++++++++++++++++++++++----------------------
 4 files changed, 152 insertions(+), 129 deletions(-)


^ permalink raw reply	[flat|nested] 114+ messages in thread

* [patch 0/8] page reclaim bits
@ 2012-12-12 21:43 ` Johannes Weiner
  0 siblings, 0 replies; 114+ messages in thread
From: Johannes Weiner @ 2012-12-12 21:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Michal Hocko, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

Hello,

I had these in my queue and on test machines for a while, but they got
deferred over and over, partly because of the kswapd issues.  I hope
it's not too late for 3.8, they should be fairly straight forward.

#1 takes the anon workingset protection with plenty file cache from
global reclaim, which was just merged into 3.8, and generalizes it to
include memcg reclaim.

#2-#6 are get_scan_count() fixes and cleanups.

#7 fixes reclaim-for-compaction to work against zones, not lruvecs,
since that is what compaction works against.  Practical impact only on
memcg setups, but confusing for everybody.

#8 puts ksm pages that are copied-on-swapin into their own separate
anon_vma.

Thanks!

 include/linux/swap.h |   2 +-
 mm/ksm.c             |   6 --
 mm/memory.c          |   5 +-
 mm/vmscan.c          | 268 +++++++++++++++++++++++++++----------------------
 4 files changed, 152 insertions(+), 129 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* [patch 1/8] mm: memcg: only evict file pages when we have plenty
  2012-12-12 21:43 ` Johannes Weiner
@ 2012-12-12 21:43   ` Johannes Weiner
  -1 siblings, 0 replies; 114+ messages in thread
From: Johannes Weiner @ 2012-12-12 21:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Michal Hocko, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

dc0422c "mm: vmscan: only evict file pages when we have plenty" makes
a point of not going for anonymous memory while there is still enough
inactive cache around.

The check was added only for global reclaim, but it is just as useful
for memory cgroup reclaim.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 157bb11..3874dcb 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1671,6 +1671,16 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 		denominator = 1;
 		goto out;
 	}
+	/*
+	 * There is enough inactive page cache, do not reclaim
+	 * anything from the anonymous working set right now.
+	 */
+	if (!inactive_file_is_low(lruvec)) {
+		fraction[0] = 0;
+		fraction[1] = 1;
+		denominator = 1;
+		goto out;
+	}
 
 	anon  = get_lru_size(lruvec, LRU_ACTIVE_ANON) +
 		get_lru_size(lruvec, LRU_INACTIVE_ANON);
@@ -1688,15 +1698,6 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 			fraction[1] = 0;
 			denominator = 1;
 			goto out;
-		} else if (!inactive_file_is_low_global(zone)) {
-			/*
-			 * There is enough inactive page cache, do not
-			 * reclaim anything from the working set right now.
-			 */
-			fraction[0] = 0;
-			fraction[1] = 1;
-			denominator = 1;
-			goto out;
 		}
 	}
 
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* [patch 1/8] mm: memcg: only evict file pages when we have plenty
@ 2012-12-12 21:43   ` Johannes Weiner
  0 siblings, 0 replies; 114+ messages in thread
From: Johannes Weiner @ 2012-12-12 21:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Michal Hocko, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

dc0422c "mm: vmscan: only evict file pages when we have plenty" makes
a point of not going for anonymous memory while there is still enough
inactive cache around.

The check was added only for global reclaim, but it is just as useful
for memory cgroup reclaim.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 157bb11..3874dcb 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1671,6 +1671,16 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 		denominator = 1;
 		goto out;
 	}
+	/*
+	 * There is enough inactive page cache, do not reclaim
+	 * anything from the anonymous working set right now.
+	 */
+	if (!inactive_file_is_low(lruvec)) {
+		fraction[0] = 0;
+		fraction[1] = 1;
+		denominator = 1;
+		goto out;
+	}
 
 	anon  = get_lru_size(lruvec, LRU_ACTIVE_ANON) +
 		get_lru_size(lruvec, LRU_INACTIVE_ANON);
@@ -1688,15 +1698,6 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 			fraction[1] = 0;
 			denominator = 1;
 			goto out;
-		} else if (!inactive_file_is_low_global(zone)) {
-			/*
-			 * There is enough inactive page cache, do not
-			 * reclaim anything from the working set right now.
-			 */
-			fraction[0] = 0;
-			fraction[1] = 1;
-			denominator = 1;
-			goto out;
 		}
 	}
 
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 114+ messages in thread

* [patch 2/8] mm: vmscan: disregard swappiness shortly before going OOM
  2012-12-12 21:43 ` Johannes Weiner
@ 2012-12-12 21:43   ` Johannes Weiner
  -1 siblings, 0 replies; 114+ messages in thread
From: Johannes Weiner @ 2012-12-12 21:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Michal Hocko, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

When a reclaim scanner is doing its final scan before giving up and
there is swap space available, pay no attention to swappiness
preference anymore.  Just swap.

Note that this change won't make too big of a difference for general
reclaim: anonymous pages are already force-scanned when there is only
very little file cache left, and there very likely isn't when the
reclaimer enters this final cycle.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3874dcb..6e53446 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1751,7 +1751,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 		unsigned long scan;
 
 		scan = get_lru_size(lruvec, lru);
-		if (sc->priority || noswap || !vmscan_swappiness(sc)) {
+		if (sc->priority || noswap) {
 			scan >>= sc->priority;
 			if (!scan && force_scan)
 				scan = SWAP_CLUSTER_MAX;
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* [patch 2/8] mm: vmscan: disregard swappiness shortly before going OOM
@ 2012-12-12 21:43   ` Johannes Weiner
  0 siblings, 0 replies; 114+ messages in thread
From: Johannes Weiner @ 2012-12-12 21:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Michal Hocko, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

When a reclaim scanner is doing its final scan before giving up and
there is swap space available, pay no attention to swappiness
preference anymore.  Just swap.

Note that this change won't make too big of a difference for general
reclaim: anonymous pages are already force-scanned when there is only
very little file cache left, and there very likely isn't when the
reclaimer enters this final cycle.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3874dcb..6e53446 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1751,7 +1751,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 		unsigned long scan;
 
 		scan = get_lru_size(lruvec, lru);
-		if (sc->priority || noswap || !vmscan_swappiness(sc)) {
+		if (sc->priority || noswap) {
 			scan >>= sc->priority;
 			if (!scan && force_scan)
 				scan = SWAP_CLUSTER_MAX;
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 114+ messages in thread

* [patch 3/8] mm: vmscan: save work scanning (almost) empty LRU lists
  2012-12-12 21:43 ` Johannes Weiner
@ 2012-12-12 21:43   ` Johannes Weiner
  -1 siblings, 0 replies; 114+ messages in thread
From: Johannes Weiner @ 2012-12-12 21:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Michal Hocko, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

In certain cases (kswapd reclaim, memcg target reclaim), a fixed
minimum amount of pages is scanned from the LRU lists on each
iteration, to make progress.

Do not make this minimum bigger than the respective LRU list size,
however, and save some busy work trying to isolate and reclaim pages
that are not there.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/swap.h |  2 +-
 mm/vmscan.c          | 10 ++++++----
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 68df9c1..8c66486 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -156,7 +156,7 @@ enum {
 	SWP_SCANNING	= (1 << 8),	/* refcount in scan_swap_map */
 };
 
-#define SWAP_CLUSTER_MAX 32
+#define SWAP_CLUSTER_MAX 32UL
 #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
 
 /*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6e53446..1763e79 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1748,15 +1748,17 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 out:
 	for_each_evictable_lru(lru) {
 		int file = is_file_lru(lru);
+		unsigned long size;
 		unsigned long scan;
 
-		scan = get_lru_size(lruvec, lru);
+		size = get_lru_size(lruvec, lru);
 		if (sc->priority || noswap) {
-			scan >>= sc->priority;
+			scan = size >> sc->priority;
 			if (!scan && force_scan)
-				scan = SWAP_CLUSTER_MAX;
+				scan = min(size, SWAP_CLUSTER_MAX);
 			scan = div64_u64(scan * fraction[file], denominator);
-		}
+		} else
+			scan = size;
 		nr[lru] = scan;
 	}
 }
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* [patch 3/8] mm: vmscan: save work scanning (almost) empty LRU lists
@ 2012-12-12 21:43   ` Johannes Weiner
  0 siblings, 0 replies; 114+ messages in thread
From: Johannes Weiner @ 2012-12-12 21:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Michal Hocko, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

In certain cases (kswapd reclaim, memcg target reclaim), a fixed
minimum amount of pages is scanned from the LRU lists on each
iteration, to make progress.

Do not make this minimum bigger than the respective LRU list size,
however, and save some busy work trying to isolate and reclaim pages
that are not there.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/swap.h |  2 +-
 mm/vmscan.c          | 10 ++++++----
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 68df9c1..8c66486 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -156,7 +156,7 @@ enum {
 	SWP_SCANNING	= (1 << 8),	/* refcount in scan_swap_map */
 };
 
-#define SWAP_CLUSTER_MAX 32
+#define SWAP_CLUSTER_MAX 32UL
 #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
 
 /*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6e53446..1763e79 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1748,15 +1748,17 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 out:
 	for_each_evictable_lru(lru) {
 		int file = is_file_lru(lru);
+		unsigned long size;
 		unsigned long scan;
 
-		scan = get_lru_size(lruvec, lru);
+		size = get_lru_size(lruvec, lru);
 		if (sc->priority || noswap) {
-			scan >>= sc->priority;
+			scan = size >> sc->priority;
 			if (!scan && force_scan)
-				scan = SWAP_CLUSTER_MAX;
+				scan = min(size, SWAP_CLUSTER_MAX);
 			scan = div64_u64(scan * fraction[file], denominator);
-		}
+		} else
+			scan = size;
 		nr[lru] = scan;
 	}
 }
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 114+ messages in thread

* [patch 4/8] mm: vmscan: clarify LRU balancing close to OOM
  2012-12-12 21:43 ` Johannes Weiner
@ 2012-12-12 21:43   ` Johannes Weiner
  -1 siblings, 0 replies; 114+ messages in thread
From: Johannes Weiner @ 2012-12-12 21:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Michal Hocko, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

There are currently several inter-LRU balancing heuristics that simply
get disabled when the reclaimer is at the last reclaim cycle before
giving up, but the code is quite cumbersome and not really obvious.

Make the heuristics visibly unreachable for the last reclaim cycle.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1763e79..5e1beed 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1644,7 +1644,6 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
 	u64 fraction[2], denominator;
 	enum lru_list lru;
-	int noswap = 0;
 	bool force_scan = false;
 	struct zone *zone = lruvec_zone(lruvec);
 
@@ -1665,12 +1664,23 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 
 	/* If we have no swap space, do not bother scanning anon pages. */
 	if (!sc->may_swap || (nr_swap_pages <= 0)) {
-		noswap = 1;
 		fraction[0] = 0;
 		fraction[1] = 1;
 		denominator = 1;
 		goto out;
 	}
+
+	/*
+	 * Do not apply any pressure balancing cleverness when the
+	 * system is close to OOM, scan both anon and file equally.
+	 */
+	if (!sc->priority) {
+		fraction[0] = 1;
+		fraction[1] = 1;
+		denominator = 1;
+		goto out;
+	}
+
 	/*
 	 * There is enough inactive page cache, do not reclaim
 	 * anything from the anonymous working set right now.
@@ -1752,13 +1762,10 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 		unsigned long scan;
 
 		size = get_lru_size(lruvec, lru);
-		if (sc->priority || noswap) {
-			scan = size >> sc->priority;
-			if (!scan && force_scan)
-				scan = min(size, SWAP_CLUSTER_MAX);
-			scan = div64_u64(scan * fraction[file], denominator);
-		} else
-			scan = size;
+		scan = size >> sc->priority;
+		if (!scan && force_scan)
+			scan = min(size, SWAP_CLUSTER_MAX);
+		scan = div64_u64(scan * fraction[file], denominator);
 		nr[lru] = scan;
 	}
 }
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* [patch 4/8] mm: vmscan: clarify LRU balancing close to OOM
@ 2012-12-12 21:43   ` Johannes Weiner
  0 siblings, 0 replies; 114+ messages in thread
From: Johannes Weiner @ 2012-12-12 21:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Michal Hocko, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

There are currently several inter-LRU balancing heuristics that simply
get disabled when the reclaimer is at the last reclaim cycle before
giving up, but the code is quite cumbersome and not really obvious.

Make the heuristics visibly unreachable for the last reclaim cycle.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1763e79..5e1beed 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1644,7 +1644,6 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
 	u64 fraction[2], denominator;
 	enum lru_list lru;
-	int noswap = 0;
 	bool force_scan = false;
 	struct zone *zone = lruvec_zone(lruvec);
 
@@ -1665,12 +1664,23 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 
 	/* If we have no swap space, do not bother scanning anon pages. */
 	if (!sc->may_swap || (nr_swap_pages <= 0)) {
-		noswap = 1;
 		fraction[0] = 0;
 		fraction[1] = 1;
 		denominator = 1;
 		goto out;
 	}
+
+	/*
+	 * Do not apply any pressure balancing cleverness when the
+	 * system is close to OOM, scan both anon and file equally.
+	 */
+	if (!sc->priority) {
+		fraction[0] = 1;
+		fraction[1] = 1;
+		denominator = 1;
+		goto out;
+	}
+
 	/*
 	 * There is enough inactive page cache, do not reclaim
 	 * anything from the anonymous working set right now.
@@ -1752,13 +1762,10 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 		unsigned long scan;
 
 		size = get_lru_size(lruvec, lru);
-		if (sc->priority || noswap) {
-			scan = size >> sc->priority;
-			if (!scan && force_scan)
-				scan = min(size, SWAP_CLUSTER_MAX);
-			scan = div64_u64(scan * fraction[file], denominator);
-		} else
-			scan = size;
+		scan = size >> sc->priority;
+		if (!scan && force_scan)
+			scan = min(size, SWAP_CLUSTER_MAX);
+		scan = div64_u64(scan * fraction[file], denominator);
 		nr[lru] = scan;
 	}
 }
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 114+ messages in thread

* [patch 5/8] mm: vmscan: improve comment on low-page cache handling
  2012-12-12 21:43 ` Johannes Weiner
@ 2012-12-12 21:43   ` Johannes Weiner
  -1 siblings, 0 replies; 114+ messages in thread
From: Johannes Weiner @ 2012-12-12 21:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Michal Hocko, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

Fix comment style and elaborate on why anonymous memory is
force-scanned when file cache runs low.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5e1beed..05475e1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1697,13 +1697,15 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	file  = get_lru_size(lruvec, LRU_ACTIVE_FILE) +
 		get_lru_size(lruvec, LRU_INACTIVE_FILE);
 
+	/*
+	 * If it's foreseeable that reclaiming the file cache won't be
+	 * enough to get the zone back into a desirable shape, we have
+	 * to swap.  Better start now and leave the - probably heavily
+	 * thrashing - remaining file pages alone.
+	 */
 	if (global_reclaim(sc)) {
-		free  = zone_page_state(zone, NR_FREE_PAGES);
+		free = zone_page_state(zone, NR_FREE_PAGES);
 		if (unlikely(file + free <= high_wmark_pages(zone))) {
-			/*
-			 * If we have very few page cache pages, force-scan
-			 * anon pages.
-			 */
 			fraction[0] = 1;
 			fraction[1] = 0;
 			denominator = 1;
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* [patch 5/8] mm: vmscan: improve comment on low-page cache handling
@ 2012-12-12 21:43   ` Johannes Weiner
  0 siblings, 0 replies; 114+ messages in thread
From: Johannes Weiner @ 2012-12-12 21:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Michal Hocko, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

Fix comment style and elaborate on why anonymous memory is
force-scanned when file cache runs low.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5e1beed..05475e1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1697,13 +1697,15 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	file  = get_lru_size(lruvec, LRU_ACTIVE_FILE) +
 		get_lru_size(lruvec, LRU_INACTIVE_FILE);
 
+	/*
+	 * If it's foreseeable that reclaiming the file cache won't be
+	 * enough to get the zone back into a desirable shape, we have
+	 * to swap.  Better start now and leave the - probably heavily
+	 * thrashing - remaining file pages alone.
+	 */
 	if (global_reclaim(sc)) {
-		free  = zone_page_state(zone, NR_FREE_PAGES);
+		free = zone_page_state(zone, NR_FREE_PAGES);
 		if (unlikely(file + free <= high_wmark_pages(zone))) {
-			/*
-			 * If we have very few page cache pages, force-scan
-			 * anon pages.
-			 */
 			fraction[0] = 1;
 			fraction[1] = 0;
 			denominator = 1;
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 114+ messages in thread

* [patch 6/8] mm: vmscan: clean up get_scan_count()
  2012-12-12 21:43 ` Johannes Weiner
@ 2012-12-12 21:43   ` Johannes Weiner
  -1 siblings, 0 replies; 114+ messages in thread
From: Johannes Weiner @ 2012-12-12 21:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Michal Hocko, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

Reclaim pressure balance between anon and file pages is calculated
through a tuple of numerators and a shared denominator.

Exceptional cases that want to force-scan anon or file pages configure
the numerators and denominator such that one list is preferred, which
is not necessarily the most obvious way:

    fraction[0] = 1;
    fraction[1] = 0;
    denominator = 1;
    goto out;

Make this easier by making the force-scan cases explicit and use the
fractionals only in case they are calculated from reclaim history.

And bring the variable declarations/definitions in order.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 46 ++++++++++++++++++++++++++++------------------
 1 file changed, 28 insertions(+), 18 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 05475e1..e20385a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1626,6 +1626,13 @@ static int vmscan_swappiness(struct scan_control *sc)
 	return mem_cgroup_swappiness(sc->target_mem_cgroup);
 }
 
+enum scan_balance {
+	SCAN_EQUAL,
+	SCAN_FRACT,
+	SCAN_ANON,
+	SCAN_FILE,
+};
+
 /*
  * Determine how aggressively the anon and file LRU lists should be
  * scanned.  The relative value of each set of LRU lists is determined
@@ -1638,14 +1645,15 @@ static int vmscan_swappiness(struct scan_control *sc)
 static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 			   unsigned long *nr)
 {
-	unsigned long anon, file, free;
+	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
+	u64 fraction[2], uninitialized_var(denominator);
+	struct zone *zone = lruvec_zone(lruvec);
 	unsigned long anon_prio, file_prio;
+	enum scan_balance scan_balance;
+	unsigned long anon, file, free;
+	bool force_scan = false;
 	unsigned long ap, fp;
-	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
-	u64 fraction[2], denominator;
 	enum lru_list lru;
-	bool force_scan = false;
-	struct zone *zone = lruvec_zone(lruvec);
 
 	/*
 	 * If the zone or memcg is small, nr[l] can be 0.  This
@@ -1664,9 +1672,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 
 	/* If we have no swap space, do not bother scanning anon pages. */
 	if (!sc->may_swap || (nr_swap_pages <= 0)) {
-		fraction[0] = 0;
-		fraction[1] = 1;
-		denominator = 1;
+		scan_balance = SCAN_FILE;
 		goto out;
 	}
 
@@ -1675,9 +1681,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	 * system is close to OOM, scan both anon and file equally.
 	 */
 	if (!sc->priority) {
-		fraction[0] = 1;
-		fraction[1] = 1;
-		denominator = 1;
+		scan_balance = SCAN_EQUAL;
 		goto out;
 	}
 
@@ -1686,9 +1690,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	 * anything from the anonymous working set right now.
 	 */
 	if (!inactive_file_is_low(lruvec)) {
-		fraction[0] = 0;
-		fraction[1] = 1;
-		denominator = 1;
+		scan_balance = SCAN_FILE;
 		goto out;
 	}
 
@@ -1706,13 +1708,13 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	if (global_reclaim(sc)) {
 		free = zone_page_state(zone, NR_FREE_PAGES);
 		if (unlikely(file + free <= high_wmark_pages(zone))) {
-			fraction[0] = 1;
-			fraction[1] = 0;
-			denominator = 1;
+			scan_balance = SCAN_ANON;
 			goto out;
 		}
 	}
 
+	scan_balance = SCAN_FRACT;
+
 	/*
 	 * With swappiness at 100, anonymous and file have the same priority.
 	 * This scanning priority is essentially the inverse of IO cost.
@@ -1765,9 +1767,17 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 
 		size = get_lru_size(lruvec, lru);
 		scan = size >> sc->priority;
+
 		if (!scan && force_scan)
 			scan = min(size, SWAP_CLUSTER_MAX);
-		scan = div64_u64(scan * fraction[file], denominator);
+
+		if (scan_balance == SCAN_EQUAL)
+			; /* scan relative to size */
+		else if (scan_balance == SCAN_FRACT)
+			scan = div64_u64(scan * fraction[file], denominator);
+		else if ((scan_balance == SCAN_FILE) != file)
+			scan = 0;
+
 		nr[lru] = scan;
 	}
 }
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* [patch 6/8] mm: vmscan: clean up get_scan_count()
@ 2012-12-12 21:43   ` Johannes Weiner
  0 siblings, 0 replies; 114+ messages in thread
From: Johannes Weiner @ 2012-12-12 21:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Michal Hocko, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

Reclaim pressure balance between anon and file pages is calculated
through a tuple of numerators and a shared denominator.

Exceptional cases that want to force-scan anon or file pages configure
the numerators and denominator such that one list is preferred, which
is not necessarily the most obvious way:

    fraction[0] = 1;
    fraction[1] = 0;
    denominator = 1;
    goto out;

Make this easier by making the force-scan cases explicit and use the
fractionals only in case they are calculated from reclaim history.

And bring the variable declarations/definitions in order.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 46 ++++++++++++++++++++++++++++------------------
 1 file changed, 28 insertions(+), 18 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 05475e1..e20385a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1626,6 +1626,13 @@ static int vmscan_swappiness(struct scan_control *sc)
 	return mem_cgroup_swappiness(sc->target_mem_cgroup);
 }
 
+enum scan_balance {
+	SCAN_EQUAL,
+	SCAN_FRACT,
+	SCAN_ANON,
+	SCAN_FILE,
+};
+
 /*
  * Determine how aggressively the anon and file LRU lists should be
  * scanned.  The relative value of each set of LRU lists is determined
@@ -1638,14 +1645,15 @@ static int vmscan_swappiness(struct scan_control *sc)
 static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 			   unsigned long *nr)
 {
-	unsigned long anon, file, free;
+	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
+	u64 fraction[2], uninitialized_var(denominator);
+	struct zone *zone = lruvec_zone(lruvec);
 	unsigned long anon_prio, file_prio;
+	enum scan_balance scan_balance;
+	unsigned long anon, file, free;
+	bool force_scan = false;
 	unsigned long ap, fp;
-	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
-	u64 fraction[2], denominator;
 	enum lru_list lru;
-	bool force_scan = false;
-	struct zone *zone = lruvec_zone(lruvec);
 
 	/*
 	 * If the zone or memcg is small, nr[l] can be 0.  This
@@ -1664,9 +1672,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 
 	/* If we have no swap space, do not bother scanning anon pages. */
 	if (!sc->may_swap || (nr_swap_pages <= 0)) {
-		fraction[0] = 0;
-		fraction[1] = 1;
-		denominator = 1;
+		scan_balance = SCAN_FILE;
 		goto out;
 	}
 
@@ -1675,9 +1681,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	 * system is close to OOM, scan both anon and file equally.
 	 */
 	if (!sc->priority) {
-		fraction[0] = 1;
-		fraction[1] = 1;
-		denominator = 1;
+		scan_balance = SCAN_EQUAL;
 		goto out;
 	}
 
@@ -1686,9 +1690,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	 * anything from the anonymous working set right now.
 	 */
 	if (!inactive_file_is_low(lruvec)) {
-		fraction[0] = 0;
-		fraction[1] = 1;
-		denominator = 1;
+		scan_balance = SCAN_FILE;
 		goto out;
 	}
 
@@ -1706,13 +1708,13 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	if (global_reclaim(sc)) {
 		free = zone_page_state(zone, NR_FREE_PAGES);
 		if (unlikely(file + free <= high_wmark_pages(zone))) {
-			fraction[0] = 1;
-			fraction[1] = 0;
-			denominator = 1;
+			scan_balance = SCAN_ANON;
 			goto out;
 		}
 	}
 
+	scan_balance = SCAN_FRACT;
+
 	/*
 	 * With swappiness at 100, anonymous and file have the same priority.
 	 * This scanning priority is essentially the inverse of IO cost.
@@ -1765,9 +1767,17 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 
 		size = get_lru_size(lruvec, lru);
 		scan = size >> sc->priority;
+
 		if (!scan && force_scan)
 			scan = min(size, SWAP_CLUSTER_MAX);
-		scan = div64_u64(scan * fraction[file], denominator);
+
+		if (scan_balance == SCAN_EQUAL)
+			; /* scan relative to size */
+		else if (scan_balance == SCAN_FRACT)
+			scan = div64_u64(scan * fraction[file], denominator);
+		else if ((scan_balance == SCAN_FILE) != file)
+			scan = 0;
+
 		nr[lru] = scan;
 	}
 }
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 114+ messages in thread

* [patch 7/8] mm: vmscan: compaction works against zones, not lruvecs
  2012-12-12 21:43 ` Johannes Weiner
@ 2012-12-12 21:43   ` Johannes Weiner
  -1 siblings, 0 replies; 114+ messages in thread
From: Johannes Weiner @ 2012-12-12 21:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Michal Hocko, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

The restart logic for when reclaim operates back to back with
compaction is currently applied on the lruvec level.  But this does
not make sense, because the container of interest for compaction is a
zone as a whole, not the zone pages that are part of a certain memory
cgroup.

Negative impact is bounded.  For one, the code checks that the lruvec
has enough reclaim candidates, so it does not risk getting stuck on a
condition that can not be fulfilled.  And the unfairness of hammering
on one particular memory cgroup to make progress in a zone will be
amortized by the round robin manner in which reclaim goes through the
memory cgroups.  Still, this can lead to unnecessary allocation
latencies when the code elects to restart on a hard to reclaim or
small group when there are other, more reclaimable groups in the zone.

Move this logic to the zone level and restart reclaim for all memory
cgroups in a zone when compaction requires more free pages from it.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 180 +++++++++++++++++++++++++++++++-----------------------------
 1 file changed, 92 insertions(+), 88 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index e20385a..c9c841d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1782,6 +1782,59 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	}
 }
 
+/*
+ * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
+ */
+static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
+{
+	unsigned long nr[NR_LRU_LISTS];
+	unsigned long nr_to_scan;
+	enum lru_list lru;
+	unsigned long nr_reclaimed = 0;
+	unsigned long nr_to_reclaim = sc->nr_to_reclaim;
+	struct blk_plug plug;
+
+	get_scan_count(lruvec, sc, nr);
+
+	blk_start_plug(&plug);
+	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
+					nr[LRU_INACTIVE_FILE]) {
+		for_each_evictable_lru(lru) {
+			if (nr[lru]) {
+				nr_to_scan = min_t(unsigned long,
+						   nr[lru], SWAP_CLUSTER_MAX);
+				nr[lru] -= nr_to_scan;
+
+				nr_reclaimed += shrink_list(lru, nr_to_scan,
+							    lruvec, sc);
+			}
+		}
+		/*
+		 * On large memory systems, scan >> priority can become
+		 * really large. This is fine for the starting priority;
+		 * we want to put equal scanning pressure on each zone.
+		 * However, if the VM has a harder time of freeing pages,
+		 * with multiple processes reclaiming pages, the total
+		 * freeing target can get unreasonably large.
+		 */
+		if (nr_reclaimed >= nr_to_reclaim &&
+		    sc->priority < DEF_PRIORITY)
+			break;
+	}
+	blk_finish_plug(&plug);
+	sc->nr_reclaimed += nr_reclaimed;
+
+	/*
+	 * Even if we did not try to evict anon pages at all, we want to
+	 * rebalance the anon lru active/inactive ratio.
+	 */
+	if (inactive_anon_is_low(lruvec))
+		shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
+				   sc, LRU_ACTIVE_ANON);
+
+	throttle_vm_writeout(sc->gfp_mask);
+}
+
 /* Use reclaim/compaction for costly allocs or under memory pressure */
 static bool in_reclaim_compaction(struct scan_control *sc)
 {
@@ -1800,7 +1853,7 @@ static bool in_reclaim_compaction(struct scan_control *sc)
  * calls try_to_compact_zone() that it will have enough free pages to succeed.
  * It will give up earlier than that if there is difficulty reclaiming pages.
  */
-static inline bool should_continue_reclaim(struct lruvec *lruvec,
+static inline bool should_continue_reclaim(struct zone *zone,
 					unsigned long nr_reclaimed,
 					unsigned long nr_scanned,
 					struct scan_control *sc)
@@ -1840,15 +1893,15 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
 	 * inactive lists are large enough, continue reclaiming
 	 */
 	pages_for_compaction = (2UL << sc->order);
-	inactive_lru_pages = get_lru_size(lruvec, LRU_INACTIVE_FILE);
+	inactive_lru_pages = zone_page_state(zone, NR_INACTIVE_FILE);
 	if (nr_swap_pages > 0)
-		inactive_lru_pages += get_lru_size(lruvec, LRU_INACTIVE_ANON);
+		inactive_lru_pages += zone_page_state(zone, NR_INACTIVE_ANON);
 	if (sc->nr_reclaimed < pages_for_compaction &&
 			inactive_lru_pages > pages_for_compaction)
 		return true;
 
 	/* If compaction would go ahead or the allocation would succeed, stop */
-	switch (compaction_suitable(lruvec_zone(lruvec), sc->order)) {
+	switch (compaction_suitable(zone, sc->order)) {
 	case COMPACT_PARTIAL:
 	case COMPACT_CONTINUE:
 		return false;
@@ -1857,98 +1910,49 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
 	}
 }
 
-/*
- * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
- */
-static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
+static void shrink_zone(struct zone *zone, struct scan_control *sc)
 {
-	unsigned long nr[NR_LRU_LISTS];
-	unsigned long nr_to_scan;
-	enum lru_list lru;
 	unsigned long nr_reclaimed, nr_scanned;
-	unsigned long nr_to_reclaim = sc->nr_to_reclaim;
-	struct blk_plug plug;
-
-restart:
-	nr_reclaimed = 0;
-	nr_scanned = sc->nr_scanned;
-	get_scan_count(lruvec, sc, nr);
-
-	blk_start_plug(&plug);
-	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
-					nr[LRU_INACTIVE_FILE]) {
-		for_each_evictable_lru(lru) {
-			if (nr[lru]) {
-				nr_to_scan = min_t(unsigned long,
-						   nr[lru], SWAP_CLUSTER_MAX);
-				nr[lru] -= nr_to_scan;
-
-				nr_reclaimed += shrink_list(lru, nr_to_scan,
-							    lruvec, sc);
-			}
-		}
-		/*
-		 * On large memory systems, scan >> priority can become
-		 * really large. This is fine for the starting priority;
-		 * we want to put equal scanning pressure on each zone.
-		 * However, if the VM has a harder time of freeing pages,
-		 * with multiple processes reclaiming pages, the total
-		 * freeing target can get unreasonably large.
-		 */
-		if (nr_reclaimed >= nr_to_reclaim &&
-		    sc->priority < DEF_PRIORITY)
-			break;
-	}
-	blk_finish_plug(&plug);
-	sc->nr_reclaimed += nr_reclaimed;
 
-	/*
-	 * Even if we did not try to evict anon pages at all, we want to
-	 * rebalance the anon lru active/inactive ratio.
-	 */
-	if (inactive_anon_is_low(lruvec))
-		shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
-				   sc, LRU_ACTIVE_ANON);
-
-	/* reclaim/compaction might need reclaim to continue */
-	if (should_continue_reclaim(lruvec, nr_reclaimed,
-				    sc->nr_scanned - nr_scanned, sc))
-		goto restart;
+	do {
+		struct mem_cgroup *root = sc->target_mem_cgroup;
+		struct mem_cgroup_reclaim_cookie reclaim = {
+			.zone = zone,
+			.priority = sc->priority,
+		};
+		struct mem_cgroup *memcg;
 
-	throttle_vm_writeout(sc->gfp_mask);
-}
+		nr_reclaimed = sc->nr_reclaimed;
+		nr_scanned = sc->nr_scanned;
 
-static void shrink_zone(struct zone *zone, struct scan_control *sc)
-{
-	struct mem_cgroup *root = sc->target_mem_cgroup;
-	struct mem_cgroup_reclaim_cookie reclaim = {
-		.zone = zone,
-		.priority = sc->priority,
-	};
-	struct mem_cgroup *memcg;
+		memcg = mem_cgroup_iter(root, NULL, &reclaim);
+		do {
+			struct lruvec *lruvec;
 
-	memcg = mem_cgroup_iter(root, NULL, &reclaim);
-	do {
-		struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
+			lruvec = mem_cgroup_zone_lruvec(zone, memcg);
 
-		shrink_lruvec(lruvec, sc);
+			shrink_lruvec(lruvec, sc);
 
-		/*
-		 * Limit reclaim has historically picked one memcg and
-		 * scanned it with decreasing priority levels until
-		 * nr_to_reclaim had been reclaimed.  This priority
-		 * cycle is thus over after a single memcg.
-		 *
-		 * Direct reclaim and kswapd, on the other hand, have
-		 * to scan all memory cgroups to fulfill the overall
-		 * scan target for the zone.
-		 */
-		if (!global_reclaim(sc)) {
-			mem_cgroup_iter_break(root, memcg);
-			break;
-		}
-		memcg = mem_cgroup_iter(root, memcg, &reclaim);
-	} while (memcg);
+			/*
+			 * Limit reclaim has historically picked one
+			 * memcg and scanned it with decreasing
+			 * priority levels until nr_to_reclaim had
+			 * been reclaimed.  This priority cycle is
+			 * thus over after a single memcg.
+			 *
+			 * Direct reclaim and kswapd, on the other
+			 * hand, have to scan all memory cgroups to
+			 * fulfill the overall scan target for the
+			 * zone.
+			 */
+			if (!global_reclaim(sc)) {
+				mem_cgroup_iter_break(root, memcg);
+				break;
+			}
+			memcg = mem_cgroup_iter(root, memcg, &reclaim);
+		} while (memcg);
+	} while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed,
+					 sc->nr_scanned - nr_scanned, sc));
 }
 
 /* Returns true if compaction should go ahead for a high-order request */
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* [patch 7/8] mm: vmscan: compaction works against zones, not lruvecs
@ 2012-12-12 21:43   ` Johannes Weiner
  0 siblings, 0 replies; 114+ messages in thread
From: Johannes Weiner @ 2012-12-12 21:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Michal Hocko, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

The restart logic for when reclaim operates back to back with
compaction is currently applied on the lruvec level.  But this does
not make sense, because the container of interest for compaction is a
zone as a whole, not the zone pages that are part of a certain memory
cgroup.

Negative impact is bounded.  For one, the code checks that the lruvec
has enough reclaim candidates, so it does not risk getting stuck on a
condition that can not be fulfilled.  And the unfairness of hammering
on one particular memory cgroup to make progress in a zone will be
amortized by the round robin manner in which reclaim goes through the
memory cgroups.  Still, this can lead to unnecessary allocation
latencies when the code elects to restart on a hard to reclaim or
small group when there are other, more reclaimable groups in the zone.

Move this logic to the zone level and restart reclaim for all memory
cgroups in a zone when compaction requires more free pages from it.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 180 +++++++++++++++++++++++++++++++-----------------------------
 1 file changed, 92 insertions(+), 88 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index e20385a..c9c841d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1782,6 +1782,59 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	}
 }
 
+/*
+ * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
+ */
+static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
+{
+	unsigned long nr[NR_LRU_LISTS];
+	unsigned long nr_to_scan;
+	enum lru_list lru;
+	unsigned long nr_reclaimed = 0;
+	unsigned long nr_to_reclaim = sc->nr_to_reclaim;
+	struct blk_plug plug;
+
+	get_scan_count(lruvec, sc, nr);
+
+	blk_start_plug(&plug);
+	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
+					nr[LRU_INACTIVE_FILE]) {
+		for_each_evictable_lru(lru) {
+			if (nr[lru]) {
+				nr_to_scan = min_t(unsigned long,
+						   nr[lru], SWAP_CLUSTER_MAX);
+				nr[lru] -= nr_to_scan;
+
+				nr_reclaimed += shrink_list(lru, nr_to_scan,
+							    lruvec, sc);
+			}
+		}
+		/*
+		 * On large memory systems, scan >> priority can become
+		 * really large. This is fine for the starting priority;
+		 * we want to put equal scanning pressure on each zone.
+		 * However, if the VM has a harder time of freeing pages,
+		 * with multiple processes reclaiming pages, the total
+		 * freeing target can get unreasonably large.
+		 */
+		if (nr_reclaimed >= nr_to_reclaim &&
+		    sc->priority < DEF_PRIORITY)
+			break;
+	}
+	blk_finish_plug(&plug);
+	sc->nr_reclaimed += nr_reclaimed;
+
+	/*
+	 * Even if we did not try to evict anon pages at all, we want to
+	 * rebalance the anon lru active/inactive ratio.
+	 */
+	if (inactive_anon_is_low(lruvec))
+		shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
+				   sc, LRU_ACTIVE_ANON);
+
+	throttle_vm_writeout(sc->gfp_mask);
+}
+
 /* Use reclaim/compaction for costly allocs or under memory pressure */
 static bool in_reclaim_compaction(struct scan_control *sc)
 {
@@ -1800,7 +1853,7 @@ static bool in_reclaim_compaction(struct scan_control *sc)
  * calls try_to_compact_zone() that it will have enough free pages to succeed.
  * It will give up earlier than that if there is difficulty reclaiming pages.
  */
-static inline bool should_continue_reclaim(struct lruvec *lruvec,
+static inline bool should_continue_reclaim(struct zone *zone,
 					unsigned long nr_reclaimed,
 					unsigned long nr_scanned,
 					struct scan_control *sc)
@@ -1840,15 +1893,15 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
 	 * inactive lists are large enough, continue reclaiming
 	 */
 	pages_for_compaction = (2UL << sc->order);
-	inactive_lru_pages = get_lru_size(lruvec, LRU_INACTIVE_FILE);
+	inactive_lru_pages = zone_page_state(zone, NR_INACTIVE_FILE);
 	if (nr_swap_pages > 0)
-		inactive_lru_pages += get_lru_size(lruvec, LRU_INACTIVE_ANON);
+		inactive_lru_pages += zone_page_state(zone, NR_INACTIVE_ANON);
 	if (sc->nr_reclaimed < pages_for_compaction &&
 			inactive_lru_pages > pages_for_compaction)
 		return true;
 
 	/* If compaction would go ahead or the allocation would succeed, stop */
-	switch (compaction_suitable(lruvec_zone(lruvec), sc->order)) {
+	switch (compaction_suitable(zone, sc->order)) {
 	case COMPACT_PARTIAL:
 	case COMPACT_CONTINUE:
 		return false;
@@ -1857,98 +1910,49 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
 	}
 }
 
-/*
- * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
- */
-static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
+static void shrink_zone(struct zone *zone, struct scan_control *sc)
 {
-	unsigned long nr[NR_LRU_LISTS];
-	unsigned long nr_to_scan;
-	enum lru_list lru;
 	unsigned long nr_reclaimed, nr_scanned;
-	unsigned long nr_to_reclaim = sc->nr_to_reclaim;
-	struct blk_plug plug;
-
-restart:
-	nr_reclaimed = 0;
-	nr_scanned = sc->nr_scanned;
-	get_scan_count(lruvec, sc, nr);
-
-	blk_start_plug(&plug);
-	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
-					nr[LRU_INACTIVE_FILE]) {
-		for_each_evictable_lru(lru) {
-			if (nr[lru]) {
-				nr_to_scan = min_t(unsigned long,
-						   nr[lru], SWAP_CLUSTER_MAX);
-				nr[lru] -= nr_to_scan;
-
-				nr_reclaimed += shrink_list(lru, nr_to_scan,
-							    lruvec, sc);
-			}
-		}
-		/*
-		 * On large memory systems, scan >> priority can become
-		 * really large. This is fine for the starting priority;
-		 * we want to put equal scanning pressure on each zone.
-		 * However, if the VM has a harder time of freeing pages,
-		 * with multiple processes reclaiming pages, the total
-		 * freeing target can get unreasonably large.
-		 */
-		if (nr_reclaimed >= nr_to_reclaim &&
-		    sc->priority < DEF_PRIORITY)
-			break;
-	}
-	blk_finish_plug(&plug);
-	sc->nr_reclaimed += nr_reclaimed;
 
-	/*
-	 * Even if we did not try to evict anon pages at all, we want to
-	 * rebalance the anon lru active/inactive ratio.
-	 */
-	if (inactive_anon_is_low(lruvec))
-		shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
-				   sc, LRU_ACTIVE_ANON);
-
-	/* reclaim/compaction might need reclaim to continue */
-	if (should_continue_reclaim(lruvec, nr_reclaimed,
-				    sc->nr_scanned - nr_scanned, sc))
-		goto restart;
+	do {
+		struct mem_cgroup *root = sc->target_mem_cgroup;
+		struct mem_cgroup_reclaim_cookie reclaim = {
+			.zone = zone,
+			.priority = sc->priority,
+		};
+		struct mem_cgroup *memcg;
 
-	throttle_vm_writeout(sc->gfp_mask);
-}
+		nr_reclaimed = sc->nr_reclaimed;
+		nr_scanned = sc->nr_scanned;
 
-static void shrink_zone(struct zone *zone, struct scan_control *sc)
-{
-	struct mem_cgroup *root = sc->target_mem_cgroup;
-	struct mem_cgroup_reclaim_cookie reclaim = {
-		.zone = zone,
-		.priority = sc->priority,
-	};
-	struct mem_cgroup *memcg;
+		memcg = mem_cgroup_iter(root, NULL, &reclaim);
+		do {
+			struct lruvec *lruvec;
 
-	memcg = mem_cgroup_iter(root, NULL, &reclaim);
-	do {
-		struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
+			lruvec = mem_cgroup_zone_lruvec(zone, memcg);
 
-		shrink_lruvec(lruvec, sc);
+			shrink_lruvec(lruvec, sc);
 
-		/*
-		 * Limit reclaim has historically picked one memcg and
-		 * scanned it with decreasing priority levels until
-		 * nr_to_reclaim had been reclaimed.  This priority
-		 * cycle is thus over after a single memcg.
-		 *
-		 * Direct reclaim and kswapd, on the other hand, have
-		 * to scan all memory cgroups to fulfill the overall
-		 * scan target for the zone.
-		 */
-		if (!global_reclaim(sc)) {
-			mem_cgroup_iter_break(root, memcg);
-			break;
-		}
-		memcg = mem_cgroup_iter(root, memcg, &reclaim);
-	} while (memcg);
+			/*
+			 * Limit reclaim has historically picked one
+			 * memcg and scanned it with decreasing
+			 * priority levels until nr_to_reclaim had
+			 * been reclaimed.  This priority cycle is
+			 * thus over after a single memcg.
+			 *
+			 * Direct reclaim and kswapd, on the other
+			 * hand, have to scan all memory cgroups to
+			 * fulfill the overall scan target for the
+			 * zone.
+			 */
+			if (!global_reclaim(sc)) {
+				mem_cgroup_iter_break(root, memcg);
+				break;
+			}
+			memcg = mem_cgroup_iter(root, memcg, &reclaim);
+		} while (memcg);
+	} while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed,
+					 sc->nr_scanned - nr_scanned, sc));
 }
 
 /* Returns true if compaction should go ahead for a high-order request */
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 114+ messages in thread

* [patch 8/8] mm: reduce rmap overhead for ex-KSM page copies created on swap faults
  2012-12-12 21:43 ` Johannes Weiner
@ 2012-12-12 21:43   ` Johannes Weiner
  -1 siblings, 0 replies; 114+ messages in thread
From: Johannes Weiner @ 2012-12-12 21:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Michal Hocko, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

When ex-KSM pages are faulted from swap cache, the fault handler is
not capable of re-establishing anon_vma-spanning KSM pages.  In this
case, a copy of the page is created instead, just like during a COW
break.

These freshly made copies are known to be exclusive to the faulting
VMA and there is no reason to go look for this page in parent and
sibling processes during rmap operations.

Use page_add_new_anon_rmap() for these copies.  This also puts them on
the proper LRU lists and marks them SwapBacked, so we can get rid of
doing this ad-hoc in the KSM copy code.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/ksm.c    | 6 ------
 mm/memory.c | 5 ++++-
 2 files changed, 4 insertions(+), 7 deletions(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index 382d930..7275c74 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1590,13 +1590,7 @@ struct page *ksm_does_need_to_copy(struct page *page,
 
 		SetPageDirty(new_page);
 		__SetPageUptodate(new_page);
-		SetPageSwapBacked(new_page);
 		__set_page_locked(new_page);
-
-		if (!mlocked_vma_newpage(vma, new_page))
-			lru_cache_add_lru(new_page, LRU_ACTIVE_ANON);
-		else
-			add_page_to_unevictable_list(new_page);
 	}
 
 	return new_page;
diff --git a/mm/memory.c b/mm/memory.c
index 7653773..726ff11 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3039,7 +3039,10 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 	flush_icache_page(vma, page);
 	set_pte_at(mm, address, page_table, pte);
-	do_page_add_anon_rmap(page, vma, address, exclusive);
+	if (swapcache) /* ksm created a completely new copy */
+		page_add_new_anon_rmap(page, vma, address);
+	else
+		do_page_add_anon_rmap(page, vma, address, exclusive);
 	/* It's better to call commit-charge after rmap is established */
 	mem_cgroup_commit_charge_swapin(page, ptr);
 
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 114+ messages in thread

* [patch 8/8] mm: reduce rmap overhead for ex-KSM page copies created on swap faults
@ 2012-12-12 21:43   ` Johannes Weiner
  0 siblings, 0 replies; 114+ messages in thread
From: Johannes Weiner @ 2012-12-12 21:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Michal Hocko, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

When ex-KSM pages are faulted from swap cache, the fault handler is
not capable of re-establishing anon_vma-spanning KSM pages.  In this
case, a copy of the page is created instead, just like during a COW
break.

These freshly made copies are known to be exclusive to the faulting
VMA and there is no reason to go look for this page in parent and
sibling processes during rmap operations.

Use page_add_new_anon_rmap() for these copies.  This also puts them on
the proper LRU lists and marks them SwapBacked, so we can get rid of
doing this ad-hoc in the KSM copy code.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/ksm.c    | 6 ------
 mm/memory.c | 5 ++++-
 2 files changed, 4 insertions(+), 7 deletions(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index 382d930..7275c74 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1590,13 +1590,7 @@ struct page *ksm_does_need_to_copy(struct page *page,
 
 		SetPageDirty(new_page);
 		__SetPageUptodate(new_page);
-		SetPageSwapBacked(new_page);
 		__set_page_locked(new_page);
-
-		if (!mlocked_vma_newpage(vma, new_page))
-			lru_cache_add_lru(new_page, LRU_ACTIVE_ANON);
-		else
-			add_page_to_unevictable_list(new_page);
 	}
 
 	return new_page;
diff --git a/mm/memory.c b/mm/memory.c
index 7653773..726ff11 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3039,7 +3039,10 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 	flush_icache_page(vma, page);
 	set_pte_at(mm, address, page_table, pte);
-	do_page_add_anon_rmap(page, vma, address, exclusive);
+	if (swapcache) /* ksm created a completely new copy */
+		page_add_new_anon_rmap(page, vma, address);
+	else
+		do_page_add_anon_rmap(page, vma, address, exclusive);
 	/* It's better to call commit-charge after rmap is established */
 	mem_cgroup_commit_charge_swapin(page, ptr);
 
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 114+ messages in thread

* Re: [patch 0/8] page reclaim bits
  2012-12-12 21:43 ` Johannes Weiner
@ 2012-12-12 21:50   ` Andrew Morton
  -1 siblings, 0 replies; 114+ messages in thread
From: Andrew Morton @ 2012-12-12 21:50 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Rik van Riel, Michal Hocko, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

On Wed, 12 Dec 2012 16:43:32 -0500
Johannes Weiner <hannes@cmpxchg.org> wrote:

> I had these in my queue and on test machines for a while, but they got
> deferred over and over, partly because of the kswapd issues.  I hope
> it's not too late for 3.8, they should be fairly straight forward.

um, that is rather late.

Let's review these promptly and thoroughly, please.  Then we can look
at squeaking at least the simple and/or observably-beneficial ones into
-rc1 or -rc2.


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 0/8] page reclaim bits
@ 2012-12-12 21:50   ` Andrew Morton
  0 siblings, 0 replies; 114+ messages in thread
From: Andrew Morton @ 2012-12-12 21:50 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Rik van Riel, Michal Hocko, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

On Wed, 12 Dec 2012 16:43:32 -0500
Johannes Weiner <hannes@cmpxchg.org> wrote:

> I had these in my queue and on test machines for a while, but they got
> deferred over and over, partly because of the kswapd issues.  I hope
> it's not too late for 3.8, they should be fairly straight forward.

um, that is rather late.

Let's review these promptly and thoroughly, please.  Then we can look
at squeaking at least the simple and/or observably-beneficial ones into
-rc1 or -rc2.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 1/8] mm: memcg: only evict file pages when we have plenty
  2012-12-12 21:43   ` Johannes Weiner
@ 2012-12-12 21:53     ` Rik van Riel
  -1 siblings, 0 replies; 114+ messages in thread
From: Rik van Riel @ 2012-12-12 21:53 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

On 12/12/2012 04:43 PM, Johannes Weiner wrote:
> dc0422c "mm: vmscan: only evict file pages when we have plenty" makes
> a point of not going for anonymous memory while there is still enough
> inactive cache around.
>
> The check was added only for global reclaim, but it is just as useful
> for memory cgroup reclaim.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>   mm/vmscan.c | 19 ++++++++++---------
>   1 file changed, 10 insertions(+), 9 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 157bb11..3874dcb 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1671,6 +1671,16 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>   		denominator = 1;
>   		goto out;
>   	}
> +	/*
> +	 * There is enough inactive page cache, do not reclaim
> +	 * anything from the anonymous working set right now.
> +	 */
> +	if (!inactive_file_is_low(lruvec)) {
> +		fraction[0] = 0;
> +		fraction[1] = 1;
> +		denominator = 1;
> +		goto out;
> +	}
>
>   	anon  = get_lru_size(lruvec, LRU_ACTIVE_ANON) +
>   		get_lru_size(lruvec, LRU_INACTIVE_ANON);
> @@ -1688,15 +1698,6 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>   			fraction[1] = 0;
>   			denominator = 1;
>   			goto out;
> -		} else if (!inactive_file_is_low_global(zone)) {
> -			/*
> -			 * There is enough inactive page cache, do not
> -			 * reclaim anything from the working set right now.
> -			 */
> -			fraction[0] = 0;
> -			fraction[1] = 1;
> -			denominator = 1;
> -			goto out;
>   		}
>   	}
>
>

I believe the if() block should be moved to AFTER
the check where we make sure we actually have enough
file pages.


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 1/8] mm: memcg: only evict file pages when we have plenty
@ 2012-12-12 21:53     ` Rik van Riel
  0 siblings, 0 replies; 114+ messages in thread
From: Rik van Riel @ 2012-12-12 21:53 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

On 12/12/2012 04:43 PM, Johannes Weiner wrote:
> dc0422c "mm: vmscan: only evict file pages when we have plenty" makes
> a point of not going for anonymous memory while there is still enough
> inactive cache around.
>
> The check was added only for global reclaim, but it is just as useful
> for memory cgroup reclaim.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>   mm/vmscan.c | 19 ++++++++++---------
>   1 file changed, 10 insertions(+), 9 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 157bb11..3874dcb 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1671,6 +1671,16 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>   		denominator = 1;
>   		goto out;
>   	}
> +	/*
> +	 * There is enough inactive page cache, do not reclaim
> +	 * anything from the anonymous working set right now.
> +	 */
> +	if (!inactive_file_is_low(lruvec)) {
> +		fraction[0] = 0;
> +		fraction[1] = 1;
> +		denominator = 1;
> +		goto out;
> +	}
>
>   	anon  = get_lru_size(lruvec, LRU_ACTIVE_ANON) +
>   		get_lru_size(lruvec, LRU_INACTIVE_ANON);
> @@ -1688,15 +1698,6 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>   			fraction[1] = 0;
>   			denominator = 1;
>   			goto out;
> -		} else if (!inactive_file_is_low_global(zone)) {
> -			/*
> -			 * There is enough inactive page cache, do not
> -			 * reclaim anything from the working set right now.
> -			 */
> -			fraction[0] = 0;
> -			fraction[1] = 1;
> -			denominator = 1;
> -			goto out;
>   		}
>   	}
>
>

I believe the if() block should be moved to AFTER
the check where we make sure we actually have enough
file pages.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 2/8] mm: vmscan: disregard swappiness shortly before going OOM
  2012-12-12 21:43   ` Johannes Weiner
@ 2012-12-12 22:01     ` Rik van Riel
  -1 siblings, 0 replies; 114+ messages in thread
From: Rik van Riel @ 2012-12-12 22:01 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

On 12/12/2012 04:43 PM, Johannes Weiner wrote:
> When a reclaim scanner is doing its final scan before giving up and
> there is swap space available, pay no attention to swappiness
> preference anymore.  Just swap.
>
> Note that this change won't make too big of a difference for general
> reclaim: anonymous pages are already force-scanned when there is only
> very little file cache left, and there very likely isn't when the
> reclaimer enters this final cycle.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Rik van Riel <riel@redhat.com>


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 2/8] mm: vmscan: disregard swappiness shortly before going OOM
@ 2012-12-12 22:01     ` Rik van Riel
  0 siblings, 0 replies; 114+ messages in thread
From: Rik van Riel @ 2012-12-12 22:01 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

On 12/12/2012 04:43 PM, Johannes Weiner wrote:
> When a reclaim scanner is doing its final scan before giving up and
> there is swap space available, pay no attention to swappiness
> preference anymore.  Just swap.
>
> Note that this change won't make too big of a difference for general
> reclaim: anonymous pages are already force-scanned when there is only
> very little file cache left, and there very likely isn't when the
> reclaimer enters this final cycle.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Rik van Riel <riel@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 3/8] mm: vmscan: save work scanning (almost) empty LRU lists
  2012-12-12 21:43   ` Johannes Weiner
@ 2012-12-12 22:02     ` Rik van Riel
  -1 siblings, 0 replies; 114+ messages in thread
From: Rik van Riel @ 2012-12-12 22:02 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

On 12/12/2012 04:43 PM, Johannes Weiner wrote:
> In certain cases (kswapd reclaim, memcg target reclaim), a fixed
> minimum amount of pages is scanned from the LRU lists on each
> iteration, to make progress.
>
> Do not make this minimum bigger than the respective LRU list size,
> however, and save some busy work trying to isolate and reclaim pages
> that are not there.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Rik van Riel <riel@redhat.com>


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 3/8] mm: vmscan: save work scanning (almost) empty LRU lists
@ 2012-12-12 22:02     ` Rik van Riel
  0 siblings, 0 replies; 114+ messages in thread
From: Rik van Riel @ 2012-12-12 22:02 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

On 12/12/2012 04:43 PM, Johannes Weiner wrote:
> In certain cases (kswapd reclaim, memcg target reclaim), a fixed
> minimum amount of pages is scanned from the LRU lists on each
> iteration, to make progress.
>
> Do not make this minimum bigger than the respective LRU list size,
> however, and save some busy work trying to isolate and reclaim pages
> that are not there.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Rik van Riel <riel@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 4/8] mm: vmscan: clarify LRU balancing close to OOM
  2012-12-12 21:43   ` Johannes Weiner
@ 2012-12-12 22:03     ` Rik van Riel
  -1 siblings, 0 replies; 114+ messages in thread
From: Rik van Riel @ 2012-12-12 22:03 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

On 12/12/2012 04:43 PM, Johannes Weiner wrote:
> There are currently several inter-LRU balancing heuristics that simply
> get disabled when the reclaimer is at the last reclaim cycle before
> giving up, but the code is quite cumbersome and not really obvious.
>
> Make the heuristics visibly unreachable for the last reclaim cycle.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Nice cleanup!

Reviewed-by: Rik van Riel <riel@redhat.com>


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 4/8] mm: vmscan: clarify LRU balancing close to OOM
@ 2012-12-12 22:03     ` Rik van Riel
  0 siblings, 0 replies; 114+ messages in thread
From: Rik van Riel @ 2012-12-12 22:03 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

On 12/12/2012 04:43 PM, Johannes Weiner wrote:
> There are currently several inter-LRU balancing heuristics that simply
> get disabled when the reclaimer is at the last reclaim cycle before
> giving up, but the code is quite cumbersome and not really obvious.
>
> Make the heuristics visibly unreachable for the last reclaim cycle.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Nice cleanup!

Reviewed-by: Rik van Riel <riel@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 5/8] mm: vmscan: improve comment on low-page cache handling
  2012-12-12 21:43   ` Johannes Weiner
@ 2012-12-12 22:04     ` Rik van Riel
  -1 siblings, 0 replies; 114+ messages in thread
From: Rik van Riel @ 2012-12-12 22:04 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

On 12/12/2012 04:43 PM, Johannes Weiner wrote:
> Fix comment style and elaborate on why anonymous memory is
> force-scanned when file cache runs low.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Rik van Riel <riel@redhat.com>


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 5/8] mm: vmscan: improve comment on low-page cache handling
@ 2012-12-12 22:04     ` Rik van Riel
  0 siblings, 0 replies; 114+ messages in thread
From: Rik van Riel @ 2012-12-12 22:04 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

On 12/12/2012 04:43 PM, Johannes Weiner wrote:
> Fix comment style and elaborate on why anonymous memory is
> force-scanned when file cache runs low.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Rik van Riel <riel@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 6/8] mm: vmscan: clean up get_scan_count()
  2012-12-12 21:43   ` Johannes Weiner
@ 2012-12-12 22:06     ` Rik van Riel
  -1 siblings, 0 replies; 114+ messages in thread
From: Rik van Riel @ 2012-12-12 22:06 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

On 12/12/2012 04:43 PM, Johannes Weiner wrote:
> Reclaim pressure balance between anon and file pages is calculated
> through a tuple of numerators and a shared denominator.
>
> Exceptional cases that want to force-scan anon or file pages configure
> the numerators and denominator such that one list is preferred, which
> is not necessarily the most obvious way:
>
>      fraction[0] = 1;
>      fraction[1] = 0;
>      denominator = 1;
>      goto out;
>
> Make this easier by making the force-scan cases explicit and use the
> fractionals only in case they are calculated from reclaim history.
>
> And bring the variable declarations/definitions in order.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Rik van Riel <riel@redhat.com>


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 6/8] mm: vmscan: clean up get_scan_count()
@ 2012-12-12 22:06     ` Rik van Riel
  0 siblings, 0 replies; 114+ messages in thread
From: Rik van Riel @ 2012-12-12 22:06 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

On 12/12/2012 04:43 PM, Johannes Weiner wrote:
> Reclaim pressure balance between anon and file pages is calculated
> through a tuple of numerators and a shared denominator.
>
> Exceptional cases that want to force-scan anon or file pages configure
> the numerators and denominator such that one list is preferred, which
> is not necessarily the most obvious way:
>
>      fraction[0] = 1;
>      fraction[1] = 0;
>      denominator = 1;
>      goto out;
>
> Make this easier by making the force-scan cases explicit and use the
> fractionals only in case they are calculated from reclaim history.
>
> And bring the variable declarations/definitions in order.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Rik van Riel <riel@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 1/8] mm: memcg: only evict file pages when we have plenty
  2012-12-12 21:53     ` Rik van Riel
@ 2012-12-12 22:28       ` Johannes Weiner
  -1 siblings, 0 replies; 114+ messages in thread
From: Johannes Weiner @ 2012-12-12 22:28 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, Michal Hocko, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

On Wed, Dec 12, 2012 at 04:53:36PM -0500, Rik van Riel wrote:
> On 12/12/2012 04:43 PM, Johannes Weiner wrote:
> >dc0422c "mm: vmscan: only evict file pages when we have plenty" makes
> >a point of not going for anonymous memory while there is still enough
> >inactive cache around.
> >
> >The check was added only for global reclaim, but it is just as useful
> >for memory cgroup reclaim.
> >
> >Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> >---
> >  mm/vmscan.c | 19 ++++++++++---------
> >  1 file changed, 10 insertions(+), 9 deletions(-)
> >
> >diff --git a/mm/vmscan.c b/mm/vmscan.c
> >index 157bb11..3874dcb 100644
> >--- a/mm/vmscan.c
> >+++ b/mm/vmscan.c
> >@@ -1671,6 +1671,16 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> >  		denominator = 1;
> >  		goto out;
> >  	}
> >+	/*
> >+	 * There is enough inactive page cache, do not reclaim
> >+	 * anything from the anonymous working set right now.
> >+	 */
> >+	if (!inactive_file_is_low(lruvec)) {
> >+		fraction[0] = 0;
> >+		fraction[1] = 1;
> >+		denominator = 1;
> >+		goto out;
> >+	}
> >
> >  	anon  = get_lru_size(lruvec, LRU_ACTIVE_ANON) +
> >  		get_lru_size(lruvec, LRU_INACTIVE_ANON);
> >@@ -1688,15 +1698,6 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> >  			fraction[1] = 0;
> >  			denominator = 1;
> >  			goto out;
> >-		} else if (!inactive_file_is_low_global(zone)) {
> >-			/*
> >-			 * There is enough inactive page cache, do not
> >-			 * reclaim anything from the working set right now.
> >-			 */
> >-			fraction[0] = 0;
> >-			fraction[1] = 1;
> >-			denominator = 1;
> >-			goto out;
> >  		}
> >  	}
> >
> >
> 
> I believe the if() block should be moved to AFTER
> the check where we make sure we actually have enough
> file pages.

You are absolutely right, this makes more sense.  Although I'd figure
the impact would be small because if there actually is that little
file cache, it won't be there for long with force-file scanning... :-)

I moved the condition, but it throws conflicts in the rest of the
series.  Will re-run tests, wait for Michal and Mel, then resend.

Thanks, Rik!

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 1/8] mm: memcg: only evict file pages when we have plenty
@ 2012-12-12 22:28       ` Johannes Weiner
  0 siblings, 0 replies; 114+ messages in thread
From: Johannes Weiner @ 2012-12-12 22:28 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, Michal Hocko, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

On Wed, Dec 12, 2012 at 04:53:36PM -0500, Rik van Riel wrote:
> On 12/12/2012 04:43 PM, Johannes Weiner wrote:
> >dc0422c "mm: vmscan: only evict file pages when we have plenty" makes
> >a point of not going for anonymous memory while there is still enough
> >inactive cache around.
> >
> >The check was added only for global reclaim, but it is just as useful
> >for memory cgroup reclaim.
> >
> >Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> >---
> >  mm/vmscan.c | 19 ++++++++++---------
> >  1 file changed, 10 insertions(+), 9 deletions(-)
> >
> >diff --git a/mm/vmscan.c b/mm/vmscan.c
> >index 157bb11..3874dcb 100644
> >--- a/mm/vmscan.c
> >+++ b/mm/vmscan.c
> >@@ -1671,6 +1671,16 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> >  		denominator = 1;
> >  		goto out;
> >  	}
> >+	/*
> >+	 * There is enough inactive page cache, do not reclaim
> >+	 * anything from the anonymous working set right now.
> >+	 */
> >+	if (!inactive_file_is_low(lruvec)) {
> >+		fraction[0] = 0;
> >+		fraction[1] = 1;
> >+		denominator = 1;
> >+		goto out;
> >+	}
> >
> >  	anon  = get_lru_size(lruvec, LRU_ACTIVE_ANON) +
> >  		get_lru_size(lruvec, LRU_INACTIVE_ANON);
> >@@ -1688,15 +1698,6 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> >  			fraction[1] = 0;
> >  			denominator = 1;
> >  			goto out;
> >-		} else if (!inactive_file_is_low_global(zone)) {
> >-			/*
> >-			 * There is enough inactive page cache, do not
> >-			 * reclaim anything from the working set right now.
> >-			 */
> >-			fraction[0] = 0;
> >-			fraction[1] = 1;
> >-			denominator = 1;
> >-			goto out;
> >  		}
> >  	}
> >
> >
> 
> I believe the if() block should be moved to AFTER
> the check where we make sure we actually have enough
> file pages.

You are absolutely right, this makes more sense.  Although I'd figure
the impact would be small because if there actually is that little
file cache, it won't be there for long with force-file scanning... :-)

I moved the condition, but it throws conflicts in the rest of the
series.  Will re-run tests, wait for Michal and Mel, then resend.

Thanks, Rik!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 7/8] mm: vmscan: compaction works against zones, not lruvecs
  2012-12-12 21:43   ` Johannes Weiner
@ 2012-12-12 22:31     ` Rik van Riel
  -1 siblings, 0 replies; 114+ messages in thread
From: Rik van Riel @ 2012-12-12 22:31 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

On 12/12/2012 04:43 PM, Johannes Weiner wrote:
> The restart logic for when reclaim operates back to back with
> compaction is currently applied on the lruvec level.  But this does
> not make sense, because the container of interest for compaction is a
> zone as a whole, not the zone pages that are part of a certain memory
> cgroup.
>
> Negative impact is bounded.  For one, the code checks that the lruvec
> has enough reclaim candidates, so it does not risk getting stuck on a
> condition that can not be fulfilled.  And the unfairness of hammering
> on one particular memory cgroup to make progress in a zone will be
> amortized by the round robin manner in which reclaim goes through the
> memory cgroups.  Still, this can lead to unnecessary allocation
> latencies when the code elects to restart on a hard to reclaim or
> small group when there are other, more reclaimable groups in the zone.
>
> Move this logic to the zone level and restart reclaim for all memory
> cgroups in a zone when compaction requires more free pages from it.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Rik van Riel <riel@redhat.com>


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 7/8] mm: vmscan: compaction works against zones, not lruvecs
@ 2012-12-12 22:31     ` Rik van Riel
  0 siblings, 0 replies; 114+ messages in thread
From: Rik van Riel @ 2012-12-12 22:31 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

On 12/12/2012 04:43 PM, Johannes Weiner wrote:
> The restart logic for when reclaim operates back to back with
> compaction is currently applied on the lruvec level.  But this does
> not make sense, because the container of interest for compaction is a
> zone as a whole, not the zone pages that are part of a certain memory
> cgroup.
>
> Negative impact is bounded.  For one, the code checks that the lruvec
> has enough reclaim candidates, so it does not risk getting stuck on a
> condition that can not be fulfilled.  And the unfairness of hammering
> on one particular memory cgroup to make progress in a zone will be
> amortized by the round robin manner in which reclaim goes through the
> memory cgroups.  Still, this can lead to unnecessary allocation
> latencies when the code elects to restart on a hard to reclaim or
> small group when there are other, more reclaimable groups in the zone.
>
> Move this logic to the zone level and restart reclaim for all memory
> cgroups in a zone when compaction requires more free pages from it.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Rik van Riel <riel@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 8/8] mm: reduce rmap overhead for ex-KSM page copies created on swap faults
  2012-12-12 21:43   ` Johannes Weiner
@ 2012-12-12 22:34     ` Rik van Riel
  -1 siblings, 0 replies; 114+ messages in thread
From: Rik van Riel @ 2012-12-12 22:34 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

On 12/12/2012 04:43 PM, Johannes Weiner wrote:
> When ex-KSM pages are faulted from swap cache, the fault handler is
> not capable of re-establishing anon_vma-spanning KSM pages.  In this
> case, a copy of the page is created instead, just like during a COW
> break.
>
> These freshly made copies are known to be exclusive to the faulting
> VMA and there is no reason to go look for this page in parent and
> sibling processes during rmap operations.
>
> Use page_add_new_anon_rmap() for these copies.  This also puts them on
> the proper LRU lists and marks them SwapBacked, so we can get rid of
> doing this ad-hoc in the KSM copy code.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Rik van Riel <riel@redhat.com>


^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 8/8] mm: reduce rmap overhead for ex-KSM page copies created on swap faults
@ 2012-12-12 22:34     ` Rik van Riel
  0 siblings, 0 replies; 114+ messages in thread
From: Rik van Riel @ 2012-12-12 22:34 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

On 12/12/2012 04:43 PM, Johannes Weiner wrote:
> When ex-KSM pages are faulted from swap cache, the fault handler is
> not capable of re-establishing anon_vma-spanning KSM pages.  In this
> case, a copy of the page is created instead, just like during a COW
> break.
>
> These freshly made copies are known to be exclusive to the faulting
> VMA and there is no reason to go look for this page in parent and
> sibling processes during rmap operations.
>
> Use page_add_new_anon_rmap() for these copies.  This also puts them on
> the proper LRU lists and marks them SwapBacked, so we can get rid of
> doing this ad-hoc in the KSM copy code.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Rik van Riel <riel@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 1/8] mm: memcg: only evict file pages when we have plenty
  2012-12-12 21:43   ` Johannes Weiner
@ 2012-12-13  5:34     ` Simon Jeons
  -1 siblings, 0 replies; 114+ messages in thread
From: Simon Jeons @ 2012-12-13  5:34 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Rik van Riel, Michal Hocko, Mel Gorman,
	Hugh Dickins, linux-mm, linux-kernel

On Wed, 2012-12-12 at 16:43 -0500, Johannes Weiner wrote:
> dc0422c "mm: vmscan: only evict file pages when we have plenty" makes

Can't find dc0422c.

> a point of not going for anonymous memory while there is still enough
> inactive cache around.
> 
> The check was added only for global reclaim, but it is just as useful
> for memory cgroup reclaim.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/vmscan.c | 19 ++++++++++---------
>  1 file changed, 10 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 157bb11..3874dcb 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1671,6 +1671,16 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>  		denominator = 1;
>  		goto out;
>  	}
> +	/*
> +	 * There is enough inactive page cache, do not reclaim
> +	 * anything from the anonymous working set right now.
> +	 */
> +	if (!inactive_file_is_low(lruvec)) {
> +		fraction[0] = 0;
> +		fraction[1] = 1;
> +		denominator = 1;
> +		goto out;
> +	}
>  
>  	anon  = get_lru_size(lruvec, LRU_ACTIVE_ANON) +
>  		get_lru_size(lruvec, LRU_INACTIVE_ANON);
> @@ -1688,15 +1698,6 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>  			fraction[1] = 0;
>  			denominator = 1;
>  			goto out;
> -		} else if (!inactive_file_is_low_global(zone)) {
> -			/*
> -			 * There is enough inactive page cache, do not
> -			 * reclaim anything from the working set right now.
> -			 */
> -			fraction[0] = 0;
> -			fraction[1] = 1;
> -			denominator = 1;
> -			goto out;
>  		}
>  	}
>  



^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 1/8] mm: memcg: only evict file pages when we have plenty
@ 2012-12-13  5:34     ` Simon Jeons
  0 siblings, 0 replies; 114+ messages in thread
From: Simon Jeons @ 2012-12-13  5:34 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Rik van Riel, Michal Hocko, Mel Gorman,
	Hugh Dickins, linux-mm, linux-kernel

On Wed, 2012-12-12 at 16:43 -0500, Johannes Weiner wrote:
> dc0422c "mm: vmscan: only evict file pages when we have plenty" makes

Can't find dc0422c.

> a point of not going for anonymous memory while there is still enough
> inactive cache around.
> 
> The check was added only for global reclaim, but it is just as useful
> for memory cgroup reclaim.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/vmscan.c | 19 ++++++++++---------
>  1 file changed, 10 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 157bb11..3874dcb 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1671,6 +1671,16 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>  		denominator = 1;
>  		goto out;
>  	}
> +	/*
> +	 * There is enough inactive page cache, do not reclaim
> +	 * anything from the anonymous working set right now.
> +	 */
> +	if (!inactive_file_is_low(lruvec)) {
> +		fraction[0] = 0;
> +		fraction[1] = 1;
> +		denominator = 1;
> +		goto out;
> +	}
>  
>  	anon  = get_lru_size(lruvec, LRU_ACTIVE_ANON) +
>  		get_lru_size(lruvec, LRU_INACTIVE_ANON);
> @@ -1688,15 +1698,6 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>  			fraction[1] = 0;
>  			denominator = 1;
>  			goto out;
> -		} else if (!inactive_file_is_low_global(zone)) {
> -			/*
> -			 * There is enough inactive page cache, do not
> -			 * reclaim anything from the working set right now.
> -			 */
> -			fraction[0] = 0;
> -			fraction[1] = 1;
> -			denominator = 1;
> -			goto out;
>  		}
>  	}
>  


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 1/8] mm: memcg: only evict file pages when we have plenty
  2012-12-12 21:53     ` Rik van Riel
@ 2012-12-13  5:36       ` Simon Jeons
  -1 siblings, 0 replies; 114+ messages in thread
From: Simon Jeons @ 2012-12-13  5:36 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Johannes Weiner, Andrew Morton, Michal Hocko, Mel Gorman,
	Hugh Dickins, linux-mm, linux-kernel

On Wed, 2012-12-12 at 16:53 -0500, Rik van Riel wrote:
> On 12/12/2012 04:43 PM, Johannes Weiner wrote:
> > dc0422c "mm: vmscan: only evict file pages when we have plenty" makes
> > a point of not going for anonymous memory while there is still enough
> > inactive cache around.
> >
> > The check was added only for global reclaim, but it is just as useful
> > for memory cgroup reclaim.
> >
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > ---
> >   mm/vmscan.c | 19 ++++++++++---------
> >   1 file changed, 10 insertions(+), 9 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 157bb11..3874dcb 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1671,6 +1671,16 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> >   		denominator = 1;
> >   		goto out;
> >   	}
> > +	/*
> > +	 * There is enough inactive page cache, do not reclaim
> > +	 * anything from the anonymous working set right now.
> > +	 */
> > +	if (!inactive_file_is_low(lruvec)) {
> > +		fraction[0] = 0;
> > +		fraction[1] = 1;
> > +		denominator = 1;
> > +		goto out;
> > +	}
> >
> >   	anon  = get_lru_size(lruvec, LRU_ACTIVE_ANON) +
> >   		get_lru_size(lruvec, LRU_INACTIVE_ANON);
> > @@ -1688,15 +1698,6 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> >   			fraction[1] = 0;
> >   			denominator = 1;
> >   			goto out;
> > -		} else if (!inactive_file_is_low_global(zone)) {
> > -			/*
> > -			 * There is enough inactive page cache, do not
> > -			 * reclaim anything from the working set right now.
> > -			 */
> > -			fraction[0] = 0;
> > -			fraction[1] = 1;
> > -			denominator = 1;
> > -			goto out;
> >   		}
> >   	}
> >
> >
> 
> I believe the if() block should be moved to AFTER
> the check where we make sure we actually have enough
> file pages.

Where check enough file pages? 
if (unlikely(file + free <= high_wmark_pages(zone))), correct?

> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>



^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 1/8] mm: memcg: only evict file pages when we have plenty
@ 2012-12-13  5:36       ` Simon Jeons
  0 siblings, 0 replies; 114+ messages in thread
From: Simon Jeons @ 2012-12-13  5:36 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Johannes Weiner, Andrew Morton, Michal Hocko, Mel Gorman,
	Hugh Dickins, linux-mm, linux-kernel

On Wed, 2012-12-12 at 16:53 -0500, Rik van Riel wrote:
> On 12/12/2012 04:43 PM, Johannes Weiner wrote:
> > dc0422c "mm: vmscan: only evict file pages when we have plenty" makes
> > a point of not going for anonymous memory while there is still enough
> > inactive cache around.
> >
> > The check was added only for global reclaim, but it is just as useful
> > for memory cgroup reclaim.
> >
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > ---
> >   mm/vmscan.c | 19 ++++++++++---------
> >   1 file changed, 10 insertions(+), 9 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 157bb11..3874dcb 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1671,6 +1671,16 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> >   		denominator = 1;
> >   		goto out;
> >   	}
> > +	/*
> > +	 * There is enough inactive page cache, do not reclaim
> > +	 * anything from the anonymous working set right now.
> > +	 */
> > +	if (!inactive_file_is_low(lruvec)) {
> > +		fraction[0] = 0;
> > +		fraction[1] = 1;
> > +		denominator = 1;
> > +		goto out;
> > +	}
> >
> >   	anon  = get_lru_size(lruvec, LRU_ACTIVE_ANON) +
> >   		get_lru_size(lruvec, LRU_INACTIVE_ANON);
> > @@ -1688,15 +1698,6 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> >   			fraction[1] = 0;
> >   			denominator = 1;
> >   			goto out;
> > -		} else if (!inactive_file_is_low_global(zone)) {
> > -			/*
> > -			 * There is enough inactive page cache, do not
> > -			 * reclaim anything from the working set right now.
> > -			 */
> > -			fraction[0] = 0;
> > -			fraction[1] = 1;
> > -			denominator = 1;
> > -			goto out;
> >   		}
> >   	}
> >
> >
> 
> I believe the if() block should be moved to AFTER
> the check where we make sure we actually have enough
> file pages.

Where check enough file pages? 
if (unlikely(file + free <= high_wmark_pages(zone))), correct?

> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 2/8] mm: vmscan: disregard swappiness shortly before going OOM
  2012-12-12 21:43   ` Johannes Weiner
@ 2012-12-13  5:56     ` Simon Jeons
  -1 siblings, 0 replies; 114+ messages in thread
From: Simon Jeons @ 2012-12-13  5:56 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Rik van Riel, Michal Hocko, Mel Gorman,
	Hugh Dickins, linux-mm, linux-kernel

On Wed, 2012-12-12 at 16:43 -0500, Johannes Weiner wrote:
> When a reclaim scanner is doing its final scan before giving up and
> there is swap space available, pay no attention to swappiness
> preference anymore.  Just swap.
> 

Confuse! If it's final scan and still swap space available, why nr[lru]
= div64_u64(scan * fraction[file], denominator); instead of nr[lru] =
scan; ? 

> Note that this change won't make too big of a difference for general
> reclaim: anonymous pages are already force-scanned when there is only
> very little file cache left, and there very likely isn't when the
> reclaimer enters this final cycle.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/vmscan.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 3874dcb..6e53446 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1751,7 +1751,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>  		unsigned long scan;
>  
>  		scan = get_lru_size(lruvec, lru);
> -		if (sc->priority || noswap || !vmscan_swappiness(sc)) {
> +		if (sc->priority || noswap) {
>  			scan >>= sc->priority;
>  			if (!scan && force_scan)
>  				scan = SWAP_CLUSTER_MAX;



^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 2/8] mm: vmscan: disregard swappiness shortly before going OOM
@ 2012-12-13  5:56     ` Simon Jeons
  0 siblings, 0 replies; 114+ messages in thread
From: Simon Jeons @ 2012-12-13  5:56 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Rik van Riel, Michal Hocko, Mel Gorman,
	Hugh Dickins, linux-mm, linux-kernel

On Wed, 2012-12-12 at 16:43 -0500, Johannes Weiner wrote:
> When a reclaim scanner is doing its final scan before giving up and
> there is swap space available, pay no attention to swappiness
> preference anymore.  Just swap.
> 

Confuse! If it's final scan and still swap space available, why nr[lru]
= div64_u64(scan * fraction[file], denominator); instead of nr[lru] =
scan; ? 

> Note that this change won't make too big of a difference for general
> reclaim: anonymous pages are already force-scanned when there is only
> very little file cache left, and there very likely isn't when the
> reclaimer enters this final cycle.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/vmscan.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 3874dcb..6e53446 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1751,7 +1751,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>  		unsigned long scan;
>  
>  		scan = get_lru_size(lruvec, lru);
> -		if (sc->priority || noswap || !vmscan_swappiness(sc)) {
> +		if (sc->priority || noswap) {
>  			scan >>= sc->priority;
>  			if (!scan && force_scan)
>  				scan = SWAP_CLUSTER_MAX;


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 1/8] mm: memcg: only evict file pages when we have plenty
  2012-12-12 22:28       ` Johannes Weiner
@ 2012-12-13 10:07         ` Mel Gorman
  -1 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2012-12-13 10:07 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Rik van Riel, Andrew Morton, Michal Hocko, Hugh Dickins,
	linux-mm, linux-kernel

On Wed, Dec 12, 2012 at 05:28:44PM -0500, Johannes Weiner wrote:
> On Wed, Dec 12, 2012 at 04:53:36PM -0500, Rik van Riel wrote:
> > On 12/12/2012 04:43 PM, Johannes Weiner wrote:
> > >dc0422c "mm: vmscan: only evict file pages when we have plenty" makes

You are using some internal tree for that commit. Now that it's upstream
it is commit e9868505987a03a26a3979f27b82911ccc003752.

> > >a point of not going for anonymous memory while there is still enough
> > >inactive cache around.
> > >
> > >The check was added only for global reclaim, but it is just as useful
> > >for memory cgroup reclaim.
> > >
> > >Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > >---
> > >  mm/vmscan.c | 19 ++++++++++---------
> > >  1 file changed, 10 insertions(+), 9 deletions(-)
> > >
> > >diff --git a/mm/vmscan.c b/mm/vmscan.c
> > >index 157bb11..3874dcb 100644
> > >--- a/mm/vmscan.c
> > >+++ b/mm/vmscan.c
> > >@@ -1671,6 +1671,16 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> > >  		denominator = 1;
> > >  		goto out;
> > >  	}
> > >+	/*
> > >+	 * There is enough inactive page cache, do not reclaim
> > >+	 * anything from the anonymous working set right now.
> > >+	 */
> > >+	if (!inactive_file_is_low(lruvec)) {
> > >+		fraction[0] = 0;
> > >+		fraction[1] = 1;
> > >+		denominator = 1;
> > >+		goto out;
> > >+	}
> > >
> > >  	anon  = get_lru_size(lruvec, LRU_ACTIVE_ANON) +
> > >  		get_lru_size(lruvec, LRU_INACTIVE_ANON);
> > >@@ -1688,15 +1698,6 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> > >  			fraction[1] = 0;
> > >  			denominator = 1;
> > >  			goto out;
> > >-		} else if (!inactive_file_is_low_global(zone)) {
> > >-			/*
> > >-			 * There is enough inactive page cache, do not
> > >-			 * reclaim anything from the working set right now.
> > >-			 */
> > >-			fraction[0] = 0;
> > >-			fraction[1] = 1;
> > >-			denominator = 1;
> > >-			goto out;
> > >  		}
> > >  	}
> > >
> > >
> > 
> > I believe the if() block should be moved to AFTER
> > the check where we make sure we actually have enough
> > file pages.
> 
> You are absolutely right, this makes more sense.  Although I'd figure
> the impact would be small because if there actually is that little
> file cache, it won't be there for long with force-file scanning... :-)
> 

Does it actually make sense? Lets take the global reclaim case.

low_file         == if (unlikely(file + free <= high_wmark_pages(zone)))
inactive_is_high == if (!inactive_file_is_low_global(zone))

Current
  low_file	inactive_is_high	force reclaim anon
  low_file	!inactive_is_high	force reclaim anon
  !low_file	inactive_is_high	force reclaim file
  !low_file	!inactive_is_high	normal split

Your patch

  low_file	inactive_is_high	force reclaim anon
  low_file	!inactive_is_high	force reclaim anon
  !low_file	inactive_is_high	force reclaim file
  !low_file	!inactive_is_high	normal split

However, if you move the inactive_file_is_low check down you get

Moving the check
  low_file	inactive_is_high	force reclaim file
  low_file	!inactive_is_high	force reclaim anon
  !low_file	inactive_is_high	force reclaim file
  !low_file	!inactive_is_high	normal split

There is a small but important change in results. I easily could have made
a mistake so double check.

I'm not being super thorough because I'm not quite sure this is the right
patch if the motivation is for memcg to use the same logic. Instead of
moving this if, why do you not estimate "free" for the memcg based on the
hard limit and current usage? 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 1/8] mm: memcg: only evict file pages when we have plenty
@ 2012-12-13 10:07         ` Mel Gorman
  0 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2012-12-13 10:07 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Rik van Riel, Andrew Morton, Michal Hocko, Hugh Dickins,
	linux-mm, linux-kernel

On Wed, Dec 12, 2012 at 05:28:44PM -0500, Johannes Weiner wrote:
> On Wed, Dec 12, 2012 at 04:53:36PM -0500, Rik van Riel wrote:
> > On 12/12/2012 04:43 PM, Johannes Weiner wrote:
> > >dc0422c "mm: vmscan: only evict file pages when we have plenty" makes

You are using some internal tree for that commit. Now that it's upstream
it is commit e9868505987a03a26a3979f27b82911ccc003752.

> > >a point of not going for anonymous memory while there is still enough
> > >inactive cache around.
> > >
> > >The check was added only for global reclaim, but it is just as useful
> > >for memory cgroup reclaim.
> > >
> > >Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > >---
> > >  mm/vmscan.c | 19 ++++++++++---------
> > >  1 file changed, 10 insertions(+), 9 deletions(-)
> > >
> > >diff --git a/mm/vmscan.c b/mm/vmscan.c
> > >index 157bb11..3874dcb 100644
> > >--- a/mm/vmscan.c
> > >+++ b/mm/vmscan.c
> > >@@ -1671,6 +1671,16 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> > >  		denominator = 1;
> > >  		goto out;
> > >  	}
> > >+	/*
> > >+	 * There is enough inactive page cache, do not reclaim
> > >+	 * anything from the anonymous working set right now.
> > >+	 */
> > >+	if (!inactive_file_is_low(lruvec)) {
> > >+		fraction[0] = 0;
> > >+		fraction[1] = 1;
> > >+		denominator = 1;
> > >+		goto out;
> > >+	}
> > >
> > >  	anon  = get_lru_size(lruvec, LRU_ACTIVE_ANON) +
> > >  		get_lru_size(lruvec, LRU_INACTIVE_ANON);
> > >@@ -1688,15 +1698,6 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> > >  			fraction[1] = 0;
> > >  			denominator = 1;
> > >  			goto out;
> > >-		} else if (!inactive_file_is_low_global(zone)) {
> > >-			/*
> > >-			 * There is enough inactive page cache, do not
> > >-			 * reclaim anything from the working set right now.
> > >-			 */
> > >-			fraction[0] = 0;
> > >-			fraction[1] = 1;
> > >-			denominator = 1;
> > >-			goto out;
> > >  		}
> > >  	}
> > >
> > >
> > 
> > I believe the if() block should be moved to AFTER
> > the check where we make sure we actually have enough
> > file pages.
> 
> You are absolutely right, this makes more sense.  Although I'd figure
> the impact would be small because if there actually is that little
> file cache, it won't be there for long with force-file scanning... :-)
> 

Does it actually make sense? Lets take the global reclaim case.

low_file         == if (unlikely(file + free <= high_wmark_pages(zone)))
inactive_is_high == if (!inactive_file_is_low_global(zone))

Current
  low_file	inactive_is_high	force reclaim anon
  low_file	!inactive_is_high	force reclaim anon
  !low_file	inactive_is_high	force reclaim file
  !low_file	!inactive_is_high	normal split

Your patch

  low_file	inactive_is_high	force reclaim anon
  low_file	!inactive_is_high	force reclaim anon
  !low_file	inactive_is_high	force reclaim file
  !low_file	!inactive_is_high	normal split

However, if you move the inactive_file_is_low check down you get

Moving the check
  low_file	inactive_is_high	force reclaim file
  low_file	!inactive_is_high	force reclaim anon
  !low_file	inactive_is_high	force reclaim file
  !low_file	!inactive_is_high	normal split

There is a small but important change in results. I easily could have made
a mistake so double check.

I'm not being super thorough because I'm not quite sure this is the right
patch if the motivation is for memcg to use the same logic. Instead of
moving this if, why do you not estimate "free" for the memcg based on the
hard limit and current usage? 

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 2/8] mm: vmscan: disregard swappiness shortly before going OOM
  2012-12-12 21:43   ` Johannes Weiner
@ 2012-12-13 10:34     ` Mel Gorman
  -1 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2012-12-13 10:34 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Rik van Riel, Michal Hocko, Hugh Dickins,
	linux-mm, linux-kernel

On Wed, Dec 12, 2012 at 04:43:34PM -0500, Johannes Weiner wrote:
> When a reclaim scanner is doing its final scan before giving up and
> there is swap space available, pay no attention to swappiness
> preference anymore.  Just swap.
> 
> Note that this change won't make too big of a difference for general
> reclaim: anonymous pages are already force-scanned when there is only
> very little file cache left, and there very likely isn't when the
> reclaimer enters this final cycle.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Ok, I see the motivation for your patch but is the block inside still
wrong for what you want? After your patch the block looks like this

                if (sc->priority || noswap) {
                        scan >>= sc->priority;
                        if (!scan && force_scan)
                                scan = SWAP_CLUSTER_MAX;
                        scan = div64_u64(scan * fraction[file], denominator);
                }

if sc->priority == 0 and swappiness==0 then you enter this block but
fraction[0] for anonymous pages will also be 0 and because of the ordering
of statements there, scan will be

scan = scan * 0 / denominator

so you are still not reclaiming anonymous pages in the swappiness=0
case. What did I miss?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 2/8] mm: vmscan: disregard swappiness shortly before going OOM
@ 2012-12-13 10:34     ` Mel Gorman
  0 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2012-12-13 10:34 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Rik van Riel, Michal Hocko, Hugh Dickins,
	linux-mm, linux-kernel

On Wed, Dec 12, 2012 at 04:43:34PM -0500, Johannes Weiner wrote:
> When a reclaim scanner is doing its final scan before giving up and
> there is swap space available, pay no attention to swappiness
> preference anymore.  Just swap.
> 
> Note that this change won't make too big of a difference for general
> reclaim: anonymous pages are already force-scanned when there is only
> very little file cache left, and there very likely isn't when the
> reclaimer enters this final cycle.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Ok, I see the motivation for your patch but is the block inside still
wrong for what you want? After your patch the block looks like this

                if (sc->priority || noswap) {
                        scan >>= sc->priority;
                        if (!scan && force_scan)
                                scan = SWAP_CLUSTER_MAX;
                        scan = div64_u64(scan * fraction[file], denominator);
                }

if sc->priority == 0 and swappiness==0 then you enter this block but
fraction[0] for anonymous pages will also be 0 and because of the ordering
of statements there, scan will be

scan = scan * 0 / denominator

so you are still not reclaiming anonymous pages in the swappiness=0
case. What did I miss?

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 3/8] mm: vmscan: save work scanning (almost) empty LRU lists
  2012-12-12 21:43   ` Johannes Weiner
@ 2012-12-13 10:41     ` Mel Gorman
  -1 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2012-12-13 10:41 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Rik van Riel, Michal Hocko, Hugh Dickins,
	linux-mm, linux-kernel

On Wed, Dec 12, 2012 at 04:43:35PM -0500, Johannes Weiner wrote:
> In certain cases (kswapd reclaim, memcg target reclaim), a fixed
> minimum amount of pages is scanned from the LRU lists on each
> iteration, to make progress.
> 
> Do not make this minimum bigger than the respective LRU list size,
> however, and save some busy work trying to isolate and reclaim pages
> that are not there.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

This looks like a corner case where the LRU size would have to be smaller
than SWAP_CLUSTER_MAX. Is that common enough to care? It looks correct,
I'm just curious.

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 3/8] mm: vmscan: save work scanning (almost) empty LRU lists
@ 2012-12-13 10:41     ` Mel Gorman
  0 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2012-12-13 10:41 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Rik van Riel, Michal Hocko, Hugh Dickins,
	linux-mm, linux-kernel

On Wed, Dec 12, 2012 at 04:43:35PM -0500, Johannes Weiner wrote:
> In certain cases (kswapd reclaim, memcg target reclaim), a fixed
> minimum amount of pages is scanned from the LRU lists on each
> iteration, to make progress.
> 
> Do not make this minimum bigger than the respective LRU list size,
> however, and save some busy work trying to isolate and reclaim pages
> that are not there.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

This looks like a corner case where the LRU size would have to be smaller
than SWAP_CLUSTER_MAX. Is that common enough to care? It looks correct,
I'm just curious.

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 4/8] mm: vmscan: clarify LRU balancing close to OOM
  2012-12-12 21:43   ` Johannes Weiner
@ 2012-12-13 10:46     ` Mel Gorman
  -1 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2012-12-13 10:46 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Rik van Riel, Michal Hocko, Hugh Dickins,
	linux-mm, linux-kernel

On Wed, Dec 12, 2012 at 04:43:36PM -0500, Johannes Weiner wrote:
> There are currently several inter-LRU balancing heuristics that simply
> get disabled when the reclaimer is at the last reclaim cycle before
> giving up, but the code is quite cumbersome and not really obvious.
> 
> Make the heuristics visibly unreachable for the last reclaim cycle.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 4/8] mm: vmscan: clarify LRU balancing close to OOM
@ 2012-12-13 10:46     ` Mel Gorman
  0 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2012-12-13 10:46 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Rik van Riel, Michal Hocko, Hugh Dickins,
	linux-mm, linux-kernel

On Wed, Dec 12, 2012 at 04:43:36PM -0500, Johannes Weiner wrote:
> There are currently several inter-LRU balancing heuristics that simply
> get disabled when the reclaimer is at the last reclaim cycle before
> giving up, but the code is quite cumbersome and not really obvious.
> 
> Make the heuristics visibly unreachable for the last reclaim cycle.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 5/8] mm: vmscan: improve comment on low-page cache handling
  2012-12-12 21:43   ` Johannes Weiner
@ 2012-12-13 10:47     ` Mel Gorman
  -1 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2012-12-13 10:47 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Rik van Riel, Michal Hocko, Hugh Dickins,
	linux-mm, linux-kernel

On Wed, Dec 12, 2012 at 04:43:37PM -0500, Johannes Weiner wrote:
> Fix comment style and elaborate on why anonymous memory is
> force-scanned when file cache runs low.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 5/8] mm: vmscan: improve comment on low-page cache handling
@ 2012-12-13 10:47     ` Mel Gorman
  0 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2012-12-13 10:47 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Rik van Riel, Michal Hocko, Hugh Dickins,
	linux-mm, linux-kernel

On Wed, Dec 12, 2012 at 04:43:37PM -0500, Johannes Weiner wrote:
> Fix comment style and elaborate on why anonymous memory is
> force-scanned when file cache runs low.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 6/8] mm: vmscan: clean up get_scan_count()
  2012-12-12 21:43   ` Johannes Weiner
@ 2012-12-13 11:07     ` Mel Gorman
  -1 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2012-12-13 11:07 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Rik van Riel, Michal Hocko, Hugh Dickins,
	linux-mm, linux-kernel

On Wed, Dec 12, 2012 at 04:43:38PM -0500, Johannes Weiner wrote:
> Reclaim pressure balance between anon and file pages is calculated
> through a tuple of numerators and a shared denominator.
> 
> Exceptional cases that want to force-scan anon or file pages configure
> the numerators and denominator such that one list is preferred, which
> is not necessarily the most obvious way:
> 
>     fraction[0] = 1;
>     fraction[1] = 0;
>     denominator = 1;
>     goto out;
> 
> Make this easier by making the force-scan cases explicit and use the
> fractionals only in case they are calculated from reclaim history.
> 
> And bring the variable declarations/definitions in order.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Mel Gorman <mgorman@suse.de>

The if at the end looks like it should have been a switch maybe?

switch(scan_balance) {
case SCAN_EQUAL:
	/* Scan relative to size */
	break;
case SCAN_FRACT:
	/* Scan proportional to swappiness */
	scan = div64_u64(scan * fraction[file], denominator);
case SCAN_FILE:
case SCAN_ANON:
	/* Scan only file or only anon LRU */
	if ((scan_balance == SCAN_FILE) != file)
		scan = 0;
	break;
default:
	/* Look ma, no brain */
	BUG();
}

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 6/8] mm: vmscan: clean up get_scan_count()
@ 2012-12-13 11:07     ` Mel Gorman
  0 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2012-12-13 11:07 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Rik van Riel, Michal Hocko, Hugh Dickins,
	linux-mm, linux-kernel

On Wed, Dec 12, 2012 at 04:43:38PM -0500, Johannes Weiner wrote:
> Reclaim pressure balance between anon and file pages is calculated
> through a tuple of numerators and a shared denominator.
> 
> Exceptional cases that want to force-scan anon or file pages configure
> the numerators and denominator such that one list is preferred, which
> is not necessarily the most obvious way:
> 
>     fraction[0] = 1;
>     fraction[1] = 0;
>     denominator = 1;
>     goto out;
> 
> Make this easier by making the force-scan cases explicit and use the
> fractionals only in case they are calculated from reclaim history.
> 
> And bring the variable declarations/definitions in order.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Mel Gorman <mgorman@suse.de>

The if at the end looks like it should have been a switch maybe?

switch(scan_balance) {
case SCAN_EQUAL:
	/* Scan relative to size */
	break;
case SCAN_FRACT:
	/* Scan proportional to swappiness */
	scan = div64_u64(scan * fraction[file], denominator);
case SCAN_FILE:
case SCAN_ANON:
	/* Scan only file or only anon LRU */
	if ((scan_balance == SCAN_FILE) != file)
		scan = 0;
	break;
default:
	/* Look ma, no brain */
	BUG();
}

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 7/8] mm: vmscan: compaction works against zones, not lruvecs
  2012-12-12 21:43   ` Johannes Weiner
@ 2012-12-13 11:12     ` Mel Gorman
  -1 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2012-12-13 11:12 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Rik van Riel, Michal Hocko, Hugh Dickins,
	linux-mm, linux-kernel

On Wed, Dec 12, 2012 at 04:43:39PM -0500, Johannes Weiner wrote:
> The restart logic for when reclaim operates back to back with
> compaction is currently applied on the lruvec level.  But this does
> not make sense, because the container of interest for compaction is a
> zone as a whole, not the zone pages that are part of a certain memory
> cgroup.
> 
> Negative impact is bounded.  For one, the code checks that the lruvec
> has enough reclaim candidates, so it does not risk getting stuck on a
> condition that can not be fulfilled.  And the unfairness of hammering
> on one particular memory cgroup to make progress in a zone will be
> amortized by the round robin manner in which reclaim goes through the
> memory cgroups.  Still, this can lead to unnecessary allocation
> latencies when the code elects to restart on a hard to reclaim or
> small group when there are other, more reclaimable groups in the zone.
> 
> Move this logic to the zone level and restart reclaim for all memory
> cgroups in a zone when compaction requires more free pages from it.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 7/8] mm: vmscan: compaction works against zones, not lruvecs
@ 2012-12-13 11:12     ` Mel Gorman
  0 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2012-12-13 11:12 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Rik van Riel, Michal Hocko, Hugh Dickins,
	linux-mm, linux-kernel

On Wed, Dec 12, 2012 at 04:43:39PM -0500, Johannes Weiner wrote:
> The restart logic for when reclaim operates back to back with
> compaction is currently applied on the lruvec level.  But this does
> not make sense, because the container of interest for compaction is a
> zone as a whole, not the zone pages that are part of a certain memory
> cgroup.
> 
> Negative impact is bounded.  For one, the code checks that the lruvec
> has enough reclaim candidates, so it does not risk getting stuck on a
> condition that can not be fulfilled.  And the unfairness of hammering
> on one particular memory cgroup to make progress in a zone will be
> amortized by the round robin manner in which reclaim goes through the
> memory cgroups.  Still, this can lead to unnecessary allocation
> latencies when the code elects to restart on a hard to reclaim or
> small group when there are other, more reclaimable groups in the zone.
> 
> Move this logic to the zone level and restart reclaim for all memory
> cgroups in a zone when compaction requires more free pages from it.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 1/8] mm: memcg: only evict file pages when we have plenty
  2012-12-13 10:07         ` Mel Gorman
@ 2012-12-13 14:44           ` Mel Gorman
  -1 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2012-12-13 14:44 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Rik van Riel, Andrew Morton, Michal Hocko, Hugh Dickins,
	linux-mm, linux-kernel

On Thu, Dec 13, 2012 at 10:07:04AM +0000, Mel Gorman wrote:
> On Wed, Dec 12, 2012 at 05:28:44PM -0500, Johannes Weiner wrote:
> > On Wed, Dec 12, 2012 at 04:53:36PM -0500, Rik van Riel wrote:
> > > On 12/12/2012 04:43 PM, Johannes Weiner wrote:
> > > >dc0422c "mm: vmscan: only evict file pages when we have plenty" makes
> 
> You are using some internal tree for that commit. Now that it's upstream
> it is commit e9868505987a03a26a3979f27b82911ccc003752.
> 
> > > >a point of not going for anonymous memory while there is still enough
> > > >inactive cache around.
> > > >
> > > >The check was added only for global reclaim, but it is just as useful
> > > >for memory cgroup reclaim.
> > > >
> > > >Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > > >---
> > > >  mm/vmscan.c | 19 ++++++++++---------
> > > >  1 file changed, 10 insertions(+), 9 deletions(-)
> > > >
> > > <SNIP>
> > > 
> > > I believe the if() block should be moved to AFTER
> > > the check where we make sure we actually have enough
> > > file pages.
> > 
> > You are absolutely right, this makes more sense.  Although I'd figure
> > the impact would be small because if there actually is that little
> > file cache, it won't be there for long with force-file scanning... :-)
> > 
> 
> Does it actually make sense? Lets take the global reclaim case.
> 
> <stupidity snipped>

I made a stupid mistake that Michal Hocko pointed out to me. The goto
out means that it should be fine either way.

> I'm not being super thorough because I'm not quite sure this is the right
> patch if the motivation is for memcg to use the same logic. Instead of
> moving this if, why do you not estimate "free" for the memcg based on the
> hard limit and current usage? 
> 

I'm still curious about this part.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 1/8] mm: memcg: only evict file pages when we have plenty
@ 2012-12-13 14:44           ` Mel Gorman
  0 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2012-12-13 14:44 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Rik van Riel, Andrew Morton, Michal Hocko, Hugh Dickins,
	linux-mm, linux-kernel

On Thu, Dec 13, 2012 at 10:07:04AM +0000, Mel Gorman wrote:
> On Wed, Dec 12, 2012 at 05:28:44PM -0500, Johannes Weiner wrote:
> > On Wed, Dec 12, 2012 at 04:53:36PM -0500, Rik van Riel wrote:
> > > On 12/12/2012 04:43 PM, Johannes Weiner wrote:
> > > >dc0422c "mm: vmscan: only evict file pages when we have plenty" makes
> 
> You are using some internal tree for that commit. Now that it's upstream
> it is commit e9868505987a03a26a3979f27b82911ccc003752.
> 
> > > >a point of not going for anonymous memory while there is still enough
> > > >inactive cache around.
> > > >
> > > >The check was added only for global reclaim, but it is just as useful
> > > >for memory cgroup reclaim.
> > > >
> > > >Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > > >---
> > > >  mm/vmscan.c | 19 ++++++++++---------
> > > >  1 file changed, 10 insertions(+), 9 deletions(-)
> > > >
> > > <SNIP>
> > > 
> > > I believe the if() block should be moved to AFTER
> > > the check where we make sure we actually have enough
> > > file pages.
> > 
> > You are absolutely right, this makes more sense.  Although I'd figure
> > the impact would be small because if there actually is that little
> > file cache, it won't be there for long with force-file scanning... :-)
> > 
> 
> Does it actually make sense? Lets take the global reclaim case.
> 
> <stupidity snipped>

I made a stupid mistake that Michal Hocko pointed out to me. The goto
out means that it should be fine either way.

> I'm not being super thorough because I'm not quite sure this is the right
> patch if the motivation is for memcg to use the same logic. Instead of
> moving this if, why do you not estimate "free" for the memcg based on the
> hard limit and current usage? 
> 

I'm still curious about this part.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 1/8] mm: memcg: only evict file pages when we have plenty
  2012-12-12 22:28       ` Johannes Weiner
@ 2012-12-13 14:55         ` Michal Hocko
  -1 siblings, 0 replies; 114+ messages in thread
From: Michal Hocko @ 2012-12-13 14:55 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Rik van Riel, Andrew Morton, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

On Wed 12-12-12 17:28:44, Johannes Weiner wrote:
> On Wed, Dec 12, 2012 at 04:53:36PM -0500, Rik van Riel wrote:
> > On 12/12/2012 04:43 PM, Johannes Weiner wrote:
> > >dc0422c "mm: vmscan: only evict file pages when we have plenty" makes
> > >a point of not going for anonymous memory while there is still enough
> > >inactive cache around.
> > >
> > >The check was added only for global reclaim, but it is just as useful
> > >for memory cgroup reclaim.
> > >
> > >Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > >---
> > >  mm/vmscan.c | 19 ++++++++++---------
> > >  1 file changed, 10 insertions(+), 9 deletions(-)
> > >
> > >diff --git a/mm/vmscan.c b/mm/vmscan.c
> > >index 157bb11..3874dcb 100644
> > >--- a/mm/vmscan.c
> > >+++ b/mm/vmscan.c
> > >@@ -1671,6 +1671,16 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> > >  		denominator = 1;
> > >  		goto out;
> > >  	}
> > >+	/*
> > >+	 * There is enough inactive page cache, do not reclaim
> > >+	 * anything from the anonymous working set right now.
> > >+	 */
> > >+	if (!inactive_file_is_low(lruvec)) {
> > >+		fraction[0] = 0;
> > >+		fraction[1] = 1;
> > >+		denominator = 1;
> > >+		goto out;
> > >+	}
> > >
> > >  	anon  = get_lru_size(lruvec, LRU_ACTIVE_ANON) +
> > >  		get_lru_size(lruvec, LRU_INACTIVE_ANON);
> > >@@ -1688,15 +1698,6 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> > >  			fraction[1] = 0;
> > >  			denominator = 1;
> > >  			goto out;
> > >-		} else if (!inactive_file_is_low_global(zone)) {
> > >-			/*
> > >-			 * There is enough inactive page cache, do not
> > >-			 * reclaim anything from the working set right now.
> > >-			 */
> > >-			fraction[0] = 0;
> > >-			fraction[1] = 1;
> > >-			denominator = 1;
> > >-			goto out;
> > >  		}
> > >  	}
> > >
> > >
> > 
> > I believe the if() block should be moved to AFTER
> > the check where we make sure we actually have enough
> > file pages.
> 
> You are absolutely right, this makes more sense.  Although I'd figure
> the impact would be small because if there actually is that little
> file cache, it won't be there for long with force-file scanning... :-)

Yes, I think that the result would be worse (more swapping) so the
change can only help.

> I moved the condition, but it throws conflicts in the rest of the
> series.  Will re-run tests, wait for Michal and Mel, then resend.

Yes the patch makes sense for memcg as well. I guess you have tested
this primarily with memcg. Do you have any numbers? Would be nice to put
them into the changelog if you have (it should help to reduce swapping
with heavy streaming IO load).

Acked-by: Michal Hocko <mhocko@suse.cz>
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 1/8] mm: memcg: only evict file pages when we have plenty
@ 2012-12-13 14:55         ` Michal Hocko
  0 siblings, 0 replies; 114+ messages in thread
From: Michal Hocko @ 2012-12-13 14:55 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Rik van Riel, Andrew Morton, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

On Wed 12-12-12 17:28:44, Johannes Weiner wrote:
> On Wed, Dec 12, 2012 at 04:53:36PM -0500, Rik van Riel wrote:
> > On 12/12/2012 04:43 PM, Johannes Weiner wrote:
> > >dc0422c "mm: vmscan: only evict file pages when we have plenty" makes
> > >a point of not going for anonymous memory while there is still enough
> > >inactive cache around.
> > >
> > >The check was added only for global reclaim, but it is just as useful
> > >for memory cgroup reclaim.
> > >
> > >Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > >---
> > >  mm/vmscan.c | 19 ++++++++++---------
> > >  1 file changed, 10 insertions(+), 9 deletions(-)
> > >
> > >diff --git a/mm/vmscan.c b/mm/vmscan.c
> > >index 157bb11..3874dcb 100644
> > >--- a/mm/vmscan.c
> > >+++ b/mm/vmscan.c
> > >@@ -1671,6 +1671,16 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> > >  		denominator = 1;
> > >  		goto out;
> > >  	}
> > >+	/*
> > >+	 * There is enough inactive page cache, do not reclaim
> > >+	 * anything from the anonymous working set right now.
> > >+	 */
> > >+	if (!inactive_file_is_low(lruvec)) {
> > >+		fraction[0] = 0;
> > >+		fraction[1] = 1;
> > >+		denominator = 1;
> > >+		goto out;
> > >+	}
> > >
> > >  	anon  = get_lru_size(lruvec, LRU_ACTIVE_ANON) +
> > >  		get_lru_size(lruvec, LRU_INACTIVE_ANON);
> > >@@ -1688,15 +1698,6 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> > >  			fraction[1] = 0;
> > >  			denominator = 1;
> > >  			goto out;
> > >-		} else if (!inactive_file_is_low_global(zone)) {
> > >-			/*
> > >-			 * There is enough inactive page cache, do not
> > >-			 * reclaim anything from the working set right now.
> > >-			 */
> > >-			fraction[0] = 0;
> > >-			fraction[1] = 1;
> > >-			denominator = 1;
> > >-			goto out;
> > >  		}
> > >  	}
> > >
> > >
> > 
> > I believe the if() block should be moved to AFTER
> > the check where we make sure we actually have enough
> > file pages.
> 
> You are absolutely right, this makes more sense.  Although I'd figure
> the impact would be small because if there actually is that little
> file cache, it won't be there for long with force-file scanning... :-)

Yes, I think that the result would be worse (more swapping) so the
change can only help.

> I moved the condition, but it throws conflicts in the rest of the
> series.  Will re-run tests, wait for Michal and Mel, then resend.

Yes the patch makes sense for memcg as well. I guess you have tested
this primarily with memcg. Do you have any numbers? Would be nice to put
them into the changelog if you have (it should help to reduce swapping
with heavy streaming IO load).

Acked-by: Michal Hocko <mhocko@suse.cz>
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 2/8] mm: vmscan: disregard swappiness shortly before going OOM
  2012-12-13 10:34     ` Mel Gorman
@ 2012-12-13 15:29       ` Michal Hocko
  -1 siblings, 0 replies; 114+ messages in thread
From: Michal Hocko @ 2012-12-13 15:29 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, Andrew Morton, Rik van Riel, Hugh Dickins,
	linux-mm, linux-kernel

On Thu 13-12-12 10:34:20, Mel Gorman wrote:
> On Wed, Dec 12, 2012 at 04:43:34PM -0500, Johannes Weiner wrote:
> > When a reclaim scanner is doing its final scan before giving up and
> > there is swap space available, pay no attention to swappiness
> > preference anymore.  Just swap.
> > 
> > Note that this change won't make too big of a difference for general
> > reclaim: anonymous pages are already force-scanned when there is only
> > very little file cache left, and there very likely isn't when the
> > reclaimer enters this final cycle.
> > 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> Ok, I see the motivation for your patch but is the block inside still
> wrong for what you want? After your patch the block looks like this
> 
>                 if (sc->priority || noswap) {
>                         scan >>= sc->priority;
>                         if (!scan && force_scan)
>                                 scan = SWAP_CLUSTER_MAX;
>                         scan = div64_u64(scan * fraction[file], denominator);
>                 }
> 
> if sc->priority == 0 and swappiness==0 then you enter this block but
> fraction[0] for anonymous pages will also be 0 and because of the ordering
> of statements there, scan will be
> 
> scan = scan * 0 / denominator
> 
> so you are still not reclaiming anonymous pages in the swappiness=0
> case. What did I miss?

Yes, now that you have mentioned that I realized that it really doesn't
make any sense. fraction[0] is _always_ 0 for swappiness==0. So we just
made a bigger pressure on file LRUs. So this sounds like a misuse of the
swappiness. This all has been introduced with fe35004f (mm: avoid
swapping out with swappiness==0).

I think that removing swappiness check make sense but I am not sure it
does what the changelog says. It should have said that checking
swappiness doesn't make any sense for small LRUs.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 2/8] mm: vmscan: disregard swappiness shortly before going OOM
@ 2012-12-13 15:29       ` Michal Hocko
  0 siblings, 0 replies; 114+ messages in thread
From: Michal Hocko @ 2012-12-13 15:29 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, Andrew Morton, Rik van Riel, Hugh Dickins,
	linux-mm, linux-kernel

On Thu 13-12-12 10:34:20, Mel Gorman wrote:
> On Wed, Dec 12, 2012 at 04:43:34PM -0500, Johannes Weiner wrote:
> > When a reclaim scanner is doing its final scan before giving up and
> > there is swap space available, pay no attention to swappiness
> > preference anymore.  Just swap.
> > 
> > Note that this change won't make too big of a difference for general
> > reclaim: anonymous pages are already force-scanned when there is only
> > very little file cache left, and there very likely isn't when the
> > reclaimer enters this final cycle.
> > 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> Ok, I see the motivation for your patch but is the block inside still
> wrong for what you want? After your patch the block looks like this
> 
>                 if (sc->priority || noswap) {
>                         scan >>= sc->priority;
>                         if (!scan && force_scan)
>                                 scan = SWAP_CLUSTER_MAX;
>                         scan = div64_u64(scan * fraction[file], denominator);
>                 }
> 
> if sc->priority == 0 and swappiness==0 then you enter this block but
> fraction[0] for anonymous pages will also be 0 and because of the ordering
> of statements there, scan will be
> 
> scan = scan * 0 / denominator
> 
> so you are still not reclaiming anonymous pages in the swappiness=0
> case. What did I miss?

Yes, now that you have mentioned that I realized that it really doesn't
make any sense. fraction[0] is _always_ 0 for swappiness==0. So we just
made a bigger pressure on file LRUs. So this sounds like a misuse of the
swappiness. This all has been introduced with fe35004f (mm: avoid
swapping out with swappiness==0).

I think that removing swappiness check make sense but I am not sure it
does what the changelog says. It should have said that checking
swappiness doesn't make any sense for small LRUs.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 3/8] mm: vmscan: save work scanning (almost) empty LRU lists
  2012-12-12 21:43   ` Johannes Weiner
@ 2012-12-13 15:43     ` Michal Hocko
  -1 siblings, 0 replies; 114+ messages in thread
From: Michal Hocko @ 2012-12-13 15:43 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

On Wed 12-12-12 16:43:35, Johannes Weiner wrote:
> In certain cases (kswapd reclaim, memcg target reclaim), a fixed
> minimum amount of pages is scanned from the LRU lists on each
> iteration, to make progress.
> 
> Do not make this minimum bigger than the respective LRU list size,
> however, and save some busy work trying to isolate and reclaim pages
> that are not there.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Hmm, shrink_lruvec would do:
	nr_to_scan = min_t(unsigned long,
			   nr[lru], SWAP_CLUSTER_MAX);
	nr[lru] -= nr_to_scan;
and isolate_lru_pages does
	for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++)
so it shouldn't matter and we shouldn't do any additional loops, right?

Anyway it would be beter if get_scan_count wouldn't ask for more than is
available.
Reviewed-by: Michal Hocko <mhocko@suse.cz>

> ---
>  include/linux/swap.h |  2 +-
>  mm/vmscan.c          | 10 ++++++----
>  2 files changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 68df9c1..8c66486 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -156,7 +156,7 @@ enum {
>  	SWP_SCANNING	= (1 << 8),	/* refcount in scan_swap_map */
>  };
>  
> -#define SWAP_CLUSTER_MAX 32
> +#define SWAP_CLUSTER_MAX 32UL
>  #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
>  
>  /*
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 6e53446..1763e79 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1748,15 +1748,17 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>  out:
>  	for_each_evictable_lru(lru) {
>  		int file = is_file_lru(lru);
> +		unsigned long size;
>  		unsigned long scan;
>  
> -		scan = get_lru_size(lruvec, lru);
> +		size = get_lru_size(lruvec, lru);
+		size = scan = get_lru_size(lruvec, lru);

>  		if (sc->priority || noswap) {
> -			scan >>= sc->priority;
> +			scan = size >> sc->priority;
>  			if (!scan && force_scan)
> -				scan = SWAP_CLUSTER_MAX;
> +				scan = min(size, SWAP_CLUSTER_MAX);
>  			scan = div64_u64(scan * fraction[file], denominator);
> -		}
> +		} else
> +			scan = size;

And this is not necessary then but this is totally nit.

>  		nr[lru] = scan;
>  	}
>  }
> -- 
> 1.7.11.7
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 3/8] mm: vmscan: save work scanning (almost) empty LRU lists
@ 2012-12-13 15:43     ` Michal Hocko
  0 siblings, 0 replies; 114+ messages in thread
From: Michal Hocko @ 2012-12-13 15:43 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

On Wed 12-12-12 16:43:35, Johannes Weiner wrote:
> In certain cases (kswapd reclaim, memcg target reclaim), a fixed
> minimum amount of pages is scanned from the LRU lists on each
> iteration, to make progress.
> 
> Do not make this minimum bigger than the respective LRU list size,
> however, and save some busy work trying to isolate and reclaim pages
> that are not there.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Hmm, shrink_lruvec would do:
	nr_to_scan = min_t(unsigned long,
			   nr[lru], SWAP_CLUSTER_MAX);
	nr[lru] -= nr_to_scan;
and isolate_lru_pages does
	for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++)
so it shouldn't matter and we shouldn't do any additional loops, right?

Anyway it would be beter if get_scan_count wouldn't ask for more than is
available.
Reviewed-by: Michal Hocko <mhocko@suse.cz>

> ---
>  include/linux/swap.h |  2 +-
>  mm/vmscan.c          | 10 ++++++----
>  2 files changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 68df9c1..8c66486 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -156,7 +156,7 @@ enum {
>  	SWP_SCANNING	= (1 << 8),	/* refcount in scan_swap_map */
>  };
>  
> -#define SWAP_CLUSTER_MAX 32
> +#define SWAP_CLUSTER_MAX 32UL
>  #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
>  
>  /*
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 6e53446..1763e79 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1748,15 +1748,17 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>  out:
>  	for_each_evictable_lru(lru) {
>  		int file = is_file_lru(lru);
> +		unsigned long size;
>  		unsigned long scan;
>  
> -		scan = get_lru_size(lruvec, lru);
> +		size = get_lru_size(lruvec, lru);
+		size = scan = get_lru_size(lruvec, lru);

>  		if (sc->priority || noswap) {
> -			scan >>= sc->priority;
> +			scan = size >> sc->priority;
>  			if (!scan && force_scan)
> -				scan = SWAP_CLUSTER_MAX;
> +				scan = min(size, SWAP_CLUSTER_MAX);
>  			scan = div64_u64(scan * fraction[file], denominator);
> -		}
> +		} else
> +			scan = size;

And this is not necessary then but this is totally nit.

>  		nr[lru] = scan;
>  	}
>  }
> -- 
> 1.7.11.7
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 2/8] mm: vmscan: disregard swappiness shortly before going OOM
  2012-12-13 15:29       ` Michal Hocko
@ 2012-12-13 16:05         ` Michal Hocko
  -1 siblings, 0 replies; 114+ messages in thread
From: Michal Hocko @ 2012-12-13 16:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, Andrew Morton, Rik van Riel, Hugh Dickins,
	linux-mm, linux-kernel

On Thu 13-12-12 16:29:59, Michal Hocko wrote:
> On Thu 13-12-12 10:34:20, Mel Gorman wrote:
> > On Wed, Dec 12, 2012 at 04:43:34PM -0500, Johannes Weiner wrote:
> > > When a reclaim scanner is doing its final scan before giving up and
> > > there is swap space available, pay no attention to swappiness
> > > preference anymore.  Just swap.
> > > 
> > > Note that this change won't make too big of a difference for general
> > > reclaim: anonymous pages are already force-scanned when there is only
> > > very little file cache left, and there very likely isn't when the
> > > reclaimer enters this final cycle.
> > > 
> > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > 
> > Ok, I see the motivation for your patch but is the block inside still
> > wrong for what you want? After your patch the block looks like this
> > 
> >                 if (sc->priority || noswap) {
> >                         scan >>= sc->priority;
> >                         if (!scan && force_scan)
> >                                 scan = SWAP_CLUSTER_MAX;
> >                         scan = div64_u64(scan * fraction[file], denominator);
> >                 }
> > 
> > if sc->priority == 0 and swappiness==0 then you enter this block but
> > fraction[0] for anonymous pages will also be 0 and because of the ordering
> > of statements there, scan will be
> > 
> > scan = scan * 0 / denominator
> > 
> > so you are still not reclaiming anonymous pages in the swappiness=0
> > case. What did I miss?
> 
> Yes, now that you have mentioned that I realized that it really doesn't
> make any sense. fraction[0] is _always_ 0 for swappiness==0. So we just
> made a bigger pressure on file LRUs. So this sounds like a misuse of the
> swappiness. This all has been introduced with fe35004f (mm: avoid
> swapping out with swappiness==0).
> 
> I think that removing swappiness check make sense but I am not sure it
> does what the changelog says. It should have said that checking
> swappiness doesn't make any sense for small LRUs.

Bahh, wait a moment. Now I remember why the check made sense especially
for memcg.
It made "don't swap _at all_ for swappiness==0" for real - you are even
willing to sacrifice OOM. Maybe this is OK for the global case because
noswap would safe you here (assuming that there is no swap if somebody
doesn't want to swap at all and swappiness doesn't play such a big role)
but for memcg you really might want to prevent from swapping - not
everybody has memcg swap extension enabled and swappiness is handy then.
So I am not sure this is actually what we want. Need to think about it.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 2/8] mm: vmscan: disregard swappiness shortly before going OOM
@ 2012-12-13 16:05         ` Michal Hocko
  0 siblings, 0 replies; 114+ messages in thread
From: Michal Hocko @ 2012-12-13 16:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, Andrew Morton, Rik van Riel, Hugh Dickins,
	linux-mm, linux-kernel

On Thu 13-12-12 16:29:59, Michal Hocko wrote:
> On Thu 13-12-12 10:34:20, Mel Gorman wrote:
> > On Wed, Dec 12, 2012 at 04:43:34PM -0500, Johannes Weiner wrote:
> > > When a reclaim scanner is doing its final scan before giving up and
> > > there is swap space available, pay no attention to swappiness
> > > preference anymore.  Just swap.
> > > 
> > > Note that this change won't make too big of a difference for general
> > > reclaim: anonymous pages are already force-scanned when there is only
> > > very little file cache left, and there very likely isn't when the
> > > reclaimer enters this final cycle.
> > > 
> > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > 
> > Ok, I see the motivation for your patch but is the block inside still
> > wrong for what you want? After your patch the block looks like this
> > 
> >                 if (sc->priority || noswap) {
> >                         scan >>= sc->priority;
> >                         if (!scan && force_scan)
> >                                 scan = SWAP_CLUSTER_MAX;
> >                         scan = div64_u64(scan * fraction[file], denominator);
> >                 }
> > 
> > if sc->priority == 0 and swappiness==0 then you enter this block but
> > fraction[0] for anonymous pages will also be 0 and because of the ordering
> > of statements there, scan will be
> > 
> > scan = scan * 0 / denominator
> > 
> > so you are still not reclaiming anonymous pages in the swappiness=0
> > case. What did I miss?
> 
> Yes, now that you have mentioned that I realized that it really doesn't
> make any sense. fraction[0] is _always_ 0 for swappiness==0. So we just
> made a bigger pressure on file LRUs. So this sounds like a misuse of the
> swappiness. This all has been introduced with fe35004f (mm: avoid
> swapping out with swappiness==0).
> 
> I think that removing swappiness check make sense but I am not sure it
> does what the changelog says. It should have said that checking
> swappiness doesn't make any sense for small LRUs.

Bahh, wait a moment. Now I remember why the check made sense especially
for memcg.
It made "don't swap _at all_ for swappiness==0" for real - you are even
willing to sacrifice OOM. Maybe this is OK for the global case because
noswap would safe you here (assuming that there is no swap if somebody
doesn't want to swap at all and swappiness doesn't play such a big role)
but for memcg you really might want to prevent from swapping - not
everybody has memcg swap extension enabled and swappiness is handy then.
So I am not sure this is actually what we want. Need to think about it.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 5/8] mm: vmscan: improve comment on low-page cache handling
  2012-12-12 21:43   ` Johannes Weiner
@ 2012-12-13 16:07     ` Michal Hocko
  -1 siblings, 0 replies; 114+ messages in thread
From: Michal Hocko @ 2012-12-13 16:07 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

On Wed 12-12-12 16:43:37, Johannes Weiner wrote:
> Fix comment style and elaborate on why anonymous memory is
> force-scanned when file cache runs low.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

yes, much better
Reviewed-by: Michal Hocko <mhocko@suse.cz>

> ---
>  mm/vmscan.c | 12 +++++++-----
>  1 file changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 5e1beed..05475e1 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1697,13 +1697,15 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>  	file  = get_lru_size(lruvec, LRU_ACTIVE_FILE) +
>  		get_lru_size(lruvec, LRU_INACTIVE_FILE);
>  
> +	/*
> +	 * If it's foreseeable that reclaiming the file cache won't be
> +	 * enough to get the zone back into a desirable shape, we have
> +	 * to swap.  Better start now and leave the - probably heavily
> +	 * thrashing - remaining file pages alone.
> +	 */
>  	if (global_reclaim(sc)) {
> -		free  = zone_page_state(zone, NR_FREE_PAGES);
> +		free = zone_page_state(zone, NR_FREE_PAGES);
>  		if (unlikely(file + free <= high_wmark_pages(zone))) {
> -			/*
> -			 * If we have very few page cache pages, force-scan
> -			 * anon pages.
> -			 */
>  			fraction[0] = 1;
>  			fraction[1] = 0;
>  			denominator = 1;
> -- 
> 1.7.11.7
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 5/8] mm: vmscan: improve comment on low-page cache handling
@ 2012-12-13 16:07     ` Michal Hocko
  0 siblings, 0 replies; 114+ messages in thread
From: Michal Hocko @ 2012-12-13 16:07 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

On Wed 12-12-12 16:43:37, Johannes Weiner wrote:
> Fix comment style and elaborate on why anonymous memory is
> force-scanned when file cache runs low.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

yes, much better
Reviewed-by: Michal Hocko <mhocko@suse.cz>

> ---
>  mm/vmscan.c | 12 +++++++-----
>  1 file changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 5e1beed..05475e1 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1697,13 +1697,15 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>  	file  = get_lru_size(lruvec, LRU_ACTIVE_FILE) +
>  		get_lru_size(lruvec, LRU_INACTIVE_FILE);
>  
> +	/*
> +	 * If it's foreseeable that reclaiming the file cache won't be
> +	 * enough to get the zone back into a desirable shape, we have
> +	 * to swap.  Better start now and leave the - probably heavily
> +	 * thrashing - remaining file pages alone.
> +	 */
>  	if (global_reclaim(sc)) {
> -		free  = zone_page_state(zone, NR_FREE_PAGES);
> +		free = zone_page_state(zone, NR_FREE_PAGES);
>  		if (unlikely(file + free <= high_wmark_pages(zone))) {
> -			/*
> -			 * If we have very few page cache pages, force-scan
> -			 * anon pages.
> -			 */
>  			fraction[0] = 1;
>  			fraction[1] = 0;
>  			denominator = 1;
> -- 
> 1.7.11.7
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 6/8] mm: vmscan: clean up get_scan_count()
  2012-12-12 21:43   ` Johannes Weiner
@ 2012-12-13 16:18     ` Michal Hocko
  -1 siblings, 0 replies; 114+ messages in thread
From: Michal Hocko @ 2012-12-13 16:18 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

On Wed 12-12-12 16:43:38, Johannes Weiner wrote:
> Reclaim pressure balance between anon and file pages is calculated
> through a tuple of numerators and a shared denominator.
> 
> Exceptional cases that want to force-scan anon or file pages configure
> the numerators and denominator such that one list is preferred, which
> is not necessarily the most obvious way:
> 
>     fraction[0] = 1;
>     fraction[1] = 0;
>     denominator = 1;
>     goto out;
> 
> Make this easier by making the force-scan cases explicit and use the
> fractionals only in case they are calculated from reclaim history.
> 
> And bring the variable declarations/definitions in order.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

I like this.
Reviewed-by: Michal Hocko <mhocko@suse.cz>

[...]
> @@ -1638,14 +1645,15 @@ static int vmscan_swappiness(struct scan_control *sc)
>  static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>  			   unsigned long *nr)
>  {
> -	unsigned long anon, file, free;
> +	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
> +	u64 fraction[2], uninitialized_var(denominator);
> +	struct zone *zone = lruvec_zone(lruvec);
>  	unsigned long anon_prio, file_prio;
> +	enum scan_balance scan_balance;
> +	unsigned long anon, file, free;
> +	bool force_scan = false;
>  	unsigned long ap, fp;
> -	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
> -	u64 fraction[2], denominator;
>  	enum lru_list lru;
> -	bool force_scan = false;
> -	struct zone *zone = lruvec_zone(lruvec);

You really do love trees, don't you :P

[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 6/8] mm: vmscan: clean up get_scan_count()
@ 2012-12-13 16:18     ` Michal Hocko
  0 siblings, 0 replies; 114+ messages in thread
From: Michal Hocko @ 2012-12-13 16:18 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

On Wed 12-12-12 16:43:38, Johannes Weiner wrote:
> Reclaim pressure balance between anon and file pages is calculated
> through a tuple of numerators and a shared denominator.
> 
> Exceptional cases that want to force-scan anon or file pages configure
> the numerators and denominator such that one list is preferred, which
> is not necessarily the most obvious way:
> 
>     fraction[0] = 1;
>     fraction[1] = 0;
>     denominator = 1;
>     goto out;
> 
> Make this easier by making the force-scan cases explicit and use the
> fractionals only in case they are calculated from reclaim history.
> 
> And bring the variable declarations/definitions in order.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

I like this.
Reviewed-by: Michal Hocko <mhocko@suse.cz>

[...]
> @@ -1638,14 +1645,15 @@ static int vmscan_swappiness(struct scan_control *sc)
>  static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>  			   unsigned long *nr)
>  {
> -	unsigned long anon, file, free;
> +	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
> +	u64 fraction[2], uninitialized_var(denominator);
> +	struct zone *zone = lruvec_zone(lruvec);
>  	unsigned long anon_prio, file_prio;
> +	enum scan_balance scan_balance;
> +	unsigned long anon, file, free;
> +	bool force_scan = false;
>  	unsigned long ap, fp;
> -	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
> -	u64 fraction[2], denominator;
>  	enum lru_list lru;
> -	bool force_scan = false;
> -	struct zone *zone = lruvec_zone(lruvec);

You really do love trees, don't you :P

[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 7/8] mm: vmscan: compaction works against zones, not lruvecs
  2012-12-12 21:43   ` Johannes Weiner
@ 2012-12-13 16:48     ` Michal Hocko
  -1 siblings, 0 replies; 114+ messages in thread
From: Michal Hocko @ 2012-12-13 16:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

On Wed 12-12-12 16:43:39, Johannes Weiner wrote:
> The restart logic for when reclaim operates back to back with
> compaction is currently applied on the lruvec level.  But this does
> not make sense, because the container of interest for compaction is a
> zone as a whole, not the zone pages that are part of a certain memory
> cgroup.
> 
> Negative impact is bounded.  For one, the code checks that the lruvec
> has enough reclaim candidates, so it does not risk getting stuck on a
> condition that can not be fulfilled.  And the unfairness of hammering
> on one particular memory cgroup to make progress in a zone will be
> amortized by the round robin manner in which reclaim goes through the
> memory cgroups.  Still, this can lead to unnecessary allocation
> latencies when the code elects to restart on a hard to reclaim or
> small group when there are other, more reclaimable groups in the zone.
> Move this logic to the zone level and restart reclaim for all memory
> cgroups in a zone when compaction requires more free pages from it.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Michal Hocko <mhocko@suse.cz>

> ---
>  mm/vmscan.c | 180 +++++++++++++++++++++++++++++++-----------------------------
>  1 file changed, 92 insertions(+), 88 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index e20385a..c9c841d 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1782,6 +1782,59 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>  	}
>  }
>  
> +/*
> + * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
> + */
> +static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> +{
> +	unsigned long nr[NR_LRU_LISTS];
> +	unsigned long nr_to_scan;
> +	enum lru_list lru;
> +	unsigned long nr_reclaimed = 0;
> +	unsigned long nr_to_reclaim = sc->nr_to_reclaim;
> +	struct blk_plug plug;
> +
> +	get_scan_count(lruvec, sc, nr);
> +
> +	blk_start_plug(&plug);
> +	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
> +					nr[LRU_INACTIVE_FILE]) {
> +		for_each_evictable_lru(lru) {
> +			if (nr[lru]) {
> +				nr_to_scan = min_t(unsigned long,
> +						   nr[lru], SWAP_CLUSTER_MAX);
> +				nr[lru] -= nr_to_scan;
> +
> +				nr_reclaimed += shrink_list(lru, nr_to_scan,
> +							    lruvec, sc);
> +			}
> +		}
> +		/*
> +		 * On large memory systems, scan >> priority can become
> +		 * really large. This is fine for the starting priority;
> +		 * we want to put equal scanning pressure on each zone.
> +		 * However, if the VM has a harder time of freeing pages,
> +		 * with multiple processes reclaiming pages, the total
> +		 * freeing target can get unreasonably large.
> +		 */
> +		if (nr_reclaimed >= nr_to_reclaim &&
> +		    sc->priority < DEF_PRIORITY)
> +			break;
> +	}
> +	blk_finish_plug(&plug);
> +	sc->nr_reclaimed += nr_reclaimed;
> +
> +	/*
> +	 * Even if we did not try to evict anon pages at all, we want to
> +	 * rebalance the anon lru active/inactive ratio.
> +	 */
> +	if (inactive_anon_is_low(lruvec))
> +		shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
> +				   sc, LRU_ACTIVE_ANON);
> +
> +	throttle_vm_writeout(sc->gfp_mask);
> +}
> +
>  /* Use reclaim/compaction for costly allocs or under memory pressure */
>  static bool in_reclaim_compaction(struct scan_control *sc)
>  {
> @@ -1800,7 +1853,7 @@ static bool in_reclaim_compaction(struct scan_control *sc)
>   * calls try_to_compact_zone() that it will have enough free pages to succeed.
>   * It will give up earlier than that if there is difficulty reclaiming pages.
>   */
> -static inline bool should_continue_reclaim(struct lruvec *lruvec,
> +static inline bool should_continue_reclaim(struct zone *zone,
>  					unsigned long nr_reclaimed,
>  					unsigned long nr_scanned,
>  					struct scan_control *sc)
> @@ -1840,15 +1893,15 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
>  	 * inactive lists are large enough, continue reclaiming
>  	 */
>  	pages_for_compaction = (2UL << sc->order);
> -	inactive_lru_pages = get_lru_size(lruvec, LRU_INACTIVE_FILE);
> +	inactive_lru_pages = zone_page_state(zone, NR_INACTIVE_FILE);
>  	if (nr_swap_pages > 0)
> -		inactive_lru_pages += get_lru_size(lruvec, LRU_INACTIVE_ANON);
> +		inactive_lru_pages += zone_page_state(zone, NR_INACTIVE_ANON);
>  	if (sc->nr_reclaimed < pages_for_compaction &&
>  			inactive_lru_pages > pages_for_compaction)
>  		return true;
>  
>  	/* If compaction would go ahead or the allocation would succeed, stop */
> -	switch (compaction_suitable(lruvec_zone(lruvec), sc->order)) {
> +	switch (compaction_suitable(zone, sc->order)) {
>  	case COMPACT_PARTIAL:
>  	case COMPACT_CONTINUE:
>  		return false;
> @@ -1857,98 +1910,49 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
>  	}
>  }
>  
> -/*
> - * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
> - */
> -static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> +static void shrink_zone(struct zone *zone, struct scan_control *sc)
>  {
> -	unsigned long nr[NR_LRU_LISTS];
> -	unsigned long nr_to_scan;
> -	enum lru_list lru;
>  	unsigned long nr_reclaimed, nr_scanned;
> -	unsigned long nr_to_reclaim = sc->nr_to_reclaim;
> -	struct blk_plug plug;
> -
> -restart:
> -	nr_reclaimed = 0;
> -	nr_scanned = sc->nr_scanned;
> -	get_scan_count(lruvec, sc, nr);
> -
> -	blk_start_plug(&plug);
> -	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
> -					nr[LRU_INACTIVE_FILE]) {
> -		for_each_evictable_lru(lru) {
> -			if (nr[lru]) {
> -				nr_to_scan = min_t(unsigned long,
> -						   nr[lru], SWAP_CLUSTER_MAX);
> -				nr[lru] -= nr_to_scan;
> -
> -				nr_reclaimed += shrink_list(lru, nr_to_scan,
> -							    lruvec, sc);
> -			}
> -		}
> -		/*
> -		 * On large memory systems, scan >> priority can become
> -		 * really large. This is fine for the starting priority;
> -		 * we want to put equal scanning pressure on each zone.
> -		 * However, if the VM has a harder time of freeing pages,
> -		 * with multiple processes reclaiming pages, the total
> -		 * freeing target can get unreasonably large.
> -		 */
> -		if (nr_reclaimed >= nr_to_reclaim &&
> -		    sc->priority < DEF_PRIORITY)
> -			break;
> -	}
> -	blk_finish_plug(&plug);
> -	sc->nr_reclaimed += nr_reclaimed;
>  
> -	/*
> -	 * Even if we did not try to evict anon pages at all, we want to
> -	 * rebalance the anon lru active/inactive ratio.
> -	 */
> -	if (inactive_anon_is_low(lruvec))
> -		shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
> -				   sc, LRU_ACTIVE_ANON);
> -
> -	/* reclaim/compaction might need reclaim to continue */
> -	if (should_continue_reclaim(lruvec, nr_reclaimed,
> -				    sc->nr_scanned - nr_scanned, sc))
> -		goto restart;
> +	do {
> +		struct mem_cgroup *root = sc->target_mem_cgroup;
> +		struct mem_cgroup_reclaim_cookie reclaim = {
> +			.zone = zone,
> +			.priority = sc->priority,
> +		};
> +		struct mem_cgroup *memcg;
>  
> -	throttle_vm_writeout(sc->gfp_mask);
> -}
> +		nr_reclaimed = sc->nr_reclaimed;
> +		nr_scanned = sc->nr_scanned;
>  
> -static void shrink_zone(struct zone *zone, struct scan_control *sc)
> -{
> -	struct mem_cgroup *root = sc->target_mem_cgroup;
> -	struct mem_cgroup_reclaim_cookie reclaim = {
> -		.zone = zone,
> -		.priority = sc->priority,
> -	};
> -	struct mem_cgroup *memcg;
> +		memcg = mem_cgroup_iter(root, NULL, &reclaim);
> +		do {
> +			struct lruvec *lruvec;
>  
> -	memcg = mem_cgroup_iter(root, NULL, &reclaim);
> -	do {
> -		struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
> +			lruvec = mem_cgroup_zone_lruvec(zone, memcg);
>  
> -		shrink_lruvec(lruvec, sc);
> +			shrink_lruvec(lruvec, sc);
>  
> -		/*
> -		 * Limit reclaim has historically picked one memcg and
> -		 * scanned it with decreasing priority levels until
> -		 * nr_to_reclaim had been reclaimed.  This priority
> -		 * cycle is thus over after a single memcg.
> -		 *
> -		 * Direct reclaim and kswapd, on the other hand, have
> -		 * to scan all memory cgroups to fulfill the overall
> -		 * scan target for the zone.
> -		 */
> -		if (!global_reclaim(sc)) {
> -			mem_cgroup_iter_break(root, memcg);
> -			break;
> -		}
> -		memcg = mem_cgroup_iter(root, memcg, &reclaim);
> -	} while (memcg);
> +			/*
> +			 * Limit reclaim has historically picked one
> +			 * memcg and scanned it with decreasing
> +			 * priority levels until nr_to_reclaim had
> +			 * been reclaimed.  This priority cycle is
> +			 * thus over after a single memcg.
> +			 *
> +			 * Direct reclaim and kswapd, on the other
> +			 * hand, have to scan all memory cgroups to
> +			 * fulfill the overall scan target for the
> +			 * zone.
> +			 */
> +			if (!global_reclaim(sc)) {
> +				mem_cgroup_iter_break(root, memcg);
> +				break;
> +			}
> +			memcg = mem_cgroup_iter(root, memcg, &reclaim);
> +		} while (memcg);
> +	} while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed,
> +					 sc->nr_scanned - nr_scanned, sc));
>  }
>  
>  /* Returns true if compaction should go ahead for a high-order request */
> -- 
> 1.7.11.7
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 7/8] mm: vmscan: compaction works against zones, not lruvecs
@ 2012-12-13 16:48     ` Michal Hocko
  0 siblings, 0 replies; 114+ messages in thread
From: Michal Hocko @ 2012-12-13 16:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

On Wed 12-12-12 16:43:39, Johannes Weiner wrote:
> The restart logic for when reclaim operates back to back with
> compaction is currently applied on the lruvec level.  But this does
> not make sense, because the container of interest for compaction is a
> zone as a whole, not the zone pages that are part of a certain memory
> cgroup.
> 
> Negative impact is bounded.  For one, the code checks that the lruvec
> has enough reclaim candidates, so it does not risk getting stuck on a
> condition that can not be fulfilled.  And the unfairness of hammering
> on one particular memory cgroup to make progress in a zone will be
> amortized by the round robin manner in which reclaim goes through the
> memory cgroups.  Still, this can lead to unnecessary allocation
> latencies when the code elects to restart on a hard to reclaim or
> small group when there are other, more reclaimable groups in the zone.
> Move this logic to the zone level and restart reclaim for all memory
> cgroups in a zone when compaction requires more free pages from it.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Michal Hocko <mhocko@suse.cz>

> ---
>  mm/vmscan.c | 180 +++++++++++++++++++++++++++++++-----------------------------
>  1 file changed, 92 insertions(+), 88 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index e20385a..c9c841d 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1782,6 +1782,59 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>  	}
>  }
>  
> +/*
> + * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
> + */
> +static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> +{
> +	unsigned long nr[NR_LRU_LISTS];
> +	unsigned long nr_to_scan;
> +	enum lru_list lru;
> +	unsigned long nr_reclaimed = 0;
> +	unsigned long nr_to_reclaim = sc->nr_to_reclaim;
> +	struct blk_plug plug;
> +
> +	get_scan_count(lruvec, sc, nr);
> +
> +	blk_start_plug(&plug);
> +	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
> +					nr[LRU_INACTIVE_FILE]) {
> +		for_each_evictable_lru(lru) {
> +			if (nr[lru]) {
> +				nr_to_scan = min_t(unsigned long,
> +						   nr[lru], SWAP_CLUSTER_MAX);
> +				nr[lru] -= nr_to_scan;
> +
> +				nr_reclaimed += shrink_list(lru, nr_to_scan,
> +							    lruvec, sc);
> +			}
> +		}
> +		/*
> +		 * On large memory systems, scan >> priority can become
> +		 * really large. This is fine for the starting priority;
> +		 * we want to put equal scanning pressure on each zone.
> +		 * However, if the VM has a harder time of freeing pages,
> +		 * with multiple processes reclaiming pages, the total
> +		 * freeing target can get unreasonably large.
> +		 */
> +		if (nr_reclaimed >= nr_to_reclaim &&
> +		    sc->priority < DEF_PRIORITY)
> +			break;
> +	}
> +	blk_finish_plug(&plug);
> +	sc->nr_reclaimed += nr_reclaimed;
> +
> +	/*
> +	 * Even if we did not try to evict anon pages at all, we want to
> +	 * rebalance the anon lru active/inactive ratio.
> +	 */
> +	if (inactive_anon_is_low(lruvec))
> +		shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
> +				   sc, LRU_ACTIVE_ANON);
> +
> +	throttle_vm_writeout(sc->gfp_mask);
> +}
> +
>  /* Use reclaim/compaction for costly allocs or under memory pressure */
>  static bool in_reclaim_compaction(struct scan_control *sc)
>  {
> @@ -1800,7 +1853,7 @@ static bool in_reclaim_compaction(struct scan_control *sc)
>   * calls try_to_compact_zone() that it will have enough free pages to succeed.
>   * It will give up earlier than that if there is difficulty reclaiming pages.
>   */
> -static inline bool should_continue_reclaim(struct lruvec *lruvec,
> +static inline bool should_continue_reclaim(struct zone *zone,
>  					unsigned long nr_reclaimed,
>  					unsigned long nr_scanned,
>  					struct scan_control *sc)
> @@ -1840,15 +1893,15 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
>  	 * inactive lists are large enough, continue reclaiming
>  	 */
>  	pages_for_compaction = (2UL << sc->order);
> -	inactive_lru_pages = get_lru_size(lruvec, LRU_INACTIVE_FILE);
> +	inactive_lru_pages = zone_page_state(zone, NR_INACTIVE_FILE);
>  	if (nr_swap_pages > 0)
> -		inactive_lru_pages += get_lru_size(lruvec, LRU_INACTIVE_ANON);
> +		inactive_lru_pages += zone_page_state(zone, NR_INACTIVE_ANON);
>  	if (sc->nr_reclaimed < pages_for_compaction &&
>  			inactive_lru_pages > pages_for_compaction)
>  		return true;
>  
>  	/* If compaction would go ahead or the allocation would succeed, stop */
> -	switch (compaction_suitable(lruvec_zone(lruvec), sc->order)) {
> +	switch (compaction_suitable(zone, sc->order)) {
>  	case COMPACT_PARTIAL:
>  	case COMPACT_CONTINUE:
>  		return false;
> @@ -1857,98 +1910,49 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
>  	}
>  }
>  
> -/*
> - * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
> - */
> -static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> +static void shrink_zone(struct zone *zone, struct scan_control *sc)
>  {
> -	unsigned long nr[NR_LRU_LISTS];
> -	unsigned long nr_to_scan;
> -	enum lru_list lru;
>  	unsigned long nr_reclaimed, nr_scanned;
> -	unsigned long nr_to_reclaim = sc->nr_to_reclaim;
> -	struct blk_plug plug;
> -
> -restart:
> -	nr_reclaimed = 0;
> -	nr_scanned = sc->nr_scanned;
> -	get_scan_count(lruvec, sc, nr);
> -
> -	blk_start_plug(&plug);
> -	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
> -					nr[LRU_INACTIVE_FILE]) {
> -		for_each_evictable_lru(lru) {
> -			if (nr[lru]) {
> -				nr_to_scan = min_t(unsigned long,
> -						   nr[lru], SWAP_CLUSTER_MAX);
> -				nr[lru] -= nr_to_scan;
> -
> -				nr_reclaimed += shrink_list(lru, nr_to_scan,
> -							    lruvec, sc);
> -			}
> -		}
> -		/*
> -		 * On large memory systems, scan >> priority can become
> -		 * really large. This is fine for the starting priority;
> -		 * we want to put equal scanning pressure on each zone.
> -		 * However, if the VM has a harder time of freeing pages,
> -		 * with multiple processes reclaiming pages, the total
> -		 * freeing target can get unreasonably large.
> -		 */
> -		if (nr_reclaimed >= nr_to_reclaim &&
> -		    sc->priority < DEF_PRIORITY)
> -			break;
> -	}
> -	blk_finish_plug(&plug);
> -	sc->nr_reclaimed += nr_reclaimed;
>  
> -	/*
> -	 * Even if we did not try to evict anon pages at all, we want to
> -	 * rebalance the anon lru active/inactive ratio.
> -	 */
> -	if (inactive_anon_is_low(lruvec))
> -		shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
> -				   sc, LRU_ACTIVE_ANON);
> -
> -	/* reclaim/compaction might need reclaim to continue */
> -	if (should_continue_reclaim(lruvec, nr_reclaimed,
> -				    sc->nr_scanned - nr_scanned, sc))
> -		goto restart;
> +	do {
> +		struct mem_cgroup *root = sc->target_mem_cgroup;
> +		struct mem_cgroup_reclaim_cookie reclaim = {
> +			.zone = zone,
> +			.priority = sc->priority,
> +		};
> +		struct mem_cgroup *memcg;
>  
> -	throttle_vm_writeout(sc->gfp_mask);
> -}
> +		nr_reclaimed = sc->nr_reclaimed;
> +		nr_scanned = sc->nr_scanned;
>  
> -static void shrink_zone(struct zone *zone, struct scan_control *sc)
> -{
> -	struct mem_cgroup *root = sc->target_mem_cgroup;
> -	struct mem_cgroup_reclaim_cookie reclaim = {
> -		.zone = zone,
> -		.priority = sc->priority,
> -	};
> -	struct mem_cgroup *memcg;
> +		memcg = mem_cgroup_iter(root, NULL, &reclaim);
> +		do {
> +			struct lruvec *lruvec;
>  
> -	memcg = mem_cgroup_iter(root, NULL, &reclaim);
> -	do {
> -		struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
> +			lruvec = mem_cgroup_zone_lruvec(zone, memcg);
>  
> -		shrink_lruvec(lruvec, sc);
> +			shrink_lruvec(lruvec, sc);
>  
> -		/*
> -		 * Limit reclaim has historically picked one memcg and
> -		 * scanned it with decreasing priority levels until
> -		 * nr_to_reclaim had been reclaimed.  This priority
> -		 * cycle is thus over after a single memcg.
> -		 *
> -		 * Direct reclaim and kswapd, on the other hand, have
> -		 * to scan all memory cgroups to fulfill the overall
> -		 * scan target for the zone.
> -		 */
> -		if (!global_reclaim(sc)) {
> -			mem_cgroup_iter_break(root, memcg);
> -			break;
> -		}
> -		memcg = mem_cgroup_iter(root, memcg, &reclaim);
> -	} while (memcg);
> +			/*
> +			 * Limit reclaim has historically picked one
> +			 * memcg and scanned it with decreasing
> +			 * priority levels until nr_to_reclaim had
> +			 * been reclaimed.  This priority cycle is
> +			 * thus over after a single memcg.
> +			 *
> +			 * Direct reclaim and kswapd, on the other
> +			 * hand, have to scan all memory cgroups to
> +			 * fulfill the overall scan target for the
> +			 * zone.
> +			 */
> +			if (!global_reclaim(sc)) {
> +				mem_cgroup_iter_break(root, memcg);
> +				break;
> +			}
> +			memcg = mem_cgroup_iter(root, memcg, &reclaim);
> +		} while (memcg);
> +	} while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed,
> +					 sc->nr_scanned - nr_scanned, sc));
>  }
>  
>  /* Returns true if compaction should go ahead for a high-order request */
> -- 
> 1.7.11.7
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 2/8] mm: vmscan: disregard swappiness shortly before going OOM
  2012-12-13 10:34     ` Mel Gorman
@ 2012-12-13 19:05       ` Johannes Weiner
  -1 siblings, 0 replies; 114+ messages in thread
From: Johannes Weiner @ 2012-12-13 19:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Rik van Riel, Michal Hocko, Hugh Dickins,
	linux-mm, linux-kernel

On Thu, Dec 13, 2012 at 10:34:20AM +0000, Mel Gorman wrote:
> On Wed, Dec 12, 2012 at 04:43:34PM -0500, Johannes Weiner wrote:
> > When a reclaim scanner is doing its final scan before giving up and
> > there is swap space available, pay no attention to swappiness
> > preference anymore.  Just swap.
> > 
> > Note that this change won't make too big of a difference for general
> > reclaim: anonymous pages are already force-scanned when there is only
> > very little file cache left, and there very likely isn't when the
> > reclaimer enters this final cycle.
> > 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> Ok, I see the motivation for your patch but is the block inside still
> wrong for what you want? After your patch the block looks like this
> 
>                 if (sc->priority || noswap) {
>                         scan >>= sc->priority;
>                         if (!scan && force_scan)
>                                 scan = SWAP_CLUSTER_MAX;
>                         scan = div64_u64(scan * fraction[file], denominator);
>                 }
> 
> if sc->priority == 0 and swappiness==0 then you enter this block but
> fraction[0] for anonymous pages will also be 0 and because of the ordering
> of statements there, scan will be
> 
> scan = scan * 0 / denominator
> 
> so you are still not reclaiming anonymous pages in the swappiness=0
> case. What did I miss?

Don't get confused by noswap, it is only set when there physically is
no swap space.  If !sc->priority, that block is skipped and
fraction[0] does not matter.

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 2/8] mm: vmscan: disregard swappiness shortly before going OOM
@ 2012-12-13 19:05       ` Johannes Weiner
  0 siblings, 0 replies; 114+ messages in thread
From: Johannes Weiner @ 2012-12-13 19:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Rik van Riel, Michal Hocko, Hugh Dickins,
	linux-mm, linux-kernel

On Thu, Dec 13, 2012 at 10:34:20AM +0000, Mel Gorman wrote:
> On Wed, Dec 12, 2012 at 04:43:34PM -0500, Johannes Weiner wrote:
> > When a reclaim scanner is doing its final scan before giving up and
> > there is swap space available, pay no attention to swappiness
> > preference anymore.  Just swap.
> > 
> > Note that this change won't make too big of a difference for general
> > reclaim: anonymous pages are already force-scanned when there is only
> > very little file cache left, and there very likely isn't when the
> > reclaimer enters this final cycle.
> > 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> Ok, I see the motivation for your patch but is the block inside still
> wrong for what you want? After your patch the block looks like this
> 
>                 if (sc->priority || noswap) {
>                         scan >>= sc->priority;
>                         if (!scan && force_scan)
>                                 scan = SWAP_CLUSTER_MAX;
>                         scan = div64_u64(scan * fraction[file], denominator);
>                 }
> 
> if sc->priority == 0 and swappiness==0 then you enter this block but
> fraction[0] for anonymous pages will also be 0 and because of the ordering
> of statements there, scan will be
> 
> scan = scan * 0 / denominator
> 
> so you are still not reclaiming anonymous pages in the swappiness=0
> case. What did I miss?

Don't get confused by noswap, it is only set when there physically is
no swap space.  If !sc->priority, that block is skipped and
fraction[0] does not matter.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 3/8] mm: vmscan: save work scanning (almost) empty LRU lists
  2012-12-13 10:41     ` Mel Gorman
@ 2012-12-13 19:33       ` Johannes Weiner
  -1 siblings, 0 replies; 114+ messages in thread
From: Johannes Weiner @ 2012-12-13 19:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Rik van Riel, Michal Hocko, Hugh Dickins,
	linux-mm, linux-kernel

On Thu, Dec 13, 2012 at 10:41:04AM +0000, Mel Gorman wrote:
> On Wed, Dec 12, 2012 at 04:43:35PM -0500, Johannes Weiner wrote:
> > In certain cases (kswapd reclaim, memcg target reclaim), a fixed
> > minimum amount of pages is scanned from the LRU lists on each
> > iteration, to make progress.
> > 
> > Do not make this minimum bigger than the respective LRU list size,
> > however, and save some busy work trying to isolate and reclaim pages
> > that are not there.
> > 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> This looks like a corner case where the LRU size would have to be smaller
> than SWAP_CLUSTER_MAX. Is that common enough to care? It looks correct,
> I'm just curious.

We have one lruvec per memcg per zone, so consider memory cgroups in a
NUMA environment: NR_MEMCG * (NR_NODES - 1) * NR_LRU_LISTS permanently
empty lruvecs, assuming the memory of one cgroup is bound to one node.

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 3/8] mm: vmscan: save work scanning (almost) empty LRU lists
@ 2012-12-13 19:33       ` Johannes Weiner
  0 siblings, 0 replies; 114+ messages in thread
From: Johannes Weiner @ 2012-12-13 19:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Rik van Riel, Michal Hocko, Hugh Dickins,
	linux-mm, linux-kernel

On Thu, Dec 13, 2012 at 10:41:04AM +0000, Mel Gorman wrote:
> On Wed, Dec 12, 2012 at 04:43:35PM -0500, Johannes Weiner wrote:
> > In certain cases (kswapd reclaim, memcg target reclaim), a fixed
> > minimum amount of pages is scanned from the LRU lists on each
> > iteration, to make progress.
> > 
> > Do not make this minimum bigger than the respective LRU list size,
> > however, and save some busy work trying to isolate and reclaim pages
> > that are not there.
> > 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> This looks like a corner case where the LRU size would have to be smaller
> than SWAP_CLUSTER_MAX. Is that common enough to care? It looks correct,
> I'm just curious.

We have one lruvec per memcg per zone, so consider memory cgroups in a
NUMA environment: NR_MEMCG * (NR_NODES - 1) * NR_LRU_LISTS permanently
empty lruvecs, assuming the memory of one cgroup is bound to one node.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 3/8] mm: vmscan: save work scanning (almost) empty LRU lists
  2012-12-13 15:43     ` Michal Hocko
@ 2012-12-13 19:38       ` Johannes Weiner
  -1 siblings, 0 replies; 114+ messages in thread
From: Johannes Weiner @ 2012-12-13 19:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

On Thu, Dec 13, 2012 at 04:43:46PM +0100, Michal Hocko wrote:
> On Wed 12-12-12 16:43:35, Johannes Weiner wrote:
> > In certain cases (kswapd reclaim, memcg target reclaim), a fixed
> > minimum amount of pages is scanned from the LRU lists on each
> > iteration, to make progress.
> > 
> > Do not make this minimum bigger than the respective LRU list size,
> > however, and save some busy work trying to isolate and reclaim pages
> > that are not there.
> > 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> Hmm, shrink_lruvec would do:
> 	nr_to_scan = min_t(unsigned long,
> 			   nr[lru], SWAP_CLUSTER_MAX);
> 	nr[lru] -= nr_to_scan;
> and isolate_lru_pages does
> 	for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++)
> so it shouldn't matter and we shouldn't do any additional loops, right?
> 
> Anyway it would be beter if get_scan_count wouldn't ask for more than is
> available.

Consider the inactive_list_is_low() check (especially expensive for
memcg anon), lru_add_drain(), lru lock acquisition...

And as I wrote to Mel in the other email, this can happen a lot when
you have memory cgroups in a multi-node environment.

> Reviewed-by: Michal Hocko <mhocko@suse.cz>

Thanks!

> > @@ -1748,15 +1748,17 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> >  out:
> >  	for_each_evictable_lru(lru) {
> >  		int file = is_file_lru(lru);
> > +		unsigned long size;
> >  		unsigned long scan;
> >  
> > -		scan = get_lru_size(lruvec, lru);
> > +		size = get_lru_size(lruvec, lru);
> +		size = scan = get_lru_size(lruvec, lru);
> 
> >  		if (sc->priority || noswap) {
> > -			scan >>= sc->priority;
> > +			scan = size >> sc->priority;
> >  			if (!scan && force_scan)
> > -				scan = SWAP_CLUSTER_MAX;
> > +				scan = min(size, SWAP_CLUSTER_MAX);
> >  			scan = div64_u64(scan * fraction[file], denominator);
> > -		}
> > +		} else
> > +			scan = size;
> 
> And this is not necessary then but this is totally nit.

Do you actually find this more readable?  Setting size = scan and then
later scan = size >> sc->priority? :-)

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 3/8] mm: vmscan: save work scanning (almost) empty LRU lists
@ 2012-12-13 19:38       ` Johannes Weiner
  0 siblings, 0 replies; 114+ messages in thread
From: Johannes Weiner @ 2012-12-13 19:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

On Thu, Dec 13, 2012 at 04:43:46PM +0100, Michal Hocko wrote:
> On Wed 12-12-12 16:43:35, Johannes Weiner wrote:
> > In certain cases (kswapd reclaim, memcg target reclaim), a fixed
> > minimum amount of pages is scanned from the LRU lists on each
> > iteration, to make progress.
> > 
> > Do not make this minimum bigger than the respective LRU list size,
> > however, and save some busy work trying to isolate and reclaim pages
> > that are not there.
> > 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> Hmm, shrink_lruvec would do:
> 	nr_to_scan = min_t(unsigned long,
> 			   nr[lru], SWAP_CLUSTER_MAX);
> 	nr[lru] -= nr_to_scan;
> and isolate_lru_pages does
> 	for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++)
> so it shouldn't matter and we shouldn't do any additional loops, right?
> 
> Anyway it would be beter if get_scan_count wouldn't ask for more than is
> available.

Consider the inactive_list_is_low() check (especially expensive for
memcg anon), lru_add_drain(), lru lock acquisition...

And as I wrote to Mel in the other email, this can happen a lot when
you have memory cgroups in a multi-node environment.

> Reviewed-by: Michal Hocko <mhocko@suse.cz>

Thanks!

> > @@ -1748,15 +1748,17 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> >  out:
> >  	for_each_evictable_lru(lru) {
> >  		int file = is_file_lru(lru);
> > +		unsigned long size;
> >  		unsigned long scan;
> >  
> > -		scan = get_lru_size(lruvec, lru);
> > +		size = get_lru_size(lruvec, lru);
> +		size = scan = get_lru_size(lruvec, lru);
> 
> >  		if (sc->priority || noswap) {
> > -			scan >>= sc->priority;
> > +			scan = size >> sc->priority;
> >  			if (!scan && force_scan)
> > -				scan = SWAP_CLUSTER_MAX;
> > +				scan = min(size, SWAP_CLUSTER_MAX);
> >  			scan = div64_u64(scan * fraction[file], denominator);
> > -		}
> > +		} else
> > +			scan = size;
> 
> And this is not necessary then but this is totally nit.

Do you actually find this more readable?  Setting size = scan and then
later scan = size >> sc->priority? :-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 2/8] mm: vmscan: disregard swappiness shortly before going OOM
  2012-12-13 19:05       ` Johannes Weiner
@ 2012-12-13 19:47         ` Mel Gorman
  -1 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2012-12-13 19:47 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Rik van Riel, Michal Hocko, Hugh Dickins,
	linux-mm, linux-kernel

On Thu, Dec 13, 2012 at 02:05:34PM -0500, Johannes Weiner wrote:
> On Thu, Dec 13, 2012 at 10:34:20AM +0000, Mel Gorman wrote:
> > On Wed, Dec 12, 2012 at 04:43:34PM -0500, Johannes Weiner wrote:
> > > When a reclaim scanner is doing its final scan before giving up and
> > > there is swap space available, pay no attention to swappiness
> > > preference anymore.  Just swap.
> > > 
> > > Note that this change won't make too big of a difference for general
> > > reclaim: anonymous pages are already force-scanned when there is only
> > > very little file cache left, and there very likely isn't when the
> > > reclaimer enters this final cycle.
> > > 
> > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > 
> > Ok, I see the motivation for your patch but is the block inside still
> > wrong for what you want? After your patch the block looks like this
> > 
> >                 if (sc->priority || noswap) {
> >                         scan >>= sc->priority;
> >                         if (!scan && force_scan)
> >                                 scan = SWAP_CLUSTER_MAX;
> >                         scan = div64_u64(scan * fraction[file], denominator);
> >                 }
> > 
> > if sc->priority == 0 and swappiness==0 then you enter this block but
> > fraction[0] for anonymous pages will also be 0 and because of the ordering
> > of statements there, scan will be
> > 
> > scan = scan * 0 / denominator
> > 
> > so you are still not reclaiming anonymous pages in the swappiness=0
> > case. What did I miss?
> 
> Don't get confused by noswap, it is only set when there physically is
> no swap space.  If !sc->priority, that block is skipped and
> fraction[0] does not matter.

/me slaps self

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 2/8] mm: vmscan: disregard swappiness shortly before going OOM
@ 2012-12-13 19:47         ` Mel Gorman
  0 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2012-12-13 19:47 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Rik van Riel, Michal Hocko, Hugh Dickins,
	linux-mm, linux-kernel

On Thu, Dec 13, 2012 at 02:05:34PM -0500, Johannes Weiner wrote:
> On Thu, Dec 13, 2012 at 10:34:20AM +0000, Mel Gorman wrote:
> > On Wed, Dec 12, 2012 at 04:43:34PM -0500, Johannes Weiner wrote:
> > > When a reclaim scanner is doing its final scan before giving up and
> > > there is swap space available, pay no attention to swappiness
> > > preference anymore.  Just swap.
> > > 
> > > Note that this change won't make too big of a difference for general
> > > reclaim: anonymous pages are already force-scanned when there is only
> > > very little file cache left, and there very likely isn't when the
> > > reclaimer enters this final cycle.
> > > 
> > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > 
> > Ok, I see the motivation for your patch but is the block inside still
> > wrong for what you want? After your patch the block looks like this
> > 
> >                 if (sc->priority || noswap) {
> >                         scan >>= sc->priority;
> >                         if (!scan && force_scan)
> >                                 scan = SWAP_CLUSTER_MAX;
> >                         scan = div64_u64(scan * fraction[file], denominator);
> >                 }
> > 
> > if sc->priority == 0 and swappiness==0 then you enter this block but
> > fraction[0] for anonymous pages will also be 0 and because of the ordering
> > of statements there, scan will be
> > 
> > scan = scan * 0 / denominator
> > 
> > so you are still not reclaiming anonymous pages in the swappiness=0
> > case. What did I miss?
> 
> Don't get confused by noswap, it is only set when there physically is
> no swap space.  If !sc->priority, that block is skipped and
> fraction[0] does not matter.

/me slaps self

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* RE: [patch 2/8] mm: vmscan: disregard swappiness shortly before going OOM
  2012-12-13 16:05         ` Michal Hocko
@ 2012-12-13 22:25           ` Satoru Moriya
  -1 siblings, 0 replies; 114+ messages in thread
From: Satoru Moriya @ 2012-12-13 22:25 UTC (permalink / raw)
  To: Michal Hocko, Mel Gorman
  Cc: Johannes Weiner, Andrew Morton, Rik van Riel, Hugh Dickins,
	linux-mm, linux-kernel


On 12/13/2012 11:05 AM, Michal Hocko wrote:> On Thu 13-12-12 16:29:59, Michal Hocko wrote:
>> On Thu 13-12-12 10:34:20, Mel Gorman wrote:
>>> On Wed, Dec 12, 2012 at 04:43:34PM -0500, Johannes Weiner wrote:
>>>> When a reclaim scanner is doing its final scan before giving up and 
>>>> there is swap space available, pay no attention to swappiness 
>>>> preference anymore.  Just swap.
>>>>
>>>> Note that this change won't make too big of a difference for 
>>>> general
>>>> reclaim: anonymous pages are already force-scanned when there is 
>>>> only very little file cache left, and there very likely isn't when 
>>>> the reclaimer enters this final cycle.
>>>>
>>>> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
>>>
>>> Ok, I see the motivation for your patch but is the block inside 
>>> still wrong for what you want? After your patch the block looks like 
>>> this
>>>
>>>                 if (sc->priority || noswap) {
>>>                         scan >>= sc->priority;
>>>                         if (!scan && force_scan)
>>>                                 scan = SWAP_CLUSTER_MAX;
>>>                         scan = div64_u64(scan * fraction[file], denominator);
>>>                 }
>>>
>>> if sc->priority == 0 and swappiness==0 then you enter this block but 
>>> fraction[0] for anonymous pages will also be 0 and because of the 
>>> ordering of statements there, scan will be
>>>
>>> scan = scan * 0 / denominator
>>>
>>> so you are still not reclaiming anonymous pages in the swappiness=0 
>>> case. What did I miss?
>>
>> Yes, now that you have mentioned that I realized that it really 
>> doesn't make any sense. fraction[0] is _always_ 0 for swappiness==0. 
>> So we just made a bigger pressure on file LRUs. So this sounds like a 
>> misuse of the swappiness. This all has been introduced with fe35004f 
>> (mm: avoid swapping out with swappiness==0).
>>
>> I think that removing swappiness check make sense but I am not sure 
>> it does what the changelog says. It should have said that checking 
>> swappiness doesn't make any sense for small LRUs.
>
> Bahh, wait a moment. Now I remember why the check made sense 
> especially for memcg.
> It made "don't swap _at all_ for swappiness==0" for real - you are 
> even willing to sacrifice OOM. Maybe this is OK for the global case 
> because noswap would safe you here (assuming that there is no swap if 
> somebody doesn't want to swap at all and swappiness doesn't play such 
> a big role) but for memcg you really might want to prevent from 
> swapping - not everybody has memcg swap extension enabled and swappiness is handy then.
> So I am not sure this is actually what we want. Need to think about it.

I introduced swappiness check here with fe35004f because, in some
cases, we prefer OOM to swap out pages to detect problems as soon
as possible. Basically, we design the system not to swap out and
so if it causes swapping, something goes wrong.

Regards,
Satoru

^ permalink raw reply	[flat|nested] 114+ messages in thread

* RE: [patch 2/8] mm: vmscan: disregard swappiness shortly before going OOM
@ 2012-12-13 22:25           ` Satoru Moriya
  0 siblings, 0 replies; 114+ messages in thread
From: Satoru Moriya @ 2012-12-13 22:25 UTC (permalink / raw)
  To: Michal Hocko, Mel Gorman
  Cc: Johannes Weiner, Andrew Morton, Rik van Riel, Hugh Dickins,
	linux-mm, linux-kernel


On 12/13/2012 11:05 AM, Michal Hocko wrote:> On Thu 13-12-12 16:29:59, Michal Hocko wrote:
>> On Thu 13-12-12 10:34:20, Mel Gorman wrote:
>>> On Wed, Dec 12, 2012 at 04:43:34PM -0500, Johannes Weiner wrote:
>>>> When a reclaim scanner is doing its final scan before giving up and 
>>>> there is swap space available, pay no attention to swappiness 
>>>> preference anymore.  Just swap.
>>>>
>>>> Note that this change won't make too big of a difference for 
>>>> general
>>>> reclaim: anonymous pages are already force-scanned when there is 
>>>> only very little file cache left, and there very likely isn't when 
>>>> the reclaimer enters this final cycle.
>>>>
>>>> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
>>>
>>> Ok, I see the motivation for your patch but is the block inside 
>>> still wrong for what you want? After your patch the block looks like 
>>> this
>>>
>>>                 if (sc->priority || noswap) {
>>>                         scan >>= sc->priority;
>>>                         if (!scan && force_scan)
>>>                                 scan = SWAP_CLUSTER_MAX;
>>>                         scan = div64_u64(scan * fraction[file], denominator);
>>>                 }
>>>
>>> if sc->priority == 0 and swappiness==0 then you enter this block but 
>>> fraction[0] for anonymous pages will also be 0 and because of the 
>>> ordering of statements there, scan will be
>>>
>>> scan = scan * 0 / denominator
>>>
>>> so you are still not reclaiming anonymous pages in the swappiness=0 
>>> case. What did I miss?
>>
>> Yes, now that you have mentioned that I realized that it really 
>> doesn't make any sense. fraction[0] is _always_ 0 for swappiness==0. 
>> So we just made a bigger pressure on file LRUs. So this sounds like a 
>> misuse of the swappiness. This all has been introduced with fe35004f 
>> (mm: avoid swapping out with swappiness==0).
>>
>> I think that removing swappiness check make sense but I am not sure 
>> it does what the changelog says. It should have said that checking 
>> swappiness doesn't make any sense for small LRUs.
>
> Bahh, wait a moment. Now I remember why the check made sense 
> especially for memcg.
> It made "don't swap _at all_ for swappiness==0" for real - you are 
> even willing to sacrifice OOM. Maybe this is OK for the global case 
> because noswap would safe you here (assuming that there is no swap if 
> somebody doesn't want to swap at all and swappiness doesn't play such 
> a big role) but for memcg you really might want to prevent from 
> swapping - not everybody has memcg swap extension enabled and swappiness is handy then.
> So I am not sure this is actually what we want. Need to think about it.

I introduced swappiness check here with fe35004f because, in some
cases, we prefer OOM to swap out pages to detect problems as soon
as possible. Basically, we design the system not to swap out and
so if it causes swapping, something goes wrong.

Regards,
Satoru

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 2/8] mm: vmscan: disregard swappiness shortly before going OOM
  2012-12-13 22:25           ` Satoru Moriya
@ 2012-12-14  4:50             ` Johannes Weiner
  -1 siblings, 0 replies; 114+ messages in thread
From: Johannes Weiner @ 2012-12-14  4:50 UTC (permalink / raw)
  To: Satoru Moriya
  Cc: Michal Hocko, Mel Gorman, Andrew Morton, Rik van Riel,
	Hugh Dickins, linux-mm, linux-kernel

On Thu, Dec 13, 2012 at 10:25:43PM +0000, Satoru Moriya wrote:
> 
> On 12/13/2012 11:05 AM, Michal Hocko wrote:> On Thu 13-12-12 16:29:59, Michal Hocko wrote:
> >> On Thu 13-12-12 10:34:20, Mel Gorman wrote:
> >>> On Wed, Dec 12, 2012 at 04:43:34PM -0500, Johannes Weiner wrote:
> >>>> When a reclaim scanner is doing its final scan before giving up and 
> >>>> there is swap space available, pay no attention to swappiness 
> >>>> preference anymore.  Just swap.
> >>>>
> >>>> Note that this change won't make too big of a difference for 
> >>>> general
> >>>> reclaim: anonymous pages are already force-scanned when there is 
> >>>> only very little file cache left, and there very likely isn't when 
> >>>> the reclaimer enters this final cycle.
> >>>>
> >>>> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> >>>
> >>> Ok, I see the motivation for your patch but is the block inside 
> >>> still wrong for what you want? After your patch the block looks like 
> >>> this
> >>>
> >>>                 if (sc->priority || noswap) {
> >>>                         scan >>= sc->priority;
> >>>                         if (!scan && force_scan)
> >>>                                 scan = SWAP_CLUSTER_MAX;
> >>>                         scan = div64_u64(scan * fraction[file], denominator);
> >>>                 }
> >>>
> >>> if sc->priority == 0 and swappiness==0 then you enter this block but 
> >>> fraction[0] for anonymous pages will also be 0 and because of the 
> >>> ordering of statements there, scan will be
> >>>
> >>> scan = scan * 0 / denominator
> >>>
> >>> so you are still not reclaiming anonymous pages in the swappiness=0 
> >>> case. What did I miss?
> >>
> >> Yes, now that you have mentioned that I realized that it really 
> >> doesn't make any sense. fraction[0] is _always_ 0 for swappiness==0. 
> >> So we just made a bigger pressure on file LRUs. So this sounds like a 
> >> misuse of the swappiness. This all has been introduced with fe35004f 
> >> (mm: avoid swapping out with swappiness==0).
> >>
> >> I think that removing swappiness check make sense but I am not sure 
> >> it does what the changelog says. It should have said that checking 
> >> swappiness doesn't make any sense for small LRUs.
> >
> > Bahh, wait a moment. Now I remember why the check made sense 
> > especially for memcg.
> > It made "don't swap _at all_ for swappiness==0" for real - you are 
> > even willing to sacrifice OOM. Maybe this is OK for the global case 
> > because noswap would safe you here (assuming that there is no swap if 
> > somebody doesn't want to swap at all and swappiness doesn't play such 
> > a big role) but for memcg you really might want to prevent from 
> > swapping - not everybody has memcg swap extension enabled and swappiness is handy then.
> > So I am not sure this is actually what we want. Need to think about it.
> 
> I introduced swappiness check here with fe35004f because, in some
> cases, we prefer OOM to swap out pages to detect problems as soon
> as possible. Basically, we design the system not to swap out and
> so if it causes swapping, something goes wrong.

I might be missing something terribly obvious, but... why do you add
swap space to the system in the first place?  Or in case of cgroups,
why not set the memsw limit equal to the memory limit?

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 2/8] mm: vmscan: disregard swappiness shortly before going OOM
@ 2012-12-14  4:50             ` Johannes Weiner
  0 siblings, 0 replies; 114+ messages in thread
From: Johannes Weiner @ 2012-12-14  4:50 UTC (permalink / raw)
  To: Satoru Moriya
  Cc: Michal Hocko, Mel Gorman, Andrew Morton, Rik van Riel,
	Hugh Dickins, linux-mm, linux-kernel

On Thu, Dec 13, 2012 at 10:25:43PM +0000, Satoru Moriya wrote:
> 
> On 12/13/2012 11:05 AM, Michal Hocko wrote:> On Thu 13-12-12 16:29:59, Michal Hocko wrote:
> >> On Thu 13-12-12 10:34:20, Mel Gorman wrote:
> >>> On Wed, Dec 12, 2012 at 04:43:34PM -0500, Johannes Weiner wrote:
> >>>> When a reclaim scanner is doing its final scan before giving up and 
> >>>> there is swap space available, pay no attention to swappiness 
> >>>> preference anymore.  Just swap.
> >>>>
> >>>> Note that this change won't make too big of a difference for 
> >>>> general
> >>>> reclaim: anonymous pages are already force-scanned when there is 
> >>>> only very little file cache left, and there very likely isn't when 
> >>>> the reclaimer enters this final cycle.
> >>>>
> >>>> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> >>>
> >>> Ok, I see the motivation for your patch but is the block inside 
> >>> still wrong for what you want? After your patch the block looks like 
> >>> this
> >>>
> >>>                 if (sc->priority || noswap) {
> >>>                         scan >>= sc->priority;
> >>>                         if (!scan && force_scan)
> >>>                                 scan = SWAP_CLUSTER_MAX;
> >>>                         scan = div64_u64(scan * fraction[file], denominator);
> >>>                 }
> >>>
> >>> if sc->priority == 0 and swappiness==0 then you enter this block but 
> >>> fraction[0] for anonymous pages will also be 0 and because of the 
> >>> ordering of statements there, scan will be
> >>>
> >>> scan = scan * 0 / denominator
> >>>
> >>> so you are still not reclaiming anonymous pages in the swappiness=0 
> >>> case. What did I miss?
> >>
> >> Yes, now that you have mentioned that I realized that it really 
> >> doesn't make any sense. fraction[0] is _always_ 0 for swappiness==0. 
> >> So we just made a bigger pressure on file LRUs. So this sounds like a 
> >> misuse of the swappiness. This all has been introduced with fe35004f 
> >> (mm: avoid swapping out with swappiness==0).
> >>
> >> I think that removing swappiness check make sense but I am not sure 
> >> it does what the changelog says. It should have said that checking 
> >> swappiness doesn't make any sense for small LRUs.
> >
> > Bahh, wait a moment. Now I remember why the check made sense 
> > especially for memcg.
> > It made "don't swap _at all_ for swappiness==0" for real - you are 
> > even willing to sacrifice OOM. Maybe this is OK for the global case 
> > because noswap would safe you here (assuming that there is no swap if 
> > somebody doesn't want to swap at all and swappiness doesn't play such 
> > a big role) but for memcg you really might want to prevent from 
> > swapping - not everybody has memcg swap extension enabled and swappiness is handy then.
> > So I am not sure this is actually what we want. Need to think about it.
> 
> I introduced swappiness check here with fe35004f because, in some
> cases, we prefer OOM to swap out pages to detect problems as soon
> as possible. Basically, we design the system not to swap out and
> so if it causes swapping, something goes wrong.

I might be missing something terribly obvious, but... why do you add
swap space to the system in the first place?  Or in case of cgroups,
why not set the memsw limit equal to the memory limit?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 2/8] mm: vmscan: disregard swappiness shortly before going OOM
  2012-12-14  4:50             ` Johannes Weiner
@ 2012-12-14  8:37               ` Michal Hocko
  -1 siblings, 0 replies; 114+ messages in thread
From: Michal Hocko @ 2012-12-14  8:37 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Satoru Moriya, Mel Gorman, Andrew Morton, Rik van Riel,
	Hugh Dickins, linux-mm, linux-kernel

On Thu 13-12-12 23:50:30, Johannes Weiner wrote:
> On Thu, Dec 13, 2012 at 10:25:43PM +0000, Satoru Moriya wrote:
> > 
> > On 12/13/2012 11:05 AM, Michal Hocko wrote:> On Thu 13-12-12 16:29:59, Michal Hocko wrote:
> > >> On Thu 13-12-12 10:34:20, Mel Gorman wrote:
> > >>> On Wed, Dec 12, 2012 at 04:43:34PM -0500, Johannes Weiner wrote:
> > >>>> When a reclaim scanner is doing its final scan before giving up and 
> > >>>> there is swap space available, pay no attention to swappiness 
> > >>>> preference anymore.  Just swap.
> > >>>>
> > >>>> Note that this change won't make too big of a difference for 
> > >>>> general
> > >>>> reclaim: anonymous pages are already force-scanned when there is 
> > >>>> only very little file cache left, and there very likely isn't when 
> > >>>> the reclaimer enters this final cycle.
> > >>>>
> > >>>> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > >>>
> > >>> Ok, I see the motivation for your patch but is the block inside 
> > >>> still wrong for what you want? After your patch the block looks like 
> > >>> this
> > >>>
> > >>>                 if (sc->priority || noswap) {
> > >>>                         scan >>= sc->priority;
> > >>>                         if (!scan && force_scan)
> > >>>                                 scan = SWAP_CLUSTER_MAX;
> > >>>                         scan = div64_u64(scan * fraction[file], denominator);
> > >>>                 }
> > >>>
> > >>> if sc->priority == 0 and swappiness==0 then you enter this block but 
> > >>> fraction[0] for anonymous pages will also be 0 and because of the 
> > >>> ordering of statements there, scan will be
> > >>>
> > >>> scan = scan * 0 / denominator
> > >>>
> > >>> so you are still not reclaiming anonymous pages in the swappiness=0 
> > >>> case. What did I miss?
> > >>
> > >> Yes, now that you have mentioned that I realized that it really 
> > >> doesn't make any sense. fraction[0] is _always_ 0 for swappiness==0. 
> > >> So we just made a bigger pressure on file LRUs. So this sounds like a 
> > >> misuse of the swappiness. This all has been introduced with fe35004f 
> > >> (mm: avoid swapping out with swappiness==0).
> > >>
> > >> I think that removing swappiness check make sense but I am not sure 
> > >> it does what the changelog says. It should have said that checking 
> > >> swappiness doesn't make any sense for small LRUs.
> > >
> > > Bahh, wait a moment. Now I remember why the check made sense 
> > > especially for memcg.
> > > It made "don't swap _at all_ for swappiness==0" for real - you are 
> > > even willing to sacrifice OOM. Maybe this is OK for the global case 
> > > because noswap would safe you here (assuming that there is no swap if 
> > > somebody doesn't want to swap at all and swappiness doesn't play such 
> > > a big role) but for memcg you really might want to prevent from 
> > > swapping - not everybody has memcg swap extension enabled and swappiness is handy then.
> > > So I am not sure this is actually what we want. Need to think about it.
> > 
> > I introduced swappiness check here with fe35004f because, in some
> > cases, we prefer OOM to swap out pages to detect problems as soon
> > as possible. Basically, we design the system not to swap out and
> > so if it causes swapping, something goes wrong.
> 
> I might be missing something terribly obvious, but... why do you add
> swap space to the system in the first place?  Or in case of cgroups,
> why not set the memsw limit equal to the memory limit?

I can answer the later. Because memsw comes with its price and
swappiness is much cheaper. On the other hand it makes sense that
swappiness==0 doesn't swap at all. Or do you think we should get back to
_almost_ doesn't swap at all?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 2/8] mm: vmscan: disregard swappiness shortly before going OOM
@ 2012-12-14  8:37               ` Michal Hocko
  0 siblings, 0 replies; 114+ messages in thread
From: Michal Hocko @ 2012-12-14  8:37 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Satoru Moriya, Mel Gorman, Andrew Morton, Rik van Riel,
	Hugh Dickins, linux-mm, linux-kernel

On Thu 13-12-12 23:50:30, Johannes Weiner wrote:
> On Thu, Dec 13, 2012 at 10:25:43PM +0000, Satoru Moriya wrote:
> > 
> > On 12/13/2012 11:05 AM, Michal Hocko wrote:> On Thu 13-12-12 16:29:59, Michal Hocko wrote:
> > >> On Thu 13-12-12 10:34:20, Mel Gorman wrote:
> > >>> On Wed, Dec 12, 2012 at 04:43:34PM -0500, Johannes Weiner wrote:
> > >>>> When a reclaim scanner is doing its final scan before giving up and 
> > >>>> there is swap space available, pay no attention to swappiness 
> > >>>> preference anymore.  Just swap.
> > >>>>
> > >>>> Note that this change won't make too big of a difference for 
> > >>>> general
> > >>>> reclaim: anonymous pages are already force-scanned when there is 
> > >>>> only very little file cache left, and there very likely isn't when 
> > >>>> the reclaimer enters this final cycle.
> > >>>>
> > >>>> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > >>>
> > >>> Ok, I see the motivation for your patch but is the block inside 
> > >>> still wrong for what you want? After your patch the block looks like 
> > >>> this
> > >>>
> > >>>                 if (sc->priority || noswap) {
> > >>>                         scan >>= sc->priority;
> > >>>                         if (!scan && force_scan)
> > >>>                                 scan = SWAP_CLUSTER_MAX;
> > >>>                         scan = div64_u64(scan * fraction[file], denominator);
> > >>>                 }
> > >>>
> > >>> if sc->priority == 0 and swappiness==0 then you enter this block but 
> > >>> fraction[0] for anonymous pages will also be 0 and because of the 
> > >>> ordering of statements there, scan will be
> > >>>
> > >>> scan = scan * 0 / denominator
> > >>>
> > >>> so you are still not reclaiming anonymous pages in the swappiness=0 
> > >>> case. What did I miss?
> > >>
> > >> Yes, now that you have mentioned that I realized that it really 
> > >> doesn't make any sense. fraction[0] is _always_ 0 for swappiness==0. 
> > >> So we just made a bigger pressure on file LRUs. So this sounds like a 
> > >> misuse of the swappiness. This all has been introduced with fe35004f 
> > >> (mm: avoid swapping out with swappiness==0).
> > >>
> > >> I think that removing swappiness check make sense but I am not sure 
> > >> it does what the changelog says. It should have said that checking 
> > >> swappiness doesn't make any sense for small LRUs.
> > >
> > > Bahh, wait a moment. Now I remember why the check made sense 
> > > especially for memcg.
> > > It made "don't swap _at all_ for swappiness==0" for real - you are 
> > > even willing to sacrifice OOM. Maybe this is OK for the global case 
> > > because noswap would safe you here (assuming that there is no swap if 
> > > somebody doesn't want to swap at all and swappiness doesn't play such 
> > > a big role) but for memcg you really might want to prevent from 
> > > swapping - not everybody has memcg swap extension enabled and swappiness is handy then.
> > > So I am not sure this is actually what we want. Need to think about it.
> > 
> > I introduced swappiness check here with fe35004f because, in some
> > cases, we prefer OOM to swap out pages to detect problems as soon
> > as possible. Basically, we design the system not to swap out and
> > so if it causes swapping, something goes wrong.
> 
> I might be missing something terribly obvious, but... why do you add
> swap space to the system in the first place?  Or in case of cgroups,
> why not set the memsw limit equal to the memory limit?

I can answer the later. Because memsw comes with its price and
swappiness is much cheaper. On the other hand it makes sense that
swappiness==0 doesn't swap at all. Or do you think we should get back to
_almost_ doesn't swap at all?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 3/8] mm: vmscan: save work scanning (almost) empty LRU lists
  2012-12-13 19:38       ` Johannes Weiner
@ 2012-12-14  8:46         ` Michal Hocko
  -1 siblings, 0 replies; 114+ messages in thread
From: Michal Hocko @ 2012-12-14  8:46 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

On Thu 13-12-12 14:38:20, Johannes Weiner wrote:
> On Thu, Dec 13, 2012 at 04:43:46PM +0100, Michal Hocko wrote:
> > On Wed 12-12-12 16:43:35, Johannes Weiner wrote:
> > > In certain cases (kswapd reclaim, memcg target reclaim), a fixed
> > > minimum amount of pages is scanned from the LRU lists on each
> > > iteration, to make progress.
> > > 
> > > Do not make this minimum bigger than the respective LRU list size,
> > > however, and save some busy work trying to isolate and reclaim pages
> > > that are not there.
> > > 
> > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > 
> > Hmm, shrink_lruvec would do:
> > 	nr_to_scan = min_t(unsigned long,
> > 			   nr[lru], SWAP_CLUSTER_MAX);
> > 	nr[lru] -= nr_to_scan;
> > and isolate_lru_pages does
> > 	for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++)
> > so it shouldn't matter and we shouldn't do any additional loops, right?
> > 
> > Anyway it would be beter if get_scan_count wouldn't ask for more than is
> > available.
> 
> Consider the inactive_list_is_low() check (especially expensive for
> memcg anon), lru_add_drain(), lru lock acquisition...

Ohh, I totally missed that. Thanks for pointing out (maybe s/some busy
wok/$WITH_ALL_THIS/)?

Thanks for clarification!

> And as I wrote to Mel in the other email, this can happen a lot when
> you have memory cgroups in a multi-node environment.
> 
> > Reviewed-by: Michal Hocko <mhocko@suse.cz>
> 
> Thanks!
> 
> > > @@ -1748,15 +1748,17 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> > >  out:
> > >  	for_each_evictable_lru(lru) {
> > >  		int file = is_file_lru(lru);
> > > +		unsigned long size;
> > >  		unsigned long scan;
> > >  
> > > -		scan = get_lru_size(lruvec, lru);
> > > +		size = get_lru_size(lruvec, lru);
> > +		size = scan = get_lru_size(lruvec, lru);
> > 
> > >  		if (sc->priority || noswap) {
> > > -			scan >>= sc->priority;
> > > +			scan = size >> sc->priority;
> > >  			if (!scan && force_scan)
> > > -				scan = SWAP_CLUSTER_MAX;
> > > +				scan = min(size, SWAP_CLUSTER_MAX);
> > >  			scan = div64_u64(scan * fraction[file], denominator);
> > > -		}
> > > +		} else
> > > +			scan = size;
> > 
> > And this is not necessary then but this is totally nit.
> 
> Do you actually find this more readable?  Setting size = scan and then
> later scan = size >> sc->priority? :-)

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 3/8] mm: vmscan: save work scanning (almost) empty LRU lists
@ 2012-12-14  8:46         ` Michal Hocko
  0 siblings, 0 replies; 114+ messages in thread
From: Michal Hocko @ 2012-12-14  8:46 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Hugh Dickins, linux-mm,
	linux-kernel

On Thu 13-12-12 14:38:20, Johannes Weiner wrote:
> On Thu, Dec 13, 2012 at 04:43:46PM +0100, Michal Hocko wrote:
> > On Wed 12-12-12 16:43:35, Johannes Weiner wrote:
> > > In certain cases (kswapd reclaim, memcg target reclaim), a fixed
> > > minimum amount of pages is scanned from the LRU lists on each
> > > iteration, to make progress.
> > > 
> > > Do not make this minimum bigger than the respective LRU list size,
> > > however, and save some busy work trying to isolate and reclaim pages
> > > that are not there.
> > > 
> > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > 
> > Hmm, shrink_lruvec would do:
> > 	nr_to_scan = min_t(unsigned long,
> > 			   nr[lru], SWAP_CLUSTER_MAX);
> > 	nr[lru] -= nr_to_scan;
> > and isolate_lru_pages does
> > 	for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++)
> > so it shouldn't matter and we shouldn't do any additional loops, right?
> > 
> > Anyway it would be beter if get_scan_count wouldn't ask for more than is
> > available.
> 
> Consider the inactive_list_is_low() check (especially expensive for
> memcg anon), lru_add_drain(), lru lock acquisition...

Ohh, I totally missed that. Thanks for pointing out (maybe s/some busy
wok/$WITH_ALL_THIS/)?

Thanks for clarification!

> And as I wrote to Mel in the other email, this can happen a lot when
> you have memory cgroups in a multi-node environment.
> 
> > Reviewed-by: Michal Hocko <mhocko@suse.cz>
> 
> Thanks!
> 
> > > @@ -1748,15 +1748,17 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> > >  out:
> > >  	for_each_evictable_lru(lru) {
> > >  		int file = is_file_lru(lru);
> > > +		unsigned long size;
> > >  		unsigned long scan;
> > >  
> > > -		scan = get_lru_size(lruvec, lru);
> > > +		size = get_lru_size(lruvec, lru);
> > +		size = scan = get_lru_size(lruvec, lru);
> > 
> > >  		if (sc->priority || noswap) {
> > > -			scan >>= sc->priority;
> > > +			scan = size >> sc->priority;
> > >  			if (!scan && force_scan)
> > > -				scan = SWAP_CLUSTER_MAX;
> > > +				scan = min(size, SWAP_CLUSTER_MAX);
> > >  			scan = div64_u64(scan * fraction[file], denominator);
> > > -		}
> > > +		} else
> > > +			scan = size;
> > 
> > And this is not necessary then but this is totally nit.
> 
> Do you actually find this more readable?  Setting size = scan and then
> later scan = size >> sc->priority? :-)

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 2/8] mm: vmscan: disregard swappiness shortly before going OOM
  2012-12-14  8:37               ` Michal Hocko
@ 2012-12-14 15:43                 ` Rik van Riel
  -1 siblings, 0 replies; 114+ messages in thread
From: Rik van Riel @ 2012-12-14 15:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Satoru Moriya, Mel Gorman, Andrew Morton,
	Hugh Dickins, linux-mm, linux-kernel

On 12/14/2012 03:37 AM, Michal Hocko wrote:

> I can answer the later. Because memsw comes with its price and
> swappiness is much cheaper. On the other hand it makes sense that
> swappiness==0 doesn't swap at all. Or do you think we should get back to
> _almost_ doesn't swap at all?

swappiness==0 will swap in emergencies, specifically when we have
almost no page cache left, we will still swap things out:

         if (global_reclaim(sc)) {
                 free  = zone_page_state(zone, NR_FREE_PAGES);
                 if (unlikely(file + free <= high_wmark_pages(zone))) {
                         /*
                          * If we have very few page cache pages, force-scan
                          * anon pages.
                          */
                         fraction[0] = 1;
                         fraction[1] = 0;
                         denominator = 1;
                         goto out;

This makes sense, because people who set swappiness==0 but
do have swap space available would probably prefer some
emergency swapping over an OOM kill.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 2/8] mm: vmscan: disregard swappiness shortly before going OOM
@ 2012-12-14 15:43                 ` Rik van Riel
  0 siblings, 0 replies; 114+ messages in thread
From: Rik van Riel @ 2012-12-14 15:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Satoru Moriya, Mel Gorman, Andrew Morton,
	Hugh Dickins, linux-mm, linux-kernel

On 12/14/2012 03:37 AM, Michal Hocko wrote:

> I can answer the later. Because memsw comes with its price and
> swappiness is much cheaper. On the other hand it makes sense that
> swappiness==0 doesn't swap at all. Or do you think we should get back to
> _almost_ doesn't swap at all?

swappiness==0 will swap in emergencies, specifically when we have
almost no page cache left, we will still swap things out:

         if (global_reclaim(sc)) {
                 free  = zone_page_state(zone, NR_FREE_PAGES);
                 if (unlikely(file + free <= high_wmark_pages(zone))) {
                         /*
                          * If we have very few page cache pages, force-scan
                          * anon pages.
                          */
                         fraction[0] = 1;
                         fraction[1] = 0;
                         denominator = 1;
                         goto out;

This makes sense, because people who set swappiness==0 but
do have swap space available would probably prefer some
emergency swapping over an OOM kill.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 2/8] mm: vmscan: disregard swappiness shortly before going OOM
  2012-12-14 15:43                 ` Rik van Riel
@ 2012-12-14 16:13                   ` Michal Hocko
  -1 siblings, 0 replies; 114+ messages in thread
From: Michal Hocko @ 2012-12-14 16:13 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Johannes Weiner, Satoru Moriya, Mel Gorman, Andrew Morton,
	Hugh Dickins, linux-mm, linux-kernel

On Fri 14-12-12 10:43:55, Rik van Riel wrote:
> On 12/14/2012 03:37 AM, Michal Hocko wrote:
> 
> >I can answer the later. Because memsw comes with its price and
> >swappiness is much cheaper. On the other hand it makes sense that
> >swappiness==0 doesn't swap at all. Or do you think we should get back to
> >_almost_ doesn't swap at all?
> 
> swappiness==0 will swap in emergencies, specifically when we have
> almost no page cache left, we will still swap things out:
> 
>         if (global_reclaim(sc)) {
>                 free  = zone_page_state(zone, NR_FREE_PAGES);
>                 if (unlikely(file + free <= high_wmark_pages(zone))) {
>                         /*
>                          * If we have very few page cache pages, force-scan
>                          * anon pages.
>                          */
>                         fraction[0] = 1;
>                         fraction[1] = 0;
>                         denominator = 1;
>                         goto out;
> 
> This makes sense, because people who set swappiness==0 but
> do have swap space available would probably prefer some
> emergency swapping over an OOM kill.

Yes, but this is the global reclaim path. I was arguing about
swappiness==0 & memcg. As this patch doesn't make a big difference for
the global case (as both the changelog and you mentioned) then we should
focus on whether this is desirable change for the memcg path. I think it
makes sense to keep "no swapping at all for memcg semantic" as we have
it currently.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 2/8] mm: vmscan: disregard swappiness shortly before going OOM
@ 2012-12-14 16:13                   ` Michal Hocko
  0 siblings, 0 replies; 114+ messages in thread
From: Michal Hocko @ 2012-12-14 16:13 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Johannes Weiner, Satoru Moriya, Mel Gorman, Andrew Morton,
	Hugh Dickins, linux-mm, linux-kernel

On Fri 14-12-12 10:43:55, Rik van Riel wrote:
> On 12/14/2012 03:37 AM, Michal Hocko wrote:
> 
> >I can answer the later. Because memsw comes with its price and
> >swappiness is much cheaper. On the other hand it makes sense that
> >swappiness==0 doesn't swap at all. Or do you think we should get back to
> >_almost_ doesn't swap at all?
> 
> swappiness==0 will swap in emergencies, specifically when we have
> almost no page cache left, we will still swap things out:
> 
>         if (global_reclaim(sc)) {
>                 free  = zone_page_state(zone, NR_FREE_PAGES);
>                 if (unlikely(file + free <= high_wmark_pages(zone))) {
>                         /*
>                          * If we have very few page cache pages, force-scan
>                          * anon pages.
>                          */
>                         fraction[0] = 1;
>                         fraction[1] = 0;
>                         denominator = 1;
>                         goto out;
> 
> This makes sense, because people who set swappiness==0 but
> do have swap space available would probably prefer some
> emergency swapping over an OOM kill.

Yes, but this is the global reclaim path. I was arguing about
swappiness==0 & memcg. As this patch doesn't make a big difference for
the global case (as both the changelog and you mentioned) then we should
focus on whether this is desirable change for the memcg path. I think it
makes sense to keep "no swapping at all for memcg semantic" as we have
it currently.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* RE: [patch 2/8] mm: vmscan: disregard swappiness shortly before going OOM
  2012-12-14  8:37               ` Michal Hocko
@ 2012-12-14 19:44                 ` Satoru Moriya
  -1 siblings, 0 replies; 114+ messages in thread
From: Satoru Moriya @ 2012-12-14 19:44 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner
  Cc: Mel Gorman, Andrew Morton, Rik van Riel, Hugh Dickins, linux-mm,
	linux-kernel

On 12/14/2012 03:37 AM, Michal Hocko wrote:
> On Thu 13-12-12 23:50:30, Johannes Weiner wrote:
>> On Thu, Dec 13, 2012 at 10:25:43PM +0000, Satoru Moriya wrote:
>>>
>>> I introduced swappiness check here with fe35004f because, in some 
>>> cases, we prefer OOM to swap out pages to detect problems as soon as 
>>> possible. Basically, we design the system not to swap out and so if 
>>> it causes swapping, something goes wrong.
>>
>> I might be missing something terribly obvious, but... why do you add 
>> swap space to the system in the first place?  Or in case of cgroups, 
>> why not set the memsw limit equal to the memory limit?
> 
> I can answer the later. Because memsw comes with its price and 
> swappiness is much cheaper. On the other hand it makes sense that
> swappiness==0 doesn't swap at all. Or do you think we should get back 
> to _almost_ doesn't swap at all?
> 

Right. One of the reason is what Michal described above and another
reason that I thought is softlimit. softlimit reclaim always works
with priority=0. Therefore, if we set softlimit to one memcg without
swappiness=0, the kernel scans both anonymous and filebacked pages
during soft limit reclaim for the memcg and reclaims them.

Regards,
Satoru

^ permalink raw reply	[flat|nested] 114+ messages in thread

* RE: [patch 2/8] mm: vmscan: disregard swappiness shortly before going OOM
@ 2012-12-14 19:44                 ` Satoru Moriya
  0 siblings, 0 replies; 114+ messages in thread
From: Satoru Moriya @ 2012-12-14 19:44 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner
  Cc: Mel Gorman, Andrew Morton, Rik van Riel, Hugh Dickins, linux-mm,
	linux-kernel

On 12/14/2012 03:37 AM, Michal Hocko wrote:
> On Thu 13-12-12 23:50:30, Johannes Weiner wrote:
>> On Thu, Dec 13, 2012 at 10:25:43PM +0000, Satoru Moriya wrote:
>>>
>>> I introduced swappiness check here with fe35004f because, in some 
>>> cases, we prefer OOM to swap out pages to detect problems as soon as 
>>> possible. Basically, we design the system not to swap out and so if 
>>> it causes swapping, something goes wrong.
>>
>> I might be missing something terribly obvious, but... why do you add 
>> swap space to the system in the first place?  Or in case of cgroups, 
>> why not set the memsw limit equal to the memory limit?
> 
> I can answer the later. Because memsw comes with its price and 
> swappiness is much cheaper. On the other hand it makes sense that
> swappiness==0 doesn't swap at all. Or do you think we should get back 
> to _almost_ doesn't swap at all?
> 

Right. One of the reason is what Michal described above and another
reason that I thought is softlimit. softlimit reclaim always works
with priority=0. Therefore, if we set softlimit to one memcg without
swappiness=0, the kernel scans both anonymous and filebacked pages
during soft limit reclaim for the memcg and reclaims them.

Regards,
Satoru

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* RE: [patch 2/8] mm: vmscan: disregard swappiness shortly before going OOM
  2012-12-14 15:43                 ` Rik van Riel
@ 2012-12-14 20:17                   ` Satoru Moriya
  -1 siblings, 0 replies; 114+ messages in thread
From: Satoru Moriya @ 2012-12-14 20:17 UTC (permalink / raw)
  To: Rik van Riel, Michal Hocko
  Cc: Johannes Weiner, Mel Gorman, Andrew Morton, Hugh Dickins,
	linux-mm, linux-kernel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1517 bytes --]

On 12/14/2012 10:43 AM, Rik van Riel wrote:
> On 12/14/2012 03:37 AM, Michal Hocko wrote:
> 
>> I can answer the later. Because memsw comes with its price and 
>> swappiness is much cheaper. On the other hand it makes sense that
>> swappiness==0 doesn't swap at all. Or do you think we should get back 
>> to _almost_ doesn't swap at all?
> 
> swappiness==0 will swap in emergencies, specifically when we have 
> almost no page cache left, we will still swap things out:
> 
>         if (global_reclaim(sc)) {
>                 free  = zone_page_state(zone, NR_FREE_PAGES);
>                 if (unlikely(file + free <= high_wmark_pages(zone))) {
>                         /*
>                          * If we have very few page cache pages, force-scan
>                          * anon pages.
>                          */
>                         fraction[0] = 1;
>                         fraction[1] = 0;
>                         denominator = 1;
>                         goto out;
> 
> This makes sense, because people who set swappiness==0 but do have 
> swap space available would probably prefer some emergency swapping 
> over an OOM kill.

This behavior is reasonable for global reclaim to me. But when
we hit this condition, it may be better to print some messages
to notify the user who set swappiness==0 of anon page scan.

Regards,
Satoru
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 114+ messages in thread

* RE: [patch 2/8] mm: vmscan: disregard swappiness shortly before going OOM
@ 2012-12-14 20:17                   ` Satoru Moriya
  0 siblings, 0 replies; 114+ messages in thread
From: Satoru Moriya @ 2012-12-14 20:17 UTC (permalink / raw)
  To: Rik van Riel, Michal Hocko
  Cc: Johannes Weiner, Mel Gorman, Andrew Morton, Hugh Dickins,
	linux-mm, linux-kernel

On 12/14/2012 10:43 AM, Rik van Riel wrote:
> On 12/14/2012 03:37 AM, Michal Hocko wrote:
> 
>> I can answer the later. Because memsw comes with its price and 
>> swappiness is much cheaper. On the other hand it makes sense that
>> swappiness==0 doesn't swap at all. Or do you think we should get back 
>> to _almost_ doesn't swap at all?
> 
> swappiness==0 will swap in emergencies, specifically when we have 
> almost no page cache left, we will still swap things out:
> 
>         if (global_reclaim(sc)) {
>                 free  = zone_page_state(zone, NR_FREE_PAGES);
>                 if (unlikely(file + free <= high_wmark_pages(zone))) {
>                         /*
>                          * If we have very few page cache pages, force-scan
>                          * anon pages.
>                          */
>                         fraction[0] = 1;
>                         fraction[1] = 0;
>                         denominator = 1;
>                         goto out;
> 
> This makes sense, because people who set swappiness==0 but do have 
> swap space available would probably prefer some emergency swapping 
> over an OOM kill.

This behavior is reasonable for global reclaim to me. But when
we hit this condition, it may be better to print some messages
to notify the user who set swappiness==0 of anon page scan.

Regards,
Satoru

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 2/8] mm: vmscan: disregard swappiness shortly before going OOM
  2012-12-14 16:13                   ` Michal Hocko
@ 2012-12-15  0:18                     ` Johannes Weiner
  -1 siblings, 0 replies; 114+ messages in thread
From: Johannes Weiner @ 2012-12-15  0:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Rik van Riel, Satoru Moriya, Mel Gorman, Andrew Morton,
	Hugh Dickins, linux-mm, linux-kernel

On Fri, Dec 14, 2012 at 05:13:45PM +0100, Michal Hocko wrote:
> On Fri 14-12-12 10:43:55, Rik van Riel wrote:
> > On 12/14/2012 03:37 AM, Michal Hocko wrote:
> > 
> > >I can answer the later. Because memsw comes with its price and
> > >swappiness is much cheaper. On the other hand it makes sense that
> > >swappiness==0 doesn't swap at all. Or do you think we should get back to
> > >_almost_ doesn't swap at all?
> > 
> > swappiness==0 will swap in emergencies, specifically when we have
> > almost no page cache left, we will still swap things out:
> > 
> >         if (global_reclaim(sc)) {
> >                 free  = zone_page_state(zone, NR_FREE_PAGES);
> >                 if (unlikely(file + free <= high_wmark_pages(zone))) {
> >                         /*
> >                          * If we have very few page cache pages, force-scan
> >                          * anon pages.
> >                          */
> >                         fraction[0] = 1;
> >                         fraction[1] = 0;
> >                         denominator = 1;
> >                         goto out;
> > 
> > This makes sense, because people who set swappiness==0 but
> > do have swap space available would probably prefer some
> > emergency swapping over an OOM kill.
> 
> Yes, but this is the global reclaim path. I was arguing about
> swappiness==0 & memcg. As this patch doesn't make a big difference for
> the global case (as both the changelog and you mentioned) then we should
> focus on whether this is desirable change for the memcg path. I think it
> makes sense to keep "no swapping at all for memcg semantic" as we have
> it currently.

I would prefer we could agree on one thing, though.  Having global
reclaim behave different from memcg reclaim violates the principle of
least surprise.  Having the code behave like that implicitely without
any mention of global_reclaim() and vm_swappiness() is unacceptable.

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 2/8] mm: vmscan: disregard swappiness shortly before going OOM
@ 2012-12-15  0:18                     ` Johannes Weiner
  0 siblings, 0 replies; 114+ messages in thread
From: Johannes Weiner @ 2012-12-15  0:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Rik van Riel, Satoru Moriya, Mel Gorman, Andrew Morton,
	Hugh Dickins, linux-mm, linux-kernel

On Fri, Dec 14, 2012 at 05:13:45PM +0100, Michal Hocko wrote:
> On Fri 14-12-12 10:43:55, Rik van Riel wrote:
> > On 12/14/2012 03:37 AM, Michal Hocko wrote:
> > 
> > >I can answer the later. Because memsw comes with its price and
> > >swappiness is much cheaper. On the other hand it makes sense that
> > >swappiness==0 doesn't swap at all. Or do you think we should get back to
> > >_almost_ doesn't swap at all?
> > 
> > swappiness==0 will swap in emergencies, specifically when we have
> > almost no page cache left, we will still swap things out:
> > 
> >         if (global_reclaim(sc)) {
> >                 free  = zone_page_state(zone, NR_FREE_PAGES);
> >                 if (unlikely(file + free <= high_wmark_pages(zone))) {
> >                         /*
> >                          * If we have very few page cache pages, force-scan
> >                          * anon pages.
> >                          */
> >                         fraction[0] = 1;
> >                         fraction[1] = 0;
> >                         denominator = 1;
> >                         goto out;
> > 
> > This makes sense, because people who set swappiness==0 but
> > do have swap space available would probably prefer some
> > emergency swapping over an OOM kill.
> 
> Yes, but this is the global reclaim path. I was arguing about
> swappiness==0 & memcg. As this patch doesn't make a big difference for
> the global case (as both the changelog and you mentioned) then we should
> focus on whether this is desirable change for the memcg path. I think it
> makes sense to keep "no swapping at all for memcg semantic" as we have
> it currently.

I would prefer we could agree on one thing, though.  Having global
reclaim behave different from memcg reclaim violates the principle of
least surprise.  Having the code behave like that implicitely without
any mention of global_reclaim() and vm_swappiness() is unacceptable.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 1/8] mm: memcg: only evict file pages when we have plenty
  2012-12-13 14:55         ` Michal Hocko
@ 2012-12-16  1:21           ` Simon Jeons
  -1 siblings, 0 replies; 114+ messages in thread
From: Simon Jeons @ 2012-12-16  1:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Rik van Riel, Andrew Morton, Mel Gorman,
	Hugh Dickins, linux-mm, linux-kernel

On 12/13/2012 10:55 PM, Michal Hocko wrote:
> On Wed 12-12-12 17:28:44, Johannes Weiner wrote:
>> On Wed, Dec 12, 2012 at 04:53:36PM -0500, Rik van Riel wrote:
>>> On 12/12/2012 04:43 PM, Johannes Weiner wrote:
>>>> dc0422c "mm: vmscan: only evict file pages when we have plenty" makes
>>>> a point of not going for anonymous memory while there is still enough
>>>> inactive cache around.
>>>>
>>>> The check was added only for global reclaim, but it is just as useful
>>>> for memory cgroup reclaim.
>>>>
>>>> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
>>>> ---
>>>>   mm/vmscan.c | 19 ++++++++++---------
>>>>   1 file changed, 10 insertions(+), 9 deletions(-)
>>>>
>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>> index 157bb11..3874dcb 100644
>>>> --- a/mm/vmscan.c
>>>> +++ b/mm/vmscan.c
>>>> @@ -1671,6 +1671,16 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>>>>   		denominator = 1;
>>>>   		goto out;
>>>>   	}
>>>> +	/*
>>>> +	 * There is enough inactive page cache, do not reclaim
>>>> +	 * anything from the anonymous working set right now.
>>>> +	 */
>>>> +	if (!inactive_file_is_low(lruvec)) {
>>>> +		fraction[0] = 0;
>>>> +		fraction[1] = 1;
>>>> +		denominator = 1;
>>>> +		goto out;
>>>> +	}
>>>>
>>>>   	anon  = get_lru_size(lruvec, LRU_ACTIVE_ANON) +
>>>>   		get_lru_size(lruvec, LRU_INACTIVE_ANON);
>>>> @@ -1688,15 +1698,6 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>>>>   			fraction[1] = 0;
>>>>   			denominator = 1;
>>>>   			goto out;
>>>> -		} else if (!inactive_file_is_low_global(zone)) {
>>>> -			/*
>>>> -			 * There is enough inactive page cache, do not
>>>> -			 * reclaim anything from the working set right now.
>>>> -			 */
>>>> -			fraction[0] = 0;
>>>> -			fraction[1] = 1;
>>>> -			denominator = 1;
>>>> -			goto out;
>>>>   		}
>>>>   	}
>>>>
>>>>
>>> I believe the if() block should be moved to AFTER
>>> the check where we make sure we actually have enough
>>> file pages.
>> You are absolutely right, this makes more sense.  Although I'd figure
>> the impact would be small because if there actually is that little
>> file cache, it won't be there for long with force-file scanning... :-)
> Yes, I think that the result would be worse (more swapping) so the
> change can only help.
>
>> I moved the condition, but it throws conflicts in the rest of the
>> series.  Will re-run tests, wait for Michal and Mel, then resend.
> Yes the patch makes sense for memcg as well. I guess you have tested
> this primarily with memcg. Do you have any numbers? Would be nice to put
> them into the changelog if you have (it should help to reduce swapping
> with heavy streaming IO load).
>
> Acked-by: Michal Hocko <mhocko@suse.cz>

Hi Michal,

I still can't understand why "The goto out means that it should be fine 
either way.", could you explain to me, sorry for my stupid. :-)


Regards,
Simon




^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 1/8] mm: memcg: only evict file pages when we have plenty
@ 2012-12-16  1:21           ` Simon Jeons
  0 siblings, 0 replies; 114+ messages in thread
From: Simon Jeons @ 2012-12-16  1:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Rik van Riel, Andrew Morton, Mel Gorman,
	Hugh Dickins, linux-mm, linux-kernel

On 12/13/2012 10:55 PM, Michal Hocko wrote:
> On Wed 12-12-12 17:28:44, Johannes Weiner wrote:
>> On Wed, Dec 12, 2012 at 04:53:36PM -0500, Rik van Riel wrote:
>>> On 12/12/2012 04:43 PM, Johannes Weiner wrote:
>>>> dc0422c "mm: vmscan: only evict file pages when we have plenty" makes
>>>> a point of not going for anonymous memory while there is still enough
>>>> inactive cache around.
>>>>
>>>> The check was added only for global reclaim, but it is just as useful
>>>> for memory cgroup reclaim.
>>>>
>>>> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
>>>> ---
>>>>   mm/vmscan.c | 19 ++++++++++---------
>>>>   1 file changed, 10 insertions(+), 9 deletions(-)
>>>>
>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>> index 157bb11..3874dcb 100644
>>>> --- a/mm/vmscan.c
>>>> +++ b/mm/vmscan.c
>>>> @@ -1671,6 +1671,16 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>>>>   		denominator = 1;
>>>>   		goto out;
>>>>   	}
>>>> +	/*
>>>> +	 * There is enough inactive page cache, do not reclaim
>>>> +	 * anything from the anonymous working set right now.
>>>> +	 */
>>>> +	if (!inactive_file_is_low(lruvec)) {
>>>> +		fraction[0] = 0;
>>>> +		fraction[1] = 1;
>>>> +		denominator = 1;
>>>> +		goto out;
>>>> +	}
>>>>
>>>>   	anon  = get_lru_size(lruvec, LRU_ACTIVE_ANON) +
>>>>   		get_lru_size(lruvec, LRU_INACTIVE_ANON);
>>>> @@ -1688,15 +1698,6 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>>>>   			fraction[1] = 0;
>>>>   			denominator = 1;
>>>>   			goto out;
>>>> -		} else if (!inactive_file_is_low_global(zone)) {
>>>> -			/*
>>>> -			 * There is enough inactive page cache, do not
>>>> -			 * reclaim anything from the working set right now.
>>>> -			 */
>>>> -			fraction[0] = 0;
>>>> -			fraction[1] = 1;
>>>> -			denominator = 1;
>>>> -			goto out;
>>>>   		}
>>>>   	}
>>>>
>>>>
>>> I believe the if() block should be moved to AFTER
>>> the check where we make sure we actually have enough
>>> file pages.
>> You are absolutely right, this makes more sense.  Although I'd figure
>> the impact would be small because if there actually is that little
>> file cache, it won't be there for long with force-file scanning... :-)
> Yes, I think that the result would be worse (more swapping) so the
> change can only help.
>
>> I moved the condition, but it throws conflicts in the rest of the
>> series.  Will re-run tests, wait for Michal and Mel, then resend.
> Yes the patch makes sense for memcg as well. I guess you have tested
> this primarily with memcg. Do you have any numbers? Would be nice to put
> them into the changelog if you have (it should help to reduce swapping
> with heavy streaming IO load).
>
> Acked-by: Michal Hocko <mhocko@suse.cz>

Hi Michal,

I still can't understand why "The goto out means that it should be fine 
either way.", could you explain to me, sorry for my stupid. :-)


Regards,
Simon



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 1/8] mm: memcg: only evict file pages when we have plenty
  2012-12-16  1:21           ` Simon Jeons
@ 2012-12-17 15:54             ` Michal Hocko
  -1 siblings, 0 replies; 114+ messages in thread
From: Michal Hocko @ 2012-12-17 15:54 UTC (permalink / raw)
  To: Simon Jeons
  Cc: Johannes Weiner, Rik van Riel, Andrew Morton, Mel Gorman,
	Hugh Dickins, linux-mm, linux-kernel

On Sun 16-12-12 09:21:54, Simon Jeons wrote:
> On 12/13/2012 10:55 PM, Michal Hocko wrote:
> >On Wed 12-12-12 17:28:44, Johannes Weiner wrote:
> >>On Wed, Dec 12, 2012 at 04:53:36PM -0500, Rik van Riel wrote:
> >>>On 12/12/2012 04:43 PM, Johannes Weiner wrote:
> >>>>dc0422c "mm: vmscan: only evict file pages when we have plenty" makes
> >>>>a point of not going for anonymous memory while there is still enough
> >>>>inactive cache around.
> >>>>
> >>>>The check was added only for global reclaim, but it is just as useful
> >>>>for memory cgroup reclaim.
> >>>>
> >>>>Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> >>>>---
> >>>>  mm/vmscan.c | 19 ++++++++++---------
> >>>>  1 file changed, 10 insertions(+), 9 deletions(-)
> >>>>
> >>>>diff --git a/mm/vmscan.c b/mm/vmscan.c
> >>>>index 157bb11..3874dcb 100644
> >>>>--- a/mm/vmscan.c
> >>>>+++ b/mm/vmscan.c
> >>>>@@ -1671,6 +1671,16 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> >>>>  		denominator = 1;
> >>>>  		goto out;
> >>>>  	}
> >>>>+	/*
> >>>>+	 * There is enough inactive page cache, do not reclaim
> >>>>+	 * anything from the anonymous working set right now.
> >>>>+	 */
> >>>>+	if (!inactive_file_is_low(lruvec)) {
> >>>>+		fraction[0] = 0;
> >>>>+		fraction[1] = 1;
> >>>>+		denominator = 1;
> >>>>+		goto out;
> >>>>+	}
> >>>>
> >>>>  	anon  = get_lru_size(lruvec, LRU_ACTIVE_ANON) +
> >>>>  		get_lru_size(lruvec, LRU_INACTIVE_ANON);
> >>>>@@ -1688,15 +1698,6 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> >>>>  			fraction[1] = 0;
> >>>>  			denominator = 1;
> >>>>  			goto out;
> >>>>-		} else if (!inactive_file_is_low_global(zone)) {
> >>>>-			/*
> >>>>-			 * There is enough inactive page cache, do not
> >>>>-			 * reclaim anything from the working set right now.
> >>>>-			 */
> >>>>-			fraction[0] = 0;
> >>>>-			fraction[1] = 1;
> >>>>-			denominator = 1;
> >>>>-			goto out;
> >>>>  		}
> >>>>  	}
> >>>>
> >>>>
> >>>I believe the if() block should be moved to AFTER
> >>>the check where we make sure we actually have enough
> >>>file pages.
> >>You are absolutely right, this makes more sense.  Although I'd figure
> >>the impact would be small because if there actually is that little
> >>file cache, it won't be there for long with force-file scanning... :-)
> >Yes, I think that the result would be worse (more swapping) so the
> >change can only help.
> >
> >>I moved the condition, but it throws conflicts in the rest of the
> >>series.  Will re-run tests, wait for Michal and Mel, then resend.
> >Yes the patch makes sense for memcg as well. I guess you have tested
> >this primarily with memcg. Do you have any numbers? Would be nice to put
> >them into the changelog if you have (it should help to reduce swapping
> >with heavy streaming IO load).
> >
> >Acked-by: Michal Hocko <mhocko@suse.cz>
> 
> Hi Michal,
> 
> I still can't understand why "The goto out means that it should be
> fine either way.",

Not sure I understand your question. goto out just says that either page
cache is low or inactive file LRU is too small. And it works for both
memcg and global because the page cache is low condition is evaluated
only for the global reclaim and always before inactive file is small.
Makes more sense?

> could you explain to me, sorry for my stupid. :-)

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 1/8] mm: memcg: only evict file pages when we have plenty
@ 2012-12-17 15:54             ` Michal Hocko
  0 siblings, 0 replies; 114+ messages in thread
From: Michal Hocko @ 2012-12-17 15:54 UTC (permalink / raw)
  To: Simon Jeons
  Cc: Johannes Weiner, Rik van Riel, Andrew Morton, Mel Gorman,
	Hugh Dickins, linux-mm, linux-kernel

On Sun 16-12-12 09:21:54, Simon Jeons wrote:
> On 12/13/2012 10:55 PM, Michal Hocko wrote:
> >On Wed 12-12-12 17:28:44, Johannes Weiner wrote:
> >>On Wed, Dec 12, 2012 at 04:53:36PM -0500, Rik van Riel wrote:
> >>>On 12/12/2012 04:43 PM, Johannes Weiner wrote:
> >>>>dc0422c "mm: vmscan: only evict file pages when we have plenty" makes
> >>>>a point of not going for anonymous memory while there is still enough
> >>>>inactive cache around.
> >>>>
> >>>>The check was added only for global reclaim, but it is just as useful
> >>>>for memory cgroup reclaim.
> >>>>
> >>>>Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> >>>>---
> >>>>  mm/vmscan.c | 19 ++++++++++---------
> >>>>  1 file changed, 10 insertions(+), 9 deletions(-)
> >>>>
> >>>>diff --git a/mm/vmscan.c b/mm/vmscan.c
> >>>>index 157bb11..3874dcb 100644
> >>>>--- a/mm/vmscan.c
> >>>>+++ b/mm/vmscan.c
> >>>>@@ -1671,6 +1671,16 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> >>>>  		denominator = 1;
> >>>>  		goto out;
> >>>>  	}
> >>>>+	/*
> >>>>+	 * There is enough inactive page cache, do not reclaim
> >>>>+	 * anything from the anonymous working set right now.
> >>>>+	 */
> >>>>+	if (!inactive_file_is_low(lruvec)) {
> >>>>+		fraction[0] = 0;
> >>>>+		fraction[1] = 1;
> >>>>+		denominator = 1;
> >>>>+		goto out;
> >>>>+	}
> >>>>
> >>>>  	anon  = get_lru_size(lruvec, LRU_ACTIVE_ANON) +
> >>>>  		get_lru_size(lruvec, LRU_INACTIVE_ANON);
> >>>>@@ -1688,15 +1698,6 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> >>>>  			fraction[1] = 0;
> >>>>  			denominator = 1;
> >>>>  			goto out;
> >>>>-		} else if (!inactive_file_is_low_global(zone)) {
> >>>>-			/*
> >>>>-			 * There is enough inactive page cache, do not
> >>>>-			 * reclaim anything from the working set right now.
> >>>>-			 */
> >>>>-			fraction[0] = 0;
> >>>>-			fraction[1] = 1;
> >>>>-			denominator = 1;
> >>>>-			goto out;
> >>>>  		}
> >>>>  	}
> >>>>
> >>>>
> >>>I believe the if() block should be moved to AFTER
> >>>the check where we make sure we actually have enough
> >>>file pages.
> >>You are absolutely right, this makes more sense.  Although I'd figure
> >>the impact would be small because if there actually is that little
> >>file cache, it won't be there for long with force-file scanning... :-)
> >Yes, I think that the result would be worse (more swapping) so the
> >change can only help.
> >
> >>I moved the condition, but it throws conflicts in the rest of the
> >>series.  Will re-run tests, wait for Michal and Mel, then resend.
> >Yes the patch makes sense for memcg as well. I guess you have tested
> >this primarily with memcg. Do you have any numbers? Would be nice to put
> >them into the changelog if you have (it should help to reduce swapping
> >with heavy streaming IO load).
> >
> >Acked-by: Michal Hocko <mhocko@suse.cz>
> 
> Hi Michal,
> 
> I still can't understand why "The goto out means that it should be
> fine either way.",

Not sure I understand your question. goto out just says that either page
cache is low or inactive file LRU is too small. And it works for both
memcg and global because the page cache is low condition is evaluated
only for the global reclaim and always before inactive file is small.
Makes more sense?

> could you explain to me, sorry for my stupid. :-)

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 2/8] mm: vmscan: disregard swappiness shortly before going OOM
  2012-12-15  0:18                     ` Johannes Weiner
@ 2012-12-17 16:37                       ` Michal Hocko
  -1 siblings, 0 replies; 114+ messages in thread
From: Michal Hocko @ 2012-12-17 16:37 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Rik van Riel, Satoru Moriya, Mel Gorman, Andrew Morton,
	Hugh Dickins, linux-mm, linux-kernel

On Fri 14-12-12 19:18:51, Johannes Weiner wrote:
> On Fri, Dec 14, 2012 at 05:13:45PM +0100, Michal Hocko wrote:
> > On Fri 14-12-12 10:43:55, Rik van Riel wrote:
> > > On 12/14/2012 03:37 AM, Michal Hocko wrote:
> > > 
> > > >I can answer the later. Because memsw comes with its price and
> > > >swappiness is much cheaper. On the other hand it makes sense that
> > > >swappiness==0 doesn't swap at all. Or do you think we should get back to
> > > >_almost_ doesn't swap at all?
> > > 
> > > swappiness==0 will swap in emergencies, specifically when we have
> > > almost no page cache left, we will still swap things out:
> > > 
> > >         if (global_reclaim(sc)) {
> > >                 free  = zone_page_state(zone, NR_FREE_PAGES);
> > >                 if (unlikely(file + free <= high_wmark_pages(zone))) {
> > >                         /*
> > >                          * If we have very few page cache pages, force-scan
> > >                          * anon pages.
> > >                          */
> > >                         fraction[0] = 1;
> > >                         fraction[1] = 0;
> > >                         denominator = 1;
> > >                         goto out;
> > > 
> > > This makes sense, because people who set swappiness==0 but
> > > do have swap space available would probably prefer some
> > > emergency swapping over an OOM kill.
> > 
> > Yes, but this is the global reclaim path. I was arguing about
> > swappiness==0 & memcg. As this patch doesn't make a big difference for
> > the global case (as both the changelog and you mentioned) then we should
> > focus on whether this is desirable change for the memcg path. I think it
> > makes sense to keep "no swapping at all for memcg semantic" as we have
> > it currently.
> 
> I would prefer we could agree on one thing, though.  Having global
> reclaim behave different from memcg reclaim violates the principle of
> least surprise. 

Hmm, I think that no swapping at all with swappiness==0 makes some sense
with the global reclaim as well. Why should we swap if admin told us not
to do that?
I am not so strong in that though because the global swappiness has been
more relaxed in the past and people got used to that. We have seen bug
reports already where users were surprised by a high io wait times when
it turned out that they had swappiness set to 0 because that prevented
swapping most of the time in the past but fe35004f changed that.

Usecases for memcg are more natural because memcg allows much better
control over OOM and also requirements for (not) swapping are per group
rather than on swap availability. We shouldn't push users into using
memcg swap accounting to accomplish the same IMHO because the accounting
has some costs and its primary usage is not to disable swapping but
rather to keep it on the leash. The two approaches are also different
from semantic point of view. Swappiness is proportional while the limit
is an absolute number.

> Having the code behave like that implicitely without any mention of
> global_reclaim() and vm_swappiness() is unacceptable.

So what about:
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7f30961..e6d4f23 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1750,7 +1750,15 @@ out:
 		unsigned long scan;
 
 		scan = get_lru_size(lruvec, lru);
-		if (sc->priority || noswap || !vmscan_swappiness(sc)) {
+		/*
+		 * Memcg targeted reclaim, unlike the global reclaim, honours
+		 * swappiness==0 and no swapping is allowed even if that would
+		 * lead to an OOM killer which is a) local to the group resp.
+		 * hierarchy and moreover can be handled from userspace which
+		 * makes it different from the global reclaim.
+		 */
+		if (sc->priority || noswap ||
+				(!global_reclaim(sc) && !vmscan_swappiness(sc))) {
 			scan >>= sc->priority;
 			if (!scan && force_scan)
 				scan = SWAP_CLUSTER_MAX;
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 114+ messages in thread

* Re: [patch 2/8] mm: vmscan: disregard swappiness shortly before going OOM
@ 2012-12-17 16:37                       ` Michal Hocko
  0 siblings, 0 replies; 114+ messages in thread
From: Michal Hocko @ 2012-12-17 16:37 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Rik van Riel, Satoru Moriya, Mel Gorman, Andrew Morton,
	Hugh Dickins, linux-mm, linux-kernel

On Fri 14-12-12 19:18:51, Johannes Weiner wrote:
> On Fri, Dec 14, 2012 at 05:13:45PM +0100, Michal Hocko wrote:
> > On Fri 14-12-12 10:43:55, Rik van Riel wrote:
> > > On 12/14/2012 03:37 AM, Michal Hocko wrote:
> > > 
> > > >I can answer the later. Because memsw comes with its price and
> > > >swappiness is much cheaper. On the other hand it makes sense that
> > > >swappiness==0 doesn't swap at all. Or do you think we should get back to
> > > >_almost_ doesn't swap at all?
> > > 
> > > swappiness==0 will swap in emergencies, specifically when we have
> > > almost no page cache left, we will still swap things out:
> > > 
> > >         if (global_reclaim(sc)) {
> > >                 free  = zone_page_state(zone, NR_FREE_PAGES);
> > >                 if (unlikely(file + free <= high_wmark_pages(zone))) {
> > >                         /*
> > >                          * If we have very few page cache pages, force-scan
> > >                          * anon pages.
> > >                          */
> > >                         fraction[0] = 1;
> > >                         fraction[1] = 0;
> > >                         denominator = 1;
> > >                         goto out;
> > > 
> > > This makes sense, because people who set swappiness==0 but
> > > do have swap space available would probably prefer some
> > > emergency swapping over an OOM kill.
> > 
> > Yes, but this is the global reclaim path. I was arguing about
> > swappiness==0 & memcg. As this patch doesn't make a big difference for
> > the global case (as both the changelog and you mentioned) then we should
> > focus on whether this is desirable change for the memcg path. I think it
> > makes sense to keep "no swapping at all for memcg semantic" as we have
> > it currently.
> 
> I would prefer we could agree on one thing, though.  Having global
> reclaim behave different from memcg reclaim violates the principle of
> least surprise. 

Hmm, I think that no swapping at all with swappiness==0 makes some sense
with the global reclaim as well. Why should we swap if admin told us not
to do that?
I am not so strong in that though because the global swappiness has been
more relaxed in the past and people got used to that. We have seen bug
reports already where users were surprised by a high io wait times when
it turned out that they had swappiness set to 0 because that prevented
swapping most of the time in the past but fe35004f changed that.

Usecases for memcg are more natural because memcg allows much better
control over OOM and also requirements for (not) swapping are per group
rather than on swap availability. We shouldn't push users into using
memcg swap accounting to accomplish the same IMHO because the accounting
has some costs and its primary usage is not to disable swapping but
rather to keep it on the leash. The two approaches are also different
from semantic point of view. Swappiness is proportional while the limit
is an absolute number.

> Having the code behave like that implicitely without any mention of
> global_reclaim() and vm_swappiness() is unacceptable.

So what about:
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7f30961..e6d4f23 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1750,7 +1750,15 @@ out:
 		unsigned long scan;
 
 		scan = get_lru_size(lruvec, lru);
-		if (sc->priority || noswap || !vmscan_swappiness(sc)) {
+		/*
+		 * Memcg targeted reclaim, unlike the global reclaim, honours
+		 * swappiness==0 and no swapping is allowed even if that would
+		 * lead to an OOM killer which is a) local to the group resp.
+		 * hierarchy and moreover can be handled from userspace which
+		 * makes it different from the global reclaim.
+		 */
+		if (sc->priority || noswap ||
+				(!global_reclaim(sc) && !vmscan_swappiness(sc))) {
 			scan >>= sc->priority;
 			if (!scan && force_scan)
 				scan = SWAP_CLUSTER_MAX;
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 114+ messages in thread

* Re: [patch 2/8] mm: vmscan: disregard swappiness shortly before going OOM
  2012-12-17 16:37                       ` Michal Hocko
@ 2012-12-17 17:54                         ` Johannes Weiner
  -1 siblings, 0 replies; 114+ messages in thread
From: Johannes Weiner @ 2012-12-17 17:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Rik van Riel, Satoru Moriya, Mel Gorman, Andrew Morton,
	Hugh Dickins, linux-mm, linux-kernel

On Mon, Dec 17, 2012 at 05:37:35PM +0100, Michal Hocko wrote:
> On Fri 14-12-12 19:18:51, Johannes Weiner wrote:
> > On Fri, Dec 14, 2012 at 05:13:45PM +0100, Michal Hocko wrote:
> > > On Fri 14-12-12 10:43:55, Rik van Riel wrote:
> > > > On 12/14/2012 03:37 AM, Michal Hocko wrote:
> > > > 
> > > > >I can answer the later. Because memsw comes with its price and
> > > > >swappiness is much cheaper. On the other hand it makes sense that
> > > > >swappiness==0 doesn't swap at all. Or do you think we should get back to
> > > > >_almost_ doesn't swap at all?
> > > > 
> > > > swappiness==0 will swap in emergencies, specifically when we have
> > > > almost no page cache left, we will still swap things out:
> > > > 
> > > >         if (global_reclaim(sc)) {
> > > >                 free  = zone_page_state(zone, NR_FREE_PAGES);
> > > >                 if (unlikely(file + free <= high_wmark_pages(zone))) {
> > > >                         /*
> > > >                          * If we have very few page cache pages, force-scan
> > > >                          * anon pages.
> > > >                          */
> > > >                         fraction[0] = 1;
> > > >                         fraction[1] = 0;
> > > >                         denominator = 1;
> > > >                         goto out;
> > > > 
> > > > This makes sense, because people who set swappiness==0 but
> > > > do have swap space available would probably prefer some
> > > > emergency swapping over an OOM kill.
> > > 
> > > Yes, but this is the global reclaim path. I was arguing about
> > > swappiness==0 & memcg. As this patch doesn't make a big difference for
> > > the global case (as both the changelog and you mentioned) then we should
> > > focus on whether this is desirable change for the memcg path. I think it
> > > makes sense to keep "no swapping at all for memcg semantic" as we have
> > > it currently.
> > 
> > I would prefer we could agree on one thing, though.  Having global
> > reclaim behave different from memcg reclaim violates the principle of
> > least surprise. 
> 
> Hmm, I think that no swapping at all with swappiness==0 makes some sense
> with the global reclaim as well. Why should we swap if admin told us not
> to do that?
> I am not so strong in that though because the global swappiness has been
> more relaxed in the past and people got used to that. We have seen bug
> reports already where users were surprised by a high io wait times when
> it turned out that they had swappiness set to 0 because that prevented
> swapping most of the time in the past but fe35004f changed that.
> 
> Usecases for memcg are more natural because memcg allows much better
> control over OOM and also requirements for (not) swapping are per group
> rather than on swap availability. We shouldn't push users into using
> memcg swap accounting to accomplish the same IMHO because the accounting
> has some costs and its primary usage is not to disable swapping but
> rather to keep it on the leash. The two approaches are also different
> from semantic point of view. Swappiness is proportional while the limit
> is an absolute number.

I agree with the usecase that Rik described, though: it makes sense to
go for file cache exclusively as long as the VM can make progress, but
once we are getting close to OOM, we may as well swap.  swappiness is
describing an eagerness to swap, not a limit.  Not swapping ever with
!swappiness does not allow you to do this, even with very low
swappiness settings, you can end up swapping with just little VM load.

They way swappiness works for memcg gives you TWO options to prevent
swapping entirely for individual groups, but no option to swap only in
case of emergency, which I think is the broader usecase.

But I also won't fight this in this last-minute submission so I
dropped this change of behaviour for now, it'll just be a cleanup.

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 2/8] mm: vmscan: disregard swappiness shortly before going OOM
@ 2012-12-17 17:54                         ` Johannes Weiner
  0 siblings, 0 replies; 114+ messages in thread
From: Johannes Weiner @ 2012-12-17 17:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Rik van Riel, Satoru Moriya, Mel Gorman, Andrew Morton,
	Hugh Dickins, linux-mm, linux-kernel

On Mon, Dec 17, 2012 at 05:37:35PM +0100, Michal Hocko wrote:
> On Fri 14-12-12 19:18:51, Johannes Weiner wrote:
> > On Fri, Dec 14, 2012 at 05:13:45PM +0100, Michal Hocko wrote:
> > > On Fri 14-12-12 10:43:55, Rik van Riel wrote:
> > > > On 12/14/2012 03:37 AM, Michal Hocko wrote:
> > > > 
> > > > >I can answer the later. Because memsw comes with its price and
> > > > >swappiness is much cheaper. On the other hand it makes sense that
> > > > >swappiness==0 doesn't swap at all. Or do you think we should get back to
> > > > >_almost_ doesn't swap at all?
> > > > 
> > > > swappiness==0 will swap in emergencies, specifically when we have
> > > > almost no page cache left, we will still swap things out:
> > > > 
> > > >         if (global_reclaim(sc)) {
> > > >                 free  = zone_page_state(zone, NR_FREE_PAGES);
> > > >                 if (unlikely(file + free <= high_wmark_pages(zone))) {
> > > >                         /*
> > > >                          * If we have very few page cache pages, force-scan
> > > >                          * anon pages.
> > > >                          */
> > > >                         fraction[0] = 1;
> > > >                         fraction[1] = 0;
> > > >                         denominator = 1;
> > > >                         goto out;
> > > > 
> > > > This makes sense, because people who set swappiness==0 but
> > > > do have swap space available would probably prefer some
> > > > emergency swapping over an OOM kill.
> > > 
> > > Yes, but this is the global reclaim path. I was arguing about
> > > swappiness==0 & memcg. As this patch doesn't make a big difference for
> > > the global case (as both the changelog and you mentioned) then we should
> > > focus on whether this is desirable change for the memcg path. I think it
> > > makes sense to keep "no swapping at all for memcg semantic" as we have
> > > it currently.
> > 
> > I would prefer we could agree on one thing, though.  Having global
> > reclaim behave different from memcg reclaim violates the principle of
> > least surprise. 
> 
> Hmm, I think that no swapping at all with swappiness==0 makes some sense
> with the global reclaim as well. Why should we swap if admin told us not
> to do that?
> I am not so strong in that though because the global swappiness has been
> more relaxed in the past and people got used to that. We have seen bug
> reports already where users were surprised by a high io wait times when
> it turned out that they had swappiness set to 0 because that prevented
> swapping most of the time in the past but fe35004f changed that.
> 
> Usecases for memcg are more natural because memcg allows much better
> control over OOM and also requirements for (not) swapping are per group
> rather than on swap availability. We shouldn't push users into using
> memcg swap accounting to accomplish the same IMHO because the accounting
> has some costs and its primary usage is not to disable swapping but
> rather to keep it on the leash. The two approaches are also different
> from semantic point of view. Swappiness is proportional while the limit
> is an absolute number.

I agree with the usecase that Rik described, though: it makes sense to
go for file cache exclusively as long as the VM can make progress, but
once we are getting close to OOM, we may as well swap.  swappiness is
describing an eagerness to swap, not a limit.  Not swapping ever with
!swappiness does not allow you to do this, even with very low
swappiness settings, you can end up swapping with just little VM load.

They way swappiness works for memcg gives you TWO options to prevent
swapping entirely for individual groups, but no option to swap only in
case of emergency, which I think is the broader usecase.

But I also won't fight this in this last-minute submission so I
dropped this change of behaviour for now, it'll just be a cleanup.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 2/8] mm: vmscan: disregard swappiness shortly before going OOM
  2012-12-17 17:54                         ` Johannes Weiner
@ 2012-12-17 19:58                           ` Michal Hocko
  -1 siblings, 0 replies; 114+ messages in thread
From: Michal Hocko @ 2012-12-17 19:58 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Rik van Riel, Satoru Moriya, Mel Gorman, Andrew Morton,
	Hugh Dickins, linux-mm, linux-kernel

On Mon 17-12-12 12:54:15, Johannes Weiner wrote:
> On Mon, Dec 17, 2012 at 05:37:35PM +0100, Michal Hocko wrote:
> > On Fri 14-12-12 19:18:51, Johannes Weiner wrote:
> > > On Fri, Dec 14, 2012 at 05:13:45PM +0100, Michal Hocko wrote:
> > > > On Fri 14-12-12 10:43:55, Rik van Riel wrote:
> > > > > On 12/14/2012 03:37 AM, Michal Hocko wrote:
> > > > > 
> > > > > >I can answer the later. Because memsw comes with its price and
> > > > > >swappiness is much cheaper. On the other hand it makes sense that
> > > > > >swappiness==0 doesn't swap at all. Or do you think we should get back to
> > > > > >_almost_ doesn't swap at all?
> > > > > 
> > > > > swappiness==0 will swap in emergencies, specifically when we have
> > > > > almost no page cache left, we will still swap things out:
> > > > > 
> > > > >         if (global_reclaim(sc)) {
> > > > >                 free  = zone_page_state(zone, NR_FREE_PAGES);
> > > > >                 if (unlikely(file + free <= high_wmark_pages(zone))) {
> > > > >                         /*
> > > > >                          * If we have very few page cache pages, force-scan
> > > > >                          * anon pages.
> > > > >                          */
> > > > >                         fraction[0] = 1;
> > > > >                         fraction[1] = 0;
> > > > >                         denominator = 1;
> > > > >                         goto out;
> > > > > 
> > > > > This makes sense, because people who set swappiness==0 but
> > > > > do have swap space available would probably prefer some
> > > > > emergency swapping over an OOM kill.
> > > > 
> > > > Yes, but this is the global reclaim path. I was arguing about
> > > > swappiness==0 & memcg. As this patch doesn't make a big difference for
> > > > the global case (as both the changelog and you mentioned) then we should
> > > > focus on whether this is desirable change for the memcg path. I think it
> > > > makes sense to keep "no swapping at all for memcg semantic" as we have
> > > > it currently.
> > > 
> > > I would prefer we could agree on one thing, though.  Having global
> > > reclaim behave different from memcg reclaim violates the principle of
> > > least surprise. 
> > 
> > Hmm, I think that no swapping at all with swappiness==0 makes some sense
> > with the global reclaim as well. Why should we swap if admin told us not
> > to do that?
> > I am not so strong in that though because the global swappiness has been
> > more relaxed in the past and people got used to that. We have seen bug
> > reports already where users were surprised by a high io wait times when
> > it turned out that they had swappiness set to 0 because that prevented
> > swapping most of the time in the past but fe35004f changed that.
> > 
> > Usecases for memcg are more natural because memcg allows much better
> > control over OOM and also requirements for (not) swapping are per group
> > rather than on swap availability. We shouldn't push users into using
> > memcg swap accounting to accomplish the same IMHO because the accounting
> > has some costs and its primary usage is not to disable swapping but
> > rather to keep it on the leash. The two approaches are also different
> > from semantic point of view. Swappiness is proportional while the limit
> > is an absolute number.
> 
> I agree with the usecase that Rik described, though: it makes sense to
> go for file cache exclusively as long as the VM can make progress, but
> once we are getting close to OOM, we may as well swap.  swappiness is
> describing an eagerness to swap, not a limit.  Not swapping ever with
> !swappiness does not allow you to do this, even with very low
> swappiness settings, you can end up swapping with just little VM load.
> 
> They way swappiness works for memcg gives you TWO options to prevent
> swapping entirely for individual groups, but no option to swap only in
> case of emergency, which I think is the broader usecase.

I think this is for a longer discussion.

> But I also won't fight this in this last-minute submission so I
> dropped this change of behaviour for now, it'll just be a cleanup.

Yes, this is reasonable. This is in no way a cleanup so it would just
delay otherwise very nice cleanup.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 2/8] mm: vmscan: disregard swappiness shortly before going OOM
@ 2012-12-17 19:58                           ` Michal Hocko
  0 siblings, 0 replies; 114+ messages in thread
From: Michal Hocko @ 2012-12-17 19:58 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Rik van Riel, Satoru Moriya, Mel Gorman, Andrew Morton,
	Hugh Dickins, linux-mm, linux-kernel

On Mon 17-12-12 12:54:15, Johannes Weiner wrote:
> On Mon, Dec 17, 2012 at 05:37:35PM +0100, Michal Hocko wrote:
> > On Fri 14-12-12 19:18:51, Johannes Weiner wrote:
> > > On Fri, Dec 14, 2012 at 05:13:45PM +0100, Michal Hocko wrote:
> > > > On Fri 14-12-12 10:43:55, Rik van Riel wrote:
> > > > > On 12/14/2012 03:37 AM, Michal Hocko wrote:
> > > > > 
> > > > > >I can answer the later. Because memsw comes with its price and
> > > > > >swappiness is much cheaper. On the other hand it makes sense that
> > > > > >swappiness==0 doesn't swap at all. Or do you think we should get back to
> > > > > >_almost_ doesn't swap at all?
> > > > > 
> > > > > swappiness==0 will swap in emergencies, specifically when we have
> > > > > almost no page cache left, we will still swap things out:
> > > > > 
> > > > >         if (global_reclaim(sc)) {
> > > > >                 free  = zone_page_state(zone, NR_FREE_PAGES);
> > > > >                 if (unlikely(file + free <= high_wmark_pages(zone))) {
> > > > >                         /*
> > > > >                          * If we have very few page cache pages, force-scan
> > > > >                          * anon pages.
> > > > >                          */
> > > > >                         fraction[0] = 1;
> > > > >                         fraction[1] = 0;
> > > > >                         denominator = 1;
> > > > >                         goto out;
> > > > > 
> > > > > This makes sense, because people who set swappiness==0 but
> > > > > do have swap space available would probably prefer some
> > > > > emergency swapping over an OOM kill.
> > > > 
> > > > Yes, but this is the global reclaim path. I was arguing about
> > > > swappiness==0 & memcg. As this patch doesn't make a big difference for
> > > > the global case (as both the changelog and you mentioned) then we should
> > > > focus on whether this is desirable change for the memcg path. I think it
> > > > makes sense to keep "no swapping at all for memcg semantic" as we have
> > > > it currently.
> > > 
> > > I would prefer we could agree on one thing, though.  Having global
> > > reclaim behave different from memcg reclaim violates the principle of
> > > least surprise. 
> > 
> > Hmm, I think that no swapping at all with swappiness==0 makes some sense
> > with the global reclaim as well. Why should we swap if admin told us not
> > to do that?
> > I am not so strong in that though because the global swappiness has been
> > more relaxed in the past and people got used to that. We have seen bug
> > reports already where users were surprised by a high io wait times when
> > it turned out that they had swappiness set to 0 because that prevented
> > swapping most of the time in the past but fe35004f changed that.
> > 
> > Usecases for memcg are more natural because memcg allows much better
> > control over OOM and also requirements for (not) swapping are per group
> > rather than on swap availability. We shouldn't push users into using
> > memcg swap accounting to accomplish the same IMHO because the accounting
> > has some costs and its primary usage is not to disable swapping but
> > rather to keep it on the leash. The two approaches are also different
> > from semantic point of view. Swappiness is proportional while the limit
> > is an absolute number.
> 
> I agree with the usecase that Rik described, though: it makes sense to
> go for file cache exclusively as long as the VM can make progress, but
> once we are getting close to OOM, we may as well swap.  swappiness is
> describing an eagerness to swap, not a limit.  Not swapping ever with
> !swappiness does not allow you to do this, even with very low
> swappiness settings, you can end up swapping with just little VM load.
> 
> They way swappiness works for memcg gives you TWO options to prevent
> swapping entirely for individual groups, but no option to swap only in
> case of emergency, which I think is the broader usecase.

I think this is for a longer discussion.

> But I also won't fight this in this last-minute submission so I
> dropped this change of behaviour for now, it'll just be a cleanup.

Yes, this is reasonable. This is in no way a cleanup so it would just
delay otherwise very nice cleanup.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 1/8] mm: memcg: only evict file pages when we have plenty
  2012-12-17 15:54             ` Michal Hocko
@ 2012-12-19  5:21               ` Simon Jeons
  -1 siblings, 0 replies; 114+ messages in thread
From: Simon Jeons @ 2012-12-19  5:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Rik van Riel, Andrew Morton, Mel Gorman,
	Hugh Dickins, linux-mm, linux-kernel

On Mon, 2012-12-17 at 16:54 +0100, Michal Hocko wrote:
> On Sun 16-12-12 09:21:54, Simon Jeons wrote:
> > On 12/13/2012 10:55 PM, Michal Hocko wrote:
> > >On Wed 12-12-12 17:28:44, Johannes Weiner wrote:
> > >>On Wed, Dec 12, 2012 at 04:53:36PM -0500, Rik van Riel wrote:
> > >>>On 12/12/2012 04:43 PM, Johannes Weiner wrote:
> > >>>>dc0422c "mm: vmscan: only evict file pages when we have plenty" makes
> > >>>>a point of not going for anonymous memory while there is still enough
> > >>>>inactive cache around.
> > >>>>
> > >>>>The check was added only for global reclaim, but it is just as useful
> > >>>>for memory cgroup reclaim.
> > >>>>
> > >>>>Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > >>>>---
> > >>>>  mm/vmscan.c | 19 ++++++++++---------
> > >>>>  1 file changed, 10 insertions(+), 9 deletions(-)
> > >>>>
> > >>>>diff --git a/mm/vmscan.c b/mm/vmscan.c
> > >>>>index 157bb11..3874dcb 100644
> > >>>>--- a/mm/vmscan.c
> > >>>>+++ b/mm/vmscan.c
> > >>>>@@ -1671,6 +1671,16 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> > >>>>  		denominator = 1;
> > >>>>  		goto out;
> > >>>>  	}
> > >>>>+	/*
> > >>>>+	 * There is enough inactive page cache, do not reclaim
> > >>>>+	 * anything from the anonymous working set right now.
> > >>>>+	 */
> > >>>>+	if (!inactive_file_is_low(lruvec)) {
> > >>>>+		fraction[0] = 0;
> > >>>>+		fraction[1] = 1;
> > >>>>+		denominator = 1;
> > >>>>+		goto out;
> > >>>>+	}
> > >>>>
> > >>>>  	anon  = get_lru_size(lruvec, LRU_ACTIVE_ANON) +
> > >>>>  		get_lru_size(lruvec, LRU_INACTIVE_ANON);
> > >>>>@@ -1688,15 +1698,6 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> > >>>>  			fraction[1] = 0;
> > >>>>  			denominator = 1;
> > >>>>  			goto out;
> > >>>>-		} else if (!inactive_file_is_low_global(zone)) {
> > >>>>-			/*
> > >>>>-			 * There is enough inactive page cache, do not
> > >>>>-			 * reclaim anything from the working set right now.
> > >>>>-			 */
> > >>>>-			fraction[0] = 0;
> > >>>>-			fraction[1] = 1;
> > >>>>-			denominator = 1;
> > >>>>-			goto out;
> > >>>>  		}
> > >>>>  	}
> > >>>>
> > >>>>
> > >>>I believe the if() block should be moved to AFTER
> > >>>the check where we make sure we actually have enough
> > >>>file pages.
> > >>You are absolutely right, this makes more sense.  Although I'd figure
> > >>the impact would be small because if there actually is that little
> > >>file cache, it won't be there for long with force-file scanning... :-)
> > >Yes, I think that the result would be worse (more swapping) so the
> > >change can only help.
> > >
> > >>I moved the condition, but it throws conflicts in the rest of the
> > >>series.  Will re-run tests, wait for Michal and Mel, then resend.
> > >Yes the patch makes sense for memcg as well. I guess you have tested
> > >this primarily with memcg. Do you have any numbers? Would be nice to put
> > >them into the changelog if you have (it should help to reduce swapping
> > >with heavy streaming IO load).
> > >
> > >Acked-by: Michal Hocko <mhocko@suse.cz>
> > 
> > Hi Michal,
> > 
> > I still can't understand why "The goto out means that it should be
> > fine either way.",
> 
> Not sure I understand your question. goto out just says that either page
> cache is low or inactive file LRU is too small. And it works for both
> memcg and global because the page cache is low condition is evaluated
> only for the global reclaim and always before inactive file is small.
> Makes more sense?

Hi Michal,

I confuse of Gorman's comments below, why the logic change still fine.  

Current
  low_file      inactive_is_high        force reclaim anon
  low_file      !inactive_is_high       force reclaim anon
  !low_file     inactive_is_high        force reclaim file
  !low_file     !inactive_is_high       normal split

Your patch

  low_file      inactive_is_high        force reclaim anon
  low_file      !inactive_is_high       force reclaim anon
  !low_file     inactive_is_high        force reclaim file
  !low_file     !inactive_is_high       normal split

However, if you move the inactive_file_is_low check down you get

Moving the check
  low_file      inactive_is_high        force reclaim file
  low_file      !inactive_is_high       force reclaim anon
  !low_file     inactive_is_high        force reclaim file
  !low_file     !inactive_is_high       normal split

> 
> > could you explain to me, sorry for my stupid. :-)
> 



^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 1/8] mm: memcg: only evict file pages when we have plenty
@ 2012-12-19  5:21               ` Simon Jeons
  0 siblings, 0 replies; 114+ messages in thread
From: Simon Jeons @ 2012-12-19  5:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Rik van Riel, Andrew Morton, Mel Gorman,
	Hugh Dickins, linux-mm, linux-kernel

On Mon, 2012-12-17 at 16:54 +0100, Michal Hocko wrote:
> On Sun 16-12-12 09:21:54, Simon Jeons wrote:
> > On 12/13/2012 10:55 PM, Michal Hocko wrote:
> > >On Wed 12-12-12 17:28:44, Johannes Weiner wrote:
> > >>On Wed, Dec 12, 2012 at 04:53:36PM -0500, Rik van Riel wrote:
> > >>>On 12/12/2012 04:43 PM, Johannes Weiner wrote:
> > >>>>dc0422c "mm: vmscan: only evict file pages when we have plenty" makes
> > >>>>a point of not going for anonymous memory while there is still enough
> > >>>>inactive cache around.
> > >>>>
> > >>>>The check was added only for global reclaim, but it is just as useful
> > >>>>for memory cgroup reclaim.
> > >>>>
> > >>>>Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > >>>>---
> > >>>>  mm/vmscan.c | 19 ++++++++++---------
> > >>>>  1 file changed, 10 insertions(+), 9 deletions(-)
> > >>>>
> > >>>>diff --git a/mm/vmscan.c b/mm/vmscan.c
> > >>>>index 157bb11..3874dcb 100644
> > >>>>--- a/mm/vmscan.c
> > >>>>+++ b/mm/vmscan.c
> > >>>>@@ -1671,6 +1671,16 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> > >>>>  		denominator = 1;
> > >>>>  		goto out;
> > >>>>  	}
> > >>>>+	/*
> > >>>>+	 * There is enough inactive page cache, do not reclaim
> > >>>>+	 * anything from the anonymous working set right now.
> > >>>>+	 */
> > >>>>+	if (!inactive_file_is_low(lruvec)) {
> > >>>>+		fraction[0] = 0;
> > >>>>+		fraction[1] = 1;
> > >>>>+		denominator = 1;
> > >>>>+		goto out;
> > >>>>+	}
> > >>>>
> > >>>>  	anon  = get_lru_size(lruvec, LRU_ACTIVE_ANON) +
> > >>>>  		get_lru_size(lruvec, LRU_INACTIVE_ANON);
> > >>>>@@ -1688,15 +1698,6 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> > >>>>  			fraction[1] = 0;
> > >>>>  			denominator = 1;
> > >>>>  			goto out;
> > >>>>-		} else if (!inactive_file_is_low_global(zone)) {
> > >>>>-			/*
> > >>>>-			 * There is enough inactive page cache, do not
> > >>>>-			 * reclaim anything from the working set right now.
> > >>>>-			 */
> > >>>>-			fraction[0] = 0;
> > >>>>-			fraction[1] = 1;
> > >>>>-			denominator = 1;
> > >>>>-			goto out;
> > >>>>  		}
> > >>>>  	}
> > >>>>
> > >>>>
> > >>>I believe the if() block should be moved to AFTER
> > >>>the check where we make sure we actually have enough
> > >>>file pages.
> > >>You are absolutely right, this makes more sense.  Although I'd figure
> > >>the impact would be small because if there actually is that little
> > >>file cache, it won't be there for long with force-file scanning... :-)
> > >Yes, I think that the result would be worse (more swapping) so the
> > >change can only help.
> > >
> > >>I moved the condition, but it throws conflicts in the rest of the
> > >>series.  Will re-run tests, wait for Michal and Mel, then resend.
> > >Yes the patch makes sense for memcg as well. I guess you have tested
> > >this primarily with memcg. Do you have any numbers? Would be nice to put
> > >them into the changelog if you have (it should help to reduce swapping
> > >with heavy streaming IO load).
> > >
> > >Acked-by: Michal Hocko <mhocko@suse.cz>
> > 
> > Hi Michal,
> > 
> > I still can't understand why "The goto out means that it should be
> > fine either way.",
> 
> Not sure I understand your question. goto out just says that either page
> cache is low or inactive file LRU is too small. And it works for both
> memcg and global because the page cache is low condition is evaluated
> only for the global reclaim and always before inactive file is small.
> Makes more sense?

Hi Michal,

I confuse of Gorman's comments below, why the logic change still fine.  

Current
  low_file      inactive_is_high        force reclaim anon
  low_file      !inactive_is_high       force reclaim anon
  !low_file     inactive_is_high        force reclaim file
  !low_file     !inactive_is_high       normal split

Your patch

  low_file      inactive_is_high        force reclaim anon
  low_file      !inactive_is_high       force reclaim anon
  !low_file     inactive_is_high        force reclaim file
  !low_file     !inactive_is_high       normal split

However, if you move the inactive_file_is_low check down you get

Moving the check
  low_file      inactive_is_high        force reclaim file
  low_file      !inactive_is_high       force reclaim anon
  !low_file     inactive_is_high        force reclaim file
  !low_file     !inactive_is_high       normal split

> 
> > could you explain to me, sorry for my stupid. :-)
> 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 1/8] mm: memcg: only evict file pages when we have plenty
  2012-12-19  5:21               ` Simon Jeons
@ 2012-12-19  9:20                 ` Mel Gorman
  -1 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2012-12-19  9:20 UTC (permalink / raw)
  To: Simon Jeons
  Cc: Michal Hocko, Johannes Weiner, Rik van Riel, Andrew Morton,
	Hugh Dickins, linux-mm, linux-kernel

On Wed, Dec 19, 2012 at 12:21:55AM -0500, Simon Jeons wrote:
> On Mon, 2012-12-17 at 16:54 +0100, Michal Hocko wrote:
> > On Sun 16-12-12 09:21:54, Simon Jeons wrote:
> > > On 12/13/2012 10:55 PM, Michal Hocko wrote:
> > > >On Wed 12-12-12 17:28:44, Johannes Weiner wrote:
> > > >>On Wed, Dec 12, 2012 at 04:53:36PM -0500, Rik van Riel wrote:
> > > >>>On 12/12/2012 04:43 PM, Johannes Weiner wrote:
> > > >>>>dc0422c "mm: vmscan: only evict file pages when we have plenty" makes
> > > >>>>a point of not going for anonymous memory while there is still enough
> > > >>>>inactive cache around.
> > > >>>>
> > > >>>>The check was added only for global reclaim, but it is just as useful
> > > >>>>for memory cgroup reclaim.
> > > >>>>
> > > >>>>Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > > >>>>---
> > > >>>>  mm/vmscan.c | 19 ++++++++++---------
> > > >>>>  1 file changed, 10 insertions(+), 9 deletions(-)
> > > >>>>
> > > >>>>diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > >>>>index 157bb11..3874dcb 100644
> > > >>>>--- a/mm/vmscan.c
> > > >>>>+++ b/mm/vmscan.c
> > > >>>>@@ -1671,6 +1671,16 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> > > >>>>  		denominator = 1;
> > > >>>>  		goto out;
> > > >>>>  	}
> > > >>>>+	/*
> > > >>>>+	 * There is enough inactive page cache, do not reclaim
> > > >>>>+	 * anything from the anonymous working set right now.
> > > >>>>+	 */
> > > >>>>+	if (!inactive_file_is_low(lruvec)) {
> > > >>>>+		fraction[0] = 0;
> > > >>>>+		fraction[1] = 1;
> > > >>>>+		denominator = 1;
> > > >>>>+		goto out;
> > > >>>>+	}
> > > >>>>
> > > >>>>  	anon  = get_lru_size(lruvec, LRU_ACTIVE_ANON) +
> > > >>>>  		get_lru_size(lruvec, LRU_INACTIVE_ANON);
> > > >>>>@@ -1688,15 +1698,6 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> > > >>>>  			fraction[1] = 0;
> > > >>>>  			denominator = 1;
> > > >>>>  			goto out;
> > > >>>>-		} else if (!inactive_file_is_low_global(zone)) {
> > > >>>>-			/*
> > > >>>>-			 * There is enough inactive page cache, do not
> > > >>>>-			 * reclaim anything from the working set right now.
> > > >>>>-			 */
> > > >>>>-			fraction[0] = 0;
> > > >>>>-			fraction[1] = 1;
> > > >>>>-			denominator = 1;
> > > >>>>-			goto out;
> > > >>>>  		}
> > > >>>>  	}
> > > >>>>
> > > >>>>
> > > >>>I believe the if() block should be moved to AFTER
> > > >>>the check where we make sure we actually have enough
> > > >>>file pages.
> > > >>You are absolutely right, this makes more sense.  Although I'd figure
> > > >>the impact would be small because if there actually is that little
> > > >>file cache, it won't be there for long with force-file scanning... :-)
> > > >Yes, I think that the result would be worse (more swapping) so the
> > > >change can only help.
> > > >
> > > >>I moved the condition, but it throws conflicts in the rest of the
> > > >>series.  Will re-run tests, wait for Michal and Mel, then resend.
> > > >Yes the patch makes sense for memcg as well. I guess you have tested
> > > >this primarily with memcg. Do you have any numbers? Would be nice to put
> > > >them into the changelog if you have (it should help to reduce swapping
> > > >with heavy streaming IO load).
> > > >
> > > >Acked-by: Michal Hocko <mhocko@suse.cz>
> > > 
> > > Hi Michal,
> > > 
> > > I still can't understand why "The goto out means that it should be
> > > fine either way.",
> > 
> > Not sure I understand your question. goto out just says that either page
> > cache is low or inactive file LRU is too small. And it works for both
> > memcg and global because the page cache is low condition is evaluated
> > only for the global reclaim and always before inactive file is small.
> > Makes more sense?
> 
> Hi Michal,
> 
> I confuse of Gorman's comments below, why the logic change still fine.  
> 

Those comments were wrong.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 114+ messages in thread

* Re: [patch 1/8] mm: memcg: only evict file pages when we have plenty
@ 2012-12-19  9:20                 ` Mel Gorman
  0 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2012-12-19  9:20 UTC (permalink / raw)
  To: Simon Jeons
  Cc: Michal Hocko, Johannes Weiner, Rik van Riel, Andrew Morton,
	Hugh Dickins, linux-mm, linux-kernel

On Wed, Dec 19, 2012 at 12:21:55AM -0500, Simon Jeons wrote:
> On Mon, 2012-12-17 at 16:54 +0100, Michal Hocko wrote:
> > On Sun 16-12-12 09:21:54, Simon Jeons wrote:
> > > On 12/13/2012 10:55 PM, Michal Hocko wrote:
> > > >On Wed 12-12-12 17:28:44, Johannes Weiner wrote:
> > > >>On Wed, Dec 12, 2012 at 04:53:36PM -0500, Rik van Riel wrote:
> > > >>>On 12/12/2012 04:43 PM, Johannes Weiner wrote:
> > > >>>>dc0422c "mm: vmscan: only evict file pages when we have plenty" makes
> > > >>>>a point of not going for anonymous memory while there is still enough
> > > >>>>inactive cache around.
> > > >>>>
> > > >>>>The check was added only for global reclaim, but it is just as useful
> > > >>>>for memory cgroup reclaim.
> > > >>>>
> > > >>>>Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > > >>>>---
> > > >>>>  mm/vmscan.c | 19 ++++++++++---------
> > > >>>>  1 file changed, 10 insertions(+), 9 deletions(-)
> > > >>>>
> > > >>>>diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > >>>>index 157bb11..3874dcb 100644
> > > >>>>--- a/mm/vmscan.c
> > > >>>>+++ b/mm/vmscan.c
> > > >>>>@@ -1671,6 +1671,16 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> > > >>>>  		denominator = 1;
> > > >>>>  		goto out;
> > > >>>>  	}
> > > >>>>+	/*
> > > >>>>+	 * There is enough inactive page cache, do not reclaim
> > > >>>>+	 * anything from the anonymous working set right now.
> > > >>>>+	 */
> > > >>>>+	if (!inactive_file_is_low(lruvec)) {
> > > >>>>+		fraction[0] = 0;
> > > >>>>+		fraction[1] = 1;
> > > >>>>+		denominator = 1;
> > > >>>>+		goto out;
> > > >>>>+	}
> > > >>>>
> > > >>>>  	anon  = get_lru_size(lruvec, LRU_ACTIVE_ANON) +
> > > >>>>  		get_lru_size(lruvec, LRU_INACTIVE_ANON);
> > > >>>>@@ -1688,15 +1698,6 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> > > >>>>  			fraction[1] = 0;
> > > >>>>  			denominator = 1;
> > > >>>>  			goto out;
> > > >>>>-		} else if (!inactive_file_is_low_global(zone)) {
> > > >>>>-			/*
> > > >>>>-			 * There is enough inactive page cache, do not
> > > >>>>-			 * reclaim anything from the working set right now.
> > > >>>>-			 */
> > > >>>>-			fraction[0] = 0;
> > > >>>>-			fraction[1] = 1;
> > > >>>>-			denominator = 1;
> > > >>>>-			goto out;
> > > >>>>  		}
> > > >>>>  	}
> > > >>>>
> > > >>>>
> > > >>>I believe the if() block should be moved to AFTER
> > > >>>the check where we make sure we actually have enough
> > > >>>file pages.
> > > >>You are absolutely right, this makes more sense.  Although I'd figure
> > > >>the impact would be small because if there actually is that little
> > > >>file cache, it won't be there for long with force-file scanning... :-)
> > > >Yes, I think that the result would be worse (more swapping) so the
> > > >change can only help.
> > > >
> > > >>I moved the condition, but it throws conflicts in the rest of the
> > > >>series.  Will re-run tests, wait for Michal and Mel, then resend.
> > > >Yes the patch makes sense for memcg as well. I guess you have tested
> > > >this primarily with memcg. Do you have any numbers? Would be nice to put
> > > >them into the changelog if you have (it should help to reduce swapping
> > > >with heavy streaming IO load).
> > > >
> > > >Acked-by: Michal Hocko <mhocko@suse.cz>
> > > 
> > > Hi Michal,
> > > 
> > > I still can't understand why "The goto out means that it should be
> > > fine either way.",
> > 
> > Not sure I understand your question. goto out just says that either page
> > cache is low or inactive file LRU is too small. And it works for both
> > memcg and global because the page cache is low condition is evaluated
> > only for the global reclaim and always before inactive file is small.
> > Makes more sense?
> 
> Hi Michal,
> 
> I confuse of Gorman's comments below, why the logic change still fine.  
> 

Those comments were wrong.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 114+ messages in thread

end of thread, other threads:[~2012-12-19  9:20 UTC | newest]

Thread overview: 114+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-12-12 21:43 [patch 0/8] page reclaim bits Johannes Weiner
2012-12-12 21:43 ` Johannes Weiner
2012-12-12 21:43 ` [patch 1/8] mm: memcg: only evict file pages when we have plenty Johannes Weiner
2012-12-12 21:43   ` Johannes Weiner
2012-12-12 21:53   ` Rik van Riel
2012-12-12 21:53     ` Rik van Riel
2012-12-12 22:28     ` Johannes Weiner
2012-12-12 22:28       ` Johannes Weiner
2012-12-13 10:07       ` Mel Gorman
2012-12-13 10:07         ` Mel Gorman
2012-12-13 14:44         ` Mel Gorman
2012-12-13 14:44           ` Mel Gorman
2012-12-13 14:55       ` Michal Hocko
2012-12-13 14:55         ` Michal Hocko
2012-12-16  1:21         ` Simon Jeons
2012-12-16  1:21           ` Simon Jeons
2012-12-17 15:54           ` Michal Hocko
2012-12-17 15:54             ` Michal Hocko
2012-12-19  5:21             ` Simon Jeons
2012-12-19  5:21               ` Simon Jeons
2012-12-19  9:20               ` Mel Gorman
2012-12-19  9:20                 ` Mel Gorman
2012-12-13  5:36     ` Simon Jeons
2012-12-13  5:36       ` Simon Jeons
2012-12-13  5:34   ` Simon Jeons
2012-12-13  5:34     ` Simon Jeons
2012-12-12 21:43 ` [patch 2/8] mm: vmscan: disregard swappiness shortly before going OOM Johannes Weiner
2012-12-12 21:43   ` Johannes Weiner
2012-12-12 22:01   ` Rik van Riel
2012-12-12 22:01     ` Rik van Riel
2012-12-13  5:56   ` Simon Jeons
2012-12-13  5:56     ` Simon Jeons
2012-12-13 10:34   ` Mel Gorman
2012-12-13 10:34     ` Mel Gorman
2012-12-13 15:29     ` Michal Hocko
2012-12-13 15:29       ` Michal Hocko
2012-12-13 16:05       ` Michal Hocko
2012-12-13 16:05         ` Michal Hocko
2012-12-13 22:25         ` Satoru Moriya
2012-12-13 22:25           ` Satoru Moriya
2012-12-14  4:50           ` Johannes Weiner
2012-12-14  4:50             ` Johannes Weiner
2012-12-14  8:37             ` Michal Hocko
2012-12-14  8:37               ` Michal Hocko
2012-12-14 15:43               ` Rik van Riel
2012-12-14 15:43                 ` Rik van Riel
2012-12-14 16:13                 ` Michal Hocko
2012-12-14 16:13                   ` Michal Hocko
2012-12-15  0:18                   ` Johannes Weiner
2012-12-15  0:18                     ` Johannes Weiner
2012-12-17 16:37                     ` Michal Hocko
2012-12-17 16:37                       ` Michal Hocko
2012-12-17 17:54                       ` Johannes Weiner
2012-12-17 17:54                         ` Johannes Weiner
2012-12-17 19:58                         ` Michal Hocko
2012-12-17 19:58                           ` Michal Hocko
2012-12-14 20:17                 ` Satoru Moriya
2012-12-14 20:17                   ` Satoru Moriya
2012-12-14 19:44               ` Satoru Moriya
2012-12-14 19:44                 ` Satoru Moriya
2012-12-13 19:05     ` Johannes Weiner
2012-12-13 19:05       ` Johannes Weiner
2012-12-13 19:47       ` Mel Gorman
2012-12-13 19:47         ` Mel Gorman
2012-12-12 21:43 ` [patch 3/8] mm: vmscan: save work scanning (almost) empty LRU lists Johannes Weiner
2012-12-12 21:43   ` Johannes Weiner
2012-12-12 22:02   ` Rik van Riel
2012-12-12 22:02     ` Rik van Riel
2012-12-13 10:41   ` Mel Gorman
2012-12-13 10:41     ` Mel Gorman
2012-12-13 19:33     ` Johannes Weiner
2012-12-13 19:33       ` Johannes Weiner
2012-12-13 15:43   ` Michal Hocko
2012-12-13 15:43     ` Michal Hocko
2012-12-13 19:38     ` Johannes Weiner
2012-12-13 19:38       ` Johannes Weiner
2012-12-14  8:46       ` Michal Hocko
2012-12-14  8:46         ` Michal Hocko
2012-12-12 21:43 ` [patch 4/8] mm: vmscan: clarify LRU balancing close to OOM Johannes Weiner
2012-12-12 21:43   ` Johannes Weiner
2012-12-12 22:03   ` Rik van Riel
2012-12-12 22:03     ` Rik van Riel
2012-12-13 10:46   ` Mel Gorman
2012-12-13 10:46     ` Mel Gorman
2012-12-12 21:43 ` [patch 5/8] mm: vmscan: improve comment on low-page cache handling Johannes Weiner
2012-12-12 21:43   ` Johannes Weiner
2012-12-12 22:04   ` Rik van Riel
2012-12-12 22:04     ` Rik van Riel
2012-12-13 10:47   ` Mel Gorman
2012-12-13 10:47     ` Mel Gorman
2012-12-13 16:07   ` Michal Hocko
2012-12-13 16:07     ` Michal Hocko
2012-12-12 21:43 ` [patch 6/8] mm: vmscan: clean up get_scan_count() Johannes Weiner
2012-12-12 21:43   ` Johannes Weiner
2012-12-12 22:06   ` Rik van Riel
2012-12-12 22:06     ` Rik van Riel
2012-12-13 11:07   ` Mel Gorman
2012-12-13 11:07     ` Mel Gorman
2012-12-13 16:18   ` Michal Hocko
2012-12-13 16:18     ` Michal Hocko
2012-12-12 21:43 ` [patch 7/8] mm: vmscan: compaction works against zones, not lruvecs Johannes Weiner
2012-12-12 21:43   ` Johannes Weiner
2012-12-12 22:31   ` Rik van Riel
2012-12-12 22:31     ` Rik van Riel
2012-12-13 11:12   ` Mel Gorman
2012-12-13 11:12     ` Mel Gorman
2012-12-13 16:48   ` Michal Hocko
2012-12-13 16:48     ` Michal Hocko
2012-12-12 21:43 ` [patch 8/8] mm: reduce rmap overhead for ex-KSM page copies created on swap faults Johannes Weiner
2012-12-12 21:43   ` Johannes Weiner
2012-12-12 22:34   ` Rik van Riel
2012-12-12 22:34     ` Rik van Riel
2012-12-12 21:50 ` [patch 0/8] page reclaim bits Andrew Morton
2012-12-12 21:50   ` Andrew Morton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.