All of lore.kernel.org
 help / color / mirror / Atom feed
* [rfc patch 0/6] mm: memcg naturalization
@ 2011-05-12 14:53 ` Johannes Weiner
  0 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2011-05-12 14:53 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Andrew Morton, Rik van Riel, Minchan Kim,
	KOSAKI Motohiro, Mel Gorman
  Cc: linux-mm, linux-kernel

Hi!

Here is a patch series that is a result of the memcg discussions on
LSF (memcg-aware global reclaim, global lru removal, struct
page_cgroup reduction, soft limit implementation) and the recent
feature discussions on linux-mm.

The long-term idea is to have memcgs no longer bolted to the side of
the mm code, but integrate it as much as possible such that there is a
native understanding of containers, and that the traditional !memcg
setup is just a singular group.  This series is an approach in that
direction.

It is a rather early snapshot, WIP, barely tested etc., but I wanted
to get your opinions before further pursuing it.  It is also part of
my counter-argument to the proposals of adding memcg-reclaim-related
user interfaces at this point in time, so I wanted to push this out
the door before things are merged into .40.

The patches are quite big, I am still looking for things to factor and
split out, sorry for this.  Documentation is on its way as well ;)

#1 and #2 are boring preparational work.  #3 makes traditional reclaim
in vmscan.c memcg-aware, which is a prerequisite for both removal of
the global lru in #5 and the way I reimplemented soft limit reclaim in
#6.

The diffstat so far looks like this:

 include/linux/memcontrol.h  |   84 +++--
 include/linux/mm_inline.h   |   15 +-
 include/linux/mmzone.h      |   10 +-
 include/linux/page_cgroup.h |   35 --
 include/linux/swap.h        |    4 -
 mm/memcontrol.c             |  860 +++++++++++++------------------------------
 mm/page_alloc.c             |    2 +-
 mm/page_cgroup.c            |   39 +--
 mm/swap.c                   |   20 +-
 mm/vmscan.c                 |  273 +++++++--------
 10 files changed, 452 insertions(+), 890 deletions(-)

It is based on .39-rc7 because of the memcg churn in -mm, but I'll
rebase it in the near future.

Discuss!

	Hannes

^ permalink raw reply	[flat|nested] 83+ messages in thread

* [rfc patch 0/6] mm: memcg naturalization
@ 2011-05-12 14:53 ` Johannes Weiner
  0 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2011-05-12 14:53 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Andrew Morton, Rik van Riel, Minchan Kim,
	KOSAKI Motohiro, Mel Gorman
  Cc: linux-mm, linux-kernel

Hi!

Here is a patch series that is a result of the memcg discussions on
LSF (memcg-aware global reclaim, global lru removal, struct
page_cgroup reduction, soft limit implementation) and the recent
feature discussions on linux-mm.

The long-term idea is to have memcgs no longer bolted to the side of
the mm code, but integrate it as much as possible such that there is a
native understanding of containers, and that the traditional !memcg
setup is just a singular group.  This series is an approach in that
direction.

It is a rather early snapshot, WIP, barely tested etc., but I wanted
to get your opinions before further pursuing it.  It is also part of
my counter-argument to the proposals of adding memcg-reclaim-related
user interfaces at this point in time, so I wanted to push this out
the door before things are merged into .40.

The patches are quite big, I am still looking for things to factor and
split out, sorry for this.  Documentation is on its way as well ;)

#1 and #2 are boring preparational work.  #3 makes traditional reclaim
in vmscan.c memcg-aware, which is a prerequisite for both removal of
the global lru in #5 and the way I reimplemented soft limit reclaim in
#6.

The diffstat so far looks like this:

 include/linux/memcontrol.h  |   84 +++--
 include/linux/mm_inline.h   |   15 +-
 include/linux/mmzone.h      |   10 +-
 include/linux/page_cgroup.h |   35 --
 include/linux/swap.h        |    4 -
 mm/memcontrol.c             |  860 +++++++++++++------------------------------
 mm/page_alloc.c             |    2 +-
 mm/page_cgroup.c            |   39 +--
 mm/swap.c                   |   20 +-
 mm/vmscan.c                 |  273 +++++++--------
 10 files changed, 452 insertions(+), 890 deletions(-)

It is based on .39-rc7 because of the memcg churn in -mm, but I'll
rebase it in the near future.

Discuss!

	Hannes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* [rfc patch 1/6] memcg: remove unused retry signal from reclaim
  2011-05-12 14:53 ` Johannes Weiner
@ 2011-05-12 14:53   ` Johannes Weiner
  -1 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2011-05-12 14:53 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Andrew Morton, Rik van Riel, Minchan Kim,
	KOSAKI Motohiro, Mel Gorman
  Cc: linux-mm, linux-kernel

If the memcg reclaim code detects the target memcg below its limit it
exits and returns a guaranteed non-zero value so that the charge is
retried.

Nowadays, the charge side checks the memcg limit itself and does not
rely on this non-zero return value trick.

This patch removes it.  The reclaim code will now always return the
true number of pages it reclaimed on its own.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 010f916..bf5ab87 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1503,7 +1503,7 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 			if (!res_counter_soft_limit_excess(&root_mem->res))
 				return total;
 		} else if (mem_cgroup_margin(root_mem))
-			return 1 + total;
+			return total;
 	}
 	return total;
 }
-- 
1.7.5.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [rfc patch 1/6] memcg: remove unused retry signal from reclaim
@ 2011-05-12 14:53   ` Johannes Weiner
  0 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2011-05-12 14:53 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Andrew Morton, Rik van Riel, Minchan Kim,
	KOSAKI Motohiro, Mel Gorman
  Cc: linux-mm, linux-kernel

If the memcg reclaim code detects the target memcg below its limit it
exits and returns a guaranteed non-zero value so that the charge is
retried.

Nowadays, the charge side checks the memcg limit itself and does not
rely on this non-zero return value trick.

This patch removes it.  The reclaim code will now always return the
true number of pages it reclaimed on its own.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 010f916..bf5ab87 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1503,7 +1503,7 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 			if (!res_counter_soft_limit_excess(&root_mem->res))
 				return total;
 		} else if (mem_cgroup_margin(root_mem))
-			return 1 + total;
+			return total;
 	}
 	return total;
 }
-- 
1.7.5.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [rfc patch 2/6] vmscan: make distinction between memcg reclaim and LRU list selection
  2011-05-12 14:53 ` Johannes Weiner
@ 2011-05-12 14:53   ` Johannes Weiner
  -1 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2011-05-12 14:53 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Andrew Morton, Rik van Riel, Minchan Kim,
	KOSAKI Motohiro, Mel Gorman
  Cc: linux-mm, linux-kernel

The reclaim code has a single predicate for whether it currently
reclaims on behalf of a memory cgroup, as well as whether it is
reclaiming from the global LRU list or a memory cgroup LRU list.

Up to now, both cases always coincide, but subsequent patches will
change things such that global reclaim will scan memory cgroup lists.

This patch adds a new predicate that tells global reclaim from memory
cgroup reclaim, and then changes all callsites that are actually about
global reclaim heuristics rather than strict LRU list selection.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c |   96 ++++++++++++++++++++++++++++++++++------------------------
 1 files changed, 56 insertions(+), 40 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index f6b435c..ceeb2a5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -104,8 +104,12 @@ struct scan_control {
 	 */
 	reclaim_mode_t reclaim_mode;
 
-	/* Which cgroup do we reclaim from */
-	struct mem_cgroup *mem_cgroup;
+	/*
+	 * The memory cgroup we reclaim on behalf of, and the one we
+	 * are currently reclaiming from.
+	 */
+	struct mem_cgroup *memcg;
+	struct mem_cgroup *current_memcg;
 
 	/*
 	 * Nodemask of nodes allowed by the caller. If NULL, all nodes
@@ -154,16 +158,24 @@ static LIST_HEAD(shrinker_list);
 static DECLARE_RWSEM(shrinker_rwsem);
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
-#define scanning_global_lru(sc)	(!(sc)->mem_cgroup)
+static bool global_reclaim(struct scan_control *sc)
+{
+	return !sc->memcg;
+}
+static bool scanning_global_lru(struct scan_control *sc)
+{
+	return !sc->current_memcg;
+}
 #else
-#define scanning_global_lru(sc)	(1)
+static bool global_reclaim(struct scan_control *sc) { return 1; }
+static bool scanning_global_lru(struct scan_control *sc) { return 1; }
 #endif
 
 static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
 						  struct scan_control *sc)
 {
 	if (!scanning_global_lru(sc))
-		return mem_cgroup_get_reclaim_stat(sc->mem_cgroup, zone);
+		return mem_cgroup_get_reclaim_stat(sc->current_memcg, zone);
 
 	return &zone->reclaim_stat;
 }
@@ -172,7 +184,7 @@ static unsigned long zone_nr_lru_pages(struct zone *zone,
 				struct scan_control *sc, enum lru_list lru)
 {
 	if (!scanning_global_lru(sc))
-		return mem_cgroup_zone_nr_pages(sc->mem_cgroup, zone, lru);
+		return mem_cgroup_zone_nr_pages(sc->current_memcg, zone, lru);
 
 	return zone_page_state(zone, NR_LRU_BASE + lru);
 }
@@ -635,7 +647,7 @@ static enum page_references page_check_references(struct page *page,
 	int referenced_ptes, referenced_page;
 	unsigned long vm_flags;
 
-	referenced_ptes = page_referenced(page, 1, sc->mem_cgroup, &vm_flags);
+	referenced_ptes = page_referenced(page, 1, sc->current_memcg, &vm_flags);
 	referenced_page = TestClearPageReferenced(page);
 
 	/* Lumpy reclaim - ignore references */
@@ -1228,7 +1240,7 @@ static int too_many_isolated(struct zone *zone, int file,
 	if (current_is_kswapd())
 		return 0;
 
-	if (!scanning_global_lru(sc))
+	if (!global_reclaim(sc))
 		return 0;
 
 	if (file) {
@@ -1397,6 +1409,16 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 			sc->reclaim_mode & RECLAIM_MODE_LUMPYRECLAIM ?
 					ISOLATE_BOTH : ISOLATE_INACTIVE,
 			zone, 0, file);
+	} else {
+		nr_taken = mem_cgroup_isolate_pages(nr_to_scan,
+			&page_list, &nr_scanned, sc->order,
+			sc->reclaim_mode & RECLAIM_MODE_LUMPYRECLAIM ?
+					ISOLATE_BOTH : ISOLATE_INACTIVE,
+			zone, sc->current_memcg,
+			0, file);
+	}
+
+	if (global_reclaim(sc)) {
 		zone->pages_scanned += nr_scanned;
 		if (current_is_kswapd())
 			__count_zone_vm_events(PGSCAN_KSWAPD, zone,
@@ -1404,17 +1426,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 		else
 			__count_zone_vm_events(PGSCAN_DIRECT, zone,
 					       nr_scanned);
-	} else {
-		nr_taken = mem_cgroup_isolate_pages(nr_to_scan,
-			&page_list, &nr_scanned, sc->order,
-			sc->reclaim_mode & RECLAIM_MODE_LUMPYRECLAIM ?
-					ISOLATE_BOTH : ISOLATE_INACTIVE,
-			zone, sc->mem_cgroup,
-			0, file);
-		/*
-		 * mem_cgroup_isolate_pages() keeps track of
-		 * scanned pages on its own.
-		 */
 	}
 
 	if (nr_taken == 0) {
@@ -1435,9 +1446,11 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 	}
 
 	local_irq_disable();
-	if (current_is_kswapd())
-		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
-	__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
+	if (global_reclaim(sc)) {
+		if (current_is_kswapd())
+			__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
+		__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
+	}
 
 	putback_lru_pages(zone, sc, nr_anon, nr_file, &page_list);
 
@@ -1520,18 +1533,16 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 						&pgscanned, sc->order,
 						ISOLATE_ACTIVE, zone,
 						1, file);
-		zone->pages_scanned += pgscanned;
 	} else {
 		nr_taken = mem_cgroup_isolate_pages(nr_pages, &l_hold,
 						&pgscanned, sc->order,
 						ISOLATE_ACTIVE, zone,
-						sc->mem_cgroup, 1, file);
-		/*
-		 * mem_cgroup_isolate_pages() keeps track of
-		 * scanned pages on its own.
-		 */
+						sc->current_memcg, 1, file);
 	}
 
+	if (global_reclaim(sc))
+		zone->pages_scanned += pgscanned;
+
 	reclaim_stat->recent_scanned[file] += nr_taken;
 
 	__count_zone_vm_events(PGREFILL, zone, pgscanned);
@@ -1552,7 +1563,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 			continue;
 		}
 
-		if (page_referenced(page, 0, sc->mem_cgroup, &vm_flags)) {
+		if (page_referenced(page, 0, sc->current_memcg, &vm_flags)) {
 			nr_rotated += hpage_nr_pages(page);
 			/*
 			 * Identify referenced, file-backed active pages and
@@ -1629,7 +1640,7 @@ static int inactive_anon_is_low(struct zone *zone, struct scan_control *sc)
 	if (scanning_global_lru(sc))
 		low = inactive_anon_is_low_global(zone);
 	else
-		low = mem_cgroup_inactive_anon_is_low(sc->mem_cgroup);
+		low = mem_cgroup_inactive_anon_is_low(sc->current_memcg);
 	return low;
 }
 #else
@@ -1672,7 +1683,7 @@ static int inactive_file_is_low(struct zone *zone, struct scan_control *sc)
 	if (scanning_global_lru(sc))
 		low = inactive_file_is_low_global(zone);
 	else
-		low = mem_cgroup_inactive_file_is_low(sc->mem_cgroup);
+		low = mem_cgroup_inactive_file_is_low(sc->current_memcg);
 	return low;
 }
 
@@ -1752,7 +1763,7 @@ static void get_scan_count(struct zone *zone, struct scan_control *sc,
 	file  = zone_nr_lru_pages(zone, sc, LRU_ACTIVE_FILE) +
 		zone_nr_lru_pages(zone, sc, LRU_INACTIVE_FILE);
 
-	if (scanning_global_lru(sc)) {
+	if (global_reclaim(sc)) {
 		free  = zone_page_state(zone, NR_FREE_PAGES);
 		/* If we have very few page cache pages,
 		   force-scan anon pages. */
@@ -1903,6 +1914,8 @@ restart:
 	nr_scanned = sc->nr_scanned;
 	get_scan_count(zone, sc, nr, priority);
 
+	sc->current_memcg = sc->memcg;
+
 	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
 					nr[LRU_INACTIVE_FILE]) {
 		for_each_evictable_lru(l) {
@@ -1941,6 +1954,9 @@ restart:
 		goto restart;
 
 	throttle_vm_writeout(sc->gfp_mask);
+
+	/* For good measure, noone higher up the stack should look at it */
+	sc->current_memcg = NULL;
 }
 
 /*
@@ -1973,7 +1989,7 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
 		 * Take care memory controller reclaiming has small influence
 		 * to global LRU.
 		 */
-		if (scanning_global_lru(sc)) {
+		if (global_reclaim(sc)) {
 			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 				continue;
 			if (zone->all_unreclaimable && priority != DEF_PRIORITY)
@@ -2038,7 +2054,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	get_mems_allowed();
 	delayacct_freepages_start();
 
-	if (scanning_global_lru(sc))
+	if (global_reclaim(sc))
 		count_vm_event(ALLOCSTALL);
 
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
@@ -2050,7 +2066,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		 * Don't shrink slabs when reclaiming memory from
 		 * over limit cgroups
 		 */
-		if (scanning_global_lru(sc)) {
+		if (global_reclaim(sc)) {
 			unsigned long lru_pages = 0;
 			for_each_zone_zonelist(zone, z, zonelist,
 					gfp_zone(sc->gfp_mask)) {
@@ -2111,7 +2127,7 @@ out:
 		return 0;
 
 	/* top priority shrink_zones still had more to do? don't OOM, then */
-	if (scanning_global_lru(sc) && !all_unreclaimable(zonelist, sc))
+	if (global_reclaim(sc) && !all_unreclaimable(zonelist, sc))
 		return 1;
 
 	return 0;
@@ -2129,7 +2145,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 		.may_swap = 1,
 		.swappiness = vm_swappiness,
 		.order = order,
-		.mem_cgroup = NULL,
+		.memcg = NULL,
 		.nodemask = nodemask,
 	};
 
@@ -2158,7 +2174,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
 		.may_swap = !noswap,
 		.swappiness = swappiness,
 		.order = 0,
-		.mem_cgroup = mem,
+		.memcg = mem,
 	};
 	sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
 			(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
@@ -2195,7 +2211,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
 		.swappiness = swappiness,
 		.order = 0,
-		.mem_cgroup = mem_cont,
+		.memcg = mem_cont,
 		.nodemask = NULL, /* we don't care the placement */
 	};
 
@@ -2333,7 +2349,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 		.nr_to_reclaim = ULONG_MAX,
 		.swappiness = vm_swappiness,
 		.order = order,
-		.mem_cgroup = NULL,
+		.memcg = NULL,
 	};
 loop_again:
 	total_scanned = 0;
-- 
1.7.5.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [rfc patch 2/6] vmscan: make distinction between memcg reclaim and LRU list selection
@ 2011-05-12 14:53   ` Johannes Weiner
  0 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2011-05-12 14:53 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Andrew Morton, Rik van Riel, Minchan Kim,
	KOSAKI Motohiro, Mel Gorman
  Cc: linux-mm, linux-kernel

The reclaim code has a single predicate for whether it currently
reclaims on behalf of a memory cgroup, as well as whether it is
reclaiming from the global LRU list or a memory cgroup LRU list.

Up to now, both cases always coincide, but subsequent patches will
change things such that global reclaim will scan memory cgroup lists.

This patch adds a new predicate that tells global reclaim from memory
cgroup reclaim, and then changes all callsites that are actually about
global reclaim heuristics rather than strict LRU list selection.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c |   96 ++++++++++++++++++++++++++++++++++------------------------
 1 files changed, 56 insertions(+), 40 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index f6b435c..ceeb2a5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -104,8 +104,12 @@ struct scan_control {
 	 */
 	reclaim_mode_t reclaim_mode;
 
-	/* Which cgroup do we reclaim from */
-	struct mem_cgroup *mem_cgroup;
+	/*
+	 * The memory cgroup we reclaim on behalf of, and the one we
+	 * are currently reclaiming from.
+	 */
+	struct mem_cgroup *memcg;
+	struct mem_cgroup *current_memcg;
 
 	/*
 	 * Nodemask of nodes allowed by the caller. If NULL, all nodes
@@ -154,16 +158,24 @@ static LIST_HEAD(shrinker_list);
 static DECLARE_RWSEM(shrinker_rwsem);
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
-#define scanning_global_lru(sc)	(!(sc)->mem_cgroup)
+static bool global_reclaim(struct scan_control *sc)
+{
+	return !sc->memcg;
+}
+static bool scanning_global_lru(struct scan_control *sc)
+{
+	return !sc->current_memcg;
+}
 #else
-#define scanning_global_lru(sc)	(1)
+static bool global_reclaim(struct scan_control *sc) { return 1; }
+static bool scanning_global_lru(struct scan_control *sc) { return 1; }
 #endif
 
 static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
 						  struct scan_control *sc)
 {
 	if (!scanning_global_lru(sc))
-		return mem_cgroup_get_reclaim_stat(sc->mem_cgroup, zone);
+		return mem_cgroup_get_reclaim_stat(sc->current_memcg, zone);
 
 	return &zone->reclaim_stat;
 }
@@ -172,7 +184,7 @@ static unsigned long zone_nr_lru_pages(struct zone *zone,
 				struct scan_control *sc, enum lru_list lru)
 {
 	if (!scanning_global_lru(sc))
-		return mem_cgroup_zone_nr_pages(sc->mem_cgroup, zone, lru);
+		return mem_cgroup_zone_nr_pages(sc->current_memcg, zone, lru);
 
 	return zone_page_state(zone, NR_LRU_BASE + lru);
 }
@@ -635,7 +647,7 @@ static enum page_references page_check_references(struct page *page,
 	int referenced_ptes, referenced_page;
 	unsigned long vm_flags;
 
-	referenced_ptes = page_referenced(page, 1, sc->mem_cgroup, &vm_flags);
+	referenced_ptes = page_referenced(page, 1, sc->current_memcg, &vm_flags);
 	referenced_page = TestClearPageReferenced(page);
 
 	/* Lumpy reclaim - ignore references */
@@ -1228,7 +1240,7 @@ static int too_many_isolated(struct zone *zone, int file,
 	if (current_is_kswapd())
 		return 0;
 
-	if (!scanning_global_lru(sc))
+	if (!global_reclaim(sc))
 		return 0;
 
 	if (file) {
@@ -1397,6 +1409,16 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 			sc->reclaim_mode & RECLAIM_MODE_LUMPYRECLAIM ?
 					ISOLATE_BOTH : ISOLATE_INACTIVE,
 			zone, 0, file);
+	} else {
+		nr_taken = mem_cgroup_isolate_pages(nr_to_scan,
+			&page_list, &nr_scanned, sc->order,
+			sc->reclaim_mode & RECLAIM_MODE_LUMPYRECLAIM ?
+					ISOLATE_BOTH : ISOLATE_INACTIVE,
+			zone, sc->current_memcg,
+			0, file);
+	}
+
+	if (global_reclaim(sc)) {
 		zone->pages_scanned += nr_scanned;
 		if (current_is_kswapd())
 			__count_zone_vm_events(PGSCAN_KSWAPD, zone,
@@ -1404,17 +1426,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 		else
 			__count_zone_vm_events(PGSCAN_DIRECT, zone,
 					       nr_scanned);
-	} else {
-		nr_taken = mem_cgroup_isolate_pages(nr_to_scan,
-			&page_list, &nr_scanned, sc->order,
-			sc->reclaim_mode & RECLAIM_MODE_LUMPYRECLAIM ?
-					ISOLATE_BOTH : ISOLATE_INACTIVE,
-			zone, sc->mem_cgroup,
-			0, file);
-		/*
-		 * mem_cgroup_isolate_pages() keeps track of
-		 * scanned pages on its own.
-		 */
 	}
 
 	if (nr_taken == 0) {
@@ -1435,9 +1446,11 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 	}
 
 	local_irq_disable();
-	if (current_is_kswapd())
-		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
-	__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
+	if (global_reclaim(sc)) {
+		if (current_is_kswapd())
+			__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
+		__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
+	}
 
 	putback_lru_pages(zone, sc, nr_anon, nr_file, &page_list);
 
@@ -1520,18 +1533,16 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 						&pgscanned, sc->order,
 						ISOLATE_ACTIVE, zone,
 						1, file);
-		zone->pages_scanned += pgscanned;
 	} else {
 		nr_taken = mem_cgroup_isolate_pages(nr_pages, &l_hold,
 						&pgscanned, sc->order,
 						ISOLATE_ACTIVE, zone,
-						sc->mem_cgroup, 1, file);
-		/*
-		 * mem_cgroup_isolate_pages() keeps track of
-		 * scanned pages on its own.
-		 */
+						sc->current_memcg, 1, file);
 	}
 
+	if (global_reclaim(sc))
+		zone->pages_scanned += pgscanned;
+
 	reclaim_stat->recent_scanned[file] += nr_taken;
 
 	__count_zone_vm_events(PGREFILL, zone, pgscanned);
@@ -1552,7 +1563,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 			continue;
 		}
 
-		if (page_referenced(page, 0, sc->mem_cgroup, &vm_flags)) {
+		if (page_referenced(page, 0, sc->current_memcg, &vm_flags)) {
 			nr_rotated += hpage_nr_pages(page);
 			/*
 			 * Identify referenced, file-backed active pages and
@@ -1629,7 +1640,7 @@ static int inactive_anon_is_low(struct zone *zone, struct scan_control *sc)
 	if (scanning_global_lru(sc))
 		low = inactive_anon_is_low_global(zone);
 	else
-		low = mem_cgroup_inactive_anon_is_low(sc->mem_cgroup);
+		low = mem_cgroup_inactive_anon_is_low(sc->current_memcg);
 	return low;
 }
 #else
@@ -1672,7 +1683,7 @@ static int inactive_file_is_low(struct zone *zone, struct scan_control *sc)
 	if (scanning_global_lru(sc))
 		low = inactive_file_is_low_global(zone);
 	else
-		low = mem_cgroup_inactive_file_is_low(sc->mem_cgroup);
+		low = mem_cgroup_inactive_file_is_low(sc->current_memcg);
 	return low;
 }
 
@@ -1752,7 +1763,7 @@ static void get_scan_count(struct zone *zone, struct scan_control *sc,
 	file  = zone_nr_lru_pages(zone, sc, LRU_ACTIVE_FILE) +
 		zone_nr_lru_pages(zone, sc, LRU_INACTIVE_FILE);
 
-	if (scanning_global_lru(sc)) {
+	if (global_reclaim(sc)) {
 		free  = zone_page_state(zone, NR_FREE_PAGES);
 		/* If we have very few page cache pages,
 		   force-scan anon pages. */
@@ -1903,6 +1914,8 @@ restart:
 	nr_scanned = sc->nr_scanned;
 	get_scan_count(zone, sc, nr, priority);
 
+	sc->current_memcg = sc->memcg;
+
 	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
 					nr[LRU_INACTIVE_FILE]) {
 		for_each_evictable_lru(l) {
@@ -1941,6 +1954,9 @@ restart:
 		goto restart;
 
 	throttle_vm_writeout(sc->gfp_mask);
+
+	/* For good measure, noone higher up the stack should look at it */
+	sc->current_memcg = NULL;
 }
 
 /*
@@ -1973,7 +1989,7 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
 		 * Take care memory controller reclaiming has small influence
 		 * to global LRU.
 		 */
-		if (scanning_global_lru(sc)) {
+		if (global_reclaim(sc)) {
 			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 				continue;
 			if (zone->all_unreclaimable && priority != DEF_PRIORITY)
@@ -2038,7 +2054,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	get_mems_allowed();
 	delayacct_freepages_start();
 
-	if (scanning_global_lru(sc))
+	if (global_reclaim(sc))
 		count_vm_event(ALLOCSTALL);
 
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
@@ -2050,7 +2066,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		 * Don't shrink slabs when reclaiming memory from
 		 * over limit cgroups
 		 */
-		if (scanning_global_lru(sc)) {
+		if (global_reclaim(sc)) {
 			unsigned long lru_pages = 0;
 			for_each_zone_zonelist(zone, z, zonelist,
 					gfp_zone(sc->gfp_mask)) {
@@ -2111,7 +2127,7 @@ out:
 		return 0;
 
 	/* top priority shrink_zones still had more to do? don't OOM, then */
-	if (scanning_global_lru(sc) && !all_unreclaimable(zonelist, sc))
+	if (global_reclaim(sc) && !all_unreclaimable(zonelist, sc))
 		return 1;
 
 	return 0;
@@ -2129,7 +2145,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 		.may_swap = 1,
 		.swappiness = vm_swappiness,
 		.order = order,
-		.mem_cgroup = NULL,
+		.memcg = NULL,
 		.nodemask = nodemask,
 	};
 
@@ -2158,7 +2174,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
 		.may_swap = !noswap,
 		.swappiness = swappiness,
 		.order = 0,
-		.mem_cgroup = mem,
+		.memcg = mem,
 	};
 	sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
 			(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
@@ -2195,7 +2211,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
 		.swappiness = swappiness,
 		.order = 0,
-		.mem_cgroup = mem_cont,
+		.memcg = mem_cont,
 		.nodemask = NULL, /* we don't care the placement */
 	};
 
@@ -2333,7 +2349,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 		.nr_to_reclaim = ULONG_MAX,
 		.swappiness = vm_swappiness,
 		.order = order,
-		.mem_cgroup = NULL,
+		.memcg = NULL,
 	};
 loop_again:
 	total_scanned = 0;
-- 
1.7.5.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [rfc patch 3/6] mm: memcg-aware global reclaim
  2011-05-12 14:53 ` Johannes Weiner
@ 2011-05-12 14:53   ` Johannes Weiner
  -1 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2011-05-12 14:53 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Andrew Morton, Rik van Riel, Minchan Kim,
	KOSAKI Motohiro, Mel Gorman
  Cc: linux-mm, linux-kernel

A page charged to a memcg is linked to a lru list specific to that
memcg.  At the same time, traditional global reclaim is obvlivious to
memcgs, and all the pages are also linked to a global per-zone list.

This patch changes traditional global reclaim to iterate over all
existing memcgs, so that it no longer relies on the global list being
present.

This is one step forward in integrating memcg code better into the
rest of memory management.  It is also a prerequisite to get rid of
the global per-zone lru lists.

RFC:

The algorithm implemented in this patch is very naive.  For each zone
scanned at each priority level, it iterates over all existing memcgs
and considers them for scanning.

This is just a prototype and I did not optimize it yet because I am
unsure about the maximum number of memcgs that still constitute a sane
configuration in comparison to the machine size.

It is perfectly fair since all memcgs are scanned at each priority
level.

On my 4G quadcore laptop with 1000 memcgs, a significant amount of CPU
time was spent just iterating memcgs during reclaim.  But it can not
really be claimed that the old code was much better, either: global
LRU reclaim could mean that a few hundred memcgs would have been
emptied out completely, while others stayed untouched.

I am open to solutions that trade fairness against CPU-time but don't
want to have an extreme in either direction.  Maybe break out early if
a number of memcgs has been successfully reclaimed from and remember
the last one scanned.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h |    7 ++
 mm/memcontrol.c            |  148 +++++++++++++++++++++++++++++---------------
 mm/vmscan.c                |   21 +++++--
 3 files changed, 120 insertions(+), 56 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5e9840f5..58728c7 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -104,6 +104,7 @@ extern void mem_cgroup_end_migration(struct mem_cgroup *mem,
 /*
  * For memory reclaim.
  */
+void mem_cgroup_hierarchy_walk(struct mem_cgroup *, struct mem_cgroup **);
 int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
 int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
 unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
@@ -289,6 +290,12 @@ static inline bool mem_cgroup_disabled(void)
 	return true;
 }
 
+static inline void mem_cgroup_hierarchy_walk(struct mem_cgroup *start,
+					     struct mem_cgroup **iter)
+{
+	*iter = start;
+}
+
 static inline int
 mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index bf5ab87..edcd55a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -313,7 +313,7 @@ static bool move_file(void)
 }
 
 /*
- * Maximum loops in mem_cgroup_hierarchical_reclaim(), used for soft
+ * Maximum loops in mem_cgroup_soft_reclaim(), used for soft
  * limit reclaim to prevent infinite loops, if they ever occur.
  */
 #define	MEM_CGROUP_MAX_RECLAIM_LOOPS		(100)
@@ -339,16 +339,6 @@ enum charge_type {
 /* Used for OOM nofiier */
 #define OOM_CONTROL		(0)
 
-/*
- * Reclaim flags for mem_cgroup_hierarchical_reclaim
- */
-#define MEM_CGROUP_RECLAIM_NOSWAP_BIT	0x0
-#define MEM_CGROUP_RECLAIM_NOSWAP	(1 << MEM_CGROUP_RECLAIM_NOSWAP_BIT)
-#define MEM_CGROUP_RECLAIM_SHRINK_BIT	0x1
-#define MEM_CGROUP_RECLAIM_SHRINK	(1 << MEM_CGROUP_RECLAIM_SHRINK_BIT)
-#define MEM_CGROUP_RECLAIM_SOFT_BIT	0x2
-#define MEM_CGROUP_RECLAIM_SOFT		(1 << MEM_CGROUP_RECLAIM_SOFT_BIT)
-
 static void mem_cgroup_get(struct mem_cgroup *mem);
 static void mem_cgroup_put(struct mem_cgroup *mem);
 static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
@@ -1381,6 +1371,86 @@ u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
 	return min(limit, memsw);
 }
 
+void mem_cgroup_hierarchy_walk(struct mem_cgroup *start,
+			       struct mem_cgroup **iter)
+{
+	struct mem_cgroup *mem = *iter;
+	int id;
+
+	if (!start)
+		start = root_mem_cgroup;
+	/*
+	 * Even without hierarchy explicitely enabled in the root
+	 * memcg, it is the ultimate parent of all memcgs.
+	 */
+	if (!(start == root_mem_cgroup || start->use_hierarchy)) {
+		*iter = start;
+		return;
+	}
+
+	if (!mem)
+		id = css_id(&start->css);
+	else {
+		id = css_id(&mem->css);
+		css_put(&mem->css);
+		mem = NULL;
+	}
+
+	do {
+		struct cgroup_subsys_state *css;
+
+		rcu_read_lock();
+		css = css_get_next(&mem_cgroup_subsys, id+1, &start->css, &id);
+		/*
+		 * The caller must already have a reference to the
+		 * starting point of this hierarchy walk, do not grab
+		 * another one.  This way, the loop can be finished
+		 * when the hierarchy root is returned, without any
+		 * further cleanup required.
+		 */
+		if (css && (css == &start->css || css_tryget(css)))
+			mem = container_of(css, struct mem_cgroup, css);
+		rcu_read_unlock();
+		if (!css)
+			id = 0;
+	} while (!mem);
+
+	if (mem == root_mem_cgroup)
+		mem = NULL;
+
+	*iter = mem;
+}
+
+static unsigned long mem_cgroup_target_reclaim(struct mem_cgroup *mem,
+					       gfp_t gfp_mask,
+					       bool noswap,
+					       bool shrink)
+{
+	unsigned long total = 0;
+	int loop;
+
+	if (mem->memsw_is_minimum)
+		noswap = true;
+
+	for (loop = 0; loop < MEM_CGROUP_MAX_RECLAIM_LOOPS; loop++) {
+		drain_all_stock_async();
+		total += try_to_free_mem_cgroup_pages(mem, gfp_mask, noswap,
+						      get_swappiness(mem));
+		if (total && shrink)
+			break;
+		if (mem_cgroup_margin(mem))
+			break;
+		/*
+		 * If we have not been able to reclaim anything after
+		 * two reclaim attempts, there may be no reclaimable
+		 * pages under this hierarchy.
+		 */
+		if (loop && !total)
+			break;
+	}
+	return total;
+}
+
 /*
  * Visit the first child (need not be the first child as per the ordering
  * of the cgroup list, since we track last_scanned_child) of @mem and use
@@ -1427,21 +1497,16 @@ mem_cgroup_select_victim(struct mem_cgroup *root_mem)
  *
  * We give up and return to the caller when we visit root_mem twice.
  * (other groups can be removed while we're walking....)
- *
- * If shrink==true, for avoiding to free too much, this returns immedieately.
  */
-static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
-						struct zone *zone,
-						gfp_t gfp_mask,
-						unsigned long reclaim_options)
+static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_mem,
+				   struct zone *zone,
+				   gfp_t gfp_mask)
 {
 	struct mem_cgroup *victim;
 	int ret, total = 0;
 	int loop = 0;
-	bool noswap = reclaim_options & MEM_CGROUP_RECLAIM_NOSWAP;
-	bool shrink = reclaim_options & MEM_CGROUP_RECLAIM_SHRINK;
-	bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT;
 	unsigned long excess;
+	bool noswap = false;
 
 	excess = res_counter_soft_limit_excess(&root_mem->res) >> PAGE_SHIFT;
 
@@ -1461,7 +1526,7 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 				 * anything, it might because there are
 				 * no reclaimable pages under this hierarchy
 				 */
-				if (!check_soft || !total) {
+				if (!total) {
 					css_put(&victim->css);
 					break;
 				}
@@ -1484,25 +1549,11 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 			continue;
 		}
 		/* we use swappiness of local cgroup */
-		if (check_soft)
-			ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
+		ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
 				noswap, get_swappiness(victim), zone);
-		else
-			ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
-						noswap, get_swappiness(victim));
 		css_put(&victim->css);
-		/*
-		 * At shrinking usage, we can't check we should stop here or
-		 * reclaim more. It's depends on callers. last_scanned_child
-		 * will work enough for keeping fairness under tree.
-		 */
-		if (shrink)
-			return ret;
 		total += ret;
-		if (check_soft) {
-			if (!res_counter_soft_limit_excess(&root_mem->res))
-				return total;
-		} else if (mem_cgroup_margin(root_mem))
+		if (!res_counter_soft_limit_excess(&root_mem->res))
 			return total;
 	}
 	return total;
@@ -1897,7 +1948,7 @@ static int mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
 	unsigned long csize = nr_pages * PAGE_SIZE;
 	struct mem_cgroup *mem_over_limit;
 	struct res_counter *fail_res;
-	unsigned long flags = 0;
+	bool noswap = false;
 	int ret;
 
 	ret = res_counter_charge(&mem->res, csize, &fail_res);
@@ -1911,7 +1962,7 @@ static int mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
 
 		res_counter_uncharge(&mem->res, csize);
 		mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw);
-		flags |= MEM_CGROUP_RECLAIM_NOSWAP;
+		noswap = true;
 	} else
 		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
 	/*
@@ -1927,8 +1978,8 @@ static int mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
 	if (!(gfp_mask & __GFP_WAIT))
 		return CHARGE_WOULDBLOCK;
 
-	ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
-					      gfp_mask, flags);
+	ret = mem_cgroup_target_reclaim(mem_over_limit, gfp_mask,
+					noswap, false);
 	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
 		return CHARGE_RETRY;
 	/*
@@ -3085,7 +3136,7 @@ void mem_cgroup_end_migration(struct mem_cgroup *mem,
 
 /*
  * A call to try to shrink memory usage on charge failure at shmem's swapin.
- * Calling hierarchical_reclaim is not enough because we should update
+ * Calling target_reclaim is not enough because we should update
  * last_oom_jiffies to prevent pagefault_out_of_memory from invoking global OOM.
  * Moreover considering hierarchy, we should reclaim from the mem_over_limit,
  * not from the memcg which this page would be charged to.
@@ -3167,7 +3218,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
 	int enlarge;
 
 	/*
-	 * For keeping hierarchical_reclaim simple, how long we should retry
+	 * For keeping target_reclaim simple, how long we should retry
 	 * is depends on callers. We set our retry-count to be function
 	 * of # of children which we should visit in this loop.
 	 */
@@ -3210,8 +3261,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
 		if (!ret)
 			break;
 
-		mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
-						MEM_CGROUP_RECLAIM_SHRINK);
+		mem_cgroup_target_reclaim(memcg, GFP_KERNEL, false, false);
 		curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
 		/* Usage is reduced ? */
   		if (curusage >= oldusage)
@@ -3269,9 +3319,7 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
 		if (!ret)
 			break;
 
-		mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
-						MEM_CGROUP_RECLAIM_NOSWAP |
-						MEM_CGROUP_RECLAIM_SHRINK);
+		mem_cgroup_target_reclaim(memcg, GFP_KERNEL, true, false);
 		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
 		/* Usage is reduced ? */
 		if (curusage >= oldusage)
@@ -3311,9 +3359,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 		if (!mz)
 			break;
 
-		reclaimed = mem_cgroup_hierarchical_reclaim(mz->mem, zone,
-						gfp_mask,
-						MEM_CGROUP_RECLAIM_SOFT);
+		reclaimed = mem_cgroup_soft_reclaim(mz->mem, zone, gfp_mask);
 		nr_reclaimed += reclaimed;
 		spin_lock(&mctz->lock);
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ceeb2a5..e2a3647 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1900,8 +1900,8 @@ static inline bool should_continue_reclaim(struct zone *zone,
 /*
  * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
  */
-static void shrink_zone(int priority, struct zone *zone,
-				struct scan_control *sc)
+static void do_shrink_zone(int priority, struct zone *zone,
+			   struct scan_control *sc)
 {
 	unsigned long nr[NR_LRU_LISTS];
 	unsigned long nr_to_scan;
@@ -1914,8 +1914,6 @@ restart:
 	nr_scanned = sc->nr_scanned;
 	get_scan_count(zone, sc, nr, priority);
 
-	sc->current_memcg = sc->memcg;
-
 	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
 					nr[LRU_INACTIVE_FILE]) {
 		for_each_evictable_lru(l) {
@@ -1954,6 +1952,19 @@ restart:
 		goto restart;
 
 	throttle_vm_writeout(sc->gfp_mask);
+}
+
+static void shrink_zone(int priority, struct zone *zone,
+			struct scan_control *sc)
+{
+	struct mem_cgroup *root = sc->memcg;
+	struct mem_cgroup *mem = NULL;
+
+	do {
+		mem_cgroup_hierarchy_walk(root, &mem);
+		sc->current_memcg = mem;
+		do_shrink_zone(priority, zone, sc);
+	} while (mem != root);
 
 	/* For good measure, noone higher up the stack should look at it */
 	sc->current_memcg = NULL;
@@ -2190,7 +2201,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
 	 * will pick up pages from other mem cgroup's as well. We hack
 	 * the priority and make it zero.
 	 */
-	shrink_zone(0, zone, &sc);
+	do_shrink_zone(0, zone, &sc);
 
 	trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
 
-- 
1.7.5.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [rfc patch 3/6] mm: memcg-aware global reclaim
@ 2011-05-12 14:53   ` Johannes Weiner
  0 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2011-05-12 14:53 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Andrew Morton, Rik van Riel, Minchan Kim,
	KOSAKI Motohiro, Mel Gorman
  Cc: linux-mm, linux-kernel

A page charged to a memcg is linked to a lru list specific to that
memcg.  At the same time, traditional global reclaim is obvlivious to
memcgs, and all the pages are also linked to a global per-zone list.

This patch changes traditional global reclaim to iterate over all
existing memcgs, so that it no longer relies on the global list being
present.

This is one step forward in integrating memcg code better into the
rest of memory management.  It is also a prerequisite to get rid of
the global per-zone lru lists.

RFC:

The algorithm implemented in this patch is very naive.  For each zone
scanned at each priority level, it iterates over all existing memcgs
and considers them for scanning.

This is just a prototype and I did not optimize it yet because I am
unsure about the maximum number of memcgs that still constitute a sane
configuration in comparison to the machine size.

It is perfectly fair since all memcgs are scanned at each priority
level.

On my 4G quadcore laptop with 1000 memcgs, a significant amount of CPU
time was spent just iterating memcgs during reclaim.  But it can not
really be claimed that the old code was much better, either: global
LRU reclaim could mean that a few hundred memcgs would have been
emptied out completely, while others stayed untouched.

I am open to solutions that trade fairness against CPU-time but don't
want to have an extreme in either direction.  Maybe break out early if
a number of memcgs has been successfully reclaimed from and remember
the last one scanned.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h |    7 ++
 mm/memcontrol.c            |  148 +++++++++++++++++++++++++++++---------------
 mm/vmscan.c                |   21 +++++--
 3 files changed, 120 insertions(+), 56 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5e9840f5..58728c7 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -104,6 +104,7 @@ extern void mem_cgroup_end_migration(struct mem_cgroup *mem,
 /*
  * For memory reclaim.
  */
+void mem_cgroup_hierarchy_walk(struct mem_cgroup *, struct mem_cgroup **);
 int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
 int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
 unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
@@ -289,6 +290,12 @@ static inline bool mem_cgroup_disabled(void)
 	return true;
 }
 
+static inline void mem_cgroup_hierarchy_walk(struct mem_cgroup *start,
+					     struct mem_cgroup **iter)
+{
+	*iter = start;
+}
+
 static inline int
 mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index bf5ab87..edcd55a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -313,7 +313,7 @@ static bool move_file(void)
 }
 
 /*
- * Maximum loops in mem_cgroup_hierarchical_reclaim(), used for soft
+ * Maximum loops in mem_cgroup_soft_reclaim(), used for soft
  * limit reclaim to prevent infinite loops, if they ever occur.
  */
 #define	MEM_CGROUP_MAX_RECLAIM_LOOPS		(100)
@@ -339,16 +339,6 @@ enum charge_type {
 /* Used for OOM nofiier */
 #define OOM_CONTROL		(0)
 
-/*
- * Reclaim flags for mem_cgroup_hierarchical_reclaim
- */
-#define MEM_CGROUP_RECLAIM_NOSWAP_BIT	0x0
-#define MEM_CGROUP_RECLAIM_NOSWAP	(1 << MEM_CGROUP_RECLAIM_NOSWAP_BIT)
-#define MEM_CGROUP_RECLAIM_SHRINK_BIT	0x1
-#define MEM_CGROUP_RECLAIM_SHRINK	(1 << MEM_CGROUP_RECLAIM_SHRINK_BIT)
-#define MEM_CGROUP_RECLAIM_SOFT_BIT	0x2
-#define MEM_CGROUP_RECLAIM_SOFT		(1 << MEM_CGROUP_RECLAIM_SOFT_BIT)
-
 static void mem_cgroup_get(struct mem_cgroup *mem);
 static void mem_cgroup_put(struct mem_cgroup *mem);
 static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
@@ -1381,6 +1371,86 @@ u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
 	return min(limit, memsw);
 }
 
+void mem_cgroup_hierarchy_walk(struct mem_cgroup *start,
+			       struct mem_cgroup **iter)
+{
+	struct mem_cgroup *mem = *iter;
+	int id;
+
+	if (!start)
+		start = root_mem_cgroup;
+	/*
+	 * Even without hierarchy explicitely enabled in the root
+	 * memcg, it is the ultimate parent of all memcgs.
+	 */
+	if (!(start == root_mem_cgroup || start->use_hierarchy)) {
+		*iter = start;
+		return;
+	}
+
+	if (!mem)
+		id = css_id(&start->css);
+	else {
+		id = css_id(&mem->css);
+		css_put(&mem->css);
+		mem = NULL;
+	}
+
+	do {
+		struct cgroup_subsys_state *css;
+
+		rcu_read_lock();
+		css = css_get_next(&mem_cgroup_subsys, id+1, &start->css, &id);
+		/*
+		 * The caller must already have a reference to the
+		 * starting point of this hierarchy walk, do not grab
+		 * another one.  This way, the loop can be finished
+		 * when the hierarchy root is returned, without any
+		 * further cleanup required.
+		 */
+		if (css && (css == &start->css || css_tryget(css)))
+			mem = container_of(css, struct mem_cgroup, css);
+		rcu_read_unlock();
+		if (!css)
+			id = 0;
+	} while (!mem);
+
+	if (mem == root_mem_cgroup)
+		mem = NULL;
+
+	*iter = mem;
+}
+
+static unsigned long mem_cgroup_target_reclaim(struct mem_cgroup *mem,
+					       gfp_t gfp_mask,
+					       bool noswap,
+					       bool shrink)
+{
+	unsigned long total = 0;
+	int loop;
+
+	if (mem->memsw_is_minimum)
+		noswap = true;
+
+	for (loop = 0; loop < MEM_CGROUP_MAX_RECLAIM_LOOPS; loop++) {
+		drain_all_stock_async();
+		total += try_to_free_mem_cgroup_pages(mem, gfp_mask, noswap,
+						      get_swappiness(mem));
+		if (total && shrink)
+			break;
+		if (mem_cgroup_margin(mem))
+			break;
+		/*
+		 * If we have not been able to reclaim anything after
+		 * two reclaim attempts, there may be no reclaimable
+		 * pages under this hierarchy.
+		 */
+		if (loop && !total)
+			break;
+	}
+	return total;
+}
+
 /*
  * Visit the first child (need not be the first child as per the ordering
  * of the cgroup list, since we track last_scanned_child) of @mem and use
@@ -1427,21 +1497,16 @@ mem_cgroup_select_victim(struct mem_cgroup *root_mem)
  *
  * We give up and return to the caller when we visit root_mem twice.
  * (other groups can be removed while we're walking....)
- *
- * If shrink==true, for avoiding to free too much, this returns immedieately.
  */
-static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
-						struct zone *zone,
-						gfp_t gfp_mask,
-						unsigned long reclaim_options)
+static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_mem,
+				   struct zone *zone,
+				   gfp_t gfp_mask)
 {
 	struct mem_cgroup *victim;
 	int ret, total = 0;
 	int loop = 0;
-	bool noswap = reclaim_options & MEM_CGROUP_RECLAIM_NOSWAP;
-	bool shrink = reclaim_options & MEM_CGROUP_RECLAIM_SHRINK;
-	bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT;
 	unsigned long excess;
+	bool noswap = false;
 
 	excess = res_counter_soft_limit_excess(&root_mem->res) >> PAGE_SHIFT;
 
@@ -1461,7 +1526,7 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 				 * anything, it might because there are
 				 * no reclaimable pages under this hierarchy
 				 */
-				if (!check_soft || !total) {
+				if (!total) {
 					css_put(&victim->css);
 					break;
 				}
@@ -1484,25 +1549,11 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 			continue;
 		}
 		/* we use swappiness of local cgroup */
-		if (check_soft)
-			ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
+		ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
 				noswap, get_swappiness(victim), zone);
-		else
-			ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
-						noswap, get_swappiness(victim));
 		css_put(&victim->css);
-		/*
-		 * At shrinking usage, we can't check we should stop here or
-		 * reclaim more. It's depends on callers. last_scanned_child
-		 * will work enough for keeping fairness under tree.
-		 */
-		if (shrink)
-			return ret;
 		total += ret;
-		if (check_soft) {
-			if (!res_counter_soft_limit_excess(&root_mem->res))
-				return total;
-		} else if (mem_cgroup_margin(root_mem))
+		if (!res_counter_soft_limit_excess(&root_mem->res))
 			return total;
 	}
 	return total;
@@ -1897,7 +1948,7 @@ static int mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
 	unsigned long csize = nr_pages * PAGE_SIZE;
 	struct mem_cgroup *mem_over_limit;
 	struct res_counter *fail_res;
-	unsigned long flags = 0;
+	bool noswap = false;
 	int ret;
 
 	ret = res_counter_charge(&mem->res, csize, &fail_res);
@@ -1911,7 +1962,7 @@ static int mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
 
 		res_counter_uncharge(&mem->res, csize);
 		mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw);
-		flags |= MEM_CGROUP_RECLAIM_NOSWAP;
+		noswap = true;
 	} else
 		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
 	/*
@@ -1927,8 +1978,8 @@ static int mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
 	if (!(gfp_mask & __GFP_WAIT))
 		return CHARGE_WOULDBLOCK;
 
-	ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
-					      gfp_mask, flags);
+	ret = mem_cgroup_target_reclaim(mem_over_limit, gfp_mask,
+					noswap, false);
 	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
 		return CHARGE_RETRY;
 	/*
@@ -3085,7 +3136,7 @@ void mem_cgroup_end_migration(struct mem_cgroup *mem,
 
 /*
  * A call to try to shrink memory usage on charge failure at shmem's swapin.
- * Calling hierarchical_reclaim is not enough because we should update
+ * Calling target_reclaim is not enough because we should update
  * last_oom_jiffies to prevent pagefault_out_of_memory from invoking global OOM.
  * Moreover considering hierarchy, we should reclaim from the mem_over_limit,
  * not from the memcg which this page would be charged to.
@@ -3167,7 +3218,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
 	int enlarge;
 
 	/*
-	 * For keeping hierarchical_reclaim simple, how long we should retry
+	 * For keeping target_reclaim simple, how long we should retry
 	 * is depends on callers. We set our retry-count to be function
 	 * of # of children which we should visit in this loop.
 	 */
@@ -3210,8 +3261,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
 		if (!ret)
 			break;
 
-		mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
-						MEM_CGROUP_RECLAIM_SHRINK);
+		mem_cgroup_target_reclaim(memcg, GFP_KERNEL, false, false);
 		curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
 		/* Usage is reduced ? */
   		if (curusage >= oldusage)
@@ -3269,9 +3319,7 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
 		if (!ret)
 			break;
 
-		mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
-						MEM_CGROUP_RECLAIM_NOSWAP |
-						MEM_CGROUP_RECLAIM_SHRINK);
+		mem_cgroup_target_reclaim(memcg, GFP_KERNEL, true, false);
 		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
 		/* Usage is reduced ? */
 		if (curusage >= oldusage)
@@ -3311,9 +3359,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 		if (!mz)
 			break;
 
-		reclaimed = mem_cgroup_hierarchical_reclaim(mz->mem, zone,
-						gfp_mask,
-						MEM_CGROUP_RECLAIM_SOFT);
+		reclaimed = mem_cgroup_soft_reclaim(mz->mem, zone, gfp_mask);
 		nr_reclaimed += reclaimed;
 		spin_lock(&mctz->lock);
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ceeb2a5..e2a3647 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1900,8 +1900,8 @@ static inline bool should_continue_reclaim(struct zone *zone,
 /*
  * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
  */
-static void shrink_zone(int priority, struct zone *zone,
-				struct scan_control *sc)
+static void do_shrink_zone(int priority, struct zone *zone,
+			   struct scan_control *sc)
 {
 	unsigned long nr[NR_LRU_LISTS];
 	unsigned long nr_to_scan;
@@ -1914,8 +1914,6 @@ restart:
 	nr_scanned = sc->nr_scanned;
 	get_scan_count(zone, sc, nr, priority);
 
-	sc->current_memcg = sc->memcg;
-
 	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
 					nr[LRU_INACTIVE_FILE]) {
 		for_each_evictable_lru(l) {
@@ -1954,6 +1952,19 @@ restart:
 		goto restart;
 
 	throttle_vm_writeout(sc->gfp_mask);
+}
+
+static void shrink_zone(int priority, struct zone *zone,
+			struct scan_control *sc)
+{
+	struct mem_cgroup *root = sc->memcg;
+	struct mem_cgroup *mem = NULL;
+
+	do {
+		mem_cgroup_hierarchy_walk(root, &mem);
+		sc->current_memcg = mem;
+		do_shrink_zone(priority, zone, sc);
+	} while (mem != root);
 
 	/* For good measure, noone higher up the stack should look at it */
 	sc->current_memcg = NULL;
@@ -2190,7 +2201,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
 	 * will pick up pages from other mem cgroup's as well. We hack
 	 * the priority and make it zero.
 	 */
-	shrink_zone(0, zone, &sc);
+	do_shrink_zone(0, zone, &sc);
 
 	trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
 
-- 
1.7.5.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [rfc patch 4/6] memcg: reclaim statistics
  2011-05-12 14:53 ` Johannes Weiner
@ 2011-05-12 14:53   ` Johannes Weiner
  -1 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2011-05-12 14:53 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Andrew Morton, Rik van Riel, Minchan Kim,
	KOSAKI Motohiro, Mel Gorman
  Cc: linux-mm, linux-kernel

TODO: write proper changelog.  Here is an excerpt from
http://lkml.kernel.org/r/20110428123652.GM12437@cmpxchg.org:

: 1. Limit-triggered direct reclaim
:
: The memory cgroup hits its limit and the task does direct reclaim from
: its own memcg.  We probably want statistics for this separately from
: background reclaim to see how successful background reclaim is, the
: same reason we have this separation in the global vmstat as well.
:
: 	pgscan_direct_limit
: 	pgfree_direct_limit
:
: 2. Limit-triggered background reclaim
:
: This is the watermark-based asynchroneous reclaim that is currently in
: discussion.  It's triggered by the memcg breaching its watermark,
: which is relative to its hard-limit.  I named it kswapd because I
: still think kswapd should do this job, but it is all open for
: discussion, obviously.  Treat it as meaning 'background' or
: 'asynchroneous'.
:
: 	pgscan_kswapd_limit
: 	pgfree_kswapd_limit
:
: 3. Hierarchy-triggered direct reclaim
:
: A condition outside the memcg leads to a task directly reclaiming from
: this memcg.  This could be global memory pressure for example, but
: also a parent cgroup hitting its limit.  It's probably helpful to
: assume global memory pressure meaning that the root cgroup hit its
: limit, conceptually.  We don't have that yet, but this could be the
: direct softlimit reclaim Ying mentioned above.
:
: 	pgscan_direct_hierarchy
: 	pgsteal_direct_hierarchy
:
: 4. Hierarchy-triggered background reclaim
:
: An outside condition leads to kswapd reclaiming from this memcg, like
: kswapd doing softlimit pushback due to global memory pressure.
:
: 	pgscan_kswapd_hierarchy
: 	pgsteal_kswapd_hierarchy
:
: ---
:
: With these stats in place, you can see how much pressure there is on
: your memcg hierarchy.  This includes machine utilization and if you
: overcommitted too much on a global level if there is a lot of reclaim
: activity indicated in the hierarchical stats.
:
: With the limit-based stats, you can see the amount of internal
: pressure of memcgs, which shows you if you overcommitted on a local
: level.
:
: And for both cases, you can also see the effectiveness of background
: reclaim by comparing the direct and the kswapd stats.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h |    9 ++++++
 mm/memcontrol.c            |   63 ++++++++++++++++++++++++++++++++++++++++++++
 mm/vmscan.c                |    7 +++++
 3 files changed, 79 insertions(+), 0 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 58728c7..a4c84db 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -105,6 +105,8 @@ extern void mem_cgroup_end_migration(struct mem_cgroup *mem,
  * For memory reclaim.
  */
 void mem_cgroup_hierarchy_walk(struct mem_cgroup *, struct mem_cgroup **);
+void mem_cgroup_count_reclaim(struct mem_cgroup *, bool, bool,
+			      unsigned long, unsigned long);
 int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
 int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
 unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
@@ -296,6 +298,13 @@ static inline void mem_cgroup_hierarchy_walk(struct mem_cgroup *start,
 	*iter = start;
 }
 
+static inline void mem_cgroup_count_reclaim(struct mem_cgroup *mem,
+					    bool kswapd, bool hierarchy,
+					    unsigned long scanned,
+					    unsigned long reclaimed)
+{
+}
+
 static inline int
 mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index edcd55a..d762706 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -90,10 +90,24 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_NSTATS,
 };
 
+#define RECLAIM_RECLAIMED 1
+#define RECLAIM_HIERARCHY 2
+#define RECLAIM_KSWAPD 4
+
 enum mem_cgroup_events_index {
 	MEM_CGROUP_EVENTS_PGPGIN,	/* # of pages paged in */
 	MEM_CGROUP_EVENTS_PGPGOUT,	/* # of pages paged out */
 	MEM_CGROUP_EVENTS_COUNT,	/* # of pages paged in/out */
+	RECLAIM_BASE,
+	PGSCAN_DIRECT_LIMIT = RECLAIM_BASE,
+	PGFREE_DIRECT_LIMIT = RECLAIM_BASE + RECLAIM_RECLAIMED,
+	PGSCAN_DIRECT_HIERARCHY = RECLAIM_BASE + RECLAIM_HIERARCHY,
+	PGSTEAL_DIRECT_HIERARCHY = RECLAIM_BASE + RECLAIM_HIERARCHY + RECLAIM_RECLAIMED,
+	/* you know the drill... */
+	PGSCAN_KSWAPD_LIMIT,
+	PGFREE_KSWAPD_LIMIT,
+	PGSCAN_KSWAPD_HIERARCHY,
+	PGSTEAL_KSWAPD_HIERARCHY,
 	MEM_CGROUP_EVENTS_NSTATS,
 };
 /*
@@ -575,6 +589,23 @@ static void mem_cgroup_swap_statistics(struct mem_cgroup *mem,
 	this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_SWAPOUT], val);
 }
 
+void mem_cgroup_count_reclaim(struct mem_cgroup *mem,
+			      bool kswapd, bool hierarchy,
+			      unsigned long scanned, unsigned long reclaimed)
+{
+	unsigned int base = RECLAIM_BASE;
+
+	if (!mem)
+		mem = root_mem_cgroup;
+	if (kswapd)
+		base += RECLAIM_KSWAPD;
+	if (hierarchy)
+		base += RECLAIM_HIERARCHY;
+
+	this_cpu_add(mem->stat->events[base], scanned);
+	this_cpu_add(mem->stat->events[base + RECLAIM_RECLAIMED], reclaimed);
+}
+
 static unsigned long mem_cgroup_read_events(struct mem_cgroup *mem,
 					    enum mem_cgroup_events_index idx)
 {
@@ -3817,6 +3848,14 @@ enum {
 	MCS_FILE_MAPPED,
 	MCS_PGPGIN,
 	MCS_PGPGOUT,
+	MCS_PGSCAN_DIRECT_LIMIT,
+	MCS_PGFREE_DIRECT_LIMIT,
+	MCS_PGSCAN_DIRECT_HIERARCHY,
+	MCS_PGSTEAL_DIRECT_HIERARCHY,
+	MCS_PGSCAN_KSWAPD_LIMIT,
+	MCS_PGFREE_KSWAPD_LIMIT,
+	MCS_PGSCAN_KSWAPD_HIERARCHY,
+	MCS_PGSTEAL_KSWAPD_HIERARCHY,
 	MCS_SWAP,
 	MCS_INACTIVE_ANON,
 	MCS_ACTIVE_ANON,
@@ -3839,6 +3878,14 @@ struct {
 	{"mapped_file", "total_mapped_file"},
 	{"pgpgin", "total_pgpgin"},
 	{"pgpgout", "total_pgpgout"},
+	{"pgscan_direct_limit", "total_pgscan_direct_limit"},
+	{"pgfree_direct_limit", "total_pgfree_direct_limit"},
+	{"pgscan_direct_hierarchy", "total_pgscan_direct_hierarchy"},
+	{"pgsteal_direct_hierarchy", "total_pgsteal_direct_hierarchy"},
+	{"pgscan_kswapd_limit", "total_pgscan_kswapd_limit"},
+	{"pgfree_kswapd_limit", "total_pgfree_kswapd_limit"},
+	{"pgscan_kswapd_hierarchy", "total_pgscan_kswapd_hierarchy"},
+	{"pgsteal_kswapd_hierarchy", "total_pgsteal_kswapd_hierarchy"},
 	{"swap", "total_swap"},
 	{"inactive_anon", "total_inactive_anon"},
 	{"active_anon", "total_active_anon"},
@@ -3864,6 +3911,22 @@ mem_cgroup_get_local_stat(struct mem_cgroup *mem, struct mcs_total_stat *s)
 	s->stat[MCS_PGPGIN] += val;
 	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_PGPGOUT);
 	s->stat[MCS_PGPGOUT] += val;
+	val = mem_cgroup_read_events(mem, PGSCAN_DIRECT_LIMIT);
+	s->stat[MCS_PGSCAN_DIRECT_LIMIT] += val;
+	val = mem_cgroup_read_events(mem, PGFREE_DIRECT_LIMIT);
+	s->stat[MCS_PGFREE_DIRECT_LIMIT] += val;
+	val = mem_cgroup_read_events(mem, PGSCAN_DIRECT_HIERARCHY);
+	s->stat[MCS_PGSCAN_DIRECT_HIERARCHY] += val;
+	val = mem_cgroup_read_events(mem, PGSTEAL_DIRECT_HIERARCHY);
+	s->stat[MCS_PGSTEAL_DIRECT_HIERARCHY] += val;
+	val = mem_cgroup_read_events(mem, PGSCAN_KSWAPD_LIMIT);
+	s->stat[MCS_PGSCAN_KSWAPD_LIMIT] += val;
+	val = mem_cgroup_read_events(mem, PGFREE_KSWAPD_LIMIT);
+	s->stat[MCS_PGFREE_KSWAPD_LIMIT] += val;
+	val = mem_cgroup_read_events(mem, PGSCAN_KSWAPD_HIERARCHY);
+	s->stat[MCS_PGSCAN_KSWAPD_HIERARCHY] += val;
+	val = mem_cgroup_read_events(mem, PGSTEAL_KSWAPD_HIERARCHY);
+	s->stat[MCS_PGSTEAL_KSWAPD_HIERARCHY] += val;
 	if (do_swap_account) {
 		val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_SWAPOUT);
 		s->stat[MCS_SWAP] += val * PAGE_SIZE;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e2a3647..0e45ceb 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1961,9 +1961,16 @@ static void shrink_zone(int priority, struct zone *zone,
 	struct mem_cgroup *mem = NULL;
 
 	do {
+		unsigned long reclaimed = sc->nr_reclaimed;
+		unsigned long scanned = sc->nr_scanned;
+
 		mem_cgroup_hierarchy_walk(root, &mem);
 		sc->current_memcg = mem;
 		do_shrink_zone(priority, zone, sc);
+		mem_cgroup_count_reclaim(mem, current_is_kswapd(),
+					 mem != root, /* limit or hierarchy? */
+					 sc->nr_scanned - scanned,
+					 sc->nr_reclaimed - reclaimed);
 	} while (mem != root);
 
 	/* For good measure, noone higher up the stack should look at it */
-- 
1.7.5.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [rfc patch 4/6] memcg: reclaim statistics
@ 2011-05-12 14:53   ` Johannes Weiner
  0 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2011-05-12 14:53 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Andrew Morton, Rik van Riel, Minchan Kim,
	KOSAKI Motohiro, Mel Gorman
  Cc: linux-mm, linux-kernel

TODO: write proper changelog.  Here is an excerpt from
http://lkml.kernel.org/r/20110428123652.GM12437@cmpxchg.org:

: 1. Limit-triggered direct reclaim
:
: The memory cgroup hits its limit and the task does direct reclaim from
: its own memcg.  We probably want statistics for this separately from
: background reclaim to see how successful background reclaim is, the
: same reason we have this separation in the global vmstat as well.
:
: 	pgscan_direct_limit
: 	pgfree_direct_limit
:
: 2. Limit-triggered background reclaim
:
: This is the watermark-based asynchroneous reclaim that is currently in
: discussion.  It's triggered by the memcg breaching its watermark,
: which is relative to its hard-limit.  I named it kswapd because I
: still think kswapd should do this job, but it is all open for
: discussion, obviously.  Treat it as meaning 'background' or
: 'asynchroneous'.
:
: 	pgscan_kswapd_limit
: 	pgfree_kswapd_limit
:
: 3. Hierarchy-triggered direct reclaim
:
: A condition outside the memcg leads to a task directly reclaiming from
: this memcg.  This could be global memory pressure for example, but
: also a parent cgroup hitting its limit.  It's probably helpful to
: assume global memory pressure meaning that the root cgroup hit its
: limit, conceptually.  We don't have that yet, but this could be the
: direct softlimit reclaim Ying mentioned above.
:
: 	pgscan_direct_hierarchy
: 	pgsteal_direct_hierarchy
:
: 4. Hierarchy-triggered background reclaim
:
: An outside condition leads to kswapd reclaiming from this memcg, like
: kswapd doing softlimit pushback due to global memory pressure.
:
: 	pgscan_kswapd_hierarchy
: 	pgsteal_kswapd_hierarchy
:
: ---
:
: With these stats in place, you can see how much pressure there is on
: your memcg hierarchy.  This includes machine utilization and if you
: overcommitted too much on a global level if there is a lot of reclaim
: activity indicated in the hierarchical stats.
:
: With the limit-based stats, you can see the amount of internal
: pressure of memcgs, which shows you if you overcommitted on a local
: level.
:
: And for both cases, you can also see the effectiveness of background
: reclaim by comparing the direct and the kswapd stats.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h |    9 ++++++
 mm/memcontrol.c            |   63 ++++++++++++++++++++++++++++++++++++++++++++
 mm/vmscan.c                |    7 +++++
 3 files changed, 79 insertions(+), 0 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 58728c7..a4c84db 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -105,6 +105,8 @@ extern void mem_cgroup_end_migration(struct mem_cgroup *mem,
  * For memory reclaim.
  */
 void mem_cgroup_hierarchy_walk(struct mem_cgroup *, struct mem_cgroup **);
+void mem_cgroup_count_reclaim(struct mem_cgroup *, bool, bool,
+			      unsigned long, unsigned long);
 int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
 int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
 unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
@@ -296,6 +298,13 @@ static inline void mem_cgroup_hierarchy_walk(struct mem_cgroup *start,
 	*iter = start;
 }
 
+static inline void mem_cgroup_count_reclaim(struct mem_cgroup *mem,
+					    bool kswapd, bool hierarchy,
+					    unsigned long scanned,
+					    unsigned long reclaimed)
+{
+}
+
 static inline int
 mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index edcd55a..d762706 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -90,10 +90,24 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_NSTATS,
 };
 
+#define RECLAIM_RECLAIMED 1
+#define RECLAIM_HIERARCHY 2
+#define RECLAIM_KSWAPD 4
+
 enum mem_cgroup_events_index {
 	MEM_CGROUP_EVENTS_PGPGIN,	/* # of pages paged in */
 	MEM_CGROUP_EVENTS_PGPGOUT,	/* # of pages paged out */
 	MEM_CGROUP_EVENTS_COUNT,	/* # of pages paged in/out */
+	RECLAIM_BASE,
+	PGSCAN_DIRECT_LIMIT = RECLAIM_BASE,
+	PGFREE_DIRECT_LIMIT = RECLAIM_BASE + RECLAIM_RECLAIMED,
+	PGSCAN_DIRECT_HIERARCHY = RECLAIM_BASE + RECLAIM_HIERARCHY,
+	PGSTEAL_DIRECT_HIERARCHY = RECLAIM_BASE + RECLAIM_HIERARCHY + RECLAIM_RECLAIMED,
+	/* you know the drill... */
+	PGSCAN_KSWAPD_LIMIT,
+	PGFREE_KSWAPD_LIMIT,
+	PGSCAN_KSWAPD_HIERARCHY,
+	PGSTEAL_KSWAPD_HIERARCHY,
 	MEM_CGROUP_EVENTS_NSTATS,
 };
 /*
@@ -575,6 +589,23 @@ static void mem_cgroup_swap_statistics(struct mem_cgroup *mem,
 	this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_SWAPOUT], val);
 }
 
+void mem_cgroup_count_reclaim(struct mem_cgroup *mem,
+			      bool kswapd, bool hierarchy,
+			      unsigned long scanned, unsigned long reclaimed)
+{
+	unsigned int base = RECLAIM_BASE;
+
+	if (!mem)
+		mem = root_mem_cgroup;
+	if (kswapd)
+		base += RECLAIM_KSWAPD;
+	if (hierarchy)
+		base += RECLAIM_HIERARCHY;
+
+	this_cpu_add(mem->stat->events[base], scanned);
+	this_cpu_add(mem->stat->events[base + RECLAIM_RECLAIMED], reclaimed);
+}
+
 static unsigned long mem_cgroup_read_events(struct mem_cgroup *mem,
 					    enum mem_cgroup_events_index idx)
 {
@@ -3817,6 +3848,14 @@ enum {
 	MCS_FILE_MAPPED,
 	MCS_PGPGIN,
 	MCS_PGPGOUT,
+	MCS_PGSCAN_DIRECT_LIMIT,
+	MCS_PGFREE_DIRECT_LIMIT,
+	MCS_PGSCAN_DIRECT_HIERARCHY,
+	MCS_PGSTEAL_DIRECT_HIERARCHY,
+	MCS_PGSCAN_KSWAPD_LIMIT,
+	MCS_PGFREE_KSWAPD_LIMIT,
+	MCS_PGSCAN_KSWAPD_HIERARCHY,
+	MCS_PGSTEAL_KSWAPD_HIERARCHY,
 	MCS_SWAP,
 	MCS_INACTIVE_ANON,
 	MCS_ACTIVE_ANON,
@@ -3839,6 +3878,14 @@ struct {
 	{"mapped_file", "total_mapped_file"},
 	{"pgpgin", "total_pgpgin"},
 	{"pgpgout", "total_pgpgout"},
+	{"pgscan_direct_limit", "total_pgscan_direct_limit"},
+	{"pgfree_direct_limit", "total_pgfree_direct_limit"},
+	{"pgscan_direct_hierarchy", "total_pgscan_direct_hierarchy"},
+	{"pgsteal_direct_hierarchy", "total_pgsteal_direct_hierarchy"},
+	{"pgscan_kswapd_limit", "total_pgscan_kswapd_limit"},
+	{"pgfree_kswapd_limit", "total_pgfree_kswapd_limit"},
+	{"pgscan_kswapd_hierarchy", "total_pgscan_kswapd_hierarchy"},
+	{"pgsteal_kswapd_hierarchy", "total_pgsteal_kswapd_hierarchy"},
 	{"swap", "total_swap"},
 	{"inactive_anon", "total_inactive_anon"},
 	{"active_anon", "total_active_anon"},
@@ -3864,6 +3911,22 @@ mem_cgroup_get_local_stat(struct mem_cgroup *mem, struct mcs_total_stat *s)
 	s->stat[MCS_PGPGIN] += val;
 	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_PGPGOUT);
 	s->stat[MCS_PGPGOUT] += val;
+	val = mem_cgroup_read_events(mem, PGSCAN_DIRECT_LIMIT);
+	s->stat[MCS_PGSCAN_DIRECT_LIMIT] += val;
+	val = mem_cgroup_read_events(mem, PGFREE_DIRECT_LIMIT);
+	s->stat[MCS_PGFREE_DIRECT_LIMIT] += val;
+	val = mem_cgroup_read_events(mem, PGSCAN_DIRECT_HIERARCHY);
+	s->stat[MCS_PGSCAN_DIRECT_HIERARCHY] += val;
+	val = mem_cgroup_read_events(mem, PGSTEAL_DIRECT_HIERARCHY);
+	s->stat[MCS_PGSTEAL_DIRECT_HIERARCHY] += val;
+	val = mem_cgroup_read_events(mem, PGSCAN_KSWAPD_LIMIT);
+	s->stat[MCS_PGSCAN_KSWAPD_LIMIT] += val;
+	val = mem_cgroup_read_events(mem, PGFREE_KSWAPD_LIMIT);
+	s->stat[MCS_PGFREE_KSWAPD_LIMIT] += val;
+	val = mem_cgroup_read_events(mem, PGSCAN_KSWAPD_HIERARCHY);
+	s->stat[MCS_PGSCAN_KSWAPD_HIERARCHY] += val;
+	val = mem_cgroup_read_events(mem, PGSTEAL_KSWAPD_HIERARCHY);
+	s->stat[MCS_PGSTEAL_KSWAPD_HIERARCHY] += val;
 	if (do_swap_account) {
 		val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_SWAPOUT);
 		s->stat[MCS_SWAP] += val * PAGE_SIZE;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e2a3647..0e45ceb 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1961,9 +1961,16 @@ static void shrink_zone(int priority, struct zone *zone,
 	struct mem_cgroup *mem = NULL;
 
 	do {
+		unsigned long reclaimed = sc->nr_reclaimed;
+		unsigned long scanned = sc->nr_scanned;
+
 		mem_cgroup_hierarchy_walk(root, &mem);
 		sc->current_memcg = mem;
 		do_shrink_zone(priority, zone, sc);
+		mem_cgroup_count_reclaim(mem, current_is_kswapd(),
+					 mem != root, /* limit or hierarchy? */
+					 sc->nr_scanned - scanned,
+					 sc->nr_reclaimed - reclaimed);
 	} while (mem != root);
 
 	/* For good measure, noone higher up the stack should look at it */
-- 
1.7.5.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [rfc patch 5/6] memcg: remove global LRU list
  2011-05-12 14:53 ` Johannes Weiner
@ 2011-05-12 14:53   ` Johannes Weiner
  -1 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2011-05-12 14:53 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Andrew Morton, Rik van Riel, Minchan Kim,
	KOSAKI Motohiro, Mel Gorman
  Cc: linux-mm, linux-kernel

Since the VM now has means to do global reclaim from the per-memcg lru
lists, the global LRU list is no longer required.

It saves two linked list pointers per page, since all pages are now
only on one list.  Also, the memcg lru lists now directly link pages
instead of page_cgroup descriptors, which gets rid of finding the way
back from the page_cgroup to the page.

A big change in behaviour is that pages are no longer aged on a global
level.  Instead, they are aged with respect to the other pages in the
same memcg, where the aging speed is determined by global memory
pressure and size of the memcg itself.

[ TO EVALUATE: this should bring more fairness to reclaim in setups
with differently sized memcgs, and distribute pressure proportionally
among memcgs instead of reclaiming only from the one that has the
oldest pages on a global level.  There is potential unfairness if
unused pages are hiding in small memcgs that are never scanned and
reclaim going only for a single, much bigger memcg.  The severeness of
this also scales with the number of memcgs wrt amount of physical
memory, so it again boils down to the question of what the sane
maximum number of memcgs on the system is ].

The patch introduces an lruvec structure that exists for both global
zones and for each zone per memcg.  All lru operations are now done in
generic code, with the memcg lru primitives only doing accounting and
returning the proper lruvec for the currently scanned memcg on
isolation, or for the respective page on putback.

The code that scans and rescues unevictable pages in a specific zone
had to be converted to iterate over all memcgs as well.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h  |   52 ++++-----
 include/linux/mm_inline.h   |   15 ++-
 include/linux/mmzone.h      |   10 +-
 include/linux/page_cgroup.h |   35 ------
 mm/memcontrol.c             |  251 +++++++++++++++---------------------------
 mm/page_alloc.c             |    2 +-
 mm/page_cgroup.c            |   39 +------
 mm/swap.c                   |   20 ++--
 mm/vmscan.c                 |  149 ++++++++++++--------------
 9 files changed, 213 insertions(+), 360 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a4c84db..65163c2 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -20,6 +20,7 @@
 #ifndef _LINUX_MEMCONTROL_H
 #define _LINUX_MEMCONTROL_H
 #include <linux/cgroup.h>
+#include <linux/mmzone.h>
 struct mem_cgroup;
 struct page_cgroup;
 struct page;
@@ -30,13 +31,6 @@ enum mem_cgroup_page_stat_item {
 	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
 };
 
-extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
-					struct list_head *dst,
-					unsigned long *scanned, int order,
-					int mode, struct zone *z,
-					struct mem_cgroup *mem_cont,
-					int active, int file);
-
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 /*
  * All "charge" functions with gfp_mask should use GFP_KERNEL or
@@ -60,13 +54,13 @@ extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *ptr);
 
 extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 					gfp_t gfp_mask);
-extern void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru);
-extern void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru);
-extern void mem_cgroup_rotate_reclaimable_page(struct page *page);
-extern void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru);
-extern void mem_cgroup_del_lru(struct page *page);
-extern void mem_cgroup_move_lists(struct page *page,
-				  enum lru_list from, enum lru_list to);
+struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
+struct lruvec *mem_cgroup_lru_add_list(struct zone *, struct page *,
+				       enum lru_list);
+void mem_cgroup_lru_del_list(struct zone *, struct page *, enum lru_list);
+void mem_cgroup_lru_del(struct zone *, struct page *);
+struct lruvec *mem_cgroup_lru_move_lists(struct zone *, struct page *,
+					 enum lru_list, enum lru_list);
 
 /* For coalescing uncharge for reducing memcg' overhead*/
 extern void mem_cgroup_uncharge_start(void);
@@ -210,33 +204,35 @@ static inline int mem_cgroup_shmem_charge_fallback(struct page *page,
 	return 0;
 }
 
-static inline void mem_cgroup_add_lru_list(struct page *page, int lru)
+static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
+						    struct mem_cgroup *mem)
 {
+	return &zone->lruvec;
 }
 
-static inline void mem_cgroup_del_lru_list(struct page *page, int lru)
+static inline struct lruvec *mem_cgroup_lru_add_list(struct zone *zone,
+						     struct page *page,
+						     enum lru_list lru)
 {
-	return ;
+	return &zone->lruvec;
 }
 
-static inline void mem_cgroup_rotate_reclaimable_page(struct page *page)
+static inline void mem_cgroup_lru_del_list(struct zone *zone,
+					   struct page *page,
+					   enum lru_list lru)
 {
-	return ;
 }
 
-static inline void mem_cgroup_rotate_lru_list(struct page *page, int lru)
+static inline void mem_cgroup_lru_del(struct zone *zone, struct page *page)
 {
-	return ;
 }
 
-static inline void mem_cgroup_del_lru(struct page *page)
-{
-	return ;
-}
-
-static inline void
-mem_cgroup_move_lists(struct page *page, enum lru_list from, enum lru_list to)
+static inline struct lruvec *mem_cgroup_lru_move_lists(struct zone *zone,
+						       struct page *page,
+						       enum lru_list from,
+						       enum lru_list to)
 {
+	return &zone->lruvec;
 }
 
 static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 8f7d247..ca794f3 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -25,23 +25,28 @@ static inline void
 __add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l,
 		       struct list_head *head)
 {
+	/* NOTE! Caller must ensure @head is on the right lruvec! */
+	mem_cgroup_lru_add_list(zone, page, l);
 	list_add(&page->lru, head);
 	__mod_zone_page_state(zone, NR_LRU_BASE + l, hpage_nr_pages(page));
-	mem_cgroup_add_lru_list(page, l);
 }
 
 static inline void
 add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l)
 {
-	__add_page_to_lru_list(zone, page, l, &zone->lru[l].list);
+	struct lruvec *lruvec;
+
+	lruvec = mem_cgroup_lru_add_list(zone, page, l);
+	list_add(&page->lru, &lruvec->lists[l]);
+	__mod_zone_page_state(zone, NR_LRU_BASE + l, hpage_nr_pages(page));
 }
 
 static inline void
 del_page_from_lru_list(struct zone *zone, struct page *page, enum lru_list l)
 {
+	mem_cgroup_lru_del_list(zone, page, l);
 	list_del(&page->lru);
 	__mod_zone_page_state(zone, NR_LRU_BASE + l, -hpage_nr_pages(page));
-	mem_cgroup_del_lru_list(page, l);
 }
 
 /**
@@ -64,7 +69,6 @@ del_page_from_lru(struct zone *zone, struct page *page)
 {
 	enum lru_list l;
 
-	list_del(&page->lru);
 	if (PageUnevictable(page)) {
 		__ClearPageUnevictable(page);
 		l = LRU_UNEVICTABLE;
@@ -75,8 +79,9 @@ del_page_from_lru(struct zone *zone, struct page *page)
 			l += LRU_ACTIVE;
 		}
 	}
+	mem_cgroup_lru_del_list(zone, page, l);
+	list_del(&page->lru);
 	__mod_zone_page_state(zone, NR_LRU_BASE + l, -hpage_nr_pages(page));
-	mem_cgroup_del_lru_list(page, l);
 }
 
 /**
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e56f835..c2ddce5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -158,6 +158,10 @@ static inline int is_unevictable_lru(enum lru_list l)
 	return (l == LRU_UNEVICTABLE);
 }
 
+struct lruvec {
+	struct list_head lists[NR_LRU_LISTS];
+};
+
 enum zone_watermarks {
 	WMARK_MIN,
 	WMARK_LOW,
@@ -344,10 +348,8 @@ struct zone {
 	ZONE_PADDING(_pad1_)
 
 	/* Fields commonly accessed by the page reclaim scanner */
-	spinlock_t		lru_lock;	
-	struct zone_lru {
-		struct list_head list;
-	} lru[NR_LRU_LISTS];
+	spinlock_t		lru_lock;
+	struct lruvec		lruvec;
 
 	struct zone_reclaim_stat reclaim_stat;
 
diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 961ecc7..2e7cbc5 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -31,7 +31,6 @@ enum {
 struct page_cgroup {
 	unsigned long flags;
 	struct mem_cgroup *mem_cgroup;
-	struct list_head lru;		/* per cgroup LRU list */
 };
 
 void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat);
@@ -49,7 +48,6 @@ static inline void __init page_cgroup_init(void)
 #endif
 
 struct page_cgroup *lookup_page_cgroup(struct page *page);
-struct page *lookup_cgroup_page(struct page_cgroup *pc);
 
 #define TESTPCGFLAG(uname, lname)			\
 static inline int PageCgroup##uname(struct page_cgroup *pc)	\
@@ -122,39 +120,6 @@ static inline void move_unlock_page_cgroup(struct page_cgroup *pc,
 	local_irq_restore(*flags);
 }
 
-#ifdef CONFIG_SPARSEMEM
-#define PCG_ARRAYID_WIDTH	SECTIONS_SHIFT
-#else
-#define PCG_ARRAYID_WIDTH	NODES_SHIFT
-#endif
-
-#if (PCG_ARRAYID_WIDTH > BITS_PER_LONG - NR_PCG_FLAGS)
-#error Not enough space left in pc->flags to store page_cgroup array IDs
-#endif
-
-/* pc->flags: ARRAY-ID | FLAGS */
-
-#define PCG_ARRAYID_MASK	((1UL << PCG_ARRAYID_WIDTH) - 1)
-
-#define PCG_ARRAYID_OFFSET	(BITS_PER_LONG - PCG_ARRAYID_WIDTH)
-/*
- * Zero the shift count for non-existent fields, to prevent compiler
- * warnings and ensure references are optimized away.
- */
-#define PCG_ARRAYID_SHIFT	(PCG_ARRAYID_OFFSET * (PCG_ARRAYID_WIDTH != 0))
-
-static inline void set_page_cgroup_array_id(struct page_cgroup *pc,
-					    unsigned long id)
-{
-	pc->flags &= ~(PCG_ARRAYID_MASK << PCG_ARRAYID_SHIFT);
-	pc->flags |= (id & PCG_ARRAYID_MASK) << PCG_ARRAYID_SHIFT;
-}
-
-static inline unsigned long page_cgroup_array_id(struct page_cgroup *pc)
-{
-	return (pc->flags >> PCG_ARRAYID_SHIFT) & PCG_ARRAYID_MASK;
-}
-
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct page_cgroup;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d762706..f5d90ba 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -134,10 +134,7 @@ struct mem_cgroup_stat_cpu {
  * per-zone information in memory controller.
  */
 struct mem_cgroup_per_zone {
-	/*
-	 * spin_lock to protect the per cgroup LRU
-	 */
-	struct list_head	lists[NR_LRU_LISTS];
+	struct lruvec		lruvec;
 	unsigned long		count[NR_LRU_LISTS];
 
 	struct zone_reclaim_stat reclaim_stat;
@@ -834,6 +831,24 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
 	return (mem == root_mem_cgroup);
 }
 
+struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone, struct mem_cgroup *mem)
+{
+	struct mem_cgroup_per_zone *mz;
+	int nid, zid;
+
+	/* Pages are on the zone's own lru lists */
+	if (mem_cgroup_disabled())
+		return &zone->lruvec;
+
+	if (!mem)
+		mem = root_mem_cgroup;
+
+	nid = zone_to_nid(zone);
+	zid = zone_idx(zone);
+	mz = mem_cgroup_zoneinfo(mem, nid, zid);
+	return &mz->lruvec;
+}
+
 /*
  * Following LRU functions are allowed to be used without PCG_LOCK.
  * Operations are called by routine of global LRU independently from memcg.
@@ -848,10 +863,43 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
  * When moving account, the page is not on LRU. It's isolated.
  */
 
-void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru)
+struct lruvec *mem_cgroup_lru_add_list(struct zone *zone, struct page *page,
+				       enum lru_list lru)
 {
+	struct mem_cgroup_per_zone *mz;
 	struct page_cgroup *pc;
+	struct mem_cgroup *mem;
+
+	if (mem_cgroup_disabled())
+		return &zone->lruvec;
+
+	pc = lookup_page_cgroup(page);
+	VM_BUG_ON(PageCgroupAcctLRU(pc));
+	if (PageCgroupUsed(pc)) {
+		/* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
+		smp_rmb();
+		mem = pc->mem_cgroup;
+	} else {
+		/*
+		 * If the page is uncharged, add it to the root's lru.
+		 * Either it will be freed soon, or it will get
+		 * charged again and the charger will relink it to the
+		 * right list.
+		 */
+		mem = root_mem_cgroup;
+	}
+	mz = page_cgroup_zoneinfo(mem, page);
+	/* huge page split is done under lru_lock. so, we have no races. */
+	MEM_CGROUP_ZSTAT(mz, lru) += 1 << compound_order(page);
+	SetPageCgroupAcctLRU(pc);
+	return &mz->lruvec;
+}
+
+void mem_cgroup_lru_del_list(struct zone *zone, struct page *page,
+			     enum lru_list lru)
+{
 	struct mem_cgroup_per_zone *mz;
+	struct page_cgroup *pc;
 
 	if (mem_cgroup_disabled())
 		return;
@@ -867,83 +915,21 @@ void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru)
 	mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
 	/* huge page split is done under lru_lock. so, we have no races. */
 	MEM_CGROUP_ZSTAT(mz, lru) -= 1 << compound_order(page);
-	if (mem_cgroup_is_root(pc->mem_cgroup))
-		return;
-	VM_BUG_ON(list_empty(&pc->lru));
-	list_del_init(&pc->lru);
 }
 
-void mem_cgroup_del_lru(struct page *page)
+void mem_cgroup_lru_del(struct zone *zone, struct page *page)
 {
-	mem_cgroup_del_lru_list(page, page_lru(page));
+	mem_cgroup_lru_del_list(zone, page, page_lru(page));
 }
 
-/*
- * Writeback is about to end against a page which has been marked for immediate
- * reclaim.  If it still appears to be reclaimable, move it to the tail of the
- * inactive list.
- */
-void mem_cgroup_rotate_reclaimable_page(struct page *page)
+struct lruvec *mem_cgroup_lru_move_lists(struct zone *zone,
+					 struct page *page,
+					 enum lru_list from,
+					 enum lru_list to)
 {
-	struct mem_cgroup_per_zone *mz;
-	struct page_cgroup *pc;
-	enum lru_list lru = page_lru(page);
-
-	if (mem_cgroup_disabled())
-		return;
-
-	pc = lookup_page_cgroup(page);
-	/* unused or root page is not rotated. */
-	if (!PageCgroupUsed(pc))
-		return;
-	/* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
-	smp_rmb();
-	if (mem_cgroup_is_root(pc->mem_cgroup))
-		return;
-	mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
-	list_move_tail(&pc->lru, &mz->lists[lru]);
-}
-
-void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru)
-{
-	struct mem_cgroup_per_zone *mz;
-	struct page_cgroup *pc;
-
-	if (mem_cgroup_disabled())
-		return;
-
-	pc = lookup_page_cgroup(page);
-	/* unused or root page is not rotated. */
-	if (!PageCgroupUsed(pc))
-		return;
-	/* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
-	smp_rmb();
-	if (mem_cgroup_is_root(pc->mem_cgroup))
-		return;
-	mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
-	list_move(&pc->lru, &mz->lists[lru]);
-}
-
-void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru)
-{
-	struct page_cgroup *pc;
-	struct mem_cgroup_per_zone *mz;
-
-	if (mem_cgroup_disabled())
-		return;
-	pc = lookup_page_cgroup(page);
-	VM_BUG_ON(PageCgroupAcctLRU(pc));
-	if (!PageCgroupUsed(pc))
-		return;
-	/* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
-	smp_rmb();
-	mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
-	/* huge page split is done under lru_lock. so, we have no races. */
-	MEM_CGROUP_ZSTAT(mz, lru) += 1 << compound_order(page);
-	SetPageCgroupAcctLRU(pc);
-	if (mem_cgroup_is_root(pc->mem_cgroup))
-		return;
-	list_add(&pc->lru, &mz->lists[lru]);
+	/* TODO: could be optimized, especially if from == to */
+	mem_cgroup_lru_del_list(zone, page, from);
+	return mem_cgroup_lru_add_list(zone, page, to);
 }
 
 /*
@@ -975,7 +961,7 @@ static void mem_cgroup_lru_del_before_commit(struct page *page)
 	 * is guarded by lock_page() because the page is SwapCache.
 	 */
 	if (!PageCgroupUsed(pc))
-		mem_cgroup_del_lru_list(page, page_lru(page));
+		del_page_from_lru(zone, page);
 	spin_unlock_irqrestore(&zone->lru_lock, flags);
 }
 
@@ -989,22 +975,11 @@ static void mem_cgroup_lru_add_after_commit(struct page *page)
 	if (likely(!PageLRU(page)))
 		return;
 	spin_lock_irqsave(&zone->lru_lock, flags);
-	/* link when the page is linked to LRU but page_cgroup isn't */
 	if (PageLRU(page) && !PageCgroupAcctLRU(pc))
-		mem_cgroup_add_lru_list(page, page_lru(page));
+		add_page_to_lru_list(zone, page, page_lru(page));
 	spin_unlock_irqrestore(&zone->lru_lock, flags);
 }
 
-
-void mem_cgroup_move_lists(struct page *page,
-			   enum lru_list from, enum lru_list to)
-{
-	if (mem_cgroup_disabled())
-		return;
-	mem_cgroup_del_lru_list(page, from);
-	mem_cgroup_add_lru_list(page, to);
-}
-
 int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem)
 {
 	int ret;
@@ -1063,6 +1038,9 @@ int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
 	unsigned long present_pages[2];
 	unsigned long inactive_ratio;
 
+	if (!memcg)
+		memcg = root_mem_cgroup;
+
 	inactive_ratio = calc_inactive_ratio(memcg, present_pages);
 
 	inactive = present_pages[0];
@@ -1079,6 +1057,9 @@ int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg)
 	unsigned long active;
 	unsigned long inactive;
 
+	if (!memcg)
+		memcg = root_mem_cgroup;
+
 	inactive = mem_cgroup_get_local_zonestat(memcg, LRU_INACTIVE_FILE);
 	active = mem_cgroup_get_local_zonestat(memcg, LRU_ACTIVE_FILE);
 
@@ -1091,8 +1072,12 @@ unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
 {
 	int nid = zone_to_nid(zone);
 	int zid = zone_idx(zone);
-	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
+	struct mem_cgroup_per_zone *mz;
+
+	if (!memcg)
+		memcg = root_mem_cgroup;
 
+	mz = mem_cgroup_zoneinfo(memcg, nid, zid);
 	return MEM_CGROUP_ZSTAT(mz, lru);
 }
 
@@ -1101,8 +1086,12 @@ struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
 {
 	int nid = zone_to_nid(zone);
 	int zid = zone_idx(zone);
-	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
+	struct mem_cgroup_per_zone *mz;
+
+	if (!memcg)
+		memcg = root_mem_cgroup;
 
+	mz = mem_cgroup_zoneinfo(memcg, nid, zid);
 	return &mz->reclaim_stat;
 }
 
@@ -1124,67 +1113,6 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page)
 	return &mz->reclaim_stat;
 }
 
-unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
-					struct list_head *dst,
-					unsigned long *scanned, int order,
-					int mode, struct zone *z,
-					struct mem_cgroup *mem_cont,
-					int active, int file)
-{
-	unsigned long nr_taken = 0;
-	struct page *page;
-	unsigned long scan;
-	LIST_HEAD(pc_list);
-	struct list_head *src;
-	struct page_cgroup *pc, *tmp;
-	int nid = zone_to_nid(z);
-	int zid = zone_idx(z);
-	struct mem_cgroup_per_zone *mz;
-	int lru = LRU_FILE * file + active;
-	int ret;
-
-	BUG_ON(!mem_cont);
-	mz = mem_cgroup_zoneinfo(mem_cont, nid, zid);
-	src = &mz->lists[lru];
-
-	scan = 0;
-	list_for_each_entry_safe_reverse(pc, tmp, src, lru) {
-		if (scan >= nr_to_scan)
-			break;
-
-		if (unlikely(!PageCgroupUsed(pc)))
-			continue;
-
-		page = lookup_cgroup_page(pc);
-
-		if (unlikely(!PageLRU(page)))
-			continue;
-
-		scan++;
-		ret = __isolate_lru_page(page, mode, file);
-		switch (ret) {
-		case 0:
-			list_move(&page->lru, dst);
-			mem_cgroup_del_lru(page);
-			nr_taken += hpage_nr_pages(page);
-			break;
-		case -EBUSY:
-			/* we don't affect global LRU but rotate in our LRU */
-			mem_cgroup_rotate_lru_list(page, page_lru(page));
-			break;
-		default:
-			break;
-		}
-	}
-
-	*scanned = scan;
-
-	trace_mm_vmscan_memcg_isolate(0, nr_to_scan, scan, nr_taken,
-				      0, 0, 0, mode);
-
-	return nr_taken;
-}
-
 #define mem_cgroup_from_res_counter(counter, member)	\
 	container_of(counter, struct mem_cgroup, member)
 
@@ -3458,22 +3386,23 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 static int mem_cgroup_force_empty_list(struct mem_cgroup *mem,
 				int node, int zid, enum lru_list lru)
 {
-	struct zone *zone;
 	struct mem_cgroup_per_zone *mz;
-	struct page_cgroup *pc, *busy;
 	unsigned long flags, loop;
 	struct list_head *list;
+	struct page *busy;
+	struct zone *zone;
 	int ret = 0;
 
 	zone = &NODE_DATA(node)->node_zones[zid];
 	mz = mem_cgroup_zoneinfo(mem, node, zid);
-	list = &mz->lists[lru];
+	list = &mz->lruvec.lists[lru];
 
 	loop = MEM_CGROUP_ZSTAT(mz, lru);
 	/* give some margin against EBUSY etc...*/
 	loop += 256;
 	busy = NULL;
 	while (loop--) {
+		struct page_cgroup *pc;
 		struct page *page;
 
 		ret = 0;
@@ -3482,16 +3411,16 @@ static int mem_cgroup_force_empty_list(struct mem_cgroup *mem,
 			spin_unlock_irqrestore(&zone->lru_lock, flags);
 			break;
 		}
-		pc = list_entry(list->prev, struct page_cgroup, lru);
-		if (busy == pc) {
-			list_move(&pc->lru, list);
+		page = list_entry(list->prev, struct page, lru);
+		if (busy == page) {
+			list_move(&page->lru, list);
 			busy = NULL;
 			spin_unlock_irqrestore(&zone->lru_lock, flags);
 			continue;
 		}
 		spin_unlock_irqrestore(&zone->lru_lock, flags);
 
-		page = lookup_cgroup_page(pc);
+		pc = lookup_page_cgroup(page);
 
 		ret = mem_cgroup_move_parent(page, pc, mem, GFP_KERNEL);
 		if (ret == -ENOMEM)
@@ -3499,7 +3428,7 @@ static int mem_cgroup_force_empty_list(struct mem_cgroup *mem,
 
 		if (ret == -EBUSY || ret == -EINVAL) {
 			/* found lock contention or "pc" is obsolete. */
-			busy = pc;
+			busy = page;
 			cond_resched();
 		} else
 			busy = NULL;
@@ -4519,7 +4448,7 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
 	for (zone = 0; zone < MAX_NR_ZONES; zone++) {
 		mz = &pn->zoneinfo[zone];
 		for_each_lru(l)
-			INIT_LIST_HEAD(&mz->lists[l]);
+			INIT_LIST_HEAD(&mz->lruvec.lists[l]);
 		mz->usage_in_excess = 0;
 		mz->on_tree = false;
 		mz->mem = mem;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9f8a97b..4099e8c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4262,7 +4262,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 
 		zone_pcp_init(zone);
 		for_each_lru(l) {
-			INIT_LIST_HEAD(&zone->lru[l].list);
+			INIT_LIST_HEAD(&zone->lruvec.lists[l]);
 			zone->reclaim_stat.nr_saved_scan[l] = 0;
 		}
 		zone->reclaim_stat.recent_rotated[0] = 0;
diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index 9905501..313e1d7 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -11,12 +11,10 @@
 #include <linux/swapops.h>
 #include <linux/kmemleak.h>
 
-static void __meminit init_page_cgroup(struct page_cgroup *pc, unsigned long id)
+static void __meminit init_page_cgroup(struct page_cgroup *pc)
 {
 	pc->flags = 0;
-	set_page_cgroup_array_id(pc, id);
 	pc->mem_cgroup = NULL;
-	INIT_LIST_HEAD(&pc->lru);
 }
 static unsigned long total_usage;
 
@@ -42,19 +40,6 @@ struct page_cgroup *lookup_page_cgroup(struct page *page)
 	return base + offset;
 }
 
-struct page *lookup_cgroup_page(struct page_cgroup *pc)
-{
-	unsigned long pfn;
-	struct page *page;
-	pg_data_t *pgdat;
-
-	pgdat = NODE_DATA(page_cgroup_array_id(pc));
-	pfn = pc - pgdat->node_page_cgroup + pgdat->node_start_pfn;
-	page = pfn_to_page(pfn);
-	VM_BUG_ON(pc != lookup_page_cgroup(page));
-	return page;
-}
-
 static int __init alloc_node_page_cgroup(int nid)
 {
 	struct page_cgroup *base, *pc;
@@ -75,7 +60,7 @@ static int __init alloc_node_page_cgroup(int nid)
 		return -ENOMEM;
 	for (index = 0; index < nr_pages; index++) {
 		pc = base + index;
-		init_page_cgroup(pc, nid);
+		init_page_cgroup(pc);
 	}
 	NODE_DATA(nid)->node_page_cgroup = base;
 	total_usage += table_size;
@@ -117,19 +102,6 @@ struct page_cgroup *lookup_page_cgroup(struct page *page)
 	return section->page_cgroup + pfn;
 }
 
-struct page *lookup_cgroup_page(struct page_cgroup *pc)
-{
-	struct mem_section *section;
-	struct page *page;
-	unsigned long nr;
-
-	nr = page_cgroup_array_id(pc);
-	section = __nr_to_section(nr);
-	page = pfn_to_page(pc - section->page_cgroup);
-	VM_BUG_ON(pc != lookup_page_cgroup(page));
-	return page;
-}
-
 static void *__init_refok alloc_page_cgroup(size_t size, int nid)
 {
 	void *addr = NULL;
@@ -167,12 +139,9 @@ static int __init_refok init_section_page_cgroup(unsigned long pfn)
 	struct page_cgroup *base, *pc;
 	struct mem_section *section;
 	unsigned long table_size;
-	unsigned long nr;
 	int nid, index;
 
-	nr = pfn_to_section_nr(pfn);
-	section = __nr_to_section(nr);
-
+	section = __pfn_to_section(pfn);
 	if (section->page_cgroup)
 		return 0;
 
@@ -194,7 +163,7 @@ static int __init_refok init_section_page_cgroup(unsigned long pfn)
 
 	for (index = 0; index < PAGES_PER_SECTION; index++) {
 		pc = base + index;
-		init_page_cgroup(pc, nr);
+		init_page_cgroup(pc);
 	}
 
 	section->page_cgroup = base - pfn;
diff --git a/mm/swap.c b/mm/swap.c
index a448db3..12095a0 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -209,12 +209,14 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 static void pagevec_move_tail_fn(struct page *page, void *arg)
 {
 	int *pgmoved = arg;
-	struct zone *zone = page_zone(page);
 
 	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
 		enum lru_list lru = page_lru_base_type(page);
-		list_move_tail(&page->lru, &zone->lru[lru].list);
-		mem_cgroup_rotate_reclaimable_page(page);
+		struct lruvec *lruvec;
+
+		lruvec = mem_cgroup_lru_move_lists(page_zone(page),
+						   page, lru, lru);
+		list_move_tail(&page->lru, &lruvec->lists[lru]);
 		(*pgmoved)++;
 	}
 }
@@ -417,12 +419,13 @@ static void lru_deactivate_fn(struct page *page, void *arg)
 		 */
 		SetPageReclaim(page);
 	} else {
+		struct lruvec *lruvec;
 		/*
 		 * The page's writeback ends up during pagevec
 		 * We moves tha page into tail of inactive.
 		 */
-		list_move_tail(&page->lru, &zone->lru[lru].list);
-		mem_cgroup_rotate_reclaimable_page(page);
+		lruvec = mem_cgroup_lru_move_lists(zone, page, lru, lru);
+		list_move_tail(&page->lru, &lruvec->lists[lru]);
 		__count_vm_event(PGROTATED);
 	}
 
@@ -594,7 +597,6 @@ void lru_add_page_tail(struct zone* zone,
 	int active;
 	enum lru_list lru;
 	const int file = 0;
-	struct list_head *head;
 
 	VM_BUG_ON(!PageHead(page));
 	VM_BUG_ON(PageCompound(page_tail));
@@ -614,10 +616,10 @@ void lru_add_page_tail(struct zone* zone,
 		}
 		update_page_reclaim_stat(zone, page_tail, file, active);
 		if (likely(PageLRU(page)))
-			head = page->lru.prev;
+			__add_page_to_lru_list(zone, page_tail, lru,
+					       page->lru.prev);
 		else
-			head = &zone->lru[lru].list;
-		__add_page_to_lru_list(zone, page_tail, lru, head);
+			add_page_to_lru_list(zone, page_tail, lru);
 	} else {
 		SetPageUnevictable(page_tail);
 		add_page_to_lru_list(zone, page_tail, LRU_UNEVICTABLE);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0e45ceb..0381a5d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -162,34 +162,27 @@ static bool global_reclaim(struct scan_control *sc)
 {
 	return !sc->memcg;
 }
-static bool scanning_global_lru(struct scan_control *sc)
-{
-	return !sc->current_memcg;
-}
 #else
 static bool global_reclaim(struct scan_control *sc) { return 1; }
-static bool scanning_global_lru(struct scan_control *sc) { return 1; }
 #endif
 
 static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
 						  struct scan_control *sc)
 {
-	if (!scanning_global_lru(sc))
-		return mem_cgroup_get_reclaim_stat(sc->current_memcg, zone);
-
-	return &zone->reclaim_stat;
+	if (mem_cgroup_disabled())
+		return &zone->reclaim_stat;
+	return mem_cgroup_get_reclaim_stat(sc->current_memcg, zone);
 }
 
 static unsigned long zone_nr_lru_pages(struct zone *zone,
-				struct scan_control *sc, enum lru_list lru)
+				       struct scan_control *sc,
+				       enum lru_list lru)
 {
-	if (!scanning_global_lru(sc))
-		return mem_cgroup_zone_nr_pages(sc->current_memcg, zone, lru);
-
-	return zone_page_state(zone, NR_LRU_BASE + lru);
+	if (mem_cgroup_disabled())
+		return zone_page_state(zone, NR_LRU_BASE + lru);
+	return mem_cgroup_zone_nr_pages(sc->current_memcg, zone, lru);
 }
 
-
 /*
  * Add a shrinker callback to be called from the vm
  */
@@ -1055,15 +1048,14 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 
 		switch (__isolate_lru_page(page, mode, file)) {
 		case 0:
+			mem_cgroup_lru_del(page_zone(page), page);
 			list_move(&page->lru, dst);
-			mem_cgroup_del_lru(page);
 			nr_taken += hpage_nr_pages(page);
 			break;
 
 		case -EBUSY:
 			/* else it is being freed elsewhere */
 			list_move(&page->lru, src);
-			mem_cgroup_rotate_lru_list(page, page_lru(page));
 			continue;
 
 		default:
@@ -1113,8 +1105,9 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 				break;
 
 			if (__isolate_lru_page(cursor_page, mode, file) == 0) {
+				mem_cgroup_lru_del(page_zone(cursor_page),
+						   cursor_page);
 				list_move(&cursor_page->lru, dst);
-				mem_cgroup_del_lru(cursor_page);
 				nr_taken += hpage_nr_pages(page);
 				nr_lumpy_taken++;
 				if (PageDirty(cursor_page))
@@ -1143,19 +1136,22 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 	return nr_taken;
 }
 
-static unsigned long isolate_pages_global(unsigned long nr,
-					struct list_head *dst,
-					unsigned long *scanned, int order,
-					int mode, struct zone *z,
-					int active, int file)
+static unsigned long isolate_pages(unsigned long nr,
+				   struct list_head *dst,
+				   unsigned long *scanned, int order,
+				   int mode, struct zone *z,
+				   int active, int file,
+				   struct mem_cgroup *mem)
 {
+	struct lruvec *lruvec = mem_cgroup_zone_lruvec(z, mem);
 	int lru = LRU_BASE;
+
 	if (active)
 		lru += LRU_ACTIVE;
 	if (file)
 		lru += LRU_FILE;
-	return isolate_lru_pages(nr, &z->lru[lru].list, dst, scanned, order,
-								mode, file);
+	return isolate_lru_pages(nr, &lruvec->lists[lru], dst,
+				 scanned, order, mode, file);
 }
 
 /*
@@ -1403,20 +1399,11 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 	lru_add_drain();
 	spin_lock_irq(&zone->lru_lock);
 
-	if (scanning_global_lru(sc)) {
-		nr_taken = isolate_pages_global(nr_to_scan,
-			&page_list, &nr_scanned, sc->order,
-			sc->reclaim_mode & RECLAIM_MODE_LUMPYRECLAIM ?
-					ISOLATE_BOTH : ISOLATE_INACTIVE,
-			zone, 0, file);
-	} else {
-		nr_taken = mem_cgroup_isolate_pages(nr_to_scan,
+	nr_taken = isolate_pages(nr_to_scan,
 			&page_list, &nr_scanned, sc->order,
 			sc->reclaim_mode & RECLAIM_MODE_LUMPYRECLAIM ?
 					ISOLATE_BOTH : ISOLATE_INACTIVE,
-			zone, sc->current_memcg,
-			0, file);
-	}
+			zone, 0, file, sc->current_memcg);
 
 	if (global_reclaim(sc)) {
 		zone->pages_scanned += nr_scanned;
@@ -1491,13 +1478,15 @@ static void move_active_pages_to_lru(struct zone *zone,
 	pagevec_init(&pvec, 1);
 
 	while (!list_empty(list)) {
+		struct lruvec *lruvec;
+
 		page = lru_to_page(list);
 
 		VM_BUG_ON(PageLRU(page));
 		SetPageLRU(page);
 
-		list_move(&page->lru, &zone->lru[lru].list);
-		mem_cgroup_add_lru_list(page, lru);
+		lruvec = mem_cgroup_lru_add_list(zone, page, lru);
+		list_move(&page->lru, &lruvec->lists[lru]);
 		pgmoved += hpage_nr_pages(page);
 
 		if (!pagevec_add(&pvec, page) || list_empty(list)) {
@@ -1528,17 +1517,10 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 
 	lru_add_drain();
 	spin_lock_irq(&zone->lru_lock);
-	if (scanning_global_lru(sc)) {
-		nr_taken = isolate_pages_global(nr_pages, &l_hold,
-						&pgscanned, sc->order,
-						ISOLATE_ACTIVE, zone,
-						1, file);
-	} else {
-		nr_taken = mem_cgroup_isolate_pages(nr_pages, &l_hold,
-						&pgscanned, sc->order,
-						ISOLATE_ACTIVE, zone,
-						sc->current_memcg, 1, file);
-	}
+	nr_taken = isolate_pages(nr_pages, &l_hold,
+				 &pgscanned, sc->order,
+				 ISOLATE_ACTIVE, zone,
+				 1, file, sc->current_memcg);
 
 	if (global_reclaim(sc))
 		zone->pages_scanned += pgscanned;
@@ -1628,8 +1610,6 @@ static int inactive_anon_is_low_global(struct zone *zone)
  */
 static int inactive_anon_is_low(struct zone *zone, struct scan_control *sc)
 {
-	int low;
-
 	/*
 	 * If we don't have swap space, anonymous page deactivation
 	 * is pointless.
@@ -1637,11 +1617,9 @@ static int inactive_anon_is_low(struct zone *zone, struct scan_control *sc)
 	if (!total_swap_pages)
 		return 0;
 
-	if (scanning_global_lru(sc))
-		low = inactive_anon_is_low_global(zone);
-	else
-		low = mem_cgroup_inactive_anon_is_low(sc->current_memcg);
-	return low;
+	if (mem_cgroup_disabled())
+		return inactive_anon_is_low_global(zone);
+	return mem_cgroup_inactive_anon_is_low(sc->current_memcg);
 }
 #else
 static inline int inactive_anon_is_low(struct zone *zone,
@@ -1678,13 +1656,9 @@ static int inactive_file_is_low_global(struct zone *zone)
  */
 static int inactive_file_is_low(struct zone *zone, struct scan_control *sc)
 {
-	int low;
-
-	if (scanning_global_lru(sc))
-		low = inactive_file_is_low_global(zone);
-	else
-		low = mem_cgroup_inactive_file_is_low(sc->current_memcg);
-	return low;
+	if (mem_cgroup_disabled())
+		return inactive_file_is_low_global(zone);
+	return mem_cgroup_inactive_file_is_low(sc->current_memcg);
 }
 
 static int inactive_list_is_low(struct zone *zone, struct scan_control *sc,
@@ -3161,16 +3135,18 @@ int page_evictable(struct page *page, struct vm_area_struct *vma)
  */
 static void check_move_unevictable_page(struct page *page, struct zone *zone)
 {
-	VM_BUG_ON(PageActive(page));
+	struct lruvec *lruvec;
 
+	VM_BUG_ON(PageActive(page));
 retry:
 	ClearPageUnevictable(page);
 	if (page_evictable(page, NULL)) {
 		enum lru_list l = page_lru_base_type(page);
 
 		__dec_zone_state(zone, NR_UNEVICTABLE);
-		list_move(&page->lru, &zone->lru[l].list);
-		mem_cgroup_move_lists(page, LRU_UNEVICTABLE, l);
+		lruvec = mem_cgroup_lru_move_lists(zone, page,
+						   LRU_UNEVICTABLE, l);
+		list_move(&page->lru, &lruvec->lists[l]);
 		__inc_zone_state(zone, NR_INACTIVE_ANON + l);
 		__count_vm_event(UNEVICTABLE_PGRESCUED);
 	} else {
@@ -3178,8 +3154,9 @@ retry:
 		 * rotate unevictable list
 		 */
 		SetPageUnevictable(page);
-		list_move(&page->lru, &zone->lru[LRU_UNEVICTABLE].list);
-		mem_cgroup_rotate_lru_list(page, LRU_UNEVICTABLE);
+		lruvec = mem_cgroup_lru_move_lists(zone, page, LRU_UNEVICTABLE,
+						   LRU_UNEVICTABLE);
+		list_move(&page->lru, &lruvec->lists[LRU_UNEVICTABLE]);
 		if (page_evictable(page, NULL))
 			goto retry;
 	}
@@ -3253,29 +3230,37 @@ void scan_mapping_unevictable_pages(struct address_space *mapping)
 #define SCAN_UNEVICTABLE_BATCH_SIZE 16UL /* arbitrary lock hold batch size */
 static void scan_zone_unevictable_pages(struct zone *zone)
 {
-	struct list_head *l_unevictable = &zone->lru[LRU_UNEVICTABLE].list;
-	unsigned long scan;
 	unsigned long nr_to_scan = zone_page_state(zone, NR_UNEVICTABLE);
 
 	while (nr_to_scan > 0) {
 		unsigned long batch_size = min(nr_to_scan,
 						SCAN_UNEVICTABLE_BATCH_SIZE);
+		struct mem_cgroup *mem = NULL;
 
-		spin_lock_irq(&zone->lru_lock);
-		for (scan = 0;  scan < batch_size; scan++) {
-			struct page *page = lru_to_page(l_unevictable);
-
-			if (!trylock_page(page))
-				continue;
+		do {
+			struct list_head *list;
+			struct lruvec *lruvec;
+			unsigned long scan;
 
-			prefetchw_prev_lru_page(page, l_unevictable, flags);
+			mem_cgroup_hierarchy_walk(NULL, &mem);
+			spin_lock_irq(&zone->lru_lock);
+			lruvec = mem_cgroup_zone_lruvec(zone, mem);
+			list = &lruvec->lists[LRU_UNEVICTABLE];
+			for (scan = 0;  scan < batch_size; scan++) {
+				struct page *page = lru_to_page(list);
 
-			if (likely(PageLRU(page) && PageUnevictable(page)))
+				if (!trylock_page(page))
+					continue;
+				prefetchw_prev_lru_page(page, list, flags);
+				if (unlikely(!PageLRU(page)))
+					continue;
+				if (unlikely(!PageUnevictable(page)))
+					continue;
 				check_move_unevictable_page(page, zone);
-
-			unlock_page(page);
-		}
-		spin_unlock_irq(&zone->lru_lock);
+				unlock_page(page);
+			}
+			spin_unlock_irq(&zone->lru_lock);
+		} while (mem);
 
 		nr_to_scan -= batch_size;
 	}
-- 
1.7.5.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [rfc patch 5/6] memcg: remove global LRU list
@ 2011-05-12 14:53   ` Johannes Weiner
  0 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2011-05-12 14:53 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Andrew Morton, Rik van Riel, Minchan Kim,
	KOSAKI Motohiro, Mel Gorman
  Cc: linux-mm, linux-kernel

Since the VM now has means to do global reclaim from the per-memcg lru
lists, the global LRU list is no longer required.

It saves two linked list pointers per page, since all pages are now
only on one list.  Also, the memcg lru lists now directly link pages
instead of page_cgroup descriptors, which gets rid of finding the way
back from the page_cgroup to the page.

A big change in behaviour is that pages are no longer aged on a global
level.  Instead, they are aged with respect to the other pages in the
same memcg, where the aging speed is determined by global memory
pressure and size of the memcg itself.

[ TO EVALUATE: this should bring more fairness to reclaim in setups
with differently sized memcgs, and distribute pressure proportionally
among memcgs instead of reclaiming only from the one that has the
oldest pages on a global level.  There is potential unfairness if
unused pages are hiding in small memcgs that are never scanned and
reclaim going only for a single, much bigger memcg.  The severeness of
this also scales with the number of memcgs wrt amount of physical
memory, so it again boils down to the question of what the sane
maximum number of memcgs on the system is ].

The patch introduces an lruvec structure that exists for both global
zones and for each zone per memcg.  All lru operations are now done in
generic code, with the memcg lru primitives only doing accounting and
returning the proper lruvec for the currently scanned memcg on
isolation, or for the respective page on putback.

The code that scans and rescues unevictable pages in a specific zone
had to be converted to iterate over all memcgs as well.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h  |   52 ++++-----
 include/linux/mm_inline.h   |   15 ++-
 include/linux/mmzone.h      |   10 +-
 include/linux/page_cgroup.h |   35 ------
 mm/memcontrol.c             |  251 +++++++++++++++---------------------------
 mm/page_alloc.c             |    2 +-
 mm/page_cgroup.c            |   39 +------
 mm/swap.c                   |   20 ++--
 mm/vmscan.c                 |  149 ++++++++++++--------------
 9 files changed, 213 insertions(+), 360 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a4c84db..65163c2 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -20,6 +20,7 @@
 #ifndef _LINUX_MEMCONTROL_H
 #define _LINUX_MEMCONTROL_H
 #include <linux/cgroup.h>
+#include <linux/mmzone.h>
 struct mem_cgroup;
 struct page_cgroup;
 struct page;
@@ -30,13 +31,6 @@ enum mem_cgroup_page_stat_item {
 	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
 };
 
-extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
-					struct list_head *dst,
-					unsigned long *scanned, int order,
-					int mode, struct zone *z,
-					struct mem_cgroup *mem_cont,
-					int active, int file);
-
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 /*
  * All "charge" functions with gfp_mask should use GFP_KERNEL or
@@ -60,13 +54,13 @@ extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *ptr);
 
 extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 					gfp_t gfp_mask);
-extern void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru);
-extern void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru);
-extern void mem_cgroup_rotate_reclaimable_page(struct page *page);
-extern void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru);
-extern void mem_cgroup_del_lru(struct page *page);
-extern void mem_cgroup_move_lists(struct page *page,
-				  enum lru_list from, enum lru_list to);
+struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
+struct lruvec *mem_cgroup_lru_add_list(struct zone *, struct page *,
+				       enum lru_list);
+void mem_cgroup_lru_del_list(struct zone *, struct page *, enum lru_list);
+void mem_cgroup_lru_del(struct zone *, struct page *);
+struct lruvec *mem_cgroup_lru_move_lists(struct zone *, struct page *,
+					 enum lru_list, enum lru_list);
 
 /* For coalescing uncharge for reducing memcg' overhead*/
 extern void mem_cgroup_uncharge_start(void);
@@ -210,33 +204,35 @@ static inline int mem_cgroup_shmem_charge_fallback(struct page *page,
 	return 0;
 }
 
-static inline void mem_cgroup_add_lru_list(struct page *page, int lru)
+static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
+						    struct mem_cgroup *mem)
 {
+	return &zone->lruvec;
 }
 
-static inline void mem_cgroup_del_lru_list(struct page *page, int lru)
+static inline struct lruvec *mem_cgroup_lru_add_list(struct zone *zone,
+						     struct page *page,
+						     enum lru_list lru)
 {
-	return ;
+	return &zone->lruvec;
 }
 
-static inline void mem_cgroup_rotate_reclaimable_page(struct page *page)
+static inline void mem_cgroup_lru_del_list(struct zone *zone,
+					   struct page *page,
+					   enum lru_list lru)
 {
-	return ;
 }
 
-static inline void mem_cgroup_rotate_lru_list(struct page *page, int lru)
+static inline void mem_cgroup_lru_del(struct zone *zone, struct page *page)
 {
-	return ;
 }
 
-static inline void mem_cgroup_del_lru(struct page *page)
-{
-	return ;
-}
-
-static inline void
-mem_cgroup_move_lists(struct page *page, enum lru_list from, enum lru_list to)
+static inline struct lruvec *mem_cgroup_lru_move_lists(struct zone *zone,
+						       struct page *page,
+						       enum lru_list from,
+						       enum lru_list to)
 {
+	return &zone->lruvec;
 }
 
 static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 8f7d247..ca794f3 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -25,23 +25,28 @@ static inline void
 __add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l,
 		       struct list_head *head)
 {
+	/* NOTE! Caller must ensure @head is on the right lruvec! */
+	mem_cgroup_lru_add_list(zone, page, l);
 	list_add(&page->lru, head);
 	__mod_zone_page_state(zone, NR_LRU_BASE + l, hpage_nr_pages(page));
-	mem_cgroup_add_lru_list(page, l);
 }
 
 static inline void
 add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l)
 {
-	__add_page_to_lru_list(zone, page, l, &zone->lru[l].list);
+	struct lruvec *lruvec;
+
+	lruvec = mem_cgroup_lru_add_list(zone, page, l);
+	list_add(&page->lru, &lruvec->lists[l]);
+	__mod_zone_page_state(zone, NR_LRU_BASE + l, hpage_nr_pages(page));
 }
 
 static inline void
 del_page_from_lru_list(struct zone *zone, struct page *page, enum lru_list l)
 {
+	mem_cgroup_lru_del_list(zone, page, l);
 	list_del(&page->lru);
 	__mod_zone_page_state(zone, NR_LRU_BASE + l, -hpage_nr_pages(page));
-	mem_cgroup_del_lru_list(page, l);
 }
 
 /**
@@ -64,7 +69,6 @@ del_page_from_lru(struct zone *zone, struct page *page)
 {
 	enum lru_list l;
 
-	list_del(&page->lru);
 	if (PageUnevictable(page)) {
 		__ClearPageUnevictable(page);
 		l = LRU_UNEVICTABLE;
@@ -75,8 +79,9 @@ del_page_from_lru(struct zone *zone, struct page *page)
 			l += LRU_ACTIVE;
 		}
 	}
+	mem_cgroup_lru_del_list(zone, page, l);
+	list_del(&page->lru);
 	__mod_zone_page_state(zone, NR_LRU_BASE + l, -hpage_nr_pages(page));
-	mem_cgroup_del_lru_list(page, l);
 }
 
 /**
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e56f835..c2ddce5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -158,6 +158,10 @@ static inline int is_unevictable_lru(enum lru_list l)
 	return (l == LRU_UNEVICTABLE);
 }
 
+struct lruvec {
+	struct list_head lists[NR_LRU_LISTS];
+};
+
 enum zone_watermarks {
 	WMARK_MIN,
 	WMARK_LOW,
@@ -344,10 +348,8 @@ struct zone {
 	ZONE_PADDING(_pad1_)
 
 	/* Fields commonly accessed by the page reclaim scanner */
-	spinlock_t		lru_lock;	
-	struct zone_lru {
-		struct list_head list;
-	} lru[NR_LRU_LISTS];
+	spinlock_t		lru_lock;
+	struct lruvec		lruvec;
 
 	struct zone_reclaim_stat reclaim_stat;
 
diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 961ecc7..2e7cbc5 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -31,7 +31,6 @@ enum {
 struct page_cgroup {
 	unsigned long flags;
 	struct mem_cgroup *mem_cgroup;
-	struct list_head lru;		/* per cgroup LRU list */
 };
 
 void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat);
@@ -49,7 +48,6 @@ static inline void __init page_cgroup_init(void)
 #endif
 
 struct page_cgroup *lookup_page_cgroup(struct page *page);
-struct page *lookup_cgroup_page(struct page_cgroup *pc);
 
 #define TESTPCGFLAG(uname, lname)			\
 static inline int PageCgroup##uname(struct page_cgroup *pc)	\
@@ -122,39 +120,6 @@ static inline void move_unlock_page_cgroup(struct page_cgroup *pc,
 	local_irq_restore(*flags);
 }
 
-#ifdef CONFIG_SPARSEMEM
-#define PCG_ARRAYID_WIDTH	SECTIONS_SHIFT
-#else
-#define PCG_ARRAYID_WIDTH	NODES_SHIFT
-#endif
-
-#if (PCG_ARRAYID_WIDTH > BITS_PER_LONG - NR_PCG_FLAGS)
-#error Not enough space left in pc->flags to store page_cgroup array IDs
-#endif
-
-/* pc->flags: ARRAY-ID | FLAGS */
-
-#define PCG_ARRAYID_MASK	((1UL << PCG_ARRAYID_WIDTH) - 1)
-
-#define PCG_ARRAYID_OFFSET	(BITS_PER_LONG - PCG_ARRAYID_WIDTH)
-/*
- * Zero the shift count for non-existent fields, to prevent compiler
- * warnings and ensure references are optimized away.
- */
-#define PCG_ARRAYID_SHIFT	(PCG_ARRAYID_OFFSET * (PCG_ARRAYID_WIDTH != 0))
-
-static inline void set_page_cgroup_array_id(struct page_cgroup *pc,
-					    unsigned long id)
-{
-	pc->flags &= ~(PCG_ARRAYID_MASK << PCG_ARRAYID_SHIFT);
-	pc->flags |= (id & PCG_ARRAYID_MASK) << PCG_ARRAYID_SHIFT;
-}
-
-static inline unsigned long page_cgroup_array_id(struct page_cgroup *pc)
-{
-	return (pc->flags >> PCG_ARRAYID_SHIFT) & PCG_ARRAYID_MASK;
-}
-
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct page_cgroup;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d762706..f5d90ba 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -134,10 +134,7 @@ struct mem_cgroup_stat_cpu {
  * per-zone information in memory controller.
  */
 struct mem_cgroup_per_zone {
-	/*
-	 * spin_lock to protect the per cgroup LRU
-	 */
-	struct list_head	lists[NR_LRU_LISTS];
+	struct lruvec		lruvec;
 	unsigned long		count[NR_LRU_LISTS];
 
 	struct zone_reclaim_stat reclaim_stat;
@@ -834,6 +831,24 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
 	return (mem == root_mem_cgroup);
 }
 
+struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone, struct mem_cgroup *mem)
+{
+	struct mem_cgroup_per_zone *mz;
+	int nid, zid;
+
+	/* Pages are on the zone's own lru lists */
+	if (mem_cgroup_disabled())
+		return &zone->lruvec;
+
+	if (!mem)
+		mem = root_mem_cgroup;
+
+	nid = zone_to_nid(zone);
+	zid = zone_idx(zone);
+	mz = mem_cgroup_zoneinfo(mem, nid, zid);
+	return &mz->lruvec;
+}
+
 /*
  * Following LRU functions are allowed to be used without PCG_LOCK.
  * Operations are called by routine of global LRU independently from memcg.
@@ -848,10 +863,43 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
  * When moving account, the page is not on LRU. It's isolated.
  */
 
-void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru)
+struct lruvec *mem_cgroup_lru_add_list(struct zone *zone, struct page *page,
+				       enum lru_list lru)
 {
+	struct mem_cgroup_per_zone *mz;
 	struct page_cgroup *pc;
+	struct mem_cgroup *mem;
+
+	if (mem_cgroup_disabled())
+		return &zone->lruvec;
+
+	pc = lookup_page_cgroup(page);
+	VM_BUG_ON(PageCgroupAcctLRU(pc));
+	if (PageCgroupUsed(pc)) {
+		/* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
+		smp_rmb();
+		mem = pc->mem_cgroup;
+	} else {
+		/*
+		 * If the page is uncharged, add it to the root's lru.
+		 * Either it will be freed soon, or it will get
+		 * charged again and the charger will relink it to the
+		 * right list.
+		 */
+		mem = root_mem_cgroup;
+	}
+	mz = page_cgroup_zoneinfo(mem, page);
+	/* huge page split is done under lru_lock. so, we have no races. */
+	MEM_CGROUP_ZSTAT(mz, lru) += 1 << compound_order(page);
+	SetPageCgroupAcctLRU(pc);
+	return &mz->lruvec;
+}
+
+void mem_cgroup_lru_del_list(struct zone *zone, struct page *page,
+			     enum lru_list lru)
+{
 	struct mem_cgroup_per_zone *mz;
+	struct page_cgroup *pc;
 
 	if (mem_cgroup_disabled())
 		return;
@@ -867,83 +915,21 @@ void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru)
 	mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
 	/* huge page split is done under lru_lock. so, we have no races. */
 	MEM_CGROUP_ZSTAT(mz, lru) -= 1 << compound_order(page);
-	if (mem_cgroup_is_root(pc->mem_cgroup))
-		return;
-	VM_BUG_ON(list_empty(&pc->lru));
-	list_del_init(&pc->lru);
 }
 
-void mem_cgroup_del_lru(struct page *page)
+void mem_cgroup_lru_del(struct zone *zone, struct page *page)
 {
-	mem_cgroup_del_lru_list(page, page_lru(page));
+	mem_cgroup_lru_del_list(zone, page, page_lru(page));
 }
 
-/*
- * Writeback is about to end against a page which has been marked for immediate
- * reclaim.  If it still appears to be reclaimable, move it to the tail of the
- * inactive list.
- */
-void mem_cgroup_rotate_reclaimable_page(struct page *page)
+struct lruvec *mem_cgroup_lru_move_lists(struct zone *zone,
+					 struct page *page,
+					 enum lru_list from,
+					 enum lru_list to)
 {
-	struct mem_cgroup_per_zone *mz;
-	struct page_cgroup *pc;
-	enum lru_list lru = page_lru(page);
-
-	if (mem_cgroup_disabled())
-		return;
-
-	pc = lookup_page_cgroup(page);
-	/* unused or root page is not rotated. */
-	if (!PageCgroupUsed(pc))
-		return;
-	/* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
-	smp_rmb();
-	if (mem_cgroup_is_root(pc->mem_cgroup))
-		return;
-	mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
-	list_move_tail(&pc->lru, &mz->lists[lru]);
-}
-
-void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru)
-{
-	struct mem_cgroup_per_zone *mz;
-	struct page_cgroup *pc;
-
-	if (mem_cgroup_disabled())
-		return;
-
-	pc = lookup_page_cgroup(page);
-	/* unused or root page is not rotated. */
-	if (!PageCgroupUsed(pc))
-		return;
-	/* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
-	smp_rmb();
-	if (mem_cgroup_is_root(pc->mem_cgroup))
-		return;
-	mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
-	list_move(&pc->lru, &mz->lists[lru]);
-}
-
-void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru)
-{
-	struct page_cgroup *pc;
-	struct mem_cgroup_per_zone *mz;
-
-	if (mem_cgroup_disabled())
-		return;
-	pc = lookup_page_cgroup(page);
-	VM_BUG_ON(PageCgroupAcctLRU(pc));
-	if (!PageCgroupUsed(pc))
-		return;
-	/* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
-	smp_rmb();
-	mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
-	/* huge page split is done under lru_lock. so, we have no races. */
-	MEM_CGROUP_ZSTAT(mz, lru) += 1 << compound_order(page);
-	SetPageCgroupAcctLRU(pc);
-	if (mem_cgroup_is_root(pc->mem_cgroup))
-		return;
-	list_add(&pc->lru, &mz->lists[lru]);
+	/* TODO: could be optimized, especially if from == to */
+	mem_cgroup_lru_del_list(zone, page, from);
+	return mem_cgroup_lru_add_list(zone, page, to);
 }
 
 /*
@@ -975,7 +961,7 @@ static void mem_cgroup_lru_del_before_commit(struct page *page)
 	 * is guarded by lock_page() because the page is SwapCache.
 	 */
 	if (!PageCgroupUsed(pc))
-		mem_cgroup_del_lru_list(page, page_lru(page));
+		del_page_from_lru(zone, page);
 	spin_unlock_irqrestore(&zone->lru_lock, flags);
 }
 
@@ -989,22 +975,11 @@ static void mem_cgroup_lru_add_after_commit(struct page *page)
 	if (likely(!PageLRU(page)))
 		return;
 	spin_lock_irqsave(&zone->lru_lock, flags);
-	/* link when the page is linked to LRU but page_cgroup isn't */
 	if (PageLRU(page) && !PageCgroupAcctLRU(pc))
-		mem_cgroup_add_lru_list(page, page_lru(page));
+		add_page_to_lru_list(zone, page, page_lru(page));
 	spin_unlock_irqrestore(&zone->lru_lock, flags);
 }
 
-
-void mem_cgroup_move_lists(struct page *page,
-			   enum lru_list from, enum lru_list to)
-{
-	if (mem_cgroup_disabled())
-		return;
-	mem_cgroup_del_lru_list(page, from);
-	mem_cgroup_add_lru_list(page, to);
-}
-
 int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem)
 {
 	int ret;
@@ -1063,6 +1038,9 @@ int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
 	unsigned long present_pages[2];
 	unsigned long inactive_ratio;
 
+	if (!memcg)
+		memcg = root_mem_cgroup;
+
 	inactive_ratio = calc_inactive_ratio(memcg, present_pages);
 
 	inactive = present_pages[0];
@@ -1079,6 +1057,9 @@ int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg)
 	unsigned long active;
 	unsigned long inactive;
 
+	if (!memcg)
+		memcg = root_mem_cgroup;
+
 	inactive = mem_cgroup_get_local_zonestat(memcg, LRU_INACTIVE_FILE);
 	active = mem_cgroup_get_local_zonestat(memcg, LRU_ACTIVE_FILE);
 
@@ -1091,8 +1072,12 @@ unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
 {
 	int nid = zone_to_nid(zone);
 	int zid = zone_idx(zone);
-	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
+	struct mem_cgroup_per_zone *mz;
+
+	if (!memcg)
+		memcg = root_mem_cgroup;
 
+	mz = mem_cgroup_zoneinfo(memcg, nid, zid);
 	return MEM_CGROUP_ZSTAT(mz, lru);
 }
 
@@ -1101,8 +1086,12 @@ struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
 {
 	int nid = zone_to_nid(zone);
 	int zid = zone_idx(zone);
-	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
+	struct mem_cgroup_per_zone *mz;
+
+	if (!memcg)
+		memcg = root_mem_cgroup;
 
+	mz = mem_cgroup_zoneinfo(memcg, nid, zid);
 	return &mz->reclaim_stat;
 }
 
@@ -1124,67 +1113,6 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page)
 	return &mz->reclaim_stat;
 }
 
-unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
-					struct list_head *dst,
-					unsigned long *scanned, int order,
-					int mode, struct zone *z,
-					struct mem_cgroup *mem_cont,
-					int active, int file)
-{
-	unsigned long nr_taken = 0;
-	struct page *page;
-	unsigned long scan;
-	LIST_HEAD(pc_list);
-	struct list_head *src;
-	struct page_cgroup *pc, *tmp;
-	int nid = zone_to_nid(z);
-	int zid = zone_idx(z);
-	struct mem_cgroup_per_zone *mz;
-	int lru = LRU_FILE * file + active;
-	int ret;
-
-	BUG_ON(!mem_cont);
-	mz = mem_cgroup_zoneinfo(mem_cont, nid, zid);
-	src = &mz->lists[lru];
-
-	scan = 0;
-	list_for_each_entry_safe_reverse(pc, tmp, src, lru) {
-		if (scan >= nr_to_scan)
-			break;
-
-		if (unlikely(!PageCgroupUsed(pc)))
-			continue;
-
-		page = lookup_cgroup_page(pc);
-
-		if (unlikely(!PageLRU(page)))
-			continue;
-
-		scan++;
-		ret = __isolate_lru_page(page, mode, file);
-		switch (ret) {
-		case 0:
-			list_move(&page->lru, dst);
-			mem_cgroup_del_lru(page);
-			nr_taken += hpage_nr_pages(page);
-			break;
-		case -EBUSY:
-			/* we don't affect global LRU but rotate in our LRU */
-			mem_cgroup_rotate_lru_list(page, page_lru(page));
-			break;
-		default:
-			break;
-		}
-	}
-
-	*scanned = scan;
-
-	trace_mm_vmscan_memcg_isolate(0, nr_to_scan, scan, nr_taken,
-				      0, 0, 0, mode);
-
-	return nr_taken;
-}
-
 #define mem_cgroup_from_res_counter(counter, member)	\
 	container_of(counter, struct mem_cgroup, member)
 
@@ -3458,22 +3386,23 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 static int mem_cgroup_force_empty_list(struct mem_cgroup *mem,
 				int node, int zid, enum lru_list lru)
 {
-	struct zone *zone;
 	struct mem_cgroup_per_zone *mz;
-	struct page_cgroup *pc, *busy;
 	unsigned long flags, loop;
 	struct list_head *list;
+	struct page *busy;
+	struct zone *zone;
 	int ret = 0;
 
 	zone = &NODE_DATA(node)->node_zones[zid];
 	mz = mem_cgroup_zoneinfo(mem, node, zid);
-	list = &mz->lists[lru];
+	list = &mz->lruvec.lists[lru];
 
 	loop = MEM_CGROUP_ZSTAT(mz, lru);
 	/* give some margin against EBUSY etc...*/
 	loop += 256;
 	busy = NULL;
 	while (loop--) {
+		struct page_cgroup *pc;
 		struct page *page;
 
 		ret = 0;
@@ -3482,16 +3411,16 @@ static int mem_cgroup_force_empty_list(struct mem_cgroup *mem,
 			spin_unlock_irqrestore(&zone->lru_lock, flags);
 			break;
 		}
-		pc = list_entry(list->prev, struct page_cgroup, lru);
-		if (busy == pc) {
-			list_move(&pc->lru, list);
+		page = list_entry(list->prev, struct page, lru);
+		if (busy == page) {
+			list_move(&page->lru, list);
 			busy = NULL;
 			spin_unlock_irqrestore(&zone->lru_lock, flags);
 			continue;
 		}
 		spin_unlock_irqrestore(&zone->lru_lock, flags);
 
-		page = lookup_cgroup_page(pc);
+		pc = lookup_page_cgroup(page);
 
 		ret = mem_cgroup_move_parent(page, pc, mem, GFP_KERNEL);
 		if (ret == -ENOMEM)
@@ -3499,7 +3428,7 @@ static int mem_cgroup_force_empty_list(struct mem_cgroup *mem,
 
 		if (ret == -EBUSY || ret == -EINVAL) {
 			/* found lock contention or "pc" is obsolete. */
-			busy = pc;
+			busy = page;
 			cond_resched();
 		} else
 			busy = NULL;
@@ -4519,7 +4448,7 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
 	for (zone = 0; zone < MAX_NR_ZONES; zone++) {
 		mz = &pn->zoneinfo[zone];
 		for_each_lru(l)
-			INIT_LIST_HEAD(&mz->lists[l]);
+			INIT_LIST_HEAD(&mz->lruvec.lists[l]);
 		mz->usage_in_excess = 0;
 		mz->on_tree = false;
 		mz->mem = mem;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9f8a97b..4099e8c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4262,7 +4262,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 
 		zone_pcp_init(zone);
 		for_each_lru(l) {
-			INIT_LIST_HEAD(&zone->lru[l].list);
+			INIT_LIST_HEAD(&zone->lruvec.lists[l]);
 			zone->reclaim_stat.nr_saved_scan[l] = 0;
 		}
 		zone->reclaim_stat.recent_rotated[0] = 0;
diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index 9905501..313e1d7 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -11,12 +11,10 @@
 #include <linux/swapops.h>
 #include <linux/kmemleak.h>
 
-static void __meminit init_page_cgroup(struct page_cgroup *pc, unsigned long id)
+static void __meminit init_page_cgroup(struct page_cgroup *pc)
 {
 	pc->flags = 0;
-	set_page_cgroup_array_id(pc, id);
 	pc->mem_cgroup = NULL;
-	INIT_LIST_HEAD(&pc->lru);
 }
 static unsigned long total_usage;
 
@@ -42,19 +40,6 @@ struct page_cgroup *lookup_page_cgroup(struct page *page)
 	return base + offset;
 }
 
-struct page *lookup_cgroup_page(struct page_cgroup *pc)
-{
-	unsigned long pfn;
-	struct page *page;
-	pg_data_t *pgdat;
-
-	pgdat = NODE_DATA(page_cgroup_array_id(pc));
-	pfn = pc - pgdat->node_page_cgroup + pgdat->node_start_pfn;
-	page = pfn_to_page(pfn);
-	VM_BUG_ON(pc != lookup_page_cgroup(page));
-	return page;
-}
-
 static int __init alloc_node_page_cgroup(int nid)
 {
 	struct page_cgroup *base, *pc;
@@ -75,7 +60,7 @@ static int __init alloc_node_page_cgroup(int nid)
 		return -ENOMEM;
 	for (index = 0; index < nr_pages; index++) {
 		pc = base + index;
-		init_page_cgroup(pc, nid);
+		init_page_cgroup(pc);
 	}
 	NODE_DATA(nid)->node_page_cgroup = base;
 	total_usage += table_size;
@@ -117,19 +102,6 @@ struct page_cgroup *lookup_page_cgroup(struct page *page)
 	return section->page_cgroup + pfn;
 }
 
-struct page *lookup_cgroup_page(struct page_cgroup *pc)
-{
-	struct mem_section *section;
-	struct page *page;
-	unsigned long nr;
-
-	nr = page_cgroup_array_id(pc);
-	section = __nr_to_section(nr);
-	page = pfn_to_page(pc - section->page_cgroup);
-	VM_BUG_ON(pc != lookup_page_cgroup(page));
-	return page;
-}
-
 static void *__init_refok alloc_page_cgroup(size_t size, int nid)
 {
 	void *addr = NULL;
@@ -167,12 +139,9 @@ static int __init_refok init_section_page_cgroup(unsigned long pfn)
 	struct page_cgroup *base, *pc;
 	struct mem_section *section;
 	unsigned long table_size;
-	unsigned long nr;
 	int nid, index;
 
-	nr = pfn_to_section_nr(pfn);
-	section = __nr_to_section(nr);
-
+	section = __pfn_to_section(pfn);
 	if (section->page_cgroup)
 		return 0;
 
@@ -194,7 +163,7 @@ static int __init_refok init_section_page_cgroup(unsigned long pfn)
 
 	for (index = 0; index < PAGES_PER_SECTION; index++) {
 		pc = base + index;
-		init_page_cgroup(pc, nr);
+		init_page_cgroup(pc);
 	}
 
 	section->page_cgroup = base - pfn;
diff --git a/mm/swap.c b/mm/swap.c
index a448db3..12095a0 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -209,12 +209,14 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 static void pagevec_move_tail_fn(struct page *page, void *arg)
 {
 	int *pgmoved = arg;
-	struct zone *zone = page_zone(page);
 
 	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
 		enum lru_list lru = page_lru_base_type(page);
-		list_move_tail(&page->lru, &zone->lru[lru].list);
-		mem_cgroup_rotate_reclaimable_page(page);
+		struct lruvec *lruvec;
+
+		lruvec = mem_cgroup_lru_move_lists(page_zone(page),
+						   page, lru, lru);
+		list_move_tail(&page->lru, &lruvec->lists[lru]);
 		(*pgmoved)++;
 	}
 }
@@ -417,12 +419,13 @@ static void lru_deactivate_fn(struct page *page, void *arg)
 		 */
 		SetPageReclaim(page);
 	} else {
+		struct lruvec *lruvec;
 		/*
 		 * The page's writeback ends up during pagevec
 		 * We moves tha page into tail of inactive.
 		 */
-		list_move_tail(&page->lru, &zone->lru[lru].list);
-		mem_cgroup_rotate_reclaimable_page(page);
+		lruvec = mem_cgroup_lru_move_lists(zone, page, lru, lru);
+		list_move_tail(&page->lru, &lruvec->lists[lru]);
 		__count_vm_event(PGROTATED);
 	}
 
@@ -594,7 +597,6 @@ void lru_add_page_tail(struct zone* zone,
 	int active;
 	enum lru_list lru;
 	const int file = 0;
-	struct list_head *head;
 
 	VM_BUG_ON(!PageHead(page));
 	VM_BUG_ON(PageCompound(page_tail));
@@ -614,10 +616,10 @@ void lru_add_page_tail(struct zone* zone,
 		}
 		update_page_reclaim_stat(zone, page_tail, file, active);
 		if (likely(PageLRU(page)))
-			head = page->lru.prev;
+			__add_page_to_lru_list(zone, page_tail, lru,
+					       page->lru.prev);
 		else
-			head = &zone->lru[lru].list;
-		__add_page_to_lru_list(zone, page_tail, lru, head);
+			add_page_to_lru_list(zone, page_tail, lru);
 	} else {
 		SetPageUnevictable(page_tail);
 		add_page_to_lru_list(zone, page_tail, LRU_UNEVICTABLE);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0e45ceb..0381a5d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -162,34 +162,27 @@ static bool global_reclaim(struct scan_control *sc)
 {
 	return !sc->memcg;
 }
-static bool scanning_global_lru(struct scan_control *sc)
-{
-	return !sc->current_memcg;
-}
 #else
 static bool global_reclaim(struct scan_control *sc) { return 1; }
-static bool scanning_global_lru(struct scan_control *sc) { return 1; }
 #endif
 
 static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
 						  struct scan_control *sc)
 {
-	if (!scanning_global_lru(sc))
-		return mem_cgroup_get_reclaim_stat(sc->current_memcg, zone);
-
-	return &zone->reclaim_stat;
+	if (mem_cgroup_disabled())
+		return &zone->reclaim_stat;
+	return mem_cgroup_get_reclaim_stat(sc->current_memcg, zone);
 }
 
 static unsigned long zone_nr_lru_pages(struct zone *zone,
-				struct scan_control *sc, enum lru_list lru)
+				       struct scan_control *sc,
+				       enum lru_list lru)
 {
-	if (!scanning_global_lru(sc))
-		return mem_cgroup_zone_nr_pages(sc->current_memcg, zone, lru);
-
-	return zone_page_state(zone, NR_LRU_BASE + lru);
+	if (mem_cgroup_disabled())
+		return zone_page_state(zone, NR_LRU_BASE + lru);
+	return mem_cgroup_zone_nr_pages(sc->current_memcg, zone, lru);
 }
 
-
 /*
  * Add a shrinker callback to be called from the vm
  */
@@ -1055,15 +1048,14 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 
 		switch (__isolate_lru_page(page, mode, file)) {
 		case 0:
+			mem_cgroup_lru_del(page_zone(page), page);
 			list_move(&page->lru, dst);
-			mem_cgroup_del_lru(page);
 			nr_taken += hpage_nr_pages(page);
 			break;
 
 		case -EBUSY:
 			/* else it is being freed elsewhere */
 			list_move(&page->lru, src);
-			mem_cgroup_rotate_lru_list(page, page_lru(page));
 			continue;
 
 		default:
@@ -1113,8 +1105,9 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 				break;
 
 			if (__isolate_lru_page(cursor_page, mode, file) == 0) {
+				mem_cgroup_lru_del(page_zone(cursor_page),
+						   cursor_page);
 				list_move(&cursor_page->lru, dst);
-				mem_cgroup_del_lru(cursor_page);
 				nr_taken += hpage_nr_pages(page);
 				nr_lumpy_taken++;
 				if (PageDirty(cursor_page))
@@ -1143,19 +1136,22 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 	return nr_taken;
 }
 
-static unsigned long isolate_pages_global(unsigned long nr,
-					struct list_head *dst,
-					unsigned long *scanned, int order,
-					int mode, struct zone *z,
-					int active, int file)
+static unsigned long isolate_pages(unsigned long nr,
+				   struct list_head *dst,
+				   unsigned long *scanned, int order,
+				   int mode, struct zone *z,
+				   int active, int file,
+				   struct mem_cgroup *mem)
 {
+	struct lruvec *lruvec = mem_cgroup_zone_lruvec(z, mem);
 	int lru = LRU_BASE;
+
 	if (active)
 		lru += LRU_ACTIVE;
 	if (file)
 		lru += LRU_FILE;
-	return isolate_lru_pages(nr, &z->lru[lru].list, dst, scanned, order,
-								mode, file);
+	return isolate_lru_pages(nr, &lruvec->lists[lru], dst,
+				 scanned, order, mode, file);
 }
 
 /*
@@ -1403,20 +1399,11 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 	lru_add_drain();
 	spin_lock_irq(&zone->lru_lock);
 
-	if (scanning_global_lru(sc)) {
-		nr_taken = isolate_pages_global(nr_to_scan,
-			&page_list, &nr_scanned, sc->order,
-			sc->reclaim_mode & RECLAIM_MODE_LUMPYRECLAIM ?
-					ISOLATE_BOTH : ISOLATE_INACTIVE,
-			zone, 0, file);
-	} else {
-		nr_taken = mem_cgroup_isolate_pages(nr_to_scan,
+	nr_taken = isolate_pages(nr_to_scan,
 			&page_list, &nr_scanned, sc->order,
 			sc->reclaim_mode & RECLAIM_MODE_LUMPYRECLAIM ?
 					ISOLATE_BOTH : ISOLATE_INACTIVE,
-			zone, sc->current_memcg,
-			0, file);
-	}
+			zone, 0, file, sc->current_memcg);
 
 	if (global_reclaim(sc)) {
 		zone->pages_scanned += nr_scanned;
@@ -1491,13 +1478,15 @@ static void move_active_pages_to_lru(struct zone *zone,
 	pagevec_init(&pvec, 1);
 
 	while (!list_empty(list)) {
+		struct lruvec *lruvec;
+
 		page = lru_to_page(list);
 
 		VM_BUG_ON(PageLRU(page));
 		SetPageLRU(page);
 
-		list_move(&page->lru, &zone->lru[lru].list);
-		mem_cgroup_add_lru_list(page, lru);
+		lruvec = mem_cgroup_lru_add_list(zone, page, lru);
+		list_move(&page->lru, &lruvec->lists[lru]);
 		pgmoved += hpage_nr_pages(page);
 
 		if (!pagevec_add(&pvec, page) || list_empty(list)) {
@@ -1528,17 +1517,10 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 
 	lru_add_drain();
 	spin_lock_irq(&zone->lru_lock);
-	if (scanning_global_lru(sc)) {
-		nr_taken = isolate_pages_global(nr_pages, &l_hold,
-						&pgscanned, sc->order,
-						ISOLATE_ACTIVE, zone,
-						1, file);
-	} else {
-		nr_taken = mem_cgroup_isolate_pages(nr_pages, &l_hold,
-						&pgscanned, sc->order,
-						ISOLATE_ACTIVE, zone,
-						sc->current_memcg, 1, file);
-	}
+	nr_taken = isolate_pages(nr_pages, &l_hold,
+				 &pgscanned, sc->order,
+				 ISOLATE_ACTIVE, zone,
+				 1, file, sc->current_memcg);
 
 	if (global_reclaim(sc))
 		zone->pages_scanned += pgscanned;
@@ -1628,8 +1610,6 @@ static int inactive_anon_is_low_global(struct zone *zone)
  */
 static int inactive_anon_is_low(struct zone *zone, struct scan_control *sc)
 {
-	int low;
-
 	/*
 	 * If we don't have swap space, anonymous page deactivation
 	 * is pointless.
@@ -1637,11 +1617,9 @@ static int inactive_anon_is_low(struct zone *zone, struct scan_control *sc)
 	if (!total_swap_pages)
 		return 0;
 
-	if (scanning_global_lru(sc))
-		low = inactive_anon_is_low_global(zone);
-	else
-		low = mem_cgroup_inactive_anon_is_low(sc->current_memcg);
-	return low;
+	if (mem_cgroup_disabled())
+		return inactive_anon_is_low_global(zone);
+	return mem_cgroup_inactive_anon_is_low(sc->current_memcg);
 }
 #else
 static inline int inactive_anon_is_low(struct zone *zone,
@@ -1678,13 +1656,9 @@ static int inactive_file_is_low_global(struct zone *zone)
  */
 static int inactive_file_is_low(struct zone *zone, struct scan_control *sc)
 {
-	int low;
-
-	if (scanning_global_lru(sc))
-		low = inactive_file_is_low_global(zone);
-	else
-		low = mem_cgroup_inactive_file_is_low(sc->current_memcg);
-	return low;
+	if (mem_cgroup_disabled())
+		return inactive_file_is_low_global(zone);
+	return mem_cgroup_inactive_file_is_low(sc->current_memcg);
 }
 
 static int inactive_list_is_low(struct zone *zone, struct scan_control *sc,
@@ -3161,16 +3135,18 @@ int page_evictable(struct page *page, struct vm_area_struct *vma)
  */
 static void check_move_unevictable_page(struct page *page, struct zone *zone)
 {
-	VM_BUG_ON(PageActive(page));
+	struct lruvec *lruvec;
 
+	VM_BUG_ON(PageActive(page));
 retry:
 	ClearPageUnevictable(page);
 	if (page_evictable(page, NULL)) {
 		enum lru_list l = page_lru_base_type(page);
 
 		__dec_zone_state(zone, NR_UNEVICTABLE);
-		list_move(&page->lru, &zone->lru[l].list);
-		mem_cgroup_move_lists(page, LRU_UNEVICTABLE, l);
+		lruvec = mem_cgroup_lru_move_lists(zone, page,
+						   LRU_UNEVICTABLE, l);
+		list_move(&page->lru, &lruvec->lists[l]);
 		__inc_zone_state(zone, NR_INACTIVE_ANON + l);
 		__count_vm_event(UNEVICTABLE_PGRESCUED);
 	} else {
@@ -3178,8 +3154,9 @@ retry:
 		 * rotate unevictable list
 		 */
 		SetPageUnevictable(page);
-		list_move(&page->lru, &zone->lru[LRU_UNEVICTABLE].list);
-		mem_cgroup_rotate_lru_list(page, LRU_UNEVICTABLE);
+		lruvec = mem_cgroup_lru_move_lists(zone, page, LRU_UNEVICTABLE,
+						   LRU_UNEVICTABLE);
+		list_move(&page->lru, &lruvec->lists[LRU_UNEVICTABLE]);
 		if (page_evictable(page, NULL))
 			goto retry;
 	}
@@ -3253,29 +3230,37 @@ void scan_mapping_unevictable_pages(struct address_space *mapping)
 #define SCAN_UNEVICTABLE_BATCH_SIZE 16UL /* arbitrary lock hold batch size */
 static void scan_zone_unevictable_pages(struct zone *zone)
 {
-	struct list_head *l_unevictable = &zone->lru[LRU_UNEVICTABLE].list;
-	unsigned long scan;
 	unsigned long nr_to_scan = zone_page_state(zone, NR_UNEVICTABLE);
 
 	while (nr_to_scan > 0) {
 		unsigned long batch_size = min(nr_to_scan,
 						SCAN_UNEVICTABLE_BATCH_SIZE);
+		struct mem_cgroup *mem = NULL;
 
-		spin_lock_irq(&zone->lru_lock);
-		for (scan = 0;  scan < batch_size; scan++) {
-			struct page *page = lru_to_page(l_unevictable);
-
-			if (!trylock_page(page))
-				continue;
+		do {
+			struct list_head *list;
+			struct lruvec *lruvec;
+			unsigned long scan;
 
-			prefetchw_prev_lru_page(page, l_unevictable, flags);
+			mem_cgroup_hierarchy_walk(NULL, &mem);
+			spin_lock_irq(&zone->lru_lock);
+			lruvec = mem_cgroup_zone_lruvec(zone, mem);
+			list = &lruvec->lists[LRU_UNEVICTABLE];
+			for (scan = 0;  scan < batch_size; scan++) {
+				struct page *page = lru_to_page(list);
 
-			if (likely(PageLRU(page) && PageUnevictable(page)))
+				if (!trylock_page(page))
+					continue;
+				prefetchw_prev_lru_page(page, list, flags);
+				if (unlikely(!PageLRU(page)))
+					continue;
+				if (unlikely(!PageUnevictable(page)))
+					continue;
 				check_move_unevictable_page(page, zone);
-
-			unlock_page(page);
-		}
-		spin_unlock_irq(&zone->lru_lock);
+				unlock_page(page);
+			}
+			spin_unlock_irq(&zone->lru_lock);
+		} while (mem);
 
 		nr_to_scan -= batch_size;
 	}
-- 
1.7.5.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [rfc patch 6/6] memcg: rework soft limit reclaim
  2011-05-12 14:53 ` Johannes Weiner
@ 2011-05-12 14:53   ` Johannes Weiner
  -1 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2011-05-12 14:53 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Andrew Morton, Rik van Riel, Minchan Kim,
	KOSAKI Motohiro, Mel Gorman
  Cc: linux-mm, linux-kernel

The current soft limit reclaim algorithm entered from kswapd.  It
selects the memcg that exceeds its soft limit the most in absolute
bytes and reclaims from it most aggressively (priority 0).

This has several disadvantages:

	1. because of the aggressiveness, kswapd can be stalled on a
	memcg that is hard to reclaim for a long time before going for
	other pages.

	2. it only considers the biggest violator (in absolute byes!)
	and does not put extra pressure on other memcgs in excess.

	3. it needs a ton of code to quickly find the target

This patch removes all the explicit soft limit target selection and
instead hooks into the hierarchical memcg walk that is done by direct
reclaim and kswapd balancing.  If it encounters a memcg that exceeds
its soft limit, or contributes to the soft limit excess in one of its
hierarchy parents, it scans the memcg one priority level below the
current reclaim priority.

	1. the primary goal is to reclaim pages, not to punish soft
	limit violators at any price

	2. increased pressure is applied to all violators, not just
	the biggest one

	3. the soft limit is no longer only meaningful on global
	memory pressure, but considered for any hierarchical reclaim.
	This means that even for hard limit reclaim, the children in
	excess of their soft limit experience more pressure compared
	to their siblings

	4. direct reclaim now also applies more pressure on memcgs in
	soft limit excess, not only kswapd

	5. the implementation is only a few lines of straight-forward
	code

RFC: since there is no longer a reliable way of counting the pages
reclaimed solely because of an exceeding soft limit, this patch
conflicts with Ying's exporting of exactly this number to userspace.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h |   16 +-
 include/linux/swap.h       |    4 -
 mm/memcontrol.c            |  450 +++-----------------------------------------
 mm/vmscan.c                |   48 +-----
 4 files changed, 34 insertions(+), 484 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 65163c2..b0c7323 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -99,6 +99,7 @@ extern void mem_cgroup_end_migration(struct mem_cgroup *mem,
  * For memory reclaim.
  */
 void mem_cgroup_hierarchy_walk(struct mem_cgroup *, struct mem_cgroup **);
+bool mem_cgroup_soft_limit_exceeded(struct mem_cgroup *, struct mem_cgroup *);
 void mem_cgroup_count_reclaim(struct mem_cgroup *, bool, bool,
 			      unsigned long, unsigned long);
 int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
@@ -140,8 +141,6 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 	mem_cgroup_update_page_stat(page, idx, -1);
 }
 
-unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
-						gfp_t gfp_mask);
 u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
@@ -294,6 +293,12 @@ static inline void mem_cgroup_hierarchy_walk(struct mem_cgroup *start,
 	*iter = start;
 }
 
+static inline bool mem_cgroup_soft_limit_exceeded(struct mem_cgroup *root,
+						  struct mem_cgroup *mem)
+{
+	return 0;
+}
+
 static inline void mem_cgroup_count_reclaim(struct mem_cgroup *mem,
 					    bool kswapd, bool hierarchy,
 					    unsigned long scanned,
@@ -349,13 +354,6 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 }
 
 static inline
-unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
-					    gfp_t gfp_mask)
-{
-	return 0;
-}
-
-static inline
 u64 mem_cgroup_get_limit(struct mem_cgroup *mem)
 {
 	return 0;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index a5c6da5..885cf19 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -254,10 +254,6 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
 						  gfp_t gfp_mask, bool noswap,
 						  unsigned int swappiness);
-extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
-						gfp_t gfp_mask, bool noswap,
-						unsigned int swappiness,
-						struct zone *zone);
 extern int __isolate_lru_page(struct page *page, int mode, int file);
 extern unsigned long shrink_all_memory(unsigned long nr_pages);
 extern int vm_swappiness;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f5d90ba..b0c6dd5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -34,7 +34,6 @@
 #include <linux/rcupdate.h>
 #include <linux/limits.h>
 #include <linux/mutex.h>
-#include <linux/rbtree.h>
 #include <linux/slab.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
@@ -138,12 +137,6 @@ struct mem_cgroup_per_zone {
 	unsigned long		count[NR_LRU_LISTS];
 
 	struct zone_reclaim_stat reclaim_stat;
-	struct rb_node		tree_node;	/* RB tree node */
-	unsigned long long	usage_in_excess;/* Set to the value by which */
-						/* the soft limit is exceeded*/
-	bool			on_tree;
-	struct mem_cgroup	*mem;		/* Back pointer, we cannot */
-						/* use container_of	   */
 };
 /* Macro for accessing counter */
 #define MEM_CGROUP_ZSTAT(mz, idx)	((mz)->count[(idx)])
@@ -156,26 +149,6 @@ struct mem_cgroup_lru_info {
 	struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
 };
 
-/*
- * Cgroups above their limits are maintained in a RB-Tree, independent of
- * their hierarchy representation
- */
-
-struct mem_cgroup_tree_per_zone {
-	struct rb_root rb_root;
-	spinlock_t lock;
-};
-
-struct mem_cgroup_tree_per_node {
-	struct mem_cgroup_tree_per_zone rb_tree_per_zone[MAX_NR_ZONES];
-};
-
-struct mem_cgroup_tree {
-	struct mem_cgroup_tree_per_node *rb_tree_per_node[MAX_NUMNODES];
-};
-
-static struct mem_cgroup_tree soft_limit_tree __read_mostly;
-
 struct mem_cgroup_threshold {
 	struct eventfd_ctx *eventfd;
 	u64 threshold;
@@ -323,12 +296,7 @@ static bool move_file(void)
 					&mc.to->move_charge_at_immigrate);
 }
 
-/*
- * Maximum loops in mem_cgroup_soft_reclaim(), used for soft
- * limit reclaim to prevent infinite loops, if they ever occur.
- */
 #define	MEM_CGROUP_MAX_RECLAIM_LOOPS		(100)
-#define	MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS	(2)
 
 enum charge_type {
 	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
@@ -375,164 +343,6 @@ page_cgroup_zoneinfo(struct mem_cgroup *mem, struct page *page)
 	return mem_cgroup_zoneinfo(mem, nid, zid);
 }
 
-static struct mem_cgroup_tree_per_zone *
-soft_limit_tree_node_zone(int nid, int zid)
-{
-	return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid];
-}
-
-static struct mem_cgroup_tree_per_zone *
-soft_limit_tree_from_page(struct page *page)
-{
-	int nid = page_to_nid(page);
-	int zid = page_zonenum(page);
-
-	return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid];
-}
-
-static void
-__mem_cgroup_insert_exceeded(struct mem_cgroup *mem,
-				struct mem_cgroup_per_zone *mz,
-				struct mem_cgroup_tree_per_zone *mctz,
-				unsigned long long new_usage_in_excess)
-{
-	struct rb_node **p = &mctz->rb_root.rb_node;
-	struct rb_node *parent = NULL;
-	struct mem_cgroup_per_zone *mz_node;
-
-	if (mz->on_tree)
-		return;
-
-	mz->usage_in_excess = new_usage_in_excess;
-	if (!mz->usage_in_excess)
-		return;
-	while (*p) {
-		parent = *p;
-		mz_node = rb_entry(parent, struct mem_cgroup_per_zone,
-					tree_node);
-		if (mz->usage_in_excess < mz_node->usage_in_excess)
-			p = &(*p)->rb_left;
-		/*
-		 * We can't avoid mem cgroups that are over their soft
-		 * limit by the same amount
-		 */
-		else if (mz->usage_in_excess >= mz_node->usage_in_excess)
-			p = &(*p)->rb_right;
-	}
-	rb_link_node(&mz->tree_node, parent, p);
-	rb_insert_color(&mz->tree_node, &mctz->rb_root);
-	mz->on_tree = true;
-}
-
-static void
-__mem_cgroup_remove_exceeded(struct mem_cgroup *mem,
-				struct mem_cgroup_per_zone *mz,
-				struct mem_cgroup_tree_per_zone *mctz)
-{
-	if (!mz->on_tree)
-		return;
-	rb_erase(&mz->tree_node, &mctz->rb_root);
-	mz->on_tree = false;
-}
-
-static void
-mem_cgroup_remove_exceeded(struct mem_cgroup *mem,
-				struct mem_cgroup_per_zone *mz,
-				struct mem_cgroup_tree_per_zone *mctz)
-{
-	spin_lock(&mctz->lock);
-	__mem_cgroup_remove_exceeded(mem, mz, mctz);
-	spin_unlock(&mctz->lock);
-}
-
-
-static void mem_cgroup_update_tree(struct mem_cgroup *mem, struct page *page)
-{
-	unsigned long long excess;
-	struct mem_cgroup_per_zone *mz;
-	struct mem_cgroup_tree_per_zone *mctz;
-	int nid = page_to_nid(page);
-	int zid = page_zonenum(page);
-	mctz = soft_limit_tree_from_page(page);
-
-	/*
-	 * Necessary to update all ancestors when hierarchy is used.
-	 * because their event counter is not touched.
-	 */
-	for (; mem; mem = parent_mem_cgroup(mem)) {
-		mz = mem_cgroup_zoneinfo(mem, nid, zid);
-		excess = res_counter_soft_limit_excess(&mem->res);
-		/*
-		 * We have to update the tree if mz is on RB-tree or
-		 * mem is over its softlimit.
-		 */
-		if (excess || mz->on_tree) {
-			spin_lock(&mctz->lock);
-			/* if on-tree, remove it */
-			if (mz->on_tree)
-				__mem_cgroup_remove_exceeded(mem, mz, mctz);
-			/*
-			 * Insert again. mz->usage_in_excess will be updated.
-			 * If excess is 0, no tree ops.
-			 */
-			__mem_cgroup_insert_exceeded(mem, mz, mctz, excess);
-			spin_unlock(&mctz->lock);
-		}
-	}
-}
-
-static void mem_cgroup_remove_from_trees(struct mem_cgroup *mem)
-{
-	int node, zone;
-	struct mem_cgroup_per_zone *mz;
-	struct mem_cgroup_tree_per_zone *mctz;
-
-	for_each_node_state(node, N_POSSIBLE) {
-		for (zone = 0; zone < MAX_NR_ZONES; zone++) {
-			mz = mem_cgroup_zoneinfo(mem, node, zone);
-			mctz = soft_limit_tree_node_zone(node, zone);
-			mem_cgroup_remove_exceeded(mem, mz, mctz);
-		}
-	}
-}
-
-static struct mem_cgroup_per_zone *
-__mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
-{
-	struct rb_node *rightmost = NULL;
-	struct mem_cgroup_per_zone *mz;
-
-retry:
-	mz = NULL;
-	rightmost = rb_last(&mctz->rb_root);
-	if (!rightmost)
-		goto done;		/* Nothing to reclaim from */
-
-	mz = rb_entry(rightmost, struct mem_cgroup_per_zone, tree_node);
-	/*
-	 * Remove the node now but someone else can add it back,
-	 * we will to add it back at the end of reclaim to its correct
-	 * position in the tree.
-	 */
-	__mem_cgroup_remove_exceeded(mz->mem, mz, mctz);
-	if (!res_counter_soft_limit_excess(&mz->mem->res) ||
-		!css_tryget(&mz->mem->css))
-		goto retry;
-done:
-	return mz;
-}
-
-static struct mem_cgroup_per_zone *
-mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
-{
-	struct mem_cgroup_per_zone *mz;
-
-	spin_lock(&mctz->lock);
-	mz = __mem_cgroup_largest_soft_limit_node(mctz);
-	spin_unlock(&mctz->lock);
-	return mz;
-}
-
 /*
  * Implementation Note: reading percpu statistics for memcg.
  *
@@ -570,15 +380,6 @@ static long mem_cgroup_read_stat(struct mem_cgroup *mem,
 	return val;
 }
 
-static long mem_cgroup_local_usage(struct mem_cgroup *mem)
-{
-	long ret;
-
-	ret = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_RSS);
-	ret += mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_CACHE);
-	return ret;
-}
-
 static void mem_cgroup_swap_statistics(struct mem_cgroup *mem,
 					 bool charge)
 {
@@ -699,7 +500,6 @@ static void memcg_check_events(struct mem_cgroup *mem, struct page *page)
 		__mem_cgroup_target_update(mem, MEM_CGROUP_TARGET_THRESH);
 		if (unlikely(__memcg_event_check(mem,
 			MEM_CGROUP_TARGET_SOFTLIMIT))){
-			mem_cgroup_update_tree(mem, page);
 			__mem_cgroup_target_update(mem,
 				MEM_CGROUP_TARGET_SOFTLIMIT);
 		}
@@ -1380,6 +1180,29 @@ void mem_cgroup_hierarchy_walk(struct mem_cgroup *start,
 	*iter = mem;
 }
 
+bool mem_cgroup_soft_limit_exceeded(struct mem_cgroup *root,
+				    struct mem_cgroup *mem)
+{
+	/* root_mem_cgroup never exceeds its soft limit */
+	if (!mem)
+		return false;
+	if (!root)
+		root = root_mem_cgroup;
+	/*
+	 * See whether the memcg in question exceeds its soft limit
+	 * directly, or contributes to the soft limit excess in the
+	 * hierarchy below the given root.
+	 */
+	while (mem != root) {
+		if (res_counter_soft_limit_excess(&mem->res))
+			return true;
+		if (!mem->use_hierarchy)
+			break;
+		mem = mem_cgroup_from_cont(mem->css.cgroup->parent);
+	}
+	return false;
+}
+
 static unsigned long mem_cgroup_target_reclaim(struct mem_cgroup *mem,
 					       gfp_t gfp_mask,
 					       bool noswap,
@@ -1411,114 +1234,6 @@ static unsigned long mem_cgroup_target_reclaim(struct mem_cgroup *mem,
 }
 
 /*
- * Visit the first child (need not be the first child as per the ordering
- * of the cgroup list, since we track last_scanned_child) of @mem and use
- * that to reclaim free pages from.
- */
-static struct mem_cgroup *
-mem_cgroup_select_victim(struct mem_cgroup *root_mem)
-{
-	struct mem_cgroup *ret = NULL;
-	struct cgroup_subsys_state *css;
-	int nextid, found;
-
-	if (!root_mem->use_hierarchy) {
-		css_get(&root_mem->css);
-		ret = root_mem;
-	}
-
-	while (!ret) {
-		rcu_read_lock();
-		nextid = root_mem->last_scanned_child + 1;
-		css = css_get_next(&mem_cgroup_subsys, nextid, &root_mem->css,
-				   &found);
-		if (css && css_tryget(css))
-			ret = container_of(css, struct mem_cgroup, css);
-
-		rcu_read_unlock();
-		/* Updates scanning parameter */
-		if (!css) {
-			/* this means start scan from ID:1 */
-			root_mem->last_scanned_child = 0;
-		} else
-			root_mem->last_scanned_child = found;
-	}
-
-	return ret;
-}
-
-/*
- * Scan the hierarchy if needed to reclaim memory. We remember the last child
- * we reclaimed from, so that we don't end up penalizing one child extensively
- * based on its position in the children list.
- *
- * root_mem is the original ancestor that we've been reclaim from.
- *
- * We give up and return to the caller when we visit root_mem twice.
- * (other groups can be removed while we're walking....)
- */
-static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_mem,
-				   struct zone *zone,
-				   gfp_t gfp_mask)
-{
-	struct mem_cgroup *victim;
-	int ret, total = 0;
-	int loop = 0;
-	unsigned long excess;
-	bool noswap = false;
-
-	excess = res_counter_soft_limit_excess(&root_mem->res) >> PAGE_SHIFT;
-
-	/* If memsw_is_minimum==1, swap-out is of-no-use. */
-	if (root_mem->memsw_is_minimum)
-		noswap = true;
-
-	while (1) {
-		victim = mem_cgroup_select_victim(root_mem);
-		if (victim == root_mem) {
-			loop++;
-			if (loop >= 1)
-				drain_all_stock_async();
-			if (loop >= 2) {
-				/*
-				 * If we have not been able to reclaim
-				 * anything, it might because there are
-				 * no reclaimable pages under this hierarchy
-				 */
-				if (!total) {
-					css_put(&victim->css);
-					break;
-				}
-				/*
-				 * We want to do more targeted reclaim.
-				 * excess >> 2 is not to excessive so as to
-				 * reclaim too much, nor too less that we keep
-				 * coming back to reclaim from this cgroup
-				 */
-				if (total >= (excess >> 2) ||
-					(loop > MEM_CGROUP_MAX_RECLAIM_LOOPS)) {
-					css_put(&victim->css);
-					break;
-				}
-			}
-		}
-		if (!mem_cgroup_local_usage(victim)) {
-			/* this cgroup's local usage == 0 */
-			css_put(&victim->css);
-			continue;
-		}
-		/* we use swappiness of local cgroup */
-		ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
-				noswap, get_swappiness(victim), zone);
-		css_put(&victim->css);
-		total += ret;
-		if (!res_counter_soft_limit_excess(&root_mem->res))
-			return total;
-	}
-	return total;
-}
-
-/*
  * Check OOM-Killer is already running under our hierarchy.
  * If someone is running, return false.
  */
@@ -3291,94 +3006,6 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
 	return ret;
 }
 
-unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
-					    gfp_t gfp_mask)
-{
-	unsigned long nr_reclaimed = 0;
-	struct mem_cgroup_per_zone *mz, *next_mz = NULL;
-	unsigned long reclaimed;
-	int loop = 0;
-	struct mem_cgroup_tree_per_zone *mctz;
-	unsigned long long excess;
-
-	if (order > 0)
-		return 0;
-
-	mctz = soft_limit_tree_node_zone(zone_to_nid(zone), zone_idx(zone));
-	/*
-	 * This loop can run a while, specially if mem_cgroup's continuously
-	 * keep exceeding their soft limit and putting the system under
-	 * pressure
-	 */
-	do {
-		if (next_mz)
-			mz = next_mz;
-		else
-			mz = mem_cgroup_largest_soft_limit_node(mctz);
-		if (!mz)
-			break;
-
-		reclaimed = mem_cgroup_soft_reclaim(mz->mem, zone, gfp_mask);
-		nr_reclaimed += reclaimed;
-		spin_lock(&mctz->lock);
-
-		/*
-		 * If we failed to reclaim anything from this memory cgroup
-		 * it is time to move on to the next cgroup
-		 */
-		next_mz = NULL;
-		if (!reclaimed) {
-			do {
-				/*
-				 * Loop until we find yet another one.
-				 *
-				 * By the time we get the soft_limit lock
-				 * again, someone might have aded the
-				 * group back on the RB tree. Iterate to
-				 * make sure we get a different mem.
-				 * mem_cgroup_largest_soft_limit_node returns
-				 * NULL if no other cgroup is present on
-				 * the tree
-				 */
-				next_mz =
-				__mem_cgroup_largest_soft_limit_node(mctz);
-				if (next_mz == mz) {
-					css_put(&next_mz->mem->css);
-					next_mz = NULL;
-				} else /* next_mz == NULL or other memcg */
-					break;
-			} while (1);
-		}
-		__mem_cgroup_remove_exceeded(mz->mem, mz, mctz);
-		excess = res_counter_soft_limit_excess(&mz->mem->res);
-		/*
-		 * One school of thought says that we should not add
-		 * back the node to the tree if reclaim returns 0.
-		 * But our reclaim could return 0, simply because due
-		 * to priority we are exposing a smaller subset of
-		 * memory to reclaim from. Consider this as a longer
-		 * term TODO.
-		 */
-		/* If excess == 0, no tree ops */
-		__mem_cgroup_insert_exceeded(mz->mem, mz, mctz, excess);
-		spin_unlock(&mctz->lock);
-		css_put(&mz->mem->css);
-		loop++;
-		/*
-		 * Could not reclaim anything and there are no more
-		 * mem cgroups to try or we seem to be looping without
-		 * reclaiming anything.
-		 */
-		if (!nr_reclaimed &&
-			(next_mz == NULL ||
-			loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS))
-			break;
-	} while (!nr_reclaimed);
-	if (next_mz)
-		css_put(&next_mz->mem->css);
-	return nr_reclaimed;
-}
-
 /*
  * This routine traverse page_cgroup in given list and drop them all.
  * *And* this routine doesn't reclaim page itself, just removes page_cgroup.
@@ -4449,9 +4076,6 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
 		mz = &pn->zoneinfo[zone];
 		for_each_lru(l)
 			INIT_LIST_HEAD(&mz->lruvec.lists[l]);
-		mz->usage_in_excess = 0;
-		mz->on_tree = false;
-		mz->mem = mem;
 	}
 	return 0;
 }
@@ -4504,7 +4128,6 @@ static void __mem_cgroup_free(struct mem_cgroup *mem)
 {
 	int node;
 
-	mem_cgroup_remove_from_trees(mem);
 	free_css_id(&mem_cgroup_subsys, &mem->css);
 
 	for_each_node_state(node, N_POSSIBLE)
@@ -4559,31 +4182,6 @@ static void __init enable_swap_cgroup(void)
 }
 #endif
 
-static int mem_cgroup_soft_limit_tree_init(void)
-{
-	struct mem_cgroup_tree_per_node *rtpn;
-	struct mem_cgroup_tree_per_zone *rtpz;
-	int tmp, node, zone;
-
-	for_each_node_state(node, N_POSSIBLE) {
-		tmp = node;
-		if (!node_state(node, N_NORMAL_MEMORY))
-			tmp = -1;
-		rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL, tmp);
-		if (!rtpn)
-			return 1;
-
-		soft_limit_tree.rb_tree_per_node[node] = rtpn;
-
-		for (zone = 0; zone < MAX_NR_ZONES; zone++) {
-			rtpz = &rtpn->rb_tree_per_zone[zone];
-			rtpz->rb_root = RB_ROOT;
-			spin_lock_init(&rtpz->lock);
-		}
-	}
-	return 0;
-}
-
 static struct cgroup_subsys_state * __ref
 mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 {
@@ -4605,8 +4203,6 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 		enable_swap_cgroup();
 		parent = NULL;
 		root_mem_cgroup = mem;
-		if (mem_cgroup_soft_limit_tree_init())
-			goto free_out;
 		for_each_possible_cpu(cpu) {
 			struct memcg_stock_pcp *stock =
 						&per_cpu(memcg_stock, cpu);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0381a5d..2b701e0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1937,10 +1937,13 @@ static void shrink_zone(int priority, struct zone *zone,
 	do {
 		unsigned long reclaimed = sc->nr_reclaimed;
 		unsigned long scanned = sc->nr_scanned;
+		int epriority = priority;
 
 		mem_cgroup_hierarchy_walk(root, &mem);
 		sc->current_memcg = mem;
-		do_shrink_zone(priority, zone, sc);
+		if (mem_cgroup_soft_limit_exceeded(root, mem))
+			epriority -= 1;
+		do_shrink_zone(epriority, zone, sc);
 		mem_cgroup_count_reclaim(mem, current_is_kswapd(),
 					 mem != root, /* limit or hierarchy? */
 					 sc->nr_scanned - scanned,
@@ -2153,42 +2156,6 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 }
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
-
-unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
-						gfp_t gfp_mask, bool noswap,
-						unsigned int swappiness,
-						struct zone *zone)
-{
-	struct scan_control sc = {
-		.nr_to_reclaim = SWAP_CLUSTER_MAX,
-		.may_writepage = !laptop_mode,
-		.may_unmap = 1,
-		.may_swap = !noswap,
-		.swappiness = swappiness,
-		.order = 0,
-		.memcg = mem,
-	};
-	sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
-			(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
-
-	trace_mm_vmscan_memcg_softlimit_reclaim_begin(0,
-						      sc.may_writepage,
-						      sc.gfp_mask);
-
-	/*
-	 * NOTE: Although we can get the priority field, using it
-	 * here is not a good idea, since it limits the pages we can scan.
-	 * if we don't reclaim here, the shrink_zone from balance_pgdat
-	 * will pick up pages from other mem cgroup's as well. We hack
-	 * the priority and make it zero.
-	 */
-	do_shrink_zone(0, zone, &sc);
-
-	trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
-
-	return sc.nr_reclaimed;
-}
-
 unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 					   gfp_t gfp_mask,
 					   bool noswap,
@@ -2418,13 +2385,6 @@ loop_again:
 				continue;
 
 			sc.nr_scanned = 0;
-
-			/*
-			 * Call soft limit reclaim before calling shrink_zone.
-			 * For now we ignore the return value
-			 */
-			mem_cgroup_soft_limit_reclaim(zone, order, sc.gfp_mask);
-
 			/*
 			 * We put equal pressure on every zone, unless
 			 * one zone has way too many pages free
-- 
1.7.5.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [rfc patch 6/6] memcg: rework soft limit reclaim
@ 2011-05-12 14:53   ` Johannes Weiner
  0 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2011-05-12 14:53 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Andrew Morton, Rik van Riel, Minchan Kim,
	KOSAKI Motohiro, Mel Gorman
  Cc: linux-mm, linux-kernel

The current soft limit reclaim algorithm entered from kswapd.  It
selects the memcg that exceeds its soft limit the most in absolute
bytes and reclaims from it most aggressively (priority 0).

This has several disadvantages:

	1. because of the aggressiveness, kswapd can be stalled on a
	memcg that is hard to reclaim for a long time before going for
	other pages.

	2. it only considers the biggest violator (in absolute byes!)
	and does not put extra pressure on other memcgs in excess.

	3. it needs a ton of code to quickly find the target

This patch removes all the explicit soft limit target selection and
instead hooks into the hierarchical memcg walk that is done by direct
reclaim and kswapd balancing.  If it encounters a memcg that exceeds
its soft limit, or contributes to the soft limit excess in one of its
hierarchy parents, it scans the memcg one priority level below the
current reclaim priority.

	1. the primary goal is to reclaim pages, not to punish soft
	limit violators at any price

	2. increased pressure is applied to all violators, not just
	the biggest one

	3. the soft limit is no longer only meaningful on global
	memory pressure, but considered for any hierarchical reclaim.
	This means that even for hard limit reclaim, the children in
	excess of their soft limit experience more pressure compared
	to their siblings

	4. direct reclaim now also applies more pressure on memcgs in
	soft limit excess, not only kswapd

	5. the implementation is only a few lines of straight-forward
	code

RFC: since there is no longer a reliable way of counting the pages
reclaimed solely because of an exceeding soft limit, this patch
conflicts with Ying's exporting of exactly this number to userspace.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h |   16 +-
 include/linux/swap.h       |    4 -
 mm/memcontrol.c            |  450 +++-----------------------------------------
 mm/vmscan.c                |   48 +-----
 4 files changed, 34 insertions(+), 484 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 65163c2..b0c7323 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -99,6 +99,7 @@ extern void mem_cgroup_end_migration(struct mem_cgroup *mem,
  * For memory reclaim.
  */
 void mem_cgroup_hierarchy_walk(struct mem_cgroup *, struct mem_cgroup **);
+bool mem_cgroup_soft_limit_exceeded(struct mem_cgroup *, struct mem_cgroup *);
 void mem_cgroup_count_reclaim(struct mem_cgroup *, bool, bool,
 			      unsigned long, unsigned long);
 int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
@@ -140,8 +141,6 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 	mem_cgroup_update_page_stat(page, idx, -1);
 }
 
-unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
-						gfp_t gfp_mask);
 u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
@@ -294,6 +293,12 @@ static inline void mem_cgroup_hierarchy_walk(struct mem_cgroup *start,
 	*iter = start;
 }
 
+static inline bool mem_cgroup_soft_limit_exceeded(struct mem_cgroup *root,
+						  struct mem_cgroup *mem)
+{
+	return 0;
+}
+
 static inline void mem_cgroup_count_reclaim(struct mem_cgroup *mem,
 					    bool kswapd, bool hierarchy,
 					    unsigned long scanned,
@@ -349,13 +354,6 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 }
 
 static inline
-unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
-					    gfp_t gfp_mask)
-{
-	return 0;
-}
-
-static inline
 u64 mem_cgroup_get_limit(struct mem_cgroup *mem)
 {
 	return 0;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index a5c6da5..885cf19 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -254,10 +254,6 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
 						  gfp_t gfp_mask, bool noswap,
 						  unsigned int swappiness);
-extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
-						gfp_t gfp_mask, bool noswap,
-						unsigned int swappiness,
-						struct zone *zone);
 extern int __isolate_lru_page(struct page *page, int mode, int file);
 extern unsigned long shrink_all_memory(unsigned long nr_pages);
 extern int vm_swappiness;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f5d90ba..b0c6dd5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -34,7 +34,6 @@
 #include <linux/rcupdate.h>
 #include <linux/limits.h>
 #include <linux/mutex.h>
-#include <linux/rbtree.h>
 #include <linux/slab.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
@@ -138,12 +137,6 @@ struct mem_cgroup_per_zone {
 	unsigned long		count[NR_LRU_LISTS];
 
 	struct zone_reclaim_stat reclaim_stat;
-	struct rb_node		tree_node;	/* RB tree node */
-	unsigned long long	usage_in_excess;/* Set to the value by which */
-						/* the soft limit is exceeded*/
-	bool			on_tree;
-	struct mem_cgroup	*mem;		/* Back pointer, we cannot */
-						/* use container_of	   */
 };
 /* Macro for accessing counter */
 #define MEM_CGROUP_ZSTAT(mz, idx)	((mz)->count[(idx)])
@@ -156,26 +149,6 @@ struct mem_cgroup_lru_info {
 	struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
 };
 
-/*
- * Cgroups above their limits are maintained in a RB-Tree, independent of
- * their hierarchy representation
- */
-
-struct mem_cgroup_tree_per_zone {
-	struct rb_root rb_root;
-	spinlock_t lock;
-};
-
-struct mem_cgroup_tree_per_node {
-	struct mem_cgroup_tree_per_zone rb_tree_per_zone[MAX_NR_ZONES];
-};
-
-struct mem_cgroup_tree {
-	struct mem_cgroup_tree_per_node *rb_tree_per_node[MAX_NUMNODES];
-};
-
-static struct mem_cgroup_tree soft_limit_tree __read_mostly;
-
 struct mem_cgroup_threshold {
 	struct eventfd_ctx *eventfd;
 	u64 threshold;
@@ -323,12 +296,7 @@ static bool move_file(void)
 					&mc.to->move_charge_at_immigrate);
 }
 
-/*
- * Maximum loops in mem_cgroup_soft_reclaim(), used for soft
- * limit reclaim to prevent infinite loops, if they ever occur.
- */
 #define	MEM_CGROUP_MAX_RECLAIM_LOOPS		(100)
-#define	MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS	(2)
 
 enum charge_type {
 	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
@@ -375,164 +343,6 @@ page_cgroup_zoneinfo(struct mem_cgroup *mem, struct page *page)
 	return mem_cgroup_zoneinfo(mem, nid, zid);
 }
 
-static struct mem_cgroup_tree_per_zone *
-soft_limit_tree_node_zone(int nid, int zid)
-{
-	return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid];
-}
-
-static struct mem_cgroup_tree_per_zone *
-soft_limit_tree_from_page(struct page *page)
-{
-	int nid = page_to_nid(page);
-	int zid = page_zonenum(page);
-
-	return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid];
-}
-
-static void
-__mem_cgroup_insert_exceeded(struct mem_cgroup *mem,
-				struct mem_cgroup_per_zone *mz,
-				struct mem_cgroup_tree_per_zone *mctz,
-				unsigned long long new_usage_in_excess)
-{
-	struct rb_node **p = &mctz->rb_root.rb_node;
-	struct rb_node *parent = NULL;
-	struct mem_cgroup_per_zone *mz_node;
-
-	if (mz->on_tree)
-		return;
-
-	mz->usage_in_excess = new_usage_in_excess;
-	if (!mz->usage_in_excess)
-		return;
-	while (*p) {
-		parent = *p;
-		mz_node = rb_entry(parent, struct mem_cgroup_per_zone,
-					tree_node);
-		if (mz->usage_in_excess < mz_node->usage_in_excess)
-			p = &(*p)->rb_left;
-		/*
-		 * We can't avoid mem cgroups that are over their soft
-		 * limit by the same amount
-		 */
-		else if (mz->usage_in_excess >= mz_node->usage_in_excess)
-			p = &(*p)->rb_right;
-	}
-	rb_link_node(&mz->tree_node, parent, p);
-	rb_insert_color(&mz->tree_node, &mctz->rb_root);
-	mz->on_tree = true;
-}
-
-static void
-__mem_cgroup_remove_exceeded(struct mem_cgroup *mem,
-				struct mem_cgroup_per_zone *mz,
-				struct mem_cgroup_tree_per_zone *mctz)
-{
-	if (!mz->on_tree)
-		return;
-	rb_erase(&mz->tree_node, &mctz->rb_root);
-	mz->on_tree = false;
-}
-
-static void
-mem_cgroup_remove_exceeded(struct mem_cgroup *mem,
-				struct mem_cgroup_per_zone *mz,
-				struct mem_cgroup_tree_per_zone *mctz)
-{
-	spin_lock(&mctz->lock);
-	__mem_cgroup_remove_exceeded(mem, mz, mctz);
-	spin_unlock(&mctz->lock);
-}
-
-
-static void mem_cgroup_update_tree(struct mem_cgroup *mem, struct page *page)
-{
-	unsigned long long excess;
-	struct mem_cgroup_per_zone *mz;
-	struct mem_cgroup_tree_per_zone *mctz;
-	int nid = page_to_nid(page);
-	int zid = page_zonenum(page);
-	mctz = soft_limit_tree_from_page(page);
-
-	/*
-	 * Necessary to update all ancestors when hierarchy is used.
-	 * because their event counter is not touched.
-	 */
-	for (; mem; mem = parent_mem_cgroup(mem)) {
-		mz = mem_cgroup_zoneinfo(mem, nid, zid);
-		excess = res_counter_soft_limit_excess(&mem->res);
-		/*
-		 * We have to update the tree if mz is on RB-tree or
-		 * mem is over its softlimit.
-		 */
-		if (excess || mz->on_tree) {
-			spin_lock(&mctz->lock);
-			/* if on-tree, remove it */
-			if (mz->on_tree)
-				__mem_cgroup_remove_exceeded(mem, mz, mctz);
-			/*
-			 * Insert again. mz->usage_in_excess will be updated.
-			 * If excess is 0, no tree ops.
-			 */
-			__mem_cgroup_insert_exceeded(mem, mz, mctz, excess);
-			spin_unlock(&mctz->lock);
-		}
-	}
-}
-
-static void mem_cgroup_remove_from_trees(struct mem_cgroup *mem)
-{
-	int node, zone;
-	struct mem_cgroup_per_zone *mz;
-	struct mem_cgroup_tree_per_zone *mctz;
-
-	for_each_node_state(node, N_POSSIBLE) {
-		for (zone = 0; zone < MAX_NR_ZONES; zone++) {
-			mz = mem_cgroup_zoneinfo(mem, node, zone);
-			mctz = soft_limit_tree_node_zone(node, zone);
-			mem_cgroup_remove_exceeded(mem, mz, mctz);
-		}
-	}
-}
-
-static struct mem_cgroup_per_zone *
-__mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
-{
-	struct rb_node *rightmost = NULL;
-	struct mem_cgroup_per_zone *mz;
-
-retry:
-	mz = NULL;
-	rightmost = rb_last(&mctz->rb_root);
-	if (!rightmost)
-		goto done;		/* Nothing to reclaim from */
-
-	mz = rb_entry(rightmost, struct mem_cgroup_per_zone, tree_node);
-	/*
-	 * Remove the node now but someone else can add it back,
-	 * we will to add it back at the end of reclaim to its correct
-	 * position in the tree.
-	 */
-	__mem_cgroup_remove_exceeded(mz->mem, mz, mctz);
-	if (!res_counter_soft_limit_excess(&mz->mem->res) ||
-		!css_tryget(&mz->mem->css))
-		goto retry;
-done:
-	return mz;
-}
-
-static struct mem_cgroup_per_zone *
-mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
-{
-	struct mem_cgroup_per_zone *mz;
-
-	spin_lock(&mctz->lock);
-	mz = __mem_cgroup_largest_soft_limit_node(mctz);
-	spin_unlock(&mctz->lock);
-	return mz;
-}
-
 /*
  * Implementation Note: reading percpu statistics for memcg.
  *
@@ -570,15 +380,6 @@ static long mem_cgroup_read_stat(struct mem_cgroup *mem,
 	return val;
 }
 
-static long mem_cgroup_local_usage(struct mem_cgroup *mem)
-{
-	long ret;
-
-	ret = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_RSS);
-	ret += mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_CACHE);
-	return ret;
-}
-
 static void mem_cgroup_swap_statistics(struct mem_cgroup *mem,
 					 bool charge)
 {
@@ -699,7 +500,6 @@ static void memcg_check_events(struct mem_cgroup *mem, struct page *page)
 		__mem_cgroup_target_update(mem, MEM_CGROUP_TARGET_THRESH);
 		if (unlikely(__memcg_event_check(mem,
 			MEM_CGROUP_TARGET_SOFTLIMIT))){
-			mem_cgroup_update_tree(mem, page);
 			__mem_cgroup_target_update(mem,
 				MEM_CGROUP_TARGET_SOFTLIMIT);
 		}
@@ -1380,6 +1180,29 @@ void mem_cgroup_hierarchy_walk(struct mem_cgroup *start,
 	*iter = mem;
 }
 
+bool mem_cgroup_soft_limit_exceeded(struct mem_cgroup *root,
+				    struct mem_cgroup *mem)
+{
+	/* root_mem_cgroup never exceeds its soft limit */
+	if (!mem)
+		return false;
+	if (!root)
+		root = root_mem_cgroup;
+	/*
+	 * See whether the memcg in question exceeds its soft limit
+	 * directly, or contributes to the soft limit excess in the
+	 * hierarchy below the given root.
+	 */
+	while (mem != root) {
+		if (res_counter_soft_limit_excess(&mem->res))
+			return true;
+		if (!mem->use_hierarchy)
+			break;
+		mem = mem_cgroup_from_cont(mem->css.cgroup->parent);
+	}
+	return false;
+}
+
 static unsigned long mem_cgroup_target_reclaim(struct mem_cgroup *mem,
 					       gfp_t gfp_mask,
 					       bool noswap,
@@ -1411,114 +1234,6 @@ static unsigned long mem_cgroup_target_reclaim(struct mem_cgroup *mem,
 }
 
 /*
- * Visit the first child (need not be the first child as per the ordering
- * of the cgroup list, since we track last_scanned_child) of @mem and use
- * that to reclaim free pages from.
- */
-static struct mem_cgroup *
-mem_cgroup_select_victim(struct mem_cgroup *root_mem)
-{
-	struct mem_cgroup *ret = NULL;
-	struct cgroup_subsys_state *css;
-	int nextid, found;
-
-	if (!root_mem->use_hierarchy) {
-		css_get(&root_mem->css);
-		ret = root_mem;
-	}
-
-	while (!ret) {
-		rcu_read_lock();
-		nextid = root_mem->last_scanned_child + 1;
-		css = css_get_next(&mem_cgroup_subsys, nextid, &root_mem->css,
-				   &found);
-		if (css && css_tryget(css))
-			ret = container_of(css, struct mem_cgroup, css);
-
-		rcu_read_unlock();
-		/* Updates scanning parameter */
-		if (!css) {
-			/* this means start scan from ID:1 */
-			root_mem->last_scanned_child = 0;
-		} else
-			root_mem->last_scanned_child = found;
-	}
-
-	return ret;
-}
-
-/*
- * Scan the hierarchy if needed to reclaim memory. We remember the last child
- * we reclaimed from, so that we don't end up penalizing one child extensively
- * based on its position in the children list.
- *
- * root_mem is the original ancestor that we've been reclaim from.
- *
- * We give up and return to the caller when we visit root_mem twice.
- * (other groups can be removed while we're walking....)
- */
-static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_mem,
-				   struct zone *zone,
-				   gfp_t gfp_mask)
-{
-	struct mem_cgroup *victim;
-	int ret, total = 0;
-	int loop = 0;
-	unsigned long excess;
-	bool noswap = false;
-
-	excess = res_counter_soft_limit_excess(&root_mem->res) >> PAGE_SHIFT;
-
-	/* If memsw_is_minimum==1, swap-out is of-no-use. */
-	if (root_mem->memsw_is_minimum)
-		noswap = true;
-
-	while (1) {
-		victim = mem_cgroup_select_victim(root_mem);
-		if (victim == root_mem) {
-			loop++;
-			if (loop >= 1)
-				drain_all_stock_async();
-			if (loop >= 2) {
-				/*
-				 * If we have not been able to reclaim
-				 * anything, it might because there are
-				 * no reclaimable pages under this hierarchy
-				 */
-				if (!total) {
-					css_put(&victim->css);
-					break;
-				}
-				/*
-				 * We want to do more targeted reclaim.
-				 * excess >> 2 is not to excessive so as to
-				 * reclaim too much, nor too less that we keep
-				 * coming back to reclaim from this cgroup
-				 */
-				if (total >= (excess >> 2) ||
-					(loop > MEM_CGROUP_MAX_RECLAIM_LOOPS)) {
-					css_put(&victim->css);
-					break;
-				}
-			}
-		}
-		if (!mem_cgroup_local_usage(victim)) {
-			/* this cgroup's local usage == 0 */
-			css_put(&victim->css);
-			continue;
-		}
-		/* we use swappiness of local cgroup */
-		ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
-				noswap, get_swappiness(victim), zone);
-		css_put(&victim->css);
-		total += ret;
-		if (!res_counter_soft_limit_excess(&root_mem->res))
-			return total;
-	}
-	return total;
-}
-
-/*
  * Check OOM-Killer is already running under our hierarchy.
  * If someone is running, return false.
  */
@@ -3291,94 +3006,6 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
 	return ret;
 }
 
-unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
-					    gfp_t gfp_mask)
-{
-	unsigned long nr_reclaimed = 0;
-	struct mem_cgroup_per_zone *mz, *next_mz = NULL;
-	unsigned long reclaimed;
-	int loop = 0;
-	struct mem_cgroup_tree_per_zone *mctz;
-	unsigned long long excess;
-
-	if (order > 0)
-		return 0;
-
-	mctz = soft_limit_tree_node_zone(zone_to_nid(zone), zone_idx(zone));
-	/*
-	 * This loop can run a while, specially if mem_cgroup's continuously
-	 * keep exceeding their soft limit and putting the system under
-	 * pressure
-	 */
-	do {
-		if (next_mz)
-			mz = next_mz;
-		else
-			mz = mem_cgroup_largest_soft_limit_node(mctz);
-		if (!mz)
-			break;
-
-		reclaimed = mem_cgroup_soft_reclaim(mz->mem, zone, gfp_mask);
-		nr_reclaimed += reclaimed;
-		spin_lock(&mctz->lock);
-
-		/*
-		 * If we failed to reclaim anything from this memory cgroup
-		 * it is time to move on to the next cgroup
-		 */
-		next_mz = NULL;
-		if (!reclaimed) {
-			do {
-				/*
-				 * Loop until we find yet another one.
-				 *
-				 * By the time we get the soft_limit lock
-				 * again, someone might have aded the
-				 * group back on the RB tree. Iterate to
-				 * make sure we get a different mem.
-				 * mem_cgroup_largest_soft_limit_node returns
-				 * NULL if no other cgroup is present on
-				 * the tree
-				 */
-				next_mz =
-				__mem_cgroup_largest_soft_limit_node(mctz);
-				if (next_mz == mz) {
-					css_put(&next_mz->mem->css);
-					next_mz = NULL;
-				} else /* next_mz == NULL or other memcg */
-					break;
-			} while (1);
-		}
-		__mem_cgroup_remove_exceeded(mz->mem, mz, mctz);
-		excess = res_counter_soft_limit_excess(&mz->mem->res);
-		/*
-		 * One school of thought says that we should not add
-		 * back the node to the tree if reclaim returns 0.
-		 * But our reclaim could return 0, simply because due
-		 * to priority we are exposing a smaller subset of
-		 * memory to reclaim from. Consider this as a longer
-		 * term TODO.
-		 */
-		/* If excess == 0, no tree ops */
-		__mem_cgroup_insert_exceeded(mz->mem, mz, mctz, excess);
-		spin_unlock(&mctz->lock);
-		css_put(&mz->mem->css);
-		loop++;
-		/*
-		 * Could not reclaim anything and there are no more
-		 * mem cgroups to try or we seem to be looping without
-		 * reclaiming anything.
-		 */
-		if (!nr_reclaimed &&
-			(next_mz == NULL ||
-			loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS))
-			break;
-	} while (!nr_reclaimed);
-	if (next_mz)
-		css_put(&next_mz->mem->css);
-	return nr_reclaimed;
-}
-
 /*
  * This routine traverse page_cgroup in given list and drop them all.
  * *And* this routine doesn't reclaim page itself, just removes page_cgroup.
@@ -4449,9 +4076,6 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
 		mz = &pn->zoneinfo[zone];
 		for_each_lru(l)
 			INIT_LIST_HEAD(&mz->lruvec.lists[l]);
-		mz->usage_in_excess = 0;
-		mz->on_tree = false;
-		mz->mem = mem;
 	}
 	return 0;
 }
@@ -4504,7 +4128,6 @@ static void __mem_cgroup_free(struct mem_cgroup *mem)
 {
 	int node;
 
-	mem_cgroup_remove_from_trees(mem);
 	free_css_id(&mem_cgroup_subsys, &mem->css);
 
 	for_each_node_state(node, N_POSSIBLE)
@@ -4559,31 +4182,6 @@ static void __init enable_swap_cgroup(void)
 }
 #endif
 
-static int mem_cgroup_soft_limit_tree_init(void)
-{
-	struct mem_cgroup_tree_per_node *rtpn;
-	struct mem_cgroup_tree_per_zone *rtpz;
-	int tmp, node, zone;
-
-	for_each_node_state(node, N_POSSIBLE) {
-		tmp = node;
-		if (!node_state(node, N_NORMAL_MEMORY))
-			tmp = -1;
-		rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL, tmp);
-		if (!rtpn)
-			return 1;
-
-		soft_limit_tree.rb_tree_per_node[node] = rtpn;
-
-		for (zone = 0; zone < MAX_NR_ZONES; zone++) {
-			rtpz = &rtpn->rb_tree_per_zone[zone];
-			rtpz->rb_root = RB_ROOT;
-			spin_lock_init(&rtpz->lock);
-		}
-	}
-	return 0;
-}
-
 static struct cgroup_subsys_state * __ref
 mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 {
@@ -4605,8 +4203,6 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 		enable_swap_cgroup();
 		parent = NULL;
 		root_mem_cgroup = mem;
-		if (mem_cgroup_soft_limit_tree_init())
-			goto free_out;
 		for_each_possible_cpu(cpu) {
 			struct memcg_stock_pcp *stock =
 						&per_cpu(memcg_stock, cpu);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0381a5d..2b701e0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1937,10 +1937,13 @@ static void shrink_zone(int priority, struct zone *zone,
 	do {
 		unsigned long reclaimed = sc->nr_reclaimed;
 		unsigned long scanned = sc->nr_scanned;
+		int epriority = priority;
 
 		mem_cgroup_hierarchy_walk(root, &mem);
 		sc->current_memcg = mem;
-		do_shrink_zone(priority, zone, sc);
+		if (mem_cgroup_soft_limit_exceeded(root, mem))
+			epriority -= 1;
+		do_shrink_zone(epriority, zone, sc);
 		mem_cgroup_count_reclaim(mem, current_is_kswapd(),
 					 mem != root, /* limit or hierarchy? */
 					 sc->nr_scanned - scanned,
@@ -2153,42 +2156,6 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 }
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
-
-unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
-						gfp_t gfp_mask, bool noswap,
-						unsigned int swappiness,
-						struct zone *zone)
-{
-	struct scan_control sc = {
-		.nr_to_reclaim = SWAP_CLUSTER_MAX,
-		.may_writepage = !laptop_mode,
-		.may_unmap = 1,
-		.may_swap = !noswap,
-		.swappiness = swappiness,
-		.order = 0,
-		.memcg = mem,
-	};
-	sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
-			(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
-
-	trace_mm_vmscan_memcg_softlimit_reclaim_begin(0,
-						      sc.may_writepage,
-						      sc.gfp_mask);
-
-	/*
-	 * NOTE: Although we can get the priority field, using it
-	 * here is not a good idea, since it limits the pages we can scan.
-	 * if we don't reclaim here, the shrink_zone from balance_pgdat
-	 * will pick up pages from other mem cgroup's as well. We hack
-	 * the priority and make it zero.
-	 */
-	do_shrink_zone(0, zone, &sc);
-
-	trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
-
-	return sc.nr_reclaimed;
-}
-
 unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 					   gfp_t gfp_mask,
 					   bool noswap,
@@ -2418,13 +2385,6 @@ loop_again:
 				continue;
 
 			sc.nr_scanned = 0;
-
-			/*
-			 * Call soft limit reclaim before calling shrink_zone.
-			 * For now we ignore the return value
-			 */
-			mem_cgroup_soft_limit_reclaim(zone, order, sc.gfp_mask);
-
 			/*
 			 * We put equal pressure on every zone, unless
 			 * one zone has way too many pages free
-- 
1.7.5.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [rfc patch 1/6] memcg: remove unused retry signal from reclaim
  2011-05-12 14:53   ` Johannes Weiner
@ 2011-05-12 15:02     ` Rik van Riel
  -1 siblings, 0 replies; 83+ messages in thread
From: Rik van Riel @ 2011-05-12 15:02 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Andrew Morton, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On 05/12/2011 10:53 AM, Johannes Weiner wrote:
> If the memcg reclaim code detects the target memcg below its limit it
> exits and returns a guaranteed non-zero value so that the charge is
> retried.
>
> Nowadays, the charge side checks the memcg limit itself and does not
> rely on this non-zero return value trick.
>
> This patch removes it.  The reclaim code will now always return the
> true number of pages it reclaimed on its own.
>
> Signed-off-by: Johannes Weiner<hannes@cmpxchg.org>

Acked-by: Rik van Riel<riel@redhat.com>


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 1/6] memcg: remove unused retry signal from reclaim
@ 2011-05-12 15:02     ` Rik van Riel
  0 siblings, 0 replies; 83+ messages in thread
From: Rik van Riel @ 2011-05-12 15:02 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Andrew Morton, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On 05/12/2011 10:53 AM, Johannes Weiner wrote:
> If the memcg reclaim code detects the target memcg below its limit it
> exits and returns a guaranteed non-zero value so that the charge is
> retried.
>
> Nowadays, the charge side checks the memcg limit itself and does not
> rely on this non-zero return value trick.
>
> This patch removes it.  The reclaim code will now always return the
> true number of pages it reclaimed on its own.
>
> Signed-off-by: Johannes Weiner<hannes@cmpxchg.org>

Acked-by: Rik van Riel<riel@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 2/6] vmscan: make distinction between memcg reclaim and LRU list selection
  2011-05-12 14:53   ` Johannes Weiner
@ 2011-05-12 15:33     ` Rik van Riel
  -1 siblings, 0 replies; 83+ messages in thread
From: Rik van Riel @ 2011-05-12 15:33 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Andrew Morton, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On 05/12/2011 10:53 AM, Johannes Weiner wrote:
> The reclaim code has a single predicate for whether it currently
> reclaims on behalf of a memory cgroup, as well as whether it is
> reclaiming from the global LRU list or a memory cgroup LRU list.
>
> Up to now, both cases always coincide, but subsequent patches will
> change things such that global reclaim will scan memory cgroup lists.
>
> This patch adds a new predicate that tells global reclaim from memory
> cgroup reclaim, and then changes all callsites that are actually about
> global reclaim heuristics rather than strict LRU list selection.
>
> Signed-off-by: Johannes Weiner<hannes@cmpxchg.org>
> ---
>   mm/vmscan.c |   96 ++++++++++++++++++++++++++++++++++------------------------
>   1 files changed, 56 insertions(+), 40 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f6b435c..ceeb2a5 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -104,8 +104,12 @@ struct scan_control {
>   	 */
>   	reclaim_mode_t reclaim_mode;
>
> -	/* Which cgroup do we reclaim from */
> -	struct mem_cgroup *mem_cgroup;
> +	/*
> +	 * The memory cgroup we reclaim on behalf of, and the one we
> +	 * are currently reclaiming from.
> +	 */
> +	struct mem_cgroup *memcg;
> +	struct mem_cgroup *current_memcg;

I can't say I'm fond of these names.  I had to read the
rest of the patch to figure out that the old mem_cgroup
got renamed to current_memcg.

Would it be better to call them my_memcg and reclaim_memcg?

Maybe somebody else has better suggestions...

Other than the naming, no objection.


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 2/6] vmscan: make distinction between memcg reclaim and LRU list selection
@ 2011-05-12 15:33     ` Rik van Riel
  0 siblings, 0 replies; 83+ messages in thread
From: Rik van Riel @ 2011-05-12 15:33 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Andrew Morton, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On 05/12/2011 10:53 AM, Johannes Weiner wrote:
> The reclaim code has a single predicate for whether it currently
> reclaims on behalf of a memory cgroup, as well as whether it is
> reclaiming from the global LRU list or a memory cgroup LRU list.
>
> Up to now, both cases always coincide, but subsequent patches will
> change things such that global reclaim will scan memory cgroup lists.
>
> This patch adds a new predicate that tells global reclaim from memory
> cgroup reclaim, and then changes all callsites that are actually about
> global reclaim heuristics rather than strict LRU list selection.
>
> Signed-off-by: Johannes Weiner<hannes@cmpxchg.org>
> ---
>   mm/vmscan.c |   96 ++++++++++++++++++++++++++++++++++------------------------
>   1 files changed, 56 insertions(+), 40 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f6b435c..ceeb2a5 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -104,8 +104,12 @@ struct scan_control {
>   	 */
>   	reclaim_mode_t reclaim_mode;
>
> -	/* Which cgroup do we reclaim from */
> -	struct mem_cgroup *mem_cgroup;
> +	/*
> +	 * The memory cgroup we reclaim on behalf of, and the one we
> +	 * are currently reclaiming from.
> +	 */
> +	struct mem_cgroup *memcg;
> +	struct mem_cgroup *current_memcg;

I can't say I'm fond of these names.  I had to read the
rest of the patch to figure out that the old mem_cgroup
got renamed to current_memcg.

Would it be better to call them my_memcg and reclaim_memcg?

Maybe somebody else has better suggestions...

Other than the naming, no objection.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 2/6] vmscan: make distinction between memcg reclaim and LRU list selection
  2011-05-12 15:33     ` Rik van Riel
@ 2011-05-12 16:03       ` Johannes Weiner
  -1 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2011-05-12 16:03 UTC (permalink / raw)
  To: Rik van Riel
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Andrew Morton, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On Thu, May 12, 2011 at 11:33:13AM -0400, Rik van Riel wrote:
> On 05/12/2011 10:53 AM, Johannes Weiner wrote:
> >The reclaim code has a single predicate for whether it currently
> >reclaims on behalf of a memory cgroup, as well as whether it is
> >reclaiming from the global LRU list or a memory cgroup LRU list.
> >
> >Up to now, both cases always coincide, but subsequent patches will
> >change things such that global reclaim will scan memory cgroup lists.
> >
> >This patch adds a new predicate that tells global reclaim from memory
> >cgroup reclaim, and then changes all callsites that are actually about
> >global reclaim heuristics rather than strict LRU list selection.
> >
> >Signed-off-by: Johannes Weiner<hannes@cmpxchg.org>
> >---
> >  mm/vmscan.c |   96 ++++++++++++++++++++++++++++++++++------------------------
> >  1 files changed, 56 insertions(+), 40 deletions(-)
> >
> >diff --git a/mm/vmscan.c b/mm/vmscan.c
> >index f6b435c..ceeb2a5 100644
> >--- a/mm/vmscan.c
> >+++ b/mm/vmscan.c
> >@@ -104,8 +104,12 @@ struct scan_control {
> >  	 */
> >  	reclaim_mode_t reclaim_mode;
> >
> >-	/* Which cgroup do we reclaim from */
> >-	struct mem_cgroup *mem_cgroup;
> >+	/*
> >+	 * The memory cgroup we reclaim on behalf of, and the one we
> >+	 * are currently reclaiming from.
> >+	 */
> >+	struct mem_cgroup *memcg;
> >+	struct mem_cgroup *current_memcg;
> 
> I can't say I'm fond of these names.  I had to read the
> rest of the patch to figure out that the old mem_cgroup
> got renamed to current_memcg.

To clarify: sc->memcg will be the memcg that hit the hard limit and is
the main target of this reclaim invocation.  current_memcg is the
iterator over the hierarchy below the target.

I realize this change in particular was placed a bit unfortunate in
terms of understanding in the series, I just wanted to keep out the
mem_cgroup to current_memcg renaming out of the next patch.  There is
probably a better way, I'll fix it up and improve the comment.

> Would it be better to call them my_memcg and reclaim_memcg?
> 
> Maybe somebody else has better suggestions...

Yes, suggestions welcome.  I'm not too fond of the naming, either.

> Other than the naming, no objection.

Thanks, Rik.

	Hannes

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 2/6] vmscan: make distinction between memcg reclaim and LRU list selection
@ 2011-05-12 16:03       ` Johannes Weiner
  0 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2011-05-12 16:03 UTC (permalink / raw)
  To: Rik van Riel
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Andrew Morton, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On Thu, May 12, 2011 at 11:33:13AM -0400, Rik van Riel wrote:
> On 05/12/2011 10:53 AM, Johannes Weiner wrote:
> >The reclaim code has a single predicate for whether it currently
> >reclaims on behalf of a memory cgroup, as well as whether it is
> >reclaiming from the global LRU list or a memory cgroup LRU list.
> >
> >Up to now, both cases always coincide, but subsequent patches will
> >change things such that global reclaim will scan memory cgroup lists.
> >
> >This patch adds a new predicate that tells global reclaim from memory
> >cgroup reclaim, and then changes all callsites that are actually about
> >global reclaim heuristics rather than strict LRU list selection.
> >
> >Signed-off-by: Johannes Weiner<hannes@cmpxchg.org>
> >---
> >  mm/vmscan.c |   96 ++++++++++++++++++++++++++++++++++------------------------
> >  1 files changed, 56 insertions(+), 40 deletions(-)
> >
> >diff --git a/mm/vmscan.c b/mm/vmscan.c
> >index f6b435c..ceeb2a5 100644
> >--- a/mm/vmscan.c
> >+++ b/mm/vmscan.c
> >@@ -104,8 +104,12 @@ struct scan_control {
> >  	 */
> >  	reclaim_mode_t reclaim_mode;
> >
> >-	/* Which cgroup do we reclaim from */
> >-	struct mem_cgroup *mem_cgroup;
> >+	/*
> >+	 * The memory cgroup we reclaim on behalf of, and the one we
> >+	 * are currently reclaiming from.
> >+	 */
> >+	struct mem_cgroup *memcg;
> >+	struct mem_cgroup *current_memcg;
> 
> I can't say I'm fond of these names.  I had to read the
> rest of the patch to figure out that the old mem_cgroup
> got renamed to current_memcg.

To clarify: sc->memcg will be the memcg that hit the hard limit and is
the main target of this reclaim invocation.  current_memcg is the
iterator over the hierarchy below the target.

I realize this change in particular was placed a bit unfortunate in
terms of understanding in the series, I just wanted to keep out the
mem_cgroup to current_memcg renaming out of the next patch.  There is
probably a better way, I'll fix it up and improve the comment.

> Would it be better to call them my_memcg and reclaim_memcg?
> 
> Maybe somebody else has better suggestions...

Yes, suggestions welcome.  I'm not too fond of the naming, either.

> Other than the naming, no objection.

Thanks, Rik.

	Hannes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 3/6] mm: memcg-aware global reclaim
  2011-05-12 14:53   ` Johannes Weiner
@ 2011-05-12 16:04     ` Rik van Riel
  -1 siblings, 0 replies; 83+ messages in thread
From: Rik van Riel @ 2011-05-12 16:04 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Andrew Morton, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On 05/12/2011 10:53 AM, Johannes Weiner wrote:

> I am open to solutions that trade fairness against CPU-time but don't
> want to have an extreme in either direction.  Maybe break out early if
> a number of memcgs has been successfully reclaimed from and remember
> the last one scanned.

The way we used to deal with this when we did per-process
virtual scanning (before rmap), was to scan the process at
the head of the list.

After we were done with that process, it got moved to the
back of the list.  If enough had been scanned, we bailed
out of the scanning code alltogether; if more needed to
be scanned, we moved on to the next process.

Doing a list move after scanning a bunch of pages in the
LRU lists of a cgroup isn't nearly as expensive as having
to scan all the cgroups.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 3/6] mm: memcg-aware global reclaim
@ 2011-05-12 16:04     ` Rik van Riel
  0 siblings, 0 replies; 83+ messages in thread
From: Rik van Riel @ 2011-05-12 16:04 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Andrew Morton, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On 05/12/2011 10:53 AM, Johannes Weiner wrote:

> I am open to solutions that trade fairness against CPU-time but don't
> want to have an extreme in either direction.  Maybe break out early if
> a number of memcgs has been successfully reclaimed from and remember
> the last one scanned.

The way we used to deal with this when we did per-process
virtual scanning (before rmap), was to scan the process at
the head of the list.

After we were done with that process, it got moved to the
back of the list.  If enough had been scanned, we bailed
out of the scanning code alltogether; if more needed to
be scanned, we moved on to the next process.

Doing a list move after scanning a bunch of pages in the
LRU lists of a cgroup isn't nearly as expensive as having
to scan all the cgroups.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 1/6] memcg: remove unused retry signal from reclaim
  2011-05-12 15:02     ` Rik van Riel
  (?)
@ 2011-05-12 17:22     ` Ying Han
  -1 siblings, 0 replies; 83+ messages in thread
From: Ying Han @ 2011-05-12 17:22 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Johannes Weiner, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Michal Hocko, Andrew Morton, Minchan Kim,
	KOSAKI Motohiro, Mel Gorman, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 668 bytes --]

On Thu, May 12, 2011 at 8:02 AM, Rik van Riel <riel@redhat.com> wrote:

> On 05/12/2011 10:53 AM, Johannes Weiner wrote:
>
>> If the memcg reclaim code detects the target memcg below its limit it
>> exits and returns a guaranteed non-zero value so that the charge is
>> retried.
>>
>> Nowadays, the charge side checks the memcg limit itself and does not
>> rely on this non-zero return value trick.
>>
>> This patch removes it.  The reclaim code will now always return the
>> true number of pages it reclaimed on its own.
>>
>> Signed-off-by: Johannes Weiner<hannes@cmpxchg.org>
>>
>
> Acked-by: Rik van Riel<riel@redhat.com>
>

Acked-by: Ying Han<yinghan@google.com>

[-- Attachment #2: Type: text/html, Size: 1327 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 6/6] memcg: rework soft limit reclaim
  2011-05-12 14:53   ` Johannes Weiner
@ 2011-05-12 18:41     ` Ying Han
  -1 siblings, 0 replies; 83+ messages in thread
From: Ying Han @ 2011-05-12 18:41 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Michal Hocko,
	Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

Hi Johannes:

Thank you for the patchset, and i will definitely spend time read them
through later today.

Also, I have a patchset which implements the round-robin soft_limit
reclaim as we discussed in LSF. Before I read through this set, i
don't know if we are making the similar approach or not. My
implementation is the first step only replace the RB-tree based
soft_limit reclaim to link_list round-robin. Feel free to throw
comment on that.

--Ying

On Thu, May 12, 2011 at 7:53 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> The current soft limit reclaim algorithm entered from kswapd.  It
> selects the memcg that exceeds its soft limit the most in absolute
> bytes and reclaims from it most aggressively (priority 0).
>
> This has several disadvantages:
>
>        1. because of the aggressiveness, kswapd can be stalled on a
>        memcg that is hard to reclaim for a long time before going for
>        other pages.
>
>        2. it only considers the biggest violator (in absolute byes!)
>        and does not put extra pressure on other memcgs in excess.
>
>        3. it needs a ton of code to quickly find the target
>
> This patch removes all the explicit soft limit target selection and
> instead hooks into the hierarchical memcg walk that is done by direct
> reclaim and kswapd balancing.  If it encounters a memcg that exceeds
> its soft limit, or contributes to the soft limit excess in one of its
> hierarchy parents, it scans the memcg one priority level below the
> current reclaim priority.
>
>        1. the primary goal is to reclaim pages, not to punish soft
>        limit violators at any price
>
>        2. increased pressure is applied to all violators, not just
>        the biggest one
>
>        3. the soft limit is no longer only meaningful on global
>        memory pressure, but considered for any hierarchical reclaim.
>        This means that even for hard limit reclaim, the children in
>        excess of their soft limit experience more pressure compared
>        to their siblings
>
>        4. direct reclaim now also applies more pressure on memcgs in
>        soft limit excess, not only kswapd
>
>        5. the implementation is only a few lines of straight-forward
>        code
>
> RFC: since there is no longer a reliable way of counting the pages
> reclaimed solely because of an exceeding soft limit, this patch
> conflicts with Ying's exporting of exactly this number to userspace.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  include/linux/memcontrol.h |   16 +-
>  include/linux/swap.h       |    4 -
>  mm/memcontrol.c            |  450 +++-----------------------------------------
>  mm/vmscan.c                |   48 +-----
>  4 files changed, 34 insertions(+), 484 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 65163c2..b0c7323 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -99,6 +99,7 @@ extern void mem_cgroup_end_migration(struct mem_cgroup *mem,
>  * For memory reclaim.
>  */
>  void mem_cgroup_hierarchy_walk(struct mem_cgroup *, struct mem_cgroup **);
> +bool mem_cgroup_soft_limit_exceeded(struct mem_cgroup *, struct mem_cgroup *);
>  void mem_cgroup_count_reclaim(struct mem_cgroup *, bool, bool,
>                              unsigned long, unsigned long);
>  int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
> @@ -140,8 +141,6 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
>        mem_cgroup_update_page_stat(page, idx, -1);
>  }
>
> -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> -                                               gfp_t gfp_mask);
>  u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
>
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> @@ -294,6 +293,12 @@ static inline void mem_cgroup_hierarchy_walk(struct mem_cgroup *start,
>        *iter = start;
>  }
>
> +static inline bool mem_cgroup_soft_limit_exceeded(struct mem_cgroup *root,
> +                                                 struct mem_cgroup *mem)
> +{
> +       return 0;
> +}
> +
>  static inline void mem_cgroup_count_reclaim(struct mem_cgroup *mem,
>                                            bool kswapd, bool hierarchy,
>                                            unsigned long scanned,
> @@ -349,13 +354,6 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
>  }
>
>  static inline
> -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> -                                           gfp_t gfp_mask)
> -{
> -       return 0;
> -}
> -
> -static inline
>  u64 mem_cgroup_get_limit(struct mem_cgroup *mem)
>  {
>        return 0;
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index a5c6da5..885cf19 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -254,10 +254,6 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
>                                                  gfp_t gfp_mask, bool noswap,
>                                                  unsigned int swappiness);
> -extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
> -                                               gfp_t gfp_mask, bool noswap,
> -                                               unsigned int swappiness,
> -                                               struct zone *zone);
>  extern int __isolate_lru_page(struct page *page, int mode, int file);
>  extern unsigned long shrink_all_memory(unsigned long nr_pages);
>  extern int vm_swappiness;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index f5d90ba..b0c6dd5 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -34,7 +34,6 @@
>  #include <linux/rcupdate.h>
>  #include <linux/limits.h>
>  #include <linux/mutex.h>
> -#include <linux/rbtree.h>
>  #include <linux/slab.h>
>  #include <linux/swap.h>
>  #include <linux/swapops.h>
> @@ -138,12 +137,6 @@ struct mem_cgroup_per_zone {
>        unsigned long           count[NR_LRU_LISTS];
>
>        struct zone_reclaim_stat reclaim_stat;
> -       struct rb_node          tree_node;      /* RB tree node */
> -       unsigned long long      usage_in_excess;/* Set to the value by which */
> -                                               /* the soft limit is exceeded*/
> -       bool                    on_tree;
> -       struct mem_cgroup       *mem;           /* Back pointer, we cannot */
> -                                               /* use container_of        */
>  };
>  /* Macro for accessing counter */
>  #define MEM_CGROUP_ZSTAT(mz, idx)      ((mz)->count[(idx)])
> @@ -156,26 +149,6 @@ struct mem_cgroup_lru_info {
>        struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
>  };
>
> -/*
> - * Cgroups above their limits are maintained in a RB-Tree, independent of
> - * their hierarchy representation
> - */
> -
> -struct mem_cgroup_tree_per_zone {
> -       struct rb_root rb_root;
> -       spinlock_t lock;
> -};
> -
> -struct mem_cgroup_tree_per_node {
> -       struct mem_cgroup_tree_per_zone rb_tree_per_zone[MAX_NR_ZONES];
> -};
> -
> -struct mem_cgroup_tree {
> -       struct mem_cgroup_tree_per_node *rb_tree_per_node[MAX_NUMNODES];
> -};
> -
> -static struct mem_cgroup_tree soft_limit_tree __read_mostly;
> -
>  struct mem_cgroup_threshold {
>        struct eventfd_ctx *eventfd;
>        u64 threshold;
> @@ -323,12 +296,7 @@ static bool move_file(void)
>                                        &mc.to->move_charge_at_immigrate);
>  }
>
> -/*
> - * Maximum loops in mem_cgroup_soft_reclaim(), used for soft
> - * limit reclaim to prevent infinite loops, if they ever occur.
> - */
>  #define        MEM_CGROUP_MAX_RECLAIM_LOOPS            (100)
> -#define        MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS (2)
>
>  enum charge_type {
>        MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
> @@ -375,164 +343,6 @@ page_cgroup_zoneinfo(struct mem_cgroup *mem, struct page *page)
>        return mem_cgroup_zoneinfo(mem, nid, zid);
>  }
>
> -static struct mem_cgroup_tree_per_zone *
> -soft_limit_tree_node_zone(int nid, int zid)
> -{
> -       return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid];
> -}
> -
> -static struct mem_cgroup_tree_per_zone *
> -soft_limit_tree_from_page(struct page *page)
> -{
> -       int nid = page_to_nid(page);
> -       int zid = page_zonenum(page);
> -
> -       return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid];
> -}
> -
> -static void
> -__mem_cgroup_insert_exceeded(struct mem_cgroup *mem,
> -                               struct mem_cgroup_per_zone *mz,
> -                               struct mem_cgroup_tree_per_zone *mctz,
> -                               unsigned long long new_usage_in_excess)
> -{
> -       struct rb_node **p = &mctz->rb_root.rb_node;
> -       struct rb_node *parent = NULL;
> -       struct mem_cgroup_per_zone *mz_node;
> -
> -       if (mz->on_tree)
> -               return;
> -
> -       mz->usage_in_excess = new_usage_in_excess;
> -       if (!mz->usage_in_excess)
> -               return;
> -       while (*p) {
> -               parent = *p;
> -               mz_node = rb_entry(parent, struct mem_cgroup_per_zone,
> -                                       tree_node);
> -               if (mz->usage_in_excess < mz_node->usage_in_excess)
> -                       p = &(*p)->rb_left;
> -               /*
> -                * We can't avoid mem cgroups that are over their soft
> -                * limit by the same amount
> -                */
> -               else if (mz->usage_in_excess >= mz_node->usage_in_excess)
> -                       p = &(*p)->rb_right;
> -       }
> -       rb_link_node(&mz->tree_node, parent, p);
> -       rb_insert_color(&mz->tree_node, &mctz->rb_root);
> -       mz->on_tree = true;
> -}
> -
> -static void
> -__mem_cgroup_remove_exceeded(struct mem_cgroup *mem,
> -                               struct mem_cgroup_per_zone *mz,
> -                               struct mem_cgroup_tree_per_zone *mctz)
> -{
> -       if (!mz->on_tree)
> -               return;
> -       rb_erase(&mz->tree_node, &mctz->rb_root);
> -       mz->on_tree = false;
> -}
> -
> -static void
> -mem_cgroup_remove_exceeded(struct mem_cgroup *mem,
> -                               struct mem_cgroup_per_zone *mz,
> -                               struct mem_cgroup_tree_per_zone *mctz)
> -{
> -       spin_lock(&mctz->lock);
> -       __mem_cgroup_remove_exceeded(mem, mz, mctz);
> -       spin_unlock(&mctz->lock);
> -}
> -
> -
> -static void mem_cgroup_update_tree(struct mem_cgroup *mem, struct page *page)
> -{
> -       unsigned long long excess;
> -       struct mem_cgroup_per_zone *mz;
> -       struct mem_cgroup_tree_per_zone *mctz;
> -       int nid = page_to_nid(page);
> -       int zid = page_zonenum(page);
> -       mctz = soft_limit_tree_from_page(page);
> -
> -       /*
> -        * Necessary to update all ancestors when hierarchy is used.
> -        * because their event counter is not touched.
> -        */
> -       for (; mem; mem = parent_mem_cgroup(mem)) {
> -               mz = mem_cgroup_zoneinfo(mem, nid, zid);
> -               excess = res_counter_soft_limit_excess(&mem->res);
> -               /*
> -                * We have to update the tree if mz is on RB-tree or
> -                * mem is over its softlimit.
> -                */
> -               if (excess || mz->on_tree) {
> -                       spin_lock(&mctz->lock);
> -                       /* if on-tree, remove it */
> -                       if (mz->on_tree)
> -                               __mem_cgroup_remove_exceeded(mem, mz, mctz);
> -                       /*
> -                        * Insert again. mz->usage_in_excess will be updated.
> -                        * If excess is 0, no tree ops.
> -                        */
> -                       __mem_cgroup_insert_exceeded(mem, mz, mctz, excess);
> -                       spin_unlock(&mctz->lock);
> -               }
> -       }
> -}
> -
> -static void mem_cgroup_remove_from_trees(struct mem_cgroup *mem)
> -{
> -       int node, zone;
> -       struct mem_cgroup_per_zone *mz;
> -       struct mem_cgroup_tree_per_zone *mctz;
> -
> -       for_each_node_state(node, N_POSSIBLE) {
> -               for (zone = 0; zone < MAX_NR_ZONES; zone++) {
> -                       mz = mem_cgroup_zoneinfo(mem, node, zone);
> -                       mctz = soft_limit_tree_node_zone(node, zone);
> -                       mem_cgroup_remove_exceeded(mem, mz, mctz);
> -               }
> -       }
> -}
> -
> -static struct mem_cgroup_per_zone *
> -__mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
> -{
> -       struct rb_node *rightmost = NULL;
> -       struct mem_cgroup_per_zone *mz;
> -
> -retry:
> -       mz = NULL;
> -       rightmost = rb_last(&mctz->rb_root);
> -       if (!rightmost)
> -               goto done;              /* Nothing to reclaim from */
> -
> -       mz = rb_entry(rightmost, struct mem_cgroup_per_zone, tree_node);
> -       /*
> -        * Remove the node now but someone else can add it back,
> -        * we will to add it back at the end of reclaim to its correct
> -        * position in the tree.
> -        */
> -       __mem_cgroup_remove_exceeded(mz->mem, mz, mctz);
> -       if (!res_counter_soft_limit_excess(&mz->mem->res) ||
> -               !css_tryget(&mz->mem->css))
> -               goto retry;
> -done:
> -       return mz;
> -}
> -
> -static struct mem_cgroup_per_zone *
> -mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
> -{
> -       struct mem_cgroup_per_zone *mz;
> -
> -       spin_lock(&mctz->lock);
> -       mz = __mem_cgroup_largest_soft_limit_node(mctz);
> -       spin_unlock(&mctz->lock);
> -       return mz;
> -}
> -
>  /*
>  * Implementation Note: reading percpu statistics for memcg.
>  *
> @@ -570,15 +380,6 @@ static long mem_cgroup_read_stat(struct mem_cgroup *mem,
>        return val;
>  }
>
> -static long mem_cgroup_local_usage(struct mem_cgroup *mem)
> -{
> -       long ret;
> -
> -       ret = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_RSS);
> -       ret += mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_CACHE);
> -       return ret;
> -}
> -
>  static void mem_cgroup_swap_statistics(struct mem_cgroup *mem,
>                                         bool charge)
>  {
> @@ -699,7 +500,6 @@ static void memcg_check_events(struct mem_cgroup *mem, struct page *page)
>                __mem_cgroup_target_update(mem, MEM_CGROUP_TARGET_THRESH);
>                if (unlikely(__memcg_event_check(mem,
>                        MEM_CGROUP_TARGET_SOFTLIMIT))){
> -                       mem_cgroup_update_tree(mem, page);
>                        __mem_cgroup_target_update(mem,
>                                MEM_CGROUP_TARGET_SOFTLIMIT);
>                }
> @@ -1380,6 +1180,29 @@ void mem_cgroup_hierarchy_walk(struct mem_cgroup *start,
>        *iter = mem;
>  }
>
> +bool mem_cgroup_soft_limit_exceeded(struct mem_cgroup *root,
> +                                   struct mem_cgroup *mem)
> +{
> +       /* root_mem_cgroup never exceeds its soft limit */
> +       if (!mem)
> +               return false;
> +       if (!root)
> +               root = root_mem_cgroup;
> +       /*
> +        * See whether the memcg in question exceeds its soft limit
> +        * directly, or contributes to the soft limit excess in the
> +        * hierarchy below the given root.
> +        */
> +       while (mem != root) {
> +               if (res_counter_soft_limit_excess(&mem->res))
> +                       return true;
> +               if (!mem->use_hierarchy)
> +                       break;
> +               mem = mem_cgroup_from_cont(mem->css.cgroup->parent);
> +       }
> +       return false;
> +}
> +
>  static unsigned long mem_cgroup_target_reclaim(struct mem_cgroup *mem,
>                                               gfp_t gfp_mask,
>                                               bool noswap,
> @@ -1411,114 +1234,6 @@ static unsigned long mem_cgroup_target_reclaim(struct mem_cgroup *mem,
>  }
>
>  /*
> - * Visit the first child (need not be the first child as per the ordering
> - * of the cgroup list, since we track last_scanned_child) of @mem and use
> - * that to reclaim free pages from.
> - */
> -static struct mem_cgroup *
> -mem_cgroup_select_victim(struct mem_cgroup *root_mem)
> -{
> -       struct mem_cgroup *ret = NULL;
> -       struct cgroup_subsys_state *css;
> -       int nextid, found;
> -
> -       if (!root_mem->use_hierarchy) {
> -               css_get(&root_mem->css);
> -               ret = root_mem;
> -       }
> -
> -       while (!ret) {
> -               rcu_read_lock();
> -               nextid = root_mem->last_scanned_child + 1;
> -               css = css_get_next(&mem_cgroup_subsys, nextid, &root_mem->css,
> -                                  &found);
> -               if (css && css_tryget(css))
> -                       ret = container_of(css, struct mem_cgroup, css);
> -
> -               rcu_read_unlock();
> -               /* Updates scanning parameter */
> -               if (!css) {
> -                       /* this means start scan from ID:1 */
> -                       root_mem->last_scanned_child = 0;
> -               } else
> -                       root_mem->last_scanned_child = found;
> -       }
> -
> -       return ret;
> -}
> -
> -/*
> - * Scan the hierarchy if needed to reclaim memory. We remember the last child
> - * we reclaimed from, so that we don't end up penalizing one child extensively
> - * based on its position in the children list.
> - *
> - * root_mem is the original ancestor that we've been reclaim from.
> - *
> - * We give up and return to the caller when we visit root_mem twice.
> - * (other groups can be removed while we're walking....)
> - */
> -static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_mem,
> -                                  struct zone *zone,
> -                                  gfp_t gfp_mask)
> -{
> -       struct mem_cgroup *victim;
> -       int ret, total = 0;
> -       int loop = 0;
> -       unsigned long excess;
> -       bool noswap = false;
> -
> -       excess = res_counter_soft_limit_excess(&root_mem->res) >> PAGE_SHIFT;
> -
> -       /* If memsw_is_minimum==1, swap-out is of-no-use. */
> -       if (root_mem->memsw_is_minimum)
> -               noswap = true;
> -
> -       while (1) {
> -               victim = mem_cgroup_select_victim(root_mem);
> -               if (victim == root_mem) {
> -                       loop++;
> -                       if (loop >= 1)
> -                               drain_all_stock_async();
> -                       if (loop >= 2) {
> -                               /*
> -                                * If we have not been able to reclaim
> -                                * anything, it might because there are
> -                                * no reclaimable pages under this hierarchy
> -                                */
> -                               if (!total) {
> -                                       css_put(&victim->css);
> -                                       break;
> -                               }
> -                               /*
> -                                * We want to do more targeted reclaim.
> -                                * excess >> 2 is not to excessive so as to
> -                                * reclaim too much, nor too less that we keep
> -                                * coming back to reclaim from this cgroup
> -                                */
> -                               if (total >= (excess >> 2) ||
> -                                       (loop > MEM_CGROUP_MAX_RECLAIM_LOOPS)) {
> -                                       css_put(&victim->css);
> -                                       break;
> -                               }
> -                       }
> -               }
> -               if (!mem_cgroup_local_usage(victim)) {
> -                       /* this cgroup's local usage == 0 */
> -                       css_put(&victim->css);
> -                       continue;
> -               }
> -               /* we use swappiness of local cgroup */
> -               ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
> -                               noswap, get_swappiness(victim), zone);
> -               css_put(&victim->css);
> -               total += ret;
> -               if (!res_counter_soft_limit_excess(&root_mem->res))
> -                       return total;
> -       }
> -       return total;
> -}
> -
> -/*
>  * Check OOM-Killer is already running under our hierarchy.
>  * If someone is running, return false.
>  */
> @@ -3291,94 +3006,6 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
>        return ret;
>  }
>
> -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> -                                           gfp_t gfp_mask)
> -{
> -       unsigned long nr_reclaimed = 0;
> -       struct mem_cgroup_per_zone *mz, *next_mz = NULL;
> -       unsigned long reclaimed;
> -       int loop = 0;
> -       struct mem_cgroup_tree_per_zone *mctz;
> -       unsigned long long excess;
> -
> -       if (order > 0)
> -               return 0;
> -
> -       mctz = soft_limit_tree_node_zone(zone_to_nid(zone), zone_idx(zone));
> -       /*
> -        * This loop can run a while, specially if mem_cgroup's continuously
> -        * keep exceeding their soft limit and putting the system under
> -        * pressure
> -        */
> -       do {
> -               if (next_mz)
> -                       mz = next_mz;
> -               else
> -                       mz = mem_cgroup_largest_soft_limit_node(mctz);
> -               if (!mz)
> -                       break;
> -
> -               reclaimed = mem_cgroup_soft_reclaim(mz->mem, zone, gfp_mask);
> -               nr_reclaimed += reclaimed;
> -               spin_lock(&mctz->lock);
> -
> -               /*
> -                * If we failed to reclaim anything from this memory cgroup
> -                * it is time to move on to the next cgroup
> -                */
> -               next_mz = NULL;
> -               if (!reclaimed) {
> -                       do {
> -                               /*
> -                                * Loop until we find yet another one.
> -                                *
> -                                * By the time we get the soft_limit lock
> -                                * again, someone might have aded the
> -                                * group back on the RB tree. Iterate to
> -                                * make sure we get a different mem.
> -                                * mem_cgroup_largest_soft_limit_node returns
> -                                * NULL if no other cgroup is present on
> -                                * the tree
> -                                */
> -                               next_mz =
> -                               __mem_cgroup_largest_soft_limit_node(mctz);
> -                               if (next_mz == mz) {
> -                                       css_put(&next_mz->mem->css);
> -                                       next_mz = NULL;
> -                               } else /* next_mz == NULL or other memcg */
> -                                       break;
> -                       } while (1);
> -               }
> -               __mem_cgroup_remove_exceeded(mz->mem, mz, mctz);
> -               excess = res_counter_soft_limit_excess(&mz->mem->res);
> -               /*
> -                * One school of thought says that we should not add
> -                * back the node to the tree if reclaim returns 0.
> -                * But our reclaim could return 0, simply because due
> -                * to priority we are exposing a smaller subset of
> -                * memory to reclaim from. Consider this as a longer
> -                * term TODO.
> -                */
> -               /* If excess == 0, no tree ops */
> -               __mem_cgroup_insert_exceeded(mz->mem, mz, mctz, excess);
> -               spin_unlock(&mctz->lock);
> -               css_put(&mz->mem->css);
> -               loop++;
> -               /*
> -                * Could not reclaim anything and there are no more
> -                * mem cgroups to try or we seem to be looping without
> -                * reclaiming anything.
> -                */
> -               if (!nr_reclaimed &&
> -                       (next_mz == NULL ||
> -                       loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS))
> -                       break;
> -       } while (!nr_reclaimed);
> -       if (next_mz)
> -               css_put(&next_mz->mem->css);
> -       return nr_reclaimed;
> -}
> -
>  /*
>  * This routine traverse page_cgroup in given list and drop them all.
>  * *And* this routine doesn't reclaim page itself, just removes page_cgroup.
> @@ -4449,9 +4076,6 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
>                mz = &pn->zoneinfo[zone];
>                for_each_lru(l)
>                        INIT_LIST_HEAD(&mz->lruvec.lists[l]);
> -               mz->usage_in_excess = 0;
> -               mz->on_tree = false;
> -               mz->mem = mem;
>        }
>        return 0;
>  }
> @@ -4504,7 +4128,6 @@ static void __mem_cgroup_free(struct mem_cgroup *mem)
>  {
>        int node;
>
> -       mem_cgroup_remove_from_trees(mem);
>        free_css_id(&mem_cgroup_subsys, &mem->css);
>
>        for_each_node_state(node, N_POSSIBLE)
> @@ -4559,31 +4182,6 @@ static void __init enable_swap_cgroup(void)
>  }
>  #endif
>
> -static int mem_cgroup_soft_limit_tree_init(void)
> -{
> -       struct mem_cgroup_tree_per_node *rtpn;
> -       struct mem_cgroup_tree_per_zone *rtpz;
> -       int tmp, node, zone;
> -
> -       for_each_node_state(node, N_POSSIBLE) {
> -               tmp = node;
> -               if (!node_state(node, N_NORMAL_MEMORY))
> -                       tmp = -1;
> -               rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL, tmp);
> -               if (!rtpn)
> -                       return 1;
> -
> -               soft_limit_tree.rb_tree_per_node[node] = rtpn;
> -
> -               for (zone = 0; zone < MAX_NR_ZONES; zone++) {
> -                       rtpz = &rtpn->rb_tree_per_zone[zone];
> -                       rtpz->rb_root = RB_ROOT;
> -                       spin_lock_init(&rtpz->lock);
> -               }
> -       }
> -       return 0;
> -}
> -
>  static struct cgroup_subsys_state * __ref
>  mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  {
> @@ -4605,8 +4203,6 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>                enable_swap_cgroup();
>                parent = NULL;
>                root_mem_cgroup = mem;
> -               if (mem_cgroup_soft_limit_tree_init())
> -                       goto free_out;
>                for_each_possible_cpu(cpu) {
>                        struct memcg_stock_pcp *stock =
>                                                &per_cpu(memcg_stock, cpu);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 0381a5d..2b701e0 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1937,10 +1937,13 @@ static void shrink_zone(int priority, struct zone *zone,
>        do {
>                unsigned long reclaimed = sc->nr_reclaimed;
>                unsigned long scanned = sc->nr_scanned;
> +               int epriority = priority;
>
>                mem_cgroup_hierarchy_walk(root, &mem);
>                sc->current_memcg = mem;
> -               do_shrink_zone(priority, zone, sc);
> +               if (mem_cgroup_soft_limit_exceeded(root, mem))
> +                       epriority -= 1;
> +               do_shrink_zone(epriority, zone, sc);
>                mem_cgroup_count_reclaim(mem, current_is_kswapd(),
>                                         mem != root, /* limit or hierarchy? */
>                                         sc->nr_scanned - scanned,
> @@ -2153,42 +2156,6 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  }
>
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
> -
> -unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
> -                                               gfp_t gfp_mask, bool noswap,
> -                                               unsigned int swappiness,
> -                                               struct zone *zone)
> -{
> -       struct scan_control sc = {
> -               .nr_to_reclaim = SWAP_CLUSTER_MAX,
> -               .may_writepage = !laptop_mode,
> -               .may_unmap = 1,
> -               .may_swap = !noswap,
> -               .swappiness = swappiness,
> -               .order = 0,
> -               .memcg = mem,
> -       };
> -       sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
> -                       (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
> -
> -       trace_mm_vmscan_memcg_softlimit_reclaim_begin(0,
> -                                                     sc.may_writepage,
> -                                                     sc.gfp_mask);
> -
> -       /*
> -        * NOTE: Although we can get the priority field, using it
> -        * here is not a good idea, since it limits the pages we can scan.
> -        * if we don't reclaim here, the shrink_zone from balance_pgdat
> -        * will pick up pages from other mem cgroup's as well. We hack
> -        * the priority and make it zero.
> -        */
> -       do_shrink_zone(0, zone, &sc);
> -
> -       trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
> -
> -       return sc.nr_reclaimed;
> -}
> -
>  unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
>                                           gfp_t gfp_mask,
>                                           bool noswap,
> @@ -2418,13 +2385,6 @@ loop_again:
>                                continue;
>
>                        sc.nr_scanned = 0;
> -
> -                       /*
> -                        * Call soft limit reclaim before calling shrink_zone.
> -                        * For now we ignore the return value
> -                        */
> -                       mem_cgroup_soft_limit_reclaim(zone, order, sc.gfp_mask);
> -
>                        /*
>                         * We put equal pressure on every zone, unless
>                         * one zone has way too many pages free
> --
> 1.7.5.1
>
>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 6/6] memcg: rework soft limit reclaim
@ 2011-05-12 18:41     ` Ying Han
  0 siblings, 0 replies; 83+ messages in thread
From: Ying Han @ 2011-05-12 18:41 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Michal Hocko,
	Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

Hi Johannes:

Thank you for the patchset, and i will definitely spend time read them
through later today.

Also, I have a patchset which implements the round-robin soft_limit
reclaim as we discussed in LSF. Before I read through this set, i
don't know if we are making the similar approach or not. My
implementation is the first step only replace the RB-tree based
soft_limit reclaim to link_list round-robin. Feel free to throw
comment on that.

--Ying

On Thu, May 12, 2011 at 7:53 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> The current soft limit reclaim algorithm entered from kswapd.  It
> selects the memcg that exceeds its soft limit the most in absolute
> bytes and reclaims from it most aggressively (priority 0).
>
> This has several disadvantages:
>
>        1. because of the aggressiveness, kswapd can be stalled on a
>        memcg that is hard to reclaim for a long time before going for
>        other pages.
>
>        2. it only considers the biggest violator (in absolute byes!)
>        and does not put extra pressure on other memcgs in excess.
>
>        3. it needs a ton of code to quickly find the target
>
> This patch removes all the explicit soft limit target selection and
> instead hooks into the hierarchical memcg walk that is done by direct
> reclaim and kswapd balancing.  If it encounters a memcg that exceeds
> its soft limit, or contributes to the soft limit excess in one of its
> hierarchy parents, it scans the memcg one priority level below the
> current reclaim priority.
>
>        1. the primary goal is to reclaim pages, not to punish soft
>        limit violators at any price
>
>        2. increased pressure is applied to all violators, not just
>        the biggest one
>
>        3. the soft limit is no longer only meaningful on global
>        memory pressure, but considered for any hierarchical reclaim.
>        This means that even for hard limit reclaim, the children in
>        excess of their soft limit experience more pressure compared
>        to their siblings
>
>        4. direct reclaim now also applies more pressure on memcgs in
>        soft limit excess, not only kswapd
>
>        5. the implementation is only a few lines of straight-forward
>        code
>
> RFC: since there is no longer a reliable way of counting the pages
> reclaimed solely because of an exceeding soft limit, this patch
> conflicts with Ying's exporting of exactly this number to userspace.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  include/linux/memcontrol.h |   16 +-
>  include/linux/swap.h       |    4 -
>  mm/memcontrol.c            |  450 +++-----------------------------------------
>  mm/vmscan.c                |   48 +-----
>  4 files changed, 34 insertions(+), 484 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 65163c2..b0c7323 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -99,6 +99,7 @@ extern void mem_cgroup_end_migration(struct mem_cgroup *mem,
>  * For memory reclaim.
>  */
>  void mem_cgroup_hierarchy_walk(struct mem_cgroup *, struct mem_cgroup **);
> +bool mem_cgroup_soft_limit_exceeded(struct mem_cgroup *, struct mem_cgroup *);
>  void mem_cgroup_count_reclaim(struct mem_cgroup *, bool, bool,
>                              unsigned long, unsigned long);
>  int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
> @@ -140,8 +141,6 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
>        mem_cgroup_update_page_stat(page, idx, -1);
>  }
>
> -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> -                                               gfp_t gfp_mask);
>  u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
>
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> @@ -294,6 +293,12 @@ static inline void mem_cgroup_hierarchy_walk(struct mem_cgroup *start,
>        *iter = start;
>  }
>
> +static inline bool mem_cgroup_soft_limit_exceeded(struct mem_cgroup *root,
> +                                                 struct mem_cgroup *mem)
> +{
> +       return 0;
> +}
> +
>  static inline void mem_cgroup_count_reclaim(struct mem_cgroup *mem,
>                                            bool kswapd, bool hierarchy,
>                                            unsigned long scanned,
> @@ -349,13 +354,6 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
>  }
>
>  static inline
> -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> -                                           gfp_t gfp_mask)
> -{
> -       return 0;
> -}
> -
> -static inline
>  u64 mem_cgroup_get_limit(struct mem_cgroup *mem)
>  {
>        return 0;
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index a5c6da5..885cf19 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -254,10 +254,6 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
>                                                  gfp_t gfp_mask, bool noswap,
>                                                  unsigned int swappiness);
> -extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
> -                                               gfp_t gfp_mask, bool noswap,
> -                                               unsigned int swappiness,
> -                                               struct zone *zone);
>  extern int __isolate_lru_page(struct page *page, int mode, int file);
>  extern unsigned long shrink_all_memory(unsigned long nr_pages);
>  extern int vm_swappiness;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index f5d90ba..b0c6dd5 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -34,7 +34,6 @@
>  #include <linux/rcupdate.h>
>  #include <linux/limits.h>
>  #include <linux/mutex.h>
> -#include <linux/rbtree.h>
>  #include <linux/slab.h>
>  #include <linux/swap.h>
>  #include <linux/swapops.h>
> @@ -138,12 +137,6 @@ struct mem_cgroup_per_zone {
>        unsigned long           count[NR_LRU_LISTS];
>
>        struct zone_reclaim_stat reclaim_stat;
> -       struct rb_node          tree_node;      /* RB tree node */
> -       unsigned long long      usage_in_excess;/* Set to the value by which */
> -                                               /* the soft limit is exceeded*/
> -       bool                    on_tree;
> -       struct mem_cgroup       *mem;           /* Back pointer, we cannot */
> -                                               /* use container_of        */
>  };
>  /* Macro for accessing counter */
>  #define MEM_CGROUP_ZSTAT(mz, idx)      ((mz)->count[(idx)])
> @@ -156,26 +149,6 @@ struct mem_cgroup_lru_info {
>        struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
>  };
>
> -/*
> - * Cgroups above their limits are maintained in a RB-Tree, independent of
> - * their hierarchy representation
> - */
> -
> -struct mem_cgroup_tree_per_zone {
> -       struct rb_root rb_root;
> -       spinlock_t lock;
> -};
> -
> -struct mem_cgroup_tree_per_node {
> -       struct mem_cgroup_tree_per_zone rb_tree_per_zone[MAX_NR_ZONES];
> -};
> -
> -struct mem_cgroup_tree {
> -       struct mem_cgroup_tree_per_node *rb_tree_per_node[MAX_NUMNODES];
> -};
> -
> -static struct mem_cgroup_tree soft_limit_tree __read_mostly;
> -
>  struct mem_cgroup_threshold {
>        struct eventfd_ctx *eventfd;
>        u64 threshold;
> @@ -323,12 +296,7 @@ static bool move_file(void)
>                                        &mc.to->move_charge_at_immigrate);
>  }
>
> -/*
> - * Maximum loops in mem_cgroup_soft_reclaim(), used for soft
> - * limit reclaim to prevent infinite loops, if they ever occur.
> - */
>  #define        MEM_CGROUP_MAX_RECLAIM_LOOPS            (100)
> -#define        MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS (2)
>
>  enum charge_type {
>        MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
> @@ -375,164 +343,6 @@ page_cgroup_zoneinfo(struct mem_cgroup *mem, struct page *page)
>        return mem_cgroup_zoneinfo(mem, nid, zid);
>  }
>
> -static struct mem_cgroup_tree_per_zone *
> -soft_limit_tree_node_zone(int nid, int zid)
> -{
> -       return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid];
> -}
> -
> -static struct mem_cgroup_tree_per_zone *
> -soft_limit_tree_from_page(struct page *page)
> -{
> -       int nid = page_to_nid(page);
> -       int zid = page_zonenum(page);
> -
> -       return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid];
> -}
> -
> -static void
> -__mem_cgroup_insert_exceeded(struct mem_cgroup *mem,
> -                               struct mem_cgroup_per_zone *mz,
> -                               struct mem_cgroup_tree_per_zone *mctz,
> -                               unsigned long long new_usage_in_excess)
> -{
> -       struct rb_node **p = &mctz->rb_root.rb_node;
> -       struct rb_node *parent = NULL;
> -       struct mem_cgroup_per_zone *mz_node;
> -
> -       if (mz->on_tree)
> -               return;
> -
> -       mz->usage_in_excess = new_usage_in_excess;
> -       if (!mz->usage_in_excess)
> -               return;
> -       while (*p) {
> -               parent = *p;
> -               mz_node = rb_entry(parent, struct mem_cgroup_per_zone,
> -                                       tree_node);
> -               if (mz->usage_in_excess < mz_node->usage_in_excess)
> -                       p = &(*p)->rb_left;
> -               /*
> -                * We can't avoid mem cgroups that are over their soft
> -                * limit by the same amount
> -                */
> -               else if (mz->usage_in_excess >= mz_node->usage_in_excess)
> -                       p = &(*p)->rb_right;
> -       }
> -       rb_link_node(&mz->tree_node, parent, p);
> -       rb_insert_color(&mz->tree_node, &mctz->rb_root);
> -       mz->on_tree = true;
> -}
> -
> -static void
> -__mem_cgroup_remove_exceeded(struct mem_cgroup *mem,
> -                               struct mem_cgroup_per_zone *mz,
> -                               struct mem_cgroup_tree_per_zone *mctz)
> -{
> -       if (!mz->on_tree)
> -               return;
> -       rb_erase(&mz->tree_node, &mctz->rb_root);
> -       mz->on_tree = false;
> -}
> -
> -static void
> -mem_cgroup_remove_exceeded(struct mem_cgroup *mem,
> -                               struct mem_cgroup_per_zone *mz,
> -                               struct mem_cgroup_tree_per_zone *mctz)
> -{
> -       spin_lock(&mctz->lock);
> -       __mem_cgroup_remove_exceeded(mem, mz, mctz);
> -       spin_unlock(&mctz->lock);
> -}
> -
> -
> -static void mem_cgroup_update_tree(struct mem_cgroup *mem, struct page *page)
> -{
> -       unsigned long long excess;
> -       struct mem_cgroup_per_zone *mz;
> -       struct mem_cgroup_tree_per_zone *mctz;
> -       int nid = page_to_nid(page);
> -       int zid = page_zonenum(page);
> -       mctz = soft_limit_tree_from_page(page);
> -
> -       /*
> -        * Necessary to update all ancestors when hierarchy is used.
> -        * because their event counter is not touched.
> -        */
> -       for (; mem; mem = parent_mem_cgroup(mem)) {
> -               mz = mem_cgroup_zoneinfo(mem, nid, zid);
> -               excess = res_counter_soft_limit_excess(&mem->res);
> -               /*
> -                * We have to update the tree if mz is on RB-tree or
> -                * mem is over its softlimit.
> -                */
> -               if (excess || mz->on_tree) {
> -                       spin_lock(&mctz->lock);
> -                       /* if on-tree, remove it */
> -                       if (mz->on_tree)
> -                               __mem_cgroup_remove_exceeded(mem, mz, mctz);
> -                       /*
> -                        * Insert again. mz->usage_in_excess will be updated.
> -                        * If excess is 0, no tree ops.
> -                        */
> -                       __mem_cgroup_insert_exceeded(mem, mz, mctz, excess);
> -                       spin_unlock(&mctz->lock);
> -               }
> -       }
> -}
> -
> -static void mem_cgroup_remove_from_trees(struct mem_cgroup *mem)
> -{
> -       int node, zone;
> -       struct mem_cgroup_per_zone *mz;
> -       struct mem_cgroup_tree_per_zone *mctz;
> -
> -       for_each_node_state(node, N_POSSIBLE) {
> -               for (zone = 0; zone < MAX_NR_ZONES; zone++) {
> -                       mz = mem_cgroup_zoneinfo(mem, node, zone);
> -                       mctz = soft_limit_tree_node_zone(node, zone);
> -                       mem_cgroup_remove_exceeded(mem, mz, mctz);
> -               }
> -       }
> -}
> -
> -static struct mem_cgroup_per_zone *
> -__mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
> -{
> -       struct rb_node *rightmost = NULL;
> -       struct mem_cgroup_per_zone *mz;
> -
> -retry:
> -       mz = NULL;
> -       rightmost = rb_last(&mctz->rb_root);
> -       if (!rightmost)
> -               goto done;              /* Nothing to reclaim from */
> -
> -       mz = rb_entry(rightmost, struct mem_cgroup_per_zone, tree_node);
> -       /*
> -        * Remove the node now but someone else can add it back,
> -        * we will to add it back at the end of reclaim to its correct
> -        * position in the tree.
> -        */
> -       __mem_cgroup_remove_exceeded(mz->mem, mz, mctz);
> -       if (!res_counter_soft_limit_excess(&mz->mem->res) ||
> -               !css_tryget(&mz->mem->css))
> -               goto retry;
> -done:
> -       return mz;
> -}
> -
> -static struct mem_cgroup_per_zone *
> -mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
> -{
> -       struct mem_cgroup_per_zone *mz;
> -
> -       spin_lock(&mctz->lock);
> -       mz = __mem_cgroup_largest_soft_limit_node(mctz);
> -       spin_unlock(&mctz->lock);
> -       return mz;
> -}
> -
>  /*
>  * Implementation Note: reading percpu statistics for memcg.
>  *
> @@ -570,15 +380,6 @@ static long mem_cgroup_read_stat(struct mem_cgroup *mem,
>        return val;
>  }
>
> -static long mem_cgroup_local_usage(struct mem_cgroup *mem)
> -{
> -       long ret;
> -
> -       ret = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_RSS);
> -       ret += mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_CACHE);
> -       return ret;
> -}
> -
>  static void mem_cgroup_swap_statistics(struct mem_cgroup *mem,
>                                         bool charge)
>  {
> @@ -699,7 +500,6 @@ static void memcg_check_events(struct mem_cgroup *mem, struct page *page)
>                __mem_cgroup_target_update(mem, MEM_CGROUP_TARGET_THRESH);
>                if (unlikely(__memcg_event_check(mem,
>                        MEM_CGROUP_TARGET_SOFTLIMIT))){
> -                       mem_cgroup_update_tree(mem, page);
>                        __mem_cgroup_target_update(mem,
>                                MEM_CGROUP_TARGET_SOFTLIMIT);
>                }
> @@ -1380,6 +1180,29 @@ void mem_cgroup_hierarchy_walk(struct mem_cgroup *start,
>        *iter = mem;
>  }
>
> +bool mem_cgroup_soft_limit_exceeded(struct mem_cgroup *root,
> +                                   struct mem_cgroup *mem)
> +{
> +       /* root_mem_cgroup never exceeds its soft limit */
> +       if (!mem)
> +               return false;
> +       if (!root)
> +               root = root_mem_cgroup;
> +       /*
> +        * See whether the memcg in question exceeds its soft limit
> +        * directly, or contributes to the soft limit excess in the
> +        * hierarchy below the given root.
> +        */
> +       while (mem != root) {
> +               if (res_counter_soft_limit_excess(&mem->res))
> +                       return true;
> +               if (!mem->use_hierarchy)
> +                       break;
> +               mem = mem_cgroup_from_cont(mem->css.cgroup->parent);
> +       }
> +       return false;
> +}
> +
>  static unsigned long mem_cgroup_target_reclaim(struct mem_cgroup *mem,
>                                               gfp_t gfp_mask,
>                                               bool noswap,
> @@ -1411,114 +1234,6 @@ static unsigned long mem_cgroup_target_reclaim(struct mem_cgroup *mem,
>  }
>
>  /*
> - * Visit the first child (need not be the first child as per the ordering
> - * of the cgroup list, since we track last_scanned_child) of @mem and use
> - * that to reclaim free pages from.
> - */
> -static struct mem_cgroup *
> -mem_cgroup_select_victim(struct mem_cgroup *root_mem)
> -{
> -       struct mem_cgroup *ret = NULL;
> -       struct cgroup_subsys_state *css;
> -       int nextid, found;
> -
> -       if (!root_mem->use_hierarchy) {
> -               css_get(&root_mem->css);
> -               ret = root_mem;
> -       }
> -
> -       while (!ret) {
> -               rcu_read_lock();
> -               nextid = root_mem->last_scanned_child + 1;
> -               css = css_get_next(&mem_cgroup_subsys, nextid, &root_mem->css,
> -                                  &found);
> -               if (css && css_tryget(css))
> -                       ret = container_of(css, struct mem_cgroup, css);
> -
> -               rcu_read_unlock();
> -               /* Updates scanning parameter */
> -               if (!css) {
> -                       /* this means start scan from ID:1 */
> -                       root_mem->last_scanned_child = 0;
> -               } else
> -                       root_mem->last_scanned_child = found;
> -       }
> -
> -       return ret;
> -}
> -
> -/*
> - * Scan the hierarchy if needed to reclaim memory. We remember the last child
> - * we reclaimed from, so that we don't end up penalizing one child extensively
> - * based on its position in the children list.
> - *
> - * root_mem is the original ancestor that we've been reclaim from.
> - *
> - * We give up and return to the caller when we visit root_mem twice.
> - * (other groups can be removed while we're walking....)
> - */
> -static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_mem,
> -                                  struct zone *zone,
> -                                  gfp_t gfp_mask)
> -{
> -       struct mem_cgroup *victim;
> -       int ret, total = 0;
> -       int loop = 0;
> -       unsigned long excess;
> -       bool noswap = false;
> -
> -       excess = res_counter_soft_limit_excess(&root_mem->res) >> PAGE_SHIFT;
> -
> -       /* If memsw_is_minimum==1, swap-out is of-no-use. */
> -       if (root_mem->memsw_is_minimum)
> -               noswap = true;
> -
> -       while (1) {
> -               victim = mem_cgroup_select_victim(root_mem);
> -               if (victim == root_mem) {
> -                       loop++;
> -                       if (loop >= 1)
> -                               drain_all_stock_async();
> -                       if (loop >= 2) {
> -                               /*
> -                                * If we have not been able to reclaim
> -                                * anything, it might because there are
> -                                * no reclaimable pages under this hierarchy
> -                                */
> -                               if (!total) {
> -                                       css_put(&victim->css);
> -                                       break;
> -                               }
> -                               /*
> -                                * We want to do more targeted reclaim.
> -                                * excess >> 2 is not to excessive so as to
> -                                * reclaim too much, nor too less that we keep
> -                                * coming back to reclaim from this cgroup
> -                                */
> -                               if (total >= (excess >> 2) ||
> -                                       (loop > MEM_CGROUP_MAX_RECLAIM_LOOPS)) {
> -                                       css_put(&victim->css);
> -                                       break;
> -                               }
> -                       }
> -               }
> -               if (!mem_cgroup_local_usage(victim)) {
> -                       /* this cgroup's local usage == 0 */
> -                       css_put(&victim->css);
> -                       continue;
> -               }
> -               /* we use swappiness of local cgroup */
> -               ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
> -                               noswap, get_swappiness(victim), zone);
> -               css_put(&victim->css);
> -               total += ret;
> -               if (!res_counter_soft_limit_excess(&root_mem->res))
> -                       return total;
> -       }
> -       return total;
> -}
> -
> -/*
>  * Check OOM-Killer is already running under our hierarchy.
>  * If someone is running, return false.
>  */
> @@ -3291,94 +3006,6 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
>        return ret;
>  }
>
> -unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> -                                           gfp_t gfp_mask)
> -{
> -       unsigned long nr_reclaimed = 0;
> -       struct mem_cgroup_per_zone *mz, *next_mz = NULL;
> -       unsigned long reclaimed;
> -       int loop = 0;
> -       struct mem_cgroup_tree_per_zone *mctz;
> -       unsigned long long excess;
> -
> -       if (order > 0)
> -               return 0;
> -
> -       mctz = soft_limit_tree_node_zone(zone_to_nid(zone), zone_idx(zone));
> -       /*
> -        * This loop can run a while, specially if mem_cgroup's continuously
> -        * keep exceeding their soft limit and putting the system under
> -        * pressure
> -        */
> -       do {
> -               if (next_mz)
> -                       mz = next_mz;
> -               else
> -                       mz = mem_cgroup_largest_soft_limit_node(mctz);
> -               if (!mz)
> -                       break;
> -
> -               reclaimed = mem_cgroup_soft_reclaim(mz->mem, zone, gfp_mask);
> -               nr_reclaimed += reclaimed;
> -               spin_lock(&mctz->lock);
> -
> -               /*
> -                * If we failed to reclaim anything from this memory cgroup
> -                * it is time to move on to the next cgroup
> -                */
> -               next_mz = NULL;
> -               if (!reclaimed) {
> -                       do {
> -                               /*
> -                                * Loop until we find yet another one.
> -                                *
> -                                * By the time we get the soft_limit lock
> -                                * again, someone might have aded the
> -                                * group back on the RB tree. Iterate to
> -                                * make sure we get a different mem.
> -                                * mem_cgroup_largest_soft_limit_node returns
> -                                * NULL if no other cgroup is present on
> -                                * the tree
> -                                */
> -                               next_mz =
> -                               __mem_cgroup_largest_soft_limit_node(mctz);
> -                               if (next_mz == mz) {
> -                                       css_put(&next_mz->mem->css);
> -                                       next_mz = NULL;
> -                               } else /* next_mz == NULL or other memcg */
> -                                       break;
> -                       } while (1);
> -               }
> -               __mem_cgroup_remove_exceeded(mz->mem, mz, mctz);
> -               excess = res_counter_soft_limit_excess(&mz->mem->res);
> -               /*
> -                * One school of thought says that we should not add
> -                * back the node to the tree if reclaim returns 0.
> -                * But our reclaim could return 0, simply because due
> -                * to priority we are exposing a smaller subset of
> -                * memory to reclaim from. Consider this as a longer
> -                * term TODO.
> -                */
> -               /* If excess == 0, no tree ops */
> -               __mem_cgroup_insert_exceeded(mz->mem, mz, mctz, excess);
> -               spin_unlock(&mctz->lock);
> -               css_put(&mz->mem->css);
> -               loop++;
> -               /*
> -                * Could not reclaim anything and there are no more
> -                * mem cgroups to try or we seem to be looping without
> -                * reclaiming anything.
> -                */
> -               if (!nr_reclaimed &&
> -                       (next_mz == NULL ||
> -                       loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS))
> -                       break;
> -       } while (!nr_reclaimed);
> -       if (next_mz)
> -               css_put(&next_mz->mem->css);
> -       return nr_reclaimed;
> -}
> -
>  /*
>  * This routine traverse page_cgroup in given list and drop them all.
>  * *And* this routine doesn't reclaim page itself, just removes page_cgroup.
> @@ -4449,9 +4076,6 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
>                mz = &pn->zoneinfo[zone];
>                for_each_lru(l)
>                        INIT_LIST_HEAD(&mz->lruvec.lists[l]);
> -               mz->usage_in_excess = 0;
> -               mz->on_tree = false;
> -               mz->mem = mem;
>        }
>        return 0;
>  }
> @@ -4504,7 +4128,6 @@ static void __mem_cgroup_free(struct mem_cgroup *mem)
>  {
>        int node;
>
> -       mem_cgroup_remove_from_trees(mem);
>        free_css_id(&mem_cgroup_subsys, &mem->css);
>
>        for_each_node_state(node, N_POSSIBLE)
> @@ -4559,31 +4182,6 @@ static void __init enable_swap_cgroup(void)
>  }
>  #endif
>
> -static int mem_cgroup_soft_limit_tree_init(void)
> -{
> -       struct mem_cgroup_tree_per_node *rtpn;
> -       struct mem_cgroup_tree_per_zone *rtpz;
> -       int tmp, node, zone;
> -
> -       for_each_node_state(node, N_POSSIBLE) {
> -               tmp = node;
> -               if (!node_state(node, N_NORMAL_MEMORY))
> -                       tmp = -1;
> -               rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL, tmp);
> -               if (!rtpn)
> -                       return 1;
> -
> -               soft_limit_tree.rb_tree_per_node[node] = rtpn;
> -
> -               for (zone = 0; zone < MAX_NR_ZONES; zone++) {
> -                       rtpz = &rtpn->rb_tree_per_zone[zone];
> -                       rtpz->rb_root = RB_ROOT;
> -                       spin_lock_init(&rtpz->lock);
> -               }
> -       }
> -       return 0;
> -}
> -
>  static struct cgroup_subsys_state * __ref
>  mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  {
> @@ -4605,8 +4203,6 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>                enable_swap_cgroup();
>                parent = NULL;
>                root_mem_cgroup = mem;
> -               if (mem_cgroup_soft_limit_tree_init())
> -                       goto free_out;
>                for_each_possible_cpu(cpu) {
>                        struct memcg_stock_pcp *stock =
>                                                &per_cpu(memcg_stock, cpu);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 0381a5d..2b701e0 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1937,10 +1937,13 @@ static void shrink_zone(int priority, struct zone *zone,
>        do {
>                unsigned long reclaimed = sc->nr_reclaimed;
>                unsigned long scanned = sc->nr_scanned;
> +               int epriority = priority;
>
>                mem_cgroup_hierarchy_walk(root, &mem);
>                sc->current_memcg = mem;
> -               do_shrink_zone(priority, zone, sc);
> +               if (mem_cgroup_soft_limit_exceeded(root, mem))
> +                       epriority -= 1;
> +               do_shrink_zone(epriority, zone, sc);
>                mem_cgroup_count_reclaim(mem, current_is_kswapd(),
>                                         mem != root, /* limit or hierarchy? */
>                                         sc->nr_scanned - scanned,
> @@ -2153,42 +2156,6 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  }
>
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
> -
> -unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
> -                                               gfp_t gfp_mask, bool noswap,
> -                                               unsigned int swappiness,
> -                                               struct zone *zone)
> -{
> -       struct scan_control sc = {
> -               .nr_to_reclaim = SWAP_CLUSTER_MAX,
> -               .may_writepage = !laptop_mode,
> -               .may_unmap = 1,
> -               .may_swap = !noswap,
> -               .swappiness = swappiness,
> -               .order = 0,
> -               .memcg = mem,
> -       };
> -       sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
> -                       (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
> -
> -       trace_mm_vmscan_memcg_softlimit_reclaim_begin(0,
> -                                                     sc.may_writepage,
> -                                                     sc.gfp_mask);
> -
> -       /*
> -        * NOTE: Although we can get the priority field, using it
> -        * here is not a good idea, since it limits the pages we can scan.
> -        * if we don't reclaim here, the shrink_zone from balance_pgdat
> -        * will pick up pages from other mem cgroup's as well. We hack
> -        * the priority and make it zero.
> -        */
> -       do_shrink_zone(0, zone, &sc);
> -
> -       trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
> -
> -       return sc.nr_reclaimed;
> -}
> -
>  unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
>                                           gfp_t gfp_mask,
>                                           bool noswap,
> @@ -2418,13 +2385,6 @@ loop_again:
>                                continue;
>
>                        sc.nr_scanned = 0;
> -
> -                       /*
> -                        * Call soft limit reclaim before calling shrink_zone.
> -                        * For now we ignore the return value
> -                        */
> -                       mem_cgroup_soft_limit_reclaim(zone, order, sc.gfp_mask);
> -
>                        /*
>                         * We put equal pressure on every zone, unless
>                         * one zone has way too many pages free
> --
> 1.7.5.1
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 0/6] mm: memcg naturalization
  2011-05-12 14:53 ` Johannes Weiner
                   ` (6 preceding siblings ...)
  (?)
@ 2011-05-12 18:53 ` Ying Han
  2011-05-13  7:20     ` Johannes Weiner
  -1 siblings, 1 reply; 83+ messages in thread
From: Ying Han @ 2011-05-12 18:53 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Michal Hocko,
	Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2498 bytes --]

On Thu, May 12, 2011 at 7:53 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:

> Hi!
>
> Here is a patch series that is a result of the memcg discussions on
> LSF (memcg-aware global reclaim, global lru removal, struct
> page_cgroup reduction, soft limit implementation) and the recent
> feature discussions on linux-mm.
>
> The long-term idea is to have memcgs no longer bolted to the side of
> the mm code, but integrate it as much as possible such that there is a
> native understanding of containers, and that the traditional !memcg
> setup is just a singular group.  This series is an approach in that
> direction.
>
> It is a rather early snapshot, WIP, barely tested etc., but I wanted
> to get your opinions before further pursuing it.  It is also part of
> my counter-argument to the proposals of adding memcg-reclaim-related
> user interfaces at this point in time, so I wanted to push this out
> the door before things are merged into .40.
>

The memcg-reclaim-related user interface I assume was the watermark
configurable tunable
we were talking about in the per-memcg background reclaim patch. I think we
got some agreement
to remove the watermark tunable at the first step. But the newly added
memory.soft_limit_async_reclaim
as you proposed seems to be a usable interface.


>
> The patches are quite big, I am still looking for things to factor and
> split out, sorry for this.  Documentation is on its way as well ;)
>

This is a quite bit patchset includes different part. We might want to split
it into steps. I will read them through
now.

--Ying

>
> #1 and #2 are boring preparational work.  #3 makes traditional reclaim
> in vmscan.c memcg-aware, which is a prerequisite for both removal of
> the global lru in #5 and the way I reimplemented soft limit reclaim in
> #6.
>
> The diffstat so far looks like this:
>
>  include/linux/memcontrol.h  |   84 +++--
>  include/linux/mm_inline.h   |   15 +-
>  include/linux/mmzone.h      |   10 +-
>  include/linux/page_cgroup.h |   35 --
>  include/linux/swap.h        |    4 -
>  mm/memcontrol.c             |  860
> +++++++++++++------------------------------
>  mm/page_alloc.c             |    2 +-
>  mm/page_cgroup.c            |   39 +--
>  mm/swap.c                   |   20 +-
>  mm/vmscan.c                 |  273 +++++++--------
>  10 files changed, 452 insertions(+), 890 deletions(-)
>
> It is based on .39-rc7 because of the memcg churn in -mm, but I'll
> rebase it in the near future.
>
> Discuss!
>
>        Hannes
>

[-- Attachment #2: Type: text/html, Size: 3240 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 3/6] mm: memcg-aware global reclaim
  2011-05-12 14:53   ` Johannes Weiner
  (?)
  (?)
@ 2011-05-12 19:19   ` Ying Han
  2011-05-13  7:08       ` Johannes Weiner
  -1 siblings, 1 reply; 83+ messages in thread
From: Ying Han @ 2011-05-12 19:19 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Michal Hocko,
	Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 16844 bytes --]

On Thu, May 12, 2011 at 7:53 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:

> A page charged to a memcg is linked to a lru list specific to that
> memcg.  At the same time, traditional global reclaim is obvlivious to
> memcgs, and all the pages are also linked to a global per-zone list.
>
> This patch changes traditional global reclaim to iterate over all
> existing memcgs, so that it no longer relies on the global list being
> present.
>

This is one step forward in integrating memcg code better into the
> rest of memory management.  It is also a prerequisite to get rid of
> the global per-zone lru lists.
>
> Sorry If i misunderstood something here. I assume this patch has not much
to do with the
global soft_limit reclaim, but only allow the system only scan per-memcg lru
under global
memory pressure.


> RFC:
>
> The algorithm implemented in this patch is very naive.  For each zone
> scanned at each priority level, it iterates over all existing memcgs
> and considers them for scanning.
>
> This is just a prototype and I did not optimize it yet because I am
> unsure about the maximum number of memcgs that still constitute a sane
> configuration in comparison to the machine size.
>

So we also scan memcg which has no page allocated on this zone? I will read
the following
patch in case i missed something here :)

--Ying

>
> It is perfectly fair since all memcgs are scanned at each priority
> level.
>
> On my 4G quadcore laptop with 1000 memcgs, a significant amount of CPU
> time was spent just iterating memcgs during reclaim.  But it can not
> really be claimed that the old code was much better, either: global
> LRU reclaim could mean that a few hundred memcgs would have been
> emptied out completely, while others stayed untouched.
>
> I am open to solutions that trade fairness against CPU-time but don't
> want to have an extreme in either direction.  Maybe break out early if
> a number of memcgs has been successfully reclaimed from and remember
> the last one scanned.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  include/linux/memcontrol.h |    7 ++
>  mm/memcontrol.c            |  148
> +++++++++++++++++++++++++++++---------------
>  mm/vmscan.c                |   21 +++++--
>  3 files changed, 120 insertions(+), 56 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 5e9840f5..58728c7 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -104,6 +104,7 @@ extern void mem_cgroup_end_migration(struct mem_cgroup
> *mem,
>  /*
>  * For memory reclaim.
>  */
> +void mem_cgroup_hierarchy_walk(struct mem_cgroup *, struct mem_cgroup **);
>  int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
>  int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
>  unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
> @@ -289,6 +290,12 @@ static inline bool mem_cgroup_disabled(void)
>        return true;
>  }
>
> +static inline void mem_cgroup_hierarchy_walk(struct mem_cgroup *start,
> +                                            struct mem_cgroup **iter)
> +{
> +       *iter = start;
> +}
> +
>  static inline int
>  mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
>  {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index bf5ab87..edcd55a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -313,7 +313,7 @@ static bool move_file(void)
>  }
>
>  /*
> - * Maximum loops in mem_cgroup_hierarchical_reclaim(), used for soft
> + * Maximum loops in mem_cgroup_soft_reclaim(), used for soft
>  * limit reclaim to prevent infinite loops, if they ever occur.
>  */
>  #define        MEM_CGROUP_MAX_RECLAIM_LOOPS            (100)
> @@ -339,16 +339,6 @@ enum charge_type {
>  /* Used for OOM nofiier */
>  #define OOM_CONTROL            (0)
>
> -/*
> - * Reclaim flags for mem_cgroup_hierarchical_reclaim
> - */
> -#define MEM_CGROUP_RECLAIM_NOSWAP_BIT  0x0
> -#define MEM_CGROUP_RECLAIM_NOSWAP      (1 <<
> MEM_CGROUP_RECLAIM_NOSWAP_BIT)
> -#define MEM_CGROUP_RECLAIM_SHRINK_BIT  0x1
> -#define MEM_CGROUP_RECLAIM_SHRINK      (1 <<
> MEM_CGROUP_RECLAIM_SHRINK_BIT)
> -#define MEM_CGROUP_RECLAIM_SOFT_BIT    0x2
> -#define MEM_CGROUP_RECLAIM_SOFT                (1 <<
> MEM_CGROUP_RECLAIM_SOFT_BIT)
> -
>  static void mem_cgroup_get(struct mem_cgroup *mem);
>  static void mem_cgroup_put(struct mem_cgroup *mem);
>  static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
> @@ -1381,6 +1371,86 @@ u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
>        return min(limit, memsw);
>  }
>
> +void mem_cgroup_hierarchy_walk(struct mem_cgroup *start,
> +                              struct mem_cgroup **iter)
> +{
> +       struct mem_cgroup *mem = *iter;
> +       int id;
> +
> +       if (!start)
> +               start = root_mem_cgroup;
> +       /*
> +        * Even without hierarchy explicitely enabled in the root
> +        * memcg, it is the ultimate parent of all memcgs.
> +        */
> +       if (!(start == root_mem_cgroup || start->use_hierarchy)) {
> +               *iter = start;
> +               return;
> +       }
> +
> +       if (!mem)
> +               id = css_id(&start->css);
> +       else {
> +               id = css_id(&mem->css);
> +               css_put(&mem->css);
> +               mem = NULL;
> +       }
> +
> +       do {
> +               struct cgroup_subsys_state *css;
> +
> +               rcu_read_lock();
> +               css = css_get_next(&mem_cgroup_subsys, id+1, &start->css,
> &id);
> +               /*
> +                * The caller must already have a reference to the
> +                * starting point of this hierarchy walk, do not grab
> +                * another one.  This way, the loop can be finished
> +                * when the hierarchy root is returned, without any
> +                * further cleanup required.
> +                */
> +               if (css && (css == &start->css || css_tryget(css)))
> +                       mem = container_of(css, struct mem_cgroup, css);
> +               rcu_read_unlock();
> +               if (!css)
> +                       id = 0;
> +       } while (!mem);
> +
> +       if (mem == root_mem_cgroup)
> +               mem = NULL;
> +
> +       *iter = mem;
> +}
> +
> +static unsigned long mem_cgroup_target_reclaim(struct mem_cgroup *mem,
> +                                              gfp_t gfp_mask,
> +                                              bool noswap,
> +                                              bool shrink)
> +{
> +       unsigned long total = 0;
> +       int loop;
> +
> +       if (mem->memsw_is_minimum)
> +               noswap = true;
> +
> +       for (loop = 0; loop < MEM_CGROUP_MAX_RECLAIM_LOOPS; loop++) {
> +               drain_all_stock_async();
> +               total += try_to_free_mem_cgroup_pages(mem, gfp_mask,
> noswap,
> +                                                     get_swappiness(mem));
> +               if (total && shrink)
> +                       break;
> +               if (mem_cgroup_margin(mem))
> +                       break;
> +               /*
> +                * If we have not been able to reclaim anything after
> +                * two reclaim attempts, there may be no reclaimable
> +                * pages under this hierarchy.
> +                */
> +               if (loop && !total)
> +                       break;
> +       }
> +       return total;
> +}
> +
>  /*
>  * Visit the first child (need not be the first child as per the ordering
>  * of the cgroup list, since we track last_scanned_child) of @mem and use
> @@ -1427,21 +1497,16 @@ mem_cgroup_select_victim(struct mem_cgroup
> *root_mem)
>  *
>  * We give up and return to the caller when we visit root_mem twice.
>  * (other groups can be removed while we're walking....)
> - *
> - * If shrink==true, for avoiding to free too much, this returns
> immedieately.
>  */
> -static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> -                                               struct zone *zone,
> -                                               gfp_t gfp_mask,
> -                                               unsigned long
> reclaim_options)
> +static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_mem,
> +                                  struct zone *zone,
> +                                  gfp_t gfp_mask)
>  {
>        struct mem_cgroup *victim;
>        int ret, total = 0;
>        int loop = 0;
> -       bool noswap = reclaim_options & MEM_CGROUP_RECLAIM_NOSWAP;
> -       bool shrink = reclaim_options & MEM_CGROUP_RECLAIM_SHRINK;
> -       bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT;
>        unsigned long excess;
> +       bool noswap = false;
>
>        excess = res_counter_soft_limit_excess(&root_mem->res) >>
> PAGE_SHIFT;
>
> @@ -1461,7 +1526,7 @@ static int mem_cgroup_hierarchical_reclaim(struct
> mem_cgroup *root_mem,
>                                 * anything, it might because there are
>                                 * no reclaimable pages under this hierarchy
>                                 */
> -                               if (!check_soft || !total) {
> +                               if (!total) {
>                                        css_put(&victim->css);
>                                        break;
>                                }
> @@ -1484,25 +1549,11 @@ static int mem_cgroup_hierarchical_reclaim(struct
> mem_cgroup *root_mem,
>                        continue;
>                }
>                /* we use swappiness of local cgroup */
> -               if (check_soft)
> -                       ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
> +               ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
>                                noswap, get_swappiness(victim), zone);
> -               else
> -                       ret = try_to_free_mem_cgroup_pages(victim,
> gfp_mask,
> -                                               noswap,
> get_swappiness(victim));
>                css_put(&victim->css);
> -               /*
> -                * At shrinking usage, we can't check we should stop here
> or
> -                * reclaim more. It's depends on callers.
> last_scanned_child
> -                * will work enough for keeping fairness under tree.
> -                */
> -               if (shrink)
> -                       return ret;
>                total += ret;
> -               if (check_soft) {
> -                       if (!res_counter_soft_limit_excess(&root_mem->res))
> -                               return total;
> -               } else if (mem_cgroup_margin(root_mem))
> +               if (!res_counter_soft_limit_excess(&root_mem->res))
>                        return total;
>        }
>        return total;
> @@ -1897,7 +1948,7 @@ static int mem_cgroup_do_charge(struct mem_cgroup
> *mem, gfp_t gfp_mask,
>        unsigned long csize = nr_pages * PAGE_SIZE;
>        struct mem_cgroup *mem_over_limit;
>        struct res_counter *fail_res;
> -       unsigned long flags = 0;
> +       bool noswap = false;
>        int ret;
>
>        ret = res_counter_charge(&mem->res, csize, &fail_res);
> @@ -1911,7 +1962,7 @@ static int mem_cgroup_do_charge(struct mem_cgroup
> *mem, gfp_t gfp_mask,
>
>                res_counter_uncharge(&mem->res, csize);
>                mem_over_limit = mem_cgroup_from_res_counter(fail_res,
> memsw);
> -               flags |= MEM_CGROUP_RECLAIM_NOSWAP;
> +               noswap = true;
>        } else
>                mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
>        /*
> @@ -1927,8 +1978,8 @@ static int mem_cgroup_do_charge(struct mem_cgroup
> *mem, gfp_t gfp_mask,
>        if (!(gfp_mask & __GFP_WAIT))
>                return CHARGE_WOULDBLOCK;
>
> -       ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
> -                                             gfp_mask, flags);
> +       ret = mem_cgroup_target_reclaim(mem_over_limit, gfp_mask,
> +                                       noswap, false);
>        if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
>                return CHARGE_RETRY;
>        /*
> @@ -3085,7 +3136,7 @@ void mem_cgroup_end_migration(struct mem_cgroup *mem,
>
>  /*
>  * A call to try to shrink memory usage on charge failure at shmem's
> swapin.
> - * Calling hierarchical_reclaim is not enough because we should update
> + * Calling target_reclaim is not enough because we should update
>  * last_oom_jiffies to prevent pagefault_out_of_memory from invoking global
> OOM.
>  * Moreover considering hierarchy, we should reclaim from the
> mem_over_limit,
>  * not from the memcg which this page would be charged to.
> @@ -3167,7 +3218,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup
> *memcg,
>        int enlarge;
>
>        /*
> -        * For keeping hierarchical_reclaim simple, how long we should
> retry
> +        * For keeping target_reclaim simple, how long we should retry
>         * is depends on callers. We set our retry-count to be function
>         * of # of children which we should visit in this loop.
>         */
> @@ -3210,8 +3261,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup
> *memcg,
>                if (!ret)
>                        break;
>
> -               mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
> -                                               MEM_CGROUP_RECLAIM_SHRINK);
> +               mem_cgroup_target_reclaim(memcg, GFP_KERNEL, false, false);
>                curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
>                /* Usage is reduced ? */
>                if (curusage >= oldusage)
> @@ -3269,9 +3319,7 @@ static int mem_cgroup_resize_memsw_limit(struct
> mem_cgroup *memcg,
>                if (!ret)
>                        break;
>
> -               mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
> -                                               MEM_CGROUP_RECLAIM_NOSWAP |
> -                                               MEM_CGROUP_RECLAIM_SHRINK);
> +               mem_cgroup_target_reclaim(memcg, GFP_KERNEL, true, false);
>                curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
>                /* Usage is reduced ? */
>                if (curusage >= oldusage)
> @@ -3311,9 +3359,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct
> zone *zone, int order,
>                if (!mz)
>                        break;
>
> -               reclaimed = mem_cgroup_hierarchical_reclaim(mz->mem, zone,
> -                                               gfp_mask,
> -                                               MEM_CGROUP_RECLAIM_SOFT);
> +               reclaimed = mem_cgroup_soft_reclaim(mz->mem, zone,
> gfp_mask);
>                nr_reclaimed += reclaimed;
>                spin_lock(&mctz->lock);
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index ceeb2a5..e2a3647 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1900,8 +1900,8 @@ static inline bool should_continue_reclaim(struct
> zone *zone,
>  /*
>  * This is a basic per-zone page freer.  Used by both kswapd and direct
> reclaim.
>  */
> -static void shrink_zone(int priority, struct zone *zone,
> -                               struct scan_control *sc)
> +static void do_shrink_zone(int priority, struct zone *zone,
> +                          struct scan_control *sc)
>  {
>        unsigned long nr[NR_LRU_LISTS];
>        unsigned long nr_to_scan;
> @@ -1914,8 +1914,6 @@ restart:
>        nr_scanned = sc->nr_scanned;
>        get_scan_count(zone, sc, nr, priority);
>
> -       sc->current_memcg = sc->memcg;
> -
>        while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
>                                        nr[LRU_INACTIVE_FILE]) {
>                for_each_evictable_lru(l) {
> @@ -1954,6 +1952,19 @@ restart:
>                goto restart;
>
>        throttle_vm_writeout(sc->gfp_mask);
> +}
> +
> +static void shrink_zone(int priority, struct zone *zone,
> +                       struct scan_control *sc)
> +{
> +       struct mem_cgroup *root = sc->memcg;
> +       struct mem_cgroup *mem = NULL;
> +
> +       do {
> +               mem_cgroup_hierarchy_walk(root, &mem);
> +               sc->current_memcg = mem;
> +               do_shrink_zone(priority, zone, sc);
> +       } while (mem != root);
>
>        /* For good measure, noone higher up the stack should look at it */
>        sc->current_memcg = NULL;
> @@ -2190,7 +2201,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct
> mem_cgroup *mem,
>         * will pick up pages from other mem cgroup's as well. We hack
>         * the priority and make it zero.
>         */
> -       shrink_zone(0, zone, &sc);
> +       do_shrink_zone(0, zone, &sc);
>
>        trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
>
> --
> 1.7.5.1
>
>

[-- Attachment #2: Type: text/html, Size: 19136 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 4/6] memcg: reclaim statistics
  2011-05-12 14:53   ` Johannes Weiner
  (?)
@ 2011-05-12 19:33   ` Ying Han
  2011-05-16 23:10       ` Johannes Weiner
  -1 siblings, 1 reply; 83+ messages in thread
From: Ying Han @ 2011-05-12 19:33 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Michal Hocko,
	Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 10261 bytes --]

On Thu, May 12, 2011 at 7:53 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:

> TODO: write proper changelog.  Here is an excerpt from
> http://lkml.kernel.org/r/20110428123652.GM12437@cmpxchg.org:
>
> : 1. Limit-triggered direct reclaim
> :
> : The memory cgroup hits its limit and the task does direct reclaim from
> : its own memcg.  We probably want statistics for this separately from
> : background reclaim to see how successful background reclaim is, the
> : same reason we have this separation in the global vmstat as well.
> :
> :       pgscan_direct_limit
> :       pgfree_direct_limit
>

Can we use "pgsteal_" instead? Not big fan of the naming but want to make
them consistent to other stats.

> :
> : 2. Limit-triggered background reclaim
> :
> : This is the watermark-based asynchroneous reclaim that is currently in
> : discussion.  It's triggered by the memcg breaching its watermark,
> : which is relative to its hard-limit.  I named it kswapd because I
> : still think kswapd should do this job, but it is all open for
> : discussion, obviously.  Treat it as meaning 'background' or
> : 'asynchroneous'.
> :
> :       pgscan_kswapd_limit
> :       pgfree_kswapd_limit
>

Kame might have this stats on the per-memcg bg reclaim patch. Just mention
here since it will make later merge
a bit harder

> :
> : 3. Hierarchy-triggered direct reclaim
> :
> : A condition outside the memcg leads to a task directly reclaiming from
> : this memcg.  This could be global memory pressure for example, but
> : also a parent cgroup hitting its limit.  It's probably helpful to
> : assume global memory pressure meaning that the root cgroup hit its
> : limit, conceptually.  We don't have that yet, but this could be the
> : direct softlimit reclaim Ying mentioned above.
> :
> :       pgscan_direct_hierarchy
> :       pgsteal_direct_hierarchy
>

 The stats for soft_limit reclaim from global ttfp have been merged in mmotm
i believe as the following:

"soft_direct_steal"
"soft_direct_scan"

I wonder we might want to separate that out from the other case where the
reclaim is from the parent triggers its limit.

> :
> : 4. Hierarchy-triggered background reclaim
> :
> : An outside condition leads to kswapd reclaiming from this memcg, like
> : kswapd doing softlimit pushback due to global memory pressure.
> :
> :       pgscan_kswapd_hierarchy
> :       pgsteal_kswapd_hierarchy
>

The stats for soft_limit reclaim from global bg reclaim have been merged in
mmotm I believe as the following:
"soft_kswapd_steal"
"soft_kswapd_scan"

 --Ying

> :
> : ---
> :
> : With these stats in place, you can see how much pressure there is on
> : your memcg hierarchy.  This includes machine utilization and if you
> : overcommitted too much on a global level if there is a lot of reclaim
> : activity indicated in the hierarchical stats.
> :
> : With the limit-based stats, you can see the amount of internal
> : pressure of memcgs, which shows you if you overcommitted on a local
> : level.
> :
> : And for both cases, you can also see the effectiveness of background
> : reclaim by comparing the direct and the kswapd stats.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  include/linux/memcontrol.h |    9 ++++++
>  mm/memcontrol.c            |   63
> ++++++++++++++++++++++++++++++++++++++++++++
>  mm/vmscan.c                |    7 +++++
>  3 files changed, 79 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 58728c7..a4c84db 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -105,6 +105,8 @@ extern void mem_cgroup_end_migration(struct mem_cgroup
> *mem,
>  * For memory reclaim.
>  */
>  void mem_cgroup_hierarchy_walk(struct mem_cgroup *, struct mem_cgroup **);
> +void mem_cgroup_count_reclaim(struct mem_cgroup *, bool, bool,
> +                             unsigned long, unsigned long);
>  int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
>  int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
>  unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
> @@ -296,6 +298,13 @@ static inline void mem_cgroup_hierarchy_walk(struct
> mem_cgroup *start,
>        *iter = start;
>  }
>
> +static inline void mem_cgroup_count_reclaim(struct mem_cgroup *mem,
> +                                           bool kswapd, bool hierarchy,
> +                                           unsigned long scanned,
> +                                           unsigned long reclaimed)
> +{
> +}
> +
>  static inline int
>  mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
>  {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index edcd55a..d762706 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -90,10 +90,24 @@ enum mem_cgroup_stat_index {
>        MEM_CGROUP_STAT_NSTATS,
>  };
>
> +#define RECLAIM_RECLAIMED 1
> +#define RECLAIM_HIERARCHY 2
> +#define RECLAIM_KSWAPD 4
> +
>  enum mem_cgroup_events_index {
>        MEM_CGROUP_EVENTS_PGPGIN,       /* # of pages paged in */
>        MEM_CGROUP_EVENTS_PGPGOUT,      /* # of pages paged out */
>        MEM_CGROUP_EVENTS_COUNT,        /* # of pages paged in/out */
> +       RECLAIM_BASE,
> +       PGSCAN_DIRECT_LIMIT = RECLAIM_BASE,
> +       PGFREE_DIRECT_LIMIT = RECLAIM_BASE + RECLAIM_RECLAIMED,
> +       PGSCAN_DIRECT_HIERARCHY = RECLAIM_BASE + RECLAIM_HIERARCHY,
> +       PGSTEAL_DIRECT_HIERARCHY = RECLAIM_BASE + RECLAIM_HIERARCHY +
> RECLAIM_RECLAIMED,
> +       /* you know the drill... */
> +       PGSCAN_KSWAPD_LIMIT,
> +       PGFREE_KSWAPD_LIMIT,
> +       PGSCAN_KSWAPD_HIERARCHY,
> +       PGSTEAL_KSWAPD_HIERARCHY,
>        MEM_CGROUP_EVENTS_NSTATS,
>  };
>  /*
> @@ -575,6 +589,23 @@ static void mem_cgroup_swap_statistics(struct
> mem_cgroup *mem,
>        this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_SWAPOUT], val);
>  }
>
> +void mem_cgroup_count_reclaim(struct mem_cgroup *mem,
> +                             bool kswapd, bool hierarchy,
> +                             unsigned long scanned, unsigned long
> reclaimed)
> +{
> +       unsigned int base = RECLAIM_BASE;
> +
> +       if (!mem)
> +               mem = root_mem_cgroup;
> +       if (kswapd)
> +               base += RECLAIM_KSWAPD;
> +       if (hierarchy)
> +               base += RECLAIM_HIERARCHY;
> +
> +       this_cpu_add(mem->stat->events[base], scanned);
> +       this_cpu_add(mem->stat->events[base + RECLAIM_RECLAIMED],
> reclaimed);
> +}
> +
>  static unsigned long mem_cgroup_read_events(struct mem_cgroup *mem,
>                                            enum mem_cgroup_events_index
> idx)
>  {
> @@ -3817,6 +3848,14 @@ enum {
>        MCS_FILE_MAPPED,
>        MCS_PGPGIN,
>        MCS_PGPGOUT,
> +       MCS_PGSCAN_DIRECT_LIMIT,
> +       MCS_PGFREE_DIRECT_LIMIT,
> +       MCS_PGSCAN_DIRECT_HIERARCHY,
> +       MCS_PGSTEAL_DIRECT_HIERARCHY,
> +       MCS_PGSCAN_KSWAPD_LIMIT,
> +       MCS_PGFREE_KSWAPD_LIMIT,
> +       MCS_PGSCAN_KSWAPD_HIERARCHY,
> +       MCS_PGSTEAL_KSWAPD_HIERARCHY,
>        MCS_SWAP,
>        MCS_INACTIVE_ANON,
>        MCS_ACTIVE_ANON,
> @@ -3839,6 +3878,14 @@ struct {
>        {"mapped_file", "total_mapped_file"},
>        {"pgpgin", "total_pgpgin"},
>        {"pgpgout", "total_pgpgout"},
> +       {"pgscan_direct_limit", "total_pgscan_direct_limit"},
> +       {"pgfree_direct_limit", "total_pgfree_direct_limit"},
> +       {"pgscan_direct_hierarchy", "total_pgscan_direct_hierarchy"},
> +       {"pgsteal_direct_hierarchy", "total_pgsteal_direct_hierarchy"},
> +       {"pgscan_kswapd_limit", "total_pgscan_kswapd_limit"},
> +       {"pgfree_kswapd_limit", "total_pgfree_kswapd_limit"},
> +       {"pgscan_kswapd_hierarchy", "total_pgscan_kswapd_hierarchy"},
> +       {"pgsteal_kswapd_hierarchy", "total_pgsteal_kswapd_hierarchy"},
>        {"swap", "total_swap"},
>        {"inactive_anon", "total_inactive_anon"},
>        {"active_anon", "total_active_anon"},
> @@ -3864,6 +3911,22 @@ mem_cgroup_get_local_stat(struct mem_cgroup *mem,
> struct mcs_total_stat *s)
>        s->stat[MCS_PGPGIN] += val;
>        val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_PGPGOUT);
>        s->stat[MCS_PGPGOUT] += val;
> +       val = mem_cgroup_read_events(mem, PGSCAN_DIRECT_LIMIT);
> +       s->stat[MCS_PGSCAN_DIRECT_LIMIT] += val;
> +       val = mem_cgroup_read_events(mem, PGFREE_DIRECT_LIMIT);
> +       s->stat[MCS_PGFREE_DIRECT_LIMIT] += val;
> +       val = mem_cgroup_read_events(mem, PGSCAN_DIRECT_HIERARCHY);
> +       s->stat[MCS_PGSCAN_DIRECT_HIERARCHY] += val;
> +       val = mem_cgroup_read_events(mem, PGSTEAL_DIRECT_HIERARCHY);
> +       s->stat[MCS_PGSTEAL_DIRECT_HIERARCHY] += val;
> +       val = mem_cgroup_read_events(mem, PGSCAN_KSWAPD_LIMIT);
> +       s->stat[MCS_PGSCAN_KSWAPD_LIMIT] += val;
> +       val = mem_cgroup_read_events(mem, PGFREE_KSWAPD_LIMIT);
> +       s->stat[MCS_PGFREE_KSWAPD_LIMIT] += val;
> +       val = mem_cgroup_read_events(mem, PGSCAN_KSWAPD_HIERARCHY);
> +       s->stat[MCS_PGSCAN_KSWAPD_HIERARCHY] += val;
> +       val = mem_cgroup_read_events(mem, PGSTEAL_KSWAPD_HIERARCHY);
> +       s->stat[MCS_PGSTEAL_KSWAPD_HIERARCHY] += val;
>        if (do_swap_account) {
>                val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_SWAPOUT);
>                s->stat[MCS_SWAP] += val * PAGE_SIZE;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index e2a3647..0e45ceb 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1961,9 +1961,16 @@ static void shrink_zone(int priority, struct zone
> *zone,
>        struct mem_cgroup *mem = NULL;
>
>        do {
> +               unsigned long reclaimed = sc->nr_reclaimed;
> +               unsigned long scanned = sc->nr_scanned;
> +
>                mem_cgroup_hierarchy_walk(root, &mem);
>                sc->current_memcg = mem;
>                do_shrink_zone(priority, zone, sc);
> +               mem_cgroup_count_reclaim(mem, current_is_kswapd(),
> +                                        mem != root, /* limit or
> hierarchy? */
> +                                        sc->nr_scanned - scanned,
> +                                        sc->nr_reclaimed - reclaimed);
>        } while (mem != root);
>
>        /* For good measure, noone higher up the stack should look at it */
> --
> 1.7.5.1
>
>

[-- Attachment #2: Type: text/html, Size: 12446 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 1/6] memcg: remove unused retry signal from reclaim
  2011-05-12 14:53   ` Johannes Weiner
@ 2011-05-12 23:44     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 83+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-05-12 23:44 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Daisuke Nishimura, Balbir Singh, Ying Han, Michal Hocko,
	Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On Thu, 12 May 2011 16:53:53 +0200
Johannes Weiner <hannes@cmpxchg.org> wrote:

> If the memcg reclaim code detects the target memcg below its limit it
> exits and returns a guaranteed non-zero value so that the charge is
> retried.
> 
> Nowadays, the charge side checks the memcg limit itself and does not
> rely on this non-zero return value trick.
> 
> This patch removes it.  The reclaim code will now always return the
> true number of pages it reclaimed on its own.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 1/6] memcg: remove unused retry signal from reclaim
@ 2011-05-12 23:44     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 83+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-05-12 23:44 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Daisuke Nishimura, Balbir Singh, Ying Han, Michal Hocko,
	Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On Thu, 12 May 2011 16:53:53 +0200
Johannes Weiner <hannes@cmpxchg.org> wrote:

> If the memcg reclaim code detects the target memcg below its limit it
> exits and returns a guaranteed non-zero value so that the charge is
> retried.
> 
> Nowadays, the charge side checks the memcg limit itself and does not
> rely on this non-zero return value trick.
> 
> This patch removes it.  The reclaim code will now always return the
> true number of pages it reclaimed on its own.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 2/6] vmscan: make distinction between memcg reclaim and LRU list selection
  2011-05-12 14:53   ` Johannes Weiner
@ 2011-05-12 23:50     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 83+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-05-12 23:50 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Daisuke Nishimura, Balbir Singh, Ying Han, Michal Hocko,
	Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On Thu, 12 May 2011 16:53:54 +0200
Johannes Weiner <hannes@cmpxchg.org> wrote:

> The reclaim code has a single predicate for whether it currently
> reclaims on behalf of a memory cgroup, as well as whether it is
> reclaiming from the global LRU list or a memory cgroup LRU list.
> 
> Up to now, both cases always coincide, but subsequent patches will
> change things such that global reclaim will scan memory cgroup lists.
> 
> This patch adds a new predicate that tells global reclaim from memory
> cgroup reclaim, and then changes all callsites that are actually about
> global reclaim heuristics rather than strict LRU list selection.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>


Hmm, isn't it better to merge this to patches where the meaning of
new variable gets clearer ?

> ---
>  mm/vmscan.c |   96 ++++++++++++++++++++++++++++++++++------------------------
>  1 files changed, 56 insertions(+), 40 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f6b435c..ceeb2a5 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -104,8 +104,12 @@ struct scan_control {
>  	 */
>  	reclaim_mode_t reclaim_mode;
>  
> -	/* Which cgroup do we reclaim from */
> -	struct mem_cgroup *mem_cgroup;
> +	/*
> +	 * The memory cgroup we reclaim on behalf of, and the one we
> +	 * are currently reclaiming from.
> +	 */
> +	struct mem_cgroup *memcg;
> +	struct mem_cgroup *current_memcg;
>  

I wonder if you avoid renaming exisiting one, the patch will
be clearer...



>  	/*
>  	 * Nodemask of nodes allowed by the caller. If NULL, all nodes
> @@ -154,16 +158,24 @@ static LIST_HEAD(shrinker_list);
>  static DECLARE_RWSEM(shrinker_rwsem);
>  
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
> -#define scanning_global_lru(sc)	(!(sc)->mem_cgroup)
> +static bool global_reclaim(struct scan_control *sc)
> +{
> +	return !sc->memcg;
> +}
> +static bool scanning_global_lru(struct scan_control *sc)
> +{
> +	return !sc->current_memcg;
> +}


Could you add comments ?

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 2/6] vmscan: make distinction between memcg reclaim and LRU list selection
@ 2011-05-12 23:50     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 83+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-05-12 23:50 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Daisuke Nishimura, Balbir Singh, Ying Han, Michal Hocko,
	Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On Thu, 12 May 2011 16:53:54 +0200
Johannes Weiner <hannes@cmpxchg.org> wrote:

> The reclaim code has a single predicate for whether it currently
> reclaims on behalf of a memory cgroup, as well as whether it is
> reclaiming from the global LRU list or a memory cgroup LRU list.
> 
> Up to now, both cases always coincide, but subsequent patches will
> change things such that global reclaim will scan memory cgroup lists.
> 
> This patch adds a new predicate that tells global reclaim from memory
> cgroup reclaim, and then changes all callsites that are actually about
> global reclaim heuristics rather than strict LRU list selection.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>


Hmm, isn't it better to merge this to patches where the meaning of
new variable gets clearer ?

> ---
>  mm/vmscan.c |   96 ++++++++++++++++++++++++++++++++++------------------------
>  1 files changed, 56 insertions(+), 40 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f6b435c..ceeb2a5 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -104,8 +104,12 @@ struct scan_control {
>  	 */
>  	reclaim_mode_t reclaim_mode;
>  
> -	/* Which cgroup do we reclaim from */
> -	struct mem_cgroup *mem_cgroup;
> +	/*
> +	 * The memory cgroup we reclaim on behalf of, and the one we
> +	 * are currently reclaiming from.
> +	 */
> +	struct mem_cgroup *memcg;
> +	struct mem_cgroup *current_memcg;
>  

I wonder if you avoid renaming exisiting one, the patch will
be clearer...



>  	/*
>  	 * Nodemask of nodes allowed by the caller. If NULL, all nodes
> @@ -154,16 +158,24 @@ static LIST_HEAD(shrinker_list);
>  static DECLARE_RWSEM(shrinker_rwsem);
>  
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
> -#define scanning_global_lru(sc)	(!(sc)->mem_cgroup)
> +static bool global_reclaim(struct scan_control *sc)
> +{
> +	return !sc->memcg;
> +}
> +static bool scanning_global_lru(struct scan_control *sc)
> +{
> +	return !sc->current_memcg;
> +}


Could you add comments ?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 3/6] mm: memcg-aware global reclaim
  2011-05-12 14:53   ` Johannes Weiner
@ 2011-05-13  0:04     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 83+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-05-13  0:04 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Daisuke Nishimura, Balbir Singh, Ying Han, Michal Hocko,
	Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On Thu, 12 May 2011 16:53:55 +0200
Johannes Weiner <hannes@cmpxchg.org> wrote:

> A page charged to a memcg is linked to a lru list specific to that
> memcg.  At the same time, traditional global reclaim is obvlivious to
> memcgs, and all the pages are also linked to a global per-zone list.
> 
> This patch changes traditional global reclaim to iterate over all
> existing memcgs, so that it no longer relies on the global list being
> present.
> 
> This is one step forward in integrating memcg code better into the
> rest of memory management.  It is also a prerequisite to get rid of
> the global per-zone lru lists.
> 


As I said, I don't want removing global reclaim until dirty_ratio support and
better softlimit algorithm, at least. Current my concern is dirty_ratio,
if you want to speed up, please help Greg and implement dirty_ratio first.

BTW, could you separete clean up code and your new logic ? 1st half of
codes seems to be just a clean up and seems nice. But , IIUC, someone
changed the arguments from chunk of params to be a flags....in some patch.
...
commit 75822b4495b62e8721e9b88e3cf9e653a0c85b73
Author: Balbir Singh <balbir@linux.vnet.ibm.com>
Date:   Wed Sep 23 15:56:38 2009 -0700

    memory controller: soft limit refactor reclaim flags

    Refactor mem_cgroup_hierarchical_reclaim()

    Refactor the arguments passed to mem_cgroup_hierarchical_reclaim() into
    flags, so that new parameters don't have to be passed as we make the
    reclaim routine more flexible

...

Balbir ?  Both are ok to me, please ask him.


And hmm...

+	do {
+		mem_cgroup_hierarchy_walk(root, &mem);
+		sc->current_memcg = mem;
+		do_shrink_zone(priority, zone, sc);
+	} while (mem != root);

This move hierarchy walk from memcontrol.c to vmscan.c ?

About moving hierarchy walk, I may say okay...because my patch does this, too.

But....doesn't this reclaim too much memory if hierarchy is very deep ?
Could you add some 'quit' path ?


Thanks,
-Kame


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 3/6] mm: memcg-aware global reclaim
@ 2011-05-13  0:04     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 83+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-05-13  0:04 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Daisuke Nishimura, Balbir Singh, Ying Han, Michal Hocko,
	Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On Thu, 12 May 2011 16:53:55 +0200
Johannes Weiner <hannes@cmpxchg.org> wrote:

> A page charged to a memcg is linked to a lru list specific to that
> memcg.  At the same time, traditional global reclaim is obvlivious to
> memcgs, and all the pages are also linked to a global per-zone list.
> 
> This patch changes traditional global reclaim to iterate over all
> existing memcgs, so that it no longer relies on the global list being
> present.
> 
> This is one step forward in integrating memcg code better into the
> rest of memory management.  It is also a prerequisite to get rid of
> the global per-zone lru lists.
> 


As I said, I don't want removing global reclaim until dirty_ratio support and
better softlimit algorithm, at least. Current my concern is dirty_ratio,
if you want to speed up, please help Greg and implement dirty_ratio first.

BTW, could you separete clean up code and your new logic ? 1st half of
codes seems to be just a clean up and seems nice. But , IIUC, someone
changed the arguments from chunk of params to be a flags....in some patch.
...
commit 75822b4495b62e8721e9b88e3cf9e653a0c85b73
Author: Balbir Singh <balbir@linux.vnet.ibm.com>
Date:   Wed Sep 23 15:56:38 2009 -0700

    memory controller: soft limit refactor reclaim flags

    Refactor mem_cgroup_hierarchical_reclaim()

    Refactor the arguments passed to mem_cgroup_hierarchical_reclaim() into
    flags, so that new parameters don't have to be passed as we make the
    reclaim routine more flexible

...

Balbir ?  Both are ok to me, please ask him.


And hmm...

+	do {
+		mem_cgroup_hierarchy_walk(root, &mem);
+		sc->current_memcg = mem;
+		do_shrink_zone(priority, zone, sc);
+	} while (mem != root);

This move hierarchy walk from memcontrol.c to vmscan.c ?

About moving hierarchy walk, I may say okay...because my patch does this, too.

But....doesn't this reclaim too much memory if hierarchy is very deep ?
Could you add some 'quit' path ?


Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 3/6] mm: memcg-aware global reclaim
  2011-05-12 14:53   ` Johannes Weiner
@ 2011-05-13  0:40     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 83+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-05-13  0:40 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Daisuke Nishimura, Balbir Singh, Ying Han, Michal Hocko,
	Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On Thu, 12 May 2011 16:53:55 +0200
Johannes Weiner <hannes@cmpxchg.org> wrote:

> A page charged to a memcg is linked to a lru list specific to that
> memcg.  At the same time, traditional global reclaim is obvlivious to
> memcgs, and all the pages are also linked to a global per-zone list.
> 
> This patch changes traditional global reclaim to iterate over all
> existing memcgs, so that it no longer relies on the global list being
> present.
> 
> This is one step forward in integrating memcg code better into the
> rest of memory management.  It is also a prerequisite to get rid of
> the global per-zone lru lists.
> 
> RFC:
> 
> The algorithm implemented in this patch is very naive.  For each zone
> scanned at each priority level, it iterates over all existing memcgs
> and considers them for scanning.
> 
> This is just a prototype and I did not optimize it yet because I am
> unsure about the maximum number of memcgs that still constitute a sane
> configuration in comparison to the machine size.
> 
> It is perfectly fair since all memcgs are scanned at each priority
> level.
> 
> On my 4G quadcore laptop with 1000 memcgs, a significant amount of CPU
> time was spent just iterating memcgs during reclaim.  But it can not
> really be claimed that the old code was much better, either: global
> LRU reclaim could mean that a few hundred memcgs would have been
> emptied out completely, while others stayed untouched.
> 
> I am open to solutions that trade fairness against CPU-time but don't
> want to have an extreme in either direction.  Maybe break out early if
> a number of memcgs has been successfully reclaimed from and remember
> the last one scanned.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  include/linux/memcontrol.h |    7 ++
>  mm/memcontrol.c            |  148 +++++++++++++++++++++++++++++---------------
>  mm/vmscan.c                |   21 +++++--
>  3 files changed, 120 insertions(+), 56 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 5e9840f5..58728c7 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -104,6 +104,7 @@ extern void mem_cgroup_end_migration(struct mem_cgroup *mem,
>  /*
>   * For memory reclaim.
>   */
> +void mem_cgroup_hierarchy_walk(struct mem_cgroup *, struct mem_cgroup **);
>  int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
>  int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
>  unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
> @@ -289,6 +290,12 @@ static inline bool mem_cgroup_disabled(void)
>  	return true;
>  }
>  
> +static inline void mem_cgroup_hierarchy_walk(struct mem_cgroup *start,
> +					     struct mem_cgroup **iter)
> +{
> +	*iter = start;
> +}
> +
>  static inline int
>  mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
>  {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index bf5ab87..edcd55a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -313,7 +313,7 @@ static bool move_file(void)
>  }
>  
>  /*
> - * Maximum loops in mem_cgroup_hierarchical_reclaim(), used for soft
> + * Maximum loops in mem_cgroup_soft_reclaim(), used for soft
>   * limit reclaim to prevent infinite loops, if they ever occur.
>   */
>  #define	MEM_CGROUP_MAX_RECLAIM_LOOPS		(100)
> @@ -339,16 +339,6 @@ enum charge_type {
>  /* Used for OOM nofiier */
>  #define OOM_CONTROL		(0)
>  
> -/*
> - * Reclaim flags for mem_cgroup_hierarchical_reclaim
> - */
> -#define MEM_CGROUP_RECLAIM_NOSWAP_BIT	0x0
> -#define MEM_CGROUP_RECLAIM_NOSWAP	(1 << MEM_CGROUP_RECLAIM_NOSWAP_BIT)
> -#define MEM_CGROUP_RECLAIM_SHRINK_BIT	0x1
> -#define MEM_CGROUP_RECLAIM_SHRINK	(1 << MEM_CGROUP_RECLAIM_SHRINK_BIT)
> -#define MEM_CGROUP_RECLAIM_SOFT_BIT	0x2
> -#define MEM_CGROUP_RECLAIM_SOFT		(1 << MEM_CGROUP_RECLAIM_SOFT_BIT)
> -
>  static void mem_cgroup_get(struct mem_cgroup *mem);
>  static void mem_cgroup_put(struct mem_cgroup *mem);
>  static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
> @@ -1381,6 +1371,86 @@ u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
>  	return min(limit, memsw);
>  }
>  
> +void mem_cgroup_hierarchy_walk(struct mem_cgroup *start,
> +			       struct mem_cgroup **iter)
> +{
> +	struct mem_cgroup *mem = *iter;
> +	int id;
> +
> +	if (!start)
> +		start = root_mem_cgroup;
> +	/*
> +	 * Even without hierarchy explicitely enabled in the root
> +	 * memcg, it is the ultimate parent of all memcgs.
> +	 */
> +	if (!(start == root_mem_cgroup || start->use_hierarchy)) {
> +		*iter = start;
> +		return;
> +	}
> +
> +	if (!mem)
> +		id = css_id(&start->css);
> +	else {
> +		id = css_id(&mem->css);
> +		css_put(&mem->css);
> +		mem = NULL;
> +	}
> +
> +	do {
> +		struct cgroup_subsys_state *css;
> +
> +		rcu_read_lock();
> +		css = css_get_next(&mem_cgroup_subsys, id+1, &start->css, &id);
> +		/*
> +		 * The caller must already have a reference to the
> +		 * starting point of this hierarchy walk, do not grab
> +		 * another one.  This way, the loop can be finished
> +		 * when the hierarchy root is returned, without any
> +		 * further cleanup required.
> +		 */
> +		if (css && (css == &start->css || css_tryget(css)))
> +			mem = container_of(css, struct mem_cgroup, css);
> +		rcu_read_unlock();
> +		if (!css)
> +			id = 0;
> +	} while (!mem);
> +
> +	if (mem == root_mem_cgroup)
> +		mem = NULL;
> +
> +	*iter = mem;
> +}
> +
> +static unsigned long mem_cgroup_target_reclaim(struct mem_cgroup *mem,
> +					       gfp_t gfp_mask,
> +					       bool noswap,
> +					       bool shrink)
> +{
> +	unsigned long total = 0;
> +	int loop;
> +
> +	if (mem->memsw_is_minimum)
> +		noswap = true;
> +
> +	for (loop = 0; loop < MEM_CGROUP_MAX_RECLAIM_LOOPS; loop++) {
> +		drain_all_stock_async();
> +		total += try_to_free_mem_cgroup_pages(mem, gfp_mask, noswap,
> +						      get_swappiness(mem));
> +		if (total && shrink)
> +			break;
> +		if (mem_cgroup_margin(mem))
> +			break;
> +		/*
> +		 * If we have not been able to reclaim anything after
> +		 * two reclaim attempts, there may be no reclaimable
> +		 * pages under this hierarchy.
> +		 */
> +		if (loop && !total)
> +			break;
> +	}
> +	return total;
> +}
> +
>  /*
>   * Visit the first child (need not be the first child as per the ordering
>   * of the cgroup list, since we track last_scanned_child) of @mem and use
> @@ -1427,21 +1497,16 @@ mem_cgroup_select_victim(struct mem_cgroup *root_mem)
>   *
>   * We give up and return to the caller when we visit root_mem twice.
>   * (other groups can be removed while we're walking....)
> - *
> - * If shrink==true, for avoiding to free too much, this returns immedieately.
>   */
> -static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> -						struct zone *zone,
> -						gfp_t gfp_mask,
> -						unsigned long reclaim_options)
> +static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_mem,
> +				   struct zone *zone,
> +				   gfp_t gfp_mask)
>  {
>  	struct mem_cgroup *victim;
>  	int ret, total = 0;
>  	int loop = 0;
> -	bool noswap = reclaim_options & MEM_CGROUP_RECLAIM_NOSWAP;
> -	bool shrink = reclaim_options & MEM_CGROUP_RECLAIM_SHRINK;
> -	bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT;
>  	unsigned long excess;
> +	bool noswap = false;
>  
>  	excess = res_counter_soft_limit_excess(&root_mem->res) >> PAGE_SHIFT;
>  
> @@ -1461,7 +1526,7 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
>  				 * anything, it might because there are
>  				 * no reclaimable pages under this hierarchy
>  				 */
> -				if (!check_soft || !total) {
> +				if (!total) {
>  					css_put(&victim->css);
>  					break;
>  				}
> @@ -1484,25 +1549,11 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
>  			continue;
>  		}
>  		/* we use swappiness of local cgroup */
> -		if (check_soft)
> -			ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
> +		ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
>  				noswap, get_swappiness(victim), zone);
> -		else
> -			ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
> -						noswap, get_swappiness(victim));
>  		css_put(&victim->css);
> -		/*
> -		 * At shrinking usage, we can't check we should stop here or
> -		 * reclaim more. It's depends on callers. last_scanned_child
> -		 * will work enough for keeping fairness under tree.
> -		 */
> -		if (shrink)
> -			return ret;
>  		total += ret;
> -		if (check_soft) {
> -			if (!res_counter_soft_limit_excess(&root_mem->res))
> -				return total;
> -		} else if (mem_cgroup_margin(root_mem))
> +		if (!res_counter_soft_limit_excess(&root_mem->res))
>  			return total;
>  	}
>  	return total;
> @@ -1897,7 +1948,7 @@ static int mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
>  	unsigned long csize = nr_pages * PAGE_SIZE;
>  	struct mem_cgroup *mem_over_limit;
>  	struct res_counter *fail_res;
> -	unsigned long flags = 0;
> +	bool noswap = false;
>  	int ret;
>  
>  	ret = res_counter_charge(&mem->res, csize, &fail_res);
> @@ -1911,7 +1962,7 @@ static int mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
>  
>  		res_counter_uncharge(&mem->res, csize);
>  		mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw);
> -		flags |= MEM_CGROUP_RECLAIM_NOSWAP;
> +		noswap = true;
>  	} else
>  		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
>  	/*
> @@ -1927,8 +1978,8 @@ static int mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
>  	if (!(gfp_mask & __GFP_WAIT))
>  		return CHARGE_WOULDBLOCK;
>  
> -	ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
> -					      gfp_mask, flags);
> +	ret = mem_cgroup_target_reclaim(mem_over_limit, gfp_mask,
> +					noswap, false);
>  	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
>  		return CHARGE_RETRY;
>  	/*
> @@ -3085,7 +3136,7 @@ void mem_cgroup_end_migration(struct mem_cgroup *mem,
>  
>  /*
>   * A call to try to shrink memory usage on charge failure at shmem's swapin.
> - * Calling hierarchical_reclaim is not enough because we should update
> + * Calling target_reclaim is not enough because we should update
>   * last_oom_jiffies to prevent pagefault_out_of_memory from invoking global OOM.
>   * Moreover considering hierarchy, we should reclaim from the mem_over_limit,
>   * not from the memcg which this page would be charged to.
> @@ -3167,7 +3218,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
>  	int enlarge;
>  
>  	/*
> -	 * For keeping hierarchical_reclaim simple, how long we should retry
> +	 * For keeping target_reclaim simple, how long we should retry
>  	 * is depends on callers. We set our retry-count to be function
>  	 * of # of children which we should visit in this loop.
>  	 */
> @@ -3210,8 +3261,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
>  		if (!ret)
>  			break;
>  
> -		mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
> -						MEM_CGROUP_RECLAIM_SHRINK);
> +		mem_cgroup_target_reclaim(memcg, GFP_KERNEL, false, false);
>  		curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
>  		/* Usage is reduced ? */
>    		if (curusage >= oldusage)
> @@ -3269,9 +3319,7 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
>  		if (!ret)
>  			break;
>  
> -		mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
> -						MEM_CGROUP_RECLAIM_NOSWAP |
> -						MEM_CGROUP_RECLAIM_SHRINK);
> +		mem_cgroup_target_reclaim(memcg, GFP_KERNEL, true, false);
>  		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
>  		/* Usage is reduced ? */
>  		if (curusage >= oldusage)
> @@ -3311,9 +3359,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  		if (!mz)
>  			break;
>  
> -		reclaimed = mem_cgroup_hierarchical_reclaim(mz->mem, zone,
> -						gfp_mask,
> -						MEM_CGROUP_RECLAIM_SOFT);
> +		reclaimed = mem_cgroup_soft_reclaim(mz->mem, zone, gfp_mask);
>  		nr_reclaimed += reclaimed;
>  		spin_lock(&mctz->lock);
>  
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index ceeb2a5..e2a3647 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1900,8 +1900,8 @@ static inline bool should_continue_reclaim(struct zone *zone,
>  /*
>   * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
>   */
> -static void shrink_zone(int priority, struct zone *zone,
> -				struct scan_control *sc)
> +static void do_shrink_zone(int priority, struct zone *zone,
> +			   struct scan_control *sc)
>  {
>  	unsigned long nr[NR_LRU_LISTS];
>  	unsigned long nr_to_scan;
> @@ -1914,8 +1914,6 @@ restart:
>  	nr_scanned = sc->nr_scanned;
>  	get_scan_count(zone, sc, nr, priority);
>  
> -	sc->current_memcg = sc->memcg;
> -
>  	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
>  					nr[LRU_INACTIVE_FILE]) {
>  		for_each_evictable_lru(l) {
> @@ -1954,6 +1952,19 @@ restart:
>  		goto restart;
>  
>  	throttle_vm_writeout(sc->gfp_mask);
> +}
> +
> +static void shrink_zone(int priority, struct zone *zone,
> +			struct scan_control *sc)
> +{
> +	struct mem_cgroup *root = sc->memcg;
> +	struct mem_cgroup *mem = NULL;
> +
> +	do {
> +		mem_cgroup_hierarchy_walk(root, &mem);
> +		sc->current_memcg = mem;
> +		do_shrink_zone(priority, zone, sc);

If I don't miss something, css_put() against mem->css will be required somewhere.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 3/6] mm: memcg-aware global reclaim
@ 2011-05-13  0:40     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 83+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-05-13  0:40 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Daisuke Nishimura, Balbir Singh, Ying Han, Michal Hocko,
	Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On Thu, 12 May 2011 16:53:55 +0200
Johannes Weiner <hannes@cmpxchg.org> wrote:

> A page charged to a memcg is linked to a lru list specific to that
> memcg.  At the same time, traditional global reclaim is obvlivious to
> memcgs, and all the pages are also linked to a global per-zone list.
> 
> This patch changes traditional global reclaim to iterate over all
> existing memcgs, so that it no longer relies on the global list being
> present.
> 
> This is one step forward in integrating memcg code better into the
> rest of memory management.  It is also a prerequisite to get rid of
> the global per-zone lru lists.
> 
> RFC:
> 
> The algorithm implemented in this patch is very naive.  For each zone
> scanned at each priority level, it iterates over all existing memcgs
> and considers them for scanning.
> 
> This is just a prototype and I did not optimize it yet because I am
> unsure about the maximum number of memcgs that still constitute a sane
> configuration in comparison to the machine size.
> 
> It is perfectly fair since all memcgs are scanned at each priority
> level.
> 
> On my 4G quadcore laptop with 1000 memcgs, a significant amount of CPU
> time was spent just iterating memcgs during reclaim.  But it can not
> really be claimed that the old code was much better, either: global
> LRU reclaim could mean that a few hundred memcgs would have been
> emptied out completely, while others stayed untouched.
> 
> I am open to solutions that trade fairness against CPU-time but don't
> want to have an extreme in either direction.  Maybe break out early if
> a number of memcgs has been successfully reclaimed from and remember
> the last one scanned.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  include/linux/memcontrol.h |    7 ++
>  mm/memcontrol.c            |  148 +++++++++++++++++++++++++++++---------------
>  mm/vmscan.c                |   21 +++++--
>  3 files changed, 120 insertions(+), 56 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 5e9840f5..58728c7 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -104,6 +104,7 @@ extern void mem_cgroup_end_migration(struct mem_cgroup *mem,
>  /*
>   * For memory reclaim.
>   */
> +void mem_cgroup_hierarchy_walk(struct mem_cgroup *, struct mem_cgroup **);
>  int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
>  int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
>  unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
> @@ -289,6 +290,12 @@ static inline bool mem_cgroup_disabled(void)
>  	return true;
>  }
>  
> +static inline void mem_cgroup_hierarchy_walk(struct mem_cgroup *start,
> +					     struct mem_cgroup **iter)
> +{
> +	*iter = start;
> +}
> +
>  static inline int
>  mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
>  {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index bf5ab87..edcd55a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -313,7 +313,7 @@ static bool move_file(void)
>  }
>  
>  /*
> - * Maximum loops in mem_cgroup_hierarchical_reclaim(), used for soft
> + * Maximum loops in mem_cgroup_soft_reclaim(), used for soft
>   * limit reclaim to prevent infinite loops, if they ever occur.
>   */
>  #define	MEM_CGROUP_MAX_RECLAIM_LOOPS		(100)
> @@ -339,16 +339,6 @@ enum charge_type {
>  /* Used for OOM nofiier */
>  #define OOM_CONTROL		(0)
>  
> -/*
> - * Reclaim flags for mem_cgroup_hierarchical_reclaim
> - */
> -#define MEM_CGROUP_RECLAIM_NOSWAP_BIT	0x0
> -#define MEM_CGROUP_RECLAIM_NOSWAP	(1 << MEM_CGROUP_RECLAIM_NOSWAP_BIT)
> -#define MEM_CGROUP_RECLAIM_SHRINK_BIT	0x1
> -#define MEM_CGROUP_RECLAIM_SHRINK	(1 << MEM_CGROUP_RECLAIM_SHRINK_BIT)
> -#define MEM_CGROUP_RECLAIM_SOFT_BIT	0x2
> -#define MEM_CGROUP_RECLAIM_SOFT		(1 << MEM_CGROUP_RECLAIM_SOFT_BIT)
> -
>  static void mem_cgroup_get(struct mem_cgroup *mem);
>  static void mem_cgroup_put(struct mem_cgroup *mem);
>  static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
> @@ -1381,6 +1371,86 @@ u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
>  	return min(limit, memsw);
>  }
>  
> +void mem_cgroup_hierarchy_walk(struct mem_cgroup *start,
> +			       struct mem_cgroup **iter)
> +{
> +	struct mem_cgroup *mem = *iter;
> +	int id;
> +
> +	if (!start)
> +		start = root_mem_cgroup;
> +	/*
> +	 * Even without hierarchy explicitely enabled in the root
> +	 * memcg, it is the ultimate parent of all memcgs.
> +	 */
> +	if (!(start == root_mem_cgroup || start->use_hierarchy)) {
> +		*iter = start;
> +		return;
> +	}
> +
> +	if (!mem)
> +		id = css_id(&start->css);
> +	else {
> +		id = css_id(&mem->css);
> +		css_put(&mem->css);
> +		mem = NULL;
> +	}
> +
> +	do {
> +		struct cgroup_subsys_state *css;
> +
> +		rcu_read_lock();
> +		css = css_get_next(&mem_cgroup_subsys, id+1, &start->css, &id);
> +		/*
> +		 * The caller must already have a reference to the
> +		 * starting point of this hierarchy walk, do not grab
> +		 * another one.  This way, the loop can be finished
> +		 * when the hierarchy root is returned, without any
> +		 * further cleanup required.
> +		 */
> +		if (css && (css == &start->css || css_tryget(css)))
> +			mem = container_of(css, struct mem_cgroup, css);
> +		rcu_read_unlock();
> +		if (!css)
> +			id = 0;
> +	} while (!mem);
> +
> +	if (mem == root_mem_cgroup)
> +		mem = NULL;
> +
> +	*iter = mem;
> +}
> +
> +static unsigned long mem_cgroup_target_reclaim(struct mem_cgroup *mem,
> +					       gfp_t gfp_mask,
> +					       bool noswap,
> +					       bool shrink)
> +{
> +	unsigned long total = 0;
> +	int loop;
> +
> +	if (mem->memsw_is_minimum)
> +		noswap = true;
> +
> +	for (loop = 0; loop < MEM_CGROUP_MAX_RECLAIM_LOOPS; loop++) {
> +		drain_all_stock_async();
> +		total += try_to_free_mem_cgroup_pages(mem, gfp_mask, noswap,
> +						      get_swappiness(mem));
> +		if (total && shrink)
> +			break;
> +		if (mem_cgroup_margin(mem))
> +			break;
> +		/*
> +		 * If we have not been able to reclaim anything after
> +		 * two reclaim attempts, there may be no reclaimable
> +		 * pages under this hierarchy.
> +		 */
> +		if (loop && !total)
> +			break;
> +	}
> +	return total;
> +}
> +
>  /*
>   * Visit the first child (need not be the first child as per the ordering
>   * of the cgroup list, since we track last_scanned_child) of @mem and use
> @@ -1427,21 +1497,16 @@ mem_cgroup_select_victim(struct mem_cgroup *root_mem)
>   *
>   * We give up and return to the caller when we visit root_mem twice.
>   * (other groups can be removed while we're walking....)
> - *
> - * If shrink==true, for avoiding to free too much, this returns immedieately.
>   */
> -static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> -						struct zone *zone,
> -						gfp_t gfp_mask,
> -						unsigned long reclaim_options)
> +static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_mem,
> +				   struct zone *zone,
> +				   gfp_t gfp_mask)
>  {
>  	struct mem_cgroup *victim;
>  	int ret, total = 0;
>  	int loop = 0;
> -	bool noswap = reclaim_options & MEM_CGROUP_RECLAIM_NOSWAP;
> -	bool shrink = reclaim_options & MEM_CGROUP_RECLAIM_SHRINK;
> -	bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT;
>  	unsigned long excess;
> +	bool noswap = false;
>  
>  	excess = res_counter_soft_limit_excess(&root_mem->res) >> PAGE_SHIFT;
>  
> @@ -1461,7 +1526,7 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
>  				 * anything, it might because there are
>  				 * no reclaimable pages under this hierarchy
>  				 */
> -				if (!check_soft || !total) {
> +				if (!total) {
>  					css_put(&victim->css);
>  					break;
>  				}
> @@ -1484,25 +1549,11 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
>  			continue;
>  		}
>  		/* we use swappiness of local cgroup */
> -		if (check_soft)
> -			ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
> +		ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
>  				noswap, get_swappiness(victim), zone);
> -		else
> -			ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
> -						noswap, get_swappiness(victim));
>  		css_put(&victim->css);
> -		/*
> -		 * At shrinking usage, we can't check we should stop here or
> -		 * reclaim more. It's depends on callers. last_scanned_child
> -		 * will work enough for keeping fairness under tree.
> -		 */
> -		if (shrink)
> -			return ret;
>  		total += ret;
> -		if (check_soft) {
> -			if (!res_counter_soft_limit_excess(&root_mem->res))
> -				return total;
> -		} else if (mem_cgroup_margin(root_mem))
> +		if (!res_counter_soft_limit_excess(&root_mem->res))
>  			return total;
>  	}
>  	return total;
> @@ -1897,7 +1948,7 @@ static int mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
>  	unsigned long csize = nr_pages * PAGE_SIZE;
>  	struct mem_cgroup *mem_over_limit;
>  	struct res_counter *fail_res;
> -	unsigned long flags = 0;
> +	bool noswap = false;
>  	int ret;
>  
>  	ret = res_counter_charge(&mem->res, csize, &fail_res);
> @@ -1911,7 +1962,7 @@ static int mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
>  
>  		res_counter_uncharge(&mem->res, csize);
>  		mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw);
> -		flags |= MEM_CGROUP_RECLAIM_NOSWAP;
> +		noswap = true;
>  	} else
>  		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
>  	/*
> @@ -1927,8 +1978,8 @@ static int mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
>  	if (!(gfp_mask & __GFP_WAIT))
>  		return CHARGE_WOULDBLOCK;
>  
> -	ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
> -					      gfp_mask, flags);
> +	ret = mem_cgroup_target_reclaim(mem_over_limit, gfp_mask,
> +					noswap, false);
>  	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
>  		return CHARGE_RETRY;
>  	/*
> @@ -3085,7 +3136,7 @@ void mem_cgroup_end_migration(struct mem_cgroup *mem,
>  
>  /*
>   * A call to try to shrink memory usage on charge failure at shmem's swapin.
> - * Calling hierarchical_reclaim is not enough because we should update
> + * Calling target_reclaim is not enough because we should update
>   * last_oom_jiffies to prevent pagefault_out_of_memory from invoking global OOM.
>   * Moreover considering hierarchy, we should reclaim from the mem_over_limit,
>   * not from the memcg which this page would be charged to.
> @@ -3167,7 +3218,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
>  	int enlarge;
>  
>  	/*
> -	 * For keeping hierarchical_reclaim simple, how long we should retry
> +	 * For keeping target_reclaim simple, how long we should retry
>  	 * is depends on callers. We set our retry-count to be function
>  	 * of # of children which we should visit in this loop.
>  	 */
> @@ -3210,8 +3261,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
>  		if (!ret)
>  			break;
>  
> -		mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
> -						MEM_CGROUP_RECLAIM_SHRINK);
> +		mem_cgroup_target_reclaim(memcg, GFP_KERNEL, false, false);
>  		curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
>  		/* Usage is reduced ? */
>    		if (curusage >= oldusage)
> @@ -3269,9 +3319,7 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
>  		if (!ret)
>  			break;
>  
> -		mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
> -						MEM_CGROUP_RECLAIM_NOSWAP |
> -						MEM_CGROUP_RECLAIM_SHRINK);
> +		mem_cgroup_target_reclaim(memcg, GFP_KERNEL, true, false);
>  		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
>  		/* Usage is reduced ? */
>  		if (curusage >= oldusage)
> @@ -3311,9 +3359,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  		if (!mz)
>  			break;
>  
> -		reclaimed = mem_cgroup_hierarchical_reclaim(mz->mem, zone,
> -						gfp_mask,
> -						MEM_CGROUP_RECLAIM_SOFT);
> +		reclaimed = mem_cgroup_soft_reclaim(mz->mem, zone, gfp_mask);
>  		nr_reclaimed += reclaimed;
>  		spin_lock(&mctz->lock);
>  
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index ceeb2a5..e2a3647 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1900,8 +1900,8 @@ static inline bool should_continue_reclaim(struct zone *zone,
>  /*
>   * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
>   */
> -static void shrink_zone(int priority, struct zone *zone,
> -				struct scan_control *sc)
> +static void do_shrink_zone(int priority, struct zone *zone,
> +			   struct scan_control *sc)
>  {
>  	unsigned long nr[NR_LRU_LISTS];
>  	unsigned long nr_to_scan;
> @@ -1914,8 +1914,6 @@ restart:
>  	nr_scanned = sc->nr_scanned;
>  	get_scan_count(zone, sc, nr, priority);
>  
> -	sc->current_memcg = sc->memcg;
> -
>  	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
>  					nr[LRU_INACTIVE_FILE]) {
>  		for_each_evictable_lru(l) {
> @@ -1954,6 +1952,19 @@ restart:
>  		goto restart;
>  
>  	throttle_vm_writeout(sc->gfp_mask);
> +}
> +
> +static void shrink_zone(int priority, struct zone *zone,
> +			struct scan_control *sc)
> +{
> +	struct mem_cgroup *root = sc->memcg;
> +	struct mem_cgroup *mem = NULL;
> +
> +	do {
> +		mem_cgroup_hierarchy_walk(root, &mem);
> +		sc->current_memcg = mem;
> +		do_shrink_zone(priority, zone, sc);

If I don't miss something, css_put() against mem->css will be required somewhere.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 3/6] mm: memcg-aware global reclaim
  2011-05-13  0:40     ` KAMEZAWA Hiroyuki
@ 2011-05-13  6:54       ` Johannes Weiner
  -1 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2011-05-13  6:54 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, Balbir Singh, Ying Han, Michal Hocko,
	Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On Fri, May 13, 2011 at 09:40:50AM +0900, KAMEZAWA Hiroyuki wrote:
> > @@ -1954,6 +1952,19 @@ restart:
> >  		goto restart;
> >  
> >  	throttle_vm_writeout(sc->gfp_mask);
> > +}
> > +
> > +static void shrink_zone(int priority, struct zone *zone,
> > +			struct scan_control *sc)
> > +{
> > +	struct mem_cgroup *root = sc->memcg;
> > +	struct mem_cgroup *mem = NULL;
> > +
> > +	do {
> > +		mem_cgroup_hierarchy_walk(root, &mem);
> > +		sc->current_memcg = mem;
> > +		do_shrink_zone(priority, zone, sc);
> 
> If I don't miss something, css_put() against mem->css will be required somewhere.

That's a bit of a hack.  mem_cgroup_hierarchy_walk() always does
css_put() on *mem before advancing to the next child.

At the last iteration, it returns mem == root.  Since the caller must
have a reference on root to begin with, it does not css_get() root.

So when mem == root, there are no outstanding references from the walk
anymore.

This only works since it always does the full hierarchy walk, so it's
going away anyway when the hierarchy walk becomes intermittent.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 3/6] mm: memcg-aware global reclaim
@ 2011-05-13  6:54       ` Johannes Weiner
  0 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2011-05-13  6:54 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, Balbir Singh, Ying Han, Michal Hocko,
	Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On Fri, May 13, 2011 at 09:40:50AM +0900, KAMEZAWA Hiroyuki wrote:
> > @@ -1954,6 +1952,19 @@ restart:
> >  		goto restart;
> >  
> >  	throttle_vm_writeout(sc->gfp_mask);
> > +}
> > +
> > +static void shrink_zone(int priority, struct zone *zone,
> > +			struct scan_control *sc)
> > +{
> > +	struct mem_cgroup *root = sc->memcg;
> > +	struct mem_cgroup *mem = NULL;
> > +
> > +	do {
> > +		mem_cgroup_hierarchy_walk(root, &mem);
> > +		sc->current_memcg = mem;
> > +		do_shrink_zone(priority, zone, sc);
> 
> If I don't miss something, css_put() against mem->css will be required somewhere.

That's a bit of a hack.  mem_cgroup_hierarchy_walk() always does
css_put() on *mem before advancing to the next child.

At the last iteration, it returns mem == root.  Since the caller must
have a reference on root to begin with, it does not css_get() root.

So when mem == root, there are no outstanding references from the walk
anymore.

This only works since it always does the full hierarchy walk, so it's
going away anyway when the hierarchy walk becomes intermittent.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 2/6] vmscan: make distinction between memcg reclaim and LRU list selection
  2011-05-12 23:50     ` KAMEZAWA Hiroyuki
@ 2011-05-13  6:58       ` Johannes Weiner
  -1 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2011-05-13  6:58 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, Balbir Singh, Ying Han, Michal Hocko,
	Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On Fri, May 13, 2011 at 08:50:27AM +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 12 May 2011 16:53:54 +0200
> Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > The reclaim code has a single predicate for whether it currently
> > reclaims on behalf of a memory cgroup, as well as whether it is
> > reclaiming from the global LRU list or a memory cgroup LRU list.
> > 
> > Up to now, both cases always coincide, but subsequent patches will
> > change things such that global reclaim will scan memory cgroup lists.
> > 
> > This patch adds a new predicate that tells global reclaim from memory
> > cgroup reclaim, and then changes all callsites that are actually about
> > global reclaim heuristics rather than strict LRU list selection.
> > 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> 
> Hmm, isn't it better to merge this to patches where the meaning of
> new variable gets clearer ?

I apologize for the confusing order.  I am going to merge them.

> >  mm/vmscan.c |   96 ++++++++++++++++++++++++++++++++++------------------------
> >  1 files changed, 56 insertions(+), 40 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index f6b435c..ceeb2a5 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -104,8 +104,12 @@ struct scan_control {
> >  	 */
> >  	reclaim_mode_t reclaim_mode;
> >  
> > -	/* Which cgroup do we reclaim from */
> > -	struct mem_cgroup *mem_cgroup;
> > +	/*
> > +	 * The memory cgroup we reclaim on behalf of, and the one we
> > +	 * are currently reclaiming from.
> > +	 */
> > +	struct mem_cgroup *memcg;
> > +	struct mem_cgroup *current_memcg;
> >  
> 
> I wonder if you avoid renaming exisiting one, the patch will
> be clearer...

I renamed it mostly because I thought current_mem_cgroup too long.
It's probably best if both get more descriptive names.

> > @@ -154,16 +158,24 @@ static LIST_HEAD(shrinker_list);
> >  static DECLARE_RWSEM(shrinker_rwsem);
> >  
> >  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
> > -#define scanning_global_lru(sc)	(!(sc)->mem_cgroup)
> > +static bool global_reclaim(struct scan_control *sc)
> > +{
> > +	return !sc->memcg;
> > +}
> > +static bool scanning_global_lru(struct scan_control *sc)
> > +{
> > +	return !sc->current_memcg;
> > +}
> 
> 
> Could you add comments ?

Yes, I will.

Thanks for your input!

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 2/6] vmscan: make distinction between memcg reclaim and LRU list selection
@ 2011-05-13  6:58       ` Johannes Weiner
  0 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2011-05-13  6:58 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, Balbir Singh, Ying Han, Michal Hocko,
	Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On Fri, May 13, 2011 at 08:50:27AM +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 12 May 2011 16:53:54 +0200
> Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > The reclaim code has a single predicate for whether it currently
> > reclaims on behalf of a memory cgroup, as well as whether it is
> > reclaiming from the global LRU list or a memory cgroup LRU list.
> > 
> > Up to now, both cases always coincide, but subsequent patches will
> > change things such that global reclaim will scan memory cgroup lists.
> > 
> > This patch adds a new predicate that tells global reclaim from memory
> > cgroup reclaim, and then changes all callsites that are actually about
> > global reclaim heuristics rather than strict LRU list selection.
> > 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> 
> Hmm, isn't it better to merge this to patches where the meaning of
> new variable gets clearer ?

I apologize for the confusing order.  I am going to merge them.

> >  mm/vmscan.c |   96 ++++++++++++++++++++++++++++++++++------------------------
> >  1 files changed, 56 insertions(+), 40 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index f6b435c..ceeb2a5 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -104,8 +104,12 @@ struct scan_control {
> >  	 */
> >  	reclaim_mode_t reclaim_mode;
> >  
> > -	/* Which cgroup do we reclaim from */
> > -	struct mem_cgroup *mem_cgroup;
> > +	/*
> > +	 * The memory cgroup we reclaim on behalf of, and the one we
> > +	 * are currently reclaiming from.
> > +	 */
> > +	struct mem_cgroup *memcg;
> > +	struct mem_cgroup *current_memcg;
> >  
> 
> I wonder if you avoid renaming exisiting one, the patch will
> be clearer...

I renamed it mostly because I thought current_mem_cgroup too long.
It's probably best if both get more descriptive names.

> > @@ -154,16 +158,24 @@ static LIST_HEAD(shrinker_list);
> >  static DECLARE_RWSEM(shrinker_rwsem);
> >  
> >  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
> > -#define scanning_global_lru(sc)	(!(sc)->mem_cgroup)
> > +static bool global_reclaim(struct scan_control *sc)
> > +{
> > +	return !sc->memcg;
> > +}
> > +static bool scanning_global_lru(struct scan_control *sc)
> > +{
> > +	return !sc->current_memcg;
> > +}
> 
> 
> Could you add comments ?

Yes, I will.

Thanks for your input!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 3/6] mm: memcg-aware global reclaim
  2011-05-12 19:19   ` Ying Han
@ 2011-05-13  7:08       ` Johannes Weiner
  0 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2011-05-13  7:08 UTC (permalink / raw)
  To: Ying Han
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Michal Hocko,
	Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On Thu, May 12, 2011 at 12:19:45PM -0700, Ying Han wrote:
> On Thu, May 12, 2011 at 7:53 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > A page charged to a memcg is linked to a lru list specific to that
> > memcg.  At the same time, traditional global reclaim is obvlivious to
> > memcgs, and all the pages are also linked to a global per-zone list.
> >
> > This patch changes traditional global reclaim to iterate over all
> > existing memcgs, so that it no longer relies on the global list being
> > present.
> >
> 
> > This is one step forward in integrating memcg code better into the
> > rest of memory management.  It is also a prerequisite to get rid
> > of the global per-zone lru lists.
> >
> Sorry If i misunderstood something here. I assume this patch has not
> much to do with the global soft_limit reclaim, but only allow the
> system only scan per-memcg lru under global memory pressure.

I see you found 6/6 in the meantime :) Did it answer your question?

> > The algorithm implemented in this patch is very naive.  For each zone
> > scanned at each priority level, it iterates over all existing memcgs
> > and considers them for scanning.
> >
> > This is just a prototype and I did not optimize it yet because I am
> > unsure about the maximum number of memcgs that still constitute a sane
> > configuration in comparison to the machine size.
> 
> So we also scan memcg which has no page allocated on this zone? I
> will read the following patch in case i missed something here :)

The old hierarchy walk skipped a memcg if it had no local pages at
all.  I thought this was a rather unlikely situation and ripped it
out.

It will not loop persistently over a specific memcg and node
combination, like soft limit reclaim does at the moment.

Since this is much deeper integrated in memory reclaim now, it
benefits from all the existing mechanisms and will calculate the scan
target based on the number of lru pages on memcg->zone->lru, and do
nothing if there are no pages there.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 3/6] mm: memcg-aware global reclaim
@ 2011-05-13  7:08       ` Johannes Weiner
  0 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2011-05-13  7:08 UTC (permalink / raw)
  To: Ying Han
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Michal Hocko,
	Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On Thu, May 12, 2011 at 12:19:45PM -0700, Ying Han wrote:
> On Thu, May 12, 2011 at 7:53 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > A page charged to a memcg is linked to a lru list specific to that
> > memcg.  At the same time, traditional global reclaim is obvlivious to
> > memcgs, and all the pages are also linked to a global per-zone list.
> >
> > This patch changes traditional global reclaim to iterate over all
> > existing memcgs, so that it no longer relies on the global list being
> > present.
> >
> 
> > This is one step forward in integrating memcg code better into the
> > rest of memory management.  It is also a prerequisite to get rid
> > of the global per-zone lru lists.
> >
> Sorry If i misunderstood something here. I assume this patch has not
> much to do with the global soft_limit reclaim, but only allow the
> system only scan per-memcg lru under global memory pressure.

I see you found 6/6 in the meantime :) Did it answer your question?

> > The algorithm implemented in this patch is very naive.  For each zone
> > scanned at each priority level, it iterates over all existing memcgs
> > and considers them for scanning.
> >
> > This is just a prototype and I did not optimize it yet because I am
> > unsure about the maximum number of memcgs that still constitute a sane
> > configuration in comparison to the machine size.
> 
> So we also scan memcg which has no page allocated on this zone? I
> will read the following patch in case i missed something here :)

The old hierarchy walk skipped a memcg if it had no local pages at
all.  I thought this was a rather unlikely situation and ripped it
out.

It will not loop persistently over a specific memcg and node
combination, like soft limit reclaim does at the moment.

Since this is much deeper integrated in memory reclaim now, it
benefits from all the existing mechanisms and will calculate the scan
target based on the number of lru pages on memcg->zone->lru, and do
nothing if there are no pages there.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 3/6] mm: memcg-aware global reclaim
  2011-05-13  0:04     ` KAMEZAWA Hiroyuki
@ 2011-05-13  7:18       ` Johannes Weiner
  -1 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2011-05-13  7:18 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, Balbir Singh, Ying Han, Michal Hocko,
	Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On Fri, May 13, 2011 at 09:04:50AM +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 12 May 2011 16:53:55 +0200
> Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > A page charged to a memcg is linked to a lru list specific to that
> > memcg.  At the same time, traditional global reclaim is obvlivious to
> > memcgs, and all the pages are also linked to a global per-zone list.
> > 
> > This patch changes traditional global reclaim to iterate over all
> > existing memcgs, so that it no longer relies on the global list being
> > present.
> > 
> > This is one step forward in integrating memcg code better into the
> > rest of memory management.  It is also a prerequisite to get rid of
> > the global per-zone lru lists.
> 
> As I said, I don't want removing global reclaim until dirty_ratio support and
> better softlimit algorithm, at least. Current my concern is dirty_ratio,
> if you want to speed up, please help Greg and implement dirty_ratio first.

As I said, I am not proposing this for integration now.  It was more
like asking if people were okay with this direction before we put
things in place that could be in the way of the long-term plan.

Note that 6/6 is an attempt to improve the soft limit algorithm.

> BTW, could you separete clean up code and your new logic ? 1st half of
> codes seems to be just a clean up and seems nice. But , IIUC, someone
> changed the arguments from chunk of params to be a flags....in some patch.

Sorry again, I know that the series is pretty unorganized.

> +	do {
> +		mem_cgroup_hierarchy_walk(root, &mem);
> +		sc->current_memcg = mem;
> +		do_shrink_zone(priority, zone, sc);
> +	} while (mem != root);
> 
> This move hierarchy walk from memcontrol.c to vmscan.c ?
> 
> About moving hierarchy walk, I may say okay...because my patch does this, too.
> 
> But....doesn't this reclaim too much memory if hierarchy is very deep ?
> Could you add some 'quit' path ?

Yes, I think I'll just reinstate the logic from
mem_cgroup_select_victim() to remember the last child, and add an exit
condition based on the number of reclaimed pages.

This was also suggested by Rik in this thread already.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 3/6] mm: memcg-aware global reclaim
@ 2011-05-13  7:18       ` Johannes Weiner
  0 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2011-05-13  7:18 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, Balbir Singh, Ying Han, Michal Hocko,
	Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On Fri, May 13, 2011 at 09:04:50AM +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 12 May 2011 16:53:55 +0200
> Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > A page charged to a memcg is linked to a lru list specific to that
> > memcg.  At the same time, traditional global reclaim is obvlivious to
> > memcgs, and all the pages are also linked to a global per-zone list.
> > 
> > This patch changes traditional global reclaim to iterate over all
> > existing memcgs, so that it no longer relies on the global list being
> > present.
> > 
> > This is one step forward in integrating memcg code better into the
> > rest of memory management.  It is also a prerequisite to get rid of
> > the global per-zone lru lists.
> 
> As I said, I don't want removing global reclaim until dirty_ratio support and
> better softlimit algorithm, at least. Current my concern is dirty_ratio,
> if you want to speed up, please help Greg and implement dirty_ratio first.

As I said, I am not proposing this for integration now.  It was more
like asking if people were okay with this direction before we put
things in place that could be in the way of the long-term plan.

Note that 6/6 is an attempt to improve the soft limit algorithm.

> BTW, could you separete clean up code and your new logic ? 1st half of
> codes seems to be just a clean up and seems nice. But , IIUC, someone
> changed the arguments from chunk of params to be a flags....in some patch.

Sorry again, I know that the series is pretty unorganized.

> +	do {
> +		mem_cgroup_hierarchy_walk(root, &mem);
> +		sc->current_memcg = mem;
> +		do_shrink_zone(priority, zone, sc);
> +	} while (mem != root);
> 
> This move hierarchy walk from memcontrol.c to vmscan.c ?
> 
> About moving hierarchy walk, I may say okay...because my patch does this, too.
> 
> But....doesn't this reclaim too much memory if hierarchy is very deep ?
> Could you add some 'quit' path ?

Yes, I think I'll just reinstate the logic from
mem_cgroup_select_victim() to remember the last child, and add an exit
condition based on the number of reclaimed pages.

This was also suggested by Rik in this thread already.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 0/6] mm: memcg naturalization
  2011-05-12 18:53 ` [rfc patch 0/6] mm: memcg naturalization Ying Han
@ 2011-05-13  7:20     ` Johannes Weiner
  0 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2011-05-13  7:20 UTC (permalink / raw)
  To: Ying Han
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Michal Hocko,
	Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On Thu, May 12, 2011 at 11:53:37AM -0700, Ying Han wrote:
> On Thu, May 12, 2011 at 7:53 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > Hi!
> >
> > Here is a patch series that is a result of the memcg discussions on
> > LSF (memcg-aware global reclaim, global lru removal, struct
> > page_cgroup reduction, soft limit implementation) and the recent
> > feature discussions on linux-mm.
> >
> > The long-term idea is to have memcgs no longer bolted to the side of
> > the mm code, but integrate it as much as possible such that there is a
> > native understanding of containers, and that the traditional !memcg
> > setup is just a singular group.  This series is an approach in that
> > direction.
> >
> > It is a rather early snapshot, WIP, barely tested etc., but I wanted
> > to get your opinions before further pursuing it.  It is also part of
> > my counter-argument to the proposals of adding memcg-reclaim-related
> > user interfaces at this point in time, so I wanted to push this out
> > the door before things are merged into .40.
> >
> 
> The memcg-reclaim-related user interface I assume was the watermark
> configurable tunable we were talking about in the per-memcg
> background reclaim patch. I think we got some agreement to remove
> the watermark tunable at the first step. But the newly added
> memory.soft_limit_async_reclaim as you proposed seems to be a usable
> interface.

Actually, I meant the soft limit reclaim statistics.  There is a
comment about that in the 6/6 changelog.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 0/6] mm: memcg naturalization
@ 2011-05-13  7:20     ` Johannes Weiner
  0 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2011-05-13  7:20 UTC (permalink / raw)
  To: Ying Han
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Michal Hocko,
	Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On Thu, May 12, 2011 at 11:53:37AM -0700, Ying Han wrote:
> On Thu, May 12, 2011 at 7:53 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > Hi!
> >
> > Here is a patch series that is a result of the memcg discussions on
> > LSF (memcg-aware global reclaim, global lru removal, struct
> > page_cgroup reduction, soft limit implementation) and the recent
> > feature discussions on linux-mm.
> >
> > The long-term idea is to have memcgs no longer bolted to the side of
> > the mm code, but integrate it as much as possible such that there is a
> > native understanding of containers, and that the traditional !memcg
> > setup is just a singular group.  This series is an approach in that
> > direction.
> >
> > It is a rather early snapshot, WIP, barely tested etc., but I wanted
> > to get your opinions before further pursuing it.  It is also part of
> > my counter-argument to the proposals of adding memcg-reclaim-related
> > user interfaces at this point in time, so I wanted to push this out
> > the door before things are merged into .40.
> >
> 
> The memcg-reclaim-related user interface I assume was the watermark
> configurable tunable we were talking about in the per-memcg
> background reclaim patch. I think we got some agreement to remove
> the watermark tunable at the first step. But the newly added
> memory.soft_limit_async_reclaim as you proposed seems to be a usable
> interface.

Actually, I meant the soft limit reclaim statistics.  There is a
comment about that in the 6/6 changelog.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 1/6] memcg: remove unused retry signal from reclaim
  2011-05-12 14:53   ` Johannes Weiner
@ 2011-05-13  9:23     ` Michal Hocko
  -1 siblings, 0 replies; 83+ messages in thread
From: Michal Hocko @ 2011-05-13  9:23 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On Thu 12-05-11 16:53:53, Johannes Weiner wrote:
> If the memcg reclaim code detects the target memcg below its limit it
> exits and returns a guaranteed non-zero value so that the charge is
> retried.
> 
> Nowadays, the charge side checks the memcg limit itself and does not
> rely on this non-zero return value trick.
> 
> This patch removes it.  The reclaim code will now always return the
> true number of pages it reclaimed on its own.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Makes sense
Reviewed-by: Michal Hocko <mhocko@suse.cz>

> ---
>  mm/memcontrol.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 010f916..bf5ab87 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1503,7 +1503,7 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
>  			if (!res_counter_soft_limit_excess(&root_mem->res))
>  				return total;
>  		} else if (mem_cgroup_margin(root_mem))
> -			return 1 + total;
> +			return total;
>  	}
>  	return total;
>  }
> -- 
> 1.7.5.1
> 

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 1/6] memcg: remove unused retry signal from reclaim
@ 2011-05-13  9:23     ` Michal Hocko
  0 siblings, 0 replies; 83+ messages in thread
From: Michal Hocko @ 2011-05-13  9:23 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On Thu 12-05-11 16:53:53, Johannes Weiner wrote:
> If the memcg reclaim code detects the target memcg below its limit it
> exits and returns a guaranteed non-zero value so that the charge is
> retried.
> 
> Nowadays, the charge side checks the memcg limit itself and does not
> rely on this non-zero return value trick.
> 
> This patch removes it.  The reclaim code will now always return the
> true number of pages it reclaimed on its own.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Makes sense
Reviewed-by: Michal Hocko <mhocko@suse.cz>

> ---
>  mm/memcontrol.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 010f916..bf5ab87 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1503,7 +1503,7 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
>  			if (!res_counter_soft_limit_excess(&root_mem->res))
>  				return total;
>  		} else if (mem_cgroup_margin(root_mem))
> -			return 1 + total;
> +			return total;
>  	}
>  	return total;
>  }
> -- 
> 1.7.5.1
> 

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 3/6] mm: memcg-aware global reclaim
  2011-05-12 14:53   ` Johannes Weiner
@ 2011-05-13  9:53     ` Michal Hocko
  -1 siblings, 0 replies; 83+ messages in thread
From: Michal Hocko @ 2011-05-13  9:53 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On Thu 12-05-11 16:53:55, Johannes Weiner wrote:
> A page charged to a memcg is linked to a lru list specific to that
> memcg.  At the same time, traditional global reclaim is obvlivious to
> memcgs, and all the pages are also linked to a global per-zone list.
> 
> This patch changes traditional global reclaim to iterate over all
> existing memcgs, so that it no longer relies on the global list being
> present.

At LSF we have discussed that we should keep a list of over-(soft)limit
cgroups in a list which would be the first target for reclaiming (in
round-robin fashion). If we are note able to reclaim enough from those
(the list becomes empty) we should fallback to the all groups reclaim
(what you did in this patchset).

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 3/6] mm: memcg-aware global reclaim
@ 2011-05-13  9:53     ` Michal Hocko
  0 siblings, 0 replies; 83+ messages in thread
From: Michal Hocko @ 2011-05-13  9:53 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On Thu 12-05-11 16:53:55, Johannes Weiner wrote:
> A page charged to a memcg is linked to a lru list specific to that
> memcg.  At the same time, traditional global reclaim is obvlivious to
> memcgs, and all the pages are also linked to a global per-zone list.
> 
> This patch changes traditional global reclaim to iterate over all
> existing memcgs, so that it no longer relies on the global list being
> present.

At LSF we have discussed that we should keep a list of over-(soft)limit
cgroups in a list which would be the first target for reclaiming (in
round-robin fashion). If we are note able to reclaim enough from those
(the list becomes empty) we should fallback to the all groups reclaim
(what you did in this patchset).

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 5/6] memcg: remove global LRU list
  2011-05-12 14:53   ` Johannes Weiner
@ 2011-05-13  9:53     ` Michal Hocko
  -1 siblings, 0 replies; 83+ messages in thread
From: Michal Hocko @ 2011-05-13  9:53 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On Thu 12-05-11 16:53:57, Johannes Weiner wrote:
> Since the VM now has means to do global reclaim from the per-memcg lru
> lists, the global LRU list is no longer required.

Shouldn't this one be at the end of the series?

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 5/6] memcg: remove global LRU list
@ 2011-05-13  9:53     ` Michal Hocko
  0 siblings, 0 replies; 83+ messages in thread
From: Michal Hocko @ 2011-05-13  9:53 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On Thu 12-05-11 16:53:57, Johannes Weiner wrote:
> Since the VM now has means to do global reclaim from the per-memcg lru
> lists, the global LRU list is no longer required.

Shouldn't this one be at the end of the series?

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 3/6] mm: memcg-aware global reclaim
  2011-05-13  9:53     ` Michal Hocko
@ 2011-05-13 10:28       ` Johannes Weiner
  -1 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2011-05-13 10:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On Fri, May 13, 2011 at 11:53:08AM +0200, Michal Hocko wrote:
> On Thu 12-05-11 16:53:55, Johannes Weiner wrote:
> > A page charged to a memcg is linked to a lru list specific to that
> > memcg.  At the same time, traditional global reclaim is obvlivious to
> > memcgs, and all the pages are also linked to a global per-zone list.
> > 
> > This patch changes traditional global reclaim to iterate over all
> > existing memcgs, so that it no longer relies on the global list being
> > present.
> 
> At LSF we have discussed that we should keep a list of over-(soft)limit
> cgroups in a list which would be the first target for reclaiming (in
> round-robin fashion). If we are note able to reclaim enough from those
> (the list becomes empty) we should fallback to the all groups reclaim
> (what you did in this patchset).

This would be on top or instead of 6/6.  This, 3/6, is indepent of
soft limit reclaim.  It is mainly in preparation to remove the global
LRU.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 3/6] mm: memcg-aware global reclaim
@ 2011-05-13 10:28       ` Johannes Weiner
  0 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2011-05-13 10:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On Fri, May 13, 2011 at 11:53:08AM +0200, Michal Hocko wrote:
> On Thu 12-05-11 16:53:55, Johannes Weiner wrote:
> > A page charged to a memcg is linked to a lru list specific to that
> > memcg.  At the same time, traditional global reclaim is obvlivious to
> > memcgs, and all the pages are also linked to a global per-zone list.
> > 
> > This patch changes traditional global reclaim to iterate over all
> > existing memcgs, so that it no longer relies on the global list being
> > present.
> 
> At LSF we have discussed that we should keep a list of over-(soft)limit
> cgroups in a list which would be the first target for reclaiming (in
> round-robin fashion). If we are note able to reclaim enough from those
> (the list becomes empty) we should fallback to the all groups reclaim
> (what you did in this patchset).

This would be on top or instead of 6/6.  This, 3/6, is indepent of
soft limit reclaim.  It is mainly in preparation to remove the global
LRU.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 5/6] memcg: remove global LRU list
  2011-05-13  9:53     ` Michal Hocko
@ 2011-05-13 10:36       ` Johannes Weiner
  -1 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2011-05-13 10:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On Fri, May 13, 2011 at 11:53:48AM +0200, Michal Hocko wrote:
> On Thu 12-05-11 16:53:57, Johannes Weiner wrote:
> > Since the VM now has means to do global reclaim from the per-memcg lru
> > lists, the global LRU list is no longer required.
> 
> Shouldn't this one be at the end of the series?

I don't really have an opinion.  Why do you think it should?

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 5/6] memcg: remove global LRU list
@ 2011-05-13 10:36       ` Johannes Weiner
  0 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2011-05-13 10:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On Fri, May 13, 2011 at 11:53:48AM +0200, Michal Hocko wrote:
> On Thu 12-05-11 16:53:57, Johannes Weiner wrote:
> > Since the VM now has means to do global reclaim from the per-memcg lru
> > lists, the global LRU list is no longer required.
> 
> Shouldn't this one be at the end of the series?

I don't really have an opinion.  Why do you think it should?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 5/6] memcg: remove global LRU list
  2011-05-13 10:36       ` Johannes Weiner
@ 2011-05-13 11:01         ` Michal Hocko
  -1 siblings, 0 replies; 83+ messages in thread
From: Michal Hocko @ 2011-05-13 11:01 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On Fri 13-05-11 12:36:08, Johannes Weiner wrote:
> On Fri, May 13, 2011 at 11:53:48AM +0200, Michal Hocko wrote:
> > On Thu 12-05-11 16:53:57, Johannes Weiner wrote:
> > > Since the VM now has means to do global reclaim from the per-memcg lru
> > > lists, the global LRU list is no longer required.
> > 
> > Shouldn't this one be at the end of the series?
> 
> I don't really have an opinion.  Why do you think it should?

It is the last step in my eyes and maybe we want to keep both global
LRU as a fallback for some time just to get an impression (with some
tracepoints)how well does the per-cgroup reclaim goes.

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 5/6] memcg: remove global LRU list
@ 2011-05-13 11:01         ` Michal Hocko
  0 siblings, 0 replies; 83+ messages in thread
From: Michal Hocko @ 2011-05-13 11:01 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On Fri 13-05-11 12:36:08, Johannes Weiner wrote:
> On Fri, May 13, 2011 at 11:53:48AM +0200, Michal Hocko wrote:
> > On Thu 12-05-11 16:53:57, Johannes Weiner wrote:
> > > Since the VM now has means to do global reclaim from the per-memcg lru
> > > lists, the global LRU list is no longer required.
> > 
> > Shouldn't this one be at the end of the series?
> 
> I don't really have an opinion.  Why do you think it should?

It is the last step in my eyes and maybe we want to keep both global
LRU as a fallback for some time just to get an impression (with some
tracepoints)how well does the per-cgroup reclaim goes.

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 3/6] mm: memcg-aware global reclaim
  2011-05-13 10:28       ` Johannes Weiner
@ 2011-05-13 11:02         ` Michal Hocko
  -1 siblings, 0 replies; 83+ messages in thread
From: Michal Hocko @ 2011-05-13 11:02 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On Fri 13-05-11 12:28:58, Johannes Weiner wrote:
> On Fri, May 13, 2011 at 11:53:08AM +0200, Michal Hocko wrote:
> > On Thu 12-05-11 16:53:55, Johannes Weiner wrote:
> > > A page charged to a memcg is linked to a lru list specific to that
> > > memcg.  At the same time, traditional global reclaim is obvlivious to
> > > memcgs, and all the pages are also linked to a global per-zone list.
> > > 
> > > This patch changes traditional global reclaim to iterate over all
> > > existing memcgs, so that it no longer relies on the global list being
> > > present.
> > 
> > At LSF we have discussed that we should keep a list of over-(soft)limit
> > cgroups in a list which would be the first target for reclaiming (in
> > round-robin fashion). If we are note able to reclaim enough from those
> > (the list becomes empty) we should fallback to the all groups reclaim
> > (what you did in this patchset).
> 
> This would be on top or instead of 6/6.  This, 3/6, is indepent of
> soft limit reclaim.  It is mainly in preparation to remove the global
> LRU.

OK.

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 3/6] mm: memcg-aware global reclaim
@ 2011-05-13 11:02         ` Michal Hocko
  0 siblings, 0 replies; 83+ messages in thread
From: Michal Hocko @ 2011-05-13 11:02 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On Fri 13-05-11 12:28:58, Johannes Weiner wrote:
> On Fri, May 13, 2011 at 11:53:08AM +0200, Michal Hocko wrote:
> > On Thu 12-05-11 16:53:55, Johannes Weiner wrote:
> > > A page charged to a memcg is linked to a lru list specific to that
> > > memcg.  At the same time, traditional global reclaim is obvlivious to
> > > memcgs, and all the pages are also linked to a global per-zone list.
> > > 
> > > This patch changes traditional global reclaim to iterate over all
> > > existing memcgs, so that it no longer relies on the global list being
> > > present.
> > 
> > At LSF we have discussed that we should keep a list of over-(soft)limit
> > cgroups in a list which would be the first target for reclaiming (in
> > round-robin fashion). If we are note able to reclaim enough from those
> > (the list becomes empty) we should fallback to the all groups reclaim
> > (what you did in this patchset).
> 
> This would be on top or instead of 6/6.  This, 3/6, is indepent of
> soft limit reclaim.  It is mainly in preparation to remove the global
> LRU.

OK.

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 0/6] mm: memcg naturalization
  2011-05-12 14:53 ` Johannes Weiner
@ 2011-05-16 10:30   ` Balbir Singh
  -1 siblings, 0 replies; 83+ messages in thread
From: Balbir Singh @ 2011-05-16 10:30 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Ying Han, Michal Hocko,
	Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

* Johannes Weiner <hannes@cmpxchg.org> [2011-05-12 16:53:52]:

> Hi!
> 
> Here is a patch series that is a result of the memcg discussions on
> LSF (memcg-aware global reclaim, global lru removal, struct
> page_cgroup reduction, soft limit implementation) and the recent
> feature discussions on linux-mm.
> 
> The long-term idea is to have memcgs no longer bolted to the side of
> the mm code, but integrate it as much as possible such that there is a
> native understanding of containers, and that the traditional !memcg
> setup is just a singular group.  This series is an approach in that
> direction.
> 
> It is a rather early snapshot, WIP, barely tested etc., but I wanted
> to get your opinions before further pursuing it.  It is also part of
> my counter-argument to the proposals of adding memcg-reclaim-related
> user interfaces at this point in time, so I wanted to push this out
> the door before things are merged into .40.
> 
> The patches are quite big, I am still looking for things to factor and
> split out, sorry for this.  Documentation is on its way as well ;)
> 
> #1 and #2 are boring preparational work.  #3 makes traditional reclaim
> in vmscan.c memcg-aware, which is a prerequisite for both removal of
> the global lru in #5 and the way I reimplemented soft limit reclaim in
> #6.

A large part of the acceptance would be based on what the test results
for common mm benchmarks show.

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 0/6] mm: memcg naturalization
@ 2011-05-16 10:30   ` Balbir Singh
  0 siblings, 0 replies; 83+ messages in thread
From: Balbir Singh @ 2011-05-16 10:30 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Ying Han, Michal Hocko,
	Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

* Johannes Weiner <hannes@cmpxchg.org> [2011-05-12 16:53:52]:

> Hi!
> 
> Here is a patch series that is a result of the memcg discussions on
> LSF (memcg-aware global reclaim, global lru removal, struct
> page_cgroup reduction, soft limit implementation) and the recent
> feature discussions on linux-mm.
> 
> The long-term idea is to have memcgs no longer bolted to the side of
> the mm code, but integrate it as much as possible such that there is a
> native understanding of containers, and that the traditional !memcg
> setup is just a singular group.  This series is an approach in that
> direction.
> 
> It is a rather early snapshot, WIP, barely tested etc., but I wanted
> to get your opinions before further pursuing it.  It is also part of
> my counter-argument to the proposals of adding memcg-reclaim-related
> user interfaces at this point in time, so I wanted to push this out
> the door before things are merged into .40.
> 
> The patches are quite big, I am still looking for things to factor and
> split out, sorry for this.  Documentation is on its way as well ;)
> 
> #1 and #2 are boring preparational work.  #3 makes traditional reclaim
> in vmscan.c memcg-aware, which is a prerequisite for both removal of
> the global lru in #5 and the way I reimplemented soft limit reclaim in
> #6.

A large part of the acceptance would be based on what the test results
for common mm benchmarks show.

-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 0/6] mm: memcg naturalization
  2011-05-16 10:30   ` Balbir Singh
@ 2011-05-16 10:57     ` Johannes Weiner
  -1 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2011-05-16 10:57 UTC (permalink / raw)
  To: Balbir Singh
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Ying Han, Michal Hocko,
	Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On Mon, May 16, 2011 at 04:00:34PM +0530, Balbir Singh wrote:
> * Johannes Weiner <hannes@cmpxchg.org> [2011-05-12 16:53:52]:
> 
> > Hi!
> > 
> > Here is a patch series that is a result of the memcg discussions on
> > LSF (memcg-aware global reclaim, global lru removal, struct
> > page_cgroup reduction, soft limit implementation) and the recent
> > feature discussions on linux-mm.
> > 
> > The long-term idea is to have memcgs no longer bolted to the side of
> > the mm code, but integrate it as much as possible such that there is a
> > native understanding of containers, and that the traditional !memcg
> > setup is just a singular group.  This series is an approach in that
> > direction.
> > 
> > It is a rather early snapshot, WIP, barely tested etc., but I wanted
> > to get your opinions before further pursuing it.  It is also part of
> > my counter-argument to the proposals of adding memcg-reclaim-related
> > user interfaces at this point in time, so I wanted to push this out
> > the door before things are merged into .40.
> > 
> > The patches are quite big, I am still looking for things to factor and
> > split out, sorry for this.  Documentation is on its way as well ;)
> > 
> > #1 and #2 are boring preparational work.  #3 makes traditional reclaim
> > in vmscan.c memcg-aware, which is a prerequisite for both removal of
> > the global lru in #5 and the way I reimplemented soft limit reclaim in
> > #6.
> 
> A large part of the acceptance would be based on what the test results
> for common mm benchmarks show.

I will try to ensure the following things:

1. will not degrade performance on !CONFIG_MEMCG kernels

2. will not degrade performance on CONFIG_MEMCG kernels without
configured memcgs.  This might be the most important one as most
desktop/server distributions enable the memory controller per default

3. will not degrade overall performance of workloads running
concurrently in separate memory control groups.  I expect some shifts,
however, that even out performance differences.

Please let me know what you consider common mm benchmarks.

Thanks!

	Hannes

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 0/6] mm: memcg naturalization
@ 2011-05-16 10:57     ` Johannes Weiner
  0 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2011-05-16 10:57 UTC (permalink / raw)
  To: Balbir Singh
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Ying Han, Michal Hocko,
	Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On Mon, May 16, 2011 at 04:00:34PM +0530, Balbir Singh wrote:
> * Johannes Weiner <hannes@cmpxchg.org> [2011-05-12 16:53:52]:
> 
> > Hi!
> > 
> > Here is a patch series that is a result of the memcg discussions on
> > LSF (memcg-aware global reclaim, global lru removal, struct
> > page_cgroup reduction, soft limit implementation) and the recent
> > feature discussions on linux-mm.
> > 
> > The long-term idea is to have memcgs no longer bolted to the side of
> > the mm code, but integrate it as much as possible such that there is a
> > native understanding of containers, and that the traditional !memcg
> > setup is just a singular group.  This series is an approach in that
> > direction.
> > 
> > It is a rather early snapshot, WIP, barely tested etc., but I wanted
> > to get your opinions before further pursuing it.  It is also part of
> > my counter-argument to the proposals of adding memcg-reclaim-related
> > user interfaces at this point in time, so I wanted to push this out
> > the door before things are merged into .40.
> > 
> > The patches are quite big, I am still looking for things to factor and
> > split out, sorry for this.  Documentation is on its way as well ;)
> > 
> > #1 and #2 are boring preparational work.  #3 makes traditional reclaim
> > in vmscan.c memcg-aware, which is a prerequisite for both removal of
> > the global lru in #5 and the way I reimplemented soft limit reclaim in
> > #6.
> 
> A large part of the acceptance would be based on what the test results
> for common mm benchmarks show.

I will try to ensure the following things:

1. will not degrade performance on !CONFIG_MEMCG kernels

2. will not degrade performance on CONFIG_MEMCG kernels without
configured memcgs.  This might be the most important one as most
desktop/server distributions enable the memory controller per default

3. will not degrade overall performance of workloads running
concurrently in separate memory control groups.  I expect some shifts,
however, that even out performance differences.

Please let me know what you consider common mm benchmarks.

Thanks!

	Hannes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 2/6] vmscan: make distinction between memcg reclaim and LRU list selection
  2011-05-13  6:58       ` Johannes Weiner
@ 2011-05-16 22:36         ` Andrew Morton
  -1 siblings, 0 replies; 83+ messages in thread
From: Andrew Morton @ 2011-05-16 22:36 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On Fri, 13 May 2011 08:58:54 +0200
Johannes Weiner <hannes@cmpxchg.org> wrote:

> > > @@ -154,16 +158,24 @@ static LIST_HEAD(shrinker_list);
> > >  static DECLARE_RWSEM(shrinker_rwsem);
> > >  
> > >  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
> > > -#define scanning_global_lru(sc)	(!(sc)->mem_cgroup)
> > > +static bool global_reclaim(struct scan_control *sc)
> > > +{
> > > +	return !sc->memcg;
> > > +}
> > > +static bool scanning_global_lru(struct scan_control *sc)
> > > +{
> > > +	return !sc->current_memcg;
> > > +}
> > 
> > 
> > Could you add comments ?

oy, that's my job.

> Yes, I will.

> +static bool global_reclaim(struct scan_control *sc) { return 1; }
> +static bool scanning_global_lru(struct scan_control *sc) { return 1; }

s/1/true/

And we may as well format the functions properly?

And it would be nice for the names of the functions to identify what
subsystem they belong to: memcg_global_reclaim() or such.  Although
that's already been a bit messed up in memcg (and in the VM generally).


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 2/6] vmscan: make distinction between memcg reclaim and LRU list selection
@ 2011-05-16 22:36         ` Andrew Morton
  0 siblings, 0 replies; 83+ messages in thread
From: Andrew Morton @ 2011-05-16 22:36 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On Fri, 13 May 2011 08:58:54 +0200
Johannes Weiner <hannes@cmpxchg.org> wrote:

> > > @@ -154,16 +158,24 @@ static LIST_HEAD(shrinker_list);
> > >  static DECLARE_RWSEM(shrinker_rwsem);
> > >  
> > >  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
> > > -#define scanning_global_lru(sc)	(!(sc)->mem_cgroup)
> > > +static bool global_reclaim(struct scan_control *sc)
> > > +{
> > > +	return !sc->memcg;
> > > +}
> > > +static bool scanning_global_lru(struct scan_control *sc)
> > > +{
> > > +	return !sc->current_memcg;
> > > +}
> > 
> > 
> > Could you add comments ?

oy, that's my job.

> Yes, I will.

> +static bool global_reclaim(struct scan_control *sc) { return 1; }
> +static bool scanning_global_lru(struct scan_control *sc) { return 1; }

s/1/true/

And we may as well format the functions properly?

And it would be nice for the names of the functions to identify what
subsystem they belong to: memcg_global_reclaim() or such.  Although
that's already been a bit messed up in memcg (and in the VM generally).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 4/6] memcg: reclaim statistics
  2011-05-12 19:33   ` Ying Han
@ 2011-05-16 23:10       ` Johannes Weiner
  0 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2011-05-16 23:10 UTC (permalink / raw)
  To: Ying Han
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Michal Hocko,
	Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On Thu, May 12, 2011 at 12:33:50PM -0700, Ying Han wrote:
> On Thu, May 12, 2011 at 7:53 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > TODO: write proper changelog.  Here is an excerpt from
> > http://lkml.kernel.org/r/20110428123652.GM12437@cmpxchg.org:
> >
> > : 1. Limit-triggered direct reclaim
> > :
> > : The memory cgroup hits its limit and the task does direct reclaim from
> > : its own memcg.  We probably want statistics for this separately from
> > : background reclaim to see how successful background reclaim is, the
> > : same reason we have this separation in the global vmstat as well.
> > :
> > :       pgscan_direct_limit
> > :       pgfree_direct_limit
> >
> 
> Can we use "pgsteal_" instead? Not big fan of the naming but want to make
> them consistent to other stats.

Actually, I thought what KAME-san said made sense.  'Stealing' is a
good fit for reclaim due to outside pressure.  But if the memcg is
target-reclaimed from the inside because it hit the limit, is
'stealing' the appropriate term?

> > : 2. Limit-triggered background reclaim
> > :
> > : This is the watermark-based asynchroneous reclaim that is currently in
> > : discussion.  It's triggered by the memcg breaching its watermark,
> > : which is relative to its hard-limit.  I named it kswapd because I
> > : still think kswapd should do this job, but it is all open for
> > : discussion, obviously.  Treat it as meaning 'background' or
> > : 'asynchroneous'.
> > :
> > :       pgscan_kswapd_limit
> > :       pgfree_kswapd_limit
> >
> 
> Kame might have this stats on the per-memcg bg reclaim patch. Just mention
> here since it will make later merge
> a bit harder

I'll have a look, thanks for the heads up.

> > : 3. Hierarchy-triggered direct reclaim
> > :
> > : A condition outside the memcg leads to a task directly reclaiming from
> > : this memcg.  This could be global memory pressure for example, but
> > : also a parent cgroup hitting its limit.  It's probably helpful to
> > : assume global memory pressure meaning that the root cgroup hit its
> > : limit, conceptually.  We don't have that yet, but this could be the
> > : direct softlimit reclaim Ying mentioned above.
> > :
> > :       pgscan_direct_hierarchy
> > :       pgsteal_direct_hierarchy
> >
> 
>  The stats for soft_limit reclaim from global ttfp have been merged in mmotm
> i believe as the following:
> 
> "soft_direct_steal"
> "soft_direct_scan"
> 
> I wonder we might want to separate that out from the other case where the
> reclaim is from the parent triggers its limit.

The way I implemented soft limits in 6/6 is to increase pressure on
exceeding children whenever hierarchical reclaim is taking place.

This changes soft limit from

	Global memory pressure: reclaim from exceeding memcg(s) first

to

	Memory pressure on a memcg: reclaim from all its children,
	with increased pressure on those exceeding their soft limit
	(where global memory pressure means root_mem_cgroup and all
	existing memcgs are considered its children)

which makes the soft limit much more generic and more powerful, as it
allows the admin to prioritize reclaim throughout the hierarchy, not
only for global memory pressure.  Consider one memcg with two
subgroups.  You can now prioritize reclaim to prefer one subgroup over
another through soft limiting.

This is one reason why I think that the approach of maintaining a
global list of memcgs that exceed their soft limits is an inferior
approach; it does not take the hierarchy into account at all.

This scheme would not provide a natural way of counting pages that
were reclaimed because of the soft limit, and thus I still oppose the
merging of soft limit counters.

	Hannes

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 4/6] memcg: reclaim statistics
@ 2011-05-16 23:10       ` Johannes Weiner
  0 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2011-05-16 23:10 UTC (permalink / raw)
  To: Ying Han
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Michal Hocko,
	Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On Thu, May 12, 2011 at 12:33:50PM -0700, Ying Han wrote:
> On Thu, May 12, 2011 at 7:53 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > TODO: write proper changelog.  Here is an excerpt from
> > http://lkml.kernel.org/r/20110428123652.GM12437@cmpxchg.org:
> >
> > : 1. Limit-triggered direct reclaim
> > :
> > : The memory cgroup hits its limit and the task does direct reclaim from
> > : its own memcg.  We probably want statistics for this separately from
> > : background reclaim to see how successful background reclaim is, the
> > : same reason we have this separation in the global vmstat as well.
> > :
> > :       pgscan_direct_limit
> > :       pgfree_direct_limit
> >
> 
> Can we use "pgsteal_" instead? Not big fan of the naming but want to make
> them consistent to other stats.

Actually, I thought what KAME-san said made sense.  'Stealing' is a
good fit for reclaim due to outside pressure.  But if the memcg is
target-reclaimed from the inside because it hit the limit, is
'stealing' the appropriate term?

> > : 2. Limit-triggered background reclaim
> > :
> > : This is the watermark-based asynchroneous reclaim that is currently in
> > : discussion.  It's triggered by the memcg breaching its watermark,
> > : which is relative to its hard-limit.  I named it kswapd because I
> > : still think kswapd should do this job, but it is all open for
> > : discussion, obviously.  Treat it as meaning 'background' or
> > : 'asynchroneous'.
> > :
> > :       pgscan_kswapd_limit
> > :       pgfree_kswapd_limit
> >
> 
> Kame might have this stats on the per-memcg bg reclaim patch. Just mention
> here since it will make later merge
> a bit harder

I'll have a look, thanks for the heads up.

> > : 3. Hierarchy-triggered direct reclaim
> > :
> > : A condition outside the memcg leads to a task directly reclaiming from
> > : this memcg.  This could be global memory pressure for example, but
> > : also a parent cgroup hitting its limit.  It's probably helpful to
> > : assume global memory pressure meaning that the root cgroup hit its
> > : limit, conceptually.  We don't have that yet, but this could be the
> > : direct softlimit reclaim Ying mentioned above.
> > :
> > :       pgscan_direct_hierarchy
> > :       pgsteal_direct_hierarchy
> >
> 
>  The stats for soft_limit reclaim from global ttfp have been merged in mmotm
> i believe as the following:
> 
> "soft_direct_steal"
> "soft_direct_scan"
> 
> I wonder we might want to separate that out from the other case where the
> reclaim is from the parent triggers its limit.

The way I implemented soft limits in 6/6 is to increase pressure on
exceeding children whenever hierarchical reclaim is taking place.

This changes soft limit from

	Global memory pressure: reclaim from exceeding memcg(s) first

to

	Memory pressure on a memcg: reclaim from all its children,
	with increased pressure on those exceeding their soft limit
	(where global memory pressure means root_mem_cgroup and all
	existing memcgs are considered its children)

which makes the soft limit much more generic and more powerful, as it
allows the admin to prioritize reclaim throughout the hierarchy, not
only for global memory pressure.  Consider one memcg with two
subgroups.  You can now prioritize reclaim to prefer one subgroup over
another through soft limiting.

This is one reason why I think that the approach of maintaining a
global list of memcgs that exceed their soft limits is an inferior
approach; it does not take the hierarchy into account at all.

This scheme would not provide a natural way of counting pages that
were reclaimed because of the soft limit, and thus I still oppose the
merging of soft limit counters.

	Hannes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 4/6] memcg: reclaim statistics
  2011-05-16 23:10       ` Johannes Weiner
  (?)
@ 2011-05-17  0:20       ` Ying Han
  2011-05-17  7:42           ` Johannes Weiner
  -1 siblings, 1 reply; 83+ messages in thread
From: Ying Han @ 2011-05-17  0:20 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Michal Hocko,
	Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 4629 bytes --]

On Mon, May 16, 2011 at 4:10 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Thu, May 12, 2011 at 12:33:50PM -0700, Ying Han wrote:
> > On Thu, May 12, 2011 at 7:53 AM, Johannes Weiner <hannes@cmpxchg.org>
> wrote:
> >
> > > TODO: write proper changelog.  Here is an excerpt from
> > > http://lkml.kernel.org/r/20110428123652.GM12437@cmpxchg.org:
> > >
> > > : 1. Limit-triggered direct reclaim
> > > :
> > > : The memory cgroup hits its limit and the task does direct reclaim
> from
> > > : its own memcg.  We probably want statistics for this separately from
> > > : background reclaim to see how successful background reclaim is, the
> > > : same reason we have this separation in the global vmstat as well.
> > > :
> > > :       pgscan_direct_limit
> > > :       pgfree_direct_limit
> > >
> >
> > Can we use "pgsteal_" instead? Not big fan of the naming but want to make
> > them consistent to other stats.
>
> Actually, I thought what KAME-san said made sense.  'Stealing' is a
> good fit for reclaim due to outside pressure.  But if the memcg is
> target-reclaimed from the inside because it hit the limit, is
> 'stealing' the appropriate term?
>
> > > : 2. Limit-triggered background reclaim
> > > :
> > > : This is the watermark-based asynchroneous reclaim that is currently
> in
> > > : discussion.  It's triggered by the memcg breaching its watermark,
> > > : which is relative to its hard-limit.  I named it kswapd because I
> > > : still think kswapd should do this job, but it is all open for
> > > : discussion, obviously.  Treat it as meaning 'background' or
> > > : 'asynchroneous'.
> > > :
> > > :       pgscan_kswapd_limit
> > > :       pgfree_kswapd_limit
> > >
> >
> > Kame might have this stats on the per-memcg bg reclaim patch. Just
> mention
> > here since it will make later merge
> > a bit harder
>
> I'll have a look, thanks for the heads up.
>
> > > : 3. Hierarchy-triggered direct reclaim
> > > :
> > > : A condition outside the memcg leads to a task directly reclaiming
> from
> > > : this memcg.  This could be global memory pressure for example, but
> > > : also a parent cgroup hitting its limit.  It's probably helpful to
> > > : assume global memory pressure meaning that the root cgroup hit its
> > > : limit, conceptually.  We don't have that yet, but this could be the
> > > : direct softlimit reclaim Ying mentioned above.
> > > :
> > > :       pgscan_direct_hierarchy
> > > :       pgsteal_direct_hierarchy
> > >
> >
> >  The stats for soft_limit reclaim from global ttfp have been merged in
> mmotm
> > i believe as the following:
> >
> > "soft_direct_steal"
> > "soft_direct_scan"
> >
> > I wonder we might want to separate that out from the other case where the
> > reclaim is from the parent triggers its limit.
>
> The way I implemented soft limits in 6/6 is to increase pressure on
> exceeding children whenever hierarchical reclaim is taking place.
>
> This changes soft limit from
>
>        Global memory pressure: reclaim from exceeding memcg(s) first
>
> to
>
>        Memory pressure on a memcg: reclaim from all its children,
>        with increased pressure on those exceeding their soft limit
>        (where global memory pressure means root_mem_cgroup and all
>        existing memcgs are considered its children)
>
> which makes the soft limit much more generic and more powerful, as it
> allows the admin to prioritize reclaim throughout the hierarchy, not
> only for global memory pressure.  Consider one memcg with two
> subgroups.  You can now prioritize reclaim to prefer one subgroup over
> another through soft limiting.
>
> This is one reason why I think that the approach of maintaining a
> global list of memcgs that exceed their soft limits is an inferior
> approach; it does not take the hierarchy into account at all.
>


This scheme would not provide a natural way of counting pages that
> were reclaimed because of the soft limit, and thus I still oppose the
> merging of soft limit counters.
>
> The proposal we discussed during LSF ( implemented in the patch " memcg:
revisit soft_limit reclaim on contention") takes consideration
of hierarchical reclaim. The memcg is linked in the list if it exceeds the
soft_limit, and the soft_limit reclaim per-memcg is calling
mem_cgroup_hierarchical_reclaim().

The current "soft_steal" and "soft_scan" is counting pages being steal/scan
 inside mem_cgroup_hierarchical_reclaim() w check_soft checking, which then
counts pages being reclaimed because of soft_limit and also counting the
hierarchical reclaim.

Sorry If i missed something before reading through your whole patch set.

--Ying

       Hannes
>

[-- Attachment #2: Type: text/html, Size: 6176 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 0/6] mm: memcg naturalization
  2011-05-13  7:20     ` Johannes Weiner
  (?)
@ 2011-05-17  0:53     ` Ying Han
  2011-05-17  8:11         ` Johannes Weiner
  -1 siblings, 1 reply; 83+ messages in thread
From: Ying Han @ 2011-05-17  0:53 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Michal Hocko,
	Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2201 bytes --]

On Fri, May 13, 2011 at 12:20 AM, Johannes Weiner <hannes@cmpxchg.org>wrote:

> On Thu, May 12, 2011 at 11:53:37AM -0700, Ying Han wrote:
> > On Thu, May 12, 2011 at 7:53 AM, Johannes Weiner <hannes@cmpxchg.org>
> wrote:
> >
> > > Hi!
> > >
> > > Here is a patch series that is a result of the memcg discussions on
> > > LSF (memcg-aware global reclaim, global lru removal, struct
> > > page_cgroup reduction, soft limit implementation) and the recent
> > > feature discussions on linux-mm.
> > >
> > > The long-term idea is to have memcgs no longer bolted to the side of
> > > the mm code, but integrate it as much as possible such that there is a
> > > native understanding of containers, and that the traditional !memcg
> > > setup is just a singular group.  This series is an approach in that
> > > direction.
>

This sounds like a good long term plan. Now I would wonder should we take it
step by step by doing:

1. improving the existing soft_limit reclaim from RB-tree based to link-list
based, also in a round_robin fashion.
We can keep the existing APIs but only changing the underlying
implementation of  mem_cgroup_soft_limit_reclaim()

2. remove the global lru list after the first one being proved to be
efficient.

3. then have better integration of memcg reclaim to the mm code.

--Ying


> > >
> > > It is a rather early snapshot, WIP, barely tested etc., but I wanted
> > > to get your opinions before further pursuing it.  It is also part of
> > > my counter-argument to the proposals of adding memcg-reclaim-related
> > > user interfaces at this point in time, so I wanted to push this out
> > > the door before things are merged into .40.
> > >
> >
> > The memcg-reclaim-related user interface I assume was the watermark
> > configurable tunable we were talking about in the per-memcg
> > background reclaim patch. I think we got some agreement to remove
> > the watermark tunable at the first step. But the newly added
> > memory.soft_limit_async_reclaim as you proposed seems to be a usable
> > interface.
>
> Actually, I meant the soft limit reclaim statistics.  There is a
> comment about that in the 6/6 changelog.
>

Ok get it now. I will move the discussion to that thread.

[-- Attachment #2: Type: text/html, Size: 3052 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 0/6] mm: memcg naturalization
  2011-05-16 10:57     ` Johannes Weiner
@ 2011-05-17  6:32       ` Balbir Singh
  -1 siblings, 0 replies; 83+ messages in thread
From: Balbir Singh @ 2011-05-17  6:32 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Ying Han, Michal Hocko,
	Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

* Johannes Weiner <hannes@cmpxchg.org> [2011-05-16 12:57:29]:

> On Mon, May 16, 2011 at 04:00:34PM +0530, Balbir Singh wrote:
> > * Johannes Weiner <hannes@cmpxchg.org> [2011-05-12 16:53:52]:
> > 
> > > Hi!
> > > 
> > > Here is a patch series that is a result of the memcg discussions on
> > > LSF (memcg-aware global reclaim, global lru removal, struct
> > > page_cgroup reduction, soft limit implementation) and the recent
> > > feature discussions on linux-mm.
> > > 
> > > The long-term idea is to have memcgs no longer bolted to the side of
> > > the mm code, but integrate it as much as possible such that there is a
> > > native understanding of containers, and that the traditional !memcg
> > > setup is just a singular group.  This series is an approach in that
> > > direction.
> > > 
> > > It is a rather early snapshot, WIP, barely tested etc., but I wanted
> > > to get your opinions before further pursuing it.  It is also part of
> > > my counter-argument to the proposals of adding memcg-reclaim-related
> > > user interfaces at this point in time, so I wanted to push this out
> > > the door before things are merged into .40.
> > > 
> > > The patches are quite big, I am still looking for things to factor and
> > > split out, sorry for this.  Documentation is on its way as well ;)
> > > 
> > > #1 and #2 are boring preparational work.  #3 makes traditional reclaim
> > > in vmscan.c memcg-aware, which is a prerequisite for both removal of
> > > the global lru in #5 and the way I reimplemented soft limit reclaim in
> > > #6.
> > 
> > A large part of the acceptance would be based on what the test results
> > for common mm benchmarks show.
> 
> I will try to ensure the following things:
> 
> 1. will not degrade performance on !CONFIG_MEMCG kernels
> 
> 2. will not degrade performance on CONFIG_MEMCG kernels without
> configured memcgs.  This might be the most important one as most
> desktop/server distributions enable the memory controller per default
> 
> 3. will not degrade overall performance of workloads running
> concurrently in separate memory control groups.  I expect some shifts,
> however, that even out performance differences.
> 
> Please let me know what you consider common mm benchmarks.

1, 2 and 3 do sound nice, what workload do you intend to run? We used
reaim, lmbench, page fault rate based tests.

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 0/6] mm: memcg naturalization
@ 2011-05-17  6:32       ` Balbir Singh
  0 siblings, 0 replies; 83+ messages in thread
From: Balbir Singh @ 2011-05-17  6:32 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Ying Han, Michal Hocko,
	Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

* Johannes Weiner <hannes@cmpxchg.org> [2011-05-16 12:57:29]:

> On Mon, May 16, 2011 at 04:00:34PM +0530, Balbir Singh wrote:
> > * Johannes Weiner <hannes@cmpxchg.org> [2011-05-12 16:53:52]:
> > 
> > > Hi!
> > > 
> > > Here is a patch series that is a result of the memcg discussions on
> > > LSF (memcg-aware global reclaim, global lru removal, struct
> > > page_cgroup reduction, soft limit implementation) and the recent
> > > feature discussions on linux-mm.
> > > 
> > > The long-term idea is to have memcgs no longer bolted to the side of
> > > the mm code, but integrate it as much as possible such that there is a
> > > native understanding of containers, and that the traditional !memcg
> > > setup is just a singular group.  This series is an approach in that
> > > direction.
> > > 
> > > It is a rather early snapshot, WIP, barely tested etc., but I wanted
> > > to get your opinions before further pursuing it.  It is also part of
> > > my counter-argument to the proposals of adding memcg-reclaim-related
> > > user interfaces at this point in time, so I wanted to push this out
> > > the door before things are merged into .40.
> > > 
> > > The patches are quite big, I am still looking for things to factor and
> > > split out, sorry for this.  Documentation is on its way as well ;)
> > > 
> > > #1 and #2 are boring preparational work.  #3 makes traditional reclaim
> > > in vmscan.c memcg-aware, which is a prerequisite for both removal of
> > > the global lru in #5 and the way I reimplemented soft limit reclaim in
> > > #6.
> > 
> > A large part of the acceptance would be based on what the test results
> > for common mm benchmarks show.
> 
> I will try to ensure the following things:
> 
> 1. will not degrade performance on !CONFIG_MEMCG kernels
> 
> 2. will not degrade performance on CONFIG_MEMCG kernels without
> configured memcgs.  This might be the most important one as most
> desktop/server distributions enable the memory controller per default
> 
> 3. will not degrade overall performance of workloads running
> concurrently in separate memory control groups.  I expect some shifts,
> however, that even out performance differences.
> 
> Please let me know what you consider common mm benchmarks.

1, 2 and 3 do sound nice, what workload do you intend to run? We used
reaim, lmbench, page fault rate based tests.

-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 2/6] vmscan: make distinction between memcg reclaim and LRU list selection
  2011-05-12 16:03       ` Johannes Weiner
@ 2011-05-17  6:38         ` Ying Han
  -1 siblings, 0 replies; 83+ messages in thread
From: Ying Han @ 2011-05-17  6:38 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Rik van Riel, KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh,
	Michal Hocko, Andrew Morton, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On Thu, May 12, 2011 at 9:03 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> On Thu, May 12, 2011 at 11:33:13AM -0400, Rik van Riel wrote:
>> On 05/12/2011 10:53 AM, Johannes Weiner wrote:
>> >The reclaim code has a single predicate for whether it currently
>> >reclaims on behalf of a memory cgroup, as well as whether it is
>> >reclaiming from the global LRU list or a memory cgroup LRU list.
>> >
>> >Up to now, both cases always coincide, but subsequent patches will
>> >change things such that global reclaim will scan memory cgroup lists.
>> >
>> >This patch adds a new predicate that tells global reclaim from memory
>> >cgroup reclaim, and then changes all callsites that are actually about
>> >global reclaim heuristics rather than strict LRU list selection.
>> >
>> >Signed-off-by: Johannes Weiner<hannes@cmpxchg.org>
>> >---
>> >  mm/vmscan.c |   96 ++++++++++++++++++++++++++++++++++------------------------
>> >  1 files changed, 56 insertions(+), 40 deletions(-)
>> >
>> >diff --git a/mm/vmscan.c b/mm/vmscan.c
>> >index f6b435c..ceeb2a5 100644
>> >--- a/mm/vmscan.c
>> >+++ b/mm/vmscan.c
>> >@@ -104,8 +104,12 @@ struct scan_control {
>> >      */
>> >     reclaim_mode_t reclaim_mode;
>> >
>> >-    /* Which cgroup do we reclaim from */
>> >-    struct mem_cgroup *mem_cgroup;
>> >+    /*
>> >+     * The memory cgroup we reclaim on behalf of, and the one we
>> >+     * are currently reclaiming from.
>> >+     */
>> >+    struct mem_cgroup *memcg;
>> >+    struct mem_cgroup *current_memcg;
>>
>> I can't say I'm fond of these names.  I had to read the
>> rest of the patch to figure out that the old mem_cgroup
>> got renamed to current_memcg.
>
> To clarify: sc->memcg will be the memcg that hit the hard limit and is
> the main target of this reclaim invocation.  current_memcg is the
> iterator over the hierarchy below the target.

I would assume the new variable memcg is a renaming of the
"mem_cgroup" which indicating which cgroup we reclaim on behalf of.
About the "current_memcg", i couldn't find where it is indicating to
be the current cgroup under the hierarchy below the "memcg".

Both mem_cgroup_shrink_node_zone() and try_to_free_mem_cgroup_pages()
are called within mem_cgroup_hierarchical_reclaim(), and the sc->memcg
is initialized w/ the victim passed down which is already the memcg
under hierarchy.

--Ying


> I realize this change in particular was placed a bit unfortunate in
> terms of understanding in the series, I just wanted to keep out the
> mem_cgroup to current_memcg renaming out of the next patch.  There is
> probably a better way, I'll fix it up and improve the comment.
>
>> Would it be better to call them my_memcg and reclaim_memcg?
>>
>> Maybe somebody else has better suggestions...
>
> Yes, suggestions welcome.  I'm not too fond of the naming, either.
>
>> Other than the naming, no objection.
>
> Thanks, Rik.
>
>        Hannes
>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 2/6] vmscan: make distinction between memcg reclaim and LRU list selection
@ 2011-05-17  6:38         ` Ying Han
  0 siblings, 0 replies; 83+ messages in thread
From: Ying Han @ 2011-05-17  6:38 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Rik van Riel, KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh,
	Michal Hocko, Andrew Morton, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On Thu, May 12, 2011 at 9:03 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> On Thu, May 12, 2011 at 11:33:13AM -0400, Rik van Riel wrote:
>> On 05/12/2011 10:53 AM, Johannes Weiner wrote:
>> >The reclaim code has a single predicate for whether it currently
>> >reclaims on behalf of a memory cgroup, as well as whether it is
>> >reclaiming from the global LRU list or a memory cgroup LRU list.
>> >
>> >Up to now, both cases always coincide, but subsequent patches will
>> >change things such that global reclaim will scan memory cgroup lists.
>> >
>> >This patch adds a new predicate that tells global reclaim from memory
>> >cgroup reclaim, and then changes all callsites that are actually about
>> >global reclaim heuristics rather than strict LRU list selection.
>> >
>> >Signed-off-by: Johannes Weiner<hannes@cmpxchg.org>
>> >---
>> >  mm/vmscan.c |   96 ++++++++++++++++++++++++++++++++++------------------------
>> >  1 files changed, 56 insertions(+), 40 deletions(-)
>> >
>> >diff --git a/mm/vmscan.c b/mm/vmscan.c
>> >index f6b435c..ceeb2a5 100644
>> >--- a/mm/vmscan.c
>> >+++ b/mm/vmscan.c
>> >@@ -104,8 +104,12 @@ struct scan_control {
>> >      */
>> >     reclaim_mode_t reclaim_mode;
>> >
>> >-    /* Which cgroup do we reclaim from */
>> >-    struct mem_cgroup *mem_cgroup;
>> >+    /*
>> >+     * The memory cgroup we reclaim on behalf of, and the one we
>> >+     * are currently reclaiming from.
>> >+     */
>> >+    struct mem_cgroup *memcg;
>> >+    struct mem_cgroup *current_memcg;
>>
>> I can't say I'm fond of these names.  I had to read the
>> rest of the patch to figure out that the old mem_cgroup
>> got renamed to current_memcg.
>
> To clarify: sc->memcg will be the memcg that hit the hard limit and is
> the main target of this reclaim invocation.  current_memcg is the
> iterator over the hierarchy below the target.

I would assume the new variable memcg is a renaming of the
"mem_cgroup" which indicating which cgroup we reclaim on behalf of.
About the "current_memcg", i couldn't find where it is indicating to
be the current cgroup under the hierarchy below the "memcg".

Both mem_cgroup_shrink_node_zone() and try_to_free_mem_cgroup_pages()
are called within mem_cgroup_hierarchical_reclaim(), and the sc->memcg
is initialized w/ the victim passed down which is already the memcg
under hierarchy.

--Ying


> I realize this change in particular was placed a bit unfortunate in
> terms of understanding in the series, I just wanted to keep out the
> mem_cgroup to current_memcg renaming out of the next patch.  There is
> probably a better way, I'll fix it up and improve the comment.
>
>> Would it be better to call them my_memcg and reclaim_memcg?
>>
>> Maybe somebody else has better suggestions...
>
> Yes, suggestions welcome.  I'm not too fond of the naming, either.
>
>> Other than the naming, no objection.
>
> Thanks, Rik.
>
>        Hannes
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 4/6] memcg: reclaim statistics
  2011-05-17  0:20       ` Ying Han
@ 2011-05-17  7:42           ` Johannes Weiner
  0 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2011-05-17  7:42 UTC (permalink / raw)
  To: Ying Han
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Michal Hocko,
	Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On Mon, May 16, 2011 at 05:20:31PM -0700, Ying Han wrote:
> On Mon, May 16, 2011 at 4:10 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > On Thu, May 12, 2011 at 12:33:50PM -0700, Ying Han wrote:
> > > The stats for soft_limit reclaim from global ttfp have been merged in
> > > mmotm i believe as the following:
> > >
> > > "soft_direct_steal"
> > > "soft_direct_scan"
> > >
> > > I wonder we might want to separate that out from the other case where the
> > > reclaim is from the parent triggers its limit.
> >
> > The way I implemented soft limits in 6/6 is to increase pressure on
> > exceeding children whenever hierarchical reclaim is taking place.
> >
> > This changes soft limit from
> >
> >        Global memory pressure: reclaim from exceeding memcg(s) first
> >
> > to
> >
> >        Memory pressure on a memcg: reclaim from all its children,
> >        with increased pressure on those exceeding their soft limit
> >        (where global memory pressure means root_mem_cgroup and all
> >        existing memcgs are considered its children)
> >
> > which makes the soft limit much more generic and more powerful, as it
> > allows the admin to prioritize reclaim throughout the hierarchy, not
> > only for global memory pressure.  Consider one memcg with two
> > subgroups.  You can now prioritize reclaim to prefer one subgroup over
> > another through soft limiting.
> >
> > This is one reason why I think that the approach of maintaining a
> > global list of memcgs that exceed their soft limits is an inferior
> > approach; it does not take the hierarchy into account at all.
> >
> > This scheme would not provide a natural way of counting pages that
> > were reclaimed because of the soft limit, and thus I still oppose the
> > merging of soft limit counters.
> 
> The proposal we discussed during LSF ( implemented in the patch " memcg:
> revisit soft_limit reclaim on contention") takes consideration
> of hierarchical reclaim. The memcg is linked in the list if it exceeds the
> soft_limit, and the soft_limit reclaim per-memcg is calling
> mem_cgroup_hierarchical_reclaim().

It does hierarchical soft limit reclaim once triggered, but I meant
that soft limits themselves have no hierarchical meaning.  Say you
have the following hierarchy:

                root_mem_cgroup

             aaa               bbb

           a1  a2             b1  b2

        a1-1

Consider aaa and a1 had a soft limit.  If global memory arose, aaa and
all its children would be pushed back with the current scheme, the one
you are proposing, and the one I am proposing.

But now consider aaa hitting its hard limit.  Regular target reclaim
will be triggered, and a1, a2, and a1-1 will be scanned equally from
hierarchical reclaim.  That a1 is in excess of its soft limit is not
considered at all.

With what I am proposing, a1 and a1-1 would be pushed back more
aggressively than a2, because a1 is in excess of its soft limit and
a1-1 is contributing to that.

It would mean that given a group of siblings, you distribute the
pressure weighted by the soft limit configuration, independent of the
kind of hierarchical/external pressure (global memory scarcity or
parent hit the hard limit).

It's much easier to understand if you think of global memory pressure
to mean that root_mem_cgroup hit its hard limit, and that all existing
memcgs are hierarchically below the root_mem_cgroup.  Altough it is
technically not implemented that way, that would be the consistent
model.

My proposal is a generic and native way of enforcing soft limits: a
memcg hit its hard limit, reclaim from the hierarchy below it, prefer
those in excess of their soft limit.

While yours is special-cased to immediate descendants of the
root_mem_cgroup.

> The current "soft_steal" and "soft_scan" is counting pages being steal/scan
>  inside mem_cgroup_hierarchical_reclaim() w check_soft checking, which then
> counts pages being reclaimed because of soft_limit and also counting the
> hierarchical reclaim.

Yeah, I understand that.  What I am saying is that in my code,
everytime a hierarchy of memcgs is scanned (global memory reclaim,
target reclaim, kswapd or direct, it's all the same), a memcg that is
in excess of its soft limit is put more pressure on compared to its
siblings.

There is no stand-alone 'now, go reclaim soft limits' cycle anymore.
As such, it would be impossible to maintain that counter.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 4/6] memcg: reclaim statistics
@ 2011-05-17  7:42           ` Johannes Weiner
  0 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2011-05-17  7:42 UTC (permalink / raw)
  To: Ying Han
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Michal Hocko,
	Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On Mon, May 16, 2011 at 05:20:31PM -0700, Ying Han wrote:
> On Mon, May 16, 2011 at 4:10 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > On Thu, May 12, 2011 at 12:33:50PM -0700, Ying Han wrote:
> > > The stats for soft_limit reclaim from global ttfp have been merged in
> > > mmotm i believe as the following:
> > >
> > > "soft_direct_steal"
> > > "soft_direct_scan"
> > >
> > > I wonder we might want to separate that out from the other case where the
> > > reclaim is from the parent triggers its limit.
> >
> > The way I implemented soft limits in 6/6 is to increase pressure on
> > exceeding children whenever hierarchical reclaim is taking place.
> >
> > This changes soft limit from
> >
> >        Global memory pressure: reclaim from exceeding memcg(s) first
> >
> > to
> >
> >        Memory pressure on a memcg: reclaim from all its children,
> >        with increased pressure on those exceeding their soft limit
> >        (where global memory pressure means root_mem_cgroup and all
> >        existing memcgs are considered its children)
> >
> > which makes the soft limit much more generic and more powerful, as it
> > allows the admin to prioritize reclaim throughout the hierarchy, not
> > only for global memory pressure.  Consider one memcg with two
> > subgroups.  You can now prioritize reclaim to prefer one subgroup over
> > another through soft limiting.
> >
> > This is one reason why I think that the approach of maintaining a
> > global list of memcgs that exceed their soft limits is an inferior
> > approach; it does not take the hierarchy into account at all.
> >
> > This scheme would not provide a natural way of counting pages that
> > were reclaimed because of the soft limit, and thus I still oppose the
> > merging of soft limit counters.
> 
> The proposal we discussed during LSF ( implemented in the patch " memcg:
> revisit soft_limit reclaim on contention") takes consideration
> of hierarchical reclaim. The memcg is linked in the list if it exceeds the
> soft_limit, and the soft_limit reclaim per-memcg is calling
> mem_cgroup_hierarchical_reclaim().

It does hierarchical soft limit reclaim once triggered, but I meant
that soft limits themselves have no hierarchical meaning.  Say you
have the following hierarchy:

                root_mem_cgroup

             aaa               bbb

           a1  a2             b1  b2

        a1-1

Consider aaa and a1 had a soft limit.  If global memory arose, aaa and
all its children would be pushed back with the current scheme, the one
you are proposing, and the one I am proposing.

But now consider aaa hitting its hard limit.  Regular target reclaim
will be triggered, and a1, a2, and a1-1 will be scanned equally from
hierarchical reclaim.  That a1 is in excess of its soft limit is not
considered at all.

With what I am proposing, a1 and a1-1 would be pushed back more
aggressively than a2, because a1 is in excess of its soft limit and
a1-1 is contributing to that.

It would mean that given a group of siblings, you distribute the
pressure weighted by the soft limit configuration, independent of the
kind of hierarchical/external pressure (global memory scarcity or
parent hit the hard limit).

It's much easier to understand if you think of global memory pressure
to mean that root_mem_cgroup hit its hard limit, and that all existing
memcgs are hierarchically below the root_mem_cgroup.  Altough it is
technically not implemented that way, that would be the consistent
model.

My proposal is a generic and native way of enforcing soft limits: a
memcg hit its hard limit, reclaim from the hierarchy below it, prefer
those in excess of their soft limit.

While yours is special-cased to immediate descendants of the
root_mem_cgroup.

> The current "soft_steal" and "soft_scan" is counting pages being steal/scan
>  inside mem_cgroup_hierarchical_reclaim() w check_soft checking, which then
> counts pages being reclaimed because of soft_limit and also counting the
> hierarchical reclaim.

Yeah, I understand that.  What I am saying is that in my code,
everytime a hierarchy of memcgs is scanned (global memory reclaim,
target reclaim, kswapd or direct, it's all the same), a memcg that is
in excess of its soft limit is put more pressure on compared to its
siblings.

There is no stand-alone 'now, go reclaim soft limits' cycle anymore.
As such, it would be impossible to maintain that counter.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 0/6] mm: memcg naturalization
  2011-05-17  0:53     ` Ying Han
@ 2011-05-17  8:11         ` Johannes Weiner
  0 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2011-05-17  8:11 UTC (permalink / raw)
  To: Ying Han
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Michal Hocko,
	Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On Mon, May 16, 2011 at 05:53:04PM -0700, Ying Han wrote:
> On Fri, May 13, 2011 at 12:20 AM, Johannes Weiner <hannes@cmpxchg.org>wrote:
> 
> > On Thu, May 12, 2011 at 11:53:37AM -0700, Ying Han wrote:
> > > On Thu, May 12, 2011 at 7:53 AM, Johannes Weiner <hannes@cmpxchg.org>
> > wrote:
> > >
> > > > Hi!
> > > >
> > > > Here is a patch series that is a result of the memcg discussions on
> > > > LSF (memcg-aware global reclaim, global lru removal, struct
> > > > page_cgroup reduction, soft limit implementation) and the recent
> > > > feature discussions on linux-mm.
> > > >
> > > > The long-term idea is to have memcgs no longer bolted to the side of
> > > > the mm code, but integrate it as much as possible such that there is a
> > > > native understanding of containers, and that the traditional !memcg
> > > > setup is just a singular group.  This series is an approach in that
> > > > direction.
> >
> 
> This sounds like a good long term plan. Now I would wonder should we take it
> step by step by doing:
>
> 1. improving the existing soft_limit reclaim from RB-tree based to link-list
> based, also in a round_robin fashion.
> We can keep the existing APIs but only changing the underlying
> implementation of  mem_cgroup_soft_limit_reclaim()
> 
> 2. remove the global lru list after the first one being proved to be
> efficient.
> 
> 3. then have better integration of memcg reclaim to the mm code.

I chose to go the other because it did not seem more complex to me and
fixed many things we had planned anyway.  Deeper integration, better
soft limit implementation (including better pressure distribution,
enforcement also from direct reclaim, not just kswapd), global lru
removal etc.

That ground work was a bit unwieldy and I think quite some confusion
ensued, but I am currently reorganizing, cleaning up, and documenting.
I expect the next version to be much easier to understand.

The three steps are still this:

1. make traditional reclaim memcg-aware.

2. improve soft limit based on 1.

3. remove global lru based on 1.

But 1. already effectively disables the global LRU for memcg-enabled
kernels, so 3. can be deferred until we are comfortable with 1.

	Hannes

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 0/6] mm: memcg naturalization
@ 2011-05-17  8:11         ` Johannes Weiner
  0 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2011-05-17  8:11 UTC (permalink / raw)
  To: Ying Han
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Michal Hocko,
	Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On Mon, May 16, 2011 at 05:53:04PM -0700, Ying Han wrote:
> On Fri, May 13, 2011 at 12:20 AM, Johannes Weiner <hannes@cmpxchg.org>wrote:
> 
> > On Thu, May 12, 2011 at 11:53:37AM -0700, Ying Han wrote:
> > > On Thu, May 12, 2011 at 7:53 AM, Johannes Weiner <hannes@cmpxchg.org>
> > wrote:
> > >
> > > > Hi!
> > > >
> > > > Here is a patch series that is a result of the memcg discussions on
> > > > LSF (memcg-aware global reclaim, global lru removal, struct
> > > > page_cgroup reduction, soft limit implementation) and the recent
> > > > feature discussions on linux-mm.
> > > >
> > > > The long-term idea is to have memcgs no longer bolted to the side of
> > > > the mm code, but integrate it as much as possible such that there is a
> > > > native understanding of containers, and that the traditional !memcg
> > > > setup is just a singular group.  This series is an approach in that
> > > > direction.
> >
> 
> This sounds like a good long term plan. Now I would wonder should we take it
> step by step by doing:
>
> 1. improving the existing soft_limit reclaim from RB-tree based to link-list
> based, also in a round_robin fashion.
> We can keep the existing APIs but only changing the underlying
> implementation of  mem_cgroup_soft_limit_reclaim()
> 
> 2. remove the global lru list after the first one being proved to be
> efficient.
> 
> 3. then have better integration of memcg reclaim to the mm code.

I chose to go the other because it did not seem more complex to me and
fixed many things we had planned anyway.  Deeper integration, better
soft limit implementation (including better pressure distribution,
enforcement also from direct reclaim, not just kswapd), global lru
removal etc.

That ground work was a bit unwieldy and I think quite some confusion
ensued, but I am currently reorganizing, cleaning up, and documenting.
I expect the next version to be much easier to understand.

The three steps are still this:

1. make traditional reclaim memcg-aware.

2. improve soft limit based on 1.

3. remove global lru based on 1.

But 1. already effectively disables the global LRU for memcg-enabled
kernels, so 3. can be deferred until we are comfortable with 1.

	Hannes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 2/6] vmscan: make distinction between memcg reclaim and LRU list selection
  2011-05-17  6:38         ` Ying Han
@ 2011-05-17  8:25           ` Johannes Weiner
  -1 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2011-05-17  8:25 UTC (permalink / raw)
  To: Ying Han
  Cc: Rik van Riel, KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh,
	Michal Hocko, Andrew Morton, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On Mon, May 16, 2011 at 11:38:07PM -0700, Ying Han wrote:
> On Thu, May 12, 2011 at 9:03 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > On Thu, May 12, 2011 at 11:33:13AM -0400, Rik van Riel wrote:
> >> On 05/12/2011 10:53 AM, Johannes Weiner wrote:
> >> >The reclaim code has a single predicate for whether it currently
> >> >reclaims on behalf of a memory cgroup, as well as whether it is
> >> >reclaiming from the global LRU list or a memory cgroup LRU list.
> >> >
> >> >Up to now, both cases always coincide, but subsequent patches will
> >> >change things such that global reclaim will scan memory cgroup lists.
> >> >
> >> >This patch adds a new predicate that tells global reclaim from memory
> >> >cgroup reclaim, and then changes all callsites that are actually about
> >> >global reclaim heuristics rather than strict LRU list selection.
> >> >
> >> >Signed-off-by: Johannes Weiner<hannes@cmpxchg.org>
> >> >---
> >> >  mm/vmscan.c |   96 ++++++++++++++++++++++++++++++++++------------------------
> >> >  1 files changed, 56 insertions(+), 40 deletions(-)
> >> >
> >> >diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> >index f6b435c..ceeb2a5 100644
> >> >--- a/mm/vmscan.c
> >> >+++ b/mm/vmscan.c
> >> >@@ -104,8 +104,12 @@ struct scan_control {
> >> >      */
> >> >     reclaim_mode_t reclaim_mode;
> >> >
> >> >-    /* Which cgroup do we reclaim from */
> >> >-    struct mem_cgroup *mem_cgroup;
> >> >+    /*
> >> >+     * The memory cgroup we reclaim on behalf of, and the one we
> >> >+     * are currently reclaiming from.
> >> >+     */
> >> >+    struct mem_cgroup *memcg;
> >> >+    struct mem_cgroup *current_memcg;
> >>
> >> I can't say I'm fond of these names.  I had to read the
> >> rest of the patch to figure out that the old mem_cgroup
> >> got renamed to current_memcg.
> >
> > To clarify: sc->memcg will be the memcg that hit the hard limit and is
> > the main target of this reclaim invocation.  current_memcg is the
> > iterator over the hierarchy below the target.
> 
> I would assume the new variable memcg is a renaming of the
> "mem_cgroup" which indicating which cgroup we reclaim on behalf of.

The thing is, mem_cgroup would mean both the group we are reclaiming
on behalf of AND the group we are currently reclaiming from.  Because
the hierarchy walk was implemented in memcontrol.c, vmscan.c only ever
saw one cgroup at a time.

> About the "current_memcg", i couldn't find where it is indicating to
> be the current cgroup under the hierarchy below the "memcg".

It's codified in shrink_zone().

	for each child of sc->memcg:
	  sc->current_memcg = child
	  reclaim(sc)

In the new version I named (and documented) them:

	sc->target_mem_cgroup: the entry point into the hierarchy, set
	by the functions that have the scan control structure on their
	stack.  That's the one hitting its hard limit.

	sc->mem_cgroup: the current position in the hierarchy below
	sc->target_mem_cgroup.  That's the one that actively gets its
	pages reclaimed.

> Both mem_cgroup_shrink_node_zone() and try_to_free_mem_cgroup_pages()
> are called within mem_cgroup_hierarchical_reclaim(), and the sc->memcg
> is initialized w/ the victim passed down which is already the memcg
> under hierarchy.

I changed mem_cgroup_shrink_node_zone() to use do_shrink_zone(), and
mem_cgroup_hierarchical_reclaim() no longer calls
try_to_free_mem_cgroup_pages().

So there is no hierarchy walk triggered from within a hierarchy walk.

I just noticed that there is, however, a bug in that
mem_cgroup_shrink_node_zone() does not initialize sc->current_memcg.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 2/6] vmscan: make distinction between memcg reclaim and LRU list selection
@ 2011-05-17  8:25           ` Johannes Weiner
  0 siblings, 0 replies; 83+ messages in thread
From: Johannes Weiner @ 2011-05-17  8:25 UTC (permalink / raw)
  To: Ying Han
  Cc: Rik van Riel, KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh,
	Michal Hocko, Andrew Morton, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On Mon, May 16, 2011 at 11:38:07PM -0700, Ying Han wrote:
> On Thu, May 12, 2011 at 9:03 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > On Thu, May 12, 2011 at 11:33:13AM -0400, Rik van Riel wrote:
> >> On 05/12/2011 10:53 AM, Johannes Weiner wrote:
> >> >The reclaim code has a single predicate for whether it currently
> >> >reclaims on behalf of a memory cgroup, as well as whether it is
> >> >reclaiming from the global LRU list or a memory cgroup LRU list.
> >> >
> >> >Up to now, both cases always coincide, but subsequent patches will
> >> >change things such that global reclaim will scan memory cgroup lists.
> >> >
> >> >This patch adds a new predicate that tells global reclaim from memory
> >> >cgroup reclaim, and then changes all callsites that are actually about
> >> >global reclaim heuristics rather than strict LRU list selection.
> >> >
> >> >Signed-off-by: Johannes Weiner<hannes@cmpxchg.org>
> >> >---
> >> >  mm/vmscan.c |   96 ++++++++++++++++++++++++++++++++++------------------------
> >> >  1 files changed, 56 insertions(+), 40 deletions(-)
> >> >
> >> >diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> >index f6b435c..ceeb2a5 100644
> >> >--- a/mm/vmscan.c
> >> >+++ b/mm/vmscan.c
> >> >@@ -104,8 +104,12 @@ struct scan_control {
> >> >      */
> >> >     reclaim_mode_t reclaim_mode;
> >> >
> >> >-    /* Which cgroup do we reclaim from */
> >> >-    struct mem_cgroup *mem_cgroup;
> >> >+    /*
> >> >+     * The memory cgroup we reclaim on behalf of, and the one we
> >> >+     * are currently reclaiming from.
> >> >+     */
> >> >+    struct mem_cgroup *memcg;
> >> >+    struct mem_cgroup *current_memcg;
> >>
> >> I can't say I'm fond of these names.  I had to read the
> >> rest of the patch to figure out that the old mem_cgroup
> >> got renamed to current_memcg.
> >
> > To clarify: sc->memcg will be the memcg that hit the hard limit and is
> > the main target of this reclaim invocation.  current_memcg is the
> > iterator over the hierarchy below the target.
> 
> I would assume the new variable memcg is a renaming of the
> "mem_cgroup" which indicating which cgroup we reclaim on behalf of.

The thing is, mem_cgroup would mean both the group we are reclaiming
on behalf of AND the group we are currently reclaiming from.  Because
the hierarchy walk was implemented in memcontrol.c, vmscan.c only ever
saw one cgroup at a time.

> About the "current_memcg", i couldn't find where it is indicating to
> be the current cgroup under the hierarchy below the "memcg".

It's codified in shrink_zone().

	for each child of sc->memcg:
	  sc->current_memcg = child
	  reclaim(sc)

In the new version I named (and documented) them:

	sc->target_mem_cgroup: the entry point into the hierarchy, set
	by the functions that have the scan control structure on their
	stack.  That's the one hitting its hard limit.

	sc->mem_cgroup: the current position in the hierarchy below
	sc->target_mem_cgroup.  That's the one that actively gets its
	pages reclaimed.

> Both mem_cgroup_shrink_node_zone() and try_to_free_mem_cgroup_pages()
> are called within mem_cgroup_hierarchical_reclaim(), and the sc->memcg
> is initialized w/ the victim passed down which is already the memcg
> under hierarchy.

I changed mem_cgroup_shrink_node_zone() to use do_shrink_zone(), and
mem_cgroup_hierarchical_reclaim() no longer calls
try_to_free_mem_cgroup_pages().

So there is no hierarchy walk triggered from within a hierarchy walk.

I just noticed that there is, however, a bug in that
mem_cgroup_shrink_node_zone() does not initialize sc->current_memcg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 4/6] memcg: reclaim statistics
  2011-05-17  7:42           ` Johannes Weiner
@ 2011-05-17 13:55             ` Rik van Riel
  -1 siblings, 0 replies; 83+ messages in thread
From: Rik van Riel @ 2011-05-17 13:55 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ying Han, KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh,
	Michal Hocko, Andrew Morton, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On 05/17/2011 03:42 AM, Johannes Weiner wrote:

> It does hierarchical soft limit reclaim once triggered, but I meant
> that soft limits themselves have no hierarchical meaning.  Say you
> have the following hierarchy:
>
>                  root_mem_cgroup
>
>               aaa               bbb
>
>             a1  a2             b1  b2
>
>          a1-1
>
> Consider aaa and a1 had a soft limit.  If global memory arose, aaa and
> all its children would be pushed back with the current scheme, the one
> you are proposing, and the one I am proposing.
>
> But now consider aaa hitting its hard limit.  Regular target reclaim
> will be triggered, and a1, a2, and a1-1 will be scanned equally from
> hierarchical reclaim.  That a1 is in excess of its soft limit is not
> considered at all.
>
> With what I am proposing, a1 and a1-1 would be pushed back more
> aggressively than a2, because a1 is in excess of its soft limit and
> a1-1 is contributing to that.

Ying, I think Johannes has a good point.  I do not see
a way to enforce the limits properly with the scheme we
came up with at LSF, in the hierarchical scenario above.

There may be a way, but until we think of it, I suspect
it will be better to go with Johannes's scheme for now.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 4/6] memcg: reclaim statistics
@ 2011-05-17 13:55             ` Rik van Riel
  0 siblings, 0 replies; 83+ messages in thread
From: Rik van Riel @ 2011-05-17 13:55 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ying Han, KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh,
	Michal Hocko, Andrew Morton, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

On 05/17/2011 03:42 AM, Johannes Weiner wrote:

> It does hierarchical soft limit reclaim once triggered, but I meant
> that soft limits themselves have no hierarchical meaning.  Say you
> have the following hierarchy:
>
>                  root_mem_cgroup
>
>               aaa               bbb
>
>             a1  a2             b1  b2
>
>          a1-1
>
> Consider aaa and a1 had a soft limit.  If global memory arose, aaa and
> all its children would be pushed back with the current scheme, the one
> you are proposing, and the one I am proposing.
>
> But now consider aaa hitting its hard limit.  Regular target reclaim
> will be triggered, and a1, a2, and a1-1 will be scanned equally from
> hierarchical reclaim.  That a1 is in excess of its soft limit is not
> considered at all.
>
> With what I am proposing, a1 and a1-1 would be pushed back more
> aggressively than a2, because a1 is in excess of its soft limit and
> a1-1 is contributing to that.

Ying, I think Johannes has a good point.  I do not see
a way to enforce the limits properly with the scheme we
came up with at LSF, in the hierarchical scenario above.

There may be a way, but until we think of it, I suspect
it will be better to go with Johannes's scheme for now.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [rfc patch 0/6] mm: memcg naturalization
  2011-05-17  8:11         ` Johannes Weiner
  (?)
@ 2011-05-17 14:45         ` Ying Han
  -1 siblings, 0 replies; 83+ messages in thread
From: Ying Han @ 2011-05-17 14:45 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Michal Hocko,
	Andrew Morton, Rik van Riel, Minchan Kim, KOSAKI Motohiro,
	Mel Gorman, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2723 bytes --]

On Tue, May 17, 2011 at 1:11 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Mon, May 16, 2011 at 05:53:04PM -0700, Ying Han wrote:
> > On Fri, May 13, 2011 at 12:20 AM, Johannes Weiner <hannes@cmpxchg.org
> >wrote:
> >
> > > On Thu, May 12, 2011 at 11:53:37AM -0700, Ying Han wrote:
> > > > On Thu, May 12, 2011 at 7:53 AM, Johannes Weiner <hannes@cmpxchg.org
> >
> > > wrote:
> > > >
> > > > > Hi!
> > > > >
> > > > > Here is a patch series that is a result of the memcg discussions on
> > > > > LSF (memcg-aware global reclaim, global lru removal, struct
> > > > > page_cgroup reduction, soft limit implementation) and the recent
> > > > > feature discussions on linux-mm.
> > > > >
> > > > > The long-term idea is to have memcgs no longer bolted to the side
> of
> > > > > the mm code, but integrate it as much as possible such that there
> is a
> > > > > native understanding of containers, and that the traditional !memcg
> > > > > setup is just a singular group.  This series is an approach in that
> > > > > direction.
> > >
> >
> > This sounds like a good long term plan. Now I would wonder should we take
> it
> > step by step by doing:
> >
> > 1. improving the existing soft_limit reclaim from RB-tree based to
> link-list
> > based, also in a round_robin fashion.
> > We can keep the existing APIs but only changing the underlying
> > implementation of  mem_cgroup_soft_limit_reclaim()
> >
> > 2. remove the global lru list after the first one being proved to be
> > efficient.
> >
> > 3. then have better integration of memcg reclaim to the mm code.
>
> I chose to go the other because it did not seem more complex to me and
> fixed many things we had planned anyway.  Deeper integration, better
> soft limit implementation (including better pressure distribution,
> enforcement also from direct reclaim, not just kswapd), global lru removal
> etc.


> That ground work was a bit unwieldy and I think quite some confusion
> ensued, but I am currently reorganizing, cleaning up, and documenting.
> I expect the next version to be much easier to understand.
>
> The three steps are still this:
>
> 1. make traditional reclaim memcg-aware.
>
> 2. improve soft limit based on 1.
>

I don't see the soft_limit round-robin implementation on the patch 6/6,
maybe I missed it somewhere. I have my patch posted which does the
linked-list
round-robin across memcgs per-zone , do you have plan to merge them together
?

>
> 3. remove global lru based on 1.
>


>
> But 1. already effectively disables the global LRU for memcg-enabled
> kernels, so 3. can be deferred until we are comfortable with 1.
>
> Thank you for the details and clarification, and looking forward to your
next post.

--Ying

>        Hannes
>

[-- Attachment #2: Type: text/html, Size: 3988 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

end of thread, other threads:[~2011-05-17 14:45 UTC | newest]

Thread overview: 83+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-12 14:53 [rfc patch 0/6] mm: memcg naturalization Johannes Weiner
2011-05-12 14:53 ` Johannes Weiner
2011-05-12 14:53 ` [rfc patch 1/6] memcg: remove unused retry signal from reclaim Johannes Weiner
2011-05-12 14:53   ` Johannes Weiner
2011-05-12 15:02   ` Rik van Riel
2011-05-12 15:02     ` Rik van Riel
2011-05-12 17:22     ` Ying Han
2011-05-12 23:44   ` KAMEZAWA Hiroyuki
2011-05-12 23:44     ` KAMEZAWA Hiroyuki
2011-05-13  9:23   ` Michal Hocko
2011-05-13  9:23     ` Michal Hocko
2011-05-12 14:53 ` [rfc patch 2/6] vmscan: make distinction between memcg reclaim and LRU list selection Johannes Weiner
2011-05-12 14:53   ` Johannes Weiner
2011-05-12 15:33   ` Rik van Riel
2011-05-12 15:33     ` Rik van Riel
2011-05-12 16:03     ` Johannes Weiner
2011-05-12 16:03       ` Johannes Weiner
2011-05-17  6:38       ` Ying Han
2011-05-17  6:38         ` Ying Han
2011-05-17  8:25         ` Johannes Weiner
2011-05-17  8:25           ` Johannes Weiner
2011-05-12 23:50   ` KAMEZAWA Hiroyuki
2011-05-12 23:50     ` KAMEZAWA Hiroyuki
2011-05-13  6:58     ` Johannes Weiner
2011-05-13  6:58       ` Johannes Weiner
2011-05-16 22:36       ` Andrew Morton
2011-05-16 22:36         ` Andrew Morton
2011-05-12 14:53 ` [rfc patch 3/6] mm: memcg-aware global reclaim Johannes Weiner
2011-05-12 14:53   ` Johannes Weiner
2011-05-12 16:04   ` Rik van Riel
2011-05-12 16:04     ` Rik van Riel
2011-05-12 19:19   ` Ying Han
2011-05-13  7:08     ` Johannes Weiner
2011-05-13  7:08       ` Johannes Weiner
2011-05-13  0:04   ` KAMEZAWA Hiroyuki
2011-05-13  0:04     ` KAMEZAWA Hiroyuki
2011-05-13  7:18     ` Johannes Weiner
2011-05-13  7:18       ` Johannes Weiner
2011-05-13  0:40   ` KAMEZAWA Hiroyuki
2011-05-13  0:40     ` KAMEZAWA Hiroyuki
2011-05-13  6:54     ` Johannes Weiner
2011-05-13  6:54       ` Johannes Weiner
2011-05-13  9:53   ` Michal Hocko
2011-05-13  9:53     ` Michal Hocko
2011-05-13 10:28     ` Johannes Weiner
2011-05-13 10:28       ` Johannes Weiner
2011-05-13 11:02       ` Michal Hocko
2011-05-13 11:02         ` Michal Hocko
2011-05-12 14:53 ` [rfc patch 4/6] memcg: reclaim statistics Johannes Weiner
2011-05-12 14:53   ` Johannes Weiner
2011-05-12 19:33   ` Ying Han
2011-05-16 23:10     ` Johannes Weiner
2011-05-16 23:10       ` Johannes Weiner
2011-05-17  0:20       ` Ying Han
2011-05-17  7:42         ` Johannes Weiner
2011-05-17  7:42           ` Johannes Weiner
2011-05-17 13:55           ` Rik van Riel
2011-05-17 13:55             ` Rik van Riel
2011-05-12 14:53 ` [rfc patch 5/6] memcg: remove global LRU list Johannes Weiner
2011-05-12 14:53   ` Johannes Weiner
2011-05-13  9:53   ` Michal Hocko
2011-05-13  9:53     ` Michal Hocko
2011-05-13 10:36     ` Johannes Weiner
2011-05-13 10:36       ` Johannes Weiner
2011-05-13 11:01       ` Michal Hocko
2011-05-13 11:01         ` Michal Hocko
2011-05-12 14:53 ` [rfc patch 6/6] memcg: rework soft limit reclaim Johannes Weiner
2011-05-12 14:53   ` Johannes Weiner
2011-05-12 18:41   ` Ying Han
2011-05-12 18:41     ` Ying Han
2011-05-12 18:53 ` [rfc patch 0/6] mm: memcg naturalization Ying Han
2011-05-13  7:20   ` Johannes Weiner
2011-05-13  7:20     ` Johannes Weiner
2011-05-17  0:53     ` Ying Han
2011-05-17  8:11       ` Johannes Weiner
2011-05-17  8:11         ` Johannes Weiner
2011-05-17 14:45         ` Ying Han
2011-05-16 10:30 ` Balbir Singh
2011-05-16 10:30   ` Balbir Singh
2011-05-16 10:57   ` Johannes Weiner
2011-05-16 10:57     ` Johannes Weiner
2011-05-17  6:32     ` Balbir Singh
2011-05-17  6:32       ` Balbir Singh

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.