All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch 0/11] mm: memcg naturalization -rc3
@ 2011-09-12 10:57 ` Johannes Weiner
  0 siblings, 0 replies; 130+ messages in thread
From: Johannes Weiner @ 2011-09-12 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

Hi everyone,

this is the third revision of the memcg naturalization patch set.  Due
to controversy, I dropped the reclaim statistics and the soft limit
reclaim rewrite.  What's left is mostly making the per-memcg LRU lists
exclusive.

Christoph suggested making struct mem_cgroup part of the core and have
reclaim always operate on at least a skeleton root_mem_cgroup with
basic LRU info even on !CONFIG_MEMCG kernels.  I agree that we should
go there, but in its current form this would drag a lot of ugly memcg
internals out into the public and I'd prefer another struct mem_cgroup
shakedown and the soft limit stuff to be done before this step.  But
we are getting there.

Changelog since -rc2
- consolidated all memcg hierarchy iteration constructs
- pass struct mem_cgroup_zone down the reclaim stack
- fix concurrent full hierarchy round-trip detection
- split out moving memcg reclaim from hierarchical global reclaim
- drop reclaim statistics
- rename do_shrink_zone to shrink_mem_cgroup_zone
- fix anon pre-aging to operate on per-memcg lrus
- revert to traditional limit reclaim hierarchy iteration
- split out lruvec introduction
- kill __add_page_to_lru_list
- fix LRU-accounting during swapcache/pagecache charging
- fix LRU-accounting of uncharged swapcache
- split out removing array id from pc->flags
- drop soft limit rework

More introduction and test results are included in the changelog of
the first patch.

 include/linux/memcontrol.h  |   74 +++--
 include/linux/mm_inline.h   |   21 +-
 include/linux/mmzone.h      |   10 +-
 include/linux/page_cgroup.h |   34 ---
 mm/memcontrol.c             |  688 ++++++++++++++++++++-----------------------
 mm/page_alloc.c             |    2 +-
 mm/page_cgroup.c            |   59 +----
 mm/swap.c                   |   24 +-
 mm/vmscan.c                 |  447 +++++++++++++++++-----------
 9 files changed, 674 insertions(+), 685 deletions(-)

The series is based on v3.1-rc3-mmotm-2011-08-24-14-08-14 plus the
following fixes from the not-yet-released -mm:

    Revert "memcg: add memory.vmscan_stat"
    memcg: skip scanning active lists based on individual list size
    mm: memcg: close race between charge and putback

Rolled-up diff of those attached.  Comments welcome!

	Hannes

---

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 6f3c598..06eb6d9 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -380,7 +380,7 @@ will be charged as a new owner of it.
 
 5.2 stat file
 
-5.2.1 memory.stat file includes following statistics
+memory.stat file includes following statistics
 
 # per-memory cgroup local status
 cache		- # of bytes of page cache memory.
@@ -438,89 +438,6 @@ Note:
 	 file_mapped is accounted only when the memory cgroup is owner of page
 	 cache.)
 
-5.2.2 memory.vmscan_stat
-
-memory.vmscan_stat includes statistics information for memory scanning and
-freeing, reclaiming. The statistics shows memory scanning information since
-memory cgroup creation and can be reset to 0 by writing 0 as
-
- #echo 0 > ../memory.vmscan_stat
-
-This file contains following statistics.
-
-[param]_[file_or_anon]_pages_by_[reason]_[under_heararchy]
-[param]_elapsed_ns_by_[reason]_[under_hierarchy]
-
-For example,
-
-  scanned_file_pages_by_limit indicates the number of scanned
-  file pages at vmscan.
-
-Now, 3 parameters are supported
-
-  scanned - the number of pages scanned by vmscan
-  rotated - the number of pages activated at vmscan
-  freed   - the number of pages freed by vmscan
-
-If "rotated" is high against scanned/freed, the memcg seems busy.
-
-Now, 2 reason are supported
-
-  limit - the memory cgroup's limit
-  system - global memory pressure + softlimit
-           (global memory pressure not under softlimit is not handled now)
-
-When under_hierarchy is added in the tail, the number indicates the
-total memcg scan of its children and itself.
-
-elapsed_ns is a elapsed time in nanosecond. This may include sleep time
-and not indicates CPU usage. So, please take this as just showing
-latency.
-
-Here is an example.
-
-# cat /cgroup/memory/A/memory.vmscan_stat
-scanned_pages_by_limit 9471864
-scanned_anon_pages_by_limit 6640629
-scanned_file_pages_by_limit 2831235
-rotated_pages_by_limit 4243974
-rotated_anon_pages_by_limit 3971968
-rotated_file_pages_by_limit 272006
-freed_pages_by_limit 2318492
-freed_anon_pages_by_limit 962052
-freed_file_pages_by_limit 1356440
-elapsed_ns_by_limit 351386416101
-scanned_pages_by_system 0
-scanned_anon_pages_by_system 0
-scanned_file_pages_by_system 0
-rotated_pages_by_system 0
-rotated_anon_pages_by_system 0
-rotated_file_pages_by_system 0
-freed_pages_by_system 0
-freed_anon_pages_by_system 0
-freed_file_pages_by_system 0
-elapsed_ns_by_system 0
-scanned_pages_by_limit_under_hierarchy 9471864
-scanned_anon_pages_by_limit_under_hierarchy 6640629
-scanned_file_pages_by_limit_under_hierarchy 2831235
-rotated_pages_by_limit_under_hierarchy 4243974
-rotated_anon_pages_by_limit_under_hierarchy 3971968
-rotated_file_pages_by_limit_under_hierarchy 272006
-freed_pages_by_limit_under_hierarchy 2318492
-freed_anon_pages_by_limit_under_hierarchy 962052
-freed_file_pages_by_limit_under_hierarchy 1356440
-elapsed_ns_by_limit_under_hierarchy 351386416101
-scanned_pages_by_system_under_hierarchy 0
-scanned_anon_pages_by_system_under_hierarchy 0
-scanned_file_pages_by_system_under_hierarchy 0
-rotated_pages_by_system_under_hierarchy 0
-rotated_anon_pages_by_system_under_hierarchy 0
-rotated_file_pages_by_system_under_hierarchy 0
-freed_pages_by_system_under_hierarchy 0
-freed_anon_pages_by_system_under_hierarchy 0
-freed_file_pages_by_system_under_hierarchy 0
-elapsed_ns_by_system_under_hierarchy 0
-
 5.3 swappiness
 
 Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only.
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index fb1ed1c..b87068a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -40,16 +40,6 @@ extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
 					struct mem_cgroup *mem_cont,
 					int active, int file);
 
-struct memcg_scanrecord {
-	struct mem_cgroup *mem; /* scanend memory cgroup */
-	struct mem_cgroup *root; /* scan target hierarchy root */
-	int context;		/* scanning context (see memcontrol.c) */
-	unsigned long nr_scanned[2]; /* the number of scanned pages */
-	unsigned long nr_rotated[2]; /* the number of rotated pages */
-	unsigned long nr_freed[2]; /* the number of freed pages */
-	unsigned long elapsed; /* nsec of time elapsed while scanning */
-};
-
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 /*
  * All "charge" functions with gfp_mask should use GFP_KERNEL or
@@ -116,8 +106,10 @@ extern void mem_cgroup_end_migration(struct mem_cgroup *memcg,
 /*
  * For memory reclaim.
  */
-int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
-int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
+int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg,
+				    struct zone *zone);
+int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg,
+				    struct zone *zone);
 int mem_cgroup_select_victim_node(struct mem_cgroup *memcg);
 unsigned long mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg,
 					int nid, int zid, unsigned int lrumask);
@@ -128,15 +120,6 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page);
 extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
 					struct task_struct *p);
 
-extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
-						  gfp_t gfp_mask, bool noswap,
-						  struct memcg_scanrecord *rec);
-extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
-						gfp_t gfp_mask, bool noswap,
-						struct zone *zone,
-						struct memcg_scanrecord *rec,
-						unsigned long *nr_scanned);
-
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
 extern int do_swap_account;
 #endif
@@ -314,13 +297,13 @@ static inline bool mem_cgroup_disabled(void)
 }
 
 static inline int
-mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
+mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg, struct zone *zone)
 {
 	return 1;
 }
 
 static inline int
-mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg)
+mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg, struct zone *zone)
 {
 	return 1;
 }
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 3808f10..b156e80 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -249,6 +249,12 @@ static inline void lru_cache_add_file(struct page *page)
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
 extern int __isolate_lru_page(struct page *page, isolate_mode_t mode, int file);
+extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
+						  gfp_t gfp_mask, bool noswap);
+extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg,
+						gfp_t gfp_mask, bool noswap,
+						struct zone *zone,
+						unsigned long *nr_scanned);
 extern unsigned long shrink_all_memory(unsigned long nr_pages);
 extern int vm_swappiness;
 extern int remove_mapping(struct address_space *mapping, struct page *page);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 54b35b3..b76011a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -205,50 +205,6 @@ struct mem_cgroup_eventfd_list {
 static void mem_cgroup_threshold(struct mem_cgroup *memcg);
 static void mem_cgroup_oom_notify(struct mem_cgroup *memcg);
 
-enum {
-	SCAN_BY_LIMIT,
-	SCAN_BY_SYSTEM,
-	NR_SCAN_CONTEXT,
-	SCAN_BY_SHRINK,	/* not recorded now */
-};
-
-enum {
-	SCAN,
-	SCAN_ANON,
-	SCAN_FILE,
-	ROTATE,
-	ROTATE_ANON,
-	ROTATE_FILE,
-	FREED,
-	FREED_ANON,
-	FREED_FILE,
-	ELAPSED,
-	NR_SCANSTATS,
-};
-
-struct scanstat {
-	spinlock_t	lock;
-	unsigned long	stats[NR_SCAN_CONTEXT][NR_SCANSTATS];
-	unsigned long	rootstats[NR_SCAN_CONTEXT][NR_SCANSTATS];
-};
-
-const char *scanstat_string[NR_SCANSTATS] = {
-	"scanned_pages",
-	"scanned_anon_pages",
-	"scanned_file_pages",
-	"rotated_pages",
-	"rotated_anon_pages",
-	"rotated_file_pages",
-	"freed_pages",
-	"freed_anon_pages",
-	"freed_file_pages",
-	"elapsed_ns",
-};
-#define SCANSTAT_WORD_LIMIT	"_by_limit"
-#define SCANSTAT_WORD_SYSTEM	"_by_system"
-#define SCANSTAT_WORD_HIERARCHY	"_under_hierarchy"
-
-
 /*
  * The memory controller data structure. The memory controller controls both
  * page cache and RSS per cgroup. We would eventually like to provide
@@ -314,8 +270,7 @@ struct mem_cgroup {
 
 	/* For oom notifier event fd */
 	struct list_head oom_notify;
-	/* For recording LRU-scan statistics */
-	struct scanstat scanstat;
+
 	/*
 	 * Should we move charges of a task when a task is moved into this
 	 * mem_cgroup ? And what type of charges should we move ?
@@ -1035,6 +990,16 @@ void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru)
 		return;
 	pc = lookup_page_cgroup(page);
 	VM_BUG_ON(PageCgroupAcctLRU(pc));
+	/*
+	 * putback:				charge:
+	 * SetPageLRU				SetPageCgroupUsed
+	 * smp_mb				smp_mb
+	 * PageCgroupUsed && add to memcg LRU	PageLRU && add to memcg LRU
+	 *
+	 * Ensure that one of the two sides adds the page to the memcg
+	 * LRU during a race.
+	 */
+	smp_mb();
 	if (!PageCgroupUsed(pc))
 		return;
 	/* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
@@ -1086,7 +1051,16 @@ static void mem_cgroup_lru_add_after_commit(struct page *page)
 	unsigned long flags;
 	struct zone *zone = page_zone(page);
 	struct page_cgroup *pc = lookup_page_cgroup(page);
-
+	/*
+	 * putback:				charge:
+	 * SetPageLRU				SetPageCgroupUsed
+	 * smp_mb				smp_mb
+	 * PageCgroupUsed && add to memcg LRU	PageLRU && add to memcg LRU
+	 *
+	 * Ensure that one of the two sides adds the page to the memcg
+	 * LRU during a race.
+	 */
+	smp_mb();
 	/* taking care of that the page is added to LRU while we commit it */
 	if (likely(!PageLRU(page)))
 		return;
@@ -1146,15 +1120,19 @@ int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *memcg)
 	return ret;
 }
 
-static int calc_inactive_ratio(struct mem_cgroup *memcg, unsigned long *present_pages)
+int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg, struct zone *zone)
 {
-	unsigned long active;
+	unsigned long inactive_ratio;
+	int nid = zone_to_nid(zone);
+	int zid = zone_idx(zone);
 	unsigned long inactive;
+	unsigned long active;
 	unsigned long gb;
-	unsigned long inactive_ratio;
 
-	inactive = mem_cgroup_nr_lru_pages(memcg, BIT(LRU_INACTIVE_ANON));
-	active = mem_cgroup_nr_lru_pages(memcg, BIT(LRU_ACTIVE_ANON));
+	inactive = mem_cgroup_zone_nr_lru_pages(memcg, nid, zid,
+						BIT(LRU_INACTIVE_ANON));
+	active = mem_cgroup_zone_nr_lru_pages(memcg, nid, zid,
+					      BIT(LRU_ACTIVE_ANON));
 
 	gb = (inactive + active) >> (30 - PAGE_SHIFT);
 	if (gb)
@@ -1162,39 +1140,20 @@ static int calc_inactive_ratio(struct mem_cgroup *memcg, unsigned long *present_
 	else
 		inactive_ratio = 1;
 
-	if (present_pages) {
-		present_pages[0] = inactive;
-		present_pages[1] = active;
-	}
-
-	return inactive_ratio;
+	return inactive * inactive_ratio < active;
 }
 
-int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
-{
-	unsigned long active;
-	unsigned long inactive;
-	unsigned long present_pages[2];
-	unsigned long inactive_ratio;
-
-	inactive_ratio = calc_inactive_ratio(memcg, present_pages);
-
-	inactive = present_pages[0];
-	active = present_pages[1];
-
-	if (inactive * inactive_ratio < active)
-		return 1;
-
-	return 0;
-}
-
-int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg)
+int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg, struct zone *zone)
 {
 	unsigned long active;
 	unsigned long inactive;
+	int zid = zone_idx(zone);
+	int nid = zone_to_nid(zone);
 
-	inactive = mem_cgroup_nr_lru_pages(memcg, BIT(LRU_INACTIVE_FILE));
-	active = mem_cgroup_nr_lru_pages(memcg, BIT(LRU_ACTIVE_FILE));
+	inactive = mem_cgroup_zone_nr_lru_pages(memcg, nid, zid,
+						BIT(LRU_INACTIVE_FILE));
+	active = mem_cgroup_zone_nr_lru_pages(memcg, nid, zid,
+					      BIT(LRU_ACTIVE_FILE));
 
 	return (active > inactive);
 }
@@ -1679,44 +1638,6 @@ bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap)
 }
 #endif
 
-static void __mem_cgroup_record_scanstat(unsigned long *stats,
-			   struct memcg_scanrecord *rec)
-{
-
-	stats[SCAN] += rec->nr_scanned[0] + rec->nr_scanned[1];
-	stats[SCAN_ANON] += rec->nr_scanned[0];
-	stats[SCAN_FILE] += rec->nr_scanned[1];
-
-	stats[ROTATE] += rec->nr_rotated[0] + rec->nr_rotated[1];
-	stats[ROTATE_ANON] += rec->nr_rotated[0];
-	stats[ROTATE_FILE] += rec->nr_rotated[1];
-
-	stats[FREED] += rec->nr_freed[0] + rec->nr_freed[1];
-	stats[FREED_ANON] += rec->nr_freed[0];
-	stats[FREED_FILE] += rec->nr_freed[1];
-
-	stats[ELAPSED] += rec->elapsed;
-}
-
-static void mem_cgroup_record_scanstat(struct memcg_scanrecord *rec)
-{
-	struct mem_cgroup *memcg;
-	int context = rec->context;
-
-	if (context >= NR_SCAN_CONTEXT)
-		return;
-
-	memcg = rec->mem;
-	spin_lock(&memcg->scanstat.lock);
-	__mem_cgroup_record_scanstat(memcg->scanstat.stats[context], rec);
-	spin_unlock(&memcg->scanstat.lock);
-
-	memcg = rec->root;
-	spin_lock(&memcg->scanstat.lock);
-	__mem_cgroup_record_scanstat(memcg->scanstat.rootstats[context], rec);
-	spin_unlock(&memcg->scanstat.lock);
-}
-
 /*
  * Scan the hierarchy if needed to reclaim memory. We remember the last child
  * we reclaimed from, so that we don't end up penalizing one child extensively
@@ -1741,9 +1662,8 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
 	bool noswap = reclaim_options & MEM_CGROUP_RECLAIM_NOSWAP;
 	bool shrink = reclaim_options & MEM_CGROUP_RECLAIM_SHRINK;
 	bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT;
-	struct memcg_scanrecord rec;
 	unsigned long excess;
-	unsigned long scanned;
+	unsigned long nr_scanned;
 
 	excess = res_counter_soft_limit_excess(&root_memcg->res) >> PAGE_SHIFT;
 
@@ -1751,15 +1671,6 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
 	if (!check_soft && !shrink && root_memcg->memsw_is_minimum)
 		noswap = true;
 
-	if (shrink)
-		rec.context = SCAN_BY_SHRINK;
-	else if (check_soft)
-		rec.context = SCAN_BY_SYSTEM;
-	else
-		rec.context = SCAN_BY_LIMIT;
-
-	rec.root = root_memcg;
-
 	while (1) {
 		victim = mem_cgroup_select_victim(root_memcg);
 		if (victim == root_memcg) {
@@ -1800,23 +1711,14 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
 			css_put(&victim->css);
 			continue;
 		}
-		rec.mem = victim;
-		rec.nr_scanned[0] = 0;
-		rec.nr_scanned[1] = 0;
-		rec.nr_rotated[0] = 0;
-		rec.nr_rotated[1] = 0;
-		rec.nr_freed[0] = 0;
-		rec.nr_freed[1] = 0;
-		rec.elapsed = 0;
 		/* we use swappiness of local cgroup */
 		if (check_soft) {
 			ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
-				noswap, zone, &rec, &scanned);
-			*total_scanned += scanned;
+				noswap, zone, &nr_scanned);
+			*total_scanned += nr_scanned;
 		} else
 			ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
-						noswap, &rec);
-		mem_cgroup_record_scanstat(&rec);
+						noswap);
 		css_put(&victim->css);
 		/*
 		 * At shrinking usage, we can't check we should stop here or
@@ -3853,18 +3755,14 @@ try_to_free:
 	/* try to free all pages in this cgroup */
 	shrink = 1;
 	while (nr_retries && memcg->res.usage > 0) {
-		struct memcg_scanrecord rec;
 		int progress;
 
 		if (signal_pending(current)) {
 			ret = -EINTR;
 			goto out;
 		}
-		rec.context = SCAN_BY_SHRINK;
-		rec.mem = memcg;
-		rec.root = memcg;
 		progress = try_to_free_mem_cgroup_pages(memcg, GFP_KERNEL,
-						false, &rec);
+						false);
 		if (!progress) {
 			nr_retries--;
 			/* maybe some writeback is necessary */
@@ -4293,8 +4191,6 @@ static int mem_control_stat_show(struct cgroup *cont, struct cftype *cft,
 	}
 
 #ifdef CONFIG_DEBUG_VM
-	cb->fill(cb, "inactive_ratio", calc_inactive_ratio(mem_cont, NULL));
-
 	{
 		int nid, zid;
 		struct mem_cgroup_per_zone *mz;
@@ -4708,57 +4604,6 @@ static int mem_control_numa_stat_open(struct inode *unused, struct file *file)
 }
 #endif /* CONFIG_NUMA */
 
-static int mem_cgroup_vmscan_stat_read(struct cgroup *cgrp,
-				struct cftype *cft,
-				struct cgroup_map_cb *cb)
-{
-	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
-	char string[64];
-	int i;
-
-	for (i = 0; i < NR_SCANSTATS; i++) {
-		strcpy(string, scanstat_string[i]);
-		strcat(string, SCANSTAT_WORD_LIMIT);
-		cb->fill(cb, string, memcg->scanstat.stats[SCAN_BY_LIMIT][i]);
-	}
-
-	for (i = 0; i < NR_SCANSTATS; i++) {
-		strcpy(string, scanstat_string[i]);
-		strcat(string, SCANSTAT_WORD_SYSTEM);
-		cb->fill(cb, string, memcg->scanstat.stats[SCAN_BY_SYSTEM][i]);
-	}
-
-	for (i = 0; i < NR_SCANSTATS; i++) {
-		strcpy(string, scanstat_string[i]);
-		strcat(string, SCANSTAT_WORD_LIMIT);
-		strcat(string, SCANSTAT_WORD_HIERARCHY);
-		cb->fill(cb,
-			string, memcg->scanstat.rootstats[SCAN_BY_LIMIT][i]);
-	}
-	for (i = 0; i < NR_SCANSTATS; i++) {
-		strcpy(string, scanstat_string[i]);
-		strcat(string, SCANSTAT_WORD_SYSTEM);
-		strcat(string, SCANSTAT_WORD_HIERARCHY);
-		cb->fill(cb,
-			string, memcg->scanstat.rootstats[SCAN_BY_SYSTEM][i]);
-	}
-	return 0;
-}
-
-static int mem_cgroup_reset_vmscan_stat(struct cgroup *cgrp,
-				unsigned int event)
-{
-	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
-
-	spin_lock(&memcg->scanstat.lock);
-	memset(&memcg->scanstat.stats, 0, sizeof(memcg->scanstat.stats));
-	memset(&memcg->scanstat.rootstats,
-		0, sizeof(memcg->scanstat.rootstats));
-	spin_unlock(&memcg->scanstat.lock);
-	return 0;
-}
-
-
 static struct cftype mem_cgroup_files[] = {
 	{
 		.name = "usage_in_bytes",
@@ -4829,11 +4674,6 @@ static struct cftype mem_cgroup_files[] = {
 		.mode = S_IRUGO,
 	},
 #endif
-	{
-		.name = "vmscan_stat",
-		.read_map = mem_cgroup_vmscan_stat_read,
-		.trigger = mem_cgroup_reset_vmscan_stat,
-	},
 };
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
@@ -5097,7 +4937,6 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	atomic_set(&memcg->refcnt, 1);
 	memcg->move_charge_at_immigrate = 0;
 	mutex_init(&memcg->thresholds_lock);
-	spin_lock_init(&memcg->scanstat.lock);
 	return &memcg->css;
 free_out:
 	__mem_cgroup_free(memcg);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 23256e8..7502726 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -105,7 +105,6 @@ struct scan_control {
 
 	/* Which cgroup do we reclaim from */
 	struct mem_cgroup *mem_cgroup;
-	struct memcg_scanrecord *memcg_record;
 
 	/*
 	 * Nodemask of nodes allowed by the caller. If NULL, all nodes
@@ -1381,8 +1380,6 @@ putback_lru_pages(struct zone *zone, struct scan_control *sc,
 			int file = is_file_lru(lru);
 			int numpages = hpage_nr_pages(page);
 			reclaim_stat->recent_rotated[file] += numpages;
-			if (!scanning_global_lru(sc))
-				sc->memcg_record->nr_rotated[file] += numpages;
 		}
 		if (!pagevec_add(&pvec, page)) {
 			spin_unlock_irq(&zone->lru_lock);
@@ -1426,10 +1423,6 @@ static noinline_for_stack void update_isolated_counts(struct zone *zone,
 
 	reclaim_stat->recent_scanned[0] += *nr_anon;
 	reclaim_stat->recent_scanned[1] += *nr_file;
-	if (!scanning_global_lru(sc)) {
-		sc->memcg_record->nr_scanned[0] += *nr_anon;
-		sc->memcg_record->nr_scanned[1] += *nr_file;
-	}
 }
 
 /*
@@ -1551,9 +1544,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 					priority, &nr_dirty, &nr_writeback);
 	}
 
-	if (!scanning_global_lru(sc))
-		sc->memcg_record->nr_freed[file] += nr_reclaimed;
-
 	local_irq_disable();
 	if (current_is_kswapd())
 		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
@@ -1670,8 +1660,6 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 	}
 
 	reclaim_stat->recent_scanned[file] += nr_taken;
-	if (!scanning_global_lru(sc))
-		sc->memcg_record->nr_scanned[file] += nr_taken;
 
 	__count_zone_vm_events(PGREFILL, zone, pgscanned);
 	if (file)
@@ -1723,8 +1711,6 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 	 * get_scan_ratio.
 	 */
 	reclaim_stat->recent_rotated[file] += nr_rotated;
-	if (!scanning_global_lru(sc))
-		sc->memcg_record->nr_rotated[file] += nr_rotated;
 
 	move_active_pages_to_lru(zone, &l_active,
 						LRU_ACTIVE + file * LRU_FILE);
@@ -1770,7 +1756,7 @@ static int inactive_anon_is_low(struct zone *zone, struct scan_control *sc)
 	if (scanning_global_lru(sc))
 		low = inactive_anon_is_low_global(zone);
 	else
-		low = mem_cgroup_inactive_anon_is_low(sc->mem_cgroup);
+		low = mem_cgroup_inactive_anon_is_low(sc->mem_cgroup, zone);
 	return low;
 }
 #else
@@ -1813,7 +1799,7 @@ static int inactive_file_is_low(struct zone *zone, struct scan_control *sc)
 	if (scanning_global_lru(sc))
 		low = inactive_file_is_low_global(zone);
 	else
-		low = mem_cgroup_inactive_file_is_low(sc->mem_cgroup);
+		low = mem_cgroup_inactive_file_is_low(sc->mem_cgroup, zone);
 	return low;
 }
 
@@ -2313,10 +2299,9 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 
 unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
-					gfp_t gfp_mask, bool noswap,
-					struct zone *zone,
-					struct memcg_scanrecord *rec,
-					unsigned long *scanned)
+						gfp_t gfp_mask, bool noswap,
+						struct zone *zone,
+						unsigned long *nr_scanned)
 {
 	struct scan_control sc = {
 		.nr_scanned = 0,
@@ -2326,9 +2311,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
 		.may_swap = !noswap,
 		.order = 0,
 		.mem_cgroup = mem,
-		.memcg_record = rec,
 	};
-	ktime_t start, end;
 
 	sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
 			(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
@@ -2337,7 +2320,6 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
 						      sc.may_writepage,
 						      sc.gfp_mask);
 
-	start = ktime_get();
 	/*
 	 * NOTE: Although we can get the priority field, using it
 	 * here is not a good idea, since it limits the pages we can scan.
@@ -2346,25 +2328,19 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
 	 * the priority and make it zero.
 	 */
 	shrink_zone(0, zone, &sc);
-	end = ktime_get();
-
-	if (rec)
-		rec->elapsed += ktime_to_ns(ktime_sub(end, start));
-	*scanned = sc.nr_scanned;
 
 	trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
 
+	*nr_scanned = sc.nr_scanned;
 	return sc.nr_reclaimed;
 }
 
 unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 					   gfp_t gfp_mask,
-					   bool noswap,
-					   struct memcg_scanrecord *rec)
+					   bool noswap)
 {
 	struct zonelist *zonelist;
 	unsigned long nr_reclaimed;
-	ktime_t start, end;
 	int nid;
 	struct scan_control sc = {
 		.may_writepage = !laptop_mode,
@@ -2373,7 +2349,6 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
 		.order = 0,
 		.mem_cgroup = mem_cont,
-		.memcg_record = rec,
 		.nodemask = NULL, /* we don't care the placement */
 		.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
 				(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
@@ -2382,7 +2357,6 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 		.gfp_mask = sc.gfp_mask,
 	};
 
-	start = ktime_get();
 	/*
 	 * Unlike direct reclaim via alloc_pages(), memcg's reclaim doesn't
 	 * take care of from where we get pages. So the node where we start the
@@ -2397,9 +2371,6 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 					    sc.gfp_mask);
 
 	nr_reclaimed = do_try_to_free_pages(zonelist, &sc, &shrink);
-	end = ktime_get();
-	if (rec)
-		rec->elapsed += ktime_to_ns(ktime_sub(end, start));
 
 	trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
 

^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [patch 0/11] mm: memcg naturalization -rc3
@ 2011-09-12 10:57 ` Johannes Weiner
  0 siblings, 0 replies; 130+ messages in thread
From: Johannes Weiner @ 2011-09-12 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

Hi everyone,

this is the third revision of the memcg naturalization patch set.  Due
to controversy, I dropped the reclaim statistics and the soft limit
reclaim rewrite.  What's left is mostly making the per-memcg LRU lists
exclusive.

Christoph suggested making struct mem_cgroup part of the core and have
reclaim always operate on at least a skeleton root_mem_cgroup with
basic LRU info even on !CONFIG_MEMCG kernels.  I agree that we should
go there, but in its current form this would drag a lot of ugly memcg
internals out into the public and I'd prefer another struct mem_cgroup
shakedown and the soft limit stuff to be done before this step.  But
we are getting there.

Changelog since -rc2
- consolidated all memcg hierarchy iteration constructs
- pass struct mem_cgroup_zone down the reclaim stack
- fix concurrent full hierarchy round-trip detection
- split out moving memcg reclaim from hierarchical global reclaim
- drop reclaim statistics
- rename do_shrink_zone to shrink_mem_cgroup_zone
- fix anon pre-aging to operate on per-memcg lrus
- revert to traditional limit reclaim hierarchy iteration
- split out lruvec introduction
- kill __add_page_to_lru_list
- fix LRU-accounting during swapcache/pagecache charging
- fix LRU-accounting of uncharged swapcache
- split out removing array id from pc->flags
- drop soft limit rework

More introduction and test results are included in the changelog of
the first patch.

 include/linux/memcontrol.h  |   74 +++--
 include/linux/mm_inline.h   |   21 +-
 include/linux/mmzone.h      |   10 +-
 include/linux/page_cgroup.h |   34 ---
 mm/memcontrol.c             |  688 ++++++++++++++++++++-----------------------
 mm/page_alloc.c             |    2 +-
 mm/page_cgroup.c            |   59 +----
 mm/swap.c                   |   24 +-
 mm/vmscan.c                 |  447 +++++++++++++++++-----------
 9 files changed, 674 insertions(+), 685 deletions(-)

The series is based on v3.1-rc3-mmotm-2011-08-24-14-08-14 plus the
following fixes from the not-yet-released -mm:

    Revert "memcg: add memory.vmscan_stat"
    memcg: skip scanning active lists based on individual list size
    mm: memcg: close race between charge and putback

Rolled-up diff of those attached.  Comments welcome!

	Hannes

---

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 6f3c598..06eb6d9 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -380,7 +380,7 @@ will be charged as a new owner of it.
 
 5.2 stat file
 
-5.2.1 memory.stat file includes following statistics
+memory.stat file includes following statistics
 
 # per-memory cgroup local status
 cache		- # of bytes of page cache memory.
@@ -438,89 +438,6 @@ Note:
 	 file_mapped is accounted only when the memory cgroup is owner of page
 	 cache.)
 
-5.2.2 memory.vmscan_stat
-
-memory.vmscan_stat includes statistics information for memory scanning and
-freeing, reclaiming. The statistics shows memory scanning information since
-memory cgroup creation and can be reset to 0 by writing 0 as
-
- #echo 0 > ../memory.vmscan_stat
-
-This file contains following statistics.
-
-[param]_[file_or_anon]_pages_by_[reason]_[under_heararchy]
-[param]_elapsed_ns_by_[reason]_[under_hierarchy]
-
-For example,
-
-  scanned_file_pages_by_limit indicates the number of scanned
-  file pages at vmscan.
-
-Now, 3 parameters are supported
-
-  scanned - the number of pages scanned by vmscan
-  rotated - the number of pages activated at vmscan
-  freed   - the number of pages freed by vmscan
-
-If "rotated" is high against scanned/freed, the memcg seems busy.
-
-Now, 2 reason are supported
-
-  limit - the memory cgroup's limit
-  system - global memory pressure + softlimit
-           (global memory pressure not under softlimit is not handled now)
-
-When under_hierarchy is added in the tail, the number indicates the
-total memcg scan of its children and itself.
-
-elapsed_ns is a elapsed time in nanosecond. This may include sleep time
-and not indicates CPU usage. So, please take this as just showing
-latency.
-
-Here is an example.
-
-# cat /cgroup/memory/A/memory.vmscan_stat
-scanned_pages_by_limit 9471864
-scanned_anon_pages_by_limit 6640629
-scanned_file_pages_by_limit 2831235
-rotated_pages_by_limit 4243974
-rotated_anon_pages_by_limit 3971968
-rotated_file_pages_by_limit 272006
-freed_pages_by_limit 2318492
-freed_anon_pages_by_limit 962052
-freed_file_pages_by_limit 1356440
-elapsed_ns_by_limit 351386416101
-scanned_pages_by_system 0
-scanned_anon_pages_by_system 0
-scanned_file_pages_by_system 0
-rotated_pages_by_system 0
-rotated_anon_pages_by_system 0
-rotated_file_pages_by_system 0
-freed_pages_by_system 0
-freed_anon_pages_by_system 0
-freed_file_pages_by_system 0
-elapsed_ns_by_system 0
-scanned_pages_by_limit_under_hierarchy 9471864
-scanned_anon_pages_by_limit_under_hierarchy 6640629
-scanned_file_pages_by_limit_under_hierarchy 2831235
-rotated_pages_by_limit_under_hierarchy 4243974
-rotated_anon_pages_by_limit_under_hierarchy 3971968
-rotated_file_pages_by_limit_under_hierarchy 272006
-freed_pages_by_limit_under_hierarchy 2318492
-freed_anon_pages_by_limit_under_hierarchy 962052
-freed_file_pages_by_limit_under_hierarchy 1356440
-elapsed_ns_by_limit_under_hierarchy 351386416101
-scanned_pages_by_system_under_hierarchy 0
-scanned_anon_pages_by_system_under_hierarchy 0
-scanned_file_pages_by_system_under_hierarchy 0
-rotated_pages_by_system_under_hierarchy 0
-rotated_anon_pages_by_system_under_hierarchy 0
-rotated_file_pages_by_system_under_hierarchy 0
-freed_pages_by_system_under_hierarchy 0
-freed_anon_pages_by_system_under_hierarchy 0
-freed_file_pages_by_system_under_hierarchy 0
-elapsed_ns_by_system_under_hierarchy 0
-
 5.3 swappiness
 
 Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only.
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index fb1ed1c..b87068a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -40,16 +40,6 @@ extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
 					struct mem_cgroup *mem_cont,
 					int active, int file);
 
-struct memcg_scanrecord {
-	struct mem_cgroup *mem; /* scanend memory cgroup */
-	struct mem_cgroup *root; /* scan target hierarchy root */
-	int context;		/* scanning context (see memcontrol.c) */
-	unsigned long nr_scanned[2]; /* the number of scanned pages */
-	unsigned long nr_rotated[2]; /* the number of rotated pages */
-	unsigned long nr_freed[2]; /* the number of freed pages */
-	unsigned long elapsed; /* nsec of time elapsed while scanning */
-};
-
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 /*
  * All "charge" functions with gfp_mask should use GFP_KERNEL or
@@ -116,8 +106,10 @@ extern void mem_cgroup_end_migration(struct mem_cgroup *memcg,
 /*
  * For memory reclaim.
  */
-int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
-int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
+int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg,
+				    struct zone *zone);
+int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg,
+				    struct zone *zone);
 int mem_cgroup_select_victim_node(struct mem_cgroup *memcg);
 unsigned long mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg,
 					int nid, int zid, unsigned int lrumask);
@@ -128,15 +120,6 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page);
 extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
 					struct task_struct *p);
 
-extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
-						  gfp_t gfp_mask, bool noswap,
-						  struct memcg_scanrecord *rec);
-extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
-						gfp_t gfp_mask, bool noswap,
-						struct zone *zone,
-						struct memcg_scanrecord *rec,
-						unsigned long *nr_scanned);
-
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
 extern int do_swap_account;
 #endif
@@ -314,13 +297,13 @@ static inline bool mem_cgroup_disabled(void)
 }
 
 static inline int
-mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
+mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg, struct zone *zone)
 {
 	return 1;
 }
 
 static inline int
-mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg)
+mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg, struct zone *zone)
 {
 	return 1;
 }
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 3808f10..b156e80 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -249,6 +249,12 @@ static inline void lru_cache_add_file(struct page *page)
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
 extern int __isolate_lru_page(struct page *page, isolate_mode_t mode, int file);
+extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
+						  gfp_t gfp_mask, bool noswap);
+extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg,
+						gfp_t gfp_mask, bool noswap,
+						struct zone *zone,
+						unsigned long *nr_scanned);
 extern unsigned long shrink_all_memory(unsigned long nr_pages);
 extern int vm_swappiness;
 extern int remove_mapping(struct address_space *mapping, struct page *page);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 54b35b3..b76011a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -205,50 +205,6 @@ struct mem_cgroup_eventfd_list {
 static void mem_cgroup_threshold(struct mem_cgroup *memcg);
 static void mem_cgroup_oom_notify(struct mem_cgroup *memcg);
 
-enum {
-	SCAN_BY_LIMIT,
-	SCAN_BY_SYSTEM,
-	NR_SCAN_CONTEXT,
-	SCAN_BY_SHRINK,	/* not recorded now */
-};
-
-enum {
-	SCAN,
-	SCAN_ANON,
-	SCAN_FILE,
-	ROTATE,
-	ROTATE_ANON,
-	ROTATE_FILE,
-	FREED,
-	FREED_ANON,
-	FREED_FILE,
-	ELAPSED,
-	NR_SCANSTATS,
-};
-
-struct scanstat {
-	spinlock_t	lock;
-	unsigned long	stats[NR_SCAN_CONTEXT][NR_SCANSTATS];
-	unsigned long	rootstats[NR_SCAN_CONTEXT][NR_SCANSTATS];
-};
-
-const char *scanstat_string[NR_SCANSTATS] = {
-	"scanned_pages",
-	"scanned_anon_pages",
-	"scanned_file_pages",
-	"rotated_pages",
-	"rotated_anon_pages",
-	"rotated_file_pages",
-	"freed_pages",
-	"freed_anon_pages",
-	"freed_file_pages",
-	"elapsed_ns",
-};
-#define SCANSTAT_WORD_LIMIT	"_by_limit"
-#define SCANSTAT_WORD_SYSTEM	"_by_system"
-#define SCANSTAT_WORD_HIERARCHY	"_under_hierarchy"
-
-
 /*
  * The memory controller data structure. The memory controller controls both
  * page cache and RSS per cgroup. We would eventually like to provide
@@ -314,8 +270,7 @@ struct mem_cgroup {
 
 	/* For oom notifier event fd */
 	struct list_head oom_notify;
-	/* For recording LRU-scan statistics */
-	struct scanstat scanstat;
+
 	/*
 	 * Should we move charges of a task when a task is moved into this
 	 * mem_cgroup ? And what type of charges should we move ?
@@ -1035,6 +990,16 @@ void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru)
 		return;
 	pc = lookup_page_cgroup(page);
 	VM_BUG_ON(PageCgroupAcctLRU(pc));
+	/*
+	 * putback:				charge:
+	 * SetPageLRU				SetPageCgroupUsed
+	 * smp_mb				smp_mb
+	 * PageCgroupUsed && add to memcg LRU	PageLRU && add to memcg LRU
+	 *
+	 * Ensure that one of the two sides adds the page to the memcg
+	 * LRU during a race.
+	 */
+	smp_mb();
 	if (!PageCgroupUsed(pc))
 		return;
 	/* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
@@ -1086,7 +1051,16 @@ static void mem_cgroup_lru_add_after_commit(struct page *page)
 	unsigned long flags;
 	struct zone *zone = page_zone(page);
 	struct page_cgroup *pc = lookup_page_cgroup(page);
-
+	/*
+	 * putback:				charge:
+	 * SetPageLRU				SetPageCgroupUsed
+	 * smp_mb				smp_mb
+	 * PageCgroupUsed && add to memcg LRU	PageLRU && add to memcg LRU
+	 *
+	 * Ensure that one of the two sides adds the page to the memcg
+	 * LRU during a race.
+	 */
+	smp_mb();
 	/* taking care of that the page is added to LRU while we commit it */
 	if (likely(!PageLRU(page)))
 		return;
@@ -1146,15 +1120,19 @@ int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *memcg)
 	return ret;
 }
 
-static int calc_inactive_ratio(struct mem_cgroup *memcg, unsigned long *present_pages)
+int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg, struct zone *zone)
 {
-	unsigned long active;
+	unsigned long inactive_ratio;
+	int nid = zone_to_nid(zone);
+	int zid = zone_idx(zone);
 	unsigned long inactive;
+	unsigned long active;
 	unsigned long gb;
-	unsigned long inactive_ratio;
 
-	inactive = mem_cgroup_nr_lru_pages(memcg, BIT(LRU_INACTIVE_ANON));
-	active = mem_cgroup_nr_lru_pages(memcg, BIT(LRU_ACTIVE_ANON));
+	inactive = mem_cgroup_zone_nr_lru_pages(memcg, nid, zid,
+						BIT(LRU_INACTIVE_ANON));
+	active = mem_cgroup_zone_nr_lru_pages(memcg, nid, zid,
+					      BIT(LRU_ACTIVE_ANON));
 
 	gb = (inactive + active) >> (30 - PAGE_SHIFT);
 	if (gb)
@@ -1162,39 +1140,20 @@ static int calc_inactive_ratio(struct mem_cgroup *memcg, unsigned long *present_
 	else
 		inactive_ratio = 1;
 
-	if (present_pages) {
-		present_pages[0] = inactive;
-		present_pages[1] = active;
-	}
-
-	return inactive_ratio;
+	return inactive * inactive_ratio < active;
 }
 
-int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
-{
-	unsigned long active;
-	unsigned long inactive;
-	unsigned long present_pages[2];
-	unsigned long inactive_ratio;
-
-	inactive_ratio = calc_inactive_ratio(memcg, present_pages);
-
-	inactive = present_pages[0];
-	active = present_pages[1];
-
-	if (inactive * inactive_ratio < active)
-		return 1;
-
-	return 0;
-}
-
-int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg)
+int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg, struct zone *zone)
 {
 	unsigned long active;
 	unsigned long inactive;
+	int zid = zone_idx(zone);
+	int nid = zone_to_nid(zone);
 
-	inactive = mem_cgroup_nr_lru_pages(memcg, BIT(LRU_INACTIVE_FILE));
-	active = mem_cgroup_nr_lru_pages(memcg, BIT(LRU_ACTIVE_FILE));
+	inactive = mem_cgroup_zone_nr_lru_pages(memcg, nid, zid,
+						BIT(LRU_INACTIVE_FILE));
+	active = mem_cgroup_zone_nr_lru_pages(memcg, nid, zid,
+					      BIT(LRU_ACTIVE_FILE));
 
 	return (active > inactive);
 }
@@ -1679,44 +1638,6 @@ bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap)
 }
 #endif
 
-static void __mem_cgroup_record_scanstat(unsigned long *stats,
-			   struct memcg_scanrecord *rec)
-{
-
-	stats[SCAN] += rec->nr_scanned[0] + rec->nr_scanned[1];
-	stats[SCAN_ANON] += rec->nr_scanned[0];
-	stats[SCAN_FILE] += rec->nr_scanned[1];
-
-	stats[ROTATE] += rec->nr_rotated[0] + rec->nr_rotated[1];
-	stats[ROTATE_ANON] += rec->nr_rotated[0];
-	stats[ROTATE_FILE] += rec->nr_rotated[1];
-
-	stats[FREED] += rec->nr_freed[0] + rec->nr_freed[1];
-	stats[FREED_ANON] += rec->nr_freed[0];
-	stats[FREED_FILE] += rec->nr_freed[1];
-
-	stats[ELAPSED] += rec->elapsed;
-}
-
-static void mem_cgroup_record_scanstat(struct memcg_scanrecord *rec)
-{
-	struct mem_cgroup *memcg;
-	int context = rec->context;
-
-	if (context >= NR_SCAN_CONTEXT)
-		return;
-
-	memcg = rec->mem;
-	spin_lock(&memcg->scanstat.lock);
-	__mem_cgroup_record_scanstat(memcg->scanstat.stats[context], rec);
-	spin_unlock(&memcg->scanstat.lock);
-
-	memcg = rec->root;
-	spin_lock(&memcg->scanstat.lock);
-	__mem_cgroup_record_scanstat(memcg->scanstat.rootstats[context], rec);
-	spin_unlock(&memcg->scanstat.lock);
-}
-
 /*
  * Scan the hierarchy if needed to reclaim memory. We remember the last child
  * we reclaimed from, so that we don't end up penalizing one child extensively
@@ -1741,9 +1662,8 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
 	bool noswap = reclaim_options & MEM_CGROUP_RECLAIM_NOSWAP;
 	bool shrink = reclaim_options & MEM_CGROUP_RECLAIM_SHRINK;
 	bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT;
-	struct memcg_scanrecord rec;
 	unsigned long excess;
-	unsigned long scanned;
+	unsigned long nr_scanned;
 
 	excess = res_counter_soft_limit_excess(&root_memcg->res) >> PAGE_SHIFT;
 
@@ -1751,15 +1671,6 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
 	if (!check_soft && !shrink && root_memcg->memsw_is_minimum)
 		noswap = true;
 
-	if (shrink)
-		rec.context = SCAN_BY_SHRINK;
-	else if (check_soft)
-		rec.context = SCAN_BY_SYSTEM;
-	else
-		rec.context = SCAN_BY_LIMIT;
-
-	rec.root = root_memcg;
-
 	while (1) {
 		victim = mem_cgroup_select_victim(root_memcg);
 		if (victim == root_memcg) {
@@ -1800,23 +1711,14 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
 			css_put(&victim->css);
 			continue;
 		}
-		rec.mem = victim;
-		rec.nr_scanned[0] = 0;
-		rec.nr_scanned[1] = 0;
-		rec.nr_rotated[0] = 0;
-		rec.nr_rotated[1] = 0;
-		rec.nr_freed[0] = 0;
-		rec.nr_freed[1] = 0;
-		rec.elapsed = 0;
 		/* we use swappiness of local cgroup */
 		if (check_soft) {
 			ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
-				noswap, zone, &rec, &scanned);
-			*total_scanned += scanned;
+				noswap, zone, &nr_scanned);
+			*total_scanned += nr_scanned;
 		} else
 			ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
-						noswap, &rec);
-		mem_cgroup_record_scanstat(&rec);
+						noswap);
 		css_put(&victim->css);
 		/*
 		 * At shrinking usage, we can't check we should stop here or
@@ -3853,18 +3755,14 @@ try_to_free:
 	/* try to free all pages in this cgroup */
 	shrink = 1;
 	while (nr_retries && memcg->res.usage > 0) {
-		struct memcg_scanrecord rec;
 		int progress;
 
 		if (signal_pending(current)) {
 			ret = -EINTR;
 			goto out;
 		}
-		rec.context = SCAN_BY_SHRINK;
-		rec.mem = memcg;
-		rec.root = memcg;
 		progress = try_to_free_mem_cgroup_pages(memcg, GFP_KERNEL,
-						false, &rec);
+						false);
 		if (!progress) {
 			nr_retries--;
 			/* maybe some writeback is necessary */
@@ -4293,8 +4191,6 @@ static int mem_control_stat_show(struct cgroup *cont, struct cftype *cft,
 	}
 
 #ifdef CONFIG_DEBUG_VM
-	cb->fill(cb, "inactive_ratio", calc_inactive_ratio(mem_cont, NULL));
-
 	{
 		int nid, zid;
 		struct mem_cgroup_per_zone *mz;
@@ -4708,57 +4604,6 @@ static int mem_control_numa_stat_open(struct inode *unused, struct file *file)
 }
 #endif /* CONFIG_NUMA */
 
-static int mem_cgroup_vmscan_stat_read(struct cgroup *cgrp,
-				struct cftype *cft,
-				struct cgroup_map_cb *cb)
-{
-	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
-	char string[64];
-	int i;
-
-	for (i = 0; i < NR_SCANSTATS; i++) {
-		strcpy(string, scanstat_string[i]);
-		strcat(string, SCANSTAT_WORD_LIMIT);
-		cb->fill(cb, string, memcg->scanstat.stats[SCAN_BY_LIMIT][i]);
-	}
-
-	for (i = 0; i < NR_SCANSTATS; i++) {
-		strcpy(string, scanstat_string[i]);
-		strcat(string, SCANSTAT_WORD_SYSTEM);
-		cb->fill(cb, string, memcg->scanstat.stats[SCAN_BY_SYSTEM][i]);
-	}
-
-	for (i = 0; i < NR_SCANSTATS; i++) {
-		strcpy(string, scanstat_string[i]);
-		strcat(string, SCANSTAT_WORD_LIMIT);
-		strcat(string, SCANSTAT_WORD_HIERARCHY);
-		cb->fill(cb,
-			string, memcg->scanstat.rootstats[SCAN_BY_LIMIT][i]);
-	}
-	for (i = 0; i < NR_SCANSTATS; i++) {
-		strcpy(string, scanstat_string[i]);
-		strcat(string, SCANSTAT_WORD_SYSTEM);
-		strcat(string, SCANSTAT_WORD_HIERARCHY);
-		cb->fill(cb,
-			string, memcg->scanstat.rootstats[SCAN_BY_SYSTEM][i]);
-	}
-	return 0;
-}
-
-static int mem_cgroup_reset_vmscan_stat(struct cgroup *cgrp,
-				unsigned int event)
-{
-	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
-
-	spin_lock(&memcg->scanstat.lock);
-	memset(&memcg->scanstat.stats, 0, sizeof(memcg->scanstat.stats));
-	memset(&memcg->scanstat.rootstats,
-		0, sizeof(memcg->scanstat.rootstats));
-	spin_unlock(&memcg->scanstat.lock);
-	return 0;
-}
-
-
 static struct cftype mem_cgroup_files[] = {
 	{
 		.name = "usage_in_bytes",
@@ -4829,11 +4674,6 @@ static struct cftype mem_cgroup_files[] = {
 		.mode = S_IRUGO,
 	},
 #endif
-	{
-		.name = "vmscan_stat",
-		.read_map = mem_cgroup_vmscan_stat_read,
-		.trigger = mem_cgroup_reset_vmscan_stat,
-	},
 };
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
@@ -5097,7 +4937,6 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	atomic_set(&memcg->refcnt, 1);
 	memcg->move_charge_at_immigrate = 0;
 	mutex_init(&memcg->thresholds_lock);
-	spin_lock_init(&memcg->scanstat.lock);
 	return &memcg->css;
 free_out:
 	__mem_cgroup_free(memcg);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 23256e8..7502726 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -105,7 +105,6 @@ struct scan_control {
 
 	/* Which cgroup do we reclaim from */
 	struct mem_cgroup *mem_cgroup;
-	struct memcg_scanrecord *memcg_record;
 
 	/*
 	 * Nodemask of nodes allowed by the caller. If NULL, all nodes
@@ -1381,8 +1380,6 @@ putback_lru_pages(struct zone *zone, struct scan_control *sc,
 			int file = is_file_lru(lru);
 			int numpages = hpage_nr_pages(page);
 			reclaim_stat->recent_rotated[file] += numpages;
-			if (!scanning_global_lru(sc))
-				sc->memcg_record->nr_rotated[file] += numpages;
 		}
 		if (!pagevec_add(&pvec, page)) {
 			spin_unlock_irq(&zone->lru_lock);
@@ -1426,10 +1423,6 @@ static noinline_for_stack void update_isolated_counts(struct zone *zone,
 
 	reclaim_stat->recent_scanned[0] += *nr_anon;
 	reclaim_stat->recent_scanned[1] += *nr_file;
-	if (!scanning_global_lru(sc)) {
-		sc->memcg_record->nr_scanned[0] += *nr_anon;
-		sc->memcg_record->nr_scanned[1] += *nr_file;
-	}
 }
 
 /*
@@ -1551,9 +1544,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 					priority, &nr_dirty, &nr_writeback);
 	}
 
-	if (!scanning_global_lru(sc))
-		sc->memcg_record->nr_freed[file] += nr_reclaimed;
-
 	local_irq_disable();
 	if (current_is_kswapd())
 		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
@@ -1670,8 +1660,6 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 	}
 
 	reclaim_stat->recent_scanned[file] += nr_taken;
-	if (!scanning_global_lru(sc))
-		sc->memcg_record->nr_scanned[file] += nr_taken;
 
 	__count_zone_vm_events(PGREFILL, zone, pgscanned);
 	if (file)
@@ -1723,8 +1711,6 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 	 * get_scan_ratio.
 	 */
 	reclaim_stat->recent_rotated[file] += nr_rotated;
-	if (!scanning_global_lru(sc))
-		sc->memcg_record->nr_rotated[file] += nr_rotated;
 
 	move_active_pages_to_lru(zone, &l_active,
 						LRU_ACTIVE + file * LRU_FILE);
@@ -1770,7 +1756,7 @@ static int inactive_anon_is_low(struct zone *zone, struct scan_control *sc)
 	if (scanning_global_lru(sc))
 		low = inactive_anon_is_low_global(zone);
 	else
-		low = mem_cgroup_inactive_anon_is_low(sc->mem_cgroup);
+		low = mem_cgroup_inactive_anon_is_low(sc->mem_cgroup, zone);
 	return low;
 }
 #else
@@ -1813,7 +1799,7 @@ static int inactive_file_is_low(struct zone *zone, struct scan_control *sc)
 	if (scanning_global_lru(sc))
 		low = inactive_file_is_low_global(zone);
 	else
-		low = mem_cgroup_inactive_file_is_low(sc->mem_cgroup);
+		low = mem_cgroup_inactive_file_is_low(sc->mem_cgroup, zone);
 	return low;
 }
 
@@ -2313,10 +2299,9 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 
 unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
-					gfp_t gfp_mask, bool noswap,
-					struct zone *zone,
-					struct memcg_scanrecord *rec,
-					unsigned long *scanned)
+						gfp_t gfp_mask, bool noswap,
+						struct zone *zone,
+						unsigned long *nr_scanned)
 {
 	struct scan_control sc = {
 		.nr_scanned = 0,
@@ -2326,9 +2311,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
 		.may_swap = !noswap,
 		.order = 0,
 		.mem_cgroup = mem,
-		.memcg_record = rec,
 	};
-	ktime_t start, end;
 
 	sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
 			(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
@@ -2337,7 +2320,6 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
 						      sc.may_writepage,
 						      sc.gfp_mask);
 
-	start = ktime_get();
 	/*
 	 * NOTE: Although we can get the priority field, using it
 	 * here is not a good idea, since it limits the pages we can scan.
@@ -2346,25 +2328,19 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
 	 * the priority and make it zero.
 	 */
 	shrink_zone(0, zone, &sc);
-	end = ktime_get();
-
-	if (rec)
-		rec->elapsed += ktime_to_ns(ktime_sub(end, start));
-	*scanned = sc.nr_scanned;
 
 	trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
 
+	*nr_scanned = sc.nr_scanned;
 	return sc.nr_reclaimed;
 }
 
 unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 					   gfp_t gfp_mask,
-					   bool noswap,
-					   struct memcg_scanrecord *rec)
+					   bool noswap)
 {
 	struct zonelist *zonelist;
 	unsigned long nr_reclaimed;
-	ktime_t start, end;
 	int nid;
 	struct scan_control sc = {
 		.may_writepage = !laptop_mode,
@@ -2373,7 +2349,6 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
 		.order = 0,
 		.mem_cgroup = mem_cont,
-		.memcg_record = rec,
 		.nodemask = NULL, /* we don't care the placement */
 		.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
 				(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
@@ -2382,7 +2357,6 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 		.gfp_mask = sc.gfp_mask,
 	};
 
-	start = ktime_get();
 	/*
 	 * Unlike direct reclaim via alloc_pages(), memcg's reclaim doesn't
 	 * take care of from where we get pages. So the node where we start the
@@ -2397,9 +2371,6 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 					    sc.gfp_mask);
 
 	nr_reclaimed = do_try_to_free_pages(zonelist, &sc, &shrink);
-	end = ktime_get();
-	if (rec)
-		rec->elapsed += ktime_to_ns(ktime_sub(end, start));
 
 	trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [patch 01/11] mm: memcg: consolidate hierarchy iteration primitives
  2011-09-12 10:57 ` Johannes Weiner
@ 2011-09-12 10:57   ` Johannes Weiner
  -1 siblings, 0 replies; 130+ messages in thread
From: Johannes Weiner @ 2011-09-12 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

Memory control groups are currently bolted onto the side of
traditional memory management in places where better integration would
be preferrable.  To reclaim memory, for example, memory control groups
maintain their own LRU list and reclaim strategy aside from the global
per-zone LRU list reclaim.  But an extra list head for each existing
page frame is expensive and maintaining it requires additional code.

This patchset disables the global per-zone LRU lists on memory cgroup
configurations and converts all its users to operate on the per-memory
cgroup lists instead.  As LRU pages are then exclusively on one list,
this saves two list pointers for each page frame in the system:

page_cgroup array size with 4G physical memory

  vanilla: [    0.000000] allocated 31457280 bytes of page_cgroup
  patched: [    0.000000] allocated 15728640 bytes of page_cgroup

At the same time, system performance for various workloads is
unaffected:

100G sparse file cat, 4G physical memory, 10 runs, to test for code
bloat in the traditional LRU handling and kswapd & direct reclaim
paths, without/with the memory controller configured in

  vanilla: 71.603(0.207) seconds
  patched: 71.640(0.156) seconds

  vanilla: 79.558(0.288) seconds
  patched: 77.233(0.147) seconds

100G sparse file cat in 1G memory cgroup, 10 runs, to test for code
bloat in the traditional memory cgroup LRU handling and reclaim path

  vanilla: 96.844(0.281) seconds
  patched: 94.454(0.311) seconds

4 unlimited memcgs running kbuild -j32 each, 4G physical memory, 500M
swap on SSD, 10 runs, to test for regressions in kswapd & direct
reclaim using per-memcg LRU lists with multiple memcgs and multiple
allocators within each memcg

  vanilla: 717.722(1.440) seconds [ 69720.100(11600.835) majfaults ]
  patched: 714.106(2.313) seconds [ 71109.300(14886.186) majfaults ]

16 unlimited memcgs running kbuild, 1900M hierarchical limit, 500M
swap on SSD, 10 runs, to test for regressions in hierarchical memcg
setups

  vanilla: 2742.058(1.992) seconds [ 26479.600(1736.737) majfaults ]
  patched: 2743.267(1.214) seconds [ 27240.700(1076.063) majfaults ]

This patch:

There are currently two different implementations of iterating over a
memory cgroup hierarchy tree.

Consolidate them into one worker function and base the convenience
looping-macros on top of it.

Signed-off-by: Johannes Weiner <jweiner@redhat.com>
---
 mm/memcontrol.c |  196 ++++++++++++++++++++----------------------------------
 1 files changed, 73 insertions(+), 123 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b76011a..912c7c7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -781,83 +781,75 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
 	return memcg;
 }
 
-/* The caller has to guarantee "mem" exists before calling this */
-static struct mem_cgroup *mem_cgroup_start_loop(struct mem_cgroup *memcg)
+static struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
+					  struct mem_cgroup *prev,
+					  bool remember)
 {
-	struct cgroup_subsys_state *css;
-	int found;
+	struct mem_cgroup *mem = NULL;
+	int id = 0;
 
-	if (!memcg) /* ROOT cgroup has the smallest ID */
-		return root_mem_cgroup; /*css_put/get against root is ignored*/
-	if (!memcg->use_hierarchy) {
-		if (css_tryget(&memcg->css))
-			return memcg;
-		return NULL;
-	}
-	rcu_read_lock();
-	/*
-	 * searching a memory cgroup which has the smallest ID under given
-	 * ROOT cgroup. (ID >= 1)
-	 */
-	css = css_get_next(&mem_cgroup_subsys, 1, &memcg->css, &found);
-	if (css && css_tryget(css))
-		memcg = container_of(css, struct mem_cgroup, css);
-	else
-		memcg = NULL;
-	rcu_read_unlock();
-	return memcg;
-}
+	if (!root)
+		root = root_mem_cgroup;
 
-static struct mem_cgroup *mem_cgroup_get_next(struct mem_cgroup *iter,
-					struct mem_cgroup *root,
-					bool cond)
-{
-	int nextid = css_id(&iter->css) + 1;
-	int found;
-	int hierarchy_used;
-	struct cgroup_subsys_state *css;
+	if (prev && !remember)
+		id = css_id(&prev->css);
 
-	hierarchy_used = iter->use_hierarchy;
+	if (prev && prev != root)
+		css_put(&prev->css);
 
-	css_put(&iter->css);
-	/* If no ROOT, walk all, ignore hierarchy */
-	if (!cond || (root && !hierarchy_used))
-		return NULL;
+	if (!root->use_hierarchy && root != root_mem_cgroup) {
+		if (prev)
+			return NULL;
+		return root;
+	}
 
-	if (!root)
-		root = root_mem_cgroup;
+	while (!mem) {
+		struct cgroup_subsys_state *css;
 
-	do {
-		iter = NULL;
-		rcu_read_lock();
+		if (remember)
+			id = root->last_scanned_child;
 
-		css = css_get_next(&mem_cgroup_subsys, nextid,
-				&root->css, &found);
-		if (css && css_tryget(css))
-			iter = container_of(css, struct mem_cgroup, css);
+		rcu_read_lock();
+		css = css_get_next(&mem_cgroup_subsys, id + 1, &root->css, &id);
+		if (css) {
+			if (css == &root->css || css_tryget(css))
+				mem = container_of(css, struct mem_cgroup, css);
+		} else
+			id = 0;
 		rcu_read_unlock();
-		/* If css is NULL, no more cgroups will be found */
-		nextid = found + 1;
-	} while (css && !iter);
 
-	return iter;
+		if (remember)
+			root->last_scanned_child = id;
+
+		if (prev && !css)
+			return NULL;
+	}
+	return mem;
 }
-/*
- * for_eacn_mem_cgroup_tree() for visiting all cgroup under tree. Please
- * be careful that "break" loop is not allowed. We have reference count.
- * Instead of that modify "cond" to be false and "continue" to exit the loop.
- */
-#define for_each_mem_cgroup_tree_cond(iter, root, cond)	\
-	for (iter = mem_cgroup_start_loop(root);\
-	     iter != NULL;\
-	     iter = mem_cgroup_get_next(iter, root, cond))
 
-#define for_each_mem_cgroup_tree(iter, root) \
-	for_each_mem_cgroup_tree_cond(iter, root, true)
+static void mem_cgroup_iter_break(struct mem_cgroup *root,
+				  struct mem_cgroup *prev)
+{
+	if (!root)
+		root = root_mem_cgroup;
+	if (prev && prev != root)
+		css_put(&prev->css);
+}
 
-#define for_each_mem_cgroup_all(iter) \
-	for_each_mem_cgroup_tree_cond(iter, NULL, true)
+/*
+ * Iteration constructs for visiting all cgroups (under a tree).  If
+ * loops are exited prematurely (break), mem_cgroup_iter_break() must
+ * be used for reference counting.
+ */
+#define for_each_mem_cgroup_tree(iter, root)		\
+	for (iter = mem_cgroup_iter(root, NULL, false);	\
+	     iter != NULL;				\
+	     iter = mem_cgroup_iter(root, iter, false))
 
+#define for_each_mem_cgroup(iter)			\
+	for (iter = mem_cgroup_iter(NULL, NULL, false);	\
+	     iter != NULL;				\
+	     iter = mem_cgroup_iter(NULL, iter, false))
 
 static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
 {
@@ -1464,43 +1456,6 @@ u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
 	return min(limit, memsw);
 }
 
-/*
- * Visit the first child (need not be the first child as per the ordering
- * of the cgroup list, since we track last_scanned_child) of @mem and use
- * that to reclaim free pages from.
- */
-static struct mem_cgroup *
-mem_cgroup_select_victim(struct mem_cgroup *root_memcg)
-{
-	struct mem_cgroup *ret = NULL;
-	struct cgroup_subsys_state *css;
-	int nextid, found;
-
-	if (!root_memcg->use_hierarchy) {
-		css_get(&root_memcg->css);
-		ret = root_memcg;
-	}
-
-	while (!ret) {
-		rcu_read_lock();
-		nextid = root_memcg->last_scanned_child + 1;
-		css = css_get_next(&mem_cgroup_subsys, nextid, &root_memcg->css,
-				   &found);
-		if (css && css_tryget(css))
-			ret = container_of(css, struct mem_cgroup, css);
-
-		rcu_read_unlock();
-		/* Updates scanning parameter */
-		if (!css) {
-			/* this means start scan from ID:1 */
-			root_memcg->last_scanned_child = 0;
-		} else
-			root_memcg->last_scanned_child = found;
-	}
-
-	return ret;
-}
-
 /**
  * test_mem_cgroup_node_reclaimable
  * @mem: the target memcg
@@ -1656,7 +1611,7 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
 						unsigned long reclaim_options,
 						unsigned long *total_scanned)
 {
-	struct mem_cgroup *victim;
+	struct mem_cgroup *victim = NULL;
 	int ret, total = 0;
 	int loop = 0;
 	bool noswap = reclaim_options & MEM_CGROUP_RECLAIM_NOSWAP;
@@ -1672,8 +1627,8 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
 		noswap = true;
 
 	while (1) {
-		victim = mem_cgroup_select_victim(root_memcg);
-		if (victim == root_memcg) {
+		victim = mem_cgroup_iter(root_memcg, victim, true);
+		if (!victim) {
 			loop++;
 			/*
 			 * We are not draining per cpu cached charges during
@@ -1689,10 +1644,8 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
 				 * anything, it might because there are
 				 * no reclaimable pages under this hierarchy
 				 */
-				if (!check_soft || !total) {
-					css_put(&victim->css);
+				if (!check_soft || !total)
 					break;
-				}
 				/*
 				 * We want to do more targeted reclaim.
 				 * excess >> 2 is not to excessive so as to
@@ -1700,15 +1653,13 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
 				 * coming back to reclaim from this cgroup
 				 */
 				if (total >= (excess >> 2) ||
-					(loop > MEM_CGROUP_MAX_RECLAIM_LOOPS)) {
-					css_put(&victim->css);
+					(loop > MEM_CGROUP_MAX_RECLAIM_LOOPS))
 					break;
-				}
 			}
+			continue;
 		}
 		if (!mem_cgroup_reclaimable(victim, noswap)) {
 			/* this cgroup's local usage == 0 */
-			css_put(&victim->css);
 			continue;
 		}
 		/* we use swappiness of local cgroup */
@@ -1719,21 +1670,21 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
 		} else
 			ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
 						noswap);
-		css_put(&victim->css);
 		/*
 		 * At shrinking usage, we can't check we should stop here or
 		 * reclaim more. It's depends on callers. last_scanned_child
 		 * will work enough for keeping fairness under tree.
 		 */
 		if (shrink)
-			return ret;
+			break;
 		total += ret;
 		if (check_soft) {
 			if (!res_counter_soft_limit_excess(&root_memcg->res))
-				return total;
+				break;
 		} else if (mem_cgroup_margin(root_memcg))
-			return total;
+			break;
 	}
+	mem_cgroup_iter_break(root_memcg, victim);
 	return total;
 }
 
@@ -1745,16 +1696,16 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
 static bool mem_cgroup_oom_lock(struct mem_cgroup *memcg)
 {
 	struct mem_cgroup *iter, *failed = NULL;
-	bool cond = true;
 
-	for_each_mem_cgroup_tree_cond(iter, memcg, cond) {
+	for_each_mem_cgroup_tree(iter, memcg) {
 		if (iter->oom_lock) {
 			/*
 			 * this subtree of our hierarchy is already locked
 			 * so we cannot give a lock.
 			 */
 			failed = iter;
-			cond = false;
+			mem_cgroup_iter_break(memcg, iter);
+			break;
 		} else
 			iter->oom_lock = true;
 	}
@@ -1766,11 +1717,10 @@ static bool mem_cgroup_oom_lock(struct mem_cgroup *memcg)
 	 * OK, we failed to lock the whole subtree so we have to clean up
 	 * what we set up to the failing subtree
 	 */
-	cond = true;
-	for_each_mem_cgroup_tree_cond(iter, memcg, cond) {
+	for_each_mem_cgroup_tree(iter, memcg) {
 		if (iter == failed) {
-			cond = false;
-			continue;
+			mem_cgroup_iter_break(memcg, iter);
+			break;
 		}
 		iter->oom_lock = false;
 	}
@@ -2166,7 +2116,7 @@ static int __cpuinit memcg_cpu_hotplug_callback(struct notifier_block *nb,
 	struct mem_cgroup *iter;
 
 	if ((action == CPU_ONLINE)) {
-		for_each_mem_cgroup_all(iter)
+		for_each_mem_cgroup(iter)
 			synchronize_mem_cgroup_on_move(iter, cpu);
 		return NOTIFY_OK;
 	}
@@ -2174,7 +2124,7 @@ static int __cpuinit memcg_cpu_hotplug_callback(struct notifier_block *nb,
 	if ((action != CPU_DEAD) || action != CPU_DEAD_FROZEN)
 		return NOTIFY_OK;
 
-	for_each_mem_cgroup_all(iter)
+	for_each_mem_cgroup(iter)
 		mem_cgroup_drain_pcp_counter(iter, cpu);
 
 	stock = &per_cpu(memcg_stock, cpu);
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [patch 01/11] mm: memcg: consolidate hierarchy iteration primitives
@ 2011-09-12 10:57   ` Johannes Weiner
  0 siblings, 0 replies; 130+ messages in thread
From: Johannes Weiner @ 2011-09-12 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

Memory control groups are currently bolted onto the side of
traditional memory management in places where better integration would
be preferrable.  To reclaim memory, for example, memory control groups
maintain their own LRU list and reclaim strategy aside from the global
per-zone LRU list reclaim.  But an extra list head for each existing
page frame is expensive and maintaining it requires additional code.

This patchset disables the global per-zone LRU lists on memory cgroup
configurations and converts all its users to operate on the per-memory
cgroup lists instead.  As LRU pages are then exclusively on one list,
this saves two list pointers for each page frame in the system:

page_cgroup array size with 4G physical memory

  vanilla: [    0.000000] allocated 31457280 bytes of page_cgroup
  patched: [    0.000000] allocated 15728640 bytes of page_cgroup

At the same time, system performance for various workloads is
unaffected:

100G sparse file cat, 4G physical memory, 10 runs, to test for code
bloat in the traditional LRU handling and kswapd & direct reclaim
paths, without/with the memory controller configured in

  vanilla: 71.603(0.207) seconds
  patched: 71.640(0.156) seconds

  vanilla: 79.558(0.288) seconds
  patched: 77.233(0.147) seconds

100G sparse file cat in 1G memory cgroup, 10 runs, to test for code
bloat in the traditional memory cgroup LRU handling and reclaim path

  vanilla: 96.844(0.281) seconds
  patched: 94.454(0.311) seconds

4 unlimited memcgs running kbuild -j32 each, 4G physical memory, 500M
swap on SSD, 10 runs, to test for regressions in kswapd & direct
reclaim using per-memcg LRU lists with multiple memcgs and multiple
allocators within each memcg

  vanilla: 717.722(1.440) seconds [ 69720.100(11600.835) majfaults ]
  patched: 714.106(2.313) seconds [ 71109.300(14886.186) majfaults ]

16 unlimited memcgs running kbuild, 1900M hierarchical limit, 500M
swap on SSD, 10 runs, to test for regressions in hierarchical memcg
setups

  vanilla: 2742.058(1.992) seconds [ 26479.600(1736.737) majfaults ]
  patched: 2743.267(1.214) seconds [ 27240.700(1076.063) majfaults ]

This patch:

There are currently two different implementations of iterating over a
memory cgroup hierarchy tree.

Consolidate them into one worker function and base the convenience
looping-macros on top of it.

Signed-off-by: Johannes Weiner <jweiner@redhat.com>
---
 mm/memcontrol.c |  196 ++++++++++++++++++++----------------------------------
 1 files changed, 73 insertions(+), 123 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b76011a..912c7c7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -781,83 +781,75 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
 	return memcg;
 }
 
-/* The caller has to guarantee "mem" exists before calling this */
-static struct mem_cgroup *mem_cgroup_start_loop(struct mem_cgroup *memcg)
+static struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
+					  struct mem_cgroup *prev,
+					  bool remember)
 {
-	struct cgroup_subsys_state *css;
-	int found;
+	struct mem_cgroup *mem = NULL;
+	int id = 0;
 
-	if (!memcg) /* ROOT cgroup has the smallest ID */
-		return root_mem_cgroup; /*css_put/get against root is ignored*/
-	if (!memcg->use_hierarchy) {
-		if (css_tryget(&memcg->css))
-			return memcg;
-		return NULL;
-	}
-	rcu_read_lock();
-	/*
-	 * searching a memory cgroup which has the smallest ID under given
-	 * ROOT cgroup. (ID >= 1)
-	 */
-	css = css_get_next(&mem_cgroup_subsys, 1, &memcg->css, &found);
-	if (css && css_tryget(css))
-		memcg = container_of(css, struct mem_cgroup, css);
-	else
-		memcg = NULL;
-	rcu_read_unlock();
-	return memcg;
-}
+	if (!root)
+		root = root_mem_cgroup;
 
-static struct mem_cgroup *mem_cgroup_get_next(struct mem_cgroup *iter,
-					struct mem_cgroup *root,
-					bool cond)
-{
-	int nextid = css_id(&iter->css) + 1;
-	int found;
-	int hierarchy_used;
-	struct cgroup_subsys_state *css;
+	if (prev && !remember)
+		id = css_id(&prev->css);
 
-	hierarchy_used = iter->use_hierarchy;
+	if (prev && prev != root)
+		css_put(&prev->css);
 
-	css_put(&iter->css);
-	/* If no ROOT, walk all, ignore hierarchy */
-	if (!cond || (root && !hierarchy_used))
-		return NULL;
+	if (!root->use_hierarchy && root != root_mem_cgroup) {
+		if (prev)
+			return NULL;
+		return root;
+	}
 
-	if (!root)
-		root = root_mem_cgroup;
+	while (!mem) {
+		struct cgroup_subsys_state *css;
 
-	do {
-		iter = NULL;
-		rcu_read_lock();
+		if (remember)
+			id = root->last_scanned_child;
 
-		css = css_get_next(&mem_cgroup_subsys, nextid,
-				&root->css, &found);
-		if (css && css_tryget(css))
-			iter = container_of(css, struct mem_cgroup, css);
+		rcu_read_lock();
+		css = css_get_next(&mem_cgroup_subsys, id + 1, &root->css, &id);
+		if (css) {
+			if (css == &root->css || css_tryget(css))
+				mem = container_of(css, struct mem_cgroup, css);
+		} else
+			id = 0;
 		rcu_read_unlock();
-		/* If css is NULL, no more cgroups will be found */
-		nextid = found + 1;
-	} while (css && !iter);
 
-	return iter;
+		if (remember)
+			root->last_scanned_child = id;
+
+		if (prev && !css)
+			return NULL;
+	}
+	return mem;
 }
-/*
- * for_eacn_mem_cgroup_tree() for visiting all cgroup under tree. Please
- * be careful that "break" loop is not allowed. We have reference count.
- * Instead of that modify "cond" to be false and "continue" to exit the loop.
- */
-#define for_each_mem_cgroup_tree_cond(iter, root, cond)	\
-	for (iter = mem_cgroup_start_loop(root);\
-	     iter != NULL;\
-	     iter = mem_cgroup_get_next(iter, root, cond))
 
-#define for_each_mem_cgroup_tree(iter, root) \
-	for_each_mem_cgroup_tree_cond(iter, root, true)
+static void mem_cgroup_iter_break(struct mem_cgroup *root,
+				  struct mem_cgroup *prev)
+{
+	if (!root)
+		root = root_mem_cgroup;
+	if (prev && prev != root)
+		css_put(&prev->css);
+}
 
-#define for_each_mem_cgroup_all(iter) \
-	for_each_mem_cgroup_tree_cond(iter, NULL, true)
+/*
+ * Iteration constructs for visiting all cgroups (under a tree).  If
+ * loops are exited prematurely (break), mem_cgroup_iter_break() must
+ * be used for reference counting.
+ */
+#define for_each_mem_cgroup_tree(iter, root)		\
+	for (iter = mem_cgroup_iter(root, NULL, false);	\
+	     iter != NULL;				\
+	     iter = mem_cgroup_iter(root, iter, false))
 
+#define for_each_mem_cgroup(iter)			\
+	for (iter = mem_cgroup_iter(NULL, NULL, false);	\
+	     iter != NULL;				\
+	     iter = mem_cgroup_iter(NULL, iter, false))
 
 static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
 {
@@ -1464,43 +1456,6 @@ u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
 	return min(limit, memsw);
 }
 
-/*
- * Visit the first child (need not be the first child as per the ordering
- * of the cgroup list, since we track last_scanned_child) of @mem and use
- * that to reclaim free pages from.
- */
-static struct mem_cgroup *
-mem_cgroup_select_victim(struct mem_cgroup *root_memcg)
-{
-	struct mem_cgroup *ret = NULL;
-	struct cgroup_subsys_state *css;
-	int nextid, found;
-
-	if (!root_memcg->use_hierarchy) {
-		css_get(&root_memcg->css);
-		ret = root_memcg;
-	}
-
-	while (!ret) {
-		rcu_read_lock();
-		nextid = root_memcg->last_scanned_child + 1;
-		css = css_get_next(&mem_cgroup_subsys, nextid, &root_memcg->css,
-				   &found);
-		if (css && css_tryget(css))
-			ret = container_of(css, struct mem_cgroup, css);
-
-		rcu_read_unlock();
-		/* Updates scanning parameter */
-		if (!css) {
-			/* this means start scan from ID:1 */
-			root_memcg->last_scanned_child = 0;
-		} else
-			root_memcg->last_scanned_child = found;
-	}
-
-	return ret;
-}
-
 /**
  * test_mem_cgroup_node_reclaimable
  * @mem: the target memcg
@@ -1656,7 +1611,7 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
 						unsigned long reclaim_options,
 						unsigned long *total_scanned)
 {
-	struct mem_cgroup *victim;
+	struct mem_cgroup *victim = NULL;
 	int ret, total = 0;
 	int loop = 0;
 	bool noswap = reclaim_options & MEM_CGROUP_RECLAIM_NOSWAP;
@@ -1672,8 +1627,8 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
 		noswap = true;
 
 	while (1) {
-		victim = mem_cgroup_select_victim(root_memcg);
-		if (victim == root_memcg) {
+		victim = mem_cgroup_iter(root_memcg, victim, true);
+		if (!victim) {
 			loop++;
 			/*
 			 * We are not draining per cpu cached charges during
@@ -1689,10 +1644,8 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
 				 * anything, it might because there are
 				 * no reclaimable pages under this hierarchy
 				 */
-				if (!check_soft || !total) {
-					css_put(&victim->css);
+				if (!check_soft || !total)
 					break;
-				}
 				/*
 				 * We want to do more targeted reclaim.
 				 * excess >> 2 is not to excessive so as to
@@ -1700,15 +1653,13 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
 				 * coming back to reclaim from this cgroup
 				 */
 				if (total >= (excess >> 2) ||
-					(loop > MEM_CGROUP_MAX_RECLAIM_LOOPS)) {
-					css_put(&victim->css);
+					(loop > MEM_CGROUP_MAX_RECLAIM_LOOPS))
 					break;
-				}
 			}
+			continue;
 		}
 		if (!mem_cgroup_reclaimable(victim, noswap)) {
 			/* this cgroup's local usage == 0 */
-			css_put(&victim->css);
 			continue;
 		}
 		/* we use swappiness of local cgroup */
@@ -1719,21 +1670,21 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
 		} else
 			ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
 						noswap);
-		css_put(&victim->css);
 		/*
 		 * At shrinking usage, we can't check we should stop here or
 		 * reclaim more. It's depends on callers. last_scanned_child
 		 * will work enough for keeping fairness under tree.
 		 */
 		if (shrink)
-			return ret;
+			break;
 		total += ret;
 		if (check_soft) {
 			if (!res_counter_soft_limit_excess(&root_memcg->res))
-				return total;
+				break;
 		} else if (mem_cgroup_margin(root_memcg))
-			return total;
+			break;
 	}
+	mem_cgroup_iter_break(root_memcg, victim);
 	return total;
 }
 
@@ -1745,16 +1696,16 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
 static bool mem_cgroup_oom_lock(struct mem_cgroup *memcg)
 {
 	struct mem_cgroup *iter, *failed = NULL;
-	bool cond = true;
 
-	for_each_mem_cgroup_tree_cond(iter, memcg, cond) {
+	for_each_mem_cgroup_tree(iter, memcg) {
 		if (iter->oom_lock) {
 			/*
 			 * this subtree of our hierarchy is already locked
 			 * so we cannot give a lock.
 			 */
 			failed = iter;
-			cond = false;
+			mem_cgroup_iter_break(memcg, iter);
+			break;
 		} else
 			iter->oom_lock = true;
 	}
@@ -1766,11 +1717,10 @@ static bool mem_cgroup_oom_lock(struct mem_cgroup *memcg)
 	 * OK, we failed to lock the whole subtree so we have to clean up
 	 * what we set up to the failing subtree
 	 */
-	cond = true;
-	for_each_mem_cgroup_tree_cond(iter, memcg, cond) {
+	for_each_mem_cgroup_tree(iter, memcg) {
 		if (iter == failed) {
-			cond = false;
-			continue;
+			mem_cgroup_iter_break(memcg, iter);
+			break;
 		}
 		iter->oom_lock = false;
 	}
@@ -2166,7 +2116,7 @@ static int __cpuinit memcg_cpu_hotplug_callback(struct notifier_block *nb,
 	struct mem_cgroup *iter;
 
 	if ((action == CPU_ONLINE)) {
-		for_each_mem_cgroup_all(iter)
+		for_each_mem_cgroup(iter)
 			synchronize_mem_cgroup_on_move(iter, cpu);
 		return NOTIFY_OK;
 	}
@@ -2174,7 +2124,7 @@ static int __cpuinit memcg_cpu_hotplug_callback(struct notifier_block *nb,
 	if ((action != CPU_DEAD) || action != CPU_DEAD_FROZEN)
 		return NOTIFY_OK;
 
-	for_each_mem_cgroup_all(iter)
+	for_each_mem_cgroup(iter)
 		mem_cgroup_drain_pcp_counter(iter, cpu);
 
 	stock = &per_cpu(memcg_stock, cpu);
-- 
1.7.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [patch 02/11] mm: vmscan: distinguish global reclaim from global LRU scanning
  2011-09-12 10:57 ` Johannes Weiner
@ 2011-09-12 10:57   ` Johannes Weiner
  -1 siblings, 0 replies; 130+ messages in thread
From: Johannes Weiner @ 2011-09-12 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

The traditional zone reclaim code is scanning the per-zone LRU lists
during direct reclaim and kswapd, and the per-zone per-memory cgroup
LRU lists when reclaiming on behalf of a memory cgroup limit.

Subsequent patches will convert the traditional reclaim code to
reclaim exclusively from the per-memory cgroup LRU lists.  As a
result, using the predicate for which LRU list is scanned will no
longer be appropriate to tell global reclaim from limit reclaim.

This patch adds a global_reclaim() predicate to tell direct/kswapd
reclaim from memory cgroup limit reclaim and substitutes it in all
places where currently scanning_global_lru() is used for that.

Signed-off-by: Johannes Weiner <jweiner@redhat.com>
---
 mm/vmscan.c |   60 +++++++++++++++++++++++++++++++++++-----------------------
 1 files changed, 36 insertions(+), 24 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7502726..354f125 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -153,9 +153,25 @@ static LIST_HEAD(shrinker_list);
 static DECLARE_RWSEM(shrinker_rwsem);
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
-#define scanning_global_lru(sc)	(!(sc)->mem_cgroup)
+static bool global_reclaim(struct scan_control *sc)
+{
+	return !sc->mem_cgroup;
+}
+
+static bool scanning_global_lru(struct scan_control *sc)
+{
+	return !sc->mem_cgroup;
+}
 #else
-#define scanning_global_lru(sc)	(1)
+static bool global_reclaim(struct scan_control *sc)
+{
+	return true;
+}
+
+static bool scanning_global_lru(struct scan_control *sc)
+{
+	return true;
+}
 #endif
 
 static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
@@ -1011,7 +1027,7 @@ keep_lumpy:
 	 * back off and wait for congestion to clear because further reclaim
 	 * will encounter the same problem
 	 */
-	if (nr_dirty && nr_dirty == nr_congested && scanning_global_lru(sc))
+	if (nr_dirty && nr_dirty == nr_congested && global_reclaim(sc))
 		zone_set_flag(zone, ZONE_CONGESTED);
 
 	free_page_list(&free_pages);
@@ -1330,7 +1346,7 @@ static int too_many_isolated(struct zone *zone, int file,
 	if (current_is_kswapd())
 		return 0;
 
-	if (!scanning_global_lru(sc))
+	if (!global_reclaim(sc))
 		return 0;
 
 	if (file) {
@@ -1508,6 +1524,12 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 	if (scanning_global_lru(sc)) {
 		nr_taken = isolate_pages_global(nr_to_scan, &page_list,
 			&nr_scanned, sc->order, reclaim_mode, zone, 0, file);
+	} else {
+		nr_taken = mem_cgroup_isolate_pages(nr_to_scan, &page_list,
+			&nr_scanned, sc->order, reclaim_mode, zone,
+			sc->mem_cgroup, 0, file);
+	}
+	if (global_reclaim(sc)) {
 		zone->pages_scanned += nr_scanned;
 		if (current_is_kswapd())
 			__count_zone_vm_events(PGSCAN_KSWAPD, zone,
@@ -1515,14 +1537,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 		else
 			__count_zone_vm_events(PGSCAN_DIRECT, zone,
 					       nr_scanned);
-	} else {
-		nr_taken = mem_cgroup_isolate_pages(nr_to_scan, &page_list,
-			&nr_scanned, sc->order, reclaim_mode, zone,
-			sc->mem_cgroup, 0, file);
-		/*
-		 * mem_cgroup_isolate_pages() keeps track of
-		 * scanned pages on its own.
-		 */
 	}
 
 	if (nr_taken == 0) {
@@ -1647,18 +1661,16 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 						&pgscanned, sc->order,
 						reclaim_mode, zone,
 						1, file);
-		zone->pages_scanned += pgscanned;
 	} else {
 		nr_taken = mem_cgroup_isolate_pages(nr_pages, &l_hold,
 						&pgscanned, sc->order,
 						reclaim_mode, zone,
 						sc->mem_cgroup, 1, file);
-		/*
-		 * mem_cgroup_isolate_pages() keeps track of
-		 * scanned pages on its own.
-		 */
 	}
 
+	if (global_reclaim(sc))
+		zone->pages_scanned += pgscanned;
+
 	reclaim_stat->recent_scanned[file] += nr_taken;
 
 	__count_zone_vm_events(PGREFILL, zone, pgscanned);
@@ -1863,9 +1875,9 @@ static void get_scan_count(struct zone *zone, struct scan_control *sc,
 	 * latencies, so it's better to scan a minimum amount there as
 	 * well.
 	 */
-	if (scanning_global_lru(sc) && current_is_kswapd())
+	if (current_is_kswapd())
 		force_scan = true;
-	if (!scanning_global_lru(sc))
+	if (!global_reclaim(sc))
 		force_scan = true;
 
 	/* If we have no swap space, do not bother scanning anon pages. */
@@ -1882,7 +1894,7 @@ static void get_scan_count(struct zone *zone, struct scan_control *sc,
 	file  = zone_nr_lru_pages(zone, sc, LRU_ACTIVE_FILE) +
 		zone_nr_lru_pages(zone, sc, LRU_INACTIVE_FILE);
 
-	if (scanning_global_lru(sc)) {
+	if (global_reclaim(sc)) {
 		free  = zone_page_state(zone, NR_FREE_PAGES);
 		/* If we have very few page cache pages,
 		   force-scan anon pages. */
@@ -2109,7 +2121,7 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
 		 * Take care memory controller reclaiming has small influence
 		 * to global LRU.
 		 */
-		if (scanning_global_lru(sc)) {
+		if (global_reclaim(sc)) {
 			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 				continue;
 			if (zone->all_unreclaimable && priority != DEF_PRIORITY)
@@ -2188,7 +2200,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	get_mems_allowed();
 	delayacct_freepages_start();
 
-	if (scanning_global_lru(sc))
+	if (global_reclaim(sc))
 		count_vm_event(ALLOCSTALL);
 
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
@@ -2200,7 +2212,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		 * Don't shrink slabs when reclaiming memory from
 		 * over limit cgroups
 		 */
-		if (scanning_global_lru(sc)) {
+		if (global_reclaim(sc)) {
 			unsigned long lru_pages = 0;
 			for_each_zone_zonelist(zone, z, zonelist,
 					gfp_zone(sc->gfp_mask)) {
@@ -2261,7 +2273,7 @@ out:
 		return 0;
 
 	/* top priority shrink_zones still had more to do? don't OOM, then */
-	if (scanning_global_lru(sc) && !all_unreclaimable(zonelist, sc))
+	if (global_reclaim(sc) && !all_unreclaimable(zonelist, sc))
 		return 1;
 
 	return 0;
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [patch 02/11] mm: vmscan: distinguish global reclaim from global LRU scanning
@ 2011-09-12 10:57   ` Johannes Weiner
  0 siblings, 0 replies; 130+ messages in thread
From: Johannes Weiner @ 2011-09-12 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

The traditional zone reclaim code is scanning the per-zone LRU lists
during direct reclaim and kswapd, and the per-zone per-memory cgroup
LRU lists when reclaiming on behalf of a memory cgroup limit.

Subsequent patches will convert the traditional reclaim code to
reclaim exclusively from the per-memory cgroup LRU lists.  As a
result, using the predicate for which LRU list is scanned will no
longer be appropriate to tell global reclaim from limit reclaim.

This patch adds a global_reclaim() predicate to tell direct/kswapd
reclaim from memory cgroup limit reclaim and substitutes it in all
places where currently scanning_global_lru() is used for that.

Signed-off-by: Johannes Weiner <jweiner@redhat.com>
---
 mm/vmscan.c |   60 +++++++++++++++++++++++++++++++++++-----------------------
 1 files changed, 36 insertions(+), 24 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7502726..354f125 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -153,9 +153,25 @@ static LIST_HEAD(shrinker_list);
 static DECLARE_RWSEM(shrinker_rwsem);
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
-#define scanning_global_lru(sc)	(!(sc)->mem_cgroup)
+static bool global_reclaim(struct scan_control *sc)
+{
+	return !sc->mem_cgroup;
+}
+
+static bool scanning_global_lru(struct scan_control *sc)
+{
+	return !sc->mem_cgroup;
+}
 #else
-#define scanning_global_lru(sc)	(1)
+static bool global_reclaim(struct scan_control *sc)
+{
+	return true;
+}
+
+static bool scanning_global_lru(struct scan_control *sc)
+{
+	return true;
+}
 #endif
 
 static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
@@ -1011,7 +1027,7 @@ keep_lumpy:
 	 * back off and wait for congestion to clear because further reclaim
 	 * will encounter the same problem
 	 */
-	if (nr_dirty && nr_dirty == nr_congested && scanning_global_lru(sc))
+	if (nr_dirty && nr_dirty == nr_congested && global_reclaim(sc))
 		zone_set_flag(zone, ZONE_CONGESTED);
 
 	free_page_list(&free_pages);
@@ -1330,7 +1346,7 @@ static int too_many_isolated(struct zone *zone, int file,
 	if (current_is_kswapd())
 		return 0;
 
-	if (!scanning_global_lru(sc))
+	if (!global_reclaim(sc))
 		return 0;
 
 	if (file) {
@@ -1508,6 +1524,12 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 	if (scanning_global_lru(sc)) {
 		nr_taken = isolate_pages_global(nr_to_scan, &page_list,
 			&nr_scanned, sc->order, reclaim_mode, zone, 0, file);
+	} else {
+		nr_taken = mem_cgroup_isolate_pages(nr_to_scan, &page_list,
+			&nr_scanned, sc->order, reclaim_mode, zone,
+			sc->mem_cgroup, 0, file);
+	}
+	if (global_reclaim(sc)) {
 		zone->pages_scanned += nr_scanned;
 		if (current_is_kswapd())
 			__count_zone_vm_events(PGSCAN_KSWAPD, zone,
@@ -1515,14 +1537,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 		else
 			__count_zone_vm_events(PGSCAN_DIRECT, zone,
 					       nr_scanned);
-	} else {
-		nr_taken = mem_cgroup_isolate_pages(nr_to_scan, &page_list,
-			&nr_scanned, sc->order, reclaim_mode, zone,
-			sc->mem_cgroup, 0, file);
-		/*
-		 * mem_cgroup_isolate_pages() keeps track of
-		 * scanned pages on its own.
-		 */
 	}
 
 	if (nr_taken == 0) {
@@ -1647,18 +1661,16 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 						&pgscanned, sc->order,
 						reclaim_mode, zone,
 						1, file);
-		zone->pages_scanned += pgscanned;
 	} else {
 		nr_taken = mem_cgroup_isolate_pages(nr_pages, &l_hold,
 						&pgscanned, sc->order,
 						reclaim_mode, zone,
 						sc->mem_cgroup, 1, file);
-		/*
-		 * mem_cgroup_isolate_pages() keeps track of
-		 * scanned pages on its own.
-		 */
 	}
 
+	if (global_reclaim(sc))
+		zone->pages_scanned += pgscanned;
+
 	reclaim_stat->recent_scanned[file] += nr_taken;
 
 	__count_zone_vm_events(PGREFILL, zone, pgscanned);
@@ -1863,9 +1875,9 @@ static void get_scan_count(struct zone *zone, struct scan_control *sc,
 	 * latencies, so it's better to scan a minimum amount there as
 	 * well.
 	 */
-	if (scanning_global_lru(sc) && current_is_kswapd())
+	if (current_is_kswapd())
 		force_scan = true;
-	if (!scanning_global_lru(sc))
+	if (!global_reclaim(sc))
 		force_scan = true;
 
 	/* If we have no swap space, do not bother scanning anon pages. */
@@ -1882,7 +1894,7 @@ static void get_scan_count(struct zone *zone, struct scan_control *sc,
 	file  = zone_nr_lru_pages(zone, sc, LRU_ACTIVE_FILE) +
 		zone_nr_lru_pages(zone, sc, LRU_INACTIVE_FILE);
 
-	if (scanning_global_lru(sc)) {
+	if (global_reclaim(sc)) {
 		free  = zone_page_state(zone, NR_FREE_PAGES);
 		/* If we have very few page cache pages,
 		   force-scan anon pages. */
@@ -2109,7 +2121,7 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
 		 * Take care memory controller reclaiming has small influence
 		 * to global LRU.
 		 */
-		if (scanning_global_lru(sc)) {
+		if (global_reclaim(sc)) {
 			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 				continue;
 			if (zone->all_unreclaimable && priority != DEF_PRIORITY)
@@ -2188,7 +2200,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	get_mems_allowed();
 	delayacct_freepages_start();
 
-	if (scanning_global_lru(sc))
+	if (global_reclaim(sc))
 		count_vm_event(ALLOCSTALL);
 
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
@@ -2200,7 +2212,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		 * Don't shrink slabs when reclaiming memory from
 		 * over limit cgroups
 		 */
-		if (scanning_global_lru(sc)) {
+		if (global_reclaim(sc)) {
 			unsigned long lru_pages = 0;
 			for_each_zone_zonelist(zone, z, zonelist,
 					gfp_zone(sc->gfp_mask)) {
@@ -2261,7 +2273,7 @@ out:
 		return 0;
 
 	/* top priority shrink_zones still had more to do? don't OOM, then */
-	if (scanning_global_lru(sc) && !all_unreclaimable(zonelist, sc))
+	if (global_reclaim(sc) && !all_unreclaimable(zonelist, sc))
 		return 1;
 
 	return 0;
-- 
1.7.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [patch 03/11] mm: vmscan: distinguish between memcg triggering reclaim and memcg being scanned
  2011-09-12 10:57 ` Johannes Weiner
@ 2011-09-12 10:57   ` Johannes Weiner
  -1 siblings, 0 replies; 130+ messages in thread
From: Johannes Weiner @ 2011-09-12 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

Memory cgroup hierarchies are currently handled completely outside of
the traditional reclaim code, which is invoked with a single memory
cgroup as an argument for the whole call stack.

Subsequent patches will switch this code to do hierarchical reclaim,
so there needs to be a distinction between a) the memory cgroup that
is triggering reclaim due to hitting its limit and b) the memory
cgroup that is being scanned as a child of a).

This patch introduces a struct mem_cgroup_zone that contains the
combination of the memory cgroup and the zone being scanned, which is
then passed down the stack instead of the zone argument.

Signed-off-by: Johannes Weiner <jweiner@redhat.com>
---
 mm/vmscan.c |  251 +++++++++++++++++++++++++++++++++--------------------------
 1 files changed, 142 insertions(+), 109 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 354f125..92f4e22 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -103,8 +103,11 @@ struct scan_control {
 	 */
 	reclaim_mode_t reclaim_mode;
 
-	/* Which cgroup do we reclaim from */
-	struct mem_cgroup *mem_cgroup;
+	/*
+	 * The memory cgroup that hit its limit and as a result is the
+	 * primary target of this reclaim invocation.
+	 */
+	struct mem_cgroup *target_mem_cgroup;
 
 	/*
 	 * Nodemask of nodes allowed by the caller. If NULL, all nodes
@@ -113,6 +116,11 @@ struct scan_control {
 	nodemask_t	*nodemask;
 };
 
+struct mem_cgroup_zone {
+	struct mem_cgroup *mem_cgroup;
+	struct zone *zone;
+};
+
 #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
 
 #ifdef ARCH_HAS_PREFETCH
@@ -155,12 +163,12 @@ static DECLARE_RWSEM(shrinker_rwsem);
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 static bool global_reclaim(struct scan_control *sc)
 {
-	return !sc->mem_cgroup;
+	return !sc->target_mem_cgroup;
 }
 
-static bool scanning_global_lru(struct scan_control *sc)
+static bool scanning_global_lru(struct mem_cgroup_zone *mz)
 {
-	return !sc->mem_cgroup;
+	return !mz->mem_cgroup;
 }
 #else
 static bool global_reclaim(struct scan_control *sc)
@@ -168,29 +176,30 @@ static bool global_reclaim(struct scan_control *sc)
 	return true;
 }
 
-static bool scanning_global_lru(struct scan_control *sc)
+static bool scanning_global_lru(struct mem_cgroup_zone *mz)
 {
 	return true;
 }
 #endif
 
-static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
-						  struct scan_control *sc)
+static struct zone_reclaim_stat *get_reclaim_stat(struct mem_cgroup_zone *mz)
 {
-	if (!scanning_global_lru(sc))
-		return mem_cgroup_get_reclaim_stat(sc->mem_cgroup, zone);
+	if (!scanning_global_lru(mz))
+		return mem_cgroup_get_reclaim_stat(mz->mem_cgroup, mz->zone);
 
-	return &zone->reclaim_stat;
+	return &mz->zone->reclaim_stat;
 }
 
-static unsigned long zone_nr_lru_pages(struct zone *zone,
-				struct scan_control *sc, enum lru_list lru)
+static unsigned long zone_nr_lru_pages(struct mem_cgroup_zone *mz,
+				       enum lru_list lru)
 {
-	if (!scanning_global_lru(sc))
-		return mem_cgroup_zone_nr_lru_pages(sc->mem_cgroup,
-				zone_to_nid(zone), zone_idx(zone), BIT(lru));
+	if (!scanning_global_lru(mz))
+		return mem_cgroup_zone_nr_lru_pages(mz->mem_cgroup,
+						    zone_to_nid(mz->zone),
+						    zone_idx(mz->zone),
+						    BIT(lru));
 
-	return zone_page_state(zone, NR_LRU_BASE + lru);
+	return zone_page_state(mz->zone, NR_LRU_BASE + lru);
 }
 
 
@@ -692,12 +701,13 @@ enum page_references {
 };
 
 static enum page_references page_check_references(struct page *page,
+						  struct mem_cgroup_zone *mz,
 						  struct scan_control *sc)
 {
 	int referenced_ptes, referenced_page;
 	unsigned long vm_flags;
 
-	referenced_ptes = page_referenced(page, 1, sc->mem_cgroup, &vm_flags);
+	referenced_ptes = page_referenced(page, 1, mz->mem_cgroup, &vm_flags);
 	referenced_page = TestClearPageReferenced(page);
 
 	/* Lumpy reclaim - ignore references */
@@ -771,7 +781,7 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
  * shrink_page_list() returns the number of reclaimed pages
  */
 static unsigned long shrink_page_list(struct list_head *page_list,
-				      struct zone *zone,
+				      struct mem_cgroup_zone *mz,
 				      struct scan_control *sc,
 				      int priority,
 				      unsigned long *ret_nr_dirty,
@@ -802,7 +812,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			goto keep;
 
 		VM_BUG_ON(PageActive(page));
-		VM_BUG_ON(page_zone(page) != zone);
+		VM_BUG_ON(page_zone(page) != mz->zone);
 
 		sc->nr_scanned++;
 
@@ -836,7 +846,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			}
 		}
 
-		references = page_check_references(page, sc);
+		references = page_check_references(page, mz, sc);
 		switch (references) {
 		case PAGEREF_ACTIVATE:
 			goto activate_locked;
@@ -1028,7 +1038,7 @@ keep_lumpy:
 	 * will encounter the same problem
 	 */
 	if (nr_dirty && nr_dirty == nr_congested && global_reclaim(sc))
-		zone_set_flag(zone, ZONE_CONGESTED);
+		zone_set_flag(mz->zone, ZONE_CONGESTED);
 
 	free_page_list(&free_pages);
 
@@ -1364,13 +1374,14 @@ static int too_many_isolated(struct zone *zone, int file,
  * TODO: Try merging with migrations version of putback_lru_pages
  */
 static noinline_for_stack void
-putback_lru_pages(struct zone *zone, struct scan_control *sc,
-				unsigned long nr_anon, unsigned long nr_file,
-				struct list_head *page_list)
+putback_lru_pages(struct mem_cgroup_zone *mz, struct scan_control *sc,
+		  unsigned long nr_anon, unsigned long nr_file,
+		  struct list_head *page_list)
 {
 	struct page *page;
 	struct pagevec pvec;
-	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
+	struct zone *zone = mz->zone;
+	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(mz);
 
 	pagevec_init(&pvec, 1);
 
@@ -1410,15 +1421,17 @@ putback_lru_pages(struct zone *zone, struct scan_control *sc,
 	pagevec_release(&pvec);
 }
 
-static noinline_for_stack void update_isolated_counts(struct zone *zone,
-					struct scan_control *sc,
-					unsigned long *nr_anon,
-					unsigned long *nr_file,
-					struct list_head *isolated_list)
+static noinline_for_stack void
+update_isolated_counts(struct mem_cgroup_zone *mz,
+		       struct scan_control *sc,
+		       unsigned long *nr_anon,
+		       unsigned long *nr_file,
+		       struct list_head *isolated_list)
 {
 	unsigned long nr_active;
+	struct zone *zone = mz->zone;
 	unsigned int count[NR_LRU_LISTS] = { 0, };
-	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
+	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(mz);
 
 	nr_active = clear_active_flags(isolated_list, count);
 	__count_vm_events(PGDEACTIVATE, nr_active);
@@ -1487,8 +1500,8 @@ static inline bool should_reclaim_stall(unsigned long nr_taken,
  * of reclaimed pages
  */
 static noinline_for_stack unsigned long
-shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
-			struct scan_control *sc, int priority, int file)
+shrink_inactive_list(unsigned long nr_to_scan, struct mem_cgroup_zone *mz,
+		     struct scan_control *sc, int priority, int file)
 {
 	LIST_HEAD(page_list);
 	unsigned long nr_scanned;
@@ -1499,6 +1512,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 	unsigned long nr_dirty = 0;
 	unsigned long nr_writeback = 0;
 	isolate_mode_t reclaim_mode = ISOLATE_INACTIVE;
+	struct zone *zone = mz->zone;
 
 	while (unlikely(too_many_isolated(zone, file, sc))) {
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
@@ -1521,13 +1535,13 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 
 	spin_lock_irq(&zone->lru_lock);
 
-	if (scanning_global_lru(sc)) {
+	if (scanning_global_lru(mz)) {
 		nr_taken = isolate_pages_global(nr_to_scan, &page_list,
 			&nr_scanned, sc->order, reclaim_mode, zone, 0, file);
 	} else {
 		nr_taken = mem_cgroup_isolate_pages(nr_to_scan, &page_list,
 			&nr_scanned, sc->order, reclaim_mode, zone,
-			sc->mem_cgroup, 0, file);
+			mz->mem_cgroup, 0, file);
 	}
 	if (global_reclaim(sc)) {
 		zone->pages_scanned += nr_scanned;
@@ -1544,17 +1558,17 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 		return 0;
 	}
 
-	update_isolated_counts(zone, sc, &nr_anon, &nr_file, &page_list);
+	update_isolated_counts(mz, sc, &nr_anon, &nr_file, &page_list);
 
 	spin_unlock_irq(&zone->lru_lock);
 
-	nr_reclaimed = shrink_page_list(&page_list, zone, sc, priority,
+	nr_reclaimed = shrink_page_list(&page_list, mz, sc, priority,
 						&nr_dirty, &nr_writeback);
 
 	/* Check if we should syncronously wait for writeback */
 	if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
 		set_reclaim_mode(priority, sc, true);
-		nr_reclaimed += shrink_page_list(&page_list, zone, sc,
+		nr_reclaimed += shrink_page_list(&page_list, mz, sc,
 					priority, &nr_dirty, &nr_writeback);
 	}
 
@@ -1563,7 +1577,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
 	__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
 
-	putback_lru_pages(zone, sc, nr_anon, nr_file, &page_list);
+	putback_lru_pages(mz, sc, nr_anon, nr_file, &page_list);
 
 	/*
 	 * If we have encountered a high number of dirty pages under writeback
@@ -1634,8 +1648,10 @@ static void move_active_pages_to_lru(struct zone *zone,
 		__count_vm_events(PGDEACTIVATE, pgmoved);
 }
 
-static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
-			struct scan_control *sc, int priority, int file)
+static void shrink_active_list(unsigned long nr_pages,
+			       struct mem_cgroup_zone *mz,
+			       struct scan_control *sc,
+			       int priority, int file)
 {
 	unsigned long nr_taken;
 	unsigned long pgscanned;
@@ -1644,9 +1660,10 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 	LIST_HEAD(l_active);
 	LIST_HEAD(l_inactive);
 	struct page *page;
-	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
+	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(mz);
 	unsigned long nr_rotated = 0;
 	isolate_mode_t reclaim_mode = ISOLATE_ACTIVE;
+	struct zone *zone = mz->zone;
 
 	lru_add_drain();
 
@@ -1656,7 +1673,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 		reclaim_mode |= ISOLATE_CLEAN;
 
 	spin_lock_irq(&zone->lru_lock);
-	if (scanning_global_lru(sc)) {
+	if (scanning_global_lru(mz)) {
 		nr_taken = isolate_pages_global(nr_pages, &l_hold,
 						&pgscanned, sc->order,
 						reclaim_mode, zone,
@@ -1665,7 +1682,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 		nr_taken = mem_cgroup_isolate_pages(nr_pages, &l_hold,
 						&pgscanned, sc->order,
 						reclaim_mode, zone,
-						sc->mem_cgroup, 1, file);
+						mz->mem_cgroup, 1, file);
 	}
 
 	if (global_reclaim(sc))
@@ -1691,7 +1708,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 			continue;
 		}
 
-		if (page_referenced(page, 0, sc->mem_cgroup, &vm_flags)) {
+		if (page_referenced(page, 0, mz->mem_cgroup, &vm_flags)) {
 			nr_rotated += hpage_nr_pages(page);
 			/*
 			 * Identify referenced, file-backed active pages and
@@ -1754,10 +1771,8 @@ static int inactive_anon_is_low_global(struct zone *zone)
  * Returns true if the zone does not have enough inactive anon pages,
  * meaning some active anon pages need to be deactivated.
  */
-static int inactive_anon_is_low(struct zone *zone, struct scan_control *sc)
+static int inactive_anon_is_low(struct mem_cgroup_zone *mz)
 {
-	int low;
-
 	/*
 	 * If we don't have swap space, anonymous page deactivation
 	 * is pointless.
@@ -1765,15 +1780,14 @@ static int inactive_anon_is_low(struct zone *zone, struct scan_control *sc)
 	if (!total_swap_pages)
 		return 0;
 
-	if (scanning_global_lru(sc))
-		low = inactive_anon_is_low_global(zone);
-	else
-		low = mem_cgroup_inactive_anon_is_low(sc->mem_cgroup, zone);
-	return low;
+	if (!scanning_global_lru(mz))
+		return mem_cgroup_inactive_anon_is_low(mz->mem_cgroup,
+						       mz->zone);
+
+	return inactive_anon_is_low_global(mz->zone);
 }
 #else
-static inline int inactive_anon_is_low(struct zone *zone,
-					struct scan_control *sc)
+static inline int inactive_anon_is_low(struct mem_cgroup_zone *mz)
 {
 	return 0;
 }
@@ -1791,8 +1805,7 @@ static int inactive_file_is_low_global(struct zone *zone)
 
 /**
  * inactive_file_is_low - check if file pages need to be deactivated
- * @zone: zone to check
- * @sc:   scan control of this context
+ * @mz: memory cgroup and zone to check
  *
  * When the system is doing streaming IO, memory pressure here
  * ensures that active file pages get deactivated, until more
@@ -1804,45 +1817,44 @@ static int inactive_file_is_low_global(struct zone *zone)
  * This uses a different ratio than the anonymous pages, because
  * the page cache uses a use-once replacement algorithm.
  */
-static int inactive_file_is_low(struct zone *zone, struct scan_control *sc)
+static int inactive_file_is_low(struct mem_cgroup_zone *mz)
 {
-	int low;
+	if (!scanning_global_lru(mz))
+		return mem_cgroup_inactive_file_is_low(mz->mem_cgroup,
+						       mz->zone);
 
-	if (scanning_global_lru(sc))
-		low = inactive_file_is_low_global(zone);
-	else
-		low = mem_cgroup_inactive_file_is_low(sc->mem_cgroup, zone);
-	return low;
+	return inactive_file_is_low_global(mz->zone);
 }
 
-static int inactive_list_is_low(struct zone *zone, struct scan_control *sc,
-				int file)
+static int inactive_list_is_low(struct mem_cgroup_zone *mz, int file)
 {
 	if (file)
-		return inactive_file_is_low(zone, sc);
+		return inactive_file_is_low(mz);
 	else
-		return inactive_anon_is_low(zone, sc);
+		return inactive_anon_is_low(mz);
 }
 
 static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
-	struct zone *zone, struct scan_control *sc, int priority)
+				 struct mem_cgroup_zone *mz,
+				 struct scan_control *sc, int priority)
 {
 	int file = is_file_lru(lru);
 
 	if (is_active_lru(lru)) {
-		if (inactive_list_is_low(zone, sc, file))
-		    shrink_active_list(nr_to_scan, zone, sc, priority, file);
+		if (inactive_list_is_low(mz, file))
+		    shrink_active_list(nr_to_scan, mz, sc, priority, file);
 		return 0;
 	}
 
-	return shrink_inactive_list(nr_to_scan, zone, sc, priority, file);
+	return shrink_inactive_list(nr_to_scan, mz, sc, priority, file);
 }
 
-static int vmscan_swappiness(struct scan_control *sc)
+static int vmscan_swappiness(struct mem_cgroup_zone *mz,
+			     struct scan_control *sc)
 {
-	if (scanning_global_lru(sc))
+	if (global_reclaim(sc))
 		return vm_swappiness;
-	return mem_cgroup_swappiness(sc->mem_cgroup);
+	return mem_cgroup_swappiness(mz->mem_cgroup);
 }
 
 /*
@@ -1853,13 +1865,13 @@ static int vmscan_swappiness(struct scan_control *sc)
  *
  * nr[0] = anon pages to scan; nr[1] = file pages to scan
  */
-static void get_scan_count(struct zone *zone, struct scan_control *sc,
-					unsigned long *nr, int priority)
+static void get_scan_count(struct mem_cgroup_zone *mz, struct scan_control *sc,
+			   unsigned long *nr, int priority)
 {
 	unsigned long anon, file, free;
 	unsigned long anon_prio, file_prio;
 	unsigned long ap, fp;
-	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
+	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(mz);
 	u64 fraction[2], denominator;
 	enum lru_list l;
 	int noswap = 0;
@@ -1889,16 +1901,16 @@ static void get_scan_count(struct zone *zone, struct scan_control *sc,
 		goto out;
 	}
 
-	anon  = zone_nr_lru_pages(zone, sc, LRU_ACTIVE_ANON) +
-		zone_nr_lru_pages(zone, sc, LRU_INACTIVE_ANON);
-	file  = zone_nr_lru_pages(zone, sc, LRU_ACTIVE_FILE) +
-		zone_nr_lru_pages(zone, sc, LRU_INACTIVE_FILE);
+	anon  = zone_nr_lru_pages(mz, LRU_ACTIVE_ANON) +
+		zone_nr_lru_pages(mz, LRU_INACTIVE_ANON);
+	file  = zone_nr_lru_pages(mz, LRU_ACTIVE_FILE) +
+		zone_nr_lru_pages(mz, LRU_INACTIVE_FILE);
 
 	if (global_reclaim(sc)) {
-		free  = zone_page_state(zone, NR_FREE_PAGES);
+		free  = zone_page_state(mz->zone, NR_FREE_PAGES);
 		/* If we have very few page cache pages,
 		   force-scan anon pages. */
-		if (unlikely(file + free <= high_wmark_pages(zone))) {
+		if (unlikely(file + free <= high_wmark_pages(mz->zone))) {
 			fraction[0] = 1;
 			fraction[1] = 0;
 			denominator = 1;
@@ -1910,8 +1922,8 @@ static void get_scan_count(struct zone *zone, struct scan_control *sc,
 	 * With swappiness at 100, anonymous and file have the same priority.
 	 * This scanning priority is essentially the inverse of IO cost.
 	 */
-	anon_prio = vmscan_swappiness(sc);
-	file_prio = 200 - vmscan_swappiness(sc);
+	anon_prio = vmscan_swappiness(mz, sc);
+	file_prio = 200 - vmscan_swappiness(mz, sc);
 
 	/*
 	 * OK, so we have swap space and a fair amount of page cache
@@ -1924,7 +1936,7 @@ static void get_scan_count(struct zone *zone, struct scan_control *sc,
 	 *
 	 * anon in [0], file in [1]
 	 */
-	spin_lock_irq(&zone->lru_lock);
+	spin_lock_irq(&mz->zone->lru_lock);
 	if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) {
 		reclaim_stat->recent_scanned[0] /= 2;
 		reclaim_stat->recent_rotated[0] /= 2;
@@ -1945,7 +1957,7 @@ static void get_scan_count(struct zone *zone, struct scan_control *sc,
 
 	fp = (file_prio + 1) * (reclaim_stat->recent_scanned[1] + 1);
 	fp /= reclaim_stat->recent_rotated[1] + 1;
-	spin_unlock_irq(&zone->lru_lock);
+	spin_unlock_irq(&mz->zone->lru_lock);
 
 	fraction[0] = ap;
 	fraction[1] = fp;
@@ -1955,7 +1967,7 @@ out:
 		int file = is_file_lru(l);
 		unsigned long scan;
 
-		scan = zone_nr_lru_pages(zone, sc, l);
+		scan = zone_nr_lru_pages(mz, l);
 		if (priority || noswap) {
 			scan >>= priority;
 			if (!scan && force_scan)
@@ -1973,7 +1985,7 @@ out:
  * back to the allocator and call try_to_compact_zone(), we ensure that
  * there are enough free pages for it to be likely successful
  */
-static inline bool should_continue_reclaim(struct zone *zone,
+static inline bool should_continue_reclaim(struct mem_cgroup_zone *mz,
 					unsigned long nr_reclaimed,
 					unsigned long nr_scanned,
 					struct scan_control *sc)
@@ -2013,14 +2025,14 @@ static inline bool should_continue_reclaim(struct zone *zone,
 	 * inactive lists are large enough, continue reclaiming
 	 */
 	pages_for_compaction = (2UL << sc->order);
-	inactive_lru_pages = zone_nr_lru_pages(zone, sc, LRU_INACTIVE_ANON) +
-				zone_nr_lru_pages(zone, sc, LRU_INACTIVE_FILE);
+	inactive_lru_pages = zone_nr_lru_pages(mz, LRU_INACTIVE_ANON) +
+				zone_nr_lru_pages(mz, LRU_INACTIVE_FILE);
 	if (sc->nr_reclaimed < pages_for_compaction &&
 			inactive_lru_pages > pages_for_compaction)
 		return true;
 
 	/* If compaction would go ahead or the allocation would succeed, stop */
-	switch (compaction_suitable(zone, sc->order)) {
+	switch (compaction_suitable(mz->zone, sc->order)) {
 	case COMPACT_PARTIAL:
 	case COMPACT_CONTINUE:
 		return false;
@@ -2032,8 +2044,8 @@ static inline bool should_continue_reclaim(struct zone *zone,
 /*
  * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
  */
-static void shrink_zone(int priority, struct zone *zone,
-				struct scan_control *sc)
+static void shrink_mem_cgroup_zone(int priority, struct mem_cgroup_zone *mz,
+				   struct scan_control *sc)
 {
 	unsigned long nr[NR_LRU_LISTS];
 	unsigned long nr_to_scan;
@@ -2045,7 +2057,7 @@ static void shrink_zone(int priority, struct zone *zone,
 restart:
 	nr_reclaimed = 0;
 	nr_scanned = sc->nr_scanned;
-	get_scan_count(zone, sc, nr, priority);
+	get_scan_count(mz, sc, nr, priority);
 
 	blk_start_plug(&plug);
 	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
@@ -2057,7 +2069,7 @@ restart:
 				nr[l] -= nr_to_scan;
 
 				nr_reclaimed += shrink_list(l, nr_to_scan,
-							    zone, sc, priority);
+							    mz, sc, priority);
 			}
 		}
 		/*
@@ -2078,17 +2090,28 @@ restart:
 	 * Even if we did not try to evict anon pages at all, we want to
 	 * rebalance the anon lru active/inactive ratio.
 	 */
-	if (inactive_anon_is_low(zone, sc))
-		shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0);
+	if (inactive_anon_is_low(mz))
+		shrink_active_list(SWAP_CLUSTER_MAX, mz, sc, priority, 0);
 
 	/* reclaim/compaction might need reclaim to continue */
-	if (should_continue_reclaim(zone, nr_reclaimed,
+	if (should_continue_reclaim(mz, nr_reclaimed,
 					sc->nr_scanned - nr_scanned, sc))
 		goto restart;
 
 	throttle_vm_writeout(sc->gfp_mask);
 }
 
+static void shrink_zone(int priority, struct zone *zone,
+			struct scan_control *sc)
+{
+	struct mem_cgroup_zone mz = {
+		.mem_cgroup = sc->target_mem_cgroup,
+		.zone = zone,
+	};
+
+	shrink_mem_cgroup_zone(priority, &mz, sc);
+}
+
 /*
  * This is the direct reclaim path, for page-allocating processes.  We only
  * try to reclaim pages from zones which will satisfy the caller's allocation
@@ -2206,7 +2229,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
 		sc->nr_scanned = 0;
 		if (!priority)
-			disable_swap_token(sc->mem_cgroup);
+			disable_swap_token(sc->target_mem_cgroup);
 		shrink_zones(priority, zonelist, sc);
 		/*
 		 * Don't shrink slabs when reclaiming memory from
@@ -2290,7 +2313,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 		.may_unmap = 1,
 		.may_swap = 1,
 		.order = order,
-		.mem_cgroup = NULL,
+		.target_mem_cgroup = NULL,
 		.nodemask = nodemask,
 	};
 	struct shrink_control shrink = {
@@ -2322,7 +2345,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
 		.may_unmap = 1,
 		.may_swap = !noswap,
 		.order = 0,
-		.mem_cgroup = mem,
+		.target_mem_cgroup = mem,
 	};
 
 	sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
@@ -2360,7 +2383,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 		.may_swap = !noswap,
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
 		.order = 0,
-		.mem_cgroup = mem_cont,
+		.target_mem_cgroup = mem_cont,
 		.nodemask = NULL, /* we don't care the placement */
 		.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
 				(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
@@ -2390,6 +2413,18 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 }
 #endif
 
+static void age_active_anon(struct zone *zone, struct scan_control *sc,
+			    int priority)
+{
+	struct mem_cgroup_zone mz = {
+		.mem_cgroup = NULL,
+		.zone = zone,
+	};
+
+	if (inactive_anon_is_low(&mz))
+		shrink_active_list(SWAP_CLUSTER_MAX, &mz, sc, priority, 0);
+}
+
 /*
  * pgdat_balanced is used when checking if a node is balanced for high-order
  * allocations. Only zones that meet watermarks and are in a zone allowed
@@ -2510,7 +2545,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 		 */
 		.nr_to_reclaim = ULONG_MAX,
 		.order = order,
-		.mem_cgroup = NULL,
+		.target_mem_cgroup = NULL,
 	};
 	struct shrink_control shrink = {
 		.gfp_mask = sc.gfp_mask,
@@ -2549,9 +2584,7 @@ loop_again:
 			 * Do some background aging of the anon list, to give
 			 * pages a chance to be referenced before reclaiming.
 			 */
-			if (inactive_anon_is_low(zone, &sc))
-				shrink_active_list(SWAP_CLUSTER_MAX, zone,
-							&sc, priority, 0);
+			age_active_anon(zone, &sc, priority);
 
 			if (!zone_watermark_ok_safe(zone, order,
 					high_wmark_pages(zone), 0, 0)) {
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [patch 03/11] mm: vmscan: distinguish between memcg triggering reclaim and memcg being scanned
@ 2011-09-12 10:57   ` Johannes Weiner
  0 siblings, 0 replies; 130+ messages in thread
From: Johannes Weiner @ 2011-09-12 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

Memory cgroup hierarchies are currently handled completely outside of
the traditional reclaim code, which is invoked with a single memory
cgroup as an argument for the whole call stack.

Subsequent patches will switch this code to do hierarchical reclaim,
so there needs to be a distinction between a) the memory cgroup that
is triggering reclaim due to hitting its limit and b) the memory
cgroup that is being scanned as a child of a).

This patch introduces a struct mem_cgroup_zone that contains the
combination of the memory cgroup and the zone being scanned, which is
then passed down the stack instead of the zone argument.

Signed-off-by: Johannes Weiner <jweiner@redhat.com>
---
 mm/vmscan.c |  251 +++++++++++++++++++++++++++++++++--------------------------
 1 files changed, 142 insertions(+), 109 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 354f125..92f4e22 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -103,8 +103,11 @@ struct scan_control {
 	 */
 	reclaim_mode_t reclaim_mode;
 
-	/* Which cgroup do we reclaim from */
-	struct mem_cgroup *mem_cgroup;
+	/*
+	 * The memory cgroup that hit its limit and as a result is the
+	 * primary target of this reclaim invocation.
+	 */
+	struct mem_cgroup *target_mem_cgroup;
 
 	/*
 	 * Nodemask of nodes allowed by the caller. If NULL, all nodes
@@ -113,6 +116,11 @@ struct scan_control {
 	nodemask_t	*nodemask;
 };
 
+struct mem_cgroup_zone {
+	struct mem_cgroup *mem_cgroup;
+	struct zone *zone;
+};
+
 #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
 
 #ifdef ARCH_HAS_PREFETCH
@@ -155,12 +163,12 @@ static DECLARE_RWSEM(shrinker_rwsem);
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 static bool global_reclaim(struct scan_control *sc)
 {
-	return !sc->mem_cgroup;
+	return !sc->target_mem_cgroup;
 }
 
-static bool scanning_global_lru(struct scan_control *sc)
+static bool scanning_global_lru(struct mem_cgroup_zone *mz)
 {
-	return !sc->mem_cgroup;
+	return !mz->mem_cgroup;
 }
 #else
 static bool global_reclaim(struct scan_control *sc)
@@ -168,29 +176,30 @@ static bool global_reclaim(struct scan_control *sc)
 	return true;
 }
 
-static bool scanning_global_lru(struct scan_control *sc)
+static bool scanning_global_lru(struct mem_cgroup_zone *mz)
 {
 	return true;
 }
 #endif
 
-static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
-						  struct scan_control *sc)
+static struct zone_reclaim_stat *get_reclaim_stat(struct mem_cgroup_zone *mz)
 {
-	if (!scanning_global_lru(sc))
-		return mem_cgroup_get_reclaim_stat(sc->mem_cgroup, zone);
+	if (!scanning_global_lru(mz))
+		return mem_cgroup_get_reclaim_stat(mz->mem_cgroup, mz->zone);
 
-	return &zone->reclaim_stat;
+	return &mz->zone->reclaim_stat;
 }
 
-static unsigned long zone_nr_lru_pages(struct zone *zone,
-				struct scan_control *sc, enum lru_list lru)
+static unsigned long zone_nr_lru_pages(struct mem_cgroup_zone *mz,
+				       enum lru_list lru)
 {
-	if (!scanning_global_lru(sc))
-		return mem_cgroup_zone_nr_lru_pages(sc->mem_cgroup,
-				zone_to_nid(zone), zone_idx(zone), BIT(lru));
+	if (!scanning_global_lru(mz))
+		return mem_cgroup_zone_nr_lru_pages(mz->mem_cgroup,
+						    zone_to_nid(mz->zone),
+						    zone_idx(mz->zone),
+						    BIT(lru));
 
-	return zone_page_state(zone, NR_LRU_BASE + lru);
+	return zone_page_state(mz->zone, NR_LRU_BASE + lru);
 }
 
 
@@ -692,12 +701,13 @@ enum page_references {
 };
 
 static enum page_references page_check_references(struct page *page,
+						  struct mem_cgroup_zone *mz,
 						  struct scan_control *sc)
 {
 	int referenced_ptes, referenced_page;
 	unsigned long vm_flags;
 
-	referenced_ptes = page_referenced(page, 1, sc->mem_cgroup, &vm_flags);
+	referenced_ptes = page_referenced(page, 1, mz->mem_cgroup, &vm_flags);
 	referenced_page = TestClearPageReferenced(page);
 
 	/* Lumpy reclaim - ignore references */
@@ -771,7 +781,7 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
  * shrink_page_list() returns the number of reclaimed pages
  */
 static unsigned long shrink_page_list(struct list_head *page_list,
-				      struct zone *zone,
+				      struct mem_cgroup_zone *mz,
 				      struct scan_control *sc,
 				      int priority,
 				      unsigned long *ret_nr_dirty,
@@ -802,7 +812,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			goto keep;
 
 		VM_BUG_ON(PageActive(page));
-		VM_BUG_ON(page_zone(page) != zone);
+		VM_BUG_ON(page_zone(page) != mz->zone);
 
 		sc->nr_scanned++;
 
@@ -836,7 +846,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			}
 		}
 
-		references = page_check_references(page, sc);
+		references = page_check_references(page, mz, sc);
 		switch (references) {
 		case PAGEREF_ACTIVATE:
 			goto activate_locked;
@@ -1028,7 +1038,7 @@ keep_lumpy:
 	 * will encounter the same problem
 	 */
 	if (nr_dirty && nr_dirty == nr_congested && global_reclaim(sc))
-		zone_set_flag(zone, ZONE_CONGESTED);
+		zone_set_flag(mz->zone, ZONE_CONGESTED);
 
 	free_page_list(&free_pages);
 
@@ -1364,13 +1374,14 @@ static int too_many_isolated(struct zone *zone, int file,
  * TODO: Try merging with migrations version of putback_lru_pages
  */
 static noinline_for_stack void
-putback_lru_pages(struct zone *zone, struct scan_control *sc,
-				unsigned long nr_anon, unsigned long nr_file,
-				struct list_head *page_list)
+putback_lru_pages(struct mem_cgroup_zone *mz, struct scan_control *sc,
+		  unsigned long nr_anon, unsigned long nr_file,
+		  struct list_head *page_list)
 {
 	struct page *page;
 	struct pagevec pvec;
-	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
+	struct zone *zone = mz->zone;
+	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(mz);
 
 	pagevec_init(&pvec, 1);
 
@@ -1410,15 +1421,17 @@ putback_lru_pages(struct zone *zone, struct scan_control *sc,
 	pagevec_release(&pvec);
 }
 
-static noinline_for_stack void update_isolated_counts(struct zone *zone,
-					struct scan_control *sc,
-					unsigned long *nr_anon,
-					unsigned long *nr_file,
-					struct list_head *isolated_list)
+static noinline_for_stack void
+update_isolated_counts(struct mem_cgroup_zone *mz,
+		       struct scan_control *sc,
+		       unsigned long *nr_anon,
+		       unsigned long *nr_file,
+		       struct list_head *isolated_list)
 {
 	unsigned long nr_active;
+	struct zone *zone = mz->zone;
 	unsigned int count[NR_LRU_LISTS] = { 0, };
-	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
+	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(mz);
 
 	nr_active = clear_active_flags(isolated_list, count);
 	__count_vm_events(PGDEACTIVATE, nr_active);
@@ -1487,8 +1500,8 @@ static inline bool should_reclaim_stall(unsigned long nr_taken,
  * of reclaimed pages
  */
 static noinline_for_stack unsigned long
-shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
-			struct scan_control *sc, int priority, int file)
+shrink_inactive_list(unsigned long nr_to_scan, struct mem_cgroup_zone *mz,
+		     struct scan_control *sc, int priority, int file)
 {
 	LIST_HEAD(page_list);
 	unsigned long nr_scanned;
@@ -1499,6 +1512,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 	unsigned long nr_dirty = 0;
 	unsigned long nr_writeback = 0;
 	isolate_mode_t reclaim_mode = ISOLATE_INACTIVE;
+	struct zone *zone = mz->zone;
 
 	while (unlikely(too_many_isolated(zone, file, sc))) {
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
@@ -1521,13 +1535,13 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 
 	spin_lock_irq(&zone->lru_lock);
 
-	if (scanning_global_lru(sc)) {
+	if (scanning_global_lru(mz)) {
 		nr_taken = isolate_pages_global(nr_to_scan, &page_list,
 			&nr_scanned, sc->order, reclaim_mode, zone, 0, file);
 	} else {
 		nr_taken = mem_cgroup_isolate_pages(nr_to_scan, &page_list,
 			&nr_scanned, sc->order, reclaim_mode, zone,
-			sc->mem_cgroup, 0, file);
+			mz->mem_cgroup, 0, file);
 	}
 	if (global_reclaim(sc)) {
 		zone->pages_scanned += nr_scanned;
@@ -1544,17 +1558,17 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 		return 0;
 	}
 
-	update_isolated_counts(zone, sc, &nr_anon, &nr_file, &page_list);
+	update_isolated_counts(mz, sc, &nr_anon, &nr_file, &page_list);
 
 	spin_unlock_irq(&zone->lru_lock);
 
-	nr_reclaimed = shrink_page_list(&page_list, zone, sc, priority,
+	nr_reclaimed = shrink_page_list(&page_list, mz, sc, priority,
 						&nr_dirty, &nr_writeback);
 
 	/* Check if we should syncronously wait for writeback */
 	if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
 		set_reclaim_mode(priority, sc, true);
-		nr_reclaimed += shrink_page_list(&page_list, zone, sc,
+		nr_reclaimed += shrink_page_list(&page_list, mz, sc,
 					priority, &nr_dirty, &nr_writeback);
 	}
 
@@ -1563,7 +1577,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
 	__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
 
-	putback_lru_pages(zone, sc, nr_anon, nr_file, &page_list);
+	putback_lru_pages(mz, sc, nr_anon, nr_file, &page_list);
 
 	/*
 	 * If we have encountered a high number of dirty pages under writeback
@@ -1634,8 +1648,10 @@ static void move_active_pages_to_lru(struct zone *zone,
 		__count_vm_events(PGDEACTIVATE, pgmoved);
 }
 
-static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
-			struct scan_control *sc, int priority, int file)
+static void shrink_active_list(unsigned long nr_pages,
+			       struct mem_cgroup_zone *mz,
+			       struct scan_control *sc,
+			       int priority, int file)
 {
 	unsigned long nr_taken;
 	unsigned long pgscanned;
@@ -1644,9 +1660,10 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 	LIST_HEAD(l_active);
 	LIST_HEAD(l_inactive);
 	struct page *page;
-	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
+	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(mz);
 	unsigned long nr_rotated = 0;
 	isolate_mode_t reclaim_mode = ISOLATE_ACTIVE;
+	struct zone *zone = mz->zone;
 
 	lru_add_drain();
 
@@ -1656,7 +1673,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 		reclaim_mode |= ISOLATE_CLEAN;
 
 	spin_lock_irq(&zone->lru_lock);
-	if (scanning_global_lru(sc)) {
+	if (scanning_global_lru(mz)) {
 		nr_taken = isolate_pages_global(nr_pages, &l_hold,
 						&pgscanned, sc->order,
 						reclaim_mode, zone,
@@ -1665,7 +1682,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 		nr_taken = mem_cgroup_isolate_pages(nr_pages, &l_hold,
 						&pgscanned, sc->order,
 						reclaim_mode, zone,
-						sc->mem_cgroup, 1, file);
+						mz->mem_cgroup, 1, file);
 	}
 
 	if (global_reclaim(sc))
@@ -1691,7 +1708,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 			continue;
 		}
 
-		if (page_referenced(page, 0, sc->mem_cgroup, &vm_flags)) {
+		if (page_referenced(page, 0, mz->mem_cgroup, &vm_flags)) {
 			nr_rotated += hpage_nr_pages(page);
 			/*
 			 * Identify referenced, file-backed active pages and
@@ -1754,10 +1771,8 @@ static int inactive_anon_is_low_global(struct zone *zone)
  * Returns true if the zone does not have enough inactive anon pages,
  * meaning some active anon pages need to be deactivated.
  */
-static int inactive_anon_is_low(struct zone *zone, struct scan_control *sc)
+static int inactive_anon_is_low(struct mem_cgroup_zone *mz)
 {
-	int low;
-
 	/*
 	 * If we don't have swap space, anonymous page deactivation
 	 * is pointless.
@@ -1765,15 +1780,14 @@ static int inactive_anon_is_low(struct zone *zone, struct scan_control *sc)
 	if (!total_swap_pages)
 		return 0;
 
-	if (scanning_global_lru(sc))
-		low = inactive_anon_is_low_global(zone);
-	else
-		low = mem_cgroup_inactive_anon_is_low(sc->mem_cgroup, zone);
-	return low;
+	if (!scanning_global_lru(mz))
+		return mem_cgroup_inactive_anon_is_low(mz->mem_cgroup,
+						       mz->zone);
+
+	return inactive_anon_is_low_global(mz->zone);
 }
 #else
-static inline int inactive_anon_is_low(struct zone *zone,
-					struct scan_control *sc)
+static inline int inactive_anon_is_low(struct mem_cgroup_zone *mz)
 {
 	return 0;
 }
@@ -1791,8 +1805,7 @@ static int inactive_file_is_low_global(struct zone *zone)
 
 /**
  * inactive_file_is_low - check if file pages need to be deactivated
- * @zone: zone to check
- * @sc:   scan control of this context
+ * @mz: memory cgroup and zone to check
  *
  * When the system is doing streaming IO, memory pressure here
  * ensures that active file pages get deactivated, until more
@@ -1804,45 +1817,44 @@ static int inactive_file_is_low_global(struct zone *zone)
  * This uses a different ratio than the anonymous pages, because
  * the page cache uses a use-once replacement algorithm.
  */
-static int inactive_file_is_low(struct zone *zone, struct scan_control *sc)
+static int inactive_file_is_low(struct mem_cgroup_zone *mz)
 {
-	int low;
+	if (!scanning_global_lru(mz))
+		return mem_cgroup_inactive_file_is_low(mz->mem_cgroup,
+						       mz->zone);
 
-	if (scanning_global_lru(sc))
-		low = inactive_file_is_low_global(zone);
-	else
-		low = mem_cgroup_inactive_file_is_low(sc->mem_cgroup, zone);
-	return low;
+	return inactive_file_is_low_global(mz->zone);
 }
 
-static int inactive_list_is_low(struct zone *zone, struct scan_control *sc,
-				int file)
+static int inactive_list_is_low(struct mem_cgroup_zone *mz, int file)
 {
 	if (file)
-		return inactive_file_is_low(zone, sc);
+		return inactive_file_is_low(mz);
 	else
-		return inactive_anon_is_low(zone, sc);
+		return inactive_anon_is_low(mz);
 }
 
 static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
-	struct zone *zone, struct scan_control *sc, int priority)
+				 struct mem_cgroup_zone *mz,
+				 struct scan_control *sc, int priority)
 {
 	int file = is_file_lru(lru);
 
 	if (is_active_lru(lru)) {
-		if (inactive_list_is_low(zone, sc, file))
-		    shrink_active_list(nr_to_scan, zone, sc, priority, file);
+		if (inactive_list_is_low(mz, file))
+		    shrink_active_list(nr_to_scan, mz, sc, priority, file);
 		return 0;
 	}
 
-	return shrink_inactive_list(nr_to_scan, zone, sc, priority, file);
+	return shrink_inactive_list(nr_to_scan, mz, sc, priority, file);
 }
 
-static int vmscan_swappiness(struct scan_control *sc)
+static int vmscan_swappiness(struct mem_cgroup_zone *mz,
+			     struct scan_control *sc)
 {
-	if (scanning_global_lru(sc))
+	if (global_reclaim(sc))
 		return vm_swappiness;
-	return mem_cgroup_swappiness(sc->mem_cgroup);
+	return mem_cgroup_swappiness(mz->mem_cgroup);
 }
 
 /*
@@ -1853,13 +1865,13 @@ static int vmscan_swappiness(struct scan_control *sc)
  *
  * nr[0] = anon pages to scan; nr[1] = file pages to scan
  */
-static void get_scan_count(struct zone *zone, struct scan_control *sc,
-					unsigned long *nr, int priority)
+static void get_scan_count(struct mem_cgroup_zone *mz, struct scan_control *sc,
+			   unsigned long *nr, int priority)
 {
 	unsigned long anon, file, free;
 	unsigned long anon_prio, file_prio;
 	unsigned long ap, fp;
-	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
+	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(mz);
 	u64 fraction[2], denominator;
 	enum lru_list l;
 	int noswap = 0;
@@ -1889,16 +1901,16 @@ static void get_scan_count(struct zone *zone, struct scan_control *sc,
 		goto out;
 	}
 
-	anon  = zone_nr_lru_pages(zone, sc, LRU_ACTIVE_ANON) +
-		zone_nr_lru_pages(zone, sc, LRU_INACTIVE_ANON);
-	file  = zone_nr_lru_pages(zone, sc, LRU_ACTIVE_FILE) +
-		zone_nr_lru_pages(zone, sc, LRU_INACTIVE_FILE);
+	anon  = zone_nr_lru_pages(mz, LRU_ACTIVE_ANON) +
+		zone_nr_lru_pages(mz, LRU_INACTIVE_ANON);
+	file  = zone_nr_lru_pages(mz, LRU_ACTIVE_FILE) +
+		zone_nr_lru_pages(mz, LRU_INACTIVE_FILE);
 
 	if (global_reclaim(sc)) {
-		free  = zone_page_state(zone, NR_FREE_PAGES);
+		free  = zone_page_state(mz->zone, NR_FREE_PAGES);
 		/* If we have very few page cache pages,
 		   force-scan anon pages. */
-		if (unlikely(file + free <= high_wmark_pages(zone))) {
+		if (unlikely(file + free <= high_wmark_pages(mz->zone))) {
 			fraction[0] = 1;
 			fraction[1] = 0;
 			denominator = 1;
@@ -1910,8 +1922,8 @@ static void get_scan_count(struct zone *zone, struct scan_control *sc,
 	 * With swappiness at 100, anonymous and file have the same priority.
 	 * This scanning priority is essentially the inverse of IO cost.
 	 */
-	anon_prio = vmscan_swappiness(sc);
-	file_prio = 200 - vmscan_swappiness(sc);
+	anon_prio = vmscan_swappiness(mz, sc);
+	file_prio = 200 - vmscan_swappiness(mz, sc);
 
 	/*
 	 * OK, so we have swap space and a fair amount of page cache
@@ -1924,7 +1936,7 @@ static void get_scan_count(struct zone *zone, struct scan_control *sc,
 	 *
 	 * anon in [0], file in [1]
 	 */
-	spin_lock_irq(&zone->lru_lock);
+	spin_lock_irq(&mz->zone->lru_lock);
 	if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) {
 		reclaim_stat->recent_scanned[0] /= 2;
 		reclaim_stat->recent_rotated[0] /= 2;
@@ -1945,7 +1957,7 @@ static void get_scan_count(struct zone *zone, struct scan_control *sc,
 
 	fp = (file_prio + 1) * (reclaim_stat->recent_scanned[1] + 1);
 	fp /= reclaim_stat->recent_rotated[1] + 1;
-	spin_unlock_irq(&zone->lru_lock);
+	spin_unlock_irq(&mz->zone->lru_lock);
 
 	fraction[0] = ap;
 	fraction[1] = fp;
@@ -1955,7 +1967,7 @@ out:
 		int file = is_file_lru(l);
 		unsigned long scan;
 
-		scan = zone_nr_lru_pages(zone, sc, l);
+		scan = zone_nr_lru_pages(mz, l);
 		if (priority || noswap) {
 			scan >>= priority;
 			if (!scan && force_scan)
@@ -1973,7 +1985,7 @@ out:
  * back to the allocator and call try_to_compact_zone(), we ensure that
  * there are enough free pages for it to be likely successful
  */
-static inline bool should_continue_reclaim(struct zone *zone,
+static inline bool should_continue_reclaim(struct mem_cgroup_zone *mz,
 					unsigned long nr_reclaimed,
 					unsigned long nr_scanned,
 					struct scan_control *sc)
@@ -2013,14 +2025,14 @@ static inline bool should_continue_reclaim(struct zone *zone,
 	 * inactive lists are large enough, continue reclaiming
 	 */
 	pages_for_compaction = (2UL << sc->order);
-	inactive_lru_pages = zone_nr_lru_pages(zone, sc, LRU_INACTIVE_ANON) +
-				zone_nr_lru_pages(zone, sc, LRU_INACTIVE_FILE);
+	inactive_lru_pages = zone_nr_lru_pages(mz, LRU_INACTIVE_ANON) +
+				zone_nr_lru_pages(mz, LRU_INACTIVE_FILE);
 	if (sc->nr_reclaimed < pages_for_compaction &&
 			inactive_lru_pages > pages_for_compaction)
 		return true;
 
 	/* If compaction would go ahead or the allocation would succeed, stop */
-	switch (compaction_suitable(zone, sc->order)) {
+	switch (compaction_suitable(mz->zone, sc->order)) {
 	case COMPACT_PARTIAL:
 	case COMPACT_CONTINUE:
 		return false;
@@ -2032,8 +2044,8 @@ static inline bool should_continue_reclaim(struct zone *zone,
 /*
  * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
  */
-static void shrink_zone(int priority, struct zone *zone,
-				struct scan_control *sc)
+static void shrink_mem_cgroup_zone(int priority, struct mem_cgroup_zone *mz,
+				   struct scan_control *sc)
 {
 	unsigned long nr[NR_LRU_LISTS];
 	unsigned long nr_to_scan;
@@ -2045,7 +2057,7 @@ static void shrink_zone(int priority, struct zone *zone,
 restart:
 	nr_reclaimed = 0;
 	nr_scanned = sc->nr_scanned;
-	get_scan_count(zone, sc, nr, priority);
+	get_scan_count(mz, sc, nr, priority);
 
 	blk_start_plug(&plug);
 	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
@@ -2057,7 +2069,7 @@ restart:
 				nr[l] -= nr_to_scan;
 
 				nr_reclaimed += shrink_list(l, nr_to_scan,
-							    zone, sc, priority);
+							    mz, sc, priority);
 			}
 		}
 		/*
@@ -2078,17 +2090,28 @@ restart:
 	 * Even if we did not try to evict anon pages at all, we want to
 	 * rebalance the anon lru active/inactive ratio.
 	 */
-	if (inactive_anon_is_low(zone, sc))
-		shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0);
+	if (inactive_anon_is_low(mz))
+		shrink_active_list(SWAP_CLUSTER_MAX, mz, sc, priority, 0);
 
 	/* reclaim/compaction might need reclaim to continue */
-	if (should_continue_reclaim(zone, nr_reclaimed,
+	if (should_continue_reclaim(mz, nr_reclaimed,
 					sc->nr_scanned - nr_scanned, sc))
 		goto restart;
 
 	throttle_vm_writeout(sc->gfp_mask);
 }
 
+static void shrink_zone(int priority, struct zone *zone,
+			struct scan_control *sc)
+{
+	struct mem_cgroup_zone mz = {
+		.mem_cgroup = sc->target_mem_cgroup,
+		.zone = zone,
+	};
+
+	shrink_mem_cgroup_zone(priority, &mz, sc);
+}
+
 /*
  * This is the direct reclaim path, for page-allocating processes.  We only
  * try to reclaim pages from zones which will satisfy the caller's allocation
@@ -2206,7 +2229,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
 		sc->nr_scanned = 0;
 		if (!priority)
-			disable_swap_token(sc->mem_cgroup);
+			disable_swap_token(sc->target_mem_cgroup);
 		shrink_zones(priority, zonelist, sc);
 		/*
 		 * Don't shrink slabs when reclaiming memory from
@@ -2290,7 +2313,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 		.may_unmap = 1,
 		.may_swap = 1,
 		.order = order,
-		.mem_cgroup = NULL,
+		.target_mem_cgroup = NULL,
 		.nodemask = nodemask,
 	};
 	struct shrink_control shrink = {
@@ -2322,7 +2345,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
 		.may_unmap = 1,
 		.may_swap = !noswap,
 		.order = 0,
-		.mem_cgroup = mem,
+		.target_mem_cgroup = mem,
 	};
 
 	sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
@@ -2360,7 +2383,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 		.may_swap = !noswap,
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
 		.order = 0,
-		.mem_cgroup = mem_cont,
+		.target_mem_cgroup = mem_cont,
 		.nodemask = NULL, /* we don't care the placement */
 		.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
 				(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
@@ -2390,6 +2413,18 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 }
 #endif
 
+static void age_active_anon(struct zone *zone, struct scan_control *sc,
+			    int priority)
+{
+	struct mem_cgroup_zone mz = {
+		.mem_cgroup = NULL,
+		.zone = zone,
+	};
+
+	if (inactive_anon_is_low(&mz))
+		shrink_active_list(SWAP_CLUSTER_MAX, &mz, sc, priority, 0);
+}
+
 /*
  * pgdat_balanced is used when checking if a node is balanced for high-order
  * allocations. Only zones that meet watermarks and are in a zone allowed
@@ -2510,7 +2545,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 		 */
 		.nr_to_reclaim = ULONG_MAX,
 		.order = order,
-		.mem_cgroup = NULL,
+		.target_mem_cgroup = NULL,
 	};
 	struct shrink_control shrink = {
 		.gfp_mask = sc.gfp_mask,
@@ -2549,9 +2584,7 @@ loop_again:
 			 * Do some background aging of the anon list, to give
 			 * pages a chance to be referenced before reclaiming.
 			 */
-			if (inactive_anon_is_low(zone, &sc))
-				shrink_active_list(SWAP_CLUSTER_MAX, zone,
-							&sc, priority, 0);
+			age_active_anon(zone, &sc, priority);
 
 			if (!zone_watermark_ok_safe(zone, order,
 					high_wmark_pages(zone), 0, 0)) {
-- 
1.7.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [patch 04/11] mm: memcg: per-priority per-zone hierarchy scan generations
  2011-09-12 10:57 ` Johannes Weiner
@ 2011-09-12 10:57   ` Johannes Weiner
  -1 siblings, 0 replies; 130+ messages in thread
From: Johannes Weiner @ 2011-09-12 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

Memory cgroup limit reclaim currently picks one memory cgroup out of
the target hierarchy, remembers it as the last scanned child, and
reclaims all zones in it with decreasing priority levels.

The new hierarchy reclaim code will pick memory cgroups from the same
hierarchy concurrently from different zones and priority levels, it
becomes necessary that hierarchy roots not only remember the last
scanned child, but do so for each zone and priority level.

Furthermore, detecting full hierarchy round-trips reliably will become
crucial, so instead of counting on one iterator site seeing a certain
memory cgroup twice, use a generation counter that is increased every
time the child with the highest ID has been visited.

Signed-off-by: Johannes Weiner <jweiner@redhat.com>
---
 mm/memcontrol.c |   60 +++++++++++++++++++++++++++++++++++++++---------------
 1 files changed, 43 insertions(+), 17 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 912c7c7..f4b404e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -121,6 +121,11 @@ struct mem_cgroup_stat_cpu {
 	unsigned long targets[MEM_CGROUP_NTARGETS];
 };
 
+struct mem_cgroup_iter_state {
+	int position;
+	unsigned int generation;
+};
+
 /*
  * per-zone information in memory controller.
  */
@@ -131,6 +136,8 @@ struct mem_cgroup_per_zone {
 	struct list_head	lists[NR_LRU_LISTS];
 	unsigned long		count[NR_LRU_LISTS];
 
+	struct mem_cgroup_iter_state iter_state[DEF_PRIORITY + 1];
+
 	struct zone_reclaim_stat reclaim_stat;
 	struct rb_node		tree_node;	/* RB tree node */
 	unsigned long long	usage_in_excess;/* Set to the value by which */
@@ -231,11 +238,6 @@ struct mem_cgroup {
 	 * per zone LRU lists.
 	 */
 	struct mem_cgroup_lru_info info;
-	/*
-	 * While reclaiming in a hierarchy, we cache the last child we
-	 * reclaimed from.
-	 */
-	int last_scanned_child;
 	int last_scanned_node;
 #if MAX_NUMNODES > 1
 	nodemask_t	scan_nodes;
@@ -781,9 +783,15 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
 	return memcg;
 }
 
+struct mem_cgroup_iter {
+	struct zone *zone;
+	int priority;
+	unsigned int generation;
+};
+
 static struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 					  struct mem_cgroup *prev,
-					  bool remember)
+					  struct mem_cgroup_iter *iter)
 {
 	struct mem_cgroup *mem = NULL;
 	int id = 0;
@@ -791,7 +799,7 @@ static struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 	if (!root)
 		root = root_mem_cgroup;
 
-	if (prev && !remember)
+	if (prev && !iter)
 		id = css_id(&prev->css);
 
 	if (prev && prev != root)
@@ -804,10 +812,20 @@ static struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 	}
 
 	while (!mem) {
+		struct mem_cgroup_iter_state *uninitialized_var(is);
 		struct cgroup_subsys_state *css;
 
-		if (remember)
-			id = root->last_scanned_child;
+		if (iter) {
+			int nid = zone_to_nid(iter->zone);
+			int zid = zone_idx(iter->zone);
+			struct mem_cgroup_per_zone *mz;
+
+			mz = mem_cgroup_zoneinfo(root, nid, zid);
+			is = &mz->iter_state[iter->priority];
+			if (prev && iter->generation != is->generation)
+				return NULL;
+			id = is->position;
+		}
 
 		rcu_read_lock();
 		css = css_get_next(&mem_cgroup_subsys, id + 1, &root->css, &id);
@@ -818,8 +836,13 @@ static struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 			id = 0;
 		rcu_read_unlock();
 
-		if (remember)
-			root->last_scanned_child = id;
+		if (iter) {
+			is->position = id;
+			if (!css)
+				is->generation++;
+			else if (!prev && mem)
+				iter->generation = is->generation;
+		}
 
 		if (prev && !css)
 			return NULL;
@@ -842,14 +865,14 @@ static void mem_cgroup_iter_break(struct mem_cgroup *root,
  * be used for reference counting.
  */
 #define for_each_mem_cgroup_tree(iter, root)		\
-	for (iter = mem_cgroup_iter(root, NULL, false);	\
+	for (iter = mem_cgroup_iter(root, NULL, NULL);	\
 	     iter != NULL;				\
-	     iter = mem_cgroup_iter(root, iter, false))
+	     iter = mem_cgroup_iter(root, iter, NULL))
 
 #define for_each_mem_cgroup(iter)			\
-	for (iter = mem_cgroup_iter(NULL, NULL, false);	\
+	for (iter = mem_cgroup_iter(NULL, NULL, NULL);	\
 	     iter != NULL;				\
-	     iter = mem_cgroup_iter(NULL, iter, false))
+	     iter = mem_cgroup_iter(NULL, iter, NULL))
 
 static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
 {
@@ -1619,6 +1642,10 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
 	bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT;
 	unsigned long excess;
 	unsigned long nr_scanned;
+	struct mem_cgroup_iter iter = {
+		.zone = zone,
+		.priority = 0,
+	};
 
 	excess = res_counter_soft_limit_excess(&root_memcg->res) >> PAGE_SHIFT;
 
@@ -1627,7 +1654,7 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
 		noswap = true;
 
 	while (1) {
-		victim = mem_cgroup_iter(root_memcg, victim, true);
+		victim = mem_cgroup_iter(root_memcg, victim, &iter);
 		if (!victim) {
 			loop++;
 			/*
@@ -4878,7 +4905,6 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 		res_counter_init(&memcg->res, NULL);
 		res_counter_init(&memcg->memsw, NULL);
 	}
-	memcg->last_scanned_child = 0;
 	memcg->last_scanned_node = MAX_NUMNODES;
 	INIT_LIST_HEAD(&memcg->oom_notify);
 
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [patch 04/11] mm: memcg: per-priority per-zone hierarchy scan generations
@ 2011-09-12 10:57   ` Johannes Weiner
  0 siblings, 0 replies; 130+ messages in thread
From: Johannes Weiner @ 2011-09-12 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

Memory cgroup limit reclaim currently picks one memory cgroup out of
the target hierarchy, remembers it as the last scanned child, and
reclaims all zones in it with decreasing priority levels.

The new hierarchy reclaim code will pick memory cgroups from the same
hierarchy concurrently from different zones and priority levels, it
becomes necessary that hierarchy roots not only remember the last
scanned child, but do so for each zone and priority level.

Furthermore, detecting full hierarchy round-trips reliably will become
crucial, so instead of counting on one iterator site seeing a certain
memory cgroup twice, use a generation counter that is increased every
time the child with the highest ID has been visited.

Signed-off-by: Johannes Weiner <jweiner@redhat.com>
---
 mm/memcontrol.c |   60 +++++++++++++++++++++++++++++++++++++++---------------
 1 files changed, 43 insertions(+), 17 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 912c7c7..f4b404e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -121,6 +121,11 @@ struct mem_cgroup_stat_cpu {
 	unsigned long targets[MEM_CGROUP_NTARGETS];
 };
 
+struct mem_cgroup_iter_state {
+	int position;
+	unsigned int generation;
+};
+
 /*
  * per-zone information in memory controller.
  */
@@ -131,6 +136,8 @@ struct mem_cgroup_per_zone {
 	struct list_head	lists[NR_LRU_LISTS];
 	unsigned long		count[NR_LRU_LISTS];
 
+	struct mem_cgroup_iter_state iter_state[DEF_PRIORITY + 1];
+
 	struct zone_reclaim_stat reclaim_stat;
 	struct rb_node		tree_node;	/* RB tree node */
 	unsigned long long	usage_in_excess;/* Set to the value by which */
@@ -231,11 +238,6 @@ struct mem_cgroup {
 	 * per zone LRU lists.
 	 */
 	struct mem_cgroup_lru_info info;
-	/*
-	 * While reclaiming in a hierarchy, we cache the last child we
-	 * reclaimed from.
-	 */
-	int last_scanned_child;
 	int last_scanned_node;
 #if MAX_NUMNODES > 1
 	nodemask_t	scan_nodes;
@@ -781,9 +783,15 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
 	return memcg;
 }
 
+struct mem_cgroup_iter {
+	struct zone *zone;
+	int priority;
+	unsigned int generation;
+};
+
 static struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 					  struct mem_cgroup *prev,
-					  bool remember)
+					  struct mem_cgroup_iter *iter)
 {
 	struct mem_cgroup *mem = NULL;
 	int id = 0;
@@ -791,7 +799,7 @@ static struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 	if (!root)
 		root = root_mem_cgroup;
 
-	if (prev && !remember)
+	if (prev && !iter)
 		id = css_id(&prev->css);
 
 	if (prev && prev != root)
@@ -804,10 +812,20 @@ static struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 	}
 
 	while (!mem) {
+		struct mem_cgroup_iter_state *uninitialized_var(is);
 		struct cgroup_subsys_state *css;
 
-		if (remember)
-			id = root->last_scanned_child;
+		if (iter) {
+			int nid = zone_to_nid(iter->zone);
+			int zid = zone_idx(iter->zone);
+			struct mem_cgroup_per_zone *mz;
+
+			mz = mem_cgroup_zoneinfo(root, nid, zid);
+			is = &mz->iter_state[iter->priority];
+			if (prev && iter->generation != is->generation)
+				return NULL;
+			id = is->position;
+		}
 
 		rcu_read_lock();
 		css = css_get_next(&mem_cgroup_subsys, id + 1, &root->css, &id);
@@ -818,8 +836,13 @@ static struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 			id = 0;
 		rcu_read_unlock();
 
-		if (remember)
-			root->last_scanned_child = id;
+		if (iter) {
+			is->position = id;
+			if (!css)
+				is->generation++;
+			else if (!prev && mem)
+				iter->generation = is->generation;
+		}
 
 		if (prev && !css)
 			return NULL;
@@ -842,14 +865,14 @@ static void mem_cgroup_iter_break(struct mem_cgroup *root,
  * be used for reference counting.
  */
 #define for_each_mem_cgroup_tree(iter, root)		\
-	for (iter = mem_cgroup_iter(root, NULL, false);	\
+	for (iter = mem_cgroup_iter(root, NULL, NULL);	\
 	     iter != NULL;				\
-	     iter = mem_cgroup_iter(root, iter, false))
+	     iter = mem_cgroup_iter(root, iter, NULL))
 
 #define for_each_mem_cgroup(iter)			\
-	for (iter = mem_cgroup_iter(NULL, NULL, false);	\
+	for (iter = mem_cgroup_iter(NULL, NULL, NULL);	\
 	     iter != NULL;				\
-	     iter = mem_cgroup_iter(NULL, iter, false))
+	     iter = mem_cgroup_iter(NULL, iter, NULL))
 
 static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
 {
@@ -1619,6 +1642,10 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
 	bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT;
 	unsigned long excess;
 	unsigned long nr_scanned;
+	struct mem_cgroup_iter iter = {
+		.zone = zone,
+		.priority = 0,
+	};
 
 	excess = res_counter_soft_limit_excess(&root_memcg->res) >> PAGE_SHIFT;
 
@@ -1627,7 +1654,7 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
 		noswap = true;
 
 	while (1) {
-		victim = mem_cgroup_iter(root_memcg, victim, true);
+		victim = mem_cgroup_iter(root_memcg, victim, &iter);
 		if (!victim) {
 			loop++;
 			/*
@@ -4878,7 +4905,6 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 		res_counter_init(&memcg->res, NULL);
 		res_counter_init(&memcg->memsw, NULL);
 	}
-	memcg->last_scanned_child = 0;
 	memcg->last_scanned_node = MAX_NUMNODES;
 	INIT_LIST_HEAD(&memcg->oom_notify);
 
-- 
1.7.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [patch 05/11] mm: move memcg hierarchy reclaim to generic reclaim code
  2011-09-12 10:57 ` Johannes Weiner
@ 2011-09-12 10:57   ` Johannes Weiner
  -1 siblings, 0 replies; 130+ messages in thread
From: Johannes Weiner @ 2011-09-12 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

Memory cgroup limit reclaim and traditional global pressure reclaim
will soon share the same code to reclaim from a hierarchical tree of
memory cgroups.

In preparation of this, move the two right next to each other in
shrink_zone().

Signed-off-by: Johannes Weiner <jweiner@redhat.com>
---
 include/linux/memcontrol.h |   25 ++++++-
 mm/memcontrol.c            |  167 ++++++++++++++++++++++----------------------
 mm/vmscan.c                |   43 ++++++++++-
 3 files changed, 147 insertions(+), 88 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b87068a..6575931 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -40,6 +40,12 @@ extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
 					struct mem_cgroup *mem_cont,
 					int active, int file);
 
+struct mem_cgroup_iter {
+	struct zone *zone;
+	int priority;
+	unsigned int generation;
+};
+
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 /*
  * All "charge" functions with gfp_mask should use GFP_KERNEL or
@@ -103,6 +109,11 @@ mem_cgroup_prepare_migration(struct page *page,
 extern void mem_cgroup_end_migration(struct mem_cgroup *memcg,
 	struct page *oldpage, struct page *newpage, bool migration_ok);
 
+struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
+				   struct mem_cgroup *,
+				   struct mem_cgroup_iter *);
+void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *);
+
 /*
  * For memory reclaim.
  */
@@ -276,7 +287,19 @@ static inline void mem_cgroup_end_migration(struct mem_cgroup *memcg,
 {
 }
 
-static inline int mem_cgroup_get_reclaim_priority(struct mem_cgroup *memcg)
+static inline struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
+						 struct mem_cgroup *prev,
+						 struct mem_cgroup_iter *iter)
+{
+	return NULL;
+}
+
+static inline void mem_cgroup_iter_break(struct mem_cgroup *root,
+					 struct mem_cgroup *prev)
+{
+}
+
+static inline int mem_cgroup_get_reclaim_priority(struct mem_cgroup *mem)
 {
 	return 0;
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f4b404e..413e1f8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -362,8 +362,6 @@ enum charge_type {
 #define MEM_CGROUP_RECLAIM_NOSWAP	(1 << MEM_CGROUP_RECLAIM_NOSWAP_BIT)
 #define MEM_CGROUP_RECLAIM_SHRINK_BIT	0x1
 #define MEM_CGROUP_RECLAIM_SHRINK	(1 << MEM_CGROUP_RECLAIM_SHRINK_BIT)
-#define MEM_CGROUP_RECLAIM_SOFT_BIT	0x2
-#define MEM_CGROUP_RECLAIM_SOFT		(1 << MEM_CGROUP_RECLAIM_SOFT_BIT)
 
 static void mem_cgroup_get(struct mem_cgroup *memcg);
 static void mem_cgroup_put(struct mem_cgroup *memcg);
@@ -783,19 +781,33 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
 	return memcg;
 }
 
-struct mem_cgroup_iter {
-	struct zone *zone;
-	int priority;
-	unsigned int generation;
-};
-
-static struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
-					  struct mem_cgroup *prev,
-					  struct mem_cgroup_iter *iter)
+/**
+ * mem_cgroup_iter - iterate over memory cgroup hierarchy
+ * @root: hierarchy root
+ * @prev: previously returned memcg, NULL on first invocation
+ * @iter: token for partial walks, NULL for full walks
+ *
+ * Returns references to children of the hierarchy starting at @root,
+ * or @root itself, or %NULL after a full round-trip.
+ *
+ * Caller must pass the return value in @prev on subsequent
+ * invocations for reference counting, or use mem_cgroup_iter_break()
+ * to cancel a hierarchy walk before the round-trip is complete.
+ *
+ * Reclaimers can specify a zone and a priority level in @iter to
+ * divide up the memcgs in the hierarchy among all concurrent
+ * reclaimers operating on the same zone and priority.
+ */
+struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
+				   struct mem_cgroup *prev,
+				   struct mem_cgroup_iter *iter)
 {
 	struct mem_cgroup *mem = NULL;
 	int id = 0;
 
+	if (mem_cgroup_disabled())
+		return NULL;
+
 	if (!root)
 		root = root_mem_cgroup;
 
@@ -850,8 +862,13 @@ static struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 	return mem;
 }
 
-static void mem_cgroup_iter_break(struct mem_cgroup *root,
-				  struct mem_cgroup *prev)
+/**
+ * mem_cgroup_iter_break - abort a hierarchy walk prematurely
+ * @root: hierarchy root
+ * @prev: last visited hierarchy member as returned by mem_cgroup_iter()
+ */
+void mem_cgroup_iter_break(struct mem_cgroup *root,
+			   struct mem_cgroup *prev)
 {
 	if (!root)
 		root = root_mem_cgroup;
@@ -1479,6 +1496,41 @@ u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
 	return min(limit, memsw);
 }
 
+static unsigned long mem_cgroup_reclaim(struct mem_cgroup *mem,
+					gfp_t gfp_mask,
+					unsigned long flags)
+{
+	unsigned long total = 0;
+	bool noswap = false;
+	int loop;
+
+	if (flags & MEM_CGROUP_RECLAIM_NOSWAP)
+		noswap = true;
+	else if (!(flags & MEM_CGROUP_RECLAIM_SHRINK) && mem->memsw_is_minimum)
+		noswap = true;
+
+	for (loop = 0; loop < MEM_CGROUP_MAX_RECLAIM_LOOPS; loop++) {
+		if (loop)
+			drain_all_stock_async(mem);
+		total += try_to_free_mem_cgroup_pages(mem, gfp_mask, noswap);
+		/*
+		 * Avoid freeing too much when shrinking to resize the
+		 * limit.  XXX: Shouldn't the margin check be enough?
+		 */
+		if (total && (flags & MEM_CGROUP_RECLAIM_SHRINK))
+			break;
+		if (mem_cgroup_margin(mem))
+			break;
+		/*
+		 * If nothing was reclaimed after two attempts, there
+		 * may be no reclaimable pages in this hierarchy.
+		 */
+		if (loop && !total)
+			break;
+	}
+	return total;
+}
+
 /**
  * test_mem_cgroup_node_reclaimable
  * @mem: the target memcg
@@ -1616,30 +1668,14 @@ bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap)
 }
 #endif
 
-/*
- * Scan the hierarchy if needed to reclaim memory. We remember the last child
- * we reclaimed from, so that we don't end up penalizing one child extensively
- * based on its position in the children list.
- *
- * root_memcg is the original ancestor that we've been reclaim from.
- *
- * We give up and return to the caller when we visit root_memcg twice.
- * (other groups can be removed while we're walking....)
- *
- * If shrink==true, for avoiding to free too much, this returns immedieately.
- */
-static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
-						struct zone *zone,
-						gfp_t gfp_mask,
-						unsigned long reclaim_options,
-						unsigned long *total_scanned)
+static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
+				   struct zone *zone,
+				   gfp_t gfp_mask,
+				   unsigned long *total_scanned)
 {
 	struct mem_cgroup *victim = NULL;
-	int ret, total = 0;
+	int total = 0;
 	int loop = 0;
-	bool noswap = reclaim_options & MEM_CGROUP_RECLAIM_NOSWAP;
-	bool shrink = reclaim_options & MEM_CGROUP_RECLAIM_SHRINK;
-	bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT;
 	unsigned long excess;
 	unsigned long nr_scanned;
 	struct mem_cgroup_iter iter = {
@@ -1649,29 +1685,17 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
 
 	excess = res_counter_soft_limit_excess(&root_memcg->res) >> PAGE_SHIFT;
 
-	/* If memsw_is_minimum==1, swap-out is of-no-use. */
-	if (!check_soft && !shrink && root_memcg->memsw_is_minimum)
-		noswap = true;
-
 	while (1) {
 		victim = mem_cgroup_iter(root_memcg, victim, &iter);
 		if (!victim) {
 			loop++;
-			/*
-			 * We are not draining per cpu cached charges during
-			 * soft limit reclaim  because global reclaim doesn't
-			 * care about charges. It tries to free some memory and
-			 * charges will not give any.
-			 */
-			if (!check_soft && loop >= 1)
-				drain_all_stock_async(root_memcg);
 			if (loop >= 2) {
 				/*
 				 * If we have not been able to reclaim
 				 * anything, it might because there are
 				 * no reclaimable pages under this hierarchy
 				 */
-				if (!check_soft || !total)
+				if (!total)
 					break;
 				/*
 				 * We want to do more targeted reclaim.
@@ -1685,30 +1709,12 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
 			}
 			continue;
 		}
-		if (!mem_cgroup_reclaimable(victim, noswap)) {
-			/* this cgroup's local usage == 0 */
+		if (!mem_cgroup_reclaimable(victim, false))
 			continue;
-		}
-		/* we use swappiness of local cgroup */
-		if (check_soft) {
-			ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
-				noswap, zone, &nr_scanned);
-			*total_scanned += nr_scanned;
-		} else
-			ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
-						noswap);
-		/*
-		 * At shrinking usage, we can't check we should stop here or
-		 * reclaim more. It's depends on callers. last_scanned_child
-		 * will work enough for keeping fairness under tree.
-		 */
-		if (shrink)
-			break;
-		total += ret;
-		if (check_soft) {
-			if (!res_counter_soft_limit_excess(&root_memcg->res))
-				break;
-		} else if (mem_cgroup_margin(root_memcg))
+		total += mem_cgroup_shrink_node_zone(victim, gfp_mask, false,
+						     zone, &nr_scanned);
+		*total_scanned += nr_scanned;
+		if (!res_counter_soft_limit_excess(&root_memcg->res))
 			break;
 	}
 	mem_cgroup_iter_break(root_memcg, victim);
@@ -2205,8 +2211,7 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (!(gfp_mask & __GFP_WAIT))
 		return CHARGE_WOULDBLOCK;
 
-	ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
-					      gfp_mask, flags, NULL);
+	ret = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
 	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
 		return CHARGE_RETRY;
 	/*
@@ -3437,9 +3442,8 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
 		if (!ret)
 			break;
 
-		mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
-						MEM_CGROUP_RECLAIM_SHRINK,
-						NULL);
+		mem_cgroup_reclaim(memcg, GFP_KERNEL,
+				   MEM_CGROUP_RECLAIM_SHRINK);
 		curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
 		/* Usage is reduced ? */
   		if (curusage >= oldusage)
@@ -3497,10 +3501,9 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
 		if (!ret)
 			break;
 
-		mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
-						MEM_CGROUP_RECLAIM_NOSWAP |
-						MEM_CGROUP_RECLAIM_SHRINK,
-						NULL);
+		mem_cgroup_reclaim(memcg, GFP_KERNEL,
+				   MEM_CGROUP_RECLAIM_NOSWAP |
+				   MEM_CGROUP_RECLAIM_SHRINK);
 		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
 		/* Usage is reduced ? */
 		if (curusage >= oldusage)
@@ -3543,10 +3546,8 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 			break;
 
 		nr_scanned = 0;
-		reclaimed = mem_cgroup_hierarchical_reclaim(mz->mem, zone,
-						gfp_mask,
-						MEM_CGROUP_RECLAIM_SOFT,
-						&nr_scanned);
+		reclaimed = mem_cgroup_soft_reclaim(mz->mem, zone,
+						    gfp_mask, &nr_scanned);
 		nr_reclaimed += reclaimed;
 		*total_scanned += nr_scanned;
 		spin_lock(&mctz->lock);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 92f4e22..8419e8f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2104,12 +2104,43 @@ restart:
 static void shrink_zone(int priority, struct zone *zone,
 			struct scan_control *sc)
 {
-	struct mem_cgroup_zone mz = {
-		.mem_cgroup = sc->target_mem_cgroup,
+	struct mem_cgroup *root = sc->target_mem_cgroup;
+	struct mem_cgroup_iter iter = {
 		.zone = zone,
+		.priority = priority,
 	};
+	struct mem_cgroup *mem;
+
+	if (global_reclaim(sc)) {
+		struct mem_cgroup_zone mz = {
+			.mem_cgroup = NULL,
+			.zone = zone,
+		};
+
+		shrink_mem_cgroup_zone(priority, &mz, sc);
+		return;
+	}
+
+	mem = mem_cgroup_iter(root, NULL, &iter);
+	do {
+		struct mem_cgroup_zone mz = {
+			.mem_cgroup = mem,
+			.zone = zone,
+		};
 
-	shrink_mem_cgroup_zone(priority, &mz, sc);
+		shrink_mem_cgroup_zone(priority, &mz, sc);
+		/*
+		 * Limit reclaim has historically picked one memcg and
+		 * scanned it with decreasing priority levels until
+		 * nr_to_reclaim had been reclaimed.  This priority
+		 * cycle is thus over after a single memcg.
+		 */
+		if (!global_reclaim(sc)) {
+			mem_cgroup_iter_break(root, mem);
+			break;
+		}
+		mem = mem_cgroup_iter(root, mem, &iter);
+	} while (mem);
 }
 
 /*
@@ -2347,6 +2378,10 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
 		.order = 0,
 		.target_mem_cgroup = mem,
 	};
+	struct mem_cgroup_zone mz = {
+		.mem_cgroup = mem,
+		.zone = zone,
+	};
 
 	sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
 			(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
@@ -2362,7 +2397,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
 	 * will pick up pages from other mem cgroup's as well. We hack
 	 * the priority and make it zero.
 	 */
-	shrink_zone(0, zone, &sc);
+	shrink_mem_cgroup_zone(0, &mz, &sc);
 
 	trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
 
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [patch 05/11] mm: move memcg hierarchy reclaim to generic reclaim code
@ 2011-09-12 10:57   ` Johannes Weiner
  0 siblings, 0 replies; 130+ messages in thread
From: Johannes Weiner @ 2011-09-12 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

Memory cgroup limit reclaim and traditional global pressure reclaim
will soon share the same code to reclaim from a hierarchical tree of
memory cgroups.

In preparation of this, move the two right next to each other in
shrink_zone().

Signed-off-by: Johannes Weiner <jweiner@redhat.com>
---
 include/linux/memcontrol.h |   25 ++++++-
 mm/memcontrol.c            |  167 ++++++++++++++++++++++----------------------
 mm/vmscan.c                |   43 ++++++++++-
 3 files changed, 147 insertions(+), 88 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b87068a..6575931 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -40,6 +40,12 @@ extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
 					struct mem_cgroup *mem_cont,
 					int active, int file);
 
+struct mem_cgroup_iter {
+	struct zone *zone;
+	int priority;
+	unsigned int generation;
+};
+
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 /*
  * All "charge" functions with gfp_mask should use GFP_KERNEL or
@@ -103,6 +109,11 @@ mem_cgroup_prepare_migration(struct page *page,
 extern void mem_cgroup_end_migration(struct mem_cgroup *memcg,
 	struct page *oldpage, struct page *newpage, bool migration_ok);
 
+struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
+				   struct mem_cgroup *,
+				   struct mem_cgroup_iter *);
+void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *);
+
 /*
  * For memory reclaim.
  */
@@ -276,7 +287,19 @@ static inline void mem_cgroup_end_migration(struct mem_cgroup *memcg,
 {
 }
 
-static inline int mem_cgroup_get_reclaim_priority(struct mem_cgroup *memcg)
+static inline struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
+						 struct mem_cgroup *prev,
+						 struct mem_cgroup_iter *iter)
+{
+	return NULL;
+}
+
+static inline void mem_cgroup_iter_break(struct mem_cgroup *root,
+					 struct mem_cgroup *prev)
+{
+}
+
+static inline int mem_cgroup_get_reclaim_priority(struct mem_cgroup *mem)
 {
 	return 0;
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f4b404e..413e1f8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -362,8 +362,6 @@ enum charge_type {
 #define MEM_CGROUP_RECLAIM_NOSWAP	(1 << MEM_CGROUP_RECLAIM_NOSWAP_BIT)
 #define MEM_CGROUP_RECLAIM_SHRINK_BIT	0x1
 #define MEM_CGROUP_RECLAIM_SHRINK	(1 << MEM_CGROUP_RECLAIM_SHRINK_BIT)
-#define MEM_CGROUP_RECLAIM_SOFT_BIT	0x2
-#define MEM_CGROUP_RECLAIM_SOFT		(1 << MEM_CGROUP_RECLAIM_SOFT_BIT)
 
 static void mem_cgroup_get(struct mem_cgroup *memcg);
 static void mem_cgroup_put(struct mem_cgroup *memcg);
@@ -783,19 +781,33 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
 	return memcg;
 }
 
-struct mem_cgroup_iter {
-	struct zone *zone;
-	int priority;
-	unsigned int generation;
-};
-
-static struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
-					  struct mem_cgroup *prev,
-					  struct mem_cgroup_iter *iter)
+/**
+ * mem_cgroup_iter - iterate over memory cgroup hierarchy
+ * @root: hierarchy root
+ * @prev: previously returned memcg, NULL on first invocation
+ * @iter: token for partial walks, NULL for full walks
+ *
+ * Returns references to children of the hierarchy starting at @root,
+ * or @root itself, or %NULL after a full round-trip.
+ *
+ * Caller must pass the return value in @prev on subsequent
+ * invocations for reference counting, or use mem_cgroup_iter_break()
+ * to cancel a hierarchy walk before the round-trip is complete.
+ *
+ * Reclaimers can specify a zone and a priority level in @iter to
+ * divide up the memcgs in the hierarchy among all concurrent
+ * reclaimers operating on the same zone and priority.
+ */
+struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
+				   struct mem_cgroup *prev,
+				   struct mem_cgroup_iter *iter)
 {
 	struct mem_cgroup *mem = NULL;
 	int id = 0;
 
+	if (mem_cgroup_disabled())
+		return NULL;
+
 	if (!root)
 		root = root_mem_cgroup;
 
@@ -850,8 +862,13 @@ static struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 	return mem;
 }
 
-static void mem_cgroup_iter_break(struct mem_cgroup *root,
-				  struct mem_cgroup *prev)
+/**
+ * mem_cgroup_iter_break - abort a hierarchy walk prematurely
+ * @root: hierarchy root
+ * @prev: last visited hierarchy member as returned by mem_cgroup_iter()
+ */
+void mem_cgroup_iter_break(struct mem_cgroup *root,
+			   struct mem_cgroup *prev)
 {
 	if (!root)
 		root = root_mem_cgroup;
@@ -1479,6 +1496,41 @@ u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
 	return min(limit, memsw);
 }
 
+static unsigned long mem_cgroup_reclaim(struct mem_cgroup *mem,
+					gfp_t gfp_mask,
+					unsigned long flags)
+{
+	unsigned long total = 0;
+	bool noswap = false;
+	int loop;
+
+	if (flags & MEM_CGROUP_RECLAIM_NOSWAP)
+		noswap = true;
+	else if (!(flags & MEM_CGROUP_RECLAIM_SHRINK) && mem->memsw_is_minimum)
+		noswap = true;
+
+	for (loop = 0; loop < MEM_CGROUP_MAX_RECLAIM_LOOPS; loop++) {
+		if (loop)
+			drain_all_stock_async(mem);
+		total += try_to_free_mem_cgroup_pages(mem, gfp_mask, noswap);
+		/*
+		 * Avoid freeing too much when shrinking to resize the
+		 * limit.  XXX: Shouldn't the margin check be enough?
+		 */
+		if (total && (flags & MEM_CGROUP_RECLAIM_SHRINK))
+			break;
+		if (mem_cgroup_margin(mem))
+			break;
+		/*
+		 * If nothing was reclaimed after two attempts, there
+		 * may be no reclaimable pages in this hierarchy.
+		 */
+		if (loop && !total)
+			break;
+	}
+	return total;
+}
+
 /**
  * test_mem_cgroup_node_reclaimable
  * @mem: the target memcg
@@ -1616,30 +1668,14 @@ bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap)
 }
 #endif
 
-/*
- * Scan the hierarchy if needed to reclaim memory. We remember the last child
- * we reclaimed from, so that we don't end up penalizing one child extensively
- * based on its position in the children list.
- *
- * root_memcg is the original ancestor that we've been reclaim from.
- *
- * We give up and return to the caller when we visit root_memcg twice.
- * (other groups can be removed while we're walking....)
- *
- * If shrink==true, for avoiding to free too much, this returns immedieately.
- */
-static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
-						struct zone *zone,
-						gfp_t gfp_mask,
-						unsigned long reclaim_options,
-						unsigned long *total_scanned)
+static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
+				   struct zone *zone,
+				   gfp_t gfp_mask,
+				   unsigned long *total_scanned)
 {
 	struct mem_cgroup *victim = NULL;
-	int ret, total = 0;
+	int total = 0;
 	int loop = 0;
-	bool noswap = reclaim_options & MEM_CGROUP_RECLAIM_NOSWAP;
-	bool shrink = reclaim_options & MEM_CGROUP_RECLAIM_SHRINK;
-	bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT;
 	unsigned long excess;
 	unsigned long nr_scanned;
 	struct mem_cgroup_iter iter = {
@@ -1649,29 +1685,17 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
 
 	excess = res_counter_soft_limit_excess(&root_memcg->res) >> PAGE_SHIFT;
 
-	/* If memsw_is_minimum==1, swap-out is of-no-use. */
-	if (!check_soft && !shrink && root_memcg->memsw_is_minimum)
-		noswap = true;
-
 	while (1) {
 		victim = mem_cgroup_iter(root_memcg, victim, &iter);
 		if (!victim) {
 			loop++;
-			/*
-			 * We are not draining per cpu cached charges during
-			 * soft limit reclaim  because global reclaim doesn't
-			 * care about charges. It tries to free some memory and
-			 * charges will not give any.
-			 */
-			if (!check_soft && loop >= 1)
-				drain_all_stock_async(root_memcg);
 			if (loop >= 2) {
 				/*
 				 * If we have not been able to reclaim
 				 * anything, it might because there are
 				 * no reclaimable pages under this hierarchy
 				 */
-				if (!check_soft || !total)
+				if (!total)
 					break;
 				/*
 				 * We want to do more targeted reclaim.
@@ -1685,30 +1709,12 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
 			}
 			continue;
 		}
-		if (!mem_cgroup_reclaimable(victim, noswap)) {
-			/* this cgroup's local usage == 0 */
+		if (!mem_cgroup_reclaimable(victim, false))
 			continue;
-		}
-		/* we use swappiness of local cgroup */
-		if (check_soft) {
-			ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
-				noswap, zone, &nr_scanned);
-			*total_scanned += nr_scanned;
-		} else
-			ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
-						noswap);
-		/*
-		 * At shrinking usage, we can't check we should stop here or
-		 * reclaim more. It's depends on callers. last_scanned_child
-		 * will work enough for keeping fairness under tree.
-		 */
-		if (shrink)
-			break;
-		total += ret;
-		if (check_soft) {
-			if (!res_counter_soft_limit_excess(&root_memcg->res))
-				break;
-		} else if (mem_cgroup_margin(root_memcg))
+		total += mem_cgroup_shrink_node_zone(victim, gfp_mask, false,
+						     zone, &nr_scanned);
+		*total_scanned += nr_scanned;
+		if (!res_counter_soft_limit_excess(&root_memcg->res))
 			break;
 	}
 	mem_cgroup_iter_break(root_memcg, victim);
@@ -2205,8 +2211,7 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (!(gfp_mask & __GFP_WAIT))
 		return CHARGE_WOULDBLOCK;
 
-	ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
-					      gfp_mask, flags, NULL);
+	ret = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
 	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
 		return CHARGE_RETRY;
 	/*
@@ -3437,9 +3442,8 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
 		if (!ret)
 			break;
 
-		mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
-						MEM_CGROUP_RECLAIM_SHRINK,
-						NULL);
+		mem_cgroup_reclaim(memcg, GFP_KERNEL,
+				   MEM_CGROUP_RECLAIM_SHRINK);
 		curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
 		/* Usage is reduced ? */
   		if (curusage >= oldusage)
@@ -3497,10 +3501,9 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
 		if (!ret)
 			break;
 
-		mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
-						MEM_CGROUP_RECLAIM_NOSWAP |
-						MEM_CGROUP_RECLAIM_SHRINK,
-						NULL);
+		mem_cgroup_reclaim(memcg, GFP_KERNEL,
+				   MEM_CGROUP_RECLAIM_NOSWAP |
+				   MEM_CGROUP_RECLAIM_SHRINK);
 		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
 		/* Usage is reduced ? */
 		if (curusage >= oldusage)
@@ -3543,10 +3546,8 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 			break;
 
 		nr_scanned = 0;
-		reclaimed = mem_cgroup_hierarchical_reclaim(mz->mem, zone,
-						gfp_mask,
-						MEM_CGROUP_RECLAIM_SOFT,
-						&nr_scanned);
+		reclaimed = mem_cgroup_soft_reclaim(mz->mem, zone,
+						    gfp_mask, &nr_scanned);
 		nr_reclaimed += reclaimed;
 		*total_scanned += nr_scanned;
 		spin_lock(&mctz->lock);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 92f4e22..8419e8f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2104,12 +2104,43 @@ restart:
 static void shrink_zone(int priority, struct zone *zone,
 			struct scan_control *sc)
 {
-	struct mem_cgroup_zone mz = {
-		.mem_cgroup = sc->target_mem_cgroup,
+	struct mem_cgroup *root = sc->target_mem_cgroup;
+	struct mem_cgroup_iter iter = {
 		.zone = zone,
+		.priority = priority,
 	};
+	struct mem_cgroup *mem;
+
+	if (global_reclaim(sc)) {
+		struct mem_cgroup_zone mz = {
+			.mem_cgroup = NULL,
+			.zone = zone,
+		};
+
+		shrink_mem_cgroup_zone(priority, &mz, sc);
+		return;
+	}
+
+	mem = mem_cgroup_iter(root, NULL, &iter);
+	do {
+		struct mem_cgroup_zone mz = {
+			.mem_cgroup = mem,
+			.zone = zone,
+		};
 
-	shrink_mem_cgroup_zone(priority, &mz, sc);
+		shrink_mem_cgroup_zone(priority, &mz, sc);
+		/*
+		 * Limit reclaim has historically picked one memcg and
+		 * scanned it with decreasing priority levels until
+		 * nr_to_reclaim had been reclaimed.  This priority
+		 * cycle is thus over after a single memcg.
+		 */
+		if (!global_reclaim(sc)) {
+			mem_cgroup_iter_break(root, mem);
+			break;
+		}
+		mem = mem_cgroup_iter(root, mem, &iter);
+	} while (mem);
 }
 
 /*
@@ -2347,6 +2378,10 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
 		.order = 0,
 		.target_mem_cgroup = mem,
 	};
+	struct mem_cgroup_zone mz = {
+		.mem_cgroup = mem,
+		.zone = zone,
+	};
 
 	sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
 			(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
@@ -2362,7 +2397,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
 	 * will pick up pages from other mem cgroup's as well. We hack
 	 * the priority and make it zero.
 	 */
-	shrink_zone(0, zone, &sc);
+	shrink_mem_cgroup_zone(0, &mz, &sc);
 
 	trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
 
-- 
1.7.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [patch 06/11] mm: memcg: remove optimization of keeping the root_mem_cgroup LRU lists empty
  2011-09-12 10:57 ` Johannes Weiner
@ 2011-09-12 10:57   ` Johannes Weiner
  -1 siblings, 0 replies; 130+ messages in thread
From: Johannes Weiner @ 2011-09-12 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

root_mem_cgroup, lacking a configurable limit, was never subject to
limit reclaim, so the pages charged to it could be kept off its LRU
lists.  They would be found on the global per-zone LRU lists upon
physical memory pressure and it made sense to avoid uselessly linking
them to both lists.

The global per-zone LRU lists are about to go away on memcg-enabled
kernels, with all pages being exclusively linked to their respective
per-memcg LRU lists.  As a result, pages of the root_mem_cgroup must
also be linked to its LRU lists again.

The overhead is temporary until the double-LRU scheme is going away
completely.

Signed-off-by: Johannes Weiner <jweiner@redhat.com>
---
 mm/memcontrol.c |   12 ++----------
 1 files changed, 2 insertions(+), 10 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 413e1f8..518f640 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -956,8 +956,6 @@ void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru)
 	mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
 	/* huge page split is done under lru_lock. so, we have no races. */
 	MEM_CGROUP_ZSTAT(mz, lru) -= 1 << compound_order(page);
-	if (mem_cgroup_is_root(pc->mem_cgroup))
-		return;
 	VM_BUG_ON(list_empty(&pc->lru));
 	list_del_init(&pc->lru);
 }
@@ -982,13 +980,11 @@ void mem_cgroup_rotate_reclaimable_page(struct page *page)
 		return;
 
 	pc = lookup_page_cgroup(page);
-	/* unused or root page is not rotated. */
+	/* unused page is not rotated. */
 	if (!PageCgroupUsed(pc))
 		return;
 	/* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
 	smp_rmb();
-	if (mem_cgroup_is_root(pc->mem_cgroup))
-		return;
 	mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
 	list_move_tail(&pc->lru, &mz->lists[lru]);
 }
@@ -1002,13 +998,11 @@ void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru)
 		return;
 
 	pc = lookup_page_cgroup(page);
-	/* unused or root page is not rotated. */
+	/* unused page is not rotated. */
 	if (!PageCgroupUsed(pc))
 		return;
 	/* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
 	smp_rmb();
-	if (mem_cgroup_is_root(pc->mem_cgroup))
-		return;
 	mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
 	list_move(&pc->lru, &mz->lists[lru]);
 }
@@ -1040,8 +1034,6 @@ void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru)
 	/* huge page split is done under lru_lock. so, we have no races. */
 	MEM_CGROUP_ZSTAT(mz, lru) += 1 << compound_order(page);
 	SetPageCgroupAcctLRU(pc);
-	if (mem_cgroup_is_root(pc->mem_cgroup))
-		return;
 	list_add(&pc->lru, &mz->lists[lru]);
 }
 
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [patch 06/11] mm: memcg: remove optimization of keeping the root_mem_cgroup LRU lists empty
@ 2011-09-12 10:57   ` Johannes Weiner
  0 siblings, 0 replies; 130+ messages in thread
From: Johannes Weiner @ 2011-09-12 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

root_mem_cgroup, lacking a configurable limit, was never subject to
limit reclaim, so the pages charged to it could be kept off its LRU
lists.  They would be found on the global per-zone LRU lists upon
physical memory pressure and it made sense to avoid uselessly linking
them to both lists.

The global per-zone LRU lists are about to go away on memcg-enabled
kernels, with all pages being exclusively linked to their respective
per-memcg LRU lists.  As a result, pages of the root_mem_cgroup must
also be linked to its LRU lists again.

The overhead is temporary until the double-LRU scheme is going away
completely.

Signed-off-by: Johannes Weiner <jweiner@redhat.com>
---
 mm/memcontrol.c |   12 ++----------
 1 files changed, 2 insertions(+), 10 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 413e1f8..518f640 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -956,8 +956,6 @@ void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru)
 	mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
 	/* huge page split is done under lru_lock. so, we have no races. */
 	MEM_CGROUP_ZSTAT(mz, lru) -= 1 << compound_order(page);
-	if (mem_cgroup_is_root(pc->mem_cgroup))
-		return;
 	VM_BUG_ON(list_empty(&pc->lru));
 	list_del_init(&pc->lru);
 }
@@ -982,13 +980,11 @@ void mem_cgroup_rotate_reclaimable_page(struct page *page)
 		return;
 
 	pc = lookup_page_cgroup(page);
-	/* unused or root page is not rotated. */
+	/* unused page is not rotated. */
 	if (!PageCgroupUsed(pc))
 		return;
 	/* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
 	smp_rmb();
-	if (mem_cgroup_is_root(pc->mem_cgroup))
-		return;
 	mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
 	list_move_tail(&pc->lru, &mz->lists[lru]);
 }
@@ -1002,13 +998,11 @@ void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru)
 		return;
 
 	pc = lookup_page_cgroup(page);
-	/* unused or root page is not rotated. */
+	/* unused page is not rotated. */
 	if (!PageCgroupUsed(pc))
 		return;
 	/* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
 	smp_rmb();
-	if (mem_cgroup_is_root(pc->mem_cgroup))
-		return;
 	mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
 	list_move(&pc->lru, &mz->lists[lru]);
 }
@@ -1040,8 +1034,6 @@ void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru)
 	/* huge page split is done under lru_lock. so, we have no races. */
 	MEM_CGROUP_ZSTAT(mz, lru) += 1 << compound_order(page);
 	SetPageCgroupAcctLRU(pc);
-	if (mem_cgroup_is_root(pc->mem_cgroup))
-		return;
 	list_add(&pc->lru, &mz->lists[lru]);
 }
 
-- 
1.7.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [patch 07/11] mm: vmscan: convert unevictable page rescue scanner to per-memcg LRU lists
  2011-09-12 10:57 ` Johannes Weiner
@ 2011-09-12 10:57   ` Johannes Weiner
  -1 siblings, 0 replies; 130+ messages in thread
From: Johannes Weiner @ 2011-09-12 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

The global per-zone LRU lists are about to go away on memcg-enabled
kernels, the unevictable page rescue scanner must be able to find its
pages on the per-memcg LRU lists.

Signed-off-by: Johannes Weiner <jweiner@redhat.com>
---
 include/linux/memcontrol.h |    3 ++
 mm/memcontrol.c            |   11 ++++++++
 mm/vmscan.c                |   61 ++++++++++++++++++++++++++++---------------
 3 files changed, 54 insertions(+), 21 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 6575931..7795b72 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -40,6 +40,9 @@ extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
 					struct mem_cgroup *mem_cont,
 					int active, int file);
 
+struct page *mem_cgroup_lru_to_page(struct zone *, struct mem_cgroup *,
+				    enum lru_list);
+
 struct mem_cgroup_iter {
 	struct zone *zone;
 	int priority;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 518f640..27d78dc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -937,6 +937,17 @@ EXPORT_SYMBOL(mem_cgroup_count_vm_event);
  * When moving account, the page is not on LRU. It's isolated.
  */
 
+struct page *mem_cgroup_lru_to_page(struct zone *zone, struct mem_cgroup *mem,
+				    enum lru_list lru)
+{
+	struct mem_cgroup_per_zone *mz;
+	struct page_cgroup *pc;
+
+	mz = mem_cgroup_zoneinfo(mem, zone_to_nid(zone), zone_idx(zone));
+	pc = list_entry(mz->lists[lru].prev, struct page_cgroup, lru);
+	return lookup_cgroup_page(pc);
+}
+
 void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru)
 {
 	struct page_cgroup *pc;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8419e8f..bb4d8b8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3477,6 +3477,17 @@ void scan_mapping_unevictable_pages(struct address_space *mapping)
 
 }
 
+/*
+ * XXX: Temporary helper to get to the last page of a mem_cgroup_zone
+ * lru list.  This will be reasonably unified in a second.
+ */
+static struct page *lru_tailpage(struct mem_cgroup_zone *mz, enum lru_list lru)
+{
+	if (!scanning_global_lru(mz))
+		return mem_cgroup_lru_to_page(mz->zone, mz->mem_cgroup, lru);
+	return lru_to_page(&mz->zone->lru[lru].list);
+}
+
 /**
  * scan_zone_unevictable_pages - check unevictable list for evictable pages
  * @zone - zone of which to scan the unevictable list
@@ -3490,32 +3501,40 @@ void scan_mapping_unevictable_pages(struct address_space *mapping)
 #define SCAN_UNEVICTABLE_BATCH_SIZE 16UL /* arbitrary lock hold batch size */
 static void scan_zone_unevictable_pages(struct zone *zone)
 {
-	struct list_head *l_unevictable = &zone->lru[LRU_UNEVICTABLE].list;
-	unsigned long scan;
-	unsigned long nr_to_scan = zone_page_state(zone, NR_UNEVICTABLE);
-
-	while (nr_to_scan > 0) {
-		unsigned long batch_size = min(nr_to_scan,
-						SCAN_UNEVICTABLE_BATCH_SIZE);
-
-		spin_lock_irq(&zone->lru_lock);
-		for (scan = 0;  scan < batch_size; scan++) {
-			struct page *page = lru_to_page(l_unevictable);
+	struct mem_cgroup *mem;
 
-			if (!trylock_page(page))
-				continue;
+	mem = mem_cgroup_iter(NULL, NULL, NULL);
+	do {
+		struct mem_cgroup_zone mz = {
+			.mem_cgroup = mem,
+			.zone = zone,
+		};
+		unsigned long nr_to_scan;
 
-			prefetchw_prev_lru_page(page, l_unevictable, flags);
+		nr_to_scan = zone_nr_lru_pages(&mz, LRU_UNEVICTABLE);
+		while (nr_to_scan > 0) {
+			unsigned long batch_size;
+			unsigned long scan;
 
-			if (likely(PageLRU(page) && PageUnevictable(page)))
-				check_move_unevictable_page(page, zone);
+			batch_size = min(nr_to_scan,
+					 SCAN_UNEVICTABLE_BATCH_SIZE);
+			spin_lock_irq(&zone->lru_lock);
+			for (scan = 0; scan < batch_size; scan++) {
+				struct page *page;
 
-			unlock_page(page);
+				page = lru_tailpage(&mz, LRU_UNEVICTABLE);
+				if (!trylock_page(page))
+					continue;
+				if (likely(PageLRU(page) &&
+					   PageUnevictable(page)))
+					check_move_unevictable_page(page, zone);
+				unlock_page(page);
+			}
+			spin_unlock_irq(&zone->lru_lock);
+			nr_to_scan -= batch_size;
 		}
-		spin_unlock_irq(&zone->lru_lock);
-
-		nr_to_scan -= batch_size;
-	}
+		mem = mem_cgroup_iter(NULL, mem, NULL);
+	} while (mem);
 }
 
 
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [patch 07/11] mm: vmscan: convert unevictable page rescue scanner to per-memcg LRU lists
@ 2011-09-12 10:57   ` Johannes Weiner
  0 siblings, 0 replies; 130+ messages in thread
From: Johannes Weiner @ 2011-09-12 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

The global per-zone LRU lists are about to go away on memcg-enabled
kernels, the unevictable page rescue scanner must be able to find its
pages on the per-memcg LRU lists.

Signed-off-by: Johannes Weiner <jweiner@redhat.com>
---
 include/linux/memcontrol.h |    3 ++
 mm/memcontrol.c            |   11 ++++++++
 mm/vmscan.c                |   61 ++++++++++++++++++++++++++++---------------
 3 files changed, 54 insertions(+), 21 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 6575931..7795b72 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -40,6 +40,9 @@ extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
 					struct mem_cgroup *mem_cont,
 					int active, int file);
 
+struct page *mem_cgroup_lru_to_page(struct zone *, struct mem_cgroup *,
+				    enum lru_list);
+
 struct mem_cgroup_iter {
 	struct zone *zone;
 	int priority;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 518f640..27d78dc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -937,6 +937,17 @@ EXPORT_SYMBOL(mem_cgroup_count_vm_event);
  * When moving account, the page is not on LRU. It's isolated.
  */
 
+struct page *mem_cgroup_lru_to_page(struct zone *zone, struct mem_cgroup *mem,
+				    enum lru_list lru)
+{
+	struct mem_cgroup_per_zone *mz;
+	struct page_cgroup *pc;
+
+	mz = mem_cgroup_zoneinfo(mem, zone_to_nid(zone), zone_idx(zone));
+	pc = list_entry(mz->lists[lru].prev, struct page_cgroup, lru);
+	return lookup_cgroup_page(pc);
+}
+
 void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru)
 {
 	struct page_cgroup *pc;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8419e8f..bb4d8b8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3477,6 +3477,17 @@ void scan_mapping_unevictable_pages(struct address_space *mapping)
 
 }
 
+/*
+ * XXX: Temporary helper to get to the last page of a mem_cgroup_zone
+ * lru list.  This will be reasonably unified in a second.
+ */
+static struct page *lru_tailpage(struct mem_cgroup_zone *mz, enum lru_list lru)
+{
+	if (!scanning_global_lru(mz))
+		return mem_cgroup_lru_to_page(mz->zone, mz->mem_cgroup, lru);
+	return lru_to_page(&mz->zone->lru[lru].list);
+}
+
 /**
  * scan_zone_unevictable_pages - check unevictable list for evictable pages
  * @zone - zone of which to scan the unevictable list
@@ -3490,32 +3501,40 @@ void scan_mapping_unevictable_pages(struct address_space *mapping)
 #define SCAN_UNEVICTABLE_BATCH_SIZE 16UL /* arbitrary lock hold batch size */
 static void scan_zone_unevictable_pages(struct zone *zone)
 {
-	struct list_head *l_unevictable = &zone->lru[LRU_UNEVICTABLE].list;
-	unsigned long scan;
-	unsigned long nr_to_scan = zone_page_state(zone, NR_UNEVICTABLE);
-
-	while (nr_to_scan > 0) {
-		unsigned long batch_size = min(nr_to_scan,
-						SCAN_UNEVICTABLE_BATCH_SIZE);
-
-		spin_lock_irq(&zone->lru_lock);
-		for (scan = 0;  scan < batch_size; scan++) {
-			struct page *page = lru_to_page(l_unevictable);
+	struct mem_cgroup *mem;
 
-			if (!trylock_page(page))
-				continue;
+	mem = mem_cgroup_iter(NULL, NULL, NULL);
+	do {
+		struct mem_cgroup_zone mz = {
+			.mem_cgroup = mem,
+			.zone = zone,
+		};
+		unsigned long nr_to_scan;
 
-			prefetchw_prev_lru_page(page, l_unevictable, flags);
+		nr_to_scan = zone_nr_lru_pages(&mz, LRU_UNEVICTABLE);
+		while (nr_to_scan > 0) {
+			unsigned long batch_size;
+			unsigned long scan;
 
-			if (likely(PageLRU(page) && PageUnevictable(page)))
-				check_move_unevictable_page(page, zone);
+			batch_size = min(nr_to_scan,
+					 SCAN_UNEVICTABLE_BATCH_SIZE);
+			spin_lock_irq(&zone->lru_lock);
+			for (scan = 0; scan < batch_size; scan++) {
+				struct page *page;
 
-			unlock_page(page);
+				page = lru_tailpage(&mz, LRU_UNEVICTABLE);
+				if (!trylock_page(page))
+					continue;
+				if (likely(PageLRU(page) &&
+					   PageUnevictable(page)))
+					check_move_unevictable_page(page, zone);
+				unlock_page(page);
+			}
+			spin_unlock_irq(&zone->lru_lock);
+			nr_to_scan -= batch_size;
 		}
-		spin_unlock_irq(&zone->lru_lock);
-
-		nr_to_scan -= batch_size;
-	}
+		mem = mem_cgroup_iter(NULL, mem, NULL);
+	} while (mem);
 }
 
 
-- 
1.7.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [patch 08/11] mm: vmscan: convert global reclaim to per-memcg LRU lists
  2011-09-12 10:57 ` Johannes Weiner
@ 2011-09-12 10:57   ` Johannes Weiner
  -1 siblings, 0 replies; 130+ messages in thread
From: Johannes Weiner @ 2011-09-12 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

The global per-zone LRU lists are about to go away on memcg-enabled
kernels, global reclaim must be able to find its pages on the
per-memcg LRU lists.

Since the LRU pages of a zone are distributed over all existing memory
cgroups, a scan target for a zone is complete when all memory cgroups
are scanned for their proportional share of a zone's memory.

The forced scanning of small scan targets from kswapd is limited to
zones marked unreclaimable, otherwise kswapd can quickly overreclaim
by force-scanning the LRU lists of multiple memory cgroups.

Signed-off-by: Johannes Weiner <jweiner@redhat.com>
---
 mm/vmscan.c |   39 ++++++++++++++++++++++-----------------
 1 files changed, 22 insertions(+), 17 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index bb4d8b8..053609e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1887,7 +1887,7 @@ static void get_scan_count(struct mem_cgroup_zone *mz, struct scan_control *sc,
 	 * latencies, so it's better to scan a minimum amount there as
 	 * well.
 	 */
-	if (current_is_kswapd())
+	if (current_is_kswapd() && mz->zone->all_unreclaimable)
 		force_scan = true;
 	if (!global_reclaim(sc))
 		force_scan = true;
@@ -2111,16 +2111,6 @@ static void shrink_zone(int priority, struct zone *zone,
 	};
 	struct mem_cgroup *mem;
 
-	if (global_reclaim(sc)) {
-		struct mem_cgroup_zone mz = {
-			.mem_cgroup = NULL,
-			.zone = zone,
-		};
-
-		shrink_mem_cgroup_zone(priority, &mz, sc);
-		return;
-	}
-
 	mem = mem_cgroup_iter(root, NULL, &iter);
 	do {
 		struct mem_cgroup_zone mz = {
@@ -2134,6 +2124,10 @@ static void shrink_zone(int priority, struct zone *zone,
 		 * scanned it with decreasing priority levels until
 		 * nr_to_reclaim had been reclaimed.  This priority
 		 * cycle is thus over after a single memcg.
+		 *
+		 * Direct reclaim and kswapd, on the other hand, have
+		 * to scan all memory cgroups to fulfill the overall
+		 * scan target for the zone.
 		 */
 		if (!global_reclaim(sc)) {
 			mem_cgroup_iter_break(root, mem);
@@ -2451,13 +2445,24 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 static void age_active_anon(struct zone *zone, struct scan_control *sc,
 			    int priority)
 {
-	struct mem_cgroup_zone mz = {
-		.mem_cgroup = NULL,
-		.zone = zone,
-	};
+	struct mem_cgroup *mem;
+
+	if (!total_swap_pages)
+		return;
+
+	mem = mem_cgroup_iter(NULL, NULL, NULL);
+	do {
+		struct mem_cgroup_zone mz = {
+			.mem_cgroup = mem,
+			.zone = zone,
+		};
 
-	if (inactive_anon_is_low(&mz))
-		shrink_active_list(SWAP_CLUSTER_MAX, &mz, sc, priority, 0);
+		if (inactive_anon_is_low(&mz))
+			shrink_active_list(SWAP_CLUSTER_MAX, &mz,
+					   sc, priority, 0);
+
+		mem = mem_cgroup_iter(NULL, mem, NULL);
+	} while (mem);
 }
 
 /*
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [patch 08/11] mm: vmscan: convert global reclaim to per-memcg LRU lists
@ 2011-09-12 10:57   ` Johannes Weiner
  0 siblings, 0 replies; 130+ messages in thread
From: Johannes Weiner @ 2011-09-12 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

The global per-zone LRU lists are about to go away on memcg-enabled
kernels, global reclaim must be able to find its pages on the
per-memcg LRU lists.

Since the LRU pages of a zone are distributed over all existing memory
cgroups, a scan target for a zone is complete when all memory cgroups
are scanned for their proportional share of a zone's memory.

The forced scanning of small scan targets from kswapd is limited to
zones marked unreclaimable, otherwise kswapd can quickly overreclaim
by force-scanning the LRU lists of multiple memory cgroups.

Signed-off-by: Johannes Weiner <jweiner@redhat.com>
---
 mm/vmscan.c |   39 ++++++++++++++++++++++-----------------
 1 files changed, 22 insertions(+), 17 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index bb4d8b8..053609e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1887,7 +1887,7 @@ static void get_scan_count(struct mem_cgroup_zone *mz, struct scan_control *sc,
 	 * latencies, so it's better to scan a minimum amount there as
 	 * well.
 	 */
-	if (current_is_kswapd())
+	if (current_is_kswapd() && mz->zone->all_unreclaimable)
 		force_scan = true;
 	if (!global_reclaim(sc))
 		force_scan = true;
@@ -2111,16 +2111,6 @@ static void shrink_zone(int priority, struct zone *zone,
 	};
 	struct mem_cgroup *mem;
 
-	if (global_reclaim(sc)) {
-		struct mem_cgroup_zone mz = {
-			.mem_cgroup = NULL,
-			.zone = zone,
-		};
-
-		shrink_mem_cgroup_zone(priority, &mz, sc);
-		return;
-	}
-
 	mem = mem_cgroup_iter(root, NULL, &iter);
 	do {
 		struct mem_cgroup_zone mz = {
@@ -2134,6 +2124,10 @@ static void shrink_zone(int priority, struct zone *zone,
 		 * scanned it with decreasing priority levels until
 		 * nr_to_reclaim had been reclaimed.  This priority
 		 * cycle is thus over after a single memcg.
+		 *
+		 * Direct reclaim and kswapd, on the other hand, have
+		 * to scan all memory cgroups to fulfill the overall
+		 * scan target for the zone.
 		 */
 		if (!global_reclaim(sc)) {
 			mem_cgroup_iter_break(root, mem);
@@ -2451,13 +2445,24 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 static void age_active_anon(struct zone *zone, struct scan_control *sc,
 			    int priority)
 {
-	struct mem_cgroup_zone mz = {
-		.mem_cgroup = NULL,
-		.zone = zone,
-	};
+	struct mem_cgroup *mem;
+
+	if (!total_swap_pages)
+		return;
+
+	mem = mem_cgroup_iter(NULL, NULL, NULL);
+	do {
+		struct mem_cgroup_zone mz = {
+			.mem_cgroup = mem,
+			.zone = zone,
+		};
 
-	if (inactive_anon_is_low(&mz))
-		shrink_active_list(SWAP_CLUSTER_MAX, &mz, sc, priority, 0);
+		if (inactive_anon_is_low(&mz))
+			shrink_active_list(SWAP_CLUSTER_MAX, &mz,
+					   sc, priority, 0);
+
+		mem = mem_cgroup_iter(NULL, mem, NULL);
+	} while (mem);
 }
 
 /*
-- 
1.7.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [patch 09/11] mm: collect LRU list heads into struct lruvec
  2011-09-12 10:57 ` Johannes Weiner
@ 2011-09-12 10:57   ` Johannes Weiner
  -1 siblings, 0 replies; 130+ messages in thread
From: Johannes Weiner @ 2011-09-12 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

Having a unified structure with a LRU list set for both global zones
and per-memcg zones allows to keep that code simple which deals with
LRU lists and does not care about the container itself.

Once the per-memcg LRU lists directly link struct pages, the isolation
function and all other list manipulations are shared between the memcg
case and the global LRU case.

Signed-off-by: Johannes Weiner <jweiner@redhat.com>
---
 include/linux/mm_inline.h |    2 +-
 include/linux/mmzone.h    |   10 ++++++----
 mm/memcontrol.c           |   19 ++++++++-----------
 mm/page_alloc.c           |    2 +-
 mm/swap.c                 |   11 +++++------
 mm/vmscan.c               |   12 ++++++------
 6 files changed, 27 insertions(+), 29 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 8f7d247..e6a7ffe 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -33,7 +33,7 @@ __add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l,
 static inline void
 add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l)
 {
-	__add_page_to_lru_list(zone, page, l, &zone->lru[l].list);
+	__add_page_to_lru_list(zone, page, l, &zone->lruvec.lists[l]);
 }
 
 static inline void
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 1ed4116..37970b9 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -159,6 +159,10 @@ static inline int is_unevictable_lru(enum lru_list l)
 	return (l == LRU_UNEVICTABLE);
 }
 
+struct lruvec {
+	struct list_head lists[NR_LRU_LISTS];
+};
+
 /* Mask used at gathering information at once (see memcontrol.c) */
 #define LRU_ALL_FILE (BIT(LRU_INACTIVE_FILE) | BIT(LRU_ACTIVE_FILE))
 #define LRU_ALL_ANON (BIT(LRU_INACTIVE_ANON) | BIT(LRU_ACTIVE_ANON))
@@ -358,10 +362,8 @@ struct zone {
 	ZONE_PADDING(_pad1_)
 
 	/* Fields commonly accessed by the page reclaim scanner */
-	spinlock_t		lru_lock;	
-	struct zone_lru {
-		struct list_head list;
-	} lru[NR_LRU_LISTS];
+	spinlock_t		lru_lock;
+	struct lruvec		lruvec;
 
 	struct zone_reclaim_stat reclaim_stat;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 27d78dc..465001c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -130,10 +130,7 @@ struct mem_cgroup_iter_state {
  * per-zone information in memory controller.
  */
 struct mem_cgroup_per_zone {
-	/*
-	 * spin_lock to protect the per cgroup LRU
-	 */
-	struct list_head	lists[NR_LRU_LISTS];
+	struct lruvec		lruvec;
 	unsigned long		count[NR_LRU_LISTS];
 
 	struct mem_cgroup_iter_state iter_state[DEF_PRIORITY + 1];
@@ -944,7 +941,7 @@ struct page *mem_cgroup_lru_to_page(struct zone *zone, struct mem_cgroup *mem,
 	struct page_cgroup *pc;
 
 	mz = mem_cgroup_zoneinfo(mem, zone_to_nid(zone), zone_idx(zone));
-	pc = list_entry(mz->lists[lru].prev, struct page_cgroup, lru);
+	pc = list_entry(mz->lruvec.lists[lru].prev, struct page_cgroup, lru);
 	return lookup_cgroup_page(pc);
 }
 
@@ -997,7 +994,7 @@ void mem_cgroup_rotate_reclaimable_page(struct page *page)
 	/* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
 	smp_rmb();
 	mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
-	list_move_tail(&pc->lru, &mz->lists[lru]);
+	list_move_tail(&pc->lru, &mz->lruvec.lists[lru]);
 }
 
 void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru)
@@ -1015,7 +1012,7 @@ void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru)
 	/* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
 	smp_rmb();
 	mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
-	list_move(&pc->lru, &mz->lists[lru]);
+	list_move(&pc->lru, &mz->lruvec.lists[lru]);
 }
 
 void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru)
@@ -1045,7 +1042,7 @@ void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru)
 	/* huge page split is done under lru_lock. so, we have no races. */
 	MEM_CGROUP_ZSTAT(mz, lru) += 1 << compound_order(page);
 	SetPageCgroupAcctLRU(pc);
-	list_add(&pc->lru, &mz->lists[lru]);
+	list_add(&pc->lru, &mz->lruvec.lists[lru]);
 }
 
 /*
@@ -1243,7 +1240,7 @@ unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
 
 	BUG_ON(!mem_cont);
 	mz = mem_cgroup_zoneinfo(mem_cont, nid, zid);
-	src = &mz->lists[lru];
+	src = &mz->lruvec.lists[lru];
 
 	scan = 0;
 	list_for_each_entry_safe_reverse(pc, tmp, src, lru) {
@@ -3627,7 +3624,7 @@ static int mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
 
 	zone = &NODE_DATA(node)->node_zones[zid];
 	mz = mem_cgroup_zoneinfo(memcg, node, zid);
-	list = &mz->lists[lru];
+	list = &mz->lruvec.lists[lru];
 
 	loop = MEM_CGROUP_ZSTAT(mz, lru);
 	/* give some margin against EBUSY etc...*/
@@ -4723,7 +4720,7 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node)
 	for (zone = 0; zone < MAX_NR_ZONES; zone++) {
 		mz = &pn->zoneinfo[zone];
 		for_each_lru(l)
-			INIT_LIST_HEAD(&mz->lists[l]);
+			INIT_LIST_HEAD(&mz->lruvec.lists[l]);
 		mz->usage_in_excess = 0;
 		mz->on_tree = false;
 		mz->mem = memcg;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1dba05e..33b25b6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4335,7 +4335,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 
 		zone_pcp_init(zone);
 		for_each_lru(l)
-			INIT_LIST_HEAD(&zone->lru[l].list);
+			INIT_LIST_HEAD(&zone->lruvec.lists[l]);
 		zone->reclaim_stat.recent_rotated[0] = 0;
 		zone->reclaim_stat.recent_rotated[1] = 0;
 		zone->reclaim_stat.recent_scanned[0] = 0;
diff --git a/mm/swap.c b/mm/swap.c
index 3a442f1..66e8292 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -213,7 +213,7 @@ static void pagevec_move_tail_fn(struct page *page, void *arg)
 
 	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
 		enum lru_list lru = page_lru_base_type(page);
-		list_move_tail(&page->lru, &zone->lru[lru].list);
+		list_move_tail(&page->lru, &zone->lruvec.lists[lru]);
 		mem_cgroup_rotate_reclaimable_page(page);
 		(*pgmoved)++;
 	}
@@ -457,7 +457,7 @@ static void lru_deactivate_fn(struct page *page, void *arg)
 		 * The page's writeback ends up during pagevec
 		 * We moves tha page into tail of inactive.
 		 */
-		list_move_tail(&page->lru, &zone->lru[lru].list);
+		list_move_tail(&page->lru, &zone->lruvec.lists[lru]);
 		mem_cgroup_rotate_reclaimable_page(page);
 		__count_vm_event(PGROTATED);
 	}
@@ -639,7 +639,6 @@ void lru_add_page_tail(struct zone* zone,
 	int active;
 	enum lru_list lru;
 	const int file = 0;
-	struct list_head *head;
 
 	VM_BUG_ON(!PageHead(page));
 	VM_BUG_ON(PageCompound(page_tail));
@@ -659,10 +658,10 @@ void lru_add_page_tail(struct zone* zone,
 		}
 		update_page_reclaim_stat(zone, page_tail, file, active);
 		if (likely(PageLRU(page)))
-			head = page->lru.prev;
+			__add_page_to_lru_list(zone, page_tail, lru,
+					       page->lru.prev);
 		else
-			head = &zone->lru[lru].list;
-		__add_page_to_lru_list(zone, page_tail, lru, head);
+			add_page_to_lru_list(zone, page_tail, lru);
 	} else {
 		SetPageUnevictable(page_tail);
 		add_page_to_lru_list(zone, page_tail, LRU_UNEVICTABLE);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 053609e..df00195 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1267,8 +1267,8 @@ static unsigned long isolate_pages_global(unsigned long nr,
 		lru += LRU_ACTIVE;
 	if (file)
 		lru += LRU_FILE;
-	return isolate_lru_pages(nr, &z->lru[lru].list, dst, scanned, order,
-								mode, file);
+	return isolate_lru_pages(nr, &z->lruvec.lists[lru], dst,
+				 scanned, order, mode, file);
 }
 
 /*
@@ -1631,7 +1631,7 @@ static void move_active_pages_to_lru(struct zone *zone,
 		VM_BUG_ON(PageLRU(page));
 		SetPageLRU(page);
 
-		list_move(&page->lru, &zone->lru[lru].list);
+		list_move(&page->lru, &zone->lruvec.lists[lru]);
 		mem_cgroup_add_lru_list(page, lru);
 		pgmoved += hpage_nr_pages(page);
 
@@ -3411,7 +3411,7 @@ retry:
 		enum lru_list l = page_lru_base_type(page);
 
 		__dec_zone_state(zone, NR_UNEVICTABLE);
-		list_move(&page->lru, &zone->lru[l].list);
+		list_move(&page->lru, &zone->lruvec.lists[l]);
 		mem_cgroup_move_lists(page, LRU_UNEVICTABLE, l);
 		__inc_zone_state(zone, NR_INACTIVE_ANON + l);
 		__count_vm_event(UNEVICTABLE_PGRESCUED);
@@ -3420,7 +3420,7 @@ retry:
 		 * rotate unevictable list
 		 */
 		SetPageUnevictable(page);
-		list_move(&page->lru, &zone->lru[LRU_UNEVICTABLE].list);
+		list_move(&page->lru, &zone->lruvec.lists[LRU_UNEVICTABLE]);
 		mem_cgroup_rotate_lru_list(page, LRU_UNEVICTABLE);
 		if (page_evictable(page, NULL))
 			goto retry;
@@ -3490,7 +3490,7 @@ static struct page *lru_tailpage(struct mem_cgroup_zone *mz, enum lru_list lru)
 {
 	if (!scanning_global_lru(mz))
 		return mem_cgroup_lru_to_page(mz->zone, mz->mem_cgroup, lru);
-	return lru_to_page(&mz->zone->lru[lru].list);
+	return lru_to_page(&mz->zone->lruvec.lists[lru]);
 }
 
 /**
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [patch 09/11] mm: collect LRU list heads into struct lruvec
@ 2011-09-12 10:57   ` Johannes Weiner
  0 siblings, 0 replies; 130+ messages in thread
From: Johannes Weiner @ 2011-09-12 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

Having a unified structure with a LRU list set for both global zones
and per-memcg zones allows to keep that code simple which deals with
LRU lists and does not care about the container itself.

Once the per-memcg LRU lists directly link struct pages, the isolation
function and all other list manipulations are shared between the memcg
case and the global LRU case.

Signed-off-by: Johannes Weiner <jweiner@redhat.com>
---
 include/linux/mm_inline.h |    2 +-
 include/linux/mmzone.h    |   10 ++++++----
 mm/memcontrol.c           |   19 ++++++++-----------
 mm/page_alloc.c           |    2 +-
 mm/swap.c                 |   11 +++++------
 mm/vmscan.c               |   12 ++++++------
 6 files changed, 27 insertions(+), 29 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 8f7d247..e6a7ffe 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -33,7 +33,7 @@ __add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l,
 static inline void
 add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l)
 {
-	__add_page_to_lru_list(zone, page, l, &zone->lru[l].list);
+	__add_page_to_lru_list(zone, page, l, &zone->lruvec.lists[l]);
 }
 
 static inline void
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 1ed4116..37970b9 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -159,6 +159,10 @@ static inline int is_unevictable_lru(enum lru_list l)
 	return (l == LRU_UNEVICTABLE);
 }
 
+struct lruvec {
+	struct list_head lists[NR_LRU_LISTS];
+};
+
 /* Mask used at gathering information at once (see memcontrol.c) */
 #define LRU_ALL_FILE (BIT(LRU_INACTIVE_FILE) | BIT(LRU_ACTIVE_FILE))
 #define LRU_ALL_ANON (BIT(LRU_INACTIVE_ANON) | BIT(LRU_ACTIVE_ANON))
@@ -358,10 +362,8 @@ struct zone {
 	ZONE_PADDING(_pad1_)
 
 	/* Fields commonly accessed by the page reclaim scanner */
-	spinlock_t		lru_lock;	
-	struct zone_lru {
-		struct list_head list;
-	} lru[NR_LRU_LISTS];
+	spinlock_t		lru_lock;
+	struct lruvec		lruvec;
 
 	struct zone_reclaim_stat reclaim_stat;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 27d78dc..465001c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -130,10 +130,7 @@ struct mem_cgroup_iter_state {
  * per-zone information in memory controller.
  */
 struct mem_cgroup_per_zone {
-	/*
-	 * spin_lock to protect the per cgroup LRU
-	 */
-	struct list_head	lists[NR_LRU_LISTS];
+	struct lruvec		lruvec;
 	unsigned long		count[NR_LRU_LISTS];
 
 	struct mem_cgroup_iter_state iter_state[DEF_PRIORITY + 1];
@@ -944,7 +941,7 @@ struct page *mem_cgroup_lru_to_page(struct zone *zone, struct mem_cgroup *mem,
 	struct page_cgroup *pc;
 
 	mz = mem_cgroup_zoneinfo(mem, zone_to_nid(zone), zone_idx(zone));
-	pc = list_entry(mz->lists[lru].prev, struct page_cgroup, lru);
+	pc = list_entry(mz->lruvec.lists[lru].prev, struct page_cgroup, lru);
 	return lookup_cgroup_page(pc);
 }
 
@@ -997,7 +994,7 @@ void mem_cgroup_rotate_reclaimable_page(struct page *page)
 	/* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
 	smp_rmb();
 	mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
-	list_move_tail(&pc->lru, &mz->lists[lru]);
+	list_move_tail(&pc->lru, &mz->lruvec.lists[lru]);
 }
 
 void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru)
@@ -1015,7 +1012,7 @@ void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru)
 	/* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
 	smp_rmb();
 	mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
-	list_move(&pc->lru, &mz->lists[lru]);
+	list_move(&pc->lru, &mz->lruvec.lists[lru]);
 }
 
 void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru)
@@ -1045,7 +1042,7 @@ void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru)
 	/* huge page split is done under lru_lock. so, we have no races. */
 	MEM_CGROUP_ZSTAT(mz, lru) += 1 << compound_order(page);
 	SetPageCgroupAcctLRU(pc);
-	list_add(&pc->lru, &mz->lists[lru]);
+	list_add(&pc->lru, &mz->lruvec.lists[lru]);
 }
 
 /*
@@ -1243,7 +1240,7 @@ unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
 
 	BUG_ON(!mem_cont);
 	mz = mem_cgroup_zoneinfo(mem_cont, nid, zid);
-	src = &mz->lists[lru];
+	src = &mz->lruvec.lists[lru];
 
 	scan = 0;
 	list_for_each_entry_safe_reverse(pc, tmp, src, lru) {
@@ -3627,7 +3624,7 @@ static int mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
 
 	zone = &NODE_DATA(node)->node_zones[zid];
 	mz = mem_cgroup_zoneinfo(memcg, node, zid);
-	list = &mz->lists[lru];
+	list = &mz->lruvec.lists[lru];
 
 	loop = MEM_CGROUP_ZSTAT(mz, lru);
 	/* give some margin against EBUSY etc...*/
@@ -4723,7 +4720,7 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node)
 	for (zone = 0; zone < MAX_NR_ZONES; zone++) {
 		mz = &pn->zoneinfo[zone];
 		for_each_lru(l)
-			INIT_LIST_HEAD(&mz->lists[l]);
+			INIT_LIST_HEAD(&mz->lruvec.lists[l]);
 		mz->usage_in_excess = 0;
 		mz->on_tree = false;
 		mz->mem = memcg;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1dba05e..33b25b6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4335,7 +4335,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 
 		zone_pcp_init(zone);
 		for_each_lru(l)
-			INIT_LIST_HEAD(&zone->lru[l].list);
+			INIT_LIST_HEAD(&zone->lruvec.lists[l]);
 		zone->reclaim_stat.recent_rotated[0] = 0;
 		zone->reclaim_stat.recent_rotated[1] = 0;
 		zone->reclaim_stat.recent_scanned[0] = 0;
diff --git a/mm/swap.c b/mm/swap.c
index 3a442f1..66e8292 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -213,7 +213,7 @@ static void pagevec_move_tail_fn(struct page *page, void *arg)
 
 	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
 		enum lru_list lru = page_lru_base_type(page);
-		list_move_tail(&page->lru, &zone->lru[lru].list);
+		list_move_tail(&page->lru, &zone->lruvec.lists[lru]);
 		mem_cgroup_rotate_reclaimable_page(page);
 		(*pgmoved)++;
 	}
@@ -457,7 +457,7 @@ static void lru_deactivate_fn(struct page *page, void *arg)
 		 * The page's writeback ends up during pagevec
 		 * We moves tha page into tail of inactive.
 		 */
-		list_move_tail(&page->lru, &zone->lru[lru].list);
+		list_move_tail(&page->lru, &zone->lruvec.lists[lru]);
 		mem_cgroup_rotate_reclaimable_page(page);
 		__count_vm_event(PGROTATED);
 	}
@@ -639,7 +639,6 @@ void lru_add_page_tail(struct zone* zone,
 	int active;
 	enum lru_list lru;
 	const int file = 0;
-	struct list_head *head;
 
 	VM_BUG_ON(!PageHead(page));
 	VM_BUG_ON(PageCompound(page_tail));
@@ -659,10 +658,10 @@ void lru_add_page_tail(struct zone* zone,
 		}
 		update_page_reclaim_stat(zone, page_tail, file, active);
 		if (likely(PageLRU(page)))
-			head = page->lru.prev;
+			__add_page_to_lru_list(zone, page_tail, lru,
+					       page->lru.prev);
 		else
-			head = &zone->lru[lru].list;
-		__add_page_to_lru_list(zone, page_tail, lru, head);
+			add_page_to_lru_list(zone, page_tail, lru);
 	} else {
 		SetPageUnevictable(page_tail);
 		add_page_to_lru_list(zone, page_tail, LRU_UNEVICTABLE);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 053609e..df00195 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1267,8 +1267,8 @@ static unsigned long isolate_pages_global(unsigned long nr,
 		lru += LRU_ACTIVE;
 	if (file)
 		lru += LRU_FILE;
-	return isolate_lru_pages(nr, &z->lru[lru].list, dst, scanned, order,
-								mode, file);
+	return isolate_lru_pages(nr, &z->lruvec.lists[lru], dst,
+				 scanned, order, mode, file);
 }
 
 /*
@@ -1631,7 +1631,7 @@ static void move_active_pages_to_lru(struct zone *zone,
 		VM_BUG_ON(PageLRU(page));
 		SetPageLRU(page);
 
-		list_move(&page->lru, &zone->lru[lru].list);
+		list_move(&page->lru, &zone->lruvec.lists[lru]);
 		mem_cgroup_add_lru_list(page, lru);
 		pgmoved += hpage_nr_pages(page);
 
@@ -3411,7 +3411,7 @@ retry:
 		enum lru_list l = page_lru_base_type(page);
 
 		__dec_zone_state(zone, NR_UNEVICTABLE);
-		list_move(&page->lru, &zone->lru[l].list);
+		list_move(&page->lru, &zone->lruvec.lists[l]);
 		mem_cgroup_move_lists(page, LRU_UNEVICTABLE, l);
 		__inc_zone_state(zone, NR_INACTIVE_ANON + l);
 		__count_vm_event(UNEVICTABLE_PGRESCUED);
@@ -3420,7 +3420,7 @@ retry:
 		 * rotate unevictable list
 		 */
 		SetPageUnevictable(page);
-		list_move(&page->lru, &zone->lru[LRU_UNEVICTABLE].list);
+		list_move(&page->lru, &zone->lruvec.lists[LRU_UNEVICTABLE]);
 		mem_cgroup_rotate_lru_list(page, LRU_UNEVICTABLE);
 		if (page_evictable(page, NULL))
 			goto retry;
@@ -3490,7 +3490,7 @@ static struct page *lru_tailpage(struct mem_cgroup_zone *mz, enum lru_list lru)
 {
 	if (!scanning_global_lru(mz))
 		return mem_cgroup_lru_to_page(mz->zone, mz->mem_cgroup, lru);
-	return lru_to_page(&mz->zone->lru[lru].list);
+	return lru_to_page(&mz->zone->lruvec.lists[lru]);
 }
 
 /**
-- 
1.7.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [patch 10/11] mm: make per-memcg LRU lists exclusive
  2011-09-12 10:57 ` Johannes Weiner
@ 2011-09-12 10:57   ` Johannes Weiner
  -1 siblings, 0 replies; 130+ messages in thread
From: Johannes Weiner @ 2011-09-12 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

Now that all code that operated on global per-zone LRU lists is
converted to operate on per-memory cgroup LRU lists instead, there is
no reason to keep the double-LRU scheme around any longer.

The pc->lru member is removed and page->lru is linked directly to the
per-memory cgroup LRU lists, which removes two pointers from a
descriptor that exists for every page frame in the system.

Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Ying Han <yinghan@google.com>
---
 include/linux/memcontrol.h  |   54 +++-----
 include/linux/mm_inline.h   |   21 +--
 include/linux/page_cgroup.h |    1 -
 mm/memcontrol.c             |  319 ++++++++++++++++++++-----------------------
 mm/page_cgroup.c            |    1 -
 mm/swap.c                   |   23 ++-
 mm/vmscan.c                 |   81 +++++-------
 7 files changed, 228 insertions(+), 272 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 7795b72..a12f16f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -32,17 +32,6 @@ enum mem_cgroup_page_stat_item {
 	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
 };
 
-extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
-					struct list_head *dst,
-					unsigned long *scanned, int order,
-					isolate_mode_t mode,
-					struct zone *z,
-					struct mem_cgroup *mem_cont,
-					int active, int file);
-
-struct page *mem_cgroup_lru_to_page(struct zone *, struct mem_cgroup *,
-				    enum lru_list);
-
 struct mem_cgroup_iter {
 	struct zone *zone;
 	int priority;
@@ -72,13 +61,14 @@ extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *ptr);
 
 extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 					gfp_t gfp_mask);
-extern void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru);
-extern void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru);
-extern void mem_cgroup_rotate_reclaimable_page(struct page *page);
-extern void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru);
-extern void mem_cgroup_del_lru(struct page *page);
-extern void mem_cgroup_move_lists(struct page *page,
-				  enum lru_list from, enum lru_list to);
+
+struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
+struct lruvec *mem_cgroup_lru_add_list(struct zone *, struct page *,
+				       enum lru_list);
+void mem_cgroup_lru_del_list(struct page *, enum lru_list);
+void mem_cgroup_lru_del(struct page *);
+struct lruvec *mem_cgroup_lru_move_lists(struct zone *, struct page *,
+					 enum lru_list, enum lru_list);
 
 /* For coalescing uncharge for reducing memcg' overhead*/
 extern void mem_cgroup_uncharge_start(void);
@@ -221,33 +211,33 @@ static inline void mem_cgroup_uncharge_cache_page(struct page *page)
 {
 }
 
-static inline void mem_cgroup_add_lru_list(struct page *page, int lru)
+static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
+						    struct mem_cgroup *mem)
 {
+	return &zone->lruvec;
 }
 
-static inline void mem_cgroup_del_lru_list(struct page *page, int lru)
+static inline struct lruvec *mem_cgroup_lru_add_list(struct zone *zone,
+						     struct page *page,
+						     enum lru_list lru)
 {
-	return ;
+	return &zone->lruvec;
 }
 
-static inline void mem_cgroup_rotate_reclaimable_page(struct page *page)
+static inline void mem_cgroup_lru_del_list(struct page *page, enum lru_list lru)
 {
-	return ;
 }
 
-static inline void mem_cgroup_rotate_lru_list(struct page *page, int lru)
+static inline void mem_cgroup_lru_del(struct page *page)
 {
-	return ;
 }
 
-static inline void mem_cgroup_del_lru(struct page *page)
-{
-	return ;
-}
-
-static inline void
-mem_cgroup_move_lists(struct page *page, enum lru_list from, enum lru_list to)
+static inline struct lruvec *mem_cgroup_lru_move_lists(struct zone *zone,
+						       struct page *page,
+						       enum lru_list from,
+						       enum lru_list to)
 {
+	return &zone->lruvec;
 }
 
 static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index e6a7ffe..4e3478e 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -22,26 +22,21 @@ static inline int page_is_file_cache(struct page *page)
 }
 
 static inline void
-__add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l,
-		       struct list_head *head)
-{
-	list_add(&page->lru, head);
-	__mod_zone_page_state(zone, NR_LRU_BASE + l, hpage_nr_pages(page));
-	mem_cgroup_add_lru_list(page, l);
-}
-
-static inline void
 add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l)
 {
-	__add_page_to_lru_list(zone, page, l, &zone->lruvec.lists[l]);
+	struct lruvec *lruvec;
+
+	lruvec = mem_cgroup_lru_add_list(zone, page, l);
+	list_add(&page->lru, &lruvec->lists[l]);
+	__mod_zone_page_state(zone, NR_LRU_BASE + l, hpage_nr_pages(page));
 }
 
 static inline void
 del_page_from_lru_list(struct zone *zone, struct page *page, enum lru_list l)
 {
+	mem_cgroup_lru_del_list(page, l);
 	list_del(&page->lru);
 	__mod_zone_page_state(zone, NR_LRU_BASE + l, -hpage_nr_pages(page));
-	mem_cgroup_del_lru_list(page, l);
 }
 
 /**
@@ -64,7 +59,6 @@ del_page_from_lru(struct zone *zone, struct page *page)
 {
 	enum lru_list l;
 
-	list_del(&page->lru);
 	if (PageUnevictable(page)) {
 		__ClearPageUnevictable(page);
 		l = LRU_UNEVICTABLE;
@@ -75,8 +69,9 @@ del_page_from_lru(struct zone *zone, struct page *page)
 			l += LRU_ACTIVE;
 		}
 	}
+	mem_cgroup_lru_del_list(page, l);
+	list_del(&page->lru);
 	__mod_zone_page_state(zone, NR_LRU_BASE + l, -hpage_nr_pages(page));
-	mem_cgroup_del_lru_list(page, l);
 }
 
 /**
diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 961ecc7..5bae753 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -31,7 +31,6 @@ enum {
 struct page_cgroup {
 	unsigned long flags;
 	struct mem_cgroup *mem_cgroup;
-	struct list_head lru;		/* per cgroup LRU list */
 };
 
 void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 465001c..a7d14a5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -920,6 +920,26 @@ out:
 }
 EXPORT_SYMBOL(mem_cgroup_count_vm_event);
 
+/**
+ * mem_cgroup_zone_lruvec - get the lru list vector for a zone and memcg
+ * @zone: zone of the wanted lruvec
+ * @mem: memcg of the wanted lruvec
+ *
+ * Returns the lru list vector holding pages for the given @zone and
+ * @mem.  This can be the global zone lruvec, if the memory controller
+ * is disabled.
+ */
+struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone, struct mem_cgroup *mem)
+{
+	struct mem_cgroup_per_zone *mz;
+
+	if (mem_cgroup_disabled())
+		return &zone->lruvec;
+
+	mz = mem_cgroup_zoneinfo(mem, zone_to_nid(zone), zone_idx(zone));
+	return &mz->lruvec;
+}
+
 /*
  * Following LRU functions are allowed to be used without PCG_LOCK.
  * Operations are called by routine of global LRU independently from memcg.
@@ -934,115 +954,123 @@ EXPORT_SYMBOL(mem_cgroup_count_vm_event);
  * When moving account, the page is not on LRU. It's isolated.
  */
 
-struct page *mem_cgroup_lru_to_page(struct zone *zone, struct mem_cgroup *mem,
-				    enum lru_list lru)
+/**
+ * mem_cgroup_lru_add_list - account for adding an lru page and return lruvec
+ * @zone: zone of the page
+ * @page: the page
+ * @lru: current lru
+ *
+ * This function accounts for @page being added to @lru, and returns
+ * the lruvec for the given @zone and the memcg @page is charged to.
+ *
+ * The callsite is then responsible for physically linking the page to
+ * the returned lruvec->lists[@lru].
+ */
+struct lruvec *mem_cgroup_lru_add_list(struct zone *zone, struct page *page,
+				       enum lru_list lru)
 {
 	struct mem_cgroup_per_zone *mz;
 	struct page_cgroup *pc;
-
-	mz = mem_cgroup_zoneinfo(mem, zone_to_nid(zone), zone_idx(zone));
-	pc = list_entry(mz->lruvec.lists[lru].prev, struct page_cgroup, lru);
-	return lookup_cgroup_page(pc);
-}
-
-void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru)
-{
-	struct page_cgroup *pc;
-	struct mem_cgroup_per_zone *mz;
+	struct mem_cgroup *mem;
 
 	if (mem_cgroup_disabled())
-		return;
+		return &zone->lruvec;
+
 	pc = lookup_page_cgroup(page);
-	/* can happen while we handle swapcache. */
-	if (!TestClearPageCgroupAcctLRU(pc))
-		return;
-	VM_BUG_ON(!pc->mem_cgroup);
+	VM_BUG_ON(PageCgroupAcctLRU(pc));
 	/*
-	 * We don't check PCG_USED bit. It's cleared when the "page" is finally
-	 * removed from global LRU.
+	 * putback:				charge:
+	 * SetPageLRU				SetPageCgroupUsed
+	 * smp_mb				smp_mb
+	 * PageCgroupUsed && add to memcg LRU	PageLRU && add to memcg LRU
+	 *
+	 * Ensure that one of the two sides adds the page to the memcg
+	 * LRU during a race.
 	 */
-	mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
-	/* huge page split is done under lru_lock. so, we have no races. */
-	MEM_CGROUP_ZSTAT(mz, lru) -= 1 << compound_order(page);
-	VM_BUG_ON(list_empty(&pc->lru));
-	list_del_init(&pc->lru);
-}
-
-void mem_cgroup_del_lru(struct page *page)
-{
-	mem_cgroup_del_lru_list(page, page_lru(page));
+	smp_mb();
+	/*
+	 * If the page is uncharged, it may be freed soon, but it
+	 * could also be swap cache (readahead, swapoff) that needs to
+	 * be reclaimable in the future.  root_mem_cgroup will babysit
+	 * it for the time being.
+	 */
+	if (PageCgroupUsed(pc)) {
+		/* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
+		smp_rmb();
+		mem = pc->mem_cgroup;
+		SetPageCgroupAcctLRU(pc);
+	} else
+		mem = root_mem_cgroup;
+	mz = page_cgroup_zoneinfo(mem, page);
+	/* compound_order() is stabilized through lru_lock */
+	MEM_CGROUP_ZSTAT(mz, lru) += 1 << compound_order(page);
+	return &mz->lruvec;
 }
 
-/*
- * Writeback is about to end against a page which has been marked for immediate
- * reclaim.  If it still appears to be reclaimable, move it to the tail of the
- * inactive list.
+/**
+ * mem_cgroup_lru_del_list - account for removing an lru page
+ * @page: the page
+ * @lru: target lru
+ *
+ * This function accounts for @page being removed from @lru.
+ *
+ * The callsite is then responsible for physically unlinking
+ * @page->lru.
  */
-void mem_cgroup_rotate_reclaimable_page(struct page *page)
+void mem_cgroup_lru_del_list(struct page *page, enum lru_list lru)
 {
 	struct mem_cgroup_per_zone *mz;
+	struct mem_cgroup *mem;
 	struct page_cgroup *pc;
-	enum lru_list lru = page_lru(page);
 
 	if (mem_cgroup_disabled())
 		return;
 
 	pc = lookup_page_cgroup(page);
-	/* unused page is not rotated. */
-	if (!PageCgroupUsed(pc))
-		return;
-	/* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
-	smp_rmb();
-	mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
-	list_move_tail(&pc->lru, &mz->lruvec.lists[lru]);
+	/*
+	 * root_mem_cgroup babysits uncharged LRU pages, but
+	 * PageCgroupUsed is cleared when the page is about to get
+	 * freed.  PageCgroupAcctLRU remembers whether the
+	 * LRU-accounting happened against pc->mem_cgroup or
+	 * root_mem_cgroup.
+	 */
+	if (TestClearPageCgroupAcctLRU(pc)) {
+		VM_BUG_ON(!pc->mem_cgroup);
+		mem = pc->mem_cgroup;
+	} else
+		mem = root_mem_cgroup;
+	mz = page_cgroup_zoneinfo(mem, page);
+	/* huge page split is done under lru_lock. so, we have no races. */
+	MEM_CGROUP_ZSTAT(mz, lru) -= 1 << compound_order(page);
 }
 
-void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru)
+void mem_cgroup_lru_del(struct page *page)
 {
-	struct mem_cgroup_per_zone *mz;
-	struct page_cgroup *pc;
-
-	if (mem_cgroup_disabled())
-		return;
-
-	pc = lookup_page_cgroup(page);
-	/* unused page is not rotated. */
-	if (!PageCgroupUsed(pc))
-		return;
-	/* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
-	smp_rmb();
-	mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
-	list_move(&pc->lru, &mz->lruvec.lists[lru]);
+	mem_cgroup_lru_del_list(page, page_lru(page));
 }
 
-void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru)
+/**
+ * mem_cgroup_lru_move_lists - account for moving a page between lrus
+ * @zone: zone of the page
+ * @page: the page
+ * @from: current lru
+ * @to: target lru
+ *
+ * This function accounts for @page being moved between the lrus @from
+ * and @to, and returns the lruvec for the given @zone and the memcg
+ * @page is charged to.
+ *
+ * The callsite is then responsible for physically relinking
+ * @page->lru to the returned lruvec->lists[@to].
+ */
+struct lruvec *mem_cgroup_lru_move_lists(struct zone *zone,
+					 struct page *page,
+					 enum lru_list from,
+					 enum lru_list to)
 {
-	struct page_cgroup *pc;
-	struct mem_cgroup_per_zone *mz;
-
-	if (mem_cgroup_disabled())
-		return;
-	pc = lookup_page_cgroup(page);
-	VM_BUG_ON(PageCgroupAcctLRU(pc));
-	/*
-	 * putback:				charge:
-	 * SetPageLRU				SetPageCgroupUsed
-	 * smp_mb				smp_mb
-	 * PageCgroupUsed && add to memcg LRU	PageLRU && add to memcg LRU
-	 *
-	 * Ensure that one of the two sides adds the page to the memcg
-	 * LRU during a race.
-	 */
-	smp_mb();
-	if (!PageCgroupUsed(pc))
-		return;
-	/* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
-	smp_rmb();
-	mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
-	/* huge page split is done under lru_lock. so, we have no races. */
-	MEM_CGROUP_ZSTAT(mz, lru) += 1 << compound_order(page);
-	SetPageCgroupAcctLRU(pc);
-	list_add(&pc->lru, &mz->lruvec.lists[lru]);
+	/* XXX: Optimize this, especially for @from == @to */
+	mem_cgroup_lru_del_list(page, from);
+	return mem_cgroup_lru_add_list(zone, page, to);
 }
 
 /*
@@ -1053,6 +1081,7 @@ void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru)
  */
 static void mem_cgroup_lru_del_before_commit(struct page *page)
 {
+	enum lru_list lru;
 	unsigned long flags;
 	struct zone *zone = page_zone(page);
 	struct page_cgroup *pc = lookup_page_cgroup(page);
@@ -1069,17 +1098,28 @@ static void mem_cgroup_lru_del_before_commit(struct page *page)
 		return;
 
 	spin_lock_irqsave(&zone->lru_lock, flags);
+	lru = page_lru(page);
 	/*
-	 * Forget old LRU when this page_cgroup is *not* used. This Used bit
-	 * is guarded by lock_page() because the page is SwapCache.
+	 * The uncharged page could still be registered to the LRU of
+	 * the stale pc->mem_cgroup.
+	 *
+	 * As pc->mem_cgroup is about to get overwritten, the old LRU
+	 * accounting needs to be taken care of.  Let root_mem_cgroup
+	 * babysit the page until the new memcg is responsible for it.
+	 *
+	 * The PCG_USED bit is guarded by lock_page() as the page is
+	 * swapcache/pagecache.
 	 */
-	if (!PageCgroupUsed(pc))
-		mem_cgroup_del_lru_list(page, page_lru(page));
+	if (PageLRU(page) && PageCgroupAcctLRU(pc) && !PageCgroupUsed(pc)) {
+		del_page_from_lru_list(zone, page, lru);
+		add_page_to_lru_list(zone, page, lru);
+	}
 	spin_unlock_irqrestore(&zone->lru_lock, flags);
 }
 
 static void mem_cgroup_lru_add_after_commit(struct page *page)
 {
+	enum lru_list lru;
 	unsigned long flags;
 	struct zone *zone = page_zone(page);
 	struct page_cgroup *pc = lookup_page_cgroup(page);
@@ -1097,22 +1137,22 @@ static void mem_cgroup_lru_add_after_commit(struct page *page)
 	if (likely(!PageLRU(page)))
 		return;
 	spin_lock_irqsave(&zone->lru_lock, flags);
-	/* link when the page is linked to LRU but page_cgroup isn't */
-	if (PageLRU(page) && !PageCgroupAcctLRU(pc))
-		mem_cgroup_add_lru_list(page, page_lru(page));
+	lru = page_lru(page);
+	/*
+	 * If the page is not on the LRU, someone will soon put it
+	 * there.  If it is, and also already accounted for on the
+	 * memcg-side, it must be on the right lruvec as setting
+	 * pc->mem_cgroup and PageCgroupUsed is properly ordered.
+	 * Otherwise, root_mem_cgroup has been babysitting the page
+	 * during the charge.  Move it to the new memcg now.
+	 */
+	if (PageLRU(page) && !PageCgroupAcctLRU(pc)) {
+		del_page_from_lru_list(zone, page, lru);
+		add_page_to_lru_list(zone, page, lru);
+	}
 	spin_unlock_irqrestore(&zone->lru_lock, flags);
 }
 
-
-void mem_cgroup_move_lists(struct page *page,
-			   enum lru_list from, enum lru_list to)
-{
-	if (mem_cgroup_disabled())
-		return;
-	mem_cgroup_del_lru_list(page, from);
-	mem_cgroup_add_lru_list(page, to);
-}
-
 /*
  * Checks whether given mem is same or in the root_mem's
  * hierarchy subtree
@@ -1218,68 +1258,6 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page)
 	return &mz->reclaim_stat;
 }
 
-unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
-					struct list_head *dst,
-					unsigned long *scanned, int order,
-					isolate_mode_t mode,
-					struct zone *z,
-					struct mem_cgroup *mem_cont,
-					int active, int file)
-{
-	unsigned long nr_taken = 0;
-	struct page *page;
-	unsigned long scan;
-	LIST_HEAD(pc_list);
-	struct list_head *src;
-	struct page_cgroup *pc, *tmp;
-	int nid = zone_to_nid(z);
-	int zid = zone_idx(z);
-	struct mem_cgroup_per_zone *mz;
-	int lru = LRU_FILE * file + active;
-	int ret;
-
-	BUG_ON(!mem_cont);
-	mz = mem_cgroup_zoneinfo(mem_cont, nid, zid);
-	src = &mz->lruvec.lists[lru];
-
-	scan = 0;
-	list_for_each_entry_safe_reverse(pc, tmp, src, lru) {
-		if (scan >= nr_to_scan)
-			break;
-
-		if (unlikely(!PageCgroupUsed(pc)))
-			continue;
-
-		page = lookup_cgroup_page(pc);
-
-		if (unlikely(!PageLRU(page)))
-			continue;
-
-		scan++;
-		ret = __isolate_lru_page(page, mode, file);
-		switch (ret) {
-		case 0:
-			list_move(&page->lru, dst);
-			mem_cgroup_del_lru(page);
-			nr_taken += hpage_nr_pages(page);
-			break;
-		case -EBUSY:
-			/* we don't affect global LRU but rotate in our LRU */
-			mem_cgroup_rotate_lru_list(page, page_lru(page));
-			break;
-		default:
-			break;
-		}
-	}
-
-	*scanned = scan;
-
-	trace_mm_vmscan_memcg_isolate(0, nr_to_scan, scan, nr_taken,
-				      0, 0, 0, mode);
-
-	return nr_taken;
-}
-
 #define mem_cgroup_from_res_counter(counter, member)	\
 	container_of(counter, struct mem_cgroup, member)
 
@@ -3615,11 +3593,11 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 static int mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
 				int node, int zid, enum lru_list lru)
 {
-	struct zone *zone;
 	struct mem_cgroup_per_zone *mz;
-	struct page_cgroup *pc, *busy;
 	unsigned long flags, loop;
 	struct list_head *list;
+	struct page *busy;
+	struct zone *zone;
 	int ret = 0;
 
 	zone = &NODE_DATA(node)->node_zones[zid];
@@ -3631,6 +3609,7 @@ static int mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
 	loop += 256;
 	busy = NULL;
 	while (loop--) {
+		struct page_cgroup *pc;
 		struct page *page;
 
 		ret = 0;
@@ -3639,16 +3618,16 @@ static int mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
 			spin_unlock_irqrestore(&zone->lru_lock, flags);
 			break;
 		}
-		pc = list_entry(list->prev, struct page_cgroup, lru);
-		if (busy == pc) {
-			list_move(&pc->lru, list);
+		page = list_entry(list->prev, struct page, lru);
+		if (busy == page) {
+			list_move(&page->lru, list);
 			busy = NULL;
 			spin_unlock_irqrestore(&zone->lru_lock, flags);
 			continue;
 		}
 		spin_unlock_irqrestore(&zone->lru_lock, flags);
 
-		page = lookup_cgroup_page(pc);
+		pc = lookup_page_cgroup(page);
 
 		ret = mem_cgroup_move_parent(page, pc, memcg, GFP_KERNEL);
 		if (ret == -ENOMEM)
@@ -3656,7 +3635,7 @@ static int mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
 
 		if (ret == -EBUSY || ret == -EINVAL) {
 			/* found lock contention or "pc" is obsolete. */
-			busy = pc;
+			busy = page;
 			cond_resched();
 		} else
 			busy = NULL;
diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index 6bdc67d..256dee8 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -16,7 +16,6 @@ static void __meminit init_page_cgroup(struct page_cgroup *pc, unsigned long id)
 	pc->flags = 0;
 	set_page_cgroup_array_id(pc, id);
 	pc->mem_cgroup = NULL;
-	INIT_LIST_HEAD(&pc->lru);
 }
 static unsigned long total_usage;
 
diff --git a/mm/swap.c b/mm/swap.c
index 66e8292..81c6da9 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -209,12 +209,14 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 static void pagevec_move_tail_fn(struct page *page, void *arg)
 {
 	int *pgmoved = arg;
-	struct zone *zone = page_zone(page);
 
 	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
 		enum lru_list lru = page_lru_base_type(page);
-		list_move_tail(&page->lru, &zone->lruvec.lists[lru]);
-		mem_cgroup_rotate_reclaimable_page(page);
+		struct lruvec *lruvec;
+
+		lruvec = mem_cgroup_lru_move_lists(page_zone(page),
+						   page, lru, lru);
+		list_move_tail(&page->lru, &lruvec->lists[lru]);
 		(*pgmoved)++;
 	}
 }
@@ -453,12 +455,13 @@ static void lru_deactivate_fn(struct page *page, void *arg)
 		 */
 		SetPageReclaim(page);
 	} else {
+		struct lruvec *lruvec;
 		/*
 		 * The page's writeback ends up during pagevec
 		 * We moves tha page into tail of inactive.
 		 */
-		list_move_tail(&page->lru, &zone->lruvec.lists[lru]);
-		mem_cgroup_rotate_reclaimable_page(page);
+		lruvec = mem_cgroup_lru_move_lists(zone, page, lru, lru);
+		list_move_tail(&page->lru, &lruvec->lists[lru]);
 		__count_vm_event(PGROTATED);
 	}
 
@@ -648,6 +651,8 @@ void lru_add_page_tail(struct zone* zone,
 	SetPageLRU(page_tail);
 
 	if (page_evictable(page_tail, NULL)) {
+		struct lruvec *lruvec;
+
 		if (PageActive(page)) {
 			SetPageActive(page_tail);
 			active = 1;
@@ -657,11 +662,13 @@ void lru_add_page_tail(struct zone* zone,
 			lru = LRU_INACTIVE_ANON;
 		}
 		update_page_reclaim_stat(zone, page_tail, file, active);
+		lruvec = mem_cgroup_lru_add_list(zone, page_tail, lru);
 		if (likely(PageLRU(page)))
-			__add_page_to_lru_list(zone, page_tail, lru,
-					       page->lru.prev);
+			list_add(&page_tail->lru, page->lru.prev);
 		else
-			add_page_to_lru_list(zone, page_tail, lru);
+			list_add(&page_tail->lru, &lruvec->lists[lru]);
+		__mod_zone_page_state(zone, NR_LRU_BASE + lru,
+				      hpage_nr_pages(page_tail));
 	} else {
 		SetPageUnevictable(page_tail);
 		add_page_to_lru_list(zone, page_tail, LRU_UNEVICTABLE);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index df00195..10f5edb 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1156,15 +1156,14 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 
 		switch (__isolate_lru_page(page, mode, file)) {
 		case 0:
+			mem_cgroup_lru_del(page);
 			list_move(&page->lru, dst);
-			mem_cgroup_del_lru(page);
 			nr_taken += hpage_nr_pages(page);
 			break;
 
 		case -EBUSY:
 			/* else it is being freed elsewhere */
 			list_move(&page->lru, src);
-			mem_cgroup_rotate_lru_list(page, page_lru(page));
 			continue;
 
 		default:
@@ -1214,8 +1213,8 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 				break;
 
 			if (__isolate_lru_page(cursor_page, mode, file) == 0) {
+				mem_cgroup_lru_del(cursor_page);
 				list_move(&cursor_page->lru, dst);
-				mem_cgroup_del_lru(cursor_page);
 				nr_taken += hpage_nr_pages(page);
 				nr_lumpy_taken++;
 				if (PageDirty(cursor_page))
@@ -1256,18 +1255,20 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 	return nr_taken;
 }
 
-static unsigned long isolate_pages_global(unsigned long nr,
-					struct list_head *dst,
-					unsigned long *scanned, int order,
-					isolate_mode_t mode,
-					struct zone *z,	int active, int file)
+static unsigned long isolate_pages(unsigned long nr, struct mem_cgroup_zone *mz,
+				   struct list_head *dst,
+				   unsigned long *scanned, int order,
+				   isolate_mode_t mode, int active, int file)
 {
+	struct lruvec *lruvec;
 	int lru = LRU_BASE;
+
+	lruvec = mem_cgroup_zone_lruvec(mz->zone, mz->mem_cgroup);
 	if (active)
 		lru += LRU_ACTIVE;
 	if (file)
 		lru += LRU_FILE;
-	return isolate_lru_pages(nr, &z->lruvec.lists[lru], dst,
+	return isolate_lru_pages(nr, &lruvec->lists[lru], dst,
 				 scanned, order, mode, file);
 }
 
@@ -1535,14 +1536,9 @@ shrink_inactive_list(unsigned long nr_to_scan, struct mem_cgroup_zone *mz,
 
 	spin_lock_irq(&zone->lru_lock);
 
-	if (scanning_global_lru(mz)) {
-		nr_taken = isolate_pages_global(nr_to_scan, &page_list,
-			&nr_scanned, sc->order, reclaim_mode, zone, 0, file);
-	} else {
-		nr_taken = mem_cgroup_isolate_pages(nr_to_scan, &page_list,
-			&nr_scanned, sc->order, reclaim_mode, zone,
-			mz->mem_cgroup, 0, file);
-	}
+	nr_taken = isolate_pages(nr_to_scan, mz, &page_list,
+				 &nr_scanned, sc->order,
+				 reclaim_mode, 0, file);
 	if (global_reclaim(sc)) {
 		zone->pages_scanned += nr_scanned;
 		if (current_is_kswapd())
@@ -1626,13 +1622,15 @@ static void move_active_pages_to_lru(struct zone *zone,
 	pagevec_init(&pvec, 1);
 
 	while (!list_empty(list)) {
+		struct lruvec *lruvec;
+
 		page = lru_to_page(list);
 
 		VM_BUG_ON(PageLRU(page));
 		SetPageLRU(page);
 
-		list_move(&page->lru, &zone->lruvec.lists[lru]);
-		mem_cgroup_add_lru_list(page, lru);
+		lruvec = mem_cgroup_lru_add_list(zone, page, lru);
+		list_move(&page->lru, &lruvec->lists[lru]);
 		pgmoved += hpage_nr_pages(page);
 
 		if (!pagevec_add(&pvec, page) || list_empty(list)) {
@@ -1673,17 +1671,10 @@ static void shrink_active_list(unsigned long nr_pages,
 		reclaim_mode |= ISOLATE_CLEAN;
 
 	spin_lock_irq(&zone->lru_lock);
-	if (scanning_global_lru(mz)) {
-		nr_taken = isolate_pages_global(nr_pages, &l_hold,
-						&pgscanned, sc->order,
-						reclaim_mode, zone,
-						1, file);
-	} else {
-		nr_taken = mem_cgroup_isolate_pages(nr_pages, &l_hold,
-						&pgscanned, sc->order,
-						reclaim_mode, zone,
-						mz->mem_cgroup, 1, file);
-	}
+
+	nr_taken = isolate_pages(nr_pages, mz, &l_hold,
+				 &pgscanned, sc->order,
+				 reclaim_mode, 1, file);
 
 	if (global_reclaim(sc))
 		zone->pages_scanned += pgscanned;
@@ -3403,16 +3394,18 @@ int page_evictable(struct page *page, struct vm_area_struct *vma)
  */
 static void check_move_unevictable_page(struct page *page, struct zone *zone)
 {
-	VM_BUG_ON(PageActive(page));
+	struct lruvec *lruvec;
 
+	VM_BUG_ON(PageActive(page));
 retry:
 	ClearPageUnevictable(page);
 	if (page_evictable(page, NULL)) {
 		enum lru_list l = page_lru_base_type(page);
 
 		__dec_zone_state(zone, NR_UNEVICTABLE);
-		list_move(&page->lru, &zone->lruvec.lists[l]);
-		mem_cgroup_move_lists(page, LRU_UNEVICTABLE, l);
+		lruvec = mem_cgroup_lru_move_lists(zone, page,
+						   LRU_UNEVICTABLE, l);
+		list_move(&page->lru, &lruvec->lists[l]);
 		__inc_zone_state(zone, NR_INACTIVE_ANON + l);
 		__count_vm_event(UNEVICTABLE_PGRESCUED);
 	} else {
@@ -3420,8 +3413,9 @@ retry:
 		 * rotate unevictable list
 		 */
 		SetPageUnevictable(page);
-		list_move(&page->lru, &zone->lruvec.lists[LRU_UNEVICTABLE]);
-		mem_cgroup_rotate_lru_list(page, LRU_UNEVICTABLE);
+		lruvec = mem_cgroup_lru_move_lists(zone, page, LRU_UNEVICTABLE,
+						   LRU_UNEVICTABLE);
+		list_move(&page->lru, &lruvec->lists[LRU_UNEVICTABLE]);
 		if (page_evictable(page, NULL))
 			goto retry;
 	}
@@ -3482,17 +3476,6 @@ void scan_mapping_unevictable_pages(struct address_space *mapping)
 
 }
 
-/*
- * XXX: Temporary helper to get to the last page of a mem_cgroup_zone
- * lru list.  This will be reasonably unified in a second.
- */
-static struct page *lru_tailpage(struct mem_cgroup_zone *mz, enum lru_list lru)
-{
-	if (!scanning_global_lru(mz))
-		return mem_cgroup_lru_to_page(mz->zone, mz->mem_cgroup, lru);
-	return lru_to_page(&mz->zone->lruvec.lists[lru]);
-}
-
 /**
  * scan_zone_unevictable_pages - check unevictable list for evictable pages
  * @zone - zone of which to scan the unevictable list
@@ -3515,8 +3498,12 @@ static void scan_zone_unevictable_pages(struct zone *zone)
 			.zone = zone,
 		};
 		unsigned long nr_to_scan;
+		struct list_head *list;
+		struct lruvec *lruvec;
 
 		nr_to_scan = zone_nr_lru_pages(&mz, LRU_UNEVICTABLE);
+		lruvec = mem_cgroup_zone_lruvec(zone, mem);
+		list = &lruvec->lists[LRU_UNEVICTABLE];
 		while (nr_to_scan > 0) {
 			unsigned long batch_size;
 			unsigned long scan;
@@ -3527,7 +3514,7 @@ static void scan_zone_unevictable_pages(struct zone *zone)
 			for (scan = 0; scan < batch_size; scan++) {
 				struct page *page;
 
-				page = lru_tailpage(&mz, LRU_UNEVICTABLE);
+				page = lru_to_page(list);
 				if (!trylock_page(page))
 					continue;
 				if (likely(PageLRU(page) &&
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [patch 10/11] mm: make per-memcg LRU lists exclusive
@ 2011-09-12 10:57   ` Johannes Weiner
  0 siblings, 0 replies; 130+ messages in thread
From: Johannes Weiner @ 2011-09-12 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

Now that all code that operated on global per-zone LRU lists is
converted to operate on per-memory cgroup LRU lists instead, there is
no reason to keep the double-LRU scheme around any longer.

The pc->lru member is removed and page->lru is linked directly to the
per-memory cgroup LRU lists, which removes two pointers from a
descriptor that exists for every page frame in the system.

Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Ying Han <yinghan@google.com>
---
 include/linux/memcontrol.h  |   54 +++-----
 include/linux/mm_inline.h   |   21 +--
 include/linux/page_cgroup.h |    1 -
 mm/memcontrol.c             |  319 ++++++++++++++++++++-----------------------
 mm/page_cgroup.c            |    1 -
 mm/swap.c                   |   23 ++-
 mm/vmscan.c                 |   81 +++++-------
 7 files changed, 228 insertions(+), 272 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 7795b72..a12f16f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -32,17 +32,6 @@ enum mem_cgroup_page_stat_item {
 	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
 };
 
-extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
-					struct list_head *dst,
-					unsigned long *scanned, int order,
-					isolate_mode_t mode,
-					struct zone *z,
-					struct mem_cgroup *mem_cont,
-					int active, int file);
-
-struct page *mem_cgroup_lru_to_page(struct zone *, struct mem_cgroup *,
-				    enum lru_list);
-
 struct mem_cgroup_iter {
 	struct zone *zone;
 	int priority;
@@ -72,13 +61,14 @@ extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *ptr);
 
 extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 					gfp_t gfp_mask);
-extern void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru);
-extern void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru);
-extern void mem_cgroup_rotate_reclaimable_page(struct page *page);
-extern void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru);
-extern void mem_cgroup_del_lru(struct page *page);
-extern void mem_cgroup_move_lists(struct page *page,
-				  enum lru_list from, enum lru_list to);
+
+struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
+struct lruvec *mem_cgroup_lru_add_list(struct zone *, struct page *,
+				       enum lru_list);
+void mem_cgroup_lru_del_list(struct page *, enum lru_list);
+void mem_cgroup_lru_del(struct page *);
+struct lruvec *mem_cgroup_lru_move_lists(struct zone *, struct page *,
+					 enum lru_list, enum lru_list);
 
 /* For coalescing uncharge for reducing memcg' overhead*/
 extern void mem_cgroup_uncharge_start(void);
@@ -221,33 +211,33 @@ static inline void mem_cgroup_uncharge_cache_page(struct page *page)
 {
 }
 
-static inline void mem_cgroup_add_lru_list(struct page *page, int lru)
+static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
+						    struct mem_cgroup *mem)
 {
+	return &zone->lruvec;
 }
 
-static inline void mem_cgroup_del_lru_list(struct page *page, int lru)
+static inline struct lruvec *mem_cgroup_lru_add_list(struct zone *zone,
+						     struct page *page,
+						     enum lru_list lru)
 {
-	return ;
+	return &zone->lruvec;
 }
 
-static inline void mem_cgroup_rotate_reclaimable_page(struct page *page)
+static inline void mem_cgroup_lru_del_list(struct page *page, enum lru_list lru)
 {
-	return ;
 }
 
-static inline void mem_cgroup_rotate_lru_list(struct page *page, int lru)
+static inline void mem_cgroup_lru_del(struct page *page)
 {
-	return ;
 }
 
-static inline void mem_cgroup_del_lru(struct page *page)
-{
-	return ;
-}
-
-static inline void
-mem_cgroup_move_lists(struct page *page, enum lru_list from, enum lru_list to)
+static inline struct lruvec *mem_cgroup_lru_move_lists(struct zone *zone,
+						       struct page *page,
+						       enum lru_list from,
+						       enum lru_list to)
 {
+	return &zone->lruvec;
 }
 
 static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index e6a7ffe..4e3478e 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -22,26 +22,21 @@ static inline int page_is_file_cache(struct page *page)
 }
 
 static inline void
-__add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l,
-		       struct list_head *head)
-{
-	list_add(&page->lru, head);
-	__mod_zone_page_state(zone, NR_LRU_BASE + l, hpage_nr_pages(page));
-	mem_cgroup_add_lru_list(page, l);
-}
-
-static inline void
 add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l)
 {
-	__add_page_to_lru_list(zone, page, l, &zone->lruvec.lists[l]);
+	struct lruvec *lruvec;
+
+	lruvec = mem_cgroup_lru_add_list(zone, page, l);
+	list_add(&page->lru, &lruvec->lists[l]);
+	__mod_zone_page_state(zone, NR_LRU_BASE + l, hpage_nr_pages(page));
 }
 
 static inline void
 del_page_from_lru_list(struct zone *zone, struct page *page, enum lru_list l)
 {
+	mem_cgroup_lru_del_list(page, l);
 	list_del(&page->lru);
 	__mod_zone_page_state(zone, NR_LRU_BASE + l, -hpage_nr_pages(page));
-	mem_cgroup_del_lru_list(page, l);
 }
 
 /**
@@ -64,7 +59,6 @@ del_page_from_lru(struct zone *zone, struct page *page)
 {
 	enum lru_list l;
 
-	list_del(&page->lru);
 	if (PageUnevictable(page)) {
 		__ClearPageUnevictable(page);
 		l = LRU_UNEVICTABLE;
@@ -75,8 +69,9 @@ del_page_from_lru(struct zone *zone, struct page *page)
 			l += LRU_ACTIVE;
 		}
 	}
+	mem_cgroup_lru_del_list(page, l);
+	list_del(&page->lru);
 	__mod_zone_page_state(zone, NR_LRU_BASE + l, -hpage_nr_pages(page));
-	mem_cgroup_del_lru_list(page, l);
 }
 
 /**
diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 961ecc7..5bae753 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -31,7 +31,6 @@ enum {
 struct page_cgroup {
 	unsigned long flags;
 	struct mem_cgroup *mem_cgroup;
-	struct list_head lru;		/* per cgroup LRU list */
 };
 
 void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 465001c..a7d14a5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -920,6 +920,26 @@ out:
 }
 EXPORT_SYMBOL(mem_cgroup_count_vm_event);
 
+/**
+ * mem_cgroup_zone_lruvec - get the lru list vector for a zone and memcg
+ * @zone: zone of the wanted lruvec
+ * @mem: memcg of the wanted lruvec
+ *
+ * Returns the lru list vector holding pages for the given @zone and
+ * @mem.  This can be the global zone lruvec, if the memory controller
+ * is disabled.
+ */
+struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone, struct mem_cgroup *mem)
+{
+	struct mem_cgroup_per_zone *mz;
+
+	if (mem_cgroup_disabled())
+		return &zone->lruvec;
+
+	mz = mem_cgroup_zoneinfo(mem, zone_to_nid(zone), zone_idx(zone));
+	return &mz->lruvec;
+}
+
 /*
  * Following LRU functions are allowed to be used without PCG_LOCK.
  * Operations are called by routine of global LRU independently from memcg.
@@ -934,115 +954,123 @@ EXPORT_SYMBOL(mem_cgroup_count_vm_event);
  * When moving account, the page is not on LRU. It's isolated.
  */
 
-struct page *mem_cgroup_lru_to_page(struct zone *zone, struct mem_cgroup *mem,
-				    enum lru_list lru)
+/**
+ * mem_cgroup_lru_add_list - account for adding an lru page and return lruvec
+ * @zone: zone of the page
+ * @page: the page
+ * @lru: current lru
+ *
+ * This function accounts for @page being added to @lru, and returns
+ * the lruvec for the given @zone and the memcg @page is charged to.
+ *
+ * The callsite is then responsible for physically linking the page to
+ * the returned lruvec->lists[@lru].
+ */
+struct lruvec *mem_cgroup_lru_add_list(struct zone *zone, struct page *page,
+				       enum lru_list lru)
 {
 	struct mem_cgroup_per_zone *mz;
 	struct page_cgroup *pc;
-
-	mz = mem_cgroup_zoneinfo(mem, zone_to_nid(zone), zone_idx(zone));
-	pc = list_entry(mz->lruvec.lists[lru].prev, struct page_cgroup, lru);
-	return lookup_cgroup_page(pc);
-}
-
-void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru)
-{
-	struct page_cgroup *pc;
-	struct mem_cgroup_per_zone *mz;
+	struct mem_cgroup *mem;
 
 	if (mem_cgroup_disabled())
-		return;
+		return &zone->lruvec;
+
 	pc = lookup_page_cgroup(page);
-	/* can happen while we handle swapcache. */
-	if (!TestClearPageCgroupAcctLRU(pc))
-		return;
-	VM_BUG_ON(!pc->mem_cgroup);
+	VM_BUG_ON(PageCgroupAcctLRU(pc));
 	/*
-	 * We don't check PCG_USED bit. It's cleared when the "page" is finally
-	 * removed from global LRU.
+	 * putback:				charge:
+	 * SetPageLRU				SetPageCgroupUsed
+	 * smp_mb				smp_mb
+	 * PageCgroupUsed && add to memcg LRU	PageLRU && add to memcg LRU
+	 *
+	 * Ensure that one of the two sides adds the page to the memcg
+	 * LRU during a race.
 	 */
-	mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
-	/* huge page split is done under lru_lock. so, we have no races. */
-	MEM_CGROUP_ZSTAT(mz, lru) -= 1 << compound_order(page);
-	VM_BUG_ON(list_empty(&pc->lru));
-	list_del_init(&pc->lru);
-}
-
-void mem_cgroup_del_lru(struct page *page)
-{
-	mem_cgroup_del_lru_list(page, page_lru(page));
+	smp_mb();
+	/*
+	 * If the page is uncharged, it may be freed soon, but it
+	 * could also be swap cache (readahead, swapoff) that needs to
+	 * be reclaimable in the future.  root_mem_cgroup will babysit
+	 * it for the time being.
+	 */
+	if (PageCgroupUsed(pc)) {
+		/* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
+		smp_rmb();
+		mem = pc->mem_cgroup;
+		SetPageCgroupAcctLRU(pc);
+	} else
+		mem = root_mem_cgroup;
+	mz = page_cgroup_zoneinfo(mem, page);
+	/* compound_order() is stabilized through lru_lock */
+	MEM_CGROUP_ZSTAT(mz, lru) += 1 << compound_order(page);
+	return &mz->lruvec;
 }
 
-/*
- * Writeback is about to end against a page which has been marked for immediate
- * reclaim.  If it still appears to be reclaimable, move it to the tail of the
- * inactive list.
+/**
+ * mem_cgroup_lru_del_list - account for removing an lru page
+ * @page: the page
+ * @lru: target lru
+ *
+ * This function accounts for @page being removed from @lru.
+ *
+ * The callsite is then responsible for physically unlinking
+ * @page->lru.
  */
-void mem_cgroup_rotate_reclaimable_page(struct page *page)
+void mem_cgroup_lru_del_list(struct page *page, enum lru_list lru)
 {
 	struct mem_cgroup_per_zone *mz;
+	struct mem_cgroup *mem;
 	struct page_cgroup *pc;
-	enum lru_list lru = page_lru(page);
 
 	if (mem_cgroup_disabled())
 		return;
 
 	pc = lookup_page_cgroup(page);
-	/* unused page is not rotated. */
-	if (!PageCgroupUsed(pc))
-		return;
-	/* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
-	smp_rmb();
-	mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
-	list_move_tail(&pc->lru, &mz->lruvec.lists[lru]);
+	/*
+	 * root_mem_cgroup babysits uncharged LRU pages, but
+	 * PageCgroupUsed is cleared when the page is about to get
+	 * freed.  PageCgroupAcctLRU remembers whether the
+	 * LRU-accounting happened against pc->mem_cgroup or
+	 * root_mem_cgroup.
+	 */
+	if (TestClearPageCgroupAcctLRU(pc)) {
+		VM_BUG_ON(!pc->mem_cgroup);
+		mem = pc->mem_cgroup;
+	} else
+		mem = root_mem_cgroup;
+	mz = page_cgroup_zoneinfo(mem, page);
+	/* huge page split is done under lru_lock. so, we have no races. */
+	MEM_CGROUP_ZSTAT(mz, lru) -= 1 << compound_order(page);
 }
 
-void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru)
+void mem_cgroup_lru_del(struct page *page)
 {
-	struct mem_cgroup_per_zone *mz;
-	struct page_cgroup *pc;
-
-	if (mem_cgroup_disabled())
-		return;
-
-	pc = lookup_page_cgroup(page);
-	/* unused page is not rotated. */
-	if (!PageCgroupUsed(pc))
-		return;
-	/* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
-	smp_rmb();
-	mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
-	list_move(&pc->lru, &mz->lruvec.lists[lru]);
+	mem_cgroup_lru_del_list(page, page_lru(page));
 }
 
-void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru)
+/**
+ * mem_cgroup_lru_move_lists - account for moving a page between lrus
+ * @zone: zone of the page
+ * @page: the page
+ * @from: current lru
+ * @to: target lru
+ *
+ * This function accounts for @page being moved between the lrus @from
+ * and @to, and returns the lruvec for the given @zone and the memcg
+ * @page is charged to.
+ *
+ * The callsite is then responsible for physically relinking
+ * @page->lru to the returned lruvec->lists[@to].
+ */
+struct lruvec *mem_cgroup_lru_move_lists(struct zone *zone,
+					 struct page *page,
+					 enum lru_list from,
+					 enum lru_list to)
 {
-	struct page_cgroup *pc;
-	struct mem_cgroup_per_zone *mz;
-
-	if (mem_cgroup_disabled())
-		return;
-	pc = lookup_page_cgroup(page);
-	VM_BUG_ON(PageCgroupAcctLRU(pc));
-	/*
-	 * putback:				charge:
-	 * SetPageLRU				SetPageCgroupUsed
-	 * smp_mb				smp_mb
-	 * PageCgroupUsed && add to memcg LRU	PageLRU && add to memcg LRU
-	 *
-	 * Ensure that one of the two sides adds the page to the memcg
-	 * LRU during a race.
-	 */
-	smp_mb();
-	if (!PageCgroupUsed(pc))
-		return;
-	/* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
-	smp_rmb();
-	mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
-	/* huge page split is done under lru_lock. so, we have no races. */
-	MEM_CGROUP_ZSTAT(mz, lru) += 1 << compound_order(page);
-	SetPageCgroupAcctLRU(pc);
-	list_add(&pc->lru, &mz->lruvec.lists[lru]);
+	/* XXX: Optimize this, especially for @from == @to */
+	mem_cgroup_lru_del_list(page, from);
+	return mem_cgroup_lru_add_list(zone, page, to);
 }
 
 /*
@@ -1053,6 +1081,7 @@ void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru)
  */
 static void mem_cgroup_lru_del_before_commit(struct page *page)
 {
+	enum lru_list lru;
 	unsigned long flags;
 	struct zone *zone = page_zone(page);
 	struct page_cgroup *pc = lookup_page_cgroup(page);
@@ -1069,17 +1098,28 @@ static void mem_cgroup_lru_del_before_commit(struct page *page)
 		return;
 
 	spin_lock_irqsave(&zone->lru_lock, flags);
+	lru = page_lru(page);
 	/*
-	 * Forget old LRU when this page_cgroup is *not* used. This Used bit
-	 * is guarded by lock_page() because the page is SwapCache.
+	 * The uncharged page could still be registered to the LRU of
+	 * the stale pc->mem_cgroup.
+	 *
+	 * As pc->mem_cgroup is about to get overwritten, the old LRU
+	 * accounting needs to be taken care of.  Let root_mem_cgroup
+	 * babysit the page until the new memcg is responsible for it.
+	 *
+	 * The PCG_USED bit is guarded by lock_page() as the page is
+	 * swapcache/pagecache.
 	 */
-	if (!PageCgroupUsed(pc))
-		mem_cgroup_del_lru_list(page, page_lru(page));
+	if (PageLRU(page) && PageCgroupAcctLRU(pc) && !PageCgroupUsed(pc)) {
+		del_page_from_lru_list(zone, page, lru);
+		add_page_to_lru_list(zone, page, lru);
+	}
 	spin_unlock_irqrestore(&zone->lru_lock, flags);
 }
 
 static void mem_cgroup_lru_add_after_commit(struct page *page)
 {
+	enum lru_list lru;
 	unsigned long flags;
 	struct zone *zone = page_zone(page);
 	struct page_cgroup *pc = lookup_page_cgroup(page);
@@ -1097,22 +1137,22 @@ static void mem_cgroup_lru_add_after_commit(struct page *page)
 	if (likely(!PageLRU(page)))
 		return;
 	spin_lock_irqsave(&zone->lru_lock, flags);
-	/* link when the page is linked to LRU but page_cgroup isn't */
-	if (PageLRU(page) && !PageCgroupAcctLRU(pc))
-		mem_cgroup_add_lru_list(page, page_lru(page));
+	lru = page_lru(page);
+	/*
+	 * If the page is not on the LRU, someone will soon put it
+	 * there.  If it is, and also already accounted for on the
+	 * memcg-side, it must be on the right lruvec as setting
+	 * pc->mem_cgroup and PageCgroupUsed is properly ordered.
+	 * Otherwise, root_mem_cgroup has been babysitting the page
+	 * during the charge.  Move it to the new memcg now.
+	 */
+	if (PageLRU(page) && !PageCgroupAcctLRU(pc)) {
+		del_page_from_lru_list(zone, page, lru);
+		add_page_to_lru_list(zone, page, lru);
+	}
 	spin_unlock_irqrestore(&zone->lru_lock, flags);
 }
 
-
-void mem_cgroup_move_lists(struct page *page,
-			   enum lru_list from, enum lru_list to)
-{
-	if (mem_cgroup_disabled())
-		return;
-	mem_cgroup_del_lru_list(page, from);
-	mem_cgroup_add_lru_list(page, to);
-}
-
 /*
  * Checks whether given mem is same or in the root_mem's
  * hierarchy subtree
@@ -1218,68 +1258,6 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page)
 	return &mz->reclaim_stat;
 }
 
-unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
-					struct list_head *dst,
-					unsigned long *scanned, int order,
-					isolate_mode_t mode,
-					struct zone *z,
-					struct mem_cgroup *mem_cont,
-					int active, int file)
-{
-	unsigned long nr_taken = 0;
-	struct page *page;
-	unsigned long scan;
-	LIST_HEAD(pc_list);
-	struct list_head *src;
-	struct page_cgroup *pc, *tmp;
-	int nid = zone_to_nid(z);
-	int zid = zone_idx(z);
-	struct mem_cgroup_per_zone *mz;
-	int lru = LRU_FILE * file + active;
-	int ret;
-
-	BUG_ON(!mem_cont);
-	mz = mem_cgroup_zoneinfo(mem_cont, nid, zid);
-	src = &mz->lruvec.lists[lru];
-
-	scan = 0;
-	list_for_each_entry_safe_reverse(pc, tmp, src, lru) {
-		if (scan >= nr_to_scan)
-			break;
-
-		if (unlikely(!PageCgroupUsed(pc)))
-			continue;
-
-		page = lookup_cgroup_page(pc);
-
-		if (unlikely(!PageLRU(page)))
-			continue;
-
-		scan++;
-		ret = __isolate_lru_page(page, mode, file);
-		switch (ret) {
-		case 0:
-			list_move(&page->lru, dst);
-			mem_cgroup_del_lru(page);
-			nr_taken += hpage_nr_pages(page);
-			break;
-		case -EBUSY:
-			/* we don't affect global LRU but rotate in our LRU */
-			mem_cgroup_rotate_lru_list(page, page_lru(page));
-			break;
-		default:
-			break;
-		}
-	}
-
-	*scanned = scan;
-
-	trace_mm_vmscan_memcg_isolate(0, nr_to_scan, scan, nr_taken,
-				      0, 0, 0, mode);
-
-	return nr_taken;
-}
-
 #define mem_cgroup_from_res_counter(counter, member)	\
 	container_of(counter, struct mem_cgroup, member)
 
@@ -3615,11 +3593,11 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 static int mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
 				int node, int zid, enum lru_list lru)
 {
-	struct zone *zone;
 	struct mem_cgroup_per_zone *mz;
-	struct page_cgroup *pc, *busy;
 	unsigned long flags, loop;
 	struct list_head *list;
+	struct page *busy;
+	struct zone *zone;
 	int ret = 0;
 
 	zone = &NODE_DATA(node)->node_zones[zid];
@@ -3631,6 +3609,7 @@ static int mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
 	loop += 256;
 	busy = NULL;
 	while (loop--) {
+		struct page_cgroup *pc;
 		struct page *page;
 
 		ret = 0;
@@ -3639,16 +3618,16 @@ static int mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
 			spin_unlock_irqrestore(&zone->lru_lock, flags);
 			break;
 		}
-		pc = list_entry(list->prev, struct page_cgroup, lru);
-		if (busy == pc) {
-			list_move(&pc->lru, list);
+		page = list_entry(list->prev, struct page, lru);
+		if (busy == page) {
+			list_move(&page->lru, list);
 			busy = NULL;
 			spin_unlock_irqrestore(&zone->lru_lock, flags);
 			continue;
 		}
 		spin_unlock_irqrestore(&zone->lru_lock, flags);
 
-		page = lookup_cgroup_page(pc);
+		pc = lookup_page_cgroup(page);
 
 		ret = mem_cgroup_move_parent(page, pc, memcg, GFP_KERNEL);
 		if (ret == -ENOMEM)
@@ -3656,7 +3635,7 @@ static int mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
 
 		if (ret == -EBUSY || ret == -EINVAL) {
 			/* found lock contention or "pc" is obsolete. */
-			busy = pc;
+			busy = page;
 			cond_resched();
 		} else
 			busy = NULL;
diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index 6bdc67d..256dee8 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -16,7 +16,6 @@ static void __meminit init_page_cgroup(struct page_cgroup *pc, unsigned long id)
 	pc->flags = 0;
 	set_page_cgroup_array_id(pc, id);
 	pc->mem_cgroup = NULL;
-	INIT_LIST_HEAD(&pc->lru);
 }
 static unsigned long total_usage;
 
diff --git a/mm/swap.c b/mm/swap.c
index 66e8292..81c6da9 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -209,12 +209,14 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 static void pagevec_move_tail_fn(struct page *page, void *arg)
 {
 	int *pgmoved = arg;
-	struct zone *zone = page_zone(page);
 
 	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
 		enum lru_list lru = page_lru_base_type(page);
-		list_move_tail(&page->lru, &zone->lruvec.lists[lru]);
-		mem_cgroup_rotate_reclaimable_page(page);
+		struct lruvec *lruvec;
+
+		lruvec = mem_cgroup_lru_move_lists(page_zone(page),
+						   page, lru, lru);
+		list_move_tail(&page->lru, &lruvec->lists[lru]);
 		(*pgmoved)++;
 	}
 }
@@ -453,12 +455,13 @@ static void lru_deactivate_fn(struct page *page, void *arg)
 		 */
 		SetPageReclaim(page);
 	} else {
+		struct lruvec *lruvec;
 		/*
 		 * The page's writeback ends up during pagevec
 		 * We moves tha page into tail of inactive.
 		 */
-		list_move_tail(&page->lru, &zone->lruvec.lists[lru]);
-		mem_cgroup_rotate_reclaimable_page(page);
+		lruvec = mem_cgroup_lru_move_lists(zone, page, lru, lru);
+		list_move_tail(&page->lru, &lruvec->lists[lru]);
 		__count_vm_event(PGROTATED);
 	}
 
@@ -648,6 +651,8 @@ void lru_add_page_tail(struct zone* zone,
 	SetPageLRU(page_tail);
 
 	if (page_evictable(page_tail, NULL)) {
+		struct lruvec *lruvec;
+
 		if (PageActive(page)) {
 			SetPageActive(page_tail);
 			active = 1;
@@ -657,11 +662,13 @@ void lru_add_page_tail(struct zone* zone,
 			lru = LRU_INACTIVE_ANON;
 		}
 		update_page_reclaim_stat(zone, page_tail, file, active);
+		lruvec = mem_cgroup_lru_add_list(zone, page_tail, lru);
 		if (likely(PageLRU(page)))
-			__add_page_to_lru_list(zone, page_tail, lru,
-					       page->lru.prev);
+			list_add(&page_tail->lru, page->lru.prev);
 		else
-			add_page_to_lru_list(zone, page_tail, lru);
+			list_add(&page_tail->lru, &lruvec->lists[lru]);
+		__mod_zone_page_state(zone, NR_LRU_BASE + lru,
+				      hpage_nr_pages(page_tail));
 	} else {
 		SetPageUnevictable(page_tail);
 		add_page_to_lru_list(zone, page_tail, LRU_UNEVICTABLE);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index df00195..10f5edb 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1156,15 +1156,14 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 
 		switch (__isolate_lru_page(page, mode, file)) {
 		case 0:
+			mem_cgroup_lru_del(page);
 			list_move(&page->lru, dst);
-			mem_cgroup_del_lru(page);
 			nr_taken += hpage_nr_pages(page);
 			break;
 
 		case -EBUSY:
 			/* else it is being freed elsewhere */
 			list_move(&page->lru, src);
-			mem_cgroup_rotate_lru_list(page, page_lru(page));
 			continue;
 
 		default:
@@ -1214,8 +1213,8 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 				break;
 
 			if (__isolate_lru_page(cursor_page, mode, file) == 0) {
+				mem_cgroup_lru_del(cursor_page);
 				list_move(&cursor_page->lru, dst);
-				mem_cgroup_del_lru(cursor_page);
 				nr_taken += hpage_nr_pages(page);
 				nr_lumpy_taken++;
 				if (PageDirty(cursor_page))
@@ -1256,18 +1255,20 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 	return nr_taken;
 }
 
-static unsigned long isolate_pages_global(unsigned long nr,
-					struct list_head *dst,
-					unsigned long *scanned, int order,
-					isolate_mode_t mode,
-					struct zone *z,	int active, int file)
+static unsigned long isolate_pages(unsigned long nr, struct mem_cgroup_zone *mz,
+				   struct list_head *dst,
+				   unsigned long *scanned, int order,
+				   isolate_mode_t mode, int active, int file)
 {
+	struct lruvec *lruvec;
 	int lru = LRU_BASE;
+
+	lruvec = mem_cgroup_zone_lruvec(mz->zone, mz->mem_cgroup);
 	if (active)
 		lru += LRU_ACTIVE;
 	if (file)
 		lru += LRU_FILE;
-	return isolate_lru_pages(nr, &z->lruvec.lists[lru], dst,
+	return isolate_lru_pages(nr, &lruvec->lists[lru], dst,
 				 scanned, order, mode, file);
 }
 
@@ -1535,14 +1536,9 @@ shrink_inactive_list(unsigned long nr_to_scan, struct mem_cgroup_zone *mz,
 
 	spin_lock_irq(&zone->lru_lock);
 
-	if (scanning_global_lru(mz)) {
-		nr_taken = isolate_pages_global(nr_to_scan, &page_list,
-			&nr_scanned, sc->order, reclaim_mode, zone, 0, file);
-	} else {
-		nr_taken = mem_cgroup_isolate_pages(nr_to_scan, &page_list,
-			&nr_scanned, sc->order, reclaim_mode, zone,
-			mz->mem_cgroup, 0, file);
-	}
+	nr_taken = isolate_pages(nr_to_scan, mz, &page_list,
+				 &nr_scanned, sc->order,
+				 reclaim_mode, 0, file);
 	if (global_reclaim(sc)) {
 		zone->pages_scanned += nr_scanned;
 		if (current_is_kswapd())
@@ -1626,13 +1622,15 @@ static void move_active_pages_to_lru(struct zone *zone,
 	pagevec_init(&pvec, 1);
 
 	while (!list_empty(list)) {
+		struct lruvec *lruvec;
+
 		page = lru_to_page(list);
 
 		VM_BUG_ON(PageLRU(page));
 		SetPageLRU(page);
 
-		list_move(&page->lru, &zone->lruvec.lists[lru]);
-		mem_cgroup_add_lru_list(page, lru);
+		lruvec = mem_cgroup_lru_add_list(zone, page, lru);
+		list_move(&page->lru, &lruvec->lists[lru]);
 		pgmoved += hpage_nr_pages(page);
 
 		if (!pagevec_add(&pvec, page) || list_empty(list)) {
@@ -1673,17 +1671,10 @@ static void shrink_active_list(unsigned long nr_pages,
 		reclaim_mode |= ISOLATE_CLEAN;
 
 	spin_lock_irq(&zone->lru_lock);
-	if (scanning_global_lru(mz)) {
-		nr_taken = isolate_pages_global(nr_pages, &l_hold,
-						&pgscanned, sc->order,
-						reclaim_mode, zone,
-						1, file);
-	} else {
-		nr_taken = mem_cgroup_isolate_pages(nr_pages, &l_hold,
-						&pgscanned, sc->order,
-						reclaim_mode, zone,
-						mz->mem_cgroup, 1, file);
-	}
+
+	nr_taken = isolate_pages(nr_pages, mz, &l_hold,
+				 &pgscanned, sc->order,
+				 reclaim_mode, 1, file);
 
 	if (global_reclaim(sc))
 		zone->pages_scanned += pgscanned;
@@ -3403,16 +3394,18 @@ int page_evictable(struct page *page, struct vm_area_struct *vma)
  */
 static void check_move_unevictable_page(struct page *page, struct zone *zone)
 {
-	VM_BUG_ON(PageActive(page));
+	struct lruvec *lruvec;
 
+	VM_BUG_ON(PageActive(page));
 retry:
 	ClearPageUnevictable(page);
 	if (page_evictable(page, NULL)) {
 		enum lru_list l = page_lru_base_type(page);
 
 		__dec_zone_state(zone, NR_UNEVICTABLE);
-		list_move(&page->lru, &zone->lruvec.lists[l]);
-		mem_cgroup_move_lists(page, LRU_UNEVICTABLE, l);
+		lruvec = mem_cgroup_lru_move_lists(zone, page,
+						   LRU_UNEVICTABLE, l);
+		list_move(&page->lru, &lruvec->lists[l]);
 		__inc_zone_state(zone, NR_INACTIVE_ANON + l);
 		__count_vm_event(UNEVICTABLE_PGRESCUED);
 	} else {
@@ -3420,8 +3413,9 @@ retry:
 		 * rotate unevictable list
 		 */
 		SetPageUnevictable(page);
-		list_move(&page->lru, &zone->lruvec.lists[LRU_UNEVICTABLE]);
-		mem_cgroup_rotate_lru_list(page, LRU_UNEVICTABLE);
+		lruvec = mem_cgroup_lru_move_lists(zone, page, LRU_UNEVICTABLE,
+						   LRU_UNEVICTABLE);
+		list_move(&page->lru, &lruvec->lists[LRU_UNEVICTABLE]);
 		if (page_evictable(page, NULL))
 			goto retry;
 	}
@@ -3482,17 +3476,6 @@ void scan_mapping_unevictable_pages(struct address_space *mapping)
 
 }
 
-/*
- * XXX: Temporary helper to get to the last page of a mem_cgroup_zone
- * lru list.  This will be reasonably unified in a second.
- */
-static struct page *lru_tailpage(struct mem_cgroup_zone *mz, enum lru_list lru)
-{
-	if (!scanning_global_lru(mz))
-		return mem_cgroup_lru_to_page(mz->zone, mz->mem_cgroup, lru);
-	return lru_to_page(&mz->zone->lruvec.lists[lru]);
-}
-
 /**
  * scan_zone_unevictable_pages - check unevictable list for evictable pages
  * @zone - zone of which to scan the unevictable list
@@ -3515,8 +3498,12 @@ static void scan_zone_unevictable_pages(struct zone *zone)
 			.zone = zone,
 		};
 		unsigned long nr_to_scan;
+		struct list_head *list;
+		struct lruvec *lruvec;
 
 		nr_to_scan = zone_nr_lru_pages(&mz, LRU_UNEVICTABLE);
+		lruvec = mem_cgroup_zone_lruvec(zone, mem);
+		list = &lruvec->lists[LRU_UNEVICTABLE];
 		while (nr_to_scan > 0) {
 			unsigned long batch_size;
 			unsigned long scan;
@@ -3527,7 +3514,7 @@ static void scan_zone_unevictable_pages(struct zone *zone)
 			for (scan = 0; scan < batch_size; scan++) {
 				struct page *page;
 
-				page = lru_tailpage(&mz, LRU_UNEVICTABLE);
+				page = lru_to_page(list);
 				if (!trylock_page(page))
 					continue;
 				if (likely(PageLRU(page) &&
-- 
1.7.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [patch 11/11] mm: memcg: remove unused node/section info from pc->flags
  2011-09-12 10:57 ` Johannes Weiner
@ 2011-09-12 10:57   ` Johannes Weiner
  -1 siblings, 0 replies; 130+ messages in thread
From: Johannes Weiner @ 2011-09-12 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

To find the page corresponding to a certain page_cgroup, the pc->flags
encoded the node or section ID with the base array to compare the pc
pointer to.

Now that the per-memory cgroup LRU lists link page descriptors
directly, there is no longer any code that knows the page_cgroup but
not the page.

Signed-off-by: Johannes Weiner <jweiner@redhat.com>
---
 include/linux/page_cgroup.h |   33 ------------------------
 mm/page_cgroup.c            |   58 ++++++-------------------------------------
 2 files changed, 8 insertions(+), 83 deletions(-)

diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 5bae753..aaa60da 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -121,39 +121,6 @@ static inline void move_unlock_page_cgroup(struct page_cgroup *pc,
 	local_irq_restore(*flags);
 }
 
-#ifdef CONFIG_SPARSEMEM
-#define PCG_ARRAYID_WIDTH	SECTIONS_SHIFT
-#else
-#define PCG_ARRAYID_WIDTH	NODES_SHIFT
-#endif
-
-#if (PCG_ARRAYID_WIDTH > BITS_PER_LONG - NR_PCG_FLAGS)
-#error Not enough space left in pc->flags to store page_cgroup array IDs
-#endif
-
-/* pc->flags: ARRAY-ID | FLAGS */
-
-#define PCG_ARRAYID_MASK	((1UL << PCG_ARRAYID_WIDTH) - 1)
-
-#define PCG_ARRAYID_OFFSET	(BITS_PER_LONG - PCG_ARRAYID_WIDTH)
-/*
- * Zero the shift count for non-existent fields, to prevent compiler
- * warnings and ensure references are optimized away.
- */
-#define PCG_ARRAYID_SHIFT	(PCG_ARRAYID_OFFSET * (PCG_ARRAYID_WIDTH != 0))
-
-static inline void set_page_cgroup_array_id(struct page_cgroup *pc,
-					    unsigned long id)
-{
-	pc->flags &= ~(PCG_ARRAYID_MASK << PCG_ARRAYID_SHIFT);
-	pc->flags |= (id & PCG_ARRAYID_MASK) << PCG_ARRAYID_SHIFT;
-}
-
-static inline unsigned long page_cgroup_array_id(struct page_cgroup *pc)
-{
-	return (pc->flags >> PCG_ARRAYID_SHIFT) & PCG_ARRAYID_MASK;
-}
-
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct page_cgroup;
 
diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index 256dee8..2601a65 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -11,12 +11,6 @@
 #include <linux/swapops.h>
 #include <linux/kmemleak.h>
 
-static void __meminit init_page_cgroup(struct page_cgroup *pc, unsigned long id)
-{
-	pc->flags = 0;
-	set_page_cgroup_array_id(pc, id);
-	pc->mem_cgroup = NULL;
-}
 static unsigned long total_usage;
 
 #if !defined(CONFIG_SPARSEMEM)
@@ -41,24 +35,11 @@ struct page_cgroup *lookup_page_cgroup(struct page *page)
 	return base + offset;
 }
 
-struct page *lookup_cgroup_page(struct page_cgroup *pc)
-{
-	unsigned long pfn;
-	struct page *page;
-	pg_data_t *pgdat;
-
-	pgdat = NODE_DATA(page_cgroup_array_id(pc));
-	pfn = pc - pgdat->node_page_cgroup + pgdat->node_start_pfn;
-	page = pfn_to_page(pfn);
-	VM_BUG_ON(pc != lookup_page_cgroup(page));
-	return page;
-}
-
 static int __init alloc_node_page_cgroup(int nid)
 {
-	struct page_cgroup *base, *pc;
+	struct page_cgroup *base;
 	unsigned long table_size;
-	unsigned long start_pfn, nr_pages, index;
+	unsigned long nr_pages;
 
 	start_pfn = NODE_DATA(nid)->node_start_pfn;
 	nr_pages = NODE_DATA(nid)->node_spanned_pages;
@@ -72,10 +53,6 @@ static int __init alloc_node_page_cgroup(int nid)
 			table_size, PAGE_SIZE, __pa(MAX_DMA_ADDRESS));
 	if (!base)
 		return -ENOMEM;
-	for (index = 0; index < nr_pages; index++) {
-		pc = base + index;
-		init_page_cgroup(pc, nid);
-	}
 	NODE_DATA(nid)->node_page_cgroup = base;
 	total_usage += table_size;
 	return 0;
@@ -116,31 +93,19 @@ struct page_cgroup *lookup_page_cgroup(struct page *page)
 	return section->page_cgroup + pfn;
 }
 
-struct page *lookup_cgroup_page(struct page_cgroup *pc)
-{
-	struct mem_section *section;
-	struct page *page;
-	unsigned long nr;
-
-	nr = page_cgroup_array_id(pc);
-	section = __nr_to_section(nr);
-	page = pfn_to_page(pc - section->page_cgroup);
-	VM_BUG_ON(pc != lookup_page_cgroup(page));
-	return page;
-}
-
 static void *__meminit alloc_page_cgroup(size_t size, int nid)
 {
 	void *addr = NULL;
 
-	addr = alloc_pages_exact_nid(nid, size, GFP_KERNEL | __GFP_NOWARN);
+	addr = alloc_pages_exact_nid(nid, size,
+				     GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN);
 	if (addr)
 		return addr;
 
 	if (node_state(nid, N_HIGH_MEMORY))
-		addr = vmalloc_node(size, nid);
+		addr = vzalloc_node(size, nid);
 	else
-		addr = vmalloc(size);
+		addr = vzalloc(size);
 
 	return addr;
 }
@@ -163,14 +128,11 @@ static void free_page_cgroup(void *addr)
 
 static int __meminit init_section_page_cgroup(unsigned long pfn, int nid)
 {
-	struct page_cgroup *base, *pc;
 	struct mem_section *section;
+	struct page_cgroup *base;
 	unsigned long table_size;
-	unsigned long nr;
-	int index;
 
-	nr = pfn_to_section_nr(pfn);
-	section = __nr_to_section(nr);
+	section = __pfn_to_section(pfn);
 
 	if (section->page_cgroup)
 		return 0;
@@ -190,10 +152,6 @@ static int __meminit init_section_page_cgroup(unsigned long pfn, int nid)
 		return -ENOMEM;
 	}
 
-	for (index = 0; index < PAGES_PER_SECTION; index++) {
-		pc = base + index;
-		init_page_cgroup(pc, nr);
-	}
 	/*
 	 * The passed "pfn" may not be aligned to SECTION.  For the calculation
 	 * we need to apply a mask.
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [patch 11/11] mm: memcg: remove unused node/section info from pc->flags
@ 2011-09-12 10:57   ` Johannes Weiner
  0 siblings, 0 replies; 130+ messages in thread
From: Johannes Weiner @ 2011-09-12 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

To find the page corresponding to a certain page_cgroup, the pc->flags
encoded the node or section ID with the base array to compare the pc
pointer to.

Now that the per-memory cgroup LRU lists link page descriptors
directly, there is no longer any code that knows the page_cgroup but
not the page.

Signed-off-by: Johannes Weiner <jweiner@redhat.com>
---
 include/linux/page_cgroup.h |   33 ------------------------
 mm/page_cgroup.c            |   58 ++++++-------------------------------------
 2 files changed, 8 insertions(+), 83 deletions(-)

diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 5bae753..aaa60da 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -121,39 +121,6 @@ static inline void move_unlock_page_cgroup(struct page_cgroup *pc,
 	local_irq_restore(*flags);
 }
 
-#ifdef CONFIG_SPARSEMEM
-#define PCG_ARRAYID_WIDTH	SECTIONS_SHIFT
-#else
-#define PCG_ARRAYID_WIDTH	NODES_SHIFT
-#endif
-
-#if (PCG_ARRAYID_WIDTH > BITS_PER_LONG - NR_PCG_FLAGS)
-#error Not enough space left in pc->flags to store page_cgroup array IDs
-#endif
-
-/* pc->flags: ARRAY-ID | FLAGS */
-
-#define PCG_ARRAYID_MASK	((1UL << PCG_ARRAYID_WIDTH) - 1)
-
-#define PCG_ARRAYID_OFFSET	(BITS_PER_LONG - PCG_ARRAYID_WIDTH)
-/*
- * Zero the shift count for non-existent fields, to prevent compiler
- * warnings and ensure references are optimized away.
- */
-#define PCG_ARRAYID_SHIFT	(PCG_ARRAYID_OFFSET * (PCG_ARRAYID_WIDTH != 0))
-
-static inline void set_page_cgroup_array_id(struct page_cgroup *pc,
-					    unsigned long id)
-{
-	pc->flags &= ~(PCG_ARRAYID_MASK << PCG_ARRAYID_SHIFT);
-	pc->flags |= (id & PCG_ARRAYID_MASK) << PCG_ARRAYID_SHIFT;
-}
-
-static inline unsigned long page_cgroup_array_id(struct page_cgroup *pc)
-{
-	return (pc->flags >> PCG_ARRAYID_SHIFT) & PCG_ARRAYID_MASK;
-}
-
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct page_cgroup;
 
diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index 256dee8..2601a65 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -11,12 +11,6 @@
 #include <linux/swapops.h>
 #include <linux/kmemleak.h>
 
-static void __meminit init_page_cgroup(struct page_cgroup *pc, unsigned long id)
-{
-	pc->flags = 0;
-	set_page_cgroup_array_id(pc, id);
-	pc->mem_cgroup = NULL;
-}
 static unsigned long total_usage;
 
 #if !defined(CONFIG_SPARSEMEM)
@@ -41,24 +35,11 @@ struct page_cgroup *lookup_page_cgroup(struct page *page)
 	return base + offset;
 }
 
-struct page *lookup_cgroup_page(struct page_cgroup *pc)
-{
-	unsigned long pfn;
-	struct page *page;
-	pg_data_t *pgdat;
-
-	pgdat = NODE_DATA(page_cgroup_array_id(pc));
-	pfn = pc - pgdat->node_page_cgroup + pgdat->node_start_pfn;
-	page = pfn_to_page(pfn);
-	VM_BUG_ON(pc != lookup_page_cgroup(page));
-	return page;
-}
-
 static int __init alloc_node_page_cgroup(int nid)
 {
-	struct page_cgroup *base, *pc;
+	struct page_cgroup *base;
 	unsigned long table_size;
-	unsigned long start_pfn, nr_pages, index;
+	unsigned long nr_pages;
 
 	start_pfn = NODE_DATA(nid)->node_start_pfn;
 	nr_pages = NODE_DATA(nid)->node_spanned_pages;
@@ -72,10 +53,6 @@ static int __init alloc_node_page_cgroup(int nid)
 			table_size, PAGE_SIZE, __pa(MAX_DMA_ADDRESS));
 	if (!base)
 		return -ENOMEM;
-	for (index = 0; index < nr_pages; index++) {
-		pc = base + index;
-		init_page_cgroup(pc, nid);
-	}
 	NODE_DATA(nid)->node_page_cgroup = base;
 	total_usage += table_size;
 	return 0;
@@ -116,31 +93,19 @@ struct page_cgroup *lookup_page_cgroup(struct page *page)
 	return section->page_cgroup + pfn;
 }
 
-struct page *lookup_cgroup_page(struct page_cgroup *pc)
-{
-	struct mem_section *section;
-	struct page *page;
-	unsigned long nr;
-
-	nr = page_cgroup_array_id(pc);
-	section = __nr_to_section(nr);
-	page = pfn_to_page(pc - section->page_cgroup);
-	VM_BUG_ON(pc != lookup_page_cgroup(page));
-	return page;
-}
-
 static void *__meminit alloc_page_cgroup(size_t size, int nid)
 {
 	void *addr = NULL;
 
-	addr = alloc_pages_exact_nid(nid, size, GFP_KERNEL | __GFP_NOWARN);
+	addr = alloc_pages_exact_nid(nid, size,
+				     GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN);
 	if (addr)
 		return addr;
 
 	if (node_state(nid, N_HIGH_MEMORY))
-		addr = vmalloc_node(size, nid);
+		addr = vzalloc_node(size, nid);
 	else
-		addr = vmalloc(size);
+		addr = vzalloc(size);
 
 	return addr;
 }
@@ -163,14 +128,11 @@ static void free_page_cgroup(void *addr)
 
 static int __meminit init_section_page_cgroup(unsigned long pfn, int nid)
 {
-	struct page_cgroup *base, *pc;
 	struct mem_section *section;
+	struct page_cgroup *base;
 	unsigned long table_size;
-	unsigned long nr;
-	int index;
 
-	nr = pfn_to_section_nr(pfn);
-	section = __nr_to_section(nr);
+	section = __pfn_to_section(pfn);
 
 	if (section->page_cgroup)
 		return 0;
@@ -190,10 +152,6 @@ static int __meminit init_section_page_cgroup(unsigned long pfn, int nid)
 		return -ENOMEM;
 	}
 
-	for (index = 0; index < PAGES_PER_SECTION; index++) {
-		pc = base + index;
-		init_page_cgroup(pc, nr);
-	}
 	/*
 	 * The passed "pfn" may not be aligned to SECTION.  For the calculation
 	 * we need to apply a mask.
-- 
1.7.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 130+ messages in thread

* Re: [patch 01/11] mm: memcg: consolidate hierarchy iteration primitives
  2011-09-12 10:57   ` Johannes Weiner
@ 2011-09-12 22:37     ` Kirill A. Shutemov
  -1 siblings, 0 replies; 130+ messages in thread
From: Kirill A. Shutemov @ 2011-09-12 22:37 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Michal Hocko, Greg Thelen,
	Michel Lespinasse, Rik van Riel, Minchan Kim, Christoph Hellwig,
	linux-mm, linux-kernel

On Mon, Sep 12, 2011 at 12:57:18PM +0200, Johannes Weiner wrote:
> Memory control groups are currently bolted onto the side of
> traditional memory management in places where better integration would
> be preferrable.  To reclaim memory, for example, memory control groups
> maintain their own LRU list and reclaim strategy aside from the global
> per-zone LRU list reclaim.  But an extra list head for each existing
> page frame is expensive and maintaining it requires additional code.
> 
> This patchset disables the global per-zone LRU lists on memory cgroup
> configurations and converts all its users to operate on the per-memory
> cgroup lists instead.  As LRU pages are then exclusively on one list,
> this saves two list pointers for each page frame in the system:
> 
> page_cgroup array size with 4G physical memory
> 
>   vanilla: [    0.000000] allocated 31457280 bytes of page_cgroup
>   patched: [    0.000000] allocated 15728640 bytes of page_cgroup
> 
> At the same time, system performance for various workloads is
> unaffected:
> 
> 100G sparse file cat, 4G physical memory, 10 runs, to test for code
> bloat in the traditional LRU handling and kswapd & direct reclaim
> paths, without/with the memory controller configured in
> 
>   vanilla: 71.603(0.207) seconds
>   patched: 71.640(0.156) seconds
> 
>   vanilla: 79.558(0.288) seconds
>   patched: 77.233(0.147) seconds
> 
> 100G sparse file cat in 1G memory cgroup, 10 runs, to test for code
> bloat in the traditional memory cgroup LRU handling and reclaim path
> 
>   vanilla: 96.844(0.281) seconds
>   patched: 94.454(0.311) seconds
> 
> 4 unlimited memcgs running kbuild -j32 each, 4G physical memory, 500M
> swap on SSD, 10 runs, to test for regressions in kswapd & direct
> reclaim using per-memcg LRU lists with multiple memcgs and multiple
> allocators within each memcg
> 
>   vanilla: 717.722(1.440) seconds [ 69720.100(11600.835) majfaults ]
>   patched: 714.106(2.313) seconds [ 71109.300(14886.186) majfaults ]
> 
> 16 unlimited memcgs running kbuild, 1900M hierarchical limit, 500M
> swap on SSD, 10 runs, to test for regressions in hierarchical memcg
> setups
> 
>   vanilla: 2742.058(1.992) seconds [ 26479.600(1736.737) majfaults ]
>   patched: 2743.267(1.214) seconds [ 27240.700(1076.063) majfaults ]
> 
> This patch:
> 
> There are currently two different implementations of iterating over a
> memory cgroup hierarchy tree.
> 
> Consolidate them into one worker function and base the convenience
> looping-macros on top of it.

Looks nice!

Few comments below.

> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> ---
>  mm/memcontrol.c |  196 ++++++++++++++++++++----------------------------------
>  1 files changed, 73 insertions(+), 123 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b76011a..912c7c7 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -781,83 +781,75 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
>  	return memcg;
>  }
>  
> -/* The caller has to guarantee "mem" exists before calling this */
> -static struct mem_cgroup *mem_cgroup_start_loop(struct mem_cgroup *memcg)
> +static struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> +					  struct mem_cgroup *prev,
> +					  bool remember)
>  {
> -	struct cgroup_subsys_state *css;
> -	int found;
> +	struct mem_cgroup *mem = NULL;
> +	int id = 0;
>  
> -	if (!memcg) /* ROOT cgroup has the smallest ID */
> -		return root_mem_cgroup; /*css_put/get against root is ignored*/
> -	if (!memcg->use_hierarchy) {
> -		if (css_tryget(&memcg->css))
> -			return memcg;
> -		return NULL;
> -	}
> -	rcu_read_lock();
> -	/*
> -	 * searching a memory cgroup which has the smallest ID under given
> -	 * ROOT cgroup. (ID >= 1)
> -	 */
> -	css = css_get_next(&mem_cgroup_subsys, 1, &memcg->css, &found);
> -	if (css && css_tryget(css))
> -		memcg = container_of(css, struct mem_cgroup, css);
> -	else
> -		memcg = NULL;
> -	rcu_read_unlock();
> -	return memcg;
> -}
> +	if (!root)
> +		root = root_mem_cgroup;
>  
> -static struct mem_cgroup *mem_cgroup_get_next(struct mem_cgroup *iter,
> -					struct mem_cgroup *root,
> -					bool cond)
> -{
> -	int nextid = css_id(&iter->css) + 1;
> -	int found;
> -	int hierarchy_used;
> -	struct cgroup_subsys_state *css;
> +	if (prev && !remember)
> +		id = css_id(&prev->css);
>  
> -	hierarchy_used = iter->use_hierarchy;
> +	if (prev && prev != root)
> +		css_put(&prev->css);
>  
> -	css_put(&iter->css);
> -	/* If no ROOT, walk all, ignore hierarchy */
> -	if (!cond || (root && !hierarchy_used))
> -		return NULL;
> +	if (!root->use_hierarchy && root != root_mem_cgroup) {
> +		if (prev)
> +			return NULL;
> +		return root;
> +	}
>  
> -	if (!root)
> -		root = root_mem_cgroup;
> +	while (!mem) {
> +		struct cgroup_subsys_state *css;
>  
> -	do {
> -		iter = NULL;
> -		rcu_read_lock();
> +		if (remember)
> +			id = root->last_scanned_child;
>  
> -		css = css_get_next(&mem_cgroup_subsys, nextid,
> -				&root->css, &found);
> -		if (css && css_tryget(css))
> -			iter = container_of(css, struct mem_cgroup, css);
> +		rcu_read_lock();
> +		css = css_get_next(&mem_cgroup_subsys, id + 1, &root->css, &id);
> +		if (css) {
> +			if (css == &root->css || css_tryget(css))

When does css != &root->css here?

> +				mem = container_of(css, struct mem_cgroup, css);
> +		} else
> +			id = 0;
>  		rcu_read_unlock();
> -		/* If css is NULL, no more cgroups will be found */
> -		nextid = found + 1;
> -	} while (css && !iter);
>  
> -	return iter;
> +		if (remember)
> +			root->last_scanned_child = id;
> +
> +		if (prev && !css)
> +			return NULL;
> +	}
> +	return mem;
>  }
> -/*
> - * for_eacn_mem_cgroup_tree() for visiting all cgroup under tree. Please
> - * be careful that "break" loop is not allowed. We have reference count.
> - * Instead of that modify "cond" to be false and "continue" to exit the loop.
> - */
> -#define for_each_mem_cgroup_tree_cond(iter, root, cond)	\
> -	for (iter = mem_cgroup_start_loop(root);\
> -	     iter != NULL;\
> -	     iter = mem_cgroup_get_next(iter, root, cond))
>  
> -#define for_each_mem_cgroup_tree(iter, root) \
> -	for_each_mem_cgroup_tree_cond(iter, root, true)
> +static void mem_cgroup_iter_break(struct mem_cgroup *root,
> +				  struct mem_cgroup *prev)
> +{
> +	if (!root)
> +		root = root_mem_cgroup;
> +	if (prev && prev != root)
> +		css_put(&prev->css);
> +}
>  
> -#define for_each_mem_cgroup_all(iter) \
> -	for_each_mem_cgroup_tree_cond(iter, NULL, true)
> +/*
> + * Iteration constructs for visiting all cgroups (under a tree).  If
> + * loops are exited prematurely (break), mem_cgroup_iter_break() must
> + * be used for reference counting.
> + */
> +#define for_each_mem_cgroup_tree(iter, root)		\
> +	for (iter = mem_cgroup_iter(root, NULL, false);	\
> +	     iter != NULL;				\
> +	     iter = mem_cgroup_iter(root, iter, false))
>  
> +#define for_each_mem_cgroup(iter)			\
> +	for (iter = mem_cgroup_iter(NULL, NULL, false);	\
> +	     iter != NULL;				\
> +	     iter = mem_cgroup_iter(NULL, iter, false))
>  
>  static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
>  {
> @@ -1464,43 +1456,6 @@ u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
>  	return min(limit, memsw);
>  }
>  
> -/*
> - * Visit the first child (need not be the first child as per the ordering
> - * of the cgroup list, since we track last_scanned_child) of @mem and use
> - * that to reclaim free pages from.
> - */
> -static struct mem_cgroup *
> -mem_cgroup_select_victim(struct mem_cgroup *root_memcg)
> -{
> -	struct mem_cgroup *ret = NULL;
> -	struct cgroup_subsys_state *css;
> -	int nextid, found;
> -
> -	if (!root_memcg->use_hierarchy) {
> -		css_get(&root_memcg->css);
> -		ret = root_memcg;
> -	}
> -
> -	while (!ret) {
> -		rcu_read_lock();
> -		nextid = root_memcg->last_scanned_child + 1;
> -		css = css_get_next(&mem_cgroup_subsys, nextid, &root_memcg->css,
> -				   &found);
> -		if (css && css_tryget(css))
> -			ret = container_of(css, struct mem_cgroup, css);
> -
> -		rcu_read_unlock();
> -		/* Updates scanning parameter */
> -		if (!css) {
> -			/* this means start scan from ID:1 */
> -			root_memcg->last_scanned_child = 0;
> -		} else
> -			root_memcg->last_scanned_child = found;
> -	}
> -
> -	return ret;
> -}
> -
>  /**
>   * test_mem_cgroup_node_reclaimable
>   * @mem: the target memcg
> @@ -1656,7 +1611,7 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
>  						unsigned long reclaim_options,
>  						unsigned long *total_scanned)
>  {
> -	struct mem_cgroup *victim;
> +	struct mem_cgroup *victim = NULL;
>  	int ret, total = 0;
>  	int loop = 0;
>  	bool noswap = reclaim_options & MEM_CGROUP_RECLAIM_NOSWAP;
> @@ -1672,8 +1627,8 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
>  		noswap = true;
>  
>  	while (1) {
> -		victim = mem_cgroup_select_victim(root_memcg);
> -		if (victim == root_memcg) {
> +		victim = mem_cgroup_iter(root_memcg, victim, true);
> +		if (!victim) {
>  			loop++;
>  			/*
>  			 * We are not draining per cpu cached charges during
> @@ -1689,10 +1644,8 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
>  				 * anything, it might because there are
>  				 * no reclaimable pages under this hierarchy
>  				 */
> -				if (!check_soft || !total) {
> -					css_put(&victim->css);
> +				if (!check_soft || !total)
>  					break;
> -				}
>  				/*
>  				 * We want to do more targeted reclaim.
>  				 * excess >> 2 is not to excessive so as to
> @@ -1700,15 +1653,13 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
>  				 * coming back to reclaim from this cgroup
>  				 */
>  				if (total >= (excess >> 2) ||
> -					(loop > MEM_CGROUP_MAX_RECLAIM_LOOPS)) {
> -					css_put(&victim->css);
> +					(loop > MEM_CGROUP_MAX_RECLAIM_LOOPS))
>  					break;
> -				}
>  			}
> +			continue;

Souldn't we do

victim = root_memcg;

instead?

>  		}
>  		if (!mem_cgroup_reclaimable(victim, noswap)) {
>  			/* this cgroup's local usage == 0 */
> -			css_put(&victim->css);
>  			continue;
>  		}
>  		/* we use swappiness of local cgroup */
> @@ -1719,21 +1670,21 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
>  		} else
>  			ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
>  						noswap);
> -		css_put(&victim->css);
>  		/*
>  		 * At shrinking usage, we can't check we should stop here or
>  		 * reclaim more. It's depends on callers. last_scanned_child
>  		 * will work enough for keeping fairness under tree.
>  		 */
>  		if (shrink)
> -			return ret;
> +			break;
>  		total += ret;
>  		if (check_soft) {
>  			if (!res_counter_soft_limit_excess(&root_memcg->res))
> -				return total;
> +				break;
>  		} else if (mem_cgroup_margin(root_memcg))
> -			return total;
> +			break;
>  	}
> +	mem_cgroup_iter_break(root_memcg, victim);
>  	return total;
>  }
>  
> @@ -1745,16 +1696,16 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
>  static bool mem_cgroup_oom_lock(struct mem_cgroup *memcg)
>  {
>  	struct mem_cgroup *iter, *failed = NULL;
> -	bool cond = true;
>  
> -	for_each_mem_cgroup_tree_cond(iter, memcg, cond) {
> +	for_each_mem_cgroup_tree(iter, memcg) {
>  		if (iter->oom_lock) {
>  			/*
>  			 * this subtree of our hierarchy is already locked
>  			 * so we cannot give a lock.
>  			 */
>  			failed = iter;
> -			cond = false;
> +			mem_cgroup_iter_break(memcg, iter);
> +			break;
>  		} else
>  			iter->oom_lock = true;
>  	}
> @@ -1766,11 +1717,10 @@ static bool mem_cgroup_oom_lock(struct mem_cgroup *memcg)
>  	 * OK, we failed to lock the whole subtree so we have to clean up
>  	 * what we set up to the failing subtree
>  	 */
> -	cond = true;
> -	for_each_mem_cgroup_tree_cond(iter, memcg, cond) {
> +	for_each_mem_cgroup_tree(iter, memcg) {
>  		if (iter == failed) {
> -			cond = false;
> -			continue;
> +			mem_cgroup_iter_break(memcg, iter);
> +			break;
>  		}
>  		iter->oom_lock = false;
>  	}
> @@ -2166,7 +2116,7 @@ static int __cpuinit memcg_cpu_hotplug_callback(struct notifier_block *nb,
>  	struct mem_cgroup *iter;
>  
>  	if ((action == CPU_ONLINE)) {
> -		for_each_mem_cgroup_all(iter)
> +		for_each_mem_cgroup(iter)
>  			synchronize_mem_cgroup_on_move(iter, cpu);
>  		return NOTIFY_OK;
>  	}
> @@ -2174,7 +2124,7 @@ static int __cpuinit memcg_cpu_hotplug_callback(struct notifier_block *nb,
>  	if ((action != CPU_DEAD) || action != CPU_DEAD_FROZEN)
>  		return NOTIFY_OK;
>  
> -	for_each_mem_cgroup_all(iter)
> +	for_each_mem_cgroup(iter)
>  		mem_cgroup_drain_pcp_counter(iter, cpu);
>  
>  	stock = &per_cpu(memcg_stock, cpu);
> -- 
> 1.7.6
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 01/11] mm: memcg: consolidate hierarchy iteration primitives
@ 2011-09-12 22:37     ` Kirill A. Shutemov
  0 siblings, 0 replies; 130+ messages in thread
From: Kirill A. Shutemov @ 2011-09-12 22:37 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Michal Hocko, Greg Thelen,
	Michel Lespinasse, Rik van Riel, Minchan Kim, Christoph Hellwig,
	linux-mm, linux-kernel

On Mon, Sep 12, 2011 at 12:57:18PM +0200, Johannes Weiner wrote:
> Memory control groups are currently bolted onto the side of
> traditional memory management in places where better integration would
> be preferrable.  To reclaim memory, for example, memory control groups
> maintain their own LRU list and reclaim strategy aside from the global
> per-zone LRU list reclaim.  But an extra list head for each existing
> page frame is expensive and maintaining it requires additional code.
> 
> This patchset disables the global per-zone LRU lists on memory cgroup
> configurations and converts all its users to operate on the per-memory
> cgroup lists instead.  As LRU pages are then exclusively on one list,
> this saves two list pointers for each page frame in the system:
> 
> page_cgroup array size with 4G physical memory
> 
>   vanilla: [    0.000000] allocated 31457280 bytes of page_cgroup
>   patched: [    0.000000] allocated 15728640 bytes of page_cgroup
> 
> At the same time, system performance for various workloads is
> unaffected:
> 
> 100G sparse file cat, 4G physical memory, 10 runs, to test for code
> bloat in the traditional LRU handling and kswapd & direct reclaim
> paths, without/with the memory controller configured in
> 
>   vanilla: 71.603(0.207) seconds
>   patched: 71.640(0.156) seconds
> 
>   vanilla: 79.558(0.288) seconds
>   patched: 77.233(0.147) seconds
> 
> 100G sparse file cat in 1G memory cgroup, 10 runs, to test for code
> bloat in the traditional memory cgroup LRU handling and reclaim path
> 
>   vanilla: 96.844(0.281) seconds
>   patched: 94.454(0.311) seconds
> 
> 4 unlimited memcgs running kbuild -j32 each, 4G physical memory, 500M
> swap on SSD, 10 runs, to test for regressions in kswapd & direct
> reclaim using per-memcg LRU lists with multiple memcgs and multiple
> allocators within each memcg
> 
>   vanilla: 717.722(1.440) seconds [ 69720.100(11600.835) majfaults ]
>   patched: 714.106(2.313) seconds [ 71109.300(14886.186) majfaults ]
> 
> 16 unlimited memcgs running kbuild, 1900M hierarchical limit, 500M
> swap on SSD, 10 runs, to test for regressions in hierarchical memcg
> setups
> 
>   vanilla: 2742.058(1.992) seconds [ 26479.600(1736.737) majfaults ]
>   patched: 2743.267(1.214) seconds [ 27240.700(1076.063) majfaults ]
> 
> This patch:
> 
> There are currently two different implementations of iterating over a
> memory cgroup hierarchy tree.
> 
> Consolidate them into one worker function and base the convenience
> looping-macros on top of it.

Looks nice!

Few comments below.

> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> ---
>  mm/memcontrol.c |  196 ++++++++++++++++++++----------------------------------
>  1 files changed, 73 insertions(+), 123 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b76011a..912c7c7 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -781,83 +781,75 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
>  	return memcg;
>  }
>  
> -/* The caller has to guarantee "mem" exists before calling this */
> -static struct mem_cgroup *mem_cgroup_start_loop(struct mem_cgroup *memcg)
> +static struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> +					  struct mem_cgroup *prev,
> +					  bool remember)
>  {
> -	struct cgroup_subsys_state *css;
> -	int found;
> +	struct mem_cgroup *mem = NULL;
> +	int id = 0;
>  
> -	if (!memcg) /* ROOT cgroup has the smallest ID */
> -		return root_mem_cgroup; /*css_put/get against root is ignored*/
> -	if (!memcg->use_hierarchy) {
> -		if (css_tryget(&memcg->css))
> -			return memcg;
> -		return NULL;
> -	}
> -	rcu_read_lock();
> -	/*
> -	 * searching a memory cgroup which has the smallest ID under given
> -	 * ROOT cgroup. (ID >= 1)
> -	 */
> -	css = css_get_next(&mem_cgroup_subsys, 1, &memcg->css, &found);
> -	if (css && css_tryget(css))
> -		memcg = container_of(css, struct mem_cgroup, css);
> -	else
> -		memcg = NULL;
> -	rcu_read_unlock();
> -	return memcg;
> -}
> +	if (!root)
> +		root = root_mem_cgroup;
>  
> -static struct mem_cgroup *mem_cgroup_get_next(struct mem_cgroup *iter,
> -					struct mem_cgroup *root,
> -					bool cond)
> -{
> -	int nextid = css_id(&iter->css) + 1;
> -	int found;
> -	int hierarchy_used;
> -	struct cgroup_subsys_state *css;
> +	if (prev && !remember)
> +		id = css_id(&prev->css);
>  
> -	hierarchy_used = iter->use_hierarchy;
> +	if (prev && prev != root)
> +		css_put(&prev->css);
>  
> -	css_put(&iter->css);
> -	/* If no ROOT, walk all, ignore hierarchy */
> -	if (!cond || (root && !hierarchy_used))
> -		return NULL;
> +	if (!root->use_hierarchy && root != root_mem_cgroup) {
> +		if (prev)
> +			return NULL;
> +		return root;
> +	}
>  
> -	if (!root)
> -		root = root_mem_cgroup;
> +	while (!mem) {
> +		struct cgroup_subsys_state *css;
>  
> -	do {
> -		iter = NULL;
> -		rcu_read_lock();
> +		if (remember)
> +			id = root->last_scanned_child;
>  
> -		css = css_get_next(&mem_cgroup_subsys, nextid,
> -				&root->css, &found);
> -		if (css && css_tryget(css))
> -			iter = container_of(css, struct mem_cgroup, css);
> +		rcu_read_lock();
> +		css = css_get_next(&mem_cgroup_subsys, id + 1, &root->css, &id);
> +		if (css) {
> +			if (css == &root->css || css_tryget(css))

When does css != &root->css here?

> +				mem = container_of(css, struct mem_cgroup, css);
> +		} else
> +			id = 0;
>  		rcu_read_unlock();
> -		/* If css is NULL, no more cgroups will be found */
> -		nextid = found + 1;
> -	} while (css && !iter);
>  
> -	return iter;
> +		if (remember)
> +			root->last_scanned_child = id;
> +
> +		if (prev && !css)
> +			return NULL;
> +	}
> +	return mem;
>  }
> -/*
> - * for_eacn_mem_cgroup_tree() for visiting all cgroup under tree. Please
> - * be careful that "break" loop is not allowed. We have reference count.
> - * Instead of that modify "cond" to be false and "continue" to exit the loop.
> - */
> -#define for_each_mem_cgroup_tree_cond(iter, root, cond)	\
> -	for (iter = mem_cgroup_start_loop(root);\
> -	     iter != NULL;\
> -	     iter = mem_cgroup_get_next(iter, root, cond))
>  
> -#define for_each_mem_cgroup_tree(iter, root) \
> -	for_each_mem_cgroup_tree_cond(iter, root, true)
> +static void mem_cgroup_iter_break(struct mem_cgroup *root,
> +				  struct mem_cgroup *prev)
> +{
> +	if (!root)
> +		root = root_mem_cgroup;
> +	if (prev && prev != root)
> +		css_put(&prev->css);
> +}
>  
> -#define for_each_mem_cgroup_all(iter) \
> -	for_each_mem_cgroup_tree_cond(iter, NULL, true)
> +/*
> + * Iteration constructs for visiting all cgroups (under a tree).  If
> + * loops are exited prematurely (break), mem_cgroup_iter_break() must
> + * be used for reference counting.
> + */
> +#define for_each_mem_cgroup_tree(iter, root)		\
> +	for (iter = mem_cgroup_iter(root, NULL, false);	\
> +	     iter != NULL;				\
> +	     iter = mem_cgroup_iter(root, iter, false))
>  
> +#define for_each_mem_cgroup(iter)			\
> +	for (iter = mem_cgroup_iter(NULL, NULL, false);	\
> +	     iter != NULL;				\
> +	     iter = mem_cgroup_iter(NULL, iter, false))
>  
>  static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
>  {
> @@ -1464,43 +1456,6 @@ u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
>  	return min(limit, memsw);
>  }
>  
> -/*
> - * Visit the first child (need not be the first child as per the ordering
> - * of the cgroup list, since we track last_scanned_child) of @mem and use
> - * that to reclaim free pages from.
> - */
> -static struct mem_cgroup *
> -mem_cgroup_select_victim(struct mem_cgroup *root_memcg)
> -{
> -	struct mem_cgroup *ret = NULL;
> -	struct cgroup_subsys_state *css;
> -	int nextid, found;
> -
> -	if (!root_memcg->use_hierarchy) {
> -		css_get(&root_memcg->css);
> -		ret = root_memcg;
> -	}
> -
> -	while (!ret) {
> -		rcu_read_lock();
> -		nextid = root_memcg->last_scanned_child + 1;
> -		css = css_get_next(&mem_cgroup_subsys, nextid, &root_memcg->css,
> -				   &found);
> -		if (css && css_tryget(css))
> -			ret = container_of(css, struct mem_cgroup, css);
> -
> -		rcu_read_unlock();
> -		/* Updates scanning parameter */
> -		if (!css) {
> -			/* this means start scan from ID:1 */
> -			root_memcg->last_scanned_child = 0;
> -		} else
> -			root_memcg->last_scanned_child = found;
> -	}
> -
> -	return ret;
> -}
> -
>  /**
>   * test_mem_cgroup_node_reclaimable
>   * @mem: the target memcg
> @@ -1656,7 +1611,7 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
>  						unsigned long reclaim_options,
>  						unsigned long *total_scanned)
>  {
> -	struct mem_cgroup *victim;
> +	struct mem_cgroup *victim = NULL;
>  	int ret, total = 0;
>  	int loop = 0;
>  	bool noswap = reclaim_options & MEM_CGROUP_RECLAIM_NOSWAP;
> @@ -1672,8 +1627,8 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
>  		noswap = true;
>  
>  	while (1) {
> -		victim = mem_cgroup_select_victim(root_memcg);
> -		if (victim == root_memcg) {
> +		victim = mem_cgroup_iter(root_memcg, victim, true);
> +		if (!victim) {
>  			loop++;
>  			/*
>  			 * We are not draining per cpu cached charges during
> @@ -1689,10 +1644,8 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
>  				 * anything, it might because there are
>  				 * no reclaimable pages under this hierarchy
>  				 */
> -				if (!check_soft || !total) {
> -					css_put(&victim->css);
> +				if (!check_soft || !total)
>  					break;
> -				}
>  				/*
>  				 * We want to do more targeted reclaim.
>  				 * excess >> 2 is not to excessive so as to
> @@ -1700,15 +1653,13 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
>  				 * coming back to reclaim from this cgroup
>  				 */
>  				if (total >= (excess >> 2) ||
> -					(loop > MEM_CGROUP_MAX_RECLAIM_LOOPS)) {
> -					css_put(&victim->css);
> +					(loop > MEM_CGROUP_MAX_RECLAIM_LOOPS))
>  					break;
> -				}
>  			}
> +			continue;

Souldn't we do

victim = root_memcg;

instead?

>  		}
>  		if (!mem_cgroup_reclaimable(victim, noswap)) {
>  			/* this cgroup's local usage == 0 */
> -			css_put(&victim->css);
>  			continue;
>  		}
>  		/* we use swappiness of local cgroup */
> @@ -1719,21 +1670,21 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
>  		} else
>  			ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
>  						noswap);
> -		css_put(&victim->css);
>  		/*
>  		 * At shrinking usage, we can't check we should stop here or
>  		 * reclaim more. It's depends on callers. last_scanned_child
>  		 * will work enough for keeping fairness under tree.
>  		 */
>  		if (shrink)
> -			return ret;
> +			break;
>  		total += ret;
>  		if (check_soft) {
>  			if (!res_counter_soft_limit_excess(&root_memcg->res))
> -				return total;
> +				break;
>  		} else if (mem_cgroup_margin(root_memcg))
> -			return total;
> +			break;
>  	}
> +	mem_cgroup_iter_break(root_memcg, victim);
>  	return total;
>  }
>  
> @@ -1745,16 +1696,16 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
>  static bool mem_cgroup_oom_lock(struct mem_cgroup *memcg)
>  {
>  	struct mem_cgroup *iter, *failed = NULL;
> -	bool cond = true;
>  
> -	for_each_mem_cgroup_tree_cond(iter, memcg, cond) {
> +	for_each_mem_cgroup_tree(iter, memcg) {
>  		if (iter->oom_lock) {
>  			/*
>  			 * this subtree of our hierarchy is already locked
>  			 * so we cannot give a lock.
>  			 */
>  			failed = iter;
> -			cond = false;
> +			mem_cgroup_iter_break(memcg, iter);
> +			break;
>  		} else
>  			iter->oom_lock = true;
>  	}
> @@ -1766,11 +1717,10 @@ static bool mem_cgroup_oom_lock(struct mem_cgroup *memcg)
>  	 * OK, we failed to lock the whole subtree so we have to clean up
>  	 * what we set up to the failing subtree
>  	 */
> -	cond = true;
> -	for_each_mem_cgroup_tree_cond(iter, memcg, cond) {
> +	for_each_mem_cgroup_tree(iter, memcg) {
>  		if (iter == failed) {
> -			cond = false;
> -			continue;
> +			mem_cgroup_iter_break(memcg, iter);
> +			break;
>  		}
>  		iter->oom_lock = false;
>  	}
> @@ -2166,7 +2116,7 @@ static int __cpuinit memcg_cpu_hotplug_callback(struct notifier_block *nb,
>  	struct mem_cgroup *iter;
>  
>  	if ((action == CPU_ONLINE)) {
> -		for_each_mem_cgroup_all(iter)
> +		for_each_mem_cgroup(iter)
>  			synchronize_mem_cgroup_on_move(iter, cpu);
>  		return NOTIFY_OK;
>  	}
> @@ -2174,7 +2124,7 @@ static int __cpuinit memcg_cpu_hotplug_callback(struct notifier_block *nb,
>  	if ((action != CPU_DEAD) || action != CPU_DEAD_FROZEN)
>  		return NOTIFY_OK;
>  
> -	for_each_mem_cgroup_all(iter)
> +	for_each_mem_cgroup(iter)
>  		mem_cgroup_drain_pcp_counter(iter, cpu);
>  
>  	stock = &per_cpu(memcg_stock, cpu);
> -- 
> 1.7.6
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 02/11] mm: vmscan: distinguish global reclaim from global LRU scanning
  2011-09-12 10:57   ` Johannes Weiner
@ 2011-09-12 23:02     ` Kirill A. Shutemov
  -1 siblings, 0 replies; 130+ messages in thread
From: Kirill A. Shutemov @ 2011-09-12 23:02 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Michal Hocko, Greg Thelen,
	Michel Lespinasse, Rik van Riel, Minchan Kim, Christoph Hellwig,
	linux-mm, linux-kernel

On Mon, Sep 12, 2011 at 12:57:19PM +0200, Johannes Weiner wrote:
> The traditional zone reclaim code is scanning the per-zone LRU lists
> during direct reclaim and kswapd, and the per-zone per-memory cgroup
> LRU lists when reclaiming on behalf of a memory cgroup limit.
> 
> Subsequent patches will convert the traditional reclaim code to
> reclaim exclusively from the per-memory cgroup LRU lists.  As a
> result, using the predicate for which LRU list is scanned will no
> longer be appropriate to tell global reclaim from limit reclaim.
> 
> This patch adds a global_reclaim() predicate to tell direct/kswapd
> reclaim from memory cgroup limit reclaim and substitutes it in all
> places where currently scanning_global_lru() is used for that.
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> ---
>  mm/vmscan.c |   60 +++++++++++++++++++++++++++++++++++-----------------------
>  1 files changed, 36 insertions(+), 24 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 7502726..354f125 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -153,9 +153,25 @@ static LIST_HEAD(shrinker_list);
>  static DECLARE_RWSEM(shrinker_rwsem);
>  
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
> -#define scanning_global_lru(sc)	(!(sc)->mem_cgroup)
> +static bool global_reclaim(struct scan_control *sc)
> +{
> +	return !sc->mem_cgroup;
> +}
> +
> +static bool scanning_global_lru(struct scan_control *sc)
> +{
> +	return !sc->mem_cgroup;
> +}
>  #else
> -#define scanning_global_lru(sc)	(1)
> +static bool global_reclaim(struct scan_control *sc)
> +{
> +	return true;
> +}
> +
> +static bool scanning_global_lru(struct scan_control *sc)
> +{
> +	return true;
> +}
>  #endif
>  
>  static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
> @@ -1011,7 +1027,7 @@ keep_lumpy:
>  	 * back off and wait for congestion to clear because further reclaim
>  	 * will encounter the same problem
>  	 */
> -	if (nr_dirty && nr_dirty == nr_congested && scanning_global_lru(sc))
> +	if (nr_dirty && nr_dirty == nr_congested && global_reclaim(sc))
>  		zone_set_flag(zone, ZONE_CONGESTED);
>  
>  	free_page_list(&free_pages);
> @@ -1330,7 +1346,7 @@ static int too_many_isolated(struct zone *zone, int file,
>  	if (current_is_kswapd())
>  		return 0;
>  
> -	if (!scanning_global_lru(sc))
> +	if (!global_reclaim(sc))
>  		return 0;
>  
>  	if (file) {
> @@ -1508,6 +1524,12 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>  	if (scanning_global_lru(sc)) {
>  		nr_taken = isolate_pages_global(nr_to_scan, &page_list,
>  			&nr_scanned, sc->order, reclaim_mode, zone, 0, file);
> +	} else {
> +		nr_taken = mem_cgroup_isolate_pages(nr_to_scan, &page_list,
> +			&nr_scanned, sc->order, reclaim_mode, zone,
> +			sc->mem_cgroup, 0, file);
> +	}

Redundant braces.

> +	if (global_reclaim(sc)) {
>  		zone->pages_scanned += nr_scanned;
>  		if (current_is_kswapd())
>  			__count_zone_vm_events(PGSCAN_KSWAPD, zone,
> @@ -1515,14 +1537,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>  		else
>  			__count_zone_vm_events(PGSCAN_DIRECT, zone,
>  					       nr_scanned);
> -	} else {
> -		nr_taken = mem_cgroup_isolate_pages(nr_to_scan, &page_list,
> -			&nr_scanned, sc->order, reclaim_mode, zone,
> -			sc->mem_cgroup, 0, file);
> -		/*
> -		 * mem_cgroup_isolate_pages() keeps track of
> -		 * scanned pages on its own.
> -		 */
>  	}
>  
>  	if (nr_taken == 0) {
> @@ -1647,18 +1661,16 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
>  						&pgscanned, sc->order,
>  						reclaim_mode, zone,
>  						1, file);
> -		zone->pages_scanned += pgscanned;
>  	} else {
>  		nr_taken = mem_cgroup_isolate_pages(nr_pages, &l_hold,
>  						&pgscanned, sc->order,
>  						reclaim_mode, zone,
>  						sc->mem_cgroup, 1, file);
> -		/*
> -		 * mem_cgroup_isolate_pages() keeps track of
> -		 * scanned pages on its own.
> -		 */
>  	}

Ditto.


-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 02/11] mm: vmscan: distinguish global reclaim from global LRU scanning
@ 2011-09-12 23:02     ` Kirill A. Shutemov
  0 siblings, 0 replies; 130+ messages in thread
From: Kirill A. Shutemov @ 2011-09-12 23:02 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Michal Hocko, Greg Thelen,
	Michel Lespinasse, Rik van Riel, Minchan Kim, Christoph Hellwig,
	linux-mm, linux-kernel

On Mon, Sep 12, 2011 at 12:57:19PM +0200, Johannes Weiner wrote:
> The traditional zone reclaim code is scanning the per-zone LRU lists
> during direct reclaim and kswapd, and the per-zone per-memory cgroup
> LRU lists when reclaiming on behalf of a memory cgroup limit.
> 
> Subsequent patches will convert the traditional reclaim code to
> reclaim exclusively from the per-memory cgroup LRU lists.  As a
> result, using the predicate for which LRU list is scanned will no
> longer be appropriate to tell global reclaim from limit reclaim.
> 
> This patch adds a global_reclaim() predicate to tell direct/kswapd
> reclaim from memory cgroup limit reclaim and substitutes it in all
> places where currently scanning_global_lru() is used for that.
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> ---
>  mm/vmscan.c |   60 +++++++++++++++++++++++++++++++++++-----------------------
>  1 files changed, 36 insertions(+), 24 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 7502726..354f125 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -153,9 +153,25 @@ static LIST_HEAD(shrinker_list);
>  static DECLARE_RWSEM(shrinker_rwsem);
>  
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
> -#define scanning_global_lru(sc)	(!(sc)->mem_cgroup)
> +static bool global_reclaim(struct scan_control *sc)
> +{
> +	return !sc->mem_cgroup;
> +}
> +
> +static bool scanning_global_lru(struct scan_control *sc)
> +{
> +	return !sc->mem_cgroup;
> +}
>  #else
> -#define scanning_global_lru(sc)	(1)
> +static bool global_reclaim(struct scan_control *sc)
> +{
> +	return true;
> +}
> +
> +static bool scanning_global_lru(struct scan_control *sc)
> +{
> +	return true;
> +}
>  #endif
>  
>  static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
> @@ -1011,7 +1027,7 @@ keep_lumpy:
>  	 * back off and wait for congestion to clear because further reclaim
>  	 * will encounter the same problem
>  	 */
> -	if (nr_dirty && nr_dirty == nr_congested && scanning_global_lru(sc))
> +	if (nr_dirty && nr_dirty == nr_congested && global_reclaim(sc))
>  		zone_set_flag(zone, ZONE_CONGESTED);
>  
>  	free_page_list(&free_pages);
> @@ -1330,7 +1346,7 @@ static int too_many_isolated(struct zone *zone, int file,
>  	if (current_is_kswapd())
>  		return 0;
>  
> -	if (!scanning_global_lru(sc))
> +	if (!global_reclaim(sc))
>  		return 0;
>  
>  	if (file) {
> @@ -1508,6 +1524,12 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>  	if (scanning_global_lru(sc)) {
>  		nr_taken = isolate_pages_global(nr_to_scan, &page_list,
>  			&nr_scanned, sc->order, reclaim_mode, zone, 0, file);
> +	} else {
> +		nr_taken = mem_cgroup_isolate_pages(nr_to_scan, &page_list,
> +			&nr_scanned, sc->order, reclaim_mode, zone,
> +			sc->mem_cgroup, 0, file);
> +	}

Redundant braces.

> +	if (global_reclaim(sc)) {
>  		zone->pages_scanned += nr_scanned;
>  		if (current_is_kswapd())
>  			__count_zone_vm_events(PGSCAN_KSWAPD, zone,
> @@ -1515,14 +1537,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>  		else
>  			__count_zone_vm_events(PGSCAN_DIRECT, zone,
>  					       nr_scanned);
> -	} else {
> -		nr_taken = mem_cgroup_isolate_pages(nr_to_scan, &page_list,
> -			&nr_scanned, sc->order, reclaim_mode, zone,
> -			sc->mem_cgroup, 0, file);
> -		/*
> -		 * mem_cgroup_isolate_pages() keeps track of
> -		 * scanned pages on its own.
> -		 */
>  	}
>  
>  	if (nr_taken == 0) {
> @@ -1647,18 +1661,16 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
>  						&pgscanned, sc->order,
>  						reclaim_mode, zone,
>  						1, file);
> -		zone->pages_scanned += pgscanned;
>  	} else {
>  		nr_taken = mem_cgroup_isolate_pages(nr_pages, &l_hold,
>  						&pgscanned, sc->order,
>  						reclaim_mode, zone,
>  						sc->mem_cgroup, 1, file);
> -		/*
> -		 * mem_cgroup_isolate_pages() keeps track of
> -		 * scanned pages on its own.
> -		 */
>  	}

Ditto.


-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 01/11] mm: memcg: consolidate hierarchy iteration primitives
  2011-09-12 22:37     ` Kirill A. Shutemov
@ 2011-09-13  5:40       ` Johannes Weiner
  -1 siblings, 0 replies; 130+ messages in thread
From: Johannes Weiner @ 2011-09-13  5:40 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Michal Hocko, Greg Thelen,
	Michel Lespinasse, Rik van Riel, Minchan Kim, Christoph Hellwig,
	linux-mm, linux-kernel

On Tue, Sep 13, 2011 at 01:37:46AM +0300, Kirill A. Shutemov wrote:
> On Mon, Sep 12, 2011 at 12:57:18PM +0200, Johannes Weiner wrote:
> > -static struct mem_cgroup *mem_cgroup_get_next(struct mem_cgroup *iter,
> > -					struct mem_cgroup *root,
> > -					bool cond)
> > -{
> > -	int nextid = css_id(&iter->css) + 1;
> > -	int found;
> > -	int hierarchy_used;
> > -	struct cgroup_subsys_state *css;
> > +	if (prev && !remember)
> > +		id = css_id(&prev->css);
> >  
> > -	hierarchy_used = iter->use_hierarchy;
> > +	if (prev && prev != root)
> > +		css_put(&prev->css);
> >  
> > -	css_put(&iter->css);
> > -	/* If no ROOT, walk all, ignore hierarchy */
> > -	if (!cond || (root && !hierarchy_used))
> > -		return NULL;
> > +	if (!root->use_hierarchy && root != root_mem_cgroup) {
> > +		if (prev)
> > +			return NULL;
> > +		return root;
> > +	}
> >  
> > -	if (!root)
> > -		root = root_mem_cgroup;
> > +	while (!mem) {
> > +		struct cgroup_subsys_state *css;
> >  
> > -	do {
> > -		iter = NULL;
> > -		rcu_read_lock();
> > +		if (remember)
> > +			id = root->last_scanned_child;
> >  
> > -		css = css_get_next(&mem_cgroup_subsys, nextid,
> > -				&root->css, &found);
> > -		if (css && css_tryget(css))
> > -			iter = container_of(css, struct mem_cgroup, css);
> > +		rcu_read_lock();
> > +		css = css_get_next(&mem_cgroup_subsys, id + 1, &root->css, &id);
> > +		if (css) {
> > +			if (css == &root->css || css_tryget(css))
> 
> When does css != &root->css here?

It does not grab an extra reference to the passed hierarchy root, as
all callsites must already hold one to guarantee it's not going away.

> > +static void mem_cgroup_iter_break(struct mem_cgroup *root,
> > +				  struct mem_cgroup *prev)
> > +{
> > +	if (!root)
> > +		root = root_mem_cgroup;
> > +	if (prev && prev != root)
> > +		css_put(&prev->css);
> > +}

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 01/11] mm: memcg: consolidate hierarchy iteration primitives
@ 2011-09-13  5:40       ` Johannes Weiner
  0 siblings, 0 replies; 130+ messages in thread
From: Johannes Weiner @ 2011-09-13  5:40 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Michal Hocko, Greg Thelen,
	Michel Lespinasse, Rik van Riel, Minchan Kim, Christoph Hellwig,
	linux-mm, linux-kernel

On Tue, Sep 13, 2011 at 01:37:46AM +0300, Kirill A. Shutemov wrote:
> On Mon, Sep 12, 2011 at 12:57:18PM +0200, Johannes Weiner wrote:
> > -static struct mem_cgroup *mem_cgroup_get_next(struct mem_cgroup *iter,
> > -					struct mem_cgroup *root,
> > -					bool cond)
> > -{
> > -	int nextid = css_id(&iter->css) + 1;
> > -	int found;
> > -	int hierarchy_used;
> > -	struct cgroup_subsys_state *css;
> > +	if (prev && !remember)
> > +		id = css_id(&prev->css);
> >  
> > -	hierarchy_used = iter->use_hierarchy;
> > +	if (prev && prev != root)
> > +		css_put(&prev->css);
> >  
> > -	css_put(&iter->css);
> > -	/* If no ROOT, walk all, ignore hierarchy */
> > -	if (!cond || (root && !hierarchy_used))
> > -		return NULL;
> > +	if (!root->use_hierarchy && root != root_mem_cgroup) {
> > +		if (prev)
> > +			return NULL;
> > +		return root;
> > +	}
> >  
> > -	if (!root)
> > -		root = root_mem_cgroup;
> > +	while (!mem) {
> > +		struct cgroup_subsys_state *css;
> >  
> > -	do {
> > -		iter = NULL;
> > -		rcu_read_lock();
> > +		if (remember)
> > +			id = root->last_scanned_child;
> >  
> > -		css = css_get_next(&mem_cgroup_subsys, nextid,
> > -				&root->css, &found);
> > -		if (css && css_tryget(css))
> > -			iter = container_of(css, struct mem_cgroup, css);
> > +		rcu_read_lock();
> > +		css = css_get_next(&mem_cgroup_subsys, id + 1, &root->css, &id);
> > +		if (css) {
> > +			if (css == &root->css || css_tryget(css))
> 
> When does css != &root->css here?

It does not grab an extra reference to the passed hierarchy root, as
all callsites must already hold one to guarantee it's not going away.

> > +static void mem_cgroup_iter_break(struct mem_cgroup *root,
> > +				  struct mem_cgroup *prev)
> > +{
> > +	if (!root)
> > +		root = root_mem_cgroup;
> > +	if (prev && prev != root)
> > +		css_put(&prev->css);
> > +}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 02/11] mm: vmscan: distinguish global reclaim from global LRU scanning
  2011-09-12 23:02     ` Kirill A. Shutemov
@ 2011-09-13  5:48       ` Johannes Weiner
  -1 siblings, 0 replies; 130+ messages in thread
From: Johannes Weiner @ 2011-09-13  5:48 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Michal Hocko, Greg Thelen,
	Michel Lespinasse, Rik van Riel, Minchan Kim, Christoph Hellwig,
	linux-mm, linux-kernel

On Tue, Sep 13, 2011 at 02:02:46AM +0300, Kirill A. Shutemov wrote:
> On Mon, Sep 12, 2011 at 12:57:19PM +0200, Johannes Weiner wrote:
> > @@ -1508,6 +1524,12 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> >  	if (scanning_global_lru(sc)) {
> >  		nr_taken = isolate_pages_global(nr_to_scan, &page_list,
> >  			&nr_scanned, sc->order, reclaim_mode, zone, 0, file);
> > +	} else {
> > +		nr_taken = mem_cgroup_isolate_pages(nr_to_scan, &page_list,
> > +			&nr_scanned, sc->order, reclaim_mode, zone,
> > +			sc->mem_cgroup, 0, file);
> > +	}
> 
> Redundant braces.

I usually keep them for multiline branches, no matter how any
statements.

But this is temporary anyway, 10/11 gets rid of this branch, leaving
only

	nr_taken = isolate_pages(...)

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 02/11] mm: vmscan: distinguish global reclaim from global LRU scanning
@ 2011-09-13  5:48       ` Johannes Weiner
  0 siblings, 0 replies; 130+ messages in thread
From: Johannes Weiner @ 2011-09-13  5:48 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Michal Hocko, Greg Thelen,
	Michel Lespinasse, Rik van Riel, Minchan Kim, Christoph Hellwig,
	linux-mm, linux-kernel

On Tue, Sep 13, 2011 at 02:02:46AM +0300, Kirill A. Shutemov wrote:
> On Mon, Sep 12, 2011 at 12:57:19PM +0200, Johannes Weiner wrote:
> > @@ -1508,6 +1524,12 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> >  	if (scanning_global_lru(sc)) {
> >  		nr_taken = isolate_pages_global(nr_to_scan, &page_list,
> >  			&nr_scanned, sc->order, reclaim_mode, zone, 0, file);
> > +	} else {
> > +		nr_taken = mem_cgroup_isolate_pages(nr_to_scan, &page_list,
> > +			&nr_scanned, sc->order, reclaim_mode, zone,
> > +			sc->mem_cgroup, 0, file);
> > +	}
> 
> Redundant braces.

I usually keep them for multiline branches, no matter how any
statements.

But this is temporary anyway, 10/11 gets rid of this branch, leaving
only

	nr_taken = isolate_pages(...)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 01/11] mm: memcg: consolidate hierarchy iteration primitives
  2011-09-12 10:57   ` Johannes Weiner
@ 2011-09-13 10:06     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 130+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-13 10:06 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

On Mon, 12 Sep 2011 12:57:18 +0200
Johannes Weiner <jweiner@redhat.com> wrote:

> Memory control groups are currently bolted onto the side of
> traditional memory management in places where better integration would
> be preferrable.  To reclaim memory, for example, memory control groups
> maintain their own LRU list and reclaim strategy aside from the global
> per-zone LRU list reclaim.  But an extra list head for each existing
> page frame is expensive and maintaining it requires additional code.
> 
> This patchset disables the global per-zone LRU lists on memory cgroup
> configurations and converts all its users to operate on the per-memory
> cgroup lists instead.  As LRU pages are then exclusively on one list,
> this saves two list pointers for each page frame in the system:
> 
> page_cgroup array size with 4G physical memory
> 
>   vanilla: [    0.000000] allocated 31457280 bytes of page_cgroup
>   patched: [    0.000000] allocated 15728640 bytes of page_cgroup
> 
> At the same time, system performance for various workloads is
> unaffected:
> 
> 100G sparse file cat, 4G physical memory, 10 runs, to test for code
> bloat in the traditional LRU handling and kswapd & direct reclaim
> paths, without/with the memory controller configured in
> 
>   vanilla: 71.603(0.207) seconds
>   patched: 71.640(0.156) seconds
> 
>   vanilla: 79.558(0.288) seconds
>   patched: 77.233(0.147) seconds
> 
> 100G sparse file cat in 1G memory cgroup, 10 runs, to test for code
> bloat in the traditional memory cgroup LRU handling and reclaim path
> 
>   vanilla: 96.844(0.281) seconds
>   patched: 94.454(0.311) seconds
> 
> 4 unlimited memcgs running kbuild -j32 each, 4G physical memory, 500M
> swap on SSD, 10 runs, to test for regressions in kswapd & direct
> reclaim using per-memcg LRU lists with multiple memcgs and multiple
> allocators within each memcg
> 
>   vanilla: 717.722(1.440) seconds [ 69720.100(11600.835) majfaults ]
>   patched: 714.106(2.313) seconds [ 71109.300(14886.186) majfaults ]
> 
> 16 unlimited memcgs running kbuild, 1900M hierarchical limit, 500M
> swap on SSD, 10 runs, to test for regressions in hierarchical memcg
> setups
> 
>   vanilla: 2742.058(1.992) seconds [ 26479.600(1736.737) majfaults ]
>   patched: 2743.267(1.214) seconds [ 27240.700(1076.063) majfaults ]
> 
> This patch:
> 
> There are currently two different implementations of iterating over a
> memory cgroup hierarchy tree.
> 
> Consolidate them into one worker function and base the convenience
> looping-macros on top of it.
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>

Seems nice.

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 01/11] mm: memcg: consolidate hierarchy iteration primitives
@ 2011-09-13 10:06     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 130+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-13 10:06 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

On Mon, 12 Sep 2011 12:57:18 +0200
Johannes Weiner <jweiner@redhat.com> wrote:

> Memory control groups are currently bolted onto the side of
> traditional memory management in places where better integration would
> be preferrable.  To reclaim memory, for example, memory control groups
> maintain their own LRU list and reclaim strategy aside from the global
> per-zone LRU list reclaim.  But an extra list head for each existing
> page frame is expensive and maintaining it requires additional code.
> 
> This patchset disables the global per-zone LRU lists on memory cgroup
> configurations and converts all its users to operate on the per-memory
> cgroup lists instead.  As LRU pages are then exclusively on one list,
> this saves two list pointers for each page frame in the system:
> 
> page_cgroup array size with 4G physical memory
> 
>   vanilla: [    0.000000] allocated 31457280 bytes of page_cgroup
>   patched: [    0.000000] allocated 15728640 bytes of page_cgroup
> 
> At the same time, system performance for various workloads is
> unaffected:
> 
> 100G sparse file cat, 4G physical memory, 10 runs, to test for code
> bloat in the traditional LRU handling and kswapd & direct reclaim
> paths, without/with the memory controller configured in
> 
>   vanilla: 71.603(0.207) seconds
>   patched: 71.640(0.156) seconds
> 
>   vanilla: 79.558(0.288) seconds
>   patched: 77.233(0.147) seconds
> 
> 100G sparse file cat in 1G memory cgroup, 10 runs, to test for code
> bloat in the traditional memory cgroup LRU handling and reclaim path
> 
>   vanilla: 96.844(0.281) seconds
>   patched: 94.454(0.311) seconds
> 
> 4 unlimited memcgs running kbuild -j32 each, 4G physical memory, 500M
> swap on SSD, 10 runs, to test for regressions in kswapd & direct
> reclaim using per-memcg LRU lists with multiple memcgs and multiple
> allocators within each memcg
> 
>   vanilla: 717.722(1.440) seconds [ 69720.100(11600.835) majfaults ]
>   patched: 714.106(2.313) seconds [ 71109.300(14886.186) majfaults ]
> 
> 16 unlimited memcgs running kbuild, 1900M hierarchical limit, 500M
> swap on SSD, 10 runs, to test for regressions in hierarchical memcg
> setups
> 
>   vanilla: 2742.058(1.992) seconds [ 26479.600(1736.737) majfaults ]
>   patched: 2743.267(1.214) seconds [ 27240.700(1076.063) majfaults ]
> 
> This patch:
> 
> There are currently two different implementations of iterating over a
> memory cgroup hierarchy tree.
> 
> Consolidate them into one worker function and base the convenience
> looping-macros on top of it.
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>

Seems nice.

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 02/11] mm: vmscan: distinguish global reclaim from global LRU scanning
  2011-09-12 10:57   ` Johannes Weiner
@ 2011-09-13 10:07     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 130+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-13 10:07 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

On Mon, 12 Sep 2011 12:57:19 +0200
Johannes Weiner <jweiner@redhat.com> wrote:

> The traditional zone reclaim code is scanning the per-zone LRU lists
> during direct reclaim and kswapd, and the per-zone per-memory cgroup
> LRU lists when reclaiming on behalf of a memory cgroup limit.
> 
> Subsequent patches will convert the traditional reclaim code to
> reclaim exclusively from the per-memory cgroup LRU lists.  As a
> result, using the predicate for which LRU list is scanned will no
> longer be appropriate to tell global reclaim from limit reclaim.
> 
> This patch adds a global_reclaim() predicate to tell direct/kswapd
> reclaim from memory cgroup limit reclaim and substitutes it in all
> places where currently scanning_global_lru() is used for that.
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 02/11] mm: vmscan: distinguish global reclaim from global LRU scanning
@ 2011-09-13 10:07     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 130+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-13 10:07 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

On Mon, 12 Sep 2011 12:57:19 +0200
Johannes Weiner <jweiner@redhat.com> wrote:

> The traditional zone reclaim code is scanning the per-zone LRU lists
> during direct reclaim and kswapd, and the per-zone per-memory cgroup
> LRU lists when reclaiming on behalf of a memory cgroup limit.
> 
> Subsequent patches will convert the traditional reclaim code to
> reclaim exclusively from the per-memory cgroup LRU lists.  As a
> result, using the predicate for which LRU list is scanned will no
> longer be appropriate to tell global reclaim from limit reclaim.
> 
> This patch adds a global_reclaim() predicate to tell direct/kswapd
> reclaim from memory cgroup limit reclaim and substitutes it in all
> places where currently scanning_global_lru() is used for that.
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 03/11] mm: vmscan: distinguish between memcg triggering reclaim and memcg being scanned
  2011-09-12 10:57   ` Johannes Weiner
@ 2011-09-13 10:23     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 130+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-13 10:23 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

On Mon, 12 Sep 2011 12:57:20 +0200
Johannes Weiner <jweiner@redhat.com> wrote:

> Memory cgroup hierarchies are currently handled completely outside of
> the traditional reclaim code, which is invoked with a single memory
> cgroup as an argument for the whole call stack.
> 
> Subsequent patches will switch this code to do hierarchical reclaim,
> so there needs to be a distinction between a) the memory cgroup that
> is triggering reclaim due to hitting its limit and b) the memory
> cgroup that is being scanned as a child of a).
> 
> This patch introduces a struct mem_cgroup_zone that contains the
> combination of the memory cgroup and the zone being scanned, which is
> then passed down the stack instead of the zone argument.
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

> 


^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 03/11] mm: vmscan: distinguish between memcg triggering reclaim and memcg being scanned
@ 2011-09-13 10:23     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 130+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-13 10:23 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

On Mon, 12 Sep 2011 12:57:20 +0200
Johannes Weiner <jweiner@redhat.com> wrote:

> Memory cgroup hierarchies are currently handled completely outside of
> the traditional reclaim code, which is invoked with a single memory
> cgroup as an argument for the whole call stack.
> 
> Subsequent patches will switch this code to do hierarchical reclaim,
> so there needs to be a distinction between a) the memory cgroup that
> is triggering reclaim due to hitting its limit and b) the memory
> cgroup that is being scanned as a child of a).
> 
> This patch introduces a struct mem_cgroup_zone that contains the
> combination of the memory cgroup and the zone being scanned, which is
> then passed down the stack instead of the zone argument.
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 04/11] mm: memcg: per-priority per-zone hierarchy scan generations
  2011-09-12 10:57   ` Johannes Weiner
@ 2011-09-13 10:27     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 130+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-13 10:27 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

On Mon, 12 Sep 2011 12:57:21 +0200
Johannes Weiner <jweiner@redhat.com> wrote:

> Memory cgroup limit reclaim currently picks one memory cgroup out of
> the target hierarchy, remembers it as the last scanned child, and
> reclaims all zones in it with decreasing priority levels.
> 
> The new hierarchy reclaim code will pick memory cgroups from the same
> hierarchy concurrently from different zones and priority levels, it
> becomes necessary that hierarchy roots not only remember the last
> scanned child, but do so for each zone and priority level.
> 
> Furthermore, detecting full hierarchy round-trips reliably will become
> crucial, so instead of counting on one iterator site seeing a certain
> memory cgroup twice, use a generation counter that is increased every
> time the child with the highest ID has been visited.
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>

I cannot image how this works. could you illustrate more with easy example ?

Thanks,
-Kame

> ---
>  mm/memcontrol.c |   60 +++++++++++++++++++++++++++++++++++++++---------------
>  1 files changed, 43 insertions(+), 17 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 912c7c7..f4b404e 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -121,6 +121,11 @@ struct mem_cgroup_stat_cpu {
>  	unsigned long targets[MEM_CGROUP_NTARGETS];
>  };
>  
> +struct mem_cgroup_iter_state {
> +	int position;
> +	unsigned int generation;
> +};
> +
>  /*
>   * per-zone information in memory controller.
>   */
> @@ -131,6 +136,8 @@ struct mem_cgroup_per_zone {
>  	struct list_head	lists[NR_LRU_LISTS];
>  	unsigned long		count[NR_LRU_LISTS];
>  
> +	struct mem_cgroup_iter_state iter_state[DEF_PRIORITY + 1];
> +
>  	struct zone_reclaim_stat reclaim_stat;
>  	struct rb_node		tree_node;	/* RB tree node */
>  	unsigned long long	usage_in_excess;/* Set to the value by which */
> @@ -231,11 +238,6 @@ struct mem_cgroup {
>  	 * per zone LRU lists.
>  	 */
>  	struct mem_cgroup_lru_info info;
> -	/*
> -	 * While reclaiming in a hierarchy, we cache the last child we
> -	 * reclaimed from.
> -	 */
> -	int last_scanned_child;
>  	int last_scanned_node;
>  #if MAX_NUMNODES > 1
>  	nodemask_t	scan_nodes;
> @@ -781,9 +783,15 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
>  	return memcg;
>  }
>  
> +struct mem_cgroup_iter {
> +	struct zone *zone;
> +	int priority;
> +	unsigned int generation;
> +};
> +
>  static struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>  					  struct mem_cgroup *prev,
> -					  bool remember)
> +					  struct mem_cgroup_iter *iter)
>  {
>  	struct mem_cgroup *mem = NULL;
>  	int id = 0;
> @@ -791,7 +799,7 @@ static struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>  	if (!root)
>  		root = root_mem_cgroup;
>  
> -	if (prev && !remember)
> +	if (prev && !iter)
>  		id = css_id(&prev->css);
>  
>  	if (prev && prev != root)
> @@ -804,10 +812,20 @@ static struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>  	}
>  
>  	while (!mem) {
> +		struct mem_cgroup_iter_state *uninitialized_var(is);
>  		struct cgroup_subsys_state *css;
>  
> -		if (remember)
> -			id = root->last_scanned_child;
> +		if (iter) {
> +			int nid = zone_to_nid(iter->zone);
> +			int zid = zone_idx(iter->zone);
> +			struct mem_cgroup_per_zone *mz;
> +
> +			mz = mem_cgroup_zoneinfo(root, nid, zid);
> +			is = &mz->iter_state[iter->priority];
> +			if (prev && iter->generation != is->generation)
> +				return NULL;
> +			id = is->position;
> +		}
>  
>  		rcu_read_lock();
>  		css = css_get_next(&mem_cgroup_subsys, id + 1, &root->css, &id);
> @@ -818,8 +836,13 @@ static struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>  			id = 0;
>  		rcu_read_unlock();
>  
> -		if (remember)
> -			root->last_scanned_child = id;
> +		if (iter) {
> +			is->position = id;
> +			if (!css)
> +				is->generation++;
> +			else if (!prev && mem)
> +				iter->generation = is->generation;
> +		}
>  
>  		if (prev && !css)
>  			return NULL;
> @@ -842,14 +865,14 @@ static void mem_cgroup_iter_break(struct mem_cgroup *root,
>   * be used for reference counting.
>   */
>  #define for_each_mem_cgroup_tree(iter, root)		\
> -	for (iter = mem_cgroup_iter(root, NULL, false);	\
> +	for (iter = mem_cgroup_iter(root, NULL, NULL);	\
>  	     iter != NULL;				\
> -	     iter = mem_cgroup_iter(root, iter, false))
> +	     iter = mem_cgroup_iter(root, iter, NULL))
>  
>  #define for_each_mem_cgroup(iter)			\
> -	for (iter = mem_cgroup_iter(NULL, NULL, false);	\
> +	for (iter = mem_cgroup_iter(NULL, NULL, NULL);	\
>  	     iter != NULL;				\
> -	     iter = mem_cgroup_iter(NULL, iter, false))
> +	     iter = mem_cgroup_iter(NULL, iter, NULL))
>  
>  static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
>  {
> @@ -1619,6 +1642,10 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
>  	bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT;
>  	unsigned long excess;
>  	unsigned long nr_scanned;
> +	struct mem_cgroup_iter iter = {
> +		.zone = zone,
> +		.priority = 0,
> +	};
>  
>  	excess = res_counter_soft_limit_excess(&root_memcg->res) >> PAGE_SHIFT;
>  
> @@ -1627,7 +1654,7 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
>  		noswap = true;
>  
>  	while (1) {
> -		victim = mem_cgroup_iter(root_memcg, victim, true);
> +		victim = mem_cgroup_iter(root_memcg, victim, &iter);
>  		if (!victim) {
>  			loop++;
>  			/*
> @@ -4878,7 +4905,6 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  		res_counter_init(&memcg->res, NULL);
>  		res_counter_init(&memcg->memsw, NULL);
>  	}
> -	memcg->last_scanned_child = 0;
>  	memcg->last_scanned_node = MAX_NUMNODES;
>  	INIT_LIST_HEAD(&memcg->oom_notify);
>  
> -- 
> 1.7.6
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 04/11] mm: memcg: per-priority per-zone hierarchy scan generations
@ 2011-09-13 10:27     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 130+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-13 10:27 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

On Mon, 12 Sep 2011 12:57:21 +0200
Johannes Weiner <jweiner@redhat.com> wrote:

> Memory cgroup limit reclaim currently picks one memory cgroup out of
> the target hierarchy, remembers it as the last scanned child, and
> reclaims all zones in it with decreasing priority levels.
> 
> The new hierarchy reclaim code will pick memory cgroups from the same
> hierarchy concurrently from different zones and priority levels, it
> becomes necessary that hierarchy roots not only remember the last
> scanned child, but do so for each zone and priority level.
> 
> Furthermore, detecting full hierarchy round-trips reliably will become
> crucial, so instead of counting on one iterator site seeing a certain
> memory cgroup twice, use a generation counter that is increased every
> time the child with the highest ID has been visited.
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>

I cannot image how this works. could you illustrate more with easy example ?

Thanks,
-Kame

> ---
>  mm/memcontrol.c |   60 +++++++++++++++++++++++++++++++++++++++---------------
>  1 files changed, 43 insertions(+), 17 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 912c7c7..f4b404e 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -121,6 +121,11 @@ struct mem_cgroup_stat_cpu {
>  	unsigned long targets[MEM_CGROUP_NTARGETS];
>  };
>  
> +struct mem_cgroup_iter_state {
> +	int position;
> +	unsigned int generation;
> +};
> +
>  /*
>   * per-zone information in memory controller.
>   */
> @@ -131,6 +136,8 @@ struct mem_cgroup_per_zone {
>  	struct list_head	lists[NR_LRU_LISTS];
>  	unsigned long		count[NR_LRU_LISTS];
>  
> +	struct mem_cgroup_iter_state iter_state[DEF_PRIORITY + 1];
> +
>  	struct zone_reclaim_stat reclaim_stat;
>  	struct rb_node		tree_node;	/* RB tree node */
>  	unsigned long long	usage_in_excess;/* Set to the value by which */
> @@ -231,11 +238,6 @@ struct mem_cgroup {
>  	 * per zone LRU lists.
>  	 */
>  	struct mem_cgroup_lru_info info;
> -	/*
> -	 * While reclaiming in a hierarchy, we cache the last child we
> -	 * reclaimed from.
> -	 */
> -	int last_scanned_child;
>  	int last_scanned_node;
>  #if MAX_NUMNODES > 1
>  	nodemask_t	scan_nodes;
> @@ -781,9 +783,15 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
>  	return memcg;
>  }
>  
> +struct mem_cgroup_iter {
> +	struct zone *zone;
> +	int priority;
> +	unsigned int generation;
> +};
> +
>  static struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>  					  struct mem_cgroup *prev,
> -					  bool remember)
> +					  struct mem_cgroup_iter *iter)
>  {
>  	struct mem_cgroup *mem = NULL;
>  	int id = 0;
> @@ -791,7 +799,7 @@ static struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>  	if (!root)
>  		root = root_mem_cgroup;
>  
> -	if (prev && !remember)
> +	if (prev && !iter)
>  		id = css_id(&prev->css);
>  
>  	if (prev && prev != root)
> @@ -804,10 +812,20 @@ static struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>  	}
>  
>  	while (!mem) {
> +		struct mem_cgroup_iter_state *uninitialized_var(is);
>  		struct cgroup_subsys_state *css;
>  
> -		if (remember)
> -			id = root->last_scanned_child;
> +		if (iter) {
> +			int nid = zone_to_nid(iter->zone);
> +			int zid = zone_idx(iter->zone);
> +			struct mem_cgroup_per_zone *mz;
> +
> +			mz = mem_cgroup_zoneinfo(root, nid, zid);
> +			is = &mz->iter_state[iter->priority];
> +			if (prev && iter->generation != is->generation)
> +				return NULL;
> +			id = is->position;
> +		}
>  
>  		rcu_read_lock();
>  		css = css_get_next(&mem_cgroup_subsys, id + 1, &root->css, &id);
> @@ -818,8 +836,13 @@ static struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>  			id = 0;
>  		rcu_read_unlock();
>  
> -		if (remember)
> -			root->last_scanned_child = id;
> +		if (iter) {
> +			is->position = id;
> +			if (!css)
> +				is->generation++;
> +			else if (!prev && mem)
> +				iter->generation = is->generation;
> +		}
>  
>  		if (prev && !css)
>  			return NULL;
> @@ -842,14 +865,14 @@ static void mem_cgroup_iter_break(struct mem_cgroup *root,
>   * be used for reference counting.
>   */
>  #define for_each_mem_cgroup_tree(iter, root)		\
> -	for (iter = mem_cgroup_iter(root, NULL, false);	\
> +	for (iter = mem_cgroup_iter(root, NULL, NULL);	\
>  	     iter != NULL;				\
> -	     iter = mem_cgroup_iter(root, iter, false))
> +	     iter = mem_cgroup_iter(root, iter, NULL))
>  
>  #define for_each_mem_cgroup(iter)			\
> -	for (iter = mem_cgroup_iter(NULL, NULL, false);	\
> +	for (iter = mem_cgroup_iter(NULL, NULL, NULL);	\
>  	     iter != NULL;				\
> -	     iter = mem_cgroup_iter(NULL, iter, false))
> +	     iter = mem_cgroup_iter(NULL, iter, NULL))
>  
>  static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
>  {
> @@ -1619,6 +1642,10 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
>  	bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT;
>  	unsigned long excess;
>  	unsigned long nr_scanned;
> +	struct mem_cgroup_iter iter = {
> +		.zone = zone,
> +		.priority = 0,
> +	};
>  
>  	excess = res_counter_soft_limit_excess(&root_memcg->res) >> PAGE_SHIFT;
>  
> @@ -1627,7 +1654,7 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
>  		noswap = true;
>  
>  	while (1) {
> -		victim = mem_cgroup_iter(root_memcg, victim, true);
> +		victim = mem_cgroup_iter(root_memcg, victim, &iter);
>  		if (!victim) {
>  			loop++;
>  			/*
> @@ -4878,7 +4905,6 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  		res_counter_init(&memcg->res, NULL);
>  		res_counter_init(&memcg->memsw, NULL);
>  	}
> -	memcg->last_scanned_child = 0;
>  	memcg->last_scanned_node = MAX_NUMNODES;
>  	INIT_LIST_HEAD(&memcg->oom_notify);
>  
> -- 
> 1.7.6
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 05/11] mm: move memcg hierarchy reclaim to generic reclaim code
  2011-09-12 10:57   ` Johannes Weiner
@ 2011-09-13 10:31     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 130+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-13 10:31 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

On Mon, 12 Sep 2011 12:57:22 +0200
Johannes Weiner <jweiner@redhat.com> wrote:

> Memory cgroup limit reclaim and traditional global pressure reclaim
> will soon share the same code to reclaim from a hierarchical tree of
> memory cgroups.
> 
> In preparation of this, move the two right next to each other in
> shrink_zone().
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>

ok, this will be good direction.

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 05/11] mm: move memcg hierarchy reclaim to generic reclaim code
@ 2011-09-13 10:31     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 130+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-13 10:31 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

On Mon, 12 Sep 2011 12:57:22 +0200
Johannes Weiner <jweiner@redhat.com> wrote:

> Memory cgroup limit reclaim and traditional global pressure reclaim
> will soon share the same code to reclaim from a hierarchical tree of
> memory cgroups.
> 
> In preparation of this, move the two right next to each other in
> shrink_zone().
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>

ok, this will be good direction.

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 06/11] mm: memcg: remove optimization of keeping the root_mem_cgroup LRU lists empty
  2011-09-12 10:57   ` Johannes Weiner
@ 2011-09-13 10:34     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 130+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-13 10:34 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

On Mon, 12 Sep 2011 12:57:23 +0200
Johannes Weiner <jweiner@redhat.com> wrote:

> root_mem_cgroup, lacking a configurable limit, was never subject to
> limit reclaim, so the pages charged to it could be kept off its LRU
> lists.  They would be found on the global per-zone LRU lists upon
> physical memory pressure and it made sense to avoid uselessly linking
> them to both lists.
> 
> The global per-zone LRU lists are about to go away on memcg-enabled
> kernels, with all pages being exclusively linked to their respective
> per-memcg LRU lists.  As a result, pages of the root_mem_cgroup must
> also be linked to its LRU lists again.
> 
> The overhead is temporary until the double-LRU scheme is going away
> completely.
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 06/11] mm: memcg: remove optimization of keeping the root_mem_cgroup LRU lists empty
@ 2011-09-13 10:34     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 130+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-13 10:34 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

On Mon, 12 Sep 2011 12:57:23 +0200
Johannes Weiner <jweiner@redhat.com> wrote:

> root_mem_cgroup, lacking a configurable limit, was never subject to
> limit reclaim, so the pages charged to it could be kept off its LRU
> lists.  They would be found on the global per-zone LRU lists upon
> physical memory pressure and it made sense to avoid uselessly linking
> them to both lists.
> 
> The global per-zone LRU lists are about to go away on memcg-enabled
> kernels, with all pages being exclusively linked to their respective
> per-memcg LRU lists.  As a result, pages of the root_mem_cgroup must
> also be linked to its LRU lists again.
> 
> The overhead is temporary until the double-LRU scheme is going away
> completely.
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 07/11] mm: vmscan: convert unevictable page rescue scanner to per-memcg LRU lists
  2011-09-12 10:57   ` Johannes Weiner
@ 2011-09-13 10:37     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 130+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-13 10:37 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

On Mon, 12 Sep 2011 12:57:24 +0200
Johannes Weiner <jweiner@redhat.com> wrote:

> The global per-zone LRU lists are about to go away on memcg-enabled
> kernels, the unevictable page rescue scanner must be able to find its
> pages on the per-memcg LRU lists.
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Hmm, how about per-memcg unevictable pages in memory.stat file ?



^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 07/11] mm: vmscan: convert unevictable page rescue scanner to per-memcg LRU lists
@ 2011-09-13 10:37     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 130+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-13 10:37 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

On Mon, 12 Sep 2011 12:57:24 +0200
Johannes Weiner <jweiner@redhat.com> wrote:

> The global per-zone LRU lists are about to go away on memcg-enabled
> kernels, the unevictable page rescue scanner must be able to find its
> pages on the per-memcg LRU lists.
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Hmm, how about per-memcg unevictable pages in memory.stat file ?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 08/11] mm: vmscan: convert global reclaim to per-memcg LRU lists
  2011-09-12 10:57   ` Johannes Weiner
@ 2011-09-13 10:41     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 130+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-13 10:41 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

On Mon, 12 Sep 2011 12:57:25 +0200
Johannes Weiner <jweiner@redhat.com> wrote:

> The global per-zone LRU lists are about to go away on memcg-enabled
> kernels, global reclaim must be able to find its pages on the
> per-memcg LRU lists.
> 
> Since the LRU pages of a zone are distributed over all existing memory
> cgroups, a scan target for a zone is complete when all memory cgroups
> are scanned for their proportional share of a zone's memory.
> 
> The forced scanning of small scan targets from kswapd is limited to
> zones marked unreclaimable, otherwise kswapd can quickly overreclaim
> by force-scanning the LRU lists of multiple memory cgroups.
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>


Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 08/11] mm: vmscan: convert global reclaim to per-memcg LRU lists
@ 2011-09-13 10:41     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 130+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-13 10:41 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

On Mon, 12 Sep 2011 12:57:25 +0200
Johannes Weiner <jweiner@redhat.com> wrote:

> The global per-zone LRU lists are about to go away on memcg-enabled
> kernels, global reclaim must be able to find its pages on the
> per-memcg LRU lists.
> 
> Since the LRU pages of a zone are distributed over all existing memory
> cgroups, a scan target for a zone is complete when all memory cgroups
> are scanned for their proportional share of a zone's memory.
> 
> The forced scanning of small scan targets from kswapd is limited to
> zones marked unreclaimable, otherwise kswapd can quickly overreclaim
> by force-scanning the LRU lists of multiple memory cgroups.
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>


Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 09/11] mm: collect LRU list heads into struct lruvec
  2011-09-12 10:57   ` Johannes Weiner
@ 2011-09-13 10:43     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 130+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-13 10:43 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

On Mon, 12 Sep 2011 12:57:26 +0200
Johannes Weiner <jweiner@redhat.com> wrote:

> Having a unified structure with a LRU list set for both global zones
> and per-memcg zones allows to keep that code simple which deals with
> LRU lists and does not care about the container itself.
> 
> Once the per-memcg LRU lists directly link struct pages, the isolation
> function and all other list manipulations are shared between the memcg
> case and the global LRU case.
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 09/11] mm: collect LRU list heads into struct lruvec
@ 2011-09-13 10:43     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 130+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-13 10:43 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

On Mon, 12 Sep 2011 12:57:26 +0200
Johannes Weiner <jweiner@redhat.com> wrote:

> Having a unified structure with a LRU list set for both global zones
> and per-memcg zones allows to keep that code simple which deals with
> LRU lists and does not care about the container itself.
> 
> Once the per-memcg LRU lists directly link struct pages, the isolation
> function and all other list manipulations are shared between the memcg
> case and the global LRU case.
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 10/11] mm: make per-memcg LRU lists exclusive
  2011-09-12 10:57   ` Johannes Weiner
@ 2011-09-13 10:47     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 130+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-13 10:47 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

On Mon, 12 Sep 2011 12:57:27 +0200
Johannes Weiner <jweiner@redhat.com> wrote:

> Now that all code that operated on global per-zone LRU lists is
> converted to operate on per-memory cgroup LRU lists instead, there is
> no reason to keep the double-LRU scheme around any longer.
> 
> The pc->lru member is removed and page->lru is linked directly to the
> per-memory cgroup LRU lists, which removes two pointers from a
> descriptor that exists for every page frame in the system.
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Signed-off-by: Ying Han <yinghan@google.com>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 10/11] mm: make per-memcg LRU lists exclusive
@ 2011-09-13 10:47     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 130+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-13 10:47 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

On Mon, 12 Sep 2011 12:57:27 +0200
Johannes Weiner <jweiner@redhat.com> wrote:

> Now that all code that operated on global per-zone LRU lists is
> converted to operate on per-memory cgroup LRU lists instead, there is
> no reason to keep the double-LRU scheme around any longer.
> 
> The pc->lru member is removed and page->lru is linked directly to the
> per-memory cgroup LRU lists, which removes two pointers from a
> descriptor that exists for every page frame in the system.
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Signed-off-by: Ying Han <yinghan@google.com>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 11/11] mm: memcg: remove unused node/section info from pc->flags
  2011-09-12 10:57   ` Johannes Weiner
@ 2011-09-13 10:50     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 130+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-13 10:50 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

On Mon, 12 Sep 2011 12:57:28 +0200
Johannes Weiner <jweiner@redhat.com> wrote:

> To find the page corresponding to a certain page_cgroup, the pc->flags
> encoded the node or section ID with the base array to compare the pc
> pointer to.
> 
> Now that the per-memory cgroup LRU lists link page descriptors
> directly, there is no longer any code that knows the page_cgroup but
> not the page.
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>

Ah, ok. remove init code and use zalloc()

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>



^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 11/11] mm: memcg: remove unused node/section info from pc->flags
@ 2011-09-13 10:50     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 130+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-13 10:50 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

On Mon, 12 Sep 2011 12:57:28 +0200
Johannes Weiner <jweiner@redhat.com> wrote:

> To find the page corresponding to a certain page_cgroup, the pc->flags
> encoded the node or section ID with the base array to compare the pc
> pointer to.
> 
> Now that the per-memory cgroup LRU lists link page descriptors
> directly, there is no longer any code that knows the page_cgroup but
> not the page.
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>

Ah, ok. remove init code and use zalloc()

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 04/11] mm: memcg: per-priority per-zone hierarchy scan generations
  2011-09-13 10:27     ` KAMEZAWA Hiroyuki
@ 2011-09-13 11:03       ` Johannes Weiner
  -1 siblings, 0 replies; 130+ messages in thread
From: Johannes Weiner @ 2011-09-13 11:03 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

On Tue, Sep 13, 2011 at 07:27:59PM +0900, KAMEZAWA Hiroyuki wrote:
> On Mon, 12 Sep 2011 12:57:21 +0200
> Johannes Weiner <jweiner@redhat.com> wrote:
> 
> > Memory cgroup limit reclaim currently picks one memory cgroup out of
> > the target hierarchy, remembers it as the last scanned child, and
> > reclaims all zones in it with decreasing priority levels.
> > 
> > The new hierarchy reclaim code will pick memory cgroups from the same
> > hierarchy concurrently from different zones and priority levels, it
> > becomes necessary that hierarchy roots not only remember the last
> > scanned child, but do so for each zone and priority level.
> > 
> > Furthermore, detecting full hierarchy round-trips reliably will become
> > crucial, so instead of counting on one iterator site seeing a certain
> > memory cgroup twice, use a generation counter that is increased every
> > time the child with the highest ID has been visited.
> > 
> > Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> 
> I cannot image how this works. could you illustrate more with easy example ?

Previously, we did

	mem = mem_cgroup_iter(root)
	  for each priority level:
	    for each zone in zonelist:

and this would reclaim memcg-1-zone-1, memcg-1-zone-2, memcg-1-zone-3
etc.

The new code does

	for each priority level
	  for each zone in zonelist
            mem = mem_cgroup_iter(root)

but with a single last_scanned_child per memcg, this would scan
memcg-1-zone-1, memcg-2-zone-2, memcg-3-zone-3 etc, which does not
make much sense.

Now imagine two reclaimers.  With the old code, the first reclaimer
would pick memcg-1 and scan all its zones, the second reclaimer would
pick memcg-2 and reclaim all its zones.  Without this patch, the first
reclaimer would pick memcg-1 and scan zone-1, the second reclaimer
would pick memcg-2 and scan zone-1, then the first reclaimer would
pick memcg-3 and scan zone-2.  If the reclaimers are concurrently
scanning at different priority levels, things are even worse because
one reclaimer may put much more force on the memcgs it gets from
mem_cgroup_iter() than the other reclaimer.  They must not share the
same iterator.

The generations are needed because the old algorithm did not rely too
much on detecting full round-trips.  After every reclaim cycle, it
checked the limit and broke out of the loop if enough was reclaimed,
no matter how many children were reclaimed from.  The new algorithm is
used for global reclaim, where the only exit condition of the
hierarchy reclaim is the full roundtrip, because equal pressure needs
to be applied to all zones.

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 04/11] mm: memcg: per-priority per-zone hierarchy scan generations
@ 2011-09-13 11:03       ` Johannes Weiner
  0 siblings, 0 replies; 130+ messages in thread
From: Johannes Weiner @ 2011-09-13 11:03 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

On Tue, Sep 13, 2011 at 07:27:59PM +0900, KAMEZAWA Hiroyuki wrote:
> On Mon, 12 Sep 2011 12:57:21 +0200
> Johannes Weiner <jweiner@redhat.com> wrote:
> 
> > Memory cgroup limit reclaim currently picks one memory cgroup out of
> > the target hierarchy, remembers it as the last scanned child, and
> > reclaims all zones in it with decreasing priority levels.
> > 
> > The new hierarchy reclaim code will pick memory cgroups from the same
> > hierarchy concurrently from different zones and priority levels, it
> > becomes necessary that hierarchy roots not only remember the last
> > scanned child, but do so for each zone and priority level.
> > 
> > Furthermore, detecting full hierarchy round-trips reliably will become
> > crucial, so instead of counting on one iterator site seeing a certain
> > memory cgroup twice, use a generation counter that is increased every
> > time the child with the highest ID has been visited.
> > 
> > Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> 
> I cannot image how this works. could you illustrate more with easy example ?

Previously, we did

	mem = mem_cgroup_iter(root)
	  for each priority level:
	    for each zone in zonelist:

and this would reclaim memcg-1-zone-1, memcg-1-zone-2, memcg-1-zone-3
etc.

The new code does

	for each priority level
	  for each zone in zonelist
            mem = mem_cgroup_iter(root)

but with a single last_scanned_child per memcg, this would scan
memcg-1-zone-1, memcg-2-zone-2, memcg-3-zone-3 etc, which does not
make much sense.

Now imagine two reclaimers.  With the old code, the first reclaimer
would pick memcg-1 and scan all its zones, the second reclaimer would
pick memcg-2 and reclaim all its zones.  Without this patch, the first
reclaimer would pick memcg-1 and scan zone-1, the second reclaimer
would pick memcg-2 and scan zone-1, then the first reclaimer would
pick memcg-3 and scan zone-2.  If the reclaimers are concurrently
scanning at different priority levels, things are even worse because
one reclaimer may put much more force on the memcgs it gets from
mem_cgroup_iter() than the other reclaimer.  They must not share the
same iterator.

The generations are needed because the old algorithm did not rely too
much on detecting full round-trips.  After every reclaim cycle, it
checked the limit and broke out of the loop if enough was reclaimed,
no matter how many children were reclaimed from.  The new algorithm is
used for global reclaim, where the only exit condition of the
hierarchy reclaim is the full roundtrip, because equal pressure needs
to be applied to all zones.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 0/11] mm: memcg naturalization -rc3
  2011-09-12 10:57 ` Johannes Weiner
@ 2011-09-13 20:35   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 130+ messages in thread
From: Kirill A. Shutemov @ 2011-09-13 20:35 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Michal Hocko, Greg Thelen,
	Michel Lespinasse, Rik van Riel, Minchan Kim, Christoph Hellwig,
	linux-mm, linux-kernel

On Mon, Sep 12, 2011 at 12:57:17PM +0200, Johannes Weiner wrote:
> Hi everyone,
> 
> this is the third revision of the memcg naturalization patch set.  Due
> to controversy, I dropped the reclaim statistics and the soft limit
> reclaim rewrite.  What's left is mostly making the per-memcg LRU lists
> exclusive.
> 
> Christoph suggested making struct mem_cgroup part of the core and have
> reclaim always operate on at least a skeleton root_mem_cgroup with
> basic LRU info even on !CONFIG_MEMCG kernels.  I agree that we should
> go there, but in its current form this would drag a lot of ugly memcg
> internals out into the public and I'd prefer another struct mem_cgroup
> shakedown and the soft limit stuff to be done before this step.  But
> we are getting there.
> 
> Changelog since -rc2
> - consolidated all memcg hierarchy iteration constructs
> - pass struct mem_cgroup_zone down the reclaim stack
> - fix concurrent full hierarchy round-trip detection
> - split out moving memcg reclaim from hierarchical global reclaim
> - drop reclaim statistics
> - rename do_shrink_zone to shrink_mem_cgroup_zone
> - fix anon pre-aging to operate on per-memcg lrus
> - revert to traditional limit reclaim hierarchy iteration
> - split out lruvec introduction
> - kill __add_page_to_lru_list
> - fix LRU-accounting during swapcache/pagecache charging
> - fix LRU-accounting of uncharged swapcache
> - split out removing array id from pc->flags
> - drop soft limit rework
> 
> More introduction and test results are included in the changelog of
> the first patch.
> 
>  include/linux/memcontrol.h  |   74 +++--
>  include/linux/mm_inline.h   |   21 +-
>  include/linux/mmzone.h      |   10 +-
>  include/linux/page_cgroup.h |   34 ---
>  mm/memcontrol.c             |  688 ++++++++++++++++++++-----------------------
>  mm/page_alloc.c             |    2 +-
>  mm/page_cgroup.c            |   59 +----
>  mm/swap.c                   |   24 +-
>  mm/vmscan.c                 |  447 +++++++++++++++++-----------
>  9 files changed, 674 insertions(+), 685 deletions(-)

Nice patchset. Thank you.

Reviewed-by: Kirill A. Shutemov <kirill@shutemov.name>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 0/11] mm: memcg naturalization -rc3
@ 2011-09-13 20:35   ` Kirill A. Shutemov
  0 siblings, 0 replies; 130+ messages in thread
From: Kirill A. Shutemov @ 2011-09-13 20:35 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Michal Hocko, Greg Thelen,
	Michel Lespinasse, Rik van Riel, Minchan Kim, Christoph Hellwig,
	linux-mm, linux-kernel

On Mon, Sep 12, 2011 at 12:57:17PM +0200, Johannes Weiner wrote:
> Hi everyone,
> 
> this is the third revision of the memcg naturalization patch set.  Due
> to controversy, I dropped the reclaim statistics and the soft limit
> reclaim rewrite.  What's left is mostly making the per-memcg LRU lists
> exclusive.
> 
> Christoph suggested making struct mem_cgroup part of the core and have
> reclaim always operate on at least a skeleton root_mem_cgroup with
> basic LRU info even on !CONFIG_MEMCG kernels.  I agree that we should
> go there, but in its current form this would drag a lot of ugly memcg
> internals out into the public and I'd prefer another struct mem_cgroup
> shakedown and the soft limit stuff to be done before this step.  But
> we are getting there.
> 
> Changelog since -rc2
> - consolidated all memcg hierarchy iteration constructs
> - pass struct mem_cgroup_zone down the reclaim stack
> - fix concurrent full hierarchy round-trip detection
> - split out moving memcg reclaim from hierarchical global reclaim
> - drop reclaim statistics
> - rename do_shrink_zone to shrink_mem_cgroup_zone
> - fix anon pre-aging to operate on per-memcg lrus
> - revert to traditional limit reclaim hierarchy iteration
> - split out lruvec introduction
> - kill __add_page_to_lru_list
> - fix LRU-accounting during swapcache/pagecache charging
> - fix LRU-accounting of uncharged swapcache
> - split out removing array id from pc->flags
> - drop soft limit rework
> 
> More introduction and test results are included in the changelog of
> the first patch.
> 
>  include/linux/memcontrol.h  |   74 +++--
>  include/linux/mm_inline.h   |   21 +-
>  include/linux/mmzone.h      |   10 +-
>  include/linux/page_cgroup.h |   34 ---
>  mm/memcontrol.c             |  688 ++++++++++++++++++++-----------------------
>  mm/page_alloc.c             |    2 +-
>  mm/page_cgroup.c            |   59 +----
>  mm/swap.c                   |   24 +-
>  mm/vmscan.c                 |  447 +++++++++++++++++-----------
>  9 files changed, 674 insertions(+), 685 deletions(-)

Nice patchset. Thank you.

Reviewed-by: Kirill A. Shutemov <kirill@shutemov.name>

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 04/11] mm: memcg: per-priority per-zone hierarchy scan generations
  2011-09-13 11:03       ` Johannes Weiner
@ 2011-09-14  0:55         ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 130+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-14  0:55 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

On Tue, 13 Sep 2011 13:03:01 +0200
Johannes Weiner <jweiner@redhat.com> wrote:

> On Tue, Sep 13, 2011 at 07:27:59PM +0900, KAMEZAWA Hiroyuki wrote:
> > On Mon, 12 Sep 2011 12:57:21 +0200
> > Johannes Weiner <jweiner@redhat.com> wrote:
> > 
> > > Memory cgroup limit reclaim currently picks one memory cgroup out of
> > > the target hierarchy, remembers it as the last scanned child, and
> > > reclaims all zones in it with decreasing priority levels.
> > > 
> > > The new hierarchy reclaim code will pick memory cgroups from the same
> > > hierarchy concurrently from different zones and priority levels, it
> > > becomes necessary that hierarchy roots not only remember the last
> > > scanned child, but do so for each zone and priority level.
> > > 
> > > Furthermore, detecting full hierarchy round-trips reliably will become
> > > crucial, so instead of counting on one iterator site seeing a certain
> > > memory cgroup twice, use a generation counter that is increased every
> > > time the child with the highest ID has been visited.
> > > 
> > > Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> > 
> > I cannot image how this works. could you illustrate more with easy example ?
> 
> Previously, we did
> 
> 	mem = mem_cgroup_iter(root)
> 	  for each priority level:
> 	    for each zone in zonelist:
> 
> and this would reclaim memcg-1-zone-1, memcg-1-zone-2, memcg-1-zone-3
> etc.
> 
yes.

> The new code does
> 
> 	for each priority level
> 	  for each zone in zonelist
>             mem = mem_cgroup_iter(root)
> 
> but with a single last_scanned_child per memcg, this would scan
> memcg-1-zone-1, memcg-2-zone-2, memcg-3-zone-3 etc, which does not
> make much sense.
> 
> Now imagine two reclaimers.  With the old code, the first reclaimer
> would pick memcg-1 and scan all its zones, the second reclaimer would
> pick memcg-2 and reclaim all its zones.  Without this patch, the first
> reclaimer would pick memcg-1 and scan zone-1, the second reclaimer
> would pick memcg-2 and scan zone-1, then the first reclaimer would
> pick memcg-3 and scan zone-2.  If the reclaimers are concurrently
> scanning at different priority levels, things are even worse because
> one reclaimer may put much more force on the memcgs it gets from
> mem_cgroup_iter() than the other reclaimer.  They must not share the
> same iterator.
> 
> The generations are needed because the old algorithm did not rely too
> much on detecting full round-trips.  After every reclaim cycle, it
> checked the limit and broke out of the loop if enough was reclaimed,
> no matter how many children were reclaimed from.  The new algorithm is
> used for global reclaim, where the only exit condition of the
> hierarchy reclaim is the full roundtrip, because equal pressure needs
> to be applied to all zones.
> 
Hm, ok, maybe good for global reclam.
Is this used for both of reclaim-by-limit and global-reclaim ?
If so, I need to abandon node-selection-logic for reclaim-by-limit
and nodemask-for-memcg which shows me very good result. 
I'll be sad ;)

Thanks,
-Kame




^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 04/11] mm: memcg: per-priority per-zone hierarchy scan generations
@ 2011-09-14  0:55         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 130+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-14  0:55 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

On Tue, 13 Sep 2011 13:03:01 +0200
Johannes Weiner <jweiner@redhat.com> wrote:

> On Tue, Sep 13, 2011 at 07:27:59PM +0900, KAMEZAWA Hiroyuki wrote:
> > On Mon, 12 Sep 2011 12:57:21 +0200
> > Johannes Weiner <jweiner@redhat.com> wrote:
> > 
> > > Memory cgroup limit reclaim currently picks one memory cgroup out of
> > > the target hierarchy, remembers it as the last scanned child, and
> > > reclaims all zones in it with decreasing priority levels.
> > > 
> > > The new hierarchy reclaim code will pick memory cgroups from the same
> > > hierarchy concurrently from different zones and priority levels, it
> > > becomes necessary that hierarchy roots not only remember the last
> > > scanned child, but do so for each zone and priority level.
> > > 
> > > Furthermore, detecting full hierarchy round-trips reliably will become
> > > crucial, so instead of counting on one iterator site seeing a certain
> > > memory cgroup twice, use a generation counter that is increased every
> > > time the child with the highest ID has been visited.
> > > 
> > > Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> > 
> > I cannot image how this works. could you illustrate more with easy example ?
> 
> Previously, we did
> 
> 	mem = mem_cgroup_iter(root)
> 	  for each priority level:
> 	    for each zone in zonelist:
> 
> and this would reclaim memcg-1-zone-1, memcg-1-zone-2, memcg-1-zone-3
> etc.
> 
yes.

> The new code does
> 
> 	for each priority level
> 	  for each zone in zonelist
>             mem = mem_cgroup_iter(root)
> 
> but with a single last_scanned_child per memcg, this would scan
> memcg-1-zone-1, memcg-2-zone-2, memcg-3-zone-3 etc, which does not
> make much sense.
> 
> Now imagine two reclaimers.  With the old code, the first reclaimer
> would pick memcg-1 and scan all its zones, the second reclaimer would
> pick memcg-2 and reclaim all its zones.  Without this patch, the first
> reclaimer would pick memcg-1 and scan zone-1, the second reclaimer
> would pick memcg-2 and scan zone-1, then the first reclaimer would
> pick memcg-3 and scan zone-2.  If the reclaimers are concurrently
> scanning at different priority levels, things are even worse because
> one reclaimer may put much more force on the memcgs it gets from
> mem_cgroup_iter() than the other reclaimer.  They must not share the
> same iterator.
> 
> The generations are needed because the old algorithm did not rely too
> much on detecting full round-trips.  After every reclaim cycle, it
> checked the limit and broke out of the loop if enough was reclaimed,
> no matter how many children were reclaimed from.  The new algorithm is
> used for global reclaim, where the only exit condition of the
> hierarchy reclaim is the full roundtrip, because equal pressure needs
> to be applied to all zones.
> 
Hm, ok, maybe good for global reclam.
Is this used for both of reclaim-by-limit and global-reclaim ?
If so, I need to abandon node-selection-logic for reclaim-by-limit
and nodemask-for-memcg which shows me very good result. 
I'll be sad ;)

Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 04/11] mm: memcg: per-priority per-zone hierarchy scan generations
  2011-09-14  0:55         ` KAMEZAWA Hiroyuki
@ 2011-09-14  5:56           ` Johannes Weiner
  -1 siblings, 0 replies; 130+ messages in thread
From: Johannes Weiner @ 2011-09-14  5:56 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

On Wed, Sep 14, 2011 at 09:55:04AM +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 13 Sep 2011 13:03:01 +0200
> Johannes Weiner <jweiner@redhat.com> wrote:
> 
> > On Tue, Sep 13, 2011 at 07:27:59PM +0900, KAMEZAWA Hiroyuki wrote:
> > > On Mon, 12 Sep 2011 12:57:21 +0200
> > > Johannes Weiner <jweiner@redhat.com> wrote:
> > > 
> > > > Memory cgroup limit reclaim currently picks one memory cgroup out of
> > > > the target hierarchy, remembers it as the last scanned child, and
> > > > reclaims all zones in it with decreasing priority levels.
> > > > 
> > > > The new hierarchy reclaim code will pick memory cgroups from the same
> > > > hierarchy concurrently from different zones and priority levels, it
> > > > becomes necessary that hierarchy roots not only remember the last
> > > > scanned child, but do so for each zone and priority level.
> > > > 
> > > > Furthermore, detecting full hierarchy round-trips reliably will become
> > > > crucial, so instead of counting on one iterator site seeing a certain
> > > > memory cgroup twice, use a generation counter that is increased every
> > > > time the child with the highest ID has been visited.
> > > > 
> > > > Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> > > 
> > > I cannot image how this works. could you illustrate more with easy example ?
> > 
> > Previously, we did
> > 
> > 	mem = mem_cgroup_iter(root)
> > 	  for each priority level:
> > 	    for each zone in zonelist:
> > 
> > and this would reclaim memcg-1-zone-1, memcg-1-zone-2, memcg-1-zone-3
> > etc.
> > 
> yes.
> 
> > The new code does
> > 
> > 	for each priority level
> > 	  for each zone in zonelist
> >             mem = mem_cgroup_iter(root)
> > 
> > but with a single last_scanned_child per memcg, this would scan
> > memcg-1-zone-1, memcg-2-zone-2, memcg-3-zone-3 etc, which does not
> > make much sense.
> > 
> > Now imagine two reclaimers.  With the old code, the first reclaimer
> > would pick memcg-1 and scan all its zones, the second reclaimer would
> > pick memcg-2 and reclaim all its zones.  Without this patch, the first
> > reclaimer would pick memcg-1 and scan zone-1, the second reclaimer
> > would pick memcg-2 and scan zone-1, then the first reclaimer would
> > pick memcg-3 and scan zone-2.  If the reclaimers are concurrently
> > scanning at different priority levels, things are even worse because
> > one reclaimer may put much more force on the memcgs it gets from
> > mem_cgroup_iter() than the other reclaimer.  They must not share the
> > same iterator.
> > 
> > The generations are needed because the old algorithm did not rely too
> > much on detecting full round-trips.  After every reclaim cycle, it
> > checked the limit and broke out of the loop if enough was reclaimed,
> > no matter how many children were reclaimed from.  The new algorithm is
> > used for global reclaim, where the only exit condition of the
> > hierarchy reclaim is the full roundtrip, because equal pressure needs
> > to be applied to all zones.
> > 
> Hm, ok, maybe good for global reclam.
> Is this used for both of reclaim-by-limit and global-reclaim ?

No, the hierarchy iteration in shrink_zone() is done after a single
memcg, which is equivalent to the old code: scan all zones at all
priority levels from a memcg, then move on to the next memcg.  This
also works because of the per-zone per-priority last_scanned_child:

	for each priority
	  for each zone
	    mem = mem_cgroup_iter(root)
	    scan(mem)

priority-12 + zone-1 will yield memcg-1.  priority-12 + zone-2 starts
at its own last_scanned_child, so yields memcg-1 as well, etc.  A
second reclaimer that comes in with priority-12 + zone-1 will receive
memcg-2 for scanning.  So there is no change in behaviour for limit
reclaim.

> If so, I need to abandon node-selection-logic for reclaim-by-limit
> and nodemask-for-memcg which shows me very good result. 
> I'll be sad ;)

With my clarification, do you still think so?

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 04/11] mm: memcg: per-priority per-zone hierarchy scan generations
@ 2011-09-14  5:56           ` Johannes Weiner
  0 siblings, 0 replies; 130+ messages in thread
From: Johannes Weiner @ 2011-09-14  5:56 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

On Wed, Sep 14, 2011 at 09:55:04AM +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 13 Sep 2011 13:03:01 +0200
> Johannes Weiner <jweiner@redhat.com> wrote:
> 
> > On Tue, Sep 13, 2011 at 07:27:59PM +0900, KAMEZAWA Hiroyuki wrote:
> > > On Mon, 12 Sep 2011 12:57:21 +0200
> > > Johannes Weiner <jweiner@redhat.com> wrote:
> > > 
> > > > Memory cgroup limit reclaim currently picks one memory cgroup out of
> > > > the target hierarchy, remembers it as the last scanned child, and
> > > > reclaims all zones in it with decreasing priority levels.
> > > > 
> > > > The new hierarchy reclaim code will pick memory cgroups from the same
> > > > hierarchy concurrently from different zones and priority levels, it
> > > > becomes necessary that hierarchy roots not only remember the last
> > > > scanned child, but do so for each zone and priority level.
> > > > 
> > > > Furthermore, detecting full hierarchy round-trips reliably will become
> > > > crucial, so instead of counting on one iterator site seeing a certain
> > > > memory cgroup twice, use a generation counter that is increased every
> > > > time the child with the highest ID has been visited.
> > > > 
> > > > Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> > > 
> > > I cannot image how this works. could you illustrate more with easy example ?
> > 
> > Previously, we did
> > 
> > 	mem = mem_cgroup_iter(root)
> > 	  for each priority level:
> > 	    for each zone in zonelist:
> > 
> > and this would reclaim memcg-1-zone-1, memcg-1-zone-2, memcg-1-zone-3
> > etc.
> > 
> yes.
> 
> > The new code does
> > 
> > 	for each priority level
> > 	  for each zone in zonelist
> >             mem = mem_cgroup_iter(root)
> > 
> > but with a single last_scanned_child per memcg, this would scan
> > memcg-1-zone-1, memcg-2-zone-2, memcg-3-zone-3 etc, which does not
> > make much sense.
> > 
> > Now imagine two reclaimers.  With the old code, the first reclaimer
> > would pick memcg-1 and scan all its zones, the second reclaimer would
> > pick memcg-2 and reclaim all its zones.  Without this patch, the first
> > reclaimer would pick memcg-1 and scan zone-1, the second reclaimer
> > would pick memcg-2 and scan zone-1, then the first reclaimer would
> > pick memcg-3 and scan zone-2.  If the reclaimers are concurrently
> > scanning at different priority levels, things are even worse because
> > one reclaimer may put much more force on the memcgs it gets from
> > mem_cgroup_iter() than the other reclaimer.  They must not share the
> > same iterator.
> > 
> > The generations are needed because the old algorithm did not rely too
> > much on detecting full round-trips.  After every reclaim cycle, it
> > checked the limit and broke out of the loop if enough was reclaimed,
> > no matter how many children were reclaimed from.  The new algorithm is
> > used for global reclaim, where the only exit condition of the
> > hierarchy reclaim is the full roundtrip, because equal pressure needs
> > to be applied to all zones.
> > 
> Hm, ok, maybe good for global reclam.
> Is this used for both of reclaim-by-limit and global-reclaim ?

No, the hierarchy iteration in shrink_zone() is done after a single
memcg, which is equivalent to the old code: scan all zones at all
priority levels from a memcg, then move on to the next memcg.  This
also works because of the per-zone per-priority last_scanned_child:

	for each priority
	  for each zone
	    mem = mem_cgroup_iter(root)
	    scan(mem)

priority-12 + zone-1 will yield memcg-1.  priority-12 + zone-2 starts
at its own last_scanned_child, so yields memcg-1 as well, etc.  A
second reclaimer that comes in with priority-12 + zone-1 will receive
memcg-2 for scanning.  So there is no change in behaviour for limit
reclaim.

> If so, I need to abandon node-selection-logic for reclaim-by-limit
> and nodemask-for-memcg which shows me very good result. 
> I'll be sad ;)

With my clarification, do you still think so?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 04/11] mm: memcg: per-priority per-zone hierarchy scan generations
  2011-09-14  5:56           ` Johannes Weiner
@ 2011-09-14  7:40             ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 130+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-14  7:40 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

On Wed, 14 Sep 2011 07:56:34 +0200
Johannes Weiner <jweiner@redhat.com> wrote:

> On Wed, Sep 14, 2011 at 09:55:04AM +0900, KAMEZAWA Hiroyuki wrote:
> > On Tue, 13 Sep 2011 13:03:01 +0200
> > Johannes Weiner <jweiner@redhat.com> wrote:
> No, the hierarchy iteration in shrink_zone() is done after a single
> memcg, which is equivalent to the old code: scan all zones at all
> priority levels from a memcg, then move on to the next memcg.  This
> also works because of the per-zone per-priority last_scanned_child:
> 
> 	for each priority
> 	  for each zone
> 	    mem = mem_cgroup_iter(root)
> 	    scan(mem)
> 
> priority-12 + zone-1 will yield memcg-1.  priority-12 + zone-2 starts
> at its own last_scanned_child, so yields memcg-1 as well, etc.  A
> second reclaimer that comes in with priority-12 + zone-1 will receive
> memcg-2 for scanning.  So there is no change in behaviour for limit
> reclaim.
> 
ok, thanks.

> > If so, I need to abandon node-selection-logic for reclaim-by-limit
> > and nodemask-for-memcg which shows me very good result. 
> > I'll be sad ;)
> 
> With my clarification, do you still think so?
> 

No. Thank you. 

Regards,
-Kame


^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 04/11] mm: memcg: per-priority per-zone hierarchy scan generations
@ 2011-09-14  7:40             ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 130+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-14  7:40 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Ying Han,
	Michal Hocko, Greg Thelen, Michel Lespinasse, Rik van Riel,
	Minchan Kim, Christoph Hellwig, linux-mm, linux-kernel

On Wed, 14 Sep 2011 07:56:34 +0200
Johannes Weiner <jweiner@redhat.com> wrote:

> On Wed, Sep 14, 2011 at 09:55:04AM +0900, KAMEZAWA Hiroyuki wrote:
> > On Tue, 13 Sep 2011 13:03:01 +0200
> > Johannes Weiner <jweiner@redhat.com> wrote:
> No, the hierarchy iteration in shrink_zone() is done after a single
> memcg, which is equivalent to the old code: scan all zones at all
> priority levels from a memcg, then move on to the next memcg.  This
> also works because of the per-zone per-priority last_scanned_child:
> 
> 	for each priority
> 	  for each zone
> 	    mem = mem_cgroup_iter(root)
> 	    scan(mem)
> 
> priority-12 + zone-1 will yield memcg-1.  priority-12 + zone-2 starts
> at its own last_scanned_child, so yields memcg-1 as well, etc.  A
> second reclaimer that comes in with priority-12 + zone-1 will receive
> memcg-2 for scanning.  So there is no change in behaviour for limit
> reclaim.
> 
ok, thanks.

> > If so, I need to abandon node-selection-logic for reclaim-by-limit
> > and nodemask-for-memcg which shows me very good result. 
> > I'll be sad ;)
> 
> With my clarification, do you still think so?
> 

No. Thank you. 

Regards,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 01/11] mm: memcg: consolidate hierarchy iteration primitives
  2011-09-12 10:57   ` Johannes Weiner
@ 2011-09-19 12:53     ` Michal Hocko
  -1 siblings, 0 replies; 130+ messages in thread
From: Michal Hocko @ 2011-09-19 12:53 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

Hi,

On Mon 12-09-11 12:57:18, Johannes Weiner wrote:
> Memory control groups are currently bolted onto the side of
> traditional memory management in places where better integration would
> be preferrable.  To reclaim memory, for example, memory control groups
> maintain their own LRU list and reclaim strategy aside from the global
> per-zone LRU list reclaim.  But an extra list head for each existing
> page frame is expensive and maintaining it requires additional code.
> 
> This patchset disables the global per-zone LRU lists on memory cgroup
> configurations and converts all its users to operate on the per-memory
> cgroup lists instead.  As LRU pages are then exclusively on one list,
> this saves two list pointers for each page frame in the system:
> 
> page_cgroup array size with 4G physical memory
> 
>   vanilla: [    0.000000] allocated 31457280 bytes of page_cgroup
>   patched: [    0.000000] allocated 15728640 bytes of page_cgroup
> 
> At the same time, system performance for various workloads is
> unaffected:
> 
> 100G sparse file cat, 4G physical memory, 10 runs, to test for code
> bloat in the traditional LRU handling and kswapd & direct reclaim
> paths, without/with the memory controller configured in
> 
>   vanilla: 71.603(0.207) seconds
>   patched: 71.640(0.156) seconds
> 
>   vanilla: 79.558(0.288) seconds
>   patched: 77.233(0.147) seconds
> 
> 100G sparse file cat in 1G memory cgroup, 10 runs, to test for code
> bloat in the traditional memory cgroup LRU handling and reclaim path
> 
>   vanilla: 96.844(0.281) seconds
>   patched: 94.454(0.311) seconds
> 
> 4 unlimited memcgs running kbuild -j32 each, 4G physical memory, 500M
> swap on SSD, 10 runs, to test for regressions in kswapd & direct
> reclaim using per-memcg LRU lists with multiple memcgs and multiple
> allocators within each memcg
> 
>   vanilla: 717.722(1.440) seconds [ 69720.100(11600.835) majfaults ]
>   patched: 714.106(2.313) seconds [ 71109.300(14886.186) majfaults ]
> 
> 16 unlimited memcgs running kbuild, 1900M hierarchical limit, 500M
> swap on SSD, 10 runs, to test for regressions in hierarchical memcg
> setups
> 
>   vanilla: 2742.058(1.992) seconds [ 26479.600(1736.737) majfaults ]
>   patched: 2743.267(1.214) seconds [ 27240.700(1076.063) majfaults ]

I guess you want to have this in the first patch to have it for
reference once it gets to the tree, right? I have no objections but it
seems unrelated to the patch and so it might be confusing a bit. I
haven't seen other patches in the series so there is probably no better
place to put this.

> 
> This patch:
> 
> There are currently two different implementations of iterating over a
> memory cgroup hierarchy tree.
> 
> Consolidate them into one worker function and base the convenience
> looping-macros on top of it.
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>

Looks mostly good. There is just one issue I spotted and I guess we
want some comments. After the issue is fixed:
Reviewed-by: Michal Hocko <mhocko@suse.cz>

> ---
>  mm/memcontrol.c |  196 ++++++++++++++++++++----------------------------------
>  1 files changed, 73 insertions(+), 123 deletions(-)

Nice diet.

> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b76011a..912c7c7 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -781,83 +781,75 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
>  	return memcg;
>  }
>  
> -/* The caller has to guarantee "mem" exists before calling this */

Shouldn't we have a similar comment that we have to keep a reference to
root if non-NULL. A mention about remember parameter and what is it used
for (hierarchical reclaim) would be helpful as well.

/*
 * Find a next cgroup under the hierarchy tree with the given root (or
 * root_mem_cgroup if NULL) starting from the given prev (iterator)
 * position and releasing a reference to it. Start from the root if
 * iterator is NULL.
 * Ignore iterator position if remember is true and follow with the
 * last_scanned_child instead and remember the new value (used during
 * hierarchical reclaim).
 * Caller is supposed to grab a reference to the root (if non NULL) before
 * it calls us for the first time.
 *
 * Returns a cgroup with increased reference count (except for the root)
 * or NULL if there are no more groups to visit.
 *
 * Use for_each_mem_cgroup_tree and for_each_mem_cgroup instead and
 * mem_cgroup_iter_break for the final clean up if you are using this
 * function directly.
 */
> -static struct mem_cgroup *mem_cgroup_start_loop(struct mem_cgroup *memcg)
> +static struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> +					  struct mem_cgroup *prev,
> +					  bool remember)
[...]
> @@ -1656,7 +1611,7 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
>  						unsigned long reclaim_options,
>  						unsigned long *total_scanned)
>  {
> -	struct mem_cgroup *victim;
> +	struct mem_cgroup *victim = NULL;
>  	int ret, total = 0;
>  	int loop = 0;
>  	bool noswap = reclaim_options & MEM_CGROUP_RECLAIM_NOSWAP;
> @@ -1672,8 +1627,8 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
>  		noswap = true;
>  
>  	while (1) {
> -		victim = mem_cgroup_select_victim(root_memcg);
> -		if (victim == root_memcg) {
> +		victim = mem_cgroup_iter(root_memcg, victim, true);
> +		if (!victim) {
>  			loop++;
>  			/*
>  			 * We are not draining per cpu cached charges during
> @@ -1689,10 +1644,8 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
>  				 * anything, it might because there are
>  				 * no reclaimable pages under this hierarchy
>  				 */
> -				if (!check_soft || !total) {
> -					css_put(&victim->css);
> +				if (!check_soft || !total)
>  					break;
> -				}
>  				/*
>  				 * We want to do more targeted reclaim.
>  				 * excess >> 2 is not to excessive so as to
> @@ -1700,15 +1653,13 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
>  				 * coming back to reclaim from this cgroup
>  				 */
>  				if (total >= (excess >> 2) ||
> -					(loop > MEM_CGROUP_MAX_RECLAIM_LOOPS)) {
> -					css_put(&victim->css);
> +					(loop > MEM_CGROUP_MAX_RECLAIM_LOOPS))
>  					break;
> -				}
>  			}
> +			continue;
>  		}
>  		if (!mem_cgroup_reclaimable(victim, noswap)) {
>  			/* this cgroup's local usage == 0 */
> -			css_put(&victim->css);
>  			continue;
>  		}
>  		/* we use swappiness of local cgroup */
> @@ -1719,21 +1670,21 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
>  		} else
>  			ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
>  						noswap);
> -		css_put(&victim->css);
>  		/*
>  		 * At shrinking usage, we can't check we should stop here or
>  		 * reclaim more. It's depends on callers. last_scanned_child
>  		 * will work enough for keeping fairness under tree.
>  		 */
>  		if (shrink)
> -			return ret;
> +			break;

Hmm, we are returning total but it doesn't get set to ret for shrinking
case so we are alway returning 0. You want to move the line bellow up.

>  		total += ret;
>  		if (check_soft) {
>  			if (!res_counter_soft_limit_excess(&root_memcg->res))
> -				return total;
> +				break;
>  		} else if (mem_cgroup_margin(root_memcg))
> -			return total;
> +			break;
>  	}
> +	mem_cgroup_iter_break(root_memcg, victim);
>  	return total;
>  }

[...]

Thanks
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 01/11] mm: memcg: consolidate hierarchy iteration primitives
@ 2011-09-19 12:53     ` Michal Hocko
  0 siblings, 0 replies; 130+ messages in thread
From: Michal Hocko @ 2011-09-19 12:53 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

Hi,

On Mon 12-09-11 12:57:18, Johannes Weiner wrote:
> Memory control groups are currently bolted onto the side of
> traditional memory management in places where better integration would
> be preferrable.  To reclaim memory, for example, memory control groups
> maintain their own LRU list and reclaim strategy aside from the global
> per-zone LRU list reclaim.  But an extra list head for each existing
> page frame is expensive and maintaining it requires additional code.
> 
> This patchset disables the global per-zone LRU lists on memory cgroup
> configurations and converts all its users to operate on the per-memory
> cgroup lists instead.  As LRU pages are then exclusively on one list,
> this saves two list pointers for each page frame in the system:
> 
> page_cgroup array size with 4G physical memory
> 
>   vanilla: [    0.000000] allocated 31457280 bytes of page_cgroup
>   patched: [    0.000000] allocated 15728640 bytes of page_cgroup
> 
> At the same time, system performance for various workloads is
> unaffected:
> 
> 100G sparse file cat, 4G physical memory, 10 runs, to test for code
> bloat in the traditional LRU handling and kswapd & direct reclaim
> paths, without/with the memory controller configured in
> 
>   vanilla: 71.603(0.207) seconds
>   patched: 71.640(0.156) seconds
> 
>   vanilla: 79.558(0.288) seconds
>   patched: 77.233(0.147) seconds
> 
> 100G sparse file cat in 1G memory cgroup, 10 runs, to test for code
> bloat in the traditional memory cgroup LRU handling and reclaim path
> 
>   vanilla: 96.844(0.281) seconds
>   patched: 94.454(0.311) seconds
> 
> 4 unlimited memcgs running kbuild -j32 each, 4G physical memory, 500M
> swap on SSD, 10 runs, to test for regressions in kswapd & direct
> reclaim using per-memcg LRU lists with multiple memcgs and multiple
> allocators within each memcg
> 
>   vanilla: 717.722(1.440) seconds [ 69720.100(11600.835) majfaults ]
>   patched: 714.106(2.313) seconds [ 71109.300(14886.186) majfaults ]
> 
> 16 unlimited memcgs running kbuild, 1900M hierarchical limit, 500M
> swap on SSD, 10 runs, to test for regressions in hierarchical memcg
> setups
> 
>   vanilla: 2742.058(1.992) seconds [ 26479.600(1736.737) majfaults ]
>   patched: 2743.267(1.214) seconds [ 27240.700(1076.063) majfaults ]

I guess you want to have this in the first patch to have it for
reference once it gets to the tree, right? I have no objections but it
seems unrelated to the patch and so it might be confusing a bit. I
haven't seen other patches in the series so there is probably no better
place to put this.

> 
> This patch:
> 
> There are currently two different implementations of iterating over a
> memory cgroup hierarchy tree.
> 
> Consolidate them into one worker function and base the convenience
> looping-macros on top of it.
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>

Looks mostly good. There is just one issue I spotted and I guess we
want some comments. After the issue is fixed:
Reviewed-by: Michal Hocko <mhocko@suse.cz>

> ---
>  mm/memcontrol.c |  196 ++++++++++++++++++++----------------------------------
>  1 files changed, 73 insertions(+), 123 deletions(-)

Nice diet.

> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b76011a..912c7c7 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -781,83 +781,75 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
>  	return memcg;
>  }
>  
> -/* The caller has to guarantee "mem" exists before calling this */

Shouldn't we have a similar comment that we have to keep a reference to
root if non-NULL. A mention about remember parameter and what is it used
for (hierarchical reclaim) would be helpful as well.

/*
 * Find a next cgroup under the hierarchy tree with the given root (or
 * root_mem_cgroup if NULL) starting from the given prev (iterator)
 * position and releasing a reference to it. Start from the root if
 * iterator is NULL.
 * Ignore iterator position if remember is true and follow with the
 * last_scanned_child instead and remember the new value (used during
 * hierarchical reclaim).
 * Caller is supposed to grab a reference to the root (if non NULL) before
 * it calls us for the first time.
 *
 * Returns a cgroup with increased reference count (except for the root)
 * or NULL if there are no more groups to visit.
 *
 * Use for_each_mem_cgroup_tree and for_each_mem_cgroup instead and
 * mem_cgroup_iter_break for the final clean up if you are using this
 * function directly.
 */
> -static struct mem_cgroup *mem_cgroup_start_loop(struct mem_cgroup *memcg)
> +static struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> +					  struct mem_cgroup *prev,
> +					  bool remember)
[...]
> @@ -1656,7 +1611,7 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
>  						unsigned long reclaim_options,
>  						unsigned long *total_scanned)
>  {
> -	struct mem_cgroup *victim;
> +	struct mem_cgroup *victim = NULL;
>  	int ret, total = 0;
>  	int loop = 0;
>  	bool noswap = reclaim_options & MEM_CGROUP_RECLAIM_NOSWAP;
> @@ -1672,8 +1627,8 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
>  		noswap = true;
>  
>  	while (1) {
> -		victim = mem_cgroup_select_victim(root_memcg);
> -		if (victim == root_memcg) {
> +		victim = mem_cgroup_iter(root_memcg, victim, true);
> +		if (!victim) {
>  			loop++;
>  			/*
>  			 * We are not draining per cpu cached charges during
> @@ -1689,10 +1644,8 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
>  				 * anything, it might because there are
>  				 * no reclaimable pages under this hierarchy
>  				 */
> -				if (!check_soft || !total) {
> -					css_put(&victim->css);
> +				if (!check_soft || !total)
>  					break;
> -				}
>  				/*
>  				 * We want to do more targeted reclaim.
>  				 * excess >> 2 is not to excessive so as to
> @@ -1700,15 +1653,13 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
>  				 * coming back to reclaim from this cgroup
>  				 */
>  				if (total >= (excess >> 2) ||
> -					(loop > MEM_CGROUP_MAX_RECLAIM_LOOPS)) {
> -					css_put(&victim->css);
> +					(loop > MEM_CGROUP_MAX_RECLAIM_LOOPS))
>  					break;
> -				}
>  			}
> +			continue;
>  		}
>  		if (!mem_cgroup_reclaimable(victim, noswap)) {
>  			/* this cgroup's local usage == 0 */
> -			css_put(&victim->css);
>  			continue;
>  		}
>  		/* we use swappiness of local cgroup */
> @@ -1719,21 +1670,21 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
>  		} else
>  			ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
>  						noswap);
> -		css_put(&victim->css);
>  		/*
>  		 * At shrinking usage, we can't check we should stop here or
>  		 * reclaim more. It's depends on callers. last_scanned_child
>  		 * will work enough for keeping fairness under tree.
>  		 */
>  		if (shrink)
> -			return ret;
> +			break;

Hmm, we are returning total but it doesn't get set to ret for shrinking
case so we are alway returning 0. You want to move the line bellow up.

>  		total += ret;
>  		if (check_soft) {
>  			if (!res_counter_soft_limit_excess(&root_memcg->res))
> -				return total;
> +				break;
>  		} else if (mem_cgroup_margin(root_memcg))
> -			return total;
> +			break;
>  	}
> +	mem_cgroup_iter_break(root_memcg, victim);
>  	return total;
>  }

[...]

Thanks
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 01/11] mm: memcg: consolidate hierarchy iteration primitives
  2011-09-12 22:37     ` Kirill A. Shutemov
@ 2011-09-19 13:06       ` Michal Hocko
  -1 siblings, 0 replies; 130+ messages in thread
From: Michal Hocko @ 2011-09-19 13:06 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki,
	Daisuke Nishimura, Balbir Singh, Ying Han, Greg Thelen,
	Michel Lespinasse, Rik van Riel, Minchan Kim, Christoph Hellwig,
	linux-mm, linux-kernel

On Tue 13-09-11 01:37:46, Kirill A. Shutemov wrote:
> On Mon, Sep 12, 2011 at 12:57:18PM +0200, Johannes Weiner wrote:
[...]
> >  	while (1) {
> > -		victim = mem_cgroup_select_victim(root_memcg);
> > -		if (victim == root_memcg) {
> > +		victim = mem_cgroup_iter(root_memcg, victim, true);
> > +		if (!victim) {
> >  			loop++;
> >  			/*
> >  			 * We are not draining per cpu cached charges during
> > @@ -1689,10 +1644,8 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
> >  				 * anything, it might because there are
> >  				 * no reclaimable pages under this hierarchy
> >  				 */
> > -				if (!check_soft || !total) {
> > -					css_put(&victim->css);
> > +				if (!check_soft || !total)
> >  					break;
> > -				}
> >  				/*
> >  				 * We want to do more targeted reclaim.
> >  				 * excess >> 2 is not to excessive so as to
> > @@ -1700,15 +1653,13 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
> >  				 * coming back to reclaim from this cgroup
> >  				 */
> >  				if (total >= (excess >> 2) ||
> > -					(loop > MEM_CGROUP_MAX_RECLAIM_LOOPS)) {
> > -					css_put(&victim->css);
> > +					(loop > MEM_CGROUP_MAX_RECLAIM_LOOPS))
> >  					break;
> > -				}
> >  			}
> > +			continue;
> 
> Souldn't we do
> 
> victim = root_memcg;
> 
> instead?

You want to save mem_cgroup_iter call?
Yes it will work... I am not sure it is really an improvement. If we
just continue we can rely on mem_cgroup_iter doing the right thing.
Assignment might be not that obvious. But I dunno. 
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 01/11] mm: memcg: consolidate hierarchy iteration primitives
@ 2011-09-19 13:06       ` Michal Hocko
  0 siblings, 0 replies; 130+ messages in thread
From: Michal Hocko @ 2011-09-19 13:06 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Johannes Weiner, Andrew Morton, KAMEZAWA Hiroyuki,
	Daisuke Nishimura, Balbir Singh, Ying Han, Greg Thelen,
	Michel Lespinasse, Rik van Riel, Minchan Kim, Christoph Hellwig,
	linux-mm, linux-kernel

On Tue 13-09-11 01:37:46, Kirill A. Shutemov wrote:
> On Mon, Sep 12, 2011 at 12:57:18PM +0200, Johannes Weiner wrote:
[...]
> >  	while (1) {
> > -		victim = mem_cgroup_select_victim(root_memcg);
> > -		if (victim == root_memcg) {
> > +		victim = mem_cgroup_iter(root_memcg, victim, true);
> > +		if (!victim) {
> >  			loop++;
> >  			/*
> >  			 * We are not draining per cpu cached charges during
> > @@ -1689,10 +1644,8 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
> >  				 * anything, it might because there are
> >  				 * no reclaimable pages under this hierarchy
> >  				 */
> > -				if (!check_soft || !total) {
> > -					css_put(&victim->css);
> > +				if (!check_soft || !total)
> >  					break;
> > -				}
> >  				/*
> >  				 * We want to do more targeted reclaim.
> >  				 * excess >> 2 is not to excessive so as to
> > @@ -1700,15 +1653,13 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
> >  				 * coming back to reclaim from this cgroup
> >  				 */
> >  				if (total >= (excess >> 2) ||
> > -					(loop > MEM_CGROUP_MAX_RECLAIM_LOOPS)) {
> > -					css_put(&victim->css);
> > +					(loop > MEM_CGROUP_MAX_RECLAIM_LOOPS))
> >  					break;
> > -				}
> >  			}
> > +			continue;
> 
> Souldn't we do
> 
> victim = root_memcg;
> 
> instead?

You want to save mem_cgroup_iter call?
Yes it will work... I am not sure it is really an improvement. If we
just continue we can rely on mem_cgroup_iter doing the right thing.
Assignment might be not that obvious. But I dunno. 
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 02/11] mm: vmscan: distinguish global reclaim from global LRU scanning
  2011-09-12 10:57   ` Johannes Weiner
@ 2011-09-19 13:23     ` Michal Hocko
  -1 siblings, 0 replies; 130+ messages in thread
From: Michal Hocko @ 2011-09-19 13:23 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Mon 12-09-11 12:57:19, Johannes Weiner wrote:
> The traditional zone reclaim code is scanning the per-zone LRU lists
> during direct reclaim and kswapd, and the per-zone per-memory cgroup
> LRU lists when reclaiming on behalf of a memory cgroup limit.
> 
> Subsequent patches will convert the traditional reclaim code to
> reclaim exclusively from the per-memory cgroup LRU lists.  As a
> result, using the predicate for which LRU list is scanned will no
> longer be appropriate to tell global reclaim from limit reclaim.
> 
> This patch adds a global_reclaim() predicate to tell direct/kswapd
> reclaim from memory cgroup limit reclaim and substitutes it in all
> places where currently scanning_global_lru() is used for that.

I am wondering about vmscan_swappiness. Shouldn't it use global_reclaim
instead?

Other than that it looks good to me.

Reviewed-by: Michal Hocko <mhocko@suse.cz>

> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> ---
>  mm/vmscan.c |   60 +++++++++++++++++++++++++++++++++++-----------------------
>  1 files changed, 36 insertions(+), 24 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 7502726..354f125 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -153,9 +153,25 @@ static LIST_HEAD(shrinker_list);
>  static DECLARE_RWSEM(shrinker_rwsem);
>  
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
> -#define scanning_global_lru(sc)	(!(sc)->mem_cgroup)
> +static bool global_reclaim(struct scan_control *sc)
> +{
> +	return !sc->mem_cgroup;
> +}
> +
> +static bool scanning_global_lru(struct scan_control *sc)
> +{
> +	return !sc->mem_cgroup;
> +}
>  #else
> -#define scanning_global_lru(sc)	(1)
> +static bool global_reclaim(struct scan_control *sc)
> +{
> +	return true;
> +}
> +
> +static bool scanning_global_lru(struct scan_control *sc)
> +{
> +	return true;
> +}
>  #endif
>  
>  static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
> @@ -1011,7 +1027,7 @@ keep_lumpy:
>  	 * back off and wait for congestion to clear because further reclaim
>  	 * will encounter the same problem
>  	 */
> -	if (nr_dirty && nr_dirty == nr_congested && scanning_global_lru(sc))
> +	if (nr_dirty && nr_dirty == nr_congested && global_reclaim(sc))
>  		zone_set_flag(zone, ZONE_CONGESTED);
>  
>  	free_page_list(&free_pages);
> @@ -1330,7 +1346,7 @@ static int too_many_isolated(struct zone *zone, int file,
>  	if (current_is_kswapd())
>  		return 0;
>  
> -	if (!scanning_global_lru(sc))
> +	if (!global_reclaim(sc))
>  		return 0;
>  
>  	if (file) {
> @@ -1508,6 +1524,12 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>  	if (scanning_global_lru(sc)) {
>  		nr_taken = isolate_pages_global(nr_to_scan, &page_list,
>  			&nr_scanned, sc->order, reclaim_mode, zone, 0, file);
> +	} else {
> +		nr_taken = mem_cgroup_isolate_pages(nr_to_scan, &page_list,
> +			&nr_scanned, sc->order, reclaim_mode, zone,
> +			sc->mem_cgroup, 0, file);
> +	}
> +	if (global_reclaim(sc)) {
>  		zone->pages_scanned += nr_scanned;
>  		if (current_is_kswapd())
>  			__count_zone_vm_events(PGSCAN_KSWAPD, zone,
> @@ -1515,14 +1537,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>  		else
>  			__count_zone_vm_events(PGSCAN_DIRECT, zone,
>  					       nr_scanned);
> -	} else {
> -		nr_taken = mem_cgroup_isolate_pages(nr_to_scan, &page_list,
> -			&nr_scanned, sc->order, reclaim_mode, zone,
> -			sc->mem_cgroup, 0, file);
> -		/*
> -		 * mem_cgroup_isolate_pages() keeps track of
> -		 * scanned pages on its own.
> -		 */
>  	}
>  
>  	if (nr_taken == 0) {
> @@ -1647,18 +1661,16 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
>  						&pgscanned, sc->order,
>  						reclaim_mode, zone,
>  						1, file);
> -		zone->pages_scanned += pgscanned;
>  	} else {
>  		nr_taken = mem_cgroup_isolate_pages(nr_pages, &l_hold,
>  						&pgscanned, sc->order,
>  						reclaim_mode, zone,
>  						sc->mem_cgroup, 1, file);
> -		/*
> -		 * mem_cgroup_isolate_pages() keeps track of
> -		 * scanned pages on its own.
> -		 */
>  	}
>  
> +	if (global_reclaim(sc))
> +		zone->pages_scanned += pgscanned;
> +
>  	reclaim_stat->recent_scanned[file] += nr_taken;
>  
>  	__count_zone_vm_events(PGREFILL, zone, pgscanned);
> @@ -1863,9 +1875,9 @@ static void get_scan_count(struct zone *zone, struct scan_control *sc,
>  	 * latencies, so it's better to scan a minimum amount there as
>  	 * well.
>  	 */
> -	if (scanning_global_lru(sc) && current_is_kswapd())
> +	if (current_is_kswapd())
>  		force_scan = true;
> -	if (!scanning_global_lru(sc))
> +	if (!global_reclaim(sc))
>  		force_scan = true;
>  
>  	/* If we have no swap space, do not bother scanning anon pages. */
> @@ -1882,7 +1894,7 @@ static void get_scan_count(struct zone *zone, struct scan_control *sc,
>  	file  = zone_nr_lru_pages(zone, sc, LRU_ACTIVE_FILE) +
>  		zone_nr_lru_pages(zone, sc, LRU_INACTIVE_FILE);
>  
> -	if (scanning_global_lru(sc)) {
> +	if (global_reclaim(sc)) {
>  		free  = zone_page_state(zone, NR_FREE_PAGES);
>  		/* If we have very few page cache pages,
>  		   force-scan anon pages. */
> @@ -2109,7 +2121,7 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
>  		 * Take care memory controller reclaiming has small influence
>  		 * to global LRU.
>  		 */
> -		if (scanning_global_lru(sc)) {
> +		if (global_reclaim(sc)) {
>  			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>  				continue;
>  			if (zone->all_unreclaimable && priority != DEF_PRIORITY)
> @@ -2188,7 +2200,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>  	get_mems_allowed();
>  	delayacct_freepages_start();
>  
> -	if (scanning_global_lru(sc))
> +	if (global_reclaim(sc))
>  		count_vm_event(ALLOCSTALL);
>  
>  	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
> @@ -2200,7 +2212,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>  		 * Don't shrink slabs when reclaiming memory from
>  		 * over limit cgroups
>  		 */
> -		if (scanning_global_lru(sc)) {
> +		if (global_reclaim(sc)) {
>  			unsigned long lru_pages = 0;
>  			for_each_zone_zonelist(zone, z, zonelist,
>  					gfp_zone(sc->gfp_mask)) {
> @@ -2261,7 +2273,7 @@ out:
>  		return 0;
>  
>  	/* top priority shrink_zones still had more to do? don't OOM, then */
> -	if (scanning_global_lru(sc) && !all_unreclaimable(zonelist, sc))
> +	if (global_reclaim(sc) && !all_unreclaimable(zonelist, sc))
>  		return 1;
>  
>  	return 0;
> -- 
> 1.7.6
> 

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 02/11] mm: vmscan: distinguish global reclaim from global LRU scanning
@ 2011-09-19 13:23     ` Michal Hocko
  0 siblings, 0 replies; 130+ messages in thread
From: Michal Hocko @ 2011-09-19 13:23 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Mon 12-09-11 12:57:19, Johannes Weiner wrote:
> The traditional zone reclaim code is scanning the per-zone LRU lists
> during direct reclaim and kswapd, and the per-zone per-memory cgroup
> LRU lists when reclaiming on behalf of a memory cgroup limit.
> 
> Subsequent patches will convert the traditional reclaim code to
> reclaim exclusively from the per-memory cgroup LRU lists.  As a
> result, using the predicate for which LRU list is scanned will no
> longer be appropriate to tell global reclaim from limit reclaim.
> 
> This patch adds a global_reclaim() predicate to tell direct/kswapd
> reclaim from memory cgroup limit reclaim and substitutes it in all
> places where currently scanning_global_lru() is used for that.

I am wondering about vmscan_swappiness. Shouldn't it use global_reclaim
instead?

Other than that it looks good to me.

Reviewed-by: Michal Hocko <mhocko@suse.cz>

> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> ---
>  mm/vmscan.c |   60 +++++++++++++++++++++++++++++++++++-----------------------
>  1 files changed, 36 insertions(+), 24 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 7502726..354f125 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -153,9 +153,25 @@ static LIST_HEAD(shrinker_list);
>  static DECLARE_RWSEM(shrinker_rwsem);
>  
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
> -#define scanning_global_lru(sc)	(!(sc)->mem_cgroup)
> +static bool global_reclaim(struct scan_control *sc)
> +{
> +	return !sc->mem_cgroup;
> +}
> +
> +static bool scanning_global_lru(struct scan_control *sc)
> +{
> +	return !sc->mem_cgroup;
> +}
>  #else
> -#define scanning_global_lru(sc)	(1)
> +static bool global_reclaim(struct scan_control *sc)
> +{
> +	return true;
> +}
> +
> +static bool scanning_global_lru(struct scan_control *sc)
> +{
> +	return true;
> +}
>  #endif
>  
>  static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
> @@ -1011,7 +1027,7 @@ keep_lumpy:
>  	 * back off and wait for congestion to clear because further reclaim
>  	 * will encounter the same problem
>  	 */
> -	if (nr_dirty && nr_dirty == nr_congested && scanning_global_lru(sc))
> +	if (nr_dirty && nr_dirty == nr_congested && global_reclaim(sc))
>  		zone_set_flag(zone, ZONE_CONGESTED);
>  
>  	free_page_list(&free_pages);
> @@ -1330,7 +1346,7 @@ static int too_many_isolated(struct zone *zone, int file,
>  	if (current_is_kswapd())
>  		return 0;
>  
> -	if (!scanning_global_lru(sc))
> +	if (!global_reclaim(sc))
>  		return 0;
>  
>  	if (file) {
> @@ -1508,6 +1524,12 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>  	if (scanning_global_lru(sc)) {
>  		nr_taken = isolate_pages_global(nr_to_scan, &page_list,
>  			&nr_scanned, sc->order, reclaim_mode, zone, 0, file);
> +	} else {
> +		nr_taken = mem_cgroup_isolate_pages(nr_to_scan, &page_list,
> +			&nr_scanned, sc->order, reclaim_mode, zone,
> +			sc->mem_cgroup, 0, file);
> +	}
> +	if (global_reclaim(sc)) {
>  		zone->pages_scanned += nr_scanned;
>  		if (current_is_kswapd())
>  			__count_zone_vm_events(PGSCAN_KSWAPD, zone,
> @@ -1515,14 +1537,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>  		else
>  			__count_zone_vm_events(PGSCAN_DIRECT, zone,
>  					       nr_scanned);
> -	} else {
> -		nr_taken = mem_cgroup_isolate_pages(nr_to_scan, &page_list,
> -			&nr_scanned, sc->order, reclaim_mode, zone,
> -			sc->mem_cgroup, 0, file);
> -		/*
> -		 * mem_cgroup_isolate_pages() keeps track of
> -		 * scanned pages on its own.
> -		 */
>  	}
>  
>  	if (nr_taken == 0) {
> @@ -1647,18 +1661,16 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
>  						&pgscanned, sc->order,
>  						reclaim_mode, zone,
>  						1, file);
> -		zone->pages_scanned += pgscanned;
>  	} else {
>  		nr_taken = mem_cgroup_isolate_pages(nr_pages, &l_hold,
>  						&pgscanned, sc->order,
>  						reclaim_mode, zone,
>  						sc->mem_cgroup, 1, file);
> -		/*
> -		 * mem_cgroup_isolate_pages() keeps track of
> -		 * scanned pages on its own.
> -		 */
>  	}
>  
> +	if (global_reclaim(sc))
> +		zone->pages_scanned += pgscanned;
> +
>  	reclaim_stat->recent_scanned[file] += nr_taken;
>  
>  	__count_zone_vm_events(PGREFILL, zone, pgscanned);
> @@ -1863,9 +1875,9 @@ static void get_scan_count(struct zone *zone, struct scan_control *sc,
>  	 * latencies, so it's better to scan a minimum amount there as
>  	 * well.
>  	 */
> -	if (scanning_global_lru(sc) && current_is_kswapd())
> +	if (current_is_kswapd())
>  		force_scan = true;
> -	if (!scanning_global_lru(sc))
> +	if (!global_reclaim(sc))
>  		force_scan = true;
>  
>  	/* If we have no swap space, do not bother scanning anon pages. */
> @@ -1882,7 +1894,7 @@ static void get_scan_count(struct zone *zone, struct scan_control *sc,
>  	file  = zone_nr_lru_pages(zone, sc, LRU_ACTIVE_FILE) +
>  		zone_nr_lru_pages(zone, sc, LRU_INACTIVE_FILE);
>  
> -	if (scanning_global_lru(sc)) {
> +	if (global_reclaim(sc)) {
>  		free  = zone_page_state(zone, NR_FREE_PAGES);
>  		/* If we have very few page cache pages,
>  		   force-scan anon pages. */
> @@ -2109,7 +2121,7 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
>  		 * Take care memory controller reclaiming has small influence
>  		 * to global LRU.
>  		 */
> -		if (scanning_global_lru(sc)) {
> +		if (global_reclaim(sc)) {
>  			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>  				continue;
>  			if (zone->all_unreclaimable && priority != DEF_PRIORITY)
> @@ -2188,7 +2200,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>  	get_mems_allowed();
>  	delayacct_freepages_start();
>  
> -	if (scanning_global_lru(sc))
> +	if (global_reclaim(sc))
>  		count_vm_event(ALLOCSTALL);
>  
>  	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
> @@ -2200,7 +2212,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>  		 * Don't shrink slabs when reclaiming memory from
>  		 * over limit cgroups
>  		 */
> -		if (scanning_global_lru(sc)) {
> +		if (global_reclaim(sc)) {
>  			unsigned long lru_pages = 0;
>  			for_each_zone_zonelist(zone, z, zonelist,
>  					gfp_zone(sc->gfp_mask)) {
> @@ -2261,7 +2273,7 @@ out:
>  		return 0;
>  
>  	/* top priority shrink_zones still had more to do? don't OOM, then */
> -	if (scanning_global_lru(sc) && !all_unreclaimable(zonelist, sc))
> +	if (global_reclaim(sc) && !all_unreclaimable(zonelist, sc))
>  		return 1;
>  
>  	return 0;
> -- 
> 1.7.6
> 

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 02/11] mm: vmscan: distinguish global reclaim from global LRU scanning
  2011-09-19 13:23     ` Michal Hocko
@ 2011-09-19 13:46       ` Michal Hocko
  -1 siblings, 0 replies; 130+ messages in thread
From: Michal Hocko @ 2011-09-19 13:46 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Mon 19-09-11 15:23:44, Michal Hocko wrote:
> On Mon 12-09-11 12:57:19, Johannes Weiner wrote:
> > The traditional zone reclaim code is scanning the per-zone LRU lists
> > during direct reclaim and kswapd, and the per-zone per-memory cgroup
> > LRU lists when reclaiming on behalf of a memory cgroup limit.
> > 
> > Subsequent patches will convert the traditional reclaim code to
> > reclaim exclusively from the per-memory cgroup LRU lists.  As a
> > result, using the predicate for which LRU list is scanned will no
> > longer be appropriate to tell global reclaim from limit reclaim.
> > 
> > This patch adds a global_reclaim() predicate to tell direct/kswapd
> > reclaim from memory cgroup limit reclaim and substitutes it in all
> > places where currently scanning_global_lru() is used for that.
> 
> I am wondering about vmscan_swappiness. Shouldn't it use global_reclaim
> instead?

Ahh, it looks like the next patch does that. Wouldn't it make more sense
to have that change here? I see that this makes the patch smaller but...
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 02/11] mm: vmscan: distinguish global reclaim from global LRU scanning
@ 2011-09-19 13:46       ` Michal Hocko
  0 siblings, 0 replies; 130+ messages in thread
From: Michal Hocko @ 2011-09-19 13:46 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Mon 19-09-11 15:23:44, Michal Hocko wrote:
> On Mon 12-09-11 12:57:19, Johannes Weiner wrote:
> > The traditional zone reclaim code is scanning the per-zone LRU lists
> > during direct reclaim and kswapd, and the per-zone per-memory cgroup
> > LRU lists when reclaiming on behalf of a memory cgroup limit.
> > 
> > Subsequent patches will convert the traditional reclaim code to
> > reclaim exclusively from the per-memory cgroup LRU lists.  As a
> > result, using the predicate for which LRU list is scanned will no
> > longer be appropriate to tell global reclaim from limit reclaim.
> > 
> > This patch adds a global_reclaim() predicate to tell direct/kswapd
> > reclaim from memory cgroup limit reclaim and substitutes it in all
> > places where currently scanning_global_lru() is used for that.
> 
> I am wondering about vmscan_swappiness. Shouldn't it use global_reclaim
> instead?

Ahh, it looks like the next patch does that. Wouldn't it make more sense
to have that change here? I see that this makes the patch smaller but...
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 03/11] mm: vmscan: distinguish between memcg triggering reclaim and memcg being scanned
  2011-09-12 10:57   ` Johannes Weiner
@ 2011-09-19 14:29     ` Michal Hocko
  -1 siblings, 0 replies; 130+ messages in thread
From: Michal Hocko @ 2011-09-19 14:29 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Mon 12-09-11 12:57:20, Johannes Weiner wrote:
> Memory cgroup hierarchies are currently handled completely outside of
> the traditional reclaim code, which is invoked with a single memory
> cgroup as an argument for the whole call stack.
> 
> Subsequent patches will switch this code to do hierarchical reclaim,
> so there needs to be a distinction between a) the memory cgroup that
> is triggering reclaim due to hitting its limit and b) the memory
> cgroup that is being scanned as a child of a).
> 
> This patch introduces a struct mem_cgroup_zone that contains the
> combination of the memory cgroup and the zone being scanned, which is
> then passed down the stack instead of the zone argument.
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>

Looks good to me. Some minor comments bellow
Anyways:
Reviewed-by: Michal Hocko <mhocko@suse.cz>

> ---
>  mm/vmscan.c |  251 +++++++++++++++++++++++++++++++++--------------------------
>  1 files changed, 142 insertions(+), 109 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 354f125..92f4e22 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
[...]
> @@ -1853,13 +1865,13 @@ static int vmscan_swappiness(struct scan_control *sc)
>   *
>   * nr[0] = anon pages to scan; nr[1] = file pages to scan
>   */
> -static void get_scan_count(struct zone *zone, struct scan_control *sc,
> -					unsigned long *nr, int priority)
> +static void get_scan_count(struct mem_cgroup_zone *mz, struct scan_control *sc,
> +			   unsigned long *nr, int priority)
>  {
>  	unsigned long anon, file, free;
>  	unsigned long anon_prio, file_prio;
>  	unsigned long ap, fp;
> -	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
> +	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(mz);
>  	u64 fraction[2], denominator;
>  	enum lru_list l;
>  	int noswap = 0;

You can save some patch lines by:
	struct zone *zone = mz->zone;
and not doing zone => mz->zone changes that follow.

> @@ -1889,16 +1901,16 @@ static void get_scan_count(struct zone *zone, struct scan_control *sc,
>  		goto out;
>  	}
>  
> -	anon  = zone_nr_lru_pages(zone, sc, LRU_ACTIVE_ANON) +
> -		zone_nr_lru_pages(zone, sc, LRU_INACTIVE_ANON);
> -	file  = zone_nr_lru_pages(zone, sc, LRU_ACTIVE_FILE) +
> -		zone_nr_lru_pages(zone, sc, LRU_INACTIVE_FILE);
> +	anon  = zone_nr_lru_pages(mz, LRU_ACTIVE_ANON) +
> +		zone_nr_lru_pages(mz, LRU_INACTIVE_ANON);
> +	file  = zone_nr_lru_pages(mz, LRU_ACTIVE_FILE) +
> +		zone_nr_lru_pages(mz, LRU_INACTIVE_FILE);
>  
>  	if (global_reclaim(sc)) {
> -		free  = zone_page_state(zone, NR_FREE_PAGES);
> +		free  = zone_page_state(mz->zone, NR_FREE_PAGES);
>  		/* If we have very few page cache pages,
>  		   force-scan anon pages. */
> -		if (unlikely(file + free <= high_wmark_pages(zone))) {
> +		if (unlikely(file + free <= high_wmark_pages(mz->zone))) {
>  			fraction[0] = 1;
>  			fraction[1] = 0;
>  			denominator = 1;
[...]
> @@ -2390,6 +2413,18 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
>  }
>  #endif
>  
> +static void age_active_anon(struct zone *zone, struct scan_control *sc,
> +			    int priority)
> +{
> +	struct mem_cgroup_zone mz = {
> +		.mem_cgroup = NULL,
> +		.zone = zone,
> +	};
> +
> +	if (inactive_anon_is_low(&mz))
> +		shrink_active_list(SWAP_CLUSTER_MAX, &mz, sc, priority, 0);
> +}
> +

I do not like this very much because we are using a similar construct in
shrink_mem_cgroup_zone so we are duplicating that code. 
What about adding age_mem_cgroup_active_anon (something like shrink_zone).

>  /*
>   * pgdat_balanced is used when checking if a node is balanced for high-order
>   * allocations. Only zones that meet watermarks and are in a zone allowed
> @@ -2510,7 +2545,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>  		 */
>  		.nr_to_reclaim = ULONG_MAX,
>  		.order = order,
> -		.mem_cgroup = NULL,
> +		.target_mem_cgroup = NULL,
>  	};
>  	struct shrink_control shrink = {
>  		.gfp_mask = sc.gfp_mask,
> @@ -2549,9 +2584,7 @@ loop_again:
>  			 * Do some background aging of the anon list, to give
>  			 * pages a chance to be referenced before reclaiming.
>  			 */
> -			if (inactive_anon_is_low(zone, &sc))
> -				shrink_active_list(SWAP_CLUSTER_MAX, zone,
> -							&sc, priority, 0);
> +			age_active_anon(zone, &sc, priority);
>  
>  			if (!zone_watermark_ok_safe(zone, order,
>  					high_wmark_pages(zone), 0, 0)) {

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 03/11] mm: vmscan: distinguish between memcg triggering reclaim and memcg being scanned
@ 2011-09-19 14:29     ` Michal Hocko
  0 siblings, 0 replies; 130+ messages in thread
From: Michal Hocko @ 2011-09-19 14:29 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Mon 12-09-11 12:57:20, Johannes Weiner wrote:
> Memory cgroup hierarchies are currently handled completely outside of
> the traditional reclaim code, which is invoked with a single memory
> cgroup as an argument for the whole call stack.
> 
> Subsequent patches will switch this code to do hierarchical reclaim,
> so there needs to be a distinction between a) the memory cgroup that
> is triggering reclaim due to hitting its limit and b) the memory
> cgroup that is being scanned as a child of a).
> 
> This patch introduces a struct mem_cgroup_zone that contains the
> combination of the memory cgroup and the zone being scanned, which is
> then passed down the stack instead of the zone argument.
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>

Looks good to me. Some minor comments bellow
Anyways:
Reviewed-by: Michal Hocko <mhocko@suse.cz>

> ---
>  mm/vmscan.c |  251 +++++++++++++++++++++++++++++++++--------------------------
>  1 files changed, 142 insertions(+), 109 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 354f125..92f4e22 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
[...]
> @@ -1853,13 +1865,13 @@ static int vmscan_swappiness(struct scan_control *sc)
>   *
>   * nr[0] = anon pages to scan; nr[1] = file pages to scan
>   */
> -static void get_scan_count(struct zone *zone, struct scan_control *sc,
> -					unsigned long *nr, int priority)
> +static void get_scan_count(struct mem_cgroup_zone *mz, struct scan_control *sc,
> +			   unsigned long *nr, int priority)
>  {
>  	unsigned long anon, file, free;
>  	unsigned long anon_prio, file_prio;
>  	unsigned long ap, fp;
> -	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
> +	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(mz);
>  	u64 fraction[2], denominator;
>  	enum lru_list l;
>  	int noswap = 0;

You can save some patch lines by:
	struct zone *zone = mz->zone;
and not doing zone => mz->zone changes that follow.

> @@ -1889,16 +1901,16 @@ static void get_scan_count(struct zone *zone, struct scan_control *sc,
>  		goto out;
>  	}
>  
> -	anon  = zone_nr_lru_pages(zone, sc, LRU_ACTIVE_ANON) +
> -		zone_nr_lru_pages(zone, sc, LRU_INACTIVE_ANON);
> -	file  = zone_nr_lru_pages(zone, sc, LRU_ACTIVE_FILE) +
> -		zone_nr_lru_pages(zone, sc, LRU_INACTIVE_FILE);
> +	anon  = zone_nr_lru_pages(mz, LRU_ACTIVE_ANON) +
> +		zone_nr_lru_pages(mz, LRU_INACTIVE_ANON);
> +	file  = zone_nr_lru_pages(mz, LRU_ACTIVE_FILE) +
> +		zone_nr_lru_pages(mz, LRU_INACTIVE_FILE);
>  
>  	if (global_reclaim(sc)) {
> -		free  = zone_page_state(zone, NR_FREE_PAGES);
> +		free  = zone_page_state(mz->zone, NR_FREE_PAGES);
>  		/* If we have very few page cache pages,
>  		   force-scan anon pages. */
> -		if (unlikely(file + free <= high_wmark_pages(zone))) {
> +		if (unlikely(file + free <= high_wmark_pages(mz->zone))) {
>  			fraction[0] = 1;
>  			fraction[1] = 0;
>  			denominator = 1;
[...]
> @@ -2390,6 +2413,18 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
>  }
>  #endif
>  
> +static void age_active_anon(struct zone *zone, struct scan_control *sc,
> +			    int priority)
> +{
> +	struct mem_cgroup_zone mz = {
> +		.mem_cgroup = NULL,
> +		.zone = zone,
> +	};
> +
> +	if (inactive_anon_is_low(&mz))
> +		shrink_active_list(SWAP_CLUSTER_MAX, &mz, sc, priority, 0);
> +}
> +

I do not like this very much because we are using a similar construct in
shrink_mem_cgroup_zone so we are duplicating that code. 
What about adding age_mem_cgroup_active_anon (something like shrink_zone).

>  /*
>   * pgdat_balanced is used when checking if a node is balanced for high-order
>   * allocations. Only zones that meet watermarks and are in a zone allowed
> @@ -2510,7 +2545,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>  		 */
>  		.nr_to_reclaim = ULONG_MAX,
>  		.order = order,
> -		.mem_cgroup = NULL,
> +		.target_mem_cgroup = NULL,
>  	};
>  	struct shrink_control shrink = {
>  		.gfp_mask = sc.gfp_mask,
> @@ -2549,9 +2584,7 @@ loop_again:
>  			 * Do some background aging of the anon list, to give
>  			 * pages a chance to be referenced before reclaiming.
>  			 */
> -			if (inactive_anon_is_low(zone, &sc))
> -				shrink_active_list(SWAP_CLUSTER_MAX, zone,
> -							&sc, priority, 0);
> +			age_active_anon(zone, &sc, priority);
>  
>  			if (!zone_watermark_ok_safe(zone, order,
>  					high_wmark_pages(zone), 0, 0)) {

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 04/11] mm: memcg: per-priority per-zone hierarchy scan generations
  2011-09-13 11:03       ` Johannes Weiner
@ 2011-09-20  8:15         ` Michal Hocko
  -1 siblings, 0 replies; 130+ messages in thread
From: Michal Hocko @ 2011-09-20  8:15 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Tue 13-09-11 13:03:01, Johannes Weiner wrote:
> On Tue, Sep 13, 2011 at 07:27:59PM +0900, KAMEZAWA Hiroyuki wrote:
> > On Mon, 12 Sep 2011 12:57:21 +0200
> > Johannes Weiner <jweiner@redhat.com> wrote:
> > 
> > > Memory cgroup limit reclaim currently picks one memory cgroup out of
> > > the target hierarchy, remembers it as the last scanned child, and
> > > reclaims all zones in it with decreasing priority levels.
> > > 
> > > The new hierarchy reclaim code will pick memory cgroups from the same
> > > hierarchy concurrently from different zones and priority levels, it
> > > becomes necessary that hierarchy roots not only remember the last
> > > scanned child, but do so for each zone and priority level.
> > > 
> > > Furthermore, detecting full hierarchy round-trips reliably will become
> > > crucial, so instead of counting on one iterator site seeing a certain
> > > memory cgroup twice, use a generation counter that is increased every
> > > time the child with the highest ID has been visited.
> > > 
> > > Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> > 
> > I cannot image how this works. could you illustrate more with easy example ?
> 
> Previously, we did
> 
> 	mem = mem_cgroup_iter(root)
> 	  for each priority level:
> 	    for each zone in zonelist:
> 
> and this would reclaim memcg-1-zone-1, memcg-1-zone-2, memcg-1-zone-3
> etc.
> 
> The new code does
> 
> 	for each priority level
> 	  for each zone in zonelist
>             mem = mem_cgroup_iter(root)
> 
> but with a single last_scanned_child per memcg, this would scan
> memcg-1-zone-1, memcg-2-zone-2, memcg-3-zone-3 etc, which does not
> make much sense.
> 
> Now imagine two reclaimers.  With the old code, the first reclaimer
> would pick memcg-1 and scan all its zones, the second reclaimer would
> pick memcg-2 and reclaim all its zones.  Without this patch, the first
> reclaimer would pick memcg-1 and scan zone-1, the second reclaimer
> would pick memcg-2 and scan zone-1, then the first reclaimer would
> pick memcg-3 and scan zone-2.  If the reclaimers are concurrently
> scanning at different priority levels, things are even worse because
> one reclaimer may put much more force on the memcgs it gets from
> mem_cgroup_iter() than the other reclaimer.  They must not share the
> same iterator.
> 
> The generations are needed because the old algorithm did not rely too
> much on detecting full round-trips.  After every reclaim cycle, it
> checked the limit and broke out of the loop if enough was reclaimed,
> no matter how many children were reclaimed from.  The new algorithm is
> used for global reclaim, where the only exit condition of the
> hierarchy reclaim is the full roundtrip, because equal pressure needs
> to be applied to all zones.

Could you fold this into the patch description, please?

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 04/11] mm: memcg: per-priority per-zone hierarchy scan generations
@ 2011-09-20  8:15         ` Michal Hocko
  0 siblings, 0 replies; 130+ messages in thread
From: Michal Hocko @ 2011-09-20  8:15 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Tue 13-09-11 13:03:01, Johannes Weiner wrote:
> On Tue, Sep 13, 2011 at 07:27:59PM +0900, KAMEZAWA Hiroyuki wrote:
> > On Mon, 12 Sep 2011 12:57:21 +0200
> > Johannes Weiner <jweiner@redhat.com> wrote:
> > 
> > > Memory cgroup limit reclaim currently picks one memory cgroup out of
> > > the target hierarchy, remembers it as the last scanned child, and
> > > reclaims all zones in it with decreasing priority levels.
> > > 
> > > The new hierarchy reclaim code will pick memory cgroups from the same
> > > hierarchy concurrently from different zones and priority levels, it
> > > becomes necessary that hierarchy roots not only remember the last
> > > scanned child, but do so for each zone and priority level.
> > > 
> > > Furthermore, detecting full hierarchy round-trips reliably will become
> > > crucial, so instead of counting on one iterator site seeing a certain
> > > memory cgroup twice, use a generation counter that is increased every
> > > time the child with the highest ID has been visited.
> > > 
> > > Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> > 
> > I cannot image how this works. could you illustrate more with easy example ?
> 
> Previously, we did
> 
> 	mem = mem_cgroup_iter(root)
> 	  for each priority level:
> 	    for each zone in zonelist:
> 
> and this would reclaim memcg-1-zone-1, memcg-1-zone-2, memcg-1-zone-3
> etc.
> 
> The new code does
> 
> 	for each priority level
> 	  for each zone in zonelist
>             mem = mem_cgroup_iter(root)
> 
> but with a single last_scanned_child per memcg, this would scan
> memcg-1-zone-1, memcg-2-zone-2, memcg-3-zone-3 etc, which does not
> make much sense.
> 
> Now imagine two reclaimers.  With the old code, the first reclaimer
> would pick memcg-1 and scan all its zones, the second reclaimer would
> pick memcg-2 and reclaim all its zones.  Without this patch, the first
> reclaimer would pick memcg-1 and scan zone-1, the second reclaimer
> would pick memcg-2 and scan zone-1, then the first reclaimer would
> pick memcg-3 and scan zone-2.  If the reclaimers are concurrently
> scanning at different priority levels, things are even worse because
> one reclaimer may put much more force on the memcgs it gets from
> mem_cgroup_iter() than the other reclaimer.  They must not share the
> same iterator.
> 
> The generations are needed because the old algorithm did not rely too
> much on detecting full round-trips.  After every reclaim cycle, it
> checked the limit and broke out of the loop if enough was reclaimed,
> no matter how many children were reclaimed from.  The new algorithm is
> used for global reclaim, where the only exit condition of the
> hierarchy reclaim is the full roundtrip, because equal pressure needs
> to be applied to all zones.

Could you fold this into the patch description, please?

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 04/11] mm: memcg: per-priority per-zone hierarchy scan generations
  2011-09-12 10:57   ` Johannes Weiner
@ 2011-09-20  8:45     ` Michal Hocko
  -1 siblings, 0 replies; 130+ messages in thread
From: Michal Hocko @ 2011-09-20  8:45 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Mon 12-09-11 12:57:21, Johannes Weiner wrote:
> Memory cgroup limit reclaim currently picks one memory cgroup out of
> the target hierarchy, remembers it as the last scanned child, and
> reclaims all zones in it with decreasing priority levels.
> 
> The new hierarchy reclaim code will pick memory cgroups from the same
> hierarchy concurrently from different zones and priority levels, it
> becomes necessary that hierarchy roots not only remember the last
> scanned child, but do so for each zone and priority level.
> 
> Furthermore, detecting full hierarchy round-trips reliably will become
> crucial, so instead of counting on one iterator site seeing a certain
> memory cgroup twice, use a generation counter that is increased every
> time the child with the highest ID has been visited.

In principle I think the patch is good. I have some concerns about
locking and I would really appreciate some more description (like you
provided in the other email in this thread).

> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> ---
>  mm/memcontrol.c |   60 +++++++++++++++++++++++++++++++++++++++---------------
>  1 files changed, 43 insertions(+), 17 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 912c7c7..f4b404e 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -121,6 +121,11 @@ struct mem_cgroup_stat_cpu {
>  	unsigned long targets[MEM_CGROUP_NTARGETS];
>  };
>  
> +struct mem_cgroup_iter_state {
> +	int position;
> +	unsigned int generation;
> +};
> +
>  /*
>   * per-zone information in memory controller.
>   */
> @@ -131,6 +136,8 @@ struct mem_cgroup_per_zone {
>  	struct list_head	lists[NR_LRU_LISTS];
>  	unsigned long		count[NR_LRU_LISTS];
>  
> +	struct mem_cgroup_iter_state iter_state[DEF_PRIORITY + 1];
> +
>  	struct zone_reclaim_stat reclaim_stat;
>  	struct rb_node		tree_node;	/* RB tree node */
>  	unsigned long long	usage_in_excess;/* Set to the value by which */
[...]
> @@ -781,9 +783,15 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
>  	return memcg;
>  }
>  
> +struct mem_cgroup_iter {

Wouldn't be mem_cgroup_zone_iter_state a better name. It is true it is
rather long but I find mem_cgroup_iter very confusing because the actual
position is stored in the zone's state. The other thing is that it looks
like we have two iterators in mem_cgroup_iter function now but in fact
the iter parameter is just a state when we start iteration.

> +	struct zone *zone;
> +	int priority;
> +	unsigned int generation;
> +};
> +
>  static struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>  					  struct mem_cgroup *prev,
> -					  bool remember)
> +					  struct mem_cgroup_iter *iter)

I would rather see a different name for the last parameter
(iter_state?).

>  {
>  	struct mem_cgroup *mem = NULL;
>  	int id = 0;
> @@ -791,7 +799,7 @@ static struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>  	if (!root)
>  		root = root_mem_cgroup;
>  
> -	if (prev && !remember)
> +	if (prev && !iter)
>  		id = css_id(&prev->css);
>  
>  	if (prev && prev != root)
> @@ -804,10 +812,20 @@ static struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>  	}
>  
>  	while (!mem) {
> +		struct mem_cgroup_iter_state *uninitialized_var(is);
>  		struct cgroup_subsys_state *css;
>  
> -		if (remember)
> -			id = root->last_scanned_child;
> +		if (iter) {
> +			int nid = zone_to_nid(iter->zone);
> +			int zid = zone_idx(iter->zone);
> +			struct mem_cgroup_per_zone *mz;
> +
> +			mz = mem_cgroup_zoneinfo(root, nid, zid);
> +			is = &mz->iter_state[iter->priority];
> +			if (prev && iter->generation != is->generation)
> +				return NULL;
> +			id = is->position;

Do we need any kind of locking here (spin_lock(&is->lock))?
If two parallel reclaimers start on the same zone and priority they will
see the same position and so bang on the same cgroup.

> +		}
>  
>  		rcu_read_lock();
>  		css = css_get_next(&mem_cgroup_subsys, id + 1, &root->css, &id);
> @@ -818,8 +836,13 @@ static struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>  			id = 0;
>  		rcu_read_unlock();
>  
> -		if (remember)
> -			root->last_scanned_child = id;
> +		if (iter) {
> +			is->position = id;
> +			if (!css)
> +				is->generation++;
> +			else if (!prev && mem)
> +				iter->generation = is->generation;

unlock it here.

> +		}
>  
>  		if (prev && !css)
>  			return NULL;
[...]

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 04/11] mm: memcg: per-priority per-zone hierarchy scan generations
@ 2011-09-20  8:45     ` Michal Hocko
  0 siblings, 0 replies; 130+ messages in thread
From: Michal Hocko @ 2011-09-20  8:45 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Mon 12-09-11 12:57:21, Johannes Weiner wrote:
> Memory cgroup limit reclaim currently picks one memory cgroup out of
> the target hierarchy, remembers it as the last scanned child, and
> reclaims all zones in it with decreasing priority levels.
> 
> The new hierarchy reclaim code will pick memory cgroups from the same
> hierarchy concurrently from different zones and priority levels, it
> becomes necessary that hierarchy roots not only remember the last
> scanned child, but do so for each zone and priority level.
> 
> Furthermore, detecting full hierarchy round-trips reliably will become
> crucial, so instead of counting on one iterator site seeing a certain
> memory cgroup twice, use a generation counter that is increased every
> time the child with the highest ID has been visited.

In principle I think the patch is good. I have some concerns about
locking and I would really appreciate some more description (like you
provided in the other email in this thread).

> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> ---
>  mm/memcontrol.c |   60 +++++++++++++++++++++++++++++++++++++++---------------
>  1 files changed, 43 insertions(+), 17 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 912c7c7..f4b404e 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -121,6 +121,11 @@ struct mem_cgroup_stat_cpu {
>  	unsigned long targets[MEM_CGROUP_NTARGETS];
>  };
>  
> +struct mem_cgroup_iter_state {
> +	int position;
> +	unsigned int generation;
> +};
> +
>  /*
>   * per-zone information in memory controller.
>   */
> @@ -131,6 +136,8 @@ struct mem_cgroup_per_zone {
>  	struct list_head	lists[NR_LRU_LISTS];
>  	unsigned long		count[NR_LRU_LISTS];
>  
> +	struct mem_cgroup_iter_state iter_state[DEF_PRIORITY + 1];
> +
>  	struct zone_reclaim_stat reclaim_stat;
>  	struct rb_node		tree_node;	/* RB tree node */
>  	unsigned long long	usage_in_excess;/* Set to the value by which */
[...]
> @@ -781,9 +783,15 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
>  	return memcg;
>  }
>  
> +struct mem_cgroup_iter {

Wouldn't be mem_cgroup_zone_iter_state a better name. It is true it is
rather long but I find mem_cgroup_iter very confusing because the actual
position is stored in the zone's state. The other thing is that it looks
like we have two iterators in mem_cgroup_iter function now but in fact
the iter parameter is just a state when we start iteration.

> +	struct zone *zone;
> +	int priority;
> +	unsigned int generation;
> +};
> +
>  static struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>  					  struct mem_cgroup *prev,
> -					  bool remember)
> +					  struct mem_cgroup_iter *iter)

I would rather see a different name for the last parameter
(iter_state?).

>  {
>  	struct mem_cgroup *mem = NULL;
>  	int id = 0;
> @@ -791,7 +799,7 @@ static struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>  	if (!root)
>  		root = root_mem_cgroup;
>  
> -	if (prev && !remember)
> +	if (prev && !iter)
>  		id = css_id(&prev->css);
>  
>  	if (prev && prev != root)
> @@ -804,10 +812,20 @@ static struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>  	}
>  
>  	while (!mem) {
> +		struct mem_cgroup_iter_state *uninitialized_var(is);
>  		struct cgroup_subsys_state *css;
>  
> -		if (remember)
> -			id = root->last_scanned_child;
> +		if (iter) {
> +			int nid = zone_to_nid(iter->zone);
> +			int zid = zone_idx(iter->zone);
> +			struct mem_cgroup_per_zone *mz;
> +
> +			mz = mem_cgroup_zoneinfo(root, nid, zid);
> +			is = &mz->iter_state[iter->priority];
> +			if (prev && iter->generation != is->generation)
> +				return NULL;
> +			id = is->position;

Do we need any kind of locking here (spin_lock(&is->lock))?
If two parallel reclaimers start on the same zone and priority they will
see the same position and so bang on the same cgroup.

> +		}
>  
>  		rcu_read_lock();
>  		css = css_get_next(&mem_cgroup_subsys, id + 1, &root->css, &id);
> @@ -818,8 +836,13 @@ static struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>  			id = 0;
>  		rcu_read_unlock();
>  
> -		if (remember)
> -			root->last_scanned_child = id;
> +		if (iter) {
> +			is->position = id;
> +			if (!css)
> +				is->generation++;
> +			else if (!prev && mem)
> +				iter->generation = is->generation;

unlock it here.

> +		}
>  
>  		if (prev && !css)
>  			return NULL;
[...]

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 01/11] mm: memcg: consolidate hierarchy iteration primitives
  2011-09-19 12:53     ` Michal Hocko
@ 2011-09-20  8:45       ` Johannes Weiner
  -1 siblings, 0 replies; 130+ messages in thread
From: Johannes Weiner @ 2011-09-20  8:45 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Mon, Sep 19, 2011 at 02:53:33PM +0200, Michal Hocko wrote:
> Hi,
> 
> On Mon 12-09-11 12:57:18, Johannes Weiner wrote:
> > Memory control groups are currently bolted onto the side of
> > traditional memory management in places where better integration would
> > be preferrable.  To reclaim memory, for example, memory control groups
> > maintain their own LRU list and reclaim strategy aside from the global
> > per-zone LRU list reclaim.  But an extra list head for each existing
> > page frame is expensive and maintaining it requires additional code.
> > 
> > This patchset disables the global per-zone LRU lists on memory cgroup
> > configurations and converts all its users to operate on the per-memory
> > cgroup lists instead.  As LRU pages are then exclusively on one list,
> > this saves two list pointers for each page frame in the system:
> > 
> > page_cgroup array size with 4G physical memory
> > 
> >   vanilla: [    0.000000] allocated 31457280 bytes of page_cgroup
> >   patched: [    0.000000] allocated 15728640 bytes of page_cgroup
> > 
> > At the same time, system performance for various workloads is
> > unaffected:
> > 
> > 100G sparse file cat, 4G physical memory, 10 runs, to test for code
> > bloat in the traditional LRU handling and kswapd & direct reclaim
> > paths, without/with the memory controller configured in
> > 
> >   vanilla: 71.603(0.207) seconds
> >   patched: 71.640(0.156) seconds
> > 
> >   vanilla: 79.558(0.288) seconds
> >   patched: 77.233(0.147) seconds
> > 
> > 100G sparse file cat in 1G memory cgroup, 10 runs, to test for code
> > bloat in the traditional memory cgroup LRU handling and reclaim path
> > 
> >   vanilla: 96.844(0.281) seconds
> >   patched: 94.454(0.311) seconds
> > 
> > 4 unlimited memcgs running kbuild -j32 each, 4G physical memory, 500M
> > swap on SSD, 10 runs, to test for regressions in kswapd & direct
> > reclaim using per-memcg LRU lists with multiple memcgs and multiple
> > allocators within each memcg
> > 
> >   vanilla: 717.722(1.440) seconds [ 69720.100(11600.835) majfaults ]
> >   patched: 714.106(2.313) seconds [ 71109.300(14886.186) majfaults ]
> > 
> > 16 unlimited memcgs running kbuild, 1900M hierarchical limit, 500M
> > swap on SSD, 10 runs, to test for regressions in hierarchical memcg
> > setups
> > 
> >   vanilla: 2742.058(1.992) seconds [ 26479.600(1736.737) majfaults ]
> >   patched: 2743.267(1.214) seconds [ 27240.700(1076.063) majfaults ]
> 
> I guess you want to have this in the first patch to have it for
> reference once it gets to the tree, right? I have no objections but it
> seems unrelated to the patch and so it might be confusing a bit. I
> haven't seen other patches in the series so there is probably no better
> place to put this.

Andrew usually hand-picks what's of long-term interest from the series
description and puts it in the first patch.  I thought I'd save him
the trouble.

> > This patch:
> > 
> > There are currently two different implementations of iterating over a
> > memory cgroup hierarchy tree.
> > 
> > Consolidate them into one worker function and base the convenience
> > looping-macros on top of it.
> > 
> > Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> 
> Looks mostly good. There is just one issue I spotted and I guess we
> want some comments. After the issue is fixed:
> Reviewed-by: Michal Hocko <mhocko@suse.cz>

Thanks!

> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index b76011a..912c7c7 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -781,83 +781,75 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
> >  	return memcg;
> >  }
> >  
> > -/* The caller has to guarantee "mem" exists before calling this */
> 
> Shouldn't we have a similar comment that we have to keep a reference to
> root if non-NULL. A mention about remember parameter and what is it used
> for (hierarchical reclaim) would be helpful as well.

The only thing that dictates the lifetime of a memcg is its reference
count, so having a reference count while operating on a memecg is not
even a question for all existing memcg-internal callsites.

But I did, in fact, add kernel-doc style documentation to
mem_cgroup_iter() when it becomes a public interface in 5/11.  Can you
take a look and tell me whether you are okay with that?

> > @@ -1719,21 +1670,21 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
> >  		} else
> >  			ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
> >  						noswap);
> > -		css_put(&victim->css);
> >  		/*
> >  		 * At shrinking usage, we can't check we should stop here or
> >  		 * reclaim more. It's depends on callers. last_scanned_child
> >  		 * will work enough for keeping fairness under tree.
> >  		 */
> >  		if (shrink)
> > -			return ret;
> > +			break;
> 
> Hmm, we are returning total but it doesn't get set to ret for shrinking
> case so we are alway returning 0. You want to move the line bellow up.

Heh, none of the limit shrinkers actually mind the return value.  But
you are right, of course, nice catch!

Signed-off-by: Johannes Weiner <jweiner@redhat.com>
---

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 912c7c7..316f3ed 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1670,6 +1670,7 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
 		} else
 			ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
 						noswap);
+		total += ret;
 		/*
 		 * At shrinking usage, we can't check we should stop here or
 		 * reclaim more. It's depends on callers. last_scanned_child
@@ -1677,7 +1678,6 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
 		 */
 		if (shrink)
 			break;
-		total += ret;
 		if (check_soft) {
 			if (!res_counter_soft_limit_excess(&root_memcg->res))
 				break;

^ permalink raw reply related	[flat|nested] 130+ messages in thread

* Re: [patch 01/11] mm: memcg: consolidate hierarchy iteration primitives
@ 2011-09-20  8:45       ` Johannes Weiner
  0 siblings, 0 replies; 130+ messages in thread
From: Johannes Weiner @ 2011-09-20  8:45 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Mon, Sep 19, 2011 at 02:53:33PM +0200, Michal Hocko wrote:
> Hi,
> 
> On Mon 12-09-11 12:57:18, Johannes Weiner wrote:
> > Memory control groups are currently bolted onto the side of
> > traditional memory management in places where better integration would
> > be preferrable.  To reclaim memory, for example, memory control groups
> > maintain their own LRU list and reclaim strategy aside from the global
> > per-zone LRU list reclaim.  But an extra list head for each existing
> > page frame is expensive and maintaining it requires additional code.
> > 
> > This patchset disables the global per-zone LRU lists on memory cgroup
> > configurations and converts all its users to operate on the per-memory
> > cgroup lists instead.  As LRU pages are then exclusively on one list,
> > this saves two list pointers for each page frame in the system:
> > 
> > page_cgroup array size with 4G physical memory
> > 
> >   vanilla: [    0.000000] allocated 31457280 bytes of page_cgroup
> >   patched: [    0.000000] allocated 15728640 bytes of page_cgroup
> > 
> > At the same time, system performance for various workloads is
> > unaffected:
> > 
> > 100G sparse file cat, 4G physical memory, 10 runs, to test for code
> > bloat in the traditional LRU handling and kswapd & direct reclaim
> > paths, without/with the memory controller configured in
> > 
> >   vanilla: 71.603(0.207) seconds
> >   patched: 71.640(0.156) seconds
> > 
> >   vanilla: 79.558(0.288) seconds
> >   patched: 77.233(0.147) seconds
> > 
> > 100G sparse file cat in 1G memory cgroup, 10 runs, to test for code
> > bloat in the traditional memory cgroup LRU handling and reclaim path
> > 
> >   vanilla: 96.844(0.281) seconds
> >   patched: 94.454(0.311) seconds
> > 
> > 4 unlimited memcgs running kbuild -j32 each, 4G physical memory, 500M
> > swap on SSD, 10 runs, to test for regressions in kswapd & direct
> > reclaim using per-memcg LRU lists with multiple memcgs and multiple
> > allocators within each memcg
> > 
> >   vanilla: 717.722(1.440) seconds [ 69720.100(11600.835) majfaults ]
> >   patched: 714.106(2.313) seconds [ 71109.300(14886.186) majfaults ]
> > 
> > 16 unlimited memcgs running kbuild, 1900M hierarchical limit, 500M
> > swap on SSD, 10 runs, to test for regressions in hierarchical memcg
> > setups
> > 
> >   vanilla: 2742.058(1.992) seconds [ 26479.600(1736.737) majfaults ]
> >   patched: 2743.267(1.214) seconds [ 27240.700(1076.063) majfaults ]
> 
> I guess you want to have this in the first patch to have it for
> reference once it gets to the tree, right? I have no objections but it
> seems unrelated to the patch and so it might be confusing a bit. I
> haven't seen other patches in the series so there is probably no better
> place to put this.

Andrew usually hand-picks what's of long-term interest from the series
description and puts it in the first patch.  I thought I'd save him
the trouble.

> > This patch:
> > 
> > There are currently two different implementations of iterating over a
> > memory cgroup hierarchy tree.
> > 
> > Consolidate them into one worker function and base the convenience
> > looping-macros on top of it.
> > 
> > Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> 
> Looks mostly good. There is just one issue I spotted and I guess we
> want some comments. After the issue is fixed:
> Reviewed-by: Michal Hocko <mhocko@suse.cz>

Thanks!

> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index b76011a..912c7c7 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -781,83 +781,75 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
> >  	return memcg;
> >  }
> >  
> > -/* The caller has to guarantee "mem" exists before calling this */
> 
> Shouldn't we have a similar comment that we have to keep a reference to
> root if non-NULL. A mention about remember parameter and what is it used
> for (hierarchical reclaim) would be helpful as well.

The only thing that dictates the lifetime of a memcg is its reference
count, so having a reference count while operating on a memecg is not
even a question for all existing memcg-internal callsites.

But I did, in fact, add kernel-doc style documentation to
mem_cgroup_iter() when it becomes a public interface in 5/11.  Can you
take a look and tell me whether you are okay with that?

> > @@ -1719,21 +1670,21 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
> >  		} else
> >  			ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
> >  						noswap);
> > -		css_put(&victim->css);
> >  		/*
> >  		 * At shrinking usage, we can't check we should stop here or
> >  		 * reclaim more. It's depends on callers. last_scanned_child
> >  		 * will work enough for keeping fairness under tree.
> >  		 */
> >  		if (shrink)
> > -			return ret;
> > +			break;
> 
> Hmm, we are returning total but it doesn't get set to ret for shrinking
> case so we are alway returning 0. You want to move the line bellow up.

Heh, none of the limit shrinkers actually mind the return value.  But
you are right, of course, nice catch!

Signed-off-by: Johannes Weiner <jweiner@redhat.com>
---

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 912c7c7..316f3ed 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1670,6 +1670,7 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
 		} else
 			ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
 						noswap);
+		total += ret;
 		/*
 		 * At shrinking usage, we can't check we should stop here or
 		 * reclaim more. It's depends on callers. last_scanned_child
@@ -1677,7 +1678,6 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_memcg,
 		 */
 		if (shrink)
 			break;
-		total += ret;
 		if (check_soft) {
 			if (!res_counter_soft_limit_excess(&root_memcg->res))
 				break;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 130+ messages in thread

* Re: [patch 02/11] mm: vmscan: distinguish global reclaim from global LRU scanning
  2011-09-19 13:23     ` Michal Hocko
@ 2011-09-20  8:52       ` Johannes Weiner
  -1 siblings, 0 replies; 130+ messages in thread
From: Johannes Weiner @ 2011-09-20  8:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Mon, Sep 19, 2011 at 03:23:44PM +0200, Michal Hocko wrote:
> On Mon 12-09-11 12:57:19, Johannes Weiner wrote:
> > The traditional zone reclaim code is scanning the per-zone LRU lists
> > during direct reclaim and kswapd, and the per-zone per-memory cgroup
> > LRU lists when reclaiming on behalf of a memory cgroup limit.
> > 
> > Subsequent patches will convert the traditional reclaim code to
> > reclaim exclusively from the per-memory cgroup LRU lists.  As a
> > result, using the predicate for which LRU list is scanned will no
> > longer be appropriate to tell global reclaim from limit reclaim.
> > 
> > This patch adds a global_reclaim() predicate to tell direct/kswapd
> > reclaim from memory cgroup limit reclaim and substitutes it in all
> > places where currently scanning_global_lru() is used for that.
> 
> I am wondering about vmscan_swappiness. Shouldn't it use global_reclaim
> instead?

Thanks for noticing, you are right.  Too many rebases...

Signed-off-by: Johannes Weiner <jweiner@redhat.com>
---

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 354f125..c2b0903 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1840,7 +1840,7 @@ static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
 
 static int vmscan_swappiness(struct scan_control *sc)
 {
-	if (scanning_global_lru(sc))
+	if (global_reclaim(sc))
 		return vm_swappiness;
 	return mem_cgroup_swappiness(sc->mem_cgroup);
 }

^ permalink raw reply related	[flat|nested] 130+ messages in thread

* Re: [patch 02/11] mm: vmscan: distinguish global reclaim from global LRU scanning
@ 2011-09-20  8:52       ` Johannes Weiner
  0 siblings, 0 replies; 130+ messages in thread
From: Johannes Weiner @ 2011-09-20  8:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Mon, Sep 19, 2011 at 03:23:44PM +0200, Michal Hocko wrote:
> On Mon 12-09-11 12:57:19, Johannes Weiner wrote:
> > The traditional zone reclaim code is scanning the per-zone LRU lists
> > during direct reclaim and kswapd, and the per-zone per-memory cgroup
> > LRU lists when reclaiming on behalf of a memory cgroup limit.
> > 
> > Subsequent patches will convert the traditional reclaim code to
> > reclaim exclusively from the per-memory cgroup LRU lists.  As a
> > result, using the predicate for which LRU list is scanned will no
> > longer be appropriate to tell global reclaim from limit reclaim.
> > 
> > This patch adds a global_reclaim() predicate to tell direct/kswapd
> > reclaim from memory cgroup limit reclaim and substitutes it in all
> > places where currently scanning_global_lru() is used for that.
> 
> I am wondering about vmscan_swappiness. Shouldn't it use global_reclaim
> instead?

Thanks for noticing, you are right.  Too many rebases...

Signed-off-by: Johannes Weiner <jweiner@redhat.com>
---

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 354f125..c2b0903 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1840,7 +1840,7 @@ static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
 
 static int vmscan_swappiness(struct scan_control *sc)
 {
-	if (scanning_global_lru(sc))
+	if (global_reclaim(sc))
 		return vm_swappiness;
 	return mem_cgroup_swappiness(sc->mem_cgroup);
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 130+ messages in thread

* Re: [patch 01/11] mm: memcg: consolidate hierarchy iteration primitives
  2011-09-20  8:45       ` Johannes Weiner
@ 2011-09-20  8:53         ` Michal Hocko
  -1 siblings, 0 replies; 130+ messages in thread
From: Michal Hocko @ 2011-09-20  8:53 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Tue 20-09-11 10:45:53, Johannes Weiner wrote:
> On Mon, Sep 19, 2011 at 02:53:33PM +0200, Michal Hocko wrote:
> > Hi,
> > 
> > On Mon 12-09-11 12:57:18, Johannes Weiner wrote:
> > > Memory control groups are currently bolted onto the side of
> > > traditional memory management in places where better integration would
> > > be preferrable.  To reclaim memory, for example, memory control groups
> > > maintain their own LRU list and reclaim strategy aside from the global
> > > per-zone LRU list reclaim.  But an extra list head for each existing
> > > page frame is expensive and maintaining it requires additional code.
> > > 
> > > This patchset disables the global per-zone LRU lists on memory cgroup
> > > configurations and converts all its users to operate on the per-memory
> > > cgroup lists instead.  As LRU pages are then exclusively on one list,
> > > this saves two list pointers for each page frame in the system:
> > > 
> > > page_cgroup array size with 4G physical memory
> > > 
> > >   vanilla: [    0.000000] allocated 31457280 bytes of page_cgroup
> > >   patched: [    0.000000] allocated 15728640 bytes of page_cgroup
> > > 
> > > At the same time, system performance for various workloads is
> > > unaffected:
> > > 
> > > 100G sparse file cat, 4G physical memory, 10 runs, to test for code
> > > bloat in the traditional LRU handling and kswapd & direct reclaim
> > > paths, without/with the memory controller configured in
> > > 
> > >   vanilla: 71.603(0.207) seconds
> > >   patched: 71.640(0.156) seconds
> > > 
> > >   vanilla: 79.558(0.288) seconds
> > >   patched: 77.233(0.147) seconds
> > > 
> > > 100G sparse file cat in 1G memory cgroup, 10 runs, to test for code
> > > bloat in the traditional memory cgroup LRU handling and reclaim path
> > > 
> > >   vanilla: 96.844(0.281) seconds
> > >   patched: 94.454(0.311) seconds
> > > 
> > > 4 unlimited memcgs running kbuild -j32 each, 4G physical memory, 500M
> > > swap on SSD, 10 runs, to test for regressions in kswapd & direct
> > > reclaim using per-memcg LRU lists with multiple memcgs and multiple
> > > allocators within each memcg
> > > 
> > >   vanilla: 717.722(1.440) seconds [ 69720.100(11600.835) majfaults ]
> > >   patched: 714.106(2.313) seconds [ 71109.300(14886.186) majfaults ]
> > > 
> > > 16 unlimited memcgs running kbuild, 1900M hierarchical limit, 500M
> > > swap on SSD, 10 runs, to test for regressions in hierarchical memcg
> > > setups
> > > 
> > >   vanilla: 2742.058(1.992) seconds [ 26479.600(1736.737) majfaults ]
> > >   patched: 2743.267(1.214) seconds [ 27240.700(1076.063) majfaults ]
> > 
> > I guess you want to have this in the first patch to have it for
> > reference once it gets to the tree, right? I have no objections but it
> > seems unrelated to the patch and so it might be confusing a bit. I
> > haven't seen other patches in the series so there is probably no better
> > place to put this.
> 
> Andrew usually hand-picks what's of long-term interest from the series
> description and puts it in the first patch.  I thought I'd save him
> the trouble.

Understood

[...]

> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index b76011a..912c7c7 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -781,83 +781,75 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
> > >  	return memcg;
> > >  }
> > >  
> > > -/* The caller has to guarantee "mem" exists before calling this */
> > 
> > Shouldn't we have a similar comment that we have to keep a reference to
> > root if non-NULL. A mention about remember parameter and what is it used
> > for (hierarchical reclaim) would be helpful as well.
> 
> The only thing that dictates the lifetime of a memcg is its reference
> count, so having a reference count while operating on a memecg is not
> even a question for all existing memcg-internal callsites.

Fair enough.

> 
> But I did, in fact, add kernel-doc style documentation to
> mem_cgroup_iter() when it becomes a public interface in 5/11.  Can you
> take a look and tell me whether you are okay with that?

OK, I will comment on that patch once I get to it.

[...]

Thanks!
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 01/11] mm: memcg: consolidate hierarchy iteration primitives
@ 2011-09-20  8:53         ` Michal Hocko
  0 siblings, 0 replies; 130+ messages in thread
From: Michal Hocko @ 2011-09-20  8:53 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Tue 20-09-11 10:45:53, Johannes Weiner wrote:
> On Mon, Sep 19, 2011 at 02:53:33PM +0200, Michal Hocko wrote:
> > Hi,
> > 
> > On Mon 12-09-11 12:57:18, Johannes Weiner wrote:
> > > Memory control groups are currently bolted onto the side of
> > > traditional memory management in places where better integration would
> > > be preferrable.  To reclaim memory, for example, memory control groups
> > > maintain their own LRU list and reclaim strategy aside from the global
> > > per-zone LRU list reclaim.  But an extra list head for each existing
> > > page frame is expensive and maintaining it requires additional code.
> > > 
> > > This patchset disables the global per-zone LRU lists on memory cgroup
> > > configurations and converts all its users to operate on the per-memory
> > > cgroup lists instead.  As LRU pages are then exclusively on one list,
> > > this saves two list pointers for each page frame in the system:
> > > 
> > > page_cgroup array size with 4G physical memory
> > > 
> > >   vanilla: [    0.000000] allocated 31457280 bytes of page_cgroup
> > >   patched: [    0.000000] allocated 15728640 bytes of page_cgroup
> > > 
> > > At the same time, system performance for various workloads is
> > > unaffected:
> > > 
> > > 100G sparse file cat, 4G physical memory, 10 runs, to test for code
> > > bloat in the traditional LRU handling and kswapd & direct reclaim
> > > paths, without/with the memory controller configured in
> > > 
> > >   vanilla: 71.603(0.207) seconds
> > >   patched: 71.640(0.156) seconds
> > > 
> > >   vanilla: 79.558(0.288) seconds
> > >   patched: 77.233(0.147) seconds
> > > 
> > > 100G sparse file cat in 1G memory cgroup, 10 runs, to test for code
> > > bloat in the traditional memory cgroup LRU handling and reclaim path
> > > 
> > >   vanilla: 96.844(0.281) seconds
> > >   patched: 94.454(0.311) seconds
> > > 
> > > 4 unlimited memcgs running kbuild -j32 each, 4G physical memory, 500M
> > > swap on SSD, 10 runs, to test for regressions in kswapd & direct
> > > reclaim using per-memcg LRU lists with multiple memcgs and multiple
> > > allocators within each memcg
> > > 
> > >   vanilla: 717.722(1.440) seconds [ 69720.100(11600.835) majfaults ]
> > >   patched: 714.106(2.313) seconds [ 71109.300(14886.186) majfaults ]
> > > 
> > > 16 unlimited memcgs running kbuild, 1900M hierarchical limit, 500M
> > > swap on SSD, 10 runs, to test for regressions in hierarchical memcg
> > > setups
> > > 
> > >   vanilla: 2742.058(1.992) seconds [ 26479.600(1736.737) majfaults ]
> > >   patched: 2743.267(1.214) seconds [ 27240.700(1076.063) majfaults ]
> > 
> > I guess you want to have this in the first patch to have it for
> > reference once it gets to the tree, right? I have no objections but it
> > seems unrelated to the patch and so it might be confusing a bit. I
> > haven't seen other patches in the series so there is probably no better
> > place to put this.
> 
> Andrew usually hand-picks what's of long-term interest from the series
> description and puts it in the first patch.  I thought I'd save him
> the trouble.

Understood

[...]

> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index b76011a..912c7c7 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -781,83 +781,75 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
> > >  	return memcg;
> > >  }
> > >  
> > > -/* The caller has to guarantee "mem" exists before calling this */
> > 
> > Shouldn't we have a similar comment that we have to keep a reference to
> > root if non-NULL. A mention about remember parameter and what is it used
> > for (hierarchical reclaim) would be helpful as well.
> 
> The only thing that dictates the lifetime of a memcg is its reference
> count, so having a reference count while operating on a memecg is not
> even a question for all existing memcg-internal callsites.

Fair enough.

> 
> But I did, in fact, add kernel-doc style documentation to
> mem_cgroup_iter() when it becomes a public interface in 5/11.  Can you
> take a look and tell me whether you are okay with that?

OK, I will comment on that patch once I get to it.

[...]

Thanks!
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 03/11] mm: vmscan: distinguish between memcg triggering reclaim and memcg being scanned
  2011-09-19 14:29     ` Michal Hocko
@ 2011-09-20  8:58       ` Johannes Weiner
  -1 siblings, 0 replies; 130+ messages in thread
From: Johannes Weiner @ 2011-09-20  8:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Mon, Sep 19, 2011 at 04:29:55PM +0200, Michal Hocko wrote:
> On Mon 12-09-11 12:57:20, Johannes Weiner wrote:
> > Memory cgroup hierarchies are currently handled completely outside of
> > the traditional reclaim code, which is invoked with a single memory
> > cgroup as an argument for the whole call stack.
> > 
> > Subsequent patches will switch this code to do hierarchical reclaim,
> > so there needs to be a distinction between a) the memory cgroup that
> > is triggering reclaim due to hitting its limit and b) the memory
> > cgroup that is being scanned as a child of a).
> > 
> > This patch introduces a struct mem_cgroup_zone that contains the
> > combination of the memory cgroup and the zone being scanned, which is
> > then passed down the stack instead of the zone argument.
> > 
> > Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> 
> Looks good to me. Some minor comments bellow
> Anyways:
> Reviewed-by: Michal Hocko <mhocko@suse.cz>

Thanks!

> > @@ -1853,13 +1865,13 @@ static int vmscan_swappiness(struct scan_control *sc)
> >   *
> >   * nr[0] = anon pages to scan; nr[1] = file pages to scan
> >   */
> > -static void get_scan_count(struct zone *zone, struct scan_control *sc,
> > -					unsigned long *nr, int priority)
> > +static void get_scan_count(struct mem_cgroup_zone *mz, struct scan_control *sc,
> > +			   unsigned long *nr, int priority)
> >  {
> >  	unsigned long anon, file, free;
> >  	unsigned long anon_prio, file_prio;
> >  	unsigned long ap, fp;
> > -	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
> > +	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(mz);
> >  	u64 fraction[2], denominator;
> >  	enum lru_list l;
> >  	int noswap = 0;
> 
> You can save some patch lines by:
> 	struct zone *zone = mz->zone;
> and not doing zone => mz->zone changes that follow.

Actually, I really hate that I had to do that local zone variable in
other places.  I only did it where it's used so often that it would
have changed every other line.  If you insist, I'll change it, but I
would prefer to avoid it when possible.

> > @@ -2390,6 +2413,18 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
> >  }
> >  #endif
> >  
> > +static void age_active_anon(struct zone *zone, struct scan_control *sc,
> > +			    int priority)
> > +{
> > +	struct mem_cgroup_zone mz = {
> > +		.mem_cgroup = NULL,
> > +		.zone = zone,
> > +	};
> > +
> > +	if (inactive_anon_is_low(&mz))
> > +		shrink_active_list(SWAP_CLUSTER_MAX, &mz, sc, priority, 0);
> > +}
> > +
> 
> I do not like this very much because we are using a similar construct in
> shrink_mem_cgroup_zone so we are duplicating that code. 
> What about adding age_mem_cgroup_active_anon (something like shrink_zone).

I am not sure I follow and I don't see what could be shared between
the zone shrinking and this as there are different exit conditions to
the hierarchy walk.  Can you elaborate?

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 03/11] mm: vmscan: distinguish between memcg triggering reclaim and memcg being scanned
@ 2011-09-20  8:58       ` Johannes Weiner
  0 siblings, 0 replies; 130+ messages in thread
From: Johannes Weiner @ 2011-09-20  8:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Mon, Sep 19, 2011 at 04:29:55PM +0200, Michal Hocko wrote:
> On Mon 12-09-11 12:57:20, Johannes Weiner wrote:
> > Memory cgroup hierarchies are currently handled completely outside of
> > the traditional reclaim code, which is invoked with a single memory
> > cgroup as an argument for the whole call stack.
> > 
> > Subsequent patches will switch this code to do hierarchical reclaim,
> > so there needs to be a distinction between a) the memory cgroup that
> > is triggering reclaim due to hitting its limit and b) the memory
> > cgroup that is being scanned as a child of a).
> > 
> > This patch introduces a struct mem_cgroup_zone that contains the
> > combination of the memory cgroup and the zone being scanned, which is
> > then passed down the stack instead of the zone argument.
> > 
> > Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> 
> Looks good to me. Some minor comments bellow
> Anyways:
> Reviewed-by: Michal Hocko <mhocko@suse.cz>

Thanks!

> > @@ -1853,13 +1865,13 @@ static int vmscan_swappiness(struct scan_control *sc)
> >   *
> >   * nr[0] = anon pages to scan; nr[1] = file pages to scan
> >   */
> > -static void get_scan_count(struct zone *zone, struct scan_control *sc,
> > -					unsigned long *nr, int priority)
> > +static void get_scan_count(struct mem_cgroup_zone *mz, struct scan_control *sc,
> > +			   unsigned long *nr, int priority)
> >  {
> >  	unsigned long anon, file, free;
> >  	unsigned long anon_prio, file_prio;
> >  	unsigned long ap, fp;
> > -	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
> > +	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(mz);
> >  	u64 fraction[2], denominator;
> >  	enum lru_list l;
> >  	int noswap = 0;
> 
> You can save some patch lines by:
> 	struct zone *zone = mz->zone;
> and not doing zone => mz->zone changes that follow.

Actually, I really hate that I had to do that local zone variable in
other places.  I only did it where it's used so often that it would
have changed every other line.  If you insist, I'll change it, but I
would prefer to avoid it when possible.

> > @@ -2390,6 +2413,18 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
> >  }
> >  #endif
> >  
> > +static void age_active_anon(struct zone *zone, struct scan_control *sc,
> > +			    int priority)
> > +{
> > +	struct mem_cgroup_zone mz = {
> > +		.mem_cgroup = NULL,
> > +		.zone = zone,
> > +	};
> > +
> > +	if (inactive_anon_is_low(&mz))
> > +		shrink_active_list(SWAP_CLUSTER_MAX, &mz, sc, priority, 0);
> > +}
> > +
> 
> I do not like this very much because we are using a similar construct in
> shrink_mem_cgroup_zone so we are duplicating that code. 
> What about adding age_mem_cgroup_active_anon (something like shrink_zone).

I am not sure I follow and I don't see what could be shared between
the zone shrinking and this as there are different exit conditions to
the hierarchy walk.  Can you elaborate?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 04/11] mm: memcg: per-priority per-zone hierarchy scan generations
  2011-09-20  8:45     ` Michal Hocko
@ 2011-09-20  9:10       ` Johannes Weiner
  -1 siblings, 0 replies; 130+ messages in thread
From: Johannes Weiner @ 2011-09-20  9:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Tue, Sep 20, 2011 at 10:45:32AM +0200, Michal Hocko wrote:
> On Mon 12-09-11 12:57:21, Johannes Weiner wrote:
> > Memory cgroup limit reclaim currently picks one memory cgroup out of
> > the target hierarchy, remembers it as the last scanned child, and
> > reclaims all zones in it with decreasing priority levels.
> > 
> > The new hierarchy reclaim code will pick memory cgroups from the same
> > hierarchy concurrently from different zones and priority levels, it
> > becomes necessary that hierarchy roots not only remember the last
> > scanned child, but do so for each zone and priority level.
> > 
> > Furthermore, detecting full hierarchy round-trips reliably will become
> > crucial, so instead of counting on one iterator site seeing a certain
> > memory cgroup twice, use a generation counter that is increased every
> > time the child with the highest ID has been visited.
> 
> In principle I think the patch is good. I have some concerns about
> locking and I would really appreciate some more description (like you
> provided in the other email in this thread).

Okay, I'll incorporate that description into the changelog.

> > @@ -131,6 +136,8 @@ struct mem_cgroup_per_zone {
> >  	struct list_head	lists[NR_LRU_LISTS];
> >  	unsigned long		count[NR_LRU_LISTS];
> >  
> > +	struct mem_cgroup_iter_state iter_state[DEF_PRIORITY + 1];
> > +
> >  	struct zone_reclaim_stat reclaim_stat;
> >  	struct rb_node		tree_node;	/* RB tree node */
> >  	unsigned long long	usage_in_excess;/* Set to the value by which */
> [...]
> > @@ -781,9 +783,15 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
> >  	return memcg;
> >  }
> >  
> > +struct mem_cgroup_iter {
> 
> Wouldn't be mem_cgroup_zone_iter_state a better name. It is true it is
> rather long but I find mem_cgroup_iter very confusing because the actual
> position is stored in the zone's state. The other thing is that it looks
> like we have two iterators in mem_cgroup_iter function now but in fact
> the iter parameter is just a state when we start iteration.

Agreed, the naming is unfortunate.  How about
mem_cgroup_reclaim_cookie or something comparable?  It's limited to
reclaim anyway, hierarchy walkers that do not age the LRU lists should
not advance the shared iterator state, so might as well encode it in
the name.

> > +	struct zone *zone;
> > +	int priority;
> > +	unsigned int generation;
> > +};
> > +
> >  static struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> >  					  struct mem_cgroup *prev,
> > -					  bool remember)
> > +					  struct mem_cgroup_iter *iter)
> 
> I would rather see a different name for the last parameter
> (iter_state?).

I'm with you on this.  Will think something up.

> > @@ -804,10 +812,20 @@ static struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> >  	}
> >  
> >  	while (!mem) {
> > +		struct mem_cgroup_iter_state *uninitialized_var(is);
> >  		struct cgroup_subsys_state *css;
> >  
> > -		if (remember)
> > -			id = root->last_scanned_child;
> > +		if (iter) {
> > +			int nid = zone_to_nid(iter->zone);
> > +			int zid = zone_idx(iter->zone);
> > +			struct mem_cgroup_per_zone *mz;
> > +
> > +			mz = mem_cgroup_zoneinfo(root, nid, zid);
> > +			is = &mz->iter_state[iter->priority];
> > +			if (prev && iter->generation != is->generation)
> > +				return NULL;
> > +			id = is->position;
> 
> Do we need any kind of locking here (spin_lock(&is->lock))?
> If two parallel reclaimers start on the same zone and priority they will
> see the same position and so bang on the same cgroup.

Note that last_scanned_child wasn't lock-protected before this series,
so there is no actual difference.

I can say, though, that during development I had a lock in there for
some time and it didn't make any difference for 32 concurrent
reclaimers on a quadcore.  Feel free to evaluate with higher
concurrency :)


^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 04/11] mm: memcg: per-priority per-zone hierarchy scan generations
@ 2011-09-20  9:10       ` Johannes Weiner
  0 siblings, 0 replies; 130+ messages in thread
From: Johannes Weiner @ 2011-09-20  9:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Tue, Sep 20, 2011 at 10:45:32AM +0200, Michal Hocko wrote:
> On Mon 12-09-11 12:57:21, Johannes Weiner wrote:
> > Memory cgroup limit reclaim currently picks one memory cgroup out of
> > the target hierarchy, remembers it as the last scanned child, and
> > reclaims all zones in it with decreasing priority levels.
> > 
> > The new hierarchy reclaim code will pick memory cgroups from the same
> > hierarchy concurrently from different zones and priority levels, it
> > becomes necessary that hierarchy roots not only remember the last
> > scanned child, but do so for each zone and priority level.
> > 
> > Furthermore, detecting full hierarchy round-trips reliably will become
> > crucial, so instead of counting on one iterator site seeing a certain
> > memory cgroup twice, use a generation counter that is increased every
> > time the child with the highest ID has been visited.
> 
> In principle I think the patch is good. I have some concerns about
> locking and I would really appreciate some more description (like you
> provided in the other email in this thread).

Okay, I'll incorporate that description into the changelog.

> > @@ -131,6 +136,8 @@ struct mem_cgroup_per_zone {
> >  	struct list_head	lists[NR_LRU_LISTS];
> >  	unsigned long		count[NR_LRU_LISTS];
> >  
> > +	struct mem_cgroup_iter_state iter_state[DEF_PRIORITY + 1];
> > +
> >  	struct zone_reclaim_stat reclaim_stat;
> >  	struct rb_node		tree_node;	/* RB tree node */
> >  	unsigned long long	usage_in_excess;/* Set to the value by which */
> [...]
> > @@ -781,9 +783,15 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
> >  	return memcg;
> >  }
> >  
> > +struct mem_cgroup_iter {
> 
> Wouldn't be mem_cgroup_zone_iter_state a better name. It is true it is
> rather long but I find mem_cgroup_iter very confusing because the actual
> position is stored in the zone's state. The other thing is that it looks
> like we have two iterators in mem_cgroup_iter function now but in fact
> the iter parameter is just a state when we start iteration.

Agreed, the naming is unfortunate.  How about
mem_cgroup_reclaim_cookie or something comparable?  It's limited to
reclaim anyway, hierarchy walkers that do not age the LRU lists should
not advance the shared iterator state, so might as well encode it in
the name.

> > +	struct zone *zone;
> > +	int priority;
> > +	unsigned int generation;
> > +};
> > +
> >  static struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> >  					  struct mem_cgroup *prev,
> > -					  bool remember)
> > +					  struct mem_cgroup_iter *iter)
> 
> I would rather see a different name for the last parameter
> (iter_state?).

I'm with you on this.  Will think something up.

> > @@ -804,10 +812,20 @@ static struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> >  	}
> >  
> >  	while (!mem) {
> > +		struct mem_cgroup_iter_state *uninitialized_var(is);
> >  		struct cgroup_subsys_state *css;
> >  
> > -		if (remember)
> > -			id = root->last_scanned_child;
> > +		if (iter) {
> > +			int nid = zone_to_nid(iter->zone);
> > +			int zid = zone_idx(iter->zone);
> > +			struct mem_cgroup_per_zone *mz;
> > +
> > +			mz = mem_cgroup_zoneinfo(root, nid, zid);
> > +			is = &mz->iter_state[iter->priority];
> > +			if (prev && iter->generation != is->generation)
> > +				return NULL;
> > +			id = is->position;
> 
> Do we need any kind of locking here (spin_lock(&is->lock))?
> If two parallel reclaimers start on the same zone and priority they will
> see the same position and so bang on the same cgroup.

Note that last_scanned_child wasn't lock-protected before this series,
so there is no actual difference.

I can say, though, that during development I had a lock in there for
some time and it didn't make any difference for 32 concurrent
reclaimers on a quadcore.  Feel free to evaluate with higher
concurrency :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 03/11] mm: vmscan: distinguish between memcg triggering reclaim and memcg being scanned
  2011-09-20  8:58       ` Johannes Weiner
@ 2011-09-20  9:17         ` Michal Hocko
  -1 siblings, 0 replies; 130+ messages in thread
From: Michal Hocko @ 2011-09-20  9:17 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Tue 20-09-11 10:58:11, Johannes Weiner wrote:
> On Mon, Sep 19, 2011 at 04:29:55PM +0200, Michal Hocko wrote:
> > On Mon 12-09-11 12:57:20, Johannes Weiner wrote:
[...]
> > > @@ -1853,13 +1865,13 @@ static int vmscan_swappiness(struct scan_control *sc)
> > >   *
> > >   * nr[0] = anon pages to scan; nr[1] = file pages to scan
> > >   */
> > > -static void get_scan_count(struct zone *zone, struct scan_control *sc,
> > > -					unsigned long *nr, int priority)
> > > +static void get_scan_count(struct mem_cgroup_zone *mz, struct scan_control *sc,
> > > +			   unsigned long *nr, int priority)
> > >  {
> > >  	unsigned long anon, file, free;
> > >  	unsigned long anon_prio, file_prio;
> > >  	unsigned long ap, fp;
> > > -	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
> > > +	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(mz);
> > >  	u64 fraction[2], denominator;
> > >  	enum lru_list l;
> > >  	int noswap = 0;
> > 
> > You can save some patch lines by:
> > 	struct zone *zone = mz->zone;
> > and not doing zone => mz->zone changes that follow.
> 
> Actually, I really hate that I had to do that local zone variable in
> other places.  I only did it where it's used so often that it would
> have changed every other line.  If you insist, I'll change it, but I
> would prefer to avoid it when possible.

Just a suggestion feel free to ignore it. I have no preference.

> 
> > > @@ -2390,6 +2413,18 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
> > >  }
> > >  #endif
> > >  
> > > +static void age_active_anon(struct zone *zone, struct scan_control *sc,
> > > +			    int priority)
> > > +{
> > > +	struct mem_cgroup_zone mz = {
> > > +		.mem_cgroup = NULL,
> > > +		.zone = zone,
> > > +	};
> > > +
> > > +	if (inactive_anon_is_low(&mz))
> > > +		shrink_active_list(SWAP_CLUSTER_MAX, &mz, sc, priority, 0);
> > > +}
> > > +
> > 
> > I do not like this very much because we are using a similar construct in
> > shrink_mem_cgroup_zone so we are duplicating that code. 
> > What about adding age_mem_cgroup_active_anon (something like shrink_zone).
> 
> I am not sure I follow and I don't see what could be shared between
> the zone shrinking and this as there are different exit conditions to
> the hierarchy walk.  Can you elaborate?

Sorry for not being clear enough. Maybe it is not very much important
but what about something like:

Index: linus_tree/mm/vmscan.c
===================================================================
--- linus_tree.orig/mm/vmscan.c	2011-09-20 11:07:57.000000000 +0200
+++ linus_tree/mm/vmscan.c	2011-09-20 11:12:53.000000000 +0200
@@ -2041,6 +2041,13 @@ static inline bool should_continue_recla
 	}
 }
 
+static void age_mem_cgroup_active_anon(struct mem_cgroup_zone *mz,
+		struct scan_control *sc, int priority)
+{
+	if (inactive_anon_is_low(mz))
+		shrink_active_list(SWAP_CLUSTER_MAX, mz, sc, priority, 0);
+}
+
 /*
  * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
  */
@@ -2090,8 +2097,7 @@ restart:
 	 * Even if we did not try to evict anon pages at all, we want to
 	 * rebalance the anon lru active/inactive ratio.
 	 */
-	if (inactive_anon_is_low(mz))
-		shrink_active_list(SWAP_CLUSTER_MAX, mz, sc, priority, 0);
+	age_mem_cgroup_active_anon(mz, sc, priority);
 
 	/* reclaim/compaction might need reclaim to continue */
 	if (should_continue_reclaim(mz, nr_reclaimed,
@@ -2421,8 +2427,7 @@ static void age_active_anon(struct zone
 		.zone = zone,
 	};
 
-	if (inactive_anon_is_low(&mz))
-		shrink_active_list(SWAP_CLUSTER_MAX, &mz, sc, priority, 0);
+	age_mem_cgroup_active_anon(&mz, sc, priority);
 }
 
 /*

Thanks
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 03/11] mm: vmscan: distinguish between memcg triggering reclaim and memcg being scanned
@ 2011-09-20  9:17         ` Michal Hocko
  0 siblings, 0 replies; 130+ messages in thread
From: Michal Hocko @ 2011-09-20  9:17 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Tue 20-09-11 10:58:11, Johannes Weiner wrote:
> On Mon, Sep 19, 2011 at 04:29:55PM +0200, Michal Hocko wrote:
> > On Mon 12-09-11 12:57:20, Johannes Weiner wrote:
[...]
> > > @@ -1853,13 +1865,13 @@ static int vmscan_swappiness(struct scan_control *sc)
> > >   *
> > >   * nr[0] = anon pages to scan; nr[1] = file pages to scan
> > >   */
> > > -static void get_scan_count(struct zone *zone, struct scan_control *sc,
> > > -					unsigned long *nr, int priority)
> > > +static void get_scan_count(struct mem_cgroup_zone *mz, struct scan_control *sc,
> > > +			   unsigned long *nr, int priority)
> > >  {
> > >  	unsigned long anon, file, free;
> > >  	unsigned long anon_prio, file_prio;
> > >  	unsigned long ap, fp;
> > > -	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
> > > +	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(mz);
> > >  	u64 fraction[2], denominator;
> > >  	enum lru_list l;
> > >  	int noswap = 0;
> > 
> > You can save some patch lines by:
> > 	struct zone *zone = mz->zone;
> > and not doing zone => mz->zone changes that follow.
> 
> Actually, I really hate that I had to do that local zone variable in
> other places.  I only did it where it's used so often that it would
> have changed every other line.  If you insist, I'll change it, but I
> would prefer to avoid it when possible.

Just a suggestion feel free to ignore it. I have no preference.

> 
> > > @@ -2390,6 +2413,18 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
> > >  }
> > >  #endif
> > >  
> > > +static void age_active_anon(struct zone *zone, struct scan_control *sc,
> > > +			    int priority)
> > > +{
> > > +	struct mem_cgroup_zone mz = {
> > > +		.mem_cgroup = NULL,
> > > +		.zone = zone,
> > > +	};
> > > +
> > > +	if (inactive_anon_is_low(&mz))
> > > +		shrink_active_list(SWAP_CLUSTER_MAX, &mz, sc, priority, 0);
> > > +}
> > > +
> > 
> > I do not like this very much because we are using a similar construct in
> > shrink_mem_cgroup_zone so we are duplicating that code. 
> > What about adding age_mem_cgroup_active_anon (something like shrink_zone).
> 
> I am not sure I follow and I don't see what could be shared between
> the zone shrinking and this as there are different exit conditions to
> the hierarchy walk.  Can you elaborate?

Sorry for not being clear enough. Maybe it is not very much important
but what about something like:

Index: linus_tree/mm/vmscan.c
===================================================================
--- linus_tree.orig/mm/vmscan.c	2011-09-20 11:07:57.000000000 +0200
+++ linus_tree/mm/vmscan.c	2011-09-20 11:12:53.000000000 +0200
@@ -2041,6 +2041,13 @@ static inline bool should_continue_recla
 	}
 }
 
+static void age_mem_cgroup_active_anon(struct mem_cgroup_zone *mz,
+		struct scan_control *sc, int priority)
+{
+	if (inactive_anon_is_low(mz))
+		shrink_active_list(SWAP_CLUSTER_MAX, mz, sc, priority, 0);
+}
+
 /*
  * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
  */
@@ -2090,8 +2097,7 @@ restart:
 	 * Even if we did not try to evict anon pages at all, we want to
 	 * rebalance the anon lru active/inactive ratio.
 	 */
-	if (inactive_anon_is_low(mz))
-		shrink_active_list(SWAP_CLUSTER_MAX, mz, sc, priority, 0);
+	age_mem_cgroup_active_anon(mz, sc, priority);
 
 	/* reclaim/compaction might need reclaim to continue */
 	if (should_continue_reclaim(mz, nr_reclaimed,
@@ -2421,8 +2427,7 @@ static void age_active_anon(struct zone
 		.zone = zone,
 	};
 
-	if (inactive_anon_is_low(&mz))
-		shrink_active_list(SWAP_CLUSTER_MAX, &mz, sc, priority, 0);
+	age_mem_cgroup_active_anon(&mz, sc, priority);
 }
 
 /*

Thanks
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 04/11] mm: memcg: per-priority per-zone hierarchy scan generations
  2011-09-20  9:10       ` Johannes Weiner
@ 2011-09-20 12:37         ` Michal Hocko
  -1 siblings, 0 replies; 130+ messages in thread
From: Michal Hocko @ 2011-09-20 12:37 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Tue 20-09-11 11:10:32, Johannes Weiner wrote:
> On Tue, Sep 20, 2011 at 10:45:32AM +0200, Michal Hocko wrote:
> > On Mon 12-09-11 12:57:21, Johannes Weiner wrote:
> > > Memory cgroup limit reclaim currently picks one memory cgroup out of
> > > the target hierarchy, remembers it as the last scanned child, and
> > > reclaims all zones in it with decreasing priority levels.
> > > 
> > > The new hierarchy reclaim code will pick memory cgroups from the same
> > > hierarchy concurrently from different zones and priority levels, it
> > > becomes necessary that hierarchy roots not only remember the last
> > > scanned child, but do so for each zone and priority level.
> > > 
> > > Furthermore, detecting full hierarchy round-trips reliably will become
> > > crucial, so instead of counting on one iterator site seeing a certain
> > > memory cgroup twice, use a generation counter that is increased every
> > > time the child with the highest ID has been visited.
> > 
> > In principle I think the patch is good. I have some concerns about
> > locking and I would really appreciate some more description (like you
> > provided in the other email in this thread).
> 
> Okay, I'll incorporate that description into the changelog.

Thanks!

> 
> > > @@ -131,6 +136,8 @@ struct mem_cgroup_per_zone {
> > >  	struct list_head	lists[NR_LRU_LISTS];
> > >  	unsigned long		count[NR_LRU_LISTS];
> > >  
> > > +	struct mem_cgroup_iter_state iter_state[DEF_PRIORITY + 1];
> > > +
> > >  	struct zone_reclaim_stat reclaim_stat;
> > >  	struct rb_node		tree_node;	/* RB tree node */
> > >  	unsigned long long	usage_in_excess;/* Set to the value by which */
> > [...]
> > > @@ -781,9 +783,15 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
> > >  	return memcg;
> > >  }
> > >  
> > > +struct mem_cgroup_iter {
> > 
> > Wouldn't be mem_cgroup_zone_iter_state a better name. It is true it is
> > rather long but I find mem_cgroup_iter very confusing because the actual
> > position is stored in the zone's state. The other thing is that it looks
> > like we have two iterators in mem_cgroup_iter function now but in fact
> > the iter parameter is just a state when we start iteration.
> 
> Agreed, the naming is unfortunate.  How about
> mem_cgroup_reclaim_cookie or something comparable?  It's limited to
> reclaim anyway, hierarchy walkers that do not age the LRU lists should
> not advance the shared iterator state, so might as well encode it in
> the name.

Sounds good.

> 
> > > +	struct zone *zone;
> > > +	int priority;
> > > +	unsigned int generation;
> > > +};
> > > +
> > >  static struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> > >  					  struct mem_cgroup *prev,
> > > -					  bool remember)
> > > +					  struct mem_cgroup_iter *iter)
> > 
> > I would rather see a different name for the last parameter
> > (iter_state?).
> 
> I'm with you on this.  Will think something up.
> 
> > > @@ -804,10 +812,20 @@ static struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> > >  	}
> > >  
> > >  	while (!mem) {
> > > +		struct mem_cgroup_iter_state *uninitialized_var(is);
> > >  		struct cgroup_subsys_state *css;
> > >  
> > > -		if (remember)
> > > -			id = root->last_scanned_child;
> > > +		if (iter) {
> > > +			int nid = zone_to_nid(iter->zone);
> > > +			int zid = zone_idx(iter->zone);
> > > +			struct mem_cgroup_per_zone *mz;
> > > +
> > > +			mz = mem_cgroup_zoneinfo(root, nid, zid);
> > > +			is = &mz->iter_state[iter->priority];
> > > +			if (prev && iter->generation != is->generation)
> > > +				return NULL;
> > > +			id = is->position;
> > 
> > Do we need any kind of locking here (spin_lock(&is->lock))?
> > If two parallel reclaimers start on the same zone and priority they will
> > see the same position and so bang on the same cgroup.
> 
> Note that last_scanned_child wasn't lock-protected before this series,
> so there is no actual difference.

that's a fair point. Anyway, I think it is worth mentioning this in the
patch description or in the comment to be clear that this is intentional.

> 
> I can say, though, that during development I had a lock in there for
> some time and it didn't make any difference for 32 concurrent
> reclaimers on a quadcore.  Feel free to evaluate with higher
> concurrency :)

Thanks!
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 04/11] mm: memcg: per-priority per-zone hierarchy scan generations
@ 2011-09-20 12:37         ` Michal Hocko
  0 siblings, 0 replies; 130+ messages in thread
From: Michal Hocko @ 2011-09-20 12:37 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Tue 20-09-11 11:10:32, Johannes Weiner wrote:
> On Tue, Sep 20, 2011 at 10:45:32AM +0200, Michal Hocko wrote:
> > On Mon 12-09-11 12:57:21, Johannes Weiner wrote:
> > > Memory cgroup limit reclaim currently picks one memory cgroup out of
> > > the target hierarchy, remembers it as the last scanned child, and
> > > reclaims all zones in it with decreasing priority levels.
> > > 
> > > The new hierarchy reclaim code will pick memory cgroups from the same
> > > hierarchy concurrently from different zones and priority levels, it
> > > becomes necessary that hierarchy roots not only remember the last
> > > scanned child, but do so for each zone and priority level.
> > > 
> > > Furthermore, detecting full hierarchy round-trips reliably will become
> > > crucial, so instead of counting on one iterator site seeing a certain
> > > memory cgroup twice, use a generation counter that is increased every
> > > time the child with the highest ID has been visited.
> > 
> > In principle I think the patch is good. I have some concerns about
> > locking and I would really appreciate some more description (like you
> > provided in the other email in this thread).
> 
> Okay, I'll incorporate that description into the changelog.

Thanks!

> 
> > > @@ -131,6 +136,8 @@ struct mem_cgroup_per_zone {
> > >  	struct list_head	lists[NR_LRU_LISTS];
> > >  	unsigned long		count[NR_LRU_LISTS];
> > >  
> > > +	struct mem_cgroup_iter_state iter_state[DEF_PRIORITY + 1];
> > > +
> > >  	struct zone_reclaim_stat reclaim_stat;
> > >  	struct rb_node		tree_node;	/* RB tree node */
> > >  	unsigned long long	usage_in_excess;/* Set to the value by which */
> > [...]
> > > @@ -781,9 +783,15 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
> > >  	return memcg;
> > >  }
> > >  
> > > +struct mem_cgroup_iter {
> > 
> > Wouldn't be mem_cgroup_zone_iter_state a better name. It is true it is
> > rather long but I find mem_cgroup_iter very confusing because the actual
> > position is stored in the zone's state. The other thing is that it looks
> > like we have two iterators in mem_cgroup_iter function now but in fact
> > the iter parameter is just a state when we start iteration.
> 
> Agreed, the naming is unfortunate.  How about
> mem_cgroup_reclaim_cookie or something comparable?  It's limited to
> reclaim anyway, hierarchy walkers that do not age the LRU lists should
> not advance the shared iterator state, so might as well encode it in
> the name.

Sounds good.

> 
> > > +	struct zone *zone;
> > > +	int priority;
> > > +	unsigned int generation;
> > > +};
> > > +
> > >  static struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> > >  					  struct mem_cgroup *prev,
> > > -					  bool remember)
> > > +					  struct mem_cgroup_iter *iter)
> > 
> > I would rather see a different name for the last parameter
> > (iter_state?).
> 
> I'm with you on this.  Will think something up.
> 
> > > @@ -804,10 +812,20 @@ static struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> > >  	}
> > >  
> > >  	while (!mem) {
> > > +		struct mem_cgroup_iter_state *uninitialized_var(is);
> > >  		struct cgroup_subsys_state *css;
> > >  
> > > -		if (remember)
> > > -			id = root->last_scanned_child;
> > > +		if (iter) {
> > > +			int nid = zone_to_nid(iter->zone);
> > > +			int zid = zone_idx(iter->zone);
> > > +			struct mem_cgroup_per_zone *mz;
> > > +
> > > +			mz = mem_cgroup_zoneinfo(root, nid, zid);
> > > +			is = &mz->iter_state[iter->priority];
> > > +			if (prev && iter->generation != is->generation)
> > > +				return NULL;
> > > +			id = is->position;
> > 
> > Do we need any kind of locking here (spin_lock(&is->lock))?
> > If two parallel reclaimers start on the same zone and priority they will
> > see the same position and so bang on the same cgroup.
> 
> Note that last_scanned_child wasn't lock-protected before this series,
> so there is no actual difference.

that's a fair point. Anyway, I think it is worth mentioning this in the
patch description or in the comment to be clear that this is intentional.

> 
> I can say, though, that during development I had a lock in there for
> some time and it didn't make any difference for 32 concurrent
> reclaimers on a quadcore.  Feel free to evaluate with higher
> concurrency :)

Thanks!
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 05/11] mm: move memcg hierarchy reclaim to generic reclaim code
  2011-09-12 10:57   ` Johannes Weiner
@ 2011-09-20 13:09     ` Michal Hocko
  -1 siblings, 0 replies; 130+ messages in thread
From: Michal Hocko @ 2011-09-20 13:09 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Mon 12-09-11 12:57:22, Johannes Weiner wrote:
> Memory cgroup limit reclaim and traditional global pressure reclaim
> will soon share the same code to reclaim from a hierarchical tree of
> memory cgroups.
> 
> In preparation of this, move the two right next to each other in
> shrink_zone().

I like the way how you've split mem_cgroup_hierarchical_reclaim into
mem_cgroup_reclaim and mem_cgroup_soft_reclaim and I guess this deserves
a note in the patch description. Especially that mem_cgroup_reclaim is
hierarchical even though it doesn't use mem_cgroup_iter directly but
rather via do_try_to_free_pages and shrink_zone.

I am not sure I see how shrink_mem_cgroup_zone works. See comments and
questions bellow:

> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> ---
>  include/linux/memcontrol.h |   25 ++++++-
>  mm/memcontrol.c            |  167 ++++++++++++++++++++++----------------------
>  mm/vmscan.c                |   43 ++++++++++-
>  3 files changed, 147 insertions(+), 88 deletions(-)
> 
[...]
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index f4b404e..413e1f8 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
[...]
> @@ -783,19 +781,33 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
>  	return memcg;
>  }
>  
> -struct mem_cgroup_iter {
> -	struct zone *zone;
> -	int priority;
> -	unsigned int generation;
> -};
> -
> -static struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> -					  struct mem_cgroup *prev,
> -					  struct mem_cgroup_iter *iter)
> +/**
> + * mem_cgroup_iter - iterate over memory cgroup hierarchy
> + * @root: hierarchy root
> + * @prev: previously returned memcg, NULL on first invocation
> + * @iter: token for partial walks, NULL for full walks
> + *
> + * Returns references to children of the hierarchy starting at @root,

I guess you meant "starting at @prev"

> + * or @root itself, or %NULL after a full round-trip.
> + *
> + * Caller must pass the return value in @prev on subsequent
> + * invocations for reference counting, or use mem_cgroup_iter_break()
> + * to cancel a hierarchy walk before the round-trip is complete.
> + *
> + * Reclaimers can specify a zone and a priority level in @iter to
> + * divide up the memcgs in the hierarchy among all concurrent
> + * reclaimers operating on the same zone and priority.
> + */
> +struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> +				   struct mem_cgroup *prev,
> +				   struct mem_cgroup_iter *iter)
>  {
>  	struct mem_cgroup *mem = NULL;
>  	int id = 0;
>  
> +	if (mem_cgroup_disabled())
> +		return NULL;
> +
>  	if (!root)
>  		root = root_mem_cgroup;
>  
[...]
> @@ -1479,6 +1496,41 @@ u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
>  	return min(limit, memsw);
>  }
>  
> +static unsigned long mem_cgroup_reclaim(struct mem_cgroup *mem,
> +					gfp_t gfp_mask,
> +					unsigned long flags)
> +{
> +	unsigned long total = 0;
> +	bool noswap = false;
> +	int loop;
> +
> +	if (flags & MEM_CGROUP_RECLAIM_NOSWAP)
> +		noswap = true;
> +	else if (!(flags & MEM_CGROUP_RECLAIM_SHRINK) && mem->memsw_is_minimum)
> +		noswap = true;
> +
> +	for (loop = 0; loop < MEM_CGROUP_MAX_RECLAIM_LOOPS; loop++) {
> +		if (loop)
> +			drain_all_stock_async(mem);
> +		total += try_to_free_mem_cgroup_pages(mem, gfp_mask, noswap);
> +		/*
> +		 * Avoid freeing too much when shrinking to resize the
> +		 * limit.  XXX: Shouldn't the margin check be enough?

I guess the MEM_CGROUP_RECLAIM_SHRINK condition should help shrinkers to
die more easily on signal even if they make some progress.

> +		 */
> +		if (total && (flags & MEM_CGROUP_RECLAIM_SHRINK))
> +			break;
> +		if (mem_cgroup_margin(mem))
> +			break;
> +		/*
> +		 * If nothing was reclaimed after two attempts, there
> +		 * may be no reclaimable pages in this hierarchy.
> +		 */
> +		if (loop && !total)
> +			break;
> +	}
> +	return total;
> +}
> +
>  /**
>   * test_mem_cgroup_node_reclaimable
>   * @mem: the target memcg
[...]
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 92f4e22..8419e8f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2104,12 +2104,43 @@ restart:
>  static void shrink_zone(int priority, struct zone *zone,
>  			struct scan_control *sc)
>  {
> -	struct mem_cgroup_zone mz = {
> -		.mem_cgroup = sc->target_mem_cgroup,
> +	struct mem_cgroup *root = sc->target_mem_cgroup;
> +	struct mem_cgroup_iter iter = {
>  		.zone = zone,
> +		.priority = priority,
>  	};
> +	struct mem_cgroup *mem;
> +
> +	if (global_reclaim(sc)) {
> +		struct mem_cgroup_zone mz = {
> +			.mem_cgroup = NULL,
> +			.zone = zone,
> +		};
> +
> +		shrink_mem_cgroup_zone(priority, &mz, sc);
> +		return;
> +	}
> +
> +	mem = mem_cgroup_iter(root, NULL, &iter);
> +	do {
> +		struct mem_cgroup_zone mz = {
> +			.mem_cgroup = mem,
> +			.zone = zone,
> +		};
>  
> -	shrink_mem_cgroup_zone(priority, &mz, sc);
> +		shrink_mem_cgroup_zone(priority, &mz, sc);
> +		/*
> +		 * Limit reclaim has historically picked one memcg and
> +		 * scanned it with decreasing priority levels until
> +		 * nr_to_reclaim had been reclaimed.  This priority
> +		 * cycle is thus over after a single memcg.
> +		 */
> +		if (!global_reclaim(sc)) {

How can we have global_reclaim(sc) == true here?
Shouldn't we just check how much have we reclaimed from that group and
iterate only if it wasn't sufficient (at least SWAP_CLUSTER_MAX)?

> +			mem_cgroup_iter_break(root, mem);
> +			break;
> +		}
> +		mem = mem_cgroup_iter(root, mem, &iter);
> +	} while (mem);
>  }
>  
>  /*
[...]
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 05/11] mm: move memcg hierarchy reclaim to generic reclaim code
@ 2011-09-20 13:09     ` Michal Hocko
  0 siblings, 0 replies; 130+ messages in thread
From: Michal Hocko @ 2011-09-20 13:09 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Mon 12-09-11 12:57:22, Johannes Weiner wrote:
> Memory cgroup limit reclaim and traditional global pressure reclaim
> will soon share the same code to reclaim from a hierarchical tree of
> memory cgroups.
> 
> In preparation of this, move the two right next to each other in
> shrink_zone().

I like the way how you've split mem_cgroup_hierarchical_reclaim into
mem_cgroup_reclaim and mem_cgroup_soft_reclaim and I guess this deserves
a note in the patch description. Especially that mem_cgroup_reclaim is
hierarchical even though it doesn't use mem_cgroup_iter directly but
rather via do_try_to_free_pages and shrink_zone.

I am not sure I see how shrink_mem_cgroup_zone works. See comments and
questions bellow:

> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> ---
>  include/linux/memcontrol.h |   25 ++++++-
>  mm/memcontrol.c            |  167 ++++++++++++++++++++++----------------------
>  mm/vmscan.c                |   43 ++++++++++-
>  3 files changed, 147 insertions(+), 88 deletions(-)
> 
[...]
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index f4b404e..413e1f8 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
[...]
> @@ -783,19 +781,33 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
>  	return memcg;
>  }
>  
> -struct mem_cgroup_iter {
> -	struct zone *zone;
> -	int priority;
> -	unsigned int generation;
> -};
> -
> -static struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> -					  struct mem_cgroup *prev,
> -					  struct mem_cgroup_iter *iter)
> +/**
> + * mem_cgroup_iter - iterate over memory cgroup hierarchy
> + * @root: hierarchy root
> + * @prev: previously returned memcg, NULL on first invocation
> + * @iter: token for partial walks, NULL for full walks
> + *
> + * Returns references to children of the hierarchy starting at @root,

I guess you meant "starting at @prev"

> + * or @root itself, or %NULL after a full round-trip.
> + *
> + * Caller must pass the return value in @prev on subsequent
> + * invocations for reference counting, or use mem_cgroup_iter_break()
> + * to cancel a hierarchy walk before the round-trip is complete.
> + *
> + * Reclaimers can specify a zone and a priority level in @iter to
> + * divide up the memcgs in the hierarchy among all concurrent
> + * reclaimers operating on the same zone and priority.
> + */
> +struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> +				   struct mem_cgroup *prev,
> +				   struct mem_cgroup_iter *iter)
>  {
>  	struct mem_cgroup *mem = NULL;
>  	int id = 0;
>  
> +	if (mem_cgroup_disabled())
> +		return NULL;
> +
>  	if (!root)
>  		root = root_mem_cgroup;
>  
[...]
> @@ -1479,6 +1496,41 @@ u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
>  	return min(limit, memsw);
>  }
>  
> +static unsigned long mem_cgroup_reclaim(struct mem_cgroup *mem,
> +					gfp_t gfp_mask,
> +					unsigned long flags)
> +{
> +	unsigned long total = 0;
> +	bool noswap = false;
> +	int loop;
> +
> +	if (flags & MEM_CGROUP_RECLAIM_NOSWAP)
> +		noswap = true;
> +	else if (!(flags & MEM_CGROUP_RECLAIM_SHRINK) && mem->memsw_is_minimum)
> +		noswap = true;
> +
> +	for (loop = 0; loop < MEM_CGROUP_MAX_RECLAIM_LOOPS; loop++) {
> +		if (loop)
> +			drain_all_stock_async(mem);
> +		total += try_to_free_mem_cgroup_pages(mem, gfp_mask, noswap);
> +		/*
> +		 * Avoid freeing too much when shrinking to resize the
> +		 * limit.  XXX: Shouldn't the margin check be enough?

I guess the MEM_CGROUP_RECLAIM_SHRINK condition should help shrinkers to
die more easily on signal even if they make some progress.

> +		 */
> +		if (total && (flags & MEM_CGROUP_RECLAIM_SHRINK))
> +			break;
> +		if (mem_cgroup_margin(mem))
> +			break;
> +		/*
> +		 * If nothing was reclaimed after two attempts, there
> +		 * may be no reclaimable pages in this hierarchy.
> +		 */
> +		if (loop && !total)
> +			break;
> +	}
> +	return total;
> +}
> +
>  /**
>   * test_mem_cgroup_node_reclaimable
>   * @mem: the target memcg
[...]
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 92f4e22..8419e8f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2104,12 +2104,43 @@ restart:
>  static void shrink_zone(int priority, struct zone *zone,
>  			struct scan_control *sc)
>  {
> -	struct mem_cgroup_zone mz = {
> -		.mem_cgroup = sc->target_mem_cgroup,
> +	struct mem_cgroup *root = sc->target_mem_cgroup;
> +	struct mem_cgroup_iter iter = {
>  		.zone = zone,
> +		.priority = priority,
>  	};
> +	struct mem_cgroup *mem;
> +
> +	if (global_reclaim(sc)) {
> +		struct mem_cgroup_zone mz = {
> +			.mem_cgroup = NULL,
> +			.zone = zone,
> +		};
> +
> +		shrink_mem_cgroup_zone(priority, &mz, sc);
> +		return;
> +	}
> +
> +	mem = mem_cgroup_iter(root, NULL, &iter);
> +	do {
> +		struct mem_cgroup_zone mz = {
> +			.mem_cgroup = mem,
> +			.zone = zone,
> +		};
>  
> -	shrink_mem_cgroup_zone(priority, &mz, sc);
> +		shrink_mem_cgroup_zone(priority, &mz, sc);
> +		/*
> +		 * Limit reclaim has historically picked one memcg and
> +		 * scanned it with decreasing priority levels until
> +		 * nr_to_reclaim had been reclaimed.  This priority
> +		 * cycle is thus over after a single memcg.
> +		 */
> +		if (!global_reclaim(sc)) {

How can we have global_reclaim(sc) == true here?
Shouldn't we just check how much have we reclaimed from that group and
iterate only if it wasn't sufficient (at least SWAP_CLUSTER_MAX)?

> +			mem_cgroup_iter_break(root, mem);
> +			break;
> +		}
> +		mem = mem_cgroup_iter(root, mem, &iter);
> +	} while (mem);
>  }
>  
>  /*
[...]
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 05/11] mm: move memcg hierarchy reclaim to generic reclaim code
  2011-09-20 13:09     ` Michal Hocko
@ 2011-09-20 13:29       ` Johannes Weiner
  -1 siblings, 0 replies; 130+ messages in thread
From: Johannes Weiner @ 2011-09-20 13:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Tue, Sep 20, 2011 at 03:09:15PM +0200, Michal Hocko wrote:
> On Mon 12-09-11 12:57:22, Johannes Weiner wrote:
> > Memory cgroup limit reclaim and traditional global pressure reclaim
> > will soon share the same code to reclaim from a hierarchical tree of
> > memory cgroups.
> > 
> > In preparation of this, move the two right next to each other in
> > shrink_zone().
> 
> I like the way how you've split mem_cgroup_hierarchical_reclaim into
> mem_cgroup_reclaim and mem_cgroup_soft_reclaim and I guess this deserves
> a note in the patch description. Especially that mem_cgroup_reclaim is
> hierarchical even though it doesn't use mem_cgroup_iter directly but
> rather via do_try_to_free_pages and shrink_zone.
> 
> I am not sure I see how shrink_mem_cgroup_zone works. See comments and
> questions bellow:
> 
> > 
> > Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> > ---
> >  include/linux/memcontrol.h |   25 ++++++-
> >  mm/memcontrol.c            |  167 ++++++++++++++++++++++----------------------
> >  mm/vmscan.c                |   43 ++++++++++-
> >  3 files changed, 147 insertions(+), 88 deletions(-)
> > 
> [...]
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index f4b404e..413e1f8 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> [...]
> > @@ -783,19 +781,33 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
> >  	return memcg;
> >  }
> >  
> > -struct mem_cgroup_iter {
> > -	struct zone *zone;
> > -	int priority;
> > -	unsigned int generation;
> > -};
> > -
> > -static struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> > -					  struct mem_cgroup *prev,
> > -					  struct mem_cgroup_iter *iter)
> > +/**
> > + * mem_cgroup_iter - iterate over memory cgroup hierarchy
> > + * @root: hierarchy root
> > + * @prev: previously returned memcg, NULL on first invocation
> > + * @iter: token for partial walks, NULL for full walks
> > + *
> > + * Returns references to children of the hierarchy starting at @root,
> 
> I guess you meant "starting at @prev"

Nope, although it is a bit ambiguous, both the hierarchy as well as
the iteration start at @root.  Unless @iter is specified, but then the
hierarchy still starts at @root.

I attached a patch below that fixes the phrasing.

> > @@ -1479,6 +1496,41 @@ u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
> >  	return min(limit, memsw);
> >  }
> >  
> > +static unsigned long mem_cgroup_reclaim(struct mem_cgroup *mem,
> > +					gfp_t gfp_mask,
> > +					unsigned long flags)
> > +{
> > +	unsigned long total = 0;
> > +	bool noswap = false;
> > +	int loop;
> > +
> > +	if (flags & MEM_CGROUP_RECLAIM_NOSWAP)
> > +		noswap = true;
> > +	else if (!(flags & MEM_CGROUP_RECLAIM_SHRINK) && mem->memsw_is_minimum)
> > +		noswap = true;
> > +
> > +	for (loop = 0; loop < MEM_CGROUP_MAX_RECLAIM_LOOPS; loop++) {
> > +		if (loop)
> > +			drain_all_stock_async(mem);
> > +		total += try_to_free_mem_cgroup_pages(mem, gfp_mask, noswap);
> > +		/*
> > +		 * Avoid freeing too much when shrinking to resize the
> > +		 * limit.  XXX: Shouldn't the margin check be enough?
> 
> I guess the MEM_CGROUP_RECLAIM_SHRINK condition should help shrinkers to
> die more easily on signal even if they make some progress.

Ah, that makes sense.  I'll remove the XXX.  Folded in the patch
below.

> > @@ -2104,12 +2104,43 @@ restart:
> >  static void shrink_zone(int priority, struct zone *zone,
> >  			struct scan_control *sc)
> >  {
> > -	struct mem_cgroup_zone mz = {
> > -		.mem_cgroup = sc->target_mem_cgroup,
> > +	struct mem_cgroup *root = sc->target_mem_cgroup;
> > +	struct mem_cgroup_iter iter = {
> >  		.zone = zone,
> > +		.priority = priority,
> >  	};
> > +	struct mem_cgroup *mem;
> > +
> > +	if (global_reclaim(sc)) {
> > +		struct mem_cgroup_zone mz = {
> > +			.mem_cgroup = NULL,
> > +			.zone = zone,
> > +		};
> > +
> > +		shrink_mem_cgroup_zone(priority, &mz, sc);
> > +		return;
> > +	}
> > +
> > +	mem = mem_cgroup_iter(root, NULL, &iter);
> > +	do {
> > +		struct mem_cgroup_zone mz = {
> > +			.mem_cgroup = mem,
> > +			.zone = zone,
> > +		};
> >  
> > -	shrink_mem_cgroup_zone(priority, &mz, sc);
> > +		shrink_mem_cgroup_zone(priority, &mz, sc);
> > +		/*
> > +		 * Limit reclaim has historically picked one memcg and
> > +		 * scanned it with decreasing priority levels until
> > +		 * nr_to_reclaim had been reclaimed.  This priority
> > +		 * cycle is thus over after a single memcg.
> > +		 */
> > +		if (!global_reclaim(sc)) {
> 
> How can we have global_reclaim(sc) == true here?

We can't yet, but will when global reclaim and limit reclaim use that
same loop a patch or two later.  I added it early to have an anchor
for the comment above it, so that it's obvious how limit reclaim
behaves, and that I don't have to change limit reclaim-specific
conditions when changing how global reclaim works.

> Shouldn't we just check how much have we reclaimed from that group and
> iterate only if it wasn't sufficient (at least SWAP_CLUSTER_MAX)?

I played with various exit conditions and in the end just left the
behaviour exactly like we had it before this patch, just that the
algorithm is now inside out.  If you want to make such a fundamental
change, you have to prove it works :-)


Signed-off-by: Johannes Weiner <jweiner@redhat.com>
---

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a7d14a5..349620c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -784,8 +784,8 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
  * @prev: previously returned memcg, NULL on first invocation
  * @iter: token for partial walks, NULL for full walks
  *
- * Returns references to children of the hierarchy starting at @root,
- * or @root itself, or %NULL after a full round-trip.
+ * Returns references to children of the hierarchy below @root, or
+ * @root itself, or %NULL after a full round-trip.
  *
  * Caller must pass the return value in @prev on subsequent
  * invocations for reference counting, or use mem_cgroup_iter_break()
@@ -1493,7 +1493,8 @@ static unsigned long mem_cgroup_reclaim(struct mem_cgroup *mem,
 		total += try_to_free_mem_cgroup_pages(mem, gfp_mask, noswap);
 		/*
 		 * Avoid freeing too much when shrinking to resize the
-		 * limit.  XXX: Shouldn't the margin check be enough?
+		 * limit, and bail easily so that signals have a
+		 * chance to kill the resizing task.
 		 */
 		if (total && (flags & MEM_CGROUP_RECLAIM_SHRINK))
 			break;

^ permalink raw reply related	[flat|nested] 130+ messages in thread

* Re: [patch 05/11] mm: move memcg hierarchy reclaim to generic reclaim code
@ 2011-09-20 13:29       ` Johannes Weiner
  0 siblings, 0 replies; 130+ messages in thread
From: Johannes Weiner @ 2011-09-20 13:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Tue, Sep 20, 2011 at 03:09:15PM +0200, Michal Hocko wrote:
> On Mon 12-09-11 12:57:22, Johannes Weiner wrote:
> > Memory cgroup limit reclaim and traditional global pressure reclaim
> > will soon share the same code to reclaim from a hierarchical tree of
> > memory cgroups.
> > 
> > In preparation of this, move the two right next to each other in
> > shrink_zone().
> 
> I like the way how you've split mem_cgroup_hierarchical_reclaim into
> mem_cgroup_reclaim and mem_cgroup_soft_reclaim and I guess this deserves
> a note in the patch description. Especially that mem_cgroup_reclaim is
> hierarchical even though it doesn't use mem_cgroup_iter directly but
> rather via do_try_to_free_pages and shrink_zone.
> 
> I am not sure I see how shrink_mem_cgroup_zone works. See comments and
> questions bellow:
> 
> > 
> > Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> > ---
> >  include/linux/memcontrol.h |   25 ++++++-
> >  mm/memcontrol.c            |  167 ++++++++++++++++++++++----------------------
> >  mm/vmscan.c                |   43 ++++++++++-
> >  3 files changed, 147 insertions(+), 88 deletions(-)
> > 
> [...]
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index f4b404e..413e1f8 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> [...]
> > @@ -783,19 +781,33 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
> >  	return memcg;
> >  }
> >  
> > -struct mem_cgroup_iter {
> > -	struct zone *zone;
> > -	int priority;
> > -	unsigned int generation;
> > -};
> > -
> > -static struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> > -					  struct mem_cgroup *prev,
> > -					  struct mem_cgroup_iter *iter)
> > +/**
> > + * mem_cgroup_iter - iterate over memory cgroup hierarchy
> > + * @root: hierarchy root
> > + * @prev: previously returned memcg, NULL on first invocation
> > + * @iter: token for partial walks, NULL for full walks
> > + *
> > + * Returns references to children of the hierarchy starting at @root,
> 
> I guess you meant "starting at @prev"

Nope, although it is a bit ambiguous, both the hierarchy as well as
the iteration start at @root.  Unless @iter is specified, but then the
hierarchy still starts at @root.

I attached a patch below that fixes the phrasing.

> > @@ -1479,6 +1496,41 @@ u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
> >  	return min(limit, memsw);
> >  }
> >  
> > +static unsigned long mem_cgroup_reclaim(struct mem_cgroup *mem,
> > +					gfp_t gfp_mask,
> > +					unsigned long flags)
> > +{
> > +	unsigned long total = 0;
> > +	bool noswap = false;
> > +	int loop;
> > +
> > +	if (flags & MEM_CGROUP_RECLAIM_NOSWAP)
> > +		noswap = true;
> > +	else if (!(flags & MEM_CGROUP_RECLAIM_SHRINK) && mem->memsw_is_minimum)
> > +		noswap = true;
> > +
> > +	for (loop = 0; loop < MEM_CGROUP_MAX_RECLAIM_LOOPS; loop++) {
> > +		if (loop)
> > +			drain_all_stock_async(mem);
> > +		total += try_to_free_mem_cgroup_pages(mem, gfp_mask, noswap);
> > +		/*
> > +		 * Avoid freeing too much when shrinking to resize the
> > +		 * limit.  XXX: Shouldn't the margin check be enough?
> 
> I guess the MEM_CGROUP_RECLAIM_SHRINK condition should help shrinkers to
> die more easily on signal even if they make some progress.

Ah, that makes sense.  I'll remove the XXX.  Folded in the patch
below.

> > @@ -2104,12 +2104,43 @@ restart:
> >  static void shrink_zone(int priority, struct zone *zone,
> >  			struct scan_control *sc)
> >  {
> > -	struct mem_cgroup_zone mz = {
> > -		.mem_cgroup = sc->target_mem_cgroup,
> > +	struct mem_cgroup *root = sc->target_mem_cgroup;
> > +	struct mem_cgroup_iter iter = {
> >  		.zone = zone,
> > +		.priority = priority,
> >  	};
> > +	struct mem_cgroup *mem;
> > +
> > +	if (global_reclaim(sc)) {
> > +		struct mem_cgroup_zone mz = {
> > +			.mem_cgroup = NULL,
> > +			.zone = zone,
> > +		};
> > +
> > +		shrink_mem_cgroup_zone(priority, &mz, sc);
> > +		return;
> > +	}
> > +
> > +	mem = mem_cgroup_iter(root, NULL, &iter);
> > +	do {
> > +		struct mem_cgroup_zone mz = {
> > +			.mem_cgroup = mem,
> > +			.zone = zone,
> > +		};
> >  
> > -	shrink_mem_cgroup_zone(priority, &mz, sc);
> > +		shrink_mem_cgroup_zone(priority, &mz, sc);
> > +		/*
> > +		 * Limit reclaim has historically picked one memcg and
> > +		 * scanned it with decreasing priority levels until
> > +		 * nr_to_reclaim had been reclaimed.  This priority
> > +		 * cycle is thus over after a single memcg.
> > +		 */
> > +		if (!global_reclaim(sc)) {
> 
> How can we have global_reclaim(sc) == true here?

We can't yet, but will when global reclaim and limit reclaim use that
same loop a patch or two later.  I added it early to have an anchor
for the comment above it, so that it's obvious how limit reclaim
behaves, and that I don't have to change limit reclaim-specific
conditions when changing how global reclaim works.

> Shouldn't we just check how much have we reclaimed from that group and
> iterate only if it wasn't sufficient (at least SWAP_CLUSTER_MAX)?

I played with various exit conditions and in the end just left the
behaviour exactly like we had it before this patch, just that the
algorithm is now inside out.  If you want to make such a fundamental
change, you have to prove it works :-)


Signed-off-by: Johannes Weiner <jweiner@redhat.com>
---

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a7d14a5..349620c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -784,8 +784,8 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
  * @prev: previously returned memcg, NULL on first invocation
  * @iter: token for partial walks, NULL for full walks
  *
- * Returns references to children of the hierarchy starting at @root,
- * or @root itself, or %NULL after a full round-trip.
+ * Returns references to children of the hierarchy below @root, or
+ * @root itself, or %NULL after a full round-trip.
  *
  * Caller must pass the return value in @prev on subsequent
  * invocations for reference counting, or use mem_cgroup_iter_break()
@@ -1493,7 +1493,8 @@ static unsigned long mem_cgroup_reclaim(struct mem_cgroup *mem,
 		total += try_to_free_mem_cgroup_pages(mem, gfp_mask, noswap);
 		/*
 		 * Avoid freeing too much when shrinking to resize the
-		 * limit.  XXX: Shouldn't the margin check be enough?
+		 * limit, and bail easily so that signals have a
+		 * chance to kill the resizing task.
 		 */
 		if (total && (flags & MEM_CGROUP_RECLAIM_SHRINK))
 			break;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 130+ messages in thread

* Re: [patch 05/11] mm: move memcg hierarchy reclaim to generic reclaim code
  2011-09-20 13:29       ` Johannes Weiner
@ 2011-09-20 14:08         ` Michal Hocko
  -1 siblings, 0 replies; 130+ messages in thread
From: Michal Hocko @ 2011-09-20 14:08 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Tue 20-09-11 15:29:28, Johannes Weiner wrote:
> On Tue, Sep 20, 2011 at 03:09:15PM +0200, Michal Hocko wrote:
> > On Mon 12-09-11 12:57:22, Johannes Weiner wrote:
> > > Memory cgroup limit reclaim and traditional global pressure reclaim
> > > will soon share the same code to reclaim from a hierarchical tree of
> > > memory cgroups.
> > > 
> > > In preparation of this, move the two right next to each other in
> > > shrink_zone().
> > 
> > I like the way how you've split mem_cgroup_hierarchical_reclaim into
> > mem_cgroup_reclaim and mem_cgroup_soft_reclaim and I guess this deserves
> > a note in the patch description. Especially that mem_cgroup_reclaim is
> > hierarchical even though it doesn't use mem_cgroup_iter directly but
> > rather via do_try_to_free_pages and shrink_zone.
> > 
> > I am not sure I see how shrink_mem_cgroup_zone works. See comments and
> > questions bellow:
> > 
> > > 
> > > Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> > > ---
> > >  include/linux/memcontrol.h |   25 ++++++-
> > >  mm/memcontrol.c            |  167 ++++++++++++++++++++++----------------------
> > >  mm/vmscan.c                |   43 ++++++++++-
> > >  3 files changed, 147 insertions(+), 88 deletions(-)
> > > 
> > [...]
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index f4b404e..413e1f8 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > [...]
> > > @@ -783,19 +781,33 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
> > >  	return memcg;
> > >  }
> > >  
> > > -struct mem_cgroup_iter {
> > > -	struct zone *zone;
> > > -	int priority;
> > > -	unsigned int generation;
> > > -};
> > > -
> > > -static struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> > > -					  struct mem_cgroup *prev,
> > > -					  struct mem_cgroup_iter *iter)
> > > +/**
> > > + * mem_cgroup_iter - iterate over memory cgroup hierarchy
> > > + * @root: hierarchy root
> > > + * @prev: previously returned memcg, NULL on first invocation
> > > + * @iter: token for partial walks, NULL for full walks
> > > + *
> > > + * Returns references to children of the hierarchy starting at @root,
> > 
> > I guess you meant "starting at @prev"
> 
> Nope, although it is a bit ambiguous, both the hierarchy as well as
> the iteration start at @root.  Unless @iter is specified, but then the
> hierarchy still starts at @root.

OK I guess I see where my misundestanding comes from. You are describing
the function in the way how is it supposed to be used (with prev=NULL
and then reusing the return value). I was thinking about general
description where someone wants to provide a prev and expects that the
function would return a next group under the hierarchy. In that case we
would have to go into ids so this is probably better.

> 
> I attached a patch below that fixes the phrasing.

Makes more sense.

[...]
> > > @@ -2104,12 +2104,43 @@ restart:
> > >  static void shrink_zone(int priority, struct zone *zone,
> > >  			struct scan_control *sc)
> > >  {
> > > -	struct mem_cgroup_zone mz = {
> > > -		.mem_cgroup = sc->target_mem_cgroup,
> > > +	struct mem_cgroup *root = sc->target_mem_cgroup;
> > > +	struct mem_cgroup_iter iter = {
> > >  		.zone = zone,
> > > +		.priority = priority,
> > >  	};
> > > +	struct mem_cgroup *mem;
> > > +
> > > +	if (global_reclaim(sc)) {
> > > +		struct mem_cgroup_zone mz = {
> > > +			.mem_cgroup = NULL,
> > > +			.zone = zone,
> > > +		};
> > > +
> > > +		shrink_mem_cgroup_zone(priority, &mz, sc);
> > > +		return;
> > > +	}
> > > +
> > > +	mem = mem_cgroup_iter(root, NULL, &iter);
> > > +	do {
> > > +		struct mem_cgroup_zone mz = {
> > > +			.mem_cgroup = mem,
> > > +			.zone = zone,
> > > +		};
> > >  
> > > -	shrink_mem_cgroup_zone(priority, &mz, sc);
> > > +		shrink_mem_cgroup_zone(priority, &mz, sc);
> > > +		/*
> > > +		 * Limit reclaim has historically picked one memcg and
> > > +		 * scanned it with decreasing priority levels until
> > > +		 * nr_to_reclaim had been reclaimed.  This priority
> > > +		 * cycle is thus over after a single memcg.
> > > +		 */
> > > +		if (!global_reclaim(sc)) {
> > 
> > How can we have global_reclaim(sc) == true here?
> 
> We can't yet, but will when global reclaim and limit reclaim use that
> same loop a patch or two later.  

OK, I see. Still a bit tricky for review, though.

> I added it early to have an anchor for the comment above it, so that
> it's obvious how limit reclaim behaves, and that I don't have to
> change limit reclaim-specific conditions when changing how global
> reclaim works.
> 
> > Shouldn't we just check how much have we reclaimed from that group and
> > iterate only if it wasn't sufficient (at least SWAP_CLUSTER_MAX)?
> 
> I played with various exit conditions and in the end just left the
> behaviour exactly like we had it before this patch, just that the
> algorithm is now inside out.  If you want to make such a fundamental
> change, you have to prove it works :-)

OK, I see.

> 
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>

For the whole patch.
Reviewed-by: Michal Hocko <mhocko@suse.cz>

> ---
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a7d14a5..349620c 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -784,8 +784,8 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
>   * @prev: previously returned memcg, NULL on first invocation
>   * @iter: token for partial walks, NULL for full walks
>   *
> - * Returns references to children of the hierarchy starting at @root,
> - * or @root itself, or %NULL after a full round-trip.
> + * Returns references to children of the hierarchy below @root, or
> + * @root itself, or %NULL after a full round-trip.
>   *
>   * Caller must pass the return value in @prev on subsequent
>   * invocations for reference counting, or use mem_cgroup_iter_break()
> @@ -1493,7 +1493,8 @@ static unsigned long mem_cgroup_reclaim(struct mem_cgroup *mem,
>  		total += try_to_free_mem_cgroup_pages(mem, gfp_mask, noswap);
>  		/*
>  		 * Avoid freeing too much when shrinking to resize the
> -		 * limit.  XXX: Shouldn't the margin check be enough?
> +		 * limit, and bail easily so that signals have a
> +		 * chance to kill the resizing task.
>  		 */
>  		if (total && (flags & MEM_CGROUP_RECLAIM_SHRINK))
>  			break;
> 
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 05/11] mm: move memcg hierarchy reclaim to generic reclaim code
@ 2011-09-20 14:08         ` Michal Hocko
  0 siblings, 0 replies; 130+ messages in thread
From: Michal Hocko @ 2011-09-20 14:08 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Tue 20-09-11 15:29:28, Johannes Weiner wrote:
> On Tue, Sep 20, 2011 at 03:09:15PM +0200, Michal Hocko wrote:
> > On Mon 12-09-11 12:57:22, Johannes Weiner wrote:
> > > Memory cgroup limit reclaim and traditional global pressure reclaim
> > > will soon share the same code to reclaim from a hierarchical tree of
> > > memory cgroups.
> > > 
> > > In preparation of this, move the two right next to each other in
> > > shrink_zone().
> > 
> > I like the way how you've split mem_cgroup_hierarchical_reclaim into
> > mem_cgroup_reclaim and mem_cgroup_soft_reclaim and I guess this deserves
> > a note in the patch description. Especially that mem_cgroup_reclaim is
> > hierarchical even though it doesn't use mem_cgroup_iter directly but
> > rather via do_try_to_free_pages and shrink_zone.
> > 
> > I am not sure I see how shrink_mem_cgroup_zone works. See comments and
> > questions bellow:
> > 
> > > 
> > > Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> > > ---
> > >  include/linux/memcontrol.h |   25 ++++++-
> > >  mm/memcontrol.c            |  167 ++++++++++++++++++++++----------------------
> > >  mm/vmscan.c                |   43 ++++++++++-
> > >  3 files changed, 147 insertions(+), 88 deletions(-)
> > > 
> > [...]
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index f4b404e..413e1f8 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > [...]
> > > @@ -783,19 +781,33 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
> > >  	return memcg;
> > >  }
> > >  
> > > -struct mem_cgroup_iter {
> > > -	struct zone *zone;
> > > -	int priority;
> > > -	unsigned int generation;
> > > -};
> > > -
> > > -static struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> > > -					  struct mem_cgroup *prev,
> > > -					  struct mem_cgroup_iter *iter)
> > > +/**
> > > + * mem_cgroup_iter - iterate over memory cgroup hierarchy
> > > + * @root: hierarchy root
> > > + * @prev: previously returned memcg, NULL on first invocation
> > > + * @iter: token for partial walks, NULL for full walks
> > > + *
> > > + * Returns references to children of the hierarchy starting at @root,
> > 
> > I guess you meant "starting at @prev"
> 
> Nope, although it is a bit ambiguous, both the hierarchy as well as
> the iteration start at @root.  Unless @iter is specified, but then the
> hierarchy still starts at @root.

OK I guess I see where my misundestanding comes from. You are describing
the function in the way how is it supposed to be used (with prev=NULL
and then reusing the return value). I was thinking about general
description where someone wants to provide a prev and expects that the
function would return a next group under the hierarchy. In that case we
would have to go into ids so this is probably better.

> 
> I attached a patch below that fixes the phrasing.

Makes more sense.

[...]
> > > @@ -2104,12 +2104,43 @@ restart:
> > >  static void shrink_zone(int priority, struct zone *zone,
> > >  			struct scan_control *sc)
> > >  {
> > > -	struct mem_cgroup_zone mz = {
> > > -		.mem_cgroup = sc->target_mem_cgroup,
> > > +	struct mem_cgroup *root = sc->target_mem_cgroup;
> > > +	struct mem_cgroup_iter iter = {
> > >  		.zone = zone,
> > > +		.priority = priority,
> > >  	};
> > > +	struct mem_cgroup *mem;
> > > +
> > > +	if (global_reclaim(sc)) {
> > > +		struct mem_cgroup_zone mz = {
> > > +			.mem_cgroup = NULL,
> > > +			.zone = zone,
> > > +		};
> > > +
> > > +		shrink_mem_cgroup_zone(priority, &mz, sc);
> > > +		return;
> > > +	}
> > > +
> > > +	mem = mem_cgroup_iter(root, NULL, &iter);
> > > +	do {
> > > +		struct mem_cgroup_zone mz = {
> > > +			.mem_cgroup = mem,
> > > +			.zone = zone,
> > > +		};
> > >  
> > > -	shrink_mem_cgroup_zone(priority, &mz, sc);
> > > +		shrink_mem_cgroup_zone(priority, &mz, sc);
> > > +		/*
> > > +		 * Limit reclaim has historically picked one memcg and
> > > +		 * scanned it with decreasing priority levels until
> > > +		 * nr_to_reclaim had been reclaimed.  This priority
> > > +		 * cycle is thus over after a single memcg.
> > > +		 */
> > > +		if (!global_reclaim(sc)) {
> > 
> > How can we have global_reclaim(sc) == true here?
> 
> We can't yet, but will when global reclaim and limit reclaim use that
> same loop a patch or two later.  

OK, I see. Still a bit tricky for review, though.

> I added it early to have an anchor for the comment above it, so that
> it's obvious how limit reclaim behaves, and that I don't have to
> change limit reclaim-specific conditions when changing how global
> reclaim works.
> 
> > Shouldn't we just check how much have we reclaimed from that group and
> > iterate only if it wasn't sufficient (at least SWAP_CLUSTER_MAX)?
> 
> I played with various exit conditions and in the end just left the
> behaviour exactly like we had it before this patch, just that the
> algorithm is now inside out.  If you want to make such a fundamental
> change, you have to prove it works :-)

OK, I see.

> 
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>

For the whole patch.
Reviewed-by: Michal Hocko <mhocko@suse.cz>

> ---
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a7d14a5..349620c 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -784,8 +784,8 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
>   * @prev: previously returned memcg, NULL on first invocation
>   * @iter: token for partial walks, NULL for full walks
>   *
> - * Returns references to children of the hierarchy starting at @root,
> - * or @root itself, or %NULL after a full round-trip.
> + * Returns references to children of the hierarchy below @root, or
> + * @root itself, or %NULL after a full round-trip.
>   *
>   * Caller must pass the return value in @prev on subsequent
>   * invocations for reference counting, or use mem_cgroup_iter_break()
> @@ -1493,7 +1493,8 @@ static unsigned long mem_cgroup_reclaim(struct mem_cgroup *mem,
>  		total += try_to_free_mem_cgroup_pages(mem, gfp_mask, noswap);
>  		/*
>  		 * Avoid freeing too much when shrinking to resize the
> -		 * limit.  XXX: Shouldn't the margin check be enough?
> +		 * limit, and bail easily so that signals have a
> +		 * chance to kill the resizing task.
>  		 */
>  		if (total && (flags & MEM_CGROUP_RECLAIM_SHRINK))
>  			break;
> 
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 06/11] mm: memcg: remove optimization of keeping the root_mem_cgroup LRU lists empty
  2011-09-12 10:57   ` Johannes Weiner
@ 2011-09-20 15:02     ` Michal Hocko
  -1 siblings, 0 replies; 130+ messages in thread
From: Michal Hocko @ 2011-09-20 15:02 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Mon 12-09-11 12:57:23, Johannes Weiner wrote:
> root_mem_cgroup, lacking a configurable limit, was never subject to
> limit reclaim, so the pages charged to it could be kept off its LRU
> lists.  They would be found on the global per-zone LRU lists upon
> physical memory pressure and it made sense to avoid uselessly linking
> them to both lists.
> 
> The global per-zone LRU lists are about to go away on memcg-enabled
> kernels, with all pages being exclusively linked to their respective
> per-memcg LRU lists.  As a result, pages of the root_mem_cgroup must
> also be linked to its LRU lists again.

Nevertheless we still do not charge them so this should be mentioned
here?

> 
> The overhead is temporary until the double-LRU scheme is going away
> completely.
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>

Reviewed-by: Michal Hocko <mhocko@suse.cz>

> ---
>  mm/memcontrol.c |   12 ++----------
>  1 files changed, 2 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 413e1f8..518f640 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -956,8 +956,6 @@ void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru)
>  	mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
>  	/* huge page split is done under lru_lock. so, we have no races. */
>  	MEM_CGROUP_ZSTAT(mz, lru) -= 1 << compound_order(page);
> -	if (mem_cgroup_is_root(pc->mem_cgroup))
> -		return;
>  	VM_BUG_ON(list_empty(&pc->lru));
>  	list_del_init(&pc->lru);
>  }
> @@ -982,13 +980,11 @@ void mem_cgroup_rotate_reclaimable_page(struct page *page)
>  		return;
>  
>  	pc = lookup_page_cgroup(page);
> -	/* unused or root page is not rotated. */
> +	/* unused page is not rotated. */
>  	if (!PageCgroupUsed(pc))
>  		return;
>  	/* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
>  	smp_rmb();
> -	if (mem_cgroup_is_root(pc->mem_cgroup))
> -		return;
>  	mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
>  	list_move_tail(&pc->lru, &mz->lists[lru]);
>  }
> @@ -1002,13 +998,11 @@ void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru)
>  		return;
>  
>  	pc = lookup_page_cgroup(page);
> -	/* unused or root page is not rotated. */
> +	/* unused page is not rotated. */
>  	if (!PageCgroupUsed(pc))
>  		return;
>  	/* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
>  	smp_rmb();
> -	if (mem_cgroup_is_root(pc->mem_cgroup))
> -		return;
>  	mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
>  	list_move(&pc->lru, &mz->lists[lru]);
>  }
> @@ -1040,8 +1034,6 @@ void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru)
>  	/* huge page split is done under lru_lock. so, we have no races. */
>  	MEM_CGROUP_ZSTAT(mz, lru) += 1 << compound_order(page);
>  	SetPageCgroupAcctLRU(pc);
> -	if (mem_cgroup_is_root(pc->mem_cgroup))
> -		return;
>  	list_add(&pc->lru, &mz->lists[lru]);
>  }
>  
> -- 
> 1.7.6
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 06/11] mm: memcg: remove optimization of keeping the root_mem_cgroup LRU lists empty
@ 2011-09-20 15:02     ` Michal Hocko
  0 siblings, 0 replies; 130+ messages in thread
From: Michal Hocko @ 2011-09-20 15:02 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Mon 12-09-11 12:57:23, Johannes Weiner wrote:
> root_mem_cgroup, lacking a configurable limit, was never subject to
> limit reclaim, so the pages charged to it could be kept off its LRU
> lists.  They would be found on the global per-zone LRU lists upon
> physical memory pressure and it made sense to avoid uselessly linking
> them to both lists.
> 
> The global per-zone LRU lists are about to go away on memcg-enabled
> kernels, with all pages being exclusively linked to their respective
> per-memcg LRU lists.  As a result, pages of the root_mem_cgroup must
> also be linked to its LRU lists again.

Nevertheless we still do not charge them so this should be mentioned
here?

> 
> The overhead is temporary until the double-LRU scheme is going away
> completely.
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>

Reviewed-by: Michal Hocko <mhocko@suse.cz>

> ---
>  mm/memcontrol.c |   12 ++----------
>  1 files changed, 2 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 413e1f8..518f640 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -956,8 +956,6 @@ void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru)
>  	mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
>  	/* huge page split is done under lru_lock. so, we have no races. */
>  	MEM_CGROUP_ZSTAT(mz, lru) -= 1 << compound_order(page);
> -	if (mem_cgroup_is_root(pc->mem_cgroup))
> -		return;
>  	VM_BUG_ON(list_empty(&pc->lru));
>  	list_del_init(&pc->lru);
>  }
> @@ -982,13 +980,11 @@ void mem_cgroup_rotate_reclaimable_page(struct page *page)
>  		return;
>  
>  	pc = lookup_page_cgroup(page);
> -	/* unused or root page is not rotated. */
> +	/* unused page is not rotated. */
>  	if (!PageCgroupUsed(pc))
>  		return;
>  	/* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
>  	smp_rmb();
> -	if (mem_cgroup_is_root(pc->mem_cgroup))
> -		return;
>  	mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
>  	list_move_tail(&pc->lru, &mz->lists[lru]);
>  }
> @@ -1002,13 +998,11 @@ void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru)
>  		return;
>  
>  	pc = lookup_page_cgroup(page);
> -	/* unused or root page is not rotated. */
> +	/* unused page is not rotated. */
>  	if (!PageCgroupUsed(pc))
>  		return;
>  	/* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
>  	smp_rmb();
> -	if (mem_cgroup_is_root(pc->mem_cgroup))
> -		return;
>  	mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
>  	list_move(&pc->lru, &mz->lists[lru]);
>  }
> @@ -1040,8 +1034,6 @@ void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru)
>  	/* huge page split is done under lru_lock. so, we have no races. */
>  	MEM_CGROUP_ZSTAT(mz, lru) += 1 << compound_order(page);
>  	SetPageCgroupAcctLRU(pc);
> -	if (mem_cgroup_is_root(pc->mem_cgroup))
> -		return;
>  	list_add(&pc->lru, &mz->lists[lru]);
>  }
>  
> -- 
> 1.7.6
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 07/11] mm: vmscan: convert unevictable page rescue scanner to per-memcg LRU lists
  2011-09-12 10:57   ` Johannes Weiner
@ 2011-09-21 12:33     ` Michal Hocko
  -1 siblings, 0 replies; 130+ messages in thread
From: Michal Hocko @ 2011-09-21 12:33 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Mon 12-09-11 12:57:24, Johannes Weiner wrote:
> The global per-zone LRU lists are about to go away on memcg-enabled
> kernels, the unevictable page rescue scanner must be able to find its
> pages on the per-memcg LRU lists.
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>

The patch is correct but I guess the original implementation of
scan_zone_unevictable_pages is buggy (see bellow). This should be
addressed separatelly, though.

Reviewed-by: Michal Hocko <mhocko@suse.cz>

> ---
>  include/linux/memcontrol.h |    3 ++
>  mm/memcontrol.c            |   11 ++++++++
>  mm/vmscan.c                |   61 ++++++++++++++++++++++++++++---------------
>  3 files changed, 54 insertions(+), 21 deletions(-)
> 
[...]
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
[...]
> @@ -3490,32 +3501,40 @@ void scan_mapping_unevictable_pages(struct address_space *mapping)
>  #define SCAN_UNEVICTABLE_BATCH_SIZE 16UL /* arbitrary lock hold batch size */
>  static void scan_zone_unevictable_pages(struct zone *zone)
>  {
> -	struct list_head *l_unevictable = &zone->lru[LRU_UNEVICTABLE].list;
> -	unsigned long scan;
> -	unsigned long nr_to_scan = zone_page_state(zone, NR_UNEVICTABLE);
> -
> -	while (nr_to_scan > 0) {
> -		unsigned long batch_size = min(nr_to_scan,
> -						SCAN_UNEVICTABLE_BATCH_SIZE);
> -
> -		spin_lock_irq(&zone->lru_lock);
> -		for (scan = 0;  scan < batch_size; scan++) {
> -			struct page *page = lru_to_page(l_unevictable);
> +	struct mem_cgroup *mem;
>  
> -			if (!trylock_page(page))
> -				continue;
> +	mem = mem_cgroup_iter(NULL, NULL, NULL);
> +	do {
> +		struct mem_cgroup_zone mz = {
> +			.mem_cgroup = mem,
> +			.zone = zone,
> +		};
> +		unsigned long nr_to_scan;
>  
> -			prefetchw_prev_lru_page(page, l_unevictable, flags);
> +		nr_to_scan = zone_nr_lru_pages(&mz, LRU_UNEVICTABLE);
> +		while (nr_to_scan > 0) {
> +			unsigned long batch_size;
> +			unsigned long scan;
>  
> -			if (likely(PageLRU(page) && PageUnevictable(page)))
> -				check_move_unevictable_page(page, zone);
> +			batch_size = min(nr_to_scan,
> +					 SCAN_UNEVICTABLE_BATCH_SIZE);
> +			spin_lock_irq(&zone->lru_lock);
> +			for (scan = 0; scan < batch_size; scan++) {
> +				struct page *page;
>  
> -			unlock_page(page);
> +				page = lru_tailpage(&mz, LRU_UNEVICTABLE);
> +				if (!trylock_page(page))
> +					continue;

We are not moving to the next page so we will try it again in the next
round while we already increased the scan count. In the end we will
missed some pages.

> +				if (likely(PageLRU(page) &&
> +					   PageUnevictable(page)))
> +					check_move_unevictable_page(page, zone);
> +				unlock_page(page);
> +			}
> +			spin_unlock_irq(&zone->lru_lock);
> +			nr_to_scan -= batch_size;
>  		}
> -		spin_unlock_irq(&zone->lru_lock);
> -
> -		nr_to_scan -= batch_size;
> -	}
> +		mem = mem_cgroup_iter(NULL, mem, NULL);
> +	} while (mem);
>  }
>  
>  
> -- 
> 1.7.6
> 

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 07/11] mm: vmscan: convert unevictable page rescue scanner to per-memcg LRU lists
@ 2011-09-21 12:33     ` Michal Hocko
  0 siblings, 0 replies; 130+ messages in thread
From: Michal Hocko @ 2011-09-21 12:33 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Mon 12-09-11 12:57:24, Johannes Weiner wrote:
> The global per-zone LRU lists are about to go away on memcg-enabled
> kernels, the unevictable page rescue scanner must be able to find its
> pages on the per-memcg LRU lists.
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>

The patch is correct but I guess the original implementation of
scan_zone_unevictable_pages is buggy (see bellow). This should be
addressed separatelly, though.

Reviewed-by: Michal Hocko <mhocko@suse.cz>

> ---
>  include/linux/memcontrol.h |    3 ++
>  mm/memcontrol.c            |   11 ++++++++
>  mm/vmscan.c                |   61 ++++++++++++++++++++++++++++---------------
>  3 files changed, 54 insertions(+), 21 deletions(-)
> 
[...]
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
[...]
> @@ -3490,32 +3501,40 @@ void scan_mapping_unevictable_pages(struct address_space *mapping)
>  #define SCAN_UNEVICTABLE_BATCH_SIZE 16UL /* arbitrary lock hold batch size */
>  static void scan_zone_unevictable_pages(struct zone *zone)
>  {
> -	struct list_head *l_unevictable = &zone->lru[LRU_UNEVICTABLE].list;
> -	unsigned long scan;
> -	unsigned long nr_to_scan = zone_page_state(zone, NR_UNEVICTABLE);
> -
> -	while (nr_to_scan > 0) {
> -		unsigned long batch_size = min(nr_to_scan,
> -						SCAN_UNEVICTABLE_BATCH_SIZE);
> -
> -		spin_lock_irq(&zone->lru_lock);
> -		for (scan = 0;  scan < batch_size; scan++) {
> -			struct page *page = lru_to_page(l_unevictable);
> +	struct mem_cgroup *mem;
>  
> -			if (!trylock_page(page))
> -				continue;
> +	mem = mem_cgroup_iter(NULL, NULL, NULL);
> +	do {
> +		struct mem_cgroup_zone mz = {
> +			.mem_cgroup = mem,
> +			.zone = zone,
> +		};
> +		unsigned long nr_to_scan;
>  
> -			prefetchw_prev_lru_page(page, l_unevictable, flags);
> +		nr_to_scan = zone_nr_lru_pages(&mz, LRU_UNEVICTABLE);
> +		while (nr_to_scan > 0) {
> +			unsigned long batch_size;
> +			unsigned long scan;
>  
> -			if (likely(PageLRU(page) && PageUnevictable(page)))
> -				check_move_unevictable_page(page, zone);
> +			batch_size = min(nr_to_scan,
> +					 SCAN_UNEVICTABLE_BATCH_SIZE);
> +			spin_lock_irq(&zone->lru_lock);
> +			for (scan = 0; scan < batch_size; scan++) {
> +				struct page *page;
>  
> -			unlock_page(page);
> +				page = lru_tailpage(&mz, LRU_UNEVICTABLE);
> +				if (!trylock_page(page))
> +					continue;

We are not moving to the next page so we will try it again in the next
round while we already increased the scan count. In the end we will
missed some pages.

> +				if (likely(PageLRU(page) &&
> +					   PageUnevictable(page)))
> +					check_move_unevictable_page(page, zone);
> +				unlock_page(page);
> +			}
> +			spin_unlock_irq(&zone->lru_lock);
> +			nr_to_scan -= batch_size;
>  		}
> -		spin_unlock_irq(&zone->lru_lock);
> -
> -		nr_to_scan -= batch_size;
> -	}
> +		mem = mem_cgroup_iter(NULL, mem, NULL);
> +	} while (mem);
>  }
>  
>  
> -- 
> 1.7.6
> 

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 08/11] mm: vmscan: convert global reclaim to per-memcg LRU lists
  2011-09-12 10:57   ` Johannes Weiner
@ 2011-09-21 13:10     ` Michal Hocko
  -1 siblings, 0 replies; 130+ messages in thread
From: Michal Hocko @ 2011-09-21 13:10 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Mon 12-09-11 12:57:25, Johannes Weiner wrote:
> The global per-zone LRU lists are about to go away on memcg-enabled
> kernels, global reclaim must be able to find its pages on the
> per-memcg LRU lists.
> 
> Since the LRU pages of a zone are distributed over all existing memory
> cgroups, a scan target for a zone is complete when all memory cgroups
> are scanned for their proportional share of a zone's memory.
> 
> The forced scanning of small scan targets from kswapd is limited to
> zones marked unreclaimable, otherwise kswapd can quickly overreclaim
> by force-scanning the LRU lists of multiple memory cgroups.
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>

Reviewed-by: Michal Hocko <mhocko@suse.cz>

Minor nit bellow

> ---
>  mm/vmscan.c |   39 ++++++++++++++++++++++-----------------
>  1 files changed, 22 insertions(+), 17 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index bb4d8b8..053609e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2451,13 +2445,24 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
>  static void age_active_anon(struct zone *zone, struct scan_control *sc,
>  			    int priority)
>  {
> -	struct mem_cgroup_zone mz = {
> -		.mem_cgroup = NULL,
> -		.zone = zone,
> -	};
> +	struct mem_cgroup *mem;
> +
> +	if (!total_swap_pages)
> +		return;
> +
> +	mem = mem_cgroup_iter(NULL, NULL, NULL);

Wouldn't be for_each_mem_cgroup more appropriate? Macro is not exported
but probably worth exporting? The same applies for
scan_zone_unevictable_pages from the previous patch.

> +	do {
> +		struct mem_cgroup_zone mz = {
> +			.mem_cgroup = mem,
> +			.zone = zone,
> +		};
>  
> -	if (inactive_anon_is_low(&mz))
> -		shrink_active_list(SWAP_CLUSTER_MAX, &mz, sc, priority, 0);
> +		if (inactive_anon_is_low(&mz))
> +			shrink_active_list(SWAP_CLUSTER_MAX, &mz,
> +					   sc, priority, 0);
> +
> +		mem = mem_cgroup_iter(NULL, mem, NULL);
> +	} while (mem);
>  }

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 08/11] mm: vmscan: convert global reclaim to per-memcg LRU lists
@ 2011-09-21 13:10     ` Michal Hocko
  0 siblings, 0 replies; 130+ messages in thread
From: Michal Hocko @ 2011-09-21 13:10 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Mon 12-09-11 12:57:25, Johannes Weiner wrote:
> The global per-zone LRU lists are about to go away on memcg-enabled
> kernels, global reclaim must be able to find its pages on the
> per-memcg LRU lists.
> 
> Since the LRU pages of a zone are distributed over all existing memory
> cgroups, a scan target for a zone is complete when all memory cgroups
> are scanned for their proportional share of a zone's memory.
> 
> The forced scanning of small scan targets from kswapd is limited to
> zones marked unreclaimable, otherwise kswapd can quickly overreclaim
> by force-scanning the LRU lists of multiple memory cgroups.
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>

Reviewed-by: Michal Hocko <mhocko@suse.cz>

Minor nit bellow

> ---
>  mm/vmscan.c |   39 ++++++++++++++++++++++-----------------
>  1 files changed, 22 insertions(+), 17 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index bb4d8b8..053609e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2451,13 +2445,24 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
>  static void age_active_anon(struct zone *zone, struct scan_control *sc,
>  			    int priority)
>  {
> -	struct mem_cgroup_zone mz = {
> -		.mem_cgroup = NULL,
> -		.zone = zone,
> -	};
> +	struct mem_cgroup *mem;
> +
> +	if (!total_swap_pages)
> +		return;
> +
> +	mem = mem_cgroup_iter(NULL, NULL, NULL);

Wouldn't be for_each_mem_cgroup more appropriate? Macro is not exported
but probably worth exporting? The same applies for
scan_zone_unevictable_pages from the previous patch.

> +	do {
> +		struct mem_cgroup_zone mz = {
> +			.mem_cgroup = mem,
> +			.zone = zone,
> +		};
>  
> -	if (inactive_anon_is_low(&mz))
> -		shrink_active_list(SWAP_CLUSTER_MAX, &mz, sc, priority, 0);
> +		if (inactive_anon_is_low(&mz))
> +			shrink_active_list(SWAP_CLUSTER_MAX, &mz,
> +					   sc, priority, 0);
> +
> +		mem = mem_cgroup_iter(NULL, mem, NULL);
> +	} while (mem);
>  }

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 09/11] mm: collect LRU list heads into struct lruvec
  2011-09-12 10:57   ` Johannes Weiner
@ 2011-09-21 13:43     ` Michal Hocko
  -1 siblings, 0 replies; 130+ messages in thread
From: Michal Hocko @ 2011-09-21 13:43 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Mon 12-09-11 12:57:26, Johannes Weiner wrote:
> Having a unified structure with a LRU list set for both global zones
> and per-memcg zones allows to keep that code simple which deals with
> LRU lists and does not care about the container itself.
> 
> Once the per-memcg LRU lists directly link struct pages, the isolation
> function and all other list manipulations are shared between the memcg
> case and the global LRU case.
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>

Thanks for splitting this off the other patch. Much easier to review now.

Reviewed-by: Michal Hocko <mhocko@suse.cz>

Small nit bellow
> ---
>  include/linux/mm_inline.h |    2 +-
>  include/linux/mmzone.h    |   10 ++++++----
>  mm/memcontrol.c           |   19 ++++++++-----------
>  mm/page_alloc.c           |    2 +-
>  mm/swap.c                 |   11 +++++------
>  mm/vmscan.c               |   12 ++++++------
>  6 files changed, 27 insertions(+), 29 deletions(-)
> 
[...]
> diff --git a/mm/swap.c b/mm/swap.c
> index 3a442f1..66e8292 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
[...]
> @@ -639,7 +639,6 @@ void lru_add_page_tail(struct zone* zone,
>  	int active;
>  	enum lru_list lru;
>  	const int file = 0;
> -	struct list_head *head;
>  
>  	VM_BUG_ON(!PageHead(page));
>  	VM_BUG_ON(PageCompound(page_tail));
> @@ -659,10 +658,10 @@ void lru_add_page_tail(struct zone* zone,
>  		}
>  		update_page_reclaim_stat(zone, page_tail, file, active);
>  		if (likely(PageLRU(page)))
> -			head = page->lru.prev;
> +			__add_page_to_lru_list(zone, page_tail, lru,
> +					       page->lru.prev);

{ } around multiline __add_page_to_lru_list?

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 09/11] mm: collect LRU list heads into struct lruvec
@ 2011-09-21 13:43     ` Michal Hocko
  0 siblings, 0 replies; 130+ messages in thread
From: Michal Hocko @ 2011-09-21 13:43 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Mon 12-09-11 12:57:26, Johannes Weiner wrote:
> Having a unified structure with a LRU list set for both global zones
> and per-memcg zones allows to keep that code simple which deals with
> LRU lists and does not care about the container itself.
> 
> Once the per-memcg LRU lists directly link struct pages, the isolation
> function and all other list manipulations are shared between the memcg
> case and the global LRU case.
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>

Thanks for splitting this off the other patch. Much easier to review now.

Reviewed-by: Michal Hocko <mhocko@suse.cz>

Small nit bellow
> ---
>  include/linux/mm_inline.h |    2 +-
>  include/linux/mmzone.h    |   10 ++++++----
>  mm/memcontrol.c           |   19 ++++++++-----------
>  mm/page_alloc.c           |    2 +-
>  mm/swap.c                 |   11 +++++------
>  mm/vmscan.c               |   12 ++++++------
>  6 files changed, 27 insertions(+), 29 deletions(-)
> 
[...]
> diff --git a/mm/swap.c b/mm/swap.c
> index 3a442f1..66e8292 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
[...]
> @@ -639,7 +639,6 @@ void lru_add_page_tail(struct zone* zone,
>  	int active;
>  	enum lru_list lru;
>  	const int file = 0;
> -	struct list_head *head;
>  
>  	VM_BUG_ON(!PageHead(page));
>  	VM_BUG_ON(PageCompound(page_tail));
> @@ -659,10 +658,10 @@ void lru_add_page_tail(struct zone* zone,
>  		}
>  		update_page_reclaim_stat(zone, page_tail, file, active);
>  		if (likely(PageLRU(page)))
> -			head = page->lru.prev;
> +			__add_page_to_lru_list(zone, page_tail, lru,
> +					       page->lru.prev);

{ } around multiline __add_page_to_lru_list?

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 07/11] mm: vmscan: convert unevictable page rescue scanner to per-memcg LRU lists
  2011-09-21 12:33     ` Michal Hocko
@ 2011-09-21 13:47       ` Johannes Weiner
  -1 siblings, 0 replies; 130+ messages in thread
From: Johannes Weiner @ 2011-09-21 13:47 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Wed, Sep 21, 2011 at 02:33:56PM +0200, Michal Hocko wrote:
> On Mon 12-09-11 12:57:24, Johannes Weiner wrote:
> > The global per-zone LRU lists are about to go away on memcg-enabled
> > kernels, the unevictable page rescue scanner must be able to find its
> > pages on the per-memcg LRU lists.
> > 
> > Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> 
> The patch is correct but I guess the original implementation of
> scan_zone_unevictable_pages is buggy (see bellow). This should be
> addressed separatelly, though.
> 
> Reviewed-by: Michal Hocko <mhocko@suse.cz>

Thanks for your effort, Michal, I really appreciate it.

> > @@ -3490,32 +3501,40 @@ void scan_mapping_unevictable_pages(struct address_space *mapping)
> >  #define SCAN_UNEVICTABLE_BATCH_SIZE 16UL /* arbitrary lock hold batch size */
> >  static void scan_zone_unevictable_pages(struct zone *zone)
> >  {
> > -	struct list_head *l_unevictable = &zone->lru[LRU_UNEVICTABLE].list;
> > -	unsigned long scan;
> > -	unsigned long nr_to_scan = zone_page_state(zone, NR_UNEVICTABLE);
> > -
> > -	while (nr_to_scan > 0) {
> > -		unsigned long batch_size = min(nr_to_scan,
> > -						SCAN_UNEVICTABLE_BATCH_SIZE);
> > -
> > -		spin_lock_irq(&zone->lru_lock);
> > -		for (scan = 0;  scan < batch_size; scan++) {
> > -			struct page *page = lru_to_page(l_unevictable);
> > +	struct mem_cgroup *mem;
> >  
> > -			if (!trylock_page(page))
> > -				continue;
> > +	mem = mem_cgroup_iter(NULL, NULL, NULL);
> > +	do {
> > +		struct mem_cgroup_zone mz = {
> > +			.mem_cgroup = mem,
> > +			.zone = zone,
> > +		};
> > +		unsigned long nr_to_scan;
> >  
> > -			prefetchw_prev_lru_page(page, l_unevictable, flags);
> > +		nr_to_scan = zone_nr_lru_pages(&mz, LRU_UNEVICTABLE);
> > +		while (nr_to_scan > 0) {
> > +			unsigned long batch_size;
> > +			unsigned long scan;
> >  
> > -			if (likely(PageLRU(page) && PageUnevictable(page)))
> > -				check_move_unevictable_page(page, zone);
> > +			batch_size = min(nr_to_scan,
> > +					 SCAN_UNEVICTABLE_BATCH_SIZE);
> > +			spin_lock_irq(&zone->lru_lock);
> > +			for (scan = 0; scan < batch_size; scan++) {
> > +				struct page *page;
> >  
> > -			unlock_page(page);
> > +				page = lru_tailpage(&mz, LRU_UNEVICTABLE);
> > +				if (!trylock_page(page))
> > +					continue;
> 
> We are not moving to the next page so we will try it again in the next
> round while we already increased the scan count. In the end we will
> missed some pages.

I guess this is about latency.  This code is only executed when the
user requests so by writing to a proc-file, check the comment above
scan_all_zones_unevictable_pages.  I think at one point Lee wanted to
move anon pages to the unevictable LRU when no swap is configured, but
we have separate anon LRUs now that are not scanned without swap, and
I think except for bugs there is no actual need to move these pages by
hand, let alone reliably every single page.

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 07/11] mm: vmscan: convert unevictable page rescue scanner to per-memcg LRU lists
@ 2011-09-21 13:47       ` Johannes Weiner
  0 siblings, 0 replies; 130+ messages in thread
From: Johannes Weiner @ 2011-09-21 13:47 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Wed, Sep 21, 2011 at 02:33:56PM +0200, Michal Hocko wrote:
> On Mon 12-09-11 12:57:24, Johannes Weiner wrote:
> > The global per-zone LRU lists are about to go away on memcg-enabled
> > kernels, the unevictable page rescue scanner must be able to find its
> > pages on the per-memcg LRU lists.
> > 
> > Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> 
> The patch is correct but I guess the original implementation of
> scan_zone_unevictable_pages is buggy (see bellow). This should be
> addressed separatelly, though.
> 
> Reviewed-by: Michal Hocko <mhocko@suse.cz>

Thanks for your effort, Michal, I really appreciate it.

> > @@ -3490,32 +3501,40 @@ void scan_mapping_unevictable_pages(struct address_space *mapping)
> >  #define SCAN_UNEVICTABLE_BATCH_SIZE 16UL /* arbitrary lock hold batch size */
> >  static void scan_zone_unevictable_pages(struct zone *zone)
> >  {
> > -	struct list_head *l_unevictable = &zone->lru[LRU_UNEVICTABLE].list;
> > -	unsigned long scan;
> > -	unsigned long nr_to_scan = zone_page_state(zone, NR_UNEVICTABLE);
> > -
> > -	while (nr_to_scan > 0) {
> > -		unsigned long batch_size = min(nr_to_scan,
> > -						SCAN_UNEVICTABLE_BATCH_SIZE);
> > -
> > -		spin_lock_irq(&zone->lru_lock);
> > -		for (scan = 0;  scan < batch_size; scan++) {
> > -			struct page *page = lru_to_page(l_unevictable);
> > +	struct mem_cgroup *mem;
> >  
> > -			if (!trylock_page(page))
> > -				continue;
> > +	mem = mem_cgroup_iter(NULL, NULL, NULL);
> > +	do {
> > +		struct mem_cgroup_zone mz = {
> > +			.mem_cgroup = mem,
> > +			.zone = zone,
> > +		};
> > +		unsigned long nr_to_scan;
> >  
> > -			prefetchw_prev_lru_page(page, l_unevictable, flags);
> > +		nr_to_scan = zone_nr_lru_pages(&mz, LRU_UNEVICTABLE);
> > +		while (nr_to_scan > 0) {
> > +			unsigned long batch_size;
> > +			unsigned long scan;
> >  
> > -			if (likely(PageLRU(page) && PageUnevictable(page)))
> > -				check_move_unevictable_page(page, zone);
> > +			batch_size = min(nr_to_scan,
> > +					 SCAN_UNEVICTABLE_BATCH_SIZE);
> > +			spin_lock_irq(&zone->lru_lock);
> > +			for (scan = 0; scan < batch_size; scan++) {
> > +				struct page *page;
> >  
> > -			unlock_page(page);
> > +				page = lru_tailpage(&mz, LRU_UNEVICTABLE);
> > +				if (!trylock_page(page))
> > +					continue;
> 
> We are not moving to the next page so we will try it again in the next
> round while we already increased the scan count. In the end we will
> missed some pages.

I guess this is about latency.  This code is only executed when the
user requests so by writing to a proc-file, check the comment above
scan_all_zones_unevictable_pages.  I think at one point Lee wanted to
move anon pages to the unevictable LRU when no swap is configured, but
we have separate anon LRUs now that are not scanned without swap, and
I think except for bugs there is no actual need to move these pages by
hand, let alone reliably every single page.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 08/11] mm: vmscan: convert global reclaim to per-memcg LRU lists
  2011-09-21 13:10     ` Michal Hocko
@ 2011-09-21 13:51       ` Johannes Weiner
  -1 siblings, 0 replies; 130+ messages in thread
From: Johannes Weiner @ 2011-09-21 13:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Wed, Sep 21, 2011 at 03:10:45PM +0200, Michal Hocko wrote:
> On Mon 12-09-11 12:57:25, Johannes Weiner wrote:
> > The global per-zone LRU lists are about to go away on memcg-enabled
> > kernels, global reclaim must be able to find its pages on the
> > per-memcg LRU lists.
> > 
> > Since the LRU pages of a zone are distributed over all existing memory
> > cgroups, a scan target for a zone is complete when all memory cgroups
> > are scanned for their proportional share of a zone's memory.
> > 
> > The forced scanning of small scan targets from kswapd is limited to
> > zones marked unreclaimable, otherwise kswapd can quickly overreclaim
> > by force-scanning the LRU lists of multiple memory cgroups.
> > 
> > Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> 
> Reviewed-by: Michal Hocko <mhocko@suse.cz>

Thanks

> Minor nit bellow

> > @@ -2451,13 +2445,24 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
> >  static void age_active_anon(struct zone *zone, struct scan_control *sc,
> >  			    int priority)
> >  {
> > -	struct mem_cgroup_zone mz = {
> > -		.mem_cgroup = NULL,
> > -		.zone = zone,
> > -	};
> > +	struct mem_cgroup *mem;
> > +
> > +	if (!total_swap_pages)
> > +		return;
> > +
> > +	mem = mem_cgroup_iter(NULL, NULL, NULL);
> 
> Wouldn't be for_each_mem_cgroup more appropriate? Macro is not exported
> but probably worth exporting? The same applies for
> scan_zone_unevictable_pages from the previous patch.

Unfortunately, in generic code, these loops need to be layed out like
this for !CONFIG_MEMCG to do the right thing.  mem_cgroup_iter() will
return NULL and the loop has to execute exactly once.

This is something that will go away once we implement Christoph's
suggestion of always having a (skeleton) root_mem_cgroup around, even
for !CONFIG_MEMCG.

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 08/11] mm: vmscan: convert global reclaim to per-memcg LRU lists
@ 2011-09-21 13:51       ` Johannes Weiner
  0 siblings, 0 replies; 130+ messages in thread
From: Johannes Weiner @ 2011-09-21 13:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Wed, Sep 21, 2011 at 03:10:45PM +0200, Michal Hocko wrote:
> On Mon 12-09-11 12:57:25, Johannes Weiner wrote:
> > The global per-zone LRU lists are about to go away on memcg-enabled
> > kernels, global reclaim must be able to find its pages on the
> > per-memcg LRU lists.
> > 
> > Since the LRU pages of a zone are distributed over all existing memory
> > cgroups, a scan target for a zone is complete when all memory cgroups
> > are scanned for their proportional share of a zone's memory.
> > 
> > The forced scanning of small scan targets from kswapd is limited to
> > zones marked unreclaimable, otherwise kswapd can quickly overreclaim
> > by force-scanning the LRU lists of multiple memory cgroups.
> > 
> > Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> 
> Reviewed-by: Michal Hocko <mhocko@suse.cz>

Thanks

> Minor nit bellow

> > @@ -2451,13 +2445,24 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
> >  static void age_active_anon(struct zone *zone, struct scan_control *sc,
> >  			    int priority)
> >  {
> > -	struct mem_cgroup_zone mz = {
> > -		.mem_cgroup = NULL,
> > -		.zone = zone,
> > -	};
> > +	struct mem_cgroup *mem;
> > +
> > +	if (!total_swap_pages)
> > +		return;
> > +
> > +	mem = mem_cgroup_iter(NULL, NULL, NULL);
> 
> Wouldn't be for_each_mem_cgroup more appropriate? Macro is not exported
> but probably worth exporting? The same applies for
> scan_zone_unevictable_pages from the previous patch.

Unfortunately, in generic code, these loops need to be layed out like
this for !CONFIG_MEMCG to do the right thing.  mem_cgroup_iter() will
return NULL and the loop has to execute exactly once.

This is something that will go away once we implement Christoph's
suggestion of always having a (skeleton) root_mem_cgroup around, even
for !CONFIG_MEMCG.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 08/11] mm: vmscan: convert global reclaim to per-memcg LRU lists
  2011-09-21 13:51       ` Johannes Weiner
@ 2011-09-21 13:57         ` Michal Hocko
  -1 siblings, 0 replies; 130+ messages in thread
From: Michal Hocko @ 2011-09-21 13:57 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Wed 21-09-11 15:51:43, Johannes Weiner wrote:
> On Wed, Sep 21, 2011 at 03:10:45PM +0200, Michal Hocko wrote:
> > On Mon 12-09-11 12:57:25, Johannes Weiner wrote:
[...]
> > > @@ -2451,13 +2445,24 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
> > >  static void age_active_anon(struct zone *zone, struct scan_control *sc,
> > >  			    int priority)
> > >  {
> > > -	struct mem_cgroup_zone mz = {
> > > -		.mem_cgroup = NULL,
> > > -		.zone = zone,
> > > -	};
> > > +	struct mem_cgroup *mem;
> > > +
> > > +	if (!total_swap_pages)
> > > +		return;
> > > +
> > > +	mem = mem_cgroup_iter(NULL, NULL, NULL);
> > 
> > Wouldn't be for_each_mem_cgroup more appropriate? Macro is not exported
> > but probably worth exporting? The same applies for
> > scan_zone_unevictable_pages from the previous patch.
> 
> Unfortunately, in generic code, these loops need to be layed out like
> this for !CONFIG_MEMCG to do the right thing.  mem_cgroup_iter() will
> return NULL and the loop has to execute exactly once.

Ahh, right you are. I have missed that.

> 
> This is something that will go away once we implement Christoph's
> suggestion of always having a (skeleton) root_mem_cgroup around, even
> for !CONFIG_MEMCG.

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 08/11] mm: vmscan: convert global reclaim to per-memcg LRU lists
@ 2011-09-21 13:57         ` Michal Hocko
  0 siblings, 0 replies; 130+ messages in thread
From: Michal Hocko @ 2011-09-21 13:57 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Wed 21-09-11 15:51:43, Johannes Weiner wrote:
> On Wed, Sep 21, 2011 at 03:10:45PM +0200, Michal Hocko wrote:
> > On Mon 12-09-11 12:57:25, Johannes Weiner wrote:
[...]
> > > @@ -2451,13 +2445,24 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
> > >  static void age_active_anon(struct zone *zone, struct scan_control *sc,
> > >  			    int priority)
> > >  {
> > > -	struct mem_cgroup_zone mz = {
> > > -		.mem_cgroup = NULL,
> > > -		.zone = zone,
> > > -	};
> > > +	struct mem_cgroup *mem;
> > > +
> > > +	if (!total_swap_pages)
> > > +		return;
> > > +
> > > +	mem = mem_cgroup_iter(NULL, NULL, NULL);
> > 
> > Wouldn't be for_each_mem_cgroup more appropriate? Macro is not exported
> > but probably worth exporting? The same applies for
> > scan_zone_unevictable_pages from the previous patch.
> 
> Unfortunately, in generic code, these loops need to be layed out like
> this for !CONFIG_MEMCG to do the right thing.  mem_cgroup_iter() will
> return NULL and the loop has to execute exactly once.

Ahh, right you are. I have missed that.

> 
> This is something that will go away once we implement Christoph's
> suggestion of always having a (skeleton) root_mem_cgroup around, even
> for !CONFIG_MEMCG.

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 07/11] mm: vmscan: convert unevictable page rescue scanner to per-memcg LRU lists
  2011-09-21 13:47       ` Johannes Weiner
@ 2011-09-21 14:08         ` Michal Hocko
  -1 siblings, 0 replies; 130+ messages in thread
From: Michal Hocko @ 2011-09-21 14:08 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Wed 21-09-11 15:47:51, Johannes Weiner wrote:
> On Wed, Sep 21, 2011 at 02:33:56PM +0200, Michal Hocko wrote:
> > On Mon 12-09-11 12:57:24, Johannes Weiner wrote:
> > > The global per-zone LRU lists are about to go away on memcg-enabled
> > > kernels, the unevictable page rescue scanner must be able to find its
> > > pages on the per-memcg LRU lists.
> > > 
> > > Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> > 
> > The patch is correct but I guess the original implementation of
> > scan_zone_unevictable_pages is buggy (see bellow). This should be
> > addressed separatelly, though.
> > 
> > Reviewed-by: Michal Hocko <mhocko@suse.cz>
> 
> Thanks for your effort, Michal, I really appreciate it.

you're welcome. You've made really a good job so it is not that hard to
review.

> 
> > > @@ -3490,32 +3501,40 @@ void scan_mapping_unevictable_pages(struct address_space *mapping)
> > >  #define SCAN_UNEVICTABLE_BATCH_SIZE 16UL /* arbitrary lock hold batch size */
> > >  static void scan_zone_unevictable_pages(struct zone *zone)
> > >  {
> > > -	struct list_head *l_unevictable = &zone->lru[LRU_UNEVICTABLE].list;
> > > -	unsigned long scan;
> > > -	unsigned long nr_to_scan = zone_page_state(zone, NR_UNEVICTABLE);
> > > -
> > > -	while (nr_to_scan > 0) {
> > > -		unsigned long batch_size = min(nr_to_scan,
> > > -						SCAN_UNEVICTABLE_BATCH_SIZE);
> > > -
> > > -		spin_lock_irq(&zone->lru_lock);
> > > -		for (scan = 0;  scan < batch_size; scan++) {
> > > -			struct page *page = lru_to_page(l_unevictable);
> > > +	struct mem_cgroup *mem;
> > >  
> > > -			if (!trylock_page(page))
> > > -				continue;
> > > +	mem = mem_cgroup_iter(NULL, NULL, NULL);
> > > +	do {
> > > +		struct mem_cgroup_zone mz = {
> > > +			.mem_cgroup = mem,
> > > +			.zone = zone,
> > > +		};
> > > +		unsigned long nr_to_scan;
> > >  
> > > -			prefetchw_prev_lru_page(page, l_unevictable, flags);
> > > +		nr_to_scan = zone_nr_lru_pages(&mz, LRU_UNEVICTABLE);
> > > +		while (nr_to_scan > 0) {
> > > +			unsigned long batch_size;
> > > +			unsigned long scan;
> > >  
> > > -			if (likely(PageLRU(page) && PageUnevictable(page)))
> > > -				check_move_unevictable_page(page, zone);
> > > +			batch_size = min(nr_to_scan,
> > > +					 SCAN_UNEVICTABLE_BATCH_SIZE);
> > > +			spin_lock_irq(&zone->lru_lock);
> > > +			for (scan = 0; scan < batch_size; scan++) {
> > > +				struct page *page;
> > >  
> > > -			unlock_page(page);
> > > +				page = lru_tailpage(&mz, LRU_UNEVICTABLE);
> > > +				if (!trylock_page(page))
> > > +					continue;
> > 
> > We are not moving to the next page so we will try it again in the next
> > round while we already increased the scan count. In the end we will
> > missed some pages.
> 
> I guess this is about latency.  This code is only executed when the
> user requests so by writing to a proc-file, check the comment above
> scan_all_zones_unevictable_pages. I think at one point Lee wanted to
> move anon pages to the unevictable LRU when no swap is configured, but
> we have separate anon LRUs now that are not scanned without swap, and
> I think except for bugs there is no actual need to move these pages by
> hand, let alone reliably every single page.

OK, fair point. Probably not worth fixing (I will put it on my TODO list
with a low priority).
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 07/11] mm: vmscan: convert unevictable page rescue scanner to per-memcg LRU lists
@ 2011-09-21 14:08         ` Michal Hocko
  0 siblings, 0 replies; 130+ messages in thread
From: Michal Hocko @ 2011-09-21 14:08 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Wed 21-09-11 15:47:51, Johannes Weiner wrote:
> On Wed, Sep 21, 2011 at 02:33:56PM +0200, Michal Hocko wrote:
> > On Mon 12-09-11 12:57:24, Johannes Weiner wrote:
> > > The global per-zone LRU lists are about to go away on memcg-enabled
> > > kernels, the unevictable page rescue scanner must be able to find its
> > > pages on the per-memcg LRU lists.
> > > 
> > > Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> > 
> > The patch is correct but I guess the original implementation of
> > scan_zone_unevictable_pages is buggy (see bellow). This should be
> > addressed separatelly, though.
> > 
> > Reviewed-by: Michal Hocko <mhocko@suse.cz>
> 
> Thanks for your effort, Michal, I really appreciate it.

you're welcome. You've made really a good job so it is not that hard to
review.

> 
> > > @@ -3490,32 +3501,40 @@ void scan_mapping_unevictable_pages(struct address_space *mapping)
> > >  #define SCAN_UNEVICTABLE_BATCH_SIZE 16UL /* arbitrary lock hold batch size */
> > >  static void scan_zone_unevictable_pages(struct zone *zone)
> > >  {
> > > -	struct list_head *l_unevictable = &zone->lru[LRU_UNEVICTABLE].list;
> > > -	unsigned long scan;
> > > -	unsigned long nr_to_scan = zone_page_state(zone, NR_UNEVICTABLE);
> > > -
> > > -	while (nr_to_scan > 0) {
> > > -		unsigned long batch_size = min(nr_to_scan,
> > > -						SCAN_UNEVICTABLE_BATCH_SIZE);
> > > -
> > > -		spin_lock_irq(&zone->lru_lock);
> > > -		for (scan = 0;  scan < batch_size; scan++) {
> > > -			struct page *page = lru_to_page(l_unevictable);
> > > +	struct mem_cgroup *mem;
> > >  
> > > -			if (!trylock_page(page))
> > > -				continue;
> > > +	mem = mem_cgroup_iter(NULL, NULL, NULL);
> > > +	do {
> > > +		struct mem_cgroup_zone mz = {
> > > +			.mem_cgroup = mem,
> > > +			.zone = zone,
> > > +		};
> > > +		unsigned long nr_to_scan;
> > >  
> > > -			prefetchw_prev_lru_page(page, l_unevictable, flags);
> > > +		nr_to_scan = zone_nr_lru_pages(&mz, LRU_UNEVICTABLE);
> > > +		while (nr_to_scan > 0) {
> > > +			unsigned long batch_size;
> > > +			unsigned long scan;
> > >  
> > > -			if (likely(PageLRU(page) && PageUnevictable(page)))
> > > -				check_move_unevictable_page(page, zone);
> > > +			batch_size = min(nr_to_scan,
> > > +					 SCAN_UNEVICTABLE_BATCH_SIZE);
> > > +			spin_lock_irq(&zone->lru_lock);
> > > +			for (scan = 0; scan < batch_size; scan++) {
> > > +				struct page *page;
> > >  
> > > -			unlock_page(page);
> > > +				page = lru_tailpage(&mz, LRU_UNEVICTABLE);
> > > +				if (!trylock_page(page))
> > > +					continue;
> > 
> > We are not moving to the next page so we will try it again in the next
> > round while we already increased the scan count. In the end we will
> > missed some pages.
> 
> I guess this is about latency.  This code is only executed when the
> user requests so by writing to a proc-file, check the comment above
> scan_all_zones_unevictable_pages. I think at one point Lee wanted to
> move anon pages to the unevictable LRU when no swap is configured, but
> we have separate anon LRUs now that are not scanned without swap, and
> I think except for bugs there is no actual need to move these pages by
> hand, let alone reliably every single page.

OK, fair point. Probably not worth fixing (I will put it on my TODO list
with a low priority).
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 09/11] mm: collect LRU list heads into struct lruvec
  2011-09-21 13:43     ` Michal Hocko
@ 2011-09-21 15:15       ` Michal Hocko
  -1 siblings, 0 replies; 130+ messages in thread
From: Michal Hocko @ 2011-09-21 15:15 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Wed 21-09-11 15:43:23, Michal Hocko wrote:
> On Mon 12-09-11 12:57:26, Johannes Weiner wrote:
[...]
> > @@ -659,10 +658,10 @@ void lru_add_page_tail(struct zone* zone,
> >  		}
> >  		update_page_reclaim_stat(zone, page_tail, file, active);
> >  		if (likely(PageLRU(page)))
> > -			head = page->lru.prev;
> > +			__add_page_to_lru_list(zone, page_tail, lru,
> > +					       page->lru.prev);
> 
> { } around multiline __add_page_to_lru_list?

Ahh, code removed in the next patch. Sorry for noise.

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 09/11] mm: collect LRU list heads into struct lruvec
@ 2011-09-21 15:15       ` Michal Hocko
  0 siblings, 0 replies; 130+ messages in thread
From: Michal Hocko @ 2011-09-21 15:15 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Wed 21-09-11 15:43:23, Michal Hocko wrote:
> On Mon 12-09-11 12:57:26, Johannes Weiner wrote:
[...]
> > @@ -659,10 +658,10 @@ void lru_add_page_tail(struct zone* zone,
> >  		}
> >  		update_page_reclaim_stat(zone, page_tail, file, active);
> >  		if (likely(PageLRU(page)))
> > -			head = page->lru.prev;
> > +			__add_page_to_lru_list(zone, page_tail, lru,
> > +					       page->lru.prev);
> 
> { } around multiline __add_page_to_lru_list?

Ahh, code removed in the next patch. Sorry for noise.

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 10/11] mm: make per-memcg LRU lists exclusive
  2011-09-12 10:57   ` Johannes Weiner
@ 2011-09-21 15:24     ` Michal Hocko
  -1 siblings, 0 replies; 130+ messages in thread
From: Michal Hocko @ 2011-09-21 15:24 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Mon 12-09-11 12:57:27, Johannes Weiner wrote:
> Now that all code that operated on global per-zone LRU lists is
> converted to operate on per-memory cgroup LRU lists instead, there is
> no reason to keep the double-LRU scheme around any longer.
> 
> The pc->lru member is removed and page->lru is linked directly to the
> per-memory cgroup LRU lists, which removes two pointers from a
> descriptor that exists for every page frame in the system.
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Signed-off-by: Ying Han <yinghan@google.com>

Reviewed-by: Michal Hocko <mhocko@suse.cz>

Minor comments/questions bellow.
> ---
>  include/linux/memcontrol.h  |   54 +++-----
>  include/linux/mm_inline.h   |   21 +--
>  include/linux/page_cgroup.h |    1 -
>  mm/memcontrol.c             |  319 ++++++++++++++++++++-----------------------
>  mm/page_cgroup.c            |    1 -
>  mm/swap.c                   |   23 ++-
>  mm/vmscan.c                 |   81 +++++-------
>  7 files changed, 228 insertions(+), 272 deletions(-)
> 
[...]
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 465001c..a7d14a5 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
[...]
> @@ -934,115 +954,123 @@ EXPORT_SYMBOL(mem_cgroup_count_vm_event);
>   * When moving account, the page is not on LRU. It's isolated.
>   */
>  
> -struct page *mem_cgroup_lru_to_page(struct zone *zone, struct mem_cgroup *mem,
> -				    enum lru_list lru)
> +/**
> + * mem_cgroup_lru_add_list - account for adding an lru page and return lruvec
> + * @zone: zone of the page
> + * @page: the page
> + * @lru: current lru
> + *
> + * This function accounts for @page being added to @lru, and returns
> + * the lruvec for the given @zone and the memcg @page is charged to.
> + *
> + * The callsite is then responsible for physically linking the page to
> + * the returned lruvec->lists[@lru].
> + */
> +struct lruvec *mem_cgroup_lru_add_list(struct zone *zone, struct page *page,
> +				       enum lru_list lru)

I know that names are alway tricky but what about mem_cgroup_acct_lru_add?
Analogously for mem_cgroup_lru_del_list, mem_cgroup_lru_del and
mem_cgroup_lru_move_lists.

[...]
> @@ -3615,11 +3593,11 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  static int mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
>  				int node, int zid, enum lru_list lru)
>  {
> -	struct zone *zone;
>  	struct mem_cgroup_per_zone *mz;
> -	struct page_cgroup *pc, *busy;
>  	unsigned long flags, loop;
>  	struct list_head *list;
> +	struct page *busy;
> +	struct zone *zone;

Any specific reason to move zone declaration down here? Not that it
matters much. Just curious.

>  	int ret = 0;
>  
>  	zone = &NODE_DATA(node)->node_zones[zid];
> @@ -3639,16 +3618,16 @@ static int mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
>  			spin_unlock_irqrestore(&zone->lru_lock, flags);
>  			break;
>  		}
> -		pc = list_entry(list->prev, struct page_cgroup, lru);
> -		if (busy == pc) {
> -			list_move(&pc->lru, list);
> +		page = list_entry(list->prev, struct page, lru);
> +		if (busy == page) {
> +			list_move(&page->lru, list);
>  			busy = NULL;
>  			spin_unlock_irqrestore(&zone->lru_lock, flags);
>  			continue;
>  		}
>  		spin_unlock_irqrestore(&zone->lru_lock, flags);
>  
> -		page = lookup_cgroup_page(pc);
> +		pc = lookup_page_cgroup(page);

lookup_page_cgroup might return NULL so we probably want BUG_ON(!pc)
here. We are not very consistent about checking the return value,
though.

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 10/11] mm: make per-memcg LRU lists exclusive
@ 2011-09-21 15:24     ` Michal Hocko
  0 siblings, 0 replies; 130+ messages in thread
From: Michal Hocko @ 2011-09-21 15:24 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Mon 12-09-11 12:57:27, Johannes Weiner wrote:
> Now that all code that operated on global per-zone LRU lists is
> converted to operate on per-memory cgroup LRU lists instead, there is
> no reason to keep the double-LRU scheme around any longer.
> 
> The pc->lru member is removed and page->lru is linked directly to the
> per-memory cgroup LRU lists, which removes two pointers from a
> descriptor that exists for every page frame in the system.
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Signed-off-by: Ying Han <yinghan@google.com>

Reviewed-by: Michal Hocko <mhocko@suse.cz>

Minor comments/questions bellow.
> ---
>  include/linux/memcontrol.h  |   54 +++-----
>  include/linux/mm_inline.h   |   21 +--
>  include/linux/page_cgroup.h |    1 -
>  mm/memcontrol.c             |  319 ++++++++++++++++++++-----------------------
>  mm/page_cgroup.c            |    1 -
>  mm/swap.c                   |   23 ++-
>  mm/vmscan.c                 |   81 +++++-------
>  7 files changed, 228 insertions(+), 272 deletions(-)
> 
[...]
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 465001c..a7d14a5 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
[...]
> @@ -934,115 +954,123 @@ EXPORT_SYMBOL(mem_cgroup_count_vm_event);
>   * When moving account, the page is not on LRU. It's isolated.
>   */
>  
> -struct page *mem_cgroup_lru_to_page(struct zone *zone, struct mem_cgroup *mem,
> -				    enum lru_list lru)
> +/**
> + * mem_cgroup_lru_add_list - account for adding an lru page and return lruvec
> + * @zone: zone of the page
> + * @page: the page
> + * @lru: current lru
> + *
> + * This function accounts for @page being added to @lru, and returns
> + * the lruvec for the given @zone and the memcg @page is charged to.
> + *
> + * The callsite is then responsible for physically linking the page to
> + * the returned lruvec->lists[@lru].
> + */
> +struct lruvec *mem_cgroup_lru_add_list(struct zone *zone, struct page *page,
> +				       enum lru_list lru)

I know that names are alway tricky but what about mem_cgroup_acct_lru_add?
Analogously for mem_cgroup_lru_del_list, mem_cgroup_lru_del and
mem_cgroup_lru_move_lists.

[...]
> @@ -3615,11 +3593,11 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  static int mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
>  				int node, int zid, enum lru_list lru)
>  {
> -	struct zone *zone;
>  	struct mem_cgroup_per_zone *mz;
> -	struct page_cgroup *pc, *busy;
>  	unsigned long flags, loop;
>  	struct list_head *list;
> +	struct page *busy;
> +	struct zone *zone;

Any specific reason to move zone declaration down here? Not that it
matters much. Just curious.

>  	int ret = 0;
>  
>  	zone = &NODE_DATA(node)->node_zones[zid];
> @@ -3639,16 +3618,16 @@ static int mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
>  			spin_unlock_irqrestore(&zone->lru_lock, flags);
>  			break;
>  		}
> -		pc = list_entry(list->prev, struct page_cgroup, lru);
> -		if (busy == pc) {
> -			list_move(&pc->lru, list);
> +		page = list_entry(list->prev, struct page, lru);
> +		if (busy == page) {
> +			list_move(&page->lru, list);
>  			busy = NULL;
>  			spin_unlock_irqrestore(&zone->lru_lock, flags);
>  			continue;
>  		}
>  		spin_unlock_irqrestore(&zone->lru_lock, flags);
>  
> -		page = lookup_cgroup_page(pc);
> +		pc = lookup_page_cgroup(page);

lookup_page_cgroup might return NULL so we probably want BUG_ON(!pc)
here. We are not very consistent about checking the return value,
though.

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 11/11] mm: memcg: remove unused node/section info from pc->flags
  2011-09-12 10:57   ` Johannes Weiner
@ 2011-09-21 15:32     ` Michal Hocko
  -1 siblings, 0 replies; 130+ messages in thread
From: Michal Hocko @ 2011-09-21 15:32 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Mon 12-09-11 12:57:28, Johannes Weiner wrote:
> To find the page corresponding to a certain page_cgroup, the pc->flags
> encoded the node or section ID with the base array to compare the pc
> pointer to.
> 
> Now that the per-memory cgroup LRU lists link page descriptors
> directly, there is no longer any code that knows the page_cgroup but
> not the page.
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>

Nice.
Reviewed-by: Michal Hocko <mhocko@suse.cz>

> ---
>  include/linux/page_cgroup.h |   33 ------------------------
>  mm/page_cgroup.c            |   58 ++++++-------------------------------------
>  2 files changed, 8 insertions(+), 83 deletions(-)
> 
> diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
> index 5bae753..aaa60da 100644
> --- a/include/linux/page_cgroup.h
> +++ b/include/linux/page_cgroup.h
> @@ -121,39 +121,6 @@ static inline void move_unlock_page_cgroup(struct page_cgroup *pc,
>  	local_irq_restore(*flags);
>  }
>  
> -#ifdef CONFIG_SPARSEMEM
> -#define PCG_ARRAYID_WIDTH	SECTIONS_SHIFT
> -#else
> -#define PCG_ARRAYID_WIDTH	NODES_SHIFT
> -#endif
> -
> -#if (PCG_ARRAYID_WIDTH > BITS_PER_LONG - NR_PCG_FLAGS)
> -#error Not enough space left in pc->flags to store page_cgroup array IDs
> -#endif
> -
> -/* pc->flags: ARRAY-ID | FLAGS */
> -
> -#define PCG_ARRAYID_MASK	((1UL << PCG_ARRAYID_WIDTH) - 1)
> -
> -#define PCG_ARRAYID_OFFSET	(BITS_PER_LONG - PCG_ARRAYID_WIDTH)
> -/*
> - * Zero the shift count for non-existent fields, to prevent compiler
> - * warnings and ensure references are optimized away.
> - */
> -#define PCG_ARRAYID_SHIFT	(PCG_ARRAYID_OFFSET * (PCG_ARRAYID_WIDTH != 0))
> -
> -static inline void set_page_cgroup_array_id(struct page_cgroup *pc,
> -					    unsigned long id)
> -{
> -	pc->flags &= ~(PCG_ARRAYID_MASK << PCG_ARRAYID_SHIFT);
> -	pc->flags |= (id & PCG_ARRAYID_MASK) << PCG_ARRAYID_SHIFT;
> -}
> -
> -static inline unsigned long page_cgroup_array_id(struct page_cgroup *pc)
> -{
> -	return (pc->flags >> PCG_ARRAYID_SHIFT) & PCG_ARRAYID_MASK;
> -}
> -
>  #else /* CONFIG_CGROUP_MEM_RES_CTLR */
>  struct page_cgroup;
>  
> diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
> index 256dee8..2601a65 100644
> --- a/mm/page_cgroup.c
> +++ b/mm/page_cgroup.c
> @@ -11,12 +11,6 @@
>  #include <linux/swapops.h>
>  #include <linux/kmemleak.h>
>  
> -static void __meminit init_page_cgroup(struct page_cgroup *pc, unsigned long id)
> -{
> -	pc->flags = 0;
> -	set_page_cgroup_array_id(pc, id);
> -	pc->mem_cgroup = NULL;
> -}
>  static unsigned long total_usage;
>  
>  #if !defined(CONFIG_SPARSEMEM)
> @@ -41,24 +35,11 @@ struct page_cgroup *lookup_page_cgroup(struct page *page)
>  	return base + offset;
>  }
>  
> -struct page *lookup_cgroup_page(struct page_cgroup *pc)
> -{
> -	unsigned long pfn;
> -	struct page *page;
> -	pg_data_t *pgdat;
> -
> -	pgdat = NODE_DATA(page_cgroup_array_id(pc));
> -	pfn = pc - pgdat->node_page_cgroup + pgdat->node_start_pfn;
> -	page = pfn_to_page(pfn);
> -	VM_BUG_ON(pc != lookup_page_cgroup(page));
> -	return page;
> -}
> -
>  static int __init alloc_node_page_cgroup(int nid)
>  {
> -	struct page_cgroup *base, *pc;
> +	struct page_cgroup *base;
>  	unsigned long table_size;
> -	unsigned long start_pfn, nr_pages, index;
> +	unsigned long nr_pages;
>  
>  	start_pfn = NODE_DATA(nid)->node_start_pfn;
>  	nr_pages = NODE_DATA(nid)->node_spanned_pages;
> @@ -72,10 +53,6 @@ static int __init alloc_node_page_cgroup(int nid)
>  			table_size, PAGE_SIZE, __pa(MAX_DMA_ADDRESS));
>  	if (!base)
>  		return -ENOMEM;
> -	for (index = 0; index < nr_pages; index++) {
> -		pc = base + index;
> -		init_page_cgroup(pc, nid);
> -	}
>  	NODE_DATA(nid)->node_page_cgroup = base;
>  	total_usage += table_size;
>  	return 0;
> @@ -116,31 +93,19 @@ struct page_cgroup *lookup_page_cgroup(struct page *page)
>  	return section->page_cgroup + pfn;
>  }
>  
> -struct page *lookup_cgroup_page(struct page_cgroup *pc)
> -{
> -	struct mem_section *section;
> -	struct page *page;
> -	unsigned long nr;
> -
> -	nr = page_cgroup_array_id(pc);
> -	section = __nr_to_section(nr);
> -	page = pfn_to_page(pc - section->page_cgroup);
> -	VM_BUG_ON(pc != lookup_page_cgroup(page));
> -	return page;
> -}
> -
>  static void *__meminit alloc_page_cgroup(size_t size, int nid)
>  {
>  	void *addr = NULL;
>  
> -	addr = alloc_pages_exact_nid(nid, size, GFP_KERNEL | __GFP_NOWARN);
> +	addr = alloc_pages_exact_nid(nid, size,
> +				     GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN);
>  	if (addr)
>  		return addr;
>  
>  	if (node_state(nid, N_HIGH_MEMORY))
> -		addr = vmalloc_node(size, nid);
> +		addr = vzalloc_node(size, nid);
>  	else
> -		addr = vmalloc(size);
> +		addr = vzalloc(size);
>  
>  	return addr;
>  }
> @@ -163,14 +128,11 @@ static void free_page_cgroup(void *addr)
>  
>  static int __meminit init_section_page_cgroup(unsigned long pfn, int nid)
>  {
> -	struct page_cgroup *base, *pc;
>  	struct mem_section *section;
> +	struct page_cgroup *base;
>  	unsigned long table_size;
> -	unsigned long nr;
> -	int index;
>  
> -	nr = pfn_to_section_nr(pfn);
> -	section = __nr_to_section(nr);
> +	section = __pfn_to_section(pfn);
>  
>  	if (section->page_cgroup)
>  		return 0;
> @@ -190,10 +152,6 @@ static int __meminit init_section_page_cgroup(unsigned long pfn, int nid)
>  		return -ENOMEM;
>  	}
>  
> -	for (index = 0; index < PAGES_PER_SECTION; index++) {
> -		pc = base + index;
> -		init_page_cgroup(pc, nr);
> -	}
>  	/*
>  	 * The passed "pfn" may not be aligned to SECTION.  For the calculation
>  	 * we need to apply a mask.
> -- 
> 1.7.6
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 11/11] mm: memcg: remove unused node/section info from pc->flags
@ 2011-09-21 15:32     ` Michal Hocko
  0 siblings, 0 replies; 130+ messages in thread
From: Michal Hocko @ 2011-09-21 15:32 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Mon 12-09-11 12:57:28, Johannes Weiner wrote:
> To find the page corresponding to a certain page_cgroup, the pc->flags
> encoded the node or section ID with the base array to compare the pc
> pointer to.
> 
> Now that the per-memory cgroup LRU lists link page descriptors
> directly, there is no longer any code that knows the page_cgroup but
> not the page.
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>

Nice.
Reviewed-by: Michal Hocko <mhocko@suse.cz>

> ---
>  include/linux/page_cgroup.h |   33 ------------------------
>  mm/page_cgroup.c            |   58 ++++++-------------------------------------
>  2 files changed, 8 insertions(+), 83 deletions(-)
> 
> diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
> index 5bae753..aaa60da 100644
> --- a/include/linux/page_cgroup.h
> +++ b/include/linux/page_cgroup.h
> @@ -121,39 +121,6 @@ static inline void move_unlock_page_cgroup(struct page_cgroup *pc,
>  	local_irq_restore(*flags);
>  }
>  
> -#ifdef CONFIG_SPARSEMEM
> -#define PCG_ARRAYID_WIDTH	SECTIONS_SHIFT
> -#else
> -#define PCG_ARRAYID_WIDTH	NODES_SHIFT
> -#endif
> -
> -#if (PCG_ARRAYID_WIDTH > BITS_PER_LONG - NR_PCG_FLAGS)
> -#error Not enough space left in pc->flags to store page_cgroup array IDs
> -#endif
> -
> -/* pc->flags: ARRAY-ID | FLAGS */
> -
> -#define PCG_ARRAYID_MASK	((1UL << PCG_ARRAYID_WIDTH) - 1)
> -
> -#define PCG_ARRAYID_OFFSET	(BITS_PER_LONG - PCG_ARRAYID_WIDTH)
> -/*
> - * Zero the shift count for non-existent fields, to prevent compiler
> - * warnings and ensure references are optimized away.
> - */
> -#define PCG_ARRAYID_SHIFT	(PCG_ARRAYID_OFFSET * (PCG_ARRAYID_WIDTH != 0))
> -
> -static inline void set_page_cgroup_array_id(struct page_cgroup *pc,
> -					    unsigned long id)
> -{
> -	pc->flags &= ~(PCG_ARRAYID_MASK << PCG_ARRAYID_SHIFT);
> -	pc->flags |= (id & PCG_ARRAYID_MASK) << PCG_ARRAYID_SHIFT;
> -}
> -
> -static inline unsigned long page_cgroup_array_id(struct page_cgroup *pc)
> -{
> -	return (pc->flags >> PCG_ARRAYID_SHIFT) & PCG_ARRAYID_MASK;
> -}
> -
>  #else /* CONFIG_CGROUP_MEM_RES_CTLR */
>  struct page_cgroup;
>  
> diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
> index 256dee8..2601a65 100644
> --- a/mm/page_cgroup.c
> +++ b/mm/page_cgroup.c
> @@ -11,12 +11,6 @@
>  #include <linux/swapops.h>
>  #include <linux/kmemleak.h>
>  
> -static void __meminit init_page_cgroup(struct page_cgroup *pc, unsigned long id)
> -{
> -	pc->flags = 0;
> -	set_page_cgroup_array_id(pc, id);
> -	pc->mem_cgroup = NULL;
> -}
>  static unsigned long total_usage;
>  
>  #if !defined(CONFIG_SPARSEMEM)
> @@ -41,24 +35,11 @@ struct page_cgroup *lookup_page_cgroup(struct page *page)
>  	return base + offset;
>  }
>  
> -struct page *lookup_cgroup_page(struct page_cgroup *pc)
> -{
> -	unsigned long pfn;
> -	struct page *page;
> -	pg_data_t *pgdat;
> -
> -	pgdat = NODE_DATA(page_cgroup_array_id(pc));
> -	pfn = pc - pgdat->node_page_cgroup + pgdat->node_start_pfn;
> -	page = pfn_to_page(pfn);
> -	VM_BUG_ON(pc != lookup_page_cgroup(page));
> -	return page;
> -}
> -
>  static int __init alloc_node_page_cgroup(int nid)
>  {
> -	struct page_cgroup *base, *pc;
> +	struct page_cgroup *base;
>  	unsigned long table_size;
> -	unsigned long start_pfn, nr_pages, index;
> +	unsigned long nr_pages;
>  
>  	start_pfn = NODE_DATA(nid)->node_start_pfn;
>  	nr_pages = NODE_DATA(nid)->node_spanned_pages;
> @@ -72,10 +53,6 @@ static int __init alloc_node_page_cgroup(int nid)
>  			table_size, PAGE_SIZE, __pa(MAX_DMA_ADDRESS));
>  	if (!base)
>  		return -ENOMEM;
> -	for (index = 0; index < nr_pages; index++) {
> -		pc = base + index;
> -		init_page_cgroup(pc, nid);
> -	}
>  	NODE_DATA(nid)->node_page_cgroup = base;
>  	total_usage += table_size;
>  	return 0;
> @@ -116,31 +93,19 @@ struct page_cgroup *lookup_page_cgroup(struct page *page)
>  	return section->page_cgroup + pfn;
>  }
>  
> -struct page *lookup_cgroup_page(struct page_cgroup *pc)
> -{
> -	struct mem_section *section;
> -	struct page *page;
> -	unsigned long nr;
> -
> -	nr = page_cgroup_array_id(pc);
> -	section = __nr_to_section(nr);
> -	page = pfn_to_page(pc - section->page_cgroup);
> -	VM_BUG_ON(pc != lookup_page_cgroup(page));
> -	return page;
> -}
> -
>  static void *__meminit alloc_page_cgroup(size_t size, int nid)
>  {
>  	void *addr = NULL;
>  
> -	addr = alloc_pages_exact_nid(nid, size, GFP_KERNEL | __GFP_NOWARN);
> +	addr = alloc_pages_exact_nid(nid, size,
> +				     GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN);
>  	if (addr)
>  		return addr;
>  
>  	if (node_state(nid, N_HIGH_MEMORY))
> -		addr = vmalloc_node(size, nid);
> +		addr = vzalloc_node(size, nid);
>  	else
> -		addr = vmalloc(size);
> +		addr = vzalloc(size);
>  
>  	return addr;
>  }
> @@ -163,14 +128,11 @@ static void free_page_cgroup(void *addr)
>  
>  static int __meminit init_section_page_cgroup(unsigned long pfn, int nid)
>  {
> -	struct page_cgroup *base, *pc;
>  	struct mem_section *section;
> +	struct page_cgroup *base;
>  	unsigned long table_size;
> -	unsigned long nr;
> -	int index;
>  
> -	nr = pfn_to_section_nr(pfn);
> -	section = __nr_to_section(nr);
> +	section = __pfn_to_section(pfn);
>  
>  	if (section->page_cgroup)
>  		return 0;
> @@ -190,10 +152,6 @@ static int __meminit init_section_page_cgroup(unsigned long pfn, int nid)
>  		return -ENOMEM;
>  	}
>  
> -	for (index = 0; index < PAGES_PER_SECTION; index++) {
> -		pc = base + index;
> -		init_page_cgroup(pc, nr);
> -	}
>  	/*
>  	 * The passed "pfn" may not be aligned to SECTION.  For the calculation
>  	 * we need to apply a mask.
> -- 
> 1.7.6
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 10/11] mm: make per-memcg LRU lists exclusive
  2011-09-21 15:24     ` Michal Hocko
@ 2011-09-21 15:47       ` Johannes Weiner
  -1 siblings, 0 replies; 130+ messages in thread
From: Johannes Weiner @ 2011-09-21 15:47 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Wed, Sep 21, 2011 at 05:24:58PM +0200, Michal Hocko wrote:
> On Mon 12-09-11 12:57:27, Johannes Weiner wrote:
> > Now that all code that operated on global per-zone LRU lists is
> > converted to operate on per-memory cgroup LRU lists instead, there is
> > no reason to keep the double-LRU scheme around any longer.
> > 
> > The pc->lru member is removed and page->lru is linked directly to the
> > per-memory cgroup LRU lists, which removes two pointers from a
> > descriptor that exists for every page frame in the system.
> > 
> > Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> > Signed-off-by: Ying Han <yinghan@google.com>
> 
> Reviewed-by: Michal Hocko <mhocko@suse.cz>

Thanks.

> > @@ -934,115 +954,123 @@ EXPORT_SYMBOL(mem_cgroup_count_vm_event);
> >   * When moving account, the page is not on LRU. It's isolated.
> >   */
> >  
> > -struct page *mem_cgroup_lru_to_page(struct zone *zone, struct mem_cgroup *mem,
> > -				    enum lru_list lru)
> > +/**
> > + * mem_cgroup_lru_add_list - account for adding an lru page and return lruvec
> > + * @zone: zone of the page
> > + * @page: the page
> > + * @lru: current lru
> > + *
> > + * This function accounts for @page being added to @lru, and returns
> > + * the lruvec for the given @zone and the memcg @page is charged to.
> > + *
> > + * The callsite is then responsible for physically linking the page to
> > + * the returned lruvec->lists[@lru].
> > + */
> > +struct lruvec *mem_cgroup_lru_add_list(struct zone *zone, struct page *page,
> > +				       enum lru_list lru)
> 
> I know that names are alway tricky but what about mem_cgroup_acct_lru_add?
> Analogously for mem_cgroup_lru_del_list, mem_cgroup_lru_del and
> mem_cgroup_lru_move_lists.

Hmm, but it doesn't just lru-account, it also looks up the right
lruvec for the caller to link the page to, so it's not necessarily an
improvement, although I agree that the name could be better.

> > @@ -3615,11 +3593,11 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> >  static int mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
> >  				int node, int zid, enum lru_list lru)
> >  {
> > -	struct zone *zone;
> >  	struct mem_cgroup_per_zone *mz;
> > -	struct page_cgroup *pc, *busy;
> >  	unsigned long flags, loop;
> >  	struct list_head *list;
> > +	struct page *busy;
> > +	struct zone *zone;
> 
> Any specific reason to move zone declaration down here? Not that it
> matters much. Just curious.

I find this arrangement more readable, I believe Ingo Molnar called it
the reverse christmas tree once :-).  Longest lines first, then sort
lines of equal length alphabetically.

And since it was basically complete, except for @zone, I just HAD to!

> > @@ -3639,16 +3618,16 @@ static int mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
> >  			spin_unlock_irqrestore(&zone->lru_lock, flags);
> >  			break;
> >  		}
> > -		pc = list_entry(list->prev, struct page_cgroup, lru);
> > -		if (busy == pc) {
> > -			list_move(&pc->lru, list);
> > +		page = list_entry(list->prev, struct page, lru);
> > +		if (busy == page) {
> > +			list_move(&page->lru, list);
> >  			busy = NULL;
> >  			spin_unlock_irqrestore(&zone->lru_lock, flags);
> >  			continue;
> >  		}
> >  		spin_unlock_irqrestore(&zone->lru_lock, flags);
> >  
> > -		page = lookup_cgroup_page(pc);
> > +		pc = lookup_page_cgroup(page);
> 
> lookup_page_cgroup might return NULL so we probably want BUG_ON(!pc)
> here. We are not very consistent about checking the return value,
> though.

I think this is a myth and we should remove all those checks.  How can
pages circulate in userspace before they are fully onlined and their
page_cgroup buddies allocated?  In this case: how would they have been
charged in the first place and sit on a list without a list_head? :-)

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 10/11] mm: make per-memcg LRU lists exclusive
@ 2011-09-21 15:47       ` Johannes Weiner
  0 siblings, 0 replies; 130+ messages in thread
From: Johannes Weiner @ 2011-09-21 15:47 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Wed, Sep 21, 2011 at 05:24:58PM +0200, Michal Hocko wrote:
> On Mon 12-09-11 12:57:27, Johannes Weiner wrote:
> > Now that all code that operated on global per-zone LRU lists is
> > converted to operate on per-memory cgroup LRU lists instead, there is
> > no reason to keep the double-LRU scheme around any longer.
> > 
> > The pc->lru member is removed and page->lru is linked directly to the
> > per-memory cgroup LRU lists, which removes two pointers from a
> > descriptor that exists for every page frame in the system.
> > 
> > Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> > Signed-off-by: Ying Han <yinghan@google.com>
> 
> Reviewed-by: Michal Hocko <mhocko@suse.cz>

Thanks.

> > @@ -934,115 +954,123 @@ EXPORT_SYMBOL(mem_cgroup_count_vm_event);
> >   * When moving account, the page is not on LRU. It's isolated.
> >   */
> >  
> > -struct page *mem_cgroup_lru_to_page(struct zone *zone, struct mem_cgroup *mem,
> > -				    enum lru_list lru)
> > +/**
> > + * mem_cgroup_lru_add_list - account for adding an lru page and return lruvec
> > + * @zone: zone of the page
> > + * @page: the page
> > + * @lru: current lru
> > + *
> > + * This function accounts for @page being added to @lru, and returns
> > + * the lruvec for the given @zone and the memcg @page is charged to.
> > + *
> > + * The callsite is then responsible for physically linking the page to
> > + * the returned lruvec->lists[@lru].
> > + */
> > +struct lruvec *mem_cgroup_lru_add_list(struct zone *zone, struct page *page,
> > +				       enum lru_list lru)
> 
> I know that names are alway tricky but what about mem_cgroup_acct_lru_add?
> Analogously for mem_cgroup_lru_del_list, mem_cgroup_lru_del and
> mem_cgroup_lru_move_lists.

Hmm, but it doesn't just lru-account, it also looks up the right
lruvec for the caller to link the page to, so it's not necessarily an
improvement, although I agree that the name could be better.

> > @@ -3615,11 +3593,11 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> >  static int mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
> >  				int node, int zid, enum lru_list lru)
> >  {
> > -	struct zone *zone;
> >  	struct mem_cgroup_per_zone *mz;
> > -	struct page_cgroup *pc, *busy;
> >  	unsigned long flags, loop;
> >  	struct list_head *list;
> > +	struct page *busy;
> > +	struct zone *zone;
> 
> Any specific reason to move zone declaration down here? Not that it
> matters much. Just curious.

I find this arrangement more readable, I believe Ingo Molnar called it
the reverse christmas tree once :-).  Longest lines first, then sort
lines of equal length alphabetically.

And since it was basically complete, except for @zone, I just HAD to!

> > @@ -3639,16 +3618,16 @@ static int mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
> >  			spin_unlock_irqrestore(&zone->lru_lock, flags);
> >  			break;
> >  		}
> > -		pc = list_entry(list->prev, struct page_cgroup, lru);
> > -		if (busy == pc) {
> > -			list_move(&pc->lru, list);
> > +		page = list_entry(list->prev, struct page, lru);
> > +		if (busy == page) {
> > +			list_move(&page->lru, list);
> >  			busy = NULL;
> >  			spin_unlock_irqrestore(&zone->lru_lock, flags);
> >  			continue;
> >  		}
> >  		spin_unlock_irqrestore(&zone->lru_lock, flags);
> >  
> > -		page = lookup_cgroup_page(pc);
> > +		pc = lookup_page_cgroup(page);
> 
> lookup_page_cgroup might return NULL so we probably want BUG_ON(!pc)
> here. We are not very consistent about checking the return value,
> though.

I think this is a myth and we should remove all those checks.  How can
pages circulate in userspace before they are fully onlined and their
page_cgroup buddies allocated?  In this case: how would they have been
charged in the first place and sit on a list without a list_head? :-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 10/11] mm: make per-memcg LRU lists exclusive
  2011-09-21 15:47       ` Johannes Weiner
@ 2011-09-21 16:05         ` Michal Hocko
  -1 siblings, 0 replies; 130+ messages in thread
From: Michal Hocko @ 2011-09-21 16:05 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Wed 21-09-11 17:47:45, Johannes Weiner wrote:
> On Wed, Sep 21, 2011 at 05:24:58PM +0200, Michal Hocko wrote:
> > On Mon 12-09-11 12:57:27, Johannes Weiner wrote:
[...]
> > > @@ -934,115 +954,123 @@ EXPORT_SYMBOL(mem_cgroup_count_vm_event);
> > >   * When moving account, the page is not on LRU. It's isolated.
> > >   */
> > >  
> > > -struct page *mem_cgroup_lru_to_page(struct zone *zone, struct mem_cgroup *mem,
> > > -				    enum lru_list lru)
> > > +/**
> > > + * mem_cgroup_lru_add_list - account for adding an lru page and return lruvec
> > > + * @zone: zone of the page
> > > + * @page: the page
> > > + * @lru: current lru
> > > + *
> > > + * This function accounts for @page being added to @lru, and returns
> > > + * the lruvec for the given @zone and the memcg @page is charged to.
> > > + *
> > > + * The callsite is then responsible for physically linking the page to
> > > + * the returned lruvec->lists[@lru].
> > > + */
> > > +struct lruvec *mem_cgroup_lru_add_list(struct zone *zone, struct page *page,
> > > +				       enum lru_list lru)
> > 
> > I know that names are alway tricky but what about mem_cgroup_acct_lru_add?
> > Analogously for mem_cgroup_lru_del_list, mem_cgroup_lru_del and
> > mem_cgroup_lru_move_lists.
> 
> Hmm, but it doesn't just lru-account, it also looks up the right
> lruvec for the caller to link the page to, so it's not necessarily an
> improvement, although I agree that the name could be better.

Sorry, I do not have any better idea. I would just like if the name
didn't suggest that we actually modify the list.

> 
> > > @@ -3615,11 +3593,11 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> > >  static int mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
> > >  				int node, int zid, enum lru_list lru)
> > >  {
> > > -	struct zone *zone;
> > >  	struct mem_cgroup_per_zone *mz;
> > > -	struct page_cgroup *pc, *busy;
> > >  	unsigned long flags, loop;
> > >  	struct list_head *list;
> > > +	struct page *busy;
> > > +	struct zone *zone;
> > 
> > Any specific reason to move zone declaration down here? Not that it
> > matters much. Just curious.
> 
> I find this arrangement more readable, I believe Ingo Molnar called it
> the reverse christmas tree once :-).  Longest lines first, then sort
> lines of equal length alphabetically.
> 
> And since it was basically complete, except for @zone, I just HAD to!

:)

> 
> > > @@ -3639,16 +3618,16 @@ static int mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
> > >  			spin_unlock_irqrestore(&zone->lru_lock, flags);
> > >  			break;
> > >  		}
> > > -		pc = list_entry(list->prev, struct page_cgroup, lru);
> > > -		if (busy == pc) {
> > > -			list_move(&pc->lru, list);
> > > +		page = list_entry(list->prev, struct page, lru);
> > > +		if (busy == page) {
> > > +			list_move(&page->lru, list);
> > >  			busy = NULL;
> > >  			spin_unlock_irqrestore(&zone->lru_lock, flags);
> > >  			continue;
> > >  		}
> > >  		spin_unlock_irqrestore(&zone->lru_lock, flags);
> > >  
> > > -		page = lookup_cgroup_page(pc);
> > > +		pc = lookup_page_cgroup(page);
> > 
> > lookup_page_cgroup might return NULL so we probably want BUG_ON(!pc)
> > here. We are not very consistent about checking the return value,
> > though.
> 
> I think this is a myth and we should remove all those checks.  How can
> pages circulate in userspace before they are fully onlined and their
> page_cgroup buddies allocated?  In this case: how would they have been
> charged in the first place and sit on a list without a list_head? :-)

Yes, that is right. This should never happen (last famous words). I can
imagine that a memory offlinening bug could cause issues.

Anyway the more appropriate way to handle that would BUG_ON directly in
lookup_page_cgroup.

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 10/11] mm: make per-memcg LRU lists exclusive
@ 2011-09-21 16:05         ` Michal Hocko
  0 siblings, 0 replies; 130+ messages in thread
From: Michal Hocko @ 2011-09-21 16:05 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Wed 21-09-11 17:47:45, Johannes Weiner wrote:
> On Wed, Sep 21, 2011 at 05:24:58PM +0200, Michal Hocko wrote:
> > On Mon 12-09-11 12:57:27, Johannes Weiner wrote:
[...]
> > > @@ -934,115 +954,123 @@ EXPORT_SYMBOL(mem_cgroup_count_vm_event);
> > >   * When moving account, the page is not on LRU. It's isolated.
> > >   */
> > >  
> > > -struct page *mem_cgroup_lru_to_page(struct zone *zone, struct mem_cgroup *mem,
> > > -				    enum lru_list lru)
> > > +/**
> > > + * mem_cgroup_lru_add_list - account for adding an lru page and return lruvec
> > > + * @zone: zone of the page
> > > + * @page: the page
> > > + * @lru: current lru
> > > + *
> > > + * This function accounts for @page being added to @lru, and returns
> > > + * the lruvec for the given @zone and the memcg @page is charged to.
> > > + *
> > > + * The callsite is then responsible for physically linking the page to
> > > + * the returned lruvec->lists[@lru].
> > > + */
> > > +struct lruvec *mem_cgroup_lru_add_list(struct zone *zone, struct page *page,
> > > +				       enum lru_list lru)
> > 
> > I know that names are alway tricky but what about mem_cgroup_acct_lru_add?
> > Analogously for mem_cgroup_lru_del_list, mem_cgroup_lru_del and
> > mem_cgroup_lru_move_lists.
> 
> Hmm, but it doesn't just lru-account, it also looks up the right
> lruvec for the caller to link the page to, so it's not necessarily an
> improvement, although I agree that the name could be better.

Sorry, I do not have any better idea. I would just like if the name
didn't suggest that we actually modify the list.

> 
> > > @@ -3615,11 +3593,11 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> > >  static int mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
> > >  				int node, int zid, enum lru_list lru)
> > >  {
> > > -	struct zone *zone;
> > >  	struct mem_cgroup_per_zone *mz;
> > > -	struct page_cgroup *pc, *busy;
> > >  	unsigned long flags, loop;
> > >  	struct list_head *list;
> > > +	struct page *busy;
> > > +	struct zone *zone;
> > 
> > Any specific reason to move zone declaration down here? Not that it
> > matters much. Just curious.
> 
> I find this arrangement more readable, I believe Ingo Molnar called it
> the reverse christmas tree once :-).  Longest lines first, then sort
> lines of equal length alphabetically.
> 
> And since it was basically complete, except for @zone, I just HAD to!

:)

> 
> > > @@ -3639,16 +3618,16 @@ static int mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
> > >  			spin_unlock_irqrestore(&zone->lru_lock, flags);
> > >  			break;
> > >  		}
> > > -		pc = list_entry(list->prev, struct page_cgroup, lru);
> > > -		if (busy == pc) {
> > > -			list_move(&pc->lru, list);
> > > +		page = list_entry(list->prev, struct page, lru);
> > > +		if (busy == page) {
> > > +			list_move(&page->lru, list);
> > >  			busy = NULL;
> > >  			spin_unlock_irqrestore(&zone->lru_lock, flags);
> > >  			continue;
> > >  		}
> > >  		spin_unlock_irqrestore(&zone->lru_lock, flags);
> > >  
> > > -		page = lookup_cgroup_page(pc);
> > > +		pc = lookup_page_cgroup(page);
> > 
> > lookup_page_cgroup might return NULL so we probably want BUG_ON(!pc)
> > here. We are not very consistent about checking the return value,
> > though.
> 
> I think this is a myth and we should remove all those checks.  How can
> pages circulate in userspace before they are fully onlined and their
> page_cgroup buddies allocated?  In this case: how would they have been
> charged in the first place and sit on a list without a list_head? :-)

Yes, that is right. This should never happen (last famous words). I can
imagine that a memory offlinening bug could cause issues.

Anyway the more appropriate way to handle that would BUG_ON directly in
lookup_page_cgroup.

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 03/11] mm: vmscan: distinguish between memcg triggering reclaim and memcg being scanned
  2011-09-20  9:17         ` Michal Hocko
@ 2011-09-29  7:55           ` Johannes Weiner
  -1 siblings, 0 replies; 130+ messages in thread
From: Johannes Weiner @ 2011-09-29  7:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Tue, Sep 20, 2011 at 11:17:38AM +0200, Michal Hocko wrote:
> On Tue 20-09-11 10:58:11, Johannes Weiner wrote:
> > On Mon, Sep 19, 2011 at 04:29:55PM +0200, Michal Hocko wrote:
> > > On Mon 12-09-11 12:57:20, Johannes Weiner wrote:
> > > > @@ -2390,6 +2413,18 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
> > > >  }
> > > >  #endif
> > > >  
> > > > +static void age_active_anon(struct zone *zone, struct scan_control *sc,
> > > > +			    int priority)
> > > > +{
> > > > +	struct mem_cgroup_zone mz = {
> > > > +		.mem_cgroup = NULL,
> > > > +		.zone = zone,
> > > > +	};
> > > > +
> > > > +	if (inactive_anon_is_low(&mz))
> > > > +		shrink_active_list(SWAP_CLUSTER_MAX, &mz, sc, priority, 0);
> > > > +}
> > > > +
> > > 
> > > I do not like this very much because we are using a similar construct in
> > > shrink_mem_cgroup_zone so we are duplicating that code. 
> > > What about adding age_mem_cgroup_active_anon (something like shrink_zone).
> > 
> > I am not sure I follow and I don't see what could be shared between
> > the zone shrinking and this as there are different exit conditions to
> > the hierarchy walk.  Can you elaborate?
> 
> Sorry for not being clear enough. Maybe it is not very much important
> but what about something like:
> 
> Index: linus_tree/mm/vmscan.c
> ===================================================================
> --- linus_tree.orig/mm/vmscan.c	2011-09-20 11:07:57.000000000 +0200
> +++ linus_tree/mm/vmscan.c	2011-09-20 11:12:53.000000000 +0200
> @@ -2041,6 +2041,13 @@ static inline bool should_continue_recla
>  	}
>  }
>  
> +static void age_mem_cgroup_active_anon(struct mem_cgroup_zone *mz,
> +		struct scan_control *sc, int priority)
> +{
> +	if (inactive_anon_is_low(mz))
> +		shrink_active_list(SWAP_CLUSTER_MAX, mz, sc, priority, 0);
> +}
> +
>  /*
>   * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
>   */
> @@ -2090,8 +2097,7 @@ restart:
>  	 * Even if we did not try to evict anon pages at all, we want to
>  	 * rebalance the anon lru active/inactive ratio.
>  	 */
> -	if (inactive_anon_is_low(mz))
> -		shrink_active_list(SWAP_CLUSTER_MAX, mz, sc, priority, 0);
> +	age_mem_cgroup_active_anon(mz, sc, priority);
>  
>  	/* reclaim/compaction might need reclaim to continue */
>  	if (should_continue_reclaim(mz, nr_reclaimed,
> @@ -2421,8 +2427,7 @@ static void age_active_anon(struct zone
>  		.zone = zone,
>  	};
>  
> -	if (inactive_anon_is_low(&mz))
> -		shrink_active_list(SWAP_CLUSTER_MAX, &mz, sc, priority, 0);
> +	age_mem_cgroup_active_anon(&mz, sc, priority);
>  }

Ahh, understood.

I think it would be an unrelated change, though.  There already are
two of those constructs, I just move one of them.

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 03/11] mm: vmscan: distinguish between memcg triggering reclaim and memcg being scanned
@ 2011-09-29  7:55           ` Johannes Weiner
  0 siblings, 0 replies; 130+ messages in thread
From: Johannes Weiner @ 2011-09-29  7:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Tue, Sep 20, 2011 at 11:17:38AM +0200, Michal Hocko wrote:
> On Tue 20-09-11 10:58:11, Johannes Weiner wrote:
> > On Mon, Sep 19, 2011 at 04:29:55PM +0200, Michal Hocko wrote:
> > > On Mon 12-09-11 12:57:20, Johannes Weiner wrote:
> > > > @@ -2390,6 +2413,18 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
> > > >  }
> > > >  #endif
> > > >  
> > > > +static void age_active_anon(struct zone *zone, struct scan_control *sc,
> > > > +			    int priority)
> > > > +{
> > > > +	struct mem_cgroup_zone mz = {
> > > > +		.mem_cgroup = NULL,
> > > > +		.zone = zone,
> > > > +	};
> > > > +
> > > > +	if (inactive_anon_is_low(&mz))
> > > > +		shrink_active_list(SWAP_CLUSTER_MAX, &mz, sc, priority, 0);
> > > > +}
> > > > +
> > > 
> > > I do not like this very much because we are using a similar construct in
> > > shrink_mem_cgroup_zone so we are duplicating that code. 
> > > What about adding age_mem_cgroup_active_anon (something like shrink_zone).
> > 
> > I am not sure I follow and I don't see what could be shared between
> > the zone shrinking and this as there are different exit conditions to
> > the hierarchy walk.  Can you elaborate?
> 
> Sorry for not being clear enough. Maybe it is not very much important
> but what about something like:
> 
> Index: linus_tree/mm/vmscan.c
> ===================================================================
> --- linus_tree.orig/mm/vmscan.c	2011-09-20 11:07:57.000000000 +0200
> +++ linus_tree/mm/vmscan.c	2011-09-20 11:12:53.000000000 +0200
> @@ -2041,6 +2041,13 @@ static inline bool should_continue_recla
>  	}
>  }
>  
> +static void age_mem_cgroup_active_anon(struct mem_cgroup_zone *mz,
> +		struct scan_control *sc, int priority)
> +{
> +	if (inactive_anon_is_low(mz))
> +		shrink_active_list(SWAP_CLUSTER_MAX, mz, sc, priority, 0);
> +}
> +
>  /*
>   * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
>   */
> @@ -2090,8 +2097,7 @@ restart:
>  	 * Even if we did not try to evict anon pages at all, we want to
>  	 * rebalance the anon lru active/inactive ratio.
>  	 */
> -	if (inactive_anon_is_low(mz))
> -		shrink_active_list(SWAP_CLUSTER_MAX, mz, sc, priority, 0);
> +	age_mem_cgroup_active_anon(mz, sc, priority);
>  
>  	/* reclaim/compaction might need reclaim to continue */
>  	if (should_continue_reclaim(mz, nr_reclaimed,
> @@ -2421,8 +2427,7 @@ static void age_active_anon(struct zone
>  		.zone = zone,
>  	};
>  
> -	if (inactive_anon_is_low(&mz))
> -		shrink_active_list(SWAP_CLUSTER_MAX, &mz, sc, priority, 0);
> +	age_mem_cgroup_active_anon(&mz, sc, priority);
>  }

Ahh, understood.

I think it would be an unrelated change, though.  There already are
two of those constructs, I just move one of them.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 06/11] mm: memcg: remove optimization of keeping the root_mem_cgroup LRU lists empty
  2011-09-20 15:02     ` Michal Hocko
@ 2011-09-29  9:20       ` Johannes Weiner
  -1 siblings, 0 replies; 130+ messages in thread
From: Johannes Weiner @ 2011-09-29  9:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Tue, Sep 20, 2011 at 05:02:29PM +0200, Michal Hocko wrote:
> On Mon 12-09-11 12:57:23, Johannes Weiner wrote:
> > root_mem_cgroup, lacking a configurable limit, was never subject to
> > limit reclaim, so the pages charged to it could be kept off its LRU
> > lists.  They would be found on the global per-zone LRU lists upon
> > physical memory pressure and it made sense to avoid uselessly linking
> > them to both lists.
> > 
> > The global per-zone LRU lists are about to go away on memcg-enabled
> > kernels, with all pages being exclusively linked to their respective
> > per-memcg LRU lists.  As a result, pages of the root_mem_cgroup must
> > also be linked to its LRU lists again.
> 
> Nevertheless we still do not charge them so this should be mentioned
> here?

Added for the next revision:

	"This is purely about the LRU list, root_mem_cgroup is still
	not charged."

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 06/11] mm: memcg: remove optimization of keeping the root_mem_cgroup LRU lists empty
@ 2011-09-29  9:20       ` Johannes Weiner
  0 siblings, 0 replies; 130+ messages in thread
From: Johannes Weiner @ 2011-09-29  9:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Tue, Sep 20, 2011 at 05:02:29PM +0200, Michal Hocko wrote:
> On Mon 12-09-11 12:57:23, Johannes Weiner wrote:
> > root_mem_cgroup, lacking a configurable limit, was never subject to
> > limit reclaim, so the pages charged to it could be kept off its LRU
> > lists.  They would be found on the global per-zone LRU lists upon
> > physical memory pressure and it made sense to avoid uselessly linking
> > them to both lists.
> > 
> > The global per-zone LRU lists are about to go away on memcg-enabled
> > kernels, with all pages being exclusively linked to their respective
> > per-memcg LRU lists.  As a result, pages of the root_mem_cgroup must
> > also be linked to its LRU lists again.
> 
> Nevertheless we still do not charge them so this should be mentioned
> here?

Added for the next revision:

	"This is purely about the LRU list, root_mem_cgroup is still
	not charged."

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 06/11] mm: memcg: remove optimization of keeping the root_mem_cgroup LRU lists empty
  2011-09-29  9:20       ` Johannes Weiner
@ 2011-09-29  9:49         ` Michal Hocko
  -1 siblings, 0 replies; 130+ messages in thread
From: Michal Hocko @ 2011-09-29  9:49 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Thu 29-09-11 11:20:33, Johannes Weiner wrote:
> On Tue, Sep 20, 2011 at 05:02:29PM +0200, Michal Hocko wrote:
> > On Mon 12-09-11 12:57:23, Johannes Weiner wrote:
> > > root_mem_cgroup, lacking a configurable limit, was never subject to
> > > limit reclaim, so the pages charged to it could be kept off its LRU
> > > lists.  They would be found on the global per-zone LRU lists upon
> > > physical memory pressure and it made sense to avoid uselessly linking
> > > them to both lists.
> > > 
> > > The global per-zone LRU lists are about to go away on memcg-enabled
> > > kernels, with all pages being exclusively linked to their respective
> > > per-memcg LRU lists.  As a result, pages of the root_mem_cgroup must
> > > also be linked to its LRU lists again.
> > 
> > Nevertheless we still do not charge them so this should be mentioned
> > here?
> 
> Added for the next revision:
> 
> 	"This is purely about the LRU list, root_mem_cgroup is still
> 	not charged."

OK, that should be more clear. Thanks!

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [patch 06/11] mm: memcg: remove optimization of keeping the root_mem_cgroup LRU lists empty
@ 2011-09-29  9:49         ` Michal Hocko
  0 siblings, 0 replies; 130+ messages in thread
From: Michal Hocko @ 2011-09-29  9:49 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Balbir Singh, Ying Han, Greg Thelen, Michel Lespinasse,
	Rik van Riel, Minchan Kim, Christoph Hellwig, linux-mm,
	linux-kernel

On Thu 29-09-11 11:20:33, Johannes Weiner wrote:
> On Tue, Sep 20, 2011 at 05:02:29PM +0200, Michal Hocko wrote:
> > On Mon 12-09-11 12:57:23, Johannes Weiner wrote:
> > > root_mem_cgroup, lacking a configurable limit, was never subject to
> > > limit reclaim, so the pages charged to it could be kept off its LRU
> > > lists.  They would be found on the global per-zone LRU lists upon
> > > physical memory pressure and it made sense to avoid uselessly linking
> > > them to both lists.
> > > 
> > > The global per-zone LRU lists are about to go away on memcg-enabled
> > > kernels, with all pages being exclusively linked to their respective
> > > per-memcg LRU lists.  As a result, pages of the root_mem_cgroup must
> > > also be linked to its LRU lists again.
> > 
> > Nevertheless we still do not charge them so this should be mentioned
> > here?
> 
> Added for the next revision:
> 
> 	"This is purely about the LRU list, root_mem_cgroup is still
> 	not charged."

OK, that should be more clear. Thanks!

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 130+ messages in thread

end of thread, other threads:[~2011-09-29  9:49 UTC | newest]

Thread overview: 130+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-09-12 10:57 [patch 0/11] mm: memcg naturalization -rc3 Johannes Weiner
2011-09-12 10:57 ` Johannes Weiner
2011-09-12 10:57 ` [patch 01/11] mm: memcg: consolidate hierarchy iteration primitives Johannes Weiner
2011-09-12 10:57   ` Johannes Weiner
2011-09-12 22:37   ` Kirill A. Shutemov
2011-09-12 22:37     ` Kirill A. Shutemov
2011-09-13  5:40     ` Johannes Weiner
2011-09-13  5:40       ` Johannes Weiner
2011-09-19 13:06     ` Michal Hocko
2011-09-19 13:06       ` Michal Hocko
2011-09-13 10:06   ` KAMEZAWA Hiroyuki
2011-09-13 10:06     ` KAMEZAWA Hiroyuki
2011-09-19 12:53   ` Michal Hocko
2011-09-19 12:53     ` Michal Hocko
2011-09-20  8:45     ` Johannes Weiner
2011-09-20  8:45       ` Johannes Weiner
2011-09-20  8:53       ` Michal Hocko
2011-09-20  8:53         ` Michal Hocko
2011-09-12 10:57 ` [patch 02/11] mm: vmscan: distinguish global reclaim from global LRU scanning Johannes Weiner
2011-09-12 10:57   ` Johannes Weiner
2011-09-12 23:02   ` Kirill A. Shutemov
2011-09-12 23:02     ` Kirill A. Shutemov
2011-09-13  5:48     ` Johannes Weiner
2011-09-13  5:48       ` Johannes Weiner
2011-09-13 10:07   ` KAMEZAWA Hiroyuki
2011-09-13 10:07     ` KAMEZAWA Hiroyuki
2011-09-19 13:23   ` Michal Hocko
2011-09-19 13:23     ` Michal Hocko
2011-09-19 13:46     ` Michal Hocko
2011-09-19 13:46       ` Michal Hocko
2011-09-20  8:52     ` Johannes Weiner
2011-09-20  8:52       ` Johannes Weiner
2011-09-12 10:57 ` [patch 03/11] mm: vmscan: distinguish between memcg triggering reclaim and memcg being scanned Johannes Weiner
2011-09-12 10:57   ` Johannes Weiner
2011-09-13 10:23   ` KAMEZAWA Hiroyuki
2011-09-13 10:23     ` KAMEZAWA Hiroyuki
2011-09-19 14:29   ` Michal Hocko
2011-09-19 14:29     ` Michal Hocko
2011-09-20  8:58     ` Johannes Weiner
2011-09-20  8:58       ` Johannes Weiner
2011-09-20  9:17       ` Michal Hocko
2011-09-20  9:17         ` Michal Hocko
2011-09-29  7:55         ` Johannes Weiner
2011-09-29  7:55           ` Johannes Weiner
2011-09-12 10:57 ` [patch 04/11] mm: memcg: per-priority per-zone hierarchy scan generations Johannes Weiner
2011-09-12 10:57   ` Johannes Weiner
2011-09-13 10:27   ` KAMEZAWA Hiroyuki
2011-09-13 10:27     ` KAMEZAWA Hiroyuki
2011-09-13 11:03     ` Johannes Weiner
2011-09-13 11:03       ` Johannes Weiner
2011-09-14  0:55       ` KAMEZAWA Hiroyuki
2011-09-14  0:55         ` KAMEZAWA Hiroyuki
2011-09-14  5:56         ` Johannes Weiner
2011-09-14  5:56           ` Johannes Weiner
2011-09-14  7:40           ` KAMEZAWA Hiroyuki
2011-09-14  7:40             ` KAMEZAWA Hiroyuki
2011-09-20  8:15       ` Michal Hocko
2011-09-20  8:15         ` Michal Hocko
2011-09-20  8:45   ` Michal Hocko
2011-09-20  8:45     ` Michal Hocko
2011-09-20  9:10     ` Johannes Weiner
2011-09-20  9:10       ` Johannes Weiner
2011-09-20 12:37       ` Michal Hocko
2011-09-20 12:37         ` Michal Hocko
2011-09-12 10:57 ` [patch 05/11] mm: move memcg hierarchy reclaim to generic reclaim code Johannes Weiner
2011-09-12 10:57   ` Johannes Weiner
2011-09-13 10:31   ` KAMEZAWA Hiroyuki
2011-09-13 10:31     ` KAMEZAWA Hiroyuki
2011-09-20 13:09   ` Michal Hocko
2011-09-20 13:09     ` Michal Hocko
2011-09-20 13:29     ` Johannes Weiner
2011-09-20 13:29       ` Johannes Weiner
2011-09-20 14:08       ` Michal Hocko
2011-09-20 14:08         ` Michal Hocko
2011-09-12 10:57 ` [patch 06/11] mm: memcg: remove optimization of keeping the root_mem_cgroup LRU lists empty Johannes Weiner
2011-09-12 10:57   ` Johannes Weiner
2011-09-13 10:34   ` KAMEZAWA Hiroyuki
2011-09-13 10:34     ` KAMEZAWA Hiroyuki
2011-09-20 15:02   ` Michal Hocko
2011-09-20 15:02     ` Michal Hocko
2011-09-29  9:20     ` Johannes Weiner
2011-09-29  9:20       ` Johannes Weiner
2011-09-29  9:49       ` Michal Hocko
2011-09-29  9:49         ` Michal Hocko
2011-09-12 10:57 ` [patch 07/11] mm: vmscan: convert unevictable page rescue scanner to per-memcg LRU lists Johannes Weiner
2011-09-12 10:57   ` Johannes Weiner
2011-09-13 10:37   ` KAMEZAWA Hiroyuki
2011-09-13 10:37     ` KAMEZAWA Hiroyuki
2011-09-21 12:33   ` Michal Hocko
2011-09-21 12:33     ` Michal Hocko
2011-09-21 13:47     ` Johannes Weiner
2011-09-21 13:47       ` Johannes Weiner
2011-09-21 14:08       ` Michal Hocko
2011-09-21 14:08         ` Michal Hocko
2011-09-12 10:57 ` [patch 08/11] mm: vmscan: convert global reclaim " Johannes Weiner
2011-09-12 10:57   ` Johannes Weiner
2011-09-13 10:41   ` KAMEZAWA Hiroyuki
2011-09-13 10:41     ` KAMEZAWA Hiroyuki
2011-09-21 13:10   ` Michal Hocko
2011-09-21 13:10     ` Michal Hocko
2011-09-21 13:51     ` Johannes Weiner
2011-09-21 13:51       ` Johannes Weiner
2011-09-21 13:57       ` Michal Hocko
2011-09-21 13:57         ` Michal Hocko
2011-09-12 10:57 ` [patch 09/11] mm: collect LRU list heads into struct lruvec Johannes Weiner
2011-09-12 10:57   ` Johannes Weiner
2011-09-13 10:43   ` KAMEZAWA Hiroyuki
2011-09-13 10:43     ` KAMEZAWA Hiroyuki
2011-09-21 13:43   ` Michal Hocko
2011-09-21 13:43     ` Michal Hocko
2011-09-21 15:15     ` Michal Hocko
2011-09-21 15:15       ` Michal Hocko
2011-09-12 10:57 ` [patch 10/11] mm: make per-memcg LRU lists exclusive Johannes Weiner
2011-09-12 10:57   ` Johannes Weiner
2011-09-13 10:47   ` KAMEZAWA Hiroyuki
2011-09-13 10:47     ` KAMEZAWA Hiroyuki
2011-09-21 15:24   ` Michal Hocko
2011-09-21 15:24     ` Michal Hocko
2011-09-21 15:47     ` Johannes Weiner
2011-09-21 15:47       ` Johannes Weiner
2011-09-21 16:05       ` Michal Hocko
2011-09-21 16:05         ` Michal Hocko
2011-09-12 10:57 ` [patch 11/11] mm: memcg: remove unused node/section info from pc->flags Johannes Weiner
2011-09-12 10:57   ` Johannes Weiner
2011-09-13 10:50   ` KAMEZAWA Hiroyuki
2011-09-13 10:50     ` KAMEZAWA Hiroyuki
2011-09-21 15:32   ` Michal Hocko
2011-09-21 15:32     ` Michal Hocko
2011-09-13 20:35 ` [patch 0/11] mm: memcg naturalization -rc3 Kirill A. Shutemov
2011-09-13 20:35   ` Kirill A. Shutemov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.