[RFC][PATCH 0/10] memcg clean up and some fixes for softlimit (Sep25)

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC][PATCH 0/10] memcg  clean up and some fixes for softlimit (Sep25)
@ 2009-09-25  8:17 KAMEZAWA Hiroyuki
  2009-09-25  8:18 ` [RFC][PATCH 1/10] memcg : modifications for softlimit uncharge KAMEZAWA Hiroyuki
                   ` (10 more replies)
  0 siblings, 11 replies; 20+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-09-25  8:17 UTC (permalink / raw)
  To: linux-mm; +Cc: balbir, nishimura

As I posted Sep/18, I'm now planning to make memcontrol.c cleaner.
I'll post this to Andrew in the next week if no objections.
(IOW, I'll post this again. So, review itself is not very urgent.)

In this version, I dropped batched-charge/uncharge set.
They includes something delicate and should not be discussed in this thread.
The patches are organized as..

Clean up/ fix softlimit charge/uncharge under hierarchy.
1. softlimit uncharge fix
2. softlimit charge fix
These 2 are not changed for 3 weeks.

Followings are new (no functional changes.)
3.  reorder layout in memcontrol.c
4.  memcg_charge_cancel.patch from Nishimura's one
5.  clean up for memcg's percpu statistics
6.  removing unsued macro
7.  rename "cont" to "cgroup"
8.  remove unused check in charge/uncharge
9.  clean up for memcg's perzone statistics
10. Add commentary.

Because my commentary is tend to be not very good, review
for 10. is helpful ;)

I think this kind of fixes should be done while -mm queue is empty.
Then, do this first.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC][PATCH 1/10] memcg : modifications for softlimit uncharge
  2009-09-25  8:17 [RFC][PATCH 0/10] memcg clean up and some fixes for softlimit (Sep25) KAMEZAWA Hiroyuki
@ 2009-09-25  8:18 ` KAMEZAWA Hiroyuki
  2009-09-25  8:20 ` [RFC][PATCH 2/10] memcg : clean up in softlimit charge KAMEZAWA Hiroyuki
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-09-25  8:18 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura

This patch clean up/fixes for memcg's uncharge soft limit path.
This is also a preparation for batched uncharge.

Problems:
  Now, res_counter_charge()/uncharge() handles softlimit information at
  charge/uncharge and softlimit-check is done when event counter per memcg
  goes over limit. But event counter per memcg is updated only when
  when memcg is over soft limit. But ancerstors are handled in charge path
  but not in uncharge path.
  For batched charge/uncharge, event counter check should be more strict.

  Prolems:
  1. memcg's event counter incremented only when softlimit hits. That's bad.
     It makes event counter hard to be reused for other purpose.
  2. At uncharge, only the lowest level rescounter is handled. This is bug.
     Because ancesotor's event counter is not incremented, children should
     take care of them.
  3. res_counter_uncharge()'s 3rd argument is NULL in most case.
     ops under res_counter->lock should be small. No "if" sentense is better.

Fixes:
  * Removed soft_limit_xx poitner and checsk from charge and uncharge.
    Do-check-only-when-necessary scheme works enough well without them.

  * make event-counter of memcg checked at every charge/uncharge.
    (per-cpu area will be accessed soon anyway)

  * All ancestors are checked at soft-limit-check. This is necessary because
    ancesotor's event counter may never be modified. Then, they should be
    checked at the same time.

Todo;
  We may need to modify EVENT_COUNTER_THRESH of parent with many children.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 include/linux/res_counter.h |    6 --
 kernel/res_counter.c        |   18 ------
 mm/memcontrol.c             |  115 +++++++++++++++++++-------------------------
 3 files changed, 55 insertions(+), 84 deletions(-)

Index: temp-mmotm/kernel/res_counter.c
===================================================================
--- temp-mmotm.orig/kernel/res_counter.c
+++ temp-mmotm/kernel/res_counter.c
@@ -37,27 +37,17 @@ int res_counter_charge_locked(struct res
 }
 
 int res_counter_charge(struct res_counter *counter, unsigned long val,
-			struct res_counter **limit_fail_at,
-			struct res_counter **soft_limit_fail_at)
+			struct res_counter **limit_fail_at)
 {
 	int ret;
 	unsigned long flags;
 	struct res_counter *c, *u;
 
 	*limit_fail_at = NULL;
-	if (soft_limit_fail_at)
-		*soft_limit_fail_at = NULL;
 	local_irq_save(flags);
 	for (c = counter; c != NULL; c = c->parent) {
 		spin_lock(&c->lock);
 		ret = res_counter_charge_locked(c, val);
-		/*
-		 * With soft limits, we return the highest ancestor
-		 * that exceeds its soft limit
-		 */
-		if (soft_limit_fail_at &&
-			!res_counter_soft_limit_check_locked(c))
-			*soft_limit_fail_at = c;
 		spin_unlock(&c->lock);
 		if (ret < 0) {
 			*limit_fail_at = c;
@@ -85,8 +75,7 @@ void res_counter_uncharge_locked(struct 
 	counter->usage -= val;
 }
 
-void res_counter_uncharge(struct res_counter *counter, unsigned long val,
-				bool *was_soft_limit_excess)
+void res_counter_uncharge(struct res_counter *counter, unsigned long val)
 {
 	unsigned long flags;
 	struct res_counter *c;
@@ -94,9 +83,6 @@ void res_counter_uncharge(struct res_cou
 	local_irq_save(flags);
 	for (c = counter; c != NULL; c = c->parent) {
 		spin_lock(&c->lock);
-		if (was_soft_limit_excess)
-			*was_soft_limit_excess =
-				!res_counter_soft_limit_check_locked(c);
 		res_counter_uncharge_locked(c, val);
 		spin_unlock(&c->lock);
 	}
Index: temp-mmotm/include/linux/res_counter.h
===================================================================
--- temp-mmotm.orig/include/linux/res_counter.h
+++ temp-mmotm/include/linux/res_counter.h
@@ -114,8 +114,7 @@ void res_counter_init(struct res_counter
 int __must_check res_counter_charge_locked(struct res_counter *counter,
 		unsigned long val);
 int __must_check res_counter_charge(struct res_counter *counter,
-		unsigned long val, struct res_counter **limit_fail_at,
-		struct res_counter **soft_limit_at);
+		unsigned long val, struct res_counter **limit_fail_at);
 
 /*
  * uncharge - tell that some portion of the resource is released
@@ -128,8 +127,7 @@ int __must_check res_counter_charge(stru
  */
 
 void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val);
-void res_counter_uncharge(struct res_counter *counter, unsigned long val,
-				bool *was_soft_limit_excess);
+void res_counter_uncharge(struct res_counter *counter, unsigned long val);
 
 static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
 {
Index: temp-mmotm/mm/memcontrol.c
===================================================================
--- temp-mmotm.orig/mm/memcontrol.c
+++ temp-mmotm/mm/memcontrol.c
@@ -353,16 +353,6 @@ __mem_cgroup_remove_exceeded(struct mem_
 }
 
 static void
-mem_cgroup_insert_exceeded(struct mem_cgroup *mem,
-				struct mem_cgroup_per_zone *mz,
-				struct mem_cgroup_tree_per_zone *mctz)
-{
-	spin_lock(&mctz->lock);
-	__mem_cgroup_insert_exceeded(mem, mz, mctz);
-	spin_unlock(&mctz->lock);
-}
-
-static void
 mem_cgroup_remove_exceeded(struct mem_cgroup *mem,
 				struct mem_cgroup_per_zone *mz,
 				struct mem_cgroup_tree_per_zone *mctz)
@@ -392,34 +382,40 @@ static bool mem_cgroup_soft_limit_check(
 
 static void mem_cgroup_update_tree(struct mem_cgroup *mem, struct page *page)
 {
-	unsigned long long prev_usage_in_excess, new_usage_in_excess;
-	bool updated_tree = false;
+	unsigned long long new_usage_in_excess;
 	struct mem_cgroup_per_zone *mz;
 	struct mem_cgroup_tree_per_zone *mctz;
-
-	mz = mem_cgroup_zoneinfo(mem, page_to_nid(page), page_zonenum(page));
+	int nid = page_to_nid(page);
+	int zid = page_zonenum(page);
 	mctz = soft_limit_tree_from_page(page);
 
 	/*
-	 * We do updates in lazy mode, mem's are removed
-	 * lazily from the per-zone, per-node rb tree
+	 * Necessary to update all ancestors when hierarchy is used.
+	 * because their event counter is not touched.
 	 */
-	prev_usage_in_excess = mz->usage_in_excess;
-
-	new_usage_in_excess = res_counter_soft_limit_excess(&mem->res);
-	if (prev_usage_in_excess) {
-		mem_cgroup_remove_exceeded(mem, mz, mctz);
-		updated_tree = true;
-	}
-	if (!new_usage_in_excess)
-		goto done;
-	mem_cgroup_insert_exceeded(mem, mz, mctz);
-
-done:
-	if (updated_tree) {
-		spin_lock(&mctz->lock);
-		mz->usage_in_excess = new_usage_in_excess;
-		spin_unlock(&mctz->lock);
+	for (; mem; mem = parent_mem_cgroup(mem)) {
+		mz = mem_cgroup_zoneinfo(mem, nid, zid);
+		new_usage_in_excess =
+			res_counter_soft_limit_excess(&mem->res);
+		/*
+		 * We have to update the tree if mz is on RB-tree or
+		 * mem is over its softlimit.
+		 */
+		if (new_usage_in_excess || mz->on_tree) {
+			spin_lock(&mctz->lock);
+			/* if on-tree, remove it */
+			if (mz->on_tree)
+				__mem_cgroup_remove_exceeded(mem, mz, mctz);
+			/*
+			 * if over soft limit, insert again. mz->usage_in_excess
+			 * will be updated properly.
+			 */
+			if (new_usage_in_excess)
+				__mem_cgroup_insert_exceeded(mem, mz, mctz);
+			else
+				mz->usage_in_excess = 0;
+			spin_unlock(&mctz->lock);
+		}
 	}
 }
 
@@ -1270,9 +1266,9 @@ static int __mem_cgroup_try_charge(struc
 			gfp_t gfp_mask, struct mem_cgroup **memcg,
 			bool oom, struct page *page)
 {
-	struct mem_cgroup *mem, *mem_over_limit, *mem_over_soft_limit;
+	struct mem_cgroup *mem, *mem_over_limit;
 	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
-	struct res_counter *fail_res, *soft_fail_res = NULL;
+	struct res_counter *fail_res;
 
 	if (unlikely(test_thread_flag(TIF_MEMDIE))) {
 		/* Don't account this! */
@@ -1304,17 +1300,16 @@ static int __mem_cgroup_try_charge(struc
 
 		if (mem_cgroup_is_root(mem))
 			goto done;
-		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res,
-						&soft_fail_res);
+		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res);
 		if (likely(!ret)) {
 			if (!do_swap_account)
 				break;
 			ret = res_counter_charge(&mem->memsw, PAGE_SIZE,
-							&fail_res, NULL);
+							&fail_res);
 			if (likely(!ret))
 				break;
 			/* mem+swap counter fails */
-			res_counter_uncharge(&mem->res, PAGE_SIZE, NULL);
+			res_counter_uncharge(&mem->res, PAGE_SIZE);
 			flags |= MEM_CGROUP_RECLAIM_NOSWAP;
 			mem_over_limit = mem_cgroup_from_res_counter(fail_res,
 									memsw);
@@ -1353,16 +1348,11 @@ static int __mem_cgroup_try_charge(struc
 		}
 	}
 	/*
-	 * Insert just the ancestor, we should trickle down to the correct
-	 * cgroup for reclaim, since the other nodes will be below their
-	 * soft limit
-	 */
-	if (soft_fail_res) {
-		mem_over_soft_limit =
-			mem_cgroup_from_res_counter(soft_fail_res, res);
-		if (mem_cgroup_soft_limit_check(mem_over_soft_limit))
-			mem_cgroup_update_tree(mem_over_soft_limit, page);
-	}
+	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
+	 * if they exceeds softlimit.
+	 */
+	if (mem_cgroup_soft_limit_check(mem))
+		mem_cgroup_update_tree(mem, page);
 done:
 	return 0;
 nomem:
@@ -1437,10 +1427,9 @@ static void __mem_cgroup_commit_charge(s
 	if (unlikely(PageCgroupUsed(pc))) {
 		unlock_page_cgroup(pc);
 		if (!mem_cgroup_is_root(mem)) {
-			res_counter_uncharge(&mem->res, PAGE_SIZE, NULL);
+			res_counter_uncharge(&mem->res, PAGE_SIZE);
 			if (do_swap_account)
-				res_counter_uncharge(&mem->memsw, PAGE_SIZE,
-							NULL);
+				res_counter_uncharge(&mem->memsw, PAGE_SIZE);
 		}
 		css_put(&mem->css);
 		return;
@@ -1519,7 +1508,7 @@ static int mem_cgroup_move_account(struc
 		goto out;
 
 	if (!mem_cgroup_is_root(from))
-		res_counter_uncharge(&from->res, PAGE_SIZE, NULL);
+		res_counter_uncharge(&from->res, PAGE_SIZE);
 	mem_cgroup_charge_statistics(from, pc, false);
 
 	page = pc->page;
@@ -1539,7 +1528,7 @@ static int mem_cgroup_move_account(struc
 	}
 
 	if (do_swap_account && !mem_cgroup_is_root(from))
-		res_counter_uncharge(&from->memsw, PAGE_SIZE, NULL);
+		res_counter_uncharge(&from->memsw, PAGE_SIZE);
 	css_put(&from->css);
 
 	css_get(&to->css);
@@ -1610,9 +1599,9 @@ uncharge:
 	css_put(&parent->css);
 	/* uncharge if move fails */
 	if (!mem_cgroup_is_root(parent)) {
-		res_counter_uncharge(&parent->res, PAGE_SIZE, NULL);
+		res_counter_uncharge(&parent->res, PAGE_SIZE);
 		if (do_swap_account)
-			res_counter_uncharge(&parent->memsw, PAGE_SIZE, NULL);
+			res_counter_uncharge(&parent->memsw, PAGE_SIZE);
 	}
 	return ret;
 }
@@ -1803,8 +1792,7 @@ __mem_cgroup_commit_charge_swapin(struct
 			 * calling css_tryget
 			 */
 			if (!mem_cgroup_is_root(memcg))
-				res_counter_uncharge(&memcg->memsw, PAGE_SIZE,
-							NULL);
+				res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
 			mem_cgroup_swap_statistics(memcg, false);
 			mem_cgroup_put(memcg);
 		}
@@ -1831,9 +1819,9 @@ void mem_cgroup_cancel_charge_swapin(str
 	if (!mem)
 		return;
 	if (!mem_cgroup_is_root(mem)) {
-		res_counter_uncharge(&mem->res, PAGE_SIZE, NULL);
+		res_counter_uncharge(&mem->res, PAGE_SIZE);
 		if (do_swap_account)
-			res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL);
+			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
 	}
 	css_put(&mem->css);
 }
@@ -1848,7 +1836,6 @@ __mem_cgroup_uncharge_common(struct page
 	struct page_cgroup *pc;
 	struct mem_cgroup *mem = NULL;
 	struct mem_cgroup_per_zone *mz;
-	bool soft_limit_excess = false;
 
 	if (mem_cgroup_disabled())
 		return NULL;
@@ -1888,10 +1875,10 @@ __mem_cgroup_uncharge_common(struct page
 	}
 
 	if (!mem_cgroup_is_root(mem)) {
-		res_counter_uncharge(&mem->res, PAGE_SIZE, &soft_limit_excess);
+		res_counter_uncharge(&mem->res, PAGE_SIZE);
 		if (do_swap_account &&
 				(ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
-			res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL);
+			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
 	}
 	if (ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
 		mem_cgroup_swap_statistics(mem, true);
@@ -1908,7 +1895,7 @@ __mem_cgroup_uncharge_common(struct page
 	mz = page_cgroup_zoneinfo(pc);
 	unlock_page_cgroup(pc);
 
-	if (soft_limit_excess && mem_cgroup_soft_limit_check(mem))
+	if (mem_cgroup_soft_limit_check(mem))
 		mem_cgroup_update_tree(mem, page);
 	/* at swapout, this memcg will be accessed to record to swap */
 	if (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
@@ -1986,7 +1973,7 @@ void mem_cgroup_uncharge_swap(swp_entry_
 		 * This memcg can be obsolete one. We avoid calling css_tryget
 		 */
 		if (!mem_cgroup_is_root(memcg))
-			res_counter_uncharge(&memcg->memsw, PAGE_SIZE, NULL);
+			res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
 		mem_cgroup_swap_statistics(memcg, false);
 		mem_cgroup_put(memcg);
 	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC][PATCH 2/10] memcg : clean up in softlimit charge
  2009-09-25  8:17 [RFC][PATCH 0/10] memcg clean up and some fixes for softlimit (Sep25) KAMEZAWA Hiroyuki
  2009-09-25  8:18 ` [RFC][PATCH 1/10] memcg : modifications for softlimit uncharge KAMEZAWA Hiroyuki
@ 2009-09-25  8:20 ` KAMEZAWA Hiroyuki
  2009-09-25  8:21 ` [RFC][PATCH 3/10] memcg: reorganize memcontrol.c KAMEZAWA Hiroyuki
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-09-25  8:20 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura

In charge/uncharge/reclaim path, usage_in_excess is calculated repeatedly and
it takes res_counter's spin_lock every time.

This patch removes unnecessary calls for res_count_soft_limit_excess.

Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   31 +++++++++++++++----------------
 1 file changed, 15 insertions(+), 16 deletions(-)

Index: mmotm-2.6.31-Sep17/mm/memcontrol.c
===================================================================
--- mmotm-2.6.31-Sep17.orig/mm/memcontrol.c
+++ mmotm-2.6.31-Sep17/mm/memcontrol.c
@@ -313,7 +313,8 @@ soft_limit_tree_from_page(struct page *p
 static void
 __mem_cgroup_insert_exceeded(struct mem_cgroup *mem,
 				struct mem_cgroup_per_zone *mz,
-				struct mem_cgroup_tree_per_zone *mctz)
+				struct mem_cgroup_tree_per_zone *mctz,
+				unsigned long long new_usage_in_excess)
 {
 	struct rb_node **p = &mctz->rb_root.rb_node;
 	struct rb_node *parent = NULL;
@@ -322,7 +323,9 @@ __mem_cgroup_insert_exceeded(struct mem_
 	if (mz->on_tree)
 		return;
 
-	mz->usage_in_excess = res_counter_soft_limit_excess(&mem->res);
+	mz->usage_in_excess = new_usage_in_excess;
+	if (!mz->usage_in_excess)
+		return;
 	while (*p) {
 		parent = *p;
 		mz_node = rb_entry(parent, struct mem_cgroup_per_zone,
@@ -382,7 +385,7 @@ static bool mem_cgroup_soft_limit_check(
 
 static void mem_cgroup_update_tree(struct mem_cgroup *mem, struct page *page)
 {
-	unsigned long long new_usage_in_excess;
+	unsigned long long excess;
 	struct mem_cgroup_per_zone *mz;
 	struct mem_cgroup_tree_per_zone *mctz;
 	int nid = page_to_nid(page);
@@ -395,25 +398,21 @@ static void mem_cgroup_update_tree(struc
 	 */
 	for (; mem; mem = parent_mem_cgroup(mem)) {
 		mz = mem_cgroup_zoneinfo(mem, nid, zid);
-		new_usage_in_excess =
-			res_counter_soft_limit_excess(&mem->res);
+		excess = res_counter_soft_limit_excess(&mem->res);
 		/*
 		 * We have to update the tree if mz is on RB-tree or
 		 * mem is over its softlimit.
 		 */
-		if (new_usage_in_excess || mz->on_tree) {
+		if (excess || mz->on_tree) {
 			spin_lock(&mctz->lock);
 			/* if on-tree, remove it */
 			if (mz->on_tree)
 				__mem_cgroup_remove_exceeded(mem, mz, mctz);
 			/*
-			 * if over soft limit, insert again. mz->usage_in_excess
-			 * will be updated properly.
+			 * Insert again. mz->usage_in_excess will be updated.
+			 * If excess is 0, no tree ops.
 			 */
-			if (new_usage_in_excess)
-				__mem_cgroup_insert_exceeded(mem, mz, mctz);
-			else
-				mz->usage_in_excess = 0;
+			__mem_cgroup_insert_exceeded(mem, mz, mctz, excess);
 			spin_unlock(&mctz->lock);
 		}
 	}
@@ -2220,6 +2219,7 @@ unsigned long mem_cgroup_soft_limit_recl
 	unsigned long reclaimed;
 	int loop = 0;
 	struct mem_cgroup_tree_per_zone *mctz;
+	unsigned long long excess;
 
 	if (order > 0)
 		return 0;
@@ -2271,9 +2271,8 @@ unsigned long mem_cgroup_soft_limit_recl
 					break;
 			} while (1);
 		}
-		mz->usage_in_excess =
-			res_counter_soft_limit_excess(&mz->mem->res);
 		__mem_cgroup_remove_exceeded(mz->mem, mz, mctz);
+		excess = res_counter_soft_limit_excess(&mz->mem->res);
 		/*
 		 * One school of thought says that we should not add
 		 * back the node to the tree if reclaim returns 0.
@@ -2282,8 +2281,8 @@ unsigned long mem_cgroup_soft_limit_recl
 		 * memory to reclaim from. Consider this as a longer
 		 * term TODO.
 		 */
-		if (mz->usage_in_excess)
-			__mem_cgroup_insert_exceeded(mz->mem, mz, mctz);
+		/* If excess == 0, no tree ops */
+		__mem_cgroup_insert_exceeded(mz->mem, mz, mctz, excess);
 		spin_unlock(&mctz->lock);
 		css_put(&mz->mem->css);
 		loop++;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC][PATCH 3/10] memcg: reorganize memcontrol.c
  2009-09-25  8:17 [RFC][PATCH 0/10] memcg clean up and some fixes for softlimit (Sep25) KAMEZAWA Hiroyuki
  2009-09-25  8:18 ` [RFC][PATCH 1/10] memcg : modifications for softlimit uncharge KAMEZAWA Hiroyuki
  2009-09-25  8:20 ` [RFC][PATCH 2/10] memcg : clean up in softlimit charge KAMEZAWA Hiroyuki
@ 2009-09-25  8:21 ` KAMEZAWA Hiroyuki
  2009-09-25  8:22 ` [RFC][PATCH 4/10] memcg: add memcg charge cancel KAMEZAWA Hiroyuki
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-09-25  8:21 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura

This is a patch just for reordering functions/definitions in memcontrol.c.

I think it's time to clean up memcontrol.c to be readable. Before adding
commentary or functions, it seems it's necessary to reorder functions in
memcontrol.c for better organization.

After this,  memcontol.c will be reordered in following way.

  - defintions of structs.
  - functions for accessing struct mem_cgroup.
  - functions for per-cpu statistics of mem_cgroup
  - functions for per-zone statistics of mem_cgroup
  - functions for per-zone softlimit-tree handling.
  - functions for LRU management.
  - functions for memory reclaim...called by vmscan.c
  - functions for OOM handling.
  - functions for hierarchical memory reclaim (includes softlimit)
  - functions for charge
  - functions for uncharge
  - functions for move charges.
  - functions for user interfaces
  - functions for cgroup callbacks.
  - functions for memcg creation/deletion.

This patch seems big...but just move codes.
There are no functionality/logic changes at all.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c | 1589 ++++++++++++++++++++++++++++----------------------------
 1 file changed, 812 insertions(+), 777 deletions(-)

Index: temp-mmotm/mm/memcontrol.c
===================================================================
--- temp-mmotm.orig/mm/memcontrol.c
+++ temp-mmotm/mm/memcontrol.c
@@ -55,7 +55,6 @@ static int really_do_swap_account __init
 #endif
 
 static DEFINE_MUTEX(memcg_tasklist);	/* can be hold under cgroup_mutex */
-#define SOFTLIMIT_EVENTS_THRESH (1000)
 
 /*
  * Statistics for memory cgroup.
@@ -83,47 +82,6 @@ struct mem_cgroup_stat {
 	struct mem_cgroup_stat_cpu cpustat[0];
 };
 
-static inline void
-__mem_cgroup_stat_reset_safe(struct mem_cgroup_stat_cpu *stat,
-				enum mem_cgroup_stat_index idx)
-{
-	stat->count[idx] = 0;
-}
-
-static inline s64
-__mem_cgroup_stat_read_local(struct mem_cgroup_stat_cpu *stat,
-				enum mem_cgroup_stat_index idx)
-{
-	return stat->count[idx];
-}
-
-/*
- * For accounting under irq disable, no need for increment preempt count.
- */
-static inline void __mem_cgroup_stat_add_safe(struct mem_cgroup_stat_cpu *stat,
-		enum mem_cgroup_stat_index idx, int val)
-{
-	stat->count[idx] += val;
-}
-
-static s64 mem_cgroup_read_stat(struct mem_cgroup_stat *stat,
-		enum mem_cgroup_stat_index idx)
-{
-	int cpu;
-	s64 ret = 0;
-	for_each_possible_cpu(cpu)
-		ret += stat->cpustat[cpu].count[idx];
-	return ret;
-}
-
-static s64 mem_cgroup_local_usage(struct mem_cgroup_stat *stat)
-{
-	s64 ret;
-
-	ret = mem_cgroup_read_stat(stat, MEM_CGROUP_STAT_CACHE);
-	ret += mem_cgroup_read_stat(stat, MEM_CGROUP_STAT_RSS);
-	return ret;
-}
 
 /*
  * per-zone information in memory controller.
@@ -231,12 +189,6 @@ struct mem_cgroup {
 	struct mem_cgroup_stat stat;
 };
 
-/*
- * Maximum loops in mem_cgroup_hierarchical_reclaim(), used for soft
- * limit reclaim to prevent infinite loops, if they ever occur.
- */
-#define	MEM_CGROUP_MAX_RECLAIM_LOOPS		(100)
-#define	MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS	(2)
 
 enum charge_type {
 	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
@@ -263,6 +215,14 @@ enum charge_type {
 #define MEMFILE_ATTR(val)	((val) & 0xffff)
 
 /*
+ * Maximum loops in mem_cgroup_hierarchical_reclaim(), used for soft
+ * limit reclaim to prevent infinite loops, if they ever occur.
+ */
+#define MEM_CGROUP_MAX_RECLAIM_LOOPS		(100)
+#define MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS	(2)
+#define SOFTLIMIT_EVENTS_THRESH 		(1000)
+
+/*
  * Reclaim flags for mem_cgroup_hierarchical_reclaim
  */
 #define MEM_CGROUP_RECLAIM_NOSWAP_BIT	0x0
@@ -276,242 +236,239 @@ static void mem_cgroup_get(struct mem_cg
 static void mem_cgroup_put(struct mem_cgroup *mem);
 static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
 
-static struct mem_cgroup_per_zone *
-mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
+/*
+ * root_cgroup is very special and we don't use res_counter in it.
+ */
+static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
 {
-	return &mem->info.nodeinfo[nid]->zoneinfo[zid];
+	return (mem == root_mem_cgroup);
 }
 
-static struct mem_cgroup_per_zone *
-page_cgroup_zoneinfo(struct page_cgroup *pc)
-{
-	struct mem_cgroup *mem = pc->mem_cgroup;
-	int nid = page_cgroup_nid(pc);
-	int zid = page_cgroup_zid(pc);
+/*
+ * Functions for getting mem_cgroup struct from misc structs.
+ */
 
-	if (!mem)
-		return NULL;
+#define mem_cgroup_from_res_counter(counter, member)	\
+	container_of(counter, struct mem_cgroup, member)
 
-	return mem_cgroup_zoneinfo(mem, nid, zid);
+static struct mem_cgroup *mem_cgroup_from_cont(struct cgroup *cont)
+{
+	return container_of(cgroup_subsys_state(cont,
+				mem_cgroup_subsys_id), struct mem_cgroup,
+				css);
 }
 
-static struct mem_cgroup_tree_per_zone *
-soft_limit_tree_node_zone(int nid, int zid)
+struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
 {
-	return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid];
+	/*
+	 * mm_update_next_owner() may clear mm->owner to NULL
+	 * if it races with swapoff, page migration, etc.
+	 * So this can be called with p == NULL.
+	 */
+	if (unlikely(!p))
+		return NULL;
+
+	return container_of(task_subsys_state(p, mem_cgroup_subsys_id),
+				struct mem_cgroup, css);
 }
 
-static struct mem_cgroup_tree_per_zone *
-soft_limit_tree_from_page(struct page *page)
+static struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
 {
-	int nid = page_to_nid(page);
-	int zid = page_zonenum(page);
+	struct mem_cgroup *mem = NULL;
 
-	return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid];
+	if (!mm)
+		return NULL;
+	/*
+	 * Because we have no locks, mm->owner's may be being moved to other
+	 * cgroup. We use css_tryget() here even if this looks
+	 * pessimistic (rather than adding locks here).
+	 */
+	rcu_read_lock();
+	do {
+		mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
+		if (unlikely(!mem))
+			break;
+	} while (!css_tryget(&mem->css));
+	rcu_read_unlock();
+	return mem;
 }
 
-static void
-__mem_cgroup_insert_exceeded(struct mem_cgroup *mem,
-				struct mem_cgroup_per_zone *mz,
-				struct mem_cgroup_tree_per_zone *mctz,
-				unsigned long long new_usage_in_excess)
+/*
+ * A helper function to get mem_cgroup from ID. must be called under
+ * rcu_read_lock(). The caller must check css_is_removed() or some if
+ * it's concern. (dropping refcnt from swap can be called against removed
+ * memcg.)
+ */
+static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
 {
-	struct rb_node **p = &mctz->rb_root.rb_node;
-	struct rb_node *parent = NULL;
-	struct mem_cgroup_per_zone *mz_node;
+	struct cgroup_subsys_state *css;
 
-	if (mz->on_tree)
-		return;
+	/* ID 0 is unused ID */
+	if (!id)
+		return NULL;
+	css = css_lookup(&mem_cgroup_subsys, id);
+	if (!css)
+		return NULL;
+	return container_of(css, struct mem_cgroup, css);
+}
 
-	mz->usage_in_excess = new_usage_in_excess;
-	if (!mz->usage_in_excess)
-		return;
-	while (*p) {
-		parent = *p;
-		mz_node = rb_entry(parent, struct mem_cgroup_per_zone,
-					tree_node);
-		if (mz->usage_in_excess < mz_node->usage_in_excess)
-			p = &(*p)->rb_left;
-		/*
-		 * We can't avoid mem cgroups that are over their soft
-		 * limit by the same amount
-		 */
-		else if (mz->usage_in_excess >= mz_node->usage_in_excess)
-			p = &(*p)->rb_right;
-	}
-	rb_link_node(&mz->tree_node, parent, p);
-	rb_insert_color(&mz->tree_node, &mctz->rb_root);
-	mz->on_tree = true;
+/*
+ * Handlers for memcg's private percpu counters.
+ */
+
+static inline void
+__mem_cgroup_stat_reset_safe(struct mem_cgroup_stat_cpu *stat,
+				enum mem_cgroup_stat_index idx)
+{
+	stat->count[idx] = 0;
 }
 
-static void
-__mem_cgroup_remove_exceeded(struct mem_cgroup *mem,
-				struct mem_cgroup_per_zone *mz,
-				struct mem_cgroup_tree_per_zone *mctz)
+static inline s64
+__mem_cgroup_stat_read_local(struct mem_cgroup_stat_cpu *stat,
+				enum mem_cgroup_stat_index idx)
 {
-	if (!mz->on_tree)
-		return;
-	rb_erase(&mz->tree_node, &mctz->rb_root);
-	mz->on_tree = false;
+	return stat->count[idx];
 }
 
-static void
-mem_cgroup_remove_exceeded(struct mem_cgroup *mem,
-				struct mem_cgroup_per_zone *mz,
-				struct mem_cgroup_tree_per_zone *mctz)
+/*
+ * For accounting under irq disable, no need for increment preempt count.
+ */
+static inline void __mem_cgroup_stat_add_safe(struct mem_cgroup_stat_cpu *stat,
+		enum mem_cgroup_stat_index idx, int val)
 {
-	spin_lock(&mctz->lock);
-	__mem_cgroup_remove_exceeded(mem, mz, mctz);
-	spin_unlock(&mctz->lock);
+	stat->count[idx] += val;
 }
 
-static bool mem_cgroup_soft_limit_check(struct mem_cgroup *mem)
+static s64 mem_cgroup_read_stat(struct mem_cgroup_stat *stat,
+		enum mem_cgroup_stat_index idx)
 {
-	bool ret = false;
 	int cpu;
-	s64 val;
-	struct mem_cgroup_stat_cpu *cpustat;
-
-	cpu = get_cpu();
-	cpustat = &mem->stat.cpustat[cpu];
-	val = __mem_cgroup_stat_read_local(cpustat, MEM_CGROUP_STAT_EVENTS);
-	if (unlikely(val > SOFTLIMIT_EVENTS_THRESH)) {
-		__mem_cgroup_stat_reset_safe(cpustat, MEM_CGROUP_STAT_EVENTS);
-		ret = true;
-	}
-	put_cpu();
+	s64 ret = 0;
+	for_each_possible_cpu(cpu)
+		ret += stat->cpustat[cpu].count[idx];
 	return ret;
 }
 
-static void mem_cgroup_update_tree(struct mem_cgroup *mem, struct page *page)
+static void mem_cgroup_swap_statistics(struct mem_cgroup *mem,
+					 bool charge)
 {
-	unsigned long long excess;
-	struct mem_cgroup_per_zone *mz;
-	struct mem_cgroup_tree_per_zone *mctz;
-	int nid = page_to_nid(page);
-	int zid = page_zonenum(page);
-	mctz = soft_limit_tree_from_page(page);
+	int val = (charge) ? 1 : -1;
+	struct mem_cgroup_stat *stat = &mem->stat;
+	struct mem_cgroup_stat_cpu *cpustat;
+	int cpu = get_cpu();
 
-	/*
-	 * Necessary to update all ancestors when hierarchy is used.
-	 * because their event counter is not touched.
-	 */
-	for (; mem; mem = parent_mem_cgroup(mem)) {
-		mz = mem_cgroup_zoneinfo(mem, nid, zid);
-		excess = res_counter_soft_limit_excess(&mem->res);
-		/*
-		 * We have to update the tree if mz is on RB-tree or
-		 * mem is over its softlimit.
-		 */
-		if (excess || mz->on_tree) {
-			spin_lock(&mctz->lock);
-			/* if on-tree, remove it */
-			if (mz->on_tree)
-				__mem_cgroup_remove_exceeded(mem, mz, mctz);
-			/*
-			 * Insert again. mz->usage_in_excess will be updated.
-			 * If excess is 0, no tree ops.
-			 */
-			__mem_cgroup_insert_exceeded(mem, mz, mctz, excess);
-			spin_unlock(&mctz->lock);
-		}
-	}
+	cpustat = &stat->cpustat[cpu];
+	__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_SWAPOUT, val);
+	put_cpu();
 }
 
-static void mem_cgroup_remove_from_trees(struct mem_cgroup *mem)
+static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
+					 struct page_cgroup *pc,
+					 bool charge)
 {
-	int node, zone;
-	struct mem_cgroup_per_zone *mz;
-	struct mem_cgroup_tree_per_zone *mctz;
-
-	for_each_node_state(node, N_POSSIBLE) {
-		for (zone = 0; zone < MAX_NR_ZONES; zone++) {
-			mz = mem_cgroup_zoneinfo(mem, node, zone);
-			mctz = soft_limit_tree_node_zone(node, zone);
-			mem_cgroup_remove_exceeded(mem, mz, mctz);
-		}
-	}
+	int val = (charge) ? 1 : -1;
+	struct mem_cgroup_stat *stat = &mem->stat;
+	struct mem_cgroup_stat_cpu *cpustat;
+	int cpu = get_cpu();
+
+	cpustat = &stat->cpustat[cpu];
+	if (PageCgroupCache(pc))
+		__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_CACHE, val);
+	else
+		__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_RSS, val);
+
+	if (charge)
+		__mem_cgroup_stat_add_safe(cpustat,
+				MEM_CGROUP_STAT_PGPGIN_COUNT, 1);
+	else
+		__mem_cgroup_stat_add_safe(cpustat,
+				MEM_CGROUP_STAT_PGPGOUT_COUNT, 1);
+	__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_EVENTS, 1);
+	put_cpu();
 }
 
-static inline unsigned long mem_cgroup_get_excess(struct mem_cgroup *mem)
+static s64 mem_cgroup_local_usage(struct mem_cgroup_stat *stat)
 {
-	return res_counter_soft_limit_excess(&mem->res) >> PAGE_SHIFT;
+	s64 ret;
+
+	ret = mem_cgroup_read_stat(stat, MEM_CGROUP_STAT_CACHE);
+	ret += mem_cgroup_read_stat(stat, MEM_CGROUP_STAT_RSS);
+	return ret;
 }
 
-static struct mem_cgroup_per_zone *
-__mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
+/*
+ * Currently used to update mapped file statistics, but the routine can be
+ * generalized to update other statistics as well.
+ */
+void mem_cgroup_update_mapped_file_stat(struct page *page, int val)
 {
-	struct rb_node *rightmost = NULL;
-	struct mem_cgroup_per_zone *mz = NULL;
+	struct mem_cgroup *mem;
+	struct mem_cgroup_stat *stat;
+	struct mem_cgroup_stat_cpu *cpustat;
+	int cpu;
+	struct page_cgroup *pc;
 
-retry:
-	rightmost = rb_last(&mctz->rb_root);
-	if (!rightmost)
-		goto done;		/* Nothing to reclaim from */
+	if (!page_is_file_cache(page))
+		return;
+
+	pc = lookup_page_cgroup(page);
+	if (unlikely(!pc))
+		return;
+
+	lock_page_cgroup(pc);
+	mem = pc->mem_cgroup;
+	if (!mem)
+		goto done;
+
+	if (!PageCgroupUsed(pc))
+		goto done;
 
-	mz = rb_entry(rightmost, struct mem_cgroup_per_zone, tree_node);
 	/*
-	 * Remove the node now but someone else can add it back,
-	 * we will to add it back at the end of reclaim to its correct
-	 * position in the tree.
+	 * Preemption is already disabled, we don't need get_cpu()
 	 */
-	__mem_cgroup_remove_exceeded(mz->mem, mz, mctz);
-	if (!res_counter_soft_limit_excess(&mz->mem->res) ||
-		!css_tryget(&mz->mem->css))
-		goto retry;
+	cpu = smp_processor_id();
+	stat = &mem->stat;
+	cpustat = &stat->cpustat[cpu];
+
+	__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE, val);
 done:
-	return mz;
+	unlock_page_cgroup(pc);
 }
 
+/*
+ * Handlers for memcg's private perzone counters.
+ */
 static struct mem_cgroup_per_zone *
-mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
+mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
 {
-	struct mem_cgroup_per_zone *mz;
-
-	spin_lock(&mctz->lock);
-	mz = __mem_cgroup_largest_soft_limit_node(mctz);
-	spin_unlock(&mctz->lock);
-	return mz;
+	return &mem->info.nodeinfo[nid]->zoneinfo[zid];
 }
 
-static void mem_cgroup_swap_statistics(struct mem_cgroup *mem,
-					 bool charge)
+static struct mem_cgroup_per_zone *
+page_cgroup_zoneinfo(struct page_cgroup *pc)
 {
-	int val = (charge) ? 1 : -1;
-	struct mem_cgroup_stat *stat = &mem->stat;
-	struct mem_cgroup_stat_cpu *cpustat;
-	int cpu = get_cpu();
+	struct mem_cgroup *mem = pc->mem_cgroup;
+	int nid = page_cgroup_nid(pc);
+	int zid = page_cgroup_zid(pc);
 
-	cpustat = &stat->cpustat[cpu];
-	__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_SWAPOUT, val);
-	put_cpu();
+	if (!mem)
+		return NULL;
+
+	return mem_cgroup_zoneinfo(mem, nid, zid);
 }
 
-static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
-					 struct page_cgroup *pc,
-					 bool charge)
+unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
+				       struct zone *zone,
+				       enum lru_list lru)
 {
-	int val = (charge) ? 1 : -1;
-	struct mem_cgroup_stat *stat = &mem->stat;
-	struct mem_cgroup_stat_cpu *cpustat;
-	int cpu = get_cpu();
-
-	cpustat = &stat->cpustat[cpu];
-	if (PageCgroupCache(pc))
-		__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_CACHE, val);
-	else
-		__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_RSS, val);
+	int nid = zone->zone_pgdat->node_id;
+	int zid = zone_idx(zone);
+	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
 
-	if (charge)
-		__mem_cgroup_stat_add_safe(cpustat,
-				MEM_CGROUP_STAT_PGPGIN_COUNT, 1);
-	else
-		__mem_cgroup_stat_add_safe(cpustat,
-				MEM_CGROUP_STAT_PGPGOUT_COUNT, 1);
-	__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_EVENTS, 1);
-	put_cpu();
+	return MEM_CGROUP_ZSTAT(mz, lru);
 }
 
+
 static unsigned long mem_cgroup_get_local_zonestat(struct mem_cgroup *mem,
 					enum lru_list idx)
 {
@@ -527,48 +484,186 @@ static unsigned long mem_cgroup_get_loca
 	return total;
 }
 
-static struct mem_cgroup *mem_cgroup_from_cont(struct cgroup *cont)
+static struct mem_cgroup_tree_per_zone *
+soft_limit_tree_node_zone(int nid, int zid)
 {
-	return container_of(cgroup_subsys_state(cont,
-				mem_cgroup_subsys_id), struct mem_cgroup,
-				css);
+	return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid];
 }
 
-struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
+static struct mem_cgroup_tree_per_zone *
+soft_limit_tree_from_page(struct page *page)
+{
+	int nid = page_to_nid(page);
+	int zid = page_zonenum(page);
+
+	return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid];
+}
+
+static void
+__mem_cgroup_insert_exceeded(struct mem_cgroup *mem,
+				struct mem_cgroup_per_zone *mz,
+				struct mem_cgroup_tree_per_zone *mctz,
+				unsigned long long new_usage_in_excess)
+{
+	struct rb_node **p = &mctz->rb_root.rb_node;
+	struct rb_node *parent = NULL;
+	struct mem_cgroup_per_zone *mz_node;
+
+	if (mz->on_tree)
+		return;
+
+	mz->usage_in_excess = new_usage_in_excess;
+	if (!mz->usage_in_excess)
+		return;
+	while (*p) {
+		parent = *p;
+		mz_node = rb_entry(parent, struct mem_cgroup_per_zone,
+					tree_node);
+		if (mz->usage_in_excess < mz_node->usage_in_excess)
+			p = &(*p)->rb_left;
+		/*
+		 * We can't avoid mem cgroups that are over their soft
+		 * limit by the same amount
+		 */
+		else if (mz->usage_in_excess >= mz_node->usage_in_excess)
+			p = &(*p)->rb_right;
+	}
+	rb_link_node(&mz->tree_node, parent, p);
+	rb_insert_color(&mz->tree_node, &mctz->rb_root);
+	mz->on_tree = true;
+}
+
+static void
+__mem_cgroup_remove_exceeded(struct mem_cgroup *mem,
+				struct mem_cgroup_per_zone *mz,
+				struct mem_cgroup_tree_per_zone *mctz)
 {
+	if (!mz->on_tree)
+		return;
+	rb_erase(&mz->tree_node, &mctz->rb_root);
+	mz->on_tree = false;
+}
+
+static void
+mem_cgroup_remove_exceeded(struct mem_cgroup *mem,
+				struct mem_cgroup_per_zone *mz,
+				struct mem_cgroup_tree_per_zone *mctz)
+{
+	spin_lock(&mctz->lock);
+	__mem_cgroup_remove_exceeded(mem, mz, mctz);
+	spin_unlock(&mctz->lock);
+}
+
+static bool mem_cgroup_soft_limit_check(struct mem_cgroup *mem)
+{
+	bool ret = false;
+	int cpu;
+	s64 val;
+	struct mem_cgroup_stat_cpu *cpustat;
+
+	cpu = get_cpu();
+	cpustat = &mem->stat.cpustat[cpu];
+	val = __mem_cgroup_stat_read_local(cpustat, MEM_CGROUP_STAT_EVENTS);
+	if (unlikely(val > SOFTLIMIT_EVENTS_THRESH)) {
+		__mem_cgroup_stat_reset_safe(cpustat, MEM_CGROUP_STAT_EVENTS);
+		ret = true;
+	}
+	put_cpu();
+	return ret;
+}
+
+static void mem_cgroup_update_tree(struct mem_cgroup *mem, struct page *page)
+{
+	unsigned long long excess;
+	struct mem_cgroup_per_zone *mz;
+	struct mem_cgroup_tree_per_zone *mctz;
+	int nid = page_to_nid(page);
+	int zid = page_zonenum(page);
+	mctz = soft_limit_tree_from_page(page);
+
+	/*
+	 * Necessary to update all ancestors when hierarchy is used.
+	 * because their event counter is not touched.
+	 */
+	for (; mem; mem = parent_mem_cgroup(mem)) {
+		mz = mem_cgroup_zoneinfo(mem, nid, zid);
+		excess = res_counter_soft_limit_excess(&mem->res);
+		/*
+		 * We have to update the tree if mz is on RB-tree or
+		 * mem is over its softlimit.
+		 */
+		if (excess || mz->on_tree) {
+			spin_lock(&mctz->lock);
+			/* if on-tree, remove it */
+			if (mz->on_tree)
+				__mem_cgroup_remove_exceeded(mem, mz, mctz);
+			/*
+			 * Insert again. mz->usage_in_excess will be updated.
+			 * If excess is 0, no tree ops.
+			 */
+			__mem_cgroup_insert_exceeded(mem, mz, mctz, excess);
+			spin_unlock(&mctz->lock);
+		}
+	}
+}
+
+static void mem_cgroup_remove_from_trees(struct mem_cgroup *mem)
+{
+	int node, zone;
+	struct mem_cgroup_per_zone *mz;
+	struct mem_cgroup_tree_per_zone *mctz;
+
+	for_each_node_state(node, N_POSSIBLE) {
+		for (zone = 0; zone < MAX_NR_ZONES; zone++) {
+			mz = mem_cgroup_zoneinfo(mem, node, zone);
+			mctz = soft_limit_tree_node_zone(node, zone);
+			mem_cgroup_remove_exceeded(mem, mz, mctz);
+		}
+	}
+}
+
+static inline unsigned long mem_cgroup_get_excess(struct mem_cgroup *mem)
+{
+	return res_counter_soft_limit_excess(&mem->res) >> PAGE_SHIFT;
+}
+
+static struct mem_cgroup_per_zone *
+__mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
+{
+	struct rb_node *rightmost = NULL;
+	struct mem_cgroup_per_zone *mz = NULL;
+
+retry:
+	rightmost = rb_last(&mctz->rb_root);
+	if (!rightmost)
+		goto done;		/* Nothing to reclaim from */
+
+	mz = rb_entry(rightmost, struct mem_cgroup_per_zone, tree_node);
 	/*
-	 * mm_update_next_owner() may clear mm->owner to NULL
-	 * if it races with swapoff, page migration, etc.
-	 * So this can be called with p == NULL.
+	 * Remove the node now but someone else can add it back,
+	 * we will to add it back at the end of reclaim to its correct
+	 * position in the tree.
 	 */
-	if (unlikely(!p))
-		return NULL;
-
-	return container_of(task_subsys_state(p, mem_cgroup_subsys_id),
-				struct mem_cgroup, css);
+	__mem_cgroup_remove_exceeded(mz->mem, mz, mctz);
+	if (!res_counter_soft_limit_excess(&mz->mem->res) ||
+		!css_tryget(&mz->mem->css))
+		goto retry;
+done:
+	return mz;
 }
 
-static struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
+static struct mem_cgroup_per_zone *
+mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
 {
-	struct mem_cgroup *mem = NULL;
+	struct mem_cgroup_per_zone *mz;
 
-	if (!mm)
-		return NULL;
-	/*
-	 * Because we have no locks, mm->owner's may be being moved to other
-	 * cgroup. We use css_tryget() here even if this looks
-	 * pessimistic (rather than adding locks here).
-	 */
-	rcu_read_lock();
-	do {
-		mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
-		if (unlikely(!mem))
-			break;
-	} while (!css_tryget(&mem->css));
-	rcu_read_unlock();
-	return mem;
+	spin_lock(&mctz->lock);
+	mz = __mem_cgroup_largest_soft_limit_node(mctz);
+	spin_unlock(&mctz->lock);
+	return mz;
 }
 
+
 /*
  * Call callback function against all cgroup under hierarchy tree.
  */
@@ -604,11 +699,6 @@ static int mem_cgroup_walk_tree(struct m
 	return ret;
 }
 
-static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
-{
-	return (mem == root_mem_cgroup);
-}
-
 /*
  * Following LRU functions are allowed to be used without PCG_LOCK.
  * Operations are called by routine of global LRU independently from memcg.
@@ -699,6 +789,15 @@ void mem_cgroup_add_lru_list(struct page
 	list_add(&pc->lru, &mz->lists[lru]);
 }
 
+void mem_cgroup_move_lists(struct page *page,
+			   enum lru_list from, enum lru_list to)
+{
+	if (mem_cgroup_disabled())
+		return;
+	mem_cgroup_del_lru_list(page, from);
+	mem_cgroup_add_lru_list(page, to);
+}
+
 /*
  * At handling SwapCache, pc->mem_cgroup may be changed while it's linked to
  * lru because the page may.be reused after it's fully uncharged (because of
@@ -736,35 +835,6 @@ static void mem_cgroup_lru_add_after_com
 }
 
 
-void mem_cgroup_move_lists(struct page *page,
-			   enum lru_list from, enum lru_list to)
-{
-	if (mem_cgroup_disabled())
-		return;
-	mem_cgroup_del_lru_list(page, from);
-	mem_cgroup_add_lru_list(page, to);
-}
-
-int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem)
-{
-	int ret;
-	struct mem_cgroup *curr = NULL;
-
-	task_lock(task);
-	rcu_read_lock();
-	curr = try_get_mem_cgroup_from_mm(task->mm);
-	rcu_read_unlock();
-	task_unlock(task);
-	if (!curr)
-		return 0;
-	if (curr->use_hierarchy)
-		ret = css_is_ancestor(&curr->css, &mem->css);
-	else
-		ret = (curr == mem);
-	css_put(&curr->css);
-	return ret;
-}
-
 /*
  * prev_priority control...this will be used in memory reclaim path.
  */
@@ -847,16 +917,6 @@ int mem_cgroup_inactive_file_is_low(stru
 	return (active > inactive);
 }
 
-unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
-				       struct zone *zone,
-				       enum lru_list lru)
-{
-	int nid = zone->zone_pgdat->node_id;
-	int zid = zone_idx(zone);
-	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
-
-	return MEM_CGROUP_ZSTAT(mz, lru);
-}
 
 struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
 						      struct zone *zone)
@@ -868,6 +928,22 @@ struct zone_reclaim_stat *mem_cgroup_get
 	return &mz->reclaim_stat;
 }
 
+static unsigned int get_swappiness(struct mem_cgroup *memcg)
+{
+	struct cgroup *cgrp = memcg->css.cgroup;
+	unsigned int swappiness;
+
+	/* root ? */
+	if (cgrp->parent == NULL)
+		return vm_swappiness;
+
+	spin_lock(&memcg->reclaim_param_lock);
+	swappiness = memcg->swappiness;
+	spin_unlock(&memcg->reclaim_param_lock);
+
+	return swappiness;
+}
+
 struct zone_reclaim_stat *
 mem_cgroup_get_reclaim_stat_from_page(struct page *page)
 {
@@ -948,35 +1024,26 @@ unsigned long mem_cgroup_isolate_pages(u
 	return nr_taken;
 }
 
-#define mem_cgroup_from_res_counter(counter, member)	\
-	container_of(counter, struct mem_cgroup, member)
 
-static bool mem_cgroup_check_under_limit(struct mem_cgroup *mem)
-{
-	if (do_swap_account) {
-		if (res_counter_check_under_limit(&mem->res) &&
-			res_counter_check_under_limit(&mem->memsw))
-			return true;
-	} else
-		if (res_counter_check_under_limit(&mem->res))
-			return true;
-	return false;
-}
 
-static unsigned int get_swappiness(struct mem_cgroup *memcg)
+int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem)
 {
-	struct cgroup *cgrp = memcg->css.cgroup;
-	unsigned int swappiness;
-
-	/* root ? */
-	if (cgrp->parent == NULL)
-		return vm_swappiness;
-
-	spin_lock(&memcg->reclaim_param_lock);
-	swappiness = memcg->swappiness;
-	spin_unlock(&memcg->reclaim_param_lock);
+	int ret;
+	struct mem_cgroup *curr = NULL;
 
-	return swappiness;
+	task_lock(task);
+	rcu_read_lock();
+	curr = try_get_mem_cgroup_from_mm(task->mm);
+	rcu_read_unlock();
+	task_unlock(task);
+	if (!curr)
+		return 0;
+	if (curr->use_hierarchy)
+		ret = css_is_ancestor(&curr->css, &mem->css);
+	else
+		ret = (curr == mem);
+	css_put(&curr->css);
+	return ret;
 }
 
 static int mem_cgroup_count_children_cb(struct mem_cgroup *mem, void *data)
@@ -1063,6 +1130,45 @@ static int mem_cgroup_count_children(str
  	mem_cgroup_walk_tree(mem, &num, mem_cgroup_count_children_cb);
 	return num;
 }
+bool mem_cgroup_oom_called(struct task_struct *task)
+{
+	bool ret = false;
+	struct mem_cgroup *mem;
+	struct mm_struct *mm;
+
+	rcu_read_lock();
+	mm = task->mm;
+	if (!mm)
+		mm = &init_mm;
+	mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
+	if (mem && time_before(jiffies, mem->last_oom_jiffies + HZ/10))
+		ret = true;
+	rcu_read_unlock();
+	return ret;
+}
+
+static int record_last_oom_cb(struct mem_cgroup *mem, void *data)
+{
+	mem->last_oom_jiffies = jiffies;
+	return 0;
+}
+
+static void record_last_oom(struct mem_cgroup *mem)
+{
+	mem_cgroup_walk_tree(mem, NULL, record_last_oom_cb);
+}
+
+static bool mem_cgroup_check_under_limit(struct mem_cgroup *mem)
+{
+	if (do_swap_account) {
+		if (res_counter_check_under_limit(&mem->res) &&
+			res_counter_check_under_limit(&mem->memsw))
+			return true;
+	} else
+		if (res_counter_check_under_limit(&mem->res))
+			return true;
+	return false;
+}
 
 /*
  * Visit the first child (need not be the first child as per the ordering
@@ -1190,71 +1296,95 @@ static int mem_cgroup_hierarchical_recla
 	return total;
 }
 
-bool mem_cgroup_oom_called(struct task_struct *task)
-{
-	bool ret = false;
-	struct mem_cgroup *mem;
-	struct mm_struct *mm;
-
-	rcu_read_lock();
-	mm = task->mm;
-	if (!mm)
-		mm = &init_mm;
-	mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
-	if (mem && time_before(jiffies, mem->last_oom_jiffies + HZ/10))
-		ret = true;
-	rcu_read_unlock();
-	return ret;
-}
-
-static int record_last_oom_cb(struct mem_cgroup *mem, void *data)
-{
-	mem->last_oom_jiffies = jiffies;
-	return 0;
-}
-
-static void record_last_oom(struct mem_cgroup *mem)
-{
-	mem_cgroup_walk_tree(mem, NULL, record_last_oom_cb);
-}
-
-/*
- * Currently used to update mapped file statistics, but the routine can be
- * generalized to update other statistics as well.
- */
-void mem_cgroup_update_mapped_file_stat(struct page *page, int val)
+unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
+						gfp_t gfp_mask, int nid,
+						int zid)
 {
-	struct mem_cgroup *mem;
-	struct mem_cgroup_stat *stat;
-	struct mem_cgroup_stat_cpu *cpustat;
-	int cpu;
-	struct page_cgroup *pc;
-
-	if (!page_is_file_cache(page))
-		return;
-
-	pc = lookup_page_cgroup(page);
-	if (unlikely(!pc))
-		return;
-
-	lock_page_cgroup(pc);
-	mem = pc->mem_cgroup;
-	if (!mem)
-		goto done;
+	unsigned long nr_reclaimed = 0;
+	struct mem_cgroup_per_zone *mz, *next_mz = NULL;
+	unsigned long reclaimed;
+	int loop = 0;
+	struct mem_cgroup_tree_per_zone *mctz;
+	unsigned long long excess;
 
-	if (!PageCgroupUsed(pc))
-		goto done;
+	if (order > 0)
+		return 0;
 
+	mctz = soft_limit_tree_node_zone(nid, zid);
 	/*
-	 * Preemption is already disabled, we don't need get_cpu()
+	 * This loop can run a while, specially if mem_cgroup's continuously
+	 * keep exceeding their soft limit and putting the system under
+	 * pressure
 	 */
-	cpu = smp_processor_id();
-	stat = &mem->stat;
-	cpustat = &stat->cpustat[cpu];
+	do {
+		if (next_mz)
+			mz = next_mz;
+		else
+			mz = mem_cgroup_largest_soft_limit_node(mctz);
+		if (!mz)
+			break;
 
-	__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE, val);
-done:
-	unlock_page_cgroup(pc);
+		reclaimed = mem_cgroup_hierarchical_reclaim(mz->mem, zone,
+						gfp_mask,
+						MEM_CGROUP_RECLAIM_SOFT);
+		nr_reclaimed += reclaimed;
+		spin_lock(&mctz->lock);
+
+		/*
+		 * If we failed to reclaim anything from this memory cgroup
+		 * it is time to move on to the next cgroup
+		 */
+		next_mz = NULL;
+		if (!reclaimed) {
+			do {
+				/*
+				 * Loop until we find yet another one.
+				 *
+				 * By the time we get the soft_limit lock
+				 * again, someone might have aded the
+				 * group back on the RB tree. Iterate to
+				 * make sure we get a different mem.
+				 * mem_cgroup_largest_soft_limit_node returns
+				 * NULL if no other cgroup is present on
+				 * the tree
+				 */
+				next_mz =
+				__mem_cgroup_largest_soft_limit_node(mctz);
+				if (next_mz == mz) {
+					css_put(&next_mz->mem->css);
+					next_mz = NULL;
+				} else /* next_mz == NULL or other memcg */
+					break;
+			} while (1);
+		}
+		__mem_cgroup_remove_exceeded(mz->mem, mz, mctz);
+		excess = res_counter_soft_limit_excess(&mz->mem->res);
+		/*
+		 * One school of thought says that we should not add
+		 * back the node to the tree if reclaim returns 0.
+		 * But our reclaim could return 0, simply because due
+		 * to priority we are exposing a smaller subset of
+		 * memory to reclaim from. Consider this as a longer
+		 * term TODO.
+		 */
+		/* If excess == 0, no tree ops */
+		__mem_cgroup_insert_exceeded(mz->mem, mz, mctz, excess);
+		spin_unlock(&mctz->lock);
+		css_put(&mz->mem->css);
+		loop++;
+		/*
+		 * Could not reclaim anything and there are no more
+		 * mem cgroups to try or we seem to be looping without
+		 * reclaiming anything.
+		 */
+		if (!nr_reclaimed &&
+			(next_mz == NULL ||
+			loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS))
+			break;
+	} while (!nr_reclaimed);
+	if (next_mz)
+		css_put(&next_mz->mem->css);
+	return nr_reclaimed;
 }
 
 /*
@@ -1359,24 +1489,6 @@ nomem:
 	return -ENOMEM;
 }
 
-/*
- * A helper function to get mem_cgroup from ID. must be called under
- * rcu_read_lock(). The caller must check css_is_removed() or some if
- * it's concern. (dropping refcnt from swap can be called against removed
- * memcg.)
- */
-static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
-{
-	struct cgroup_subsys_state *css;
-
-	/* ID 0 is unused ID */
-	if (!id)
-		return NULL;
-	css = css_lookup(&mem_cgroup_subsys, id);
-	if (!css)
-		return NULL;
-	return container_of(css, struct mem_cgroup, css);
-}
 
 static struct mem_cgroup *try_get_mem_cgroup_from_swapcache(struct page *page)
 {
@@ -1420,189 +1532,46 @@ static void __mem_cgroup_commit_charge(s
 {
 	/* try_charge() can return NULL to *memcg, taking care of it. */
 	if (!mem)
-		return;
-
-	lock_page_cgroup(pc);
-	if (unlikely(PageCgroupUsed(pc))) {
-		unlock_page_cgroup(pc);
-		if (!mem_cgroup_is_root(mem)) {
-			res_counter_uncharge(&mem->res, PAGE_SIZE);
-			if (do_swap_account)
-				res_counter_uncharge(&mem->memsw, PAGE_SIZE);
-		}
-		css_put(&mem->css);
-		return;
-	}
-
-	pc->mem_cgroup = mem;
-	/*
-	 * We access a page_cgroup asynchronously without lock_page_cgroup().
-	 * Especially when a page_cgroup is taken from a page, pc->mem_cgroup
-	 * is accessed after testing USED bit. To make pc->mem_cgroup visible
-	 * before USED bit, we need memory barrier here.
-	 * See mem_cgroup_add_lru_list(), etc.
- 	 */
-	smp_wmb();
-	switch (ctype) {
-	case MEM_CGROUP_CHARGE_TYPE_CACHE:
-	case MEM_CGROUP_CHARGE_TYPE_SHMEM:
-		SetPageCgroupCache(pc);
-		SetPageCgroupUsed(pc);
-		break;
-	case MEM_CGROUP_CHARGE_TYPE_MAPPED:
-		ClearPageCgroupCache(pc);
-		SetPageCgroupUsed(pc);
-		break;
-	default:
-		break;
-	}
-
-	mem_cgroup_charge_statistics(mem, pc, true);
-
-	unlock_page_cgroup(pc);
-}
-
-/**
- * mem_cgroup_move_account - move account of the page
- * @pc:	page_cgroup of the page.
- * @from: mem_cgroup which the page is moved from.
- * @to:	mem_cgroup which the page is moved to. @from != @to.
- *
- * The caller must confirm following.
- * - page is not on LRU (isolate_page() is useful.)
- *
- * returns 0 at success,
- * returns -EBUSY when lock is busy or "pc" is unstable.
- *
- * This function does "uncharge" from old cgroup but doesn't do "charge" to
- * new cgroup. It should be done by a caller.
- */
-
-static int mem_cgroup_move_account(struct page_cgroup *pc,
-	struct mem_cgroup *from, struct mem_cgroup *to)
-{
-	struct mem_cgroup_per_zone *from_mz, *to_mz;
-	int nid, zid;
-	int ret = -EBUSY;
-	struct page *page;
-	int cpu;
-	struct mem_cgroup_stat *stat;
-	struct mem_cgroup_stat_cpu *cpustat;
-
-	VM_BUG_ON(from == to);
-	VM_BUG_ON(PageLRU(pc->page));
-
-	nid = page_cgroup_nid(pc);
-	zid = page_cgroup_zid(pc);
-	from_mz =  mem_cgroup_zoneinfo(from, nid, zid);
-	to_mz =  mem_cgroup_zoneinfo(to, nid, zid);
-
-	if (!trylock_page_cgroup(pc))
-		return ret;
-
-	if (!PageCgroupUsed(pc))
-		goto out;
-
-	if (pc->mem_cgroup != from)
-		goto out;
-
-	if (!mem_cgroup_is_root(from))
-		res_counter_uncharge(&from->res, PAGE_SIZE);
-	mem_cgroup_charge_statistics(from, pc, false);
-
-	page = pc->page;
-	if (page_is_file_cache(page) && page_mapped(page)) {
-		cpu = smp_processor_id();
-		/* Update mapped_file data for mem_cgroup "from" */
-		stat = &from->stat;
-		cpustat = &stat->cpustat[cpu];
-		__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE,
-						-1);
-
-		/* Update mapped_file data for mem_cgroup "to" */
-		stat = &to->stat;
-		cpustat = &stat->cpustat[cpu];
-		__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE,
-						1);
-	}
-
-	if (do_swap_account && !mem_cgroup_is_root(from))
-		res_counter_uncharge(&from->memsw, PAGE_SIZE);
-	css_put(&from->css);
-
-	css_get(&to->css);
-	pc->mem_cgroup = to;
-	mem_cgroup_charge_statistics(to, pc, true);
-	ret = 0;
-out:
-	unlock_page_cgroup(pc);
-	/*
-	 * We charges against "to" which may not have any tasks. Then, "to"
-	 * can be under rmdir(). But in current implementation, caller of
-	 * this function is just force_empty() and it's garanteed that
-	 * "to" is never removed. So, we don't check rmdir status here.
-	 */
-	return ret;
-}
-
-/*
- * move charges to its parent.
- */
-
-static int mem_cgroup_move_parent(struct page_cgroup *pc,
-				  struct mem_cgroup *child,
-				  gfp_t gfp_mask)
-{
-	struct page *page = pc->page;
-	struct cgroup *cg = child->css.cgroup;
-	struct cgroup *pcg = cg->parent;
-	struct mem_cgroup *parent;
-	int ret;
-
-	/* Is ROOT ? */
-	if (!pcg)
-		return -EINVAL;
-
-
-	parent = mem_cgroup_from_cont(pcg);
-
-
-	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false, page);
-	if (ret || !parent)
-		return ret;
-
-	if (!get_page_unless_zero(page)) {
-		ret = -EBUSY;
-		goto uncharge;
-	}
-
-	ret = isolate_lru_page(page);
-
-	if (ret)
-		goto cancel;
-
-	ret = mem_cgroup_move_account(pc, child, parent);
+		return;
 
-	putback_lru_page(page);
-	if (!ret) {
-		put_page(page);
-		/* drop extra refcnt by try_charge() */
-		css_put(&parent->css);
-		return 0;
+	lock_page_cgroup(pc);
+	if (unlikely(PageCgroupUsed(pc))) {
+		unlock_page_cgroup(pc);
+		if (!mem_cgroup_is_root(mem)) {
+			res_counter_uncharge(&mem->res, PAGE_SIZE);
+			if (do_swap_account)
+				res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+		}
+		css_put(&mem->css);
+		return;
 	}
 
-cancel:
-	put_page(page);
-uncharge:
-	/* drop extra refcnt by try_charge() */
-	css_put(&parent->css);
-	/* uncharge if move fails */
-	if (!mem_cgroup_is_root(parent)) {
-		res_counter_uncharge(&parent->res, PAGE_SIZE);
-		if (do_swap_account)
-			res_counter_uncharge(&parent->memsw, PAGE_SIZE);
+	pc->mem_cgroup = mem;
+	/*
+	 * We access a page_cgroup asynchronously without lock_page_cgroup().
+	 * Especially when a page_cgroup is taken from a page, pc->mem_cgroup
+	 * is accessed after testing USED bit. To make pc->mem_cgroup visible
+	 * before USED bit, we need memory barrier here.
+	 * See mem_cgroup_add_lru_list(), etc.
+	 */
+	smp_wmb();
+	switch (ctype) {
+	case MEM_CGROUP_CHARGE_TYPE_CACHE:
+	case MEM_CGROUP_CHARGE_TYPE_SHMEM:
+		SetPageCgroupCache(pc);
+		SetPageCgroupUsed(pc);
+		break;
+	case MEM_CGROUP_CHARGE_TYPE_MAPPED:
+		ClearPageCgroupCache(pc);
+		SetPageCgroupUsed(pc);
+		break;
+	default:
+		break;
 	}
-	return ret;
+
+	mem_cgroup_charge_statistics(mem, pc, true);
+
+	unlock_page_cgroup(pc);
 }
 
 /*
@@ -1981,6 +1950,32 @@ void mem_cgroup_uncharge_swap(swp_entry_
 #endif
 
 /*
+ * A call to try to shrink memory usage on charge failure at shmem's swapin.
+ * Calling hierarchical_reclaim is not enough because we should update
+ * last_oom_jiffies to prevent pagefault_out_of_memory from invoking global OOM.
+ * Moreover considering hierarchy, we should reclaim from the mem_over_limit,
+ * not from the memcg which this page would be charged to.
+ * try_charge_swapin does all of these works properly.
+ */
+int mem_cgroup_shmem_charge_fallback(struct page *page,
+			    struct mm_struct *mm,
+			    gfp_t gfp_mask)
+{
+	struct mem_cgroup *mem = NULL;
+	int ret;
+
+	if (mem_cgroup_disabled())
+		return 0;
+
+	ret = mem_cgroup_try_charge_swapin(mm, page, gfp_mask, &mem);
+	if (!ret)
+		mem_cgroup_cancel_charge_swapin(mem); /* it does !mem check */
+
+	return ret;
+}
+
+
+/*
  * Before starting migration, account PAGE_SIZE to mem_cgroup that the old
  * page belongs to.
  */
@@ -2030,70 +2025,231 @@ void mem_cgroup_end_migration(struct mem
 		unused = oldpage;
 	}
 
-	if (PageAnon(target))
-		ctype = MEM_CGROUP_CHARGE_TYPE_MAPPED;
-	else if (page_is_file_cache(target))
-		ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
-	else
-		ctype = MEM_CGROUP_CHARGE_TYPE_SHMEM;
-
-	/* unused page is not on radix-tree now. */
-	if (unused)
-		__mem_cgroup_uncharge_common(unused, ctype);
+	if (PageAnon(target))
+		ctype = MEM_CGROUP_CHARGE_TYPE_MAPPED;
+	else if (page_is_file_cache(target))
+		ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
+	else
+		ctype = MEM_CGROUP_CHARGE_TYPE_SHMEM;
+
+	/* unused page is not on radix-tree now. */
+	if (unused)
+		__mem_cgroup_uncharge_common(unused, ctype);
+
+	pc = lookup_page_cgroup(target);
+	/*
+	 * __mem_cgroup_commit_charge() check PCG_USED bit of page_cgroup.
+	 * So, double-counting is effectively avoided.
+	 */
+	__mem_cgroup_commit_charge(mem, pc, ctype);
+
+	/*
+	 * Both of oldpage and newpage are still under lock_page().
+	 * Then, we don't have to care about race in radix-tree.
+	 * But we have to be careful that this page is unmapped or not.
+	 *
+	 * There is a case for !page_mapped(). At the start of
+	 * migration, oldpage was mapped. But now, it's zapped.
+	 * But we know *target* page is not freed/reused under us.
+	 * mem_cgroup_uncharge_page() does all necessary checks.
+	 */
+	if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
+		mem_cgroup_uncharge_page(target);
+	/*
+	 * At migration, we may charge account against cgroup which has no tasks
+	 * So, rmdir()->pre_destroy() can be called while we do this charge.
+	 * In that case, we need to call pre_destroy() again. check it here.
+	 */
+	cgroup_release_and_wakeup_rmdir(&mem->css);
+}
+
+
+/**
+ * mem_cgroup_move_account - move account of the page
+ * @pc:	page_cgroup of the page.
+ * @from: mem_cgroup which the page is moved from.
+ * @to:	mem_cgroup which the page is moved to. @from != @to.
+ *
+ * The caller must confirm following.
+ * - page is not on LRU (isolate_page() is useful.)
+ *
+ * returns 0 at success,
+ * returns -EBUSY when lock is busy or "pc" is unstable.
+ *
+ * This function does "uncharge" from old cgroup but doesn't do "charge" to
+ * new cgroup. It should be done by a caller.
+ */
+
+static int mem_cgroup_move_account(struct page_cgroup *pc,
+	struct mem_cgroup *from, struct mem_cgroup *to)
+{
+	struct mem_cgroup_per_zone *from_mz, *to_mz;
+	int nid, zid;
+	int ret = -EBUSY;
+	struct page *page;
+	int cpu;
+	struct mem_cgroup_stat *stat;
+	struct mem_cgroup_stat_cpu *cpustat;
+
+	VM_BUG_ON(from == to);
+	VM_BUG_ON(PageLRU(pc->page));
+
+	nid = page_cgroup_nid(pc);
+	zid = page_cgroup_zid(pc);
+	from_mz =  mem_cgroup_zoneinfo(from, nid, zid);
+	to_mz =  mem_cgroup_zoneinfo(to, nid, zid);
+
+	if (!trylock_page_cgroup(pc))
+		return ret;
+
+	if (!PageCgroupUsed(pc))
+		goto out;
+
+	if (pc->mem_cgroup != from)
+		goto out;
+
+	if (!mem_cgroup_is_root(from))
+		res_counter_uncharge(&from->res, PAGE_SIZE);
+	mem_cgroup_charge_statistics(from, pc, false);
+
+	page = pc->page;
+	if (page_is_file_cache(page) && page_mapped(page)) {
+		cpu = smp_processor_id();
+		/* Update mapped_file data for mem_cgroup "from" */
+		stat = &from->stat;
+		cpustat = &stat->cpustat[cpu];
+		__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE,
+						-1);
+
+		/* Update mapped_file data for mem_cgroup "to" */
+		stat = &to->stat;
+		cpustat = &stat->cpustat[cpu];
+		__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE,
+						1);
+	}
+
+	if (do_swap_account && !mem_cgroup_is_root(from))
+		res_counter_uncharge(&from->memsw, PAGE_SIZE);
+	css_put(&from->css);
+
+	css_get(&to->css);
+	pc->mem_cgroup = to;
+	mem_cgroup_charge_statistics(to, pc, true);
+	ret = 0;
+out:
+	unlock_page_cgroup(pc);
+	/*
+	 * We charges against "to" which may not have any tasks. Then, "to"
+	 * can be under rmdir(). But in current implementation, caller of
+	 * this function is just force_empty() and it's garanteed that
+	 * "to" is never removed. So, we don't check rmdir status here.
+	 */
+	return ret;
+}
+
+/*
+ * move charges to its parent.
+ */
+
+static int mem_cgroup_move_parent(struct page_cgroup *pc,
+				  struct mem_cgroup *child,
+				  gfp_t gfp_mask)
+{
+	struct page *page = pc->page;
+	struct cgroup *cg = child->css.cgroup;
+	struct cgroup *pcg = cg->parent;
+	struct mem_cgroup *parent;
+	int ret;
+
+	/* Is ROOT ? */
+	if (!pcg)
+		return -EINVAL;
+
+
+	parent = mem_cgroup_from_cont(pcg);
+
+
+	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false, page);
+	if (ret || !parent)
+		return ret;
+
+	if (!get_page_unless_zero(page)) {
+		ret = -EBUSY;
+		goto uncharge;
+	}
+
+	ret = isolate_lru_page(page);
+
+	if (ret)
+		goto cancel;
+
+	ret = mem_cgroup_move_account(pc, child, parent);
+
+	putback_lru_page(page);
+	if (!ret) {
+		put_page(page);
+		/* drop extra refcnt by try_charge() */
+		css_put(&parent->css);
+		return 0;
+	}
+
+cancel:
+	put_page(page);
+uncharge:
+	/* drop extra refcnt by try_charge() */
+	css_put(&parent->css);
+	/* uncharge if move fails */
+	if (!mem_cgroup_is_root(parent)) {
+		res_counter_uncharge(&parent->res, PAGE_SIZE);
+		if (do_swap_account)
+			res_counter_uncharge(&parent->memsw, PAGE_SIZE);
+	}
+	return ret;
+}
+
+/*
+ * For User Interfaces.
+ */
+static DEFINE_MUTEX(set_limit_mutex);
 
-	pc = lookup_page_cgroup(target);
-	/*
-	 * __mem_cgroup_commit_charge() check PCG_USED bit of page_cgroup.
-	 * So, double-counting is effectively avoided.
-	 */
-	__mem_cgroup_commit_charge(mem, pc, ctype);
+static u64 mem_cgroup_swappiness_read(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
 
-	/*
-	 * Both of oldpage and newpage are still under lock_page().
-	 * Then, we don't have to care about race in radix-tree.
-	 * But we have to be careful that this page is unmapped or not.
-	 *
-	 * There is a case for !page_mapped(). At the start of
-	 * migration, oldpage was mapped. But now, it's zapped.
-	 * But we know *target* page is not freed/reused under us.
-	 * mem_cgroup_uncharge_page() does all necessary checks.
-	 */
-	if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
-		mem_cgroup_uncharge_page(target);
-	/*
-	 * At migration, we may charge account against cgroup which has no tasks
-	 * So, rmdir()->pre_destroy() can be called while we do this charge.
-	 * In that case, we need to call pre_destroy() again. check it here.
-	 */
-	cgroup_release_and_wakeup_rmdir(&mem->css);
+	return get_swappiness(memcg);
 }
 
-/*
- * A call to try to shrink memory usage on charge failure at shmem's swapin.
- * Calling hierarchical_reclaim is not enough because we should update
- * last_oom_jiffies to prevent pagefault_out_of_memory from invoking global OOM.
- * Moreover considering hierarchy, we should reclaim from the mem_over_limit,
- * not from the memcg which this page would be charged to.
- * try_charge_swapin does all of these works properly.
- */
-int mem_cgroup_shmem_charge_fallback(struct page *page,
-			    struct mm_struct *mm,
-			    gfp_t gfp_mask)
+static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft,
+				       u64 val)
 {
-	struct mem_cgroup *mem = NULL;
-	int ret;
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+	struct mem_cgroup *parent;
 
-	if (mem_cgroup_disabled())
-		return 0;
+	if (val > 100)
+		return -EINVAL;
 
-	ret = mem_cgroup_try_charge_swapin(mm, page, gfp_mask, &mem);
-	if (!ret)
-		mem_cgroup_cancel_charge_swapin(mem); /* it does !mem check */
+	if (cgrp->parent == NULL)
+		return -EINVAL;
 
-	return ret;
-}
+	parent = mem_cgroup_from_cont(cgrp->parent);
 
-static DEFINE_MUTEX(set_limit_mutex);
+	cgroup_lock();
+
+	/* If under hierarchy, only empty-root can set this value */
+	if ((parent->use_hierarchy) ||
+	    (memcg->use_hierarchy && !list_empty(&cgrp->children))) {
+		cgroup_unlock();
+		return -EINVAL;
+	}
+
+	spin_lock(&memcg->reclaim_param_lock);
+	memcg->swappiness = val;
+	spin_unlock(&memcg->reclaim_param_lock);
+
+	cgroup_unlock();
+
+	return 0;
+}
 
 static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
 				unsigned long long val)
@@ -2210,96 +2366,6 @@ static int mem_cgroup_resize_memsw_limit
 	return ret;
 }
 
-unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
-						gfp_t gfp_mask, int nid,
-						int zid)
-{
-	unsigned long nr_reclaimed = 0;
-	struct mem_cgroup_per_zone *mz, *next_mz = NULL;
-	unsigned long reclaimed;
-	int loop = 0;
-	struct mem_cgroup_tree_per_zone *mctz;
-	unsigned long long excess;
-
-	if (order > 0)
-		return 0;
-
-	mctz = soft_limit_tree_node_zone(nid, zid);
-	/*
-	 * This loop can run a while, specially if mem_cgroup's continuously
-	 * keep exceeding their soft limit and putting the system under
-	 * pressure
-	 */
-	do {
-		if (next_mz)
-			mz = next_mz;
-		else
-			mz = mem_cgroup_largest_soft_limit_node(mctz);
-		if (!mz)
-			break;
-
-		reclaimed = mem_cgroup_hierarchical_reclaim(mz->mem, zone,
-						gfp_mask,
-						MEM_CGROUP_RECLAIM_SOFT);
-		nr_reclaimed += reclaimed;
-		spin_lock(&mctz->lock);
-
-		/*
-		 * If we failed to reclaim anything from this memory cgroup
-		 * it is time to move on to the next cgroup
-		 */
-		next_mz = NULL;
-		if (!reclaimed) {
-			do {
-				/*
-				 * Loop until we find yet another one.
-				 *
-				 * By the time we get the soft_limit lock
-				 * again, someone might have aded the
-				 * group back on the RB tree. Iterate to
-				 * make sure we get a different mem.
-				 * mem_cgroup_largest_soft_limit_node returns
-				 * NULL if no other cgroup is present on
-				 * the tree
-				 */
-				next_mz =
-				__mem_cgroup_largest_soft_limit_node(mctz);
-				if (next_mz == mz) {
-					css_put(&next_mz->mem->css);
-					next_mz = NULL;
-				} else /* next_mz == NULL or other memcg */
-					break;
-			} while (1);
-		}
-		__mem_cgroup_remove_exceeded(mz->mem, mz, mctz);
-		excess = res_counter_soft_limit_excess(&mz->mem->res);
-		/*
-		 * One school of thought says that we should not add
-		 * back the node to the tree if reclaim returns 0.
-		 * But our reclaim could return 0, simply because due
-		 * to priority we are exposing a smaller subset of
-		 * memory to reclaim from. Consider this as a longer
-		 * term TODO.
-		 */
-		/* If excess == 0, no tree ops */
-		__mem_cgroup_insert_exceeded(mz->mem, mz, mctz, excess);
-		spin_unlock(&mctz->lock);
-		css_put(&mz->mem->css);
-		loop++;
-		/*
-		 * Could not reclaim anything and there are no more
-		 * mem cgroups to try or we seem to be looping without
-		 * reclaiming anything.
-		 */
-		if (!nr_reclaimed &&
-			(next_mz == NULL ||
-			loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS))
-			break;
-	} while (!nr_reclaimed);
-	if (next_mz)
-		css_put(&next_mz->mem->css);
-	return nr_reclaimed;
-}
 
 /*
  * This routine traverse page_cgroup in given list and drop them all.
@@ -2484,7 +2550,6 @@ static int mem_cgroup_hierarchy_write(st
 
 	return retval;
 }
-
 struct mem_cgroup_idx_data {
 	s64 val;
 	enum mem_cgroup_stat_index idx;
@@ -2655,6 +2720,7 @@ static int mem_cgroup_reset(struct cgrou
 }
 
 
+
 /* For read statistics */
 enum {
 	MCS_CACHE,
@@ -2799,44 +2865,6 @@ static int mem_control_stat_show(struct 
 	return 0;
 }
 
-static u64 mem_cgroup_swappiness_read(struct cgroup *cgrp, struct cftype *cft)
-{
-	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
-
-	return get_swappiness(memcg);
-}
-
-static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft,
-				       u64 val)
-{
-	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
-	struct mem_cgroup *parent;
-
-	if (val > 100)
-		return -EINVAL;
-
-	if (cgrp->parent == NULL)
-		return -EINVAL;
-
-	parent = mem_cgroup_from_cont(cgrp->parent);
-
-	cgroup_lock();
-
-	/* If under hierarchy, only empty-root can set this value */
-	if ((parent->use_hierarchy) ||
-	    (memcg->use_hierarchy && !list_empty(&cgrp->children))) {
-		cgroup_unlock();
-		return -EINVAL;
-	}
-
-	spin_lock(&memcg->reclaim_param_lock);
-	memcg->swappiness = val;
-	spin_unlock(&memcg->reclaim_param_lock);
-
-	cgroup_unlock();
-
-	return 0;
-}
 
 
 static struct cftype mem_cgroup_files[] = {
@@ -2916,6 +2944,27 @@ static struct cftype memsw_cgroup_files[
 	},
 };
 
+/*
+ * Moving tasks.
+ */
+static void mem_cgroup_move_task(struct cgroup_subsys *ss,
+				struct cgroup *cont,
+				struct cgroup *old_cont,
+				struct task_struct *p,
+				bool threadgroup)
+{
+	mutex_lock(&memcg_tasklist);
+	/*
+	 * FIXME: It's better to move charges of this process from old
+	 * memcg to new memcg. But it's just on TODO-List now.
+	 */
+	mutex_unlock(&memcg_tasklist);
+}
+
+/*
+ * memcg creation and destruction.
+ */
+
 static int register_memsw_files(struct cgroup *cont, struct cgroup_subsys *ss)
 {
 	if (!do_swap_account)
@@ -3163,20 +3212,6 @@ static int mem_cgroup_populate(struct cg
 	return ret;
 }
 
-static void mem_cgroup_move_task(struct cgroup_subsys *ss,
-				struct cgroup *cont,
-				struct cgroup *old_cont,
-				struct task_struct *p,
-				bool threadgroup)
-{
-	mutex_lock(&memcg_tasklist);
-	/*
-	 * FIXME: It's better to move charges of this process from old
-	 * memcg to new memcg. But it's just on TODO-List now.
-	 */
-	mutex_unlock(&memcg_tasklist);
-}
-
 struct cgroup_subsys mem_cgroup_subsys = {
 	.name = "memory",
 	.subsys_id = mem_cgroup_subsys_id,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC][PATCH 4/10] memcg: add memcg charge cancel
  2009-09-25  8:17 [RFC][PATCH 0/10] memcg clean up and some fixes for softlimit (Sep25) KAMEZAWA Hiroyuki
                   ` (2 preceding siblings ...)
  2009-09-25  8:21 ` [RFC][PATCH 3/10] memcg: reorganize memcontrol.c KAMEZAWA Hiroyuki
@ 2009-09-25  8:22 ` KAMEZAWA Hiroyuki
  2009-09-25  8:24 ` [RFC][PATCH 5/10] memcg: clean up percpu statistics access KAMEZAWA Hiroyuki
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-09-25  8:22 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura

From: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

There are some places calling both res_counter_uncharge() and css_put()
to cancel the charge and the refcnt we have got by mem_cgroup_tyr_charge().

This patch introduces mem_cgroup_cancel_charge() and call it in those places.

Modification from Nishimura's
 - removed 'inline'
 - adjusted for a change in res_counter_uncharge.
 - added comment

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
 mm/memcontrol.c |   38 ++++++++++++++++++--------------------
 1 file changed, 18 insertions(+), 20 deletions(-)

Index: temp-mmotm/mm/memcontrol.c
===================================================================
--- temp-mmotm.orig/mm/memcontrol.c
+++ temp-mmotm/mm/memcontrol.c
@@ -1489,6 +1489,21 @@ nomem:
 	return -ENOMEM;
 }
 
+/*
+ * Somemtimes we have to undo a charge we got by try_charge().
+ * This function is for that and do uncharge, put css's refcnt.
+ * gotten by try_charge().
+ */
+static void mem_cgroup_cancel_charge(struct mem_cgroup *mem)
+{
+	if (!mem_cgroup_is_root(mem)) {
+		res_counter_uncharge(&mem->res, PAGE_SIZE);
+		if (do_swap_account)
+			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+	}
+	css_put(&mem->css);
+}
+
 
 static struct mem_cgroup *try_get_mem_cgroup_from_swapcache(struct page *page)
 {
@@ -1537,12 +1552,7 @@ static void __mem_cgroup_commit_charge(s
 	lock_page_cgroup(pc);
 	if (unlikely(PageCgroupUsed(pc))) {
 		unlock_page_cgroup(pc);
-		if (!mem_cgroup_is_root(mem)) {
-			res_counter_uncharge(&mem->res, PAGE_SIZE);
-			if (do_swap_account)
-				res_counter_uncharge(&mem->memsw, PAGE_SIZE);
-		}
-		css_put(&mem->css);
+		mem_cgroup_cancel_charge(mem);
 		return;
 	}
 
@@ -1786,12 +1796,7 @@ void mem_cgroup_cancel_charge_swapin(str
 		return;
 	if (!mem)
 		return;
-	if (!mem_cgroup_is_root(mem)) {
-		res_counter_uncharge(&mem->res, PAGE_SIZE);
-		if (do_swap_account)
-			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
-	}
-	css_put(&mem->css);
+	mem_cgroup_cancel_charge(mem);
 }
 
 
@@ -2196,14 +2201,7 @@ static int mem_cgroup_move_parent(struct
 cancel:
 	put_page(page);
 uncharge:
-	/* drop extra refcnt by try_charge() */
-	css_put(&parent->css);
-	/* uncharge if move fails */
-	if (!mem_cgroup_is_root(parent)) {
-		res_counter_uncharge(&parent->res, PAGE_SIZE);
-		if (do_swap_account)
-			res_counter_uncharge(&parent->memsw, PAGE_SIZE);
-	}
+	mem_cgroup_cancel_charge(parent);
 	return ret;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC][PATCH 5/10] memcg: clean up percpu statistics access
  2009-09-25  8:17 [RFC][PATCH 0/10] memcg clean up and some fixes for softlimit (Sep25) KAMEZAWA Hiroyuki
                   ` (3 preceding siblings ...)
  2009-09-25  8:22 ` [RFC][PATCH 4/10] memcg: add memcg charge cancel KAMEZAWA Hiroyuki
@ 2009-09-25  8:24 ` KAMEZAWA Hiroyuki
  2009-09-25  8:25 ` [RFC][PATCH 6/10] memcg: remove unsued macros KAMEZAWA Hiroyuki
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-09-25  8:24 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Cleanup.
There are several places which updates percpu counter of memcg.

In old age, there were only mem_cgroup_charge_statistics() and
we used a bare access as
   get_cpu()
   cpustat == mem_cgroup->stat.cpustat[cpu];
   update cpu stat
   put_cpu()

But we added several callers of above codes...it seems not clean. This patch
adds
  mem_cgroup_stat_read_pcpu()
  mem_cgroup_stat_add_pcpu()
and renames
  mem_cgroup_local_usage() to be mem_cgroup_usage()
because "local" here is ambiguous.
(I used "local" for meaning "no hierarchy consideration" but..)

and this changes mem_cgroup_read_stat()'s argument from
struct mem_cgroup_stat * to struct mem_cgroup *

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |  126 ++++++++++++++++++++++++++++++--------------------------
 1 file changed, 68 insertions(+), 58 deletions(-)

Index: temp-mmotm/mm/memcontrol.c
===================================================================
--- temp-mmotm.orig/mm/memcontrol.c
+++ temp-mmotm/mm/memcontrol.c
@@ -314,22 +314,41 @@ static struct mem_cgroup *mem_cgroup_loo
 
 /*
  * Handlers for memcg's private percpu counters.
+ * percpu conter is used for memcg's local (means ignoring hierarchy) statitics.
+ * And an event counter is also maintained. Updates to all field should be
+ * under preemption disabled.
  */
 
-static inline void
-__mem_cgroup_stat_reset_safe(struct mem_cgroup_stat_cpu *stat,
+static inline void mem_cgroup_stat_reset_pcpu(struct mem_cgroup *mem,
 				enum mem_cgroup_stat_index idx)
 {
-	stat->count[idx] = 0;
+	struct mem_cgroup_stat_cpu *cstat;
+	int cpu = get_cpu();
+
+	cstat = &mem->stat.cpustat[cpu];
+	cstat->count[idx] = 0;
+	put_cpu();
 }
 
-static inline s64
-__mem_cgroup_stat_read_local(struct mem_cgroup_stat_cpu *stat,
+static inline s64 __mem_cgroup_stat_read_safe(struct mem_cgroup_stat_cpu *stat,
 				enum mem_cgroup_stat_index idx)
 {
 	return stat->count[idx];
 }
 
+static inline s64 mem_cgroup_stat_read_pcpu(struct mem_cgroup *mem,
+				enum mem_cgroup_stat_index idx)
+{
+	struct mem_cgroup_stat_cpu *cstat;
+	int cpu = get_cpu();
+	s64 ret;
+
+	cstat = &mem->stat.cpustat[cpu];
+	ret = __mem_cgroup_stat_read_safe(cstat, idx);
+	put_cpu();
+	return ret;
+}
+
 /*
  * For accounting under irq disable, no need for increment preempt count.
  */
@@ -339,13 +358,24 @@ static inline void __mem_cgroup_stat_add
 	stat->count[idx] += val;
 }
 
-static s64 mem_cgroup_read_stat(struct mem_cgroup_stat *stat,
+static inline void mem_cgroup_stat_add_pcpu(struct mem_cgroup *mem,
+		enum mem_cgroup_stat_index idx, int val)
+{
+	struct mem_cgroup_stat_cpu *cstat;
+	int cpu = get_cpu();
+
+	cstat = &mem->stat.cpustat[cpu];
+	__mem_cgroup_stat_add_safe(cstat, idx, val);
+	put_cpu();
+}
+
+static s64 mem_cgroup_read_stat(struct mem_cgroup *mem,
 		enum mem_cgroup_stat_index idx)
 {
 	int cpu;
 	s64 ret = 0;
 	for_each_possible_cpu(cpu)
-		ret += stat->cpustat[cpu].count[idx];
+		ret += mem->stat.cpustat[cpu].count[idx];
 	return ret;
 }
 
@@ -353,25 +383,22 @@ static void mem_cgroup_swap_statistics(s
 					 bool charge)
 {
 	int val = (charge) ? 1 : -1;
-	struct mem_cgroup_stat *stat = &mem->stat;
-	struct mem_cgroup_stat_cpu *cpustat;
-	int cpu = get_cpu();
-
-	cpustat = &stat->cpustat[cpu];
-	__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_SWAPOUT, val);
-	put_cpu();
+	mem_cgroup_stat_add_pcpu(mem, MEM_CGROUP_STAT_SWAPOUT, val);
 }
 
+/*
+ * We updates several coutners at once. Do all under a get_cpu/put_cpu
+ * calls.
+ */
 static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
 					 struct page_cgroup *pc,
 					 bool charge)
 {
 	int val = (charge) ? 1 : -1;
-	struct mem_cgroup_stat *stat = &mem->stat;
 	struct mem_cgroup_stat_cpu *cpustat;
 	int cpu = get_cpu();
 
-	cpustat = &stat->cpustat[cpu];
+	cpustat = &mem->stat.cpustat[cpu];
 	if (PageCgroupCache(pc))
 		__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_CACHE, val);
 	else
@@ -387,12 +414,17 @@ static void mem_cgroup_charge_statistics
 	put_cpu();
 }
 
-static s64 mem_cgroup_local_usage(struct mem_cgroup_stat *stat)
+/*
+ * When mem_cgroup is used with hierarchy inheritance enabled, cgroup local
+ * memory usage is just shown by sum of percpu statitics. This function returns
+ * cgroup local memory usage even if it's under hierarchy.
+ */
+static s64 mem_cgroup_usage(struct mem_cgroup *mem)
 {
 	s64 ret;
 
-	ret = mem_cgroup_read_stat(stat, MEM_CGROUP_STAT_CACHE);
-	ret += mem_cgroup_read_stat(stat, MEM_CGROUP_STAT_RSS);
+	ret = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_CACHE);
+	ret += mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_RSS);
 	return ret;
 }
 
@@ -403,9 +435,6 @@ static s64 mem_cgroup_local_usage(struct
 void mem_cgroup_update_mapped_file_stat(struct page *page, int val)
 {
 	struct mem_cgroup *mem;
-	struct mem_cgroup_stat *stat;
-	struct mem_cgroup_stat_cpu *cpustat;
-	int cpu;
 	struct page_cgroup *pc;
 
 	if (!page_is_file_cache(page))
@@ -423,14 +452,7 @@ void mem_cgroup_update_mapped_file_stat(
 	if (!PageCgroupUsed(pc))
 		goto done;
 
-	/*
-	 * Preemption is already disabled, we don't need get_cpu()
-	 */
-	cpu = smp_processor_id();
-	stat = &mem->stat;
-	cpustat = &stat->cpustat[cpu];
-
-	__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE, val);
+	mem_cgroup_stat_add_pcpu(mem, MEM_CGROUP_STAT_MAPPED_FILE, val);
 done:
 	unlock_page_cgroup(pc);
 }
@@ -557,18 +579,13 @@ mem_cgroup_remove_exceeded(struct mem_cg
 static bool mem_cgroup_soft_limit_check(struct mem_cgroup *mem)
 {
 	bool ret = false;
-	int cpu;
 	s64 val;
-	struct mem_cgroup_stat_cpu *cpustat;
 
-	cpu = get_cpu();
-	cpustat = &mem->stat.cpustat[cpu];
-	val = __mem_cgroup_stat_read_local(cpustat, MEM_CGROUP_STAT_EVENTS);
+	val = mem_cgroup_stat_read_pcpu(mem, MEM_CGROUP_STAT_EVENTS);
 	if (unlikely(val > SOFTLIMIT_EVENTS_THRESH)) {
-		__mem_cgroup_stat_reset_safe(cpustat, MEM_CGROUP_STAT_EVENTS);
+		mem_cgroup_stat_reset_pcpu(mem, MEM_CGROUP_STAT_EVENTS);
 		ret = true;
 	}
-	put_cpu();
 	return ret;
 }
 
@@ -1265,7 +1282,11 @@ static int mem_cgroup_hierarchical_recla
 				}
 			}
 		}
-		if (!mem_cgroup_local_usage(&victim->stat)) {
+		/*
+		 * mem->res can includes memory usage of children. We have to
+		 * check memory usage of this victim.
+		 */
+		if (!mem_cgroup_usage(victim)) {
 			/* this cgroup's local usage == 0 */
 			css_put(&victim->css);
 			continue;
@@ -2092,9 +2113,6 @@ static int mem_cgroup_move_account(struc
 	int nid, zid;
 	int ret = -EBUSY;
 	struct page *page;
-	int cpu;
-	struct mem_cgroup_stat *stat;
-	struct mem_cgroup_stat_cpu *cpustat;
 
 	VM_BUG_ON(from == to);
 	VM_BUG_ON(PageLRU(pc->page));
@@ -2119,18 +2137,10 @@ static int mem_cgroup_move_account(struc
 
 	page = pc->page;
 	if (page_is_file_cache(page) && page_mapped(page)) {
-		cpu = smp_processor_id();
 		/* Update mapped_file data for mem_cgroup "from" */
-		stat = &from->stat;
-		cpustat = &stat->cpustat[cpu];
-		__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE,
-						-1);
-
+		mem_cgroup_stat_add_pcpu(from, MEM_CGROUP_STAT_MAPPED_FILE, -1);
 		/* Update mapped_file data for mem_cgroup "to" */
-		stat = &to->stat;
-		cpustat = &stat->cpustat[cpu];
-		__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE,
-						1);
+		mem_cgroup_stat_add_pcpu(to, MEM_CGROUP_STAT_MAPPED_FILE, 1);
 	}
 
 	if (do_swap_account && !mem_cgroup_is_root(from))
@@ -2557,7 +2567,7 @@ static int
 mem_cgroup_get_idx_stat(struct mem_cgroup *mem, void *data)
 {
 	struct mem_cgroup_idx_data *d = data;
-	d->val += mem_cgroup_read_stat(&mem->stat, d->idx);
+	d->val += mem_cgroup_read_stat(mem, d->idx);
 	return 0;
 }
 
@@ -2763,18 +2773,18 @@ static int mem_cgroup_get_local_stat(str
 	s64 val;
 
 	/* per cpu stat */
-	val = mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_CACHE);
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_CACHE);
 	s->stat[MCS_CACHE] += val * PAGE_SIZE;
-	val = mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_RSS);
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_RSS);
 	s->stat[MCS_RSS] += val * PAGE_SIZE;
-	val = mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_MAPPED_FILE);
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_MAPPED_FILE);
 	s->stat[MCS_MAPPED_FILE] += val * PAGE_SIZE;
-	val = mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_PGPGIN_COUNT);
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_PGPGIN_COUNT);
 	s->stat[MCS_PGPGIN] += val;
-	val = mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_PGPGOUT_COUNT);
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_PGPGOUT_COUNT);
 	s->stat[MCS_PGPGOUT] += val;
 	if (do_swap_account) {
-		val = mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_SWAPOUT);
+		val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_SWAPOUT);
 		s->stat[MCS_SWAP] += val * PAGE_SIZE;
 	}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC][PATCH 6/10] memcg: remove unsued macros
  2009-09-25  8:17 [RFC][PATCH 0/10] memcg clean up and some fixes for softlimit (Sep25) KAMEZAWA Hiroyuki
                   ` (4 preceding siblings ...)
  2009-09-25  8:24 ` [RFC][PATCH 5/10] memcg: clean up percpu statistics access KAMEZAWA Hiroyuki
@ 2009-09-25  8:25 ` KAMEZAWA Hiroyuki
  2009-09-25  8:25 ` [RFC][PATCH 0/10] memcg clean up and some fixes for softlimit (Sep25) Balbir Singh
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-09-25  8:25 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

This patch removes unused macro/enum from memcontol.c
and adds more commenary on charge_type enum definition.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   19 ++++++++-----------
 1 file changed, 8 insertions(+), 11 deletions(-)

Index: temp-mmotm/mm/memcontrol.c
===================================================================
--- temp-mmotm.orig/mm/memcontrol.c
+++ temp-mmotm/mm/memcontrol.c
@@ -189,24 +189,21 @@ struct mem_cgroup {
 	struct mem_cgroup_stat stat;
 };
 
-
+/*
+ * Types of charge/uncharge. memcg's behavior is depends on these types.
+ * SWAPOUT is for mem+swap accounting. It's used when a page is dropped
+ * from memory but there is a valid reference in swap. DROP means
+ * a page is removed from swap cache and no reference from swap itself.
+ */
 enum charge_type {
-	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
-	MEM_CGROUP_CHARGE_TYPE_MAPPED,
+	MEM_CGROUP_CHARGE_TYPE_CACHE = 0, /* charge from file cache */
+	MEM_CGROUP_CHARGE_TYPE_MAPPED,  /* charge from anonymoys memory */
 	MEM_CGROUP_CHARGE_TYPE_SHMEM,	/* used by page migration of shmem */
-	MEM_CGROUP_CHARGE_TYPE_FORCE,	/* used by force_empty */
 	MEM_CGROUP_CHARGE_TYPE_SWAPOUT,	/* for accounting swapcache */
 	MEM_CGROUP_CHARGE_TYPE_DROP,	/* a page was unused swap cache */
 	NR_CHARGE_TYPE,
 };
 
-/* only for here (for easy reading.) */
-#define PCGF_CACHE	(1UL << PCG_CACHE)
-#define PCGF_USED	(1UL << PCG_USED)
-#define PCGF_LOCK	(1UL << PCG_LOCK)
-/* Not used, but added here for completeness */
-#define PCGF_ACCT	(1UL << PCG_ACCT)
-
 /* for encoding cft->private value on file */
 #define _MEM			(0)
 #define _MEMSWAP		(1)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH 0/10] memcg  clean up and some fixes for softlimit (Sep25)
  2009-09-25  8:17 [RFC][PATCH 0/10] memcg clean up and some fixes for softlimit (Sep25) KAMEZAWA Hiroyuki
                   ` (5 preceding siblings ...)
  2009-09-25  8:25 ` [RFC][PATCH 6/10] memcg: remove unsued macros KAMEZAWA Hiroyuki
@ 2009-09-25  8:25 ` Balbir Singh
  2009-09-25  8:27 ` [RFC][PATCH 7/10] memcg: replace cont with cgroup KAMEZAWA Hiroyuki
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: Balbir Singh @ 2009-09-25  8:25 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, nishimura

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-09-25 17:17:21]:

> 
> As I posted Sep/18, I'm now planning to make memcontrol.c cleaner.
> I'll post this to Andrew in the next week if no objections.
> (IOW, I'll post this again. So, review itself is not very urgent.)
>

Sure, please do, personally I would like to see the
batched-charge/uncharge patches first, due to their impact on users
and potential gain.
 
> In this version, I dropped batched-charge/uncharge set.
> They includes something delicate and should not be discussed in this thread.
> The patches are organized as..
> 
> Clean up/ fix softlimit charge/uncharge under hierarchy.
> 1. softlimit uncharge fix
> 2. softlimit charge fix
> These 2 are not changed for 3 weeks.
> 
> Followings are new (no functional changes.)
> 3.  reorder layout in memcontrol.c
> 4.  memcg_charge_cancel.patch from Nishimura's one
> 5.  clean up for memcg's percpu statistics
> 6.  removing unsued macro
> 7.  rename "cont" to "cgroup"
> 8.  remove unused check in charge/uncharge
> 9.  clean up for memcg's perzone statistics
> 10. Add commentary.
> 
> Because my commentary is tend to be not very good, review
> for 10. is helpful ;)
> 
> I think this kind of fixes should be done while -mm queue is empty.
> Then, do this first.
> 
> 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC][PATCH 7/10] memcg: replace cont with cgroup
  2009-09-25  8:17 [RFC][PATCH 0/10] memcg clean up and some fixes for softlimit (Sep25) KAMEZAWA Hiroyuki
                   ` (6 preceding siblings ...)
  2009-09-25  8:25 ` [RFC][PATCH 0/10] memcg clean up and some fixes for softlimit (Sep25) Balbir Singh
@ 2009-09-25  8:27 ` KAMEZAWA Hiroyuki
  2009-09-25  8:28 ` [RFC][PATCH 8/10] memcg: clean up charge/uncharge anon KAMEZAWA Hiroyuki
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-09-25  8:27 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 memcontrol.c uses "cont" for indicating "cgroup" from historical reasons.
 It was called container in very young age.

 Now, memcontrol.c is a subsystem of Cgroups. Rename from_cont() to be
 from_cgroup(). This may be good for newcomers...they won't have to 
 consider what "cont" means and what variable name is appreciated.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   94 ++++++++++++++++++++++++++++----------------------------
 1 file changed, 47 insertions(+), 47 deletions(-)

Index: temp-mmotm/mm/memcontrol.c
===================================================================
--- temp-mmotm.orig/mm/memcontrol.c
+++ temp-mmotm/mm/memcontrol.c
@@ -248,9 +248,9 @@ static inline bool mem_cgroup_is_root(st
 #define mem_cgroup_from_res_counter(counter, member)	\
 	container_of(counter, struct mem_cgroup, member)
 
-static struct mem_cgroup *mem_cgroup_from_cont(struct cgroup *cont)
+static struct mem_cgroup *mem_cgroup_from_cgroup(struct cgroup *cgroup)
 {
-	return container_of(cgroup_subsys_state(cont,
+	return container_of(cgroup_subsys_state(cgroup,
 				mem_cgroup_subsys_id), struct mem_cgroup,
 				css);
 }
@@ -987,7 +987,7 @@ unsigned long mem_cgroup_isolate_pages(u
 					struct list_head *dst,
 					unsigned long *scanned, int order,
 					int mode, struct zone *z,
-					struct mem_cgroup *mem_cont,
+					struct mem_cgroup *mem,
 					int active, int file)
 {
 	unsigned long nr_taken = 0;
@@ -1002,8 +1002,8 @@ unsigned long mem_cgroup_isolate_pages(u
 	int lru = LRU_FILE * file + active;
 	int ret;
 
-	BUG_ON(!mem_cont);
-	mz = mem_cgroup_zoneinfo(mem_cont, nid, zid);
+	BUG_ON(!mem);
+	mz = mem_cgroup_zoneinfo(mem, nid, zid);
 	src = &mz->lists[lru];
 
 	scan = 0;
@@ -2178,7 +2178,7 @@ static int mem_cgroup_move_parent(struct
 		return -EINVAL;
 
 
-	parent = mem_cgroup_from_cont(pcg);
+	parent = mem_cgroup_from_cgroup(pcg);
 
 
 	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false, page);
@@ -2219,7 +2219,7 @@ static DEFINE_MUTEX(set_limit_mutex);
 
 static u64 mem_cgroup_swappiness_read(struct cgroup *cgrp, struct cftype *cft)
 {
-	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+	struct mem_cgroup *memcg = mem_cgroup_from_cgroup(cgrp);
 
 	return get_swappiness(memcg);
 }
@@ -2227,7 +2227,7 @@ static u64 mem_cgroup_swappiness_read(st
 static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft,
 				       u64 val)
 {
-	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+	struct mem_cgroup *memcg = mem_cgroup_from_cgroup(cgrp);
 	struct mem_cgroup *parent;
 
 	if (val > 100)
@@ -2236,7 +2236,7 @@ static int mem_cgroup_swappiness_write(s
 	if (cgrp->parent == NULL)
 		return -EINVAL;
 
-	parent = mem_cgroup_from_cont(cgrp->parent);
+	parent = mem_cgroup_from_cgroup(cgrp->parent);
 
 	cgroup_lock();
 
@@ -2512,27 +2512,27 @@ try_to_free:
 	goto out;
 }
 
-int mem_cgroup_force_empty_write(struct cgroup *cont, unsigned int event)
+int mem_cgroup_force_empty_write(struct cgroup *cgroup, unsigned int event)
 {
-	return mem_cgroup_force_empty(mem_cgroup_from_cont(cont), true);
+	return mem_cgroup_force_empty(mem_cgroup_from_cgroup(cgroup), true);
 }
 
 
-static u64 mem_cgroup_hierarchy_read(struct cgroup *cont, struct cftype *cft)
+static u64 mem_cgroup_hierarchy_read(struct cgroup *cgroup, struct cftype *cft)
 {
-	return mem_cgroup_from_cont(cont)->use_hierarchy;
+	return mem_cgroup_from_cgroup(cgroup)->use_hierarchy;
 }
 
-static int mem_cgroup_hierarchy_write(struct cgroup *cont, struct cftype *cft,
+static int mem_cgroup_hierarchy_write(struct cgroup *cgroup, struct cftype *cft,
 					u64 val)
 {
 	int retval = 0;
-	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
-	struct cgroup *parent = cont->parent;
+	struct mem_cgroup *mem = mem_cgroup_from_cgroup(cgroup);
+	struct cgroup *parent = cgroup->parent;
 	struct mem_cgroup *parent_mem = NULL;
 
 	if (parent)
-		parent_mem = mem_cgroup_from_cont(parent);
+		parent_mem = mem_cgroup_from_cgroup(parent);
 
 	cgroup_lock();
 	/*
@@ -2545,7 +2545,7 @@ static int mem_cgroup_hierarchy_write(st
 	 */
 	if ((!parent_mem || !parent_mem->use_hierarchy) &&
 				(val == 1 || val == 0)) {
-		if (list_empty(&cont->children))
+		if (list_empty(&cgroup->children))
 			mem->use_hierarchy = val;
 		else
 			retval = -EBUSY;
@@ -2579,9 +2579,9 @@ mem_cgroup_get_recursive_idx_stat(struct
 	*val = d.val;
 }
 
-static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
+static u64 mem_cgroup_read(struct cgroup *cgroup, struct cftype *cft)
 {
-	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+	struct mem_cgroup *mem = mem_cgroup_from_cgroup(cgroup);
 	u64 idx_val, val;
 	int type, name;
 
@@ -2624,10 +2624,10 @@ static u64 mem_cgroup_read(struct cgroup
  * The user of this function is...
  * RES_LIMIT.
  */
-static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
+static int mem_cgroup_write(struct cgroup *cgroup, struct cftype *cft,
 			    const char *buffer)
 {
-	struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
+	struct mem_cgroup *memcg = mem_cgroup_from_cgroup(cgroup);
 	int type, name;
 	unsigned long long val;
 	int ret;
@@ -2684,7 +2684,7 @@ static void memcg_get_hierarchical_limit
 
 	while (cgroup->parent) {
 		cgroup = cgroup->parent;
-		memcg = mem_cgroup_from_cont(cgroup);
+		memcg = mem_cgroup_from_cgroup(cgroup);
 		if (!memcg->use_hierarchy)
 			break;
 		tmp = res_counter_read_u64(&memcg->res, RES_LIMIT);
@@ -2698,12 +2698,12 @@ out:
 	return;
 }
 
-static int mem_cgroup_reset(struct cgroup *cont, unsigned int event)
+static int mem_cgroup_reset(struct cgroup *cgroup, unsigned int event)
 {
 	struct mem_cgroup *mem;
 	int type, name;
 
-	mem = mem_cgroup_from_cont(cont);
+	mem = mem_cgroup_from_cgroup(cgroup);
 	type = MEMFILE_TYPE(event);
 	name = MEMFILE_ATTR(event);
 	switch (name) {
@@ -2805,15 +2805,15 @@ mem_cgroup_get_total_stat(struct mem_cgr
 	mem_cgroup_walk_tree(mem, s, mem_cgroup_get_local_stat);
 }
 
-static int mem_control_stat_show(struct cgroup *cont, struct cftype *cft,
+static int mem_control_stat_show(struct cgroup *cgroup, struct cftype *cft,
 				 struct cgroup_map_cb *cb)
 {
-	struct mem_cgroup *mem_cont = mem_cgroup_from_cont(cont);
+	struct mem_cgroup *mem = mem_cgroup_from_cgroup(cgroup);
 	struct mcs_total_stat mystat;
 	int i;
 
 	memset(&mystat, 0, sizeof(mystat));
-	mem_cgroup_get_local_stat(mem_cont, &mystat);
+	mem_cgroup_get_local_stat(mem, &mystat);
 
 	for (i = 0; i < NR_MCS_STAT; i++) {
 		if (i == MCS_SWAP && !do_swap_account)
@@ -2824,14 +2824,14 @@ static int mem_control_stat_show(struct 
 	/* Hierarchical information */
 	{
 		unsigned long long limit, memsw_limit;
-		memcg_get_hierarchical_limit(mem_cont, &limit, &memsw_limit);
+		memcg_get_hierarchical_limit(mem, &limit, &memsw_limit);
 		cb->fill(cb, "hierarchical_memory_limit", limit);
 		if (do_swap_account)
 			cb->fill(cb, "hierarchical_memsw_limit", memsw_limit);
 	}
 
 	memset(&mystat, 0, sizeof(mystat));
-	mem_cgroup_get_total_stat(mem_cont, &mystat);
+	mem_cgroup_get_total_stat(mem, &mystat);
 	for (i = 0; i < NR_MCS_STAT; i++) {
 		if (i == MCS_SWAP && !do_swap_account)
 			continue;
@@ -2839,7 +2839,7 @@ static int mem_control_stat_show(struct 
 	}
 
 #ifdef CONFIG_DEBUG_VM
-	cb->fill(cb, "inactive_ratio", calc_inactive_ratio(mem_cont, NULL));
+	cb->fill(cb, "inactive_ratio", calc_inactive_ratio(mem, NULL));
 
 	{
 		int nid, zid;
@@ -2849,7 +2849,7 @@ static int mem_control_stat_show(struct 
 
 		for_each_online_node(nid)
 			for (zid = 0; zid < MAX_NR_ZONES; zid++) {
-				mz = mem_cgroup_zoneinfo(mem_cont, nid, zid);
+				mz = mem_cgroup_zoneinfo(mem, nid, zid);
 
 				recent_rotated[0] +=
 					mz->reclaim_stat.recent_rotated[0];
@@ -2953,8 +2953,8 @@ static struct cftype memsw_cgroup_files[
  * Moving tasks.
  */
 static void mem_cgroup_move_task(struct cgroup_subsys *ss,
-				struct cgroup *cont,
-				struct cgroup *old_cont,
+				struct cgroup *cgroup,
+				struct cgroup *old_cgroup,
 				struct task_struct *p,
 				bool threadgroup)
 {
@@ -2970,15 +2970,15 @@ static void mem_cgroup_move_task(struct 
  * memcg creation and destruction.
  */
 
-static int register_memsw_files(struct cgroup *cont, struct cgroup_subsys *ss)
+static int register_memsw_files(struct cgroup *cgroup, struct cgroup_subsys *ss)
 {
 	if (!do_swap_account)
 		return 0;
-	return cgroup_add_files(cont, ss, memsw_cgroup_files,
+	return cgroup_add_files(cgroup, ss, memsw_cgroup_files,
 				ARRAY_SIZE(memsw_cgroup_files));
 };
 #else
-static int register_memsw_files(struct cgroup *cont, struct cgroup_subsys *ss)
+static int register_memsw_files(struct cgroup *cgroup, struct cgroup_subsys *ss)
 {
 	return 0;
 }
@@ -3134,7 +3134,7 @@ static int mem_cgroup_soft_limit_tree_in
 }
 
 static struct cgroup_subsys_state * __ref
-mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
+mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cgroup)
 {
 	struct mem_cgroup *mem, *parent;
 	long error = -ENOMEM;
@@ -3149,7 +3149,7 @@ mem_cgroup_create(struct cgroup_subsys *
 			goto free_out;
 
 	/* root ? */
-	if (cont->parent == NULL) {
+	if (cgroup->parent == NULL) {
 		enable_swap_cgroup();
 		parent = NULL;
 		root_mem_cgroup = mem;
@@ -3157,7 +3157,7 @@ mem_cgroup_create(struct cgroup_subsys *
 			goto free_out;
 
 	} else {
-		parent = mem_cgroup_from_cont(cont->parent);
+		parent = mem_cgroup_from_cgroup(cgroup->parent);
 		mem->use_hierarchy = parent->use_hierarchy;
 	}
 
@@ -3189,31 +3189,31 @@ free_out:
 }
 
 static int mem_cgroup_pre_destroy(struct cgroup_subsys *ss,
-					struct cgroup *cont)
+					struct cgroup *cgroup)
 {
-	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+	struct mem_cgroup *mem = mem_cgroup_from_cgroup(cgroup);
 
 	return mem_cgroup_force_empty(mem, false);
 }
 
 static void mem_cgroup_destroy(struct cgroup_subsys *ss,
-				struct cgroup *cont)
+				struct cgroup *cgroup)
 {
-	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+	struct mem_cgroup *mem = mem_cgroup_from_cgroup(cgroup);
 
 	mem_cgroup_put(mem);
 }
 
 static int mem_cgroup_populate(struct cgroup_subsys *ss,
-				struct cgroup *cont)
+				struct cgroup *cgroup)
 {
 	int ret;
 
-	ret = cgroup_add_files(cont, ss, mem_cgroup_files,
+	ret = cgroup_add_files(cgroup, ss, mem_cgroup_files,
 				ARRAY_SIZE(mem_cgroup_files));
 
 	if (!ret)
-		ret = register_memsw_files(cont, ss);
+		ret = register_memsw_files(cgroup, ss);
 	return ret;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC][PATCH 8/10] memcg: clean up charge/uncharge anon
  2009-09-25  8:17 [RFC][PATCH 0/10] memcg clean up and some fixes for softlimit (Sep25) KAMEZAWA Hiroyuki
                   ` (7 preceding siblings ...)
  2009-09-25  8:27 ` [RFC][PATCH 7/10] memcg: replace cont with cgroup KAMEZAWA Hiroyuki
@ 2009-09-25  8:28 ` KAMEZAWA Hiroyuki
  2009-09-29  0:24   ` Daisuke Nishimura
  2009-09-25  8:29 ` [RFC][PATCH 9/10] memcg: clean up perzone stat KAMEZAWA Hiroyuki
  2009-09-25  8:30 ` [RFC][PATCH 10/10] memcg: add commentary KAMEZAWA Hiroyuki
  10 siblings, 1 reply; 20+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-09-25  8:28 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura

This may need careful review.

==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

In old codes, this function was used for other purposes rather
than charginc new anon pages. But now, this function is (ranamed) and
used only for new pages.

For the same kind of reason, ucharge_page() should use VM_BUG_ON().

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   27 +++++++++++++--------------
 1 file changed, 13 insertions(+), 14 deletions(-)

Index: temp-mmotm/mm/memcontrol.c
===================================================================
--- temp-mmotm.orig/mm/memcontrol.c
+++ temp-mmotm/mm/memcontrol.c
@@ -1638,15 +1638,8 @@ int mem_cgroup_newpage_charge(struct pag
 		return 0;
 	if (PageCompound(page))
 		return 0;
-	/*
-	 * If already mapped, we don't have to account.
-	 * If page cache, page->mapping has address_space.
-	 * But page->mapping may have out-of-use anon_vma pointer,
-	 * detecit it by PageAnon() check. newly-mapped-anon's page->mapping
-	 * is NULL.
-  	 */
-	if (page_mapped(page) || (page->mapping && !PageAnon(page)))
-		return 0;
+	/* This function is "newpage_charge" and called right after alloc */
+	VM_BUG_ON(page_mapped(page) || (page->mapping && !PageAnon(page)));
 	if (unlikely(!mm))
 		mm = &init_mm;
 	return mem_cgroup_charge_common(page, mm, gfp_mask,
@@ -1901,11 +1894,11 @@ unlock_out:
 
 void mem_cgroup_uncharge_page(struct page *page)
 {
-	/* early check. */
-	if (page_mapped(page))
-		return;
-	if (page->mapping && !PageAnon(page))
-		return;
+	/*
+ 	 * Called when anonymous page's page->mapcount goes down to zero,
+ 	 * or cancel a charge gotten by newpage_charge().
+	 */
+	VM_BUG_ON(page_mapped(page) || (page->mapping && !PageAnon(page)));
 	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_MAPPED);
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC][PATCH 9/10] memcg: clean up perzone stat
  2009-09-25  8:17 [RFC][PATCH 0/10] memcg clean up and some fixes for softlimit (Sep25) KAMEZAWA Hiroyuki
                   ` (8 preceding siblings ...)
  2009-09-25  8:28 ` [RFC][PATCH 8/10] memcg: clean up charge/uncharge anon KAMEZAWA Hiroyuki
@ 2009-09-25  8:29 ` KAMEZAWA Hiroyuki
  2009-09-25  8:30 ` [RFC][PATCH 10/10] memcg: add commentary KAMEZAWA Hiroyuki
  10 siblings, 0 replies; 20+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-09-25  8:29 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura

This patch renames
  mem_cgroup_get_local_zonestat() to be
  mem_cgroup_get_lru_stat().

That "local" means "don't consider hierarchy" but ...ambiguous.
We all know lru is per memcg private, this name is better.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 mm/memcontrol.c |   20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

Index: temp-mmotm/mm/memcontrol.c
===================================================================
--- temp-mmotm.orig/mm/memcontrol.c
+++ temp-mmotm/mm/memcontrol.c
@@ -488,7 +488,7 @@ unsigned long mem_cgroup_zone_nr_pages(s
 }
 
 
-static unsigned long mem_cgroup_get_local_zonestat(struct mem_cgroup *mem,
+static unsigned long mem_cgroup_get_lru_stat(struct mem_cgroup *mem,
 					enum lru_list idx)
 {
 	int nid, zid;
@@ -885,8 +885,8 @@ static int calc_inactive_ratio(struct me
 	unsigned long gb;
 	unsigned long inactive_ratio;
 
-	inactive = mem_cgroup_get_local_zonestat(memcg, LRU_INACTIVE_ANON);
-	active = mem_cgroup_get_local_zonestat(memcg, LRU_ACTIVE_ANON);
+	inactive = mem_cgroup_get_lru_stat(memcg, LRU_INACTIVE_ANON);
+	active = mem_cgroup_get_lru_stat(memcg, LRU_ACTIVE_ANON);
 
 	gb = (inactive + active) >> (30 - PAGE_SHIFT);
 	if (gb)
@@ -925,8 +925,8 @@ int mem_cgroup_inactive_file_is_low(stru
 	unsigned long active;
 	unsigned long inactive;
 
-	inactive = mem_cgroup_get_local_zonestat(memcg, LRU_INACTIVE_FILE);
-	active = mem_cgroup_get_local_zonestat(memcg, LRU_ACTIVE_FILE);
+	inactive = mem_cgroup_get_lru_stat(memcg, LRU_INACTIVE_FILE);
+	active = mem_cgroup_get_lru_stat(memcg, LRU_ACTIVE_FILE);
 
 	return (active > inactive);
 }
@@ -2779,15 +2779,15 @@ static int mem_cgroup_get_local_stat(str
 	}
 
 	/* per zone stat */
-	val = mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_ANON);
+	val = mem_cgroup_get_lru_stat(mem, LRU_INACTIVE_ANON);
 	s->stat[MCS_INACTIVE_ANON] += val * PAGE_SIZE;
-	val = mem_cgroup_get_local_zonestat(mem, LRU_ACTIVE_ANON);
+	val = mem_cgroup_get_lru_stat(mem, LRU_ACTIVE_ANON);
 	s->stat[MCS_ACTIVE_ANON] += val * PAGE_SIZE;
-	val = mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_FILE);
+	val = mem_cgroup_get_lru_stat(mem, LRU_INACTIVE_FILE);
 	s->stat[MCS_INACTIVE_FILE] += val * PAGE_SIZE;
-	val = mem_cgroup_get_local_zonestat(mem, LRU_ACTIVE_FILE);
+	val = mem_cgroup_get_lru_stat(mem, LRU_ACTIVE_FILE);
 	s->stat[MCS_ACTIVE_FILE] += val * PAGE_SIZE;
-	val = mem_cgroup_get_local_zonestat(mem, LRU_UNEVICTABLE);
+	val = mem_cgroup_get_lru_stat(mem, LRU_UNEVICTABLE);
 	s->stat[MCS_UNEVICTABLE] += val * PAGE_SIZE;
 	return 0;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC][PATCH 10/10] memcg: add commentary
  2009-09-25  8:17 [RFC][PATCH 0/10] memcg clean up and some fixes for softlimit (Sep25) KAMEZAWA Hiroyuki
                   ` (9 preceding siblings ...)
  2009-09-25  8:29 ` [RFC][PATCH 9/10] memcg: clean up perzone stat KAMEZAWA Hiroyuki
@ 2009-09-25  8:30 ` KAMEZAWA Hiroyuki
  2009-09-30  2:21   ` Daisuke Nishimura
  10 siblings, 1 reply; 20+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-09-25  8:30 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, balbir, nishimura


From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Adds commenatry to memcontrol.c

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 mm/memcontrol.c |  105 +++++++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 86 insertions(+), 19 deletions(-)

Index: temp-mmotm/mm/memcontrol.c
===================================================================
--- temp-mmotm.orig/mm/memcontrol.c
+++ temp-mmotm/mm/memcontrol.c
@@ -47,7 +47,10 @@ struct cgroup_subsys mem_cgroup_subsys _
 struct mem_cgroup *root_mem_cgroup __read_mostly;
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
-/* Turned on only when memory cgroup is enabled && really_do_swap_account = 1 */
+/*
+ * If this is true, we do mem+swap accounting. (see mem->memsw handling)
+ * Turned on only when memory cgroup is enabled && really_do_swap_account = 1
+ */
 int do_swap_account __read_mostly;
 static int really_do_swap_account __initdata = 1; /* for remember boot option*/
 #else
@@ -88,12 +91,13 @@ struct mem_cgroup_stat {
  */
 struct mem_cgroup_per_zone {
 	/*
-	 * spin_lock to protect the per cgroup LRU
+	 * zone->lru_lock protects the per cgroup LRU
 	 */
 	struct list_head	lists[NR_LRU_LISTS];
 	unsigned long		count[NR_LRU_LISTS];
-
+	/* reclaim_stat is used by vmscan.c. struct zone has this, too */
 	struct zone_reclaim_stat reclaim_stat;
+	/* For Softlimit handling */
 	struct rb_node		tree_node;	/* RB tree node */
 	unsigned long long	usage_in_excess;/* Set to the value by which */
 						/* the soft limit is exceeded*/
@@ -165,20 +169,32 @@ struct mem_cgroup {
 	spinlock_t reclaim_param_lock;
 
 	int	prev_priority;	/* for recording reclaim priority */
+	unsigned int	swappiness; /* form memory reclaim calculation */
 
 	/*
+	 * Should the accounting and control be hierarchical, per subtree?
+	 */
+	bool use_hierarchy;
+	/*
 	 * While reclaiming in a hiearchy, we cache the last child we
 	 * reclaimed from.
 	 */
 	int last_scanned_child;
 	/*
-	 * Should the accounting and control be hierarchical, per subtree?
+	 * Because what we handle is permanent objects, we may have references
+	 * even after all tasks are gone. Even in such case, rmdir() cgroup
+	 * should be allowed to some extent. We have private refcnt
+	 * (from swap_cgroup etc..) other than css_get/put for such cases.
 	 */
-	bool use_hierarchy;
-	unsigned long	last_oom_jiffies;
 	atomic_t	refcnt;
+	/*
+	 * When oom_kill is invoked from page fault path,
+	 * oom_kill.c::pagefault_out_of_memory() is called...but, in memcg,
+	 * we already calls oom_killer by ourselves. This jiffies is used
+	 * for avoiding calling OOM-kill twice. (see oom_kill.c also)
+	 */
+	unsigned long	last_oom_jiffies;
 
-	unsigned int	swappiness;
 
 	/* set when res.limit == memsw.limit */
 	bool		memsw_is_minimum;
@@ -204,7 +220,10 @@ enum charge_type {
 	NR_CHARGE_TYPE,
 };
 
-/* for encoding cft->private value on file */
+/*
+ * Because we have 2 types of similar controls, memory and memory.memsw, we
+ * use some encoding macro for cft->private value to share codes between them.
+ */
 #define _MEM			(0)
 #define _MEMSWAP		(1)
 #define MEMFILE_PRIVATE(x, val)	(((x) << 16) | (val))
@@ -1038,7 +1057,16 @@ unsigned long mem_cgroup_isolate_pages(u
 	return nr_taken;
 }
 
-
+/**
+ * task_in_mem_cgroup: check a task is under a mem_cgroup's control
+ * @task: task struct to be checked.
+ * @mem: mem_cgroup to be checked
+ *
+ * Retunrs non-zero if a task is under control of a memcgroup.
+ *
+ * This function is for checking a task is under a mem_cgroup. Because we do
+ * hierarchical accounting, we have to check all ancestors.
+ */
 
 int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem)
 {
@@ -1060,12 +1088,6 @@ int task_in_mem_cgroup(struct task_struc
 	return ret;
 }
 
-static int mem_cgroup_count_children_cb(struct mem_cgroup *mem, void *data)
-{
-	int *val = data;
-	(*val)++;
-	return 0;
-}
 
 /**
  * mem_cgroup_print_mem_info: Called from OOM with tasklist_lock held in read mode.
@@ -1134,6 +1156,12 @@ done:
 		res_counter_read_u64(&memcg->memsw, RES_FAILCNT));
 }
 
+static int mem_cgroup_count_children_cb(struct mem_cgroup *mem, void *data)
+{
+	int *val = data;
+	(*val)++;
+	return 0;
+}
 /*
  * This function returns the number of memcg under hierarchy tree. Returns
  * 1(self count) if no children.
@@ -1144,6 +1172,13 @@ static int mem_cgroup_count_children(str
  	mem_cgroup_walk_tree(mem, &num, mem_cgroup_count_children_cb);
 	return num;
 }
+
+/**
+ * mem_cgroup_oon_called - check oom-kill is called recentlry under memcg
+ * @mem: mem_cgroup to be checked.
+ *
+ * Returns true if oom-kill was invoked in this memcg recently.
+ */
 bool mem_cgroup_oom_called(struct task_struct *task)
 {
 	bool ret = false;
@@ -1314,6 +1349,16 @@ static int mem_cgroup_hierarchical_recla
 	return total;
 }
 
+/*
+ * This function is called by kswapd before entering per-zone memory reclaim.
+ * This selects a victim mem_cgroup from soft-limit tree and memory will be
+ * reclaimed from that.
+ *
+ * Soft-limit tree is sorted by the extent how many mem_cgroup's memoyr usage
+ * excess the soft limit and a memory cgroup which has the largest excess
+ * s selected as a victim. This Soft-limit tree is maintained perzone and
+ * we never select a memcg which has no memory usage on this zone.
+ */
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 						gfp_t gfp_mask, int nid,
 						int zid)
@@ -1407,7 +1452,11 @@ unsigned long mem_cgroup_soft_limit_recl
 
 /*
  * Unlike exported interface, "oom" parameter is added. if oom==true,
- * oom-killer can be invoked.
+ * oom-killer can be invoked. The callser should set oom=false if it's ok
+ * to return -ENOMEM. For example, at resizing limit of memcg.
+ *
+ * In usual case, *memcg = NULL because we charges against current task.
+ * If we charge against other, *memcg should be filled.
  */
 static int __mem_cgroup_try_charge(struct mm_struct *mm,
 			gfp_t gfp_mask, struct mem_cgroup **memcg,
@@ -1418,7 +1467,7 @@ static int __mem_cgroup_try_charge(struc
 	struct res_counter *fail_res;
 
 	if (unlikely(test_thread_flag(TIF_MEMDIE))) {
-		/* Don't account this! */
+		/* This is killed by OOM-Kill. Don't account this! */
 		*memcg = NULL;
 		return 0;
 	}
@@ -1447,8 +1496,10 @@ static int __mem_cgroup_try_charge(struc
 
 		if (mem_cgroup_is_root(mem))
 			goto done;
+		/* Check memory usage hits limit */
 		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res);
 		if (likely(!ret)) {
+			/* Memory is ok. Then Mem+Swap is ? */
 			if (!do_swap_account)
 				break;
 			ret = res_counter_charge(&mem->memsw, PAGE_SIZE,
@@ -1486,9 +1537,12 @@ static int __mem_cgroup_try_charge(struc
 
 		if (!nr_retries--) {
 			if (oom) {
+				/* Call OOM-Killer */
 				mutex_lock(&memcg_tasklist);
-				mem_cgroup_out_of_memory(mem_over_limit, gfp_mask);
+				mem_cgroup_out_of_memory(mem_over_limit,
+							gfp_mask);
 				mutex_unlock(&memcg_tasklist);
+				/* Record we called OOM-Killer */
 				record_last_oom(mem_over_limit);
 			}
 			goto nomem;
@@ -1534,14 +1588,21 @@ static struct mem_cgroup *try_get_mem_cg
 
 	if (!PageSwapCache(page))
 		return NULL;
-
+	/*
+	 * At charging SwapCache, there are 2 cases in usual.
+	 * (1) a newly swapped-in page is mapped.
+	 * (2) an exisiting swap cache is mapped again.
+	 * In case (2), we need to check PageCgroupUsed().
+	 */
 	pc = lookup_page_cgroup(page);
 	lock_page_cgroup(pc);
 	if (PageCgroupUsed(pc)) {
+		/* Already accounted SwapCache is mapped */
 		mem = pc->mem_cgroup;
 		if (mem && !css_tryget(&mem->css))
 			mem = NULL;
 	} else {
+		/* Maybe a new page swapped-in */
 		ent.val = page_private(page);
 		id = lookup_swap_cgroup(ent);
 		rcu_read_lock();
@@ -1569,6 +1630,7 @@ static void __mem_cgroup_commit_charge(s
 
 	lock_page_cgroup(pc);
 	if (unlikely(PageCgroupUsed(pc))) {
+		/* This means we charged this page twice. cancel ours */
 		unlock_page_cgroup(pc);
 		mem_cgroup_cancel_charge(mem);
 		return;
@@ -1757,6 +1819,11 @@ __mem_cgroup_commit_charge_swapin(struct
 		return;
 	cgroup_exclude_rmdir(&ptr->css);
 	pc = lookup_page_cgroup(page);
+	/*
+ 	 * We may overwrite pc->memcgoup in commit_charge(). But SwapCache
+ 	 * can be on LRU before we reach here. Remove it from LRU for avoiding
+ 	 * confliction.
+ 	 */
 	mem_cgroup_lru_del_before_commit_swapcache(page);
 	__mem_cgroup_commit_charge(ptr, pc, ctype);
 	mem_cgroup_lru_add_after_commit_swapcache(page);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH 8/10] memcg: clean up charge/uncharge anon
  2009-09-25  8:28 ` [RFC][PATCH 8/10] memcg: clean up charge/uncharge anon KAMEZAWA Hiroyuki
@ 2009-09-29  0:24   ` Daisuke Nishimura
  2009-09-29  1:26     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 20+ messages in thread
From: Daisuke Nishimura @ 2009-09-29  0:24 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: nishimura, linux-mm, balbir

On Fri, 25 Sep 2009 17:28:50 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> This may need careful review.
> 
> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> In old codes, this function was used for other purposes rather
> than charginc new anon pages. But now, this function is (ranamed) and
> used only for new pages.
> 
> For the same kind of reason, ucharge_page() should use VM_BUG_ON().
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  mm/memcontrol.c |   27 +++++++++++++--------------
>  1 file changed, 13 insertions(+), 14 deletions(-)
> 
> Index: temp-mmotm/mm/memcontrol.c
> ===================================================================
> --- temp-mmotm.orig/mm/memcontrol.c
> +++ temp-mmotm/mm/memcontrol.c
> @@ -1638,15 +1638,8 @@ int mem_cgroup_newpage_charge(struct pag
>  		return 0;
>  	if (PageCompound(page))
>  		return 0;
> -	/*
> -	 * If already mapped, we don't have to account.
> -	 * If page cache, page->mapping has address_space.
> -	 * But page->mapping may have out-of-use anon_vma pointer,
> -	 * detecit it by PageAnon() check. newly-mapped-anon's page->mapping
> -	 * is NULL.
> -  	 */
> -	if (page_mapped(page) || (page->mapping && !PageAnon(page)))
> -		return 0;
> +	/* This function is "newpage_charge" and called right after alloc */
> +	VM_BUG_ON(page_mapped(page) || (page->mapping && !PageAnon(page)));
>  	if (unlikely(!mm))
>  		mm = &init_mm;
>  	return mem_cgroup_charge_common(page, mm, gfp_mask,
I think this VM_BUG_ON() is vaild.

> @@ -1901,11 +1894,11 @@ unlock_out:
>  
>  void mem_cgroup_uncharge_page(struct page *page)
>  {
> -	/* early check. */
> -	if (page_mapped(page))
> -		return;
> -	if (page->mapping && !PageAnon(page))
> -		return;
> +	/*
> + 	 * Called when anonymous page's page->mapcount goes down to zero,
> + 	 * or cancel a charge gotten by newpage_charge().
> +	 */
> +	VM_BUG_ON(page_mapped(page) || (page->mapping && !PageAnon(page)));
>  	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_MAPPED);
>  }
>  
> 
I hit this VM_BUG_ON() when I'm testing mmotm-2009-09-25-14-35 + these patches
+ my latest recharge-at-task-move patches(not posted yet).

[ 2966.031815] kernel BUG at mm/memcontrol.c:2014!
[ 2966.031815] invalid opcode: 0000 [#1] SMP
[ 2966.031815] last sysfs file: /sys/devices/pci0000:00/0000:00:07.0/0000:09:00.2/0000:0b:
04.1/irq
[ 2966.031815] CPU 2
[ 2966.031815] Modules linked in: autofs4 lockd sunrpc iscsi_tcp libiscsi_tcp libiscsi scs
i_transport_iscsi dm_mirror dm_multipath video output lp sg ide_cd_mod cdrom serio_raw but
ton parport_pc parport e1000 i2c_i801 i2c_core pata_acpi ata_generic pcspkr dm_region_hash
 dm_log dm_mod ata_piix libata shpchp megaraid_mbox megaraid_mm sd_mod scsi_mod ext3 jbd u
hci_hcd ohci_hcd ehci_hcd [last unloaded: microcode]
[ 2966.031815] Pid: 19677, comm: mmapstress10 Not tainted 2.6.31-mmotm-2009-09-25-14-35-35
ace741 #1 Express5800/140Rd-4 [N8100-1065]
[ 2966.031815] RIP: 0010:[<ffffffff800efbe2>]  [<ffffffff800efbe2>] mem_cgroup_uncharge_pa
ge+0x18/0x28
[ 2966.031815] RSP: 0000:ffff880381713da8  EFLAGS: 00010246
[ 2966.031815] RAX: 0000000000000000 RBX: ffffea001668fef0 RCX: ffffea001763afe0
[ 2966.031815] RDX: 0000000000000080 RSI: 0000000000000008 RDI: ffffea001668fef0
[ 2966.031815] RBP: ffff880381713da8 R08: ffffea001763afe0 R09: ffff88039dfafa28
[ 2966.031815] R10: ffff880381617b80 R11: 0000000000000000 R12: ffffea001668fef0
[ 2966.031815] R13: ffffea001668fef0 R14: 0000000000000008 R15: ffff880381617b80
[ 2966.031815] FS:  00007f9a617fa6e0(0000) GS:ffff88002a000000(0000) knlGS:000000000000000
0
[ 2966.031815] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 2966.031815] CR2: 000000385021c4b8 CR3: 0000000372107000 CR4: 00000000000006e0
[ 2966.031815] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2966.031815] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 2966.031815] Process mmapstress10 (pid: 19677, threadinfo ffff880381712000, task ffff880
3817c2920)
[ 2966.031815] Stack:
[ 2966.031815]  ffff880381713dc8 ffffffff800d6bff 00000003992ec067 00000003992ec067
[ 2966.031815] <0> ffff880381713e58 ffffffff800ce2f3 ffff880381713e18 ffff88038a90c408
[ 2966.031815] <0> 000000385021c4b8 ffff88039dfaf6c0 ffffea000168fef0 ffffea0016c91508
[ 2966.031815] Call Trace:
[ 2966.031815]  [<ffffffff800d6bff>] page_remove_rmap+0x28/0x44
[ 2966.031815]  [<ffffffff800ce2f3>] do_wp_page+0x626/0x73c
[ 2966.031815]  [<ffffffff8005e9e3>] ? __wake_up_bit+0x2c/0x2e
[ 2966.031815]  [<ffffffff800cfbfc>] handle_mm_fault+0x712/0x824
[ 2966.031815]  [<ffffffff8034e6d7>] do_page_fault+0x255/0x2e5
[ 2966.031815]  [<ffffffff8034c59f>] page_fault+0x1f/0x30
[ 2966.031815] Code: 83 7f 18 00 74 04 0f 0b eb fe 31 f6 e8 75 fe ff ff c9 c3 8b 47 0c 55
48 89 e5 85 c0 79 0d 48 8b 47 18 48 85 c0 74 08 a8 01 75 04 <0f> 0b eb fe be 01 00 00 00 e
8 4d fe ff ff c9 c3 55 65 48 8b 04
[ 2966.031815] RIP  [<ffffffff800efbe2>] mem_cgroup_uncharge_page+0x18/0x28
[ 2966.031815]  RSP <ffff880381713da8>

I don't think my patch is the guilt because this also happens even if I don't
set "recharge_at_immigrate".

It might be better that I test w/o my patches, but IIUC, this can happen
in following scenario like:

Assume process A and B has the same swap pte.

          process A                       process B
  do_swap_page()                    do_swap_page()
    read_swap_cache_async()
    lock_page()                       lookup_swap_cache()
    page_add_anon_rmap()
    unlock_page()
                                      lock_page()
    do_wp_page()
      page_remove_rmap()
        atomic_add_negative()
          -> page->_mapcount = -1     page_add_anon_rmap()
                                        atomic_inc_and_test()
                                          -> page->_mapcount = 0
        mem_cgroup_uncharge_page()
          -> hit the VM_BUG_ON()


So, I think this should be like:

void mem_cgroup_uncharge_page(struct page *page)
{
	/*
	 * Called when anonymous page's page->mapcount goes down to zero,
	 * or cancel a charge gotten by newpage_charge().
	 * But there is a small race between page_remove_rmap() and
	 * page_add_anon_rmap(), so we can reach here with page_mapped().
	 */
	VM_BUG_ON(page->mapping && !PageAnon(page))
        if (unlikely(page_mapped(page))
		return;
	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_MAPPED);
}



Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH 8/10] memcg: clean up charge/uncharge anon
  2009-09-29  0:24   ` Daisuke Nishimura
@ 2009-09-29  1:26     ` KAMEZAWA Hiroyuki
  2009-09-29  2:18       ` Daisuke Nishimura
  0 siblings, 1 reply; 20+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-09-29  1:26 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, balbir


Thank you for testing.

On Tue, 29 Sep 2009 09:24:13 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Fri, 25 Sep 2009 17:28:50 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > This may need careful review.
> > 
> > ==
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > In old codes, this function was used for other purposes rather
> > than charginc new anon pages. But now, this function is (ranamed) and
> > used only for new pages.
> > 
> > For the same kind of reason, ucharge_page() should use VM_BUG_ON().
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> >  mm/memcontrol.c |   27 +++++++++++++--------------
> >  1 file changed, 13 insertions(+), 14 deletions(-)
> > 
> > Index: temp-mmotm/mm/memcontrol.c
> > ===================================================================
> > --- temp-mmotm.orig/mm/memcontrol.c
> > +++ temp-mmotm/mm/memcontrol.c
> > @@ -1638,15 +1638,8 @@ int mem_cgroup_newpage_charge(struct pag
> >  		return 0;
> >  	if (PageCompound(page))
> >  		return 0;
> > -	/*
> > -	 * If already mapped, we don't have to account.
> > -	 * If page cache, page->mapping has address_space.
> > -	 * But page->mapping may have out-of-use anon_vma pointer,
> > -	 * detecit it by PageAnon() check. newly-mapped-anon's page->mapping
> > -	 * is NULL.
> > -  	 */
> > -	if (page_mapped(page) || (page->mapping && !PageAnon(page)))
> > -		return 0;
> > +	/* This function is "newpage_charge" and called right after alloc */
> > +	VM_BUG_ON(page_mapped(page) || (page->mapping && !PageAnon(page)));
> >  	if (unlikely(!mm))
> >  		mm = &init_mm;
> >  	return mem_cgroup_charge_common(page, mm, gfp_mask,
> I think this VM_BUG_ON() is vaild.
> 
> > @@ -1901,11 +1894,11 @@ unlock_out:
> >  
> >  void mem_cgroup_uncharge_page(struct page *page)
> >  {
> > -	/* early check. */
> > -	if (page_mapped(page))
> > -		return;
> > -	if (page->mapping && !PageAnon(page))
> > -		return;
> > +	/*
> > + 	 * Called when anonymous page's page->mapcount goes down to zero,
> > + 	 * or cancel a charge gotten by newpage_charge().
> > +	 */
> > +	VM_BUG_ON(page_mapped(page) || (page->mapping && !PageAnon(page)));
> >  	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_MAPPED);
> >  }
> >  
> > 
> I hit this VM_BUG_ON() when I'm testing mmotm-2009-09-25-14-35 + these patches
> + my latest recharge-at-task-move patches(not posted yet).
> 

Ouch.


> [ 2966.031815] kernel BUG at mm/memcontrol.c:2014!
> [ 2966.031815] invalid opcode: 0000 [#1] SMP
> [ 2966.031815] last sysfs file: /sys/devices/pci0000:00/0000:00:07.0/0000:09:00.2/0000:0b:
> 04.1/irq
> [ 2966.031815] CPU 2
> [ 2966.031815] Modules linked in: autofs4 lockd sunrpc iscsi_tcp libiscsi_tcp libiscsi scs
> i_transport_iscsi dm_mirror dm_multipath video output lp sg ide_cd_mod cdrom serio_raw but
> ton parport_pc parport e1000 i2c_i801 i2c_core pata_acpi ata_generic pcspkr dm_region_hash
>  dm_log dm_mod ata_piix libata shpchp megaraid_mbox megaraid_mm sd_mod scsi_mod ext3 jbd u
> hci_hcd ohci_hcd ehci_hcd [last unloaded: microcode]
> [ 2966.031815] Pid: 19677, comm: mmapstress10 Not tainted 2.6.31-mmotm-2009-09-25-14-35-35
> ace741 #1 Express5800/140Rd-4 [N8100-1065]
> [ 2966.031815] RIP: 0010:[<ffffffff800efbe2>]  [<ffffffff800efbe2>] mem_cgroup_uncharge_pa
> ge+0x18/0x28
> [ 2966.031815] RSP: 0000:ffff880381713da8  EFLAGS: 00010246
> [ 2966.031815] RAX: 0000000000000000 RBX: ffffea001668fef0 RCX: ffffea001763afe0
> [ 2966.031815] RDX: 0000000000000080 RSI: 0000000000000008 RDI: ffffea001668fef0
> [ 2966.031815] RBP: ffff880381713da8 R08: ffffea001763afe0 R09: ffff88039dfafa28
> [ 2966.031815] R10: ffff880381617b80 R11: 0000000000000000 R12: ffffea001668fef0
> [ 2966.031815] R13: ffffea001668fef0 R14: 0000000000000008 R15: ffff880381617b80
> [ 2966.031815] FS:  00007f9a617fa6e0(0000) GS:ffff88002a000000(0000) knlGS:000000000000000
> 0
> [ 2966.031815] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 2966.031815] CR2: 000000385021c4b8 CR3: 0000000372107000 CR4: 00000000000006e0
> [ 2966.031815] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 2966.031815] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 2966.031815] Process mmapstress10 (pid: 19677, threadinfo ffff880381712000, task ffff880
> 3817c2920)
> [ 2966.031815] Stack:
> [ 2966.031815]  ffff880381713dc8 ffffffff800d6bff 00000003992ec067 00000003992ec067
> [ 2966.031815] <0> ffff880381713e58 ffffffff800ce2f3 ffff880381713e18 ffff88038a90c408
> [ 2966.031815] <0> 000000385021c4b8 ffff88039dfaf6c0 ffffea000168fef0 ffffea0016c91508
> [ 2966.031815] Call Trace:
> [ 2966.031815]  [<ffffffff800d6bff>] page_remove_rmap+0x28/0x44
> [ 2966.031815]  [<ffffffff800ce2f3>] do_wp_page+0x626/0x73c
> [ 2966.031815]  [<ffffffff8005e9e3>] ? __wake_up_bit+0x2c/0x2e
> [ 2966.031815]  [<ffffffff800cfbfc>] handle_mm_fault+0x712/0x824
> [ 2966.031815]  [<ffffffff8034e6d7>] do_page_fault+0x255/0x2e5
> [ 2966.031815]  [<ffffffff8034c59f>] page_fault+0x1f/0x30
> [ 2966.031815] Code: 83 7f 18 00 74 04 0f 0b eb fe 31 f6 e8 75 fe ff ff c9 c3 8b 47 0c 55
> 48 89 e5 85 c0 79 0d 48 8b 47 18 48 85 c0 74 08 a8 01 75 04 <0f> 0b eb fe be 01 00 00 00 e
> 8 4d fe ff ff c9 c3 55 65 48 8b 04
> [ 2966.031815] RIP  [<ffffffff800efbe2>] mem_cgroup_uncharge_page+0x18/0x28
> [ 2966.031815]  RSP <ffff880381713da8>
> 
> I don't think my patch is the guilt because this also happens even if I don't
> set "recharge_at_immigrate".
> 
> It might be better that I test w/o my patches, but IIUC, this can happen
> in following scenario like:
> 
> Assume process A and B has the same swap pte.
> 
>           process A                       process B
>   do_swap_page()                    do_swap_page()
>     read_swap_cache_async()
>     lock_page()                       lookup_swap_cache()
>     page_add_anon_rmap()
>     unlock_page()
>                                       lock_page()
>     do_wp_page()
I think do_wp_page() do lock_page() here.
Because the page is ANON.

>       page_remove_rmap()
>         atomic_add_negative()
>           -> page->_mapcount = -1     page_add_anon_rmap()
>                                         atomic_inc_and_test()
>                                           -> page->_mapcount = 0
>         mem_cgroup_uncharge_page()
>           -> hit the VM_BUG_ON()
> 
> 
> So, I think this should be like:
> 
> void mem_cgroup_uncharge_page(struct page *page)
> {
> 	/*
> 	 * Called when anonymous page's page->mapcount goes down to zero,
> 	 * or cancel a charge gotten by newpage_charge().
> 	 * But there is a small race between page_remove_rmap() and
> 	 * page_add_anon_rmap(), so we can reach here with page_mapped().
> 	 */
> 	VM_BUG_ON(page->mapping && !PageAnon(page))
>         if (unlikely(page_mapped(page))
> 		return;
> 	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_MAPPED);
> }
> 

>From backtrace, mem_cgroup_uncharge_page() is called via page_remove_rmap().
Then, page_mapped() should be false.

Hmm.
 VM_BUG_ON(page_mapped(page) || (page->mapping && !PageAnon(page)));

mem_cgroup_uncharge_page() is called by page_remove_rmap() only when
 !page_mapped() and PageAnon(page) == true.
Could you show me disassemble of (head lines of) mem_cgroup_uncharge_page() ?

Maybe there is something I don't understand..
IIUC, when page_remove_rmap() is called by do_wp_page(),
there must be pte(s) which points to the page and a pte is guarded by
page table lock. So, I think page_mapcount() > 0 before calling page_remove_rmap()
because there must be a valid pte, at least.

Can this scenario happen ?
==
    Thread A.                                      Thread B.

    do_wp_page()                                 do_swap_page()
       PageAnon(oldpage)                         
         lock_page()                             lock_page()=> wait.
         reuse = false.
         unlock_page()                           get lock.      
       do copy-on-write
       pte_same() == true
         page_remove_rmap(oldpage) (mapcount goes to -1)
                                                 page_set_anon_rmap() (new anon rmap again)
==
Then, oldpage's mapcount goes down to 0 and up to 1 immediately.

About charge itself, Thread B. executes swap's charge sequence and maybe
no problem. Hmm.

I'll apply your change. Thank you for report.

Regards,
-Kame



> 
> 
> Thanks,
> Daisuke Nishimura.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH 8/10] memcg: clean up charge/uncharge anon
  2009-09-29  1:26     ` KAMEZAWA Hiroyuki
@ 2009-09-29  2:18       ` Daisuke Nishimura
  2009-09-29  3:03         ` Daisuke Nishimura
  0 siblings, 1 reply; 20+ messages in thread
From: Daisuke Nishimura @ 2009-09-29  2:18 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: nishimura, linux-mm, balbir

On Tue, 29 Sep 2009 10:26:53 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> Thank you for testing.
> 
> On Tue, 29 Sep 2009 09:24:13 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> 
> > On Fri, 25 Sep 2009 17:28:50 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > This may need careful review.
> > > 
> > > ==
> > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > 
> > > In old codes, this function was used for other purposes rather
> > > than charginc new anon pages. But now, this function is (ranamed) and
> > > used only for new pages.
> > > 
> > > For the same kind of reason, ucharge_page() should use VM_BUG_ON().
> > > 
> > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > ---
> > >  mm/memcontrol.c |   27 +++++++++++++--------------
> > >  1 file changed, 13 insertions(+), 14 deletions(-)
> > > 
> > > Index: temp-mmotm/mm/memcontrol.c
> > > ===================================================================
> > > --- temp-mmotm.orig/mm/memcontrol.c
> > > +++ temp-mmotm/mm/memcontrol.c
> > > @@ -1638,15 +1638,8 @@ int mem_cgroup_newpage_charge(struct pag
> > >  		return 0;
> > >  	if (PageCompound(page))
> > >  		return 0;
> > > -	/*
> > > -	 * If already mapped, we don't have to account.
> > > -	 * If page cache, page->mapping has address_space.
> > > -	 * But page->mapping may have out-of-use anon_vma pointer,
> > > -	 * detecit it by PageAnon() check. newly-mapped-anon's page->mapping
> > > -	 * is NULL.
> > > -  	 */
> > > -	if (page_mapped(page) || (page->mapping && !PageAnon(page)))
> > > -		return 0;
> > > +	/* This function is "newpage_charge" and called right after alloc */
> > > +	VM_BUG_ON(page_mapped(page) || (page->mapping && !PageAnon(page)));
> > >  	if (unlikely(!mm))
> > >  		mm = &init_mm;
> > >  	return mem_cgroup_charge_common(page, mm, gfp_mask,
> > I think this VM_BUG_ON() is vaild.
> > 
> > > @@ -1901,11 +1894,11 @@ unlock_out:
> > >  
> > >  void mem_cgroup_uncharge_page(struct page *page)
> > >  {
> > > -	/* early check. */
> > > -	if (page_mapped(page))
> > > -		return;
> > > -	if (page->mapping && !PageAnon(page))
> > > -		return;
> > > +	/*
> > > + 	 * Called when anonymous page's page->mapcount goes down to zero,
> > > + 	 * or cancel a charge gotten by newpage_charge().
> > > +	 */
> > > +	VM_BUG_ON(page_mapped(page) || (page->mapping && !PageAnon(page)));
> > >  	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_MAPPED);
> > >  }
> > >  
> > > 
> > I hit this VM_BUG_ON() when I'm testing mmotm-2009-09-25-14-35 + these patches
> > + my latest recharge-at-task-move patches(not posted yet).
> > 
> 
> Ouch.
> 
> 
> > [ 2966.031815] kernel BUG at mm/memcontrol.c:2014!
> > [ 2966.031815] invalid opcode: 0000 [#1] SMP
> > [ 2966.031815] last sysfs file: /sys/devices/pci0000:00/0000:00:07.0/0000:09:00.2/0000:0b:
> > 04.1/irq
> > [ 2966.031815] CPU 2
> > [ 2966.031815] Modules linked in: autofs4 lockd sunrpc iscsi_tcp libiscsi_tcp libiscsi scs
> > i_transport_iscsi dm_mirror dm_multipath video output lp sg ide_cd_mod cdrom serio_raw but
> > ton parport_pc parport e1000 i2c_i801 i2c_core pata_acpi ata_generic pcspkr dm_region_hash
> >  dm_log dm_mod ata_piix libata shpchp megaraid_mbox megaraid_mm sd_mod scsi_mod ext3 jbd u
> > hci_hcd ohci_hcd ehci_hcd [last unloaded: microcode]
> > [ 2966.031815] Pid: 19677, comm: mmapstress10 Not tainted 2.6.31-mmotm-2009-09-25-14-35-35
> > ace741 #1 Express5800/140Rd-4 [N8100-1065]
> > [ 2966.031815] RIP: 0010:[<ffffffff800efbe2>]  [<ffffffff800efbe2>] mem_cgroup_uncharge_pa
> > ge+0x18/0x28
> > [ 2966.031815] RSP: 0000:ffff880381713da8  EFLAGS: 00010246
> > [ 2966.031815] RAX: 0000000000000000 RBX: ffffea001668fef0 RCX: ffffea001763afe0
> > [ 2966.031815] RDX: 0000000000000080 RSI: 0000000000000008 RDI: ffffea001668fef0
> > [ 2966.031815] RBP: ffff880381713da8 R08: ffffea001763afe0 R09: ffff88039dfafa28
> > [ 2966.031815] R10: ffff880381617b80 R11: 0000000000000000 R12: ffffea001668fef0
> > [ 2966.031815] R13: ffffea001668fef0 R14: 0000000000000008 R15: ffff880381617b80
> > [ 2966.031815] FS:  00007f9a617fa6e0(0000) GS:ffff88002a000000(0000) knlGS:000000000000000
> > 0
> > [ 2966.031815] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > [ 2966.031815] CR2: 000000385021c4b8 CR3: 0000000372107000 CR4: 00000000000006e0
> > [ 2966.031815] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [ 2966.031815] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > [ 2966.031815] Process mmapstress10 (pid: 19677, threadinfo ffff880381712000, task ffff880
> > 3817c2920)
> > [ 2966.031815] Stack:
> > [ 2966.031815]  ffff880381713dc8 ffffffff800d6bff 00000003992ec067 00000003992ec067
> > [ 2966.031815] <0> ffff880381713e58 ffffffff800ce2f3 ffff880381713e18 ffff88038a90c408
> > [ 2966.031815] <0> 000000385021c4b8 ffff88039dfaf6c0 ffffea000168fef0 ffffea0016c91508
> > [ 2966.031815] Call Trace:
> > [ 2966.031815]  [<ffffffff800d6bff>] page_remove_rmap+0x28/0x44
> > [ 2966.031815]  [<ffffffff800ce2f3>] do_wp_page+0x626/0x73c
> > [ 2966.031815]  [<ffffffff8005e9e3>] ? __wake_up_bit+0x2c/0x2e
> > [ 2966.031815]  [<ffffffff800cfbfc>] handle_mm_fault+0x712/0x824
> > [ 2966.031815]  [<ffffffff8034e6d7>] do_page_fault+0x255/0x2e5
> > [ 2966.031815]  [<ffffffff8034c59f>] page_fault+0x1f/0x30
> > [ 2966.031815] Code: 83 7f 18 00 74 04 0f 0b eb fe 31 f6 e8 75 fe ff ff c9 c3 8b 47 0c 55
> > 48 89 e5 85 c0 79 0d 48 8b 47 18 48 85 c0 74 08 a8 01 75 04 <0f> 0b eb fe be 01 00 00 00 e
> > 8 4d fe ff ff c9 c3 55 65 48 8b 04
> > [ 2966.031815] RIP  [<ffffffff800efbe2>] mem_cgroup_uncharge_page+0x18/0x28
> > [ 2966.031815]  RSP <ffff880381713da8>
> > 
> > I don't think my patch is the guilt because this also happens even if I don't
> > set "recharge_at_immigrate".
> > 
> > It might be better that I test w/o my patches, but IIUC, this can happen
> > in following scenario like:
> > 
> > Assume process A and B has the same swap pte.
> > 
> >           process A                       process B
> >   do_swap_page()                    do_swap_page()
> >     read_swap_cache_async()
> >     lock_page()                       lookup_swap_cache()
> >     page_add_anon_rmap()
> >     unlock_page()
> >                                       lock_page()
> >     do_wp_page()
> I think do_wp_page() do lock_page() here.
> Because the page is ANON.
> 
hmm, but it calls unlock_page() soon.
So, I think this can happen if lock_page() of B gets the lock after do_wp_page()
of A release the lock.

> >       page_remove_rmap()
> >         atomic_add_negative()
> >           -> page->_mapcount = -1     page_add_anon_rmap()
> >                                         atomic_inc_and_test()
> >                                           -> page->_mapcount = 0
> >         mem_cgroup_uncharge_page()
> >           -> hit the VM_BUG_ON()
> > 
> > 
> > So, I think this should be like:
> > 
> > void mem_cgroup_uncharge_page(struct page *page)
> > {
> > 	/*
> > 	 * Called when anonymous page's page->mapcount goes down to zero,
> > 	 * or cancel a charge gotten by newpage_charge().
> > 	 * But there is a small race between page_remove_rmap() and
> > 	 * page_add_anon_rmap(), so we can reach here with page_mapped().
> > 	 */
> > 	VM_BUG_ON(page->mapping && !PageAnon(page))
> >         if (unlikely(page_mapped(page))
> > 		return;
> > 	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_MAPPED);
> > }
> > 
> 
> From backtrace, mem_cgroup_uncharge_page() is called via page_remove_rmap().
> Then, page_mapped() should be false.
> 
> Hmm.
>  VM_BUG_ON(page_mapped(page) || (page->mapping && !PageAnon(page)));
> 
> mem_cgroup_uncharge_page() is called by page_remove_rmap() only when
>  !page_mapped() and PageAnon(page) == true.
> Could you show me disassemble of (head lines of) mem_cgroup_uncharge_page() ?
> 
here.

===
(gdb) disas mem_cgroup_uncharge_page
Dump of assembler code for function mem_cgroup_uncharge_page:
0xffffffff800efbca <mem_cgroup_uncharge_page+0>:        mov    0xc(%rdi),%eax
0xffffffff800efbcd <mem_cgroup_uncharge_page+3>:        push   %rbp
0xffffffff800efbce <mem_cgroup_uncharge_page+4>:        mov    %rsp,%rbp
0xffffffff800efbd1 <mem_cgroup_uncharge_page+7>:        test   %eax,%eax
0xffffffff800efbd3 <mem_cgroup_uncharge_page+9>:        jns    0xffffffff800efbe2 <mem_cgroup_uncharge_page+24>
0xffffffff800efbd5 <mem_cgroup_uncharge_page+11>:       mov    0x18(%rdi),%rax
0xffffffff800efbd9 <mem_cgroup_uncharge_page+15>:       test   %rax,%rax
0xffffffff800efbdc <mem_cgroup_uncharge_page+18>:       je     0xffffffff800efbe6 <mem_cgroup_uncharge_page+28>
0xffffffff800efbde <mem_cgroup_uncharge_page+20>:       test   $0x1,%al
0xffffffff800efbe0 <mem_cgroup_uncharge_page+22>:       jne    0xffffffff800efbe6 <mem_cgroup_uncharge_page+28>
0xffffffff800efbe2 <mem_cgroup_uncharge_page+24>:       ud2a
0xffffffff800efbe4 <mem_cgroup_uncharge_page+26>:       jmp    0xffffffff800efbe4 <mem_cgroup_uncharge_page+26>
0xffffffff800efbe6 <mem_cgroup_uncharge_page+28>:       mov    $0x1,%esi
0xffffffff800efbeb <mem_cgroup_uncharge_page+33>:       callq  0xffffffff800efa3d <__mem_cgroup_uncharge_common>
0xffffffff800efbf0 <mem_cgroup_uncharge_page+38>:       leaveq
0xffffffff800efbf1 <mem_cgroup_uncharge_page+39>:       retq
End of assembler dump.
===

> Maybe there is something I don't understand..
> IIUC, when page_remove_rmap() is called by do_wp_page(),
> there must be pte(s) which points to the page and a pte is guarded by
> page table lock. So, I think page_mapcount() > 0 before calling page_remove_rmap()
> because there must be a valid pte, at least.
> 
> Can this scenario happen ?
I think so. I intended to mention this case :)
I'm sorry for my vague explanation.

> ==
>     Thread A.                                      Thread B.
> 
>     do_wp_page()                                 do_swap_page()
>        PageAnon(oldpage)                         
>          lock_page()                             lock_page()=> wait.
>          reuse = false.
>          unlock_page()                           get lock.      
>        do copy-on-write
>        pte_same() == true
>          page_remove_rmap(oldpage) (mapcount goes to -1)
>                                                  page_set_anon_rmap() (new anon rmap again)
> ==
> Then, oldpage's mapcount goes down to 0 and up to 1 immediately.
> 
> About charge itself, Thread B. executes swap's charge sequence and maybe
> no problem. Hmm.
> 
I think so too.


Thanks,
Daisuke Nishimura.

> I'll apply your change. Thank you for report.
> 
> Regards,
> -Kame
> 
> 
> 
> > 
> > 
> > Thanks,
> > Daisuke Nishimura.
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> > 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH 8/10] memcg: clean up charge/uncharge anon
  2009-09-29  2:18       ` Daisuke Nishimura
@ 2009-09-29  3:03         ` Daisuke Nishimura
  2009-09-29  3:14           ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 20+ messages in thread
From: Daisuke Nishimura @ 2009-09-29  3:03 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: nishimura, linux-mm, balbir

Just to make sure.

> > Maybe there is something I don't understand..
> > IIUC, when page_remove_rmap() is called by do_wp_page(),
> > there must be pte(s) which points to the page and a pte is guarded by
> > page table lock. So, I think page_mapcount() > 0 before calling page_remove_rmap()
> > because there must be a valid pte, at least.
> > 
> > Can this scenario happen ?
> I think so. I intended to mention this case :)
> I'm sorry for my vague explanation.
> 
> > ==
> >     Thread A.                                      Thread B.
> > 
> >     do_wp_page()                                 do_swap_page()
> >        PageAnon(oldpage)                         
> >          lock_page()                             lock_page()=> wait.
> >          reuse = false.
> >          unlock_page()                           get lock.      
> >        do copy-on-write
> >        pte_same() == true
> >          page_remove_rmap(oldpage) (mapcount goes to -1)
> >                                                  page_set_anon_rmap() (new anon rmap again)
> > ==
> > Then, oldpage's mapcount goes down to 0 and up to 1 immediately.
> > 
I meant "process" not "thread".
I think this cannot happen in the case of threads, because these page_remove_rmap()
and page_set_anon_rmap() are called under pte lock(they share the pte).


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH 8/10] memcg: clean up charge/uncharge anon
  2009-09-29  3:03         ` Daisuke Nishimura
@ 2009-09-29  3:14           ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 20+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-09-29  3:14 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, balbir

On Tue, 29 Sep 2009 12:03:48 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> Just to make sure.
> 
> > > Maybe there is something I don't understand..
> > > IIUC, when page_remove_rmap() is called by do_wp_page(),
> > > there must be pte(s) which points to the page and a pte is guarded by
> > > page table lock. So, I think page_mapcount() > 0 before calling page_remove_rmap()
> > > because there must be a valid pte, at least.
> > > 
> > > Can this scenario happen ?
> > I think so. I intended to mention this case :)
> > I'm sorry for my vague explanation.
> > 
> > > ==
> > >     Thread A.                                      Thread B.
> > > 
> > >     do_wp_page()                                 do_swap_page()
> > >        PageAnon(oldpage)                         
> > >          lock_page()                             lock_page()=> wait.
> > >          reuse = false.
> > >          unlock_page()                           get lock.      
> > >        do copy-on-write
> > >        pte_same() == true
> > >          page_remove_rmap(oldpage) (mapcount goes to -1)
> > >                                                  page_set_anon_rmap() (new anon rmap again)
> > > ==
> > > Then, oldpage's mapcount goes down to 0 and up to 1 immediately.
> > > 
> I meant "process" not "thread".
Okay ;)

> I think this cannot happen in the case of threads, because these page_remove_rmap()
> and page_set_anon_rmap() are called under pte lock(they share the pte).
> 
Anyway, I'll fix this patch.
But Balbir ask me to post batched_charge/uncharge first, this clean up series
will be postponed.

I think..

  1. post softlimit fixes.
  2. batched uncharge/charge
  3. post some fixes from this set.

I personally want to reorder all functions but it makes diff (between versions)
too big. So, I think I should avoid big reorganization.
I'll go moderate way.
Hmm..but I'll do move percpu/perzone functions below definitions of structs.

Thanks,
-Kame

> 
> Thanks,
> Daisuke Nishimura.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH 10/10] memcg: add commentary
  2009-09-25  8:30 ` [RFC][PATCH 10/10] memcg: add commentary KAMEZAWA Hiroyuki
@ 2009-09-30  2:21   ` Daisuke Nishimura
  2009-09-30  2:41     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 20+ messages in thread
From: Daisuke Nishimura @ 2009-09-30  2:21 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: nishimura, linux-mm, balbir

A few trivial comments and a question.

> @@ -1144,6 +1172,13 @@ static int mem_cgroup_count_children(str
>   	mem_cgroup_walk_tree(mem, &num, mem_cgroup_count_children_cb);
>  	return num;
>  }
> +
> +/**
> + * mem_cgroup_oon_called - check oom-kill is called recentlry under memcg
s/oon/oom/

> + * @mem: mem_cgroup to be checked.
> + *
> + * Returns true if oom-kill was invoked in this memcg recently.
> + */
>  bool mem_cgroup_oom_called(struct task_struct *task)
>  {
>  	bool ret = false;



> @@ -1314,6 +1349,16 @@ static int mem_cgroup_hierarchical_recla
>  	return total;
>  }
>  
> +/*
> + * This function is called by kswapd before entering per-zone memory reclaim.
> + * This selects a victim mem_cgroup from soft-limit tree and memory will be
> + * reclaimed from that.
> + *
> + * Soft-limit tree is sorted by the extent how many mem_cgroup's memoyr usage
> + * excess the soft limit and a memory cgroup which has the largest excess
> + * s selected as a victim. This Soft-limit tree is maintained perzone and
"is selected"
 ^

> + * we never select a memcg which has no memory usage on this zone.
> + */
I'm sorry if I misunderstand about softlimit implementation, what prevents
a memcg which has no memory usage on this zone from being selected ?
IIUC, mz->usage_in_excess has a value calculated from res_counter_soft_limit_excess(),
which doesn't take account of zone but only calculates "usage - soft_limit".

>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  						gfp_t gfp_mask, int nid,
>  						int zid)



> @@ -1757,6 +1819,11 @@ __mem_cgroup_commit_charge_swapin(struct
>  		return;
>  	cgroup_exclude_rmdir(&ptr->css);
>  	pc = lookup_page_cgroup(page);
> +	/*
> + 	 * We may overwrite pc->memcgoup in commit_charge(). But SwapCache
should be "pc->mem_cgroup".

> + 	 * can be on LRU before we reach here. Remove it from LRU for avoiding
> + 	 * confliction.
> + 	 */
>  	mem_cgroup_lru_del_before_commit_swapcache(page);
>  	__mem_cgroup_commit_charge(ptr, pc, ctype);
>  	mem_cgroup_lru_add_after_commit_swapcache(page);
> 



Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH 10/10] memcg: add commentary
  2009-09-30  2:21   ` Daisuke Nishimura
@ 2009-09-30  2:41     ` KAMEZAWA Hiroyuki
  2009-09-30  4:36       ` Daisuke Nishimura
  0 siblings, 1 reply; 20+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-09-30  2:41 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, balbir

Thank you for review.

On Wed, 30 Sep 2009 11:21:49 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> A few trivial comments and a question.
> 
> > @@ -1144,6 +1172,13 @@ static int mem_cgroup_count_children(str
> >   	mem_cgroup_walk_tree(mem, &num, mem_cgroup_count_children_cb);
> >  	return num;
> >  }
> > +
> > +/**
> > + * mem_cgroup_oon_called - check oom-kill is called recentlry under memcg
> s/oon/oom/
> 
yes.

> > + * @mem: mem_cgroup to be checked.
> > + *
> > + * Returns true if oom-kill was invoked in this memcg recently.
> > + */
> >  bool mem_cgroup_oom_called(struct task_struct *task)
> >  {
> >  	bool ret = false;
> 
> 
> 
> > @@ -1314,6 +1349,16 @@ static int mem_cgroup_hierarchical_recla
> >  	return total;
> >  }
> >  
> > +/*
> > + * This function is called by kswapd before entering per-zone memory reclaim.
> > + * This selects a victim mem_cgroup from soft-limit tree and memory will be
> > + * reclaimed from that.
> > + *
> > + * Soft-limit tree is sorted by the extent how many mem_cgroup's memoyr usage
> > + * excess the soft limit and a memory cgroup which has the largest excess
> > + * s selected as a victim. This Soft-limit tree is maintained perzone and
> "is selected"
>  ^
> 
will fix.


> > + * we never select a memcg which has no memory usage on this zone.
> > + */
> I'm sorry if I misunderstand about softlimit implementation, what prevents
> a memcg which has no memory usage on this zone from being selected ?
> IIUC, mz->usage_in_excess has a value calculated from res_counter_soft_limit_excess(),
> which doesn't take account of zone but only calculates "usage - soft_limit".
> 
right. But the point is that if memcg has _no_ pages in the zone, memcg is
not on RB-tree. So, Hmm, How about this ?
==
Because this soft-limit tree is maintained per zone, if memcg has little usage on
this zone, we can expect such memcg won't be found on this per-zone RB-tree.
==

I wonder there are something should be improved on this tree management.
Maybe we should add some per-zone check around here.
==
>                 __mem_cgroup_remove_exceeded(mz->mem, mz, mctz);
>                 excess = res_counter_soft_limit_excess(&mz->mem->res);
>                 /*
>                  * One school of thought says that we should not add
>                  * back the node to the tree if reclaim returns 0.
>                  * But our reclaim could return 0, simply because due
>                  * to priority we are exposing a smaller subset of
>                  * memory to reclaim from. Consider this as a longer
>                  * term TODO.
>                  */
>                 /* If excess == 0, no tree ops */
>                 __mem_cgroup_insert_exceeded(mz->mem, mz, mctz, excess);
>                 spin_unlock(&mctz->lock);
==
Its cost will not be high.

Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH 10/10] memcg: add commentary
  2009-09-30  2:41     ` KAMEZAWA Hiroyuki
@ 2009-09-30  4:36       ` Daisuke Nishimura
  0 siblings, 0 replies; 20+ messages in thread
From: Daisuke Nishimura @ 2009-09-30  4:36 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: nishimura, linux-mm, balbir

> > > + * we never select a memcg which has no memory usage on this zone.
> > > + */
> > I'm sorry if I misunderstand about softlimit implementation, what prevents
> > a memcg which has no memory usage on this zone from being selected ?
> > IIUC, mz->usage_in_excess has a value calculated from res_counter_soft_limit_excess(),
> > which doesn't take account of zone but only calculates "usage - soft_limit".
> > 
> right. But the point is that if memcg has _no_ pages in the zone, memcg is
> not on RB-tree. So, Hmm, How about this ?
Thank you for your clarification.

> ==
> Because this soft-limit tree is maintained per zone, if memcg has little usage on
> this zone, we can expect such memcg won't be found on this per-zone RB-tree.
> ==
> 
I think "never" above is exaggeration a bit, but otherwise it looks good for me.

> I wonder there are something should be improved on this tree management.
I agree.
But I think it would be enough for now to leave it in TODO-list.


Thanks,
Daisuke Nishimura.

> Maybe we should add some per-zone check around here.
> ==
> >                 __mem_cgroup_remove_exceeded(mz->mem, mz, mctz);
> >                 excess = res_counter_soft_limit_excess(&mz->mem->res);
> >                 /*
> >                  * One school of thought says that we should not add
> >                  * back the node to the tree if reclaim returns 0.
> >                  * But our reclaim could return 0, simply because due
> >                  * to priority we are exposing a smaller subset of
> >                  * memory to reclaim from. Consider this as a longer
> >                  * term TODO.
> >                  */
> >                 /* If excess == 0, no tree ops */
> >                 __mem_cgroup_insert_exceeded(mz->mem, mz, mctz, excess);
> >                 spin_unlock(&mctz->lock);
> ==
> Its cost will not be high.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2009-09-30  4:39 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-09-25  8:17 [RFC][PATCH 0/10] memcg clean up and some fixes for softlimit (Sep25) KAMEZAWA Hiroyuki
2009-09-25  8:18 ` [RFC][PATCH 1/10] memcg : modifications for softlimit uncharge KAMEZAWA Hiroyuki
2009-09-25  8:20 ` [RFC][PATCH 2/10] memcg : clean up in softlimit charge KAMEZAWA Hiroyuki
2009-09-25  8:21 ` [RFC][PATCH 3/10] memcg: reorganize memcontrol.c KAMEZAWA Hiroyuki
2009-09-25  8:22 ` [RFC][PATCH 4/10] memcg: add memcg charge cancel KAMEZAWA Hiroyuki
2009-09-25  8:24 ` [RFC][PATCH 5/10] memcg: clean up percpu statistics access KAMEZAWA Hiroyuki
2009-09-25  8:25 ` [RFC][PATCH 6/10] memcg: remove unsued macros KAMEZAWA Hiroyuki
2009-09-25  8:25 ` [RFC][PATCH 0/10] memcg clean up and some fixes for softlimit (Sep25) Balbir Singh
2009-09-25  8:27 ` [RFC][PATCH 7/10] memcg: replace cont with cgroup KAMEZAWA Hiroyuki
2009-09-25  8:28 ` [RFC][PATCH 8/10] memcg: clean up charge/uncharge anon KAMEZAWA Hiroyuki
2009-09-29  0:24   ` Daisuke Nishimura
2009-09-29  1:26     ` KAMEZAWA Hiroyuki
2009-09-29  2:18       ` Daisuke Nishimura
2009-09-29  3:03         ` Daisuke Nishimura
2009-09-29  3:14           ` KAMEZAWA Hiroyuki
2009-09-25  8:29 ` [RFC][PATCH 9/10] memcg: clean up perzone stat KAMEZAWA Hiroyuki
2009-09-25  8:30 ` [RFC][PATCH 10/10] memcg: add commentary KAMEZAWA Hiroyuki
2009-09-30  2:21   ` Daisuke Nishimura
2009-09-30  2:41     ` KAMEZAWA Hiroyuki
2009-09-30  4:36       ` Daisuke Nishimura

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.