All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH -mmotm 0/8] memcg: recharge at task move
@ 2009-11-06  5:10 Daisuke Nishimura
  2009-11-06  5:11 ` [PATCH -mmotm 1/8] cgroup: introduce cancel_attach() Daisuke Nishimura
                   ` (10 more replies)
  0 siblings, 11 replies; 39+ messages in thread
From: Daisuke Nishimura @ 2009-11-06  5:10 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Balbir Singh, KAMEZAWA Hiroyuki, Li Zefan,
	Paul Menage, Daisuke Nishimura

Hi.

In current memcg, charges associated with a task aren't moved to the new cgroup
at task move. These patches are for this feature, that is, for recharging to
the new cgroup and, of course, uncharging from old cgroup at task move.

Current virsion supports only recharge of non-shared(mapcount == 1) anonymous pages
and swaps of those pages. I think it's enough as a first step.

[1/8] cgroup: introduce cancel_attach()
[2/8] memcg: move memcg_tasklist mutex
[3/8] memcg: add mem_cgroup_cancel_charge()
[4/8] memcg: cleanup mem_cgroup_move_parent()
[5/8] memcg: add interface to recharge at task move
[6/8] memcg: recharge charges of anonymous page
[7/8] memcg: avoid oom during recharge at task move
[8/8] memcg: recharge charges of anonymous swap

2 is dependent on 1 and 4 is dependent on 3.
3 and 4 are just for cleanups.
5-8 are the body of this feature.

Major Changes from Oct13:
- removed "[RFC]".
- rebased on mmotm-2009-11-01-10-01.
- dropped support for file cache and shmem/tmpfs(revisit in future).
- Updated Documentation/cgroup/memory.txt.

TODO:
- add support for file cache, shmem/tmpfs, and shared(mapcount > 1) pages.
- implement madvise(2) to let users decide the target vma for recharge.

Any comments or suggestions would be welcome.


Thanks,
Dasiuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH -mmotm 1/8] cgroup: introduce cancel_attach()
  2009-11-06  5:10 [PATCH -mmotm 0/8] memcg: recharge at task move Daisuke Nishimura
@ 2009-11-06  5:11 ` Daisuke Nishimura
  2009-11-09  6:57   ` Balbir Singh
  2009-11-09  7:23   ` Li Zefan
  2009-11-06  5:11 ` [PATCH -mmotm 2/8] memcg: move memcg_tasklist mutex Daisuke Nishimura
                   ` (9 subsequent siblings)
  10 siblings, 2 replies; 39+ messages in thread
From: Daisuke Nishimura @ 2009-11-06  5:11 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Balbir Singh, KAMEZAWA Hiroyuki, Li Zefan,
	Paul Menage, Daisuke Nishimura

This patch adds cancel_attach() operation to struct cgroup_subsys.
cancel_attach() can be used when can_attach() operation prepares something
for the subsys, but we should rollback what can_attach() operation has prepared
if attach task fails after we've succeeded in can_attach().

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
 Documentation/cgroups/cgroups.txt |   13 +++++++++++-
 include/linux/cgroup.h            |    2 +
 kernel/cgroup.c                   |   38 ++++++++++++++++++++++++++++++------
 3 files changed, 45 insertions(+), 8 deletions(-)

diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index 0b33bfe..c86947c 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -536,10 +536,21 @@ returns an error, this will abort the attach operation.  If a NULL
 task is passed, then a successful result indicates that *any*
 unspecified task can be moved into the cgroup. Note that this isn't
 called on a fork. If this method returns 0 (success) then this should
-remain valid while the caller holds cgroup_mutex. If threadgroup is
+remain valid while the caller holds cgroup_mutex and it is ensured that either
+attach() or cancel_attach() will be called in futer. If threadgroup is
 true, then a successful result indicates that all threads in the given
 thread's threadgroup can be moved together.
 
+void cancel_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
+	       struct task_struct *task, bool threadgroup)
+(cgroup_mutex held by caller)
+
+Called when a task attach operation has failed after can_attach() has succeeded.
+A subsystem whose can_attach() has some side-effects should provide this
+function, so that the subsytem can implement a rollback. If not, not necessary.
+This will be called only about subsystems whose can_attach() operation have
+succeeded.
+
 void attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 	    struct cgroup *old_cgrp, struct task_struct *task,
 	    bool threadgroup)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 0008dee..d4cc200 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -427,6 +427,8 @@ struct cgroup_subsys {
 	void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
 	int (*can_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
 			  struct task_struct *tsk, bool threadgroup);
+	void (*cancel_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
+			  struct task_struct *tsk, bool threadgroup);
 	void (*attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
 			struct cgroup *old_cgrp, struct task_struct *tsk,
 			bool threadgroup);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 0249f4b..e443742 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1539,7 +1539,7 @@ int cgroup_path(const struct cgroup *cgrp, char *buf, int buflen)
 int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
 	int retval = 0;
-	struct cgroup_subsys *ss;
+	struct cgroup_subsys *ss, *failed_ss = NULL;
 	struct cgroup *oldcgrp;
 	struct css_set *cg;
 	struct css_set *newcg;
@@ -1553,8 +1553,16 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 	for_each_subsys(root, ss) {
 		if (ss->can_attach) {
 			retval = ss->can_attach(ss, cgrp, tsk, false);
-			if (retval)
-				return retval;
+			if (retval) {
+				/*
+				 * Remember at which subsystem we've failed in
+				 * can_attach() to call cancel_attach() only
+				 * against subsystems whose attach() have
+				 * succeeded(see below).
+				 */
+				failed_ss = ss;
+				goto out;
+			}
 		}
 	}
 
@@ -1568,14 +1576,17 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 	 */
 	newcg = find_css_set(cg, cgrp);
 	put_css_set(cg);
-	if (!newcg)
-		return -ENOMEM;
+	if (!newcg) {
+		retval = -ENOMEM;
+		goto out;
+	}
 
 	task_lock(tsk);
 	if (tsk->flags & PF_EXITING) {
 		task_unlock(tsk);
 		put_css_set(newcg);
-		return -ESRCH;
+		retval = -ESRCH;
+		goto out;
 	}
 	rcu_assign_pointer(tsk->cgroups, newcg);
 	task_unlock(tsk);
@@ -1601,7 +1612,20 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 	 * is no longer empty.
 	 */
 	cgroup_wakeup_rmdir_waiter(cgrp);
-	return 0;
+out:
+	if (retval)
+		for_each_subsys(root, ss) {
+			if (ss == failed_ss)
+				/*
+				 * This means can_attach() of this subsystem
+				 * have failed, so we don't need to call
+				 * cancel_attach() against rests of subsystems.
+				 */
+				break;
+			if (ss->cancel_attach)
+				ss->cancel_attach(ss, cgrp, tsk, false);
+		}
+	return retval;
 }
 
 /*
-- 
1.5.6.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH -mmotm 2/8] memcg: move memcg_tasklist mutex
  2009-11-06  5:10 [PATCH -mmotm 0/8] memcg: recharge at task move Daisuke Nishimura
  2009-11-06  5:11 ` [PATCH -mmotm 1/8] cgroup: introduce cancel_attach() Daisuke Nishimura
@ 2009-11-06  5:11 ` Daisuke Nishimura
  2009-11-06  5:54   ` KAMEZAWA Hiroyuki
  2009-11-10 19:14   ` Balbir Singh
  2009-11-06  5:12 ` [PATCH -mmotm 3/8] memcg: add mem_cgroup_cancel_charge() Daisuke Nishimura
                   ` (8 subsequent siblings)
  10 siblings, 2 replies; 39+ messages in thread
From: Daisuke Nishimura @ 2009-11-06  5:11 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Balbir Singh, KAMEZAWA Hiroyuki, Li Zefan,
	Paul Menage, Daisuke Nishimura

memcg_tasklist was introduced to serialize mem_cgroup_out_of_memory() and
mem_cgroup_move_task() to ensure tasks cannot be moved to another cgroup
during select_bad_process().

task_in_mem_cgroup(), which can be called by select_bad_process(), will check
whether a task is in the mem_cgroup or not by dereferencing task->cgroups
->subsys[]. So, it would be desirable to change task->cgroups
(rcu_assign_pointer() in cgroup_attach_task() does it) with memcg_tasklist held.

Now that we can define cancel_attach(), we can safely release memcg_tasklist
on fail path even if we hold memcg_tasklist in can_attach(). So let's move
mutex_lock/unlock() of memcg_tasklist.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
 mm/memcontrol.c |   22 ++++++++++++++++++++--
 1 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4bd3451..d3b2ac0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3395,18 +3395,34 @@ static int mem_cgroup_populate(struct cgroup_subsys *ss,
 	return ret;
 }
 
+static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
+				struct cgroup *cgroup,
+				struct task_struct *p,
+				bool threadgroup)
+{
+	mutex_lock(&memcg_tasklist);
+	return 0;
+}
+
+static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss,
+				struct cgroup *cgroup,
+				struct task_struct *p,
+				bool threadgroup)
+{
+	mutex_unlock(&memcg_tasklist);
+}
+
 static void mem_cgroup_move_task(struct cgroup_subsys *ss,
 				struct cgroup *cont,
 				struct cgroup *old_cont,
 				struct task_struct *p,
 				bool threadgroup)
 {
-	mutex_lock(&memcg_tasklist);
+	mutex_unlock(&memcg_tasklist);
 	/*
 	 * FIXME: It's better to move charges of this process from old
 	 * memcg to new memcg. But it's just on TODO-List now.
 	 */
-	mutex_unlock(&memcg_tasklist);
 }
 
 struct cgroup_subsys mem_cgroup_subsys = {
@@ -3416,6 +3432,8 @@ struct cgroup_subsys mem_cgroup_subsys = {
 	.pre_destroy = mem_cgroup_pre_destroy,
 	.destroy = mem_cgroup_destroy,
 	.populate = mem_cgroup_populate,
+	.can_attach = mem_cgroup_can_attach,
+	.cancel_attach = mem_cgroup_cancel_attach,
 	.attach = mem_cgroup_move_task,
 	.early_init = 0,
 	.use_id = 1,
-- 
1.5.6.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH -mmotm 3/8] memcg: add mem_cgroup_cancel_charge()
  2009-11-06  5:10 [PATCH -mmotm 0/8] memcg: recharge at task move Daisuke Nishimura
  2009-11-06  5:11 ` [PATCH -mmotm 1/8] cgroup: introduce cancel_attach() Daisuke Nishimura
  2009-11-06  5:11 ` [PATCH -mmotm 2/8] memcg: move memcg_tasklist mutex Daisuke Nishimura
@ 2009-11-06  5:12 ` Daisuke Nishimura
  2009-11-06  5:13 ` [PATCH -mmotm 4/8] memcg: cleanup mem_cgroup_move_parent() Daisuke Nishimura
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 39+ messages in thread
From: Daisuke Nishimura @ 2009-11-06  5:12 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Balbir Singh, KAMEZAWA Hiroyuki, Li Zefan,
	Paul Menage, Daisuke Nishimura

There are some places calling both res_counter_uncharge() and css_put()
to cancel the charge and the refcnt we have got by mem_cgroup_tyr_charge().

This patch introduces mem_cgroup_cancel_charge() and call it in those places.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
 mm/memcontrol.c |   38 ++++++++++++++++++--------------------
 1 files changed, 18 insertions(+), 20 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d3b2ac0..05e837c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1500,6 +1500,21 @@ nomem:
 }
 
 /*
+ * Somemtimes we have to undo a charge we got by try_charge().
+ * This function is for that and do uncharge, put css's refcnt.
+ * gotten by try_charge().
+ */
+static void mem_cgroup_cancel_charge(struct mem_cgroup *mem)
+{
+	if (!mem_cgroup_is_root(mem)) {
+		res_counter_uncharge(&mem->res, PAGE_SIZE);
+		if (do_swap_account)
+			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+	}
+	css_put(&mem->css);
+}
+
+/*
  * A helper function to get mem_cgroup from ID. must be called under
  * rcu_read_lock(). The caller must check css_is_removed() or some if
  * it's concern. (dropping refcnt from swap can be called against removed
@@ -1565,12 +1580,7 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
 	lock_page_cgroup(pc);
 	if (unlikely(PageCgroupUsed(pc))) {
 		unlock_page_cgroup(pc);
-		if (!mem_cgroup_is_root(mem)) {
-			res_counter_uncharge(&mem->res, PAGE_SIZE);
-			if (do_swap_account)
-				res_counter_uncharge(&mem->memsw, PAGE_SIZE);
-		}
-		css_put(&mem->css);
+		mem_cgroup_cancel_charge(mem);
 		return;
 	}
 
@@ -1734,14 +1744,7 @@ static int mem_cgroup_move_parent(struct page_cgroup *pc,
 cancel:
 	put_page(page);
 uncharge:
-	/* drop extra refcnt by try_charge() */
-	css_put(&parent->css);
-	/* uncharge if move fails */
-	if (!mem_cgroup_is_root(parent)) {
-		res_counter_uncharge(&parent->res, PAGE_SIZE);
-		if (do_swap_account)
-			res_counter_uncharge(&parent->memsw, PAGE_SIZE);
-	}
+	mem_cgroup_cancel_charge(parent);
 	return ret;
 }
 
@@ -1957,12 +1960,7 @@ void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *mem)
 		return;
 	if (!mem)
 		return;
-	if (!mem_cgroup_is_root(mem)) {
-		res_counter_uncharge(&mem->res, PAGE_SIZE);
-		if (do_swap_account)
-			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
-	}
-	css_put(&mem->css);
+	mem_cgroup_cancel_charge(mem);
 }
 
 static void
-- 
1.5.6.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH -mmotm 4/8] memcg: cleanup mem_cgroup_move_parent()
  2009-11-06  5:10 [PATCH -mmotm 0/8] memcg: recharge at task move Daisuke Nishimura
                   ` (2 preceding siblings ...)
  2009-11-06  5:12 ` [PATCH -mmotm 3/8] memcg: add mem_cgroup_cancel_charge() Daisuke Nishimura
@ 2009-11-06  5:13 ` Daisuke Nishimura
  2009-11-06  5:56   ` KAMEZAWA Hiroyuki
  2009-11-06  5:14 ` [PATCH -mmotm 5/8] memcg: add interface to recharge at task move Daisuke Nishimura
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 39+ messages in thread
From: Daisuke Nishimura @ 2009-11-06  5:13 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Balbir Singh, KAMEZAWA Hiroyuki, Li Zefan,
	Paul Menage, Daisuke Nishimura

mem_cgroup_move_parent() calls try_charge first and cancel_charge on failure.
IMHO, charge/uncharge(especially charge) is high cost operation, so we should
avoid it as far as possible.

This patch tries to delay try_charge in mem_cgroup_move_parent() by re-ordering
checks it does.

And this patch renames mem_cgroup_move_account() to __mem_cgroup_move_account(),
changes the return value of __mem_cgroup_move_account() from int to void,
and adds a new wrapper(mem_cgroup_move_account()), which checks whether a @pc
is valid for moving account and calls __mem_cgroup_move_account().

This patch removes the last caller of trylock_page_cgroup(), so removes its
definition too.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
 include/linux/page_cgroup.h |    7 +---
 mm/memcontrol.c             |   84 ++++++++++++++++++-------------------------
 2 files changed, 37 insertions(+), 54 deletions(-)

diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 4b938d4..b0e4eb1 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -57,6 +57,8 @@ static inline void ClearPageCgroup##uname(struct page_cgroup *pc)	\
 static inline int TestClearPageCgroup##uname(struct page_cgroup *pc)	\
 	{ return test_and_clear_bit(PCG_##lname, &pc->flags);  }
 
+TESTPCGFLAG(Locked, LOCK)
+
 /* Cache flag is set only once (at allocation) */
 TESTPCGFLAG(Cache, CACHE)
 CLEARPCGFLAG(Cache, CACHE)
@@ -86,11 +88,6 @@ static inline void lock_page_cgroup(struct page_cgroup *pc)
 	bit_spin_lock(PCG_LOCK, &pc->flags);
 }
 
-static inline int trylock_page_cgroup(struct page_cgroup *pc)
-{
-	return bit_spin_trylock(PCG_LOCK, &pc->flags);
-}
-
 static inline void unlock_page_cgroup(struct page_cgroup *pc)
 {
 	bit_spin_unlock(PCG_LOCK, &pc->flags);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 05e837c..1ad3248 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1613,27 +1613,22 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
 }
 
 /**
- * mem_cgroup_move_account - move account of the page
+ * __mem_cgroup_move_account - move account of the page
  * @pc:	page_cgroup of the page.
  * @from: mem_cgroup which the page is moved from.
  * @to:	mem_cgroup which the page is moved to. @from != @to.
  *
  * The caller must confirm following.
  * - page is not on LRU (isolate_page() is useful.)
- *
- * returns 0 at success,
- * returns -EBUSY when lock is busy or "pc" is unstable.
+ * - the pc is locked, used, and ->mem_cgroup points to @from.
  *
  * This function does "uncharge" from old cgroup but doesn't do "charge" to
  * new cgroup. It should be done by a caller.
  */
 
-static int mem_cgroup_move_account(struct page_cgroup *pc,
+static void __mem_cgroup_move_account(struct page_cgroup *pc,
 	struct mem_cgroup *from, struct mem_cgroup *to)
 {
-	struct mem_cgroup_per_zone *from_mz, *to_mz;
-	int nid, zid;
-	int ret = -EBUSY;
 	struct page *page;
 	int cpu;
 	struct mem_cgroup_stat *stat;
@@ -1641,20 +1636,9 @@ static int mem_cgroup_move_account(struct page_cgroup *pc,
 
 	VM_BUG_ON(from == to);
 	VM_BUG_ON(PageLRU(pc->page));
-
-	nid = page_cgroup_nid(pc);
-	zid = page_cgroup_zid(pc);
-	from_mz =  mem_cgroup_zoneinfo(from, nid, zid);
-	to_mz =  mem_cgroup_zoneinfo(to, nid, zid);
-
-	if (!trylock_page_cgroup(pc))
-		return ret;
-
-	if (!PageCgroupUsed(pc))
-		goto out;
-
-	if (pc->mem_cgroup != from)
-		goto out;
+	VM_BUG_ON(!PageCgroupLocked(pc));
+	VM_BUG_ON(!PageCgroupUsed(pc));
+	VM_BUG_ON(pc->mem_cgroup != from);
 
 	if (!mem_cgroup_is_root(from))
 		res_counter_uncharge(&from->res, PAGE_SIZE);
@@ -1683,15 +1667,28 @@ static int mem_cgroup_move_account(struct page_cgroup *pc,
 	css_get(&to->css);
 	pc->mem_cgroup = to;
 	mem_cgroup_charge_statistics(to, pc, true);
-	ret = 0;
-out:
-	unlock_page_cgroup(pc);
 	/*
 	 * We charges against "to" which may not have any tasks. Then, "to"
 	 * can be under rmdir(). But in current implementation, caller of
 	 * this function is just force_empty() and it's garanteed that
 	 * "to" is never removed. So, we don't check rmdir status here.
 	 */
+}
+
+/*
+ * check whether the @pc is valid for moving account and call
+ * __mem_cgroup_move_account()
+ */
+static int mem_cgroup_move_account(struct page_cgroup *pc,
+				struct mem_cgroup *from, struct mem_cgroup *to)
+{
+	int ret = -EINVAL;
+	lock_page_cgroup(pc);
+	if (PageCgroupUsed(pc) && pc->mem_cgroup == from) {
+		__mem_cgroup_move_account(pc, from, to);
+		ret = 0;
+	}
+	unlock_page_cgroup(pc);
 	return ret;
 }
 
@@ -1713,38 +1710,27 @@ static int mem_cgroup_move_parent(struct page_cgroup *pc,
 	if (!pcg)
 		return -EINVAL;
 
+	ret = -EBUSY;
+	if (!get_page_unless_zero(page))
+		goto out;
+	if (isolate_lru_page(page))
+		goto put;
 
 	parent = mem_cgroup_from_cont(pcg);
-
-
 	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false, page);
 	if (ret || !parent)
-		return ret;
-
-	if (!get_page_unless_zero(page)) {
-		ret = -EBUSY;
-		goto uncharge;
-	}
-
-	ret = isolate_lru_page(page);
-
-	if (ret)
-		goto cancel;
+		goto put_back;
 
 	ret = mem_cgroup_move_account(pc, child, parent);
-
+	if (!ret)
+		css_put(&parent->css);	/* drop extra refcnt by try_charge() */
+	else
+		mem_cgroup_cancel_charge(parent);	/* does css_put */
+put_back:
 	putback_lru_page(page);
-	if (!ret) {
-		put_page(page);
-		/* drop extra refcnt by try_charge() */
-		css_put(&parent->css);
-		return 0;
-	}
-
-cancel:
+put:
 	put_page(page);
-uncharge:
-	mem_cgroup_cancel_charge(parent);
+out:
 	return ret;
 }
 
-- 
1.5.6.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH -mmotm 5/8] memcg: add interface to recharge at task move
  2009-11-06  5:10 [PATCH -mmotm 0/8] memcg: recharge at task move Daisuke Nishimura
                   ` (3 preceding siblings ...)
  2009-11-06  5:13 ` [PATCH -mmotm 4/8] memcg: cleanup mem_cgroup_move_parent() Daisuke Nishimura
@ 2009-11-06  5:14 ` Daisuke Nishimura
  2009-11-06  6:06   ` KAMEZAWA Hiroyuki
  2009-11-06  5:14 ` [PATCH -mmotm 6/8] memcg: recharge charges of anonymous page Daisuke Nishimura
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 39+ messages in thread
From: Daisuke Nishimura @ 2009-11-06  5:14 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Balbir Singh, KAMEZAWA Hiroyuki, Li Zefan,
	Paul Menage, Daisuke Nishimura

In current memcg, charges associated with a task aren't moved to the new cgroup
at task move. These patches are for this feature, that is, for recharging to
the new cgroup and, of course, uncharging from old cgroup at task move.

This patch adds "memory.recharge_at_immigrate" file, which is a flag file to
determine whether charges should be moved to the new cgroup at task move or
not, and read/write handlers of the file.
This patch also adds no-op handlers for this feature. These handlers will be
implemented in later patches.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
 mm/memcontrol.c |   65 +++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 files changed, 59 insertions(+), 6 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1ad3248..afa1179 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -227,6 +227,12 @@ struct mem_cgroup {
 	bool		memsw_is_minimum;
 
 	/*
+	 * Should we recharge charges of a task when a task is moved into this
+	 * mem_cgroup ?
+	 */
+	bool	 	recharge_at_immigrate;
+
+	/*
 	 * statistics. This must be placed at the end of memcg.
 	 */
 	struct mem_cgroup_stat stat;
@@ -2863,6 +2869,30 @@ static int mem_cgroup_reset(struct cgroup *cont, unsigned int event)
 	return 0;
 }
 
+static u64 mem_cgroup_recharge_read(struct cgroup *cgrp,
+					struct cftype *cft)
+{
+	return mem_cgroup_from_cont(cgrp)->recharge_at_immigrate;
+}
+
+static int mem_cgroup_recharge_write(struct cgroup *cgrp,
+					struct cftype *cft, u64 val)
+{
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
+
+	if (val != 0 && val != 1)
+		return -EINVAL;
+	/*
+	 * We check this value both in can_attach() and attach(), so we need
+	 * cgroup lock to prevent this value from being inconsistent.
+	 */
+	cgroup_lock();
+	mem->recharge_at_immigrate = val;
+	cgroup_unlock();
+
+	return 0;
+}
+
 
 /* For read statistics */
 enum {
@@ -3096,6 +3126,11 @@ static struct cftype mem_cgroup_files[] = {
 		.read_u64 = mem_cgroup_swappiness_read,
 		.write_u64 = mem_cgroup_swappiness_write,
 	},
+	{
+		.name = "recharge_at_immigrate",
+		.read_u64 = mem_cgroup_recharge_read,
+		.write_u64 = mem_cgroup_recharge_write,
+	},
 };
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
@@ -3343,6 +3378,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	if (parent)
 		mem->swappiness = get_swappiness(parent);
 	atomic_set(&mem->refcnt, 1);
+	mem->recharge_at_immigrate = 0;
 	return &mem->css;
 free_out:
 	__mem_cgroup_free(mem);
@@ -3379,13 +3415,26 @@ static int mem_cgroup_populate(struct cgroup_subsys *ss,
 	return ret;
 }
 
+/* Handlers for recharge at task move. */
+static int mem_cgroup_can_recharge(struct mem_cgroup *mem,
+					struct task_struct *p)
+{
+	return 0;
+}
+
 static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
 				struct cgroup *cgroup,
 				struct task_struct *p,
 				bool threadgroup)
 {
-	mutex_lock(&memcg_tasklist);
-	return 0;
+	int ret = 0;
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cgroup);
+
+	if (mem->recharge_at_immigrate && thread_group_leader(p))
+		ret = mem_cgroup_can_recharge(mem, p);
+	if (!ret)
+		mutex_lock(&memcg_tasklist);
+	return ret;
 }
 
 static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss,
@@ -3396,17 +3445,21 @@ static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss,
 	mutex_unlock(&memcg_tasklist);
 }
 
+static void mem_cgroup_recharge(void)
+{
+}
+
 static void mem_cgroup_move_task(struct cgroup_subsys *ss,
 				struct cgroup *cont,
 				struct cgroup *old_cont,
 				struct task_struct *p,
 				bool threadgroup)
 {
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+
 	mutex_unlock(&memcg_tasklist);
-	/*
-	 * FIXME: It's better to move charges of this process from old
-	 * memcg to new memcg. But it's just on TODO-List now.
-	 */
+	if (mem->recharge_at_immigrate && thread_group_leader(p))
+		mem_cgroup_recharge();
 }
 
 struct cgroup_subsys mem_cgroup_subsys = {
-- 
1.5.6.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH -mmotm 6/8] memcg: recharge charges of anonymous page
  2009-11-06  5:10 [PATCH -mmotm 0/8] memcg: recharge at task move Daisuke Nishimura
                   ` (4 preceding siblings ...)
  2009-11-06  5:14 ` [PATCH -mmotm 5/8] memcg: add interface to recharge at task move Daisuke Nishimura
@ 2009-11-06  5:14 ` Daisuke Nishimura
  2009-11-06  6:35   ` KAMEZAWA Hiroyuki
  2009-11-06  5:15 ` [PATCH -mmotm 7/8] memcg: avoid oom during recharge at task move Daisuke Nishimura
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 39+ messages in thread
From: Daisuke Nishimura @ 2009-11-06  5:14 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Balbir Singh, KAMEZAWA Hiroyuki, Li Zefan,
	Paul Menage, Daisuke Nishimura

This patch is the core part of this recharge-at-task-move feature.
It implements functions to recharge charges of anonymous pages mapped only by
the target task.

Implementation:
- define struct recharge_struct and a valuable of it(recharge) to remember
  the count of pre-charges and other information.
- At can_attach(), parse the page table of the task and count the number of
  mapped pages which are charged to the source mem_cgroup, and call
  __mem_cgroup_try_charge() repeatedly and count up recharge.precharge.
- At attach(), parse the page table again, find a target page as we did in
  can_attach(), and call mem_cgroup_move_account() about the page.
- Cancel all charges if recharge.precharge > 0 on failure or at the end of
  task move.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
 Documentation/cgroups/memory.txt |   36 +++++-
 mm/memcontrol.c                  |  275 +++++++++++++++++++++++++++++++++++++-
 2 files changed, 306 insertions(+), 5 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index b871f25..54281ff 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -262,10 +262,12 @@ some of the pages cached in the cgroup (page cache pages).
 4.2 Task migration
 
 When a task migrates from one cgroup to another, it's charge is not
-carried forward. The pages allocated from the original cgroup still
+carried forward by default. The pages allocated from the original cgroup still
 remain charged to it, the charge is dropped when the page is freed or
 reclaimed.
 
+Note: You can move charges of a task along with task migration. See 8.
+
 4.3 Removing a cgroup
 
 A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a
@@ -414,7 +416,37 @@ NOTE1: Soft limits take effect over a long period of time, since they involve
 NOTE2: It is recommended to set the soft limit always below the hard limit,
        otherwise the hard limit will take precedence.
 
-8. TODO
+8. Recharge at task move
+
+Users can move charges associated with a task along with task move, that is,
+uncharge from the old cgroup and charge to the new cgroup.
+
+8.1 Interface
+
+This feature is disabled by default. It can be enabled(and disabled again) by
+writing to memory.recharge_at_immigrate of the destination cgroup.
+
+If you want to enable it
+
+# echo 1 > memory.recharget_at_immigrate
+
+Note: A value more than 1 will be supported in futer. See 8.2.
+
+And if you want disable it again
+
+# echo 0 > memory.recharget_at_immigrate
+
+8.2 Type of charges which can be recharged
+
+We recharge a charge which meets the following conditions.
+
+a. It must be charged to the old cgroup.
+b. A charge of an anonymous page used by the target task. The page must be used
+   only by the target task.
+
+Note: More type of pages(e.g. file cache, shmem,) will be supported in future.
+
+9. TODO
 
 1. Add support for accounting huge pages (as a separate controller)
 2. Make per-cgroup scanner reclaim not-shared pages first
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index afa1179..f4b7116 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -21,6 +21,8 @@
 #include <linux/memcontrol.h>
 #include <linux/cgroup.h>
 #include <linux/mm.h>
+#include <linux/migrate.h>
+#include <linux/hugetlb.h>
 #include <linux/pagemap.h>
 #include <linux/smp.h>
 #include <linux/page-flags.h>
@@ -239,6 +241,18 @@ struct mem_cgroup {
 };
 
 /*
+ * A data structure and a valiable for recharging charges at task move.
+ * "recharge" and its members are protected by cgroup_lock
+ */
+struct recharge_struct {
+	struct mem_cgroup *from;
+	struct mem_cgroup *to;
+	struct task_struct *target;	/* the target task being moved */
+	unsigned long precharge;
+};
+static struct recharge_struct recharge;
+
+/*
  * Maximum loops in mem_cgroup_hierarchical_reclaim(), used for soft
  * limit reclaim to prevent infinite loops, if they ever occur.
  */
@@ -1496,7 +1510,7 @@ charged:
 	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
 	 * if they exceeds softlimit.
 	 */
-	if (mem_cgroup_soft_limit_check(mem))
+	if (page && mem_cgroup_soft_limit_check(mem))
 		mem_cgroup_update_tree(mem, page);
 done:
 	return 0;
@@ -3416,10 +3430,170 @@ static int mem_cgroup_populate(struct cgroup_subsys *ss,
 }
 
 /* Handlers for recharge at task move. */
+/**
+ * is_target_pte_for_recharge - check a pte whether it is valid for recharge
+ * @vma: the vma the pte to be checked belongs
+ * @addr: the address corresponding to the pte to be checked
+ * @ptent: the pte to be checked
+ * @target: the pointer the target page will be stored(can be NULL)
+ *
+ * Returns
+ *   0(RECHARGE_TARGET_NONE): if the pte is not a target for recharge.
+ *   1(RECHARGE_TARGET_PAGE): if the page corresponding to this pte is a target
+ *     for recharge. if @target is not NULL, the page is stored in target->page
+ *     with extra refcnt got(Callers should handle it).
+ *
+ * Called with pte lock held.
+ */
+/* We add a new member later. */
+union recharge_target {
+	struct page	*page;
+};
+
+/* We add a new type later. */
+enum recharge_target_type {
+	RECHARGE_TARGET_NONE,	/* not used */
+	RECHARGE_TARGET_PAGE,
+};
+
+static int is_target_pte_for_recharge(struct vm_area_struct *vma,
+		unsigned long addr, pte_t ptent, union recharge_target *target)
+{
+	struct page *page;
+	struct page_cgroup *pc;
+	int ret = 0;
+
+	if (!pte_present(ptent))
+		return 0;
+
+	page = vm_normal_page(vma, addr, ptent);
+	if (!page || !page_mapped(page))
+		return 0;
+	/* TODO: We don't recharge file(including shmem/tmpfs) pages for now. */
+	if (!PageAnon(page))
+		return 0;
+	/*
+	 * TODO: We don't recharge shared(used by multiple processes) pages
+	 * for now.
+	 */
+	if (page_mapcount(page) > 1)
+		return 0;
+	if (!get_page_unless_zero(page))
+		return 0;
+
+	pc = lookup_page_cgroup(page);
+	lock_page_cgroup(pc);
+	if (PageCgroupUsed(pc) && pc->mem_cgroup == recharge.from) {
+		ret = RECHARGE_TARGET_PAGE;
+		if (target)
+			target->page = page;
+	}
+	unlock_page_cgroup(pc);
+
+	if (!ret || !target)
+		put_page(page);
+
+	return ret;
+}
+
+static int mem_cgroup_recharge_do_precharge(void)
+{
+	int ret = -ENOMEM;
+	struct mem_cgroup *mem = recharge.to;
+
+	ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem, false, NULL);
+	if (ret || !mem)
+		return -ENOMEM;
+
+	recharge.precharge++;
+	return ret;
+}
+
+static int mem_cgroup_recharge_prepare_pte_range(pmd_t *pmd,
+					unsigned long addr, unsigned long end,
+					struct mm_walk *walk)
+{
+	int ret = 0;
+	unsigned long count = 0;
+	struct vm_area_struct *vma = walk->private;
+	pte_t *pte;
+	spinlock_t *ptl;
+
+	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	for (; addr != end; pte++, addr += PAGE_SIZE)
+		if (is_target_pte_for_recharge(vma, addr, *pte, NULL))
+			count++;
+	pte_unmap_unlock(pte - 1, ptl);
+
+	while (count-- && !ret)
+		ret = mem_cgroup_recharge_do_precharge();
+
+	return ret;
+}
+
+static int mem_cgroup_recharge_prepare(void)
+{
+	int ret = 0;
+	struct mm_struct *mm;
+	struct vm_area_struct *vma;
+
+	mm = get_task_mm(recharge.target);
+	if (!mm)
+		return 0;
+
+	down_read(&mm->mmap_sem);
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		struct mm_walk mem_cgroup_recharge_prepare_walk = {
+			.pmd_entry = mem_cgroup_recharge_prepare_pte_range,
+			.mm = mm,
+			.private = vma,
+		};
+		if (signal_pending(current)) {
+			ret = -EINTR;
+			break;
+		}
+		if (is_vm_hugetlb_page(vma))
+			continue;
+		ret = walk_page_range(vma->vm_start, vma->vm_end,
+					&mem_cgroup_recharge_prepare_walk);
+		if (ret)
+			break;
+		cond_resched();
+	}
+	up_read(&mm->mmap_sem);
+
+	mmput(mm);
+	return ret;
+}
+
+static void mem_cgroup_clear_recharge(void)
+{
+	while (recharge.precharge--)
+		mem_cgroup_cancel_charge(recharge.to);
+	recharge.from = NULL;
+	recharge.to = NULL;
+	recharge.target = NULL;
+}
+
 static int mem_cgroup_can_recharge(struct mem_cgroup *mem,
 					struct task_struct *p)
 {
-	return 0;
+	int ret;
+	struct mem_cgroup *from = mem_cgroup_from_task(p);
+
+	if (from == mem)
+		return 0;
+
+	recharge.from = from;
+	recharge.to = mem;
+	recharge.target = p;
+	recharge.precharge = 0;
+
+	ret = mem_cgroup_recharge_prepare();
+
+	if (ret)
+		mem_cgroup_clear_recharge();
+	return ret;
 }
 
 static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
@@ -3442,11 +3616,104 @@ static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss,
 				struct task_struct *p,
 				bool threadgroup)
 {
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cgroup);
+
 	mutex_unlock(&memcg_tasklist);
+	if (mem->recharge_at_immigrate && thread_group_leader(p))
+		mem_cgroup_clear_recharge();
+}
+
+static int mem_cgroup_recharge_pte_range(pmd_t *pmd,
+				unsigned long addr, unsigned long end,
+				struct mm_walk *walk)
+{
+	int ret = 0;
+	struct vm_area_struct *vma = walk->private;
+	pte_t *pte;
+	spinlock_t *ptl;
+
+retry:
+	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	for (; addr != end; addr += PAGE_SIZE) {
+		pte_t ptent = *(pte++);
+		union recharge_target target;
+		int type;
+		struct page *page;
+		struct page_cgroup *pc;
+
+		if (!recharge.precharge)
+			break;
+
+		type = is_target_pte_for_recharge(vma, addr, ptent, &target);
+		switch (type) {
+		case RECHARGE_TARGET_PAGE:
+			page = target.page;
+			if (isolate_lru_page(page))
+				goto put;
+			pc = lookup_page_cgroup(page);
+			if (!mem_cgroup_move_account(pc,
+						recharge.from, recharge.to)) {
+				css_put(&recharge.to->css);
+				recharge.precharge--;
+			}
+			putback_lru_page(page);
+put:			/* is_target_pte_for_recharge() gets the page */
+			put_page(page);
+			break;
+		default:
+			continue;
+		}
+	}
+	pte_unmap_unlock(pte - 1, ptl);
+
+	if (addr != end) {
+		/*
+		 * We have consumed all precharges we got in can_attach().
+		 * We try precharge one by one, but don't do any additional
+		 * precharges nor recharges to recharge.to if we have failed in
+		 * precharge once in attach() phase.
+		 */
+		ret = mem_cgroup_recharge_do_precharge();
+		if (!ret)
+			goto retry;
+	}
+
+	return ret;
 }
 
 static void mem_cgroup_recharge(void)
 {
+	struct mm_struct *mm;
+	struct vm_area_struct *vma;
+
+	mm = get_task_mm(recharge.target);
+	if (!mm)
+		return;
+
+	lru_add_drain_all();
+	down_read(&mm->mmap_sem);
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		int ret;
+		struct mm_walk mem_cgroup_recharge_walk = {
+			.pmd_entry = mem_cgroup_recharge_pte_range,
+			.mm = mm,
+			.private = vma,
+		};
+		if (is_vm_hugetlb_page(vma))
+			continue;
+		ret = walk_page_range(vma->vm_start, vma->vm_end,
+						&mem_cgroup_recharge_walk);
+		if (ret)
+			/*
+			 * means we have consumed all precharges and failed in
+			 * doing additional precharge. Just abandon here.
+			 */
+			break;
+		cond_resched();
+	}
+	up_read(&mm->mmap_sem);
+
+	mmput(mm);
 }
 
 static void mem_cgroup_move_task(struct cgroup_subsys *ss,
@@ -3458,8 +3725,10 @@ static void mem_cgroup_move_task(struct cgroup_subsys *ss,
 	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
 
 	mutex_unlock(&memcg_tasklist);
-	if (mem->recharge_at_immigrate && thread_group_leader(p))
+	if (mem->recharge_at_immigrate && thread_group_leader(p)) {
 		mem_cgroup_recharge();
+		mem_cgroup_clear_recharge();
+	}
 }
 
 struct cgroup_subsys mem_cgroup_subsys = {
-- 
1.5.6.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH -mmotm 7/8] memcg: avoid oom during recharge at task move
  2009-11-06  5:10 [PATCH -mmotm 0/8] memcg: recharge at task move Daisuke Nishimura
                   ` (5 preceding siblings ...)
  2009-11-06  5:14 ` [PATCH -mmotm 6/8] memcg: recharge charges of anonymous page Daisuke Nishimura
@ 2009-11-06  5:15 ` Daisuke Nishimura
  2009-11-06  6:39   ` KAMEZAWA Hiroyuki
  2009-11-06  5:16 ` [PATCH -mmotm 8/8] memcg: recharge charges of anonymous swap Daisuke Nishimura
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 39+ messages in thread
From: Daisuke Nishimura @ 2009-11-06  5:15 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Balbir Singh, KAMEZAWA Hiroyuki, Li Zefan,
	Paul Menage, Daisuke Nishimura

This recharge-at-task-move feature has extra charges(pre-charges) on "to"
mem_cgroup during recharging. This means unnecessary oom can happen.

This patch tries to avoid such oom.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
 mm/memcontrol.c |   27 +++++++++++++++++++++++++++
 1 files changed, 27 insertions(+), 0 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f4b7116..7e96f3b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -248,6 +248,7 @@ struct recharge_struct {
 	struct mem_cgroup *from;
 	struct mem_cgroup *to;
 	struct task_struct *target;	/* the target task being moved */
+	struct task_struct *working;	/* a task moving the target task */
 	unsigned long precharge;
 };
 static struct recharge_struct recharge;
@@ -1493,6 +1494,30 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 		if (mem_cgroup_check_under_limit(mem_over_limit))
 			continue;
 
+		/* try to avoid oom while someone is recharging */
+		if (recharge.working && current != recharge.working) {
+			struct mem_cgroup *dest;
+			bool do_continue = false;
+			/*
+			 * There is a small race that "dest" can be freed by
+			 * rmdir, so we use css_tryget().
+			 */
+			rcu_read_lock();
+			dest = recharge.to;
+			if (dest && css_tryget(&dest->css)) {
+				if (dest->use_hierarchy)
+					do_continue = css_is_ancestor(
+							&dest->css,
+							&mem_over_limit->css);
+				else
+					do_continue = (dest == mem_over_limit);
+				css_put(&dest->css);
+			}
+			rcu_read_unlock();
+			if (do_continue)
+				continue;
+		}
+
 		if (!nr_retries--) {
 			if (oom) {
 				mutex_lock(&memcg_tasklist);
@@ -3573,6 +3598,7 @@ static void mem_cgroup_clear_recharge(void)
 	recharge.from = NULL;
 	recharge.to = NULL;
 	recharge.target = NULL;
+	recharge.working = NULL;
 }
 
 static int mem_cgroup_can_recharge(struct mem_cgroup *mem,
@@ -3587,6 +3613,7 @@ static int mem_cgroup_can_recharge(struct mem_cgroup *mem,
 	recharge.from = from;
 	recharge.to = mem;
 	recharge.target = p;
+	recharge.working = current;
 	recharge.precharge = 0;
 
 	ret = mem_cgroup_recharge_prepare();
-- 
1.5.6.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH -mmotm 8/8] memcg: recharge charges of anonymous swap
  2009-11-06  5:10 [PATCH -mmotm 0/8] memcg: recharge at task move Daisuke Nishimura
                   ` (6 preceding siblings ...)
  2009-11-06  5:15 ` [PATCH -mmotm 7/8] memcg: avoid oom during recharge at task move Daisuke Nishimura
@ 2009-11-06  5:16 ` Daisuke Nishimura
  2009-11-06  6:45 ` [PATCH -mmotm 0/8] memcg: recharge at task move KAMEZAWA Hiroyuki
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 39+ messages in thread
From: Daisuke Nishimura @ 2009-11-06  5:16 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Balbir Singh, KAMEZAWA Hiroyuki, Li Zefan,
	Paul Menage, Daisuke Nishimura

This patch is another core part of this recharge-at-task-move feature.
It enables recharge of anonymous swaps.

To move the charge of swap, we need to exchange swap_cgroup's record.

In current implementation, swap_cgroup's record is protected by:

  - page lock: if the entry is on swap cache.
  - swap_lock: if the entry is not on swap cache.

This works well in usual swap-in/out activity.

But this behavior make charge migration of swap check many conditions to
exchange swap_cgroup's record safely.

So I changed modification of swap_cgroup's recored(swap_cgroup_record())
to use xchg, and define a new function to cmpxchg swap_cgroup's record.

This patch also enables recharge of non pte_present but not uncharged swap
caches, which can be exist on swap-out path,  by getting the target pages via
find_get_page() as do_mincore() does.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
 Documentation/cgroups/memory.txt |    5 +-
 include/linux/page_cgroup.h      |    2 +
 include/linux/swap.h             |    1 +
 mm/memcontrol.c                  |  140 ++++++++++++++++++++++++++++++--------
 mm/page_cgroup.c                 |   35 +++++++++-
 mm/swapfile.c                    |   32 +++++++++
 6 files changed, 182 insertions(+), 33 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 54281ff..2820626 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -441,8 +441,9 @@ And if you want disable it again
 We recharge a charge which meets the following conditions.
 
 a. It must be charged to the old cgroup.
-b. A charge of an anonymous page used by the target task. The page must be used
-   only by the target task.
+b. A charge of an anonymous page(or swap of it) used by the target task.
+   The page(or swap) must be used only by the target task. You must enable
+   Swap Extension(see 2.4) to enable recharge of swap.
 
 Note: More type of pages(e.g. file cache, shmem,) will be supported in future.
 
diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index b0e4eb1..30b0813 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -118,6 +118,8 @@ static inline void __init page_cgroup_init_flatmem(void)
 #include <linux/swap.h>
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+extern unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
+					unsigned short old, unsigned short new);
 extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id);
 extern unsigned short lookup_swap_cgroup(swp_entry_t ent);
 extern int swap_cgroup_swapon(int type, unsigned long max_pages);
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 9f0ca32..2a3209e 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -355,6 +355,7 @@ static inline void disable_swap_token(void)
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 extern void
 mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout);
+extern int mem_cgroup_count_swap_user(swp_entry_t ent, struct page **pagep);
 #else
 static inline void
 mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7e96f3b..50e28df 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -34,6 +34,7 @@
 #include <linux/rbtree.h>
 #include <linux/slab.h>
 #include <linux/swap.h>
+#include <linux/swapops.h>
 #include <linux/spinlock.h>
 #include <linux/fs.h>
 #include <linux/seq_file.h>
@@ -2231,6 +2232,49 @@ void mem_cgroup_uncharge_swap(swp_entry_t ent)
 	}
 	rcu_read_unlock();
 }
+
+/**
+ * mem_cgroup_move_swap_account - move swap charge and swap_cgroup's record.
+ * @entry: swap entry to be moved
+ * @from:  mem_cgroup which the entry is moved from
+ * @to:  mem_cgroup which the entry is moved to
+ *
+ * It successes only when the swap_cgroup's record for this entry is the same
+ * as the mem_cgroup's id of @from.
+ *
+ * Returns 0 on success, 1 on failure.
+ *
+ * The caller must have called __mem_cgroup_try_charge on @to.
+ */
+static int mem_cgroup_move_swap_account(swp_entry_t entry,
+				struct mem_cgroup *from, struct mem_cgroup *to)
+{
+	unsigned short old_id, new_id;
+
+	old_id = css_id(&from->css);
+	new_id = css_id(&to->css);
+
+	if (swap_cgroup_cmpxchg(entry, old_id, new_id) == old_id) {
+		if (!mem_cgroup_is_root(from))
+			res_counter_uncharge(&from->memsw, PAGE_SIZE);
+		mem_cgroup_swap_statistics(from, false);
+		mem_cgroup_put(from);
+
+		if (!mem_cgroup_is_root(to))
+			res_counter_uncharge(&to->res, PAGE_SIZE);
+		mem_cgroup_swap_statistics(to, true);
+		mem_cgroup_get(to);
+
+		return 0;
+	}
+	return 1;
+}
+#else
+static inline int mem_cgroup_move_swap_account(swp_entry_t entry,
+				struct mem_cgroup *from, struct mem_cgroup *to)
+{
+	return 1;
+}
 #endif
 
 /*
@@ -3460,63 +3504,92 @@ static int mem_cgroup_populate(struct cgroup_subsys *ss,
  * @vma: the vma the pte to be checked belongs
  * @addr: the address corresponding to the pte to be checked
  * @ptent: the pte to be checked
- * @target: the pointer the target page will be stored(can be NULL)
+ * @target: the pointer the target page or entry will be stored(can be NULL)
  *
  * Returns
  *   0(RECHARGE_TARGET_NONE): if the pte is not a target for recharge.
  *   1(RECHARGE_TARGET_PAGE): if the page corresponding to this pte is a target
  *     for recharge. if @target is not NULL, the page is stored in target->page
  *     with extra refcnt got(Callers should handle it).
+ *   2(MIGRATION_TARGET_SWAP): if the swap entry corresponding to this pte is a
+ *     target for charge migration. if @target is not NULL, the entry is stored
+ *     in target->ent.
  *
  * Called with pte lock held.
  */
-/* We add a new member later. */
 union recharge_target {
 	struct page	*page;
+	swp_entry_t	ent;
 };
 
-/* We add a new type later. */
 enum recharge_target_type {
 	RECHARGE_TARGET_NONE,	/* not used */
 	RECHARGE_TARGET_PAGE,
+	RECHARGE_TARGET_SWAP,
 };
 
 static int is_target_pte_for_recharge(struct vm_area_struct *vma,
 		unsigned long addr, pte_t ptent, union recharge_target *target)
 {
-	struct page *page;
+	struct page *page = NULL;
 	struct page_cgroup *pc;
+	swp_entry_t ent = { .val = 0 };
 	int ret = 0;
+	int user = 0;
 
-	if (!pte_present(ptent))
-		return 0;
-
-	page = vm_normal_page(vma, addr, ptent);
-	if (!page || !page_mapped(page))
-		return 0;
-	/* TODO: We don't recharge file(including shmem/tmpfs) pages for now. */
-	if (!PageAnon(page))
-		return 0;
-	/*
-	 * TODO: We don't recharge shared(used by multiple processes) pages
-	 * for now.
-	 */
-	if (page_mapcount(page) > 1)
-		return 0;
-	if (!get_page_unless_zero(page))
+	if (!pte_present(ptent)) {
+		/* TODO: handle swap of shmes/tmpfs */
+		if (pte_none(ptent) || pte_file(ptent))
+			return 0;
+		else if (is_swap_pte(ptent)) {
+			ent = pte_to_swp_entry(ptent);
+			if (non_swap_entry(ent))
+				return 0;
+			user = mem_cgroup_count_swap_user(ent, &page);
+		}
+	} else {
+		page = vm_normal_page(vma, addr, ptent);
+		if (!page || !page_mapped(page))
+			return 0;
+		/*
+		 * TODO: We don't recharge file(including shmem/tmpfs) pages
+		 * for now.
+		 */
+		if (!PageAnon(page))
+			return 0;
+		if (!get_page_unless_zero(page))
+			return 0;
+		user = page_mapcount(page);
+	}
+	if (user > 1) {
+		/*
+		 * TODO: We don't recharge shared(used by multiple processes)
+		 * pages for now.
+		 */
+		if (page)
+			put_page(page);
 		return 0;
+	}
 
-	pc = lookup_page_cgroup(page);
-	lock_page_cgroup(pc);
-	if (PageCgroupUsed(pc) && pc->mem_cgroup == recharge.from) {
-		ret = RECHARGE_TARGET_PAGE;
+	if (page) {
+		pc = lookup_page_cgroup(page);
+		lock_page_cgroup(pc);
+		if (PageCgroupUsed(pc) && pc->mem_cgroup == recharge.from) {
+			ret = RECHARGE_TARGET_PAGE;
+			if (target)
+				target->page = page;
+		}
+		unlock_page_cgroup(pc);
+		if (!ret || !target)
+			put_page(page);
+	}
+	/* fall throught */
+	if (ent.val && do_swap_account && !ret &&
+		css_id(&recharge.from->css) == lookup_swap_cgroup(ent)) {
+		ret = RECHARGE_TARGET_SWAP;
 		if (target)
-			target->page = page;
+			target->ent = ent;
 	}
-	unlock_page_cgroup(pc);
-
-	if (!ret || !target)
-		put_page(page);
 
 	return ret;
 }
@@ -3667,6 +3740,7 @@ retry:
 		int type;
 		struct page *page;
 		struct page_cgroup *pc;
+		swp_entry_t ent;
 
 		if (!recharge.precharge)
 			break;
@@ -3687,6 +3761,14 @@ retry:
 put:			/* is_target_pte_for_recharge() gets the page */
 			put_page(page);
 			break;
+		case RECHARGE_TARGET_SWAP:
+			ent = target.ent;
+			if (!mem_cgroup_move_swap_account(ent,
+						recharge.from, recharge.to)) {
+				css_put(&recharge.to->css);
+				recharge.precharge--;
+			}
+			break;
 		default:
 			continue;
 		}
diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index 3d535d5..213b0ee 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -9,6 +9,7 @@
 #include <linux/vmalloc.h>
 #include <linux/cgroup.h>
 #include <linux/swapops.h>
+#include <asm/cmpxchg.h>
 
 static void __meminit
 __init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
@@ -335,6 +336,37 @@ not_enough_page:
 }
 
 /**
+ * swap_cgroup_cmpxchg - cmpxchg mem_cgroup's id for this swp_entry.
+ * @end: swap entry to be cmpxchged
+ * @old: old id
+ * @new: new id
+ *
+ * Returns old id at success, 0 at failure.
+ * (There is no mem_cgroup useing 0 as its id)
+ */
+unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
+					unsigned short old, unsigned short new)
+{
+	int type = swp_type(ent);
+	unsigned long offset = swp_offset(ent);
+	unsigned long idx = offset / SC_PER_PAGE;
+	unsigned long pos = offset & SC_POS_MASK;
+	struct swap_cgroup_ctrl *ctrl;
+	struct page *mappage;
+	struct swap_cgroup *sc;
+
+	ctrl = &swap_cgroup_ctrl[type];
+
+	mappage = ctrl->map[idx];
+	sc = page_address(mappage);
+	sc += pos;
+	if (cmpxchg(&sc->id, old, new) == old)
+		return old;
+	else
+		return 0;
+}
+
+/**
  * swap_cgroup_record - record mem_cgroup for this swp_entry.
  * @ent: swap entry to be recorded into
  * @mem: mem_cgroup to be recorded
@@ -358,8 +390,7 @@ unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id)
 	mappage = ctrl->map[idx];
 	sc = page_address(mappage);
 	sc += pos;
-	old = sc->id;
-	sc->id = id;
+	old = xchg(&sc->id, id);
 
 	return old;
 }
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 93e71cf..28eef54 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -719,6 +719,38 @@ int free_swap_and_cache(swp_entry_t entry)
 	return p != NULL;
 }
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+/**
+ * mem_cgroup_count_swap_user - count the user of a swap entry
+ * @ent: the swap entry to be checked
+ * @pagep: the pointer for the swap cache page of the entry to be stored
+ *
+ * Returns the number of the user of the swap entry.
+ * If the entry is found on swap cache, the page is stored to pagep with
+ * refcount of it being incremented.
+ *
+ * This function can be used only for swaps for anonymous pages.
+ */
+int mem_cgroup_count_swap_user(swp_entry_t ent, struct page **pagep)
+{
+	struct page *page;
+	struct swap_info_struct *p;
+	int count = 0;
+
+	page = find_get_page(&swapper_space, ent.val);
+	if (page)
+		count += page_mapcount(page);
+	p = swap_info_get(ent);
+	if (p) {
+		count += swap_count(p->swap_map[swp_offset(ent)]);
+		spin_unlock(&swap_lock);
+	}
+
+	*pagep = page;
+	return count;
+}
+#endif
+
 #ifdef CONFIG_HIBERNATION
 /*
  * Find the swap type that corresponds to given device (if any).
-- 
1.5.6.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH -mmotm 2/8] memcg: move memcg_tasklist mutex
  2009-11-06  5:11 ` [PATCH -mmotm 2/8] memcg: move memcg_tasklist mutex Daisuke Nishimura
@ 2009-11-06  5:54   ` KAMEZAWA Hiroyuki
  2009-11-06  7:49     ` Daisuke Nishimura
  2009-11-10 19:14   ` Balbir Singh
  1 sibling, 1 reply; 39+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-11-06  5:54 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: linux-mm, Andrew Morton, Balbir Singh, Li Zefan, Paul Menage

On Fri, 6 Nov 2009 14:11:49 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> memcg_tasklist was introduced to serialize mem_cgroup_out_of_memory() and
> mem_cgroup_move_task() to ensure tasks cannot be moved to another cgroup
> during select_bad_process().
> 
> task_in_mem_cgroup(), which can be called by select_bad_process(), will check
> whether a task is in the mem_cgroup or not by dereferencing task->cgroups
> ->subsys[]. So, it would be desirable to change task->cgroups
> (rcu_assign_pointer() in cgroup_attach_task() does it) with memcg_tasklist held.
> 
> Now that we can define cancel_attach(), we can safely release memcg_tasklist
> on fail path even if we hold memcg_tasklist in can_attach(). So let's move
> mutex_lock/unlock() of memcg_tasklist.
> 
> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> ---
>  mm/memcontrol.c |   22 ++++++++++++++++++++--
>  1 files changed, 20 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 4bd3451..d3b2ac0 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3395,18 +3395,34 @@ static int mem_cgroup_populate(struct cgroup_subsys *ss,
>  	return ret;
>  }
>  
> +static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
> +				struct cgroup *cgroup,
> +				struct task_struct *p,
> +				bool threadgroup)
> +{
> +	mutex_lock(&memcg_tasklist);
> +	return 0;
> +}

Hmm...Is this lock really necessary ?
IOW, can't we just remove memcg_tasklist mutex ?
What kind of bad race happens when we remove this ?

Thanks,
-Kame

> +
> +static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss,
> +				struct cgroup *cgroup,
> +				struct task_struct *p,
> +				bool threadgroup)
> +{
> +	mutex_unlock(&memcg_tasklist);
> +}
> +
>  static void mem_cgroup_move_task(struct cgroup_subsys *ss,
>  				struct cgroup *cont,
>  				struct cgroup *old_cont,
>  				struct task_struct *p,
>  				bool threadgroup)
>  {
> -	mutex_lock(&memcg_tasklist);
> +	mutex_unlock(&memcg_tasklist);
>  	/*
>  	 * FIXME: It's better to move charges of this process from old
>  	 * memcg to new memcg. But it's just on TODO-List now.
>  	 */
> -	mutex_unlock(&memcg_tasklist);
>  }
>  
>  struct cgroup_subsys mem_cgroup_subsys = {
> @@ -3416,6 +3432,8 @@ struct cgroup_subsys mem_cgroup_subsys = {
>  	.pre_destroy = mem_cgroup_pre_destroy,
>  	.destroy = mem_cgroup_destroy,
>  	.populate = mem_cgroup_populate,
> +	.can_attach = mem_cgroup_can_attach,
> +	.cancel_attach = mem_cgroup_cancel_attach,
>  	.attach = mem_cgroup_move_task,
>  	.early_init = 0,
>  	.use_id = 1,


Thanks,
-Kame




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH -mmotm 4/8] memcg: cleanup mem_cgroup_move_parent()
  2009-11-06  5:13 ` [PATCH -mmotm 4/8] memcg: cleanup mem_cgroup_move_parent() Daisuke Nishimura
@ 2009-11-06  5:56   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 39+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-11-06  5:56 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: linux-mm, Andrew Morton, Balbir Singh, Li Zefan, Paul Menage

On Fri, 6 Nov 2009 14:13:01 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> mem_cgroup_move_parent() calls try_charge first and cancel_charge on failure.
> IMHO, charge/uncharge(especially charge) is high cost operation, so we should
> avoid it as far as possible.
> 
> This patch tries to delay try_charge in mem_cgroup_move_parent() by re-ordering
> checks it does.
> 
> And this patch renames mem_cgroup_move_account() to __mem_cgroup_move_account(),
> changes the return value of __mem_cgroup_move_account() from int to void,
> and adds a new wrapper(mem_cgroup_move_account()), which checks whether a @pc
> is valid for moving account and calls __mem_cgroup_move_account().
> 
> This patch removes the last caller of trylock_page_cgroup(), so removes its
> definition too.
> 
> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

> ---
>  include/linux/page_cgroup.h |    7 +---
>  mm/memcontrol.c             |   84 ++++++++++++++++++-------------------------
>  2 files changed, 37 insertions(+), 54 deletions(-)
> 
> diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
> index 4b938d4..b0e4eb1 100644
> --- a/include/linux/page_cgroup.h
> +++ b/include/linux/page_cgroup.h
> @@ -57,6 +57,8 @@ static inline void ClearPageCgroup##uname(struct page_cgroup *pc)	\
>  static inline int TestClearPageCgroup##uname(struct page_cgroup *pc)	\
>  	{ return test_and_clear_bit(PCG_##lname, &pc->flags);  }
>  
> +TESTPCGFLAG(Locked, LOCK)
> +
>  /* Cache flag is set only once (at allocation) */
>  TESTPCGFLAG(Cache, CACHE)
>  CLEARPCGFLAG(Cache, CACHE)
> @@ -86,11 +88,6 @@ static inline void lock_page_cgroup(struct page_cgroup *pc)
>  	bit_spin_lock(PCG_LOCK, &pc->flags);
>  }
>  
> -static inline int trylock_page_cgroup(struct page_cgroup *pc)
> -{
> -	return bit_spin_trylock(PCG_LOCK, &pc->flags);
> -}
> -
>  static inline void unlock_page_cgroup(struct page_cgroup *pc)
>  {
>  	bit_spin_unlock(PCG_LOCK, &pc->flags);
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 05e837c..1ad3248 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1613,27 +1613,22 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
>  }
>  
>  /**
> - * mem_cgroup_move_account - move account of the page
> + * __mem_cgroup_move_account - move account of the page
>   * @pc:	page_cgroup of the page.
>   * @from: mem_cgroup which the page is moved from.
>   * @to:	mem_cgroup which the page is moved to. @from != @to.
>   *
>   * The caller must confirm following.
>   * - page is not on LRU (isolate_page() is useful.)
> - *
> - * returns 0 at success,
> - * returns -EBUSY when lock is busy or "pc" is unstable.
> + * - the pc is locked, used, and ->mem_cgroup points to @from.
>   *
>   * This function does "uncharge" from old cgroup but doesn't do "charge" to
>   * new cgroup. It should be done by a caller.
>   */
>  
> -static int mem_cgroup_move_account(struct page_cgroup *pc,
> +static void __mem_cgroup_move_account(struct page_cgroup *pc,
>  	struct mem_cgroup *from, struct mem_cgroup *to)
>  {
> -	struct mem_cgroup_per_zone *from_mz, *to_mz;
> -	int nid, zid;
> -	int ret = -EBUSY;
>  	struct page *page;
>  	int cpu;
>  	struct mem_cgroup_stat *stat;
> @@ -1641,20 +1636,9 @@ static int mem_cgroup_move_account(struct page_cgroup *pc,
>  
>  	VM_BUG_ON(from == to);
>  	VM_BUG_ON(PageLRU(pc->page));
> -
> -	nid = page_cgroup_nid(pc);
> -	zid = page_cgroup_zid(pc);
> -	from_mz =  mem_cgroup_zoneinfo(from, nid, zid);
> -	to_mz =  mem_cgroup_zoneinfo(to, nid, zid);
> -
> -	if (!trylock_page_cgroup(pc))
> -		return ret;
> -
> -	if (!PageCgroupUsed(pc))
> -		goto out;
> -
> -	if (pc->mem_cgroup != from)
> -		goto out;
> +	VM_BUG_ON(!PageCgroupLocked(pc));
> +	VM_BUG_ON(!PageCgroupUsed(pc));
> +	VM_BUG_ON(pc->mem_cgroup != from);
>  
>  	if (!mem_cgroup_is_root(from))
>  		res_counter_uncharge(&from->res, PAGE_SIZE);
> @@ -1683,15 +1667,28 @@ static int mem_cgroup_move_account(struct page_cgroup *pc,
>  	css_get(&to->css);
>  	pc->mem_cgroup = to;
>  	mem_cgroup_charge_statistics(to, pc, true);
> -	ret = 0;
> -out:
> -	unlock_page_cgroup(pc);
>  	/*
>  	 * We charges against "to" which may not have any tasks. Then, "to"
>  	 * can be under rmdir(). But in current implementation, caller of
>  	 * this function is just force_empty() and it's garanteed that
>  	 * "to" is never removed. So, we don't check rmdir status here.
>  	 */
> +}
> +
> +/*
> + * check whether the @pc is valid for moving account and call
> + * __mem_cgroup_move_account()
> + */
> +static int mem_cgroup_move_account(struct page_cgroup *pc,
> +				struct mem_cgroup *from, struct mem_cgroup *to)
> +{
> +	int ret = -EINVAL;
> +	lock_page_cgroup(pc);
> +	if (PageCgroupUsed(pc) && pc->mem_cgroup == from) {
> +		__mem_cgroup_move_account(pc, from, to);
> +		ret = 0;
> +	}
> +	unlock_page_cgroup(pc);
>  	return ret;
>  }
>  
> @@ -1713,38 +1710,27 @@ static int mem_cgroup_move_parent(struct page_cgroup *pc,
>  	if (!pcg)
>  		return -EINVAL;
>  
> +	ret = -EBUSY;
> +	if (!get_page_unless_zero(page))
> +		goto out;
> +	if (isolate_lru_page(page))
> +		goto put;
>  
>  	parent = mem_cgroup_from_cont(pcg);
> -
> -
>  	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false, page);
>  	if (ret || !parent)
> -		return ret;
> -
> -	if (!get_page_unless_zero(page)) {
> -		ret = -EBUSY;
> -		goto uncharge;
> -	}
> -
> -	ret = isolate_lru_page(page);
> -
> -	if (ret)
> -		goto cancel;
> +		goto put_back;
>  
>  	ret = mem_cgroup_move_account(pc, child, parent);
> -
> +	if (!ret)
> +		css_put(&parent->css);	/* drop extra refcnt by try_charge() */
> +	else
> +		mem_cgroup_cancel_charge(parent);	/* does css_put */
> +put_back:
>  	putback_lru_page(page);
> -	if (!ret) {
> -		put_page(page);
> -		/* drop extra refcnt by try_charge() */
> -		css_put(&parent->css);
> -		return 0;
> -	}
> -
> -cancel:
> +put:
>  	put_page(page);
> -uncharge:
> -	mem_cgroup_cancel_charge(parent);
> +out:
>  	return ret;
>  }
>  
> -- 
> 1.5.6.1
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH -mmotm 5/8] memcg: add interface to recharge at task move
  2009-11-06  5:14 ` [PATCH -mmotm 5/8] memcg: add interface to recharge at task move Daisuke Nishimura
@ 2009-11-06  6:06   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 39+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-11-06  6:06 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: linux-mm, Andrew Morton, Balbir Singh, Li Zefan, Paul Menage

On Fri, 6 Nov 2009 14:14:18 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> In current memcg, charges associated with a task aren't moved to the new cgroup
> at task move. These patches are for this feature, that is, for recharging to
> the new cgroup and, of course, uncharging from old cgroup at task move.
> 
> This patch adds "memory.recharge_at_immigrate" file, which is a flag file to
> determine whether charges should be moved to the new cgroup at task move or
> not, and read/write handlers of the file.
> This patch also adds no-op handlers for this feature. These handlers will be
> implemented in later patches.
> 
> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

BTW, Even if this patch is just for interface, adding Documentation here
is a good choice, I think. It makes things clear.

> +
> +	if (mem->recharge_at_immigrate && thread_group_leader(p))
> +		ret = mem_cgroup_can_recharge(mem, p);

My small concern is whtehter thread_group_leader(p) is _always_ equal to
mm->owner..If not, things will be complicated.

Could you clarify ? If mm->owner is better, we should use mm->owner here.
(Non-usual case but..)

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH -mmotm 6/8] memcg: recharge charges of anonymous page
  2009-11-06  5:14 ` [PATCH -mmotm 6/8] memcg: recharge charges of anonymous page Daisuke Nishimura
@ 2009-11-06  6:35   ` KAMEZAWA Hiroyuki
  2009-11-09  0:31     ` Daisuke Nishimura
  0 siblings, 1 reply; 39+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-11-06  6:35 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: linux-mm, Andrew Morton, Balbir Singh, Li Zefan, Paul Menage

On Fri, 6 Nov 2009 14:14:48 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> This patch is the core part of this recharge-at-task-move feature.
> It implements functions to recharge charges of anonymous pages mapped only by
> the target task.
> 
> Implementation:
> - define struct recharge_struct and a valuable of it(recharge) to remember
>   the count of pre-charges and other information.
> - At can_attach(), parse the page table of the task and count the number of
>   mapped pages which are charged to the source mem_cgroup, and call
>   __mem_cgroup_try_charge() repeatedly and count up recharge.precharge.
> - At attach(), parse the page table again, find a target page as we did in
>   can_attach(), and call mem_cgroup_move_account() about the page.
> - Cancel all charges if recharge.precharge > 0 on failure or at the end of
>   task move.
> 

Changelog ?


> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> ---
>  Documentation/cgroups/memory.txt |   36 +++++-
>  mm/memcontrol.c                  |  275 +++++++++++++++++++++++++++++++++++++-
>  2 files changed, 306 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index b871f25..54281ff 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -262,10 +262,12 @@ some of the pages cached in the cgroup (page cache pages).
>  4.2 Task migration
>  
>  When a task migrates from one cgroup to another, it's charge is not
> -carried forward. The pages allocated from the original cgroup still
> +carried forward by default. The pages allocated from the original cgroup still
>  remain charged to it, the charge is dropped when the page is freed or
>  reclaimed.
>  
> +Note: You can move charges of a task along with task migration. See 8.
> +
>  4.3 Removing a cgroup
>  
>  A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a
> @@ -414,7 +416,37 @@ NOTE1: Soft limits take effect over a long period of time, since they involve
>  NOTE2: It is recommended to set the soft limit always below the hard limit,
>         otherwise the hard limit will take precedence.
>  
> -8. TODO
> +8. Recharge at task move
> +
> +Users can move charges associated with a task along with task move, that is,
> +uncharge from the old cgroup and charge to the new cgroup.
> +
> +8.1 Interface
> +
> +This feature is disabled by default. It can be enabled(and disabled again) by
> +writing to memory.recharge_at_immigrate of the destination cgroup.
> +
> +If you want to enable it
> +
> +# echo 1 > memory.recharget_at_immigrate
> +
> +Note: A value more than 1 will be supported in futer. See 8.2.
> +
> +And if you want disable it again
> +
> +# echo 0 > memory.recharget_at_immigrate
> +
> +8.2 Type of charges which can be recharged
> +
> +We recharge a charge which meets the following conditions.
> +
> +a. It must be charged to the old cgroup.
> +b. A charge of an anonymous page used by the target task. The page must be used
> +   only by the target task.
> +
> +Note: More type of pages(e.g. file cache, shmem,) will be supported in future.
> +
> +9. TODO
>  
>  1. Add support for accounting huge pages (as a separate controller)
>  2. Make per-cgroup scanner reclaim not-shared pages first
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index afa1179..f4b7116 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -21,6 +21,8 @@
>  #include <linux/memcontrol.h>
>  #include <linux/cgroup.h>
>  #include <linux/mm.h>
> +#include <linux/migrate.h>
> +#include <linux/hugetlb.h>
>  #include <linux/pagemap.h>
>  #include <linux/smp.h>
>  #include <linux/page-flags.h>
> @@ -239,6 +241,18 @@ struct mem_cgroup {
>  };
>  
>  /*
> + * A data structure and a valiable for recharging charges at task move.
> + * "recharge" and its members are protected by cgroup_lock
> + */
> +struct recharge_struct {
> +	struct mem_cgroup *from;
> +	struct mem_cgroup *to;
> +	struct task_struct *target;	/* the target task being moved */
> +	unsigned long precharge;
> +};
> +static struct recharge_struct recharge;
> +
> +/*
>   * Maximum loops in mem_cgroup_hierarchical_reclaim(), used for soft
>   * limit reclaim to prevent infinite loops, if they ever occur.
>   */
> @@ -1496,7 +1510,7 @@ charged:
>  	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
>  	 * if they exceeds softlimit.
>  	 */
> -	if (mem_cgroup_soft_limit_check(mem))
> +	if (page && mem_cgroup_soft_limit_check(mem))
>  		mem_cgroup_update_tree(mem, page);
>  done:
>  	return 0;
> @@ -3416,10 +3430,170 @@ static int mem_cgroup_populate(struct cgroup_subsys *ss,
>  }
>  
>  /* Handlers for recharge at task move. */
> +/**
> + * is_target_pte_for_recharge - check a pte whether it is valid for recharge
> + * @vma: the vma the pte to be checked belongs
> + * @addr: the address corresponding to the pte to be checked
> + * @ptent: the pte to be checked
> + * @target: the pointer the target page will be stored(can be NULL)
> + *
> + * Returns
> + *   0(RECHARGE_TARGET_NONE): if the pte is not a target for recharge.
> + *   1(RECHARGE_TARGET_PAGE): if the page corresponding to this pte is a target
> + *     for recharge. if @target is not NULL, the page is stored in target->page
> + *     with extra refcnt got(Callers should handle it).
> + *
> + * Called with pte lock held.
> + */
> +/* We add a new member later. */
> +union recharge_target {
> +	struct page	*page;
> +};
> +
> +/* We add a new type later. */
> +enum recharge_target_type {
> +	RECHARGE_TARGET_NONE,	/* not used */
> +	RECHARGE_TARGET_PAGE,
> +};
> +
> +static int is_target_pte_for_recharge(struct vm_area_struct *vma,
> +		unsigned long addr, pte_t ptent, union recharge_target *target)
> +{
> +	struct page *page;
> +	struct page_cgroup *pc;
> +	int ret = 0;
> +
> +	if (!pte_present(ptent))
> +		return 0;
> +
> +	page = vm_normal_page(vma, addr, ptent);
> +	if (!page || !page_mapped(page))
> +		return 0;
> +	/* TODO: We don't recharge file(including shmem/tmpfs) pages for now. */
> +	if (!PageAnon(page))
> +		return 0;
> +	/*
> +	 * TODO: We don't recharge shared(used by multiple processes) pages
> +	 * for now.
> +	 */
> +	if (page_mapcount(page) > 1)
> +		return 0;
> +	if (!get_page_unless_zero(page))
> +		return 0;
> +
> +	pc = lookup_page_cgroup(page);
> +	lock_page_cgroup(pc);
> +	if (PageCgroupUsed(pc) && pc->mem_cgroup == recharge.from) {
> +		ret = RECHARGE_TARGET_PAGE;
> +		if (target)
> +			target->page = page;
> +	}
> +	unlock_page_cgroup(pc);
> +
> +	if (!ret || !target)
> +		put_page(page);
> +
> +	return ret;
> +}
> +
> +static int mem_cgroup_recharge_do_precharge(void)
> +{
> +	int ret = -ENOMEM;
> +	struct mem_cgroup *mem = recharge.to;
> +
> +	ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem, false, NULL);
> +	if (ret || !mem)
> +		return -ENOMEM;
> +
> +	recharge.precharge++;
> +	return ret;
> +}
> +
> +static int mem_cgroup_recharge_prepare_pte_range(pmd_t *pmd,
> +					unsigned long addr, unsigned long end,
> +					struct mm_walk *walk)
> +{
> +	int ret = 0;
> +	unsigned long count = 0;
> +	struct vm_area_struct *vma = walk->private;
> +	pte_t *pte;
> +	spinlock_t *ptl;
> +
> +	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> +	for (; addr != end; pte++, addr += PAGE_SIZE)
> +		if (is_target_pte_for_recharge(vma, addr, *pte, NULL))
> +			count++;
> +	pte_unmap_unlock(pte - 1, ptl);
> +
> +	while (count-- && !ret)
> +		ret = mem_cgroup_recharge_do_precharge();
> +
> +	return ret;
> +}
> +
> +static int mem_cgroup_recharge_prepare(void)
> +{
> +	int ret = 0;
> +	struct mm_struct *mm;
> +	struct vm_area_struct *vma;
> +
> +	mm = get_task_mm(recharge.target);
> +	if (!mm)
> +		return 0;
> +
> +	down_read(&mm->mmap_sem);
> +	for (vma = mm->mmap; vma; vma = vma->vm_next) {
> +		struct mm_walk mem_cgroup_recharge_prepare_walk = {
> +			.pmd_entry = mem_cgroup_recharge_prepare_pte_range,
> +			.mm = mm,
> +			.private = vma,
> +		};
> +		if (signal_pending(current)) {
> +			ret = -EINTR;
> +			break;
> +		}
> +		if (is_vm_hugetlb_page(vma))
> +			continue;
> +		ret = walk_page_range(vma->vm_start, vma->vm_end,
> +					&mem_cgroup_recharge_prepare_walk);
> +		if (ret)
> +			break;
> +		cond_resched();
> +	}
> +	up_read(&mm->mmap_sem);
> +
> +	mmput(mm);
> +	return ret;
> +}
> +
> +static void mem_cgroup_clear_recharge(void)
> +{
> +	while (recharge.precharge--)
> +		mem_cgroup_cancel_charge(recharge.to);
> +	recharge.from = NULL;
> +	recharge.to = NULL;
> +	recharge.target = NULL;
> +}
> +
>  static int mem_cgroup_can_recharge(struct mem_cgroup *mem,
>  					struct task_struct *p)
>  {
> -	return 0;
> +	int ret;
> +	struct mem_cgroup *from = mem_cgroup_from_task(p);
> +
> +	if (from == mem)
> +		return 0;
> +
> +	recharge.from = from;
> +	recharge.to = mem;
> +	recharge.target = p;
> +	recharge.precharge = 0;
> +
> +	ret = mem_cgroup_recharge_prepare();
> +
> +	if (ret)
> +		mem_cgroup_clear_recharge();
> +	return ret;
>  }
>  

Hmm...Hmm...looks nicer. But can I have another suggestion ?

==
static int mem_cgroup_can_recharge(struct mem_cgroup *mem,
				struct task_struct *p)
{
	unsigned long rss;
	struct mm_struct *mm;

	mm = get_task_mm(p);
	if (!mm)
		return;

	rss = get_mm_counter(mm, anon_rss);
	precharge(rss);
	mmput(mm);
}
==
Do you think anonymous memory are so shared at "move" as that
we need page table scan ?

If typical sequence is
==
	fork()
		-> exec()

	move child
==
No problem will happen.

>  static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
> @@ -3442,11 +3616,104 @@ static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss,
>  				struct task_struct *p,
>  				bool threadgroup)
>  {
> +	struct mem_cgroup *mem = mem_cgroup_from_cont(cgroup);
> +
>  	mutex_unlock(&memcg_tasklist);
> +	if (mem->recharge_at_immigrate && thread_group_leader(p))
> +		mem_cgroup_clear_recharge();
> +}
> +
> +static int mem_cgroup_recharge_pte_range(pmd_t *pmd,
> +				unsigned long addr, unsigned long end,
> +				struct mm_walk *walk)
> +{
> +	int ret = 0;
> +	struct vm_area_struct *vma = walk->private;
> +	pte_t *pte;
> +	spinlock_t *ptl;
> +
> +retry:
> +	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> +	for (; addr != end; addr += PAGE_SIZE) {
> +		pte_t ptent = *(pte++);
> +		union recharge_target target;
> +		int type;
> +		struct page *page;
> +		struct page_cgroup *pc;
> +
> +		if (!recharge.precharge)
> +			break;
> +
> +		type = is_target_pte_for_recharge(vma, addr, ptent, &target);
> +		switch (type) {
> +		case RECHARGE_TARGET_PAGE:
> +			page = target.page;
> +			if (isolate_lru_page(page))
> +				goto put;
> +			pc = lookup_page_cgroup(page);
> +			if (!mem_cgroup_move_account(pc,
> +						recharge.from, recharge.to)) {
> +				css_put(&recharge.to->css);
> +				recharge.precharge--;
> +			}
> +			putback_lru_page(page);
> +put:			/* is_target_pte_for_recharge() gets the page */
> +			put_page(page);
> +			break;
> +		default:
> +			continue;

continue for what ?

And we forget "failed to move" pte. This "move" is best-effort service.
Right ?

> +		}
> +	}
> +	pte_unmap_unlock(pte - 1, ptl);
> +
> +	if (addr != end) {
> +		/*
> +		 * We have consumed all precharges we got in can_attach().
> +		 * We try precharge one by one, but don't do any additional
> +		 * precharges nor recharges to recharge.to if we have failed in
> +		 * precharge once in attach() phase.
> +		 */
> +		ret = mem_cgroup_recharge_do_precharge();
> +		if (!ret)
> +			goto retry;
> +	}
> +
> +	return ret;
>  }
>  



>  static void mem_cgroup_recharge(void)
>  {
> +	struct mm_struct *mm;
> +	struct vm_area_struct *vma;
> +
> +	mm = get_task_mm(recharge.target);
> +	if (!mm)
> +		return;
> +
> +	lru_add_drain_all();
> +	down_read(&mm->mmap_sem);
> +	for (vma = mm->mmap; vma; vma = vma->vm_next) {
> +		int ret;
> +		struct mm_walk mem_cgroup_recharge_walk = {
> +			.pmd_entry = mem_cgroup_recharge_pte_range,
> +			.mm = mm,
> +			.private = vma,
> +		};
> +		if (is_vm_hugetlb_page(vma))
> +			continue;
> +		ret = walk_page_range(vma->vm_start, vma->vm_end,
> +						&mem_cgroup_recharge_walk);

At _this_ point, check VM_SHARED and skip scan is a sane operation.
Could you add checks ?


> +		if (ret)
> +			/*
> +			 * means we have consumed all precharges and failed in
> +			 * doing additional precharge. Just abandon here.
> +			 */
> +			break;
> +		cond_resched();
> +	}
> +	up_read(&mm->mmap_sem);
> +
> +	mmput(mm);
>  }
>  
>  static void mem_cgroup_move_task(struct cgroup_subsys *ss,
> @@ -3458,8 +3725,10 @@ static void mem_cgroup_move_task(struct cgroup_subsys *ss,
>  	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
>  
>  	mutex_unlock(&memcg_tasklist);
> -	if (mem->recharge_at_immigrate && thread_group_leader(p))
> +	if (mem->recharge_at_immigrate && thread_group_leader(p)) {
>  		mem_cgroup_recharge();
> +		mem_cgroup_clear_recharge();
> +	}

Is it guranteed that thread_group_leader(p) is true if this is true at
can_attach() ?
If no,
	if (.....) {
		mem_cgroup_recharge()
	}
	mem_cgroup_cleare_recharge()

is better.


Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH -mmotm 7/8] memcg: avoid oom during recharge at task move
  2009-11-06  5:15 ` [PATCH -mmotm 7/8] memcg: avoid oom during recharge at task move Daisuke Nishimura
@ 2009-11-06  6:39   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 39+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-11-06  6:39 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: linux-mm, Andrew Morton, Balbir Singh, Li Zefan, Paul Menage

On Fri, 6 Nov 2009 14:15:32 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> This recharge-at-task-move feature has extra charges(pre-charges) on "to"
> mem_cgroup during recharging. This means unnecessary oom can happen.
> 
> This patch tries to avoid such oom.
> 
> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> ---
>  mm/memcontrol.c |   27 +++++++++++++++++++++++++++
>  1 files changed, 27 insertions(+), 0 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index f4b7116..7e96f3b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -248,6 +248,7 @@ struct recharge_struct {
>  	struct mem_cgroup *from;
>  	struct mem_cgroup *to;
>  	struct task_struct *target;	/* the target task being moved */
> +	struct task_struct *working;	/* a task moving the target task */
>  	unsigned long precharge;
>  };
>  static struct recharge_struct recharge;
> @@ -1493,6 +1494,30 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
>  		if (mem_cgroup_check_under_limit(mem_over_limit))
>  			continue;
>  
> +		/* try to avoid oom while someone is recharging */
> +		if (recharge.working && current != recharge.working) {
> +			struct mem_cgroup *dest;
> +			bool do_continue = false;
> +			/*
> +			 * There is a small race that "dest" can be freed by
> +			 * rmdir, so we use css_tryget().
> +			 */
> +			rcu_read_lock();
> +			dest = recharge.to;
> +			if (dest && css_tryget(&dest->css)) {
> +				if (dest->use_hierarchy)
> +					do_continue = css_is_ancestor(
> +							&dest->css,
> +							&mem_over_limit->css);
> +				else
> +					do_continue = (dest == mem_over_limit);
> +				css_put(&dest->css);
> +			}
> +			rcu_read_unlock();
> +			if (do_continue)
> +				continue;
> +		}

I think it's better to do this here, rather than continue.

==
	if (do_continue) {
		mutex_lock(&memcg_tasklist_lock);
		mutex_unlock(&memcg_tasklist_lock);
		contiue;
	}
==

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH -mmotm 0/8] memcg: recharge at task move
  2009-11-06  5:10 [PATCH -mmotm 0/8] memcg: recharge at task move Daisuke Nishimura
                   ` (7 preceding siblings ...)
  2009-11-06  5:16 ` [PATCH -mmotm 8/8] memcg: recharge charges of anonymous swap Daisuke Nishimura
@ 2009-11-06  6:45 ` KAMEZAWA Hiroyuki
  2009-11-09  1:44   ` Daisuke Nishimura
  2009-11-09  5:08 ` Balbir Singh
  2009-11-11  1:35 ` [PATCH -mmotm 0/3] some cleanups for memcg Daisuke Nishimura
  10 siblings, 1 reply; 39+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-11-06  6:45 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: linux-mm, Andrew Morton, Balbir Singh, Li Zefan, Paul Menage

On Fri, 6 Nov 2009 14:10:11 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> Hi.
> 
> In current memcg, charges associated with a task aren't moved to the new cgroup
> at task move. These patches are for this feature, that is, for recharging to
> the new cgroup and, of course, uncharging from old cgroup at task move.
> 
> Current virsion supports only recharge of non-shared(mapcount == 1) anonymous pages
> and swaps of those pages. I think it's enough as a first step.
> 
> [1/8] cgroup: introduce cancel_attach()
> [2/8] memcg: move memcg_tasklist mutex
> [3/8] memcg: add mem_cgroup_cancel_charge()
> [4/8] memcg: cleanup mem_cgroup_move_parent()
> [5/8] memcg: add interface to recharge at task move
> [6/8] memcg: recharge charges of anonymous page
> [7/8] memcg: avoid oom during recharge at task move
> [8/8] memcg: recharge charges of anonymous swap
> 
> 2 is dependent on 1 and 4 is dependent on 3.
> 3 and 4 are just for cleanups.
> 5-8 are the body of this feature.
> 
> Major Changes from Oct13:
> - removed "[RFC]".
> - rebased on mmotm-2009-11-01-10-01.
> - dropped support for file cache and shmem/tmpfs(revisit in future).
> - Updated Documentation/cgroup/memory.txt.
> 

Seems much nicer but I have some nitpicks as already commented.

For [8/8], mm->swap_usage counter may be a help for making it faster.
Concern is how it's shared but will not be very big error.

> TODO:
> - add support for file cache, shmem/tmpfs, and shared(mapcount > 1) pages.
> - implement madvise(2) to let users decide the target vma for recharge.
> 

About this, I think "force_move_shared_account" flag is enough, I think.
But we have to clarify "mmap()ed but not on page table" entries are not
moved....

Thanks,
-Kame




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH -mmotm 2/8] memcg: move memcg_tasklist mutex
  2009-11-06  5:54   ` KAMEZAWA Hiroyuki
@ 2009-11-06  7:49     ` Daisuke Nishimura
  2009-11-06  8:02       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 39+ messages in thread
From: Daisuke Nishimura @ 2009-11-06  7:49 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, Andrew Morton, Balbir Singh, Li Zefan, Paul Menage,
	Daisuke Nishimura

On Fri, 6 Nov 2009 14:54:59 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Fri, 6 Nov 2009 14:11:49 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> 
> > memcg_tasklist was introduced to serialize mem_cgroup_out_of_memory() and
> > mem_cgroup_move_task() to ensure tasks cannot be moved to another cgroup
> > during select_bad_process().
> > 
> > task_in_mem_cgroup(), which can be called by select_bad_process(), will check
> > whether a task is in the mem_cgroup or not by dereferencing task->cgroups
> > ->subsys[]. So, it would be desirable to change task->cgroups
> > (rcu_assign_pointer() in cgroup_attach_task() does it) with memcg_tasklist held.
> > 
> > Now that we can define cancel_attach(), we can safely release memcg_tasklist
> > on fail path even if we hold memcg_tasklist in can_attach(). So let's move
> > mutex_lock/unlock() of memcg_tasklist.
> > 
> > Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> > ---
> >  mm/memcontrol.c |   22 ++++++++++++++++++++--
> >  1 files changed, 20 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 4bd3451..d3b2ac0 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -3395,18 +3395,34 @@ static int mem_cgroup_populate(struct cgroup_subsys *ss,
> >  	return ret;
> >  }
> >  
> > +static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
> > +				struct cgroup *cgroup,
> > +				struct task_struct *p,
> > +				bool threadgroup)
> > +{
> > +	mutex_lock(&memcg_tasklist);
> > +	return 0;
> > +}
> 
> Hmm...Is this lock really necessary ?
> IOW, can't we just remove memcg_tasklist mutex ?
> What kind of bad race happens when we remove this ?
> 
It was introduced at commit 7f4d454d, in which I introduced the mutex instead of
using cgroup_mutex to fix a deadlock problem.

commit 7f4d454dee2e0bdd21bafd413d1c53e443a26540
Author: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Date:   Wed Jan 7 18:08:29 2009 -0800

    memcg: avoid deadlock caused by race between oom and cpuset_attach

    mpol_rebind_mm(), which can be called from cpuset_attach(), does
    down_write(mm->mmap_sem).  This means down_write(mm->mmap_sem) can be
    called under cgroup_mutex.

    OTOH, page fault path does down_read(mm->mmap_sem) and calls
    mem_cgroup_try_charge_xxx(), which may eventually calls
    mem_cgroup_out_of_memory().  And mem_cgroup_out_of_memory() calls
    cgroup_lock().  This means cgroup_lock() can be called under
    down_read(mm->mmap_sem).

    If those two paths race, deadlock can happen.

    This patch avoid this deadlock by:
      - remove cgroup_lock() from mem_cgroup_out_of_memory().
      - define new mutex (memcg_tasklist) and serialize mem_cgroup_move_task()
        (->attach handler of memory cgroup) and mem_cgroup_out_of_memory.

    Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
    Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Before the commit, mem_cgroup_out_of_memory() holded(and released afterward) cgroup_mutex.
Those codes was introduced at commit c7ba5c9e.

commit c7ba5c9e8176704bfac0729875fa62798037584d
Author: Pavel Emelianov <xemul@openvz.org>
Date:   Thu Feb 7 00:13:58 2008 -0800

    Memory controller: OOM handling

    Out of memory handling for cgroups over their limit. A task from the
    cgroup over limit is chosen using the existing OOM logic and killed.

    TODO:
    1. As discussed in the OLS BOF session, consider implementing a user
    space policy for OOM handling.

    [akpm@linux-foundation.org: fix build due to oom-killer changes]
    Signed-off-by: Pavel Emelianov <xemul@openvz.org>
    Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
    Cc: Paul Menage <menage@google.com>
    Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Cc: "Eric W. Biederman" <ebiederm@xmission.com>
    Cc: Nick Piggin <nickpiggin@yahoo.com.au>
    Cc: Kirill Korotaev <dev@sw.ru>
    Cc: Herbert Poetzl <herbert@13thfloor.at>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

I'm not sure about the intention of the original cgroup_lock() here, but I imagine that
it was for preventing task move during select_bad_process().

If there is no such a lock:

  Assume cgroup foo has exceeded its limit and is about to triggering oom.
  1. Process A, which has been in cgroup baa and uses large memory,
     is just moved to cgroup foo. Process A can be the candidates for being killed.
  2. Process B, which has been in cgroup foo and uses large memory,
     is just moved from cgroup foo. Process B can be excluded from the candidates for
     being killed. 

Hmm, but considering more, those race window exist anyway even if we holds a lock,
because try_charge decides wether it should trigger oom or not outside of the lock.

If this recharge feature is enabled, I think those problems might be avoided by doing like:

__mem_cgroup_try_charge()
{
	...
	if (oom) {
		mutex_lock(&memcg_tasklist);
		if (unlikely(mem_cgroup_check_under_limit)) {
			mutex_unlock(&memcg_tasklist);
			continue
		}
		mem_cgroup_out_of_memory();
		mutex_unlock(&memcg_tasklist);
		record_last_oom();
	}
	...
}

but it makes codes more complex and the recharge feature isn't necessarily enabled.

Well, I personally think we can remove these locks completely and make codes simpler.
What do you think ?


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH -mmotm 2/8] memcg: move memcg_tasklist mutex
  2009-11-06  7:49     ` Daisuke Nishimura
@ 2009-11-06  8:02       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 39+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-11-06  8:02 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: linux-mm, Andrew Morton, Balbir Singh, Li Zefan, Paul Menage

On Fri, 6 Nov 2009 16:49:34 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> If there is no such a lock:
> 
>   Assume cgroup foo has exceeded its limit and is about to triggering oom.
>   1. Process A, which has been in cgroup baa and uses large memory,
>      is just moved to cgroup foo. Process A can be the candidates for being killed.
>   2. Process B, which has been in cgroup foo and uses large memory,
>      is just moved from cgroup foo. Process B can be excluded from the candidates for
>      being killed. 
> 
> Hmm, but considering more, those race window exist anyway even if we holds a lock,
> because try_charge decides wether it should trigger oom or not outside of the lock.
> 
yes, that's point.


> If this recharge feature is enabled, I think those problems might be avoided by doing like:
> 
> __mem_cgroup_try_charge()
> {
> 	...
> 	if (oom) {
> 		mutex_lock(&memcg_tasklist);
> 		if (unlikely(mem_cgroup_check_under_limit)) {
> 			mutex_unlock(&memcg_tasklist);
> 			continue
> 		}
> 		mem_cgroup_out_of_memory();
> 		mutex_unlock(&memcg_tasklist);
> 		record_last_oom();
> 	}
> 	...
> }
> 
> but it makes codes more complex and the recharge feature isn't necessarily enabled.
> 
> Well, I personally think we can remove these locks completely and make codes simpler.
> What do you think ?

I myself vote for removing this lock ;)

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH -mmotm 6/8] memcg: recharge charges of anonymous page
  2009-11-06  6:35   ` KAMEZAWA Hiroyuki
@ 2009-11-09  0:31     ` Daisuke Nishimura
  0 siblings, 0 replies; 39+ messages in thread
From: Daisuke Nishimura @ 2009-11-09  0:31 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, Andrew Morton, Balbir Singh, Li Zefan, Paul Menage,
	Daisuke Nishimura

On Fri, 6 Nov 2009 15:35:26 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Fri, 6 Nov 2009 14:14:48 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> 
> > This patch is the core part of this recharge-at-task-move feature.
> > It implements functions to recharge charges of anonymous pages mapped only by
> > the target task.
> > 
> > Implementation:
> > - define struct recharge_struct and a valuable of it(recharge) to remember
> >   the count of pre-charges and other information.
> > - At can_attach(), parse the page table of the task and count the number of
> >   mapped pages which are charged to the source mem_cgroup, and call
> >   __mem_cgroup_try_charge() repeatedly and count up recharge.precharge.
> > - At attach(), parse the page table again, find a target page as we did in
> >   can_attach(), and call mem_cgroup_move_account() about the page.
> > - Cancel all charges if recharge.precharge > 0 on failure or at the end of
> >   task move.
> > 
> 
> Changelog ?
> 
> 
will add.

> > Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> > ---
> >  Documentation/cgroups/memory.txt |   36 +++++-
> >  mm/memcontrol.c                  |  275 +++++++++++++++++++++++++++++++++++++-
> >  2 files changed, 306 insertions(+), 5 deletions(-)
> > 
> > diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> > index b871f25..54281ff 100644
> > --- a/Documentation/cgroups/memory.txt
> > +++ b/Documentation/cgroups/memory.txt
> > @@ -262,10 +262,12 @@ some of the pages cached in the cgroup (page cache pages).
> >  4.2 Task migration
> >  
> >  When a task migrates from one cgroup to another, it's charge is not
> > -carried forward. The pages allocated from the original cgroup still
> > +carried forward by default. The pages allocated from the original cgroup still
> >  remain charged to it, the charge is dropped when the page is freed or
> >  reclaimed.
> >  
> > +Note: You can move charges of a task along with task migration. See 8.
> > +
> >  4.3 Removing a cgroup
> >  
> >  A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a
> > @@ -414,7 +416,37 @@ NOTE1: Soft limits take effect over a long period of time, since they involve
> >  NOTE2: It is recommended to set the soft limit always below the hard limit,
> >         otherwise the hard limit will take precedence.
> >  
> > -8. TODO
> > +8. Recharge at task move
> > +
> > +Users can move charges associated with a task along with task move, that is,
> > +uncharge from the old cgroup and charge to the new cgroup.
> > +
> > +8.1 Interface
> > +
> > +This feature is disabled by default. It can be enabled(and disabled again) by
> > +writing to memory.recharge_at_immigrate of the destination cgroup.
> > +
> > +If you want to enable it
> > +
> > +# echo 1 > memory.recharget_at_immigrate
> > +
> > +Note: A value more than 1 will be supported in futer. See 8.2.
> > +
> > +And if you want disable it again
> > +
> > +# echo 0 > memory.recharget_at_immigrate
> > +
> > +8.2 Type of charges which can be recharged
> > +
> > +We recharge a charge which meets the following conditions.
> > +
> > +a. It must be charged to the old cgroup.
> > +b. A charge of an anonymous page used by the target task. The page must be used
> > +   only by the target task.
> > +
> > +Note: More type of pages(e.g. file cache, shmem,) will be supported in future.
> > +
> > +9. TODO
> >  
> >  1. Add support for accounting huge pages (as a separate controller)
> >  2. Make per-cgroup scanner reclaim not-shared pages first
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index afa1179..f4b7116 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -21,6 +21,8 @@
> >  #include <linux/memcontrol.h>
> >  #include <linux/cgroup.h>
> >  #include <linux/mm.h>
> > +#include <linux/migrate.h>
> > +#include <linux/hugetlb.h>
> >  #include <linux/pagemap.h>
> >  #include <linux/smp.h>
> >  #include <linux/page-flags.h>
> > @@ -239,6 +241,18 @@ struct mem_cgroup {
> >  };
> >  
> >  /*
> > + * A data structure and a valiable for recharging charges at task move.
> > + * "recharge" and its members are protected by cgroup_lock
> > + */
> > +struct recharge_struct {
> > +	struct mem_cgroup *from;
> > +	struct mem_cgroup *to;
> > +	struct task_struct *target;	/* the target task being moved */
> > +	unsigned long precharge;
> > +};
> > +static struct recharge_struct recharge;
> > +
> > +/*
> >   * Maximum loops in mem_cgroup_hierarchical_reclaim(), used for soft
> >   * limit reclaim to prevent infinite loops, if they ever occur.
> >   */
> > @@ -1496,7 +1510,7 @@ charged:
> >  	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
> >  	 * if they exceeds softlimit.
> >  	 */
> > -	if (mem_cgroup_soft_limit_check(mem))
> > +	if (page && mem_cgroup_soft_limit_check(mem))
> >  		mem_cgroup_update_tree(mem, page);
> >  done:
> >  	return 0;
> > @@ -3416,10 +3430,170 @@ static int mem_cgroup_populate(struct cgroup_subsys *ss,
> >  }
> >  
> >  /* Handlers for recharge at task move. */
> > +/**
> > + * is_target_pte_for_recharge - check a pte whether it is valid for recharge
> > + * @vma: the vma the pte to be checked belongs
> > + * @addr: the address corresponding to the pte to be checked
> > + * @ptent: the pte to be checked
> > + * @target: the pointer the target page will be stored(can be NULL)
> > + *
> > + * Returns
> > + *   0(RECHARGE_TARGET_NONE): if the pte is not a target for recharge.
> > + *   1(RECHARGE_TARGET_PAGE): if the page corresponding to this pte is a target
> > + *     for recharge. if @target is not NULL, the page is stored in target->page
> > + *     with extra refcnt got(Callers should handle it).
> > + *
> > + * Called with pte lock held.
> > + */
> > +/* We add a new member later. */
> > +union recharge_target {
> > +	struct page	*page;
> > +};
> > +
> > +/* We add a new type later. */
> > +enum recharge_target_type {
> > +	RECHARGE_TARGET_NONE,	/* not used */
> > +	RECHARGE_TARGET_PAGE,
> > +};
> > +
> > +static int is_target_pte_for_recharge(struct vm_area_struct *vma,
> > +		unsigned long addr, pte_t ptent, union recharge_target *target)
> > +{
> > +	struct page *page;
> > +	struct page_cgroup *pc;
> > +	int ret = 0;
> > +
> > +	if (!pte_present(ptent))
> > +		return 0;
> > +
> > +	page = vm_normal_page(vma, addr, ptent);
> > +	if (!page || !page_mapped(page))
> > +		return 0;
> > +	/* TODO: We don't recharge file(including shmem/tmpfs) pages for now. */
> > +	if (!PageAnon(page))
> > +		return 0;
> > +	/*
> > +	 * TODO: We don't recharge shared(used by multiple processes) pages
> > +	 * for now.
> > +	 */
> > +	if (page_mapcount(page) > 1)
> > +		return 0;
> > +	if (!get_page_unless_zero(page))
> > +		return 0;
> > +
> > +	pc = lookup_page_cgroup(page);
> > +	lock_page_cgroup(pc);
> > +	if (PageCgroupUsed(pc) && pc->mem_cgroup == recharge.from) {
> > +		ret = RECHARGE_TARGET_PAGE;
> > +		if (target)
> > +			target->page = page;
> > +	}
> > +	unlock_page_cgroup(pc);
> > +
> > +	if (!ret || !target)
> > +		put_page(page);
> > +
> > +	return ret;
> > +}
> > +
> > +static int mem_cgroup_recharge_do_precharge(void)
> > +{
> > +	int ret = -ENOMEM;
> > +	struct mem_cgroup *mem = recharge.to;
> > +
> > +	ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem, false, NULL);
> > +	if (ret || !mem)
> > +		return -ENOMEM;
> > +
> > +	recharge.precharge++;
> > +	return ret;
> > +}
> > +
> > +static int mem_cgroup_recharge_prepare_pte_range(pmd_t *pmd,
> > +					unsigned long addr, unsigned long end,
> > +					struct mm_walk *walk)
> > +{
> > +	int ret = 0;
> > +	unsigned long count = 0;
> > +	struct vm_area_struct *vma = walk->private;
> > +	pte_t *pte;
> > +	spinlock_t *ptl;
> > +
> > +	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > +	for (; addr != end; pte++, addr += PAGE_SIZE)
> > +		if (is_target_pte_for_recharge(vma, addr, *pte, NULL))
> > +			count++;
> > +	pte_unmap_unlock(pte - 1, ptl);
> > +
> > +	while (count-- && !ret)
> > +		ret = mem_cgroup_recharge_do_precharge();
> > +
> > +	return ret;
> > +}
> > +
> > +static int mem_cgroup_recharge_prepare(void)
> > +{
> > +	int ret = 0;
> > +	struct mm_struct *mm;
> > +	struct vm_area_struct *vma;
> > +
> > +	mm = get_task_mm(recharge.target);
> > +	if (!mm)
> > +		return 0;
> > +
> > +	down_read(&mm->mmap_sem);
> > +	for (vma = mm->mmap; vma; vma = vma->vm_next) {
> > +		struct mm_walk mem_cgroup_recharge_prepare_walk = {
> > +			.pmd_entry = mem_cgroup_recharge_prepare_pte_range,
> > +			.mm = mm,
> > +			.private = vma,
> > +		};
> > +		if (signal_pending(current)) {
> > +			ret = -EINTR;
> > +			break;
> > +		}
> > +		if (is_vm_hugetlb_page(vma))
> > +			continue;
> > +		ret = walk_page_range(vma->vm_start, vma->vm_end,
> > +					&mem_cgroup_recharge_prepare_walk);
> > +		if (ret)
> > +			break;
> > +		cond_resched();
> > +	}
> > +	up_read(&mm->mmap_sem);
> > +
> > +	mmput(mm);
> > +	return ret;
> > +}
> > +
> > +static void mem_cgroup_clear_recharge(void)
> > +{
> > +	while (recharge.precharge--)
> > +		mem_cgroup_cancel_charge(recharge.to);
> > +	recharge.from = NULL;
> > +	recharge.to = NULL;
> > +	recharge.target = NULL;
> > +}
> > +
> >  static int mem_cgroup_can_recharge(struct mem_cgroup *mem,
> >  					struct task_struct *p)
> >  {
> > -	return 0;
> > +	int ret;
> > +	struct mem_cgroup *from = mem_cgroup_from_task(p);
> > +
> > +	if (from == mem)
> > +		return 0;
> > +
> > +	recharge.from = from;
> > +	recharge.to = mem;
> > +	recharge.target = p;
> > +	recharge.precharge = 0;
> > +
> > +	ret = mem_cgroup_recharge_prepare();
> > +
> > +	if (ret)
> > +		mem_cgroup_clear_recharge();
> > +	return ret;
> >  }
> >  
> 
> Hmm...Hmm...looks nicer. But can I have another suggestion ?
> 
> ==
> static int mem_cgroup_can_recharge(struct mem_cgroup *mem,
> 				struct task_struct *p)
> {
> 	unsigned long rss;
> 	struct mm_struct *mm;
> 
> 	mm = get_task_mm(p);
> 	if (!mm)
> 		return;
> 
> 	rss = get_mm_counter(mm, anon_rss);
> 	precharge(rss);
> 	mmput(mm);
> }
> ==
> Do you think anonymous memory are so shared at "move" as that
> we need page table scan ?
> 
The reason why I scanned page table twice was to move "swap" charge.
There was no counter for swap per process.
Yes, the counter is just being added (by you :)).
It cannot be used for shmem's swaps, but I won't handle them for now and
we can recharge them in attach() phase(by best-effort) anyway.
I'll just count the mm_counter in can_attach() phase.

> If typical sequence is
> ==
> 	fork()
> 		-> exec()
> 
> 	move child
> ==
> No problem will happen.
> 
> >  static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
> > @@ -3442,11 +3616,104 @@ static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss,
> >  				struct task_struct *p,
> >  				bool threadgroup)
> >  {
> > +	struct mem_cgroup *mem = mem_cgroup_from_cont(cgroup);
> > +
> >  	mutex_unlock(&memcg_tasklist);
> > +	if (mem->recharge_at_immigrate && thread_group_leader(p))
> > +		mem_cgroup_clear_recharge();
> > +}
> > +
> > +static int mem_cgroup_recharge_pte_range(pmd_t *pmd,
> > +				unsigned long addr, unsigned long end,
> > +				struct mm_walk *walk)
> > +{
> > +	int ret = 0;
> > +	struct vm_area_struct *vma = walk->private;
> > +	pte_t *pte;
> > +	spinlock_t *ptl;
> > +
> > +retry:
> > +	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > +	for (; addr != end; addr += PAGE_SIZE) {
> > +		pte_t ptent = *(pte++);
> > +		union recharge_target target;
> > +		int type;
> > +		struct page *page;
> > +		struct page_cgroup *pc;
> > +
> > +		if (!recharge.precharge)
> > +			break;
> > +
> > +		type = is_target_pte_for_recharge(vma, addr, ptent, &target);
> > +		switch (type) {
> > +		case RECHARGE_TARGET_PAGE:
> > +			page = target.page;
> > +			if (isolate_lru_page(page))
> > +				goto put;
> > +			pc = lookup_page_cgroup(page);
> > +			if (!mem_cgroup_move_account(pc,
> > +						recharge.from, recharge.to)) {
> > +				css_put(&recharge.to->css);
> > +				recharge.precharge--;
> > +			}
> > +			putback_lru_page(page);
> > +put:			/* is_target_pte_for_recharge() gets the page */
> > +			put_page(page);
> > +			break;
> > +		default:
> > +			continue;
> 
> continue for what ?
> 
Nothing :)
Just move to the next step in this for-loop. I think it's a leftover of my
old code...

> And we forget "failed to move" pte. This "move" is best-effort service.
> Right ?
> 
"if (!recharge.precharge)" above and "if (addr != end)" bellow will do that.

> > +		}
> > +	}
> > +	pte_unmap_unlock(pte - 1, ptl);
> > +
> > +	if (addr != end) {
> > +		/*
> > +		 * We have consumed all precharges we got in can_attach().
> > +		 * We try precharge one by one, but don't do any additional
> > +		 * precharges nor recharges to recharge.to if we have failed in
> > +		 * precharge once in attach() phase.
> > +		 */
> > +		ret = mem_cgroup_recharge_do_precharge();
> > +		if (!ret)
> > +			goto retry;
> > +	}
> > +
> > +	return ret;
> >  }
> >  
> 
> 
> 
> >  static void mem_cgroup_recharge(void)
> >  {
> > +	struct mm_struct *mm;
> > +	struct vm_area_struct *vma;
> > +
> > +	mm = get_task_mm(recharge.target);
> > +	if (!mm)
> > +		return;
> > +
> > +	lru_add_drain_all();
> > +	down_read(&mm->mmap_sem);
> > +	for (vma = mm->mmap; vma; vma = vma->vm_next) {
> > +		int ret;
> > +		struct mm_walk mem_cgroup_recharge_walk = {
> > +			.pmd_entry = mem_cgroup_recharge_pte_range,
> > +			.mm = mm,
> > +			.private = vma,
> > +		};
> > +		if (is_vm_hugetlb_page(vma))
> > +			continue;
> > +		ret = walk_page_range(vma->vm_start, vma->vm_end,
> > +						&mem_cgroup_recharge_walk);
> 
> At _this_ point, check VM_SHARED and skip scan is a sane operation.
> Could you add checks ?
> 
will do.

> 
> > +		if (ret)
> > +			/*
> > +			 * means we have consumed all precharges and failed in
> > +			 * doing additional precharge. Just abandon here.
> > +			 */
> > +			break;
> > +		cond_resched();
> > +	}
> > +	up_read(&mm->mmap_sem);
> > +
> > +	mmput(mm);
> >  }
> >  
> >  static void mem_cgroup_move_task(struct cgroup_subsys *ss,
> > @@ -3458,8 +3725,10 @@ static void mem_cgroup_move_task(struct cgroup_subsys *ss,
> >  	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
> >  
> >  	mutex_unlock(&memcg_tasklist);
> > -	if (mem->recharge_at_immigrate && thread_group_leader(p))
> > +	if (mem->recharge_at_immigrate && thread_group_leader(p)) {
> >  		mem_cgroup_recharge();
> > +		mem_cgroup_clear_recharge();
> > +	}
> 
> Is it guranteed that thread_group_leader(p) is true if this is true at
> can_attach() ?
> If no,
> 	if (.....) {
> 		mem_cgroup_recharge()
> 	}
> 	mem_cgroup_cleare_recharge()
> 
> is better.
> 
will change.

Thank you for your review.


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH -mmotm 0/8] memcg: recharge at task move
  2009-11-06  6:45 ` [PATCH -mmotm 0/8] memcg: recharge at task move KAMEZAWA Hiroyuki
@ 2009-11-09  1:44   ` Daisuke Nishimura
  2009-11-09  5:16     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 39+ messages in thread
From: Daisuke Nishimura @ 2009-11-09  1:44 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, Andrew Morton, Balbir Singh, Li Zefan, Paul Menage,
	Daisuke Nishimura

On Fri, 6 Nov 2009 15:45:42 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Fri, 6 Nov 2009 14:10:11 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> 
> > Hi.
> > 
> > In current memcg, charges associated with a task aren't moved to the new cgroup
> > at task move. These patches are for this feature, that is, for recharging to
> > the new cgroup and, of course, uncharging from old cgroup at task move.
> > 
> > Current virsion supports only recharge of non-shared(mapcount == 1) anonymous pages
> > and swaps of those pages. I think it's enough as a first step.
> > 
> > [1/8] cgroup: introduce cancel_attach()
> > [2/8] memcg: move memcg_tasklist mutex
> > [3/8] memcg: add mem_cgroup_cancel_charge()
> > [4/8] memcg: cleanup mem_cgroup_move_parent()
> > [5/8] memcg: add interface to recharge at task move
> > [6/8] memcg: recharge charges of anonymous page
> > [7/8] memcg: avoid oom during recharge at task move
> > [8/8] memcg: recharge charges of anonymous swap
> > 
> > 2 is dependent on 1 and 4 is dependent on 3.
> > 3 and 4 are just for cleanups.
> > 5-8 are the body of this feature.
> > 
> > Major Changes from Oct13:
> > - removed "[RFC]".
> > - rebased on mmotm-2009-11-01-10-01.
> > - dropped support for file cache and shmem/tmpfs(revisit in future).
> > - Updated Documentation/cgroup/memory.txt.
> > 
> 
> Seems much nicer but I have some nitpicks as already commented.
> 
> For [8/8], mm->swap_usage counter may be a help for making it faster.
> Concern is how it's shared but will not be very big error.
> 
will change as I mentioned in another mail.

I'll repost 3 and 4 as cleanup(I think they are ready for inclusion),
and post removal-of-memcg_tasklist as a separate patch.

I'll postpone the body of this feature(waiting for your percpu change
and per-process swap counter at least).

> > TODO:
> > - add support for file cache, shmem/tmpfs, and shared(mapcount > 1) pages.
> > - implement madvise(2) to let users decide the target vma for recharge.
> > 
> 
> About this, I think "force_move_shared_account" flag is enough, I think.
> But we have to clarify "mmap()ed but not on page table" entries are not
> moved....
> 
You mean swap entries of shmem/tmpfs, do you ? I agree they are hard to handle..


My concern is:

- I want to add support for private file caches, shmes/tmpfs pages(including swaps of them),
  and "shared" pages by some means in future, and let an admin or a middle-ware
  decide how to handle them.
- Once this feature has been merged(at .33?), I don't want to change the behavior
  when a user set "recharge_at_immigrate=1".
  So, I'll "extend" the meaning of "recharge_at_immigrate" or add a new flag file
  to support other type of charges.


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH -mmotm 0/8] memcg: recharge at task move
  2009-11-06  5:10 [PATCH -mmotm 0/8] memcg: recharge at task move Daisuke Nishimura
                   ` (8 preceding siblings ...)
  2009-11-06  6:45 ` [PATCH -mmotm 0/8] memcg: recharge at task move KAMEZAWA Hiroyuki
@ 2009-11-09  5:08 ` Balbir Singh
  2009-11-09  8:24   ` Daisuke Nishimura
  2009-11-11  1:35 ` [PATCH -mmotm 0/3] some cleanups for memcg Daisuke Nishimura
  10 siblings, 1 reply; 39+ messages in thread
From: Balbir Singh @ 2009-11-09  5:08 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: linux-mm, Andrew Morton, KAMEZAWA Hiroyuki, Li Zefan, Paul Menage

* nishimura@mxp.nes.nec.co.jp <nishimura@mxp.nes.nec.co.jp> [2009-11-06 14:10:11]:

> Hi.
> 
> In current memcg, charges associated with a task aren't moved to the new cgroup
> at task move. These patches are for this feature, that is, for recharging to
> the new cgroup and, of course, uncharging from old cgroup at task move.
> 
> Current virsion supports only recharge of non-shared(mapcount == 1) anonymous pages
> and swaps of those pages. I think it's enough as a first step.
>

What is the motivation? Is to provide better accountability? I think
it is worthwhile to look into it, provided the cost of migration is
not too high. It was on my list of things to look at, but I found that
if cpu/memory are mounted together and the cost of migration is high,
it can be a bottleneck in some cases. I'll review the patches a bit
more.
 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH -mmotm 0/8] memcg: recharge at task move
  2009-11-09  1:44   ` Daisuke Nishimura
@ 2009-11-09  5:16     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 39+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-11-09  5:16 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: linux-mm, Andrew Morton, Balbir Singh, Li Zefan, Paul Menage

On Mon, 9 Nov 2009 10:44:46 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote: 
> > Seems much nicer but I have some nitpicks as already commented.
> > 
> > For [8/8], mm->swap_usage counter may be a help for making it faster.
> > Concern is how it's shared but will not be very big error.
> > 
> will change as I mentioned in another mail.
> 
> I'll repost 3 and 4 as cleanup(I think they are ready for inclusion),
> and post removal-of-memcg_tasklist as a separate patch.
> 
> I'll postpone the body of this feature(waiting for your percpu change
> and per-process swap counter at least).
> 
Thanks. I agree 1-4 are ready for merge. But, please get ack by Paul.
or Li zefan. Paul, Li, how do you think about [1/8] ?


> > > TODO:
> > > - add support for file cache, shmem/tmpfs, and shared(mapcount > 1) pages.
> > > - implement madvise(2) to let users decide the target vma for recharge.
> > > 
> > 
> > About this, I think "force_move_shared_account" flag is enough, I think.
> > But we have to clarify "mmap()ed but not on page table" entries are not
> > moved....
> > 
> You mean swap entries of shmem/tmpfs, do you ? I agree they are hard to handle..
> 
Yes and No. Including mmaped file.

> 
> My concern is:
> 
> - I want to add support for private file caches, shmes/tmpfs pages(including swaps of them),
>   and "shared" pages by some means in future, and let an admin or a middle-ware
>   decide how to handle them.
Yes, I agree that middle ware may want that.

> - Once this feature has been merged(at .33?), I don't want to change the behavior
>   when a user set "recharge_at_immigrate=1".
>   So, I'll "extend" the meaning of "recharge_at_immigrate" or add a new flag file
>   to support other type of charges.

Yes, I think making recharge_at_immigrate as "bitmask" makes sense.
So, showing "recharge_at_immigrate" is a bitmask _now_ both in code and documentation
in clear way will be a help for future.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH -mmotm 1/8] cgroup: introduce cancel_attach()
  2009-11-06  5:11 ` [PATCH -mmotm 1/8] cgroup: introduce cancel_attach() Daisuke Nishimura
@ 2009-11-09  6:57   ` Balbir Singh
  2009-11-09  7:18     ` Li Zefan
  2009-11-09  7:23     ` Daisuke Nishimura
  2009-11-09  7:23   ` Li Zefan
  1 sibling, 2 replies; 39+ messages in thread
From: Balbir Singh @ 2009-11-09  6:57 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: linux-mm, Andrew Morton, KAMEZAWA Hiroyuki, Li Zefan, Paul Menage

* nishimura@mxp.nes.nec.co.jp <nishimura@mxp.nes.nec.co.jp> [2009-11-06 14:11:06]:

> This patch adds cancel_attach() operation to struct cgroup_subsys.
> cancel_attach() can be used when can_attach() operation prepares something
> for the subsys, but we should rollback what can_attach() operation has prepared
> if attach task fails after we've succeeded in can_attach().
> 
> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> ---
>  Documentation/cgroups/cgroups.txt |   13 +++++++++++-
>  include/linux/cgroup.h            |    2 +
>  kernel/cgroup.c                   |   38 ++++++++++++++++++++++++++++++------
>  3 files changed, 45 insertions(+), 8 deletions(-)
> 
> diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
> index 0b33bfe..c86947c 100644
> --- a/Documentation/cgroups/cgroups.txt
> +++ b/Documentation/cgroups/cgroups.txt
> @@ -536,10 +536,21 @@ returns an error, this will abort the attach operation.  If a NULL
>  task is passed, then a successful result indicates that *any*
>  unspecified task can be moved into the cgroup. Note that this isn't
>  called on a fork. If this method returns 0 (success) then this should
> -remain valid while the caller holds cgroup_mutex. If threadgroup is
> +remain valid while the caller holds cgroup_mutex and it is ensured that either
> +attach() or cancel_attach() will be called in futer. If threadgroup is
>  true, then a successful result indicates that all threads in the given
>  thread's threadgroup can be moved together.
> 
> +void cancel_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
> +	       struct task_struct *task, bool threadgroup)
> +(cgroup_mutex held by caller)
> +
> +Called when a task attach operation has failed after can_attach() has succeeded.
> +A subsystem whose can_attach() has some side-effects should provide this
> +function, so that the subsytem can implement a rollback. If not, not necessary.
> +This will be called only about subsystems whose can_attach() operation have
> +succeeded.
> +
>  void attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
>  	    struct cgroup *old_cgrp, struct task_struct *task,
>  	    bool threadgroup)
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 0008dee..d4cc200 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -427,6 +427,8 @@ struct cgroup_subsys {
>  	void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
>  	int (*can_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
>  			  struct task_struct *tsk, bool threadgroup);
> +	void (*cancel_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
> +			  struct task_struct *tsk, bool threadgroup);
>  	void (*attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
>  			struct cgroup *old_cgrp, struct task_struct *tsk,
>  			bool threadgroup);
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 0249f4b..e443742 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1539,7 +1539,7 @@ int cgroup_path(const struct cgroup *cgrp, char *buf, int buflen)
>  int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>  {
>  	int retval = 0;
> -	struct cgroup_subsys *ss;
> +	struct cgroup_subsys *ss, *failed_ss = NULL;
>  	struct cgroup *oldcgrp;
>  	struct css_set *cg;
>  	struct css_set *newcg;
> @@ -1553,8 +1553,16 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>  	for_each_subsys(root, ss) {
>  		if (ss->can_attach) {
>  			retval = ss->can_attach(ss, cgrp, tsk, false);
> -			if (retval)
> -				return retval;
> +			if (retval) {
> +				/*
> +				 * Remember at which subsystem we've failed in
> +				 * can_attach() to call cancel_attach() only
> +				 * against subsystems whose attach() have
> +				 * succeeded(see below).
> +				 */
> +				failed_ss = ss;

failed_ss is global? Is it a marker into an array of subsystems? Don't
we need more than one failed_ss for each failed subsystem? Or do we
find the first failed subsystem, cancel_attach and fail all
migrations?

> +				goto out;
> +			}
>  		}
>  	}
> 
> @@ -1568,14 +1576,17 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>  	 */
>  	newcg = find_css_set(cg, cgrp);
>  	put_css_set(cg);
> -	if (!newcg)
> -		return -ENOMEM;
> +	if (!newcg) {
> +		retval = -ENOMEM;
> +		goto out;
> +	}
> 
>  	task_lock(tsk);
>  	if (tsk->flags & PF_EXITING) {
>  		task_unlock(tsk);
>  		put_css_set(newcg);
> -		return -ESRCH;
> +		retval = -ESRCH;
> +		goto out;
>  	}
>  	rcu_assign_pointer(tsk->cgroups, newcg);
>  	task_unlock(tsk);
> @@ -1601,7 +1612,20 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>  	 * is no longer empty.
>  	 */
>  	cgroup_wakeup_rmdir_waiter(cgrp);
> -	return 0;
> +out:
> +	if (retval)
> +		for_each_subsys(root, ss) {
> +			if (ss == failed_ss)
> +				/*
> +				 * This means can_attach() of this subsystem
> +				 * have failed, so we don't need to call
> +				 * cancel_attach() against rests of subsystems.
> +				 */
> +				break;
> +			if (ss->cancel_attach)
> +				ss->cancel_attach(ss, cgrp, tsk, false);
> +		}
> +	return retval;
>  }
> 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH -mmotm 1/8] cgroup: introduce cancel_attach()
  2009-11-09  6:57   ` Balbir Singh
@ 2009-11-09  7:18     ` Li Zefan
  2009-11-09  7:23     ` Daisuke Nishimura
  1 sibling, 0 replies; 39+ messages in thread
From: Li Zefan @ 2009-11-09  7:18 UTC (permalink / raw)
  To: balbir
  Cc: Daisuke Nishimura, linux-mm, Andrew Morton, KAMEZAWA Hiroyuki,
	Paul Menage

>> @@ -1553,8 +1553,16 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>>  	for_each_subsys(root, ss) {
>>  		if (ss->can_attach) {
>>  			retval = ss->can_attach(ss, cgrp, tsk, false);
>> -			if (retval)
>> -				return retval;
>> +			if (retval) {
>> +				/*
>> +				 * Remember at which subsystem we've failed in
>> +				 * can_attach() to call cancel_attach() only
>> +				 * against subsystems whose attach() have
>> +				 * succeeded(see below).
>> +				 */
>> +				failed_ss = ss;
> 
> failed_ss is global? Is it a marker into an array of subsystems? Don't
> we need more than one failed_ss for each failed subsystem? Or do we
> find the first failed subsystem, cancel_attach and fail all
> migrations?
> 

The latter.

We record the first subsys that failed can_attach(), and break out
the for loop, and call cancel_attach() on those subsystems that
has succeeded in can_attach().

>> +				goto out;
>> +			}
>>  		}
>>  	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH -mmotm 1/8] cgroup: introduce cancel_attach()
  2009-11-06  5:11 ` [PATCH -mmotm 1/8] cgroup: introduce cancel_attach() Daisuke Nishimura
  2009-11-09  6:57   ` Balbir Singh
@ 2009-11-09  7:23   ` Li Zefan
  2009-11-09  7:38     ` Daisuke Nishimura
  1 sibling, 1 reply; 39+ messages in thread
From: Li Zefan @ 2009-11-09  7:23 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: linux-mm, Andrew Morton, Balbir Singh, KAMEZAWA Hiroyuki, Paul Menage

Daisuke Nishimura wrote:
> This patch adds cancel_attach() operation to struct cgroup_subsys.
> cancel_attach() can be used when can_attach() operation prepares something
> for the subsys, but we should rollback what can_attach() operation has prepared
> if attach task fails after we've succeeded in can_attach().
> 
> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

Acked-by: Li Zefan <lizf@cn.fujitsu.com>

> ---
>  Documentation/cgroups/cgroups.txt |   13 +++++++++++-
>  include/linux/cgroup.h            |    2 +
>  kernel/cgroup.c                   |   38 ++++++++++++++++++++++++++++++------
>  3 files changed, 45 insertions(+), 8 deletions(-)
> 
> diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
> index 0b33bfe..c86947c 100644
> --- a/Documentation/cgroups/cgroups.txt
> +++ b/Documentation/cgroups/cgroups.txt
> @@ -536,10 +536,21 @@ returns an error, this will abort the attach operation.  If a NULL
>  task is passed, then a successful result indicates that *any*
>  unspecified task can be moved into the cgroup. Note that this isn't
>  called on a fork. If this method returns 0 (success) then this should
> -remain valid while the caller holds cgroup_mutex. If threadgroup is
> +remain valid while the caller holds cgroup_mutex and it is ensured that either
> +attach() or cancel_attach() will be called in futer. If threadgroup is

s/futer/future

>  true, then a successful result indicates that all threads in the given
>  thread's threadgroup can be moved together.
...
> +out:
> +	if (retval)

I prefer:

	if (reval) {
		...
	}

> +		for_each_subsys(root, ss) {
> +			if (ss == failed_ss)
> +				/*
> +				 * This means can_attach() of this subsystem
> +				 * have failed, so we don't need to call
> +				 * cancel_attach() against rests of subsystems.
> +				 */
> +				break;
> +			if (ss->cancel_attach)
> +				ss->cancel_attach(ss, cgrp, tsk, false);
> +		}
> +	return retval;
>  }
>  
>  /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH -mmotm 1/8] cgroup: introduce cancel_attach()
  2009-11-09  6:57   ` Balbir Singh
  2009-11-09  7:18     ` Li Zefan
@ 2009-11-09  7:23     ` Daisuke Nishimura
  1 sibling, 0 replies; 39+ messages in thread
From: Daisuke Nishimura @ 2009-11-09  7:23 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, Andrew Morton, KAMEZAWA Hiroyuki, Li Zefan,
	Paul Menage, Daisuke Nishimura

On Mon, 9 Nov 2009 12:27:59 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> * nishimura@mxp.nes.nec.co.jp <nishimura@mxp.nes.nec.co.jp> [2009-11-06 14:11:06]:
> 
> > This patch adds cancel_attach() operation to struct cgroup_subsys.
> > cancel_attach() can be used when can_attach() operation prepares something
> > for the subsys, but we should rollback what can_attach() operation has prepared
> > if attach task fails after we've succeeded in can_attach().
> > 
> > Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> > ---
> >  Documentation/cgroups/cgroups.txt |   13 +++++++++++-
> >  include/linux/cgroup.h            |    2 +
> >  kernel/cgroup.c                   |   38 ++++++++++++++++++++++++++++++------
> >  3 files changed, 45 insertions(+), 8 deletions(-)
> > 
> > diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
> > index 0b33bfe..c86947c 100644
> > --- a/Documentation/cgroups/cgroups.txt
> > +++ b/Documentation/cgroups/cgroups.txt
> > @@ -536,10 +536,21 @@ returns an error, this will abort the attach operation.  If a NULL
> >  task is passed, then a successful result indicates that *any*
> >  unspecified task can be moved into the cgroup. Note that this isn't
> >  called on a fork. If this method returns 0 (success) then this should
> > -remain valid while the caller holds cgroup_mutex. If threadgroup is
> > +remain valid while the caller holds cgroup_mutex and it is ensured that either
> > +attach() or cancel_attach() will be called in futer. If threadgroup is
> >  true, then a successful result indicates that all threads in the given
> >  thread's threadgroup can be moved together.
> > 
> > +void cancel_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
> > +	       struct task_struct *task, bool threadgroup)
> > +(cgroup_mutex held by caller)
> > +
> > +Called when a task attach operation has failed after can_attach() has succeeded.
> > +A subsystem whose can_attach() has some side-effects should provide this
> > +function, so that the subsytem can implement a rollback. If not, not necessary.
> > +This will be called only about subsystems whose can_attach() operation have
> > +succeeded.
> > +
> >  void attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
> >  	    struct cgroup *old_cgrp, struct task_struct *task,
> >  	    bool threadgroup)
> > diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> > index 0008dee..d4cc200 100644
> > --- a/include/linux/cgroup.h
> > +++ b/include/linux/cgroup.h
> > @@ -427,6 +427,8 @@ struct cgroup_subsys {
> >  	void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
> >  	int (*can_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
> >  			  struct task_struct *tsk, bool threadgroup);
> > +	void (*cancel_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
> > +			  struct task_struct *tsk, bool threadgroup);
> >  	void (*attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
> >  			struct cgroup *old_cgrp, struct task_struct *tsk,
> >  			bool threadgroup);
> > diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> > index 0249f4b..e443742 100644
> > --- a/kernel/cgroup.c
> > +++ b/kernel/cgroup.c
> > @@ -1539,7 +1539,7 @@ int cgroup_path(const struct cgroup *cgrp, char *buf, int buflen)
> >  int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
> >  {
> >  	int retval = 0;
> > -	struct cgroup_subsys *ss;
> > +	struct cgroup_subsys *ss, *failed_ss = NULL;
> >  	struct cgroup *oldcgrp;
> >  	struct css_set *cg;
> >  	struct css_set *newcg;
> > @@ -1553,8 +1553,16 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
> >  	for_each_subsys(root, ss) {
> >  		if (ss->can_attach) {
> >  			retval = ss->can_attach(ss, cgrp, tsk, false);
> > -			if (retval)
> > -				return retval;
> > +			if (retval) {
> > +				/*
> > +				 * Remember at which subsystem we've failed in
> > +				 * can_attach() to call cancel_attach() only
> > +				 * against subsystems whose attach() have
> > +				 * succeeded(see below).
> > +				 */
> > +				failed_ss = ss;
> 
> failed_ss is global? Is it a marker into an array of subsystems? Don't
> we need more than one failed_ss for each failed subsystem? Or do we
> find the first failed subsystem, cancel_attach and fail all
> migrations?
> 
failed_ss is just a local valiable(see above definition) :)
It is used to remember "at which subsystem of for_each_subsys we failed in can_attach()".
By remembering it, we can avoid calling cancel_attach() against subsystems whose
can_attach() has failed or has not even called.


Thanks,
Daisuke Nishimura.

> > +				goto out;
> > +			}
> >  		}
> >  	}
> > 
> > @@ -1568,14 +1576,17 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
> >  	 */
> >  	newcg = find_css_set(cg, cgrp);
> >  	put_css_set(cg);
> > -	if (!newcg)
> > -		return -ENOMEM;
> > +	if (!newcg) {
> > +		retval = -ENOMEM;
> > +		goto out;
> > +	}
> > 
> >  	task_lock(tsk);
> >  	if (tsk->flags & PF_EXITING) {
> >  		task_unlock(tsk);
> >  		put_css_set(newcg);
> > -		return -ESRCH;
> > +		retval = -ESRCH;
> > +		goto out;
> >  	}
> >  	rcu_assign_pointer(tsk->cgroups, newcg);
> >  	task_unlock(tsk);
> > @@ -1601,7 +1612,20 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
> >  	 * is no longer empty.
> >  	 */
> >  	cgroup_wakeup_rmdir_waiter(cgrp);
> > -	return 0;
> > +out:
> > +	if (retval)
> > +		for_each_subsys(root, ss) {
> > +			if (ss == failed_ss)
> > +				/*
> > +				 * This means can_attach() of this subsystem
> > +				 * have failed, so we don't need to call
> > +				 * cancel_attach() against rests of subsystems.
> > +				 */
> > +				break;
> > +			if (ss->cancel_attach)
> > +				ss->cancel_attach(ss, cgrp, tsk, false);
> > +		}
> > +	return retval;
> >  }
> > 
> 
> -- 
> 	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH -mmotm 1/8] cgroup: introduce cancel_attach()
  2009-11-09  7:23   ` Li Zefan
@ 2009-11-09  7:38     ` Daisuke Nishimura
  0 siblings, 0 replies; 39+ messages in thread
From: Daisuke Nishimura @ 2009-11-09  7:38 UTC (permalink / raw)
  To: Li Zefan
  Cc: linux-mm, Andrew Morton, Balbir Singh, KAMEZAWA Hiroyuki,
	Paul Menage, Daisuke Nishimura

On Mon, 09 Nov 2009 15:23:02 +0800, Li Zefan <lizf@cn.fujitsu.com> wrote:
> Daisuke Nishimura wrote:
> > This patch adds cancel_attach() operation to struct cgroup_subsys.
> > cancel_attach() can be used when can_attach() operation prepares something
> > for the subsys, but we should rollback what can_attach() operation has prepared
> > if attach task fails after we've succeeded in can_attach().
> > 
> > Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> 
> Acked-by: Li Zefan <lizf@cn.fujitsu.com>
> 
Thank you for your ack.

> > ---
> >  Documentation/cgroups/cgroups.txt |   13 +++++++++++-
> >  include/linux/cgroup.h            |    2 +
> >  kernel/cgroup.c                   |   38 ++++++++++++++++++++++++++++++------
> >  3 files changed, 45 insertions(+), 8 deletions(-)
> > 
> > diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
> > index 0b33bfe..c86947c 100644
> > --- a/Documentation/cgroups/cgroups.txt
> > +++ b/Documentation/cgroups/cgroups.txt
> > @@ -536,10 +536,21 @@ returns an error, this will abort the attach operation.  If a NULL
> >  task is passed, then a successful result indicates that *any*
> >  unspecified task can be moved into the cgroup. Note that this isn't
> >  called on a fork. If this method returns 0 (success) then this should
> > -remain valid while the caller holds cgroup_mutex. If threadgroup is
> > +remain valid while the caller holds cgroup_mutex and it is ensured that either
> > +attach() or cancel_attach() will be called in futer. If threadgroup is
> 
> s/futer/future
> 
Ah, I'm sorry.
will fix.

> >  true, then a successful result indicates that all threads in the given
> >  thread's threadgroup can be moved together.
> ...
> > +out:
> > +	if (retval)
> 
> I prefer:
> 
> 	if (reval) {
> 		...
> 	}
> 
Okey.
will change in next post.


Thanks,
Daisuke Nishimura.

> > +		for_each_subsys(root, ss) {
> > +			if (ss == failed_ss)
> > +				/*
> > +				 * This means can_attach() of this subsystem
> > +				 * have failed, so we don't need to call
> > +				 * cancel_attach() against rests of subsystems.
> > +				 */
> > +				break;
> > +			if (ss->cancel_attach)
> > +				ss->cancel_attach(ss, cgrp, tsk, false);
> > +		}
> > +	return retval;
> >  }
> >  
> >  /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH -mmotm 0/8] memcg: recharge at task move
  2009-11-09  5:08 ` Balbir Singh
@ 2009-11-09  8:24   ` Daisuke Nishimura
  0 siblings, 0 replies; 39+ messages in thread
From: Daisuke Nishimura @ 2009-11-09  8:24 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, Andrew Morton, KAMEZAWA Hiroyuki, Li Zefan,
	Paul Menage, Daisuke Nishimura

On Mon, 9 Nov 2009 10:38:06 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> * nishimura@mxp.nes.nec.co.jp <nishimura@mxp.nes.nec.co.jp> [2009-11-06 14:10:11]:
> 
> > Hi.
> > 
> > In current memcg, charges associated with a task aren't moved to the new cgroup
> > at task move. These patches are for this feature, that is, for recharging to
> > the new cgroup and, of course, uncharging from old cgroup at task move.
> > 
> > Current virsion supports only recharge of non-shared(mapcount == 1) anonymous pages
> > and swaps of those pages. I think it's enough as a first step.
> >
> 
> What is the motivation? Is to provide better accountability? I think
> it is worthwhile to look into it, provided the cost of migration is
> not too high. It was on my list of things to look at, but I found that
> if cpu/memory are mounted together and the cost of migration is high,
> it can be a bottleneck in some cases. I'll review the patches a bit
> more.
>  
My purpose of using memcg is to restrict the usage of memory/swap by a group of processes.
IOW, prepare a box with a configurable size and put processes into it.

The point is, there is users who think charges are associated with processes not with the box.
Current behavior is very unnatural for those users.
If a user comes to want to manage some processes under another new box,
because, for example, they use more memory than expected and he wants to isolate
the influence from the original group, current behavior is very bad.

Anyway, we don't move any charge by default and an admin or a middle-ware can
decide the behavior.


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH -mmotm 2/8] memcg: move memcg_tasklist mutex
  2009-11-06  5:11 ` [PATCH -mmotm 2/8] memcg: move memcg_tasklist mutex Daisuke Nishimura
  2009-11-06  5:54   ` KAMEZAWA Hiroyuki
@ 2009-11-10 19:14   ` Balbir Singh
  2009-11-10 23:44     ` Daisuke Nishimura
  1 sibling, 1 reply; 39+ messages in thread
From: Balbir Singh @ 2009-11-10 19:14 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: linux-mm, Andrew Morton, KAMEZAWA Hiroyuki, Li Zefan, Paul Menage

* nishimura@mxp.nes.nec.co.jp <nishimura@mxp.nes.nec.co.jp> [2009-11-06 14:11:49]:

> memcg_tasklist was introduced to serialize mem_cgroup_out_of_memory() and
> mem_cgroup_move_task() to ensure tasks cannot be moved to another cgroup
> during select_bad_process().
> 
> task_in_mem_cgroup(), which can be called by select_bad_process(), will check
> whether a task is in the mem_cgroup or not by dereferencing task->cgroups
> ->subsys[]. So, it would be desirable to change task->cgroups
> (rcu_assign_pointer() in cgroup_attach_task() does it) with memcg_tasklist held.
> 
> Now that we can define cancel_attach(), we can safely release memcg_tasklist
> on fail path even if we hold memcg_tasklist in can_attach(). So let's move
> mutex_lock/unlock() of memcg_tasklist.
> 
> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> ---
>  mm/memcontrol.c |   22 ++++++++++++++++++++--
>  1 files changed, 20 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 4bd3451..d3b2ac0 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3395,18 +3395,34 @@ static int mem_cgroup_populate(struct cgroup_subsys *ss,
>  	return ret;
>  }
> 
> +static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
> +				struct cgroup *cgroup,
> +				struct task_struct *p,
> +				bool threadgroup)
> +{
> +	mutex_lock(&memcg_tasklist);
> +	return 0;
> +}
> +
> +static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss,
> +				struct cgroup *cgroup,
> +				struct task_struct *p,
> +				bool threadgroup)
> +{
> +	mutex_unlock(&memcg_tasklist);
> +}
> +
>  static void mem_cgroup_move_task(struct cgroup_subsys *ss,
>  				struct cgroup *cont,
>  				struct cgroup *old_cont,
>  				struct task_struct *p,
>  				bool threadgroup)
>  {
> -	mutex_lock(&memcg_tasklist);
> +	mutex_unlock(&memcg_tasklist);

What does this mean for nesting? I think the API's are called with
cgroup_mutex held, so memcg_tasklist nests under cgroup_mutex right?
Could you please document that at the mutex declaration point.
Shouldn't you be removing the FIXME as well?

>  	/*
>  	 * FIXME: It's better to move charges of this process from old
>  	 * memcg to new memcg. But it's just on TODO-List now.
>  	 */
> -	mutex_unlock(&memcg_tasklist);
>  }
> 
>  struct cgroup_subsys mem_cgroup_subsys = {
> @@ -3416,6 +3432,8 @@ struct cgroup_subsys mem_cgroup_subsys = {
>  	.pre_destroy = mem_cgroup_pre_destroy,
>  	.destroy = mem_cgroup_destroy,
>  	.populate = mem_cgroup_populate,
> +	.can_attach = mem_cgroup_can_attach,
> +	.cancel_attach = mem_cgroup_cancel_attach,
>  	.attach = mem_cgroup_move_task,
>  	.early_init = 0,
>  	.use_id = 1,
> -- 
> 1.5.6.1
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH -mmotm 2/8] memcg: move memcg_tasklist mutex
  2009-11-10 19:14   ` Balbir Singh
@ 2009-11-10 23:44     ` Daisuke Nishimura
  0 siblings, 0 replies; 39+ messages in thread
From: Daisuke Nishimura @ 2009-11-10 23:44 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, Andrew Morton, KAMEZAWA Hiroyuki, Li Zefan,
	Paul Menage, Daisuke Nishimura

On Wed, 11 Nov 2009 00:44:23 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> * nishimura@mxp.nes.nec.co.jp <nishimura@mxp.nes.nec.co.jp> [2009-11-06 14:11:49]:
> 
> > memcg_tasklist was introduced to serialize mem_cgroup_out_of_memory() and
> > mem_cgroup_move_task() to ensure tasks cannot be moved to another cgroup
> > during select_bad_process().
> > 
> > task_in_mem_cgroup(), which can be called by select_bad_process(), will check
> > whether a task is in the mem_cgroup or not by dereferencing task->cgroups
> > ->subsys[]. So, it would be desirable to change task->cgroups
> > (rcu_assign_pointer() in cgroup_attach_task() does it) with memcg_tasklist held.
> > 
> > Now that we can define cancel_attach(), we can safely release memcg_tasklist
> > on fail path even if we hold memcg_tasklist in can_attach(). So let's move
> > mutex_lock/unlock() of memcg_tasklist.
> > 
> > Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> > ---
> >  mm/memcontrol.c |   22 ++++++++++++++++++++--
> >  1 files changed, 20 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 4bd3451..d3b2ac0 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -3395,18 +3395,34 @@ static int mem_cgroup_populate(struct cgroup_subsys *ss,
> >  	return ret;
> >  }
> > 
> > +static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
> > +				struct cgroup *cgroup,
> > +				struct task_struct *p,
> > +				bool threadgroup)
> > +{
> > +	mutex_lock(&memcg_tasklist);
> > +	return 0;
> > +}
> > +
> > +static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss,
> > +				struct cgroup *cgroup,
> > +				struct task_struct *p,
> > +				bool threadgroup)
> > +{
> > +	mutex_unlock(&memcg_tasklist);
> > +}
> > +
> >  static void mem_cgroup_move_task(struct cgroup_subsys *ss,
> >  				struct cgroup *cont,
> >  				struct cgroup *old_cont,
> >  				struct task_struct *p,
> >  				bool threadgroup)
> >  {
> > -	mutex_lock(&memcg_tasklist);
> > +	mutex_unlock(&memcg_tasklist);
> 
> What does this mean for nesting? I think the API's are called with
> cgroup_mutex held, so memcg_tasklist nests under cgroup_mutex right?
Yes.

> Could you please document that at the mutex declaration point.
I'm going to remove this mutex completely. It's no use as I said
in another mail(http://marc.info/?l=linux-mm&m=125749423314702&w=2).

> Shouldn't you be removing the FIXME as well?
> 
I remove this FIXME comment in [5/8] :)
This patch itself has nothing to do with this recharge feature.


Thanks,
Daisuke Nishimura.

> >  	/*
> >  	 * FIXME: It's better to move charges of this process from old
> >  	 * memcg to new memcg. But it's just on TODO-List now.
> >  	 */
> > -	mutex_unlock(&memcg_tasklist);
> >  }
> > 
> >  struct cgroup_subsys mem_cgroup_subsys = {
> > @@ -3416,6 +3432,8 @@ struct cgroup_subsys mem_cgroup_subsys = {
> >  	.pre_destroy = mem_cgroup_pre_destroy,
> >  	.destroy = mem_cgroup_destroy,
> >  	.populate = mem_cgroup_populate,
> > +	.can_attach = mem_cgroup_can_attach,
> > +	.cancel_attach = mem_cgroup_cancel_attach,
> >  	.attach = mem_cgroup_move_task,
> >  	.early_init = 0,
> >  	.use_id = 1,
> > -- 
> > 1.5.6.1
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 
> -- 
> 	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH -mmotm 0/3] some cleanups for memcg
  2009-11-06  5:10 [PATCH -mmotm 0/8] memcg: recharge at task move Daisuke Nishimura
                   ` (9 preceding siblings ...)
  2009-11-09  5:08 ` Balbir Singh
@ 2009-11-11  1:35 ` Daisuke Nishimura
  2009-11-11  1:36   ` [PATCH -mmotm 1/3] memcg: add mem_cgroup_cancel_charge() Daisuke Nishimura
                     ` (2 more replies)
  10 siblings, 3 replies; 39+ messages in thread
From: Daisuke Nishimura @ 2009-11-11  1:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Daisuke Nishimura, linux-mm

Hi.

These are cleanup patches split from my recharge-at-task-move patch set
posted in Nov/06.

[1/3] memcg: add mem_cgroup_cancel_charge()
[2/3] memcg: cleanup mem_cgroup_move_parent()
[3/3] memcg: remove memcg_tasklist

1 is corresponding to 3 in original patch set, and 2 to 4. There is no practical
change from then. I think they are ready for merge.

3 is a substitutional patch for 2 in original patch set. I want ack or some comments
about this patch.


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH -mmotm 1/3] memcg: add mem_cgroup_cancel_charge()
  2009-11-11  1:35 ` [PATCH -mmotm 0/3] some cleanups for memcg Daisuke Nishimura
@ 2009-11-11  1:36   ` Daisuke Nishimura
  2009-11-11  4:24     ` Balbir Singh
  2009-11-11  1:37   ` [PATCH -mmotm 2/3] memcg: cleanup mem_cgroup_move_parent() Daisuke Nishimura
  2009-11-11  1:39   ` [PATCH -mmotm 3/3] memcg: remove memcg_tasklist Daisuke Nishimura
  2 siblings, 1 reply; 39+ messages in thread
From: Daisuke Nishimura @ 2009-11-11  1:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Daisuke Nishimura, linux-mm

There are some places calling both res_counter_uncharge() and css_put()
to cancel the charge and the refcnt we have got by mem_cgroup_tyr_charge().

This patch introduces mem_cgroup_cancel_charge() and call it in those places.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
 mm/memcontrol.c |   38 ++++++++++++++++++--------------------
 1 files changed, 18 insertions(+), 20 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4bd3451..d92c398 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1500,6 +1500,21 @@ nomem:
 }
 
 /*
+ * Somemtimes we have to undo a charge we got by try_charge().
+ * This function is for that and do uncharge, put css's refcnt.
+ * gotten by try_charge().
+ */
+static void mem_cgroup_cancel_charge(struct mem_cgroup *mem)
+{
+	if (!mem_cgroup_is_root(mem)) {
+		res_counter_uncharge(&mem->res, PAGE_SIZE);
+		if (do_swap_account)
+			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+	}
+	css_put(&mem->css);
+}
+
+/*
  * A helper function to get mem_cgroup from ID. must be called under
  * rcu_read_lock(). The caller must check css_is_removed() or some if
  * it's concern. (dropping refcnt from swap can be called against removed
@@ -1565,12 +1580,7 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
 	lock_page_cgroup(pc);
 	if (unlikely(PageCgroupUsed(pc))) {
 		unlock_page_cgroup(pc);
-		if (!mem_cgroup_is_root(mem)) {
-			res_counter_uncharge(&mem->res, PAGE_SIZE);
-			if (do_swap_account)
-				res_counter_uncharge(&mem->memsw, PAGE_SIZE);
-		}
-		css_put(&mem->css);
+		mem_cgroup_cancel_charge(mem);
 		return;
 	}
 
@@ -1734,14 +1744,7 @@ static int mem_cgroup_move_parent(struct page_cgroup *pc,
 cancel:
 	put_page(page);
 uncharge:
-	/* drop extra refcnt by try_charge() */
-	css_put(&parent->css);
-	/* uncharge if move fails */
-	if (!mem_cgroup_is_root(parent)) {
-		res_counter_uncharge(&parent->res, PAGE_SIZE);
-		if (do_swap_account)
-			res_counter_uncharge(&parent->memsw, PAGE_SIZE);
-	}
+	mem_cgroup_cancel_charge(parent);
 	return ret;
 }
 
@@ -1957,12 +1960,7 @@ void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *mem)
 		return;
 	if (!mem)
 		return;
-	if (!mem_cgroup_is_root(mem)) {
-		res_counter_uncharge(&mem->res, PAGE_SIZE);
-		if (do_swap_account)
-			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
-	}
-	css_put(&mem->css);
+	mem_cgroup_cancel_charge(mem);
 }
 
 static void
-- 
1.5.6.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH -mmotm 2/3] memcg: cleanup mem_cgroup_move_parent()
  2009-11-11  1:35 ` [PATCH -mmotm 0/3] some cleanups for memcg Daisuke Nishimura
  2009-11-11  1:36   ` [PATCH -mmotm 1/3] memcg: add mem_cgroup_cancel_charge() Daisuke Nishimura
@ 2009-11-11  1:37   ` Daisuke Nishimura
  2009-11-11 14:40     ` Balbir Singh
  2009-11-11  1:39   ` [PATCH -mmotm 3/3] memcg: remove memcg_tasklist Daisuke Nishimura
  2 siblings, 1 reply; 39+ messages in thread
From: Daisuke Nishimura @ 2009-11-11  1:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Daisuke Nishimura, linux-mm

mem_cgroup_move_parent() calls try_charge first and cancel_charge on failure.
IMHO, charge/uncharge(especially charge) is high cost operation, so we should
avoid it as far as possible.

This patch tries to delay try_charge in mem_cgroup_move_parent() by re-ordering
checks it does.

And this patch renames mem_cgroup_move_account() to __mem_cgroup_move_account(),
changes the return value of __mem_cgroup_move_account() from int to void,
and adds a new wrapper(mem_cgroup_move_account()), which checks whether a @pc
is valid for moving account and calls __mem_cgroup_move_account().

This patch removes the last caller of trylock_page_cgroup(), so removes its
definition too.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/page_cgroup.h |    7 +---
 mm/memcontrol.c             |   84 ++++++++++++++++++-------------------------
 2 files changed, 37 insertions(+), 54 deletions(-)

diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 4b938d4..b0e4eb1 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -57,6 +57,8 @@ static inline void ClearPageCgroup##uname(struct page_cgroup *pc)	\
 static inline int TestClearPageCgroup##uname(struct page_cgroup *pc)	\
 	{ return test_and_clear_bit(PCG_##lname, &pc->flags);  }
 
+TESTPCGFLAG(Locked, LOCK)
+
 /* Cache flag is set only once (at allocation) */
 TESTPCGFLAG(Cache, CACHE)
 CLEARPCGFLAG(Cache, CACHE)
@@ -86,11 +88,6 @@ static inline void lock_page_cgroup(struct page_cgroup *pc)
 	bit_spin_lock(PCG_LOCK, &pc->flags);
 }
 
-static inline int trylock_page_cgroup(struct page_cgroup *pc)
-{
-	return bit_spin_trylock(PCG_LOCK, &pc->flags);
-}
-
 static inline void unlock_page_cgroup(struct page_cgroup *pc)
 {
 	bit_spin_unlock(PCG_LOCK, &pc->flags);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d92c398..2f1283b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1613,27 +1613,22 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
 }
 
 /**
- * mem_cgroup_move_account - move account of the page
+ * __mem_cgroup_move_account - move account of the page
  * @pc:	page_cgroup of the page.
  * @from: mem_cgroup which the page is moved from.
  * @to:	mem_cgroup which the page is moved to. @from != @to.
  *
  * The caller must confirm following.
  * - page is not on LRU (isolate_page() is useful.)
- *
- * returns 0 at success,
- * returns -EBUSY when lock is busy or "pc" is unstable.
+ * - the pc is locked, used, and ->mem_cgroup points to @from.
  *
  * This function does "uncharge" from old cgroup but doesn't do "charge" to
  * new cgroup. It should be done by a caller.
  */
 
-static int mem_cgroup_move_account(struct page_cgroup *pc,
+static void __mem_cgroup_move_account(struct page_cgroup *pc,
 	struct mem_cgroup *from, struct mem_cgroup *to)
 {
-	struct mem_cgroup_per_zone *from_mz, *to_mz;
-	int nid, zid;
-	int ret = -EBUSY;
 	struct page *page;
 	int cpu;
 	struct mem_cgroup_stat *stat;
@@ -1641,20 +1636,9 @@ static int mem_cgroup_move_account(struct page_cgroup *pc,
 
 	VM_BUG_ON(from == to);
 	VM_BUG_ON(PageLRU(pc->page));
-
-	nid = page_cgroup_nid(pc);
-	zid = page_cgroup_zid(pc);
-	from_mz =  mem_cgroup_zoneinfo(from, nid, zid);
-	to_mz =  mem_cgroup_zoneinfo(to, nid, zid);
-
-	if (!trylock_page_cgroup(pc))
-		return ret;
-
-	if (!PageCgroupUsed(pc))
-		goto out;
-
-	if (pc->mem_cgroup != from)
-		goto out;
+	VM_BUG_ON(!PageCgroupLocked(pc));
+	VM_BUG_ON(!PageCgroupUsed(pc));
+	VM_BUG_ON(pc->mem_cgroup != from);
 
 	if (!mem_cgroup_is_root(from))
 		res_counter_uncharge(&from->res, PAGE_SIZE);
@@ -1683,15 +1667,28 @@ static int mem_cgroup_move_account(struct page_cgroup *pc,
 	css_get(&to->css);
 	pc->mem_cgroup = to;
 	mem_cgroup_charge_statistics(to, pc, true);
-	ret = 0;
-out:
-	unlock_page_cgroup(pc);
 	/*
 	 * We charges against "to" which may not have any tasks. Then, "to"
 	 * can be under rmdir(). But in current implementation, caller of
 	 * this function is just force_empty() and it's garanteed that
 	 * "to" is never removed. So, we don't check rmdir status here.
 	 */
+}
+
+/*
+ * check whether the @pc is valid for moving account and call
+ * __mem_cgroup_move_account()
+ */
+static int mem_cgroup_move_account(struct page_cgroup *pc,
+				struct mem_cgroup *from, struct mem_cgroup *to)
+{
+	int ret = -EINVAL;
+	lock_page_cgroup(pc);
+	if (PageCgroupUsed(pc) && pc->mem_cgroup == from) {
+		__mem_cgroup_move_account(pc, from, to);
+		ret = 0;
+	}
+	unlock_page_cgroup(pc);
 	return ret;
 }
 
@@ -1713,38 +1710,27 @@ static int mem_cgroup_move_parent(struct page_cgroup *pc,
 	if (!pcg)
 		return -EINVAL;
 
+	ret = -EBUSY;
+	if (!get_page_unless_zero(page))
+		goto out;
+	if (isolate_lru_page(page))
+		goto put;
 
 	parent = mem_cgroup_from_cont(pcg);
-
-
 	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false, page);
 	if (ret || !parent)
-		return ret;
-
-	if (!get_page_unless_zero(page)) {
-		ret = -EBUSY;
-		goto uncharge;
-	}
-
-	ret = isolate_lru_page(page);
-
-	if (ret)
-		goto cancel;
+		goto put_back;
 
 	ret = mem_cgroup_move_account(pc, child, parent);
-
+	if (!ret)
+		css_put(&parent->css);	/* drop extra refcnt by try_charge() */
+	else
+		mem_cgroup_cancel_charge(parent);	/* does css_put */
+put_back:
 	putback_lru_page(page);
-	if (!ret) {
-		put_page(page);
-		/* drop extra refcnt by try_charge() */
-		css_put(&parent->css);
-		return 0;
-	}
-
-cancel:
+put:
 	put_page(page);
-uncharge:
-	mem_cgroup_cancel_charge(parent);
+out:
 	return ret;
 }
 
-- 
1.5.6.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH -mmotm 3/3] memcg: remove memcg_tasklist
  2009-11-11  1:35 ` [PATCH -mmotm 0/3] some cleanups for memcg Daisuke Nishimura
  2009-11-11  1:36   ` [PATCH -mmotm 1/3] memcg: add mem_cgroup_cancel_charge() Daisuke Nishimura
  2009-11-11  1:37   ` [PATCH -mmotm 2/3] memcg: cleanup mem_cgroup_move_parent() Daisuke Nishimura
@ 2009-11-11  1:39   ` Daisuke Nishimura
  2009-11-11  1:49     ` KAMEZAWA Hiroyuki
  2009-11-11 16:01     ` Balbir Singh
  2 siblings, 2 replies; 39+ messages in thread
From: Daisuke Nishimura @ 2009-11-11  1:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Balbir Singh, KAMEZAWA Hiroyuki, Daisuke Nishimura, linux-mm

memcg_tasklist was introduced at commit 7f4d454d(memcg: avoid deadlock caused
by race between oom and cpuset_attach) instead of cgroup_mutex to fix a deadlock
problem.  The cgroup_mutex, which was removed by the commit, in
mem_cgroup_out_of_memory() was originally introduced at commit c7ba5c9e
(Memory controller: OOM handling).

IIUC, the intention of this cgroup_mutex was to prevent task move during
select_bad_process() so that situations like below can be avoided.

  Assume cgroup "foo" has exceeded its limit and is about to trigger oom.
  1. Process A, which has been in cgroup "baa" and uses large memory, is just
     moved to cgroup "foo". Process A can be the candidates for being killed.
  2. Process B, which has been in cgroup "foo" and uses large memory, is just
     moved from cgroup "foo". Process B can be excluded from the candidates for
     being killed.

But these race window exists anyway even if we hold a lock, because
__mem_cgroup_try_charge() decides wether it should trigger oom or not outside
of the lock. So the original cgroup_mutex in mem_cgroup_out_of_memory and thus
current memcg_tasklist has no use. And IMHO, those races are not so critical
for users.

This patch removes it and make codes simpler.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
 mm/memcontrol.c |    5 -----
 1 files changed, 0 insertions(+), 5 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2f1283b..a74fcc2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -55,7 +55,6 @@ static int really_do_swap_account __initdata = 1; /* for remember boot option*/
 #define do_swap_account		(0)
 #endif
 
-static DEFINE_MUTEX(memcg_tasklist);	/* can be hold under cgroup_mutex */
 #define SOFTLIMIT_EVENTS_THRESH (1000)
 
 /*
@@ -1475,9 +1474,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 
 		if (!nr_retries--) {
 			if (oom) {
-				mutex_lock(&memcg_tasklist);
 				mem_cgroup_out_of_memory(mem_over_limit, gfp_mask);
-				mutex_unlock(&memcg_tasklist);
 				record_last_oom(mem_over_limit);
 			}
 			goto nomem;
@@ -3385,12 +3382,10 @@ static void mem_cgroup_move_task(struct cgroup_subsys *ss,
 				struct task_struct *p,
 				bool threadgroup)
 {
-	mutex_lock(&memcg_tasklist);
 	/*
 	 * FIXME: It's better to move charges of this process from old
 	 * memcg to new memcg. But it's just on TODO-List now.
 	 */
-	mutex_unlock(&memcg_tasklist);
 }
 
 struct cgroup_subsys mem_cgroup_subsys = {
-- 
1.5.6.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: remove memcg_tasklist
  2009-11-11  1:39   ` [PATCH -mmotm 3/3] memcg: remove memcg_tasklist Daisuke Nishimura
@ 2009-11-11  1:49     ` KAMEZAWA Hiroyuki
  2009-11-11 16:01     ` Balbir Singh
  1 sibling, 0 replies; 39+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-11-11  1:49 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: Andrew Morton, Balbir Singh, linux-mm

On Wed, 11 Nov 2009 10:39:06 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> memcg_tasklist was introduced at commit 7f4d454d(memcg: avoid deadlock caused
> by race between oom and cpuset_attach) instead of cgroup_mutex to fix a deadlock
> problem.  The cgroup_mutex, which was removed by the commit, in
> mem_cgroup_out_of_memory() was originally introduced at commit c7ba5c9e
> (Memory controller: OOM handling).
> 
> IIUC, the intention of this cgroup_mutex was to prevent task move during
> select_bad_process() so that situations like below can be avoided.
> 
>   Assume cgroup "foo" has exceeded its limit and is about to trigger oom.
>   1. Process A, which has been in cgroup "baa" and uses large memory, is just
>      moved to cgroup "foo". Process A can be the candidates for being killed.
>   2. Process B, which has been in cgroup "foo" and uses large memory, is just
>      moved from cgroup "foo". Process B can be excluded from the candidates for
>      being killed.
> 
> But these race window exists anyway even if we hold a lock, because
> __mem_cgroup_try_charge() decides wether it should trigger oom or not outside
> of the lock. So the original cgroup_mutex in mem_cgroup_out_of_memory and thus
> current memcg_tasklist has no use. And IMHO, those races are not so critical
> for users.
> 
> This patch removes it and make codes simpler.
> 
> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH -mmotm 1/3] memcg: add mem_cgroup_cancel_charge()
  2009-11-11  1:36   ` [PATCH -mmotm 1/3] memcg: add mem_cgroup_cancel_charge() Daisuke Nishimura
@ 2009-11-11  4:24     ` Balbir Singh
  0 siblings, 0 replies; 39+ messages in thread
From: Balbir Singh @ 2009-11-11  4:24 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: Andrew Morton, KAMEZAWA Hiroyuki, linux-mm

* nishimura@mxp.nes.nec.co.jp <nishimura@mxp.nes.nec.co.jp> [2009-11-11 10:36:49]:

> There are some places calling both res_counter_uncharge() and css_put()
> to cancel the charge and the refcnt we have got by mem_cgroup_tyr_charge().
> 
> This patch introduces mem_cgroup_cancel_charge() and call it in those places.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>


Reviewed-by: Balbir Singh <balbir@linux.vnet.ibm.com>
 
-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH -mmotm 2/3] memcg: cleanup mem_cgroup_move_parent()
  2009-11-11  1:37   ` [PATCH -mmotm 2/3] memcg: cleanup mem_cgroup_move_parent() Daisuke Nishimura
@ 2009-11-11 14:40     ` Balbir Singh
  2009-11-11 15:16       ` Daisuke Nishimura
  0 siblings, 1 reply; 39+ messages in thread
From: Balbir Singh @ 2009-11-11 14:40 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: Andrew Morton, KAMEZAWA Hiroyuki, linux-mm

* nishimura@mxp.nes.nec.co.jp <nishimura@mxp.nes.nec.co.jp> [2009-11-11 10:37:41]:

> mem_cgroup_move_parent() calls try_charge first and cancel_charge on failure.
> IMHO, charge/uncharge(especially charge) is high cost operation, so we should
> avoid it as far as possible.
> 

But cancel_charge and move_account are not frequent operations, does
optimizing this give a significant benefit?


> This patch tries to delay try_charge in mem_cgroup_move_parent() by re-ordering
> checks it does.
> 
> And this patch renames mem_cgroup_move_account() to __mem_cgroup_move_account(),
> changes the return value of __mem_cgroup_move_account() from int to void,
> and adds a new wrapper(mem_cgroup_move_account()), which checks whether a @pc
> is valid for moving account and calls __mem_cgroup_move_account().
> 
> This patch removes the last caller of trylock_page_cgroup(), so removes its
> definition too.
> 
> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  include/linux/page_cgroup.h |    7 +---
>  mm/memcontrol.c             |   84 ++++++++++++++++++-------------------------
>  2 files changed, 37 insertions(+), 54 deletions(-)
> 
> diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
> index 4b938d4..b0e4eb1 100644
> --- a/include/linux/page_cgroup.h
> +++ b/include/linux/page_cgroup.h
> @@ -57,6 +57,8 @@ static inline void ClearPageCgroup##uname(struct page_cgroup *pc)	\
>  static inline int TestClearPageCgroup##uname(struct page_cgroup *pc)	\
>  	{ return test_and_clear_bit(PCG_##lname, &pc->flags);  }
> 
> +TESTPCGFLAG(Locked, LOCK)
> +
>  /* Cache flag is set only once (at allocation) */
>  TESTPCGFLAG(Cache, CACHE)
>  CLEARPCGFLAG(Cache, CACHE)
> @@ -86,11 +88,6 @@ static inline void lock_page_cgroup(struct page_cgroup *pc)
>  	bit_spin_lock(PCG_LOCK, &pc->flags);
>  }
> 
> -static inline int trylock_page_cgroup(struct page_cgroup *pc)
> -{
> -	return bit_spin_trylock(PCG_LOCK, &pc->flags);
> -}
> -
>  static inline void unlock_page_cgroup(struct page_cgroup *pc)
>  {
>  	bit_spin_unlock(PCG_LOCK, &pc->flags);
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d92c398..2f1283b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1613,27 +1613,22 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
>  }
> 
>  /**
> - * mem_cgroup_move_account - move account of the page
> + * __mem_cgroup_move_account - move account of the page
>   * @pc:	page_cgroup of the page.
>   * @from: mem_cgroup which the page is moved from.
>   * @to:	mem_cgroup which the page is moved to. @from != @to.
>   *
>   * The caller must confirm following.
>   * - page is not on LRU (isolate_page() is useful.)
> - *
> - * returns 0 at success,
> - * returns -EBUSY when lock is busy or "pc" is unstable.
> + * - the pc is locked, used, and ->mem_cgroup points to @from.
>   *
>   * This function does "uncharge" from old cgroup but doesn't do "charge" to
>   * new cgroup. It should be done by a caller.
>   */
> 
> -static int mem_cgroup_move_account(struct page_cgroup *pc,
> +static void __mem_cgroup_move_account(struct page_cgroup *pc,
>  	struct mem_cgroup *from, struct mem_cgroup *to)
>  {
> -	struct mem_cgroup_per_zone *from_mz, *to_mz;
> -	int nid, zid;
> -	int ret = -EBUSY;
>  	struct page *page;
>  	int cpu;
>  	struct mem_cgroup_stat *stat;
> @@ -1641,20 +1636,9 @@ static int mem_cgroup_move_account(struct page_cgroup *pc,
> 
>  	VM_BUG_ON(from == to);
>  	VM_BUG_ON(PageLRU(pc->page));
> -
> -	nid = page_cgroup_nid(pc);
> -	zid = page_cgroup_zid(pc);
> -	from_mz =  mem_cgroup_zoneinfo(from, nid, zid);
> -	to_mz =  mem_cgroup_zoneinfo(to, nid, zid);
> -
> -	if (!trylock_page_cgroup(pc))
> -		return ret;
> -
> -	if (!PageCgroupUsed(pc))
> -		goto out;
> -
> -	if (pc->mem_cgroup != from)
> -		goto out;
> +	VM_BUG_ON(!PageCgroupLocked(pc));
> +	VM_BUG_ON(!PageCgroupUsed(pc));
> +	VM_BUG_ON(pc->mem_cgroup != from);
> 
>  	if (!mem_cgroup_is_root(from))
>  		res_counter_uncharge(&from->res, PAGE_SIZE);
> @@ -1683,15 +1667,28 @@ static int mem_cgroup_move_account(struct page_cgroup *pc,
>  	css_get(&to->css);
>  	pc->mem_cgroup = to;
>  	mem_cgroup_charge_statistics(to, pc, true);
> -	ret = 0;
> -out:
> -	unlock_page_cgroup(pc);
>  	/*
>  	 * We charges against "to" which may not have any tasks. Then, "to"
>  	 * can be under rmdir(). But in current implementation, caller of
>  	 * this function is just force_empty() and it's garanteed that
>  	 * "to" is never removed. So, we don't check rmdir status here.
>  	 */
> +}
> +
> +/*
> + * check whether the @pc is valid for moving account and call
> + * __mem_cgroup_move_account()
> + */
> +static int mem_cgroup_move_account(struct page_cgroup *pc,
> +				struct mem_cgroup *from, struct mem_cgroup *to)
> +{
> +	int ret = -EINVAL;
> +	lock_page_cgroup(pc);
> +	if (PageCgroupUsed(pc) && pc->mem_cgroup == from) {
> +		__mem_cgroup_move_account(pc, from, to);
> +		ret = 0;
> +	}
> +	unlock_page_cgroup(pc);
>  	return ret;
>  }
> 
> @@ -1713,38 +1710,27 @@ static int mem_cgroup_move_parent(struct page_cgroup *pc,
>  	if (!pcg)
>  		return -EINVAL;
> 
> +	ret = -EBUSY;
> +	if (!get_page_unless_zero(page))
> +		goto out;
> +	if (isolate_lru_page(page))
> +		goto put;
> 
>  	parent = mem_cgroup_from_cont(pcg);
> -
> -
>  	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false, page);
>  	if (ret || !parent)
> -		return ret;
> -
> -	if (!get_page_unless_zero(page)) {
> -		ret = -EBUSY;
> -		goto uncharge;
> -	}
> -
> -	ret = isolate_lru_page(page);
> -
> -	if (ret)
> -		goto cancel;
> +		goto put_back;
> 
>  	ret = mem_cgroup_move_account(pc, child, parent);
> -
> +	if (!ret)
> +		css_put(&parent->css);	/* drop extra refcnt by try_charge() */
> +	else
> +		mem_cgroup_cancel_charge(parent);	/* does css_put */
> +put_back:
>  	putback_lru_page(page);
> -	if (!ret) {
> -		put_page(page);
> -		/* drop extra refcnt by try_charge() */
> -		css_put(&parent->css);
> -		return 0;
> -	}
> -
> -cancel:
> +put:
>  	put_page(page);
> -uncharge:
> -	mem_cgroup_cancel_charge(parent);
> +out:
>  	return ret;
>  }
>


Looks good overall
 
-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH -mmotm 2/3] memcg: cleanup mem_cgroup_move_parent()
  2009-11-11 14:40     ` Balbir Singh
@ 2009-11-11 15:16       ` Daisuke Nishimura
  0 siblings, 0 replies; 39+ messages in thread
From: Daisuke Nishimura @ 2009-11-11 15:16 UTC (permalink / raw)
  To: balbir
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, linux-mm, d-nishimura,
	Daisuke Nishimura

On Wed, 11 Nov 2009 20:10:50 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * nishimura@mxp.nes.nec.co.jp <nishimura@mxp.nes.nec.co.jp> [2009-11-11 10:37:41]:
> 
> > mem_cgroup_move_parent() calls try_charge first and cancel_charge on failure.
> > IMHO, charge/uncharge(especially charge) is high cost operation, so we should
> > avoid it as far as possible.
> > 
> 
> But cancel_charge and move_account are not frequent operations, does
> optimizing this give a significant benefit?
> 
I agree they are not called so frequently, so the benefit would not be so big.
But, the number of lines of memcontrol.c decreases a bit by these patches ;)

IMHO, current mem_cgroup_move_parent() is a bit hard to understand, and mem_cgroup_move_account() have redundant codes. So, I cleaned them up.

Moreover, mem_cgroup_cancel_charge(), which I introduced in [1/3], and the new
wrapper function of mem_cgroup_move_account(), which I introduced in this patch,
will be used in my recharge-at-task-move patches and make them more readable.

> Looks good overall
>  
Thank you.


Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: remove memcg_tasklist
  2009-11-11  1:39   ` [PATCH -mmotm 3/3] memcg: remove memcg_tasklist Daisuke Nishimura
  2009-11-11  1:49     ` KAMEZAWA Hiroyuki
@ 2009-11-11 16:01     ` Balbir Singh
  2009-11-12  8:05       ` Daisuke Nishimura
  1 sibling, 1 reply; 39+ messages in thread
From: Balbir Singh @ 2009-11-11 16:01 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: Andrew Morton, KAMEZAWA Hiroyuki, linux-mm

* nishimura@mxp.nes.nec.co.jp <nishimura@mxp.nes.nec.co.jp> [2009-11-11 10:39:06]:

> memcg_tasklist was introduced at commit 7f4d454d(memcg: avoid deadlock caused
> by race between oom and cpuset_attach) instead of cgroup_mutex to fix a deadlock
> problem.  The cgroup_mutex, which was removed by the commit, in
> mem_cgroup_out_of_memory() was originally introduced at commit c7ba5c9e
> (Memory controller: OOM handling).
> 
> IIUC, the intention of this cgroup_mutex was to prevent task move during
> select_bad_process() so that situations like below can be avoided.
> 
>   Assume cgroup "foo" has exceeded its limit and is about to trigger oom.
>   1. Process A, which has been in cgroup "baa" and uses large memory, is just
>      moved to cgroup "foo". Process A can be the candidates for being killed.
>   2. Process B, which has been in cgroup "foo" and uses large memory, is just
>      moved from cgroup "foo". Process B can be excluded from the candidates for
>      being killed.
> 
> But these race window exists anyway even if we hold a lock, because
> __mem_cgroup_try_charge() decides wether it should trigger oom or not outside
> of the lock. So the original cgroup_mutex in mem_cgroup_out_of_memory and thus
> current memcg_tasklist has no use. And IMHO, those races are not so critical
> for users.
> 
> This patch removes it and make codes simpler.
>

Could you please test for side-effects like concurrent OOM. An idea of
how the patchset was tested would be good to have, given the
implications of these changes.

Not-Yet-Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH -mmotm 3/3] memcg: remove memcg_tasklist
  2009-11-11 16:01     ` Balbir Singh
@ 2009-11-12  8:05       ` Daisuke Nishimura
  0 siblings, 0 replies; 39+ messages in thread
From: Daisuke Nishimura @ 2009-11-12  8:05 UTC (permalink / raw)
  To: balbir; +Cc: Andrew Morton, KAMEZAWA Hiroyuki, linux-mm, Daisuke Nishimura

On Wed, 11 Nov 2009 21:31:34 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> * nishimura@mxp.nes.nec.co.jp <nishimura@mxp.nes.nec.co.jp> [2009-11-11 10:39:06]:
> 
> > memcg_tasklist was introduced at commit 7f4d454d(memcg: avoid deadlock caused
> > by race between oom and cpuset_attach) instead of cgroup_mutex to fix a deadlock
> > problem.  The cgroup_mutex, which was removed by the commit, in
> > mem_cgroup_out_of_memory() was originally introduced at commit c7ba5c9e
> > (Memory controller: OOM handling).
> > 
> > IIUC, the intention of this cgroup_mutex was to prevent task move during
> > select_bad_process() so that situations like below can be avoided.
> > 
> >   Assume cgroup "foo" has exceeded its limit and is about to trigger oom.
> >   1. Process A, which has been in cgroup "baa" and uses large memory, is just
> >      moved to cgroup "foo". Process A can be the candidates for being killed.
> >   2. Process B, which has been in cgroup "foo" and uses large memory, is just
> >      moved from cgroup "foo". Process B can be excluded from the candidates for
> >      being killed.
> > 
> > But these race window exists anyway even if we hold a lock, because
> > __mem_cgroup_try_charge() decides wether it should trigger oom or not outside
> > of the lock. So the original cgroup_mutex in mem_cgroup_out_of_memory and thus
> > current memcg_tasklist has no use. And IMHO, those races are not so critical
> > for users.
> > 
> > This patch removes it and make codes simpler.
> >
> 
> Could you please test for side-effects like concurrent OOM. An idea of
> how the patchset was tested would be good to have, given the
> implications of these changes.
> 
hmm, I'm not sure what is your concern.

My point is, __mem_cgroup_try_charge() decides wether it should trigger oom or not
outside of this mutex. mem_cgroup_out_of_memory(selecting a target task and killing it)
itself would be serialized by this mutex, but the decision wether we should trigger
oom or not is made outside of this mutex, so this mutex has no meaning(oom will happen anyway).

Actually, I tested following scenario.

- make a cgroup with mem.limit == memsw.limit == 128M.
- under the cgroup, run a test program which consumes about 8MB as anonymous for each.
- I can run up to 15 instances of this test program, but when I started 16th one,
  oom is triggered. The number of processes being kill is random(not necessarily one process).

This behavior doesn't change before and after this patch.


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2009-11-12  8:22 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-11-06  5:10 [PATCH -mmotm 0/8] memcg: recharge at task move Daisuke Nishimura
2009-11-06  5:11 ` [PATCH -mmotm 1/8] cgroup: introduce cancel_attach() Daisuke Nishimura
2009-11-09  6:57   ` Balbir Singh
2009-11-09  7:18     ` Li Zefan
2009-11-09  7:23     ` Daisuke Nishimura
2009-11-09  7:23   ` Li Zefan
2009-11-09  7:38     ` Daisuke Nishimura
2009-11-06  5:11 ` [PATCH -mmotm 2/8] memcg: move memcg_tasklist mutex Daisuke Nishimura
2009-11-06  5:54   ` KAMEZAWA Hiroyuki
2009-11-06  7:49     ` Daisuke Nishimura
2009-11-06  8:02       ` KAMEZAWA Hiroyuki
2009-11-10 19:14   ` Balbir Singh
2009-11-10 23:44     ` Daisuke Nishimura
2009-11-06  5:12 ` [PATCH -mmotm 3/8] memcg: add mem_cgroup_cancel_charge() Daisuke Nishimura
2009-11-06  5:13 ` [PATCH -mmotm 4/8] memcg: cleanup mem_cgroup_move_parent() Daisuke Nishimura
2009-11-06  5:56   ` KAMEZAWA Hiroyuki
2009-11-06  5:14 ` [PATCH -mmotm 5/8] memcg: add interface to recharge at task move Daisuke Nishimura
2009-11-06  6:06   ` KAMEZAWA Hiroyuki
2009-11-06  5:14 ` [PATCH -mmotm 6/8] memcg: recharge charges of anonymous page Daisuke Nishimura
2009-11-06  6:35   ` KAMEZAWA Hiroyuki
2009-11-09  0:31     ` Daisuke Nishimura
2009-11-06  5:15 ` [PATCH -mmotm 7/8] memcg: avoid oom during recharge at task move Daisuke Nishimura
2009-11-06  6:39   ` KAMEZAWA Hiroyuki
2009-11-06  5:16 ` [PATCH -mmotm 8/8] memcg: recharge charges of anonymous swap Daisuke Nishimura
2009-11-06  6:45 ` [PATCH -mmotm 0/8] memcg: recharge at task move KAMEZAWA Hiroyuki
2009-11-09  1:44   ` Daisuke Nishimura
2009-11-09  5:16     ` KAMEZAWA Hiroyuki
2009-11-09  5:08 ` Balbir Singh
2009-11-09  8:24   ` Daisuke Nishimura
2009-11-11  1:35 ` [PATCH -mmotm 0/3] some cleanups for memcg Daisuke Nishimura
2009-11-11  1:36   ` [PATCH -mmotm 1/3] memcg: add mem_cgroup_cancel_charge() Daisuke Nishimura
2009-11-11  4:24     ` Balbir Singh
2009-11-11  1:37   ` [PATCH -mmotm 2/3] memcg: cleanup mem_cgroup_move_parent() Daisuke Nishimura
2009-11-11 14:40     ` Balbir Singh
2009-11-11 15:16       ` Daisuke Nishimura
2009-11-11  1:39   ` [PATCH -mmotm 3/3] memcg: remove memcg_tasklist Daisuke Nishimura
2009-11-11  1:49     ` KAMEZAWA Hiroyuki
2009-11-11 16:01     ` Balbir Singh
2009-11-12  8:05       ` Daisuke Nishimura

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.