All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] memcg: cgroup fix rmdir hang
@ 2009-06-23  7:07 ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 28+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-06-23  7:07 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, nishimura, balbir, lizf, menage, akpm

previous discussion was this => http://marc.info/?t=124478543600001&r=1&w=2

This patch tries to fix problem as
  - rmdir can sleep very very long if swap entry is shared between multiple
    cgroups

Now, cgroup's rmdir path does following

==
again:
	check there are no tasks and children group.
	call pre_destroy()
	check css's refcnt
	if (refcnt > 0) {
		sleep until css's refcnt goes down to 0.
		goto again
	}
==

Unfortunately, memory cgroup does following at charge.

	css_get(&memcg->css)
	....
	charge(memcg) (increase USAGE)
	...
And this "memcg" is not necessary to include the caller, task.

pre_destroy() tries to reduce memory usage until USAGE goes down to 0.
Then, there is a race that
	- css's refcnt > 0 (and memcg's usage > 0)
	- rmdir() caller sleeps until css->refcnt goes down 0.
	- But to make css->refcnt be 0, pre_destroy() should be called again.

This patch tries to fix this in asyhcnrounos way (i.e. without big lock.)
Any comments are welcome.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 0/2] memcg: cgroup fix rmdir hang
@ 2009-06-23  7:07 ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 28+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-06-23  7:07 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, nishimura, balbir, lizf, menage, akpm

previous discussion was this => http://marc.info/?t=124478543600001&r=1&w=2

This patch tries to fix problem as
  - rmdir can sleep very very long if swap entry is shared between multiple
    cgroups

Now, cgroup's rmdir path does following

==
again:
	check there are no tasks and children group.
	call pre_destroy()
	check css's refcnt
	if (refcnt > 0) {
		sleep until css's refcnt goes down to 0.
		goto again
	}
==

Unfortunately, memory cgroup does following at charge.

	css_get(&memcg->css)
	....
	charge(memcg) (increase USAGE)
	...
And this "memcg" is not necessary to include the caller, task.

pre_destroy() tries to reduce memory usage until USAGE goes down to 0.
Then, there is a race that
	- css's refcnt > 0 (and memcg's usage > 0)
	- rmdir() caller sleeps until css->refcnt goes down 0.
	- But to make css->refcnt be 0, pre_destroy() should be called again.

This patch tries to fix this in asyhcnrounos way (i.e. without big lock.)
Any comments are welcome.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/2] memcg: cgroup fix rmdir hang
  2009-06-23  7:07 ` KAMEZAWA Hiroyuki
@ 2009-06-23  7:08   ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 28+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-06-23  7:08 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, nishimura, balbir, lizf, menage, akpm

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Now, cgroup has a logic to wait until ready-to-rmdir for avoiding
frequent -EBUSY at rmdir.
 (See Commit ec64f51545fffbc4cb968f0cea56341a4b07e85a
  cgroup: fix frequent -EBUSY at rmdir.

Nishimura-san reported bad case for waiting and This is a fix to
make it reliable. A thread waiting for thread cannot be waken up
when a refcnt gotten by css_tryget() isn't put immediately.
(Original code assumed css_put() will be called soon.)

memcg has this case and this is a fix for the problem. This adds
retry_rmdir() callback to subsys and check we can sleep or not.

Note: another solution will be adding "rmdir state" to subsys.
But it will be much complicated than this do-enough-check solution.

Changelog v1 -> v2:
 - splitted into 2 patches. This just includes retry_rmdir() modification.

Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 Documentation/cgroups/cgroups.txt |   11 +++++++++++
 include/linux/cgroup.h            |    1 +
 kernel/cgroup.c                   |   20 +++++++++++++++++++-
 mm/memcontrol.c                   |   14 ++++++++++++--
 4 files changed, 43 insertions(+), 3 deletions(-)

Index: fix-rmdir-cgroup/include/linux/cgroup.h
===================================================================
--- fix-rmdir-cgroup.orig/include/linux/cgroup.h
+++ fix-rmdir-cgroup/include/linux/cgroup.h
@@ -374,6 +374,7 @@ struct cgroup_subsys {
 	struct cgroup_subsys_state *(*create)(struct cgroup_subsys *ss,
 						  struct cgroup *cgrp);
 	int (*pre_destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
+	int (*retry_rmdir)(struct cgroup_subsys *ss, struct cgroup *cgrp);
 	void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
 	int (*can_attach)(struct cgroup_subsys *ss,
 			  struct cgroup *cgrp, struct task_struct *tsk);
Index: fix-rmdir-cgroup/kernel/cgroup.c
===================================================================
--- fix-rmdir-cgroup.orig/kernel/cgroup.c
+++ fix-rmdir-cgroup/kernel/cgroup.c
@@ -636,6 +636,23 @@ static int cgroup_call_pre_destroy(struc
 		}
 	return ret;
 }
+/*
+ * Call subsys's retry_rmdir() handler. If this returns non-Zero, we retry
+ * rmdir immediately and call pre_destroy again.
+ */
+static int cgroup_check_retry_rmdir(struct cgroup *cgrp)
+{
+	struct cgroup_subsys *ss;
+	int ret = 0;
+
+	for_each_subsys(cgrp->root, ss)
+		if (ss->pre_destroy) {
+			ret = ss->retry_rmdir(ss, cgrp);
+			if (ret)
+				break;
+		}
+	return ret;
+}
 
 static void free_cgroup_rcu(struct rcu_head *obj)
 {
@@ -2722,7 +2739,8 @@ again:
 
 	if (!cgroup_clear_css_refs(cgrp)) {
 		mutex_unlock(&cgroup_mutex);
-		schedule();
+		if (!cgroup_check_retry_rmdir(cgrp))
+			schedule();
 		finish_wait(&cgroup_rmdir_waitq, &wait);
 		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
 		if (signal_pending(current))
Index: fix-rmdir-cgroup/mm/memcontrol.c
===================================================================
--- fix-rmdir-cgroup.orig/mm/memcontrol.c
+++ fix-rmdir-cgroup/mm/memcontrol.c
@@ -1457,8 +1457,6 @@ __mem_cgroup_commit_charge_swapin(struct
 		}
 		rcu_read_unlock();
 	}
-	/* add this page(page_cgroup) to the LRU we want. */
-
 }
 
 void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr)
@@ -2571,6 +2569,17 @@ static int mem_cgroup_pre_destroy(struct
 	return mem_cgroup_force_empty(mem, false);
 }
 
+static int mem_cgroup_retry_rmdir(struct cgroup_subsys *ss,
+				  struct cgroup *cont)
+{
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+
+	if (res_counter_read_u64(&mem->res, RES_USAGE))
+		return 1;
+	return 0;
+}
+
+
 static void mem_cgroup_destroy(struct cgroup_subsys *ss,
 				struct cgroup *cont)
 {
@@ -2610,6 +2619,7 @@ struct cgroup_subsys mem_cgroup_subsys =
 	.subsys_id = mem_cgroup_subsys_id,
 	.create = mem_cgroup_create,
 	.pre_destroy = mem_cgroup_pre_destroy,
+	.retry_rmdir = mem_cgroup_retry_rmdir,
 	.destroy = mem_cgroup_destroy,
 	.populate = mem_cgroup_populate,
 	.attach = mem_cgroup_move_task,
Index: fix-rmdir-cgroup/Documentation/cgroups/cgroups.txt
===================================================================
--- fix-rmdir-cgroup.orig/Documentation/cgroups/cgroups.txt
+++ fix-rmdir-cgroup/Documentation/cgroups/cgroups.txt
@@ -500,6 +500,17 @@ there are not tasks in the cgroup. If pr
 rmdir() will fail with it. From this behavior, pre_destroy() can be
 called multiple times against a cgroup.
 
+int retry_rmdir(struct cgroup_subsys *ss, struct cgroup *cgrp);
+
+Called at rmdir right after the kernel finds there are remaining refcnt on
+subsystems after pre_destroy(). When retry_rmdir() returns 0, the caller enter
+sleep and wakes up when css's refcnt goes down to 0 by css_put().
+When this returns 1, the caller doesn't sleep and retry rmdir immediately.
+This is useful when the subsys knows remaining css's refcnt is not temporal
+and to calling pre_destroy() again is proper way to remove that.
+(or proper way to retrun -EBUSY.)
+
+
 int can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 	       struct task_struct *task)
 (cgroup_mutex held by caller)


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/2] memcg: cgroup fix rmdir hang
@ 2009-06-23  7:08   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 28+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-06-23  7:08 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, nishimura, balbir, lizf, menage, akpm

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Now, cgroup has a logic to wait until ready-to-rmdir for avoiding
frequent -EBUSY at rmdir.
 (See Commit ec64f51545fffbc4cb968f0cea56341a4b07e85a
  cgroup: fix frequent -EBUSY at rmdir.

Nishimura-san reported bad case for waiting and This is a fix to
make it reliable. A thread waiting for thread cannot be waken up
when a refcnt gotten by css_tryget() isn't put immediately.
(Original code assumed css_put() will be called soon.)

memcg has this case and this is a fix for the problem. This adds
retry_rmdir() callback to subsys and check we can sleep or not.

Note: another solution will be adding "rmdir state" to subsys.
But it will be much complicated than this do-enough-check solution.

Changelog v1 -> v2:
 - splitted into 2 patches. This just includes retry_rmdir() modification.

Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 Documentation/cgroups/cgroups.txt |   11 +++++++++++
 include/linux/cgroup.h            |    1 +
 kernel/cgroup.c                   |   20 +++++++++++++++++++-
 mm/memcontrol.c                   |   14 ++++++++++++--
 4 files changed, 43 insertions(+), 3 deletions(-)

Index: fix-rmdir-cgroup/include/linux/cgroup.h
===================================================================
--- fix-rmdir-cgroup.orig/include/linux/cgroup.h
+++ fix-rmdir-cgroup/include/linux/cgroup.h
@@ -374,6 +374,7 @@ struct cgroup_subsys {
 	struct cgroup_subsys_state *(*create)(struct cgroup_subsys *ss,
 						  struct cgroup *cgrp);
 	int (*pre_destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
+	int (*retry_rmdir)(struct cgroup_subsys *ss, struct cgroup *cgrp);
 	void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
 	int (*can_attach)(struct cgroup_subsys *ss,
 			  struct cgroup *cgrp, struct task_struct *tsk);
Index: fix-rmdir-cgroup/kernel/cgroup.c
===================================================================
--- fix-rmdir-cgroup.orig/kernel/cgroup.c
+++ fix-rmdir-cgroup/kernel/cgroup.c
@@ -636,6 +636,23 @@ static int cgroup_call_pre_destroy(struc
 		}
 	return ret;
 }
+/*
+ * Call subsys's retry_rmdir() handler. If this returns non-Zero, we retry
+ * rmdir immediately and call pre_destroy again.
+ */
+static int cgroup_check_retry_rmdir(struct cgroup *cgrp)
+{
+	struct cgroup_subsys *ss;
+	int ret = 0;
+
+	for_each_subsys(cgrp->root, ss)
+		if (ss->pre_destroy) {
+			ret = ss->retry_rmdir(ss, cgrp);
+			if (ret)
+				break;
+		}
+	return ret;
+}
 
 static void free_cgroup_rcu(struct rcu_head *obj)
 {
@@ -2722,7 +2739,8 @@ again:
 
 	if (!cgroup_clear_css_refs(cgrp)) {
 		mutex_unlock(&cgroup_mutex);
-		schedule();
+		if (!cgroup_check_retry_rmdir(cgrp))
+			schedule();
 		finish_wait(&cgroup_rmdir_waitq, &wait);
 		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
 		if (signal_pending(current))
Index: fix-rmdir-cgroup/mm/memcontrol.c
===================================================================
--- fix-rmdir-cgroup.orig/mm/memcontrol.c
+++ fix-rmdir-cgroup/mm/memcontrol.c
@@ -1457,8 +1457,6 @@ __mem_cgroup_commit_charge_swapin(struct
 		}
 		rcu_read_unlock();
 	}
-	/* add this page(page_cgroup) to the LRU we want. */
-
 }
 
 void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr)
@@ -2571,6 +2569,17 @@ static int mem_cgroup_pre_destroy(struct
 	return mem_cgroup_force_empty(mem, false);
 }
 
+static int mem_cgroup_retry_rmdir(struct cgroup_subsys *ss,
+				  struct cgroup *cont)
+{
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+
+	if (res_counter_read_u64(&mem->res, RES_USAGE))
+		return 1;
+	return 0;
+}
+
+
 static void mem_cgroup_destroy(struct cgroup_subsys *ss,
 				struct cgroup *cont)
 {
@@ -2610,6 +2619,7 @@ struct cgroup_subsys mem_cgroup_subsys =
 	.subsys_id = mem_cgroup_subsys_id,
 	.create = mem_cgroup_create,
 	.pre_destroy = mem_cgroup_pre_destroy,
+	.retry_rmdir = mem_cgroup_retry_rmdir,
 	.destroy = mem_cgroup_destroy,
 	.populate = mem_cgroup_populate,
 	.attach = mem_cgroup_move_task,
Index: fix-rmdir-cgroup/Documentation/cgroups/cgroups.txt
===================================================================
--- fix-rmdir-cgroup.orig/Documentation/cgroups/cgroups.txt
+++ fix-rmdir-cgroup/Documentation/cgroups/cgroups.txt
@@ -500,6 +500,17 @@ there are not tasks in the cgroup. If pr
 rmdir() will fail with it. From this behavior, pre_destroy() can be
 called multiple times against a cgroup.
 
+int retry_rmdir(struct cgroup_subsys *ss, struct cgroup *cgrp);
+
+Called at rmdir right after the kernel finds there are remaining refcnt on
+subsystems after pre_destroy(). When retry_rmdir() returns 0, the caller enter
+sleep and wakes up when css's refcnt goes down to 0 by css_put().
+When this returns 1, the caller doesn't sleep and retry rmdir immediately.
+This is useful when the subsys knows remaining css's refcnt is not temporal
+and to calling pre_destroy() again is proper way to remove that.
+(or proper way to retrun -EBUSY.)
+
+
 int can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 	       struct task_struct *task)
 (cgroup_mutex held by caller)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 2/2] memcg: wakeup rmdir waiter if necessary
  2009-06-23  7:07 ` KAMEZAWA Hiroyuki
@ 2009-06-23  7:10   ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 28+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-06-23  7:10 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, nishimura, balbir, lizf, menage, akpm

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

mem_cgroup's pre_destroy() handler tries to reduce its resource usage to 0.
But in some case, a charge comes after pre_destroy and rmdir() never finishes
because the caller of rmdir() sleeps.

This patch wakes up the caller of rmdir() and let it call pre_destroy(), again.

Note: Making pre_destroy() synchrounous is a way, but it will require
some synchronization...as global lock. A method this patch uses is
"do asynchronous and check if necessary". Maybe this works better than global
synchronization if properly commented.

Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/cgroup.h |    7 +++++++
 kernel/cgroup.c        |    5 ++---
 mm/memcontrol.c        |   17 +++++++++++++++++
 3 files changed, 26 insertions(+), 3 deletions(-)

Index: fix-rmdir-cgroup/include/linux/cgroup.h
===================================================================
--- fix-rmdir-cgroup.orig/include/linux/cgroup.h
+++ fix-rmdir-cgroup/include/linux/cgroup.h
@@ -365,6 +365,13 @@ int cgroup_task_count(const struct cgrou
 /* Return true if cgrp is a descendant of the task's cgroup */
 int cgroup_is_descendant(const struct cgroup *cgrp, struct task_struct *task);
 
+void __cgroup_wakeup_rmdir_waiters(void);
+static inline void cgroup_wakeup_rmdir_waiters(const struct cgroup *cgrp)
+{
+	if (unlikely(test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
+		__cgroup_wakeup_rmdir_waiters();
+}
+
 /*
  * Control Group subsystem type.
  * See Documentation/cgroups/cgroups.txt for details
Index: fix-rmdir-cgroup/kernel/cgroup.c
===================================================================
--- fix-rmdir-cgroup.orig/kernel/cgroup.c
+++ fix-rmdir-cgroup/kernel/cgroup.c
@@ -755,10 +755,9 @@ static void cgroup_d_remove_dir(struct d
  */
 DECLARE_WAIT_QUEUE_HEAD(cgroup_rmdir_waitq);
 
-static void cgroup_wakeup_rmdir_waiters(const struct cgroup *cgrp)
+void __cgroup_wakeup_rmdir_waiters(void)
 {
-	if (unlikely(test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
-		wake_up_all(&cgroup_rmdir_waitq);
+	wake_up_all(&cgroup_rmdir_waitq);
 }
 
 static int rebind_subsystems(struct cgroupfs_root *root,
Index: fix-rmdir-cgroup/mm/memcontrol.c
===================================================================
--- fix-rmdir-cgroup.orig/mm/memcontrol.c
+++ fix-rmdir-cgroup/mm/memcontrol.c
@@ -1428,6 +1428,7 @@ __mem_cgroup_commit_charge_swapin(struct
 		return;
 	if (!ptr)
 		return;
+	css_get(&ptr->css);
 	pc = lookup_page_cgroup(page);
 	mem_cgroup_lru_del_before_commit_swapcache(page);
 	__mem_cgroup_commit_charge(ptr, pc, ctype);
@@ -1457,6 +1458,13 @@ __mem_cgroup_commit_charge_swapin(struct
 		}
 		rcu_read_unlock();
 	}
+	/*
+	 * At swapin, we may charge against cgroup which has no tasks. Such
+	 * cgroups can be removed by rmdir(). If we do charge after
+	 * pre_destroy(), we should call pre_destroy(), again.
+	 */
+	cgroup_wakeup_rmdir_waiters(ptr->css.cgroup);
+	css_put(&ptr->css);
 }
 
 void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr)
@@ -1663,6 +1671,7 @@ void mem_cgroup_end_migration(struct mem
 	if (!mem)
 		return;
 
+	css_get(&mem->css);
 	/* at migration success, oldpage->mapping is NULL. */
 	if (oldpage->mapping) {
 		target = oldpage;
@@ -1702,6 +1711,14 @@ void mem_cgroup_end_migration(struct mem
 	 */
 	if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
 		mem_cgroup_uncharge_page(target);
+
+	/*
+	 * At migration, we may charge against cgroup which has no tasks. Such
+	 * cgroups can be removed by rmdir(). If we do charge after
+	 * pre_destroy(), we should call pre_destroy(), again.
+	 */
+	cgroup_wakeup_rmdir_waiters(mem->css.cgroup);
+	css_put(&mem->css);
 }
 
 /*


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 2/2] memcg: wakeup rmdir waiter if necessary
@ 2009-06-23  7:10   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 28+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-06-23  7:10 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, nishimura, balbir, lizf, menage, akpm

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

mem_cgroup's pre_destroy() handler tries to reduce its resource usage to 0.
But in some case, a charge comes after pre_destroy and rmdir() never finishes
because the caller of rmdir() sleeps.

This patch wakes up the caller of rmdir() and let it call pre_destroy(), again.

Note: Making pre_destroy() synchrounous is a way, but it will require
some synchronization...as global lock. A method this patch uses is
"do asynchronous and check if necessary". Maybe this works better than global
synchronization if properly commented.

Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/cgroup.h |    7 +++++++
 kernel/cgroup.c        |    5 ++---
 mm/memcontrol.c        |   17 +++++++++++++++++
 3 files changed, 26 insertions(+), 3 deletions(-)

Index: fix-rmdir-cgroup/include/linux/cgroup.h
===================================================================
--- fix-rmdir-cgroup.orig/include/linux/cgroup.h
+++ fix-rmdir-cgroup/include/linux/cgroup.h
@@ -365,6 +365,13 @@ int cgroup_task_count(const struct cgrou
 /* Return true if cgrp is a descendant of the task's cgroup */
 int cgroup_is_descendant(const struct cgroup *cgrp, struct task_struct *task);
 
+void __cgroup_wakeup_rmdir_waiters(void);
+static inline void cgroup_wakeup_rmdir_waiters(const struct cgroup *cgrp)
+{
+	if (unlikely(test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
+		__cgroup_wakeup_rmdir_waiters();
+}
+
 /*
  * Control Group subsystem type.
  * See Documentation/cgroups/cgroups.txt for details
Index: fix-rmdir-cgroup/kernel/cgroup.c
===================================================================
--- fix-rmdir-cgroup.orig/kernel/cgroup.c
+++ fix-rmdir-cgroup/kernel/cgroup.c
@@ -755,10 +755,9 @@ static void cgroup_d_remove_dir(struct d
  */
 DECLARE_WAIT_QUEUE_HEAD(cgroup_rmdir_waitq);
 
-static void cgroup_wakeup_rmdir_waiters(const struct cgroup *cgrp)
+void __cgroup_wakeup_rmdir_waiters(void)
 {
-	if (unlikely(test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
-		wake_up_all(&cgroup_rmdir_waitq);
+	wake_up_all(&cgroup_rmdir_waitq);
 }
 
 static int rebind_subsystems(struct cgroupfs_root *root,
Index: fix-rmdir-cgroup/mm/memcontrol.c
===================================================================
--- fix-rmdir-cgroup.orig/mm/memcontrol.c
+++ fix-rmdir-cgroup/mm/memcontrol.c
@@ -1428,6 +1428,7 @@ __mem_cgroup_commit_charge_swapin(struct
 		return;
 	if (!ptr)
 		return;
+	css_get(&ptr->css);
 	pc = lookup_page_cgroup(page);
 	mem_cgroup_lru_del_before_commit_swapcache(page);
 	__mem_cgroup_commit_charge(ptr, pc, ctype);
@@ -1457,6 +1458,13 @@ __mem_cgroup_commit_charge_swapin(struct
 		}
 		rcu_read_unlock();
 	}
+	/*
+	 * At swapin, we may charge against cgroup which has no tasks. Such
+	 * cgroups can be removed by rmdir(). If we do charge after
+	 * pre_destroy(), we should call pre_destroy(), again.
+	 */
+	cgroup_wakeup_rmdir_waiters(ptr->css.cgroup);
+	css_put(&ptr->css);
 }
 
 void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr)
@@ -1663,6 +1671,7 @@ void mem_cgroup_end_migration(struct mem
 	if (!mem)
 		return;
 
+	css_get(&mem->css);
 	/* at migration success, oldpage->mapping is NULL. */
 	if (oldpage->mapping) {
 		target = oldpage;
@@ -1702,6 +1711,14 @@ void mem_cgroup_end_migration(struct mem
 	 */
 	if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
 		mem_cgroup_uncharge_page(target);
+
+	/*
+	 * At migration, we may charge against cgroup which has no tasks. Such
+	 * cgroups can be removed by rmdir(). If we do charge after
+	 * pre_destroy(), we should call pre_destroy(), again.
+	 */
+	cgroup_wakeup_rmdir_waiters(mem->css.cgroup);
+	css_put(&mem->css);
 }
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/2] memcg: cgroup fix rmdir hang
  2009-06-23  7:08   ` KAMEZAWA Hiroyuki
@ 2009-06-24  4:05     ` Daisuke Nishimura
  -1 siblings, 0 replies; 28+ messages in thread
From: Daisuke Nishimura @ 2009-06-24  4:05 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: nishimura, linux-mm, linux-kernel, balbir, lizf, menage, akpm

On Tue, 23 Jun 2009 16:08:54 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Now, cgroup has a logic to wait until ready-to-rmdir for avoiding
> frequent -EBUSY at rmdir.
>  (See Commit ec64f51545fffbc4cb968f0cea56341a4b07e85a
>   cgroup: fix frequent -EBUSY at rmdir.
> 
> Nishimura-san reported bad case for waiting and This is a fix to
> make it reliable. A thread waiting for thread cannot be waken up
> when a refcnt gotten by css_tryget() isn't put immediately.
> (Original code assumed css_put() will be called soon.)
> 
> memcg has this case and this is a fix for the problem. This adds
> retry_rmdir() callback to subsys and check we can sleep or not.
> 
> Note: another solution will be adding "rmdir state" to subsys.
> But it will be much complicated than this do-enough-check solution.
> 
> Changelog v1 -> v2:
>  - splitted into 2 patches. This just includes retry_rmdir() modification.
> 
> Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Looks good to me.

	Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

I've been testing with [2/2] applied (because both of these patches are
necessary to fix this problem completely), and it works fine so far.
But I want to test in more test cases.


Thanks,
Daisuke Nishimura.

> ---
>  Documentation/cgroups/cgroups.txt |   11 +++++++++++
>  include/linux/cgroup.h            |    1 +
>  kernel/cgroup.c                   |   20 +++++++++++++++++++-
>  mm/memcontrol.c                   |   14 ++++++++++++--
>  4 files changed, 43 insertions(+), 3 deletions(-)
> 
> Index: fix-rmdir-cgroup/include/linux/cgroup.h
> ===================================================================
> --- fix-rmdir-cgroup.orig/include/linux/cgroup.h
> +++ fix-rmdir-cgroup/include/linux/cgroup.h
> @@ -374,6 +374,7 @@ struct cgroup_subsys {
>  	struct cgroup_subsys_state *(*create)(struct cgroup_subsys *ss,
>  						  struct cgroup *cgrp);
>  	int (*pre_destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
> +	int (*retry_rmdir)(struct cgroup_subsys *ss, struct cgroup *cgrp);
>  	void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
>  	int (*can_attach)(struct cgroup_subsys *ss,
>  			  struct cgroup *cgrp, struct task_struct *tsk);
> Index: fix-rmdir-cgroup/kernel/cgroup.c
> ===================================================================
> --- fix-rmdir-cgroup.orig/kernel/cgroup.c
> +++ fix-rmdir-cgroup/kernel/cgroup.c
> @@ -636,6 +636,23 @@ static int cgroup_call_pre_destroy(struc
>  		}
>  	return ret;
>  }
> +/*
> + * Call subsys's retry_rmdir() handler. If this returns non-Zero, we retry
> + * rmdir immediately and call pre_destroy again.
> + */
> +static int cgroup_check_retry_rmdir(struct cgroup *cgrp)
> +{
> +	struct cgroup_subsys *ss;
> +	int ret = 0;
> +
> +	for_each_subsys(cgrp->root, ss)
> +		if (ss->pre_destroy) {
> +			ret = ss->retry_rmdir(ss, cgrp);
> +			if (ret)
> +				break;
> +		}
> +	return ret;
> +}
>  
>  static void free_cgroup_rcu(struct rcu_head *obj)
>  {
> @@ -2722,7 +2739,8 @@ again:
>  
>  	if (!cgroup_clear_css_refs(cgrp)) {
>  		mutex_unlock(&cgroup_mutex);
> -		schedule();
> +		if (!cgroup_check_retry_rmdir(cgrp))
> +			schedule();
>  		finish_wait(&cgroup_rmdir_waitq, &wait);
>  		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
>  		if (signal_pending(current))
> Index: fix-rmdir-cgroup/mm/memcontrol.c
> ===================================================================
> --- fix-rmdir-cgroup.orig/mm/memcontrol.c
> +++ fix-rmdir-cgroup/mm/memcontrol.c
> @@ -1457,8 +1457,6 @@ __mem_cgroup_commit_charge_swapin(struct
>  		}
>  		rcu_read_unlock();
>  	}
> -	/* add this page(page_cgroup) to the LRU we want. */
> -
>  }
>  
>  void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr)
> @@ -2571,6 +2569,17 @@ static int mem_cgroup_pre_destroy(struct
>  	return mem_cgroup_force_empty(mem, false);
>  }
>  
> +static int mem_cgroup_retry_rmdir(struct cgroup_subsys *ss,
> +				  struct cgroup *cont)
> +{
> +	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
> +
> +	if (res_counter_read_u64(&mem->res, RES_USAGE))
> +		return 1;
> +	return 0;
> +}
> +
> +
>  static void mem_cgroup_destroy(struct cgroup_subsys *ss,
>  				struct cgroup *cont)
>  {
> @@ -2610,6 +2619,7 @@ struct cgroup_subsys mem_cgroup_subsys =
>  	.subsys_id = mem_cgroup_subsys_id,
>  	.create = mem_cgroup_create,
>  	.pre_destroy = mem_cgroup_pre_destroy,
> +	.retry_rmdir = mem_cgroup_retry_rmdir,
>  	.destroy = mem_cgroup_destroy,
>  	.populate = mem_cgroup_populate,
>  	.attach = mem_cgroup_move_task,
> Index: fix-rmdir-cgroup/Documentation/cgroups/cgroups.txt
> ===================================================================
> --- fix-rmdir-cgroup.orig/Documentation/cgroups/cgroups.txt
> +++ fix-rmdir-cgroup/Documentation/cgroups/cgroups.txt
> @@ -500,6 +500,17 @@ there are not tasks in the cgroup. If pr
>  rmdir() will fail with it. From this behavior, pre_destroy() can be
>  called multiple times against a cgroup.
>  
> +int retry_rmdir(struct cgroup_subsys *ss, struct cgroup *cgrp);
> +
> +Called at rmdir right after the kernel finds there are remaining refcnt on
> +subsystems after pre_destroy(). When retry_rmdir() returns 0, the caller enter
> +sleep and wakes up when css's refcnt goes down to 0 by css_put().
> +When this returns 1, the caller doesn't sleep and retry rmdir immediately.
> +This is useful when the subsys knows remaining css's refcnt is not temporal
> +and to calling pre_destroy() again is proper way to remove that.
> +(or proper way to retrun -EBUSY.)
> +
> +
>  int can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
>  	       struct task_struct *task)
>  (cgroup_mutex held by caller)
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/2] memcg: cgroup fix rmdir hang
@ 2009-06-24  4:05     ` Daisuke Nishimura
  0 siblings, 0 replies; 28+ messages in thread
From: Daisuke Nishimura @ 2009-06-24  4:05 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: nishimura, linux-mm, linux-kernel, balbir, lizf, menage, akpm

On Tue, 23 Jun 2009 16:08:54 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Now, cgroup has a logic to wait until ready-to-rmdir for avoiding
> frequent -EBUSY at rmdir.
>  (See Commit ec64f51545fffbc4cb968f0cea56341a4b07e85a
>   cgroup: fix frequent -EBUSY at rmdir.
> 
> Nishimura-san reported bad case for waiting and This is a fix to
> make it reliable. A thread waiting for thread cannot be waken up
> when a refcnt gotten by css_tryget() isn't put immediately.
> (Original code assumed css_put() will be called soon.)
> 
> memcg has this case and this is a fix for the problem. This adds
> retry_rmdir() callback to subsys and check we can sleep or not.
> 
> Note: another solution will be adding "rmdir state" to subsys.
> But it will be much complicated than this do-enough-check solution.
> 
> Changelog v1 -> v2:
>  - splitted into 2 patches. This just includes retry_rmdir() modification.
> 
> Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Looks good to me.

	Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

I've been testing with [2/2] applied (because both of these patches are
necessary to fix this problem completely), and it works fine so far.
But I want to test in more test cases.


Thanks,
Daisuke Nishimura.

> ---
>  Documentation/cgroups/cgroups.txt |   11 +++++++++++
>  include/linux/cgroup.h            |    1 +
>  kernel/cgroup.c                   |   20 +++++++++++++++++++-
>  mm/memcontrol.c                   |   14 ++++++++++++--
>  4 files changed, 43 insertions(+), 3 deletions(-)
> 
> Index: fix-rmdir-cgroup/include/linux/cgroup.h
> ===================================================================
> --- fix-rmdir-cgroup.orig/include/linux/cgroup.h
> +++ fix-rmdir-cgroup/include/linux/cgroup.h
> @@ -374,6 +374,7 @@ struct cgroup_subsys {
>  	struct cgroup_subsys_state *(*create)(struct cgroup_subsys *ss,
>  						  struct cgroup *cgrp);
>  	int (*pre_destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
> +	int (*retry_rmdir)(struct cgroup_subsys *ss, struct cgroup *cgrp);
>  	void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
>  	int (*can_attach)(struct cgroup_subsys *ss,
>  			  struct cgroup *cgrp, struct task_struct *tsk);
> Index: fix-rmdir-cgroup/kernel/cgroup.c
> ===================================================================
> --- fix-rmdir-cgroup.orig/kernel/cgroup.c
> +++ fix-rmdir-cgroup/kernel/cgroup.c
> @@ -636,6 +636,23 @@ static int cgroup_call_pre_destroy(struc
>  		}
>  	return ret;
>  }
> +/*
> + * Call subsys's retry_rmdir() handler. If this returns non-Zero, we retry
> + * rmdir immediately and call pre_destroy again.
> + */
> +static int cgroup_check_retry_rmdir(struct cgroup *cgrp)
> +{
> +	struct cgroup_subsys *ss;
> +	int ret = 0;
> +
> +	for_each_subsys(cgrp->root, ss)
> +		if (ss->pre_destroy) {
> +			ret = ss->retry_rmdir(ss, cgrp);
> +			if (ret)
> +				break;
> +		}
> +	return ret;
> +}
>  
>  static void free_cgroup_rcu(struct rcu_head *obj)
>  {
> @@ -2722,7 +2739,8 @@ again:
>  
>  	if (!cgroup_clear_css_refs(cgrp)) {
>  		mutex_unlock(&cgroup_mutex);
> -		schedule();
> +		if (!cgroup_check_retry_rmdir(cgrp))
> +			schedule();
>  		finish_wait(&cgroup_rmdir_waitq, &wait);
>  		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
>  		if (signal_pending(current))
> Index: fix-rmdir-cgroup/mm/memcontrol.c
> ===================================================================
> --- fix-rmdir-cgroup.orig/mm/memcontrol.c
> +++ fix-rmdir-cgroup/mm/memcontrol.c
> @@ -1457,8 +1457,6 @@ __mem_cgroup_commit_charge_swapin(struct
>  		}
>  		rcu_read_unlock();
>  	}
> -	/* add this page(page_cgroup) to the LRU we want. */
> -
>  }
>  
>  void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr)
> @@ -2571,6 +2569,17 @@ static int mem_cgroup_pre_destroy(struct
>  	return mem_cgroup_force_empty(mem, false);
>  }
>  
> +static int mem_cgroup_retry_rmdir(struct cgroup_subsys *ss,
> +				  struct cgroup *cont)
> +{
> +	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
> +
> +	if (res_counter_read_u64(&mem->res, RES_USAGE))
> +		return 1;
> +	return 0;
> +}
> +
> +
>  static void mem_cgroup_destroy(struct cgroup_subsys *ss,
>  				struct cgroup *cont)
>  {
> @@ -2610,6 +2619,7 @@ struct cgroup_subsys mem_cgroup_subsys =
>  	.subsys_id = mem_cgroup_subsys_id,
>  	.create = mem_cgroup_create,
>  	.pre_destroy = mem_cgroup_pre_destroy,
> +	.retry_rmdir = mem_cgroup_retry_rmdir,
>  	.destroy = mem_cgroup_destroy,
>  	.populate = mem_cgroup_populate,
>  	.attach = mem_cgroup_move_task,
> Index: fix-rmdir-cgroup/Documentation/cgroups/cgroups.txt
> ===================================================================
> --- fix-rmdir-cgroup.orig/Documentation/cgroups/cgroups.txt
> +++ fix-rmdir-cgroup/Documentation/cgroups/cgroups.txt
> @@ -500,6 +500,17 @@ there are not tasks in the cgroup. If pr
>  rmdir() will fail with it. From this behavior, pre_destroy() can be
>  called multiple times against a cgroup.
>  
> +int retry_rmdir(struct cgroup_subsys *ss, struct cgroup *cgrp);
> +
> +Called at rmdir right after the kernel finds there are remaining refcnt on
> +subsystems after pre_destroy(). When retry_rmdir() returns 0, the caller enter
> +sleep and wakes up when css's refcnt goes down to 0 by css_put().
> +When this returns 1, the caller doesn't sleep and retry rmdir immediately.
> +This is useful when the subsys knows remaining css's refcnt is not temporal
> +and to calling pre_destroy() again is proper way to remove that.
> +(or proper way to retrun -EBUSY.)
> +
> +
>  int can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
>  	       struct task_struct *task)
>  (cgroup_mutex held by caller)
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/2] memcg: cgroup fix rmdir hang
  2009-06-23  7:07 ` KAMEZAWA Hiroyuki
@ 2009-06-25 21:28   ` Andrew Morton
  -1 siblings, 0 replies; 28+ messages in thread
From: Andrew Morton @ 2009-06-25 21:28 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, nishimura, balbir, lizf, menage

On Tue, 23 Jun 2009 16:07:20 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> previous discussion was this => http://marc.info/?t=124478543600001&r=1&w=2
> 
> This patch tries to fix problem as
>   - rmdir can sleep very very long if swap entry is shared between multiple
>     cgroups
> 
> Now, cgroup's rmdir path does following
> 
> ==
> again:
> 	check there are no tasks and children group.
> 	call pre_destroy()
> 	check css's refcnt
> 	if (refcnt > 0) {
> 		sleep until css's refcnt goes down to 0.
> 		goto again
> 	}
> ==
> 
> Unfortunately, memory cgroup does following at charge.
> 
> 	css_get(&memcg->css)
> 	....
> 	charge(memcg) (increase USAGE)
> 	...
> And this "memcg" is not necessary to include the caller, task.
> 
> pre_destroy() tries to reduce memory usage until USAGE goes down to 0.
> Then, there is a race that
> 	- css's refcnt > 0 (and memcg's usage > 0)
> 	- rmdir() caller sleeps until css->refcnt goes down 0.
> 	- But to make css->refcnt be 0, pre_destroy() should be called again.
> 
> This patch tries to fix this in asyhcnrounos way (i.e. without big lock.)
> Any comments are welcome.
> 

Do you believe that these fixes should be backported into 2.6.30.x?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/2] memcg: cgroup fix rmdir hang
@ 2009-06-25 21:28   ` Andrew Morton
  0 siblings, 0 replies; 28+ messages in thread
From: Andrew Morton @ 2009-06-25 21:28 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, nishimura, balbir, lizf, menage

On Tue, 23 Jun 2009 16:07:20 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> previous discussion was this => http://marc.info/?t=124478543600001&r=1&w=2
> 
> This patch tries to fix problem as
>   - rmdir can sleep very very long if swap entry is shared between multiple
>     cgroups
> 
> Now, cgroup's rmdir path does following
> 
> ==
> again:
> 	check there are no tasks and children group.
> 	call pre_destroy()
> 	check css's refcnt
> 	if (refcnt > 0) {
> 		sleep until css's refcnt goes down to 0.
> 		goto again
> 	}
> ==
> 
> Unfortunately, memory cgroup does following at charge.
> 
> 	css_get(&memcg->css)
> 	....
> 	charge(memcg) (increase USAGE)
> 	...
> And this "memcg" is not necessary to include the caller, task.
> 
> pre_destroy() tries to reduce memory usage until USAGE goes down to 0.
> Then, there is a race that
> 	- css's refcnt > 0 (and memcg's usage > 0)
> 	- rmdir() caller sleeps until css->refcnt goes down 0.
> 	- But to make css->refcnt be 0, pre_destroy() should be called again.
> 
> This patch tries to fix this in asyhcnrounos way (i.e. without big lock.)
> Any comments are welcome.
> 

Do you believe that these fixes should be backported into 2.6.30.x?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/2] memcg: cgroup fix rmdir hang
  2009-06-23  7:08   ` KAMEZAWA Hiroyuki
@ 2009-06-25 21:50     ` Andrew Morton
  -1 siblings, 0 replies; 28+ messages in thread
From: Andrew Morton @ 2009-06-25 21:50 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: kamezawa.hiroyu, linux-mm, linux-kernel, nishimura, balbir, lizf, menage

On Tue, 23 Jun 2009 16:08:54 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Now, cgroup has a logic to wait until ready-to-rmdir for avoiding
> frequent -EBUSY at rmdir.
>  (See Commit ec64f51545fffbc4cb968f0cea56341a4b07e85a
>   cgroup: fix frequent -EBUSY at rmdir.
> 
> Nishimura-san reported bad case for waiting and This is a fix to
> make it reliable. A thread waiting for thread cannot be waken up
> when a refcnt gotten by css_tryget() isn't put immediately.
> (Original code assumed css_put() will be called soon.)
> 
> memcg has this case and this is a fix for the problem. This adds
> retry_rmdir() callback to subsys and check we can sleep or not.
> 
> Note: another solution will be adding "rmdir state" to subsys.
> But it will be much complicated than this do-enough-check solution.
> 

A few issues..

Firstly, this code (both before and after the patch) looks like a
rather horrid hack.  

<ooh look, a comment!>

	/*
	 * css_put/get is provided for subsys to grab refcnt to css. In typical
	 * case, subsystem has no reference after pre_destroy(). But, under
	 * hierarchy management, some *temporal* refcnt can be hold.
	 * To avoid returning -EBUSY to a user, waitqueue is used. If subsys
	 * is really busy, it should return -EBUSY at pre_destroy(). wake_up
	 * is called when css_put() is called and refcnt goes down to 0.
	 */

(The correct word here is "temporary").

Where and under what circumstances is this temporary reference taken? 
Is there any way in which we can fix all this properly, so that the
directory removal will happen deterministically, without needing the
in-kernel polling loop?

ie: refcounting?

(I have a vague feeling that I've asked all this before.  But that's OK
- the code's still horrid ;))


> Index: fix-rmdir-cgroup/include/linux/cgroup.h
> ===================================================================
> --- fix-rmdir-cgroup.orig/include/linux/cgroup.h
> +++ fix-rmdir-cgroup/include/linux/cgroup.h
> @@ -374,6 +374,7 @@ struct cgroup_subsys {
>  	struct cgroup_subsys_state *(*create)(struct cgroup_subsys *ss,
>  						  struct cgroup *cgrp);
>  	int (*pre_destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
> +	int (*retry_rmdir)(struct cgroup_subsys *ss, struct cgroup *cgrp);

This is poorly named.  The reader will expect that a function called
"retry_rmdir" will, umm, retry an rmdir.

But this function doesn't do that.  It's a predicate which the caller
will use to determine whether the caller should retry the rmdir.

A better name would be should_retry_rmdir(), for example.

But even that isn't very good, because "should_retry_rmdir()" implies
that the caller will only use this function for a single purpose.  The
callee shouldn't assume this!

So can we come up with a name which accurately reflects what the
function actually does?  Like "has_remaining_references()", or somesuch?

Also, making the return value `bool' would have come clarification
benefits.

>  	void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
>  	int (*can_attach)(struct cgroup_subsys *ss,
>  			  struct cgroup *cgrp, struct task_struct *tsk);
> Index: fix-rmdir-cgroup/kernel/cgroup.c
> ===================================================================
> --- fix-rmdir-cgroup.orig/kernel/cgroup.c
> +++ fix-rmdir-cgroup/kernel/cgroup.c
> @@ -636,6 +636,23 @@ static int cgroup_call_pre_destroy(struc
>  		}
>  	return ret;
>  }
> +/*
> + * Call subsys's retry_rmdir() handler. If this returns non-Zero, we retry
> + * rmdir immediately and call pre_destroy again.
> + */
> +static int cgroup_check_retry_rmdir(struct cgroup *cgrp)
> +{
> +	struct cgroup_subsys *ss;
> +	int ret = 0;
> +
> +	for_each_subsys(cgrp->root, ss)
> +		if (ss->pre_destroy) {
> +			ret = ss->retry_rmdir(ss, cgrp);
> +			if (ret)
> +				break;
> +		}
> +	return ret;
> +}

There's an important and subtle precondition for this function: it is
called in state TASK_INTERRUPTIBLE.  This means that the ->retry_rmdir
handler must be careful to not disturb that state.  For if that
function were to accidentally enter state TASK_RUNNING (say, it does a
mutex_lock/unlock) then the kernel could enter a busy loop and would use
lots of CPU time.  I guess that code comments are sufficient to cover
this.  It's a property of ->retry_rmdir, really.

Also, what sense does it make to call ->retry_rmdir() if ->pre_destroy
is non-NULL?  Was that actually intentional?  If so, it is strange to link
those two fields in this way.  The retry_rmdir() documentation didn't describe
this.

>  static void free_cgroup_rcu(struct rcu_head *obj)
>  {
> @@ -2722,7 +2739,8 @@ again:
>  
>  	if (!cgroup_clear_css_refs(cgrp)) {
>  		mutex_unlock(&cgroup_mutex);
> -		schedule();
> +		if (!cgroup_check_retry_rmdir(cgrp))
> +			schedule();
>  		finish_wait(&cgroup_rmdir_waitq, &wait);
>  		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
>  		if (signal_pending(current))
>
> ...
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/2] memcg: cgroup fix rmdir hang
@ 2009-06-25 21:50     ` Andrew Morton
  0 siblings, 0 replies; 28+ messages in thread
From: Andrew Morton @ 2009-06-25 21:50 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, nishimura, balbir, lizf, menage

On Tue, 23 Jun 2009 16:08:54 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Now, cgroup has a logic to wait until ready-to-rmdir for avoiding
> frequent -EBUSY at rmdir.
>  (See Commit ec64f51545fffbc4cb968f0cea56341a4b07e85a
>   cgroup: fix frequent -EBUSY at rmdir.
> 
> Nishimura-san reported bad case for waiting and This is a fix to
> make it reliable. A thread waiting for thread cannot be waken up
> when a refcnt gotten by css_tryget() isn't put immediately.
> (Original code assumed css_put() will be called soon.)
> 
> memcg has this case and this is a fix for the problem. This adds
> retry_rmdir() callback to subsys and check we can sleep or not.
> 
> Note: another solution will be adding "rmdir state" to subsys.
> But it will be much complicated than this do-enough-check solution.
> 

A few issues..

Firstly, this code (both before and after the patch) looks like a
rather horrid hack.  

<ooh look, a comment!>

	/*
	 * css_put/get is provided for subsys to grab refcnt to css. In typical
	 * case, subsystem has no reference after pre_destroy(). But, under
	 * hierarchy management, some *temporal* refcnt can be hold.
	 * To avoid returning -EBUSY to a user, waitqueue is used. If subsys
	 * is really busy, it should return -EBUSY at pre_destroy(). wake_up
	 * is called when css_put() is called and refcnt goes down to 0.
	 */

(The correct word here is "temporary").

Where and under what circumstances is this temporary reference taken? 
Is there any way in which we can fix all this properly, so that the
directory removal will happen deterministically, without needing the
in-kernel polling loop?

ie: refcounting?

(I have a vague feeling that I've asked all this before.  But that's OK
- the code's still horrid ;))


> Index: fix-rmdir-cgroup/include/linux/cgroup.h
> ===================================================================
> --- fix-rmdir-cgroup.orig/include/linux/cgroup.h
> +++ fix-rmdir-cgroup/include/linux/cgroup.h
> @@ -374,6 +374,7 @@ struct cgroup_subsys {
>  	struct cgroup_subsys_state *(*create)(struct cgroup_subsys *ss,
>  						  struct cgroup *cgrp);
>  	int (*pre_destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
> +	int (*retry_rmdir)(struct cgroup_subsys *ss, struct cgroup *cgrp);

This is poorly named.  The reader will expect that a function called
"retry_rmdir" will, umm, retry an rmdir.

But this function doesn't do that.  It's a predicate which the caller
will use to determine whether the caller should retry the rmdir.

A better name would be should_retry_rmdir(), for example.

But even that isn't very good, because "should_retry_rmdir()" implies
that the caller will only use this function for a single purpose.  The
callee shouldn't assume this!

So can we come up with a name which accurately reflects what the
function actually does?  Like "has_remaining_references()", or somesuch?

Also, making the return value `bool' would have come clarification
benefits.

>  	void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
>  	int (*can_attach)(struct cgroup_subsys *ss,
>  			  struct cgroup *cgrp, struct task_struct *tsk);
> Index: fix-rmdir-cgroup/kernel/cgroup.c
> ===================================================================
> --- fix-rmdir-cgroup.orig/kernel/cgroup.c
> +++ fix-rmdir-cgroup/kernel/cgroup.c
> @@ -636,6 +636,23 @@ static int cgroup_call_pre_destroy(struc
>  		}
>  	return ret;
>  }
> +/*
> + * Call subsys's retry_rmdir() handler. If this returns non-Zero, we retry
> + * rmdir immediately and call pre_destroy again.
> + */
> +static int cgroup_check_retry_rmdir(struct cgroup *cgrp)
> +{
> +	struct cgroup_subsys *ss;
> +	int ret = 0;
> +
> +	for_each_subsys(cgrp->root, ss)
> +		if (ss->pre_destroy) {
> +			ret = ss->retry_rmdir(ss, cgrp);
> +			if (ret)
> +				break;
> +		}
> +	return ret;
> +}

There's an important and subtle precondition for this function: it is
called in state TASK_INTERRUPTIBLE.  This means that the ->retry_rmdir
handler must be careful to not disturb that state.  For if that
function were to accidentally enter state TASK_RUNNING (say, it does a
mutex_lock/unlock) then the kernel could enter a busy loop and would use
lots of CPU time.  I guess that code comments are sufficient to cover
this.  It's a property of ->retry_rmdir, really.

Also, what sense does it make to call ->retry_rmdir() if ->pre_destroy
is non-NULL?  Was that actually intentional?  If so, it is strange to link
those two fields in this way.  The retry_rmdir() documentation didn't describe
this.

>  static void free_cgroup_rcu(struct rcu_head *obj)
>  {
> @@ -2722,7 +2739,8 @@ again:
>  
>  	if (!cgroup_clear_css_refs(cgrp)) {
>  		mutex_unlock(&cgroup_mutex);
> -		schedule();
> +		if (!cgroup_check_retry_rmdir(cgrp))
> +			schedule();
>  		finish_wait(&cgroup_rmdir_waitq, &wait);
>  		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
>  		if (signal_pending(current))
>
> ...
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/2] memcg: cgroup fix rmdir hang
  2009-06-25 21:28   ` Andrew Morton
@ 2009-06-25 23:29     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 28+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-06-25 23:29 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel, nishimura, balbir, lizf, menage

On Thu, 25 Jun 2009 14:28:09 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Tue, 23 Jun 2009 16:07:20 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > previous discussion was this => http://marc.info/?t=124478543600001&r=1&w=2
> > 
> > This patch tries to fix problem as
> >   - rmdir can sleep very very long if swap entry is shared between multiple
> >     cgroups
> > 
> > Now, cgroup's rmdir path does following
> > 
> > ==
> > again:
> > 	check there are no tasks and children group.
> > 	call pre_destroy()
> > 	check css's refcnt
> > 	if (refcnt > 0) {
> > 		sleep until css's refcnt goes down to 0.
> > 		goto again
> > 	}
> > ==
> > 
> > Unfortunately, memory cgroup does following at charge.
> > 
> > 	css_get(&memcg->css)
> > 	....
> > 	charge(memcg) (increase USAGE)
> > 	...
> > And this "memcg" is not necessary to include the caller, task.
> > 
> > pre_destroy() tries to reduce memory usage until USAGE goes down to 0.
> > Then, there is a race that
> > 	- css's refcnt > 0 (and memcg's usage > 0)
> > 	- rmdir() caller sleeps until css->refcnt goes down 0.
> > 	- But to make css->refcnt be 0, pre_destroy() should be called again.
> > 
> > This patch tries to fix this in asyhcnrounos way (i.e. without big lock.)
> > Any comments are welcome.
> > 
> 
> Do you believe that these fixes should be backported into 2.6.30.x?

Yes, I think so. (If it's easy)

To be honest:

To cause the problem,
  - swap cgroup should be shared between cgroup.
  - rmdir should be called in critical chance.

Considering usual usage of cgroup is "container", there will be no share of swap
in typical users. But,  2.6.30 can be a base kernel of a major distro. So,
I hope this in 2.6.30 if we have no difficulties.

Thanks,
-Kame











^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/2] memcg: cgroup fix rmdir hang
@ 2009-06-25 23:29     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 28+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-06-25 23:29 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel, nishimura, balbir, lizf, menage

On Thu, 25 Jun 2009 14:28:09 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Tue, 23 Jun 2009 16:07:20 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > previous discussion was this => http://marc.info/?t=124478543600001&r=1&w=2
> > 
> > This patch tries to fix problem as
> >   - rmdir can sleep very very long if swap entry is shared between multiple
> >     cgroups
> > 
> > Now, cgroup's rmdir path does following
> > 
> > ==
> > again:
> > 	check there are no tasks and children group.
> > 	call pre_destroy()
> > 	check css's refcnt
> > 	if (refcnt > 0) {
> > 		sleep until css's refcnt goes down to 0.
> > 		goto again
> > 	}
> > ==
> > 
> > Unfortunately, memory cgroup does following at charge.
> > 
> > 	css_get(&memcg->css)
> > 	....
> > 	charge(memcg) (increase USAGE)
> > 	...
> > And this "memcg" is not necessary to include the caller, task.
> > 
> > pre_destroy() tries to reduce memory usage until USAGE goes down to 0.
> > Then, there is a race that
> > 	- css's refcnt > 0 (and memcg's usage > 0)
> > 	- rmdir() caller sleeps until css->refcnt goes down 0.
> > 	- But to make css->refcnt be 0, pre_destroy() should be called again.
> > 
> > This patch tries to fix this in asyhcnrounos way (i.e. without big lock.)
> > Any comments are welcome.
> > 
> 
> Do you believe that these fixes should be backported into 2.6.30.x?

Yes, I think so. (If it's easy)

To be honest:

To cause the problem,
  - swap cgroup should be shared between cgroup.
  - rmdir should be called in critical chance.

Considering usual usage of cgroup is "container", there will be no share of swap
in typical users. But,  2.6.30 can be a base kernel of a major distro. So,
I hope this in 2.6.30 if we have no difficulties.

Thanks,
-Kame










--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/2] memcg: cgroup fix rmdir hang
  2009-06-25 21:50     ` Andrew Morton
@ 2009-06-26  0:14       ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 28+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-06-26  0:14 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel, nishimura, balbir, lizf, menage

Thank you for review.

On Thu, 25 Jun 2009 14:50:44 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Tue, 23 Jun 2009 16:08:54 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > Now, cgroup has a logic to wait until ready-to-rmdir for avoiding
> > frequent -EBUSY at rmdir.
> >  (See Commit ec64f51545fffbc4cb968f0cea56341a4b07e85a
> >   cgroup: fix frequent -EBUSY at rmdir.
> > 
> > Nishimura-san reported bad case for waiting and This is a fix to
> > make it reliable. A thread waiting for thread cannot be waken up
> > when a refcnt gotten by css_tryget() isn't put immediately.
> > (Original code assumed css_put() will be called soon.)
> > 
> > memcg has this case and this is a fix for the problem. This adds
> > retry_rmdir() callback to subsys and check we can sleep or not.
> > 
> > Note: another solution will be adding "rmdir state" to subsys.
> > But it will be much complicated than this do-enough-check solution.
> > 
> 
> A few issues..
> 
> Firstly, this code (both before and after the patch) looks like a
> rather horrid hack.  
> 
> <ooh look, a comment!>
> 
> 	/*
> 	 * css_put/get is provided for subsys to grab refcnt to css. In typical
> 	 * case, subsystem has no reference after pre_destroy(). But, under
> 	 * hierarchy management, some *temporal* refcnt can be hold.
> 	 * To avoid returning -EBUSY to a user, waitqueue is used. If subsys
> 	 * is really busy, it should return -EBUSY at pre_destroy(). wake_up
> 	 * is called when css_put() is called and refcnt goes down to 0.
> 	 */
> 
> (The correct word here is "temporary").
> 
yes.

> Where and under what circumstances is this temporary reference taken? 

Typical case is when we're using "hierarchy". In following hirerarchy,

	group-A/01
		02
		03

There are several codes which needs to scan A,01,02,03. Even while 03 is
under rmdir(), this scan can happen. Most of cause is memory.stat file
and hierarchical reclaim.

Without this "sleep" logic, we can see very frequent -EBUSY in following
case.

  (Shell-1)# while true; do cat group-A/memory.stat > myfile ; done
  (Shell-2)# rmdir group-A/03

> Is there any way in which we can fix all this properly, so that the
> directory removal will happen deterministically, without needing the
> in-kernel polling loop?
> 
> ie: refcounting?
> 
> (I have a vague feeling that I've asked all this before.  But that's OK
> - the code's still horrid ;))
> 
Sorry..

As far as this "rmdir" problem patch concerns, one way to go is removing
css->ref per page at at all. 

A problem I feel recently is that the cgroup is designed to stand on "task"
and not for others like pages and bios, etc...So making memcg's refcounting
against "page" should be removed. (and no permanent refcnt which requires
pre_destroy())

Option "A" we have (but painful for me) is
 1. revert commit:ec64f51545fffbc4cb968f0cea56341a4b07e85a
 2. remove all css reference per page for memcg. And just use "temporal" ones.
 3. retry commit:ec64f51545fffbc4cb968f0cea56341a4b07e85a

We'll see frequent EBUSY again at rmdir but this can keep codes clean.
And there will be "misaccouted usage against dead cgroup" probelm. This race
will be a difficult one, anyway.


Option "B" is making pre_destroy() stateful.
 1. add cancel_destroy().
 2. When pre_destroy() is called, we record memcg that "we're under rmdir" 
    And don't do any charge after that.
 3. clear "we're under rmdir" when cancel_destroy() is called.

Concern of this option is tons of racy case I can expect.
So, I didn't select this.

> 
> > Index: fix-rmdir-cgroup/include/linux/cgroup.h
> > ===================================================================
> > --- fix-rmdir-cgroup.orig/include/linux/cgroup.h
> > +++ fix-rmdir-cgroup/include/linux/cgroup.h
> > @@ -374,6 +374,7 @@ struct cgroup_subsys {
> >  	struct cgroup_subsys_state *(*create)(struct cgroup_subsys *ss,
> >  						  struct cgroup *cgrp);
> >  	int (*pre_destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
> > +	int (*retry_rmdir)(struct cgroup_subsys *ss, struct cgroup *cgrp);
> 
> This is poorly named.  The reader will expect that a function called
> "retry_rmdir" will, umm, retry an rmdir.
> 
> But this function doesn't do that.  It's a predicate which the caller
> will use to determine whether the caller should retry the rmdir.
> 
> A better name would be should_retry_rmdir(), for example.
> 
Ah, thanks.

> But even that isn't very good, because "should_retry_rmdir()" implies
> that the caller will only use this function for a single purpose.  The
> callee shouldn't assume this!
> 
ok.

> So can we come up with a name which accurately reflects what the
> function actually does?  Like "has_remaining_references()", or somesuch?
> 
> Also, making the return value `bool' would have come clarification
> benefits.
> 
Thank you for suggestions. I'll update this.




> >  	void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
> >  	int (*can_attach)(struct cgroup_subsys *ss,
> >  			  struct cgroup *cgrp, struct task_struct *tsk);
> > Index: fix-rmdir-cgroup/kernel/cgroup.c
> > ===================================================================
> > --- fix-rmdir-cgroup.orig/kernel/cgroup.c
> > +++ fix-rmdir-cgroup/kernel/cgroup.c
> > @@ -636,6 +636,23 @@ static int cgroup_call_pre_destroy(struc
> >  		}
> >  	return ret;
> >  }
> > +/*
> > + * Call subsys's retry_rmdir() handler. If this returns non-Zero, we retry
> > + * rmdir immediately and call pre_destroy again.
> > + */
> > +static int cgroup_check_retry_rmdir(struct cgroup *cgrp)
> > +{
> > +	struct cgroup_subsys *ss;
> > +	int ret = 0;
> > +
> > +	for_each_subsys(cgrp->root, ss)
> > +		if (ss->pre_destroy) {
> > +			ret = ss->retry_rmdir(ss, cgrp);
> > +			if (ret)
> > +				break;
> > +		}
> > +	return ret;
> > +}
> 
> There's an important and subtle precondition for this function: it is
> called in state TASK_INTERRUPTIBLE.  This means that the ->retry_rmdir
> handler must be careful to not disturb that state.  For if that
> function were to accidentally enter state TASK_RUNNING (say, it does a
> mutex_lock/unlock) then the kernel could enter a busy loop and would use
> lots of CPU time.  I guess that code comments are sufficient to cover
> this.  It's a property of ->retry_rmdir, really.
> 
> Also, what sense does it make to call ->retry_rmdir() if ->pre_destroy
> is non-NULL?  Was that actually intentional?  If so, it is strange to link
> those two fields in this way.  The retry_rmdir() documentation didn't describe
> this.
> 
ok, How about this ?

rename retry_rmdir() as
	bool should_rerun_pre_destroy(cgroup, subsys).
Then, function name describes its functionality.

Hmm, the code is being complicated/dirty day by day..(most of reason is me..)


Thanks,
-Kame


> >  static void free_cgroup_rcu(struct rcu_head *obj)
> >  {
> > @@ -2722,7 +2739,8 @@ again:
> >  
> >  	if (!cgroup_clear_css_refs(cgrp)) {
> >  		mutex_unlock(&cgroup_mutex);
> > -		schedule();
> > +		if (!cgroup_check_retry_rmdir(cgrp))
> > +			schedule();
> >  		finish_wait(&cgroup_rmdir_waitq, &wait);
> >  		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> >  		if (signal_pending(current))
> >
> > ...
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/2] memcg: cgroup fix rmdir hang
@ 2009-06-26  0:14       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 28+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-06-26  0:14 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel, nishimura, balbir, lizf, menage

Thank you for review.

On Thu, 25 Jun 2009 14:50:44 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Tue, 23 Jun 2009 16:08:54 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > Now, cgroup has a logic to wait until ready-to-rmdir for avoiding
> > frequent -EBUSY at rmdir.
> >  (See Commit ec64f51545fffbc4cb968f0cea56341a4b07e85a
> >   cgroup: fix frequent -EBUSY at rmdir.
> > 
> > Nishimura-san reported bad case for waiting and This is a fix to
> > make it reliable. A thread waiting for thread cannot be waken up
> > when a refcnt gotten by css_tryget() isn't put immediately.
> > (Original code assumed css_put() will be called soon.)
> > 
> > memcg has this case and this is a fix for the problem. This adds
> > retry_rmdir() callback to subsys and check we can sleep or not.
> > 
> > Note: another solution will be adding "rmdir state" to subsys.
> > But it will be much complicated than this do-enough-check solution.
> > 
> 
> A few issues..
> 
> Firstly, this code (both before and after the patch) looks like a
> rather horrid hack.  
> 
> <ooh look, a comment!>
> 
> 	/*
> 	 * css_put/get is provided for subsys to grab refcnt to css. In typical
> 	 * case, subsystem has no reference after pre_destroy(). But, under
> 	 * hierarchy management, some *temporal* refcnt can be hold.
> 	 * To avoid returning -EBUSY to a user, waitqueue is used. If subsys
> 	 * is really busy, it should return -EBUSY at pre_destroy(). wake_up
> 	 * is called when css_put() is called and refcnt goes down to 0.
> 	 */
> 
> (The correct word here is "temporary").
> 
yes.

> Where and under what circumstances is this temporary reference taken? 

Typical case is when we're using "hierarchy". In following hirerarchy,

	group-A/01
		02
		03

There are several codes which needs to scan A,01,02,03. Even while 03 is
under rmdir(), this scan can happen. Most of cause is memory.stat file
and hierarchical reclaim.

Without this "sleep" logic, we can see very frequent -EBUSY in following
case.

  (Shell-1)# while true; do cat group-A/memory.stat > myfile ; done
  (Shell-2)# rmdir group-A/03

> Is there any way in which we can fix all this properly, so that the
> directory removal will happen deterministically, without needing the
> in-kernel polling loop?
> 
> ie: refcounting?
> 
> (I have a vague feeling that I've asked all this before.  But that's OK
> - the code's still horrid ;))
> 
Sorry..

As far as this "rmdir" problem patch concerns, one way to go is removing
css->ref per page at at all. 

A problem I feel recently is that the cgroup is designed to stand on "task"
and not for others like pages and bios, etc...So making memcg's refcounting
against "page" should be removed. (and no permanent refcnt which requires
pre_destroy())

Option "A" we have (but painful for me) is
 1. revert commit:ec64f51545fffbc4cb968f0cea56341a4b07e85a
 2. remove all css reference per page for memcg. And just use "temporal" ones.
 3. retry commit:ec64f51545fffbc4cb968f0cea56341a4b07e85a

We'll see frequent EBUSY again at rmdir but this can keep codes clean.
And there will be "misaccouted usage against dead cgroup" probelm. This race
will be a difficult one, anyway.


Option "B" is making pre_destroy() stateful.
 1. add cancel_destroy().
 2. When pre_destroy() is called, we record memcg that "we're under rmdir" 
    And don't do any charge after that.
 3. clear "we're under rmdir" when cancel_destroy() is called.

Concern of this option is tons of racy case I can expect.
So, I didn't select this.

> 
> > Index: fix-rmdir-cgroup/include/linux/cgroup.h
> > ===================================================================
> > --- fix-rmdir-cgroup.orig/include/linux/cgroup.h
> > +++ fix-rmdir-cgroup/include/linux/cgroup.h
> > @@ -374,6 +374,7 @@ struct cgroup_subsys {
> >  	struct cgroup_subsys_state *(*create)(struct cgroup_subsys *ss,
> >  						  struct cgroup *cgrp);
> >  	int (*pre_destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
> > +	int (*retry_rmdir)(struct cgroup_subsys *ss, struct cgroup *cgrp);
> 
> This is poorly named.  The reader will expect that a function called
> "retry_rmdir" will, umm, retry an rmdir.
> 
> But this function doesn't do that.  It's a predicate which the caller
> will use to determine whether the caller should retry the rmdir.
> 
> A better name would be should_retry_rmdir(), for example.
> 
Ah, thanks.

> But even that isn't very good, because "should_retry_rmdir()" implies
> that the caller will only use this function for a single purpose.  The
> callee shouldn't assume this!
> 
ok.

> So can we come up with a name which accurately reflects what the
> function actually does?  Like "has_remaining_references()", or somesuch?
> 
> Also, making the return value `bool' would have come clarification
> benefits.
> 
Thank you for suggestions. I'll update this.




> >  	void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
> >  	int (*can_attach)(struct cgroup_subsys *ss,
> >  			  struct cgroup *cgrp, struct task_struct *tsk);
> > Index: fix-rmdir-cgroup/kernel/cgroup.c
> > ===================================================================
> > --- fix-rmdir-cgroup.orig/kernel/cgroup.c
> > +++ fix-rmdir-cgroup/kernel/cgroup.c
> > @@ -636,6 +636,23 @@ static int cgroup_call_pre_destroy(struc
> >  		}
> >  	return ret;
> >  }
> > +/*
> > + * Call subsys's retry_rmdir() handler. If this returns non-Zero, we retry
> > + * rmdir immediately and call pre_destroy again.
> > + */
> > +static int cgroup_check_retry_rmdir(struct cgroup *cgrp)
> > +{
> > +	struct cgroup_subsys *ss;
> > +	int ret = 0;
> > +
> > +	for_each_subsys(cgrp->root, ss)
> > +		if (ss->pre_destroy) {
> > +			ret = ss->retry_rmdir(ss, cgrp);
> > +			if (ret)
> > +				break;
> > +		}
> > +	return ret;
> > +}
> 
> There's an important and subtle precondition for this function: it is
> called in state TASK_INTERRUPTIBLE.  This means that the ->retry_rmdir
> handler must be careful to not disturb that state.  For if that
> function were to accidentally enter state TASK_RUNNING (say, it does a
> mutex_lock/unlock) then the kernel could enter a busy loop and would use
> lots of CPU time.  I guess that code comments are sufficient to cover
> this.  It's a property of ->retry_rmdir, really.
> 
> Also, what sense does it make to call ->retry_rmdir() if ->pre_destroy
> is non-NULL?  Was that actually intentional?  If so, it is strange to link
> those two fields in this way.  The retry_rmdir() documentation didn't describe
> this.
> 
ok, How about this ?

rename retry_rmdir() as
	bool should_rerun_pre_destroy(cgroup, subsys).
Then, function name describes its functionality.

Hmm, the code is being complicated/dirty day by day..(most of reason is me..)


Thanks,
-Kame


> >  static void free_cgroup_rcu(struct rcu_head *obj)
> >  {
> > @@ -2722,7 +2739,8 @@ again:
> >  
> >  	if (!cgroup_clear_css_refs(cgrp)) {
> >  		mutex_unlock(&cgroup_mutex);
> > -		schedule();
> > +		if (!cgroup_check_retry_rmdir(cgrp))
> > +			schedule();
> >  		finish_wait(&cgroup_rmdir_waitq, &wait);
> >  		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> >  		if (signal_pending(current))
> >
> > ...
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH] memcg: cgroup fix rmdir hang
  2009-06-23  7:07 ` KAMEZAWA Hiroyuki
@ 2009-06-26  5:10   ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 28+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-06-26  5:10 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, nishimura, balbir, lizf, menage, akpm

I hope this will be a final bullet..
I myself think this one is enough simple and good.
I'm sorry that we need test again.

==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

After commit: cgroup: fix frequent -EBUSY at rmdir
	      ec64f51545fffbc4cb968f0cea56341a4b07e85a
cgroup's rmdir (especially against memcg) doesn't return -EBUSY
by temporal ref counts. That commit expects all refs after pre_destroy()
is temporary but...it wasn't. Then, rmdir can wait permanently.
This patch tries to fix that and change followings.

 - set CGRP_WAIT_ON_RMDIR flag before pre_destroy().
 - clear CGRP_WAIT_ON_RMDIR flag when the subsys finds racy case.
   if there are sleeping ones, wakes them up.
 - rmdir() sleeps only when CGRP_WAIT_ON_RMDIR flag is set.

Changelog v2->v3:
  - removed retry_rmdir() callback.
  - make use of CGRP_WAIT_ON_RMDIR flag more.

Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/cgroup.h |   13 +++++++++++++
 kernel/cgroup.c        |   38 ++++++++++++++++++++++----------------
 mm/memcontrol.c        |   25 +++++++++++++++++++++++--
 3 files changed, 58 insertions(+), 18 deletions(-)

Index: mmotm-2.6.31-Jun25/include/linux/cgroup.h
===================================================================
--- mmotm-2.6.31-Jun25.orig/include/linux/cgroup.h
+++ mmotm-2.6.31-Jun25/include/linux/cgroup.h
@@ -366,6 +366,19 @@ int cgroup_task_count(const struct cgrou
 int cgroup_is_descendant(const struct cgroup *cgrp, struct task_struct *task);
 
 /*
+ * Allow to use CGRP_WAIT_ON_RMDIR flag to check race with rmdir() for subsys.
+ * Subsys can call this function if it's necessary to call pre_destroy() again
+ * because it adds not-temporary refs to css after or while pre_destroy().
+ * The caller of this function should use css_tryget(), too.
+ */
+void __cgroup_wakeup_rmdir_waiters(void);
+static inline void cgroup_wakeup_rmdir_waiters(struct cgroup *cgrp)
+{
+	if (unlikely(test_and_clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
+		__cgroup_wakeup_rmdir_waiters();
+}
+
+/*
  * Control Group subsystem type.
  * See Documentation/cgroups/cgroups.txt for details
  */
Index: mmotm-2.6.31-Jun25/kernel/cgroup.c
===================================================================
--- mmotm-2.6.31-Jun25.orig/kernel/cgroup.c
+++ mmotm-2.6.31-Jun25/kernel/cgroup.c
@@ -734,14 +734,13 @@ static void cgroup_d_remove_dir(struct d
  * reference to css->refcnt. In general, this refcnt is expected to goes down
  * to zero, soon.
  *
- * CGRP_WAIT_ON_RMDIR flag is modified under cgroup's inode->i_mutex;
+ * CGRP_WAIT_ON_RMDIR flag is set under cgroup's inode->i_mutex;
  */
 DECLARE_WAIT_QUEUE_HEAD(cgroup_rmdir_waitq);
 
-static void cgroup_wakeup_rmdir_waiters(const struct cgroup *cgrp)
+void __cgroup_wakeup_rmdir_waiters(void)
 {
-	if (unlikely(test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
-		wake_up_all(&cgroup_rmdir_waitq);
+	wake_up_all(&cgroup_rmdir_waitq);
 }
 
 static int rebind_subsystems(struct cgroupfs_root *root,
@@ -2696,33 +2695,40 @@ again:
 	mutex_unlock(&cgroup_mutex);
 
 	/*
+	 * css_put/get is provided for subsys to grab refcnt to css. In typical
+	 * case, subsystem has no reference after pre_destroy(). But, under
+	 * hierarchy management, some *temporal* refcnt can be hold.
+	 * To avoid returning -EBUSY to a user, waitqueue is used. If subsys
+	 * is really busy, it should return -EBUSY at pre_destroy(). wake_up
+	 * is called when css_put() is called and refcnt goes down to 0.
+	 * And this WAIT_ON_RMDIR flag is cleared when subsys detect a race
+	 * condition under pre_destroy()->rmdir. If flag is cleared, we need
+	 * to call pre_destroy(), again.
+	 */
+	set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
+
+	/*
 	 * Call pre_destroy handlers of subsys. Notify subsystems
 	 * that rmdir() request comes.
 	 */
 	ret = cgroup_call_pre_destroy(cgrp);
-	if (ret)
+	if (ret) {
+		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
 		return ret;
+	}
 
 	mutex_lock(&cgroup_mutex);
 	parent = cgrp->parent;
 	if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children)) {
+		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
 		mutex_unlock(&cgroup_mutex);
 		return -EBUSY;
 	}
-	/*
-	 * css_put/get is provided for subsys to grab refcnt to css. In typical
-	 * case, subsystem has no reference after pre_destroy(). But, under
-	 * hierarchy management, some *temporal* refcnt can be hold.
-	 * To avoid returning -EBUSY to a user, waitqueue is used. If subsys
-	 * is really busy, it should return -EBUSY at pre_destroy(). wake_up
-	 * is called when css_put() is called and refcnt goes down to 0.
-	 */
-	set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
 	prepare_to_wait(&cgroup_rmdir_waitq, &wait, TASK_INTERRUPTIBLE);
-
 	if (!cgroup_clear_css_refs(cgrp)) {
 		mutex_unlock(&cgroup_mutex);
-		schedule();
+		if (test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags))
+			schedule();
 		finish_wait(&cgroup_rmdir_waitq, &wait);
 		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
 		if (signal_pending(current))
Index: mmotm-2.6.31-Jun25/mm/memcontrol.c
===================================================================
--- mmotm-2.6.31-Jun25.orig/mm/memcontrol.c
+++ mmotm-2.6.31-Jun25/mm/memcontrol.c
@@ -1234,6 +1234,12 @@ static int mem_cgroup_move_account(struc
 	ret = 0;
 out:
 	unlock_page_cgroup(pc);
+	/*
+	 * We charges against "to" which may not have any tasks. Then, "to"
+	 * can be under rmdir(). But in current implementation, caller of
+	 * this function is just force_empty() and it's garanteed that
+	 * "to" is never removed. So, we don't check rmdir status here.
+	 */
 	return ret;
 }
 
@@ -1455,6 +1461,7 @@ __mem_cgroup_commit_charge_swapin(struct
 		return;
 	if (!ptr)
 		return;
+	css_get(&ptr->css);
 	pc = lookup_page_cgroup(page);
 	mem_cgroup_lru_del_before_commit_swapcache(page);
 	__mem_cgroup_commit_charge(ptr, pc, ctype);
@@ -1484,7 +1491,13 @@ __mem_cgroup_commit_charge_swapin(struct
 		}
 		rcu_read_unlock();
 	}
-	/* add this page(page_cgroup) to the LRU we want. */
+	/*
+	 * At swapin, we may charge account against cgroup which has no tasks.
+	 * So, rmdir()->pre_destroy() can be called while we do this charge.
+	 * In that case, we need to call pre_destroy() again. check it here.
+	 */
+	cgroup_wakeup_rmdir_waiters(ptr->css.cgroup);
+	css_put(&ptr->css);
 
 }
 
@@ -1691,7 +1704,7 @@ void mem_cgroup_end_migration(struct mem
 
 	if (!mem)
 		return;
-
+	css_get(&mem->css);
 	/* at migration success, oldpage->mapping is NULL. */
 	if (oldpage->mapping) {
 		target = oldpage;
@@ -1731,6 +1744,14 @@ void mem_cgroup_end_migration(struct mem
 	 */
 	if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
 		mem_cgroup_uncharge_page(target);
+	/*
+	 * At migration, we may charge account against cgroup which has no tasks
+	 * So, rmdir()->pre_destroy() can be called while we do this charge.
+	 * In that case, we need to call pre_destroy() again. check it here.
+	 */
+	cgroup_wakeup_rmdir_waiters(mem->css.cgroup);
+	css_put(&mem->css);
+
 }
 
 /*


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH] memcg: cgroup fix rmdir hang
@ 2009-06-26  5:10   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 28+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-06-26  5:10 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, nishimura, balbir, lizf, menage, akpm

I hope this will be a final bullet..
I myself think this one is enough simple and good.
I'm sorry that we need test again.

==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

After commit: cgroup: fix frequent -EBUSY at rmdir
	      ec64f51545fffbc4cb968f0cea56341a4b07e85a
cgroup's rmdir (especially against memcg) doesn't return -EBUSY
by temporal ref counts. That commit expects all refs after pre_destroy()
is temporary but...it wasn't. Then, rmdir can wait permanently.
This patch tries to fix that and change followings.

 - set CGRP_WAIT_ON_RMDIR flag before pre_destroy().
 - clear CGRP_WAIT_ON_RMDIR flag when the subsys finds racy case.
   if there are sleeping ones, wakes them up.
 - rmdir() sleeps only when CGRP_WAIT_ON_RMDIR flag is set.

Changelog v2->v3:
  - removed retry_rmdir() callback.
  - make use of CGRP_WAIT_ON_RMDIR flag more.

Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/cgroup.h |   13 +++++++++++++
 kernel/cgroup.c        |   38 ++++++++++++++++++++++----------------
 mm/memcontrol.c        |   25 +++++++++++++++++++++++--
 3 files changed, 58 insertions(+), 18 deletions(-)

Index: mmotm-2.6.31-Jun25/include/linux/cgroup.h
===================================================================
--- mmotm-2.6.31-Jun25.orig/include/linux/cgroup.h
+++ mmotm-2.6.31-Jun25/include/linux/cgroup.h
@@ -366,6 +366,19 @@ int cgroup_task_count(const struct cgrou
 int cgroup_is_descendant(const struct cgroup *cgrp, struct task_struct *task);
 
 /*
+ * Allow to use CGRP_WAIT_ON_RMDIR flag to check race with rmdir() for subsys.
+ * Subsys can call this function if it's necessary to call pre_destroy() again
+ * because it adds not-temporary refs to css after or while pre_destroy().
+ * The caller of this function should use css_tryget(), too.
+ */
+void __cgroup_wakeup_rmdir_waiters(void);
+static inline void cgroup_wakeup_rmdir_waiters(struct cgroup *cgrp)
+{
+	if (unlikely(test_and_clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
+		__cgroup_wakeup_rmdir_waiters();
+}
+
+/*
  * Control Group subsystem type.
  * See Documentation/cgroups/cgroups.txt for details
  */
Index: mmotm-2.6.31-Jun25/kernel/cgroup.c
===================================================================
--- mmotm-2.6.31-Jun25.orig/kernel/cgroup.c
+++ mmotm-2.6.31-Jun25/kernel/cgroup.c
@@ -734,14 +734,13 @@ static void cgroup_d_remove_dir(struct d
  * reference to css->refcnt. In general, this refcnt is expected to goes down
  * to zero, soon.
  *
- * CGRP_WAIT_ON_RMDIR flag is modified under cgroup's inode->i_mutex;
+ * CGRP_WAIT_ON_RMDIR flag is set under cgroup's inode->i_mutex;
  */
 DECLARE_WAIT_QUEUE_HEAD(cgroup_rmdir_waitq);
 
-static void cgroup_wakeup_rmdir_waiters(const struct cgroup *cgrp)
+void __cgroup_wakeup_rmdir_waiters(void)
 {
-	if (unlikely(test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
-		wake_up_all(&cgroup_rmdir_waitq);
+	wake_up_all(&cgroup_rmdir_waitq);
 }
 
 static int rebind_subsystems(struct cgroupfs_root *root,
@@ -2696,33 +2695,40 @@ again:
 	mutex_unlock(&cgroup_mutex);
 
 	/*
+	 * css_put/get is provided for subsys to grab refcnt to css. In typical
+	 * case, subsystem has no reference after pre_destroy(). But, under
+	 * hierarchy management, some *temporal* refcnt can be hold.
+	 * To avoid returning -EBUSY to a user, waitqueue is used. If subsys
+	 * is really busy, it should return -EBUSY at pre_destroy(). wake_up
+	 * is called when css_put() is called and refcnt goes down to 0.
+	 * And this WAIT_ON_RMDIR flag is cleared when subsys detect a race
+	 * condition under pre_destroy()->rmdir. If flag is cleared, we need
+	 * to call pre_destroy(), again.
+	 */
+	set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
+
+	/*
 	 * Call pre_destroy handlers of subsys. Notify subsystems
 	 * that rmdir() request comes.
 	 */
 	ret = cgroup_call_pre_destroy(cgrp);
-	if (ret)
+	if (ret) {
+		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
 		return ret;
+	}
 
 	mutex_lock(&cgroup_mutex);
 	parent = cgrp->parent;
 	if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children)) {
+		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
 		mutex_unlock(&cgroup_mutex);
 		return -EBUSY;
 	}
-	/*
-	 * css_put/get is provided for subsys to grab refcnt to css. In typical
-	 * case, subsystem has no reference after pre_destroy(). But, under
-	 * hierarchy management, some *temporal* refcnt can be hold.
-	 * To avoid returning -EBUSY to a user, waitqueue is used. If subsys
-	 * is really busy, it should return -EBUSY at pre_destroy(). wake_up
-	 * is called when css_put() is called and refcnt goes down to 0.
-	 */
-	set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
 	prepare_to_wait(&cgroup_rmdir_waitq, &wait, TASK_INTERRUPTIBLE);
-
 	if (!cgroup_clear_css_refs(cgrp)) {
 		mutex_unlock(&cgroup_mutex);
-		schedule();
+		if (test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags))
+			schedule();
 		finish_wait(&cgroup_rmdir_waitq, &wait);
 		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
 		if (signal_pending(current))
Index: mmotm-2.6.31-Jun25/mm/memcontrol.c
===================================================================
--- mmotm-2.6.31-Jun25.orig/mm/memcontrol.c
+++ mmotm-2.6.31-Jun25/mm/memcontrol.c
@@ -1234,6 +1234,12 @@ static int mem_cgroup_move_account(struc
 	ret = 0;
 out:
 	unlock_page_cgroup(pc);
+	/*
+	 * We charges against "to" which may not have any tasks. Then, "to"
+	 * can be under rmdir(). But in current implementation, caller of
+	 * this function is just force_empty() and it's garanteed that
+	 * "to" is never removed. So, we don't check rmdir status here.
+	 */
 	return ret;
 }
 
@@ -1455,6 +1461,7 @@ __mem_cgroup_commit_charge_swapin(struct
 		return;
 	if (!ptr)
 		return;
+	css_get(&ptr->css);
 	pc = lookup_page_cgroup(page);
 	mem_cgroup_lru_del_before_commit_swapcache(page);
 	__mem_cgroup_commit_charge(ptr, pc, ctype);
@@ -1484,7 +1491,13 @@ __mem_cgroup_commit_charge_swapin(struct
 		}
 		rcu_read_unlock();
 	}
-	/* add this page(page_cgroup) to the LRU we want. */
+	/*
+	 * At swapin, we may charge account against cgroup which has no tasks.
+	 * So, rmdir()->pre_destroy() can be called while we do this charge.
+	 * In that case, we need to call pre_destroy() again. check it here.
+	 */
+	cgroup_wakeup_rmdir_waiters(ptr->css.cgroup);
+	css_put(&ptr->css);
 
 }
 
@@ -1691,7 +1704,7 @@ void mem_cgroup_end_migration(struct mem
 
 	if (!mem)
 		return;
-
+	css_get(&mem->css);
 	/* at migration success, oldpage->mapping is NULL. */
 	if (oldpage->mapping) {
 		target = oldpage;
@@ -1731,6 +1744,14 @@ void mem_cgroup_end_migration(struct mem
 	 */
 	if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
 		mem_cgroup_uncharge_page(target);
+	/*
+	 * At migration, we may charge account against cgroup which has no tasks
+	 * So, rmdir()->pre_destroy() can be called while we do this charge.
+	 * In that case, we need to call pre_destroy() again. check it here.
+	 */
+	cgroup_wakeup_rmdir_waiters(mem->css.cgroup);
+	css_put(&mem->css);
+
 }
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] memcg: cgroup fix rmdir hang
  2009-06-26  5:10   ` KAMEZAWA Hiroyuki
@ 2009-06-26  6:00     ` Daisuke Nishimura
  -1 siblings, 0 replies; 28+ messages in thread
From: Daisuke Nishimura @ 2009-06-26  6:00 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, balbir, lizf, menage, akpm, Daisuke Nishimura

On Fri, 26 Jun 2009 14:10:20 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> I hope this will be a final bullet..
> I myself think this one is enough simple and good.
I think so too :)
Using test_and_clear_bit() and checking CGRP_WAIT_ON_RMDIR before sleeping
would be a good idea to make the patch simple.

> I'm sorry that we need test again.
No problem.
I'll test this one this weekend.


Thanks,
Daisuke Nishimura.

> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> After commit: cgroup: fix frequent -EBUSY at rmdir
> 	      ec64f51545fffbc4cb968f0cea56341a4b07e85a
> cgroup's rmdir (especially against memcg) doesn't return -EBUSY
> by temporal ref counts. That commit expects all refs after pre_destroy()
> is temporary but...it wasn't. Then, rmdir can wait permanently.
> This patch tries to fix that and change followings.
> 
>  - set CGRP_WAIT_ON_RMDIR flag before pre_destroy().
>  - clear CGRP_WAIT_ON_RMDIR flag when the subsys finds racy case.
>    if there are sleeping ones, wakes them up.
>  - rmdir() sleeps only when CGRP_WAIT_ON_RMDIR flag is set.
> 
> Changelog v2->v3:
>   - removed retry_rmdir() callback.
>   - make use of CGRP_WAIT_ON_RMDIR flag more.
> 
> Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  include/linux/cgroup.h |   13 +++++++++++++
>  kernel/cgroup.c        |   38 ++++++++++++++++++++++----------------
>  mm/memcontrol.c        |   25 +++++++++++++++++++++++--
>  3 files changed, 58 insertions(+), 18 deletions(-)
> 
> Index: mmotm-2.6.31-Jun25/include/linux/cgroup.h
> ===================================================================
> --- mmotm-2.6.31-Jun25.orig/include/linux/cgroup.h
> +++ mmotm-2.6.31-Jun25/include/linux/cgroup.h
> @@ -366,6 +366,19 @@ int cgroup_task_count(const struct cgrou
>  int cgroup_is_descendant(const struct cgroup *cgrp, struct task_struct *task);
>  
>  /*
> + * Allow to use CGRP_WAIT_ON_RMDIR flag to check race with rmdir() for subsys.
> + * Subsys can call this function if it's necessary to call pre_destroy() again
> + * because it adds not-temporary refs to css after or while pre_destroy().
> + * The caller of this function should use css_tryget(), too.
> + */
> +void __cgroup_wakeup_rmdir_waiters(void);
> +static inline void cgroup_wakeup_rmdir_waiters(struct cgroup *cgrp)
> +{
> +	if (unlikely(test_and_clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
> +		__cgroup_wakeup_rmdir_waiters();
> +}
> +
> +/*
>   * Control Group subsystem type.
>   * See Documentation/cgroups/cgroups.txt for details
>   */
> Index: mmotm-2.6.31-Jun25/kernel/cgroup.c
> ===================================================================
> --- mmotm-2.6.31-Jun25.orig/kernel/cgroup.c
> +++ mmotm-2.6.31-Jun25/kernel/cgroup.c
> @@ -734,14 +734,13 @@ static void cgroup_d_remove_dir(struct d
>   * reference to css->refcnt. In general, this refcnt is expected to goes down
>   * to zero, soon.
>   *
> - * CGRP_WAIT_ON_RMDIR flag is modified under cgroup's inode->i_mutex;
> + * CGRP_WAIT_ON_RMDIR flag is set under cgroup's inode->i_mutex;
>   */
>  DECLARE_WAIT_QUEUE_HEAD(cgroup_rmdir_waitq);
>  
> -static void cgroup_wakeup_rmdir_waiters(const struct cgroup *cgrp)
> +void __cgroup_wakeup_rmdir_waiters(void)
>  {
> -	if (unlikely(test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
> -		wake_up_all(&cgroup_rmdir_waitq);
> +	wake_up_all(&cgroup_rmdir_waitq);
>  }
>  
>  static int rebind_subsystems(struct cgroupfs_root *root,
> @@ -2696,33 +2695,40 @@ again:
>  	mutex_unlock(&cgroup_mutex);
>  
>  	/*
> +	 * css_put/get is provided for subsys to grab refcnt to css. In typical
> +	 * case, subsystem has no reference after pre_destroy(). But, under
> +	 * hierarchy management, some *temporal* refcnt can be hold.
> +	 * To avoid returning -EBUSY to a user, waitqueue is used. If subsys
> +	 * is really busy, it should return -EBUSY at pre_destroy(). wake_up
> +	 * is called when css_put() is called and refcnt goes down to 0.
> +	 * And this WAIT_ON_RMDIR flag is cleared when subsys detect a race
> +	 * condition under pre_destroy()->rmdir. If flag is cleared, we need
> +	 * to call pre_destroy(), again.
> +	 */
> +	set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> +
> +	/*
>  	 * Call pre_destroy handlers of subsys. Notify subsystems
>  	 * that rmdir() request comes.
>  	 */
>  	ret = cgroup_call_pre_destroy(cgrp);
> -	if (ret)
> +	if (ret) {
> +		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
>  		return ret;
> +	}
>  
>  	mutex_lock(&cgroup_mutex);
>  	parent = cgrp->parent;
>  	if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children)) {
> +		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
>  		mutex_unlock(&cgroup_mutex);
>  		return -EBUSY;
>  	}
> -	/*
> -	 * css_put/get is provided for subsys to grab refcnt to css. In typical
> -	 * case, subsystem has no reference after pre_destroy(). But, under
> -	 * hierarchy management, some *temporal* refcnt can be hold.
> -	 * To avoid returning -EBUSY to a user, waitqueue is used. If subsys
> -	 * is really busy, it should return -EBUSY at pre_destroy(). wake_up
> -	 * is called when css_put() is called and refcnt goes down to 0.
> -	 */
> -	set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
>  	prepare_to_wait(&cgroup_rmdir_waitq, &wait, TASK_INTERRUPTIBLE);
> -
>  	if (!cgroup_clear_css_refs(cgrp)) {
>  		mutex_unlock(&cgroup_mutex);
> -		schedule();
> +		if (test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags))
> +			schedule();
>  		finish_wait(&cgroup_rmdir_waitq, &wait);
>  		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
>  		if (signal_pending(current))
> Index: mmotm-2.6.31-Jun25/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.31-Jun25.orig/mm/memcontrol.c
> +++ mmotm-2.6.31-Jun25/mm/memcontrol.c
> @@ -1234,6 +1234,12 @@ static int mem_cgroup_move_account(struc
>  	ret = 0;
>  out:
>  	unlock_page_cgroup(pc);
> +	/*
> +	 * We charges against "to" which may not have any tasks. Then, "to"
> +	 * can be under rmdir(). But in current implementation, caller of
> +	 * this function is just force_empty() and it's garanteed that
> +	 * "to" is never removed. So, we don't check rmdir status here.
> +	 */
>  	return ret;
>  }
>  
> @@ -1455,6 +1461,7 @@ __mem_cgroup_commit_charge_swapin(struct
>  		return;
>  	if (!ptr)
>  		return;
> +	css_get(&ptr->css);
>  	pc = lookup_page_cgroup(page);
>  	mem_cgroup_lru_del_before_commit_swapcache(page);
>  	__mem_cgroup_commit_charge(ptr, pc, ctype);
> @@ -1484,7 +1491,13 @@ __mem_cgroup_commit_charge_swapin(struct
>  		}
>  		rcu_read_unlock();
>  	}
> -	/* add this page(page_cgroup) to the LRU we want. */
> +	/*
> +	 * At swapin, we may charge account against cgroup which has no tasks.
> +	 * So, rmdir()->pre_destroy() can be called while we do this charge.
> +	 * In that case, we need to call pre_destroy() again. check it here.
> +	 */
> +	cgroup_wakeup_rmdir_waiters(ptr->css.cgroup);
> +	css_put(&ptr->css);
>  
>  }
>  
> @@ -1691,7 +1704,7 @@ void mem_cgroup_end_migration(struct mem
>  
>  	if (!mem)
>  		return;
> -
> +	css_get(&mem->css);
>  	/* at migration success, oldpage->mapping is NULL. */
>  	if (oldpage->mapping) {
>  		target = oldpage;
> @@ -1731,6 +1744,14 @@ void mem_cgroup_end_migration(struct mem
>  	 */
>  	if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
>  		mem_cgroup_uncharge_page(target);
> +	/*
> +	 * At migration, we may charge account against cgroup which has no tasks
> +	 * So, rmdir()->pre_destroy() can be called while we do this charge.
> +	 * In that case, we need to call pre_destroy() again. check it here.
> +	 */
> +	cgroup_wakeup_rmdir_waiters(mem->css.cgroup);
> +	css_put(&mem->css);
> +
>  }
>  
>  /*
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] memcg: cgroup fix rmdir hang
@ 2009-06-26  6:00     ` Daisuke Nishimura
  0 siblings, 0 replies; 28+ messages in thread
From: Daisuke Nishimura @ 2009-06-26  6:00 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, balbir, lizf, menage, akpm, Daisuke Nishimura

On Fri, 26 Jun 2009 14:10:20 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> I hope this will be a final bullet..
> I myself think this one is enough simple and good.
I think so too :)
Using test_and_clear_bit() and checking CGRP_WAIT_ON_RMDIR before sleeping
would be a good idea to make the patch simple.

> I'm sorry that we need test again.
No problem.
I'll test this one this weekend.


Thanks,
Daisuke Nishimura.

> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> After commit: cgroup: fix frequent -EBUSY at rmdir
> 	      ec64f51545fffbc4cb968f0cea56341a4b07e85a
> cgroup's rmdir (especially against memcg) doesn't return -EBUSY
> by temporal ref counts. That commit expects all refs after pre_destroy()
> is temporary but...it wasn't. Then, rmdir can wait permanently.
> This patch tries to fix that and change followings.
> 
>  - set CGRP_WAIT_ON_RMDIR flag before pre_destroy().
>  - clear CGRP_WAIT_ON_RMDIR flag when the subsys finds racy case.
>    if there are sleeping ones, wakes them up.
>  - rmdir() sleeps only when CGRP_WAIT_ON_RMDIR flag is set.
> 
> Changelog v2->v3:
>   - removed retry_rmdir() callback.
>   - make use of CGRP_WAIT_ON_RMDIR flag more.
> 
> Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  include/linux/cgroup.h |   13 +++++++++++++
>  kernel/cgroup.c        |   38 ++++++++++++++++++++++----------------
>  mm/memcontrol.c        |   25 +++++++++++++++++++++++--
>  3 files changed, 58 insertions(+), 18 deletions(-)
> 
> Index: mmotm-2.6.31-Jun25/include/linux/cgroup.h
> ===================================================================
> --- mmotm-2.6.31-Jun25.orig/include/linux/cgroup.h
> +++ mmotm-2.6.31-Jun25/include/linux/cgroup.h
> @@ -366,6 +366,19 @@ int cgroup_task_count(const struct cgrou
>  int cgroup_is_descendant(const struct cgroup *cgrp, struct task_struct *task);
>  
>  /*
> + * Allow to use CGRP_WAIT_ON_RMDIR flag to check race with rmdir() for subsys.
> + * Subsys can call this function if it's necessary to call pre_destroy() again
> + * because it adds not-temporary refs to css after or while pre_destroy().
> + * The caller of this function should use css_tryget(), too.
> + */
> +void __cgroup_wakeup_rmdir_waiters(void);
> +static inline void cgroup_wakeup_rmdir_waiters(struct cgroup *cgrp)
> +{
> +	if (unlikely(test_and_clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
> +		__cgroup_wakeup_rmdir_waiters();
> +}
> +
> +/*
>   * Control Group subsystem type.
>   * See Documentation/cgroups/cgroups.txt for details
>   */
> Index: mmotm-2.6.31-Jun25/kernel/cgroup.c
> ===================================================================
> --- mmotm-2.6.31-Jun25.orig/kernel/cgroup.c
> +++ mmotm-2.6.31-Jun25/kernel/cgroup.c
> @@ -734,14 +734,13 @@ static void cgroup_d_remove_dir(struct d
>   * reference to css->refcnt. In general, this refcnt is expected to goes down
>   * to zero, soon.
>   *
> - * CGRP_WAIT_ON_RMDIR flag is modified under cgroup's inode->i_mutex;
> + * CGRP_WAIT_ON_RMDIR flag is set under cgroup's inode->i_mutex;
>   */
>  DECLARE_WAIT_QUEUE_HEAD(cgroup_rmdir_waitq);
>  
> -static void cgroup_wakeup_rmdir_waiters(const struct cgroup *cgrp)
> +void __cgroup_wakeup_rmdir_waiters(void)
>  {
> -	if (unlikely(test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
> -		wake_up_all(&cgroup_rmdir_waitq);
> +	wake_up_all(&cgroup_rmdir_waitq);
>  }
>  
>  static int rebind_subsystems(struct cgroupfs_root *root,
> @@ -2696,33 +2695,40 @@ again:
>  	mutex_unlock(&cgroup_mutex);
>  
>  	/*
> +	 * css_put/get is provided for subsys to grab refcnt to css. In typical
> +	 * case, subsystem has no reference after pre_destroy(). But, under
> +	 * hierarchy management, some *temporal* refcnt can be hold.
> +	 * To avoid returning -EBUSY to a user, waitqueue is used. If subsys
> +	 * is really busy, it should return -EBUSY at pre_destroy(). wake_up
> +	 * is called when css_put() is called and refcnt goes down to 0.
> +	 * And this WAIT_ON_RMDIR flag is cleared when subsys detect a race
> +	 * condition under pre_destroy()->rmdir. If flag is cleared, we need
> +	 * to call pre_destroy(), again.
> +	 */
> +	set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> +
> +	/*
>  	 * Call pre_destroy handlers of subsys. Notify subsystems
>  	 * that rmdir() request comes.
>  	 */
>  	ret = cgroup_call_pre_destroy(cgrp);
> -	if (ret)
> +	if (ret) {
> +		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
>  		return ret;
> +	}
>  
>  	mutex_lock(&cgroup_mutex);
>  	parent = cgrp->parent;
>  	if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children)) {
> +		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
>  		mutex_unlock(&cgroup_mutex);
>  		return -EBUSY;
>  	}
> -	/*
> -	 * css_put/get is provided for subsys to grab refcnt to css. In typical
> -	 * case, subsystem has no reference after pre_destroy(). But, under
> -	 * hierarchy management, some *temporal* refcnt can be hold.
> -	 * To avoid returning -EBUSY to a user, waitqueue is used. If subsys
> -	 * is really busy, it should return -EBUSY at pre_destroy(). wake_up
> -	 * is called when css_put() is called and refcnt goes down to 0.
> -	 */
> -	set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
>  	prepare_to_wait(&cgroup_rmdir_waitq, &wait, TASK_INTERRUPTIBLE);
> -
>  	if (!cgroup_clear_css_refs(cgrp)) {
>  		mutex_unlock(&cgroup_mutex);
> -		schedule();
> +		if (test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags))
> +			schedule();
>  		finish_wait(&cgroup_rmdir_waitq, &wait);
>  		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
>  		if (signal_pending(current))
> Index: mmotm-2.6.31-Jun25/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.31-Jun25.orig/mm/memcontrol.c
> +++ mmotm-2.6.31-Jun25/mm/memcontrol.c
> @@ -1234,6 +1234,12 @@ static int mem_cgroup_move_account(struc
>  	ret = 0;
>  out:
>  	unlock_page_cgroup(pc);
> +	/*
> +	 * We charges against "to" which may not have any tasks. Then, "to"
> +	 * can be under rmdir(). But in current implementation, caller of
> +	 * this function is just force_empty() and it's garanteed that
> +	 * "to" is never removed. So, we don't check rmdir status here.
> +	 */
>  	return ret;
>  }
>  
> @@ -1455,6 +1461,7 @@ __mem_cgroup_commit_charge_swapin(struct
>  		return;
>  	if (!ptr)
>  		return;
> +	css_get(&ptr->css);
>  	pc = lookup_page_cgroup(page);
>  	mem_cgroup_lru_del_before_commit_swapcache(page);
>  	__mem_cgroup_commit_charge(ptr, pc, ctype);
> @@ -1484,7 +1491,13 @@ __mem_cgroup_commit_charge_swapin(struct
>  		}
>  		rcu_read_unlock();
>  	}
> -	/* add this page(page_cgroup) to the LRU we want. */
> +	/*
> +	 * At swapin, we may charge account against cgroup which has no tasks.
> +	 * So, rmdir()->pre_destroy() can be called while we do this charge.
> +	 * In that case, we need to call pre_destroy() again. check it here.
> +	 */
> +	cgroup_wakeup_rmdir_waiters(ptr->css.cgroup);
> +	css_put(&ptr->css);
>  
>  }
>  
> @@ -1691,7 +1704,7 @@ void mem_cgroup_end_migration(struct mem
>  
>  	if (!mem)
>  		return;
> -
> +	css_get(&mem->css);
>  	/* at migration success, oldpage->mapping is NULL. */
>  	if (oldpage->mapping) {
>  		target = oldpage;
> @@ -1731,6 +1744,14 @@ void mem_cgroup_end_migration(struct mem
>  	 */
>  	if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
>  		mem_cgroup_uncharge_page(target);
> +	/*
> +	 * At migration, we may charge account against cgroup which has no tasks
> +	 * So, rmdir()->pre_destroy() can be called while we do this charge.
> +	 * In that case, we need to call pre_destroy() again. check it here.
> +	 */
> +	cgroup_wakeup_rmdir_waiters(mem->css.cgroup);
> +	css_put(&mem->css);
> +
>  }
>  
>  /*
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] memcg: cgroup fix rmdir hang
  2009-06-26  5:10   ` KAMEZAWA Hiroyuki
@ 2009-06-26 20:16     ` Paul Menage
  -1 siblings, 0 replies; 28+ messages in thread
From: Paul Menage @ 2009-06-26 20:16 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, nishimura, balbir, lizf, akpm

Hi Kamezawa,

Sorry that I didn't get a chance to look at these patches before now.

On Thu, Jun 25, 2009 at 10:10 PM, KAMEZAWA
Hiroyuki<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> I hope this will be a final bullet..
> I myself think this one is enough simple and good.
> I'm sorry that we need test again.
>
> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> After commit: cgroup: fix frequent -EBUSY at rmdir
>              ec64f51545fffbc4cb968f0cea56341a4b07e85a
> cgroup's rmdir (especially against memcg) doesn't return -EBUSY
> by temporal ref counts. That commit expects all refs after pre_destroy()
> is temporary but...it wasn't. Then, rmdir can wait permanently.
> This patch tries to fix that and change followings.
>
>  - set CGRP_WAIT_ON_RMDIR flag before pre_destroy().
>  - clear CGRP_WAIT_ON_RMDIR flag when the subsys finds racy case.
>   if there are sleeping ones, wakes them up.
>  - rmdir() sleeps only when CGRP_WAIT_ON_RMDIR flag is set.
>
> Changelog v2->v3:
>  - removed retry_rmdir() callback.
>  - make use of CGRP_WAIT_ON_RMDIR flag more.
>
> Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  include/linux/cgroup.h |   13 +++++++++++++
>  kernel/cgroup.c        |   38 ++++++++++++++++++++++----------------
>  mm/memcontrol.c        |   25 +++++++++++++++++++++++--
>  3 files changed, 58 insertions(+), 18 deletions(-)
>
> Index: mmotm-2.6.31-Jun25/include/linux/cgroup.h
> ===================================================================
> --- mmotm-2.6.31-Jun25.orig/include/linux/cgroup.h
> +++ mmotm-2.6.31-Jun25/include/linux/cgroup.h
> @@ -366,6 +366,19 @@ int cgroup_task_count(const struct cgrou
>  int cgroup_is_descendant(const struct cgroup *cgrp, struct task_struct *task);
>
>  /*
> + * Allow to use CGRP_WAIT_ON_RMDIR flag to check race with rmdir() for subsys.
> + * Subsys can call this function if it's necessary to call pre_destroy() again
> + * because it adds not-temporary refs to css after or while pre_destroy().
> + * The caller of this function should use css_tryget(), too.
> + */
> +void __cgroup_wakeup_rmdir_waiters(void);
> +static inline void cgroup_wakeup_rmdir_waiters(struct cgroup *cgrp)
> +{
> +       if (unlikely(test_and_clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
> +               __cgroup_wakeup_rmdir_waiters();
> +}
> +
> +/*
>  * Control Group subsystem type.
>  * See Documentation/cgroups/cgroups.txt for details
>  */
> Index: mmotm-2.6.31-Jun25/kernel/cgroup.c
> ===================================================================
> --- mmotm-2.6.31-Jun25.orig/kernel/cgroup.c
> +++ mmotm-2.6.31-Jun25/kernel/cgroup.c
> @@ -734,14 +734,13 @@ static void cgroup_d_remove_dir(struct d
>  * reference to css->refcnt. In general, this refcnt is expected to goes down
>  * to zero, soon.
>  *
> - * CGRP_WAIT_ON_RMDIR flag is modified under cgroup's inode->i_mutex;
> + * CGRP_WAIT_ON_RMDIR flag is set under cgroup's inode->i_mutex;
>  */
>  DECLARE_WAIT_QUEUE_HEAD(cgroup_rmdir_waitq);
>
> -static void cgroup_wakeup_rmdir_waiters(const struct cgroup *cgrp)
> +void __cgroup_wakeup_rmdir_waiters(void)
>  {

Maybe we should name this wakeup_rmdir_waiter() to emphasise that fact
that there will only be one waiter (the thread doing the rmdir).

> -       if (unlikely(test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
> -               wake_up_all(&cgroup_rmdir_waitq);
> +       wake_up_all(&cgroup_rmdir_waitq);
>  }
>
>  static int rebind_subsystems(struct cgroupfs_root *root,
> @@ -2696,33 +2695,40 @@ again:
>        mutex_unlock(&cgroup_mutex);
>
>        /*
> +        * css_put/get is provided for subsys to grab refcnt to css. In typical
> +        * case, subsystem has no reference after pre_destroy(). But, under
> +        * hierarchy management, some *temporal* refcnt can be hold.

This sentence needs improvement/clarification. (Yes, I know it's just
copied from later in the file, but it wasn't clear there either :-) )


> +        * To avoid returning -EBUSY to a user, waitqueue is used. If subsys
> +        * is really busy, it should return -EBUSY at pre_destroy(). wake_up
> +        * is called when css_put() is called and refcnt goes down to 0.
> +        * And this WAIT_ON_RMDIR flag is cleared when subsys detect a race
> +        * condition under pre_destroy()->rmdir.

What exactly do you mean by pre_destroy()->rmdir ?

> If flag is cleared, we need
> +        * to call pre_destroy(), again.
> +        */
> +       set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> +
> +       /*
>         * Call pre_destroy handlers of subsys. Notify subsystems
>         * that rmdir() request comes.
>         */
>        ret = cgroup_call_pre_destroy(cgrp);
> -       if (ret)
> +       if (ret) {
> +               clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
>                return ret;
> +       }
>
>        mutex_lock(&cgroup_mutex);
>        parent = cgrp->parent;
>        if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children)) {
> +               clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
>                mutex_unlock(&cgroup_mutex);
>                return -EBUSY;
>        }
> -       /*
> -        * css_put/get is provided for subsys to grab refcnt to css. In typical
> -        * case, subsystem has no reference after pre_destroy(). But, under
> -        * hierarchy management, some *temporal* refcnt can be hold.
> -        * To avoid returning -EBUSY to a user, waitqueue is used. If subsys
> -        * is really busy, it should return -EBUSY at pre_destroy(). wake_up
> -        * is called when css_put() is called and refcnt goes down to 0.
> -        */
> -       set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
>        prepare_to_wait(&cgroup_rmdir_waitq, &wait, TASK_INTERRUPTIBLE);
> -
>        if (!cgroup_clear_css_refs(cgrp)) {
>                mutex_unlock(&cgroup_mutex);
> -               schedule();
> +               if (test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags))
> +                       schedule();

The need for this test_bit() should be commented, I think - at first
sight it doesn't seem necessary since if we've already been woken up,
then the schedule() is logically a no-op anyway. We only need it
because we are worried about the case where someone calls
wakeup_rmdir_waiters() prior to the prepare_to_wait()


> ===================================================================
> --- mmotm-2.6.31-Jun25.orig/mm/memcontrol.c
> +++ mmotm-2.6.31-Jun25/mm/memcontrol.c
> @@ -1234,6 +1234,12 @@ static int mem_cgroup_move_account(struc
>        ret = 0;
>  out:
>        unlock_page_cgroup(pc);
> +       /*
> +        * We charges against "to" which may not have any tasks. Then, "to"
> +        * can be under rmdir(). But in current implementation, caller of
> +        * this function is just force_empty() and it's garanteed that
> +        * "to" is never removed. So, we don't check rmdir status here.
> +        */
>        return ret;
>  }
>
> @@ -1455,6 +1461,7 @@ __mem_cgroup_commit_charge_swapin(struct
>                return;
>        if (!ptr)
>                return;
> +       css_get(&ptr->css);
>        pc = lookup_page_cgroup(page);
>        mem_cgroup_lru_del_before_commit_swapcache(page);
>        __mem_cgroup_commit_charge(ptr, pc, ctype);
> @@ -1484,7 +1491,13 @@ __mem_cgroup_commit_charge_swapin(struct
>                }
>                rcu_read_unlock();
>        }
> -       /* add this page(page_cgroup) to the LRU we want. */
> +       /*
> +        * At swapin, we may charge account against cgroup which has no tasks.
> +        * So, rmdir()->pre_destroy() can be called while we do this charge.
> +        * In that case, we need to call pre_destroy() again. check it here.
> +        */
> +       cgroup_wakeup_rmdir_waiters(ptr->css.cgroup);
> +       css_put(&ptr->css);
>
>  }
>
> @@ -1691,7 +1704,7 @@ void mem_cgroup_end_migration(struct mem
>
>        if (!mem)
>                return;
> -
> +       css_get(&mem->css);
>        /* at migration success, oldpage->mapping is NULL. */
>        if (oldpage->mapping) {
>                target = oldpage;
> @@ -1731,6 +1744,14 @@ void mem_cgroup_end_migration(struct mem
>         */
>        if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
>                mem_cgroup_uncharge_page(target);
> +       /*
> +        * At migration, we may charge account against cgroup which has no tasks
> +        * So, rmdir()->pre_destroy() can be called while we do this charge.
> +        * In that case, we need to call pre_destroy() again. check it here.
> +        */
> +       cgroup_wakeup_rmdir_waiters(mem->css.cgroup);
> +       css_put(&mem->css);

Having to do an extra get/put purely in order to seems unfortunate -
is that purely to force the cgroup_clear_css_refs() to fail?

Maybe we could wrap the get and the wakeup/put inside functions named
"cgroup_exclude_rmdir()" "cgroup_release_rmdir()" so that the mm
cgroup code is more self-explanatory.

Paul

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] memcg: cgroup fix rmdir hang
@ 2009-06-26 20:16     ` Paul Menage
  0 siblings, 0 replies; 28+ messages in thread
From: Paul Menage @ 2009-06-26 20:16 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, nishimura, balbir, lizf, akpm

Hi Kamezawa,

Sorry that I didn't get a chance to look at these patches before now.

On Thu, Jun 25, 2009 at 10:10 PM, KAMEZAWA
Hiroyuki<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> I hope this will be a final bullet..
> I myself think this one is enough simple and good.
> I'm sorry that we need test again.
>
> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> After commit: cgroup: fix frequent -EBUSY at rmdir
>              ec64f51545fffbc4cb968f0cea56341a4b07e85a
> cgroup's rmdir (especially against memcg) doesn't return -EBUSY
> by temporal ref counts. That commit expects all refs after pre_destroy()
> is temporary but...it wasn't. Then, rmdir can wait permanently.
> This patch tries to fix that and change followings.
>
>  - set CGRP_WAIT_ON_RMDIR flag before pre_destroy().
>  - clear CGRP_WAIT_ON_RMDIR flag when the subsys finds racy case.
>   if there are sleeping ones, wakes them up.
>  - rmdir() sleeps only when CGRP_WAIT_ON_RMDIR flag is set.
>
> Changelog v2->v3:
>  - removed retry_rmdir() callback.
>  - make use of CGRP_WAIT_ON_RMDIR flag more.
>
> Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  include/linux/cgroup.h |   13 +++++++++++++
>  kernel/cgroup.c        |   38 ++++++++++++++++++++++----------------
>  mm/memcontrol.c        |   25 +++++++++++++++++++++++--
>  3 files changed, 58 insertions(+), 18 deletions(-)
>
> Index: mmotm-2.6.31-Jun25/include/linux/cgroup.h
> ===================================================================
> --- mmotm-2.6.31-Jun25.orig/include/linux/cgroup.h
> +++ mmotm-2.6.31-Jun25/include/linux/cgroup.h
> @@ -366,6 +366,19 @@ int cgroup_task_count(const struct cgrou
>  int cgroup_is_descendant(const struct cgroup *cgrp, struct task_struct *task);
>
>  /*
> + * Allow to use CGRP_WAIT_ON_RMDIR flag to check race with rmdir() for subsys.
> + * Subsys can call this function if it's necessary to call pre_destroy() again
> + * because it adds not-temporary refs to css after or while pre_destroy().
> + * The caller of this function should use css_tryget(), too.
> + */
> +void __cgroup_wakeup_rmdir_waiters(void);
> +static inline void cgroup_wakeup_rmdir_waiters(struct cgroup *cgrp)
> +{
> +       if (unlikely(test_and_clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
> +               __cgroup_wakeup_rmdir_waiters();
> +}
> +
> +/*
>  * Control Group subsystem type.
>  * See Documentation/cgroups/cgroups.txt for details
>  */
> Index: mmotm-2.6.31-Jun25/kernel/cgroup.c
> ===================================================================
> --- mmotm-2.6.31-Jun25.orig/kernel/cgroup.c
> +++ mmotm-2.6.31-Jun25/kernel/cgroup.c
> @@ -734,14 +734,13 @@ static void cgroup_d_remove_dir(struct d
>  * reference to css->refcnt. In general, this refcnt is expected to goes down
>  * to zero, soon.
>  *
> - * CGRP_WAIT_ON_RMDIR flag is modified under cgroup's inode->i_mutex;
> + * CGRP_WAIT_ON_RMDIR flag is set under cgroup's inode->i_mutex;
>  */
>  DECLARE_WAIT_QUEUE_HEAD(cgroup_rmdir_waitq);
>
> -static void cgroup_wakeup_rmdir_waiters(const struct cgroup *cgrp)
> +void __cgroup_wakeup_rmdir_waiters(void)
>  {

Maybe we should name this wakeup_rmdir_waiter() to emphasise that fact
that there will only be one waiter (the thread doing the rmdir).

> -       if (unlikely(test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
> -               wake_up_all(&cgroup_rmdir_waitq);
> +       wake_up_all(&cgroup_rmdir_waitq);
>  }
>
>  static int rebind_subsystems(struct cgroupfs_root *root,
> @@ -2696,33 +2695,40 @@ again:
>        mutex_unlock(&cgroup_mutex);
>
>        /*
> +        * css_put/get is provided for subsys to grab refcnt to css. In typical
> +        * case, subsystem has no reference after pre_destroy(). But, under
> +        * hierarchy management, some *temporal* refcnt can be hold.

This sentence needs improvement/clarification. (Yes, I know it's just
copied from later in the file, but it wasn't clear there either :-) )


> +        * To avoid returning -EBUSY to a user, waitqueue is used. If subsys
> +        * is really busy, it should return -EBUSY at pre_destroy(). wake_up
> +        * is called when css_put() is called and refcnt goes down to 0.
> +        * And this WAIT_ON_RMDIR flag is cleared when subsys detect a race
> +        * condition under pre_destroy()->rmdir.

What exactly do you mean by pre_destroy()->rmdir ?

> If flag is cleared, we need
> +        * to call pre_destroy(), again.
> +        */
> +       set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> +
> +       /*
>         * Call pre_destroy handlers of subsys. Notify subsystems
>         * that rmdir() request comes.
>         */
>        ret = cgroup_call_pre_destroy(cgrp);
> -       if (ret)
> +       if (ret) {
> +               clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
>                return ret;
> +       }
>
>        mutex_lock(&cgroup_mutex);
>        parent = cgrp->parent;
>        if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children)) {
> +               clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
>                mutex_unlock(&cgroup_mutex);
>                return -EBUSY;
>        }
> -       /*
> -        * css_put/get is provided for subsys to grab refcnt to css. In typical
> -        * case, subsystem has no reference after pre_destroy(). But, under
> -        * hierarchy management, some *temporal* refcnt can be hold.
> -        * To avoid returning -EBUSY to a user, waitqueue is used. If subsys
> -        * is really busy, it should return -EBUSY at pre_destroy(). wake_up
> -        * is called when css_put() is called and refcnt goes down to 0.
> -        */
> -       set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
>        prepare_to_wait(&cgroup_rmdir_waitq, &wait, TASK_INTERRUPTIBLE);
> -
>        if (!cgroup_clear_css_refs(cgrp)) {
>                mutex_unlock(&cgroup_mutex);
> -               schedule();
> +               if (test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags))
> +                       schedule();

The need for this test_bit() should be commented, I think - at first
sight it doesn't seem necessary since if we've already been woken up,
then the schedule() is logically a no-op anyway. We only need it
because we are worried about the case where someone calls
wakeup_rmdir_waiters() prior to the prepare_to_wait()


> ===================================================================
> --- mmotm-2.6.31-Jun25.orig/mm/memcontrol.c
> +++ mmotm-2.6.31-Jun25/mm/memcontrol.c
> @@ -1234,6 +1234,12 @@ static int mem_cgroup_move_account(struc
>        ret = 0;
>  out:
>        unlock_page_cgroup(pc);
> +       /*
> +        * We charges against "to" which may not have any tasks. Then, "to"
> +        * can be under rmdir(). But in current implementation, caller of
> +        * this function is just force_empty() and it's garanteed that
> +        * "to" is never removed. So, we don't check rmdir status here.
> +        */
>        return ret;
>  }
>
> @@ -1455,6 +1461,7 @@ __mem_cgroup_commit_charge_swapin(struct
>                return;
>        if (!ptr)
>                return;
> +       css_get(&ptr->css);
>        pc = lookup_page_cgroup(page);
>        mem_cgroup_lru_del_before_commit_swapcache(page);
>        __mem_cgroup_commit_charge(ptr, pc, ctype);
> @@ -1484,7 +1491,13 @@ __mem_cgroup_commit_charge_swapin(struct
>                }
>                rcu_read_unlock();
>        }
> -       /* add this page(page_cgroup) to the LRU we want. */
> +       /*
> +        * At swapin, we may charge account against cgroup which has no tasks.
> +        * So, rmdir()->pre_destroy() can be called while we do this charge.
> +        * In that case, we need to call pre_destroy() again. check it here.
> +        */
> +       cgroup_wakeup_rmdir_waiters(ptr->css.cgroup);
> +       css_put(&ptr->css);
>
>  }
>
> @@ -1691,7 +1704,7 @@ void mem_cgroup_end_migration(struct mem
>
>        if (!mem)
>                return;
> -
> +       css_get(&mem->css);
>        /* at migration success, oldpage->mapping is NULL. */
>        if (oldpage->mapping) {
>                target = oldpage;
> @@ -1731,6 +1744,14 @@ void mem_cgroup_end_migration(struct mem
>         */
>        if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
>                mem_cgroup_uncharge_page(target);
> +       /*
> +        * At migration, we may charge account against cgroup which has no tasks
> +        * So, rmdir()->pre_destroy() can be called while we do this charge.
> +        * In that case, we need to call pre_destroy() again. check it here.
> +        */
> +       cgroup_wakeup_rmdir_waiters(mem->css.cgroup);
> +       css_put(&mem->css);

Having to do an extra get/put purely in order to seems unfortunate -
is that purely to force the cgroup_clear_css_refs() to fail?

Maybe we could wrap the get and the wakeup/put inside functions named
"cgroup_exclude_rmdir()" "cgroup_release_rmdir()" so that the mm
cgroup code is more self-explanatory.

Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] memcg: cgroup fix rmdir hang
  2009-06-26 20:16     ` Paul Menage
@ 2009-06-28 23:50       ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 28+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-06-28 23:50 UTC (permalink / raw)
  To: Paul Menage; +Cc: linux-mm, linux-kernel, nishimura, balbir, lizf, akpm

On Fri, 26 Jun 2009 13:16:23 -0700
Paul Menage <menage@google.com> wrote:

> Hi Kamezawa,
> 
> Sorry that I didn't get a chance to look at these patches before now.
> 
> On Thu, Jun 25, 2009 at 10:10 PM, KAMEZAWA
> Hiroyuki<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > I hope this will be a final bullet..
> > I myself think this one is enough simple and good.
> > I'm sorry that we need test again.
> >
> > ==
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >
> > After commit: cgroup: fix frequent -EBUSY at rmdir
> >              ec64f51545fffbc4cb968f0cea56341a4b07e85a
> > cgroup's rmdir (especially against memcg) doesn't return -EBUSY
> > by temporal ref counts. That commit expects all refs after pre_destroy()
> > is temporary but...it wasn't. Then, rmdir can wait permanently.
> > This patch tries to fix that and change followings.
> >
> >  - set CGRP_WAIT_ON_RMDIR flag before pre_destroy().
> >  - clear CGRP_WAIT_ON_RMDIR flag when the subsys finds racy case.
> >   if there are sleeping ones, wakes them up.
> >  - rmdir() sleeps only when CGRP_WAIT_ON_RMDIR flag is set.
> >
> > Changelog v2->v3:
> >  - removed retry_rmdir() callback.
> >  - make use of CGRP_WAIT_ON_RMDIR flag more.
> >
> > Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> >  include/linux/cgroup.h |   13 +++++++++++++
> >  kernel/cgroup.c        |   38 ++++++++++++++++++++++----------------
> >  mm/memcontrol.c        |   25 +++++++++++++++++++++++--
> >  3 files changed, 58 insertions(+), 18 deletions(-)
> >
> > Index: mmotm-2.6.31-Jun25/include/linux/cgroup.h
> > ===================================================================
> > --- mmotm-2.6.31-Jun25.orig/include/linux/cgroup.h
> > +++ mmotm-2.6.31-Jun25/include/linux/cgroup.h
> > @@ -366,6 +366,19 @@ int cgroup_task_count(const struct cgrou
> >  int cgroup_is_descendant(const struct cgroup *cgrp, struct task_struct *task);
> >
> >  /*
> > + * Allow to use CGRP_WAIT_ON_RMDIR flag to check race with rmdir() for subsys.
> > + * Subsys can call this function if it's necessary to call pre_destroy() again
> > + * because it adds not-temporary refs to css after or while pre_destroy().
> > + * The caller of this function should use css_tryget(), too.
> > + */
> > +void __cgroup_wakeup_rmdir_waiters(void);
> > +static inline void cgroup_wakeup_rmdir_waiters(struct cgroup *cgrp)
> > +{
> > +       if (unlikely(test_and_clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
> > +               __cgroup_wakeup_rmdir_waiters();
> > +}
> > +
> > +/*
> >  * Control Group subsystem type.
> >  * See Documentation/cgroups/cgroups.txt for details
> >  */
> > Index: mmotm-2.6.31-Jun25/kernel/cgroup.c
> > ===================================================================
> > --- mmotm-2.6.31-Jun25.orig/kernel/cgroup.c
> > +++ mmotm-2.6.31-Jun25/kernel/cgroup.c
> > @@ -734,14 +734,13 @@ static void cgroup_d_remove_dir(struct d
> >  * reference to css->refcnt. In general, this refcnt is expected to goes down
> >  * to zero, soon.
> >  *
> > - * CGRP_WAIT_ON_RMDIR flag is modified under cgroup's inode->i_mutex;
> > + * CGRP_WAIT_ON_RMDIR flag is set under cgroup's inode->i_mutex;
> >  */
> >  DECLARE_WAIT_QUEUE_HEAD(cgroup_rmdir_waitq);
> >
> > -static void cgroup_wakeup_rmdir_waiters(const struct cgroup *cgrp)
> > +void __cgroup_wakeup_rmdir_waiters(void)
> >  {
> 
> Maybe we should name this wakeup_rmdir_waiter() to emphasise that fact
> that there will only be one waiter (the thread doing the rmdir).
> 
Hm, but, there is no guarantee that there will "an" waiter.


> > -       if (unlikely(test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
> > -               wake_up_all(&cgroup_rmdir_waitq);
> > +       wake_up_all(&cgroup_rmdir_waitq);
> >  }
> >
> >  static int rebind_subsystems(struct cgroupfs_root *root,
> > @@ -2696,33 +2695,40 @@ again:
> >        mutex_unlock(&cgroup_mutex);
> >
> >        /*
> > +        * css_put/get is provided for subsys to grab refcnt to css. In typical
> > +        * case, subsystem has no reference after pre_destroy(). But, under
> > +        * hierarchy management, some *temporal* refcnt can be hold.
> 
> This sentence needs improvement/clarification. (Yes, I know it's just
> copied from later in the file, but it wasn't clear there either :-) )
> 
ok, I'll modify here. How about this ?
==
In general, subsystem has no css->refcnt after pre_destroy(). But in racy cases,
subsystem may have to get css->refcnt after pre_destroy() and it makes rmdir
return with -EBUSY. But we don't like frequent -EBUSY. To avoid that, we use
waitqueue for cgroup's rmdir. CGROUP_WAIT_ON_RMDIR bit is for synchronizing
rmdir waitqueue and subsystem behavior. Please see css_get()/put()/tryget()
implementation, it allows subsystem to avoid unnecessary failure of rmdir().
If css_put()/get()/tryget() is not enough, cgroup_wakeup_rmdir_waiter() can be
used.
==


> 
> > +        * To avoid returning -EBUSY to a user, waitqueue is used. If subsys
> > +        * is really busy, it should return -EBUSY at pre_destroy(). wake_up
> > +        * is called when css_put() is called and refcnt goes down to 0.
> > +        * And this WAIT_ON_RMDIR flag is cleared when subsys detect a race
> > +        * condition under pre_destroy()->rmdir.
> 
> What exactly do you mean by pre_destroy()->rmdir ?
> 
Ah, between pre_destroy() and "cgroup is removed" state. I'll remove this.
Maybe above comment is enough.


> > If flag is cleared, we need
> > +        * to call pre_destroy(), again.
> > +        */
> > +       set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> > +
> > +       /*
> >         * Call pre_destroy handlers of subsys. Notify subsystems
> >         * that rmdir() request comes.
> >         */
> >        ret = cgroup_call_pre_destroy(cgrp);
> > -       if (ret)
> > +       if (ret) {
> > +               clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> >                return ret;
> > +       }
> >
> >        mutex_lock(&cgroup_mutex);
> >        parent = cgrp->parent;
> >        if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children)) {
> > +               clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> >                mutex_unlock(&cgroup_mutex);
> >                return -EBUSY;
> >        }
> > -       /*
> > -        * css_put/get is provided for subsys to grab refcnt to css. In typical
> > -        * case, subsystem has no reference after pre_destroy(). But, under
> > -        * hierarchy management, some *temporal* refcnt can be hold.
> > -        * To avoid returning -EBUSY to a user, waitqueue is used. If subsys
> > -        * is really busy, it should return -EBUSY at pre_destroy(). wake_up
> > -        * is called when css_put() is called and refcnt goes down to 0.
> > -        */
> > -       set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> >        prepare_to_wait(&cgroup_rmdir_waitq, &wait, TASK_INTERRUPTIBLE);
> > -
> >        if (!cgroup_clear_css_refs(cgrp)) {
> >                mutex_unlock(&cgroup_mutex);
> > -               schedule();
> > +               if (test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags))
> > +                       schedule();
> 
> The need for this test_bit() should be commented, I think - at first
> sight it doesn't seem necessary since if we've already been woken up,
> then the schedule() is logically a no-op anyway. We only need it
> because we are worried about the case where someone calls
> wakeup_rmdir_waiters() prior to the prepare_to_wait()
> 
> 
yes. will add comments.


> > ===================================================================
> > --- mmotm-2.6.31-Jun25.orig/mm/memcontrol.c
> > +++ mmotm-2.6.31-Jun25/mm/memcontrol.c
> > @@ -1234,6 +1234,12 @@ static int mem_cgroup_move_account(struc
> >        ret = 0;
> >  out:
> >        unlock_page_cgroup(pc);
> > +       /*
> > +        * We charges against "to" which may not have any tasks. Then, "to"
> > +        * can be under rmdir(). But in current implementation, caller of
> > +        * this function is just force_empty() and it's garanteed that
> > +        * "to" is never removed. So, we don't check rmdir status here.
> > +        */
> >        return ret;
> >  }
> >
> > @@ -1455,6 +1461,7 @@ __mem_cgroup_commit_charge_swapin(struct
> >                return;
> >        if (!ptr)
> >                return;
> > +       css_get(&ptr->css);
> >        pc = lookup_page_cgroup(page);
> >        mem_cgroup_lru_del_before_commit_swapcache(page);
> >        __mem_cgroup_commit_charge(ptr, pc, ctype);
> > @@ -1484,7 +1491,13 @@ __mem_cgroup_commit_charge_swapin(struct
> >                }
> >                rcu_read_unlock();
> >        }
> > -       /* add this page(page_cgroup) to the LRU we want. */
> > +       /*
> > +        * At swapin, we may charge account against cgroup which has no tasks.
> > +        * So, rmdir()->pre_destroy() can be called while we do this charge.
> > +        * In that case, we need to call pre_destroy() again. check it here.
> > +        */
> > +       cgroup_wakeup_rmdir_waiters(ptr->css.cgroup);
> > +       css_put(&ptr->css);
> >
> >  }
> >
> > @@ -1691,7 +1704,7 @@ void mem_cgroup_end_migration(struct mem
> >
> >        if (!mem)
> >                return;
> > -
> > +       css_get(&mem->css);
> >        /* at migration success, oldpage->mapping is NULL. */
> >        if (oldpage->mapping) {
> >                target = oldpage;
> > @@ -1731,6 +1744,14 @@ void mem_cgroup_end_migration(struct mem
> >         */
> >        if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
> >                mem_cgroup_uncharge_page(target);
> > +       /*
> > +        * At migration, we may charge account against cgroup which has no tasks
> > +        * So, rmdir()->pre_destroy() can be called while we do this charge.
> > +        * In that case, we need to call pre_destroy() again. check it here.
> > +        */
> > +       cgroup_wakeup_rmdir_waiters(mem->css.cgroup);
> > +       css_put(&mem->css);
> 
> Having to do an extra get/put purely in order to seems unfortunate -
> is that purely to force the cgroup_clear_css_refs() to fail?
> 
yes.

> Maybe we could wrap the get and the wakeup/put inside functions named
> "cgroup_exclude_rmdir()" "cgroup_release_rmdir()" so that the mm
> cgroup code is more self-explanatory.
> 
Ah, it sounds nice idea. I'll prepare that as [2/2] patch.

Thanks,
-Kame


> Paul
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] memcg: cgroup fix rmdir hang
@ 2009-06-28 23:50       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 28+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-06-28 23:50 UTC (permalink / raw)
  To: Paul Menage; +Cc: linux-mm, linux-kernel, nishimura, balbir, lizf, akpm

On Fri, 26 Jun 2009 13:16:23 -0700
Paul Menage <menage@google.com> wrote:

> Hi Kamezawa,
> 
> Sorry that I didn't get a chance to look at these patches before now.
> 
> On Thu, Jun 25, 2009 at 10:10 PM, KAMEZAWA
> Hiroyuki<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > I hope this will be a final bullet..
> > I myself think this one is enough simple and good.
> > I'm sorry that we need test again.
> >
> > ==
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >
> > After commit: cgroup: fix frequent -EBUSY at rmdir
> > A  A  A  A  A  A  A ec64f51545fffbc4cb968f0cea56341a4b07e85a
> > cgroup's rmdir (especially against memcg) doesn't return -EBUSY
> > by temporal ref counts. That commit expects all refs after pre_destroy()
> > is temporary but...it wasn't. Then, rmdir can wait permanently.
> > This patch tries to fix that and change followings.
> >
> > A - set CGRP_WAIT_ON_RMDIR flag before pre_destroy().
> > A - clear CGRP_WAIT_ON_RMDIR flag when the subsys finds racy case.
> > A  if there are sleeping ones, wakes them up.
> > A - rmdir() sleeps only when CGRP_WAIT_ON_RMDIR flag is set.
> >
> > Changelog v2->v3:
> > A - removed retry_rmdir() callback.
> > A - make use of CGRP_WAIT_ON_RMDIR flag more.
> >
> > Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> > A include/linux/cgroup.h | A  13 +++++++++++++
> > A kernel/cgroup.c A  A  A  A | A  38 ++++++++++++++++++++++----------------
> > A mm/memcontrol.c A  A  A  A | A  25 +++++++++++++++++++++++--
> > A 3 files changed, 58 insertions(+), 18 deletions(-)
> >
> > Index: mmotm-2.6.31-Jun25/include/linux/cgroup.h
> > ===================================================================
> > --- mmotm-2.6.31-Jun25.orig/include/linux/cgroup.h
> > +++ mmotm-2.6.31-Jun25/include/linux/cgroup.h
> > @@ -366,6 +366,19 @@ int cgroup_task_count(const struct cgrou
> > A int cgroup_is_descendant(const struct cgroup *cgrp, struct task_struct *task);
> >
> > A /*
> > + * Allow to use CGRP_WAIT_ON_RMDIR flag to check race with rmdir() for subsys.
> > + * Subsys can call this function if it's necessary to call pre_destroy() again
> > + * because it adds not-temporary refs to css after or while pre_destroy().
> > + * The caller of this function should use css_tryget(), too.
> > + */
> > +void __cgroup_wakeup_rmdir_waiters(void);
> > +static inline void cgroup_wakeup_rmdir_waiters(struct cgroup *cgrp)
> > +{
> > + A  A  A  if (unlikely(test_and_clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
> > + A  A  A  A  A  A  A  __cgroup_wakeup_rmdir_waiters();
> > +}
> > +
> > +/*
> > A * Control Group subsystem type.
> > A * See Documentation/cgroups/cgroups.txt for details
> > A */
> > Index: mmotm-2.6.31-Jun25/kernel/cgroup.c
> > ===================================================================
> > --- mmotm-2.6.31-Jun25.orig/kernel/cgroup.c
> > +++ mmotm-2.6.31-Jun25/kernel/cgroup.c
> > @@ -734,14 +734,13 @@ static void cgroup_d_remove_dir(struct d
> > A * reference to css->refcnt. In general, this refcnt is expected to goes down
> > A * to zero, soon.
> > A *
> > - * CGRP_WAIT_ON_RMDIR flag is modified under cgroup's inode->i_mutex;
> > + * CGRP_WAIT_ON_RMDIR flag is set under cgroup's inode->i_mutex;
> > A */
> > A DECLARE_WAIT_QUEUE_HEAD(cgroup_rmdir_waitq);
> >
> > -static void cgroup_wakeup_rmdir_waiters(const struct cgroup *cgrp)
> > +void __cgroup_wakeup_rmdir_waiters(void)
> > A {
> 
> Maybe we should name this wakeup_rmdir_waiter() to emphasise that fact
> that there will only be one waiter (the thread doing the rmdir).
> 
Hm, but, there is no guarantee that there will "an" waiter.


> > - A  A  A  if (unlikely(test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
> > - A  A  A  A  A  A  A  wake_up_all(&cgroup_rmdir_waitq);
> > + A  A  A  wake_up_all(&cgroup_rmdir_waitq);
> > A }
> >
> > A static int rebind_subsystems(struct cgroupfs_root *root,
> > @@ -2696,33 +2695,40 @@ again:
> > A  A  A  A mutex_unlock(&cgroup_mutex);
> >
> > A  A  A  A /*
> > + A  A  A  A * css_put/get is provided for subsys to grab refcnt to css. In typical
> > + A  A  A  A * case, subsystem has no reference after pre_destroy(). But, under
> > + A  A  A  A * hierarchy management, some *temporal* refcnt can be hold.
> 
> This sentence needs improvement/clarification. (Yes, I know it's just
> copied from later in the file, but it wasn't clear there either :-) )
> 
ok, I'll modify here. How about this ?
==
In general, subsystem has no css->refcnt after pre_destroy(). But in racy cases,
subsystem may have to get css->refcnt after pre_destroy() and it makes rmdir
return with -EBUSY. But we don't like frequent -EBUSY. To avoid that, we use
waitqueue for cgroup's rmdir. CGROUP_WAIT_ON_RMDIR bit is for synchronizing
rmdir waitqueue and subsystem behavior. Please see css_get()/put()/tryget()
implementation, it allows subsystem to avoid unnecessary failure of rmdir().
If css_put()/get()/tryget() is not enough, cgroup_wakeup_rmdir_waiter() can be
used.
==


> 
> > + A  A  A  A * To avoid returning -EBUSY to a user, waitqueue is used. If subsys
> > + A  A  A  A * is really busy, it should return -EBUSY at pre_destroy(). wake_up
> > + A  A  A  A * is called when css_put() is called and refcnt goes down to 0.
> > + A  A  A  A * And this WAIT_ON_RMDIR flag is cleared when subsys detect a race
> > + A  A  A  A * condition under pre_destroy()->rmdir.
> 
> What exactly do you mean by pre_destroy()->rmdir ?
> 
Ah, between pre_destroy() and "cgroup is removed" state. I'll remove this.
Maybe above comment is enough.


> > If flag is cleared, we need
> > + A  A  A  A * to call pre_destroy(), again.
> > + A  A  A  A */
> > + A  A  A  set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> > +
> > + A  A  A  /*
> > A  A  A  A  * Call pre_destroy handlers of subsys. Notify subsystems
> > A  A  A  A  * that rmdir() request comes.
> > A  A  A  A  */
> > A  A  A  A ret = cgroup_call_pre_destroy(cgrp);
> > - A  A  A  if (ret)
> > + A  A  A  if (ret) {
> > + A  A  A  A  A  A  A  clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> > A  A  A  A  A  A  A  A return ret;
> > + A  A  A  }
> >
> > A  A  A  A mutex_lock(&cgroup_mutex);
> > A  A  A  A parent = cgrp->parent;
> > A  A  A  A if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children)) {
> > + A  A  A  A  A  A  A  clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> > A  A  A  A  A  A  A  A mutex_unlock(&cgroup_mutex);
> > A  A  A  A  A  A  A  A return -EBUSY;
> > A  A  A  A }
> > - A  A  A  /*
> > - A  A  A  A * css_put/get is provided for subsys to grab refcnt to css. In typical
> > - A  A  A  A * case, subsystem has no reference after pre_destroy(). But, under
> > - A  A  A  A * hierarchy management, some *temporal* refcnt can be hold.
> > - A  A  A  A * To avoid returning -EBUSY to a user, waitqueue is used. If subsys
> > - A  A  A  A * is really busy, it should return -EBUSY at pre_destroy(). wake_up
> > - A  A  A  A * is called when css_put() is called and refcnt goes down to 0.
> > - A  A  A  A */
> > - A  A  A  set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> > A  A  A  A prepare_to_wait(&cgroup_rmdir_waitq, &wait, TASK_INTERRUPTIBLE);
> > -
> > A  A  A  A if (!cgroup_clear_css_refs(cgrp)) {
> > A  A  A  A  A  A  A  A mutex_unlock(&cgroup_mutex);
> > - A  A  A  A  A  A  A  schedule();
> > + A  A  A  A  A  A  A  if (test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags))
> > + A  A  A  A  A  A  A  A  A  A  A  schedule();
> 
> The need for this test_bit() should be commented, I think - at first
> sight it doesn't seem necessary since if we've already been woken up,
> then the schedule() is logically a no-op anyway. We only need it
> because we are worried about the case where someone calls
> wakeup_rmdir_waiters() prior to the prepare_to_wait()
> 
> 
yes. will add comments.


> > ===================================================================
> > --- mmotm-2.6.31-Jun25.orig/mm/memcontrol.c
> > +++ mmotm-2.6.31-Jun25/mm/memcontrol.c
> > @@ -1234,6 +1234,12 @@ static int mem_cgroup_move_account(struc
> > A  A  A  A ret = 0;
> > A out:
> > A  A  A  A unlock_page_cgroup(pc);
> > + A  A  A  /*
> > + A  A  A  A * We charges against "to" which may not have any tasks. Then, "to"
> > + A  A  A  A * can be under rmdir(). But in current implementation, caller of
> > + A  A  A  A * this function is just force_empty() and it's garanteed that
> > + A  A  A  A * "to" is never removed. So, we don't check rmdir status here.
> > + A  A  A  A */
> > A  A  A  A return ret;
> > A }
> >
> > @@ -1455,6 +1461,7 @@ __mem_cgroup_commit_charge_swapin(struct
> > A  A  A  A  A  A  A  A return;
> > A  A  A  A if (!ptr)
> > A  A  A  A  A  A  A  A return;
> > + A  A  A  css_get(&ptr->css);
> > A  A  A  A pc = lookup_page_cgroup(page);
> > A  A  A  A mem_cgroup_lru_del_before_commit_swapcache(page);
> > A  A  A  A __mem_cgroup_commit_charge(ptr, pc, ctype);
> > @@ -1484,7 +1491,13 @@ __mem_cgroup_commit_charge_swapin(struct
> > A  A  A  A  A  A  A  A }
> > A  A  A  A  A  A  A  A rcu_read_unlock();
> > A  A  A  A }
> > - A  A  A  /* add this page(page_cgroup) to the LRU we want. */
> > + A  A  A  /*
> > + A  A  A  A * At swapin, we may charge account against cgroup which has no tasks.
> > + A  A  A  A * So, rmdir()->pre_destroy() can be called while we do this charge.
> > + A  A  A  A * In that case, we need to call pre_destroy() again. check it here.
> > + A  A  A  A */
> > + A  A  A  cgroup_wakeup_rmdir_waiters(ptr->css.cgroup);
> > + A  A  A  css_put(&ptr->css);
> >
> > A }
> >
> > @@ -1691,7 +1704,7 @@ void mem_cgroup_end_migration(struct mem
> >
> > A  A  A  A if (!mem)
> > A  A  A  A  A  A  A  A return;
> > -
> > + A  A  A  css_get(&mem->css);
> > A  A  A  A /* at migration success, oldpage->mapping is NULL. */
> > A  A  A  A if (oldpage->mapping) {
> > A  A  A  A  A  A  A  A target = oldpage;
> > @@ -1731,6 +1744,14 @@ void mem_cgroup_end_migration(struct mem
> > A  A  A  A  */
> > A  A  A  A if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
> > A  A  A  A  A  A  A  A mem_cgroup_uncharge_page(target);
> > + A  A  A  /*
> > + A  A  A  A * At migration, we may charge account against cgroup which has no tasks
> > + A  A  A  A * So, rmdir()->pre_destroy() can be called while we do this charge.
> > + A  A  A  A * In that case, we need to call pre_destroy() again. check it here.
> > + A  A  A  A */
> > + A  A  A  cgroup_wakeup_rmdir_waiters(mem->css.cgroup);
> > + A  A  A  css_put(&mem->css);
> 
> Having to do an extra get/put purely in order to seems unfortunate -
> is that purely to force the cgroup_clear_css_refs() to fail?
> 
yes.

> Maybe we could wrap the get and the wakeup/put inside functions named
> "cgroup_exclude_rmdir()" "cgroup_release_rmdir()" so that the mm
> cgroup code is more self-explanatory.
> 
Ah, it sounds nice idea. I'll prepare that as [2/2] patch.

Thanks,
-Kame


> Paul
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] memcg: cgroup fix rmdir hang
  2009-06-28 23:50       ` KAMEZAWA Hiroyuki
@ 2009-06-28 23:56         ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 28+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-06-28 23:56 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Paul Menage, linux-mm, linux-kernel, nishimura, balbir, lizf, akpm

On Mon, 29 Jun 2009 08:50:26 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> > Maybe we should name this wakeup_rmdir_waiter() to emphasise that fact
> > that there will only be one waiter (the thread doing the rmdir).
> > 
> Hm, but, there is no guarantee that there will "an" waiter.
> 
Ignore above comment....I forget inode->mutex is held.

Thanks,
-Kame

> 
> > > -       if (unlikely(test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
> > > -               wake_up_all(&cgroup_rmdir_waitq);
> > > +       wake_up_all(&cgroup_rmdir_waitq);
> > >  }
> > >
> > >  static int rebind_subsystems(struct cgroupfs_root *root,
> > > @@ -2696,33 +2695,40 @@ again:
> > >        mutex_unlock(&cgroup_mutex);
> > >
> > >        /*
> > > +        * css_put/get is provided for subsys to grab refcnt to css. In typical
> > > +        * case, subsystem has no reference after pre_destroy(). But, under
> > > +        * hierarchy management, some *temporal* refcnt can be hold.
> > 
> > This sentence needs improvement/clarification. (Yes, I know it's just
> > copied from later in the file, but it wasn't clear there either :-) )
> > 
> ok, I'll modify here. How about this ?
> ==
> In general, subsystem has no css->refcnt after pre_destroy(). But in racy cases,
> subsystem may have to get css->refcnt after pre_destroy() and it makes rmdir
> return with -EBUSY. But we don't like frequent -EBUSY. To avoid that, we use
> waitqueue for cgroup's rmdir. CGROUP_WAIT_ON_RMDIR bit is for synchronizing
> rmdir waitqueue and subsystem behavior. Please see css_get()/put()/tryget()
> implementation, it allows subsystem to avoid unnecessary failure of rmdir().
> If css_put()/get()/tryget() is not enough, cgroup_wakeup_rmdir_waiter() can be
> used.
> ==
> 
> 
> > 
> > > +        * To avoid returning -EBUSY to a user, waitqueue is used. If subsys
> > > +        * is really busy, it should return -EBUSY at pre_destroy(). wake_up
> > > +        * is called when css_put() is called and refcnt goes down to 0.
> > > +        * And this WAIT_ON_RMDIR flag is cleared when subsys detect a race
> > > +        * condition under pre_destroy()->rmdir.
> > 
> > What exactly do you mean by pre_destroy()->rmdir ?
> > 
> Ah, between pre_destroy() and "cgroup is removed" state. I'll remove this.
> Maybe above comment is enough.
> 
> 
> > > If flag is cleared, we need
> > > +        * to call pre_destroy(), again.
> > > +        */
> > > +       set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> > > +
> > > +       /*
> > >         * Call pre_destroy handlers of subsys. Notify subsystems
> > >         * that rmdir() request comes.
> > >         */
> > >        ret = cgroup_call_pre_destroy(cgrp);
> > > -       if (ret)
> > > +       if (ret) {
> > > +               clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> > >                return ret;
> > > +       }
> > >
> > >        mutex_lock(&cgroup_mutex);
> > >        parent = cgrp->parent;
> > >        if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children)) {
> > > +               clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> > >                mutex_unlock(&cgroup_mutex);
> > >                return -EBUSY;
> > >        }
> > > -       /*
> > > -        * css_put/get is provided for subsys to grab refcnt to css. In typical
> > > -        * case, subsystem has no reference after pre_destroy(). But, under
> > > -        * hierarchy management, some *temporal* refcnt can be hold.
> > > -        * To avoid returning -EBUSY to a user, waitqueue is used. If subsys
> > > -        * is really busy, it should return -EBUSY at pre_destroy(). wake_up
> > > -        * is called when css_put() is called and refcnt goes down to 0.
> > > -        */
> > > -       set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> > >        prepare_to_wait(&cgroup_rmdir_waitq, &wait, TASK_INTERRUPTIBLE);
> > > -
> > >        if (!cgroup_clear_css_refs(cgrp)) {
> > >                mutex_unlock(&cgroup_mutex);
> > > -               schedule();
> > > +               if (test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags))
> > > +                       schedule();
> > 
> > The need for this test_bit() should be commented, I think - at first
> > sight it doesn't seem necessary since if we've already been woken up,
> > then the schedule() is logically a no-op anyway. We only need it
> > because we are worried about the case where someone calls
> > wakeup_rmdir_waiters() prior to the prepare_to_wait()
> > 
> > 
> yes. will add comments.
> 
> 
> > > ===================================================================
> > > --- mmotm-2.6.31-Jun25.orig/mm/memcontrol.c
> > > +++ mmotm-2.6.31-Jun25/mm/memcontrol.c
> > > @@ -1234,6 +1234,12 @@ static int mem_cgroup_move_account(struc
> > >        ret = 0;
> > >  out:
> > >        unlock_page_cgroup(pc);
> > > +       /*
> > > +        * We charges against "to" which may not have any tasks. Then, "to"
> > > +        * can be under rmdir(). But in current implementation, caller of
> > > +        * this function is just force_empty() and it's garanteed that
> > > +        * "to" is never removed. So, we don't check rmdir status here.
> > > +        */
> > >        return ret;
> > >  }
> > >
> > > @@ -1455,6 +1461,7 @@ __mem_cgroup_commit_charge_swapin(struct
> > >                return;
> > >        if (!ptr)
> > >                return;
> > > +       css_get(&ptr->css);
> > >        pc = lookup_page_cgroup(page);
> > >        mem_cgroup_lru_del_before_commit_swapcache(page);
> > >        __mem_cgroup_commit_charge(ptr, pc, ctype);
> > > @@ -1484,7 +1491,13 @@ __mem_cgroup_commit_charge_swapin(struct
> > >                }
> > >                rcu_read_unlock();
> > >        }
> > > -       /* add this page(page_cgroup) to the LRU we want. */
> > > +       /*
> > > +        * At swapin, we may charge account against cgroup which has no tasks.
> > > +        * So, rmdir()->pre_destroy() can be called while we do this charge.
> > > +        * In that case, we need to call pre_destroy() again. check it here.
> > > +        */
> > > +       cgroup_wakeup_rmdir_waiters(ptr->css.cgroup);
> > > +       css_put(&ptr->css);
> > >
> > >  }
> > >
> > > @@ -1691,7 +1704,7 @@ void mem_cgroup_end_migration(struct mem
> > >
> > >        if (!mem)
> > >                return;
> > > -
> > > +       css_get(&mem->css);
> > >        /* at migration success, oldpage->mapping is NULL. */
> > >        if (oldpage->mapping) {
> > >                target = oldpage;
> > > @@ -1731,6 +1744,14 @@ void mem_cgroup_end_migration(struct mem
> > >         */
> > >        if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
> > >                mem_cgroup_uncharge_page(target);
> > > +       /*
> > > +        * At migration, we may charge account against cgroup which has no tasks
> > > +        * So, rmdir()->pre_destroy() can be called while we do this charge.
> > > +        * In that case, we need to call pre_destroy() again. check it here.
> > > +        */
> > > +       cgroup_wakeup_rmdir_waiters(mem->css.cgroup);
> > > +       css_put(&mem->css);
> > 
> > Having to do an extra get/put purely in order to seems unfortunate -
> > is that purely to force the cgroup_clear_css_refs() to fail?
> > 
> yes.
> 
> > Maybe we could wrap the get and the wakeup/put inside functions named
> > "cgroup_exclude_rmdir()" "cgroup_release_rmdir()" so that the mm
> > cgroup code is more self-explanatory.
> > 
> Ah, it sounds nice idea. I'll prepare that as [2/2] patch.
> 
> Thanks,
> -Kame
> 
> 
> > Paul
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> > 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] memcg: cgroup fix rmdir hang
@ 2009-06-28 23:56         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 28+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-06-28 23:56 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Paul Menage, linux-mm, linux-kernel, nishimura, balbir, lizf, akpm

On Mon, 29 Jun 2009 08:50:26 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> > Maybe we should name this wakeup_rmdir_waiter() to emphasise that fact
> > that there will only be one waiter (the thread doing the rmdir).
> > 
> Hm, but, there is no guarantee that there will "an" waiter.
> 
Ignore above comment....I forget inode->mutex is held.

Thanks,
-Kame

> 
> > > - A  A  A  if (unlikely(test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
> > > - A  A  A  A  A  A  A  wake_up_all(&cgroup_rmdir_waitq);
> > > + A  A  A  wake_up_all(&cgroup_rmdir_waitq);
> > > A }
> > >
> > > A static int rebind_subsystems(struct cgroupfs_root *root,
> > > @@ -2696,33 +2695,40 @@ again:
> > > A  A  A  A mutex_unlock(&cgroup_mutex);
> > >
> > > A  A  A  A /*
> > > + A  A  A  A * css_put/get is provided for subsys to grab refcnt to css. In typical
> > > + A  A  A  A * case, subsystem has no reference after pre_destroy(). But, under
> > > + A  A  A  A * hierarchy management, some *temporal* refcnt can be hold.
> > 
> > This sentence needs improvement/clarification. (Yes, I know it's just
> > copied from later in the file, but it wasn't clear there either :-) )
> > 
> ok, I'll modify here. How about this ?
> ==
> In general, subsystem has no css->refcnt after pre_destroy(). But in racy cases,
> subsystem may have to get css->refcnt after pre_destroy() and it makes rmdir
> return with -EBUSY. But we don't like frequent -EBUSY. To avoid that, we use
> waitqueue for cgroup's rmdir. CGROUP_WAIT_ON_RMDIR bit is for synchronizing
> rmdir waitqueue and subsystem behavior. Please see css_get()/put()/tryget()
> implementation, it allows subsystem to avoid unnecessary failure of rmdir().
> If css_put()/get()/tryget() is not enough, cgroup_wakeup_rmdir_waiter() can be
> used.
> ==
> 
> 
> > 
> > > + A  A  A  A * To avoid returning -EBUSY to a user, waitqueue is used. If subsys
> > > + A  A  A  A * is really busy, it should return -EBUSY at pre_destroy(). wake_up
> > > + A  A  A  A * is called when css_put() is called and refcnt goes down to 0.
> > > + A  A  A  A * And this WAIT_ON_RMDIR flag is cleared when subsys detect a race
> > > + A  A  A  A * condition under pre_destroy()->rmdir.
> > 
> > What exactly do you mean by pre_destroy()->rmdir ?
> > 
> Ah, between pre_destroy() and "cgroup is removed" state. I'll remove this.
> Maybe above comment is enough.
> 
> 
> > > If flag is cleared, we need
> > > + A  A  A  A * to call pre_destroy(), again.
> > > + A  A  A  A */
> > > + A  A  A  set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> > > +
> > > + A  A  A  /*
> > > A  A  A  A  * Call pre_destroy handlers of subsys. Notify subsystems
> > > A  A  A  A  * that rmdir() request comes.
> > > A  A  A  A  */
> > > A  A  A  A ret = cgroup_call_pre_destroy(cgrp);
> > > - A  A  A  if (ret)
> > > + A  A  A  if (ret) {
> > > + A  A  A  A  A  A  A  clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> > > A  A  A  A  A  A  A  A return ret;
> > > + A  A  A  }
> > >
> > > A  A  A  A mutex_lock(&cgroup_mutex);
> > > A  A  A  A parent = cgrp->parent;
> > > A  A  A  A if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children)) {
> > > + A  A  A  A  A  A  A  clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> > > A  A  A  A  A  A  A  A mutex_unlock(&cgroup_mutex);
> > > A  A  A  A  A  A  A  A return -EBUSY;
> > > A  A  A  A }
> > > - A  A  A  /*
> > > - A  A  A  A * css_put/get is provided for subsys to grab refcnt to css. In typical
> > > - A  A  A  A * case, subsystem has no reference after pre_destroy(). But, under
> > > - A  A  A  A * hierarchy management, some *temporal* refcnt can be hold.
> > > - A  A  A  A * To avoid returning -EBUSY to a user, waitqueue is used. If subsys
> > > - A  A  A  A * is really busy, it should return -EBUSY at pre_destroy(). wake_up
> > > - A  A  A  A * is called when css_put() is called and refcnt goes down to 0.
> > > - A  A  A  A */
> > > - A  A  A  set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> > > A  A  A  A prepare_to_wait(&cgroup_rmdir_waitq, &wait, TASK_INTERRUPTIBLE);
> > > -
> > > A  A  A  A if (!cgroup_clear_css_refs(cgrp)) {
> > > A  A  A  A  A  A  A  A mutex_unlock(&cgroup_mutex);
> > > - A  A  A  A  A  A  A  schedule();
> > > + A  A  A  A  A  A  A  if (test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags))
> > > + A  A  A  A  A  A  A  A  A  A  A  schedule();
> > 
> > The need for this test_bit() should be commented, I think - at first
> > sight it doesn't seem necessary since if we've already been woken up,
> > then the schedule() is logically a no-op anyway. We only need it
> > because we are worried about the case where someone calls
> > wakeup_rmdir_waiters() prior to the prepare_to_wait()
> > 
> > 
> yes. will add comments.
> 
> 
> > > ===================================================================
> > > --- mmotm-2.6.31-Jun25.orig/mm/memcontrol.c
> > > +++ mmotm-2.6.31-Jun25/mm/memcontrol.c
> > > @@ -1234,6 +1234,12 @@ static int mem_cgroup_move_account(struc
> > > A  A  A  A ret = 0;
> > > A out:
> > > A  A  A  A unlock_page_cgroup(pc);
> > > + A  A  A  /*
> > > + A  A  A  A * We charges against "to" which may not have any tasks. Then, "to"
> > > + A  A  A  A * can be under rmdir(). But in current implementation, caller of
> > > + A  A  A  A * this function is just force_empty() and it's garanteed that
> > > + A  A  A  A * "to" is never removed. So, we don't check rmdir status here.
> > > + A  A  A  A */
> > > A  A  A  A return ret;
> > > A }
> > >
> > > @@ -1455,6 +1461,7 @@ __mem_cgroup_commit_charge_swapin(struct
> > > A  A  A  A  A  A  A  A return;
> > > A  A  A  A if (!ptr)
> > > A  A  A  A  A  A  A  A return;
> > > + A  A  A  css_get(&ptr->css);
> > > A  A  A  A pc = lookup_page_cgroup(page);
> > > A  A  A  A mem_cgroup_lru_del_before_commit_swapcache(page);
> > > A  A  A  A __mem_cgroup_commit_charge(ptr, pc, ctype);
> > > @@ -1484,7 +1491,13 @@ __mem_cgroup_commit_charge_swapin(struct
> > > A  A  A  A  A  A  A  A }
> > > A  A  A  A  A  A  A  A rcu_read_unlock();
> > > A  A  A  A }
> > > - A  A  A  /* add this page(page_cgroup) to the LRU we want. */
> > > + A  A  A  /*
> > > + A  A  A  A * At swapin, we may charge account against cgroup which has no tasks.
> > > + A  A  A  A * So, rmdir()->pre_destroy() can be called while we do this charge.
> > > + A  A  A  A * In that case, we need to call pre_destroy() again. check it here.
> > > + A  A  A  A */
> > > + A  A  A  cgroup_wakeup_rmdir_waiters(ptr->css.cgroup);
> > > + A  A  A  css_put(&ptr->css);
> > >
> > > A }
> > >
> > > @@ -1691,7 +1704,7 @@ void mem_cgroup_end_migration(struct mem
> > >
> > > A  A  A  A if (!mem)
> > > A  A  A  A  A  A  A  A return;
> > > -
> > > + A  A  A  css_get(&mem->css);
> > > A  A  A  A /* at migration success, oldpage->mapping is NULL. */
> > > A  A  A  A if (oldpage->mapping) {
> > > A  A  A  A  A  A  A  A target = oldpage;
> > > @@ -1731,6 +1744,14 @@ void mem_cgroup_end_migration(struct mem
> > > A  A  A  A  */
> > > A  A  A  A if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
> > > A  A  A  A  A  A  A  A mem_cgroup_uncharge_page(target);
> > > + A  A  A  /*
> > > + A  A  A  A * At migration, we may charge account against cgroup which has no tasks
> > > + A  A  A  A * So, rmdir()->pre_destroy() can be called while we do this charge.
> > > + A  A  A  A * In that case, we need to call pre_destroy() again. check it here.
> > > + A  A  A  A */
> > > + A  A  A  cgroup_wakeup_rmdir_waiters(mem->css.cgroup);
> > > + A  A  A  css_put(&mem->css);
> > 
> > Having to do an extra get/put purely in order to seems unfortunate -
> > is that purely to force the cgroup_clear_css_refs() to fail?
> > 
> yes.
> 
> > Maybe we could wrap the get and the wakeup/put inside functions named
> > "cgroup_exclude_rmdir()" "cgroup_release_rmdir()" so that the mm
> > cgroup code is more self-explanatory.
> > 
> Ah, it sounds nice idea. I'll prepare that as [2/2] patch.
> 
> Thanks,
> -Kame
> 
> 
> > Paul
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> > 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] memcg: cgroup fix rmdir hang
  2009-06-26  6:00     ` Daisuke Nishimura
@ 2009-06-29  4:10       ` Daisuke Nishimura
  -1 siblings, 0 replies; 28+ messages in thread
From: Daisuke Nishimura @ 2009-06-29  4:10 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, balbir, lizf, menage, akpm, Daisuke Nishimura

On Fri, 26 Jun 2009 15:00:52 +0900, Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> On Fri, 26 Jun 2009 14:10:20 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > I hope this will be a final bullet..
> > I myself think this one is enough simple and good.
> I think so too :)
> Using test_and_clear_bit() and checking CGRP_WAIT_ON_RMDIR before sleeping
> would be a good idea to make the patch simple.
> 
> > I'm sorry that we need test again.
> No problem.
> I'll test this one this weekend.
> 
It has survived my test(rmdir under memory pressure) this weekend.
Please feel free to add my Tested-by.

	Tested-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

Thanks,
Daisuke Nishimura.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] memcg: cgroup fix rmdir hang
@ 2009-06-29  4:10       ` Daisuke Nishimura
  0 siblings, 0 replies; 28+ messages in thread
From: Daisuke Nishimura @ 2009-06-29  4:10 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, balbir, lizf, menage, akpm, Daisuke Nishimura

On Fri, 26 Jun 2009 15:00:52 +0900, Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> On Fri, 26 Jun 2009 14:10:20 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > I hope this will be a final bullet..
> > I myself think this one is enough simple and good.
> I think so too :)
> Using test_and_clear_bit() and checking CGRP_WAIT_ON_RMDIR before sleeping
> would be a good idea to make the patch simple.
> 
> > I'm sorry that we need test again.
> No problem.
> I'll test this one this weekend.
> 
It has survived my test(rmdir under memory pressure) this weekend.
Please feel free to add my Tested-by.

	Tested-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2009-06-29  4:13 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-06-23  7:07 [PATCH 0/2] memcg: cgroup fix rmdir hang KAMEZAWA Hiroyuki
2009-06-23  7:07 ` KAMEZAWA Hiroyuki
2009-06-23  7:08 ` [PATCH 1/2] " KAMEZAWA Hiroyuki
2009-06-23  7:08   ` KAMEZAWA Hiroyuki
2009-06-24  4:05   ` Daisuke Nishimura
2009-06-24  4:05     ` Daisuke Nishimura
2009-06-25 21:50   ` Andrew Morton
2009-06-25 21:50     ` Andrew Morton
2009-06-26  0:14     ` KAMEZAWA Hiroyuki
2009-06-26  0:14       ` KAMEZAWA Hiroyuki
2009-06-23  7:10 ` [PATCH 2/2] memcg: wakeup rmdir waiter if necessary KAMEZAWA Hiroyuki
2009-06-23  7:10   ` KAMEZAWA Hiroyuki
2009-06-25 21:28 ` [PATCH 0/2] memcg: cgroup fix rmdir hang Andrew Morton
2009-06-25 21:28   ` Andrew Morton
2009-06-25 23:29   ` KAMEZAWA Hiroyuki
2009-06-25 23:29     ` KAMEZAWA Hiroyuki
2009-06-26  5:10 ` [PATCH] " KAMEZAWA Hiroyuki
2009-06-26  5:10   ` KAMEZAWA Hiroyuki
2009-06-26  6:00   ` Daisuke Nishimura
2009-06-26  6:00     ` Daisuke Nishimura
2009-06-29  4:10     ` Daisuke Nishimura
2009-06-29  4:10       ` Daisuke Nishimura
2009-06-26 20:16   ` Paul Menage
2009-06-26 20:16     ` Paul Menage
2009-06-28 23:50     ` KAMEZAWA Hiroyuki
2009-06-28 23:50       ` KAMEZAWA Hiroyuki
2009-06-28 23:56       ` KAMEZAWA Hiroyuki
2009-06-28 23:56         ` KAMEZAWA Hiroyuki

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.