linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH 0/7 v2] memcg: prevent failure in pre_destroy()
@ 2012-04-27  5:45 KAMEZAWA Hiroyuki
  2012-04-27  5:49 ` [RFC][PATCH 1/7 v2] temporal compile-fix in linux-next KAMEZAWA Hiroyuki
                   ` (9 more replies)
  0 siblings, 10 replies; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-04-27  5:45 UTC (permalink / raw)
  To: Linux Kernel
  Cc: linux-mm, cgroups, Michal Hocko, Johannes Weiner,
	Frederic Weisbecker, Glauber Costa, Tejun Heo, Han Ying,
	Aneesh Kumar K.V, Andrew Morton, kamezawa.hiroyuki

This is a v2 patch for preventing failure in memcg->pre_destroy().
With this patch, ->pre_destroy() will never return error code and
users will not see warning at rmdir(). And this work will simplify
memcg->pre_destroy(), largely.

This patch is based on linux-next + hugetlb memory control patches.

I post this as RFC because I'll have vacation in the next week and
hugetlb patches are not visible in linux-next yet.
So, I'm not in hurry. Please review when you have time.

I'll rebase this onto memcg-devel in the next post.
== BTW, memory cgroup's github is here == 
git://github.com/mstsxfx/memcg-devel.git

Since v1, Whole patch designs are changed. In this version, I didn't
remove ->pre_destroy() but make it succeed always. There are no
asynchronous operation and no big patches. But this introduces
2 changes to cgroup core.

After this series, if use_hierarchy==0, all resources will be moved
to root cgroup at rmdir() or force_empty().

Brief patch conents are

0001 : my version of compile-fix for linux-next, Aneesh will post his own version.
0002 : fix error code in hugetlb_force_memcg_empty
0003 : add res_counter_uncharge_until()
0004 : use res_counter_uncharge_until() at move_parent()
0005 : move charges to root cgroup at rmdir, if use_hierarchy=0
0006 : clean up mem_cgroup_move_account()
0007 : cgroup : avoid attaching task to cgroup where ->pre_destroy() is running.
0008 : cgroup : avoid creating a new cgroup under a cgroup where ->pre_destroy() is running.
0009 : remove -EINTR from memcg->pre_destroy().

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC][PATCH 1/7 v2] temporal compile-fix in linux-next
  2012-04-27  5:45 [RFC][PATCH 0/7 v2] memcg: prevent failure in pre_destroy() KAMEZAWA Hiroyuki
@ 2012-04-27  5:49 ` KAMEZAWA Hiroyuki
  2012-04-30  8:47   ` Aneesh Kumar K.V
  2012-04-27  5:51 ` [RFC][PATCH 2/7 v2] memcg: fix error code in hugetlb_force_memcg_empty() KAMEZAWA Hiroyuki
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-04-27  5:49 UTC (permalink / raw)
  To: Linux Kernel
  Cc: linux-mm, cgroups, Michal Hocko, Johannes Weiner,
	Frederic Weisbecker, Glauber Costa, Tejun Heo, Han Ying,
	Aneesh Kumar K.V, Andrew Morton, kamezawa.hiroyuki

Maybe Aneesh will post his own version. This is just for my work.

>From eaf25c555ec809006220ef22ef152aa04c30c1af Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Date: Thu, 26 Apr 2012 17:51:52 +0900
Subject: [PATCH 1/9] compile-fix

My version of compile fix for linux-next, discussed between Andrew and
Aneesh.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/hugetlb.h    |    4 ----
 include/linux/memcontrol.h |   15 +++++++++++++++
 2 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 828b073..fc226be 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -343,8 +343,4 @@ static inline unsigned int pages_per_huge_page(struct hstate *h)
 #define hstate_index(h) 0
 #endif
 
-#ifdef CONFIG_MEM_RES_CTLR_HUGETLB
-extern int hugetlb_force_memcg_empty(struct cgroup *cgroup);
-#endif
-
 #endif /* _LINUX_HUGETLB_H */
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f173811..ca0914a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -450,9 +450,14 @@ extern int mem_cgroup_move_hugetlb_parent(int idx, struct cgroup *cgroup,
 					  struct page *page);
 extern bool mem_cgroup_have_hugetlb_usage(struct cgroup *cgroup);
 
+extern int hugetlb_force_memcg_empty(struct cgroup *cgroup);
+
 extern void  mem_cgroup_hugetlb_migrate(struct page *oldhpage,
 					struct page *newhpage);
 #else
+struct cgroup;
+struct mem_cgroup;
+
 static inline int
 mem_cgroup_hugetlb_charge_page(int idx, unsigned long nr_pages,
 						 struct mem_cgroup **ptr)
@@ -494,6 +499,16 @@ mem_cgroup_move_hugetlb_parent(int idx, struct cgroup *cgroup,
 	return 0;
 }
 
+static inline void  mem_cgroup_hugetlb_migrate(struct page *oldhpage,
+				struct page *newhpage)
+{
+	return;
+}
+
+static inline int hugetlb_force_memcg_empty(struct cgroup *cgroup)
+{
+	return 0;
+}
 #endif  /* CONFIG_MEM_RES_CTLR_HUGETLB */
 #endif /* _LINUX_MEMCONTROL_H */
 
-- 
1.7.4.1



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC][PATCH 2/7 v2] memcg: fix error code in hugetlb_force_memcg_empty()
  2012-04-27  5:45 [RFC][PATCH 0/7 v2] memcg: prevent failure in pre_destroy() KAMEZAWA Hiroyuki
  2012-04-27  5:49 ` [RFC][PATCH 1/7 v2] temporal compile-fix in linux-next KAMEZAWA Hiroyuki
@ 2012-04-27  5:51 ` KAMEZAWA Hiroyuki
  2012-04-30  8:49   ` Aneesh Kumar K.V
  2012-04-27  5:53 ` [RFC][PATCH 3/7 v2] res_counter: add res_counter_uncharge_until() KAMEZAWA Hiroyuki
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-04-27  5:51 UTC (permalink / raw)
  To: Linux Kernel
  Cc: linux-mm, cgroups, Michal Hocko, Johannes Weiner,
	Frederic Weisbecker, Glauber Costa, Tejun Heo, Han Ying,
	Aneesh Kumar K.V, Andrew Morton, kamezawa.hiroyuki

EBUSY should be returned.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/hugetlb.c |    5 ++++-
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 17ae2e4..4dd6b39 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1922,8 +1922,11 @@ int hugetlb_force_memcg_empty(struct cgroup *cgroup)
 	int ret = 0, idx = 0;
 
 	do {
-		if (cgroup_task_count(cgroup) || !list_empty(&cgroup->children))
+		if (cgroup_task_count(cgroup)
+			|| !list_empty(&cgroup->children)) {
+			ret = -EBUSY;
 			goto out;
+		}
 		/*
 		 * If the task doing the cgroup_rmdir got a signal
 		 * we don't really need to loop till the hugetlb resource
-- 
1.7.4.1



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC][PATCH 3/7 v2] res_counter: add res_counter_uncharge_until()
  2012-04-27  5:45 [RFC][PATCH 0/7 v2] memcg: prevent failure in pre_destroy() KAMEZAWA Hiroyuki
  2012-04-27  5:49 ` [RFC][PATCH 1/7 v2] temporal compile-fix in linux-next KAMEZAWA Hiroyuki
  2012-04-27  5:51 ` [RFC][PATCH 2/7 v2] memcg: fix error code in hugetlb_force_memcg_empty() KAMEZAWA Hiroyuki
@ 2012-04-27  5:53 ` KAMEZAWA Hiroyuki
  2012-04-27 18:18   ` Tejun Heo
       [not found]   ` <4F9AD28C.60508@parallels.com>
  2012-04-27  5:54 ` [RFC][PATCH 4/7 v2] memcg: use res_counter_uncharge_until in move_parent KAMEZAWA Hiroyuki
                   ` (6 subsequent siblings)
  9 siblings, 2 replies; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-04-27  5:53 UTC (permalink / raw)
  To: Linux Kernel
  Cc: linux-mm, cgroups, Michal Hocko, Johannes Weiner,
	Frederic Weisbecker, Glauber Costa, Tejun Heo, Han Ying,
	Aneesh Kumar K.V, Andrew Morton, kamezawa.hiroyuki

>From bb0168d5c85f62f36434956e4728a67d0cc41e55 Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Date: Thu, 26 Apr 2012 18:48:07 +0900
Subject: [PATCH 3/9] memcg: add res_counter_uncharge_until()

At killing res_counter which is a child of other counter,
we need to do
	res_counter_uncharge(child, xxx)
	res_counter_charge(parent, xxx)

This is not atomic and wasting cpu. This patch adds
res_counter_uncharge_until(). This function's uncharge propagates
to ancestors until specified res_counter.

	res_counter_uncharge_until(child, parent, xxx)

This ops is atomic and more efficient.

Originaly-written-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 Documentation/cgroups/resource_counter.txt |    8 ++++++++
 include/linux/res_counter.h                |    3 +++
 kernel/res_counter.c                       |   13 +++++++++++--
 3 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt
index 95b24d7..703103a 100644
--- a/Documentation/cgroups/resource_counter.txt
+++ b/Documentation/cgroups/resource_counter.txt
@@ -92,6 +92,14 @@ to work with it.
 
 	The _locked routines imply that the res_counter->lock is taken.
 
+ f. void res_counter_uncharge_until
+		(struct res_counter *rc, struct res_counter *top,
+		 unsinged long val)
+
+	Almost same as res_cunter_uncharge() but propagation of uncharge
+	stops when rc == top. This is useful when kill a res_coutner in
+	child cgroup.
+
  2.1 Other accounting routines
 
     There are more routines that may help you with common needs, like
diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index da81af0..d11c1cd 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -135,6 +135,9 @@ int __must_check res_counter_charge_nofail(struct res_counter *counter,
 void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val);
 void res_counter_uncharge(struct res_counter *counter, unsigned long val);
 
+void res_counter_uncharge_until(struct res_counter *counter,
+				struct res_counter *top,
+				unsigned long val);
 /**
  * res_counter_margin - calculate chargeable space of a counter
  * @cnt: the counter
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index d508363..f4ec411 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -66,6 +66,8 @@ done:
 	return ret;
 }
 
+
+
 int res_counter_charge_nofail(struct res_counter *counter, unsigned long val,
 			      struct res_counter **limit_fail_at)
 {
@@ -99,13 +101,15 @@ void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val)
 	counter->usage -= val;
 }
 
-void res_counter_uncharge(struct res_counter *counter, unsigned long val)
+void res_counter_uncharge_until(struct res_counter *counter,
+				struct res_counter *top,
+				unsigned long val)
 {
 	unsigned long flags;
 	struct res_counter *c;
 
 	local_irq_save(flags);
-	for (c = counter; c != NULL; c = c->parent) {
+	for (c = counter; c != top; c = c->parent) {
 		spin_lock(&c->lock);
 		res_counter_uncharge_locked(c, val);
 		spin_unlock(&c->lock);
@@ -113,6 +117,11 @@ void res_counter_uncharge(struct res_counter *counter, unsigned long val)
 	local_irq_restore(flags);
 }
 
+void res_counter_uncharge(struct res_counter *counter, unsigned long val)
+{
+	res_counter_uncharge_until(counter, NULL, val);
+}
+
 
 static inline unsigned long long *
 res_counter_member(struct res_counter *counter, int member)
-- 
1.7.4.1



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC][PATCH 4/7 v2] memcg: use res_counter_uncharge_until in move_parent
  2012-04-27  5:45 [RFC][PATCH 0/7 v2] memcg: prevent failure in pre_destroy() KAMEZAWA Hiroyuki
                   ` (2 preceding siblings ...)
  2012-04-27  5:53 ` [RFC][PATCH 3/7 v2] res_counter: add res_counter_uncharge_until() KAMEZAWA Hiroyuki
@ 2012-04-27  5:54 ` KAMEZAWA Hiroyuki
  2012-04-27 18:20   ` Tejun Heo
                     ` (2 more replies)
  2012-04-27  5:58 ` [RFC][PATCH 5/9 v2] move charges to root at rmdir if use_hierarchy is unset KAMEZAWA Hiroyuki
                   ` (5 subsequent siblings)
  9 siblings, 3 replies; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-04-27  5:54 UTC (permalink / raw)
  To: Linux Kernel
  Cc: linux-mm, cgroups, Michal Hocko, Johannes Weiner,
	Frederic Weisbecker, Glauber Costa, Tejun Heo, Han Ying,
	Aneesh Kumar K.V, Andrew Morton, kamezawa.hiroyuki

By using res_counter_uncharge_until(), we can avoid 
unnecessary charging.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   63 ++++++++++++++++++++++++++++++++++++------------------
 1 files changed, 42 insertions(+), 21 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 613bb15..ed53d64 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2420,6 +2420,24 @@ static void __mem_cgroup_cancel_charge(struct mem_cgroup *memcg,
 }
 
 /*
+ * Cancel chages in this cgroup....doesn't propagates to parent cgroup.
+ * This is useful when moving usage to parent cgroup.
+ */
+static void __mem_cgroup_cancel_local_charge(struct mem_cgroup *memcg,
+					unsigned int nr_pages)
+{
+	if (!mem_cgroup_is_root(memcg)) {
+		unsigned long bytes = nr_pages * PAGE_SIZE;
+
+		res_counter_uncharge_until(&memcg->res,
+					memcg->res.parent, bytes);
+		if (do_swap_account)
+			res_counter_uncharge_until(&memcg->memsw,
+						memcg->memsw.parent, bytes);
+	}
+}
+
+/*
  * A helper function to get mem_cgroup from ID. must be called under
  * rcu_read_lock(). The caller must check css_is_removed() or some if
  * it's concern. (dropping refcnt from swap can be called against removed
@@ -2677,16 +2695,28 @@ static int mem_cgroup_move_parent(struct page *page,
 	nr_pages = hpage_nr_pages(page);
 
 	parent = mem_cgroup_from_cont(pcg);
-	ret = __mem_cgroup_try_charge(NULL, gfp_mask, nr_pages, &parent, false);
-	if (ret)
-		goto put_back;
+	if (!parent->use_hierarchy) {
+		ret = __mem_cgroup_try_charge(NULL,
+					gfp_mask, nr_pages, &parent, false);
+		if (ret)
+			goto put_back;
+	}
 
 	if (nr_pages > 1)
 		flags = compound_lock_irqsave(page);
 
-	ret = mem_cgroup_move_account(page, nr_pages, pc, child, parent, true);
-	if (ret)
-		__mem_cgroup_cancel_charge(parent, nr_pages);
+	if (parent->use_hierarchy) {
+		ret = mem_cgroup_move_account(page, nr_pages,
+					pc, child, parent, false);
+		if (!ret)
+			__mem_cgroup_cancel_local_charge(child, nr_pages);
+	} else {
+		ret = mem_cgroup_move_account(page, nr_pages,
+					pc, child, parent, true);
+
+		if (ret)
+			__mem_cgroup_cancel_charge(parent, nr_pages);
+	}
 
 	if (nr_pages > 1)
 		compound_unlock_irqrestore(page, flags);
@@ -3295,6 +3325,7 @@ int mem_cgroup_move_hugetlb_parent(int idx, struct cgroup *cgroup,
 	struct cgroup *pcgrp = cgroup->parent;
 	struct mem_cgroup *parent = mem_cgroup_from_cont(pcgrp);
 	struct mem_cgroup *memcg  = mem_cgroup_from_cont(cgroup);
+	struct res_counter *counter;
 
 	if (!get_page_unless_zero(page))
 		goto out;
@@ -3305,28 +3336,18 @@ int mem_cgroup_move_hugetlb_parent(int idx, struct cgroup *cgroup,
 		goto err_out;
 
 	csize = PAGE_SIZE << compound_order(page);
-	/*
-	 * If we have use_hierarchy set we can never fail here. So instead of
-	 * using res_counter_uncharge use the open-coded variant which just
-	 * uncharge the child res_counter. The parent will retain the charge.
-	 */
-	if (parent->use_hierarchy) {
-		unsigned long flags;
-		struct res_counter *counter;
-
-		counter = &memcg->hugepage[idx];
-		spin_lock_irqsave(&counter->lock, flags);
-		res_counter_uncharge_locked(counter, csize);
-		spin_unlock_irqrestore(&counter->lock, flags);
-	} else {
+	/* If parent->use_hierarchy == 0, we need to charge parent */
+	if (!parent->use_hierarchy) {
 		ret = res_counter_charge(&parent->hugepage[idx],
 					 csize, &fail_res);
 		if (ret) {
 			ret = -EBUSY;
 			goto err_out;
 		}
-		res_counter_uncharge(&memcg->hugepage[idx], csize);
 	}
+	counter = &memcg->hugepage[idx];
+	res_counter_uncharge_until(counter, counter->parent, csize);
+
 	pc->mem_cgroup = parent;
 err_out:
 	unlock_page_cgroup(pc);
-- 
1.7.4.1



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC][PATCH 5/9 v2] move charges to root at rmdir if use_hierarchy is unset
  2012-04-27  5:45 [RFC][PATCH 0/7 v2] memcg: prevent failure in pre_destroy() KAMEZAWA Hiroyuki
                   ` (3 preceding siblings ...)
  2012-04-27  5:54 ` [RFC][PATCH 4/7 v2] memcg: use res_counter_uncharge_until in move_parent KAMEZAWA Hiroyuki
@ 2012-04-27  5:58 ` KAMEZAWA Hiroyuki
  2012-04-27 19:12   ` Ying Han
  2012-04-30  9:07   ` Aneesh Kumar K.V
  2012-04-27  6:00 ` [RFC][PATCH 6/9 v2] memcg: don't uncharge in mem_cgroup_move_account KAMEZAWA Hiroyuki
                   ` (4 subsequent siblings)
  9 siblings, 2 replies; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-04-27  5:58 UTC (permalink / raw)
  To: Linux Kernel
  Cc: linux-mm, cgroups, Michal Hocko, Johannes Weiner,
	Frederic Weisbecker, Glauber Costa, Tejun Heo, Han Ying,
	Aneesh Kumar K.V, Andrew Morton, kamezawa.hiroyuki

Now, at removal of cgroup, ->pre_destroy() is called and move charges
to the parent cgroup. A major reason of -EBUSY returned by ->pre_destroy()
is that the 'moving' hits parent's resource limitation. It happens only
when use_hierarchy=0. This was a mistake of original design.(it's me...)

Considering use_hierarchy=0, all cgroups are treated as flat. So, no one
cannot justify moving charges to parent...parent and children are in
flat configuration, not hierarchical.

This patch modifes to move charges to root cgroup at rmdir/force_empty
if use_hierarchy==0. This will much simplify rmdir() and reduce error
in ->pre_destroy.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 Documentation/cgroups/memory.txt |   12 ++++++----
 mm/memcontrol.c                  |   39 +++++++++++++------------------------
 2 files changed, 21 insertions(+), 30 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 54c338d..82ce1ef 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -393,14 +393,14 @@ cgroup might have some charge associated with it, even though all
 tasks have migrated away from it. (because we charge against pages, not
 against tasks.)
 
-Such charges are freed or moved to their parent. At moving, both of RSS
-and CACHES are moved to parent.
-rmdir() may return -EBUSY if freeing/moving fails. See 5.1 also.
+Such charges are freed or moved to their parent if use_hierarchy=1.
+if use_hierarchy=0, the charges will be moved to root cgroup.
 
 Charges recorded in swap information is not updated at removal of cgroup.
 Recorded information is discarded and a cgroup which uses swap (swapcache)
 will be charged as a new owner of it.
 
+About use_hierarchy, see Section 6.
 
 5. Misc. interfaces.
 
@@ -413,13 +413,15 @@ will be charged as a new owner of it.
 
   Almost all pages tracked by this memory cgroup will be unmapped and freed.
   Some pages cannot be freed because they are locked or in-use. Such pages are
-  moved to parent and this cgroup will be empty. This may return -EBUSY if
-  VM is too busy to free/move all pages immediately.
+  moved to parent(if use_hierarchy==1) or root (if use_hierarchy==0) and this
+  cgroup will be empty.
 
   Typical use case of this interface is that calling this before rmdir().
   Because rmdir() moves all pages to parent, some out-of-use page caches can be
   moved to the parent. If you want to avoid that, force_empty will be useful.
 
+  About use_hierarchy, see Section 6.
+
 5.2 stat file
 
 memory.stat file includes following statistics
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ed53d64..62200f1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2695,32 +2695,23 @@ static int mem_cgroup_move_parent(struct page *page,
 	nr_pages = hpage_nr_pages(page);
 
 	parent = mem_cgroup_from_cont(pcg);
-	if (!parent->use_hierarchy) {
-		ret = __mem_cgroup_try_charge(NULL,
-					gfp_mask, nr_pages, &parent, false);
-		if (ret)
-			goto put_back;
-	}
+	/*
+	 * if use_hierarchy==0, move charges to root cgroup.
+	 * in root cgroup, we don't touch res_counter
+	 */
+	if (!parent->use_hierarchy)
+		parent = root_mem_cgroup;
 
 	if (nr_pages > 1)
 		flags = compound_lock_irqsave(page);
 
-	if (parent->use_hierarchy) {
-		ret = mem_cgroup_move_account(page, nr_pages,
-					pc, child, parent, false);
-		if (!ret)
-			__mem_cgroup_cancel_local_charge(child, nr_pages);
-	} else {
-		ret = mem_cgroup_move_account(page, nr_pages,
-					pc, child, parent, true);
-
-		if (ret)
-			__mem_cgroup_cancel_charge(parent, nr_pages);
-	}
+	ret = mem_cgroup_move_account(page, nr_pages,
+				pc, child, parent, false);
+	if (!ret)
+		__mem_cgroup_cancel_local_charge(child, nr_pages);
 
 	if (nr_pages > 1)
 		compound_unlock_irqrestore(page, flags);
-put_back:
 	putback_lru_page(page);
 put:
 	put_page(page);
@@ -3338,12 +3329,10 @@ int mem_cgroup_move_hugetlb_parent(int idx, struct cgroup *cgroup,
 	csize = PAGE_SIZE << compound_order(page);
 	/* If parent->use_hierarchy == 0, we need to charge parent */
 	if (!parent->use_hierarchy) {
-		ret = res_counter_charge(&parent->hugepage[idx],
-					 csize, &fail_res);
-		if (ret) {
-			ret = -EBUSY;
-			goto err_out;
-		}
+		parent = root_mem_cgroup;
+		/* root has no limit */
+		res_counter_charge_nofail(&parent->hugepage[idx],
+				 csize, &fail_res);
 	}
 	counter = &memcg->hugepage[idx];
 	res_counter_uncharge_until(counter, counter->parent, csize);
-- 
1.7.4.1



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC][PATCH 6/9 v2] memcg: don't uncharge in mem_cgroup_move_account
  2012-04-27  5:45 [RFC][PATCH 0/7 v2] memcg: prevent failure in pre_destroy() KAMEZAWA Hiroyuki
                   ` (4 preceding siblings ...)
  2012-04-27  5:58 ` [RFC][PATCH 5/9 v2] move charges to root at rmdir if use_hierarchy is unset KAMEZAWA Hiroyuki
@ 2012-04-27  6:00 ` KAMEZAWA Hiroyuki
  2012-04-27  6:02 ` [RFC][PATCH 7/9 v2] cgroup: avoid attaching task to a cgroup under rmdir() KAMEZAWA Hiroyuki
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-04-27  6:00 UTC (permalink / raw)
  To: Linux Kernel
  Cc: linux-mm, cgroups, Michal Hocko, Johannes Weiner,
	Frederic Weisbecker, Glauber Costa, Tejun Heo, Han Ying,
	Aneesh Kumar K.V, Andrew Morton, kamezawa.hiroyuki


Now, all caller passes 'false' for 'bool uncharge', remove the argument.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   20 ++++++--------------
 1 files changed, 6 insertions(+), 14 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 62200f1..2715223 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2589,23 +2589,19 @@ void mem_cgroup_split_huge_fixup(struct page *head)
  * @pc:	page_cgroup of the page.
  * @from: mem_cgroup which the page is moved from.
  * @to:	mem_cgroup which the page is moved to. @from != @to.
- * @uncharge: whether we should call uncharge and css_put against @from.
  *
  * The caller must confirm following.
  * - page is not on LRU (isolate_page() is useful.)
  * - compound_lock is held when nr_pages > 1
  *
- * This function doesn't do "charge" nor css_get to new cgroup. It should be
- * done by a caller(__mem_cgroup_try_charge would be useful). If @uncharge is
- * true, this function does "uncharge" from old cgroup, but it doesn't if
- * @uncharge is false, so a caller should do "uncharge".
+ * This function doesn't do "charge" to new cgroup and doesn't do "uncharge"
+ * from old cgroup.
  */
 static int mem_cgroup_move_account(struct page *page,
 				   unsigned int nr_pages,
 				   struct page_cgroup *pc,
 				   struct mem_cgroup *from,
-				   struct mem_cgroup *to,
-				   bool uncharge)
+				   struct mem_cgroup *to)
 {
 	unsigned long flags;
 	int ret;
@@ -2639,9 +2635,6 @@ static int mem_cgroup_move_account(struct page *page,
 		preempt_enable();
 	}
 	mem_cgroup_charge_statistics(from, anon, -nr_pages);
-	if (uncharge)
-		/* This is not "cancel", but cancel_charge does all we need. */
-		__mem_cgroup_cancel_charge(from, nr_pages);
 
 	/* caller should have done css_get */
 	pc->mem_cgroup = to;
@@ -2706,7 +2699,7 @@ static int mem_cgroup_move_parent(struct page *page,
 		flags = compound_lock_irqsave(page);
 
 	ret = mem_cgroup_move_account(page, nr_pages,
-				pc, child, parent, false);
+				pc, child, parent);
 	if (!ret)
 		__mem_cgroup_cancel_local_charge(child, nr_pages);
 
@@ -5724,8 +5717,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 			if (!isolate_lru_page(page)) {
 				pc = lookup_page_cgroup(page);
 				if (!mem_cgroup_move_account(page, HPAGE_PMD_NR,
-							     pc, mc.from, mc.to,
-							     false)) {
+							pc, mc.from, mc.to)) {
 					mc.precharge -= HPAGE_PMD_NR;
 					mc.moved_charge += HPAGE_PMD_NR;
 				}
@@ -5755,7 +5747,7 @@ retry:
 				goto put;
 			pc = lookup_page_cgroup(page);
 			if (!mem_cgroup_move_account(page, 1, pc,
-						     mc.from, mc.to, false)) {
+						     mc.from, mc.to)) {
 				mc.precharge--;
 				/* we uncharge from mc.from later. */
 				mc.moved_charge++;
-- 
1.7.4.1



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC][PATCH 7/9 v2] cgroup: avoid attaching task to a cgroup under rmdir()
  2012-04-27  5:45 [RFC][PATCH 0/7 v2] memcg: prevent failure in pre_destroy() KAMEZAWA Hiroyuki
                   ` (5 preceding siblings ...)
  2012-04-27  6:00 ` [RFC][PATCH 6/9 v2] memcg: don't uncharge in mem_cgroup_move_account KAMEZAWA Hiroyuki
@ 2012-04-27  6:02 ` KAMEZAWA Hiroyuki
  2012-04-27 10:39   ` Frederic Weisbecker
  2012-04-27 20:31   ` Tejun Heo
  2012-04-27  6:04 ` [RFC][PATCH 8/9 v2] cgroup: avoid creating new cgroup under a cgroup being destroyed KAMEZAWA Hiroyuki
                   ` (2 subsequent siblings)
  9 siblings, 2 replies; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-04-27  6:02 UTC (permalink / raw)
  To: Linux Kernel
  Cc: linux-mm, cgroups, Michal Hocko, Johannes Weiner,
	Frederic Weisbecker, Glauber Costa, Tejun Heo, Han Ying,
	Aneesh Kumar K.V, Andrew Morton, kamezawa.hiroyuki

attach_task() is done under cgroup_mutex() but ->pre_destroy() callback
in rmdir() isn't called under cgroup_mutex().

It's better to avoid attaching a task to a cgroup which
is under pre_destroy(). Considering memcg, the attached task may
increase resource usage after memcg's pre_destroy() confirms that
memcg is empty. This is not good.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 kernel/cgroup.c |    5 ++++-
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index ad8eae5..7a3076b 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1953,6 +1953,9 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 	if (cgrp == oldcgrp)
 		return 0;
 
+	if (test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags))
+		return -EBUSY;
+
 	tset.single.task = tsk;
 	tset.single.cgrp = oldcgrp;
 
@@ -4181,7 +4184,6 @@ again:
 		mutex_unlock(&cgroup_mutex);
 		return -EBUSY;
 	}
-	mutex_unlock(&cgroup_mutex);
 
 	/*
 	 * In general, subsystem has no css->refcnt after pre_destroy(). But
@@ -4193,6 +4195,7 @@ again:
 	 * and css_tryget() and cgroup_wakeup_rmdir_waiter() implementation.
 	 */
 	set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
+	mutex_unlock(&cgroup_mutex);
 
 	/*
 	 * Call pre_destroy handlers of subsys. Notify subsystems
-- 
1.7.4.1



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC][PATCH 8/9 v2] cgroup: avoid creating new cgroup under a cgroup being destroyed
  2012-04-27  5:45 [RFC][PATCH 0/7 v2] memcg: prevent failure in pre_destroy() KAMEZAWA Hiroyuki
                   ` (6 preceding siblings ...)
  2012-04-27  6:02 ` [RFC][PATCH 7/9 v2] cgroup: avoid attaching task to a cgroup under rmdir() KAMEZAWA Hiroyuki
@ 2012-04-27  6:04 ` KAMEZAWA Hiroyuki
  2012-04-27 20:40   ` Tejun Heo
  2012-04-27  6:06 ` [RFC][PATCH 9/9 v2] memcg: never return error at pre_destroy() KAMEZAWA Hiroyuki
  2012-04-27 18:16 ` [RFC][PATCH 0/7 v2] memcg: prevent failure in pre_destroy() Tejun Heo
  9 siblings, 1 reply; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-04-27  6:04 UTC (permalink / raw)
  To: Linux Kernel
  Cc: linux-mm, cgroups, Michal Hocko, Johannes Weiner,
	Frederic Weisbecker, Glauber Costa, Tejun Heo, Han Ying,
	Aneesh Kumar K.V, Andrew Morton, kamezawa.hiroyuki

When ->pre_destroy() is called, it should be guaranteed that
new child cgroup is not created under a cgroup, where pre_destroy()
is running. If not, ->pre_destroy() must check children and
return -EBUSY, which causes warning.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 kernel/cgroup.c |    9 +++++++++
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 7a3076b..003ceed 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -3970,6 +3970,15 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
 
 	mutex_lock(&cgroup_mutex);
 
+	/*
+	 * prevent making child cgroup when ->pre_destroy() is running.
+	 */
+	if (test_bit(CGRP_WAIT_ON_RMDIR, &parent->flags)) {
+		mutex_unlock(&cgroup_mutex);
+		atomic_dec(&sb->s_active);
+		return -EBUSY;
+	}
+
 	init_cgroup_housekeeping(cgrp);
 
 	cgrp->parent = parent;
-- 
1.7.4.1



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC][PATCH 9/9 v2] memcg: never return error at pre_destroy()
  2012-04-27  5:45 [RFC][PATCH 0/7 v2] memcg: prevent failure in pre_destroy() KAMEZAWA Hiroyuki
                   ` (7 preceding siblings ...)
  2012-04-27  6:04 ` [RFC][PATCH 8/9 v2] cgroup: avoid creating new cgroup under a cgroup being destroyed KAMEZAWA Hiroyuki
@ 2012-04-27  6:06 ` KAMEZAWA Hiroyuki
  2012-04-27 21:28   ` Ying Han
  2012-05-01 22:28   ` Suleiman Souhlal
  2012-04-27 18:16 ` [RFC][PATCH 0/7 v2] memcg: prevent failure in pre_destroy() Tejun Heo
  9 siblings, 2 replies; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-04-27  6:06 UTC (permalink / raw)
  To: Linux Kernel
  Cc: linux-mm, cgroups, Michal Hocko, Johannes Weiner,
	Frederic Weisbecker, Glauber Costa, Tejun Heo, Han Ying,
	Aneesh Kumar K.V, Andrew Morton, kamezawa.hiroyuki

When force_empty() called by ->pre_destroy(), no memory reclaim happens
and it doesn't take very long time which requires signal_pending() check.
And if we return -EINTR from pre_destroy(), cgroup.c show warning.

This patch removes signal check in force_empty(). By this, ->pre_destroy()
returns success always.

Note: check for 'cgroup is empty' remains for force_empty interface.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/hugetlb.c    |   10 +---------
 mm/memcontrol.c |   14 +++++---------
 2 files changed, 6 insertions(+), 18 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4dd6b39..770f1642 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1922,20 +1922,12 @@ int hugetlb_force_memcg_empty(struct cgroup *cgroup)
 	int ret = 0, idx = 0;
 
 	do {
+		/* see memcontrol.c::mem_cgroup_force_empty() */
 		if (cgroup_task_count(cgroup)
 			|| !list_empty(&cgroup->children)) {
 			ret = -EBUSY;
 			goto out;
 		}
-		/*
-		 * If the task doing the cgroup_rmdir got a signal
-		 * we don't really need to loop till the hugetlb resource
-		 * usage become zero.
-		 */
-		if (signal_pending(current)) {
-			ret = -EINTR;
-			goto out;
-		}
 		for_each_hstate(h) {
 			spin_lock(&hugetlb_lock);
 			list_for_each_entry(page, &h->hugepage_activelist, lru) {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2715223..ee350c5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3852,8 +3852,6 @@ static int mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
 		pc = lookup_page_cgroup(page);
 
 		ret = mem_cgroup_move_parent(page, pc, memcg, GFP_KERNEL);
-		if (ret == -ENOMEM || ret == -EINTR)
-			break;
 
 		if (ret == -EBUSY || ret == -EINVAL) {
 			/* found lock contention or "pc" is obsolete. */
@@ -3863,7 +3861,7 @@ static int mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
 			busy = NULL;
 	}
 
-	if (!ret && !list_empty(list))
+	if (!loop)
 		return -EBUSY;
 	return ret;
 }
@@ -3893,11 +3891,12 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg, bool free_all)
 move_account:
 	do {
 		ret = -EBUSY;
+		/*
+		 * This never happens when this is called by ->pre_destroy().
+		 * But we need to take care of force_empty interface.
+		 */
 		if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children))
 			goto out;
-		ret = -EINTR;
-		if (signal_pending(current))
-			goto out;
 		/* This is for making all *used* pages to be on LRU. */
 		lru_add_drain_all();
 		drain_all_stock_sync(memcg);
@@ -3918,9 +3917,6 @@ move_account:
 		}
 		mem_cgroup_end_move(memcg);
 		memcg_oom_recover(memcg);
-		/* it seems parent cgroup doesn't have enough mem */
-		if (ret == -ENOMEM)
-			goto try_to_free;
 		cond_resched();
 	/* "ret" should also be checked to ensure all lists are empty. */
 	} while (res_counter_read_u64(&memcg->res, RES_USAGE) > 0 || ret);
-- 
1.7.4.1



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH 7/9 v2] cgroup: avoid attaching task to a cgroup under rmdir()
  2012-04-27  6:02 ` [RFC][PATCH 7/9 v2] cgroup: avoid attaching task to a cgroup under rmdir() KAMEZAWA Hiroyuki
@ 2012-04-27 10:39   ` Frederic Weisbecker
  2012-04-28  0:06     ` Hiroyuki Kamezawa
  2012-04-27 20:31   ` Tejun Heo
  1 sibling, 1 reply; 42+ messages in thread
From: Frederic Weisbecker @ 2012-04-27 10:39 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Linux Kernel, linux-mm, cgroups, Michal Hocko, Johannes Weiner,
	Glauber Costa, Tejun Heo, Han Ying, Aneesh Kumar K.V,
	Andrew Morton, kamezawa.hiroyuki

On Fri, Apr 27, 2012 at 03:02:22PM +0900, KAMEZAWA Hiroyuki wrote:
> attach_task() is done under cgroup_mutex() but ->pre_destroy() callback
> in rmdir() isn't called under cgroup_mutex().
> 
> It's better to avoid attaching a task to a cgroup which
> is under pre_destroy(). Considering memcg, the attached task may
> increase resource usage after memcg's pre_destroy() confirms that
> memcg is empty. This is not good.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  kernel/cgroup.c |    5 ++++-
>  1 files changed, 4 insertions(+), 1 deletions(-)
> 
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index ad8eae5..7a3076b 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1953,6 +1953,9 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>  	if (cgrp == oldcgrp)
>  		return 0;
>  
> +	if (test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags))
> +		return -EBUSY;
> +

You probably need to update cgroup_attach_proc() as well?

>  	tset.single.task = tsk;
>  	tset.single.cgrp = oldcgrp;
>  
> @@ -4181,7 +4184,6 @@ again:
>  		mutex_unlock(&cgroup_mutex);
>  		return -EBUSY;
>  	}
> -	mutex_unlock(&cgroup_mutex);
>  
>  	/*
>  	 * In general, subsystem has no css->refcnt after pre_destroy(). But
> @@ -4193,6 +4195,7 @@ again:
>  	 * and css_tryget() and cgroup_wakeup_rmdir_waiter() implementation.
>  	 */
>  	set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> +	mutex_unlock(&cgroup_mutex);
>  
>  	/*
>  	 * Call pre_destroy handlers of subsys. Notify subsystems
> -- 
> 1.7.4.1
> 
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH 0/7 v2] memcg: prevent failure in pre_destroy()
  2012-04-27  5:45 [RFC][PATCH 0/7 v2] memcg: prevent failure in pre_destroy() KAMEZAWA Hiroyuki
                   ` (8 preceding siblings ...)
  2012-04-27  6:06 ` [RFC][PATCH 9/9 v2] memcg: never return error at pre_destroy() KAMEZAWA Hiroyuki
@ 2012-04-27 18:16 ` Tejun Heo
  2012-04-27 23:48   ` Hiroyuki Kamezawa
  9 siblings, 1 reply; 42+ messages in thread
From: Tejun Heo @ 2012-04-27 18:16 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Linux Kernel, linux-mm, cgroups, Michal Hocko, Johannes Weiner,
	Frederic Weisbecker, Glauber Costa, Han Ying, Aneesh Kumar K.V,
	Andrew Morton, kamezawa.hiroyuki

Hello,

On Fri, Apr 27, 2012 at 02:45:30PM +0900, KAMEZAWA Hiroyuki wrote:
> This is a v2 patch for preventing failure in memcg->pre_destroy().
> With this patch, ->pre_destroy() will never return error code and
> users will not see warning at rmdir(). And this work will simplify
> memcg->pre_destroy(), largely.
> 
> This patch is based on linux-next + hugetlb memory control patches.

Ergh... can you please set up a git branch somewhere for review
purposes?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH 3/7 v2] res_counter: add res_counter_uncharge_until()
  2012-04-27  5:53 ` [RFC][PATCH 3/7 v2] res_counter: add res_counter_uncharge_until() KAMEZAWA Hiroyuki
@ 2012-04-27 18:18   ` Tejun Heo
  2012-04-27 23:51     ` Hiroyuki Kamezawa
       [not found]   ` <4F9AD28C.60508@parallels.com>
  1 sibling, 1 reply; 42+ messages in thread
From: Tejun Heo @ 2012-04-27 18:18 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Linux Kernel, linux-mm, cgroups, Michal Hocko, Johannes Weiner,
	Frederic Weisbecker, Glauber Costa, Han Ying, Aneesh Kumar K.V,
	Andrew Morton, kamezawa.hiroyuki

On Fri, Apr 27, 2012 at 02:53:03PM +0900, KAMEZAWA Hiroyuki wrote:
> diff --git a/kernel/res_counter.c b/kernel/res_counter.c
> index d508363..f4ec411 100644
> --- a/kernel/res_counter.c
> +++ b/kernel/res_counter.c
> @@ -66,6 +66,8 @@ done:
>  	return ret;
>  }
>  
> +
> +

Contamination?

-- 
tejun

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH 4/7 v2] memcg: use res_counter_uncharge_until in move_parent
  2012-04-27  5:54 ` [RFC][PATCH 4/7 v2] memcg: use res_counter_uncharge_until in move_parent KAMEZAWA Hiroyuki
@ 2012-04-27 18:20   ` Tejun Heo
  2012-04-27 23:59     ` Hiroyuki Kamezawa
       [not found]   ` <4F9AD455.9030306@parallels.com>
  2012-04-30  9:00   ` Aneesh Kumar K.V
  2 siblings, 1 reply; 42+ messages in thread
From: Tejun Heo @ 2012-04-27 18:20 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Linux Kernel, linux-mm, cgroups, Michal Hocko, Johannes Weiner,
	Frederic Weisbecker, Glauber Costa, Han Ying, Aneesh Kumar K.V,
	Andrew Morton, kamezawa.hiroyuki

On Fri, Apr 27, 2012 at 02:54:58PM +0900, KAMEZAWA Hiroyuki wrote:
> By using res_counter_uncharge_until(), we can avoid 
> unnecessary charging.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  mm/memcontrol.c |   63 ++++++++++++++++++++++++++++++++++++------------------
>  1 files changed, 42 insertions(+), 21 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 613bb15..ed53d64 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2420,6 +2420,24 @@ static void __mem_cgroup_cancel_charge(struct mem_cgroup *memcg,
>  }
>  
>  /*
> + * Cancel chages in this cgroup....doesn't propagates to parent cgroup.

             ^typo                                     ^ unnecessary s

-- 
tejun

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH 4/7 v2] memcg: use res_counter_uncharge_until in move_parent
       [not found]   ` <4F9AD455.9030306@parallels.com>
@ 2012-04-27 18:26     ` Ying Han
  2012-04-27 23:58     ` Hiroyuki Kamezawa
  1 sibling, 0 replies; 42+ messages in thread
From: Ying Han @ 2012-04-27 18:26 UTC (permalink / raw)
  To: Glauber Costa
  Cc: KAMEZAWA Hiroyuki, Linux Kernel, linux-mm, cgroups, Michal Hocko,
	Johannes Weiner, Frederic Weisbecker, Tejun Heo,
	Aneesh Kumar K.V, Andrew Morton, kamezawa.hiroyuki

On Fri, Apr 27, 2012 at 10:16 AM, Glauber Costa <glommer@parallels.com> wrote:
> On 04/27/2012 02:54 AM, KAMEZAWA Hiroyuki wrote:
>> By using res_counter_uncharge_until(), we can avoid
>> unnecessary charging.
>>
>> Signed-off-by: KAMEZAWA Hiroyuki<kamezawa.hiroyu@jp.fujitsu.com>
>> ---
>>   mm/memcontrol.c |   63 ++++++++++++++++++++++++++++++++++++------------------
>>   1 files changed, 42 insertions(+), 21 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 613bb15..ed53d64 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -2420,6 +2420,24 @@ static void __mem_cgroup_cancel_charge(struct mem_cgroup *memcg,
>>   }
>>
>>   /*
>> + * Cancel chages in this cgroup....doesn't propagates to parent cgroup.
>> + * This is useful when moving usage to parent cgroup.
>> + */
>> +static void __mem_cgroup_cancel_local_charge(struct mem_cgroup *memcg,
>> +                                     unsigned int nr_pages)
>> +{
>> +     if (!mem_cgroup_is_root(memcg)) {
>> +             unsigned long bytes = nr_pages * PAGE_SIZE;
>> +
>> +             res_counter_uncharge_until(&memcg->res,
>> +                                     memcg->res.parent, bytes);
>> +             if (do_swap_account)
>> +                     res_counter_uncharge_until(&memcg->memsw,
>> +                                             memcg->memsw.parent, bytes);
>> +     }
>> +}
>
> Kame, this is a nitpick, but I usually prefer to write this like:
>
> if (mem_cgroup_is_root(memcg))
>   return;
>
> res_counter...
>
> Specially with memcg, where function names are bigger than average, in
> comparison.
>
> the code itself seems fine.
>
>> +/*
>>    * A helper function to get mem_cgroup from ID. must be called under
>>    * rcu_read_lock(). The caller must check css_is_removed() or some if
>>    * it's concern. (dropping refcnt from swap can be called against removed
>> @@ -2677,16 +2695,28 @@ static int mem_cgroup_move_parent(struct page *page,
>>       nr_pages = hpage_nr_pages(page);
>>
>>       parent = mem_cgroup_from_cont(pcg);
>> -     ret = __mem_cgroup_try_charge(NULL, gfp_mask, nr_pages,&parent, false);
>> -     if (ret)
>> -             goto put_back;
>> +     if (!parent->use_hierarchy) {
> Can we avoid testing for use hierarchy ?
> Specially given this might go away.
>
> parent_mem_cgroup() already bundles this information. So maybe we can
> test for parent_mem_cgroup(parent) == NULL. It is the same thing after all.
>> +             ret = __mem_cgroup_try_charge(NULL,
>> +                                     gfp_mask, nr_pages,&parent, false);
>> +             if (ret)
>> +                     goto put_back;
>> +     }
>
> Why? If we are not hierarchical, we should not charge the parent, right?

This is how it is implemented today and I think he changed that to
move to root on the next patch.

>
>>       if (nr_pages>  1)
>>               flags = compound_lock_irqsave(page);
>>
>> -     ret = mem_cgroup_move_account(page, nr_pages, pc, child, parent, true);
>> -     if (ret)
>> -             __mem_cgroup_cancel_charge(parent, nr_pages);
>> +     if (parent->use_hierarchy) {
>> +             ret = mem_cgroup_move_account(page, nr_pages,
>> +                                     pc, child, parent, false);
>> +             if (!ret)
>> +                     __mem_cgroup_cancel_local_charge(child, nr_pages);
>> +     } else {
>> +             ret = mem_cgroup_move_account(page, nr_pages,
>> +                                     pc, child, parent, true);
>> +
>> +             if (ret)
>> +                     __mem_cgroup_cancel_charge(parent, nr_pages);
>> +     }
>
> Calling move account also seems not necessary to me. If we are not
> uncharging + charging, we won't even touch the parent.

Today for user_hierarchy = 0, the charge is moved to parent as well as
the stats. But that is changed on the following patches.

--Ying

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH 5/9 v2] move charges to root at rmdir if use_hierarchy is unset
  2012-04-27  5:58 ` [RFC][PATCH 5/9 v2] move charges to root at rmdir if use_hierarchy is unset KAMEZAWA Hiroyuki
@ 2012-04-27 19:12   ` Ying Han
  2012-04-28  0:01     ` Hiroyuki Kamezawa
  2012-04-30  9:07   ` Aneesh Kumar K.V
  1 sibling, 1 reply; 42+ messages in thread
From: Ying Han @ 2012-04-27 19:12 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Linux Kernel, linux-mm, cgroups, Michal Hocko, Johannes Weiner,
	Frederic Weisbecker, Glauber Costa, Tejun Heo, Aneesh Kumar K.V,
	Andrew Morton, kamezawa.hiroyuki

On Thu, Apr 26, 2012 at 10:58 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> Now, at removal of cgroup, ->pre_destroy() is called and move charges
> to the parent cgroup. A major reason of -EBUSY returned by ->pre_destroy()
> is that the 'moving' hits parent's resource limitation. It happens only
> when use_hierarchy=0. This was a mistake of original design.(it's me...)

Nice patch, i can see how broken it is now with use_hierarchy=0...

nitpick on the documentation below:

>
> Considering use_hierarchy=0, all cgroups are treated as flat. So, no one
> cannot justify moving charges to parent...parent and children are in
> flat configuration, not hierarchical.
>
> This patch modifes to move charges to root cgroup at rmdir/force_empty
> if use_hierarchy==0. This will much simplify rmdir() and reduce error
> in ->pre_destroy.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  Documentation/cgroups/memory.txt |   12 ++++++----
>  mm/memcontrol.c                  |   39 +++++++++++++------------------------
>  2 files changed, 21 insertions(+), 30 deletions(-)
>
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index 54c338d..82ce1ef 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -393,14 +393,14 @@ cgroup might have some charge associated with it, even though all
>  tasks have migrated away from it. (because we charge against pages, not
>  against tasks.)
>
> -Such charges are freed or moved to their parent. At moving, both of RSS
> -and CACHES are moved to parent.
> -rmdir() may return -EBUSY if freeing/moving fails. See 5.1 also.
> +Such charges are freed or moved to their parent if use_hierarchy=1.
> +if use_hierarchy=0, the charges will be moved to root cgroup.

It is more clear that we move the stats to root (if use_hierarchy==0)
or parent (if use_hierarchy==1), and no change on the charge except
uncharging from the child.

--Ying

>
>  Charges recorded in swap information is not updated at removal of cgroup.
>  Recorded information is discarded and a cgroup which uses swap (swapcache)
>  will be charged as a new owner of it.
>
> +About use_hierarchy, see Section 6.
>
>  5. Misc. interfaces.
>
> @@ -413,13 +413,15 @@ will be charged as a new owner of it.
>
>   Almost all pages tracked by this memory cgroup will be unmapped and freed.
>   Some pages cannot be freed because they are locked or in-use. Such pages are
> -  moved to parent and this cgroup will be empty. This may return -EBUSY if
> -  VM is too busy to free/move all pages immediately.
> +  moved to parent(if use_hierarchy==1) or root (if use_hierarchy==0) and this
> +  cgroup will be empty.
>
>   Typical use case of this interface is that calling this before rmdir().
>   Because rmdir() moves all pages to parent, some out-of-use page caches can be
>   moved to the parent. If you want to avoid that, force_empty will be useful.
>
> +  About use_hierarchy, see Section 6.
> +
>  5.2 stat file
>
>  memory.stat file includes following statistics
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ed53d64..62200f1 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2695,32 +2695,23 @@ static int mem_cgroup_move_parent(struct page *page,
>        nr_pages = hpage_nr_pages(page);
>
>        parent = mem_cgroup_from_cont(pcg);
> -       if (!parent->use_hierarchy) {
> -               ret = __mem_cgroup_try_charge(NULL,
> -                                       gfp_mask, nr_pages, &parent, false);
> -               if (ret)
> -                       goto put_back;
> -       }
> +       /*
> +        * if use_hierarchy==0, move charges to root cgroup.
> +        * in root cgroup, we don't touch res_counter
> +        */
> +       if (!parent->use_hierarchy)
> +               parent = root_mem_cgroup;
>
>        if (nr_pages > 1)
>                flags = compound_lock_irqsave(page);
>
> -       if (parent->use_hierarchy) {
> -               ret = mem_cgroup_move_account(page, nr_pages,
> -                                       pc, child, parent, false);
> -               if (!ret)
> -                       __mem_cgroup_cancel_local_charge(child, nr_pages);
> -       } else {
> -               ret = mem_cgroup_move_account(page, nr_pages,
> -                                       pc, child, parent, true);
> -
> -               if (ret)
> -                       __mem_cgroup_cancel_charge(parent, nr_pages);
> -       }
> +       ret = mem_cgroup_move_account(page, nr_pages,
> +                               pc, child, parent, false);
> +       if (!ret)
> +               __mem_cgroup_cancel_local_charge(child, nr_pages);
>
>        if (nr_pages > 1)
>                compound_unlock_irqrestore(page, flags);
> -put_back:
>        putback_lru_page(page);
>  put:
>        put_page(page);
> @@ -3338,12 +3329,10 @@ int mem_cgroup_move_hugetlb_parent(int idx, struct cgroup *cgroup,
>        csize = PAGE_SIZE << compound_order(page);
>        /* If parent->use_hierarchy == 0, we need to charge parent */
>        if (!parent->use_hierarchy) {
> -               ret = res_counter_charge(&parent->hugepage[idx],
> -                                        csize, &fail_res);
> -               if (ret) {
> -                       ret = -EBUSY;
> -                       goto err_out;
> -               }
> +               parent = root_mem_cgroup;
> +               /* root has no limit */
> +               res_counter_charge_nofail(&parent->hugepage[idx],
> +                                csize, &fail_res);
>        }
>        counter = &memcg->hugepage[idx];
>        res_counter_uncharge_until(counter, counter->parent, csize);
> --
> 1.7.4.1
>
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH 7/9 v2] cgroup: avoid attaching task to a cgroup under rmdir()
  2012-04-27  6:02 ` [RFC][PATCH 7/9 v2] cgroup: avoid attaching task to a cgroup under rmdir() KAMEZAWA Hiroyuki
  2012-04-27 10:39   ` Frederic Weisbecker
@ 2012-04-27 20:31   ` Tejun Heo
  2012-04-27 20:33     ` Tejun Heo
  1 sibling, 1 reply; 42+ messages in thread
From: Tejun Heo @ 2012-04-27 20:31 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Linux Kernel, linux-mm, cgroups, Michal Hocko, Johannes Weiner,
	Frederic Weisbecker, Glauber Costa, Han Ying, Aneesh Kumar K.V,
	Andrew Morton, kamezawa.hiroyuki

On Fri, Apr 27, 2012 at 03:02:22PM +0900, KAMEZAWA Hiroyuki wrote:
> attach_task() is done under cgroup_mutex() but ->pre_destroy() callback
> in rmdir() isn't called under cgroup_mutex().
> 
> It's better to avoid attaching a task to a cgroup which
> is under pre_destroy(). Considering memcg, the attached task may
> increase resource usage after memcg's pre_destroy() confirms that
> memcg is empty. This is not good.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Hmm... once memcg's pre_destroy() can't fail, I think what we should
do is marking a cgroup DEAD before calling pre_destroy() and the
existing cgroup_is_removed() check should be enough.  Patches upto
this point already make ->pre_destroy() not fail, right?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH 7/9 v2] cgroup: avoid attaching task to a cgroup under rmdir()
  2012-04-27 20:31   ` Tejun Heo
@ 2012-04-27 20:33     ` Tejun Heo
  0 siblings, 0 replies; 42+ messages in thread
From: Tejun Heo @ 2012-04-27 20:33 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Linux Kernel, linux-mm, cgroups, Michal Hocko, Johannes Weiner,
	Frederic Weisbecker, Glauber Costa, Han Ying, Aneesh Kumar K.V,
	Andrew Morton, kamezawa.hiroyuki

On Fri, Apr 27, 2012 at 01:31:59PM -0700, Tejun Heo wrote:
> On Fri, Apr 27, 2012 at 03:02:22PM +0900, KAMEZAWA Hiroyuki wrote:
> > attach_task() is done under cgroup_mutex() but ->pre_destroy() callback
> > in rmdir() isn't called under cgroup_mutex().
> > 
> > It's better to avoid attaching a task to a cgroup which
> > is under pre_destroy(). Considering memcg, the attached task may
> > increase resource usage after memcg's pre_destroy() confirms that
> > memcg is empty. This is not good.
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Hmm... once memcg's pre_destroy() can't fail, I think what we should
> do is marking a cgroup DEAD before calling pre_destroy() and the
> existing cgroup_is_removed() check should be enough.  Patches upto
> this point already make ->pre_destroy() not fail, right?

Ooh, I was confused about patch order.  This patch, probably with the
update suggested by Frederic looks good as an interim solution for
now.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH 8/9 v2] cgroup: avoid creating new cgroup under a cgroup being destroyed
  2012-04-27  6:04 ` [RFC][PATCH 8/9 v2] cgroup: avoid creating new cgroup under a cgroup being destroyed KAMEZAWA Hiroyuki
@ 2012-04-27 20:40   ` Tejun Heo
  2012-04-27 20:41     ` Tejun Heo
  2012-04-28  0:20     ` Hiroyuki Kamezawa
  0 siblings, 2 replies; 42+ messages in thread
From: Tejun Heo @ 2012-04-27 20:40 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Linux Kernel, linux-mm, cgroups, Michal Hocko, Johannes Weiner,
	Frederic Weisbecker, Glauber Costa, Han Ying, Aneesh Kumar K.V,
	Andrew Morton, kamezawa.hiroyuki

On Fri, Apr 27, 2012 at 03:04:14PM +0900, KAMEZAWA Hiroyuki wrote:
> When ->pre_destroy() is called, it should be guaranteed that
> new child cgroup is not created under a cgroup, where pre_destroy()
> is running. If not, ->pre_destroy() must check children and
> return -EBUSY, which causes warning.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Hmm... I'm getting confused more.  Why do we need these cgroup changes
at all?  cgroup still has cgrp->count check and
cgroup_clear_css_refs() after pre_destroy() calls.  The order of
changes should be,

* Make memcg pre_destroy() not fail; however, pre_destroy() should
  still be ready to be retried.  That's the defined interface.

* cgroup core updated to drop pre_destroy() retrying and guarantee
  that pre_destroy() invocation will happen only once.

* memcg and other cgroups can update their pre_destroy() if the "won't
  be retried" part can simplify their implementations.

So, there's no reason to be updating cgroup pre_destroy() semantics at
this point and these updates actually break cgroup API as it currently
stands.  The only change necessary is memcg's pre_destroy() not
returning zero.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH 8/9 v2] cgroup: avoid creating new cgroup under a cgroup being destroyed
  2012-04-27 20:40   ` Tejun Heo
@ 2012-04-27 20:41     ` Tejun Heo
  2012-04-28  0:20     ` Hiroyuki Kamezawa
  1 sibling, 0 replies; 42+ messages in thread
From: Tejun Heo @ 2012-04-27 20:41 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Linux Kernel, linux-mm, cgroups, Michal Hocko, Johannes Weiner,
	Frederic Weisbecker, Glauber Costa, Han Ying, Aneesh Kumar K.V,
	Andrew Morton, kamezawa.hiroyuki

On Fri, Apr 27, 2012 at 01:40:35PM -0700, Tejun Heo wrote:
> stands.  The only change necessary is memcg's pre_destroy() not
> returning zero.

Umm.. that should have been "always returning zero". :)

-- 
tejun

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH 9/9 v2] memcg: never return error at pre_destroy()
  2012-04-27  6:06 ` [RFC][PATCH 9/9 v2] memcg: never return error at pre_destroy() KAMEZAWA Hiroyuki
@ 2012-04-27 21:28   ` Ying Han
  2012-04-28  0:25     ` Hiroyuki Kamezawa
  2012-05-01 22:28   ` Suleiman Souhlal
  1 sibling, 1 reply; 42+ messages in thread
From: Ying Han @ 2012-04-27 21:28 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Linux Kernel, linux-mm, cgroups, Michal Hocko, Johannes Weiner,
	Frederic Weisbecker, Glauber Costa, Tejun Heo, Aneesh Kumar K.V,
	Andrew Morton, kamezawa.hiroyuki

On Thu, Apr 26, 2012 at 11:06 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> When force_empty() called by ->pre_destroy(), no memory reclaim happens
> and it doesn't take very long time which requires signal_pending() check.
> And if we return -EINTR from pre_destroy(), cgroup.c show warning.
>
> This patch removes signal check in force_empty(). By this, ->pre_destroy()
> returns success always.
>
> Note: check for 'cgroup is empty' remains for force_empty interface.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  mm/hugetlb.c    |   10 +---------
>  mm/memcontrol.c |   14 +++++---------
>  2 files changed, 6 insertions(+), 18 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 4dd6b39..770f1642 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1922,20 +1922,12 @@ int hugetlb_force_memcg_empty(struct cgroup *cgroup)
>        int ret = 0, idx = 0;
>
>        do {
> +               /* see memcontrol.c::mem_cgroup_force_empty() */
>                if (cgroup_task_count(cgroup)
>                        || !list_empty(&cgroup->children)) {
>                        ret = -EBUSY;
>                        goto out;
>                }
> -               /*
> -                * If the task doing the cgroup_rmdir got a signal
> -                * we don't really need to loop till the hugetlb resource
> -                * usage become zero.
> -                */
> -               if (signal_pending(current)) {
> -                       ret = -EINTR;
> -                       goto out;
> -               }
>                for_each_hstate(h) {
>                        spin_lock(&hugetlb_lock);
>                        list_for_each_entry(page, &h->hugepage_activelist, lru) {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 2715223..ee350c5 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3852,8 +3852,6 @@ static int mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
>                pc = lookup_page_cgroup(page);
>
>                ret = mem_cgroup_move_parent(page, pc, memcg, GFP_KERNEL);
> -               if (ret == -ENOMEM || ret == -EINTR)
> -                       break;
>
>                if (ret == -EBUSY || ret == -EINVAL) {
>                        /* found lock contention or "pc" is obsolete. */
> @@ -3863,7 +3861,7 @@ static int mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
>                        busy = NULL;
>        }
>
> -       if (!ret && !list_empty(list))
> +       if (!loop)

This looks a bit strange to me... why we make the change ?

--Ying

>                return -EBUSY;
>        return ret;
>  }
> @@ -3893,11 +3891,12 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg, bool free_all)
>  move_account:
>        do {
>                ret = -EBUSY;
> +               /*
> +                * This never happens when this is called by ->pre_destroy().
> +                * But we need to take care of force_empty interface.
> +                */
>                if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children))
>                        goto out;
> -               ret = -EINTR;
> -               if (signal_pending(current))
> -                       goto out;
>                /* This is for making all *used* pages to be on LRU. */
>                lru_add_drain_all();
>                drain_all_stock_sync(memcg);
> @@ -3918,9 +3917,6 @@ move_account:
>                }
>                mem_cgroup_end_move(memcg);
>                memcg_oom_recover(memcg);
> -               /* it seems parent cgroup doesn't have enough mem */
> -               if (ret == -ENOMEM)
> -                       goto try_to_free;
>                cond_resched();
>        /* "ret" should also be checked to ensure all lists are empty. */
>        } while (res_counter_read_u64(&memcg->res, RES_USAGE) > 0 || ret);
> --
> 1.7.4.1
>
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH 0/7 v2] memcg: prevent failure in pre_destroy()
  2012-04-27 18:16 ` [RFC][PATCH 0/7 v2] memcg: prevent failure in pre_destroy() Tejun Heo
@ 2012-04-27 23:48   ` Hiroyuki Kamezawa
  2012-04-28 16:13     ` Michal Hocko
  0 siblings, 1 reply; 42+ messages in thread
From: Hiroyuki Kamezawa @ 2012-04-27 23:48 UTC (permalink / raw)
  To: Tejun Heo
  Cc: KAMEZAWA Hiroyuki, Linux Kernel, linux-mm, cgroups, Michal Hocko,
	Johannes Weiner, Frederic Weisbecker, Glauber Costa, Han Ying,
	Aneesh Kumar K.V, Andrew Morton

On Sat, Apr 28, 2012 at 3:16 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> On Fri, Apr 27, 2012 at 02:45:30PM +0900, KAMEZAWA Hiroyuki wrote:
>> This is a v2 patch for preventing failure in memcg->pre_destroy().
>> With this patch, ->pre_destroy() will never return error code and
>> users will not see warning at rmdir(). And this work will simplify
>> memcg->pre_destroy(), largely.
>>
>> This patch is based on linux-next + hugetlb memory control patches.
>
> Ergh... can you please set up a git branch somewhere for review
> purposes?
>
I'm sorry...I can't. (To do that, I need to pass many my company's check.)
I'll repost all a week later, hugetlb tree will be seen in memcg-devel or
linux-next.

Thanks,
-Kame

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH 3/7 v2] res_counter: add res_counter_uncharge_until()
       [not found]   ` <4F9AD28C.60508@parallels.com>
@ 2012-04-27 23:51     ` Hiroyuki Kamezawa
  0 siblings, 0 replies; 42+ messages in thread
From: Hiroyuki Kamezawa @ 2012-04-27 23:51 UTC (permalink / raw)
  To: Glauber Costa
  Cc: KAMEZAWA Hiroyuki, Linux Kernel, linux-mm, cgroups, Michal Hocko,
	Johannes Weiner, Frederic Weisbecker, Tejun Heo, Han Ying,
	Aneesh Kumar K.V, Andrew Morton

On Sat, Apr 28, 2012 at 2:08 AM, Glauber Costa <glommer@parallels.com> wrote:
> On 04/27/2012 02:53 AM, KAMEZAWA Hiroyuki wrote:
>>  From bb0168d5c85f62f36434956e4728a67d0cc41e55 Mon Sep 17 00:00:00 2001
>> From: KAMEZAWA Hiroyuki<kamezawa.hiroyu@jp.fujitsu.com>
>> Date: Thu, 26 Apr 2012 18:48:07 +0900
>> Subject: [PATCH 3/9] memcg: add res_counter_uncharge_until()
>>
>> At killing res_counter which is a child of other counter,
>> we need to do
>>       res_counter_uncharge(child, xxx)
>>       res_counter_charge(parent, xxx)
>>
>> This is not atomic and wasting cpu. This patch adds
>> res_counter_uncharge_until(). This function's uncharge propagates
>> to ancestors until specified res_counter.
>>
>>       res_counter_uncharge_until(child, parent, xxx)
>>
>> This ops is atomic and more efficient.
>>
>> Originaly-written-by: Frederic Weisbecker<fweisbec@gmail.com>
>> Signed-off-by: KAMEZAWA Hiroyuki<kamezawa.hiroyu@jp.fujitsu.com>
> I have been carrying Frederic's patch itself in my code for a while now.
>
> Why not just use it? What are you doing differently to justify writing a
> patch yourself? It's a bit of credit giving as well

I don't need "charge" part for my purpose and have no justification to add it.
And task-limit cgroup has no justification, either.

Thanks,
-Kame

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH 3/7 v2] res_counter: add res_counter_uncharge_until()
  2012-04-27 18:18   ` Tejun Heo
@ 2012-04-27 23:51     ` Hiroyuki Kamezawa
  0 siblings, 0 replies; 42+ messages in thread
From: Hiroyuki Kamezawa @ 2012-04-27 23:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: KAMEZAWA Hiroyuki, Linux Kernel, linux-mm, cgroups, Michal Hocko,
	Johannes Weiner, Frederic Weisbecker, Glauber Costa, Han Ying,
	Aneesh Kumar K.V, Andrew Morton

On Sat, Apr 28, 2012 at 3:18 AM, Tejun Heo <tj@kernel.org> wrote:
> On Fri, Apr 27, 2012 at 02:53:03PM +0900, KAMEZAWA Hiroyuki wrote:
>> diff --git a/kernel/res_counter.c b/kernel/res_counter.c
>> index d508363..f4ec411 100644
>> --- a/kernel/res_counter.c
>> +++ b/kernel/res_counter.c
>> @@ -66,6 +66,8 @@ done:
>>       return ret;
>>  }
>>
>> +
>> +
>
> Contamination?

Ah, sorry. I'll fix.

-Kame

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH 4/7 v2] memcg: use res_counter_uncharge_until in move_parent
       [not found]   ` <4F9AD455.9030306@parallels.com>
  2012-04-27 18:26     ` Ying Han
@ 2012-04-27 23:58     ` Hiroyuki Kamezawa
  1 sibling, 0 replies; 42+ messages in thread
From: Hiroyuki Kamezawa @ 2012-04-27 23:58 UTC (permalink / raw)
  To: Glauber Costa
  Cc: KAMEZAWA Hiroyuki, Linux Kernel, linux-mm, cgroups, Michal Hocko,
	Johannes Weiner, Frederic Weisbecker, Tejun Heo, Han Ying,
	Aneesh Kumar K.V, Andrew Morton

On Sat, Apr 28, 2012 at 2:16 AM, Glauber Costa <glommer@parallels.com> wrote:
> On 04/27/2012 02:54 AM, KAMEZAWA Hiroyuki wrote:
>> By using res_counter_uncharge_until(), we can avoid
>> unnecessary charging.
>>
>> Signed-off-by: KAMEZAWA Hiroyuki<kamezawa.hiroyu@jp.fujitsu.com>
>> ---
>>   mm/memcontrol.c |   63 ++++++++++++++++++++++++++++++++++++------------------
>>   1 files changed, 42 insertions(+), 21 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 613bb15..ed53d64 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -2420,6 +2420,24 @@ static void __mem_cgroup_cancel_charge(struct mem_cgroup *memcg,
>>   }
>>
>>   /*
>> + * Cancel chages in this cgroup....doesn't propagates to parent cgroup.
>> + * This is useful when moving usage to parent cgroup.
>> + */
>> +static void __mem_cgroup_cancel_local_charge(struct mem_cgroup *memcg,
>> +                                     unsigned int nr_pages)
>> +{
>> +     if (!mem_cgroup_is_root(memcg)) {
>> +             unsigned long bytes = nr_pages * PAGE_SIZE;
>> +
>> +             res_counter_uncharge_until(&memcg->res,
>> +                                     memcg->res.parent, bytes);
>> +             if (do_swap_account)
>> +                     res_counter_uncharge_until(&memcg->memsw,
>> +                                             memcg->memsw.parent, bytes);
>> +     }
>> +}
>
> Kame, this is a nitpick, but I usually prefer to write this like:
>
> if (mem_cgroup_is_root(memcg))
>   return;
>
> res_counter...
>
> Specially with memcg, where function names are bigger than average, in
> comparison.
>
> the code itself seems fine.
>
Ok, I'll use that style in the next post.

>> +/*
>>    * A helper function to get mem_cgroup from ID. must be called under
>>    * rcu_read_lock(). The caller must check css_is_removed() or some if
>>    * it's concern. (dropping refcnt from swap can be called against removed
>> @@ -2677,16 +2695,28 @@ static int mem_cgroup_move_parent(struct page *page,
>>       nr_pages = hpage_nr_pages(page);
>>
>>       parent = mem_cgroup_from_cont(pcg);
>> -     ret = __mem_cgroup_try_charge(NULL, gfp_mask, nr_pages,&parent, false);
>> -     if (ret)
>> -             goto put_back;
>> +     if (!parent->use_hierarchy) {
> Can we avoid testing for use hierarchy ?
> Specially given this might go away.
>
> parent_mem_cgroup() already bundles this information. So maybe we can
> test for parent_mem_cgroup(parent) == NULL. It is the same thing after all.

We need to find parent even if use_hierarchy==0 in this patch.
I'll consider to use it in later patch, thank you for pointing out.


>> +             ret = __mem_cgroup_try_charge(NULL,
>> +                                     gfp_mask, nr_pages,&parent, false);
>> +             if (ret)
>> +                     goto put_back;
>> +     }
>
> Why? If we are not hierarchical, we should not charge the parent, right?
Current implementation moves charges to parent regardless of use_hierarchy.
It's handled in  a following patch.

>
>>       if (nr_pages>  1)
>>               flags = compound_lock_irqsave(page);
>>
>> -     ret = mem_cgroup_move_account(page, nr_pages, pc, child, parent, true);
>> -     if (ret)
>> -             __mem_cgroup_cancel_charge(parent, nr_pages);
>> +     if (parent->use_hierarchy) {
>> +             ret = mem_cgroup_move_account(page, nr_pages,
>> +                                     pc, child, parent, false);
>> +             if (!ret)
>> +                     __mem_cgroup_cancel_local_charge(child, nr_pages);
>> +     } else {
>> +             ret = mem_cgroup_move_account(page, nr_pages,
>> +                                     pc, child, parent, true);
>> +
>> +             if (ret)
>> +                     __mem_cgroup_cancel_charge(parent, nr_pages);
>> +     }
>
> Calling move account also seems not necessary to me. If we are not
> uncharging + charging, we won't even touch the parent.

we need to overwrite pc->mem_cgroup and touch other statistics.

Thanks,
-Kame

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH 4/7 v2] memcg: use res_counter_uncharge_until in move_parent
  2012-04-27 18:20   ` Tejun Heo
@ 2012-04-27 23:59     ` Hiroyuki Kamezawa
  0 siblings, 0 replies; 42+ messages in thread
From: Hiroyuki Kamezawa @ 2012-04-27 23:59 UTC (permalink / raw)
  To: Tejun Heo
  Cc: KAMEZAWA Hiroyuki, Linux Kernel, linux-mm, cgroups, Michal Hocko,
	Johannes Weiner, Frederic Weisbecker, Glauber Costa, Han Ying,
	Aneesh Kumar K.V, Andrew Morton

On Sat, Apr 28, 2012 at 3:20 AM, Tejun Heo <tj@kernel.org> wrote:
> On Fri, Apr 27, 2012 at 02:54:58PM +0900, KAMEZAWA Hiroyuki wrote:
>> By using res_counter_uncharge_until(), we can avoid
>> unnecessary charging.
>>
>> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> ---
>>  mm/memcontrol.c |   63 ++++++++++++++++++++++++++++++++++++------------------
>>  1 files changed, 42 insertions(+), 21 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 613bb15..ed53d64 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -2420,6 +2420,24 @@ static void __mem_cgroup_cancel_charge(struct mem_cgroup *memcg,
>>  }
>>
>>  /*
>> + * Cancel chages in this cgroup....doesn't propagates to parent cgroup.
>
>             ^typo                                     ^ unnecessary s

Ya, thanks. I'll fix it.

-Kame

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH 5/9 v2] move charges to root at rmdir if use_hierarchy is unset
  2012-04-27 19:12   ` Ying Han
@ 2012-04-28  0:01     ` Hiroyuki Kamezawa
  0 siblings, 0 replies; 42+ messages in thread
From: Hiroyuki Kamezawa @ 2012-04-28  0:01 UTC (permalink / raw)
  To: Ying Han
  Cc: KAMEZAWA Hiroyuki, Linux Kernel, linux-mm, cgroups, Michal Hocko,
	Johannes Weiner, Frederic Weisbecker, Glauber Costa, Tejun Heo,
	Aneesh Kumar K.V, Andrew Morton

On Sat, Apr 28, 2012 at 4:12 AM, Ying Han <yinghan@google.com> wrote:
> On Thu, Apr 26, 2012 at 10:58 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> Now, at removal of cgroup, ->pre_destroy() is called and move charges
>> to the parent cgroup. A major reason of -EBUSY returned by ->pre_destroy()
>> is that the 'moving' hits parent's resource limitation. It happens only
>> when use_hierarchy=0. This was a mistake of original design.(it's me...)
>
> Nice patch, i can see how broken it is now with use_hierarchy=0...
>
> nitpick on the documentation below:
>
>>
>> Considering use_hierarchy=0, all cgroups are treated as flat. So, no one
>> cannot justify moving charges to parent...parent and children are in
>> flat configuration, not hierarchical.
>>
>> This patch modifes to move charges to root cgroup at rmdir/force_empty
>> if use_hierarchy==0. This will much simplify rmdir() and reduce error
>> in ->pre_destroy.
>>
>> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> ---
>>  Documentation/cgroups/memory.txt |   12 ++++++----
>>  mm/memcontrol.c                  |   39 +++++++++++++------------------------
>>  2 files changed, 21 insertions(+), 30 deletions(-)
>>
>> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
>> index 54c338d..82ce1ef 100644
>> --- a/Documentation/cgroups/memory.txt
>> +++ b/Documentation/cgroups/memory.txt
>> @@ -393,14 +393,14 @@ cgroup might have some charge associated with it, even though all
>>  tasks have migrated away from it. (because we charge against pages, not
>>  against tasks.)
>>
>> -Such charges are freed or moved to their parent. At moving, both of RSS
>> -and CACHES are moved to parent.
>> -rmdir() may return -EBUSY if freeing/moving fails. See 5.1 also.
>> +Such charges are freed or moved to their parent if use_hierarchy=1.
>> +if use_hierarchy=0, the charges will be moved to root cgroup.
>
> It is more clear that we move the stats to root (if use_hierarchy==0)
> or parent (if use_hierarchy==1), and no change on the charge except
> uncharging from the child.
>

Seems nicer. I'll use your text in next ver.

Thanks,
-Kame

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH 7/9 v2] cgroup: avoid attaching task to a cgroup under rmdir()
  2012-04-27 10:39   ` Frederic Weisbecker
@ 2012-04-28  0:06     ` Hiroyuki Kamezawa
  0 siblings, 0 replies; 42+ messages in thread
From: Hiroyuki Kamezawa @ 2012-04-28  0:06 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: KAMEZAWA Hiroyuki, Linux Kernel, linux-mm, cgroups, Michal Hocko,
	Johannes Weiner, Glauber Costa, Tejun Heo, Han Ying,
	Aneesh Kumar K.V, Andrew Morton

On Fri, Apr 27, 2012 at 7:39 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> On Fri, Apr 27, 2012 at 03:02:22PM +0900, KAMEZAWA Hiroyuki wrote:
>> attach_task() is done under cgroup_mutex() but ->pre_destroy() callback
>> in rmdir() isn't called under cgroup_mutex().
>>
>> It's better to avoid attaching a task to a cgroup which
>> is under pre_destroy(). Considering memcg, the attached task may
>> increase resource usage after memcg's pre_destroy() confirms that
>> memcg is empty. This is not good.
>>
>> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> ---
>>  kernel/cgroup.c |    5 ++++-
>>  1 files changed, 4 insertions(+), 1 deletions(-)
>>
>> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
>> index ad8eae5..7a3076b 100644
>> --- a/kernel/cgroup.c
>> +++ b/kernel/cgroup.c
>> @@ -1953,6 +1953,9 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>>       if (cgrp == oldcgrp)
>>               return 0;
>>
>> +     if (test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags))
>> +             return -EBUSY;
>> +
>
> You probably need to update cgroup_attach_proc() as well?
>
Ahh...I missed that. Thank you for pointing out !

Thanks,
-Kame

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH 8/9 v2] cgroup: avoid creating new cgroup under a cgroup being destroyed
  2012-04-27 20:40   ` Tejun Heo
  2012-04-27 20:41     ` Tejun Heo
@ 2012-04-28  0:20     ` Hiroyuki Kamezawa
  2012-04-28  2:00       ` Tejun Heo
  1 sibling, 1 reply; 42+ messages in thread
From: Hiroyuki Kamezawa @ 2012-04-28  0:20 UTC (permalink / raw)
  To: Tejun Heo
  Cc: KAMEZAWA Hiroyuki, Linux Kernel, linux-mm, cgroups, Michal Hocko,
	Johannes Weiner, Frederic Weisbecker, Glauber Costa, Han Ying,
	Aneesh Kumar K.V, Andrew Morton

On Sat, Apr 28, 2012 at 5:40 AM, Tejun Heo <tj@kernel.org> wrote:
> On Fri, Apr 27, 2012 at 03:04:14PM +0900, KAMEZAWA Hiroyuki wrote:
>> When ->pre_destroy() is called, it should be guaranteed that
>> new child cgroup is not created under a cgroup, where pre_destroy()
>> is running. If not, ->pre_destroy() must check children and
>> return -EBUSY, which causes warning.
>>
>> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> Hmm... I'm getting confused more.  Why do we need these cgroup changes
> at all?  cgroup still has cgrp->count check and
> cgroup_clear_css_refs() after pre_destroy() calls.  The order of
> changes should be,
>
> * Make memcg pre_destroy() not fail; however, pre_destroy() should
>  still be ready to be retried.  That's the defined interface.
>
> * cgroup core updated to drop pre_destroy() retrying and guarantee
>  that pre_destroy() invocation will happen only once.
>
> * memcg and other cgroups can update their pre_destroy() if the "won't
>  be retried" part can simplify their implementations.
>

What I thought was...
Assume a memory cgoup A, with use_hierarchy==1.

1.  thread:0   start calling pre->destroy of cgroup A
2.  thread:0   it sometimes calls cond_resched or other sleep functions.
3.  thread:1   create a cgroup B under "A"
4.  thread:1   attach a thread X to cgroup A/B
5.  res_counter of A charged up. but pre_destroy() can't find what happens
    because it scans LRU of A.

So, we have -EBUSY now. I considered some options to fix this.

option 1) just return 0 instead of -EBUSY when pre_destroy() finds a
task or a child.

There is a race....even if we return 0 here and expects cgroup code
can catch it,
the thread or a child we found may be moved to other cgroup before we check it
in cgroup's final check.
In that case, the cgroup will be freed before full-ack of
pre_destory() and the charges
will be lost.

option 2) move all codes to ->destory()
That was previous version of this set.

This is option3 that preventing creation of new child.

If you don't like this, I'll move all codes to ->destroy() and use
asynchronous again.

Thanks,
-Kame

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH 9/9 v2] memcg: never return error at pre_destroy()
  2012-04-27 21:28   ` Ying Han
@ 2012-04-28  0:25     ` Hiroyuki Kamezawa
  2012-04-30 17:02       ` Ying Han
  0 siblings, 1 reply; 42+ messages in thread
From: Hiroyuki Kamezawa @ 2012-04-28  0:25 UTC (permalink / raw)
  To: Ying Han
  Cc: KAMEZAWA Hiroyuki, Linux Kernel, linux-mm, cgroups, Michal Hocko,
	Johannes Weiner, Frederic Weisbecker, Glauber Costa, Tejun Heo,
	Aneesh Kumar K.V, Andrew Morton

On Sat, Apr 28, 2012 at 6:28 AM, Ying Han <yinghan@google.com> wrote:
> On Thu, Apr 26, 2012 at 11:06 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> When force_empty() called by ->pre_destroy(), no memory reclaim happens
>> and it doesn't take very long time which requires signal_pending() check.
>> And if we return -EINTR from pre_destroy(), cgroup.c show warning.
>>
>> This patch removes signal check in force_empty(). By this, ->pre_destroy()
>> returns success always.
>>
>> Note: check for 'cgroup is empty' remains for force_empty interface.
>>
>> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> ---
>>  mm/hugetlb.c    |   10 +---------
>>  mm/memcontrol.c |   14 +++++---------
>>  2 files changed, 6 insertions(+), 18 deletions(-)
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 4dd6b39..770f1642 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -1922,20 +1922,12 @@ int hugetlb_force_memcg_empty(struct cgroup *cgroup)
>>        int ret = 0, idx = 0;
>>
>>        do {
>> +               /* see memcontrol.c::mem_cgroup_force_empty() */
>>                if (cgroup_task_count(cgroup)
>>                        || !list_empty(&cgroup->children)) {
>>                        ret = -EBUSY;
>>                        goto out;
>>                }
>> -               /*
>> -                * If the task doing the cgroup_rmdir got a signal
>> -                * we don't really need to loop till the hugetlb resource
>> -                * usage become zero.
>> -                */
>> -               if (signal_pending(current)) {
>> -                       ret = -EINTR;
>> -                       goto out;
>> -               }
>>                for_each_hstate(h) {
>>                        spin_lock(&hugetlb_lock);
>>                        list_for_each_entry(page, &h->hugepage_activelist, lru) {
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 2715223..ee350c5 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -3852,8 +3852,6 @@ static int mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
>>                pc = lookup_page_cgroup(page);
>>
>>                ret = mem_cgroup_move_parent(page, pc, memcg, GFP_KERNEL);
>> -               if (ret == -ENOMEM || ret == -EINTR)
>> -                       break;
>>
>>                if (ret == -EBUSY || ret == -EINVAL) {
>>                        /* found lock contention or "pc" is obsolete. */
>> @@ -3863,7 +3861,7 @@ static int mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
>>                        busy = NULL;
>>        }
>>
>> -       if (!ret && !list_empty(list))
>> +       if (!loop)
>
> This looks a bit strange to me... why we make the change ?
>
Ah, I should this move to an independet patch.
Because we don't have -ENOMEM path to exit loop, the return value of
this function
is
  0 (if loop !=0 this means lru is empty under the lru lock )
  -EBUSY (if loop== 0)

I'll move this part out as an independent clean up patch

thanks,
-kame

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH 8/9 v2] cgroup: avoid creating new cgroup under a cgroup being destroyed
  2012-04-28  0:20     ` Hiroyuki Kamezawa
@ 2012-04-28  2:00       ` Tejun Heo
  2012-04-28  9:31         ` Hiroyuki Kamezawa
  0 siblings, 1 reply; 42+ messages in thread
From: Tejun Heo @ 2012-04-28  2:00 UTC (permalink / raw)
  To: Hiroyuki Kamezawa
  Cc: KAMEZAWA Hiroyuki, Linux Kernel, linux-mm, cgroups, Michal Hocko,
	Johannes Weiner, Frederic Weisbecker, Glauber Costa, Han Ying,
	Aneesh Kumar K.V, Andrew Morton

Hi, KAME.

On Sat, Apr 28, 2012 at 09:20:52AM +0900, Hiroyuki Kamezawa wrote:
> What I thought was...
> Assume a memory cgoup A, with use_hierarchy==1.
> 
> 1.  thread:0   start calling pre->destroy of cgroup A
> 2.  thread:0   it sometimes calls cond_resched or other sleep functions.
> 3.  thread:1   create a cgroup B under "A"
> 4.  thread:1   attach a thread X to cgroup A/B
> 5.  res_counter of A charged up. but pre_destroy() can't find what happens
>     because it scans LRU of A.
> 
> So, we have -EBUSY now. I considered some options to fix this.
> 
> option 1) just return 0 instead of -EBUSY when pre_destroy() finds a
> task or a child.
> 
> There is a race....even if we return 0 here and expects cgroup code
> can catch it,
> the thread or a child we found may be moved to other cgroup before we check it
> in cgroup's final check.
> In that case, the cgroup will be freed before full-ack of
> pre_destory() and the charges
> will be lost.

So, cgroup code won't proceed with rmdir if children are created
inbetween and note that the race condition of lost charge you
described above existed before this change - ie. new cgroup could be
created after pre_destroy() is complete.

The current cgroup rmdir code is transitional.  It has to support both
retrying and non-retrying pre_destroy()s and that means we can't mark
the cgroup DEAD before starting invoking pre_destroy(); however, we
can do that once memcg's pre_destroy() is converted which will also
remove all the WAIT_ON_RMDIR mechanism and the above described race.

There really isn't much point in trying to make the current cgroup
rmdir behave perfectly when the next step is removing all the fixed up
parts.

So, IMHO, just making pre_destroy() clean up its own charges and
always returning 0 is enough.  There's no need to fix up old
non-critical race condition at this point in the patch stream.  cgroup
rmdir simplification will make them disappear anyway.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH 8/9 v2] cgroup: avoid creating new cgroup under a cgroup being destroyed
  2012-04-28  2:00       ` Tejun Heo
@ 2012-04-28  9:31         ` Hiroyuki Kamezawa
  2012-04-28 21:31           ` Tejun Heo
  0 siblings, 1 reply; 42+ messages in thread
From: Hiroyuki Kamezawa @ 2012-04-28  9:31 UTC (permalink / raw)
  To: Tejun Heo
  Cc: KAMEZAWA Hiroyuki, Linux Kernel, linux-mm, cgroups, Michal Hocko,
	Johannes Weiner, Frederic Weisbecker, Glauber Costa, Han Ying,
	Aneesh Kumar K.V, Andrew Morton

On Sat, Apr 28, 2012 at 11:00 AM, Tejun Heo <tj@kernel.org> wrote:
> Hi, KAME.
>
> On Sat, Apr 28, 2012 at 09:20:52AM +0900, Hiroyuki Kamezawa wrote:
>> What I thought was...
>> Assume a memory cgoup A, with use_hierarchy==1.
>>
>> 1.  thread:0   start calling pre->destroy of cgroup A
>> 2.  thread:0   it sometimes calls cond_resched or other sleep functions.
>> 3.  thread:1   create a cgroup B under "A"
>> 4.  thread:1   attach a thread X to cgroup A/B
>> 5.  res_counter of A charged up. but pre_destroy() can't find what happens
>>     because it scans LRU of A.
>>
>> So, we have -EBUSY now. I considered some options to fix this.
>>
>> option 1) just return 0 instead of -EBUSY when pre_destroy() finds a
>> task or a child.
>>
>> There is a race....even if we return 0 here and expects cgroup code
>> can catch it,
>> the thread or a child we found may be moved to other cgroup before we check it
>> in cgroup's final check.
>> In that case, the cgroup will be freed before full-ack of
>> pre_destory() and the charges
>> will be lost.
>
> So, cgroup code won't proceed with rmdir if children are created
> inbetween and note that the race condition of lost charge you
> described above existed before this change - ie. new cgroup could be
> created after pre_destroy() is complete.
>
> The current cgroup rmdir code is transitional.  It has to support both
> retrying and non-retrying pre_destroy()s and that means we can't mark
> the cgroup DEAD before starting invoking pre_destroy(); however, we
> can do that once memcg's pre_destroy() is converted which will also
> remove all the WAIT_ON_RMDIR mechanism and the above described race.
>
> There really isn't much point in trying to make the current cgroup
> rmdir behave perfectly when the next step is removing all the fixed up
> parts.
>
> So, IMHO, just making pre_destroy() clean up its own charges and
> always returning 0 is enough.  There's no need to fix up old
> non-critical race condition at this point in the patch stream.  cgroup
> rmdir simplification will make them disappear anyway.
>
So, hmm, ok. I'll drop patch 7 & 8. memcg may return -EBUSY in very very
race case but users will not see it in the most case.
I'll fix limit, move-charge and use_hierarchy problem first.
Thanks,
-Kame

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH 0/7 v2] memcg: prevent failure in pre_destroy()
  2012-04-27 23:48   ` Hiroyuki Kamezawa
@ 2012-04-28 16:13     ` Michal Hocko
  2012-04-29  6:03       ` Michal Hocko
  0 siblings, 1 reply; 42+ messages in thread
From: Michal Hocko @ 2012-04-28 16:13 UTC (permalink / raw)
  To: Hiroyuki Kamezawa
  Cc: Tejun Heo, KAMEZAWA Hiroyuki, Linux Kernel, linux-mm, cgroups,
	Johannes Weiner, Frederic Weisbecker, Glauber Costa, Han Ying,
	Aneesh Kumar K.V, Andrew Morton

On Sat 28-04-12 08:48:18, Hiroyuki Kamezawa wrote:
> On Sat, Apr 28, 2012 at 3:16 AM, Tejun Heo <tj@kernel.org> wrote:
> > Hello,
> >
> > On Fri, Apr 27, 2012 at 02:45:30PM +0900, KAMEZAWA Hiroyuki wrote:
> >> This is a v2 patch for preventing failure in memcg->pre_destroy().
> >> With this patch, ->pre_destroy() will never return error code and
> >> users will not see warning at rmdir(). And this work will simplify
> >> memcg->pre_destroy(), largely.
> >>
> >> This patch is based on linux-next + hugetlb memory control patches.
> >
> > Ergh... can you please set up a git branch somewhere for review
> > purposes?
> >
> I'm sorry...I can't. (To do that, I need to pass many my company's check.)
> I'll repost all a week later, hugetlb tree will be seen in memcg-devel or
> linux-next.

I can push it to memcg-devel tree if you want.

> 
> Thanks,
> -Kame
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH 8/9 v2] cgroup: avoid creating new cgroup under a cgroup being destroyed
  2012-04-28  9:31         ` Hiroyuki Kamezawa
@ 2012-04-28 21:31           ` Tejun Heo
  0 siblings, 0 replies; 42+ messages in thread
From: Tejun Heo @ 2012-04-28 21:31 UTC (permalink / raw)
  To: Hiroyuki Kamezawa
  Cc: KAMEZAWA Hiroyuki, Linux Kernel, linux-mm, cgroups, Michal Hocko,
	Johannes Weiner, Frederic Weisbecker, Glauber Costa, Han Ying,
	Aneesh Kumar K.V, Andrew Morton

Hello, KAME.

On Sat, Apr 28, 2012 at 06:31:38PM +0900, Hiroyuki Kamezawa wrote:
> > So, IMHO, just making pre_destroy() clean up its own charges and
> > always returning 0 is enough.  There's no need to fix up old
> > non-critical race condition at this point in the patch stream.  cgroup
> > rmdir simplification will make them disappear anyway.
>
> So, hmm, ok. I'll drop patch 7 & 8. memcg may return -EBUSY in very very
> race case but users will not see it in the most case.
> I'll fix limit, move-charge and use_hierarchy problem first.

IIUC, memcg can just return 0 when child creation races against
pre_destroy().  cgroup will retry if child exists after pre_destroy()
completion.  If child comes and goes before cgroup checks its
existence, some charges may be lost but that race already exists and
it will be gone once the retry logic is removed.  Also, returning
-errno will trigger WARN_ON() w/o the legacy behavior flag.

Thank you very much.

-- 
tejun

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH 0/7 v2] memcg: prevent failure in pre_destroy()
  2012-04-28 16:13     ` Michal Hocko
@ 2012-04-29  6:03       ` Michal Hocko
  0 siblings, 0 replies; 42+ messages in thread
From: Michal Hocko @ 2012-04-29  6:03 UTC (permalink / raw)
  To: Hiroyuki Kamezawa
  Cc: Tejun Heo, KAMEZAWA Hiroyuki, Linux Kernel, linux-mm, cgroups,
	Johannes Weiner, Frederic Weisbecker, Glauber Costa, Han Ying,
	Aneesh Kumar K.V, Andrew Morton

On Sat 28-04-12 18:13:58, Michal Hocko wrote:
> On Sat 28-04-12 08:48:18, Hiroyuki Kamezawa wrote:
> > On Sat, Apr 28, 2012 at 3:16 AM, Tejun Heo <tj@kernel.org> wrote:
> > > Hello,
> > >
> > > On Fri, Apr 27, 2012 at 02:45:30PM +0900, KAMEZAWA Hiroyuki wrote:
> > >> This is a v2 patch for preventing failure in memcg->pre_destroy().
> > >> With this patch, ->pre_destroy() will never return error code and
> > >> users will not see warning at rmdir(). And this work will simplify
> > >> memcg->pre_destroy(), largely.
> > >>
> > >> This patch is based on linux-next + hugetlb memory control patches.
> > >
> > > Ergh... can you please set up a git branch somewhere for review
> > > purposes?
> > >
> > I'm sorry...I can't. (To do that, I need to pass many my company's check.)
> > I'll repost all a week later, hugetlb tree will be seen in memcg-devel or
> > linux-next.
> 
> I can push it to memcg-devel tree if you want.

As a separate branch of course...

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH 1/7 v2] temporal compile-fix in linux-next
  2012-04-27  5:49 ` [RFC][PATCH 1/7 v2] temporal compile-fix in linux-next KAMEZAWA Hiroyuki
@ 2012-04-30  8:47   ` Aneesh Kumar K.V
  0 siblings, 0 replies; 42+ messages in thread
From: Aneesh Kumar K.V @ 2012-04-30  8:47 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, Linux Kernel
  Cc: linux-mm, cgroups, Michal Hocko, Johannes Weiner,
	Frederic Weisbecker, Glauber Costa, Tejun Heo, Han Ying,
	Andrew Morton, kamezawa.hiroyuki

KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:

> Maybe Aneesh will post his own version. This is just for my work.
>

-next should have fixes for these errors now.

-aneesh


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH 2/7 v2] memcg: fix error code in hugetlb_force_memcg_empty()
  2012-04-27  5:51 ` [RFC][PATCH 2/7 v2] memcg: fix error code in hugetlb_force_memcg_empty() KAMEZAWA Hiroyuki
@ 2012-04-30  8:49   ` Aneesh Kumar K.V
  0 siblings, 0 replies; 42+ messages in thread
From: Aneesh Kumar K.V @ 2012-04-30  8:49 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, Linux Kernel
  Cc: linux-mm, cgroups, Michal Hocko, Johannes Weiner,
	Frederic Weisbecker, Glauber Costa, Tejun Heo, Han Ying,
	Andrew Morton, kamezawa.hiroyuki

KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:

> EBUSY should be returned.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  mm/hugetlb.c |    5 ++++-
>  1 files changed, 4 insertions(+), 1 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 17ae2e4..4dd6b39 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1922,8 +1922,11 @@ int hugetlb_force_memcg_empty(struct cgroup *cgroup)
>  	int ret = 0, idx = 0;
>
>  	do {
> -		if (cgroup_task_count(cgroup) || !list_empty(&cgroup->children))
> +		if (cgroup_task_count(cgroup)
> +			|| !list_empty(&cgroup->children)) {
> +			ret = -EBUSY;
>  			goto out;
> +		}
>  		/*
>  		 * If the task doing the cgroup_rmdir got a signal
>  		 * we don't really need to loop till the hugetlb resource

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

-aneesh


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH 4/7 v2] memcg: use res_counter_uncharge_until in move_parent
  2012-04-27  5:54 ` [RFC][PATCH 4/7 v2] memcg: use res_counter_uncharge_until in move_parent KAMEZAWA Hiroyuki
  2012-04-27 18:20   ` Tejun Heo
       [not found]   ` <4F9AD455.9030306@parallels.com>
@ 2012-04-30  9:00   ` Aneesh Kumar K.V
  2 siblings, 0 replies; 42+ messages in thread
From: Aneesh Kumar K.V @ 2012-04-30  9:00 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, Linux Kernel
  Cc: linux-mm, cgroups, Michal Hocko, Johannes Weiner,
	Frederic Weisbecker, Glauber Costa, Tejun Heo, Han Ying,
	Andrew Morton, kamezawa.hiroyuki

KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:

> By using res_counter_uncharge_until(), we can avoid 
> unnecessary charging.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

> ---
>  mm/memcontrol.c |   63 ++++++++++++++++++++++++++++++++++++------------------
>  1 files changed, 42 insertions(+), 21 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 613bb15..ed53d64 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2420,6 +2420,24 @@ static void __mem_cgroup_cancel_charge(struct mem_cgroup *memcg,
>  }
>
>  /*
> + * Cancel chages in this cgroup....doesn't propagates to parent cgroup.
> + * This is useful when moving usage to parent cgroup.
> + */
> +static void __mem_cgroup_cancel_local_charge(struct mem_cgroup *memcg,
> +					unsigned int nr_pages)
> +{
> +	if (!mem_cgroup_is_root(memcg)) {
> +		unsigned long bytes = nr_pages * PAGE_SIZE;
> +
> +		res_counter_uncharge_until(&memcg->res,
> +					memcg->res.parent, bytes);
> +		if (do_swap_account)
> +			res_counter_uncharge_until(&memcg->memsw,
> +						memcg->memsw.parent, bytes);
> +	}
> +}
> +
> +/*
>   * A helper function to get mem_cgroup from ID. must be called under
>   * rcu_read_lock(). The caller must check css_is_removed() or some if
>   * it's concern. (dropping refcnt from swap can be called against removed
> @@ -2677,16 +2695,28 @@ static int mem_cgroup_move_parent(struct page *page,
>  	nr_pages = hpage_nr_pages(page);
>
>  	parent = mem_cgroup_from_cont(pcg);
> -	ret = __mem_cgroup_try_charge(NULL, gfp_mask, nr_pages, &parent, false);
> -	if (ret)
> -		goto put_back;
> +	if (!parent->use_hierarchy) {
> +		ret = __mem_cgroup_try_charge(NULL,
> +					gfp_mask, nr_pages, &parent, false);
> +		if (ret)
> +			goto put_back;
> +	}
>
>  	if (nr_pages > 1)
>  		flags = compound_lock_irqsave(page);
>
> -	ret = mem_cgroup_move_account(page, nr_pages, pc, child, parent, true);
> -	if (ret)
> -		__mem_cgroup_cancel_charge(parent, nr_pages);
> +	if (parent->use_hierarchy) {
> +		ret = mem_cgroup_move_account(page, nr_pages,
> +					pc, child, parent, false);
> +		if (!ret)
> +			__mem_cgroup_cancel_local_charge(child, nr_pages);
> +	} else {
> +		ret = mem_cgroup_move_account(page, nr_pages,
> +					pc, child, parent, true);
> +
> +		if (ret)
> +			__mem_cgroup_cancel_charge(parent, nr_pages);
> +	}
>

May be a comment around this ? I had to look closer to find why
there is a if (!ret) and if (ret) difference. 


>  	if (nr_pages > 1)
>  		compound_unlock_irqrestore(page, flags);
> @@ -3295,6 +3325,7 @@ int mem_cgroup_move_hugetlb_parent(int idx, struct cgroup *cgroup,
>  	struct cgroup *pcgrp = cgroup->parent;
>  	struct mem_cgroup *parent = mem_cgroup_from_cont(pcgrp);
>  	struct mem_cgroup *memcg  = mem_cgroup_from_cont(cgroup);
> +	struct res_counter *counter;
>
>  	if (!get_page_unless_zero(page))
>  		goto out;
> @@ -3305,28 +3336,18 @@ int mem_cgroup_move_hugetlb_parent(int idx, struct cgroup *cgroup,
>  		goto err_out;
>
>  	csize = PAGE_SIZE << compound_order(page);
> -	/*
> -	 * If we have use_hierarchy set we can never fail here. So instead of
> -	 * using res_counter_uncharge use the open-coded variant which just
> -	 * uncharge the child res_counter. The parent will retain the charge.
> -	 */
> -	if (parent->use_hierarchy) {
> -		unsigned long flags;
> -		struct res_counter *counter;
> -
> -		counter = &memcg->hugepage[idx];
> -		spin_lock_irqsave(&counter->lock, flags);
> -		res_counter_uncharge_locked(counter, csize);
> -		spin_unlock_irqrestore(&counter->lock, flags);
> -	} else {
> +	/* If parent->use_hierarchy == 0, we need to charge parent */
> +	if (!parent->use_hierarchy) {
>  		ret = res_counter_charge(&parent->hugepage[idx],
>  					 csize, &fail_res);
>  		if (ret) {
>  			ret = -EBUSY;
>  			goto err_out;
>  		}
> -		res_counter_uncharge(&memcg->hugepage[idx], csize);
>  	}
> +	counter = &memcg->hugepage[idx];
> +	res_counter_uncharge_until(counter, counter->parent, csize);
> +
>  	pc->mem_cgroup = parent;
>  err_out:
>  	unlock_page_cgroup(pc);

-aneesh


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH 5/9 v2] move charges to root at rmdir if use_hierarchy is unset
  2012-04-27  5:58 ` [RFC][PATCH 5/9 v2] move charges to root at rmdir if use_hierarchy is unset KAMEZAWA Hiroyuki
  2012-04-27 19:12   ` Ying Han
@ 2012-04-30  9:07   ` Aneesh Kumar K.V
  1 sibling, 0 replies; 42+ messages in thread
From: Aneesh Kumar K.V @ 2012-04-30  9:07 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, Linux Kernel
  Cc: linux-mm, cgroups, Michal Hocko, Johannes Weiner,
	Frederic Weisbecker, Glauber Costa, Tejun Heo, Han Ying,
	Andrew Morton, kamezawa.hiroyuki

KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:

> Now, at removal of cgroup, ->pre_destroy() is called and move charges
> to the parent cgroup. A major reason of -EBUSY returned by ->pre_destroy()
> is that the 'moving' hits parent's resource limitation. It happens only
> when use_hierarchy=0. This was a mistake of original design.(it's me...)
>
> Considering use_hierarchy=0, all cgroups are treated as flat. So, no one
> cannot justify moving charges to parent...parent and children are in
> flat configuration, not hierarchical.
>
> This patch modifes to move charges to root cgroup at rmdir/force_empty
> if use_hierarchy==0. This will much simplify rmdir() and reduce error
> in ->pre_destroy.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Reviewed-by: Anees Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>


> ---
>  Documentation/cgroups/memory.txt |   12 ++++++----
>  mm/memcontrol.c                  |   39 +++++++++++++------------------------
>  2 files changed, 21 insertions(+), 30 deletions(-)
>
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index 54c338d..82ce1ef 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -393,14 +393,14 @@ cgroup might have some charge associated with it, even though all
>  tasks have migrated away from it. (because we charge against pages, not
>  against tasks.)
>
> -Such charges are freed or moved to their parent. At moving, both of RSS
> -and CACHES are moved to parent.
> -rmdir() may return -EBUSY if freeing/moving fails. See 5.1 also.
> +Such charges are freed or moved to their parent if use_hierarchy=1.
> +if use_hierarchy=0, the charges will be moved to root cgroup.
>
>  Charges recorded in swap information is not updated at removal of cgroup.
>  Recorded information is discarded and a cgroup which uses swap (swapcache)
>  will be charged as a new owner of it.
>
> +About use_hierarchy, see Section 6.
>
>  5. Misc. interfaces.
>
> @@ -413,13 +413,15 @@ will be charged as a new owner of it.
>
>    Almost all pages tracked by this memory cgroup will be unmapped and freed.
>    Some pages cannot be freed because they are locked or in-use. Such pages are
> -  moved to parent and this cgroup will be empty. This may return -EBUSY if
> -  VM is too busy to free/move all pages immediately.
> +  moved to parent(if use_hierarchy==1) or root (if use_hierarchy==0) and this
> +  cgroup will be empty.
>
>    Typical use case of this interface is that calling this before rmdir().
>    Because rmdir() moves all pages to parent, some out-of-use page caches can be
>    moved to the parent. If you want to avoid that, force_empty will be useful.
>
> +  About use_hierarchy, see Section 6.
> +
>  5.2 stat file
>
>  memory.stat file includes following statistics
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ed53d64..62200f1 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2695,32 +2695,23 @@ static int mem_cgroup_move_parent(struct page *page,
>  	nr_pages = hpage_nr_pages(page);
>
>  	parent = mem_cgroup_from_cont(pcg);
> -	if (!parent->use_hierarchy) {
> -		ret = __mem_cgroup_try_charge(NULL,
> -					gfp_mask, nr_pages, &parent, false);
> -		if (ret)
> -			goto put_back;
> -	}
> +	/*
> +	 * if use_hierarchy==0, move charges to root cgroup.
> +	 * in root cgroup, we don't touch res_counter
> +	 */
> +	if (!parent->use_hierarchy)
> +		parent = root_mem_cgroup;
>
>  	if (nr_pages > 1)
>  		flags = compound_lock_irqsave(page);
>
> -	if (parent->use_hierarchy) {
> -		ret = mem_cgroup_move_account(page, nr_pages,
> -					pc, child, parent, false);
> -		if (!ret)
> -			__mem_cgroup_cancel_local_charge(child, nr_pages);
> -	} else {
> -		ret = mem_cgroup_move_account(page, nr_pages,
> -					pc, child, parent, true);
> -
> -		if (ret)
> -			__mem_cgroup_cancel_charge(parent, nr_pages);
> -	}
> +	ret = mem_cgroup_move_account(page, nr_pages,
> +				pc, child, parent, false);
> +	if (!ret)
> +		__mem_cgroup_cancel_local_charge(child, nr_pages);
>
>  	if (nr_pages > 1)
>  		compound_unlock_irqrestore(page, flags);
> -put_back:
>  	putback_lru_page(page);
>  put:
>  	put_page(page);
> @@ -3338,12 +3329,10 @@ int mem_cgroup_move_hugetlb_parent(int idx, struct cgroup *cgroup,
>  	csize = PAGE_SIZE << compound_order(page);
>  	/* If parent->use_hierarchy == 0, we need to charge parent */
>  	if (!parent->use_hierarchy) {
> -		ret = res_counter_charge(&parent->hugepage[idx],
> -					 csize, &fail_res);
> -		if (ret) {
> -			ret = -EBUSY;
> -			goto err_out;
> -		}
> +		parent = root_mem_cgroup;
> +		/* root has no limit */
> +		res_counter_charge_nofail(&parent->hugepage[idx],
> +				 csize, &fail_res);
>  	}
>  	counter = &memcg->hugepage[idx];
>  	res_counter_uncharge_until(counter, counter->parent, csize);
> -- 

-aneesh


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH 9/9 v2] memcg: never return error at pre_destroy()
  2012-04-28  0:25     ` Hiroyuki Kamezawa
@ 2012-04-30 17:02       ` Ying Han
  0 siblings, 0 replies; 42+ messages in thread
From: Ying Han @ 2012-04-30 17:02 UTC (permalink / raw)
  To: Hiroyuki Kamezawa
  Cc: KAMEZAWA Hiroyuki, Linux Kernel, linux-mm, cgroups, Michal Hocko,
	Johannes Weiner, Frederic Weisbecker, Glauber Costa, Tejun Heo,
	Aneesh Kumar K.V, Andrew Morton

On Fri, Apr 27, 2012 at 5:25 PM, Hiroyuki Kamezawa
<kamezawa.hiroyuki@gmail.com> wrote:
> On Sat, Apr 28, 2012 at 6:28 AM, Ying Han <yinghan@google.com> wrote:
>> On Thu, Apr 26, 2012 at 11:06 PM, KAMEZAWA Hiroyuki
>> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>>> When force_empty() called by ->pre_destroy(), no memory reclaim happens
>>> and it doesn't take very long time which requires signal_pending() check.
>>> And if we return -EINTR from pre_destroy(), cgroup.c show warning.
>>>
>>> This patch removes signal check in force_empty(). By this, ->pre_destroy()
>>> returns success always.
>>>
>>> Note: check for 'cgroup is empty' remains for force_empty interface.
>>>
>>> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>>> ---
>>>  mm/hugetlb.c    |   10 +---------
>>>  mm/memcontrol.c |   14 +++++---------
>>>  2 files changed, 6 insertions(+), 18 deletions(-)
>>>
>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>> index 4dd6b39..770f1642 100644
>>> --- a/mm/hugetlb.c
>>> +++ b/mm/hugetlb.c
>>> @@ -1922,20 +1922,12 @@ int hugetlb_force_memcg_empty(struct cgroup *cgroup)
>>>        int ret = 0, idx = 0;
>>>
>>>        do {
>>> +               /* see memcontrol.c::mem_cgroup_force_empty() */
>>>                if (cgroup_task_count(cgroup)
>>>                        || !list_empty(&cgroup->children)) {
>>>                        ret = -EBUSY;
>>>                        goto out;
>>>                }
>>> -               /*
>>> -                * If the task doing the cgroup_rmdir got a signal
>>> -                * we don't really need to loop till the hugetlb resource
>>> -                * usage become zero.
>>> -                */
>>> -               if (signal_pending(current)) {
>>> -                       ret = -EINTR;
>>> -                       goto out;
>>> -               }
>>>                for_each_hstate(h) {
>>>                        spin_lock(&hugetlb_lock);
>>>                        list_for_each_entry(page, &h->hugepage_activelist, lru) {
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index 2715223..ee350c5 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -3852,8 +3852,6 @@ static int mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
>>>                pc = lookup_page_cgroup(page);
>>>
>>>                ret = mem_cgroup_move_parent(page, pc, memcg, GFP_KERNEL);
>>> -               if (ret == -ENOMEM || ret == -EINTR)
>>> -                       break;
>>>
>>>                if (ret == -EBUSY || ret == -EINVAL) {
>>>                        /* found lock contention or "pc" is obsolete. */
>>> @@ -3863,7 +3861,7 @@ static int mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
>>>                        busy = NULL;
>>>        }
>>>
>>> -       if (!ret && !list_empty(list))
>>> +       if (!loop)
>>
>> This looks a bit strange to me... why we make the change ?
>>
> Ah, I should this move to an independet patch.
> Because we don't have -ENOMEM path to exit loop, the return value of
> this function
> is
>  0 (if loop !=0 this means lru is empty under the lru lock )
>  -EBUSY (if loop== 0)

>
> I'll move this part out as an independent clean up patch

Thanks ~

--Ying

> thanks,
> -kame

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH 9/9 v2] memcg: never return error at pre_destroy()
  2012-04-27  6:06 ` [RFC][PATCH 9/9 v2] memcg: never return error at pre_destroy() KAMEZAWA Hiroyuki
  2012-04-27 21:28   ` Ying Han
@ 2012-05-01 22:28   ` Suleiman Souhlal
  2012-05-02  3:34     ` Hiroyuki Kamezawa
  1 sibling, 1 reply; 42+ messages in thread
From: Suleiman Souhlal @ 2012-05-01 22:28 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Linux Kernel, linux-mm, cgroups, Michal Hocko, Johannes Weiner,
	Frederic Weisbecker, Glauber Costa, Tejun Heo, Han Ying,
	Aneesh Kumar K.V, Andrew Morton, kamezawa.hiroyuki

2012/4/26 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>:
> When force_empty() called by ->pre_destroy(), no memory reclaim happens
> and it doesn't take very long time which requires signal_pending() check.
> And if we return -EINTR from pre_destroy(), cgroup.c show warning.
>
> This patch removes signal check in force_empty(). By this, ->pre_destroy()
> returns success always.
>
> Note: check for 'cgroup is empty' remains for force_empty interface.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  mm/hugetlb.c    |   10 +---------
>  mm/memcontrol.c |   14 +++++---------
>  2 files changed, 6 insertions(+), 18 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 4dd6b39..770f1642 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1922,20 +1922,12 @@ int hugetlb_force_memcg_empty(struct cgroup *cgroup)
>        int ret = 0, idx = 0;
>
>        do {
> +               /* see memcontrol.c::mem_cgroup_force_empty() */
>                if (cgroup_task_count(cgroup)
>                        || !list_empty(&cgroup->children)) {
>                        ret = -EBUSY;
>                        goto out;
>                }
> -               /*
> -                * If the task doing the cgroup_rmdir got a signal
> -                * we don't really need to loop till the hugetlb resource
> -                * usage become zero.
> -                */
> -               if (signal_pending(current)) {
> -                       ret = -EINTR;
> -                       goto out;
> -               }
>                for_each_hstate(h) {
>                        spin_lock(&hugetlb_lock);
>                        list_for_each_entry(page, &h->hugepage_activelist, lru) {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 2715223..ee350c5 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3852,8 +3852,6 @@ static int mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
>                pc = lookup_page_cgroup(page);
>
>                ret = mem_cgroup_move_parent(page, pc, memcg, GFP_KERNEL);
> -               if (ret == -ENOMEM || ret == -EINTR)
> -                       break;
>
>                if (ret == -EBUSY || ret == -EINVAL) {
>                        /* found lock contention or "pc" is obsolete. */
> @@ -3863,7 +3861,7 @@ static int mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
>                        busy = NULL;
>        }
>
> -       if (!ret && !list_empty(list))
> +       if (!loop)
>                return -EBUSY;
>        return ret;
>  }
> @@ -3893,11 +3891,12 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg, bool free_all)
>  move_account:
>        do {
>                ret = -EBUSY;
> +               /*
> +                * This never happens when this is called by ->pre_destroy().
> +                * But we need to take care of force_empty interface.
> +                */
>                if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children))
>                        goto out;

Are you sure this never happens when called by ->pre_destroy()?
Can't a task still get attached to the cgroup while ->pre_destroy() is running?

At least, I don't see anything in the cgroup code that prevents
someone from newly attaching a task at that point.
In fact, there is code that seems to handle the case when someone
attached to the cgroup after pre_destroy() has run: See the
cgroup_wakeup_rmdir_waiter() call in cgroup_attach_task().

-- Suleiman

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH 9/9 v2] memcg: never return error at pre_destroy()
  2012-05-01 22:28   ` Suleiman Souhlal
@ 2012-05-02  3:34     ` Hiroyuki Kamezawa
  0 siblings, 0 replies; 42+ messages in thread
From: Hiroyuki Kamezawa @ 2012-05-02  3:34 UTC (permalink / raw)
  To: Suleiman Souhlal
  Cc: KAMEZAWA Hiroyuki, Linux Kernel, linux-mm, cgroups, Michal Hocko,
	Johannes Weiner, Frederic Weisbecker, Glauber Costa, Tejun Heo,
	Han Ying, Aneesh Kumar K.V, Andrew Morton

On Wed, May 2, 2012 at 7:28 AM, Suleiman Souhlal <suleiman@google.com> wrote:
> 2012/4/26 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>:
>> When force_empty() called by ->pre_destroy(), no memory reclaim happens
>> and it doesn't take very long time which requires signal_pending() check.
>> And if we return -EINTR from pre_destroy(), cgroup.c show warning.
>>
>> This patch removes signal check in force_empty(). By this, ->pre_destroy()
>> returns success always.
>>
>> Note: check for 'cgroup is empty' remains for force_empty interface.
>>
>> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> ---
>>  mm/hugetlb.c    |   10 +---------
>>  mm/memcontrol.c |   14 +++++---------
>>  2 files changed, 6 insertions(+), 18 deletions(-)
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 4dd6b39..770f1642 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -1922,20 +1922,12 @@ int hugetlb_force_memcg_empty(struct cgroup *cgroup)
>>        int ret = 0, idx = 0;
>>
>>        do {
>> +               /* see memcontrol.c::mem_cgroup_force_empty() */
>>                if (cgroup_task_count(cgroup)
>>                        || !list_empty(&cgroup->children)) {
>>                        ret = -EBUSY;
>>                        goto out;
>>                }
>> -               /*
>> -                * If the task doing the cgroup_rmdir got a signal
>> -                * we don't really need to loop till the hugetlb resource
>> -                * usage become zero.
>> -                */
>> -               if (signal_pending(current)) {
>> -                       ret = -EINTR;
>> -                       goto out;
>> -               }
>>                for_each_hstate(h) {
>>                        spin_lock(&hugetlb_lock);
>>                        list_for_each_entry(page, &h->hugepage_activelist, lru) {
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 2715223..ee350c5 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -3852,8 +3852,6 @@ static int mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
>>                pc = lookup_page_cgroup(page);
>>
>>                ret = mem_cgroup_move_parent(page, pc, memcg, GFP_KERNEL);
>> -               if (ret == -ENOMEM || ret == -EINTR)
>> -                       break;
>>
>>                if (ret == -EBUSY || ret == -EINVAL) {
>>                        /* found lock contention or "pc" is obsolete. */
>> @@ -3863,7 +3861,7 @@ static int mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
>>                        busy = NULL;
>>        }
>>
>> -       if (!ret && !list_empty(list))
>> +       if (!loop)
>>                return -EBUSY;
>>        return ret;
>>  }
>> @@ -3893,11 +3891,12 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg, bool free_all)
>>  move_account:
>>        do {
>>                ret = -EBUSY;
>> +               /*
>> +                * This never happens when this is called by ->pre_destroy().
>> +                * But we need to take care of force_empty interface.
>> +                */
>>                if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children))
>>                        goto out;
>
> Are you sure this never happens when called by ->pre_destroy()?
> Can't a task still get attached to the cgroup while ->pre_destroy() is running?
>
see whole series of patch series, 7 & 8 is against that probelm.
But they will be dropped and this race will remain. And this patch's
title will be
changed to be "remove -EINTR" rather than "remove failure of pre_destroy*.
pre_destrou() will continue to fail until cgroup core is fixed.

Thanks,
-Kame

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2012-05-02  3:34 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-27  5:45 [RFC][PATCH 0/7 v2] memcg: prevent failure in pre_destroy() KAMEZAWA Hiroyuki
2012-04-27  5:49 ` [RFC][PATCH 1/7 v2] temporal compile-fix in linux-next KAMEZAWA Hiroyuki
2012-04-30  8:47   ` Aneesh Kumar K.V
2012-04-27  5:51 ` [RFC][PATCH 2/7 v2] memcg: fix error code in hugetlb_force_memcg_empty() KAMEZAWA Hiroyuki
2012-04-30  8:49   ` Aneesh Kumar K.V
2012-04-27  5:53 ` [RFC][PATCH 3/7 v2] res_counter: add res_counter_uncharge_until() KAMEZAWA Hiroyuki
2012-04-27 18:18   ` Tejun Heo
2012-04-27 23:51     ` Hiroyuki Kamezawa
     [not found]   ` <4F9AD28C.60508@parallels.com>
2012-04-27 23:51     ` Hiroyuki Kamezawa
2012-04-27  5:54 ` [RFC][PATCH 4/7 v2] memcg: use res_counter_uncharge_until in move_parent KAMEZAWA Hiroyuki
2012-04-27 18:20   ` Tejun Heo
2012-04-27 23:59     ` Hiroyuki Kamezawa
     [not found]   ` <4F9AD455.9030306@parallels.com>
2012-04-27 18:26     ` Ying Han
2012-04-27 23:58     ` Hiroyuki Kamezawa
2012-04-30  9:00   ` Aneesh Kumar K.V
2012-04-27  5:58 ` [RFC][PATCH 5/9 v2] move charges to root at rmdir if use_hierarchy is unset KAMEZAWA Hiroyuki
2012-04-27 19:12   ` Ying Han
2012-04-28  0:01     ` Hiroyuki Kamezawa
2012-04-30  9:07   ` Aneesh Kumar K.V
2012-04-27  6:00 ` [RFC][PATCH 6/9 v2] memcg: don't uncharge in mem_cgroup_move_account KAMEZAWA Hiroyuki
2012-04-27  6:02 ` [RFC][PATCH 7/9 v2] cgroup: avoid attaching task to a cgroup under rmdir() KAMEZAWA Hiroyuki
2012-04-27 10:39   ` Frederic Weisbecker
2012-04-28  0:06     ` Hiroyuki Kamezawa
2012-04-27 20:31   ` Tejun Heo
2012-04-27 20:33     ` Tejun Heo
2012-04-27  6:04 ` [RFC][PATCH 8/9 v2] cgroup: avoid creating new cgroup under a cgroup being destroyed KAMEZAWA Hiroyuki
2012-04-27 20:40   ` Tejun Heo
2012-04-27 20:41     ` Tejun Heo
2012-04-28  0:20     ` Hiroyuki Kamezawa
2012-04-28  2:00       ` Tejun Heo
2012-04-28  9:31         ` Hiroyuki Kamezawa
2012-04-28 21:31           ` Tejun Heo
2012-04-27  6:06 ` [RFC][PATCH 9/9 v2] memcg: never return error at pre_destroy() KAMEZAWA Hiroyuki
2012-04-27 21:28   ` Ying Han
2012-04-28  0:25     ` Hiroyuki Kamezawa
2012-04-30 17:02       ` Ying Han
2012-05-01 22:28   ` Suleiman Souhlal
2012-05-02  3:34     ` Hiroyuki Kamezawa
2012-04-27 18:16 ` [RFC][PATCH 0/7 v2] memcg: prevent failure in pre_destroy() Tejun Heo
2012-04-27 23:48   ` Hiroyuki Kamezawa
2012-04-28 16:13     ` Michal Hocko
2012-04-29  6:03       ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).