linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC REPOST] cgroup: removing css reference drain wait during cgroup removal
       [not found] <20120312213155.GE23255@google.com>
@ 2012-03-12 21:33 ` Tejun Heo
  2012-03-12 23:23   ` Tejun Heo
  2012-03-13  6:11   ` KAMEZAWA Hiroyuki
       [not found] ` <20120313214526.GG19584@count0.beaverton.ibm.com>
  1 sibling, 2 replies; 12+ messages in thread
From: Tejun Heo @ 2012-03-12 21:33 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, Michal Hocko, Johannes Weiner, gthelen, Hugh Dickins
  Cc: linux-mm, linux-kernel, Vivek Goyal, Jens Axboe, Li Zefan,
	containers, cgroups

(wrong mailing list addresses, reposing. sorry guys)

Hello, guys.

While working on blkcg, I learned that cgroup removal path tries to
drain all internal references synchronously before proceeding with
removal, and may be aborted by by pre_destroy() failing.

I find both quite unusual.  While there are some occassions where we
try to drain reference counts synchronously, the norm is deactivating
the target and then releasing it when the reference count hits zero
and exposing this synchronous behavior directly to userland makes it
worse.  This also requires allowing rmdir to be aborted from userland.

pre_destroy() being allowed to veto rmdir might be okay if there's
only one subsystem using it or it implemented proper
prepare-commit/cancel transaction, but as it currently stands,
pre_destroy() operations are not reversible and even within memcg
itself the state wouldn't be consistent after failure (some moved to
parent while the rest on child).  Note that abort from userland also
has the same problem.

It also complicates and adds even more subtleties to cgroup code.  A
lot of it was just me being dumb but making sense of
cgroup_exclude_rmdir() and cgroup_release_and_wakeup_rmdir() usages in
memcontrol took me quite some time even with Hugh's help.

In general, IMHO, it's a bad idea to expose purely internal
implementation details to userland directly.  Internal ref counts can
be kept around for whatever reason (e.g. blkcg does it for lookup
caching), such details shouldn't be visible to userland.  Midlayers
like cgroup which sit between userland and mechanism implementations
should provide isolation between the two so that each mechanism
implementation doesn't have to worry about things like that.

It seems cgroup is going through a lot of the same growing pains that
sysfs went through years ago and would probably benefit from using
sysfs for userland interfacing rather than trying to replicate
features that sysfs already provides.  Well, that's another long term
thing, I guess.  For now, I'd like to make cgroup rmdir path more
conventional so that rmdir behaves like the following.

1. disallow further css_tryget() on all affected subsystems

2. call pre_destroy() on each subsystem.  pre_destroy() can't fail and
   each subsystem is responsible for guaranteeing that all reference
   counts to its css will be released in finite amount of time.

3. destroy() will be called when all css refs drop to zero.

To do this, memcg, the only current user of pre_destroy() callback,
should be modified such that

* pre_destroy() doesn't fail.  Greg helped me walking through the
  different failure paths.  Two of them seem to be from race
  conditions which just require retries (I think it's wrong to fail
  userland operation for things like this).  pre_destroy() itself may
  loop or may schedule a work item if it's expected to take a long
  time.

  The last one seems more tricky.  On destruction of cgroup, the
  charges are transferred to its parent and the parent may not have
  enough room for that.  Greg told me that this should only be a
  problem for !hierarchical case.  I think this can be dealt with by
  dumping what's left over to root cgroup with a warning message.

* cgroup_exclude_rmdir() and cgroup_release_and_wakeup_rmdir() usages.
  The pair is used to re-trigger pre_destroy() for operations which
  may add css references while rmdir is in progress.  I think the
  right thing to do is calling (directly or through a work item)
  mem_cgroup_force_empty() if the css is dead after the operation.  I
  think such change would make following the logic easier too.

I'm appending the patch for the cgroup part of the change.  It removes
good amount of complication.  I think cgroup_has_css_refs() check
should be dropped from check_for_release() but other than that it
seems to work fine with blkcg in linux-next.

If memcg's pre_destroy() can be updated as described above, I can
apply that first and then apply the following cgroup behavior change.

What do you think?  Li, how does the cgroup portion look?

Thank you.

RFC PATCH. DON'T APPLY.
---
 include/linux/cgroup.h |   41 +-------
 kernel/cgroup.c        |  225 +++++++++++++------------------------------------
 2 files changed, 69 insertions(+), 197 deletions(-)

Index: work/include/linux/cgroup.h
===================================================================
--- work.orig/include/linux/cgroup.h
+++ work/include/linux/cgroup.h
@@ -16,6 +16,7 @@
 #include <linux/prio_heap.h>
 #include <linux/rwsem.h>
 #include <linux/idr.h>
+#include <linux/workqueue.h>
 
 #ifdef CONFIG_CGROUPS
 
@@ -81,7 +82,6 @@ struct cgroup_subsys_state {
 /* bits in struct cgroup_subsys_state flags field */
 enum {
 	CSS_ROOT, /* This CSS is the root of the subsystem */
-	CSS_REMOVED, /* This CSS is dead */
 };
 
 /* Caller must verify that the css is not for root cgroup */
@@ -104,27 +104,18 @@ static inline void css_get(struct cgroup
 		__css_get(css, 1);
 }
 
-static inline bool css_is_removed(struct cgroup_subsys_state *css)
-{
-	return test_bit(CSS_REMOVED, &css->flags);
-}
-
 /*
  * Call css_tryget() to take a reference on a css if your existing
  * (known-valid) reference isn't already ref-counted. Returns false if
  * the css has been destroyed.
  */
 
+extern bool __css_tryget(struct cgroup_subsys_state *css);
 static inline bool css_tryget(struct cgroup_subsys_state *css)
 {
 	if (test_bit(CSS_ROOT, &css->flags))
 		return true;
-	while (!atomic_inc_not_zero(&css->refcnt)) {
-		if (test_bit(CSS_REMOVED, &css->flags))
-			return false;
-		cpu_relax();
-	}
-	return true;
+	return __css_tryget(css);
 }
 
 /*
@@ -132,11 +123,11 @@ static inline bool css_tryget(struct cgr
  * css_get() or css_tryget()
  */
 
-extern void __css_put(struct cgroup_subsys_state *css, int count);
+extern void __css_put(struct cgroup_subsys_state *css);
 static inline void css_put(struct cgroup_subsys_state *css)
 {
 	if (!test_bit(CSS_ROOT, &css->flags))
-		__css_put(css, 1);
+		__css_put(css);
 }
 
 /* bits in struct cgroup flags field */
@@ -211,6 +202,9 @@ struct cgroup {
 	/* List of events which userspace want to receive */
 	struct list_head event_list;
 	spinlock_t event_list_lock;
+
+	/* Used to dput @cgroup->dentry from css_put() */
+	struct work_struct dput_work;
 };
 
 /*
@@ -408,23 +402,6 @@ int cgroup_task_count(const struct cgrou
 int cgroup_is_descendant(const struct cgroup *cgrp, struct task_struct *task);
 
 /*
- * When the subsys has to access css and may add permanent refcnt to css,
- * it should take care of racy conditions with rmdir(). Following set of
- * functions, is for stop/restart rmdir if necessary.
- * Because these will call css_get/put, "css" should be alive css.
- *
- *  cgroup_exclude_rmdir();
- *  ...do some jobs which may access arbitrary empty cgroup
- *  cgroup_release_and_wakeup_rmdir();
- *
- *  When someone removes a cgroup while cgroup_exclude_rmdir() holds it,
- *  it sleeps and cgroup_release_and_wakeup_rmdir() will wake him up.
- */
-
-void cgroup_exclude_rmdir(struct cgroup_subsys_state *css);
-void cgroup_release_and_wakeup_rmdir(struct cgroup_subsys_state *css);
-
-/*
  * Control Group taskset, used to pass around set of tasks to cgroup_subsys
  * methods.
  */
@@ -453,7 +430,7 @@ int cgroup_taskset_size(struct cgroup_ta
 
 struct cgroup_subsys {
 	struct cgroup_subsys_state *(*create)(struct cgroup *cgrp);
-	int (*pre_destroy)(struct cgroup *cgrp);
+	void (*pre_destroy)(struct cgroup *cgrp);
 	void (*destroy)(struct cgroup *cgrp);
 	int (*can_attach)(struct cgroup *cgrp, struct cgroup_taskset *tset);
 	void (*cancel_attach)(struct cgroup *cgrp, struct cgroup_taskset *tset);
Index: work/kernel/cgroup.c
===================================================================
--- work.orig/kernel/cgroup.c
+++ work/kernel/cgroup.c
@@ -63,6 +63,8 @@
 
 #include <linux/atomic.h>
 
+#define CSS_DEAD_BIAS		INT_MIN
+
 /*
  * cgroup_mutex is the master lock.  Any modification to cgroup or its
  * hierarchy must be performed while holding it.
@@ -154,8 +156,8 @@ struct css_id {
 	 * The css to which this ID points. This pointer is set to valid value
 	 * after cgroup is populated. If cgroup is removed, this will be NULL.
 	 * This pointer is expected to be RCU-safe because destroy()
-	 * is called after synchronize_rcu(). But for safe use, css_is_removed()
-	 * css_tryget() should be used for avoiding race.
+	 * is called after synchronize_rcu(). But for safe use, css_tryget()
+	 * should be used for avoiding race.
 	 */
 	struct cgroup_subsys_state __rcu *css;
 	/*
@@ -807,39 +809,14 @@ static struct inode *cgroup_new_inode(um
 	return inode;
 }
 
-/*
- * Call subsys's pre_destroy handler.
- * This is called before css refcnt check.
- */
-static int cgroup_call_pre_destroy(struct cgroup *cgrp)
-{
-	struct cgroup_subsys *ss;
-	int ret = 0;
-
-	for_each_subsys(cgrp->root, ss)
-		if (ss->pre_destroy) {
-			ret = ss->pre_destroy(cgrp);
-			if (ret)
-				break;
-		}
-
-	return ret;
-}
-
 static void cgroup_diput(struct dentry *dentry, struct inode *inode)
 {
 	/* is dentry a directory ? if so, kfree() associated cgroup */
 	if (S_ISDIR(inode->i_mode)) {
 		struct cgroup *cgrp = dentry->d_fsdata;
 		struct cgroup_subsys *ss;
+
 		BUG_ON(!(cgroup_is_removed(cgrp)));
-		/* It's possible for external users to be holding css
-		 * reference counts on a cgroup; css_put() needs to
-		 * be able to access the cgroup after decrementing
-		 * the reference count in order to know if it needs to
-		 * queue the cgroup to be handled by the release
-		 * agent */
-		synchronize_rcu();
 
 		mutex_lock(&cgroup_mutex);
 		/*
@@ -931,33 +908,6 @@ static void cgroup_d_remove_dir(struct d
 }
 
 /*
- * A queue for waiters to do rmdir() cgroup. A tasks will sleep when
- * cgroup->count == 0 && list_empty(&cgroup->children) && subsys has some
- * reference to css->refcnt. In general, this refcnt is expected to goes down
- * to zero, soon.
- *
- * CGRP_WAIT_ON_RMDIR flag is set under cgroup's inode->i_mutex;
- */
-static DECLARE_WAIT_QUEUE_HEAD(cgroup_rmdir_waitq);
-
-static void cgroup_wakeup_rmdir_waiter(struct cgroup *cgrp)
-{
-	if (unlikely(test_and_clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
-		wake_up_all(&cgroup_rmdir_waitq);
-}
-
-void cgroup_exclude_rmdir(struct cgroup_subsys_state *css)
-{
-	css_get(css);
-}
-
-void cgroup_release_and_wakeup_rmdir(struct cgroup_subsys_state *css)
-{
-	cgroup_wakeup_rmdir_waiter(css->cgroup);
-	css_put(css);
-}
-
-/*
  * Call with cgroup_mutex held. Drops reference counts on modules, including
  * any duplicate ones that parse_cgroupfs_options took. If this function
  * returns an error, no reference counts are touched.
@@ -1329,6 +1279,13 @@ static const struct super_operations cgr
 	.remount_fs = cgroup_remount,
 };
 
+static void cgroup_dput_fn(struct work_struct *work)
+{
+	struct cgroup *cgrp = container_of(work, struct cgroup, dput_work);
+
+	dput(cgrp->dentry);
+}
+
 static void init_cgroup_housekeeping(struct cgroup *cgrp)
 {
 	INIT_LIST_HEAD(&cgrp->sibling);
@@ -1339,6 +1296,7 @@ static void init_cgroup_housekeeping(str
 	mutex_init(&cgrp->pidlist_mutex);
 	INIT_LIST_HEAD(&cgrp->event_list);
 	spin_lock_init(&cgrp->event_list_lock);
+	INIT_WORK(&cgrp->dput_work, cgroup_dput_fn);
 }
 
 static void init_cgroup_root(struct cgroupfs_root *root)
@@ -1936,12 +1894,6 @@ int cgroup_attach_task(struct cgroup *cg
 	}
 
 	synchronize_rcu();
-
-	/*
-	 * wake up rmdir() waiter. the rmdir should fail since the cgroup
-	 * is no longer empty.
-	 */
-	cgroup_wakeup_rmdir_waiter(cgrp);
 out:
 	if (retval) {
 		for_each_subsys(root, ss) {
@@ -2111,7 +2063,6 @@ static int cgroup_attach_proc(struct cgr
 	 * step 5: success! and cleanup
 	 */
 	synchronize_rcu();
-	cgroup_wakeup_rmdir_waiter(cgrp);
 	retval = 0;
 out_put_css_set_refs:
 	if (retval) {
@@ -3768,6 +3719,7 @@ static long cgroup_create(struct cgroup 
 			err = PTR_ERR(css);
 			goto err_destroy;
 		}
+		dget(dentry);		/* will be put on the last @css put */
 		init_cgroup_css(css, ss, cgrp);
 		if (ss->use_id) {
 			err = alloc_css_id(ss, parent, cgrp);
@@ -3809,8 +3761,10 @@ static long cgroup_create(struct cgroup 
  err_destroy:
 
 	for_each_subsys(root, ss) {
-		if (cgrp->subsys[ss->subsys_id])
+		if (cgrp->subsys[ss->subsys_id]) {
+			dput(dentry);
 			ss->destroy(cgrp);
+		}
 	}
 
 	mutex_unlock(&cgroup_mutex);
@@ -3866,70 +3820,15 @@ static int cgroup_has_css_refs(struct cg
 	return 0;
 }
 
-/*
- * Atomically mark all (or else none) of the cgroup's CSS objects as
- * CSS_REMOVED. Return true on success, or false if the cgroup has
- * busy subsystems. Call with cgroup_mutex held
- */
-
-static int cgroup_clear_css_refs(struct cgroup *cgrp)
-{
-	struct cgroup_subsys *ss;
-	unsigned long flags;
-	bool failed = false;
-	local_irq_save(flags);
-	for_each_subsys(cgrp->root, ss) {
-		struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
-		int refcnt;
-		while (1) {
-			/* We can only remove a CSS with a refcnt==1 */
-			refcnt = atomic_read(&css->refcnt);
-			if (refcnt > 1) {
-				failed = true;
-				goto done;
-			}
-			BUG_ON(!refcnt);
-			/*
-			 * Drop the refcnt to 0 while we check other
-			 * subsystems. This will cause any racing
-			 * css_tryget() to spin until we set the
-			 * CSS_REMOVED bits or abort
-			 */
-			if (atomic_cmpxchg(&css->refcnt, refcnt, 0) == refcnt)
-				break;
-			cpu_relax();
-		}
-	}
- done:
-	for_each_subsys(cgrp->root, ss) {
-		struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
-		if (failed) {
-			/*
-			 * Restore old refcnt if we previously managed
-			 * to clear it from 1 to 0
-			 */
-			if (!atomic_read(&css->refcnt))
-				atomic_set(&css->refcnt, 1);
-		} else {
-			/* Commit the fact that the CSS is removed */
-			set_bit(CSS_REMOVED, &css->flags);
-		}
-	}
-	local_irq_restore(flags);
-	return !failed;
-}
-
 static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
 {
 	struct cgroup *cgrp = dentry->d_fsdata;
 	struct dentry *d;
 	struct cgroup *parent;
-	DEFINE_WAIT(wait);
+	struct cgroup_subsys *ss;
 	struct cgroup_event *event, *tmp;
-	int ret;
 
 	/* the vfs holds both inode->i_mutex already */
-again:
 	mutex_lock(&cgroup_mutex);
 	if (atomic_read(&cgrp->count) != 0) {
 		mutex_unlock(&cgroup_mutex);
@@ -3941,52 +3840,34 @@ again:
 	}
 	mutex_unlock(&cgroup_mutex);
 
-	/*
-	 * In general, subsystem has no css->refcnt after pre_destroy(). But
-	 * in racy cases, subsystem may have to get css->refcnt after
-	 * pre_destroy() and it makes rmdir return with -EBUSY. This sometimes
-	 * make rmdir return -EBUSY too often. To avoid that, we use waitqueue
-	 * for cgroup's rmdir. CGRP_WAIT_ON_RMDIR is for synchronizing rmdir
-	 * and subsystem's reference count handling. Please see css_get/put
-	 * and css_tryget() and cgroup_wakeup_rmdir_waiter() implementation.
-	 */
-	set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
-
-	/*
-	 * Call pre_destroy handlers of subsys. Notify subsystems
-	 * that rmdir() request comes.
-	 */
-	ret = cgroup_call_pre_destroy(cgrp);
-	if (ret) {
-		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
-		return ret;
-	}
-
 	mutex_lock(&cgroup_mutex);
 	parent = cgrp->parent;
 	if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children)) {
-		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
 		mutex_unlock(&cgroup_mutex);
 		return -EBUSY;
 	}
-	prepare_to_wait(&cgroup_rmdir_waitq, &wait, TASK_INTERRUPTIBLE);
-	if (!cgroup_clear_css_refs(cgrp)) {
-		mutex_unlock(&cgroup_mutex);
-		/*
-		 * Because someone may call cgroup_wakeup_rmdir_waiter() before
-		 * prepare_to_wait(), we need to check this flag.
-		 */
-		if (test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags))
-			schedule();
-		finish_wait(&cgroup_rmdir_waitq, &wait);
-		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
-		if (signal_pending(current))
-			return -EINTR;
-		goto again;
+
+	/* deny new css_tryget() attempts */
+	for_each_subsys(cgrp->root, ss) {
+		struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
+
+		WARN_ON(atomic_read(&css->refcnt) < 0);
+		atomic_add(CSS_DEAD_BIAS, &css->refcnt);
+	}
+
+	/*
+	 * No new tryget reference will be handed out.  Tell subsystems to
+	 * drain the existing references.
+	 */
+	for_each_subsys(cgrp->root, ss)
+		if (ss->pre_destroy)
+			ss->pre_destroy(cgrp);
+
+	for_each_subsys(cgrp->root, ss) {
+		struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
+
+		css_put(css);
 	}
-	/* NO css_tryget() can success after here. */
-	finish_wait(&cgroup_rmdir_waitq, &wait);
-	clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
 
 	raw_spin_lock(&release_list_lock);
 	set_bit(CGRP_REMOVED, &cgrp->flags);
@@ -4689,21 +4570,35 @@ static void check_for_release(struct cgr
 }
 
 /* Caller must verify that the css is not for root cgroup */
-void __css_put(struct cgroup_subsys_state *css, int count)
+bool __css_tryget(struct cgroup_subsys_state *css)
+{
+	while (1) {
+		int v, t;
+
+		v = atomic_read(&css->refcnt);
+		if (unlikely(v < 0))
+			return false;
+
+		t = atomic_cmpxchg(&css->refcnt, v, v + 1);
+		if (likely(t == v))
+			return true;
+	}
+}
+
+/* Caller must verify that the css is not for root cgroup */
+void __css_put(struct cgroup_subsys_state *css)
 {
 	struct cgroup *cgrp = css->cgroup;
 	int val;
-	rcu_read_lock();
-	val = atomic_sub_return(count, &css->refcnt);
-	if (val == 1) {
+
+	val = atomic_dec_return(&css->refcnt);
+	if (val == CSS_DEAD_BIAS) {
 		if (notify_on_release(cgrp)) {
 			set_bit(CGRP_RELEASABLE, &cgrp->flags);
 			check_for_release(cgrp);
 		}
-		cgroup_wakeup_rmdir_waiter(cgrp);
+		schedule_work(&cgrp->dput_work);
 	}
-	rcu_read_unlock();
-	WARN_ON_ONCE(val < 1);
 }
 EXPORT_SYMBOL_GPL(__css_put);
 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC REPOST] cgroup: removing css reference drain wait during cgroup removal
  2012-03-12 21:33 ` [RFC REPOST] cgroup: removing css reference drain wait during cgroup removal Tejun Heo
@ 2012-03-12 23:23   ` Tejun Heo
  2012-03-13  6:11   ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 12+ messages in thread
From: Tejun Heo @ 2012-03-12 23:23 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, Michal Hocko, Johannes Weiner, gthelen, Hugh Dickins
  Cc: linux-mm, linux-kernel, Vivek Goyal, Jens Axboe, Li Zefan,
	containers, cgroups

On Mon, Mar 12, 2012 at 02:33:43PM -0700, Tejun Heo wrote:
> I'm appending the patch for the cgroup part of the change.  It removes
> good amount of complication.  I think cgroup_has_css_refs() check
> should be dropped from check_for_release() but other than that it
> seems to work fine with blkcg in linux-next.

Here's a version w/ cgroup_has_css_refs() removed.  It removes 160
lines.  I'm liking it.

---
 include/linux/cgroup.h |   41 +------
 kernel/cgroup.c        |  265 +++++++++++--------------------------------------
 2 files changed, 71 insertions(+), 235 deletions(-)

Index: work/include/linux/cgroup.h
===================================================================
--- work.orig/include/linux/cgroup.h
+++ work/include/linux/cgroup.h
@@ -16,6 +16,7 @@
 #include <linux/prio_heap.h>
 #include <linux/rwsem.h>
 #include <linux/idr.h>
+#include <linux/workqueue.h>
 
 #ifdef CONFIG_CGROUPS
 
@@ -81,7 +82,6 @@ struct cgroup_subsys_state {
 /* bits in struct cgroup_subsys_state flags field */
 enum {
 	CSS_ROOT, /* This CSS is the root of the subsystem */
-	CSS_REMOVED, /* This CSS is dead */
 };
 
 /* Caller must verify that the css is not for root cgroup */
@@ -104,27 +104,18 @@ static inline void css_get(struct cgroup
 		__css_get(css, 1);
 }
 
-static inline bool css_is_removed(struct cgroup_subsys_state *css)
-{
-	return test_bit(CSS_REMOVED, &css->flags);
-}
-
 /*
  * Call css_tryget() to take a reference on a css if your existing
  * (known-valid) reference isn't already ref-counted. Returns false if
  * the css has been destroyed.
  */
 
+extern bool __css_tryget(struct cgroup_subsys_state *css);
 static inline bool css_tryget(struct cgroup_subsys_state *css)
 {
 	if (test_bit(CSS_ROOT, &css->flags))
 		return true;
-	while (!atomic_inc_not_zero(&css->refcnt)) {
-		if (test_bit(CSS_REMOVED, &css->flags))
-			return false;
-		cpu_relax();
-	}
-	return true;
+	return __css_tryget(css);
 }
 
 /*
@@ -132,11 +123,11 @@ static inline bool css_tryget(struct cgr
  * css_get() or css_tryget()
  */
 
-extern void __css_put(struct cgroup_subsys_state *css, int count);
+extern void __css_put(struct cgroup_subsys_state *css);
 static inline void css_put(struct cgroup_subsys_state *css)
 {
 	if (!test_bit(CSS_ROOT, &css->flags))
-		__css_put(css, 1);
+		__css_put(css);
 }
 
 /* bits in struct cgroup flags field */
@@ -211,6 +202,9 @@ struct cgroup {
 	/* List of events which userspace want to receive */
 	struct list_head event_list;
 	spinlock_t event_list_lock;
+
+	/* Used to dput @cgroup->dentry from css_put() */
+	struct work_struct dput_work;
 };
 
 /*
@@ -408,23 +402,6 @@ int cgroup_task_count(const struct cgrou
 int cgroup_is_descendant(const struct cgroup *cgrp, struct task_struct *task);
 
 /*
- * When the subsys has to access css and may add permanent refcnt to css,
- * it should take care of racy conditions with rmdir(). Following set of
- * functions, is for stop/restart rmdir if necessary.
- * Because these will call css_get/put, "css" should be alive css.
- *
- *  cgroup_exclude_rmdir();
- *  ...do some jobs which may access arbitrary empty cgroup
- *  cgroup_release_and_wakeup_rmdir();
- *
- *  When someone removes a cgroup while cgroup_exclude_rmdir() holds it,
- *  it sleeps and cgroup_release_and_wakeup_rmdir() will wake him up.
- */
-
-void cgroup_exclude_rmdir(struct cgroup_subsys_state *css);
-void cgroup_release_and_wakeup_rmdir(struct cgroup_subsys_state *css);
-
-/*
  * Control Group taskset, used to pass around set of tasks to cgroup_subsys
  * methods.
  */
@@ -453,7 +430,7 @@ int cgroup_taskset_size(struct cgroup_ta
 
 struct cgroup_subsys {
 	struct cgroup_subsys_state *(*create)(struct cgroup *cgrp);
-	int (*pre_destroy)(struct cgroup *cgrp);
+	void (*pre_destroy)(struct cgroup *cgrp);
 	void (*destroy)(struct cgroup *cgrp);
 	int (*can_attach)(struct cgroup *cgrp, struct cgroup_taskset *tset);
 	void (*cancel_attach)(struct cgroup *cgrp, struct cgroup_taskset *tset);
Index: work/kernel/cgroup.c
===================================================================
--- work.orig/kernel/cgroup.c
+++ work/kernel/cgroup.c
@@ -63,6 +63,8 @@
 
 #include <linux/atomic.h>
 
+#define CSS_DEAD_BIAS		INT_MIN
+
 /*
  * cgroup_mutex is the master lock.  Any modification to cgroup or its
  * hierarchy must be performed while holding it.
@@ -154,8 +156,8 @@ struct css_id {
 	 * The css to which this ID points. This pointer is set to valid value
 	 * after cgroup is populated. If cgroup is removed, this will be NULL.
 	 * This pointer is expected to be RCU-safe because destroy()
-	 * is called after synchronize_rcu(). But for safe use, css_is_removed()
-	 * css_tryget() should be used for avoiding race.
+	 * is called after synchronize_rcu(). But for safe use, css_tryget()
+	 * should be used for avoiding race.
 	 */
 	struct cgroup_subsys_state __rcu *css;
 	/*
@@ -807,39 +809,14 @@ static struct inode *cgroup_new_inode(um
 	return inode;
 }
 
-/*
- * Call subsys's pre_destroy handler.
- * This is called before css refcnt check.
- */
-static int cgroup_call_pre_destroy(struct cgroup *cgrp)
-{
-	struct cgroup_subsys *ss;
-	int ret = 0;
-
-	for_each_subsys(cgrp->root, ss)
-		if (ss->pre_destroy) {
-			ret = ss->pre_destroy(cgrp);
-			if (ret)
-				break;
-		}
-
-	return ret;
-}
-
 static void cgroup_diput(struct dentry *dentry, struct inode *inode)
 {
 	/* is dentry a directory ? if so, kfree() associated cgroup */
 	if (S_ISDIR(inode->i_mode)) {
 		struct cgroup *cgrp = dentry->d_fsdata;
 		struct cgroup_subsys *ss;
+
 		BUG_ON(!(cgroup_is_removed(cgrp)));
-		/* It's possible for external users to be holding css
-		 * reference counts on a cgroup; css_put() needs to
-		 * be able to access the cgroup after decrementing
-		 * the reference count in order to know if it needs to
-		 * queue the cgroup to be handled by the release
-		 * agent */
-		synchronize_rcu();
 
 		mutex_lock(&cgroup_mutex);
 		/*
@@ -931,33 +908,6 @@ static void cgroup_d_remove_dir(struct d
 }
 
 /*
- * A queue for waiters to do rmdir() cgroup. A tasks will sleep when
- * cgroup->count == 0 && list_empty(&cgroup->children) && subsys has some
- * reference to css->refcnt. In general, this refcnt is expected to goes down
- * to zero, soon.
- *
- * CGRP_WAIT_ON_RMDIR flag is set under cgroup's inode->i_mutex;
- */
-static DECLARE_WAIT_QUEUE_HEAD(cgroup_rmdir_waitq);
-
-static void cgroup_wakeup_rmdir_waiter(struct cgroup *cgrp)
-{
-	if (unlikely(test_and_clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
-		wake_up_all(&cgroup_rmdir_waitq);
-}
-
-void cgroup_exclude_rmdir(struct cgroup_subsys_state *css)
-{
-	css_get(css);
-}
-
-void cgroup_release_and_wakeup_rmdir(struct cgroup_subsys_state *css)
-{
-	cgroup_wakeup_rmdir_waiter(css->cgroup);
-	css_put(css);
-}
-
-/*
  * Call with cgroup_mutex held. Drops reference counts on modules, including
  * any duplicate ones that parse_cgroupfs_options took. If this function
  * returns an error, no reference counts are touched.
@@ -1329,6 +1279,13 @@ static const struct super_operations cgr
 	.remount_fs = cgroup_remount,
 };
 
+static void cgroup_dput_fn(struct work_struct *work)
+{
+	struct cgroup *cgrp = container_of(work, struct cgroup, dput_work);
+
+	dput(cgrp->dentry);
+}
+
 static void init_cgroup_housekeeping(struct cgroup *cgrp)
 {
 	INIT_LIST_HEAD(&cgrp->sibling);
@@ -1339,6 +1296,7 @@ static void init_cgroup_housekeeping(str
 	mutex_init(&cgrp->pidlist_mutex);
 	INIT_LIST_HEAD(&cgrp->event_list);
 	spin_lock_init(&cgrp->event_list_lock);
+	INIT_WORK(&cgrp->dput_work, cgroup_dput_fn);
 }
 
 static void init_cgroup_root(struct cgroupfs_root *root)
@@ -1936,12 +1894,6 @@ int cgroup_attach_task(struct cgroup *cg
 	}
 
 	synchronize_rcu();
-
-	/*
-	 * wake up rmdir() waiter. the rmdir should fail since the cgroup
-	 * is no longer empty.
-	 */
-	cgroup_wakeup_rmdir_waiter(cgrp);
 out:
 	if (retval) {
 		for_each_subsys(root, ss) {
@@ -2111,7 +2063,6 @@ static int cgroup_attach_proc(struct cgr
 	 * step 5: success! and cleanup
 	 */
 	synchronize_rcu();
-	cgroup_wakeup_rmdir_waiter(cgrp);
 	retval = 0;
 out_put_css_set_refs:
 	if (retval) {
@@ -3768,6 +3719,7 @@ static long cgroup_create(struct cgroup 
 			err = PTR_ERR(css);
 			goto err_destroy;
 		}
+		dget(dentry);		/* will be put on the last @css put */
 		init_cgroup_css(css, ss, cgrp);
 		if (ss->use_id) {
 			err = alloc_css_id(ss, parent, cgrp);
@@ -3809,8 +3761,10 @@ static long cgroup_create(struct cgroup 
  err_destroy:
 
 	for_each_subsys(root, ss) {
-		if (cgrp->subsys[ss->subsys_id])
+		if (cgrp->subsys[ss->subsys_id]) {
+			dput(dentry);
 			ss->destroy(cgrp);
+		}
 	}
 
 	mutex_unlock(&cgroup_mutex);
@@ -3830,106 +3784,15 @@ static int cgroup_mkdir(struct inode *di
 	return cgroup_create(c_parent, dentry, mode | S_IFDIR);
 }
 
-static int cgroup_has_css_refs(struct cgroup *cgrp)
-{
-	/* Check the reference count on each subsystem. Since we
-	 * already established that there are no tasks in the
-	 * cgroup, if the css refcount is also 1, then there should
-	 * be no outstanding references, so the subsystem is safe to
-	 * destroy. We scan across all subsystems rather than using
-	 * the per-hierarchy linked list of mounted subsystems since
-	 * we can be called via check_for_release() with no
-	 * synchronization other than RCU, and the subsystem linked
-	 * list isn't RCU-safe */
-	int i;
-	/*
-	 * We won't need to lock the subsys array, because the subsystems
-	 * we're concerned about aren't going anywhere since our cgroup root
-	 * has a reference on them.
-	 */
-	for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
-		struct cgroup_subsys *ss = subsys[i];
-		struct cgroup_subsys_state *css;
-		/* Skip subsystems not present or not in this hierarchy */
-		if (ss == NULL || ss->root != cgrp->root)
-			continue;
-		css = cgrp->subsys[ss->subsys_id];
-		/* When called from check_for_release() it's possible
-		 * that by this point the cgroup has been removed
-		 * and the css deleted. But a false-positive doesn't
-		 * matter, since it can only happen if the cgroup
-		 * has been deleted and hence no longer needs the
-		 * release agent to be called anyway. */
-		if (css && (atomic_read(&css->refcnt) > 1))
-			return 1;
-	}
-	return 0;
-}
-
-/*
- * Atomically mark all (or else none) of the cgroup's CSS objects as
- * CSS_REMOVED. Return true on success, or false if the cgroup has
- * busy subsystems. Call with cgroup_mutex held
- */
-
-static int cgroup_clear_css_refs(struct cgroup *cgrp)
-{
-	struct cgroup_subsys *ss;
-	unsigned long flags;
-	bool failed = false;
-	local_irq_save(flags);
-	for_each_subsys(cgrp->root, ss) {
-		struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
-		int refcnt;
-		while (1) {
-			/* We can only remove a CSS with a refcnt==1 */
-			refcnt = atomic_read(&css->refcnt);
-			if (refcnt > 1) {
-				failed = true;
-				goto done;
-			}
-			BUG_ON(!refcnt);
-			/*
-			 * Drop the refcnt to 0 while we check other
-			 * subsystems. This will cause any racing
-			 * css_tryget() to spin until we set the
-			 * CSS_REMOVED bits or abort
-			 */
-			if (atomic_cmpxchg(&css->refcnt, refcnt, 0) == refcnt)
-				break;
-			cpu_relax();
-		}
-	}
- done:
-	for_each_subsys(cgrp->root, ss) {
-		struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
-		if (failed) {
-			/*
-			 * Restore old refcnt if we previously managed
-			 * to clear it from 1 to 0
-			 */
-			if (!atomic_read(&css->refcnt))
-				atomic_set(&css->refcnt, 1);
-		} else {
-			/* Commit the fact that the CSS is removed */
-			set_bit(CSS_REMOVED, &css->flags);
-		}
-	}
-	local_irq_restore(flags);
-	return !failed;
-}
-
 static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
 {
 	struct cgroup *cgrp = dentry->d_fsdata;
 	struct dentry *d;
 	struct cgroup *parent;
-	DEFINE_WAIT(wait);
+	struct cgroup_subsys *ss;
 	struct cgroup_event *event, *tmp;
-	int ret;
 
 	/* the vfs holds both inode->i_mutex already */
-again:
 	mutex_lock(&cgroup_mutex);
 	if (atomic_read(&cgrp->count) != 0) {
 		mutex_unlock(&cgroup_mutex);
@@ -3941,52 +3804,34 @@ again:
 	}
 	mutex_unlock(&cgroup_mutex);
 
-	/*
-	 * In general, subsystem has no css->refcnt after pre_destroy(). But
-	 * in racy cases, subsystem may have to get css->refcnt after
-	 * pre_destroy() and it makes rmdir return with -EBUSY. This sometimes
-	 * make rmdir return -EBUSY too often. To avoid that, we use waitqueue
-	 * for cgroup's rmdir. CGRP_WAIT_ON_RMDIR is for synchronizing rmdir
-	 * and subsystem's reference count handling. Please see css_get/put
-	 * and css_tryget() and cgroup_wakeup_rmdir_waiter() implementation.
-	 */
-	set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
-
-	/*
-	 * Call pre_destroy handlers of subsys. Notify subsystems
-	 * that rmdir() request comes.
-	 */
-	ret = cgroup_call_pre_destroy(cgrp);
-	if (ret) {
-		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
-		return ret;
-	}
-
 	mutex_lock(&cgroup_mutex);
 	parent = cgrp->parent;
 	if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children)) {
-		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
 		mutex_unlock(&cgroup_mutex);
 		return -EBUSY;
 	}
-	prepare_to_wait(&cgroup_rmdir_waitq, &wait, TASK_INTERRUPTIBLE);
-	if (!cgroup_clear_css_refs(cgrp)) {
-		mutex_unlock(&cgroup_mutex);
-		/*
-		 * Because someone may call cgroup_wakeup_rmdir_waiter() before
-		 * prepare_to_wait(), we need to check this flag.
-		 */
-		if (test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags))
-			schedule();
-		finish_wait(&cgroup_rmdir_waitq, &wait);
-		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
-		if (signal_pending(current))
-			return -EINTR;
-		goto again;
+
+	/* deny new css_tryget() attempts */
+	for_each_subsys(cgrp->root, ss) {
+		struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
+
+		WARN_ON(atomic_read(&css->refcnt) < 0);
+		atomic_add(CSS_DEAD_BIAS, &css->refcnt);
+	}
+
+	/*
+	 * No new tryget reference will be handed out.  Tell subsystems to
+	 * drain the existing references.
+	 */
+	for_each_subsys(cgrp->root, ss)
+		if (ss->pre_destroy)
+			ss->pre_destroy(cgrp);
+
+	for_each_subsys(cgrp->root, ss) {
+		struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
+
+		css_put(css);
 	}
-	/* NO css_tryget() can success after here. */
-	finish_wait(&cgroup_rmdir_waitq, &wait);
-	clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
 
 	raw_spin_lock(&release_list_lock);
 	set_bit(CGRP_REMOVED, &cgrp->flags);
@@ -4670,8 +4515,8 @@ static void check_for_release(struct cgr
 {
 	/* All of these checks rely on RCU to keep the cgroup
 	 * structure alive */
-	if (cgroup_is_releasable(cgrp) && !atomic_read(&cgrp->count)
-	    && list_empty(&cgrp->children) && !cgroup_has_css_refs(cgrp)) {
+	if (cgroup_is_releasable(cgrp) && !atomic_read(&cgrp->count) &&
+	    list_empty(&cgrp->children)) {
 		/* Control Group is currently removeable. If it's not
 		 * already queued for a userspace notification, queue
 		 * it now */
@@ -4689,21 +4534,35 @@ static void check_for_release(struct cgr
 }
 
 /* Caller must verify that the css is not for root cgroup */
-void __css_put(struct cgroup_subsys_state *css, int count)
+bool __css_tryget(struct cgroup_subsys_state *css)
+{
+	while (1) {
+		int v, t;
+
+		v = atomic_read(&css->refcnt);
+		if (unlikely(v < 0))
+			return false;
+
+		t = atomic_cmpxchg(&css->refcnt, v, v + 1);
+		if (likely(t == v))
+			return true;
+	}
+}
+
+/* Caller must verify that the css is not for root cgroup */
+void __css_put(struct cgroup_subsys_state *css)
 {
 	struct cgroup *cgrp = css->cgroup;
 	int val;
-	rcu_read_lock();
-	val = atomic_sub_return(count, &css->refcnt);
-	if (val == 1) {
+
+	val = atomic_dec_return(&css->refcnt);
+	if (val == CSS_DEAD_BIAS) {
 		if (notify_on_release(cgrp)) {
 			set_bit(CGRP_RELEASABLE, &cgrp->flags);
 			check_for_release(cgrp);
 		}
-		cgroup_wakeup_rmdir_waiter(cgrp);
+		schedule_work(&cgrp->dput_work);
 	}
-	rcu_read_unlock();
-	WARN_ON_ONCE(val < 1);
 }
 EXPORT_SYMBOL_GPL(__css_put);
 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC REPOST] cgroup: removing css reference drain wait during cgroup removal
  2012-03-12 21:33 ` [RFC REPOST] cgroup: removing css reference drain wait during cgroup removal Tejun Heo
  2012-03-12 23:23   ` Tejun Heo
@ 2012-03-13  6:11   ` KAMEZAWA Hiroyuki
  2012-03-13 16:39     ` Tejun Heo
  1 sibling, 1 reply; 12+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-03-13  6:11 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Hocko, Johannes Weiner, gthelen, Hugh Dickins, linux-mm,
	linux-kernel, Vivek Goyal, Jens Axboe, Li Zefan, containers,
	cgroups

On Mon, 12 Mar 2012 14:33:43 -0700
Tejun Heo <tj@kernel.org> wrote:

> (wrong mailing list addresses, reposing. sorry guys)
> 
> Hello, guys.
> 
> While working on blkcg, I learned that cgroup removal path tries to
> drain all internal references synchronously before proceeding with
> removal, and may be aborted by by pre_destroy() failing.
> 
> I find both quite unusual.  While there are some occassions where we
> try to drain reference counts synchronously, the norm is deactivating
> the target and then releasing it when the reference count hits zero
> and exposing this synchronous behavior directly to userland makes it
> worse.  This also requires allowing rmdir to be aborted from userland.
> 
> pre_destroy() being allowed to veto rmdir might be okay if there's
> only one subsystem using it or it implemented proper
> prepare-commit/cancel transaction, but as it currently stands,
> pre_destroy() operations are not reversible and even within memcg
> itself the state wouldn't be consistent after failure (some moved to
> parent while the rest on child).  Note that abort from userland also
> has the same problem.
> 
> It also complicates and adds even more subtleties to cgroup code.  A
> lot of it was just me being dumb but making sense of
> cgroup_exclude_rmdir() and cgroup_release_and_wakeup_rmdir() usages in
> memcontrol took me quite some time even with Hugh's help.
> 
> In general, IMHO, it's a bad idea to expose purely internal
> implementation details to userland directly.  Internal ref counts can
> be kept around for whatever reason (e.g. blkcg does it for lookup
> caching), such details shouldn't be visible to userland.  Midlayers
> like cgroup which sit between userland and mechanism implementations
> should provide isolation between the two so that each mechanism
> implementation doesn't have to worry about things like that.
> 

The trouble for pre_destroy() is _not_ refcount, Memory cgroup has its own refcnt
and use it internally. The problem is 'charges'. It's not related to refcnt.
Cgroup is designed to exists with 'tasks'. But memory may not be related to any
task...just related to a cgroup.

But ok, pre_destory() & rmdir() is complicated, I agree.

Now, we prevent rmdir() if we can't move charges to its parent. If pre_destory()
shouldn't fail, I can think of some alternatives.

 * move all charges to the parent and if it fails...move all charges to
   root cgroup.
   (drop_from_memory may not work well in swapless system.)
Or
 * move all charges to 'root' without checking 'parent'.





> 1. disallow further css_tryget() on all affected subsystems
> 
> 2. call pre_destroy() on each subsystem.  pre_destroy() can't fail and
>    each subsystem is responsible for guaranteeing that all reference
>    counts to its css will be released in finite amount of time.
> 

I think.. if pre_destory() never fails, we don't need pre_destroy().


> 3. destroy() will be called when all css refs drop to zero.
> 

memcg doesn't handle refcnt in pre_destroy().


> To do this, memcg, the only current user of pre_destroy() callback,
> should be modified such that
> 
> * pre_destroy() doesn't fail.  Greg helped me walking through the
>   different failure paths.  Two of them seem to be from race
>   conditions which just require retries (I think it's wrong to fail
>   userland operation for things like this).  pre_destroy() itself may
>   loop or may schedule a work item if it's expected to take a long
>   time.
> 
>   The last one seems more tricky.  On destruction of cgroup, the
>   charges are transferred to its parent and the parent may not have
>   enough room for that.  Greg told me that this should only be a
>   problem for !hierarchical case.  I think this can be dealt with by
>   dumping what's left over to root cgroup with a warning message.
> 

I don't like warning ;) 


> * cgroup_exclude_rmdir() and cgroup_release_and_wakeup_rmdir() usages.
>   The pair is used to re-trigger pre_destroy() for operations which
>   may add css references while rmdir is in progress.  I think the
>   right thing to do is calling (directly or through a work item)
>   mem_cgroup_force_empty() if the css is dead after the operation.  I
>   think such change would make following the logic easier too.
> 
> I'm appending the patch for the cgroup part of the change.  It removes
> good amount of complication.  I think cgroup_has_css_refs() check
> should be dropped from check_for_release() but other than that it
> seems to work fine with blkcg in linux-next.
> 
> If memcg's pre_destroy() can be updated as described above, I can
> apply that first and then apply the following cgroup behavior change.
> 

I think we can do all in 'destroy()'.

Thanks,
-Kame


> What do you think?  Li, how does the cgroup portion look?
> 
> Thank you.
> 
> RFC PATCH. DON'T APPLY.
> ---
>  include/linux/cgroup.h |   41 +-------
>  kernel/cgroup.c        |  225 +++++++++++++------------------------------------
>  2 files changed, 69 insertions(+), 197 deletions(-)
> 
> Index: work/include/linux/cgroup.h
> ===================================================================
> --- work.orig/include/linux/cgroup.h
> +++ work/include/linux/cgroup.h
> @@ -16,6 +16,7 @@
>  #include <linux/prio_heap.h>
>  #include <linux/rwsem.h>
>  #include <linux/idr.h>
> +#include <linux/workqueue.h>
>  
>  #ifdef CONFIG_CGROUPS
>  
> @@ -81,7 +82,6 @@ struct cgroup_subsys_state {
>  /* bits in struct cgroup_subsys_state flags field */
>  enum {
>  	CSS_ROOT, /* This CSS is the root of the subsystem */
> -	CSS_REMOVED, /* This CSS is dead */
>  };
>  
>  /* Caller must verify that the css is not for root cgroup */
> @@ -104,27 +104,18 @@ static inline void css_get(struct cgroup
>  		__css_get(css, 1);
>  }
>  
> -static inline bool css_is_removed(struct cgroup_subsys_state *css)
> -{
> -	return test_bit(CSS_REMOVED, &css->flags);
> -}
> -
>  /*
>   * Call css_tryget() to take a reference on a css if your existing
>   * (known-valid) reference isn't already ref-counted. Returns false if
>   * the css has been destroyed.
>   */
>  
> +extern bool __css_tryget(struct cgroup_subsys_state *css);
>  static inline bool css_tryget(struct cgroup_subsys_state *css)
>  {
>  	if (test_bit(CSS_ROOT, &css->flags))
>  		return true;
> -	while (!atomic_inc_not_zero(&css->refcnt)) {
> -		if (test_bit(CSS_REMOVED, &css->flags))
> -			return false;
> -		cpu_relax();
> -	}
> -	return true;
> +	return __css_tryget(css);
>  }
>  
>  /*
> @@ -132,11 +123,11 @@ static inline bool css_tryget(struct cgr
>   * css_get() or css_tryget()
>   */
>  
> -extern void __css_put(struct cgroup_subsys_state *css, int count);
> +extern void __css_put(struct cgroup_subsys_state *css);
>  static inline void css_put(struct cgroup_subsys_state *css)
>  {
>  	if (!test_bit(CSS_ROOT, &css->flags))
> -		__css_put(css, 1);
> +		__css_put(css);
>  }
>  
>  /* bits in struct cgroup flags field */
> @@ -211,6 +202,9 @@ struct cgroup {
>  	/* List of events which userspace want to receive */
>  	struct list_head event_list;
>  	spinlock_t event_list_lock;
> +
> +	/* Used to dput @cgroup->dentry from css_put() */
> +	struct work_struct dput_work;
>  };
>  
>  /*
> @@ -408,23 +402,6 @@ int cgroup_task_count(const struct cgrou
>  int cgroup_is_descendant(const struct cgroup *cgrp, struct task_struct *task);
>  
>  /*
> - * When the subsys has to access css and may add permanent refcnt to css,
> - * it should take care of racy conditions with rmdir(). Following set of
> - * functions, is for stop/restart rmdir if necessary.
> - * Because these will call css_get/put, "css" should be alive css.
> - *
> - *  cgroup_exclude_rmdir();
> - *  ...do some jobs which may access arbitrary empty cgroup
> - *  cgroup_release_and_wakeup_rmdir();
> - *
> - *  When someone removes a cgroup while cgroup_exclude_rmdir() holds it,
> - *  it sleeps and cgroup_release_and_wakeup_rmdir() will wake him up.
> - */
> -
> -void cgroup_exclude_rmdir(struct cgroup_subsys_state *css);
> -void cgroup_release_and_wakeup_rmdir(struct cgroup_subsys_state *css);
> -
> -/*
>   * Control Group taskset, used to pass around set of tasks to cgroup_subsys
>   * methods.
>   */
> @@ -453,7 +430,7 @@ int cgroup_taskset_size(struct cgroup_ta
>  
>  struct cgroup_subsys {
>  	struct cgroup_subsys_state *(*create)(struct cgroup *cgrp);
> -	int (*pre_destroy)(struct cgroup *cgrp);
> +	void (*pre_destroy)(struct cgroup *cgrp);
>  	void (*destroy)(struct cgroup *cgrp);
>  	int (*can_attach)(struct cgroup *cgrp, struct cgroup_taskset *tset);
>  	void (*cancel_attach)(struct cgroup *cgrp, struct cgroup_taskset *tset);
> Index: work/kernel/cgroup.c
> ===================================================================
> --- work.orig/kernel/cgroup.c
> +++ work/kernel/cgroup.c
> @@ -63,6 +63,8 @@
>  
>  #include <linux/atomic.h>
>  
> +#define CSS_DEAD_BIAS		INT_MIN
> +
>  /*
>   * cgroup_mutex is the master lock.  Any modification to cgroup or its
>   * hierarchy must be performed while holding it.
> @@ -154,8 +156,8 @@ struct css_id {
>  	 * The css to which this ID points. This pointer is set to valid value
>  	 * after cgroup is populated. If cgroup is removed, this will be NULL.
>  	 * This pointer is expected to be RCU-safe because destroy()
> -	 * is called after synchronize_rcu(). But for safe use, css_is_removed()
> -	 * css_tryget() should be used for avoiding race.
> +	 * is called after synchronize_rcu(). But for safe use, css_tryget()
> +	 * should be used for avoiding race.
>  	 */
>  	struct cgroup_subsys_state __rcu *css;
>  	/*
> @@ -807,39 +809,14 @@ static struct inode *cgroup_new_inode(um
>  	return inode;
>  }
>  
> -/*
> - * Call subsys's pre_destroy handler.
> - * This is called before css refcnt check.
> - */
> -static int cgroup_call_pre_destroy(struct cgroup *cgrp)
> -{
> -	struct cgroup_subsys *ss;
> -	int ret = 0;
> -
> -	for_each_subsys(cgrp->root, ss)
> -		if (ss->pre_destroy) {
> -			ret = ss->pre_destroy(cgrp);
> -			if (ret)
> -				break;
> -		}
> -
> -	return ret;
> -}
> -
>  static void cgroup_diput(struct dentry *dentry, struct inode *inode)
>  {
>  	/* is dentry a directory ? if so, kfree() associated cgroup */
>  	if (S_ISDIR(inode->i_mode)) {
>  		struct cgroup *cgrp = dentry->d_fsdata;
>  		struct cgroup_subsys *ss;
> +
>  		BUG_ON(!(cgroup_is_removed(cgrp)));
> -		/* It's possible for external users to be holding css
> -		 * reference counts on a cgroup; css_put() needs to
> -		 * be able to access the cgroup after decrementing
> -		 * the reference count in order to know if it needs to
> -		 * queue the cgroup to be handled by the release
> -		 * agent */
> -		synchronize_rcu();
>  
>  		mutex_lock(&cgroup_mutex);
>  		/*
> @@ -931,33 +908,6 @@ static void cgroup_d_remove_dir(struct d
>  }
>  
>  /*
> - * A queue for waiters to do rmdir() cgroup. A tasks will sleep when
> - * cgroup->count == 0 && list_empty(&cgroup->children) && subsys has some
> - * reference to css->refcnt. In general, this refcnt is expected to goes down
> - * to zero, soon.
> - *
> - * CGRP_WAIT_ON_RMDIR flag is set under cgroup's inode->i_mutex;
> - */
> -static DECLARE_WAIT_QUEUE_HEAD(cgroup_rmdir_waitq);
> -
> -static void cgroup_wakeup_rmdir_waiter(struct cgroup *cgrp)
> -{
> -	if (unlikely(test_and_clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
> -		wake_up_all(&cgroup_rmdir_waitq);
> -}
> -
> -void cgroup_exclude_rmdir(struct cgroup_subsys_state *css)
> -{
> -	css_get(css);
> -}
> -
> -void cgroup_release_and_wakeup_rmdir(struct cgroup_subsys_state *css)
> -{
> -	cgroup_wakeup_rmdir_waiter(css->cgroup);
> -	css_put(css);
> -}
> -
> -/*
>   * Call with cgroup_mutex held. Drops reference counts on modules, including
>   * any duplicate ones that parse_cgroupfs_options took. If this function
>   * returns an error, no reference counts are touched.
> @@ -1329,6 +1279,13 @@ static const struct super_operations cgr
>  	.remount_fs = cgroup_remount,
>  };
>  
> +static void cgroup_dput_fn(struct work_struct *work)
> +{
> +	struct cgroup *cgrp = container_of(work, struct cgroup, dput_work);
> +
> +	dput(cgrp->dentry);
> +}
> +
>  static void init_cgroup_housekeeping(struct cgroup *cgrp)
>  {
>  	INIT_LIST_HEAD(&cgrp->sibling);
> @@ -1339,6 +1296,7 @@ static void init_cgroup_housekeeping(str
>  	mutex_init(&cgrp->pidlist_mutex);
>  	INIT_LIST_HEAD(&cgrp->event_list);
>  	spin_lock_init(&cgrp->event_list_lock);
> +	INIT_WORK(&cgrp->dput_work, cgroup_dput_fn);
>  }
>  
>  static void init_cgroup_root(struct cgroupfs_root *root)
> @@ -1936,12 +1894,6 @@ int cgroup_attach_task(struct cgroup *cg
>  	}
>  
>  	synchronize_rcu();
> -
> -	/*
> -	 * wake up rmdir() waiter. the rmdir should fail since the cgroup
> -	 * is no longer empty.
> -	 */
> -	cgroup_wakeup_rmdir_waiter(cgrp);
>  out:
>  	if (retval) {
>  		for_each_subsys(root, ss) {
> @@ -2111,7 +2063,6 @@ static int cgroup_attach_proc(struct cgr
>  	 * step 5: success! and cleanup
>  	 */
>  	synchronize_rcu();
> -	cgroup_wakeup_rmdir_waiter(cgrp);
>  	retval = 0;
>  out_put_css_set_refs:
>  	if (retval) {
> @@ -3768,6 +3719,7 @@ static long cgroup_create(struct cgroup 
>  			err = PTR_ERR(css);
>  			goto err_destroy;
>  		}
> +		dget(dentry);		/* will be put on the last @css put */
>  		init_cgroup_css(css, ss, cgrp);
>  		if (ss->use_id) {
>  			err = alloc_css_id(ss, parent, cgrp);
> @@ -3809,8 +3761,10 @@ static long cgroup_create(struct cgroup 
>   err_destroy:
>  
>  	for_each_subsys(root, ss) {
> -		if (cgrp->subsys[ss->subsys_id])
> +		if (cgrp->subsys[ss->subsys_id]) {
> +			dput(dentry);
>  			ss->destroy(cgrp);
> +		}
>  	}
>  
>  	mutex_unlock(&cgroup_mutex);
> @@ -3866,70 +3820,15 @@ static int cgroup_has_css_refs(struct cg
>  	return 0;
>  }
>  
> -/*
> - * Atomically mark all (or else none) of the cgroup's CSS objects as
> - * CSS_REMOVED. Return true on success, or false if the cgroup has
> - * busy subsystems. Call with cgroup_mutex held
> - */
> -
> -static int cgroup_clear_css_refs(struct cgroup *cgrp)
> -{
> -	struct cgroup_subsys *ss;
> -	unsigned long flags;
> -	bool failed = false;
> -	local_irq_save(flags);
> -	for_each_subsys(cgrp->root, ss) {
> -		struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
> -		int refcnt;
> -		while (1) {
> -			/* We can only remove a CSS with a refcnt==1 */
> -			refcnt = atomic_read(&css->refcnt);
> -			if (refcnt > 1) {
> -				failed = true;
> -				goto done;
> -			}
> -			BUG_ON(!refcnt);
> -			/*
> -			 * Drop the refcnt to 0 while we check other
> -			 * subsystems. This will cause any racing
> -			 * css_tryget() to spin until we set the
> -			 * CSS_REMOVED bits or abort
> -			 */
> -			if (atomic_cmpxchg(&css->refcnt, refcnt, 0) == refcnt)
> -				break;
> -			cpu_relax();
> -		}
> -	}
> - done:
> -	for_each_subsys(cgrp->root, ss) {
> -		struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
> -		if (failed) {
> -			/*
> -			 * Restore old refcnt if we previously managed
> -			 * to clear it from 1 to 0
> -			 */
> -			if (!atomic_read(&css->refcnt))
> -				atomic_set(&css->refcnt, 1);
> -		} else {
> -			/* Commit the fact that the CSS is removed */
> -			set_bit(CSS_REMOVED, &css->flags);
> -		}
> -	}
> -	local_irq_restore(flags);
> -	return !failed;
> -}
> -
>  static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
>  {
>  	struct cgroup *cgrp = dentry->d_fsdata;
>  	struct dentry *d;
>  	struct cgroup *parent;
> -	DEFINE_WAIT(wait);
> +	struct cgroup_subsys *ss;
>  	struct cgroup_event *event, *tmp;
> -	int ret;
>  
>  	/* the vfs holds both inode->i_mutex already */
> -again:
>  	mutex_lock(&cgroup_mutex);
>  	if (atomic_read(&cgrp->count) != 0) {
>  		mutex_unlock(&cgroup_mutex);
> @@ -3941,52 +3840,34 @@ again:
>  	}
>  	mutex_unlock(&cgroup_mutex);
>  
> -	/*
> -	 * In general, subsystem has no css->refcnt after pre_destroy(). But
> -	 * in racy cases, subsystem may have to get css->refcnt after
> -	 * pre_destroy() and it makes rmdir return with -EBUSY. This sometimes
> -	 * make rmdir return -EBUSY too often. To avoid that, we use waitqueue
> -	 * for cgroup's rmdir. CGRP_WAIT_ON_RMDIR is for synchronizing rmdir
> -	 * and subsystem's reference count handling. Please see css_get/put
> -	 * and css_tryget() and cgroup_wakeup_rmdir_waiter() implementation.
> -	 */
> -	set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> -
> -	/*
> -	 * Call pre_destroy handlers of subsys. Notify subsystems
> -	 * that rmdir() request comes.
> -	 */
> -	ret = cgroup_call_pre_destroy(cgrp);
> -	if (ret) {
> -		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> -		return ret;
> -	}
> -
>  	mutex_lock(&cgroup_mutex);
>  	parent = cgrp->parent;
>  	if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children)) {
> -		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
>  		mutex_unlock(&cgroup_mutex);
>  		return -EBUSY;
>  	}
> -	prepare_to_wait(&cgroup_rmdir_waitq, &wait, TASK_INTERRUPTIBLE);
> -	if (!cgroup_clear_css_refs(cgrp)) {
> -		mutex_unlock(&cgroup_mutex);
> -		/*
> -		 * Because someone may call cgroup_wakeup_rmdir_waiter() before
> -		 * prepare_to_wait(), we need to check this flag.
> -		 */
> -		if (test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags))
> -			schedule();
> -		finish_wait(&cgroup_rmdir_waitq, &wait);
> -		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> -		if (signal_pending(current))
> -			return -EINTR;
> -		goto again;
> +
> +	/* deny new css_tryget() attempts */
> +	for_each_subsys(cgrp->root, ss) {
> +		struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
> +
> +		WARN_ON(atomic_read(&css->refcnt) < 0);
> +		atomic_add(CSS_DEAD_BIAS, &css->refcnt);
> +	}
> +
> +	/*
> +	 * No new tryget reference will be handed out.  Tell subsystems to
> +	 * drain the existing references.
> +	 */
> +	for_each_subsys(cgrp->root, ss)
> +		if (ss->pre_destroy)
> +			ss->pre_destroy(cgrp);
> +
> +	for_each_subsys(cgrp->root, ss) {
> +		struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
> +
> +		css_put(css);
>  	}
> -	/* NO css_tryget() can success after here. */
> -	finish_wait(&cgroup_rmdir_waitq, &wait);
> -	clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
>  
>  	raw_spin_lock(&release_list_lock);
>  	set_bit(CGRP_REMOVED, &cgrp->flags);
> @@ -4689,21 +4570,35 @@ static void check_for_release(struct cgr
>  }
>  
>  /* Caller must verify that the css is not for root cgroup */
> -void __css_put(struct cgroup_subsys_state *css, int count)
> +bool __css_tryget(struct cgroup_subsys_state *css)
> +{
> +	while (1) {
> +		int v, t;
> +
> +		v = atomic_read(&css->refcnt);
> +		if (unlikely(v < 0))
> +			return false;
> +
> +		t = atomic_cmpxchg(&css->refcnt, v, v + 1);
> +		if (likely(t == v))
> +			return true;
> +	}
> +}
> +
> +/* Caller must verify that the css is not for root cgroup */
> +void __css_put(struct cgroup_subsys_state *css)
>  {
>  	struct cgroup *cgrp = css->cgroup;
>  	int val;
> -	rcu_read_lock();
> -	val = atomic_sub_return(count, &css->refcnt);
> -	if (val == 1) {
> +
> +	val = atomic_dec_return(&css->refcnt);
> +	if (val == CSS_DEAD_BIAS) {
>  		if (notify_on_release(cgrp)) {
>  			set_bit(CGRP_RELEASABLE, &cgrp->flags);
>  			check_for_release(cgrp);
>  		}
> -		cgroup_wakeup_rmdir_waiter(cgrp);
> +		schedule_work(&cgrp->dput_work);
>  	}
> -	rcu_read_unlock();
> -	WARN_ON_ONCE(val < 1);
>  }
>  EXPORT_SYMBOL_GPL(__css_put);
>  
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC REPOST] cgroup: removing css reference drain wait during cgroup removal
  2012-03-13  6:11   ` KAMEZAWA Hiroyuki
@ 2012-03-13 16:39     ` Tejun Heo
  2012-03-14  0:28       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 12+ messages in thread
From: Tejun Heo @ 2012-03-13 16:39 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Michal Hocko, Johannes Weiner, gthelen, Hugh Dickins, linux-mm,
	linux-kernel, Vivek Goyal, Jens Axboe, Li Zefan, containers,
	cgroups

Hello, KAMEZAWA.

On Tue, Mar 13, 2012 at 03:11:48PM +0900, KAMEZAWA Hiroyuki wrote:
> The trouble for pre_destroy() is _not_ refcount, Memory cgroup has its own refcnt
> and use it internally. The problem is 'charges'. It's not related to refcnt.

Hmmm.... yeah, I'm not familiar with memcg internals at all.  For
blkcg, refcnt matters but if it doesn't for memcg, great.

> Cgroup is designed to exists with 'tasks'. But memory may not be related to any
> task...just related to a cgroup.
> 
> But ok, pre_destory() & rmdir() is complicated, I agree.
> 
> Now, we prevent rmdir() if we can't move charges to its parent. If pre_destory()
> shouldn't fail, I can think of some alternatives.
> 
>  * move all charges to the parent and if it fails...move all charges to
>    root cgroup.
>    (drop_from_memory may not work well in swapless system.)

I think this one is better and this shouldn't fail if hierarchical
mode is in use, right?

> I think.. if pre_destory() never fails, we don't need pre_destroy().

For memcg maybe, blkcg still needs it.

> >   The last one seems more tricky.  On destruction of cgroup, the
> >   charges are transferred to its parent and the parent may not have
> >   enough room for that.  Greg told me that this should only be a
> >   problem for !hierarchical case.  I think this can be dealt with by
> >   dumping what's left over to root cgroup with a warning message.
> 
> I don't like warning ;) 

I agree this isn't perfect but then again failing rmdir isn't perfect
either and given that the condition can be wholly avoided in
hierarchical mode, which should be the default anyway (is there any
reason to keep flat mode except for backward compatibility?), I don't
think the trade off is too bad.

> I think we can do all in 'destroy()'.

That would be even better.  I tried myself but that was a lot of code
I didn't have much idea about.  If someone more familiar with memcg
can write up such patch, I owe a beer. :)

Thank you.

-- 
tejun

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] cgroup: removing css reference drain wait during cgroup removal
       [not found]   ` <20120313220551.GF7349@google.com>
@ 2012-03-13 22:16     ` Tejun Heo
  0 siblings, 0 replies; 12+ messages in thread
From: Tejun Heo @ 2012-03-13 22:16 UTC (permalink / raw)
  To: Matt Helsley
  Cc: KAMEZAWA Hiroyuki, Michal Hocko, Johannes Weiner, gthelen,
	Hugh Dickins, linux-mm, linux-kernel, Vivek Goyal, Jens Axboe,
	Li Zefan, containers, cgroups

(fixed up mailing list addresses)

On Tue, Mar 13, 2012 at 03:05:51PM -0700, Tejun Heo wrote:
> Hey, Matt.
> 
> On Tue, Mar 13, 2012 at 02:45:26PM -0700, Matt Helsley wrote:
> > If you want to spend your time doing archaeology there are some old threads
> > that touch on this idea (roughly around 2003-2005). One point against the
> > idea that I distinctly recall:
> > 
> > Somewhat like configfs, object lifetimes in cgroups are determined
> > primarily by the user whereas sysfs object lifetimes are primarily
> > determined by the kernel. I think the closest we come to user-determined
> > objects in sysfs occur through debugfs, and module loading/unloading.
> > However those involve mount/umount and modprobe/rmmod rather than
> > mkdir/rmdir to create and remove the objects.
> 
> The thing is that sysfs itself has been almost completely rewritten
> since that time to 1. decouple internal representation from vfs
> objects and 2. provide proper isolation between the userland and
> kernel code exposing data through sysfs.
> 
> #1 began mostly due to the large size of dentries and inodes but, with
> the benefit of hindsight, I think it just was a bad idea to piggyback
> on vfs objects for object life-cycle management and locking for stuff
> which is wholely described in memory with simplistic locking.
> 
> #2 was necessary to avoid hanging device detach due to open sysfs file
> from userland.  sysfs now has notion of "active access" encompassing
> only each show/store op invocation and it only guarantees that the
> associated device doesn't go away while active accesses are in
> progress.
> 
> The sysfs heritage is almost recognizable and unfortunately almost the
> same set of problems (nobody wants show/store ops to be called on
> unlinked css waiting for references to be drained).  As refactoring
> and sharing sysfs won't be a trivial task, my plan is to first augment
> cgroupfs as necessary with longer term goal of converging and later
> sharing the same code with sysfs.

Sorry, forgot to reply to the userland-determined object
creation/deletion part.

I don't think there are direct creation cases in sysfs but there are
plenty of deletion going on, especially the kind where a file requests
to delete its parent directly (*/device/delete).  While using
mkdir/rmdir indeed is different for cgroupfs, I don't think that would
make too much of difference.  Both calls are essentially unused by
sysfs currently and there's nothing preventing addition of callbacks
there.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC REPOST] cgroup: removing css reference drain wait during cgroup removal
  2012-03-13 16:39     ` Tejun Heo
@ 2012-03-14  0:28       ` KAMEZAWA Hiroyuki
  2012-03-14  6:11         ` Tejun Heo
  2012-03-14  9:46         ` Glauber Costa
  0 siblings, 2 replies; 12+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-03-14  0:28 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Hocko, Johannes Weiner, gthelen, Hugh Dickins, linux-mm,
	linux-kernel, Vivek Goyal, Jens Axboe, Li Zefan, containers,
	cgroups

On Tue, 13 Mar 2012 09:39:14 -0700
Tejun Heo <tj@kernel.org> wrote:

> Hello, KAMEZAWA.
> 
> On Tue, Mar 13, 2012 at 03:11:48PM +0900, KAMEZAWA Hiroyuki wrote:
> > The trouble for pre_destroy() is _not_ refcount, Memory cgroup has its own refcnt
> > and use it internally. The problem is 'charges'. It's not related to refcnt.
> 
> Hmmm.... yeah, I'm not familiar with memcg internals at all.  For
> blkcg, refcnt matters but if it doesn't for memcg, great.
> 
> > Cgroup is designed to exists with 'tasks'. But memory may not be related to any
> > task...just related to a cgroup.
> > 
> > But ok, pre_destory() & rmdir() is complicated, I agree.
> > 
> > Now, we prevent rmdir() if we can't move charges to its parent. If pre_destory()
> > shouldn't fail, I can think of some alternatives.
> > 
> >  * move all charges to the parent and if it fails...move all charges to
> >    root cgroup.
> >    (drop_from_memory may not work well in swapless system.)
> 
> I think this one is better and this shouldn't fail if hierarchical
> mode is in use, right?
> 

Right.


> > I think.. if pre_destory() never fails, we don't need pre_destroy().
> 
> For memcg maybe, blkcg still needs it.
> 
> > >   The last one seems more tricky.  On destruction of cgroup, the
> > >   charges are transferred to its parent and the parent may not have
> > >   enough room for that.  Greg told me that this should only be a
> > >   problem for !hierarchical case.  I think this can be dealt with by
> > >   dumping what's left over to root cgroup with a warning message.
> > 
> > I don't like warning ;) 
> 
> I agree this isn't perfect but then again failing rmdir isn't perfect
> either and given that the condition can be wholly avoided in
> hierarchical mode, which should be the default anyway (is there any
> reason to keep flat mode except for backward compatibility?), I don't
> think the trade off is too bad.
> 

One reason is 'performance'. You can see performance trouble when you
creates deep tree of memcgs in hierarchy mode. The deeper memcg tree,
the more res_coutners will be shared.

For example, libvirt creates cgroup tree as

	/cgroup/memory/libvirt/qemu/GuestXXX/....
        /cgroup/memory/libvirt/lxc/GuestXXX/...

No one don't want to count up 4 res_coutner, which is very very heavy,
for handling independent workloads of "Guest".

IIUC, in general, even in the processes are in a tree, in major case
of servers, their workloads are independent.
I think FLAT mode is the dafault. 'heararchical' is a crazy thing which
cannot be managed.


Thanks,
-Kame


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC REPOST] cgroup: removing css reference drain wait during cgroup removal
  2012-03-14  0:28       ` KAMEZAWA Hiroyuki
@ 2012-03-14  6:11         ` Tejun Heo
  2012-03-14  9:46         ` Glauber Costa
  1 sibling, 0 replies; 12+ messages in thread
From: Tejun Heo @ 2012-03-14  6:11 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Michal Hocko, Johannes Weiner, gthelen, Hugh Dickins, linux-mm,
	linux-kernel, Vivek Goyal, Jens Axboe, Li Zefan, containers,
	cgroups

Hello,

On Wed, Mar 14, 2012 at 09:28:28AM +0900, KAMEZAWA Hiroyuki wrote:
> > I agree this isn't perfect but then again failing rmdir isn't perfect
> > either and given that the condition can be wholly avoided in
> > hierarchical mode, which should be the default anyway (is there any
> > reason to keep flat mode except for backward compatibility?), I don't
> > think the trade off is too bad.
> 
> One reason is 'performance'. You can see performance trouble when you
> creates deep tree of memcgs in hierarchy mode. The deeper memcg tree,
> the more res_coutners will be shared.
> 
> For example, libvirt creates cgroup tree as
> 
> 	/cgroup/memory/libvirt/qemu/GuestXXX/....
>         /cgroup/memory/libvirt/lxc/GuestXXX/...
> 
> No one don't want to count up 4 res_coutner, which is very very heavy,
> for handling independent workloads of "Guest".

Yes, performance definitely is a concern but I think that it would be
better to either avoid building deep hierarchies or provide a generic
way to skip some levels rather than implementing different behavior
mode per controller.  Per-controller behavior selection ends up
requiring highly specialized configuration which is very difficult to
generalize and automate.

> IIUC, in general, even in the processes are in a tree, in major case
> of servers, their workloads are independent.
> I think FLAT mode is the dafault. 'heararchical' is a crazy thing which
> cannot be managed.

I currently am hoping that cgroup core can provide a generic mechanism
to abbreviate, if you will, hierarchies so that controllers can be
used the same way, with hierarchy unaware controllers using the same
mechanism to essentially achieve flat view of the same hierarchy.  It
might as well be a pipe dream tho.  I'll think more about it.

Anyways, I'm building up updates on top of the patch to strip out
pre_destroy waiting and failure handling.  Is anyone interested in
doing the memcg part, pretty please?

Thank you very much.

-- 
tejun

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC REPOST] cgroup: removing css reference drain wait during cgroup removal
  2012-03-14  0:28       ` KAMEZAWA Hiroyuki
  2012-03-14  6:11         ` Tejun Heo
@ 2012-03-14  9:46         ` Glauber Costa
  2012-03-15  0:16           ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 12+ messages in thread
From: Glauber Costa @ 2012-03-14  9:46 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Tejun Heo, Michal Hocko, Johannes Weiner, gthelen, Hugh Dickins,
	linux-mm, linux-kernel, Vivek Goyal, Jens Axboe, Li Zefan,
	containers, cgroups, Peter Zijlstra

On 03/14/2012 04:28 AM, KAMEZAWA Hiroyuki wrote:
> IIUC, in general, even in the processes are in a tree, in major case
> of servers, their workloads are independent.
> I think FLAT mode is the dafault. 'heararchical' is a crazy thing which
> cannot be managed.

Better pay attention to the current overall cgroups discussions being 
held by Tejun then. ([RFD] cgroup: about multiple hierarchies)

The topic of whether of adapting all cgroups to be hierarchical by 
deafult is a recurring one.

I personally think that it is not unachievable to make res_counters 
cheaper, therefore making this less of a problem.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC REPOST] cgroup: removing css reference drain wait during cgroup removal
  2012-03-14  9:46         ` Glauber Costa
@ 2012-03-15  0:16           ` KAMEZAWA Hiroyuki
  2012-03-15 11:24             ` Glauber Costa
  0 siblings, 1 reply; 12+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-03-15  0:16 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Tejun Heo, Michal Hocko, Johannes Weiner, gthelen, Hugh Dickins,
	linux-mm, linux-kernel, Vivek Goyal, Jens Axboe, Li Zefan,
	containers, cgroups, Peter Zijlstra

(2012/03/14 18:46), Glauber Costa wrote:

> On 03/14/2012 04:28 AM, KAMEZAWA Hiroyuki wrote:
>> IIUC, in general, even in the processes are in a tree, in major case
>> of servers, their workloads are independent.
>> I think FLAT mode is the dafault. 'heararchical' is a crazy thing which
>> cannot be managed.
> 
> Better pay attention to the current overall cgroups discussions being 
> held by Tejun then. ([RFD] cgroup: about multiple hierarchies)
> 
> The topic of whether of adapting all cgroups to be hierarchical by 
> deafult is a recurring one.
> 
> I personally think that it is not unachievable to make res_counters 
> cheaper, therefore making this less of a problem.
> 


I thought of this a little yesterday. Current my idea is applying following
rule for res_counter.

1. All res_counter is hierarchical. But behavior should be optimized.

2. If parent res_counter has UNLIMITED limit, 'usage' will not be propagated
   to its parent at _charge_.

3. If a res_counter has UNLIMITED limit, at reading usage, it must visit
   all children and returns a sum of them.

Then,
	/cgroup/
		memory/                       (unlimited)
			libivirt/             (unlimited)
				 qeumu/       (unlimited)
				        guest/(limited)

All dir can show hierarchical usage and the guest will not have
any lock contention at runtime.


By this
 1. no runtime overhead if the parent has unlimited limit.
 2. All res_counter can show aggregate resource usage of children.

To do this
 1. res_coutner should have children list by itself.

Implementation problem
 - What should happens when a user set new limit to a res_counter which have
   childrens ? Shouldn't we allow it ? Or take all locks of children and
   update in atomic ?
 - memory.use_hierarchy should be obsolete ?

Other problem I'm not sure at all
 - blkcg doesn't support hierarchy at all.

Hmm. 

Thanks,
-Kame









^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC REPOST] cgroup: removing css reference drain wait during cgroup removal
  2012-03-15  0:16           ` KAMEZAWA Hiroyuki
@ 2012-03-15 11:24             ` Glauber Costa
  2012-03-16  0:02               ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 12+ messages in thread
From: Glauber Costa @ 2012-03-15 11:24 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Tejun Heo, Michal Hocko, Johannes Weiner, gthelen, Hugh Dickins,
	linux-mm, linux-kernel, Vivek Goyal, Jens Axboe, Li Zefan,
	containers, cgroups, Peter Zijlstra

On 03/15/2012 04:16 AM, KAMEZAWA Hiroyuki wrote:
> (2012/03/14 18:46), Glauber Costa wrote:
>
>> On 03/14/2012 04:28 AM, KAMEZAWA Hiroyuki wrote:
>>> IIUC, in general, even in the processes are in a tree, in major case
>>> of servers, their workloads are independent.
>>> I think FLAT mode is the dafault. 'heararchical' is a crazy thing which
>>> cannot be managed.
>>
>> Better pay attention to the current overall cgroups discussions being
>> held by Tejun then. ([RFD] cgroup: about multiple hierarchies)
>>
>> The topic of whether of adapting all cgroups to be hierarchical by
>> deafult is a recurring one.
>>
>> I personally think that it is not unachievable to make res_counters
>> cheaper, therefore making this less of a problem.
>>
>
>
> I thought of this a little yesterday. Current my idea is applying following
> rule for res_counter.
>
> 1. All res_counter is hierarchical. But behavior should be optimized.
>
> 2. If parent res_counter has UNLIMITED limit, 'usage' will not be propagated
>    to its parent at _charge_.

That doesn't seem to make much sense. If you are unlimited, but your 
parent is limited,
he has a lot more interest to know about the charge than you do. So the 
logic should rather be the opposite: Don't go around getting locks and 
all that if you are unlimited. Your parent might, though.

I am trying to experiment a bit with billing to percpu counters for 
unlimited res_counters. But their inexact nature is giving me quite a 
headache.

> 3. If a res_counter has UNLIMITED limit, at reading usage, it must visit
>     all children and returns a sum of them.
>
> Then,
> 	/cgroup/
> 		memory/                       (unlimited)
> 			libivirt/             (unlimited)
> 				 qeumu/       (unlimited)
> 				        guest/(limited)
>
> All dir can show hierarchical usage and the guest will not have
> any lock contention at runtime.

If we are okay with summing it up at read time, we may as well
keep everything in percpu counters at all times.
>
> By this
>   1. no runtime overhead if the parent has unlimited limit.
>   2. All res_counter can show aggregate resource usage of children.
>
> To do this
>   1. res_coutner should have children list by itself.
>
> Implementation problem
>   - What should happens when a user set new limit to a res_counter which have
>     childrens ? Shouldn't we allow it ? Or take all locks of children and
>     update in atomic ?
Well, increasing the limit should be always possible.

As for the kids, how about:

- ) Take their locks
- ) scan through them seeing if their usage is bellow the new allowance
     -) if it is, then ok
     -) if it is not, then try to reclaim (*). Fail if it is not possible.

(*) May be hard to implement, because we already have the res_counter 
lock taken, and the code may get nasty. So maybe it is better just fail 
if any of your kids usage is over the new allowance...



>   - memory.use_hierarchy should be obsolete ?
If we're going fully hierarchical, yes.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC REPOST] cgroup: removing css reference drain wait during cgroup removal
  2012-03-15 11:24             ` Glauber Costa
@ 2012-03-16  0:02               ` KAMEZAWA Hiroyuki
  2012-03-16 10:21                 ` Glauber Costa
  0 siblings, 1 reply; 12+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-03-16  0:02 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Tejun Heo, Michal Hocko, Johannes Weiner, gthelen, Hugh Dickins,
	linux-mm, linux-kernel, Vivek Goyal, Jens Axboe, Li Zefan,
	containers, cgroups, Peter Zijlstra

(2012/03/15 20:24), Glauber Costa wrote:

> On 03/15/2012 04:16 AM, KAMEZAWA Hiroyuki wrote:
>> (2012/03/14 18:46), Glauber Costa wrote:
>>
>>> On 03/14/2012 04:28 AM, KAMEZAWA Hiroyuki wrote:
>>>> IIUC, in general, even in the processes are in a tree, in major case
>>>> of servers, their workloads are independent.
>>>> I think FLAT mode is the dafault. 'heararchical' is a crazy thing which
>>>> cannot be managed.
>>>
>>> Better pay attention to the current overall cgroups discussions being
>>> held by Tejun then. ([RFD] cgroup: about multiple hierarchies)
>>>
>>> The topic of whether of adapting all cgroups to be hierarchical by
>>> deafult is a recurring one.
>>>
>>> I personally think that it is not unachievable to make res_counters
>>> cheaper, therefore making this less of a problem.
>>>
>>
>>
>> I thought of this a little yesterday. Current my idea is applying following
>> rule for res_counter.
>>
>> 1. All res_counter is hierarchical. But behavior should be optimized.
>>
>> 2. If parent res_counter has UNLIMITED limit, 'usage' will not be propagated
>>    to its parent at _charge_.
> 
> That doesn't seem to make much sense. If you are unlimited, but your 
> parent is limited,
> he has a lot more interest to know about the charge than you do. 


Sorry, I should write "If all ancestors are umlimited'.
If parent is limited, the children should be treated as limited.

> So the 
> logic should rather be the opposite: Don't go around getting locks and 
> all that if you are unlimited. Your parent might, though.
> 
> I am trying to experiment a bit with billing to percpu counters for 
> unlimited res_counters. But their inexact nature is giving me quite a 
> headache.
> 


Personally, I think percpu counter is not the best one. Yes, it will work but...
Because of its nature of error range, it has scalability problem. Considering
to have a tree like

	/A/B/Guest0/tasks
             Guest1/tasks
             Guest2/tasks
             Guest4/tasks
             Guest5/tasks
             ......

percpu res_counter may work scarable in GuestX level but will conflict in level B.
And I don't want to think what happens in 256 cpu system. Error in B will be
very big.

Another idea is to borrow a resource from memcg to the tasks. i.e.having per-task
caching of charges. But it has two problems that draining unused resource is difficult
and precise usage is unknown.

IMHO, hard-limited resource counter itself may be a problem ;)

So, an idea, 'if all ancestors are unlimited, don't propagate charges.'
comes to my mind. With this, people use resource in FLAT (but has hierarchical cgroup
tree) will not see any performance problem.



>> 3. If a res_counter has UNLIMITED limit, at reading usage, it must visit
>>     all children and returns a sum of them.
>>
>> Then,
>> 	/cgroup/
>> 		memory/                       (unlimited)
>> 			libivirt/             (unlimited)
>> 				 qeumu/       (unlimited)
>> 				        guest/(limited)
>>
>> All dir can show hierarchical usage and the guest will not have
>> any lock contention at runtime.
> 
> If we are okay with summing it up at read time, we may as well
> keep everything in percpu counters at all times.
>


If all ancestors are unlimited, we don't need to propagate usage upwards
at charging. If one of ancestors are limited, we need to propagate and
check usage at charging.



>> By this
>>   1. no runtime overhead if the parent has unlimited limit.
>>   2. All res_counter can show aggregate resource usage of children.
>>
>> To do this
>>   1. res_coutner should have children list by itself.
>>
>> Implementation problem
>>   - What should happens when a user set new limit to a res_counter which have
>>     childrens ? Shouldn't we allow it ? Or take all locks of children and
>>     update in atomic ?
> Well, increasing the limit should be always possible.
> 


> As for the kids, how about:
> 
> - ) Take their locks
> - ) scan through them seeing if their usage is bellow the new allowance
>      -) if it is, then ok
>      -) if it is not, then try to reclaim (*). Fail if it is not possible.
> 
> (*) May be hard to implement, because we already have the res_counter 
> lock taken, and the code may get nasty. So maybe it is better just fail 
> if any of your kids usage is over the new allowance...
> 


Seems enough and seems worth to try.


> 
> 
>>   - memory.use_hierarchy should be obsolete ?
> If we're going fully hierarchical, yes.
> 

Another big problem is 'when' we should do this change..
Maybe this 'hierarchical' problem will be good topic in MM summit.

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC REPOST] cgroup: removing css reference drain wait during cgroup removal
  2012-03-16  0:02               ` KAMEZAWA Hiroyuki
@ 2012-03-16 10:21                 ` Glauber Costa
  0 siblings, 0 replies; 12+ messages in thread
From: Glauber Costa @ 2012-03-16 10:21 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Tejun Heo, Michal Hocko, Johannes Weiner, gthelen, Hugh Dickins,
	linux-mm, linux-kernel, Vivek Goyal, Jens Axboe, Li Zefan,
	containers, cgroups, Peter Zijlstra


>>>
>>> I thought of this a little yesterday. Current my idea is applying following
>>> rule for res_counter.
>>>
>>> 1. All res_counter is hierarchical. But behavior should be optimized.
>>>
>>> 2. If parent res_counter has UNLIMITED limit, 'usage' will not be propagated
>>>     to its parent at _charge_.
>>
>> That doesn't seem to make much sense. If you are unlimited, but your
>> parent is limited,
>> he has a lot more interest to know about the charge than you do.
>
>
> Sorry, I should write "If all ancestors are umlimited'.
> If parent is limited, the children should be treated as limited.
>
>> So the
>> logic should rather be the opposite: Don't go around getting locks and
>> all that if you are unlimited. Your parent might, though.
>>
>> I am trying to experiment a bit with billing to percpu counters for
>> unlimited res_counters. But their inexact nature is giving me quite a
>> headache.
>>
>
>
> Personally, I think percpu counter is not the best one. Yes, it will work but...
> Because of its nature of error range, it has scalability problem. Considering
> to have a tree like
>
> 	/A/B/Guest0/tasks
>               Guest1/tasks
>               Guest2/tasks
>               Guest4/tasks
>               Guest5/tasks
>               ......
>
> percpu res_counter may work scarable in GuestX level but will conflict in level B.
> And I don't want to think what happens in 256 cpu system. Error in B will be
> very big.

Usually the recommendation in this case is to follow 
percpu_counter_read() with percpu_counter_sum()

If we additionally wrap the updaters in rcu_read_lock(), we can sum it 
up when needed with relative confidence. It does have performance 
scalability problems in big systems, but no bigger than we have today.

The advantage of percpu counters, is that we don't need to think in 
terms of "unlimited", but rather, in terms of "close enough to the limit".

Here is an excerpt of a patch I am experimenting with

1:      usage = percpu_counter_read(&c->usage_pcp);

         if (percpu_counter_read(&c->usage_pcp) + val <
             c->limit + num_online_cpus() * percpu_counter_batch) {
5:              percpu_counter_add(&c->usage_pcp, val);
                 rcu_read_unlock();
                 return 0;
         }
         rcu_read_unlock();
10:
         raw_spin_lock(&c->usage_pcp.lock);
         usage = __percpu_counter_sum_locked(&c->usage_pcp);

         if (usage + val > c->limit) {
5:              c->failcnt++;
                 ret = -ENOMEM;
                 goto out;
         }

20:      usage += val;
         c->usage_pcp.count = usage;
         if (usage > c->usage_pcp.max)
                 c->usage_pcp.max = usage;

25: out:
         raw_spin_unlock(&c->usage_pcp.lock);
         return ret;


So we probe the current counter, and if we are unlimited, or not close 
to the limits, we update it per-cpu, and let it go.

If we believe we're in a suspicious area, we take the lock (as a quick 
hack, I am holding the percpu lock directly), and then proceed 
everything under the lock.

In the unlimited case, we'll always be writing to the percpu storage.

Note, also, that this is always either under a spinlock, or rcu marked area.

This can also possibly be improved by an unfrequent slow-path update in 
which once we start reading from percpu_counter_sum, we flip a 
res_counter bit to mark that, and then we start dealing with the 
usage_pcp.count directly, without any percpu. This would work exactly 
like the res_counters today, so it is kind of a fallback mode.

For readers, a combination of synchronize_rcu() + spin_lock() on the 
reader side should be enough to guarantee that the result of 
percpu_counter_sum() is reliable enough.

Also please note, that our read-side these days is not that good as 
well: because the way we do caching in memcg, we can end up with the 
*exact* situation as this proposal. If the same memcg is being updated 
in all cpus, we can have up to 32 * nr_cpus() pages in the cache, that 
are not really used by memcg.

I actually have patches to force a drain_all_caches_sync() before reads, 
but I am holding them waiting for something to happen in this discussion.


>
> Another idea is to borrow a resource from memcg to the tasks. i.e.having per-task
> caching of charges. But it has two problems that draining unused resource is difficult
> and precise usage is unknown.
>
> IMHO, hard-limited resource counter itself may be a problem ;)
Yes, it is.

Indeed, if percpu charging as I described above is still not acceptable, 
our way out of this may be heavier caching. Maybe we need per-level 
cache, or something like that.
Per-task may get tricky, specially when we start having charges that 
can't be directly related to a task, like slab objects.

> So, an idea, 'if all ancestors are unlimited, don't propagate charges.'
> comes to my mind. With this, people use resource in FLAT (but has hierarchical cgroup
> tree) will not see any performance problem.
>
>
>
>>> 3. If a res_counter has UNLIMITED limit, at reading usage, it must visit
>>>      all children and returns a sum of them.
>>>
>>> Then,
>>> 	/cgroup/
>>> 		memory/                       (unlimited)
>>> 			libivirt/             (unlimited)
>>> 				 qeumu/       (unlimited)
>>> 				        guest/(limited)
>>>
>>> All dir can show hierarchical usage and the guest will not have
>>> any lock contention at runtime.
>>
>> If we are okay with summing it up at read time, we may as well
>> keep everything in percpu counters at all times.
>>
>
>
> If all ancestors are unlimited, we don't need to propagate usage upwards
> at charging. If one of ancestors are limited, we need to propagate and
> check usage at charging.

The only problem is that "one of ancestors is limited" is expected to be 
quite a common case. So by optimizing for unlimited, I think we're 
tricking ourselves into believing we're solving anything, when in 
reality we're not.

>
>>
>>
>>>    - memory.use_hierarchy should be obsolete ?
>> If we're going fully hierarchical, yes.
>>
>
> Another big problem is 'when' we should do this change..
> Maybe this 'hierarchical' problem will be good topic in MM summit.

we're in no shortage of topics! =)

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2012-03-16 10:22 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20120312213155.GE23255@google.com>
2012-03-12 21:33 ` [RFC REPOST] cgroup: removing css reference drain wait during cgroup removal Tejun Heo
2012-03-12 23:23   ` Tejun Heo
2012-03-13  6:11   ` KAMEZAWA Hiroyuki
2012-03-13 16:39     ` Tejun Heo
2012-03-14  0:28       ` KAMEZAWA Hiroyuki
2012-03-14  6:11         ` Tejun Heo
2012-03-14  9:46         ` Glauber Costa
2012-03-15  0:16           ` KAMEZAWA Hiroyuki
2012-03-15 11:24             ` Glauber Costa
2012-03-16  0:02               ` KAMEZAWA Hiroyuki
2012-03-16 10:21                 ` Glauber Costa
     [not found] ` <20120313214526.GG19584@count0.beaverton.ibm.com>
     [not found]   ` <20120313220551.GF7349@google.com>
2012-03-13 22:16     ` [RFC] " Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).