linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET cgroup/for-3.16] cgroup: implement cgroup.populated
@ 2014-04-09 15:07 Tejun Heo
  2014-04-09 15:07 ` [PATCH 1/3] kernfs: implement kernfs_root->supers list Tejun Heo
                   ` (4 more replies)
  0 siblings, 5 replies; 26+ messages in thread
From: Tejun Heo @ 2014-04-09 15:07 UTC (permalink / raw)
  To: gregkh
  Cc: linux-kernel, john, rlove, eparis, serge.hallyn, lennart, kay,
	lizefan, cgroups, containers

Hello,

(inotify folks, can you please review the second patch which hooks up
 kernfs_notify() with fsnotify?  It works but directly invoking
 fsnotify_parent() and fsnotify() feels a bit dirty.  Maybe something
 like fsnotify_notify_modify_by_dentry() would make more sense?)

cgroup users often need a way to determine when a cgroup's
subhierarchy becomes empty so that it can be cleaned up.  cgroup
currently provides release_agent for it; unfortunately, this mechanism
is riddled with issues.  This patchset implements a replacement
mechanism "cgroup.subtree_populated" which can be used to monitor
whether the cgroup's subhierarchy has tasks in it or not and triggers
poll and [di]notify events when its content changes.

This patchset contains the following three patches.

 0001-kernfs-implement-kernfs_root-supers-list.patch
 0002-kernfs-make-kernfs_notify-trigger-inotify-events-too.patch
 0003-cgroup-implement-cgroup.subtree_populated-for-the-de.patch

0001-0002 add [di]notify notification to kernfs_notify().

0003 implements cgroup.subtree_populated.

This patchset is on top of "cgroup: implement unified hierarchy"
patchset[1] and availalbe in the following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-populated

diffstat follows.

 fs/kernfs/dir.c             |    1 
 fs/kernfs/file.c            |   41 +++++++++++++++++++++++----
 fs/kernfs/kernfs-internal.h |    5 +++
 fs/kernfs/mount.c           |   11 +++++++
 include/linux/cgroup.h      |   15 ++++++++++
 include/linux/kernfs.h      |    4 ++
 kernel/cgroup.c             |   65 +++++++++++++++++++++++++++++++++++++++++---
 7 files changed, 132 insertions(+), 10 deletions(-)

--
tejun

[1] http://lkml.kernel.org/g/1395974461-12735-1-git-send-email-tj@kernel.org

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/3] kernfs: implement kernfs_root->supers list
  2014-04-09 15:07 [PATCHSET cgroup/for-3.16] cgroup: implement cgroup.populated Tejun Heo
@ 2014-04-09 15:07 ` Tejun Heo
  2014-04-09 15:07 ` [PATCH 2/3] kernfs: make kernfs_notify() trigger inotify events too Tejun Heo
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 26+ messages in thread
From: Tejun Heo @ 2014-04-09 15:07 UTC (permalink / raw)
  To: gregkh
  Cc: linux-kernel, john, rlove, eparis, serge.hallyn, lennart, kay,
	lizefan, cgroups, containers, Tejun Heo

Currently, there's no way to find out which super_blocks are
associated with a given kernfs_root.  Let's implement it - the planned
inotify extension to kernfs_notify() needs it.

Make kernfs_super_info point back to the super_block and chain it at
kernfs_root->supers.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 fs/kernfs/dir.c             |  1 +
 fs/kernfs/kernfs-internal.h |  5 +++++
 fs/kernfs/mount.c           | 11 +++++++++++
 include/linux/kernfs.h      |  4 ++++
 4 files changed, 21 insertions(+)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 939684e..a8bf7ac 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -711,6 +711,7 @@ struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
 		return ERR_PTR(-ENOMEM);
 
 	ida_init(&root->ino_ida);
+	INIT_LIST_HEAD(&root->supers);
 
 	kn = __kernfs_new_node(root, "", S_IFDIR | S_IRUGO | S_IXUGO,
 			       KERNFS_DIR);
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index a91d7a1..61e0920 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -49,6 +49,8 @@ static inline struct kernfs_root *kernfs_root(struct kernfs_node *kn)
  * mount.c
  */
 struct kernfs_super_info {
+	struct super_block	*sb;
+
 	/*
 	 * The root associated with this super_block.  Each super_block is
 	 * identified by the root and ns it's associated with.
@@ -62,6 +64,9 @@ struct kernfs_super_info {
 	 * an array and compare kernfs_node tag against every entry.
 	 */
 	const void		*ns;
+
+	/* anchored at kernfs_root->supers, protected by kernfs_mutex */
+	struct list_head	node;
 };
 #define kernfs_info(SB) ((struct kernfs_super_info *)(SB->s_fs_info))
 
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index e5b28b0..1f80295 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -68,6 +68,7 @@ static int kernfs_fill_super(struct super_block *sb)
 	struct inode *inode;
 	struct dentry *root;
 
+	info->sb = sb;
 	sb->s_blocksize = PAGE_CACHE_SIZE;
 	sb->s_blocksize_bits = PAGE_CACHE_SHIFT;
 	sb->s_magic = SYSFS_MAGIC;
@@ -160,12 +161,18 @@ struct dentry *kernfs_mount_ns(struct file_system_type *fs_type, int flags,
 	if (IS_ERR(sb))
 		return ERR_CAST(sb);
 	if (!sb->s_root) {
+		struct kernfs_super_info *info = kernfs_info(sb);
+
 		error = kernfs_fill_super(sb);
 		if (error) {
 			deactivate_locked_super(sb);
 			return ERR_PTR(error);
 		}
 		sb->s_flags |= MS_ACTIVE;
+
+		mutex_lock(&kernfs_mutex);
+		list_add(&info->node, &root->supers);
+		mutex_unlock(&kernfs_mutex);
 	}
 
 	return dget(sb->s_root);
@@ -184,6 +191,10 @@ void kernfs_kill_sb(struct super_block *sb)
 	struct kernfs_super_info *info = kernfs_info(sb);
 	struct kernfs_node *root_kn = sb->s_root->d_fsdata;
 
+	mutex_lock(&kernfs_mutex);
+	list_del(&info->node);
+	mutex_unlock(&kernfs_mutex);
+
 	/*
 	 * Remove the superblock from fs_supers/s_instances
 	 * so we can't find it, before freeing kernfs_super_info.
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 649497a..d8ec4f2 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -144,6 +144,10 @@ struct kernfs_root {
 	/* private fields, do not use outside kernfs proper */
 	struct ida		ino_ida;
 	struct kernfs_syscall_ops *syscall_ops;
+
+	/* list of kernfs_super_info of this root, protected by kernfs_mutex */
+	struct list_head	supers;
+
 	wait_queue_head_t	deactivate_waitq;
 };
 
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/3] kernfs: make kernfs_notify() trigger inotify events too
  2014-04-09 15:07 [PATCHSET cgroup/for-3.16] cgroup: implement cgroup.populated Tejun Heo
  2014-04-09 15:07 ` [PATCH 1/3] kernfs: implement kernfs_root->supers list Tejun Heo
@ 2014-04-09 15:07 ` Tejun Heo
  2014-04-09 15:07 ` [PATCH 3/3] cgroup: implement cgroup.subtree_populated for the default hierarchy Tejun Heo
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 26+ messages in thread
From: Tejun Heo @ 2014-04-09 15:07 UTC (permalink / raw)
  To: gregkh
  Cc: linux-kernel, john, rlove, eparis, serge.hallyn, lennart, kay,
	lizefan, cgroups, containers, Tejun Heo

kernfs_notify() is used to indicate either new data is available or
the content of a file has changed.  It currently only triggers poll
which may not be the most convenient to monitor especially when there
are a lot to monitor.  Let's hook it up to fsnotify too so that the
events can be monitored via inotify too.

fsnotify_modify() requires file * but kernfs_notify() doesn't have any
specific file associated; however, we can walk all super_blocks
associated with a kernfs_root and as kernfs always associate one ino
with inode and one dentry with an inode, it's trivial to look up the
dentry associated with a given kernfs_node.  As any active monitor
would pin dentry, just looking up existing dentry is enough.  This
patch looks up the dentry associated with the specified kernfs_node
and generates events equivalent to fsnotify_modify().

Note that as fsnotify doesn't provide fsnotify_modify() equivalent
which can be called with dentry, kernfs_notify() directly calls
fsnotify_parent() and fsnotify().  It might be better to add a wrapper
in fsnotify.h instead.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: John McCutchan <john@johnmccutchan.com>
Cc: Robert Love <rlove@rlove.org>
Cc: Eric Paris <eparis@parisplace.org>
---
 fs/kernfs/file.c | 41 +++++++++++++++++++++++++++++++++++------
 1 file changed, 35 insertions(+), 6 deletions(-)

diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index ddcb471..774af54 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -14,6 +14,7 @@
 #include <linux/poll.h>
 #include <linux/pagemap.h>
 #include <linux/sched.h>
+#include <linux/fsnotify.h>
 
 #include "kernfs-internal.h"
 
@@ -784,20 +785,48 @@ static unsigned int kernfs_fop_poll(struct file *filp, poll_table *wait)
  */
 void kernfs_notify(struct kernfs_node *kn)
 {
+	struct kernfs_root *root = kernfs_root(kn);
 	struct kernfs_open_node *on;
+	struct kernfs_super_info *info;
 	unsigned long flags;
 
+	if (WARN_ON(kernfs_type(kn) != KERNFS_FILE))
+		return;
+
+	/* kick poll */
 	spin_lock_irqsave(&kernfs_open_node_lock, flags);
 
-	if (!WARN_ON(kernfs_type(kn) != KERNFS_FILE)) {
-		on = kn->attr.open;
-		if (on) {
-			atomic_inc(&on->event);
-			wake_up_interruptible(&on->poll);
-		}
+	on = kn->attr.open;
+	if (on) {
+		atomic_inc(&on->event);
+		wake_up_interruptible(&on->poll);
 	}
 
 	spin_unlock_irqrestore(&kernfs_open_node_lock, flags);
+
+	/* kick fsnotify */
+	mutex_lock(&kernfs_mutex);
+
+	list_for_each_entry(info, &root->supers, node) {
+		struct inode *inode;
+		struct dentry *dentry;
+
+		inode = ilookup(info->sb, kn->ino);
+		if (!inode)
+			continue;
+
+		dentry = d_find_any_alias(inode);
+		if (dentry) {
+			fsnotify_parent(NULL, dentry, FS_MODIFY);
+			fsnotify(inode, FS_MODIFY, inode, FSNOTIFY_EVENT_INODE,
+				 NULL, 0);
+			dput(dentry);
+		}
+
+		iput(inode);
+	}
+
+	mutex_unlock(&kernfs_mutex);
 }
 EXPORT_SYMBOL_GPL(kernfs_notify);
 
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 3/3] cgroup: implement cgroup.subtree_populated for the default hierarchy
  2014-04-09 15:07 [PATCHSET cgroup/for-3.16] cgroup: implement cgroup.populated Tejun Heo
  2014-04-09 15:07 ` [PATCH 1/3] kernfs: implement kernfs_root->supers list Tejun Heo
  2014-04-09 15:07 ` [PATCH 2/3] kernfs: make kernfs_notify() trigger inotify events too Tejun Heo
@ 2014-04-09 15:07 ` Tejun Heo
  2014-04-10  3:08   ` Serge E. Hallyn
  2014-04-14 21:31 ` [PATCHSET cgroup/for-3.16] cgroup: implement cgroup.populated Tejun Heo
  2014-04-25 22:30 ` Tejun Heo
  4 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2014-04-09 15:07 UTC (permalink / raw)
  To: gregkh
  Cc: linux-kernel, john, rlove, eparis, serge.hallyn, lennart, kay,
	lizefan, cgroups, containers, Tejun Heo

cgroup users often need a way to determine when a cgroup's
subhierarchy becomes empty so that it can be cleaned up.  cgroup
currently provides release_agent for it; unfortunately, this mechanism
is riddled with issues.

* It delivers events by forking and execing a userland binary
  specified as the release_agent.  This is a long deprecated method of
  notification delivery.  It's extremely heavy, slow and cumbersome to
  integrate with larger infrastructure.

* There is single monitoring point at the root.  There's no way to
  delegate management of subtree.

* The event isn't recursive.  It triggers when a cgroup doesn't have
  any tasks or child cgroups.  Events for internal nodes trigger only
  after all children are removed.  This again makes it impossible to
  delegate management of subtree.

* Events are filtered from the kernel side.  "notify_on_release" file
  is used to subscribe to or suppres release event and events are not
  generated if a cgroup becomes empty by moving the last task out of
  it; however, event is generated if it becomes empty because the last
  child cgroup is removed.  This is inconsistent, awkward and
  unnecessarily complicated and probably done this way because event
  delivery itself was expensive.

This patch implements interface file "cgroup.subtree_populated" which
can be used to monitor whether the cgroup's subhierarchy has tasks in
it or not.  Its value is 1 if there is no task in the cgroup and its
descendants; otherwise, 0, and kernfs_notify() notificaiton is
triggers when the value changes, which can be monitored through poll
and [di]notify.

This is a lot ligther and simpler and trivially allows delegating
management of subhierarchy - subhierarchy monitoring can block further
propgation simply by putting itself or another process in the root of
the subhierarchy and monitor events that it's interested in from there
without interfering with monitoring higher in the tree.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
Cc: Lennart Poettering <lennart@poettering.net>
---
 include/linux/cgroup.h | 15 ++++++++++++
 kernel/cgroup.c        | 65 ++++++++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 76 insertions(+), 4 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index dee6f3c..e45d87f 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -154,6 +154,14 @@ struct cgroup {
 	/* the number of attached css's */
 	int nr_css;
 
+	/*
+	 * If this cgroup contains any tasks, it contributes one to
+	 * populated_cnt.  All children with non-zero popuplated_cnt of
+	 * their own contribute one.  The count is zero iff there's no task
+	 * in this cgroup or its subtree.
+	 */
+	int populated_cnt;
+
 	atomic_t refcnt;
 
 	/*
@@ -166,6 +174,7 @@ struct cgroup {
 	struct cgroup *parent;		/* my parent */
 	struct kernfs_node *kn;		/* cgroup kernfs entry */
 	struct kernfs_node *control_kn;	/* kn for "cgroup.subtree_control" */
+	struct kernfs_node *populated_kn; /* kn for "cgroup.subtree_populated" */
 
 	/*
 	 * Monotonically increasing unique serial number which defines a
@@ -264,6 +273,12 @@ enum {
 	 *
 	 * - "cgroup.clone_children" is removed.
 	 *
+	 * - "cgroup.subtree_populated" is available.  Its value is 0 if
+	 *   the cgroup and its descendants contain no task; otherwise, 1.
+	 *   The file also generates kernfs notification which can be
+	 *   monitored through poll and [di]notify when the value of the
+	 *   file changes.
+	 *
 	 * - If mount is requested with sane_behavior but without any
 	 *   subsystem, the default unified hierarchy is mounted.
 	 *
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 4e958c7..17f0a09 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -411,6 +411,43 @@ static struct css_set init_css_set = {
 
 static int css_set_count	= 1;	/* 1 for init_css_set */
 
+/**
+ * cgroup_update_populated - updated populated count of a cgroup
+ * @cgrp: the target cgroup
+ * @populated: inc or dec populated count
+ *
+ * @cgrp is either getting the first task (css_set) or losing the last.
+ * Update @cgrp->populated_cnt accordingly.  The count is propagated
+ * towards root so that a given cgroup's populated_cnt is zero iff the
+ * cgroup and all its descendants are empty.
+ *
+ * @cgrp's interface file "cgroup.subtree_populated" is zero if
+ * @cgrp->populated_cnt is zero and 1 otherwise.  When @cgrp->populated_cnt
+ * changes from or to zero, userland is notified that the content of the
+ * interface file has changed.  This can be used to detect when @cgrp and
+ * its descendants become populated or empty.
+ */
+static void cgroup_update_populated(struct cgroup *cgrp, bool populated)
+{
+	lockdep_assert_held(&css_set_rwsem);
+
+	do {
+		bool trigger;
+
+		if (populated)
+			trigger = !cgrp->populated_cnt++;
+		else
+			trigger = !--cgrp->populated_cnt;
+
+		if (!trigger)
+			break;
+
+		if (cgrp->populated_kn)
+			kernfs_notify(cgrp->populated_kn);
+		cgrp = cgrp->parent;
+	} while (cgrp);
+}
+
 /*
  * hash table for cgroup groups. This improves the performance to find
  * an existing css_set. This hash doesn't (currently) take into
@@ -456,10 +493,13 @@ static void put_css_set_locked(struct css_set *cset, bool taskexit)
 		list_del(&link->cgrp_link);
 
 		/* @cgrp can't go away while we're holding css_set_rwsem */
-		if (list_empty(&cgrp->cset_links) && notify_on_release(cgrp)) {
-			if (taskexit)
-				set_bit(CGRP_RELEASABLE, &cgrp->flags);
-			check_for_release(cgrp);
+		if (list_empty(&cgrp->cset_links)) {
+			cgroup_update_populated(cgrp, false);
+			if (notify_on_release(cgrp)) {
+				if (taskexit)
+					set_bit(CGRP_RELEASABLE, &cgrp->flags);
+				check_for_release(cgrp);
+			}
 		}
 
 		kfree(link);
@@ -668,7 +708,11 @@ static void link_css_set(struct list_head *tmp_links, struct css_set *cset,
 	link = list_first_entry(tmp_links, struct cgrp_cset_link, cset_link);
 	link->cset = cset;
 	link->cgrp = cgrp;
+
+	if (list_empty(&cgrp->cset_links))
+		cgroup_update_populated(cgrp, true);
 	list_move(&link->cset_link, &cgrp->cset_links);
+
 	/*
 	 * Always add links to the tail of the list so that the list
 	 * is sorted by order of hierarchy creation
@@ -2633,6 +2677,12 @@ err_undo_css:
 	goto out_unlock;
 }
 
+static int cgroup_subtree_populated_show(struct seq_file *seq, void *v)
+{
+	seq_printf(seq, "%d\n", (bool)seq_css(seq)->cgroup->populated_cnt);
+	return 0;
+}
+
 static ssize_t cgroup_file_write(struct kernfs_open_file *of, char *buf,
 				 size_t nbytes, loff_t off)
 {
@@ -2775,6 +2825,8 @@ static int cgroup_add_file(struct cgroup *cgrp, struct cftype *cft)
 				  NULL, false, key);
 	if (cft->seq_show == cgroup_subtree_control_show)
 		cgrp->control_kn = kn;
+	else if (cft->seq_show == cgroup_subtree_populated_show)
+		cgrp->populated_kn = kn;
 	return PTR_ERR_OR_ZERO(kn);
 }
 
@@ -3883,6 +3935,11 @@ static struct cftype cgroup_base_files[] = {
 		.seq_show = cgroup_subtree_control_show,
 		.write_string = cgroup_subtree_control_write,
 	},
+	{
+		.name = "cgroup.subtree_populated",
+		.flags = CFTYPE_ONLY_ON_DFL | CFTYPE_NOT_ON_ROOT,
+		.seq_show = cgroup_subtree_populated_show,
+	},
 
 	/*
 	 * Historical crazy stuff.  These don't have "cgroup."  prefix and
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/3] cgroup: implement cgroup.subtree_populated for the default hierarchy
  2014-04-09 15:07 ` [PATCH 3/3] cgroup: implement cgroup.subtree_populated for the default hierarchy Tejun Heo
@ 2014-04-10  3:08   ` Serge E. Hallyn
  2014-04-10 13:08     ` Tejun Heo
  0 siblings, 1 reply; 26+ messages in thread
From: Serge E. Hallyn @ 2014-04-10  3:08 UTC (permalink / raw)
  To: Tejun Heo
  Cc: gregkh, rlove, containers, serge.hallyn, kay, linux-kernel,
	lennart, cgroups, eparis, john

Quoting Tejun Heo (tj@kernel.org):
> cgroup users often need a way to determine when a cgroup's
> subhierarchy becomes empty so that it can be cleaned up.  cgroup
> currently provides release_agent for it; unfortunately, this mechanism
> is riddled with issues.

Thanks, Tejun.

> * It delivers events by forking and execing a userland binary
>   specified as the release_agent.  This is a long deprecated method of
>   notification delivery.  It's extremely heavy, slow and cumbersome to
>   integrate with larger infrastructure.

(Not seriously worried about this, but it's a point worth considering)
It does have one advantage though:  if the userspace agent goes bad,
cgroups can still be removed on empty.

Do you plan on keeping release-on-empty around?  I assume only for a
while?

Do you think there is any value in having a simpler "remove-when-empty"
file?  Doesn't call out to userspace, just drops the cgroup when there
are no more tasks or sub-cgroups?

> * There is single monitoring point at the root.  There's no way to
>   delegate management of subtree.
> 
> * The event isn't recursive.  It triggers when a cgroup doesn't have
>   any tasks or child cgroups.  Events for internal nodes trigger only
>   after all children are removed.  This again makes it impossible to
>   delegate management of subtree.
> 
> * Events are filtered from the kernel side.  "notify_on_release" file
>   is used to subscribe to or suppres release event and events are not
>   generated if a cgroup becomes empty by moving the last task out of
>   it; however, event is generated if it becomes empty because the last
>   child cgroup is removed.  This is inconsistent, awkward and

Hm, maybe I'm misreading but this doesn't seem right.  If I move
a task into x1 and kill the task, x1 goes away.  Likewise if I
create x1/y1, and rmdir y1, x1 goes away.  I suspect I'm misunderstanding
the case in which you say it doesn't happen?

>   unnecessarily complicated and probably done this way because event
>   delivery itself was expensive.
> 
> This patch implements interface file "cgroup.subtree_populated" which
> can be used to monitor whether the cgroup's subhierarchy has tasks in
> it or not.  Its value is 1 if there is no task in the cgroup and its

I think you meant this backward?  It's 1 if there is *any task in
the cgroup and its descendants, else 0?

> descendants; otherwise, 0, and kernfs_notify() notificaiton is
> triggers when the value changes, which can be monitored through poll
> and [di]notify.
> 
> This is a lot ligther and simpler and trivially allows delegating
> management of subhierarchy - subhierarchy monitoring can block further
> propgation simply by putting itself or another process in the root of
> the subhierarchy and monitor events that it's interested in from there
> without interfering with monitoring higher in the tree.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Serge Hallyn <serge.hallyn@ubuntu.com>

Acked-by: Serge Hallyn <serge.hallyn@ubuntu.com>

> Cc: Lennart Poettering <lennart@poettering.net>
> ---
>  include/linux/cgroup.h | 15 ++++++++++++
>  kernel/cgroup.c        | 65 ++++++++++++++++++++++++++++++++++++++++++++++----
>  2 files changed, 76 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index dee6f3c..e45d87f 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -154,6 +154,14 @@ struct cgroup {
>  	/* the number of attached css's */
>  	int nr_css;
>  
> +	/*
> +	 * If this cgroup contains any tasks, it contributes one to
> +	 * populated_cnt.  All children with non-zero popuplated_cnt of
> +	 * their own contribute one.  The count is zero iff there's no task
> +	 * in this cgroup or its subtree.
> +	 */
> +	int populated_cnt;
> +
>  	atomic_t refcnt;
>  
>  	/*
> @@ -166,6 +174,7 @@ struct cgroup {
>  	struct cgroup *parent;		/* my parent */
>  	struct kernfs_node *kn;		/* cgroup kernfs entry */
>  	struct kernfs_node *control_kn;	/* kn for "cgroup.subtree_control" */
> +	struct kernfs_node *populated_kn; /* kn for "cgroup.subtree_populated" */
>  
>  	/*
>  	 * Monotonically increasing unique serial number which defines a
> @@ -264,6 +273,12 @@ enum {
>  	 *
>  	 * - "cgroup.clone_children" is removed.
>  	 *
> +	 * - "cgroup.subtree_populated" is available.  Its value is 0 if
> +	 *   the cgroup and its descendants contain no task; otherwise, 1.
> +	 *   The file also generates kernfs notification which can be
> +	 *   monitored through poll and [di]notify when the value of the
> +	 *   file changes.
> +	 *
>  	 * - If mount is requested with sane_behavior but without any
>  	 *   subsystem, the default unified hierarchy is mounted.
>  	 *
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 4e958c7..17f0a09 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -411,6 +411,43 @@ static struct css_set init_css_set = {
>  
>  static int css_set_count	= 1;	/* 1 for init_css_set */
>  
> +/**
> + * cgroup_update_populated - updated populated count of a cgroup
> + * @cgrp: the target cgroup
> + * @populated: inc or dec populated count
> + *
> + * @cgrp is either getting the first task (css_set) or losing the last.
> + * Update @cgrp->populated_cnt accordingly.  The count is propagated
> + * towards root so that a given cgroup's populated_cnt is zero iff the
> + * cgroup and all its descendants are empty.
> + *
> + * @cgrp's interface file "cgroup.subtree_populated" is zero if
> + * @cgrp->populated_cnt is zero and 1 otherwise.  When @cgrp->populated_cnt
> + * changes from or to zero, userland is notified that the content of the
> + * interface file has changed.  This can be used to detect when @cgrp and
> + * its descendants become populated or empty.
> + */
> +static void cgroup_update_populated(struct cgroup *cgrp, bool populated)
> +{
> +	lockdep_assert_held(&css_set_rwsem);
> +
> +	do {
> +		bool trigger;
> +
> +		if (populated)
> +			trigger = !cgrp->populated_cnt++;
> +		else
> +			trigger = !--cgrp->populated_cnt;
> +
> +		if (!trigger)
> +			break;
> +
> +		if (cgrp->populated_kn)
> +			kernfs_notify(cgrp->populated_kn);
> +		cgrp = cgrp->parent;
> +	} while (cgrp);
> +}
> +
>  /*
>   * hash table for cgroup groups. This improves the performance to find
>   * an existing css_set. This hash doesn't (currently) take into
> @@ -456,10 +493,13 @@ static void put_css_set_locked(struct css_set *cset, bool taskexit)
>  		list_del(&link->cgrp_link);
>  
>  		/* @cgrp can't go away while we're holding css_set_rwsem */
> -		if (list_empty(&cgrp->cset_links) && notify_on_release(cgrp)) {
> -			if (taskexit)
> -				set_bit(CGRP_RELEASABLE, &cgrp->flags);
> -			check_for_release(cgrp);
> +		if (list_empty(&cgrp->cset_links)) {
> +			cgroup_update_populated(cgrp, false);
> +			if (notify_on_release(cgrp)) {
> +				if (taskexit)
> +					set_bit(CGRP_RELEASABLE, &cgrp->flags);
> +				check_for_release(cgrp);
> +			}
>  		}
>  
>  		kfree(link);
> @@ -668,7 +708,11 @@ static void link_css_set(struct list_head *tmp_links, struct css_set *cset,
>  	link = list_first_entry(tmp_links, struct cgrp_cset_link, cset_link);
>  	link->cset = cset;
>  	link->cgrp = cgrp;
> +
> +	if (list_empty(&cgrp->cset_links))
> +		cgroup_update_populated(cgrp, true);
>  	list_move(&link->cset_link, &cgrp->cset_links);
> +
>  	/*
>  	 * Always add links to the tail of the list so that the list
>  	 * is sorted by order of hierarchy creation
> @@ -2633,6 +2677,12 @@ err_undo_css:
>  	goto out_unlock;
>  }
>  
> +static int cgroup_subtree_populated_show(struct seq_file *seq, void *v)
> +{
> +	seq_printf(seq, "%d\n", (bool)seq_css(seq)->cgroup->populated_cnt);
> +	return 0;
> +}
> +
>  static ssize_t cgroup_file_write(struct kernfs_open_file *of, char *buf,
>  				 size_t nbytes, loff_t off)
>  {
> @@ -2775,6 +2825,8 @@ static int cgroup_add_file(struct cgroup *cgrp, struct cftype *cft)
>  				  NULL, false, key);
>  	if (cft->seq_show == cgroup_subtree_control_show)
>  		cgrp->control_kn = kn;
> +	else if (cft->seq_show == cgroup_subtree_populated_show)
> +		cgrp->populated_kn = kn;
>  	return PTR_ERR_OR_ZERO(kn);
>  }
>  
> @@ -3883,6 +3935,11 @@ static struct cftype cgroup_base_files[] = {
>  		.seq_show = cgroup_subtree_control_show,
>  		.write_string = cgroup_subtree_control_write,
>  	},
> +	{
> +		.name = "cgroup.subtree_populated",
> +		.flags = CFTYPE_ONLY_ON_DFL | CFTYPE_NOT_ON_ROOT,
> +		.seq_show = cgroup_subtree_populated_show,
> +	},
>  
>  	/*
>  	 * Historical crazy stuff.  These don't have "cgroup."  prefix and
> -- 
> 1.9.0
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/3] cgroup: implement cgroup.subtree_populated for the default hierarchy
  2014-04-10  3:08   ` Serge E. Hallyn
@ 2014-04-10 13:08     ` Tejun Heo
  2014-04-10 14:04       ` Serge Hallyn
  0 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2014-04-10 13:08 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: gregkh, rlove, containers, serge.hallyn, kay, linux-kernel,
	lennart, cgroups, eparis, john

Hey, Serge.

On Thu, Apr 10, 2014 at 05:08:55AM +0200, Serge E. Hallyn wrote:
> Quoting Tejun Heo (tj@kernel.org):
> > * It delivers events by forking and execing a userland binary
> >   specified as the release_agent.  This is a long deprecated method of
> >   notification delivery.  It's extremely heavy, slow and cumbersome to
> >   integrate with larger infrastructure.
> 
> (Not seriously worried about this, but it's a point worth considering)
> It does have one advantage though:  if the userspace agent goes bad,
> cgroups can still be removed on empty.
> 
> Do you plan on keeping release-on-empty around?  I assume only for a
> while?

The new mechanism is only for the unified hierarchy.  The old one will
be kept around for other hierarchies.

> Do you think there is any value in having a simpler "remove-when-empty"
> file?  Doesn't call out to userspace, just drops the cgroup when there
> are no more tasks or sub-cgroups?

I don't think so.  Implementing such simplistic mechanism in userland
is trivial and even independent failover mechanisms can be easily
implemented from userland as multiple entities can set up watches.  I
don't think there's much value in providing another mechanism from
kernel side.  The only reason why release_agent thing got as complex
as it is is because the mechanism is fundamentally flawed - clumsy
delivery, no multiple watches, single watch point - so people tried to
work around it by adding event filtering from kernel side, which is
quite backwards IMHO.  With proper event mechanism, everything should
be easily achievable from userland side.

> > * Events are filtered from the kernel side.  "notify_on_release" file
> >   is used to subscribe to or suppres release event and events are not
> >   generated if a cgroup becomes empty by moving the last task out of
> >   it; however, event is generated if it becomes empty because the last
> >   child cgroup is removed.  This is inconsistent, awkward and
> 
> Hm, maybe I'm misreading but this doesn't seem right.  If I move
> a task into x1 and kill the task, x1 goes away.  Likewise if I
> create x1/y1, and rmdir y1, x1 goes away.  I suspect I'm misunderstanding
> the case in which you say it doesn't happen?

The case where you move a task out of x1/y1 to another cgroup doesn't
generate an event.  One could say that that's unnecessary because the
mover knows that the cgroup is becoming empty; however, it excludes
any cases where there are more than one actors and the same can be
said for cases when the actor is removing a child.

> > This patch implements interface file "cgroup.subtree_populated" which
> > can be used to monitor whether the cgroup's subhierarchy has tasks in
> > it or not.  Its value is 1 if there is no task in the cgroup and its
> 
> I think you meant this backward?  It's 1 if there is *any task in
> the cgroup and its descendants, else 0?

Oops, yeap.  Will update.

Thanks!

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/3] cgroup: implement cgroup.subtree_populated for the default hierarchy
  2014-04-10 13:08     ` Tejun Heo
@ 2014-04-10 14:04       ` Serge Hallyn
  2014-04-10 14:19         ` Tejun Heo
  0 siblings, 1 reply; 26+ messages in thread
From: Serge Hallyn @ 2014-04-10 14:04 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Serge E. Hallyn, gregkh, rlove, containers, kay, linux-kernel,
	lennart, cgroups, eparis, john

Quoting Tejun Heo (tj@kernel.org):
> Hey, Serge.
> 
> On Thu, Apr 10, 2014 at 05:08:55AM +0200, Serge E. Hallyn wrote:
> > Quoting Tejun Heo (tj@kernel.org):
> > > * It delivers events by forking and execing a userland binary
> > >   specified as the release_agent.  This is a long deprecated method of
> > >   notification delivery.  It's extremely heavy, slow and cumbersome to
> > >   integrate with larger infrastructure.
> > 
> > (Not seriously worried about this, but it's a point worth considering)
> > It does have one advantage though:  if the userspace agent goes bad,
> > cgroups can still be removed on empty.
> > 
> > Do you plan on keeping release-on-empty around?  I assume only for a
> > while?
> 
> The new mechanism is only for the unified hierarchy.  The old one will
> be kept around for other hierarchies.
> 
> > Do you think there is any value in having a simpler "remove-when-empty"
> > file?  Doesn't call out to userspace, just drops the cgroup when there
> > are no more tasks or sub-cgroups?
> 
> I don't think so.  Implementing such simplistic mechanism in userland
> is trivial and even independent failover mechanisms can be easily
> implemented from userland as multiple entities can set up watches.  I
> don't think there's much value in providing another mechanism from
> kernel side.  The only reason why release_agent thing got as complex
> as it is is because the mechanism is fundamentally flawed - clumsy
> delivery, no multiple watches, single watch point - so people tried to
> work around it by adding event filtering from kernel side, which is
> quite backwards IMHO.  With proper event mechanism, everything should
> be easily achievable from userland side.

Except for the keeping state.  If the userspace agent crashes when it
was meant to drop 100 cgroups when they become empty, then when it
restarts those 100 cgroups may never be freed.  Of course userspace
can do things about this, but it just seems like it would be so
trivial to handle this case in the kernel.  Like you say there is
no need for all the fancy agent spawning - just notice that the
cgroup became empty, that releaseonempty was true, and so unlink the
thing.  It'll be freed when anyone who has it held open closes it.

> > > * Events are filtered from the kernel side.  "notify_on_release" file
> > >   is used to subscribe to or suppres release event and events are not
> > >   generated if a cgroup becomes empty by moving the last task out of
> > >   it; however, event is generated if it becomes empty because the last
> > >   child cgroup is removed.  This is inconsistent, awkward and
> > 
> > Hm, maybe I'm misreading but this doesn't seem right.  If I move
> > a task into x1 and kill the task, x1 goes away.  Likewise if I
> > create x1/y1, and rmdir y1, x1 goes away.  I suspect I'm misunderstanding
> > the case in which you say it doesn't happen?
> 
> The case where you move a task out of x1/y1 to another cgroup doesn't
> generate an event.  One could say that that's unnecessary because the

Still confused.

If I create x1/y1 and x3/y3, set notify_on_release on x1 and x1/y1,
move a task into x1/y1, then move it to x3/y3, then x1/y1 and x1 both
do get removed.

> mover knows that the cgroup is becoming empty; however, it excludes
> any cases where there are more than one actors and the same can be
> said for cases when the actor is removing a child.
> 
> > > This patch implements interface file "cgroup.subtree_populated" which
> > > can be used to monitor whether the cgroup's subhierarchy has tasks in
> > > it or not.  Its value is 1 if there is no task in the cgroup and its
> > 
> > I think you meant this backward?  It's 1 if there is *any task in
> > the cgroup and its descendants, else 0?
> 
> Oops, yeap.  Will update.
> 
> Thanks!
> 
> -- 
> tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/3] cgroup: implement cgroup.subtree_populated for the default hierarchy
  2014-04-10 14:04       ` Serge Hallyn
@ 2014-04-10 14:19         ` Tejun Heo
  2014-04-11  9:00           ` Li Zefan
  0 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2014-04-10 14:19 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: Serge E. Hallyn, gregkh, rlove, containers, kay, linux-kernel,
	lennart, cgroups, eparis, john

Hello,

On Thu, Apr 10, 2014 at 09:04:24AM -0500, Serge Hallyn wrote:
> Except for the keeping state.  If the userspace agent crashes when it
> was meant to drop 100 cgroups when they become empty, then when it
> restarts those 100 cgroups may never be freed.  Of course userspace
> can do things about this, but it just seems like it would be so
> trivial to handle this case in the kernel.  Like you say there is

But it's also trivial from userland.  You can write up a separate
backup agent or just make the main agent perform cleanup on restart,
which it should probably be doing anyway.

> no need for all the fancy agent spawning - just notice that the
> cgroup became empty, that releaseonempty was true, and so unlink the
> thing.  It'll be freed when anyone who has it held open closes it.

It'd be a superflous feature which require a separate control knob on
each cgroup and can lead to confusion too as there are now two
entities acting on it by default.  The manager has to worry about
keeping those knobs in sync and deal with cases where cgroups
disappear unexpectedly.  Think about it.  What should the default
value of the knob be?  What if the manager crashes before setting up
the knob to its desired state?  It needs to perform a cleanup pass
after crash *anyway*.  How is it gonna synchronize with the current
state of cgroup hierarchy otherwise?

It's adding complexity without any actual benefits.  This is kernel
API.  It should be concise and orthogonal where it can be.  There of
course are cases where convenience should be considered but what
you're suggesting doesn't seem beneficial in any substantial way.

> > The case where you move a task out of x1/y1 to another cgroup doesn't
> > generate an event.  One could say that that's unnecessary because the
> 
> Still confused.
> 
> If I create x1/y1 and x3/y3, set notify_on_release on x1 and x1/y1,
> move a task into x1/y1, then move it to x3/y3, then x1/y1 and x1 both
> do get removed.

Ah, you're right, cgroup_task_migrate() sets CGRP_RELEASABLE
explicitly.  I was confused because put_css_set_locked() sets
CGRP_RELEASABLE only if @taskexit is set.  Will drop that part from
the description.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/3] cgroup: implement cgroup.subtree_populated for the default hierarchy
  2014-04-10 14:19         ` Tejun Heo
@ 2014-04-11  9:00           ` Li Zefan
  0 siblings, 0 replies; 26+ messages in thread
From: Li Zefan @ 2014-04-11  9:00 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Serge Hallyn, Serge E. Hallyn, gregkh, rlove, containers, kay,
	linux-kernel, lennart, cgroups, eparis, john

> Ah, you're right, cgroup_task_migrate() sets CGRP_RELEASABLE
> explicitly.  I was confused because put_css_set_locked() sets
> CGRP_RELEASABLE only if @taskexit is set.  Will drop that part from
> the description.
> 

"If the notify_on_release flag is enabled (1) in a cgroup, then
whenever the last task in the cgroup leaves (exits or attaches to
some other cgroup) and the last child cgroup of that cgroup
is removed, then the kernel runs the command specified by the contents
of the "release_agent" file"

says cgroups.txt. :)


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCHSET cgroup/for-3.16] cgroup: implement cgroup.populated
  2014-04-09 15:07 [PATCHSET cgroup/for-3.16] cgroup: implement cgroup.populated Tejun Heo
                   ` (2 preceding siblings ...)
  2014-04-09 15:07 ` [PATCH 3/3] cgroup: implement cgroup.subtree_populated for the default hierarchy Tejun Heo
@ 2014-04-14 21:31 ` Tejun Heo
  2014-04-14 22:26   ` Greg KH
  2014-04-25 22:30 ` Tejun Heo
  4 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2014-04-14 21:31 UTC (permalink / raw)
  To: gregkh
  Cc: linux-kernel, john, rlove, eparis, serge.hallyn, lennart, kay,
	lizefan, cgroups, containers

On Wed, Apr 09, 2014 at 11:07:29AM -0400, Tejun Heo wrote:
> This patchset is on top of "cgroup: implement unified hierarchy"
> patchset[1] and availalbe in the following git branch.
> 
>  git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-populated

Branch renamed to review-populated-v1.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCHSET cgroup/for-3.16] cgroup: implement cgroup.populated
  2014-04-14 21:31 ` [PATCHSET cgroup/for-3.16] cgroup: implement cgroup.populated Tejun Heo
@ 2014-04-14 22:26   ` Greg KH
  2014-04-15 16:18     ` Tejun Heo
  0 siblings, 1 reply; 26+ messages in thread
From: Greg KH @ 2014-04-14 22:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, john, rlove, eparis, serge.hallyn, lennart, kay,
	lizefan, cgroups, containers

On Mon, Apr 14, 2014 at 05:31:00PM -0400, Tejun Heo wrote:
> On Wed, Apr 09, 2014 at 11:07:29AM -0400, Tejun Heo wrote:
> > This patchset is on top of "cgroup: implement unified hierarchy"
> > patchset[1] and availalbe in the following git branch.
> > 
> >  git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-populated
> 
> Branch renamed to review-populated-v1.

Do you want me to pull this branch into my tree for 3.16 now?  I have no
objection to you taking it, if that's easier for you, which ever is fine
for me.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCHSET cgroup/for-3.16] cgroup: implement cgroup.populated
  2014-04-14 22:26   ` Greg KH
@ 2014-04-15 16:18     ` Tejun Heo
  2014-04-23 15:16       ` Tejun Heo
  0 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2014-04-15 16:18 UTC (permalink / raw)
  To: Greg KH
  Cc: linux-kernel, john, rlove, eparis, serge.hallyn, lennart, kay,
	lizefan, cgroups, containers

Hello, Greg.

On Mon, Apr 14, 2014 at 03:26:58PM -0700, Greg KH wrote:
> Do you want me to pull this branch into my tree for 3.16 now?  I have no
> objection to you taking it, if that's easier for you, which ever is fine
> for me.

I think the best way to proceed would be applying 1-2 to
driver-core-next and then base cgroup/for-3.16 on top of it, but I'd
like to get some feedback from inotify people before proceeding.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCHSET cgroup/for-3.16] cgroup: implement cgroup.populated
  2014-04-15 16:18     ` Tejun Heo
@ 2014-04-23 15:16       ` Tejun Heo
  2014-04-25 18:57         ` Greg KH
  0 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2014-04-23 15:16 UTC (permalink / raw)
  To: Greg KH
  Cc: linux-kernel, john, rlove, eparis, serge.hallyn, lennart, kay,
	lizefan, cgroups, containers

On Tue, Apr 15, 2014 at 12:18:28PM -0400, Tejun Heo wrote:
> Hello, Greg.
> 
> On Mon, Apr 14, 2014 at 03:26:58PM -0700, Greg KH wrote:
> > Do you want me to pull this branch into my tree for 3.16 now?  I have no
> > objection to you taking it, if that's easier for you, which ever is fine
> > for me.
> 
> I think the best way to proceed would be applying 1-2 to
> driver-core-next and then base cgroup/for-3.16 on top of it, but I'd
> like to get some feedback from inotify people before proceeding.

Greg, can you please apply the first two patches to driver-core-next
(or a separate branch which is pulled into driver-core-next, either
way works for me as long as the branch is stable)?  I'll pull in the
branch and route the third patch through cgroup/for-3.16.

Thanks!

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCHSET cgroup/for-3.16] cgroup: implement cgroup.populated
  2014-04-23 15:16       ` Tejun Heo
@ 2014-04-25 18:57         ` Greg KH
  0 siblings, 0 replies; 26+ messages in thread
From: Greg KH @ 2014-04-25 18:57 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, john, rlove, eparis, serge.hallyn, lennart, kay,
	lizefan, cgroups, containers

On Wed, Apr 23, 2014 at 11:16:38AM -0400, Tejun Heo wrote:
> On Tue, Apr 15, 2014 at 12:18:28PM -0400, Tejun Heo wrote:
> > Hello, Greg.
> > 
> > On Mon, Apr 14, 2014 at 03:26:58PM -0700, Greg KH wrote:
> > > Do you want me to pull this branch into my tree for 3.16 now?  I have no
> > > objection to you taking it, if that's easier for you, which ever is fine
> > > for me.
> > 
> > I think the best way to proceed would be applying 1-2 to
> > driver-core-next and then base cgroup/for-3.16 on top of it, but I'd
> > like to get some feedback from inotify people before proceeding.
> 
> Greg, can you please apply the first two patches to driver-core-next
> (or a separate branch which is pulled into driver-core-next, either
> way works for me as long as the branch is stable)?  I'll pull in the
> branch and route the third patch through cgroup/for-3.16.

Now done.

greg k-h

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCHSET cgroup/for-3.16] cgroup: implement cgroup.populated
  2014-04-09 15:07 [PATCHSET cgroup/for-3.16] cgroup: implement cgroup.populated Tejun Heo
                   ` (3 preceding siblings ...)
  2014-04-14 21:31 ` [PATCHSET cgroup/for-3.16] cgroup: implement cgroup.populated Tejun Heo
@ 2014-04-25 22:30 ` Tejun Heo
  4 siblings, 0 replies; 26+ messages in thread
From: Tejun Heo @ 2014-04-25 22:30 UTC (permalink / raw)
  To: gregkh
  Cc: linux-kernel, john, rlove, eparis, serge.hallyn, lennart, kay,
	lizefan, cgroups, containers

On Wed, Apr 09, 2014 at 11:07:29AM -0400, Tejun Heo wrote:
> Hello,
> 
> (inotify folks, can you please review the second patch which hooks up
>  kernfs_notify() with fsnotify?  It works but directly invoking
>  fsnotify_parent() and fsnotify() feels a bit dirty.  Maybe something
>  like fsnotify_notify_modify_by_dentry() would make more sense?)
> 
> cgroup users often need a way to determine when a cgroup's
> subhierarchy becomes empty so that it can be cleaned up.  cgroup
> currently provides release_agent for it; unfortunately, this mechanism
> is riddled with issues.  This patchset implements a replacement
> mechanism "cgroup.subtree_populated" which can be used to monitor
> whether the cgroup's subhierarchy has tasks in it or not and triggers
> poll and [di]notify events when its content changes.
> 
> This patchset contains the following three patches.
> 
>  0001-kernfs-implement-kernfs_root-supers-list.patch
>  0002-kernfs-make-kernfs_notify-trigger-inotify-events-too.patch
>  0003-cgroup-implement-cgroup.subtree_populated-for-the-de.patch

0001-0002 are applied to driver-core-next, which was pulled into
cgroup/for-3.16.   0003 applied to cgroup/for-3.16.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/3] cgroup: implement cgroup.subtree_populated for the default hierarchy
  2014-04-16  3:50       ` Eric W. Biederman
  2014-04-16  4:15         ` Kay Sievers
@ 2014-04-16  4:20         ` Li Zefan
  1 sibling, 0 replies; 26+ messages in thread
From: Li Zefan @ 2014-04-16  4:20 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Kay Sievers, rlove, Greg Kroah-Hartman, containers, serge.hallyn,
	LKML, Lennart Poettering, eparis, Tejun Heo, cgroups, john

On 2014/4/16 11:50, Eric W. Biederman wrote:
> Kay Sievers <kay@vrfy.org> writes:
> 
>> On Tue, Apr 15, 2014 at 7:48 PM, Li Zefan <lizefan@huawei.com> wrote:
>>> On 2014/4/15 5:44, Tejun Heo wrote:
>>>> cgroup users often need a way to determine when a cgroup's
>>>> subhierarchy becomes empty so that it can be cleaned up.  cgroup
>>>> currently provides release_agent for it; unfortunately, this mechanism
>>>> is riddled with issues.
>>>>
>>>> * It delivers events by forking and execing a userland binary
>>>>   specified as the release_agent.  This is a long deprecated method of
>>>>   notification delivery.  It's extremely heavy, slow and cumbersome to
>>>>   integrate with larger infrastructure.
>>>>
>>>> * There is single monitoring point at the root.  There's no way to
>>>>   delegate management of subtree.
>>>>
>>>> * The event isn't recursive.  It triggers when a cgroup doesn't have
>>>>   any tasks or child cgroups.  Events for internal nodes trigger only
>>>>   after all children are removed.  This again makes it impossible to
>>>>   delegate management of subtree.
>>>>
>>>> * Events are filtered from the kernel side.  "notify_on_release" file
>>>>   is used to subscribe to or suppress release event.  This is
>>>>   unnecessarily complicated and probably done this way because event
>>>>   delivery itself was expensive.
>>>>
>>>> This patch implements interface file "cgroup.subtree_populated" which
>>>> can be used to monitor whether the cgroup's subhierarchy has tasks in
>>>> it or not.  Its value is 0 if there is no task in the cgroup and its
>>>> descendants; otherwise, 1, and kernfs_notify() notificaiton is
>>>> triggers when the value changes, which can be monitored through poll
>>>> and [di]notify.
>>>>
>>>
>>> For the old notification mechanism, the path of the cgroup that becomes
>>> empty will be passed to the user specified release agent. Like this:
>>>
>>> # cat /sbin/cpuset_release_agent
>>> #!/bin/sh
>>> rmdir /dev/cpuset/$1
>>>
>>> How do we achieve this using inotify?
>>>
>>> - monitor all the cgroups, or
>>> - monitor all the leaf cgroups, and travel cgrp->parent to delete all
>>>   empty cgroups.
>>> - monitor root cgroup only, and travel the whole hierarchy to find
>>>   empy cgroups when it gets an fs event.
>>>
>>> Seems none of them is scalible.
>>
>> The manager would add all cgroups as watches to one inotify file
>> descriptor, it should not be problem to do that.
> 
> inotify won't work on cgroupfs.
> 

Tejun's working on inotify support for cgroupfs, and I believe this patchset
has been tested, so it works.

So what do you mean by saying it won't work? Could you be more specific?


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/3] cgroup: implement cgroup.subtree_populated for the default hierarchy
  2014-04-16  3:33     ` Kay Sievers
  2014-04-16  3:50       ` Eric W. Biederman
@ 2014-04-16  4:16       ` Li Zefan
  1 sibling, 0 replies; 26+ messages in thread
From: Li Zefan @ 2014-04-16  4:16 UTC (permalink / raw)
  To: Kay Sievers
  Cc: Tejun Heo, containers, cgroups, LKML, john, rlove, eparis,
	Greg Kroah-Hartman, serge.hallyn, Lennart Poettering

On 2014/4/16 11:33, Kay Sievers wrote:
> On Tue, Apr 15, 2014 at 7:48 PM, Li Zefan <lizefan@huawei.com> wrote:
>> On 2014/4/15 5:44, Tejun Heo wrote:
>>> cgroup users often need a way to determine when a cgroup's
>>> subhierarchy becomes empty so that it can be cleaned up.  cgroup
>>> currently provides release_agent for it; unfortunately, this mechanism
>>> is riddled with issues.
>>>
>>> * It delivers events by forking and execing a userland binary
>>>   specified as the release_agent.  This is a long deprecated method of
>>>   notification delivery.  It's extremely heavy, slow and cumbersome to
>>>   integrate with larger infrastructure.
>>>
>>> * There is single monitoring point at the root.  There's no way to
>>>   delegate management of subtree.
>>>
>>> * The event isn't recursive.  It triggers when a cgroup doesn't have
>>>   any tasks or child cgroups.  Events for internal nodes trigger only
>>>   after all children are removed.  This again makes it impossible to
>>>   delegate management of subtree.
>>>
>>> * Events are filtered from the kernel side.  "notify_on_release" file
>>>   is used to subscribe to or suppress release event.  This is
>>>   unnecessarily complicated and probably done this way because event
>>>   delivery itself was expensive.
>>>
>>> This patch implements interface file "cgroup.subtree_populated" which
>>> can be used to monitor whether the cgroup's subhierarchy has tasks in
>>> it or not.  Its value is 0 if there is no task in the cgroup and its
>>> descendants; otherwise, 1, and kernfs_notify() notificaiton is
>>> triggers when the value changes, which can be monitored through poll
>>> and [di]notify.
>>>
>>
>> For the old notification mechanism, the path of the cgroup that becomes
>> empty will be passed to the user specified release agent. Like this:
>>
>> # cat /sbin/cpuset_release_agent
>> #!/bin/sh
>> rmdir /dev/cpuset/$1
>>
>> How do we achieve this using inotify?
>>
>> - monitor all the cgroups, or
>> - monitor all the leaf cgroups, and travel cgrp->parent to delete all
>>   empty cgroups.
>> - monitor root cgroup only, and travel the whole hierarchy to find
>>   empy cgroups when it gets an fs event.
>>
>> Seems none of them is scalible.
> 
> The manager would add all cgroups as watches to one inotify file
> descriptor, it should not be problem to do that.
> 

Never use inotify. Thanks for explanation, so I think inotify can scale
to thounsands of cgroups after I googled a bit.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/3] cgroup: implement cgroup.subtree_populated for the default hierarchy
  2014-04-16  3:50       ` Eric W. Biederman
@ 2014-04-16  4:15         ` Kay Sievers
  2014-04-16  4:20         ` Li Zefan
  1 sibling, 0 replies; 26+ messages in thread
From: Kay Sievers @ 2014-04-16  4:15 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Li Zefan, rlove, Greg Kroah-Hartman, containers, serge.hallyn,
	LKML, Lennart Poettering, eparis, Tejun Heo, cgroups, john

On Tue, Apr 15, 2014 at 8:50 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Kay Sievers <kay@vrfy.org> writes:
>
>> On Tue, Apr 15, 2014 at 7:48 PM, Li Zefan <lizefan@huawei.com> wrote:
>>> On 2014/4/15 5:44, Tejun Heo wrote:
>>>> cgroup users often need a way to determine when a cgroup's
>>>> subhierarchy becomes empty so that it can be cleaned up.  cgroup
>>>> currently provides release_agent for it; unfortunately, this mechanism
>>>> is riddled with issues.
>>>>
>>>> * It delivers events by forking and execing a userland binary
>>>>   specified as the release_agent.  This is a long deprecated method of
>>>>   notification delivery.  It's extremely heavy, slow and cumbersome to
>>>>   integrate with larger infrastructure.
>>>>
>>>> * There is single monitoring point at the root.  There's no way to
>>>>   delegate management of subtree.
>>>>
>>>> * The event isn't recursive.  It triggers when a cgroup doesn't have
>>>>   any tasks or child cgroups.  Events for internal nodes trigger only
>>>>   after all children are removed.  This again makes it impossible to
>>>>   delegate management of subtree.
>>>>
>>>> * Events are filtered from the kernel side.  "notify_on_release" file
>>>>   is used to subscribe to or suppress release event.  This is
>>>>   unnecessarily complicated and probably done this way because event
>>>>   delivery itself was expensive.
>>>>
>>>> This patch implements interface file "cgroup.subtree_populated" which
>>>> can be used to monitor whether the cgroup's subhierarchy has tasks in
>>>> it or not.  Its value is 0 if there is no task in the cgroup and its
>>>> descendants; otherwise, 1, and kernfs_notify() notificaiton is
>>>> triggers when the value changes, which can be monitored through poll
>>>> and [di]notify.
>>>>
>>>
>>> For the old notification mechanism, the path of the cgroup that becomes
>>> empty will be passed to the user specified release agent. Like this:
>>>
>>> # cat /sbin/cpuset_release_agent
>>> #!/bin/sh
>>> rmdir /dev/cpuset/$1
>>>
>>> How do we achieve this using inotify?
>>>
>>> - monitor all the cgroups, or
>>> - monitor all the leaf cgroups, and travel cgrp->parent to delete all
>>>   empty cgroups.
>>> - monitor root cgroup only, and travel the whole hierarchy to find
>>>   empy cgroups when it gets an fs event.
>>>
>>> Seems none of them is scalible.
>>
>> The manager would add all cgroups as watches to one inotify file
>> descriptor, it should not be problem to do that.
>
> inotify won't work on cgroupfs.

Inotify on kernfs will work.

Kay

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/3] cgroup: implement cgroup.subtree_populated for the default hierarchy
  2014-04-16  3:33     ` Kay Sievers
@ 2014-04-16  3:50       ` Eric W. Biederman
  2014-04-16  4:15         ` Kay Sievers
  2014-04-16  4:20         ` Li Zefan
  2014-04-16  4:16       ` Li Zefan
  1 sibling, 2 replies; 26+ messages in thread
From: Eric W. Biederman @ 2014-04-16  3:50 UTC (permalink / raw)
  To: Kay Sievers
  Cc: Li Zefan, rlove, Greg Kroah-Hartman, containers, serge.hallyn,
	LKML, Lennart Poettering, eparis, Tejun Heo, cgroups, john

Kay Sievers <kay@vrfy.org> writes:

> On Tue, Apr 15, 2014 at 7:48 PM, Li Zefan <lizefan@huawei.com> wrote:
>> On 2014/4/15 5:44, Tejun Heo wrote:
>>> cgroup users often need a way to determine when a cgroup's
>>> subhierarchy becomes empty so that it can be cleaned up.  cgroup
>>> currently provides release_agent for it; unfortunately, this mechanism
>>> is riddled with issues.
>>>
>>> * It delivers events by forking and execing a userland binary
>>>   specified as the release_agent.  This is a long deprecated method of
>>>   notification delivery.  It's extremely heavy, slow and cumbersome to
>>>   integrate with larger infrastructure.
>>>
>>> * There is single monitoring point at the root.  There's no way to
>>>   delegate management of subtree.
>>>
>>> * The event isn't recursive.  It triggers when a cgroup doesn't have
>>>   any tasks or child cgroups.  Events for internal nodes trigger only
>>>   after all children are removed.  This again makes it impossible to
>>>   delegate management of subtree.
>>>
>>> * Events are filtered from the kernel side.  "notify_on_release" file
>>>   is used to subscribe to or suppress release event.  This is
>>>   unnecessarily complicated and probably done this way because event
>>>   delivery itself was expensive.
>>>
>>> This patch implements interface file "cgroup.subtree_populated" which
>>> can be used to monitor whether the cgroup's subhierarchy has tasks in
>>> it or not.  Its value is 0 if there is no task in the cgroup and its
>>> descendants; otherwise, 1, and kernfs_notify() notificaiton is
>>> triggers when the value changes, which can be monitored through poll
>>> and [di]notify.
>>>
>>
>> For the old notification mechanism, the path of the cgroup that becomes
>> empty will be passed to the user specified release agent. Like this:
>>
>> # cat /sbin/cpuset_release_agent
>> #!/bin/sh
>> rmdir /dev/cpuset/$1
>>
>> How do we achieve this using inotify?
>>
>> - monitor all the cgroups, or
>> - monitor all the leaf cgroups, and travel cgrp->parent to delete all
>>   empty cgroups.
>> - monitor root cgroup only, and travel the whole hierarchy to find
>>   empy cgroups when it gets an fs event.
>>
>> Seems none of them is scalible.
>
> The manager would add all cgroups as watches to one inotify file
> descriptor, it should not be problem to do that.

inotify won't work on cgroupfs.

Eric

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/3] cgroup: implement cgroup.subtree_populated for the default hierarchy
  2014-04-16  2:48   ` Li Zefan
@ 2014-04-16  3:33     ` Kay Sievers
  2014-04-16  3:50       ` Eric W. Biederman
  2014-04-16  4:16       ` Li Zefan
  0 siblings, 2 replies; 26+ messages in thread
From: Kay Sievers @ 2014-04-16  3:33 UTC (permalink / raw)
  To: Li Zefan
  Cc: Tejun Heo, containers, cgroups, LKML, john, rlove, eparis,
	Greg Kroah-Hartman, serge.hallyn, Lennart Poettering

On Tue, Apr 15, 2014 at 7:48 PM, Li Zefan <lizefan@huawei.com> wrote:
> On 2014/4/15 5:44, Tejun Heo wrote:
>> cgroup users often need a way to determine when a cgroup's
>> subhierarchy becomes empty so that it can be cleaned up.  cgroup
>> currently provides release_agent for it; unfortunately, this mechanism
>> is riddled with issues.
>>
>> * It delivers events by forking and execing a userland binary
>>   specified as the release_agent.  This is a long deprecated method of
>>   notification delivery.  It's extremely heavy, slow and cumbersome to
>>   integrate with larger infrastructure.
>>
>> * There is single monitoring point at the root.  There's no way to
>>   delegate management of subtree.
>>
>> * The event isn't recursive.  It triggers when a cgroup doesn't have
>>   any tasks or child cgroups.  Events for internal nodes trigger only
>>   after all children are removed.  This again makes it impossible to
>>   delegate management of subtree.
>>
>> * Events are filtered from the kernel side.  "notify_on_release" file
>>   is used to subscribe to or suppress release event.  This is
>>   unnecessarily complicated and probably done this way because event
>>   delivery itself was expensive.
>>
>> This patch implements interface file "cgroup.subtree_populated" which
>> can be used to monitor whether the cgroup's subhierarchy has tasks in
>> it or not.  Its value is 0 if there is no task in the cgroup and its
>> descendants; otherwise, 1, and kernfs_notify() notificaiton is
>> triggers when the value changes, which can be monitored through poll
>> and [di]notify.
>>
>
> For the old notification mechanism, the path of the cgroup that becomes
> empty will be passed to the user specified release agent. Like this:
>
> # cat /sbin/cpuset_release_agent
> #!/bin/sh
> rmdir /dev/cpuset/$1
>
> How do we achieve this using inotify?
>
> - monitor all the cgroups, or
> - monitor all the leaf cgroups, and travel cgrp->parent to delete all
>   empty cgroups.
> - monitor root cgroup only, and travel the whole hierarchy to find
>   empy cgroups when it gets an fs event.
>
> Seems none of them is scalible.

The manager would add all cgroups as watches to one inotify file
descriptor, it should not be problem to do that.

Kay

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/3] cgroup: implement cgroup.subtree_populated for the default hierarchy
  2014-04-14 21:44 ` [PATCH 3/3] cgroup: implement cgroup.subtree_populated for the default hierarchy Tejun Heo
  2014-04-15  0:57   ` Li Zefan
@ 2014-04-16  2:48   ` Li Zefan
  2014-04-16  3:33     ` Kay Sievers
  1 sibling, 1 reply; 26+ messages in thread
From: Li Zefan @ 2014-04-16  2:48 UTC (permalink / raw)
  To: Tejun Heo
  Cc: containers, cgroups, linux-kernel, john, rlove, eparis, gregkh,
	serge.hallyn, lennart, kay

Hi Tejun,

On 2014/4/15 5:44, Tejun Heo wrote:
> cgroup users often need a way to determine when a cgroup's
> subhierarchy becomes empty so that it can be cleaned up.  cgroup
> currently provides release_agent for it; unfortunately, this mechanism
> is riddled with issues.
> 
> * It delivers events by forking and execing a userland binary
>   specified as the release_agent.  This is a long deprecated method of
>   notification delivery.  It's extremely heavy, slow and cumbersome to
>   integrate with larger infrastructure.
> 
> * There is single monitoring point at the root.  There's no way to
>   delegate management of subtree.
> 
> * The event isn't recursive.  It triggers when a cgroup doesn't have
>   any tasks or child cgroups.  Events for internal nodes trigger only
>   after all children are removed.  This again makes it impossible to
>   delegate management of subtree.
> 
> * Events are filtered from the kernel side.  "notify_on_release" file
>   is used to subscribe to or suppress release event.  This is
>   unnecessarily complicated and probably done this way because event
>   delivery itself was expensive.
> 
> This patch implements interface file "cgroup.subtree_populated" which
> can be used to monitor whether the cgroup's subhierarchy has tasks in
> it or not.  Its value is 0 if there is no task in the cgroup and its
> descendants; otherwise, 1, and kernfs_notify() notificaiton is
> triggers when the value changes, which can be monitored through poll
> and [di]notify.
> 

For the old notification mechanism, the path of the cgroup that becomes
empty will be passed to the user specified release agent. Like this:

# cat /sbin/cpuset_release_agent
#!/bin/sh
rmdir /dev/cpuset/$1

How do we achieve this using inotify?

- monitor all the cgroups, or
- monitor all the leaf cgroups, and travel cgrp->parent to delete all
  empty cgroups.
- monitor root cgroup only, and travel the whole hierarchy to find
  empy cgroups when it gets an fs event.

Seems none of them is scalible.

> This is a lot ligther and simpler and trivially allows delegating
> management of subhierarchy - subhierarchy monitoring can block further
> propgation simply by putting itself or another process in the root of
> the subhierarchy and monitor events that it's interested in from there
> without interfering with monitoring higher in the tree.
> 
> v2: Patch description updated as per Serge.
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/3] cgroup: implement cgroup.subtree_populated for the default hierarchy
  2014-04-15 16:52       ` Tejun Heo
@ 2014-04-16  1:30         ` Li Zefan
  0 siblings, 0 replies; 26+ messages in thread
From: Li Zefan @ 2014-04-16  1:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: containers, cgroups, linux-kernel, john, rlove, eparis, gregkh,
	serge.hallyn, lennart, kay

On 2014/4/16 0:52, Tejun Heo wrote:
> On Tue, Apr 15, 2014 at 10:54:50AM -0400, Tejun Heo wrote:
>> Hello,
>>
>> On Tue, Apr 15, 2014 at 08:57:21AM +0800, Li Zefan wrote:
>>> Is cgroup.tree_populated a better name?
>>>
>>> cgroup.subtree_control controls child cgroups only, but .subtree_populated
>>> shows 1 if there're tasks in the cgroup or its children, so the two
>>> are a bit inconsistent to me.
>>
>> Yes, good catch.  subtree_control affects subtree proper.
>> subtree_populated covers self too.  The difference is subtle and the
>> trade off between shared pattern in names and clarifying the subtlety
>> didn't seem clear-cut to me.  Hmmm....
> 
> How about cgroup.populated?
> 

Yeah, fine for me.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/3] cgroup: implement cgroup.subtree_populated for the default hierarchy
  2014-04-15 14:54     ` Tejun Heo
@ 2014-04-15 16:52       ` Tejun Heo
  2014-04-16  1:30         ` Li Zefan
  0 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2014-04-15 16:52 UTC (permalink / raw)
  To: Li Zefan
  Cc: containers, cgroups, linux-kernel, john, rlove, eparis, gregkh,
	serge.hallyn, lennart, kay

On Tue, Apr 15, 2014 at 10:54:50AM -0400, Tejun Heo wrote:
> Hello,
> 
> On Tue, Apr 15, 2014 at 08:57:21AM +0800, Li Zefan wrote:
> > Is cgroup.tree_populated a better name?
> > 
> > cgroup.subtree_control controls child cgroups only, but .subtree_populated
> > shows 1 if there're tasks in the cgroup or its children, so the two
> > are a bit inconsistent to me.
> 
> Yes, good catch.  subtree_control affects subtree proper.
> subtree_populated covers self too.  The difference is subtle and the
> trade off between shared pattern in names and clarifying the subtlety
> didn't seem clear-cut to me.  Hmmm....

How about cgroup.populated?

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/3] cgroup: implement cgroup.subtree_populated for the default hierarchy
  2014-04-15  0:57   ` Li Zefan
@ 2014-04-15 14:54     ` Tejun Heo
  2014-04-15 16:52       ` Tejun Heo
  0 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2014-04-15 14:54 UTC (permalink / raw)
  To: Li Zefan
  Cc: containers, cgroups, linux-kernel, john, rlove, eparis, gregkh,
	serge.hallyn, lennart, kay

Hello,

On Tue, Apr 15, 2014 at 08:57:21AM +0800, Li Zefan wrote:
> Is cgroup.tree_populated a better name?
> 
> cgroup.subtree_control controls child cgroups only, but .subtree_populated
> shows 1 if there're tasks in the cgroup or its children, so the two
> are a bit inconsistent to me.

Yes, good catch.  subtree_control affects subtree proper.
subtree_populated covers self too.  The difference is subtle and the
trade off between shared pattern in names and clarifying the subtlety
didn't seem clear-cut to me.  Hmmm....

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/3] cgroup: implement cgroup.subtree_populated for the default hierarchy
  2014-04-14 21:44 ` [PATCH 3/3] cgroup: implement cgroup.subtree_populated for the default hierarchy Tejun Heo
@ 2014-04-15  0:57   ` Li Zefan
  2014-04-15 14:54     ` Tejun Heo
  2014-04-16  2:48   ` Li Zefan
  1 sibling, 1 reply; 26+ messages in thread
From: Li Zefan @ 2014-04-15  0:57 UTC (permalink / raw)
  To: Tejun Heo
  Cc: containers, cgroups, linux-kernel, john, rlove, eparis, gregkh,
	serge.hallyn, lennart, kay

On 2014/4/15 5:44, Tejun Heo wrote:
> cgroup users often need a way to determine when a cgroup's
> subhierarchy becomes empty so that it can be cleaned up.  cgroup
> currently provides release_agent for it; unfortunately, this mechanism
> is riddled with issues.
> 
> * It delivers events by forking and execing a userland binary
>   specified as the release_agent.  This is a long deprecated method of
>   notification delivery.  It's extremely heavy, slow and cumbersome to
>   integrate with larger infrastructure.
> 
> * There is single monitoring point at the root.  There's no way to
>   delegate management of subtree.
> 
> * The event isn't recursive.  It triggers when a cgroup doesn't have
>   any tasks or child cgroups.  Events for internal nodes trigger only
>   after all children are removed.  This again makes it impossible to
>   delegate management of subtree.
> 
> * Events are filtered from the kernel side.  "notify_on_release" file
>   is used to subscribe to or suppress release event.  This is
>   unnecessarily complicated and probably done this way because event
>   delivery itself was expensive.
> 
> This patch implements interface file "cgroup.subtree_populated" which
> can be used to monitor whether the cgroup's subhierarchy has tasks in
> it or not.  Its value is 0 if there is no task in the cgroup and its
> descendants; otherwise, 1, 

Is cgroup.tree_populated a better name?

cgroup.subtree_control controls child cgroups only, but .subtree_populated
shows 1 if there're tasks in the cgroup or its children, so the two
are a bit inconsistent to me.

> and kernfs_notify() notificaiton is
> triggers when the value changes, which can be monitored through poll
> and [di]notify.
> 
> This is a lot ligther and simpler and trivially allows delegating
> management of subhierarchy - subhierarchy monitoring can block further
> propgation simply by putting itself or another process in the root of
> the subhierarchy and monitor events that it's interested in from there
> without interfering with monitoring higher in the tree.
> 
> v2: Patch description updated as per Serge.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Acked-by: Serge Hallyn <serge.hallyn@ubuntu.com>
> Cc: Lennart Poettering <lennart@poettering.net>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 3/3] cgroup: implement cgroup.subtree_populated for the default hierarchy
  2014-04-14 21:44 [PATCHSET cgroup/for-3.16] cgroup: implement cgroup.populated, v2 Tejun Heo
@ 2014-04-14 21:44 ` Tejun Heo
  2014-04-15  0:57   ` Li Zefan
  2014-04-16  2:48   ` Li Zefan
  0 siblings, 2 replies; 26+ messages in thread
From: Tejun Heo @ 2014-04-14 21:44 UTC (permalink / raw)
  To: lizefan
  Cc: containers, cgroups, linux-kernel, john, rlove, eparis, gregkh,
	serge.hallyn, lennart, kay, Tejun Heo

cgroup users often need a way to determine when a cgroup's
subhierarchy becomes empty so that it can be cleaned up.  cgroup
currently provides release_agent for it; unfortunately, this mechanism
is riddled with issues.

* It delivers events by forking and execing a userland binary
  specified as the release_agent.  This is a long deprecated method of
  notification delivery.  It's extremely heavy, slow and cumbersome to
  integrate with larger infrastructure.

* There is single monitoring point at the root.  There's no way to
  delegate management of subtree.

* The event isn't recursive.  It triggers when a cgroup doesn't have
  any tasks or child cgroups.  Events for internal nodes trigger only
  after all children are removed.  This again makes it impossible to
  delegate management of subtree.

* Events are filtered from the kernel side.  "notify_on_release" file
  is used to subscribe to or suppress release event.  This is
  unnecessarily complicated and probably done this way because event
  delivery itself was expensive.

This patch implements interface file "cgroup.subtree_populated" which
can be used to monitor whether the cgroup's subhierarchy has tasks in
it or not.  Its value is 0 if there is no task in the cgroup and its
descendants; otherwise, 1, and kernfs_notify() notificaiton is
triggers when the value changes, which can be monitored through poll
and [di]notify.

This is a lot ligther and simpler and trivially allows delegating
management of subhierarchy - subhierarchy monitoring can block further
propgation simply by putting itself or another process in the root of
the subhierarchy and monitor events that it's interested in from there
without interfering with monitoring higher in the tree.

v2: Patch description updated as per Serge.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Cc: Lennart Poettering <lennart@poettering.net>
---
 include/linux/cgroup.h | 15 ++++++++++++
 kernel/cgroup.c        | 65 ++++++++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 76 insertions(+), 4 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index ada2392..4b38e2d 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -154,6 +154,14 @@ struct cgroup {
 	/* the number of attached css's */
 	int nr_css;
 
+	/*
+	 * If this cgroup contains any tasks, it contributes one to
+	 * populated_cnt.  All children with non-zero popuplated_cnt of
+	 * their own contribute one.  The count is zero iff there's no task
+	 * in this cgroup or its subtree.
+	 */
+	int populated_cnt;
+
 	atomic_t refcnt;
 
 	/*
@@ -166,6 +174,7 @@ struct cgroup {
 	struct cgroup *parent;		/* my parent */
 	struct kernfs_node *kn;		/* cgroup kernfs entry */
 	struct kernfs_node *control_kn;	/* kn for "cgroup.subtree_control" */
+	struct kernfs_node *populated_kn; /* kn for "cgroup.subtree_populated" */
 
 	/*
 	 * Monotonically increasing unique serial number which defines a
@@ -264,6 +273,12 @@ enum {
 	 *
 	 * - "cgroup.clone_children" is removed.
 	 *
+	 * - "cgroup.subtree_populated" is available.  Its value is 0 if
+	 *   the cgroup and its descendants contain no task; otherwise, 1.
+	 *   The file also generates kernfs notification which can be
+	 *   monitored through poll and [di]notify when the value of the
+	 *   file changes.
+	 *
 	 * - If mount is requested with sane_behavior but without any
 	 *   subsystem, the default unified hierarchy is mounted.
 	 *
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 08c4439..e379e83 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -411,6 +411,43 @@ static struct css_set init_css_set = {
 
 static int css_set_count	= 1;	/* 1 for init_css_set */
 
+/**
+ * cgroup_update_populated - updated populated count of a cgroup
+ * @cgrp: the target cgroup
+ * @populated: inc or dec populated count
+ *
+ * @cgrp is either getting the first task (css_set) or losing the last.
+ * Update @cgrp->populated_cnt accordingly.  The count is propagated
+ * towards root so that a given cgroup's populated_cnt is zero iff the
+ * cgroup and all its descendants are empty.
+ *
+ * @cgrp's interface file "cgroup.subtree_populated" is zero if
+ * @cgrp->populated_cnt is zero and 1 otherwise.  When @cgrp->populated_cnt
+ * changes from or to zero, userland is notified that the content of the
+ * interface file has changed.  This can be used to detect when @cgrp and
+ * its descendants become populated or empty.
+ */
+static void cgroup_update_populated(struct cgroup *cgrp, bool populated)
+{
+	lockdep_assert_held(&css_set_rwsem);
+
+	do {
+		bool trigger;
+
+		if (populated)
+			trigger = !cgrp->populated_cnt++;
+		else
+			trigger = !--cgrp->populated_cnt;
+
+		if (!trigger)
+			break;
+
+		if (cgrp->populated_kn)
+			kernfs_notify(cgrp->populated_kn);
+		cgrp = cgrp->parent;
+	} while (cgrp);
+}
+
 /*
  * hash table for cgroup groups. This improves the performance to find
  * an existing css_set. This hash doesn't (currently) take into
@@ -456,10 +493,13 @@ static void put_css_set_locked(struct css_set *cset, bool taskexit)
 		list_del(&link->cgrp_link);
 
 		/* @cgrp can't go away while we're holding css_set_rwsem */
-		if (list_empty(&cgrp->cset_links) && notify_on_release(cgrp)) {
-			if (taskexit)
-				set_bit(CGRP_RELEASABLE, &cgrp->flags);
-			check_for_release(cgrp);
+		if (list_empty(&cgrp->cset_links)) {
+			cgroup_update_populated(cgrp, false);
+			if (notify_on_release(cgrp)) {
+				if (taskexit)
+					set_bit(CGRP_RELEASABLE, &cgrp->flags);
+				check_for_release(cgrp);
+			}
 		}
 
 		kfree(link);
@@ -668,7 +708,11 @@ static void link_css_set(struct list_head *tmp_links, struct css_set *cset,
 	link = list_first_entry(tmp_links, struct cgrp_cset_link, cset_link);
 	link->cset = cset;
 	link->cgrp = cgrp;
+
+	if (list_empty(&cgrp->cset_links))
+		cgroup_update_populated(cgrp, true);
 	list_move(&link->cset_link, &cgrp->cset_links);
+
 	/*
 	 * Always add links to the tail of the list so that the list
 	 * is sorted by order of hierarchy creation
@@ -2643,6 +2687,12 @@ err_undo_css:
 	goto out_unlock;
 }
 
+static int cgroup_subtree_populated_show(struct seq_file *seq, void *v)
+{
+	seq_printf(seq, "%d\n", (bool)seq_css(seq)->cgroup->populated_cnt);
+	return 0;
+}
+
 static ssize_t cgroup_file_write(struct kernfs_open_file *of, char *buf,
 				 size_t nbytes, loff_t off)
 {
@@ -2809,6 +2859,8 @@ static int cgroup_add_file(struct cgroup *cgrp, struct cftype *cft)
 
 	if (cft->seq_show == cgroup_subtree_control_show)
 		cgrp->control_kn = kn;
+	else if (cft->seq_show == cgroup_subtree_populated_show)
+		cgrp->populated_kn = kn;
 	return 0;
 }
 
@@ -3922,6 +3974,11 @@ static struct cftype cgroup_base_files[] = {
 		.seq_show = cgroup_subtree_control_show,
 		.write_string = cgroup_subtree_control_write,
 	},
+	{
+		.name = "cgroup.subtree_populated",
+		.flags = CFTYPE_ONLY_ON_DFL | CFTYPE_NOT_ON_ROOT,
+		.seq_show = cgroup_subtree_populated_show,
+	},
 
 	/*
 	 * Historical crazy stuff.  These don't have "cgroup."  prefix and
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2014-04-25 22:30 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-04-09 15:07 [PATCHSET cgroup/for-3.16] cgroup: implement cgroup.populated Tejun Heo
2014-04-09 15:07 ` [PATCH 1/3] kernfs: implement kernfs_root->supers list Tejun Heo
2014-04-09 15:07 ` [PATCH 2/3] kernfs: make kernfs_notify() trigger inotify events too Tejun Heo
2014-04-09 15:07 ` [PATCH 3/3] cgroup: implement cgroup.subtree_populated for the default hierarchy Tejun Heo
2014-04-10  3:08   ` Serge E. Hallyn
2014-04-10 13:08     ` Tejun Heo
2014-04-10 14:04       ` Serge Hallyn
2014-04-10 14:19         ` Tejun Heo
2014-04-11  9:00           ` Li Zefan
2014-04-14 21:31 ` [PATCHSET cgroup/for-3.16] cgroup: implement cgroup.populated Tejun Heo
2014-04-14 22:26   ` Greg KH
2014-04-15 16:18     ` Tejun Heo
2014-04-23 15:16       ` Tejun Heo
2014-04-25 18:57         ` Greg KH
2014-04-25 22:30 ` Tejun Heo
2014-04-14 21:44 [PATCHSET cgroup/for-3.16] cgroup: implement cgroup.populated, v2 Tejun Heo
2014-04-14 21:44 ` [PATCH 3/3] cgroup: implement cgroup.subtree_populated for the default hierarchy Tejun Heo
2014-04-15  0:57   ` Li Zefan
2014-04-15 14:54     ` Tejun Heo
2014-04-15 16:52       ` Tejun Heo
2014-04-16  1:30         ` Li Zefan
2014-04-16  2:48   ` Li Zefan
2014-04-16  3:33     ` Kay Sievers
2014-04-16  3:50       ` Eric W. Biederman
2014-04-16  4:15         ` Kay Sievers
2014-04-16  4:20         ` Li Zefan
2014-04-16  4:16       ` Li Zefan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).