All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/2] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs
@ 2010-07-30 23:56 Ben Blum
  2010-07-30 23:59 ` [PATCH v4 2/2] cgroups: make procs file writable Ben Blum
                   ` (2 more replies)
  0 siblings, 3 replies; 185+ messages in thread
From: Ben Blum @ 2010-07-30 23:56 UTC (permalink / raw)
  To: linux-kernel, containers
  Cc: akpm, bblum, ebiederm, lizf, matthltc, menage, oleg

This patch series is a revision of http://lkml.org/lkml/2010/6/25/11 .

This patch series implements a write function for the 'cgroup.procs'
per-cgroup file, which enables atomic movement of multithreaded
applications between cgroups. Writing the thread-ID of any thread in a
threadgroup to a cgroup's procs file causes all threads in the group to
be moved to that cgroup safely with respect to threads forking/exiting.
(Possible usage scenario: If running a multithreaded build system that
sucks up system resources, this lets you restrict it all at once into a
new cgroup to keep it under control.)

Example: Suppose pid 31337 clones new threads 31338 and 31339.

# cat /dev/cgroup/tasks
...
31337
31338
31339
# mkdir /dev/cgroup/foo
# echo 31337 > /dev/cgroup/foo/cgroup.procs
# cat /dev/cgroup/foo/tasks
31337
31338
31339

A new lock, called threadgroup_fork_lock and living in signal_struct, is
introduced to ensure atomicity when moving threads between cgroups. It's
taken for writing during the operation, and taking for reading in fork()
around the calls to cgroup_fork() and cgroup_post_fork(). I put calls to
down_read/up_read directly in copy_process(), since new inline functions
seemed like overkill.

-- Ben

---
 Documentation/cgroups/cgroups.txt |   13 -
 include/linux/init_task.h         |    9
 include/linux/sched.h             |   10
 kernel/cgroup.c                   |  426 +++++++++++++++++++++++++++++++++-----
 kernel/cgroup_freezer.c           |    4
 kernel/cpuset.c                   |    4
 kernel/fork.c                     |   16 +
 kernel/ns_cgroup.c                |    4
 kernel/sched.c                    |    4
 9 files changed, 440 insertions(+), 50 deletions(-)

^ permalink raw reply	[flat|nested] 185+ messages in thread

* [PATCH v4 1/2] cgroups: read-write lock CLONE_THREAD forking per threadgroup
  2010-07-30 23:56 [PATCH v4 0/2] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs Ben Blum
@ 2010-07-30 23:57     ` Ben Blum
       [not found] ` <20100730235649.GA22644-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
  2010-08-03 19:58 ` [PATCH v4 0/2] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs Andrew Morton
  2 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2010-07-30 23:57 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: bblum-OM76b2Iv3yLQjUSlxSEPGw, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, menage-hpIqsD4AKlfQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

[-- Attachment #1: cgroup-threadgroup-fork-lock.patch --]
[-- Type: text/plain, Size: 3737 bytes --]

Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup

From: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>

This patch adds an rwsem that lives in a threadgroup's signal_struct that's
taken for reading in the fork path, under CONFIG_CGROUPS. If another part of
the kernel later wants to use such a locking mechanism, the CONFIG_CGROUPS
ifdefs should be changed to a higher-up flag that CGROUPS and the other system
would both depend on.

This is a pre-patch for cgroups-procs-write.patch.

Signed-off-by: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
---
 include/linux/init_task.h |    9 +++++++++
 include/linux/sched.h     |   10 ++++++++++
 kernel/fork.c             |   16 ++++++++++++++++
 3 files changed, 35 insertions(+), 0 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 1f43fa5..ca46711 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -15,6 +15,14 @@
 extern struct files_struct init_files;
 extern struct fs_struct init_fs;
 
+#ifdef CONFIG_CGROUPS
+#define INIT_THREADGROUP_FORK_LOCK(sig)					\
+	.threadgroup_fork_lock =					\
+		__RWSEM_INITIALIZER(sig.threadgroup_fork_lock),
+#else
+#define INIT_THREADGROUP_FORK_LOCK(sig)
+#endif
+
 #define INIT_SIGNALS(sig) {						\
 	.nr_threads	= 1,						\
 	.wait_chldexit	= __WAIT_QUEUE_HEAD_INITIALIZER(sig.wait_chldexit),\
@@ -29,6 +37,7 @@ extern struct fs_struct init_fs;
 		.running = 0,						\
 		.lock = __SPIN_LOCK_UNLOCKED(sig.cputimer.lock),	\
 	},								\
+	INIT_THREADGROUP_FORK_LOCK(sig)					\
 }
 
 extern struct nsproxy init_nsproxy;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ae69716..82b0bcf 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -619,6 +619,16 @@ struct signal_struct {
 	unsigned audit_tty;
 	struct tty_audit_buf *tty_audit_buf;
 #endif
+#ifdef CONFIG_CGROUPS
+	/*
+	 * The threadgroup_fork_lock prevents threads from forking with
+	 * CLONE_THREAD while held for writing. Use this for fork-sensitive
+	 * threadgroup-wide operations. It's taken for reading in fork.c in
+	 * copy_process().
+	 * Currently only needed write-side by cgroups.
+	 */
+	struct rw_semaphore threadgroup_fork_lock;
+#endif
 
 	int oom_adj;	/* OOM kill score adjustment (bit shift) */
 };
diff --git a/kernel/fork.c b/kernel/fork.c
index a82a65c..a9bce89 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -898,6 +898,10 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 
 	tty_audit_fork(sig);
 
+#ifdef CONFIG_CGROUPS
+	init_rwsem(&sig->threadgroup_fork_lock);
+#endif
+
 	sig->oom_adj = current->signal->oom_adj;
 
 	return 0;
@@ -1076,6 +1080,10 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	monotonic_to_bootbased(&p->real_start_time);
 	p->io_context = NULL;
 	p->audit_context = NULL;
+#ifdef CONFIG_CGROUPS
+	if (clone_flags & CLONE_THREAD)
+		down_read(&current->signal->threadgroup_fork_lock);
+#endif
 	cgroup_fork(p);
 #ifdef CONFIG_NUMA
 	p->mempolicy = mpol_dup(p->mempolicy);
@@ -1283,6 +1291,10 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	write_unlock_irq(&tasklist_lock);
 	proc_fork_connector(p);
 	cgroup_post_fork(p);
+#ifdef CONFIG_CGROUPS
+	if (clone_flags & CLONE_THREAD)
+		up_read(&current->signal->threadgroup_fork_lock);
+#endif
 	perf_event_fork(p);
 	return p;
 
@@ -1316,6 +1328,10 @@ bad_fork_cleanup_policy:
 	mpol_put(p->mempolicy);
 bad_fork_cleanup_cgroup:
 #endif
+#ifdef CONFIG_CGROUPS
+	if (clone_flags & CLONE_THREAD)
+		up_read(&current->signal->threadgroup_fork_lock);
+#endif
 	cgroup_exit(p, cgroup_callbacks_done);
 	delayacct_tsk_free(p);
 	module_put(task_thread_info(p)->exec_domain->module);

^ permalink raw reply related	[flat|nested] 185+ messages in thread

* [PATCH v4 1/2] cgroups: read-write lock CLONE_THREAD forking per threadgroup
@ 2010-07-30 23:57     ` Ben Blum
  0 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2010-07-30 23:57 UTC (permalink / raw)
  To: linux-kernel, containers
  Cc: akpm, bblum, ebiederm, lizf, matthltc, menage, oleg

[-- Attachment #1: cgroup-threadgroup-fork-lock.patch --]
[-- Type: text/plain, Size: 3687 bytes --]

Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup

From: Ben Blum <bblum@andrew.cmu.edu>

This patch adds an rwsem that lives in a threadgroup's signal_struct that's
taken for reading in the fork path, under CONFIG_CGROUPS. If another part of
the kernel later wants to use such a locking mechanism, the CONFIG_CGROUPS
ifdefs should be changed to a higher-up flag that CGROUPS and the other system
would both depend on.

This is a pre-patch for cgroups-procs-write.patch.

Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
---
 include/linux/init_task.h |    9 +++++++++
 include/linux/sched.h     |   10 ++++++++++
 kernel/fork.c             |   16 ++++++++++++++++
 3 files changed, 35 insertions(+), 0 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 1f43fa5..ca46711 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -15,6 +15,14 @@
 extern struct files_struct init_files;
 extern struct fs_struct init_fs;
 
+#ifdef CONFIG_CGROUPS
+#define INIT_THREADGROUP_FORK_LOCK(sig)					\
+	.threadgroup_fork_lock =					\
+		__RWSEM_INITIALIZER(sig.threadgroup_fork_lock),
+#else
+#define INIT_THREADGROUP_FORK_LOCK(sig)
+#endif
+
 #define INIT_SIGNALS(sig) {						\
 	.nr_threads	= 1,						\
 	.wait_chldexit	= __WAIT_QUEUE_HEAD_INITIALIZER(sig.wait_chldexit),\
@@ -29,6 +37,7 @@ extern struct fs_struct init_fs;
 		.running = 0,						\
 		.lock = __SPIN_LOCK_UNLOCKED(sig.cputimer.lock),	\
 	},								\
+	INIT_THREADGROUP_FORK_LOCK(sig)					\
 }
 
 extern struct nsproxy init_nsproxy;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ae69716..82b0bcf 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -619,6 +619,16 @@ struct signal_struct {
 	unsigned audit_tty;
 	struct tty_audit_buf *tty_audit_buf;
 #endif
+#ifdef CONFIG_CGROUPS
+	/*
+	 * The threadgroup_fork_lock prevents threads from forking with
+	 * CLONE_THREAD while held for writing. Use this for fork-sensitive
+	 * threadgroup-wide operations. It's taken for reading in fork.c in
+	 * copy_process().
+	 * Currently only needed write-side by cgroups.
+	 */
+	struct rw_semaphore threadgroup_fork_lock;
+#endif
 
 	int oom_adj;	/* OOM kill score adjustment (bit shift) */
 };
diff --git a/kernel/fork.c b/kernel/fork.c
index a82a65c..a9bce89 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -898,6 +898,10 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 
 	tty_audit_fork(sig);
 
+#ifdef CONFIG_CGROUPS
+	init_rwsem(&sig->threadgroup_fork_lock);
+#endif
+
 	sig->oom_adj = current->signal->oom_adj;
 
 	return 0;
@@ -1076,6 +1080,10 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	monotonic_to_bootbased(&p->real_start_time);
 	p->io_context = NULL;
 	p->audit_context = NULL;
+#ifdef CONFIG_CGROUPS
+	if (clone_flags & CLONE_THREAD)
+		down_read(&current->signal->threadgroup_fork_lock);
+#endif
 	cgroup_fork(p);
 #ifdef CONFIG_NUMA
 	p->mempolicy = mpol_dup(p->mempolicy);
@@ -1283,6 +1291,10 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	write_unlock_irq(&tasklist_lock);
 	proc_fork_connector(p);
 	cgroup_post_fork(p);
+#ifdef CONFIG_CGROUPS
+	if (clone_flags & CLONE_THREAD)
+		up_read(&current->signal->threadgroup_fork_lock);
+#endif
 	perf_event_fork(p);
 	return p;
 
@@ -1316,6 +1328,10 @@ bad_fork_cleanup_policy:
 	mpol_put(p->mempolicy);
 bad_fork_cleanup_cgroup:
 #endif
+#ifdef CONFIG_CGROUPS
+	if (clone_flags & CLONE_THREAD)
+		up_read(&current->signal->threadgroup_fork_lock);
+#endif
 	cgroup_exit(p, cgroup_callbacks_done);
 	delayacct_tsk_free(p);
 	module_put(task_thread_info(p)->exec_domain->module);

^ permalink raw reply related	[flat|nested] 185+ messages in thread

* [PATCH v4 2/2] cgroups: make procs file writable
       [not found] ` <20100730235649.GA22644-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
  2010-07-30 23:57     ` Ben Blum
@ 2010-07-30 23:59   ` Ben Blum
  2010-08-03 19:58   ` [PATCH v4 0/2] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs Andrew Morton
  2010-08-11  5:46     ` Ben Blum
  3 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2010-07-30 23:59 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: bblum-OM76b2Iv3yLQjUSlxSEPGw, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, menage-hpIqsD4AKlfQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

[-- Attachment #1: cgroup-procs-writable.patch --]
[-- Type: text/plain, Size: 19762 bytes --]

Makes procs file writable to move all threads by tgid at once

From: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>

This patch adds functionality that enables users to move all threads in a
threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs'
file. This current implementation makes use of a per-threadgroup rwsem that's
taken for reading in the fork() path to prevent newly forking threads within
the threadgroup from "escaping" while the move is in progress.

Signed-off-by: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
---
 Documentation/cgroups/cgroups.txt |   13 +
 kernel/cgroup.c                   |  426 +++++++++++++++++++++++++++++++++----
 kernel/cgroup_freezer.c           |    4 
 kernel/cpuset.c                   |    4 
 kernel/ns_cgroup.c                |    4 
 kernel/sched.c                    |    4 
 6 files changed, 405 insertions(+), 50 deletions(-)

diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index b34823f..5f3c707 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -235,7 +235,8 @@ containing the following files describing that cgroup:
  - cgroup.procs: list of tgids in the cgroup.  This list is not
    guaranteed to be sorted or free of duplicate tgids, and userspace
    should sort/uniquify the list if this property is required.
-   This is a read-only file, for now.
+   Writing a thread group id into this file moves all threads in that
+   group into this cgroup.
  - notify_on_release flag: run the release agent on exit?
  - release_agent: the path to use for release notifications (this file
    exists in the top cgroup only)
@@ -416,6 +417,12 @@ You can attach the current shell task by echoing 0:
 
 # echo 0 > tasks
 
+You can use the cgroup.procs file instead of the tasks file to move all
+threads in a threadgroup at once. Echoing the pid of any task in a
+threadgroup to cgroup.procs causes all tasks in that threadgroup to be
+be attached to the cgroup. Writing 0 to cgroup.procs moves all tasks
+in the writing task's threadgroup.
+
 2.3 Mounting hierarchies by name
 --------------------------------
 
@@ -564,7 +571,9 @@ called on a fork. If this method returns 0 (success) then this should
 remain valid while the caller holds cgroup_mutex and it is ensured that either
 attach() or cancel_attach() will be called in future. If threadgroup is
 true, then a successful result indicates that all threads in the given
-thread's threadgroup can be moved together.
+thread's threadgroup can be moved together. If the subsystem wants to
+iterate over task->thread_group, it must take rcu_read_lock then check
+if thread_group_leader(task), returning -EAGAIN if that fails.
 
 void cancel_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 	       struct task_struct *task, bool threadgroup)
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index f91d7dd..fab8c87 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1688,6 +1688,76 @@ int cgroup_path(const struct cgroup *cgrp, char *buf, int buflen)
 }
 EXPORT_SYMBOL_GPL(cgroup_path);
 
+/*
+ * cgroup_task_migrate - move a task from one cgroup to another.
+ *
+ * 'guarantee' is set if the caller promises that a new css_set for the task
+ * will already exit. If not set, this function might sleep, and can fail with
+ * -ENOMEM. Otherwise, it can only fail with -ESRCH.
+ */
+static int cgroup_task_migrate(struct cgroup *cgrp, struct cgroup *oldcgrp,
+			       struct task_struct *tsk, bool guarantee)
+{
+	struct css_set *oldcg;
+	struct css_set *newcg;
+
+	/*
+	 * get old css_set. we need to take task_lock and refcount it, because
+	 * an exiting task can change its css_set to init_css_set and drop its
+	 * old one without taking cgroup_mutex.
+	 */
+	task_lock(tsk);
+	oldcg = tsk->cgroups;
+	get_css_set(oldcg);
+	task_unlock(tsk);
+
+	/* locate or allocate a new css_set for this task. */
+	if (guarantee) {
+		/* we know the css_set we want already exists. */
+		struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
+		read_lock(&css_set_lock);
+		newcg = find_existing_css_set(oldcg, cgrp, template);
+		BUG_ON(!newcg);
+		get_css_set(newcg);
+		read_unlock(&css_set_lock);
+	} else {
+		might_sleep();
+		/* find_css_set will give us newcg already referenced. */
+		newcg = find_css_set(oldcg, cgrp);
+		if (!newcg) {
+			put_css_set(oldcg);
+			return -ENOMEM;
+		}
+	}
+	put_css_set(oldcg);
+
+	/* if PF_EXITING is set, the tsk->cgroups pointer is no longer safe. */
+	task_lock(tsk);
+	if (tsk->flags & PF_EXITING) {
+		task_unlock(tsk);
+		put_css_set(newcg);
+		return -ESRCH;
+	}
+	rcu_assign_pointer(tsk->cgroups, newcg);
+	task_unlock(tsk);
+
+	/* Update the css_set linked lists if we're using them */
+	write_lock(&css_set_lock);
+	if (!list_empty(&tsk->cg_list))
+		list_move(&tsk->cg_list, &newcg->tasks);
+	write_unlock(&css_set_lock);
+
+	/*
+	 * We just gained a reference on oldcg by taking it from the task. As
+	 * trading it for newcg is protected by cgroup_mutex, we're safe to drop
+	 * it here; it will be freed under RCU.
+	 */
+	put_css_set(oldcg);
+
+	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
+	return 0;
+}
+
 /**
  * cgroup_attach_task - attach task 'tsk' to cgroup 'cgrp'
  * @cgrp: the cgroup the task is attaching to
@@ -1698,11 +1768,9 @@ EXPORT_SYMBOL_GPL(cgroup_path);
  */
 int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
-	int retval = 0;
+	int retval;
 	struct cgroup_subsys *ss, *failed_ss = NULL;
 	struct cgroup *oldcgrp;
-	struct css_set *cg;
-	struct css_set *newcg;
 	struct cgroupfs_root *root = cgrp->root;
 
 	/* Nothing to do if the task is already in that cgroup */
@@ -1726,46 +1794,16 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 		}
 	}
 
-	task_lock(tsk);
-	cg = tsk->cgroups;
-	get_css_set(cg);
-	task_unlock(tsk);
-	/*
-	 * Locate or allocate a new css_set for this task,
-	 * based on its final set of cgroups
-	 */
-	newcg = find_css_set(cg, cgrp);
-	put_css_set(cg);
-	if (!newcg) {
-		retval = -ENOMEM;
-		goto out;
-	}
-
-	task_lock(tsk);
-	if (tsk->flags & PF_EXITING) {
-		task_unlock(tsk);
-		put_css_set(newcg);
-		retval = -ESRCH;
+	retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, false);
+	if (retval)
 		goto out;
-	}
-	rcu_assign_pointer(tsk->cgroups, newcg);
-	task_unlock(tsk);
-
-	/* Update the css_set linked lists if we're using them */
-	write_lock(&css_set_lock);
-	if (!list_empty(&tsk->cg_list)) {
-		list_del(&tsk->cg_list);
-		list_add(&tsk->cg_list, &newcg->tasks);
-	}
-	write_unlock(&css_set_lock);
 
 	for_each_subsys(root, ss) {
 		if (ss->attach)
 			ss->attach(ss, cgrp, oldcgrp, tsk, false);
 	}
-	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
+
 	synchronize_rcu();
-	put_css_set(cg);
 
 	/*
 	 * wake up rmdir() waiter. the rmdir should fail since the cgroup
@@ -1791,49 +1829,341 @@ out:
 }
 
 /*
- * Attach task with pid 'pid' to cgroup 'cgrp'. Call with cgroup_mutex
- * held. May take task_lock of task
+ * cgroup_attach_proc works in two stages, the first of which prefetches all
+ * new css_sets needed (to make sure we have enough memory before committing
+ * to the move) and stores them in a list of entries of the following type.
+ * TODO: possible optimization: use css_set->rcu_head for chaining instead
+ */
+struct cg_list_entry {
+	struct css_set *cg;
+	struct list_head links;
+};
+
+static bool css_set_check_fetched(struct cgroup *cgrp,
+				  struct task_struct *tsk, struct css_set *cg,
+				  struct list_head *newcg_list)
+{
+	struct css_set *newcg;
+	struct cg_list_entry *cg_entry;
+	struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
+
+	read_lock(&css_set_lock);
+	newcg = find_existing_css_set(cg, cgrp, template);
+	if (newcg)
+		get_css_set(newcg);
+	read_unlock(&css_set_lock);
+
+	/* doesn't exist at all? */
+	if (!newcg)
+		return false;
+	/* see if it's already in the list */
+	list_for_each_entry(cg_entry, newcg_list, links) {
+		if (cg_entry->cg == newcg) {
+			put_css_set(newcg);
+			return true;
+		}
+	}
+
+	/* not found */
+	put_css_set(newcg);
+	return false;
+}
+
+/*
+ * Find the new css_set and store it in the list in preparation for moving the
+ * given task to the given cgroup. Returns 0 or -ENOMEM.
  */
-static int attach_task_by_pid(struct cgroup *cgrp, u64 pid)
+static int css_set_prefetch(struct cgroup *cgrp, struct css_set *cg,
+			    struct list_head *newcg_list)
+{
+	struct css_set *newcg;
+	struct cg_list_entry *cg_entry;
+
+	/* ensure a new css_set will exist for this thread */
+	newcg = find_css_set(cg, cgrp);
+	if (!newcg)
+		return -ENOMEM;
+	/* add it to the list */
+	cg_entry = kmalloc(sizeof(struct cg_list_entry), GFP_KERNEL);
+	if (!cg_entry) {
+		put_css_set(newcg);
+		return -ENOMEM;
+	}
+	cg_entry->cg = newcg;
+	list_add(&cg_entry->links, newcg_list);
+	return 0;
+}
+
+/**
+ * cgroup_attach_proc - attach all threads in a threadgroup to a cgroup
+ * @cgrp: the cgroup to attach to
+ * @leader: the threadgroup leader task_struct of the group to be attached
+ *
+ * Call holding cgroup_mutex. Will take task_lock of each thread in leader's
+ * threadgroup individually in turn.
+ */
+int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
+{
+	int retval;
+	struct cgroup_subsys *ss, *failed_ss = NULL;
+	struct cgroup *oldcgrp;
+	struct css_set *oldcg;
+	struct cgroupfs_root *root = cgrp->root;
+	/* threadgroup list cursor */
+	struct task_struct *tsk;
+	/*
+	 * we need to make sure we have css_sets for all the tasks we're
+	 * going to move -before- we actually start moving them, so that in
+	 * case we get an ENOMEM we can bail out before making any changes.
+	 */
+	struct list_head newcg_list;
+	struct cg_list_entry *cg_entry, *temp_nobe;
+
+	/* check that we can legitimately attach to the cgroup. */
+	for_each_subsys(root, ss) {
+		if (ss->can_attach) {
+			retval = ss->can_attach(ss, cgrp, leader, true);
+			if (retval) {
+				failed_ss = ss;
+				goto out;
+			}
+		}
+	}
+
+	/*
+	 * step 1: make sure css_sets exist for all threads to be migrated.
+	 * we use find_css_set, which allocates a new one if necessary.
+	 */
+	INIT_LIST_HEAD(&newcg_list);
+	oldcgrp = task_cgroup_from_root(leader, root);
+	if (cgrp != oldcgrp) {
+		/* get old css_set */
+		task_lock(leader);
+		if (leader->flags & PF_EXITING) {
+			task_unlock(leader);
+			goto prefetch_loop;
+		}
+		oldcg = leader->cgroups;
+		get_css_set(oldcg);
+		task_unlock(leader);
+		/* acquire new one */
+		retval = css_set_prefetch(cgrp, oldcg, &newcg_list);
+		put_css_set(oldcg);
+		if (retval)
+			goto list_teardown;
+	}
+prefetch_loop:
+	rcu_read_lock();
+	/* sanity check - if we raced with de_thread, we must abort */
+	if (!thread_group_leader(leader)) {
+		retval = -EAGAIN;
+		goto list_teardown;
+	}
+	/*
+	 * if we need to fetch a new css_set for this task, we must exit the
+	 * rcu_read section because allocating it can sleep. afterwards, we'll
+	 * need to restart iteration on the threadgroup list - the whole thing
+	 * will be O(nm) in the number of threads and css_sets; as the typical
+	 * case has only one css_set for all of them, usually O(n). which ones
+	 * we need allocated won't change as long as we hold cgroup_mutex.
+	 */
+	list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
+		/* nothing to do if this task is already in the cgroup */
+		oldcgrp = task_cgroup_from_root(tsk, root);
+		if (cgrp == oldcgrp)
+			continue;
+		/* get old css_set pointer */
+		task_lock(tsk);
+		if (tsk->flags & PF_EXITING) {
+			/* ignore this task if it's going away */
+			task_unlock(tsk);
+			continue;
+		}
+		oldcg = tsk->cgroups;
+		get_css_set(oldcg);
+		task_unlock(tsk);
+		/* see if the new one for us is already in the list? */
+		if (css_set_check_fetched(cgrp, tsk, oldcg, &newcg_list)) {
+			/* was already there, nothing to do. */
+			put_css_set(oldcg);
+		} else {
+			/* we don't already have it. get new one. */
+			rcu_read_unlock();
+			retval = css_set_prefetch(cgrp, oldcg, &newcg_list);
+			put_css_set(oldcg);
+			if (retval)
+				goto list_teardown;
+			/* begin iteration again. */
+			goto prefetch_loop;
+		}
+	}
+	rcu_read_unlock();
+
+	/*
+	 * step 2: now that we're guaranteed success wrt the css_sets, proceed
+	 * to move all tasks to the new cgroup. we need to lock against possible
+	 * races with fork(). note: we can safely access leader->signal because
+	 * attach_task_by_pid takes a reference on leader, which guarantees that
+	 * the signal_struct will stick around. threadgroup_fork_lock must be
+	 * taken outside of tasklist_lock to match the order in the fork path.
+	 */
+	BUG_ON(!leader->signal);
+	down_write(&leader->signal->threadgroup_fork_lock);
+	read_lock(&tasklist_lock);
+	/* sanity check - if we raced with de_thread, we must abort */
+	if (!thread_group_leader(leader)) {
+		retval = -EAGAIN;
+		read_unlock(&tasklist_lock);
+		up_write(&leader->signal->threadgroup_fork_lock);
+		goto list_teardown;
+	}
+	/*
+	 * No failure cases left, so this is the commit point.
+	 *
+	 * If the leader is already there, skip moving him. Note: even if the
+	 * leader is PF_EXITING, we still move all other threads; if everybody
+	 * is PF_EXITING, we end up doing nothing, which is ok.
+	 */
+	oldcgrp = task_cgroup_from_root(leader, root);
+	if (cgrp != oldcgrp) {
+		retval = cgroup_task_migrate(cgrp, oldcgrp, leader, true);
+		BUG_ON(retval != 0 && retval != -ESRCH);
+	}
+	/* Now iterate over each thread in the group. */
+	list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
+		BUG_ON(tsk->signal != leader->signal);
+		/* leave current thread as it is if it's already there */
+		oldcgrp = task_cgroup_from_root(tsk, root);
+		if (cgrp == oldcgrp)
+			continue;
+		/* we don't care whether these threads are exiting */
+		retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, true);
+		BUG_ON(retval != 0 && retval != -ESRCH);
+	}
+
+	/*
+	 * step 3: attach whole threadgroup to each subsystem
+	 * TODO: if ever a subsystem needs to know the oldcgrp for each task
+	 * being moved, this call will need to be reworked to communicate that.
+	 */
+	for_each_subsys(root, ss) {
+		if (ss->attach)
+			ss->attach(ss, cgrp, oldcgrp, leader, true);
+	}
+	/* holding these until here keeps us safe from exec() and fork(). */
+	read_unlock(&tasklist_lock);
+	up_write(&leader->signal->threadgroup_fork_lock);
+
+	/*
+	 * step 4: success! and cleanup
+	 */
+	synchronize_rcu();
+	cgroup_wakeup_rmdir_waiter(cgrp);
+	retval = 0;
+list_teardown:
+	/* clean up the list of prefetched css_sets. */
+	list_for_each_entry_safe(cg_entry, temp_nobe, &newcg_list, links) {
+		list_del(&cg_entry->links);
+		put_css_set(cg_entry->cg);
+		kfree(cg_entry);
+	}
+out:
+	if (retval) {
+		/* same deal as in cgroup_attach_task, with threadgroup=true */
+		for_each_subsys(root, ss) {
+			if (ss == failed_ss)
+				break;
+			if (ss->cancel_attach)
+				ss->cancel_attach(ss, cgrp, leader, true);
+		}
+	}
+	return retval;
+}
+
+/*
+ * Find the task_struct of the task to attach by vpid and pass it along to the
+ * function to attach either it or all tasks in its threadgroup. Will take
+ * cgroup_mutex; may take task_lock of task.
+ */
+static int attach_task_by_pid(struct cgroup *cgrp, u64 pid, bool threadgroup)
 {
 	struct task_struct *tsk;
 	const struct cred *cred = current_cred(), *tcred;
 	int ret;
 
+	if (!cgroup_lock_live_group(cgrp))
+		return -ENODEV;
+
 	if (pid) {
 		rcu_read_lock();
 		tsk = find_task_by_vpid(pid);
-		if (!tsk || tsk->flags & PF_EXITING) {
+		if (!tsk) {
+			rcu_read_unlock();
+			cgroup_unlock();
+			return -ESRCH;
+		}
+		if (threadgroup) {
+			/*
+			 * it is safe to find group_leader because tsk was found
+			 * in the tid map, meaning it can't have been unhashed
+			 * by someone in de_thread changing the leadership.
+			 */
+			tsk = tsk->group_leader;
+			BUG_ON(!thread_group_leader(tsk));
+		} else if (tsk->flags & PF_EXITING) {
+			/* optimization for the single-task-only case */
 			rcu_read_unlock();
+			cgroup_unlock();
 			return -ESRCH;
 		}
 
+		/*
+		 * even if we're attaching all tasks in the thread group, we
+		 * only need to check permissions on one of them.
+		 */
 		tcred = __task_cred(tsk);
 		if (cred->euid &&
 		    cred->euid != tcred->uid &&
 		    cred->euid != tcred->suid) {
 			rcu_read_unlock();
+			cgroup_unlock();
 			return -EACCES;
 		}
 		get_task_struct(tsk);
 		rcu_read_unlock();
 	} else {
-		tsk = current;
+		if (threadgroup)
+			tsk = current->group_leader;
+		else
+			tsk = current;
 		get_task_struct(tsk);
 	}
 
-	ret = cgroup_attach_task(cgrp, tsk);
+	if (threadgroup)
+		ret = cgroup_attach_proc(cgrp, tsk);
+	else
+		ret = cgroup_attach_task(cgrp, tsk);
 	put_task_struct(tsk);
+	cgroup_unlock();
 	return ret;
 }
 
 static int cgroup_tasks_write(struct cgroup *cgrp, struct cftype *cft, u64 pid)
 {
+	return attach_task_by_pid(cgrp, pid, false);
+}
+
+static int cgroup_procs_write(struct cgroup *cgrp, struct cftype *cft, u64 tgid)
+{
 	int ret;
-	if (!cgroup_lock_live_group(cgrp))
-		return -ENODEV;
-	ret = attach_task_by_pid(cgrp, pid);
-	cgroup_unlock();
+	do {
+		/*
+		 * attach_proc fails with -EAGAIN if threadgroup leadership
+		 * changes in the middle of the operation, in which case we need
+		 * to find the task_struct for the new leader and start over.
+		 */
+		ret = attach_task_by_pid(cgrp, tgid, true);
+	} while (ret == -EAGAIN);
 	return ret;
 }
 
@@ -3168,9 +3498,9 @@ static struct cftype files[] = {
 	{
 		.name = CGROUP_FILE_GENERIC_PREFIX "procs",
 		.open = cgroup_procs_open,
-		/* .write_u64 = cgroup_procs_write, TODO */
+		.write_u64 = cgroup_procs_write,
 		.release = cgroup_pidlist_release,
-		.mode = S_IRUGO,
+		.mode = S_IRUGO | S_IWUSR,
 	},
 	{
 		.name = "notify_on_release",
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index ce71ed5..daf0249 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -190,6 +190,10 @@ static int freezer_can_attach(struct cgroup_subsys *ss,
 		struct task_struct *c;
 
 		rcu_read_lock();
+		if (!thread_group_leader(task)) {
+			rcu_read_unlock();
+			return -EAGAIN;
+		}
 		list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
 			if (is_task_frozen_enough(c)) {
 				rcu_read_unlock();
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index b23c097..3d7c978 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1404,6 +1404,10 @@ static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
 		struct task_struct *c;
 
 		rcu_read_lock();
+		if (!thread_group_leader(tsk)) {
+			rcu_read_unlock();
+			return -EAGAIN;
+		}
 		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
 			ret = security_task_setscheduler(c, 0, NULL);
 			if (ret) {
diff --git a/kernel/ns_cgroup.c b/kernel/ns_cgroup.c
index 2a5dfec..ecd15d2 100644
--- a/kernel/ns_cgroup.c
+++ b/kernel/ns_cgroup.c
@@ -59,6 +59,10 @@ static int ns_can_attach(struct cgroup_subsys *ss, struct cgroup *new_cgroup,
 	if (threadgroup) {
 		struct task_struct *c;
 		rcu_read_lock();
+		if (!thread_group_leader(task)) {
+			rcu_read_unlock();
+			return -EAGAIN;
+		}
 		list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
 			if (!cgroup_is_descendant(new_cgroup, c)) {
 				rcu_read_unlock();
diff --git a/kernel/sched.c b/kernel/sched.c
index 70fa78d..df53f53 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -8721,6 +8721,10 @@ cpu_cgroup_can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 	if (threadgroup) {
 		struct task_struct *c;
 		rcu_read_lock();
+		if (!thread_group_leader(tsk)) {
+			rcu_read_unlock();
+			return -EAGAIN;
+		}
 		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
 			retval = cpu_cgroup_can_attach_task(cgrp, c);
 			if (retval) {

^ permalink raw reply related	[flat|nested] 185+ messages in thread

* [PATCH v4 2/2] cgroups: make procs file writable
  2010-07-30 23:56 [PATCH v4 0/2] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs Ben Blum
@ 2010-07-30 23:59 ` Ben Blum
  2010-08-04  1:08   ` KAMEZAWA Hiroyuki
       [not found]   ` <20100730235902.GC22644-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
       [not found] ` <20100730235649.GA22644-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
  2010-08-03 19:58 ` [PATCH v4 0/2] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs Andrew Morton
  2 siblings, 2 replies; 185+ messages in thread
From: Ben Blum @ 2010-07-30 23:59 UTC (permalink / raw)
  To: linux-kernel, containers
  Cc: akpm, bblum, ebiederm, lizf, matthltc, menage, oleg

[-- Attachment #1: cgroup-procs-writable.patch --]
[-- Type: text/plain, Size: 19712 bytes --]

Makes procs file writable to move all threads by tgid at once

From: Ben Blum <bblum@andrew.cmu.edu>

This patch adds functionality that enables users to move all threads in a
threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs'
file. This current implementation makes use of a per-threadgroup rwsem that's
taken for reading in the fork() path to prevent newly forking threads within
the threadgroup from "escaping" while the move is in progress.

Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
---
 Documentation/cgroups/cgroups.txt |   13 +
 kernel/cgroup.c                   |  426 +++++++++++++++++++++++++++++++++----
 kernel/cgroup_freezer.c           |    4 
 kernel/cpuset.c                   |    4 
 kernel/ns_cgroup.c                |    4 
 kernel/sched.c                    |    4 
 6 files changed, 405 insertions(+), 50 deletions(-)

diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index b34823f..5f3c707 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -235,7 +235,8 @@ containing the following files describing that cgroup:
  - cgroup.procs: list of tgids in the cgroup.  This list is not
    guaranteed to be sorted or free of duplicate tgids, and userspace
    should sort/uniquify the list if this property is required.
-   This is a read-only file, for now.
+   Writing a thread group id into this file moves all threads in that
+   group into this cgroup.
  - notify_on_release flag: run the release agent on exit?
  - release_agent: the path to use for release notifications (this file
    exists in the top cgroup only)
@@ -416,6 +417,12 @@ You can attach the current shell task by echoing 0:
 
 # echo 0 > tasks
 
+You can use the cgroup.procs file instead of the tasks file to move all
+threads in a threadgroup at once. Echoing the pid of any task in a
+threadgroup to cgroup.procs causes all tasks in that threadgroup to be
+be attached to the cgroup. Writing 0 to cgroup.procs moves all tasks
+in the writing task's threadgroup.
+
 2.3 Mounting hierarchies by name
 --------------------------------
 
@@ -564,7 +571,9 @@ called on a fork. If this method returns 0 (success) then this should
 remain valid while the caller holds cgroup_mutex and it is ensured that either
 attach() or cancel_attach() will be called in future. If threadgroup is
 true, then a successful result indicates that all threads in the given
-thread's threadgroup can be moved together.
+thread's threadgroup can be moved together. If the subsystem wants to
+iterate over task->thread_group, it must take rcu_read_lock then check
+if thread_group_leader(task), returning -EAGAIN if that fails.
 
 void cancel_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 	       struct task_struct *task, bool threadgroup)
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index f91d7dd..fab8c87 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1688,6 +1688,76 @@ int cgroup_path(const struct cgroup *cgrp, char *buf, int buflen)
 }
 EXPORT_SYMBOL_GPL(cgroup_path);
 
+/*
+ * cgroup_task_migrate - move a task from one cgroup to another.
+ *
+ * 'guarantee' is set if the caller promises that a new css_set for the task
+ * will already exit. If not set, this function might sleep, and can fail with
+ * -ENOMEM. Otherwise, it can only fail with -ESRCH.
+ */
+static int cgroup_task_migrate(struct cgroup *cgrp, struct cgroup *oldcgrp,
+			       struct task_struct *tsk, bool guarantee)
+{
+	struct css_set *oldcg;
+	struct css_set *newcg;
+
+	/*
+	 * get old css_set. we need to take task_lock and refcount it, because
+	 * an exiting task can change its css_set to init_css_set and drop its
+	 * old one without taking cgroup_mutex.
+	 */
+	task_lock(tsk);
+	oldcg = tsk->cgroups;
+	get_css_set(oldcg);
+	task_unlock(tsk);
+
+	/* locate or allocate a new css_set for this task. */
+	if (guarantee) {
+		/* we know the css_set we want already exists. */
+		struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
+		read_lock(&css_set_lock);
+		newcg = find_existing_css_set(oldcg, cgrp, template);
+		BUG_ON(!newcg);
+		get_css_set(newcg);
+		read_unlock(&css_set_lock);
+	} else {
+		might_sleep();
+		/* find_css_set will give us newcg already referenced. */
+		newcg = find_css_set(oldcg, cgrp);
+		if (!newcg) {
+			put_css_set(oldcg);
+			return -ENOMEM;
+		}
+	}
+	put_css_set(oldcg);
+
+	/* if PF_EXITING is set, the tsk->cgroups pointer is no longer safe. */
+	task_lock(tsk);
+	if (tsk->flags & PF_EXITING) {
+		task_unlock(tsk);
+		put_css_set(newcg);
+		return -ESRCH;
+	}
+	rcu_assign_pointer(tsk->cgroups, newcg);
+	task_unlock(tsk);
+
+	/* Update the css_set linked lists if we're using them */
+	write_lock(&css_set_lock);
+	if (!list_empty(&tsk->cg_list))
+		list_move(&tsk->cg_list, &newcg->tasks);
+	write_unlock(&css_set_lock);
+
+	/*
+	 * We just gained a reference on oldcg by taking it from the task. As
+	 * trading it for newcg is protected by cgroup_mutex, we're safe to drop
+	 * it here; it will be freed under RCU.
+	 */
+	put_css_set(oldcg);
+
+	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
+	return 0;
+}
+
 /**
  * cgroup_attach_task - attach task 'tsk' to cgroup 'cgrp'
  * @cgrp: the cgroup the task is attaching to
@@ -1698,11 +1768,9 @@ EXPORT_SYMBOL_GPL(cgroup_path);
  */
 int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
-	int retval = 0;
+	int retval;
 	struct cgroup_subsys *ss, *failed_ss = NULL;
 	struct cgroup *oldcgrp;
-	struct css_set *cg;
-	struct css_set *newcg;
 	struct cgroupfs_root *root = cgrp->root;
 
 	/* Nothing to do if the task is already in that cgroup */
@@ -1726,46 +1794,16 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 		}
 	}
 
-	task_lock(tsk);
-	cg = tsk->cgroups;
-	get_css_set(cg);
-	task_unlock(tsk);
-	/*
-	 * Locate or allocate a new css_set for this task,
-	 * based on its final set of cgroups
-	 */
-	newcg = find_css_set(cg, cgrp);
-	put_css_set(cg);
-	if (!newcg) {
-		retval = -ENOMEM;
-		goto out;
-	}
-
-	task_lock(tsk);
-	if (tsk->flags & PF_EXITING) {
-		task_unlock(tsk);
-		put_css_set(newcg);
-		retval = -ESRCH;
+	retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, false);
+	if (retval)
 		goto out;
-	}
-	rcu_assign_pointer(tsk->cgroups, newcg);
-	task_unlock(tsk);
-
-	/* Update the css_set linked lists if we're using them */
-	write_lock(&css_set_lock);
-	if (!list_empty(&tsk->cg_list)) {
-		list_del(&tsk->cg_list);
-		list_add(&tsk->cg_list, &newcg->tasks);
-	}
-	write_unlock(&css_set_lock);
 
 	for_each_subsys(root, ss) {
 		if (ss->attach)
 			ss->attach(ss, cgrp, oldcgrp, tsk, false);
 	}
-	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
+
 	synchronize_rcu();
-	put_css_set(cg);
 
 	/*
 	 * wake up rmdir() waiter. the rmdir should fail since the cgroup
@@ -1791,49 +1829,341 @@ out:
 }
 
 /*
- * Attach task with pid 'pid' to cgroup 'cgrp'. Call with cgroup_mutex
- * held. May take task_lock of task
+ * cgroup_attach_proc works in two stages, the first of which prefetches all
+ * new css_sets needed (to make sure we have enough memory before committing
+ * to the move) and stores them in a list of entries of the following type.
+ * TODO: possible optimization: use css_set->rcu_head for chaining instead
+ */
+struct cg_list_entry {
+	struct css_set *cg;
+	struct list_head links;
+};
+
+static bool css_set_check_fetched(struct cgroup *cgrp,
+				  struct task_struct *tsk, struct css_set *cg,
+				  struct list_head *newcg_list)
+{
+	struct css_set *newcg;
+	struct cg_list_entry *cg_entry;
+	struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
+
+	read_lock(&css_set_lock);
+	newcg = find_existing_css_set(cg, cgrp, template);
+	if (newcg)
+		get_css_set(newcg);
+	read_unlock(&css_set_lock);
+
+	/* doesn't exist at all? */
+	if (!newcg)
+		return false;
+	/* see if it's already in the list */
+	list_for_each_entry(cg_entry, newcg_list, links) {
+		if (cg_entry->cg == newcg) {
+			put_css_set(newcg);
+			return true;
+		}
+	}
+
+	/* not found */
+	put_css_set(newcg);
+	return false;
+}
+
+/*
+ * Find the new css_set and store it in the list in preparation for moving the
+ * given task to the given cgroup. Returns 0 or -ENOMEM.
  */
-static int attach_task_by_pid(struct cgroup *cgrp, u64 pid)
+static int css_set_prefetch(struct cgroup *cgrp, struct css_set *cg,
+			    struct list_head *newcg_list)
+{
+	struct css_set *newcg;
+	struct cg_list_entry *cg_entry;
+
+	/* ensure a new css_set will exist for this thread */
+	newcg = find_css_set(cg, cgrp);
+	if (!newcg)
+		return -ENOMEM;
+	/* add it to the list */
+	cg_entry = kmalloc(sizeof(struct cg_list_entry), GFP_KERNEL);
+	if (!cg_entry) {
+		put_css_set(newcg);
+		return -ENOMEM;
+	}
+	cg_entry->cg = newcg;
+	list_add(&cg_entry->links, newcg_list);
+	return 0;
+}
+
+/**
+ * cgroup_attach_proc - attach all threads in a threadgroup to a cgroup
+ * @cgrp: the cgroup to attach to
+ * @leader: the threadgroup leader task_struct of the group to be attached
+ *
+ * Call holding cgroup_mutex. Will take task_lock of each thread in leader's
+ * threadgroup individually in turn.
+ */
+int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
+{
+	int retval;
+	struct cgroup_subsys *ss, *failed_ss = NULL;
+	struct cgroup *oldcgrp;
+	struct css_set *oldcg;
+	struct cgroupfs_root *root = cgrp->root;
+	/* threadgroup list cursor */
+	struct task_struct *tsk;
+	/*
+	 * we need to make sure we have css_sets for all the tasks we're
+	 * going to move -before- we actually start moving them, so that in
+	 * case we get an ENOMEM we can bail out before making any changes.
+	 */
+	struct list_head newcg_list;
+	struct cg_list_entry *cg_entry, *temp_nobe;
+
+	/* check that we can legitimately attach to the cgroup. */
+	for_each_subsys(root, ss) {
+		if (ss->can_attach) {
+			retval = ss->can_attach(ss, cgrp, leader, true);
+			if (retval) {
+				failed_ss = ss;
+				goto out;
+			}
+		}
+	}
+
+	/*
+	 * step 1: make sure css_sets exist for all threads to be migrated.
+	 * we use find_css_set, which allocates a new one if necessary.
+	 */
+	INIT_LIST_HEAD(&newcg_list);
+	oldcgrp = task_cgroup_from_root(leader, root);
+	if (cgrp != oldcgrp) {
+		/* get old css_set */
+		task_lock(leader);
+		if (leader->flags & PF_EXITING) {
+			task_unlock(leader);
+			goto prefetch_loop;
+		}
+		oldcg = leader->cgroups;
+		get_css_set(oldcg);
+		task_unlock(leader);
+		/* acquire new one */
+		retval = css_set_prefetch(cgrp, oldcg, &newcg_list);
+		put_css_set(oldcg);
+		if (retval)
+			goto list_teardown;
+	}
+prefetch_loop:
+	rcu_read_lock();
+	/* sanity check - if we raced with de_thread, we must abort */
+	if (!thread_group_leader(leader)) {
+		retval = -EAGAIN;
+		goto list_teardown;
+	}
+	/*
+	 * if we need to fetch a new css_set for this task, we must exit the
+	 * rcu_read section because allocating it can sleep. afterwards, we'll
+	 * need to restart iteration on the threadgroup list - the whole thing
+	 * will be O(nm) in the number of threads and css_sets; as the typical
+	 * case has only one css_set for all of them, usually O(n). which ones
+	 * we need allocated won't change as long as we hold cgroup_mutex.
+	 */
+	list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
+		/* nothing to do if this task is already in the cgroup */
+		oldcgrp = task_cgroup_from_root(tsk, root);
+		if (cgrp == oldcgrp)
+			continue;
+		/* get old css_set pointer */
+		task_lock(tsk);
+		if (tsk->flags & PF_EXITING) {
+			/* ignore this task if it's going away */
+			task_unlock(tsk);
+			continue;
+		}
+		oldcg = tsk->cgroups;
+		get_css_set(oldcg);
+		task_unlock(tsk);
+		/* see if the new one for us is already in the list? */
+		if (css_set_check_fetched(cgrp, tsk, oldcg, &newcg_list)) {
+			/* was already there, nothing to do. */
+			put_css_set(oldcg);
+		} else {
+			/* we don't already have it. get new one. */
+			rcu_read_unlock();
+			retval = css_set_prefetch(cgrp, oldcg, &newcg_list);
+			put_css_set(oldcg);
+			if (retval)
+				goto list_teardown;
+			/* begin iteration again. */
+			goto prefetch_loop;
+		}
+	}
+	rcu_read_unlock();
+
+	/*
+	 * step 2: now that we're guaranteed success wrt the css_sets, proceed
+	 * to move all tasks to the new cgroup. we need to lock against possible
+	 * races with fork(). note: we can safely access leader->signal because
+	 * attach_task_by_pid takes a reference on leader, which guarantees that
+	 * the signal_struct will stick around. threadgroup_fork_lock must be
+	 * taken outside of tasklist_lock to match the order in the fork path.
+	 */
+	BUG_ON(!leader->signal);
+	down_write(&leader->signal->threadgroup_fork_lock);
+	read_lock(&tasklist_lock);
+	/* sanity check - if we raced with de_thread, we must abort */
+	if (!thread_group_leader(leader)) {
+		retval = -EAGAIN;
+		read_unlock(&tasklist_lock);
+		up_write(&leader->signal->threadgroup_fork_lock);
+		goto list_teardown;
+	}
+	/*
+	 * No failure cases left, so this is the commit point.
+	 *
+	 * If the leader is already there, skip moving him. Note: even if the
+	 * leader is PF_EXITING, we still move all other threads; if everybody
+	 * is PF_EXITING, we end up doing nothing, which is ok.
+	 */
+	oldcgrp = task_cgroup_from_root(leader, root);
+	if (cgrp != oldcgrp) {
+		retval = cgroup_task_migrate(cgrp, oldcgrp, leader, true);
+		BUG_ON(retval != 0 && retval != -ESRCH);
+	}
+	/* Now iterate over each thread in the group. */
+	list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
+		BUG_ON(tsk->signal != leader->signal);
+		/* leave current thread as it is if it's already there */
+		oldcgrp = task_cgroup_from_root(tsk, root);
+		if (cgrp == oldcgrp)
+			continue;
+		/* we don't care whether these threads are exiting */
+		retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, true);
+		BUG_ON(retval != 0 && retval != -ESRCH);
+	}
+
+	/*
+	 * step 3: attach whole threadgroup to each subsystem
+	 * TODO: if ever a subsystem needs to know the oldcgrp for each task
+	 * being moved, this call will need to be reworked to communicate that.
+	 */
+	for_each_subsys(root, ss) {
+		if (ss->attach)
+			ss->attach(ss, cgrp, oldcgrp, leader, true);
+	}
+	/* holding these until here keeps us safe from exec() and fork(). */
+	read_unlock(&tasklist_lock);
+	up_write(&leader->signal->threadgroup_fork_lock);
+
+	/*
+	 * step 4: success! and cleanup
+	 */
+	synchronize_rcu();
+	cgroup_wakeup_rmdir_waiter(cgrp);
+	retval = 0;
+list_teardown:
+	/* clean up the list of prefetched css_sets. */
+	list_for_each_entry_safe(cg_entry, temp_nobe, &newcg_list, links) {
+		list_del(&cg_entry->links);
+		put_css_set(cg_entry->cg);
+		kfree(cg_entry);
+	}
+out:
+	if (retval) {
+		/* same deal as in cgroup_attach_task, with threadgroup=true */
+		for_each_subsys(root, ss) {
+			if (ss == failed_ss)
+				break;
+			if (ss->cancel_attach)
+				ss->cancel_attach(ss, cgrp, leader, true);
+		}
+	}
+	return retval;
+}
+
+/*
+ * Find the task_struct of the task to attach by vpid and pass it along to the
+ * function to attach either it or all tasks in its threadgroup. Will take
+ * cgroup_mutex; may take task_lock of task.
+ */
+static int attach_task_by_pid(struct cgroup *cgrp, u64 pid, bool threadgroup)
 {
 	struct task_struct *tsk;
 	const struct cred *cred = current_cred(), *tcred;
 	int ret;
 
+	if (!cgroup_lock_live_group(cgrp))
+		return -ENODEV;
+
 	if (pid) {
 		rcu_read_lock();
 		tsk = find_task_by_vpid(pid);
-		if (!tsk || tsk->flags & PF_EXITING) {
+		if (!tsk) {
+			rcu_read_unlock();
+			cgroup_unlock();
+			return -ESRCH;
+		}
+		if (threadgroup) {
+			/*
+			 * it is safe to find group_leader because tsk was found
+			 * in the tid map, meaning it can't have been unhashed
+			 * by someone in de_thread changing the leadership.
+			 */
+			tsk = tsk->group_leader;
+			BUG_ON(!thread_group_leader(tsk));
+		} else if (tsk->flags & PF_EXITING) {
+			/* optimization for the single-task-only case */
 			rcu_read_unlock();
+			cgroup_unlock();
 			return -ESRCH;
 		}
 
+		/*
+		 * even if we're attaching all tasks in the thread group, we
+		 * only need to check permissions on one of them.
+		 */
 		tcred = __task_cred(tsk);
 		if (cred->euid &&
 		    cred->euid != tcred->uid &&
 		    cred->euid != tcred->suid) {
 			rcu_read_unlock();
+			cgroup_unlock();
 			return -EACCES;
 		}
 		get_task_struct(tsk);
 		rcu_read_unlock();
 	} else {
-		tsk = current;
+		if (threadgroup)
+			tsk = current->group_leader;
+		else
+			tsk = current;
 		get_task_struct(tsk);
 	}
 
-	ret = cgroup_attach_task(cgrp, tsk);
+	if (threadgroup)
+		ret = cgroup_attach_proc(cgrp, tsk);
+	else
+		ret = cgroup_attach_task(cgrp, tsk);
 	put_task_struct(tsk);
+	cgroup_unlock();
 	return ret;
 }
 
 static int cgroup_tasks_write(struct cgroup *cgrp, struct cftype *cft, u64 pid)
 {
+	return attach_task_by_pid(cgrp, pid, false);
+}
+
+static int cgroup_procs_write(struct cgroup *cgrp, struct cftype *cft, u64 tgid)
+{
 	int ret;
-	if (!cgroup_lock_live_group(cgrp))
-		return -ENODEV;
-	ret = attach_task_by_pid(cgrp, pid);
-	cgroup_unlock();
+	do {
+		/*
+		 * attach_proc fails with -EAGAIN if threadgroup leadership
+		 * changes in the middle of the operation, in which case we need
+		 * to find the task_struct for the new leader and start over.
+		 */
+		ret = attach_task_by_pid(cgrp, tgid, true);
+	} while (ret == -EAGAIN);
 	return ret;
 }
 
@@ -3168,9 +3498,9 @@ static struct cftype files[] = {
 	{
 		.name = CGROUP_FILE_GENERIC_PREFIX "procs",
 		.open = cgroup_procs_open,
-		/* .write_u64 = cgroup_procs_write, TODO */
+		.write_u64 = cgroup_procs_write,
 		.release = cgroup_pidlist_release,
-		.mode = S_IRUGO,
+		.mode = S_IRUGO | S_IWUSR,
 	},
 	{
 		.name = "notify_on_release",
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index ce71ed5..daf0249 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -190,6 +190,10 @@ static int freezer_can_attach(struct cgroup_subsys *ss,
 		struct task_struct *c;
 
 		rcu_read_lock();
+		if (!thread_group_leader(task)) {
+			rcu_read_unlock();
+			return -EAGAIN;
+		}
 		list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
 			if (is_task_frozen_enough(c)) {
 				rcu_read_unlock();
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index b23c097..3d7c978 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1404,6 +1404,10 @@ static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
 		struct task_struct *c;
 
 		rcu_read_lock();
+		if (!thread_group_leader(tsk)) {
+			rcu_read_unlock();
+			return -EAGAIN;
+		}
 		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
 			ret = security_task_setscheduler(c, 0, NULL);
 			if (ret) {
diff --git a/kernel/ns_cgroup.c b/kernel/ns_cgroup.c
index 2a5dfec..ecd15d2 100644
--- a/kernel/ns_cgroup.c
+++ b/kernel/ns_cgroup.c
@@ -59,6 +59,10 @@ static int ns_can_attach(struct cgroup_subsys *ss, struct cgroup *new_cgroup,
 	if (threadgroup) {
 		struct task_struct *c;
 		rcu_read_lock();
+		if (!thread_group_leader(task)) {
+			rcu_read_unlock();
+			return -EAGAIN;
+		}
 		list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
 			if (!cgroup_is_descendant(new_cgroup, c)) {
 				rcu_read_unlock();
diff --git a/kernel/sched.c b/kernel/sched.c
index 70fa78d..df53f53 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -8721,6 +8721,10 @@ cpu_cgroup_can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 	if (threadgroup) {
 		struct task_struct *c;
 		rcu_read_lock();
+		if (!thread_group_leader(tsk)) {
+			rcu_read_unlock();
+			return -EAGAIN;
+		}
 		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
 			retval = cpu_cgroup_can_attach_task(cgrp, c);
 			if (retval) {

^ permalink raw reply related	[flat|nested] 185+ messages in thread

* Re: [PATCH v4 0/2] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs
       [not found] ` <20100730235649.GA22644-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
  2010-07-30 23:57     ` Ben Blum
  2010-07-30 23:59   ` [PATCH v4 2/2] cgroups: make procs file writable Ben Blum
@ 2010-08-03 19:58   ` Andrew Morton
  2010-08-11  5:46     ` Ben Blum
  3 siblings, 0 replies; 185+ messages in thread
From: Andrew Morton @ 2010-08-03 19:58 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	menage-hpIqsD4AKlfQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

On Fri, 30 Jul 2010 19:56:49 -0400
Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:

> This patch series implements a write function for the 'cgroup.procs'
> per-cgroup file, which enables atomic movement of multithreaded
> applications between cgroups. Writing the thread-ID of any thread in a
> threadgroup to a cgroup's procs file causes all threads in the group to
> be moved to that cgroup safely with respect to threads forking/exiting.
> (Possible usage scenario: If running a multithreaded build system that
> sucks up system resources, this lets you restrict it all at once into a
> new cgroup to keep it under control.)

I can see how that would be useful.  No comments from anyone else?

patch 1/2 makes me cry with all those ifdefs.  Maybe helper functions
would help, but not a lot.

patch 2/2 looks very complicated.

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v4 0/2] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs
  2010-07-30 23:56 [PATCH v4 0/2] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs Ben Blum
  2010-07-30 23:59 ` [PATCH v4 2/2] cgroups: make procs file writable Ben Blum
       [not found] ` <20100730235649.GA22644-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
@ 2010-08-03 19:58 ` Andrew Morton
       [not found]   ` <20100803125827.0822e6ab.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
  2010-08-03 23:45   ` KAMEZAWA Hiroyuki
  2 siblings, 2 replies; 185+ messages in thread
From: Andrew Morton @ 2010-08-03 19:58 UTC (permalink / raw)
  To: Ben Blum; +Cc: linux-kernel, containers, ebiederm, lizf, matthltc, menage, oleg

On Fri, 30 Jul 2010 19:56:49 -0400
Ben Blum <bblum@andrew.cmu.edu> wrote:

> This patch series implements a write function for the 'cgroup.procs'
> per-cgroup file, which enables atomic movement of multithreaded
> applications between cgroups. Writing the thread-ID of any thread in a
> threadgroup to a cgroup's procs file causes all threads in the group to
> be moved to that cgroup safely with respect to threads forking/exiting.
> (Possible usage scenario: If running a multithreaded build system that
> sucks up system resources, this lets you restrict it all at once into a
> new cgroup to keep it under control.)

I can see how that would be useful.  No comments from anyone else?

patch 1/2 makes me cry with all those ifdefs.  Maybe helper functions
would help, but not a lot.

patch 2/2 looks very complicated.

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v4 0/2] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs
       [not found]   ` <20100803125827.0822e6ab.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
@ 2010-08-03 23:45     ` KAMEZAWA Hiroyuki
  2010-08-04  2:00       ` Li Zefan
  1 sibling, 0 replies; 185+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-08-03 23:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	menage-hpIqsD4AKlfQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

On Tue, 3 Aug 2010 12:58:27 -0700
Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:

> On Fri, 30 Jul 2010 19:56:49 -0400
> Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:
> 
> > This patch series implements a write function for the 'cgroup.procs'
> > per-cgroup file, which enables atomic movement of multithreaded
> > applications between cgroups. Writing the thread-ID of any thread in a
> > threadgroup to a cgroup's procs file causes all threads in the group to
> > be moved to that cgroup safely with respect to threads forking/exiting.
> > (Possible usage scenario: If running a multithreaded build system that
> > sucks up system resources, this lets you restrict it all at once into a
> > new cgroup to keep it under control.)
> 
> I can see how that would be useful.  No comments from anyone else?
> 

I think the feature itself is good and useful. I welcome this.

> patch 1/2 makes me cry with all those ifdefs.  Maybe helper functions
> would help, but not a lot.
> 
Add static inline functions ?


> patch 2/2 looks very complicated.

yes. that's a concern.
I'd like to look deeper, today.

Thanks,
-Kame

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v4 0/2] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs
  2010-08-03 19:58 ` [PATCH v4 0/2] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs Andrew Morton
       [not found]   ` <20100803125827.0822e6ab.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
@ 2010-08-03 23:45   ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 185+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-08-03 23:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ben Blum, linux-kernel, containers, ebiederm, lizf, matthltc,
	menage, oleg

On Tue, 3 Aug 2010 12:58:27 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Fri, 30 Jul 2010 19:56:49 -0400
> Ben Blum <bblum@andrew.cmu.edu> wrote:
> 
> > This patch series implements a write function for the 'cgroup.procs'
> > per-cgroup file, which enables atomic movement of multithreaded
> > applications between cgroups. Writing the thread-ID of any thread in a
> > threadgroup to a cgroup's procs file causes all threads in the group to
> > be moved to that cgroup safely with respect to threads forking/exiting.
> > (Possible usage scenario: If running a multithreaded build system that
> > sucks up system resources, this lets you restrict it all at once into a
> > new cgroup to keep it under control.)
> 
> I can see how that would be useful.  No comments from anyone else?
> 

I think the feature itself is good and useful. I welcome this.

> patch 1/2 makes me cry with all those ifdefs.  Maybe helper functions
> would help, but not a lot.
> 
Add static inline functions ?


> patch 2/2 looks very complicated.

yes. that's a concern.
I'd like to look deeper, today.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v4 2/2] cgroups: make procs file writable
       [not found]   ` <20100730235902.GC22644-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
@ 2010-08-04  1:08     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 185+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-08-04  1:08 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	menage-hpIqsD4AKlfQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

On Fri, 30 Jul 2010 19:59:02 -0400
Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:

> Makes procs file writable to move all threads by tgid at once
> 
> From: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
> 
> This patch adds functionality that enables users to move all threads in a
> threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs'
> file. This current implementation makes use of a per-threadgroup rwsem that's
> taken for reading in the fork() path to prevent newly forking threads within
> the threadgroup from "escaping" while the move is in progress.
> 
> Signed-off-by: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
> ---
>  Documentation/cgroups/cgroups.txt |   13 +
>  kernel/cgroup.c                   |  426 +++++++++++++++++++++++++++++++++----
>  kernel/cgroup_freezer.c           |    4 
>  kernel/cpuset.c                   |    4 
>  kernel/ns_cgroup.c                |    4 
>  kernel/sched.c                    |    4 
>  6 files changed, 405 insertions(+), 50 deletions(-)
> 
> diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
> index b34823f..5f3c707 100644
> --- a/Documentation/cgroups/cgroups.txt
> +++ b/Documentation/cgroups/cgroups.txt
> @@ -235,7 +235,8 @@ containing the following files describing that cgroup:
>   - cgroup.procs: list of tgids in the cgroup.  This list is not
>     guaranteed to be sorted or free of duplicate tgids, and userspace
>     should sort/uniquify the list if this property is required.
> -   This is a read-only file, for now.
> +   Writing a thread group id into this file moves all threads in that
> +   group into this cgroup.
>   - notify_on_release flag: run the release agent on exit?
>   - release_agent: the path to use for release notifications (this file
>     exists in the top cgroup only)
> @@ -416,6 +417,12 @@ You can attach the current shell task by echoing 0:
>  
>  # echo 0 > tasks
>  
> +You can use the cgroup.procs file instead of the tasks file to move all
> +threads in a threadgroup at once. Echoing the pid of any task in a
> +threadgroup to cgroup.procs causes all tasks in that threadgroup to be
> +be attached to the cgroup. Writing 0 to cgroup.procs moves all tasks
> +in the writing task's threadgroup.
> +
>  2.3 Mounting hierarchies by name
>  --------------------------------
>  
> @@ -564,7 +571,9 @@ called on a fork. If this method returns 0 (success) then this should
>  remain valid while the caller holds cgroup_mutex and it is ensured that either
>  attach() or cancel_attach() will be called in future. If threadgroup is
>  true, then a successful result indicates that all threads in the given
> -thread's threadgroup can be moved together.
> +thread's threadgroup can be moved together. If the subsystem wants to
> +iterate over task->thread_group, it must take rcu_read_lock then check
> +if thread_group_leader(task), returning -EAGAIN if that fails.
>  
>  void cancel_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
>  	       struct task_struct *task, bool threadgroup)
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index f91d7dd..fab8c87 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1688,6 +1688,76 @@ int cgroup_path(const struct cgroup *cgrp, char *buf, int buflen)
>  }
>  EXPORT_SYMBOL_GPL(cgroup_path);
>  
> +/*
> + * cgroup_task_migrate - move a task from one cgroup to another.
> + *
> + * 'guarantee' is set if the caller promises that a new css_set for the task
> + * will already exit. If not set, this function might sleep, and can fail with
           
           already exist ?

> + * -ENOMEM. Otherwise, it can only fail with -ESRCH.
> + */
> +static int cgroup_task_migrate(struct cgroup *cgrp, struct cgroup *oldcgrp,
> +			       struct task_struct *tsk, bool guarantee)
> +{
> +	struct css_set *oldcg;
> +	struct css_set *newcg;
> +
> +	/*
> +	 * get old css_set. we need to take task_lock and refcount it, because
> +	 * an exiting task can change its css_set to init_css_set and drop its
> +	 * old one without taking cgroup_mutex.
> +	 */
> +	task_lock(tsk);
> +	oldcg = tsk->cgroups;
> +	get_css_set(oldcg);
> +	task_unlock(tsk);
> +
> +	/* locate or allocate a new css_set for this task. */
> +	if (guarantee) {
> +		/* we know the css_set we want already exists. */
> +		struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
> +		read_lock(&css_set_lock);
> +		newcg = find_existing_css_set(oldcg, cgrp, template);
> +		BUG_ON(!newcg);
> +		get_css_set(newcg);
> +		read_unlock(&css_set_lock);
> +	} else {
> +		might_sleep();
> +		/* find_css_set will give us newcg already referenced. */
> +		newcg = find_css_set(oldcg, cgrp);
> +		if (!newcg) {
> +			put_css_set(oldcg);
> +			return -ENOMEM;
> +		}
> +	}
> +	put_css_set(oldcg);
> +
> +	/* if PF_EXITING is set, the tsk->cgroups pointer is no longer safe. */
> +	task_lock(tsk);
> +	if (tsk->flags & PF_EXITING) {
> +		task_unlock(tsk);
> +		put_css_set(newcg);
> +		return -ESRCH;
> +	}
> +	rcu_assign_pointer(tsk->cgroups, newcg);
> +	task_unlock(tsk);
> +
> +	/* Update the css_set linked lists if we're using them */
> +	write_lock(&css_set_lock);
> +	if (!list_empty(&tsk->cg_list))
> +		list_move(&tsk->cg_list, &newcg->tasks);
> +	write_unlock(&css_set_lock);
> +
> +	/*
> +	 * We just gained a reference on oldcg by taking it from the task. As
> +	 * trading it for newcg is protected by cgroup_mutex, we're safe to drop
> +	 * it here; it will be freed under RCU.
> +	 */
> +	put_css_set(oldcg);
> +
> +	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
> +	return 0;
> +}
> +
>  /**
>   * cgroup_attach_task - attach task 'tsk' to cgroup 'cgrp'
>   * @cgrp: the cgroup the task is attaching to
> @@ -1698,11 +1768,9 @@ EXPORT_SYMBOL_GPL(cgroup_path);
>   */
>  int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>  {
> -	int retval = 0;
> +	int retval;
>  	struct cgroup_subsys *ss, *failed_ss = NULL;
>  	struct cgroup *oldcgrp;
> -	struct css_set *cg;
> -	struct css_set *newcg;
>  	struct cgroupfs_root *root = cgrp->root;
>  
>  	/* Nothing to do if the task is already in that cgroup */
> @@ -1726,46 +1794,16 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>  		}
>  	}
>  
> -	task_lock(tsk);
> -	cg = tsk->cgroups;
> -	get_css_set(cg);
> -	task_unlock(tsk);
> -	/*
> -	 * Locate or allocate a new css_set for this task,
> -	 * based on its final set of cgroups
> -	 */
> -	newcg = find_css_set(cg, cgrp);
> -	put_css_set(cg);
> -	if (!newcg) {
> -		retval = -ENOMEM;
> -		goto out;
> -	}
> -
> -	task_lock(tsk);
> -	if (tsk->flags & PF_EXITING) {
> -		task_unlock(tsk);
> -		put_css_set(newcg);
> -		retval = -ESRCH;
> +	retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, false);
> +	if (retval)
>  		goto out;
> -	}
> -	rcu_assign_pointer(tsk->cgroups, newcg);
> -	task_unlock(tsk);
> -
> -	/* Update the css_set linked lists if we're using them */
> -	write_lock(&css_set_lock);
> -	if (!list_empty(&tsk->cg_list)) {
> -		list_del(&tsk->cg_list);
> -		list_add(&tsk->cg_list, &newcg->tasks);
> -	}
> -	write_unlock(&css_set_lock);
>  
>  	for_each_subsys(root, ss) {
>  		if (ss->attach)
>  			ss->attach(ss, cgrp, oldcgrp, tsk, false);
>  	}
> -	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
> +

Hmm. By this, we call ss->attach(ss, cgrp, oldcgrp, tsk, false) after
makring CGRP_RELEASABLE+synchronize_rcu() to oldcgroup...is it safe ?
And why move it before attach() ?



>  	synchronize_rcu();
> -	put_css_set(cg);
>  
>  	/*
>  	 * wake up rmdir() waiter. the rmdir should fail since the cgroup
> @@ -1791,49 +1829,341 @@ out:
>  }
>  
>  /*
> - * Attach task with pid 'pid' to cgroup 'cgrp'. Call with cgroup_mutex
> - * held. May take task_lock of task
> + * cgroup_attach_proc works in two stages, the first of which prefetches all
> + * new css_sets needed (to make sure we have enough memory before committing
> + * to the move) and stores them in a list of entries of the following type.
> + * TODO: possible optimization: use css_set->rcu_head for chaining instead
> + */
> +struct cg_list_entry {
> +	struct css_set *cg;
> +	struct list_head links;
> +};
> +
> +static bool css_set_check_fetched(struct cgroup *cgrp,
> +				  struct task_struct *tsk, struct css_set *cg,
> +				  struct list_head *newcg_list)
> +{
> +	struct css_set *newcg;
> +	struct cg_list_entry *cg_entry;
> +	struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
> +
> +	read_lock(&css_set_lock);
> +	newcg = find_existing_css_set(cg, cgrp, template);
> +	if (newcg)
> +		get_css_set(newcg);
> +	read_unlock(&css_set_lock);
> +
> +	/* doesn't exist at all? */
> +	if (!newcg)
> +		return false;
> +	/* see if it's already in the list */
> +	list_for_each_entry(cg_entry, newcg_list, links) {
> +		if (cg_entry->cg == newcg) {
> +			put_css_set(newcg);
> +			return true;
> +		}
> +	}
> +
> +	/* not found */
> +	put_css_set(newcg);
> +	return false;
> +}
> +
> +/*
> + * Find the new css_set and store it in the list in preparation for moving the
> + * given task to the given cgroup. Returns 0 or -ENOMEM.
>   */
> -static int attach_task_by_pid(struct cgroup *cgrp, u64 pid)
> +static int css_set_prefetch(struct cgroup *cgrp, struct css_set *cg,
> +			    struct list_head *newcg_list)
> +{
> +	struct css_set *newcg;
> +	struct cg_list_entry *cg_entry;
> +
> +	/* ensure a new css_set will exist for this thread */
> +	newcg = find_css_set(cg, cgrp);
> +	if (!newcg)
> +		return -ENOMEM;
> +	/* add it to the list */
> +	cg_entry = kmalloc(sizeof(struct cg_list_entry), GFP_KERNEL);
> +	if (!cg_entry) {
> +		put_css_set(newcg);
> +		return -ENOMEM;
> +	}
> +	cg_entry->cg = newcg;
> +	list_add(&cg_entry->links, newcg_list);
> +	return 0;
> +}
> +
> +/**
> + * cgroup_attach_proc - attach all threads in a threadgroup to a cgroup
> + * @cgrp: the cgroup to attach to
> + * @leader: the threadgroup leader task_struct of the group to be attached
> + *
> + * Call holding cgroup_mutex. Will take task_lock of each thread in leader's
> + * threadgroup individually in turn.
> + */
> +int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
> +{
> +	int retval;
> +	struct cgroup_subsys *ss, *failed_ss = NULL;
> +	struct cgroup *oldcgrp;
> +	struct css_set *oldcg;
> +	struct cgroupfs_root *root = cgrp->root;
> +	/* threadgroup list cursor */
> +	struct task_struct *tsk;
> +	/*
> +	 * we need to make sure we have css_sets for all the tasks we're
> +	 * going to move -before- we actually start moving them, so that in
> +	 * case we get an ENOMEM we can bail out before making any changes.
> +	 */
> +	struct list_head newcg_list;
> +	struct cg_list_entry *cg_entry, *temp_nobe;
> +
> +	/* check that we can legitimately attach to the cgroup. */
> +	for_each_subsys(root, ss) {
> +		if (ss->can_attach) {
> +			retval = ss->can_attach(ss, cgrp, leader, true);
> +			if (retval) {
> +				failed_ss = ss;
> +				goto out;
> +			}
> +		}
> +	}

Then, we cannot do attach limitaion control per thread ? (This just check leader.)
Is it ok for all subsys ?


> +
> +	/*
> +	 * step 1: make sure css_sets exist for all threads to be migrated.
> +	 * we use find_css_set, which allocates a new one if necessary.
> +	 */
> +	INIT_LIST_HEAD(&newcg_list);
> +	oldcgrp = task_cgroup_from_root(leader, root);
> +	if (cgrp != oldcgrp) {
> +		/* get old css_set */
> +		task_lock(leader);
> +		if (leader->flags & PF_EXITING) {
> +			task_unlock(leader);
> +			goto prefetch_loop;
> +		}
Why do we continue here ? not -ESRCH ?


> +		oldcg = leader->cgroups;
> +		get_css_set(oldcg);
> +		task_unlock(leader);
> +		/* acquire new one */
> +		retval = css_set_prefetch(cgrp, oldcg, &newcg_list);
> +		put_css_set(oldcg);
> +		if (retval)
> +			goto list_teardown;
> +	}
> +prefetch_loop:
> +	rcu_read_lock();
> +	/* sanity check - if we raced with de_thread, we must abort */
> +	if (!thread_group_leader(leader)) {
> +		retval = -EAGAIN;
> +		goto list_teardown;
> +	}

EAGAIN ? ESRCH ? or EBUSY ?

> +	/*
> +	 * if we need to fetch a new css_set for this task, we must exit the
> +	 * rcu_read section because allocating it can sleep. afterwards, we'll
> +	 * need to restart iteration on the threadgroup list - the whole thing
> +	 * will be O(nm) in the number of threads and css_sets; as the typical
> +	 * case has only one css_set for all of them, usually O(n). which ones
> +	 * we need allocated won't change as long as we hold cgroup_mutex.
> +	 */
> +	list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
> +		/* nothing to do if this task is already in the cgroup */
> +		oldcgrp = task_cgroup_from_root(tsk, root);
> +		if (cgrp == oldcgrp)
> +			continue;
> +		/* get old css_set pointer */
> +		task_lock(tsk);
> +		if (tsk->flags & PF_EXITING) {
> +			/* ignore this task if it's going away */
> +			task_unlock(tsk);

It's going away but seems to exist for a while....then, "continue" is safe
for keeping consistency ?


> +			continue;
> +		}
> +		oldcg = tsk->cgroups;
> +		get_css_set(oldcg);
> +		task_unlock(tsk);
> +		/* see if the new one for us is already in the list? */
> +		if (css_set_check_fetched(cgrp, tsk, oldcg, &newcg_list)) {
> +			/* was already there, nothing to do. */
> +			put_css_set(oldcg);
> +		} else {
> +			/* we don't already have it. get new one. */
> +			rcu_read_unlock();
> +			retval = css_set_prefetch(cgrp, oldcg, &newcg_list);
> +			put_css_set(oldcg);
> +			if (retval)
> +				goto list_teardown;
> +			/* begin iteration again. */
> +			goto prefetch_loop;

Hmm ? Why do we need to restart from the 1st entry ?
(maybe because of rcu_read_unlock() ?)
Does this function work well if the process has 10000+ threads ?

How about this logic ?
==

	/* At first, find out necessary things */
	rcu_read_lock();
	list_for_each_entry_rcu() {
		oldcgrp = task_cgroup_from_root(tsk, root);
		if (oldcgrp == cgrp)
			continue;
		task_lock(task);
		if (task->flags & PF_EXITING) {
			task_unlock(task);
			continue;
		}
		oldcg = tsk->cgroups;
		get_css_set(oldcg);
		task_unlock(task);
		read_lock(&css_set_lock);
		newcg = find_existing_css_set(oldcgrp cgrp, template);
		if (newcg)
			remember_this_newcg(newcg, &found_cg_array); {
			put_css_set(oldcg);
		} else
			remember_need_to_allocate(oldcg, &need_to_allocate_array);
	}
	rcu_read_unlock();
	/* Sort all cg_list found and drop doubly counted ones, drop refcnt if necessary */
	sort_and_unique(found_cg_array);
	/* Sort all cg_list not found and drop doubly counted ones, drop refcnt if necessary */
	sort_and_unique(need_to_allocate_array);
	/* Allocate new ones */
	newly_allocated_array = allocate_new_cg_lists(need_to_allocate_array);
	drop_refcnt_of_old_cgs(need_to_allocate_array);

	/* Now we have all necessary cg_list */

> +		}
> +	}
> +	rcu_read_unlock();
> +
> +	/*
> +	 * step 2: now that we're guaranteed success wrt the css_sets, proceed
> +	 * to move all tasks to the new cgroup. we need to lock against possible
> +	 * races with fork(). note: we can safely access leader->signal because
> +	 * attach_task_by_pid takes a reference on leader, which guarantees that
> +	 * the signal_struct will stick around. threadgroup_fork_lock must be
> +	 * taken outside of tasklist_lock to match the order in the fork path.
> +	 */
> +	BUG_ON(!leader->signal);
> +	down_write(&leader->signal->threadgroup_fork_lock);
> +	read_lock(&tasklist_lock);
> +	/* sanity check - if we raced with de_thread, we must abort */
> +	if (!thread_group_leader(leader)) {
> +		retval = -EAGAIN;
> +		read_unlock(&tasklist_lock);
> +		up_write(&leader->signal->threadgroup_fork_lock);
> +		goto list_teardown;
> +	}
> +	/*
> +	 * No failure cases left, so this is the commit point.
> +	 *
> +	 * If the leader is already there, skip moving him. Note: even if the
> +	 * leader is PF_EXITING, we still move all other threads; if everybody
> +	 * is PF_EXITING, we end up doing nothing, which is ok.
> +	 */
> +	oldcgrp = task_cgroup_from_root(leader, root);
> +	if (cgrp != oldcgrp) {
> +		retval = cgroup_task_migrate(cgrp, oldcgrp, leader, true);
> +		BUG_ON(retval != 0 && retval != -ESRCH);
> +	}
> +	/* Now iterate over each thread in the group. */
> +	list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
> +		BUG_ON(tsk->signal != leader->signal);
> +		/* leave current thread as it is if it's already there */
> +		oldcgrp = task_cgroup_from_root(tsk, root);
> +		if (cgrp == oldcgrp)
> +			continue;
> +		/* we don't care whether these threads are exiting */
> +		retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, true);
> +		BUG_ON(retval != 0 && retval != -ESRCH);
> +	}
> +
> +	/*
> +	 * step 3: attach whole threadgroup to each subsystem
> +	 * TODO: if ever a subsystem needs to know the oldcgrp for each task
> +	 * being moved, this call will need to be reworked to communicate that.
> +	 */
> +	for_each_subsys(root, ss) {
> +		if (ss->attach)
> +			ss->attach(ss, cgrp, oldcgrp, leader, true);
> +	}
> +	/* holding these until here keeps us safe from exec() and fork(). */
> +	read_unlock(&tasklist_lock);
> +	up_write(&leader->signal->threadgroup_fork_lock);
> +
> +	/*
> +	 * step 4: success! and cleanup
> +	 */
> +	synchronize_rcu();
> +	cgroup_wakeup_rmdir_waiter(cgrp);
> +	retval = 0;
> +list_teardown:
> +	/* clean up the list of prefetched css_sets. */
> +	list_for_each_entry_safe(cg_entry, temp_nobe, &newcg_list, links) {
> +		list_del(&cg_entry->links);
> +		put_css_set(cg_entry->cg);
> +		kfree(cg_entry);
> +	}
> +out:
> +	if (retval) {
> +		/* same deal as in cgroup_attach_task, with threadgroup=true */
> +		for_each_subsys(root, ss) {
> +			if (ss == failed_ss)
> +				break;
> +			if (ss->cancel_attach)
> +				ss->cancel_attach(ss, cgrp, leader, true);
> +		}
> +	}
> +	return retval;
> +}
> +
> +/*
> + * Find the task_struct of the task to attach by vpid and pass it along to the
> + * function to attach either it or all tasks in its threadgroup. Will take
> + * cgroup_mutex; may take task_lock of task.
> + */
> +static int attach_task_by_pid(struct cgroup *cgrp, u64 pid, bool threadgroup)
>  {
>  	struct task_struct *tsk;
>  	const struct cred *cred = current_cred(), *tcred;
>  	int ret;
>  
> +	if (!cgroup_lock_live_group(cgrp))
> +		return -ENODEV;
> +
>  	if (pid) {
>  		rcu_read_lock();
>  		tsk = find_task_by_vpid(pid);
> -		if (!tsk || tsk->flags & PF_EXITING) {
> +		if (!tsk) {
> +			rcu_read_unlock();
> +			cgroup_unlock();
> +			return -ESRCH;
> +		}
> +		if (threadgroup) {
> +			/*
> +			 * it is safe to find group_leader because tsk was found
> +			 * in the tid map, meaning it can't have been unhashed
> +			 * by someone in de_thread changing the leadership.
> +			 */
> +			tsk = tsk->group_leader;
> +			BUG_ON(!thread_group_leader(tsk));
> +		} else if (tsk->flags & PF_EXITING) {
> +			/* optimization for the single-task-only case */
>  			rcu_read_unlock();
> +			cgroup_unlock();
>  			return -ESRCH;
>  		}
>  
> +		/*
> +		 * even if we're attaching all tasks in the thread group, we
> +		 * only need to check permissions on one of them.
> +		 */
>  		tcred = __task_cred(tsk);
>  		if (cred->euid &&
>  		    cred->euid != tcred->uid &&
>  		    cred->euid != tcred->suid) {
>  			rcu_read_unlock();
> +			cgroup_unlock();
>  			return -EACCES;
>  		}
>  		get_task_struct(tsk);
>  		rcu_read_unlock();
>  	} else {
> -		tsk = current;
> +		if (threadgroup)
> +			tsk = current->group_leader;
> +		else

I'm not sure but "group_leader" is safe to access here ?

> +			tsk = current;
>  		get_task_struct(tsk);
>  	}
>  
> -	ret = cgroup_attach_task(cgrp, tsk);
> +	if (threadgroup)
> +		ret = cgroup_attach_proc(cgrp, tsk);
> +	else
> +		ret = cgroup_attach_task(cgrp, tsk);
>  	put_task_struct(tsk);
> +	cgroup_unlock();
>  	return ret;
>  }
>  
>  static int cgroup_tasks_write(struct cgroup *cgrp, struct cftype *cft, u64 pid)
>  {
> +	return attach_task_by_pid(cgrp, pid, false);
> +}
> +
> +static int cgroup_procs_write(struct cgroup *cgrp, struct cftype *cft, u64 tgid)
> +{
>  	int ret;
> -	if (!cgroup_lock_live_group(cgrp))
> -		return -ENODEV;
> -	ret = attach_task_by_pid(cgrp, pid);
> -	cgroup_unlock();
> +	do {
> +		/*
> +		 * attach_proc fails with -EAGAIN if threadgroup leadership
> +		 * changes in the middle of the operation, in which case we need
> +		 * to find the task_struct for the new leader and start over.
> +		 */
> +		ret = attach_task_by_pid(cgrp, tgid, true);
> +	} while (ret == -EAGAIN);
>  	return ret;
>  }
>  
> @@ -3168,9 +3498,9 @@ static struct cftype files[] = {
>  	{
>  		.name = CGROUP_FILE_GENERIC_PREFIX "procs",
>  		.open = cgroup_procs_open,
> -		/* .write_u64 = cgroup_procs_write, TODO */
> +		.write_u64 = cgroup_procs_write,
>  		.release = cgroup_pidlist_release,
> -		.mode = S_IRUGO,
> +		.mode = S_IRUGO | S_IWUSR,
>  	},
>  	{
>  		.name = "notify_on_release",
> diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
> index ce71ed5..daf0249 100644
> --- a/kernel/cgroup_freezer.c
> +++ b/kernel/cgroup_freezer.c
> @@ -190,6 +190,10 @@ static int freezer_can_attach(struct cgroup_subsys *ss,
>  		struct task_struct *c;
>  
>  		rcu_read_lock();
> +		if (!thread_group_leader(task)) {
> +			rcu_read_unlock();
> +			return -EAGAIN;
> +		}
>  		list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
>  			if (is_task_frozen_enough(c)) {
>  				rcu_read_unlock();
> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> index b23c097..3d7c978 100644
> --- a/kernel/cpuset.c
> +++ b/kernel/cpuset.c
> @@ -1404,6 +1404,10 @@ static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
>  		struct task_struct *c;
>  
>  		rcu_read_lock();
> +		if (!thread_group_leader(tsk)) {
> +			rcu_read_unlock();
> +			return -EAGAIN;
> +		}
>  		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
>  			ret = security_task_setscheduler(c, 0, NULL);
>  			if (ret) {
> diff --git a/kernel/ns_cgroup.c b/kernel/ns_cgroup.c
> index 2a5dfec..ecd15d2 100644
> --- a/kernel/ns_cgroup.c
> +++ b/kernel/ns_cgroup.c
> @@ -59,6 +59,10 @@ static int ns_can_attach(struct cgroup_subsys *ss, struct cgroup *new_cgroup,
>  	if (threadgroup) {
>  		struct task_struct *c;
>  		rcu_read_lock();
> +		if (!thread_group_leader(task)) {
> +			rcu_read_unlock();
> +			return -EAGAIN;
> +		}
>  		list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
>  			if (!cgroup_is_descendant(new_cgroup, c)) {
>  				rcu_read_unlock();
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 70fa78d..df53f53 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -8721,6 +8721,10 @@ cpu_cgroup_can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
>  	if (threadgroup) {
>  		struct task_struct *c;
>  		rcu_read_lock();
> +		if (!thread_group_leader(tsk)) {
> +			rcu_read_unlock();
> +			return -EAGAIN;
> +		}
>  		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
>  			retval = cpu_cgroup_can_attach_task(cgrp, c);
>  			if (retval) {


Thanks,
-Kame

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v4 2/2] cgroups: make procs file writable
  2010-07-30 23:59 ` [PATCH v4 2/2] cgroups: make procs file writable Ben Blum
@ 2010-08-04  1:08   ` KAMEZAWA Hiroyuki
  2010-08-04  4:28     ` Ben Blum
                       ` (2 more replies)
       [not found]   ` <20100730235902.GC22644-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
  1 sibling, 3 replies; 185+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-08-04  1:08 UTC (permalink / raw)
  To: Ben Blum
  Cc: linux-kernel, containers, akpm, ebiederm, lizf, matthltc, menage, oleg

On Fri, 30 Jul 2010 19:59:02 -0400
Ben Blum <bblum@andrew.cmu.edu> wrote:

> Makes procs file writable to move all threads by tgid at once
> 
> From: Ben Blum <bblum@andrew.cmu.edu>
> 
> This patch adds functionality that enables users to move all threads in a
> threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs'
> file. This current implementation makes use of a per-threadgroup rwsem that's
> taken for reading in the fork() path to prevent newly forking threads within
> the threadgroup from "escaping" while the move is in progress.
> 
> Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
> ---
>  Documentation/cgroups/cgroups.txt |   13 +
>  kernel/cgroup.c                   |  426 +++++++++++++++++++++++++++++++++----
>  kernel/cgroup_freezer.c           |    4 
>  kernel/cpuset.c                   |    4 
>  kernel/ns_cgroup.c                |    4 
>  kernel/sched.c                    |    4 
>  6 files changed, 405 insertions(+), 50 deletions(-)
> 
> diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
> index b34823f..5f3c707 100644
> --- a/Documentation/cgroups/cgroups.txt
> +++ b/Documentation/cgroups/cgroups.txt
> @@ -235,7 +235,8 @@ containing the following files describing that cgroup:
>   - cgroup.procs: list of tgids in the cgroup.  This list is not
>     guaranteed to be sorted or free of duplicate tgids, and userspace
>     should sort/uniquify the list if this property is required.
> -   This is a read-only file, for now.
> +   Writing a thread group id into this file moves all threads in that
> +   group into this cgroup.
>   - notify_on_release flag: run the release agent on exit?
>   - release_agent: the path to use for release notifications (this file
>     exists in the top cgroup only)
> @@ -416,6 +417,12 @@ You can attach the current shell task by echoing 0:
>  
>  # echo 0 > tasks
>  
> +You can use the cgroup.procs file instead of the tasks file to move all
> +threads in a threadgroup at once. Echoing the pid of any task in a
> +threadgroup to cgroup.procs causes all tasks in that threadgroup to be
> +be attached to the cgroup. Writing 0 to cgroup.procs moves all tasks
> +in the writing task's threadgroup.
> +
>  2.3 Mounting hierarchies by name
>  --------------------------------
>  
> @@ -564,7 +571,9 @@ called on a fork. If this method returns 0 (success) then this should
>  remain valid while the caller holds cgroup_mutex and it is ensured that either
>  attach() or cancel_attach() will be called in future. If threadgroup is
>  true, then a successful result indicates that all threads in the given
> -thread's threadgroup can be moved together.
> +thread's threadgroup can be moved together. If the subsystem wants to
> +iterate over task->thread_group, it must take rcu_read_lock then check
> +if thread_group_leader(task), returning -EAGAIN if that fails.
>  
>  void cancel_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
>  	       struct task_struct *task, bool threadgroup)
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index f91d7dd..fab8c87 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1688,6 +1688,76 @@ int cgroup_path(const struct cgroup *cgrp, char *buf, int buflen)
>  }
>  EXPORT_SYMBOL_GPL(cgroup_path);
>  
> +/*
> + * cgroup_task_migrate - move a task from one cgroup to another.
> + *
> + * 'guarantee' is set if the caller promises that a new css_set for the task
> + * will already exit. If not set, this function might sleep, and can fail with
           
           already exist ?

> + * -ENOMEM. Otherwise, it can only fail with -ESRCH.
> + */
> +static int cgroup_task_migrate(struct cgroup *cgrp, struct cgroup *oldcgrp,
> +			       struct task_struct *tsk, bool guarantee)
> +{
> +	struct css_set *oldcg;
> +	struct css_set *newcg;
> +
> +	/*
> +	 * get old css_set. we need to take task_lock and refcount it, because
> +	 * an exiting task can change its css_set to init_css_set and drop its
> +	 * old one without taking cgroup_mutex.
> +	 */
> +	task_lock(tsk);
> +	oldcg = tsk->cgroups;
> +	get_css_set(oldcg);
> +	task_unlock(tsk);
> +
> +	/* locate or allocate a new css_set for this task. */
> +	if (guarantee) {
> +		/* we know the css_set we want already exists. */
> +		struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
> +		read_lock(&css_set_lock);
> +		newcg = find_existing_css_set(oldcg, cgrp, template);
> +		BUG_ON(!newcg);
> +		get_css_set(newcg);
> +		read_unlock(&css_set_lock);
> +	} else {
> +		might_sleep();
> +		/* find_css_set will give us newcg already referenced. */
> +		newcg = find_css_set(oldcg, cgrp);
> +		if (!newcg) {
> +			put_css_set(oldcg);
> +			return -ENOMEM;
> +		}
> +	}
> +	put_css_set(oldcg);
> +
> +	/* if PF_EXITING is set, the tsk->cgroups pointer is no longer safe. */
> +	task_lock(tsk);
> +	if (tsk->flags & PF_EXITING) {
> +		task_unlock(tsk);
> +		put_css_set(newcg);
> +		return -ESRCH;
> +	}
> +	rcu_assign_pointer(tsk->cgroups, newcg);
> +	task_unlock(tsk);
> +
> +	/* Update the css_set linked lists if we're using them */
> +	write_lock(&css_set_lock);
> +	if (!list_empty(&tsk->cg_list))
> +		list_move(&tsk->cg_list, &newcg->tasks);
> +	write_unlock(&css_set_lock);
> +
> +	/*
> +	 * We just gained a reference on oldcg by taking it from the task. As
> +	 * trading it for newcg is protected by cgroup_mutex, we're safe to drop
> +	 * it here; it will be freed under RCU.
> +	 */
> +	put_css_set(oldcg);
> +
> +	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
> +	return 0;
> +}
> +
>  /**
>   * cgroup_attach_task - attach task 'tsk' to cgroup 'cgrp'
>   * @cgrp: the cgroup the task is attaching to
> @@ -1698,11 +1768,9 @@ EXPORT_SYMBOL_GPL(cgroup_path);
>   */
>  int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>  {
> -	int retval = 0;
> +	int retval;
>  	struct cgroup_subsys *ss, *failed_ss = NULL;
>  	struct cgroup *oldcgrp;
> -	struct css_set *cg;
> -	struct css_set *newcg;
>  	struct cgroupfs_root *root = cgrp->root;
>  
>  	/* Nothing to do if the task is already in that cgroup */
> @@ -1726,46 +1794,16 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>  		}
>  	}
>  
> -	task_lock(tsk);
> -	cg = tsk->cgroups;
> -	get_css_set(cg);
> -	task_unlock(tsk);
> -	/*
> -	 * Locate or allocate a new css_set for this task,
> -	 * based on its final set of cgroups
> -	 */
> -	newcg = find_css_set(cg, cgrp);
> -	put_css_set(cg);
> -	if (!newcg) {
> -		retval = -ENOMEM;
> -		goto out;
> -	}
> -
> -	task_lock(tsk);
> -	if (tsk->flags & PF_EXITING) {
> -		task_unlock(tsk);
> -		put_css_set(newcg);
> -		retval = -ESRCH;
> +	retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, false);
> +	if (retval)
>  		goto out;
> -	}
> -	rcu_assign_pointer(tsk->cgroups, newcg);
> -	task_unlock(tsk);
> -
> -	/* Update the css_set linked lists if we're using them */
> -	write_lock(&css_set_lock);
> -	if (!list_empty(&tsk->cg_list)) {
> -		list_del(&tsk->cg_list);
> -		list_add(&tsk->cg_list, &newcg->tasks);
> -	}
> -	write_unlock(&css_set_lock);
>  
>  	for_each_subsys(root, ss) {
>  		if (ss->attach)
>  			ss->attach(ss, cgrp, oldcgrp, tsk, false);
>  	}
> -	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
> +

Hmm. By this, we call ss->attach(ss, cgrp, oldcgrp, tsk, false) after
makring CGRP_RELEASABLE+synchronize_rcu() to oldcgroup...is it safe ?
And why move it before attach() ?



>  	synchronize_rcu();
> -	put_css_set(cg);
>  
>  	/*
>  	 * wake up rmdir() waiter. the rmdir should fail since the cgroup
> @@ -1791,49 +1829,341 @@ out:
>  }
>  
>  /*
> - * Attach task with pid 'pid' to cgroup 'cgrp'. Call with cgroup_mutex
> - * held. May take task_lock of task
> + * cgroup_attach_proc works in two stages, the first of which prefetches all
> + * new css_sets needed (to make sure we have enough memory before committing
> + * to the move) and stores them in a list of entries of the following type.
> + * TODO: possible optimization: use css_set->rcu_head for chaining instead
> + */
> +struct cg_list_entry {
> +	struct css_set *cg;
> +	struct list_head links;
> +};
> +
> +static bool css_set_check_fetched(struct cgroup *cgrp,
> +				  struct task_struct *tsk, struct css_set *cg,
> +				  struct list_head *newcg_list)
> +{
> +	struct css_set *newcg;
> +	struct cg_list_entry *cg_entry;
> +	struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
> +
> +	read_lock(&css_set_lock);
> +	newcg = find_existing_css_set(cg, cgrp, template);
> +	if (newcg)
> +		get_css_set(newcg);
> +	read_unlock(&css_set_lock);
> +
> +	/* doesn't exist at all? */
> +	if (!newcg)
> +		return false;
> +	/* see if it's already in the list */
> +	list_for_each_entry(cg_entry, newcg_list, links) {
> +		if (cg_entry->cg == newcg) {
> +			put_css_set(newcg);
> +			return true;
> +		}
> +	}
> +
> +	/* not found */
> +	put_css_set(newcg);
> +	return false;
> +}
> +
> +/*
> + * Find the new css_set and store it in the list in preparation for moving the
> + * given task to the given cgroup. Returns 0 or -ENOMEM.
>   */
> -static int attach_task_by_pid(struct cgroup *cgrp, u64 pid)
> +static int css_set_prefetch(struct cgroup *cgrp, struct css_set *cg,
> +			    struct list_head *newcg_list)
> +{
> +	struct css_set *newcg;
> +	struct cg_list_entry *cg_entry;
> +
> +	/* ensure a new css_set will exist for this thread */
> +	newcg = find_css_set(cg, cgrp);
> +	if (!newcg)
> +		return -ENOMEM;
> +	/* add it to the list */
> +	cg_entry = kmalloc(sizeof(struct cg_list_entry), GFP_KERNEL);
> +	if (!cg_entry) {
> +		put_css_set(newcg);
> +		return -ENOMEM;
> +	}
> +	cg_entry->cg = newcg;
> +	list_add(&cg_entry->links, newcg_list);
> +	return 0;
> +}
> +
> +/**
> + * cgroup_attach_proc - attach all threads in a threadgroup to a cgroup
> + * @cgrp: the cgroup to attach to
> + * @leader: the threadgroup leader task_struct of the group to be attached
> + *
> + * Call holding cgroup_mutex. Will take task_lock of each thread in leader's
> + * threadgroup individually in turn.
> + */
> +int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
> +{
> +	int retval;
> +	struct cgroup_subsys *ss, *failed_ss = NULL;
> +	struct cgroup *oldcgrp;
> +	struct css_set *oldcg;
> +	struct cgroupfs_root *root = cgrp->root;
> +	/* threadgroup list cursor */
> +	struct task_struct *tsk;
> +	/*
> +	 * we need to make sure we have css_sets for all the tasks we're
> +	 * going to move -before- we actually start moving them, so that in
> +	 * case we get an ENOMEM we can bail out before making any changes.
> +	 */
> +	struct list_head newcg_list;
> +	struct cg_list_entry *cg_entry, *temp_nobe;
> +
> +	/* check that we can legitimately attach to the cgroup. */
> +	for_each_subsys(root, ss) {
> +		if (ss->can_attach) {
> +			retval = ss->can_attach(ss, cgrp, leader, true);
> +			if (retval) {
> +				failed_ss = ss;
> +				goto out;
> +			}
> +		}
> +	}

Then, we cannot do attach limitaion control per thread ? (This just check leader.)
Is it ok for all subsys ?


> +
> +	/*
> +	 * step 1: make sure css_sets exist for all threads to be migrated.
> +	 * we use find_css_set, which allocates a new one if necessary.
> +	 */
> +	INIT_LIST_HEAD(&newcg_list);
> +	oldcgrp = task_cgroup_from_root(leader, root);
> +	if (cgrp != oldcgrp) {
> +		/* get old css_set */
> +		task_lock(leader);
> +		if (leader->flags & PF_EXITING) {
> +			task_unlock(leader);
> +			goto prefetch_loop;
> +		}
Why do we continue here ? not -ESRCH ?


> +		oldcg = leader->cgroups;
> +		get_css_set(oldcg);
> +		task_unlock(leader);
> +		/* acquire new one */
> +		retval = css_set_prefetch(cgrp, oldcg, &newcg_list);
> +		put_css_set(oldcg);
> +		if (retval)
> +			goto list_teardown;
> +	}
> +prefetch_loop:
> +	rcu_read_lock();
> +	/* sanity check - if we raced with de_thread, we must abort */
> +	if (!thread_group_leader(leader)) {
> +		retval = -EAGAIN;
> +		goto list_teardown;
> +	}

EAGAIN ? ESRCH ? or EBUSY ?

> +	/*
> +	 * if we need to fetch a new css_set for this task, we must exit the
> +	 * rcu_read section because allocating it can sleep. afterwards, we'll
> +	 * need to restart iteration on the threadgroup list - the whole thing
> +	 * will be O(nm) in the number of threads and css_sets; as the typical
> +	 * case has only one css_set for all of them, usually O(n). which ones
> +	 * we need allocated won't change as long as we hold cgroup_mutex.
> +	 */
> +	list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
> +		/* nothing to do if this task is already in the cgroup */
> +		oldcgrp = task_cgroup_from_root(tsk, root);
> +		if (cgrp == oldcgrp)
> +			continue;
> +		/* get old css_set pointer */
> +		task_lock(tsk);
> +		if (tsk->flags & PF_EXITING) {
> +			/* ignore this task if it's going away */
> +			task_unlock(tsk);

It's going away but seems to exist for a while....then, "continue" is safe
for keeping consistency ?


> +			continue;
> +		}
> +		oldcg = tsk->cgroups;
> +		get_css_set(oldcg);
> +		task_unlock(tsk);
> +		/* see if the new one for us is already in the list? */
> +		if (css_set_check_fetched(cgrp, tsk, oldcg, &newcg_list)) {
> +			/* was already there, nothing to do. */
> +			put_css_set(oldcg);
> +		} else {
> +			/* we don't already have it. get new one. */
> +			rcu_read_unlock();
> +			retval = css_set_prefetch(cgrp, oldcg, &newcg_list);
> +			put_css_set(oldcg);
> +			if (retval)
> +				goto list_teardown;
> +			/* begin iteration again. */
> +			goto prefetch_loop;

Hmm ? Why do we need to restart from the 1st entry ?
(maybe because of rcu_read_unlock() ?)
Does this function work well if the process has 10000+ threads ?

How about this logic ?
==

	/* At first, find out necessary things */
	rcu_read_lock();
	list_for_each_entry_rcu() {
		oldcgrp = task_cgroup_from_root(tsk, root);
		if (oldcgrp == cgrp)
			continue;
		task_lock(task);
		if (task->flags & PF_EXITING) {
			task_unlock(task);
			continue;
		}
		oldcg = tsk->cgroups;
		get_css_set(oldcg);
		task_unlock(task);
		read_lock(&css_set_lock);
		newcg = find_existing_css_set(oldcgrp cgrp, template);
		if (newcg)
			remember_this_newcg(newcg, &found_cg_array); {
			put_css_set(oldcg);
		} else
			remember_need_to_allocate(oldcg, &need_to_allocate_array);
	}
	rcu_read_unlock();
	/* Sort all cg_list found and drop doubly counted ones, drop refcnt if necessary */
	sort_and_unique(found_cg_array);
	/* Sort all cg_list not found and drop doubly counted ones, drop refcnt if necessary */
	sort_and_unique(need_to_allocate_array);
	/* Allocate new ones */
	newly_allocated_array = allocate_new_cg_lists(need_to_allocate_array);
	drop_refcnt_of_old_cgs(need_to_allocate_array);

	/* Now we have all necessary cg_list */

> +		}
> +	}
> +	rcu_read_unlock();
> +
> +	/*
> +	 * step 2: now that we're guaranteed success wrt the css_sets, proceed
> +	 * to move all tasks to the new cgroup. we need to lock against possible
> +	 * races with fork(). note: we can safely access leader->signal because
> +	 * attach_task_by_pid takes a reference on leader, which guarantees that
> +	 * the signal_struct will stick around. threadgroup_fork_lock must be
> +	 * taken outside of tasklist_lock to match the order in the fork path.
> +	 */
> +	BUG_ON(!leader->signal);
> +	down_write(&leader->signal->threadgroup_fork_lock);
> +	read_lock(&tasklist_lock);
> +	/* sanity check - if we raced with de_thread, we must abort */
> +	if (!thread_group_leader(leader)) {
> +		retval = -EAGAIN;
> +		read_unlock(&tasklist_lock);
> +		up_write(&leader->signal->threadgroup_fork_lock);
> +		goto list_teardown;
> +	}
> +	/*
> +	 * No failure cases left, so this is the commit point.
> +	 *
> +	 * If the leader is already there, skip moving him. Note: even if the
> +	 * leader is PF_EXITING, we still move all other threads; if everybody
> +	 * is PF_EXITING, we end up doing nothing, which is ok.
> +	 */
> +	oldcgrp = task_cgroup_from_root(leader, root);
> +	if (cgrp != oldcgrp) {
> +		retval = cgroup_task_migrate(cgrp, oldcgrp, leader, true);
> +		BUG_ON(retval != 0 && retval != -ESRCH);
> +	}
> +	/* Now iterate over each thread in the group. */
> +	list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
> +		BUG_ON(tsk->signal != leader->signal);
> +		/* leave current thread as it is if it's already there */
> +		oldcgrp = task_cgroup_from_root(tsk, root);
> +		if (cgrp == oldcgrp)
> +			continue;
> +		/* we don't care whether these threads are exiting */
> +		retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, true);
> +		BUG_ON(retval != 0 && retval != -ESRCH);
> +	}
> +
> +	/*
> +	 * step 3: attach whole threadgroup to each subsystem
> +	 * TODO: if ever a subsystem needs to know the oldcgrp for each task
> +	 * being moved, this call will need to be reworked to communicate that.
> +	 */
> +	for_each_subsys(root, ss) {
> +		if (ss->attach)
> +			ss->attach(ss, cgrp, oldcgrp, leader, true);
> +	}
> +	/* holding these until here keeps us safe from exec() and fork(). */
> +	read_unlock(&tasklist_lock);
> +	up_write(&leader->signal->threadgroup_fork_lock);
> +
> +	/*
> +	 * step 4: success! and cleanup
> +	 */
> +	synchronize_rcu();
> +	cgroup_wakeup_rmdir_waiter(cgrp);
> +	retval = 0;
> +list_teardown:
> +	/* clean up the list of prefetched css_sets. */
> +	list_for_each_entry_safe(cg_entry, temp_nobe, &newcg_list, links) {
> +		list_del(&cg_entry->links);
> +		put_css_set(cg_entry->cg);
> +		kfree(cg_entry);
> +	}
> +out:
> +	if (retval) {
> +		/* same deal as in cgroup_attach_task, with threadgroup=true */
> +		for_each_subsys(root, ss) {
> +			if (ss == failed_ss)
> +				break;
> +			if (ss->cancel_attach)
> +				ss->cancel_attach(ss, cgrp, leader, true);
> +		}
> +	}
> +	return retval;
> +}
> +
> +/*
> + * Find the task_struct of the task to attach by vpid and pass it along to the
> + * function to attach either it or all tasks in its threadgroup. Will take
> + * cgroup_mutex; may take task_lock of task.
> + */
> +static int attach_task_by_pid(struct cgroup *cgrp, u64 pid, bool threadgroup)
>  {
>  	struct task_struct *tsk;
>  	const struct cred *cred = current_cred(), *tcred;
>  	int ret;
>  
> +	if (!cgroup_lock_live_group(cgrp))
> +		return -ENODEV;
> +
>  	if (pid) {
>  		rcu_read_lock();
>  		tsk = find_task_by_vpid(pid);
> -		if (!tsk || tsk->flags & PF_EXITING) {
> +		if (!tsk) {
> +			rcu_read_unlock();
> +			cgroup_unlock();
> +			return -ESRCH;
> +		}
> +		if (threadgroup) {
> +			/*
> +			 * it is safe to find group_leader because tsk was found
> +			 * in the tid map, meaning it can't have been unhashed
> +			 * by someone in de_thread changing the leadership.
> +			 */
> +			tsk = tsk->group_leader;
> +			BUG_ON(!thread_group_leader(tsk));
> +		} else if (tsk->flags & PF_EXITING) {
> +			/* optimization for the single-task-only case */
>  			rcu_read_unlock();
> +			cgroup_unlock();
>  			return -ESRCH;
>  		}
>  
> +		/*
> +		 * even if we're attaching all tasks in the thread group, we
> +		 * only need to check permissions on one of them.
> +		 */
>  		tcred = __task_cred(tsk);
>  		if (cred->euid &&
>  		    cred->euid != tcred->uid &&
>  		    cred->euid != tcred->suid) {
>  			rcu_read_unlock();
> +			cgroup_unlock();
>  			return -EACCES;
>  		}
>  		get_task_struct(tsk);
>  		rcu_read_unlock();
>  	} else {
> -		tsk = current;
> +		if (threadgroup)
> +			tsk = current->group_leader;
> +		else

I'm not sure but "group_leader" is safe to access here ?

> +			tsk = current;
>  		get_task_struct(tsk);
>  	}
>  
> -	ret = cgroup_attach_task(cgrp, tsk);
> +	if (threadgroup)
> +		ret = cgroup_attach_proc(cgrp, tsk);
> +	else
> +		ret = cgroup_attach_task(cgrp, tsk);
>  	put_task_struct(tsk);
> +	cgroup_unlock();
>  	return ret;
>  }
>  
>  static int cgroup_tasks_write(struct cgroup *cgrp, struct cftype *cft, u64 pid)
>  {
> +	return attach_task_by_pid(cgrp, pid, false);
> +}
> +
> +static int cgroup_procs_write(struct cgroup *cgrp, struct cftype *cft, u64 tgid)
> +{
>  	int ret;
> -	if (!cgroup_lock_live_group(cgrp))
> -		return -ENODEV;
> -	ret = attach_task_by_pid(cgrp, pid);
> -	cgroup_unlock();
> +	do {
> +		/*
> +		 * attach_proc fails with -EAGAIN if threadgroup leadership
> +		 * changes in the middle of the operation, in which case we need
> +		 * to find the task_struct for the new leader and start over.
> +		 */
> +		ret = attach_task_by_pid(cgrp, tgid, true);
> +	} while (ret == -EAGAIN);
>  	return ret;
>  }
>  
> @@ -3168,9 +3498,9 @@ static struct cftype files[] = {
>  	{
>  		.name = CGROUP_FILE_GENERIC_PREFIX "procs",
>  		.open = cgroup_procs_open,
> -		/* .write_u64 = cgroup_procs_write, TODO */
> +		.write_u64 = cgroup_procs_write,
>  		.release = cgroup_pidlist_release,
> -		.mode = S_IRUGO,
> +		.mode = S_IRUGO | S_IWUSR,
>  	},
>  	{
>  		.name = "notify_on_release",
> diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
> index ce71ed5..daf0249 100644
> --- a/kernel/cgroup_freezer.c
> +++ b/kernel/cgroup_freezer.c
> @@ -190,6 +190,10 @@ static int freezer_can_attach(struct cgroup_subsys *ss,
>  		struct task_struct *c;
>  
>  		rcu_read_lock();
> +		if (!thread_group_leader(task)) {
> +			rcu_read_unlock();
> +			return -EAGAIN;
> +		}
>  		list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
>  			if (is_task_frozen_enough(c)) {
>  				rcu_read_unlock();
> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> index b23c097..3d7c978 100644
> --- a/kernel/cpuset.c
> +++ b/kernel/cpuset.c
> @@ -1404,6 +1404,10 @@ static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
>  		struct task_struct *c;
>  
>  		rcu_read_lock();
> +		if (!thread_group_leader(tsk)) {
> +			rcu_read_unlock();
> +			return -EAGAIN;
> +		}
>  		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
>  			ret = security_task_setscheduler(c, 0, NULL);
>  			if (ret) {
> diff --git a/kernel/ns_cgroup.c b/kernel/ns_cgroup.c
> index 2a5dfec..ecd15d2 100644
> --- a/kernel/ns_cgroup.c
> +++ b/kernel/ns_cgroup.c
> @@ -59,6 +59,10 @@ static int ns_can_attach(struct cgroup_subsys *ss, struct cgroup *new_cgroup,
>  	if (threadgroup) {
>  		struct task_struct *c;
>  		rcu_read_lock();
> +		if (!thread_group_leader(task)) {
> +			rcu_read_unlock();
> +			return -EAGAIN;
> +		}
>  		list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
>  			if (!cgroup_is_descendant(new_cgroup, c)) {
>  				rcu_read_unlock();
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 70fa78d..df53f53 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -8721,6 +8721,10 @@ cpu_cgroup_can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
>  	if (threadgroup) {
>  		struct task_struct *c;
>  		rcu_read_lock();
> +		if (!thread_group_leader(tsk)) {
> +			rcu_read_unlock();
> +			return -EAGAIN;
> +		}
>  		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
>  			retval = cpu_cgroup_can_attach_task(cgrp, c);
>  			if (retval) {


Thanks,
-Kame


^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v4 0/2] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs
  2010-08-03 19:58 ` [PATCH v4 0/2] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs Andrew Morton
@ 2010-08-04  2:00       ` Li Zefan
  2010-08-03 23:45   ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 185+ messages in thread
From: Li Zefan @ 2010-08-04  2:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	menage-hpIqsD4AKlfQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

Andrew Morton wrote:
> On Fri, 30 Jul 2010 19:56:49 -0400
> Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:
> 
>> This patch series implements a write function for the 'cgroup.procs'
>> per-cgroup file, which enables atomic movement of multithreaded
>> applications between cgroups. Writing the thread-ID of any thread in a
>> threadgroup to a cgroup's procs file causes all threads in the group to
>> be moved to that cgroup safely with respect to threads forking/exiting.
>> (Possible usage scenario: If running a multithreaded build system that
>> sucks up system resources, this lets you restrict it all at once into a
>> new cgroup to keep it under control.)
> 
> I can see how that would be useful.  No comments from anyone else?
> 

Oleg had been commenting on this patchset, so it would be nice to know
if he's comfortable with the changes in this version.

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v4 0/2] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs
@ 2010-08-04  2:00       ` Li Zefan
  0 siblings, 0 replies; 185+ messages in thread
From: Li Zefan @ 2010-08-04  2:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ben Blum, linux-kernel, containers, ebiederm, matthltc, menage, oleg

Andrew Morton wrote:
> On Fri, 30 Jul 2010 19:56:49 -0400
> Ben Blum <bblum@andrew.cmu.edu> wrote:
> 
>> This patch series implements a write function for the 'cgroup.procs'
>> per-cgroup file, which enables atomic movement of multithreaded
>> applications between cgroups. Writing the thread-ID of any thread in a
>> threadgroup to a cgroup's procs file causes all threads in the group to
>> be moved to that cgroup safely with respect to threads forking/exiting.
>> (Possible usage scenario: If running a multithreaded build system that
>> sucks up system resources, this lets you restrict it all at once into a
>> new cgroup to keep it under control.)
> 
> I can see how that would be useful.  No comments from anyone else?
> 

Oleg had been commenting on this patchset, so it would be nice to know
if he's comfortable with the changes in this version.


^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v4 1/2] cgroups: read-write lock CLONE_THREAD forking per threadgroup
       [not found]     ` <20100730235754.GB22644-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
@ 2010-08-04  3:44       ` Paul Menage
  0 siblings, 0 replies; 185+ messages in thread
From: Paul Menage @ 2010-08-04  3:44 UTC (permalink / raw)
  To: Ben Blum, Andrew Morton
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w

 On Fri, Jul 30, 2010 at 4:57 PM, Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:
> +	 * The threadgroup_fork_lock prevents threads from forking with
> +	 * CLONE_THREAD while held for writing. Use this for fork-sensitive
> +	 * threadgroup-wide operations. It's taken for reading in fork.c in
> +	 * copy_process().
> +	 * Currently only needed write-side by cgroups.
> +	 */
> +	struct rw_semaphore threadgroup_fork_lock;
> +#endif

I'm not sure how best to word this comment, but I'd prefer something like:

"The threadgroup_fork_lock is taken in read mode during a CLONE_THREAD
fork operation; taking it in write mode prevents the owning
threadgroup from adding any new threads and thus allows you to
synchronize against the addition of unseen threads when performing
threadgroup-wide operations. New-process forks (without CLONE_THREAD)
are not affected."

As far as the #ifdef mess goes, it's true that some people don't have
CONFIG_CGROUPS defined. I'd imagine that these are likely to be
embedded systems with a fairly small number of processes and threads
per process. Are there really any such platforms where the cost of a
single extra rwsem per process is going to make a difference either in
terms of memory or lock contention? I think you should consider making
these additions unconditional.

Paul

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v4 1/2] cgroups: read-write lock CLONE_THREAD forking per  threadgroup
  2010-07-30 23:57     ` Ben Blum
  (?)
@ 2010-08-04  3:44     ` Paul Menage
       [not found]       ` <AANLkTikpNG2Y3S3AyxAbCkMynKu1u5yKPrw=bh+uy=9R-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  -1 siblings, 1 reply; 185+ messages in thread
From: Paul Menage @ 2010-08-04  3:44 UTC (permalink / raw)
  To: Ben Blum, Andrew Morton
  Cc: linux-kernel, containers, ebiederm, lizf, matthltc, oleg

 On Fri, Jul 30, 2010 at 4:57 PM, Ben Blum <bblum@andrew.cmu.edu> wrote:
> +	 * The threadgroup_fork_lock prevents threads from forking with
> +	 * CLONE_THREAD while held for writing. Use this for fork-sensitive
> +	 * threadgroup-wide operations. It's taken for reading in fork.c in
> +	 * copy_process().
> +	 * Currently only needed write-side by cgroups.
> +	 */
> +	struct rw_semaphore threadgroup_fork_lock;
> +#endif

I'm not sure how best to word this comment, but I'd prefer something like:

"The threadgroup_fork_lock is taken in read mode during a CLONE_THREAD
fork operation; taking it in write mode prevents the owning
threadgroup from adding any new threads and thus allows you to
synchronize against the addition of unseen threads when performing
threadgroup-wide operations. New-process forks (without CLONE_THREAD)
are not affected."

As far as the #ifdef mess goes, it's true that some people don't have
CONFIG_CGROUPS defined. I'd imagine that these are likely to be
embedded systems with a fairly small number of processes and threads
per process. Are there really any such platforms where the cost of a
single extra rwsem per process is going to make a difference either in
terms of memory or lock contention? I think you should consider making
these additions unconditional.

Paul

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v4 2/2] cgroups: make procs file writable
       [not found]     ` <20100804100811.199d73ba.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
@ 2010-08-04  4:28       ` Ben Blum
  2010-08-04  4:30       ` Paul Menage
  1 sibling, 0 replies; 185+ messages in thread
From: Ben Blum @ 2010-08-04  4:28 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	menage-hpIqsD4AKlfQT0dZR+AlfA

On Wed, Aug 04, 2010 at 10:08:11AM +0900, KAMEZAWA Hiroyuki wrote:
> On Fri, 30 Jul 2010 19:59:02 -0400
> Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:
> 
> > Makes procs file writable to move all threads by tgid at once
> > 
> > From: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
> > 
> > This patch adds functionality that enables users to move all threads in a
> > threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs'
> > file. This current implementation makes use of a per-threadgroup rwsem that's
> > taken for reading in the fork() path to prevent newly forking threads within
> > the threadgroup from "escaping" while the move is in progress.
> > 
> > Signed-off-by: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
> > ---
> >  Documentation/cgroups/cgroups.txt |   13 +
> >  kernel/cgroup.c                   |  426 +++++++++++++++++++++++++++++++++----
> >  kernel/cgroup_freezer.c           |    4 
> >  kernel/cpuset.c                   |    4 
> >  kernel/ns_cgroup.c                |    4 
> >  kernel/sched.c                    |    4 
> >  6 files changed, 405 insertions(+), 50 deletions(-)
> > 
> > diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
> > index b34823f..5f3c707 100644
> > --- a/Documentation/cgroups/cgroups.txt
> > +++ b/Documentation/cgroups/cgroups.txt
> > @@ -235,7 +235,8 @@ containing the following files describing that cgroup:
> >   - cgroup.procs: list of tgids in the cgroup.  This list is not
> >     guaranteed to be sorted or free of duplicate tgids, and userspace
> >     should sort/uniquify the list if this property is required.
> > -   This is a read-only file, for now.
> > +   Writing a thread group id into this file moves all threads in that
> > +   group into this cgroup.
> >   - notify_on_release flag: run the release agent on exit?
> >   - release_agent: the path to use for release notifications (this file
> >     exists in the top cgroup only)
> > @@ -416,6 +417,12 @@ You can attach the current shell task by echoing 0:
> >  
> >  # echo 0 > tasks
> >  
> > +You can use the cgroup.procs file instead of the tasks file to move all
> > +threads in a threadgroup at once. Echoing the pid of any task in a
> > +threadgroup to cgroup.procs causes all tasks in that threadgroup to be
> > +be attached to the cgroup. Writing 0 to cgroup.procs moves all tasks
> > +in the writing task's threadgroup.
> > +
> >  2.3 Mounting hierarchies by name
> >  --------------------------------
> >  
> > @@ -564,7 +571,9 @@ called on a fork. If this method returns 0 (success) then this should
> >  remain valid while the caller holds cgroup_mutex and it is ensured that either
> >  attach() or cancel_attach() will be called in future. If threadgroup is
> >  true, then a successful result indicates that all threads in the given
> > -thread's threadgroup can be moved together.
> > +thread's threadgroup can be moved together. If the subsystem wants to
> > +iterate over task->thread_group, it must take rcu_read_lock then check
> > +if thread_group_leader(task), returning -EAGAIN if that fails.
> >  
> >  void cancel_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
> >  	       struct task_struct *task, bool threadgroup)
> > diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> > index f91d7dd..fab8c87 100644
> > --- a/kernel/cgroup.c
> > +++ b/kernel/cgroup.c
> > @@ -1688,6 +1688,76 @@ int cgroup_path(const struct cgroup *cgrp, char *buf, int buflen)
> >  }
> >  EXPORT_SYMBOL_GPL(cgroup_path);
> >  
> > +/*
> > + * cgroup_task_migrate - move a task from one cgroup to another.
> > + *
> > + * 'guarantee' is set if the caller promises that a new css_set for the task
> > + * will already exit. If not set, this function might sleep, and can fail with
>            
>            already exist ?

oops, yes. good catch.

> > + * -ENOMEM. Otherwise, it can only fail with -ESRCH.
> > + */
> > +static int cgroup_task_migrate(struct cgroup *cgrp, struct cgroup *oldcgrp,
> > +			       struct task_struct *tsk, bool guarantee)
> > +{
> > +	struct css_set *oldcg;
> > +	struct css_set *newcg;
> > +
> > +	/*
> > +	 * get old css_set. we need to take task_lock and refcount it, because
> > +	 * an exiting task can change its css_set to init_css_set and drop its
> > +	 * old one without taking cgroup_mutex.
> > +	 */
> > +	task_lock(tsk);
> > +	oldcg = tsk->cgroups;
> > +	get_css_set(oldcg);
> > +	task_unlock(tsk);
> > +
> > +	/* locate or allocate a new css_set for this task. */
> > +	if (guarantee) {
> > +		/* we know the css_set we want already exists. */
> > +		struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
> > +		read_lock(&css_set_lock);
> > +		newcg = find_existing_css_set(oldcg, cgrp, template);
> > +		BUG_ON(!newcg);
> > +		get_css_set(newcg);
> > +		read_unlock(&css_set_lock);
> > +	} else {
> > +		might_sleep();
> > +		/* find_css_set will give us newcg already referenced. */
> > +		newcg = find_css_set(oldcg, cgrp);
> > +		if (!newcg) {
> > +			put_css_set(oldcg);
> > +			return -ENOMEM;
> > +		}
> > +	}
> > +	put_css_set(oldcg);
> > +
> > +	/* if PF_EXITING is set, the tsk->cgroups pointer is no longer safe. */
> > +	task_lock(tsk);
> > +	if (tsk->flags & PF_EXITING) {
> > +		task_unlock(tsk);
> > +		put_css_set(newcg);
> > +		return -ESRCH;
> > +	}
> > +	rcu_assign_pointer(tsk->cgroups, newcg);
> > +	task_unlock(tsk);
> > +
> > +	/* Update the css_set linked lists if we're using them */
> > +	write_lock(&css_set_lock);
> > +	if (!list_empty(&tsk->cg_list))
> > +		list_move(&tsk->cg_list, &newcg->tasks);
> > +	write_unlock(&css_set_lock);
> > +
> > +	/*
> > +	 * We just gained a reference on oldcg by taking it from the task. As
> > +	 * trading it for newcg is protected by cgroup_mutex, we're safe to drop
> > +	 * it here; it will be freed under RCU.
> > +	 */
> > +	put_css_set(oldcg);
> > +
> > +	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
> > +	return 0;
> > +}
> > +
> >  /**
> >   * cgroup_attach_task - attach task 'tsk' to cgroup 'cgrp'
> >   * @cgrp: the cgroup the task is attaching to
> > @@ -1698,11 +1768,9 @@ EXPORT_SYMBOL_GPL(cgroup_path);
> >   */
> >  int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
> >  {
> > -	int retval = 0;
> > +	int retval;
> >  	struct cgroup_subsys *ss, *failed_ss = NULL;
> >  	struct cgroup *oldcgrp;
> > -	struct css_set *cg;
> > -	struct css_set *newcg;
> >  	struct cgroupfs_root *root = cgrp->root;
> >  
> >  	/* Nothing to do if the task is already in that cgroup */
> > @@ -1726,46 +1794,16 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
> >  		}
> >  	}
> >  
> > -	task_lock(tsk);
> > -	cg = tsk->cgroups;
> > -	get_css_set(cg);
> > -	task_unlock(tsk);
> > -	/*
> > -	 * Locate or allocate a new css_set for this task,
> > -	 * based on its final set of cgroups
> > -	 */
> > -	newcg = find_css_set(cg, cgrp);
> > -	put_css_set(cg);
> > -	if (!newcg) {
> > -		retval = -ENOMEM;
> > -		goto out;
> > -	}
> > -
> > -	task_lock(tsk);
> > -	if (tsk->flags & PF_EXITING) {
> > -		task_unlock(tsk);
> > -		put_css_set(newcg);
> > -		retval = -ESRCH;
> > +	retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, false);
> > +	if (retval)
> >  		goto out;
> > -	}
> > -	rcu_assign_pointer(tsk->cgroups, newcg);
> > -	task_unlock(tsk);
> > -
> > -	/* Update the css_set linked lists if we're using them */
> > -	write_lock(&css_set_lock);
> > -	if (!list_empty(&tsk->cg_list)) {
> > -		list_del(&tsk->cg_list);
> > -		list_add(&tsk->cg_list, &newcg->tasks);
> > -	}
> > -	write_unlock(&css_set_lock);
> >  
> >  	for_each_subsys(root, ss) {
> >  		if (ss->attach)
> >  			ss->attach(ss, cgrp, oldcgrp, tsk, false);
> >  	}
> > -	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
> > +
> 
> Hmm. By this, we call ss->attach(ss, cgrp, oldcgrp, tsk, false) after
> makring CGRP_RELEASABLE+synchronize_rcu() to oldcgroup...is it safe ?

I honestly don't remember (that logic was written like a year ago), but
I remember Paul confirming that it was ok. But things may have changed
around - I don't recall any "cgroup_release_and_wakeup_rmdir" semantics.

> And why move it before attach() ?

Makes it easier when there are arbitrarily many "oldcgrp"s - once you
migrate each task, you won't have its old cgroup to set the bit on by
the time you call attach().

> >  	synchronize_rcu();
> > -	put_css_set(cg);
> >  
> >  	/*
> >  	 * wake up rmdir() waiter. the rmdir should fail since the cgroup
> > @@ -1791,49 +1829,341 @@ out:
> >  }
> >  
> >  /*
> > - * Attach task with pid 'pid' to cgroup 'cgrp'. Call with cgroup_mutex
> > - * held. May take task_lock of task
> > + * cgroup_attach_proc works in two stages, the first of which prefetches all
> > + * new css_sets needed (to make sure we have enough memory before committing
> > + * to the move) and stores them in a list of entries of the following type.
> > + * TODO: possible optimization: use css_set->rcu_head for chaining instead
> > + */
> > +struct cg_list_entry {
> > +	struct css_set *cg;
> > +	struct list_head links;
> > +};
> > +
> > +static bool css_set_check_fetched(struct cgroup *cgrp,
> > +				  struct task_struct *tsk, struct css_set *cg,
> > +				  struct list_head *newcg_list)
> > +{
> > +	struct css_set *newcg;
> > +	struct cg_list_entry *cg_entry;
> > +	struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
> > +
> > +	read_lock(&css_set_lock);
> > +	newcg = find_existing_css_set(cg, cgrp, template);
> > +	if (newcg)
> > +		get_css_set(newcg);
> > +	read_unlock(&css_set_lock);
> > +
> > +	/* doesn't exist at all? */
> > +	if (!newcg)
> > +		return false;
> > +	/* see if it's already in the list */
> > +	list_for_each_entry(cg_entry, newcg_list, links) {
> > +		if (cg_entry->cg == newcg) {
> > +			put_css_set(newcg);
> > +			return true;
> > +		}
> > +	}
> > +
> > +	/* not found */
> > +	put_css_set(newcg);
> > +	return false;
> > +}
> > +
> > +/*
> > + * Find the new css_set and store it in the list in preparation for moving the
> > + * given task to the given cgroup. Returns 0 or -ENOMEM.
> >   */
> > -static int attach_task_by_pid(struct cgroup *cgrp, u64 pid)
> > +static int css_set_prefetch(struct cgroup *cgrp, struct css_set *cg,
> > +			    struct list_head *newcg_list)
> > +{
> > +	struct css_set *newcg;
> > +	struct cg_list_entry *cg_entry;
> > +
> > +	/* ensure a new css_set will exist for this thread */
> > +	newcg = find_css_set(cg, cgrp);
> > +	if (!newcg)
> > +		return -ENOMEM;
> > +	/* add it to the list */
> > +	cg_entry = kmalloc(sizeof(struct cg_list_entry), GFP_KERNEL);
> > +	if (!cg_entry) {
> > +		put_css_set(newcg);
> > +		return -ENOMEM;
> > +	}
> > +	cg_entry->cg = newcg;
> > +	list_add(&cg_entry->links, newcg_list);
> > +	return 0;
> > +}
> > +
> > +/**
> > + * cgroup_attach_proc - attach all threads in a threadgroup to a cgroup
> > + * @cgrp: the cgroup to attach to
> > + * @leader: the threadgroup leader task_struct of the group to be attached
> > + *
> > + * Call holding cgroup_mutex. Will take task_lock of each thread in leader's
> > + * threadgroup individually in turn.
> > + */
> > +int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
> > +{
> > +	int retval;
> > +	struct cgroup_subsys *ss, *failed_ss = NULL;
> > +	struct cgroup *oldcgrp;
> > +	struct css_set *oldcg;
> > +	struct cgroupfs_root *root = cgrp->root;
> > +	/* threadgroup list cursor */
> > +	struct task_struct *tsk;
> > +	/*
> > +	 * we need to make sure we have css_sets for all the tasks we're
> > +	 * going to move -before- we actually start moving them, so that in
> > +	 * case we get an ENOMEM we can bail out before making any changes.
> > +	 */
> > +	struct list_head newcg_list;
> > +	struct cg_list_entry *cg_entry, *temp_nobe;
> > +
> > +	/* check that we can legitimately attach to the cgroup. */
> > +	for_each_subsys(root, ss) {
> > +		if (ss->can_attach) {
> > +			retval = ss->can_attach(ss, cgrp, leader, true);
> > +			if (retval) {
> > +				failed_ss = ss;
> > +				goto out;
> > +			}
> > +		}
> > +	}
> 
> Then, we cannot do attach limitaion control per thread ? (This just check leader.)
> Is it ok for all subsys ?

I believe it should be. At least for memory, there's no point to check
multiple threads that all share the same VM. :)

> > +
> > +	/*
> > +	 * step 1: make sure css_sets exist for all threads to be migrated.
> > +	 * we use find_css_set, which allocates a new one if necessary.
> > +	 */
> > +	INIT_LIST_HEAD(&newcg_list);
> > +	oldcgrp = task_cgroup_from_root(leader, root);
> > +	if (cgrp != oldcgrp) {
> > +		/* get old css_set */
> > +		task_lock(leader);
> > +		if (leader->flags & PF_EXITING) {
> > +			task_unlock(leader);
> > +			goto prefetch_loop;
> > +		}
> Why do we continue here ? not -ESRCH ?

The leader can exit and still have other threads going in its
threadgroup; in this case, we still want to move the rest of the
threads.

> > +		oldcg = leader->cgroups;
> > +		get_css_set(oldcg);
> > +		task_unlock(leader);
> > +		/* acquire new one */
> > +		retval = css_set_prefetch(cgrp, oldcg, &newcg_list);
> > +		put_css_set(oldcg);
> > +		if (retval)
> > +			goto list_teardown;
> > +	}
> > +prefetch_loop:
> > +	rcu_read_lock();
> > +	/* sanity check - if we raced with de_thread, we must abort */
> > +	if (!thread_group_leader(leader)) {
> > +		retval = -EAGAIN;
> > +		goto list_teardown;
> > +	}
> 
> EAGAIN ? ESRCH ? or EBUSY ?

This happens in the following case: we have a pointer to A the leader, A
forks B, B exec, B becomes new leader. (It's dangerous if, after that,
this happens: B forks C, B exits. Now A->leader and A->thread_group.next
both point to nowhere. Thanks Oleg :) )

EBUSY might also be ok; I picked EAGAIN because it fits meaning-wise -
it's handled higher-up in the VFS write handler, so userspace doesn't
see it.

> > +	/*
> > +	 * if we need to fetch a new css_set for this task, we must exit the
> > +	 * rcu_read section because allocating it can sleep. afterwards, we'll
> > +	 * need to restart iteration on the threadgroup list - the whole thing
> > +	 * will be O(nm) in the number of threads and css_sets; as the typical
> > +	 * case has only one css_set for all of them, usually O(n). which ones
> > +	 * we need allocated won't change as long as we hold cgroup_mutex.
> > +	 */
> > +	list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
> > +		/* nothing to do if this task is already in the cgroup */
> > +		oldcgrp = task_cgroup_from_root(tsk, root);
> > +		if (cgrp == oldcgrp)
> > +			continue;
> > +		/* get old css_set pointer */
> > +		task_lock(tsk);
> > +		if (tsk->flags & PF_EXITING) {
> > +			/* ignore this task if it's going away */
> > +			task_unlock(tsk);
> 
> It's going away but seems to exist for a while....then, "continue" is safe
> for keeping consistency ?

Yes, it's going away but hasn't been unhashed yet. Since it's on the
thread_group list (and we have rcu_read), of course its next pointer is
sane.

> > +			continue;
> > +		}
> > +		oldcg = tsk->cgroups;
> > +		get_css_set(oldcg);
> > +		task_unlock(tsk);
> > +		/* see if the new one for us is already in the list? */
> > +		if (css_set_check_fetched(cgrp, tsk, oldcg, &newcg_list)) {
> > +			/* was already there, nothing to do. */
> > +			put_css_set(oldcg);
> > +		} else {
> > +			/* we don't already have it. get new one. */
> > +			rcu_read_unlock();
> > +			retval = css_set_prefetch(cgrp, oldcg, &newcg_list);
> > +			put_css_set(oldcg);
> > +			if (retval)
> > +				goto list_teardown;
> > +			/* begin iteration again. */
> > +			goto prefetch_loop;
> 
> Hmm ? Why do we need to restart from the 1st entry ?
> (maybe because of rcu_read_unlock() ?)

Need to allocate (prefetch), can't do it while rcu_read is held.

> Does this function work well if the process has 10000+ threads ?

Depends on the css_sets - it is pretty unlikely that the threads will be
diversified enough that runtime will actually approach quadratic, and in
that case I'd rather have bad runtime (this is already an expensive
operation) than more complicated logic (which you proposed). Lord knows
it's already complicated enough :X

> 
> How about this logic ?
> ==
> 
> 	/* At first, find out necessary things */
> 	rcu_read_lock();
> 	list_for_each_entry_rcu() {
> 		oldcgrp = task_cgroup_from_root(tsk, root);
> 		if (oldcgrp == cgrp)
> 			continue;
> 		task_lock(task);
> 		if (task->flags & PF_EXITING) {
> 			task_unlock(task);
> 			continue;
> 		}
> 		oldcg = tsk->cgroups;
> 		get_css_set(oldcg);
> 		task_unlock(task);
> 		read_lock(&css_set_lock);
> 		newcg = find_existing_css_set(oldcgrp cgrp, template);
> 		if (newcg)
> 			remember_this_newcg(newcg, &found_cg_array); {
> 			put_css_set(oldcg);
> 		} else
> 			remember_need_to_allocate(oldcg, &need_to_allocate_array);
> 	}
> 	rcu_read_unlock();
> 	/* Sort all cg_list found and drop doubly counted ones, drop refcnt if necessary */
> 	sort_and_unique(found_cg_array);
> 	/* Sort all cg_list not found and drop doubly counted ones, drop refcnt if necessary */
> 	sort_and_unique(need_to_allocate_array);
> 	/* Allocate new ones */
> 	newly_allocated_array = allocate_new_cg_lists(need_to_allocate_array);
> 	drop_refcnt_of_old_cgs(need_to_allocate_array);
> 
> 	/* Now we have all necessary cg_list */
> 
> > +		}
> > +	}
> > +	rcu_read_unlock();
> > +
> > +	/*
> > +	 * step 2: now that we're guaranteed success wrt the css_sets, proceed
> > +	 * to move all tasks to the new cgroup. we need to lock against possible
> > +	 * races with fork(). note: we can safely access leader->signal because
> > +	 * attach_task_by_pid takes a reference on leader, which guarantees that
> > +	 * the signal_struct will stick around. threadgroup_fork_lock must be
> > +	 * taken outside of tasklist_lock to match the order in the fork path.
> > +	 */
> > +	BUG_ON(!leader->signal);
> > +	down_write(&leader->signal->threadgroup_fork_lock);
> > +	read_lock(&tasklist_lock);
> > +	/* sanity check - if we raced with de_thread, we must abort */
> > +	if (!thread_group_leader(leader)) {
> > +		retval = -EAGAIN;
> > +		read_unlock(&tasklist_lock);
> > +		up_write(&leader->signal->threadgroup_fork_lock);
> > +		goto list_teardown;
> > +	}
> > +	/*
> > +	 * No failure cases left, so this is the commit point.
> > +	 *
> > +	 * If the leader is already there, skip moving him. Note: even if the
> > +	 * leader is PF_EXITING, we still move all other threads; if everybody
> > +	 * is PF_EXITING, we end up doing nothing, which is ok.
> > +	 */
> > +	oldcgrp = task_cgroup_from_root(leader, root);
> > +	if (cgrp != oldcgrp) {
> > +		retval = cgroup_task_migrate(cgrp, oldcgrp, leader, true);
> > +		BUG_ON(retval != 0 && retval != -ESRCH);
> > +	}
> > +	/* Now iterate over each thread in the group. */
> > +	list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
> > +		BUG_ON(tsk->signal != leader->signal);
> > +		/* leave current thread as it is if it's already there */
> > +		oldcgrp = task_cgroup_from_root(tsk, root);
> > +		if (cgrp == oldcgrp)
> > +			continue;
> > +		/* we don't care whether these threads are exiting */
> > +		retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, true);
> > +		BUG_ON(retval != 0 && retval != -ESRCH);
> > +	}
> > +
> > +	/*
> > +	 * step 3: attach whole threadgroup to each subsystem
> > +	 * TODO: if ever a subsystem needs to know the oldcgrp for each task
> > +	 * being moved, this call will need to be reworked to communicate that.
> > +	 */
> > +	for_each_subsys(root, ss) {
> > +		if (ss->attach)
> > +			ss->attach(ss, cgrp, oldcgrp, leader, true);
> > +	}
> > +	/* holding these until here keeps us safe from exec() and fork(). */
> > +	read_unlock(&tasklist_lock);
> > +	up_write(&leader->signal->threadgroup_fork_lock);
> > +
> > +	/*
> > +	 * step 4: success! and cleanup
> > +	 */
> > +	synchronize_rcu();
> > +	cgroup_wakeup_rmdir_waiter(cgrp);
> > +	retval = 0;
> > +list_teardown:
> > +	/* clean up the list of prefetched css_sets. */
> > +	list_for_each_entry_safe(cg_entry, temp_nobe, &newcg_list, links) {
> > +		list_del(&cg_entry->links);
> > +		put_css_set(cg_entry->cg);
> > +		kfree(cg_entry);
> > +	}
> > +out:
> > +	if (retval) {
> > +		/* same deal as in cgroup_attach_task, with threadgroup=true */
> > +		for_each_subsys(root, ss) {
> > +			if (ss == failed_ss)
> > +				break;
> > +			if (ss->cancel_attach)
> > +				ss->cancel_attach(ss, cgrp, leader, true);
> > +		}
> > +	}
> > +	return retval;
> > +}
> > +
> > +/*
> > + * Find the task_struct of the task to attach by vpid and pass it along to the
> > + * function to attach either it or all tasks in its threadgroup. Will take
> > + * cgroup_mutex; may take task_lock of task.
> > + */
> > +static int attach_task_by_pid(struct cgroup *cgrp, u64 pid, bool threadgroup)
> >  {
> >  	struct task_struct *tsk;
> >  	const struct cred *cred = current_cred(), *tcred;
> >  	int ret;
> >  
> > +	if (!cgroup_lock_live_group(cgrp))
> > +		return -ENODEV;
> > +
> >  	if (pid) {
> >  		rcu_read_lock();
> >  		tsk = find_task_by_vpid(pid);
> > -		if (!tsk || tsk->flags & PF_EXITING) {
> > +		if (!tsk) {
> > +			rcu_read_unlock();
> > +			cgroup_unlock();
> > +			return -ESRCH;
> > +		}
> > +		if (threadgroup) {
> > +			/*
> > +			 * it is safe to find group_leader because tsk was found
> > +			 * in the tid map, meaning it can't have been unhashed
> > +			 * by someone in de_thread changing the leadership.
> > +			 */
> > +			tsk = tsk->group_leader;
> > +			BUG_ON(!thread_group_leader(tsk));
> > +		} else if (tsk->flags & PF_EXITING) {
> > +			/* optimization for the single-task-only case */
> >  			rcu_read_unlock();
> > +			cgroup_unlock();
> >  			return -ESRCH;
> >  		}
> >  
> > +		/*
> > +		 * even if we're attaching all tasks in the thread group, we
> > +		 * only need to check permissions on one of them.
> > +		 */
> >  		tcred = __task_cred(tsk);
> >  		if (cred->euid &&
> >  		    cred->euid != tcred->uid &&
> >  		    cred->euid != tcred->suid) {
> >  			rcu_read_unlock();
> > +			cgroup_unlock();
> >  			return -EACCES;
> >  		}
> >  		get_task_struct(tsk);
> >  		rcu_read_unlock();
> >  	} else {
> > -		tsk = current;
> > +		if (threadgroup)
> > +			tsk = current->group_leader;
> > +		else
> 
> I'm not sure but "group_leader" is safe to access here ?

current->group_leader is always safe, since current should have a
refcount on its leader.

> 
> > +			tsk = current;
> >  		get_task_struct(tsk);
> >  	}
> >  
> > -	ret = cgroup_attach_task(cgrp, tsk);
> > +	if (threadgroup)
> > +		ret = cgroup_attach_proc(cgrp, tsk);
> > +	else
> > +		ret = cgroup_attach_task(cgrp, tsk);
> >  	put_task_struct(tsk);
> > +	cgroup_unlock();
> >  	return ret;
> >  }
> >  
> >  static int cgroup_tasks_write(struct cgroup *cgrp, struct cftype *cft, u64 pid)
> >  {
> > +	return attach_task_by_pid(cgrp, pid, false);
> > +}
> > +
> > +static int cgroup_procs_write(struct cgroup *cgrp, struct cftype *cft, u64 tgid)
> > +{
> >  	int ret;
> > -	if (!cgroup_lock_live_group(cgrp))
> > -		return -ENODEV;
> > -	ret = attach_task_by_pid(cgrp, pid);
> > -	cgroup_unlock();
> > +	do {
> > +		/*
> > +		 * attach_proc fails with -EAGAIN if threadgroup leadership
> > +		 * changes in the middle of the operation, in which case we need
> > +		 * to find the task_struct for the new leader and start over.
> > +		 */
> > +		ret = attach_task_by_pid(cgrp, tgid, true);
> > +	} while (ret == -EAGAIN);
> >  	return ret;
> >  }
> >  
> > @@ -3168,9 +3498,9 @@ static struct cftype files[] = {
> >  	{
> >  		.name = CGROUP_FILE_GENERIC_PREFIX "procs",
> >  		.open = cgroup_procs_open,
> > -		/* .write_u64 = cgroup_procs_write, TODO */
> > +		.write_u64 = cgroup_procs_write,
> >  		.release = cgroup_pidlist_release,
> > -		.mode = S_IRUGO,
> > +		.mode = S_IRUGO | S_IWUSR,
> >  	},
> >  	{
> >  		.name = "notify_on_release",
> > diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
> > index ce71ed5..daf0249 100644
> > --- a/kernel/cgroup_freezer.c
> > +++ b/kernel/cgroup_freezer.c
> > @@ -190,6 +190,10 @@ static int freezer_can_attach(struct cgroup_subsys *ss,
> >  		struct task_struct *c;
> >  
> >  		rcu_read_lock();
> > +		if (!thread_group_leader(task)) {
> > +			rcu_read_unlock();
> > +			return -EAGAIN;
> > +		}
> >  		list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
> >  			if (is_task_frozen_enough(c)) {
> >  				rcu_read_unlock();
> > diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> > index b23c097..3d7c978 100644
> > --- a/kernel/cpuset.c
> > +++ b/kernel/cpuset.c
> > @@ -1404,6 +1404,10 @@ static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
> >  		struct task_struct *c;
> >  
> >  		rcu_read_lock();
> > +		if (!thread_group_leader(tsk)) {
> > +			rcu_read_unlock();
> > +			return -EAGAIN;
> > +		}
> >  		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
> >  			ret = security_task_setscheduler(c, 0, NULL);
> >  			if (ret) {
> > diff --git a/kernel/ns_cgroup.c b/kernel/ns_cgroup.c
> > index 2a5dfec..ecd15d2 100644
> > --- a/kernel/ns_cgroup.c
> > +++ b/kernel/ns_cgroup.c
> > @@ -59,6 +59,10 @@ static int ns_can_attach(struct cgroup_subsys *ss, struct cgroup *new_cgroup,
> >  	if (threadgroup) {
> >  		struct task_struct *c;
> >  		rcu_read_lock();
> > +		if (!thread_group_leader(task)) {
> > +			rcu_read_unlock();
> > +			return -EAGAIN;
> > +		}
> >  		list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
> >  			if (!cgroup_is_descendant(new_cgroup, c)) {
> >  				rcu_read_unlock();
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index 70fa78d..df53f53 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -8721,6 +8721,10 @@ cpu_cgroup_can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
> >  	if (threadgroup) {
> >  		struct task_struct *c;
> >  		rcu_read_lock();
> > +		if (!thread_group_leader(tsk)) {
> > +			rcu_read_unlock();
> > +			return -EAGAIN;
> > +		}
> >  		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
> >  			retval = cpu_cgroup_can_attach_task(cgrp, c);
> >  			if (retval) {
> 
> 
> Thanks,
> -Kame
> 
> 

Thanks for having a full look over!

-- Ben

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v4 2/2] cgroups: make procs file writable
  2010-08-04  1:08   ` KAMEZAWA Hiroyuki
@ 2010-08-04  4:28     ` Ben Blum
       [not found]     ` <20100804100811.199d73ba.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
  2010-08-04  4:30     ` Paul Menage
  2 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2010-08-04  4:28 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Ben Blum, linux-kernel, containers, akpm, ebiederm, lizf,
	matthltc, menage, oleg

On Wed, Aug 04, 2010 at 10:08:11AM +0900, KAMEZAWA Hiroyuki wrote:
> On Fri, 30 Jul 2010 19:59:02 -0400
> Ben Blum <bblum@andrew.cmu.edu> wrote:
> 
> > Makes procs file writable to move all threads by tgid at once
> > 
> > From: Ben Blum <bblum@andrew.cmu.edu>
> > 
> > This patch adds functionality that enables users to move all threads in a
> > threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs'
> > file. This current implementation makes use of a per-threadgroup rwsem that's
> > taken for reading in the fork() path to prevent newly forking threads within
> > the threadgroup from "escaping" while the move is in progress.
> > 
> > Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
> > ---
> >  Documentation/cgroups/cgroups.txt |   13 +
> >  kernel/cgroup.c                   |  426 +++++++++++++++++++++++++++++++++----
> >  kernel/cgroup_freezer.c           |    4 
> >  kernel/cpuset.c                   |    4 
> >  kernel/ns_cgroup.c                |    4 
> >  kernel/sched.c                    |    4 
> >  6 files changed, 405 insertions(+), 50 deletions(-)
> > 
> > diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
> > index b34823f..5f3c707 100644
> > --- a/Documentation/cgroups/cgroups.txt
> > +++ b/Documentation/cgroups/cgroups.txt
> > @@ -235,7 +235,8 @@ containing the following files describing that cgroup:
> >   - cgroup.procs: list of tgids in the cgroup.  This list is not
> >     guaranteed to be sorted or free of duplicate tgids, and userspace
> >     should sort/uniquify the list if this property is required.
> > -   This is a read-only file, for now.
> > +   Writing a thread group id into this file moves all threads in that
> > +   group into this cgroup.
> >   - notify_on_release flag: run the release agent on exit?
> >   - release_agent: the path to use for release notifications (this file
> >     exists in the top cgroup only)
> > @@ -416,6 +417,12 @@ You can attach the current shell task by echoing 0:
> >  
> >  # echo 0 > tasks
> >  
> > +You can use the cgroup.procs file instead of the tasks file to move all
> > +threads in a threadgroup at once. Echoing the pid of any task in a
> > +threadgroup to cgroup.procs causes all tasks in that threadgroup to be
> > +be attached to the cgroup. Writing 0 to cgroup.procs moves all tasks
> > +in the writing task's threadgroup.
> > +
> >  2.3 Mounting hierarchies by name
> >  --------------------------------
> >  
> > @@ -564,7 +571,9 @@ called on a fork. If this method returns 0 (success) then this should
> >  remain valid while the caller holds cgroup_mutex and it is ensured that either
> >  attach() or cancel_attach() will be called in future. If threadgroup is
> >  true, then a successful result indicates that all threads in the given
> > -thread's threadgroup can be moved together.
> > +thread's threadgroup can be moved together. If the subsystem wants to
> > +iterate over task->thread_group, it must take rcu_read_lock then check
> > +if thread_group_leader(task), returning -EAGAIN if that fails.
> >  
> >  void cancel_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
> >  	       struct task_struct *task, bool threadgroup)
> > diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> > index f91d7dd..fab8c87 100644
> > --- a/kernel/cgroup.c
> > +++ b/kernel/cgroup.c
> > @@ -1688,6 +1688,76 @@ int cgroup_path(const struct cgroup *cgrp, char *buf, int buflen)
> >  }
> >  EXPORT_SYMBOL_GPL(cgroup_path);
> >  
> > +/*
> > + * cgroup_task_migrate - move a task from one cgroup to another.
> > + *
> > + * 'guarantee' is set if the caller promises that a new css_set for the task
> > + * will already exit. If not set, this function might sleep, and can fail with
>            
>            already exist ?

oops, yes. good catch.

> > + * -ENOMEM. Otherwise, it can only fail with -ESRCH.
> > + */
> > +static int cgroup_task_migrate(struct cgroup *cgrp, struct cgroup *oldcgrp,
> > +			       struct task_struct *tsk, bool guarantee)
> > +{
> > +	struct css_set *oldcg;
> > +	struct css_set *newcg;
> > +
> > +	/*
> > +	 * get old css_set. we need to take task_lock and refcount it, because
> > +	 * an exiting task can change its css_set to init_css_set and drop its
> > +	 * old one without taking cgroup_mutex.
> > +	 */
> > +	task_lock(tsk);
> > +	oldcg = tsk->cgroups;
> > +	get_css_set(oldcg);
> > +	task_unlock(tsk);
> > +
> > +	/* locate or allocate a new css_set for this task. */
> > +	if (guarantee) {
> > +		/* we know the css_set we want already exists. */
> > +		struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
> > +		read_lock(&css_set_lock);
> > +		newcg = find_existing_css_set(oldcg, cgrp, template);
> > +		BUG_ON(!newcg);
> > +		get_css_set(newcg);
> > +		read_unlock(&css_set_lock);
> > +	} else {
> > +		might_sleep();
> > +		/* find_css_set will give us newcg already referenced. */
> > +		newcg = find_css_set(oldcg, cgrp);
> > +		if (!newcg) {
> > +			put_css_set(oldcg);
> > +			return -ENOMEM;
> > +		}
> > +	}
> > +	put_css_set(oldcg);
> > +
> > +	/* if PF_EXITING is set, the tsk->cgroups pointer is no longer safe. */
> > +	task_lock(tsk);
> > +	if (tsk->flags & PF_EXITING) {
> > +		task_unlock(tsk);
> > +		put_css_set(newcg);
> > +		return -ESRCH;
> > +	}
> > +	rcu_assign_pointer(tsk->cgroups, newcg);
> > +	task_unlock(tsk);
> > +
> > +	/* Update the css_set linked lists if we're using them */
> > +	write_lock(&css_set_lock);
> > +	if (!list_empty(&tsk->cg_list))
> > +		list_move(&tsk->cg_list, &newcg->tasks);
> > +	write_unlock(&css_set_lock);
> > +
> > +	/*
> > +	 * We just gained a reference on oldcg by taking it from the task. As
> > +	 * trading it for newcg is protected by cgroup_mutex, we're safe to drop
> > +	 * it here; it will be freed under RCU.
> > +	 */
> > +	put_css_set(oldcg);
> > +
> > +	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
> > +	return 0;
> > +}
> > +
> >  /**
> >   * cgroup_attach_task - attach task 'tsk' to cgroup 'cgrp'
> >   * @cgrp: the cgroup the task is attaching to
> > @@ -1698,11 +1768,9 @@ EXPORT_SYMBOL_GPL(cgroup_path);
> >   */
> >  int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
> >  {
> > -	int retval = 0;
> > +	int retval;
> >  	struct cgroup_subsys *ss, *failed_ss = NULL;
> >  	struct cgroup *oldcgrp;
> > -	struct css_set *cg;
> > -	struct css_set *newcg;
> >  	struct cgroupfs_root *root = cgrp->root;
> >  
> >  	/* Nothing to do if the task is already in that cgroup */
> > @@ -1726,46 +1794,16 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
> >  		}
> >  	}
> >  
> > -	task_lock(tsk);
> > -	cg = tsk->cgroups;
> > -	get_css_set(cg);
> > -	task_unlock(tsk);
> > -	/*
> > -	 * Locate or allocate a new css_set for this task,
> > -	 * based on its final set of cgroups
> > -	 */
> > -	newcg = find_css_set(cg, cgrp);
> > -	put_css_set(cg);
> > -	if (!newcg) {
> > -		retval = -ENOMEM;
> > -		goto out;
> > -	}
> > -
> > -	task_lock(tsk);
> > -	if (tsk->flags & PF_EXITING) {
> > -		task_unlock(tsk);
> > -		put_css_set(newcg);
> > -		retval = -ESRCH;
> > +	retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, false);
> > +	if (retval)
> >  		goto out;
> > -	}
> > -	rcu_assign_pointer(tsk->cgroups, newcg);
> > -	task_unlock(tsk);
> > -
> > -	/* Update the css_set linked lists if we're using them */
> > -	write_lock(&css_set_lock);
> > -	if (!list_empty(&tsk->cg_list)) {
> > -		list_del(&tsk->cg_list);
> > -		list_add(&tsk->cg_list, &newcg->tasks);
> > -	}
> > -	write_unlock(&css_set_lock);
> >  
> >  	for_each_subsys(root, ss) {
> >  		if (ss->attach)
> >  			ss->attach(ss, cgrp, oldcgrp, tsk, false);
> >  	}
> > -	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
> > +
> 
> Hmm. By this, we call ss->attach(ss, cgrp, oldcgrp, tsk, false) after
> makring CGRP_RELEASABLE+synchronize_rcu() to oldcgroup...is it safe ?

I honestly don't remember (that logic was written like a year ago), but
I remember Paul confirming that it was ok. But things may have changed
around - I don't recall any "cgroup_release_and_wakeup_rmdir" semantics.

> And why move it before attach() ?

Makes it easier when there are arbitrarily many "oldcgrp"s - once you
migrate each task, you won't have its old cgroup to set the bit on by
the time you call attach().

> >  	synchronize_rcu();
> > -	put_css_set(cg);
> >  
> >  	/*
> >  	 * wake up rmdir() waiter. the rmdir should fail since the cgroup
> > @@ -1791,49 +1829,341 @@ out:
> >  }
> >  
> >  /*
> > - * Attach task with pid 'pid' to cgroup 'cgrp'. Call with cgroup_mutex
> > - * held. May take task_lock of task
> > + * cgroup_attach_proc works in two stages, the first of which prefetches all
> > + * new css_sets needed (to make sure we have enough memory before committing
> > + * to the move) and stores them in a list of entries of the following type.
> > + * TODO: possible optimization: use css_set->rcu_head for chaining instead
> > + */
> > +struct cg_list_entry {
> > +	struct css_set *cg;
> > +	struct list_head links;
> > +};
> > +
> > +static bool css_set_check_fetched(struct cgroup *cgrp,
> > +				  struct task_struct *tsk, struct css_set *cg,
> > +				  struct list_head *newcg_list)
> > +{
> > +	struct css_set *newcg;
> > +	struct cg_list_entry *cg_entry;
> > +	struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
> > +
> > +	read_lock(&css_set_lock);
> > +	newcg = find_existing_css_set(cg, cgrp, template);
> > +	if (newcg)
> > +		get_css_set(newcg);
> > +	read_unlock(&css_set_lock);
> > +
> > +	/* doesn't exist at all? */
> > +	if (!newcg)
> > +		return false;
> > +	/* see if it's already in the list */
> > +	list_for_each_entry(cg_entry, newcg_list, links) {
> > +		if (cg_entry->cg == newcg) {
> > +			put_css_set(newcg);
> > +			return true;
> > +		}
> > +	}
> > +
> > +	/* not found */
> > +	put_css_set(newcg);
> > +	return false;
> > +}
> > +
> > +/*
> > + * Find the new css_set and store it in the list in preparation for moving the
> > + * given task to the given cgroup. Returns 0 or -ENOMEM.
> >   */
> > -static int attach_task_by_pid(struct cgroup *cgrp, u64 pid)
> > +static int css_set_prefetch(struct cgroup *cgrp, struct css_set *cg,
> > +			    struct list_head *newcg_list)
> > +{
> > +	struct css_set *newcg;
> > +	struct cg_list_entry *cg_entry;
> > +
> > +	/* ensure a new css_set will exist for this thread */
> > +	newcg = find_css_set(cg, cgrp);
> > +	if (!newcg)
> > +		return -ENOMEM;
> > +	/* add it to the list */
> > +	cg_entry = kmalloc(sizeof(struct cg_list_entry), GFP_KERNEL);
> > +	if (!cg_entry) {
> > +		put_css_set(newcg);
> > +		return -ENOMEM;
> > +	}
> > +	cg_entry->cg = newcg;
> > +	list_add(&cg_entry->links, newcg_list);
> > +	return 0;
> > +}
> > +
> > +/**
> > + * cgroup_attach_proc - attach all threads in a threadgroup to a cgroup
> > + * @cgrp: the cgroup to attach to
> > + * @leader: the threadgroup leader task_struct of the group to be attached
> > + *
> > + * Call holding cgroup_mutex. Will take task_lock of each thread in leader's
> > + * threadgroup individually in turn.
> > + */
> > +int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
> > +{
> > +	int retval;
> > +	struct cgroup_subsys *ss, *failed_ss = NULL;
> > +	struct cgroup *oldcgrp;
> > +	struct css_set *oldcg;
> > +	struct cgroupfs_root *root = cgrp->root;
> > +	/* threadgroup list cursor */
> > +	struct task_struct *tsk;
> > +	/*
> > +	 * we need to make sure we have css_sets for all the tasks we're
> > +	 * going to move -before- we actually start moving them, so that in
> > +	 * case we get an ENOMEM we can bail out before making any changes.
> > +	 */
> > +	struct list_head newcg_list;
> > +	struct cg_list_entry *cg_entry, *temp_nobe;
> > +
> > +	/* check that we can legitimately attach to the cgroup. */
> > +	for_each_subsys(root, ss) {
> > +		if (ss->can_attach) {
> > +			retval = ss->can_attach(ss, cgrp, leader, true);
> > +			if (retval) {
> > +				failed_ss = ss;
> > +				goto out;
> > +			}
> > +		}
> > +	}
> 
> Then, we cannot do attach limitaion control per thread ? (This just check leader.)
> Is it ok for all subsys ?

I believe it should be. At least for memory, there's no point to check
multiple threads that all share the same VM. :)

> > +
> > +	/*
> > +	 * step 1: make sure css_sets exist for all threads to be migrated.
> > +	 * we use find_css_set, which allocates a new one if necessary.
> > +	 */
> > +	INIT_LIST_HEAD(&newcg_list);
> > +	oldcgrp = task_cgroup_from_root(leader, root);
> > +	if (cgrp != oldcgrp) {
> > +		/* get old css_set */
> > +		task_lock(leader);
> > +		if (leader->flags & PF_EXITING) {
> > +			task_unlock(leader);
> > +			goto prefetch_loop;
> > +		}
> Why do we continue here ? not -ESRCH ?

The leader can exit and still have other threads going in its
threadgroup; in this case, we still want to move the rest of the
threads.

> > +		oldcg = leader->cgroups;
> > +		get_css_set(oldcg);
> > +		task_unlock(leader);
> > +		/* acquire new one */
> > +		retval = css_set_prefetch(cgrp, oldcg, &newcg_list);
> > +		put_css_set(oldcg);
> > +		if (retval)
> > +			goto list_teardown;
> > +	}
> > +prefetch_loop:
> > +	rcu_read_lock();
> > +	/* sanity check - if we raced with de_thread, we must abort */
> > +	if (!thread_group_leader(leader)) {
> > +		retval = -EAGAIN;
> > +		goto list_teardown;
> > +	}
> 
> EAGAIN ? ESRCH ? or EBUSY ?

This happens in the following case: we have a pointer to A the leader, A
forks B, B exec, B becomes new leader. (It's dangerous if, after that,
this happens: B forks C, B exits. Now A->leader and A->thread_group.next
both point to nowhere. Thanks Oleg :) )

EBUSY might also be ok; I picked EAGAIN because it fits meaning-wise -
it's handled higher-up in the VFS write handler, so userspace doesn't
see it.

> > +	/*
> > +	 * if we need to fetch a new css_set for this task, we must exit the
> > +	 * rcu_read section because allocating it can sleep. afterwards, we'll
> > +	 * need to restart iteration on the threadgroup list - the whole thing
> > +	 * will be O(nm) in the number of threads and css_sets; as the typical
> > +	 * case has only one css_set for all of them, usually O(n). which ones
> > +	 * we need allocated won't change as long as we hold cgroup_mutex.
> > +	 */
> > +	list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
> > +		/* nothing to do if this task is already in the cgroup */
> > +		oldcgrp = task_cgroup_from_root(tsk, root);
> > +		if (cgrp == oldcgrp)
> > +			continue;
> > +		/* get old css_set pointer */
> > +		task_lock(tsk);
> > +		if (tsk->flags & PF_EXITING) {
> > +			/* ignore this task if it's going away */
> > +			task_unlock(tsk);
> 
> It's going away but seems to exist for a while....then, "continue" is safe
> for keeping consistency ?

Yes, it's going away but hasn't been unhashed yet. Since it's on the
thread_group list (and we have rcu_read), of course its next pointer is
sane.

> > +			continue;
> > +		}
> > +		oldcg = tsk->cgroups;
> > +		get_css_set(oldcg);
> > +		task_unlock(tsk);
> > +		/* see if the new one for us is already in the list? */
> > +		if (css_set_check_fetched(cgrp, tsk, oldcg, &newcg_list)) {
> > +			/* was already there, nothing to do. */
> > +			put_css_set(oldcg);
> > +		} else {
> > +			/* we don't already have it. get new one. */
> > +			rcu_read_unlock();
> > +			retval = css_set_prefetch(cgrp, oldcg, &newcg_list);
> > +			put_css_set(oldcg);
> > +			if (retval)
> > +				goto list_teardown;
> > +			/* begin iteration again. */
> > +			goto prefetch_loop;
> 
> Hmm ? Why do we need to restart from the 1st entry ?
> (maybe because of rcu_read_unlock() ?)

Need to allocate (prefetch), can't do it while rcu_read is held.

> Does this function work well if the process has 10000+ threads ?

Depends on the css_sets - it is pretty unlikely that the threads will be
diversified enough that runtime will actually approach quadratic, and in
that case I'd rather have bad runtime (this is already an expensive
operation) than more complicated logic (which you proposed). Lord knows
it's already complicated enough :X

> 
> How about this logic ?
> ==
> 
> 	/* At first, find out necessary things */
> 	rcu_read_lock();
> 	list_for_each_entry_rcu() {
> 		oldcgrp = task_cgroup_from_root(tsk, root);
> 		if (oldcgrp == cgrp)
> 			continue;
> 		task_lock(task);
> 		if (task->flags & PF_EXITING) {
> 			task_unlock(task);
> 			continue;
> 		}
> 		oldcg = tsk->cgroups;
> 		get_css_set(oldcg);
> 		task_unlock(task);
> 		read_lock(&css_set_lock);
> 		newcg = find_existing_css_set(oldcgrp cgrp, template);
> 		if (newcg)
> 			remember_this_newcg(newcg, &found_cg_array); {
> 			put_css_set(oldcg);
> 		} else
> 			remember_need_to_allocate(oldcg, &need_to_allocate_array);
> 	}
> 	rcu_read_unlock();
> 	/* Sort all cg_list found and drop doubly counted ones, drop refcnt if necessary */
> 	sort_and_unique(found_cg_array);
> 	/* Sort all cg_list not found and drop doubly counted ones, drop refcnt if necessary */
> 	sort_and_unique(need_to_allocate_array);
> 	/* Allocate new ones */
> 	newly_allocated_array = allocate_new_cg_lists(need_to_allocate_array);
> 	drop_refcnt_of_old_cgs(need_to_allocate_array);
> 
> 	/* Now we have all necessary cg_list */
> 
> > +		}
> > +	}
> > +	rcu_read_unlock();
> > +
> > +	/*
> > +	 * step 2: now that we're guaranteed success wrt the css_sets, proceed
> > +	 * to move all tasks to the new cgroup. we need to lock against possible
> > +	 * races with fork(). note: we can safely access leader->signal because
> > +	 * attach_task_by_pid takes a reference on leader, which guarantees that
> > +	 * the signal_struct will stick around. threadgroup_fork_lock must be
> > +	 * taken outside of tasklist_lock to match the order in the fork path.
> > +	 */
> > +	BUG_ON(!leader->signal);
> > +	down_write(&leader->signal->threadgroup_fork_lock);
> > +	read_lock(&tasklist_lock);
> > +	/* sanity check - if we raced with de_thread, we must abort */
> > +	if (!thread_group_leader(leader)) {
> > +		retval = -EAGAIN;
> > +		read_unlock(&tasklist_lock);
> > +		up_write(&leader->signal->threadgroup_fork_lock);
> > +		goto list_teardown;
> > +	}
> > +	/*
> > +	 * No failure cases left, so this is the commit point.
> > +	 *
> > +	 * If the leader is already there, skip moving him. Note: even if the
> > +	 * leader is PF_EXITING, we still move all other threads; if everybody
> > +	 * is PF_EXITING, we end up doing nothing, which is ok.
> > +	 */
> > +	oldcgrp = task_cgroup_from_root(leader, root);
> > +	if (cgrp != oldcgrp) {
> > +		retval = cgroup_task_migrate(cgrp, oldcgrp, leader, true);
> > +		BUG_ON(retval != 0 && retval != -ESRCH);
> > +	}
> > +	/* Now iterate over each thread in the group. */
> > +	list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
> > +		BUG_ON(tsk->signal != leader->signal);
> > +		/* leave current thread as it is if it's already there */
> > +		oldcgrp = task_cgroup_from_root(tsk, root);
> > +		if (cgrp == oldcgrp)
> > +			continue;
> > +		/* we don't care whether these threads are exiting */
> > +		retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, true);
> > +		BUG_ON(retval != 0 && retval != -ESRCH);
> > +	}
> > +
> > +	/*
> > +	 * step 3: attach whole threadgroup to each subsystem
> > +	 * TODO: if ever a subsystem needs to know the oldcgrp for each task
> > +	 * being moved, this call will need to be reworked to communicate that.
> > +	 */
> > +	for_each_subsys(root, ss) {
> > +		if (ss->attach)
> > +			ss->attach(ss, cgrp, oldcgrp, leader, true);
> > +	}
> > +	/* holding these until here keeps us safe from exec() and fork(). */
> > +	read_unlock(&tasklist_lock);
> > +	up_write(&leader->signal->threadgroup_fork_lock);
> > +
> > +	/*
> > +	 * step 4: success! and cleanup
> > +	 */
> > +	synchronize_rcu();
> > +	cgroup_wakeup_rmdir_waiter(cgrp);
> > +	retval = 0;
> > +list_teardown:
> > +	/* clean up the list of prefetched css_sets. */
> > +	list_for_each_entry_safe(cg_entry, temp_nobe, &newcg_list, links) {
> > +		list_del(&cg_entry->links);
> > +		put_css_set(cg_entry->cg);
> > +		kfree(cg_entry);
> > +	}
> > +out:
> > +	if (retval) {
> > +		/* same deal as in cgroup_attach_task, with threadgroup=true */
> > +		for_each_subsys(root, ss) {
> > +			if (ss == failed_ss)
> > +				break;
> > +			if (ss->cancel_attach)
> > +				ss->cancel_attach(ss, cgrp, leader, true);
> > +		}
> > +	}
> > +	return retval;
> > +}
> > +
> > +/*
> > + * Find the task_struct of the task to attach by vpid and pass it along to the
> > + * function to attach either it or all tasks in its threadgroup. Will take
> > + * cgroup_mutex; may take task_lock of task.
> > + */
> > +static int attach_task_by_pid(struct cgroup *cgrp, u64 pid, bool threadgroup)
> >  {
> >  	struct task_struct *tsk;
> >  	const struct cred *cred = current_cred(), *tcred;
> >  	int ret;
> >  
> > +	if (!cgroup_lock_live_group(cgrp))
> > +		return -ENODEV;
> > +
> >  	if (pid) {
> >  		rcu_read_lock();
> >  		tsk = find_task_by_vpid(pid);
> > -		if (!tsk || tsk->flags & PF_EXITING) {
> > +		if (!tsk) {
> > +			rcu_read_unlock();
> > +			cgroup_unlock();
> > +			return -ESRCH;
> > +		}
> > +		if (threadgroup) {
> > +			/*
> > +			 * it is safe to find group_leader because tsk was found
> > +			 * in the tid map, meaning it can't have been unhashed
> > +			 * by someone in de_thread changing the leadership.
> > +			 */
> > +			tsk = tsk->group_leader;
> > +			BUG_ON(!thread_group_leader(tsk));
> > +		} else if (tsk->flags & PF_EXITING) {
> > +			/* optimization for the single-task-only case */
> >  			rcu_read_unlock();
> > +			cgroup_unlock();
> >  			return -ESRCH;
> >  		}
> >  
> > +		/*
> > +		 * even if we're attaching all tasks in the thread group, we
> > +		 * only need to check permissions on one of them.
> > +		 */
> >  		tcred = __task_cred(tsk);
> >  		if (cred->euid &&
> >  		    cred->euid != tcred->uid &&
> >  		    cred->euid != tcred->suid) {
> >  			rcu_read_unlock();
> > +			cgroup_unlock();
> >  			return -EACCES;
> >  		}
> >  		get_task_struct(tsk);
> >  		rcu_read_unlock();
> >  	} else {
> > -		tsk = current;
> > +		if (threadgroup)
> > +			tsk = current->group_leader;
> > +		else
> 
> I'm not sure but "group_leader" is safe to access here ?

current->group_leader is always safe, since current should have a
refcount on its leader.

> 
> > +			tsk = current;
> >  		get_task_struct(tsk);
> >  	}
> >  
> > -	ret = cgroup_attach_task(cgrp, tsk);
> > +	if (threadgroup)
> > +		ret = cgroup_attach_proc(cgrp, tsk);
> > +	else
> > +		ret = cgroup_attach_task(cgrp, tsk);
> >  	put_task_struct(tsk);
> > +	cgroup_unlock();
> >  	return ret;
> >  }
> >  
> >  static int cgroup_tasks_write(struct cgroup *cgrp, struct cftype *cft, u64 pid)
> >  {
> > +	return attach_task_by_pid(cgrp, pid, false);
> > +}
> > +
> > +static int cgroup_procs_write(struct cgroup *cgrp, struct cftype *cft, u64 tgid)
> > +{
> >  	int ret;
> > -	if (!cgroup_lock_live_group(cgrp))
> > -		return -ENODEV;
> > -	ret = attach_task_by_pid(cgrp, pid);
> > -	cgroup_unlock();
> > +	do {
> > +		/*
> > +		 * attach_proc fails with -EAGAIN if threadgroup leadership
> > +		 * changes in the middle of the operation, in which case we need
> > +		 * to find the task_struct for the new leader and start over.
> > +		 */
> > +		ret = attach_task_by_pid(cgrp, tgid, true);
> > +	} while (ret == -EAGAIN);
> >  	return ret;
> >  }
> >  
> > @@ -3168,9 +3498,9 @@ static struct cftype files[] = {
> >  	{
> >  		.name = CGROUP_FILE_GENERIC_PREFIX "procs",
> >  		.open = cgroup_procs_open,
> > -		/* .write_u64 = cgroup_procs_write, TODO */
> > +		.write_u64 = cgroup_procs_write,
> >  		.release = cgroup_pidlist_release,
> > -		.mode = S_IRUGO,
> > +		.mode = S_IRUGO | S_IWUSR,
> >  	},
> >  	{
> >  		.name = "notify_on_release",
> > diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
> > index ce71ed5..daf0249 100644
> > --- a/kernel/cgroup_freezer.c
> > +++ b/kernel/cgroup_freezer.c
> > @@ -190,6 +190,10 @@ static int freezer_can_attach(struct cgroup_subsys *ss,
> >  		struct task_struct *c;
> >  
> >  		rcu_read_lock();
> > +		if (!thread_group_leader(task)) {
> > +			rcu_read_unlock();
> > +			return -EAGAIN;
> > +		}
> >  		list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
> >  			if (is_task_frozen_enough(c)) {
> >  				rcu_read_unlock();
> > diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> > index b23c097..3d7c978 100644
> > --- a/kernel/cpuset.c
> > +++ b/kernel/cpuset.c
> > @@ -1404,6 +1404,10 @@ static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
> >  		struct task_struct *c;
> >  
> >  		rcu_read_lock();
> > +		if (!thread_group_leader(tsk)) {
> > +			rcu_read_unlock();
> > +			return -EAGAIN;
> > +		}
> >  		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
> >  			ret = security_task_setscheduler(c, 0, NULL);
> >  			if (ret) {
> > diff --git a/kernel/ns_cgroup.c b/kernel/ns_cgroup.c
> > index 2a5dfec..ecd15d2 100644
> > --- a/kernel/ns_cgroup.c
> > +++ b/kernel/ns_cgroup.c
> > @@ -59,6 +59,10 @@ static int ns_can_attach(struct cgroup_subsys *ss, struct cgroup *new_cgroup,
> >  	if (threadgroup) {
> >  		struct task_struct *c;
> >  		rcu_read_lock();
> > +		if (!thread_group_leader(task)) {
> > +			rcu_read_unlock();
> > +			return -EAGAIN;
> > +		}
> >  		list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
> >  			if (!cgroup_is_descendant(new_cgroup, c)) {
> >  				rcu_read_unlock();
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index 70fa78d..df53f53 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -8721,6 +8721,10 @@ cpu_cgroup_can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
> >  	if (threadgroup) {
> >  		struct task_struct *c;
> >  		rcu_read_lock();
> > +		if (!thread_group_leader(tsk)) {
> > +			rcu_read_unlock();
> > +			return -EAGAIN;
> > +		}
> >  		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
> >  			retval = cpu_cgroup_can_attach_task(cgrp, c);
> >  			if (retval) {
> 
> 
> Thanks,
> -Kame
> 
> 

Thanks for having a full look over!

-- Ben

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v4 2/2] cgroups: make procs file writable
       [not found]     ` <20100804100811.199d73ba.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
  2010-08-04  4:28       ` Ben Blum
@ 2010-08-04  4:30       ` Paul Menage
  1 sibling, 0 replies; 185+ messages in thread
From: Paul Menage @ 2010-08-04  4:30 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Hi Ben, Kame,

Sorry for the delay in getting to look at this,

On Tue, Aug 3, 2010 at 6:08 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> wrote:
>>
>>       for_each_subsys(root, ss) {
>>               if (ss->attach)
>>                       ss->attach(ss, cgrp, oldcgrp, tsk, false);
>>       }
>> -     set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
>> +
>
> Hmm. By this, we call ss->attach(ss, cgrp, oldcgrp, tsk, false) after
> makring CGRP_RELEASABLE+synchronize_rcu() to oldcgroup...is it safe ?
> And why move it before attach() ?
>

Marking as releasable should be fine - the only time this is cleared
is when you write to notify_on_release.

I think that the put_css_set(oldcg) and synchronize_rcu() is safe, by
the following logic:

- we took cgroup_lock in cgroup_tasks_write()

- after this point, oldcgrp was initialized from the task's cgroup;
therefore oldcgrp still existed at this point

- even if all the threads (including the one being operated on) exit
(and hence leave oldcgrp) while we're doing the attach, we're holding
cgroup_lock so no-one else can delete oldcgrp

So it's certainly possible that the task in question has exited by the
time we call the subsys attach methods, but oldcgrp should still be
alive.

Whether we need an additional synchronize_rcu() after the attach()
calls is harder to determine - I guess it's better to be safe than
sorry, unless people are seeing specific performance issues with this.

I think the css_set_check_fetched() function needs more comments
explaining its behaviour and what its return value indicates.
>> +     /*
>> +      * we need to make sure we have css_sets for all the tasks we're
>> +      * going to move -before- we actually start moving them, so that in
>> +      * case we get an ENOMEM we can bail out before making any changes.

More than that - even if we don't get an ENOMEM, we can't safely sleep
in the RCU section, so we'd either have to do all memory allocations
atomically (which would be bad and unreliable) or else we avoid the
need to allocate in the RCU section (which is the choice taken here).

>> +      */
>> +     struct list_head newcg_list;
>> +     struct cg_list_entry *cg_entry, *temp_nobe;
>> +
>> +     /* check that we can legitimately attach to the cgroup. */
>> +     for_each_subsys(root, ss) {
>> +             if (ss->can_attach) {
>> +                     retval = ss->can_attach(ss, cgrp, leader, true);
>> +                     if (retval) {
>> +                             failed_ss = ss;
>> +                             goto out;
>> +                     }
>> +             }
>> +     }
>
> Then, we cannot do attach limitaion control per thread ? (This just check leader.)
> Is it ok for all subsys ?

By passing "true" as the "threadgroup" parameter to can_attach(),
we're letting the subsystem decide if it needs to do per-thread
checks. For most subsystems, calling them once for each thread would
be unnecessary.

>
>
>> +
>> +     /*
>> +      * step 1: make sure css_sets exist for all threads to be migrated.
>> +      * we use find_css_set, which allocates a new one if necessary.
>> +      */
>> +     INIT_LIST_HEAD(&newcg_list);
>> +     oldcgrp = task_cgroup_from_root(leader, root);
>> +     if (cgrp != oldcgrp) {
>> +             /* get old css_set */
>> +             task_lock(leader);
>> +             if (leader->flags & PF_EXITING) {
>> +                     task_unlock(leader);
>> +                     goto prefetch_loop;
>> +             }
> Why do we continue here ? not -ESRCH ?
>

It's possible that some threads from the process are exiting while
we're trying to move the entire process. As long as we move at least
one thread, we shouldn't care if some of its threads are exiting.

Which means that after we've done the prefetch loop, we should
probably check that the newcg_list isn't empty, and return -ESRCH in
that case.

>
>> +             oldcg = leader->cgroups;
>> +             get_css_set(oldcg);
>> +             task_unlock(leader);
>> +             /* acquire new one */

/* acquire a new css_set for the leader */

>> +      * if we need to fetch a new css_set for this task, we must exit the
>> +      * rcu_read section because allocating it can sleep. afterwards, we'll
>> +      * need to restart iteration on the threadgroup list - the whole thing
>> +      * will be O(nm) in the number of threads and css_sets; as the typical
>> +      * case has only one css_set for all of them, usually O(n). which ones

Maybe better to say "in the worst case this is O(n^2) on the number of
threads; however, in the vast majority of cases all the threads will
be in the same cgroups as the leader and we'll make a just single pass
through the list with no additional allocations needed".

>
> It's going away but seems to exist for a while....then, "continue" is safe
> for keeping consistency ?

Yes, because we don't sleep so the RCU list is still valid.

>> +             /* see if the new one for us is already in the list? */

/* See if we already have an appropriate css_set for this thread */

>> +                     /* begin iteration again. */

/* Since we may have slept in css_set_prefetch(), the RCU list is no
longer valid, so we must begin the iteration again; Any threads that
we've previously processed will pass the css_set_check_fetched() test
on subsequent iterations since we hold cgroup_lock, so we're
guaranteed to make progress. */

> Does this function work well if the process has 10000+ threads ?

In general there'll only be one cgroup so it'll be a single pass
through the list.

>
> How about this logic ?
> ==
>
>        /* At first, find out necessary things */
>        rcu_read_lock();
>        list_for_each_entry_rcu() {
>                oldcgrp = task_cgroup_from_root(tsk, root);
>                if (oldcgrp == cgrp)
>                        continue;
>                task_lock(task);
>                if (task->flags & PF_EXITING) {
>                        task_unlock(task);
>                        continue;
>                }
>                oldcg = tsk->cgroups;
>                get_css_set(oldcg);
>                task_unlock(task);
>                read_lock(&css_set_lock);
>                newcg = find_existing_css_set(oldcgrp cgrp, template);
>                if (newcg)
>                        remember_this_newcg(newcg, &found_cg_array); {
>                        put_css_set(oldcg);
>                } else
>                        remember_need_to_allocate(oldcg, &need_to_allocate_array);

The problem with this is that remember_need_to_allocate() will itself
need to allocate memory in order to allow need_to_allocate_array to
expand arbitrarily. Which can't be done without GFP_ATOMIC or else
sleeping in the RCU section, neither of which are good.
>> +list_teardown:
>> +     /* clean up the list of prefetched css_sets. */
>> +     list_for_each_entry_safe(cg_entry, temp_nobe, &newcg_list, links) {
>> +             list_del(&cg_entry->links);
>> +             put_css_set(cg_entry->cg);
>> +             kfree(cg_entry);
>> +     }

I wonder if we might need a synchronize_rcu() here?

>> --- a/kernel/cpuset.c
>> +++ b/kernel/cpuset.c
>> @@ -1404,6 +1404,10 @@ static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
>>               struct task_struct *c;
>>
>>               rcu_read_lock();
>> +             if (!thread_group_leader(tsk)) {
>> +                     rcu_read_unlock();
>> +                     return -EAGAIN;
>> +             }

Why are you adding this requirement, here and in sched.c? (ns_cgroup.c
doesn't matter since it's being deleted).

Paul

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v4 2/2] cgroups: make procs file writable
  2010-08-04  1:08   ` KAMEZAWA Hiroyuki
  2010-08-04  4:28     ` Ben Blum
       [not found]     ` <20100804100811.199d73ba.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
@ 2010-08-04  4:30     ` Paul Menage
       [not found]       ` <AANLkTikMofFGHSwF2QrdcAsit+hU6ihndhK5cod8duwS-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2 siblings, 1 reply; 185+ messages in thread
From: Paul Menage @ 2010-08-04  4:30 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Ben Blum, linux-kernel, containers, akpm, ebiederm, lizf, matthltc, oleg

Hi Ben, Kame,

Sorry for the delay in getting to look at this,

On Tue, Aug 3, 2010 at 6:08 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
>>
>>       for_each_subsys(root, ss) {
>>               if (ss->attach)
>>                       ss->attach(ss, cgrp, oldcgrp, tsk, false);
>>       }
>> -     set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
>> +
>
> Hmm. By this, we call ss->attach(ss, cgrp, oldcgrp, tsk, false) after
> makring CGRP_RELEASABLE+synchronize_rcu() to oldcgroup...is it safe ?
> And why move it before attach() ?
>

Marking as releasable should be fine - the only time this is cleared
is when you write to notify_on_release.

I think that the put_css_set(oldcg) and synchronize_rcu() is safe, by
the following logic:

- we took cgroup_lock in cgroup_tasks_write()

- after this point, oldcgrp was initialized from the task's cgroup;
therefore oldcgrp still existed at this point

- even if all the threads (including the one being operated on) exit
(and hence leave oldcgrp) while we're doing the attach, we're holding
cgroup_lock so no-one else can delete oldcgrp

So it's certainly possible that the task in question has exited by the
time we call the subsys attach methods, but oldcgrp should still be
alive.

Whether we need an additional synchronize_rcu() after the attach()
calls is harder to determine - I guess it's better to be safe than
sorry, unless people are seeing specific performance issues with this.

I think the css_set_check_fetched() function needs more comments
explaining its behaviour and what its return value indicates.
>> +     /*
>> +      * we need to make sure we have css_sets for all the tasks we're
>> +      * going to move -before- we actually start moving them, so that in
>> +      * case we get an ENOMEM we can bail out before making any changes.

More than that - even if we don't get an ENOMEM, we can't safely sleep
in the RCU section, so we'd either have to do all memory allocations
atomically (which would be bad and unreliable) or else we avoid the
need to allocate in the RCU section (which is the choice taken here).

>> +      */
>> +     struct list_head newcg_list;
>> +     struct cg_list_entry *cg_entry, *temp_nobe;
>> +
>> +     /* check that we can legitimately attach to the cgroup. */
>> +     for_each_subsys(root, ss) {
>> +             if (ss->can_attach) {
>> +                     retval = ss->can_attach(ss, cgrp, leader, true);
>> +                     if (retval) {
>> +                             failed_ss = ss;
>> +                             goto out;
>> +                     }
>> +             }
>> +     }
>
> Then, we cannot do attach limitaion control per thread ? (This just check leader.)
> Is it ok for all subsys ?

By passing "true" as the "threadgroup" parameter to can_attach(),
we're letting the subsystem decide if it needs to do per-thread
checks. For most subsystems, calling them once for each thread would
be unnecessary.

>
>
>> +
>> +     /*
>> +      * step 1: make sure css_sets exist for all threads to be migrated.
>> +      * we use find_css_set, which allocates a new one if necessary.
>> +      */
>> +     INIT_LIST_HEAD(&newcg_list);
>> +     oldcgrp = task_cgroup_from_root(leader, root);
>> +     if (cgrp != oldcgrp) {
>> +             /* get old css_set */
>> +             task_lock(leader);
>> +             if (leader->flags & PF_EXITING) {
>> +                     task_unlock(leader);
>> +                     goto prefetch_loop;
>> +             }
> Why do we continue here ? not -ESRCH ?
>

It's possible that some threads from the process are exiting while
we're trying to move the entire process. As long as we move at least
one thread, we shouldn't care if some of its threads are exiting.

Which means that after we've done the prefetch loop, we should
probably check that the newcg_list isn't empty, and return -ESRCH in
that case.

>
>> +             oldcg = leader->cgroups;
>> +             get_css_set(oldcg);
>> +             task_unlock(leader);
>> +             /* acquire new one */

/* acquire a new css_set for the leader */

>> +      * if we need to fetch a new css_set for this task, we must exit the
>> +      * rcu_read section because allocating it can sleep. afterwards, we'll
>> +      * need to restart iteration on the threadgroup list - the whole thing
>> +      * will be O(nm) in the number of threads and css_sets; as the typical
>> +      * case has only one css_set for all of them, usually O(n). which ones

Maybe better to say "in the worst case this is O(n^2) on the number of
threads; however, in the vast majority of cases all the threads will
be in the same cgroups as the leader and we'll make a just single pass
through the list with no additional allocations needed".

>
> It's going away but seems to exist for a while....then, "continue" is safe
> for keeping consistency ?

Yes, because we don't sleep so the RCU list is still valid.

>> +             /* see if the new one for us is already in the list? */

/* See if we already have an appropriate css_set for this thread */

>> +                     /* begin iteration again. */

/* Since we may have slept in css_set_prefetch(), the RCU list is no
longer valid, so we must begin the iteration again; Any threads that
we've previously processed will pass the css_set_check_fetched() test
on subsequent iterations since we hold cgroup_lock, so we're
guaranteed to make progress. */

> Does this function work well if the process has 10000+ threads ?

In general there'll only be one cgroup so it'll be a single pass
through the list.

>
> How about this logic ?
> ==
>
>        /* At first, find out necessary things */
>        rcu_read_lock();
>        list_for_each_entry_rcu() {
>                oldcgrp = task_cgroup_from_root(tsk, root);
>                if (oldcgrp == cgrp)
>                        continue;
>                task_lock(task);
>                if (task->flags & PF_EXITING) {
>                        task_unlock(task);
>                        continue;
>                }
>                oldcg = tsk->cgroups;
>                get_css_set(oldcg);
>                task_unlock(task);
>                read_lock(&css_set_lock);
>                newcg = find_existing_css_set(oldcgrp cgrp, template);
>                if (newcg)
>                        remember_this_newcg(newcg, &found_cg_array); {
>                        put_css_set(oldcg);
>                } else
>                        remember_need_to_allocate(oldcg, &need_to_allocate_array);

The problem with this is that remember_need_to_allocate() will itself
need to allocate memory in order to allow need_to_allocate_array to
expand arbitrarily. Which can't be done without GFP_ATOMIC or else
sleeping in the RCU section, neither of which are good.
>> +list_teardown:
>> +     /* clean up the list of prefetched css_sets. */
>> +     list_for_each_entry_safe(cg_entry, temp_nobe, &newcg_list, links) {
>> +             list_del(&cg_entry->links);
>> +             put_css_set(cg_entry->cg);
>> +             kfree(cg_entry);
>> +     }

I wonder if we might need a synchronize_rcu() here?

>> --- a/kernel/cpuset.c
>> +++ b/kernel/cpuset.c
>> @@ -1404,6 +1404,10 @@ static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
>>               struct task_struct *c;
>>
>>               rcu_read_lock();
>> +             if (!thread_group_leader(tsk)) {
>> +                     rcu_read_unlock();
>> +                     return -EAGAIN;
>> +             }

Why are you adding this requirement, here and in sched.c? (ns_cgroup.c
doesn't matter since it's being deleted).

Paul

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v4 1/2] cgroups: read-write lock CLONE_THREAD forking per threadgroup
  2010-08-04  3:44     ` Paul Menage
@ 2010-08-04  4:33           ` Ben Blum
  0 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2010-08-04  4:33 UTC (permalink / raw)
  To: Paul Menage
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, Andrew Morton

On Tue, Aug 03, 2010 at 08:44:01PM -0700, Paul Menage wrote:
>  On Fri, Jul 30, 2010 at 4:57 PM, Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:
> > +	 * The threadgroup_fork_lock prevents threads from forking with
> > +	 * CLONE_THREAD while held for writing. Use this for fork-sensitive
> > +	 * threadgroup-wide operations. It's taken for reading in fork.c in
> > +	 * copy_process().
> > +	 * Currently only needed write-side by cgroups.
> > +	 */
> > +	struct rw_semaphore threadgroup_fork_lock;
> > +#endif
> 
> I'm not sure how best to word this comment, but I'd prefer something like:
> 
> "The threadgroup_fork_lock is taken in read mode during a CLONE_THREAD
> fork operation; taking it in write mode prevents the owning
> threadgroup from adding any new threads and thus allows you to
> synchronize against the addition of unseen threads when performing
> threadgroup-wide operations. New-process forks (without CLONE_THREAD)
> are not affected."

That sounds good.

> As far as the #ifdef mess goes, it's true that some people don't have
> CONFIG_CGROUPS defined. I'd imagine that these are likely to be
> embedded systems with a fairly small number of processes and threads
> per process. Are there really any such platforms where the cost of a
> single extra rwsem per process is going to make a difference either in
> terms of memory or lock contention? I think you should consider making
> these additions unconditional.

That's certainly an option, but I think it would be clean enough to put
static inline functions just under the signal_struct definition.
Thoughts?

> 
> Paul
> 

-- Ben

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v4 1/2] cgroups: read-write lock CLONE_THREAD forking per threadgroup
@ 2010-08-04  4:33           ` Ben Blum
  0 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2010-08-04  4:33 UTC (permalink / raw)
  To: Paul Menage
  Cc: Ben Blum, Andrew Morton, linux-kernel, containers, ebiederm,
	lizf, matthltc, oleg

On Tue, Aug 03, 2010 at 08:44:01PM -0700, Paul Menage wrote:
>  On Fri, Jul 30, 2010 at 4:57 PM, Ben Blum <bblum@andrew.cmu.edu> wrote:
> > +	 * The threadgroup_fork_lock prevents threads from forking with
> > +	 * CLONE_THREAD while held for writing. Use this for fork-sensitive
> > +	 * threadgroup-wide operations. It's taken for reading in fork.c in
> > +	 * copy_process().
> > +	 * Currently only needed write-side by cgroups.
> > +	 */
> > +	struct rw_semaphore threadgroup_fork_lock;
> > +#endif
> 
> I'm not sure how best to word this comment, but I'd prefer something like:
> 
> "The threadgroup_fork_lock is taken in read mode during a CLONE_THREAD
> fork operation; taking it in write mode prevents the owning
> threadgroup from adding any new threads and thus allows you to
> synchronize against the addition of unseen threads when performing
> threadgroup-wide operations. New-process forks (without CLONE_THREAD)
> are not affected."

That sounds good.

> As far as the #ifdef mess goes, it's true that some people don't have
> CONFIG_CGROUPS defined. I'd imagine that these are likely to be
> embedded systems with a fairly small number of processes and threads
> per process. Are there really any such platforms where the cost of a
> single extra rwsem per process is going to make a difference either in
> terms of memory or lock contention? I think you should consider making
> these additions unconditional.

That's certainly an option, but I think it would be clean enough to put
static inline functions just under the signal_struct definition.
Thoughts?

> 
> Paul
> 

-- Ben

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v4 1/2] cgroups: read-write lock CLONE_THREAD forking per threadgroup
  2010-08-04  4:33           ` Ben Blum
@ 2010-08-04  4:34               ` Paul Menage
  -1 siblings, 0 replies; 185+ messages in thread
From: Paul Menage @ 2010-08-04  4:34 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	Andrew Morton, ebiederm-aS9lmoZGLiVWk0Htik3J/w

On Tue, Aug 3, 2010 at 9:33 PM, Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:
>> As far as the #ifdef mess goes, it's true that some people don't have
>> CONFIG_CGROUPS defined. I'd imagine that these are likely to be
>> embedded systems with a fairly small number of processes and threads
>> per process. Are there really any such platforms where the cost of a
>> single extra rwsem per process is going to make a difference either in
>> terms of memory or lock contention? I think you should consider making
>> these additions unconditional.
>
> That's certainly an option, but I think it would be clean enough to put
> static inline functions just under the signal_struct definition.

Either sounds fine to me. I suspect others have a stronger opinion.

Paul

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v4 1/2] cgroups: read-write lock CLONE_THREAD forking per  threadgroup
@ 2010-08-04  4:34               ` Paul Menage
  0 siblings, 0 replies; 185+ messages in thread
From: Paul Menage @ 2010-08-04  4:34 UTC (permalink / raw)
  To: Ben Blum
  Cc: Andrew Morton, linux-kernel, containers, ebiederm, lizf, matthltc, oleg

On Tue, Aug 3, 2010 at 9:33 PM, Ben Blum <bblum@andrew.cmu.edu> wrote:
>> As far as the #ifdef mess goes, it's true that some people don't have
>> CONFIG_CGROUPS defined. I'd imagine that these are likely to be
>> embedded systems with a fairly small number of processes and threads
>> per process. Are there really any such platforms where the cost of a
>> single extra rwsem per process is going to make a difference either in
>> terms of memory or lock contention? I think you should consider making
>> these additions unconditional.
>
> That's certainly an option, but I think it would be clean enough to put
> static inline functions just under the signal_struct definition.

Either sounds fine to me. I suspect others have a stronger opinion.

Paul

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v4 2/2] cgroups: make procs file writable
  2010-08-04  4:30     ` Paul Menage
@ 2010-08-04  4:38           ` Ben Blum
  0 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2010-08-04  4:38 UTC (permalink / raw)
  To: Paul Menage
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Tue, Aug 03, 2010 at 09:30:00PM -0700, Paul Menage wrote:
> >> --- a/kernel/cpuset.c
> >> +++ b/kernel/cpuset.c
> >> @@ -1404,6 +1404,10 @@ static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
> >> ? ? ? ? ? ? ? struct task_struct *c;
> >>
> >> ? ? ? ? ? ? ? rcu_read_lock();
> >> + ? ? ? ? ? ? if (!thread_group_leader(tsk)) {
> >> + ? ? ? ? ? ? ? ? ? ? rcu_read_unlock();
> >> + ? ? ? ? ? ? ? ? ? ? return -EAGAIN;
> >> + ? ? ? ? ? ? }
> 
> Why are you adding this requirement, here and in sched.c? (ns_cgroup.c
> doesn't matter since it's being deleted).
> 
> Paul

It was either this or:

rcu_read_lock();
for_each_subsys(...) {
	can_attach(...);
}
rcu_read_unlock();

Which forces all can_attaches to not sleep. So by dropping
rcu_read_lock(), we allow the possibility of the exec race I described
in my last email, and therefore we have to check each time we re-acquire
rcu_read to iterate thread_group.

Yeah, it is not pretty. I call it "double-double-toil-and-trouble-check
locking". But it is safe.

-- Ben

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v4 2/2] cgroups: make procs file writable
@ 2010-08-04  4:38           ` Ben Blum
  0 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2010-08-04  4:38 UTC (permalink / raw)
  To: Paul Menage
  Cc: KAMEZAWA Hiroyuki, Ben Blum, linux-kernel, containers, akpm,
	ebiederm, lizf, matthltc, oleg

On Tue, Aug 03, 2010 at 09:30:00PM -0700, Paul Menage wrote:
> >> --- a/kernel/cpuset.c
> >> +++ b/kernel/cpuset.c
> >> @@ -1404,6 +1404,10 @@ static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
> >> ? ? ? ? ? ? ? struct task_struct *c;
> >>
> >> ? ? ? ? ? ? ? rcu_read_lock();
> >> + ? ? ? ? ? ? if (!thread_group_leader(tsk)) {
> >> + ? ? ? ? ? ? ? ? ? ? rcu_read_unlock();
> >> + ? ? ? ? ? ? ? ? ? ? return -EAGAIN;
> >> + ? ? ? ? ? ? }
> 
> Why are you adding this requirement, here and in sched.c? (ns_cgroup.c
> doesn't matter since it's being deleted).
> 
> Paul

It was either this or:

rcu_read_lock();
for_each_subsys(...) {
	can_attach(...);
}
rcu_read_unlock();

Which forces all can_attaches to not sleep. So by dropping
rcu_read_lock(), we allow the possibility of the exec race I described
in my last email, and therefore we have to check each time we re-acquire
rcu_read to iterate thread_group.

Yeah, it is not pretty. I call it "double-double-toil-and-trouble-check
locking". But it is safe.

-- Ben

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v4 2/2] cgroups: make procs file writable
       [not found]           ` <20100804043849.GC11950-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
@ 2010-08-04  4:46             ` Paul Menage
  0 siblings, 0 replies; 185+ messages in thread
From: Paul Menage @ 2010-08-04  4:46 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Tue, Aug 3, 2010 at 9:38 PM, Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:
>
> rcu_read_lock();
> for_each_subsys(...) {
>        can_attach(...);
> }
> rcu_read_unlock();

Sorry, I was misreading this, and didn't notice that it was already
inside an "if (threadgroup) {}" test.

>
> Which forces all can_attaches to not sleep. So by dropping
> rcu_read_lock(), we allow the possibility of the exec race I described
> in my last email, and therefore we have to check each time we re-acquire
> rcu_read to iterate thread_group.

Agreed.

>
> Yeah, it is not pretty. I call it "double-double-toil-and-trouble-check
> locking". But it is safe.

As a cleanup, I'd be inclined to have a wrapper in cgroup.c, something like

cgroup_can_attach_threadgroup(struct cgroup_subsys *ss, struct cgroup
*cg, struct task_struct *leader, int (*cb)(struct task_struct *t,
struct cgroup *cg))

which handles the RCU section, checking threadgroup_leader(), and
looping through each thread. The the subsystem just has to define a
callback which will be called for each thread.

Paul

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v4 2/2] cgroups: make procs file writable
  2010-08-04  4:38           ` Ben Blum
  (?)
@ 2010-08-04  4:46           ` Paul Menage
  -1 siblings, 0 replies; 185+ messages in thread
From: Paul Menage @ 2010-08-04  4:46 UTC (permalink / raw)
  To: Ben Blum
  Cc: KAMEZAWA Hiroyuki, linux-kernel, containers, akpm, ebiederm,
	lizf, matthltc, oleg

On Tue, Aug 3, 2010 at 9:38 PM, Ben Blum <bblum@andrew.cmu.edu> wrote:
>
> rcu_read_lock();
> for_each_subsys(...) {
>        can_attach(...);
> }
> rcu_read_unlock();

Sorry, I was misreading this, and didn't notice that it was already
inside an "if (threadgroup) {}" test.

>
> Which forces all can_attaches to not sleep. So by dropping
> rcu_read_lock(), we allow the possibility of the exec race I described
> in my last email, and therefore we have to check each time we re-acquire
> rcu_read to iterate thread_group.

Agreed.

>
> Yeah, it is not pretty. I call it "double-double-toil-and-trouble-check
> locking". But it is safe.

As a cleanup, I'd be inclined to have a wrapper in cgroup.c, something like

cgroup_can_attach_threadgroup(struct cgroup_subsys *ss, struct cgroup
*cg, struct task_struct *leader, int (*cb)(struct task_struct *t,
struct cgroup *cg))

which handles the RCU section, checking threadgroup_leader(), and
looping through each thread. The the subsystem just has to define a
callback which will be called for each thread.

Paul

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v4 1/2] cgroups: read-write lock CLONE_THREAD forking per  threadgroup
  2010-08-04  3:44     ` Paul Menage
@ 2010-08-04 16:34           ` Brian K. White
  0 siblings, 0 replies; 185+ messages in thread
From: Brian K. White @ 2010-08-04 16:34 UTC (permalink / raw)
  To: Paul Menage
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	oleg-H+wXaHxf7aLQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, Andrew Morton


> As far as the #ifdef mess goes, it's true that some people don't have
> CONFIG_CGROUPS defined. I'd imagine that these are likely to be
> embedded systems with a fairly small number of processes and threads
> per process. Are there really any such platforms where the cost of a
> single extra rwsem per process is going to make a difference either in
> terms of memory or lock contention? I think you should consider making
> these additions unconditional.

openSUSE's default kernel* doesn't have CONFIG_CGROUPS

Personally I think it's a silly mistake also, since the argument was 
performance, and ubuntu's desktop kernel has it and actually outperforms 
openSUSE's, and the feature is perfectly likely to be needed by 
"desktop" users, but I wasn't and still aren't consulted on this and so 
it's a fact that must be lived with at least for a while. ;) (at least 
until the majority of 11.2 and 11.3 installations are replaced due to 
age, and that's if they reverted the decision today, which so far they 
haven't)

*("kernel-desktop" the one installed by default, not "kernel-default", 
which exists but is not installed by default since openSUSE 11.2)

-- 
bkw

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v4 1/2] cgroups: read-write lock CLONE_THREAD forking per  threadgroup
@ 2010-08-04 16:34           ` Brian K. White
  0 siblings, 0 replies; 185+ messages in thread
From: Brian K. White @ 2010-08-04 16:34 UTC (permalink / raw)
  To: Paul Menage
  Cc: Ben Blum, Andrew Morton, containers, linux-kernel, oleg, ebiederm


> As far as the #ifdef mess goes, it's true that some people don't have
> CONFIG_CGROUPS defined. I'd imagine that these are likely to be
> embedded systems with a fairly small number of processes and threads
> per process. Are there really any such platforms where the cost of a
> single extra rwsem per process is going to make a difference either in
> terms of memory or lock contention? I think you should consider making
> these additions unconditional.

openSUSE's default kernel* doesn't have CONFIG_CGROUPS

Personally I think it's a silly mistake also, since the argument was 
performance, and ubuntu's desktop kernel has it and actually outperforms 
openSUSE's, and the feature is perfectly likely to be needed by 
"desktop" users, but I wasn't and still aren't consulted on this and so 
it's a fact that must be lived with at least for a while. ;) (at least 
until the majority of 11.2 and 11.3 installations are replaced due to 
age, and that's if they reverted the decision today, which so far they 
haven't)

*("kernel-desktop" the one installed by default, not "kernel-default", 
which exists but is not installed by default since openSUSE 11.2)

-- 
bkw

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v4 1/2] cgroups: read-write lock CLONE_THREAD forking per threadgroup
       [not found]               ` <AANLkTi=dhym3c+XJVjoObROcw=mz2Y+a2R5oMdePK3Ng-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-08-06  6:02                 ` Ben Blum
  0 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2010-08-06  6:02 UTC (permalink / raw)
  To: Paul Menage
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, Andrew Morton

On Tue, Aug 03, 2010 at 09:34:22PM -0700, Paul Menage wrote:
> On Tue, Aug 3, 2010 at 9:33 PM, Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:
> >> As far as the #ifdef mess goes, it's true that some people don't have
> >> CONFIG_CGROUPS defined. I'd imagine that these are likely to be
> >> embedded systems with a fairly small number of processes and threads
> >> per process. Are there really any such platforms where the cost of a
> >> single extra rwsem per process is going to make a difference either in
> >> terms of memory or lock contention? I think you should consider making
> >> these additions unconditional.
> >
> > That's certainly an option, but I think it would be clean enough to put
> > static inline functions just under the signal_struct definition.
> 
> Either sounds fine to me. I suspect others have a stronger opinion.
> 
> Paul
> 

Any other votes? One set of static inline functions (I'd call them
threadgroup_fork_{read,write}_{un,}lock) or just remove the ifdefs
entirely? I'm inclined to go with the former.

-- Ben

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v4 1/2] cgroups: read-write lock CLONE_THREAD forking per threadgroup
  2010-08-04  4:34               ` Paul Menage
  (?)
@ 2010-08-06  6:02               ` Ben Blum
  2010-08-06  7:08                 ` KAMEZAWA Hiroyuki
       [not found]                 ` <20100806060224.GA1351-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
  -1 siblings, 2 replies; 185+ messages in thread
From: Ben Blum @ 2010-08-06  6:02 UTC (permalink / raw)
  To: Paul Menage
  Cc: Ben Blum, Andrew Morton, linux-kernel, containers, ebiederm,
	lizf, matthltc, oleg

On Tue, Aug 03, 2010 at 09:34:22PM -0700, Paul Menage wrote:
> On Tue, Aug 3, 2010 at 9:33 PM, Ben Blum <bblum@andrew.cmu.edu> wrote:
> >> As far as the #ifdef mess goes, it's true that some people don't have
> >> CONFIG_CGROUPS defined. I'd imagine that these are likely to be
> >> embedded systems with a fairly small number of processes and threads
> >> per process. Are there really any such platforms where the cost of a
> >> single extra rwsem per process is going to make a difference either in
> >> terms of memory or lock contention? I think you should consider making
> >> these additions unconditional.
> >
> > That's certainly an option, but I think it would be clean enough to put
> > static inline functions just under the signal_struct definition.
> 
> Either sounds fine to me. I suspect others have a stronger opinion.
> 
> Paul
> 

Any other votes? One set of static inline functions (I'd call them
threadgroup_fork_{read,write}_{un,}lock) or just remove the ifdefs
entirely? I'm inclined to go with the former.

-- Ben

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v4 1/2] cgroups: read-write lock CLONE_THREAD forking per threadgroup
       [not found]                 ` <20100806060224.GA1351-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
@ 2010-08-06  7:08                   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 185+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-08-06  7:08 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, Paul Menage, Andrew Morton

On Fri, 6 Aug 2010 02:02:24 -0400
Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:

> On Tue, Aug 03, 2010 at 09:34:22PM -0700, Paul Menage wrote:
> > On Tue, Aug 3, 2010 at 9:33 PM, Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:
> > >> As far as the #ifdef mess goes, it's true that some people don't have
> > >> CONFIG_CGROUPS defined. I'd imagine that these are likely to be
> > >> embedded systems with a fairly small number of processes and threads
> > >> per process. Are there really any such platforms where the cost of a
> > >> single extra rwsem per process is going to make a difference either in
> > >> terms of memory or lock contention? I think you should consider making
> > >> these additions unconditional.
> > >
> > > That's certainly an option, but I think it would be clean enough to put
> > > static inline functions just under the signal_struct definition.
> > 
> > Either sounds fine to me. I suspect others have a stronger opinion.
> > 
> > Paul
> > 
> 
> Any other votes? One set of static inline functions (I'd call them
> threadgroup_fork_{read,write}_{un,}lock) or just remove the ifdefs
> entirely? I'm inclined to go with the former.
> 

I vote for the former. #ifdef can be easily removed if someone finds it useful
for other purpose...and static inline function is usual way.

Thanks,
-Kame

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v4 1/2] cgroups: read-write lock CLONE_THREAD forking per threadgroup
  2010-08-06  6:02               ` Ben Blum
@ 2010-08-06  7:08                 ` KAMEZAWA Hiroyuki
       [not found]                 ` <20100806060224.GA1351-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
  1 sibling, 0 replies; 185+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-08-06  7:08 UTC (permalink / raw)
  To: Ben Blum
  Cc: Paul Menage, Andrew Morton, linux-kernel, containers, ebiederm,
	lizf, matthltc, oleg

On Fri, 6 Aug 2010 02:02:24 -0400
Ben Blum <bblum@andrew.cmu.edu> wrote:

> On Tue, Aug 03, 2010 at 09:34:22PM -0700, Paul Menage wrote:
> > On Tue, Aug 3, 2010 at 9:33 PM, Ben Blum <bblum@andrew.cmu.edu> wrote:
> > >> As far as the #ifdef mess goes, it's true that some people don't have
> > >> CONFIG_CGROUPS defined. I'd imagine that these are likely to be
> > >> embedded systems with a fairly small number of processes and threads
> > >> per process. Are there really any such platforms where the cost of a
> > >> single extra rwsem per process is going to make a difference either in
> > >> terms of memory or lock contention? I think you should consider making
> > >> these additions unconditional.
> > >
> > > That's certainly an option, but I think it would be clean enough to put
> > > static inline functions just under the signal_struct definition.
> > 
> > Either sounds fine to me. I suspect others have a stronger opinion.
> > 
> > Paul
> > 
> 
> Any other votes? One set of static inline functions (I'd call them
> threadgroup_fork_{read,write}_{un,}lock) or just remove the ifdefs
> entirely? I'm inclined to go with the former.
> 

I vote for the former. #ifdef can be easily removed if someone finds it useful
for other purpose...and static inline function is usual way.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 185+ messages in thread

* [PATCH v5 0/3] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs
  2010-07-30 23:56 [PATCH v4 0/2] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs Ben Blum
@ 2010-08-11  5:46     ` Ben Blum
       [not found] ` <20100730235649.GA22644-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
  2010-08-03 19:58 ` [PATCH v4 0/2] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs Andrew Morton
  2 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2010-08-11  5:46 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	menage-hpIqsD4AKlfQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

On Fri, Jul 30, 2010 at 07:56:49PM -0400, Ben Blum wrote:
> This patch series is a revision of http://lkml.org/lkml/2010/6/25/11 .
> 
> This patch series implements a write function for the 'cgroup.procs'
> per-cgroup file, which enables atomic movement of multithreaded
> applications between cgroups. Writing the thread-ID of any thread in a
> threadgroup to a cgroup's procs file causes all threads in the group to
> be moved to that cgroup safely with respect to threads forking/exiting.
> (Possible usage scenario: If running a multithreaded build system that
> sucks up system resources, this lets you restrict it all at once into a
> new cgroup to keep it under control.)
> 
> Example: Suppose pid 31337 clones new threads 31338 and 31339.
> 
> # cat /dev/cgroup/tasks
> ...
> 31337
> 31338
> 31339
> # mkdir /dev/cgroup/foo
> # echo 31337 > /dev/cgroup/foo/cgroup.procs
> # cat /dev/cgroup/foo/tasks
> 31337
> 31338
> 31339
> 
> A new lock, called threadgroup_fork_lock and living in signal_struct, is
> introduced to ensure atomicity when moving threads between cgroups. It's
> taken for writing during the operation, and taking for reading in fork()
> around the calls to cgroup_fork() and cgroup_post_fork(). I put calls to
> down_read/up_read directly in copy_process(), since new inline functions
> seemed like overkill.
> 
> -- Ben
> 
> ---
>  Documentation/cgroups/cgroups.txt |   13 -
>  include/linux/init_task.h         |    9
>  include/linux/sched.h             |   10
>  kernel/cgroup.c                   |  426 +++++++++++++++++++++++++++++++++-----
>  kernel/cgroup_freezer.c           |    4
>  kernel/cpuset.c                   |    4
>  kernel/fork.c                     |   16 +
>  kernel/ns_cgroup.c                |    4
>  kernel/sched.c                    |    4
>  9 files changed, 440 insertions(+), 50 deletions(-)

Here's an updated patchset. I've added an extra patch to implement the 
callback scheme Paul suggested (note how there are twice as many deleted
lines of code as before :) ), and also moved the up_read/down_read calls
to static inline functions in sched.h near the other threadgroup-related
calls.

---
 Documentation/cgroups/cgroups.txt |   13 -
 include/linux/cgroup.h            |   12  
 include/linux/init_task.h         |    9   
 include/linux/sched.h             |   35 ++
 kernel/cgroup.c                   |  459 ++++++++++++++++++++++++++++++++++----
 kernel/cgroup_freezer.c           |   27 --
 kernel/cpuset.c                   |   20 -
 kernel/fork.c                     |   10  
 kernel/ns_cgroup.c                |   27 +-
 kernel/sched.c                    |   21 -
 10 files changed, 526 insertions(+), 107 deletions(-)

^ permalink raw reply	[flat|nested] 185+ messages in thread

* [PATCH v5 0/3] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs
@ 2010-08-11  5:46     ` Ben Blum
  0 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2010-08-11  5:46 UTC (permalink / raw)
  To: Ben Blum
  Cc: linux-kernel, containers, akpm, ebiederm, lizf, matthltc, menage, oleg

On Fri, Jul 30, 2010 at 07:56:49PM -0400, Ben Blum wrote:
> This patch series is a revision of http://lkml.org/lkml/2010/6/25/11 .
> 
> This patch series implements a write function for the 'cgroup.procs'
> per-cgroup file, which enables atomic movement of multithreaded
> applications between cgroups. Writing the thread-ID of any thread in a
> threadgroup to a cgroup's procs file causes all threads in the group to
> be moved to that cgroup safely with respect to threads forking/exiting.
> (Possible usage scenario: If running a multithreaded build system that
> sucks up system resources, this lets you restrict it all at once into a
> new cgroup to keep it under control.)
> 
> Example: Suppose pid 31337 clones new threads 31338 and 31339.
> 
> # cat /dev/cgroup/tasks
> ...
> 31337
> 31338
> 31339
> # mkdir /dev/cgroup/foo
> # echo 31337 > /dev/cgroup/foo/cgroup.procs
> # cat /dev/cgroup/foo/tasks
> 31337
> 31338
> 31339
> 
> A new lock, called threadgroup_fork_lock and living in signal_struct, is
> introduced to ensure atomicity when moving threads between cgroups. It's
> taken for writing during the operation, and taking for reading in fork()
> around the calls to cgroup_fork() and cgroup_post_fork(). I put calls to
> down_read/up_read directly in copy_process(), since new inline functions
> seemed like overkill.
> 
> -- Ben
> 
> ---
>  Documentation/cgroups/cgroups.txt |   13 -
>  include/linux/init_task.h         |    9
>  include/linux/sched.h             |   10
>  kernel/cgroup.c                   |  426 +++++++++++++++++++++++++++++++++-----
>  kernel/cgroup_freezer.c           |    4
>  kernel/cpuset.c                   |    4
>  kernel/fork.c                     |   16 +
>  kernel/ns_cgroup.c                |    4
>  kernel/sched.c                    |    4
>  9 files changed, 440 insertions(+), 50 deletions(-)

Here's an updated patchset. I've added an extra patch to implement the 
callback scheme Paul suggested (note how there are twice as many deleted
lines of code as before :) ), and also moved the up_read/down_read calls
to static inline functions in sched.h near the other threadgroup-related
calls.

---
 Documentation/cgroups/cgroups.txt |   13 -
 include/linux/cgroup.h            |   12  
 include/linux/init_task.h         |    9   
 include/linux/sched.h             |   35 ++
 kernel/cgroup.c                   |  459 ++++++++++++++++++++++++++++++++++----
 kernel/cgroup_freezer.c           |   27 --
 kernel/cpuset.c                   |   20 -
 kernel/fork.c                     |   10  
 kernel/ns_cgroup.c                |   27 +-
 kernel/sched.c                    |   21 -
 10 files changed, 526 insertions(+), 107 deletions(-)

^ permalink raw reply	[flat|nested] 185+ messages in thread

* [PATCH v5 1/3] cgroups: read-write lock CLONE_THREAD forking per threadgroup
       [not found]     ` <20100811054604.GA8743-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
@ 2010-08-11  5:47       ` Ben Blum
  2010-08-11  5:48         ` Ben Blum
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2010-08-11  5:47 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	menage-hpIqsD4AKlfQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

[-- Attachment #1: cgroup-threadgroup-fork-lock.patch --]
[-- Type: text/plain, Size: 4809 bytes --]

Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup

From: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>

This patch adds an rwsem that lives in a threadgroup's signal_struct that's
taken for reading in the fork path, under CONFIG_CGROUPS. If another part of
the kernel later wants to use such a locking mechanism, the CONFIG_CGROUPS
ifdefs should be changed to a higher-up flag that CGROUPS and the other system
would both depend on.

This is a pre-patch for cgroup-procs-write.patch.

Signed-off-by: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
---
 include/linux/init_task.h |    9 +++++++++
 include/linux/sched.h     |   35 +++++++++++++++++++++++++++++++++++
 kernel/fork.c             |   10 ++++++++++
 3 files changed, 54 insertions(+), 0 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 1f43fa5..ca46711 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -15,6 +15,14 @@
 extern struct files_struct init_files;
 extern struct fs_struct init_fs;
 
+#ifdef CONFIG_CGROUPS
+#define INIT_THREADGROUP_FORK_LOCK(sig)					\
+	.threadgroup_fork_lock =					\
+		__RWSEM_INITIALIZER(sig.threadgroup_fork_lock),
+#else
+#define INIT_THREADGROUP_FORK_LOCK(sig)
+#endif
+
 #define INIT_SIGNALS(sig) {						\
 	.nr_threads	= 1,						\
 	.wait_chldexit	= __WAIT_QUEUE_HEAD_INITIALIZER(sig.wait_chldexit),\
@@ -29,6 +37,7 @@ extern struct fs_struct init_fs;
 		.running = 0,						\
 		.lock = __SPIN_LOCK_UNLOCKED(sig.cputimer.lock),	\
 	},								\
+	INIT_THREADGROUP_FORK_LOCK(sig)					\
 }
 
 extern struct nsproxy init_nsproxy;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ae69716..ebd4af2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -619,6 +619,16 @@ struct signal_struct {
 	unsigned audit_tty;
 	struct tty_audit_buf *tty_audit_buf;
 #endif
+#ifdef CONFIG_CGROUPS
+	/*
+	 * The threadgroup_fork_lock prevents threads from forking with
+	 * CLONE_THREAD while held for writing. Use this for fork-sensitive
+	 * threadgroup-wide operations. It's taken for reading in fork.c in
+	 * copy_process().
+	 * Currently only needed write-side by cgroups.
+	 */
+	struct rw_semaphore threadgroup_fork_lock;
+#endif
 
 	int oom_adj;	/* OOM kill score adjustment (bit shift) */
 };
@@ -2216,6 +2226,31 @@ static inline void unlock_task_sighand(struct task_struct *tsk,
 	spin_unlock_irqrestore(&tsk->sighand->siglock, *flags);
 }
 
+/* See the declaration of threadgroup_fork_lock in signal_struct. */
+#ifdef CONFIG_CGROUPS
+static inline void threadgroup_fork_read_lock(struct task_struct *tsk)
+{
+	down_read(&tsk->signal->threadgroup_fork_lock);
+}
+static inline void threadgroup_fork_read_unlock(struct task_struct *tsk)
+{
+	up_read(&tsk->signal->threadgroup_fork_lock);
+}
+static inline void threadgroup_fork_write_lock(struct task_struct *tsk)
+{
+	down_write(&tsk->signal->threadgroup_fork_lock);
+}
+static inline void threadgroup_fork_write_unlock(struct task_struct *tsk)
+{
+	up_write(&tsk->signal->threadgroup_fork_lock);
+}
+#else
+static inline void threadgroup_fork_read_lock(struct task_struct *tsk) {}
+static inline void threadgroup_fork_read_unlock(struct task_struct *tsk) {}
+static inline void threadgroup_fork_write_lock(struct task_struct *tsk) {}
+static inline void threadgroup_fork_write_unlock(struct task_struct *tsk) {}
+#endif
+
 #ifndef __HAVE_THREAD_FUNCTIONS
 
 #define task_thread_info(task)	((struct thread_info *)(task)->stack)
diff --git a/kernel/fork.c b/kernel/fork.c
index a82a65c..41df253 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -898,6 +898,10 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 
 	tty_audit_fork(sig);
 
+#ifdef CONFIG_CGROUPS
+	init_rwsem(&sig->threadgroup_fork_lock);
+#endif
+
 	sig->oom_adj = current->signal->oom_adj;
 
 	return 0;
@@ -1076,6 +1080,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	monotonic_to_bootbased(&p->real_start_time);
 	p->io_context = NULL;
 	p->audit_context = NULL;
+	if (clone_flags & CLONE_THREAD)
+		threadgroup_fork_read_lock(current);
 	cgroup_fork(p);
 #ifdef CONFIG_NUMA
 	p->mempolicy = mpol_dup(p->mempolicy);
@@ -1283,6 +1289,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	write_unlock_irq(&tasklist_lock);
 	proc_fork_connector(p);
 	cgroup_post_fork(p);
+	if (clone_flags & CLONE_THREAD)
+		threadgroup_fork_read_unlock(current);
 	perf_event_fork(p);
 	return p;
 
@@ -1316,6 +1324,8 @@ bad_fork_cleanup_policy:
 	mpol_put(p->mempolicy);
 bad_fork_cleanup_cgroup:
 #endif
+	if (clone_flags & CLONE_THREAD)
+		threadgroup_fork_read_unlock(current);
 	cgroup_exit(p, cgroup_callbacks_done);
 	delayacct_tsk_free(p);
 	module_put(task_thread_info(p)->exec_domain->module);

^ permalink raw reply related	[flat|nested] 185+ messages in thread

* [PATCH v5 1/3] cgroups: read-write lock CLONE_THREAD forking per threadgroup
  2010-08-11  5:46     ` Ben Blum
  (?)
@ 2010-08-11  5:47     ` Ben Blum
  2010-08-23 23:35       ` Paul Menage
       [not found]       ` <20100811054711.GB8743-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
  -1 siblings, 2 replies; 185+ messages in thread
From: Ben Blum @ 2010-08-11  5:47 UTC (permalink / raw)
  To: Ben Blum
  Cc: linux-kernel, containers, akpm, ebiederm, lizf, matthltc, menage, oleg

[-- Attachment #1: cgroup-threadgroup-fork-lock.patch --]
[-- Type: text/plain, Size: 4759 bytes --]

Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup

From: Ben Blum <bblum@andrew.cmu.edu>

This patch adds an rwsem that lives in a threadgroup's signal_struct that's
taken for reading in the fork path, under CONFIG_CGROUPS. If another part of
the kernel later wants to use such a locking mechanism, the CONFIG_CGROUPS
ifdefs should be changed to a higher-up flag that CGROUPS and the other system
would both depend on.

This is a pre-patch for cgroup-procs-write.patch.

Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
---
 include/linux/init_task.h |    9 +++++++++
 include/linux/sched.h     |   35 +++++++++++++++++++++++++++++++++++
 kernel/fork.c             |   10 ++++++++++
 3 files changed, 54 insertions(+), 0 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 1f43fa5..ca46711 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -15,6 +15,14 @@
 extern struct files_struct init_files;
 extern struct fs_struct init_fs;
 
+#ifdef CONFIG_CGROUPS
+#define INIT_THREADGROUP_FORK_LOCK(sig)					\
+	.threadgroup_fork_lock =					\
+		__RWSEM_INITIALIZER(sig.threadgroup_fork_lock),
+#else
+#define INIT_THREADGROUP_FORK_LOCK(sig)
+#endif
+
 #define INIT_SIGNALS(sig) {						\
 	.nr_threads	= 1,						\
 	.wait_chldexit	= __WAIT_QUEUE_HEAD_INITIALIZER(sig.wait_chldexit),\
@@ -29,6 +37,7 @@ extern struct fs_struct init_fs;
 		.running = 0,						\
 		.lock = __SPIN_LOCK_UNLOCKED(sig.cputimer.lock),	\
 	},								\
+	INIT_THREADGROUP_FORK_LOCK(sig)					\
 }
 
 extern struct nsproxy init_nsproxy;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ae69716..ebd4af2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -619,6 +619,16 @@ struct signal_struct {
 	unsigned audit_tty;
 	struct tty_audit_buf *tty_audit_buf;
 #endif
+#ifdef CONFIG_CGROUPS
+	/*
+	 * The threadgroup_fork_lock prevents threads from forking with
+	 * CLONE_THREAD while held for writing. Use this for fork-sensitive
+	 * threadgroup-wide operations. It's taken for reading in fork.c in
+	 * copy_process().
+	 * Currently only needed write-side by cgroups.
+	 */
+	struct rw_semaphore threadgroup_fork_lock;
+#endif
 
 	int oom_adj;	/* OOM kill score adjustment (bit shift) */
 };
@@ -2216,6 +2226,31 @@ static inline void unlock_task_sighand(struct task_struct *tsk,
 	spin_unlock_irqrestore(&tsk->sighand->siglock, *flags);
 }
 
+/* See the declaration of threadgroup_fork_lock in signal_struct. */
+#ifdef CONFIG_CGROUPS
+static inline void threadgroup_fork_read_lock(struct task_struct *tsk)
+{
+	down_read(&tsk->signal->threadgroup_fork_lock);
+}
+static inline void threadgroup_fork_read_unlock(struct task_struct *tsk)
+{
+	up_read(&tsk->signal->threadgroup_fork_lock);
+}
+static inline void threadgroup_fork_write_lock(struct task_struct *tsk)
+{
+	down_write(&tsk->signal->threadgroup_fork_lock);
+}
+static inline void threadgroup_fork_write_unlock(struct task_struct *tsk)
+{
+	up_write(&tsk->signal->threadgroup_fork_lock);
+}
+#else
+static inline void threadgroup_fork_read_lock(struct task_struct *tsk) {}
+static inline void threadgroup_fork_read_unlock(struct task_struct *tsk) {}
+static inline void threadgroup_fork_write_lock(struct task_struct *tsk) {}
+static inline void threadgroup_fork_write_unlock(struct task_struct *tsk) {}
+#endif
+
 #ifndef __HAVE_THREAD_FUNCTIONS
 
 #define task_thread_info(task)	((struct thread_info *)(task)->stack)
diff --git a/kernel/fork.c b/kernel/fork.c
index a82a65c..41df253 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -898,6 +898,10 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 
 	tty_audit_fork(sig);
 
+#ifdef CONFIG_CGROUPS
+	init_rwsem(&sig->threadgroup_fork_lock);
+#endif
+
 	sig->oom_adj = current->signal->oom_adj;
 
 	return 0;
@@ -1076,6 +1080,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	monotonic_to_bootbased(&p->real_start_time);
 	p->io_context = NULL;
 	p->audit_context = NULL;
+	if (clone_flags & CLONE_THREAD)
+		threadgroup_fork_read_lock(current);
 	cgroup_fork(p);
 #ifdef CONFIG_NUMA
 	p->mempolicy = mpol_dup(p->mempolicy);
@@ -1283,6 +1289,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	write_unlock_irq(&tasklist_lock);
 	proc_fork_connector(p);
 	cgroup_post_fork(p);
+	if (clone_flags & CLONE_THREAD)
+		threadgroup_fork_read_unlock(current);
 	perf_event_fork(p);
 	return p;
 
@@ -1316,6 +1324,8 @@ bad_fork_cleanup_policy:
 	mpol_put(p->mempolicy);
 bad_fork_cleanup_cgroup:
 #endif
+	if (clone_flags & CLONE_THREAD)
+		threadgroup_fork_read_unlock(current);
 	cgroup_exit(p, cgroup_callbacks_done);
 	delayacct_tsk_free(p);
 	module_put(task_thread_info(p)->exec_domain->module);

^ permalink raw reply related	[flat|nested] 185+ messages in thread

* [PATCH v5 2/3] cgroups: add can_attach callback for checking all threads in a group
  2010-08-11  5:46     ` Ben Blum
@ 2010-08-11  5:48         ` Ben Blum
  -1 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2010-08-11  5:48 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	menage-hpIqsD4AKlfQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

[-- Attachment #1: cgroup-threadgroup-callback.patch --]
[-- Type: text/plain, Size: 8211 bytes --]

Add cgroup wrapper for safely calling can_attach on all threads in a threadgroup

From: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>

This patch adds a function cgroup_can_attach_per_thread which handles iterating
over each thread in a threadgroup safely with respect to the invariants that
will be used in cgroup_attach_proc. Also, subsystems whose can_attach calls
require per-thread validation are modified to use the per_thread wrapper to
avoid duplicating cgroup-internal code.

This is a pre-patch for cgroup-procs-writable.patch.

Signed-off-by: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
---
 include/linux/cgroup.h  |   12 ++++++++++++
 kernel/cgroup.c         |   35 +++++++++++++++++++++++++++++++++++
 kernel/cgroup_freezer.c |   27 ++++++++++++---------------
 kernel/cpuset.c         |   20 +++++++-------------
 kernel/ns_cgroup.c      |   27 +++++++++++++--------------
 kernel/sched.c          |   21 ++++++---------------
 6 files changed, 85 insertions(+), 57 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index e3d00fd..f040d66 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -580,6 +580,18 @@ int cgroup_scan_tasks(struct cgroup_scanner *scan);
 int cgroup_attach_task(struct cgroup *, struct task_struct *);
 
 /*
+ * For use in subsystems whose can_attach functions need to run an operation
+ * on every task in the threadgroup. Calls the given callback once if the
+ * 'threadgroup' flag is false, or once per thread in the group if true.
+ * The callback should return 0/-ERR; this will return 0/-ERR.
+ * The callback will run within an rcu_read section, so must not sleep.
+ */
+int cgroup_can_attach_per_thread(struct cgroup *cgrp, struct task_struct *task,
+				 int (*cb)(struct cgroup *cgrp,
+					   struct task_struct *task),
+				 bool threadgroup);
+
+/*
  * CSS ID is ID for cgroup_subsys_state structs under subsys. This only works
  * if cgroup_subsys.use_id == true. It can be used for looking up and scanning.
  * CSS ID is assigned at cgroup allocation (create) automatically
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index f91d7dd..e8b8f71 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1688,6 +1688,41 @@ int cgroup_path(const struct cgroup *cgrp, char *buf, int buflen)
 }
 EXPORT_SYMBOL_GPL(cgroup_path);
 
+int cgroup_can_attach_per_thread(struct cgroup *cgrp, struct task_struct *task,
+				 int (*cb)(struct cgroup *cgrp,
+					   struct task_struct *task),
+				 bool threadgroup)
+{
+	/* Start by running on the leader, in all cases. */
+	int ret = cb(cgrp, task);
+	if (ret < 0)
+		return ret;
+
+	if (threadgroup) {
+		/* Run on each task in the threadgroup. */
+		struct task_struct *c;
+		rcu_read_lock();
+		/*
+		 * It is necessary for the given task to still be the leader
+		 * to safely traverse thread_group. See cgroup_attach_proc.
+		 */
+		if (!thread_group_leader(task)) {
+			rcu_read_unlock();
+			return -EAGAIN;
+		}
+		list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
+			ret = cb(cgrp, c);
+			if (ret < 0) {
+				rcu_read_unlock();
+				return ret;
+			}
+		}
+		rcu_read_unlock();
+	}
+	return 0;
+}
+EXPORT_SYMBOL_GPL(cgroup_can_attach_per_thread);
+
 /**
  * cgroup_attach_task - attach task 'tsk' to cgroup 'cgrp'
  * @cgrp: the cgroup the task is attaching to
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index ce71ed5..677b24e 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -161,6 +161,13 @@ static bool is_task_frozen_enough(struct task_struct *task)
 		(task_is_stopped_or_traced(task) && freezing(task));
 }
 
+static int freezer_can_attach_cb(struct cgroup *cgrp, struct task_struct *task)
+{
+	if (is_task_frozen_enough(task))
+		return -EBUSY;
+	return 0;
+}
+
 /*
  * The call to cgroup_lock() in the freezer.state write method prevents
  * a write to that file racing against an attach, and hence the
@@ -171,6 +178,7 @@ static int freezer_can_attach(struct cgroup_subsys *ss,
 			      struct task_struct *task, bool threadgroup)
 {
 	struct freezer *freezer;
+	int ret;
 
 	/*
 	 * Anything frozen can't move or be moved to/from.
@@ -179,26 +187,15 @@ static int freezer_can_attach(struct cgroup_subsys *ss,
 	 * frozen, so it's sufficient to check the latter condition.
 	 */
 
-	if (is_task_frozen_enough(task))
-		return -EBUSY;
+	ret = cgroup_can_attach_per_thread(new_cgroup, task,
+					   freezer_can_attach_cb, threadgroup);
+	if (ret < 0)
+		return ret;
 
 	freezer = cgroup_freezer(new_cgroup);
 	if (freezer->state == CGROUP_FROZEN)
 		return -EBUSY;
 
-	if (threadgroup) {
-		struct task_struct *c;
-
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
-			if (is_task_frozen_enough(c)) {
-				rcu_read_unlock();
-				return -EBUSY;
-			}
-		}
-		rcu_read_unlock();
-	}
-
 	return 0;
 }
 
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index b23c097..cc4b1f7 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1376,6 +1376,11 @@ static int fmeter_getrate(struct fmeter *fmp)
 /* Protected by cgroup_lock */
 static cpumask_var_t cpus_attach;
 
+static int cpuset_can_attach_cb(struct cgroup *cgrp, struct task_struct *task)
+{
+	return security_task_setscheduler(task, 0, NULL);
+}
+
 /* Called by cgroups to determine if a cpuset is usable; cgroup_mutex held */
 static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
 			     struct task_struct *tsk, bool threadgroup)
@@ -1397,22 +1402,11 @@ static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
 	if (tsk->flags & PF_THREAD_BOUND)
 		return -EINVAL;
 
-	ret = security_task_setscheduler(tsk, 0, NULL);
+	ret = cgroup_can_attach_per_thread(cont, tsk, cpuset_can_attach_cb,
+					   threadgroup);
 	if (ret)
 		return ret;
-	if (threadgroup) {
-		struct task_struct *c;
 
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
-			ret = security_task_setscheduler(c, 0, NULL);
-			if (ret) {
-				rcu_read_unlock();
-				return ret;
-			}
-		}
-		rcu_read_unlock();
-	}
 	return 0;
 }
 
diff --git a/kernel/ns_cgroup.c b/kernel/ns_cgroup.c
index 2a5dfec..af0accf 100644
--- a/kernel/ns_cgroup.c
+++ b/kernel/ns_cgroup.c
@@ -42,9 +42,18 @@ int ns_cgroup_clone(struct task_struct *task, struct pid *pid)
  *       (hence either you are in the same cgroup as task, or in an
  *        ancestor cgroup thereof)
  */
+static int ns_can_attach_cb(struct cgroup *new_cgroup, struct task_struct *task)
+{
+	if (!cgroup_is_descendant(new_cgroup, task))
+		return -EPERM;
+	return 0;
+}
+
 static int ns_can_attach(struct cgroup_subsys *ss, struct cgroup *new_cgroup,
 			 struct task_struct *task, bool threadgroup)
 {
+	int ret;
+
 	if (current != task) {
 		if (!capable(CAP_SYS_ADMIN))
 			return -EPERM;
@@ -53,20 +62,10 @@ static int ns_can_attach(struct cgroup_subsys *ss, struct cgroup *new_cgroup,
 			return -EPERM;
 	}
 
-	if (!cgroup_is_descendant(new_cgroup, task))
-		return -EPERM;
-
-	if (threadgroup) {
-		struct task_struct *c;
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
-			if (!cgroup_is_descendant(new_cgroup, c)) {
-				rcu_read_unlock();
-				return -EPERM;
-			}
-		}
-		rcu_read_unlock();
-	}
+	ret = cgroup_can_attach_per_thread(new_cgroup, task, ns_can_attach_cb,
+					   threadgroup);
+	if (ret < 0)
+		return ret;
 
 	return 0;
 }
diff --git a/kernel/sched.c b/kernel/sched.c
index 70fa78d..8330e6f 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -8715,21 +8715,12 @@ static int
 cpu_cgroup_can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 		      struct task_struct *tsk, bool threadgroup)
 {
-	int retval = cpu_cgroup_can_attach_task(cgrp, tsk);
-	if (retval)
-		return retval;
-	if (threadgroup) {
-		struct task_struct *c;
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
-			retval = cpu_cgroup_can_attach_task(cgrp, c);
-			if (retval) {
-				rcu_read_unlock();
-				return retval;
-			}
-		}
-		rcu_read_unlock();
-	}
+	int ret = cgroup_can_attach_per_thread(cgrp, tsk,
+					       cpu_cgroup_can_attach_task,
+					       threadgroup);
+	if (ret)
+		return ret;
+
 	return 0;
 }

^ permalink raw reply related	[flat|nested] 185+ messages in thread

* [PATCH v5 2/3] cgroups: add can_attach callback for checking all threads in a group
@ 2010-08-11  5:48         ` Ben Blum
  0 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2010-08-11  5:48 UTC (permalink / raw)
  To: Ben Blum
  Cc: linux-kernel, containers, akpm, ebiederm, lizf, matthltc, menage, oleg

[-- Attachment #1: cgroup-threadgroup-callback.patch --]
[-- Type: text/plain, Size: 8163 bytes --]

Add cgroup wrapper for safely calling can_attach on all threads in a threadgroup

From: Ben Blum <bblum@andrew.cmu.edu>

This patch adds a function cgroup_can_attach_per_thread which handles iterating
over each thread in a threadgroup safely with respect to the invariants that
will be used in cgroup_attach_proc. Also, subsystems whose can_attach calls
require per-thread validation are modified to use the per_thread wrapper to
avoid duplicating cgroup-internal code.

This is a pre-patch for cgroup-procs-writable.patch.

Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
---
 include/linux/cgroup.h  |   12 ++++++++++++
 kernel/cgroup.c         |   35 +++++++++++++++++++++++++++++++++++
 kernel/cgroup_freezer.c |   27 ++++++++++++---------------
 kernel/cpuset.c         |   20 +++++++-------------
 kernel/ns_cgroup.c      |   27 +++++++++++++--------------
 kernel/sched.c          |   21 ++++++---------------
 6 files changed, 85 insertions(+), 57 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index e3d00fd..f040d66 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -580,6 +580,18 @@ int cgroup_scan_tasks(struct cgroup_scanner *scan);
 int cgroup_attach_task(struct cgroup *, struct task_struct *);
 
 /*
+ * For use in subsystems whose can_attach functions need to run an operation
+ * on every task in the threadgroup. Calls the given callback once if the
+ * 'threadgroup' flag is false, or once per thread in the group if true.
+ * The callback should return 0/-ERR; this will return 0/-ERR.
+ * The callback will run within an rcu_read section, so must not sleep.
+ */
+int cgroup_can_attach_per_thread(struct cgroup *cgrp, struct task_struct *task,
+				 int (*cb)(struct cgroup *cgrp,
+					   struct task_struct *task),
+				 bool threadgroup);
+
+/*
  * CSS ID is ID for cgroup_subsys_state structs under subsys. This only works
  * if cgroup_subsys.use_id == true. It can be used for looking up and scanning.
  * CSS ID is assigned at cgroup allocation (create) automatically
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index f91d7dd..e8b8f71 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1688,6 +1688,41 @@ int cgroup_path(const struct cgroup *cgrp, char *buf, int buflen)
 }
 EXPORT_SYMBOL_GPL(cgroup_path);
 
+int cgroup_can_attach_per_thread(struct cgroup *cgrp, struct task_struct *task,
+				 int (*cb)(struct cgroup *cgrp,
+					   struct task_struct *task),
+				 bool threadgroup)
+{
+	/* Start by running on the leader, in all cases. */
+	int ret = cb(cgrp, task);
+	if (ret < 0)
+		return ret;
+
+	if (threadgroup) {
+		/* Run on each task in the threadgroup. */
+		struct task_struct *c;
+		rcu_read_lock();
+		/*
+		 * It is necessary for the given task to still be the leader
+		 * to safely traverse thread_group. See cgroup_attach_proc.
+		 */
+		if (!thread_group_leader(task)) {
+			rcu_read_unlock();
+			return -EAGAIN;
+		}
+		list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
+			ret = cb(cgrp, c);
+			if (ret < 0) {
+				rcu_read_unlock();
+				return ret;
+			}
+		}
+		rcu_read_unlock();
+	}
+	return 0;
+}
+EXPORT_SYMBOL_GPL(cgroup_can_attach_per_thread);
+
 /**
  * cgroup_attach_task - attach task 'tsk' to cgroup 'cgrp'
  * @cgrp: the cgroup the task is attaching to
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index ce71ed5..677b24e 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -161,6 +161,13 @@ static bool is_task_frozen_enough(struct task_struct *task)
 		(task_is_stopped_or_traced(task) && freezing(task));
 }
 
+static int freezer_can_attach_cb(struct cgroup *cgrp, struct task_struct *task)
+{
+	if (is_task_frozen_enough(task))
+		return -EBUSY;
+	return 0;
+}
+
 /*
  * The call to cgroup_lock() in the freezer.state write method prevents
  * a write to that file racing against an attach, and hence the
@@ -171,6 +178,7 @@ static int freezer_can_attach(struct cgroup_subsys *ss,
 			      struct task_struct *task, bool threadgroup)
 {
 	struct freezer *freezer;
+	int ret;
 
 	/*
 	 * Anything frozen can't move or be moved to/from.
@@ -179,26 +187,15 @@ static int freezer_can_attach(struct cgroup_subsys *ss,
 	 * frozen, so it's sufficient to check the latter condition.
 	 */
 
-	if (is_task_frozen_enough(task))
-		return -EBUSY;
+	ret = cgroup_can_attach_per_thread(new_cgroup, task,
+					   freezer_can_attach_cb, threadgroup);
+	if (ret < 0)
+		return ret;
 
 	freezer = cgroup_freezer(new_cgroup);
 	if (freezer->state == CGROUP_FROZEN)
 		return -EBUSY;
 
-	if (threadgroup) {
-		struct task_struct *c;
-
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
-			if (is_task_frozen_enough(c)) {
-				rcu_read_unlock();
-				return -EBUSY;
-			}
-		}
-		rcu_read_unlock();
-	}
-
 	return 0;
 }
 
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index b23c097..cc4b1f7 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1376,6 +1376,11 @@ static int fmeter_getrate(struct fmeter *fmp)
 /* Protected by cgroup_lock */
 static cpumask_var_t cpus_attach;
 
+static int cpuset_can_attach_cb(struct cgroup *cgrp, struct task_struct *task)
+{
+	return security_task_setscheduler(task, 0, NULL);
+}
+
 /* Called by cgroups to determine if a cpuset is usable; cgroup_mutex held */
 static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
 			     struct task_struct *tsk, bool threadgroup)
@@ -1397,22 +1402,11 @@ static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
 	if (tsk->flags & PF_THREAD_BOUND)
 		return -EINVAL;
 
-	ret = security_task_setscheduler(tsk, 0, NULL);
+	ret = cgroup_can_attach_per_thread(cont, tsk, cpuset_can_attach_cb,
+					   threadgroup);
 	if (ret)
 		return ret;
-	if (threadgroup) {
-		struct task_struct *c;
 
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
-			ret = security_task_setscheduler(c, 0, NULL);
-			if (ret) {
-				rcu_read_unlock();
-				return ret;
-			}
-		}
-		rcu_read_unlock();
-	}
 	return 0;
 }
 
diff --git a/kernel/ns_cgroup.c b/kernel/ns_cgroup.c
index 2a5dfec..af0accf 100644
--- a/kernel/ns_cgroup.c
+++ b/kernel/ns_cgroup.c
@@ -42,9 +42,18 @@ int ns_cgroup_clone(struct task_struct *task, struct pid *pid)
  *       (hence either you are in the same cgroup as task, or in an
  *        ancestor cgroup thereof)
  */
+static int ns_can_attach_cb(struct cgroup *new_cgroup, struct task_struct *task)
+{
+	if (!cgroup_is_descendant(new_cgroup, task))
+		return -EPERM;
+	return 0;
+}
+
 static int ns_can_attach(struct cgroup_subsys *ss, struct cgroup *new_cgroup,
 			 struct task_struct *task, bool threadgroup)
 {
+	int ret;
+
 	if (current != task) {
 		if (!capable(CAP_SYS_ADMIN))
 			return -EPERM;
@@ -53,20 +62,10 @@ static int ns_can_attach(struct cgroup_subsys *ss, struct cgroup *new_cgroup,
 			return -EPERM;
 	}
 
-	if (!cgroup_is_descendant(new_cgroup, task))
-		return -EPERM;
-
-	if (threadgroup) {
-		struct task_struct *c;
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
-			if (!cgroup_is_descendant(new_cgroup, c)) {
-				rcu_read_unlock();
-				return -EPERM;
-			}
-		}
-		rcu_read_unlock();
-	}
+	ret = cgroup_can_attach_per_thread(new_cgroup, task, ns_can_attach_cb,
+					   threadgroup);
+	if (ret < 0)
+		return ret;
 
 	return 0;
 }
diff --git a/kernel/sched.c b/kernel/sched.c
index 70fa78d..8330e6f 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -8715,21 +8715,12 @@ static int
 cpu_cgroup_can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 		      struct task_struct *tsk, bool threadgroup)
 {
-	int retval = cpu_cgroup_can_attach_task(cgrp, tsk);
-	if (retval)
-		return retval;
-	if (threadgroup) {
-		struct task_struct *c;
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
-			retval = cpu_cgroup_can_attach_task(cgrp, c);
-			if (retval) {
-				rcu_read_unlock();
-				return retval;
-			}
-		}
-		rcu_read_unlock();
-	}
+	int ret = cgroup_can_attach_per_thread(cgrp, tsk,
+					       cpu_cgroup_can_attach_task,
+					       threadgroup);
+	if (ret)
+		return ret;
+
 	return 0;
 }
 

^ permalink raw reply related	[flat|nested] 185+ messages in thread

* [PATCH v5 3/3] cgroups: make procs file writable
       [not found]     ` <20100811054604.GA8743-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
  2010-08-11  5:47       ` Ben Blum
  2010-08-11  5:48         ` Ben Blum
@ 2010-08-11  5:48       ` Ben Blum
  2010-12-24  8:22         ` Ben Blum
  3 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2010-08-11  5:48 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	menage-hpIqsD4AKlfQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

[-- Attachment #1: cgroup-procs-writable.patch --]
[-- Type: text/plain, Size: 17491 bytes --]

Makes procs file writable to move all threads by tgid at once

From: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>

This patch adds functionality that enables users to move all threads in a
threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs'
file. This current implementation makes use of a per-threadgroup rwsem that's
taken for reading in the fork() path to prevent newly forking threads within
the threadgroup from "escaping" while the move is in progress.

Signed-off-by: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
---
 Documentation/cgroups/cgroups.txt |   13 +
 kernel/cgroup.c                   |  424 +++++++++++++++++++++++++++++++++----
 2 files changed, 387 insertions(+), 50 deletions(-)

diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index b34823f..5f3c707 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -235,7 +235,8 @@ containing the following files describing that cgroup:
  - cgroup.procs: list of tgids in the cgroup.  This list is not
    guaranteed to be sorted or free of duplicate tgids, and userspace
    should sort/uniquify the list if this property is required.
-   This is a read-only file, for now.
+   Writing a thread group id into this file moves all threads in that
+   group into this cgroup.
  - notify_on_release flag: run the release agent on exit?
  - release_agent: the path to use for release notifications (this file
    exists in the top cgroup only)
@@ -416,6 +417,12 @@ You can attach the current shell task by echoing 0:
 
 # echo 0 > tasks
 
+You can use the cgroup.procs file instead of the tasks file to move all
+threads in a threadgroup at once. Echoing the pid of any task in a
+threadgroup to cgroup.procs causes all tasks in that threadgroup to be
+be attached to the cgroup. Writing 0 to cgroup.procs moves all tasks
+in the writing task's threadgroup.
+
 2.3 Mounting hierarchies by name
 --------------------------------
 
@@ -564,7 +571,9 @@ called on a fork. If this method returns 0 (success) then this should
 remain valid while the caller holds cgroup_mutex and it is ensured that either
 attach() or cancel_attach() will be called in future. If threadgroup is
 true, then a successful result indicates that all threads in the given
-thread's threadgroup can be moved together.
+thread's threadgroup can be moved together. If the subsystem wants to
+iterate over task->thread_group, it must take rcu_read_lock then check
+if thread_group_leader(task), returning -EAGAIN if that fails.
 
 void cancel_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 	       struct task_struct *task, bool threadgroup)
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index e8b8f71..586dbb7 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1723,6 +1723,76 @@ int cgroup_can_attach_per_thread(struct cgroup *cgrp, struct task_struct *task,
 }
 EXPORT_SYMBOL_GPL(cgroup_can_attach_per_thread);
 
+/*
+ * cgroup_task_migrate - move a task from one cgroup to another.
+ *
+ * 'guarantee' is set if the caller promises that a new css_set for the task
+ * will already exit. If not set, this function might sleep, and can fail with
+ * -ENOMEM. Otherwise, it can only fail with -ESRCH.
+ */
+static int cgroup_task_migrate(struct cgroup *cgrp, struct cgroup *oldcgrp,
+			       struct task_struct *tsk, bool guarantee)
+{
+	struct css_set *oldcg;
+	struct css_set *newcg;
+
+	/*
+	 * get old css_set. we need to take task_lock and refcount it, because
+	 * an exiting task can change its css_set to init_css_set and drop its
+	 * old one without taking cgroup_mutex.
+	 */
+	task_lock(tsk);
+	oldcg = tsk->cgroups;
+	get_css_set(oldcg);
+	task_unlock(tsk);
+
+	/* locate or allocate a new css_set for this task. */
+	if (guarantee) {
+		/* we know the css_set we want already exists. */
+		struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
+		read_lock(&css_set_lock);
+		newcg = find_existing_css_set(oldcg, cgrp, template);
+		BUG_ON(!newcg);
+		get_css_set(newcg);
+		read_unlock(&css_set_lock);
+	} else {
+		might_sleep();
+		/* find_css_set will give us newcg already referenced. */
+		newcg = find_css_set(oldcg, cgrp);
+		if (!newcg) {
+			put_css_set(oldcg);
+			return -ENOMEM;
+		}
+	}
+	put_css_set(oldcg);
+
+	/* if PF_EXITING is set, the tsk->cgroups pointer is no longer safe. */
+	task_lock(tsk);
+	if (tsk->flags & PF_EXITING) {
+		task_unlock(tsk);
+		put_css_set(newcg);
+		return -ESRCH;
+	}
+	rcu_assign_pointer(tsk->cgroups, newcg);
+	task_unlock(tsk);
+
+	/* Update the css_set linked lists if we're using them */
+	write_lock(&css_set_lock);
+	if (!list_empty(&tsk->cg_list))
+		list_move(&tsk->cg_list, &newcg->tasks);
+	write_unlock(&css_set_lock);
+
+	/*
+	 * We just gained a reference on oldcg by taking it from the task. As
+	 * trading it for newcg is protected by cgroup_mutex, we're safe to drop
+	 * it here; it will be freed under RCU.
+	 */
+	put_css_set(oldcg);
+
+	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
+	return 0;
+}
+
 /**
  * cgroup_attach_task - attach task 'tsk' to cgroup 'cgrp'
  * @cgrp: the cgroup the task is attaching to
@@ -1733,11 +1803,9 @@ EXPORT_SYMBOL_GPL(cgroup_can_attach_per_thread);
  */
 int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
-	int retval = 0;
+	int retval;
 	struct cgroup_subsys *ss, *failed_ss = NULL;
 	struct cgroup *oldcgrp;
-	struct css_set *cg;
-	struct css_set *newcg;
 	struct cgroupfs_root *root = cgrp->root;
 
 	/* Nothing to do if the task is already in that cgroup */
@@ -1761,46 +1829,16 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 		}
 	}
 
-	task_lock(tsk);
-	cg = tsk->cgroups;
-	get_css_set(cg);
-	task_unlock(tsk);
-	/*
-	 * Locate or allocate a new css_set for this task,
-	 * based on its final set of cgroups
-	 */
-	newcg = find_css_set(cg, cgrp);
-	put_css_set(cg);
-	if (!newcg) {
-		retval = -ENOMEM;
-		goto out;
-	}
-
-	task_lock(tsk);
-	if (tsk->flags & PF_EXITING) {
-		task_unlock(tsk);
-		put_css_set(newcg);
-		retval = -ESRCH;
+	retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, false);
+	if (retval)
 		goto out;
-	}
-	rcu_assign_pointer(tsk->cgroups, newcg);
-	task_unlock(tsk);
-
-	/* Update the css_set linked lists if we're using them */
-	write_lock(&css_set_lock);
-	if (!list_empty(&tsk->cg_list)) {
-		list_del(&tsk->cg_list);
-		list_add(&tsk->cg_list, &newcg->tasks);
-	}
-	write_unlock(&css_set_lock);
 
 	for_each_subsys(root, ss) {
 		if (ss->attach)
 			ss->attach(ss, cgrp, oldcgrp, tsk, false);
 	}
-	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
+
 	synchronize_rcu();
-	put_css_set(cg);
 
 	/*
 	 * wake up rmdir() waiter. the rmdir should fail since the cgroup
@@ -1826,49 +1864,339 @@ out:
 }
 
 /*
- * Attach task with pid 'pid' to cgroup 'cgrp'. Call with cgroup_mutex
- * held. May take task_lock of task
+ * cgroup_attach_proc works in two stages, the first of which prefetches all
+ * new css_sets needed (to make sure we have enough memory before committing
+ * to the move) and stores them in a list of entries of the following type.
+ * TODO: possible optimization: use css_set->rcu_head for chaining instead
+ */
+struct cg_list_entry {
+	struct css_set *cg;
+	struct list_head links;
+};
+
+static bool css_set_check_fetched(struct cgroup *cgrp,
+				  struct task_struct *tsk, struct css_set *cg,
+				  struct list_head *newcg_list)
+{
+	struct css_set *newcg;
+	struct cg_list_entry *cg_entry;
+	struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
+
+	read_lock(&css_set_lock);
+	newcg = find_existing_css_set(cg, cgrp, template);
+	if (newcg)
+		get_css_set(newcg);
+	read_unlock(&css_set_lock);
+
+	/* doesn't exist at all? */
+	if (!newcg)
+		return false;
+	/* see if it's already in the list */
+	list_for_each_entry(cg_entry, newcg_list, links) {
+		if (cg_entry->cg == newcg) {
+			put_css_set(newcg);
+			return true;
+		}
+	}
+
+	/* not found */
+	put_css_set(newcg);
+	return false;
+}
+
+/*
+ * Find the new css_set and store it in the list in preparation for moving the
+ * given task to the given cgroup. Returns 0 or -ENOMEM.
  */
-static int attach_task_by_pid(struct cgroup *cgrp, u64 pid)
+static int css_set_prefetch(struct cgroup *cgrp, struct css_set *cg,
+			    struct list_head *newcg_list)
+{
+	struct css_set *newcg;
+	struct cg_list_entry *cg_entry;
+
+	/* ensure a new css_set will exist for this thread */
+	newcg = find_css_set(cg, cgrp);
+	if (!newcg)
+		return -ENOMEM;
+	/* add it to the list */
+	cg_entry = kmalloc(sizeof(struct cg_list_entry), GFP_KERNEL);
+	if (!cg_entry) {
+		put_css_set(newcg);
+		return -ENOMEM;
+	}
+	cg_entry->cg = newcg;
+	list_add(&cg_entry->links, newcg_list);
+	return 0;
+}
+
+/**
+ * cgroup_attach_proc - attach all threads in a threadgroup to a cgroup
+ * @cgrp: the cgroup to attach to
+ * @leader: the threadgroup leader task_struct of the group to be attached
+ *
+ * Call holding cgroup_mutex. Will take task_lock of each thread in leader's
+ * threadgroup individually in turn.
+ */
+int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
+{
+	int retval;
+	struct cgroup_subsys *ss, *failed_ss = NULL;
+	struct cgroup *oldcgrp;
+	struct css_set *oldcg;
+	struct cgroupfs_root *root = cgrp->root;
+	/* threadgroup list cursor */
+	struct task_struct *tsk;
+	/*
+	 * we need to make sure we have css_sets for all the tasks we're
+	 * going to move -before- we actually start moving them, so that in
+	 * case we get an ENOMEM we can bail out before making any changes.
+	 */
+	struct list_head newcg_list;
+	struct cg_list_entry *cg_entry, *temp_nobe;
+
+	/* check that we can legitimately attach to the cgroup. */
+	for_each_subsys(root, ss) {
+		if (ss->can_attach) {
+			retval = ss->can_attach(ss, cgrp, leader, true);
+			if (retval) {
+				failed_ss = ss;
+				goto out;
+			}
+		}
+	}
+
+	/*
+	 * step 1: make sure css_sets exist for all threads to be migrated.
+	 * we use find_css_set, which allocates a new one if necessary.
+	 */
+	INIT_LIST_HEAD(&newcg_list);
+	oldcgrp = task_cgroup_from_root(leader, root);
+	if (cgrp != oldcgrp) {
+		/* get old css_set */
+		task_lock(leader);
+		if (leader->flags & PF_EXITING) {
+			task_unlock(leader);
+			goto prefetch_loop;
+		}
+		oldcg = leader->cgroups;
+		get_css_set(oldcg);
+		task_unlock(leader);
+		/* acquire new one */
+		retval = css_set_prefetch(cgrp, oldcg, &newcg_list);
+		put_css_set(oldcg);
+		if (retval)
+			goto list_teardown;
+	}
+prefetch_loop:
+	rcu_read_lock();
+	/* sanity check - if we raced with de_thread, we must abort */
+	if (!thread_group_leader(leader)) {
+		retval = -EAGAIN;
+		goto list_teardown;
+	}
+	/*
+	 * if we need to fetch a new css_set for this task, we must exit the
+	 * rcu_read section because allocating it can sleep. afterwards, we'll
+	 * need to restart iteration on the threadgroup list - the whole thing
+	 * will be O(nm) in the number of threads and css_sets; as the typical
+	 * case has only one css_set for all of them, usually O(n). which ones
+	 * we need allocated won't change as long as we hold cgroup_mutex.
+	 */
+	list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
+		/* nothing to do if this task is already in the cgroup */
+		oldcgrp = task_cgroup_from_root(tsk, root);
+		if (cgrp == oldcgrp)
+			continue;
+		/* get old css_set pointer */
+		task_lock(tsk);
+		if (tsk->flags & PF_EXITING) {
+			/* ignore this task if it's going away */
+			task_unlock(tsk);
+			continue;
+		}
+		oldcg = tsk->cgroups;
+		get_css_set(oldcg);
+		task_unlock(tsk);
+		/* see if the new one for us is already in the list? */
+		if (css_set_check_fetched(cgrp, tsk, oldcg, &newcg_list)) {
+			/* was already there, nothing to do. */
+			put_css_set(oldcg);
+		} else {
+			/* we don't already have it. get new one. */
+			rcu_read_unlock();
+			retval = css_set_prefetch(cgrp, oldcg, &newcg_list);
+			put_css_set(oldcg);
+			if (retval)
+				goto list_teardown;
+			/* begin iteration again. */
+			goto prefetch_loop;
+		}
+	}
+	rcu_read_unlock();
+
+	/*
+	 * step 2: now that we're guaranteed success wrt the css_sets, proceed
+	 * to move all tasks to the new cgroup. we need to lock against possible
+	 * races with fork(). note: we can safely take the threadgroup_fork_lock
+	 * of leader since attach_task_by_pid took a reference.
+	 * threadgroup_fork_lock must be taken outside of tasklist_lock to match
+	 * the order in the fork path.
+	 */
+	threadgroup_fork_write_lock(leader);
+	read_lock(&tasklist_lock);
+	/* sanity check - if we raced with de_thread, we must abort */
+	if (!thread_group_leader(leader)) {
+		retval = -EAGAIN;
+		read_unlock(&tasklist_lock);
+		threadgroup_fork_write_unlock(leader);
+		goto list_teardown;
+	}
+	/*
+	 * No failure cases left, so this is the commit point.
+	 *
+	 * If the leader is already there, skip moving him. Note: even if the
+	 * leader is PF_EXITING, we still move all other threads; if everybody
+	 * is PF_EXITING, we end up doing nothing, which is ok.
+	 */
+	oldcgrp = task_cgroup_from_root(leader, root);
+	if (cgrp != oldcgrp) {
+		retval = cgroup_task_migrate(cgrp, oldcgrp, leader, true);
+		BUG_ON(retval != 0 && retval != -ESRCH);
+	}
+	/* Now iterate over each thread in the group. */
+	list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
+		/* leave current thread as it is if it's already there */
+		oldcgrp = task_cgroup_from_root(tsk, root);
+		if (cgrp == oldcgrp)
+			continue;
+		/* we don't care whether these threads are exiting */
+		retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, true);
+		BUG_ON(retval != 0 && retval != -ESRCH);
+	}
+
+	/*
+	 * step 3: attach whole threadgroup to each subsystem
+	 * TODO: if ever a subsystem needs to know the oldcgrp for each task
+	 * being moved, this call will need to be reworked to communicate that.
+	 */
+	for_each_subsys(root, ss) {
+		if (ss->attach)
+			ss->attach(ss, cgrp, oldcgrp, leader, true);
+	}
+	/* holding these until here keeps us safe from exec() and fork(). */
+	read_unlock(&tasklist_lock);
+	threadgroup_fork_write_unlock(leader);
+
+	/*
+	 * step 4: success! and cleanup
+	 */
+	synchronize_rcu();
+	cgroup_wakeup_rmdir_waiter(cgrp);
+	retval = 0;
+list_teardown:
+	/* clean up the list of prefetched css_sets. */
+	list_for_each_entry_safe(cg_entry, temp_nobe, &newcg_list, links) {
+		list_del(&cg_entry->links);
+		put_css_set(cg_entry->cg);
+		kfree(cg_entry);
+	}
+out:
+	if (retval) {
+		/* same deal as in cgroup_attach_task, with threadgroup=true */
+		for_each_subsys(root, ss) {
+			if (ss == failed_ss)
+				break;
+			if (ss->cancel_attach)
+				ss->cancel_attach(ss, cgrp, leader, true);
+		}
+	}
+	return retval;
+}
+
+/*
+ * Find the task_struct of the task to attach by vpid and pass it along to the
+ * function to attach either it or all tasks in its threadgroup. Will take
+ * cgroup_mutex; may take task_lock of task.
+ */
+static int attach_task_by_pid(struct cgroup *cgrp, u64 pid, bool threadgroup)
 {
 	struct task_struct *tsk;
 	const struct cred *cred = current_cred(), *tcred;
 	int ret;
 
+	if (!cgroup_lock_live_group(cgrp))
+		return -ENODEV;
+
 	if (pid) {
 		rcu_read_lock();
 		tsk = find_task_by_vpid(pid);
-		if (!tsk || tsk->flags & PF_EXITING) {
+		if (!tsk) {
+			rcu_read_unlock();
+			cgroup_unlock();
+			return -ESRCH;
+		}
+		if (threadgroup) {
+			/*
+			 * it is safe to find group_leader because tsk was found
+			 * in the tid map, meaning it can't have been unhashed
+			 * by someone in de_thread changing the leadership.
+			 */
+			tsk = tsk->group_leader;
+			BUG_ON(!thread_group_leader(tsk));
+		} else if (tsk->flags & PF_EXITING) {
+			/* optimization for the single-task-only case */
 			rcu_read_unlock();
+			cgroup_unlock();
 			return -ESRCH;
 		}
 
+		/*
+		 * even if we're attaching all tasks in the thread group, we
+		 * only need to check permissions on one of them.
+		 */
 		tcred = __task_cred(tsk);
 		if (cred->euid &&
 		    cred->euid != tcred->uid &&
 		    cred->euid != tcred->suid) {
 			rcu_read_unlock();
+			cgroup_unlock();
 			return -EACCES;
 		}
 		get_task_struct(tsk);
 		rcu_read_unlock();
 	} else {
-		tsk = current;
+		if (threadgroup)
+			tsk = current->group_leader;
+		else
+			tsk = current;
 		get_task_struct(tsk);
 	}
 
-	ret = cgroup_attach_task(cgrp, tsk);
+	if (threadgroup)
+		ret = cgroup_attach_proc(cgrp, tsk);
+	else
+		ret = cgroup_attach_task(cgrp, tsk);
 	put_task_struct(tsk);
+	cgroup_unlock();
 	return ret;
 }
 
 static int cgroup_tasks_write(struct cgroup *cgrp, struct cftype *cft, u64 pid)
 {
+	return attach_task_by_pid(cgrp, pid, false);
+}
+
+static int cgroup_procs_write(struct cgroup *cgrp, struct cftype *cft, u64 tgid)
+{
 	int ret;
-	if (!cgroup_lock_live_group(cgrp))
-		return -ENODEV;
-	ret = attach_task_by_pid(cgrp, pid);
-	cgroup_unlock();
+	do {
+		/*
+		 * attach_proc fails with -EAGAIN if threadgroup leadership
+		 * changes in the middle of the operation, in which case we need
+		 * to find the task_struct for the new leader and start over.
+		 */
+		ret = attach_task_by_pid(cgrp, tgid, true);
+	} while (ret == -EAGAIN);
 	return ret;
 }
 
@@ -3203,9 +3531,9 @@ static struct cftype files[] = {
 	{
 		.name = CGROUP_FILE_GENERIC_PREFIX "procs",
 		.open = cgroup_procs_open,
-		/* .write_u64 = cgroup_procs_write, TODO */
+		.write_u64 = cgroup_procs_write,
 		.release = cgroup_pidlist_release,
-		.mode = S_IRUGO,
+		.mode = S_IRUGO | S_IWUSR,
 	},
 	{
 		.name = "notify_on_release",

^ permalink raw reply related	[flat|nested] 185+ messages in thread

* [PATCH v5 3/3] cgroups: make procs file writable
  2010-08-11  5:46     ` Ben Blum
                       ` (2 preceding siblings ...)
  (?)
@ 2010-08-11  5:48     ` Ben Blum
       [not found]       ` <20100811054851.GD8743-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
  2010-08-24 18:08       ` Paul Menage
  -1 siblings, 2 replies; 185+ messages in thread
From: Ben Blum @ 2010-08-11  5:48 UTC (permalink / raw)
  To: Ben Blum
  Cc: linux-kernel, containers, akpm, ebiederm, lizf, matthltc, menage, oleg

[-- Attachment #1: cgroup-procs-writable.patch --]
[-- Type: text/plain, Size: 17441 bytes --]

Makes procs file writable to move all threads by tgid at once

From: Ben Blum <bblum@andrew.cmu.edu>

This patch adds functionality that enables users to move all threads in a
threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs'
file. This current implementation makes use of a per-threadgroup rwsem that's
taken for reading in the fork() path to prevent newly forking threads within
the threadgroup from "escaping" while the move is in progress.

Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
---
 Documentation/cgroups/cgroups.txt |   13 +
 kernel/cgroup.c                   |  424 +++++++++++++++++++++++++++++++++----
 2 files changed, 387 insertions(+), 50 deletions(-)

diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index b34823f..5f3c707 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -235,7 +235,8 @@ containing the following files describing that cgroup:
  - cgroup.procs: list of tgids in the cgroup.  This list is not
    guaranteed to be sorted or free of duplicate tgids, and userspace
    should sort/uniquify the list if this property is required.
-   This is a read-only file, for now.
+   Writing a thread group id into this file moves all threads in that
+   group into this cgroup.
  - notify_on_release flag: run the release agent on exit?
  - release_agent: the path to use for release notifications (this file
    exists in the top cgroup only)
@@ -416,6 +417,12 @@ You can attach the current shell task by echoing 0:
 
 # echo 0 > tasks
 
+You can use the cgroup.procs file instead of the tasks file to move all
+threads in a threadgroup at once. Echoing the pid of any task in a
+threadgroup to cgroup.procs causes all tasks in that threadgroup to be
+be attached to the cgroup. Writing 0 to cgroup.procs moves all tasks
+in the writing task's threadgroup.
+
 2.3 Mounting hierarchies by name
 --------------------------------
 
@@ -564,7 +571,9 @@ called on a fork. If this method returns 0 (success) then this should
 remain valid while the caller holds cgroup_mutex and it is ensured that either
 attach() or cancel_attach() will be called in future. If threadgroup is
 true, then a successful result indicates that all threads in the given
-thread's threadgroup can be moved together.
+thread's threadgroup can be moved together. If the subsystem wants to
+iterate over task->thread_group, it must take rcu_read_lock then check
+if thread_group_leader(task), returning -EAGAIN if that fails.
 
 void cancel_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 	       struct task_struct *task, bool threadgroup)
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index e8b8f71..586dbb7 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1723,6 +1723,76 @@ int cgroup_can_attach_per_thread(struct cgroup *cgrp, struct task_struct *task,
 }
 EXPORT_SYMBOL_GPL(cgroup_can_attach_per_thread);
 
+/*
+ * cgroup_task_migrate - move a task from one cgroup to another.
+ *
+ * 'guarantee' is set if the caller promises that a new css_set for the task
+ * will already exit. If not set, this function might sleep, and can fail with
+ * -ENOMEM. Otherwise, it can only fail with -ESRCH.
+ */
+static int cgroup_task_migrate(struct cgroup *cgrp, struct cgroup *oldcgrp,
+			       struct task_struct *tsk, bool guarantee)
+{
+	struct css_set *oldcg;
+	struct css_set *newcg;
+
+	/*
+	 * get old css_set. we need to take task_lock and refcount it, because
+	 * an exiting task can change its css_set to init_css_set and drop its
+	 * old one without taking cgroup_mutex.
+	 */
+	task_lock(tsk);
+	oldcg = tsk->cgroups;
+	get_css_set(oldcg);
+	task_unlock(tsk);
+
+	/* locate or allocate a new css_set for this task. */
+	if (guarantee) {
+		/* we know the css_set we want already exists. */
+		struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
+		read_lock(&css_set_lock);
+		newcg = find_existing_css_set(oldcg, cgrp, template);
+		BUG_ON(!newcg);
+		get_css_set(newcg);
+		read_unlock(&css_set_lock);
+	} else {
+		might_sleep();
+		/* find_css_set will give us newcg already referenced. */
+		newcg = find_css_set(oldcg, cgrp);
+		if (!newcg) {
+			put_css_set(oldcg);
+			return -ENOMEM;
+		}
+	}
+	put_css_set(oldcg);
+
+	/* if PF_EXITING is set, the tsk->cgroups pointer is no longer safe. */
+	task_lock(tsk);
+	if (tsk->flags & PF_EXITING) {
+		task_unlock(tsk);
+		put_css_set(newcg);
+		return -ESRCH;
+	}
+	rcu_assign_pointer(tsk->cgroups, newcg);
+	task_unlock(tsk);
+
+	/* Update the css_set linked lists if we're using them */
+	write_lock(&css_set_lock);
+	if (!list_empty(&tsk->cg_list))
+		list_move(&tsk->cg_list, &newcg->tasks);
+	write_unlock(&css_set_lock);
+
+	/*
+	 * We just gained a reference on oldcg by taking it from the task. As
+	 * trading it for newcg is protected by cgroup_mutex, we're safe to drop
+	 * it here; it will be freed under RCU.
+	 */
+	put_css_set(oldcg);
+
+	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
+	return 0;
+}
+
 /**
  * cgroup_attach_task - attach task 'tsk' to cgroup 'cgrp'
  * @cgrp: the cgroup the task is attaching to
@@ -1733,11 +1803,9 @@ EXPORT_SYMBOL_GPL(cgroup_can_attach_per_thread);
  */
 int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
-	int retval = 0;
+	int retval;
 	struct cgroup_subsys *ss, *failed_ss = NULL;
 	struct cgroup *oldcgrp;
-	struct css_set *cg;
-	struct css_set *newcg;
 	struct cgroupfs_root *root = cgrp->root;
 
 	/* Nothing to do if the task is already in that cgroup */
@@ -1761,46 +1829,16 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 		}
 	}
 
-	task_lock(tsk);
-	cg = tsk->cgroups;
-	get_css_set(cg);
-	task_unlock(tsk);
-	/*
-	 * Locate or allocate a new css_set for this task,
-	 * based on its final set of cgroups
-	 */
-	newcg = find_css_set(cg, cgrp);
-	put_css_set(cg);
-	if (!newcg) {
-		retval = -ENOMEM;
-		goto out;
-	}
-
-	task_lock(tsk);
-	if (tsk->flags & PF_EXITING) {
-		task_unlock(tsk);
-		put_css_set(newcg);
-		retval = -ESRCH;
+	retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, false);
+	if (retval)
 		goto out;
-	}
-	rcu_assign_pointer(tsk->cgroups, newcg);
-	task_unlock(tsk);
-
-	/* Update the css_set linked lists if we're using them */
-	write_lock(&css_set_lock);
-	if (!list_empty(&tsk->cg_list)) {
-		list_del(&tsk->cg_list);
-		list_add(&tsk->cg_list, &newcg->tasks);
-	}
-	write_unlock(&css_set_lock);
 
 	for_each_subsys(root, ss) {
 		if (ss->attach)
 			ss->attach(ss, cgrp, oldcgrp, tsk, false);
 	}
-	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
+
 	synchronize_rcu();
-	put_css_set(cg);
 
 	/*
 	 * wake up rmdir() waiter. the rmdir should fail since the cgroup
@@ -1826,49 +1864,339 @@ out:
 }
 
 /*
- * Attach task with pid 'pid' to cgroup 'cgrp'. Call with cgroup_mutex
- * held. May take task_lock of task
+ * cgroup_attach_proc works in two stages, the first of which prefetches all
+ * new css_sets needed (to make sure we have enough memory before committing
+ * to the move) and stores them in a list of entries of the following type.
+ * TODO: possible optimization: use css_set->rcu_head for chaining instead
+ */
+struct cg_list_entry {
+	struct css_set *cg;
+	struct list_head links;
+};
+
+static bool css_set_check_fetched(struct cgroup *cgrp,
+				  struct task_struct *tsk, struct css_set *cg,
+				  struct list_head *newcg_list)
+{
+	struct css_set *newcg;
+	struct cg_list_entry *cg_entry;
+	struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
+
+	read_lock(&css_set_lock);
+	newcg = find_existing_css_set(cg, cgrp, template);
+	if (newcg)
+		get_css_set(newcg);
+	read_unlock(&css_set_lock);
+
+	/* doesn't exist at all? */
+	if (!newcg)
+		return false;
+	/* see if it's already in the list */
+	list_for_each_entry(cg_entry, newcg_list, links) {
+		if (cg_entry->cg == newcg) {
+			put_css_set(newcg);
+			return true;
+		}
+	}
+
+	/* not found */
+	put_css_set(newcg);
+	return false;
+}
+
+/*
+ * Find the new css_set and store it in the list in preparation for moving the
+ * given task to the given cgroup. Returns 0 or -ENOMEM.
  */
-static int attach_task_by_pid(struct cgroup *cgrp, u64 pid)
+static int css_set_prefetch(struct cgroup *cgrp, struct css_set *cg,
+			    struct list_head *newcg_list)
+{
+	struct css_set *newcg;
+	struct cg_list_entry *cg_entry;
+
+	/* ensure a new css_set will exist for this thread */
+	newcg = find_css_set(cg, cgrp);
+	if (!newcg)
+		return -ENOMEM;
+	/* add it to the list */
+	cg_entry = kmalloc(sizeof(struct cg_list_entry), GFP_KERNEL);
+	if (!cg_entry) {
+		put_css_set(newcg);
+		return -ENOMEM;
+	}
+	cg_entry->cg = newcg;
+	list_add(&cg_entry->links, newcg_list);
+	return 0;
+}
+
+/**
+ * cgroup_attach_proc - attach all threads in a threadgroup to a cgroup
+ * @cgrp: the cgroup to attach to
+ * @leader: the threadgroup leader task_struct of the group to be attached
+ *
+ * Call holding cgroup_mutex. Will take task_lock of each thread in leader's
+ * threadgroup individually in turn.
+ */
+int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
+{
+	int retval;
+	struct cgroup_subsys *ss, *failed_ss = NULL;
+	struct cgroup *oldcgrp;
+	struct css_set *oldcg;
+	struct cgroupfs_root *root = cgrp->root;
+	/* threadgroup list cursor */
+	struct task_struct *tsk;
+	/*
+	 * we need to make sure we have css_sets for all the tasks we're
+	 * going to move -before- we actually start moving them, so that in
+	 * case we get an ENOMEM we can bail out before making any changes.
+	 */
+	struct list_head newcg_list;
+	struct cg_list_entry *cg_entry, *temp_nobe;
+
+	/* check that we can legitimately attach to the cgroup. */
+	for_each_subsys(root, ss) {
+		if (ss->can_attach) {
+			retval = ss->can_attach(ss, cgrp, leader, true);
+			if (retval) {
+				failed_ss = ss;
+				goto out;
+			}
+		}
+	}
+
+	/*
+	 * step 1: make sure css_sets exist for all threads to be migrated.
+	 * we use find_css_set, which allocates a new one if necessary.
+	 */
+	INIT_LIST_HEAD(&newcg_list);
+	oldcgrp = task_cgroup_from_root(leader, root);
+	if (cgrp != oldcgrp) {
+		/* get old css_set */
+		task_lock(leader);
+		if (leader->flags & PF_EXITING) {
+			task_unlock(leader);
+			goto prefetch_loop;
+		}
+		oldcg = leader->cgroups;
+		get_css_set(oldcg);
+		task_unlock(leader);
+		/* acquire new one */
+		retval = css_set_prefetch(cgrp, oldcg, &newcg_list);
+		put_css_set(oldcg);
+		if (retval)
+			goto list_teardown;
+	}
+prefetch_loop:
+	rcu_read_lock();
+	/* sanity check - if we raced with de_thread, we must abort */
+	if (!thread_group_leader(leader)) {
+		retval = -EAGAIN;
+		goto list_teardown;
+	}
+	/*
+	 * if we need to fetch a new css_set for this task, we must exit the
+	 * rcu_read section because allocating it can sleep. afterwards, we'll
+	 * need to restart iteration on the threadgroup list - the whole thing
+	 * will be O(nm) in the number of threads and css_sets; as the typical
+	 * case has only one css_set for all of them, usually O(n). which ones
+	 * we need allocated won't change as long as we hold cgroup_mutex.
+	 */
+	list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
+		/* nothing to do if this task is already in the cgroup */
+		oldcgrp = task_cgroup_from_root(tsk, root);
+		if (cgrp == oldcgrp)
+			continue;
+		/* get old css_set pointer */
+		task_lock(tsk);
+		if (tsk->flags & PF_EXITING) {
+			/* ignore this task if it's going away */
+			task_unlock(tsk);
+			continue;
+		}
+		oldcg = tsk->cgroups;
+		get_css_set(oldcg);
+		task_unlock(tsk);
+		/* see if the new one for us is already in the list? */
+		if (css_set_check_fetched(cgrp, tsk, oldcg, &newcg_list)) {
+			/* was already there, nothing to do. */
+			put_css_set(oldcg);
+		} else {
+			/* we don't already have it. get new one. */
+			rcu_read_unlock();
+			retval = css_set_prefetch(cgrp, oldcg, &newcg_list);
+			put_css_set(oldcg);
+			if (retval)
+				goto list_teardown;
+			/* begin iteration again. */
+			goto prefetch_loop;
+		}
+	}
+	rcu_read_unlock();
+
+	/*
+	 * step 2: now that we're guaranteed success wrt the css_sets, proceed
+	 * to move all tasks to the new cgroup. we need to lock against possible
+	 * races with fork(). note: we can safely take the threadgroup_fork_lock
+	 * of leader since attach_task_by_pid took a reference.
+	 * threadgroup_fork_lock must be taken outside of tasklist_lock to match
+	 * the order in the fork path.
+	 */
+	threadgroup_fork_write_lock(leader);
+	read_lock(&tasklist_lock);
+	/* sanity check - if we raced with de_thread, we must abort */
+	if (!thread_group_leader(leader)) {
+		retval = -EAGAIN;
+		read_unlock(&tasklist_lock);
+		threadgroup_fork_write_unlock(leader);
+		goto list_teardown;
+	}
+	/*
+	 * No failure cases left, so this is the commit point.
+	 *
+	 * If the leader is already there, skip moving him. Note: even if the
+	 * leader is PF_EXITING, we still move all other threads; if everybody
+	 * is PF_EXITING, we end up doing nothing, which is ok.
+	 */
+	oldcgrp = task_cgroup_from_root(leader, root);
+	if (cgrp != oldcgrp) {
+		retval = cgroup_task_migrate(cgrp, oldcgrp, leader, true);
+		BUG_ON(retval != 0 && retval != -ESRCH);
+	}
+	/* Now iterate over each thread in the group. */
+	list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
+		/* leave current thread as it is if it's already there */
+		oldcgrp = task_cgroup_from_root(tsk, root);
+		if (cgrp == oldcgrp)
+			continue;
+		/* we don't care whether these threads are exiting */
+		retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, true);
+		BUG_ON(retval != 0 && retval != -ESRCH);
+	}
+
+	/*
+	 * step 3: attach whole threadgroup to each subsystem
+	 * TODO: if ever a subsystem needs to know the oldcgrp for each task
+	 * being moved, this call will need to be reworked to communicate that.
+	 */
+	for_each_subsys(root, ss) {
+		if (ss->attach)
+			ss->attach(ss, cgrp, oldcgrp, leader, true);
+	}
+	/* holding these until here keeps us safe from exec() and fork(). */
+	read_unlock(&tasklist_lock);
+	threadgroup_fork_write_unlock(leader);
+
+	/*
+	 * step 4: success! and cleanup
+	 */
+	synchronize_rcu();
+	cgroup_wakeup_rmdir_waiter(cgrp);
+	retval = 0;
+list_teardown:
+	/* clean up the list of prefetched css_sets. */
+	list_for_each_entry_safe(cg_entry, temp_nobe, &newcg_list, links) {
+		list_del(&cg_entry->links);
+		put_css_set(cg_entry->cg);
+		kfree(cg_entry);
+	}
+out:
+	if (retval) {
+		/* same deal as in cgroup_attach_task, with threadgroup=true */
+		for_each_subsys(root, ss) {
+			if (ss == failed_ss)
+				break;
+			if (ss->cancel_attach)
+				ss->cancel_attach(ss, cgrp, leader, true);
+		}
+	}
+	return retval;
+}
+
+/*
+ * Find the task_struct of the task to attach by vpid and pass it along to the
+ * function to attach either it or all tasks in its threadgroup. Will take
+ * cgroup_mutex; may take task_lock of task.
+ */
+static int attach_task_by_pid(struct cgroup *cgrp, u64 pid, bool threadgroup)
 {
 	struct task_struct *tsk;
 	const struct cred *cred = current_cred(), *tcred;
 	int ret;
 
+	if (!cgroup_lock_live_group(cgrp))
+		return -ENODEV;
+
 	if (pid) {
 		rcu_read_lock();
 		tsk = find_task_by_vpid(pid);
-		if (!tsk || tsk->flags & PF_EXITING) {
+		if (!tsk) {
+			rcu_read_unlock();
+			cgroup_unlock();
+			return -ESRCH;
+		}
+		if (threadgroup) {
+			/*
+			 * it is safe to find group_leader because tsk was found
+			 * in the tid map, meaning it can't have been unhashed
+			 * by someone in de_thread changing the leadership.
+			 */
+			tsk = tsk->group_leader;
+			BUG_ON(!thread_group_leader(tsk));
+		} else if (tsk->flags & PF_EXITING) {
+			/* optimization for the single-task-only case */
 			rcu_read_unlock();
+			cgroup_unlock();
 			return -ESRCH;
 		}
 
+		/*
+		 * even if we're attaching all tasks in the thread group, we
+		 * only need to check permissions on one of them.
+		 */
 		tcred = __task_cred(tsk);
 		if (cred->euid &&
 		    cred->euid != tcred->uid &&
 		    cred->euid != tcred->suid) {
 			rcu_read_unlock();
+			cgroup_unlock();
 			return -EACCES;
 		}
 		get_task_struct(tsk);
 		rcu_read_unlock();
 	} else {
-		tsk = current;
+		if (threadgroup)
+			tsk = current->group_leader;
+		else
+			tsk = current;
 		get_task_struct(tsk);
 	}
 
-	ret = cgroup_attach_task(cgrp, tsk);
+	if (threadgroup)
+		ret = cgroup_attach_proc(cgrp, tsk);
+	else
+		ret = cgroup_attach_task(cgrp, tsk);
 	put_task_struct(tsk);
+	cgroup_unlock();
 	return ret;
 }
 
 static int cgroup_tasks_write(struct cgroup *cgrp, struct cftype *cft, u64 pid)
 {
+	return attach_task_by_pid(cgrp, pid, false);
+}
+
+static int cgroup_procs_write(struct cgroup *cgrp, struct cftype *cft, u64 tgid)
+{
 	int ret;
-	if (!cgroup_lock_live_group(cgrp))
-		return -ENODEV;
-	ret = attach_task_by_pid(cgrp, pid);
-	cgroup_unlock();
+	do {
+		/*
+		 * attach_proc fails with -EAGAIN if threadgroup leadership
+		 * changes in the middle of the operation, in which case we need
+		 * to find the task_struct for the new leader and start over.
+		 */
+		ret = attach_task_by_pid(cgrp, tgid, true);
+	} while (ret == -EAGAIN);
 	return ret;
 }
 
@@ -3203,9 +3531,9 @@ static struct cftype files[] = {
 	{
 		.name = CGROUP_FILE_GENERIC_PREFIX "procs",
 		.open = cgroup_procs_open,
-		/* .write_u64 = cgroup_procs_write, TODO */
+		.write_u64 = cgroup_procs_write,
 		.release = cgroup_pidlist_release,
-		.mode = S_IRUGO,
+		.mode = S_IRUGO | S_IWUSR,
 	},
 	{
 		.name = "notify_on_release",

^ permalink raw reply related	[flat|nested] 185+ messages in thread

* Re: [PATCH v5 2/3] cgroups: add can_attach callback for checking all threads in a group
       [not found]         ` <20100811054814.GC8743-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
@ 2010-08-23 23:31           ` Paul Menage
  0 siblings, 0 replies; 185+ messages in thread
From: Paul Menage @ 2010-08-23 23:31 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w

On Tue, Aug 10, 2010 at 10:48 PM, Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:
>
> Add cgroup wrapper for safely calling can_attach on all threads in a threadgroup
>
> From: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
>
> This patch adds a function cgroup_can_attach_per_thread which handles iterating
> over each thread in a threadgroup safely with respect to the invariants that
> will be used in cgroup_attach_proc. Also, subsystems whose can_attach calls
> require per-thread validation are modified to use the per_thread wrapper to
> avoid duplicating cgroup-internal code.
>
> This is a pre-patch for cgroup-procs-writable.patch.
>
> Signed-off-by: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>

Acked-by: Paul Menage <menage-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

Some of the can_attach() methods could be simplified slightly by
directly returning the result of cgroup_can_attach_per_thread()

Paul

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v5 2/3] cgroups: add can_attach callback for checking all threads in a group
  2010-08-11  5:48         ` Ben Blum
  (?)
@ 2010-08-23 23:31         ` Paul Menage
  -1 siblings, 0 replies; 185+ messages in thread
From: Paul Menage @ 2010-08-23 23:31 UTC (permalink / raw)
  To: Ben Blum; +Cc: linux-kernel, containers, akpm, ebiederm, lizf, matthltc, oleg

On Tue, Aug 10, 2010 at 10:48 PM, Ben Blum <bblum@andrew.cmu.edu> wrote:
>
> Add cgroup wrapper for safely calling can_attach on all threads in a threadgroup
>
> From: Ben Blum <bblum@andrew.cmu.edu>
>
> This patch adds a function cgroup_can_attach_per_thread which handles iterating
> over each thread in a threadgroup safely with respect to the invariants that
> will be used in cgroup_attach_proc. Also, subsystems whose can_attach calls
> require per-thread validation are modified to use the per_thread wrapper to
> avoid duplicating cgroup-internal code.
>
> This is a pre-patch for cgroup-procs-writable.patch.
>
> Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>

Acked-by: Paul Menage <menage@google.com>

Some of the can_attach() methods could be simplified slightly by
directly returning the result of cgroup_can_attach_per_thread()

Paul

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v5 1/3] cgroups: read-write lock CLONE_THREAD forking per threadgroup
       [not found]       ` <20100811054711.GB8743-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
@ 2010-08-23 23:35         ` Paul Menage
  0 siblings, 0 replies; 185+ messages in thread
From: Paul Menage @ 2010-08-23 23:35 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w

On Tue, Aug 10, 2010 at 10:47 PM, Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:
>
>
> Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup
>
> From: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
>
> This patch adds an rwsem that lives in a threadgroup's signal_struct that's
> taken for reading in the fork path, under CONFIG_CGROUPS. If another part of
> the kernel later wants to use such a locking mechanism, the CONFIG_CGROUPS
> ifdefs should be changed to a higher-up flag that CGROUPS and the other system
> would both depend on.
>
> This is a pre-patch for cgroup-procs-write.patch.
>
> Signed-off-by: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>

Acked-by: Paul Menage <menage-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

Paul

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v5 1/3] cgroups: read-write lock CLONE_THREAD forking per threadgroup
  2010-08-11  5:47     ` [PATCH v5 1/3] cgroups: read-write lock CLONE_THREAD forking per threadgroup Ben Blum
@ 2010-08-23 23:35       ` Paul Menage
       [not found]       ` <20100811054711.GB8743-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
  1 sibling, 0 replies; 185+ messages in thread
From: Paul Menage @ 2010-08-23 23:35 UTC (permalink / raw)
  To: Ben Blum; +Cc: linux-kernel, containers, akpm, ebiederm, lizf, matthltc, oleg

On Tue, Aug 10, 2010 at 10:47 PM, Ben Blum <bblum@andrew.cmu.edu> wrote:
>
>
> Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup
>
> From: Ben Blum <bblum@andrew.cmu.edu>
>
> This patch adds an rwsem that lives in a threadgroup's signal_struct that's
> taken for reading in the fork path, under CONFIG_CGROUPS. If another part of
> the kernel later wants to use such a locking mechanism, the CONFIG_CGROUPS
> ifdefs should be changed to a higher-up flag that CGROUPS and the other system
> would both depend on.
>
> This is a pre-patch for cgroup-procs-write.patch.
>
> Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>

Acked-by: Paul Menage <menage@google.com>

Paul

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v5 3/3] cgroups: make procs file writable
       [not found]       ` <20100811054851.GD8743-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
@ 2010-08-24 18:08         ` Paul Menage
  0 siblings, 0 replies; 185+ messages in thread
From: Paul Menage @ 2010-08-24 18:08 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w

On Tue, Aug 10, 2010 at 10:48 PM, Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:
>
>
> Makes procs file writable to move all threads by tgid at once
>
> From: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
>
> This patch adds functionality that enables users to move all threads in a
> threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs'
> file. This current implementation makes use of a per-threadgroup rwsem that's
> taken for reading in the fork() path to prevent newly forking threads within
> the threadgroup from "escaping" while the move is in progress.
>
> Signed-off-by: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>

Reviewed-by: Paul Menage <menage-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v5 3/3] cgroups: make procs file writable
  2010-08-11  5:48     ` [PATCH v5 3/3] cgroups: make procs file writable Ben Blum
       [not found]       ` <20100811054851.GD8743-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
@ 2010-08-24 18:08       ` Paul Menage
       [not found]         ` <AANLkTimRM8rDe+u7fTy853RK=1mnLJMK57Tci2OLPR7L-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 185+ messages in thread
From: Paul Menage @ 2010-08-24 18:08 UTC (permalink / raw)
  To: Ben Blum; +Cc: linux-kernel, containers, akpm, ebiederm, lizf, matthltc, oleg

On Tue, Aug 10, 2010 at 10:48 PM, Ben Blum <bblum@andrew.cmu.edu> wrote:
>
>
> Makes procs file writable to move all threads by tgid at once
>
> From: Ben Blum <bblum@andrew.cmu.edu>
>
> This patch adds functionality that enables users to move all threads in a
> threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs'
> file. This current implementation makes use of a per-threadgroup rwsem that's
> taken for reading in the fork() path to prevent newly forking threads within
> the threadgroup from "escaping" while the move is in progress.
>
> Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>

Reviewed-by: Paul Menage <menage@google.com>

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v5 3/3] cgroups: make procs file writable
       [not found]         ` <AANLkTimRM8rDe+u7fTy853RK=1mnLJMK57Tci2OLPR7L-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-10-08 21:57           ` Paul Menage
       [not found]             ` <AANLkTim7HW0wNyqOPePFXmEMV8hx_fMKNMTAsSwkRzZX-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 185+ messages in thread
From: Paul Menage @ 2010-10-08 21:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	oleg-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

Hi Andrew,

Do you see any road-blockers for including this patch set in -mm ?

Thanks,
Paul

On Tue, Aug 24, 2010 at 11:08 AM, Paul Menage <menage-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> On Tue, Aug 10, 2010 at 10:48 PM, Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:
>>
>>
>> Makes procs file writable to move all threads by tgid at once
>>
>> From: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
>>
>> This patch adds functionality that enables users to move all threads in a
>> threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs'
>> file. This current implementation makes use of a per-threadgroup rwsem that's
>> taken for reading in the fork() path to prevent newly forking threads within
>> the threadgroup from "escaping" while the move is in progress.
>>
>> Signed-off-by: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
>
> Reviewed-by: Paul Menage <menage-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v5 3/3] cgroups: make procs file writable
       [not found]             ` <AANLkTim7HW0wNyqOPePFXmEMV8hx_fMKNMTAsSwkRzZX-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-12-16  6:34               ` Paul Menage
       [not found]                 ` <AANLkTin7aK5uEFi0U+iU_9=cbfRTHfDzKsbWupn73fSL-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 185+ messages in thread
From: Paul Menage @ 2010-12-16  6:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	oleg-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

Ping akpm?

On Fri, Oct 8, 2010 at 2:57 PM, Paul Menage <menage-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> Hi Andrew,
>
> Do you see any road-blockers for including this patch set in -mm ?
>
> Thanks,
> Paul
>
> On Tue, Aug 24, 2010 at 11:08 AM, Paul Menage <menage-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>> On Tue, Aug 10, 2010 at 10:48 PM, Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:
>>>
>>>
>>> Makes procs file writable to move all threads by tgid at once
>>>
>>> From: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
>>>
>>> This patch adds functionality that enables users to move all threads in a
>>> threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs'
>>> file. This current implementation makes use of a per-threadgroup rwsem that's
>>> taken for reading in the fork() path to prevent newly forking threads within
>>> the threadgroup from "escaping" while the move is in progress.
>>>
>>> Signed-off-by: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
>>
>> Reviewed-by: Paul Menage <menage-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>>
>

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v5 3/3] cgroups: make procs file writable
       [not found]                 ` <AANLkTin7aK5uEFi0U+iU_9=cbfRTHfDzKsbWupn73fSL-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-12-16  8:26                   ` Andrew Morton
       [not found]                     ` <20101216002603.6741874a.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
  0 siblings, 1 reply; 185+ messages in thread
From: Andrew Morton @ 2010-12-16  8:26 UTC (permalink / raw)
  To: Paul Menage
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	oleg-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

On Wed, 15 Dec 2010 22:34:39 -0800 Paul Menage <menage-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:

> Ping akpm?

Patches have gone a bit stale, sorry.  Refactoring in
kernel/cgroup_freezer.c necessitates a refresh and retest please.


> On Fri, Oct 8, 2010 at 2:57 PM, Paul Menage <menage-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> > Hi Andrew,
> >
> > Do you see any road-blockers for including this patch set in -mm ?
> >
> > Thanks,
> > Paul
> >
> > On Tue, Aug 24, 2010 at 11:08 AM, Paul Menage <menage-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> >> On Tue, Aug 10, 2010 at 10:48 PM, Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:
> >>>
> >>>
> >>> Makes procs file writable to move all threads by tgid at once
> >>>
> >>> From: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
> >>>
> >>> This patch adds functionality that enables users to move all threads in a
> >>> threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs'
> >>> file. This current implementation makes use of a per-threadgroup rwsem that's
> >>> taken for reading in the fork() path to prevent newly forking threads within
> >>> the threadgroup from "escaping" while the move is in progress.
> >>>
> >>> Signed-off-by: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
> >>
> >> Reviewed-by: Paul Menage <menage-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> >>
> >

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v5 3/3] cgroups: make procs file writable
       [not found]                     ` <20101216002603.6741874a.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
@ 2010-12-24  3:33                       ` Ben Blum
       [not found]                         ` <20101224033352.GA7804-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
  0 siblings, 1 reply; 185+ messages in thread
From: Ben Blum @ 2010-12-24  3:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	oleg-H+wXaHxf7aLQT0dZR+AlfA, Miao Xie, David Rientjes,
	Paul Menage, ebiederm-aS9lmoZGLiVWk0Htik3J/w

On Thu, Dec 16, 2010 at 12:26:03AM -0800, Andrew Morton wrote:
> Patches have gone a bit stale, sorry.  Refactoring in
> kernel/cgroup_freezer.c necessitates a refresh and retest please.

commit 53feb29767c29c877f9d47dcfe14211b5b0f7ebd changed a bunch of stuff
in kernel/cpuset.c to allocate nodemasks with NODEMASK_ALLOC (which
wraps kmalloc) instead of on the stack.

1. All these functions have 'void' return values, indicating that
   calling them them must not fail. Sure there are bailout cases, but no
   semblance of cross-function error propagation. Most importantly,
   cpuset_attach is a subsystem callback, which MUST not fail given the
   way it's used in cgroups, so relying on kmalloc is not safe.

2. I'm working on a patch series which needs to hold tasklist_lock
   across ss->attach callbacks (in cpuset_attach's "if (threadgroup)"
   case, this is how safe traversal of tsk->thread_group will be
   ensured), and kmalloc isn't allowed while holding a spin-lock. 

Why do we need heap-allocation here at all? In each case their scope is
exactly the function's scope, and neither the commit nor the surrounding
patch series give any explanation. I'd like to revert the patch if
possible.

cc'ing Miao Xie (author) and David Rientjes (acker).

-- Ben

^ permalink raw reply	[flat|nested] 185+ messages in thread

* [PATCH v6 0/3] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs
  2010-08-11  5:46     ` Ben Blum
@ 2010-12-24  8:22         ` Ben Blum
  -1 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2010-12-24  8:22 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	menage-hpIqsD4AKlfQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

On Wed, Aug 11, 2010 at 01:46:04AM -0400, Ben Blum wrote:
> On Fri, Jul 30, 2010 at 07:56:49PM -0400, Ben Blum wrote:
> > This patch series is a revision of http://lkml.org/lkml/2010/6/25/11 .
> > 
> > This patch series implements a write function for the 'cgroup.procs'
> > per-cgroup file, which enables atomic movement of multithreaded
> > applications between cgroups. Writing the thread-ID of any thread in a
> > threadgroup to a cgroup's procs file causes all threads in the group to
> > be moved to that cgroup safely with respect to threads forking/exiting.
> > (Possible usage scenario: If running a multithreaded build system that
> > sucks up system resources, this lets you restrict it all at once into a
> > new cgroup to keep it under control.)
> > 
> > Example: Suppose pid 31337 clones new threads 31338 and 31339.
> > 
> > # cat /dev/cgroup/tasks
> > ...
> > 31337
> > 31338
> > 31339
> > # mkdir /dev/cgroup/foo
> > # echo 31337 > /dev/cgroup/foo/cgroup.procs
> > # cat /dev/cgroup/foo/tasks
> > 31337
> > 31338
> > 31339
> > 
> > A new lock, called threadgroup_fork_lock and living in signal_struct, is
> > introduced to ensure atomicity when moving threads between cgroups. It's
> > taken for writing during the operation, and taking for reading in fork()
> > around the calls to cgroup_fork() and cgroup_post_fork(). I put calls to
> > down_read/up_read directly in copy_process(), since new inline functions
> > seemed like overkill.
> > 
> > -- Ben
> > 
> > ---
> >  Documentation/cgroups/cgroups.txt |   13 -
> >  include/linux/init_task.h         |    9
> >  include/linux/sched.h             |   10
> >  kernel/cgroup.c                   |  426 +++++++++++++++++++++++++++++++++-----
> >  kernel/cgroup_freezer.c           |    4
> >  kernel/cpuset.c                   |    4
> >  kernel/fork.c                     |   16 +
> >  kernel/ns_cgroup.c                |    4
> >  kernel/sched.c                    |    4
> >  9 files changed, 440 insertions(+), 50 deletions(-)
> 
> Here's an updated patchset. I've added an extra patch to implement the 
> callback scheme Paul suggested (note how there are twice as many deleted
> lines of code as before :) ), and also moved the up_read/down_read calls
> to static inline functions in sched.h near the other threadgroup-related
> calls.

One more go at this. I've refreshed the patches for some conflicts in
cgroup_freezer.c, by adding an extra argument to the per_thread() call,
"need_rcu", which makes the function take rcu_read_lock even around the
single-task case (like freezer now requires). So no semantics have been
changed.

I also poked around at some attach() calls which also iterate over the
threadgroup (blkiocg_attach, cpuset_attach, cpu_cgroup_attach). I was
borderline about making another function, cgroup_attach_per_thread(),
but decided against.

There is a big issue in cpuset_attach, as explained in this email:
http://www.spinics.net/lists/linux-containers/msg22223.html
but the actual code/diffs for this patchset are independent of that
getting fixed, so I'm putting this up for consideration now.

-- Ben

---
 Documentation/cgroups/cgroups.txt |   13 -
 block/blk-cgroup.c                |   31 ++
 include/linux/cgroup.h            |   14 +
 include/linux/init_task.h         |    9 
 include/linux/sched.h             |   35 ++
 kernel/cgroup.c                   |  469 ++++++++++++++++++++++++++++++++++----
 kernel/cgroup_freezer.c           |   33 +-
 kernel/cpuset.c                   |   30 --
 kernel/fork.c                     |   10 
 kernel/ns_cgroup.c                |   25 --
 kernel/sched.c                    |   24 -
 11 files changed, 565 insertions(+), 128 deletions(-)

^ permalink raw reply	[flat|nested] 185+ messages in thread

* [PATCH v6 0/3] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs
@ 2010-12-24  8:22         ` Ben Blum
  0 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2010-12-24  8:22 UTC (permalink / raw)
  To: Ben Blum
  Cc: linux-kernel, containers, akpm, ebiederm, lizf, matthltc, menage, oleg

On Wed, Aug 11, 2010 at 01:46:04AM -0400, Ben Blum wrote:
> On Fri, Jul 30, 2010 at 07:56:49PM -0400, Ben Blum wrote:
> > This patch series is a revision of http://lkml.org/lkml/2010/6/25/11 .
> > 
> > This patch series implements a write function for the 'cgroup.procs'
> > per-cgroup file, which enables atomic movement of multithreaded
> > applications between cgroups. Writing the thread-ID of any thread in a
> > threadgroup to a cgroup's procs file causes all threads in the group to
> > be moved to that cgroup safely with respect to threads forking/exiting.
> > (Possible usage scenario: If running a multithreaded build system that
> > sucks up system resources, this lets you restrict it all at once into a
> > new cgroup to keep it under control.)
> > 
> > Example: Suppose pid 31337 clones new threads 31338 and 31339.
> > 
> > # cat /dev/cgroup/tasks
> > ...
> > 31337
> > 31338
> > 31339
> > # mkdir /dev/cgroup/foo
> > # echo 31337 > /dev/cgroup/foo/cgroup.procs
> > # cat /dev/cgroup/foo/tasks
> > 31337
> > 31338
> > 31339
> > 
> > A new lock, called threadgroup_fork_lock and living in signal_struct, is
> > introduced to ensure atomicity when moving threads between cgroups. It's
> > taken for writing during the operation, and taking for reading in fork()
> > around the calls to cgroup_fork() and cgroup_post_fork(). I put calls to
> > down_read/up_read directly in copy_process(), since new inline functions
> > seemed like overkill.
> > 
> > -- Ben
> > 
> > ---
> >  Documentation/cgroups/cgroups.txt |   13 -
> >  include/linux/init_task.h         |    9
> >  include/linux/sched.h             |   10
> >  kernel/cgroup.c                   |  426 +++++++++++++++++++++++++++++++++-----
> >  kernel/cgroup_freezer.c           |    4
> >  kernel/cpuset.c                   |    4
> >  kernel/fork.c                     |   16 +
> >  kernel/ns_cgroup.c                |    4
> >  kernel/sched.c                    |    4
> >  9 files changed, 440 insertions(+), 50 deletions(-)
> 
> Here's an updated patchset. I've added an extra patch to implement the 
> callback scheme Paul suggested (note how there are twice as many deleted
> lines of code as before :) ), and also moved the up_read/down_read calls
> to static inline functions in sched.h near the other threadgroup-related
> calls.

One more go at this. I've refreshed the patches for some conflicts in
cgroup_freezer.c, by adding an extra argument to the per_thread() call,
"need_rcu", which makes the function take rcu_read_lock even around the
single-task case (like freezer now requires). So no semantics have been
changed.

I also poked around at some attach() calls which also iterate over the
threadgroup (blkiocg_attach, cpuset_attach, cpu_cgroup_attach). I was
borderline about making another function, cgroup_attach_per_thread(),
but decided against.

There is a big issue in cpuset_attach, as explained in this email:
http://www.spinics.net/lists/linux-containers/msg22223.html
but the actual code/diffs for this patchset are independent of that
getting fixed, so I'm putting this up for consideration now.

-- Ben

---
 Documentation/cgroups/cgroups.txt |   13 -
 block/blk-cgroup.c                |   31 ++
 include/linux/cgroup.h            |   14 +
 include/linux/init_task.h         |    9 
 include/linux/sched.h             |   35 ++
 kernel/cgroup.c                   |  469 ++++++++++++++++++++++++++++++++++----
 kernel/cgroup_freezer.c           |   33 +-
 kernel/cpuset.c                   |   30 --
 kernel/fork.c                     |   10 
 kernel/ns_cgroup.c                |   25 --
 kernel/sched.c                    |   24 -
 11 files changed, 565 insertions(+), 128 deletions(-)

^ permalink raw reply	[flat|nested] 185+ messages in thread

* [PATCH v6 1/3] cgroups: read-write lock CLONE_THREAD forking per threadgroup
  2010-12-24  8:22         ` Ben Blum
@ 2010-12-24  8:23             ` Ben Blum
  -1 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2010-12-24  8:23 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	menage-hpIqsD4AKlfQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

[-- Attachment #1: cgroup-threadgroup-fork-lock.patch --]
[-- Type: text/plain, Size: 4964 bytes --]

Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup

From: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>

This patch adds an rwsem that lives in a threadgroup's signal_struct that's
taken for reading in the fork path, under CONFIG_CGROUPS. If another part of
the kernel later wants to use such a locking mechanism, the CONFIG_CGROUPS
ifdefs should be changed to a higher-up flag that CGROUPS and the other system
would both depend on.

This is a pre-patch for cgroup-procs-write.patch.

Signed-off-by: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
---
 include/linux/init_task.h |    9 +++++++++
 include/linux/sched.h     |   35 +++++++++++++++++++++++++++++++++++
 kernel/fork.c             |   10 ++++++++++
 3 files changed, 54 insertions(+), 0 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 6b281fa..b560381 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -15,6 +15,14 @@
 extern struct files_struct init_files;
 extern struct fs_struct init_fs;
 
+#ifdef CONFIG_CGROUPS
+#define INIT_THREADGROUP_FORK_LOCK(sig)					\
+	.threadgroup_fork_lock =					\
+		__RWSEM_INITIALIZER(sig.threadgroup_fork_lock),
+#else
+#define INIT_THREADGROUP_FORK_LOCK(sig)
+#endif
+
 #define INIT_SIGNALS(sig) {						\
 	.nr_threads	= 1,						\
 	.wait_chldexit	= __WAIT_QUEUE_HEAD_INITIALIZER(sig.wait_chldexit),\
@@ -31,6 +39,7 @@ extern struct fs_struct init_fs;
 	},								\
 	.cred_guard_mutex =						\
 		 __MUTEX_INITIALIZER(sig.cred_guard_mutex),		\
+	INIT_THREADGROUP_FORK_LOCK(sig)					\
 }
 
 extern struct nsproxy init_nsproxy;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8580dc6..213a0b9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -623,6 +623,16 @@ struct signal_struct {
 	unsigned audit_tty;
 	struct tty_audit_buf *tty_audit_buf;
 #endif
+#ifdef CONFIG_CGROUPS
+	/*
+	 * The threadgroup_fork_lock prevents threads from forking with
+	 * CLONE_THREAD while held for writing. Use this for fork-sensitive
+	 * threadgroup-wide operations. It's taken for reading in fork.c in
+	 * copy_process().
+	 * Currently only needed write-side by cgroups.
+	 */
+	struct rw_semaphore threadgroup_fork_lock;
+#endif
 
 	int oom_adj;		/* OOM kill score adjustment (bit shift) */
 	int oom_score_adj;	/* OOM kill score adjustment */
@@ -2270,6 +2280,31 @@ static inline void unlock_task_sighand(struct task_struct *tsk,
 	spin_unlock_irqrestore(&tsk->sighand->siglock, *flags);
 }
 
+/* See the declaration of threadgroup_fork_lock in signal_struct. */
+#ifdef CONFIG_CGROUPS
+static inline void threadgroup_fork_read_lock(struct task_struct *tsk)
+{
+	down_read(&tsk->signal->threadgroup_fork_lock);
+}
+static inline void threadgroup_fork_read_unlock(struct task_struct *tsk)
+{
+	up_read(&tsk->signal->threadgroup_fork_lock);
+}
+static inline void threadgroup_fork_write_lock(struct task_struct *tsk)
+{
+	down_write(&tsk->signal->threadgroup_fork_lock);
+}
+static inline void threadgroup_fork_write_unlock(struct task_struct *tsk)
+{
+	up_write(&tsk->signal->threadgroup_fork_lock);
+}
+#else
+static inline void threadgroup_fork_read_lock(struct task_struct *tsk) {}
+static inline void threadgroup_fork_read_unlock(struct task_struct *tsk) {}
+static inline void threadgroup_fork_write_lock(struct task_struct *tsk) {}
+static inline void threadgroup_fork_write_unlock(struct task_struct *tsk) {}
+#endif
+
 #ifndef __HAVE_THREAD_FUNCTIONS
 
 #define task_thread_info(task)	((struct thread_info *)(task)->stack)
diff --git a/kernel/fork.c b/kernel/fork.c
index 0979527..aefe61f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -905,6 +905,10 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 
 	tty_audit_fork(sig);
 
+#ifdef CONFIG_CGROUPS
+	init_rwsem(&sig->threadgroup_fork_lock);
+#endif
+
 	sig->oom_adj = current->signal->oom_adj;
 	sig->oom_score_adj = current->signal->oom_score_adj;
 	sig->oom_score_adj_min = current->signal->oom_score_adj_min;
@@ -1087,6 +1091,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	monotonic_to_bootbased(&p->real_start_time);
 	p->io_context = NULL;
 	p->audit_context = NULL;
+	if (clone_flags & CLONE_THREAD)
+		threadgroup_fork_read_lock(current);
 	cgroup_fork(p);
 #ifdef CONFIG_NUMA
 	p->mempolicy = mpol_dup(p->mempolicy);
@@ -1294,6 +1300,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	write_unlock_irq(&tasklist_lock);
 	proc_fork_connector(p);
 	cgroup_post_fork(p);
+	if (clone_flags & CLONE_THREAD)
+		threadgroup_fork_read_unlock(current);
 	perf_event_fork(p);
 	return p;
 
@@ -1332,6 +1340,8 @@ bad_fork_cleanup_policy:
 	mpol_put(p->mempolicy);
 bad_fork_cleanup_cgroup:
 #endif
+	if (clone_flags & CLONE_THREAD)
+		threadgroup_fork_read_unlock(current);
 	cgroup_exit(p, cgroup_callbacks_done);
 	delayacct_tsk_free(p);
 	module_put(task_thread_info(p)->exec_domain->module);

^ permalink raw reply related	[flat|nested] 185+ messages in thread

* [PATCH v6 1/3] cgroups: read-write lock CLONE_THREAD forking per threadgroup
@ 2010-12-24  8:23             ` Ben Blum
  0 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2010-12-24  8:23 UTC (permalink / raw)
  To: Ben Blum
  Cc: linux-kernel, containers, akpm, ebiederm, lizf, matthltc, menage, oleg

[-- Attachment #1: cgroup-threadgroup-fork-lock.patch --]
[-- Type: text/plain, Size: 4914 bytes --]

Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup

From: Ben Blum <bblum@andrew.cmu.edu>

This patch adds an rwsem that lives in a threadgroup's signal_struct that's
taken for reading in the fork path, under CONFIG_CGROUPS. If another part of
the kernel later wants to use such a locking mechanism, the CONFIG_CGROUPS
ifdefs should be changed to a higher-up flag that CGROUPS and the other system
would both depend on.

This is a pre-patch for cgroup-procs-write.patch.

Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
---
 include/linux/init_task.h |    9 +++++++++
 include/linux/sched.h     |   35 +++++++++++++++++++++++++++++++++++
 kernel/fork.c             |   10 ++++++++++
 3 files changed, 54 insertions(+), 0 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 6b281fa..b560381 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -15,6 +15,14 @@
 extern struct files_struct init_files;
 extern struct fs_struct init_fs;
 
+#ifdef CONFIG_CGROUPS
+#define INIT_THREADGROUP_FORK_LOCK(sig)					\
+	.threadgroup_fork_lock =					\
+		__RWSEM_INITIALIZER(sig.threadgroup_fork_lock),
+#else
+#define INIT_THREADGROUP_FORK_LOCK(sig)
+#endif
+
 #define INIT_SIGNALS(sig) {						\
 	.nr_threads	= 1,						\
 	.wait_chldexit	= __WAIT_QUEUE_HEAD_INITIALIZER(sig.wait_chldexit),\
@@ -31,6 +39,7 @@ extern struct fs_struct init_fs;
 	},								\
 	.cred_guard_mutex =						\
 		 __MUTEX_INITIALIZER(sig.cred_guard_mutex),		\
+	INIT_THREADGROUP_FORK_LOCK(sig)					\
 }
 
 extern struct nsproxy init_nsproxy;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8580dc6..213a0b9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -623,6 +623,16 @@ struct signal_struct {
 	unsigned audit_tty;
 	struct tty_audit_buf *tty_audit_buf;
 #endif
+#ifdef CONFIG_CGROUPS
+	/*
+	 * The threadgroup_fork_lock prevents threads from forking with
+	 * CLONE_THREAD while held for writing. Use this for fork-sensitive
+	 * threadgroup-wide operations. It's taken for reading in fork.c in
+	 * copy_process().
+	 * Currently only needed write-side by cgroups.
+	 */
+	struct rw_semaphore threadgroup_fork_lock;
+#endif
 
 	int oom_adj;		/* OOM kill score adjustment (bit shift) */
 	int oom_score_adj;	/* OOM kill score adjustment */
@@ -2270,6 +2280,31 @@ static inline void unlock_task_sighand(struct task_struct *tsk,
 	spin_unlock_irqrestore(&tsk->sighand->siglock, *flags);
 }
 
+/* See the declaration of threadgroup_fork_lock in signal_struct. */
+#ifdef CONFIG_CGROUPS
+static inline void threadgroup_fork_read_lock(struct task_struct *tsk)
+{
+	down_read(&tsk->signal->threadgroup_fork_lock);
+}
+static inline void threadgroup_fork_read_unlock(struct task_struct *tsk)
+{
+	up_read(&tsk->signal->threadgroup_fork_lock);
+}
+static inline void threadgroup_fork_write_lock(struct task_struct *tsk)
+{
+	down_write(&tsk->signal->threadgroup_fork_lock);
+}
+static inline void threadgroup_fork_write_unlock(struct task_struct *tsk)
+{
+	up_write(&tsk->signal->threadgroup_fork_lock);
+}
+#else
+static inline void threadgroup_fork_read_lock(struct task_struct *tsk) {}
+static inline void threadgroup_fork_read_unlock(struct task_struct *tsk) {}
+static inline void threadgroup_fork_write_lock(struct task_struct *tsk) {}
+static inline void threadgroup_fork_write_unlock(struct task_struct *tsk) {}
+#endif
+
 #ifndef __HAVE_THREAD_FUNCTIONS
 
 #define task_thread_info(task)	((struct thread_info *)(task)->stack)
diff --git a/kernel/fork.c b/kernel/fork.c
index 0979527..aefe61f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -905,6 +905,10 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 
 	tty_audit_fork(sig);
 
+#ifdef CONFIG_CGROUPS
+	init_rwsem(&sig->threadgroup_fork_lock);
+#endif
+
 	sig->oom_adj = current->signal->oom_adj;
 	sig->oom_score_adj = current->signal->oom_score_adj;
 	sig->oom_score_adj_min = current->signal->oom_score_adj_min;
@@ -1087,6 +1091,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	monotonic_to_bootbased(&p->real_start_time);
 	p->io_context = NULL;
 	p->audit_context = NULL;
+	if (clone_flags & CLONE_THREAD)
+		threadgroup_fork_read_lock(current);
 	cgroup_fork(p);
 #ifdef CONFIG_NUMA
 	p->mempolicy = mpol_dup(p->mempolicy);
@@ -1294,6 +1300,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	write_unlock_irq(&tasklist_lock);
 	proc_fork_connector(p);
 	cgroup_post_fork(p);
+	if (clone_flags & CLONE_THREAD)
+		threadgroup_fork_read_unlock(current);
 	perf_event_fork(p);
 	return p;
 
@@ -1332,6 +1340,8 @@ bad_fork_cleanup_policy:
 	mpol_put(p->mempolicy);
 bad_fork_cleanup_cgroup:
 #endif
+	if (clone_flags & CLONE_THREAD)
+		threadgroup_fork_read_unlock(current);
 	cgroup_exit(p, cgroup_callbacks_done);
 	delayacct_tsk_free(p);
 	module_put(task_thread_info(p)->exec_domain->module);

^ permalink raw reply related	[flat|nested] 185+ messages in thread

* [PATCH v6 2/3] cgroups: add can_attach callback for checking all threads in a group
  2010-12-24  8:22         ` Ben Blum
@ 2010-12-24  8:24             ` Ben Blum
  -1 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2010-12-24  8:24 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	menage-hpIqsD4AKlfQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

[-- Attachment #1: cgroup-threadgroup-callback.patch --]
[-- Type: text/plain, Size: 11276 bytes --]

Add cgroup wrapper for safely calling can_attach on all threads in a threadgroup

From: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>

This patch adds a function cgroup_can_attach_per_thread which handles iterating
over each thread in a threadgroup safely with respect to the invariants that
will be used in cgroup_attach_proc. Also, subsystems whose can_attach calls
require per-thread validation are modified to use the per_thread wrapper to
avoid duplicating cgroup-internal code.

This is a pre-patch for cgroup-procs-writable.patch.

Signed-off-by: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
---
 block/blk-cgroup.c      |   31 ++++++++++++++++++++++++++-----
 include/linux/cgroup.h  |   14 ++++++++++++++
 kernel/cgroup.c         |   45 +++++++++++++++++++++++++++++++++++++++++++++
 kernel/cgroup_freezer.c |   33 ++++++++++++++-------------------
 kernel/cpuset.c         |   30 ++++++++++--------------------
 kernel/ns_cgroup.c      |   25 +++++++++----------------
 kernel/sched.c          |   24 ++++++------------------
 7 files changed, 124 insertions(+), 78 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index b1febd0..865e208 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1475,9 +1475,7 @@ done:
  * of the main cic data structures.  For now we allow a task to change
  * its cgroup only if it's the only owner of its ioc.
  */
-static int blkiocg_can_attach(struct cgroup_subsys *subsys,
-				struct cgroup *cgroup, struct task_struct *tsk,
-				bool threadgroup)
+static int blkiocg_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
 	struct io_context *ioc;
 	int ret = 0;
@@ -1492,10 +1490,17 @@ static int blkiocg_can_attach(struct cgroup_subsys *subsys,
 	return ret;
 }
 
-static void blkiocg_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
-				struct cgroup *prev, struct task_struct *tsk,
+static int blkiocg_can_attach(struct cgroup_subsys *subsys,
+				struct cgroup *cgroup, struct task_struct *tsk,
 				bool threadgroup)
 {
+	return cgroup_can_attach_per_thread(cgroup, tsk,
+					    blkiocg_can_attach_task,
+					    threadgroup, false);
+}
+
+static void blkiocg_attach_task(struct task_struct *tsk)
+{
 	struct io_context *ioc;
 
 	task_lock(tsk);
@@ -1505,6 +1510,22 @@ static void blkiocg_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
 	task_unlock(tsk);
 }
 
+static void blkiocg_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
+				struct cgroup *prev, struct task_struct *tsk,
+				bool threadgroup)
+{
+	blkiocg_attach_task(tsk);
+	if (threadgroup) {
+		struct task_struct *c;
+
+		/* tasklist_lock will be held. */
+		BUG_ON(!thread_group_leader(tsk));
+		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
+			blkiocg_attach_task(c);
+		}
+	}
+}
+
 void blkio_policy_register(struct blkio_policy_type *blkiop)
 {
 	spin_lock(&blkio_list_lock);
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index ce104e3..96898e6 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -590,6 +590,20 @@ static inline int cgroup_attach_task_current_cg(struct task_struct *tsk)
 }
 
 /*
+ * For use in subsystems whose can_attach functions need to run an operation
+ * on every task in the threadgroup. Calls the given callback once if the
+ * 'threadgroup' flag is false, or once per thread in the group if true.
+ * The callback should return 0/-ERR; this will return 0/-ERR.
+ * The callback will run within an rcu_read section, so must not sleep.
+ * 'need_rcu' should specify whether the callback needs to run in an rcu_read
+ * section even in the single-task case.
+ */
+int cgroup_can_attach_per_thread(struct cgroup *cgrp, struct task_struct *task,
+				 int (*cb)(struct cgroup *cgrp,
+					   struct task_struct *task),
+				 bool threadgroup, bool need_rcu);
+
+/*
  * CSS ID is ID for cgroup_subsys_state structs under subsys. This only works
  * if cgroup_subsys.use_id == true. It can be used for looking up and scanning.
  * CSS ID is assigned at cgroup allocation (create) automatically
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 66a416b..f86dd9c 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1726,6 +1726,51 @@ int cgroup_path(const struct cgroup *cgrp, char *buf, int buflen)
 }
 EXPORT_SYMBOL_GPL(cgroup_path);
 
+int cgroup_can_attach_per_thread(struct cgroup *cgrp, struct task_struct *task,
+				 int (*cb)(struct cgroup *cgrp,
+					   struct task_struct *task),
+				 bool threadgroup, bool need_rcu)
+{
+	int ret;
+
+	/* Run callback on the leader first, taking rcu_read_lock if needed. */
+	if (need_rcu)
+		rcu_read_lock();
+
+	ret = cb(cgrp, task);
+
+	if (need_rcu)
+		rcu_read_unlock();
+
+	if (ret < 0)
+		return ret;
+
+	/* Run on each task in the threadgroup. */
+	if (threadgroup) {
+		struct task_struct *c;
+
+		rcu_read_lock();
+		/*
+		 * It is necessary for the given task to still be the leader
+		 * to safely traverse thread_group. See cgroup_attach_proc.
+		 */
+		if (!thread_group_leader(task)) {
+			rcu_read_unlock();
+			return -EAGAIN;
+		}
+		list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
+			ret = cb(cgrp, c);
+			if (ret < 0) {
+				rcu_read_unlock();
+				return ret;
+			}
+		}
+		rcu_read_unlock();
+	}
+	return 0;
+}
+EXPORT_SYMBOL_GPL(cgroup_can_attach_per_thread);
+
 /**
  * cgroup_attach_task - attach task 'tsk' to cgroup 'cgrp'
  * @cgrp: the cgroup the task is attaching to
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index e7bebb7..1f5ac8f 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -153,6 +153,13 @@ static void freezer_destroy(struct cgroup_subsys *ss,
 	kfree(cgroup_freezer(cgroup));
 }
 
+static int freezer_can_attach_cb(struct cgroup *cgrp, struct task_struct *task)
+{
+	if (__cgroup_freezing_or_frozen(task))
+		return -EBUSY;
+	return 0;
+}
+
 /*
  * The call to cgroup_lock() in the freezer.state write method prevents
  * a write to that file racing against an attach, and hence the
@@ -163,6 +170,7 @@ static int freezer_can_attach(struct cgroup_subsys *ss,
 			      struct task_struct *task, bool threadgroup)
 {
 	struct freezer *freezer;
+	int ret;
 
 	/*
 	 * Anything frozen can't move or be moved to/from.
@@ -172,25 +180,12 @@ static int freezer_can_attach(struct cgroup_subsys *ss,
 	if (freezer->state != CGROUP_THAWED)
 		return -EBUSY;
 
-	rcu_read_lock();
-	if (__cgroup_freezing_or_frozen(task)) {
-		rcu_read_unlock();
-		return -EBUSY;
-	}
-	rcu_read_unlock();
-
-	if (threadgroup) {
-		struct task_struct *c;
-
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
-			if (__cgroup_freezing_or_frozen(c)) {
-				rcu_read_unlock();
-				return -EBUSY;
-			}
-		}
-		rcu_read_unlock();
-	}
+	/* Need to take rcu_read_lock even around the call on the leader. */
+	ret = cgroup_can_attach_per_thread(new_cgroup, task,
+					   freezer_can_attach_cb, threadgroup,
+					   true);
+	if (ret < 0)
+		return ret;
 
 	return 0;
 }
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 4349935..8fbe1e3 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1375,11 +1375,15 @@ static int fmeter_getrate(struct fmeter *fmp)
 /* Protected by cgroup_lock */
 static cpumask_var_t cpus_attach;
 
+static int cpuset_can_attach_cb(struct cgroup *cgrp, struct task_struct *task)
+{
+	return security_task_setscheduler(task);
+}
+
 /* Called by cgroups to determine if a cpuset is usable; cgroup_mutex held */
 static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
 			     struct task_struct *tsk, bool threadgroup)
 {
-	int ret;
 	struct cpuset *cs = cgroup_cs(cont);
 
 	if (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))
@@ -1396,23 +1400,8 @@ static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
 	if (tsk->flags & PF_THREAD_BOUND)
 		return -EINVAL;
 
-	ret = security_task_setscheduler(tsk);
-	if (ret)
-		return ret;
-	if (threadgroup) {
-		struct task_struct *c;
-
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
-			ret = security_task_setscheduler(c);
-			if (ret) {
-				rcu_read_unlock();
-				return ret;
-			}
-		}
-		rcu_read_unlock();
-	}
-	return 0;
+	return cgroup_can_attach_per_thread(cont, tsk, cpuset_can_attach_cb,
+					    threadgroup, false);
 }
 
 static void cpuset_attach_task(struct task_struct *tsk, nodemask_t *to,
@@ -1455,11 +1444,12 @@ static void cpuset_attach(struct cgroup_subsys *ss, struct cgroup *cont,
 	cpuset_attach_task(tsk, to, cs);
 	if (threadgroup) {
 		struct task_struct *c;
-		rcu_read_lock();
+
+		/* tasklist_lock will be held. */
+		BUG_ON(!thread_group_leader(tsk));
 		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
 			cpuset_attach_task(c, to, cs);
 		}
-		rcu_read_unlock();
 	}
 
 	/* change mm; only needs to be done once even if threadgroup */
diff --git a/kernel/ns_cgroup.c b/kernel/ns_cgroup.c
index 2c98ad9..66ba860 100644
--- a/kernel/ns_cgroup.c
+++ b/kernel/ns_cgroup.c
@@ -42,6 +42,13 @@ int ns_cgroup_clone(struct task_struct *task, struct pid *pid)
  *       (hence either you are in the same cgroup as task, or in an
  *        ancestor cgroup thereof)
  */
+static int ns_can_attach_cb(struct cgroup *new_cgroup, struct task_struct *task)
+{
+	if (!cgroup_is_descendant(new_cgroup, task))
+		return -EPERM;
+	return 0;
+}
+
 static int ns_can_attach(struct cgroup_subsys *ss, struct cgroup *new_cgroup,
 			 struct task_struct *task, bool threadgroup)
 {
@@ -53,22 +60,8 @@ static int ns_can_attach(struct cgroup_subsys *ss, struct cgroup *new_cgroup,
 			return -EPERM;
 	}
 
-	if (!cgroup_is_descendant(new_cgroup, task))
-		return -EPERM;
-
-	if (threadgroup) {
-		struct task_struct *c;
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
-			if (!cgroup_is_descendant(new_cgroup, c)) {
-				rcu_read_unlock();
-				return -EPERM;
-			}
-		}
-		rcu_read_unlock();
-	}
-
-	return 0;
+	return cgroup_can_attach_per_thread(new_cgroup, task, ns_can_attach_cb,
+					    threadgroup, false);
 }
 
 /*
diff --git a/kernel/sched.c b/kernel/sched.c
index 218ef20..8e89bf9 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -8659,22 +8659,9 @@ static int
 cpu_cgroup_can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 		      struct task_struct *tsk, bool threadgroup)
 {
-	int retval = cpu_cgroup_can_attach_task(cgrp, tsk);
-	if (retval)
-		return retval;
-	if (threadgroup) {
-		struct task_struct *c;
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
-			retval = cpu_cgroup_can_attach_task(cgrp, c);
-			if (retval) {
-				rcu_read_unlock();
-				return retval;
-			}
-		}
-		rcu_read_unlock();
-	}
-	return 0;
+	return cgroup_can_attach_per_thread(cgrp, tsk,
+					    cpu_cgroup_can_attach_task,
+					    threadgroup, false);
 }
 
 static void
@@ -8685,11 +8672,12 @@ cpu_cgroup_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 	sched_move_task(tsk);
 	if (threadgroup) {
 		struct task_struct *c;
-		rcu_read_lock();
+
+		/* tasklist_lock will be held. */
+		BUG_ON(!thread_group_leader(tsk));
 		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
 			sched_move_task(c);
 		}
-		rcu_read_unlock();
 	}
 }

^ permalink raw reply related	[flat|nested] 185+ messages in thread

* [PATCH v6 2/3] cgroups: add can_attach callback for checking all threads in a group
@ 2010-12-24  8:24             ` Ben Blum
  0 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2010-12-24  8:24 UTC (permalink / raw)
  To: Ben Blum
  Cc: linux-kernel, containers, akpm, ebiederm, lizf, matthltc, menage, oleg

[-- Attachment #1: cgroup-threadgroup-callback.patch --]
[-- Type: text/plain, Size: 11228 bytes --]

Add cgroup wrapper for safely calling can_attach on all threads in a threadgroup

From: Ben Blum <bblum@andrew.cmu.edu>

This patch adds a function cgroup_can_attach_per_thread which handles iterating
over each thread in a threadgroup safely with respect to the invariants that
will be used in cgroup_attach_proc. Also, subsystems whose can_attach calls
require per-thread validation are modified to use the per_thread wrapper to
avoid duplicating cgroup-internal code.

This is a pre-patch for cgroup-procs-writable.patch.

Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
---
 block/blk-cgroup.c      |   31 ++++++++++++++++++++++++++-----
 include/linux/cgroup.h  |   14 ++++++++++++++
 kernel/cgroup.c         |   45 +++++++++++++++++++++++++++++++++++++++++++++
 kernel/cgroup_freezer.c |   33 ++++++++++++++-------------------
 kernel/cpuset.c         |   30 ++++++++++--------------------
 kernel/ns_cgroup.c      |   25 +++++++++----------------
 kernel/sched.c          |   24 ++++++------------------
 7 files changed, 124 insertions(+), 78 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index b1febd0..865e208 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1475,9 +1475,7 @@ done:
  * of the main cic data structures.  For now we allow a task to change
  * its cgroup only if it's the only owner of its ioc.
  */
-static int blkiocg_can_attach(struct cgroup_subsys *subsys,
-				struct cgroup *cgroup, struct task_struct *tsk,
-				bool threadgroup)
+static int blkiocg_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
 	struct io_context *ioc;
 	int ret = 0;
@@ -1492,10 +1490,17 @@ static int blkiocg_can_attach(struct cgroup_subsys *subsys,
 	return ret;
 }
 
-static void blkiocg_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
-				struct cgroup *prev, struct task_struct *tsk,
+static int blkiocg_can_attach(struct cgroup_subsys *subsys,
+				struct cgroup *cgroup, struct task_struct *tsk,
 				bool threadgroup)
 {
+	return cgroup_can_attach_per_thread(cgroup, tsk,
+					    blkiocg_can_attach_task,
+					    threadgroup, false);
+}
+
+static void blkiocg_attach_task(struct task_struct *tsk)
+{
 	struct io_context *ioc;
 
 	task_lock(tsk);
@@ -1505,6 +1510,22 @@ static void blkiocg_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
 	task_unlock(tsk);
 }
 
+static void blkiocg_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
+				struct cgroup *prev, struct task_struct *tsk,
+				bool threadgroup)
+{
+	blkiocg_attach_task(tsk);
+	if (threadgroup) {
+		struct task_struct *c;
+
+		/* tasklist_lock will be held. */
+		BUG_ON(!thread_group_leader(tsk));
+		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
+			blkiocg_attach_task(c);
+		}
+	}
+}
+
 void blkio_policy_register(struct blkio_policy_type *blkiop)
 {
 	spin_lock(&blkio_list_lock);
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index ce104e3..96898e6 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -590,6 +590,20 @@ static inline int cgroup_attach_task_current_cg(struct task_struct *tsk)
 }
 
 /*
+ * For use in subsystems whose can_attach functions need to run an operation
+ * on every task in the threadgroup. Calls the given callback once if the
+ * 'threadgroup' flag is false, or once per thread in the group if true.
+ * The callback should return 0/-ERR; this will return 0/-ERR.
+ * The callback will run within an rcu_read section, so must not sleep.
+ * 'need_rcu' should specify whether the callback needs to run in an rcu_read
+ * section even in the single-task case.
+ */
+int cgroup_can_attach_per_thread(struct cgroup *cgrp, struct task_struct *task,
+				 int (*cb)(struct cgroup *cgrp,
+					   struct task_struct *task),
+				 bool threadgroup, bool need_rcu);
+
+/*
  * CSS ID is ID for cgroup_subsys_state structs under subsys. This only works
  * if cgroup_subsys.use_id == true. It can be used for looking up and scanning.
  * CSS ID is assigned at cgroup allocation (create) automatically
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 66a416b..f86dd9c 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1726,6 +1726,51 @@ int cgroup_path(const struct cgroup *cgrp, char *buf, int buflen)
 }
 EXPORT_SYMBOL_GPL(cgroup_path);
 
+int cgroup_can_attach_per_thread(struct cgroup *cgrp, struct task_struct *task,
+				 int (*cb)(struct cgroup *cgrp,
+					   struct task_struct *task),
+				 bool threadgroup, bool need_rcu)
+{
+	int ret;
+
+	/* Run callback on the leader first, taking rcu_read_lock if needed. */
+	if (need_rcu)
+		rcu_read_lock();
+
+	ret = cb(cgrp, task);
+
+	if (need_rcu)
+		rcu_read_unlock();
+
+	if (ret < 0)
+		return ret;
+
+	/* Run on each task in the threadgroup. */
+	if (threadgroup) {
+		struct task_struct *c;
+
+		rcu_read_lock();
+		/*
+		 * It is necessary for the given task to still be the leader
+		 * to safely traverse thread_group. See cgroup_attach_proc.
+		 */
+		if (!thread_group_leader(task)) {
+			rcu_read_unlock();
+			return -EAGAIN;
+		}
+		list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
+			ret = cb(cgrp, c);
+			if (ret < 0) {
+				rcu_read_unlock();
+				return ret;
+			}
+		}
+		rcu_read_unlock();
+	}
+	return 0;
+}
+EXPORT_SYMBOL_GPL(cgroup_can_attach_per_thread);
+
 /**
  * cgroup_attach_task - attach task 'tsk' to cgroup 'cgrp'
  * @cgrp: the cgroup the task is attaching to
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index e7bebb7..1f5ac8f 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -153,6 +153,13 @@ static void freezer_destroy(struct cgroup_subsys *ss,
 	kfree(cgroup_freezer(cgroup));
 }
 
+static int freezer_can_attach_cb(struct cgroup *cgrp, struct task_struct *task)
+{
+	if (__cgroup_freezing_or_frozen(task))
+		return -EBUSY;
+	return 0;
+}
+
 /*
  * The call to cgroup_lock() in the freezer.state write method prevents
  * a write to that file racing against an attach, and hence the
@@ -163,6 +170,7 @@ static int freezer_can_attach(struct cgroup_subsys *ss,
 			      struct task_struct *task, bool threadgroup)
 {
 	struct freezer *freezer;
+	int ret;
 
 	/*
 	 * Anything frozen can't move or be moved to/from.
@@ -172,25 +180,12 @@ static int freezer_can_attach(struct cgroup_subsys *ss,
 	if (freezer->state != CGROUP_THAWED)
 		return -EBUSY;
 
-	rcu_read_lock();
-	if (__cgroup_freezing_or_frozen(task)) {
-		rcu_read_unlock();
-		return -EBUSY;
-	}
-	rcu_read_unlock();
-
-	if (threadgroup) {
-		struct task_struct *c;
-
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
-			if (__cgroup_freezing_or_frozen(c)) {
-				rcu_read_unlock();
-				return -EBUSY;
-			}
-		}
-		rcu_read_unlock();
-	}
+	/* Need to take rcu_read_lock even around the call on the leader. */
+	ret = cgroup_can_attach_per_thread(new_cgroup, task,
+					   freezer_can_attach_cb, threadgroup,
+					   true);
+	if (ret < 0)
+		return ret;
 
 	return 0;
 }
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 4349935..8fbe1e3 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1375,11 +1375,15 @@ static int fmeter_getrate(struct fmeter *fmp)
 /* Protected by cgroup_lock */
 static cpumask_var_t cpus_attach;
 
+static int cpuset_can_attach_cb(struct cgroup *cgrp, struct task_struct *task)
+{
+	return security_task_setscheduler(task);
+}
+
 /* Called by cgroups to determine if a cpuset is usable; cgroup_mutex held */
 static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
 			     struct task_struct *tsk, bool threadgroup)
 {
-	int ret;
 	struct cpuset *cs = cgroup_cs(cont);
 
 	if (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))
@@ -1396,23 +1400,8 @@ static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
 	if (tsk->flags & PF_THREAD_BOUND)
 		return -EINVAL;
 
-	ret = security_task_setscheduler(tsk);
-	if (ret)
-		return ret;
-	if (threadgroup) {
-		struct task_struct *c;
-
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
-			ret = security_task_setscheduler(c);
-			if (ret) {
-				rcu_read_unlock();
-				return ret;
-			}
-		}
-		rcu_read_unlock();
-	}
-	return 0;
+	return cgroup_can_attach_per_thread(cont, tsk, cpuset_can_attach_cb,
+					    threadgroup, false);
 }
 
 static void cpuset_attach_task(struct task_struct *tsk, nodemask_t *to,
@@ -1455,11 +1444,12 @@ static void cpuset_attach(struct cgroup_subsys *ss, struct cgroup *cont,
 	cpuset_attach_task(tsk, to, cs);
 	if (threadgroup) {
 		struct task_struct *c;
-		rcu_read_lock();
+
+		/* tasklist_lock will be held. */
+		BUG_ON(!thread_group_leader(tsk));
 		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
 			cpuset_attach_task(c, to, cs);
 		}
-		rcu_read_unlock();
 	}
 
 	/* change mm; only needs to be done once even if threadgroup */
diff --git a/kernel/ns_cgroup.c b/kernel/ns_cgroup.c
index 2c98ad9..66ba860 100644
--- a/kernel/ns_cgroup.c
+++ b/kernel/ns_cgroup.c
@@ -42,6 +42,13 @@ int ns_cgroup_clone(struct task_struct *task, struct pid *pid)
  *       (hence either you are in the same cgroup as task, or in an
  *        ancestor cgroup thereof)
  */
+static int ns_can_attach_cb(struct cgroup *new_cgroup, struct task_struct *task)
+{
+	if (!cgroup_is_descendant(new_cgroup, task))
+		return -EPERM;
+	return 0;
+}
+
 static int ns_can_attach(struct cgroup_subsys *ss, struct cgroup *new_cgroup,
 			 struct task_struct *task, bool threadgroup)
 {
@@ -53,22 +60,8 @@ static int ns_can_attach(struct cgroup_subsys *ss, struct cgroup *new_cgroup,
 			return -EPERM;
 	}
 
-	if (!cgroup_is_descendant(new_cgroup, task))
-		return -EPERM;
-
-	if (threadgroup) {
-		struct task_struct *c;
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
-			if (!cgroup_is_descendant(new_cgroup, c)) {
-				rcu_read_unlock();
-				return -EPERM;
-			}
-		}
-		rcu_read_unlock();
-	}
-
-	return 0;
+	return cgroup_can_attach_per_thread(new_cgroup, task, ns_can_attach_cb,
+					    threadgroup, false);
 }
 
 /*
diff --git a/kernel/sched.c b/kernel/sched.c
index 218ef20..8e89bf9 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -8659,22 +8659,9 @@ static int
 cpu_cgroup_can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 		      struct task_struct *tsk, bool threadgroup)
 {
-	int retval = cpu_cgroup_can_attach_task(cgrp, tsk);
-	if (retval)
-		return retval;
-	if (threadgroup) {
-		struct task_struct *c;
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
-			retval = cpu_cgroup_can_attach_task(cgrp, c);
-			if (retval) {
-				rcu_read_unlock();
-				return retval;
-			}
-		}
-		rcu_read_unlock();
-	}
-	return 0;
+	return cgroup_can_attach_per_thread(cgrp, tsk,
+					    cpu_cgroup_can_attach_task,
+					    threadgroup, false);
 }
 
 static void
@@ -8685,11 +8672,12 @@ cpu_cgroup_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 	sched_move_task(tsk);
 	if (threadgroup) {
 		struct task_struct *c;
-		rcu_read_lock();
+
+		/* tasklist_lock will be held. */
+		BUG_ON(!thread_group_leader(tsk));
 		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
 			sched_move_task(c);
 		}
-		rcu_read_unlock();
 	}
 }
 

^ permalink raw reply related	[flat|nested] 185+ messages in thread

* [PATCH v6 3/3] cgroups: make procs file writable
  2010-12-24  8:22         ` Ben Blum
@ 2010-12-24  8:24             ` Ben Blum
  -1 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2010-12-24  8:24 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	menage-hpIqsD4AKlfQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

[-- Attachment #1: cgroup-procs-writable.patch --]
[-- Type: text/plain, Size: 17605 bytes --]

Makes procs file writable to move all threads by tgid at once

From: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>

This patch adds functionality that enables users to move all threads in a
threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs'
file. This current implementation makes use of a per-threadgroup rwsem that's
taken for reading in the fork() path to prevent newly forking threads within
the threadgroup from "escaping" while the move is in progress.

Signed-off-by: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
---
 Documentation/cgroups/cgroups.txt |   13 +
 kernel/cgroup.c                   |  424 +++++++++++++++++++++++++++++++++----
 2 files changed, 387 insertions(+), 50 deletions(-)

diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index 190018b..07674e5 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -236,7 +236,8 @@ containing the following files describing that cgroup:
  - cgroup.procs: list of tgids in the cgroup.  This list is not
    guaranteed to be sorted or free of duplicate tgids, and userspace
    should sort/uniquify the list if this property is required.
-   This is a read-only file, for now.
+   Writing a thread group id into this file moves all threads in that
+   group into this cgroup.
  - notify_on_release flag: run the release agent on exit?
  - release_agent: the path to use for release notifications (this file
    exists in the top cgroup only)
@@ -426,6 +427,12 @@ You can attach the current shell task by echoing 0:
 
 # echo 0 > tasks
 
+You can use the cgroup.procs file instead of the tasks file to move all
+threads in a threadgroup at once. Echoing the pid of any task in a
+threadgroup to cgroup.procs causes all tasks in that threadgroup to be
+be attached to the cgroup. Writing 0 to cgroup.procs moves all tasks
+in the writing task's threadgroup.
+
 2.3 Mounting hierarchies by name
 --------------------------------
 
@@ -574,7 +581,9 @@ called on a fork. If this method returns 0 (success) then this should
 remain valid while the caller holds cgroup_mutex and it is ensured that either
 attach() or cancel_attach() will be called in future. If threadgroup is
 true, then a successful result indicates that all threads in the given
-thread's threadgroup can be moved together.
+thread's threadgroup can be moved together. If the subsystem wants to
+iterate over task->thread_group, it must take rcu_read_lock then check
+if thread_group_leader(task), returning -EAGAIN if that fails.
 
 void cancel_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 	       struct task_struct *task, bool threadgroup)
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index f86dd9c..74be02c 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1771,6 +1771,76 @@ int cgroup_can_attach_per_thread(struct cgroup *cgrp, struct task_struct *task,
 }
 EXPORT_SYMBOL_GPL(cgroup_can_attach_per_thread);
 
+/*
+ * cgroup_task_migrate - move a task from one cgroup to another.
+ *
+ * 'guarantee' is set if the caller promises that a new css_set for the task
+ * will already exit. If not set, this function might sleep, and can fail with
+ * -ENOMEM. Otherwise, it can only fail with -ESRCH.
+ */
+static int cgroup_task_migrate(struct cgroup *cgrp, struct cgroup *oldcgrp,
+			       struct task_struct *tsk, bool guarantee)
+{
+	struct css_set *oldcg;
+	struct css_set *newcg;
+
+	/*
+	 * get old css_set. we need to take task_lock and refcount it, because
+	 * an exiting task can change its css_set to init_css_set and drop its
+	 * old one without taking cgroup_mutex.
+	 */
+	task_lock(tsk);
+	oldcg = tsk->cgroups;
+	get_css_set(oldcg);
+	task_unlock(tsk);
+
+	/* locate or allocate a new css_set for this task. */
+	if (guarantee) {
+		/* we know the css_set we want already exists. */
+		struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
+		read_lock(&css_set_lock);
+		newcg = find_existing_css_set(oldcg, cgrp, template);
+		BUG_ON(!newcg);
+		get_css_set(newcg);
+		read_unlock(&css_set_lock);
+	} else {
+		might_sleep();
+		/* find_css_set will give us newcg already referenced. */
+		newcg = find_css_set(oldcg, cgrp);
+		if (!newcg) {
+			put_css_set(oldcg);
+			return -ENOMEM;
+		}
+	}
+	put_css_set(oldcg);
+
+	/* if PF_EXITING is set, the tsk->cgroups pointer is no longer safe. */
+	task_lock(tsk);
+	if (tsk->flags & PF_EXITING) {
+		task_unlock(tsk);
+		put_css_set(newcg);
+		return -ESRCH;
+	}
+	rcu_assign_pointer(tsk->cgroups, newcg);
+	task_unlock(tsk);
+
+	/* Update the css_set linked lists if we're using them */
+	write_lock(&css_set_lock);
+	if (!list_empty(&tsk->cg_list))
+		list_move(&tsk->cg_list, &newcg->tasks);
+	write_unlock(&css_set_lock);
+
+	/*
+	 * We just gained a reference on oldcg by taking it from the task. As
+	 * trading it for newcg is protected by cgroup_mutex, we're safe to drop
+	 * it here; it will be freed under RCU.
+	 */
+	put_css_set(oldcg);
+
+	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
+	return 0;
+}
+
 /**
  * cgroup_attach_task - attach task 'tsk' to cgroup 'cgrp'
  * @cgrp: the cgroup the task is attaching to
@@ -1781,11 +1851,9 @@ EXPORT_SYMBOL_GPL(cgroup_can_attach_per_thread);
  */
 int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
-	int retval = 0;
+	int retval;
 	struct cgroup_subsys *ss, *failed_ss = NULL;
 	struct cgroup *oldcgrp;
-	struct css_set *cg;
-	struct css_set *newcg;
 	struct cgroupfs_root *root = cgrp->root;
 
 	/* Nothing to do if the task is already in that cgroup */
@@ -1809,46 +1877,16 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 		}
 	}
 
-	task_lock(tsk);
-	cg = tsk->cgroups;
-	get_css_set(cg);
-	task_unlock(tsk);
-	/*
-	 * Locate or allocate a new css_set for this task,
-	 * based on its final set of cgroups
-	 */
-	newcg = find_css_set(cg, cgrp);
-	put_css_set(cg);
-	if (!newcg) {
-		retval = -ENOMEM;
-		goto out;
-	}
-
-	task_lock(tsk);
-	if (tsk->flags & PF_EXITING) {
-		task_unlock(tsk);
-		put_css_set(newcg);
-		retval = -ESRCH;
+	retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, false);
+	if (retval)
 		goto out;
-	}
-	rcu_assign_pointer(tsk->cgroups, newcg);
-	task_unlock(tsk);
-
-	/* Update the css_set linked lists if we're using them */
-	write_lock(&css_set_lock);
-	if (!list_empty(&tsk->cg_list)) {
-		list_del(&tsk->cg_list);
-		list_add(&tsk->cg_list, &newcg->tasks);
-	}
-	write_unlock(&css_set_lock);
 
 	for_each_subsys(root, ss) {
 		if (ss->attach)
 			ss->attach(ss, cgrp, oldcgrp, tsk, false);
 	}
-	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
+
 	synchronize_rcu();
-	put_css_set(cg);
 
 	/*
 	 * wake up rmdir() waiter. the rmdir should fail since the cgroup
@@ -1898,49 +1936,339 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
 EXPORT_SYMBOL_GPL(cgroup_attach_task_all);
 
 /*
- * Attach task with pid 'pid' to cgroup 'cgrp'. Call with cgroup_mutex
- * held. May take task_lock of task
+ * cgroup_attach_proc works in two stages, the first of which prefetches all
+ * new css_sets needed (to make sure we have enough memory before committing
+ * to the move) and stores them in a list of entries of the following type.
+ * TODO: possible optimization: use css_set->rcu_head for chaining instead
+ */
+struct cg_list_entry {
+	struct css_set *cg;
+	struct list_head links;
+};
+
+static bool css_set_check_fetched(struct cgroup *cgrp,
+				  struct task_struct *tsk, struct css_set *cg,
+				  struct list_head *newcg_list)
+{
+	struct css_set *newcg;
+	struct cg_list_entry *cg_entry;
+	struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
+
+	read_lock(&css_set_lock);
+	newcg = find_existing_css_set(cg, cgrp, template);
+	if (newcg)
+		get_css_set(newcg);
+	read_unlock(&css_set_lock);
+
+	/* doesn't exist at all? */
+	if (!newcg)
+		return false;
+	/* see if it's already in the list */
+	list_for_each_entry(cg_entry, newcg_list, links) {
+		if (cg_entry->cg == newcg) {
+			put_css_set(newcg);
+			return true;
+		}
+	}
+
+	/* not found */
+	put_css_set(newcg);
+	return false;
+}
+
+/*
+ * Find the new css_set and store it in the list in preparation for moving the
+ * given task to the given cgroup. Returns 0 or -ENOMEM.
  */
-static int attach_task_by_pid(struct cgroup *cgrp, u64 pid)
+static int css_set_prefetch(struct cgroup *cgrp, struct css_set *cg,
+			    struct list_head *newcg_list)
+{
+	struct css_set *newcg;
+	struct cg_list_entry *cg_entry;
+
+	/* ensure a new css_set will exist for this thread */
+	newcg = find_css_set(cg, cgrp);
+	if (!newcg)
+		return -ENOMEM;
+	/* add it to the list */
+	cg_entry = kmalloc(sizeof(struct cg_list_entry), GFP_KERNEL);
+	if (!cg_entry) {
+		put_css_set(newcg);
+		return -ENOMEM;
+	}
+	cg_entry->cg = newcg;
+	list_add(&cg_entry->links, newcg_list);
+	return 0;
+}
+
+/**
+ * cgroup_attach_proc - attach all threads in a threadgroup to a cgroup
+ * @cgrp: the cgroup to attach to
+ * @leader: the threadgroup leader task_struct of the group to be attached
+ *
+ * Call holding cgroup_mutex. Will take task_lock of each thread in leader's
+ * threadgroup individually in turn.
+ */
+int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
+{
+	int retval;
+	struct cgroup_subsys *ss, *failed_ss = NULL;
+	struct cgroup *oldcgrp;
+	struct css_set *oldcg;
+	struct cgroupfs_root *root = cgrp->root;
+	/* threadgroup list cursor */
+	struct task_struct *tsk;
+	/*
+	 * we need to make sure we have css_sets for all the tasks we're
+	 * going to move -before- we actually start moving them, so that in
+	 * case we get an ENOMEM we can bail out before making any changes.
+	 */
+	struct list_head newcg_list;
+	struct cg_list_entry *cg_entry, *temp_nobe;
+
+	/* check that we can legitimately attach to the cgroup. */
+	for_each_subsys(root, ss) {
+		if (ss->can_attach) {
+			retval = ss->can_attach(ss, cgrp, leader, true);
+			if (retval) {
+				failed_ss = ss;
+				goto out;
+			}
+		}
+	}
+
+	/*
+	 * step 1: make sure css_sets exist for all threads to be migrated.
+	 * we use find_css_set, which allocates a new one if necessary.
+	 */
+	INIT_LIST_HEAD(&newcg_list);
+	oldcgrp = task_cgroup_from_root(leader, root);
+	if (cgrp != oldcgrp) {
+		/* get old css_set */
+		task_lock(leader);
+		if (leader->flags & PF_EXITING) {
+			task_unlock(leader);
+			goto prefetch_loop;
+		}
+		oldcg = leader->cgroups;
+		get_css_set(oldcg);
+		task_unlock(leader);
+		/* acquire new one */
+		retval = css_set_prefetch(cgrp, oldcg, &newcg_list);
+		put_css_set(oldcg);
+		if (retval)
+			goto list_teardown;
+	}
+prefetch_loop:
+	rcu_read_lock();
+	/* sanity check - if we raced with de_thread, we must abort */
+	if (!thread_group_leader(leader)) {
+		retval = -EAGAIN;
+		goto list_teardown;
+	}
+	/*
+	 * if we need to fetch a new css_set for this task, we must exit the
+	 * rcu_read section because allocating it can sleep. afterwards, we'll
+	 * need to restart iteration on the threadgroup list - the whole thing
+	 * will be O(nm) in the number of threads and css_sets; as the typical
+	 * case has only one css_set for all of them, usually O(n). which ones
+	 * we need allocated won't change as long as we hold cgroup_mutex.
+	 */
+	list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
+		/* nothing to do if this task is already in the cgroup */
+		oldcgrp = task_cgroup_from_root(tsk, root);
+		if (cgrp == oldcgrp)
+			continue;
+		/* get old css_set pointer */
+		task_lock(tsk);
+		if (tsk->flags & PF_EXITING) {
+			/* ignore this task if it's going away */
+			task_unlock(tsk);
+			continue;
+		}
+		oldcg = tsk->cgroups;
+		get_css_set(oldcg);
+		task_unlock(tsk);
+		/* see if the new one for us is already in the list? */
+		if (css_set_check_fetched(cgrp, tsk, oldcg, &newcg_list)) {
+			/* was already there, nothing to do. */
+			put_css_set(oldcg);
+		} else {
+			/* we don't already have it. get new one. */
+			rcu_read_unlock();
+			retval = css_set_prefetch(cgrp, oldcg, &newcg_list);
+			put_css_set(oldcg);
+			if (retval)
+				goto list_teardown;
+			/* begin iteration again. */
+			goto prefetch_loop;
+		}
+	}
+	rcu_read_unlock();
+
+	/*
+	 * step 2: now that we're guaranteed success wrt the css_sets, proceed
+	 * to move all tasks to the new cgroup. we need to lock against possible
+	 * races with fork(). note: we can safely take the threadgroup_fork_lock
+	 * of leader since attach_task_by_pid took a reference.
+	 * threadgroup_fork_lock must be taken outside of tasklist_lock to match
+	 * the order in the fork path.
+	 */
+	threadgroup_fork_write_lock(leader);
+	read_lock(&tasklist_lock);
+	/* sanity check - if we raced with de_thread, we must abort */
+	if (!thread_group_leader(leader)) {
+		retval = -EAGAIN;
+		read_unlock(&tasklist_lock);
+		threadgroup_fork_write_unlock(leader);
+		goto list_teardown;
+	}
+	/*
+	 * No failure cases left, so this is the commit point.
+	 *
+	 * If the leader is already there, skip moving him. Note: even if the
+	 * leader is PF_EXITING, we still move all other threads; if everybody
+	 * is PF_EXITING, we end up doing nothing, which is ok.
+	 */
+	oldcgrp = task_cgroup_from_root(leader, root);
+	if (cgrp != oldcgrp) {
+		retval = cgroup_task_migrate(cgrp, oldcgrp, leader, true);
+		BUG_ON(retval != 0 && retval != -ESRCH);
+	}
+	/* Now iterate over each thread in the group. */
+	list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
+		/* leave current thread as it is if it's already there */
+		oldcgrp = task_cgroup_from_root(tsk, root);
+		if (cgrp == oldcgrp)
+			continue;
+		/* we don't care whether these threads are exiting */
+		retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, true);
+		BUG_ON(retval != 0 && retval != -ESRCH);
+	}
+
+	/*
+	 * step 3: attach whole threadgroup to each subsystem
+	 * TODO: if ever a subsystem needs to know the oldcgrp for each task
+	 * being moved, this call will need to be reworked to communicate that.
+	 */
+	for_each_subsys(root, ss) {
+		if (ss->attach)
+			ss->attach(ss, cgrp, oldcgrp, leader, true);
+	}
+	/* holding these until here keeps us safe from exec() and fork(). */
+	read_unlock(&tasklist_lock);
+	threadgroup_fork_write_unlock(leader);
+
+	/*
+	 * step 4: success! and cleanup
+	 */
+	synchronize_rcu();
+	cgroup_wakeup_rmdir_waiter(cgrp);
+	retval = 0;
+list_teardown:
+	/* clean up the list of prefetched css_sets. */
+	list_for_each_entry_safe(cg_entry, temp_nobe, &newcg_list, links) {
+		list_del(&cg_entry->links);
+		put_css_set(cg_entry->cg);
+		kfree(cg_entry);
+	}
+out:
+	if (retval) {
+		/* same deal as in cgroup_attach_task, with threadgroup=true */
+		for_each_subsys(root, ss) {
+			if (ss == failed_ss)
+				break;
+			if (ss->cancel_attach)
+				ss->cancel_attach(ss, cgrp, leader, true);
+		}
+	}
+	return retval;
+}
+
+/*
+ * Find the task_struct of the task to attach by vpid and pass it along to the
+ * function to attach either it or all tasks in its threadgroup. Will take
+ * cgroup_mutex; may take task_lock of task.
+ */
+static int attach_task_by_pid(struct cgroup *cgrp, u64 pid, bool threadgroup)
 {
 	struct task_struct *tsk;
 	const struct cred *cred = current_cred(), *tcred;
 	int ret;
 
+	if (!cgroup_lock_live_group(cgrp))
+		return -ENODEV;
+
 	if (pid) {
 		rcu_read_lock();
 		tsk = find_task_by_vpid(pid);
-		if (!tsk || tsk->flags & PF_EXITING) {
+		if (!tsk) {
+			rcu_read_unlock();
+			cgroup_unlock();
+			return -ESRCH;
+		}
+		if (threadgroup) {
+			/*
+			 * it is safe to find group_leader because tsk was found
+			 * in the tid map, meaning it can't have been unhashed
+			 * by someone in de_thread changing the leadership.
+			 */
+			tsk = tsk->group_leader;
+			BUG_ON(!thread_group_leader(tsk));
+		} else if (tsk->flags & PF_EXITING) {
+			/* optimization for the single-task-only case */
 			rcu_read_unlock();
+			cgroup_unlock();
 			return -ESRCH;
 		}
 
+		/*
+		 * even if we're attaching all tasks in the thread group, we
+		 * only need to check permissions on one of them.
+		 */
 		tcred = __task_cred(tsk);
 		if (cred->euid &&
 		    cred->euid != tcred->uid &&
 		    cred->euid != tcred->suid) {
 			rcu_read_unlock();
+			cgroup_unlock();
 			return -EACCES;
 		}
 		get_task_struct(tsk);
 		rcu_read_unlock();
 	} else {
-		tsk = current;
+		if (threadgroup)
+			tsk = current->group_leader;
+		else
+			tsk = current;
 		get_task_struct(tsk);
 	}
 
-	ret = cgroup_attach_task(cgrp, tsk);
+	if (threadgroup)
+		ret = cgroup_attach_proc(cgrp, tsk);
+	else
+		ret = cgroup_attach_task(cgrp, tsk);
 	put_task_struct(tsk);
+	cgroup_unlock();
 	return ret;
 }
 
 static int cgroup_tasks_write(struct cgroup *cgrp, struct cftype *cft, u64 pid)
 {
+	return attach_task_by_pid(cgrp, pid, false);
+}
+
+static int cgroup_procs_write(struct cgroup *cgrp, struct cftype *cft, u64 tgid)
+{
 	int ret;
-	if (!cgroup_lock_live_group(cgrp))
-		return -ENODEV;
-	ret = attach_task_by_pid(cgrp, pid);
-	cgroup_unlock();
+	do {
+		/*
+		 * attach_proc fails with -EAGAIN if threadgroup leadership
+		 * changes in the middle of the operation, in which case we need
+		 * to find the task_struct for the new leader and start over.
+		 */
+		ret = attach_task_by_pid(cgrp, tgid, true);
+	} while (ret == -EAGAIN);
 	return ret;
 }
 
@@ -3294,9 +3622,9 @@ static struct cftype files[] = {
 	{
 		.name = CGROUP_FILE_GENERIC_PREFIX "procs",
 		.open = cgroup_procs_open,
-		/* .write_u64 = cgroup_procs_write, TODO */
+		.write_u64 = cgroup_procs_write,
 		.release = cgroup_pidlist_release,
-		.mode = S_IRUGO,
+		.mode = S_IRUGO | S_IWUSR,
 	},
 	{
 		.name = "notify_on_release",

^ permalink raw reply related	[flat|nested] 185+ messages in thread

* [PATCH v6 3/3] cgroups: make procs file writable
@ 2010-12-24  8:24             ` Ben Blum
  0 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2010-12-24  8:24 UTC (permalink / raw)
  To: Ben Blum
  Cc: linux-kernel, containers, akpm, ebiederm, lizf, matthltc, menage, oleg

[-- Attachment #1: cgroup-procs-writable.patch --]
[-- Type: text/plain, Size: 17555 bytes --]

Makes procs file writable to move all threads by tgid at once

From: Ben Blum <bblum@andrew.cmu.edu>

This patch adds functionality that enables users to move all threads in a
threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs'
file. This current implementation makes use of a per-threadgroup rwsem that's
taken for reading in the fork() path to prevent newly forking threads within
the threadgroup from "escaping" while the move is in progress.

Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
---
 Documentation/cgroups/cgroups.txt |   13 +
 kernel/cgroup.c                   |  424 +++++++++++++++++++++++++++++++++----
 2 files changed, 387 insertions(+), 50 deletions(-)

diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index 190018b..07674e5 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -236,7 +236,8 @@ containing the following files describing that cgroup:
  - cgroup.procs: list of tgids in the cgroup.  This list is not
    guaranteed to be sorted or free of duplicate tgids, and userspace
    should sort/uniquify the list if this property is required.
-   This is a read-only file, for now.
+   Writing a thread group id into this file moves all threads in that
+   group into this cgroup.
  - notify_on_release flag: run the release agent on exit?
  - release_agent: the path to use for release notifications (this file
    exists in the top cgroup only)
@@ -426,6 +427,12 @@ You can attach the current shell task by echoing 0:
 
 # echo 0 > tasks
 
+You can use the cgroup.procs file instead of the tasks file to move all
+threads in a threadgroup at once. Echoing the pid of any task in a
+threadgroup to cgroup.procs causes all tasks in that threadgroup to be
+be attached to the cgroup. Writing 0 to cgroup.procs moves all tasks
+in the writing task's threadgroup.
+
 2.3 Mounting hierarchies by name
 --------------------------------
 
@@ -574,7 +581,9 @@ called on a fork. If this method returns 0 (success) then this should
 remain valid while the caller holds cgroup_mutex and it is ensured that either
 attach() or cancel_attach() will be called in future. If threadgroup is
 true, then a successful result indicates that all threads in the given
-thread's threadgroup can be moved together.
+thread's threadgroup can be moved together. If the subsystem wants to
+iterate over task->thread_group, it must take rcu_read_lock then check
+if thread_group_leader(task), returning -EAGAIN if that fails.
 
 void cancel_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 	       struct task_struct *task, bool threadgroup)
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index f86dd9c..74be02c 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1771,6 +1771,76 @@ int cgroup_can_attach_per_thread(struct cgroup *cgrp, struct task_struct *task,
 }
 EXPORT_SYMBOL_GPL(cgroup_can_attach_per_thread);
 
+/*
+ * cgroup_task_migrate - move a task from one cgroup to another.
+ *
+ * 'guarantee' is set if the caller promises that a new css_set for the task
+ * will already exit. If not set, this function might sleep, and can fail with
+ * -ENOMEM. Otherwise, it can only fail with -ESRCH.
+ */
+static int cgroup_task_migrate(struct cgroup *cgrp, struct cgroup *oldcgrp,
+			       struct task_struct *tsk, bool guarantee)
+{
+	struct css_set *oldcg;
+	struct css_set *newcg;
+
+	/*
+	 * get old css_set. we need to take task_lock and refcount it, because
+	 * an exiting task can change its css_set to init_css_set and drop its
+	 * old one without taking cgroup_mutex.
+	 */
+	task_lock(tsk);
+	oldcg = tsk->cgroups;
+	get_css_set(oldcg);
+	task_unlock(tsk);
+
+	/* locate or allocate a new css_set for this task. */
+	if (guarantee) {
+		/* we know the css_set we want already exists. */
+		struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
+		read_lock(&css_set_lock);
+		newcg = find_existing_css_set(oldcg, cgrp, template);
+		BUG_ON(!newcg);
+		get_css_set(newcg);
+		read_unlock(&css_set_lock);
+	} else {
+		might_sleep();
+		/* find_css_set will give us newcg already referenced. */
+		newcg = find_css_set(oldcg, cgrp);
+		if (!newcg) {
+			put_css_set(oldcg);
+			return -ENOMEM;
+		}
+	}
+	put_css_set(oldcg);
+
+	/* if PF_EXITING is set, the tsk->cgroups pointer is no longer safe. */
+	task_lock(tsk);
+	if (tsk->flags & PF_EXITING) {
+		task_unlock(tsk);
+		put_css_set(newcg);
+		return -ESRCH;
+	}
+	rcu_assign_pointer(tsk->cgroups, newcg);
+	task_unlock(tsk);
+
+	/* Update the css_set linked lists if we're using them */
+	write_lock(&css_set_lock);
+	if (!list_empty(&tsk->cg_list))
+		list_move(&tsk->cg_list, &newcg->tasks);
+	write_unlock(&css_set_lock);
+
+	/*
+	 * We just gained a reference on oldcg by taking it from the task. As
+	 * trading it for newcg is protected by cgroup_mutex, we're safe to drop
+	 * it here; it will be freed under RCU.
+	 */
+	put_css_set(oldcg);
+
+	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
+	return 0;
+}
+
 /**
  * cgroup_attach_task - attach task 'tsk' to cgroup 'cgrp'
  * @cgrp: the cgroup the task is attaching to
@@ -1781,11 +1851,9 @@ EXPORT_SYMBOL_GPL(cgroup_can_attach_per_thread);
  */
 int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
-	int retval = 0;
+	int retval;
 	struct cgroup_subsys *ss, *failed_ss = NULL;
 	struct cgroup *oldcgrp;
-	struct css_set *cg;
-	struct css_set *newcg;
 	struct cgroupfs_root *root = cgrp->root;
 
 	/* Nothing to do if the task is already in that cgroup */
@@ -1809,46 +1877,16 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 		}
 	}
 
-	task_lock(tsk);
-	cg = tsk->cgroups;
-	get_css_set(cg);
-	task_unlock(tsk);
-	/*
-	 * Locate or allocate a new css_set for this task,
-	 * based on its final set of cgroups
-	 */
-	newcg = find_css_set(cg, cgrp);
-	put_css_set(cg);
-	if (!newcg) {
-		retval = -ENOMEM;
-		goto out;
-	}
-
-	task_lock(tsk);
-	if (tsk->flags & PF_EXITING) {
-		task_unlock(tsk);
-		put_css_set(newcg);
-		retval = -ESRCH;
+	retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, false);
+	if (retval)
 		goto out;
-	}
-	rcu_assign_pointer(tsk->cgroups, newcg);
-	task_unlock(tsk);
-
-	/* Update the css_set linked lists if we're using them */
-	write_lock(&css_set_lock);
-	if (!list_empty(&tsk->cg_list)) {
-		list_del(&tsk->cg_list);
-		list_add(&tsk->cg_list, &newcg->tasks);
-	}
-	write_unlock(&css_set_lock);
 
 	for_each_subsys(root, ss) {
 		if (ss->attach)
 			ss->attach(ss, cgrp, oldcgrp, tsk, false);
 	}
-	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
+
 	synchronize_rcu();
-	put_css_set(cg);
 
 	/*
 	 * wake up rmdir() waiter. the rmdir should fail since the cgroup
@@ -1898,49 +1936,339 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
 EXPORT_SYMBOL_GPL(cgroup_attach_task_all);
 
 /*
- * Attach task with pid 'pid' to cgroup 'cgrp'. Call with cgroup_mutex
- * held. May take task_lock of task
+ * cgroup_attach_proc works in two stages, the first of which prefetches all
+ * new css_sets needed (to make sure we have enough memory before committing
+ * to the move) and stores them in a list of entries of the following type.
+ * TODO: possible optimization: use css_set->rcu_head for chaining instead
+ */
+struct cg_list_entry {
+	struct css_set *cg;
+	struct list_head links;
+};
+
+static bool css_set_check_fetched(struct cgroup *cgrp,
+				  struct task_struct *tsk, struct css_set *cg,
+				  struct list_head *newcg_list)
+{
+	struct css_set *newcg;
+	struct cg_list_entry *cg_entry;
+	struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
+
+	read_lock(&css_set_lock);
+	newcg = find_existing_css_set(cg, cgrp, template);
+	if (newcg)
+		get_css_set(newcg);
+	read_unlock(&css_set_lock);
+
+	/* doesn't exist at all? */
+	if (!newcg)
+		return false;
+	/* see if it's already in the list */
+	list_for_each_entry(cg_entry, newcg_list, links) {
+		if (cg_entry->cg == newcg) {
+			put_css_set(newcg);
+			return true;
+		}
+	}
+
+	/* not found */
+	put_css_set(newcg);
+	return false;
+}
+
+/*
+ * Find the new css_set and store it in the list in preparation for moving the
+ * given task to the given cgroup. Returns 0 or -ENOMEM.
  */
-static int attach_task_by_pid(struct cgroup *cgrp, u64 pid)
+static int css_set_prefetch(struct cgroup *cgrp, struct css_set *cg,
+			    struct list_head *newcg_list)
+{
+	struct css_set *newcg;
+	struct cg_list_entry *cg_entry;
+
+	/* ensure a new css_set will exist for this thread */
+	newcg = find_css_set(cg, cgrp);
+	if (!newcg)
+		return -ENOMEM;
+	/* add it to the list */
+	cg_entry = kmalloc(sizeof(struct cg_list_entry), GFP_KERNEL);
+	if (!cg_entry) {
+		put_css_set(newcg);
+		return -ENOMEM;
+	}
+	cg_entry->cg = newcg;
+	list_add(&cg_entry->links, newcg_list);
+	return 0;
+}
+
+/**
+ * cgroup_attach_proc - attach all threads in a threadgroup to a cgroup
+ * @cgrp: the cgroup to attach to
+ * @leader: the threadgroup leader task_struct of the group to be attached
+ *
+ * Call holding cgroup_mutex. Will take task_lock of each thread in leader's
+ * threadgroup individually in turn.
+ */
+int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
+{
+	int retval;
+	struct cgroup_subsys *ss, *failed_ss = NULL;
+	struct cgroup *oldcgrp;
+	struct css_set *oldcg;
+	struct cgroupfs_root *root = cgrp->root;
+	/* threadgroup list cursor */
+	struct task_struct *tsk;
+	/*
+	 * we need to make sure we have css_sets for all the tasks we're
+	 * going to move -before- we actually start moving them, so that in
+	 * case we get an ENOMEM we can bail out before making any changes.
+	 */
+	struct list_head newcg_list;
+	struct cg_list_entry *cg_entry, *temp_nobe;
+
+	/* check that we can legitimately attach to the cgroup. */
+	for_each_subsys(root, ss) {
+		if (ss->can_attach) {
+			retval = ss->can_attach(ss, cgrp, leader, true);
+			if (retval) {
+				failed_ss = ss;
+				goto out;
+			}
+		}
+	}
+
+	/*
+	 * step 1: make sure css_sets exist for all threads to be migrated.
+	 * we use find_css_set, which allocates a new one if necessary.
+	 */
+	INIT_LIST_HEAD(&newcg_list);
+	oldcgrp = task_cgroup_from_root(leader, root);
+	if (cgrp != oldcgrp) {
+		/* get old css_set */
+		task_lock(leader);
+		if (leader->flags & PF_EXITING) {
+			task_unlock(leader);
+			goto prefetch_loop;
+		}
+		oldcg = leader->cgroups;
+		get_css_set(oldcg);
+		task_unlock(leader);
+		/* acquire new one */
+		retval = css_set_prefetch(cgrp, oldcg, &newcg_list);
+		put_css_set(oldcg);
+		if (retval)
+			goto list_teardown;
+	}
+prefetch_loop:
+	rcu_read_lock();
+	/* sanity check - if we raced with de_thread, we must abort */
+	if (!thread_group_leader(leader)) {
+		retval = -EAGAIN;
+		goto list_teardown;
+	}
+	/*
+	 * if we need to fetch a new css_set for this task, we must exit the
+	 * rcu_read section because allocating it can sleep. afterwards, we'll
+	 * need to restart iteration on the threadgroup list - the whole thing
+	 * will be O(nm) in the number of threads and css_sets; as the typical
+	 * case has only one css_set for all of them, usually O(n). which ones
+	 * we need allocated won't change as long as we hold cgroup_mutex.
+	 */
+	list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
+		/* nothing to do if this task is already in the cgroup */
+		oldcgrp = task_cgroup_from_root(tsk, root);
+		if (cgrp == oldcgrp)
+			continue;
+		/* get old css_set pointer */
+		task_lock(tsk);
+		if (tsk->flags & PF_EXITING) {
+			/* ignore this task if it's going away */
+			task_unlock(tsk);
+			continue;
+		}
+		oldcg = tsk->cgroups;
+		get_css_set(oldcg);
+		task_unlock(tsk);
+		/* see if the new one for us is already in the list? */
+		if (css_set_check_fetched(cgrp, tsk, oldcg, &newcg_list)) {
+			/* was already there, nothing to do. */
+			put_css_set(oldcg);
+		} else {
+			/* we don't already have it. get new one. */
+			rcu_read_unlock();
+			retval = css_set_prefetch(cgrp, oldcg, &newcg_list);
+			put_css_set(oldcg);
+			if (retval)
+				goto list_teardown;
+			/* begin iteration again. */
+			goto prefetch_loop;
+		}
+	}
+	rcu_read_unlock();
+
+	/*
+	 * step 2: now that we're guaranteed success wrt the css_sets, proceed
+	 * to move all tasks to the new cgroup. we need to lock against possible
+	 * races with fork(). note: we can safely take the threadgroup_fork_lock
+	 * of leader since attach_task_by_pid took a reference.
+	 * threadgroup_fork_lock must be taken outside of tasklist_lock to match
+	 * the order in the fork path.
+	 */
+	threadgroup_fork_write_lock(leader);
+	read_lock(&tasklist_lock);
+	/* sanity check - if we raced with de_thread, we must abort */
+	if (!thread_group_leader(leader)) {
+		retval = -EAGAIN;
+		read_unlock(&tasklist_lock);
+		threadgroup_fork_write_unlock(leader);
+		goto list_teardown;
+	}
+	/*
+	 * No failure cases left, so this is the commit point.
+	 *
+	 * If the leader is already there, skip moving him. Note: even if the
+	 * leader is PF_EXITING, we still move all other threads; if everybody
+	 * is PF_EXITING, we end up doing nothing, which is ok.
+	 */
+	oldcgrp = task_cgroup_from_root(leader, root);
+	if (cgrp != oldcgrp) {
+		retval = cgroup_task_migrate(cgrp, oldcgrp, leader, true);
+		BUG_ON(retval != 0 && retval != -ESRCH);
+	}
+	/* Now iterate over each thread in the group. */
+	list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
+		/* leave current thread as it is if it's already there */
+		oldcgrp = task_cgroup_from_root(tsk, root);
+		if (cgrp == oldcgrp)
+			continue;
+		/* we don't care whether these threads are exiting */
+		retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, true);
+		BUG_ON(retval != 0 && retval != -ESRCH);
+	}
+
+	/*
+	 * step 3: attach whole threadgroup to each subsystem
+	 * TODO: if ever a subsystem needs to know the oldcgrp for each task
+	 * being moved, this call will need to be reworked to communicate that.
+	 */
+	for_each_subsys(root, ss) {
+		if (ss->attach)
+			ss->attach(ss, cgrp, oldcgrp, leader, true);
+	}
+	/* holding these until here keeps us safe from exec() and fork(). */
+	read_unlock(&tasklist_lock);
+	threadgroup_fork_write_unlock(leader);
+
+	/*
+	 * step 4: success! and cleanup
+	 */
+	synchronize_rcu();
+	cgroup_wakeup_rmdir_waiter(cgrp);
+	retval = 0;
+list_teardown:
+	/* clean up the list of prefetched css_sets. */
+	list_for_each_entry_safe(cg_entry, temp_nobe, &newcg_list, links) {
+		list_del(&cg_entry->links);
+		put_css_set(cg_entry->cg);
+		kfree(cg_entry);
+	}
+out:
+	if (retval) {
+		/* same deal as in cgroup_attach_task, with threadgroup=true */
+		for_each_subsys(root, ss) {
+			if (ss == failed_ss)
+				break;
+			if (ss->cancel_attach)
+				ss->cancel_attach(ss, cgrp, leader, true);
+		}
+	}
+	return retval;
+}
+
+/*
+ * Find the task_struct of the task to attach by vpid and pass it along to the
+ * function to attach either it or all tasks in its threadgroup. Will take
+ * cgroup_mutex; may take task_lock of task.
+ */
+static int attach_task_by_pid(struct cgroup *cgrp, u64 pid, bool threadgroup)
 {
 	struct task_struct *tsk;
 	const struct cred *cred = current_cred(), *tcred;
 	int ret;
 
+	if (!cgroup_lock_live_group(cgrp))
+		return -ENODEV;
+
 	if (pid) {
 		rcu_read_lock();
 		tsk = find_task_by_vpid(pid);
-		if (!tsk || tsk->flags & PF_EXITING) {
+		if (!tsk) {
+			rcu_read_unlock();
+			cgroup_unlock();
+			return -ESRCH;
+		}
+		if (threadgroup) {
+			/*
+			 * it is safe to find group_leader because tsk was found
+			 * in the tid map, meaning it can't have been unhashed
+			 * by someone in de_thread changing the leadership.
+			 */
+			tsk = tsk->group_leader;
+			BUG_ON(!thread_group_leader(tsk));
+		} else if (tsk->flags & PF_EXITING) {
+			/* optimization for the single-task-only case */
 			rcu_read_unlock();
+			cgroup_unlock();
 			return -ESRCH;
 		}
 
+		/*
+		 * even if we're attaching all tasks in the thread group, we
+		 * only need to check permissions on one of them.
+		 */
 		tcred = __task_cred(tsk);
 		if (cred->euid &&
 		    cred->euid != tcred->uid &&
 		    cred->euid != tcred->suid) {
 			rcu_read_unlock();
+			cgroup_unlock();
 			return -EACCES;
 		}
 		get_task_struct(tsk);
 		rcu_read_unlock();
 	} else {
-		tsk = current;
+		if (threadgroup)
+			tsk = current->group_leader;
+		else
+			tsk = current;
 		get_task_struct(tsk);
 	}
 
-	ret = cgroup_attach_task(cgrp, tsk);
+	if (threadgroup)
+		ret = cgroup_attach_proc(cgrp, tsk);
+	else
+		ret = cgroup_attach_task(cgrp, tsk);
 	put_task_struct(tsk);
+	cgroup_unlock();
 	return ret;
 }
 
 static int cgroup_tasks_write(struct cgroup *cgrp, struct cftype *cft, u64 pid)
 {
+	return attach_task_by_pid(cgrp, pid, false);
+}
+
+static int cgroup_procs_write(struct cgroup *cgrp, struct cftype *cft, u64 tgid)
+{
 	int ret;
-	if (!cgroup_lock_live_group(cgrp))
-		return -ENODEV;
-	ret = attach_task_by_pid(cgrp, pid);
-	cgroup_unlock();
+	do {
+		/*
+		 * attach_proc fails with -EAGAIN if threadgroup leadership
+		 * changes in the middle of the operation, in which case we need
+		 * to find the task_struct for the new leader and start over.
+		 */
+		ret = attach_task_by_pid(cgrp, tgid, true);
+	} while (ret == -EAGAIN);
 	return ret;
 }
 
@@ -3294,9 +3622,9 @@ static struct cftype files[] = {
 	{
 		.name = CGROUP_FILE_GENERIC_PREFIX "procs",
 		.open = cgroup_procs_open,
-		/* .write_u64 = cgroup_procs_write, TODO */
+		.write_u64 = cgroup_procs_write,
 		.release = cgroup_pidlist_release,
-		.mode = S_IRUGO,
+		.mode = S_IRUGO | S_IWUSR,
 	},
 	{
 		.name = "notify_on_release",

^ permalink raw reply related	[flat|nested] 185+ messages in thread

* Re: [PATCH v5 3/3] cgroups: make procs file writable
       [not found]                         ` <20101224033352.GA7804-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
@ 2010-12-24 10:49                           ` David Rientjes
       [not found]                             ` <alpine.DEB.2.00.1012240245040.775-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
  2010-12-25  2:55                           ` Ben Blum
  2010-12-25  4:24                           ` [PATCH v5 3/3] cgroups: make procs file writable Ben Blum
  2 siblings, 1 reply; 185+ messages in thread
From: David Rientjes @ 2010-12-24 10:49 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Oleg Nesterov, Miao Xie, Andrew Morton, Paul Menage,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w

On Thu, 23 Dec 2010, Ben Blum wrote:

> On Thu, Dec 16, 2010 at 12:26:03AM -0800, Andrew Morton wrote:
> > Patches have gone a bit stale, sorry.  Refactoring in
> > kernel/cgroup_freezer.c necessitates a refresh and retest please.
> 
> commit 53feb29767c29c877f9d47dcfe14211b5b0f7ebd changed a bunch of stuff
> in kernel/cpuset.c to allocate nodemasks with NODEMASK_ALLOC (which
> wraps kmalloc) instead of on the stack.
> 

It only wraps kmalloc for CONFIG_NODES_SHIFT > 8.

> 1. All these functions have 'void' return values, indicating that
>    calling them them must not fail. Sure there are bailout cases, but no
>    semblance of cross-function error propagation. Most importantly,
>    cpuset_attach is a subsystem callback, which MUST not fail given the
>    way it's used in cgroups, so relying on kmalloc is not safe.
> 

Yes, that's a valid concern that should be addressed.

> 2. I'm working on a patch series which needs to hold tasklist_lock
>    across ss->attach callbacks (in cpuset_attach's "if (threadgroup)"
>    case, this is how safe traversal of tsk->thread_group will be
>    ensured), and kmalloc isn't allowed while holding a spin-lock. 
> 

kmalloc() is allowed while holding a spinlock and NODEMASK_ALLOC() takes a 
gfp_flags argument for that reason.

> Why do we need heap-allocation here at all? In each case their scope is
> exactly the function's scope, and neither the commit nor the surrounding
> patch series give any explanation. I'd like to revert the patch if
> possible.
> 

Because some kernels, such as those with CONFIG_NODES_SHIFT > 8, cause 
stack overflows with the large nodemasks.

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v5 3/3] cgroups: make procs file writable
       [not found]                             ` <alpine.DEB.2.00.1012240245040.775-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
@ 2010-12-24 11:45                               ` Ben Blum
       [not found]                                 ` <20101224114500.GA18036-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
  0 siblings, 1 reply; 185+ messages in thread
From: Ben Blum @ 2010-12-24 11:45 UTC (permalink / raw)
  To: David Rientjes
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Oleg Nesterov, ebiederm-aS9lmoZGLiVWk0Htik3J/w, Miao Xie,
	Andrew Morton, Paul Menage

On Fri, Dec 24, 2010 at 02:49:43AM -0800, David Rientjes wrote:
> On Thu, 23 Dec 2010, Ben Blum wrote:
> 
> > On Thu, Dec 16, 2010 at 12:26:03AM -0800, Andrew Morton wrote:
> > > Patches have gone a bit stale, sorry.  Refactoring in
> > > kernel/cgroup_freezer.c necessitates a refresh and retest please.
> > 
> > commit 53feb29767c29c877f9d47dcfe14211b5b0f7ebd changed a bunch of stuff
> > in kernel/cpuset.c to allocate nodemasks with NODEMASK_ALLOC (which
> > wraps kmalloc) instead of on the stack.
> > 
> 
> It only wraps kmalloc for CONFIG_NODES_SHIFT > 8.
> 
> > 1. All these functions have 'void' return values, indicating that
> >    calling them them must not fail. Sure there are bailout cases, but no
> >    semblance of cross-function error propagation. Most importantly,
> >    cpuset_attach is a subsystem callback, which MUST not fail given the
> >    way it's used in cgroups, so relying on kmalloc is not safe.
> > 
> 
> Yes, that's a valid concern that should be addressed.

depending on the circumstances that would cause such a failure, and the
consequences of "ignoring" it, would doing a cop-out with WARN_ON() be
appropriate?

> > 2. I'm working on a patch series which needs to hold tasklist_lock
> >    across ss->attach callbacks (in cpuset_attach's "if (threadgroup)"
> >    case, this is how safe traversal of tsk->thread_group will be
> >    ensured), and kmalloc isn't allowed while holding a spin-lock. 
> > 
> 
> kmalloc() is allowed while holding a spinlock and NODEMASK_ALLOC() takes a 
> gfp_flags argument for that reason.

Ah, it's only with GFP_KERNEL and friends. So changing the uses in
cpuset_can_attach to GFP_ATOMIC would solve this concern, then?

Then there are no extra issues with my patchset, but note: the use cases
for calling ->attach are going to be hairy enough that it won't be
reasonable to require the caller to back-track if an allocation fails...

> > Why do we need heap-allocation here at all? In each case their scope is
> > exactly the function's scope, and neither the commit nor the surrounding
> > patch series give any explanation. I'd like to revert the patch if
> > possible.
> > 
> 
> Because some kernels, such as those with CONFIG_NODES_SHIFT > 8, cause 
> stack overflows with the large nodemasks.

Ah. I couldn't find any documentation for what the maximum number of
memory controllers could be. this makes sense.

Thanks!

-- Ben

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v5 3/3] cgroups: make procs file writable
       [not found]                                 ` <20101224114500.GA18036-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
@ 2010-12-24 11:53                                   ` Andrew Morton
       [not found]                                     ` <20101224035331.b907b410.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
  0 siblings, 1 reply; 185+ messages in thread
From: Andrew Morton @ 2010-12-24 11:53 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Oleg Nesterov, Miao Xie, David Rientjes, Paul Menage,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w

On Fri, 24 Dec 2010 06:45:00 -0500 Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:

> > kmalloc() is allowed while holding a spinlock and NODEMASK_ALLOC() takes a 
> > gfp_flags argument for that reason.
> 
> Ah, it's only with GFP_KERNEL and friends. So changing the uses in
> cpuset_can_attach to GFP_ATOMIC would solve this concern, then?

It would introduce an additional concern.  GFP_ATOMIC is bad, for a
number of reasons.  The main one of which is that it is vastly less
reliable than GFP_KERNEL.  And making the cpuset code less reliable
is a regression, no?

Please try to find a better solution.

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v5 3/3] cgroups: make procs file writable
       [not found]                                     ` <20101224035331.b907b410.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
@ 2010-12-24 12:08                                       ` Ben Blum
       [not found]                                         ` <20101224120853.GA18518-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
  0 siblings, 1 reply; 185+ messages in thread
From: Ben Blum @ 2010-12-24 12:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Oleg Nesterov, ebiederm-aS9lmoZGLiVWk0Htik3J/w, Miao Xie,
	David Rientjes, Paul Menage

On Fri, Dec 24, 2010 at 03:53:31AM -0800, Andrew Morton wrote:
> On Fri, 24 Dec 2010 06:45:00 -0500 Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:
> 
> > > kmalloc() is allowed while holding a spinlock and NODEMASK_ALLOC() takes a 
> > > gfp_flags argument for that reason.
> > 
> > Ah, it's only with GFP_KERNEL and friends. So changing the uses in
> > cpuset_can_attach to GFP_ATOMIC would solve this concern, then?
> 
> It would introduce an additional concern.  GFP_ATOMIC is bad, for a
> number of reasons.  The main one of which is that it is vastly less
> reliable than GFP_KERNEL.  And making the cpuset code less reliable
> is a regression, no?
> 
> Please try to find a better solution.

Good point. How about pre-allocating the nodemasks in cpuset_can_attach,
and having a cpuset_cancel_attach function which can free them up?

They could be stored in the struct cpuset (protected by cgroup_mutex)
after being pre-allocated - but also only if a heap-allocation was
required, so there might need to be an extra interface, like
"NODEMASK_PREALLOC" (a no-op if heap not required, otherwise allocates
and stores in the struct cpuset) and "NODEMASK_RETRIEVE"?

-- Ben

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v5 3/3] cgroups: make procs file writable
       [not found]                                         ` <20101224120853.GA18518-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
@ 2010-12-24 21:24                                           ` Ben Blum
       [not found]                                             ` <20101224212452.GA27275-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
  2010-12-24 21:32                                           ` David Rientjes
  1 sibling, 1 reply; 185+ messages in thread
From: Ben Blum @ 2010-12-24 21:24 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Oleg Nesterov, ebiederm-aS9lmoZGLiVWk0Htik3J/w, Miao Xie,
	David Rientjes, Andrew Morton, Paul Menage

On Fri, Dec 24, 2010 at 07:08:53AM -0500, Ben Blum wrote:
> On Fri, Dec 24, 2010 at 03:53:31AM -0800, Andrew Morton wrote:
> > On Fri, 24 Dec 2010 06:45:00 -0500 Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:
> > 
> > > > kmalloc() is allowed while holding a spinlock and NODEMASK_ALLOC() takes a 
> > > > gfp_flags argument for that reason.
> > > 
> > > Ah, it's only with GFP_KERNEL and friends. So changing the uses in
> > > cpuset_can_attach to GFP_ATOMIC would solve this concern, then?
> > 
> > It would introduce an additional concern.  GFP_ATOMIC is bad, for a
> > number of reasons.  The main one of which is that it is vastly less
> > reliable than GFP_KERNEL.  And making the cpuset code less reliable
> > is a regression, no?
> > 
> > Please try to find a better solution.
> 
> Good point. How about pre-allocating the nodemasks in cpuset_can_attach,
> and having a cpuset_cancel_attach function which can free them up?
> 
> They could be stored in the struct cpuset (protected by cgroup_mutex)
> after being pre-allocated - but also only if a heap-allocation was
> required, so there might need to be an extra interface, like
> "NODEMASK_PREALLOC" (a no-op if heap not required, otherwise allocates
> and stores in the struct cpuset) and "NODEMASK_RETRIEVE"?
> 
> -- Ben

Oh, also, most (not all) times that NODEMASK_ALLOC is used in cpusets,
cgroup_mutex is also held. So how about just using static storage for
them? (There could be a new macro "NODEMASK_ALLOC_STATIC", for use when
the caller can never race against itself.) As long as the call-graph
isn't recursive, there shouldn't be a problem...

-- Ben

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v5 3/3] cgroups: make procs file writable
       [not found]                                         ` <20101224120853.GA18518-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
  2010-12-24 21:24                                           ` Ben Blum
@ 2010-12-24 21:32                                           ` David Rientjes
  1 sibling, 0 replies; 185+ messages in thread
From: David Rientjes @ 2010-12-24 21:32 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Oleg Nesterov, Miao Xie, Andrew Morton, Paul Menage,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w

On Fri, 24 Dec 2010, Ben Blum wrote:

> Good point. How about pre-allocating the nodemasks in cpuset_can_attach,
> and having a cpuset_cancel_attach function which can free them up?
> 
> They could be stored in the struct cpuset (protected by cgroup_mutex)
> after being pre-allocated - but also only if a heap-allocation was
> required, so there might need to be an extra interface, like
> "NODEMASK_PREALLOC" (a no-op if heap not required, otherwise allocates
> and stores in the struct cpuset) and "NODEMASK_RETRIEVE"?
> 

I don't think it should be limited to only cpusets since what's being 
described is a characteristic of cgroups and others may need to allocate 
nodemasks in attach functions either presently or in the future as well.

If you're protecting the attach function with cgroup_mutex (or can protect 
it with a lock), then you should be able to statically allocate the 
nodemasks.  Such an implementation could complement NODEMASK_ALLOC() but 
be done statically regardless of CONFIG_NODES_SHIFT.

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v5 3/3] cgroups: make procs file writable
       [not found]                                             ` <20101224212452.GA27275-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
@ 2010-12-24 21:34                                               ` David Rientjes
       [not found]                                                 ` <alpine.DEB.2.00.1012241333010.13509-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
  0 siblings, 1 reply; 185+ messages in thread
From: David Rientjes @ 2010-12-24 21:34 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Oleg Nesterov, Miao Xie, Andrew Morton, Paul Menage,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w

On Fri, 24 Dec 2010, Ben Blum wrote:

> Oh, also, most (not all) times that NODEMASK_ALLOC is used in cpusets,
> cgroup_mutex is also held. So how about just using static storage for
> them? (There could be a new macro "NODEMASK_ALLOC_STATIC", for use when
> the caller can never race against itself.) As long as the call-graph
> isn't recursive, there shouldn't be a problem...
> 

Yes, that sounds good but I'd suggest using it only when dynamic 
allocation cannot be done with GFP_KERNEL to avoid the waste.

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v5 3/3] cgroups: make procs file writable
       [not found]                                                 ` <alpine.DEB.2.00.1012241333010.13509-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
@ 2010-12-24 23:09                                                   ` Ben Blum
       [not found]                                                     ` <20101224230901.GA30136-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
  0 siblings, 1 reply; 185+ messages in thread
From: Ben Blum @ 2010-12-24 23:09 UTC (permalink / raw)
  To: David Rientjes
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Oleg Nesterov, ebiederm-aS9lmoZGLiVWk0Htik3J/w, Miao Xie,
	Andrew Morton, Paul Menage

On Fri, Dec 24, 2010 at 01:34:06PM -0800, David Rientjes wrote:
> On Fri, 24 Dec 2010, Ben Blum wrote:
> 
> > Oh, also, most (not all) times that NODEMASK_ALLOC is used in cpusets,
> > cgroup_mutex is also held. So how about just using static storage for
> > them? (There could be a new macro "NODEMASK_ALLOC_STATIC", for use when
> > the caller can never race against itself.) As long as the call-graph
> > isn't recursive, there shouldn't be a problem...
> > 
> 
> Yes, that sounds good but I'd suggest using it only when dynamic 
> allocation cannot be done with GFP_KERNEL to avoid the waste.

I'll add a patch to my current series to do this. Should I leave alone
the other cases where an out-of-memory causes a silent failure?
(cpuset_change_nodemask, scan_for_empty_cpusets)

-- Ben

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v5 3/3] cgroups: make procs file writable
       [not found]                         ` <20101224033352.GA7804-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
  2010-12-24 10:49                           ` David Rientjes
@ 2010-12-25  2:55                           ` Ben Blum
       [not found]                             ` <20101225025508.GA649-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
  2010-12-25  4:24                           ` [PATCH v5 3/3] cgroups: make procs file writable Ben Blum
  2 siblings, 1 reply; 185+ messages in thread
From: Ben Blum @ 2010-12-25  2:55 UTC (permalink / raw)
  To: Ben Blum
  Cc: Daisuke Nishimura,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	oleg-H+wXaHxf7aLQT0dZR+AlfA, Miao Xie, David Rientjes,
	Andrew Morton, Paul Menage, ebiederm-aS9lmoZGLiVWk0Htik3J/w

On Thu, Dec 23, 2010 at 10:33:52PM -0500, Ben Blum wrote:
> On Thu, Dec 16, 2010 at 12:26:03AM -0800, Andrew Morton wrote:
> > Patches have gone a bit stale, sorry.  Refactoring in
> > kernel/cgroup_freezer.c necessitates a refresh and retest please.
> 
> commit 53feb29767c29c877f9d47dcfe14211b5b0f7ebd changed a bunch of stuff
> in kernel/cpuset.c to allocate nodemasks with NODEMASK_ALLOC (which
> wraps kmalloc) instead of on the stack.
> 
> 1. All these functions have 'void' return values, indicating that
>    calling them them must not fail. Sure there are bailout cases, but no
>    semblance of cross-function error propagation. Most importantly,
>    cpuset_attach is a subsystem callback, which MUST not fail given the
>    way it's used in cgroups, so relying on kmalloc is not safe.
> 
> 2. I'm working on a patch series which needs to hold tasklist_lock
>    across ss->attach callbacks (in cpuset_attach's "if (threadgroup)"
>    case, this is how safe traversal of tsk->thread_group will be
>    ensured), and kmalloc isn't allowed while holding a spin-lock. 
> 
> Why do we need heap-allocation here at all? In each case their scope is
> exactly the function's scope, and neither the commit nor the surrounding
> patch series give any explanation. I'd like to revert the patch if
> possible.
> 
> cc'ing Miao Xie (author) and David Rientjes (acker).
> 
> -- Ben

Well even with the proposed solution to this there is still another
problem that I see - that of mmap_sem. cpuset_attach() calls into
mpol_rebind_mm() and do_migrate_pages(), which take mm->mmap_sem for
writing and reading respectively. This is going to conflict with
tasklist_lock... but moreover, the memcontrol subsys also touches the
task's mm->mmap_sem, holding onto it between mem_cgroup_can_attach() and
mem_cgroup_move_task() - as of b1dd693e5b9348bd68a80e679e03cf9c0973b01b.

So we have (currently, even without my patches):

cgroup_attach_task
(1) cpuset_can_attach
(2) mem_cgroup_can_attach
     - down_read(&mm->mmap_sem);
(3) cpuset_attach
     - mpol_rebind_mm
        - down_write(&mm->mmap_sem);
        - up_write(&mm->mmap_sem);
     - cpuset_migrate_mm
        - do_migrate_pages
           - down_read(&mm->mmap_sem);
           - up_read(&mm->mmap_sem);
(4) mem_cgroup_move_task
     - mem_cgroup_clear_mc
        - up_read(...);

Is there some interdependency I'm missing here that guarantees recursive
locking/deadlock will be avoided? It all looks like typical-case code.

I think we should move taking the mmap_sem all the way up into
cgroup_attach_task and cgroup_attach_proc; it will be held for writing
the whole time. I don't quite understand the mempolicy stuff but maybe
there can be ways to use mpol_rebind_mm and do_migrate_pages when the
lock is already held.

Adding Daisuke Nishimura and Kamezawa Hiroyuki from the commit mentioned
above.

-- Ben

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v5 3/3] cgroups: make procs file writable
       [not found]                         ` <20101224033352.GA7804-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
  2010-12-24 10:49                           ` David Rientjes
  2010-12-25  2:55                           ` Ben Blum
@ 2010-12-25  4:24                           ` Ben Blum
  2 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2010-12-25  4:24 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	oleg-H+wXaHxf7aLQT0dZR+AlfA, Miao Xie, David Rientjes,
	Andrew Morton, Paul Menage, ebiederm-aS9lmoZGLiVWk0Htik3J/w

On Thu, Dec 23, 2010 at 10:33:52PM -0500, Ben Blum wrote:
> On Thu, Dec 16, 2010 at 12:26:03AM -0800, Andrew Morton wrote:
> > Patches have gone a bit stale, sorry.  Refactoring in
> > kernel/cgroup_freezer.c necessitates a refresh and retest please.
> 
> commit 53feb29767c29c877f9d47dcfe14211b5b0f7ebd changed a bunch of stuff
> in kernel/cpuset.c to allocate nodemasks with NODEMASK_ALLOC (which
> wraps kmalloc) instead of on the stack.
> 
> 1. All these functions have 'void' return values, indicating that
>    calling them them must not fail. Sure there are bailout cases, but no
>    semblance of cross-function error propagation. Most importantly,
>    cpuset_attach is a subsystem callback, which MUST not fail given the
>    way it's used in cgroups, so relying on kmalloc is not safe.
> 
> 2. I'm working on a patch series which needs to hold tasklist_lock
>    across ss->attach callbacks (in cpuset_attach's "if (threadgroup)"
>    case, this is how safe traversal of tsk->thread_group will be
>    ensured), and kmalloc isn't allowed while holding a spin-lock. 
> 
> Why do we need heap-allocation here at all? In each case their scope is
> exactly the function's scope, and neither the commit nor the surrounding
> patch series give any explanation. I'd like to revert the patch if
> possible.
> 
> cc'ing Miao Xie (author) and David Rientjes (acker).
> 
> -- Ben
> 

By the way, there is still another issue on top of this.

cpuset_attach
-> do_migrate_pages
-> migrate_prep
-> lru_add_drain_all
-> __alloc_percpu, which does GFP_KERNEL, which can sleep.

This will be no good if we call ss->attach with tasklist_lock held. A
friend points out that even without the sleeping/etc you still don't
want to be doing a memory migration while holding a spinlock.

I think the best solution for this would be to add another subsystem
callback, "attach_thread", which would do the cheap once-per-thread
operations and be called under tasklist_lock. It'd look like this:

read_lock(tasklist_lock)
for each thread (c) {
    cgroup_task_migrate(c)
    for each subsys (ss) {
        ss->attach_thread(c)
    }
}
read_unlock(tasklist_lock)
for each subsys (ss) {
    ss->attach(leader)
}

For cpuset, this will need to keep the nodemask data between the
attach_thread and attach functions, so the nodemasks will need to be
global instead of local.

-- Ben

P.S. when i'm iterating over tsk->thread_group while using tasklist_lock
to protect it instead of rcu_read_lock, should i use list_for_each_entry
or list_for_each_entry_rcu? it has been a while since i wrote that bit.

^ permalink raw reply	[flat|nested] 185+ messages in thread

* [PATCH v7 0/3] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs
       [not found]         ` <20101224082226.GA13872-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
                             ` (2 preceding siblings ...)
  2010-12-24  8:24             ` Ben Blum
@ 2010-12-26 12:09           ` Ben Blum
  3 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2010-12-26 12:09 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	Miao Xie, David Rientjes, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	menage-hpIqsD4AKlfQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

On Fri, Dec 24, 2010 at 03:22:26AM -0500, Ben Blum wrote:
> On Wed, Aug 11, 2010 at 01:46:04AM -0400, Ben Blum wrote:
> > On Fri, Jul 30, 2010 at 07:56:49PM -0400, Ben Blum wrote:
> > > This patch series is a revision of http://lkml.org/lkml/2010/6/25/11 .
> > > 
> > > This patch series implements a write function for the 'cgroup.procs'
> > > per-cgroup file, which enables atomic movement of multithreaded
> > > applications between cgroups. Writing the thread-ID of any thread in a
> > > threadgroup to a cgroup's procs file causes all threads in the group to
> > > be moved to that cgroup safely with respect to threads forking/exiting.
> > > (Possible usage scenario: If running a multithreaded build system that
> > > sucks up system resources, this lets you restrict it all at once into a
> > > new cgroup to keep it under control.)
> > > 
> > > Example: Suppose pid 31337 clones new threads 31338 and 31339.
> > > 
> > > # cat /dev/cgroup/tasks
> > > ...
> > > 31337
> > > 31338
> > > 31339
> > > # mkdir /dev/cgroup/foo
> > > # echo 31337 > /dev/cgroup/foo/cgroup.procs
> > > # cat /dev/cgroup/foo/tasks
> > > 31337
> > > 31338
> > > 31339
> > > 
> > > A new lock, called threadgroup_fork_lock and living in signal_struct, is
> > > introduced to ensure atomicity when moving threads between cgroups. It's
> > > taken for writing during the operation, and taking for reading in fork()
> > > around the calls to cgroup_fork() and cgroup_post_fork(). I put calls to
> > > down_read/up_read directly in copy_process(), since new inline functions
> > > seemed like overkill.
> > > 
> > > -- Ben
> > > 
> > > ---
> > >  Documentation/cgroups/cgroups.txt |   13 -
> > >  include/linux/init_task.h         |    9
> > >  include/linux/sched.h             |   10
> > >  kernel/cgroup.c                   |  426 +++++++++++++++++++++++++++++++++-----
> > >  kernel/cgroup_freezer.c           |    4
> > >  kernel/cpuset.c                   |    4
> > >  kernel/fork.c                     |   16 +
> > >  kernel/ns_cgroup.c                |    4
> > >  kernel/sched.c                    |    4
> > >  9 files changed, 440 insertions(+), 50 deletions(-)
> > 
> > Here's an updated patchset. I've added an extra patch to implement the 
> > callback scheme Paul suggested (note how there are twice as many deleted
> > lines of code as before :) ), and also moved the up_read/down_read calls
> > to static inline functions in sched.h near the other threadgroup-related
> > calls.
> 
> One more go at this. I've refreshed the patches for some conflicts in
> cgroup_freezer.c, by adding an extra argument to the per_thread() call,
> "need_rcu", which makes the function take rcu_read_lock even around the
> single-task case (like freezer now requires). So no semantics have been
> changed.
> 
> I also poked around at some attach() calls which also iterate over the
> threadgroup (blkiocg_attach, cpuset_attach, cpu_cgroup_attach). I was
> borderline about making another function, cgroup_attach_per_thread(),
> but decided against.
> 
> There is a big issue in cpuset_attach, as explained in this email:
> http://www.spinics.net/lists/linux-containers/msg22223.html
> but the actual code/diffs for this patchset are independent of that
> getting fixed, so I'm putting this up for consideration now.
> 
> -- Ben

Well this time everything here is actually safe and correct, as far as
my best efforts and keen eyes can tell. I dropped the per_thread call
from the last series in favour of revising the subsystem callback
interface. It now looks like this:

ss->can_attach()
 - Thread-independent, possibly expensive/sleeping.

ss->can_attach_task()
 - Called per-thread, run with rcu_read so must not sleep.

ss->pre_attach()
 - Thread independent, must be atomic, happens before attach_task.

ss->attach_task()
 - Called per-thread, run with tasklist_lock so must not sleep.

ss->attach()
 - Thread independent, possibly expensive/sleeping, called last.

I think this makes the most sense, since it keeps all of the threadgroup
logic confined to cgroup_attach_proc, and also by splitting up the
callbacks, many subsystems get to have less code about stuff they don't
need to worry about. It makes the issue mentioned here:
http://www.spinics.net/lists/linux-containers/msg22236.html decoupled
from this patchset (since mmap_sem stuff is done in thread-independent
callbacks, and also fixes (this particular case of) this problem:
http://www.spinics.net/lists/linux-containers/msg22223.html (by using
global nodemasks for the three attach callbacks).

One final bullet to dodge: cpuset_change_task_nodemask() is implemented
using a loop around yield() to synchronize the mems_allowed, so it can't
be used in the atomic attach_task(). (It looks like a total mess to me -
can anybody justify why it was done that way, instead of using a better
concurrency primitive?) Rather than dirty my hands by changing any of
it, I just moved it out of the per-thread function - explained more in
the second patch. If it gets rewritten to avoid yielding, it can be
moved back to attach_task (I left a TODO).

Other than that, a quick review of why everything here is safe:
 - Iterating over the thread_group is done only under rcu_read_lock or
   tasklist_lock, always checking first that thread_group_leader(task).
   (And, a reference is held on that task the whole time.)
 - All allocation is done outside of rcu_read/tasklist_lock.
 - All subsystem callbacks for can_attach_task() and attach_task() never
   call any function that can block or otherwise yield.

(It'd be really nice if the functions that might sleep and regions of
code that must not sleep could be checked for automatically at build.)

-- Ben

---
 Documentation/cgroups/cgroups.txt |   44 ++-
 Documentation/cgroups/cpusets.txt |    9 
 block/blk-cgroup.c                |   18 -
 include/linux/cgroup.h            |   10 
 include/linux/init_task.h         |    9 
 include/linux/sched.h             |   35 ++
 kernel/cgroup.c                   |  489 ++++++++++++++++++++++++++++++++++----
 kernel/cgroup_freezer.c           |   27 --
 kernel/cpuset.c                   |  116 ++++-----
 kernel/fork.c                     |   10 
 kernel/ns_cgroup.c                |   23 -
 kernel/sched.c                    |   38 --
 mm/memcontrol.c                   |   18 -
 security/device_cgroup.c          |    3 
 14 files changed, 635 insertions(+), 214 deletions(-)

^ permalink raw reply	[flat|nested] 185+ messages in thread

* [PATCH v7 0/3] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs
  2010-12-24  8:22         ` Ben Blum
  (?)
  (?)
@ 2010-12-26 12:09         ` Ben Blum
  2010-12-26 12:09           ` [PATCH v7 1/3] cgroups: read-write lock CLONE_THREAD forking per threadgroup Ben Blum
                             ` (4 more replies)
  -1 siblings, 5 replies; 185+ messages in thread
From: Ben Blum @ 2010-12-26 12:09 UTC (permalink / raw)
  To: Ben Blum
  Cc: linux-kernel, containers, akpm, ebiederm, lizf, matthltc, menage,
	oleg, David Rientjes, Miao Xie

On Fri, Dec 24, 2010 at 03:22:26AM -0500, Ben Blum wrote:
> On Wed, Aug 11, 2010 at 01:46:04AM -0400, Ben Blum wrote:
> > On Fri, Jul 30, 2010 at 07:56:49PM -0400, Ben Blum wrote:
> > > This patch series is a revision of http://lkml.org/lkml/2010/6/25/11 .
> > > 
> > > This patch series implements a write function for the 'cgroup.procs'
> > > per-cgroup file, which enables atomic movement of multithreaded
> > > applications between cgroups. Writing the thread-ID of any thread in a
> > > threadgroup to a cgroup's procs file causes all threads in the group to
> > > be moved to that cgroup safely with respect to threads forking/exiting.
> > > (Possible usage scenario: If running a multithreaded build system that
> > > sucks up system resources, this lets you restrict it all at once into a
> > > new cgroup to keep it under control.)
> > > 
> > > Example: Suppose pid 31337 clones new threads 31338 and 31339.
> > > 
> > > # cat /dev/cgroup/tasks
> > > ...
> > > 31337
> > > 31338
> > > 31339
> > > # mkdir /dev/cgroup/foo
> > > # echo 31337 > /dev/cgroup/foo/cgroup.procs
> > > # cat /dev/cgroup/foo/tasks
> > > 31337
> > > 31338
> > > 31339
> > > 
> > > A new lock, called threadgroup_fork_lock and living in signal_struct, is
> > > introduced to ensure atomicity when moving threads between cgroups. It's
> > > taken for writing during the operation, and taking for reading in fork()
> > > around the calls to cgroup_fork() and cgroup_post_fork(). I put calls to
> > > down_read/up_read directly in copy_process(), since new inline functions
> > > seemed like overkill.
> > > 
> > > -- Ben
> > > 
> > > ---
> > >  Documentation/cgroups/cgroups.txt |   13 -
> > >  include/linux/init_task.h         |    9
> > >  include/linux/sched.h             |   10
> > >  kernel/cgroup.c                   |  426 +++++++++++++++++++++++++++++++++-----
> > >  kernel/cgroup_freezer.c           |    4
> > >  kernel/cpuset.c                   |    4
> > >  kernel/fork.c                     |   16 +
> > >  kernel/ns_cgroup.c                |    4
> > >  kernel/sched.c                    |    4
> > >  9 files changed, 440 insertions(+), 50 deletions(-)
> > 
> > Here's an updated patchset. I've added an extra patch to implement the 
> > callback scheme Paul suggested (note how there are twice as many deleted
> > lines of code as before :) ), and also moved the up_read/down_read calls
> > to static inline functions in sched.h near the other threadgroup-related
> > calls.
> 
> One more go at this. I've refreshed the patches for some conflicts in
> cgroup_freezer.c, by adding an extra argument to the per_thread() call,
> "need_rcu", which makes the function take rcu_read_lock even around the
> single-task case (like freezer now requires). So no semantics have been
> changed.
> 
> I also poked around at some attach() calls which also iterate over the
> threadgroup (blkiocg_attach, cpuset_attach, cpu_cgroup_attach). I was
> borderline about making another function, cgroup_attach_per_thread(),
> but decided against.
> 
> There is a big issue in cpuset_attach, as explained in this email:
> http://www.spinics.net/lists/linux-containers/msg22223.html
> but the actual code/diffs for this patchset are independent of that
> getting fixed, so I'm putting this up for consideration now.
> 
> -- Ben

Well this time everything here is actually safe and correct, as far as
my best efforts and keen eyes can tell. I dropped the per_thread call
from the last series in favour of revising the subsystem callback
interface. It now looks like this:

ss->can_attach()
 - Thread-independent, possibly expensive/sleeping.

ss->can_attach_task()
 - Called per-thread, run with rcu_read so must not sleep.

ss->pre_attach()
 - Thread independent, must be atomic, happens before attach_task.

ss->attach_task()
 - Called per-thread, run with tasklist_lock so must not sleep.

ss->attach()
 - Thread independent, possibly expensive/sleeping, called last.

I think this makes the most sense, since it keeps all of the threadgroup
logic confined to cgroup_attach_proc, and also by splitting up the
callbacks, many subsystems get to have less code about stuff they don't
need to worry about. It makes the issue mentioned here:
http://www.spinics.net/lists/linux-containers/msg22236.html decoupled
from this patchset (since mmap_sem stuff is done in thread-independent
callbacks, and also fixes (this particular case of) this problem:
http://www.spinics.net/lists/linux-containers/msg22223.html (by using
global nodemasks for the three attach callbacks).

One final bullet to dodge: cpuset_change_task_nodemask() is implemented
using a loop around yield() to synchronize the mems_allowed, so it can't
be used in the atomic attach_task(). (It looks like a total mess to me -
can anybody justify why it was done that way, instead of using a better
concurrency primitive?) Rather than dirty my hands by changing any of
it, I just moved it out of the per-thread function - explained more in
the second patch. If it gets rewritten to avoid yielding, it can be
moved back to attach_task (I left a TODO).

Other than that, a quick review of why everything here is safe:
 - Iterating over the thread_group is done only under rcu_read_lock or
   tasklist_lock, always checking first that thread_group_leader(task).
   (And, a reference is held on that task the whole time.)
 - All allocation is done outside of rcu_read/tasklist_lock.
 - All subsystem callbacks for can_attach_task() and attach_task() never
   call any function that can block or otherwise yield.

(It'd be really nice if the functions that might sleep and regions of
code that must not sleep could be checked for automatically at build.)

-- Ben

---
 Documentation/cgroups/cgroups.txt |   44 ++-
 Documentation/cgroups/cpusets.txt |    9 
 block/blk-cgroup.c                |   18 -
 include/linux/cgroup.h            |   10 
 include/linux/init_task.h         |    9 
 include/linux/sched.h             |   35 ++
 kernel/cgroup.c                   |  489 ++++++++++++++++++++++++++++++++++----
 kernel/cgroup_freezer.c           |   27 --
 kernel/cpuset.c                   |  116 ++++-----
 kernel/fork.c                     |   10 
 kernel/ns_cgroup.c                |   23 -
 kernel/sched.c                    |   38 --
 mm/memcontrol.c                   |   18 -
 security/device_cgroup.c          |    3 
 14 files changed, 635 insertions(+), 214 deletions(-)

^ permalink raw reply	[flat|nested] 185+ messages in thread

* [PATCH v7 1/3] cgroups: read-write lock CLONE_THREAD forking per threadgroup
       [not found]           ` <20101226120919.GA28529-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
@ 2010-12-26 12:09             ` Ben Blum
  2010-12-26 12:11             ` [PATCH v7 2/3] cgroups: add atomic-context per-thread subsystem callbacks Ben Blum
                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2010-12-26 12:09 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	Miao Xie, David Rientjes, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	menage-hpIqsD4AKlfQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

[-- Attachment #1: cgroup-threadgroup-fork-lock.patch --]
[-- Type: text/plain, Size: 4964 bytes --]

Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup

From: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>

This patch adds an rwsem that lives in a threadgroup's signal_struct that's
taken for reading in the fork path, under CONFIG_CGROUPS. If another part of
the kernel later wants to use such a locking mechanism, the CONFIG_CGROUPS
ifdefs should be changed to a higher-up flag that CGROUPS and the other system
would both depend on.

This is a pre-patch for cgroup-procs-write.patch.

Signed-off-by: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
---
 include/linux/init_task.h |    9 +++++++++
 include/linux/sched.h     |   35 +++++++++++++++++++++++++++++++++++
 kernel/fork.c             |   10 ++++++++++
 3 files changed, 54 insertions(+), 0 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 6b281fa..b560381 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -15,6 +15,14 @@
 extern struct files_struct init_files;
 extern struct fs_struct init_fs;
 
+#ifdef CONFIG_CGROUPS
+#define INIT_THREADGROUP_FORK_LOCK(sig)					\
+	.threadgroup_fork_lock =					\
+		__RWSEM_INITIALIZER(sig.threadgroup_fork_lock),
+#else
+#define INIT_THREADGROUP_FORK_LOCK(sig)
+#endif
+
 #define INIT_SIGNALS(sig) {						\
 	.nr_threads	= 1,						\
 	.wait_chldexit	= __WAIT_QUEUE_HEAD_INITIALIZER(sig.wait_chldexit),\
@@ -31,6 +39,7 @@ extern struct fs_struct init_fs;
 	},								\
 	.cred_guard_mutex =						\
 		 __MUTEX_INITIALIZER(sig.cred_guard_mutex),		\
+	INIT_THREADGROUP_FORK_LOCK(sig)					\
 }
 
 extern struct nsproxy init_nsproxy;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8580dc6..213a0b9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -623,6 +623,16 @@ struct signal_struct {
 	unsigned audit_tty;
 	struct tty_audit_buf *tty_audit_buf;
 #endif
+#ifdef CONFIG_CGROUPS
+	/*
+	 * The threadgroup_fork_lock prevents threads from forking with
+	 * CLONE_THREAD while held for writing. Use this for fork-sensitive
+	 * threadgroup-wide operations. It's taken for reading in fork.c in
+	 * copy_process().
+	 * Currently only needed write-side by cgroups.
+	 */
+	struct rw_semaphore threadgroup_fork_lock;
+#endif
 
 	int oom_adj;		/* OOM kill score adjustment (bit shift) */
 	int oom_score_adj;	/* OOM kill score adjustment */
@@ -2270,6 +2280,31 @@ static inline void unlock_task_sighand(struct task_struct *tsk,
 	spin_unlock_irqrestore(&tsk->sighand->siglock, *flags);
 }
 
+/* See the declaration of threadgroup_fork_lock in signal_struct. */
+#ifdef CONFIG_CGROUPS
+static inline void threadgroup_fork_read_lock(struct task_struct *tsk)
+{
+	down_read(&tsk->signal->threadgroup_fork_lock);
+}
+static inline void threadgroup_fork_read_unlock(struct task_struct *tsk)
+{
+	up_read(&tsk->signal->threadgroup_fork_lock);
+}
+static inline void threadgroup_fork_write_lock(struct task_struct *tsk)
+{
+	down_write(&tsk->signal->threadgroup_fork_lock);
+}
+static inline void threadgroup_fork_write_unlock(struct task_struct *tsk)
+{
+	up_write(&tsk->signal->threadgroup_fork_lock);
+}
+#else
+static inline void threadgroup_fork_read_lock(struct task_struct *tsk) {}
+static inline void threadgroup_fork_read_unlock(struct task_struct *tsk) {}
+static inline void threadgroup_fork_write_lock(struct task_struct *tsk) {}
+static inline void threadgroup_fork_write_unlock(struct task_struct *tsk) {}
+#endif
+
 #ifndef __HAVE_THREAD_FUNCTIONS
 
 #define task_thread_info(task)	((struct thread_info *)(task)->stack)
diff --git a/kernel/fork.c b/kernel/fork.c
index 0979527..aefe61f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -905,6 +905,10 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 
 	tty_audit_fork(sig);
 
+#ifdef CONFIG_CGROUPS
+	init_rwsem(&sig->threadgroup_fork_lock);
+#endif
+
 	sig->oom_adj = current->signal->oom_adj;
 	sig->oom_score_adj = current->signal->oom_score_adj;
 	sig->oom_score_adj_min = current->signal->oom_score_adj_min;
@@ -1087,6 +1091,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	monotonic_to_bootbased(&p->real_start_time);
 	p->io_context = NULL;
 	p->audit_context = NULL;
+	if (clone_flags & CLONE_THREAD)
+		threadgroup_fork_read_lock(current);
 	cgroup_fork(p);
 #ifdef CONFIG_NUMA
 	p->mempolicy = mpol_dup(p->mempolicy);
@@ -1294,6 +1300,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	write_unlock_irq(&tasklist_lock);
 	proc_fork_connector(p);
 	cgroup_post_fork(p);
+	if (clone_flags & CLONE_THREAD)
+		threadgroup_fork_read_unlock(current);
 	perf_event_fork(p);
 	return p;
 
@@ -1332,6 +1340,8 @@ bad_fork_cleanup_policy:
 	mpol_put(p->mempolicy);
 bad_fork_cleanup_cgroup:
 #endif
+	if (clone_flags & CLONE_THREAD)
+		threadgroup_fork_read_unlock(current);
 	cgroup_exit(p, cgroup_callbacks_done);
 	delayacct_tsk_free(p);
 	module_put(task_thread_info(p)->exec_domain->module);

^ permalink raw reply related	[flat|nested] 185+ messages in thread

* [PATCH v7 1/3] cgroups: read-write lock CLONE_THREAD forking per threadgroup
  2010-12-26 12:09         ` Ben Blum
@ 2010-12-26 12:09           ` Ben Blum
  2011-01-24  8:38             ` Paul Menage
                               ` (2 more replies)
  2010-12-26 12:11           ` [PATCH v7 2/3] cgroups: add atomic-context per-thread subsystem callbacks Ben Blum
                             ` (3 subsequent siblings)
  4 siblings, 3 replies; 185+ messages in thread
From: Ben Blum @ 2010-12-26 12:09 UTC (permalink / raw)
  To: Ben Blum
  Cc: linux-kernel, containers, akpm, ebiederm, lizf, matthltc, menage,
	oleg, David Rientjes, Miao Xie

[-- Attachment #1: cgroup-threadgroup-fork-lock.patch --]
[-- Type: text/plain, Size: 4914 bytes --]

Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup

From: Ben Blum <bblum@andrew.cmu.edu>

This patch adds an rwsem that lives in a threadgroup's signal_struct that's
taken for reading in the fork path, under CONFIG_CGROUPS. If another part of
the kernel later wants to use such a locking mechanism, the CONFIG_CGROUPS
ifdefs should be changed to a higher-up flag that CGROUPS and the other system
would both depend on.

This is a pre-patch for cgroup-procs-write.patch.

Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
---
 include/linux/init_task.h |    9 +++++++++
 include/linux/sched.h     |   35 +++++++++++++++++++++++++++++++++++
 kernel/fork.c             |   10 ++++++++++
 3 files changed, 54 insertions(+), 0 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 6b281fa..b560381 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -15,6 +15,14 @@
 extern struct files_struct init_files;
 extern struct fs_struct init_fs;
 
+#ifdef CONFIG_CGROUPS
+#define INIT_THREADGROUP_FORK_LOCK(sig)					\
+	.threadgroup_fork_lock =					\
+		__RWSEM_INITIALIZER(sig.threadgroup_fork_lock),
+#else
+#define INIT_THREADGROUP_FORK_LOCK(sig)
+#endif
+
 #define INIT_SIGNALS(sig) {						\
 	.nr_threads	= 1,						\
 	.wait_chldexit	= __WAIT_QUEUE_HEAD_INITIALIZER(sig.wait_chldexit),\
@@ -31,6 +39,7 @@ extern struct fs_struct init_fs;
 	},								\
 	.cred_guard_mutex =						\
 		 __MUTEX_INITIALIZER(sig.cred_guard_mutex),		\
+	INIT_THREADGROUP_FORK_LOCK(sig)					\
 }
 
 extern struct nsproxy init_nsproxy;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8580dc6..213a0b9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -623,6 +623,16 @@ struct signal_struct {
 	unsigned audit_tty;
 	struct tty_audit_buf *tty_audit_buf;
 #endif
+#ifdef CONFIG_CGROUPS
+	/*
+	 * The threadgroup_fork_lock prevents threads from forking with
+	 * CLONE_THREAD while held for writing. Use this for fork-sensitive
+	 * threadgroup-wide operations. It's taken for reading in fork.c in
+	 * copy_process().
+	 * Currently only needed write-side by cgroups.
+	 */
+	struct rw_semaphore threadgroup_fork_lock;
+#endif
 
 	int oom_adj;		/* OOM kill score adjustment (bit shift) */
 	int oom_score_adj;	/* OOM kill score adjustment */
@@ -2270,6 +2280,31 @@ static inline void unlock_task_sighand(struct task_struct *tsk,
 	spin_unlock_irqrestore(&tsk->sighand->siglock, *flags);
 }
 
+/* See the declaration of threadgroup_fork_lock in signal_struct. */
+#ifdef CONFIG_CGROUPS
+static inline void threadgroup_fork_read_lock(struct task_struct *tsk)
+{
+	down_read(&tsk->signal->threadgroup_fork_lock);
+}
+static inline void threadgroup_fork_read_unlock(struct task_struct *tsk)
+{
+	up_read(&tsk->signal->threadgroup_fork_lock);
+}
+static inline void threadgroup_fork_write_lock(struct task_struct *tsk)
+{
+	down_write(&tsk->signal->threadgroup_fork_lock);
+}
+static inline void threadgroup_fork_write_unlock(struct task_struct *tsk)
+{
+	up_write(&tsk->signal->threadgroup_fork_lock);
+}
+#else
+static inline void threadgroup_fork_read_lock(struct task_struct *tsk) {}
+static inline void threadgroup_fork_read_unlock(struct task_struct *tsk) {}
+static inline void threadgroup_fork_write_lock(struct task_struct *tsk) {}
+static inline void threadgroup_fork_write_unlock(struct task_struct *tsk) {}
+#endif
+
 #ifndef __HAVE_THREAD_FUNCTIONS
 
 #define task_thread_info(task)	((struct thread_info *)(task)->stack)
diff --git a/kernel/fork.c b/kernel/fork.c
index 0979527..aefe61f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -905,6 +905,10 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 
 	tty_audit_fork(sig);
 
+#ifdef CONFIG_CGROUPS
+	init_rwsem(&sig->threadgroup_fork_lock);
+#endif
+
 	sig->oom_adj = current->signal->oom_adj;
 	sig->oom_score_adj = current->signal->oom_score_adj;
 	sig->oom_score_adj_min = current->signal->oom_score_adj_min;
@@ -1087,6 +1091,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	monotonic_to_bootbased(&p->real_start_time);
 	p->io_context = NULL;
 	p->audit_context = NULL;
+	if (clone_flags & CLONE_THREAD)
+		threadgroup_fork_read_lock(current);
 	cgroup_fork(p);
 #ifdef CONFIG_NUMA
 	p->mempolicy = mpol_dup(p->mempolicy);
@@ -1294,6 +1300,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	write_unlock_irq(&tasklist_lock);
 	proc_fork_connector(p);
 	cgroup_post_fork(p);
+	if (clone_flags & CLONE_THREAD)
+		threadgroup_fork_read_unlock(current);
 	perf_event_fork(p);
 	return p;
 
@@ -1332,6 +1340,8 @@ bad_fork_cleanup_policy:
 	mpol_put(p->mempolicy);
 bad_fork_cleanup_cgroup:
 #endif
+	if (clone_flags & CLONE_THREAD)
+		threadgroup_fork_read_unlock(current);
 	cgroup_exit(p, cgroup_callbacks_done);
 	delayacct_tsk_free(p);
 	module_put(task_thread_info(p)->exec_domain->module);

^ permalink raw reply related	[flat|nested] 185+ messages in thread

* [PATCH v7 2/3] cgroups: add atomic-context per-thread subsystem callbacks
       [not found]           ` <20101226120919.GA28529-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
  2010-12-26 12:09             ` [PATCH v7 1/3] cgroups: read-write lock CLONE_THREAD forking per threadgroup Ben Blum
@ 2010-12-26 12:11             ` Ben Blum
  2010-12-26 12:12             ` [PATCH v7 3/3] cgroups: make procs file writable Ben Blum
  2011-02-08  1:35             ` [PATCH v8 0/3] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs Ben Blum
  3 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2010-12-26 12:11 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	Miao Xie, David Rientjes, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	menage-hpIqsD4AKlfQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

[-- Attachment #1: cgroup-subsys-task-callbacks.patch --]
[-- Type: text/plain, Size: 22402 bytes --]

Add cgroup subsystem callbacks for per-thread attachment in atomic contexts

From: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>

This patch adds can_attach_task, pre_attach, and attach_task as new callbacks
for cgroups's subsystem interface. Unlike can_attach and attach, these are for
per-thread operations, to be called potentially many times when attaching an
entire threadgroup, and may run under rcu_read/tasklist_lock, so are for quick
operations only.

Also, the old "bool threadgroup" interface is removed, as replaced by this.
All subsystems are modified for the new interface - of note is cpuset, which
requires from/to nodemasks for attach to be globally scoped (though per-cpuset
would work too) to persist from its pre_attach to attach_task and attach.

This is a pre-patch for cgroup-procs-writable.patch.

Signed-off-by: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
---
 Documentation/cgroups/cgroups.txt |   35 ++++++++---
 Documentation/cgroups/cpusets.txt |    9 +++
 block/blk-cgroup.c                |   18 ++----
 include/linux/cgroup.h            |   10 ++-
 kernel/cgroup.c                   |   17 ++++-
 kernel/cgroup_freezer.c           |   27 ++++-----
 kernel/cpuset.c                   |  116 +++++++++++++++++++------------------
 kernel/ns_cgroup.c                |   23 +++----
 kernel/sched.c                    |   38 +-----------
 mm/memcontrol.c                   |   18 ++----
 security/device_cgroup.c          |    3 -
 11 files changed, 149 insertions(+), 165 deletions(-)

diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index 190018b..341ed44 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -563,7 +563,7 @@ rmdir() will fail with it. From this behavior, pre_destroy() can be
 called multiple times against a cgroup.
 
 int can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
-	       struct task_struct *task, bool threadgroup)
+	       struct task_struct *task)
 (cgroup_mutex held by caller)
 
 Called prior to moving a task into a cgroup; if the subsystem
@@ -572,9 +572,15 @@ task is passed, then a successful result indicates that *any*
 unspecified task can be moved into the cgroup. Note that this isn't
 called on a fork. If this method returns 0 (success) then this should
 remain valid while the caller holds cgroup_mutex and it is ensured that either
-attach() or cancel_attach() will be called in future. If threadgroup is
-true, then a successful result indicates that all threads in the given
-thread's threadgroup can be moved together.
+attach() or cancel_attach() will be called in future.
+
+int can_attach_task(struct cgroup *cgrp, struct task_struct *tsk);
+(cgroup_mutex and rcu_read_lock held by caller)
+
+As can_attach, but for operations that must be run once per task to be
+attached (possibly many when using cgroup_attach_proc). This may run in
+rcu_read-side, so sleeping is not permitted. Expensive operations, such as
+dealing with the shared mm, should run in can_attach.
 
 void cancel_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 	       struct task_struct *task, bool threadgroup)
@@ -587,15 +593,26 @@ This will be called only about subsystems whose can_attach() operation have
 succeeded.
 
 void attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
-	    struct cgroup *old_cgrp, struct task_struct *task,
-	    bool threadgroup)
+	    struct cgroup *old_cgrp, struct task_struct *task)
 (cgroup_mutex held by caller)
 
 Called after the task has been attached to the cgroup, to allow any
 post-attachment activity that requires memory allocations or blocking.
-If threadgroup is true, the subsystem should take care of all threads
-in the specified thread's threadgroup. Currently does not support any
-subsystem that might need the old_cgrp for every thread in the group.
+
+void pre_attach(struct cgroup *cgrp);
+(cgroup_mutex and tasklist_lock held by caller)
+
+See description of attach_task.
+
+void attach_task(struct cgroup *cgrp, struct task_struct *tsk);
+(cgroup_mutex and possibly tasklist_lock held by caller)
+
+As attach, but for operations that must be run once per task to be attached,
+like can_attach_task. Sometimes called with tasklist_lock taken for reading,
+so may not sleep. Currently does not support any subsystem that might need the
+old_cgrp for every thread in the group. Note: unlike can_attach_task, this
+runs before attach, so use pre_attach for non-per-thread operations that must
+happen before attach_task.
 
 void fork(struct cgroup_subsy *ss, struct task_struct *task)
 
diff --git a/Documentation/cgroups/cpusets.txt b/Documentation/cgroups/cpusets.txt
index 5d0d569..1f0868d 100644
--- a/Documentation/cgroups/cpusets.txt
+++ b/Documentation/cgroups/cpusets.txt
@@ -659,6 +659,15 @@ the current task's cpuset, then we relax the cpuset, and look for
 memory anywhere we can find it.  It's better to violate the cpuset
 than stress the kernel.
 
+There is a third exception to the above.  When using the cgroup.procs file
+to move all tasks in a threadgroup at once, the per-task attachment code
+must run in an atomic context, but as currently implemented, changing the
+nodemasks for a task's memory policy may need to deschedule.  So, in this
+case, the best cpusets can do is change the nodemask for the threadgroup
+leader when attaching.  Thus, a multithreaded mempolicy user should first
+use cgroup.procs (for correctness), but also next use the tasks file for
+each thread in the group to ensure updating the nodemasks for all of them.
+
 To start a new job that is to be contained within a cpuset, the steps are:
 
  1) mkdir /dev/cpuset
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index b1febd0..45b3809 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -30,10 +30,8 @@ EXPORT_SYMBOL_GPL(blkio_root_cgroup);
 
 static struct cgroup_subsys_state *blkiocg_create(struct cgroup_subsys *,
 						  struct cgroup *);
-static int blkiocg_can_attach(struct cgroup_subsys *, struct cgroup *,
-			      struct task_struct *, bool);
-static void blkiocg_attach(struct cgroup_subsys *, struct cgroup *,
-			   struct cgroup *, struct task_struct *, bool);
+static int blkiocg_can_attach_task(struct cgroup *, struct task_struct *);
+static void blkiocg_attach_task(struct cgroup *, struct task_struct *);
 static void blkiocg_destroy(struct cgroup_subsys *, struct cgroup *);
 static int blkiocg_populate(struct cgroup_subsys *, struct cgroup *);
 
@@ -46,8 +44,8 @@ static int blkiocg_populate(struct cgroup_subsys *, struct cgroup *);
 struct cgroup_subsys blkio_subsys = {
 	.name = "blkio",
 	.create = blkiocg_create,
-	.can_attach = blkiocg_can_attach,
-	.attach = blkiocg_attach,
+	.can_attach_task = blkiocg_can_attach_task,
+	.attach_task = blkiocg_attach_task,
 	.destroy = blkiocg_destroy,
 	.populate = blkiocg_populate,
 #ifdef CONFIG_BLK_CGROUP
@@ -1475,9 +1473,7 @@ done:
  * of the main cic data structures.  For now we allow a task to change
  * its cgroup only if it's the only owner of its ioc.
  */
-static int blkiocg_can_attach(struct cgroup_subsys *subsys,
-				struct cgroup *cgroup, struct task_struct *tsk,
-				bool threadgroup)
+static int blkiocg_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
 	struct io_context *ioc;
 	int ret = 0;
@@ -1492,9 +1488,7 @@ static int blkiocg_can_attach(struct cgroup_subsys *subsys,
 	return ret;
 }
 
-static void blkiocg_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
-				struct cgroup *prev, struct task_struct *tsk,
-				bool threadgroup)
+static void blkiocg_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
 	struct io_context *ioc;
 
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index ce104e3..35b69b4 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -467,12 +467,14 @@ struct cgroup_subsys {
 	int (*pre_destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
 	void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
 	int (*can_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
-			  struct task_struct *tsk, bool threadgroup);
+			  struct task_struct *tsk);
+	int (*can_attach_task)(struct cgroup *cgrp, struct task_struct *tsk);
 	void (*cancel_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
-			  struct task_struct *tsk, bool threadgroup);
+			      struct task_struct *tsk);
+	void (*pre_attach)(struct cgroup *cgrp);
+	void (*attach_task)(struct cgroup *cgrp, struct task_struct *tsk);
 	void (*attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
-			struct cgroup *old_cgrp, struct task_struct *tsk,
-			bool threadgroup);
+		       struct cgroup *old_cgrp, struct task_struct *tsk);
 	void (*fork)(struct cgroup_subsys *ss, struct task_struct *task);
 	void (*exit)(struct cgroup_subsys *ss, struct task_struct *task);
 	int (*populate)(struct cgroup_subsys *ss,
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 66a416b..616f27a 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1750,7 +1750,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 
 	for_each_subsys(root, ss) {
 		if (ss->can_attach) {
-			retval = ss->can_attach(ss, cgrp, tsk, false);
+			retval = ss->can_attach(ss, cgrp, tsk);
 			if (retval) {
 				/*
 				 * Remember on which subsystem the can_attach()
@@ -1762,6 +1762,13 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 				goto out;
 			}
 		}
+		if (ss->can_attach_task) {
+			retval = ss->can_attach_task(cgrp, tsk);
+			if (retval) {
+				failed_ss = ss;
+				goto out;
+			}
+		}
 	}
 
 	task_lock(tsk);
@@ -1798,8 +1805,12 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 	write_unlock(&css_set_lock);
 
 	for_each_subsys(root, ss) {
+		if (ss->pre_attach)
+			ss->pre_attach(cgrp);
+		if (ss->attach_task)
+			ss->attach_task(cgrp, tsk);
 		if (ss->attach)
-			ss->attach(ss, cgrp, oldcgrp, tsk, false);
+			ss->attach(ss, cgrp, oldcgrp, tsk);
 	}
 	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
 	synchronize_rcu();
@@ -1822,7 +1833,7 @@ out:
 				 */
 				break;
 			if (ss->cancel_attach)
-				ss->cancel_attach(ss, cgrp, tsk, false);
+				ss->cancel_attach(ss, cgrp, tsk);
 		}
 	}
 	return retval;
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index e7bebb7..e6ee70c 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -160,7 +160,7 @@ static void freezer_destroy(struct cgroup_subsys *ss,
  */
 static int freezer_can_attach(struct cgroup_subsys *ss,
 			      struct cgroup *new_cgroup,
-			      struct task_struct *task, bool threadgroup)
+			      struct task_struct *task)
 {
 	struct freezer *freezer;
 
@@ -172,26 +172,18 @@ static int freezer_can_attach(struct cgroup_subsys *ss,
 	if (freezer->state != CGROUP_THAWED)
 		return -EBUSY;
 
+	return 0;
+}
+
+static int freezer_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
+{
+	/* rcu_read_lock allows recursive locking */
 	rcu_read_lock();
-	if (__cgroup_freezing_or_frozen(task)) {
+	if (__cgroup_freezing_or_frozen(tsk)) {
 		rcu_read_unlock();
 		return -EBUSY;
 	}
 	rcu_read_unlock();
-
-	if (threadgroup) {
-		struct task_struct *c;
-
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
-			if (__cgroup_freezing_or_frozen(c)) {
-				rcu_read_unlock();
-				return -EBUSY;
-			}
-		}
-		rcu_read_unlock();
-	}
-
 	return 0;
 }
 
@@ -390,6 +382,9 @@ struct cgroup_subsys freezer_subsys = {
 	.populate	= freezer_populate,
 	.subsys_id	= freezer_subsys_id,
 	.can_attach	= freezer_can_attach,
+	.can_attach_task = freezer_can_attach_task,
+	.pre_attach	= NULL,
+	.attach_task	= NULL,
 	.attach		= NULL,
 	.fork		= freezer_fork,
 	.exit		= NULL,
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 4349935..b9fce80 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1372,14 +1372,10 @@ static int fmeter_getrate(struct fmeter *fmp)
 	return val;
 }
 
-/* Protected by cgroup_lock */
-static cpumask_var_t cpus_attach;
-
 /* Called by cgroups to determine if a cpuset is usable; cgroup_mutex held */
 static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
-			     struct task_struct *tsk, bool threadgroup)
+			     struct task_struct *tsk)
 {
-	int ret;
 	struct cpuset *cs = cgroup_cs(cont);
 
 	if (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))
@@ -1396,29 +1392,42 @@ static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
 	if (tsk->flags & PF_THREAD_BOUND)
 		return -EINVAL;
 
-	ret = security_task_setscheduler(tsk);
-	if (ret)
-		return ret;
-	if (threadgroup) {
-		struct task_struct *c;
-
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
-			ret = security_task_setscheduler(c);
-			if (ret) {
-				rcu_read_unlock();
-				return ret;
-			}
-		}
-		rcu_read_unlock();
-	}
 	return 0;
 }
 
-static void cpuset_attach_task(struct task_struct *tsk, nodemask_t *to,
-			       struct cpuset *cs)
+static int cpuset_can_attach_task(struct cgroup *cgrp, struct task_struct *task)
+{
+	return security_task_setscheduler(task);
+}
+
+/*
+ * Protected by cgroup_lock. The nodemasks must be stored globally because
+ * dynamically allocating them is not allowed in pre_attach, and they must
+ * persist among pre_attach, attach_task, and attach.
+ */
+static cpumask_var_t cpus_attach;
+static nodemask_t cpuset_attach_nodemask_from;
+static nodemask_t cpuset_attach_nodemask_to;
+
+/* Do quick set-up work for before attaching each task. */
+static void cpuset_pre_attach(struct cgroup *cont)
+{
+	struct cpuset *cs = cgroup_cs(cont);
+
+	if (cs == &top_cpuset)
+		cpumask_copy(cpus_attach, cpu_possible_mask);
+	else
+		guarantee_online_cpus(cs, cpus_attach);
+
+	guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
+}
+
+/* Per-thread attachment work. */
+static void cpuset_attach_task(struct cgroup *cont, struct task_struct *tsk)
 {
 	int err;
+	struct cpuset *cs = cgroup_cs(cont);
+
 	/*
 	 * can_attach beforehand should guarantee that this doesn't fail.
 	 * TODO: have a better way to handle failure here
@@ -1426,56 +1435,46 @@ static void cpuset_attach_task(struct task_struct *tsk, nodemask_t *to,
 	err = set_cpus_allowed_ptr(tsk, cpus_attach);
 	WARN_ON_ONCE(err);
 
-	cpuset_change_task_nodemask(tsk, to);
 	cpuset_update_task_spread_flag(cs, tsk);
 
 }
 
 static void cpuset_attach(struct cgroup_subsys *ss, struct cgroup *cont,
-			  struct cgroup *oldcont, struct task_struct *tsk,
-			  bool threadgroup)
+			  struct cgroup *oldcont, struct task_struct *tsk)
 {
 	struct mm_struct *mm;
 	struct cpuset *cs = cgroup_cs(cont);
 	struct cpuset *oldcs = cgroup_cs(oldcont);
-	NODEMASK_ALLOC(nodemask_t, from, GFP_KERNEL);
-	NODEMASK_ALLOC(nodemask_t, to, GFP_KERNEL);
 
-	if (from == NULL || to == NULL)
-		goto alloc_fail;
-
-	if (cs == &top_cpuset) {
-		cpumask_copy(cpus_attach, cpu_possible_mask);
-	} else {
-		guarantee_online_cpus(cs, cpus_attach);
-	}
-	guarantee_online_mems(cs, to);
-
-	/* do per-task migration stuff possibly for each in the threadgroup */
-	cpuset_attach_task(tsk, to, cs);
-	if (threadgroup) {
-		struct task_struct *c;
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
-			cpuset_attach_task(c, to, cs);
-		}
-		rcu_read_unlock();
-	}
+	/*
+	 * TODO: As implemented, change_task_nodemask uses yield() to
+	 * synchronize with other users of the mems_allowed, which is not
+	 * allowed in the atomic attach_task callback, so we can't do this for
+	 * each thread in the multithreaded case. This is a performance issue,
+	 * but not a correctness one.
+	 *
+	 * As long as change_task_nodemask can yield, a multithreaded mempolicy
+	 * user should attach to a cgroup by threadgroup first (for
+	 * correctness) then poke each task to get its mempolicy right.
+	 *
+	 * This is the "third exception" in Documentation/cgroups/cpusets.txt.
+	 */
+	cpuset_change_task_nodemask(tsk, &cpuset_attach_nodemask_to);
 
-	/* change mm; only needs to be done once even if threadgroup */
-	*from = oldcs->mems_allowed;
-	*to = cs->mems_allowed;
+	/*
+	 * Change mm, possibly for multiple threads in a threadgroup. This is
+	 * expensive and may sleep.
+	 */
+	cpuset_attach_nodemask_from = oldcs->mems_allowed;
+	cpuset_attach_nodemask_to = cs->mems_allowed;
 	mm = get_task_mm(tsk);
 	if (mm) {
-		mpol_rebind_mm(mm, to);
+		mpol_rebind_mm(mm, &cpuset_attach_nodemask_to);
 		if (is_memory_migrate(cs))
-			cpuset_migrate_mm(mm, from, to);
+			cpuset_migrate_mm(mm, &cpuset_attach_nodemask_from,
+					  &cpuset_attach_nodemask_to);
 		mmput(mm);
 	}
-
-alloc_fail:
-	NODEMASK_FREE(from);
-	NODEMASK_FREE(to);
 }
 
 /* The various types of files and directories in a cpuset file system */
@@ -1928,6 +1927,9 @@ struct cgroup_subsys cpuset_subsys = {
 	.create = cpuset_create,
 	.destroy = cpuset_destroy,
 	.can_attach = cpuset_can_attach,
+	.can_attach_task = cpuset_can_attach_task,
+	.pre_attach = cpuset_pre_attach,
+	.attach_task = cpuset_attach_task,
 	.attach = cpuset_attach,
 	.populate = cpuset_populate,
 	.post_clone = cpuset_post_clone,
diff --git a/kernel/ns_cgroup.c b/kernel/ns_cgroup.c
index 2c98ad9..1fc2b1b 100644
--- a/kernel/ns_cgroup.c
+++ b/kernel/ns_cgroup.c
@@ -43,7 +43,7 @@ int ns_cgroup_clone(struct task_struct *task, struct pid *pid)
  *        ancestor cgroup thereof)
  */
 static int ns_can_attach(struct cgroup_subsys *ss, struct cgroup *new_cgroup,
-			 struct task_struct *task, bool threadgroup)
+			 struct task_struct *task)
 {
 	if (current != task) {
 		if (!capable(CAP_SYS_ADMIN))
@@ -53,21 +53,13 @@ static int ns_can_attach(struct cgroup_subsys *ss, struct cgroup *new_cgroup,
 			return -EPERM;
 	}
 
-	if (!cgroup_is_descendant(new_cgroup, task))
-		return -EPERM;
-
-	if (threadgroup) {
-		struct task_struct *c;
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
-			if (!cgroup_is_descendant(new_cgroup, c)) {
-				rcu_read_unlock();
-				return -EPERM;
-			}
-		}
-		rcu_read_unlock();
-	}
+	return 0;
+}
 
+static int ns_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
+{
+	if (!cgroup_is_descendant(cgrp, tsk))
+		return -EPERM;
 	return 0;
 }
 
@@ -112,6 +104,7 @@ static void ns_destroy(struct cgroup_subsys *ss,
 struct cgroup_subsys ns_subsys = {
 	.name = "ns",
 	.can_attach = ns_can_attach,
+	.can_attach_task = ns_can_attach_task,
 	.create = ns_create,
 	.destroy  = ns_destroy,
 	.subsys_id = ns_subsys_id,
diff --git a/kernel/sched.c b/kernel/sched.c
index 218ef20..d619f1d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -8655,42 +8655,10 @@ cpu_cgroup_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 	return 0;
 }
 
-static int
-cpu_cgroup_can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
-		      struct task_struct *tsk, bool threadgroup)
-{
-	int retval = cpu_cgroup_can_attach_task(cgrp, tsk);
-	if (retval)
-		return retval;
-	if (threadgroup) {
-		struct task_struct *c;
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
-			retval = cpu_cgroup_can_attach_task(cgrp, c);
-			if (retval) {
-				rcu_read_unlock();
-				return retval;
-			}
-		}
-		rcu_read_unlock();
-	}
-	return 0;
-}
-
 static void
-cpu_cgroup_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
-		  struct cgroup *old_cont, struct task_struct *tsk,
-		  bool threadgroup)
+cpu_cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
 	sched_move_task(tsk);
-	if (threadgroup) {
-		struct task_struct *c;
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
-			sched_move_task(c);
-		}
-		rcu_read_unlock();
-	}
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -8763,8 +8731,8 @@ struct cgroup_subsys cpu_cgroup_subsys = {
 	.name		= "cpu",
 	.create		= cpu_cgroup_create,
 	.destroy	= cpu_cgroup_destroy,
-	.can_attach	= cpu_cgroup_can_attach,
-	.attach		= cpu_cgroup_attach,
+	.can_attach_task = cpu_cgroup_can_attach_task,
+	.attach_task	= cpu_cgroup_attach_task,
 	.populate	= cpu_cgroup_populate,
 	.subsys_id	= cpu_cgroup_subsys_id,
 	.early_init	= 1,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 729beb7..995f0b9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4720,8 +4720,7 @@ static void mem_cgroup_clear_mc(void)
 
 static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
 				struct cgroup *cgroup,
-				struct task_struct *p,
-				bool threadgroup)
+				struct task_struct *p)
 {
 	int ret = 0;
 	struct mem_cgroup *mem = mem_cgroup_from_cont(cgroup);
@@ -4775,8 +4774,7 @@ static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
 
 static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss,
 				struct cgroup *cgroup,
-				struct task_struct *p,
-				bool threadgroup)
+				struct task_struct *p)
 {
 	mem_cgroup_clear_mc();
 }
@@ -4880,8 +4878,7 @@ static void mem_cgroup_move_charge(struct mm_struct *mm)
 static void mem_cgroup_move_task(struct cgroup_subsys *ss,
 				struct cgroup *cont,
 				struct cgroup *old_cont,
-				struct task_struct *p,
-				bool threadgroup)
+				struct task_struct *p)
 {
 	if (!mc.mm)
 		/* no need to move charge */
@@ -4893,22 +4890,19 @@ static void mem_cgroup_move_task(struct cgroup_subsys *ss,
 #else	/* !CONFIG_MMU */
 static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
 				struct cgroup *cgroup,
-				struct task_struct *p,
-				bool threadgroup)
+				struct task_struct *p)
 {
 	return 0;
 }
 static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss,
 				struct cgroup *cgroup,
-				struct task_struct *p,
-				bool threadgroup)
+				struct task_struct *p)
 {
 }
 static void mem_cgroup_move_task(struct cgroup_subsys *ss,
 				struct cgroup *cont,
 				struct cgroup *old_cont,
-				struct task_struct *p,
-				bool threadgroup)
+				struct task_struct *p)
 {
 }
 #endif
diff --git a/security/device_cgroup.c b/security/device_cgroup.c
index 8d9c48f..cd1f779 100644
--- a/security/device_cgroup.c
+++ b/security/device_cgroup.c
@@ -62,8 +62,7 @@ static inline struct dev_cgroup *task_devcgroup(struct task_struct *task)
 struct cgroup_subsys devices_subsys;
 
 static int devcgroup_can_attach(struct cgroup_subsys *ss,
-		struct cgroup *new_cgroup, struct task_struct *task,
-		bool threadgroup)
+		struct cgroup *new_cgroup, struct task_struct *task)
 {
 	if (current != task && !capable(CAP_SYS_ADMIN))
 			return -EPERM;

^ permalink raw reply related	[flat|nested] 185+ messages in thread

* [PATCH v7 2/3] cgroups: add atomic-context per-thread subsystem callbacks
  2010-12-26 12:09         ` Ben Blum
  2010-12-26 12:09           ` [PATCH v7 1/3] cgroups: read-write lock CLONE_THREAD forking per threadgroup Ben Blum
@ 2010-12-26 12:11           ` Ben Blum
       [not found]             ` <20101226121100.GC28529-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
  2011-01-24  8:38             ` Paul Menage
  2010-12-26 12:12           ` [PATCH v7 3/3] cgroups: make procs file writable Ben Blum
                             ` (2 subsequent siblings)
  4 siblings, 2 replies; 185+ messages in thread
From: Ben Blum @ 2010-12-26 12:11 UTC (permalink / raw)
  To: Ben Blum
  Cc: linux-kernel, containers, akpm, ebiederm, lizf, matthltc, menage,
	oleg, David Rientjes, Miao Xie

[-- Attachment #1: cgroup-subsys-task-callbacks.patch --]
[-- Type: text/plain, Size: 22352 bytes --]

Add cgroup subsystem callbacks for per-thread attachment in atomic contexts

From: Ben Blum <bblum@andrew.cmu.edu>

This patch adds can_attach_task, pre_attach, and attach_task as new callbacks
for cgroups's subsystem interface. Unlike can_attach and attach, these are for
per-thread operations, to be called potentially many times when attaching an
entire threadgroup, and may run under rcu_read/tasklist_lock, so are for quick
operations only.

Also, the old "bool threadgroup" interface is removed, as replaced by this.
All subsystems are modified for the new interface - of note is cpuset, which
requires from/to nodemasks for attach to be globally scoped (though per-cpuset
would work too) to persist from its pre_attach to attach_task and attach.

This is a pre-patch for cgroup-procs-writable.patch.

Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
---
 Documentation/cgroups/cgroups.txt |   35 ++++++++---
 Documentation/cgroups/cpusets.txt |    9 +++
 block/blk-cgroup.c                |   18 ++----
 include/linux/cgroup.h            |   10 ++-
 kernel/cgroup.c                   |   17 ++++-
 kernel/cgroup_freezer.c           |   27 ++++-----
 kernel/cpuset.c                   |  116 +++++++++++++++++++------------------
 kernel/ns_cgroup.c                |   23 +++----
 kernel/sched.c                    |   38 +-----------
 mm/memcontrol.c                   |   18 ++----
 security/device_cgroup.c          |    3 -
 11 files changed, 149 insertions(+), 165 deletions(-)

diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index 190018b..341ed44 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -563,7 +563,7 @@ rmdir() will fail with it. From this behavior, pre_destroy() can be
 called multiple times against a cgroup.
 
 int can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
-	       struct task_struct *task, bool threadgroup)
+	       struct task_struct *task)
 (cgroup_mutex held by caller)
 
 Called prior to moving a task into a cgroup; if the subsystem
@@ -572,9 +572,15 @@ task is passed, then a successful result indicates that *any*
 unspecified task can be moved into the cgroup. Note that this isn't
 called on a fork. If this method returns 0 (success) then this should
 remain valid while the caller holds cgroup_mutex and it is ensured that either
-attach() or cancel_attach() will be called in future. If threadgroup is
-true, then a successful result indicates that all threads in the given
-thread's threadgroup can be moved together.
+attach() or cancel_attach() will be called in future.
+
+int can_attach_task(struct cgroup *cgrp, struct task_struct *tsk);
+(cgroup_mutex and rcu_read_lock held by caller)
+
+As can_attach, but for operations that must be run once per task to be
+attached (possibly many when using cgroup_attach_proc). This may run in
+rcu_read-side, so sleeping is not permitted. Expensive operations, such as
+dealing with the shared mm, should run in can_attach.
 
 void cancel_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 	       struct task_struct *task, bool threadgroup)
@@ -587,15 +593,26 @@ This will be called only about subsystems whose can_attach() operation have
 succeeded.
 
 void attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
-	    struct cgroup *old_cgrp, struct task_struct *task,
-	    bool threadgroup)
+	    struct cgroup *old_cgrp, struct task_struct *task)
 (cgroup_mutex held by caller)
 
 Called after the task has been attached to the cgroup, to allow any
 post-attachment activity that requires memory allocations or blocking.
-If threadgroup is true, the subsystem should take care of all threads
-in the specified thread's threadgroup. Currently does not support any
-subsystem that might need the old_cgrp for every thread in the group.
+
+void pre_attach(struct cgroup *cgrp);
+(cgroup_mutex and tasklist_lock held by caller)
+
+See description of attach_task.
+
+void attach_task(struct cgroup *cgrp, struct task_struct *tsk);
+(cgroup_mutex and possibly tasklist_lock held by caller)
+
+As attach, but for operations that must be run once per task to be attached,
+like can_attach_task. Sometimes called with tasklist_lock taken for reading,
+so may not sleep. Currently does not support any subsystem that might need the
+old_cgrp for every thread in the group. Note: unlike can_attach_task, this
+runs before attach, so use pre_attach for non-per-thread operations that must
+happen before attach_task.
 
 void fork(struct cgroup_subsy *ss, struct task_struct *task)
 
diff --git a/Documentation/cgroups/cpusets.txt b/Documentation/cgroups/cpusets.txt
index 5d0d569..1f0868d 100644
--- a/Documentation/cgroups/cpusets.txt
+++ b/Documentation/cgroups/cpusets.txt
@@ -659,6 +659,15 @@ the current task's cpuset, then we relax the cpuset, and look for
 memory anywhere we can find it.  It's better to violate the cpuset
 than stress the kernel.
 
+There is a third exception to the above.  When using the cgroup.procs file
+to move all tasks in a threadgroup at once, the per-task attachment code
+must run in an atomic context, but as currently implemented, changing the
+nodemasks for a task's memory policy may need to deschedule.  So, in this
+case, the best cpusets can do is change the nodemask for the threadgroup
+leader when attaching.  Thus, a multithreaded mempolicy user should first
+use cgroup.procs (for correctness), but also next use the tasks file for
+each thread in the group to ensure updating the nodemasks for all of them.
+
 To start a new job that is to be contained within a cpuset, the steps are:
 
  1) mkdir /dev/cpuset
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index b1febd0..45b3809 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -30,10 +30,8 @@ EXPORT_SYMBOL_GPL(blkio_root_cgroup);
 
 static struct cgroup_subsys_state *blkiocg_create(struct cgroup_subsys *,
 						  struct cgroup *);
-static int blkiocg_can_attach(struct cgroup_subsys *, struct cgroup *,
-			      struct task_struct *, bool);
-static void blkiocg_attach(struct cgroup_subsys *, struct cgroup *,
-			   struct cgroup *, struct task_struct *, bool);
+static int blkiocg_can_attach_task(struct cgroup *, struct task_struct *);
+static void blkiocg_attach_task(struct cgroup *, struct task_struct *);
 static void blkiocg_destroy(struct cgroup_subsys *, struct cgroup *);
 static int blkiocg_populate(struct cgroup_subsys *, struct cgroup *);
 
@@ -46,8 +44,8 @@ static int blkiocg_populate(struct cgroup_subsys *, struct cgroup *);
 struct cgroup_subsys blkio_subsys = {
 	.name = "blkio",
 	.create = blkiocg_create,
-	.can_attach = blkiocg_can_attach,
-	.attach = blkiocg_attach,
+	.can_attach_task = blkiocg_can_attach_task,
+	.attach_task = blkiocg_attach_task,
 	.destroy = blkiocg_destroy,
 	.populate = blkiocg_populate,
 #ifdef CONFIG_BLK_CGROUP
@@ -1475,9 +1473,7 @@ done:
  * of the main cic data structures.  For now we allow a task to change
  * its cgroup only if it's the only owner of its ioc.
  */
-static int blkiocg_can_attach(struct cgroup_subsys *subsys,
-				struct cgroup *cgroup, struct task_struct *tsk,
-				bool threadgroup)
+static int blkiocg_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
 	struct io_context *ioc;
 	int ret = 0;
@@ -1492,9 +1488,7 @@ static int blkiocg_can_attach(struct cgroup_subsys *subsys,
 	return ret;
 }
 
-static void blkiocg_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
-				struct cgroup *prev, struct task_struct *tsk,
-				bool threadgroup)
+static void blkiocg_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
 	struct io_context *ioc;
 
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index ce104e3..35b69b4 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -467,12 +467,14 @@ struct cgroup_subsys {
 	int (*pre_destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
 	void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
 	int (*can_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
-			  struct task_struct *tsk, bool threadgroup);
+			  struct task_struct *tsk);
+	int (*can_attach_task)(struct cgroup *cgrp, struct task_struct *tsk);
 	void (*cancel_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
-			  struct task_struct *tsk, bool threadgroup);
+			      struct task_struct *tsk);
+	void (*pre_attach)(struct cgroup *cgrp);
+	void (*attach_task)(struct cgroup *cgrp, struct task_struct *tsk);
 	void (*attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
-			struct cgroup *old_cgrp, struct task_struct *tsk,
-			bool threadgroup);
+		       struct cgroup *old_cgrp, struct task_struct *tsk);
 	void (*fork)(struct cgroup_subsys *ss, struct task_struct *task);
 	void (*exit)(struct cgroup_subsys *ss, struct task_struct *task);
 	int (*populate)(struct cgroup_subsys *ss,
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 66a416b..616f27a 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1750,7 +1750,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 
 	for_each_subsys(root, ss) {
 		if (ss->can_attach) {
-			retval = ss->can_attach(ss, cgrp, tsk, false);
+			retval = ss->can_attach(ss, cgrp, tsk);
 			if (retval) {
 				/*
 				 * Remember on which subsystem the can_attach()
@@ -1762,6 +1762,13 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 				goto out;
 			}
 		}
+		if (ss->can_attach_task) {
+			retval = ss->can_attach_task(cgrp, tsk);
+			if (retval) {
+				failed_ss = ss;
+				goto out;
+			}
+		}
 	}
 
 	task_lock(tsk);
@@ -1798,8 +1805,12 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 	write_unlock(&css_set_lock);
 
 	for_each_subsys(root, ss) {
+		if (ss->pre_attach)
+			ss->pre_attach(cgrp);
+		if (ss->attach_task)
+			ss->attach_task(cgrp, tsk);
 		if (ss->attach)
-			ss->attach(ss, cgrp, oldcgrp, tsk, false);
+			ss->attach(ss, cgrp, oldcgrp, tsk);
 	}
 	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
 	synchronize_rcu();
@@ -1822,7 +1833,7 @@ out:
 				 */
 				break;
 			if (ss->cancel_attach)
-				ss->cancel_attach(ss, cgrp, tsk, false);
+				ss->cancel_attach(ss, cgrp, tsk);
 		}
 	}
 	return retval;
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index e7bebb7..e6ee70c 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -160,7 +160,7 @@ static void freezer_destroy(struct cgroup_subsys *ss,
  */
 static int freezer_can_attach(struct cgroup_subsys *ss,
 			      struct cgroup *new_cgroup,
-			      struct task_struct *task, bool threadgroup)
+			      struct task_struct *task)
 {
 	struct freezer *freezer;
 
@@ -172,26 +172,18 @@ static int freezer_can_attach(struct cgroup_subsys *ss,
 	if (freezer->state != CGROUP_THAWED)
 		return -EBUSY;
 
+	return 0;
+}
+
+static int freezer_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
+{
+	/* rcu_read_lock allows recursive locking */
 	rcu_read_lock();
-	if (__cgroup_freezing_or_frozen(task)) {
+	if (__cgroup_freezing_or_frozen(tsk)) {
 		rcu_read_unlock();
 		return -EBUSY;
 	}
 	rcu_read_unlock();
-
-	if (threadgroup) {
-		struct task_struct *c;
-
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
-			if (__cgroup_freezing_or_frozen(c)) {
-				rcu_read_unlock();
-				return -EBUSY;
-			}
-		}
-		rcu_read_unlock();
-	}
-
 	return 0;
 }
 
@@ -390,6 +382,9 @@ struct cgroup_subsys freezer_subsys = {
 	.populate	= freezer_populate,
 	.subsys_id	= freezer_subsys_id,
 	.can_attach	= freezer_can_attach,
+	.can_attach_task = freezer_can_attach_task,
+	.pre_attach	= NULL,
+	.attach_task	= NULL,
 	.attach		= NULL,
 	.fork		= freezer_fork,
 	.exit		= NULL,
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 4349935..b9fce80 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1372,14 +1372,10 @@ static int fmeter_getrate(struct fmeter *fmp)
 	return val;
 }
 
-/* Protected by cgroup_lock */
-static cpumask_var_t cpus_attach;
-
 /* Called by cgroups to determine if a cpuset is usable; cgroup_mutex held */
 static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
-			     struct task_struct *tsk, bool threadgroup)
+			     struct task_struct *tsk)
 {
-	int ret;
 	struct cpuset *cs = cgroup_cs(cont);
 
 	if (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))
@@ -1396,29 +1392,42 @@ static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
 	if (tsk->flags & PF_THREAD_BOUND)
 		return -EINVAL;
 
-	ret = security_task_setscheduler(tsk);
-	if (ret)
-		return ret;
-	if (threadgroup) {
-		struct task_struct *c;
-
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
-			ret = security_task_setscheduler(c);
-			if (ret) {
-				rcu_read_unlock();
-				return ret;
-			}
-		}
-		rcu_read_unlock();
-	}
 	return 0;
 }
 
-static void cpuset_attach_task(struct task_struct *tsk, nodemask_t *to,
-			       struct cpuset *cs)
+static int cpuset_can_attach_task(struct cgroup *cgrp, struct task_struct *task)
+{
+	return security_task_setscheduler(task);
+}
+
+/*
+ * Protected by cgroup_lock. The nodemasks must be stored globally because
+ * dynamically allocating them is not allowed in pre_attach, and they must
+ * persist among pre_attach, attach_task, and attach.
+ */
+static cpumask_var_t cpus_attach;
+static nodemask_t cpuset_attach_nodemask_from;
+static nodemask_t cpuset_attach_nodemask_to;
+
+/* Do quick set-up work for before attaching each task. */
+static void cpuset_pre_attach(struct cgroup *cont)
+{
+	struct cpuset *cs = cgroup_cs(cont);
+
+	if (cs == &top_cpuset)
+		cpumask_copy(cpus_attach, cpu_possible_mask);
+	else
+		guarantee_online_cpus(cs, cpus_attach);
+
+	guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
+}
+
+/* Per-thread attachment work. */
+static void cpuset_attach_task(struct cgroup *cont, struct task_struct *tsk)
 {
 	int err;
+	struct cpuset *cs = cgroup_cs(cont);
+
 	/*
 	 * can_attach beforehand should guarantee that this doesn't fail.
 	 * TODO: have a better way to handle failure here
@@ -1426,56 +1435,46 @@ static void cpuset_attach_task(struct task_struct *tsk, nodemask_t *to,
 	err = set_cpus_allowed_ptr(tsk, cpus_attach);
 	WARN_ON_ONCE(err);
 
-	cpuset_change_task_nodemask(tsk, to);
 	cpuset_update_task_spread_flag(cs, tsk);
 
 }
 
 static void cpuset_attach(struct cgroup_subsys *ss, struct cgroup *cont,
-			  struct cgroup *oldcont, struct task_struct *tsk,
-			  bool threadgroup)
+			  struct cgroup *oldcont, struct task_struct *tsk)
 {
 	struct mm_struct *mm;
 	struct cpuset *cs = cgroup_cs(cont);
 	struct cpuset *oldcs = cgroup_cs(oldcont);
-	NODEMASK_ALLOC(nodemask_t, from, GFP_KERNEL);
-	NODEMASK_ALLOC(nodemask_t, to, GFP_KERNEL);
 
-	if (from == NULL || to == NULL)
-		goto alloc_fail;
-
-	if (cs == &top_cpuset) {
-		cpumask_copy(cpus_attach, cpu_possible_mask);
-	} else {
-		guarantee_online_cpus(cs, cpus_attach);
-	}
-	guarantee_online_mems(cs, to);
-
-	/* do per-task migration stuff possibly for each in the threadgroup */
-	cpuset_attach_task(tsk, to, cs);
-	if (threadgroup) {
-		struct task_struct *c;
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
-			cpuset_attach_task(c, to, cs);
-		}
-		rcu_read_unlock();
-	}
+	/*
+	 * TODO: As implemented, change_task_nodemask uses yield() to
+	 * synchronize with other users of the mems_allowed, which is not
+	 * allowed in the atomic attach_task callback, so we can't do this for
+	 * each thread in the multithreaded case. This is a performance issue,
+	 * but not a correctness one.
+	 *
+	 * As long as change_task_nodemask can yield, a multithreaded mempolicy
+	 * user should attach to a cgroup by threadgroup first (for
+	 * correctness) then poke each task to get its mempolicy right.
+	 *
+	 * This is the "third exception" in Documentation/cgroups/cpusets.txt.
+	 */
+	cpuset_change_task_nodemask(tsk, &cpuset_attach_nodemask_to);
 
-	/* change mm; only needs to be done once even if threadgroup */
-	*from = oldcs->mems_allowed;
-	*to = cs->mems_allowed;
+	/*
+	 * Change mm, possibly for multiple threads in a threadgroup. This is
+	 * expensive and may sleep.
+	 */
+	cpuset_attach_nodemask_from = oldcs->mems_allowed;
+	cpuset_attach_nodemask_to = cs->mems_allowed;
 	mm = get_task_mm(tsk);
 	if (mm) {
-		mpol_rebind_mm(mm, to);
+		mpol_rebind_mm(mm, &cpuset_attach_nodemask_to);
 		if (is_memory_migrate(cs))
-			cpuset_migrate_mm(mm, from, to);
+			cpuset_migrate_mm(mm, &cpuset_attach_nodemask_from,
+					  &cpuset_attach_nodemask_to);
 		mmput(mm);
 	}
-
-alloc_fail:
-	NODEMASK_FREE(from);
-	NODEMASK_FREE(to);
 }
 
 /* The various types of files and directories in a cpuset file system */
@@ -1928,6 +1927,9 @@ struct cgroup_subsys cpuset_subsys = {
 	.create = cpuset_create,
 	.destroy = cpuset_destroy,
 	.can_attach = cpuset_can_attach,
+	.can_attach_task = cpuset_can_attach_task,
+	.pre_attach = cpuset_pre_attach,
+	.attach_task = cpuset_attach_task,
 	.attach = cpuset_attach,
 	.populate = cpuset_populate,
 	.post_clone = cpuset_post_clone,
diff --git a/kernel/ns_cgroup.c b/kernel/ns_cgroup.c
index 2c98ad9..1fc2b1b 100644
--- a/kernel/ns_cgroup.c
+++ b/kernel/ns_cgroup.c
@@ -43,7 +43,7 @@ int ns_cgroup_clone(struct task_struct *task, struct pid *pid)
  *        ancestor cgroup thereof)
  */
 static int ns_can_attach(struct cgroup_subsys *ss, struct cgroup *new_cgroup,
-			 struct task_struct *task, bool threadgroup)
+			 struct task_struct *task)
 {
 	if (current != task) {
 		if (!capable(CAP_SYS_ADMIN))
@@ -53,21 +53,13 @@ static int ns_can_attach(struct cgroup_subsys *ss, struct cgroup *new_cgroup,
 			return -EPERM;
 	}
 
-	if (!cgroup_is_descendant(new_cgroup, task))
-		return -EPERM;
-
-	if (threadgroup) {
-		struct task_struct *c;
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
-			if (!cgroup_is_descendant(new_cgroup, c)) {
-				rcu_read_unlock();
-				return -EPERM;
-			}
-		}
-		rcu_read_unlock();
-	}
+	return 0;
+}
 
+static int ns_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
+{
+	if (!cgroup_is_descendant(cgrp, tsk))
+		return -EPERM;
 	return 0;
 }
 
@@ -112,6 +104,7 @@ static void ns_destroy(struct cgroup_subsys *ss,
 struct cgroup_subsys ns_subsys = {
 	.name = "ns",
 	.can_attach = ns_can_attach,
+	.can_attach_task = ns_can_attach_task,
 	.create = ns_create,
 	.destroy  = ns_destroy,
 	.subsys_id = ns_subsys_id,
diff --git a/kernel/sched.c b/kernel/sched.c
index 218ef20..d619f1d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -8655,42 +8655,10 @@ cpu_cgroup_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 	return 0;
 }
 
-static int
-cpu_cgroup_can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
-		      struct task_struct *tsk, bool threadgroup)
-{
-	int retval = cpu_cgroup_can_attach_task(cgrp, tsk);
-	if (retval)
-		return retval;
-	if (threadgroup) {
-		struct task_struct *c;
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
-			retval = cpu_cgroup_can_attach_task(cgrp, c);
-			if (retval) {
-				rcu_read_unlock();
-				return retval;
-			}
-		}
-		rcu_read_unlock();
-	}
-	return 0;
-}
-
 static void
-cpu_cgroup_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
-		  struct cgroup *old_cont, struct task_struct *tsk,
-		  bool threadgroup)
+cpu_cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
 	sched_move_task(tsk);
-	if (threadgroup) {
-		struct task_struct *c;
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
-			sched_move_task(c);
-		}
-		rcu_read_unlock();
-	}
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -8763,8 +8731,8 @@ struct cgroup_subsys cpu_cgroup_subsys = {
 	.name		= "cpu",
 	.create		= cpu_cgroup_create,
 	.destroy	= cpu_cgroup_destroy,
-	.can_attach	= cpu_cgroup_can_attach,
-	.attach		= cpu_cgroup_attach,
+	.can_attach_task = cpu_cgroup_can_attach_task,
+	.attach_task	= cpu_cgroup_attach_task,
 	.populate	= cpu_cgroup_populate,
 	.subsys_id	= cpu_cgroup_subsys_id,
 	.early_init	= 1,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 729beb7..995f0b9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4720,8 +4720,7 @@ static void mem_cgroup_clear_mc(void)
 
 static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
 				struct cgroup *cgroup,
-				struct task_struct *p,
-				bool threadgroup)
+				struct task_struct *p)
 {
 	int ret = 0;
 	struct mem_cgroup *mem = mem_cgroup_from_cont(cgroup);
@@ -4775,8 +4774,7 @@ static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
 
 static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss,
 				struct cgroup *cgroup,
-				struct task_struct *p,
-				bool threadgroup)
+				struct task_struct *p)
 {
 	mem_cgroup_clear_mc();
 }
@@ -4880,8 +4878,7 @@ static void mem_cgroup_move_charge(struct mm_struct *mm)
 static void mem_cgroup_move_task(struct cgroup_subsys *ss,
 				struct cgroup *cont,
 				struct cgroup *old_cont,
-				struct task_struct *p,
-				bool threadgroup)
+				struct task_struct *p)
 {
 	if (!mc.mm)
 		/* no need to move charge */
@@ -4893,22 +4890,19 @@ static void mem_cgroup_move_task(struct cgroup_subsys *ss,
 #else	/* !CONFIG_MMU */
 static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
 				struct cgroup *cgroup,
-				struct task_struct *p,
-				bool threadgroup)
+				struct task_struct *p)
 {
 	return 0;
 }
 static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss,
 				struct cgroup *cgroup,
-				struct task_struct *p,
-				bool threadgroup)
+				struct task_struct *p)
 {
 }
 static void mem_cgroup_move_task(struct cgroup_subsys *ss,
 				struct cgroup *cont,
 				struct cgroup *old_cont,
-				struct task_struct *p,
-				bool threadgroup)
+				struct task_struct *p)
 {
 }
 #endif
diff --git a/security/device_cgroup.c b/security/device_cgroup.c
index 8d9c48f..cd1f779 100644
--- a/security/device_cgroup.c
+++ b/security/device_cgroup.c
@@ -62,8 +62,7 @@ static inline struct dev_cgroup *task_devcgroup(struct task_struct *task)
 struct cgroup_subsys devices_subsys;
 
 static int devcgroup_can_attach(struct cgroup_subsys *ss,
-		struct cgroup *new_cgroup, struct task_struct *task,
-		bool threadgroup)
+		struct cgroup *new_cgroup, struct task_struct *task)
 {
 	if (current != task && !capable(CAP_SYS_ADMIN))
 			return -EPERM;

^ permalink raw reply related	[flat|nested] 185+ messages in thread

* [PATCH v7 3/3] cgroups: make procs file writable
       [not found]           ` <20101226120919.GA28529-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
  2010-12-26 12:09             ` [PATCH v7 1/3] cgroups: read-write lock CLONE_THREAD forking per threadgroup Ben Blum
  2010-12-26 12:11             ` [PATCH v7 2/3] cgroups: add atomic-context per-thread subsystem callbacks Ben Blum
@ 2010-12-26 12:12             ` Ben Blum
  2011-02-08  1:35             ` [PATCH v8 0/3] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs Ben Blum
  3 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2010-12-26 12:12 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	Miao Xie, David Rientjes, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	menage-hpIqsD4AKlfQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

[-- Attachment #1: cgroup-procs-writable.patch --]
[-- Type: text/plain, Size: 18183 bytes --]

Makes procs file writable to move all threads by tgid at once

From: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>

This patch adds functionality that enables users to move all threads in a
threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs'
file. This current implementation makes use of a per-threadgroup rwsem that's
taken for reading in the fork() path to prevent newly forking threads within
the threadgroup from "escaping" while the move is in progress.

Signed-off-by: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
---
 Documentation/cgroups/cgroups.txt |    9 +
 kernel/cgroup.c                   |  472 +++++++++++++++++++++++++++++++++----
 2 files changed, 432 insertions(+), 49 deletions(-)

diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index 341ed44..9157e75 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -236,7 +236,8 @@ containing the following files describing that cgroup:
  - cgroup.procs: list of tgids in the cgroup.  This list is not
    guaranteed to be sorted or free of duplicate tgids, and userspace
    should sort/uniquify the list if this property is required.
-   This is a read-only file, for now.
+   Writing a thread group id into this file moves all threads in that
+   group into this cgroup.
  - notify_on_release flag: run the release agent on exit?
  - release_agent: the path to use for release notifications (this file
    exists in the top cgroup only)
@@ -426,6 +427,12 @@ You can attach the current shell task by echoing 0:
 
 # echo 0 > tasks
 
+You can use the cgroup.procs file instead of the tasks file to move all
+threads in a threadgroup at once. Echoing the pid of any task in a
+threadgroup to cgroup.procs causes all tasks in that threadgroup to be
+be attached to the cgroup. Writing 0 to cgroup.procs moves all tasks
+in the writing task's threadgroup.
+
 2.3 Mounting hierarchies by name
 --------------------------------
 
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 616f27a..9361c44 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1726,6 +1726,76 @@ int cgroup_path(const struct cgroup *cgrp, char *buf, int buflen)
 }
 EXPORT_SYMBOL_GPL(cgroup_path);
 
+/*
+ * cgroup_task_migrate - move a task from one cgroup to another.
+ *
+ * 'guarantee' is set if the caller promises that a new css_set for the task
+ * will already exit. If not set, this function might sleep, and can fail with
+ * -ENOMEM. Otherwise, it can only fail with -ESRCH.
+ */
+static int cgroup_task_migrate(struct cgroup *cgrp, struct cgroup *oldcgrp,
+			       struct task_struct *tsk, bool guarantee)
+{
+	struct css_set *oldcg;
+	struct css_set *newcg;
+
+	/*
+	 * get old css_set. we need to take task_lock and refcount it, because
+	 * an exiting task can change its css_set to init_css_set and drop its
+	 * old one without taking cgroup_mutex.
+	 */
+	task_lock(tsk);
+	oldcg = tsk->cgroups;
+	get_css_set(oldcg);
+	task_unlock(tsk);
+
+	/* locate or allocate a new css_set for this task. */
+	if (guarantee) {
+		/* we know the css_set we want already exists. */
+		struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
+		read_lock(&css_set_lock);
+		newcg = find_existing_css_set(oldcg, cgrp, template);
+		BUG_ON(!newcg);
+		get_css_set(newcg);
+		read_unlock(&css_set_lock);
+	} else {
+		might_sleep();
+		/* find_css_set will give us newcg already referenced. */
+		newcg = find_css_set(oldcg, cgrp);
+		if (!newcg) {
+			put_css_set(oldcg);
+			return -ENOMEM;
+		}
+	}
+	put_css_set(oldcg);
+
+	/* if PF_EXITING is set, the tsk->cgroups pointer is no longer safe. */
+	task_lock(tsk);
+	if (tsk->flags & PF_EXITING) {
+		task_unlock(tsk);
+		put_css_set(newcg);
+		return -ESRCH;
+	}
+	rcu_assign_pointer(tsk->cgroups, newcg);
+	task_unlock(tsk);
+
+	/* Update the css_set linked lists if we're using them */
+	write_lock(&css_set_lock);
+	if (!list_empty(&tsk->cg_list))
+		list_move(&tsk->cg_list, &newcg->tasks);
+	write_unlock(&css_set_lock);
+
+	/*
+	 * We just gained a reference on oldcg by taking it from the task. As
+	 * trading it for newcg is protected by cgroup_mutex, we're safe to drop
+	 * it here; it will be freed under RCU.
+	 */
+	put_css_set(oldcg);
+
+	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
+	return 0;
+}
+
 /**
  * cgroup_attach_task - attach task 'tsk' to cgroup 'cgrp'
  * @cgrp: the cgroup the task is attaching to
@@ -1736,11 +1806,9 @@ EXPORT_SYMBOL_GPL(cgroup_path);
  */
 int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
-	int retval = 0;
+	int retval;
 	struct cgroup_subsys *ss, *failed_ss = NULL;
 	struct cgroup *oldcgrp;
-	struct css_set *cg;
-	struct css_set *newcg;
 	struct cgroupfs_root *root = cgrp->root;
 
 	/* Nothing to do if the task is already in that cgroup */
@@ -1771,38 +1839,9 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 		}
 	}
 
-	task_lock(tsk);
-	cg = tsk->cgroups;
-	get_css_set(cg);
-	task_unlock(tsk);
-	/*
-	 * Locate or allocate a new css_set for this task,
-	 * based on its final set of cgroups
-	 */
-	newcg = find_css_set(cg, cgrp);
-	put_css_set(cg);
-	if (!newcg) {
-		retval = -ENOMEM;
-		goto out;
-	}
-
-	task_lock(tsk);
-	if (tsk->flags & PF_EXITING) {
-		task_unlock(tsk);
-		put_css_set(newcg);
-		retval = -ESRCH;
+	retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, false);
+	if (retval)
 		goto out;
-	}
-	rcu_assign_pointer(tsk->cgroups, newcg);
-	task_unlock(tsk);
-
-	/* Update the css_set linked lists if we're using them */
-	write_lock(&css_set_lock);
-	if (!list_empty(&tsk->cg_list)) {
-		list_del(&tsk->cg_list);
-		list_add(&tsk->cg_list, &newcg->tasks);
-	}
-	write_unlock(&css_set_lock);
 
 	for_each_subsys(root, ss) {
 		if (ss->pre_attach)
@@ -1812,9 +1851,8 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 		if (ss->attach)
 			ss->attach(ss, cgrp, oldcgrp, tsk);
 	}
-	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
+
 	synchronize_rcu();
-	put_css_set(cg);
 
 	/*
 	 * wake up rmdir() waiter. the rmdir should fail since the cgroup
@@ -1864,49 +1902,387 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
 EXPORT_SYMBOL_GPL(cgroup_attach_task_all);
 
 /*
- * Attach task with pid 'pid' to cgroup 'cgrp'. Call with cgroup_mutex
- * held. May take task_lock of task
+ * cgroup_attach_proc works in two stages, the first of which prefetches all
+ * new css_sets needed (to make sure we have enough memory before committing
+ * to the move) and stores them in a list of entries of the following type.
+ * TODO: possible optimization: use css_set->rcu_head for chaining instead
+ */
+struct cg_list_entry {
+	struct css_set *cg;
+	struct list_head links;
+};
+
+static bool css_set_check_fetched(struct cgroup *cgrp,
+				  struct task_struct *tsk, struct css_set *cg,
+				  struct list_head *newcg_list)
+{
+	struct css_set *newcg;
+	struct cg_list_entry *cg_entry;
+	struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
+
+	read_lock(&css_set_lock);
+	newcg = find_existing_css_set(cg, cgrp, template);
+	if (newcg)
+		get_css_set(newcg);
+	read_unlock(&css_set_lock);
+
+	/* doesn't exist at all? */
+	if (!newcg)
+		return false;
+	/* see if it's already in the list */
+	list_for_each_entry(cg_entry, newcg_list, links) {
+		if (cg_entry->cg == newcg) {
+			put_css_set(newcg);
+			return true;
+		}
+	}
+
+	/* not found */
+	put_css_set(newcg);
+	return false;
+}
+
+/*
+ * Find the new css_set and store it in the list in preparation for moving the
+ * given task to the given cgroup. Returns 0 or -ENOMEM.
+ */
+static int css_set_prefetch(struct cgroup *cgrp, struct css_set *cg,
+			    struct list_head *newcg_list)
+{
+	struct css_set *newcg;
+	struct cg_list_entry *cg_entry;
+
+	/* ensure a new css_set will exist for this thread */
+	newcg = find_css_set(cg, cgrp);
+	if (!newcg)
+		return -ENOMEM;
+	/* add it to the list */
+	cg_entry = kmalloc(sizeof(struct cg_list_entry), GFP_KERNEL);
+	if (!cg_entry) {
+		put_css_set(newcg);
+		return -ENOMEM;
+	}
+	cg_entry->cg = newcg;
+	list_add(&cg_entry->links, newcg_list);
+	return 0;
+}
+
+/**
+ * cgroup_attach_proc - attach all threads in a threadgroup to a cgroup
+ * @cgrp: the cgroup to attach to
+ * @leader: the threadgroup leader task_struct of the group to be attached
+ *
+ * Call holding cgroup_mutex. Will take task_lock of each thread in leader's
+ * threadgroup individually in turn.
+ */
+int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
+{
+	int retval;
+	struct cgroup_subsys *ss, *failed_ss = NULL;
+	struct cgroup *oldcgrp;
+	struct css_set *oldcg;
+	struct cgroupfs_root *root = cgrp->root;
+	/* threadgroup list cursor */
+	struct task_struct *tsk;
+	/*
+	 * we need to make sure we have css_sets for all the tasks we're
+	 * going to move -before- we actually start moving them, so that in
+	 * case we get an ENOMEM we can bail out before making any changes.
+	 */
+	struct list_head newcg_list;
+	struct cg_list_entry *cg_entry, *temp_nobe;
+
+	/* check that we can legitimately attach to the cgroup. */
+	for_each_subsys(root, ss) {
+		if (ss->can_attach) {
+			retval = ss->can_attach(ss, cgrp, leader);
+			if (retval) {
+				failed_ss = ss;
+				goto out;
+			}
+		}
+		/* a callback to be run on every thread in the threadgroup. */
+		if (ss->can_attach_task) {
+			/* run callback on the leader first. */
+			retval = ss->can_attach_task(cgrp, leader);
+			if (retval) {
+				failed_ss = ss;
+				goto out;
+			}
+
+			/* run on each task in the threadgroup. */
+			rcu_read_lock();
+			/* sanity check - racing de_thread may cause this. */
+			if (!thread_group_leader(leader)) {
+				rcu_read_unlock();
+				retval = -EAGAIN;
+				failed_ss = ss;
+				goto out;
+			}
+			list_for_each_entry_rcu(tsk, &leader->thread_group,
+						thread_group) {
+				retval = ss->can_attach_task(cgrp, tsk);
+				if (retval) {
+					rcu_read_unlock();
+					failed_ss = ss;
+					goto out;
+				}
+			}
+			rcu_read_unlock();
+		}
+	}
+
+	/*
+	 * step 1: make sure css_sets exist for all threads to be migrated.
+	 * we use find_css_set, which allocates a new one if necessary.
+	 */
+	INIT_LIST_HEAD(&newcg_list);
+	oldcgrp = task_cgroup_from_root(leader, root);
+	if (cgrp != oldcgrp) {
+		/* get old css_set */
+		task_lock(leader);
+		if (leader->flags & PF_EXITING) {
+			task_unlock(leader);
+			goto prefetch_loop;
+		}
+		oldcg = leader->cgroups;
+		get_css_set(oldcg);
+		task_unlock(leader);
+		/* acquire new one */
+		retval = css_set_prefetch(cgrp, oldcg, &newcg_list);
+		put_css_set(oldcg);
+		if (retval)
+			goto list_teardown;
+	}
+prefetch_loop:
+	rcu_read_lock();
+	/* sanity check - if we raced with de_thread, we must abort */
+	if (!thread_group_leader(leader)) {
+		retval = -EAGAIN;
+		goto list_teardown;
+	}
+	/*
+	 * if we need to fetch a new css_set for this task, we must exit the
+	 * rcu_read section because allocating it can sleep. afterwards, we'll
+	 * need to restart iteration on the threadgroup list - the whole thing
+	 * will be O(nm) in the number of threads and css_sets; as the typical
+	 * case has only one css_set for all of them, usually O(n). which ones
+	 * we need allocated won't change as long as we hold cgroup_mutex.
+	 */
+	list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
+		/* nothing to do if this task is already in the cgroup */
+		oldcgrp = task_cgroup_from_root(tsk, root);
+		if (cgrp == oldcgrp)
+			continue;
+		/* get old css_set pointer */
+		task_lock(tsk);
+		if (tsk->flags & PF_EXITING) {
+			/* ignore this task if it's going away */
+			task_unlock(tsk);
+			continue;
+		}
+		oldcg = tsk->cgroups;
+		get_css_set(oldcg);
+		task_unlock(tsk);
+		/* see if the new one for us is already in the list? */
+		if (css_set_check_fetched(cgrp, tsk, oldcg, &newcg_list)) {
+			/* was already there, nothing to do. */
+			put_css_set(oldcg);
+		} else {
+			/* we don't already have it. get new one. */
+			rcu_read_unlock();
+			retval = css_set_prefetch(cgrp, oldcg, &newcg_list);
+			put_css_set(oldcg);
+			if (retval)
+				goto list_teardown;
+			/* begin iteration again. */
+			goto prefetch_loop;
+		}
+	}
+	rcu_read_unlock();
+
+	/*
+	 * step 2: now that we're guaranteed success wrt the css_sets, proceed
+	 * to move all tasks to the new cgroup. we need to lock against possible
+	 * races with fork(). note: we can safely take the threadgroup_fork_lock
+	 * of leader since attach_task_by_pid took a reference.
+	 * threadgroup_fork_lock must be taken outside of tasklist_lock to match
+	 * the order in the fork path.
+	 */
+	threadgroup_fork_write_lock(leader);
+	read_lock(&tasklist_lock);
+	/* sanity check - if we raced with de_thread, we must abort */
+	if (!thread_group_leader(leader)) {
+		retval = -EAGAIN;
+		read_unlock(&tasklist_lock);
+		threadgroup_fork_write_unlock(leader);
+		goto list_teardown;
+	}
+	/*
+	 * No failure cases left, so this is the commit point.
+	 *
+	 * Start by calling pre_attach for each subsystem.
+	 */
+	for_each_subsys(root, ss) {
+		if (ss->pre_attach)
+			ss->pre_attach(cgrp);
+	}
+	/*
+	 * Move each thread, calling ss->attach_task for each one along the way.
+	 *
+	 * If the leader is already there, skip moving him. Note: even if the
+	 * leader is PF_EXITING, we still move all other threads; if everybody
+	 * is PF_EXITING, we end up doing nothing, which is ok.
+	 */
+	oldcgrp = task_cgroup_from_root(leader, root);
+	if (cgrp != oldcgrp) {
+		/* attach the leader */
+		for_each_subsys(root, ss) {
+			if (ss->attach_task)
+				ss->attach_task(cgrp, leader);
+		}
+		retval = cgroup_task_migrate(cgrp, oldcgrp, leader, true);
+		BUG_ON(retval != 0 && retval != -ESRCH);
+	}
+	/* Now iterate over each thread in the group. */
+	list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
+		/* leave current thread as it is if it's already there */
+		oldcgrp = task_cgroup_from_root(tsk, root);
+		if (cgrp == oldcgrp)
+			continue;
+		/* attach each task to each subsystem */
+		for_each_subsys(root, ss) {
+			if (ss->attach_task)
+				ss->attach_task(cgrp, tsk);
+		}
+		/* we don't care whether these threads are exiting */
+		retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, true);
+		BUG_ON(retval != 0 && retval != -ESRCH);
+	}
+	/* nothing is sensitive to fork() or exec() after this point. */
+	read_unlock(&tasklist_lock);
+	threadgroup_fork_write_unlock(leader);
+
+	/*
+	 * step 3: do expensive, non-thread-specific subsystem callbacks.
+	 * TODO: if ever a subsystem needs to know the oldcgrp for each task
+	 * being moved, this call will need to be reworked to communicate that.
+	 */
+	for_each_subsys(root, ss) {
+		if (ss->attach)
+			ss->attach(ss, cgrp, oldcgrp, leader);
+	}
+
+	/*
+	 * step 4: success! and cleanup
+	 */
+	synchronize_rcu();
+	cgroup_wakeup_rmdir_waiter(cgrp);
+	retval = 0;
+list_teardown:
+	/* clean up the list of prefetched css_sets. */
+	list_for_each_entry_safe(cg_entry, temp_nobe, &newcg_list, links) {
+		list_del(&cg_entry->links);
+		put_css_set(cg_entry->cg);
+		kfree(cg_entry);
+	}
+out:
+	if (retval) {
+		/* same deal as in cgroup_attach_task */
+		for_each_subsys(root, ss) {
+			if (ss == failed_ss)
+				break;
+			if (ss->cancel_attach)
+				ss->cancel_attach(ss, cgrp, leader);
+		}
+	}
+	return retval;
+}
+
+/*
+ * Find the task_struct of the task to attach by vpid and pass it along to the
+ * function to attach either it or all tasks in its threadgroup. Will take
+ * cgroup_mutex; may take task_lock of task.
  */
-static int attach_task_by_pid(struct cgroup *cgrp, u64 pid)
+static int attach_task_by_pid(struct cgroup *cgrp, u64 pid, bool threadgroup)
 {
 	struct task_struct *tsk;
 	const struct cred *cred = current_cred(), *tcred;
 	int ret;
 
+	if (!cgroup_lock_live_group(cgrp))
+		return -ENODEV;
+
 	if (pid) {
 		rcu_read_lock();
 		tsk = find_task_by_vpid(pid);
-		if (!tsk || tsk->flags & PF_EXITING) {
+		if (!tsk) {
 			rcu_read_unlock();
+			cgroup_unlock();
+			return -ESRCH;
+		}
+		if (threadgroup) {
+			/*
+			 * it is safe to find group_leader because tsk was found
+			 * in the tid map, meaning it can't have been unhashed
+			 * by someone in de_thread changing the leadership.
+			 */
+			tsk = tsk->group_leader;
+			BUG_ON(!thread_group_leader(tsk));
+		} else if (tsk->flags & PF_EXITING) {
+			/* optimization for the single-task-only case */
+			rcu_read_unlock();
+			cgroup_unlock();
 			return -ESRCH;
 		}
 
+		/*
+		 * even if we're attaching all tasks in the thread group, we
+		 * only need to check permissions on one of them.
+		 */
 		tcred = __task_cred(tsk);
 		if (cred->euid &&
 		    cred->euid != tcred->uid &&
 		    cred->euid != tcred->suid) {
 			rcu_read_unlock();
+			cgroup_unlock();
 			return -EACCES;
 		}
 		get_task_struct(tsk);
 		rcu_read_unlock();
 	} else {
-		tsk = current;
+		if (threadgroup)
+			tsk = current->group_leader;
+		else
+			tsk = current;
 		get_task_struct(tsk);
 	}
 
-	ret = cgroup_attach_task(cgrp, tsk);
+	if (threadgroup)
+		ret = cgroup_attach_proc(cgrp, tsk);
+	else
+		ret = cgroup_attach_task(cgrp, tsk);
 	put_task_struct(tsk);
+	cgroup_unlock();
 	return ret;
 }
 
 static int cgroup_tasks_write(struct cgroup *cgrp, struct cftype *cft, u64 pid)
 {
+	return attach_task_by_pid(cgrp, pid, false);
+}
+
+static int cgroup_procs_write(struct cgroup *cgrp, struct cftype *cft, u64 tgid)
+{
 	int ret;
-	if (!cgroup_lock_live_group(cgrp))
-		return -ENODEV;
-	ret = attach_task_by_pid(cgrp, pid);
-	cgroup_unlock();
+	do {
+		/*
+		 * attach_proc fails with -EAGAIN if threadgroup leadership
+		 * changes in the middle of the operation, in which case we need
+		 * to find the task_struct for the new leader and start over.
+		 */
+		ret = attach_task_by_pid(cgrp, tgid, true);
+	} while (ret == -EAGAIN);
 	return ret;
 }
 
@@ -3260,9 +3636,9 @@ static struct cftype files[] = {
 	{
 		.name = CGROUP_FILE_GENERIC_PREFIX "procs",
 		.open = cgroup_procs_open,
-		/* .write_u64 = cgroup_procs_write, TODO */
+		.write_u64 = cgroup_procs_write,
 		.release = cgroup_pidlist_release,
-		.mode = S_IRUGO,
+		.mode = S_IRUGO | S_IWUSR,
 	},
 	{
 		.name = "notify_on_release",

^ permalink raw reply related	[flat|nested] 185+ messages in thread

* [PATCH v7 3/3] cgroups: make procs file writable
  2010-12-26 12:09         ` Ben Blum
  2010-12-26 12:09           ` [PATCH v7 1/3] cgroups: read-write lock CLONE_THREAD forking per threadgroup Ben Blum
  2010-12-26 12:11           ` [PATCH v7 2/3] cgroups: add atomic-context per-thread subsystem callbacks Ben Blum
@ 2010-12-26 12:12           ` Ben Blum
       [not found]           ` <20101226120919.GA28529-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
  2011-02-08  1:35           ` Ben Blum
  4 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2010-12-26 12:12 UTC (permalink / raw)
  To: Ben Blum
  Cc: linux-kernel, containers, akpm, ebiederm, lizf, matthltc, menage,
	oleg, David Rientjes, Miao Xie

[-- Attachment #1: cgroup-procs-writable.patch --]
[-- Type: text/plain, Size: 18133 bytes --]

Makes procs file writable to move all threads by tgid at once

From: Ben Blum <bblum@andrew.cmu.edu>

This patch adds functionality that enables users to move all threads in a
threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs'
file. This current implementation makes use of a per-threadgroup rwsem that's
taken for reading in the fork() path to prevent newly forking threads within
the threadgroup from "escaping" while the move is in progress.

Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
---
 Documentation/cgroups/cgroups.txt |    9 +
 kernel/cgroup.c                   |  472 +++++++++++++++++++++++++++++++++----
 2 files changed, 432 insertions(+), 49 deletions(-)

diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index 341ed44..9157e75 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -236,7 +236,8 @@ containing the following files describing that cgroup:
  - cgroup.procs: list of tgids in the cgroup.  This list is not
    guaranteed to be sorted or free of duplicate tgids, and userspace
    should sort/uniquify the list if this property is required.
-   This is a read-only file, for now.
+   Writing a thread group id into this file moves all threads in that
+   group into this cgroup.
  - notify_on_release flag: run the release agent on exit?
  - release_agent: the path to use for release notifications (this file
    exists in the top cgroup only)
@@ -426,6 +427,12 @@ You can attach the current shell task by echoing 0:
 
 # echo 0 > tasks
 
+You can use the cgroup.procs file instead of the tasks file to move all
+threads in a threadgroup at once. Echoing the pid of any task in a
+threadgroup to cgroup.procs causes all tasks in that threadgroup to be
+be attached to the cgroup. Writing 0 to cgroup.procs moves all tasks
+in the writing task's threadgroup.
+
 2.3 Mounting hierarchies by name
 --------------------------------
 
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 616f27a..9361c44 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1726,6 +1726,76 @@ int cgroup_path(const struct cgroup *cgrp, char *buf, int buflen)
 }
 EXPORT_SYMBOL_GPL(cgroup_path);
 
+/*
+ * cgroup_task_migrate - move a task from one cgroup to another.
+ *
+ * 'guarantee' is set if the caller promises that a new css_set for the task
+ * will already exit. If not set, this function might sleep, and can fail with
+ * -ENOMEM. Otherwise, it can only fail with -ESRCH.
+ */
+static int cgroup_task_migrate(struct cgroup *cgrp, struct cgroup *oldcgrp,
+			       struct task_struct *tsk, bool guarantee)
+{
+	struct css_set *oldcg;
+	struct css_set *newcg;
+
+	/*
+	 * get old css_set. we need to take task_lock and refcount it, because
+	 * an exiting task can change its css_set to init_css_set and drop its
+	 * old one without taking cgroup_mutex.
+	 */
+	task_lock(tsk);
+	oldcg = tsk->cgroups;
+	get_css_set(oldcg);
+	task_unlock(tsk);
+
+	/* locate or allocate a new css_set for this task. */
+	if (guarantee) {
+		/* we know the css_set we want already exists. */
+		struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
+		read_lock(&css_set_lock);
+		newcg = find_existing_css_set(oldcg, cgrp, template);
+		BUG_ON(!newcg);
+		get_css_set(newcg);
+		read_unlock(&css_set_lock);
+	} else {
+		might_sleep();
+		/* find_css_set will give us newcg already referenced. */
+		newcg = find_css_set(oldcg, cgrp);
+		if (!newcg) {
+			put_css_set(oldcg);
+			return -ENOMEM;
+		}
+	}
+	put_css_set(oldcg);
+
+	/* if PF_EXITING is set, the tsk->cgroups pointer is no longer safe. */
+	task_lock(tsk);
+	if (tsk->flags & PF_EXITING) {
+		task_unlock(tsk);
+		put_css_set(newcg);
+		return -ESRCH;
+	}
+	rcu_assign_pointer(tsk->cgroups, newcg);
+	task_unlock(tsk);
+
+	/* Update the css_set linked lists if we're using them */
+	write_lock(&css_set_lock);
+	if (!list_empty(&tsk->cg_list))
+		list_move(&tsk->cg_list, &newcg->tasks);
+	write_unlock(&css_set_lock);
+
+	/*
+	 * We just gained a reference on oldcg by taking it from the task. As
+	 * trading it for newcg is protected by cgroup_mutex, we're safe to drop
+	 * it here; it will be freed under RCU.
+	 */
+	put_css_set(oldcg);
+
+	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
+	return 0;
+}
+
 /**
  * cgroup_attach_task - attach task 'tsk' to cgroup 'cgrp'
  * @cgrp: the cgroup the task is attaching to
@@ -1736,11 +1806,9 @@ EXPORT_SYMBOL_GPL(cgroup_path);
  */
 int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
-	int retval = 0;
+	int retval;
 	struct cgroup_subsys *ss, *failed_ss = NULL;
 	struct cgroup *oldcgrp;
-	struct css_set *cg;
-	struct css_set *newcg;
 	struct cgroupfs_root *root = cgrp->root;
 
 	/* Nothing to do if the task is already in that cgroup */
@@ -1771,38 +1839,9 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 		}
 	}
 
-	task_lock(tsk);
-	cg = tsk->cgroups;
-	get_css_set(cg);
-	task_unlock(tsk);
-	/*
-	 * Locate or allocate a new css_set for this task,
-	 * based on its final set of cgroups
-	 */
-	newcg = find_css_set(cg, cgrp);
-	put_css_set(cg);
-	if (!newcg) {
-		retval = -ENOMEM;
-		goto out;
-	}
-
-	task_lock(tsk);
-	if (tsk->flags & PF_EXITING) {
-		task_unlock(tsk);
-		put_css_set(newcg);
-		retval = -ESRCH;
+	retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, false);
+	if (retval)
 		goto out;
-	}
-	rcu_assign_pointer(tsk->cgroups, newcg);
-	task_unlock(tsk);
-
-	/* Update the css_set linked lists if we're using them */
-	write_lock(&css_set_lock);
-	if (!list_empty(&tsk->cg_list)) {
-		list_del(&tsk->cg_list);
-		list_add(&tsk->cg_list, &newcg->tasks);
-	}
-	write_unlock(&css_set_lock);
 
 	for_each_subsys(root, ss) {
 		if (ss->pre_attach)
@@ -1812,9 +1851,8 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 		if (ss->attach)
 			ss->attach(ss, cgrp, oldcgrp, tsk);
 	}
-	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
+
 	synchronize_rcu();
-	put_css_set(cg);
 
 	/*
 	 * wake up rmdir() waiter. the rmdir should fail since the cgroup
@@ -1864,49 +1902,387 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
 EXPORT_SYMBOL_GPL(cgroup_attach_task_all);
 
 /*
- * Attach task with pid 'pid' to cgroup 'cgrp'. Call with cgroup_mutex
- * held. May take task_lock of task
+ * cgroup_attach_proc works in two stages, the first of which prefetches all
+ * new css_sets needed (to make sure we have enough memory before committing
+ * to the move) and stores them in a list of entries of the following type.
+ * TODO: possible optimization: use css_set->rcu_head for chaining instead
+ */
+struct cg_list_entry {
+	struct css_set *cg;
+	struct list_head links;
+};
+
+static bool css_set_check_fetched(struct cgroup *cgrp,
+				  struct task_struct *tsk, struct css_set *cg,
+				  struct list_head *newcg_list)
+{
+	struct css_set *newcg;
+	struct cg_list_entry *cg_entry;
+	struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
+
+	read_lock(&css_set_lock);
+	newcg = find_existing_css_set(cg, cgrp, template);
+	if (newcg)
+		get_css_set(newcg);
+	read_unlock(&css_set_lock);
+
+	/* doesn't exist at all? */
+	if (!newcg)
+		return false;
+	/* see if it's already in the list */
+	list_for_each_entry(cg_entry, newcg_list, links) {
+		if (cg_entry->cg == newcg) {
+			put_css_set(newcg);
+			return true;
+		}
+	}
+
+	/* not found */
+	put_css_set(newcg);
+	return false;
+}
+
+/*
+ * Find the new css_set and store it in the list in preparation for moving the
+ * given task to the given cgroup. Returns 0 or -ENOMEM.
+ */
+static int css_set_prefetch(struct cgroup *cgrp, struct css_set *cg,
+			    struct list_head *newcg_list)
+{
+	struct css_set *newcg;
+	struct cg_list_entry *cg_entry;
+
+	/* ensure a new css_set will exist for this thread */
+	newcg = find_css_set(cg, cgrp);
+	if (!newcg)
+		return -ENOMEM;
+	/* add it to the list */
+	cg_entry = kmalloc(sizeof(struct cg_list_entry), GFP_KERNEL);
+	if (!cg_entry) {
+		put_css_set(newcg);
+		return -ENOMEM;
+	}
+	cg_entry->cg = newcg;
+	list_add(&cg_entry->links, newcg_list);
+	return 0;
+}
+
+/**
+ * cgroup_attach_proc - attach all threads in a threadgroup to a cgroup
+ * @cgrp: the cgroup to attach to
+ * @leader: the threadgroup leader task_struct of the group to be attached
+ *
+ * Call holding cgroup_mutex. Will take task_lock of each thread in leader's
+ * threadgroup individually in turn.
+ */
+int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
+{
+	int retval;
+	struct cgroup_subsys *ss, *failed_ss = NULL;
+	struct cgroup *oldcgrp;
+	struct css_set *oldcg;
+	struct cgroupfs_root *root = cgrp->root;
+	/* threadgroup list cursor */
+	struct task_struct *tsk;
+	/*
+	 * we need to make sure we have css_sets for all the tasks we're
+	 * going to move -before- we actually start moving them, so that in
+	 * case we get an ENOMEM we can bail out before making any changes.
+	 */
+	struct list_head newcg_list;
+	struct cg_list_entry *cg_entry, *temp_nobe;
+
+	/* check that we can legitimately attach to the cgroup. */
+	for_each_subsys(root, ss) {
+		if (ss->can_attach) {
+			retval = ss->can_attach(ss, cgrp, leader);
+			if (retval) {
+				failed_ss = ss;
+				goto out;
+			}
+		}
+		/* a callback to be run on every thread in the threadgroup. */
+		if (ss->can_attach_task) {
+			/* run callback on the leader first. */
+			retval = ss->can_attach_task(cgrp, leader);
+			if (retval) {
+				failed_ss = ss;
+				goto out;
+			}
+
+			/* run on each task in the threadgroup. */
+			rcu_read_lock();
+			/* sanity check - racing de_thread may cause this. */
+			if (!thread_group_leader(leader)) {
+				rcu_read_unlock();
+				retval = -EAGAIN;
+				failed_ss = ss;
+				goto out;
+			}
+			list_for_each_entry_rcu(tsk, &leader->thread_group,
+						thread_group) {
+				retval = ss->can_attach_task(cgrp, tsk);
+				if (retval) {
+					rcu_read_unlock();
+					failed_ss = ss;
+					goto out;
+				}
+			}
+			rcu_read_unlock();
+		}
+	}
+
+	/*
+	 * step 1: make sure css_sets exist for all threads to be migrated.
+	 * we use find_css_set, which allocates a new one if necessary.
+	 */
+	INIT_LIST_HEAD(&newcg_list);
+	oldcgrp = task_cgroup_from_root(leader, root);
+	if (cgrp != oldcgrp) {
+		/* get old css_set */
+		task_lock(leader);
+		if (leader->flags & PF_EXITING) {
+			task_unlock(leader);
+			goto prefetch_loop;
+		}
+		oldcg = leader->cgroups;
+		get_css_set(oldcg);
+		task_unlock(leader);
+		/* acquire new one */
+		retval = css_set_prefetch(cgrp, oldcg, &newcg_list);
+		put_css_set(oldcg);
+		if (retval)
+			goto list_teardown;
+	}
+prefetch_loop:
+	rcu_read_lock();
+	/* sanity check - if we raced with de_thread, we must abort */
+	if (!thread_group_leader(leader)) {
+		retval = -EAGAIN;
+		goto list_teardown;
+	}
+	/*
+	 * if we need to fetch a new css_set for this task, we must exit the
+	 * rcu_read section because allocating it can sleep. afterwards, we'll
+	 * need to restart iteration on the threadgroup list - the whole thing
+	 * will be O(nm) in the number of threads and css_sets; as the typical
+	 * case has only one css_set for all of them, usually O(n). which ones
+	 * we need allocated won't change as long as we hold cgroup_mutex.
+	 */
+	list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
+		/* nothing to do if this task is already in the cgroup */
+		oldcgrp = task_cgroup_from_root(tsk, root);
+		if (cgrp == oldcgrp)
+			continue;
+		/* get old css_set pointer */
+		task_lock(tsk);
+		if (tsk->flags & PF_EXITING) {
+			/* ignore this task if it's going away */
+			task_unlock(tsk);
+			continue;
+		}
+		oldcg = tsk->cgroups;
+		get_css_set(oldcg);
+		task_unlock(tsk);
+		/* see if the new one for us is already in the list? */
+		if (css_set_check_fetched(cgrp, tsk, oldcg, &newcg_list)) {
+			/* was already there, nothing to do. */
+			put_css_set(oldcg);
+		} else {
+			/* we don't already have it. get new one. */
+			rcu_read_unlock();
+			retval = css_set_prefetch(cgrp, oldcg, &newcg_list);
+			put_css_set(oldcg);
+			if (retval)
+				goto list_teardown;
+			/* begin iteration again. */
+			goto prefetch_loop;
+		}
+	}
+	rcu_read_unlock();
+
+	/*
+	 * step 2: now that we're guaranteed success wrt the css_sets, proceed
+	 * to move all tasks to the new cgroup. we need to lock against possible
+	 * races with fork(). note: we can safely take the threadgroup_fork_lock
+	 * of leader since attach_task_by_pid took a reference.
+	 * threadgroup_fork_lock must be taken outside of tasklist_lock to match
+	 * the order in the fork path.
+	 */
+	threadgroup_fork_write_lock(leader);
+	read_lock(&tasklist_lock);
+	/* sanity check - if we raced with de_thread, we must abort */
+	if (!thread_group_leader(leader)) {
+		retval = -EAGAIN;
+		read_unlock(&tasklist_lock);
+		threadgroup_fork_write_unlock(leader);
+		goto list_teardown;
+	}
+	/*
+	 * No failure cases left, so this is the commit point.
+	 *
+	 * Start by calling pre_attach for each subsystem.
+	 */
+	for_each_subsys(root, ss) {
+		if (ss->pre_attach)
+			ss->pre_attach(cgrp);
+	}
+	/*
+	 * Move each thread, calling ss->attach_task for each one along the way.
+	 *
+	 * If the leader is already there, skip moving him. Note: even if the
+	 * leader is PF_EXITING, we still move all other threads; if everybody
+	 * is PF_EXITING, we end up doing nothing, which is ok.
+	 */
+	oldcgrp = task_cgroup_from_root(leader, root);
+	if (cgrp != oldcgrp) {
+		/* attach the leader */
+		for_each_subsys(root, ss) {
+			if (ss->attach_task)
+				ss->attach_task(cgrp, leader);
+		}
+		retval = cgroup_task_migrate(cgrp, oldcgrp, leader, true);
+		BUG_ON(retval != 0 && retval != -ESRCH);
+	}
+	/* Now iterate over each thread in the group. */
+	list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
+		/* leave current thread as it is if it's already there */
+		oldcgrp = task_cgroup_from_root(tsk, root);
+		if (cgrp == oldcgrp)
+			continue;
+		/* attach each task to each subsystem */
+		for_each_subsys(root, ss) {
+			if (ss->attach_task)
+				ss->attach_task(cgrp, tsk);
+		}
+		/* we don't care whether these threads are exiting */
+		retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, true);
+		BUG_ON(retval != 0 && retval != -ESRCH);
+	}
+	/* nothing is sensitive to fork() or exec() after this point. */
+	read_unlock(&tasklist_lock);
+	threadgroup_fork_write_unlock(leader);
+
+	/*
+	 * step 3: do expensive, non-thread-specific subsystem callbacks.
+	 * TODO: if ever a subsystem needs to know the oldcgrp for each task
+	 * being moved, this call will need to be reworked to communicate that.
+	 */
+	for_each_subsys(root, ss) {
+		if (ss->attach)
+			ss->attach(ss, cgrp, oldcgrp, leader);
+	}
+
+	/*
+	 * step 4: success! and cleanup
+	 */
+	synchronize_rcu();
+	cgroup_wakeup_rmdir_waiter(cgrp);
+	retval = 0;
+list_teardown:
+	/* clean up the list of prefetched css_sets. */
+	list_for_each_entry_safe(cg_entry, temp_nobe, &newcg_list, links) {
+		list_del(&cg_entry->links);
+		put_css_set(cg_entry->cg);
+		kfree(cg_entry);
+	}
+out:
+	if (retval) {
+		/* same deal as in cgroup_attach_task */
+		for_each_subsys(root, ss) {
+			if (ss == failed_ss)
+				break;
+			if (ss->cancel_attach)
+				ss->cancel_attach(ss, cgrp, leader);
+		}
+	}
+	return retval;
+}
+
+/*
+ * Find the task_struct of the task to attach by vpid and pass it along to the
+ * function to attach either it or all tasks in its threadgroup. Will take
+ * cgroup_mutex; may take task_lock of task.
  */
-static int attach_task_by_pid(struct cgroup *cgrp, u64 pid)
+static int attach_task_by_pid(struct cgroup *cgrp, u64 pid, bool threadgroup)
 {
 	struct task_struct *tsk;
 	const struct cred *cred = current_cred(), *tcred;
 	int ret;
 
+	if (!cgroup_lock_live_group(cgrp))
+		return -ENODEV;
+
 	if (pid) {
 		rcu_read_lock();
 		tsk = find_task_by_vpid(pid);
-		if (!tsk || tsk->flags & PF_EXITING) {
+		if (!tsk) {
 			rcu_read_unlock();
+			cgroup_unlock();
+			return -ESRCH;
+		}
+		if (threadgroup) {
+			/*
+			 * it is safe to find group_leader because tsk was found
+			 * in the tid map, meaning it can't have been unhashed
+			 * by someone in de_thread changing the leadership.
+			 */
+			tsk = tsk->group_leader;
+			BUG_ON(!thread_group_leader(tsk));
+		} else if (tsk->flags & PF_EXITING) {
+			/* optimization for the single-task-only case */
+			rcu_read_unlock();
+			cgroup_unlock();
 			return -ESRCH;
 		}
 
+		/*
+		 * even if we're attaching all tasks in the thread group, we
+		 * only need to check permissions on one of them.
+		 */
 		tcred = __task_cred(tsk);
 		if (cred->euid &&
 		    cred->euid != tcred->uid &&
 		    cred->euid != tcred->suid) {
 			rcu_read_unlock();
+			cgroup_unlock();
 			return -EACCES;
 		}
 		get_task_struct(tsk);
 		rcu_read_unlock();
 	} else {
-		tsk = current;
+		if (threadgroup)
+			tsk = current->group_leader;
+		else
+			tsk = current;
 		get_task_struct(tsk);
 	}
 
-	ret = cgroup_attach_task(cgrp, tsk);
+	if (threadgroup)
+		ret = cgroup_attach_proc(cgrp, tsk);
+	else
+		ret = cgroup_attach_task(cgrp, tsk);
 	put_task_struct(tsk);
+	cgroup_unlock();
 	return ret;
 }
 
 static int cgroup_tasks_write(struct cgroup *cgrp, struct cftype *cft, u64 pid)
 {
+	return attach_task_by_pid(cgrp, pid, false);
+}
+
+static int cgroup_procs_write(struct cgroup *cgrp, struct cftype *cft, u64 tgid)
+{
 	int ret;
-	if (!cgroup_lock_live_group(cgrp))
-		return -ENODEV;
-	ret = attach_task_by_pid(cgrp, pid);
-	cgroup_unlock();
+	do {
+		/*
+		 * attach_proc fails with -EAGAIN if threadgroup leadership
+		 * changes in the middle of the operation, in which case we need
+		 * to find the task_struct for the new leader and start over.
+		 */
+		ret = attach_task_by_pid(cgrp, tgid, true);
+	} while (ret == -EAGAIN);
 	return ret;
 }
 
@@ -3260,9 +3636,9 @@ static struct cftype files[] = {
 	{
 		.name = CGROUP_FILE_GENERIC_PREFIX "procs",
 		.open = cgroup_procs_open,
-		/* .write_u64 = cgroup_procs_write, TODO */
+		.write_u64 = cgroup_procs_write,
 		.release = cgroup_pidlist_release,
-		.mode = S_IRUGO,
+		.mode = S_IRUGO | S_IWUSR,
 	},
 	{
 		.name = "notify_on_release",

^ permalink raw reply related	[flat|nested] 185+ messages in thread

* Re: [PATCH v5 3/3] cgroups: make procs file writable
       [not found]                                                     ` <20101224230901.GA30136-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
@ 2010-12-26 21:48                                                       ` David Rientjes
       [not found]                                                         ` <alpine.DEB.2.00.1012261345340.23173-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
  0 siblings, 1 reply; 185+ messages in thread
From: David Rientjes @ 2010-12-26 21:48 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Oleg Nesterov, Miao Xie, Andrew Morton, Paul Menage,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w

On Fri, 24 Dec 2010, Ben Blum wrote:

> I'll add a patch to my current series to do this. Should I leave alone
> the other cases where an out-of-memory causes a silent failure?
> (cpuset_change_nodemask, scan_for_empty_cpusets)
> 

Both are protected by cgroup_lock, so I think it should be a pretty simple 
change.  cpuset_change_nodemask() is interesting because a task within an 
oom cpuset may be changing its own nodemask for more memory and that could 
easily allow the NODEMASK_ALLOC() to fail for large CONFIG_NODES_SHIFT.  
scan_for_empty_cpusets() is interesting to avoid leaving the hierarchy in 
an inconsistent state.  So I think both of these would benefit from having 
a statically allocated nodemask protected by cgroup_lock().

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v5 3/3] cgroups: make procs file writable
       [not found]                                                         ` <alpine.DEB.2.00.1012261345340.23173-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
@ 2010-12-27  0:12                                                           ` Ben Blum
       [not found]                                                             ` <20101227001233.GA10951-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
  0 siblings, 1 reply; 185+ messages in thread
From: Ben Blum @ 2010-12-27  0:12 UTC (permalink / raw)
  To: David Rientjes
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Oleg Nesterov, ebiederm-aS9lmoZGLiVWk0Htik3J/w, Miao Xie,
	Andrew Morton, Paul Menage

On Sun, Dec 26, 2010 at 01:48:58PM -0800, David Rientjes wrote:
> On Fri, 24 Dec 2010, Ben Blum wrote:
> 
> > I'll add a patch to my current series to do this. Should I leave alone
> > the other cases where an out-of-memory causes a silent failure?
> > (cpuset_change_nodemask, scan_for_empty_cpusets)
> > 
> 
> Both are protected by cgroup_lock, so I think it should be a pretty simple 
> change.  cpuset_change_nodemask() is interesting because a task within an 
> oom cpuset may be changing its own nodemask for more memory and that could 
> easily allow the NODEMASK_ALLOC() to fail for large CONFIG_NODES_SHIFT.  
> scan_for_empty_cpusets() is interesting to avoid leaving the hierarchy in 
> an inconsistent state.  So I think both of these would benefit from having 
> a statically allocated nodemask protected by cgroup_lock().

I was going to make a macro like NODEMASK_STATIC, but it turned out that
can_attach() needed the to/from nodemasks to be shared among three
functions for the attaching, so I defined them globally without making a
macro for it. I can make a separate patch for fixing the other cases,
but I'd like to see my current patches through first. (Or should I make
a bugfix patch first and send my other ones on top of that?)

-- Ben

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v5 3/3] cgroups: make procs file writable
       [not found]                             ` <20101225025508.GA649-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
@ 2010-12-27  0:53                               ` Daisuke Nishimura
       [not found]                                 ` <20101227095353.48d95687.nishimura-YQH0OdQVrdy45+QrQBaojngSJqDPrsil@public.gmane.org>
  0 siblings, 1 reply; 185+ messages in thread
From: Daisuke Nishimura @ 2010-12-27  0:53 UTC (permalink / raw)
  To: Ben Blum
  Cc: Daisuke Nishimura,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	oleg-H+wXaHxf7aLQT0dZR+AlfA, Miao Xie, David Rientjes,
	Andrew Morton, Paul Menage, ebiederm-aS9lmoZGLiVWk0Htik3J/w

On Fri, 24 Dec 2010 21:55:08 -0500
Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:

> On Thu, Dec 23, 2010 at 10:33:52PM -0500, Ben Blum wrote:
> > On Thu, Dec 16, 2010 at 12:26:03AM -0800, Andrew Morton wrote:
> > > Patches have gone a bit stale, sorry.  Refactoring in
> > > kernel/cgroup_freezer.c necessitates a refresh and retest please.
> > 
> > commit 53feb29767c29c877f9d47dcfe14211b5b0f7ebd changed a bunch of stuff
> > in kernel/cpuset.c to allocate nodemasks with NODEMASK_ALLOC (which
> > wraps kmalloc) instead of on the stack.
> > 
> > 1. All these functions have 'void' return values, indicating that
> >    calling them them must not fail. Sure there are bailout cases, but no
> >    semblance of cross-function error propagation. Most importantly,
> >    cpuset_attach is a subsystem callback, which MUST not fail given the
> >    way it's used in cgroups, so relying on kmalloc is not safe.
> > 
> > 2. I'm working on a patch series which needs to hold tasklist_lock
> >    across ss->attach callbacks (in cpuset_attach's "if (threadgroup)"
> >    case, this is how safe traversal of tsk->thread_group will be
> >    ensured), and kmalloc isn't allowed while holding a spin-lock. 
> > 
> > Why do we need heap-allocation here at all? In each case their scope is
> > exactly the function's scope, and neither the commit nor the surrounding
> > patch series give any explanation. I'd like to revert the patch if
> > possible.
> > 
> > cc'ing Miao Xie (author) and David Rientjes (acker).
> > 
> > -- Ben
> 
> Well even with the proposed solution to this there is still another
> problem that I see - that of mmap_sem. cpuset_attach() calls into
> mpol_rebind_mm() and do_migrate_pages(), which take mm->mmap_sem for
> writing and reading respectively. This is going to conflict with
> tasklist_lock... but moreover, the memcontrol subsys also touches the
> task's mm->mmap_sem, holding onto it between mem_cgroup_can_attach() and
> mem_cgroup_move_task() - as of b1dd693e5b9348bd68a80e679e03cf9c0973b01b.
> 
> So we have (currently, even without my patches):
> 
> cgroup_attach_task
> (1) cpuset_can_attach
> (2) mem_cgroup_can_attach
>      - down_read(&mm->mmap_sem);
> (3) cpuset_attach
>      - mpol_rebind_mm
>         - down_write(&mm->mmap_sem);
>         - up_write(&mm->mmap_sem);
>      - cpuset_migrate_mm
>         - do_migrate_pages
>            - down_read(&mm->mmap_sem);
>            - up_read(&mm->mmap_sem);
> (4) mem_cgroup_move_task
>      - mem_cgroup_clear_mc
>         - up_read(...);
> 
hmm, nice catch.

> Is there some interdependency I'm missing here that guarantees recursive
> locking/deadlock will be avoided? It all looks like typical-case code.
> 
Unfortunately, not.
I couldn't hit this because I mount all subsystems onto different
mount points in my environment.

> I think we should move taking the mmap_sem all the way up into
> cgroup_attach_task and cgroup_attach_proc; it will be held for writing
> the whole time. I don't quite understand the mempolicy stuff but maybe
> there can be ways to use mpol_rebind_mm and do_migrate_pages when the
> lock is already held.
> 
I agree.
Another solution(just an idea): avoid enabling both "move_charge" feature of memcg
and "memory_migrate" of cpuset at the same time iff they are mounted
under the same mount point. But, hmm... it's not a good idea to make a subsystem
take account of another subsystem, IMHO.


Thanks,
Daisuke Nishimura.

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v5 3/3] cgroups: make procs file writable
       [not found]                                 ` <20101227095353.48d95687.nishimura-YQH0OdQVrdy45+QrQBaojngSJqDPrsil@public.gmane.org>
@ 2010-12-27  1:15                                   ` KAMEZAWA Hiroyuki
  2010-12-27  4:22                                   ` Ben Blum
  1 sibling, 0 replies; 185+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-12-27  1:15 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	oleg-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	Miao Xie, David Rientjes, Andrew Morton, Paul Menage

On Mon, 27 Dec 2010 09:53:53 +0900
Daisuke Nishimura <nishimura-YQH0OdQVrdy45+QrQBaojngSJqDPrsil@public.gmane.org> wrote:

> On Fri, 24 Dec 2010 21:55:08 -0500
> Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:
> 
> > On Thu, Dec 23, 2010 at 10:33:52PM -0500, Ben Blum wrote:
> > > On Thu, Dec 16, 2010 at 12:26:03AM -0800, Andrew Morton wrote:
> > > > Patches have gone a bit stale, sorry.  Refactoring in
> > > > kernel/cgroup_freezer.c necessitates a refresh and retest please.
> > > 
> > > commit 53feb29767c29c877f9d47dcfe14211b5b0f7ebd changed a bunch of stuff
> > > in kernel/cpuset.c to allocate nodemasks with NODEMASK_ALLOC (which
> > > wraps kmalloc) instead of on the stack.
> > > 
> > > 1. All these functions have 'void' return values, indicating that
> > >    calling them them must not fail. Sure there are bailout cases, but no
> > >    semblance of cross-function error propagation. Most importantly,
> > >    cpuset_attach is a subsystem callback, which MUST not fail given the
> > >    way it's used in cgroups, so relying on kmalloc is not safe.
> > > 
> > > 2. I'm working on a patch series which needs to hold tasklist_lock
> > >    across ss->attach callbacks (in cpuset_attach's "if (threadgroup)"
> > >    case, this is how safe traversal of tsk->thread_group will be
> > >    ensured), and kmalloc isn't allowed while holding a spin-lock. 
> > > 
> > > Why do we need heap-allocation here at all? In each case their scope is
> > > exactly the function's scope, and neither the commit nor the surrounding
> > > patch series give any explanation. I'd like to revert the patch if
> > > possible.
> > > 
> > > cc'ing Miao Xie (author) and David Rientjes (acker).
> > > 
> > > -- Ben
> > 
> > Well even with the proposed solution to this there is still another
> > problem that I see - that of mmap_sem. cpuset_attach() calls into
> > mpol_rebind_mm() and do_migrate_pages(), which take mm->mmap_sem for
> > writing and reading respectively. This is going to conflict with
> > tasklist_lock... but moreover, the memcontrol subsys also touches the
> > task's mm->mmap_sem, holding onto it between mem_cgroup_can_attach() and
> > mem_cgroup_move_task() - as of b1dd693e5b9348bd68a80e679e03cf9c0973b01b.
> > 
> > So we have (currently, even without my patches):
> > 
> > cgroup_attach_task
> > (1) cpuset_can_attach
> > (2) mem_cgroup_can_attach
> >      - down_read(&mm->mmap_sem);
> > (3) cpuset_attach
> >      - mpol_rebind_mm
> >         - down_write(&mm->mmap_sem);
> >         - up_write(&mm->mmap_sem);
> >      - cpuset_migrate_mm
> >         - do_migrate_pages
> >            - down_read(&mm->mmap_sem);
> >            - up_read(&mm->mmap_sem);
> > (4) mem_cgroup_move_task
> >      - mem_cgroup_clear_mc
> >         - up_read(...);
> > 
> hmm, nice catch.
> 
> > Is there some interdependency I'm missing here that guarantees recursive
> > locking/deadlock will be avoided? It all looks like typical-case code.
> > 
> Unfortunately, not.
> I couldn't hit this because I mount all subsystems onto different
> mount points in my environment.
> 
> > I think we should move taking the mmap_sem all the way up into
> > cgroup_attach_task and cgroup_attach_proc; it will be held for writing
> > the whole time. I don't quite understand the mempolicy stuff but maybe
> > there can be ways to use mpol_rebind_mm and do_migrate_pages when the
> > lock is already held.
> > 
> I agree.
> Another solution(just an idea): avoid enabling both "move_charge" feature of memcg
> and "memory_migrate" of cpuset at the same time iff they are mounted
> under the same mount point. But, hmm... it's not a good idea to make a subsystem
> take account of another subsystem, IMHO.
> 

Taking tasklist_lock is bad, I think.

Thanks,
-Kame

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v5 3/3] cgroups: make procs file writable
       [not found]                                 ` <20101227095353.48d95687.nishimura-YQH0OdQVrdy45+QrQBaojngSJqDPrsil@public.gmane.org>
  2010-12-27  1:15                                   ` KAMEZAWA Hiroyuki
@ 2010-12-27  4:22                                   ` Ben Blum
       [not found]                                     ` <20101227042254.GA15417-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
  1 sibling, 1 reply; 185+ messages in thread
From: Ben Blum @ 2010-12-27  4:22 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	oleg-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	Miao Xie, David Rientjes, Andrew Morton, Paul Menage

On Mon, Dec 27, 2010 at 09:53:53AM +0900, Daisuke Nishimura wrote:
> On Fri, 24 Dec 2010 21:55:08 -0500
> Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:
> 
> > On Thu, Dec 23, 2010 at 10:33:52PM -0500, Ben Blum wrote:
> > > On Thu, Dec 16, 2010 at 12:26:03AM -0800, Andrew Morton wrote:
> > > > Patches have gone a bit stale, sorry.  Refactoring in
> > > > kernel/cgroup_freezer.c necessitates a refresh and retest please.
> > > 
> > > commit 53feb29767c29c877f9d47dcfe14211b5b0f7ebd changed a bunch of stuff
> > > in kernel/cpuset.c to allocate nodemasks with NODEMASK_ALLOC (which
> > > wraps kmalloc) instead of on the stack.
> > > 
> > > 1. All these functions have 'void' return values, indicating that
> > >    calling them them must not fail. Sure there are bailout cases, but no
> > >    semblance of cross-function error propagation. Most importantly,
> > >    cpuset_attach is a subsystem callback, which MUST not fail given the
> > >    way it's used in cgroups, so relying on kmalloc is not safe.
> > > 
> > > 2. I'm working on a patch series which needs to hold tasklist_lock
> > >    across ss->attach callbacks (in cpuset_attach's "if (threadgroup)"
> > >    case, this is how safe traversal of tsk->thread_group will be
> > >    ensured), and kmalloc isn't allowed while holding a spin-lock. 
> > > 
> > > Why do we need heap-allocation here at all? In each case their scope is
> > > exactly the function's scope, and neither the commit nor the surrounding
> > > patch series give any explanation. I'd like to revert the patch if
> > > possible.
> > > 
> > > cc'ing Miao Xie (author) and David Rientjes (acker).
> > > 
> > > -- Ben
> > 
> > Well even with the proposed solution to this there is still another
> > problem that I see - that of mmap_sem. cpuset_attach() calls into
> > mpol_rebind_mm() and do_migrate_pages(), which take mm->mmap_sem for
> > writing and reading respectively. This is going to conflict with
> > tasklist_lock... but moreover, the memcontrol subsys also touches the
> > task's mm->mmap_sem, holding onto it between mem_cgroup_can_attach() and
> > mem_cgroup_move_task() - as of b1dd693e5b9348bd68a80e679e03cf9c0973b01b.
> > 
> > So we have (currently, even without my patches):
> > 
> > cgroup_attach_task
> > (1) cpuset_can_attach
> > (2) mem_cgroup_can_attach
> >      - down_read(&mm->mmap_sem);
> > (3) cpuset_attach
> >      - mpol_rebind_mm
> >         - down_write(&mm->mmap_sem);
> >         - up_write(&mm->mmap_sem);
> >      - cpuset_migrate_mm
> >         - do_migrate_pages
> >            - down_read(&mm->mmap_sem);
> >            - up_read(&mm->mmap_sem);
> > (4) mem_cgroup_move_task
> >      - mem_cgroup_clear_mc
> >         - up_read(...);
> > 
> hmm, nice catch.
> 
> > Is there some interdependency I'm missing here that guarantees recursive
> > locking/deadlock will be avoided? It all looks like typical-case code.
> > 
> Unfortunately, not.
> I couldn't hit this because I mount all subsystems onto different
> mount points in my environment.
> 
> > I think we should move taking the mmap_sem all the way up into
> > cgroup_attach_task and cgroup_attach_proc; it will be held for writing
> > the whole time. I don't quite understand the mempolicy stuff but maybe
> > there can be ways to use mpol_rebind_mm and do_migrate_pages when the
> > lock is already held.
> > 
> I agree.
> Another solution(just an idea): avoid enabling both "move_charge" feature of memcg
> and "memory_migrate" of cpuset at the same time iff they are mounted
> under the same mount point. But, hmm... it's not a good idea to make a subsystem
> take account of another subsystem, IMHO.
> 
> 
> Thanks,
> Daisuke Nishimura.

It looks to me like when memcg holds the mmap_sem the whole time, it's
just to avoid the deadlock, not that there's there some need for the
stuff under mmap_sem not to change between can_attach and attach. But if
there is such a need, then the write-side in mpol_rebind_mm may conflict
even with my proposed solution.

Regardless, the best way would be to avoid holding the mmap_sem across
the whole window, possibly by solving the move_charge deadlock some
other internal way, if at all possible?

-- Ben

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v5 3/3] cgroups: make procs file writable
       [not found]                                     ` <20101227042254.GA15417-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
@ 2010-12-27  7:00                                       ` KAMEZAWA Hiroyuki
       [not found]                                         ` <20101227160041.07bff52a.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
  2010-12-28  2:43                                       ` [RFC][BUGFIX] memcg: fix dead lock between cpuset and memcg (Re: [PATCH v5 3/3] cgroups: make procs file writable) Daisuke Nishimura
  1 sibling, 1 reply; 185+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-12-27  7:00 UTC (permalink / raw)
  To: Ben Blum
  Cc: Daisuke Nishimura,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	oleg-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	Miao Xie, David Rientjes, Andrew Morton, Paul Menage

On Sun, 26 Dec 2010 23:22:54 -0500
Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:

> On Mon, Dec 27, 2010 at 09:53:53AM +0900, Daisuke Nishimura wrote:
> > On Fri, 24 Dec 2010 21:55:08 -0500
> > Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:
> > 
> > > On Thu, Dec 23, 2010 at 10:33:52PM -0500, Ben Blum wrote:
> > > > On Thu, Dec 16, 2010 at 12:26:03AM -0800, Andrew Morton wrote:
> > > > > Patches have gone a bit stale, sorry.  Refactoring in
> > > > > kernel/cgroup_freezer.c necessitates a refresh and retest please.
> > > > 
> > > > commit 53feb29767c29c877f9d47dcfe14211b5b0f7ebd changed a bunch of stuff
> > > > in kernel/cpuset.c to allocate nodemasks with NODEMASK_ALLOC (which
> > > > wraps kmalloc) instead of on the stack.
> > > > 
> > > > 1. All these functions have 'void' return values, indicating that
> > > >    calling them them must not fail. Sure there are bailout cases, but no
> > > >    semblance of cross-function error propagation. Most importantly,
> > > >    cpuset_attach is a subsystem callback, which MUST not fail given the
> > > >    way it's used in cgroups, so relying on kmalloc is not safe.
> > > > 
> > > > 2. I'm working on a patch series which needs to hold tasklist_lock
> > > >    across ss->attach callbacks (in cpuset_attach's "if (threadgroup)"
> > > >    case, this is how safe traversal of tsk->thread_group will be
> > > >    ensured), and kmalloc isn't allowed while holding a spin-lock. 
> > > > 
> > > > Why do we need heap-allocation here at all? In each case their scope is
> > > > exactly the function's scope, and neither the commit nor the surrounding
> > > > patch series give any explanation. I'd like to revert the patch if
> > > > possible.
> > > > 
> > > > cc'ing Miao Xie (author) and David Rientjes (acker).
> > > > 
> > > > -- Ben
> > > 
> > > Well even with the proposed solution to this there is still another
> > > problem that I see - that of mmap_sem. cpuset_attach() calls into
> > > mpol_rebind_mm() and do_migrate_pages(), which take mm->mmap_sem for
> > > writing and reading respectively. This is going to conflict with
> > > tasklist_lock... but moreover, the memcontrol subsys also touches the
> > > task's mm->mmap_sem, holding onto it between mem_cgroup_can_attach() and
> > > mem_cgroup_move_task() - as of b1dd693e5b9348bd68a80e679e03cf9c0973b01b.
> > > 
> > > So we have (currently, even without my patches):
> > > 
> > > cgroup_attach_task
> > > (1) cpuset_can_attach
> > > (2) mem_cgroup_can_attach
> > >      - down_read(&mm->mmap_sem);
> > > (3) cpuset_attach
> > >      - mpol_rebind_mm
> > >         - down_write(&mm->mmap_sem);
> > >         - up_write(&mm->mmap_sem);
> > >      - cpuset_migrate_mm
> > >         - do_migrate_pages
> > >            - down_read(&mm->mmap_sem);
> > >            - up_read(&mm->mmap_sem);
> > > (4) mem_cgroup_move_task
> > >      - mem_cgroup_clear_mc
> > >         - up_read(...);
> > > 
> > hmm, nice catch.
> > 
> > > Is there some interdependency I'm missing here that guarantees recursive
> > > locking/deadlock will be avoided? It all looks like typical-case code.
> > > 
> > Unfortunately, not.
> > I couldn't hit this because I mount all subsystems onto different
> > mount points in my environment.
> > 
> > > I think we should move taking the mmap_sem all the way up into
> > > cgroup_attach_task and cgroup_attach_proc; it will be held for writing
> > > the whole time. I don't quite understand the mempolicy stuff but maybe
> > > there can be ways to use mpol_rebind_mm and do_migrate_pages when the
> > > lock is already held.
> > > 
> > I agree.
> > Another solution(just an idea): avoid enabling both "move_charge" feature of memcg
> > and "memory_migrate" of cpuset at the same time iff they are mounted
> > under the same mount point. But, hmm... it's not a good idea to make a subsystem
> > take account of another subsystem, IMHO.
> > 
> > 
> > Thanks,
> > Daisuke Nishimura.
> 
> It looks to me like when memcg holds the mmap_sem the whole time, it's
> just to avoid the deadlock, not that there's there some need for the
> stuff under mmap_sem not to change between can_attach and attach. But if
> there is such a need, then the write-side in mpol_rebind_mm may conflict
> even with my proposed solution.
> 
> Regardless, the best way would be to avoid holding the mmap_sem across
> the whole window, possibly by solving the move_charge deadlock some
> other internal way, if at all possible?
> 

IIUC, what you request is 'don't call any kind of function may sleep'.
in can_attach() and attach() callback. That's impossible.
Memory cgroup will go to sleep because of calling memory reclaim.

Is tasklist_lock the only way ? Can you catch "new thread" event in cgroup_fork() ?

For example,

==
void cgroup_fork(struct task_struct *child)
{
	if (current->in_moving) {
		add_to_waitqueue somewhere.
	}
        task_lock(current);
        child->cgroups = current->cgroups;
        get_css_set(child->cgroups);
        task_unlock(current);
        INIT_LIST_HEAD(&child->cg_list);
}
==

==
 read_lock(&tasklist_lock);
 list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
	tsk->in_moving = true;
 }
 read_unlock(&tasklist_lock);

 ->pre_attach()
 ->attach()

 read_unlock(&tasklist_lock);
 list_for_each_.....
	tsk->in_moving_cgroup = false;

 wakeup threads in waitq.
==

Ah yes, this will have some bug. But please avoid to take tasklist_lock() around
all pre_attach() and attach().

Thanks,
-Kame





Thanks,
-Kame

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v5 3/3] cgroups: make procs file writable
       [not found]                                         ` <20101227160041.07bff52a.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
@ 2010-12-27  7:21                                           ` Ben Blum
       [not found]                                             ` <20101227072123.GA19652-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
  0 siblings, 1 reply; 185+ messages in thread
From: Ben Blum @ 2010-12-27  7:21 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Daisuke Nishimura, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, Miao Xie, David Rientjes,
	Andrew Morton, Paul Menage

On Mon, Dec 27, 2010 at 04:00:41PM +0900, KAMEZAWA Hiroyuki wrote:
> On Sun, 26 Dec 2010 23:22:54 -0500
> Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:
> 
> > On Mon, Dec 27, 2010 at 09:53:53AM +0900, Daisuke Nishimura wrote:
> > > On Fri, 24 Dec 2010 21:55:08 -0500
> > > Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:
> > > 
> > > > On Thu, Dec 23, 2010 at 10:33:52PM -0500, Ben Blum wrote:
> > > > > On Thu, Dec 16, 2010 at 12:26:03AM -0800, Andrew Morton wrote:
> > > > > > Patches have gone a bit stale, sorry.  Refactoring in
> > > > > > kernel/cgroup_freezer.c necessitates a refresh and retest please.
> > > > > 
> > > > > commit 53feb29767c29c877f9d47dcfe14211b5b0f7ebd changed a bunch of stuff
> > > > > in kernel/cpuset.c to allocate nodemasks with NODEMASK_ALLOC (which
> > > > > wraps kmalloc) instead of on the stack.
> > > > > 
> > > > > 1. All these functions have 'void' return values, indicating that
> > > > >    calling them them must not fail. Sure there are bailout cases, but no
> > > > >    semblance of cross-function error propagation. Most importantly,
> > > > >    cpuset_attach is a subsystem callback, which MUST not fail given the
> > > > >    way it's used in cgroups, so relying on kmalloc is not safe.
> > > > > 
> > > > > 2. I'm working on a patch series which needs to hold tasklist_lock
> > > > >    across ss->attach callbacks (in cpuset_attach's "if (threadgroup)"
> > > > >    case, this is how safe traversal of tsk->thread_group will be
> > > > >    ensured), and kmalloc isn't allowed while holding a spin-lock. 
> > > > > 
> > > > > Why do we need heap-allocation here at all? In each case their scope is
> > > > > exactly the function's scope, and neither the commit nor the surrounding
> > > > > patch series give any explanation. I'd like to revert the patch if
> > > > > possible.
> > > > > 
> > > > > cc'ing Miao Xie (author) and David Rientjes (acker).
> > > > > 
> > > > > -- Ben
> > > > 
> > > > Well even with the proposed solution to this there is still another
> > > > problem that I see - that of mmap_sem. cpuset_attach() calls into
> > > > mpol_rebind_mm() and do_migrate_pages(), which take mm->mmap_sem for
> > > > writing and reading respectively. This is going to conflict with
> > > > tasklist_lock... but moreover, the memcontrol subsys also touches the
> > > > task's mm->mmap_sem, holding onto it between mem_cgroup_can_attach() and
> > > > mem_cgroup_move_task() - as of b1dd693e5b9348bd68a80e679e03cf9c0973b01b.
> > > > 
> > > > So we have (currently, even without my patches):
> > > > 
> > > > cgroup_attach_task
> > > > (1) cpuset_can_attach
> > > > (2) mem_cgroup_can_attach
> > > >      - down_read(&mm->mmap_sem);
> > > > (3) cpuset_attach
> > > >      - mpol_rebind_mm
> > > >         - down_write(&mm->mmap_sem);
> > > >         - up_write(&mm->mmap_sem);
> > > >      - cpuset_migrate_mm
> > > >         - do_migrate_pages
> > > >            - down_read(&mm->mmap_sem);
> > > >            - up_read(&mm->mmap_sem);
> > > > (4) mem_cgroup_move_task
> > > >      - mem_cgroup_clear_mc
> > > >         - up_read(...);
> > > > 
> > > hmm, nice catch.
> > > 
> > > > Is there some interdependency I'm missing here that guarantees recursive
> > > > locking/deadlock will be avoided? It all looks like typical-case code.
> > > > 
> > > Unfortunately, not.
> > > I couldn't hit this because I mount all subsystems onto different
> > > mount points in my environment.
> > > 
> > > > I think we should move taking the mmap_sem all the way up into
> > > > cgroup_attach_task and cgroup_attach_proc; it will be held for writing
> > > > the whole time. I don't quite understand the mempolicy stuff but maybe
> > > > there can be ways to use mpol_rebind_mm and do_migrate_pages when the
> > > > lock is already held.
> > > > 
> > > I agree.
> > > Another solution(just an idea): avoid enabling both "move_charge" feature of memcg
> > > and "memory_migrate" of cpuset at the same time iff they are mounted
> > > under the same mount point. But, hmm... it's not a good idea to make a subsystem
> > > take account of another subsystem, IMHO.
> > > 
> > > 
> > > Thanks,
> > > Daisuke Nishimura.
> > 
> > It looks to me like when memcg holds the mmap_sem the whole time, it's
> > just to avoid the deadlock, not that there's there some need for the
> > stuff under mmap_sem not to change between can_attach and attach. But if
> > there is such a need, then the write-side in mpol_rebind_mm may conflict
> > even with my proposed solution.
> > 
> > Regardless, the best way would be to avoid holding the mmap_sem across
> > the whole window, possibly by solving the move_charge deadlock some
> > other internal way, if at all possible?
> > 
> 
> IIUC, what you request is 'don't call any kind of function may sleep'.
> in can_attach() and attach() callback. That's impossible.
> Memory cgroup will go to sleep because of calling memory reclaim.
> 
> Is tasklist_lock the only way ? Can you catch "new thread" event in cgroup_fork() ?
> 
> For example,
> 
> ==
> void cgroup_fork(struct task_struct *child)
> {
> 	if (current->in_moving) {
> 		add_to_waitqueue somewhere.
> 	}
>         task_lock(current);
>         child->cgroups = current->cgroups;
>         get_css_set(child->cgroups);
>         task_unlock(current);
>         INIT_LIST_HEAD(&child->cg_list);
> }
> ==
> 
> ==
>  read_lock(&tasklist_lock);
>  list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
> 	tsk->in_moving = true;
>  }
>  read_unlock(&tasklist_lock);
> 
>  ->pre_attach()
>  ->attach()
> 
>  read_unlock(&tasklist_lock);
>  list_for_each_.....
> 	tsk->in_moving_cgroup = false;
> 
>  wakeup threads in waitq.
> ==
> 
> Ah yes, this will have some bug. But please avoid to take tasklist_lock() around
> all pre_attach() and attach().
> 
> Thanks,
> -Kame
> 
> 
> 
> 
> 
> Thanks,
> -Kame

You misunderstand slightly: the callbacks are split into a
once-per-thread function and a thread-independent function, and only the
former is not allowed to sleep. All of memcg's attachment is thread-
independent, so sleeping there is fine.

Also, the tasklist_lock isn't used to synchronize fork events; that's
what threadgroup_fork_lock (an rwsem) is for. It's for protecting the
thread-group list while iterating over it - so that a race with exec()
(de_thread(), specifically) doesn't change the threadgroup state. It's
basically never safe to iterate over leader->thread_group unless you're
in a no-sleep section.

-- Ben

(P.S. Paul, do you remember why we decided to use tasklist_lock in the
middle there instead of rcu_read_lock like the other places where it
iterates over ->thread_group?)

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v5 3/3] cgroups: make procs file writable
       [not found]                                             ` <20101227072123.GA19652-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
@ 2010-12-27  7:42                                               ` KAMEZAWA Hiroyuki
       [not found]                                                 ` <20101227164207.b09318be.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
  0 siblings, 1 reply; 185+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-12-27  7:42 UTC (permalink / raw)
  To: Ben Blum
  Cc: Daisuke Nishimura,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	oleg-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	Miao Xie, David Rientjes, Andrew Morton, Paul Menage

On Mon, 27 Dec 2010 02:21:23 -0500
Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:
 
> > IIUC, what you request is 'don't call any kind of function may sleep'.
> > in can_attach() and attach() callback. That's impossible.
> > Memory cgroup will go to sleep because of calling memory reclaim.
> > 
> > Is tasklist_lock the only way ? Can you catch "new thread" event in cgroup_fork() ?
> > 
> > For example,
> > 
> > ==
> > void cgroup_fork(struct task_struct *child)
> > {
> > 	if (current->in_moving) {
> > 		add_to_waitqueue somewhere.
> > 	}
> >         task_lock(current);
> >         child->cgroups = current->cgroups;
> >         get_css_set(child->cgroups);
> >         task_unlock(current);
> >         INIT_LIST_HEAD(&child->cg_list);
> > }
> > ==
> > 
> > ==
> >  read_lock(&tasklist_lock);
> >  list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
> > 	tsk->in_moving = true;
> >  }
> >  read_unlock(&tasklist_lock);
> > 
> >  ->pre_attach()
> >  ->attach()
> > 
> >  read_unlock(&tasklist_lock);
> >  list_for_each_.....
> > 	tsk->in_moving_cgroup = false;
> > 
> >  wakeup threads in waitq.
> > ==
> > 
> > Ah yes, this will have some bug. But please avoid to take tasklist_lock() around
> > all pre_attach() and attach().
> > 
> > Thanks,
> > -Kame
> > 
> > 
> > 
> > 
> > 
> > Thanks,
> > -Kame
> 
> You misunderstand slightly: the callbacks are split into a
> once-per-thread function and a thread-independent function, and only the
> former is not allowed to sleep. All of memcg's attachment is thread-
> independent, so sleeping there is fine.
> 
Okay, then, problem is cpuset, which allocates memory.
(But I feel that limitation -never_sleep- is not very good.)

BTW, mem cgroup chesk mm_owner(mm) == p to decide to take mmap_sem().
Your code does moving "thread_group_leader()". IIUC, this isn't guaranteed to
be the same. I wonder we can see problem with some stupid userland.


> Also, the tasklist_lock isn't used to synchronize fork events; that's
> what threadgroup_fork_lock (an rwsem) is for. It's for protecting the
> thread-group list while iterating over it - so that a race with exec()
> (de_thread(), specifically) doesn't change the threadgroup state. It's
> basically never safe to iterate over leader->thread_group unless you're
> in a no-sleep section.
> 

Thank you for explanation.

At first, cpuset should modify settings around 'mm' only when the thread == mm->owner.
....is there any reason to allow all threads can affect the 'mm' ?

About nodemask, I feel what's required is pre_pre_attach() to cache required memory
before attaching as radix_tree_preload(). Then, subsys can prepare for creating
working area.

Hmm...but I wonder de_thread() should take threadgroup_fork_write_unlock().

I may not understand anything important but I feel taking tasklist_lock() is overkill.

Thanks, 
-Kame

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v5 3/3] cgroups: make procs file writable
       [not found]                                                 ` <20101227164207.b09318be.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
@ 2010-12-27  8:42                                                   ` Ben Blum
       [not found]                                                     ` <20101227084257.GA20986-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
  0 siblings, 1 reply; 185+ messages in thread
From: Ben Blum @ 2010-12-27  8:42 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Daisuke Nishimura, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, Miao Xie, David Rientjes,
	Andrew Morton, Paul Menage

On Mon, Dec 27, 2010 at 04:42:07PM +0900, KAMEZAWA Hiroyuki wrote:
> On Mon, 27 Dec 2010 02:21:23 -0500
> Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:
>  
> > > IIUC, what you request is 'don't call any kind of function may sleep'.
> > > in can_attach() and attach() callback. That's impossible.
> > > Memory cgroup will go to sleep because of calling memory reclaim.
> > > 
> > > Is tasklist_lock the only way ? Can you catch "new thread" event in cgroup_fork() ?
> > > 
> > > For example,
> > > 
> > > ==
> > > void cgroup_fork(struct task_struct *child)
> > > {
> > > 	if (current->in_moving) {
> > > 		add_to_waitqueue somewhere.
> > > 	}
> > >         task_lock(current);
> > >         child->cgroups = current->cgroups;
> > >         get_css_set(child->cgroups);
> > >         task_unlock(current);
> > >         INIT_LIST_HEAD(&child->cg_list);
> > > }
> > > ==
> > > 
> > > ==
> > >  read_lock(&tasklist_lock);
> > >  list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
> > > 	tsk->in_moving = true;
> > >  }
> > >  read_unlock(&tasklist_lock);
> > > 
> > >  ->pre_attach()
> > >  ->attach()
> > > 
> > >  read_unlock(&tasklist_lock);
> > >  list_for_each_.....
> > > 	tsk->in_moving_cgroup = false;
> > > 
> > >  wakeup threads in waitq.
> > > ==
> > > 
> > > Ah yes, this will have some bug. But please avoid to take tasklist_lock() around
> > > all pre_attach() and attach().
> > > 
> > > Thanks,
> > > -Kame
> > > 
> > > 
> > > 
> > > 
> > > 
> > > Thanks,
> > > -Kame
> > 
> > You misunderstand slightly: the callbacks are split into a
> > once-per-thread function and a thread-independent function, and only the
> > former is not allowed to sleep. All of memcg's attachment is thread-
> > independent, so sleeping there is fine.
> > 
> Okay, then, problem is cpuset, which allocates memory.
> (But I feel that limitation -never_sleep- is not very good.)
> 
> BTW, mem cgroup chesk mm_owner(mm) == p to decide to take mmap_sem().
> Your code does moving "thread_group_leader()". IIUC, this isn't guaranteed to
> be the same. I wonder we can see problem with some stupid userland.

I'm not sure I understand the point of looking at mm->owner? My code
does thread_group_leader() because if a task that's not the leader calls
exec(), the thread_group list will be in an indeterminate state from the
perspective of the task we hold on to (who was originally the leader).

> > Also, the tasklist_lock isn't used to synchronize fork events; that's
> > what threadgroup_fork_lock (an rwsem) is for. It's for protecting the
> > thread-group list while iterating over it - so that a race with exec()
> > (de_thread(), specifically) doesn't change the threadgroup state. It's
> > basically never safe to iterate over leader->thread_group unless you're
> > in a no-sleep section.
> > 
> 
> Thank you for explanation.
> 
> At first, cpuset should modify settings around 'mm' only when the thread == mm->owner.
> ....is there any reason to allow all threads can affect the 'mm' ?

I don't follow... is this any different from the cgroup_attach_task()
case?

> About nodemask, I feel what's required is pre_pre_attach() to cache required memory
> before attaching as radix_tree_preload(). Then, subsys can prepare for creating
> working area.

To save on global memory footprint, we could pre-allocate two nodemasks,
but I feel like it's not worth the increase in code complexity. This
would need to be done in the other cases that unsafely do NODEMASK_ALLOC
too... too much to keep track of for little gain.

> Hmm...but I wonder de_thread() should take threadgroup_fork_write_unlock().
> 
> I may not understand anything important but I feel taking tasklist_lock() is overkill.

Would rcu_read_lock() be any better? Everywhere else in the kernel that
iterates over all threads in a group uses either rcu_read_lock or
tasklist_lock.

-- Ben

> 
> Thanks, 
> -Kame
> 
> 

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v5 3/3] cgroups: make procs file writable
       [not found]                                                     ` <20101227084257.GA20986-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
@ 2010-12-27  9:18                                                       ` KAMEZAWA Hiroyuki
       [not found]                                                         ` <20101227181801.095e9a23.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
  0 siblings, 1 reply; 185+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-12-27  9:18 UTC (permalink / raw)
  To: Ben Blum
  Cc: Daisuke Nishimura,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	oleg-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	Miao Xie, David Rientjes, Andrew Morton, Paul Menage

On Mon, 27 Dec 2010 03:42:57 -0500
Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:

> On Mon, Dec 27, 2010 at 04:42:07PM +0900, KAMEZAWA Hiroyuki wrote:

> > Okay, then, problem is cpuset, which allocates memory.
> > (But I feel that limitation -never_sleep- is not very good.)
> > 
> > BTW, mem cgroup chesk mm_owner(mm) == p to decide to take mmap_sem().
> > Your code does moving "thread_group_leader()". IIUC, this isn't guaranteed to
> > be the same. I wonder we can see problem with some stupid userland.
> 
> I'm not sure I understand the point of looking at mm->owner? My code
> does thread_group_leader() because if a task that's not the leader calls
> exec(), the thread_group list will be in an indeterminate state from the
> perspective of the task we hold on to (who was originally the leader).
> 

Hm. By following, only 'thread-group-leader' can go ahead.

+	threadgroup_fork_write_lock(leader);
+	read_lock(&tasklist_lock);
+	/* sanity check - if we raced with de_thread, we must abort */
+	if (!thread_group_leader(leader)) {
+		retval = -EAGAIN;
+		read_unlock(&tasklist_lock);
+		threadgroup_fork_write_unlock(leader);
+		goto list_teardown;
+	}

This moves each tasks under tasklist_lock().

+	list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
+		/* leave current thread as it is if it's already there */
+		oldcgrp = task_cgroup_from_root(tsk, root);
+		if (cgrp == oldcgrp)
+			continue;
+		/* attach each task to each subsystem */
+		for_each_subsys(root, ss) {
+			if (ss->attach_task)
+				ss->attach_task(cgrp, tsk);
+		}
+		/* we don't care whether these threads are exiting */
+		retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, true);
+		BUG_ON(retval != 0 && retval != -ESRCH);
+	}

Hm. I'm not sure there can be a task, tsk == mm->owner here.
If tsk == mm->owner, attach task will take mmap_sem() and do something more.

In usual case, mm->owner = thread_group_leader, but it's not in special cases.



> > > Also, the tasklist_lock isn't used to synchronize fork events; that's
> > > what threadgroup_fork_lock (an rwsem) is for. It's for protecting the
> > > thread-group list while iterating over it - so that a race with exec()
> > > (de_thread(), specifically) doesn't change the threadgroup state. It's
> > > basically never safe to iterate over leader->thread_group unless you're
> > > in a no-sleep section.
> > > 
> > 
> > Thank you for explanation.
> > 
> > At first, cpuset should modify settings around 'mm' only when the thread == mm->owner.
> > ....is there any reason to allow all threads can affect the 'mm' ?
> 
> I don't follow... is this any different from the cgroup_attach_task()
> case?
> 

cpuset_attach() takes mmap_sem() to modify mm's settings. Because all threads
share mm, doing page migration whenever a thread moves seems to be wrong.
I think only a thread, thread-group-leader or mm->owner, in process should be
allowed to migrate pages. Then, cgroup_attach_proc() can avoid taking mmap_sem
under tasklist_lock.



> > About nodemask, I feel what's required is pre_pre_attach() to cache required memory
> > before attaching as radix_tree_preload(). Then, subsys can prepare for creating
> > working area.
> 
> To save on global memory footprint, we could pre-allocate two nodemasks,
> but I feel like it's not worth the increase in code complexity. This
> would need to be done in the other cases that unsafely do NODEMASK_ALLOC
> too... too much to keep track of for little gain.
> 

But NODEMASK_ALLOC cannot be on stack when we consider 4096node systems.


> > Hmm...but I wonder de_thread() should take threadgroup_fork_write_unlock().
> > 
> > I may not understand anything important but I feel taking tasklist_lock() is overkill.
> 
> Would rcu_read_lock() be any better? Everywhere else in the kernel that
> iterates over all threads in a group uses either rcu_read_lock or
> tasklist_lock.
> 

Because it means that can_attach()/attach() cannot sleep, it seems to make no
difference.

I wonder.... if you stops all clone()/fork() of the proc in moving, you can use
find_ge_pid(). Please see next_tgid() or next_tid() in fs/proc/base.c which
implements  scanning tasklist with sleep. Can't you use next_tid() ?

Thanks,
-Kame

P.S. I'll be offlined until 2010/01/04.

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v5 3/3] cgroups: make procs file writable
       [not found]                                                         ` <20101227181801.095e9a23.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
@ 2010-12-27 10:12                                                           ` Ben Blum
       [not found]                                                             ` <20101227101228.GB20986-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
  0 siblings, 1 reply; 185+ messages in thread
From: Ben Blum @ 2010-12-27 10:12 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Daisuke Nishimura, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, Miao Xie, David Rientjes,
	Andrew Morton, Paul Menage

On Mon, Dec 27, 2010 at 06:18:01PM +0900, KAMEZAWA Hiroyuki wrote:
> On Mon, 27 Dec 2010 03:42:57 -0500
> Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:
> 
> > On Mon, Dec 27, 2010 at 04:42:07PM +0900, KAMEZAWA Hiroyuki wrote:
> 
> > > Okay, then, problem is cpuset, which allocates memory.
> > > (But I feel that limitation -never_sleep- is not very good.)
> > > 
> > > BTW, mem cgroup chesk mm_owner(mm) == p to decide to take mmap_sem().
> > > Your code does moving "thread_group_leader()". IIUC, this isn't guaranteed to
> > > be the same. I wonder we can see problem with some stupid userland.
> > 
> > I'm not sure I understand the point of looking at mm->owner? My code
> > does thread_group_leader() because if a task that's not the leader calls
> > exec(), the thread_group list will be in an indeterminate state from the
> > perspective of the task we hold on to (who was originally the leader).
> > 
> 
> Hm. By following, only 'thread-group-leader' can go ahead.
> 
> +	threadgroup_fork_write_lock(leader);
> +	read_lock(&tasklist_lock);
> +	/* sanity check - if we raced with de_thread, we must abort */
> +	if (!thread_group_leader(leader)) {
> +		retval = -EAGAIN;
> +		read_unlock(&tasklist_lock);
> +		threadgroup_fork_write_unlock(leader);
> +		goto list_teardown;
> +	}
> 
> This moves each tasks under tasklist_lock().
> 
> +	list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
> +		/* leave current thread as it is if it's already there */
> +		oldcgrp = task_cgroup_from_root(tsk, root);
> +		if (cgrp == oldcgrp)
> +			continue;
> +		/* attach each task to each subsystem */
> +		for_each_subsys(root, ss) {
> +			if (ss->attach_task)
> +				ss->attach_task(cgrp, tsk);
> +		}
> +		/* we don't care whether these threads are exiting */
> +		retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, true);
> +		BUG_ON(retval != 0 && retval != -ESRCH);
> +	}
> 
> Hm. I'm not sure there can be a task, tsk == mm->owner here.
> If tsk == mm->owner, attach task will take mmap_sem() and do something more.
> 
> In usual case, mm->owner = thread_group_leader, but it's not in special cases.
> 
> 
> 
> > > > Also, the tasklist_lock isn't used to synchronize fork events; that's
> > > > what threadgroup_fork_lock (an rwsem) is for. It's for protecting the
> > > > thread-group list while iterating over it - so that a race with exec()
> > > > (de_thread(), specifically) doesn't change the threadgroup state. It's
> > > > basically never safe to iterate over leader->thread_group unless you're
> > > > in a no-sleep section.
> > > > 
> > > 
> > > Thank you for explanation.
> > > 
> > > At first, cpuset should modify settings around 'mm' only when the thread == mm->owner.
> > > ....is there any reason to allow all threads can affect the 'mm' ?
> > 
> > I don't follow... is this any different from the cgroup_attach_task()
> > case?
> > 
> 
> cpuset_attach() takes mmap_sem() to modify mm's settings. Because all threads
> share mm, doing page migration whenever a thread moves seems to be wrong.
> I think only a thread, thread-group-leader or mm->owner, in process should be
> allowed to migrate pages. Then, cgroup_attach_proc() can avoid taking mmap_sem
> under tasklist_lock.

Yes, but note: cpuset_migrate_mm isn't a per-thread operation already.
By keeping it in cpuset_attach() (called once, not under tasklist_lock),
instead of putting it in cpuset_attach_task() (called many times, under
tasklist_lock), there is no problem.

> > > About nodemask, I feel what's required is pre_pre_attach() to cache required memory
> > > before attaching as radix_tree_preload(). Then, subsys can prepare for creating
> > > working area.
> > 
> > To save on global memory footprint, we could pre-allocate two nodemasks,
> > but I feel like it's not worth the increase in code complexity. This
> > would need to be done in the other cases that unsafely do NODEMASK_ALLOC
> > too... too much to keep track of for little gain.
> > 
> 
> But NODEMASK_ALLOC cannot be on stack when we consider 4096node systems.

Hence global allocation, instead of on-stack. (Also, for this particular
case, the state of the nodemasks needs to persist from pre_attach to
attach_task to attach, so it can't be static inside the function
either.)

> > > Hmm...but I wonder de_thread() should take threadgroup_fork_write_unlock().
> > > 
> > > I may not understand anything important but I feel taking tasklist_lock() is overkill.
> > 
> > Would rcu_read_lock() be any better? Everywhere else in the kernel that
> > iterates over all threads in a group uses either rcu_read_lock or
> > tasklist_lock.
> > 
> 
> Because it means that can_attach()/attach() cannot sleep, it seems to make no
> difference.

can_attach_task(), pre_attach(), and attach_task() cannot sleep, but
can_attach() and attach() may. Careful not to confuse them. :P

> I wonder.... if you stops all clone()/fork() of the proc in moving, you can use
> find_ge_pid(). Please see next_tgid() or next_tid() in fs/proc/base.c which
> implements  scanning tasklist with sleep. Can't you use next_tid() ?

If I'm not mistaken, that approach is vulnerable to an exit() race -
next_tid() may return NULL if pid_alive() fails, and then we stop
iterating and miss some threads. (The relevant code is kernel/exit.c,
__unhash_process, which is protected by tasklist_lock and sighand->lock,
but nothing else.)

I wonder if a judicious use of threadgroup_fork_lock would solve this,
but I'm not sure where I could put it that would be safe. (When is
signal_struct freed?)

-- Ben

> 
> Thanks,
> -Kame
> 
> P.S. I'll be offlined until 2010/01/04.
> 
> 

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v5 3/3] cgroups: make procs file writable
       [not found]                                                             ` <20101227001233.GA10951-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
@ 2010-12-27 10:31                                                               ` David Rientjes
       [not found]                                                                 ` <alpine.DEB.2.00.1012270227010.3960-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
  0 siblings, 1 reply; 185+ messages in thread
From: David Rientjes @ 2010-12-27 10:31 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Oleg Nesterov, Miao Xie, Andrew Morton, Paul Menage,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w

On Sun, 26 Dec 2010, Ben Blum wrote:

> I was going to make a macro like NODEMASK_STATIC, but it turned out that
> can_attach() needed the to/from nodemasks to be shared among three
> functions for the attaching, so I defined them globally without making a
> macro for it.

I'm not sure what the benefit of defining it as a macro would be.  You're 
defining these statically allocated nodemasks so they have file scope, I 
hope (so they can be shared amongst the users who synchronize on 
cgroup_lock() already).

> I can make a separate patch for fixing the other cases,
> but I'd like to see my current patches through first. (Or should I make
> a bugfix patch first and send my other ones on top of that?)
> 

I don't think the fix is urgent since the NODEMASK_ALLOC()'s have been 
around since March and nobody has complained about the failures, I 
personally wouldn't delay your own development over something you've found 
only through code inspection.  I think it's safe to defer to afterwards.

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v5 3/3] cgroups: make procs file writable
       [not found]                                                                 ` <alpine.DEB.2.00.1012270227010.3960-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
@ 2010-12-27 10:37                                                                   ` Ben Blum
       [not found]                                                                     ` <20101227103701.GC20986-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
  0 siblings, 1 reply; 185+ messages in thread
From: Ben Blum @ 2010-12-27 10:37 UTC (permalink / raw)
  To: David Rientjes
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Oleg Nesterov, ebiederm-aS9lmoZGLiVWk0Htik3J/w, Miao Xie,
	Andrew Morton, Paul Menage

On Mon, Dec 27, 2010 at 02:31:21AM -0800, David Rientjes wrote:
> On Sun, 26 Dec 2010, Ben Blum wrote:
> 
> > I was going to make a macro like NODEMASK_STATIC, but it turned out that
> > can_attach() needed the to/from nodemasks to be shared among three
> > functions for the attaching, so I defined them globally without making a
> > macro for it.
> 
> I'm not sure what the benefit of defining it as a macro would be.  You're 
> defining these statically allocated nodemasks so they have file scope, I 
> hope (so they can be shared amongst the users who synchronize on 
> cgroup_lock() already).

In the attach() case, yes, but in other cases I was thinking they could
be put on the stack if CONFIG_NODES_SHIFT < 8, and static but still
per-function otherwise. Or should all the functions share the same
global nodemask?

> > I can make a separate patch for fixing the other cases,
> > but I'd like to see my current patches through first. (Or should I make
> > a bugfix patch first and send my other ones on top of that?)
> > 
> 
> I don't think the fix is urgent since the NODEMASK_ALLOC()'s have been 
> around since March and nobody has complained about the failures, I 
> personally wouldn't delay your own development over something you've found 
> only through code inspection.  I think it's safe to defer to afterwards.

Thanks,
Ben

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v5 3/3] cgroups: make procs file writable
       [not found]                                                                     ` <20101227103701.GC20986-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
@ 2010-12-27 10:53                                                                       ` David Rientjes
       [not found]                                                                         ` <alpine.DEB.2.00.1012270240400.3960-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
  0 siblings, 1 reply; 185+ messages in thread
From: David Rientjes @ 2010-12-27 10:53 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Oleg Nesterov, Miao Xie, Andrew Morton, Paul Menage,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w

On Mon, 27 Dec 2010, Ben Blum wrote:

> > I'm not sure what the benefit of defining it as a macro would be.  You're 
> > defining these statically allocated nodemasks so they have file scope, I 
> > hope (so they can be shared amongst the users who synchronize on 
> > cgroup_lock() already).
> 
> In the attach() case, yes, but in other cases I was thinking they could
> be put on the stack if CONFIG_NODES_SHIFT < 8, and static but still
> per-function otherwise. Or should all the functions share the same
> global nodemask?
> 

I think it would be appropriate to use a shared nodemask with file scope 
whenever you have cgroup_lock() to avoid the unnecessary kmalloc() even 
with GFP_KERNEL.  Cpusets are traditionally used on very large machines in 
the first place, so there is a higher likelihood that 
CONFIG_NODES_SHIFT > 8 whenever CONFIG_CPUSETS is enabled.

All users of NODEMASK_ALLOC() should be protected by cgroup_lock() other 
than cpuset_sprintf_memlist(), right?  That should be the only remaining 
user of NODEMASK_ALLOC() and works well since it can return -ENOMEM.

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v5 3/3] cgroups: make procs file writable
       [not found]                                                                         ` <alpine.DEB.2.00.1012270240400.3960-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
@ 2010-12-27 11:00                                                                           ` Ben Blum
       [not found]                                                                             ` <20101227110050.GF20986-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
  2010-12-29  1:39                                                                           ` Li Zefan
  1 sibling, 1 reply; 185+ messages in thread
From: Ben Blum @ 2010-12-27 11:00 UTC (permalink / raw)
  To: David Rientjes
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Oleg Nesterov, ebiederm-aS9lmoZGLiVWk0Htik3J/w, Miao Xie,
	Andrew Morton, Paul Menage

On Mon, Dec 27, 2010 at 02:53:47AM -0800, David Rientjes wrote:
> On Mon, 27 Dec 2010, Ben Blum wrote:
> 
> > > I'm not sure what the benefit of defining it as a macro would be.  You're 
> > > defining these statically allocated nodemasks so they have file scope, I 
> > > hope (so they can be shared amongst the users who synchronize on 
> > > cgroup_lock() already).
> > 
> > In the attach() case, yes, but in other cases I was thinking they could
> > be put on the stack if CONFIG_NODES_SHIFT < 8, and static but still
> > per-function otherwise. Or should all the functions share the same
> > global nodemask?
> > 
> 
> I think it would be appropriate to use a shared nodemask with file scope 
> whenever you have cgroup_lock() to avoid the unnecessary kmalloc() even 
> with GFP_KERNEL.  Cpusets are traditionally used on very large machines in 
> the first place, so there is a higher likelihood that 
> CONFIG_NODES_SHIFT > 8 whenever CONFIG_CPUSETS is enabled.
> 
> All users of NODEMASK_ALLOC() should be protected by cgroup_lock() other 
> than cpuset_sprintf_memlist(), right?  That should be the only remaining 
> user of NODEMASK_ALLOC() and works well since it can return -ENOMEM.

Just checked; that looks right. Perhaps I should add cgroup_is_locked()
in cgroup.c and BUG_ON() checks for it in those functions, too?

-- Ben

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v5 3/3] cgroups: make procs file writable
       [not found]                                                                             ` <20101227110050.GF20986-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
@ 2010-12-27 11:03                                                                               ` David Rientjes
  0 siblings, 0 replies; 185+ messages in thread
From: David Rientjes @ 2010-12-27 11:03 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Oleg Nesterov, Miao Xie, Andrew Morton, Paul Menage,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w

On Mon, 27 Dec 2010, Ben Blum wrote:

> > I think it would be appropriate to use a shared nodemask with file scope 
> > whenever you have cgroup_lock() to avoid the unnecessary kmalloc() even 
> > with GFP_KERNEL.  Cpusets are traditionally used on very large machines in 
> > the first place, so there is a higher likelihood that 
> > CONFIG_NODES_SHIFT > 8 whenever CONFIG_CPUSETS is enabled.
> > 
> > All users of NODEMASK_ALLOC() should be protected by cgroup_lock() other 
> > than cpuset_sprintf_memlist(), right?  That should be the only remaining 
> > user of NODEMASK_ALLOC() and works well since it can return -ENOMEM.
> 
> Just checked; that looks right. Perhaps I should add cgroup_is_locked()
> in cgroup.c and BUG_ON() checks for it in those functions, too?
> 

Sounds good, especially if it's coupled with a comment where the nodemasks 
are declared that specify that they are protected by the lock.

^ permalink raw reply	[flat|nested] 185+ messages in thread

* [RFC][BUGFIX] memcg: fix dead lock between cpuset and memcg (Re: [PATCH v5 3/3] cgroups: make procs file writable)
       [not found]                                     ` <20101227042254.GA15417-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
  2010-12-27  7:00                                       ` KAMEZAWA Hiroyuki
@ 2010-12-28  2:43                                       ` Daisuke Nishimura
  1 sibling, 0 replies; 185+ messages in thread
From: Daisuke Nishimura @ 2010-12-28  2:43 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Daisuke Nishimura, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, Miao Xie, David Rientjes,
	Andrew Morton, Paul Menage

> It looks to me like when memcg holds the mmap_sem the whole time, it's
> just to avoid the deadlock, not that there's there some need for the
> stuff under mmap_sem not to change between can_attach and attach. But if
> there is such a need, then the write-side in mpol_rebind_mm may conflict
> even with my proposed solution.
> 
> Regardless, the best way would be to avoid holding the mmap_sem across
> the whole window, possibly by solving the move_charge deadlock some
> other internal way, if at all possible?
> 
I made a patch to fix these probrems(deadlock between cpuset and memcg which
commit b1dd693e introduces, and deadlock which the commit fixed).
I'll test and resend this after new year holidays in Japan.

===
From: Daisuke Nishimura <nishimura-YQH0OdQVrdy45+QrQBaojngSJqDPrsil@public.gmane.org>

The commit b1dd693e(memcg: avoid deadlock between move charge and try_charge())
can cause another deadlock about mmap_sem on task migration if cpuset and memcg
are mounted onto the same mount point.

After the commit, cgroup_attach_task() has sequence like:

cgroup_attach_task()
  ss->can_attach()
    cpuset_can_attach()
    mem_cgroup_can_attach()
      down_read(&mmap_sem)        (1)
  ss->attach()
    cpuset_attach()
      mpol_rebind_mm()
        down_write(&mmap_sem)     (2)
        up_write(&mmap_sem)
      cpuset_migrate_mm()
        do_migrate_pages()
          down_read(&mmap_sem)
          up_read(&mmap_sem)
    mem_cgroup_move_task()
      mem_cgroup_clear_mc()
        up_read(&mmap_sem)

We can cause deadlock at (2) because we've already aquire the mmap_sem at (1).

But the commit itself is necessary to fix deadlocks which have existed before
the commit like:

Ex.1)
                move charge             |        try charge
  --------------------------------------+------------------------------
    mem_cgroup_can_attach()             |  down_write(&mmap_sem)
      mc.moving_task = current          |    ..
      mem_cgroup_precharge_mc()         |  __mem_cgroup_try_charge()
        mem_cgroup_count_precharge()    |    prepare_to_wait()
          down_read(&mmap_sem)          |    if (mc.moving_task)
          -> cannot aquire the lock     |    -> true
                                        |      schedule()
                                        |      -> move charge should wake it up

Ex.2)
                move charge             |        try charge
  --------------------------------------+------------------------------
    mem_cgroup_can_attach()             |
      mc.moving_task = current          |
      mem_cgroup_precharge_mc()         |
        mem_cgroup_count_precharge()    |
          down_read(&mmap_sem)          |
          ..                            |
          up_read(&mmap_sem)            |
                                        |  down_write(&mmap_sem)
    mem_cgroup_move_task()              |    ..
      mem_cgroup_move_charge()          |  __mem_cgroup_try_charge()
        down_read(&mmap_sem)            |    prepare_to_wait()
        -> cannot aquire the lock       |    if (mc.moving_task)
                                        |    -> true
                                        |      schedule()
                                        |      -> move charge should wake it up

This patch fixes all of these problems by:
1. revert the commit.
2. To fix the Ex.1, we set mc.moving_task after mem_cgroup_count_precharge()
   has released the mmap_sem.
3. To fix the Ex.2, we use down_read_trylock() instead of down_read() in
   mem_cgroup_move_charge() and, if it has failed to aquire the lock, cancel
   all extra charges, wake up all waiters, and retry trylock.

Reported-by: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
Signed-off-by: Daisuke Nishimura <nishimura-YQH0OdQVrdy45+QrQBaojngSJqDPrsil@public.gmane.org>
---
 mm/memcontrol.c |   78 +++++++++++++++++++++++++++++++------------------------
 1 files changed, 44 insertions(+), 34 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7a22b41..b108b30 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -292,7 +292,6 @@ static struct move_charge_struct {
 	unsigned long moved_charge;
 	unsigned long moved_swap;
 	struct task_struct *moving_task;	/* a task moving charges */
-	struct mm_struct *mm;
 	wait_queue_head_t waitq;		/* a waitq for other context */
 } mc = {
 	.lock = __SPIN_LOCK_UNLOCKED(mc.lock),
@@ -4639,7 +4638,7 @@ static unsigned long mem_cgroup_count_precharge(struct mm_struct *mm)
 	unsigned long precharge;
 	struct vm_area_struct *vma;
 
-	/* We've already held the mmap_sem */
+	down_read(&mm->mmap_sem);
 	for (vma = mm->mmap; vma; vma = vma->vm_next) {
 		struct mm_walk mem_cgroup_count_precharge_walk = {
 			.pmd_entry = mem_cgroup_count_precharge_pte_range,
@@ -4651,6 +4650,7 @@ static unsigned long mem_cgroup_count_precharge(struct mm_struct *mm)
 		walk_page_range(vma->vm_start, vma->vm_end,
 					&mem_cgroup_count_precharge_walk);
 	}
+	up_read(&mm->mmap_sem);
 
 	precharge = mc.precharge;
 	mc.precharge = 0;
@@ -4660,10 +4660,15 @@ static unsigned long mem_cgroup_count_precharge(struct mm_struct *mm)
 
 static int mem_cgroup_precharge_mc(struct mm_struct *mm)
 {
-	return mem_cgroup_do_precharge(mem_cgroup_count_precharge(mm));
+	unsigned long precharge = mem_cgroup_count_precharge(mm);
+
+	VM_BUG_ON(mc.moving_task);
+	mc.moving_task = current;
+	return mem_cgroup_do_precharge(precharge);
 }
 
-static void mem_cgroup_clear_mc(void)
+/* cancels all extra charges on mc.from and mc.to, and wakes up all waiters. */
+static void __mem_cgroup_clear_mc(void)
 {
 	struct mem_cgroup *from = mc.from;
 	struct mem_cgroup *to = mc.to;
@@ -4698,23 +4703,24 @@ static void mem_cgroup_clear_mc(void)
 						PAGE_SIZE * mc.moved_swap);
 		}
 		/* we've already done mem_cgroup_get(mc.to) */
-
 		mc.moved_swap = 0;
 	}
-	if (mc.mm) {
-		up_read(&mc.mm->mmap_sem);
-		mmput(mc.mm);
-	}
+	memcg_oom_recover(from);
+	memcg_oom_recover(to);
+	wake_up_all(&mc.waitq);
+}
+
+static void mem_cgroup_clear_mc(void)
+{
+	struct mem_cgroup *from = mc.from;
+
+	__mem_cgroup_clear_mc();
 	spin_lock(&mc.lock);
 	mc.from = NULL;
 	mc.to = NULL;
 	spin_unlock(&mc.lock);
 	mc.moving_task = NULL;
-	mc.mm = NULL;
 	mem_cgroup_end_move(from);
-	memcg_oom_recover(from);
-	memcg_oom_recover(to);
-	wake_up_all(&mc.waitq);
 }
 
 static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
@@ -4736,38 +4742,23 @@ static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
 			return 0;
 		/* We move charges only when we move a owner of the mm */
 		if (mm->owner == p) {
-			/*
-			 * We do all the move charge works under one mmap_sem to
-			 * avoid deadlock with down_write(&mmap_sem)
-			 * -> try_charge() -> if (mc.moving_task) -> sleep.
-			 */
-			down_read(&mm->mmap_sem);
-
 			VM_BUG_ON(mc.from);
 			VM_BUG_ON(mc.to);
 			VM_BUG_ON(mc.precharge);
 			VM_BUG_ON(mc.moved_charge);
 			VM_BUG_ON(mc.moved_swap);
-			VM_BUG_ON(mc.moving_task);
-			VM_BUG_ON(mc.mm);
-
 			mem_cgroup_start_move(from);
 			spin_lock(&mc.lock);
 			mc.from = from;
 			mc.to = mem;
-			mc.precharge = 0;
-			mc.moved_charge = 0;
-			mc.moved_swap = 0;
 			spin_unlock(&mc.lock);
-			mc.moving_task = current;
-			mc.mm = mm;
+			/* We set mc.moving_task later */
 
 			ret = mem_cgroup_precharge_mc(mm);
 			if (ret)
 				mem_cgroup_clear_mc();
-			/* We call up_read() and mmput() in clear_mc(). */
-		} else
-			mmput(mm);
+		}
+		mmput(mm);
 	}
 	return ret;
 }
@@ -4855,7 +4846,19 @@ static void mem_cgroup_move_charge(struct mm_struct *mm)
 	struct vm_area_struct *vma;
 
 	lru_add_drain_all();
-	/* We've already held the mmap_sem */
+retry:
+	if (unlikely(!down_read_trylock(&mm->mmap_sem))) {
+		/*
+		 * Someone who are holding the mmap_sem might be waiting in
+		 * waitq. So we cancel all extra charges, wake up all waiters,
+		 * and retry. Because we cancel precharges, we might not be able
+		 * to move enough charges, but moving charge is a best-effort
+		 * feature anyway, so it wouldn't be a big problem.
+		 */
+		__mem_cgroup_clear_mc();
+		cond_resched();
+		goto retry;
+	}
 	for (vma = mm->mmap; vma; vma = vma->vm_next) {
 		int ret;
 		struct mm_walk mem_cgroup_move_charge_walk = {
@@ -4874,6 +4877,7 @@ static void mem_cgroup_move_charge(struct mm_struct *mm)
 			 */
 			break;
 	}
+	up_read(&mm->mmap_sem);
 }
 
 static void mem_cgroup_move_task(struct cgroup_subsys *ss,
@@ -4882,11 +4886,17 @@ static void mem_cgroup_move_task(struct cgroup_subsys *ss,
 				struct task_struct *p,
 				bool threadgroup)
 {
-	if (!mc.mm)
+	struct mm_struct *mm;
+
+	if (!mc.to)
 		/* no need to move charge */
 		return;
 
-	mem_cgroup_move_charge(mc.mm);
+	mm = get_task_mm(p);
+	if (mm) {
+		mem_cgroup_move_charge(mm);
+		mmput(mm);
+	}
 	mem_cgroup_clear_mc();
 }
 #else	/* !CONFIG_MMU */
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 185+ messages in thread

* Re: [PATCH v5 3/3] cgroups: make procs file writable
       [not found]                                                                         ` <alpine.DEB.2.00.1012270240400.3960-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
  2010-12-27 11:00                                                                           ` Ben Blum
@ 2010-12-29  1:39                                                                           ` Li Zefan
       [not found]                                                                             ` <4D1A913C.5080702-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  1 sibling, 1 reply; 185+ messages in thread
From: Li Zefan @ 2010-12-29  1:39 UTC (permalink / raw)
  To: David Rientjes
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Oleg Nesterov, Miao Xie, Andrew Morton, Paul Menage,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w

David Rientjes wrote:
> On Mon, 27 Dec 2010, Ben Blum wrote:
> 
>>> I'm not sure what the benefit of defining it as a macro would be.  You're 
>>> defining these statically allocated nodemasks so they have file scope, I 
>>> hope (so they can be shared amongst the users who synchronize on 
>>> cgroup_lock() already).
>> In the attach() case, yes, but in other cases I was thinking they could
>> be put on the stack if CONFIG_NODES_SHIFT < 8, and static but still
>> per-function otherwise. Or should all the functions share the same
>> global nodemask?
>>
> 
> I think it would be appropriate to use a shared nodemask with file scope 
> whenever you have cgroup_lock() to avoid the unnecessary kmalloc() even 
> with GFP_KERNEL.  Cpusets are traditionally used on very large machines in 
> the first place, so there is a higher likelihood that 
> CONFIG_NODES_SHIFT > 8 whenever CONFIG_CPUSETS is enabled.
> 
> All users of NODEMASK_ALLOC() should be protected by cgroup_lock() other 
> than cpuset_sprintf_memlist(), right?  That should be the only remaining 
> user of NODEMASK_ALLOC() and works well since it can return -ENOMEM.
> 

Changing cpuset->mems_allowed is protected by both cgroup_mutex and
cpuset-specific lock (callback_mutex), so you can read it under either
lock, so NODEMASK_ALLOC() is not needed. See cpuset_sprintf_cpulist().

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v5 3/3] cgroups: make procs file writable
       [not found]                                                                             ` <4D1A913C.5080702-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2010-12-30  0:26                                                                               ` David Rientjes
       [not found]                                                                                 ` <alpine.DEB.2.00.1012291624210.6040-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
  0 siblings, 1 reply; 185+ messages in thread
From: David Rientjes @ 2010-12-30  0:26 UTC (permalink / raw)
  To: Li Zefan
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Oleg Nesterov, Miao Xie, Andrew Morton, Paul Menage,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w

On Wed, 29 Dec 2010, Li Zefan wrote:

> > I think it would be appropriate to use a shared nodemask with file scope 
> > whenever you have cgroup_lock() to avoid the unnecessary kmalloc() even 
> > with GFP_KERNEL.  Cpusets are traditionally used on very large machines in 
> > the first place, so there is a higher likelihood that 
> > CONFIG_NODES_SHIFT > 8 whenever CONFIG_CPUSETS is enabled.
> > 
> > All users of NODEMASK_ALLOC() should be protected by cgroup_lock() other 
> > than cpuset_sprintf_memlist(), right?  That should be the only remaining 
> > user of NODEMASK_ALLOC() and works well since it can return -ENOMEM.
> > 
> 
> Changing cpuset->mems_allowed is protected by both cgroup_mutex and
> cpuset-specific lock (callback_mutex), so you can read it under either
> lock, so NODEMASK_ALLOC() is not needed. See cpuset_sprintf_cpulist().
> 

I'm not sure what you're saying.  Cpusets needs to allocate nodemasks for 
certain functions and doing on the stack can be problemantic if 
CONFIG_NODES_SHIFT is large because of overflow.  Thus, we can't have 
temporary nodemasks available on the stack where necessary in functions 
like cpuset_attach(), update_nodemask(), etc. that require them.  The 
suggestion was to use a statically allocated "scratch" nodemask since 
these functions are all protected by cgroup_lock().

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v5 3/3] cgroups: make procs file writable
       [not found]                                                                                 ` <alpine.DEB.2.00.1012291624210.6040-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
@ 2010-12-30  4:02                                                                                   ` Li Zefan
       [not found]                                                                                     ` <4D1C0464.5090801-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  0 siblings, 1 reply; 185+ messages in thread
From: Li Zefan @ 2010-12-30  4:02 UTC (permalink / raw)
  To: David Rientjes
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Oleg Nesterov, Miao Xie, Andrew Morton, Paul Menage,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w

David Rientjes wrote:
> On Wed, 29 Dec 2010, Li Zefan wrote:
> 
>>> I think it would be appropriate to use a shared nodemask with file scope 
>>> whenever you have cgroup_lock() to avoid the unnecessary kmalloc() even 
>>> with GFP_KERNEL.  Cpusets are traditionally used on very large machines in 
>>> the first place, so there is a higher likelihood that 
>>> CONFIG_NODES_SHIFT > 8 whenever CONFIG_CPUSETS is enabled.
>>>
>>> All users of NODEMASK_ALLOC() should be protected by cgroup_lock() other 
>>> than cpuset_sprintf_memlist(), right?  That should be the only remaining 
>>> user of NODEMASK_ALLOC() and works well since it can return -ENOMEM.
>>>
>> Changing cpuset->mems_allowed is protected by both cgroup_mutex and
>> cpuset-specific lock (callback_mutex), so you can read it under either
>> lock, so NODEMASK_ALLOC() is not needed. See cpuset_sprintf_cpulist().
>>
> 
> I'm not sure what you're saying.  Cpusets needs to allocate nodemasks for 
> certain functions and doing on the stack can be problemantic if 
> CONFIG_NODES_SHIFT is large because of overflow.  Thus, we can't have 
> temporary nodemasks available on the stack where necessary in functions 
> like cpuset_attach(), update_nodemask(), etc. that require them.  The 
> suggestion was to use a statically allocated "scratch" nodemask since 
> these functions are all protected by cgroup_lock().
> 

That's what we did for cpu masks :). See commit
2341d1b6598c7146d64a5050b53a72a5a819617f.

I made a patchset to remove on stack cpu masks.

What I meant is we don't have to allocate nodemasks in cpuset_sprintf_memlist().
This is sufficient:

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 4349935..a159612 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1620,20 +1620,12 @@ static int cpuset_sprintf_cpulist(char *page, struct cpu

 static int cpuset_sprintf_memlist(char *page, struct cpuset *cs)
 {
-       NODEMASK_ALLOC(nodemask_t, mask, GFP_KERNEL);
        int retval;

-       if (mask == NULL)
-               return -ENOMEM;
-
        mutex_lock(&callback_mutex);
-       *mask = cs->mems_allowed;
+       retval = nodelist_scnprintf(page, PAGE_SIZE, cs->mems_allowed);
        mutex_unlock(&callback_mutex);

-       retval = nodelist_scnprintf(page, PAGE_SIZE, *mask);
-
-       NODEMASK_FREE(mask);
-
        return retval;
 }

^ permalink raw reply related	[flat|nested] 185+ messages in thread

* Re: [PATCH v5 3/3] cgroups: make procs file writable
       [not found]                                                                                     ` <4D1C0464.5090801-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2010-12-30  4:24                                                                                       ` David Rientjes
       [not found]                                                                                         ` <alpine.DEB.2.00.1012292019540.27634-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
  0 siblings, 1 reply; 185+ messages in thread
From: David Rientjes @ 2010-12-30  4:24 UTC (permalink / raw)
  To: Li Zefan
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Oleg Nesterov, Miao Xie, Andrew Morton, Paul Menage,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w

On Thu, 30 Dec 2010, Li Zefan wrote:

> That's what we did for cpu masks :). See commit
> 2341d1b6598c7146d64a5050b53a72a5a819617f.
> 
> I made a patchset to remove on stack cpu masks.
> 
> What I meant is we don't have to allocate nodemasks in cpuset_sprintf_memlist().
> This is sufficient:
> 
> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> index 4349935..a159612 100644
> --- a/kernel/cpuset.c
> +++ b/kernel/cpuset.c
> @@ -1620,20 +1620,12 @@ static int cpuset_sprintf_cpulist(char *page, struct cpu
> 
>  static int cpuset_sprintf_memlist(char *page, struct cpuset *cs)
>  {
> -       NODEMASK_ALLOC(nodemask_t, mask, GFP_KERNEL);
>         int retval;
> 
> -       if (mask == NULL)
> -               return -ENOMEM;
> -
>         mutex_lock(&callback_mutex);
> -       *mask = cs->mems_allowed;
> +       retval = nodelist_scnprintf(page, PAGE_SIZE, cs->mems_allowed);
>         mutex_unlock(&callback_mutex);
> 
> -       retval = nodelist_scnprintf(page, PAGE_SIZE, *mask);
> -
> -       NODEMASK_FREE(mask);
> -
>         return retval;
>  }
> 

This needs to be done with cgroup_lock() instead of callback_mutex since 
the post_clone() callback will store to cs->mems_allowed on 
cgroup_clone().

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v5 3/3] cgroups: make procs file writable
       [not found]                                                                                         ` <alpine.DEB.2.00.1012292019540.27634-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
@ 2010-12-30  4:38                                                                                           ` Li Zefan
       [not found]                                                                                             ` <4D1C0CC6.4090107-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  0 siblings, 1 reply; 185+ messages in thread
From: Li Zefan @ 2010-12-30  4:38 UTC (permalink / raw)
  To: David Rientjes
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Oleg Nesterov, Miao Xie, Andrew Morton, Paul Menage,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w

David Rientjes wrote:
> On Thu, 30 Dec 2010, Li Zefan wrote:
> 
>> That's what we did for cpu masks :). See commit
>> 2341d1b6598c7146d64a5050b53a72a5a819617f.
>>
>> I made a patchset to remove on stack cpu masks.
>>
>> What I meant is we don't have to allocate nodemasks in cpuset_sprintf_memlist().
>> This is sufficient:
>>
>> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
>> index 4349935..a159612 100644
>> --- a/kernel/cpuset.c
>> +++ b/kernel/cpuset.c
>> @@ -1620,20 +1620,12 @@ static int cpuset_sprintf_cpulist(char *page, struct cpu
>>
>>  static int cpuset_sprintf_memlist(char *page, struct cpuset *cs)
>>  {
>> -       NODEMASK_ALLOC(nodemask_t, mask, GFP_KERNEL);
>>         int retval;
>>
>> -       if (mask == NULL)
>> -               return -ENOMEM;
>> -
>>         mutex_lock(&callback_mutex);
>> -       *mask = cs->mems_allowed;
>> +       retval = nodelist_scnprintf(page, PAGE_SIZE, cs->mems_allowed);
>>         mutex_unlock(&callback_mutex);
>>
>> -       retval = nodelist_scnprintf(page, PAGE_SIZE, *mask);
>> -
>> -       NODEMASK_FREE(mask);
>> -
>>         return retval;
>>  }
>>
> 
> This needs to be done with cgroup_lock() instead of callback_mutex since 
> the post_clone() callback will store to cs->mems_allowed on 
> cgroup_clone().
> 

Then cpuset_post_clone() breaks the lock rule:

 * A task must hold both mutexes to modify cpusets...
 ...
 * If a task is only holding callback_mutex, then it has read-only
 * access to cpusets.

But that's Ok, because cgroup_clone() is called during the creation of
the new cgroup, so no one can access the cpuset at that time.

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v5 3/3] cgroups: make procs file writable
       [not found]                                                                                             ` <4D1C0CC6.4090107-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2010-12-30  5:49                                                                                               ` David Rientjes
       [not found]                                                                                                 ` <alpine.DEB.2.00.1012292149000.29486-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
  0 siblings, 1 reply; 185+ messages in thread
From: David Rientjes @ 2010-12-30  5:49 UTC (permalink / raw)
  To: Li Zefan
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Oleg Nesterov, Miao Xie, Andrew Morton, Paul Menage,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w

On Thu, 30 Dec 2010, Li Zefan wrote:

> > This needs to be done with cgroup_lock() instead of callback_mutex since 
> > the post_clone() callback will store to cs->mems_allowed on 
> > cgroup_clone().
> > 
> 
> Then cpuset_post_clone() breaks the lock rule:
> 
>  * A task must hold both mutexes to modify cpusets...
>  ...
>  * If a task is only holding callback_mutex, then it has read-only
>  * access to cpusets.
> 
> But that's Ok, because cgroup_clone() is called during the creation of
> the new cgroup, so no one can access the cpuset at that time.
> 

I'm saying that if cpusets implements a cgroup_clone() handler that the 
locking will break with only callback_mutex here because the only 
synchronization after the new cgroup dentry is added is cgroup_lock() that 
is always held when a post_clone callback is invoked and reading a mems 
file may race since it is accessible before the task is attached in the 
cgroup_clone() case.  It's not a problem right now but may subtly break if 
cpusets were to use cgroup_clone().

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v5 3/3] cgroups: make procs file writable
       [not found]                                                                                                 ` <alpine.DEB.2.00.1012292149000.29486-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
@ 2010-12-30  6:12                                                                                                   ` Li Zefan
       [not found]                                                                                                     ` <4D1C22D2.9090007-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  0 siblings, 1 reply; 185+ messages in thread
From: Li Zefan @ 2010-12-30  6:12 UTC (permalink / raw)
  To: David Rientjes
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Oleg Nesterov, Miao Xie, Andrew Morton, Paul Menage,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w

David Rientjes wrote:
> On Thu, 30 Dec 2010, Li Zefan wrote:
> 
>>> This needs to be done with cgroup_lock() instead of callback_mutex since 
>>> the post_clone() callback will store to cs->mems_allowed on 
>>> cgroup_clone().
>>>
>> Then cpuset_post_clone() breaks the lock rule:
>>
>>  * A task must hold both mutexes to modify cpusets...
>>  ...
>>  * If a task is only holding callback_mutex, then it has read-only
>>  * access to cpusets.
>>
>> But that's Ok, because cgroup_clone() is called during the creation of
>> the new cgroup, so no one can access the cpuset at that time.
>>
> 
> I'm saying that if cpusets implements a cgroup_clone() handler that the 
> locking will break with only callback_mutex here because the only 
> synchronization after the new cgroup dentry is added is cgroup_lock() that 
> is always held when a post_clone callback is invoked and reading a mems 
> file may race since it is accessible before the task is attached in the 
> cgroup_clone() case.  It's not a problem right now but may subtly break if 
> cpusets were to use cgroup_clone().
> 

cgroup_clone() was implemented for ns_cgroup, and ns_cgroup is scheduled to be
removed in the coming 2.6.38, and so will be cgroup_clone().

As a replacement we've added cgroup.clone_children, and the post_clone() callback
will be called in cgroup_create() if the clone_children flag is set.

If we want to avoid subtle break in case post_clone() is used in other cases
than cgroup creation in the future, we can just hold callback_mutex() in
cpuset_post_clone().

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v5 3/3] cgroups: make procs file writable
       [not found]                                                                                                     ` <4D1C22D2.9090007-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2010-12-30 18:25                                                                                                       ` David Rientjes
  0 siblings, 0 replies; 185+ messages in thread
From: David Rientjes @ 2010-12-30 18:25 UTC (permalink / raw)
  To: Li Zefan
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Oleg Nesterov, Miao Xie, Andrew Morton, Paul Menage,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w

On Thu, 30 Dec 2010, Li Zefan wrote:

> If we want to avoid subtle break in case post_clone() is used in other cases
> than cgroup creation in the future, we can just hold callback_mutex() in
> cpuset_post_clone().
> 

Sounds good!  I'd suggest doing this in the same patch that you remove the 
NODEMASK_ALLOC() in cpuset_sprintf_memlist() so its not forgotten when the 
transition you explained above is done.

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v5 3/3] cgroups: make procs file writable
       [not found]                                                             ` <20101227101228.GB20986-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
@ 2011-01-04  0:57                                                               ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 185+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-04  0:57 UTC (permalink / raw)
  To: Ben Blum
  Cc: Daisuke Nishimura,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	oleg-H+wXaHxf7aLQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	Miao Xie, David Rientjes, Andrew Morton, Paul Menage

On Mon, 27 Dec 2010 05:12:28 -0500
Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:

> > > > About nodemask, I feel what's required is pre_pre_attach() to cache required memory
> > > > before attaching as radix_tree_preload(). Then, subsys can prepare for creating
> > > > working area.
> > > 
> > > To save on global memory footprint, we could pre-allocate two nodemasks,
> > > but I feel like it's not worth the increase in code complexity. This
> > > would need to be done in the other cases that unsafely do NODEMASK_ALLOC
> > > too... too much to keep track of for little gain.
> > > 
> > 
> > But NODEMASK_ALLOC cannot be on stack when we consider 4096node systems.
> 
> Hence global allocation, instead of on-stack. (Also, for this particular
> case, the state of the nodemasks needs to persist from pre_attach to
> attach_task to attach, so it can't be static inside the function
> either.)
> 

Yes, it can be global/static.



> > > > Hmm...but I wonder de_thread() should take threadgroup_fork_write_unlock().
> > > > 
> > > > I may not understand anything important but I feel taking tasklist_lock() is overkill.
> > > 
> > > Would rcu_read_lock() be any better? Everywhere else in the kernel that
> > > iterates over all threads in a group uses either rcu_read_lock or
> > > tasklist_lock.
> > > 
> > 
> > Because it means that can_attach()/attach() cannot sleep, it seems to make no
> > difference.
> 
> can_attach_task(), pre_attach(), and attach_task() cannot sleep, but
> can_attach() and attach() may. Careful not to confuse them. :P
> 
> > I wonder.... if you stops all clone()/fork() of the proc in moving, you can use
> > find_ge_pid(). Please see next_tgid() or next_tid() in fs/proc/base.c which
> > implements  scanning tasklist with sleep. Can't you use next_tid() ?
> 
> If I'm not mistaken, that approach is vulnerable to an exit() race -
> next_tid() may return NULL if pid_alive() fails, and then we stop
> iterating and miss some threads. (The relevant code is kernel/exit.c,
> __unhash_process, which is protected by tasklist_lock and sighand->lock,
> but nothing else.)
>

You can use pid array, find_ge_pid() (or some), if the linked-list
is not dependable. If CLONE_THREAD is blocked, it works to catch
all threads without races.

Thanks,
-Kame

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v6 3/3] cgroups: make procs file writable
       [not found]             ` <20101224082445.GD13872-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
@ 2011-01-12 23:26               ` Paul E. McKenney
  0 siblings, 0 replies; 185+ messages in thread
From: Paul E. McKenney @ 2011-01-12 23:26 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	menage-hpIqsD4AKlfQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

On Fri, Dec 24, 2010 at 03:24:45AM -0500, Ben Blum wrote:
> Makes procs file writable to move all threads by tgid at once
> 
> From: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
> 
> This patch adds functionality that enables users to move all threads in a
> threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs'
> file. This current implementation makes use of a per-threadgroup rwsem that's
> taken for reading in the fork() path to prevent newly forking threads within
> the threadgroup from "escaping" while the move is in progress.

One minor nit below.

							Thanx, Paul

> Signed-off-by: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
> ---
>  Documentation/cgroups/cgroups.txt |   13 +
>  kernel/cgroup.c                   |  424 +++++++++++++++++++++++++++++++++----
>  2 files changed, 387 insertions(+), 50 deletions(-)
> 
> diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
> index 190018b..07674e5 100644
> --- a/Documentation/cgroups/cgroups.txt
> +++ b/Documentation/cgroups/cgroups.txt
> @@ -236,7 +236,8 @@ containing the following files describing that cgroup:
>   - cgroup.procs: list of tgids in the cgroup.  This list is not
>     guaranteed to be sorted or free of duplicate tgids, and userspace
>     should sort/uniquify the list if this property is required.
> -   This is a read-only file, for now.
> +   Writing a thread group id into this file moves all threads in that
> +   group into this cgroup.
>   - notify_on_release flag: run the release agent on exit?
>   - release_agent: the path to use for release notifications (this file
>     exists in the top cgroup only)
> @@ -426,6 +427,12 @@ You can attach the current shell task by echoing 0:
> 
>  # echo 0 > tasks
> 
> +You can use the cgroup.procs file instead of the tasks file to move all
> +threads in a threadgroup at once. Echoing the pid of any task in a
> +threadgroup to cgroup.procs causes all tasks in that threadgroup to be
> +be attached to the cgroup. Writing 0 to cgroup.procs moves all tasks
> +in the writing task's threadgroup.
> +
>  2.3 Mounting hierarchies by name
>  --------------------------------
> 
> @@ -574,7 +581,9 @@ called on a fork. If this method returns 0 (success) then this should
>  remain valid while the caller holds cgroup_mutex and it is ensured that either
>  attach() or cancel_attach() will be called in future. If threadgroup is
>  true, then a successful result indicates that all threads in the given
> -thread's threadgroup can be moved together.
> +thread's threadgroup can be moved together. If the subsystem wants to
> +iterate over task->thread_group, it must take rcu_read_lock then check
> +if thread_group_leader(task), returning -EAGAIN if that fails.
> 
>  void cancel_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
>  	       struct task_struct *task, bool threadgroup)
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index f86dd9c..74be02c 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1771,6 +1771,76 @@ int cgroup_can_attach_per_thread(struct cgroup *cgrp, struct task_struct *task,
>  }
>  EXPORT_SYMBOL_GPL(cgroup_can_attach_per_thread);
> 
> +/*
> + * cgroup_task_migrate - move a task from one cgroup to another.
> + *
> + * 'guarantee' is set if the caller promises that a new css_set for the task
> + * will already exit. If not set, this function might sleep, and can fail with
> + * -ENOMEM. Otherwise, it can only fail with -ESRCH.
> + */
> +static int cgroup_task_migrate(struct cgroup *cgrp, struct cgroup *oldcgrp,
> +			       struct task_struct *tsk, bool guarantee)
> +{
> +	struct css_set *oldcg;
> +	struct css_set *newcg;
> +
> +	/*
> +	 * get old css_set. we need to take task_lock and refcount it, because
> +	 * an exiting task can change its css_set to init_css_set and drop its
> +	 * old one without taking cgroup_mutex.
> +	 */
> +	task_lock(tsk);
> +	oldcg = tsk->cgroups;
> +	get_css_set(oldcg);
> +	task_unlock(tsk);
> +
> +	/* locate or allocate a new css_set for this task. */
> +	if (guarantee) {
> +		/* we know the css_set we want already exists. */
> +		struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
> +		read_lock(&css_set_lock);
> +		newcg = find_existing_css_set(oldcg, cgrp, template);
> +		BUG_ON(!newcg);
> +		get_css_set(newcg);
> +		read_unlock(&css_set_lock);
> +	} else {
> +		might_sleep();
> +		/* find_css_set will give us newcg already referenced. */
> +		newcg = find_css_set(oldcg, cgrp);
> +		if (!newcg) {
> +			put_css_set(oldcg);
> +			return -ENOMEM;
> +		}
> +	}
> +	put_css_set(oldcg);
> +
> +	/* if PF_EXITING is set, the tsk->cgroups pointer is no longer safe. */
> +	task_lock(tsk);
> +	if (tsk->flags & PF_EXITING) {
> +		task_unlock(tsk);
> +		put_css_set(newcg);
> +		return -ESRCH;
> +	}
> +	rcu_assign_pointer(tsk->cgroups, newcg);
> +	task_unlock(tsk);
> +
> +	/* Update the css_set linked lists if we're using them */
> +	write_lock(&css_set_lock);
> +	if (!list_empty(&tsk->cg_list))
> +		list_move(&tsk->cg_list, &newcg->tasks);
> +	write_unlock(&css_set_lock);
> +
> +	/*
> +	 * We just gained a reference on oldcg by taking it from the task. As
> +	 * trading it for newcg is protected by cgroup_mutex, we're safe to drop
> +	 * it here; it will be freed under RCU.
> +	 */
> +	put_css_set(oldcg);
> +
> +	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
> +	return 0;
> +}
> +
>  /**
>   * cgroup_attach_task - attach task 'tsk' to cgroup 'cgrp'
>   * @cgrp: the cgroup the task is attaching to
> @@ -1781,11 +1851,9 @@ EXPORT_SYMBOL_GPL(cgroup_can_attach_per_thread);
>   */
>  int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>  {
> -	int retval = 0;
> +	int retval;
>  	struct cgroup_subsys *ss, *failed_ss = NULL;
>  	struct cgroup *oldcgrp;
> -	struct css_set *cg;
> -	struct css_set *newcg;
>  	struct cgroupfs_root *root = cgrp->root;
> 
>  	/* Nothing to do if the task is already in that cgroup */
> @@ -1809,46 +1877,16 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>  		}
>  	}
> 
> -	task_lock(tsk);
> -	cg = tsk->cgroups;
> -	get_css_set(cg);
> -	task_unlock(tsk);
> -	/*
> -	 * Locate or allocate a new css_set for this task,
> -	 * based on its final set of cgroups
> -	 */
> -	newcg = find_css_set(cg, cgrp);
> -	put_css_set(cg);
> -	if (!newcg) {
> -		retval = -ENOMEM;
> -		goto out;
> -	}
> -
> -	task_lock(tsk);
> -	if (tsk->flags & PF_EXITING) {
> -		task_unlock(tsk);
> -		put_css_set(newcg);
> -		retval = -ESRCH;
> +	retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, false);
> +	if (retval)
>  		goto out;
> -	}
> -	rcu_assign_pointer(tsk->cgroups, newcg);
> -	task_unlock(tsk);
> -
> -	/* Update the css_set linked lists if we're using them */
> -	write_lock(&css_set_lock);
> -	if (!list_empty(&tsk->cg_list)) {
> -		list_del(&tsk->cg_list);
> -		list_add(&tsk->cg_list, &newcg->tasks);
> -	}
> -	write_unlock(&css_set_lock);
> 
>  	for_each_subsys(root, ss) {
>  		if (ss->attach)
>  			ss->attach(ss, cgrp, oldcgrp, tsk, false);
>  	}
> -	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
> +
>  	synchronize_rcu();
> -	put_css_set(cg);
> 
>  	/*
>  	 * wake up rmdir() waiter. the rmdir should fail since the cgroup
> @@ -1898,49 +1936,339 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
>  EXPORT_SYMBOL_GPL(cgroup_attach_task_all);
> 
>  /*
> - * Attach task with pid 'pid' to cgroup 'cgrp'. Call with cgroup_mutex
> - * held. May take task_lock of task
> + * cgroup_attach_proc works in two stages, the first of which prefetches all
> + * new css_sets needed (to make sure we have enough memory before committing
> + * to the move) and stores them in a list of entries of the following type.
> + * TODO: possible optimization: use css_set->rcu_head for chaining instead
> + */
> +struct cg_list_entry {
> +	struct css_set *cg;
> +	struct list_head links;
> +};
> +
> +static bool css_set_check_fetched(struct cgroup *cgrp,
> +				  struct task_struct *tsk, struct css_set *cg,
> +				  struct list_head *newcg_list)
> +{
> +	struct css_set *newcg;
> +	struct cg_list_entry *cg_entry;
> +	struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
> +
> +	read_lock(&css_set_lock);
> +	newcg = find_existing_css_set(cg, cgrp, template);
> +	if (newcg)
> +		get_css_set(newcg);
> +	read_unlock(&css_set_lock);
> +
> +	/* doesn't exist at all? */
> +	if (!newcg)
> +		return false;
> +	/* see if it's already in the list */
> +	list_for_each_entry(cg_entry, newcg_list, links) {
> +		if (cg_entry->cg == newcg) {
> +			put_css_set(newcg);
> +			return true;
> +		}
> +	}
> +
> +	/* not found */
> +	put_css_set(newcg);
> +	return false;
> +}
> +
> +/*
> + * Find the new css_set and store it in the list in preparation for moving the
> + * given task to the given cgroup. Returns 0 or -ENOMEM.
>   */
> -static int attach_task_by_pid(struct cgroup *cgrp, u64 pid)
> +static int css_set_prefetch(struct cgroup *cgrp, struct css_set *cg,
> +			    struct list_head *newcg_list)
> +{
> +	struct css_set *newcg;
> +	struct cg_list_entry *cg_entry;
> +
> +	/* ensure a new css_set will exist for this thread */
> +	newcg = find_css_set(cg, cgrp);
> +	if (!newcg)
> +		return -ENOMEM;
> +	/* add it to the list */
> +	cg_entry = kmalloc(sizeof(struct cg_list_entry), GFP_KERNEL);
> +	if (!cg_entry) {
> +		put_css_set(newcg);
> +		return -ENOMEM;
> +	}
> +	cg_entry->cg = newcg;
> +	list_add(&cg_entry->links, newcg_list);
> +	return 0;
> +}
> +
> +/**
> + * cgroup_attach_proc - attach all threads in a threadgroup to a cgroup
> + * @cgrp: the cgroup to attach to
> + * @leader: the threadgroup leader task_struct of the group to be attached
> + *
> + * Call holding cgroup_mutex. Will take task_lock of each thread in leader's
> + * threadgroup individually in turn.
> + */
> +int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
> +{
> +	int retval;
> +	struct cgroup_subsys *ss, *failed_ss = NULL;
> +	struct cgroup *oldcgrp;
> +	struct css_set *oldcg;
> +	struct cgroupfs_root *root = cgrp->root;
> +	/* threadgroup list cursor */
> +	struct task_struct *tsk;
> +	/*
> +	 * we need to make sure we have css_sets for all the tasks we're
> +	 * going to move -before- we actually start moving them, so that in
> +	 * case we get an ENOMEM we can bail out before making any changes.
> +	 */
> +	struct list_head newcg_list;
> +	struct cg_list_entry *cg_entry, *temp_nobe;
> +
> +	/* check that we can legitimately attach to the cgroup. */
> +	for_each_subsys(root, ss) {
> +		if (ss->can_attach) {
> +			retval = ss->can_attach(ss, cgrp, leader, true);
> +			if (retval) {
> +				failed_ss = ss;
> +				goto out;
> +			}
> +		}
> +	}
> +
> +	/*
> +	 * step 1: make sure css_sets exist for all threads to be migrated.
> +	 * we use find_css_set, which allocates a new one if necessary.
> +	 */
> +	INIT_LIST_HEAD(&newcg_list);
> +	oldcgrp = task_cgroup_from_root(leader, root);
> +	if (cgrp != oldcgrp) {
> +		/* get old css_set */
> +		task_lock(leader);
> +		if (leader->flags & PF_EXITING) {
> +			task_unlock(leader);
> +			goto prefetch_loop;
> +		}
> +		oldcg = leader->cgroups;
> +		get_css_set(oldcg);
> +		task_unlock(leader);
> +		/* acquire new one */
> +		retval = css_set_prefetch(cgrp, oldcg, &newcg_list);
> +		put_css_set(oldcg);
> +		if (retval)
> +			goto list_teardown;
> +	}
> +prefetch_loop:
> +	rcu_read_lock();
> +	/* sanity check - if we raced with de_thread, we must abort */
> +	if (!thread_group_leader(leader)) {
> +		retval = -EAGAIN;
> +		goto list_teardown;
> +	}
> +	/*
> +	 * if we need to fetch a new css_set for this task, we must exit the
> +	 * rcu_read section because allocating it can sleep. afterwards, we'll
> +	 * need to restart iteration on the threadgroup list - the whole thing
> +	 * will be O(nm) in the number of threads and css_sets; as the typical
> +	 * case has only one css_set for all of them, usually O(n). which ones
> +	 * we need allocated won't change as long as we hold cgroup_mutex.
> +	 */
> +	list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
> +		/* nothing to do if this task is already in the cgroup */
> +		oldcgrp = task_cgroup_from_root(tsk, root);
> +		if (cgrp == oldcgrp)
> +			continue;
> +		/* get old css_set pointer */
> +		task_lock(tsk);
> +		if (tsk->flags & PF_EXITING) {
> +			/* ignore this task if it's going away */
> +			task_unlock(tsk);
> +			continue;
> +		}
> +		oldcg = tsk->cgroups;
> +		get_css_set(oldcg);
> +		task_unlock(tsk);
> +		/* see if the new one for us is already in the list? */
> +		if (css_set_check_fetched(cgrp, tsk, oldcg, &newcg_list)) {
> +			/* was already there, nothing to do. */
> +			put_css_set(oldcg);
> +		} else {
> +			/* we don't already have it. get new one. */
> +			rcu_read_unlock();
> +			retval = css_set_prefetch(cgrp, oldcg, &newcg_list);
> +			put_css_set(oldcg);
> +			if (retval)
> +				goto list_teardown;
> +			/* begin iteration again. */
> +			goto prefetch_loop;
> +		}
> +	}
> +	rcu_read_unlock();
> +
> +	/*
> +	 * step 2: now that we're guaranteed success wrt the css_sets, proceed
> +	 * to move all tasks to the new cgroup. we need to lock against possible
> +	 * races with fork(). note: we can safely take the threadgroup_fork_lock
> +	 * of leader since attach_task_by_pid took a reference.
> +	 * threadgroup_fork_lock must be taken outside of tasklist_lock to match
> +	 * the order in the fork path.
> +	 */
> +	threadgroup_fork_write_lock(leader);
> +	read_lock(&tasklist_lock);
> +	/* sanity check - if we raced with de_thread, we must abort */
> +	if (!thread_group_leader(leader)) {
> +		retval = -EAGAIN;
> +		read_unlock(&tasklist_lock);
> +		threadgroup_fork_write_unlock(leader);
> +		goto list_teardown;
> +	}
> +	/*
> +	 * No failure cases left, so this is the commit point.
> +	 *
> +	 * If the leader is already there, skip moving him. Note: even if the
> +	 * leader is PF_EXITING, we still move all other threads; if everybody
> +	 * is PF_EXITING, we end up doing nothing, which is ok.
> +	 */
> +	oldcgrp = task_cgroup_from_root(leader, root);
> +	if (cgrp != oldcgrp) {
> +		retval = cgroup_task_migrate(cgrp, oldcgrp, leader, true);
> +		BUG_ON(retval != 0 && retval != -ESRCH);
> +	}
> +	/* Now iterate over each thread in the group. */
> +	list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {

Why can't we just use list_for_each_entry() here?  Unless I am confused
(quite possible!), we are not in an RCU read-side critical section.

> +		/* leave current thread as it is if it's already there */
> +		oldcgrp = task_cgroup_from_root(tsk, root);
> +		if (cgrp == oldcgrp)
> +			continue;
> +		/* we don't care whether these threads are exiting */
> +		retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, true);
> +		BUG_ON(retval != 0 && retval != -ESRCH);
> +	}
> +
> +	/*
> +	 * step 3: attach whole threadgroup to each subsystem
> +	 * TODO: if ever a subsystem needs to know the oldcgrp for each task
> +	 * being moved, this call will need to be reworked to communicate that.
> +	 */
> +	for_each_subsys(root, ss) {
> +		if (ss->attach)
> +			ss->attach(ss, cgrp, oldcgrp, leader, true);
> +	}
> +	/* holding these until here keeps us safe from exec() and fork(). */
> +	read_unlock(&tasklist_lock);
> +	threadgroup_fork_write_unlock(leader);
> +
> +	/*
> +	 * step 4: success! and cleanup
> +	 */
> +	synchronize_rcu();
> +	cgroup_wakeup_rmdir_waiter(cgrp);
> +	retval = 0;
> +list_teardown:
> +	/* clean up the list of prefetched css_sets. */
> +	list_for_each_entry_safe(cg_entry, temp_nobe, &newcg_list, links) {
> +		list_del(&cg_entry->links);
> +		put_css_set(cg_entry->cg);
> +		kfree(cg_entry);
> +	}
> +out:
> +	if (retval) {
> +		/* same deal as in cgroup_attach_task, with threadgroup=true */
> +		for_each_subsys(root, ss) {
> +			if (ss == failed_ss)
> +				break;
> +			if (ss->cancel_attach)
> +				ss->cancel_attach(ss, cgrp, leader, true);
> +		}
> +	}
> +	return retval;
> +}
> +
> +/*
> + * Find the task_struct of the task to attach by vpid and pass it along to the
> + * function to attach either it or all tasks in its threadgroup. Will take
> + * cgroup_mutex; may take task_lock of task.
> + */
> +static int attach_task_by_pid(struct cgroup *cgrp, u64 pid, bool threadgroup)
>  {
>  	struct task_struct *tsk;
>  	const struct cred *cred = current_cred(), *tcred;
>  	int ret;
> 
> +	if (!cgroup_lock_live_group(cgrp))
> +		return -ENODEV;
> +
>  	if (pid) {
>  		rcu_read_lock();
>  		tsk = find_task_by_vpid(pid);
> -		if (!tsk || tsk->flags & PF_EXITING) {
> +		if (!tsk) {
> +			rcu_read_unlock();
> +			cgroup_unlock();
> +			return -ESRCH;
> +		}
> +		if (threadgroup) {
> +			/*
> +			 * it is safe to find group_leader because tsk was found
> +			 * in the tid map, meaning it can't have been unhashed
> +			 * by someone in de_thread changing the leadership.
> +			 */
> +			tsk = tsk->group_leader;
> +			BUG_ON(!thread_group_leader(tsk));
> +		} else if (tsk->flags & PF_EXITING) {
> +			/* optimization for the single-task-only case */
>  			rcu_read_unlock();
> +			cgroup_unlock();
>  			return -ESRCH;
>  		}
> 
> +		/*
> +		 * even if we're attaching all tasks in the thread group, we
> +		 * only need to check permissions on one of them.
> +		 */
>  		tcred = __task_cred(tsk);
>  		if (cred->euid &&
>  		    cred->euid != tcred->uid &&
>  		    cred->euid != tcred->suid) {
>  			rcu_read_unlock();
> +			cgroup_unlock();
>  			return -EACCES;
>  		}
>  		get_task_struct(tsk);
>  		rcu_read_unlock();
>  	} else {
> -		tsk = current;
> +		if (threadgroup)
> +			tsk = current->group_leader;
> +		else
> +			tsk = current;
>  		get_task_struct(tsk);
>  	}
> 
> -	ret = cgroup_attach_task(cgrp, tsk);
> +	if (threadgroup)
> +		ret = cgroup_attach_proc(cgrp, tsk);
> +	else
> +		ret = cgroup_attach_task(cgrp, tsk);
>  	put_task_struct(tsk);
> +	cgroup_unlock();
>  	return ret;
>  }
> 
>  static int cgroup_tasks_write(struct cgroup *cgrp, struct cftype *cft, u64 pid)
>  {
> +	return attach_task_by_pid(cgrp, pid, false);
> +}
> +
> +static int cgroup_procs_write(struct cgroup *cgrp, struct cftype *cft, u64 tgid)
> +{
>  	int ret;
> -	if (!cgroup_lock_live_group(cgrp))
> -		return -ENODEV;
> -	ret = attach_task_by_pid(cgrp, pid);
> -	cgroup_unlock();
> +	do {
> +		/*
> +		 * attach_proc fails with -EAGAIN if threadgroup leadership
> +		 * changes in the middle of the operation, in which case we need
> +		 * to find the task_struct for the new leader and start over.
> +		 */
> +		ret = attach_task_by_pid(cgrp, tgid, true);
> +	} while (ret == -EAGAIN);
>  	return ret;
>  }
> 
> @@ -3294,9 +3622,9 @@ static struct cftype files[] = {
>  	{
>  		.name = CGROUP_FILE_GENERIC_PREFIX "procs",
>  		.open = cgroup_procs_open,
> -		/* .write_u64 = cgroup_procs_write, TODO */
> +		.write_u64 = cgroup_procs_write,
>  		.release = cgroup_pidlist_release,
> -		.mode = S_IRUGO,
> +		.mode = S_IRUGO | S_IWUSR,
>  	},
>  	{
>  		.name = "notify_on_release",
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v6 3/3] cgroups: make procs file writable
  2010-12-24  8:24             ` Ben Blum
  (?)
@ 2011-01-12 23:26             ` Paul E. McKenney
  -1 siblings, 0 replies; 185+ messages in thread
From: Paul E. McKenney @ 2011-01-12 23:26 UTC (permalink / raw)
  To: Ben Blum
  Cc: linux-kernel, containers, akpm, ebiederm, lizf, matthltc, menage, oleg

On Fri, Dec 24, 2010 at 03:24:45AM -0500, Ben Blum wrote:
> Makes procs file writable to move all threads by tgid at once
> 
> From: Ben Blum <bblum@andrew.cmu.edu>
> 
> This patch adds functionality that enables users to move all threads in a
> threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs'
> file. This current implementation makes use of a per-threadgroup rwsem that's
> taken for reading in the fork() path to prevent newly forking threads within
> the threadgroup from "escaping" while the move is in progress.

One minor nit below.

							Thanx, Paul

> Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
> ---
>  Documentation/cgroups/cgroups.txt |   13 +
>  kernel/cgroup.c                   |  424 +++++++++++++++++++++++++++++++++----
>  2 files changed, 387 insertions(+), 50 deletions(-)
> 
> diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
> index 190018b..07674e5 100644
> --- a/Documentation/cgroups/cgroups.txt
> +++ b/Documentation/cgroups/cgroups.txt
> @@ -236,7 +236,8 @@ containing the following files describing that cgroup:
>   - cgroup.procs: list of tgids in the cgroup.  This list is not
>     guaranteed to be sorted or free of duplicate tgids, and userspace
>     should sort/uniquify the list if this property is required.
> -   This is a read-only file, for now.
> +   Writing a thread group id into this file moves all threads in that
> +   group into this cgroup.
>   - notify_on_release flag: run the release agent on exit?
>   - release_agent: the path to use for release notifications (this file
>     exists in the top cgroup only)
> @@ -426,6 +427,12 @@ You can attach the current shell task by echoing 0:
> 
>  # echo 0 > tasks
> 
> +You can use the cgroup.procs file instead of the tasks file to move all
> +threads in a threadgroup at once. Echoing the pid of any task in a
> +threadgroup to cgroup.procs causes all tasks in that threadgroup to be
> +be attached to the cgroup. Writing 0 to cgroup.procs moves all tasks
> +in the writing task's threadgroup.
> +
>  2.3 Mounting hierarchies by name
>  --------------------------------
> 
> @@ -574,7 +581,9 @@ called on a fork. If this method returns 0 (success) then this should
>  remain valid while the caller holds cgroup_mutex and it is ensured that either
>  attach() or cancel_attach() will be called in future. If threadgroup is
>  true, then a successful result indicates that all threads in the given
> -thread's threadgroup can be moved together.
> +thread's threadgroup can be moved together. If the subsystem wants to
> +iterate over task->thread_group, it must take rcu_read_lock then check
> +if thread_group_leader(task), returning -EAGAIN if that fails.
> 
>  void cancel_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
>  	       struct task_struct *task, bool threadgroup)
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index f86dd9c..74be02c 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1771,6 +1771,76 @@ int cgroup_can_attach_per_thread(struct cgroup *cgrp, struct task_struct *task,
>  }
>  EXPORT_SYMBOL_GPL(cgroup_can_attach_per_thread);
> 
> +/*
> + * cgroup_task_migrate - move a task from one cgroup to another.
> + *
> + * 'guarantee' is set if the caller promises that a new css_set for the task
> + * will already exit. If not set, this function might sleep, and can fail with
> + * -ENOMEM. Otherwise, it can only fail with -ESRCH.
> + */
> +static int cgroup_task_migrate(struct cgroup *cgrp, struct cgroup *oldcgrp,
> +			       struct task_struct *tsk, bool guarantee)
> +{
> +	struct css_set *oldcg;
> +	struct css_set *newcg;
> +
> +	/*
> +	 * get old css_set. we need to take task_lock and refcount it, because
> +	 * an exiting task can change its css_set to init_css_set and drop its
> +	 * old one without taking cgroup_mutex.
> +	 */
> +	task_lock(tsk);
> +	oldcg = tsk->cgroups;
> +	get_css_set(oldcg);
> +	task_unlock(tsk);
> +
> +	/* locate or allocate a new css_set for this task. */
> +	if (guarantee) {
> +		/* we know the css_set we want already exists. */
> +		struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
> +		read_lock(&css_set_lock);
> +		newcg = find_existing_css_set(oldcg, cgrp, template);
> +		BUG_ON(!newcg);
> +		get_css_set(newcg);
> +		read_unlock(&css_set_lock);
> +	} else {
> +		might_sleep();
> +		/* find_css_set will give us newcg already referenced. */
> +		newcg = find_css_set(oldcg, cgrp);
> +		if (!newcg) {
> +			put_css_set(oldcg);
> +			return -ENOMEM;
> +		}
> +	}
> +	put_css_set(oldcg);
> +
> +	/* if PF_EXITING is set, the tsk->cgroups pointer is no longer safe. */
> +	task_lock(tsk);
> +	if (tsk->flags & PF_EXITING) {
> +		task_unlock(tsk);
> +		put_css_set(newcg);
> +		return -ESRCH;
> +	}
> +	rcu_assign_pointer(tsk->cgroups, newcg);
> +	task_unlock(tsk);
> +
> +	/* Update the css_set linked lists if we're using them */
> +	write_lock(&css_set_lock);
> +	if (!list_empty(&tsk->cg_list))
> +		list_move(&tsk->cg_list, &newcg->tasks);
> +	write_unlock(&css_set_lock);
> +
> +	/*
> +	 * We just gained a reference on oldcg by taking it from the task. As
> +	 * trading it for newcg is protected by cgroup_mutex, we're safe to drop
> +	 * it here; it will be freed under RCU.
> +	 */
> +	put_css_set(oldcg);
> +
> +	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
> +	return 0;
> +}
> +
>  /**
>   * cgroup_attach_task - attach task 'tsk' to cgroup 'cgrp'
>   * @cgrp: the cgroup the task is attaching to
> @@ -1781,11 +1851,9 @@ EXPORT_SYMBOL_GPL(cgroup_can_attach_per_thread);
>   */
>  int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>  {
> -	int retval = 0;
> +	int retval;
>  	struct cgroup_subsys *ss, *failed_ss = NULL;
>  	struct cgroup *oldcgrp;
> -	struct css_set *cg;
> -	struct css_set *newcg;
>  	struct cgroupfs_root *root = cgrp->root;
> 
>  	/* Nothing to do if the task is already in that cgroup */
> @@ -1809,46 +1877,16 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>  		}
>  	}
> 
> -	task_lock(tsk);
> -	cg = tsk->cgroups;
> -	get_css_set(cg);
> -	task_unlock(tsk);
> -	/*
> -	 * Locate or allocate a new css_set for this task,
> -	 * based on its final set of cgroups
> -	 */
> -	newcg = find_css_set(cg, cgrp);
> -	put_css_set(cg);
> -	if (!newcg) {
> -		retval = -ENOMEM;
> -		goto out;
> -	}
> -
> -	task_lock(tsk);
> -	if (tsk->flags & PF_EXITING) {
> -		task_unlock(tsk);
> -		put_css_set(newcg);
> -		retval = -ESRCH;
> +	retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, false);
> +	if (retval)
>  		goto out;
> -	}
> -	rcu_assign_pointer(tsk->cgroups, newcg);
> -	task_unlock(tsk);
> -
> -	/* Update the css_set linked lists if we're using them */
> -	write_lock(&css_set_lock);
> -	if (!list_empty(&tsk->cg_list)) {
> -		list_del(&tsk->cg_list);
> -		list_add(&tsk->cg_list, &newcg->tasks);
> -	}
> -	write_unlock(&css_set_lock);
> 
>  	for_each_subsys(root, ss) {
>  		if (ss->attach)
>  			ss->attach(ss, cgrp, oldcgrp, tsk, false);
>  	}
> -	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
> +
>  	synchronize_rcu();
> -	put_css_set(cg);
> 
>  	/*
>  	 * wake up rmdir() waiter. the rmdir should fail since the cgroup
> @@ -1898,49 +1936,339 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
>  EXPORT_SYMBOL_GPL(cgroup_attach_task_all);
> 
>  /*
> - * Attach task with pid 'pid' to cgroup 'cgrp'. Call with cgroup_mutex
> - * held. May take task_lock of task
> + * cgroup_attach_proc works in two stages, the first of which prefetches all
> + * new css_sets needed (to make sure we have enough memory before committing
> + * to the move) and stores them in a list of entries of the following type.
> + * TODO: possible optimization: use css_set->rcu_head for chaining instead
> + */
> +struct cg_list_entry {
> +	struct css_set *cg;
> +	struct list_head links;
> +};
> +
> +static bool css_set_check_fetched(struct cgroup *cgrp,
> +				  struct task_struct *tsk, struct css_set *cg,
> +				  struct list_head *newcg_list)
> +{
> +	struct css_set *newcg;
> +	struct cg_list_entry *cg_entry;
> +	struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
> +
> +	read_lock(&css_set_lock);
> +	newcg = find_existing_css_set(cg, cgrp, template);
> +	if (newcg)
> +		get_css_set(newcg);
> +	read_unlock(&css_set_lock);
> +
> +	/* doesn't exist at all? */
> +	if (!newcg)
> +		return false;
> +	/* see if it's already in the list */
> +	list_for_each_entry(cg_entry, newcg_list, links) {
> +		if (cg_entry->cg == newcg) {
> +			put_css_set(newcg);
> +			return true;
> +		}
> +	}
> +
> +	/* not found */
> +	put_css_set(newcg);
> +	return false;
> +}
> +
> +/*
> + * Find the new css_set and store it in the list in preparation for moving the
> + * given task to the given cgroup. Returns 0 or -ENOMEM.
>   */
> -static int attach_task_by_pid(struct cgroup *cgrp, u64 pid)
> +static int css_set_prefetch(struct cgroup *cgrp, struct css_set *cg,
> +			    struct list_head *newcg_list)
> +{
> +	struct css_set *newcg;
> +	struct cg_list_entry *cg_entry;
> +
> +	/* ensure a new css_set will exist for this thread */
> +	newcg = find_css_set(cg, cgrp);
> +	if (!newcg)
> +		return -ENOMEM;
> +	/* add it to the list */
> +	cg_entry = kmalloc(sizeof(struct cg_list_entry), GFP_KERNEL);
> +	if (!cg_entry) {
> +		put_css_set(newcg);
> +		return -ENOMEM;
> +	}
> +	cg_entry->cg = newcg;
> +	list_add(&cg_entry->links, newcg_list);
> +	return 0;
> +}
> +
> +/**
> + * cgroup_attach_proc - attach all threads in a threadgroup to a cgroup
> + * @cgrp: the cgroup to attach to
> + * @leader: the threadgroup leader task_struct of the group to be attached
> + *
> + * Call holding cgroup_mutex. Will take task_lock of each thread in leader's
> + * threadgroup individually in turn.
> + */
> +int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
> +{
> +	int retval;
> +	struct cgroup_subsys *ss, *failed_ss = NULL;
> +	struct cgroup *oldcgrp;
> +	struct css_set *oldcg;
> +	struct cgroupfs_root *root = cgrp->root;
> +	/* threadgroup list cursor */
> +	struct task_struct *tsk;
> +	/*
> +	 * we need to make sure we have css_sets for all the tasks we're
> +	 * going to move -before- we actually start moving them, so that in
> +	 * case we get an ENOMEM we can bail out before making any changes.
> +	 */
> +	struct list_head newcg_list;
> +	struct cg_list_entry *cg_entry, *temp_nobe;
> +
> +	/* check that we can legitimately attach to the cgroup. */
> +	for_each_subsys(root, ss) {
> +		if (ss->can_attach) {
> +			retval = ss->can_attach(ss, cgrp, leader, true);
> +			if (retval) {
> +				failed_ss = ss;
> +				goto out;
> +			}
> +		}
> +	}
> +
> +	/*
> +	 * step 1: make sure css_sets exist for all threads to be migrated.
> +	 * we use find_css_set, which allocates a new one if necessary.
> +	 */
> +	INIT_LIST_HEAD(&newcg_list);
> +	oldcgrp = task_cgroup_from_root(leader, root);
> +	if (cgrp != oldcgrp) {
> +		/* get old css_set */
> +		task_lock(leader);
> +		if (leader->flags & PF_EXITING) {
> +			task_unlock(leader);
> +			goto prefetch_loop;
> +		}
> +		oldcg = leader->cgroups;
> +		get_css_set(oldcg);
> +		task_unlock(leader);
> +		/* acquire new one */
> +		retval = css_set_prefetch(cgrp, oldcg, &newcg_list);
> +		put_css_set(oldcg);
> +		if (retval)
> +			goto list_teardown;
> +	}
> +prefetch_loop:
> +	rcu_read_lock();
> +	/* sanity check - if we raced with de_thread, we must abort */
> +	if (!thread_group_leader(leader)) {
> +		retval = -EAGAIN;
> +		goto list_teardown;
> +	}
> +	/*
> +	 * if we need to fetch a new css_set for this task, we must exit the
> +	 * rcu_read section because allocating it can sleep. afterwards, we'll
> +	 * need to restart iteration on the threadgroup list - the whole thing
> +	 * will be O(nm) in the number of threads and css_sets; as the typical
> +	 * case has only one css_set for all of them, usually O(n). which ones
> +	 * we need allocated won't change as long as we hold cgroup_mutex.
> +	 */
> +	list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
> +		/* nothing to do if this task is already in the cgroup */
> +		oldcgrp = task_cgroup_from_root(tsk, root);
> +		if (cgrp == oldcgrp)
> +			continue;
> +		/* get old css_set pointer */
> +		task_lock(tsk);
> +		if (tsk->flags & PF_EXITING) {
> +			/* ignore this task if it's going away */
> +			task_unlock(tsk);
> +			continue;
> +		}
> +		oldcg = tsk->cgroups;
> +		get_css_set(oldcg);
> +		task_unlock(tsk);
> +		/* see if the new one for us is already in the list? */
> +		if (css_set_check_fetched(cgrp, tsk, oldcg, &newcg_list)) {
> +			/* was already there, nothing to do. */
> +			put_css_set(oldcg);
> +		} else {
> +			/* we don't already have it. get new one. */
> +			rcu_read_unlock();
> +			retval = css_set_prefetch(cgrp, oldcg, &newcg_list);
> +			put_css_set(oldcg);
> +			if (retval)
> +				goto list_teardown;
> +			/* begin iteration again. */
> +			goto prefetch_loop;
> +		}
> +	}
> +	rcu_read_unlock();
> +
> +	/*
> +	 * step 2: now that we're guaranteed success wrt the css_sets, proceed
> +	 * to move all tasks to the new cgroup. we need to lock against possible
> +	 * races with fork(). note: we can safely take the threadgroup_fork_lock
> +	 * of leader since attach_task_by_pid took a reference.
> +	 * threadgroup_fork_lock must be taken outside of tasklist_lock to match
> +	 * the order in the fork path.
> +	 */
> +	threadgroup_fork_write_lock(leader);
> +	read_lock(&tasklist_lock);
> +	/* sanity check - if we raced with de_thread, we must abort */
> +	if (!thread_group_leader(leader)) {
> +		retval = -EAGAIN;
> +		read_unlock(&tasklist_lock);
> +		threadgroup_fork_write_unlock(leader);
> +		goto list_teardown;
> +	}
> +	/*
> +	 * No failure cases left, so this is the commit point.
> +	 *
> +	 * If the leader is already there, skip moving him. Note: even if the
> +	 * leader is PF_EXITING, we still move all other threads; if everybody
> +	 * is PF_EXITING, we end up doing nothing, which is ok.
> +	 */
> +	oldcgrp = task_cgroup_from_root(leader, root);
> +	if (cgrp != oldcgrp) {
> +		retval = cgroup_task_migrate(cgrp, oldcgrp, leader, true);
> +		BUG_ON(retval != 0 && retval != -ESRCH);
> +	}
> +	/* Now iterate over each thread in the group. */
> +	list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {

Why can't we just use list_for_each_entry() here?  Unless I am confused
(quite possible!), we are not in an RCU read-side critical section.

> +		/* leave current thread as it is if it's already there */
> +		oldcgrp = task_cgroup_from_root(tsk, root);
> +		if (cgrp == oldcgrp)
> +			continue;
> +		/* we don't care whether these threads are exiting */
> +		retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, true);
> +		BUG_ON(retval != 0 && retval != -ESRCH);
> +	}
> +
> +	/*
> +	 * step 3: attach whole threadgroup to each subsystem
> +	 * TODO: if ever a subsystem needs to know the oldcgrp for each task
> +	 * being moved, this call will need to be reworked to communicate that.
> +	 */
> +	for_each_subsys(root, ss) {
> +		if (ss->attach)
> +			ss->attach(ss, cgrp, oldcgrp, leader, true);
> +	}
> +	/* holding these until here keeps us safe from exec() and fork(). */
> +	read_unlock(&tasklist_lock);
> +	threadgroup_fork_write_unlock(leader);
> +
> +	/*
> +	 * step 4: success! and cleanup
> +	 */
> +	synchronize_rcu();
> +	cgroup_wakeup_rmdir_waiter(cgrp);
> +	retval = 0;
> +list_teardown:
> +	/* clean up the list of prefetched css_sets. */
> +	list_for_each_entry_safe(cg_entry, temp_nobe, &newcg_list, links) {
> +		list_del(&cg_entry->links);
> +		put_css_set(cg_entry->cg);
> +		kfree(cg_entry);
> +	}
> +out:
> +	if (retval) {
> +		/* same deal as in cgroup_attach_task, with threadgroup=true */
> +		for_each_subsys(root, ss) {
> +			if (ss == failed_ss)
> +				break;
> +			if (ss->cancel_attach)
> +				ss->cancel_attach(ss, cgrp, leader, true);
> +		}
> +	}
> +	return retval;
> +}
> +
> +/*
> + * Find the task_struct of the task to attach by vpid and pass it along to the
> + * function to attach either it or all tasks in its threadgroup. Will take
> + * cgroup_mutex; may take task_lock of task.
> + */
> +static int attach_task_by_pid(struct cgroup *cgrp, u64 pid, bool threadgroup)
>  {
>  	struct task_struct *tsk;
>  	const struct cred *cred = current_cred(), *tcred;
>  	int ret;
> 
> +	if (!cgroup_lock_live_group(cgrp))
> +		return -ENODEV;
> +
>  	if (pid) {
>  		rcu_read_lock();
>  		tsk = find_task_by_vpid(pid);
> -		if (!tsk || tsk->flags & PF_EXITING) {
> +		if (!tsk) {
> +			rcu_read_unlock();
> +			cgroup_unlock();
> +			return -ESRCH;
> +		}
> +		if (threadgroup) {
> +			/*
> +			 * it is safe to find group_leader because tsk was found
> +			 * in the tid map, meaning it can't have been unhashed
> +			 * by someone in de_thread changing the leadership.
> +			 */
> +			tsk = tsk->group_leader;
> +			BUG_ON(!thread_group_leader(tsk));
> +		} else if (tsk->flags & PF_EXITING) {
> +			/* optimization for the single-task-only case */
>  			rcu_read_unlock();
> +			cgroup_unlock();
>  			return -ESRCH;
>  		}
> 
> +		/*
> +		 * even if we're attaching all tasks in the thread group, we
> +		 * only need to check permissions on one of them.
> +		 */
>  		tcred = __task_cred(tsk);
>  		if (cred->euid &&
>  		    cred->euid != tcred->uid &&
>  		    cred->euid != tcred->suid) {
>  			rcu_read_unlock();
> +			cgroup_unlock();
>  			return -EACCES;
>  		}
>  		get_task_struct(tsk);
>  		rcu_read_unlock();
>  	} else {
> -		tsk = current;
> +		if (threadgroup)
> +			tsk = current->group_leader;
> +		else
> +			tsk = current;
>  		get_task_struct(tsk);
>  	}
> 
> -	ret = cgroup_attach_task(cgrp, tsk);
> +	if (threadgroup)
> +		ret = cgroup_attach_proc(cgrp, tsk);
> +	else
> +		ret = cgroup_attach_task(cgrp, tsk);
>  	put_task_struct(tsk);
> +	cgroup_unlock();
>  	return ret;
>  }
> 
>  static int cgroup_tasks_write(struct cgroup *cgrp, struct cftype *cft, u64 pid)
>  {
> +	return attach_task_by_pid(cgrp, pid, false);
> +}
> +
> +static int cgroup_procs_write(struct cgroup *cgrp, struct cftype *cft, u64 tgid)
> +{
>  	int ret;
> -	if (!cgroup_lock_live_group(cgrp))
> -		return -ENODEV;
> -	ret = attach_task_by_pid(cgrp, pid);
> -	cgroup_unlock();
> +	do {
> +		/*
> +		 * attach_proc fails with -EAGAIN if threadgroup leadership
> +		 * changes in the middle of the operation, in which case we need
> +		 * to find the task_struct for the new leader and start over.
> +		 */
> +		ret = attach_task_by_pid(cgrp, tgid, true);
> +	} while (ret == -EAGAIN);
>  	return ret;
>  }
> 
> @@ -3294,9 +3622,9 @@ static struct cftype files[] = {
>  	{
>  		.name = CGROUP_FILE_GENERIC_PREFIX "procs",
>  		.open = cgroup_procs_open,
> -		/* .write_u64 = cgroup_procs_write, TODO */
> +		.write_u64 = cgroup_procs_write,
>  		.release = cgroup_pidlist_release,
> -		.mode = S_IRUGO,
> +		.mode = S_IRUGO | S_IWUSR,
>  	},
>  	{
>  		.name = "notify_on_release",
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v7 2/3] cgroups: add atomic-context per-thread subsystem callbacks
       [not found]             ` <20101226121100.GC28529-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
@ 2011-01-24  8:38               ` Paul Menage
  0 siblings, 0 replies; 185+ messages in thread
From: Paul Menage @ 2011-01-24  8:38 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	Miao Xie, David Rientjes, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w

[-- Attachment #1: Type: text/plain, Size: 934 bytes --]

Hi Ben,

Finally finding a moment to actually look at these patches. Sorry it's
been a while. Can you send the patches inline rather than as
attachments in future?

Reviewed-by: Paul Menage <menage-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

This patch looks fine, although I think that freezer_can_attach_task()
could be simplified to:

static int freezer_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
{
  if (__cgroup_freezing_or_frozen(tsk))
    return -EBUSY;
  return 0;
}

since we guarantee that rcu_read_lock() is held across this call.

There appears to be a tiny bit of rot in kernel/cpu.c (due to the
addition of the exit() callback) and memcontrol.c (due to some changes
at the start of mem_cgroup_move_task()) but neither impact actual
code.

I think that before actually pushing to mainline, we'll need to sort
out the cpuset mempolicy yielding issue, since that could be a
user-visible API change.


Paul

[-- Attachment #2: cgroup-subsys-task-callbacks.patch --]
[-- Type: text/plain, Size: 22352 bytes --]

Add cgroup subsystem callbacks for per-thread attachment in atomic contexts

From: Ben Blum <bblum@andrew.cmu.edu>

This patch adds can_attach_task, pre_attach, and attach_task as new callbacks
for cgroups's subsystem interface. Unlike can_attach and attach, these are for
per-thread operations, to be called potentially many times when attaching an
entire threadgroup, and may run under rcu_read/tasklist_lock, so are for quick
operations only.

Also, the old "bool threadgroup" interface is removed, as replaced by this.
All subsystems are modified for the new interface - of note is cpuset, which
requires from/to nodemasks for attach to be globally scoped (though per-cpuset
would work too) to persist from its pre_attach to attach_task and attach.

This is a pre-patch for cgroup-procs-writable.patch.

Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
---
 Documentation/cgroups/cgroups.txt |   35 ++++++++---
 Documentation/cgroups/cpusets.txt |    9 +++
 block/blk-cgroup.c                |   18 ++----
 include/linux/cgroup.h            |   10 ++-
 kernel/cgroup.c                   |   17 ++++-
 kernel/cgroup_freezer.c           |   27 ++++-----
 kernel/cpuset.c                   |  116 +++++++++++++++++++------------------
 kernel/ns_cgroup.c                |   23 +++----
 kernel/sched.c                    |   38 +-----------
 mm/memcontrol.c                   |   18 ++----
 security/device_cgroup.c          |    3 -
 11 files changed, 149 insertions(+), 165 deletions(-)

diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index 190018b..341ed44 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -563,7 +563,7 @@ rmdir() will fail with it. From this behavior, pre_destroy() can be
 called multiple times against a cgroup.
 
 int can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
-	       struct task_struct *task, bool threadgroup)
+	       struct task_struct *task)
 (cgroup_mutex held by caller)
 
 Called prior to moving a task into a cgroup; if the subsystem
@@ -572,9 +572,15 @@ task is passed, then a successful result indicates that *any*
 unspecified task can be moved into the cgroup. Note that this isn't
 called on a fork. If this method returns 0 (success) then this should
 remain valid while the caller holds cgroup_mutex and it is ensured that either
-attach() or cancel_attach() will be called in future. If threadgroup is
-true, then a successful result indicates that all threads in the given
-thread's threadgroup can be moved together.
+attach() or cancel_attach() will be called in future.
+
+int can_attach_task(struct cgroup *cgrp, struct task_struct *tsk);
+(cgroup_mutex and rcu_read_lock held by caller)
+
+As can_attach, but for operations that must be run once per task to be
+attached (possibly many when using cgroup_attach_proc). This may run in
+rcu_read-side, so sleeping is not permitted. Expensive operations, such as
+dealing with the shared mm, should run in can_attach.
 
 void cancel_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 	       struct task_struct *task, bool threadgroup)
@@ -587,15 +593,26 @@ This will be called only about subsystems whose can_attach() operation have
 succeeded.
 
 void attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
-	    struct cgroup *old_cgrp, struct task_struct *task,
-	    bool threadgroup)
+	    struct cgroup *old_cgrp, struct task_struct *task)
 (cgroup_mutex held by caller)
 
 Called after the task has been attached to the cgroup, to allow any
 post-attachment activity that requires memory allocations or blocking.
-If threadgroup is true, the subsystem should take care of all threads
-in the specified thread's threadgroup. Currently does not support any
-subsystem that might need the old_cgrp for every thread in the group.
+
+void pre_attach(struct cgroup *cgrp);
+(cgroup_mutex and tasklist_lock held by caller)
+
+See description of attach_task.
+
+void attach_task(struct cgroup *cgrp, struct task_struct *tsk);
+(cgroup_mutex and possibly tasklist_lock held by caller)
+
+As attach, but for operations that must be run once per task to be attached,
+like can_attach_task. Sometimes called with tasklist_lock taken for reading,
+so may not sleep. Currently does not support any subsystem that might need the
+old_cgrp for every thread in the group. Note: unlike can_attach_task, this
+runs before attach, so use pre_attach for non-per-thread operations that must
+happen before attach_task.
 
 void fork(struct cgroup_subsy *ss, struct task_struct *task)
 
diff --git a/Documentation/cgroups/cpusets.txt b/Documentation/cgroups/cpusets.txt
index 5d0d569..1f0868d 100644
--- a/Documentation/cgroups/cpusets.txt
+++ b/Documentation/cgroups/cpusets.txt
@@ -659,6 +659,15 @@ the current task's cpuset, then we relax the cpuset, and look for
 memory anywhere we can find it.  It's better to violate the cpuset
 than stress the kernel.
 
+There is a third exception to the above.  When using the cgroup.procs file
+to move all tasks in a threadgroup at once, the per-task attachment code
+must run in an atomic context, but as currently implemented, changing the
+nodemasks for a task's memory policy may need to deschedule.  So, in this
+case, the best cpusets can do is change the nodemask for the threadgroup
+leader when attaching.  Thus, a multithreaded mempolicy user should first
+use cgroup.procs (for correctness), but also next use the tasks file for
+each thread in the group to ensure updating the nodemasks for all of them.
+
 To start a new job that is to be contained within a cpuset, the steps are:
 
  1) mkdir /dev/cpuset
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index b1febd0..45b3809 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -30,10 +30,8 @@ EXPORT_SYMBOL_GPL(blkio_root_cgroup);
 
 static struct cgroup_subsys_state *blkiocg_create(struct cgroup_subsys *,
 						  struct cgroup *);
-static int blkiocg_can_attach(struct cgroup_subsys *, struct cgroup *,
-			      struct task_struct *, bool);
-static void blkiocg_attach(struct cgroup_subsys *, struct cgroup *,
-			   struct cgroup *, struct task_struct *, bool);
+static int blkiocg_can_attach_task(struct cgroup *, struct task_struct *);
+static void blkiocg_attach_task(struct cgroup *, struct task_struct *);
 static void blkiocg_destroy(struct cgroup_subsys *, struct cgroup *);
 static int blkiocg_populate(struct cgroup_subsys *, struct cgroup *);
 
@@ -46,8 +44,8 @@ static int blkiocg_populate(struct cgroup_subsys *, struct cgroup *);
 struct cgroup_subsys blkio_subsys = {
 	.name = "blkio",
 	.create = blkiocg_create,
-	.can_attach = blkiocg_can_attach,
-	.attach = blkiocg_attach,
+	.can_attach_task = blkiocg_can_attach_task,
+	.attach_task = blkiocg_attach_task,
 	.destroy = blkiocg_destroy,
 	.populate = blkiocg_populate,
 #ifdef CONFIG_BLK_CGROUP
@@ -1475,9 +1473,7 @@ done:
  * of the main cic data structures.  For now we allow a task to change
  * its cgroup only if it's the only owner of its ioc.
  */
-static int blkiocg_can_attach(struct cgroup_subsys *subsys,
-				struct cgroup *cgroup, struct task_struct *tsk,
-				bool threadgroup)
+static int blkiocg_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
 	struct io_context *ioc;
 	int ret = 0;
@@ -1492,9 +1488,7 @@ static int blkiocg_can_attach(struct cgroup_subsys *subsys,
 	return ret;
 }
 
-static void blkiocg_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
-				struct cgroup *prev, struct task_struct *tsk,
-				bool threadgroup)
+static void blkiocg_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
 	struct io_context *ioc;
 
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index ce104e3..35b69b4 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -467,12 +467,14 @@ struct cgroup_subsys {
 	int (*pre_destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
 	void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
 	int (*can_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
-			  struct task_struct *tsk, bool threadgroup);
+			  struct task_struct *tsk);
+	int (*can_attach_task)(struct cgroup *cgrp, struct task_struct *tsk);
 	void (*cancel_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
-			  struct task_struct *tsk, bool threadgroup);
+			      struct task_struct *tsk);
+	void (*pre_attach)(struct cgroup *cgrp);
+	void (*attach_task)(struct cgroup *cgrp, struct task_struct *tsk);
 	void (*attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
-			struct cgroup *old_cgrp, struct task_struct *tsk,
-			bool threadgroup);
+		       struct cgroup *old_cgrp, struct task_struct *tsk);
 	void (*fork)(struct cgroup_subsys *ss, struct task_struct *task);
 	void (*exit)(struct cgroup_subsys *ss, struct task_struct *task);
 	int (*populate)(struct cgroup_subsys *ss,
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 66a416b..616f27a 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1750,7 +1750,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 
 	for_each_subsys(root, ss) {
 		if (ss->can_attach) {
-			retval = ss->can_attach(ss, cgrp, tsk, false);
+			retval = ss->can_attach(ss, cgrp, tsk);
 			if (retval) {
 				/*
 				 * Remember on which subsystem the can_attach()
@@ -1762,6 +1762,13 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 				goto out;
 			}
 		}
+		if (ss->can_attach_task) {
+			retval = ss->can_attach_task(cgrp, tsk);
+			if (retval) {
+				failed_ss = ss;
+				goto out;
+			}
+		}
 	}
 
 	task_lock(tsk);
@@ -1798,8 +1805,12 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 	write_unlock(&css_set_lock);
 
 	for_each_subsys(root, ss) {
+		if (ss->pre_attach)
+			ss->pre_attach(cgrp);
+		if (ss->attach_task)
+			ss->attach_task(cgrp, tsk);
 		if (ss->attach)
-			ss->attach(ss, cgrp, oldcgrp, tsk, false);
+			ss->attach(ss, cgrp, oldcgrp, tsk);
 	}
 	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
 	synchronize_rcu();
@@ -1822,7 +1833,7 @@ out:
 				 */
 				break;
 			if (ss->cancel_attach)
-				ss->cancel_attach(ss, cgrp, tsk, false);
+				ss->cancel_attach(ss, cgrp, tsk);
 		}
 	}
 	return retval;
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index e7bebb7..e6ee70c 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -160,7 +160,7 @@ static void freezer_destroy(struct cgroup_subsys *ss,
  */
 static int freezer_can_attach(struct cgroup_subsys *ss,
 			      struct cgroup *new_cgroup,
-			      struct task_struct *task, bool threadgroup)
+			      struct task_struct *task)
 {
 	struct freezer *freezer;
 
@@ -172,26 +172,18 @@ static int freezer_can_attach(struct cgroup_subsys *ss,
 	if (freezer->state != CGROUP_THAWED)
 		return -EBUSY;
 
+	return 0;
+}
+
+static int freezer_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
+{
+	/* rcu_read_lock allows recursive locking */
 	rcu_read_lock();
-	if (__cgroup_freezing_or_frozen(task)) {
+	if (__cgroup_freezing_or_frozen(tsk)) {
 		rcu_read_unlock();
 		return -EBUSY;
 	}
 	rcu_read_unlock();
-
-	if (threadgroup) {
-		struct task_struct *c;
-
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
-			if (__cgroup_freezing_or_frozen(c)) {
-				rcu_read_unlock();
-				return -EBUSY;
-			}
-		}
-		rcu_read_unlock();
-	}
-
 	return 0;
 }
 
@@ -390,6 +382,9 @@ struct cgroup_subsys freezer_subsys = {
 	.populate	= freezer_populate,
 	.subsys_id	= freezer_subsys_id,
 	.can_attach	= freezer_can_attach,
+	.can_attach_task = freezer_can_attach_task,
+	.pre_attach	= NULL,
+	.attach_task	= NULL,
 	.attach		= NULL,
 	.fork		= freezer_fork,
 	.exit		= NULL,
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 4349935..b9fce80 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1372,14 +1372,10 @@ static int fmeter_getrate(struct fmeter *fmp)
 	return val;
 }
 
-/* Protected by cgroup_lock */
-static cpumask_var_t cpus_attach;
-
 /* Called by cgroups to determine if a cpuset is usable; cgroup_mutex held */
 static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
-			     struct task_struct *tsk, bool threadgroup)
+			     struct task_struct *tsk)
 {
-	int ret;
 	struct cpuset *cs = cgroup_cs(cont);
 
 	if (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))
@@ -1396,29 +1392,42 @@ static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
 	if (tsk->flags & PF_THREAD_BOUND)
 		return -EINVAL;
 
-	ret = security_task_setscheduler(tsk);
-	if (ret)
-		return ret;
-	if (threadgroup) {
-		struct task_struct *c;
-
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
-			ret = security_task_setscheduler(c);
-			if (ret) {
-				rcu_read_unlock();
-				return ret;
-			}
-		}
-		rcu_read_unlock();
-	}
 	return 0;
 }
 
-static void cpuset_attach_task(struct task_struct *tsk, nodemask_t *to,
-			       struct cpuset *cs)
+static int cpuset_can_attach_task(struct cgroup *cgrp, struct task_struct *task)
+{
+	return security_task_setscheduler(task);
+}
+
+/*
+ * Protected by cgroup_lock. The nodemasks must be stored globally because
+ * dynamically allocating them is not allowed in pre_attach, and they must
+ * persist among pre_attach, attach_task, and attach.
+ */
+static cpumask_var_t cpus_attach;
+static nodemask_t cpuset_attach_nodemask_from;
+static nodemask_t cpuset_attach_nodemask_to;
+
+/* Do quick set-up work for before attaching each task. */
+static void cpuset_pre_attach(struct cgroup *cont)
+{
+	struct cpuset *cs = cgroup_cs(cont);
+
+	if (cs == &top_cpuset)
+		cpumask_copy(cpus_attach, cpu_possible_mask);
+	else
+		guarantee_online_cpus(cs, cpus_attach);
+
+	guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
+}
+
+/* Per-thread attachment work. */
+static void cpuset_attach_task(struct cgroup *cont, struct task_struct *tsk)
 {
 	int err;
+	struct cpuset *cs = cgroup_cs(cont);
+
 	/*
 	 * can_attach beforehand should guarantee that this doesn't fail.
 	 * TODO: have a better way to handle failure here
@@ -1426,56 +1435,46 @@ static void cpuset_attach_task(struct task_struct *tsk, nodemask_t *to,
 	err = set_cpus_allowed_ptr(tsk, cpus_attach);
 	WARN_ON_ONCE(err);
 
-	cpuset_change_task_nodemask(tsk, to);
 	cpuset_update_task_spread_flag(cs, tsk);
 
 }
 
 static void cpuset_attach(struct cgroup_subsys *ss, struct cgroup *cont,
-			  struct cgroup *oldcont, struct task_struct *tsk,
-			  bool threadgroup)
+			  struct cgroup *oldcont, struct task_struct *tsk)
 {
 	struct mm_struct *mm;
 	struct cpuset *cs = cgroup_cs(cont);
 	struct cpuset *oldcs = cgroup_cs(oldcont);
-	NODEMASK_ALLOC(nodemask_t, from, GFP_KERNEL);
-	NODEMASK_ALLOC(nodemask_t, to, GFP_KERNEL);
 
-	if (from == NULL || to == NULL)
-		goto alloc_fail;
-
-	if (cs == &top_cpuset) {
-		cpumask_copy(cpus_attach, cpu_possible_mask);
-	} else {
-		guarantee_online_cpus(cs, cpus_attach);
-	}
-	guarantee_online_mems(cs, to);
-
-	/* do per-task migration stuff possibly for each in the threadgroup */
-	cpuset_attach_task(tsk, to, cs);
-	if (threadgroup) {
-		struct task_struct *c;
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
-			cpuset_attach_task(c, to, cs);
-		}
-		rcu_read_unlock();
-	}
+	/*
+	 * TODO: As implemented, change_task_nodemask uses yield() to
+	 * synchronize with other users of the mems_allowed, which is not
+	 * allowed in the atomic attach_task callback, so we can't do this for
+	 * each thread in the multithreaded case. This is a performance issue,
+	 * but not a correctness one.
+	 *
+	 * As long as change_task_nodemask can yield, a multithreaded mempolicy
+	 * user should attach to a cgroup by threadgroup first (for
+	 * correctness) then poke each task to get its mempolicy right.
+	 *
+	 * This is the "third exception" in Documentation/cgroups/cpusets.txt.
+	 */
+	cpuset_change_task_nodemask(tsk, &cpuset_attach_nodemask_to);
 
-	/* change mm; only needs to be done once even if threadgroup */
-	*from = oldcs->mems_allowed;
-	*to = cs->mems_allowed;
+	/*
+	 * Change mm, possibly for multiple threads in a threadgroup. This is
+	 * expensive and may sleep.
+	 */
+	cpuset_attach_nodemask_from = oldcs->mems_allowed;
+	cpuset_attach_nodemask_to = cs->mems_allowed;
 	mm = get_task_mm(tsk);
 	if (mm) {
-		mpol_rebind_mm(mm, to);
+		mpol_rebind_mm(mm, &cpuset_attach_nodemask_to);
 		if (is_memory_migrate(cs))
-			cpuset_migrate_mm(mm, from, to);
+			cpuset_migrate_mm(mm, &cpuset_attach_nodemask_from,
+					  &cpuset_attach_nodemask_to);
 		mmput(mm);
 	}
-
-alloc_fail:
-	NODEMASK_FREE(from);
-	NODEMASK_FREE(to);
 }
 
 /* The various types of files and directories in a cpuset file system */
@@ -1928,6 +1927,9 @@ struct cgroup_subsys cpuset_subsys = {
 	.create = cpuset_create,
 	.destroy = cpuset_destroy,
 	.can_attach = cpuset_can_attach,
+	.can_attach_task = cpuset_can_attach_task,
+	.pre_attach = cpuset_pre_attach,
+	.attach_task = cpuset_attach_task,
 	.attach = cpuset_attach,
 	.populate = cpuset_populate,
 	.post_clone = cpuset_post_clone,
diff --git a/kernel/ns_cgroup.c b/kernel/ns_cgroup.c
index 2c98ad9..1fc2b1b 100644
--- a/kernel/ns_cgroup.c
+++ b/kernel/ns_cgroup.c
@@ -43,7 +43,7 @@ int ns_cgroup_clone(struct task_struct *task, struct pid *pid)
  *        ancestor cgroup thereof)
  */
 static int ns_can_attach(struct cgroup_subsys *ss, struct cgroup *new_cgroup,
-			 struct task_struct *task, bool threadgroup)
+			 struct task_struct *task)
 {
 	if (current != task) {
 		if (!capable(CAP_SYS_ADMIN))
@@ -53,21 +53,13 @@ static int ns_can_attach(struct cgroup_subsys *ss, struct cgroup *new_cgroup,
 			return -EPERM;
 	}
 
-	if (!cgroup_is_descendant(new_cgroup, task))
-		return -EPERM;
-
-	if (threadgroup) {
-		struct task_struct *c;
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
-			if (!cgroup_is_descendant(new_cgroup, c)) {
-				rcu_read_unlock();
-				return -EPERM;
-			}
-		}
-		rcu_read_unlock();
-	}
+	return 0;
+}
 
+static int ns_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
+{
+	if (!cgroup_is_descendant(cgrp, tsk))
+		return -EPERM;
 	return 0;
 }
 
@@ -112,6 +104,7 @@ static void ns_destroy(struct cgroup_subsys *ss,
 struct cgroup_subsys ns_subsys = {
 	.name = "ns",
 	.can_attach = ns_can_attach,
+	.can_attach_task = ns_can_attach_task,
 	.create = ns_create,
 	.destroy  = ns_destroy,
 	.subsys_id = ns_subsys_id,
diff --git a/kernel/sched.c b/kernel/sched.c
index 218ef20..d619f1d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -8655,42 +8655,10 @@ cpu_cgroup_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 	return 0;
 }
 
-static int
-cpu_cgroup_can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
-		      struct task_struct *tsk, bool threadgroup)
-{
-	int retval = cpu_cgroup_can_attach_task(cgrp, tsk);
-	if (retval)
-		return retval;
-	if (threadgroup) {
-		struct task_struct *c;
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
-			retval = cpu_cgroup_can_attach_task(cgrp, c);
-			if (retval) {
-				rcu_read_unlock();
-				return retval;
-			}
-		}
-		rcu_read_unlock();
-	}
-	return 0;
-}
-
 static void
-cpu_cgroup_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
-		  struct cgroup *old_cont, struct task_struct *tsk,
-		  bool threadgroup)
+cpu_cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
 	sched_move_task(tsk);
-	if (threadgroup) {
-		struct task_struct *c;
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
-			sched_move_task(c);
-		}
-		rcu_read_unlock();
-	}
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -8763,8 +8731,8 @@ struct cgroup_subsys cpu_cgroup_subsys = {
 	.name		= "cpu",
 	.create		= cpu_cgroup_create,
 	.destroy	= cpu_cgroup_destroy,
-	.can_attach	= cpu_cgroup_can_attach,
-	.attach		= cpu_cgroup_attach,
+	.can_attach_task = cpu_cgroup_can_attach_task,
+	.attach_task	= cpu_cgroup_attach_task,
 	.populate	= cpu_cgroup_populate,
 	.subsys_id	= cpu_cgroup_subsys_id,
 	.early_init	= 1,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 729beb7..995f0b9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4720,8 +4720,7 @@ static void mem_cgroup_clear_mc(void)
 
 static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
 				struct cgroup *cgroup,
-				struct task_struct *p,
-				bool threadgroup)
+				struct task_struct *p)
 {
 	int ret = 0;
 	struct mem_cgroup *mem = mem_cgroup_from_cont(cgroup);
@@ -4775,8 +4774,7 @@ static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
 
 static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss,
 				struct cgroup *cgroup,
-				struct task_struct *p,
-				bool threadgroup)
+				struct task_struct *p)
 {
 	mem_cgroup_clear_mc();
 }
@@ -4880,8 +4878,7 @@ static void mem_cgroup_move_charge(struct mm_struct *mm)
 static void mem_cgroup_move_task(struct cgroup_subsys *ss,
 				struct cgroup *cont,
 				struct cgroup *old_cont,
-				struct task_struct *p,
-				bool threadgroup)
+				struct task_struct *p)
 {
 	if (!mc.mm)
 		/* no need to move charge */
@@ -4893,22 +4890,19 @@ static void mem_cgroup_move_task(struct cgroup_subsys *ss,
 #else	/* !CONFIG_MMU */
 static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
 				struct cgroup *cgroup,
-				struct task_struct *p,
-				bool threadgroup)
+				struct task_struct *p)
 {
 	return 0;
 }
 static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss,
 				struct cgroup *cgroup,
-				struct task_struct *p,
-				bool threadgroup)
+				struct task_struct *p)
 {
 }
 static void mem_cgroup_move_task(struct cgroup_subsys *ss,
 				struct cgroup *cont,
 				struct cgroup *old_cont,
-				struct task_struct *p,
-				bool threadgroup)
+				struct task_struct *p)
 {
 }
 #endif
diff --git a/security/device_cgroup.c b/security/device_cgroup.c
index 8d9c48f..cd1f779 100644
--- a/security/device_cgroup.c
+++ b/security/device_cgroup.c
@@ -62,8 +62,7 @@ static inline struct dev_cgroup *task_devcgroup(struct task_struct *task)
 struct cgroup_subsys devices_subsys;
 
 static int devcgroup_can_attach(struct cgroup_subsys *ss,
-		struct cgroup *new_cgroup, struct task_struct *task,
-		bool threadgroup)
+		struct cgroup *new_cgroup, struct task_struct *task)
 {
 	if (current != task && !capable(CAP_SYS_ADMIN))
 			return -EPERM;

[-- Attachment #3: Type: text/plain, Size: 206 bytes --]

_______________________________________________
Containers mailing list
Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply related	[flat|nested] 185+ messages in thread

* Re: [PATCH v7 2/3] cgroups: add atomic-context per-thread subsystem callbacks
  2010-12-26 12:11           ` [PATCH v7 2/3] cgroups: add atomic-context per-thread subsystem callbacks Ben Blum
       [not found]             ` <20101226121100.GC28529-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
@ 2011-01-24  8:38             ` Paul Menage
  2011-01-24 15:32               ` Ben Blum
       [not found]               ` <AANLkTimytfrDnr_5SzBUFQu0SaGdAWDC0p38hiFiHrtU-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 2 replies; 185+ messages in thread
From: Paul Menage @ 2011-01-24  8:38 UTC (permalink / raw)
  To: Ben Blum
  Cc: linux-kernel, containers, akpm, ebiederm, lizf, matthltc, oleg,
	David Rientjes, Miao Xie

[-- Attachment #1: Type: text/plain, Size: 905 bytes --]

Hi Ben,

Finally finding a moment to actually look at these patches. Sorry it's
been a while. Can you send the patches inline rather than as
attachments in future?

Reviewed-by: Paul Menage <menage@google.com>

This patch looks fine, although I think that freezer_can_attach_task()
could be simplified to:

static int freezer_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
{
  if (__cgroup_freezing_or_frozen(tsk))
    return -EBUSY;
  return 0;
}

since we guarantee that rcu_read_lock() is held across this call.

There appears to be a tiny bit of rot in kernel/cpu.c (due to the
addition of the exit() callback) and memcontrol.c (due to some changes
at the start of mem_cgroup_move_task()) but neither impact actual
code.

I think that before actually pushing to mainline, we'll need to sort
out the cpuset mempolicy yielding issue, since that could be a
user-visible API change.


Paul

[-- Attachment #2: cgroup-subsys-task-callbacks.patch --]
[-- Type: text/plain, Size: 22352 bytes --]

Add cgroup subsystem callbacks for per-thread attachment in atomic contexts

From: Ben Blum <bblum@andrew.cmu.edu>

This patch adds can_attach_task, pre_attach, and attach_task as new callbacks
for cgroups's subsystem interface. Unlike can_attach and attach, these are for
per-thread operations, to be called potentially many times when attaching an
entire threadgroup, and may run under rcu_read/tasklist_lock, so are for quick
operations only.

Also, the old "bool threadgroup" interface is removed, as replaced by this.
All subsystems are modified for the new interface - of note is cpuset, which
requires from/to nodemasks for attach to be globally scoped (though per-cpuset
would work too) to persist from its pre_attach to attach_task and attach.

This is a pre-patch for cgroup-procs-writable.patch.

Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
---
 Documentation/cgroups/cgroups.txt |   35 ++++++++---
 Documentation/cgroups/cpusets.txt |    9 +++
 block/blk-cgroup.c                |   18 ++----
 include/linux/cgroup.h            |   10 ++-
 kernel/cgroup.c                   |   17 ++++-
 kernel/cgroup_freezer.c           |   27 ++++-----
 kernel/cpuset.c                   |  116 +++++++++++++++++++------------------
 kernel/ns_cgroup.c                |   23 +++----
 kernel/sched.c                    |   38 +-----------
 mm/memcontrol.c                   |   18 ++----
 security/device_cgroup.c          |    3 -
 11 files changed, 149 insertions(+), 165 deletions(-)

diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index 190018b..341ed44 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -563,7 +563,7 @@ rmdir() will fail with it. From this behavior, pre_destroy() can be
 called multiple times against a cgroup.
 
 int can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
-	       struct task_struct *task, bool threadgroup)
+	       struct task_struct *task)
 (cgroup_mutex held by caller)
 
 Called prior to moving a task into a cgroup; if the subsystem
@@ -572,9 +572,15 @@ task is passed, then a successful result indicates that *any*
 unspecified task can be moved into the cgroup. Note that this isn't
 called on a fork. If this method returns 0 (success) then this should
 remain valid while the caller holds cgroup_mutex and it is ensured that either
-attach() or cancel_attach() will be called in future. If threadgroup is
-true, then a successful result indicates that all threads in the given
-thread's threadgroup can be moved together.
+attach() or cancel_attach() will be called in future.
+
+int can_attach_task(struct cgroup *cgrp, struct task_struct *tsk);
+(cgroup_mutex and rcu_read_lock held by caller)
+
+As can_attach, but for operations that must be run once per task to be
+attached (possibly many when using cgroup_attach_proc). This may run in
+rcu_read-side, so sleeping is not permitted. Expensive operations, such as
+dealing with the shared mm, should run in can_attach.
 
 void cancel_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 	       struct task_struct *task, bool threadgroup)
@@ -587,15 +593,26 @@ This will be called only about subsystems whose can_attach() operation have
 succeeded.
 
 void attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
-	    struct cgroup *old_cgrp, struct task_struct *task,
-	    bool threadgroup)
+	    struct cgroup *old_cgrp, struct task_struct *task)
 (cgroup_mutex held by caller)
 
 Called after the task has been attached to the cgroup, to allow any
 post-attachment activity that requires memory allocations or blocking.
-If threadgroup is true, the subsystem should take care of all threads
-in the specified thread's threadgroup. Currently does not support any
-subsystem that might need the old_cgrp for every thread in the group.
+
+void pre_attach(struct cgroup *cgrp);
+(cgroup_mutex and tasklist_lock held by caller)
+
+See description of attach_task.
+
+void attach_task(struct cgroup *cgrp, struct task_struct *tsk);
+(cgroup_mutex and possibly tasklist_lock held by caller)
+
+As attach, but for operations that must be run once per task to be attached,
+like can_attach_task. Sometimes called with tasklist_lock taken for reading,
+so may not sleep. Currently does not support any subsystem that might need the
+old_cgrp for every thread in the group. Note: unlike can_attach_task, this
+runs before attach, so use pre_attach for non-per-thread operations that must
+happen before attach_task.
 
 void fork(struct cgroup_subsy *ss, struct task_struct *task)
 
diff --git a/Documentation/cgroups/cpusets.txt b/Documentation/cgroups/cpusets.txt
index 5d0d569..1f0868d 100644
--- a/Documentation/cgroups/cpusets.txt
+++ b/Documentation/cgroups/cpusets.txt
@@ -659,6 +659,15 @@ the current task's cpuset, then we relax the cpuset, and look for
 memory anywhere we can find it.  It's better to violate the cpuset
 than stress the kernel.
 
+There is a third exception to the above.  When using the cgroup.procs file
+to move all tasks in a threadgroup at once, the per-task attachment code
+must run in an atomic context, but as currently implemented, changing the
+nodemasks for a task's memory policy may need to deschedule.  So, in this
+case, the best cpusets can do is change the nodemask for the threadgroup
+leader when attaching.  Thus, a multithreaded mempolicy user should first
+use cgroup.procs (for correctness), but also next use the tasks file for
+each thread in the group to ensure updating the nodemasks for all of them.
+
 To start a new job that is to be contained within a cpuset, the steps are:
 
  1) mkdir /dev/cpuset
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index b1febd0..45b3809 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -30,10 +30,8 @@ EXPORT_SYMBOL_GPL(blkio_root_cgroup);
 
 static struct cgroup_subsys_state *blkiocg_create(struct cgroup_subsys *,
 						  struct cgroup *);
-static int blkiocg_can_attach(struct cgroup_subsys *, struct cgroup *,
-			      struct task_struct *, bool);
-static void blkiocg_attach(struct cgroup_subsys *, struct cgroup *,
-			   struct cgroup *, struct task_struct *, bool);
+static int blkiocg_can_attach_task(struct cgroup *, struct task_struct *);
+static void blkiocg_attach_task(struct cgroup *, struct task_struct *);
 static void blkiocg_destroy(struct cgroup_subsys *, struct cgroup *);
 static int blkiocg_populate(struct cgroup_subsys *, struct cgroup *);
 
@@ -46,8 +44,8 @@ static int blkiocg_populate(struct cgroup_subsys *, struct cgroup *);
 struct cgroup_subsys blkio_subsys = {
 	.name = "blkio",
 	.create = blkiocg_create,
-	.can_attach = blkiocg_can_attach,
-	.attach = blkiocg_attach,
+	.can_attach_task = blkiocg_can_attach_task,
+	.attach_task = blkiocg_attach_task,
 	.destroy = blkiocg_destroy,
 	.populate = blkiocg_populate,
 #ifdef CONFIG_BLK_CGROUP
@@ -1475,9 +1473,7 @@ done:
  * of the main cic data structures.  For now we allow a task to change
  * its cgroup only if it's the only owner of its ioc.
  */
-static int blkiocg_can_attach(struct cgroup_subsys *subsys,
-				struct cgroup *cgroup, struct task_struct *tsk,
-				bool threadgroup)
+static int blkiocg_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
 	struct io_context *ioc;
 	int ret = 0;
@@ -1492,9 +1488,7 @@ static int blkiocg_can_attach(struct cgroup_subsys *subsys,
 	return ret;
 }
 
-static void blkiocg_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
-				struct cgroup *prev, struct task_struct *tsk,
-				bool threadgroup)
+static void blkiocg_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
 	struct io_context *ioc;
 
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index ce104e3..35b69b4 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -467,12 +467,14 @@ struct cgroup_subsys {
 	int (*pre_destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
 	void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
 	int (*can_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
-			  struct task_struct *tsk, bool threadgroup);
+			  struct task_struct *tsk);
+	int (*can_attach_task)(struct cgroup *cgrp, struct task_struct *tsk);
 	void (*cancel_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
-			  struct task_struct *tsk, bool threadgroup);
+			      struct task_struct *tsk);
+	void (*pre_attach)(struct cgroup *cgrp);
+	void (*attach_task)(struct cgroup *cgrp, struct task_struct *tsk);
 	void (*attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
-			struct cgroup *old_cgrp, struct task_struct *tsk,
-			bool threadgroup);
+		       struct cgroup *old_cgrp, struct task_struct *tsk);
 	void (*fork)(struct cgroup_subsys *ss, struct task_struct *task);
 	void (*exit)(struct cgroup_subsys *ss, struct task_struct *task);
 	int (*populate)(struct cgroup_subsys *ss,
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 66a416b..616f27a 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1750,7 +1750,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 
 	for_each_subsys(root, ss) {
 		if (ss->can_attach) {
-			retval = ss->can_attach(ss, cgrp, tsk, false);
+			retval = ss->can_attach(ss, cgrp, tsk);
 			if (retval) {
 				/*
 				 * Remember on which subsystem the can_attach()
@@ -1762,6 +1762,13 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 				goto out;
 			}
 		}
+		if (ss->can_attach_task) {
+			retval = ss->can_attach_task(cgrp, tsk);
+			if (retval) {
+				failed_ss = ss;
+				goto out;
+			}
+		}
 	}
 
 	task_lock(tsk);
@@ -1798,8 +1805,12 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 	write_unlock(&css_set_lock);
 
 	for_each_subsys(root, ss) {
+		if (ss->pre_attach)
+			ss->pre_attach(cgrp);
+		if (ss->attach_task)
+			ss->attach_task(cgrp, tsk);
 		if (ss->attach)
-			ss->attach(ss, cgrp, oldcgrp, tsk, false);
+			ss->attach(ss, cgrp, oldcgrp, tsk);
 	}
 	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
 	synchronize_rcu();
@@ -1822,7 +1833,7 @@ out:
 				 */
 				break;
 			if (ss->cancel_attach)
-				ss->cancel_attach(ss, cgrp, tsk, false);
+				ss->cancel_attach(ss, cgrp, tsk);
 		}
 	}
 	return retval;
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index e7bebb7..e6ee70c 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -160,7 +160,7 @@ static void freezer_destroy(struct cgroup_subsys *ss,
  */
 static int freezer_can_attach(struct cgroup_subsys *ss,
 			      struct cgroup *new_cgroup,
-			      struct task_struct *task, bool threadgroup)
+			      struct task_struct *task)
 {
 	struct freezer *freezer;
 
@@ -172,26 +172,18 @@ static int freezer_can_attach(struct cgroup_subsys *ss,
 	if (freezer->state != CGROUP_THAWED)
 		return -EBUSY;
 
+	return 0;
+}
+
+static int freezer_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
+{
+	/* rcu_read_lock allows recursive locking */
 	rcu_read_lock();
-	if (__cgroup_freezing_or_frozen(task)) {
+	if (__cgroup_freezing_or_frozen(tsk)) {
 		rcu_read_unlock();
 		return -EBUSY;
 	}
 	rcu_read_unlock();
-
-	if (threadgroup) {
-		struct task_struct *c;
-
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
-			if (__cgroup_freezing_or_frozen(c)) {
-				rcu_read_unlock();
-				return -EBUSY;
-			}
-		}
-		rcu_read_unlock();
-	}
-
 	return 0;
 }
 
@@ -390,6 +382,9 @@ struct cgroup_subsys freezer_subsys = {
 	.populate	= freezer_populate,
 	.subsys_id	= freezer_subsys_id,
 	.can_attach	= freezer_can_attach,
+	.can_attach_task = freezer_can_attach_task,
+	.pre_attach	= NULL,
+	.attach_task	= NULL,
 	.attach		= NULL,
 	.fork		= freezer_fork,
 	.exit		= NULL,
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 4349935..b9fce80 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1372,14 +1372,10 @@ static int fmeter_getrate(struct fmeter *fmp)
 	return val;
 }
 
-/* Protected by cgroup_lock */
-static cpumask_var_t cpus_attach;
-
 /* Called by cgroups to determine if a cpuset is usable; cgroup_mutex held */
 static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
-			     struct task_struct *tsk, bool threadgroup)
+			     struct task_struct *tsk)
 {
-	int ret;
 	struct cpuset *cs = cgroup_cs(cont);
 
 	if (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))
@@ -1396,29 +1392,42 @@ static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
 	if (tsk->flags & PF_THREAD_BOUND)
 		return -EINVAL;
 
-	ret = security_task_setscheduler(tsk);
-	if (ret)
-		return ret;
-	if (threadgroup) {
-		struct task_struct *c;
-
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
-			ret = security_task_setscheduler(c);
-			if (ret) {
-				rcu_read_unlock();
-				return ret;
-			}
-		}
-		rcu_read_unlock();
-	}
 	return 0;
 }
 
-static void cpuset_attach_task(struct task_struct *tsk, nodemask_t *to,
-			       struct cpuset *cs)
+static int cpuset_can_attach_task(struct cgroup *cgrp, struct task_struct *task)
+{
+	return security_task_setscheduler(task);
+}
+
+/*
+ * Protected by cgroup_lock. The nodemasks must be stored globally because
+ * dynamically allocating them is not allowed in pre_attach, and they must
+ * persist among pre_attach, attach_task, and attach.
+ */
+static cpumask_var_t cpus_attach;
+static nodemask_t cpuset_attach_nodemask_from;
+static nodemask_t cpuset_attach_nodemask_to;
+
+/* Do quick set-up work for before attaching each task. */
+static void cpuset_pre_attach(struct cgroup *cont)
+{
+	struct cpuset *cs = cgroup_cs(cont);
+
+	if (cs == &top_cpuset)
+		cpumask_copy(cpus_attach, cpu_possible_mask);
+	else
+		guarantee_online_cpus(cs, cpus_attach);
+
+	guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
+}
+
+/* Per-thread attachment work. */
+static void cpuset_attach_task(struct cgroup *cont, struct task_struct *tsk)
 {
 	int err;
+	struct cpuset *cs = cgroup_cs(cont);
+
 	/*
 	 * can_attach beforehand should guarantee that this doesn't fail.
 	 * TODO: have a better way to handle failure here
@@ -1426,56 +1435,46 @@ static void cpuset_attach_task(struct task_struct *tsk, nodemask_t *to,
 	err = set_cpus_allowed_ptr(tsk, cpus_attach);
 	WARN_ON_ONCE(err);
 
-	cpuset_change_task_nodemask(tsk, to);
 	cpuset_update_task_spread_flag(cs, tsk);
 
 }
 
 static void cpuset_attach(struct cgroup_subsys *ss, struct cgroup *cont,
-			  struct cgroup *oldcont, struct task_struct *tsk,
-			  bool threadgroup)
+			  struct cgroup *oldcont, struct task_struct *tsk)
 {
 	struct mm_struct *mm;
 	struct cpuset *cs = cgroup_cs(cont);
 	struct cpuset *oldcs = cgroup_cs(oldcont);
-	NODEMASK_ALLOC(nodemask_t, from, GFP_KERNEL);
-	NODEMASK_ALLOC(nodemask_t, to, GFP_KERNEL);
 
-	if (from == NULL || to == NULL)
-		goto alloc_fail;
-
-	if (cs == &top_cpuset) {
-		cpumask_copy(cpus_attach, cpu_possible_mask);
-	} else {
-		guarantee_online_cpus(cs, cpus_attach);
-	}
-	guarantee_online_mems(cs, to);
-
-	/* do per-task migration stuff possibly for each in the threadgroup */
-	cpuset_attach_task(tsk, to, cs);
-	if (threadgroup) {
-		struct task_struct *c;
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
-			cpuset_attach_task(c, to, cs);
-		}
-		rcu_read_unlock();
-	}
+	/*
+	 * TODO: As implemented, change_task_nodemask uses yield() to
+	 * synchronize with other users of the mems_allowed, which is not
+	 * allowed in the atomic attach_task callback, so we can't do this for
+	 * each thread in the multithreaded case. This is a performance issue,
+	 * but not a correctness one.
+	 *
+	 * As long as change_task_nodemask can yield, a multithreaded mempolicy
+	 * user should attach to a cgroup by threadgroup first (for
+	 * correctness) then poke each task to get its mempolicy right.
+	 *
+	 * This is the "third exception" in Documentation/cgroups/cpusets.txt.
+	 */
+	cpuset_change_task_nodemask(tsk, &cpuset_attach_nodemask_to);
 
-	/* change mm; only needs to be done once even if threadgroup */
-	*from = oldcs->mems_allowed;
-	*to = cs->mems_allowed;
+	/*
+	 * Change mm, possibly for multiple threads in a threadgroup. This is
+	 * expensive and may sleep.
+	 */
+	cpuset_attach_nodemask_from = oldcs->mems_allowed;
+	cpuset_attach_nodemask_to = cs->mems_allowed;
 	mm = get_task_mm(tsk);
 	if (mm) {
-		mpol_rebind_mm(mm, to);
+		mpol_rebind_mm(mm, &cpuset_attach_nodemask_to);
 		if (is_memory_migrate(cs))
-			cpuset_migrate_mm(mm, from, to);
+			cpuset_migrate_mm(mm, &cpuset_attach_nodemask_from,
+					  &cpuset_attach_nodemask_to);
 		mmput(mm);
 	}
-
-alloc_fail:
-	NODEMASK_FREE(from);
-	NODEMASK_FREE(to);
 }
 
 /* The various types of files and directories in a cpuset file system */
@@ -1928,6 +1927,9 @@ struct cgroup_subsys cpuset_subsys = {
 	.create = cpuset_create,
 	.destroy = cpuset_destroy,
 	.can_attach = cpuset_can_attach,
+	.can_attach_task = cpuset_can_attach_task,
+	.pre_attach = cpuset_pre_attach,
+	.attach_task = cpuset_attach_task,
 	.attach = cpuset_attach,
 	.populate = cpuset_populate,
 	.post_clone = cpuset_post_clone,
diff --git a/kernel/ns_cgroup.c b/kernel/ns_cgroup.c
index 2c98ad9..1fc2b1b 100644
--- a/kernel/ns_cgroup.c
+++ b/kernel/ns_cgroup.c
@@ -43,7 +43,7 @@ int ns_cgroup_clone(struct task_struct *task, struct pid *pid)
  *        ancestor cgroup thereof)
  */
 static int ns_can_attach(struct cgroup_subsys *ss, struct cgroup *new_cgroup,
-			 struct task_struct *task, bool threadgroup)
+			 struct task_struct *task)
 {
 	if (current != task) {
 		if (!capable(CAP_SYS_ADMIN))
@@ -53,21 +53,13 @@ static int ns_can_attach(struct cgroup_subsys *ss, struct cgroup *new_cgroup,
 			return -EPERM;
 	}
 
-	if (!cgroup_is_descendant(new_cgroup, task))
-		return -EPERM;
-
-	if (threadgroup) {
-		struct task_struct *c;
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
-			if (!cgroup_is_descendant(new_cgroup, c)) {
-				rcu_read_unlock();
-				return -EPERM;
-			}
-		}
-		rcu_read_unlock();
-	}
+	return 0;
+}
 
+static int ns_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
+{
+	if (!cgroup_is_descendant(cgrp, tsk))
+		return -EPERM;
 	return 0;
 }
 
@@ -112,6 +104,7 @@ static void ns_destroy(struct cgroup_subsys *ss,
 struct cgroup_subsys ns_subsys = {
 	.name = "ns",
 	.can_attach = ns_can_attach,
+	.can_attach_task = ns_can_attach_task,
 	.create = ns_create,
 	.destroy  = ns_destroy,
 	.subsys_id = ns_subsys_id,
diff --git a/kernel/sched.c b/kernel/sched.c
index 218ef20..d619f1d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -8655,42 +8655,10 @@ cpu_cgroup_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 	return 0;
 }
 
-static int
-cpu_cgroup_can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
-		      struct task_struct *tsk, bool threadgroup)
-{
-	int retval = cpu_cgroup_can_attach_task(cgrp, tsk);
-	if (retval)
-		return retval;
-	if (threadgroup) {
-		struct task_struct *c;
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
-			retval = cpu_cgroup_can_attach_task(cgrp, c);
-			if (retval) {
-				rcu_read_unlock();
-				return retval;
-			}
-		}
-		rcu_read_unlock();
-	}
-	return 0;
-}
-
 static void
-cpu_cgroup_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
-		  struct cgroup *old_cont, struct task_struct *tsk,
-		  bool threadgroup)
+cpu_cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
 	sched_move_task(tsk);
-	if (threadgroup) {
-		struct task_struct *c;
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
-			sched_move_task(c);
-		}
-		rcu_read_unlock();
-	}
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -8763,8 +8731,8 @@ struct cgroup_subsys cpu_cgroup_subsys = {
 	.name		= "cpu",
 	.create		= cpu_cgroup_create,
 	.destroy	= cpu_cgroup_destroy,
-	.can_attach	= cpu_cgroup_can_attach,
-	.attach		= cpu_cgroup_attach,
+	.can_attach_task = cpu_cgroup_can_attach_task,
+	.attach_task	= cpu_cgroup_attach_task,
 	.populate	= cpu_cgroup_populate,
 	.subsys_id	= cpu_cgroup_subsys_id,
 	.early_init	= 1,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 729beb7..995f0b9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4720,8 +4720,7 @@ static void mem_cgroup_clear_mc(void)
 
 static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
 				struct cgroup *cgroup,
-				struct task_struct *p,
-				bool threadgroup)
+				struct task_struct *p)
 {
 	int ret = 0;
 	struct mem_cgroup *mem = mem_cgroup_from_cont(cgroup);
@@ -4775,8 +4774,7 @@ static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
 
 static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss,
 				struct cgroup *cgroup,
-				struct task_struct *p,
-				bool threadgroup)
+				struct task_struct *p)
 {
 	mem_cgroup_clear_mc();
 }
@@ -4880,8 +4878,7 @@ static void mem_cgroup_move_charge(struct mm_struct *mm)
 static void mem_cgroup_move_task(struct cgroup_subsys *ss,
 				struct cgroup *cont,
 				struct cgroup *old_cont,
-				struct task_struct *p,
-				bool threadgroup)
+				struct task_struct *p)
 {
 	if (!mc.mm)
 		/* no need to move charge */
@@ -4893,22 +4890,19 @@ static void mem_cgroup_move_task(struct cgroup_subsys *ss,
 #else	/* !CONFIG_MMU */
 static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
 				struct cgroup *cgroup,
-				struct task_struct *p,
-				bool threadgroup)
+				struct task_struct *p)
 {
 	return 0;
 }
 static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss,
 				struct cgroup *cgroup,
-				struct task_struct *p,
-				bool threadgroup)
+				struct task_struct *p)
 {
 }
 static void mem_cgroup_move_task(struct cgroup_subsys *ss,
 				struct cgroup *cont,
 				struct cgroup *old_cont,
-				struct task_struct *p,
-				bool threadgroup)
+				struct task_struct *p)
 {
 }
 #endif
diff --git a/security/device_cgroup.c b/security/device_cgroup.c
index 8d9c48f..cd1f779 100644
--- a/security/device_cgroup.c
+++ b/security/device_cgroup.c
@@ -62,8 +62,7 @@ static inline struct dev_cgroup *task_devcgroup(struct task_struct *task)
 struct cgroup_subsys devices_subsys;
 
 static int devcgroup_can_attach(struct cgroup_subsys *ss,
-		struct cgroup *new_cgroup, struct task_struct *task,
-		bool threadgroup)
+		struct cgroup *new_cgroup, struct task_struct *task)
 {
 	if (current != task && !capable(CAP_SYS_ADMIN))
 			return -EPERM;

^ permalink raw reply related	[flat|nested] 185+ messages in thread

* Re: [PATCH v7 1/3] cgroups: read-write lock CLONE_THREAD forking per threadgroup
       [not found]             ` <20101226120951.GB28529-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
@ 2011-01-24  8:38               ` Paul Menage
  2011-01-24 21:05               ` Andrew Morton
  1 sibling, 0 replies; 185+ messages in thread
From: Paul Menage @ 2011-01-24  8:38 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	Miao Xie, David Rientjes, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w

On Sun, Dec 26, 2010 at 4:09 AM, Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:
> (...please include patches inline...)

Reviewed-by: Paul Menage <menage-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

Thanks,
Paul

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v7 1/3] cgroups: read-write lock CLONE_THREAD forking per threadgroup
  2010-12-26 12:09           ` [PATCH v7 1/3] cgroups: read-write lock CLONE_THREAD forking per threadgroup Ben Blum
@ 2011-01-24  8:38             ` Paul Menage
  2011-01-24 21:05             ` Andrew Morton
       [not found]             ` <20101226120951.GB28529-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
  2 siblings, 0 replies; 185+ messages in thread
From: Paul Menage @ 2011-01-24  8:38 UTC (permalink / raw)
  To: Ben Blum
  Cc: linux-kernel, containers, akpm, ebiederm, lizf, matthltc, oleg,
	David Rientjes, Miao Xie

On Sun, Dec 26, 2010 at 4:09 AM, Ben Blum <bblum@andrew.cmu.edu> wrote:
> (...please include patches inline...)

Reviewed-by: Paul Menage <menage@google.com>

Thanks,
Paul

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v7 2/3] cgroups: add atomic-context per-thread subsystem callbacks
       [not found]               ` <AANLkTimytfrDnr_5SzBUFQu0SaGdAWDC0p38hiFiHrtU-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2011-01-24 15:32                 ` Ben Blum
  0 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2011-01-24 15:32 UTC (permalink / raw)
  To: Paul Menage
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, Miao Xie, David Rientjes,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Mon, Jan 24, 2011 at 12:38:06AM -0800, Paul Menage wrote:
> Hi Ben,
> 
> Finally finding a moment to actually look at these patches. Sorry it's
> been a while. Can you send the patches inline rather than as
> attachments in future?

Whoops, sure thing.

> 
> Reviewed-by: Paul Menage <menage-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> 
> This patch looks fine, although I think that freezer_can_attach_task()
> could be simplified to:
> 
> static int freezer_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
> {
>   if (__cgroup_freezing_or_frozen(tsk))
>     return -EBUSY;
>   return 0;
> }
> 
> since we guarantee that rcu_read_lock() is held across this call.

I put a note there that "rcu_read_lock allows recursive locking", to
denote that it's okay to double-lock when it's called from
cgroup_attach_proc. I guess this isn't very clear: the reason the lock
is there is because in cgroup_attach_task, I call it without
rcu_read_lock (not necessary in most cases), but freezer needs RCU there
in either case. I wrote in the documentation: "This may run in
rcu_read-side", which I guess isn't very clear either.

> 
> There appears to be a tiny bit of rot in kernel/cpu.c (due to the
> addition of the exit() callback) and memcontrol.c (due to some changes
> at the start of mem_cgroup_move_task()) but neither impact actual
> code.
> 
> I think that before actually pushing to mainline, we'll need to sort
> out the cpuset mempolicy yielding issue, since that could be a
> user-visible API change.
> 
> 
> Paul

Hmm. The quirks caused by this are specific to using cgroup.procs, and
since cgroup.procs is new, I wouldn't say this is an API "change"?

Thanks,
Ben

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v7 2/3] cgroups: add atomic-context per-thread subsystem callbacks
  2011-01-24  8:38             ` Paul Menage
@ 2011-01-24 15:32               ` Ben Blum
       [not found]               ` <AANLkTimytfrDnr_5SzBUFQu0SaGdAWDC0p38hiFiHrtU-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 0 replies; 185+ messages in thread
From: Ben Blum @ 2011-01-24 15:32 UTC (permalink / raw)
  To: Paul Menage
  Cc: Ben Blum, linux-kernel, containers, akpm, ebiederm, lizf,
	matthltc, oleg, David Rientjes, Miao Xie

On Mon, Jan 24, 2011 at 12:38:06AM -0800, Paul Menage wrote:
> Hi Ben,
> 
> Finally finding a moment to actually look at these patches. Sorry it's
> been a while. Can you send the patches inline rather than as
> attachments in future?

Whoops, sure thing.

> 
> Reviewed-by: Paul Menage <menage@google.com>
> 
> This patch looks fine, although I think that freezer_can_attach_task()
> could be simplified to:
> 
> static int freezer_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
> {
>   if (__cgroup_freezing_or_frozen(tsk))
>     return -EBUSY;
>   return 0;
> }
> 
> since we guarantee that rcu_read_lock() is held across this call.

I put a note there that "rcu_read_lock allows recursive locking", to
denote that it's okay to double-lock when it's called from
cgroup_attach_proc. I guess this isn't very clear: the reason the lock
is there is because in cgroup_attach_task, I call it without
rcu_read_lock (not necessary in most cases), but freezer needs RCU there
in either case. I wrote in the documentation: "This may run in
rcu_read-side", which I guess isn't very clear either.

> 
> There appears to be a tiny bit of rot in kernel/cpu.c (due to the
> addition of the exit() callback) and memcontrol.c (due to some changes
> at the start of mem_cgroup_move_task()) but neither impact actual
> code.
> 
> I think that before actually pushing to mainline, we'll need to sort
> out the cpuset mempolicy yielding issue, since that could be a
> user-visible API change.
> 
> 
> Paul

Hmm. The quirks caused by this are specific to using cgroup.procs, and
since cgroup.procs is new, I wouldn't say this is an API "change"?

Thanks,
Ben

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v7 1/3] cgroups: read-write lock CLONE_THREAD forking per threadgroup
       [not found]             ` <20101226120951.GB28529-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
  2011-01-24  8:38               ` Paul Menage
@ 2011-01-24 21:05               ` Andrew Morton
  1 sibling, 0 replies; 185+ messages in thread
From: Andrew Morton @ 2011-01-24 21:05 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	Miao Xie, David Rientjes, menage-hpIqsD4AKlfQT0dZR+AlfA,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w

On Sun, 26 Dec 2010 07:09:51 -0500
Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:

> Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup
> 
> From: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
> 
> This patch adds an rwsem that lives in a threadgroup's signal_struct that's
> taken for reading in the fork path, under CONFIG_CGROUPS. If another part of
> the kernel later wants to use such a locking mechanism, the CONFIG_CGROUPS
> ifdefs should be changed to a higher-up flag that CGROUPS and the other system
> would both depend on.
> 
> This is a pre-patch for cgroup-procs-write.patch.
> 
> ...
>
> +/* See the declaration of threadgroup_fork_lock in signal_struct. */
> +#ifdef CONFIG_CGROUPS
> +static inline void threadgroup_fork_read_lock(struct task_struct *tsk)
> +{
> +	down_read(&tsk->signal->threadgroup_fork_lock);
> +}
> +static inline void threadgroup_fork_read_unlock(struct task_struct *tsk)
> +{
> +	up_read(&tsk->signal->threadgroup_fork_lock);
> +}
> +static inline void threadgroup_fork_write_lock(struct task_struct *tsk)
> +{
> +	down_write(&tsk->signal->threadgroup_fork_lock);
> +}
> +static inline void threadgroup_fork_write_unlock(struct task_struct *tsk)
> +{
> +	up_write(&tsk->signal->threadgroup_fork_lock);
> +}
> +#else

Risky. sched.h doesn't include rwsem.h.

We could make it do so, but almost every compilation unit in the kernel
includes sched.h.  It would be nicer to make the kernel build
finer-grained, rather than blunter-grained.  Don't be afraid to add new
header files if that is one way of doing this!

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v7 1/3] cgroups: read-write lock CLONE_THREAD forking per threadgroup
  2010-12-26 12:09           ` [PATCH v7 1/3] cgroups: read-write lock CLONE_THREAD forking per threadgroup Ben Blum
  2011-01-24  8:38             ` Paul Menage
@ 2011-01-24 21:05             ` Andrew Morton
  2011-02-04 21:25               ` Ben Blum
                                 ` (2 more replies)
       [not found]             ` <20101226120951.GB28529-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
  2 siblings, 3 replies; 185+ messages in thread
From: Andrew Morton @ 2011-01-24 21:05 UTC (permalink / raw)
  To: Ben Blum
  Cc: linux-kernel, containers, ebiederm, lizf, matthltc, menage, oleg,
	David Rientjes, Miao Xie

On Sun, 26 Dec 2010 07:09:51 -0500
Ben Blum <bblum@andrew.cmu.edu> wrote:

> Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup
> 
> From: Ben Blum <bblum@andrew.cmu.edu>
> 
> This patch adds an rwsem that lives in a threadgroup's signal_struct that's
> taken for reading in the fork path, under CONFIG_CGROUPS. If another part of
> the kernel later wants to use such a locking mechanism, the CONFIG_CGROUPS
> ifdefs should be changed to a higher-up flag that CGROUPS and the other system
> would both depend on.
> 
> This is a pre-patch for cgroup-procs-write.patch.
> 
> ...
>
> +/* See the declaration of threadgroup_fork_lock in signal_struct. */
> +#ifdef CONFIG_CGROUPS
> +static inline void threadgroup_fork_read_lock(struct task_struct *tsk)
> +{
> +	down_read(&tsk->signal->threadgroup_fork_lock);
> +}
> +static inline void threadgroup_fork_read_unlock(struct task_struct *tsk)
> +{
> +	up_read(&tsk->signal->threadgroup_fork_lock);
> +}
> +static inline void threadgroup_fork_write_lock(struct task_struct *tsk)
> +{
> +	down_write(&tsk->signal->threadgroup_fork_lock);
> +}
> +static inline void threadgroup_fork_write_unlock(struct task_struct *tsk)
> +{
> +	up_write(&tsk->signal->threadgroup_fork_lock);
> +}
> +#else

Risky. sched.h doesn't include rwsem.h.

We could make it do so, but almost every compilation unit in the kernel
includes sched.h.  It would be nicer to make the kernel build
finer-grained, rather than blunter-grained.  Don't be afraid to add new
header files if that is one way of doing this!



^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v7 1/3] cgroups: read-write lock CLONE_THREAD forking per threadgroup
       [not found]               ` <20110124130529.903d9832.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
@ 2011-02-04 21:25                 ` Ben Blum
  2011-02-14  5:31                 ` Paul Menage
  1 sibling, 0 replies; 185+ messages in thread
From: Ben Blum @ 2011-02-04 21:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	Miao Xie, David Rientjes, menage-hpIqsD4AKlfQT0dZR+AlfA,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w

On Mon, Jan 24, 2011 at 01:05:29PM -0800, Andrew Morton wrote:
> On Sun, 26 Dec 2010 07:09:51 -0500
> Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:
> 
> > Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup
> > 
> > From: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
> > 
> > This patch adds an rwsem that lives in a threadgroup's signal_struct that's
> > taken for reading in the fork path, under CONFIG_CGROUPS. If another part of
> > the kernel later wants to use such a locking mechanism, the CONFIG_CGROUPS
> > ifdefs should be changed to a higher-up flag that CGROUPS and the other system
> > would both depend on.
> > 
> > This is a pre-patch for cgroup-procs-write.patch.
> > 
> > ...
> >
> > +/* See the declaration of threadgroup_fork_lock in signal_struct. */
> > +#ifdef CONFIG_CGROUPS
> > +static inline void threadgroup_fork_read_lock(struct task_struct *tsk)
> > +{
> > +	down_read(&tsk->signal->threadgroup_fork_lock);
> > +}
> > +static inline void threadgroup_fork_read_unlock(struct task_struct *tsk)
> > +{
> > +	up_read(&tsk->signal->threadgroup_fork_lock);
> > +}
> > +static inline void threadgroup_fork_write_lock(struct task_struct *tsk)
> > +{
> > +	down_write(&tsk->signal->threadgroup_fork_lock);
> > +}
> > +static inline void threadgroup_fork_write_unlock(struct task_struct *tsk)
> > +{
> > +	up_write(&tsk->signal->threadgroup_fork_lock);
> > +}
> > +#else
> 
> Risky. sched.h doesn't include rwsem.h.
> 
> We could make it do so, but almost every compilation unit in the kernel
> includes sched.h.  It would be nicer to make the kernel build
> finer-grained, rather than blunter-grained.  Don't be afraid to add new
> header files if that is one way of doing this!

Hmm, good point. But there's also:

+#ifdef CONFIG_CGROUPS
+       struct rw_semaphore threadgroup_fork_lock;
+#endif

in the signal_struct, also in sched.h, which needs to be there. Or I
could change it to a struct pointer with a forward incomplete
declaration above, and kmalloc/kfree it? I don't like adding more
alloc/free calls but don't know if it's more or less important than
header granularity.

-- Ben

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v7 1/3] cgroups: read-write lock CLONE_THREAD forking per threadgroup
  2011-01-24 21:05             ` Andrew Morton
@ 2011-02-04 21:25               ` Ben Blum
       [not found]                 ` <20110204212515.GA5916-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
  2011-02-04 21:36                 ` Andrew Morton
  2011-02-14  5:31               ` Paul Menage
       [not found]               ` <20110124130529.903d9832.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
  2 siblings, 2 replies; 185+ messages in thread
From: Ben Blum @ 2011-02-04 21:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ben Blum, linux-kernel, containers, ebiederm, lizf, matthltc,
	menage, oleg, David Rientjes, Miao Xie

On Mon, Jan 24, 2011 at 01:05:29PM -0800, Andrew Morton wrote:
> On Sun, 26 Dec 2010 07:09:51 -0500
> Ben Blum <bblum@andrew.cmu.edu> wrote:
> 
> > Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup
> > 
> > From: Ben Blum <bblum@andrew.cmu.edu>
> > 
> > This patch adds an rwsem that lives in a threadgroup's signal_struct that's
> > taken for reading in the fork path, under CONFIG_CGROUPS. If another part of
> > the kernel later wants to use such a locking mechanism, the CONFIG_CGROUPS
> > ifdefs should be changed to a higher-up flag that CGROUPS and the other system
> > would both depend on.
> > 
> > This is a pre-patch for cgroup-procs-write.patch.
> > 
> > ...
> >
> > +/* See the declaration of threadgroup_fork_lock in signal_struct. */
> > +#ifdef CONFIG_CGROUPS
> > +static inline void threadgroup_fork_read_lock(struct task_struct *tsk)
> > +{
> > +	down_read(&tsk->signal->threadgroup_fork_lock);
> > +}
> > +static inline void threadgroup_fork_read_unlock(struct task_struct *tsk)
> > +{
> > +	up_read(&tsk->signal->threadgroup_fork_lock);
> > +}
> > +static inline void threadgroup_fork_write_lock(struct task_struct *tsk)
> > +{
> > +	down_write(&tsk->signal->threadgroup_fork_lock);
> > +}
> > +static inline void threadgroup_fork_write_unlock(struct task_struct *tsk)
> > +{
> > +	up_write(&tsk->signal->threadgroup_fork_lock);
> > +}
> > +#else
> 
> Risky. sched.h doesn't include rwsem.h.
> 
> We could make it do so, but almost every compilation unit in the kernel
> includes sched.h.  It would be nicer to make the kernel build
> finer-grained, rather than blunter-grained.  Don't be afraid to add new
> header files if that is one way of doing this!

Hmm, good point. But there's also:

+#ifdef CONFIG_CGROUPS
+       struct rw_semaphore threadgroup_fork_lock;
+#endif

in the signal_struct, also in sched.h, which needs to be there. Or I
could change it to a struct pointer with a forward incomplete
declaration above, and kmalloc/kfree it? I don't like adding more
alloc/free calls but don't know if it's more or less important than
header granularity.

-- Ben

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v7 1/3] cgroups: read-write lock CLONE_THREAD forking per threadgroup
       [not found]                 ` <20110204212515.GA5916-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
@ 2011-02-04 21:36                   ` Andrew Morton
  0 siblings, 0 replies; 185+ messages in thread
From: Andrew Morton @ 2011-02-04 21:36 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	Miao Xie, David Rientjes, menage-hpIqsD4AKlfQT0dZR+AlfA,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w

On Fri, 4 Feb 2011 16:25:15 -0500
Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:

> On Mon, Jan 24, 2011 at 01:05:29PM -0800, Andrew Morton wrote:
> > On Sun, 26 Dec 2010 07:09:51 -0500
> > Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:
> > 
> > > Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup
> > > 
> > > From: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
> > > 
> > > This patch adds an rwsem that lives in a threadgroup's signal_struct that's
> > > taken for reading in the fork path, under CONFIG_CGROUPS. If another part of
> > > the kernel later wants to use such a locking mechanism, the CONFIG_CGROUPS
> > > ifdefs should be changed to a higher-up flag that CGROUPS and the other system
> > > would both depend on.
> > > 
> > > This is a pre-patch for cgroup-procs-write.patch.
> > > 
> > > ...
> > >
> > > +/* See the declaration of threadgroup_fork_lock in signal_struct. */
> > > +#ifdef CONFIG_CGROUPS
> > > +static inline void threadgroup_fork_read_lock(struct task_struct *tsk)
> > > +{
> > > +	down_read(&tsk->signal->threadgroup_fork_lock);
> > > +}
> > > +static inline void threadgroup_fork_read_unlock(struct task_struct *tsk)
> > > +{
> > > +	up_read(&tsk->signal->threadgroup_fork_lock);
> > > +}
> > > +static inline void threadgroup_fork_write_lock(struct task_struct *tsk)
> > > +{
> > > +	down_write(&tsk->signal->threadgroup_fork_lock);
> > > +}
> > > +static inline void threadgroup_fork_write_unlock(struct task_struct *tsk)
> > > +{
> > > +	up_write(&tsk->signal->threadgroup_fork_lock);
> > > +}
> > > +#else
> > 
> > Risky. sched.h doesn't include rwsem.h.
> > 
> > We could make it do so, but almost every compilation unit in the kernel
> > includes sched.h.  It would be nicer to make the kernel build
> > finer-grained, rather than blunter-grained.  Don't be afraid to add new
> > header files if that is one way of doing this!
> 
> Hmm, good point. But there's also:
> 
> +#ifdef CONFIG_CGROUPS
> +       struct rw_semaphore threadgroup_fork_lock;
> +#endif
> 
> in the signal_struct, also in sched.h, which needs to be there. Or I
> could change it to a struct pointer with a forward incomplete
> declaration above, and kmalloc/kfree it? I don't like adding more
> alloc/free calls but don't know if it's more or less important than
> header granularity.

What about adding a new header file which includes rwsem.h and sched.h
and then defines the new interfaces?

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v7 1/3] cgroups: read-write lock CLONE_THREAD forking per threadgroup
  2011-02-04 21:25               ` Ben Blum
       [not found]                 ` <20110204212515.GA5916-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
@ 2011-02-04 21:36                 ` Andrew Morton
       [not found]                   ` <20110204133657.78aeebe3.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
  1 sibling, 1 reply; 185+ messages in thread
From: Andrew Morton @ 2011-02-04 21:36 UTC (permalink / raw)
  To: Ben Blum
  Cc: linux-kernel, containers, ebiederm, lizf, matthltc, menage, oleg,
	David Rientjes, Miao Xie

On Fri, 4 Feb 2011 16:25:15 -0500
Ben Blum <bblum@andrew.cmu.edu> wrote:

> On Mon, Jan 24, 2011 at 01:05:29PM -0800, Andrew Morton wrote:
> > On Sun, 26 Dec 2010 07:09:51 -0500
> > Ben Blum <bblum@andrew.cmu.edu> wrote:
> > 
> > > Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup
> > > 
> > > From: Ben Blum <bblum@andrew.cmu.edu>
> > > 
> > > This patch adds an rwsem that lives in a threadgroup's signal_struct that's
> > > taken for reading in the fork path, under CONFIG_CGROUPS. If another part of
> > > the kernel later wants to use such a locking mechanism, the CONFIG_CGROUPS
> > > ifdefs should be changed to a higher-up flag that CGROUPS and the other system
> > > would both depend on.
> > > 
> > > This is a pre-patch for cgroup-procs-write.patch.
> > > 
> > > ...
> > >
> > > +/* See the declaration of threadgroup_fork_lock in signal_struct. */
> > > +#ifdef CONFIG_CGROUPS
> > > +static inline void threadgroup_fork_read_lock(struct task_struct *tsk)
> > > +{
> > > +	down_read(&tsk->signal->threadgroup_fork_lock);
> > > +}
> > > +static inline void threadgroup_fork_read_unlock(struct task_struct *tsk)
> > > +{
> > > +	up_read(&tsk->signal->threadgroup_fork_lock);
> > > +}
> > > +static inline void threadgroup_fork_write_lock(struct task_struct *tsk)
> > > +{
> > > +	down_write(&tsk->signal->threadgroup_fork_lock);
> > > +}
> > > +static inline void threadgroup_fork_write_unlock(struct task_struct *tsk)
> > > +{
> > > +	up_write(&tsk->signal->threadgroup_fork_lock);
> > > +}
> > > +#else
> > 
> > Risky. sched.h doesn't include rwsem.h.
> > 
> > We could make it do so, but almost every compilation unit in the kernel
> > includes sched.h.  It would be nicer to make the kernel build
> > finer-grained, rather than blunter-grained.  Don't be afraid to add new
> > header files if that is one way of doing this!
> 
> Hmm, good point. But there's also:
> 
> +#ifdef CONFIG_CGROUPS
> +       struct rw_semaphore threadgroup_fork_lock;
> +#endif
> 
> in the signal_struct, also in sched.h, which needs to be there. Or I
> could change it to a struct pointer with a forward incomplete
> declaration above, and kmalloc/kfree it? I don't like adding more
> alloc/free calls but don't know if it's more or less important than
> header granularity.

What about adding a new header file which includes rwsem.h and sched.h
and then defines the new interfaces?

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v7 1/3] cgroups: read-write lock CLONE_THREAD forking per threadgroup
  2011-02-04 21:36                 ` Andrew Morton
@ 2011-02-04 21:43                       ` Ben Blum
  0 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2011-02-04 21:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	Miao Xie, David Rientjes, menage-hpIqsD4AKlfQT0dZR+AlfA,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w

On Fri, Feb 04, 2011 at 01:36:57PM -0800, Andrew Morton wrote:
> On Fri, 4 Feb 2011 16:25:15 -0500
> Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:
> 
> > On Mon, Jan 24, 2011 at 01:05:29PM -0800, Andrew Morton wrote:
> > > On Sun, 26 Dec 2010 07:09:51 -0500
> > > Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:
> > > 
> > > > Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup
> > > > 
> > > > From: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
> > > > 
> > > > This patch adds an rwsem that lives in a threadgroup's signal_struct that's
> > > > taken for reading in the fork path, under CONFIG_CGROUPS. If another part of
> > > > the kernel later wants to use such a locking mechanism, the CONFIG_CGROUPS
> > > > ifdefs should be changed to a higher-up flag that CGROUPS and the other system
> > > > would both depend on.
> > > > 
> > > > This is a pre-patch for cgroup-procs-write.patch.
> > > > 
> > > > ...
> > > >
> > > > +/* See the declaration of threadgroup_fork_lock in signal_struct. */
> > > > +#ifdef CONFIG_CGROUPS
> > > > +static inline void threadgroup_fork_read_lock(struct task_struct *tsk)
> > > > +{
> > > > +	down_read(&tsk->signal->threadgroup_fork_lock);
> > > > +}
> > > > +static inline void threadgroup_fork_read_unlock(struct task_struct *tsk)
> > > > +{
> > > > +	up_read(&tsk->signal->threadgroup_fork_lock);
> > > > +}
> > > > +static inline void threadgroup_fork_write_lock(struct task_struct *tsk)
> > > > +{
> > > > +	down_write(&tsk->signal->threadgroup_fork_lock);
> > > > +}
> > > > +static inline void threadgroup_fork_write_unlock(struct task_struct *tsk)
> > > > +{
> > > > +	up_write(&tsk->signal->threadgroup_fork_lock);
> > > > +}
> > > > +#else
> > > 
> > > Risky. sched.h doesn't include rwsem.h.
> > > 
> > > We could make it do so, but almost every compilation unit in the kernel
> > > includes sched.h.  It would be nicer to make the kernel build
> > > finer-grained, rather than blunter-grained.  Don't be afraid to add new
> > > header files if that is one way of doing this!
> > 
> > Hmm, good point. But there's also:
> > 
> > +#ifdef CONFIG_CGROUPS
> > +       struct rw_semaphore threadgroup_fork_lock;
> > +#endif
> > 
> > in the signal_struct, also in sched.h, which needs to be there. Or I
> > could change it to a struct pointer with a forward incomplete
> > declaration above, and kmalloc/kfree it? I don't like adding more
> > alloc/free calls but don't know if it's more or less important than
> > header granularity.
> 
> What about adding a new header file which includes rwsem.h and sched.h
> and then defines the new interfaces?

Er, I mean the definition of signal_struct needs rwsem.h as well, not
just the threadgroup_fork_* functions. (And I suspect moving
signal_struct somewhere else would give bigger problems...)

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v7 1/3] cgroups: read-write lock CLONE_THREAD forking per threadgroup
@ 2011-02-04 21:43                       ` Ben Blum
  0 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2011-02-04 21:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ben Blum, linux-kernel, containers, ebiederm, lizf, matthltc,
	menage, oleg, David Rientjes, Miao Xie

On Fri, Feb 04, 2011 at 01:36:57PM -0800, Andrew Morton wrote:
> On Fri, 4 Feb 2011 16:25:15 -0500
> Ben Blum <bblum@andrew.cmu.edu> wrote:
> 
> > On Mon, Jan 24, 2011 at 01:05:29PM -0800, Andrew Morton wrote:
> > > On Sun, 26 Dec 2010 07:09:51 -0500
> > > Ben Blum <bblum@andrew.cmu.edu> wrote:
> > > 
> > > > Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup
> > > > 
> > > > From: Ben Blum <bblum@andrew.cmu.edu>
> > > > 
> > > > This patch adds an rwsem that lives in a threadgroup's signal_struct that's
> > > > taken for reading in the fork path, under CONFIG_CGROUPS. If another part of
> > > > the kernel later wants to use such a locking mechanism, the CONFIG_CGROUPS
> > > > ifdefs should be changed to a higher-up flag that CGROUPS and the other system
> > > > would both depend on.
> > > > 
> > > > This is a pre-patch for cgroup-procs-write.patch.
> > > > 
> > > > ...
> > > >
> > > > +/* See the declaration of threadgroup_fork_lock in signal_struct. */
> > > > +#ifdef CONFIG_CGROUPS
> > > > +static inline void threadgroup_fork_read_lock(struct task_struct *tsk)
> > > > +{
> > > > +	down_read(&tsk->signal->threadgroup_fork_lock);
> > > > +}
> > > > +static inline void threadgroup_fork_read_unlock(struct task_struct *tsk)
> > > > +{
> > > > +	up_read(&tsk->signal->threadgroup_fork_lock);
> > > > +}
> > > > +static inline void threadgroup_fork_write_lock(struct task_struct *tsk)
> > > > +{
> > > > +	down_write(&tsk->signal->threadgroup_fork_lock);
> > > > +}
> > > > +static inline void threadgroup_fork_write_unlock(struct task_struct *tsk)
> > > > +{
> > > > +	up_write(&tsk->signal->threadgroup_fork_lock);
> > > > +}
> > > > +#else
> > > 
> > > Risky. sched.h doesn't include rwsem.h.
> > > 
> > > We could make it do so, but almost every compilation unit in the kernel
> > > includes sched.h.  It would be nicer to make the kernel build
> > > finer-grained, rather than blunter-grained.  Don't be afraid to add new
> > > header files if that is one way of doing this!
> > 
> > Hmm, good point. But there's also:
> > 
> > +#ifdef CONFIG_CGROUPS
> > +       struct rw_semaphore threadgroup_fork_lock;
> > +#endif
> > 
> > in the signal_struct, also in sched.h, which needs to be there. Or I
> > could change it to a struct pointer with a forward incomplete
> > declaration above, and kmalloc/kfree it? I don't like adding more
> > alloc/free calls but don't know if it's more or less important than
> > header granularity.
> 
> What about adding a new header file which includes rwsem.h and sched.h
> and then defines the new interfaces?

Er, I mean the definition of signal_struct needs rwsem.h as well, not
just the threadgroup_fork_* functions. (And I suspect moving
signal_struct somewhere else would give bigger problems...)

^ permalink raw reply	[flat|nested] 185+ messages in thread

* [PATCH v8 0/3] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs
       [not found]           ` <20101226120919.GA28529-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
                               ` (2 preceding siblings ...)
  2010-12-26 12:12             ` [PATCH v7 3/3] cgroups: make procs file writable Ben Blum
@ 2011-02-08  1:35             ` Ben Blum
  3 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2011-02-08  1:35 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	Miao Xie, David Rientjes, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	menage-hpIqsD4AKlfQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

On Sun, Dec 26, 2010 at 07:09:19AM -0500, Ben Blum wrote:
> On Fri, Dec 24, 2010 at 03:22:26AM -0500, Ben Blum wrote:
> > On Wed, Aug 11, 2010 at 01:46:04AM -0400, Ben Blum wrote:
> > > On Fri, Jul 30, 2010 at 07:56:49PM -0400, Ben Blum wrote:
> > > > This patch series is a revision of http://lkml.org/lkml/2010/6/25/11 .
> > > > 
> > > > This patch series implements a write function for the 'cgroup.procs'
> > > > per-cgroup file, which enables atomic movement of multithreaded
> > > > applications between cgroups. Writing the thread-ID of any thread in a
> > > > threadgroup to a cgroup's procs file causes all threads in the group to
> > > > be moved to that cgroup safely with respect to threads forking/exiting.
> > > > (Possible usage scenario: If running a multithreaded build system that
> > > > sucks up system resources, this lets you restrict it all at once into a
> > > > new cgroup to keep it under control.)
> > > > 
> > > > Example: Suppose pid 31337 clones new threads 31338 and 31339.
> > > > 
> > > > # cat /dev/cgroup/tasks
> > > > ...
> > > > 31337
> > > > 31338
> > > > 31339
> > > > # mkdir /dev/cgroup/foo
> > > > # echo 31337 > /dev/cgroup/foo/cgroup.procs
> > > > # cat /dev/cgroup/foo/tasks
> > > > 31337
> > > > 31338
> > > > 31339
> > > > 
> > > > A new lock, called threadgroup_fork_lock and living in signal_struct, is
> > > > introduced to ensure atomicity when moving threads between cgroups. It's
> > > > taken for writing during the operation, and taking for reading in fork()
> > > > around the calls to cgroup_fork() and cgroup_post_fork().
> 
> Well this time everything here is actually safe and correct, as far as
> my best efforts and keen eyes can tell. I dropped the per_thread call
> from the last series in favour of revising the subsystem callback
> interface. It now looks like this:
> 
> ss->can_attach()
>  - Thread-independent, possibly expensive/sleeping.
> 
> ss->can_attach_task()
>  - Called per-thread, run with rcu_read so must not sleep.
> 
> ss->pre_attach()
>  - Thread independent, must be atomic, happens before attach_task.
> 
> ss->attach_task()
>  - Called per-thread, run with tasklist_lock so must not sleep.
> 
> ss->attach()
>  - Thread independent, possibly expensive/sleeping, called last.

Okay, so.

I've revamped the cgroup_attach_proc implementation a bunch and this
version should be a lot easier on the eyes (and brains). Issues that are
addressed:

1) cgroup_attach_proc now iterates over leader->thread_group once, at
   the very beginning, and puts each task_struct that we want to move
   into an array, using get_task_struct to make sure they stick around.
    - threadgroup_fork_lock ensures no threads not in the array can
      appear, and allows us to use signal->nr_threads to determine the
      size of the array when kmallocing it.
    - This simplifies the rest of the function a bunch, since now we
      never need to do rcu_read_lock after building the array. All the
      subsystem callbacks are the same as described just above, but the
      "can't sleep" restriction is gone, so it's nice and clean.
    - Checking for a race with de_thread (the manoeuvre I refer to as
      "double-double-toil-and-trouble-check locking") now needs to be
      done only once, at the beginning (before building the array).

2) The nodemask allocation problem in cpuset is fixed the same way as
   before - the masks are shared between the three attach callbacks, so
   are made as static global variables.

3) The introduction of threadgroup_fork_lock in sched.h (specifically,
   in signal_struct) requires rwsem.h; the new include appears in the
   first patch. (An alternate plan would be to make it a struct pointer
   with an incomplete forward declaration and kmalloc/kfree it during
   housekeeping, but adding an include seems better than that particular
   complication.) In light of this, the definitions for
   threadgroup_fork_{read,write}_{un,}lock are also in sched.h.

-- Ben

---
 Documentation/cgroups/cgroups.txt |   39 ++-
 block/blk-cgroup.c                |   18 -
 include/linux/cgroup.h            |   10 
 include/linux/init_task.h         |    9 
 include/linux/sched.h             |   37 +++
 kernel/cgroup.c                   |  454 +++++++++++++++++++++++++++++++++-----
 kernel/cgroup_freezer.c           |   26 --
 kernel/cpuset.c                   |  105 +++-----
 kernel/fork.c                     |   10 
 kernel/ns_cgroup.c                |   23 -
 kernel/sched.c                    |   38 ---
 mm/memcontrol.c                   |   18 -
 security/device_cgroup.c          |    3 
 13 files changed, 575 insertions(+), 215 deletions(-)

^ permalink raw reply	[flat|nested] 185+ messages in thread

* [PATCH v8 0/3] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs
  2010-12-26 12:09         ` Ben Blum
                             ` (3 preceding siblings ...)
       [not found]           ` <20101226120919.GA28529-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
@ 2011-02-08  1:35           ` Ben Blum
  2011-02-08  1:37             ` [PATCH v8 1/3] cgroups: read-write lock CLONE_THREAD forking per threadgroup Ben Blum
                               ` (3 more replies)
  4 siblings, 4 replies; 185+ messages in thread
From: Ben Blum @ 2011-02-08  1:35 UTC (permalink / raw)
  To: Ben Blum
  Cc: linux-kernel, containers, akpm, ebiederm, lizf, matthltc, menage,
	oleg, David Rientjes, Miao Xie

On Sun, Dec 26, 2010 at 07:09:19AM -0500, Ben Blum wrote:
> On Fri, Dec 24, 2010 at 03:22:26AM -0500, Ben Blum wrote:
> > On Wed, Aug 11, 2010 at 01:46:04AM -0400, Ben Blum wrote:
> > > On Fri, Jul 30, 2010 at 07:56:49PM -0400, Ben Blum wrote:
> > > > This patch series is a revision of http://lkml.org/lkml/2010/6/25/11 .
> > > > 
> > > > This patch series implements a write function for the 'cgroup.procs'
> > > > per-cgroup file, which enables atomic movement of multithreaded
> > > > applications between cgroups. Writing the thread-ID of any thread in a
> > > > threadgroup to a cgroup's procs file causes all threads in the group to
> > > > be moved to that cgroup safely with respect to threads forking/exiting.
> > > > (Possible usage scenario: If running a multithreaded build system that
> > > > sucks up system resources, this lets you restrict it all at once into a
> > > > new cgroup to keep it under control.)
> > > > 
> > > > Example: Suppose pid 31337 clones new threads 31338 and 31339.
> > > > 
> > > > # cat /dev/cgroup/tasks
> > > > ...
> > > > 31337
> > > > 31338
> > > > 31339
> > > > # mkdir /dev/cgroup/foo
> > > > # echo 31337 > /dev/cgroup/foo/cgroup.procs
> > > > # cat /dev/cgroup/foo/tasks
> > > > 31337
> > > > 31338
> > > > 31339
> > > > 
> > > > A new lock, called threadgroup_fork_lock and living in signal_struct, is
> > > > introduced to ensure atomicity when moving threads between cgroups. It's
> > > > taken for writing during the operation, and taking for reading in fork()
> > > > around the calls to cgroup_fork() and cgroup_post_fork().
> 
> Well this time everything here is actually safe and correct, as far as
> my best efforts and keen eyes can tell. I dropped the per_thread call
> from the last series in favour of revising the subsystem callback
> interface. It now looks like this:
> 
> ss->can_attach()
>  - Thread-independent, possibly expensive/sleeping.
> 
> ss->can_attach_task()
>  - Called per-thread, run with rcu_read so must not sleep.
> 
> ss->pre_attach()
>  - Thread independent, must be atomic, happens before attach_task.
> 
> ss->attach_task()
>  - Called per-thread, run with tasklist_lock so must not sleep.
> 
> ss->attach()
>  - Thread independent, possibly expensive/sleeping, called last.

Okay, so.

I've revamped the cgroup_attach_proc implementation a bunch and this
version should be a lot easier on the eyes (and brains). Issues that are
addressed:

1) cgroup_attach_proc now iterates over leader->thread_group once, at
   the very beginning, and puts each task_struct that we want to move
   into an array, using get_task_struct to make sure they stick around.
    - threadgroup_fork_lock ensures no threads not in the array can
      appear, and allows us to use signal->nr_threads to determine the
      size of the array when kmallocing it.
    - This simplifies the rest of the function a bunch, since now we
      never need to do rcu_read_lock after building the array. All the
      subsystem callbacks are the same as described just above, but the
      "can't sleep" restriction is gone, so it's nice and clean.
    - Checking for a race with de_thread (the manoeuvre I refer to as
      "double-double-toil-and-trouble-check locking") now needs to be
      done only once, at the beginning (before building the array).

2) The nodemask allocation problem in cpuset is fixed the same way as
   before - the masks are shared between the three attach callbacks, so
   are made as static global variables.

3) The introduction of threadgroup_fork_lock in sched.h (specifically,
   in signal_struct) requires rwsem.h; the new include appears in the
   first patch. (An alternate plan would be to make it a struct pointer
   with an incomplete forward declaration and kmalloc/kfree it during
   housekeeping, but adding an include seems better than that particular
   complication.) In light of this, the definitions for
   threadgroup_fork_{read,write}_{un,}lock are also in sched.h.

-- Ben

---
 Documentation/cgroups/cgroups.txt |   39 ++-
 block/blk-cgroup.c                |   18 -
 include/linux/cgroup.h            |   10 
 include/linux/init_task.h         |    9 
 include/linux/sched.h             |   37 +++
 kernel/cgroup.c                   |  454 +++++++++++++++++++++++++++++++++-----
 kernel/cgroup_freezer.c           |   26 --
 kernel/cpuset.c                   |  105 +++-----
 kernel/fork.c                     |   10 
 kernel/ns_cgroup.c                |   23 -
 kernel/sched.c                    |   38 ---
 mm/memcontrol.c                   |   18 -
 security/device_cgroup.c          |    3 
 13 files changed, 575 insertions(+), 215 deletions(-)

^ permalink raw reply	[flat|nested] 185+ messages in thread

* [PATCH v8 1/3] cgroups: read-write lock CLONE_THREAD forking per threadgroup
       [not found]             ` <20110208013542.GC31569-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
@ 2011-02-08  1:37               ` Ben Blum
  2011-02-08  1:39                 ` Ben Blum
                                 ` (3 subsequent siblings)
  4 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2011-02-08  1:37 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	Miao Xie, David Rientjes, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	menage-hpIqsD4AKlfQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup

From: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>

This patch adds an rwsem that lives in a threadgroup's signal_struct that's
taken for reading in the fork path, under CONFIG_CGROUPS. If another part of
the kernel later wants to use such a locking mechanism, the CONFIG_CGROUPS
ifdefs should be changed to a higher-up flag that CGROUPS and the other system
would both depend on.

This is a pre-patch for cgroup-procs-write.patch.

Signed-off-by: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
---
 include/linux/init_task.h |    9 +++++++++
 include/linux/sched.h     |   37 +++++++++++++++++++++++++++++++++++++
 kernel/fork.c             |   10 ++++++++++
 3 files changed, 56 insertions(+), 0 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 6b281fa..b560381 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -15,6 +15,14 @@
 extern struct files_struct init_files;
 extern struct fs_struct init_fs;
 
+#ifdef CONFIG_CGROUPS
+#define INIT_THREADGROUP_FORK_LOCK(sig)					\
+	.threadgroup_fork_lock =					\
+		__RWSEM_INITIALIZER(sig.threadgroup_fork_lock),
+#else
+#define INIT_THREADGROUP_FORK_LOCK(sig)
+#endif
+
 #define INIT_SIGNALS(sig) {						\
 	.nr_threads	= 1,						\
 	.wait_chldexit	= __WAIT_QUEUE_HEAD_INITIALIZER(sig.wait_chldexit),\
@@ -31,6 +39,7 @@ extern struct fs_struct init_fs;
 	},								\
 	.cred_guard_mutex =						\
 		 __MUTEX_INITIALIZER(sig.cred_guard_mutex),		\
+	INIT_THREADGROUP_FORK_LOCK(sig)					\
 }
 
 extern struct nsproxy init_nsproxy;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8580dc6..2fdbeb1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -509,6 +509,8 @@ struct thread_group_cputimer {
 	spinlock_t lock;
 };
 
+#include <linux/rwsem.h>
+
 /*
  * NOTE! "signal_struct" does not have it's own
  * locking, because a shared signal_struct always
@@ -623,6 +625,16 @@ struct signal_struct {
 	unsigned audit_tty;
 	struct tty_audit_buf *tty_audit_buf;
 #endif
+#ifdef CONFIG_CGROUPS
+	/*
+	 * The threadgroup_fork_lock prevents threads from forking with
+	 * CLONE_THREAD while held for writing. Use this for fork-sensitive
+	 * threadgroup-wide operations. It's taken for reading in fork.c in
+	 * copy_process().
+	 * Currently only needed write-side by cgroups.
+	 */
+	struct rw_semaphore threadgroup_fork_lock;
+#endif
 
 	int oom_adj;		/* OOM kill score adjustment (bit shift) */
 	int oom_score_adj;	/* OOM kill score adjustment */
@@ -2270,6 +2282,31 @@ static inline void unlock_task_sighand(struct task_struct *tsk,
 	spin_unlock_irqrestore(&tsk->sighand->siglock, *flags);
 }
 
+/* See the declaration of threadgroup_fork_lock in signal_struct. */
+#ifdef CONFIG_CGROUPS
+static inline void threadgroup_fork_read_lock(struct task_struct *tsk)
+{
+	down_read(&tsk->signal->threadgroup_fork_lock);
+}
+static inline void threadgroup_fork_read_unlock(struct task_struct *tsk)
+{
+	up_read(&tsk->signal->threadgroup_fork_lock);
+}
+static inline void threadgroup_fork_write_lock(struct task_struct *tsk)
+{
+	down_write(&tsk->signal->threadgroup_fork_lock);
+}
+static inline void threadgroup_fork_write_unlock(struct task_struct *tsk)
+{
+	up_write(&tsk->signal->threadgroup_fork_lock);
+}
+#else
+static inline void threadgroup_fork_read_lock(struct task_struct *tsk) {}
+static inline void threadgroup_fork_read_unlock(struct task_struct *tsk) {}
+static inline void threadgroup_fork_write_lock(struct task_struct *tsk) {}
+static inline void threadgroup_fork_write_unlock(struct task_struct *tsk) {}
+#endif
+
 #ifndef __HAVE_THREAD_FUNCTIONS
 
 #define task_thread_info(task)	((struct thread_info *)(task)->stack)
diff --git a/kernel/fork.c b/kernel/fork.c
index 0979527..aefe61f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -905,6 +905,10 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 
 	tty_audit_fork(sig);
 
+#ifdef CONFIG_CGROUPS
+	init_rwsem(&sig->threadgroup_fork_lock);
+#endif
+
 	sig->oom_adj = current->signal->oom_adj;
 	sig->oom_score_adj = current->signal->oom_score_adj;
 	sig->oom_score_adj_min = current->signal->oom_score_adj_min;
@@ -1087,6 +1091,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	monotonic_to_bootbased(&p->real_start_time);
 	p->io_context = NULL;
 	p->audit_context = NULL;
+	if (clone_flags & CLONE_THREAD)
+		threadgroup_fork_read_lock(current);
 	cgroup_fork(p);
 #ifdef CONFIG_NUMA
 	p->mempolicy = mpol_dup(p->mempolicy);
@@ -1294,6 +1300,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	write_unlock_irq(&tasklist_lock);
 	proc_fork_connector(p);
 	cgroup_post_fork(p);
+	if (clone_flags & CLONE_THREAD)
+		threadgroup_fork_read_unlock(current);
 	perf_event_fork(p);
 	return p;
 
@@ -1332,6 +1340,8 @@ bad_fork_cleanup_policy:
 	mpol_put(p->mempolicy);
 bad_fork_cleanup_cgroup:
 #endif
+	if (clone_flags & CLONE_THREAD)
+		threadgroup_fork_read_unlock(current);
 	cgroup_exit(p, cgroup_callbacks_done);
 	delayacct_tsk_free(p);
 	module_put(task_thread_info(p)->exec_domain->module);

^ permalink raw reply related	[flat|nested] 185+ messages in thread

* [PATCH v8 1/3] cgroups: read-write lock CLONE_THREAD forking per threadgroup
  2011-02-08  1:35           ` Ben Blum
@ 2011-02-08  1:37             ` Ben Blum
  2011-03-03 17:54               ` Paul Menage
       [not found]               ` <20110208013741.GD31569-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
       [not found]             ` <20110208013542.GC31569-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
                               ` (2 subsequent siblings)
  3 siblings, 2 replies; 185+ messages in thread
From: Ben Blum @ 2011-02-08  1:37 UTC (permalink / raw)
  To: Ben Blum
  Cc: linux-kernel, containers, akpm, ebiederm, lizf, matthltc, menage,
	oleg, David Rientjes, Miao Xie

Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup

From: Ben Blum <bblum@andrew.cmu.edu>

This patch adds an rwsem that lives in a threadgroup's signal_struct that's
taken for reading in the fork path, under CONFIG_CGROUPS. If another part of
the kernel later wants to use such a locking mechanism, the CONFIG_CGROUPS
ifdefs should be changed to a higher-up flag that CGROUPS and the other system
would both depend on.

This is a pre-patch for cgroup-procs-write.patch.

Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
---
 include/linux/init_task.h |    9 +++++++++
 include/linux/sched.h     |   37 +++++++++++++++++++++++++++++++++++++
 kernel/fork.c             |   10 ++++++++++
 3 files changed, 56 insertions(+), 0 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 6b281fa..b560381 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -15,6 +15,14 @@
 extern struct files_struct init_files;
 extern struct fs_struct init_fs;
 
+#ifdef CONFIG_CGROUPS
+#define INIT_THREADGROUP_FORK_LOCK(sig)					\
+	.threadgroup_fork_lock =					\
+		__RWSEM_INITIALIZER(sig.threadgroup_fork_lock),
+#else
+#define INIT_THREADGROUP_FORK_LOCK(sig)
+#endif
+
 #define INIT_SIGNALS(sig) {						\
 	.nr_threads	= 1,						\
 	.wait_chldexit	= __WAIT_QUEUE_HEAD_INITIALIZER(sig.wait_chldexit),\
@@ -31,6 +39,7 @@ extern struct fs_struct init_fs;
 	},								\
 	.cred_guard_mutex =						\
 		 __MUTEX_INITIALIZER(sig.cred_guard_mutex),		\
+	INIT_THREADGROUP_FORK_LOCK(sig)					\
 }
 
 extern struct nsproxy init_nsproxy;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8580dc6..2fdbeb1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -509,6 +509,8 @@ struct thread_group_cputimer {
 	spinlock_t lock;
 };
 
+#include <linux/rwsem.h>
+
 /*
  * NOTE! "signal_struct" does not have it's own
  * locking, because a shared signal_struct always
@@ -623,6 +625,16 @@ struct signal_struct {
 	unsigned audit_tty;
 	struct tty_audit_buf *tty_audit_buf;
 #endif
+#ifdef CONFIG_CGROUPS
+	/*
+	 * The threadgroup_fork_lock prevents threads from forking with
+	 * CLONE_THREAD while held for writing. Use this for fork-sensitive
+	 * threadgroup-wide operations. It's taken for reading in fork.c in
+	 * copy_process().
+	 * Currently only needed write-side by cgroups.
+	 */
+	struct rw_semaphore threadgroup_fork_lock;
+#endif
 
 	int oom_adj;		/* OOM kill score adjustment (bit shift) */
 	int oom_score_adj;	/* OOM kill score adjustment */
@@ -2270,6 +2282,31 @@ static inline void unlock_task_sighand(struct task_struct *tsk,
 	spin_unlock_irqrestore(&tsk->sighand->siglock, *flags);
 }
 
+/* See the declaration of threadgroup_fork_lock in signal_struct. */
+#ifdef CONFIG_CGROUPS
+static inline void threadgroup_fork_read_lock(struct task_struct *tsk)
+{
+	down_read(&tsk->signal->threadgroup_fork_lock);
+}
+static inline void threadgroup_fork_read_unlock(struct task_struct *tsk)
+{
+	up_read(&tsk->signal->threadgroup_fork_lock);
+}
+static inline void threadgroup_fork_write_lock(struct task_struct *tsk)
+{
+	down_write(&tsk->signal->threadgroup_fork_lock);
+}
+static inline void threadgroup_fork_write_unlock(struct task_struct *tsk)
+{
+	up_write(&tsk->signal->threadgroup_fork_lock);
+}
+#else
+static inline void threadgroup_fork_read_lock(struct task_struct *tsk) {}
+static inline void threadgroup_fork_read_unlock(struct task_struct *tsk) {}
+static inline void threadgroup_fork_write_lock(struct task_struct *tsk) {}
+static inline void threadgroup_fork_write_unlock(struct task_struct *tsk) {}
+#endif
+
 #ifndef __HAVE_THREAD_FUNCTIONS
 
 #define task_thread_info(task)	((struct thread_info *)(task)->stack)
diff --git a/kernel/fork.c b/kernel/fork.c
index 0979527..aefe61f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -905,6 +905,10 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 
 	tty_audit_fork(sig);
 
+#ifdef CONFIG_CGROUPS
+	init_rwsem(&sig->threadgroup_fork_lock);
+#endif
+
 	sig->oom_adj = current->signal->oom_adj;
 	sig->oom_score_adj = current->signal->oom_score_adj;
 	sig->oom_score_adj_min = current->signal->oom_score_adj_min;
@@ -1087,6 +1091,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	monotonic_to_bootbased(&p->real_start_time);
 	p->io_context = NULL;
 	p->audit_context = NULL;
+	if (clone_flags & CLONE_THREAD)
+		threadgroup_fork_read_lock(current);
 	cgroup_fork(p);
 #ifdef CONFIG_NUMA
 	p->mempolicy = mpol_dup(p->mempolicy);
@@ -1294,6 +1300,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	write_unlock_irq(&tasklist_lock);
 	proc_fork_connector(p);
 	cgroup_post_fork(p);
+	if (clone_flags & CLONE_THREAD)
+		threadgroup_fork_read_unlock(current);
 	perf_event_fork(p);
 	return p;
 
@@ -1332,6 +1340,8 @@ bad_fork_cleanup_policy:
 	mpol_put(p->mempolicy);
 bad_fork_cleanup_cgroup:
 #endif
+	if (clone_flags & CLONE_THREAD)
+		threadgroup_fork_read_unlock(current);
 	cgroup_exit(p, cgroup_callbacks_done);
 	delayacct_tsk_free(p);
 	module_put(task_thread_info(p)->exec_domain->module);

^ permalink raw reply related	[flat|nested] 185+ messages in thread

* [PATCH v8 2/3] cgroups: add per-thread subsystem callbacks
  2011-02-08  1:35           ` Ben Blum
@ 2011-02-08  1:39                 ` Ben Blum
       [not found]             ` <20110208013542.GC31569-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
                                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2011-02-08  1:39 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	Miao Xie, David Rientjes, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	menage-hpIqsD4AKlfQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

Add cgroup subsystem callbacks for per-thread attachment

From: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>

This patch adds can_attach_task, pre_attach, and attach_task as new callbacks
for cgroups's subsystem interface. Unlike can_attach and attach, these are for
per-thread operations, to be called potentially many times when attaching an
entire threadgroup.

Also, the old "bool threadgroup" interface is removed, as replaced by this.
All subsystems are modified for the new interface - of note is cpuset, which
requires from/to nodemasks for attach to be globally scoped (though per-cpuset
would work too) to persist from its pre_attach to attach_task and attach.

This is a pre-patch for cgroup-procs-writable.patch.

Signed-off-by: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
---
 Documentation/cgroups/cgroups.txt |   30 ++++++++---
 block/blk-cgroup.c                |   18 ++----
 include/linux/cgroup.h            |   10 ++--
 kernel/cgroup.c                   |   17 +++++-
 kernel/cgroup_freezer.c           |   26 ++++-----
 kernel/cpuset.c                   |  105 ++++++++++++++++---------------------
 kernel/ns_cgroup.c                |   23 +++-----
 kernel/sched.c                    |   38 +------------
 mm/memcontrol.c                   |   18 ++----
 security/device_cgroup.c          |    3 -
 10 files changed, 122 insertions(+), 166 deletions(-)

diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index 190018b..d3c9a24 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -563,7 +563,7 @@ rmdir() will fail with it. From this behavior, pre_destroy() can be
 called multiple times against a cgroup.
 
 int can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
-	       struct task_struct *task, bool threadgroup)
+	       struct task_struct *task)
 (cgroup_mutex held by caller)
 
 Called prior to moving a task into a cgroup; if the subsystem
@@ -572,9 +572,14 @@ task is passed, then a successful result indicates that *any*
 unspecified task can be moved into the cgroup. Note that this isn't
 called on a fork. If this method returns 0 (success) then this should
 remain valid while the caller holds cgroup_mutex and it is ensured that either
-attach() or cancel_attach() will be called in future. If threadgroup is
-true, then a successful result indicates that all threads in the given
-thread's threadgroup can be moved together.
+attach() or cancel_attach() will be called in future.
+
+int can_attach_task(struct cgroup *cgrp, struct task_struct *tsk);
+(cgroup_mutex held by caller)
+
+As can_attach, but for operations that must be run once per task to be
+attached (possibly many when using cgroup_attach_proc). Called after
+can_attach.
 
 void cancel_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 	       struct task_struct *task, bool threadgroup)
@@ -586,15 +591,24 @@ function, so that the subsystem can implement a rollback. If not, not necessary.
 This will be called only about subsystems whose can_attach() operation have
 succeeded.
 
+void pre_attach(struct cgroup *cgrp);
+(cgroup_mutex held by caller)
+
+For any non-per-thread attachment work that needs to happen before
+attach_task. Needed by cpuset.
+
 void attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
-	    struct cgroup *old_cgrp, struct task_struct *task,
-	    bool threadgroup)
+	    struct cgroup *old_cgrp, struct task_struct *task)
 (cgroup_mutex held by caller)
 
 Called after the task has been attached to the cgroup, to allow any
 post-attachment activity that requires memory allocations or blocking.
-If threadgroup is true, the subsystem should take care of all threads
-in the specified thread's threadgroup. Currently does not support any
+
+void attach_task(struct cgroup *cgrp, struct task_struct *tsk);
+(cgroup_mutex held by caller)
+
+As attach, but for operations that must be run once per task to be attached,
+like can_attach_task. Called before attach. Currently does not support any
 subsystem that might need the old_cgrp for every thread in the group.
 
 void fork(struct cgroup_subsy *ss, struct task_struct *task)
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index b1febd0..45b3809 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -30,10 +30,8 @@ EXPORT_SYMBOL_GPL(blkio_root_cgroup);
 
 static struct cgroup_subsys_state *blkiocg_create(struct cgroup_subsys *,
 						  struct cgroup *);
-static int blkiocg_can_attach(struct cgroup_subsys *, struct cgroup *,
-			      struct task_struct *, bool);
-static void blkiocg_attach(struct cgroup_subsys *, struct cgroup *,
-			   struct cgroup *, struct task_struct *, bool);
+static int blkiocg_can_attach_task(struct cgroup *, struct task_struct *);
+static void blkiocg_attach_task(struct cgroup *, struct task_struct *);
 static void blkiocg_destroy(struct cgroup_subsys *, struct cgroup *);
 static int blkiocg_populate(struct cgroup_subsys *, struct cgroup *);
 
@@ -46,8 +44,8 @@ static int blkiocg_populate(struct cgroup_subsys *, struct cgroup *);
 struct cgroup_subsys blkio_subsys = {
 	.name = "blkio",
 	.create = blkiocg_create,
-	.can_attach = blkiocg_can_attach,
-	.attach = blkiocg_attach,
+	.can_attach_task = blkiocg_can_attach_task,
+	.attach_task = blkiocg_attach_task,
 	.destroy = blkiocg_destroy,
 	.populate = blkiocg_populate,
 #ifdef CONFIG_BLK_CGROUP
@@ -1475,9 +1473,7 @@ done:
  * of the main cic data structures.  For now we allow a task to change
  * its cgroup only if it's the only owner of its ioc.
  */
-static int blkiocg_can_attach(struct cgroup_subsys *subsys,
-				struct cgroup *cgroup, struct task_struct *tsk,
-				bool threadgroup)
+static int blkiocg_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
 	struct io_context *ioc;
 	int ret = 0;
@@ -1492,9 +1488,7 @@ static int blkiocg_can_attach(struct cgroup_subsys *subsys,
 	return ret;
 }
 
-static void blkiocg_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
-				struct cgroup *prev, struct task_struct *tsk,
-				bool threadgroup)
+static void blkiocg_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
 	struct io_context *ioc;
 
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index ce104e3..35b69b4 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -467,12 +467,14 @@ struct cgroup_subsys {
 	int (*pre_destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
 	void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
 	int (*can_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
-			  struct task_struct *tsk, bool threadgroup);
+			  struct task_struct *tsk);
+	int (*can_attach_task)(struct cgroup *cgrp, struct task_struct *tsk);
 	void (*cancel_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
-			  struct task_struct *tsk, bool threadgroup);
+			      struct task_struct *tsk);
+	void (*pre_attach)(struct cgroup *cgrp);
+	void (*attach_task)(struct cgroup *cgrp, struct task_struct *tsk);
 	void (*attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
-			struct cgroup *old_cgrp, struct task_struct *tsk,
-			bool threadgroup);
+		       struct cgroup *old_cgrp, struct task_struct *tsk);
 	void (*fork)(struct cgroup_subsys *ss, struct task_struct *task);
 	void (*exit)(struct cgroup_subsys *ss, struct task_struct *task);
 	int (*populate)(struct cgroup_subsys *ss,
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 66a416b..616f27a 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1750,7 +1750,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 
 	for_each_subsys(root, ss) {
 		if (ss->can_attach) {
-			retval = ss->can_attach(ss, cgrp, tsk, false);
+			retval = ss->can_attach(ss, cgrp, tsk);
 			if (retval) {
 				/*
 				 * Remember on which subsystem the can_attach()
@@ -1762,6 +1762,13 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 				goto out;
 			}
 		}
+		if (ss->can_attach_task) {
+			retval = ss->can_attach_task(cgrp, tsk);
+			if (retval) {
+				failed_ss = ss;
+				goto out;
+			}
+		}
 	}
 
 	task_lock(tsk);
@@ -1798,8 +1805,12 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 	write_unlock(&css_set_lock);
 
 	for_each_subsys(root, ss) {
+		if (ss->pre_attach)
+			ss->pre_attach(cgrp);
+		if (ss->attach_task)
+			ss->attach_task(cgrp, tsk);
 		if (ss->attach)
-			ss->attach(ss, cgrp, oldcgrp, tsk, false);
+			ss->attach(ss, cgrp, oldcgrp, tsk);
 	}
 	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
 	synchronize_rcu();
@@ -1822,7 +1833,7 @@ out:
 				 */
 				break;
 			if (ss->cancel_attach)
-				ss->cancel_attach(ss, cgrp, tsk, false);
+				ss->cancel_attach(ss, cgrp, tsk);
 		}
 	}
 	return retval;
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index e7bebb7..e691818 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -160,7 +160,7 @@ static void freezer_destroy(struct cgroup_subsys *ss,
  */
 static int freezer_can_attach(struct cgroup_subsys *ss,
 			      struct cgroup *new_cgroup,
-			      struct task_struct *task, bool threadgroup)
+			      struct task_struct *task)
 {
 	struct freezer *freezer;
 
@@ -172,26 +172,17 @@ static int freezer_can_attach(struct cgroup_subsys *ss,
 	if (freezer->state != CGROUP_THAWED)
 		return -EBUSY;
 
+	return 0;
+}
+
+static int freezer_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
+{
 	rcu_read_lock();
-	if (__cgroup_freezing_or_frozen(task)) {
+	if (__cgroup_freezing_or_frozen(tsk)) {
 		rcu_read_unlock();
 		return -EBUSY;
 	}
 	rcu_read_unlock();
-
-	if (threadgroup) {
-		struct task_struct *c;
-
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
-			if (__cgroup_freezing_or_frozen(c)) {
-				rcu_read_unlock();
-				return -EBUSY;
-			}
-		}
-		rcu_read_unlock();
-	}
-
 	return 0;
 }
 
@@ -390,6 +381,9 @@ struct cgroup_subsys freezer_subsys = {
 	.populate	= freezer_populate,
 	.subsys_id	= freezer_subsys_id,
 	.can_attach	= freezer_can_attach,
+	.can_attach_task = freezer_can_attach_task,
+	.pre_attach	= NULL,
+	.attach_task	= NULL,
 	.attach		= NULL,
 	.fork		= freezer_fork,
 	.exit		= NULL,
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 4349935..5f71ca2 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1372,14 +1372,10 @@ static int fmeter_getrate(struct fmeter *fmp)
 	return val;
 }
 
-/* Protected by cgroup_lock */
-static cpumask_var_t cpus_attach;
-
 /* Called by cgroups to determine if a cpuset is usable; cgroup_mutex held */
 static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
-			     struct task_struct *tsk, bool threadgroup)
+			     struct task_struct *tsk)
 {
-	int ret;
 	struct cpuset *cs = cgroup_cs(cont);
 
 	if (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))
@@ -1396,29 +1392,42 @@ static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
 	if (tsk->flags & PF_THREAD_BOUND)
 		return -EINVAL;
 
-	ret = security_task_setscheduler(tsk);
-	if (ret)
-		return ret;
-	if (threadgroup) {
-		struct task_struct *c;
-
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
-			ret = security_task_setscheduler(c);
-			if (ret) {
-				rcu_read_unlock();
-				return ret;
-			}
-		}
-		rcu_read_unlock();
-	}
 	return 0;
 }
 
-static void cpuset_attach_task(struct task_struct *tsk, nodemask_t *to,
-			       struct cpuset *cs)
+static int cpuset_can_attach_task(struct cgroup *cgrp, struct task_struct *task)
+{
+	return security_task_setscheduler(task);
+}
+
+/*
+ * Protected by cgroup_lock. The nodemasks must be stored globally because
+ * dynamically allocating them is not allowed in pre_attach, and they must
+ * persist among pre_attach, attach_task, and attach.
+ */
+static cpumask_var_t cpus_attach;
+static nodemask_t cpuset_attach_nodemask_from;
+static nodemask_t cpuset_attach_nodemask_to;
+
+/* Set-up work for before attaching each task. */
+static void cpuset_pre_attach(struct cgroup *cont)
+{
+	struct cpuset *cs = cgroup_cs(cont);
+
+	if (cs == &top_cpuset)
+		cpumask_copy(cpus_attach, cpu_possible_mask);
+	else
+		guarantee_online_cpus(cs, cpus_attach);
+
+	guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
+}
+
+/* Per-thread attachment work. */
+static void cpuset_attach_task(struct cgroup *cont, struct task_struct *tsk)
 {
 	int err;
+	struct cpuset *cs = cgroup_cs(cont);
+
 	/*
 	 * can_attach beforehand should guarantee that this doesn't fail.
 	 * TODO: have a better way to handle failure here
@@ -1426,56 +1435,31 @@ static void cpuset_attach_task(struct task_struct *tsk, nodemask_t *to,
 	err = set_cpus_allowed_ptr(tsk, cpus_attach);
 	WARN_ON_ONCE(err);
 
-	cpuset_change_task_nodemask(tsk, to);
+	cpuset_change_task_nodemask(tsk, &cpuset_attach_nodemask_to);
 	cpuset_update_task_spread_flag(cs, tsk);
-
 }
 
 static void cpuset_attach(struct cgroup_subsys *ss, struct cgroup *cont,
-			  struct cgroup *oldcont, struct task_struct *tsk,
-			  bool threadgroup)
+			  struct cgroup *oldcont, struct task_struct *tsk)
 {
 	struct mm_struct *mm;
 	struct cpuset *cs = cgroup_cs(cont);
 	struct cpuset *oldcs = cgroup_cs(oldcont);
-	NODEMASK_ALLOC(nodemask_t, from, GFP_KERNEL);
-	NODEMASK_ALLOC(nodemask_t, to, GFP_KERNEL);
-
-	if (from == NULL || to == NULL)
-		goto alloc_fail;
 
-	if (cs == &top_cpuset) {
-		cpumask_copy(cpus_attach, cpu_possible_mask);
-	} else {
-		guarantee_online_cpus(cs, cpus_attach);
-	}
-	guarantee_online_mems(cs, to);
-
-	/* do per-task migration stuff possibly for each in the threadgroup */
-	cpuset_attach_task(tsk, to, cs);
-	if (threadgroup) {
-		struct task_struct *c;
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
-			cpuset_attach_task(c, to, cs);
-		}
-		rcu_read_unlock();
-	}
-
-	/* change mm; only needs to be done once even if threadgroup */
-	*from = oldcs->mems_allowed;
-	*to = cs->mems_allowed;
+	/*
+	 * Change mm, possibly for multiple threads in a threadgroup. This is
+	 * expensive and may sleep.
+	 */
+	cpuset_attach_nodemask_from = oldcs->mems_allowed;
+	cpuset_attach_nodemask_to = cs->mems_allowed;
 	mm = get_task_mm(tsk);
 	if (mm) {
-		mpol_rebind_mm(mm, to);
+		mpol_rebind_mm(mm, &cpuset_attach_nodemask_to);
 		if (is_memory_migrate(cs))
-			cpuset_migrate_mm(mm, from, to);
+			cpuset_migrate_mm(mm, &cpuset_attach_nodemask_from,
+					  &cpuset_attach_nodemask_to);
 		mmput(mm);
 	}
-
-alloc_fail:
-	NODEMASK_FREE(from);
-	NODEMASK_FREE(to);
 }
 
 /* The various types of files and directories in a cpuset file system */
@@ -1928,6 +1912,9 @@ struct cgroup_subsys cpuset_subsys = {
 	.create = cpuset_create,
 	.destroy = cpuset_destroy,
 	.can_attach = cpuset_can_attach,
+	.can_attach_task = cpuset_can_attach_task,
+	.pre_attach = cpuset_pre_attach,
+	.attach_task = cpuset_attach_task,
 	.attach = cpuset_attach,
 	.populate = cpuset_populate,
 	.post_clone = cpuset_post_clone,
diff --git a/kernel/ns_cgroup.c b/kernel/ns_cgroup.c
index 2c98ad9..1fc2b1b 100644
--- a/kernel/ns_cgroup.c
+++ b/kernel/ns_cgroup.c
@@ -43,7 +43,7 @@ int ns_cgroup_clone(struct task_struct *task, struct pid *pid)
  *        ancestor cgroup thereof)
  */
 static int ns_can_attach(struct cgroup_subsys *ss, struct cgroup *new_cgroup,
-			 struct task_struct *task, bool threadgroup)
+			 struct task_struct *task)
 {
 	if (current != task) {
 		if (!capable(CAP_SYS_ADMIN))
@@ -53,21 +53,13 @@ static int ns_can_attach(struct cgroup_subsys *ss, struct cgroup *new_cgroup,
 			return -EPERM;
 	}
 
-	if (!cgroup_is_descendant(new_cgroup, task))
-		return -EPERM;
-
-	if (threadgroup) {
-		struct task_struct *c;
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
-			if (!cgroup_is_descendant(new_cgroup, c)) {
-				rcu_read_unlock();
-				return -EPERM;
-			}
-		}
-		rcu_read_unlock();
-	}
+	return 0;
+}
 
+static int ns_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
+{
+	if (!cgroup_is_descendant(cgrp, tsk))
+		return -EPERM;
 	return 0;
 }
 
@@ -112,6 +104,7 @@ static void ns_destroy(struct cgroup_subsys *ss,
 struct cgroup_subsys ns_subsys = {
 	.name = "ns",
 	.can_attach = ns_can_attach,
+	.can_attach_task = ns_can_attach_task,
 	.create = ns_create,
 	.destroy  = ns_destroy,
 	.subsys_id = ns_subsys_id,
diff --git a/kernel/sched.c b/kernel/sched.c
index 218ef20..d619f1d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -8655,42 +8655,10 @@ cpu_cgroup_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 	return 0;
 }
 
-static int
-cpu_cgroup_can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
-		      struct task_struct *tsk, bool threadgroup)
-{
-	int retval = cpu_cgroup_can_attach_task(cgrp, tsk);
-	if (retval)
-		return retval;
-	if (threadgroup) {
-		struct task_struct *c;
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
-			retval = cpu_cgroup_can_attach_task(cgrp, c);
-			if (retval) {
-				rcu_read_unlock();
-				return retval;
-			}
-		}
-		rcu_read_unlock();
-	}
-	return 0;
-}
-
 static void
-cpu_cgroup_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
-		  struct cgroup *old_cont, struct task_struct *tsk,
-		  bool threadgroup)
+cpu_cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
 	sched_move_task(tsk);
-	if (threadgroup) {
-		struct task_struct *c;
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
-			sched_move_task(c);
-		}
-		rcu_read_unlock();
-	}
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -8763,8 +8731,8 @@ struct cgroup_subsys cpu_cgroup_subsys = {
 	.name		= "cpu",
 	.create		= cpu_cgroup_create,
 	.destroy	= cpu_cgroup_destroy,
-	.can_attach	= cpu_cgroup_can_attach,
-	.attach		= cpu_cgroup_attach,
+	.can_attach_task = cpu_cgroup_can_attach_task,
+	.attach_task	= cpu_cgroup_attach_task,
 	.populate	= cpu_cgroup_populate,
 	.subsys_id	= cpu_cgroup_subsys_id,
 	.early_init	= 1,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 729beb7..995f0b9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4720,8 +4720,7 @@ static void mem_cgroup_clear_mc(void)
 
 static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
 				struct cgroup *cgroup,
-				struct task_struct *p,
-				bool threadgroup)
+				struct task_struct *p)
 {
 	int ret = 0;
 	struct mem_cgroup *mem = mem_cgroup_from_cont(cgroup);
@@ -4775,8 +4774,7 @@ static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
 
 static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss,
 				struct cgroup *cgroup,
-				struct task_struct *p,
-				bool threadgroup)
+				struct task_struct *p)
 {
 	mem_cgroup_clear_mc();
 }
@@ -4880,8 +4878,7 @@ static void mem_cgroup_move_charge(struct mm_struct *mm)
 static void mem_cgroup_move_task(struct cgroup_subsys *ss,
 				struct cgroup *cont,
 				struct cgroup *old_cont,
-				struct task_struct *p,
-				bool threadgroup)
+				struct task_struct *p)
 {
 	if (!mc.mm)
 		/* no need to move charge */
@@ -4893,22 +4890,19 @@ static void mem_cgroup_move_task(struct cgroup_subsys *ss,
 #else	/* !CONFIG_MMU */
 static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
 				struct cgroup *cgroup,
-				struct task_struct *p,
-				bool threadgroup)
+				struct task_struct *p)
 {
 	return 0;
 }
 static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss,
 				struct cgroup *cgroup,
-				struct task_struct *p,
-				bool threadgroup)
+				struct task_struct *p)
 {
 }
 static void mem_cgroup_move_task(struct cgroup_subsys *ss,
 				struct cgroup *cont,
 				struct cgroup *old_cont,
-				struct task_struct *p,
-				bool threadgroup)
+				struct task_struct *p)
 {
 }
 #endif
diff --git a/security/device_cgroup.c b/security/device_cgroup.c
index 8d9c48f..cd1f779 100644
--- a/security/device_cgroup.c
+++ b/security/device_cgroup.c
@@ -62,8 +62,7 @@ static inline struct dev_cgroup *task_devcgroup(struct task_struct *task)
 struct cgroup_subsys devices_subsys;
 
 static int devcgroup_can_attach(struct cgroup_subsys *ss,
-		struct cgroup *new_cgroup, struct task_struct *task,
-		bool threadgroup)
+		struct cgroup *new_cgroup, struct task_struct *task)
 {
 	if (current != task && !capable(CAP_SYS_ADMIN))
 			return -EPERM;

^ permalink raw reply related	[flat|nested] 185+ messages in thread

* [PATCH v8 2/3] cgroups: add per-thread subsystem callbacks
@ 2011-02-08  1:39                 ` Ben Blum
  0 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2011-02-08  1:39 UTC (permalink / raw)
  To: Ben Blum
  Cc: linux-kernel, containers, akpm, ebiederm, lizf, matthltc, menage,
	oleg, David Rientjes, Miao Xie

Add cgroup subsystem callbacks for per-thread attachment

From: Ben Blum <bblum@andrew.cmu.edu>

This patch adds can_attach_task, pre_attach, and attach_task as new callbacks
for cgroups's subsystem interface. Unlike can_attach and attach, these are for
per-thread operations, to be called potentially many times when attaching an
entire threadgroup.

Also, the old "bool threadgroup" interface is removed, as replaced by this.
All subsystems are modified for the new interface - of note is cpuset, which
requires from/to nodemasks for attach to be globally scoped (though per-cpuset
would work too) to persist from its pre_attach to attach_task and attach.

This is a pre-patch for cgroup-procs-writable.patch.

Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
---
 Documentation/cgroups/cgroups.txt |   30 ++++++++---
 block/blk-cgroup.c                |   18 ++----
 include/linux/cgroup.h            |   10 ++--
 kernel/cgroup.c                   |   17 +++++-
 kernel/cgroup_freezer.c           |   26 ++++-----
 kernel/cpuset.c                   |  105 ++++++++++++++++---------------------
 kernel/ns_cgroup.c                |   23 +++-----
 kernel/sched.c                    |   38 +------------
 mm/memcontrol.c                   |   18 ++----
 security/device_cgroup.c          |    3 -
 10 files changed, 122 insertions(+), 166 deletions(-)

diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index 190018b..d3c9a24 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -563,7 +563,7 @@ rmdir() will fail with it. From this behavior, pre_destroy() can be
 called multiple times against a cgroup.
 
 int can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
-	       struct task_struct *task, bool threadgroup)
+	       struct task_struct *task)
 (cgroup_mutex held by caller)
 
 Called prior to moving a task into a cgroup; if the subsystem
@@ -572,9 +572,14 @@ task is passed, then a successful result indicates that *any*
 unspecified task can be moved into the cgroup. Note that this isn't
 called on a fork. If this method returns 0 (success) then this should
 remain valid while the caller holds cgroup_mutex and it is ensured that either
-attach() or cancel_attach() will be called in future. If threadgroup is
-true, then a successful result indicates that all threads in the given
-thread's threadgroup can be moved together.
+attach() or cancel_attach() will be called in future.
+
+int can_attach_task(struct cgroup *cgrp, struct task_struct *tsk);
+(cgroup_mutex held by caller)
+
+As can_attach, but for operations that must be run once per task to be
+attached (possibly many when using cgroup_attach_proc). Called after
+can_attach.
 
 void cancel_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 	       struct task_struct *task, bool threadgroup)
@@ -586,15 +591,24 @@ function, so that the subsystem can implement a rollback. If not, not necessary.
 This will be called only about subsystems whose can_attach() operation have
 succeeded.
 
+void pre_attach(struct cgroup *cgrp);
+(cgroup_mutex held by caller)
+
+For any non-per-thread attachment work that needs to happen before
+attach_task. Needed by cpuset.
+
 void attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
-	    struct cgroup *old_cgrp, struct task_struct *task,
-	    bool threadgroup)
+	    struct cgroup *old_cgrp, struct task_struct *task)
 (cgroup_mutex held by caller)
 
 Called after the task has been attached to the cgroup, to allow any
 post-attachment activity that requires memory allocations or blocking.
-If threadgroup is true, the subsystem should take care of all threads
-in the specified thread's threadgroup. Currently does not support any
+
+void attach_task(struct cgroup *cgrp, struct task_struct *tsk);
+(cgroup_mutex held by caller)
+
+As attach, but for operations that must be run once per task to be attached,
+like can_attach_task. Called before attach. Currently does not support any
 subsystem that might need the old_cgrp for every thread in the group.
 
 void fork(struct cgroup_subsy *ss, struct task_struct *task)
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index b1febd0..45b3809 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -30,10 +30,8 @@ EXPORT_SYMBOL_GPL(blkio_root_cgroup);
 
 static struct cgroup_subsys_state *blkiocg_create(struct cgroup_subsys *,
 						  struct cgroup *);
-static int blkiocg_can_attach(struct cgroup_subsys *, struct cgroup *,
-			      struct task_struct *, bool);
-static void blkiocg_attach(struct cgroup_subsys *, struct cgroup *,
-			   struct cgroup *, struct task_struct *, bool);
+static int blkiocg_can_attach_task(struct cgroup *, struct task_struct *);
+static void blkiocg_attach_task(struct cgroup *, struct task_struct *);
 static void blkiocg_destroy(struct cgroup_subsys *, struct cgroup *);
 static int blkiocg_populate(struct cgroup_subsys *, struct cgroup *);
 
@@ -46,8 +44,8 @@ static int blkiocg_populate(struct cgroup_subsys *, struct cgroup *);
 struct cgroup_subsys blkio_subsys = {
 	.name = "blkio",
 	.create = blkiocg_create,
-	.can_attach = blkiocg_can_attach,
-	.attach = blkiocg_attach,
+	.can_attach_task = blkiocg_can_attach_task,
+	.attach_task = blkiocg_attach_task,
 	.destroy = blkiocg_destroy,
 	.populate = blkiocg_populate,
 #ifdef CONFIG_BLK_CGROUP
@@ -1475,9 +1473,7 @@ done:
  * of the main cic data structures.  For now we allow a task to change
  * its cgroup only if it's the only owner of its ioc.
  */
-static int blkiocg_can_attach(struct cgroup_subsys *subsys,
-				struct cgroup *cgroup, struct task_struct *tsk,
-				bool threadgroup)
+static int blkiocg_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
 	struct io_context *ioc;
 	int ret = 0;
@@ -1492,9 +1488,7 @@ static int blkiocg_can_attach(struct cgroup_subsys *subsys,
 	return ret;
 }
 
-static void blkiocg_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
-				struct cgroup *prev, struct task_struct *tsk,
-				bool threadgroup)
+static void blkiocg_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
 	struct io_context *ioc;
 
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index ce104e3..35b69b4 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -467,12 +467,14 @@ struct cgroup_subsys {
 	int (*pre_destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
 	void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
 	int (*can_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
-			  struct task_struct *tsk, bool threadgroup);
+			  struct task_struct *tsk);
+	int (*can_attach_task)(struct cgroup *cgrp, struct task_struct *tsk);
 	void (*cancel_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
-			  struct task_struct *tsk, bool threadgroup);
+			      struct task_struct *tsk);
+	void (*pre_attach)(struct cgroup *cgrp);
+	void (*attach_task)(struct cgroup *cgrp, struct task_struct *tsk);
 	void (*attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
-			struct cgroup *old_cgrp, struct task_struct *tsk,
-			bool threadgroup);
+		       struct cgroup *old_cgrp, struct task_struct *tsk);
 	void (*fork)(struct cgroup_subsys *ss, struct task_struct *task);
 	void (*exit)(struct cgroup_subsys *ss, struct task_struct *task);
 	int (*populate)(struct cgroup_subsys *ss,
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 66a416b..616f27a 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1750,7 +1750,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 
 	for_each_subsys(root, ss) {
 		if (ss->can_attach) {
-			retval = ss->can_attach(ss, cgrp, tsk, false);
+			retval = ss->can_attach(ss, cgrp, tsk);
 			if (retval) {
 				/*
 				 * Remember on which subsystem the can_attach()
@@ -1762,6 +1762,13 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 				goto out;
 			}
 		}
+		if (ss->can_attach_task) {
+			retval = ss->can_attach_task(cgrp, tsk);
+			if (retval) {
+				failed_ss = ss;
+				goto out;
+			}
+		}
 	}
 
 	task_lock(tsk);
@@ -1798,8 +1805,12 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 	write_unlock(&css_set_lock);
 
 	for_each_subsys(root, ss) {
+		if (ss->pre_attach)
+			ss->pre_attach(cgrp);
+		if (ss->attach_task)
+			ss->attach_task(cgrp, tsk);
 		if (ss->attach)
-			ss->attach(ss, cgrp, oldcgrp, tsk, false);
+			ss->attach(ss, cgrp, oldcgrp, tsk);
 	}
 	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
 	synchronize_rcu();
@@ -1822,7 +1833,7 @@ out:
 				 */
 				break;
 			if (ss->cancel_attach)
-				ss->cancel_attach(ss, cgrp, tsk, false);
+				ss->cancel_attach(ss, cgrp, tsk);
 		}
 	}
 	return retval;
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index e7bebb7..e691818 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -160,7 +160,7 @@ static void freezer_destroy(struct cgroup_subsys *ss,
  */
 static int freezer_can_attach(struct cgroup_subsys *ss,
 			      struct cgroup *new_cgroup,
-			      struct task_struct *task, bool threadgroup)
+			      struct task_struct *task)
 {
 	struct freezer *freezer;
 
@@ -172,26 +172,17 @@ static int freezer_can_attach(struct cgroup_subsys *ss,
 	if (freezer->state != CGROUP_THAWED)
 		return -EBUSY;
 
+	return 0;
+}
+
+static int freezer_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
+{
 	rcu_read_lock();
-	if (__cgroup_freezing_or_frozen(task)) {
+	if (__cgroup_freezing_or_frozen(tsk)) {
 		rcu_read_unlock();
 		return -EBUSY;
 	}
 	rcu_read_unlock();
-
-	if (threadgroup) {
-		struct task_struct *c;
-
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
-			if (__cgroup_freezing_or_frozen(c)) {
-				rcu_read_unlock();
-				return -EBUSY;
-			}
-		}
-		rcu_read_unlock();
-	}
-
 	return 0;
 }
 
@@ -390,6 +381,9 @@ struct cgroup_subsys freezer_subsys = {
 	.populate	= freezer_populate,
 	.subsys_id	= freezer_subsys_id,
 	.can_attach	= freezer_can_attach,
+	.can_attach_task = freezer_can_attach_task,
+	.pre_attach	= NULL,
+	.attach_task	= NULL,
 	.attach		= NULL,
 	.fork		= freezer_fork,
 	.exit		= NULL,
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 4349935..5f71ca2 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1372,14 +1372,10 @@ static int fmeter_getrate(struct fmeter *fmp)
 	return val;
 }
 
-/* Protected by cgroup_lock */
-static cpumask_var_t cpus_attach;
-
 /* Called by cgroups to determine if a cpuset is usable; cgroup_mutex held */
 static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
-			     struct task_struct *tsk, bool threadgroup)
+			     struct task_struct *tsk)
 {
-	int ret;
 	struct cpuset *cs = cgroup_cs(cont);
 
 	if (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))
@@ -1396,29 +1392,42 @@ static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
 	if (tsk->flags & PF_THREAD_BOUND)
 		return -EINVAL;
 
-	ret = security_task_setscheduler(tsk);
-	if (ret)
-		return ret;
-	if (threadgroup) {
-		struct task_struct *c;
-
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
-			ret = security_task_setscheduler(c);
-			if (ret) {
-				rcu_read_unlock();
-				return ret;
-			}
-		}
-		rcu_read_unlock();
-	}
 	return 0;
 }
 
-static void cpuset_attach_task(struct task_struct *tsk, nodemask_t *to,
-			       struct cpuset *cs)
+static int cpuset_can_attach_task(struct cgroup *cgrp, struct task_struct *task)
+{
+	return security_task_setscheduler(task);
+}
+
+/*
+ * Protected by cgroup_lock. The nodemasks must be stored globally because
+ * dynamically allocating them is not allowed in pre_attach, and they must
+ * persist among pre_attach, attach_task, and attach.
+ */
+static cpumask_var_t cpus_attach;
+static nodemask_t cpuset_attach_nodemask_from;
+static nodemask_t cpuset_attach_nodemask_to;
+
+/* Set-up work for before attaching each task. */
+static void cpuset_pre_attach(struct cgroup *cont)
+{
+	struct cpuset *cs = cgroup_cs(cont);
+
+	if (cs == &top_cpuset)
+		cpumask_copy(cpus_attach, cpu_possible_mask);
+	else
+		guarantee_online_cpus(cs, cpus_attach);
+
+	guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
+}
+
+/* Per-thread attachment work. */
+static void cpuset_attach_task(struct cgroup *cont, struct task_struct *tsk)
 {
 	int err;
+	struct cpuset *cs = cgroup_cs(cont);
+
 	/*
 	 * can_attach beforehand should guarantee that this doesn't fail.
 	 * TODO: have a better way to handle failure here
@@ -1426,56 +1435,31 @@ static void cpuset_attach_task(struct task_struct *tsk, nodemask_t *to,
 	err = set_cpus_allowed_ptr(tsk, cpus_attach);
 	WARN_ON_ONCE(err);
 
-	cpuset_change_task_nodemask(tsk, to);
+	cpuset_change_task_nodemask(tsk, &cpuset_attach_nodemask_to);
 	cpuset_update_task_spread_flag(cs, tsk);
-
 }
 
 static void cpuset_attach(struct cgroup_subsys *ss, struct cgroup *cont,
-			  struct cgroup *oldcont, struct task_struct *tsk,
-			  bool threadgroup)
+			  struct cgroup *oldcont, struct task_struct *tsk)
 {
 	struct mm_struct *mm;
 	struct cpuset *cs = cgroup_cs(cont);
 	struct cpuset *oldcs = cgroup_cs(oldcont);
-	NODEMASK_ALLOC(nodemask_t, from, GFP_KERNEL);
-	NODEMASK_ALLOC(nodemask_t, to, GFP_KERNEL);
-
-	if (from == NULL || to == NULL)
-		goto alloc_fail;
 
-	if (cs == &top_cpuset) {
-		cpumask_copy(cpus_attach, cpu_possible_mask);
-	} else {
-		guarantee_online_cpus(cs, cpus_attach);
-	}
-	guarantee_online_mems(cs, to);
-
-	/* do per-task migration stuff possibly for each in the threadgroup */
-	cpuset_attach_task(tsk, to, cs);
-	if (threadgroup) {
-		struct task_struct *c;
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
-			cpuset_attach_task(c, to, cs);
-		}
-		rcu_read_unlock();
-	}
-
-	/* change mm; only needs to be done once even if threadgroup */
-	*from = oldcs->mems_allowed;
-	*to = cs->mems_allowed;
+	/*
+	 * Change mm, possibly for multiple threads in a threadgroup. This is
+	 * expensive and may sleep.
+	 */
+	cpuset_attach_nodemask_from = oldcs->mems_allowed;
+	cpuset_attach_nodemask_to = cs->mems_allowed;
 	mm = get_task_mm(tsk);
 	if (mm) {
-		mpol_rebind_mm(mm, to);
+		mpol_rebind_mm(mm, &cpuset_attach_nodemask_to);
 		if (is_memory_migrate(cs))
-			cpuset_migrate_mm(mm, from, to);
+			cpuset_migrate_mm(mm, &cpuset_attach_nodemask_from,
+					  &cpuset_attach_nodemask_to);
 		mmput(mm);
 	}
-
-alloc_fail:
-	NODEMASK_FREE(from);
-	NODEMASK_FREE(to);
 }
 
 /* The various types of files and directories in a cpuset file system */
@@ -1928,6 +1912,9 @@ struct cgroup_subsys cpuset_subsys = {
 	.create = cpuset_create,
 	.destroy = cpuset_destroy,
 	.can_attach = cpuset_can_attach,
+	.can_attach_task = cpuset_can_attach_task,
+	.pre_attach = cpuset_pre_attach,
+	.attach_task = cpuset_attach_task,
 	.attach = cpuset_attach,
 	.populate = cpuset_populate,
 	.post_clone = cpuset_post_clone,
diff --git a/kernel/ns_cgroup.c b/kernel/ns_cgroup.c
index 2c98ad9..1fc2b1b 100644
--- a/kernel/ns_cgroup.c
+++ b/kernel/ns_cgroup.c
@@ -43,7 +43,7 @@ int ns_cgroup_clone(struct task_struct *task, struct pid *pid)
  *        ancestor cgroup thereof)
  */
 static int ns_can_attach(struct cgroup_subsys *ss, struct cgroup *new_cgroup,
-			 struct task_struct *task, bool threadgroup)
+			 struct task_struct *task)
 {
 	if (current != task) {
 		if (!capable(CAP_SYS_ADMIN))
@@ -53,21 +53,13 @@ static int ns_can_attach(struct cgroup_subsys *ss, struct cgroup *new_cgroup,
 			return -EPERM;
 	}
 
-	if (!cgroup_is_descendant(new_cgroup, task))
-		return -EPERM;
-
-	if (threadgroup) {
-		struct task_struct *c;
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
-			if (!cgroup_is_descendant(new_cgroup, c)) {
-				rcu_read_unlock();
-				return -EPERM;
-			}
-		}
-		rcu_read_unlock();
-	}
+	return 0;
+}
 
+static int ns_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
+{
+	if (!cgroup_is_descendant(cgrp, tsk))
+		return -EPERM;
 	return 0;
 }
 
@@ -112,6 +104,7 @@ static void ns_destroy(struct cgroup_subsys *ss,
 struct cgroup_subsys ns_subsys = {
 	.name = "ns",
 	.can_attach = ns_can_attach,
+	.can_attach_task = ns_can_attach_task,
 	.create = ns_create,
 	.destroy  = ns_destroy,
 	.subsys_id = ns_subsys_id,
diff --git a/kernel/sched.c b/kernel/sched.c
index 218ef20..d619f1d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -8655,42 +8655,10 @@ cpu_cgroup_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 	return 0;
 }
 
-static int
-cpu_cgroup_can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
-		      struct task_struct *tsk, bool threadgroup)
-{
-	int retval = cpu_cgroup_can_attach_task(cgrp, tsk);
-	if (retval)
-		return retval;
-	if (threadgroup) {
-		struct task_struct *c;
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
-			retval = cpu_cgroup_can_attach_task(cgrp, c);
-			if (retval) {
-				rcu_read_unlock();
-				return retval;
-			}
-		}
-		rcu_read_unlock();
-	}
-	return 0;
-}
-
 static void
-cpu_cgroup_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
-		  struct cgroup *old_cont, struct task_struct *tsk,
-		  bool threadgroup)
+cpu_cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
 	sched_move_task(tsk);
-	if (threadgroup) {
-		struct task_struct *c;
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
-			sched_move_task(c);
-		}
-		rcu_read_unlock();
-	}
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -8763,8 +8731,8 @@ struct cgroup_subsys cpu_cgroup_subsys = {
 	.name		= "cpu",
 	.create		= cpu_cgroup_create,
 	.destroy	= cpu_cgroup_destroy,
-	.can_attach	= cpu_cgroup_can_attach,
-	.attach		= cpu_cgroup_attach,
+	.can_attach_task = cpu_cgroup_can_attach_task,
+	.attach_task	= cpu_cgroup_attach_task,
 	.populate	= cpu_cgroup_populate,
 	.subsys_id	= cpu_cgroup_subsys_id,
 	.early_init	= 1,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 729beb7..995f0b9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4720,8 +4720,7 @@ static void mem_cgroup_clear_mc(void)
 
 static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
 				struct cgroup *cgroup,
-				struct task_struct *p,
-				bool threadgroup)
+				struct task_struct *p)
 {
 	int ret = 0;
 	struct mem_cgroup *mem = mem_cgroup_from_cont(cgroup);
@@ -4775,8 +4774,7 @@ static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
 
 static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss,
 				struct cgroup *cgroup,
-				struct task_struct *p,
-				bool threadgroup)
+				struct task_struct *p)
 {
 	mem_cgroup_clear_mc();
 }
@@ -4880,8 +4878,7 @@ static void mem_cgroup_move_charge(struct mm_struct *mm)
 static void mem_cgroup_move_task(struct cgroup_subsys *ss,
 				struct cgroup *cont,
 				struct cgroup *old_cont,
-				struct task_struct *p,
-				bool threadgroup)
+				struct task_struct *p)
 {
 	if (!mc.mm)
 		/* no need to move charge */
@@ -4893,22 +4890,19 @@ static void mem_cgroup_move_task(struct cgroup_subsys *ss,
 #else	/* !CONFIG_MMU */
 static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
 				struct cgroup *cgroup,
-				struct task_struct *p,
-				bool threadgroup)
+				struct task_struct *p)
 {
 	return 0;
 }
 static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss,
 				struct cgroup *cgroup,
-				struct task_struct *p,
-				bool threadgroup)
+				struct task_struct *p)
 {
 }
 static void mem_cgroup_move_task(struct cgroup_subsys *ss,
 				struct cgroup *cont,
 				struct cgroup *old_cont,
-				struct task_struct *p,
-				bool threadgroup)
+				struct task_struct *p)
 {
 }
 #endif
diff --git a/security/device_cgroup.c b/security/device_cgroup.c
index 8d9c48f..cd1f779 100644
--- a/security/device_cgroup.c
+++ b/security/device_cgroup.c
@@ -62,8 +62,7 @@ static inline struct dev_cgroup *task_devcgroup(struct task_struct *task)
 struct cgroup_subsys devices_subsys;
 
 static int devcgroup_can_attach(struct cgroup_subsys *ss,
-		struct cgroup *new_cgroup, struct task_struct *task,
-		bool threadgroup)
+		struct cgroup *new_cgroup, struct task_struct *task)
 {
 	if (current != task && !capable(CAP_SYS_ADMIN))
 			return -EPERM;

^ permalink raw reply related	[flat|nested] 185+ messages in thread

* [PATCH v8 3/3] cgroups: make procs file writable
  2011-02-08  1:35           ` Ben Blum
@ 2011-02-08  1:39                 ` Ben Blum
       [not found]             ` <20110208013542.GC31569-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
                                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2011-02-08  1:39 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	Miao Xie, David Rientjes, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	menage-hpIqsD4AKlfQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

Makes procs file writable to move all threads by tgid at once

From: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>

This patch adds functionality that enables users to move all threads in a
threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs'
file. This current implementation makes use of a per-threadgroup rwsem that's
taken for reading in the fork() path to prevent newly forking threads within
the threadgroup from "escaping" while the move is in progress.

Signed-off-by: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
---
 Documentation/cgroups/cgroups.txt |    9 +
 kernel/cgroup.c                   |  437 +++++++++++++++++++++++++++++++++----
 2 files changed, 397 insertions(+), 49 deletions(-)

diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index d3c9a24..92d93d6 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -236,7 +236,8 @@ containing the following files describing that cgroup:
  - cgroup.procs: list of tgids in the cgroup.  This list is not
    guaranteed to be sorted or free of duplicate tgids, and userspace
    should sort/uniquify the list if this property is required.
-   This is a read-only file, for now.
+   Writing a thread group id into this file moves all threads in that
+   group into this cgroup.
  - notify_on_release flag: run the release agent on exit?
  - release_agent: the path to use for release notifications (this file
    exists in the top cgroup only)
@@ -426,6 +427,12 @@ You can attach the current shell task by echoing 0:
 
 # echo 0 > tasks
 
+You can use the cgroup.procs file instead of the tasks file to move all
+threads in a threadgroup at once. Echoing the pid of any task in a
+threadgroup to cgroup.procs causes all tasks in that threadgroup to be
+be attached to the cgroup. Writing 0 to cgroup.procs moves all tasks
+in the writing task's threadgroup.
+
 2.3 Mounting hierarchies by name
 --------------------------------
 
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 616f27a..58b364a 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1726,6 +1726,76 @@ int cgroup_path(const struct cgroup *cgrp, char *buf, int buflen)
 }
 EXPORT_SYMBOL_GPL(cgroup_path);
 
+/*
+ * cgroup_task_migrate - move a task from one cgroup to another.
+ *
+ * 'guarantee' is set if the caller promises that a new css_set for the task
+ * will already exist. If not set, this function might sleep, and can fail with
+ * -ENOMEM. Otherwise, it can only fail with -ESRCH.
+ */
+static int cgroup_task_migrate(struct cgroup *cgrp, struct cgroup *oldcgrp,
+			       struct task_struct *tsk, bool guarantee)
+{
+	struct css_set *oldcg;
+	struct css_set *newcg;
+
+	/*
+	 * get old css_set. we need to take task_lock and refcount it, because
+	 * an exiting task can change its css_set to init_css_set and drop its
+	 * old one without taking cgroup_mutex.
+	 */
+	task_lock(tsk);
+	oldcg = tsk->cgroups;
+	get_css_set(oldcg);
+	task_unlock(tsk);
+
+	/* locate or allocate a new css_set for this task. */
+	if (guarantee) {
+		/* we know the css_set we want already exists. */
+		struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
+		read_lock(&css_set_lock);
+		newcg = find_existing_css_set(oldcg, cgrp, template);
+		BUG_ON(!newcg);
+		get_css_set(newcg);
+		read_unlock(&css_set_lock);
+	} else {
+		might_sleep();
+		/* find_css_set will give us newcg already referenced. */
+		newcg = find_css_set(oldcg, cgrp);
+		if (!newcg) {
+			put_css_set(oldcg);
+			return -ENOMEM;
+		}
+	}
+	put_css_set(oldcg);
+
+	/* if PF_EXITING is set, the tsk->cgroups pointer is no longer safe. */
+	task_lock(tsk);
+	if (tsk->flags & PF_EXITING) {
+		task_unlock(tsk);
+		put_css_set(newcg);
+		return -ESRCH;
+	}
+	rcu_assign_pointer(tsk->cgroups, newcg);
+	task_unlock(tsk);
+
+	/* Update the css_set linked lists if we're using them */
+	write_lock(&css_set_lock);
+	if (!list_empty(&tsk->cg_list))
+		list_move(&tsk->cg_list, &newcg->tasks);
+	write_unlock(&css_set_lock);
+
+	/*
+	 * We just gained a reference on oldcg by taking it from the task. As
+	 * trading it for newcg is protected by cgroup_mutex, we're safe to drop
+	 * it here; it will be freed under RCU.
+	 */
+	put_css_set(oldcg);
+
+	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
+	return 0;
+}
+
 /**
  * cgroup_attach_task - attach task 'tsk' to cgroup 'cgrp'
  * @cgrp: the cgroup the task is attaching to
@@ -1736,11 +1806,9 @@ EXPORT_SYMBOL_GPL(cgroup_path);
  */
 int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
-	int retval = 0;
+	int retval;
 	struct cgroup_subsys *ss, *failed_ss = NULL;
 	struct cgroup *oldcgrp;
-	struct css_set *cg;
-	struct css_set *newcg;
 	struct cgroupfs_root *root = cgrp->root;
 
 	/* Nothing to do if the task is already in that cgroup */
@@ -1771,38 +1839,9 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 		}
 	}
 
-	task_lock(tsk);
-	cg = tsk->cgroups;
-	get_css_set(cg);
-	task_unlock(tsk);
-	/*
-	 * Locate or allocate a new css_set for this task,
-	 * based on its final set of cgroups
-	 */
-	newcg = find_css_set(cg, cgrp);
-	put_css_set(cg);
-	if (!newcg) {
-		retval = -ENOMEM;
-		goto out;
-	}
-
-	task_lock(tsk);
-	if (tsk->flags & PF_EXITING) {
-		task_unlock(tsk);
-		put_css_set(newcg);
-		retval = -ESRCH;
+	retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, false);
+	if (retval)
 		goto out;
-	}
-	rcu_assign_pointer(tsk->cgroups, newcg);
-	task_unlock(tsk);
-
-	/* Update the css_set linked lists if we're using them */
-	write_lock(&css_set_lock);
-	if (!list_empty(&tsk->cg_list)) {
-		list_del(&tsk->cg_list);
-		list_add(&tsk->cg_list, &newcg->tasks);
-	}
-	write_unlock(&css_set_lock);
 
 	for_each_subsys(root, ss) {
 		if (ss->pre_attach)
@@ -1812,9 +1851,8 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 		if (ss->attach)
 			ss->attach(ss, cgrp, oldcgrp, tsk);
 	}
-	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
+
 	synchronize_rcu();
-	put_css_set(cg);
 
 	/*
 	 * wake up rmdir() waiter. the rmdir should fail since the cgroup
@@ -1864,49 +1902,352 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
 EXPORT_SYMBOL_GPL(cgroup_attach_task_all);
 
 /*
- * Attach task with pid 'pid' to cgroup 'cgrp'. Call with cgroup_mutex
- * held. May take task_lock of task
+ * cgroup_attach_proc works in two stages, the first of which prefetches all
+ * new css_sets needed (to make sure we have enough memory before committing
+ * to the move) and stores them in a list of entries of the following type.
+ * TODO: possible optimization: use css_set->rcu_head for chaining instead
+ */
+struct cg_list_entry {
+	struct css_set *cg;
+	struct list_head links;
+};
+
+static bool css_set_check_fetched(struct cgroup *cgrp,
+				  struct task_struct *tsk, struct css_set *cg,
+				  struct list_head *newcg_list)
+{
+	struct css_set *newcg;
+	struct cg_list_entry *cg_entry;
+	struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
+
+	read_lock(&css_set_lock);
+	newcg = find_existing_css_set(cg, cgrp, template);
+	if (newcg)
+		get_css_set(newcg);
+	read_unlock(&css_set_lock);
+
+	/* doesn't exist at all? */
+	if (!newcg)
+		return false;
+	/* see if it's already in the list */
+	list_for_each_entry(cg_entry, newcg_list, links) {
+		if (cg_entry->cg == newcg) {
+			put_css_set(newcg);
+			return true;
+		}
+	}
+
+	/* not found */
+	put_css_set(newcg);
+	return false;
+}
+
+/*
+ * Find the new css_set and store it in the list in preparation for moving the
+ * given task to the given cgroup. Returns 0 or -ENOMEM.
+ */
+static int css_set_prefetch(struct cgroup *cgrp, struct css_set *cg,
+			    struct list_head *newcg_list)
+{
+	struct css_set *newcg;
+	struct cg_list_entry *cg_entry;
+
+	/* ensure a new css_set will exist for this thread */
+	newcg = find_css_set(cg, cgrp);
+	if (!newcg)
+		return -ENOMEM;
+	/* add it to the list */
+	cg_entry = kmalloc(sizeof(struct cg_list_entry), GFP_KERNEL);
+	if (!cg_entry) {
+		put_css_set(newcg);
+		return -ENOMEM;
+	}
+	cg_entry->cg = newcg;
+	list_add(&cg_entry->links, newcg_list);
+	return 0;
+}
+
+/**
+ * cgroup_attach_proc - attach all threads in a threadgroup to a cgroup
+ * @cgrp: the cgroup to attach to
+ * @leader: the threadgroup leader task_struct of the group to be attached
+ *
+ * Call holding cgroup_mutex and the threadgroup_fork_lock of the leader. Will
+ * take task_lock of each thread in leader's threadgroup individually in turn.
+ */
+int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
+{
+	int retval, i, group_size;
+	struct cgroup_subsys *ss, *failed_ss = NULL;
+	/* guaranteed to be initialized later, but the compiler needs this */
+	struct cgroup *oldcgrp = NULL;
+	struct css_set *oldcg;
+	struct cgroupfs_root *root = cgrp->root;
+	/* threadgroup list cursor and array */
+	struct task_struct *tsk;
+	struct task_struct **group;
+	/*
+	 * we need to make sure we have css_sets for all the tasks we're
+	 * going to move -before- we actually start moving them, so that in
+	 * case we get an ENOMEM we can bail out before making any changes.
+	 */
+	struct list_head newcg_list;
+	struct cg_list_entry *cg_entry, *temp_nobe;
+
+	/*
+	 * step 0: in order to do expensive, possibly blocking operations for
+	 * every thread, we cannot iterate the thread group list, since it needs
+	 * rcu or tasklist locked. instead, build an array of all threads in the
+	 * group - threadgroup_fork_lock prevents new threads from appearing,
+	 * and if threads exit, this will just be an over-estimate.
+	 */
+	group_size = get_nr_threads(leader);
+	group = kmalloc(group_size * sizeof(*group), GFP_KERNEL);
+	if (!group)
+		return -ENOMEM;
+
+	/* prevent changes to the threadgroup list while we take a snapshot. */
+	rcu_read_lock();
+	if (!thread_group_leader(leader)) {
+		/*
+		 * a race with de_thread from another thread's exec() may strip
+		 * us of our leadership, making while_each_thread unsafe to use
+		 * on this task. if this happens, there is no choice but to
+		 * throw this task away and try again (from cgroup_procs_write);
+		 * this is "double-double-toil-and-trouble-check locking".
+		 */
+		rcu_read_unlock();
+		retval = -EAGAIN;
+		goto out_free_group_list;
+	}
+	/* take a reference on each task in the group to go in the array. */
+	tsk = leader;
+	i = 0;
+	do {
+		/* as per above, nr_threads may decrease, but not increase. */
+		BUG_ON(i >= group_size);
+		get_task_struct(tsk);
+		group[i] = tsk;
+		i++;
+	} while_each_thread(leader, tsk);
+	/* remember the number of threads in the array for later. */
+	BUG_ON(i == 0);
+	group_size = i;
+	rcu_read_unlock();
+
+	/*
+	 * step 1: check that we can legitimately attach to the cgroup.
+	 */
+	for_each_subsys(root, ss) {
+		if (ss->can_attach) {
+			retval = ss->can_attach(ss, cgrp, leader);
+			if (retval) {
+				failed_ss = ss;
+				goto out_cancel_attach;
+			}
+		}
+		/* a callback to be run on every thread in the threadgroup. */
+		if (ss->can_attach_task) {
+			/* run on each task in the threadgroup. */
+			for (i = 0; i < group_size; i++) {
+				retval = ss->can_attach_task(cgrp, group[i]);
+				if (retval) {
+					failed_ss = ss;
+					goto out_cancel_attach;
+				}
+			}
+		}
+	}
+
+	/*
+	 * step 2: make sure css_sets exist for all threads to be migrated.
+	 * we use find_css_set, which allocates a new one if necessary.
+	 */
+	INIT_LIST_HEAD(&newcg_list);
+	for (i = 0; i < group_size; i++) {
+		tsk = group[i];
+		/* nothing to do if this task is already in the cgroup */
+		oldcgrp = task_cgroup_from_root(tsk, root);
+		if (cgrp == oldcgrp)
+			continue;
+		/* get old css_set pointer */
+		task_lock(tsk);
+		if (tsk->flags & PF_EXITING) {
+			/* ignore this task if it's going away */
+			task_unlock(tsk);
+			continue;
+		}
+		oldcg = tsk->cgroups;
+		get_css_set(oldcg);
+		task_unlock(tsk);
+		/* see if the new one for us is already in the list? */
+		if (css_set_check_fetched(cgrp, tsk, oldcg, &newcg_list)) {
+			/* was already there, nothing to do. */
+			put_css_set(oldcg);
+		} else {
+			/* we don't already have it. get new one. */
+			retval = css_set_prefetch(cgrp, oldcg, &newcg_list);
+			put_css_set(oldcg);
+			if (retval)
+				goto out_list_teardown;
+		}
+	}
+
+	/*
+	 * step 3: now that we're guaranteed success wrt the css_sets, proceed
+	 * to move all tasks to the new cgroup, calling ss->attach_task for each
+	 * one along the way. there are no failure cases after here, so this is
+	 * the commit point.
+	 */
+	for_each_subsys(root, ss) {
+		if (ss->pre_attach)
+			ss->pre_attach(cgrp);
+	}
+	for (i = 0; i < group_size; i++) {
+		tsk = group[i];
+		/* leave current thread as it is if it's already there */
+		oldcgrp = task_cgroup_from_root(tsk, root);
+		if (cgrp == oldcgrp)
+			continue;
+		/* attach each task to each subsystem */
+		for_each_subsys(root, ss) {
+			if (ss->attach_task)
+				ss->attach_task(cgrp, tsk);
+		}
+		/* if the thread is PF_EXITING, it can just get skipped. */
+		retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, true);
+		BUG_ON(retval != 0 && retval != -ESRCH);
+	}
+	/* nothing is sensitive to fork() after this point. */
+
+	/*
+	 * step 4: do expensive, non-thread-specific subsystem callbacks.
+	 * TODO: if ever a subsystem needs to know the oldcgrp for each task
+	 * being moved, this call will need to be reworked to communicate that.
+	 */
+	for_each_subsys(root, ss) {
+		if (ss->attach)
+			ss->attach(ss, cgrp, oldcgrp, leader);
+	}
+
+	/*
+	 * step 5: success! and cleanup
+	 */
+	synchronize_rcu();
+	cgroup_wakeup_rmdir_waiter(cgrp);
+	retval = 0;
+out_list_teardown:
+	/* clean up the list of prefetched css_sets. */
+	list_for_each_entry_safe(cg_entry, temp_nobe, &newcg_list, links) {
+		list_del(&cg_entry->links);
+		put_css_set(cg_entry->cg);
+		kfree(cg_entry);
+	}
+out_cancel_attach:
+	/* same deal as in cgroup_attach_task */
+	if (retval) {
+		for_each_subsys(root, ss) {
+			if (ss == failed_ss)
+				break;
+			if (ss->cancel_attach)
+				ss->cancel_attach(ss, cgrp, leader);
+		}
+	}
+	/* clean up the array of referenced threads in the group. */
+	for (i = 0; i < group_size; i++)
+		put_task_struct(group[i]);
+out_free_group_list:
+	kfree(group);
+	return retval;
+}
+
+/*
+ * Find the task_struct of the task to attach by vpid and pass it along to the
+ * function to attach either it or all tasks in its threadgroup. Will take
+ * cgroup_mutex; may take task_lock of task.
  */
-static int attach_task_by_pid(struct cgroup *cgrp, u64 pid)
+static int attach_task_by_pid(struct cgroup *cgrp, u64 pid, bool threadgroup)
 {
 	struct task_struct *tsk;
 	const struct cred *cred = current_cred(), *tcred;
 	int ret;
 
+	if (!cgroup_lock_live_group(cgrp))
+		return -ENODEV;
+
 	if (pid) {
 		rcu_read_lock();
 		tsk = find_task_by_vpid(pid);
-		if (!tsk || tsk->flags & PF_EXITING) {
+		if (!tsk) {
 			rcu_read_unlock();
+			cgroup_unlock();
+			return -ESRCH;
+		}
+		if (threadgroup) {
+			/*
+			 * it is safe to find group_leader because tsk was found
+			 * in the tid map, meaning it can't have been unhashed
+			 * by someone in de_thread changing the leadership.
+			 */
+			tsk = tsk->group_leader;
+			BUG_ON(!thread_group_leader(tsk));
+		} else if (tsk->flags & PF_EXITING) {
+			/* optimization for the single-task-only case */
+			rcu_read_unlock();
+			cgroup_unlock();
 			return -ESRCH;
 		}
 
+		/*
+		 * even if we're attaching all tasks in the thread group, we
+		 * only need to check permissions on one of them.
+		 */
 		tcred = __task_cred(tsk);
 		if (cred->euid &&
 		    cred->euid != tcred->uid &&
 		    cred->euid != tcred->suid) {
 			rcu_read_unlock();
+			cgroup_unlock();
 			return -EACCES;
 		}
 		get_task_struct(tsk);
 		rcu_read_unlock();
 	} else {
-		tsk = current;
+		if (threadgroup)
+			tsk = current->group_leader;
+		else
+			tsk = current;
 		get_task_struct(tsk);
 	}
 
-	ret = cgroup_attach_task(cgrp, tsk);
+	if (threadgroup) {
+		threadgroup_fork_write_lock(tsk);
+		ret = cgroup_attach_proc(cgrp, tsk);
+		threadgroup_fork_write_unlock(tsk);
+	} else {
+		ret = cgroup_attach_task(cgrp, tsk);
+	}
 	put_task_struct(tsk);
+	cgroup_unlock();
 	return ret;
 }
 
 static int cgroup_tasks_write(struct cgroup *cgrp, struct cftype *cft, u64 pid)
 {
+	return attach_task_by_pid(cgrp, pid, false);
+}
+
+static int cgroup_procs_write(struct cgroup *cgrp, struct cftype *cft, u64 tgid)
+{
 	int ret;
-	if (!cgroup_lock_live_group(cgrp))
-		return -ENODEV;
-	ret = attach_task_by_pid(cgrp, pid);
-	cgroup_unlock();
+	do {
+		/*
+		 * attach_proc fails with -EAGAIN if threadgroup leadership
+		 * changes in the middle of the operation, in which case we need
+		 * to find the task_struct for the new leader and start over.
+		 */
+		ret = attach_task_by_pid(cgrp, tgid, true);
+	} while (ret == -EAGAIN);
 	return ret;
 }
 
@@ -3260,9 +3601,9 @@ static struct cftype files[] = {
 	{
 		.name = CGROUP_FILE_GENERIC_PREFIX "procs",
 		.open = cgroup_procs_open,
-		/* .write_u64 = cgroup_procs_write, TODO */
+		.write_u64 = cgroup_procs_write,
 		.release = cgroup_pidlist_release,
-		.mode = S_IRUGO,
+		.mode = S_IRUGO | S_IWUSR,
 	},
 	{
 		.name = "notify_on_release",

^ permalink raw reply related	[flat|nested] 185+ messages in thread

* [PATCH v8 3/3] cgroups: make procs file writable
@ 2011-02-08  1:39                 ` Ben Blum
  0 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2011-02-08  1:39 UTC (permalink / raw)
  To: Ben Blum
  Cc: linux-kernel, containers, akpm, ebiederm, lizf, matthltc, menage,
	oleg, David Rientjes, Miao Xie

Makes procs file writable to move all threads by tgid at once

From: Ben Blum <bblum@andrew.cmu.edu>

This patch adds functionality that enables users to move all threads in a
threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs'
file. This current implementation makes use of a per-threadgroup rwsem that's
taken for reading in the fork() path to prevent newly forking threads within
the threadgroup from "escaping" while the move is in progress.

Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
---
 Documentation/cgroups/cgroups.txt |    9 +
 kernel/cgroup.c                   |  437 +++++++++++++++++++++++++++++++++----
 2 files changed, 397 insertions(+), 49 deletions(-)

diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index d3c9a24..92d93d6 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -236,7 +236,8 @@ containing the following files describing that cgroup:
  - cgroup.procs: list of tgids in the cgroup.  This list is not
    guaranteed to be sorted or free of duplicate tgids, and userspace
    should sort/uniquify the list if this property is required.
-   This is a read-only file, for now.
+   Writing a thread group id into this file moves all threads in that
+   group into this cgroup.
  - notify_on_release flag: run the release agent on exit?
  - release_agent: the path to use for release notifications (this file
    exists in the top cgroup only)
@@ -426,6 +427,12 @@ You can attach the current shell task by echoing 0:
 
 # echo 0 > tasks
 
+You can use the cgroup.procs file instead of the tasks file to move all
+threads in a threadgroup at once. Echoing the pid of any task in a
+threadgroup to cgroup.procs causes all tasks in that threadgroup to be
+be attached to the cgroup. Writing 0 to cgroup.procs moves all tasks
+in the writing task's threadgroup.
+
 2.3 Mounting hierarchies by name
 --------------------------------
 
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 616f27a..58b364a 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1726,6 +1726,76 @@ int cgroup_path(const struct cgroup *cgrp, char *buf, int buflen)
 }
 EXPORT_SYMBOL_GPL(cgroup_path);
 
+/*
+ * cgroup_task_migrate - move a task from one cgroup to another.
+ *
+ * 'guarantee' is set if the caller promises that a new css_set for the task
+ * will already exist. If not set, this function might sleep, and can fail with
+ * -ENOMEM. Otherwise, it can only fail with -ESRCH.
+ */
+static int cgroup_task_migrate(struct cgroup *cgrp, struct cgroup *oldcgrp,
+			       struct task_struct *tsk, bool guarantee)
+{
+	struct css_set *oldcg;
+	struct css_set *newcg;
+
+	/*
+	 * get old css_set. we need to take task_lock and refcount it, because
+	 * an exiting task can change its css_set to init_css_set and drop its
+	 * old one without taking cgroup_mutex.
+	 */
+	task_lock(tsk);
+	oldcg = tsk->cgroups;
+	get_css_set(oldcg);
+	task_unlock(tsk);
+
+	/* locate or allocate a new css_set for this task. */
+	if (guarantee) {
+		/* we know the css_set we want already exists. */
+		struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
+		read_lock(&css_set_lock);
+		newcg = find_existing_css_set(oldcg, cgrp, template);
+		BUG_ON(!newcg);
+		get_css_set(newcg);
+		read_unlock(&css_set_lock);
+	} else {
+		might_sleep();
+		/* find_css_set will give us newcg already referenced. */
+		newcg = find_css_set(oldcg, cgrp);
+		if (!newcg) {
+			put_css_set(oldcg);
+			return -ENOMEM;
+		}
+	}
+	put_css_set(oldcg);
+
+	/* if PF_EXITING is set, the tsk->cgroups pointer is no longer safe. */
+	task_lock(tsk);
+	if (tsk->flags & PF_EXITING) {
+		task_unlock(tsk);
+		put_css_set(newcg);
+		return -ESRCH;
+	}
+	rcu_assign_pointer(tsk->cgroups, newcg);
+	task_unlock(tsk);
+
+	/* Update the css_set linked lists if we're using them */
+	write_lock(&css_set_lock);
+	if (!list_empty(&tsk->cg_list))
+		list_move(&tsk->cg_list, &newcg->tasks);
+	write_unlock(&css_set_lock);
+
+	/*
+	 * We just gained a reference on oldcg by taking it from the task. As
+	 * trading it for newcg is protected by cgroup_mutex, we're safe to drop
+	 * it here; it will be freed under RCU.
+	 */
+	put_css_set(oldcg);
+
+	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
+	return 0;
+}
+
 /**
  * cgroup_attach_task - attach task 'tsk' to cgroup 'cgrp'
  * @cgrp: the cgroup the task is attaching to
@@ -1736,11 +1806,9 @@ EXPORT_SYMBOL_GPL(cgroup_path);
  */
 int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
-	int retval = 0;
+	int retval;
 	struct cgroup_subsys *ss, *failed_ss = NULL;
 	struct cgroup *oldcgrp;
-	struct css_set *cg;
-	struct css_set *newcg;
 	struct cgroupfs_root *root = cgrp->root;
 
 	/* Nothing to do if the task is already in that cgroup */
@@ -1771,38 +1839,9 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 		}
 	}
 
-	task_lock(tsk);
-	cg = tsk->cgroups;
-	get_css_set(cg);
-	task_unlock(tsk);
-	/*
-	 * Locate or allocate a new css_set for this task,
-	 * based on its final set of cgroups
-	 */
-	newcg = find_css_set(cg, cgrp);
-	put_css_set(cg);
-	if (!newcg) {
-		retval = -ENOMEM;
-		goto out;
-	}
-
-	task_lock(tsk);
-	if (tsk->flags & PF_EXITING) {
-		task_unlock(tsk);
-		put_css_set(newcg);
-		retval = -ESRCH;
+	retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, false);
+	if (retval)
 		goto out;
-	}
-	rcu_assign_pointer(tsk->cgroups, newcg);
-	task_unlock(tsk);
-
-	/* Update the css_set linked lists if we're using them */
-	write_lock(&css_set_lock);
-	if (!list_empty(&tsk->cg_list)) {
-		list_del(&tsk->cg_list);
-		list_add(&tsk->cg_list, &newcg->tasks);
-	}
-	write_unlock(&css_set_lock);
 
 	for_each_subsys(root, ss) {
 		if (ss->pre_attach)
@@ -1812,9 +1851,8 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 		if (ss->attach)
 			ss->attach(ss, cgrp, oldcgrp, tsk);
 	}
-	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
+
 	synchronize_rcu();
-	put_css_set(cg);
 
 	/*
 	 * wake up rmdir() waiter. the rmdir should fail since the cgroup
@@ -1864,49 +1902,352 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
 EXPORT_SYMBOL_GPL(cgroup_attach_task_all);
 
 /*
- * Attach task with pid 'pid' to cgroup 'cgrp'. Call with cgroup_mutex
- * held. May take task_lock of task
+ * cgroup_attach_proc works in two stages, the first of which prefetches all
+ * new css_sets needed (to make sure we have enough memory before committing
+ * to the move) and stores them in a list of entries of the following type.
+ * TODO: possible optimization: use css_set->rcu_head for chaining instead
+ */
+struct cg_list_entry {
+	struct css_set *cg;
+	struct list_head links;
+};
+
+static bool css_set_check_fetched(struct cgroup *cgrp,
+				  struct task_struct *tsk, struct css_set *cg,
+				  struct list_head *newcg_list)
+{
+	struct css_set *newcg;
+	struct cg_list_entry *cg_entry;
+	struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
+
+	read_lock(&css_set_lock);
+	newcg = find_existing_css_set(cg, cgrp, template);
+	if (newcg)
+		get_css_set(newcg);
+	read_unlock(&css_set_lock);
+
+	/* doesn't exist at all? */
+	if (!newcg)
+		return false;
+	/* see if it's already in the list */
+	list_for_each_entry(cg_entry, newcg_list, links) {
+		if (cg_entry->cg == newcg) {
+			put_css_set(newcg);
+			return true;
+		}
+	}
+
+	/* not found */
+	put_css_set(newcg);
+	return false;
+}
+
+/*
+ * Find the new css_set and store it in the list in preparation for moving the
+ * given task to the given cgroup. Returns 0 or -ENOMEM.
+ */
+static int css_set_prefetch(struct cgroup *cgrp, struct css_set *cg,
+			    struct list_head *newcg_list)
+{
+	struct css_set *newcg;
+	struct cg_list_entry *cg_entry;
+
+	/* ensure a new css_set will exist for this thread */
+	newcg = find_css_set(cg, cgrp);
+	if (!newcg)
+		return -ENOMEM;
+	/* add it to the list */
+	cg_entry = kmalloc(sizeof(struct cg_list_entry), GFP_KERNEL);
+	if (!cg_entry) {
+		put_css_set(newcg);
+		return -ENOMEM;
+	}
+	cg_entry->cg = newcg;
+	list_add(&cg_entry->links, newcg_list);
+	return 0;
+}
+
+/**
+ * cgroup_attach_proc - attach all threads in a threadgroup to a cgroup
+ * @cgrp: the cgroup to attach to
+ * @leader: the threadgroup leader task_struct of the group to be attached
+ *
+ * Call holding cgroup_mutex and the threadgroup_fork_lock of the leader. Will
+ * take task_lock of each thread in leader's threadgroup individually in turn.
+ */
+int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
+{
+	int retval, i, group_size;
+	struct cgroup_subsys *ss, *failed_ss = NULL;
+	/* guaranteed to be initialized later, but the compiler needs this */
+	struct cgroup *oldcgrp = NULL;
+	struct css_set *oldcg;
+	struct cgroupfs_root *root = cgrp->root;
+	/* threadgroup list cursor and array */
+	struct task_struct *tsk;
+	struct task_struct **group;
+	/*
+	 * we need to make sure we have css_sets for all the tasks we're
+	 * going to move -before- we actually start moving them, so that in
+	 * case we get an ENOMEM we can bail out before making any changes.
+	 */
+	struct list_head newcg_list;
+	struct cg_list_entry *cg_entry, *temp_nobe;
+
+	/*
+	 * step 0: in order to do expensive, possibly blocking operations for
+	 * every thread, we cannot iterate the thread group list, since it needs
+	 * rcu or tasklist locked. instead, build an array of all threads in the
+	 * group - threadgroup_fork_lock prevents new threads from appearing,
+	 * and if threads exit, this will just be an over-estimate.
+	 */
+	group_size = get_nr_threads(leader);
+	group = kmalloc(group_size * sizeof(*group), GFP_KERNEL);
+	if (!group)
+		return -ENOMEM;
+
+	/* prevent changes to the threadgroup list while we take a snapshot. */
+	rcu_read_lock();
+	if (!thread_group_leader(leader)) {
+		/*
+		 * a race with de_thread from another thread's exec() may strip
+		 * us of our leadership, making while_each_thread unsafe to use
+		 * on this task. if this happens, there is no choice but to
+		 * throw this task away and try again (from cgroup_procs_write);
+		 * this is "double-double-toil-and-trouble-check locking".
+		 */
+		rcu_read_unlock();
+		retval = -EAGAIN;
+		goto out_free_group_list;
+	}
+	/* take a reference on each task in the group to go in the array. */
+	tsk = leader;
+	i = 0;
+	do {
+		/* as per above, nr_threads may decrease, but not increase. */
+		BUG_ON(i >= group_size);
+		get_task_struct(tsk);
+		group[i] = tsk;
+		i++;
+	} while_each_thread(leader, tsk);
+	/* remember the number of threads in the array for later. */
+	BUG_ON(i == 0);
+	group_size = i;
+	rcu_read_unlock();
+
+	/*
+	 * step 1: check that we can legitimately attach to the cgroup.
+	 */
+	for_each_subsys(root, ss) {
+		if (ss->can_attach) {
+			retval = ss->can_attach(ss, cgrp, leader);
+			if (retval) {
+				failed_ss = ss;
+				goto out_cancel_attach;
+			}
+		}
+		/* a callback to be run on every thread in the threadgroup. */
+		if (ss->can_attach_task) {
+			/* run on each task in the threadgroup. */
+			for (i = 0; i < group_size; i++) {
+				retval = ss->can_attach_task(cgrp, group[i]);
+				if (retval) {
+					failed_ss = ss;
+					goto out_cancel_attach;
+				}
+			}
+		}
+	}
+
+	/*
+	 * step 2: make sure css_sets exist for all threads to be migrated.
+	 * we use find_css_set, which allocates a new one if necessary.
+	 */
+	INIT_LIST_HEAD(&newcg_list);
+	for (i = 0; i < group_size; i++) {
+		tsk = group[i];
+		/* nothing to do if this task is already in the cgroup */
+		oldcgrp = task_cgroup_from_root(tsk, root);
+		if (cgrp == oldcgrp)
+			continue;
+		/* get old css_set pointer */
+		task_lock(tsk);
+		if (tsk->flags & PF_EXITING) {
+			/* ignore this task if it's going away */
+			task_unlock(tsk);
+			continue;
+		}
+		oldcg = tsk->cgroups;
+		get_css_set(oldcg);
+		task_unlock(tsk);
+		/* see if the new one for us is already in the list? */
+		if (css_set_check_fetched(cgrp, tsk, oldcg, &newcg_list)) {
+			/* was already there, nothing to do. */
+			put_css_set(oldcg);
+		} else {
+			/* we don't already have it. get new one. */
+			retval = css_set_prefetch(cgrp, oldcg, &newcg_list);
+			put_css_set(oldcg);
+			if (retval)
+				goto out_list_teardown;
+		}
+	}
+
+	/*
+	 * step 3: now that we're guaranteed success wrt the css_sets, proceed
+	 * to move all tasks to the new cgroup, calling ss->attach_task for each
+	 * one along the way. there are no failure cases after here, so this is
+	 * the commit point.
+	 */
+	for_each_subsys(root, ss) {
+		if (ss->pre_attach)
+			ss->pre_attach(cgrp);
+	}
+	for (i = 0; i < group_size; i++) {
+		tsk = group[i];
+		/* leave current thread as it is if it's already there */
+		oldcgrp = task_cgroup_from_root(tsk, root);
+		if (cgrp == oldcgrp)
+			continue;
+		/* attach each task to each subsystem */
+		for_each_subsys(root, ss) {
+			if (ss->attach_task)
+				ss->attach_task(cgrp, tsk);
+		}
+		/* if the thread is PF_EXITING, it can just get skipped. */
+		retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, true);
+		BUG_ON(retval != 0 && retval != -ESRCH);
+	}
+	/* nothing is sensitive to fork() after this point. */
+
+	/*
+	 * step 4: do expensive, non-thread-specific subsystem callbacks.
+	 * TODO: if ever a subsystem needs to know the oldcgrp for each task
+	 * being moved, this call will need to be reworked to communicate that.
+	 */
+	for_each_subsys(root, ss) {
+		if (ss->attach)
+			ss->attach(ss, cgrp, oldcgrp, leader);
+	}
+
+	/*
+	 * step 5: success! and cleanup
+	 */
+	synchronize_rcu();
+	cgroup_wakeup_rmdir_waiter(cgrp);
+	retval = 0;
+out_list_teardown:
+	/* clean up the list of prefetched css_sets. */
+	list_for_each_entry_safe(cg_entry, temp_nobe, &newcg_list, links) {
+		list_del(&cg_entry->links);
+		put_css_set(cg_entry->cg);
+		kfree(cg_entry);
+	}
+out_cancel_attach:
+	/* same deal as in cgroup_attach_task */
+	if (retval) {
+		for_each_subsys(root, ss) {
+			if (ss == failed_ss)
+				break;
+			if (ss->cancel_attach)
+				ss->cancel_attach(ss, cgrp, leader);
+		}
+	}
+	/* clean up the array of referenced threads in the group. */
+	for (i = 0; i < group_size; i++)
+		put_task_struct(group[i]);
+out_free_group_list:
+	kfree(group);
+	return retval;
+}
+
+/*
+ * Find the task_struct of the task to attach by vpid and pass it along to the
+ * function to attach either it or all tasks in its threadgroup. Will take
+ * cgroup_mutex; may take task_lock of task.
  */
-static int attach_task_by_pid(struct cgroup *cgrp, u64 pid)
+static int attach_task_by_pid(struct cgroup *cgrp, u64 pid, bool threadgroup)
 {
 	struct task_struct *tsk;
 	const struct cred *cred = current_cred(), *tcred;
 	int ret;
 
+	if (!cgroup_lock_live_group(cgrp))
+		return -ENODEV;
+
 	if (pid) {
 		rcu_read_lock();
 		tsk = find_task_by_vpid(pid);
-		if (!tsk || tsk->flags & PF_EXITING) {
+		if (!tsk) {
 			rcu_read_unlock();
+			cgroup_unlock();
+			return -ESRCH;
+		}
+		if (threadgroup) {
+			/*
+			 * it is safe to find group_leader because tsk was found
+			 * in the tid map, meaning it can't have been unhashed
+			 * by someone in de_thread changing the leadership.
+			 */
+			tsk = tsk->group_leader;
+			BUG_ON(!thread_group_leader(tsk));
+		} else if (tsk->flags & PF_EXITING) {
+			/* optimization for the single-task-only case */
+			rcu_read_unlock();
+			cgroup_unlock();
 			return -ESRCH;
 		}
 
+		/*
+		 * even if we're attaching all tasks in the thread group, we
+		 * only need to check permissions on one of them.
+		 */
 		tcred = __task_cred(tsk);
 		if (cred->euid &&
 		    cred->euid != tcred->uid &&
 		    cred->euid != tcred->suid) {
 			rcu_read_unlock();
+			cgroup_unlock();
 			return -EACCES;
 		}
 		get_task_struct(tsk);
 		rcu_read_unlock();
 	} else {
-		tsk = current;
+		if (threadgroup)
+			tsk = current->group_leader;
+		else
+			tsk = current;
 		get_task_struct(tsk);
 	}
 
-	ret = cgroup_attach_task(cgrp, tsk);
+	if (threadgroup) {
+		threadgroup_fork_write_lock(tsk);
+		ret = cgroup_attach_proc(cgrp, tsk);
+		threadgroup_fork_write_unlock(tsk);
+	} else {
+		ret = cgroup_attach_task(cgrp, tsk);
+	}
 	put_task_struct(tsk);
+	cgroup_unlock();
 	return ret;
 }
 
 static int cgroup_tasks_write(struct cgroup *cgrp, struct cftype *cft, u64 pid)
 {
+	return attach_task_by_pid(cgrp, pid, false);
+}
+
+static int cgroup_procs_write(struct cgroup *cgrp, struct cftype *cft, u64 tgid)
+{
 	int ret;
-	if (!cgroup_lock_live_group(cgrp))
-		return -ENODEV;
-	ret = attach_task_by_pid(cgrp, pid);
-	cgroup_unlock();
+	do {
+		/*
+		 * attach_proc fails with -EAGAIN if threadgroup leadership
+		 * changes in the middle of the operation, in which case we need
+		 * to find the task_struct for the new leader and start over.
+		 */
+		ret = attach_task_by_pid(cgrp, tgid, true);
+	} while (ret == -EAGAIN);
 	return ret;
 }
 
@@ -3260,9 +3601,9 @@ static struct cftype files[] = {
 	{
 		.name = CGROUP_FILE_GENERIC_PREFIX "procs",
 		.open = cgroup_procs_open,
-		/* .write_u64 = cgroup_procs_write, TODO */
+		.write_u64 = cgroup_procs_write,
 		.release = cgroup_pidlist_release,
-		.mode = S_IRUGO,
+		.mode = S_IRUGO | S_IWUSR,
 	},
 	{
 		.name = "notify_on_release",

^ permalink raw reply related	[flat|nested] 185+ messages in thread

* Re: [PATCH v8 0/3] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs
       [not found]             ` <20110208013542.GC31569-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
                                 ` (2 preceding siblings ...)
  2011-02-08  1:39                 ` Ben Blum
@ 2011-02-09 23:10               ` Andrew Morton
  2011-04-06 19:44               ` [PATCH v8.75 0/4] " Ben Blum
  4 siblings, 0 replies; 185+ messages in thread
From: Andrew Morton @ 2011-02-09 23:10 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	Miao Xie, David Rientjes, menage-hpIqsD4AKlfQT0dZR+AlfA,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w

On Mon, 7 Feb 2011 20:35:42 -0500
Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:

> On Sun, Dec 26, 2010 at 07:09:19AM -0500, Ben Blum wrote:
> > On Fri, Dec 24, 2010 at 03:22:26AM -0500, Ben Blum wrote:
> > > On Wed, Aug 11, 2010 at 01:46:04AM -0400, Ben Blum wrote:
> > > > On Fri, Jul 30, 2010 at 07:56:49PM -0400, Ben Blum wrote:
> > > > > This patch series is a revision of http://lkml.org/lkml/2010/6/25/11 .
> > > > > 
> > > > > This patch series implements a write function for the 'cgroup.procs'
> > > > > per-cgroup file, which enables atomic movement of multithreaded
> > > > > applications between cgroups. Writing the thread-ID of any thread in a
> > > > > threadgroup to a cgroup's procs file causes all threads in the group to
> > > > > be moved to that cgroup safely with respect to threads forking/exiting.
> > > > > (Possible usage scenario: If running a multithreaded build system that
> > > > > sucks up system resources, this lets you restrict it all at once into a
> > > > > new cgroup to keep it under control.)
> > > > > 
> > > > > Example: Suppose pid 31337 clones new threads 31338 and 31339.
> > > > > 
> > > > > # cat /dev/cgroup/tasks
> > > > > ...
> > > > > 31337
> > > > > 31338
> > > > > 31339
> > > > > # mkdir /dev/cgroup/foo
> > > > > # echo 31337 > /dev/cgroup/foo/cgroup.procs
> > > > > # cat /dev/cgroup/foo/tasks
> > > > > 31337
> > > > > 31338
> > > > > 31339
> > > > > 
> > > > > A new lock, called threadgroup_fork_lock and living in signal_struct, is
> > > > > introduced to ensure atomicity when moving threads between cgroups. It's
> > > > > taken for writing during the operation, and taking for reading in fork()
> > > > > around the calls to cgroup_fork() and cgroup_post_fork().

The above six month old text is the best (and almost the only)
explanation of the rationale for the entire patch series.  Is
it still correct and complete?


Assuming "yes", then...  how do we determine whether the feature is
sufficiently useful to justify merging and maintaining it?  Will people
actually use it?

Was there some particular operational situation which led you to think
that the kernel should have this capability?  If so, please help us out here
and lavishly describe it.

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v8 0/3] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs
  2011-02-08  1:35           ` Ben Blum
  2011-02-08  1:37             ` [PATCH v8 1/3] cgroups: read-write lock CLONE_THREAD forking per threadgroup Ben Blum
       [not found]             ` <20110208013542.GC31569-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
@ 2011-02-09 23:10             ` Andrew Morton
  2011-02-10  1:02               ` KAMEZAWA Hiroyuki
                                 ` (2 more replies)
  2011-04-06 19:44             ` [PATCH v8.75 0/4] " Ben Blum
  3 siblings, 3 replies; 185+ messages in thread
From: Andrew Morton @ 2011-02-09 23:10 UTC (permalink / raw)
  To: Ben Blum
  Cc: linux-kernel, containers, ebiederm, lizf, matthltc, menage, oleg,
	David Rientjes, Miao Xie

On Mon, 7 Feb 2011 20:35:42 -0500
Ben Blum <bblum@andrew.cmu.edu> wrote:

> On Sun, Dec 26, 2010 at 07:09:19AM -0500, Ben Blum wrote:
> > On Fri, Dec 24, 2010 at 03:22:26AM -0500, Ben Blum wrote:
> > > On Wed, Aug 11, 2010 at 01:46:04AM -0400, Ben Blum wrote:
> > > > On Fri, Jul 30, 2010 at 07:56:49PM -0400, Ben Blum wrote:
> > > > > This patch series is a revision of http://lkml.org/lkml/2010/6/25/11 .
> > > > > 
> > > > > This patch series implements a write function for the 'cgroup.procs'
> > > > > per-cgroup file, which enables atomic movement of multithreaded
> > > > > applications between cgroups. Writing the thread-ID of any thread in a
> > > > > threadgroup to a cgroup's procs file causes all threads in the group to
> > > > > be moved to that cgroup safely with respect to threads forking/exiting.
> > > > > (Possible usage scenario: If running a multithreaded build system that
> > > > > sucks up system resources, this lets you restrict it all at once into a
> > > > > new cgroup to keep it under control.)
> > > > > 
> > > > > Example: Suppose pid 31337 clones new threads 31338 and 31339.
> > > > > 
> > > > > # cat /dev/cgroup/tasks
> > > > > ...
> > > > > 31337
> > > > > 31338
> > > > > 31339
> > > > > # mkdir /dev/cgroup/foo
> > > > > # echo 31337 > /dev/cgroup/foo/cgroup.procs
> > > > > # cat /dev/cgroup/foo/tasks
> > > > > 31337
> > > > > 31338
> > > > > 31339
> > > > > 
> > > > > A new lock, called threadgroup_fork_lock and living in signal_struct, is
> > > > > introduced to ensure atomicity when moving threads between cgroups. It's
> > > > > taken for writing during the operation, and taking for reading in fork()
> > > > > around the calls to cgroup_fork() and cgroup_post_fork().

The above six month old text is the best (and almost the only)
explanation of the rationale for the entire patch series.  Is
it still correct and complete?


Assuming "yes", then...  how do we determine whether the feature is
sufficiently useful to justify merging and maintaining it?  Will people
actually use it?

Was there some particular operational situation which led you to think
that the kernel should have this capability?  If so, please help us out here
and lavishly describe it.



^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v8 0/3] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs
       [not found]               ` <20110209151046.89e03dcd.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
@ 2011-02-10  1:02                 ` KAMEZAWA Hiroyuki
  2011-02-14  6:12                 ` Paul Menage
  1 sibling, 0 replies; 185+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-02-10  1:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, Miao Xie, David Rientjes,
	menage-hpIqsD4AKlfQT0dZR+AlfA

On Wed, 9 Feb 2011 15:10:46 -0800
Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:

> On Mon, 7 Feb 2011 20:35:42 -0500
> Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:
> 
> > On Sun, Dec 26, 2010 at 07:09:19AM -0500, Ben Blum wrote:
> > > On Fri, Dec 24, 2010 at 03:22:26AM -0500, Ben Blum wrote:
> > > > On Wed, Aug 11, 2010 at 01:46:04AM -0400, Ben Blum wrote:
> > > > > On Fri, Jul 30, 2010 at 07:56:49PM -0400, Ben Blum wrote:
> > > > > > This patch series is a revision of http://lkml.org/lkml/2010/6/25/11 .
> > > > > > 
> > > > > > This patch series implements a write function for the 'cgroup.procs'
> > > > > > per-cgroup file, which enables atomic movement of multithreaded
> > > > > > applications between cgroups. Writing the thread-ID of any thread in a
> > > > > > threadgroup to a cgroup's procs file causes all threads in the group to
> > > > > > be moved to that cgroup safely with respect to threads forking/exiting.
> > > > > > (Possible usage scenario: If running a multithreaded build system that
> > > > > > sucks up system resources, this lets you restrict it all at once into a
> > > > > > new cgroup to keep it under control.)
> > > > > > 
> > > > > > Example: Suppose pid 31337 clones new threads 31338 and 31339.
> > > > > > 
> > > > > > # cat /dev/cgroup/tasks
> > > > > > ...
> > > > > > 31337
> > > > > > 31338
> > > > > > 31339
> > > > > > # mkdir /dev/cgroup/foo
> > > > > > # echo 31337 > /dev/cgroup/foo/cgroup.procs
> > > > > > # cat /dev/cgroup/foo/tasks
> > > > > > 31337
> > > > > > 31338
> > > > > > 31339
> > > > > > 
> > > > > > A new lock, called threadgroup_fork_lock and living in signal_struct, is
> > > > > > introduced to ensure atomicity when moving threads between cgroups. It's
> > > > > > taken for writing during the operation, and taking for reading in fork()
> > > > > > around the calls to cgroup_fork() and cgroup_post_fork().
> 
> The above six month old text is the best (and almost the only)
> explanation of the rationale for the entire patch series.  Is
> it still correct and complete?
> 
> 
> Assuming "yes", then...  how do we determine whether the feature is
> sufficiently useful to justify merging and maintaining it?  Will people
> actually use it?
> 
> Was there some particular operational situation which led you to think
> that the kernel should have this capability?  If so, please help us out here
> and lavishly describe it.
> 

In these months, I saw following questions as 
==
Q. I think I put qemu to xxxx cgroup but it never works!
A. You need to put all threads in qemu to cgroup.
==

'tasks' file is not useful interface for users, I think.
(Even if users tend to use put-task-before-exec scheme.)


IMHO, from user's side of view, 'tasks' file is a mystery.

TID(thread-ID) is one of secrets in Linux + pthread library. For example,
on RHEL6, to use gettid(), users has to use syscall() directly. And end-user
may not know about thread-ID which is hidden under pthreads. 

IIRC, there are no interface other than /proc/<pid>/tasks which shows all
thread IDs of a process. But it's not atomic.

So, I think it's ok to have 'procs' interface for cgroup if
overhead/impact of patch is not heavy.

Thanks,
-Kame

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v8 0/3] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs
  2011-02-09 23:10             ` [PATCH v8 0/3] " Andrew Morton
@ 2011-02-10  1:02               ` KAMEZAWA Hiroyuki
       [not found]                 ` <20110210100210.adf09c49.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
                                   ` (2 more replies)
  2011-02-14  6:12               ` Paul Menage
       [not found]               ` <20110209151046.89e03dcd.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
  2 siblings, 3 replies; 185+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-02-10  1:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ben Blum, containers, linux-kernel, oleg, Miao Xie,
	David Rientjes, menage, ebiederm

On Wed, 9 Feb 2011 15:10:46 -0800
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Mon, 7 Feb 2011 20:35:42 -0500
> Ben Blum <bblum@andrew.cmu.edu> wrote:
> 
> > On Sun, Dec 26, 2010 at 07:09:19AM -0500, Ben Blum wrote:
> > > On Fri, Dec 24, 2010 at 03:22:26AM -0500, Ben Blum wrote:
> > > > On Wed, Aug 11, 2010 at 01:46:04AM -0400, Ben Blum wrote:
> > > > > On Fri, Jul 30, 2010 at 07:56:49PM -0400, Ben Blum wrote:
> > > > > > This patch series is a revision of http://lkml.org/lkml/2010/6/25/11 .
> > > > > > 
> > > > > > This patch series implements a write function for the 'cgroup.procs'
> > > > > > per-cgroup file, which enables atomic movement of multithreaded
> > > > > > applications between cgroups. Writing the thread-ID of any thread in a
> > > > > > threadgroup to a cgroup's procs file causes all threads in the group to
> > > > > > be moved to that cgroup safely with respect to threads forking/exiting.
> > > > > > (Possible usage scenario: If running a multithreaded build system that
> > > > > > sucks up system resources, this lets you restrict it all at once into a
> > > > > > new cgroup to keep it under control.)
> > > > > > 
> > > > > > Example: Suppose pid 31337 clones new threads 31338 and 31339.
> > > > > > 
> > > > > > # cat /dev/cgroup/tasks
> > > > > > ...
> > > > > > 31337
> > > > > > 31338
> > > > > > 31339
> > > > > > # mkdir /dev/cgroup/foo
> > > > > > # echo 31337 > /dev/cgroup/foo/cgroup.procs
> > > > > > # cat /dev/cgroup/foo/tasks
> > > > > > 31337
> > > > > > 31338
> > > > > > 31339
> > > > > > 
> > > > > > A new lock, called threadgroup_fork_lock and living in signal_struct, is
> > > > > > introduced to ensure atomicity when moving threads between cgroups. It's
> > > > > > taken for writing during the operation, and taking for reading in fork()
> > > > > > around the calls to cgroup_fork() and cgroup_post_fork().
> 
> The above six month old text is the best (and almost the only)
> explanation of the rationale for the entire patch series.  Is
> it still correct and complete?
> 
> 
> Assuming "yes", then...  how do we determine whether the feature is
> sufficiently useful to justify merging and maintaining it?  Will people
> actually use it?
> 
> Was there some particular operational situation which led you to think
> that the kernel should have this capability?  If so, please help us out here
> and lavishly describe it.
> 

In these months, I saw following questions as 
==
Q. I think I put qemu to xxxx cgroup but it never works!
A. You need to put all threads in qemu to cgroup.
==

'tasks' file is not useful interface for users, I think.
(Even if users tend to use put-task-before-exec scheme.)


IMHO, from user's side of view, 'tasks' file is a mystery.

TID(thread-ID) is one of secrets in Linux + pthread library. For example,
on RHEL6, to use gettid(), users has to use syscall() directly. And end-user
may not know about thread-ID which is hidden under pthreads. 

IIRC, there are no interface other than /proc/<pid>/tasks which shows all
thread IDs of a process. But it's not atomic.

So, I think it's ok to have 'procs' interface for cgroup if
overhead/impact of patch is not heavy.

Thanks,
-Kame












^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v8 0/3] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs
       [not found]                 ` <20110210100210.adf09c49.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
@ 2011-02-10  1:36                   ` Ben Blum
  2011-02-14  6:12                   ` Paul Menage
  1 sibling, 0 replies; 185+ messages in thread
From: Ben Blum @ 2011-02-10  1:36 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	oleg-H+wXaHxf7aLQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, Miao Xie, David Rientjes,
	Andrew Morton, menage-hpIqsD4AKlfQT0dZR+AlfA

On Thu, Feb 10, 2011 at 10:02:10AM +0900, KAMEZAWA Hiroyuki wrote:
> On Wed, 9 Feb 2011 15:10:46 -0800
> Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:
> 
> > On Mon, 7 Feb 2011 20:35:42 -0500
> > Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:
> > 
> > > On Sun, Dec 26, 2010 at 07:09:19AM -0500, Ben Blum wrote:
> > > > On Fri, Dec 24, 2010 at 03:22:26AM -0500, Ben Blum wrote:
> > > > > On Wed, Aug 11, 2010 at 01:46:04AM -0400, Ben Blum wrote:
> > > > > > On Fri, Jul 30, 2010 at 07:56:49PM -0400, Ben Blum wrote:
> > > > > > > This patch series is a revision of http://lkml.org/lkml/2010/6/25/11 .
> > > > > > > 
> > > > > > > This patch series implements a write function for the 'cgroup.procs'
> > > > > > > per-cgroup file, which enables atomic movement of multithreaded
> > > > > > > applications between cgroups. Writing the thread-ID of any thread in a
> > > > > > > threadgroup to a cgroup's procs file causes all threads in the group to
> > > > > > > be moved to that cgroup safely with respect to threads forking/exiting.
> > > > > > > (Possible usage scenario: If running a multithreaded build system that
> > > > > > > sucks up system resources, this lets you restrict it all at once into a
> > > > > > > new cgroup to keep it under control.)
> > > > > > > 
> > > > > > > Example: Suppose pid 31337 clones new threads 31338 and 31339.
> > > > > > > 
> > > > > > > # cat /dev/cgroup/tasks
> > > > > > > ...
> > > > > > > 31337
> > > > > > > 31338
> > > > > > > 31339
> > > > > > > # mkdir /dev/cgroup/foo
> > > > > > > # echo 31337 > /dev/cgroup/foo/cgroup.procs
> > > > > > > # cat /dev/cgroup/foo/tasks
> > > > > > > 31337
> > > > > > > 31338
> > > > > > > 31339
> > > > > > > 
> > > > > > > A new lock, called threadgroup_fork_lock and living in signal_struct, is
> > > > > > > introduced to ensure atomicity when moving threads between cgroups. It's
> > > > > > > taken for writing during the operation, and taking for reading in fork()
> > > > > > > around the calls to cgroup_fork() and cgroup_post_fork().
> > 
> > The above six month old text is the best (and almost the only)
> > explanation of the rationale for the entire patch series.  Is
> > it still correct and complete?

Yep, it's still fresh. (That's why I kept it around!)

> > 
> > 
> > Assuming "yes", then...  how do we determine whether the feature is
> > sufficiently useful to justify merging and maintaining it?  Will people
> > actually use it?
> > 
> > Was there some particular operational situation which led you to think
> > that the kernel should have this capability?  If so, please help us out here
> > and lavishly describe it.
> > 
> 
> In these months, I saw following questions as 
> ==
> Q. I think I put qemu to xxxx cgroup but it never works!
> A. You need to put all threads in qemu to cgroup.
> ==
> 
> 'tasks' file is not useful interface for users, I think.
> (Even if users tend to use put-task-before-exec scheme.)
> 
> 
> IMHO, from user's side of view, 'tasks' file is a mystery.
> 
> TID(thread-ID) is one of secrets in Linux + pthread library. For example,
> on RHEL6, to use gettid(), users has to use syscall() directly. And end-user
> may not know about thread-ID which is hidden under pthreads. 

I think glibc in general is to blame for the fact that you need to
syscall(__NR_gettid)? Regardless - yes, exposing an interface dealing
with task_structs can be less than perfect for a world that deals in
userland applications.

> IIRC, there are no interface other than /proc/<pid>/tasks which shows all
> thread IDs of a process. But it's not atomic.

I tend to use pgrep, which is a bit of a hassle.

Also, like in the six-month-old-text, many resource-sucking programs
nowadays (web browsers) are multithreaded.

> So, I think it's ok to have 'procs' interface for cgroup if
> overhead/impact of patch is not heavy.
> 
> Thanks,
> -Kame

Thanks for the reasoning. ;)

-- Ben

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v8 0/3] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs
  2011-02-10  1:02               ` KAMEZAWA Hiroyuki
       [not found]                 ` <20110210100210.adf09c49.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
@ 2011-02-10  1:36                 ` Ben Blum
  2011-02-14  6:12                 ` Paul Menage
  2 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2011-02-10  1:36 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Ben Blum, containers, linux-kernel, oleg,
	Miao Xie, David Rientjes, menage, ebiederm

On Thu, Feb 10, 2011 at 10:02:10AM +0900, KAMEZAWA Hiroyuki wrote:
> On Wed, 9 Feb 2011 15:10:46 -0800
> Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > On Mon, 7 Feb 2011 20:35:42 -0500
> > Ben Blum <bblum@andrew.cmu.edu> wrote:
> > 
> > > On Sun, Dec 26, 2010 at 07:09:19AM -0500, Ben Blum wrote:
> > > > On Fri, Dec 24, 2010 at 03:22:26AM -0500, Ben Blum wrote:
> > > > > On Wed, Aug 11, 2010 at 01:46:04AM -0400, Ben Blum wrote:
> > > > > > On Fri, Jul 30, 2010 at 07:56:49PM -0400, Ben Blum wrote:
> > > > > > > This patch series is a revision of http://lkml.org/lkml/2010/6/25/11 .
> > > > > > > 
> > > > > > > This patch series implements a write function for the 'cgroup.procs'
> > > > > > > per-cgroup file, which enables atomic movement of multithreaded
> > > > > > > applications between cgroups. Writing the thread-ID of any thread in a
> > > > > > > threadgroup to a cgroup's procs file causes all threads in the group to
> > > > > > > be moved to that cgroup safely with respect to threads forking/exiting.
> > > > > > > (Possible usage scenario: If running a multithreaded build system that
> > > > > > > sucks up system resources, this lets you restrict it all at once into a
> > > > > > > new cgroup to keep it under control.)
> > > > > > > 
> > > > > > > Example: Suppose pid 31337 clones new threads 31338 and 31339.
> > > > > > > 
> > > > > > > # cat /dev/cgroup/tasks
> > > > > > > ...
> > > > > > > 31337
> > > > > > > 31338
> > > > > > > 31339
> > > > > > > # mkdir /dev/cgroup/foo
> > > > > > > # echo 31337 > /dev/cgroup/foo/cgroup.procs
> > > > > > > # cat /dev/cgroup/foo/tasks
> > > > > > > 31337
> > > > > > > 31338
> > > > > > > 31339
> > > > > > > 
> > > > > > > A new lock, called threadgroup_fork_lock and living in signal_struct, is
> > > > > > > introduced to ensure atomicity when moving threads between cgroups. It's
> > > > > > > taken for writing during the operation, and taking for reading in fork()
> > > > > > > around the calls to cgroup_fork() and cgroup_post_fork().
> > 
> > The above six month old text is the best (and almost the only)
> > explanation of the rationale for the entire patch series.  Is
> > it still correct and complete?

Yep, it's still fresh. (That's why I kept it around!)

> > 
> > 
> > Assuming "yes", then...  how do we determine whether the feature is
> > sufficiently useful to justify merging and maintaining it?  Will people
> > actually use it?
> > 
> > Was there some particular operational situation which led you to think
> > that the kernel should have this capability?  If so, please help us out here
> > and lavishly describe it.
> > 
> 
> In these months, I saw following questions as 
> ==
> Q. I think I put qemu to xxxx cgroup but it never works!
> A. You need to put all threads in qemu to cgroup.
> ==
> 
> 'tasks' file is not useful interface for users, I think.
> (Even if users tend to use put-task-before-exec scheme.)
> 
> 
> IMHO, from user's side of view, 'tasks' file is a mystery.
> 
> TID(thread-ID) is one of secrets in Linux + pthread library. For example,
> on RHEL6, to use gettid(), users has to use syscall() directly. And end-user
> may not know about thread-ID which is hidden under pthreads. 

I think glibc in general is to blame for the fact that you need to
syscall(__NR_gettid)? Regardless - yes, exposing an interface dealing
with task_structs can be less than perfect for a world that deals in
userland applications.

> IIRC, there are no interface other than /proc/<pid>/tasks which shows all
> thread IDs of a process. But it's not atomic.

I tend to use pgrep, which is a bit of a hassle.

Also, like in the six-month-old-text, many resource-sucking programs
nowadays (web browsers) are multithreaded.

> So, I think it's ok to have 'procs' interface for cgroup if
> overhead/impact of patch is not heavy.
> 
> Thanks,
> -Kame

Thanks for the reasoning. ;)

-- Ben

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v7 1/3] cgroups: read-write lock CLONE_THREAD forking per threadgroup
       [not found]               ` <20110124130529.903d9832.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
  2011-02-04 21:25                 ` Ben Blum
@ 2011-02-14  5:31                 ` Paul Menage
  1 sibling, 0 replies; 185+ messages in thread
From: Paul Menage @ 2011-02-14  5:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	Miao Xie, David Rientjes, ebiederm-aS9lmoZGLiVWk0Htik3J/w

On Mon, Jan 24, 2011 at 1:05 PM, Andrew Morton
<akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:
>
> Risky. sched.h doesn't include rwsem.h.
>
> We could make it do so, but almost every compilation unit in the kernel
> includes sched.h.  It would be nicer to make the kernel build
> finer-grained, rather than blunter-grained.  Don't be afraid to add new
> header files if that is one way of doing this!
>

The only header files included by rwsem.h that aren't directly
included in sched.h already are linux/linkage.h and asm/atomic.h.
Since sighand_struct in sched.h has an atomic_t field, sched.h is
clearly including atomic.h somewhere indirectly. And there are mutex
fields in sched.h, which means it's indirectly including
linux/mutex.h, which includes linux/linkage.h. So I think that it's
hard to argue that this change would make the kernel build any more
heavyweight.

Paul

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v7 1/3] cgroups: read-write lock CLONE_THREAD forking per threadgroup
  2011-01-24 21:05             ` Andrew Morton
  2011-02-04 21:25               ` Ben Blum
@ 2011-02-14  5:31               ` Paul Menage
       [not found]               ` <20110124130529.903d9832.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
  2 siblings, 0 replies; 185+ messages in thread
From: Paul Menage @ 2011-02-14  5:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ben Blum, linux-kernel, containers, ebiederm, lizf, matthltc,
	oleg, David Rientjes, Miao Xie

On Mon, Jan 24, 2011 at 1:05 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> Risky. sched.h doesn't include rwsem.h.
>
> We could make it do so, but almost every compilation unit in the kernel
> includes sched.h.  It would be nicer to make the kernel build
> finer-grained, rather than blunter-grained.  Don't be afraid to add new
> header files if that is one way of doing this!
>

The only header files included by rwsem.h that aren't directly
included in sched.h already are linux/linkage.h and asm/atomic.h.
Since sighand_struct in sched.h has an atomic_t field, sched.h is
clearly including atomic.h somewhere indirectly. And there are mutex
fields in sched.h, which means it's indirectly including
linux/mutex.h, which includes linux/linkage.h. So I think that it's
hard to argue that this change would make the kernel build any more
heavyweight.

Paul

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v8 0/3] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs
       [not found]                 ` <20110210100210.adf09c49.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
  2011-02-10  1:36                   ` Ben Blum
@ 2011-02-14  6:12                   ` Paul Menage
  1 sibling, 0 replies; 185+ messages in thread
From: Paul Menage @ 2011-02-14  6:12 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	oleg-H+wXaHxf7aLQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, Miao Xie, David Rientjes,
	Andrew Morton

On Wed, Feb 9, 2011 at 5:02 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> wrote:
>
> So, I think it's ok to have 'procs' interface for cgroup if
> overhead/impact of patch is not heavy.
>

Agreed - it's definitely an operation that comes up as either
confusing or annoying for users, depending on whether or not they
understand how threads and cgroups interact. (We've been getting
people wanting to do this internally at Google, and I'm guessing that
we're one of the bigger users of cgroups.)

In theory it's something that could be handled in userspace, in one of two ways:

- repeatedly scan the old cgroup's tasks file and sweep any threads
from the given process into the destination cgroup, until you complete
a clean sweep finding none. (Possibly even this is racy if a thread is
being slow to fork)

- use a process event notifier to catch thread fork events and keep
track of any newly created threads that appear after your first sweep
of threads, and be prepared to handle them for some reasonable length
of time (tens of milliseconds?) after the last thread has been
apparently moved.

(The alternative approach, of course, is to give up and never try to
move a process into a cgroup except right when you're in the middle of
forking it, before the exec(), when you know that it has only a single
thread and you're in control of it.)

These are both painful procedures, compared to the very simple
approach of letting the kernel move the entire process atomically.

It's true that it's a pretty heavyweight operation, but that weight is
only paid when you actually use it on a very large process (and which
would be even more expensive to do in userspace). For the rest of the
kernel, it's just an extra read lock in the fork path on a semaphore
in a structure that's pretty much guaranteed to be in cache.

Paul

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v8 0/3] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs
  2011-02-10  1:02               ` KAMEZAWA Hiroyuki
       [not found]                 ` <20110210100210.adf09c49.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
  2011-02-10  1:36                 ` Ben Blum
@ 2011-02-14  6:12                 ` Paul Menage
  2 siblings, 0 replies; 185+ messages in thread
From: Paul Menage @ 2011-02-14  6:12 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Ben Blum, containers, linux-kernel, oleg,
	Miao Xie, David Rientjes, ebiederm

On Wed, Feb 9, 2011 at 5:02 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
> So, I think it's ok to have 'procs' interface for cgroup if
> overhead/impact of patch is not heavy.
>

Agreed - it's definitely an operation that comes up as either
confusing or annoying for users, depending on whether or not they
understand how threads and cgroups interact. (We've been getting
people wanting to do this internally at Google, and I'm guessing that
we're one of the bigger users of cgroups.)

In theory it's something that could be handled in userspace, in one of two ways:

- repeatedly scan the old cgroup's tasks file and sweep any threads
from the given process into the destination cgroup, until you complete
a clean sweep finding none. (Possibly even this is racy if a thread is
being slow to fork)

- use a process event notifier to catch thread fork events and keep
track of any newly created threads that appear after your first sweep
of threads, and be prepared to handle them for some reasonable length
of time (tens of milliseconds?) after the last thread has been
apparently moved.

(The alternative approach, of course, is to give up and never try to
move a process into a cgroup except right when you're in the middle of
forking it, before the exec(), when you know that it has only a single
thread and you're in control of it.)

These are both painful procedures, compared to the very simple
approach of letting the kernel move the entire process atomically.

It's true that it's a pretty heavyweight operation, but that weight is
only paid when you actually use it on a very large process (and which
would be even more expensive to do in userspace). For the rest of the
kernel, it's just an extra read lock in the fork path on a semaphore
in a structure that's pretty much guaranteed to be in cache.

Paul

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v8 0/3] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs
       [not found]               ` <20110209151046.89e03dcd.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
  2011-02-10  1:02                 ` KAMEZAWA Hiroyuki
@ 2011-02-14  6:12                 ` Paul Menage
  1 sibling, 0 replies; 185+ messages in thread
From: Paul Menage @ 2011-02-14  6:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	Miao Xie, David Rientjes, ebiederm-aS9lmoZGLiVWk0Htik3J/w

On Wed, Feb 9, 2011 at 3:10 PM, Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:
> On Mon, 7 Feb 2011 20:35:42 -0500
> Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:
>
>> On Sun, Dec 26, 2010 at 07:09:19AM -0500, Ben Blum wrote:
>> > On Fri, Dec 24, 2010 at 03:22:26AM -0500, Ben Blum wrote:
>> > > On Wed, Aug 11, 2010 at 01:46:04AM -0400, Ben Blum wrote:
>> > > > On Fri, Jul 30, 2010 at 07:56:49PM -0400, Ben Blum wrote:
>> > > > > This patch series is a revision of http://lkml.org/lkml/2010/6/25/11 .
>> > > > >
>> > > > > This patch series implements a write function for the 'cgroup.procs'
>> > > > > per-cgroup file, which enables atomic movement of multithreaded
>> > > > > applications between cgroups. Writing the thread-ID of any thread in a
>> > > > > threadgroup to a cgroup's procs file causes all threads in the group to
>> > > > > be moved to that cgroup safely with respect to threads forking/exiting.
>> > > > > (Possible usage scenario: If running a multithreaded build system that
>> > > > > sucks up system resources, this lets you restrict it all at once into a
>> > > > > new cgroup to keep it under control.)
>> > > > >
>
> The above six month old text is the best (and almost the only)
> explanation of the rationale for the entire patch series.  Is
> it still correct and complete?
>

It's still correct, but I'm sure we could come up with a more detailed
justification if necessary.

Paul

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v8 0/3] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs
  2011-02-09 23:10             ` [PATCH v8 0/3] " Andrew Morton
  2011-02-10  1:02               ` KAMEZAWA Hiroyuki
@ 2011-02-14  6:12               ` Paul Menage
       [not found]               ` <20110209151046.89e03dcd.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
  2 siblings, 0 replies; 185+ messages in thread
From: Paul Menage @ 2011-02-14  6:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ben Blum, linux-kernel, containers, ebiederm, lizf, matthltc,
	oleg, David Rientjes, Miao Xie

On Wed, Feb 9, 2011 at 3:10 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
> On Mon, 7 Feb 2011 20:35:42 -0500
> Ben Blum <bblum@andrew.cmu.edu> wrote:
>
>> On Sun, Dec 26, 2010 at 07:09:19AM -0500, Ben Blum wrote:
>> > On Fri, Dec 24, 2010 at 03:22:26AM -0500, Ben Blum wrote:
>> > > On Wed, Aug 11, 2010 at 01:46:04AM -0400, Ben Blum wrote:
>> > > > On Fri, Jul 30, 2010 at 07:56:49PM -0400, Ben Blum wrote:
>> > > > > This patch series is a revision of http://lkml.org/lkml/2010/6/25/11 .
>> > > > >
>> > > > > This patch series implements a write function for the 'cgroup.procs'
>> > > > > per-cgroup file, which enables atomic movement of multithreaded
>> > > > > applications between cgroups. Writing the thread-ID of any thread in a
>> > > > > threadgroup to a cgroup's procs file causes all threads in the group to
>> > > > > be moved to that cgroup safely with respect to threads forking/exiting.
>> > > > > (Possible usage scenario: If running a multithreaded build system that
>> > > > > sucks up system resources, this lets you restrict it all at once into a
>> > > > > new cgroup to keep it under control.)
>> > > > >
>
> The above six month old text is the best (and almost the only)
> explanation of the rationale for the entire patch series.  Is
> it still correct and complete?
>

It's still correct, but I'm sure we could come up with a more detailed
justification if necessary.

Paul

^ permalink raw reply	[flat|nested] 185+ messages in thread

* [PATCH v8 4/3] cgroups: use flex_array in attach_proc
       [not found]                 ` <20110208013950.GF31569-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
@ 2011-02-16 19:22                   ` Ben Blum
  2011-03-03 18:38                   ` [PATCH v8 3/3] cgroups: make procs file writable Paul Menage
  1 sibling, 0 replies; 185+ messages in thread
From: Ben Blum @ 2011-02-16 19:22 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	Miao Xie, David Rientjes, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	menage-hpIqsD4AKlfQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

Convert cgroup_attach_proc to use flex_array.

From: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>

The cgroup_attach_proc implementation requires a pre-allocated array to store
task pointers to atomically move a thread-group, but asking for a monolithic
array with kmalloc() may be unreliable for very large groups. Using flex_array
provides the same functionality with less risk of failure.

This is a post-patch for cgroup-procs-write.patch.

Signed-off-by: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
---
 kernel/cgroup.c |   37 ++++++++++++++++++++++++++++---------
 1 files changed, 28 insertions(+), 9 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 58b364a..feba784 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -57,6 +57,7 @@
 #include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
 #include <linux/eventfd.h>
 #include <linux/poll.h>
+#include <linux/flex_array.h> /* used in cgroup_attach_proc */
 
 #include <asm/atomic.h>
 
@@ -1985,7 +1986,7 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 	struct cgroupfs_root *root = cgrp->root;
 	/* threadgroup list cursor and array */
 	struct task_struct *tsk;
-	struct task_struct **group;
+	struct flex_array *group;
 	/*
 	 * we need to make sure we have css_sets for all the tasks we're
 	 * going to move -before- we actually start moving them, so that in
@@ -2002,9 +2003,15 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 	 * and if threads exit, this will just be an over-estimate.
 	 */
 	group_size = get_nr_threads(leader);
-	group = kmalloc(group_size * sizeof(*group), GFP_KERNEL);
+	/* flex_array supports very large thread-groups better than kmalloc. */
+	group = flex_array_alloc(sizeof(struct task_struct *), group_size,
+				 GFP_KERNEL);
 	if (!group)
 		return -ENOMEM;
+	/* pre-allocate to guarantee space while iterating in rcu read-side. */
+	retval = flex_array_prealloc(group, 0, group_size - 1, GFP_KERNEL);
+	if (retval)
+		goto out_free_group_list;
 
 	/* prevent changes to the threadgroup list while we take a snapshot. */
 	rcu_read_lock();
@@ -2027,7 +2034,12 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 		/* as per above, nr_threads may decrease, but not increase. */
 		BUG_ON(i >= group_size);
 		get_task_struct(tsk);
-		group[i] = tsk;
+		/*
+		 * saying GFP_ATOMIC has no effect here because we did prealloc
+		 * earlier, but it's good form to communicate our expectations.
+		 */
+		retval = flex_array_put_ptr(group, i, tsk, GFP_ATOMIC);
+		BUG_ON(retval != 0);
 		i++;
 	} while_each_thread(leader, tsk);
 	/* remember the number of threads in the array for later. */
@@ -2050,7 +2062,9 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 		if (ss->can_attach_task) {
 			/* run on each task in the threadgroup. */
 			for (i = 0; i < group_size; i++) {
-				retval = ss->can_attach_task(cgrp, group[i]);
+				tsk = flex_array_get_ptr(group, i);
+				BUG_ON(tsk == NULL);
+				retval = ss->can_attach_task(cgrp, tsk);
 				if (retval) {
 					failed_ss = ss;
 					goto out_cancel_attach;
@@ -2065,7 +2079,8 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 	 */
 	INIT_LIST_HEAD(&newcg_list);
 	for (i = 0; i < group_size; i++) {
-		tsk = group[i];
+		tsk = flex_array_get_ptr(group, i);
+		BUG_ON(tsk == NULL);
 		/* nothing to do if this task is already in the cgroup */
 		oldcgrp = task_cgroup_from_root(tsk, root);
 		if (cgrp == oldcgrp)
@@ -2104,7 +2119,8 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 			ss->pre_attach(cgrp);
 	}
 	for (i = 0; i < group_size; i++) {
-		tsk = group[i];
+		tsk = flex_array_get_ptr(group, i);
+		BUG_ON(tsk == NULL);
 		/* leave current thread as it is if it's already there */
 		oldcgrp = task_cgroup_from_root(tsk, root);
 		if (cgrp == oldcgrp)
@@ -2154,10 +2170,13 @@ out_cancel_attach:
 		}
 	}
 	/* clean up the array of referenced threads in the group. */
-	for (i = 0; i < group_size; i++)
-		put_task_struct(group[i]);
+	for (i = 0; i < group_size; i++) {
+		tsk = flex_array_get_ptr(group, i);
+		BUG_ON(tsk == NULL);
+		put_task_struct(tsk);
+	}
 out_free_group_list:
-	kfree(group);
+	flex_array_free(group);
 	return retval;
 }

^ permalink raw reply related	[flat|nested] 185+ messages in thread

* [PATCH v8 4/3] cgroups: use flex_array in attach_proc
  2011-02-08  1:39                 ` Ben Blum
  (?)
@ 2011-02-16 19:22                 ` Ben Blum
       [not found]                   ` <20110216192200.GA11980-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
  2011-03-03 17:48                   ` Paul Menage
  -1 siblings, 2 replies; 185+ messages in thread
From: Ben Blum @ 2011-02-16 19:22 UTC (permalink / raw)
  To: Ben Blum
  Cc: linux-kernel, containers, akpm, ebiederm, lizf, matthltc, menage,
	oleg, David Rientjes, Miao Xie

Convert cgroup_attach_proc to use flex_array.

From: Ben Blum <bblum@andrew.cmu.edu>

The cgroup_attach_proc implementation requires a pre-allocated array to store
task pointers to atomically move a thread-group, but asking for a monolithic
array with kmalloc() may be unreliable for very large groups. Using flex_array
provides the same functionality with less risk of failure.

This is a post-patch for cgroup-procs-write.patch.

Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
---
 kernel/cgroup.c |   37 ++++++++++++++++++++++++++++---------
 1 files changed, 28 insertions(+), 9 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 58b364a..feba784 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -57,6 +57,7 @@
 #include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
 #include <linux/eventfd.h>
 #include <linux/poll.h>
+#include <linux/flex_array.h> /* used in cgroup_attach_proc */
 
 #include <asm/atomic.h>
 
@@ -1985,7 +1986,7 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 	struct cgroupfs_root *root = cgrp->root;
 	/* threadgroup list cursor and array */
 	struct task_struct *tsk;
-	struct task_struct **group;
+	struct flex_array *group;
 	/*
 	 * we need to make sure we have css_sets for all the tasks we're
 	 * going to move -before- we actually start moving them, so that in
@@ -2002,9 +2003,15 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 	 * and if threads exit, this will just be an over-estimate.
 	 */
 	group_size = get_nr_threads(leader);
-	group = kmalloc(group_size * sizeof(*group), GFP_KERNEL);
+	/* flex_array supports very large thread-groups better than kmalloc. */
+	group = flex_array_alloc(sizeof(struct task_struct *), group_size,
+				 GFP_KERNEL);
 	if (!group)
 		return -ENOMEM;
+	/* pre-allocate to guarantee space while iterating in rcu read-side. */
+	retval = flex_array_prealloc(group, 0, group_size - 1, GFP_KERNEL);
+	if (retval)
+		goto out_free_group_list;
 
 	/* prevent changes to the threadgroup list while we take a snapshot. */
 	rcu_read_lock();
@@ -2027,7 +2034,12 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 		/* as per above, nr_threads may decrease, but not increase. */
 		BUG_ON(i >= group_size);
 		get_task_struct(tsk);
-		group[i] = tsk;
+		/*
+		 * saying GFP_ATOMIC has no effect here because we did prealloc
+		 * earlier, but it's good form to communicate our expectations.
+		 */
+		retval = flex_array_put_ptr(group, i, tsk, GFP_ATOMIC);
+		BUG_ON(retval != 0);
 		i++;
 	} while_each_thread(leader, tsk);
 	/* remember the number of threads in the array for later. */
@@ -2050,7 +2062,9 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 		if (ss->can_attach_task) {
 			/* run on each task in the threadgroup. */
 			for (i = 0; i < group_size; i++) {
-				retval = ss->can_attach_task(cgrp, group[i]);
+				tsk = flex_array_get_ptr(group, i);
+				BUG_ON(tsk == NULL);
+				retval = ss->can_attach_task(cgrp, tsk);
 				if (retval) {
 					failed_ss = ss;
 					goto out_cancel_attach;
@@ -2065,7 +2079,8 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 	 */
 	INIT_LIST_HEAD(&newcg_list);
 	for (i = 0; i < group_size; i++) {
-		tsk = group[i];
+		tsk = flex_array_get_ptr(group, i);
+		BUG_ON(tsk == NULL);
 		/* nothing to do if this task is already in the cgroup */
 		oldcgrp = task_cgroup_from_root(tsk, root);
 		if (cgrp == oldcgrp)
@@ -2104,7 +2119,8 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 			ss->pre_attach(cgrp);
 	}
 	for (i = 0; i < group_size; i++) {
-		tsk = group[i];
+		tsk = flex_array_get_ptr(group, i);
+		BUG_ON(tsk == NULL);
 		/* leave current thread as it is if it's already there */
 		oldcgrp = task_cgroup_from_root(tsk, root);
 		if (cgrp == oldcgrp)
@@ -2154,10 +2170,13 @@ out_cancel_attach:
 		}
 	}
 	/* clean up the array of referenced threads in the group. */
-	for (i = 0; i < group_size; i++)
-		put_task_struct(group[i]);
+	for (i = 0; i < group_size; i++) {
+		tsk = flex_array_get_ptr(group, i);
+		BUG_ON(tsk == NULL);
+		put_task_struct(tsk);
+	}
 out_free_group_list:
-	kfree(group);
+	flex_array_free(group);
 	return retval;
 }
 

^ permalink raw reply related	[flat|nested] 185+ messages in thread

* Re: [PATCH v8 4/3] cgroups: use flex_array in attach_proc
       [not found]                   ` <20110216192200.GA11980-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
@ 2011-03-03 17:48                     ` Paul Menage
  0 siblings, 0 replies; 185+ messages in thread
From: Paul Menage @ 2011-03-03 17:48 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	Miao Xie, David Rientjes, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w

On Wed, Feb 16, 2011 at 11:22 AM, Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:
> Convert cgroup_attach_proc to use flex_array.
>
> From: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
>
> The cgroup_attach_proc implementation requires a pre-allocated array to store
> task pointers to atomically move a thread-group, but asking for a monolithic
> array with kmalloc() may be unreliable for very large groups. Using flex_array
> provides the same functionality with less risk of failure.
>
> This is a post-patch for cgroup-procs-write.patch.
>
> Signed-off-by: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>

Reviewed-by: Paul Menage <menage-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

Looks fine from a correctness point of view, but I'd be inclined to
reduce the verbosity - rather than

tsk = flex_array_get_ptr(group, i);
BUG_ON(tsk == NULL);
retval = ss->can_attach_task(cgrp, tsk);

I'd just have

retval = ss->can_attach_task(cgrp, flex_array_get_ptr(group, i));

I don't think you need to be so defensive about flex_array's behaviour.

Paul

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v8 4/3] cgroups: use flex_array in attach_proc
  2011-02-16 19:22                 ` [PATCH v8 4/3] cgroups: use flex_array in attach_proc Ben Blum
       [not found]                   ` <20110216192200.GA11980-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
@ 2011-03-03 17:48                   ` Paul Menage
  2011-03-22  5:15                     ` Ben Blum
       [not found]                     ` <AANLkTinKTqBnjLKkv93UxyWoPL-2vyXP=LUvRz8JTC2K-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 2 replies; 185+ messages in thread
From: Paul Menage @ 2011-03-03 17:48 UTC (permalink / raw)
  To: Ben Blum
  Cc: linux-kernel, containers, akpm, ebiederm, lizf, matthltc, oleg,
	David Rientjes, Miao Xie

On Wed, Feb 16, 2011 at 11:22 AM, Ben Blum <bblum@andrew.cmu.edu> wrote:
> Convert cgroup_attach_proc to use flex_array.
>
> From: Ben Blum <bblum@andrew.cmu.edu>
>
> The cgroup_attach_proc implementation requires a pre-allocated array to store
> task pointers to atomically move a thread-group, but asking for a monolithic
> array with kmalloc() may be unreliable for very large groups. Using flex_array
> provides the same functionality with less risk of failure.
>
> This is a post-patch for cgroup-procs-write.patch.
>
> Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>

Reviewed-by: Paul Menage <menage@google.com>

Looks fine from a correctness point of view, but I'd be inclined to
reduce the verbosity - rather than

tsk = flex_array_get_ptr(group, i);
BUG_ON(tsk == NULL);
retval = ss->can_attach_task(cgrp, tsk);

I'd just have

retval = ss->can_attach_task(cgrp, flex_array_get_ptr(group, i));

I don't think you need to be so defensive about flex_array's behaviour.

Paul

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v8 1/3] cgroups: read-write lock CLONE_THREAD forking per threadgroup
       [not found]               ` <20110208013741.GD31569-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
@ 2011-03-03 17:54                 ` Paul Menage
  0 siblings, 0 replies; 185+ messages in thread
From: Paul Menage @ 2011-03-03 17:54 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	Miao Xie, David Rientjes, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w

On Mon, Feb 7, 2011 at 5:37 PM, Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:
> Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup
>
> From: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
>
> This patch adds an rwsem that lives in a threadgroup's signal_struct that's
> taken for reading in the fork path, under CONFIG_CGROUPS. If another part of
> the kernel later wants to use such a locking mechanism, the CONFIG_CGROUPS
> ifdefs should be changed to a higher-up flag that CGROUPS and the other system
> would both depend on.
>
> This is a pre-patch for cgroup-procs-write.patch.
>
> Signed-off-by: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>

Reviewed-by: Paul Menage <menage-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

AFAICS, the only change from the previous version of this patch is the
addition of including linux/rwsem.h in sched.h, so I think it's fair
to assume my previous Reviewed-by: tag still holds.

(Incidentally, does anyone have any handy tools for tracking diffs
between things you've previously tagged as Acked or Reviewed-by, and
newer versions?

Paul

> ---
>  include/linux/init_task.h |    9 +++++++++
>  include/linux/sched.h     |   37 +++++++++++++++++++++++++++++++++++++
>  kernel/fork.c             |   10 ++++++++++
>  3 files changed, 56 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/init_task.h b/include/linux/init_task.h
> index 6b281fa..b560381 100644
> --- a/include/linux/init_task.h
> +++ b/include/linux/init_task.h
> @@ -15,6 +15,14 @@
>  extern struct files_struct init_files;
>  extern struct fs_struct init_fs;
>
> +#ifdef CONFIG_CGROUPS
> +#define INIT_THREADGROUP_FORK_LOCK(sig)                                        \
> +       .threadgroup_fork_lock =                                        \
> +               __RWSEM_INITIALIZER(sig.threadgroup_fork_lock),
> +#else
> +#define INIT_THREADGROUP_FORK_LOCK(sig)
> +#endif
> +
>  #define INIT_SIGNALS(sig) {                                            \
>        .nr_threads     = 1,                                            \
>        .wait_chldexit  = __WAIT_QUEUE_HEAD_INITIALIZER(sig.wait_chldexit),\
> @@ -31,6 +39,7 @@ extern struct fs_struct init_fs;
>        },                                                              \
>        .cred_guard_mutex =                                             \
>                 __MUTEX_INITIALIZER(sig.cred_guard_mutex),             \
> +       INIT_THREADGROUP_FORK_LOCK(sig)                                 \
>  }
>
>  extern struct nsproxy init_nsproxy;
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 8580dc6..2fdbeb1 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -509,6 +509,8 @@ struct thread_group_cputimer {
>        spinlock_t lock;
>  };
>
> +#include <linux/rwsem.h>
> +
>  /*
>  * NOTE! "signal_struct" does not have it's own
>  * locking, because a shared signal_struct always
> @@ -623,6 +625,16 @@ struct signal_struct {
>        unsigned audit_tty;
>        struct tty_audit_buf *tty_audit_buf;
>  #endif
> +#ifdef CONFIG_CGROUPS
> +       /*
> +        * The threadgroup_fork_lock prevents threads from forking with
> +        * CLONE_THREAD while held for writing. Use this for fork-sensitive
> +        * threadgroup-wide operations. It's taken for reading in fork.c in
> +        * copy_process().
> +        * Currently only needed write-side by cgroups.
> +        */
> +       struct rw_semaphore threadgroup_fork_lock;
> +#endif
>
>        int oom_adj;            /* OOM kill score adjustment (bit shift) */
>        int oom_score_adj;      /* OOM kill score adjustment */
> @@ -2270,6 +2282,31 @@ static inline void unlock_task_sighand(struct task_struct *tsk,
>        spin_unlock_irqrestore(&tsk->sighand->siglock, *flags);
>  }
>
> +/* See the declaration of threadgroup_fork_lock in signal_struct. */
> +#ifdef CONFIG_CGROUPS
> +static inline void threadgroup_fork_read_lock(struct task_struct *tsk)
> +{
> +       down_read(&tsk->signal->threadgroup_fork_lock);
> +}
> +static inline void threadgroup_fork_read_unlock(struct task_struct *tsk)
> +{
> +       up_read(&tsk->signal->threadgroup_fork_lock);
> +}
> +static inline void threadgroup_fork_write_lock(struct task_struct *tsk)
> +{
> +       down_write(&tsk->signal->threadgroup_fork_lock);
> +}
> +static inline void threadgroup_fork_write_unlock(struct task_struct *tsk)
> +{
> +       up_write(&tsk->signal->threadgroup_fork_lock);
> +}
> +#else
> +static inline void threadgroup_fork_read_lock(struct task_struct *tsk) {}
> +static inline void threadgroup_fork_read_unlock(struct task_struct *tsk) {}
> +static inline void threadgroup_fork_write_lock(struct task_struct *tsk) {}
> +static inline void threadgroup_fork_write_unlock(struct task_struct *tsk) {}
> +#endif
> +
>  #ifndef __HAVE_THREAD_FUNCTIONS
>
>  #define task_thread_info(task) ((struct thread_info *)(task)->stack)
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 0979527..aefe61f 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -905,6 +905,10 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
>
>        tty_audit_fork(sig);
>
> +#ifdef CONFIG_CGROUPS
> +       init_rwsem(&sig->threadgroup_fork_lock);
> +#endif
> +
>        sig->oom_adj = current->signal->oom_adj;
>        sig->oom_score_adj = current->signal->oom_score_adj;
>        sig->oom_score_adj_min = current->signal->oom_score_adj_min;
> @@ -1087,6 +1091,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>        monotonic_to_bootbased(&p->real_start_time);
>        p->io_context = NULL;
>        p->audit_context = NULL;
> +       if (clone_flags & CLONE_THREAD)
> +               threadgroup_fork_read_lock(current);
>        cgroup_fork(p);
>  #ifdef CONFIG_NUMA
>        p->mempolicy = mpol_dup(p->mempolicy);
> @@ -1294,6 +1300,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>        write_unlock_irq(&tasklist_lock);
>        proc_fork_connector(p);
>        cgroup_post_fork(p);
> +       if (clone_flags & CLONE_THREAD)
> +               threadgroup_fork_read_unlock(current);
>        perf_event_fork(p);
>        return p;
>
> @@ -1332,6 +1340,8 @@ bad_fork_cleanup_policy:
>        mpol_put(p->mempolicy);
>  bad_fork_cleanup_cgroup:
>  #endif
> +       if (clone_flags & CLONE_THREAD)
> +               threadgroup_fork_read_unlock(current);
>        cgroup_exit(p, cgroup_callbacks_done);
>        delayacct_tsk_free(p);
>        module_put(task_thread_info(p)->exec_domain->module);
>

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v8 1/3] cgroups: read-write lock CLONE_THREAD forking per threadgroup
  2011-02-08  1:37             ` [PATCH v8 1/3] cgroups: read-write lock CLONE_THREAD forking per threadgroup Ben Blum
@ 2011-03-03 17:54               ` Paul Menage
       [not found]               ` <20110208013741.GD31569-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
  1 sibling, 0 replies; 185+ messages in thread
From: Paul Menage @ 2011-03-03 17:54 UTC (permalink / raw)
  To: Ben Blum
  Cc: linux-kernel, containers, akpm, ebiederm, lizf, matthltc, oleg,
	David Rientjes, Miao Xie

On Mon, Feb 7, 2011 at 5:37 PM, Ben Blum <bblum@andrew.cmu.edu> wrote:
> Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup
>
> From: Ben Blum <bblum@andrew.cmu.edu>
>
> This patch adds an rwsem that lives in a threadgroup's signal_struct that's
> taken for reading in the fork path, under CONFIG_CGROUPS. If another part of
> the kernel later wants to use such a locking mechanism, the CONFIG_CGROUPS
> ifdefs should be changed to a higher-up flag that CGROUPS and the other system
> would both depend on.
>
> This is a pre-patch for cgroup-procs-write.patch.
>
> Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>

Reviewed-by: Paul Menage <menage@google.com>

AFAICS, the only change from the previous version of this patch is the
addition of including linux/rwsem.h in sched.h, so I think it's fair
to assume my previous Reviewed-by: tag still holds.

(Incidentally, does anyone have any handy tools for tracking diffs
between things you've previously tagged as Acked or Reviewed-by, and
newer versions?

Paul

> ---
>  include/linux/init_task.h |    9 +++++++++
>  include/linux/sched.h     |   37 +++++++++++++++++++++++++++++++++++++
>  kernel/fork.c             |   10 ++++++++++
>  3 files changed, 56 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/init_task.h b/include/linux/init_task.h
> index 6b281fa..b560381 100644
> --- a/include/linux/init_task.h
> +++ b/include/linux/init_task.h
> @@ -15,6 +15,14 @@
>  extern struct files_struct init_files;
>  extern struct fs_struct init_fs;
>
> +#ifdef CONFIG_CGROUPS
> +#define INIT_THREADGROUP_FORK_LOCK(sig)                                        \
> +       .threadgroup_fork_lock =                                        \
> +               __RWSEM_INITIALIZER(sig.threadgroup_fork_lock),
> +#else
> +#define INIT_THREADGROUP_FORK_LOCK(sig)
> +#endif
> +
>  #define INIT_SIGNALS(sig) {                                            \
>        .nr_threads     = 1,                                            \
>        .wait_chldexit  = __WAIT_QUEUE_HEAD_INITIALIZER(sig.wait_chldexit),\
> @@ -31,6 +39,7 @@ extern struct fs_struct init_fs;
>        },                                                              \
>        .cred_guard_mutex =                                             \
>                 __MUTEX_INITIALIZER(sig.cred_guard_mutex),             \
> +       INIT_THREADGROUP_FORK_LOCK(sig)                                 \
>  }
>
>  extern struct nsproxy init_nsproxy;
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 8580dc6..2fdbeb1 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -509,6 +509,8 @@ struct thread_group_cputimer {
>        spinlock_t lock;
>  };
>
> +#include <linux/rwsem.h>
> +
>  /*
>  * NOTE! "signal_struct" does not have it's own
>  * locking, because a shared signal_struct always
> @@ -623,6 +625,16 @@ struct signal_struct {
>        unsigned audit_tty;
>        struct tty_audit_buf *tty_audit_buf;
>  #endif
> +#ifdef CONFIG_CGROUPS
> +       /*
> +        * The threadgroup_fork_lock prevents threads from forking with
> +        * CLONE_THREAD while held for writing. Use this for fork-sensitive
> +        * threadgroup-wide operations. It's taken for reading in fork.c in
> +        * copy_process().
> +        * Currently only needed write-side by cgroups.
> +        */
> +       struct rw_semaphore threadgroup_fork_lock;
> +#endif
>
>        int oom_adj;            /* OOM kill score adjustment (bit shift) */
>        int oom_score_adj;      /* OOM kill score adjustment */
> @@ -2270,6 +2282,31 @@ static inline void unlock_task_sighand(struct task_struct *tsk,
>        spin_unlock_irqrestore(&tsk->sighand->siglock, *flags);
>  }
>
> +/* See the declaration of threadgroup_fork_lock in signal_struct. */
> +#ifdef CONFIG_CGROUPS
> +static inline void threadgroup_fork_read_lock(struct task_struct *tsk)
> +{
> +       down_read(&tsk->signal->threadgroup_fork_lock);
> +}
> +static inline void threadgroup_fork_read_unlock(struct task_struct *tsk)
> +{
> +       up_read(&tsk->signal->threadgroup_fork_lock);
> +}
> +static inline void threadgroup_fork_write_lock(struct task_struct *tsk)
> +{
> +       down_write(&tsk->signal->threadgroup_fork_lock);
> +}
> +static inline void threadgroup_fork_write_unlock(struct task_struct *tsk)
> +{
> +       up_write(&tsk->signal->threadgroup_fork_lock);
> +}
> +#else
> +static inline void threadgroup_fork_read_lock(struct task_struct *tsk) {}
> +static inline void threadgroup_fork_read_unlock(struct task_struct *tsk) {}
> +static inline void threadgroup_fork_write_lock(struct task_struct *tsk) {}
> +static inline void threadgroup_fork_write_unlock(struct task_struct *tsk) {}
> +#endif
> +
>  #ifndef __HAVE_THREAD_FUNCTIONS
>
>  #define task_thread_info(task) ((struct thread_info *)(task)->stack)
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 0979527..aefe61f 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -905,6 +905,10 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
>
>        tty_audit_fork(sig);
>
> +#ifdef CONFIG_CGROUPS
> +       init_rwsem(&sig->threadgroup_fork_lock);
> +#endif
> +
>        sig->oom_adj = current->signal->oom_adj;
>        sig->oom_score_adj = current->signal->oom_score_adj;
>        sig->oom_score_adj_min = current->signal->oom_score_adj_min;
> @@ -1087,6 +1091,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>        monotonic_to_bootbased(&p->real_start_time);
>        p->io_context = NULL;
>        p->audit_context = NULL;
> +       if (clone_flags & CLONE_THREAD)
> +               threadgroup_fork_read_lock(current);
>        cgroup_fork(p);
>  #ifdef CONFIG_NUMA
>        p->mempolicy = mpol_dup(p->mempolicy);
> @@ -1294,6 +1300,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>        write_unlock_irq(&tasklist_lock);
>        proc_fork_connector(p);
>        cgroup_post_fork(p);
> +       if (clone_flags & CLONE_THREAD)
> +               threadgroup_fork_read_unlock(current);
>        perf_event_fork(p);
>        return p;
>
> @@ -1332,6 +1340,8 @@ bad_fork_cleanup_policy:
>        mpol_put(p->mempolicy);
>  bad_fork_cleanup_cgroup:
>  #endif
> +       if (clone_flags & CLONE_THREAD)
> +               threadgroup_fork_read_unlock(current);
>        cgroup_exit(p, cgroup_callbacks_done);
>        delayacct_tsk_free(p);
>        module_put(task_thread_info(p)->exec_domain->module);
>

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v8 2/3] cgroups: add per-thread subsystem callbacks
       [not found]                 ` <20110208013915.GE31569-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
@ 2011-03-03 17:59                   ` Paul Menage
  0 siblings, 0 replies; 185+ messages in thread
From: Paul Menage @ 2011-03-03 17:59 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	Miao Xie, David Rientjes, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w

On Mon, Feb 7, 2011 at 5:39 PM, Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:
> Add cgroup subsystem callbacks for per-thread attachment
>
> From: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
>
> This patch adds can_attach_task, pre_attach, and attach_task as new callbacks
> for cgroups's subsystem interface. Unlike can_attach and attach, these are for
> per-thread operations, to be called potentially many times when attaching an
> entire threadgroup.
>
> Also, the old "bool threadgroup" interface is removed, as replaced by this.
> All subsystems are modified for the new interface - of note is cpuset, which
> requires from/to nodemasks for attach to be globally scoped (though per-cpuset
> would work too) to persist from its pre_attach to attach_task and attach.
>
> This is a pre-patch for cgroup-procs-writable.patch.
>
> Signed-off-by: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>

Reviewed-by: Paul Menage <menage-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

Paul

> ---
>  Documentation/cgroups/cgroups.txt |   30 ++++++++---
>  block/blk-cgroup.c                |   18 ++----
>  include/linux/cgroup.h            |   10 ++--
>  kernel/cgroup.c                   |   17 +++++-
>  kernel/cgroup_freezer.c           |   26 ++++-----
>  kernel/cpuset.c                   |  105 ++++++++++++++++---------------------
>  kernel/ns_cgroup.c                |   23 +++-----
>  kernel/sched.c                    |   38 +------------
>  mm/memcontrol.c                   |   18 ++----
>  security/device_cgroup.c          |    3 -
>  10 files changed, 122 insertions(+), 166 deletions(-)
>
> diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
> index 190018b..d3c9a24 100644
> --- a/Documentation/cgroups/cgroups.txt
> +++ b/Documentation/cgroups/cgroups.txt
> @@ -563,7 +563,7 @@ rmdir() will fail with it. From this behavior, pre_destroy() can be
>  called multiple times against a cgroup.
>
>  int can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
> -              struct task_struct *task, bool threadgroup)
> +              struct task_struct *task)
>  (cgroup_mutex held by caller)
>
>  Called prior to moving a task into a cgroup; if the subsystem
> @@ -572,9 +572,14 @@ task is passed, then a successful result indicates that *any*
>  unspecified task can be moved into the cgroup. Note that this isn't
>  called on a fork. If this method returns 0 (success) then this should
>  remain valid while the caller holds cgroup_mutex and it is ensured that either
> -attach() or cancel_attach() will be called in future. If threadgroup is
> -true, then a successful result indicates that all threads in the given
> -thread's threadgroup can be moved together.
> +attach() or cancel_attach() will be called in future.
> +
> +int can_attach_task(struct cgroup *cgrp, struct task_struct *tsk);
> +(cgroup_mutex held by caller)
> +
> +As can_attach, but for operations that must be run once per task to be
> +attached (possibly many when using cgroup_attach_proc). Called after
> +can_attach.
>
>  void cancel_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
>               struct task_struct *task, bool threadgroup)
> @@ -586,15 +591,24 @@ function, so that the subsystem can implement a rollback. If not, not necessary.
>  This will be called only about subsystems whose can_attach() operation have
>  succeeded.
>
> +void pre_attach(struct cgroup *cgrp);
> +(cgroup_mutex held by caller)
> +
> +For any non-per-thread attachment work that needs to happen before
> +attach_task. Needed by cpuset.
> +
>  void attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
> -           struct cgroup *old_cgrp, struct task_struct *task,
> -           bool threadgroup)
> +           struct cgroup *old_cgrp, struct task_struct *task)
>  (cgroup_mutex held by caller)
>
>  Called after the task has been attached to the cgroup, to allow any
>  post-attachment activity that requires memory allocations or blocking.
> -If threadgroup is true, the subsystem should take care of all threads
> -in the specified thread's threadgroup. Currently does not support any
> +
> +void attach_task(struct cgroup *cgrp, struct task_struct *tsk);
> +(cgroup_mutex held by caller)
> +
> +As attach, but for operations that must be run once per task to be attached,
> +like can_attach_task. Called before attach. Currently does not support any
>  subsystem that might need the old_cgrp for every thread in the group.
>
>  void fork(struct cgroup_subsy *ss, struct task_struct *task)
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index b1febd0..45b3809 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -30,10 +30,8 @@ EXPORT_SYMBOL_GPL(blkio_root_cgroup);
>
>  static struct cgroup_subsys_state *blkiocg_create(struct cgroup_subsys *,
>                                                  struct cgroup *);
> -static int blkiocg_can_attach(struct cgroup_subsys *, struct cgroup *,
> -                             struct task_struct *, bool);
> -static void blkiocg_attach(struct cgroup_subsys *, struct cgroup *,
> -                          struct cgroup *, struct task_struct *, bool);
> +static int blkiocg_can_attach_task(struct cgroup *, struct task_struct *);
> +static void blkiocg_attach_task(struct cgroup *, struct task_struct *);
>  static void blkiocg_destroy(struct cgroup_subsys *, struct cgroup *);
>  static int blkiocg_populate(struct cgroup_subsys *, struct cgroup *);
>
> @@ -46,8 +44,8 @@ static int blkiocg_populate(struct cgroup_subsys *, struct cgroup *);
>  struct cgroup_subsys blkio_subsys = {
>        .name = "blkio",
>        .create = blkiocg_create,
> -       .can_attach = blkiocg_can_attach,
> -       .attach = blkiocg_attach,
> +       .can_attach_task = blkiocg_can_attach_task,
> +       .attach_task = blkiocg_attach_task,
>        .destroy = blkiocg_destroy,
>        .populate = blkiocg_populate,
>  #ifdef CONFIG_BLK_CGROUP
> @@ -1475,9 +1473,7 @@ done:
>  * of the main cic data structures.  For now we allow a task to change
>  * its cgroup only if it's the only owner of its ioc.
>  */
> -static int blkiocg_can_attach(struct cgroup_subsys *subsys,
> -                               struct cgroup *cgroup, struct task_struct *tsk,
> -                               bool threadgroup)
> +static int blkiocg_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>  {
>        struct io_context *ioc;
>        int ret = 0;
> @@ -1492,9 +1488,7 @@ static int blkiocg_can_attach(struct cgroup_subsys *subsys,
>        return ret;
>  }
>
> -static void blkiocg_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
> -                               struct cgroup *prev, struct task_struct *tsk,
> -                               bool threadgroup)
> +static void blkiocg_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>  {
>        struct io_context *ioc;
>
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index ce104e3..35b69b4 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -467,12 +467,14 @@ struct cgroup_subsys {
>        int (*pre_destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
>        void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
>        int (*can_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
> -                         struct task_struct *tsk, bool threadgroup);
> +                         struct task_struct *tsk);
> +       int (*can_attach_task)(struct cgroup *cgrp, struct task_struct *tsk);
>        void (*cancel_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
> -                         struct task_struct *tsk, bool threadgroup);
> +                             struct task_struct *tsk);
> +       void (*pre_attach)(struct cgroup *cgrp);
> +       void (*attach_task)(struct cgroup *cgrp, struct task_struct *tsk);
>        void (*attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
> -                       struct cgroup *old_cgrp, struct task_struct *tsk,
> -                       bool threadgroup);
> +                      struct cgroup *old_cgrp, struct task_struct *tsk);
>        void (*fork)(struct cgroup_subsys *ss, struct task_struct *task);
>        void (*exit)(struct cgroup_subsys *ss, struct task_struct *task);
>        int (*populate)(struct cgroup_subsys *ss,
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 66a416b..616f27a 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1750,7 +1750,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>
>        for_each_subsys(root, ss) {
>                if (ss->can_attach) {
> -                       retval = ss->can_attach(ss, cgrp, tsk, false);
> +                       retval = ss->can_attach(ss, cgrp, tsk);
>                        if (retval) {
>                                /*
>                                 * Remember on which subsystem the can_attach()
> @@ -1762,6 +1762,13 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>                                goto out;
>                        }
>                }
> +               if (ss->can_attach_task) {
> +                       retval = ss->can_attach_task(cgrp, tsk);
> +                       if (retval) {
> +                               failed_ss = ss;
> +                               goto out;
> +                       }
> +               }
>        }
>
>        task_lock(tsk);
> @@ -1798,8 +1805,12 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>        write_unlock(&css_set_lock);
>
>        for_each_subsys(root, ss) {
> +               if (ss->pre_attach)
> +                       ss->pre_attach(cgrp);
> +               if (ss->attach_task)
> +                       ss->attach_task(cgrp, tsk);
>                if (ss->attach)
> -                       ss->attach(ss, cgrp, oldcgrp, tsk, false);
> +                       ss->attach(ss, cgrp, oldcgrp, tsk);
>        }
>        set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
>        synchronize_rcu();
> @@ -1822,7 +1833,7 @@ out:
>                                 */
>                                break;
>                        if (ss->cancel_attach)
> -                               ss->cancel_attach(ss, cgrp, tsk, false);
> +                               ss->cancel_attach(ss, cgrp, tsk);
>                }
>        }
>        return retval;
> diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
> index e7bebb7..e691818 100644
> --- a/kernel/cgroup_freezer.c
> +++ b/kernel/cgroup_freezer.c
> @@ -160,7 +160,7 @@ static void freezer_destroy(struct cgroup_subsys *ss,
>  */
>  static int freezer_can_attach(struct cgroup_subsys *ss,
>                              struct cgroup *new_cgroup,
> -                             struct task_struct *task, bool threadgroup)
> +                             struct task_struct *task)
>  {
>        struct freezer *freezer;
>
> @@ -172,26 +172,17 @@ static int freezer_can_attach(struct cgroup_subsys *ss,
>        if (freezer->state != CGROUP_THAWED)
>                return -EBUSY;
>
> +       return 0;
> +}
> +
> +static int freezer_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
> +{
>        rcu_read_lock();
> -       if (__cgroup_freezing_or_frozen(task)) {
> +       if (__cgroup_freezing_or_frozen(tsk)) {
>                rcu_read_unlock();
>                return -EBUSY;
>        }
>        rcu_read_unlock();
> -
> -       if (threadgroup) {
> -               struct task_struct *c;
> -
> -               rcu_read_lock();
> -               list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
> -                       if (__cgroup_freezing_or_frozen(c)) {
> -                               rcu_read_unlock();
> -                               return -EBUSY;
> -                       }
> -               }
> -               rcu_read_unlock();
> -       }
> -
>        return 0;
>  }
>
> @@ -390,6 +381,9 @@ struct cgroup_subsys freezer_subsys = {
>        .populate       = freezer_populate,
>        .subsys_id      = freezer_subsys_id,
>        .can_attach     = freezer_can_attach,
> +       .can_attach_task = freezer_can_attach_task,
> +       .pre_attach     = NULL,
> +       .attach_task    = NULL,
>        .attach         = NULL,
>        .fork           = freezer_fork,
>        .exit           = NULL,
> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> index 4349935..5f71ca2 100644
> --- a/kernel/cpuset.c
> +++ b/kernel/cpuset.c
> @@ -1372,14 +1372,10 @@ static int fmeter_getrate(struct fmeter *fmp)
>        return val;
>  }
>
> -/* Protected by cgroup_lock */
> -static cpumask_var_t cpus_attach;
> -
>  /* Called by cgroups to determine if a cpuset is usable; cgroup_mutex held */
>  static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
> -                            struct task_struct *tsk, bool threadgroup)
> +                            struct task_struct *tsk)
>  {
> -       int ret;
>        struct cpuset *cs = cgroup_cs(cont);
>
>        if (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))
> @@ -1396,29 +1392,42 @@ static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
>        if (tsk->flags & PF_THREAD_BOUND)
>                return -EINVAL;
>
> -       ret = security_task_setscheduler(tsk);
> -       if (ret)
> -               return ret;
> -       if (threadgroup) {
> -               struct task_struct *c;
> -
> -               rcu_read_lock();
> -               list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
> -                       ret = security_task_setscheduler(c);
> -                       if (ret) {
> -                               rcu_read_unlock();
> -                               return ret;
> -                       }
> -               }
> -               rcu_read_unlock();
> -       }
>        return 0;
>  }
>
> -static void cpuset_attach_task(struct task_struct *tsk, nodemask_t *to,
> -                              struct cpuset *cs)
> +static int cpuset_can_attach_task(struct cgroup *cgrp, struct task_struct *task)
> +{
> +       return security_task_setscheduler(task);
> +}
> +
> +/*
> + * Protected by cgroup_lock. The nodemasks must be stored globally because
> + * dynamically allocating them is not allowed in pre_attach, and they must
> + * persist among pre_attach, attach_task, and attach.
> + */
> +static cpumask_var_t cpus_attach;
> +static nodemask_t cpuset_attach_nodemask_from;
> +static nodemask_t cpuset_attach_nodemask_to;
> +
> +/* Set-up work for before attaching each task. */
> +static void cpuset_pre_attach(struct cgroup *cont)
> +{
> +       struct cpuset *cs = cgroup_cs(cont);
> +
> +       if (cs == &top_cpuset)
> +               cpumask_copy(cpus_attach, cpu_possible_mask);
> +       else
> +               guarantee_online_cpus(cs, cpus_attach);
> +
> +       guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
> +}
> +
> +/* Per-thread attachment work. */
> +static void cpuset_attach_task(struct cgroup *cont, struct task_struct *tsk)
>  {
>        int err;
> +       struct cpuset *cs = cgroup_cs(cont);
> +
>        /*
>         * can_attach beforehand should guarantee that this doesn't fail.
>         * TODO: have a better way to handle failure here
> @@ -1426,56 +1435,31 @@ static void cpuset_attach_task(struct task_struct *tsk, nodemask_t *to,
>        err = set_cpus_allowed_ptr(tsk, cpus_attach);
>        WARN_ON_ONCE(err);
>
> -       cpuset_change_task_nodemask(tsk, to);
> +       cpuset_change_task_nodemask(tsk, &cpuset_attach_nodemask_to);
>        cpuset_update_task_spread_flag(cs, tsk);
> -
>  }
>
>  static void cpuset_attach(struct cgroup_subsys *ss, struct cgroup *cont,
> -                         struct cgroup *oldcont, struct task_struct *tsk,
> -                         bool threadgroup)
> +                         struct cgroup *oldcont, struct task_struct *tsk)
>  {
>        struct mm_struct *mm;
>        struct cpuset *cs = cgroup_cs(cont);
>        struct cpuset *oldcs = cgroup_cs(oldcont);
> -       NODEMASK_ALLOC(nodemask_t, from, GFP_KERNEL);
> -       NODEMASK_ALLOC(nodemask_t, to, GFP_KERNEL);
> -
> -       if (from == NULL || to == NULL)
> -               goto alloc_fail;
>
> -       if (cs == &top_cpuset) {
> -               cpumask_copy(cpus_attach, cpu_possible_mask);
> -       } else {
> -               guarantee_online_cpus(cs, cpus_attach);
> -       }
> -       guarantee_online_mems(cs, to);
> -
> -       /* do per-task migration stuff possibly for each in the threadgroup */
> -       cpuset_attach_task(tsk, to, cs);
> -       if (threadgroup) {
> -               struct task_struct *c;
> -               rcu_read_lock();
> -               list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
> -                       cpuset_attach_task(c, to, cs);
> -               }
> -               rcu_read_unlock();
> -       }
> -
> -       /* change mm; only needs to be done once even if threadgroup */
> -       *from = oldcs->mems_allowed;
> -       *to = cs->mems_allowed;
> +       /*
> +        * Change mm, possibly for multiple threads in a threadgroup. This is
> +        * expensive and may sleep.
> +        */
> +       cpuset_attach_nodemask_from = oldcs->mems_allowed;
> +       cpuset_attach_nodemask_to = cs->mems_allowed;
>        mm = get_task_mm(tsk);
>        if (mm) {
> -               mpol_rebind_mm(mm, to);
> +               mpol_rebind_mm(mm, &cpuset_attach_nodemask_to);
>                if (is_memory_migrate(cs))
> -                       cpuset_migrate_mm(mm, from, to);
> +                       cpuset_migrate_mm(mm, &cpuset_attach_nodemask_from,
> +                                         &cpuset_attach_nodemask_to);
>                mmput(mm);
>        }
> -
> -alloc_fail:
> -       NODEMASK_FREE(from);
> -       NODEMASK_FREE(to);
>  }
>
>  /* The various types of files and directories in a cpuset file system */
> @@ -1928,6 +1912,9 @@ struct cgroup_subsys cpuset_subsys = {
>        .create = cpuset_create,
>        .destroy = cpuset_destroy,
>        .can_attach = cpuset_can_attach,
> +       .can_attach_task = cpuset_can_attach_task,
> +       .pre_attach = cpuset_pre_attach,
> +       .attach_task = cpuset_attach_task,
>        .attach = cpuset_attach,
>        .populate = cpuset_populate,
>        .post_clone = cpuset_post_clone,
> diff --git a/kernel/ns_cgroup.c b/kernel/ns_cgroup.c
> index 2c98ad9..1fc2b1b 100644
> --- a/kernel/ns_cgroup.c
> +++ b/kernel/ns_cgroup.c
> @@ -43,7 +43,7 @@ int ns_cgroup_clone(struct task_struct *task, struct pid *pid)
>  *        ancestor cgroup thereof)
>  */
>  static int ns_can_attach(struct cgroup_subsys *ss, struct cgroup *new_cgroup,
> -                        struct task_struct *task, bool threadgroup)
> +                        struct task_struct *task)
>  {
>        if (current != task) {
>                if (!capable(CAP_SYS_ADMIN))
> @@ -53,21 +53,13 @@ static int ns_can_attach(struct cgroup_subsys *ss, struct cgroup *new_cgroup,
>                        return -EPERM;
>        }
>
> -       if (!cgroup_is_descendant(new_cgroup, task))
> -               return -EPERM;
> -
> -       if (threadgroup) {
> -               struct task_struct *c;
> -               rcu_read_lock();
> -               list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
> -                       if (!cgroup_is_descendant(new_cgroup, c)) {
> -                               rcu_read_unlock();
> -                               return -EPERM;
> -                       }
> -               }
> -               rcu_read_unlock();
> -       }
> +       return 0;
> +}
>
> +static int ns_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
> +{
> +       if (!cgroup_is_descendant(cgrp, tsk))
> +               return -EPERM;
>        return 0;
>  }
>
> @@ -112,6 +104,7 @@ static void ns_destroy(struct cgroup_subsys *ss,
>  struct cgroup_subsys ns_subsys = {
>        .name = "ns",
>        .can_attach = ns_can_attach,
> +       .can_attach_task = ns_can_attach_task,
>        .create = ns_create,
>        .destroy  = ns_destroy,
>        .subsys_id = ns_subsys_id,
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 218ef20..d619f1d 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -8655,42 +8655,10 @@ cpu_cgroup_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>        return 0;
>  }
>
> -static int
> -cpu_cgroup_can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
> -                     struct task_struct *tsk, bool threadgroup)
> -{
> -       int retval = cpu_cgroup_can_attach_task(cgrp, tsk);
> -       if (retval)
> -               return retval;
> -       if (threadgroup) {
> -               struct task_struct *c;
> -               rcu_read_lock();
> -               list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
> -                       retval = cpu_cgroup_can_attach_task(cgrp, c);
> -                       if (retval) {
> -                               rcu_read_unlock();
> -                               return retval;
> -                       }
> -               }
> -               rcu_read_unlock();
> -       }
> -       return 0;
> -}
> -
>  static void
> -cpu_cgroup_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
> -                 struct cgroup *old_cont, struct task_struct *tsk,
> -                 bool threadgroup)
> +cpu_cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>  {
>        sched_move_task(tsk);
> -       if (threadgroup) {
> -               struct task_struct *c;
> -               rcu_read_lock();
> -               list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
> -                       sched_move_task(c);
> -               }
> -               rcu_read_unlock();
> -       }
>  }
>
>  #ifdef CONFIG_FAIR_GROUP_SCHED
> @@ -8763,8 +8731,8 @@ struct cgroup_subsys cpu_cgroup_subsys = {
>        .name           = "cpu",
>        .create         = cpu_cgroup_create,
>        .destroy        = cpu_cgroup_destroy,
> -       .can_attach     = cpu_cgroup_can_attach,
> -       .attach         = cpu_cgroup_attach,
> +       .can_attach_task = cpu_cgroup_can_attach_task,
> +       .attach_task    = cpu_cgroup_attach_task,
>        .populate       = cpu_cgroup_populate,
>        .subsys_id      = cpu_cgroup_subsys_id,
>        .early_init     = 1,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 729beb7..995f0b9 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4720,8 +4720,7 @@ static void mem_cgroup_clear_mc(void)
>
>  static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
>                                struct cgroup *cgroup,
> -                               struct task_struct *p,
> -                               bool threadgroup)
> +                               struct task_struct *p)
>  {
>        int ret = 0;
>        struct mem_cgroup *mem = mem_cgroup_from_cont(cgroup);
> @@ -4775,8 +4774,7 @@ static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
>
>  static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss,
>                                struct cgroup *cgroup,
> -                               struct task_struct *p,
> -                               bool threadgroup)
> +                               struct task_struct *p)
>  {
>        mem_cgroup_clear_mc();
>  }
> @@ -4880,8 +4878,7 @@ static void mem_cgroup_move_charge(struct mm_struct *mm)
>  static void mem_cgroup_move_task(struct cgroup_subsys *ss,
>                                struct cgroup *cont,
>                                struct cgroup *old_cont,
> -                               struct task_struct *p,
> -                               bool threadgroup)
> +                               struct task_struct *p)
>  {
>        if (!mc.mm)
>                /* no need to move charge */
> @@ -4893,22 +4890,19 @@ static void mem_cgroup_move_task(struct cgroup_subsys *ss,
>  #else  /* !CONFIG_MMU */
>  static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
>                                struct cgroup *cgroup,
> -                               struct task_struct *p,
> -                               bool threadgroup)
> +                               struct task_struct *p)
>  {
>        return 0;
>  }
>  static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss,
>                                struct cgroup *cgroup,
> -                               struct task_struct *p,
> -                               bool threadgroup)
> +                               struct task_struct *p)
>  {
>  }
>  static void mem_cgroup_move_task(struct cgroup_subsys *ss,
>                                struct cgroup *cont,
>                                struct cgroup *old_cont,
> -                               struct task_struct *p,
> -                               bool threadgroup)
> +                               struct task_struct *p)
>  {
>  }
>  #endif
> diff --git a/security/device_cgroup.c b/security/device_cgroup.c
> index 8d9c48f..cd1f779 100644
> --- a/security/device_cgroup.c
> +++ b/security/device_cgroup.c
> @@ -62,8 +62,7 @@ static inline struct dev_cgroup *task_devcgroup(struct task_struct *task)
>  struct cgroup_subsys devices_subsys;
>
>  static int devcgroup_can_attach(struct cgroup_subsys *ss,
> -               struct cgroup *new_cgroup, struct task_struct *task,
> -               bool threadgroup)
> +               struct cgroup *new_cgroup, struct task_struct *task)
>  {
>        if (current != task && !capable(CAP_SYS_ADMIN))
>                        return -EPERM;
>

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v8 2/3] cgroups: add per-thread subsystem callbacks
  2011-02-08  1:39                 ` Ben Blum
  (?)
@ 2011-03-03 17:59                 ` Paul Menage
  -1 siblings, 0 replies; 185+ messages in thread
From: Paul Menage @ 2011-03-03 17:59 UTC (permalink / raw)
  To: Ben Blum
  Cc: linux-kernel, containers, akpm, ebiederm, lizf, matthltc, oleg,
	David Rientjes, Miao Xie

On Mon, Feb 7, 2011 at 5:39 PM, Ben Blum <bblum@andrew.cmu.edu> wrote:
> Add cgroup subsystem callbacks for per-thread attachment
>
> From: Ben Blum <bblum@andrew.cmu.edu>
>
> This patch adds can_attach_task, pre_attach, and attach_task as new callbacks
> for cgroups's subsystem interface. Unlike can_attach and attach, these are for
> per-thread operations, to be called potentially many times when attaching an
> entire threadgroup.
>
> Also, the old "bool threadgroup" interface is removed, as replaced by this.
> All subsystems are modified for the new interface - of note is cpuset, which
> requires from/to nodemasks for attach to be globally scoped (though per-cpuset
> would work too) to persist from its pre_attach to attach_task and attach.
>
> This is a pre-patch for cgroup-procs-writable.patch.
>
> Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>

Reviewed-by: Paul Menage <menage@google.com>

Paul

> ---
>  Documentation/cgroups/cgroups.txt |   30 ++++++++---
>  block/blk-cgroup.c                |   18 ++----
>  include/linux/cgroup.h            |   10 ++--
>  kernel/cgroup.c                   |   17 +++++-
>  kernel/cgroup_freezer.c           |   26 ++++-----
>  kernel/cpuset.c                   |  105 ++++++++++++++++---------------------
>  kernel/ns_cgroup.c                |   23 +++-----
>  kernel/sched.c                    |   38 +------------
>  mm/memcontrol.c                   |   18 ++----
>  security/device_cgroup.c          |    3 -
>  10 files changed, 122 insertions(+), 166 deletions(-)
>
> diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
> index 190018b..d3c9a24 100644
> --- a/Documentation/cgroups/cgroups.txt
> +++ b/Documentation/cgroups/cgroups.txt
> @@ -563,7 +563,7 @@ rmdir() will fail with it. From this behavior, pre_destroy() can be
>  called multiple times against a cgroup.
>
>  int can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
> -              struct task_struct *task, bool threadgroup)
> +              struct task_struct *task)
>  (cgroup_mutex held by caller)
>
>  Called prior to moving a task into a cgroup; if the subsystem
> @@ -572,9 +572,14 @@ task is passed, then a successful result indicates that *any*
>  unspecified task can be moved into the cgroup. Note that this isn't
>  called on a fork. If this method returns 0 (success) then this should
>  remain valid while the caller holds cgroup_mutex and it is ensured that either
> -attach() or cancel_attach() will be called in future. If threadgroup is
> -true, then a successful result indicates that all threads in the given
> -thread's threadgroup can be moved together.
> +attach() or cancel_attach() will be called in future.
> +
> +int can_attach_task(struct cgroup *cgrp, struct task_struct *tsk);
> +(cgroup_mutex held by caller)
> +
> +As can_attach, but for operations that must be run once per task to be
> +attached (possibly many when using cgroup_attach_proc). Called after
> +can_attach.
>
>  void cancel_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
>               struct task_struct *task, bool threadgroup)
> @@ -586,15 +591,24 @@ function, so that the subsystem can implement a rollback. If not, not necessary.
>  This will be called only about subsystems whose can_attach() operation have
>  succeeded.
>
> +void pre_attach(struct cgroup *cgrp);
> +(cgroup_mutex held by caller)
> +
> +For any non-per-thread attachment work that needs to happen before
> +attach_task. Needed by cpuset.
> +
>  void attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
> -           struct cgroup *old_cgrp, struct task_struct *task,
> -           bool threadgroup)
> +           struct cgroup *old_cgrp, struct task_struct *task)
>  (cgroup_mutex held by caller)
>
>  Called after the task has been attached to the cgroup, to allow any
>  post-attachment activity that requires memory allocations or blocking.
> -If threadgroup is true, the subsystem should take care of all threads
> -in the specified thread's threadgroup. Currently does not support any
> +
> +void attach_task(struct cgroup *cgrp, struct task_struct *tsk);
> +(cgroup_mutex held by caller)
> +
> +As attach, but for operations that must be run once per task to be attached,
> +like can_attach_task. Called before attach. Currently does not support any
>  subsystem that might need the old_cgrp for every thread in the group.
>
>  void fork(struct cgroup_subsy *ss, struct task_struct *task)
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index b1febd0..45b3809 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -30,10 +30,8 @@ EXPORT_SYMBOL_GPL(blkio_root_cgroup);
>
>  static struct cgroup_subsys_state *blkiocg_create(struct cgroup_subsys *,
>                                                  struct cgroup *);
> -static int blkiocg_can_attach(struct cgroup_subsys *, struct cgroup *,
> -                             struct task_struct *, bool);
> -static void blkiocg_attach(struct cgroup_subsys *, struct cgroup *,
> -                          struct cgroup *, struct task_struct *, bool);
> +static int blkiocg_can_attach_task(struct cgroup *, struct task_struct *);
> +static void blkiocg_attach_task(struct cgroup *, struct task_struct *);
>  static void blkiocg_destroy(struct cgroup_subsys *, struct cgroup *);
>  static int blkiocg_populate(struct cgroup_subsys *, struct cgroup *);
>
> @@ -46,8 +44,8 @@ static int blkiocg_populate(struct cgroup_subsys *, struct cgroup *);
>  struct cgroup_subsys blkio_subsys = {
>        .name = "blkio",
>        .create = blkiocg_create,
> -       .can_attach = blkiocg_can_attach,
> -       .attach = blkiocg_attach,
> +       .can_attach_task = blkiocg_can_attach_task,
> +       .attach_task = blkiocg_attach_task,
>        .destroy = blkiocg_destroy,
>        .populate = blkiocg_populate,
>  #ifdef CONFIG_BLK_CGROUP
> @@ -1475,9 +1473,7 @@ done:
>  * of the main cic data structures.  For now we allow a task to change
>  * its cgroup only if it's the only owner of its ioc.
>  */
> -static int blkiocg_can_attach(struct cgroup_subsys *subsys,
> -                               struct cgroup *cgroup, struct task_struct *tsk,
> -                               bool threadgroup)
> +static int blkiocg_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>  {
>        struct io_context *ioc;
>        int ret = 0;
> @@ -1492,9 +1488,7 @@ static int blkiocg_can_attach(struct cgroup_subsys *subsys,
>        return ret;
>  }
>
> -static void blkiocg_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
> -                               struct cgroup *prev, struct task_struct *tsk,
> -                               bool threadgroup)
> +static void blkiocg_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>  {
>        struct io_context *ioc;
>
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index ce104e3..35b69b4 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -467,12 +467,14 @@ struct cgroup_subsys {
>        int (*pre_destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
>        void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
>        int (*can_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
> -                         struct task_struct *tsk, bool threadgroup);
> +                         struct task_struct *tsk);
> +       int (*can_attach_task)(struct cgroup *cgrp, struct task_struct *tsk);
>        void (*cancel_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
> -                         struct task_struct *tsk, bool threadgroup);
> +                             struct task_struct *tsk);
> +       void (*pre_attach)(struct cgroup *cgrp);
> +       void (*attach_task)(struct cgroup *cgrp, struct task_struct *tsk);
>        void (*attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
> -                       struct cgroup *old_cgrp, struct task_struct *tsk,
> -                       bool threadgroup);
> +                      struct cgroup *old_cgrp, struct task_struct *tsk);
>        void (*fork)(struct cgroup_subsys *ss, struct task_struct *task);
>        void (*exit)(struct cgroup_subsys *ss, struct task_struct *task);
>        int (*populate)(struct cgroup_subsys *ss,
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 66a416b..616f27a 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1750,7 +1750,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>
>        for_each_subsys(root, ss) {
>                if (ss->can_attach) {
> -                       retval = ss->can_attach(ss, cgrp, tsk, false);
> +                       retval = ss->can_attach(ss, cgrp, tsk);
>                        if (retval) {
>                                /*
>                                 * Remember on which subsystem the can_attach()
> @@ -1762,6 +1762,13 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>                                goto out;
>                        }
>                }
> +               if (ss->can_attach_task) {
> +                       retval = ss->can_attach_task(cgrp, tsk);
> +                       if (retval) {
> +                               failed_ss = ss;
> +                               goto out;
> +                       }
> +               }
>        }
>
>        task_lock(tsk);
> @@ -1798,8 +1805,12 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>        write_unlock(&css_set_lock);
>
>        for_each_subsys(root, ss) {
> +               if (ss->pre_attach)
> +                       ss->pre_attach(cgrp);
> +               if (ss->attach_task)
> +                       ss->attach_task(cgrp, tsk);
>                if (ss->attach)
> -                       ss->attach(ss, cgrp, oldcgrp, tsk, false);
> +                       ss->attach(ss, cgrp, oldcgrp, tsk);
>        }
>        set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
>        synchronize_rcu();
> @@ -1822,7 +1833,7 @@ out:
>                                 */
>                                break;
>                        if (ss->cancel_attach)
> -                               ss->cancel_attach(ss, cgrp, tsk, false);
> +                               ss->cancel_attach(ss, cgrp, tsk);
>                }
>        }
>        return retval;
> diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
> index e7bebb7..e691818 100644
> --- a/kernel/cgroup_freezer.c
> +++ b/kernel/cgroup_freezer.c
> @@ -160,7 +160,7 @@ static void freezer_destroy(struct cgroup_subsys *ss,
>  */
>  static int freezer_can_attach(struct cgroup_subsys *ss,
>                              struct cgroup *new_cgroup,
> -                             struct task_struct *task, bool threadgroup)
> +                             struct task_struct *task)
>  {
>        struct freezer *freezer;
>
> @@ -172,26 +172,17 @@ static int freezer_can_attach(struct cgroup_subsys *ss,
>        if (freezer->state != CGROUP_THAWED)
>                return -EBUSY;
>
> +       return 0;
> +}
> +
> +static int freezer_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
> +{
>        rcu_read_lock();
> -       if (__cgroup_freezing_or_frozen(task)) {
> +       if (__cgroup_freezing_or_frozen(tsk)) {
>                rcu_read_unlock();
>                return -EBUSY;
>        }
>        rcu_read_unlock();
> -
> -       if (threadgroup) {
> -               struct task_struct *c;
> -
> -               rcu_read_lock();
> -               list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
> -                       if (__cgroup_freezing_or_frozen(c)) {
> -                               rcu_read_unlock();
> -                               return -EBUSY;
> -                       }
> -               }
> -               rcu_read_unlock();
> -       }
> -
>        return 0;
>  }
>
> @@ -390,6 +381,9 @@ struct cgroup_subsys freezer_subsys = {
>        .populate       = freezer_populate,
>        .subsys_id      = freezer_subsys_id,
>        .can_attach     = freezer_can_attach,
> +       .can_attach_task = freezer_can_attach_task,
> +       .pre_attach     = NULL,
> +       .attach_task    = NULL,
>        .attach         = NULL,
>        .fork           = freezer_fork,
>        .exit           = NULL,
> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> index 4349935..5f71ca2 100644
> --- a/kernel/cpuset.c
> +++ b/kernel/cpuset.c
> @@ -1372,14 +1372,10 @@ static int fmeter_getrate(struct fmeter *fmp)
>        return val;
>  }
>
> -/* Protected by cgroup_lock */
> -static cpumask_var_t cpus_attach;
> -
>  /* Called by cgroups to determine if a cpuset is usable; cgroup_mutex held */
>  static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
> -                            struct task_struct *tsk, bool threadgroup)
> +                            struct task_struct *tsk)
>  {
> -       int ret;
>        struct cpuset *cs = cgroup_cs(cont);
>
>        if (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))
> @@ -1396,29 +1392,42 @@ static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
>        if (tsk->flags & PF_THREAD_BOUND)
>                return -EINVAL;
>
> -       ret = security_task_setscheduler(tsk);
> -       if (ret)
> -               return ret;
> -       if (threadgroup) {
> -               struct task_struct *c;
> -
> -               rcu_read_lock();
> -               list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
> -                       ret = security_task_setscheduler(c);
> -                       if (ret) {
> -                               rcu_read_unlock();
> -                               return ret;
> -                       }
> -               }
> -               rcu_read_unlock();
> -       }
>        return 0;
>  }
>
> -static void cpuset_attach_task(struct task_struct *tsk, nodemask_t *to,
> -                              struct cpuset *cs)
> +static int cpuset_can_attach_task(struct cgroup *cgrp, struct task_struct *task)
> +{
> +       return security_task_setscheduler(task);
> +}
> +
> +/*
> + * Protected by cgroup_lock. The nodemasks must be stored globally because
> + * dynamically allocating them is not allowed in pre_attach, and they must
> + * persist among pre_attach, attach_task, and attach.
> + */
> +static cpumask_var_t cpus_attach;
> +static nodemask_t cpuset_attach_nodemask_from;
> +static nodemask_t cpuset_attach_nodemask_to;
> +
> +/* Set-up work for before attaching each task. */
> +static void cpuset_pre_attach(struct cgroup *cont)
> +{
> +       struct cpuset *cs = cgroup_cs(cont);
> +
> +       if (cs == &top_cpuset)
> +               cpumask_copy(cpus_attach, cpu_possible_mask);
> +       else
> +               guarantee_online_cpus(cs, cpus_attach);
> +
> +       guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
> +}
> +
> +/* Per-thread attachment work. */
> +static void cpuset_attach_task(struct cgroup *cont, struct task_struct *tsk)
>  {
>        int err;
> +       struct cpuset *cs = cgroup_cs(cont);
> +
>        /*
>         * can_attach beforehand should guarantee that this doesn't fail.
>         * TODO: have a better way to handle failure here
> @@ -1426,56 +1435,31 @@ static void cpuset_attach_task(struct task_struct *tsk, nodemask_t *to,
>        err = set_cpus_allowed_ptr(tsk, cpus_attach);
>        WARN_ON_ONCE(err);
>
> -       cpuset_change_task_nodemask(tsk, to);
> +       cpuset_change_task_nodemask(tsk, &cpuset_attach_nodemask_to);
>        cpuset_update_task_spread_flag(cs, tsk);
> -
>  }
>
>  static void cpuset_attach(struct cgroup_subsys *ss, struct cgroup *cont,
> -                         struct cgroup *oldcont, struct task_struct *tsk,
> -                         bool threadgroup)
> +                         struct cgroup *oldcont, struct task_struct *tsk)
>  {
>        struct mm_struct *mm;
>        struct cpuset *cs = cgroup_cs(cont);
>        struct cpuset *oldcs = cgroup_cs(oldcont);
> -       NODEMASK_ALLOC(nodemask_t, from, GFP_KERNEL);
> -       NODEMASK_ALLOC(nodemask_t, to, GFP_KERNEL);
> -
> -       if (from == NULL || to == NULL)
> -               goto alloc_fail;
>
> -       if (cs == &top_cpuset) {
> -               cpumask_copy(cpus_attach, cpu_possible_mask);
> -       } else {
> -               guarantee_online_cpus(cs, cpus_attach);
> -       }
> -       guarantee_online_mems(cs, to);
> -
> -       /* do per-task migration stuff possibly for each in the threadgroup */
> -       cpuset_attach_task(tsk, to, cs);
> -       if (threadgroup) {
> -               struct task_struct *c;
> -               rcu_read_lock();
> -               list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
> -                       cpuset_attach_task(c, to, cs);
> -               }
> -               rcu_read_unlock();
> -       }
> -
> -       /* change mm; only needs to be done once even if threadgroup */
> -       *from = oldcs->mems_allowed;
> -       *to = cs->mems_allowed;
> +       /*
> +        * Change mm, possibly for multiple threads in a threadgroup. This is
> +        * expensive and may sleep.
> +        */
> +       cpuset_attach_nodemask_from = oldcs->mems_allowed;
> +       cpuset_attach_nodemask_to = cs->mems_allowed;
>        mm = get_task_mm(tsk);
>        if (mm) {
> -               mpol_rebind_mm(mm, to);
> +               mpol_rebind_mm(mm, &cpuset_attach_nodemask_to);
>                if (is_memory_migrate(cs))
> -                       cpuset_migrate_mm(mm, from, to);
> +                       cpuset_migrate_mm(mm, &cpuset_attach_nodemask_from,
> +                                         &cpuset_attach_nodemask_to);
>                mmput(mm);
>        }
> -
> -alloc_fail:
> -       NODEMASK_FREE(from);
> -       NODEMASK_FREE(to);
>  }
>
>  /* The various types of files and directories in a cpuset file system */
> @@ -1928,6 +1912,9 @@ struct cgroup_subsys cpuset_subsys = {
>        .create = cpuset_create,
>        .destroy = cpuset_destroy,
>        .can_attach = cpuset_can_attach,
> +       .can_attach_task = cpuset_can_attach_task,
> +       .pre_attach = cpuset_pre_attach,
> +       .attach_task = cpuset_attach_task,
>        .attach = cpuset_attach,
>        .populate = cpuset_populate,
>        .post_clone = cpuset_post_clone,
> diff --git a/kernel/ns_cgroup.c b/kernel/ns_cgroup.c
> index 2c98ad9..1fc2b1b 100644
> --- a/kernel/ns_cgroup.c
> +++ b/kernel/ns_cgroup.c
> @@ -43,7 +43,7 @@ int ns_cgroup_clone(struct task_struct *task, struct pid *pid)
>  *        ancestor cgroup thereof)
>  */
>  static int ns_can_attach(struct cgroup_subsys *ss, struct cgroup *new_cgroup,
> -                        struct task_struct *task, bool threadgroup)
> +                        struct task_struct *task)
>  {
>        if (current != task) {
>                if (!capable(CAP_SYS_ADMIN))
> @@ -53,21 +53,13 @@ static int ns_can_attach(struct cgroup_subsys *ss, struct cgroup *new_cgroup,
>                        return -EPERM;
>        }
>
> -       if (!cgroup_is_descendant(new_cgroup, task))
> -               return -EPERM;
> -
> -       if (threadgroup) {
> -               struct task_struct *c;
> -               rcu_read_lock();
> -               list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
> -                       if (!cgroup_is_descendant(new_cgroup, c)) {
> -                               rcu_read_unlock();
> -                               return -EPERM;
> -                       }
> -               }
> -               rcu_read_unlock();
> -       }
> +       return 0;
> +}
>
> +static int ns_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
> +{
> +       if (!cgroup_is_descendant(cgrp, tsk))
> +               return -EPERM;
>        return 0;
>  }
>
> @@ -112,6 +104,7 @@ static void ns_destroy(struct cgroup_subsys *ss,
>  struct cgroup_subsys ns_subsys = {
>        .name = "ns",
>        .can_attach = ns_can_attach,
> +       .can_attach_task = ns_can_attach_task,
>        .create = ns_create,
>        .destroy  = ns_destroy,
>        .subsys_id = ns_subsys_id,
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 218ef20..d619f1d 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -8655,42 +8655,10 @@ cpu_cgroup_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>        return 0;
>  }
>
> -static int
> -cpu_cgroup_can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
> -                     struct task_struct *tsk, bool threadgroup)
> -{
> -       int retval = cpu_cgroup_can_attach_task(cgrp, tsk);
> -       if (retval)
> -               return retval;
> -       if (threadgroup) {
> -               struct task_struct *c;
> -               rcu_read_lock();
> -               list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
> -                       retval = cpu_cgroup_can_attach_task(cgrp, c);
> -                       if (retval) {
> -                               rcu_read_unlock();
> -                               return retval;
> -                       }
> -               }
> -               rcu_read_unlock();
> -       }
> -       return 0;
> -}
> -
>  static void
> -cpu_cgroup_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
> -                 struct cgroup *old_cont, struct task_struct *tsk,
> -                 bool threadgroup)
> +cpu_cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>  {
>        sched_move_task(tsk);
> -       if (threadgroup) {
> -               struct task_struct *c;
> -               rcu_read_lock();
> -               list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
> -                       sched_move_task(c);
> -               }
> -               rcu_read_unlock();
> -       }
>  }
>
>  #ifdef CONFIG_FAIR_GROUP_SCHED
> @@ -8763,8 +8731,8 @@ struct cgroup_subsys cpu_cgroup_subsys = {
>        .name           = "cpu",
>        .create         = cpu_cgroup_create,
>        .destroy        = cpu_cgroup_destroy,
> -       .can_attach     = cpu_cgroup_can_attach,
> -       .attach         = cpu_cgroup_attach,
> +       .can_attach_task = cpu_cgroup_can_attach_task,
> +       .attach_task    = cpu_cgroup_attach_task,
>        .populate       = cpu_cgroup_populate,
>        .subsys_id      = cpu_cgroup_subsys_id,
>        .early_init     = 1,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 729beb7..995f0b9 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4720,8 +4720,7 @@ static void mem_cgroup_clear_mc(void)
>
>  static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
>                                struct cgroup *cgroup,
> -                               struct task_struct *p,
> -                               bool threadgroup)
> +                               struct task_struct *p)
>  {
>        int ret = 0;
>        struct mem_cgroup *mem = mem_cgroup_from_cont(cgroup);
> @@ -4775,8 +4774,7 @@ static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
>
>  static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss,
>                                struct cgroup *cgroup,
> -                               struct task_struct *p,
> -                               bool threadgroup)
> +                               struct task_struct *p)
>  {
>        mem_cgroup_clear_mc();
>  }
> @@ -4880,8 +4878,7 @@ static void mem_cgroup_move_charge(struct mm_struct *mm)
>  static void mem_cgroup_move_task(struct cgroup_subsys *ss,
>                                struct cgroup *cont,
>                                struct cgroup *old_cont,
> -                               struct task_struct *p,
> -                               bool threadgroup)
> +                               struct task_struct *p)
>  {
>        if (!mc.mm)
>                /* no need to move charge */
> @@ -4893,22 +4890,19 @@ static void mem_cgroup_move_task(struct cgroup_subsys *ss,
>  #else  /* !CONFIG_MMU */
>  static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
>                                struct cgroup *cgroup,
> -                               struct task_struct *p,
> -                               bool threadgroup)
> +                               struct task_struct *p)
>  {
>        return 0;
>  }
>  static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss,
>                                struct cgroup *cgroup,
> -                               struct task_struct *p,
> -                               bool threadgroup)
> +                               struct task_struct *p)
>  {
>  }
>  static void mem_cgroup_move_task(struct cgroup_subsys *ss,
>                                struct cgroup *cont,
>                                struct cgroup *old_cont,
> -                               struct task_struct *p,
> -                               bool threadgroup)
> +                               struct task_struct *p)
>  {
>  }
>  #endif
> diff --git a/security/device_cgroup.c b/security/device_cgroup.c
> index 8d9c48f..cd1f779 100644
> --- a/security/device_cgroup.c
> +++ b/security/device_cgroup.c
> @@ -62,8 +62,7 @@ static inline struct dev_cgroup *task_devcgroup(struct task_struct *task)
>  struct cgroup_subsys devices_subsys;
>
>  static int devcgroup_can_attach(struct cgroup_subsys *ss,
> -               struct cgroup *new_cgroup, struct task_struct *task,
> -               bool threadgroup)
> +               struct cgroup *new_cgroup, struct task_struct *task)
>  {
>        if (current != task && !capable(CAP_SYS_ADMIN))
>                        return -EPERM;
>

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v8 3/3] cgroups: make procs file writable
       [not found]                 ` <20110208013950.GF31569-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
  2011-02-16 19:22                   ` [PATCH v8 4/3] cgroups: use flex_array in attach_proc Ben Blum
@ 2011-03-03 18:38                   ` Paul Menage
  1 sibling, 0 replies; 185+ messages in thread
From: Paul Menage @ 2011-03-03 18:38 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	Miao Xie, David Rientjes, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w

On Mon, Feb 7, 2011 at 5:39 PM, Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:
> Makes procs file writable to move all threads by tgid at once
>
> From: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
>
> This patch adds functionality that enables users to move all threads in a
> threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs'
> file. This current implementation makes use of a per-threadgroup rwsem that's
> taken for reading in the fork() path to prevent newly forking threads within
> the threadgroup from "escaping" while the move is in progress.
>
> Signed-off-by: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
> ---
> +       /* remember the number of threads in the array for later. */
> +       BUG_ON(i == 0);

This BUG_ON() seems unnecessary, given the i++ directly above it.

> +       group_size = i;
> +       rcu_read_unlock();
> +
> +       /*
> +        * step 1: check that we can legitimately attach to the cgroup.
> +        */
> +       for_each_subsys(root, ss) {
> +               if (ss->can_attach) {
> +                       retval = ss->can_attach(ss, cgrp, leader);
> +                       if (retval) {
> +                               failed_ss = ss;
> +                               goto out_cancel_attach;
> +                       }
> +               }
> +               /* a callback to be run on every thread in the threadgroup. */
> +               if (ss->can_attach_task) {
> +                       /* run on each task in the threadgroup. */
> +                       for (i = 0; i < group_size; i++) {
> +                               retval = ss->can_attach_task(cgrp, group[i]);
> +                               if (retval) {
> +                                       failed_ss = ss;

Should we be setting failed_ss here? Doesn't that mean that if all
subsystems pass the can_attach() check but the first one fails a
can_attach_task() check, we don't call any cancel_attach() methods?

What are the rollback semantics for failing a can_attach_task() check?

> +               if (threadgroup) {
> +                       /*
> +                        * it is safe to find group_leader because tsk was found
> +                        * in the tid map, meaning it can't have been unhashed
> +                        * by someone in de_thread changing the leadership.
> +                        */
> +                       tsk = tsk->group_leader;
> +                       BUG_ON(!thread_group_leader(tsk));

Can this race with an exiting/execing group leader?

> +               } else if (tsk->flags & PF_EXITING) {

The check for PF_EXITING doesn't apply to group leaders?

> +                       /* optimization for the single-task-only case */
> +                       rcu_read_unlock();
> +                       cgroup_unlock();
>                        return -ESRCH;
>                }
>
> +               /*
> +                * even if we're attaching all tasks in the thread group, we
> +                * only need to check permissions on one of them.
> +                */
>                tcred = __task_cred(tsk);
>                if (cred->euid &&
>                    cred->euid != tcred->uid &&
>                    cred->euid != tcred->suid) {
>                        rcu_read_unlock();
> +                       cgroup_unlock();
>                        return -EACCES;

Maybe turn these returns into "goto out;" statements and put the
unlock after the out: label?

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v8 3/3] cgroups: make procs file writable
  2011-02-08  1:39                 ` Ben Blum
  (?)
  (?)
@ 2011-03-03 18:38                 ` Paul Menage
       [not found]                   ` <AANLkTinEnNsu8=PEktXL_EECzGYqsgdf+uogGxe7k4W+-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2011-03-10  6:18                   ` Ben Blum
  -1 siblings, 2 replies; 185+ messages in thread
From: Paul Menage @ 2011-03-03 18:38 UTC (permalink / raw)
  To: Ben Blum
  Cc: linux-kernel, containers, akpm, ebiederm, lizf, matthltc, oleg,
	David Rientjes, Miao Xie

On Mon, Feb 7, 2011 at 5:39 PM, Ben Blum <bblum@andrew.cmu.edu> wrote:
> Makes procs file writable to move all threads by tgid at once
>
> From: Ben Blum <bblum@andrew.cmu.edu>
>
> This patch adds functionality that enables users to move all threads in a
> threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs'
> file. This current implementation makes use of a per-threadgroup rwsem that's
> taken for reading in the fork() path to prevent newly forking threads within
> the threadgroup from "escaping" while the move is in progress.
>
> Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
> ---
> +       /* remember the number of threads in the array for later. */
> +       BUG_ON(i == 0);

This BUG_ON() seems unnecessary, given the i++ directly above it.

> +       group_size = i;
> +       rcu_read_unlock();
> +
> +       /*
> +        * step 1: check that we can legitimately attach to the cgroup.
> +        */
> +       for_each_subsys(root, ss) {
> +               if (ss->can_attach) {
> +                       retval = ss->can_attach(ss, cgrp, leader);
> +                       if (retval) {
> +                               failed_ss = ss;
> +                               goto out_cancel_attach;
> +                       }
> +               }
> +               /* a callback to be run on every thread in the threadgroup. */
> +               if (ss->can_attach_task) {
> +                       /* run on each task in the threadgroup. */
> +                       for (i = 0; i < group_size; i++) {
> +                               retval = ss->can_attach_task(cgrp, group[i]);
> +                               if (retval) {
> +                                       failed_ss = ss;

Should we be setting failed_ss here? Doesn't that mean that if all
subsystems pass the can_attach() check but the first one fails a
can_attach_task() check, we don't call any cancel_attach() methods?

What are the rollback semantics for failing a can_attach_task() check?

> +               if (threadgroup) {
> +                       /*
> +                        * it is safe to find group_leader because tsk was found
> +                        * in the tid map, meaning it can't have been unhashed
> +                        * by someone in de_thread changing the leadership.
> +                        */
> +                       tsk = tsk->group_leader;
> +                       BUG_ON(!thread_group_leader(tsk));

Can this race with an exiting/execing group leader?

> +               } else if (tsk->flags & PF_EXITING) {

The check for PF_EXITING doesn't apply to group leaders?

> +                       /* optimization for the single-task-only case */
> +                       rcu_read_unlock();
> +                       cgroup_unlock();
>                        return -ESRCH;
>                }
>
> +               /*
> +                * even if we're attaching all tasks in the thread group, we
> +                * only need to check permissions on one of them.
> +                */
>                tcred = __task_cred(tsk);
>                if (cred->euid &&
>                    cred->euid != tcred->uid &&
>                    cred->euid != tcred->suid) {
>                        rcu_read_unlock();
> +                       cgroup_unlock();
>                        return -EACCES;

Maybe turn these returns into "goto out;" statements and put the
unlock after the out: label?

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v8 3/3] cgroups: make procs file writable
       [not found]                   ` <AANLkTinEnNsu8=PEktXL_EECzGYqsgdf+uogGxe7k4W+-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2011-03-10  6:18                     ` Ben Blum
  0 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2011-03-10  6:18 UTC (permalink / raw)
  To: Paul Menage
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, Miao Xie, David Rientjes,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Thu, Mar 03, 2011 at 10:38:58AM -0800, Paul Menage wrote:
> On Mon, Feb 7, 2011 at 5:39 PM, Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:
> > Makes procs file writable to move all threads by tgid at once
> >
> > From: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
> >
> > This patch adds functionality that enables users to move all threads in a
> > threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs'
> > file. This current implementation makes use of a per-threadgroup rwsem that's
> > taken for reading in the fork() path to prevent newly forking threads within
> > the threadgroup from "escaping" while the move is in progress.
> >
> > Signed-off-by: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
> > ---
> > + ? ? ? /* remember the number of threads in the array for later. */
> > + ? ? ? BUG_ON(i == 0);
> 
> This BUG_ON() seems unnecessary, given the i++ directly above it.

It's meant to communicate that the loop must go through at least once,
so that 'struct cgroup *oldcgrp' will be initialised within a loop later
(setting it to NULL in the beginning is just to shut up the compiler.)

> 
> > + ? ? ? group_size = i;
> > + ? ? ? rcu_read_unlock();
> > +
> > + ? ? ? /*
> > + ? ? ? ?* step 1: check that we can legitimately attach to the cgroup.
> > + ? ? ? ?*/
> > + ? ? ? for_each_subsys(root, ss) {
> > + ? ? ? ? ? ? ? if (ss->can_attach) {
> > + ? ? ? ? ? ? ? ? ? ? ? retval = ss->can_attach(ss, cgrp, leader);
> > + ? ? ? ? ? ? ? ? ? ? ? if (retval) {
> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? failed_ss = ss;
> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? goto out_cancel_attach;
> > + ? ? ? ? ? ? ? ? ? ? ? }
> > + ? ? ? ? ? ? ? }
> > + ? ? ? ? ? ? ? /* a callback to be run on every thread in the threadgroup. */
> > + ? ? ? ? ? ? ? if (ss->can_attach_task) {
> > + ? ? ? ? ? ? ? ? ? ? ? /* run on each task in the threadgroup. */
> > + ? ? ? ? ? ? ? ? ? ? ? for (i = 0; i < group_size; i++) {
> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? retval = ss->can_attach_task(cgrp, group[i]);
> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? if (retval) {
> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? failed_ss = ss;
> 
> Should we be setting failed_ss here? Doesn't that mean that if all
> subsystems pass the can_attach() check but the first one fails a
> can_attach_task() check, we don't call any cancel_attach() methods?
> 
> What are the rollback semantics for failing a can_attach_task() check?

They are not called in that order - it's for_each_subsys { can_attach();
can_attach_task(); }. Although if the deal is that cancel_attach reverts
the things that can_attach does (and can_attach_task is separate) (is
this the case? it should probably go in the documentation), then passing
a can_attach and failing a can_attach_task should cause cancel_attach to
get called for that subsystem, which in this code it doesn't. Something
like:

    retval = ss->can_attach();
    if (retval) {
        failed_ss = ss;
        goto out_cancel_attach;
    }
    retval = ss->can_attach_task();
    if (retval) {
        failed_ss = ss;
        cancel_extra_ss = true;
        goto out_cancel_attach;
    }
    ...
    out_cancel_attach:
    if (retval) {
        for_each_subsys(root, ss) {
            if (ss == failed_ss) {
                if (cancel_extra_ss)
                    ss->cancel_attach();
                break;
            }
            ss->cancel_attach();
        }
    }

> 
> > + ? ? ? ? ? ? ? if (threadgroup) {
> > + ? ? ? ? ? ? ? ? ? ? ? /*
> > + ? ? ? ? ? ? ? ? ? ? ? ?* it is safe to find group_leader because tsk was found
> > + ? ? ? ? ? ? ? ? ? ? ? ?* in the tid map, meaning it can't have been unhashed
> > + ? ? ? ? ? ? ? ? ? ? ? ?* by someone in de_thread changing the leadership.
> > + ? ? ? ? ? ? ? ? ? ? ? ?*/
> > + ? ? ? ? ? ? ? ? ? ? ? tsk = tsk->group_leader;
> > + ? ? ? ? ? ? ? ? ? ? ? BUG_ON(!thread_group_leader(tsk));
> 
> Can this race with an exiting/execing group leader?

No, rcu_read_lock() is held.

> 
> > + ? ? ? ? ? ? ? } else if (tsk->flags & PF_EXITING) {
> 
> The check for PF_EXITING doesn't apply to group leaders?

I remember discussing this bit a while back - the point that if the
leader is PF_EXITING, that we should still iterate over its group list.
(However, I did try to test it, and it looks like if a leader calls
sys_exit() then the whole group goes away; is this actually guaranteed?)

> 
> > + ? ? ? ? ? ? ? ? ? ? ? /* optimization for the single-task-only case */
> > + ? ? ? ? ? ? ? ? ? ? ? rcu_read_unlock();
> > + ? ? ? ? ? ? ? ? ? ? ? cgroup_unlock();
> > ? ? ? ? ? ? ? ? ? ? ? ?return -ESRCH;
> > ? ? ? ? ? ? ? ?}
> >
> > + ? ? ? ? ? ? ? /*
> > + ? ? ? ? ? ? ? ?* even if we're attaching all tasks in the thread group, we
> > + ? ? ? ? ? ? ? ?* only need to check permissions on one of them.
> > + ? ? ? ? ? ? ? ?*/
> > ? ? ? ? ? ? ? ?tcred = __task_cred(tsk);
> > ? ? ? ? ? ? ? ?if (cred->euid &&
> > ? ? ? ? ? ? ? ? ? ?cred->euid != tcred->uid &&
> > ? ? ? ? ? ? ? ? ? ?cred->euid != tcred->suid) {
> > ? ? ? ? ? ? ? ? ? ? ? ?rcu_read_unlock();
> > + ? ? ? ? ? ? ? ? ? ? ? cgroup_unlock();
> > ? ? ? ? ? ? ? ? ? ? ? ?return -EACCES;
> 
> Maybe turn these returns into "goto out;" statements and put the
> unlock after the out: label?
> 

Maybe; I didn't look too hard at that function. If I revise the patch I
can do this, though.

Thanks,
Ben

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v8 3/3] cgroups: make procs file writable
  2011-03-03 18:38                 ` [PATCH v8 3/3] cgroups: make procs file writable Paul Menage
       [not found]                   ` <AANLkTinEnNsu8=PEktXL_EECzGYqsgdf+uogGxe7k4W+-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2011-03-10  6:18                   ` Ben Blum
  2011-03-10 20:01                     ` Paul Menage
                                       ` (2 more replies)
  1 sibling, 3 replies; 185+ messages in thread
From: Ben Blum @ 2011-03-10  6:18 UTC (permalink / raw)
  To: Paul Menage
  Cc: Ben Blum, linux-kernel, containers, akpm, ebiederm, lizf,
	matthltc, oleg, David Rientjes, Miao Xie

On Thu, Mar 03, 2011 at 10:38:58AM -0800, Paul Menage wrote:
> On Mon, Feb 7, 2011 at 5:39 PM, Ben Blum <bblum@andrew.cmu.edu> wrote:
> > Makes procs file writable to move all threads by tgid at once
> >
> > From: Ben Blum <bblum@andrew.cmu.edu>
> >
> > This patch adds functionality that enables users to move all threads in a
> > threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs'
> > file. This current implementation makes use of a per-threadgroup rwsem that's
> > taken for reading in the fork() path to prevent newly forking threads within
> > the threadgroup from "escaping" while the move is in progress.
> >
> > Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
> > ---
> > + ? ? ? /* remember the number of threads in the array for later. */
> > + ? ? ? BUG_ON(i == 0);
> 
> This BUG_ON() seems unnecessary, given the i++ directly above it.

It's meant to communicate that the loop must go through at least once,
so that 'struct cgroup *oldcgrp' will be initialised within a loop later
(setting it to NULL in the beginning is just to shut up the compiler.)

> 
> > + ? ? ? group_size = i;
> > + ? ? ? rcu_read_unlock();
> > +
> > + ? ? ? /*
> > + ? ? ? ?* step 1: check that we can legitimately attach to the cgroup.
> > + ? ? ? ?*/
> > + ? ? ? for_each_subsys(root, ss) {
> > + ? ? ? ? ? ? ? if (ss->can_attach) {
> > + ? ? ? ? ? ? ? ? ? ? ? retval = ss->can_attach(ss, cgrp, leader);
> > + ? ? ? ? ? ? ? ? ? ? ? if (retval) {
> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? failed_ss = ss;
> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? goto out_cancel_attach;
> > + ? ? ? ? ? ? ? ? ? ? ? }
> > + ? ? ? ? ? ? ? }
> > + ? ? ? ? ? ? ? /* a callback to be run on every thread in the threadgroup. */
> > + ? ? ? ? ? ? ? if (ss->can_attach_task) {
> > + ? ? ? ? ? ? ? ? ? ? ? /* run on each task in the threadgroup. */
> > + ? ? ? ? ? ? ? ? ? ? ? for (i = 0; i < group_size; i++) {
> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? retval = ss->can_attach_task(cgrp, group[i]);
> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? if (retval) {
> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? failed_ss = ss;
> 
> Should we be setting failed_ss here? Doesn't that mean that if all
> subsystems pass the can_attach() check but the first one fails a
> can_attach_task() check, we don't call any cancel_attach() methods?
> 
> What are the rollback semantics for failing a can_attach_task() check?

They are not called in that order - it's for_each_subsys { can_attach();
can_attach_task(); }. Although if the deal is that cancel_attach reverts
the things that can_attach does (and can_attach_task is separate) (is
this the case? it should probably go in the documentation), then passing
a can_attach and failing a can_attach_task should cause cancel_attach to
get called for that subsystem, which in this code it doesn't. Something
like:

    retval = ss->can_attach();
    if (retval) {
        failed_ss = ss;
        goto out_cancel_attach;
    }
    retval = ss->can_attach_task();
    if (retval) {
        failed_ss = ss;
        cancel_extra_ss = true;
        goto out_cancel_attach;
    }
    ...
    out_cancel_attach:
    if (retval) {
        for_each_subsys(root, ss) {
            if (ss == failed_ss) {
                if (cancel_extra_ss)
                    ss->cancel_attach();
                break;
            }
            ss->cancel_attach();
        }
    }

> 
> > + ? ? ? ? ? ? ? if (threadgroup) {
> > + ? ? ? ? ? ? ? ? ? ? ? /*
> > + ? ? ? ? ? ? ? ? ? ? ? ?* it is safe to find group_leader because tsk was found
> > + ? ? ? ? ? ? ? ? ? ? ? ?* in the tid map, meaning it can't have been unhashed
> > + ? ? ? ? ? ? ? ? ? ? ? ?* by someone in de_thread changing the leadership.
> > + ? ? ? ? ? ? ? ? ? ? ? ?*/
> > + ? ? ? ? ? ? ? ? ? ? ? tsk = tsk->group_leader;
> > + ? ? ? ? ? ? ? ? ? ? ? BUG_ON(!thread_group_leader(tsk));
> 
> Can this race with an exiting/execing group leader?

No, rcu_read_lock() is held.

> 
> > + ? ? ? ? ? ? ? } else if (tsk->flags & PF_EXITING) {
> 
> The check for PF_EXITING doesn't apply to group leaders?

I remember discussing this bit a while back - the point that if the
leader is PF_EXITING, that we should still iterate over its group list.
(However, I did try to test it, and it looks like if a leader calls
sys_exit() then the whole group goes away; is this actually guaranteed?)

> 
> > + ? ? ? ? ? ? ? ? ? ? ? /* optimization for the single-task-only case */
> > + ? ? ? ? ? ? ? ? ? ? ? rcu_read_unlock();
> > + ? ? ? ? ? ? ? ? ? ? ? cgroup_unlock();
> > ? ? ? ? ? ? ? ? ? ? ? ?return -ESRCH;
> > ? ? ? ? ? ? ? ?}
> >
> > + ? ? ? ? ? ? ? /*
> > + ? ? ? ? ? ? ? ?* even if we're attaching all tasks in the thread group, we
> > + ? ? ? ? ? ? ? ?* only need to check permissions on one of them.
> > + ? ? ? ? ? ? ? ?*/
> > ? ? ? ? ? ? ? ?tcred = __task_cred(tsk);
> > ? ? ? ? ? ? ? ?if (cred->euid &&
> > ? ? ? ? ? ? ? ? ? ?cred->euid != tcred->uid &&
> > ? ? ? ? ? ? ? ? ? ?cred->euid != tcred->suid) {
> > ? ? ? ? ? ? ? ? ? ? ? ?rcu_read_unlock();
> > + ? ? ? ? ? ? ? ? ? ? ? cgroup_unlock();
> > ? ? ? ? ? ? ? ? ? ? ? ?return -EACCES;
> 
> Maybe turn these returns into "goto out;" statements and put the
> unlock after the out: label?
> 

Maybe; I didn't look too hard at that function. If I revise the patch I
can do this, though.

Thanks,
Ben

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v8 3/3] cgroups: make procs file writable
       [not found]                     ` <20110310061831.GA23736-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
@ 2011-03-10 20:01                       ` Paul Menage
  2011-03-22  5:08                       ` Ben Blum
  1 sibling, 0 replies; 185+ messages in thread
From: Paul Menage @ 2011-03-10 20:01 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	Miao Xie, David Rientjes, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w

On Wed, Mar 9, 2011 at 10:18 PM, Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:
>> This BUG_ON() seems unnecessary, given the i++ directly above it.
>
> It's meant to communicate that the loop must go through at least once,
> so that 'struct cgroup *oldcgrp' will be initialised within a loop later
> (setting it to NULL in the beginning is just to shut up the compiler.)

Right, but it's a do {} while() loop with no break in it - it's
impossible to not go through at least once...

>> Should we be setting failed_ss here? Doesn't that mean that if all
>> subsystems pass the can_attach() check but the first one fails a
>> can_attach_task() check, we don't call any cancel_attach() methods?
>>
>> What are the rollback semantics for failing a can_attach_task() check?
>
> They are not called in that order - it's for_each_subsys { can_attach();
> can_attach_task(); }.

Oh, fair point - I misread that.

> Although if the deal is that cancel_attach reverts
> the things that can_attach does (and can_attach_task is separate) (is
> this the case? it should probably go in the documentation), then passing
> a can_attach and failing a can_attach_task should cause cancel_attach to
> get called for that subsystem, which in this code it doesn't. Something
> like:
>
>    retval = ss->can_attach();
>    if (retval) {
>        failed_ss = ss;
>        goto out_cancel_attach;
>    }
>    retval = ss->can_attach_task();
>    if (retval) {
>        failed_ss = ss;
>        cancel_extra_ss = true;
>        goto out_cancel_attach;
>    }

Yes, but maybe call the flag cancel_failed_ss? Slightly more obvious,
to me at least.

>> > + ? ? ? ? ? ? ? ? ? ? ? BUG_ON(!thread_group_leader(tsk));
>>
>> Can this race with an exiting/execing group leader?
>
> No, rcu_read_lock() is held.
>

But rcu_read_lock() doesn't stop any actions - it just stops the data
structures from going away. Can't leadership change during an
execve()?

> (However, I did try to test it, and it looks like if a leader calls
> sys_exit() then the whole group goes away; is this actually guaranteed?)

I think so, but maybe not instantaneously.

Paul

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v8 3/3] cgroups: make procs file writable
  2011-03-10  6:18                   ` Ben Blum
@ 2011-03-10 20:01                     ` Paul Menage
       [not found]                       ` <AANLkTikkmfwk0nV0p=omz2ddrw+ZqWF1Lx3EfO6dTjEQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2011-03-22  5:08                     ` Ben Blum
       [not found]                     ` <20110310061831.GA23736-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
  2 siblings, 1 reply; 185+ messages in thread
From: Paul Menage @ 2011-03-10 20:01 UTC (permalink / raw)
  To: Ben Blum
  Cc: linux-kernel, containers, akpm, ebiederm, lizf, matthltc, oleg,
	David Rientjes, Miao Xie

On Wed, Mar 9, 2011 at 10:18 PM, Ben Blum <bblum@andrew.cmu.edu> wrote:
>> This BUG_ON() seems unnecessary, given the i++ directly above it.
>
> It's meant to communicate that the loop must go through at least once,
> so that 'struct cgroup *oldcgrp' will be initialised within a loop later
> (setting it to NULL in the beginning is just to shut up the compiler.)

Right, but it's a do {} while() loop with no break in it - it's
impossible to not go through at least once...

>> Should we be setting failed_ss here? Doesn't that mean that if all
>> subsystems pass the can_attach() check but the first one fails a
>> can_attach_task() check, we don't call any cancel_attach() methods?
>>
>> What are the rollback semantics for failing a can_attach_task() check?
>
> They are not called in that order - it's for_each_subsys { can_attach();
> can_attach_task(); }.

Oh, fair point - I misread that.

> Although if the deal is that cancel_attach reverts
> the things that can_attach does (and can_attach_task is separate) (is
> this the case? it should probably go in the documentation), then passing
> a can_attach and failing a can_attach_task should cause cancel_attach to
> get called for that subsystem, which in this code it doesn't. Something
> like:
>
>    retval = ss->can_attach();
>    if (retval) {
>        failed_ss = ss;
>        goto out_cancel_attach;
>    }
>    retval = ss->can_attach_task();
>    if (retval) {
>        failed_ss = ss;
>        cancel_extra_ss = true;
>        goto out_cancel_attach;
>    }

Yes, but maybe call the flag cancel_failed_ss? Slightly more obvious,
to me at least.

>> > + ? ? ? ? ? ? ? ? ? ? ? BUG_ON(!thread_group_leader(tsk));
>>
>> Can this race with an exiting/execing group leader?
>
> No, rcu_read_lock() is held.
>

But rcu_read_lock() doesn't stop any actions - it just stops the data
structures from going away. Can't leadership change during an
execve()?

> (However, I did try to test it, and it looks like if a leader calls
> sys_exit() then the whole group goes away; is this actually guaranteed?)

I think so, but maybe not instantaneously.

Paul

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v8 3/3] cgroups: make procs file writable
  2011-03-10 20:01                     ` Paul Menage
@ 2011-03-15 21:13                           ` Ben Blum
  0 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2011-03-15 21:13 UTC (permalink / raw)
  To: Paul Menage
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, Miao Xie, David Rientjes,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Thu, Mar 10, 2011 at 12:01:29PM -0800, Paul Menage wrote:
> On Wed, Mar 9, 2011 at 10:18 PM, Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:
> >> This BUG_ON() seems unnecessary, given the i++ directly above it.
> >
> > It's meant to communicate that the loop must go through at least once,
> > so that 'struct cgroup *oldcgrp' will be initialised within a loop later
> > (setting it to NULL in the beginning is just to shut up the compiler.)
> 
> Right, but it's a do {} while() loop with no break in it - it's
> impossible to not go through at least once...

OK; I guess it can go.

> > Although if the deal is that cancel_attach reverts
> > the things that can_attach does (and can_attach_task is separate) (is
> > this the case? it should probably go in the documentation), then passing
> > a can_attach and failing a can_attach_task should cause cancel_attach to
> > get called for that subsystem, which in this code it doesn't. Something
> > like:
> >
> > ? ?retval = ss->can_attach();
> > ? ?if (retval) {
> > ? ? ? ?failed_ss = ss;
> > ? ? ? ?goto out_cancel_attach;
> > ? ?}
> > ? ?retval = ss->can_attach_task();
> > ? ?if (retval) {
> > ? ? ? ?failed_ss = ss;
> > ? ? ? ?cancel_extra_ss = true;
> > ? ? ? ?goto out_cancel_attach;
> > ? ?}
> 
> Yes, but maybe call the flag cancel_failed_ss? Slightly more obvious,
> to me at least.

Sounds good.

> >> > + ? ? ? ? ? ? ? ? ? ? ? BUG_ON(!thread_group_leader(tsk));
> >>
> >> Can this race with an exiting/execing group leader?
> >
> > No, rcu_read_lock() is held.
> >
> 
> But rcu_read_lock() doesn't stop any actions - it just stops the data
> structures from going away. Can't leadership change during an
> execve()?

Hmm, you may be right; my understanding of RCU is not complete. But
actually I think the BUG_ON should just be removed, since we're about to
drop locks before handing off to cgroup_attach_proc anyway (so at no
important part is the assertion guaranteed), which will detect and
EAGAIN if such a race happened.

> > (However, I did try to test it, and it looks like if a leader calls
> > sys_exit() then the whole group goes away; is this actually guaranteed?)
> 
> I think so, but maybe not instantaneously.
> 
> Paul
> 

Hmm, well, should I make this assumption, then? The code would not be
more complicated either way, really. I kind of prefer it as it is...

-- Ben

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v8 3/3] cgroups: make procs file writable
@ 2011-03-15 21:13                           ` Ben Blum
  0 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2011-03-15 21:13 UTC (permalink / raw)
  To: Paul Menage
  Cc: Ben Blum, linux-kernel, containers, akpm, ebiederm, lizf,
	matthltc, oleg, David Rientjes, Miao Xie

On Thu, Mar 10, 2011 at 12:01:29PM -0800, Paul Menage wrote:
> On Wed, Mar 9, 2011 at 10:18 PM, Ben Blum <bblum@andrew.cmu.edu> wrote:
> >> This BUG_ON() seems unnecessary, given the i++ directly above it.
> >
> > It's meant to communicate that the loop must go through at least once,
> > so that 'struct cgroup *oldcgrp' will be initialised within a loop later
> > (setting it to NULL in the beginning is just to shut up the compiler.)
> 
> Right, but it's a do {} while() loop with no break in it - it's
> impossible to not go through at least once...

OK; I guess it can go.

> > Although if the deal is that cancel_attach reverts
> > the things that can_attach does (and can_attach_task is separate) (is
> > this the case? it should probably go in the documentation), then passing
> > a can_attach and failing a can_attach_task should cause cancel_attach to
> > get called for that subsystem, which in this code it doesn't. Something
> > like:
> >
> > ? ?retval = ss->can_attach();
> > ? ?if (retval) {
> > ? ? ? ?failed_ss = ss;
> > ? ? ? ?goto out_cancel_attach;
> > ? ?}
> > ? ?retval = ss->can_attach_task();
> > ? ?if (retval) {
> > ? ? ? ?failed_ss = ss;
> > ? ? ? ?cancel_extra_ss = true;
> > ? ? ? ?goto out_cancel_attach;
> > ? ?}
> 
> Yes, but maybe call the flag cancel_failed_ss? Slightly more obvious,
> to me at least.

Sounds good.

> >> > + ? ? ? ? ? ? ? ? ? ? ? BUG_ON(!thread_group_leader(tsk));
> >>
> >> Can this race with an exiting/execing group leader?
> >
> > No, rcu_read_lock() is held.
> >
> 
> But rcu_read_lock() doesn't stop any actions - it just stops the data
> structures from going away. Can't leadership change during an
> execve()?

Hmm, you may be right; my understanding of RCU is not complete. But
actually I think the BUG_ON should just be removed, since we're about to
drop locks before handing off to cgroup_attach_proc anyway (so at no
important part is the assertion guaranteed), which will detect and
EAGAIN if such a race happened.

> > (However, I did try to test it, and it looks like if a leader calls
> > sys_exit() then the whole group goes away; is this actually guaranteed?)
> 
> I think so, but maybe not instantaneously.
> 
> Paul
> 

Hmm, well, should I make this assumption, then? The code would not be
more complicated either way, really. I kind of prefer it as it is...

-- Ben

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v8 3/3] cgroups: make procs file writable
       [not found]                           ` <20110315211353.GA9992-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
@ 2011-03-18 16:54                             ` Paul Menage
  0 siblings, 0 replies; 185+ messages in thread
From: Paul Menage @ 2011-03-18 16:54 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	Miao Xie, David Rientjes, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w

On Tue, Mar 15, 2011 at 2:13 PM, Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:
>
> Hmm, you may be right; my understanding of RCU is not complete. But
> actually I think the BUG_ON should just be removed, since we're about to
> drop locks before handing off to cgroup_attach_proc anyway (so at no
> important part is the assertion guaranteed), which will detect and
> EAGAIN if such a race happened.

Sounds good.

>
> Hmm, well, should I make this assumption, then? The code would not be
> more complicated either way, really. I kind of prefer it as it is...
>

OK, I guess either way is OK until we can prove otherwise :-)

Paul

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v8 3/3] cgroups: make procs file writable
  2011-03-15 21:13                           ` Ben Blum
  (?)
@ 2011-03-18 16:54                           ` Paul Menage
       [not found]                             ` <AANLkTim4z_x_UQE__f5t73Dimja8PTTXTKKgj2phv6FY-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  -1 siblings, 1 reply; 185+ messages in thread
From: Paul Menage @ 2011-03-18 16:54 UTC (permalink / raw)
  To: Ben Blum
  Cc: linux-kernel, containers, akpm, ebiederm, lizf, matthltc, oleg,
	David Rientjes, Miao Xie

On Tue, Mar 15, 2011 at 2:13 PM, Ben Blum <bblum@andrew.cmu.edu> wrote:
>
> Hmm, you may be right; my understanding of RCU is not complete. But
> actually I think the BUG_ON should just be removed, since we're about to
> drop locks before handing off to cgroup_attach_proc anyway (so at no
> important part is the assertion guaranteed), which will detect and
> EAGAIN if such a race happened.

Sounds good.

>
> Hmm, well, should I make this assumption, then? The code would not be
> more complicated either way, really. I kind of prefer it as it is...
>

OK, I guess either way is OK until we can prove otherwise :-)

Paul

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v8 3/3] cgroups: make procs file writable
       [not found]                     ` <20110310061831.GA23736-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
  2011-03-10 20:01                       ` Paul Menage
@ 2011-03-22  5:08                       ` Ben Blum
  1 sibling, 0 replies; 185+ messages in thread
From: Ben Blum @ 2011-03-22  5:08 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, Miao Xie, David Rientjes,
	Paul Menage, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Thu, Mar 10, 2011 at 01:18:31AM -0500, Ben Blum wrote:
> > > + ? ? ? ? ? ? ? ? ? ? ? /* optimization for the single-task-only case */
> > > + ? ? ? ? ? ? ? ? ? ? ? rcu_read_unlock();
> > > + ? ? ? ? ? ? ? ? ? ? ? cgroup_unlock();
> > > ? ? ? ? ? ? ? ? ? ? ? ?return -ESRCH;
> > > ? ? ? ? ? ? ? ?}
> > >
> > > + ? ? ? ? ? ? ? /*
> > > + ? ? ? ? ? ? ? ?* even if we're attaching all tasks in the thread group, we
> > > + ? ? ? ? ? ? ? ?* only need to check permissions on one of them.
> > > + ? ? ? ? ? ? ? ?*/
> > > ? ? ? ? ? ? ? ?tcred = __task_cred(tsk);
> > > ? ? ? ? ? ? ? ?if (cred->euid &&
> > > ? ? ? ? ? ? ? ? ? ?cred->euid != tcred->uid &&
> > > ? ? ? ? ? ? ? ? ? ?cred->euid != tcred->suid) {
> > > ? ? ? ? ? ? ? ? ? ? ? ?rcu_read_unlock();
> > > + ? ? ? ? ? ? ? ? ? ? ? cgroup_unlock();
> > > ? ? ? ? ? ? ? ? ? ? ? ?return -EACCES;
> > 
> > Maybe turn these returns into "goto out;" statements and put the
> > unlock after the out: label?
> > 
> 
> Maybe; I didn't look too hard at that function. If I revise the patch I
> can do this, though.

Looking back, I think I like it the way it is. Coalescing those unlock
paths would make it less clear... rcu_read is unlocked in the middle of
the function (on the success path), so having a bailout path moves the
failure path far removed from where it's relevant.

-- Ben

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v8 3/3] cgroups: make procs file writable
  2011-03-10  6:18                   ` Ben Blum
  2011-03-10 20:01                     ` Paul Menage
@ 2011-03-22  5:08                     ` Ben Blum
       [not found]                     ` <20110310061831.GA23736-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
  2 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2011-03-22  5:08 UTC (permalink / raw)
  To: Ben Blum
  Cc: Paul Menage, linux-kernel, containers, akpm, ebiederm, lizf,
	matthltc, oleg, David Rientjes, Miao Xie

On Thu, Mar 10, 2011 at 01:18:31AM -0500, Ben Blum wrote:
> > > + ? ? ? ? ? ? ? ? ? ? ? /* optimization for the single-task-only case */
> > > + ? ? ? ? ? ? ? ? ? ? ? rcu_read_unlock();
> > > + ? ? ? ? ? ? ? ? ? ? ? cgroup_unlock();
> > > ? ? ? ? ? ? ? ? ? ? ? ?return -ESRCH;
> > > ? ? ? ? ? ? ? ?}
> > >
> > > + ? ? ? ? ? ? ? /*
> > > + ? ? ? ? ? ? ? ?* even if we're attaching all tasks in the thread group, we
> > > + ? ? ? ? ? ? ? ?* only need to check permissions on one of them.
> > > + ? ? ? ? ? ? ? ?*/
> > > ? ? ? ? ? ? ? ?tcred = __task_cred(tsk);
> > > ? ? ? ? ? ? ? ?if (cred->euid &&
> > > ? ? ? ? ? ? ? ? ? ?cred->euid != tcred->uid &&
> > > ? ? ? ? ? ? ? ? ? ?cred->euid != tcred->suid) {
> > > ? ? ? ? ? ? ? ? ? ? ? ?rcu_read_unlock();
> > > + ? ? ? ? ? ? ? ? ? ? ? cgroup_unlock();
> > > ? ? ? ? ? ? ? ? ? ? ? ?return -EACCES;
> > 
> > Maybe turn these returns into "goto out;" statements and put the
> > unlock after the out: label?
> > 
> 
> Maybe; I didn't look too hard at that function. If I revise the patch I
> can do this, though.

Looking back, I think I like it the way it is. Coalescing those unlock
paths would make it less clear... rcu_read is unlocked in the middle of
the function (on the success path), so having a bailout path moves the
failure path far removed from where it's relevant.

-- Ben

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v8 4/3] cgroups: use flex_array in attach_proc
       [not found]                     ` <AANLkTinKTqBnjLKkv93UxyWoPL-2vyXP=LUvRz8JTC2K-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2011-03-22  5:15                       ` Ben Blum
  0 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2011-03-22  5:15 UTC (permalink / raw)
  To: Paul Menage
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, Miao Xie, David Rientjes,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Thu, Mar 03, 2011 at 09:48:09AM -0800, Paul Menage wrote:
> On Wed, Feb 16, 2011 at 11:22 AM, Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:
> > Convert cgroup_attach_proc to use flex_array.
> >
> > From: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
> >
> > The cgroup_attach_proc implementation requires a pre-allocated array to store
> > task pointers to atomically move a thread-group, but asking for a monolithic
> > array with kmalloc() may be unreliable for very large groups. Using flex_array
> > provides the same functionality with less risk of failure.
> >
> > This is a post-patch for cgroup-procs-write.patch.
> >
> > Signed-off-by: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
> 
> Reviewed-by: Paul Menage <menage-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> 
> Looks fine from a correctness point of view, but I'd be inclined to
> reduce the verbosity - rather than
> 
> tsk = flex_array_get_ptr(group, i);
> BUG_ON(tsk == NULL);
> retval = ss->can_attach_task(cgrp, tsk);
> 
> I'd just have
> 
> retval = ss->can_attach_task(cgrp, flex_array_get_ptr(group, i));
> 
> I don't think you need to be so defensive about flex_array's behaviour.
> 
> Paul
> 

hmm, in this case that change would make it cross 80 columns (and I
liked consistency). ;)

I've removed the BUG_ONs, though.

-- Ben

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v8 4/3] cgroups: use flex_array in attach_proc
  2011-03-03 17:48                   ` Paul Menage
@ 2011-03-22  5:15                     ` Ben Blum
       [not found]                       ` <20110322051553.GB11447-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
  2011-03-22  5:19                       ` Ben Blum
       [not found]                     ` <AANLkTinKTqBnjLKkv93UxyWoPL-2vyXP=LUvRz8JTC2K-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 2 replies; 185+ messages in thread
From: Ben Blum @ 2011-03-22  5:15 UTC (permalink / raw)
  To: Paul Menage
  Cc: Ben Blum, linux-kernel, containers, akpm, ebiederm, lizf,
	matthltc, oleg, David Rientjes, Miao Xie

On Thu, Mar 03, 2011 at 09:48:09AM -0800, Paul Menage wrote:
> On Wed, Feb 16, 2011 at 11:22 AM, Ben Blum <bblum@andrew.cmu.edu> wrote:
> > Convert cgroup_attach_proc to use flex_array.
> >
> > From: Ben Blum <bblum@andrew.cmu.edu>
> >
> > The cgroup_attach_proc implementation requires a pre-allocated array to store
> > task pointers to atomically move a thread-group, but asking for a monolithic
> > array with kmalloc() may be unreliable for very large groups. Using flex_array
> > provides the same functionality with less risk of failure.
> >
> > This is a post-patch for cgroup-procs-write.patch.
> >
> > Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
> 
> Reviewed-by: Paul Menage <menage@google.com>
> 
> Looks fine from a correctness point of view, but I'd be inclined to
> reduce the verbosity - rather than
> 
> tsk = flex_array_get_ptr(group, i);
> BUG_ON(tsk == NULL);
> retval = ss->can_attach_task(cgrp, tsk);
> 
> I'd just have
> 
> retval = ss->can_attach_task(cgrp, flex_array_get_ptr(group, i));
> 
> I don't think you need to be so defensive about flex_array's behaviour.
> 
> Paul
> 

hmm, in this case that change would make it cross 80 columns (and I
liked consistency). ;)

I've removed the BUG_ONs, though.

-- Ben

^ permalink raw reply	[flat|nested] 185+ messages in thread

* [PATCH v8.5 3/3] cgroups: make procs file writable
  2011-03-18 16:54                           ` Paul Menage
@ 2011-03-22  5:18                                 ` Ben Blum
  0 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2011-03-22  5:18 UTC (permalink / raw)
  To: Paul Menage
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, Miao Xie, David Rientjes,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Makes procs file writable to move all threads by tgid at once

From: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>

This patch adds functionality that enables users to move all threads in a
threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs'
file. This current implementation makes use of a per-threadgroup rwsem that's
taken for reading in the fork() path to prevent newly forking threads within
the threadgroup from "escaping" while the move is in progress.

Signed-off-by: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
---
 Documentation/cgroups/cgroups.txt |    9 +
 kernel/cgroup.c                   |  441 +++++++++++++++++++++++++++++++++----
 2 files changed, 401 insertions(+), 49 deletions(-)

diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index d3c9a24..92d93d6 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -236,7 +236,8 @@ containing the following files describing that cgroup:
  - cgroup.procs: list of tgids in the cgroup.  This list is not
    guaranteed to be sorted or free of duplicate tgids, and userspace
    should sort/uniquify the list if this property is required.
-   This is a read-only file, for now.
+   Writing a thread group id into this file moves all threads in that
+   group into this cgroup.
  - notify_on_release flag: run the release agent on exit?
  - release_agent: the path to use for release notifications (this file
    exists in the top cgroup only)
@@ -426,6 +427,12 @@ You can attach the current shell task by echoing 0:
 
 # echo 0 > tasks
 
+You can use the cgroup.procs file instead of the tasks file to move all
+threads in a threadgroup at once. Echoing the pid of any task in a
+threadgroup to cgroup.procs causes all tasks in that threadgroup to be
+be attached to the cgroup. Writing 0 to cgroup.procs moves all tasks
+in the writing task's threadgroup.
+
 2.3 Mounting hierarchies by name
 --------------------------------
 
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 616f27a..273633c 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1726,6 +1726,76 @@ int cgroup_path(const struct cgroup *cgrp, char *buf, int buflen)
 }
 EXPORT_SYMBOL_GPL(cgroup_path);
 
+/*
+ * cgroup_task_migrate - move a task from one cgroup to another.
+ *
+ * 'guarantee' is set if the caller promises that a new css_set for the task
+ * will already exist. If not set, this function might sleep, and can fail with
+ * -ENOMEM. Otherwise, it can only fail with -ESRCH.
+ */
+static int cgroup_task_migrate(struct cgroup *cgrp, struct cgroup *oldcgrp,
+			       struct task_struct *tsk, bool guarantee)
+{
+	struct css_set *oldcg;
+	struct css_set *newcg;
+
+	/*
+	 * get old css_set. we need to take task_lock and refcount it, because
+	 * an exiting task can change its css_set to init_css_set and drop its
+	 * old one without taking cgroup_mutex.
+	 */
+	task_lock(tsk);
+	oldcg = tsk->cgroups;
+	get_css_set(oldcg);
+	task_unlock(tsk);
+
+	/* locate or allocate a new css_set for this task. */
+	if (guarantee) {
+		/* we know the css_set we want already exists. */
+		struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
+		read_lock(&css_set_lock);
+		newcg = find_existing_css_set(oldcg, cgrp, template);
+		BUG_ON(!newcg);
+		get_css_set(newcg);
+		read_unlock(&css_set_lock);
+	} else {
+		might_sleep();
+		/* find_css_set will give us newcg already referenced. */
+		newcg = find_css_set(oldcg, cgrp);
+		if (!newcg) {
+			put_css_set(oldcg);
+			return -ENOMEM;
+		}
+	}
+	put_css_set(oldcg);
+
+	/* if PF_EXITING is set, the tsk->cgroups pointer is no longer safe. */
+	task_lock(tsk);
+	if (tsk->flags & PF_EXITING) {
+		task_unlock(tsk);
+		put_css_set(newcg);
+		return -ESRCH;
+	}
+	rcu_assign_pointer(tsk->cgroups, newcg);
+	task_unlock(tsk);
+
+	/* Update the css_set linked lists if we're using them */
+	write_lock(&css_set_lock);
+	if (!list_empty(&tsk->cg_list))
+		list_move(&tsk->cg_list, &newcg->tasks);
+	write_unlock(&css_set_lock);
+
+	/*
+	 * We just gained a reference on oldcg by taking it from the task. As
+	 * trading it for newcg is protected by cgroup_mutex, we're safe to drop
+	 * it here; it will be freed under RCU.
+	 */
+	put_css_set(oldcg);
+
+	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
+	return 0;
+}
+
 /**
  * cgroup_attach_task - attach task 'tsk' to cgroup 'cgrp'
  * @cgrp: the cgroup the task is attaching to
@@ -1736,11 +1806,9 @@ EXPORT_SYMBOL_GPL(cgroup_path);
  */
 int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
-	int retval = 0;
+	int retval;
 	struct cgroup_subsys *ss, *failed_ss = NULL;
 	struct cgroup *oldcgrp;
-	struct css_set *cg;
-	struct css_set *newcg;
 	struct cgroupfs_root *root = cgrp->root;
 
 	/* Nothing to do if the task is already in that cgroup */
@@ -1771,38 +1839,9 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 		}
 	}
 
-	task_lock(tsk);
-	cg = tsk->cgroups;
-	get_css_set(cg);
-	task_unlock(tsk);
-	/*
-	 * Locate or allocate a new css_set for this task,
-	 * based on its final set of cgroups
-	 */
-	newcg = find_css_set(cg, cgrp);
-	put_css_set(cg);
-	if (!newcg) {
-		retval = -ENOMEM;
-		goto out;
-	}
-
-	task_lock(tsk);
-	if (tsk->flags & PF_EXITING) {
-		task_unlock(tsk);
-		put_css_set(newcg);
-		retval = -ESRCH;
+	retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, false);
+	if (retval)
 		goto out;
-	}
-	rcu_assign_pointer(tsk->cgroups, newcg);
-	task_unlock(tsk);
-
-	/* Update the css_set linked lists if we're using them */
-	write_lock(&css_set_lock);
-	if (!list_empty(&tsk->cg_list)) {
-		list_del(&tsk->cg_list);
-		list_add(&tsk->cg_list, &newcg->tasks);
-	}
-	write_unlock(&css_set_lock);
 
 	for_each_subsys(root, ss) {
 		if (ss->pre_attach)
@@ -1812,9 +1851,8 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 		if (ss->attach)
 			ss->attach(ss, cgrp, oldcgrp, tsk);
 	}
-	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
+
 	synchronize_rcu();
-	put_css_set(cg);
 
 	/*
 	 * wake up rmdir() waiter. the rmdir should fail since the cgroup
@@ -1864,49 +1902,356 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
 EXPORT_SYMBOL_GPL(cgroup_attach_task_all);
 
 /*
- * Attach task with pid 'pid' to cgroup 'cgrp'. Call with cgroup_mutex
- * held. May take task_lock of task
+ * cgroup_attach_proc works in two stages, the first of which prefetches all
+ * new css_sets needed (to make sure we have enough memory before committing
+ * to the move) and stores them in a list of entries of the following type.
+ * TODO: possible optimization: use css_set->rcu_head for chaining instead
+ */
+struct cg_list_entry {
+	struct css_set *cg;
+	struct list_head links;
+};
+
+static bool css_set_check_fetched(struct cgroup *cgrp,
+				  struct task_struct *tsk, struct css_set *cg,
+				  struct list_head *newcg_list)
+{
+	struct css_set *newcg;
+	struct cg_list_entry *cg_entry;
+	struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
+
+	read_lock(&css_set_lock);
+	newcg = find_existing_css_set(cg, cgrp, template);
+	if (newcg)
+		get_css_set(newcg);
+	read_unlock(&css_set_lock);
+
+	/* doesn't exist at all? */
+	if (!newcg)
+		return false;
+	/* see if it's already in the list */
+	list_for_each_entry(cg_entry, newcg_list, links) {
+		if (cg_entry->cg == newcg) {
+			put_css_set(newcg);
+			return true;
+		}
+	}
+
+	/* not found */
+	put_css_set(newcg);
+	return false;
+}
+
+/*
+ * Find the new css_set and store it in the list in preparation for moving the
+ * given task to the given cgroup. Returns 0 or -ENOMEM.
+ */
+static int css_set_prefetch(struct cgroup *cgrp, struct css_set *cg,
+			    struct list_head *newcg_list)
+{
+	struct css_set *newcg;
+	struct cg_list_entry *cg_entry;
+
+	/* ensure a new css_set will exist for this thread */
+	newcg = find_css_set(cg, cgrp);
+	if (!newcg)
+		return -ENOMEM;
+	/* add it to the list */
+	cg_entry = kmalloc(sizeof(struct cg_list_entry), GFP_KERNEL);
+	if (!cg_entry) {
+		put_css_set(newcg);
+		return -ENOMEM;
+	}
+	cg_entry->cg = newcg;
+	list_add(&cg_entry->links, newcg_list);
+	return 0;
+}
+
+/**
+ * cgroup_attach_proc - attach all threads in a threadgroup to a cgroup
+ * @cgrp: the cgroup to attach to
+ * @leader: the threadgroup leader task_struct of the group to be attached
+ *
+ * Call holding cgroup_mutex and the threadgroup_fork_lock of the leader. Will
+ * take task_lock of each thread in leader's threadgroup individually in turn.
+ */
+int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
+{
+	int retval, i, group_size;
+	struct cgroup_subsys *ss, *failed_ss = NULL;
+	bool cancel_failed_ss = false;
+	/* guaranteed to be initialized later, but the compiler needs this */
+	struct cgroup *oldcgrp = NULL;
+	struct css_set *oldcg;
+	struct cgroupfs_root *root = cgrp->root;
+	/* threadgroup list cursor and array */
+	struct task_struct *tsk;
+	struct task_struct **group;
+	/*
+	 * we need to make sure we have css_sets for all the tasks we're
+	 * going to move -before- we actually start moving them, so that in
+	 * case we get an ENOMEM we can bail out before making any changes.
+	 */
+	struct list_head newcg_list;
+	struct cg_list_entry *cg_entry, *temp_nobe;
+
+	/*
+	 * step 0: in order to do expensive, possibly blocking operations for
+	 * every thread, we cannot iterate the thread group list, since it needs
+	 * rcu or tasklist locked. instead, build an array of all threads in the
+	 * group - threadgroup_fork_lock prevents new threads from appearing,
+	 * and if threads exit, this will just be an over-estimate.
+	 */
+	group_size = get_nr_threads(leader);
+	group = kmalloc(group_size * sizeof(*group), GFP_KERNEL);
+	if (!group)
+		return -ENOMEM;
+
+	/* prevent changes to the threadgroup list while we take a snapshot. */
+	rcu_read_lock();
+	if (!thread_group_leader(leader)) {
+		/*
+		 * a race with de_thread from another thread's exec() may strip
+		 * us of our leadership, making while_each_thread unsafe to use
+		 * on this task. if this happens, there is no choice but to
+		 * throw this task away and try again (from cgroup_procs_write);
+		 * this is "double-double-toil-and-trouble-check locking".
+		 */
+		rcu_read_unlock();
+		retval = -EAGAIN;
+		goto out_free_group_list;
+	}
+	/* take a reference on each task in the group to go in the array. */
+	tsk = leader;
+	i = 0;
+	do {
+		/* as per above, nr_threads may decrease, but not increase. */
+		BUG_ON(i >= group_size);
+		get_task_struct(tsk);
+		group[i] = tsk;
+		i++;
+	} while_each_thread(leader, tsk);
+	/* remember the number of threads in the array for later. */
+	group_size = i;
+	rcu_read_unlock();
+
+	/*
+	 * step 1: check that we can legitimately attach to the cgroup.
+	 */
+	for_each_subsys(root, ss) {
+		if (ss->can_attach) {
+			retval = ss->can_attach(ss, cgrp, leader);
+			if (retval) {
+				failed_ss = ss;
+				goto out_cancel_attach;
+			}
+		}
+		/* a callback to be run on every thread in the threadgroup. */
+		if (ss->can_attach_task) {
+			/* run on each task in the threadgroup. */
+			for (i = 0; i < group_size; i++) {
+				retval = ss->can_attach_task(cgrp, group[i]);
+				if (retval) {
+					failed_ss = ss;
+					cancel_failed_ss = true;
+					goto out_cancel_attach;
+				}
+			}
+		}
+	}
+
+	/*
+	 * step 2: make sure css_sets exist for all threads to be migrated.
+	 * we use find_css_set, which allocates a new one if necessary.
+	 */
+	INIT_LIST_HEAD(&newcg_list);
+	for (i = 0; i < group_size; i++) {
+		tsk = group[i];
+		/* nothing to do if this task is already in the cgroup */
+		oldcgrp = task_cgroup_from_root(tsk, root);
+		if (cgrp == oldcgrp)
+			continue;
+		/* get old css_set pointer */
+		task_lock(tsk);
+		if (tsk->flags & PF_EXITING) {
+			/* ignore this task if it's going away */
+			task_unlock(tsk);
+			continue;
+		}
+		oldcg = tsk->cgroups;
+		get_css_set(oldcg);
+		task_unlock(tsk);
+		/* see if the new one for us is already in the list? */
+		if (css_set_check_fetched(cgrp, tsk, oldcg, &newcg_list)) {
+			/* was already there, nothing to do. */
+			put_css_set(oldcg);
+		} else {
+			/* we don't already have it. get new one. */
+			retval = css_set_prefetch(cgrp, oldcg, &newcg_list);
+			put_css_set(oldcg);
+			if (retval)
+				goto out_list_teardown;
+		}
+	}
+
+	/*
+	 * step 3: now that we're guaranteed success wrt the css_sets, proceed
+	 * to move all tasks to the new cgroup, calling ss->attach_task for each
+	 * one along the way. there are no failure cases after here, so this is
+	 * the commit point.
+	 */
+	for_each_subsys(root, ss) {
+		if (ss->pre_attach)
+			ss->pre_attach(cgrp);
+	}
+	for (i = 0; i < group_size; i++) {
+		tsk = group[i];
+		/* leave current thread as it is if it's already there */
+		oldcgrp = task_cgroup_from_root(tsk, root);
+		if (cgrp == oldcgrp)
+			continue;
+		/* attach each task to each subsystem */
+		for_each_subsys(root, ss) {
+			if (ss->attach_task)
+				ss->attach_task(cgrp, tsk);
+		}
+		/* if the thread is PF_EXITING, it can just get skipped. */
+		retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, true);
+		BUG_ON(retval != 0 && retval != -ESRCH);
+	}
+	/* nothing is sensitive to fork() after this point. */
+
+	/*
+	 * step 4: do expensive, non-thread-specific subsystem callbacks.
+	 * TODO: if ever a subsystem needs to know the oldcgrp for each task
+	 * being moved, this call will need to be reworked to communicate that.
+	 */
+	for_each_subsys(root, ss) {
+		if (ss->attach)
+			ss->attach(ss, cgrp, oldcgrp, leader);
+	}
+
+	/*
+	 * step 5: success! and cleanup
+	 */
+	synchronize_rcu();
+	cgroup_wakeup_rmdir_waiter(cgrp);
+	retval = 0;
+out_list_teardown:
+	/* clean up the list of prefetched css_sets. */
+	list_for_each_entry_safe(cg_entry, temp_nobe, &newcg_list, links) {
+		list_del(&cg_entry->links);
+		put_css_set(cg_entry->cg);
+		kfree(cg_entry);
+	}
+out_cancel_attach:
+	/* same deal as in cgroup_attach_task */
+	if (retval) {
+		for_each_subsys(root, ss) {
+			if (ss == failed_ss) {
+				if (cancel_failed_ss && ss->cancel_attach)
+					ss->cancel_attach(ss, cgrp, leader);
+				break;
+			}
+			if (ss->cancel_attach)
+				ss->cancel_attach(ss, cgrp, leader);
+		}
+	}
+	/* clean up the array of referenced threads in the group. */
+	for (i = 0; i < group_size; i++)
+		put_task_struct(group[i]);
+out_free_group_list:
+	kfree(group);
+	return retval;
+}
+
+/*
+ * Find the task_struct of the task to attach by vpid and pass it along to the
+ * function to attach either it or all tasks in its threadgroup. Will take
+ * cgroup_mutex; may take task_lock of task.
  */
-static int attach_task_by_pid(struct cgroup *cgrp, u64 pid)
+static int attach_task_by_pid(struct cgroup *cgrp, u64 pid, bool threadgroup)
 {
 	struct task_struct *tsk;
 	const struct cred *cred = current_cred(), *tcred;
 	int ret;
 
+	if (!cgroup_lock_live_group(cgrp))
+		return -ENODEV;
+
 	if (pid) {
 		rcu_read_lock();
 		tsk = find_task_by_vpid(pid);
-		if (!tsk || tsk->flags & PF_EXITING) {
+		if (!tsk) {
 			rcu_read_unlock();
+			cgroup_unlock();
+			return -ESRCH;
+		}
+		if (threadgroup) {
+			/*
+			 * RCU protects this access, since tsk was found in the
+			 * tid map. a race with de_thread may cause group_leader
+			 * to stop being the leader, but cgroup_attach_proc will
+			 * detect it later.
+			 */
+			tsk = tsk->group_leader;
+		} else if (tsk->flags & PF_EXITING) {
+			/* optimization for the single-task-only case */
+			rcu_read_unlock();
+			cgroup_unlock();
 			return -ESRCH;
 		}
 
+		/*
+		 * even if we're attaching all tasks in the thread group, we
+		 * only need to check permissions on one of them.
+		 */
 		tcred = __task_cred(tsk);
 		if (cred->euid &&
 		    cred->euid != tcred->uid &&
 		    cred->euid != tcred->suid) {
 			rcu_read_unlock();
+			cgroup_unlock();
 			return -EACCES;
 		}
 		get_task_struct(tsk);
 		rcu_read_unlock();
 	} else {
-		tsk = current;
+		if (threadgroup)
+			tsk = current->group_leader;
+		else
+			tsk = current;
 		get_task_struct(tsk);
 	}
 
-	ret = cgroup_attach_task(cgrp, tsk);
+	if (threadgroup) {
+		threadgroup_fork_write_lock(tsk);
+		ret = cgroup_attach_proc(cgrp, tsk);
+		threadgroup_fork_write_unlock(tsk);
+	} else {
+		ret = cgroup_attach_task(cgrp, tsk);
+	}
 	put_task_struct(tsk);
+	cgroup_unlock();
 	return ret;
 }
 
 static int cgroup_tasks_write(struct cgroup *cgrp, struct cftype *cft, u64 pid)
 {
+	return attach_task_by_pid(cgrp, pid, false);
+}
+
+static int cgroup_procs_write(struct cgroup *cgrp, struct cftype *cft, u64 tgid)
+{
 	int ret;
-	if (!cgroup_lock_live_group(cgrp))
-		return -ENODEV;
-	ret = attach_task_by_pid(cgrp, pid);
-	cgroup_unlock();
+	do {
+		/*
+		 * attach_proc fails with -EAGAIN if threadgroup leadership
+		 * changes in the middle of the operation, in which case we need
+		 * to find the task_struct for the new leader and start over.
+		 */
+		ret = attach_task_by_pid(cgrp, tgid, true);
+	} while (ret == -EAGAIN);
 	return ret;
 }
 
@@ -3260,9 +3605,9 @@ static struct cftype files[] = {
 	{
 		.name = CGROUP_FILE_GENERIC_PREFIX "procs",
 		.open = cgroup_procs_open,
-		/* .write_u64 = cgroup_procs_write, TODO */
+		.write_u64 = cgroup_procs_write,
 		.release = cgroup_pidlist_release,
-		.mode = S_IRUGO,
+		.mode = S_IRUGO | S_IWUSR,
 	},
 	{
 		.name = "notify_on_release",

^ permalink raw reply related	[flat|nested] 185+ messages in thread

* [PATCH v8.5 3/3] cgroups: make procs file writable
@ 2011-03-22  5:18                                 ` Ben Blum
  0 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2011-03-22  5:18 UTC (permalink / raw)
  To: Paul Menage
  Cc: Ben Blum, linux-kernel, containers, akpm, ebiederm, lizf,
	matthltc, oleg, David Rientjes, Miao Xie

Makes procs file writable to move all threads by tgid at once

From: Ben Blum <bblum@andrew.cmu.edu>

This patch adds functionality that enables users to move all threads in a
threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs'
file. This current implementation makes use of a per-threadgroup rwsem that's
taken for reading in the fork() path to prevent newly forking threads within
the threadgroup from "escaping" while the move is in progress.

Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
---
 Documentation/cgroups/cgroups.txt |    9 +
 kernel/cgroup.c                   |  441 +++++++++++++++++++++++++++++++++----
 2 files changed, 401 insertions(+), 49 deletions(-)

diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index d3c9a24..92d93d6 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -236,7 +236,8 @@ containing the following files describing that cgroup:
  - cgroup.procs: list of tgids in the cgroup.  This list is not
    guaranteed to be sorted or free of duplicate tgids, and userspace
    should sort/uniquify the list if this property is required.
-   This is a read-only file, for now.
+   Writing a thread group id into this file moves all threads in that
+   group into this cgroup.
  - notify_on_release flag: run the release agent on exit?
  - release_agent: the path to use for release notifications (this file
    exists in the top cgroup only)
@@ -426,6 +427,12 @@ You can attach the current shell task by echoing 0:
 
 # echo 0 > tasks
 
+You can use the cgroup.procs file instead of the tasks file to move all
+threads in a threadgroup at once. Echoing the pid of any task in a
+threadgroup to cgroup.procs causes all tasks in that threadgroup to be
+be attached to the cgroup. Writing 0 to cgroup.procs moves all tasks
+in the writing task's threadgroup.
+
 2.3 Mounting hierarchies by name
 --------------------------------
 
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 616f27a..273633c 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1726,6 +1726,76 @@ int cgroup_path(const struct cgroup *cgrp, char *buf, int buflen)
 }
 EXPORT_SYMBOL_GPL(cgroup_path);
 
+/*
+ * cgroup_task_migrate - move a task from one cgroup to another.
+ *
+ * 'guarantee' is set if the caller promises that a new css_set for the task
+ * will already exist. If not set, this function might sleep, and can fail with
+ * -ENOMEM. Otherwise, it can only fail with -ESRCH.
+ */
+static int cgroup_task_migrate(struct cgroup *cgrp, struct cgroup *oldcgrp,
+			       struct task_struct *tsk, bool guarantee)
+{
+	struct css_set *oldcg;
+	struct css_set *newcg;
+
+	/*
+	 * get old css_set. we need to take task_lock and refcount it, because
+	 * an exiting task can change its css_set to init_css_set and drop its
+	 * old one without taking cgroup_mutex.
+	 */
+	task_lock(tsk);
+	oldcg = tsk->cgroups;
+	get_css_set(oldcg);
+	task_unlock(tsk);
+
+	/* locate or allocate a new css_set for this task. */
+	if (guarantee) {
+		/* we know the css_set we want already exists. */
+		struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
+		read_lock(&css_set_lock);
+		newcg = find_existing_css_set(oldcg, cgrp, template);
+		BUG_ON(!newcg);
+		get_css_set(newcg);
+		read_unlock(&css_set_lock);
+	} else {
+		might_sleep();
+		/* find_css_set will give us newcg already referenced. */
+		newcg = find_css_set(oldcg, cgrp);
+		if (!newcg) {
+			put_css_set(oldcg);
+			return -ENOMEM;
+		}
+	}
+	put_css_set(oldcg);
+
+	/* if PF_EXITING is set, the tsk->cgroups pointer is no longer safe. */
+	task_lock(tsk);
+	if (tsk->flags & PF_EXITING) {
+		task_unlock(tsk);
+		put_css_set(newcg);
+		return -ESRCH;
+	}
+	rcu_assign_pointer(tsk->cgroups, newcg);
+	task_unlock(tsk);
+
+	/* Update the css_set linked lists if we're using them */
+	write_lock(&css_set_lock);
+	if (!list_empty(&tsk->cg_list))
+		list_move(&tsk->cg_list, &newcg->tasks);
+	write_unlock(&css_set_lock);
+
+	/*
+	 * We just gained a reference on oldcg by taking it from the task. As
+	 * trading it for newcg is protected by cgroup_mutex, we're safe to drop
+	 * it here; it will be freed under RCU.
+	 */
+	put_css_set(oldcg);
+
+	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
+	return 0;
+}
+
 /**
  * cgroup_attach_task - attach task 'tsk' to cgroup 'cgrp'
  * @cgrp: the cgroup the task is attaching to
@@ -1736,11 +1806,9 @@ EXPORT_SYMBOL_GPL(cgroup_path);
  */
 int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
-	int retval = 0;
+	int retval;
 	struct cgroup_subsys *ss, *failed_ss = NULL;
 	struct cgroup *oldcgrp;
-	struct css_set *cg;
-	struct css_set *newcg;
 	struct cgroupfs_root *root = cgrp->root;
 
 	/* Nothing to do if the task is already in that cgroup */
@@ -1771,38 +1839,9 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 		}
 	}
 
-	task_lock(tsk);
-	cg = tsk->cgroups;
-	get_css_set(cg);
-	task_unlock(tsk);
-	/*
-	 * Locate or allocate a new css_set for this task,
-	 * based on its final set of cgroups
-	 */
-	newcg = find_css_set(cg, cgrp);
-	put_css_set(cg);
-	if (!newcg) {
-		retval = -ENOMEM;
-		goto out;
-	}
-
-	task_lock(tsk);
-	if (tsk->flags & PF_EXITING) {
-		task_unlock(tsk);
-		put_css_set(newcg);
-		retval = -ESRCH;
+	retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, false);
+	if (retval)
 		goto out;
-	}
-	rcu_assign_pointer(tsk->cgroups, newcg);
-	task_unlock(tsk);
-
-	/* Update the css_set linked lists if we're using them */
-	write_lock(&css_set_lock);
-	if (!list_empty(&tsk->cg_list)) {
-		list_del(&tsk->cg_list);
-		list_add(&tsk->cg_list, &newcg->tasks);
-	}
-	write_unlock(&css_set_lock);
 
 	for_each_subsys(root, ss) {
 		if (ss->pre_attach)
@@ -1812,9 +1851,8 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 		if (ss->attach)
 			ss->attach(ss, cgrp, oldcgrp, tsk);
 	}
-	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
+
 	synchronize_rcu();
-	put_css_set(cg);
 
 	/*
 	 * wake up rmdir() waiter. the rmdir should fail since the cgroup
@@ -1864,49 +1902,356 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
 EXPORT_SYMBOL_GPL(cgroup_attach_task_all);
 
 /*
- * Attach task with pid 'pid' to cgroup 'cgrp'. Call with cgroup_mutex
- * held. May take task_lock of task
+ * cgroup_attach_proc works in two stages, the first of which prefetches all
+ * new css_sets needed (to make sure we have enough memory before committing
+ * to the move) and stores them in a list of entries of the following type.
+ * TODO: possible optimization: use css_set->rcu_head for chaining instead
+ */
+struct cg_list_entry {
+	struct css_set *cg;
+	struct list_head links;
+};
+
+static bool css_set_check_fetched(struct cgroup *cgrp,
+				  struct task_struct *tsk, struct css_set *cg,
+				  struct list_head *newcg_list)
+{
+	struct css_set *newcg;
+	struct cg_list_entry *cg_entry;
+	struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
+
+	read_lock(&css_set_lock);
+	newcg = find_existing_css_set(cg, cgrp, template);
+	if (newcg)
+		get_css_set(newcg);
+	read_unlock(&css_set_lock);
+
+	/* doesn't exist at all? */
+	if (!newcg)
+		return false;
+	/* see if it's already in the list */
+	list_for_each_entry(cg_entry, newcg_list, links) {
+		if (cg_entry->cg == newcg) {
+			put_css_set(newcg);
+			return true;
+		}
+	}
+
+	/* not found */
+	put_css_set(newcg);
+	return false;
+}
+
+/*
+ * Find the new css_set and store it in the list in preparation for moving the
+ * given task to the given cgroup. Returns 0 or -ENOMEM.
+ */
+static int css_set_prefetch(struct cgroup *cgrp, struct css_set *cg,
+			    struct list_head *newcg_list)
+{
+	struct css_set *newcg;
+	struct cg_list_entry *cg_entry;
+
+	/* ensure a new css_set will exist for this thread */
+	newcg = find_css_set(cg, cgrp);
+	if (!newcg)
+		return -ENOMEM;
+	/* add it to the list */
+	cg_entry = kmalloc(sizeof(struct cg_list_entry), GFP_KERNEL);
+	if (!cg_entry) {
+		put_css_set(newcg);
+		return -ENOMEM;
+	}
+	cg_entry->cg = newcg;
+	list_add(&cg_entry->links, newcg_list);
+	return 0;
+}
+
+/**
+ * cgroup_attach_proc - attach all threads in a threadgroup to a cgroup
+ * @cgrp: the cgroup to attach to
+ * @leader: the threadgroup leader task_struct of the group to be attached
+ *
+ * Call holding cgroup_mutex and the threadgroup_fork_lock of the leader. Will
+ * take task_lock of each thread in leader's threadgroup individually in turn.
+ */
+int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
+{
+	int retval, i, group_size;
+	struct cgroup_subsys *ss, *failed_ss = NULL;
+	bool cancel_failed_ss = false;
+	/* guaranteed to be initialized later, but the compiler needs this */
+	struct cgroup *oldcgrp = NULL;
+	struct css_set *oldcg;
+	struct cgroupfs_root *root = cgrp->root;
+	/* threadgroup list cursor and array */
+	struct task_struct *tsk;
+	struct task_struct **group;
+	/*
+	 * we need to make sure we have css_sets for all the tasks we're
+	 * going to move -before- we actually start moving them, so that in
+	 * case we get an ENOMEM we can bail out before making any changes.
+	 */
+	struct list_head newcg_list;
+	struct cg_list_entry *cg_entry, *temp_nobe;
+
+	/*
+	 * step 0: in order to do expensive, possibly blocking operations for
+	 * every thread, we cannot iterate the thread group list, since it needs
+	 * rcu or tasklist locked. instead, build an array of all threads in the
+	 * group - threadgroup_fork_lock prevents new threads from appearing,
+	 * and if threads exit, this will just be an over-estimate.
+	 */
+	group_size = get_nr_threads(leader);
+	group = kmalloc(group_size * sizeof(*group), GFP_KERNEL);
+	if (!group)
+		return -ENOMEM;
+
+	/* prevent changes to the threadgroup list while we take a snapshot. */
+	rcu_read_lock();
+	if (!thread_group_leader(leader)) {
+		/*
+		 * a race with de_thread from another thread's exec() may strip
+		 * us of our leadership, making while_each_thread unsafe to use
+		 * on this task. if this happens, there is no choice but to
+		 * throw this task away and try again (from cgroup_procs_write);
+		 * this is "double-double-toil-and-trouble-check locking".
+		 */
+		rcu_read_unlock();
+		retval = -EAGAIN;
+		goto out_free_group_list;
+	}
+	/* take a reference on each task in the group to go in the array. */
+	tsk = leader;
+	i = 0;
+	do {
+		/* as per above, nr_threads may decrease, but not increase. */
+		BUG_ON(i >= group_size);
+		get_task_struct(tsk);
+		group[i] = tsk;
+		i++;
+	} while_each_thread(leader, tsk);
+	/* remember the number of threads in the array for later. */
+	group_size = i;
+	rcu_read_unlock();
+
+	/*
+	 * step 1: check that we can legitimately attach to the cgroup.
+	 */
+	for_each_subsys(root, ss) {
+		if (ss->can_attach) {
+			retval = ss->can_attach(ss, cgrp, leader);
+			if (retval) {
+				failed_ss = ss;
+				goto out_cancel_attach;
+			}
+		}
+		/* a callback to be run on every thread in the threadgroup. */
+		if (ss->can_attach_task) {
+			/* run on each task in the threadgroup. */
+			for (i = 0; i < group_size; i++) {
+				retval = ss->can_attach_task(cgrp, group[i]);
+				if (retval) {
+					failed_ss = ss;
+					cancel_failed_ss = true;
+					goto out_cancel_attach;
+				}
+			}
+		}
+	}
+
+	/*
+	 * step 2: make sure css_sets exist for all threads to be migrated.
+	 * we use find_css_set, which allocates a new one if necessary.
+	 */
+	INIT_LIST_HEAD(&newcg_list);
+	for (i = 0; i < group_size; i++) {
+		tsk = group[i];
+		/* nothing to do if this task is already in the cgroup */
+		oldcgrp = task_cgroup_from_root(tsk, root);
+		if (cgrp == oldcgrp)
+			continue;
+		/* get old css_set pointer */
+		task_lock(tsk);
+		if (tsk->flags & PF_EXITING) {
+			/* ignore this task if it's going away */
+			task_unlock(tsk);
+			continue;
+		}
+		oldcg = tsk->cgroups;
+		get_css_set(oldcg);
+		task_unlock(tsk);
+		/* see if the new one for us is already in the list? */
+		if (css_set_check_fetched(cgrp, tsk, oldcg, &newcg_list)) {
+			/* was already there, nothing to do. */
+			put_css_set(oldcg);
+		} else {
+			/* we don't already have it. get new one. */
+			retval = css_set_prefetch(cgrp, oldcg, &newcg_list);
+			put_css_set(oldcg);
+			if (retval)
+				goto out_list_teardown;
+		}
+	}
+
+	/*
+	 * step 3: now that we're guaranteed success wrt the css_sets, proceed
+	 * to move all tasks to the new cgroup, calling ss->attach_task for each
+	 * one along the way. there are no failure cases after here, so this is
+	 * the commit point.
+	 */
+	for_each_subsys(root, ss) {
+		if (ss->pre_attach)
+			ss->pre_attach(cgrp);
+	}
+	for (i = 0; i < group_size; i++) {
+		tsk = group[i];
+		/* leave current thread as it is if it's already there */
+		oldcgrp = task_cgroup_from_root(tsk, root);
+		if (cgrp == oldcgrp)
+			continue;
+		/* attach each task to each subsystem */
+		for_each_subsys(root, ss) {
+			if (ss->attach_task)
+				ss->attach_task(cgrp, tsk);
+		}
+		/* if the thread is PF_EXITING, it can just get skipped. */
+		retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, true);
+		BUG_ON(retval != 0 && retval != -ESRCH);
+	}
+	/* nothing is sensitive to fork() after this point. */
+
+	/*
+	 * step 4: do expensive, non-thread-specific subsystem callbacks.
+	 * TODO: if ever a subsystem needs to know the oldcgrp for each task
+	 * being moved, this call will need to be reworked to communicate that.
+	 */
+	for_each_subsys(root, ss) {
+		if (ss->attach)
+			ss->attach(ss, cgrp, oldcgrp, leader);
+	}
+
+	/*
+	 * step 5: success! and cleanup
+	 */
+	synchronize_rcu();
+	cgroup_wakeup_rmdir_waiter(cgrp);
+	retval = 0;
+out_list_teardown:
+	/* clean up the list of prefetched css_sets. */
+	list_for_each_entry_safe(cg_entry, temp_nobe, &newcg_list, links) {
+		list_del(&cg_entry->links);
+		put_css_set(cg_entry->cg);
+		kfree(cg_entry);
+	}
+out_cancel_attach:
+	/* same deal as in cgroup_attach_task */
+	if (retval) {
+		for_each_subsys(root, ss) {
+			if (ss == failed_ss) {
+				if (cancel_failed_ss && ss->cancel_attach)
+					ss->cancel_attach(ss, cgrp, leader);
+				break;
+			}
+			if (ss->cancel_attach)
+				ss->cancel_attach(ss, cgrp, leader);
+		}
+	}
+	/* clean up the array of referenced threads in the group. */
+	for (i = 0; i < group_size; i++)
+		put_task_struct(group[i]);
+out_free_group_list:
+	kfree(group);
+	return retval;
+}
+
+/*
+ * Find the task_struct of the task to attach by vpid and pass it along to the
+ * function to attach either it or all tasks in its threadgroup. Will take
+ * cgroup_mutex; may take task_lock of task.
  */
-static int attach_task_by_pid(struct cgroup *cgrp, u64 pid)
+static int attach_task_by_pid(struct cgroup *cgrp, u64 pid, bool threadgroup)
 {
 	struct task_struct *tsk;
 	const struct cred *cred = current_cred(), *tcred;
 	int ret;
 
+	if (!cgroup_lock_live_group(cgrp))
+		return -ENODEV;
+
 	if (pid) {
 		rcu_read_lock();
 		tsk = find_task_by_vpid(pid);
-		if (!tsk || tsk->flags & PF_EXITING) {
+		if (!tsk) {
 			rcu_read_unlock();
+			cgroup_unlock();
+			return -ESRCH;
+		}
+		if (threadgroup) {
+			/*
+			 * RCU protects this access, since tsk was found in the
+			 * tid map. a race with de_thread may cause group_leader
+			 * to stop being the leader, but cgroup_attach_proc will
+			 * detect it later.
+			 */
+			tsk = tsk->group_leader;
+		} else if (tsk->flags & PF_EXITING) {
+			/* optimization for the single-task-only case */
+			rcu_read_unlock();
+			cgroup_unlock();
 			return -ESRCH;
 		}
 
+		/*
+		 * even if we're attaching all tasks in the thread group, we
+		 * only need to check permissions on one of them.
+		 */
 		tcred = __task_cred(tsk);
 		if (cred->euid &&
 		    cred->euid != tcred->uid &&
 		    cred->euid != tcred->suid) {
 			rcu_read_unlock();
+			cgroup_unlock();
 			return -EACCES;
 		}
 		get_task_struct(tsk);
 		rcu_read_unlock();
 	} else {
-		tsk = current;
+		if (threadgroup)
+			tsk = current->group_leader;
+		else
+			tsk = current;
 		get_task_struct(tsk);
 	}
 
-	ret = cgroup_attach_task(cgrp, tsk);
+	if (threadgroup) {
+		threadgroup_fork_write_lock(tsk);
+		ret = cgroup_attach_proc(cgrp, tsk);
+		threadgroup_fork_write_unlock(tsk);
+	} else {
+		ret = cgroup_attach_task(cgrp, tsk);
+	}
 	put_task_struct(tsk);
+	cgroup_unlock();
 	return ret;
 }
 
 static int cgroup_tasks_write(struct cgroup *cgrp, struct cftype *cft, u64 pid)
 {
+	return attach_task_by_pid(cgrp, pid, false);
+}
+
+static int cgroup_procs_write(struct cgroup *cgrp, struct cftype *cft, u64 tgid)
+{
 	int ret;
-	if (!cgroup_lock_live_group(cgrp))
-		return -ENODEV;
-	ret = attach_task_by_pid(cgrp, pid);
-	cgroup_unlock();
+	do {
+		/*
+		 * attach_proc fails with -EAGAIN if threadgroup leadership
+		 * changes in the middle of the operation, in which case we need
+		 * to find the task_struct for the new leader and start over.
+		 */
+		ret = attach_task_by_pid(cgrp, tgid, true);
+	} while (ret == -EAGAIN);
 	return ret;
 }
 
@@ -3260,9 +3605,9 @@ static struct cftype files[] = {
 	{
 		.name = CGROUP_FILE_GENERIC_PREFIX "procs",
 		.open = cgroup_procs_open,
-		/* .write_u64 = cgroup_procs_write, TODO */
+		.write_u64 = cgroup_procs_write,
 		.release = cgroup_pidlist_release,
-		.mode = S_IRUGO,
+		.mode = S_IRUGO | S_IWUSR,
 	},
 	{
 		.name = "notify_on_release",

^ permalink raw reply related	[flat|nested] 185+ messages in thread

* [PATCH v8.5 4/3] cgroups: use flex_array in attach_proc
       [not found]                       ` <20110322051553.GB11447-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
@ 2011-03-22  5:19                         ` Ben Blum
  0 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2011-03-22  5:19 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, Miao Xie, David Rientjes,
	Paul Menage, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Convert cgroup_attach_proc to use flex_array.

From: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>

The cgroup_attach_proc implementation requires a pre-allocated array to store
task pointers to atomically move a thread-group, but asking for a monolithic
array with kmalloc() may be unreliable for very large groups. Using flex_array
provides the same functionality with less risk of failure.

This is a post-patch for cgroup-procs-write.patch.

Signed-off-by: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
---
 kernel/cgroup.c |   33 ++++++++++++++++++++++++---------
 1 files changed, 24 insertions(+), 9 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 273633c..92aa794 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -57,6 +57,7 @@
 #include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
 #include <linux/eventfd.h>
 #include <linux/poll.h>
+#include <linux/flex_array.h> /* used in cgroup_attach_proc */
 
 #include <asm/atomic.h>
 
@@ -1986,7 +1987,7 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 	struct cgroupfs_root *root = cgrp->root;
 	/* threadgroup list cursor and array */
 	struct task_struct *tsk;
-	struct task_struct **group;
+	struct flex_array *group;
 	/*
 	 * we need to make sure we have css_sets for all the tasks we're
 	 * going to move -before- we actually start moving them, so that in
@@ -2003,9 +2004,15 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 	 * and if threads exit, this will just be an over-estimate.
 	 */
 	group_size = get_nr_threads(leader);
-	group = kmalloc(group_size * sizeof(*group), GFP_KERNEL);
+	/* flex_array supports very large thread-groups better than kmalloc. */
+	group = flex_array_alloc(sizeof(struct task_struct *), group_size,
+				 GFP_KERNEL);
 	if (!group)
 		return -ENOMEM;
+	/* pre-allocate to guarantee space while iterating in rcu read-side. */
+	retval = flex_array_prealloc(group, 0, group_size - 1, GFP_KERNEL);
+	if (retval)
+		goto out_free_group_list;
 
 	/* prevent changes to the threadgroup list while we take a snapshot. */
 	rcu_read_lock();
@@ -2028,7 +2035,12 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 		/* as per above, nr_threads may decrease, but not increase. */
 		BUG_ON(i >= group_size);
 		get_task_struct(tsk);
-		group[i] = tsk;
+		/*
+		 * saying GFP_ATOMIC has no effect here because we did prealloc
+		 * earlier, but it's good form to communicate our expectations.
+		 */
+		retval = flex_array_put_ptr(group, i, tsk, GFP_ATOMIC);
+		BUG_ON(retval != 0);
 		i++;
 	} while_each_thread(leader, tsk);
 	/* remember the number of threads in the array for later. */
@@ -2050,7 +2062,8 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 		if (ss->can_attach_task) {
 			/* run on each task in the threadgroup. */
 			for (i = 0; i < group_size; i++) {
-				retval = ss->can_attach_task(cgrp, group[i]);
+				tsk = flex_array_get_ptr(group, i);
+				retval = ss->can_attach_task(cgrp, tsk);
 				if (retval) {
 					failed_ss = ss;
 					cancel_failed_ss = true;
@@ -2066,7 +2079,7 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 	 */
 	INIT_LIST_HEAD(&newcg_list);
 	for (i = 0; i < group_size; i++) {
-		tsk = group[i];
+		tsk = flex_array_get_ptr(group, i);
 		/* nothing to do if this task is already in the cgroup */
 		oldcgrp = task_cgroup_from_root(tsk, root);
 		if (cgrp == oldcgrp)
@@ -2105,7 +2118,7 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 			ss->pre_attach(cgrp);
 	}
 	for (i = 0; i < group_size; i++) {
-		tsk = group[i];
+		tsk = flex_array_get_ptr(group, i);
 		/* leave current thread as it is if it's already there */
 		oldcgrp = task_cgroup_from_root(tsk, root);
 		if (cgrp == oldcgrp)
@@ -2158,10 +2171,12 @@ out_cancel_attach:
 		}
 	}
 	/* clean up the array of referenced threads in the group. */
-	for (i = 0; i < group_size; i++)
-		put_task_struct(group[i]);
+	for (i = 0; i < group_size; i++) {
+		tsk = flex_array_get_ptr(group, i);
+		put_task_struct(tsk);
+	}
 out_free_group_list:
-	kfree(group);
+	flex_array_free(group);
 	return retval;
 }

^ permalink raw reply related	[flat|nested] 185+ messages in thread

* [PATCH v8.5 4/3] cgroups: use flex_array in attach_proc
  2011-03-22  5:15                     ` Ben Blum
       [not found]                       ` <20110322051553.GB11447-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
@ 2011-03-22  5:19                       ` Ben Blum
  1 sibling, 0 replies; 185+ messages in thread
From: Ben Blum @ 2011-03-22  5:19 UTC (permalink / raw)
  To: Ben Blum
  Cc: Paul Menage, linux-kernel, containers, akpm, ebiederm, lizf,
	matthltc, oleg, David Rientjes, Miao Xie

Convert cgroup_attach_proc to use flex_array.

From: Ben Blum <bblum@andrew.cmu.edu>

The cgroup_attach_proc implementation requires a pre-allocated array to store
task pointers to atomically move a thread-group, but asking for a monolithic
array with kmalloc() may be unreliable for very large groups. Using flex_array
provides the same functionality with less risk of failure.

This is a post-patch for cgroup-procs-write.patch.

Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
---
 kernel/cgroup.c |   33 ++++++++++++++++++++++++---------
 1 files changed, 24 insertions(+), 9 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 273633c..92aa794 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -57,6 +57,7 @@
 #include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
 #include <linux/eventfd.h>
 #include <linux/poll.h>
+#include <linux/flex_array.h> /* used in cgroup_attach_proc */
 
 #include <asm/atomic.h>
 
@@ -1986,7 +1987,7 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 	struct cgroupfs_root *root = cgrp->root;
 	/* threadgroup list cursor and array */
 	struct task_struct *tsk;
-	struct task_struct **group;
+	struct flex_array *group;
 	/*
 	 * we need to make sure we have css_sets for all the tasks we're
 	 * going to move -before- we actually start moving them, so that in
@@ -2003,9 +2004,15 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 	 * and if threads exit, this will just be an over-estimate.
 	 */
 	group_size = get_nr_threads(leader);
-	group = kmalloc(group_size * sizeof(*group), GFP_KERNEL);
+	/* flex_array supports very large thread-groups better than kmalloc. */
+	group = flex_array_alloc(sizeof(struct task_struct *), group_size,
+				 GFP_KERNEL);
 	if (!group)
 		return -ENOMEM;
+	/* pre-allocate to guarantee space while iterating in rcu read-side. */
+	retval = flex_array_prealloc(group, 0, group_size - 1, GFP_KERNEL);
+	if (retval)
+		goto out_free_group_list;
 
 	/* prevent changes to the threadgroup list while we take a snapshot. */
 	rcu_read_lock();
@@ -2028,7 +2035,12 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 		/* as per above, nr_threads may decrease, but not increase. */
 		BUG_ON(i >= group_size);
 		get_task_struct(tsk);
-		group[i] = tsk;
+		/*
+		 * saying GFP_ATOMIC has no effect here because we did prealloc
+		 * earlier, but it's good form to communicate our expectations.
+		 */
+		retval = flex_array_put_ptr(group, i, tsk, GFP_ATOMIC);
+		BUG_ON(retval != 0);
 		i++;
 	} while_each_thread(leader, tsk);
 	/* remember the number of threads in the array for later. */
@@ -2050,7 +2062,8 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 		if (ss->can_attach_task) {
 			/* run on each task in the threadgroup. */
 			for (i = 0; i < group_size; i++) {
-				retval = ss->can_attach_task(cgrp, group[i]);
+				tsk = flex_array_get_ptr(group, i);
+				retval = ss->can_attach_task(cgrp, tsk);
 				if (retval) {
 					failed_ss = ss;
 					cancel_failed_ss = true;
@@ -2066,7 +2079,7 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 	 */
 	INIT_LIST_HEAD(&newcg_list);
 	for (i = 0; i < group_size; i++) {
-		tsk = group[i];
+		tsk = flex_array_get_ptr(group, i);
 		/* nothing to do if this task is already in the cgroup */
 		oldcgrp = task_cgroup_from_root(tsk, root);
 		if (cgrp == oldcgrp)
@@ -2105,7 +2118,7 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 			ss->pre_attach(cgrp);
 	}
 	for (i = 0; i < group_size; i++) {
-		tsk = group[i];
+		tsk = flex_array_get_ptr(group, i);
 		/* leave current thread as it is if it's already there */
 		oldcgrp = task_cgroup_from_root(tsk, root);
 		if (cgrp == oldcgrp)
@@ -2158,10 +2171,12 @@ out_cancel_attach:
 		}
 	}
 	/* clean up the array of referenced threads in the group. */
-	for (i = 0; i < group_size; i++)
-		put_task_struct(group[i]);
+	for (i = 0; i < group_size; i++) {
+		tsk = flex_array_get_ptr(group, i);
+		put_task_struct(tsk);
+	}
 out_free_group_list:
-	kfree(group);
+	flex_array_free(group);
 	return retval;
 }
 

^ permalink raw reply related	[flat|nested] 185+ messages in thread

* Re: [PATCH v8.5 3/3] cgroups: make procs file writable
       [not found]                                 ` <20110322051841.GA12055-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
@ 2011-03-29 23:27                                   ` Paul Menage
  0 siblings, 0 replies; 185+ messages in thread
From: Paul Menage @ 2011-03-29 23:27 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	Miao Xie, David Rientjes, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w

On Mon, Mar 21, 2011 at 10:18 PM, Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:
> Makes procs file writable to move all threads by tgid at once
>
> From: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
>
> This patch adds functionality that enables users to move all threads in a
> threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs'
> file. This current implementation makes use of a per-threadgroup rwsem that's
> taken for reading in the fork() path to prevent newly forking threads within
> the threadgroup from "escaping" while the move is in progress.
>
> Signed-off-by: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>

Reviewed-by: Paul Menage <menage-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

OK, I guess this is ready to go in :-)

Paul

> ---
>  Documentation/cgroups/cgroups.txt |    9 +
>  kernel/cgroup.c                   |  441 +++++++++++++++++++++++++++++++++----
>  2 files changed, 401 insertions(+), 49 deletions(-)
>
> diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
> index d3c9a24..92d93d6 100644
> --- a/Documentation/cgroups/cgroups.txt
> +++ b/Documentation/cgroups/cgroups.txt
> @@ -236,7 +236,8 @@ containing the following files describing that cgroup:
>  - cgroup.procs: list of tgids in the cgroup.  This list is not
>    guaranteed to be sorted or free of duplicate tgids, and userspace
>    should sort/uniquify the list if this property is required.
> -   This is a read-only file, for now.
> +   Writing a thread group id into this file moves all threads in that
> +   group into this cgroup.
>  - notify_on_release flag: run the release agent on exit?
>  - release_agent: the path to use for release notifications (this file
>    exists in the top cgroup only)
> @@ -426,6 +427,12 @@ You can attach the current shell task by echoing 0:
>
>  # echo 0 > tasks
>
> +You can use the cgroup.procs file instead of the tasks file to move all
> +threads in a threadgroup at once. Echoing the pid of any task in a
> +threadgroup to cgroup.procs causes all tasks in that threadgroup to be
> +be attached to the cgroup. Writing 0 to cgroup.procs moves all tasks
> +in the writing task's threadgroup.
> +
>  2.3 Mounting hierarchies by name
>  --------------------------------
>
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 616f27a..273633c 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1726,6 +1726,76 @@ int cgroup_path(const struct cgroup *cgrp, char *buf, int buflen)
>  }
>  EXPORT_SYMBOL_GPL(cgroup_path);
>
> +/*
> + * cgroup_task_migrate - move a task from one cgroup to another.
> + *
> + * 'guarantee' is set if the caller promises that a new css_set for the task
> + * will already exist. If not set, this function might sleep, and can fail with
> + * -ENOMEM. Otherwise, it can only fail with -ESRCH.
> + */
> +static int cgroup_task_migrate(struct cgroup *cgrp, struct cgroup *oldcgrp,
> +                              struct task_struct *tsk, bool guarantee)
> +{
> +       struct css_set *oldcg;
> +       struct css_set *newcg;
> +
> +       /*
> +        * get old css_set. we need to take task_lock and refcount it, because
> +        * an exiting task can change its css_set to init_css_set and drop its
> +        * old one without taking cgroup_mutex.
> +        */
> +       task_lock(tsk);
> +       oldcg = tsk->cgroups;
> +       get_css_set(oldcg);
> +       task_unlock(tsk);
> +
> +       /* locate or allocate a new css_set for this task. */
> +       if (guarantee) {
> +               /* we know the css_set we want already exists. */
> +               struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
> +               read_lock(&css_set_lock);
> +               newcg = find_existing_css_set(oldcg, cgrp, template);
> +               BUG_ON(!newcg);
> +               get_css_set(newcg);
> +               read_unlock(&css_set_lock);
> +       } else {
> +               might_sleep();
> +               /* find_css_set will give us newcg already referenced. */
> +               newcg = find_css_set(oldcg, cgrp);
> +               if (!newcg) {
> +                       put_css_set(oldcg);
> +                       return -ENOMEM;
> +               }
> +       }
> +       put_css_set(oldcg);
> +
> +       /* if PF_EXITING is set, the tsk->cgroups pointer is no longer safe. */
> +       task_lock(tsk);
> +       if (tsk->flags & PF_EXITING) {
> +               task_unlock(tsk);
> +               put_css_set(newcg);
> +               return -ESRCH;
> +       }
> +       rcu_assign_pointer(tsk->cgroups, newcg);
> +       task_unlock(tsk);
> +
> +       /* Update the css_set linked lists if we're using them */
> +       write_lock(&css_set_lock);
> +       if (!list_empty(&tsk->cg_list))
> +               list_move(&tsk->cg_list, &newcg->tasks);
> +       write_unlock(&css_set_lock);
> +
> +       /*
> +        * We just gained a reference on oldcg by taking it from the task. As
> +        * trading it for newcg is protected by cgroup_mutex, we're safe to drop
> +        * it here; it will be freed under RCU.
> +        */
> +       put_css_set(oldcg);
> +
> +       set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
> +       return 0;
> +}
> +
>  /**
>  * cgroup_attach_task - attach task 'tsk' to cgroup 'cgrp'
>  * @cgrp: the cgroup the task is attaching to
> @@ -1736,11 +1806,9 @@ EXPORT_SYMBOL_GPL(cgroup_path);
>  */
>  int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>  {
> -       int retval = 0;
> +       int retval;
>        struct cgroup_subsys *ss, *failed_ss = NULL;
>        struct cgroup *oldcgrp;
> -       struct css_set *cg;
> -       struct css_set *newcg;
>        struct cgroupfs_root *root = cgrp->root;
>
>        /* Nothing to do if the task is already in that cgroup */
> @@ -1771,38 +1839,9 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>                }
>        }
>
> -       task_lock(tsk);
> -       cg = tsk->cgroups;
> -       get_css_set(cg);
> -       task_unlock(tsk);
> -       /*
> -        * Locate or allocate a new css_set for this task,
> -        * based on its final set of cgroups
> -        */
> -       newcg = find_css_set(cg, cgrp);
> -       put_css_set(cg);
> -       if (!newcg) {
> -               retval = -ENOMEM;
> -               goto out;
> -       }
> -
> -       task_lock(tsk);
> -       if (tsk->flags & PF_EXITING) {
> -               task_unlock(tsk);
> -               put_css_set(newcg);
> -               retval = -ESRCH;
> +       retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, false);
> +       if (retval)
>                goto out;
> -       }
> -       rcu_assign_pointer(tsk->cgroups, newcg);
> -       task_unlock(tsk);
> -
> -       /* Update the css_set linked lists if we're using them */
> -       write_lock(&css_set_lock);
> -       if (!list_empty(&tsk->cg_list)) {
> -               list_del(&tsk->cg_list);
> -               list_add(&tsk->cg_list, &newcg->tasks);
> -       }
> -       write_unlock(&css_set_lock);
>
>        for_each_subsys(root, ss) {
>                if (ss->pre_attach)
> @@ -1812,9 +1851,8 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>                if (ss->attach)
>                        ss->attach(ss, cgrp, oldcgrp, tsk);
>        }
> -       set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
> +
>        synchronize_rcu();
> -       put_css_set(cg);
>
>        /*
>         * wake up rmdir() waiter. the rmdir should fail since the cgroup
> @@ -1864,49 +1902,356 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
>  EXPORT_SYMBOL_GPL(cgroup_attach_task_all);
>
>  /*
> - * Attach task with pid 'pid' to cgroup 'cgrp'. Call with cgroup_mutex
> - * held. May take task_lock of task
> + * cgroup_attach_proc works in two stages, the first of which prefetches all
> + * new css_sets needed (to make sure we have enough memory before committing
> + * to the move) and stores them in a list of entries of the following type.
> + * TODO: possible optimization: use css_set->rcu_head for chaining instead
> + */
> +struct cg_list_entry {
> +       struct css_set *cg;
> +       struct list_head links;
> +};
> +
> +static bool css_set_check_fetched(struct cgroup *cgrp,
> +                                 struct task_struct *tsk, struct css_set *cg,
> +                                 struct list_head *newcg_list)
> +{
> +       struct css_set *newcg;
> +       struct cg_list_entry *cg_entry;
> +       struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
> +
> +       read_lock(&css_set_lock);
> +       newcg = find_existing_css_set(cg, cgrp, template);
> +       if (newcg)
> +               get_css_set(newcg);
> +       read_unlock(&css_set_lock);
> +
> +       /* doesn't exist at all? */
> +       if (!newcg)
> +               return false;
> +       /* see if it's already in the list */
> +       list_for_each_entry(cg_entry, newcg_list, links) {
> +               if (cg_entry->cg == newcg) {
> +                       put_css_set(newcg);
> +                       return true;
> +               }
> +       }
> +
> +       /* not found */
> +       put_css_set(newcg);
> +       return false;
> +}
> +
> +/*
> + * Find the new css_set and store it in the list in preparation for moving the
> + * given task to the given cgroup. Returns 0 or -ENOMEM.
> + */
> +static int css_set_prefetch(struct cgroup *cgrp, struct css_set *cg,
> +                           struct list_head *newcg_list)
> +{
> +       struct css_set *newcg;
> +       struct cg_list_entry *cg_entry;
> +
> +       /* ensure a new css_set will exist for this thread */
> +       newcg = find_css_set(cg, cgrp);
> +       if (!newcg)
> +               return -ENOMEM;
> +       /* add it to the list */
> +       cg_entry = kmalloc(sizeof(struct cg_list_entry), GFP_KERNEL);
> +       if (!cg_entry) {
> +               put_css_set(newcg);
> +               return -ENOMEM;
> +       }
> +       cg_entry->cg = newcg;
> +       list_add(&cg_entry->links, newcg_list);
> +       return 0;
> +}
> +
> +/**
> + * cgroup_attach_proc - attach all threads in a threadgroup to a cgroup
> + * @cgrp: the cgroup to attach to
> + * @leader: the threadgroup leader task_struct of the group to be attached
> + *
> + * Call holding cgroup_mutex and the threadgroup_fork_lock of the leader. Will
> + * take task_lock of each thread in leader's threadgroup individually in turn.
> + */
> +int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
> +{
> +       int retval, i, group_size;
> +       struct cgroup_subsys *ss, *failed_ss = NULL;
> +       bool cancel_failed_ss = false;
> +       /* guaranteed to be initialized later, but the compiler needs this */
> +       struct cgroup *oldcgrp = NULL;
> +       struct css_set *oldcg;
> +       struct cgroupfs_root *root = cgrp->root;
> +       /* threadgroup list cursor and array */
> +       struct task_struct *tsk;
> +       struct task_struct **group;
> +       /*
> +        * we need to make sure we have css_sets for all the tasks we're
> +        * going to move -before- we actually start moving them, so that in
> +        * case we get an ENOMEM we can bail out before making any changes.
> +        */
> +       struct list_head newcg_list;
> +       struct cg_list_entry *cg_entry, *temp_nobe;
> +
> +       /*
> +        * step 0: in order to do expensive, possibly blocking operations for
> +        * every thread, we cannot iterate the thread group list, since it needs
> +        * rcu or tasklist locked. instead, build an array of all threads in the
> +        * group - threadgroup_fork_lock prevents new threads from appearing,
> +        * and if threads exit, this will just be an over-estimate.
> +        */
> +       group_size = get_nr_threads(leader);
> +       group = kmalloc(group_size * sizeof(*group), GFP_KERNEL);
> +       if (!group)
> +               return -ENOMEM;
> +
> +       /* prevent changes to the threadgroup list while we take a snapshot. */
> +       rcu_read_lock();
> +       if (!thread_group_leader(leader)) {
> +               /*
> +                * a race with de_thread from another thread's exec() may strip
> +                * us of our leadership, making while_each_thread unsafe to use
> +                * on this task. if this happens, there is no choice but to
> +                * throw this task away and try again (from cgroup_procs_write);
> +                * this is "double-double-toil-and-trouble-check locking".
> +                */
> +               rcu_read_unlock();
> +               retval = -EAGAIN;
> +               goto out_free_group_list;
> +       }
> +       /* take a reference on each task in the group to go in the array. */
> +       tsk = leader;
> +       i = 0;
> +       do {
> +               /* as per above, nr_threads may decrease, but not increase. */
> +               BUG_ON(i >= group_size);
> +               get_task_struct(tsk);
> +               group[i] = tsk;
> +               i++;
> +       } while_each_thread(leader, tsk);
> +       /* remember the number of threads in the array for later. */
> +       group_size = i;
> +       rcu_read_unlock();
> +
> +       /*
> +        * step 1: check that we can legitimately attach to the cgroup.
> +        */
> +       for_each_subsys(root, ss) {
> +               if (ss->can_attach) {
> +                       retval = ss->can_attach(ss, cgrp, leader);
> +                       if (retval) {
> +                               failed_ss = ss;
> +                               goto out_cancel_attach;
> +                       }
> +               }
> +               /* a callback to be run on every thread in the threadgroup. */
> +               if (ss->can_attach_task) {
> +                       /* run on each task in the threadgroup. */
> +                       for (i = 0; i < group_size; i++) {
> +                               retval = ss->can_attach_task(cgrp, group[i]);
> +                               if (retval) {
> +                                       failed_ss = ss;
> +                                       cancel_failed_ss = true;
> +                                       goto out_cancel_attach;
> +                               }
> +                       }
> +               }
> +       }
> +
> +       /*
> +        * step 2: make sure css_sets exist for all threads to be migrated.
> +        * we use find_css_set, which allocates a new one if necessary.
> +        */
> +       INIT_LIST_HEAD(&newcg_list);
> +       for (i = 0; i < group_size; i++) {
> +               tsk = group[i];
> +               /* nothing to do if this task is already in the cgroup */
> +               oldcgrp = task_cgroup_from_root(tsk, root);
> +               if (cgrp == oldcgrp)
> +                       continue;
> +               /* get old css_set pointer */
> +               task_lock(tsk);
> +               if (tsk->flags & PF_EXITING) {
> +                       /* ignore this task if it's going away */
> +                       task_unlock(tsk);
> +                       continue;
> +               }
> +               oldcg = tsk->cgroups;
> +               get_css_set(oldcg);
> +               task_unlock(tsk);
> +               /* see if the new one for us is already in the list? */
> +               if (css_set_check_fetched(cgrp, tsk, oldcg, &newcg_list)) {
> +                       /* was already there, nothing to do. */
> +                       put_css_set(oldcg);
> +               } else {
> +                       /* we don't already have it. get new one. */
> +                       retval = css_set_prefetch(cgrp, oldcg, &newcg_list);
> +                       put_css_set(oldcg);
> +                       if (retval)
> +                               goto out_list_teardown;
> +               }
> +       }
> +
> +       /*
> +        * step 3: now that we're guaranteed success wrt the css_sets, proceed
> +        * to move all tasks to the new cgroup, calling ss->attach_task for each
> +        * one along the way. there are no failure cases after here, so this is
> +        * the commit point.
> +        */
> +       for_each_subsys(root, ss) {
> +               if (ss->pre_attach)
> +                       ss->pre_attach(cgrp);
> +       }
> +       for (i = 0; i < group_size; i++) {
> +               tsk = group[i];
> +               /* leave current thread as it is if it's already there */
> +               oldcgrp = task_cgroup_from_root(tsk, root);
> +               if (cgrp == oldcgrp)
> +                       continue;
> +               /* attach each task to each subsystem */
> +               for_each_subsys(root, ss) {
> +                       if (ss->attach_task)
> +                               ss->attach_task(cgrp, tsk);
> +               }
> +               /* if the thread is PF_EXITING, it can just get skipped. */
> +               retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, true);
> +               BUG_ON(retval != 0 && retval != -ESRCH);
> +       }
> +       /* nothing is sensitive to fork() after this point. */
> +
> +       /*
> +        * step 4: do expensive, non-thread-specific subsystem callbacks.
> +        * TODO: if ever a subsystem needs to know the oldcgrp for each task
> +        * being moved, this call will need to be reworked to communicate that.
> +        */
> +       for_each_subsys(root, ss) {
> +               if (ss->attach)
> +                       ss->attach(ss, cgrp, oldcgrp, leader);
> +       }
> +
> +       /*
> +        * step 5: success! and cleanup
> +        */
> +       synchronize_rcu();
> +       cgroup_wakeup_rmdir_waiter(cgrp);
> +       retval = 0;
> +out_list_teardown:
> +       /* clean up the list of prefetched css_sets. */
> +       list_for_each_entry_safe(cg_entry, temp_nobe, &newcg_list, links) {
> +               list_del(&cg_entry->links);
> +               put_css_set(cg_entry->cg);
> +               kfree(cg_entry);
> +       }
> +out_cancel_attach:
> +       /* same deal as in cgroup_attach_task */
> +       if (retval) {
> +               for_each_subsys(root, ss) {
> +                       if (ss == failed_ss) {
> +                               if (cancel_failed_ss && ss->cancel_attach)
> +                                       ss->cancel_attach(ss, cgrp, leader);
> +                               break;
> +                       }
> +                       if (ss->cancel_attach)
> +                               ss->cancel_attach(ss, cgrp, leader);
> +               }
> +       }
> +       /* clean up the array of referenced threads in the group. */
> +       for (i = 0; i < group_size; i++)
> +               put_task_struct(group[i]);
> +out_free_group_list:
> +       kfree(group);
> +       return retval;
> +}
> +
> +/*
> + * Find the task_struct of the task to attach by vpid and pass it along to the
> + * function to attach either it or all tasks in its threadgroup. Will take
> + * cgroup_mutex; may take task_lock of task.
>  */
> -static int attach_task_by_pid(struct cgroup *cgrp, u64 pid)
> +static int attach_task_by_pid(struct cgroup *cgrp, u64 pid, bool threadgroup)
>  {
>        struct task_struct *tsk;
>        const struct cred *cred = current_cred(), *tcred;
>        int ret;
>
> +       if (!cgroup_lock_live_group(cgrp))
> +               return -ENODEV;
> +
>        if (pid) {
>                rcu_read_lock();
>                tsk = find_task_by_vpid(pid);
> -               if (!tsk || tsk->flags & PF_EXITING) {
> +               if (!tsk) {
>                        rcu_read_unlock();
> +                       cgroup_unlock();
> +                       return -ESRCH;
> +               }
> +               if (threadgroup) {
> +                       /*
> +                        * RCU protects this access, since tsk was found in the
> +                        * tid map. a race with de_thread may cause group_leader
> +                        * to stop being the leader, but cgroup_attach_proc will
> +                        * detect it later.
> +                        */
> +                       tsk = tsk->group_leader;
> +               } else if (tsk->flags & PF_EXITING) {
> +                       /* optimization for the single-task-only case */
> +                       rcu_read_unlock();
> +                       cgroup_unlock();
>                        return -ESRCH;
>                }
>
> +               /*
> +                * even if we're attaching all tasks in the thread group, we
> +                * only need to check permissions on one of them.
> +                */
>                tcred = __task_cred(tsk);
>                if (cred->euid &&
>                    cred->euid != tcred->uid &&
>                    cred->euid != tcred->suid) {
>                        rcu_read_unlock();
> +                       cgroup_unlock();
>                        return -EACCES;
>                }
>                get_task_struct(tsk);
>                rcu_read_unlock();
>        } else {
> -               tsk = current;
> +               if (threadgroup)
> +                       tsk = current->group_leader;
> +               else
> +                       tsk = current;
>                get_task_struct(tsk);
>        }
>
> -       ret = cgroup_attach_task(cgrp, tsk);
> +       if (threadgroup) {
> +               threadgroup_fork_write_lock(tsk);
> +               ret = cgroup_attach_proc(cgrp, tsk);
> +               threadgroup_fork_write_unlock(tsk);
> +       } else {
> +               ret = cgroup_attach_task(cgrp, tsk);
> +       }
>        put_task_struct(tsk);
> +       cgroup_unlock();
>        return ret;
>  }
>
>  static int cgroup_tasks_write(struct cgroup *cgrp, struct cftype *cft, u64 pid)
>  {
> +       return attach_task_by_pid(cgrp, pid, false);
> +}
> +
> +static int cgroup_procs_write(struct cgroup *cgrp, struct cftype *cft, u64 tgid)
> +{
>        int ret;
> -       if (!cgroup_lock_live_group(cgrp))
> -               return -ENODEV;
> -       ret = attach_task_by_pid(cgrp, pid);
> -       cgroup_unlock();
> +       do {
> +               /*
> +                * attach_proc fails with -EAGAIN if threadgroup leadership
> +                * changes in the middle of the operation, in which case we need
> +                * to find the task_struct for the new leader and start over.
> +                */
> +               ret = attach_task_by_pid(cgrp, tgid, true);
> +       } while (ret == -EAGAIN);
>        return ret;
>  }
>
> @@ -3260,9 +3605,9 @@ static struct cftype files[] = {
>        {
>                .name = CGROUP_FILE_GENERIC_PREFIX "procs",
>                .open = cgroup_procs_open,
> -               /* .write_u64 = cgroup_procs_write, TODO */
> +               .write_u64 = cgroup_procs_write,
>                .release = cgroup_pidlist_release,
> -               .mode = S_IRUGO,
> +               .mode = S_IRUGO | S_IWUSR,
>        },
>        {
>                .name = "notify_on_release",
>

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v8.5 3/3] cgroups: make procs file writable
  2011-03-22  5:18                                 ` Ben Blum
  (?)
  (?)
@ 2011-03-29 23:27                                 ` Paul Menage
  2011-03-29 23:39                                   ` Andrew Morton
       [not found]                                   ` <BANLkTikMgd5HvMyC1BTGzAtj_=Jk=wZm+A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  -1 siblings, 2 replies; 185+ messages in thread
From: Paul Menage @ 2011-03-29 23:27 UTC (permalink / raw)
  To: Ben Blum
  Cc: linux-kernel, containers, akpm, ebiederm, lizf, matthltc, oleg,
	David Rientjes, Miao Xie

On Mon, Mar 21, 2011 at 10:18 PM, Ben Blum <bblum@andrew.cmu.edu> wrote:
> Makes procs file writable to move all threads by tgid at once
>
> From: Ben Blum <bblum@andrew.cmu.edu>
>
> This patch adds functionality that enables users to move all threads in a
> threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs'
> file. This current implementation makes use of a per-threadgroup rwsem that's
> taken for reading in the fork() path to prevent newly forking threads within
> the threadgroup from "escaping" while the move is in progress.
>
> Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>

Reviewed-by: Paul Menage <menage@google.com>

OK, I guess this is ready to go in :-)

Paul

> ---
>  Documentation/cgroups/cgroups.txt |    9 +
>  kernel/cgroup.c                   |  441 +++++++++++++++++++++++++++++++++----
>  2 files changed, 401 insertions(+), 49 deletions(-)
>
> diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
> index d3c9a24..92d93d6 100644
> --- a/Documentation/cgroups/cgroups.txt
> +++ b/Documentation/cgroups/cgroups.txt
> @@ -236,7 +236,8 @@ containing the following files describing that cgroup:
>  - cgroup.procs: list of tgids in the cgroup.  This list is not
>    guaranteed to be sorted or free of duplicate tgids, and userspace
>    should sort/uniquify the list if this property is required.
> -   This is a read-only file, for now.
> +   Writing a thread group id into this file moves all threads in that
> +   group into this cgroup.
>  - notify_on_release flag: run the release agent on exit?
>  - release_agent: the path to use for release notifications (this file
>    exists in the top cgroup only)
> @@ -426,6 +427,12 @@ You can attach the current shell task by echoing 0:
>
>  # echo 0 > tasks
>
> +You can use the cgroup.procs file instead of the tasks file to move all
> +threads in a threadgroup at once. Echoing the pid of any task in a
> +threadgroup to cgroup.procs causes all tasks in that threadgroup to be
> +be attached to the cgroup. Writing 0 to cgroup.procs moves all tasks
> +in the writing task's threadgroup.
> +
>  2.3 Mounting hierarchies by name
>  --------------------------------
>
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 616f27a..273633c 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1726,6 +1726,76 @@ int cgroup_path(const struct cgroup *cgrp, char *buf, int buflen)
>  }
>  EXPORT_SYMBOL_GPL(cgroup_path);
>
> +/*
> + * cgroup_task_migrate - move a task from one cgroup to another.
> + *
> + * 'guarantee' is set if the caller promises that a new css_set for the task
> + * will already exist. If not set, this function might sleep, and can fail with
> + * -ENOMEM. Otherwise, it can only fail with -ESRCH.
> + */
> +static int cgroup_task_migrate(struct cgroup *cgrp, struct cgroup *oldcgrp,
> +                              struct task_struct *tsk, bool guarantee)
> +{
> +       struct css_set *oldcg;
> +       struct css_set *newcg;
> +
> +       /*
> +        * get old css_set. we need to take task_lock and refcount it, because
> +        * an exiting task can change its css_set to init_css_set and drop its
> +        * old one without taking cgroup_mutex.
> +        */
> +       task_lock(tsk);
> +       oldcg = tsk->cgroups;
> +       get_css_set(oldcg);
> +       task_unlock(tsk);
> +
> +       /* locate or allocate a new css_set for this task. */
> +       if (guarantee) {
> +               /* we know the css_set we want already exists. */
> +               struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
> +               read_lock(&css_set_lock);
> +               newcg = find_existing_css_set(oldcg, cgrp, template);
> +               BUG_ON(!newcg);
> +               get_css_set(newcg);
> +               read_unlock(&css_set_lock);
> +       } else {
> +               might_sleep();
> +               /* find_css_set will give us newcg already referenced. */
> +               newcg = find_css_set(oldcg, cgrp);
> +               if (!newcg) {
> +                       put_css_set(oldcg);
> +                       return -ENOMEM;
> +               }
> +       }
> +       put_css_set(oldcg);
> +
> +       /* if PF_EXITING is set, the tsk->cgroups pointer is no longer safe. */
> +       task_lock(tsk);
> +       if (tsk->flags & PF_EXITING) {
> +               task_unlock(tsk);
> +               put_css_set(newcg);
> +               return -ESRCH;
> +       }
> +       rcu_assign_pointer(tsk->cgroups, newcg);
> +       task_unlock(tsk);
> +
> +       /* Update the css_set linked lists if we're using them */
> +       write_lock(&css_set_lock);
> +       if (!list_empty(&tsk->cg_list))
> +               list_move(&tsk->cg_list, &newcg->tasks);
> +       write_unlock(&css_set_lock);
> +
> +       /*
> +        * We just gained a reference on oldcg by taking it from the task. As
> +        * trading it for newcg is protected by cgroup_mutex, we're safe to drop
> +        * it here; it will be freed under RCU.
> +        */
> +       put_css_set(oldcg);
> +
> +       set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
> +       return 0;
> +}
> +
>  /**
>  * cgroup_attach_task - attach task 'tsk' to cgroup 'cgrp'
>  * @cgrp: the cgroup the task is attaching to
> @@ -1736,11 +1806,9 @@ EXPORT_SYMBOL_GPL(cgroup_path);
>  */
>  int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>  {
> -       int retval = 0;
> +       int retval;
>        struct cgroup_subsys *ss, *failed_ss = NULL;
>        struct cgroup *oldcgrp;
> -       struct css_set *cg;
> -       struct css_set *newcg;
>        struct cgroupfs_root *root = cgrp->root;
>
>        /* Nothing to do if the task is already in that cgroup */
> @@ -1771,38 +1839,9 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>                }
>        }
>
> -       task_lock(tsk);
> -       cg = tsk->cgroups;
> -       get_css_set(cg);
> -       task_unlock(tsk);
> -       /*
> -        * Locate or allocate a new css_set for this task,
> -        * based on its final set of cgroups
> -        */
> -       newcg = find_css_set(cg, cgrp);
> -       put_css_set(cg);
> -       if (!newcg) {
> -               retval = -ENOMEM;
> -               goto out;
> -       }
> -
> -       task_lock(tsk);
> -       if (tsk->flags & PF_EXITING) {
> -               task_unlock(tsk);
> -               put_css_set(newcg);
> -               retval = -ESRCH;
> +       retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, false);
> +       if (retval)
>                goto out;
> -       }
> -       rcu_assign_pointer(tsk->cgroups, newcg);
> -       task_unlock(tsk);
> -
> -       /* Update the css_set linked lists if we're using them */
> -       write_lock(&css_set_lock);
> -       if (!list_empty(&tsk->cg_list)) {
> -               list_del(&tsk->cg_list);
> -               list_add(&tsk->cg_list, &newcg->tasks);
> -       }
> -       write_unlock(&css_set_lock);
>
>        for_each_subsys(root, ss) {
>                if (ss->pre_attach)
> @@ -1812,9 +1851,8 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>                if (ss->attach)
>                        ss->attach(ss, cgrp, oldcgrp, tsk);
>        }
> -       set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
> +
>        synchronize_rcu();
> -       put_css_set(cg);
>
>        /*
>         * wake up rmdir() waiter. the rmdir should fail since the cgroup
> @@ -1864,49 +1902,356 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
>  EXPORT_SYMBOL_GPL(cgroup_attach_task_all);
>
>  /*
> - * Attach task with pid 'pid' to cgroup 'cgrp'. Call with cgroup_mutex
> - * held. May take task_lock of task
> + * cgroup_attach_proc works in two stages, the first of which prefetches all
> + * new css_sets needed (to make sure we have enough memory before committing
> + * to the move) and stores them in a list of entries of the following type.
> + * TODO: possible optimization: use css_set->rcu_head for chaining instead
> + */
> +struct cg_list_entry {
> +       struct css_set *cg;
> +       struct list_head links;
> +};
> +
> +static bool css_set_check_fetched(struct cgroup *cgrp,
> +                                 struct task_struct *tsk, struct css_set *cg,
> +                                 struct list_head *newcg_list)
> +{
> +       struct css_set *newcg;
> +       struct cg_list_entry *cg_entry;
> +       struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
> +
> +       read_lock(&css_set_lock);
> +       newcg = find_existing_css_set(cg, cgrp, template);
> +       if (newcg)
> +               get_css_set(newcg);
> +       read_unlock(&css_set_lock);
> +
> +       /* doesn't exist at all? */
> +       if (!newcg)
> +               return false;
> +       /* see if it's already in the list */
> +       list_for_each_entry(cg_entry, newcg_list, links) {
> +               if (cg_entry->cg == newcg) {
> +                       put_css_set(newcg);
> +                       return true;
> +               }
> +       }
> +
> +       /* not found */
> +       put_css_set(newcg);
> +       return false;
> +}
> +
> +/*
> + * Find the new css_set and store it in the list in preparation for moving the
> + * given task to the given cgroup. Returns 0 or -ENOMEM.
> + */
> +static int css_set_prefetch(struct cgroup *cgrp, struct css_set *cg,
> +                           struct list_head *newcg_list)
> +{
> +       struct css_set *newcg;
> +       struct cg_list_entry *cg_entry;
> +
> +       /* ensure a new css_set will exist for this thread */
> +       newcg = find_css_set(cg, cgrp);
> +       if (!newcg)
> +               return -ENOMEM;
> +       /* add it to the list */
> +       cg_entry = kmalloc(sizeof(struct cg_list_entry), GFP_KERNEL);
> +       if (!cg_entry) {
> +               put_css_set(newcg);
> +               return -ENOMEM;
> +       }
> +       cg_entry->cg = newcg;
> +       list_add(&cg_entry->links, newcg_list);
> +       return 0;
> +}
> +
> +/**
> + * cgroup_attach_proc - attach all threads in a threadgroup to a cgroup
> + * @cgrp: the cgroup to attach to
> + * @leader: the threadgroup leader task_struct of the group to be attached
> + *
> + * Call holding cgroup_mutex and the threadgroup_fork_lock of the leader. Will
> + * take task_lock of each thread in leader's threadgroup individually in turn.
> + */
> +int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
> +{
> +       int retval, i, group_size;
> +       struct cgroup_subsys *ss, *failed_ss = NULL;
> +       bool cancel_failed_ss = false;
> +       /* guaranteed to be initialized later, but the compiler needs this */
> +       struct cgroup *oldcgrp = NULL;
> +       struct css_set *oldcg;
> +       struct cgroupfs_root *root = cgrp->root;
> +       /* threadgroup list cursor and array */
> +       struct task_struct *tsk;
> +       struct task_struct **group;
> +       /*
> +        * we need to make sure we have css_sets for all the tasks we're
> +        * going to move -before- we actually start moving them, so that in
> +        * case we get an ENOMEM we can bail out before making any changes.
> +        */
> +       struct list_head newcg_list;
> +       struct cg_list_entry *cg_entry, *temp_nobe;
> +
> +       /*
> +        * step 0: in order to do expensive, possibly blocking operations for
> +        * every thread, we cannot iterate the thread group list, since it needs
> +        * rcu or tasklist locked. instead, build an array of all threads in the
> +        * group - threadgroup_fork_lock prevents new threads from appearing,
> +        * and if threads exit, this will just be an over-estimate.
> +        */
> +       group_size = get_nr_threads(leader);
> +       group = kmalloc(group_size * sizeof(*group), GFP_KERNEL);
> +       if (!group)
> +               return -ENOMEM;
> +
> +       /* prevent changes to the threadgroup list while we take a snapshot. */
> +       rcu_read_lock();
> +       if (!thread_group_leader(leader)) {
> +               /*
> +                * a race with de_thread from another thread's exec() may strip
> +                * us of our leadership, making while_each_thread unsafe to use
> +                * on this task. if this happens, there is no choice but to
> +                * throw this task away and try again (from cgroup_procs_write);
> +                * this is "double-double-toil-and-trouble-check locking".
> +                */
> +               rcu_read_unlock();
> +               retval = -EAGAIN;
> +               goto out_free_group_list;
> +       }
> +       /* take a reference on each task in the group to go in the array. */
> +       tsk = leader;
> +       i = 0;
> +       do {
> +               /* as per above, nr_threads may decrease, but not increase. */
> +               BUG_ON(i >= group_size);
> +               get_task_struct(tsk);
> +               group[i] = tsk;
> +               i++;
> +       } while_each_thread(leader, tsk);
> +       /* remember the number of threads in the array for later. */
> +       group_size = i;
> +       rcu_read_unlock();
> +
> +       /*
> +        * step 1: check that we can legitimately attach to the cgroup.
> +        */
> +       for_each_subsys(root, ss) {
> +               if (ss->can_attach) {
> +                       retval = ss->can_attach(ss, cgrp, leader);
> +                       if (retval) {
> +                               failed_ss = ss;
> +                               goto out_cancel_attach;
> +                       }
> +               }
> +               /* a callback to be run on every thread in the threadgroup. */
> +               if (ss->can_attach_task) {
> +                       /* run on each task in the threadgroup. */
> +                       for (i = 0; i < group_size; i++) {
> +                               retval = ss->can_attach_task(cgrp, group[i]);
> +                               if (retval) {
> +                                       failed_ss = ss;
> +                                       cancel_failed_ss = true;
> +                                       goto out_cancel_attach;
> +                               }
> +                       }
> +               }
> +       }
> +
> +       /*
> +        * step 2: make sure css_sets exist for all threads to be migrated.
> +        * we use find_css_set, which allocates a new one if necessary.
> +        */
> +       INIT_LIST_HEAD(&newcg_list);
> +       for (i = 0; i < group_size; i++) {
> +               tsk = group[i];
> +               /* nothing to do if this task is already in the cgroup */
> +               oldcgrp = task_cgroup_from_root(tsk, root);
> +               if (cgrp == oldcgrp)
> +                       continue;
> +               /* get old css_set pointer */
> +               task_lock(tsk);
> +               if (tsk->flags & PF_EXITING) {
> +                       /* ignore this task if it's going away */
> +                       task_unlock(tsk);
> +                       continue;
> +               }
> +               oldcg = tsk->cgroups;
> +               get_css_set(oldcg);
> +               task_unlock(tsk);
> +               /* see if the new one for us is already in the list? */
> +               if (css_set_check_fetched(cgrp, tsk, oldcg, &newcg_list)) {
> +                       /* was already there, nothing to do. */
> +                       put_css_set(oldcg);
> +               } else {
> +                       /* we don't already have it. get new one. */
> +                       retval = css_set_prefetch(cgrp, oldcg, &newcg_list);
> +                       put_css_set(oldcg);
> +                       if (retval)
> +                               goto out_list_teardown;
> +               }
> +       }
> +
> +       /*
> +        * step 3: now that we're guaranteed success wrt the css_sets, proceed
> +        * to move all tasks to the new cgroup, calling ss->attach_task for each
> +        * one along the way. there are no failure cases after here, so this is
> +        * the commit point.
> +        */
> +       for_each_subsys(root, ss) {
> +               if (ss->pre_attach)
> +                       ss->pre_attach(cgrp);
> +       }
> +       for (i = 0; i < group_size; i++) {
> +               tsk = group[i];
> +               /* leave current thread as it is if it's already there */
> +               oldcgrp = task_cgroup_from_root(tsk, root);
> +               if (cgrp == oldcgrp)
> +                       continue;
> +               /* attach each task to each subsystem */
> +               for_each_subsys(root, ss) {
> +                       if (ss->attach_task)
> +                               ss->attach_task(cgrp, tsk);
> +               }
> +               /* if the thread is PF_EXITING, it can just get skipped. */
> +               retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, true);
> +               BUG_ON(retval != 0 && retval != -ESRCH);
> +       }
> +       /* nothing is sensitive to fork() after this point. */
> +
> +       /*
> +        * step 4: do expensive, non-thread-specific subsystem callbacks.
> +        * TODO: if ever a subsystem needs to know the oldcgrp for each task
> +        * being moved, this call will need to be reworked to communicate that.
> +        */
> +       for_each_subsys(root, ss) {
> +               if (ss->attach)
> +                       ss->attach(ss, cgrp, oldcgrp, leader);
> +       }
> +
> +       /*
> +        * step 5: success! and cleanup
> +        */
> +       synchronize_rcu();
> +       cgroup_wakeup_rmdir_waiter(cgrp);
> +       retval = 0;
> +out_list_teardown:
> +       /* clean up the list of prefetched css_sets. */
> +       list_for_each_entry_safe(cg_entry, temp_nobe, &newcg_list, links) {
> +               list_del(&cg_entry->links);
> +               put_css_set(cg_entry->cg);
> +               kfree(cg_entry);
> +       }
> +out_cancel_attach:
> +       /* same deal as in cgroup_attach_task */
> +       if (retval) {
> +               for_each_subsys(root, ss) {
> +                       if (ss == failed_ss) {
> +                               if (cancel_failed_ss && ss->cancel_attach)
> +                                       ss->cancel_attach(ss, cgrp, leader);
> +                               break;
> +                       }
> +                       if (ss->cancel_attach)
> +                               ss->cancel_attach(ss, cgrp, leader);
> +               }
> +       }
> +       /* clean up the array of referenced threads in the group. */
> +       for (i = 0; i < group_size; i++)
> +               put_task_struct(group[i]);
> +out_free_group_list:
> +       kfree(group);
> +       return retval;
> +}
> +
> +/*
> + * Find the task_struct of the task to attach by vpid and pass it along to the
> + * function to attach either it or all tasks in its threadgroup. Will take
> + * cgroup_mutex; may take task_lock of task.
>  */
> -static int attach_task_by_pid(struct cgroup *cgrp, u64 pid)
> +static int attach_task_by_pid(struct cgroup *cgrp, u64 pid, bool threadgroup)
>  {
>        struct task_struct *tsk;
>        const struct cred *cred = current_cred(), *tcred;
>        int ret;
>
> +       if (!cgroup_lock_live_group(cgrp))
> +               return -ENODEV;
> +
>        if (pid) {
>                rcu_read_lock();
>                tsk = find_task_by_vpid(pid);
> -               if (!tsk || tsk->flags & PF_EXITING) {
> +               if (!tsk) {
>                        rcu_read_unlock();
> +                       cgroup_unlock();
> +                       return -ESRCH;
> +               }
> +               if (threadgroup) {
> +                       /*
> +                        * RCU protects this access, since tsk was found in the
> +                        * tid map. a race with de_thread may cause group_leader
> +                        * to stop being the leader, but cgroup_attach_proc will
> +                        * detect it later.
> +                        */
> +                       tsk = tsk->group_leader;
> +               } else if (tsk->flags & PF_EXITING) {
> +                       /* optimization for the single-task-only case */
> +                       rcu_read_unlock();
> +                       cgroup_unlock();
>                        return -ESRCH;
>                }
>
> +               /*
> +                * even if we're attaching all tasks in the thread group, we
> +                * only need to check permissions on one of them.
> +                */
>                tcred = __task_cred(tsk);
>                if (cred->euid &&
>                    cred->euid != tcred->uid &&
>                    cred->euid != tcred->suid) {
>                        rcu_read_unlock();
> +                       cgroup_unlock();
>                        return -EACCES;
>                }
>                get_task_struct(tsk);
>                rcu_read_unlock();
>        } else {
> -               tsk = current;
> +               if (threadgroup)
> +                       tsk = current->group_leader;
> +               else
> +                       tsk = current;
>                get_task_struct(tsk);
>        }
>
> -       ret = cgroup_attach_task(cgrp, tsk);
> +       if (threadgroup) {
> +               threadgroup_fork_write_lock(tsk);
> +               ret = cgroup_attach_proc(cgrp, tsk);
> +               threadgroup_fork_write_unlock(tsk);
> +       } else {
> +               ret = cgroup_attach_task(cgrp, tsk);
> +       }
>        put_task_struct(tsk);
> +       cgroup_unlock();
>        return ret;
>  }
>
>  static int cgroup_tasks_write(struct cgroup *cgrp, struct cftype *cft, u64 pid)
>  {
> +       return attach_task_by_pid(cgrp, pid, false);
> +}
> +
> +static int cgroup_procs_write(struct cgroup *cgrp, struct cftype *cft, u64 tgid)
> +{
>        int ret;
> -       if (!cgroup_lock_live_group(cgrp))
> -               return -ENODEV;
> -       ret = attach_task_by_pid(cgrp, pid);
> -       cgroup_unlock();
> +       do {
> +               /*
> +                * attach_proc fails with -EAGAIN if threadgroup leadership
> +                * changes in the middle of the operation, in which case we need
> +                * to find the task_struct for the new leader and start over.
> +                */
> +               ret = attach_task_by_pid(cgrp, tgid, true);
> +       } while (ret == -EAGAIN);
>        return ret;
>  }
>
> @@ -3260,9 +3605,9 @@ static struct cftype files[] = {
>        {
>                .name = CGROUP_FILE_GENERIC_PREFIX "procs",
>                .open = cgroup_procs_open,
> -               /* .write_u64 = cgroup_procs_write, TODO */
> +               .write_u64 = cgroup_procs_write,
>                .release = cgroup_pidlist_release,
> -               .mode = S_IRUGO,
> +               .mode = S_IRUGO | S_IWUSR,
>        },
>        {
>                .name = "notify_on_release",
>

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v8.5 3/3] cgroups: make procs file writable
       [not found]                                   ` <BANLkTikMgd5HvMyC1BTGzAtj_=Jk=wZm+A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2011-03-29 23:39                                     ` Andrew Morton
  0 siblings, 0 replies; 185+ messages in thread
From: Andrew Morton @ 2011-03-29 23:39 UTC (permalink / raw)
  To: Paul Menage
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	Miao Xie, David Rientjes, ebiederm-aS9lmoZGLiVWk0Htik3J/w

On Tue, 29 Mar 2011 16:27:19 -0700
Paul Menage <menage-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:

> On Mon, Mar 21, 2011 at 10:18 PM, Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:
> > Makes procs file writable to move all threads by tgid at once
> >
> > From: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
> >
> > This patch adds functionality that enables users to move all threads in a
> > threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs'
> > file. This current implementation makes use of a per-threadgroup rwsem that's
> > taken for reading in the fork() path to prevent newly forking threads within
> > the threadgroup from "escaping" while the move is in progress.
> >
> > Signed-off-by: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
> 
> Reviewed-by: Paul Menage <menage-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> 
> OK, I guess this is ready to go in :-)

It all needs a refresh, retest and resend please.

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v8.5 3/3] cgroups: make procs file writable
  2011-03-29 23:27                                 ` Paul Menage
@ 2011-03-29 23:39                                   ` Andrew Morton
       [not found]                                   ` <BANLkTikMgd5HvMyC1BTGzAtj_=Jk=wZm+A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 0 replies; 185+ messages in thread
From: Andrew Morton @ 2011-03-29 23:39 UTC (permalink / raw)
  To: Paul Menage
  Cc: Ben Blum, linux-kernel, containers, ebiederm, lizf, matthltc,
	oleg, David Rientjes, Miao Xie

On Tue, 29 Mar 2011 16:27:19 -0700
Paul Menage <menage@google.com> wrote:

> On Mon, Mar 21, 2011 at 10:18 PM, Ben Blum <bblum@andrew.cmu.edu> wrote:
> > Makes procs file writable to move all threads by tgid at once
> >
> > From: Ben Blum <bblum@andrew.cmu.edu>
> >
> > This patch adds functionality that enables users to move all threads in a
> > threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs'
> > file. This current implementation makes use of a per-threadgroup rwsem that's
> > taken for reading in the fork() path to prevent newly forking threads within
> > the threadgroup from "escaping" while the move is in progress.
> >
> > Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
> 
> Reviewed-by: Paul Menage <menage@google.com>
> 
> OK, I guess this is ready to go in :-)

It all needs a refresh, retest and resend please.

^ permalink raw reply	[flat|nested] 185+ messages in thread

* [PATCH v8.75 0/4] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs
       [not found]             ` <20110208013542.GC31569-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
                                 ` (3 preceding siblings ...)
  2011-02-09 23:10               ` [PATCH v8 0/3] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs Andrew Morton
@ 2011-04-06 19:44               ` Ben Blum
  4 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2011-04-06 19:44 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	Miao Xie, David Rientjes, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	menage-hpIqsD4AKlfQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

On Mon, Feb 07, 2011 at 08:35:42PM -0500, Ben Blum wrote:
> On Sun, Dec 26, 2010 at 07:09:19AM -0500, Ben Blum wrote:
> > On Fri, Dec 24, 2010 at 03:22:26AM -0500, Ben Blum wrote:
> > > On Wed, Aug 11, 2010 at 01:46:04AM -0400, Ben Blum wrote:
> > > > On Fri, Jul 30, 2010 at 07:56:49PM -0400, Ben Blum wrote:
> > > > > This patch series is a revision of http://lkml.org/lkml/2010/6/25/11 .
> > > > > 
> > > > > This patch series implements a write function for the 'cgroup.procs'
> > > > > per-cgroup file, which enables atomic movement of multithreaded
> > > > > applications between cgroups. Writing the thread-ID of any thread in a
> > > > > threadgroup to a cgroup's procs file causes all threads in the group to
> > > > > be moved to that cgroup safely with respect to threads forking/exiting.
> > > > > (Possible usage scenario: If running a multithreaded build system that
> > > > > sucks up system resources, this lets you restrict it all at once into a
> > > > > new cgroup to keep it under control.)
> > > > > 
> > > > > Example: Suppose pid 31337 clones new threads 31338 and 31339.
> > > > > 
> > > > > # cat /dev/cgroup/tasks
> > > > > ...
> > > > > 31337
> > > > > 31338
> > > > > 31339
> > > > > # mkdir /dev/cgroup/foo
> > > > > # echo 31337 > /dev/cgroup/foo/cgroup.procs
> > > > > # cat /dev/cgroup/foo/tasks
> > > > > 31337
> > > > > 31338
> > > > > 31339
> > > > > 
> > > > > A new lock, called threadgroup_fork_lock and living in signal_struct, is
> > > > > introduced to ensure atomicity when moving threads between cgroups. It's
> > > > > taken for writing during the operation, and taking for reading in fork()
> > > > > around the calls to cgroup_fork() and cgroup_post_fork().
> > 
> > Well this time everything here is actually safe and correct, as far as
> > my best efforts and keen eyes can tell. I dropped the per_thread call
> > from the last series in favour of revising the subsystem callback
> > interface. It now looks like this:
> > 
> > ss->can_attach()
> >  - Thread-independent, possibly expensive/sleeping.
> > 
> > ss->can_attach_task()
> >  - Called per-thread, run with rcu_read so must not sleep.
> > 
> > ss->pre_attach()
> >  - Thread independent, must be atomic, happens before attach_task.
> > 
> > ss->attach_task()
> >  - Called per-thread, run with tasklist_lock so must not sleep.
> > 
> > ss->attach()
> >  - Thread independent, possibly expensive/sleeping, called last.
> 
> Okay, so.
> 
> I've revamped the cgroup_attach_proc implementation a bunch and this
> version should be a lot easier on the eyes (and brains). Issues that are
> addressed:
> 
> 1) cgroup_attach_proc now iterates over leader->thread_group once, at
>    the very beginning, and puts each task_struct that we want to move
>    into an array, using get_task_struct to make sure they stick around.
>     - threadgroup_fork_lock ensures no threads not in the array can
>       appear, and allows us to use signal->nr_threads to determine the
>       size of the array when kmallocing it.
>     - This simplifies the rest of the function a bunch, since now we
>       never need to do rcu_read_lock after building the array. All the
>       subsystem callbacks are the same as described just above, but the
>       "can't sleep" restriction is gone, so it's nice and clean.
>     - Checking for a race with de_thread (the manoeuvre I refer to as
>       "double-double-toil-and-trouble-check locking") now needs to be
>       done only once, at the beginning (before building the array).
> 
> 2) The nodemask allocation problem in cpuset is fixed the same way as
>    before - the masks are shared between the three attach callbacks, so
>    are made as static global variables.
> 
> 3) The introduction of threadgroup_fork_lock in sched.h (specifically,
>    in signal_struct) requires rwsem.h; the new include appears in the
>    first patch. (An alternate plan would be to make it a struct pointer
>    with an incomplete forward declaration and kmalloc/kfree it during
>    housekeeping, but adding an include seems better than that particular
>    complication.) In light of this, the definitions for
>    threadgroup_fork_{read,write}_{un,}lock are also in sched.h.

Same as before; using flex_array in attach_proc (thanks Kame).

-- Ben

---
 Documentation/cgroups/cgroups.txt |   39 ++-
 block/blk-cgroup.c                |   18 -
 include/linux/cgroup.h            |   10 
 include/linux/init_task.h         |    9 
 include/linux/sched.h             |   36 ++
 kernel/cgroup.c                   |  489 +++++++++++++++++++++++++++++++++-----
 kernel/cgroup_freezer.c           |   26 --
 kernel/cpuset.c                   |   96 +++----
 kernel/fork.c                     |   10 
 kernel/sched.c                    |   38 --
 mm/memcontrol.c                   |   18 -
 security/device_cgroup.c          |    3 
 12 files changed, 594 insertions(+), 198 deletions(-)

^ permalink raw reply	[flat|nested] 185+ messages in thread

* [PATCH v8.75 0/4] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs
  2011-02-08  1:35           ` Ben Blum
                               ` (2 preceding siblings ...)
  2011-02-09 23:10             ` [PATCH v8 0/3] " Andrew Morton
@ 2011-04-06 19:44             ` Ben Blum
  2011-04-06 19:45               ` [PATCH v8.75 1/4] cgroups: read-write lock CLONE_THREAD forking per threadgroup Ben Blum
                                 ` (5 more replies)
  3 siblings, 6 replies; 185+ messages in thread
From: Ben Blum @ 2011-04-06 19:44 UTC (permalink / raw)
  To: Ben Blum
  Cc: linux-kernel, containers, akpm, ebiederm, lizf, matthltc, menage,
	oleg, David Rientjes, Miao Xie

On Mon, Feb 07, 2011 at 08:35:42PM -0500, Ben Blum wrote:
> On Sun, Dec 26, 2010 at 07:09:19AM -0500, Ben Blum wrote:
> > On Fri, Dec 24, 2010 at 03:22:26AM -0500, Ben Blum wrote:
> > > On Wed, Aug 11, 2010 at 01:46:04AM -0400, Ben Blum wrote:
> > > > On Fri, Jul 30, 2010 at 07:56:49PM -0400, Ben Blum wrote:
> > > > > This patch series is a revision of http://lkml.org/lkml/2010/6/25/11 .
> > > > > 
> > > > > This patch series implements a write function for the 'cgroup.procs'
> > > > > per-cgroup file, which enables atomic movement of multithreaded
> > > > > applications between cgroups. Writing the thread-ID of any thread in a
> > > > > threadgroup to a cgroup's procs file causes all threads in the group to
> > > > > be moved to that cgroup safely with respect to threads forking/exiting.
> > > > > (Possible usage scenario: If running a multithreaded build system that
> > > > > sucks up system resources, this lets you restrict it all at once into a
> > > > > new cgroup to keep it under control.)
> > > > > 
> > > > > Example: Suppose pid 31337 clones new threads 31338 and 31339.
> > > > > 
> > > > > # cat /dev/cgroup/tasks
> > > > > ...
> > > > > 31337
> > > > > 31338
> > > > > 31339
> > > > > # mkdir /dev/cgroup/foo
> > > > > # echo 31337 > /dev/cgroup/foo/cgroup.procs
> > > > > # cat /dev/cgroup/foo/tasks
> > > > > 31337
> > > > > 31338
> > > > > 31339
> > > > > 
> > > > > A new lock, called threadgroup_fork_lock and living in signal_struct, is
> > > > > introduced to ensure atomicity when moving threads between cgroups. It's
> > > > > taken for writing during the operation, and taking for reading in fork()
> > > > > around the calls to cgroup_fork() and cgroup_post_fork().
> > 
> > Well this time everything here is actually safe and correct, as far as
> > my best efforts and keen eyes can tell. I dropped the per_thread call
> > from the last series in favour of revising the subsystem callback
> > interface. It now looks like this:
> > 
> > ss->can_attach()
> >  - Thread-independent, possibly expensive/sleeping.
> > 
> > ss->can_attach_task()
> >  - Called per-thread, run with rcu_read so must not sleep.
> > 
> > ss->pre_attach()
> >  - Thread independent, must be atomic, happens before attach_task.
> > 
> > ss->attach_task()
> >  - Called per-thread, run with tasklist_lock so must not sleep.
> > 
> > ss->attach()
> >  - Thread independent, possibly expensive/sleeping, called last.
> 
> Okay, so.
> 
> I've revamped the cgroup_attach_proc implementation a bunch and this
> version should be a lot easier on the eyes (and brains). Issues that are
> addressed:
> 
> 1) cgroup_attach_proc now iterates over leader->thread_group once, at
>    the very beginning, and puts each task_struct that we want to move
>    into an array, using get_task_struct to make sure they stick around.
>     - threadgroup_fork_lock ensures no threads not in the array can
>       appear, and allows us to use signal->nr_threads to determine the
>       size of the array when kmallocing it.
>     - This simplifies the rest of the function a bunch, since now we
>       never need to do rcu_read_lock after building the array. All the
>       subsystem callbacks are the same as described just above, but the
>       "can't sleep" restriction is gone, so it's nice and clean.
>     - Checking for a race with de_thread (the manoeuvre I refer to as
>       "double-double-toil-and-trouble-check locking") now needs to be
>       done only once, at the beginning (before building the array).
> 
> 2) The nodemask allocation problem in cpuset is fixed the same way as
>    before - the masks are shared between the three attach callbacks, so
>    are made as static global variables.
> 
> 3) The introduction of threadgroup_fork_lock in sched.h (specifically,
>    in signal_struct) requires rwsem.h; the new include appears in the
>    first patch. (An alternate plan would be to make it a struct pointer
>    with an incomplete forward declaration and kmalloc/kfree it during
>    housekeeping, but adding an include seems better than that particular
>    complication.) In light of this, the definitions for
>    threadgroup_fork_{read,write}_{un,}lock are also in sched.h.

Same as before; using flex_array in attach_proc (thanks Kame).

-- Ben

---
 Documentation/cgroups/cgroups.txt |   39 ++-
 block/blk-cgroup.c                |   18 -
 include/linux/cgroup.h            |   10 
 include/linux/init_task.h         |    9 
 include/linux/sched.h             |   36 ++
 kernel/cgroup.c                   |  489 +++++++++++++++++++++++++++++++++-----
 kernel/cgroup_freezer.c           |   26 --
 kernel/cpuset.c                   |   96 +++----
 kernel/fork.c                     |   10 
 kernel/sched.c                    |   38 --
 mm/memcontrol.c                   |   18 -
 security/device_cgroup.c          |    3 
 12 files changed, 594 insertions(+), 198 deletions(-)

^ permalink raw reply	[flat|nested] 185+ messages in thread

* [PATCH v8.75 1/4] cgroups: read-write lock CLONE_THREAD forking per threadgroup
       [not found]               ` <20110406194420.GC10792-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
@ 2011-04-06 19:45                 ` Ben Blum
  2011-04-06 19:46                 ` [PATCH v8.75 2/4] cgroups: add per-thread subsystem callbacks Ben Blum
                                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2011-04-06 19:45 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	Miao Xie, David Rientjes, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	menage-hpIqsD4AKlfQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup

From: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>

This patch adds an rwsem that lives in a threadgroup's signal_struct that's
taken for reading in the fork path, under CONFIG_CGROUPS. If another part of
the kernel later wants to use such a locking mechanism, the CONFIG_CGROUPS
ifdefs should be changed to a higher-up flag that CGROUPS and the other system
would both depend on.

This is a pre-patch for cgroup-procs-write.patch.

Signed-off-by: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
---
 include/linux/init_task.h |    9 +++++++++
 include/linux/sched.h     |   36 ++++++++++++++++++++++++++++++++++++
 kernel/fork.c             |   10 ++++++++++
 3 files changed, 55 insertions(+), 0 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index caa151f..7bf5257 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -22,6 +22,14 @@
 extern struct files_struct init_files;
 extern struct fs_struct init_fs;
 
+#ifdef CONFIG_CGROUPS
+#define INIT_THREADGROUP_FORK_LOCK(sig)					\
+	.threadgroup_fork_lock =					\
+		__RWSEM_INITIALIZER(sig.threadgroup_fork_lock),
+#else
+#define INIT_THREADGROUP_FORK_LOCK(sig)
+#endif
+
 #define INIT_SIGNALS(sig) {						\
 	.nr_threads	= 1,						\
 	.wait_chldexit	= __WAIT_QUEUE_HEAD_INITIALIZER(sig.wait_chldexit),\
@@ -38,6 +46,7 @@ extern struct fs_struct init_fs;
 	},								\
 	.cred_guard_mutex =						\
 		 __MUTEX_INITIALIZER(sig.cred_guard_mutex),		\
+	INIT_THREADGROUP_FORK_LOCK(sig)					\
 }
 
 extern struct nsproxy init_nsproxy;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3509d00..a219c69 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -514,6 +514,7 @@ struct thread_group_cputimer {
 	spinlock_t lock;
 };
 
+#include <linux/rwsem.h>
 struct autogroup;
 
 /*
@@ -633,6 +634,16 @@ struct signal_struct {
 	unsigned audit_tty;
 	struct tty_audit_buf *tty_audit_buf;
 #endif
+#ifdef CONFIG_CGROUPS
+	/*
+	 * The threadgroup_fork_lock prevents threads from forking with
+	 * CLONE_THREAD while held for writing. Use this for fork-sensitive
+	 * threadgroup-wide operations. It's taken for reading in fork.c in
+	 * copy_process().
+	 * Currently only needed write-side by cgroups.
+	 */
+	struct rw_semaphore threadgroup_fork_lock;
+#endif
 
 	int oom_adj;		/* OOM kill score adjustment (bit shift) */
 	int oom_score_adj;	/* OOM kill score adjustment */
@@ -2307,6 +2318,31 @@ static inline void unlock_task_sighand(struct task_struct *tsk,
 	spin_unlock_irqrestore(&tsk->sighand->siglock, *flags);
 }
 
+/* See the declaration of threadgroup_fork_lock in signal_struct. */
+#ifdef CONFIG_CGROUPS
+static inline void threadgroup_fork_read_lock(struct task_struct *tsk)
+{
+	down_read(&tsk->signal->threadgroup_fork_lock);
+}
+static inline void threadgroup_fork_read_unlock(struct task_struct *tsk)
+{
+	up_read(&tsk->signal->threadgroup_fork_lock);
+}
+static inline void threadgroup_fork_write_lock(struct task_struct *tsk)
+{
+	down_write(&tsk->signal->threadgroup_fork_lock);
+}
+static inline void threadgroup_fork_write_unlock(struct task_struct *tsk)
+{
+	up_write(&tsk->signal->threadgroup_fork_lock);
+}
+#else
+static inline void threadgroup_fork_read_lock(struct task_struct *tsk) {}
+static inline void threadgroup_fork_read_unlock(struct task_struct *tsk) {}
+static inline void threadgroup_fork_write_lock(struct task_struct *tsk) {}
+static inline void threadgroup_fork_write_unlock(struct task_struct *tsk) {}
+#endif
+
 #ifndef __HAVE_THREAD_FUNCTIONS
 
 #define task_thread_info(task)	((struct thread_info *)(task)->stack)
diff --git a/kernel/fork.c b/kernel/fork.c
index 41d2062..aef33ac 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -927,6 +927,10 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 	tty_audit_fork(sig);
 	sched_autogroup_fork(sig);
 
+#ifdef CONFIG_CGROUPS
+	init_rwsem(&sig->threadgroup_fork_lock);
+#endif
+
 	sig->oom_adj = current->signal->oom_adj;
 	sig->oom_score_adj = current->signal->oom_score_adj;
 	sig->oom_score_adj_min = current->signal->oom_score_adj_min;
@@ -1109,6 +1113,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	monotonic_to_bootbased(&p->real_start_time);
 	p->io_context = NULL;
 	p->audit_context = NULL;
+	if (clone_flags & CLONE_THREAD)
+		threadgroup_fork_read_lock(current);
 	cgroup_fork(p);
 #ifdef CONFIG_NUMA
 	p->mempolicy = mpol_dup(p->mempolicy);
@@ -1307,6 +1313,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	write_unlock_irq(&tasklist_lock);
 	proc_fork_connector(p);
 	cgroup_post_fork(p);
+	if (clone_flags & CLONE_THREAD)
+		threadgroup_fork_read_unlock(current);
 	perf_event_fork(p);
 	return p;
 
@@ -1345,6 +1353,8 @@ bad_fork_cleanup_policy:
 	mpol_put(p->mempolicy);
 bad_fork_cleanup_cgroup:
 #endif
+	if (clone_flags & CLONE_THREAD)
+		threadgroup_fork_read_unlock(current);
 	cgroup_exit(p, cgroup_callbacks_done);
 	delayacct_tsk_free(p);
 	module_put(task_thread_info(p)->exec_domain->module);

^ permalink raw reply related	[flat|nested] 185+ messages in thread

* [PATCH v8.75 1/4] cgroups: read-write lock CLONE_THREAD forking per threadgroup
  2011-04-06 19:44             ` [PATCH v8.75 0/4] " Ben Blum
@ 2011-04-06 19:45               ` Ben Blum
       [not found]               ` <20110406194420.GC10792-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
                                 ` (4 subsequent siblings)
  5 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2011-04-06 19:45 UTC (permalink / raw)
  To: Ben Blum
  Cc: linux-kernel, containers, akpm, ebiederm, lizf, matthltc, menage,
	oleg, David Rientjes, Miao Xie

Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup

From: Ben Blum <bblum@andrew.cmu.edu>

This patch adds an rwsem that lives in a threadgroup's signal_struct that's
taken for reading in the fork path, under CONFIG_CGROUPS. If another part of
the kernel later wants to use such a locking mechanism, the CONFIG_CGROUPS
ifdefs should be changed to a higher-up flag that CGROUPS and the other system
would both depend on.

This is a pre-patch for cgroup-procs-write.patch.

Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
---
 include/linux/init_task.h |    9 +++++++++
 include/linux/sched.h     |   36 ++++++++++++++++++++++++++++++++++++
 kernel/fork.c             |   10 ++++++++++
 3 files changed, 55 insertions(+), 0 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index caa151f..7bf5257 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -22,6 +22,14 @@
 extern struct files_struct init_files;
 extern struct fs_struct init_fs;
 
+#ifdef CONFIG_CGROUPS
+#define INIT_THREADGROUP_FORK_LOCK(sig)					\
+	.threadgroup_fork_lock =					\
+		__RWSEM_INITIALIZER(sig.threadgroup_fork_lock),
+#else
+#define INIT_THREADGROUP_FORK_LOCK(sig)
+#endif
+
 #define INIT_SIGNALS(sig) {						\
 	.nr_threads	= 1,						\
 	.wait_chldexit	= __WAIT_QUEUE_HEAD_INITIALIZER(sig.wait_chldexit),\
@@ -38,6 +46,7 @@ extern struct fs_struct init_fs;
 	},								\
 	.cred_guard_mutex =						\
 		 __MUTEX_INITIALIZER(sig.cred_guard_mutex),		\
+	INIT_THREADGROUP_FORK_LOCK(sig)					\
 }
 
 extern struct nsproxy init_nsproxy;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3509d00..a219c69 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -514,6 +514,7 @@ struct thread_group_cputimer {
 	spinlock_t lock;
 };
 
+#include <linux/rwsem.h>
 struct autogroup;
 
 /*
@@ -633,6 +634,16 @@ struct signal_struct {
 	unsigned audit_tty;
 	struct tty_audit_buf *tty_audit_buf;
 #endif
+#ifdef CONFIG_CGROUPS
+	/*
+	 * The threadgroup_fork_lock prevents threads from forking with
+	 * CLONE_THREAD while held for writing. Use this for fork-sensitive
+	 * threadgroup-wide operations. It's taken for reading in fork.c in
+	 * copy_process().
+	 * Currently only needed write-side by cgroups.
+	 */
+	struct rw_semaphore threadgroup_fork_lock;
+#endif
 
 	int oom_adj;		/* OOM kill score adjustment (bit shift) */
 	int oom_score_adj;	/* OOM kill score adjustment */
@@ -2307,6 +2318,31 @@ static inline void unlock_task_sighand(struct task_struct *tsk,
 	spin_unlock_irqrestore(&tsk->sighand->siglock, *flags);
 }
 
+/* See the declaration of threadgroup_fork_lock in signal_struct. */
+#ifdef CONFIG_CGROUPS
+static inline void threadgroup_fork_read_lock(struct task_struct *tsk)
+{
+	down_read(&tsk->signal->threadgroup_fork_lock);
+}
+static inline void threadgroup_fork_read_unlock(struct task_struct *tsk)
+{
+	up_read(&tsk->signal->threadgroup_fork_lock);
+}
+static inline void threadgroup_fork_write_lock(struct task_struct *tsk)
+{
+	down_write(&tsk->signal->threadgroup_fork_lock);
+}
+static inline void threadgroup_fork_write_unlock(struct task_struct *tsk)
+{
+	up_write(&tsk->signal->threadgroup_fork_lock);
+}
+#else
+static inline void threadgroup_fork_read_lock(struct task_struct *tsk) {}
+static inline void threadgroup_fork_read_unlock(struct task_struct *tsk) {}
+static inline void threadgroup_fork_write_lock(struct task_struct *tsk) {}
+static inline void threadgroup_fork_write_unlock(struct task_struct *tsk) {}
+#endif
+
 #ifndef __HAVE_THREAD_FUNCTIONS
 
 #define task_thread_info(task)	((struct thread_info *)(task)->stack)
diff --git a/kernel/fork.c b/kernel/fork.c
index 41d2062..aef33ac 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -927,6 +927,10 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 	tty_audit_fork(sig);
 	sched_autogroup_fork(sig);
 
+#ifdef CONFIG_CGROUPS
+	init_rwsem(&sig->threadgroup_fork_lock);
+#endif
+
 	sig->oom_adj = current->signal->oom_adj;
 	sig->oom_score_adj = current->signal->oom_score_adj;
 	sig->oom_score_adj_min = current->signal->oom_score_adj_min;
@@ -1109,6 +1113,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	monotonic_to_bootbased(&p->real_start_time);
 	p->io_context = NULL;
 	p->audit_context = NULL;
+	if (clone_flags & CLONE_THREAD)
+		threadgroup_fork_read_lock(current);
 	cgroup_fork(p);
 #ifdef CONFIG_NUMA
 	p->mempolicy = mpol_dup(p->mempolicy);
@@ -1307,6 +1313,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	write_unlock_irq(&tasklist_lock);
 	proc_fork_connector(p);
 	cgroup_post_fork(p);
+	if (clone_flags & CLONE_THREAD)
+		threadgroup_fork_read_unlock(current);
 	perf_event_fork(p);
 	return p;
 
@@ -1345,6 +1353,8 @@ bad_fork_cleanup_policy:
 	mpol_put(p->mempolicy);
 bad_fork_cleanup_cgroup:
 #endif
+	if (clone_flags & CLONE_THREAD)
+		threadgroup_fork_read_unlock(current);
 	cgroup_exit(p, cgroup_callbacks_done);
 	delayacct_tsk_free(p);
 	module_put(task_thread_info(p)->exec_domain->module);

^ permalink raw reply related	[flat|nested] 185+ messages in thread

* [PATCH v8.75 2/4] cgroups: add per-thread subsystem callbacks
       [not found]               ` <20110406194420.GC10792-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
  2011-04-06 19:45                 ` Ben Blum
@ 2011-04-06 19:46                 ` Ben Blum
  2011-04-06 19:46                 ` [PATCH v8.75 3/4] cgroups: make procs file writable Ben Blum
                                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2011-04-06 19:46 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	Miao Xie, David Rientjes, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	menage-hpIqsD4AKlfQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

Add cgroup subsystem callbacks for per-thread attachment in atomic contexts

From: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>

This patch adds can_attach_task, pre_attach, and attach_task as new callbacks
for cgroups's subsystem interface. Unlike can_attach and attach, these are for
per-thread operations, to be called potentially many times when attaching an
entire threadgroup.

Also, the old "bool threadgroup" interface is removed, as replaced by this.
All subsystems are modified for the new interface - of note is cpuset, which
requires from/to nodemasks for attach to be globally scoped (though per-cpuset
would work too) to persist from its pre_attach to attach_task and attach.

This is a pre-patch for cgroup-procs-writable.patch.

Signed-off-by: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
---
 Documentation/cgroups/cgroups.txt |   30 ++++++++----
 block/blk-cgroup.c                |   18 ++-----
 include/linux/cgroup.h            |   10 ++--
 kernel/cgroup.c                   |   17 +++++--
 kernel/cgroup_freezer.c           |   26 ++++------
 kernel/cpuset.c                   |   96 ++++++++++++++++++-------------------
 kernel/sched.c                    |   38 +--------------
 mm/memcontrol.c                   |   18 ++-----
 security/device_cgroup.c          |    3 -
 9 files changed, 114 insertions(+), 142 deletions(-)

diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index 2a5d137..4b0377c 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -575,7 +575,7 @@ rmdir() will fail with it. From this behavior, pre_destroy() can be
 called multiple times against a cgroup.
 
 int can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
-	       struct task_struct *task, bool threadgroup)
+	       struct task_struct *task)
 (cgroup_mutex held by caller)
 
 Called prior to moving a task into a cgroup; if the subsystem
@@ -584,9 +584,14 @@ task is passed, then a successful result indicates that *any*
 unspecified task can be moved into the cgroup. Note that this isn't
 called on a fork. If this method returns 0 (success) then this should
 remain valid while the caller holds cgroup_mutex and it is ensured that either
-attach() or cancel_attach() will be called in future. If threadgroup is
-true, then a successful result indicates that all threads in the given
-thread's threadgroup can be moved together.
+attach() or cancel_attach() will be called in future.
+
+int can_attach_task(struct cgroup *cgrp, struct task_struct *tsk);
+(cgroup_mutex held by caller)
+
+As can_attach, but for operations that must be run once per task to be
+attached (possibly many when using cgroup_attach_proc). Called after
+can_attach.
 
 void cancel_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 	       struct task_struct *task, bool threadgroup)
@@ -598,15 +603,24 @@ function, so that the subsystem can implement a rollback. If not, not necessary.
 This will be called only about subsystems whose can_attach() operation have
 succeeded.
 
+void pre_attach(struct cgroup *cgrp);
+(cgroup_mutex held by caller)
+
+For any non-per-thread attachment work that needs to happen before
+attach_task. Needed by cpuset.
+
 void attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
-	    struct cgroup *old_cgrp, struct task_struct *task,
-	    bool threadgroup)
+	    struct cgroup *old_cgrp, struct task_struct *task)
 (cgroup_mutex held by caller)
 
 Called after the task has been attached to the cgroup, to allow any
 post-attachment activity that requires memory allocations or blocking.
-If threadgroup is true, the subsystem should take care of all threads
-in the specified thread's threadgroup. Currently does not support any
+
+void attach_task(struct cgroup *cgrp, struct task_struct *tsk);
+(cgroup_mutex held by caller)
+
+As attach, but for operations that must be run once per task to be attached,
+like can_attach_task. Called before attach. Currently does not support any
 subsystem that might need the old_cgrp for every thread in the group.
 
 void fork(struct cgroup_subsy *ss, struct task_struct *task)
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 2bef570..23d03fb 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -30,10 +30,8 @@ EXPORT_SYMBOL_GPL(blkio_root_cgroup);
 
 static struct cgroup_subsys_state *blkiocg_create(struct cgroup_subsys *,
 						  struct cgroup *);
-static int blkiocg_can_attach(struct cgroup_subsys *, struct cgroup *,
-			      struct task_struct *, bool);
-static void blkiocg_attach(struct cgroup_subsys *, struct cgroup *,
-			   struct cgroup *, struct task_struct *, bool);
+static int blkiocg_can_attach_task(struct cgroup *, struct task_struct *);
+static void blkiocg_attach_task(struct cgroup *, struct task_struct *);
 static void blkiocg_destroy(struct cgroup_subsys *, struct cgroup *);
 static int blkiocg_populate(struct cgroup_subsys *, struct cgroup *);
 
@@ -46,8 +44,8 @@ static int blkiocg_populate(struct cgroup_subsys *, struct cgroup *);
 struct cgroup_subsys blkio_subsys = {
 	.name = "blkio",
 	.create = blkiocg_create,
-	.can_attach = blkiocg_can_attach,
-	.attach = blkiocg_attach,
+	.can_attach_task = blkiocg_can_attach_task,
+	.attach_task = blkiocg_attach_task,
 	.destroy = blkiocg_destroy,
 	.populate = blkiocg_populate,
 #ifdef CONFIG_BLK_CGROUP
@@ -1485,9 +1483,7 @@ done:
  * of the main cic data structures.  For now we allow a task to change
  * its cgroup only if it's the only owner of its ioc.
  */
-static int blkiocg_can_attach(struct cgroup_subsys *subsys,
-				struct cgroup *cgroup, struct task_struct *tsk,
-				bool threadgroup)
+static int blkiocg_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
 	struct io_context *ioc;
 	int ret = 0;
@@ -1502,9 +1498,7 @@ static int blkiocg_can_attach(struct cgroup_subsys *subsys,
 	return ret;
 }
 
-static void blkiocg_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
-				struct cgroup *prev, struct task_struct *tsk,
-				bool threadgroup)
+static void blkiocg_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
 	struct io_context *ioc;
 
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index f9f7e3a..919c32c 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -467,12 +467,14 @@ struct cgroup_subsys {
 	int (*pre_destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
 	void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
 	int (*can_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
-			  struct task_struct *tsk, bool threadgroup);
+			  struct task_struct *tsk);
+	int (*can_attach_task)(struct cgroup *cgrp, struct task_struct *tsk);
 	void (*cancel_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
-			  struct task_struct *tsk, bool threadgroup);
+			      struct task_struct *tsk);
+	void (*pre_attach)(struct cgroup *cgrp);
+	void (*attach_task)(struct cgroup *cgrp, struct task_struct *tsk);
 	void (*attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
-			struct cgroup *old_cgrp, struct task_struct *tsk,
-			bool threadgroup);
+		       struct cgroup *old_cgrp, struct task_struct *tsk);
 	void (*fork)(struct cgroup_subsys *ss, struct task_struct *task);
 	void (*exit)(struct cgroup_subsys *ss, struct cgroup *cgrp,
 			struct cgroup *old_cgrp, struct task_struct *task);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index be1ebeb..1f4037f 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1772,7 +1772,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 
 	for_each_subsys(root, ss) {
 		if (ss->can_attach) {
-			retval = ss->can_attach(ss, cgrp, tsk, false);
+			retval = ss->can_attach(ss, cgrp, tsk);
 			if (retval) {
 				/*
 				 * Remember on which subsystem the can_attach()
@@ -1784,6 +1784,13 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 				goto out;
 			}
 		}
+		if (ss->can_attach_task) {
+			retval = ss->can_attach_task(cgrp, tsk);
+			if (retval) {
+				failed_ss = ss;
+				goto out;
+			}
+		}
 	}
 
 	task_lock(tsk);
@@ -1818,8 +1825,12 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 	write_unlock(&css_set_lock);
 
 	for_each_subsys(root, ss) {
+		if (ss->pre_attach)
+			ss->pre_attach(cgrp);
+		if (ss->attach_task)
+			ss->attach_task(cgrp, tsk);
 		if (ss->attach)
-			ss->attach(ss, cgrp, oldcgrp, tsk, false);
+			ss->attach(ss, cgrp, oldcgrp, tsk);
 	}
 	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
 	synchronize_rcu();
@@ -1842,7 +1853,7 @@ out:
 				 */
 				break;
 			if (ss->cancel_attach)
-				ss->cancel_attach(ss, cgrp, tsk, false);
+				ss->cancel_attach(ss, cgrp, tsk);
 		}
 	}
 	return retval;
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index e7bebb7..e691818 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -160,7 +160,7 @@ static void freezer_destroy(struct cgroup_subsys *ss,
  */
 static int freezer_can_attach(struct cgroup_subsys *ss,
 			      struct cgroup *new_cgroup,
-			      struct task_struct *task, bool threadgroup)
+			      struct task_struct *task)
 {
 	struct freezer *freezer;
 
@@ -172,26 +172,17 @@ static int freezer_can_attach(struct cgroup_subsys *ss,
 	if (freezer->state != CGROUP_THAWED)
 		return -EBUSY;
 
+	return 0;
+}
+
+static int freezer_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
+{
 	rcu_read_lock();
-	if (__cgroup_freezing_or_frozen(task)) {
+	if (__cgroup_freezing_or_frozen(tsk)) {
 		rcu_read_unlock();
 		return -EBUSY;
 	}
 	rcu_read_unlock();
-
-	if (threadgroup) {
-		struct task_struct *c;
-
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
-			if (__cgroup_freezing_or_frozen(c)) {
-				rcu_read_unlock();
-				return -EBUSY;
-			}
-		}
-		rcu_read_unlock();
-	}
-
 	return 0;
 }
 
@@ -390,6 +381,9 @@ struct cgroup_subsys freezer_subsys = {
 	.populate	= freezer_populate,
 	.subsys_id	= freezer_subsys_id,
 	.can_attach	= freezer_can_attach,
+	.can_attach_task = freezer_can_attach_task,
+	.pre_attach	= NULL,
+	.attach_task	= NULL,
 	.attach		= NULL,
 	.fork		= freezer_fork,
 	.exit		= NULL,
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 236a3d3..c1e1e1d 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1367,14 +1367,10 @@ static int fmeter_getrate(struct fmeter *fmp)
 	return val;
 }
 
-/* Protected by cgroup_lock */
-static cpumask_var_t cpus_attach;
-
 /* Called by cgroups to determine if a cpuset is usable; cgroup_mutex held */
 static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
-			     struct task_struct *tsk, bool threadgroup)
+			     struct task_struct *tsk)
 {
-	int ret;
 	struct cpuset *cs = cgroup_cs(cont);
 
 	if (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))
@@ -1391,29 +1387,42 @@ static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
 	if (tsk->flags & PF_THREAD_BOUND)
 		return -EINVAL;
 
-	ret = security_task_setscheduler(tsk);
-	if (ret)
-		return ret;
-	if (threadgroup) {
-		struct task_struct *c;
-
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
-			ret = security_task_setscheduler(c);
-			if (ret) {
-				rcu_read_unlock();
-				return ret;
-			}
-		}
-		rcu_read_unlock();
-	}
 	return 0;
 }
 
-static void cpuset_attach_task(struct task_struct *tsk, nodemask_t *to,
-			       struct cpuset *cs)
+static int cpuset_can_attach_task(struct cgroup *cgrp, struct task_struct *task)
+{
+	return security_task_setscheduler(task);
+}
+
+/*
+ * Protected by cgroup_lock. The nodemasks must be stored globally because
+ * dynamically allocating them is not allowed in pre_attach, and they must
+ * persist among pre_attach, attach_task, and attach.
+ */
+static cpumask_var_t cpus_attach;
+static nodemask_t cpuset_attach_nodemask_from;
+static nodemask_t cpuset_attach_nodemask_to;
+
+/* Set-up work for before attaching each task. */
+static void cpuset_pre_attach(struct cgroup *cont)
+{
+	struct cpuset *cs = cgroup_cs(cont);
+
+	if (cs == &top_cpuset)
+		cpumask_copy(cpus_attach, cpu_possible_mask);
+	else
+		guarantee_online_cpus(cs, cpus_attach);
+
+	guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
+}
+
+/* Per-thread attachment work. */
+static void cpuset_attach_task(struct cgroup *cont, struct task_struct *tsk)
 {
 	int err;
+	struct cpuset *cs = cgroup_cs(cont);
+
 	/*
 	 * can_attach beforehand should guarantee that this doesn't fail.
 	 * TODO: have a better way to handle failure here
@@ -1421,45 +1430,29 @@ static void cpuset_attach_task(struct task_struct *tsk, nodemask_t *to,
 	err = set_cpus_allowed_ptr(tsk, cpus_attach);
 	WARN_ON_ONCE(err);
 
-	cpuset_change_task_nodemask(tsk, to);
+	cpuset_change_task_nodemask(tsk, &cpuset_attach_nodemask_to);
 	cpuset_update_task_spread_flag(cs, tsk);
-
 }
 
 static void cpuset_attach(struct cgroup_subsys *ss, struct cgroup *cont,
-			  struct cgroup *oldcont, struct task_struct *tsk,
-			  bool threadgroup)
+			  struct cgroup *oldcont, struct task_struct *tsk)
 {
 	struct mm_struct *mm;
 	struct cpuset *cs = cgroup_cs(cont);
 	struct cpuset *oldcs = cgroup_cs(oldcont);
-	static nodemask_t to;		/* protected by cgroup_mutex */
 
-	if (cs == &top_cpuset) {
-		cpumask_copy(cpus_attach, cpu_possible_mask);
-	} else {
-		guarantee_online_cpus(cs, cpus_attach);
-	}
-	guarantee_online_mems(cs, &to);
-
-	/* do per-task migration stuff possibly for each in the threadgroup */
-	cpuset_attach_task(tsk, &to, cs);
-	if (threadgroup) {
-		struct task_struct *c;
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
-			cpuset_attach_task(c, &to, cs);
-		}
-		rcu_read_unlock();
-	}
-
-	/* change mm; only needs to be done once even if threadgroup */
-	to = cs->mems_allowed;
+	/*
+	 * Change mm, possibly for multiple threads in a threadgroup. This is
+	 * expensive and may sleep.
+	 */
+	cpuset_attach_nodemask_from = oldcs->mems_allowed;
+	cpuset_attach_nodemask_to = cs->mems_allowed;
 	mm = get_task_mm(tsk);
 	if (mm) {
-		mpol_rebind_mm(mm, &to);
+		mpol_rebind_mm(mm, &cpuset_attach_nodemask_to);
 		if (is_memory_migrate(cs))
-			cpuset_migrate_mm(mm, &oldcs->mems_allowed, &to);
+			cpuset_migrate_mm(mm, &cpuset_attach_nodemask_from,
+					  &cpuset_attach_nodemask_to);
 		mmput(mm);
 	}
 }
@@ -1910,6 +1903,9 @@ struct cgroup_subsys cpuset_subsys = {
 	.create = cpuset_create,
 	.destroy = cpuset_destroy,
 	.can_attach = cpuset_can_attach,
+	.can_attach_task = cpuset_can_attach_task,
+	.pre_attach = cpuset_pre_attach,
+	.attach_task = cpuset_attach_task,
 	.attach = cpuset_attach,
 	.populate = cpuset_populate,
 	.post_clone = cpuset_post_clone,
diff --git a/kernel/sched.c b/kernel/sched.c
index f592ce6..28aa791 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -9059,42 +9059,10 @@ cpu_cgroup_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 	return 0;
 }
 
-static int
-cpu_cgroup_can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
-		      struct task_struct *tsk, bool threadgroup)
-{
-	int retval = cpu_cgroup_can_attach_task(cgrp, tsk);
-	if (retval)
-		return retval;
-	if (threadgroup) {
-		struct task_struct *c;
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
-			retval = cpu_cgroup_can_attach_task(cgrp, c);
-			if (retval) {
-				rcu_read_unlock();
-				return retval;
-			}
-		}
-		rcu_read_unlock();
-	}
-	return 0;
-}
-
 static void
-cpu_cgroup_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
-		  struct cgroup *old_cont, struct task_struct *tsk,
-		  bool threadgroup)
+cpu_cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
 	sched_move_task(tsk);
-	if (threadgroup) {
-		struct task_struct *c;
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
-			sched_move_task(c);
-		}
-		rcu_read_unlock();
-	}
 }
 
 static void
@@ -9182,8 +9150,8 @@ struct cgroup_subsys cpu_cgroup_subsys = {
 	.name		= "cpu",
 	.create		= cpu_cgroup_create,
 	.destroy	= cpu_cgroup_destroy,
-	.can_attach	= cpu_cgroup_can_attach,
-	.attach		= cpu_cgroup_attach,
+	.can_attach_task = cpu_cgroup_can_attach_task,
+	.attach_task	= cpu_cgroup_attach_task,
 	.exit		= cpu_cgroup_exit,
 	.populate	= cpu_cgroup_populate,
 	.subsys_id	= cpu_cgroup_subsys_id,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index bd689f2..d5202d1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5035,8 +5035,7 @@ static void mem_cgroup_clear_mc(void)
 
 static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
 				struct cgroup *cgroup,
-				struct task_struct *p,
-				bool threadgroup)
+				struct task_struct *p)
 {
 	int ret = 0;
 	struct mem_cgroup *mem = mem_cgroup_from_cont(cgroup);
@@ -5075,8 +5074,7 @@ static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
 
 static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss,
 				struct cgroup *cgroup,
-				struct task_struct *p,
-				bool threadgroup)
+				struct task_struct *p)
 {
 	mem_cgroup_clear_mc();
 }
@@ -5194,8 +5192,7 @@ retry:
 static void mem_cgroup_move_task(struct cgroup_subsys *ss,
 				struct cgroup *cont,
 				struct cgroup *old_cont,
-				struct task_struct *p,
-				bool threadgroup)
+				struct task_struct *p)
 {
 	struct mm_struct *mm;
 
@@ -5213,22 +5210,19 @@ static void mem_cgroup_move_task(struct cgroup_subsys *ss,
 #else	/* !CONFIG_MMU */
 static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
 				struct cgroup *cgroup,
-				struct task_struct *p,
-				bool threadgroup)
+				struct task_struct *p)
 {
 	return 0;
 }
 static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss,
 				struct cgroup *cgroup,
-				struct task_struct *p,
-				bool threadgroup)
+				struct task_struct *p)
 {
 }
 static void mem_cgroup_move_task(struct cgroup_subsys *ss,
 				struct cgroup *cont,
 				struct cgroup *old_cont,
-				struct task_struct *p,
-				bool threadgroup)
+				struct task_struct *p)
 {
 }
 #endif
diff --git a/security/device_cgroup.c b/security/device_cgroup.c
index 8d9c48f..cd1f779 100644
--- a/security/device_cgroup.c
+++ b/security/device_cgroup.c
@@ -62,8 +62,7 @@ static inline struct dev_cgroup *task_devcgroup(struct task_struct *task)
 struct cgroup_subsys devices_subsys;
 
 static int devcgroup_can_attach(struct cgroup_subsys *ss,
-		struct cgroup *new_cgroup, struct task_struct *task,
-		bool threadgroup)
+		struct cgroup *new_cgroup, struct task_struct *task)
 {
 	if (current != task && !capable(CAP_SYS_ADMIN))
 			return -EPERM;

^ permalink raw reply related	[flat|nested] 185+ messages in thread

* [PATCH v8.75 2/4] cgroups: add per-thread subsystem callbacks
  2011-04-06 19:44             ` [PATCH v8.75 0/4] " Ben Blum
  2011-04-06 19:45               ` [PATCH v8.75 1/4] cgroups: read-write lock CLONE_THREAD forking per threadgroup Ben Blum
       [not found]               ` <20110406194420.GC10792-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
@ 2011-04-06 19:46               ` Ben Blum
  2011-04-06 19:46               ` [PATCH v8.75 3/4] cgroups: make procs file writable Ben Blum
                                 ` (2 subsequent siblings)
  5 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2011-04-06 19:46 UTC (permalink / raw)
  To: Ben Blum
  Cc: linux-kernel, containers, akpm, ebiederm, lizf, matthltc, menage,
	oleg, David Rientjes, Miao Xie

Add cgroup subsystem callbacks for per-thread attachment in atomic contexts

From: Ben Blum <bblum@andrew.cmu.edu>

This patch adds can_attach_task, pre_attach, and attach_task as new callbacks
for cgroups's subsystem interface. Unlike can_attach and attach, these are for
per-thread operations, to be called potentially many times when attaching an
entire threadgroup.

Also, the old "bool threadgroup" interface is removed, as replaced by this.
All subsystems are modified for the new interface - of note is cpuset, which
requires from/to nodemasks for attach to be globally scoped (though per-cpuset
would work too) to persist from its pre_attach to attach_task and attach.

This is a pre-patch for cgroup-procs-writable.patch.

Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
---
 Documentation/cgroups/cgroups.txt |   30 ++++++++----
 block/blk-cgroup.c                |   18 ++-----
 include/linux/cgroup.h            |   10 ++--
 kernel/cgroup.c                   |   17 +++++--
 kernel/cgroup_freezer.c           |   26 ++++------
 kernel/cpuset.c                   |   96 ++++++++++++++++++-------------------
 kernel/sched.c                    |   38 +--------------
 mm/memcontrol.c                   |   18 ++-----
 security/device_cgroup.c          |    3 -
 9 files changed, 114 insertions(+), 142 deletions(-)

diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index 2a5d137..4b0377c 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -575,7 +575,7 @@ rmdir() will fail with it. From this behavior, pre_destroy() can be
 called multiple times against a cgroup.
 
 int can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
-	       struct task_struct *task, bool threadgroup)
+	       struct task_struct *task)
 (cgroup_mutex held by caller)
 
 Called prior to moving a task into a cgroup; if the subsystem
@@ -584,9 +584,14 @@ task is passed, then a successful result indicates that *any*
 unspecified task can be moved into the cgroup. Note that this isn't
 called on a fork. If this method returns 0 (success) then this should
 remain valid while the caller holds cgroup_mutex and it is ensured that either
-attach() or cancel_attach() will be called in future. If threadgroup is
-true, then a successful result indicates that all threads in the given
-thread's threadgroup can be moved together.
+attach() or cancel_attach() will be called in future.
+
+int can_attach_task(struct cgroup *cgrp, struct task_struct *tsk);
+(cgroup_mutex held by caller)
+
+As can_attach, but for operations that must be run once per task to be
+attached (possibly many when using cgroup_attach_proc). Called after
+can_attach.
 
 void cancel_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 	       struct task_struct *task, bool threadgroup)
@@ -598,15 +603,24 @@ function, so that the subsystem can implement a rollback. If not, not necessary.
 This will be called only about subsystems whose can_attach() operation have
 succeeded.
 
+void pre_attach(struct cgroup *cgrp);
+(cgroup_mutex held by caller)
+
+For any non-per-thread attachment work that needs to happen before
+attach_task. Needed by cpuset.
+
 void attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
-	    struct cgroup *old_cgrp, struct task_struct *task,
-	    bool threadgroup)
+	    struct cgroup *old_cgrp, struct task_struct *task)
 (cgroup_mutex held by caller)
 
 Called after the task has been attached to the cgroup, to allow any
 post-attachment activity that requires memory allocations or blocking.
-If threadgroup is true, the subsystem should take care of all threads
-in the specified thread's threadgroup. Currently does not support any
+
+void attach_task(struct cgroup *cgrp, struct task_struct *tsk);
+(cgroup_mutex held by caller)
+
+As attach, but for operations that must be run once per task to be attached,
+like can_attach_task. Called before attach. Currently does not support any
 subsystem that might need the old_cgrp for every thread in the group.
 
 void fork(struct cgroup_subsy *ss, struct task_struct *task)
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 2bef570..23d03fb 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -30,10 +30,8 @@ EXPORT_SYMBOL_GPL(blkio_root_cgroup);
 
 static struct cgroup_subsys_state *blkiocg_create(struct cgroup_subsys *,
 						  struct cgroup *);
-static int blkiocg_can_attach(struct cgroup_subsys *, struct cgroup *,
-			      struct task_struct *, bool);
-static void blkiocg_attach(struct cgroup_subsys *, struct cgroup *,
-			   struct cgroup *, struct task_struct *, bool);
+static int blkiocg_can_attach_task(struct cgroup *, struct task_struct *);
+static void blkiocg_attach_task(struct cgroup *, struct task_struct *);
 static void blkiocg_destroy(struct cgroup_subsys *, struct cgroup *);
 static int blkiocg_populate(struct cgroup_subsys *, struct cgroup *);
 
@@ -46,8 +44,8 @@ static int blkiocg_populate(struct cgroup_subsys *, struct cgroup *);
 struct cgroup_subsys blkio_subsys = {
 	.name = "blkio",
 	.create = blkiocg_create,
-	.can_attach = blkiocg_can_attach,
-	.attach = blkiocg_attach,
+	.can_attach_task = blkiocg_can_attach_task,
+	.attach_task = blkiocg_attach_task,
 	.destroy = blkiocg_destroy,
 	.populate = blkiocg_populate,
 #ifdef CONFIG_BLK_CGROUP
@@ -1485,9 +1483,7 @@ done:
  * of the main cic data structures.  For now we allow a task to change
  * its cgroup only if it's the only owner of its ioc.
  */
-static int blkiocg_can_attach(struct cgroup_subsys *subsys,
-				struct cgroup *cgroup, struct task_struct *tsk,
-				bool threadgroup)
+static int blkiocg_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
 	struct io_context *ioc;
 	int ret = 0;
@@ -1502,9 +1498,7 @@ static int blkiocg_can_attach(struct cgroup_subsys *subsys,
 	return ret;
 }
 
-static void blkiocg_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
-				struct cgroup *prev, struct task_struct *tsk,
-				bool threadgroup)
+static void blkiocg_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
 	struct io_context *ioc;
 
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index f9f7e3a..919c32c 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -467,12 +467,14 @@ struct cgroup_subsys {
 	int (*pre_destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
 	void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
 	int (*can_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
-			  struct task_struct *tsk, bool threadgroup);
+			  struct task_struct *tsk);
+	int (*can_attach_task)(struct cgroup *cgrp, struct task_struct *tsk);
 	void (*cancel_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
-			  struct task_struct *tsk, bool threadgroup);
+			      struct task_struct *tsk);
+	void (*pre_attach)(struct cgroup *cgrp);
+	void (*attach_task)(struct cgroup *cgrp, struct task_struct *tsk);
 	void (*attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
-			struct cgroup *old_cgrp, struct task_struct *tsk,
-			bool threadgroup);
+		       struct cgroup *old_cgrp, struct task_struct *tsk);
 	void (*fork)(struct cgroup_subsys *ss, struct task_struct *task);
 	void (*exit)(struct cgroup_subsys *ss, struct cgroup *cgrp,
 			struct cgroup *old_cgrp, struct task_struct *task);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index be1ebeb..1f4037f 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1772,7 +1772,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 
 	for_each_subsys(root, ss) {
 		if (ss->can_attach) {
-			retval = ss->can_attach(ss, cgrp, tsk, false);
+			retval = ss->can_attach(ss, cgrp, tsk);
 			if (retval) {
 				/*
 				 * Remember on which subsystem the can_attach()
@@ -1784,6 +1784,13 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 				goto out;
 			}
 		}
+		if (ss->can_attach_task) {
+			retval = ss->can_attach_task(cgrp, tsk);
+			if (retval) {
+				failed_ss = ss;
+				goto out;
+			}
+		}
 	}
 
 	task_lock(tsk);
@@ -1818,8 +1825,12 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 	write_unlock(&css_set_lock);
 
 	for_each_subsys(root, ss) {
+		if (ss->pre_attach)
+			ss->pre_attach(cgrp);
+		if (ss->attach_task)
+			ss->attach_task(cgrp, tsk);
 		if (ss->attach)
-			ss->attach(ss, cgrp, oldcgrp, tsk, false);
+			ss->attach(ss, cgrp, oldcgrp, tsk);
 	}
 	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
 	synchronize_rcu();
@@ -1842,7 +1853,7 @@ out:
 				 */
 				break;
 			if (ss->cancel_attach)
-				ss->cancel_attach(ss, cgrp, tsk, false);
+				ss->cancel_attach(ss, cgrp, tsk);
 		}
 	}
 	return retval;
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index e7bebb7..e691818 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -160,7 +160,7 @@ static void freezer_destroy(struct cgroup_subsys *ss,
  */
 static int freezer_can_attach(struct cgroup_subsys *ss,
 			      struct cgroup *new_cgroup,
-			      struct task_struct *task, bool threadgroup)
+			      struct task_struct *task)
 {
 	struct freezer *freezer;
 
@@ -172,26 +172,17 @@ static int freezer_can_attach(struct cgroup_subsys *ss,
 	if (freezer->state != CGROUP_THAWED)
 		return -EBUSY;
 
+	return 0;
+}
+
+static int freezer_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
+{
 	rcu_read_lock();
-	if (__cgroup_freezing_or_frozen(task)) {
+	if (__cgroup_freezing_or_frozen(tsk)) {
 		rcu_read_unlock();
 		return -EBUSY;
 	}
 	rcu_read_unlock();
-
-	if (threadgroup) {
-		struct task_struct *c;
-
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
-			if (__cgroup_freezing_or_frozen(c)) {
-				rcu_read_unlock();
-				return -EBUSY;
-			}
-		}
-		rcu_read_unlock();
-	}
-
 	return 0;
 }
 
@@ -390,6 +381,9 @@ struct cgroup_subsys freezer_subsys = {
 	.populate	= freezer_populate,
 	.subsys_id	= freezer_subsys_id,
 	.can_attach	= freezer_can_attach,
+	.can_attach_task = freezer_can_attach_task,
+	.pre_attach	= NULL,
+	.attach_task	= NULL,
 	.attach		= NULL,
 	.fork		= freezer_fork,
 	.exit		= NULL,
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 236a3d3..c1e1e1d 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1367,14 +1367,10 @@ static int fmeter_getrate(struct fmeter *fmp)
 	return val;
 }
 
-/* Protected by cgroup_lock */
-static cpumask_var_t cpus_attach;
-
 /* Called by cgroups to determine if a cpuset is usable; cgroup_mutex held */
 static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
-			     struct task_struct *tsk, bool threadgroup)
+			     struct task_struct *tsk)
 {
-	int ret;
 	struct cpuset *cs = cgroup_cs(cont);
 
 	if (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))
@@ -1391,29 +1387,42 @@ static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
 	if (tsk->flags & PF_THREAD_BOUND)
 		return -EINVAL;
 
-	ret = security_task_setscheduler(tsk);
-	if (ret)
-		return ret;
-	if (threadgroup) {
-		struct task_struct *c;
-
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
-			ret = security_task_setscheduler(c);
-			if (ret) {
-				rcu_read_unlock();
-				return ret;
-			}
-		}
-		rcu_read_unlock();
-	}
 	return 0;
 }
 
-static void cpuset_attach_task(struct task_struct *tsk, nodemask_t *to,
-			       struct cpuset *cs)
+static int cpuset_can_attach_task(struct cgroup *cgrp, struct task_struct *task)
+{
+	return security_task_setscheduler(task);
+}
+
+/*
+ * Protected by cgroup_lock. The nodemasks must be stored globally because
+ * dynamically allocating them is not allowed in pre_attach, and they must
+ * persist among pre_attach, attach_task, and attach.
+ */
+static cpumask_var_t cpus_attach;
+static nodemask_t cpuset_attach_nodemask_from;
+static nodemask_t cpuset_attach_nodemask_to;
+
+/* Set-up work for before attaching each task. */
+static void cpuset_pre_attach(struct cgroup *cont)
+{
+	struct cpuset *cs = cgroup_cs(cont);
+
+	if (cs == &top_cpuset)
+		cpumask_copy(cpus_attach, cpu_possible_mask);
+	else
+		guarantee_online_cpus(cs, cpus_attach);
+
+	guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
+}
+
+/* Per-thread attachment work. */
+static void cpuset_attach_task(struct cgroup *cont, struct task_struct *tsk)
 {
 	int err;
+	struct cpuset *cs = cgroup_cs(cont);
+
 	/*
 	 * can_attach beforehand should guarantee that this doesn't fail.
 	 * TODO: have a better way to handle failure here
@@ -1421,45 +1430,29 @@ static void cpuset_attach_task(struct task_struct *tsk, nodemask_t *to,
 	err = set_cpus_allowed_ptr(tsk, cpus_attach);
 	WARN_ON_ONCE(err);
 
-	cpuset_change_task_nodemask(tsk, to);
+	cpuset_change_task_nodemask(tsk, &cpuset_attach_nodemask_to);
 	cpuset_update_task_spread_flag(cs, tsk);
-
 }
 
 static void cpuset_attach(struct cgroup_subsys *ss, struct cgroup *cont,
-			  struct cgroup *oldcont, struct task_struct *tsk,
-			  bool threadgroup)
+			  struct cgroup *oldcont, struct task_struct *tsk)
 {
 	struct mm_struct *mm;
 	struct cpuset *cs = cgroup_cs(cont);
 	struct cpuset *oldcs = cgroup_cs(oldcont);
-	static nodemask_t to;		/* protected by cgroup_mutex */
 
-	if (cs == &top_cpuset) {
-		cpumask_copy(cpus_attach, cpu_possible_mask);
-	} else {
-		guarantee_online_cpus(cs, cpus_attach);
-	}
-	guarantee_online_mems(cs, &to);
-
-	/* do per-task migration stuff possibly for each in the threadgroup */
-	cpuset_attach_task(tsk, &to, cs);
-	if (threadgroup) {
-		struct task_struct *c;
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
-			cpuset_attach_task(c, &to, cs);
-		}
-		rcu_read_unlock();
-	}
-
-	/* change mm; only needs to be done once even if threadgroup */
-	to = cs->mems_allowed;
+	/*
+	 * Change mm, possibly for multiple threads in a threadgroup. This is
+	 * expensive and may sleep.
+	 */
+	cpuset_attach_nodemask_from = oldcs->mems_allowed;
+	cpuset_attach_nodemask_to = cs->mems_allowed;
 	mm = get_task_mm(tsk);
 	if (mm) {
-		mpol_rebind_mm(mm, &to);
+		mpol_rebind_mm(mm, &cpuset_attach_nodemask_to);
 		if (is_memory_migrate(cs))
-			cpuset_migrate_mm(mm, &oldcs->mems_allowed, &to);
+			cpuset_migrate_mm(mm, &cpuset_attach_nodemask_from,
+					  &cpuset_attach_nodemask_to);
 		mmput(mm);
 	}
 }
@@ -1910,6 +1903,9 @@ struct cgroup_subsys cpuset_subsys = {
 	.create = cpuset_create,
 	.destroy = cpuset_destroy,
 	.can_attach = cpuset_can_attach,
+	.can_attach_task = cpuset_can_attach_task,
+	.pre_attach = cpuset_pre_attach,
+	.attach_task = cpuset_attach_task,
 	.attach = cpuset_attach,
 	.populate = cpuset_populate,
 	.post_clone = cpuset_post_clone,
diff --git a/kernel/sched.c b/kernel/sched.c
index f592ce6..28aa791 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -9059,42 +9059,10 @@ cpu_cgroup_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 	return 0;
 }
 
-static int
-cpu_cgroup_can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
-		      struct task_struct *tsk, bool threadgroup)
-{
-	int retval = cpu_cgroup_can_attach_task(cgrp, tsk);
-	if (retval)
-		return retval;
-	if (threadgroup) {
-		struct task_struct *c;
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
-			retval = cpu_cgroup_can_attach_task(cgrp, c);
-			if (retval) {
-				rcu_read_unlock();
-				return retval;
-			}
-		}
-		rcu_read_unlock();
-	}
-	return 0;
-}
-
 static void
-cpu_cgroup_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
-		  struct cgroup *old_cont, struct task_struct *tsk,
-		  bool threadgroup)
+cpu_cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
 	sched_move_task(tsk);
-	if (threadgroup) {
-		struct task_struct *c;
-		rcu_read_lock();
-		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
-			sched_move_task(c);
-		}
-		rcu_read_unlock();
-	}
 }
 
 static void
@@ -9182,8 +9150,8 @@ struct cgroup_subsys cpu_cgroup_subsys = {
 	.name		= "cpu",
 	.create		= cpu_cgroup_create,
 	.destroy	= cpu_cgroup_destroy,
-	.can_attach	= cpu_cgroup_can_attach,
-	.attach		= cpu_cgroup_attach,
+	.can_attach_task = cpu_cgroup_can_attach_task,
+	.attach_task	= cpu_cgroup_attach_task,
 	.exit		= cpu_cgroup_exit,
 	.populate	= cpu_cgroup_populate,
 	.subsys_id	= cpu_cgroup_subsys_id,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index bd689f2..d5202d1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5035,8 +5035,7 @@ static void mem_cgroup_clear_mc(void)
 
 static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
 				struct cgroup *cgroup,
-				struct task_struct *p,
-				bool threadgroup)
+				struct task_struct *p)
 {
 	int ret = 0;
 	struct mem_cgroup *mem = mem_cgroup_from_cont(cgroup);
@@ -5075,8 +5074,7 @@ static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
 
 static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss,
 				struct cgroup *cgroup,
-				struct task_struct *p,
-				bool threadgroup)
+				struct task_struct *p)
 {
 	mem_cgroup_clear_mc();
 }
@@ -5194,8 +5192,7 @@ retry:
 static void mem_cgroup_move_task(struct cgroup_subsys *ss,
 				struct cgroup *cont,
 				struct cgroup *old_cont,
-				struct task_struct *p,
-				bool threadgroup)
+				struct task_struct *p)
 {
 	struct mm_struct *mm;
 
@@ -5213,22 +5210,19 @@ static void mem_cgroup_move_task(struct cgroup_subsys *ss,
 #else	/* !CONFIG_MMU */
 static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
 				struct cgroup *cgroup,
-				struct task_struct *p,
-				bool threadgroup)
+				struct task_struct *p)
 {
 	return 0;
 }
 static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss,
 				struct cgroup *cgroup,
-				struct task_struct *p,
-				bool threadgroup)
+				struct task_struct *p)
 {
 }
 static void mem_cgroup_move_task(struct cgroup_subsys *ss,
 				struct cgroup *cont,
 				struct cgroup *old_cont,
-				struct task_struct *p,
-				bool threadgroup)
+				struct task_struct *p)
 {
 }
 #endif
diff --git a/security/device_cgroup.c b/security/device_cgroup.c
index 8d9c48f..cd1f779 100644
--- a/security/device_cgroup.c
+++ b/security/device_cgroup.c
@@ -62,8 +62,7 @@ static inline struct dev_cgroup *task_devcgroup(struct task_struct *task)
 struct cgroup_subsys devices_subsys;
 
 static int devcgroup_can_attach(struct cgroup_subsys *ss,
-		struct cgroup *new_cgroup, struct task_struct *task,
-		bool threadgroup)
+		struct cgroup *new_cgroup, struct task_struct *task)
 {
 	if (current != task && !capable(CAP_SYS_ADMIN))
 			return -EPERM;

^ permalink raw reply related	[flat|nested] 185+ messages in thread

* [PATCH v8.75 3/4] cgroups: make procs file writable
       [not found]               ` <20110406194420.GC10792-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
  2011-04-06 19:45                 ` Ben Blum
  2011-04-06 19:46                 ` [PATCH v8.75 2/4] cgroups: add per-thread subsystem callbacks Ben Blum
@ 2011-04-06 19:46                 ` Ben Blum
  2011-04-06 19:47                 ` [PATCH v8.75 4/4] cgroups: use flex_array in attach_proc Ben Blum
  2011-04-12 23:25                 ` [PATCH v8.75 0/4] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs Andrew Morton
  4 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2011-04-06 19:46 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	Miao Xie, David Rientjes, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	menage-hpIqsD4AKlfQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

Makes procs file writable to move all threads by tgid at once

From: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>

This patch adds functionality that enables users to move all threads in a
threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs'
file. This current implementation makes use of a per-threadgroup rwsem that's
taken for reading in the fork() path to prevent newly forking threads within
the threadgroup from "escaping" while the move is in progress.

Signed-off-by: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
---
 Documentation/cgroups/cgroups.txt |    9 +
 kernel/cgroup.c                   |  439 +++++++++++++++++++++++++++++++++----
 2 files changed, 401 insertions(+), 47 deletions(-)

diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index 4b0377c..166f6e3 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -236,7 +236,8 @@ containing the following files describing that cgroup:
  - cgroup.procs: list of tgids in the cgroup.  This list is not
    guaranteed to be sorted or free of duplicate tgids, and userspace
    should sort/uniquify the list if this property is required.
-   This is a read-only file, for now.
+   Writing a thread group id into this file moves all threads in that
+   group into this cgroup.
  - notify_on_release flag: run the release agent on exit?
  - release_agent: the path to use for release notifications (this file
    exists in the top cgroup only)
@@ -430,6 +431,12 @@ You can attach the current shell task by echoing 0:
 
 # echo 0 > tasks
 
+You can use the cgroup.procs file instead of the tasks file to move all
+threads in a threadgroup at once. Echoing the pid of any task in a
+threadgroup to cgroup.procs causes all tasks in that threadgroup to be
+be attached to the cgroup. Writing 0 to cgroup.procs moves all tasks
+in the writing task's threadgroup.
+
 Note: Since every task is always a member of exactly one cgroup in each
 mounted hierarchy, to remove a task from its current cgroup you must
 move it into a new cgroup (possibly the root cgroup) by writing to the
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 1f4037f..52dfb33 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1748,6 +1748,76 @@ int cgroup_path(const struct cgroup *cgrp, char *buf, int buflen)
 }
 EXPORT_SYMBOL_GPL(cgroup_path);
 
+/*
+ * cgroup_task_migrate - move a task from one cgroup to another.
+ *
+ * 'guarantee' is set if the caller promises that a new css_set for the task
+ * will already exist. If not set, this function might sleep, and can fail with
+ * -ENOMEM. Otherwise, it can only fail with -ESRCH.
+ */
+static int cgroup_task_migrate(struct cgroup *cgrp, struct cgroup *oldcgrp,
+			       struct task_struct *tsk, bool guarantee)
+{
+	struct css_set *oldcg;
+	struct css_set *newcg;
+
+	/*
+	 * get old css_set. we need to take task_lock and refcount it, because
+	 * an exiting task can change its css_set to init_css_set and drop its
+	 * old one without taking cgroup_mutex.
+	 */
+	task_lock(tsk);
+	oldcg = tsk->cgroups;
+	get_css_set(oldcg);
+	task_unlock(tsk);
+
+	/* locate or allocate a new css_set for this task. */
+	if (guarantee) {
+		/* we know the css_set we want already exists. */
+		struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
+		read_lock(&css_set_lock);
+		newcg = find_existing_css_set(oldcg, cgrp, template);
+		BUG_ON(!newcg);
+		get_css_set(newcg);
+		read_unlock(&css_set_lock);
+	} else {
+		might_sleep();
+		/* find_css_set will give us newcg already referenced. */
+		newcg = find_css_set(oldcg, cgrp);
+		if (!newcg) {
+			put_css_set(oldcg);
+			return -ENOMEM;
+		}
+	}
+	put_css_set(oldcg);
+
+	/* if PF_EXITING is set, the tsk->cgroups pointer is no longer safe. */
+	task_lock(tsk);
+	if (tsk->flags & PF_EXITING) {
+		task_unlock(tsk);
+		put_css_set(newcg);
+		return -ESRCH;
+	}
+	rcu_assign_pointer(tsk->cgroups, newcg);
+	task_unlock(tsk);
+
+	/* Update the css_set linked lists if we're using them */
+	write_lock(&css_set_lock);
+	if (!list_empty(&tsk->cg_list))
+		list_move(&tsk->cg_list, &newcg->tasks);
+	write_unlock(&css_set_lock);
+
+	/*
+	 * We just gained a reference on oldcg by taking it from the task. As
+	 * trading it for newcg is protected by cgroup_mutex, we're safe to drop
+	 * it here; it will be freed under RCU.
+	 */
+	put_css_set(oldcg);
+
+	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
+	return 0;
+}
+
 /**
  * cgroup_attach_task - attach task 'tsk' to cgroup 'cgrp'
  * @cgrp: the cgroup the task is attaching to
@@ -1758,11 +1828,9 @@ EXPORT_SYMBOL_GPL(cgroup_path);
  */
 int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
-	int retval = 0;
+	int retval;
 	struct cgroup_subsys *ss, *failed_ss = NULL;
 	struct cgroup *oldcgrp;
-	struct css_set *cg;
-	struct css_set *newcg;
 	struct cgroupfs_root *root = cgrp->root;
 
 	/* Nothing to do if the task is already in that cgroup */
@@ -1793,36 +1861,9 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 		}
 	}
 
-	task_lock(tsk);
-	cg = tsk->cgroups;
-	get_css_set(cg);
-	task_unlock(tsk);
-	/*
-	 * Locate or allocate a new css_set for this task,
-	 * based on its final set of cgroups
-	 */
-	newcg = find_css_set(cg, cgrp);
-	put_css_set(cg);
-	if (!newcg) {
-		retval = -ENOMEM;
-		goto out;
-	}
-
-	task_lock(tsk);
-	if (tsk->flags & PF_EXITING) {
-		task_unlock(tsk);
-		put_css_set(newcg);
-		retval = -ESRCH;
+	retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, false);
+	if (retval)
 		goto out;
-	}
-	rcu_assign_pointer(tsk->cgroups, newcg);
-	task_unlock(tsk);
-
-	/* Update the css_set linked lists if we're using them */
-	write_lock(&css_set_lock);
-	if (!list_empty(&tsk->cg_list))
-		list_move(&tsk->cg_list, &newcg->tasks);
-	write_unlock(&css_set_lock);
 
 	for_each_subsys(root, ss) {
 		if (ss->pre_attach)
@@ -1832,9 +1873,8 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 		if (ss->attach)
 			ss->attach(ss, cgrp, oldcgrp, tsk);
 	}
-	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
+
 	synchronize_rcu();
-	put_css_set(cg);
 
 	/*
 	 * wake up rmdir() waiter. the rmdir should fail since the cgroup
@@ -1884,49 +1924,356 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
 EXPORT_SYMBOL_GPL(cgroup_attach_task_all);
 
 /*
- * Attach task with pid 'pid' to cgroup 'cgrp'. Call with cgroup_mutex
- * held. May take task_lock of task
+ * cgroup_attach_proc works in two stages, the first of which prefetches all
+ * new css_sets needed (to make sure we have enough memory before committing
+ * to the move) and stores them in a list of entries of the following type.
+ * TODO: possible optimization: use css_set->rcu_head for chaining instead
+ */
+struct cg_list_entry {
+	struct css_set *cg;
+	struct list_head links;
+};
+
+static bool css_set_check_fetched(struct cgroup *cgrp,
+				  struct task_struct *tsk, struct css_set *cg,
+				  struct list_head *newcg_list)
+{
+	struct css_set *newcg;
+	struct cg_list_entry *cg_entry;
+	struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
+
+	read_lock(&css_set_lock);
+	newcg = find_existing_css_set(cg, cgrp, template);
+	if (newcg)
+		get_css_set(newcg);
+	read_unlock(&css_set_lock);
+
+	/* doesn't exist at all? */
+	if (!newcg)
+		return false;
+	/* see if it's already in the list */
+	list_for_each_entry(cg_entry, newcg_list, links) {
+		if (cg_entry->cg == newcg) {
+			put_css_set(newcg);
+			return true;
+		}
+	}
+
+	/* not found */
+	put_css_set(newcg);
+	return false;
+}
+
+/*
+ * Find the new css_set and store it in the list in preparation for moving the
+ * given task to the given cgroup. Returns 0 or -ENOMEM.
  */
-static int attach_task_by_pid(struct cgroup *cgrp, u64 pid)
+static int css_set_prefetch(struct cgroup *cgrp, struct css_set *cg,
+			    struct list_head *newcg_list)
+{
+	struct css_set *newcg;
+	struct cg_list_entry *cg_entry;
+
+	/* ensure a new css_set will exist for this thread */
+	newcg = find_css_set(cg, cgrp);
+	if (!newcg)
+		return -ENOMEM;
+	/* add it to the list */
+	cg_entry = kmalloc(sizeof(struct cg_list_entry), GFP_KERNEL);
+	if (!cg_entry) {
+		put_css_set(newcg);
+		return -ENOMEM;
+	}
+	cg_entry->cg = newcg;
+	list_add(&cg_entry->links, newcg_list);
+	return 0;
+}
+
+/**
+ * cgroup_attach_proc - attach all threads in a threadgroup to a cgroup
+ * @cgrp: the cgroup to attach to
+ * @leader: the threadgroup leader task_struct of the group to be attached
+ *
+ * Call holding cgroup_mutex and the threadgroup_fork_lock of the leader. Will
+ * take task_lock of each thread in leader's threadgroup individually in turn.
+ */
+int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
+{
+	int retval, i, group_size;
+	struct cgroup_subsys *ss, *failed_ss = NULL;
+	bool cancel_failed_ss = false;
+	/* guaranteed to be initialized later, but the compiler needs this */
+	struct cgroup *oldcgrp = NULL;
+	struct css_set *oldcg;
+	struct cgroupfs_root *root = cgrp->root;
+	/* threadgroup list cursor and array */
+	struct task_struct *tsk;
+	struct task_struct **group;
+	/*
+	 * we need to make sure we have css_sets for all the tasks we're
+	 * going to move -before- we actually start moving them, so that in
+	 * case we get an ENOMEM we can bail out before making any changes.
+	 */
+	struct list_head newcg_list;
+	struct cg_list_entry *cg_entry, *temp_nobe;
+
+	/*
+	 * step 0: in order to do expensive, possibly blocking operations for
+	 * every thread, we cannot iterate the thread group list, since it needs
+	 * rcu or tasklist locked. instead, build an array of all threads in the
+	 * group - threadgroup_fork_lock prevents new threads from appearing,
+	 * and if threads exit, this will just be an over-estimate.
+	 */
+	group_size = get_nr_threads(leader);
+	group = kmalloc(group_size * sizeof(*group), GFP_KERNEL);
+	if (!group)
+		return -ENOMEM;
+
+	/* prevent changes to the threadgroup list while we take a snapshot. */
+	rcu_read_lock();
+	if (!thread_group_leader(leader)) {
+		/*
+		 * a race with de_thread from another thread's exec() may strip
+		 * us of our leadership, making while_each_thread unsafe to use
+		 * on this task. if this happens, there is no choice but to
+		 * throw this task away and try again (from cgroup_procs_write);
+		 * this is "double-double-toil-and-trouble-check locking".
+		 */
+		rcu_read_unlock();
+		retval = -EAGAIN;
+		goto out_free_group_list;
+	}
+	/* take a reference on each task in the group to go in the array. */
+	tsk = leader;
+	i = 0;
+	do {
+		/* as per above, nr_threads may decrease, but not increase. */
+		BUG_ON(i >= group_size);
+		get_task_struct(tsk);
+		group[i] = tsk;
+		i++;
+	} while_each_thread(leader, tsk);
+	/* remember the number of threads in the array for later. */
+	group_size = i;
+	rcu_read_unlock();
+
+	/*
+	 * step 1: check that we can legitimately attach to the cgroup.
+	 */
+	for_each_subsys(root, ss) {
+		if (ss->can_attach) {
+			retval = ss->can_attach(ss, cgrp, leader);
+			if (retval) {
+				failed_ss = ss;
+				goto out_cancel_attach;
+			}
+		}
+		/* a callback to be run on every thread in the threadgroup. */
+		if (ss->can_attach_task) {
+			/* run on each task in the threadgroup. */
+			for (i = 0; i < group_size; i++) {
+				retval = ss->can_attach_task(cgrp, group[i]);
+				if (retval) {
+					failed_ss = ss;
+					cancel_failed_ss = true;
+					goto out_cancel_attach;
+				}
+			}
+		}
+	}
+
+	/*
+	 * step 2: make sure css_sets exist for all threads to be migrated.
+	 * we use find_css_set, which allocates a new one if necessary.
+	 */
+	INIT_LIST_HEAD(&newcg_list);
+	for (i = 0; i < group_size; i++) {
+		tsk = group[i];
+		/* nothing to do if this task is already in the cgroup */
+		oldcgrp = task_cgroup_from_root(tsk, root);
+		if (cgrp == oldcgrp)
+			continue;
+		/* get old css_set pointer */
+		task_lock(tsk);
+		if (tsk->flags & PF_EXITING) {
+			/* ignore this task if it's going away */
+			task_unlock(tsk);
+			continue;
+		}
+		oldcg = tsk->cgroups;
+		get_css_set(oldcg);
+		task_unlock(tsk);
+		/* see if the new one for us is already in the list? */
+		if (css_set_check_fetched(cgrp, tsk, oldcg, &newcg_list)) {
+			/* was already there, nothing to do. */
+			put_css_set(oldcg);
+		} else {
+			/* we don't already have it. get new one. */
+			retval = css_set_prefetch(cgrp, oldcg, &newcg_list);
+			put_css_set(oldcg);
+			if (retval)
+				goto out_list_teardown;
+		}
+	}
+
+	/*
+	 * step 3: now that we're guaranteed success wrt the css_sets, proceed
+	 * to move all tasks to the new cgroup, calling ss->attach_task for each
+	 * one along the way. there are no failure cases after here, so this is
+	 * the commit point.
+	 */
+	for_each_subsys(root, ss) {
+		if (ss->pre_attach)
+			ss->pre_attach(cgrp);
+	}
+	for (i = 0; i < group_size; i++) {
+		tsk = group[i];
+		/* leave current thread as it is if it's already there */
+		oldcgrp = task_cgroup_from_root(tsk, root);
+		if (cgrp == oldcgrp)
+			continue;
+		/* attach each task to each subsystem */
+		for_each_subsys(root, ss) {
+			if (ss->attach_task)
+				ss->attach_task(cgrp, tsk);
+		}
+		/* if the thread is PF_EXITING, it can just get skipped. */
+		retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, true);
+		BUG_ON(retval != 0 && retval != -ESRCH);
+	}
+	/* nothing is sensitive to fork() after this point. */
+
+	/*
+	 * step 4: do expensive, non-thread-specific subsystem callbacks.
+	 * TODO: if ever a subsystem needs to know the oldcgrp for each task
+	 * being moved, this call will need to be reworked to communicate that.
+	 */
+	for_each_subsys(root, ss) {
+		if (ss->attach)
+			ss->attach(ss, cgrp, oldcgrp, leader);
+	}
+
+	/*
+	 * step 5: success! and cleanup
+	 */
+	synchronize_rcu();
+	cgroup_wakeup_rmdir_waiter(cgrp);
+	retval = 0;
+out_list_teardown:
+	/* clean up the list of prefetched css_sets. */
+	list_for_each_entry_safe(cg_entry, temp_nobe, &newcg_list, links) {
+		list_del(&cg_entry->links);
+		put_css_set(cg_entry->cg);
+		kfree(cg_entry);
+	}
+out_cancel_attach:
+	/* same deal as in cgroup_attach_task */
+	if (retval) {
+		for_each_subsys(root, ss) {
+			if (ss == failed_ss) {
+				if (cancel_failed_ss && ss->cancel_attach)
+					ss->cancel_attach(ss, cgrp, leader);
+				break;
+			}
+			if (ss->cancel_attach)
+				ss->cancel_attach(ss, cgrp, leader);
+		}
+	}
+	/* clean up the array of referenced threads in the group. */
+	for (i = 0; i < group_size; i++)
+		put_task_struct(group[i]);
+out_free_group_list:
+	kfree(group);
+	return retval;
+}
+
+/*
+ * Find the task_struct of the task to attach by vpid and pass it along to the
+ * function to attach either it or all tasks in its threadgroup. Will take
+ * cgroup_mutex; may take task_lock of task.
+ */
+static int attach_task_by_pid(struct cgroup *cgrp, u64 pid, bool threadgroup)
 {
 	struct task_struct *tsk;
 	const struct cred *cred = current_cred(), *tcred;
 	int ret;
 
+	if (!cgroup_lock_live_group(cgrp))
+		return -ENODEV;
+
 	if (pid) {
 		rcu_read_lock();
 		tsk = find_task_by_vpid(pid);
-		if (!tsk || tsk->flags & PF_EXITING) {
+		if (!tsk) {
 			rcu_read_unlock();
+			cgroup_unlock();
+			return -ESRCH;
+		}
+		if (threadgroup) {
+			/*
+			 * RCU protects this access, since tsk was found in the
+			 * tid map. a race with de_thread may cause group_leader
+			 * to stop being the leader, but cgroup_attach_proc will
+			 * detect it later.
+			 */
+			tsk = tsk->group_leader;
+		} else if (tsk->flags & PF_EXITING) {
+			/* optimization for the single-task-only case */
+			rcu_read_unlock();
+			cgroup_unlock();
 			return -ESRCH;
 		}
 
+		/*
+		 * even if we're attaching all tasks in the thread group, we
+		 * only need to check permissions on one of them.
+		 */
 		tcred = __task_cred(tsk);
 		if (cred->euid &&
 		    cred->euid != tcred->uid &&
 		    cred->euid != tcred->suid) {
 			rcu_read_unlock();
+			cgroup_unlock();
 			return -EACCES;
 		}
 		get_task_struct(tsk);
 		rcu_read_unlock();
 	} else {
-		tsk = current;
+		if (threadgroup)
+			tsk = current->group_leader;
+		else
+			tsk = current;
 		get_task_struct(tsk);
 	}
 
-	ret = cgroup_attach_task(cgrp, tsk);
+	if (threadgroup) {
+		threadgroup_fork_write_lock(tsk);
+		ret = cgroup_attach_proc(cgrp, tsk);
+		threadgroup_fork_write_unlock(tsk);
+	} else {
+		ret = cgroup_attach_task(cgrp, tsk);
+	}
 	put_task_struct(tsk);
+	cgroup_unlock();
 	return ret;
 }
 
 static int cgroup_tasks_write(struct cgroup *cgrp, struct cftype *cft, u64 pid)
 {
+	return attach_task_by_pid(cgrp, pid, false);
+}
+
+static int cgroup_procs_write(struct cgroup *cgrp, struct cftype *cft, u64 tgid)
+{
 	int ret;
-	if (!cgroup_lock_live_group(cgrp))
-		return -ENODEV;
-	ret = attach_task_by_pid(cgrp, pid);
-	cgroup_unlock();
+	do {
+		/*
+		 * attach_proc fails with -EAGAIN if threadgroup leadership
+		 * changes in the middle of the operation, in which case we need
+		 * to find the task_struct for the new leader and start over.
+		 */
+		ret = attach_task_by_pid(cgrp, tgid, true);
+	} while (ret == -EAGAIN);
 	return ret;
 }
 
@@ -3283,9 +3630,9 @@ static struct cftype files[] = {
 	{
 		.name = CGROUP_FILE_GENERIC_PREFIX "procs",
 		.open = cgroup_procs_open,
-		/* .write_u64 = cgroup_procs_write, TODO */
+		.write_u64 = cgroup_procs_write,
 		.release = cgroup_pidlist_release,
-		.mode = S_IRUGO,
+		.mode = S_IRUGO | S_IWUSR,
 	},
 	{
 		.name = "notify_on_release",

^ permalink raw reply related	[flat|nested] 185+ messages in thread

* [PATCH v8.75 3/4] cgroups: make procs file writable
  2011-04-06 19:44             ` [PATCH v8.75 0/4] " Ben Blum
                                 ` (2 preceding siblings ...)
  2011-04-06 19:46               ` [PATCH v8.75 2/4] cgroups: add per-thread subsystem callbacks Ben Blum
@ 2011-04-06 19:46               ` Ben Blum
  2011-04-06 19:47               ` [PATCH v8.75 4/4] cgroups: use flex_array in attach_proc Ben Blum
  2011-04-12 23:25               ` [PATCH v8.75 0/4] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs Andrew Morton
  5 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2011-04-06 19:46 UTC (permalink / raw)
  To: Ben Blum
  Cc: linux-kernel, containers, akpm, ebiederm, lizf, matthltc, menage,
	oleg, David Rientjes, Miao Xie

Makes procs file writable to move all threads by tgid at once

From: Ben Blum <bblum@andrew.cmu.edu>

This patch adds functionality that enables users to move all threads in a
threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs'
file. This current implementation makes use of a per-threadgroup rwsem that's
taken for reading in the fork() path to prevent newly forking threads within
the threadgroup from "escaping" while the move is in progress.

Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
---
 Documentation/cgroups/cgroups.txt |    9 +
 kernel/cgroup.c                   |  439 +++++++++++++++++++++++++++++++++----
 2 files changed, 401 insertions(+), 47 deletions(-)

diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index 4b0377c..166f6e3 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -236,7 +236,8 @@ containing the following files describing that cgroup:
  - cgroup.procs: list of tgids in the cgroup.  This list is not
    guaranteed to be sorted or free of duplicate tgids, and userspace
    should sort/uniquify the list if this property is required.
-   This is a read-only file, for now.
+   Writing a thread group id into this file moves all threads in that
+   group into this cgroup.
  - notify_on_release flag: run the release agent on exit?
  - release_agent: the path to use for release notifications (this file
    exists in the top cgroup only)
@@ -430,6 +431,12 @@ You can attach the current shell task by echoing 0:
 
 # echo 0 > tasks
 
+You can use the cgroup.procs file instead of the tasks file to move all
+threads in a threadgroup at once. Echoing the pid of any task in a
+threadgroup to cgroup.procs causes all tasks in that threadgroup to be
+be attached to the cgroup. Writing 0 to cgroup.procs moves all tasks
+in the writing task's threadgroup.
+
 Note: Since every task is always a member of exactly one cgroup in each
 mounted hierarchy, to remove a task from its current cgroup you must
 move it into a new cgroup (possibly the root cgroup) by writing to the
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 1f4037f..52dfb33 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1748,6 +1748,76 @@ int cgroup_path(const struct cgroup *cgrp, char *buf, int buflen)
 }
 EXPORT_SYMBOL_GPL(cgroup_path);
 
+/*
+ * cgroup_task_migrate - move a task from one cgroup to another.
+ *
+ * 'guarantee' is set if the caller promises that a new css_set for the task
+ * will already exist. If not set, this function might sleep, and can fail with
+ * -ENOMEM. Otherwise, it can only fail with -ESRCH.
+ */
+static int cgroup_task_migrate(struct cgroup *cgrp, struct cgroup *oldcgrp,
+			       struct task_struct *tsk, bool guarantee)
+{
+	struct css_set *oldcg;
+	struct css_set *newcg;
+
+	/*
+	 * get old css_set. we need to take task_lock and refcount it, because
+	 * an exiting task can change its css_set to init_css_set and drop its
+	 * old one without taking cgroup_mutex.
+	 */
+	task_lock(tsk);
+	oldcg = tsk->cgroups;
+	get_css_set(oldcg);
+	task_unlock(tsk);
+
+	/* locate or allocate a new css_set for this task. */
+	if (guarantee) {
+		/* we know the css_set we want already exists. */
+		struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
+		read_lock(&css_set_lock);
+		newcg = find_existing_css_set(oldcg, cgrp, template);
+		BUG_ON(!newcg);
+		get_css_set(newcg);
+		read_unlock(&css_set_lock);
+	} else {
+		might_sleep();
+		/* find_css_set will give us newcg already referenced. */
+		newcg = find_css_set(oldcg, cgrp);
+		if (!newcg) {
+			put_css_set(oldcg);
+			return -ENOMEM;
+		}
+	}
+	put_css_set(oldcg);
+
+	/* if PF_EXITING is set, the tsk->cgroups pointer is no longer safe. */
+	task_lock(tsk);
+	if (tsk->flags & PF_EXITING) {
+		task_unlock(tsk);
+		put_css_set(newcg);
+		return -ESRCH;
+	}
+	rcu_assign_pointer(tsk->cgroups, newcg);
+	task_unlock(tsk);
+
+	/* Update the css_set linked lists if we're using them */
+	write_lock(&css_set_lock);
+	if (!list_empty(&tsk->cg_list))
+		list_move(&tsk->cg_list, &newcg->tasks);
+	write_unlock(&css_set_lock);
+
+	/*
+	 * We just gained a reference on oldcg by taking it from the task. As
+	 * trading it for newcg is protected by cgroup_mutex, we're safe to drop
+	 * it here; it will be freed under RCU.
+	 */
+	put_css_set(oldcg);
+
+	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
+	return 0;
+}
+
 /**
  * cgroup_attach_task - attach task 'tsk' to cgroup 'cgrp'
  * @cgrp: the cgroup the task is attaching to
@@ -1758,11 +1828,9 @@ EXPORT_SYMBOL_GPL(cgroup_path);
  */
 int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
-	int retval = 0;
+	int retval;
 	struct cgroup_subsys *ss, *failed_ss = NULL;
 	struct cgroup *oldcgrp;
-	struct css_set *cg;
-	struct css_set *newcg;
 	struct cgroupfs_root *root = cgrp->root;
 
 	/* Nothing to do if the task is already in that cgroup */
@@ -1793,36 +1861,9 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 		}
 	}
 
-	task_lock(tsk);
-	cg = tsk->cgroups;
-	get_css_set(cg);
-	task_unlock(tsk);
-	/*
-	 * Locate or allocate a new css_set for this task,
-	 * based on its final set of cgroups
-	 */
-	newcg = find_css_set(cg, cgrp);
-	put_css_set(cg);
-	if (!newcg) {
-		retval = -ENOMEM;
-		goto out;
-	}
-
-	task_lock(tsk);
-	if (tsk->flags & PF_EXITING) {
-		task_unlock(tsk);
-		put_css_set(newcg);
-		retval = -ESRCH;
+	retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, false);
+	if (retval)
 		goto out;
-	}
-	rcu_assign_pointer(tsk->cgroups, newcg);
-	task_unlock(tsk);
-
-	/* Update the css_set linked lists if we're using them */
-	write_lock(&css_set_lock);
-	if (!list_empty(&tsk->cg_list))
-		list_move(&tsk->cg_list, &newcg->tasks);
-	write_unlock(&css_set_lock);
 
 	for_each_subsys(root, ss) {
 		if (ss->pre_attach)
@@ -1832,9 +1873,8 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 		if (ss->attach)
 			ss->attach(ss, cgrp, oldcgrp, tsk);
 	}
-	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
+
 	synchronize_rcu();
-	put_css_set(cg);
 
 	/*
 	 * wake up rmdir() waiter. the rmdir should fail since the cgroup
@@ -1884,49 +1924,356 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
 EXPORT_SYMBOL_GPL(cgroup_attach_task_all);
 
 /*
- * Attach task with pid 'pid' to cgroup 'cgrp'. Call with cgroup_mutex
- * held. May take task_lock of task
+ * cgroup_attach_proc works in two stages, the first of which prefetches all
+ * new css_sets needed (to make sure we have enough memory before committing
+ * to the move) and stores them in a list of entries of the following type.
+ * TODO: possible optimization: use css_set->rcu_head for chaining instead
+ */
+struct cg_list_entry {
+	struct css_set *cg;
+	struct list_head links;
+};
+
+static bool css_set_check_fetched(struct cgroup *cgrp,
+				  struct task_struct *tsk, struct css_set *cg,
+				  struct list_head *newcg_list)
+{
+	struct css_set *newcg;
+	struct cg_list_entry *cg_entry;
+	struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
+
+	read_lock(&css_set_lock);
+	newcg = find_existing_css_set(cg, cgrp, template);
+	if (newcg)
+		get_css_set(newcg);
+	read_unlock(&css_set_lock);
+
+	/* doesn't exist at all? */
+	if (!newcg)
+		return false;
+	/* see if it's already in the list */
+	list_for_each_entry(cg_entry, newcg_list, links) {
+		if (cg_entry->cg == newcg) {
+			put_css_set(newcg);
+			return true;
+		}
+	}
+
+	/* not found */
+	put_css_set(newcg);
+	return false;
+}
+
+/*
+ * Find the new css_set and store it in the list in preparation for moving the
+ * given task to the given cgroup. Returns 0 or -ENOMEM.
  */
-static int attach_task_by_pid(struct cgroup *cgrp, u64 pid)
+static int css_set_prefetch(struct cgroup *cgrp, struct css_set *cg,
+			    struct list_head *newcg_list)
+{
+	struct css_set *newcg;
+	struct cg_list_entry *cg_entry;
+
+	/* ensure a new css_set will exist for this thread */
+	newcg = find_css_set(cg, cgrp);
+	if (!newcg)
+		return -ENOMEM;
+	/* add it to the list */
+	cg_entry = kmalloc(sizeof(struct cg_list_entry), GFP_KERNEL);
+	if (!cg_entry) {
+		put_css_set(newcg);
+		return -ENOMEM;
+	}
+	cg_entry->cg = newcg;
+	list_add(&cg_entry->links, newcg_list);
+	return 0;
+}
+
+/**
+ * cgroup_attach_proc - attach all threads in a threadgroup to a cgroup
+ * @cgrp: the cgroup to attach to
+ * @leader: the threadgroup leader task_struct of the group to be attached
+ *
+ * Call holding cgroup_mutex and the threadgroup_fork_lock of the leader. Will
+ * take task_lock of each thread in leader's threadgroup individually in turn.
+ */
+int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
+{
+	int retval, i, group_size;
+	struct cgroup_subsys *ss, *failed_ss = NULL;
+	bool cancel_failed_ss = false;
+	/* guaranteed to be initialized later, but the compiler needs this */
+	struct cgroup *oldcgrp = NULL;
+	struct css_set *oldcg;
+	struct cgroupfs_root *root = cgrp->root;
+	/* threadgroup list cursor and array */
+	struct task_struct *tsk;
+	struct task_struct **group;
+	/*
+	 * we need to make sure we have css_sets for all the tasks we're
+	 * going to move -before- we actually start moving them, so that in
+	 * case we get an ENOMEM we can bail out before making any changes.
+	 */
+	struct list_head newcg_list;
+	struct cg_list_entry *cg_entry, *temp_nobe;
+
+	/*
+	 * step 0: in order to do expensive, possibly blocking operations for
+	 * every thread, we cannot iterate the thread group list, since it needs
+	 * rcu or tasklist locked. instead, build an array of all threads in the
+	 * group - threadgroup_fork_lock prevents new threads from appearing,
+	 * and if threads exit, this will just be an over-estimate.
+	 */
+	group_size = get_nr_threads(leader);
+	group = kmalloc(group_size * sizeof(*group), GFP_KERNEL);
+	if (!group)
+		return -ENOMEM;
+
+	/* prevent changes to the threadgroup list while we take a snapshot. */
+	rcu_read_lock();
+	if (!thread_group_leader(leader)) {
+		/*
+		 * a race with de_thread from another thread's exec() may strip
+		 * us of our leadership, making while_each_thread unsafe to use
+		 * on this task. if this happens, there is no choice but to
+		 * throw this task away and try again (from cgroup_procs_write);
+		 * this is "double-double-toil-and-trouble-check locking".
+		 */
+		rcu_read_unlock();
+		retval = -EAGAIN;
+		goto out_free_group_list;
+	}
+	/* take a reference on each task in the group to go in the array. */
+	tsk = leader;
+	i = 0;
+	do {
+		/* as per above, nr_threads may decrease, but not increase. */
+		BUG_ON(i >= group_size);
+		get_task_struct(tsk);
+		group[i] = tsk;
+		i++;
+	} while_each_thread(leader, tsk);
+	/* remember the number of threads in the array for later. */
+	group_size = i;
+	rcu_read_unlock();
+
+	/*
+	 * step 1: check that we can legitimately attach to the cgroup.
+	 */
+	for_each_subsys(root, ss) {
+		if (ss->can_attach) {
+			retval = ss->can_attach(ss, cgrp, leader);
+			if (retval) {
+				failed_ss = ss;
+				goto out_cancel_attach;
+			}
+		}
+		/* a callback to be run on every thread in the threadgroup. */
+		if (ss->can_attach_task) {
+			/* run on each task in the threadgroup. */
+			for (i = 0; i < group_size; i++) {
+				retval = ss->can_attach_task(cgrp, group[i]);
+				if (retval) {
+					failed_ss = ss;
+					cancel_failed_ss = true;
+					goto out_cancel_attach;
+				}
+			}
+		}
+	}
+
+	/*
+	 * step 2: make sure css_sets exist for all threads to be migrated.
+	 * we use find_css_set, which allocates a new one if necessary.
+	 */
+	INIT_LIST_HEAD(&newcg_list);
+	for (i = 0; i < group_size; i++) {
+		tsk = group[i];
+		/* nothing to do if this task is already in the cgroup */
+		oldcgrp = task_cgroup_from_root(tsk, root);
+		if (cgrp == oldcgrp)
+			continue;
+		/* get old css_set pointer */
+		task_lock(tsk);
+		if (tsk->flags & PF_EXITING) {
+			/* ignore this task if it's going away */
+			task_unlock(tsk);
+			continue;
+		}
+		oldcg = tsk->cgroups;
+		get_css_set(oldcg);
+		task_unlock(tsk);
+		/* see if the new one for us is already in the list? */
+		if (css_set_check_fetched(cgrp, tsk, oldcg, &newcg_list)) {
+			/* was already there, nothing to do. */
+			put_css_set(oldcg);
+		} else {
+			/* we don't already have it. get new one. */
+			retval = css_set_prefetch(cgrp, oldcg, &newcg_list);
+			put_css_set(oldcg);
+			if (retval)
+				goto out_list_teardown;
+		}
+	}
+
+	/*
+	 * step 3: now that we're guaranteed success wrt the css_sets, proceed
+	 * to move all tasks to the new cgroup, calling ss->attach_task for each
+	 * one along the way. there are no failure cases after here, so this is
+	 * the commit point.
+	 */
+	for_each_subsys(root, ss) {
+		if (ss->pre_attach)
+			ss->pre_attach(cgrp);
+	}
+	for (i = 0; i < group_size; i++) {
+		tsk = group[i];
+		/* leave current thread as it is if it's already there */
+		oldcgrp = task_cgroup_from_root(tsk, root);
+		if (cgrp == oldcgrp)
+			continue;
+		/* attach each task to each subsystem */
+		for_each_subsys(root, ss) {
+			if (ss->attach_task)
+				ss->attach_task(cgrp, tsk);
+		}
+		/* if the thread is PF_EXITING, it can just get skipped. */
+		retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, true);
+		BUG_ON(retval != 0 && retval != -ESRCH);
+	}
+	/* nothing is sensitive to fork() after this point. */
+
+	/*
+	 * step 4: do expensive, non-thread-specific subsystem callbacks.
+	 * TODO: if ever a subsystem needs to know the oldcgrp for each task
+	 * being moved, this call will need to be reworked to communicate that.
+	 */
+	for_each_subsys(root, ss) {
+		if (ss->attach)
+			ss->attach(ss, cgrp, oldcgrp, leader);
+	}
+
+	/*
+	 * step 5: success! and cleanup
+	 */
+	synchronize_rcu();
+	cgroup_wakeup_rmdir_waiter(cgrp);
+	retval = 0;
+out_list_teardown:
+	/* clean up the list of prefetched css_sets. */
+	list_for_each_entry_safe(cg_entry, temp_nobe, &newcg_list, links) {
+		list_del(&cg_entry->links);
+		put_css_set(cg_entry->cg);
+		kfree(cg_entry);
+	}
+out_cancel_attach:
+	/* same deal as in cgroup_attach_task */
+	if (retval) {
+		for_each_subsys(root, ss) {
+			if (ss == failed_ss) {
+				if (cancel_failed_ss && ss->cancel_attach)
+					ss->cancel_attach(ss, cgrp, leader);
+				break;
+			}
+			if (ss->cancel_attach)
+				ss->cancel_attach(ss, cgrp, leader);
+		}
+	}
+	/* clean up the array of referenced threads in the group. */
+	for (i = 0; i < group_size; i++)
+		put_task_struct(group[i]);
+out_free_group_list:
+	kfree(group);
+	return retval;
+}
+
+/*
+ * Find the task_struct of the task to attach by vpid and pass it along to the
+ * function to attach either it or all tasks in its threadgroup. Will take
+ * cgroup_mutex; may take task_lock of task.
+ */
+static int attach_task_by_pid(struct cgroup *cgrp, u64 pid, bool threadgroup)
 {
 	struct task_struct *tsk;
 	const struct cred *cred = current_cred(), *tcred;
 	int ret;
 
+	if (!cgroup_lock_live_group(cgrp))
+		return -ENODEV;
+
 	if (pid) {
 		rcu_read_lock();
 		tsk = find_task_by_vpid(pid);
-		if (!tsk || tsk->flags & PF_EXITING) {
+		if (!tsk) {
 			rcu_read_unlock();
+			cgroup_unlock();
+			return -ESRCH;
+		}
+		if (threadgroup) {
+			/*
+			 * RCU protects this access, since tsk was found in the
+			 * tid map. a race with de_thread may cause group_leader
+			 * to stop being the leader, but cgroup_attach_proc will
+			 * detect it later.
+			 */
+			tsk = tsk->group_leader;
+		} else if (tsk->flags & PF_EXITING) {
+			/* optimization for the single-task-only case */
+			rcu_read_unlock();
+			cgroup_unlock();
 			return -ESRCH;
 		}
 
+		/*
+		 * even if we're attaching all tasks in the thread group, we
+		 * only need to check permissions on one of them.
+		 */
 		tcred = __task_cred(tsk);
 		if (cred->euid &&
 		    cred->euid != tcred->uid &&
 		    cred->euid != tcred->suid) {
 			rcu_read_unlock();
+			cgroup_unlock();
 			return -EACCES;
 		}
 		get_task_struct(tsk);
 		rcu_read_unlock();
 	} else {
-		tsk = current;
+		if (threadgroup)
+			tsk = current->group_leader;
+		else
+			tsk = current;
 		get_task_struct(tsk);
 	}
 
-	ret = cgroup_attach_task(cgrp, tsk);
+	if (threadgroup) {
+		threadgroup_fork_write_lock(tsk);
+		ret = cgroup_attach_proc(cgrp, tsk);
+		threadgroup_fork_write_unlock(tsk);
+	} else {
+		ret = cgroup_attach_task(cgrp, tsk);
+	}
 	put_task_struct(tsk);
+	cgroup_unlock();
 	return ret;
 }
 
 static int cgroup_tasks_write(struct cgroup *cgrp, struct cftype *cft, u64 pid)
 {
+	return attach_task_by_pid(cgrp, pid, false);
+}
+
+static int cgroup_procs_write(struct cgroup *cgrp, struct cftype *cft, u64 tgid)
+{
 	int ret;
-	if (!cgroup_lock_live_group(cgrp))
-		return -ENODEV;
-	ret = attach_task_by_pid(cgrp, pid);
-	cgroup_unlock();
+	do {
+		/*
+		 * attach_proc fails with -EAGAIN if threadgroup leadership
+		 * changes in the middle of the operation, in which case we need
+		 * to find the task_struct for the new leader and start over.
+		 */
+		ret = attach_task_by_pid(cgrp, tgid, true);
+	} while (ret == -EAGAIN);
 	return ret;
 }
 
@@ -3283,9 +3630,9 @@ static struct cftype files[] = {
 	{
 		.name = CGROUP_FILE_GENERIC_PREFIX "procs",
 		.open = cgroup_procs_open,
-		/* .write_u64 = cgroup_procs_write, TODO */
+		.write_u64 = cgroup_procs_write,
 		.release = cgroup_pidlist_release,
-		.mode = S_IRUGO,
+		.mode = S_IRUGO | S_IWUSR,
 	},
 	{
 		.name = "notify_on_release",

^ permalink raw reply related	[flat|nested] 185+ messages in thread

* [PATCH v8.75 4/4] cgroups: use flex_array in attach_proc
       [not found]               ` <20110406194420.GC10792-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
                                   ` (2 preceding siblings ...)
  2011-04-06 19:46                 ` [PATCH v8.75 3/4] cgroups: make procs file writable Ben Blum
@ 2011-04-06 19:47                 ` Ben Blum
  2011-04-12 23:25                 ` [PATCH v8.75 0/4] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs Andrew Morton
  4 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2011-04-06 19:47 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	Miao Xie, David Rientjes, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	menage-hpIqsD4AKlfQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

Convert cgroup_attach_proc to use flex_array.

From: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>

The cgroup_attach_proc implementation requires a pre-allocated array to store
task pointers to atomically move a thread-group, but asking for a monolithic
array with kmalloc() may be unreliable for very large groups. Using flex_array
provides the same functionality with less risk of failure.

This is a post-patch for cgroup-procs-write.patch.

Signed-off-by: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
---
 kernel/cgroup.c |   33 ++++++++++++++++++++++++---------
 1 files changed, 24 insertions(+), 9 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 52dfb33..8236895 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -57,6 +57,7 @@
 #include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
 #include <linux/eventfd.h>
 #include <linux/poll.h>
+#include <linux/flex_array.h> /* used in cgroup_attach_proc */
 
 #include <asm/atomic.h>
 
@@ -2008,7 +2009,7 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 	struct cgroupfs_root *root = cgrp->root;
 	/* threadgroup list cursor and array */
 	struct task_struct *tsk;
-	struct task_struct **group;
+	struct flex_array *group;
 	/*
 	 * we need to make sure we have css_sets for all the tasks we're
 	 * going to move -before- we actually start moving them, so that in
@@ -2025,9 +2026,15 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 	 * and if threads exit, this will just be an over-estimate.
 	 */
 	group_size = get_nr_threads(leader);
-	group = kmalloc(group_size * sizeof(*group), GFP_KERNEL);
+	/* flex_array supports very large thread-groups better than kmalloc. */
+	group = flex_array_alloc(sizeof(struct task_struct *), group_size,
+				 GFP_KERNEL);
 	if (!group)
 		return -ENOMEM;
+	/* pre-allocate to guarantee space while iterating in rcu read-side. */
+	retval = flex_array_prealloc(group, 0, group_size - 1, GFP_KERNEL);
+	if (retval)
+		goto out_free_group_list;
 
 	/* prevent changes to the threadgroup list while we take a snapshot. */
 	rcu_read_lock();
@@ -2050,7 +2057,12 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 		/* as per above, nr_threads may decrease, but not increase. */
 		BUG_ON(i >= group_size);
 		get_task_struct(tsk);
-		group[i] = tsk;
+		/*
+		 * saying GFP_ATOMIC has no effect here because we did prealloc
+		 * earlier, but it's good form to communicate our expectations.
+		 */
+		retval = flex_array_put_ptr(group, i, tsk, GFP_ATOMIC);
+		BUG_ON(retval != 0);
 		i++;
 	} while_each_thread(leader, tsk);
 	/* remember the number of threads in the array for later. */
@@ -2072,7 +2084,8 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 		if (ss->can_attach_task) {
 			/* run on each task in the threadgroup. */
 			for (i = 0; i < group_size; i++) {
-				retval = ss->can_attach_task(cgrp, group[i]);
+				tsk = flex_array_get_ptr(group, i);
+				retval = ss->can_attach_task(cgrp, tsk);
 				if (retval) {
 					failed_ss = ss;
 					cancel_failed_ss = true;
@@ -2088,7 +2101,7 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 	 */
 	INIT_LIST_HEAD(&newcg_list);
 	for (i = 0; i < group_size; i++) {
-		tsk = group[i];
+		tsk = flex_array_get_ptr(group, i);
 		/* nothing to do if this task is already in the cgroup */
 		oldcgrp = task_cgroup_from_root(tsk, root);
 		if (cgrp == oldcgrp)
@@ -2127,7 +2140,7 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 			ss->pre_attach(cgrp);
 	}
 	for (i = 0; i < group_size; i++) {
-		tsk = group[i];
+		tsk = flex_array_get_ptr(group, i);
 		/* leave current thread as it is if it's already there */
 		oldcgrp = task_cgroup_from_root(tsk, root);
 		if (cgrp == oldcgrp)
@@ -2180,10 +2193,12 @@ out_cancel_attach:
 		}
 	}
 	/* clean up the array of referenced threads in the group. */
-	for (i = 0; i < group_size; i++)
-		put_task_struct(group[i]);
+	for (i = 0; i < group_size; i++) {
+		tsk = flex_array_get_ptr(group, i);
+		put_task_struct(tsk);
+	}
 out_free_group_list:
-	kfree(group);
+	flex_array_free(group);
 	return retval;
 }

^ permalink raw reply related	[flat|nested] 185+ messages in thread

* [PATCH v8.75 4/4] cgroups: use flex_array in attach_proc
  2011-04-06 19:44             ` [PATCH v8.75 0/4] " Ben Blum
                                 ` (3 preceding siblings ...)
  2011-04-06 19:46               ` [PATCH v8.75 3/4] cgroups: make procs file writable Ben Blum
@ 2011-04-06 19:47               ` Ben Blum
  2011-04-12 23:25               ` [PATCH v8.75 0/4] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs Andrew Morton
  5 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2011-04-06 19:47 UTC (permalink / raw)
  To: Ben Blum
  Cc: linux-kernel, containers, akpm, ebiederm, lizf, matthltc, menage,
	oleg, David Rientjes, Miao Xie

Convert cgroup_attach_proc to use flex_array.

From: Ben Blum <bblum@andrew.cmu.edu>

The cgroup_attach_proc implementation requires a pre-allocated array to store
task pointers to atomically move a thread-group, but asking for a monolithic
array with kmalloc() may be unreliable for very large groups. Using flex_array
provides the same functionality with less risk of failure.

This is a post-patch for cgroup-procs-write.patch.

Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
---
 kernel/cgroup.c |   33 ++++++++++++++++++++++++---------
 1 files changed, 24 insertions(+), 9 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 52dfb33..8236895 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -57,6 +57,7 @@
 #include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
 #include <linux/eventfd.h>
 #include <linux/poll.h>
+#include <linux/flex_array.h> /* used in cgroup_attach_proc */
 
 #include <asm/atomic.h>
 
@@ -2008,7 +2009,7 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 	struct cgroupfs_root *root = cgrp->root;
 	/* threadgroup list cursor and array */
 	struct task_struct *tsk;
-	struct task_struct **group;
+	struct flex_array *group;
 	/*
 	 * we need to make sure we have css_sets for all the tasks we're
 	 * going to move -before- we actually start moving them, so that in
@@ -2025,9 +2026,15 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 	 * and if threads exit, this will just be an over-estimate.
 	 */
 	group_size = get_nr_threads(leader);
-	group = kmalloc(group_size * sizeof(*group), GFP_KERNEL);
+	/* flex_array supports very large thread-groups better than kmalloc. */
+	group = flex_array_alloc(sizeof(struct task_struct *), group_size,
+				 GFP_KERNEL);
 	if (!group)
 		return -ENOMEM;
+	/* pre-allocate to guarantee space while iterating in rcu read-side. */
+	retval = flex_array_prealloc(group, 0, group_size - 1, GFP_KERNEL);
+	if (retval)
+		goto out_free_group_list;
 
 	/* prevent changes to the threadgroup list while we take a snapshot. */
 	rcu_read_lock();
@@ -2050,7 +2057,12 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 		/* as per above, nr_threads may decrease, but not increase. */
 		BUG_ON(i >= group_size);
 		get_task_struct(tsk);
-		group[i] = tsk;
+		/*
+		 * saying GFP_ATOMIC has no effect here because we did prealloc
+		 * earlier, but it's good form to communicate our expectations.
+		 */
+		retval = flex_array_put_ptr(group, i, tsk, GFP_ATOMIC);
+		BUG_ON(retval != 0);
 		i++;
 	} while_each_thread(leader, tsk);
 	/* remember the number of threads in the array for later. */
@@ -2072,7 +2084,8 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 		if (ss->can_attach_task) {
 			/* run on each task in the threadgroup. */
 			for (i = 0; i < group_size; i++) {
-				retval = ss->can_attach_task(cgrp, group[i]);
+				tsk = flex_array_get_ptr(group, i);
+				retval = ss->can_attach_task(cgrp, tsk);
 				if (retval) {
 					failed_ss = ss;
 					cancel_failed_ss = true;
@@ -2088,7 +2101,7 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 	 */
 	INIT_LIST_HEAD(&newcg_list);
 	for (i = 0; i < group_size; i++) {
-		tsk = group[i];
+		tsk = flex_array_get_ptr(group, i);
 		/* nothing to do if this task is already in the cgroup */
 		oldcgrp = task_cgroup_from_root(tsk, root);
 		if (cgrp == oldcgrp)
@@ -2127,7 +2140,7 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 			ss->pre_attach(cgrp);
 	}
 	for (i = 0; i < group_size; i++) {
-		tsk = group[i];
+		tsk = flex_array_get_ptr(group, i);
 		/* leave current thread as it is if it's already there */
 		oldcgrp = task_cgroup_from_root(tsk, root);
 		if (cgrp == oldcgrp)
@@ -2180,10 +2193,12 @@ out_cancel_attach:
 		}
 	}
 	/* clean up the array of referenced threads in the group. */
-	for (i = 0; i < group_size; i++)
-		put_task_struct(group[i]);
+	for (i = 0; i < group_size; i++) {
+		tsk = flex_array_get_ptr(group, i);
+		put_task_struct(tsk);
+	}
 out_free_group_list:
-	kfree(group);
+	flex_array_free(group);
 	return retval;
 }
 

^ permalink raw reply related	[flat|nested] 185+ messages in thread

* Re: [PATCH v8.75 0/4] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs
       [not found]               ` <20110406194420.GC10792-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
                                   ` (3 preceding siblings ...)
  2011-04-06 19:47                 ` [PATCH v8.75 4/4] cgroups: use flex_array in attach_proc Ben Blum
@ 2011-04-12 23:25                 ` Andrew Morton
  4 siblings, 0 replies; 185+ messages in thread
From: Andrew Morton @ 2011-04-12 23:25 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	Miao Xie, David Rientjes, menage-hpIqsD4AKlfQT0dZR+AlfA,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w

On Wed, 6 Apr 2011 15:44:20 -0400
Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:

> Same as before; using flex_array in attach_proc (thanks Kame).
> 
> -- Ben
> 
> ---
>  Documentation/cgroups/cgroups.txt |   39 ++-
>  block/blk-cgroup.c                |   18 -
>  include/linux/cgroup.h            |   10 
>  include/linux/init_task.h         |    9 
>  include/linux/sched.h             |   36 ++
>  kernel/cgroup.c                   |  489 +++++++++++++++++++++++++++++++++-----
>  kernel/cgroup_freezer.c           |   26 --
>  kernel/cpuset.c                   |   96 +++----
>  kernel/fork.c                     |   10 
>  kernel/sched.c                    |   38 --
>  mm/memcontrol.c                   |   18 -
>  security/device_cgroup.c          |    3 
>  12 files changed, 594 insertions(+), 198 deletions(-)

So where are we up to with all this.

I'm surprised that none of the patches had anyone's Acked-by: or
Reviewed-by:.  Were they really that mean to you, or have you not been
tracking these?

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v8.75 0/4] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs
  2011-04-06 19:44             ` [PATCH v8.75 0/4] " Ben Blum
                                 ` (4 preceding siblings ...)
  2011-04-06 19:47               ` [PATCH v8.75 4/4] cgroups: use flex_array in attach_proc Ben Blum
@ 2011-04-12 23:25               ` Andrew Morton
  2011-04-12 23:59                 ` Ben Blum
                                   ` (2 more replies)
  5 siblings, 3 replies; 185+ messages in thread
From: Andrew Morton @ 2011-04-12 23:25 UTC (permalink / raw)
  To: Ben Blum
  Cc: linux-kernel, containers, ebiederm, lizf, matthltc, menage, oleg,
	David Rientjes, Miao Xie

On Wed, 6 Apr 2011 15:44:20 -0400
Ben Blum <bblum@andrew.cmu.edu> wrote:

> Same as before; using flex_array in attach_proc (thanks Kame).
> 
> -- Ben
> 
> ---
>  Documentation/cgroups/cgroups.txt |   39 ++-
>  block/blk-cgroup.c                |   18 -
>  include/linux/cgroup.h            |   10 
>  include/linux/init_task.h         |    9 
>  include/linux/sched.h             |   36 ++
>  kernel/cgroup.c                   |  489 +++++++++++++++++++++++++++++++++-----
>  kernel/cgroup_freezer.c           |   26 --
>  kernel/cpuset.c                   |   96 +++----
>  kernel/fork.c                     |   10 
>  kernel/sched.c                    |   38 --
>  mm/memcontrol.c                   |   18 -
>  security/device_cgroup.c          |    3 
>  12 files changed, 594 insertions(+), 198 deletions(-)

So where are we up to with all this.

I'm surprised that none of the patches had anyone's Acked-by: or
Reviewed-by:.  Were they really that mean to you, or have you not been
tracking these?


^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v8.75 0/4] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs
       [not found]                 ` <20110412162516.4120c441.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
@ 2011-04-12 23:59                   ` Ben Blum
  2011-04-13  2:07                   ` Li Zefan
  1 sibling, 0 replies; 185+ messages in thread
From: Ben Blum @ 2011-04-12 23:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	Miao Xie, David Rientjes, menage-hpIqsD4AKlfQT0dZR+AlfA,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w

On Tue, Apr 12, 2011 at 04:25:16PM -0700, Andrew Morton wrote:
> On Wed, 6 Apr 2011 15:44:20 -0400
> Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:
> 
> > Same as before; using flex_array in attach_proc (thanks Kame).
> > 
> > -- Ben
> > 
> > ---
> >  Documentation/cgroups/cgroups.txt |   39 ++-
> >  block/blk-cgroup.c                |   18 -
> >  include/linux/cgroup.h            |   10 
> >  include/linux/init_task.h         |    9 
> >  include/linux/sched.h             |   36 ++
> >  kernel/cgroup.c                   |  489 +++++++++++++++++++++++++++++++++-----
> >  kernel/cgroup_freezer.c           |   26 --
> >  kernel/cpuset.c                   |   96 +++----
> >  kernel/fork.c                     |   10 
> >  kernel/sched.c                    |   38 --
> >  mm/memcontrol.c                   |   18 -
> >  security/device_cgroup.c          |    3 
> >  12 files changed, 594 insertions(+), 198 deletions(-)
> 
> So where are we up to with all this.

done and good to go, hopefully? :O

> 
> I'm surprised that none of the patches had anyone's Acked-by: or
> Reviewed-by:.  Were they really that mean to you, or have you not been
> tracking these?
> 
> 

Oh, eep. I didn't think to put them there myself; I guess I was assuming
they'd either be implicit or that my reviewers would have something more
to say.

Thanks!

-- Ben

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v8.75 0/4] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs
  2011-04-12 23:25               ` [PATCH v8.75 0/4] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs Andrew Morton
@ 2011-04-12 23:59                 ` Ben Blum
  2011-04-13  2:07                 ` Li Zefan
       [not found]                 ` <20110412162516.4120c441.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
  2 siblings, 0 replies; 185+ messages in thread
From: Ben Blum @ 2011-04-12 23:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ben Blum, linux-kernel, containers, ebiederm, lizf, matthltc,
	menage, oleg, David Rientjes, Miao Xie

On Tue, Apr 12, 2011 at 04:25:16PM -0700, Andrew Morton wrote:
> On Wed, 6 Apr 2011 15:44:20 -0400
> Ben Blum <bblum@andrew.cmu.edu> wrote:
> 
> > Same as before; using flex_array in attach_proc (thanks Kame).
> > 
> > -- Ben
> > 
> > ---
> >  Documentation/cgroups/cgroups.txt |   39 ++-
> >  block/blk-cgroup.c                |   18 -
> >  include/linux/cgroup.h            |   10 
> >  include/linux/init_task.h         |    9 
> >  include/linux/sched.h             |   36 ++
> >  kernel/cgroup.c                   |  489 +++++++++++++++++++++++++++++++++-----
> >  kernel/cgroup_freezer.c           |   26 --
> >  kernel/cpuset.c                   |   96 +++----
> >  kernel/fork.c                     |   10 
> >  kernel/sched.c                    |   38 --
> >  mm/memcontrol.c                   |   18 -
> >  security/device_cgroup.c          |    3 
> >  12 files changed, 594 insertions(+), 198 deletions(-)
> 
> So where are we up to with all this.

done and good to go, hopefully? :O

> 
> I'm surprised that none of the patches had anyone's Acked-by: or
> Reviewed-by:.  Were they really that mean to you, or have you not been
> tracking these?
> 
> 

Oh, eep. I didn't think to put them there myself; I guess I was assuming
they'd either be implicit or that my reviewers would have something more
to say.

Thanks!

-- Ben

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v8.75 0/4] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs
       [not found]                 ` <20110412162516.4120c441.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
  2011-04-12 23:59                   ` Ben Blum
@ 2011-04-13  2:07                   ` Li Zefan
  1 sibling, 0 replies; 185+ messages in thread
From: Li Zefan @ 2011-04-13  2:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	Miao Xie, David Rientjes, menage-hpIqsD4AKlfQT0dZR+AlfA,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w

Andrew Morton wrote:
> On Wed, 6 Apr 2011 15:44:20 -0400
> Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:
> 
>> Same as before; using flex_array in attach_proc (thanks Kame).
>>
>> -- Ben
>>
>> ---
>>  Documentation/cgroups/cgroups.txt |   39 ++-
>>  block/blk-cgroup.c                |   18 -
>>  include/linux/cgroup.h            |   10 
>>  include/linux/init_task.h         |    9 
>>  include/linux/sched.h             |   36 ++
>>  kernel/cgroup.c                   |  489 +++++++++++++++++++++++++++++++++-----
>>  kernel/cgroup_freezer.c           |   26 --
>>  kernel/cpuset.c                   |   96 +++----
>>  kernel/fork.c                     |   10 
>>  kernel/sched.c                    |   38 --
>>  mm/memcontrol.c                   |   18 -
>>  security/device_cgroup.c          |    3 
>>  12 files changed, 594 insertions(+), 198 deletions(-)
> 
> So where are we up to with all this.
> 
> I'm surprised that none of the patches had anyone's Acked-by: or
> Reviewed-by:.  Were they really that mean to you, or have you not been
> tracking these?
> 
> 

Paul reviewed the patchset and explicitly gave his reviewed-by tag for all
the 3 pathces.

And I'm going to do some testing for it.

^ permalink raw reply	[flat|nested] 185+ messages in thread

* Re: [PATCH v8.75 0/4] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs
  2011-04-12 23:25               ` [PATCH v8.75 0/4] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs Andrew Morton
  2011-04-12 23:59                 ` Ben Blum
@ 2011-04-13  2:07                 ` Li Zefan
       [not found]                 ` <20110412162516.4120c441.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
  2 siblings, 0 replies; 185+ messages in thread
From: Li Zefan @ 2011-04-13  2:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ben Blum, linux-kernel, containers, ebiederm, matthltc, menage,
	oleg, David Rientjes, Miao Xie

Andrew Morton wrote:
> On Wed, 6 Apr 2011 15:44:20 -0400
> Ben Blum <bblum@andrew.cmu.edu> wrote:
> 
>> Same as before; using flex_array in attach_proc (thanks Kame).
>>
>> -- Ben
>>
>> ---
>>  Documentation/cgroups/cgroups.txt |   39 ++-
>>  block/blk-cgroup.c                |   18 -
>>  include/linux/cgroup.h            |   10 
>>  include/linux/init_task.h         |    9 
>>  include/linux/sched.h             |   36 ++
>>  kernel/cgroup.c                   |  489 +++++++++++++++++++++++++++++++++-----
>>  kernel/cgroup_freezer.c           |   26 --
>>  kernel/cpuset.c                   |   96 +++----
>>  kernel/fork.c                     |   10 
>>  kernel/sched.c                    |   38 --
>>  mm/memcontrol.c                   |   18 -
>>  security/device_cgroup.c          |    3 
>>  12 files changed, 594 insertions(+), 198 deletions(-)
> 
> So where are we up to with all this.
> 
> I'm surprised that none of the patches had anyone's Acked-by: or
> Reviewed-by:.  Were they really that mean to you, or have you not been
> tracking these?
> 
> 

Paul reviewed the patchset and explicitly gave his reviewed-by tag for all
the 3 pathces.

And I'm going to do some testing for it.

^ permalink raw reply	[flat|nested] 185+ messages in thread

end of thread, other threads:[~2011-04-13  2:07 UTC | newest]

Thread overview: 185+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-07-30 23:56 [PATCH v4 0/2] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs Ben Blum
2010-07-30 23:59 ` [PATCH v4 2/2] cgroups: make procs file writable Ben Blum
2010-08-04  1:08   ` KAMEZAWA Hiroyuki
2010-08-04  4:28     ` Ben Blum
     [not found]     ` <20100804100811.199d73ba.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
2010-08-04  4:28       ` Ben Blum
2010-08-04  4:30       ` Paul Menage
2010-08-04  4:30     ` Paul Menage
     [not found]       ` <AANLkTikMofFGHSwF2QrdcAsit+hU6ihndhK5cod8duwS-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-08-04  4:38         ` Ben Blum
2010-08-04  4:38           ` Ben Blum
2010-08-04  4:46           ` Paul Menage
     [not found]           ` <20100804043849.GC11950-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
2010-08-04  4:46             ` Paul Menage
     [not found]   ` <20100730235902.GC22644-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
2010-08-04  1:08     ` KAMEZAWA Hiroyuki
     [not found] ` <20100730235649.GA22644-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
2010-07-30 23:57   ` [PATCH v4 1/2] cgroups: read-write lock CLONE_THREAD forking per threadgroup Ben Blum
2010-07-30 23:57     ` Ben Blum
2010-08-04  3:44     ` Paul Menage
     [not found]       ` <AANLkTikpNG2Y3S3AyxAbCkMynKu1u5yKPrw=bh+uy=9R-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-08-04  4:33         ` Ben Blum
2010-08-04  4:33           ` Ben Blum
     [not found]           ` <20100804043328.GB11950-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
2010-08-04  4:34             ` Paul Menage
2010-08-04  4:34               ` Paul Menage
2010-08-06  6:02               ` Ben Blum
2010-08-06  7:08                 ` KAMEZAWA Hiroyuki
     [not found]                 ` <20100806060224.GA1351-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
2010-08-06  7:08                   ` KAMEZAWA Hiroyuki
     [not found]               ` <AANLkTi=dhym3c+XJVjoObROcw=mz2Y+a2R5oMdePK3Ng-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-08-06  6:02                 ` Ben Blum
2010-08-04 16:34         ` Brian K. White
2010-08-04 16:34           ` Brian K. White
     [not found]     ` <20100730235754.GB22644-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
2010-08-04  3:44       ` Paul Menage
2010-07-30 23:59   ` [PATCH v4 2/2] cgroups: make procs file writable Ben Blum
2010-08-03 19:58   ` [PATCH v4 0/2] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs Andrew Morton
2010-08-11  5:46   ` [PATCH v5 0/3] " Ben Blum
2010-08-11  5:46     ` Ben Blum
2010-08-11  5:47     ` [PATCH v5 1/3] cgroups: read-write lock CLONE_THREAD forking per threadgroup Ben Blum
2010-08-23 23:35       ` Paul Menage
     [not found]       ` <20100811054711.GB8743-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
2010-08-23 23:35         ` Paul Menage
     [not found]     ` <20100811054604.GA8743-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
2010-08-11  5:47       ` Ben Blum
2010-08-11  5:48       ` [PATCH v5 2/3] cgroups: add can_attach callback for checking all threads in a group Ben Blum
2010-08-11  5:48         ` Ben Blum
2010-08-23 23:31         ` Paul Menage
     [not found]         ` <20100811054814.GC8743-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
2010-08-23 23:31           ` Paul Menage
2010-08-11  5:48       ` [PATCH v5 3/3] cgroups: make procs file writable Ben Blum
2010-12-24  8:22       ` [PATCH v6 0/3] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs Ben Blum
2010-12-24  8:22         ` Ben Blum
     [not found]         ` <20101224082226.GA13872-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
2010-12-24  8:23           ` [PATCH v6 1/3] cgroups: read-write lock CLONE_THREAD forking per threadgroup Ben Blum
2010-12-24  8:23             ` Ben Blum
2010-12-24  8:24           ` [PATCH v6 2/3] cgroups: add can_attach callback for checking all threads in a group Ben Blum
2010-12-24  8:24             ` Ben Blum
2010-12-24  8:24           ` [PATCH v6 3/3] cgroups: make procs file writable Ben Blum
2010-12-24  8:24             ` Ben Blum
2011-01-12 23:26             ` Paul E. McKenney
     [not found]             ` <20101224082445.GD13872-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
2011-01-12 23:26               ` Paul E. McKenney
2010-12-26 12:09           ` [PATCH v7 0/3] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs Ben Blum
2010-12-26 12:09         ` Ben Blum
2010-12-26 12:09           ` [PATCH v7 1/3] cgroups: read-write lock CLONE_THREAD forking per threadgroup Ben Blum
2011-01-24  8:38             ` Paul Menage
2011-01-24 21:05             ` Andrew Morton
2011-02-04 21:25               ` Ben Blum
     [not found]                 ` <20110204212515.GA5916-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
2011-02-04 21:36                   ` Andrew Morton
2011-02-04 21:36                 ` Andrew Morton
     [not found]                   ` <20110204133657.78aeebe3.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2011-02-04 21:43                     ` Ben Blum
2011-02-04 21:43                       ` Ben Blum
2011-02-14  5:31               ` Paul Menage
     [not found]               ` <20110124130529.903d9832.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2011-02-04 21:25                 ` Ben Blum
2011-02-14  5:31                 ` Paul Menage
     [not found]             ` <20101226120951.GB28529-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
2011-01-24  8:38               ` Paul Menage
2011-01-24 21:05               ` Andrew Morton
2010-12-26 12:11           ` [PATCH v7 2/3] cgroups: add atomic-context per-thread subsystem callbacks Ben Blum
     [not found]             ` <20101226121100.GC28529-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
2011-01-24  8:38               ` Paul Menage
2011-01-24  8:38             ` Paul Menage
2011-01-24 15:32               ` Ben Blum
     [not found]               ` <AANLkTimytfrDnr_5SzBUFQu0SaGdAWDC0p38hiFiHrtU-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-01-24 15:32                 ` Ben Blum
2010-12-26 12:12           ` [PATCH v7 3/3] cgroups: make procs file writable Ben Blum
     [not found]           ` <20101226120919.GA28529-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
2010-12-26 12:09             ` [PATCH v7 1/3] cgroups: read-write lock CLONE_THREAD forking per threadgroup Ben Blum
2010-12-26 12:11             ` [PATCH v7 2/3] cgroups: add atomic-context per-thread subsystem callbacks Ben Blum
2010-12-26 12:12             ` [PATCH v7 3/3] cgroups: make procs file writable Ben Blum
2011-02-08  1:35             ` [PATCH v8 0/3] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs Ben Blum
2011-02-08  1:35           ` Ben Blum
2011-02-08  1:37             ` [PATCH v8 1/3] cgroups: read-write lock CLONE_THREAD forking per threadgroup Ben Blum
2011-03-03 17:54               ` Paul Menage
     [not found]               ` <20110208013741.GD31569-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
2011-03-03 17:54                 ` Paul Menage
     [not found]             ` <20110208013542.GC31569-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
2011-02-08  1:37               ` Ben Blum
2011-02-08  1:39               ` [PATCH v8 2/3] cgroups: add per-thread subsystem callbacks Ben Blum
2011-02-08  1:39                 ` Ben Blum
2011-03-03 17:59                 ` Paul Menage
     [not found]                 ` <20110208013915.GE31569-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
2011-03-03 17:59                   ` Paul Menage
2011-02-08  1:39               ` [PATCH v8 3/3] cgroups: make procs file writable Ben Blum
2011-02-08  1:39                 ` Ben Blum
2011-02-16 19:22                 ` [PATCH v8 4/3] cgroups: use flex_array in attach_proc Ben Blum
     [not found]                   ` <20110216192200.GA11980-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
2011-03-03 17:48                     ` Paul Menage
2011-03-03 17:48                   ` Paul Menage
2011-03-22  5:15                     ` Ben Blum
     [not found]                       ` <20110322051553.GB11447-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
2011-03-22  5:19                         ` [PATCH v8.5 " Ben Blum
2011-03-22  5:19                       ` Ben Blum
     [not found]                     ` <AANLkTinKTqBnjLKkv93UxyWoPL-2vyXP=LUvRz8JTC2K-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-03-22  5:15                       ` [PATCH v8 " Ben Blum
2011-03-03 18:38                 ` [PATCH v8 3/3] cgroups: make procs file writable Paul Menage
     [not found]                   ` <AANLkTinEnNsu8=PEktXL_EECzGYqsgdf+uogGxe7k4W+-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-03-10  6:18                     ` Ben Blum
2011-03-10  6:18                   ` Ben Blum
2011-03-10 20:01                     ` Paul Menage
     [not found]                       ` <AANLkTikkmfwk0nV0p=omz2ddrw+ZqWF1Lx3EfO6dTjEQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-03-15 21:13                         ` Ben Blum
2011-03-15 21:13                           ` Ben Blum
2011-03-18 16:54                           ` Paul Menage
     [not found]                             ` <AANLkTim4z_x_UQE__f5t73Dimja8PTTXTKKgj2phv6FY-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-03-22  5:18                               ` [PATCH v8.5 " Ben Blum
2011-03-22  5:18                                 ` Ben Blum
     [not found]                                 ` <20110322051841.GA12055-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
2011-03-29 23:27                                   ` Paul Menage
2011-03-29 23:27                                 ` Paul Menage
2011-03-29 23:39                                   ` Andrew Morton
     [not found]                                   ` <BANLkTikMgd5HvMyC1BTGzAtj_=Jk=wZm+A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-03-29 23:39                                     ` Andrew Morton
     [not found]                           ` <20110315211353.GA9992-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
2011-03-18 16:54                             ` [PATCH v8 " Paul Menage
2011-03-22  5:08                     ` Ben Blum
     [not found]                     ` <20110310061831.GA23736-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
2011-03-10 20:01                       ` Paul Menage
2011-03-22  5:08                       ` Ben Blum
     [not found]                 ` <20110208013950.GF31569-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
2011-02-16 19:22                   ` [PATCH v8 4/3] cgroups: use flex_array in attach_proc Ben Blum
2011-03-03 18:38                   ` [PATCH v8 3/3] cgroups: make procs file writable Paul Menage
2011-02-09 23:10               ` [PATCH v8 0/3] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs Andrew Morton
2011-04-06 19:44               ` [PATCH v8.75 0/4] " Ben Blum
2011-02-09 23:10             ` [PATCH v8 0/3] " Andrew Morton
2011-02-10  1:02               ` KAMEZAWA Hiroyuki
     [not found]                 ` <20110210100210.adf09c49.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
2011-02-10  1:36                   ` Ben Blum
2011-02-14  6:12                   ` Paul Menage
2011-02-10  1:36                 ` Ben Blum
2011-02-14  6:12                 ` Paul Menage
2011-02-14  6:12               ` Paul Menage
     [not found]               ` <20110209151046.89e03dcd.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2011-02-10  1:02                 ` KAMEZAWA Hiroyuki
2011-02-14  6:12                 ` Paul Menage
2011-04-06 19:44             ` [PATCH v8.75 0/4] " Ben Blum
2011-04-06 19:45               ` [PATCH v8.75 1/4] cgroups: read-write lock CLONE_THREAD forking per threadgroup Ben Blum
     [not found]               ` <20110406194420.GC10792-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
2011-04-06 19:45                 ` Ben Blum
2011-04-06 19:46                 ` [PATCH v8.75 2/4] cgroups: add per-thread subsystem callbacks Ben Blum
2011-04-06 19:46                 ` [PATCH v8.75 3/4] cgroups: make procs file writable Ben Blum
2011-04-06 19:47                 ` [PATCH v8.75 4/4] cgroups: use flex_array in attach_proc Ben Blum
2011-04-12 23:25                 ` [PATCH v8.75 0/4] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs Andrew Morton
2011-04-06 19:46               ` [PATCH v8.75 2/4] cgroups: add per-thread subsystem callbacks Ben Blum
2011-04-06 19:46               ` [PATCH v8.75 3/4] cgroups: make procs file writable Ben Blum
2011-04-06 19:47               ` [PATCH v8.75 4/4] cgroups: use flex_array in attach_proc Ben Blum
2011-04-12 23:25               ` [PATCH v8.75 0/4] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs Andrew Morton
2011-04-12 23:59                 ` Ben Blum
2011-04-13  2:07                 ` Li Zefan
     [not found]                 ` <20110412162516.4120c441.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2011-04-12 23:59                   ` Ben Blum
2011-04-13  2:07                   ` Li Zefan
2010-08-11  5:48     ` [PATCH v5 3/3] cgroups: make procs file writable Ben Blum
     [not found]       ` <20100811054851.GD8743-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
2010-08-24 18:08         ` Paul Menage
2010-08-24 18:08       ` Paul Menage
     [not found]         ` <AANLkTimRM8rDe+u7fTy853RK=1mnLJMK57Tci2OLPR7L-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-10-08 21:57           ` Paul Menage
     [not found]             ` <AANLkTim7HW0wNyqOPePFXmEMV8hx_fMKNMTAsSwkRzZX-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-12-16  6:34               ` Paul Menage
     [not found]                 ` <AANLkTin7aK5uEFi0U+iU_9=cbfRTHfDzKsbWupn73fSL-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-12-16  8:26                   ` Andrew Morton
     [not found]                     ` <20101216002603.6741874a.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2010-12-24  3:33                       ` Ben Blum
     [not found]                         ` <20101224033352.GA7804-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
2010-12-24 10:49                           ` David Rientjes
     [not found]                             ` <alpine.DEB.2.00.1012240245040.775-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
2010-12-24 11:45                               ` Ben Blum
     [not found]                                 ` <20101224114500.GA18036-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
2010-12-24 11:53                                   ` Andrew Morton
     [not found]                                     ` <20101224035331.b907b410.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2010-12-24 12:08                                       ` Ben Blum
     [not found]                                         ` <20101224120853.GA18518-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
2010-12-24 21:24                                           ` Ben Blum
     [not found]                                             ` <20101224212452.GA27275-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
2010-12-24 21:34                                               ` David Rientjes
     [not found]                                                 ` <alpine.DEB.2.00.1012241333010.13509-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
2010-12-24 23:09                                                   ` Ben Blum
     [not found]                                                     ` <20101224230901.GA30136-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
2010-12-26 21:48                                                       ` David Rientjes
     [not found]                                                         ` <alpine.DEB.2.00.1012261345340.23173-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
2010-12-27  0:12                                                           ` Ben Blum
     [not found]                                                             ` <20101227001233.GA10951-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
2010-12-27 10:31                                                               ` David Rientjes
     [not found]                                                                 ` <alpine.DEB.2.00.1012270227010.3960-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
2010-12-27 10:37                                                                   ` Ben Blum
     [not found]                                                                     ` <20101227103701.GC20986-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
2010-12-27 10:53                                                                       ` David Rientjes
     [not found]                                                                         ` <alpine.DEB.2.00.1012270240400.3960-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
2010-12-27 11:00                                                                           ` Ben Blum
     [not found]                                                                             ` <20101227110050.GF20986-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
2010-12-27 11:03                                                                               ` David Rientjes
2010-12-29  1:39                                                                           ` Li Zefan
     [not found]                                                                             ` <4D1A913C.5080702-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2010-12-30  0:26                                                                               ` David Rientjes
     [not found]                                                                                 ` <alpine.DEB.2.00.1012291624210.6040-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
2010-12-30  4:02                                                                                   ` Li Zefan
     [not found]                                                                                     ` <4D1C0464.5090801-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2010-12-30  4:24                                                                                       ` David Rientjes
     [not found]                                                                                         ` <alpine.DEB.2.00.1012292019540.27634-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
2010-12-30  4:38                                                                                           ` Li Zefan
     [not found]                                                                                             ` <4D1C0CC6.4090107-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2010-12-30  5:49                                                                                               ` David Rientjes
     [not found]                                                                                                 ` <alpine.DEB.2.00.1012292149000.29486-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
2010-12-30  6:12                                                                                                   ` Li Zefan
     [not found]                                                                                                     ` <4D1C22D2.9090007-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2010-12-30 18:25                                                                                                       ` David Rientjes
2010-12-24 21:32                                           ` David Rientjes
2010-12-25  2:55                           ` Ben Blum
     [not found]                             ` <20101225025508.GA649-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
2010-12-27  0:53                               ` Daisuke Nishimura
     [not found]                                 ` <20101227095353.48d95687.nishimura-YQH0OdQVrdy45+QrQBaojngSJqDPrsil@public.gmane.org>
2010-12-27  1:15                                   ` KAMEZAWA Hiroyuki
2010-12-27  4:22                                   ` Ben Blum
     [not found]                                     ` <20101227042254.GA15417-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
2010-12-27  7:00                                       ` KAMEZAWA Hiroyuki
     [not found]                                         ` <20101227160041.07bff52a.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
2010-12-27  7:21                                           ` Ben Blum
     [not found]                                             ` <20101227072123.GA19652-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
2010-12-27  7:42                                               ` KAMEZAWA Hiroyuki
     [not found]                                                 ` <20101227164207.b09318be.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
2010-12-27  8:42                                                   ` Ben Blum
     [not found]                                                     ` <20101227084257.GA20986-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
2010-12-27  9:18                                                       ` KAMEZAWA Hiroyuki
     [not found]                                                         ` <20101227181801.095e9a23.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
2010-12-27 10:12                                                           ` Ben Blum
     [not found]                                                             ` <20101227101228.GB20986-dJQ2lsn+DImqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
2011-01-04  0:57                                                               ` KAMEZAWA Hiroyuki
2010-12-28  2:43                                       ` [RFC][BUGFIX] memcg: fix dead lock between cpuset and memcg (Re: [PATCH v5 3/3] cgroups: make procs file writable) Daisuke Nishimura
2010-12-25  4:24                           ` [PATCH v5 3/3] cgroups: make procs file writable Ben Blum
2010-08-03 19:58 ` [PATCH v4 0/2] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs Andrew Morton
     [not found]   ` <20100803125827.0822e6ab.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2010-08-03 23:45     ` KAMEZAWA Hiroyuki
2010-08-04  2:00     ` Li Zefan
2010-08-04  2:00       ` Li Zefan
2010-08-03 23:45   ` KAMEZAWA Hiroyuki

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.