[RFC] [PATCH v3 0/2] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC] [PATCH v3 0/2] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs
@ 2010-06-25  5:45 Ben Blum
       [not found] ` <20100625054541.GA20610-kwnxxEB+oiWqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
  2010-06-25  5:47 ` Ben Blum
  0 siblings, 2 replies; 16+ messages in thread
From: Ben Blum @ 2010-06-25  5:45 UTC (permalink / raw)
  To: linux-kernel, containers
  Cc: akpm, bblum, ebiederm, lizf, matthltc, menage, oleg

This patch series is a revision of http://lkml.org/lkml/2010/5/29/126 .

These patches use an rwlock in signal_struct which access is dependent
on Oleg's recent changes to signal_struct's lifetime rules.

It is okay to write the tid of any task in the threadgroup. This is
implemented by taking task->group_leader right after find_task_by_vpid
while still rcu_read-side. This makes it necessary to check if
thread_group_leader(leader) every time we want to iterate over
->thread_group; each of these checks can fail with -EAGAIN.
Unfortunately this also means I needed to put these checks in can_attach
for each subsystem that needs to check each thread in the group.

I handle EAGAIN in the file's write handler, since hey, it's a super
expensive operation, might as well make it unbounded-time to boot - this
is optional and would work just as well with -EAGAIN sent to userspace.

-- bblum

---
 Documentation/cgroups/cgroups.txt |   13 -
 include/linux/cgroup.h            |   15 -
 include/linux/init_task.h         |    9
 include/linux/sched.h             |   10
 kernel/cgroup.c                   |  449 +++++++++++++++++++++++++++++++++-----
 kernel/cgroup_freezer.c           |    4
 kernel/cpuset.c                   |    4
 kernel/fork.c                     |   10
 kernel/ns_cgroup.c                |    4
 kernel/sched.c                    |    4
 10 files changed, 462 insertions(+), 60 deletions(-)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC] [PATCH v3 1/2] cgroups: read-write lock CLONE_THREAD forking per threadgroup
  2010-06-25  5:45 [RFC] [PATCH v3 0/2] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs Ben Blum
@ 2010-06-25  5:46     ` Ben Blum
  2010-06-25  5:47 ` Ben Blum
  1 sibling, 0 replies; 16+ messages in thread
From: Ben Blum @ 2010-06-25  5:46 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	menage-hpIqsD4AKlfQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

[-- Attachment #1: cgroup-threadgroup-fork-lock.patch --]
[-- Type: text/plain, Size: 7609 bytes --]

Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup

From: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>

This patch adds an rwsem that lives in a threadgroup's signal_struct that's
taken for reading in the fork path, under CONFIG_CGROUPS. If another part of
the kernel later wants to use such a locking mechanism, the CONFIG_CGROUPS
ifdefs should be changed to a higher-up flag that CGROUPS and the other system
would both depend on.

This is a pre-patch for cgroups-procs-write.patch.

Signed-off-by: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
---
 include/linux/cgroup.h    |   15 ++++++++++-----
 include/linux/init_task.h |    9 +++++++++
 include/linux/sched.h     |   10 ++++++++++
 kernel/cgroup.c           |   23 +++++++++++++++++++++--
 kernel/fork.c             |   10 +++++++---
 5 files changed, 57 insertions(+), 10 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 8f78073..196a703 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -31,10 +31,12 @@ extern void cgroup_lock(void);
 extern int cgroup_lock_is_held(void);
 extern bool cgroup_lock_live_group(struct cgroup *cgrp);
 extern void cgroup_unlock(void);
-extern void cgroup_fork(struct task_struct *p);
+extern void cgroup_fork(struct task_struct *p, unsigned long clone_flags);
 extern void cgroup_fork_callbacks(struct task_struct *p);
-extern void cgroup_post_fork(struct task_struct *p);
+extern void cgroup_post_fork(struct task_struct *p, unsigned long clone_flags);
 extern void cgroup_exit(struct task_struct *p, int run_callbacks);
+extern void cgroup_fork_failed(struct task_struct *p, int run_callbacks,
+			       unsigned long clone_flags);
 extern int cgroupstats_build(struct cgroupstats *stats,
 				struct dentry *dentry);
 extern int cgroup_load_subsys(struct cgroup_subsys *ss);
@@ -613,11 +615,14 @@ unsigned short css_depth(struct cgroup_subsys_state *css);
 
 static inline int cgroup_init_early(void) { return 0; }
 static inline int cgroup_init(void) { return 0; }
-static inline void cgroup_fork(struct task_struct *p) {}
+static inline void cgroup_fork(struct task_struct *p,
+			       unsigned long clone_flags) {}
 static inline void cgroup_fork_callbacks(struct task_struct *p) {}
-static inline void cgroup_post_fork(struct task_struct *p) {}
+static inline void cgroup_post_fork(struct task_struct *p,
+				    unsigned long clone_flags) {}
 static inline void cgroup_exit(struct task_struct *p, int callbacks) {}
-
+static inline void cgroup_fork_failed(struct task_struct *p, int callbacks,
+				      unsigned long clone_flags) {}
 static inline void cgroup_lock(void) {}
 static inline void cgroup_unlock(void) {}
 static inline int cgroupstats_build(struct cgroupstats *stats,
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index b1ed1cd..cfb2bc8 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -15,6 +15,14 @@
 extern struct files_struct init_files;
 extern struct fs_struct init_fs;
 
+#ifdef CONFIG_CGROUPS
+#define INIT_THREADGROUP_FORK_LOCK(sig)					\
+	.threadgroup_fork_lock =					\
+		__RWSEM_INITIALIZER(sig.threadgroup_fork_lock),
+#else
+#define INIT_THREADGROUP_FORK_LOCK(sig)
+#endif
+
 #define INIT_SIGNALS(sig) {						\
 	.count		= ATOMIC_INIT(1), 				\
 	.wait_chldexit	= __WAIT_QUEUE_HEAD_INITIALIZER(sig.wait_chldexit),\
@@ -29,6 +37,7 @@ extern struct fs_struct init_fs;
 		.running = 0,						\
 		.lock = __SPIN_LOCK_UNLOCKED(sig.cputimer.lock),	\
 	},								\
+	INIT_THREADGROUP_FORK_LOCK(sig)					\
 }
 
 extern struct nsproxy init_nsproxy;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2b7b81d..2bbcbd2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -627,6 +627,16 @@ struct signal_struct {
 	unsigned audit_tty;
 	struct tty_audit_buf *tty_audit_buf;
 #endif
+#ifdef CONFIG_CGROUPS
+	/*
+	 * The threadgroup_fork_lock prevents threads from forking with
+	 * CLONE_THREAD while held for writing. Use this for fork-sensitive
+	 * threadgroup-wide operations.
+	 * Currently only needed by cgroups, and the fork-side readlock happens
+	 * in cgroup_{fork,post_fork,fork_failed}.
+	 */
+	struct rw_semaphore threadgroup_fork_lock;
+#endif
 
 	int oom_adj;	/* OOM kill score adjustment (bit shift) */
 };
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 6d870f2..6c8e46f 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4015,8 +4015,10 @@ static const struct file_operations proc_cgroupstats_operations = {
  * At the point that cgroup_fork() is called, 'current' is the parent
  * task, and the passed argument 'child' points to the child task.
  */
-void cgroup_fork(struct task_struct *child)
+void cgroup_fork(struct task_struct *child, unsigned long clone_flags)
 {
+	if (clone_flags & CLONE_THREAD)
+		down_read(&current->signal->threadgroup_fork_lock);
 	task_lock(current);
 	child->cgroups = current->cgroups;
 	get_css_set(child->cgroups);
@@ -4058,7 +4060,7 @@ void cgroup_fork_callbacks(struct task_struct *child)
  * with the first call to cgroup_iter_start() - to guarantee that the
  * new task ends up on its list.
  */
-void cgroup_post_fork(struct task_struct *child)
+void cgroup_post_fork(struct task_struct *child, unsigned long clone_flags)
 {
 	if (use_task_css_set_links) {
 		write_lock(&css_set_lock);
@@ -4068,6 +4070,8 @@ void cgroup_post_fork(struct task_struct *child)
 		task_unlock(child);
 		write_unlock(&css_set_lock);
 	}
+	if (clone_flags & CLONE_THREAD)
+		up_read(&current->signal->threadgroup_fork_lock);
 }
 /**
  * cgroup_exit - detach cgroup from exiting task
@@ -4143,6 +4147,21 @@ void cgroup_exit(struct task_struct *tsk, int run_callbacks)
 }
 
 /**
+ * cgroup_fork_failed - undo operations for fork failure
+ * @tsk: pointer to  task_struct of exiting process
+ * @run_callback: run exit callbacks?
+ *
+ * Wrapper for cgroup_exit that also drops the fork lock.
+ */
+void cgroup_fork_failed(struct task_struct *tsk, int run_callbacks,
+			unsigned long clone_flags)
+{
+	if (clone_flags & CLONE_THREAD)
+		up_read(&current->signal->threadgroup_fork_lock);
+	cgroup_exit(tsk, run_callbacks);
+}
+
+/**
  * cgroup_clone - clone the cgroup the given subsystem is attached to
  * @tsk: the task to be moved
  * @subsys: the given subsystem
diff --git a/kernel/fork.c b/kernel/fork.c
index 4c14942..e2b04ac 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -884,6 +884,10 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 
 	tty_audit_fork(sig);
 
+#ifdef CONFIG_CGROUPS
+	init_rwsem(&sig->threadgroup_fork_lock);
+#endif
+
 	sig->oom_adj = current->signal->oom_adj;
 
 	return 0;
@@ -1069,7 +1073,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	monotonic_to_bootbased(&p->real_start_time);
 	p->io_context = NULL;
 	p->audit_context = NULL;
-	cgroup_fork(p);
+	cgroup_fork(p, clone_flags);
 #ifdef CONFIG_NUMA
 	p->mempolicy = mpol_dup(p->mempolicy);
  	if (IS_ERR(p->mempolicy)) {
@@ -1277,7 +1281,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	spin_unlock(&current->sighand->siglock);
 	write_unlock_irq(&tasklist_lock);
 	proc_fork_connector(p);
-	cgroup_post_fork(p);
+	cgroup_post_fork(p, clone_flags);
 	perf_event_fork(p);
 	return p;
 
@@ -1311,7 +1315,7 @@ bad_fork_cleanup_policy:
 	mpol_put(p->mempolicy);
 bad_fork_cleanup_cgroup:
 #endif
-	cgroup_exit(p, cgroup_callbacks_done);
+	cgroup_fork_failed(p, cgroup_callbacks_done, clone_flags);
 	delayacct_tsk_free(p);
 	module_put(task_thread_info(p)->exec_domain->module);
 bad_fork_cleanup_count:

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC] [PATCH v3 1/2] cgroups: read-write lock CLONE_THREAD forking per threadgroup
@ 2010-06-25  5:46     ` Ben Blum
  0 siblings, 0 replies; 16+ messages in thread
From: Ben Blum @ 2010-06-25  5:46 UTC (permalink / raw)
  To: Ben Blum
  Cc: linux-kernel, containers, akpm, ebiederm, lizf, matthltc, menage, oleg

[-- Attachment #1: cgroup-threadgroup-fork-lock.patch --]
[-- Type: text/plain, Size: 7559 bytes --]

Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup

From: Ben Blum <bblum@andrew.cmu.edu>

This patch adds an rwsem that lives in a threadgroup's signal_struct that's
taken for reading in the fork path, under CONFIG_CGROUPS. If another part of
the kernel later wants to use such a locking mechanism, the CONFIG_CGROUPS
ifdefs should be changed to a higher-up flag that CGROUPS and the other system
would both depend on.

This is a pre-patch for cgroups-procs-write.patch.

Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
---
 include/linux/cgroup.h    |   15 ++++++++++-----
 include/linux/init_task.h |    9 +++++++++
 include/linux/sched.h     |   10 ++++++++++
 kernel/cgroup.c           |   23 +++++++++++++++++++++--
 kernel/fork.c             |   10 +++++++---
 5 files changed, 57 insertions(+), 10 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 8f78073..196a703 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -31,10 +31,12 @@ extern void cgroup_lock(void);
 extern int cgroup_lock_is_held(void);
 extern bool cgroup_lock_live_group(struct cgroup *cgrp);
 extern void cgroup_unlock(void);
-extern void cgroup_fork(struct task_struct *p);
+extern void cgroup_fork(struct task_struct *p, unsigned long clone_flags);
 extern void cgroup_fork_callbacks(struct task_struct *p);
-extern void cgroup_post_fork(struct task_struct *p);
+extern void cgroup_post_fork(struct task_struct *p, unsigned long clone_flags);
 extern void cgroup_exit(struct task_struct *p, int run_callbacks);
+extern void cgroup_fork_failed(struct task_struct *p, int run_callbacks,
+			       unsigned long clone_flags);
 extern int cgroupstats_build(struct cgroupstats *stats,
 				struct dentry *dentry);
 extern int cgroup_load_subsys(struct cgroup_subsys *ss);
@@ -613,11 +615,14 @@ unsigned short css_depth(struct cgroup_subsys_state *css);
 
 static inline int cgroup_init_early(void) { return 0; }
 static inline int cgroup_init(void) { return 0; }
-static inline void cgroup_fork(struct task_struct *p) {}
+static inline void cgroup_fork(struct task_struct *p,
+			       unsigned long clone_flags) {}
 static inline void cgroup_fork_callbacks(struct task_struct *p) {}
-static inline void cgroup_post_fork(struct task_struct *p) {}
+static inline void cgroup_post_fork(struct task_struct *p,
+				    unsigned long clone_flags) {}
 static inline void cgroup_exit(struct task_struct *p, int callbacks) {}
-
+static inline void cgroup_fork_failed(struct task_struct *p, int callbacks,
+				      unsigned long clone_flags) {}
 static inline void cgroup_lock(void) {}
 static inline void cgroup_unlock(void) {}
 static inline int cgroupstats_build(struct cgroupstats *stats,
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index b1ed1cd..cfb2bc8 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -15,6 +15,14 @@
 extern struct files_struct init_files;
 extern struct fs_struct init_fs;
 
+#ifdef CONFIG_CGROUPS
+#define INIT_THREADGROUP_FORK_LOCK(sig)					\
+	.threadgroup_fork_lock =					\
+		__RWSEM_INITIALIZER(sig.threadgroup_fork_lock),
+#else
+#define INIT_THREADGROUP_FORK_LOCK(sig)
+#endif
+
 #define INIT_SIGNALS(sig) {						\
 	.count		= ATOMIC_INIT(1), 				\
 	.wait_chldexit	= __WAIT_QUEUE_HEAD_INITIALIZER(sig.wait_chldexit),\
@@ -29,6 +37,7 @@ extern struct fs_struct init_fs;
 		.running = 0,						\
 		.lock = __SPIN_LOCK_UNLOCKED(sig.cputimer.lock),	\
 	},								\
+	INIT_THREADGROUP_FORK_LOCK(sig)					\
 }
 
 extern struct nsproxy init_nsproxy;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2b7b81d..2bbcbd2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -627,6 +627,16 @@ struct signal_struct {
 	unsigned audit_tty;
 	struct tty_audit_buf *tty_audit_buf;
 #endif
+#ifdef CONFIG_CGROUPS
+	/*
+	 * The threadgroup_fork_lock prevents threads from forking with
+	 * CLONE_THREAD while held for writing. Use this for fork-sensitive
+	 * threadgroup-wide operations.
+	 * Currently only needed by cgroups, and the fork-side readlock happens
+	 * in cgroup_{fork,post_fork,fork_failed}.
+	 */
+	struct rw_semaphore threadgroup_fork_lock;
+#endif
 
 	int oom_adj;	/* OOM kill score adjustment (bit shift) */
 };
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 6d870f2..6c8e46f 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4015,8 +4015,10 @@ static const struct file_operations proc_cgroupstats_operations = {
  * At the point that cgroup_fork() is called, 'current' is the parent
  * task, and the passed argument 'child' points to the child task.
  */
-void cgroup_fork(struct task_struct *child)
+void cgroup_fork(struct task_struct *child, unsigned long clone_flags)
 {
+	if (clone_flags & CLONE_THREAD)
+		down_read(&current->signal->threadgroup_fork_lock);
 	task_lock(current);
 	child->cgroups = current->cgroups;
 	get_css_set(child->cgroups);
@@ -4058,7 +4060,7 @@ void cgroup_fork_callbacks(struct task_struct *child)
  * with the first call to cgroup_iter_start() - to guarantee that the
  * new task ends up on its list.
  */
-void cgroup_post_fork(struct task_struct *child)
+void cgroup_post_fork(struct task_struct *child, unsigned long clone_flags)
 {
 	if (use_task_css_set_links) {
 		write_lock(&css_set_lock);
@@ -4068,6 +4070,8 @@ void cgroup_post_fork(struct task_struct *child)
 		task_unlock(child);
 		write_unlock(&css_set_lock);
 	}
+	if (clone_flags & CLONE_THREAD)
+		up_read(&current->signal->threadgroup_fork_lock);
 }
 /**
  * cgroup_exit - detach cgroup from exiting task
@@ -4143,6 +4147,21 @@ void cgroup_exit(struct task_struct *tsk, int run_callbacks)
 }
 
 /**
+ * cgroup_fork_failed - undo operations for fork failure
+ * @tsk: pointer to  task_struct of exiting process
+ * @run_callback: run exit callbacks?
+ *
+ * Wrapper for cgroup_exit that also drops the fork lock.
+ */
+void cgroup_fork_failed(struct task_struct *tsk, int run_callbacks,
+			unsigned long clone_flags)
+{
+	if (clone_flags & CLONE_THREAD)
+		up_read(&current->signal->threadgroup_fork_lock);
+	cgroup_exit(tsk, run_callbacks);
+}
+
+/**
  * cgroup_clone - clone the cgroup the given subsystem is attached to
  * @tsk: the task to be moved
  * @subsys: the given subsystem
diff --git a/kernel/fork.c b/kernel/fork.c
index 4c14942..e2b04ac 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -884,6 +884,10 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 
 	tty_audit_fork(sig);
 
+#ifdef CONFIG_CGROUPS
+	init_rwsem(&sig->threadgroup_fork_lock);
+#endif
+
 	sig->oom_adj = current->signal->oom_adj;
 
 	return 0;
@@ -1069,7 +1073,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	monotonic_to_bootbased(&p->real_start_time);
 	p->io_context = NULL;
 	p->audit_context = NULL;
-	cgroup_fork(p);
+	cgroup_fork(p, clone_flags);
 #ifdef CONFIG_NUMA
 	p->mempolicy = mpol_dup(p->mempolicy);
  	if (IS_ERR(p->mempolicy)) {
@@ -1277,7 +1281,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	spin_unlock(&current->sighand->siglock);
 	write_unlock_irq(&tasklist_lock);
 	proc_fork_connector(p);
-	cgroup_post_fork(p);
+	cgroup_post_fork(p, clone_flags);
 	perf_event_fork(p);
 	return p;
 
@@ -1311,7 +1315,7 @@ bad_fork_cleanup_policy:
 	mpol_put(p->mempolicy);
 bad_fork_cleanup_cgroup:
 #endif
-	cgroup_exit(p, cgroup_callbacks_done);
+	cgroup_fork_failed(p, cgroup_callbacks_done, clone_flags);
 	delayacct_tsk_free(p);
 	module_put(task_thread_info(p)->exec_domain->module);
 bad_fork_cleanup_count:

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC] [PATCH v3 2/2] cgroups: make procs file writable
       [not found] ` <20100625054541.GA20610-kwnxxEB+oiWqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
  2010-06-25  5:46     ` Ben Blum
@ 2010-06-25  5:47   ` Ben Blum
  1 sibling, 0 replies; 16+ messages in thread
From: Ben Blum @ 2010-06-25  5:47 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	menage-hpIqsD4AKlfQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

[-- Attachment #1: cgroup-procs-writable.patch --]
[-- Type: text/plain, Size: 19762 bytes --]

Makes procs file writable to move all threads by tgid at once

From: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>

This patch adds functionality that enables users to move all threads in a
threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs'
file. This current implementation makes use of a per-threadgroup rwsem that's
taken for reading in the fork() path to prevent newly forking threads within
the threadgroup from "escaping" while the move is in progress.

Signed-off-by: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
---
 Documentation/cgroups/cgroups.txt |   13 +
 kernel/cgroup.c                   |  426 +++++++++++++++++++++++++++++++++----
 kernel/cgroup_freezer.c           |    4 
 kernel/cpuset.c                   |    4 
 kernel/ns_cgroup.c                |    4 
 kernel/sched.c                    |    4 
 6 files changed, 405 insertions(+), 50 deletions(-)

diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index a1ca592..1d9c81e 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -235,7 +235,8 @@ containing the following files describing that cgroup:
  - cgroup.procs: list of tgids in the cgroup.  This list is not
    guaranteed to be sorted or free of duplicate tgids, and userspace
    should sort/uniquify the list if this property is required.
-   This is a read-only file, for now.
+   Writing a thread group id into this file moves all threads in that
+   group into this cgroup.
  - notify_on_release flag: run the release agent on exit?
  - release_agent: the path to use for release notifications (this file
    exists in the top cgroup only)
@@ -416,6 +417,12 @@ You can attach the current shell task by echoing 0:
 
 # echo 0 > tasks
 
+You can use the cgroup.procs file instead of the tasks file to move all
+threads in a threadgroup at once. Echoing the pid of any task in a
+threadgroup to cgroup.procs causes all tasks in that threadgroup to be
+be attached to the cgroup. Writing 0 to cgroup.procs moves all tasks
+in the writing task's threadgroup.
+
 2.3 Mounting hierarchies by name
 --------------------------------
 
@@ -564,7 +571,9 @@ called on a fork. If this method returns 0 (success) then this should
 remain valid while the caller holds cgroup_mutex and it is ensured that either
 attach() or cancel_attach() will be called in future. If threadgroup is
 true, then a successful result indicates that all threads in the given
-thread's threadgroup can be moved together.
+thread's threadgroup can be moved together. If the subsystem wants to
+iterate over task->thread_group, it must take rcu_read_lock then check
+if thread_group_leader(task), returning -EAGAIN if that fails.
 
 void cancel_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 	       struct task_struct *task, bool threadgroup)
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 6c8e46f..526f44b 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1686,6 +1686,76 @@ int cgroup_path(const struct cgroup *cgrp, char *buf, int buflen)
 }
 EXPORT_SYMBOL_GPL(cgroup_path);
 
+/*
+ * cgroup_task_migrate - move a task from one cgroup to another.
+ *
+ * 'guarantee' is set if the caller promises that a new css_set for the task
+ * will already exit. If not set, this function might sleep, and can fail with
+ * -ENOMEM. Otherwise, it can only fail with -ESRCH.
+ */
+static int cgroup_task_migrate(struct cgroup *cgrp, struct cgroup *oldcgrp,
+			       struct task_struct *tsk, bool guarantee)
+{
+	struct css_set *oldcg;
+	struct css_set *newcg;
+
+	/*
+	 * get old css_set. we need to take task_lock and refcount it, because
+	 * an exiting task can change its css_set to init_css_set and drop its
+	 * old one without taking cgroup_mutex.
+	 */
+	task_lock(tsk);
+	oldcg = tsk->cgroups;
+	get_css_set(oldcg);
+	task_unlock(tsk);
+
+	/* locate or allocate a new css_set for this task. */
+	if (guarantee) {
+		/* we know the css_set we want already exists. */
+		struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
+		read_lock(&css_set_lock);
+		newcg = find_existing_css_set(oldcg, cgrp, template);
+		BUG_ON(!newcg);
+		get_css_set(newcg);
+		read_unlock(&css_set_lock);
+	} else {
+		might_sleep();
+		/* find_css_set will give us newcg already referenced. */
+		newcg = find_css_set(oldcg, cgrp);
+		if (!newcg) {
+			put_css_set(oldcg);
+			return -ENOMEM;
+		}
+	}
+	put_css_set(oldcg);
+
+	/* if PF_EXITING is set, the tsk->cgroups pointer is no longer safe. */
+	task_lock(tsk);
+	if (tsk->flags & PF_EXITING) {
+		task_unlock(tsk);
+		put_css_set(newcg);
+		return -ESRCH;
+	}
+	rcu_assign_pointer(tsk->cgroups, newcg);
+	task_unlock(tsk);
+
+	/* Update the css_set linked lists if we're using them */
+	write_lock(&css_set_lock);
+	if (!list_empty(&tsk->cg_list))
+		list_move(&tsk->cg_list, &newcg->tasks);
+	write_unlock(&css_set_lock);
+
+	/*
+	 * We just gained a reference on oldcg by taking it from the task. As
+	 * trading it for newcg is protected by cgroup_mutex, we're safe to drop
+	 * it here; it will be freed under RCU.
+	 */
+	put_css_set(oldcg);
+
+	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
+	return 0;
+}
+
 /**
  * cgroup_attach_task - attach task 'tsk' to cgroup 'cgrp'
  * @cgrp: the cgroup the task is attaching to
@@ -1696,11 +1766,9 @@ EXPORT_SYMBOL_GPL(cgroup_path);
  */
 int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
-	int retval = 0;
+	int retval;
 	struct cgroup_subsys *ss, *failed_ss = NULL;
 	struct cgroup *oldcgrp;
-	struct css_set *cg;
-	struct css_set *newcg;
 	struct cgroupfs_root *root = cgrp->root;
 
 	/* Nothing to do if the task is already in that cgroup */
@@ -1724,46 +1792,16 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 		}
 	}
 
-	task_lock(tsk);
-	cg = tsk->cgroups;
-	get_css_set(cg);
-	task_unlock(tsk);
-	/*
-	 * Locate or allocate a new css_set for this task,
-	 * based on its final set of cgroups
-	 */
-	newcg = find_css_set(cg, cgrp);
-	put_css_set(cg);
-	if (!newcg) {
-		retval = -ENOMEM;
-		goto out;
-	}
-
-	task_lock(tsk);
-	if (tsk->flags & PF_EXITING) {
-		task_unlock(tsk);
-		put_css_set(newcg);
-		retval = -ESRCH;
+	retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, false);
+	if (retval)
 		goto out;
-	}
-	rcu_assign_pointer(tsk->cgroups, newcg);
-	task_unlock(tsk);
-
-	/* Update the css_set linked lists if we're using them */
-	write_lock(&css_set_lock);
-	if (!list_empty(&tsk->cg_list)) {
-		list_del(&tsk->cg_list);
-		list_add(&tsk->cg_list, &newcg->tasks);
-	}
-	write_unlock(&css_set_lock);
 
 	for_each_subsys(root, ss) {
 		if (ss->attach)
 			ss->attach(ss, cgrp, oldcgrp, tsk, false);
 	}
-	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
+
 	synchronize_rcu();
-	put_css_set(cg);
 
 	/*
 	 * wake up rmdir() waiter. the rmdir should fail since the cgroup
@@ -1789,49 +1827,341 @@ out:
 }
 
 /*
- * Attach task with pid 'pid' to cgroup 'cgrp'. Call with cgroup_mutex
- * held. May take task_lock of task
+ * cgroup_attach_proc works in two stages, the first of which prefetches all
+ * new css_sets needed (to make sure we have enough memory before committing
+ * to the move) and stores them in a list of entries of the following type.
+ * TODO: possible optimization: use css_set->rcu_head for chaining instead
+ */
+struct cg_list_entry {
+	struct css_set *cg;
+	struct list_head links;
+};
+
+static bool css_set_check_fetched(struct cgroup *cgrp,
+				  struct task_struct *tsk, struct css_set *cg,
+				  struct list_head *newcg_list)
+{
+	struct css_set *newcg;
+	struct cg_list_entry *cg_entry;
+	struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
+
+	read_lock(&css_set_lock);
+	newcg = find_existing_css_set(cg, cgrp, template);
+	if (newcg)
+		get_css_set(newcg);
+	read_unlock(&css_set_lock);
+
+	/* doesn't exist at all? */
+	if (!newcg)
+		return false;
+	/* see if it's already in the list */
+	list_for_each_entry(cg_entry, newcg_list, links) {
+		if (cg_entry->cg == newcg) {
+			put_css_set(newcg);
+			return true;
+		}
+	}
+
+	/* not found */
+	put_css_set(newcg);
+	return false;
+}
+
+/*
+ * Find the new css_set and store it in the list in preparation for moving the
+ * given task to the given cgroup. Returns 0 or -ENOMEM.
  */
-static int attach_task_by_pid(struct cgroup *cgrp, u64 pid)
+static int css_set_prefetch(struct cgroup *cgrp, struct css_set *cg,
+			    struct list_head *newcg_list)
+{
+	struct css_set *newcg;
+	struct cg_list_entry *cg_entry;
+
+	/* ensure a new css_set will exist for this thread */
+	newcg = find_css_set(cg, cgrp);
+	if (!newcg)
+		return -ENOMEM;
+	/* add it to the list */
+	cg_entry = kmalloc(sizeof(struct cg_list_entry), GFP_KERNEL);
+	if (!cg_entry) {
+		put_css_set(newcg);
+		return -ENOMEM;
+	}
+	cg_entry->cg = newcg;
+	list_add(&cg_entry->links, newcg_list);
+	return 0;
+}
+
+/**
+ * cgroup_attach_proc - attach all threads in a threadgroup to a cgroup
+ * @cgrp: the cgroup to attach to
+ * @leader: the threadgroup leader task_struct of the group to be attached
+ *
+ * Call holding cgroup_mutex. Will take task_lock of each thread in leader's
+ * threadgroup individually in turn.
+ */
+int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
+{
+	int retval;
+	struct cgroup_subsys *ss, *failed_ss = NULL;
+	struct cgroup *oldcgrp;
+	struct css_set *oldcg;
+	struct cgroupfs_root *root = cgrp->root;
+	/* threadgroup list cursor */
+	struct task_struct *tsk;
+	/*
+	 * we need to make sure we have css_sets for all the tasks we're
+	 * going to move -before- we actually start moving them, so that in
+	 * case we get an ENOMEM we can bail out before making any changes.
+	 */
+	struct list_head newcg_list;
+	struct cg_list_entry *cg_entry, *temp_nobe;
+
+	/* check that we can legitimately attach to the cgroup. */
+	for_each_subsys(root, ss) {
+		if (ss->can_attach) {
+			retval = ss->can_attach(ss, cgrp, leader, true);
+			if (retval) {
+				failed_ss = ss;
+				goto out;
+			}
+		}
+	}
+
+	/*
+	 * step 1: make sure css_sets exist for all threads to be migrated.
+	 * we use find_css_set, which allocates a new one if necessary.
+	 */
+	INIT_LIST_HEAD(&newcg_list);
+	oldcgrp = task_cgroup_from_root(leader, root);
+	if (cgrp != oldcgrp) {
+		/* get old css_set */
+		task_lock(leader);
+		if (leader->flags & PF_EXITING) {
+			task_unlock(leader);
+			goto prefetch_loop;
+		}
+		oldcg = leader->cgroups;
+		get_css_set(oldcg);
+		task_unlock(leader);
+		/* acquire new one */
+		retval = css_set_prefetch(cgrp, oldcg, &newcg_list);
+		put_css_set(oldcg);
+		if (retval)
+			goto list_teardown;
+	}
+prefetch_loop:
+	rcu_read_lock();
+	/* sanity check - if we raced with de_thread, we must abort */
+	if (!thread_group_leader(leader)) {
+		retval = -EAGAIN;
+		goto list_teardown;
+	}
+	/*
+	 * if we need to fetch a new css_set for this task, we must exit the
+	 * rcu_read section because allocating it can sleep. afterwards, we'll
+	 * need to restart iteration on the threadgroup list - the whole thing
+	 * will be O(nm) in the number of threads and css_sets; as the typical
+	 * case has only one css_set for all of them, usually O(n). which ones
+	 * we need allocated won't change as long as we hold cgroup_mutex.
+	 */
+	list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
+		/* nothing to do if this task is already in the cgroup */
+		oldcgrp = task_cgroup_from_root(tsk, root);
+		if (cgrp == oldcgrp)
+			continue;
+		/* get old css_set pointer */
+		task_lock(tsk);
+		if (tsk->flags & PF_EXITING) {
+			/* ignore this task if it's going away */
+			task_unlock(tsk);
+			continue;
+		}
+		oldcg = tsk->cgroups;
+		get_css_set(oldcg);
+		task_unlock(tsk);
+		/* see if the new one for us is already in the list? */
+		if (css_set_check_fetched(cgrp, tsk, oldcg, &newcg_list)) {
+			/* was already there, nothing to do. */
+			put_css_set(oldcg);
+		} else {
+			/* we don't already have it. get new one. */
+			rcu_read_unlock();
+			retval = css_set_prefetch(cgrp, oldcg, &newcg_list);
+			put_css_set(oldcg);
+			if (retval)
+				goto list_teardown;
+			/* begin iteration again. */
+			goto prefetch_loop;
+		}
+	}
+	rcu_read_unlock();
+
+	/*
+	 * step 2: now that we're guaranteed success wrt the css_sets, proceed
+	 * to move all tasks to the new cgroup. we need to lock against possible
+	 * races with fork(). note: we can safely access leader->signal because
+	 * attach_task_by_pid takes a reference on leader, which guarantees that
+	 * the signal_struct will stick around. threadgroup_fork_lock must be
+	 * taken outside of tasklist_lock to match the order in the fork path.
+	 */
+	BUG_ON(!leader->signal);
+	down_write(&leader->signal->threadgroup_fork_lock);
+	read_lock(&tasklist_lock);
+	/* sanity check - if we raced with de_thread, we must abort */
+	if (!thread_group_leader(leader)) {
+		retval = -EAGAIN;
+		read_unlock(&tasklist_lock);
+		up_write(&leader->signal->threadgroup_fork_lock);
+		goto list_teardown;
+	}
+	/*
+	 * No failure cases left, so this is the commit point.
+	 *
+	 * If the leader is already there, skip moving him. Note: even if the
+	 * leader is PF_EXITING, we still move all other threads; if everybody
+	 * is PF_EXITING, we end up doing nothing, which is ok.
+	 */
+	oldcgrp = task_cgroup_from_root(leader, root);
+	if (cgrp != oldcgrp) {
+		retval = cgroup_task_migrate(cgrp, oldcgrp, leader, true);
+		BUG_ON(retval != 0 && retval != -ESRCH);
+	}
+	/* Now iterate over each thread in the group. */
+	list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
+		BUG_ON(tsk->signal != leader->signal);
+		/* leave current thread as it is if it's already there */
+		oldcgrp = task_cgroup_from_root(tsk, root);
+		if (cgrp == oldcgrp)
+			continue;
+		/* we don't care whether these threads are exiting */
+		retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, true);
+		BUG_ON(retval != 0 && retval != -ESRCH);
+	}
+
+	/*
+	 * step 3: attach whole threadgroup to each subsystem
+	 * TODO: if ever a subsystem needs to know the oldcgrp for each task
+	 * being moved, this call will need to be reworked to communicate that.
+	 */
+	for_each_subsys(root, ss) {
+		if (ss->attach)
+			ss->attach(ss, cgrp, oldcgrp, leader, true);
+	}
+	/* holding these until here keeps us safe from exec() and fork(). */
+	read_unlock(&tasklist_lock);
+	up_write(&leader->signal->threadgroup_fork_lock);
+
+	/*
+	 * step 4: success! and cleanup
+	 */
+	synchronize_rcu();
+	cgroup_wakeup_rmdir_waiter(cgrp);
+	retval = 0;
+list_teardown:
+	/* clean up the list of prefetched css_sets. */
+	list_for_each_entry_safe(cg_entry, temp_nobe, &newcg_list, links) {
+		list_del(&cg_entry->links);
+		put_css_set(cg_entry->cg);
+		kfree(cg_entry);
+	}
+out:
+	if (retval) {
+		/* same deal as in cgroup_attach_task, with threadgroup=true */
+		for_each_subsys(root, ss) {
+			if (ss == failed_ss)
+				break;
+			if (ss->cancel_attach)
+				ss->cancel_attach(ss, cgrp, leader, true);
+		}
+	}
+	return retval;
+}
+
+/*
+ * Find the task_struct of the task to attach by vpid and pass it along to the
+ * function to attach either it or all tasks in its threadgroup. Will take
+ * cgroup_mutex; may take task_lock of task.
+ */
+static int attach_task_by_pid(struct cgroup *cgrp, u64 pid, bool threadgroup)
 {
 	struct task_struct *tsk;
 	const struct cred *cred = current_cred(), *tcred;
 	int ret;
 
+	if (!cgroup_lock_live_group(cgrp))
+		return -ENODEV;
+
 	if (pid) {
 		rcu_read_lock();
 		tsk = find_task_by_vpid(pid);
-		if (!tsk || tsk->flags & PF_EXITING) {
+		if (!tsk) {
+			rcu_read_unlock();
+			cgroup_unlock();
+			return -ESRCH;
+		}
+		if (threadgroup) {
+			/*
+			 * it is safe to find group_leader because tsk was found
+			 * in the tid map, meaning it can't have been unhashed
+			 * by someone in de_thread changing the leadership.
+			 */
+			tsk = tsk->group_leader;
+			BUG_ON(!thread_group_leader(tsk));
+		} else if (tsk->flags & PF_EXITING) {
+			/* optimization for the single-task-only case */
 			rcu_read_unlock();
+			cgroup_unlock();
 			return -ESRCH;
 		}
 
+		/*
+		 * even if we're attaching all tasks in the thread group, we
+		 * only need to check permissions on one of them.
+		 */
 		tcred = __task_cred(tsk);
 		if (cred->euid &&
 		    cred->euid != tcred->uid &&
 		    cred->euid != tcred->suid) {
 			rcu_read_unlock();
+			cgroup_unlock();
 			return -EACCES;
 		}
 		get_task_struct(tsk);
 		rcu_read_unlock();
 	} else {
-		tsk = current;
+		if (threadgroup)
+			tsk = current->group_leader;
+		else
+			tsk = current;
 		get_task_struct(tsk);
 	}
 
-	ret = cgroup_attach_task(cgrp, tsk);
+	if (threadgroup)
+		ret = cgroup_attach_proc(cgrp, tsk);
+	else
+		ret = cgroup_attach_task(cgrp, tsk);
 	put_task_struct(tsk);
+	cgroup_unlock();
 	return ret;
 }
 
 static int cgroup_tasks_write(struct cgroup *cgrp, struct cftype *cft, u64 pid)
 {
+	return attach_task_by_pid(cgrp, pid, false);
+}
+
+static int cgroup_procs_write(struct cgroup *cgrp, struct cftype *cft, u64 tgid)
+{
 	int ret;
-	if (!cgroup_lock_live_group(cgrp))
-		return -ENODEV;
-	ret = attach_task_by_pid(cgrp, pid);
-	cgroup_unlock();
+	do {
+		/*
+		 * attach_proc fails with -EAGAIN if threadgroup leadership
+		 * changes in the middle of the operation, in which case we need
+		 * to find the task_struct for the new leader and start over.
+		 */
+		ret = attach_task_by_pid(cgrp, tgid, true);
+	} while (ret == -EAGAIN);
 	return ret;
 }
 
@@ -3167,9 +3497,9 @@ static struct cftype files[] = {
 	{
 		.name = CGROUP_FILE_GENERIC_PREFIX "procs",
 		.open = cgroup_procs_open,
-		/* .write_u64 = cgroup_procs_write, TODO */
+		.write_u64 = cgroup_procs_write,
 		.release = cgroup_pidlist_release,
-		.mode = S_IRUGO,
+		.mode = S_IRUGO | S_IWUSR,
 	},
 	{
 		.name = "notify_on_release",
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index e5c0244..dfb9f24 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -185,6 +185,10 @@ static int freezer_can_attach(struct cgroup_subsys *ss,
 		struct task_struct *c;
 
 		rcu_read_lock();
+		if (!thread_group_leader(task)) {
+			rcu_read_unlock();
+			return -EAGAIN;
+		}
 		list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
 			if (is_task_frozen_enough(c)) {
 				rcu_read_unlock();
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index d109467..edc7c01 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1360,6 +1360,10 @@ static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
 		struct task_struct *c;
 
 		rcu_read_lock();
+		if (!thread_group_leader(tsk)) {
+			rcu_read_unlock();
+			return -EAGAIN;
+		}
 		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
 			ret = security_task_setscheduler(c, 0, NULL);
 			if (ret) {
diff --git a/kernel/ns_cgroup.c b/kernel/ns_cgroup.c
index 2a5dfec..ecd15d2 100644
--- a/kernel/ns_cgroup.c
+++ b/kernel/ns_cgroup.c
@@ -59,6 +59,10 @@ static int ns_can_attach(struct cgroup_subsys *ss, struct cgroup *new_cgroup,
 	if (threadgroup) {
 		struct task_struct *c;
 		rcu_read_lock();
+		if (!thread_group_leader(task)) {
+			rcu_read_unlock();
+			return -EAGAIN;
+		}
 		list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
 			if (!cgroup_is_descendant(new_cgroup, c)) {
 				rcu_read_unlock();
diff --git a/kernel/sched.c b/kernel/sched.c
index 3c2a54f..1b36bf0 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -8712,6 +8712,10 @@ cpu_cgroup_can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 	if (threadgroup) {
 		struct task_struct *c;
 		rcu_read_lock();
+		if (!thread_group_leader(tsk)) {
+			rcu_read_unlock();
+			return -EAGAIN;
+		}
 		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
 			retval = cpu_cgroup_can_attach_task(cgrp, c);
 			if (retval) {

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC] [PATCH v3 2/2] cgroups: make procs file writable
  2010-06-25  5:45 [RFC] [PATCH v3 0/2] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs Ben Blum
       [not found] ` <20100625054541.GA20610-kwnxxEB+oiWqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
@ 2010-06-25  5:47 ` Ben Blum
  1 sibling, 0 replies; 16+ messages in thread
From: Ben Blum @ 2010-06-25  5:47 UTC (permalink / raw)
  To: Ben Blum
  Cc: linux-kernel, containers, akpm, ebiederm, lizf, matthltc, menage, oleg

[-- Attachment #1: cgroup-procs-writable.patch --]
[-- Type: text/plain, Size: 19712 bytes --]

Makes procs file writable to move all threads by tgid at once

From: Ben Blum <bblum@andrew.cmu.edu>

This patch adds functionality that enables users to move all threads in a
threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs'
file. This current implementation makes use of a per-threadgroup rwsem that's
taken for reading in the fork() path to prevent newly forking threads within
the threadgroup from "escaping" while the move is in progress.

Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
---
 Documentation/cgroups/cgroups.txt |   13 +
 kernel/cgroup.c                   |  426 +++++++++++++++++++++++++++++++++----
 kernel/cgroup_freezer.c           |    4 
 kernel/cpuset.c                   |    4 
 kernel/ns_cgroup.c                |    4 
 kernel/sched.c                    |    4 
 6 files changed, 405 insertions(+), 50 deletions(-)

diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index a1ca592..1d9c81e 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -235,7 +235,8 @@ containing the following files describing that cgroup:
  - cgroup.procs: list of tgids in the cgroup.  This list is not
    guaranteed to be sorted or free of duplicate tgids, and userspace
    should sort/uniquify the list if this property is required.
-   This is a read-only file, for now.
+   Writing a thread group id into this file moves all threads in that
+   group into this cgroup.
  - notify_on_release flag: run the release agent on exit?
  - release_agent: the path to use for release notifications (this file
    exists in the top cgroup only)
@@ -416,6 +417,12 @@ You can attach the current shell task by echoing 0:
 
 # echo 0 > tasks
 
+You can use the cgroup.procs file instead of the tasks file to move all
+threads in a threadgroup at once. Echoing the pid of any task in a
+threadgroup to cgroup.procs causes all tasks in that threadgroup to be
+be attached to the cgroup. Writing 0 to cgroup.procs moves all tasks
+in the writing task's threadgroup.
+
 2.3 Mounting hierarchies by name
 --------------------------------
 
@@ -564,7 +571,9 @@ called on a fork. If this method returns 0 (success) then this should
 remain valid while the caller holds cgroup_mutex and it is ensured that either
 attach() or cancel_attach() will be called in future. If threadgroup is
 true, then a successful result indicates that all threads in the given
-thread's threadgroup can be moved together.
+thread's threadgroup can be moved together. If the subsystem wants to
+iterate over task->thread_group, it must take rcu_read_lock then check
+if thread_group_leader(task), returning -EAGAIN if that fails.
 
 void cancel_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 	       struct task_struct *task, bool threadgroup)
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 6c8e46f..526f44b 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1686,6 +1686,76 @@ int cgroup_path(const struct cgroup *cgrp, char *buf, int buflen)
 }
 EXPORT_SYMBOL_GPL(cgroup_path);
 
+/*
+ * cgroup_task_migrate - move a task from one cgroup to another.
+ *
+ * 'guarantee' is set if the caller promises that a new css_set for the task
+ * will already exit. If not set, this function might sleep, and can fail with
+ * -ENOMEM. Otherwise, it can only fail with -ESRCH.
+ */
+static int cgroup_task_migrate(struct cgroup *cgrp, struct cgroup *oldcgrp,
+			       struct task_struct *tsk, bool guarantee)
+{
+	struct css_set *oldcg;
+	struct css_set *newcg;
+
+	/*
+	 * get old css_set. we need to take task_lock and refcount it, because
+	 * an exiting task can change its css_set to init_css_set and drop its
+	 * old one without taking cgroup_mutex.
+	 */
+	task_lock(tsk);
+	oldcg = tsk->cgroups;
+	get_css_set(oldcg);
+	task_unlock(tsk);
+
+	/* locate or allocate a new css_set for this task. */
+	if (guarantee) {
+		/* we know the css_set we want already exists. */
+		struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
+		read_lock(&css_set_lock);
+		newcg = find_existing_css_set(oldcg, cgrp, template);
+		BUG_ON(!newcg);
+		get_css_set(newcg);
+		read_unlock(&css_set_lock);
+	} else {
+		might_sleep();
+		/* find_css_set will give us newcg already referenced. */
+		newcg = find_css_set(oldcg, cgrp);
+		if (!newcg) {
+			put_css_set(oldcg);
+			return -ENOMEM;
+		}
+	}
+	put_css_set(oldcg);
+
+	/* if PF_EXITING is set, the tsk->cgroups pointer is no longer safe. */
+	task_lock(tsk);
+	if (tsk->flags & PF_EXITING) {
+		task_unlock(tsk);
+		put_css_set(newcg);
+		return -ESRCH;
+	}
+	rcu_assign_pointer(tsk->cgroups, newcg);
+	task_unlock(tsk);
+
+	/* Update the css_set linked lists if we're using them */
+	write_lock(&css_set_lock);
+	if (!list_empty(&tsk->cg_list))
+		list_move(&tsk->cg_list, &newcg->tasks);
+	write_unlock(&css_set_lock);
+
+	/*
+	 * We just gained a reference on oldcg by taking it from the task. As
+	 * trading it for newcg is protected by cgroup_mutex, we're safe to drop
+	 * it here; it will be freed under RCU.
+	 */
+	put_css_set(oldcg);
+
+	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
+	return 0;
+}
+
 /**
  * cgroup_attach_task - attach task 'tsk' to cgroup 'cgrp'
  * @cgrp: the cgroup the task is attaching to
@@ -1696,11 +1766,9 @@ EXPORT_SYMBOL_GPL(cgroup_path);
  */
 int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
-	int retval = 0;
+	int retval;
 	struct cgroup_subsys *ss, *failed_ss = NULL;
 	struct cgroup *oldcgrp;
-	struct css_set *cg;
-	struct css_set *newcg;
 	struct cgroupfs_root *root = cgrp->root;
 
 	/* Nothing to do if the task is already in that cgroup */
@@ -1724,46 +1792,16 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 		}
 	}
 
-	task_lock(tsk);
-	cg = tsk->cgroups;
-	get_css_set(cg);
-	task_unlock(tsk);
-	/*
-	 * Locate or allocate a new css_set for this task,
-	 * based on its final set of cgroups
-	 */
-	newcg = find_css_set(cg, cgrp);
-	put_css_set(cg);
-	if (!newcg) {
-		retval = -ENOMEM;
-		goto out;
-	}
-
-	task_lock(tsk);
-	if (tsk->flags & PF_EXITING) {
-		task_unlock(tsk);
-		put_css_set(newcg);
-		retval = -ESRCH;
+	retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, false);
+	if (retval)
 		goto out;
-	}
-	rcu_assign_pointer(tsk->cgroups, newcg);
-	task_unlock(tsk);
-
-	/* Update the css_set linked lists if we're using them */
-	write_lock(&css_set_lock);
-	if (!list_empty(&tsk->cg_list)) {
-		list_del(&tsk->cg_list);
-		list_add(&tsk->cg_list, &newcg->tasks);
-	}
-	write_unlock(&css_set_lock);
 
 	for_each_subsys(root, ss) {
 		if (ss->attach)
 			ss->attach(ss, cgrp, oldcgrp, tsk, false);
 	}
-	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
+
 	synchronize_rcu();
-	put_css_set(cg);
 
 	/*
 	 * wake up rmdir() waiter. the rmdir should fail since the cgroup
@@ -1789,49 +1827,341 @@ out:
 }
 
 /*
- * Attach task with pid 'pid' to cgroup 'cgrp'. Call with cgroup_mutex
- * held. May take task_lock of task
+ * cgroup_attach_proc works in two stages, the first of which prefetches all
+ * new css_sets needed (to make sure we have enough memory before committing
+ * to the move) and stores them in a list of entries of the following type.
+ * TODO: possible optimization: use css_set->rcu_head for chaining instead
+ */
+struct cg_list_entry {
+	struct css_set *cg;
+	struct list_head links;
+};
+
+static bool css_set_check_fetched(struct cgroup *cgrp,
+				  struct task_struct *tsk, struct css_set *cg,
+				  struct list_head *newcg_list)
+{
+	struct css_set *newcg;
+	struct cg_list_entry *cg_entry;
+	struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
+
+	read_lock(&css_set_lock);
+	newcg = find_existing_css_set(cg, cgrp, template);
+	if (newcg)
+		get_css_set(newcg);
+	read_unlock(&css_set_lock);
+
+	/* doesn't exist at all? */
+	if (!newcg)
+		return false;
+	/* see if it's already in the list */
+	list_for_each_entry(cg_entry, newcg_list, links) {
+		if (cg_entry->cg == newcg) {
+			put_css_set(newcg);
+			return true;
+		}
+	}
+
+	/* not found */
+	put_css_set(newcg);
+	return false;
+}
+
+/*
+ * Find the new css_set and store it in the list in preparation for moving the
+ * given task to the given cgroup. Returns 0 or -ENOMEM.
  */
-static int attach_task_by_pid(struct cgroup *cgrp, u64 pid)
+static int css_set_prefetch(struct cgroup *cgrp, struct css_set *cg,
+			    struct list_head *newcg_list)
+{
+	struct css_set *newcg;
+	struct cg_list_entry *cg_entry;
+
+	/* ensure a new css_set will exist for this thread */
+	newcg = find_css_set(cg, cgrp);
+	if (!newcg)
+		return -ENOMEM;
+	/* add it to the list */
+	cg_entry = kmalloc(sizeof(struct cg_list_entry), GFP_KERNEL);
+	if (!cg_entry) {
+		put_css_set(newcg);
+		return -ENOMEM;
+	}
+	cg_entry->cg = newcg;
+	list_add(&cg_entry->links, newcg_list);
+	return 0;
+}
+
+/**
+ * cgroup_attach_proc - attach all threads in a threadgroup to a cgroup
+ * @cgrp: the cgroup to attach to
+ * @leader: the threadgroup leader task_struct of the group to be attached
+ *
+ * Call holding cgroup_mutex. Will take task_lock of each thread in leader's
+ * threadgroup individually in turn.
+ */
+int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
+{
+	int retval;
+	struct cgroup_subsys *ss, *failed_ss = NULL;
+	struct cgroup *oldcgrp;
+	struct css_set *oldcg;
+	struct cgroupfs_root *root = cgrp->root;
+	/* threadgroup list cursor */
+	struct task_struct *tsk;
+	/*
+	 * we need to make sure we have css_sets for all the tasks we're
+	 * going to move -before- we actually start moving them, so that in
+	 * case we get an ENOMEM we can bail out before making any changes.
+	 */
+	struct list_head newcg_list;
+	struct cg_list_entry *cg_entry, *temp_nobe;
+
+	/* check that we can legitimately attach to the cgroup. */
+	for_each_subsys(root, ss) {
+		if (ss->can_attach) {
+			retval = ss->can_attach(ss, cgrp, leader, true);
+			if (retval) {
+				failed_ss = ss;
+				goto out;
+			}
+		}
+	}
+
+	/*
+	 * step 1: make sure css_sets exist for all threads to be migrated.
+	 * we use find_css_set, which allocates a new one if necessary.
+	 */
+	INIT_LIST_HEAD(&newcg_list);
+	oldcgrp = task_cgroup_from_root(leader, root);
+	if (cgrp != oldcgrp) {
+		/* get old css_set */
+		task_lock(leader);
+		if (leader->flags & PF_EXITING) {
+			task_unlock(leader);
+			goto prefetch_loop;
+		}
+		oldcg = leader->cgroups;
+		get_css_set(oldcg);
+		task_unlock(leader);
+		/* acquire new one */
+		retval = css_set_prefetch(cgrp, oldcg, &newcg_list);
+		put_css_set(oldcg);
+		if (retval)
+			goto list_teardown;
+	}
+prefetch_loop:
+	rcu_read_lock();
+	/* sanity check - if we raced with de_thread, we must abort */
+	if (!thread_group_leader(leader)) {
+		retval = -EAGAIN;
+		goto list_teardown;
+	}
+	/*
+	 * if we need to fetch a new css_set for this task, we must exit the
+	 * rcu_read section because allocating it can sleep. afterwards, we'll
+	 * need to restart iteration on the threadgroup list - the whole thing
+	 * will be O(nm) in the number of threads and css_sets; as the typical
+	 * case has only one css_set for all of them, usually O(n). which ones
+	 * we need allocated won't change as long as we hold cgroup_mutex.
+	 */
+	list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
+		/* nothing to do if this task is already in the cgroup */
+		oldcgrp = task_cgroup_from_root(tsk, root);
+		if (cgrp == oldcgrp)
+			continue;
+		/* get old css_set pointer */
+		task_lock(tsk);
+		if (tsk->flags & PF_EXITING) {
+			/* ignore this task if it's going away */
+			task_unlock(tsk);
+			continue;
+		}
+		oldcg = tsk->cgroups;
+		get_css_set(oldcg);
+		task_unlock(tsk);
+		/* see if the new one for us is already in the list? */
+		if (css_set_check_fetched(cgrp, tsk, oldcg, &newcg_list)) {
+			/* was already there, nothing to do. */
+			put_css_set(oldcg);
+		} else {
+			/* we don't already have it. get new one. */
+			rcu_read_unlock();
+			retval = css_set_prefetch(cgrp, oldcg, &newcg_list);
+			put_css_set(oldcg);
+			if (retval)
+				goto list_teardown;
+			/* begin iteration again. */
+			goto prefetch_loop;
+		}
+	}
+	rcu_read_unlock();
+
+	/*
+	 * step 2: now that we're guaranteed success wrt the css_sets, proceed
+	 * to move all tasks to the new cgroup. we need to lock against possible
+	 * races with fork(). note: we can safely access leader->signal because
+	 * attach_task_by_pid takes a reference on leader, which guarantees that
+	 * the signal_struct will stick around. threadgroup_fork_lock must be
+	 * taken outside of tasklist_lock to match the order in the fork path.
+	 */
+	BUG_ON(!leader->signal);
+	down_write(&leader->signal->threadgroup_fork_lock);
+	read_lock(&tasklist_lock);
+	/* sanity check - if we raced with de_thread, we must abort */
+	if (!thread_group_leader(leader)) {
+		retval = -EAGAIN;
+		read_unlock(&tasklist_lock);
+		up_write(&leader->signal->threadgroup_fork_lock);
+		goto list_teardown;
+	}
+	/*
+	 * No failure cases left, so this is the commit point.
+	 *
+	 * If the leader is already there, skip moving him. Note: even if the
+	 * leader is PF_EXITING, we still move all other threads; if everybody
+	 * is PF_EXITING, we end up doing nothing, which is ok.
+	 */
+	oldcgrp = task_cgroup_from_root(leader, root);
+	if (cgrp != oldcgrp) {
+		retval = cgroup_task_migrate(cgrp, oldcgrp, leader, true);
+		BUG_ON(retval != 0 && retval != -ESRCH);
+	}
+	/* Now iterate over each thread in the group. */
+	list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
+		BUG_ON(tsk->signal != leader->signal);
+		/* leave current thread as it is if it's already there */
+		oldcgrp = task_cgroup_from_root(tsk, root);
+		if (cgrp == oldcgrp)
+			continue;
+		/* we don't care whether these threads are exiting */
+		retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, true);
+		BUG_ON(retval != 0 && retval != -ESRCH);
+	}
+
+	/*
+	 * step 3: attach whole threadgroup to each subsystem
+	 * TODO: if ever a subsystem needs to know the oldcgrp for each task
+	 * being moved, this call will need to be reworked to communicate that.
+	 */
+	for_each_subsys(root, ss) {
+		if (ss->attach)
+			ss->attach(ss, cgrp, oldcgrp, leader, true);
+	}
+	/* holding these until here keeps us safe from exec() and fork(). */
+	read_unlock(&tasklist_lock);
+	up_write(&leader->signal->threadgroup_fork_lock);
+
+	/*
+	 * step 4: success! and cleanup
+	 */
+	synchronize_rcu();
+	cgroup_wakeup_rmdir_waiter(cgrp);
+	retval = 0;
+list_teardown:
+	/* clean up the list of prefetched css_sets. */
+	list_for_each_entry_safe(cg_entry, temp_nobe, &newcg_list, links) {
+		list_del(&cg_entry->links);
+		put_css_set(cg_entry->cg);
+		kfree(cg_entry);
+	}
+out:
+	if (retval) {
+		/* same deal as in cgroup_attach_task, with threadgroup=true */
+		for_each_subsys(root, ss) {
+			if (ss == failed_ss)
+				break;
+			if (ss->cancel_attach)
+				ss->cancel_attach(ss, cgrp, leader, true);
+		}
+	}
+	return retval;
+}
+
+/*
+ * Find the task_struct of the task to attach by vpid and pass it along to the
+ * function to attach either it or all tasks in its threadgroup. Will take
+ * cgroup_mutex; may take task_lock of task.
+ */
+static int attach_task_by_pid(struct cgroup *cgrp, u64 pid, bool threadgroup)
 {
 	struct task_struct *tsk;
 	const struct cred *cred = current_cred(), *tcred;
 	int ret;
 
+	if (!cgroup_lock_live_group(cgrp))
+		return -ENODEV;
+
 	if (pid) {
 		rcu_read_lock();
 		tsk = find_task_by_vpid(pid);
-		if (!tsk || tsk->flags & PF_EXITING) {
+		if (!tsk) {
+			rcu_read_unlock();
+			cgroup_unlock();
+			return -ESRCH;
+		}
+		if (threadgroup) {
+			/*
+			 * it is safe to find group_leader because tsk was found
+			 * in the tid map, meaning it can't have been unhashed
+			 * by someone in de_thread changing the leadership.
+			 */
+			tsk = tsk->group_leader;
+			BUG_ON(!thread_group_leader(tsk));
+		} else if (tsk->flags & PF_EXITING) {
+			/* optimization for the single-task-only case */
 			rcu_read_unlock();
+			cgroup_unlock();
 			return -ESRCH;
 		}
 
+		/*
+		 * even if we're attaching all tasks in the thread group, we
+		 * only need to check permissions on one of them.
+		 */
 		tcred = __task_cred(tsk);
 		if (cred->euid &&
 		    cred->euid != tcred->uid &&
 		    cred->euid != tcred->suid) {
 			rcu_read_unlock();
+			cgroup_unlock();
 			return -EACCES;
 		}
 		get_task_struct(tsk);
 		rcu_read_unlock();
 	} else {
-		tsk = current;
+		if (threadgroup)
+			tsk = current->group_leader;
+		else
+			tsk = current;
 		get_task_struct(tsk);
 	}
 
-	ret = cgroup_attach_task(cgrp, tsk);
+	if (threadgroup)
+		ret = cgroup_attach_proc(cgrp, tsk);
+	else
+		ret = cgroup_attach_task(cgrp, tsk);
 	put_task_struct(tsk);
+	cgroup_unlock();
 	return ret;
 }
 
 static int cgroup_tasks_write(struct cgroup *cgrp, struct cftype *cft, u64 pid)
 {
+	return attach_task_by_pid(cgrp, pid, false);
+}
+
+static int cgroup_procs_write(struct cgroup *cgrp, struct cftype *cft, u64 tgid)
+{
 	int ret;
-	if (!cgroup_lock_live_group(cgrp))
-		return -ENODEV;
-	ret = attach_task_by_pid(cgrp, pid);
-	cgroup_unlock();
+	do {
+		/*
+		 * attach_proc fails with -EAGAIN if threadgroup leadership
+		 * changes in the middle of the operation, in which case we need
+		 * to find the task_struct for the new leader and start over.
+		 */
+		ret = attach_task_by_pid(cgrp, tgid, true);
+	} while (ret == -EAGAIN);
 	return ret;
 }
 
@@ -3167,9 +3497,9 @@ static struct cftype files[] = {
 	{
 		.name = CGROUP_FILE_GENERIC_PREFIX "procs",
 		.open = cgroup_procs_open,
-		/* .write_u64 = cgroup_procs_write, TODO */
+		.write_u64 = cgroup_procs_write,
 		.release = cgroup_pidlist_release,
-		.mode = S_IRUGO,
+		.mode = S_IRUGO | S_IWUSR,
 	},
 	{
 		.name = "notify_on_release",
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index e5c0244..dfb9f24 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -185,6 +185,10 @@ static int freezer_can_attach(struct cgroup_subsys *ss,
 		struct task_struct *c;
 
 		rcu_read_lock();
+		if (!thread_group_leader(task)) {
+			rcu_read_unlock();
+			return -EAGAIN;
+		}
 		list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
 			if (is_task_frozen_enough(c)) {
 				rcu_read_unlock();
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index d109467..edc7c01 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1360,6 +1360,10 @@ static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
 		struct task_struct *c;
 
 		rcu_read_lock();
+		if (!thread_group_leader(tsk)) {
+			rcu_read_unlock();
+			return -EAGAIN;
+		}
 		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
 			ret = security_task_setscheduler(c, 0, NULL);
 			if (ret) {
diff --git a/kernel/ns_cgroup.c b/kernel/ns_cgroup.c
index 2a5dfec..ecd15d2 100644
--- a/kernel/ns_cgroup.c
+++ b/kernel/ns_cgroup.c
@@ -59,6 +59,10 @@ static int ns_can_attach(struct cgroup_subsys *ss, struct cgroup *new_cgroup,
 	if (threadgroup) {
 		struct task_struct *c;
 		rcu_read_lock();
+		if (!thread_group_leader(task)) {
+			rcu_read_unlock();
+			return -EAGAIN;
+		}
 		list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
 			if (!cgroup_is_descendant(new_cgroup, c)) {
 				rcu_read_unlock();
diff --git a/kernel/sched.c b/kernel/sched.c
index 3c2a54f..1b36bf0 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -8712,6 +8712,10 @@ cpu_cgroup_can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 	if (threadgroup) {
 		struct task_struct *c;
 		rcu_read_lock();
+		if (!thread_group_leader(tsk)) {
+			rcu_read_unlock();
+			return -EAGAIN;
+		}
 		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
 			retval = cpu_cgroup_can_attach_task(cgrp, c);
 			if (retval) {

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC] [PATCH v3 1/2] cgroups: read-write lock CLONE_THREAD forking per threadgroup
       [not found]     ` <20100625054654.GB20610-kwnxxEB+oiWqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
@ 2010-06-28  7:10       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 16+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-06-28  7:10 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	menage-hpIqsD4AKlfQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

On Fri, 25 Jun 2010 01:46:54 -0400
Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:

> Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup
> 
> From: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
> 
> This patch adds an rwsem that lives in a threadgroup's signal_struct that's
> taken for reading in the fork path, under CONFIG_CGROUPS. If another part of
> the kernel later wants to use such a locking mechanism, the CONFIG_CGROUPS
> ifdefs should be changed to a higher-up flag that CGROUPS and the other system
> would both depend on.
> 
> This is a pre-patch for cgroups-procs-write.patch.
> 

Hmm, at adding a new lock, please describe its semantics.
Following is my understanding, right ?

Lock range:
This rwsem is read-locked from cgroup_fork() to cgroup_post_fork(). 
Most of works for fork() are between cgroup_fork() and cgroup_post_for().
This means if sig->threadgroup_fork_lock is held, no new do_work() can
make progress in this process groups. This rwsem is held only when
CLONE_THREAD is in clone_flags. IOW, this rwsem is taken only at creating
new thread.

What we can do with this:
By locking sig->threadgroup_fork_lock, a code can visit _all_ threads
in a process group witout any races. So, if you want to implement an
atomic operation against a process, taking this lock is an idea.

For what:
To implement an atomic process move in cgroup, we need this lock.

Why this implemantation:
Considering cgroup, threads in a cgroup can be under several different
cgroup. So, we can't implement lock in cgroup-internal, we use signal
struct.

By the way, IMHO, hiding lock in cgroup_fork() and cgroup_post_fork() doesn't
seem good idea. How about a code like this ?

  read_lock_thread_clone(current);
  cgroup_fork();
 .....
  cgroup_post_fork();
  read_unlock_thrad_clone(current);

We may have chances to move these lock to better position if cgroup is
an only user.

Thanks,
-Kame

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] [PATCH v3 1/2] cgroups: read-write lock CLONE_THREAD forking per threadgroup
  2010-06-25  5:46     ` Ben Blum
  (?)
@ 2010-06-28  7:10     ` KAMEZAWA Hiroyuki
  2010-06-28 15:43       ` Ben Blum
       [not found]       ` <20100628161031.a8c71fce.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
  -1 siblings, 2 replies; 16+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-06-28  7:10 UTC (permalink / raw)
  To: Ben Blum
  Cc: linux-kernel, containers, akpm, ebiederm, lizf, matthltc, menage, oleg

On Fri, 25 Jun 2010 01:46:54 -0400
Ben Blum <bblum@andrew.cmu.edu> wrote:

> Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup
> 
> From: Ben Blum <bblum@andrew.cmu.edu>
> 
> This patch adds an rwsem that lives in a threadgroup's signal_struct that's
> taken for reading in the fork path, under CONFIG_CGROUPS. If another part of
> the kernel later wants to use such a locking mechanism, the CONFIG_CGROUPS
> ifdefs should be changed to a higher-up flag that CGROUPS and the other system
> would both depend on.
> 
> This is a pre-patch for cgroups-procs-write.patch.
> 

Hmm, at adding a new lock, please describe its semantics.
Following is my understanding, right ?

Lock range:
This rwsem is read-locked from cgroup_fork() to cgroup_post_fork(). 
Most of works for fork() are between cgroup_fork() and cgroup_post_for().
This means if sig->threadgroup_fork_lock is held, no new do_work() can
make progress in this process groups. This rwsem is held only when
CLONE_THREAD is in clone_flags. IOW, this rwsem is taken only at creating
new thread.

What we can do with this:
By locking sig->threadgroup_fork_lock, a code can visit _all_ threads
in a process group witout any races. So, if you want to implement an
atomic operation against a process, taking this lock is an idea.

For what:
To implement an atomic process move in cgroup, we need this lock.

Why this implemantation:
Considering cgroup, threads in a cgroup can be under several different
cgroup. So, we can't implement lock in cgroup-internal, we use signal
struct.

By the way, IMHO, hiding lock in cgroup_fork() and cgroup_post_fork() doesn't
seem good idea. How about a code like this ?

  read_lock_thread_clone(current);
  cgroup_fork();
 .....
  cgroup_post_fork();
  read_unlock_thrad_clone(current);

We may have chances to move these lock to better position if cgroup is
an only user.

Thanks,
-Kame

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] [PATCH v3 1/2] cgroups: read-write lock CLONE_THREAD forking per threadgroup
       [not found]       ` <20100628161031.a8c71fce.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
@ 2010-06-28 15:43         ` Ben Blum
  0 siblings, 0 replies; 16+ messages in thread
From: Ben Blum @ 2010-06-28 15:43 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	menage-hpIqsD4AKlfQT0dZR+AlfA

On Mon, Jun 28, 2010 at 04:10:31PM +0900, KAMEZAWA Hiroyuki wrote:
> On Fri, 25 Jun 2010 01:46:54 -0400
> Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:
> 
> > Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup
> > 
> > From: Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org>
> > 
> > This patch adds an rwsem that lives in a threadgroup's signal_struct that's
> > taken for reading in the fork path, under CONFIG_CGROUPS. If another part of
> > the kernel later wants to use such a locking mechanism, the CONFIG_CGROUPS
> > ifdefs should be changed to a higher-up flag that CGROUPS and the other system
> > would both depend on.
> > 
> > This is a pre-patch for cgroups-procs-write.patch.
> > 
> 
> Hmm, at adding a new lock, please describe its semantics.
> Following is my understanding, right ?
> 
> Lock range:
> This rwsem is read-locked from cgroup_fork() to cgroup_post_fork(). 
> Most of works for fork() are between cgroup_fork() and cgroup_post_for().
> This means if sig->threadgroup_fork_lock is held, no new do_work() can
> make progress in this process groups. This rwsem is held only when
> CLONE_THREAD is in clone_flags. IOW, this rwsem is taken only at creating
> new thread.
> 
> What we can do with this:
> By locking sig->threadgroup_fork_lock, a code can visit _all_ threads
> in a process group witout any races. So, if you want to implement an
> atomic operation against a process, taking this lock is an idea.
> 
> For what:
> To implement an atomic process move in cgroup, we need this lock.

All good. Should a description like this go in Documentation/ somewhere,
or above the declaration of the lock?

> Why this implemantation:
> Considering cgroup, threads in a cgroup can be under several different
> cgroup. So, we can't implement lock in cgroup-internal, we use signal
> struct.

Not entirely, though that's an additional restriction... The reason it's
in signal_struct: signal_struct is per-threadgroup and has exactly the
lifetime rules we want. Putting the lock in task_struct and taking
current->group_leader->signal... seems like it would also work, but
introduces cacheline conflicts that weren't there before, since fork
doesn't look at group_leader (but it does bump signal's count).

> By the way, IMHO, hiding lock in cgroup_fork() and cgroup_post_fork() doesn't
> seem good idea. How about a code like this ?
> 
>   read_lock_thread_clone(current);
>   cgroup_fork();
>  .....
>   cgroup_post_fork();
>   read_unlock_thrad_clone(current);
> 
> We may have chances to move these lock to better position if cgroup is
> an only user.

I didn't do that out of a desire to change fork.c as little as possible,
but that does look better than what I've got. Those two functions should
be in fork.c under #ifdef CONFIG_CGROUPS.

> 
> Thanks,
> -Kame

Thanks,
-- Ben

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] [PATCH v3 1/2] cgroups: read-write lock CLONE_THREAD forking per threadgroup
  2010-06-28  7:10     ` KAMEZAWA Hiroyuki
@ 2010-06-28 15:43       ` Ben Blum
  2010-07-28  8:29         ` Ben Blum
       [not found]         ` <20100628154359.GA24629-kwnxxEB+oiWqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
       [not found]       ` <20100628161031.a8c71fce.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
  1 sibling, 2 replies; 16+ messages in thread
From: Ben Blum @ 2010-06-28 15:43 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Ben Blum, linux-kernel, containers, akpm, ebiederm, lizf,
	matthltc, menage, oleg

On Mon, Jun 28, 2010 at 04:10:31PM +0900, KAMEZAWA Hiroyuki wrote:
> On Fri, 25 Jun 2010 01:46:54 -0400
> Ben Blum <bblum@andrew.cmu.edu> wrote:
> 
> > Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup
> > 
> > From: Ben Blum <bblum@andrew.cmu.edu>
> > 
> > This patch adds an rwsem that lives in a threadgroup's signal_struct that's
> > taken for reading in the fork path, under CONFIG_CGROUPS. If another part of
> > the kernel later wants to use such a locking mechanism, the CONFIG_CGROUPS
> > ifdefs should be changed to a higher-up flag that CGROUPS and the other system
> > would both depend on.
> > 
> > This is a pre-patch for cgroups-procs-write.patch.
> > 
> 
> Hmm, at adding a new lock, please describe its semantics.
> Following is my understanding, right ?
> 
> Lock range:
> This rwsem is read-locked from cgroup_fork() to cgroup_post_fork(). 
> Most of works for fork() are between cgroup_fork() and cgroup_post_for().
> This means if sig->threadgroup_fork_lock is held, no new do_work() can
> make progress in this process groups. This rwsem is held only when
> CLONE_THREAD is in clone_flags. IOW, this rwsem is taken only at creating
> new thread.
> 
> What we can do with this:
> By locking sig->threadgroup_fork_lock, a code can visit _all_ threads
> in a process group witout any races. So, if you want to implement an
> atomic operation against a process, taking this lock is an idea.
> 
> For what:
> To implement an atomic process move in cgroup, we need this lock.

All good. Should a description like this go in Documentation/ somewhere,
or above the declaration of the lock?

> Why this implemantation:
> Considering cgroup, threads in a cgroup can be under several different
> cgroup. So, we can't implement lock in cgroup-internal, we use signal
> struct.

Not entirely, though that's an additional restriction... The reason it's
in signal_struct: signal_struct is per-threadgroup and has exactly the
lifetime rules we want. Putting the lock in task_struct and taking
current->group_leader->signal... seems like it would also work, but
introduces cacheline conflicts that weren't there before, since fork
doesn't look at group_leader (but it does bump signal's count).

> By the way, IMHO, hiding lock in cgroup_fork() and cgroup_post_fork() doesn't
> seem good idea. How about a code like this ?
> 
>   read_lock_thread_clone(current);
>   cgroup_fork();
>  .....
>   cgroup_post_fork();
>   read_unlock_thrad_clone(current);
> 
> We may have chances to move these lock to better position if cgroup is
> an only user.

I didn't do that out of a desire to change fork.c as little as possible,
but that does look better than what I've got. Those two functions should
be in fork.c under #ifdef CONFIG_CGROUPS.

> 
> Thanks,
> -Kame

Thanks,
-- Ben

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] [PATCH v3 1/2] cgroups: read-write lock CLONE_THREAD forking per threadgroup
       [not found]         ` <20100628154359.GA24629-kwnxxEB+oiWqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
@ 2010-07-28  8:29           ` Ben Blum
  0 siblings, 0 replies; 16+ messages in thread
From: Ben Blum @ 2010-07-28  8:29 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	menage-hpIqsD4AKlfQT0dZR+AlfA

On Mon, Jun 28, 2010 at 11:43:59AM -0400, Ben Blum wrote:
> On Mon, Jun 28, 2010 at 04:10:31PM +0900, KAMEZAWA Hiroyuki wrote:
> > By the way, IMHO, hiding lock in cgroup_fork() and cgroup_post_fork() doesn't
> > seem good idea. How about a code like this ?
> > 
> >   read_lock_thread_clone(current);
> >   cgroup_fork();
> >  .....
> >   cgroup_post_fork();
> >   read_unlock_thrad_clone(current);
> > 
> > We may have chances to move these lock to better position if cgroup is
> > an only user.
> 
> I didn't do that out of a desire to change fork.c as little as possible,
> but that does look better than what I've got. Those two functions should
> be in fork.c under #ifdef CONFIG_CGROUPS.

I'm looking at this now and am not sure where the best place to put
these is:

1) Don't make new functions, just put:

    #ifdef CONFIG_CGROUPS
        if (clone_flags & CLONE_THREAD)
            down/up_read(...);
    #endif

directly in copy_process() in fork.c. Simplest, but uglifies the code.

2) Make static helper functions in fork.c. Good, but not consistent with
directly using the lock in write-side (attach_proc).

3) Define inline functions under #ifdef CONFIG_CGROUPS in sched.h, just
under the declaration of the lock. Most robust, but I'm hesitant to add
unneeded stuff to such a popular header file.

Any opinions?

-- Ben

> 
> > 
> > Thanks,
> > -Kame
> 
> Thanks,
> -- Ben
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] [PATCH v3 1/2] cgroups: read-write lock CLONE_THREAD forking per threadgroup
  2010-06-28 15:43       ` Ben Blum
@ 2010-07-28  8:29         ` Ben Blum
  2010-07-28  9:28           ` KAMEZAWA Hiroyuki
       [not found]           ` <20100728082953.GA15455-kwnxxEB+oiWqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
       [not found]         ` <20100628154359.GA24629-kwnxxEB+oiWqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
  1 sibling, 2 replies; 16+ messages in thread
From: Ben Blum @ 2010-07-28  8:29 UTC (permalink / raw)
  To: Ben Blum
  Cc: KAMEZAWA Hiroyuki, linux-kernel, containers, akpm, ebiederm,
	lizf, matthltc, menage, oleg

On Mon, Jun 28, 2010 at 11:43:59AM -0400, Ben Blum wrote:
> On Mon, Jun 28, 2010 at 04:10:31PM +0900, KAMEZAWA Hiroyuki wrote:
> > By the way, IMHO, hiding lock in cgroup_fork() and cgroup_post_fork() doesn't
> > seem good idea. How about a code like this ?
> > 
> >   read_lock_thread_clone(current);
> >   cgroup_fork();
> >  .....
> >   cgroup_post_fork();
> >   read_unlock_thrad_clone(current);
> > 
> > We may have chances to move these lock to better position if cgroup is
> > an only user.
> 
> I didn't do that out of a desire to change fork.c as little as possible,
> but that does look better than what I've got. Those two functions should
> be in fork.c under #ifdef CONFIG_CGROUPS.

I'm looking at this now and am not sure where the best place to put
these is:

1) Don't make new functions, just put:

    #ifdef CONFIG_CGROUPS
        if (clone_flags & CLONE_THREAD)
            down/up_read(...);
    #endif

directly in copy_process() in fork.c. Simplest, but uglifies the code.

2) Make static helper functions in fork.c. Good, but not consistent with
directly using the lock in write-side (attach_proc).

3) Define inline functions under #ifdef CONFIG_CGROUPS in sched.h, just
under the declaration of the lock. Most robust, but I'm hesitant to add
unneeded stuff to such a popular header file.

Any opinions?

-- Ben

> 
> > 
> > Thanks,
> > -Kame
> 
> Thanks,
> -- Ben
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] [PATCH v3 1/2] cgroups: read-write lock CLONE_THREAD forking per threadgroup
       [not found]           ` <20100728082953.GA15455-kwnxxEB+oiWqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
@ 2010-07-28  9:28             ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 16+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-07-28  9:28 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	menage-hpIqsD4AKlfQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w

On Wed, 28 Jul 2010 04:29:53 -0400
Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:

> On Mon, Jun 28, 2010 at 11:43:59AM -0400, Ben Blum wrote:
> > On Mon, Jun 28, 2010 at 04:10:31PM +0900, KAMEZAWA Hiroyuki wrote:
> > > By the way, IMHO, hiding lock in cgroup_fork() and cgroup_post_fork() doesn't
> > > seem good idea. How about a code like this ?
> > > 
> > >   read_lock_thread_clone(current);
> > >   cgroup_fork();
> > >  .....
> > >   cgroup_post_fork();
> > >   read_unlock_thrad_clone(current);
> > > 
> > > We may have chances to move these lock to better position if cgroup is
> > > an only user.
> > 
> > I didn't do that out of a desire to change fork.c as little as possible,
> > but that does look better than what I've got. Those two functions should
> > be in fork.c under #ifdef CONFIG_CGROUPS.
> 
> I'm looking at this now and am not sure where the best place to put
> these is:
> 
> 1) Don't make new functions, just put:
> 
>     #ifdef CONFIG_CGROUPS
>         if (clone_flags & CLONE_THREAD)
>             down/up_read(...);
>     #endif
> 
> directly in copy_process() in fork.c. Simplest, but uglifies the code.
> 
> 2) Make static helper functions in fork.c. Good, but not consistent with
> directly using the lock in write-side (attach_proc).
> 
> 3) Define inline functions under #ifdef CONFIG_CGROUPS in sched.h, just
> under the declaration of the lock. Most robust, but I'm hesitant to add
> unneeded stuff to such a popular header file.
> 
> Any opinions?
> 

My point was simple. Because copy_process() is very important path,
the new lock should be visible in copy_process() or kernek/fork.c.
"If the lock is visible in copy_process(), the reader can notice it".

Then, I offer you 2 options.

rename cgroup_fork() and cgroup_post_fork() as
       cgroup_fork_lock() and cgroup_post_fork_unlock() 

Now, the lock is visible and the change is minimum.

Or
       add the definition of lock/unlock to cgroup.h and include it 
       from kernel/fork.c

Thanks,
-Kame

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] [PATCH v3 1/2] cgroups: read-write lock CLONE_THREAD forking per threadgroup
  2010-07-28  8:29         ` Ben Blum
@ 2010-07-28  9:28           ` KAMEZAWA Hiroyuki
       [not found]             ` <20100728182813.7c2826ae.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
       [not found]           ` <20100728082953.GA15455-kwnxxEB+oiWqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
  1 sibling, 1 reply; 16+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-07-28  9:28 UTC (permalink / raw)
  To: Ben Blum
  Cc: linux-kernel, containers, akpm, ebiederm, lizf, matthltc, menage, oleg

On Wed, 28 Jul 2010 04:29:53 -0400
Ben Blum <bblum@andrew.cmu.edu> wrote:

> On Mon, Jun 28, 2010 at 11:43:59AM -0400, Ben Blum wrote:
> > On Mon, Jun 28, 2010 at 04:10:31PM +0900, KAMEZAWA Hiroyuki wrote:
> > > By the way, IMHO, hiding lock in cgroup_fork() and cgroup_post_fork() doesn't
> > > seem good idea. How about a code like this ?
> > > 
> > >   read_lock_thread_clone(current);
> > >   cgroup_fork();
> > >  .....
> > >   cgroup_post_fork();
> > >   read_unlock_thrad_clone(current);
> > > 
> > > We may have chances to move these lock to better position if cgroup is
> > > an only user.
> > 
> > I didn't do that out of a desire to change fork.c as little as possible,
> > but that does look better than what I've got. Those two functions should
> > be in fork.c under #ifdef CONFIG_CGROUPS.
> 
> I'm looking at this now and am not sure where the best place to put
> these is:
> 
> 1) Don't make new functions, just put:
> 
>     #ifdef CONFIG_CGROUPS
>         if (clone_flags & CLONE_THREAD)
>             down/up_read(...);
>     #endif
> 
> directly in copy_process() in fork.c. Simplest, but uglifies the code.
> 
> 2) Make static helper functions in fork.c. Good, but not consistent with
> directly using the lock in write-side (attach_proc).
> 
> 3) Define inline functions under #ifdef CONFIG_CGROUPS in sched.h, just
> under the declaration of the lock. Most robust, but I'm hesitant to add
> unneeded stuff to such a popular header file.
> 
> Any opinions?
> 

My point was simple. Because copy_process() is very important path,
the new lock should be visible in copy_process() or kernek/fork.c.
"If the lock is visible in copy_process(), the reader can notice it".

Then, I offer you 2 options.

rename cgroup_fork() and cgroup_post_fork() as
       cgroup_fork_lock() and cgroup_post_fork_unlock() 

Now, the lock is visible and the change is minimum.

Or
       add the definition of lock/unlock to cgroup.h and include it 
       from kernel/fork.c

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] [PATCH v3 1/2] cgroups: read-write lock CLONE_THREAD forking per threadgroup
  2010-07-28  9:28           ` KAMEZAWA Hiroyuki
@ 2010-07-28 15:41                 ` Ben Blum
  0 siblings, 0 replies; 16+ messages in thread
From: Ben Blum @ 2010-07-28 15:41 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Ben Blum, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	menage-hpIqsD4AKlfQT0dZR+AlfA

On Wed, Jul 28, 2010 at 06:28:13PM +0900, KAMEZAWA Hiroyuki wrote:
> On Wed, 28 Jul 2010 04:29:53 -0400
> Ben Blum <bblum-OM76b2Iv3yLQjUSlxSEPGw@public.gmane.org> wrote:
> 
> > On Mon, Jun 28, 2010 at 11:43:59AM -0400, Ben Blum wrote:
> > > On Mon, Jun 28, 2010 at 04:10:31PM +0900, KAMEZAWA Hiroyuki wrote:
> > > > By the way, IMHO, hiding lock in cgroup_fork() and cgroup_post_fork() doesn't
> > > > seem good idea. How about a code like this ?
> > > > 
> > > >   read_lock_thread_clone(current);
> > > >   cgroup_fork();
> > > >  .....
> > > >   cgroup_post_fork();
> > > >   read_unlock_thrad_clone(current);
> > > > 
> > > > We may have chances to move these lock to better position if cgroup is
> > > > an only user.
> > > 
> > > I didn't do that out of a desire to change fork.c as little as possible,
> > > but that does look better than what I've got. Those two functions should
> > > be in fork.c under #ifdef CONFIG_CGROUPS.
> > 
> > I'm looking at this now and am not sure where the best place to put
> > these is:
> > 
> > 1) Don't make new functions, just put:
> > 
> >     #ifdef CONFIG_CGROUPS
> >         if (clone_flags & CLONE_THREAD)
> >             down/up_read(...);
> >     #endif
> > 
> > directly in copy_process() in fork.c. Simplest, but uglifies the code.
> > 
> > 2) Make static helper functions in fork.c. Good, but not consistent with
> > directly using the lock in write-side (attach_proc).
> > 
> > 3) Define inline functions under #ifdef CONFIG_CGROUPS in sched.h, just
> > under the declaration of the lock. Most robust, but I'm hesitant to add
> > unneeded stuff to such a popular header file.
> > 
> > Any opinions?
> > 
> 
> My point was simple. Because copy_process() is very important path,
> the new lock should be visible in copy_process() or kernek/fork.c.
> "If the lock is visible in copy_process(), the reader can notice it".
> 
> Then, I offer you 2 options.
> 
> rename cgroup_fork() and cgroup_post_fork() as
>        cgroup_fork_lock() and cgroup_post_fork_unlock() 
> 
> Now, the lock is visible and the change is minimum.
> 
> Or
>        add the definition of lock/unlock to cgroup.h and include it 
>        from kernel/fork.c
> 
> Thanks,
> -Kame

I don't like either of these. Renaming to cgroup_fork_lock not only
conveys the sense that a cgroup-specific lock is taken, but also hides
the real purpose of these functions, which is to manipulate cgroup
pointers. And it's not a cgroup-specific lock - only write-side is
*currently* used by cgroups - so it shouldn't go in cgroup.h.

-- Ben

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] [PATCH v3 1/2] cgroups: read-write lock CLONE_THREAD forking per threadgroup
@ 2010-07-28 15:41                 ` Ben Blum
  0 siblings, 0 replies; 16+ messages in thread
From: Ben Blum @ 2010-07-28 15:41 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Ben Blum, linux-kernel, containers, akpm, ebiederm, lizf,
	matthltc, menage, oleg

On Wed, Jul 28, 2010 at 06:28:13PM +0900, KAMEZAWA Hiroyuki wrote:
> On Wed, 28 Jul 2010 04:29:53 -0400
> Ben Blum <bblum@andrew.cmu.edu> wrote:
> 
> > On Mon, Jun 28, 2010 at 11:43:59AM -0400, Ben Blum wrote:
> > > On Mon, Jun 28, 2010 at 04:10:31PM +0900, KAMEZAWA Hiroyuki wrote:
> > > > By the way, IMHO, hiding lock in cgroup_fork() and cgroup_post_fork() doesn't
> > > > seem good idea. How about a code like this ?
> > > > 
> > > >   read_lock_thread_clone(current);
> > > >   cgroup_fork();
> > > >  .....
> > > >   cgroup_post_fork();
> > > >   read_unlock_thrad_clone(current);
> > > > 
> > > > We may have chances to move these lock to better position if cgroup is
> > > > an only user.
> > > 
> > > I didn't do that out of a desire to change fork.c as little as possible,
> > > but that does look better than what I've got. Those two functions should
> > > be in fork.c under #ifdef CONFIG_CGROUPS.
> > 
> > I'm looking at this now and am not sure where the best place to put
> > these is:
> > 
> > 1) Don't make new functions, just put:
> > 
> >     #ifdef CONFIG_CGROUPS
> >         if (clone_flags & CLONE_THREAD)
> >             down/up_read(...);
> >     #endif
> > 
> > directly in copy_process() in fork.c. Simplest, but uglifies the code.
> > 
> > 2) Make static helper functions in fork.c. Good, but not consistent with
> > directly using the lock in write-side (attach_proc).
> > 
> > 3) Define inline functions under #ifdef CONFIG_CGROUPS in sched.h, just
> > under the declaration of the lock. Most robust, but I'm hesitant to add
> > unneeded stuff to such a popular header file.
> > 
> > Any opinions?
> > 
> 
> My point was simple. Because copy_process() is very important path,
> the new lock should be visible in copy_process() or kernek/fork.c.
> "If the lock is visible in copy_process(), the reader can notice it".
> 
> Then, I offer you 2 options.
> 
> rename cgroup_fork() and cgroup_post_fork() as
>        cgroup_fork_lock() and cgroup_post_fork_unlock() 
> 
> Now, the lock is visible and the change is minimum.
> 
> Or
>        add the definition of lock/unlock to cgroup.h and include it 
>        from kernel/fork.c
> 
> Thanks,
> -Kame

I don't like either of these. Renaming to cgroup_fork_lock not only
conveys the sense that a cgroup-specific lock is taken, but also hides
the real purpose of these functions, which is to manipulate cgroup
pointers. And it's not a cgroup-specific lock - only write-side is
*currently* used by cgroups - so it shouldn't go in cgroup.h.

-- Ben

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC] [PATCH v3 0/2] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs
@ 2010-06-25  5:45 Ben Blum
  0 siblings, 0 replies; 16+ messages in thread
From: Ben Blum @ 2010-06-25  5:45 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: bblum-OM76b2Iv3yLQjUSlxSEPGw, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, menage-hpIqsD4AKlfQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

This patch series is a revision of http://lkml.org/lkml/2010/5/29/126 .

These patches use an rwlock in signal_struct which access is dependent
on Oleg's recent changes to signal_struct's lifetime rules.

It is okay to write the tid of any task in the threadgroup. This is
implemented by taking task->group_leader right after find_task_by_vpid
while still rcu_read-side. This makes it necessary to check if
thread_group_leader(leader) every time we want to iterate over
->thread_group; each of these checks can fail with -EAGAIN.
Unfortunately this also means I needed to put these checks in can_attach
for each subsystem that needs to check each thread in the group.

I handle EAGAIN in the file's write handler, since hey, it's a super
expensive operation, might as well make it unbounded-time to boot - this
is optional and would work just as well with -EAGAIN sent to userspace.

-- bblum

---
 Documentation/cgroups/cgroups.txt |   13 -
 include/linux/cgroup.h            |   15 -
 include/linux/init_task.h         |    9
 include/linux/sched.h             |   10
 kernel/cgroup.c                   |  449 +++++++++++++++++++++++++++++++++-----
 kernel/cgroup_freezer.c           |    4
 kernel/cpuset.c                   |    4
 kernel/fork.c                     |   10
 kernel/ns_cgroup.c                |    4
 kernel/sched.c                    |    4
 10 files changed, 462 insertions(+), 60 deletions(-)

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2010-07-28 15:42 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-06-25  5:45 [RFC] [PATCH v3 0/2] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs Ben Blum
     [not found] ` <20100625054541.GA20610-kwnxxEB+oiWqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
2010-06-25  5:46   ` [RFC] [PATCH v3 1/2] cgroups: read-write lock CLONE_THREAD forking per threadgroup Ben Blum
2010-06-25  5:46     ` Ben Blum
2010-06-28  7:10     ` KAMEZAWA Hiroyuki
2010-06-28 15:43       ` Ben Blum
2010-07-28  8:29         ` Ben Blum
2010-07-28  9:28           ` KAMEZAWA Hiroyuki
     [not found]             ` <20100728182813.7c2826ae.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
2010-07-28 15:41               ` Ben Blum
2010-07-28 15:41                 ` Ben Blum
     [not found]           ` <20100728082953.GA15455-kwnxxEB+oiWqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
2010-07-28  9:28             ` KAMEZAWA Hiroyuki
     [not found]         ` <20100628154359.GA24629-kwnxxEB+oiWqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
2010-07-28  8:29           ` Ben Blum
     [not found]       ` <20100628161031.a8c71fce.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
2010-06-28 15:43         ` Ben Blum
     [not found]     ` <20100625054654.GB20610-kwnxxEB+oiWqwBT9kiuFm8WGCVk0P7UB@public.gmane.org>
2010-06-28  7:10       ` KAMEZAWA Hiroyuki
2010-06-25  5:47   ` [RFC] [PATCH v3 2/2] cgroups: make procs file writable Ben Blum
2010-06-25  5:47 ` Ben Blum
  -- strict thread matches above, loose matches on Subject: below --
2010-06-25  5:45 [RFC] [PATCH v3 0/2] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs Ben Blum

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.