linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET] cgroup: stable threadgroup during attach & subsys methods consolidation
@ 2011-11-01 23:46 Tejun Heo
  2011-11-01 23:46 ` [PATCH 01/10] cgroup: add cgroup_root_mutex Tejun Heo
                   ` (11 more replies)
  0 siblings, 12 replies; 56+ messages in thread
From: Tejun Heo @ 2011-11-01 23:46 UTC (permalink / raw)
  To: paul, rjw, lizf
  Cc: linux-pm, linux-kernel, containers, fweisbec, matthltc, akpm,
	oleg, kamezawa.hiroyu

Hello,

NOT FOR THIS MERGE WINDOW.

This patchset is combination of the following two patchsets.

 [1] cgroup: extend threadgroup locking
 [2] cgroup: introduce cgroup_taskset and consolidate subsys methods, take#2

Changes from the last postings are

* 0001-cgroup-add-cgroup_root_mutex.patch replaces mutex ordering
  reversal patch, which Oleg found out to be broken.  Instead, a new
  sub-mutex cgroup_root_mutex is introduced to break circular
  dependency.

* Rebased on top of the current linus/master.

* Other minor changes to reflect comments from reviews.

* Reviewed/Acked-by's added.

This patchset addresses the following two issues.

1. cgroup currently only blocks new threads from joining the target
   threadgroup during migration, and on-going migration could race
   against exec and exit leading to interesting problems - the
   symmetry between various attach methods, task exiting during method
   execution, ->exit() racing against attach methods, migrating task
   switching basic properties during exec and so on.

   This is resolved by extending threadgroup locking such that it
   covers all operations which can alter the threadgroup - fork, exit
   and exec, and update cgroup to take advantage of it.  rwsem read
   ops are added to exit path but exec is excluded by grabbing the
   existing cred_guard_mutex from threadgroup locking helper.

   This makes threadgroup locking complete and resolves cgroup issues
   stemming from the target taskset being unstable.

2. cgroup has grown quite some number of subsys methods.  Some of them
   are overlapping, inconsistent with each other and called under
   different conditions depending on whether they're called for a
   single task or whole process.  Unfortunately, these callbacks are
   complicated and incomplete at the same time.

   * ->attach_task() is called after migration for task attach but
     before for process.

   * Ditto for ->pre_attach().

   * ->can_attach_task() is called for every task in the thread group
     ->but attach_task() skips the ones which don't actually change
     ->cgroups.

   * Task attach becomes noop if the task isn't actually moving.
     Process attach is always performed.

   * ->attach_task() doesn't (or at least aren't supposed to) have
     access to the old cgroup.

   * During cancel, there's no way to access the affected tasks.

   This patchset introduces cgroup_taskset along with some accessors
   and iterator, updates methods to use it, consolidates usages and
   drops superflous methods.

 0001-cgroup-add-cgroup_root_mutex.patch
 0002-threadgroup-rename-signal-threadgroup_fork_lock-to-g.patch
 0003-threadgroup-extend-threadgroup_lock-to-cover-exit-an.patch
 0004-cgroup-always-lock-threadgroup-during-migration.patch
 0005-cgroup-subsys-attach_task-should-be-called-after-mig.patch
 0006-cgroup-improve-old-cgroup-handling-in-cgroup_attach_.patch
 0007-cgroup-introduce-cgroup_taskset-and-use-it-in-subsys.patch
 0008-cgroup-don-t-use-subsys-can_attach_task-or-attach_ta.patch
 0009-cgroup-cpuset-don-t-use-ss-pre_attach.patch
 0010-cgroup-kill-subsys-can_attach_task-pre_attach-and-at.patch

0001-0004 implement stable thread group.

0005-0010 introduce taskset and consolidate subsys callbacks.

This patchset is on top of the current linus/master 839d881074 "Merge
branch 'i2c-for-linus' of ..." and also available in the following git
branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git cgroup-cleanup

If this looks okay, I think it would be best to route this through pm
tree as there are and will be intersecting changes (mostly around
cgroup_freezer).

diffstat follows.

 Documentation/cgroups/cgroups.txt |   51 ++----
 block/blk-cgroup.c                |   45 +++--
 include/linux/cgroup.h            |   31 ++-
 include/linux/init_task.h         |    9 -
 include/linux/sched.h             |   62 +++++--
 kernel/cgroup.c                   |  320 +++++++++++++++++++++++---------------
 kernel/cgroup_freezer.c           |   27 +--
 kernel/cpuset.c                   |  105 +++++-------
 kernel/events/core.c              |   13 -
 kernel/exit.c                     |   17 +-
 kernel/fork.c                     |    8 
 kernel/sched.c                    |   31 ++-
 mm/memcontrol.c                   |   16 -
 security/device_cgroup.c          |    7 
 14 files changed, 430 insertions(+), 312 deletions(-)

Thanks.

--
tejun

[1] http://thread.gmane.org/gmane.linux.kernel/1187853
[2] http://thread.gmane.org/gmane.linux.kernel/1184375

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 01/10] cgroup: add cgroup_root_mutex
  2011-11-01 23:46 [PATCHSET] cgroup: stable threadgroup during attach & subsys methods consolidation Tejun Heo
@ 2011-11-01 23:46 ` Tejun Heo
  2011-11-04  8:38   ` KAMEZAWA Hiroyuki
  2011-11-01 23:46 ` [PATCH 02/10] threadgroup: rename signal->threadgroup_fork_lock to ->group_rwsem Tejun Heo
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 56+ messages in thread
From: Tejun Heo @ 2011-11-01 23:46 UTC (permalink / raw)
  To: paul, rjw, lizf
  Cc: linux-pm, linux-kernel, containers, fweisbec, matthltc, akpm,
	oleg, kamezawa.hiroyu, Tejun Heo

cgroup wants to make threadgroup stable while modifying cgroup
hierarchies which will introduce locking dependency on
cred_guard_mutex from cgroup_mutex.  This unfortunately completes
circular dependency.

 A. cgroup_mutex -> cred_guard_mutex -> s_type->i_mutex_key -> namespace_sem
 B. namespace_sem -> cgroup_mutex

B is from cgroup_show_options() and this patch breaks it by
introducing another mutex cgroup_root_mutex which nests inside
cgroup_mutex and protects cgroupfs_root.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
---
 kernel/cgroup.c |   64 ++++++++++++++++++++++++++++++++++++-------------------
 1 files changed, 42 insertions(+), 22 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 453100a..efa5886 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -63,7 +63,24 @@
 
 #include <linux/atomic.h>
 
+/*
+ * cgroup_mutex is the master lock.  Any modification to cgroup or its
+ * hierarchy must be performed while holding it.
+ *
+ * cgroup_root_mutex nests inside cgroup_mutex and should be held to modify
+ * cgroupfs_root of any cgroup hierarchy - subsys list, flags,
+ * release_agent_path and so on.  Modifying requires both cgroup_mutex and
+ * cgroup_root_mutex.  Readers can acquire either of the two.  This is to
+ * break the following locking order cycle.
+ *
+ *  A. cgroup_mutex -> cred_guard_mutex -> s_type->i_mutex_key -> namespace_sem
+ *  B. namespace_sem -> cgroup_mutex
+ *
+ * B happens only through cgroup_show_options() and using cgroup_root_mutex
+ * breaks it.
+ */
 static DEFINE_MUTEX(cgroup_mutex);
+static DEFINE_MUTEX(cgroup_root_mutex);
 
 /*
  * Generate an array of cgroup subsystem pointers. At boot time, this is
@@ -953,6 +970,7 @@ static int rebind_subsystems(struct cgroupfs_root *root,
 	int i;
 
 	BUG_ON(!mutex_is_locked(&cgroup_mutex));
+	BUG_ON(!mutex_is_locked(&cgroup_root_mutex));
 
 	removed_bits = root->actual_subsys_bits & ~final_bits;
 	added_bits = final_bits & ~root->actual_subsys_bits;
@@ -1043,7 +1061,7 @@ static int cgroup_show_options(struct seq_file *seq, struct vfsmount *vfs)
 	struct cgroupfs_root *root = vfs->mnt_sb->s_fs_info;
 	struct cgroup_subsys *ss;
 
-	mutex_lock(&cgroup_mutex);
+	mutex_lock(&cgroup_root_mutex);
 	for_each_subsys(root, ss)
 		seq_printf(seq, ",%s", ss->name);
 	if (test_bit(ROOT_NOPREFIX, &root->flags))
@@ -1054,7 +1072,7 @@ static int cgroup_show_options(struct seq_file *seq, struct vfsmount *vfs)
 		seq_puts(seq, ",clone_children");
 	if (strlen(root->name))
 		seq_printf(seq, ",name=%s", root->name);
-	mutex_unlock(&cgroup_mutex);
+	mutex_unlock(&cgroup_root_mutex);
 	return 0;
 }
 
@@ -1269,6 +1287,7 @@ static int cgroup_remount(struct super_block *sb, int *flags, char *data)
 
 	mutex_lock(&cgrp->dentry->d_inode->i_mutex);
 	mutex_lock(&cgroup_mutex);
+	mutex_lock(&cgroup_root_mutex);
 
 	/* See what subsystems are wanted */
 	ret = parse_cgroupfs_options(data, &opts);
@@ -1297,6 +1316,7 @@ static int cgroup_remount(struct super_block *sb, int *flags, char *data)
  out_unlock:
 	kfree(opts.release_agent);
 	kfree(opts.name);
+	mutex_unlock(&cgroup_root_mutex);
 	mutex_unlock(&cgroup_mutex);
 	mutex_unlock(&cgrp->dentry->d_inode->i_mutex);
 	return ret;
@@ -1481,6 +1501,7 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
 	int ret = 0;
 	struct super_block *sb;
 	struct cgroupfs_root *new_root;
+	struct inode *inode;
 
 	/* First find the desired set of subsystems */
 	mutex_lock(&cgroup_mutex);
@@ -1514,7 +1535,6 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
 		/* We used the new root structure, so this is a new hierarchy */
 		struct list_head tmp_cg_links;
 		struct cgroup *root_cgrp = &root->top_cgroup;
-		struct inode *inode;
 		struct cgroupfs_root *existing_root;
 		const struct cred *cred;
 		int i;
@@ -1528,18 +1548,14 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
 
 		mutex_lock(&inode->i_mutex);
 		mutex_lock(&cgroup_mutex);
+		mutex_lock(&cgroup_root_mutex);
 
-		if (strlen(root->name)) {
-			/* Check for name clashes with existing mounts */
-			for_each_active_root(existing_root) {
-				if (!strcmp(existing_root->name, root->name)) {
-					ret = -EBUSY;
-					mutex_unlock(&cgroup_mutex);
-					mutex_unlock(&inode->i_mutex);
-					goto drop_new_super;
-				}
-			}
-		}
+		/* Check for name clashes with existing mounts */
+		ret = -EBUSY;
+		if (strlen(root->name))
+			for_each_active_root(existing_root)
+				if (!strcmp(existing_root->name, root->name))
+					goto unlock_drop;
 
 		/*
 		 * We're accessing css_set_count without locking
@@ -1549,18 +1565,13 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
 		 * have some link structures left over
 		 */
 		ret = allocate_cg_links(css_set_count, &tmp_cg_links);
-		if (ret) {
-			mutex_unlock(&cgroup_mutex);
-			mutex_unlock(&inode->i_mutex);
-			goto drop_new_super;
-		}
+		if (ret)
+			goto unlock_drop;
 
 		ret = rebind_subsystems(root, root->subsys_bits);
 		if (ret == -EBUSY) {
-			mutex_unlock(&cgroup_mutex);
-			mutex_unlock(&inode->i_mutex);
 			free_cg_links(&tmp_cg_links);
-			goto drop_new_super;
+			goto unlock_drop;
 		}
 		/*
 		 * There must be no failure case after here, since rebinding
@@ -1599,6 +1610,7 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
 		cred = override_creds(&init_cred);
 		cgroup_populate_dir(root_cgrp);
 		revert_creds(cred);
+		mutex_unlock(&cgroup_root_mutex);
 		mutex_unlock(&cgroup_mutex);
 		mutex_unlock(&inode->i_mutex);
 	} else {
@@ -1615,6 +1627,10 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
 	kfree(opts.name);
 	return dget(sb->s_root);
 
+ unlock_drop:
+	mutex_unlock(&cgroup_root_mutex);
+	mutex_unlock(&cgroup_mutex);
+	mutex_unlock(&inode->i_mutex);
  drop_new_super:
 	deactivate_locked_super(sb);
  drop_modules:
@@ -1639,6 +1655,7 @@ static void cgroup_kill_sb(struct super_block *sb) {
 	BUG_ON(!list_empty(&cgrp->sibling));
 
 	mutex_lock(&cgroup_mutex);
+	mutex_lock(&cgroup_root_mutex);
 
 	/* Rebind all subsystems back to the default hierarchy */
 	ret = rebind_subsystems(root, 0);
@@ -1664,6 +1681,7 @@ static void cgroup_kill_sb(struct super_block *sb) {
 		root_count--;
 	}
 
+	mutex_unlock(&cgroup_root_mutex);
 	mutex_unlock(&cgroup_mutex);
 
 	kill_litter_super(sb);
@@ -2308,7 +2326,9 @@ static int cgroup_release_agent_write(struct cgroup *cgrp, struct cftype *cft,
 		return -EINVAL;
 	if (!cgroup_lock_live_group(cgrp))
 		return -ENODEV;
+	mutex_lock(&cgroup_root_mutex);
 	strcpy(cgrp->root->release_agent_path, buffer);
+	mutex_unlock(&cgroup_root_mutex);
 	cgroup_unlock();
 	return 0;
 }
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 02/10] threadgroup: rename signal->threadgroup_fork_lock to ->group_rwsem
  2011-11-01 23:46 [PATCHSET] cgroup: stable threadgroup during attach & subsys methods consolidation Tejun Heo
  2011-11-01 23:46 ` [PATCH 01/10] cgroup: add cgroup_root_mutex Tejun Heo
@ 2011-11-01 23:46 ` Tejun Heo
  2011-11-04  8:40   ` KAMEZAWA Hiroyuki
  2011-11-01 23:46 ` [PATCH 03/10] threadgroup: extend threadgroup_lock() to cover exit and exec Tejun Heo
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 56+ messages in thread
From: Tejun Heo @ 2011-11-01 23:46 UTC (permalink / raw)
  To: paul, rjw, lizf
  Cc: linux-pm, linux-kernel, containers, fweisbec, matthltc, akpm,
	oleg, kamezawa.hiroyu, Tejun Heo

Make the following renames to prepare for extension of threadgroup
locking.

* s/signal->threadgroup_fork_lock/signal->group_rwsem/
* s/threadgroup_fork_read_lock()/threadgroup_change_begin()/
* s/threadgroup_fork_read_unlock()/threadgroup_change_done()/
* s/threadgroup_fork_write_lock()/threadgroup_lock()/
* s/threadgroup_fork_write_unlock()/threadgroup_unlock()/

This patch doesn't cause any behavior change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
---
 include/linux/init_task.h |    9 ++++-----
 include/linux/sched.h     |   30 +++++++++++++++---------------
 kernel/cgroup.c           |   13 ++++++-------
 kernel/fork.c             |    8 ++++----
 4 files changed, 29 insertions(+), 31 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 08ffab0..ef20cbe 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -23,11 +23,10 @@ extern struct files_struct init_files;
 extern struct fs_struct init_fs;
 
 #ifdef CONFIG_CGROUPS
-#define INIT_THREADGROUP_FORK_LOCK(sig)					\
-	.threadgroup_fork_lock =					\
-		__RWSEM_INITIALIZER(sig.threadgroup_fork_lock),
+#define INIT_GROUP_RWSEM(sig)						\
+	.group_rwsem = __RWSEM_INITIALIZER(sig.group_rwsem),
 #else
-#define INIT_THREADGROUP_FORK_LOCK(sig)
+#define INIT_GROUP_RWSEM(sig)
 #endif
 
 #define INIT_SIGNALS(sig) {						\
@@ -46,7 +45,7 @@ extern struct fs_struct init_fs;
 	},								\
 	.cred_guard_mutex =						\
 		 __MUTEX_INITIALIZER(sig.cred_guard_mutex),		\
-	INIT_THREADGROUP_FORK_LOCK(sig)					\
+	INIT_GROUP_RWSEM(sig)						\
 }
 
 extern struct nsproxy init_nsproxy;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index e8acce7..aa47d0f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -635,13 +635,13 @@ struct signal_struct {
 #endif
 #ifdef CONFIG_CGROUPS
 	/*
-	 * The threadgroup_fork_lock prevents threads from forking with
+	 * The group_rwsem prevents threads from forking with
 	 * CLONE_THREAD while held for writing. Use this for fork-sensitive
 	 * threadgroup-wide operations. It's taken for reading in fork.c in
 	 * copy_process().
 	 * Currently only needed write-side by cgroups.
 	 */
-	struct rw_semaphore threadgroup_fork_lock;
+	struct rw_semaphore group_rwsem;
 #endif
 
 	int oom_adj;		/* OOM kill score adjustment (bit shift) */
@@ -2367,29 +2367,29 @@ static inline void unlock_task_sighand(struct task_struct *tsk,
 	spin_unlock_irqrestore(&tsk->sighand->siglock, *flags);
 }
 
-/* See the declaration of threadgroup_fork_lock in signal_struct. */
+/* See the declaration of group_rwsem in signal_struct. */
 #ifdef CONFIG_CGROUPS
-static inline void threadgroup_fork_read_lock(struct task_struct *tsk)
+static inline void threadgroup_change_begin(struct task_struct *tsk)
 {
-	down_read(&tsk->signal->threadgroup_fork_lock);
+	down_read(&tsk->signal->group_rwsem);
 }
-static inline void threadgroup_fork_read_unlock(struct task_struct *tsk)
+static inline void threadgroup_change_done(struct task_struct *tsk)
 {
-	up_read(&tsk->signal->threadgroup_fork_lock);
+	up_read(&tsk->signal->group_rwsem);
 }
-static inline void threadgroup_fork_write_lock(struct task_struct *tsk)
+static inline void threadgroup_lock(struct task_struct *tsk)
 {
-	down_write(&tsk->signal->threadgroup_fork_lock);
+	down_write(&tsk->signal->group_rwsem);
 }
-static inline void threadgroup_fork_write_unlock(struct task_struct *tsk)
+static inline void threadgroup_unlock(struct task_struct *tsk)
 {
-	up_write(&tsk->signal->threadgroup_fork_lock);
+	up_write(&tsk->signal->group_rwsem);
 }
 #else
-static inline void threadgroup_fork_read_lock(struct task_struct *tsk) {}
-static inline void threadgroup_fork_read_unlock(struct task_struct *tsk) {}
-static inline void threadgroup_fork_write_lock(struct task_struct *tsk) {}
-static inline void threadgroup_fork_write_unlock(struct task_struct *tsk) {}
+static inline void threadgroup_change_begin(struct task_struct *tsk) {}
+static inline void threadgroup_change_done(struct task_struct *tsk) {}
+static inline void threadgroup_lock(struct task_struct *tsk) {}
+static inline void threadgroup_unlock(struct task_struct *tsk) {}
 #endif
 
 #ifndef __HAVE_THREAD_FUNCTIONS
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index efa5886..f0e099f 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2003,8 +2003,8 @@ static int css_set_prefetch(struct cgroup *cgrp, struct css_set *cg,
  * @cgrp: the cgroup to attach to
  * @leader: the threadgroup leader task_struct of the group to be attached
  *
- * Call holding cgroup_mutex and the threadgroup_fork_lock of the leader. Will
- * take task_lock of each thread in leader's threadgroup individually in turn.
+ * Call holding cgroup_mutex and the group_rwsem of the leader. Will take
+ * task_lock of each thread in leader's threadgroup individually in turn.
  */
 int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 {
@@ -2030,8 +2030,8 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 	 * step 0: in order to do expensive, possibly blocking operations for
 	 * every thread, we cannot iterate the thread group list, since it needs
 	 * rcu or tasklist locked. instead, build an array of all threads in the
-	 * group - threadgroup_fork_lock prevents new threads from appearing,
-	 * and if threads exit, this will just be an over-estimate.
+	 * group - group_rwsem prevents new threads from appearing, and if
+	 * threads exit, this will just be an over-estimate.
 	 */
 	group_size = get_nr_threads(leader);
 	/* flex_array supports very large thread-groups better than kmalloc. */
@@ -2246,7 +2246,6 @@ static int attach_task_by_pid(struct cgroup *cgrp, u64 pid, bool threadgroup)
 			cgroup_unlock();
 			return -ESRCH;
 		}
-
 		/*
 		 * even if we're attaching all tasks in the thread group, we
 		 * only need to check permissions on one of them.
@@ -2270,9 +2269,9 @@ static int attach_task_by_pid(struct cgroup *cgrp, u64 pid, bool threadgroup)
 	}
 
 	if (threadgroup) {
-		threadgroup_fork_write_lock(tsk);
+		threadgroup_lock(tsk);
 		ret = cgroup_attach_proc(cgrp, tsk);
-		threadgroup_fork_write_unlock(tsk);
+		threadgroup_unlock(tsk);
 	} else {
 		ret = cgroup_attach_task(cgrp, tsk);
 	}
diff --git a/kernel/fork.c b/kernel/fork.c
index 8e6b6f4..c2af839 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -980,7 +980,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 	sched_autogroup_fork(sig);
 
 #ifdef CONFIG_CGROUPS
-	init_rwsem(&sig->threadgroup_fork_lock);
+	init_rwsem(&sig->group_rwsem);
 #endif
 
 	sig->oom_adj = current->signal->oom_adj;
@@ -1166,7 +1166,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	p->io_context = NULL;
 	p->audit_context = NULL;
 	if (clone_flags & CLONE_THREAD)
-		threadgroup_fork_read_lock(current);
+		threadgroup_change_begin(current);
 	cgroup_fork(p);
 #ifdef CONFIG_NUMA
 	p->mempolicy = mpol_dup(p->mempolicy);
@@ -1378,7 +1378,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	proc_fork_connector(p);
 	cgroup_post_fork(p);
 	if (clone_flags & CLONE_THREAD)
-		threadgroup_fork_read_unlock(current);
+		threadgroup_change_done(current);
 	perf_event_fork(p);
 	return p;
 
@@ -1418,7 +1418,7 @@ bad_fork_cleanup_policy:
 bad_fork_cleanup_cgroup:
 #endif
 	if (clone_flags & CLONE_THREAD)
-		threadgroup_fork_read_unlock(current);
+		threadgroup_change_done(current);
 	cgroup_exit(p, cgroup_callbacks_done);
 	delayacct_tsk_free(p);
 	module_put(task_thread_info(p)->exec_domain->module);
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 03/10] threadgroup: extend threadgroup_lock() to cover exit and exec
  2011-11-01 23:46 [PATCHSET] cgroup: stable threadgroup during attach & subsys methods consolidation Tejun Heo
  2011-11-01 23:46 ` [PATCH 01/10] cgroup: add cgroup_root_mutex Tejun Heo
  2011-11-01 23:46 ` [PATCH 02/10] threadgroup: rename signal->threadgroup_fork_lock to ->group_rwsem Tejun Heo
@ 2011-11-01 23:46 ` Tejun Heo
  2011-11-04  8:45   ` KAMEZAWA Hiroyuki
                     ` (3 more replies)
  2011-11-01 23:46 ` [PATCH 04/10] cgroup: always lock threadgroup during migration Tejun Heo
                   ` (8 subsequent siblings)
  11 siblings, 4 replies; 56+ messages in thread
From: Tejun Heo @ 2011-11-01 23:46 UTC (permalink / raw)
  To: paul, rjw, lizf
  Cc: linux-pm, linux-kernel, containers, fweisbec, matthltc, akpm,
	oleg, kamezawa.hiroyu, Tejun Heo

threadgroup_lock() protected only protected against new addition to
the threadgroup, which was inherently somewhat incomplete and
problematic for its only user cgroup.  On-going migration could race
against exec and exit leading to interesting problems - the symmetry
between various attach methods, task exiting during method execution,
->exit() racing against attach methods, migrating task switching basic
properties during exec and so on.

This patch extends threadgroup_lock() such that it protects against
all three threadgroup altering operations - fork, exit and exec.  For
exit, threadgroup_change_begin/end() calls are added to exit path.
For exec, threadgroup_[un]lock() are updated to also grab and release
cred_guard_mutex.

With this change, threadgroup_lock() guarantees that the target
threadgroup will remain stable - no new task will be added, no new
PF_EXITING will be set and exec won't happen.

The next patch will update cgroup so that it can take full advantage
of this change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
---
 include/linux/sched.h |   36 ++++++++++++++++++++++++++++++------
 kernel/exit.c         |   17 +++++++++++++----
 2 files changed, 43 insertions(+), 10 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index aa47d0f..6fb546e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -635,11 +635,12 @@ struct signal_struct {
 #endif
 #ifdef CONFIG_CGROUPS
 	/*
-	 * The group_rwsem prevents threads from forking with
-	 * CLONE_THREAD while held for writing. Use this for fork-sensitive
-	 * threadgroup-wide operations. It's taken for reading in fork.c in
-	 * copy_process().
-	 * Currently only needed write-side by cgroups.
+	 * group_rwsem prevents new tasks from entering the threadgroup and
+	 * member tasks from exiting.  fork and exit paths are protected
+	 * with this rwsem using threadgroup_change_begin/end().  Users
+	 * which require threadgroup to remain stable should use
+	 * threadgroup_[un]lock() which also takes care of exec path.
+	 * Currently, cgroup is the only user.
 	 */
 	struct rw_semaphore group_rwsem;
 #endif
@@ -2367,7 +2368,6 @@ static inline void unlock_task_sighand(struct task_struct *tsk,
 	spin_unlock_irqrestore(&tsk->sighand->siglock, *flags);
 }
 
-/* See the declaration of group_rwsem in signal_struct. */
 #ifdef CONFIG_CGROUPS
 static inline void threadgroup_change_begin(struct task_struct *tsk)
 {
@@ -2377,13 +2377,37 @@ static inline void threadgroup_change_done(struct task_struct *tsk)
 {
 	up_read(&tsk->signal->group_rwsem);
 }
+
+/**
+ * threadgroup_lock - lock threadgroup
+ * @tsk: member task of the threadgroup to lock
+ *
+ * Lock the threadgroup @tsk belongs to.  No new task is allowed to enter
+ * and member tasks aren't allowed to exit (as indicated by PF_EXITING) or
+ * perform exec.  This is useful for cases where the threadgroup needs to
+ * stay stable across blockable operations.
+ *
+ * fork and exit explicitly call threadgroup_change_{begin|end}() for
+ * synchronization.  exec is already synchronized with cread_guard_mutex
+ * which we grab here.
+ */
 static inline void threadgroup_lock(struct task_struct *tsk)
 {
+	/* exec uses exit for de-threading, grab cred_guard_mutex first */
+	mutex_lock(&tsk->signal->cred_guard_mutex);
 	down_write(&tsk->signal->group_rwsem);
 }
+
+/**
+ * threadgroup_unlock - unlock threadgroup
+ * @tsk: member task of the threadgroup to unlock
+ *
+ * Reverse threadgroup_lock().
+ */
 static inline void threadgroup_unlock(struct task_struct *tsk)
 {
 	up_write(&tsk->signal->group_rwsem);
+	mutex_unlock(&tsk->signal->cred_guard_mutex);
 }
 #else
 static inline void threadgroup_change_begin(struct task_struct *tsk) {}
diff --git a/kernel/exit.c b/kernel/exit.c
index 2913b35..3565f00 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -938,6 +938,12 @@ NORET_TYPE void do_exit(long code)
 		schedule();
 	}
 
+	/*
+	 * @tsk's threadgroup is going through changes - lock out users
+	 * which expect stable threadgroup.
+	 */
+	threadgroup_change_begin(tsk);
+
 	exit_irq_thread();
 
 	exit_signals(tsk);  /* sets PF_EXITING */
@@ -1020,10 +1026,6 @@ NORET_TYPE void do_exit(long code)
 		kfree(current->pi_state_cache);
 #endif
 	/*
-	 * Make sure we are holding no locks:
-	 */
-	debug_check_no_locks_held(tsk);
-	/*
 	 * We can do this unlocked here. The futex code uses this flag
 	 * just to verify whether the pi state cleanup has been done
 	 * or not. In the worst case it loops once more.
@@ -1040,6 +1042,13 @@ NORET_TYPE void do_exit(long code)
 
 	preempt_disable();
 	exit_rcu();
+
+	/*
+	 * Release threadgroup and make sure we are holding no locks.
+	 */
+	threadgroup_change_done(tsk);
+	debug_check_no_locks_held(tsk);
+
 	/* causes final put_task_struct in finish_task_switch(). */
 	tsk->state = TASK_DEAD;
 	schedule();
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 04/10] cgroup: always lock threadgroup during migration
  2011-11-01 23:46 [PATCHSET] cgroup: stable threadgroup during attach & subsys methods consolidation Tejun Heo
                   ` (2 preceding siblings ...)
  2011-11-01 23:46 ` [PATCH 03/10] threadgroup: extend threadgroup_lock() to cover exit and exec Tejun Heo
@ 2011-11-01 23:46 ` Tejun Heo
  2011-11-04  8:54   ` KAMEZAWA Hiroyuki
  2011-11-14 18:46   ` Frederic Weisbecker
  2011-11-01 23:46 ` [PATCH 05/10] cgroup: subsys->attach_task() should be called after migration Tejun Heo
                   ` (7 subsequent siblings)
  11 siblings, 2 replies; 56+ messages in thread
From: Tejun Heo @ 2011-11-01 23:46 UTC (permalink / raw)
  To: paul, rjw, lizf
  Cc: linux-pm, linux-kernel, containers, fweisbec, matthltc, akpm,
	oleg, kamezawa.hiroyu, Tejun Heo

Update cgroup to take advantage of the fack that threadgroup_lock()
guarantees stable threadgroup.

* Lock threadgroup even if the target is a single task.  This
  guarantees that when the target tasks stay stable during migration
  regardless of the target type.

* Remove PF_EXITING early exit optimization from attach_task_by_pid()
  and check it in cgroup_task_migrate() instead.  The optimization was
  for rather cold path to begin with and PF_EXITING state can be
  trusted throughout migration by checking it after locking
  threadgroup.

* Don't add PF_EXITING tasks to target task array in
  cgroup_attach_proc().  This ensures that task migration is performed
  only for live tasks.

* Remove -ESRCH failure path from cgroup_task_migrate().  With the
  above changes, it's guaranteed to be called only for live tasks.

After the changes, only live tasks are migrated and they're guaranteed
to stay alive until migration is complete.  This removes problems
caused by exec and exit racing against cgroup migration including
symmetry among cgroup attach methods and different cgroup methods
racing each other.

v2: Oleg pointed out that one more PF_EXITING check can be removed
    from cgroup_attach_proc().  Removed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
---
 kernel/cgroup.c |   51 +++++++++++++++++++++++----------------------------
 1 files changed, 23 insertions(+), 28 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index f0e099f..83e10f9 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1762,7 +1762,7 @@ EXPORT_SYMBOL_GPL(cgroup_path);
  *
  * 'guarantee' is set if the caller promises that a new css_set for the task
  * will already exist. If not set, this function might sleep, and can fail with
- * -ENOMEM. Otherwise, it can only fail with -ESRCH.
+ * -ENOMEM. Must be called with cgroup_mutex and threadgroup locked.
  */
 static int cgroup_task_migrate(struct cgroup *cgrp, struct cgroup *oldcgrp,
 			       struct task_struct *tsk, bool guarantee)
@@ -1800,13 +1800,9 @@ static int cgroup_task_migrate(struct cgroup *cgrp, struct cgroup *oldcgrp,
 	}
 	put_css_set(oldcg);
 
-	/* if PF_EXITING is set, the tsk->cgroups pointer is no longer safe. */
+	/* @tsk can't exit as its threadgroup is locked */
 	task_lock(tsk);
-	if (tsk->flags & PF_EXITING) {
-		task_unlock(tsk);
-		put_css_set(newcg);
-		return -ESRCH;
-	}
+	WARN_ON_ONCE(tsk->flags & PF_EXITING);
 	rcu_assign_pointer(tsk->cgroups, newcg);
 	task_unlock(tsk);
 
@@ -1832,8 +1828,8 @@ static int cgroup_task_migrate(struct cgroup *cgrp, struct cgroup *oldcgrp,
  * @cgrp: the cgroup the task is attaching to
  * @tsk: the task to be attached
  *
- * Call holding cgroup_mutex. May take task_lock of
- * the task 'tsk' during call.
+ * Call with cgroup_mutex and threadgroup locked. May take task_lock of
+ * @tsk during call.
  */
 int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
@@ -1842,6 +1838,10 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 	struct cgroup *oldcgrp;
 	struct cgroupfs_root *root = cgrp->root;
 
+	/* @tsk either already exited or can't exit until the end */
+	if (tsk->flags & PF_EXITING)
+		return -ESRCH;
+
 	/* Nothing to do if the task is already in that cgroup */
 	oldcgrp = task_cgroup_from_root(tsk, root);
 	if (cgrp == oldcgrp)
@@ -2062,6 +2062,10 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 	tsk = leader;
 	i = 0;
 	do {
+		/* @tsk either already exited or can't exit until the end */
+		if (tsk->flags & PF_EXITING)
+			continue;
+
 		/* as per above, nr_threads may decrease, but not increase. */
 		BUG_ON(i >= group_size);
 		get_task_struct(tsk);
@@ -2116,11 +2120,6 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 			continue;
 		/* get old css_set pointer */
 		task_lock(tsk);
-		if (tsk->flags & PF_EXITING) {
-			/* ignore this task if it's going away */
-			task_unlock(tsk);
-			continue;
-		}
 		oldcg = tsk->cgroups;
 		get_css_set(oldcg);
 		task_unlock(tsk);
@@ -2158,9 +2157,8 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 			if (ss->attach_task)
 				ss->attach_task(cgrp, tsk);
 		}
-		/* if the thread is PF_EXITING, it can just get skipped. */
 		retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, true);
-		BUG_ON(retval != 0 && retval != -ESRCH);
+		BUG_ON(retval != 0);
 	}
 	/* nothing is sensitive to fork() after this point. */
 
@@ -2212,8 +2210,8 @@ out_free_group_list:
 
 /*
  * Find the task_struct of the task to attach by vpid and pass it along to the
- * function to attach either it or all tasks in its threadgroup. Will take
- * cgroup_mutex; may take task_lock of task.
+ * function to attach either it or all tasks in its threadgroup. Will lock
+ * cgroup_mutex and threadgroup; may take task_lock of task.
  */
 static int attach_task_by_pid(struct cgroup *cgrp, u64 pid, bool threadgroup)
 {
@@ -2240,11 +2238,6 @@ static int attach_task_by_pid(struct cgroup *cgrp, u64 pid, bool threadgroup)
 			 * detect it later.
 			 */
 			tsk = tsk->group_leader;
-		} else if (tsk->flags & PF_EXITING) {
-			/* optimization for the single-task-only case */
-			rcu_read_unlock();
-			cgroup_unlock();
-			return -ESRCH;
 		}
 		/*
 		 * even if we're attaching all tasks in the thread group, we
@@ -2268,13 +2261,15 @@ static int attach_task_by_pid(struct cgroup *cgrp, u64 pid, bool threadgroup)
 		get_task_struct(tsk);
 	}
 
-	if (threadgroup) {
-		threadgroup_lock(tsk);
+	threadgroup_lock(tsk);
+
+	if (threadgroup)
 		ret = cgroup_attach_proc(cgrp, tsk);
-		threadgroup_unlock(tsk);
-	} else {
+	else
 		ret = cgroup_attach_task(cgrp, tsk);
-	}
+
+	threadgroup_unlock(tsk);
+
 	put_task_struct(tsk);
 	cgroup_unlock();
 	return ret;
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 05/10] cgroup: subsys->attach_task() should be called after migration
  2011-11-01 23:46 [PATCHSET] cgroup: stable threadgroup during attach & subsys methods consolidation Tejun Heo
                   ` (3 preceding siblings ...)
  2011-11-01 23:46 ` [PATCH 04/10] cgroup: always lock threadgroup during migration Tejun Heo
@ 2011-11-01 23:46 ` Tejun Heo
  2011-11-14 20:06   ` Frederic Weisbecker
  2011-11-01 23:46 ` [PATCH 06/10] cgroup: improve old cgroup handling in cgroup_attach_proc() Tejun Heo
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 56+ messages in thread
From: Tejun Heo @ 2011-11-01 23:46 UTC (permalink / raw)
  To: paul, rjw, lizf
  Cc: linux-pm, linux-kernel, containers, fweisbec, matthltc, akpm,
	oleg, kamezawa.hiroyu, Tejun Heo

cgroup_attach_task() calls subsys->attach_task() after
cgroup_task_migrate(); however, cgroup_attach_proc() calls it before
migration.  This actually affects some of the users.  Update
cgroup_attach_proc() such that ->attach_task() is called after
migration.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@Jp.fujitsu.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
---
 kernel/cgroup.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 83e10f9..3679fb6 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2152,13 +2152,15 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 		oldcgrp = task_cgroup_from_root(tsk, root);
 		if (cgrp == oldcgrp)
 			continue;
+
+		retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, true);
+		BUG_ON(retval != 0);
+
 		/* attach each task to each subsystem */
 		for_each_subsys(root, ss) {
 			if (ss->attach_task)
 				ss->attach_task(cgrp, tsk);
 		}
-		retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, true);
-		BUG_ON(retval != 0);
 	}
 	/* nothing is sensitive to fork() after this point. */
 
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 06/10] cgroup: improve old cgroup handling in cgroup_attach_proc()
  2011-11-01 23:46 [PATCHSET] cgroup: stable threadgroup during attach & subsys methods consolidation Tejun Heo
                   ` (4 preceding siblings ...)
  2011-11-01 23:46 ` [PATCH 05/10] cgroup: subsys->attach_task() should be called after migration Tejun Heo
@ 2011-11-01 23:46 ` Tejun Heo
  2011-11-14 20:37   ` Frederic Weisbecker
  2011-11-01 23:46 ` [PATCH 07/10] cgroup: introduce cgroup_taskset and use it in subsys->can_attach(), cancel_attach() and attach() Tejun Heo
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 56+ messages in thread
From: Tejun Heo @ 2011-11-01 23:46 UTC (permalink / raw)
  To: paul, rjw, lizf
  Cc: linux-pm, linux-kernel, containers, fweisbec, matthltc, akpm,
	oleg, kamezawa.hiroyu, Tejun Heo

cgroup_attach_proc() behaves differently from cgroup_attach_task() in
the following aspects.

* All hooks are invoked even if no task is actually being moved.

* ->can_attach_task() is called for all tasks in the group whether the
  new cgrp is different from the current cgrp or not; however,
  ->attach_task() is skipped if new equals new.  This makes the calls
  asymmetric.

This patch improves old cgroup handling in cgroup_attach_proc() by
looking up the current cgroup at the head, recording it in the flex
array along with the task itself, and using it to remove the above two
differences.  This will also ease further changes.

-v2: nr_todo renamed to nr_migrating_tasks as per Paul Menage's
     suggestion.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Paul Menage <paul@paulmenage.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
---
 kernel/cgroup.c |   66 +++++++++++++++++++++++++++++++++++--------------------
 1 files changed, 42 insertions(+), 24 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 3679fb6..19396d6 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1757,6 +1757,11 @@ int cgroup_path(const struct cgroup *cgrp, char *buf, int buflen)
 }
 EXPORT_SYMBOL_GPL(cgroup_path);
 
+struct task_and_cgroup {
+	struct task_struct	*task;
+	struct cgroup		*cgrp;
+};
+
 /*
  * cgroup_task_migrate - move a task from one cgroup to another.
  *
@@ -2008,15 +2013,15 @@ static int css_set_prefetch(struct cgroup *cgrp, struct css_set *cg,
  */
 int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 {
-	int retval, i, group_size;
+	int retval, i, group_size, nr_migrating_tasks;
 	struct cgroup_subsys *ss, *failed_ss = NULL;
 	bool cancel_failed_ss = false;
 	/* guaranteed to be initialized later, but the compiler needs this */
-	struct cgroup *oldcgrp = NULL;
 	struct css_set *oldcg;
 	struct cgroupfs_root *root = cgrp->root;
 	/* threadgroup list cursor and array */
 	struct task_struct *tsk;
+	struct task_and_cgroup *tc;
 	struct flex_array *group;
 	/*
 	 * we need to make sure we have css_sets for all the tasks we're
@@ -2035,8 +2040,7 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 	 */
 	group_size = get_nr_threads(leader);
 	/* flex_array supports very large thread-groups better than kmalloc. */
-	group = flex_array_alloc(sizeof(struct task_struct *), group_size,
-				 GFP_KERNEL);
+	group = flex_array_alloc(sizeof(*tc), group_size, GFP_KERNEL);
 	if (!group)
 		return -ENOMEM;
 	/* pre-allocate to guarantee space while iterating in rcu read-side. */
@@ -2060,8 +2064,10 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 	}
 	/* take a reference on each task in the group to go in the array. */
 	tsk = leader;
-	i = 0;
+	i = nr_migrating_tasks = 0;
 	do {
+		struct task_and_cgroup ent;
+
 		/* @tsk either already exited or can't exit until the end */
 		if (tsk->flags & PF_EXITING)
 			continue;
@@ -2073,14 +2079,23 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 		 * saying GFP_ATOMIC has no effect here because we did prealloc
 		 * earlier, but it's good form to communicate our expectations.
 		 */
-		retval = flex_array_put_ptr(group, i, tsk, GFP_ATOMIC);
+		ent.task = tsk;
+		ent.cgrp = task_cgroup_from_root(tsk, root);
+		retval = flex_array_put(group, i, &ent, GFP_ATOMIC);
 		BUG_ON(retval != 0);
 		i++;
+		if (ent.cgrp != cgrp)
+			nr_migrating_tasks++;
 	} while_each_thread(leader, tsk);
 	/* remember the number of threads in the array for later. */
 	group_size = i;
 	rcu_read_unlock();
 
+	/* methods shouldn't be called if no task is actually migrating */
+	retval = 0;
+	if (!nr_migrating_tasks)
+		goto out_put_tasks;
+
 	/*
 	 * step 1: check that we can legitimately attach to the cgroup.
 	 */
@@ -2096,8 +2111,10 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 		if (ss->can_attach_task) {
 			/* run on each task in the threadgroup. */
 			for (i = 0; i < group_size; i++) {
-				tsk = flex_array_get_ptr(group, i);
-				retval = ss->can_attach_task(cgrp, tsk);
+				tc = flex_array_get(group, i);
+				if (tc->cgrp == cgrp)
+					continue;
+				retval = ss->can_attach_task(cgrp, tc->task);
 				if (retval) {
 					failed_ss = ss;
 					cancel_failed_ss = true;
@@ -2113,18 +2130,17 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 	 */
 	INIT_LIST_HEAD(&newcg_list);
 	for (i = 0; i < group_size; i++) {
-		tsk = flex_array_get_ptr(group, i);
+		tc = flex_array_get(group, i);
 		/* nothing to do if this task is already in the cgroup */
-		oldcgrp = task_cgroup_from_root(tsk, root);
-		if (cgrp == oldcgrp)
+		if (tc->cgrp == cgrp)
 			continue;
 		/* get old css_set pointer */
-		task_lock(tsk);
-		oldcg = tsk->cgroups;
+		task_lock(tc->task);
+		oldcg = tc->task->cgroups;
 		get_css_set(oldcg);
-		task_unlock(tsk);
+		task_unlock(tc->task);
 		/* see if the new one for us is already in the list? */
-		if (css_set_check_fetched(cgrp, tsk, oldcg, &newcg_list)) {
+		if (css_set_check_fetched(cgrp, tc->task, oldcg, &newcg_list)) {
 			/* was already there, nothing to do. */
 			put_css_set(oldcg);
 		} else {
@@ -2147,19 +2163,18 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 			ss->pre_attach(cgrp);
 	}
 	for (i = 0; i < group_size; i++) {
-		tsk = flex_array_get_ptr(group, i);
+		tc = flex_array_get(group, i);
 		/* leave current thread as it is if it's already there */
-		oldcgrp = task_cgroup_from_root(tsk, root);
-		if (cgrp == oldcgrp)
+		if (tc->cgrp == cgrp)
 			continue;
 
-		retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, true);
+		retval = cgroup_task_migrate(cgrp, tc->cgrp, tc->task, true);
 		BUG_ON(retval != 0);
 
 		/* attach each task to each subsystem */
 		for_each_subsys(root, ss) {
 			if (ss->attach_task)
-				ss->attach_task(cgrp, tsk);
+				ss->attach_task(cgrp, tc->task);
 		}
 	}
 	/* nothing is sensitive to fork() after this point. */
@@ -2170,8 +2185,10 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 	 * being moved, this call will need to be reworked to communicate that.
 	 */
 	for_each_subsys(root, ss) {
-		if (ss->attach)
-			ss->attach(ss, cgrp, oldcgrp, leader);
+		if (ss->attach) {
+			tc = flex_array_get(group, 0);
+			ss->attach(ss, cgrp, tc->cgrp, tc->task);
+		}
 	}
 
 	/*
@@ -2200,10 +2217,11 @@ out_cancel_attach:
 				ss->cancel_attach(ss, cgrp, leader);
 		}
 	}
+out_put_tasks:
 	/* clean up the array of referenced threads in the group. */
 	for (i = 0; i < group_size; i++) {
-		tsk = flex_array_get_ptr(group, i);
-		put_task_struct(tsk);
+		tc = flex_array_get(group, i);
+		put_task_struct(tc->task);
 	}
 out_free_group_list:
 	flex_array_free(group);
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 07/10] cgroup: introduce cgroup_taskset and use it in subsys->can_attach(), cancel_attach() and attach()
  2011-11-01 23:46 [PATCHSET] cgroup: stable threadgroup during attach & subsys methods consolidation Tejun Heo
                   ` (5 preceding siblings ...)
  2011-11-01 23:46 ` [PATCH 06/10] cgroup: improve old cgroup handling in cgroup_attach_proc() Tejun Heo
@ 2011-11-01 23:46 ` Tejun Heo
  2011-11-14 21:16   ` Frederic Weisbecker
  2011-11-01 23:46 ` [PATCH 08/10] cgroup: don't use subsys->can_attach_task() or ->attach_task() Tejun Heo
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 56+ messages in thread
From: Tejun Heo @ 2011-11-01 23:46 UTC (permalink / raw)
  To: paul, rjw, lizf
  Cc: linux-pm, linux-kernel, containers, fweisbec, matthltc, akpm,
	oleg, kamezawa.hiroyu, Tejun Heo, Balbir Singh,
	Daisuke Nishimura, KAMEZAWA Hiroyuki, James Morris

Currently, there's no way to pass multiple tasks to cgroup_subsys
methods necessitating the need for separate per-process and per-task
methods.  This patch introduces cgroup_taskset which can be used to
pass multiple tasks and their associated cgroups to cgroup_subsys
methods.

Three methods - can_attach(), cancel_attach() and attach() - are
converted to use cgroup_taskset.  This unifies passed parameters so
that all methods have access to all information.  Conversions in this
patchset are identical and don't introduce any behavior change.

-v2: documentation updated as per Paul Menage's suggestion.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Paul Menage <paul@paulmenage.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: James Morris <jmorris@namei.org>
---
 Documentation/cgroups/cgroups.txt |   31 ++++++++----
 include/linux/cgroup.h            |   28 +++++++++-
 kernel/cgroup.c                   |   99 +++++++++++++++++++++++++++++++++----
 kernel/cgroup_freezer.c           |    2 +-
 kernel/cpuset.c                   |   18 ++++---
 mm/memcontrol.c                   |   16 +++---
 security/device_cgroup.c          |    7 ++-
 7 files changed, 158 insertions(+), 43 deletions(-)

diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index cd67e90..bf5d6c9 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -594,15 +594,25 @@ rmdir() will fail with it. From this behavior, pre_destroy() can be
 called multiple times against a cgroup.
 
 int can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
-	       struct task_struct *task)
+	       struct cgroup_taskset *tset)
 (cgroup_mutex held by caller)
 
-Called prior to moving a task into a cgroup; if the subsystem
-returns an error, this will abort the attach operation.  If a NULL
-task is passed, then a successful result indicates that *any*
-unspecified task can be moved into the cgroup. Note that this isn't
-called on a fork. If this method returns 0 (success) then this should
-remain valid while the caller holds cgroup_mutex and it is ensured that either
+Called prior to moving one or more tasks into a cgroup; if the
+subsystem returns an error, this will abort the attach operation.
+@tset contains the tasks to be attached and is guaranteed to have at
+least one task in it.
+
+If there are multiple tasks in the taskset, then:
+  - it's guaranteed that all are from the same thread group
+  - @tset contains all tasks from the thread group whether or not
+    they're switching cgroups
+  - the first task is the leader
+
+Each @tset entry also contains the task's old cgroup and tasks which
+aren't switching cgroup can be skipped easily using the
+cgroup_taskset_for_each() iterator. Note that this isn't called on a
+fork. If this method returns 0 (success) then this should remain valid
+while the caller holds cgroup_mutex and it is ensured that either
 attach() or cancel_attach() will be called in future.
 
 int can_attach_task(struct cgroup *cgrp, struct task_struct *tsk);
@@ -613,14 +623,14 @@ attached (possibly many when using cgroup_attach_proc). Called after
 can_attach.
 
 void cancel_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
-	       struct task_struct *task, bool threadgroup)
+		   struct cgroup_taskset *tset)
 (cgroup_mutex held by caller)
 
 Called when a task attach operation has failed after can_attach() has succeeded.
 A subsystem whose can_attach() has some side-effects should provide this
 function, so that the subsystem can implement a rollback. If not, not necessary.
 This will be called only about subsystems whose can_attach() operation have
-succeeded.
+succeeded. The parameters are identical to can_attach().
 
 void pre_attach(struct cgroup *cgrp);
 (cgroup_mutex held by caller)
@@ -629,11 +639,12 @@ For any non-per-thread attachment work that needs to happen before
 attach_task. Needed by cpuset.
 
 void attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
-	    struct cgroup *old_cgrp, struct task_struct *task)
+	    struct cgroup_taskset *tset)
 (cgroup_mutex held by caller)
 
 Called after the task has been attached to the cgroup, to allow any
 post-attachment activity that requires memory allocations or blocking.
+The parameters are identical to can_attach().
 
 void attach_task(struct cgroup *cgrp, struct task_struct *tsk);
 (cgroup_mutex held by caller)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index da7e4bc..2470c8e 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -457,6 +457,28 @@ void cgroup_exclude_rmdir(struct cgroup_subsys_state *css);
 void cgroup_release_and_wakeup_rmdir(struct cgroup_subsys_state *css);
 
 /*
+ * Control Group taskset, used to pass around set of tasks to cgroup_subsys
+ * methods.
+ */
+struct cgroup_taskset;
+struct task_struct *cgroup_taskset_first(struct cgroup_taskset *tset);
+struct task_struct *cgroup_taskset_next(struct cgroup_taskset *tset);
+struct cgroup *cgroup_taskset_cur_cgroup(struct cgroup_taskset *tset);
+int cgroup_taskset_size(struct cgroup_taskset *tset);
+
+/**
+ * cgroup_taskset_for_each - iterate cgroup_taskset
+ * @task: the loop cursor
+ * @skip_cgrp: skip if task's cgroup matches this, %NULL to iterate through all
+ * @tset: taskset to iterate
+ */
+#define cgroup_taskset_for_each(task, skip_cgrp, tset)			\
+	for ((task) = cgroup_taskset_first((tset)); (task);		\
+	     (task) = cgroup_taskset_next((tset)))			\
+		if (!(skip_cgrp) ||					\
+		    cgroup_taskset_cur_cgroup((tset)) != (skip_cgrp))
+
+/*
  * Control Group subsystem type.
  * See Documentation/cgroups/cgroups.txt for details
  */
@@ -467,14 +489,14 @@ struct cgroup_subsys {
 	int (*pre_destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
 	void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
 	int (*can_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
-			  struct task_struct *tsk);
+			  struct cgroup_taskset *tset);
 	int (*can_attach_task)(struct cgroup *cgrp, struct task_struct *tsk);
 	void (*cancel_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
-			      struct task_struct *tsk);
+			      struct cgroup_taskset *tset);
 	void (*pre_attach)(struct cgroup *cgrp);
 	void (*attach_task)(struct cgroup *cgrp, struct task_struct *tsk);
 	void (*attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
-		       struct cgroup *old_cgrp, struct task_struct *tsk);
+		       struct cgroup_taskset *tset);
 	void (*fork)(struct cgroup_subsys *ss, struct task_struct *task);
 	void (*exit)(struct cgroup_subsys *ss, struct cgroup *cgrp,
 			struct cgroup *old_cgrp, struct task_struct *task);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 19396d6..0bb7927 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1757,11 +1757,85 @@ int cgroup_path(const struct cgroup *cgrp, char *buf, int buflen)
 }
 EXPORT_SYMBOL_GPL(cgroup_path);
 
+/*
+ * Control Group taskset
+ */
 struct task_and_cgroup {
 	struct task_struct	*task;
 	struct cgroup		*cgrp;
 };
 
+struct cgroup_taskset {
+	struct task_and_cgroup	single;
+	struct flex_array	*tc_array;
+	int			tc_array_len;
+	int			idx;
+	struct cgroup		*cur_cgrp;
+};
+
+/**
+ * cgroup_taskset_first - reset taskset and return the first task
+ * @tset: taskset of interest
+ *
+ * @tset iteration is initialized and the first task is returned.
+ */
+struct task_struct *cgroup_taskset_first(struct cgroup_taskset *tset)
+{
+	if (tset->tc_array) {
+		tset->idx = 0;
+		return cgroup_taskset_next(tset);
+	} else {
+		tset->cur_cgrp = tset->single.cgrp;
+		return tset->single.task;
+	}
+}
+EXPORT_SYMBOL_GPL(cgroup_taskset_first);
+
+/**
+ * cgroup_taskset_next - iterate to the next task in taskset
+ * @tset: taskset of interest
+ *
+ * Return the next task in @tset.  Iteration must have been initialized
+ * with cgroup_taskset_first().
+ */
+struct task_struct *cgroup_taskset_next(struct cgroup_taskset *tset)
+{
+	struct task_and_cgroup *tc;
+
+	if (!tset->tc_array || tset->idx >= tset->tc_array_len)
+		return NULL;
+
+	tc = flex_array_get(tset->tc_array, tset->idx++);
+	tset->cur_cgrp = tc->cgrp;
+	return tc->task;
+}
+EXPORT_SYMBOL_GPL(cgroup_taskset_next);
+
+/**
+ * cgroup_taskset_cur_cgroup - return the matching cgroup for the current task
+ * @tset: taskset of interest
+ *
+ * Return the cgroup for the current (last returned) task of @tset.  This
+ * function must be preceded by either cgroup_taskset_first() or
+ * cgroup_taskset_next().
+ */
+struct cgroup *cgroup_taskset_cur_cgroup(struct cgroup_taskset *tset)
+{
+	return tset->cur_cgrp;
+}
+EXPORT_SYMBOL_GPL(cgroup_taskset_cur_cgroup);
+
+/**
+ * cgroup_taskset_size - return the number of tasks in taskset
+ * @tset: taskset of interest
+ */
+int cgroup_taskset_size(struct cgroup_taskset *tset)
+{
+	return tset->tc_array ? tset->tc_array_len : 1;
+}
+EXPORT_SYMBOL_GPL(cgroup_taskset_size);
+
+
 /*
  * cgroup_task_migrate - move a task from one cgroup to another.
  *
@@ -1842,6 +1916,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 	struct cgroup_subsys *ss, *failed_ss = NULL;
 	struct cgroup *oldcgrp;
 	struct cgroupfs_root *root = cgrp->root;
+	struct cgroup_taskset tset = { };
 
 	/* @tsk either already exited or can't exit until the end */
 	if (tsk->flags & PF_EXITING)
@@ -1852,9 +1927,12 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 	if (cgrp == oldcgrp)
 		return 0;
 
+	tset.single.task = tsk;
+	tset.single.cgrp = oldcgrp;
+
 	for_each_subsys(root, ss) {
 		if (ss->can_attach) {
-			retval = ss->can_attach(ss, cgrp, tsk);
+			retval = ss->can_attach(ss, cgrp, &tset);
 			if (retval) {
 				/*
 				 * Remember on which subsystem the can_attach()
@@ -1885,7 +1963,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 		if (ss->attach_task)
 			ss->attach_task(cgrp, tsk);
 		if (ss->attach)
-			ss->attach(ss, cgrp, oldcgrp, tsk);
+			ss->attach(ss, cgrp, &tset);
 	}
 
 	synchronize_rcu();
@@ -1907,7 +1985,7 @@ out:
 				 */
 				break;
 			if (ss->cancel_attach)
-				ss->cancel_attach(ss, cgrp, tsk);
+				ss->cancel_attach(ss, cgrp, &tset);
 		}
 	}
 	return retval;
@@ -2023,6 +2101,7 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 	struct task_struct *tsk;
 	struct task_and_cgroup *tc;
 	struct flex_array *group;
+	struct cgroup_taskset tset = { };
 	/*
 	 * we need to make sure we have css_sets for all the tasks we're
 	 * going to move -before- we actually start moving them, so that in
@@ -2089,6 +2168,8 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 	} while_each_thread(leader, tsk);
 	/* remember the number of threads in the array for later. */
 	group_size = i;
+	tset.tc_array = group;
+	tset.tc_array_len = group_size;
 	rcu_read_unlock();
 
 	/* methods shouldn't be called if no task is actually migrating */
@@ -2101,7 +2182,7 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 	 */
 	for_each_subsys(root, ss) {
 		if (ss->can_attach) {
-			retval = ss->can_attach(ss, cgrp, leader);
+			retval = ss->can_attach(ss, cgrp, &tset);
 			if (retval) {
 				failed_ss = ss;
 				goto out_cancel_attach;
@@ -2185,10 +2266,8 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 	 * being moved, this call will need to be reworked to communicate that.
 	 */
 	for_each_subsys(root, ss) {
-		if (ss->attach) {
-			tc = flex_array_get(group, 0);
-			ss->attach(ss, cgrp, tc->cgrp, tc->task);
-		}
+		if (ss->attach)
+			ss->attach(ss, cgrp, &tset);
 	}
 
 	/*
@@ -2210,11 +2289,11 @@ out_cancel_attach:
 		for_each_subsys(root, ss) {
 			if (ss == failed_ss) {
 				if (cancel_failed_ss && ss->cancel_attach)
-					ss->cancel_attach(ss, cgrp, leader);
+					ss->cancel_attach(ss, cgrp, &tset);
 				break;
 			}
 			if (ss->cancel_attach)
-				ss->cancel_attach(ss, cgrp, leader);
+				ss->cancel_attach(ss, cgrp, &tset);
 		}
 	}
 out_put_tasks:
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index e691818..a72d5fa 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -160,7 +160,7 @@ static void freezer_destroy(struct cgroup_subsys *ss,
  */
 static int freezer_can_attach(struct cgroup_subsys *ss,
 			      struct cgroup *new_cgroup,
-			      struct task_struct *task)
+			      struct cgroup_taskset *tset)
 {
 	struct freezer *freezer;
 
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 10131fd..2e5825b 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1368,10 +1368,10 @@ static int fmeter_getrate(struct fmeter *fmp)
 }
 
 /* Called by cgroups to determine if a cpuset is usable; cgroup_mutex held */
-static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
-			     struct task_struct *tsk)
+static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
+			     struct cgroup_taskset *tset)
 {
-	struct cpuset *cs = cgroup_cs(cont);
+	struct cpuset *cs = cgroup_cs(cgrp);
 
 	if (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))
 		return -ENOSPC;
@@ -1384,7 +1384,7 @@ static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
 	 * set_cpus_allowed_ptr() on all attached tasks before cpus_allowed may
 	 * be changed.
 	 */
-	if (tsk->flags & PF_THREAD_BOUND)
+	if (cgroup_taskset_first(tset)->flags & PF_THREAD_BOUND)
 		return -EINVAL;
 
 	return 0;
@@ -1434,12 +1434,14 @@ static void cpuset_attach_task(struct cgroup *cont, struct task_struct *tsk)
 	cpuset_update_task_spread_flag(cs, tsk);
 }
 
-static void cpuset_attach(struct cgroup_subsys *ss, struct cgroup *cont,
-			  struct cgroup *oldcont, struct task_struct *tsk)
+static void cpuset_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
+			  struct cgroup_taskset *tset)
 {
 	struct mm_struct *mm;
-	struct cpuset *cs = cgroup_cs(cont);
-	struct cpuset *oldcs = cgroup_cs(oldcont);
+	struct task_struct *tsk = cgroup_taskset_first(tset);
+	struct cgroup *oldcgrp = cgroup_taskset_cur_cgroup(tset);
+	struct cpuset *cs = cgroup_cs(cgrp);
+	struct cpuset *oldcs = cgroup_cs(oldcgrp);
 
 	/*
 	 * Change mm, possibly for multiple threads in a threadgroup. This is
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3508777..5019787 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5288,8 +5288,9 @@ static void mem_cgroup_clear_mc(void)
 
 static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
 				struct cgroup *cgroup,
-				struct task_struct *p)
+				struct cgroup_taskset *tset)
 {
+	struct task_struct *p = cgroup_taskset_first(tset);
 	int ret = 0;
 	struct mem_cgroup *mem = mem_cgroup_from_cont(cgroup);
 
@@ -5327,7 +5328,7 @@ static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
 
 static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss,
 				struct cgroup *cgroup,
-				struct task_struct *p)
+				struct cgroup_taskset *tset)
 {
 	mem_cgroup_clear_mc();
 }
@@ -5444,9 +5445,9 @@ retry:
 
 static void mem_cgroup_move_task(struct cgroup_subsys *ss,
 				struct cgroup *cont,
-				struct cgroup *old_cont,
-				struct task_struct *p)
+				struct cgroup_taskset *tset)
 {
+	struct task_struct *p = cgroup_taskset_first(tset);
 	struct mm_struct *mm = get_task_mm(p);
 
 	if (mm) {
@@ -5461,19 +5462,18 @@ static void mem_cgroup_move_task(struct cgroup_subsys *ss,
 #else	/* !CONFIG_MMU */
 static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
 				struct cgroup *cgroup,
-				struct task_struct *p)
+				struct cgroup_taskset *tset)
 {
 	return 0;
 }
 static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss,
 				struct cgroup *cgroup,
-				struct task_struct *p)
+				struct cgroup_taskset *tset)
 {
 }
 static void mem_cgroup_move_task(struct cgroup_subsys *ss,
 				struct cgroup *cont,
-				struct cgroup *old_cont,
-				struct task_struct *p)
+				struct cgroup_taskset *tset)
 {
 }
 #endif
diff --git a/security/device_cgroup.c b/security/device_cgroup.c
index 4450fbe..8b5b5d8 100644
--- a/security/device_cgroup.c
+++ b/security/device_cgroup.c
@@ -62,11 +62,12 @@ static inline struct dev_cgroup *task_devcgroup(struct task_struct *task)
 struct cgroup_subsys devices_subsys;
 
 static int devcgroup_can_attach(struct cgroup_subsys *ss,
-		struct cgroup *new_cgroup, struct task_struct *task)
+			struct cgroup *new_cgrp, struct cgroup_taskset *set)
 {
-	if (current != task && !capable(CAP_SYS_ADMIN))
-			return -EPERM;
+	struct task_struct *task = cgroup_taskset_first(set);
 
+	if (current != task && !capable(CAP_SYS_ADMIN))
+		return -EPERM;
 	return 0;
 }
 
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 08/10] cgroup: don't use subsys->can_attach_task() or ->attach_task()
  2011-11-01 23:46 [PATCHSET] cgroup: stable threadgroup during attach & subsys methods consolidation Tejun Heo
                   ` (6 preceding siblings ...)
  2011-11-01 23:46 ` [PATCH 07/10] cgroup: introduce cgroup_taskset and use it in subsys->can_attach(), cancel_attach() and attach() Tejun Heo
@ 2011-11-01 23:46 ` Tejun Heo
  2011-11-04  9:08   ` KAMEZAWA Hiroyuki
  2011-11-14 23:54   ` Frederic Weisbecker
  2011-11-01 23:46 ` [PATCH 09/10] cgroup, cpuset: don't use ss->pre_attach() Tejun Heo
                   ` (3 subsequent siblings)
  11 siblings, 2 replies; 56+ messages in thread
From: Tejun Heo @ 2011-11-01 23:46 UTC (permalink / raw)
  To: paul, rjw, lizf
  Cc: linux-pm, linux-kernel, containers, fweisbec, matthltc, akpm,
	oleg, kamezawa.hiroyu, Tejun Heo, Balbir Singh,
	Daisuke Nishimura, KAMEZAWA Hiroyuki, James Morris, Ingo Molnar,
	Peter Zijlstra

Now that subsys->can_attach() and attach() take @tset instead of
@task, they can handle per-task operations.  Convert
->can_attach_task() and ->attach_task() users to use ->can_attach()
and attach() instead.  Most converions are straight-forward.
Noteworthy changes are,

* In cgroup_freezer, remove unnecessary NULL assignments to unused
  methods.  It's useless and very prone to get out of sync, which
  already happened.

* In cpuset, PF_THREAD_BOUND test is checked for each task.  This
  doesn't make any practical difference but is conceptually cleaner.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: James Morris <jmorris@namei.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
---
 block/blk-cgroup.c      |   45 +++++++++++++++++++-----------
 kernel/cgroup_freezer.c |   25 ++++++-----------
 kernel/cpuset.c         |   70 +++++++++++++++++++++-------------------------
 kernel/events/core.c    |   13 +++++---
 kernel/sched.c          |   31 +++++++++++++--------
 5 files changed, 96 insertions(+), 88 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index b596e54..b5e3fef 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -30,8 +30,10 @@ EXPORT_SYMBOL_GPL(blkio_root_cgroup);
 
 static struct cgroup_subsys_state *blkiocg_create(struct cgroup_subsys *,
 						  struct cgroup *);
-static int blkiocg_can_attach_task(struct cgroup *, struct task_struct *);
-static void blkiocg_attach_task(struct cgroup *, struct task_struct *);
+static int blkiocg_can_attach(struct cgroup_subsys *, struct cgroup *,
+			      struct cgroup_taskset *);
+static void blkiocg_attach(struct cgroup_subsys *, struct cgroup *,
+			   struct cgroup_taskset *);
 static void blkiocg_destroy(struct cgroup_subsys *, struct cgroup *);
 static int blkiocg_populate(struct cgroup_subsys *, struct cgroup *);
 
@@ -44,8 +46,8 @@ static int blkiocg_populate(struct cgroup_subsys *, struct cgroup *);
 struct cgroup_subsys blkio_subsys = {
 	.name = "blkio",
 	.create = blkiocg_create,
-	.can_attach_task = blkiocg_can_attach_task,
-	.attach_task = blkiocg_attach_task,
+	.can_attach = blkiocg_can_attach,
+	.attach = blkiocg_attach,
 	.destroy = blkiocg_destroy,
 	.populate = blkiocg_populate,
 #ifdef CONFIG_BLK_CGROUP
@@ -1609,30 +1611,39 @@ done:
  * of the main cic data structures.  For now we allow a task to change
  * its cgroup only if it's the only owner of its ioc.
  */
-static int blkiocg_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
+static int blkiocg_can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
+			      struct cgroup_taskset *tset)
 {
+	struct task_struct *task;
 	struct io_context *ioc;
 	int ret = 0;
 
 	/* task_lock() is needed to avoid races with exit_io_context() */
-	task_lock(tsk);
-	ioc = tsk->io_context;
-	if (ioc && atomic_read(&ioc->nr_tasks) > 1)
-		ret = -EINVAL;
-	task_unlock(tsk);
-
+	cgroup_taskset_for_each(task, cgrp, tset) {
+		task_lock(task);
+		ioc = task->io_context;
+		if (ioc && atomic_read(&ioc->nr_tasks) > 1)
+			ret = -EINVAL;
+		task_unlock(task);
+		if (ret)
+			break;
+	}
 	return ret;
 }
 
-static void blkiocg_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
+static void blkiocg_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
+			   struct cgroup_taskset *tset)
 {
+	struct task_struct *task;
 	struct io_context *ioc;
 
-	task_lock(tsk);
-	ioc = tsk->io_context;
-	if (ioc)
-		ioc->cgroup_changed = 1;
-	task_unlock(tsk);
+	cgroup_taskset_for_each(task, cgrp, tset) {
+		task_lock(task);
+		ioc = task->io_context;
+		if (ioc)
+			ioc->cgroup_changed = 1;
+		task_unlock(task);
+	}
 }
 
 void blkio_policy_register(struct blkio_policy_type *blkiop)
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index a72d5fa..68ef861 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -163,10 +163,19 @@ static int freezer_can_attach(struct cgroup_subsys *ss,
 			      struct cgroup_taskset *tset)
 {
 	struct freezer *freezer;
+	struct task_struct *task;
 
 	/*
 	 * Anything frozen can't move or be moved to/from.
 	 */
+	rcu_read_lock();
+	cgroup_taskset_for_each(task, new_cgroup, tset) {
+		if (__cgroup_freezing_or_frozen(task)) {
+			rcu_read_unlock();
+			return -EBUSY;
+		}
+	}
+	rcu_read_unlock();
 
 	freezer = cgroup_freezer(new_cgroup);
 	if (freezer->state != CGROUP_THAWED)
@@ -175,17 +184,6 @@ static int freezer_can_attach(struct cgroup_subsys *ss,
 	return 0;
 }
 
-static int freezer_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
-{
-	rcu_read_lock();
-	if (__cgroup_freezing_or_frozen(tsk)) {
-		rcu_read_unlock();
-		return -EBUSY;
-	}
-	rcu_read_unlock();
-	return 0;
-}
-
 static void freezer_fork(struct cgroup_subsys *ss, struct task_struct *task)
 {
 	struct freezer *freezer;
@@ -381,10 +379,5 @@ struct cgroup_subsys freezer_subsys = {
 	.populate	= freezer_populate,
 	.subsys_id	= freezer_subsys_id,
 	.can_attach	= freezer_can_attach,
-	.can_attach_task = freezer_can_attach_task,
-	.pre_attach	= NULL,
-	.attach_task	= NULL,
-	.attach		= NULL,
 	.fork		= freezer_fork,
-	.exit		= NULL,
 };
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 2e5825b..472ddd6 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1372,33 +1372,34 @@ static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 			     struct cgroup_taskset *tset)
 {
 	struct cpuset *cs = cgroup_cs(cgrp);
+	struct task_struct *task;
+	int ret;
 
 	if (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))
 		return -ENOSPC;
 
-	/*
-	 * Kthreads bound to specific cpus cannot be moved to a new cpuset; we
-	 * cannot change their cpu affinity and isolating such threads by their
-	 * set of allowed nodes is unnecessary.  Thus, cpusets are not
-	 * applicable for such threads.  This prevents checking for success of
-	 * set_cpus_allowed_ptr() on all attached tasks before cpus_allowed may
-	 * be changed.
-	 */
-	if (cgroup_taskset_first(tset)->flags & PF_THREAD_BOUND)
-		return -EINVAL;
-
+	cgroup_taskset_for_each(task, cgrp, tset) {
+		/*
+		 * Kthreads bound to specific cpus cannot be moved to a new
+		 * cpuset; we cannot change their cpu affinity and
+		 * isolating such threads by their set of allowed nodes is
+		 * unnecessary.  Thus, cpusets are not applicable for such
+		 * threads.  This prevents checking for success of
+		 * set_cpus_allowed_ptr() on all attached tasks before
+		 * cpus_allowed may be changed.
+		 */
+		if (task->flags & PF_THREAD_BOUND)
+			return -EINVAL;
+		if ((ret = security_task_setscheduler(task)))
+			return ret;
+	}
 	return 0;
 }
 
-static int cpuset_can_attach_task(struct cgroup *cgrp, struct task_struct *task)
-{
-	return security_task_setscheduler(task);
-}
-
 /*
  * Protected by cgroup_lock. The nodemasks must be stored globally because
  * dynamically allocating them is not allowed in pre_attach, and they must
- * persist among pre_attach, attach_task, and attach.
+ * persist among pre_attach, and attach.
  */
 static cpumask_var_t cpus_attach;
 static nodemask_t cpuset_attach_nodemask_from;
@@ -1417,39 +1418,34 @@ static void cpuset_pre_attach(struct cgroup *cont)
 	guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
 }
 
-/* Per-thread attachment work. */
-static void cpuset_attach_task(struct cgroup *cont, struct task_struct *tsk)
-{
-	int err;
-	struct cpuset *cs = cgroup_cs(cont);
-
-	/*
-	 * can_attach beforehand should guarantee that this doesn't fail.
-	 * TODO: have a better way to handle failure here
-	 */
-	err = set_cpus_allowed_ptr(tsk, cpus_attach);
-	WARN_ON_ONCE(err);
-
-	cpuset_change_task_nodemask(tsk, &cpuset_attach_nodemask_to);
-	cpuset_update_task_spread_flag(cs, tsk);
-}
-
 static void cpuset_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 			  struct cgroup_taskset *tset)
 {
 	struct mm_struct *mm;
-	struct task_struct *tsk = cgroup_taskset_first(tset);
+	struct task_struct *task;
+	struct task_struct *leader = cgroup_taskset_first(tset);
 	struct cgroup *oldcgrp = cgroup_taskset_cur_cgroup(tset);
 	struct cpuset *cs = cgroup_cs(cgrp);
 	struct cpuset *oldcs = cgroup_cs(oldcgrp);
 
+	cgroup_taskset_for_each(task, cgrp, tset) {
+		/*
+		 * can_attach beforehand should guarantee that this doesn't
+		 * fail.  TODO: have a better way to handle failure here
+		 */
+		WARN_ON_ONCE(set_cpus_allowed_ptr(task, cpus_attach));
+
+		cpuset_change_task_nodemask(task, &cpuset_attach_nodemask_to);
+		cpuset_update_task_spread_flag(cs, task);
+	}
+
 	/*
 	 * Change mm, possibly for multiple threads in a threadgroup. This is
 	 * expensive and may sleep.
 	 */
 	cpuset_attach_nodemask_from = oldcs->mems_allowed;
 	cpuset_attach_nodemask_to = cs->mems_allowed;
-	mm = get_task_mm(tsk);
+	mm = get_task_mm(leader);
 	if (mm) {
 		mpol_rebind_mm(mm, &cpuset_attach_nodemask_to);
 		if (is_memory_migrate(cs))
@@ -1905,9 +1901,7 @@ struct cgroup_subsys cpuset_subsys = {
 	.create = cpuset_create,
 	.destroy = cpuset_destroy,
 	.can_attach = cpuset_can_attach,
-	.can_attach_task = cpuset_can_attach_task,
 	.pre_attach = cpuset_pre_attach,
-	.attach_task = cpuset_attach_task,
 	.attach = cpuset_attach,
 	.populate = cpuset_populate,
 	.post_clone = cpuset_post_clone,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index d1a1bee..ed35211 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7136,10 +7136,13 @@ static int __perf_cgroup_move(void *info)
 	return 0;
 }
 
-static void
-perf_cgroup_attach_task(struct cgroup *cgrp, struct task_struct *task)
+static void perf_cgroup_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
+			       struct cgroup_taskset *tset)
 {
-	task_function_call(task, __perf_cgroup_move, task);
+	struct task_struct *task;
+
+	cgroup_taskset_for_each(task, cgrp, tset)
+		task_function_call(task, __perf_cgroup_move, task);
 }
 
 static void perf_cgroup_exit(struct cgroup_subsys *ss, struct cgroup *cgrp,
@@ -7153,7 +7156,7 @@ static void perf_cgroup_exit(struct cgroup_subsys *ss, struct cgroup *cgrp,
 	if (!(task->flags & PF_EXITING))
 		return;
 
-	perf_cgroup_attach_task(cgrp, task);
+	task_function_call(task, __perf_cgroup_move, task);
 }
 
 struct cgroup_subsys perf_subsys = {
@@ -7162,6 +7165,6 @@ struct cgroup_subsys perf_subsys = {
 	.create		= perf_cgroup_create,
 	.destroy	= perf_cgroup_destroy,
 	.exit		= perf_cgroup_exit,
-	.attach_task	= perf_cgroup_attach_task,
+	.attach		= perf_cgroup_attach,
 };
 #endif /* CONFIG_CGROUP_PERF */
diff --git a/kernel/sched.c b/kernel/sched.c
index d87c6e5..49e3605 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -9129,24 +9129,31 @@ cpu_cgroup_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
 	sched_destroy_group(tg);
 }
 
-static int
-cpu_cgroup_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
+static int cpu_cgroup_can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
+				 struct cgroup_taskset *tset)
 {
+	struct task_struct *task;
+
+	cgroup_taskset_for_each(task, cgrp, tset) {
 #ifdef CONFIG_RT_GROUP_SCHED
-	if (!sched_rt_can_attach(cgroup_tg(cgrp), tsk))
-		return -EINVAL;
+		if (!sched_rt_can_attach(cgroup_tg(cgrp), task))
+			return -EINVAL;
 #else
-	/* We don't support RT-tasks being in separate groups */
-	if (tsk->sched_class != &fair_sched_class)
-		return -EINVAL;
+		/* We don't support RT-tasks being in separate groups */
+		if (task->sched_class != &fair_sched_class)
+			return -EINVAL;
 #endif
+	}
 	return 0;
 }
 
-static void
-cpu_cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
+static void cpu_cgroup_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
+			      struct cgroup_taskset *tset)
 {
-	sched_move_task(tsk);
+	struct task_struct *task;
+
+	cgroup_taskset_for_each(task, cgrp, tset)
+		sched_move_task(task);
 }
 
 static void
@@ -9482,8 +9489,8 @@ struct cgroup_subsys cpu_cgroup_subsys = {
 	.name		= "cpu",
 	.create		= cpu_cgroup_create,
 	.destroy	= cpu_cgroup_destroy,
-	.can_attach_task = cpu_cgroup_can_attach_task,
-	.attach_task	= cpu_cgroup_attach_task,
+	.can_attach	= cpu_cgroup_can_attach,
+	.attach		= cpu_cgroup_attach,
 	.exit		= cpu_cgroup_exit,
 	.populate	= cpu_cgroup_populate,
 	.subsys_id	= cpu_cgroup_subsys_id,
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 09/10] cgroup, cpuset: don't use ss->pre_attach()
  2011-11-01 23:46 [PATCHSET] cgroup: stable threadgroup during attach & subsys methods consolidation Tejun Heo
                   ` (7 preceding siblings ...)
  2011-11-01 23:46 ` [PATCH 08/10] cgroup: don't use subsys->can_attach_task() or ->attach_task() Tejun Heo
@ 2011-11-01 23:46 ` Tejun Heo
  2011-11-15  0:51   ` Frederic Weisbecker
  2011-11-01 23:46 ` [PATCH 10/10] cgroup: kill subsys->can_attach_task(), pre_attach() and attach_task() Tejun Heo
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 56+ messages in thread
From: Tejun Heo @ 2011-11-01 23:46 UTC (permalink / raw)
  To: paul, rjw, lizf
  Cc: linux-pm, linux-kernel, containers, fweisbec, matthltc, akpm,
	oleg, kamezawa.hiroyu, Tejun Heo

->pre_attach() is supposed to be called before migration, which is
observed during process migration but task migration does it the other
way around.  The only ->pre_attach() user is cpuset which can do the
same operaitons in ->can_attach().  Collapse cpuset_pre_attach() into
cpuset_can_attach().

-v2: Patch contamination from later patch removed.  Spotted by Paul
     Menage.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
---
 kernel/cpuset.c |   29 ++++++++++++-----------------
 1 files changed, 12 insertions(+), 17 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 472ddd6..f0b8df3 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1367,6 +1367,15 @@ static int fmeter_getrate(struct fmeter *fmp)
 	return val;
 }
 
+/*
+ * Protected by cgroup_lock. The nodemasks must be stored globally because
+ * dynamically allocating them is not allowed in can_attach, and they must
+ * persist until attach.
+ */
+static cpumask_var_t cpus_attach;
+static nodemask_t cpuset_attach_nodemask_from;
+static nodemask_t cpuset_attach_nodemask_to;
+
 /* Called by cgroups to determine if a cpuset is usable; cgroup_mutex held */
 static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 			     struct cgroup_taskset *tset)
@@ -1393,29 +1402,16 @@ static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 		if ((ret = security_task_setscheduler(task)))
 			return ret;
 	}
-	return 0;
-}
-
-/*
- * Protected by cgroup_lock. The nodemasks must be stored globally because
- * dynamically allocating them is not allowed in pre_attach, and they must
- * persist among pre_attach, and attach.
- */
-static cpumask_var_t cpus_attach;
-static nodemask_t cpuset_attach_nodemask_from;
-static nodemask_t cpuset_attach_nodemask_to;
-
-/* Set-up work for before attaching each task. */
-static void cpuset_pre_attach(struct cgroup *cont)
-{
-	struct cpuset *cs = cgroup_cs(cont);
 
+	/* prepare for attach */
 	if (cs == &top_cpuset)
 		cpumask_copy(cpus_attach, cpu_possible_mask);
 	else
 		guarantee_online_cpus(cs, cpus_attach);
 
 	guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
+
+	return 0;
 }
 
 static void cpuset_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
@@ -1901,7 +1897,6 @@ struct cgroup_subsys cpuset_subsys = {
 	.create = cpuset_create,
 	.destroy = cpuset_destroy,
 	.can_attach = cpuset_can_attach,
-	.pre_attach = cpuset_pre_attach,
 	.attach = cpuset_attach,
 	.populate = cpuset_populate,
 	.post_clone = cpuset_post_clone,
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 10/10] cgroup: kill subsys->can_attach_task(), pre_attach() and attach_task()
  2011-11-01 23:46 [PATCHSET] cgroup: stable threadgroup during attach & subsys methods consolidation Tejun Heo
                   ` (8 preceding siblings ...)
  2011-11-01 23:46 ` [PATCH 09/10] cgroup, cpuset: don't use ss->pre_attach() Tejun Heo
@ 2011-11-01 23:46 ` Tejun Heo
  2011-11-04  9:10   ` KAMEZAWA Hiroyuki
  2011-11-15  0:54   ` Frederic Weisbecker
  2011-11-21 22:07 ` [PATCHSET] cgroup: stable threadgroup during attach & subsys methods consolidation Tejun Heo
  2011-11-24 22:51 ` Tejun Heo
  11 siblings, 2 replies; 56+ messages in thread
From: Tejun Heo @ 2011-11-01 23:46 UTC (permalink / raw)
  To: paul, rjw, lizf
  Cc: linux-pm, linux-kernel, containers, fweisbec, matthltc, akpm,
	oleg, kamezawa.hiroyu, Tejun Heo

These three methods are no longer used.  Kill them.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
---
 Documentation/cgroups/cgroups.txt |   20 --------------
 include/linux/cgroup.h            |    3 --
 kernel/cgroup.c                   |   53 +++---------------------------------
 3 files changed, 5 insertions(+), 71 deletions(-)

diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index bf5d6c9..eb1b609 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -615,13 +615,6 @@ fork. If this method returns 0 (success) then this should remain valid
 while the caller holds cgroup_mutex and it is ensured that either
 attach() or cancel_attach() will be called in future.
 
-int can_attach_task(struct cgroup *cgrp, struct task_struct *tsk);
-(cgroup_mutex held by caller)
-
-As can_attach, but for operations that must be run once per task to be
-attached (possibly many when using cgroup_attach_proc). Called after
-can_attach.
-
 void cancel_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 		   struct cgroup_taskset *tset)
 (cgroup_mutex held by caller)
@@ -632,12 +625,6 @@ function, so that the subsystem can implement a rollback. If not, not necessary.
 This will be called only about subsystems whose can_attach() operation have
 succeeded. The parameters are identical to can_attach().
 
-void pre_attach(struct cgroup *cgrp);
-(cgroup_mutex held by caller)
-
-For any non-per-thread attachment work that needs to happen before
-attach_task. Needed by cpuset.
-
 void attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 	    struct cgroup_taskset *tset)
 (cgroup_mutex held by caller)
@@ -646,13 +633,6 @@ Called after the task has been attached to the cgroup, to allow any
 post-attachment activity that requires memory allocations or blocking.
 The parameters are identical to can_attach().
 
-void attach_task(struct cgroup *cgrp, struct task_struct *tsk);
-(cgroup_mutex held by caller)
-
-As attach, but for operations that must be run once per task to be attached,
-like can_attach_task. Called before attach. Currently does not support any
-subsystem that might need the old_cgrp for every thread in the group.
-
 void fork(struct cgroup_subsy *ss, struct task_struct *task)
 
 Called when a task is forked into a cgroup.
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 2470c8e..5659d37 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -490,11 +490,8 @@ struct cgroup_subsys {
 	void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
 	int (*can_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
 			  struct cgroup_taskset *tset);
-	int (*can_attach_task)(struct cgroup *cgrp, struct task_struct *tsk);
 	void (*cancel_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
 			      struct cgroup_taskset *tset);
-	void (*pre_attach)(struct cgroup *cgrp);
-	void (*attach_task)(struct cgroup *cgrp, struct task_struct *tsk);
 	void (*attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
 		       struct cgroup_taskset *tset);
 	void (*fork)(struct cgroup_subsys *ss, struct task_struct *task);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 0bb7927..d2b3697 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1944,13 +1944,6 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 				goto out;
 			}
 		}
-		if (ss->can_attach_task) {
-			retval = ss->can_attach_task(cgrp, tsk);
-			if (retval) {
-				failed_ss = ss;
-				goto out;
-			}
-		}
 	}
 
 	retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, false);
@@ -1958,10 +1951,6 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 		goto out;
 
 	for_each_subsys(root, ss) {
-		if (ss->pre_attach)
-			ss->pre_attach(cgrp);
-		if (ss->attach_task)
-			ss->attach_task(cgrp, tsk);
 		if (ss->attach)
 			ss->attach(ss, cgrp, &tset);
 	}
@@ -2093,7 +2082,6 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 {
 	int retval, i, group_size, nr_migrating_tasks;
 	struct cgroup_subsys *ss, *failed_ss = NULL;
-	bool cancel_failed_ss = false;
 	/* guaranteed to be initialized later, but the compiler needs this */
 	struct css_set *oldcg;
 	struct cgroupfs_root *root = cgrp->root;
@@ -2188,21 +2176,6 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 				goto out_cancel_attach;
 			}
 		}
-		/* a callback to be run on every thread in the threadgroup. */
-		if (ss->can_attach_task) {
-			/* run on each task in the threadgroup. */
-			for (i = 0; i < group_size; i++) {
-				tc = flex_array_get(group, i);
-				if (tc->cgrp == cgrp)
-					continue;
-				retval = ss->can_attach_task(cgrp, tc->task);
-				if (retval) {
-					failed_ss = ss;
-					cancel_failed_ss = true;
-					goto out_cancel_attach;
-				}
-			}
-		}
 	}
 
 	/*
@@ -2234,15 +2207,10 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 	}
 
 	/*
-	 * step 3: now that we're guaranteed success wrt the css_sets, proceed
-	 * to move all tasks to the new cgroup, calling ss->attach_task for each
-	 * one along the way. there are no failure cases after here, so this is
-	 * the commit point.
+	 * step 3: now that we're guaranteed success wrt the css_sets,
+	 * proceed to move all tasks to the new cgroup.  There are no
+	 * failure cases after here, so this is the commit point.
 	 */
-	for_each_subsys(root, ss) {
-		if (ss->pre_attach)
-			ss->pre_attach(cgrp);
-	}
 	for (i = 0; i < group_size; i++) {
 		tc = flex_array_get(group, i);
 		/* leave current thread as it is if it's already there */
@@ -2251,19 +2219,11 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 
 		retval = cgroup_task_migrate(cgrp, tc->cgrp, tc->task, true);
 		BUG_ON(retval != 0);
-
-		/* attach each task to each subsystem */
-		for_each_subsys(root, ss) {
-			if (ss->attach_task)
-				ss->attach_task(cgrp, tc->task);
-		}
 	}
 	/* nothing is sensitive to fork() after this point. */
 
 	/*
-	 * step 4: do expensive, non-thread-specific subsystem callbacks.
-	 * TODO: if ever a subsystem needs to know the oldcgrp for each task
-	 * being moved, this call will need to be reworked to communicate that.
+	 * step 4: do subsystem attach callbacks.
 	 */
 	for_each_subsys(root, ss) {
 		if (ss->attach)
@@ -2287,11 +2247,8 @@ out_cancel_attach:
 	/* same deal as in cgroup_attach_task */
 	if (retval) {
 		for_each_subsys(root, ss) {
-			if (ss == failed_ss) {
-				if (cancel_failed_ss && ss->cancel_attach)
-					ss->cancel_attach(ss, cgrp, &tset);
+			if (ss == failed_ss)
 				break;
-			}
 			if (ss->cancel_attach)
 				ss->cancel_attach(ss, cgrp, &tset);
 		}
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH 01/10] cgroup: add cgroup_root_mutex
  2011-11-01 23:46 ` [PATCH 01/10] cgroup: add cgroup_root_mutex Tejun Heo
@ 2011-11-04  8:38   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 56+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-11-04  8:38 UTC (permalink / raw)
  To: Tejun Heo
  Cc: paul, rjw, lizf, linux-pm, linux-kernel, containers, fweisbec,
	matthltc, akpm, oleg, kamezawa.hiroyu

On Tue,  1 Nov 2011 16:46:24 -0700
Tejun Heo <tj@kernel.org> wrote:

> cgroup wants to make threadgroup stable while modifying cgroup
> hierarchies which will introduce locking dependency on
> cred_guard_mutex from cgroup_mutex.  This unfortunately completes
> circular dependency.
> 
>  A. cgroup_mutex -> cred_guard_mutex -> s_type->i_mutex_key -> namespace_sem
>  B. namespace_sem -> cgroup_mutex
> 
> B is from cgroup_show_options() and this patch breaks it by
> introducing another mutex cgroup_root_mutex which nests inside
> cgroup_mutex and protects cgroupfs_root.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Oleg Nesterov <oleg@redhat.com>

Thanks,
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 02/10] threadgroup: rename signal->threadgroup_fork_lock to ->group_rwsem
  2011-11-01 23:46 ` [PATCH 02/10] threadgroup: rename signal->threadgroup_fork_lock to ->group_rwsem Tejun Heo
@ 2011-11-04  8:40   ` KAMEZAWA Hiroyuki
  2011-11-04 15:16     ` Tejun Heo
  0 siblings, 1 reply; 56+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-11-04  8:40 UTC (permalink / raw)
  To: Tejun Heo
  Cc: paul, rjw, lizf, linux-pm, linux-kernel, containers, fweisbec,
	matthltc, akpm, oleg, kamezawa.hiroyu

On Tue,  1 Nov 2011 16:46:25 -0700
Tejun Heo <tj@kernel.org> wrote:

> Make the following renames to prepare for extension of threadgroup
> locking.
> 
> * s/signal->threadgroup_fork_lock/signal->group_rwsem/
> * s/threadgroup_fork_read_lock()/threadgroup_change_begin()/
> * s/threadgroup_fork_read_unlock()/threadgroup_change_done()/

I personally like begin <-> end to see relationship.

> * s/threadgroup_fork_write_lock()/threadgroup_lock()/
> * s/threadgroup_fork_write_unlock()/threadgroup_unlock()/
> 
> This patch doesn't cause any behavior change.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Paul Menage <paul@paulmenage.org>
> Cc: Li Zefan <lizf@cn.fujitsu.com>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 03/10] threadgroup: extend threadgroup_lock() to cover exit and exec
  2011-11-01 23:46 ` [PATCH 03/10] threadgroup: extend threadgroup_lock() to cover exit and exec Tejun Heo
@ 2011-11-04  8:45   ` KAMEZAWA Hiroyuki
  2011-11-13 16:44   ` Frederic Weisbecker
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 56+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-11-04  8:45 UTC (permalink / raw)
  To: Tejun Heo
  Cc: paul, rjw, lizf, linux-pm, linux-kernel, containers, fweisbec,
	matthltc, akpm, oleg, kamezawa.hiroyu

On Tue,  1 Nov 2011 16:46:26 -0700
Tejun Heo <tj@kernel.org> wrote:

> threadgroup_lock() protected only protected against new addition to
> the threadgroup, which was inherently somewhat incomplete and
> problematic for its only user cgroup.  On-going migration could race
> against exec and exit leading to interesting problems - the symmetry
> between various attach methods, task exiting during method execution,
> ->exit() racing against attach methods, migrating task switching basic
> properties during exec and so on.
> 
> This patch extends threadgroup_lock() such that it protects against
> all three threadgroup altering operations - fork, exit and exec.  For
> exit, threadgroup_change_begin/end() calls are added to exit path.
> For exec, threadgroup_[un]lock() are updated to also grab and release
> cred_guard_mutex.
> 
> With this change, threadgroup_lock() guarantees that the target
> threadgroup will remain stable - no new task will be added, no new
> PF_EXITING will be set and exec won't happen.
> 
> The next patch will update cgroup so that it can take full advantage
> of this change.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Paul Menage <paul@paulmenage.org>
> Cc: Li Zefan <lizf@cn.fujitsu.com>


Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 04/10] cgroup: always lock threadgroup during migration
  2011-11-01 23:46 ` [PATCH 04/10] cgroup: always lock threadgroup during migration Tejun Heo
@ 2011-11-04  8:54   ` KAMEZAWA Hiroyuki
  2011-11-04 15:21     ` Tejun Heo
  2011-11-14 18:46   ` Frederic Weisbecker
  1 sibling, 1 reply; 56+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-11-04  8:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: paul, rjw, lizf, linux-pm, linux-kernel, containers, fweisbec,
	matthltc, akpm, oleg, kamezawa.hiroyu

On Tue,  1 Nov 2011 16:46:27 -0700
Tejun Heo <tj@kernel.org> wrote:

> Update cgroup to take advantage of the fack that threadgroup_lock()
> guarantees stable threadgroup.
> 
> * Lock threadgroup even if the target is a single task.  This
>   guarantees that when the target tasks stay stable during migration
>   regardless of the target type.
> 
> * Remove PF_EXITING early exit optimization from attach_task_by_pid()
>   and check it in cgroup_task_migrate() instead.  The optimization was
>   for rather cold path to begin with and PF_EXITING state can be
>   trusted throughout migration by checking it after locking
>   threadgroup.
> 
> * Don't add PF_EXITING tasks to target task array in
>   cgroup_attach_proc().  This ensures that task migration is performed
>   only for live tasks.
> 
> * Remove -ESRCH failure path from cgroup_task_migrate().  With the
>   above changes, it's guaranteed to be called only for live tasks.
> 
> After the changes, only live tasks are migrated and they're guaranteed
> to stay alive until migration is complete.  This removes problems
> caused by exec and exit racing against cgroup migration including
> symmetry among cgroup attach methods and different cgroup methods
> racing each other.
> 
> v2: Oleg pointed out that one more PF_EXITING check can be removed
>     from cgroup_attach_proc().  Removed.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Paul Menage <paul@paulmenage.org>
> Cc: Li Zefan <lizf@cn.fujitsu.com>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
A few questions below.


> ---
>  kernel/cgroup.c |   51 +++++++++++++++++++++++----------------------------
>  1 files changed, 23 insertions(+), 28 deletions(-)
> 
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index f0e099f..83e10f9 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1762,7 +1762,7 @@ EXPORT_SYMBOL_GPL(cgroup_path);
>   *
>   * 'guarantee' is set if the caller promises that a new css_set for the task
>   * will already exist. If not set, this function might sleep, and can fail with
> - * -ENOMEM. Otherwise, it can only fail with -ESRCH.
> + * -ENOMEM. Must be called with cgroup_mutex and threadgroup locked.
>   */
>  static int cgroup_task_migrate(struct cgroup *cgrp, struct cgroup *oldcgrp,
>  			       struct task_struct *tsk, bool guarantee)
> @@ -1800,13 +1800,9 @@ static int cgroup_task_migrate(struct cgroup *cgrp, struct cgroup *oldcgrp,
>  	}
>  	put_css_set(oldcg);
>  
> -	/* if PF_EXITING is set, the tsk->cgroups pointer is no longer safe. */
> +	/* @tsk can't exit as its threadgroup is locked */
>  	task_lock(tsk);
> -	if (tsk->flags & PF_EXITING) {
> -		task_unlock(tsk);
> -		put_css_set(newcg);
> -		return -ESRCH;
> -	}
> +	WARN_ON_ONCE(tsk->flags & PF_EXITING);
>  	rcu_assign_pointer(tsk->cgroups, newcg);
>  	task_unlock(tsk);

Is this task_lock/unlock is required ?



>
> @@ -2116,11 +2120,6 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
>  			continue;
>  		/* get old css_set pointer */
>  		task_lock(tsk);
> -		if (tsk->flags & PF_EXITING) {
> -			/* ignore this task if it's going away */
> -			task_unlock(tsk);
> -			continue;
> -		}
>  		oldcg = tsk->cgroups;
>  		get_css_set(oldcg);
>  		task_unlock(tsk);

ditto.


Thanks,
-Kame


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 08/10] cgroup: don't use subsys->can_attach_task() or ->attach_task()
  2011-11-01 23:46 ` [PATCH 08/10] cgroup: don't use subsys->can_attach_task() or ->attach_task() Tejun Heo
@ 2011-11-04  9:08   ` KAMEZAWA Hiroyuki
  2011-11-14 23:54   ` Frederic Weisbecker
  1 sibling, 0 replies; 56+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-11-04  9:08 UTC (permalink / raw)
  To: Tejun Heo
  Cc: paul, rjw, lizf, linux-pm, linux-kernel, containers, fweisbec,
	matthltc, akpm, oleg, kamezawa.hiroyu, Balbir Singh,
	Daisuke Nishimura, James Morris, Ingo Molnar, Peter Zijlstra

On Tue,  1 Nov 2011 16:46:31 -0700
Tejun Heo <tj@kernel.org> wrote:

> Now that subsys->can_attach() and attach() take @tset instead of
> @task, they can handle per-task operations.  Convert
> ->can_attach_task() and ->attach_task() users to use ->can_attach()
> and attach() instead.  Most converions are straight-forward.
> Noteworthy changes are,
> 
> * In cgroup_freezer, remove unnecessary NULL assignments to unused
>   methods.  It's useless and very prone to get out of sync, which
>   already happened.
> 
> * In cpuset, PF_THREAD_BOUND test is checked for each task.  This
>   doesn't make any practical difference but is conceptually cleaner.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Paul Menage <paul@paulmenage.org>
> Cc: Li Zefan <lizf@cn.fujitsu.com>
> Cc: Balbir Singh <bsingharora@gmail.com>
> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: James Morris <jmorris@namei.org>
> Cc: Ingo Molnar <mingo@elte.hu>
> Cc: Peter Zijlstra <peterz@infradead.org>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 10/10] cgroup: kill subsys->can_attach_task(), pre_attach() and attach_task()
  2011-11-01 23:46 ` [PATCH 10/10] cgroup: kill subsys->can_attach_task(), pre_attach() and attach_task() Tejun Heo
@ 2011-11-04  9:10   ` KAMEZAWA Hiroyuki
  2011-11-15  0:54   ` Frederic Weisbecker
  1 sibling, 0 replies; 56+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-11-04  9:10 UTC (permalink / raw)
  To: Tejun Heo
  Cc: paul, rjw, lizf, linux-pm, linux-kernel, containers, fweisbec,
	matthltc, akpm, oleg, kamezawa.hiroyu

On Tue,  1 Nov 2011 16:46:33 -0700
Tejun Heo <tj@kernel.org> wrote:

> These three methods are no longer used.  Kill them.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Acked-by: Paul Menage <paul@paulmenage.org>
> Cc: Li Zefan <lizf@cn.fujitsu.com>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 02/10] threadgroup: rename signal->threadgroup_fork_lock to ->group_rwsem
  2011-11-04  8:40   ` KAMEZAWA Hiroyuki
@ 2011-11-04 15:16     ` Tejun Heo
  0 siblings, 0 replies; 56+ messages in thread
From: Tejun Heo @ 2011-11-04 15:16 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: paul, rjw, lizf, linux-pm, linux-kernel, containers, fweisbec,
	matthltc, akpm, oleg

On Fri, Nov 04, 2011 at 05:40:32PM +0900, KAMEZAWA Hiroyuki wrote:
> On Tue,  1 Nov 2011 16:46:25 -0700
> Tejun Heo <tj@kernel.org> wrote:
> 
> > Make the following renames to prepare for extension of threadgroup
> > locking.
> > 
> > * s/signal->threadgroup_fork_lock/signal->group_rwsem/
> > * s/threadgroup_fork_read_lock()/threadgroup_change_begin()/
> > * s/threadgroup_fork_read_unlock()/threadgroup_change_done()/
> 
> I personally like begin <-> end to see relationship.

Hmmm... yeah, I like that better too.  Renaming.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 04/10] cgroup: always lock threadgroup during migration
  2011-11-04  8:54   ` KAMEZAWA Hiroyuki
@ 2011-11-04 15:21     ` Tejun Heo
  0 siblings, 0 replies; 56+ messages in thread
From: Tejun Heo @ 2011-11-04 15:21 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: paul, rjw, lizf, linux-pm, linux-kernel, containers, fweisbec,
	matthltc, akpm, oleg

Hello,

On Fri, Nov 04, 2011 at 05:54:13PM +0900, KAMEZAWA Hiroyuki wrote:
> > -	/* if PF_EXITING is set, the tsk->cgroups pointer is no longer safe. */
> > +	/* @tsk can't exit as its threadgroup is locked */
> >  	task_lock(tsk);
> > -	if (tsk->flags & PF_EXITING) {
> > -		task_unlock(tsk);
> > -		put_css_set(newcg);
> > -		return -ESRCH;
> > -	}
> > +	WARN_ON_ONCE(tsk->flags & PF_EXITING);
> >  	rcu_assign_pointer(tsk->cgroups, newcg);
> >  	task_unlock(tsk);
> 
> Is this task_lock/unlock is required ?

For put, I don't think it's necessary.

> > @@ -2116,11 +2120,6 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
> >  			continue;
> >  		/* get old css_set pointer */
> >  		task_lock(tsk);
> > -		if (tsk->flags & PF_EXITING) {
> > -			/* ignore this task if it's going away */
> > -			task_unlock(tsk);
> > -			continue;
> > -		}
> >  		oldcg = tsk->cgroups;
> >  		get_css_set(oldcg);
> >  		task_unlock(tsk);

For get, I think it is; otherwise, nothing prevents someone else from
changing tsk->cgroups between load and get, and destroying it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 03/10] threadgroup: extend threadgroup_lock() to cover exit and exec
  2011-11-01 23:46 ` [PATCH 03/10] threadgroup: extend threadgroup_lock() to cover exit and exec Tejun Heo
  2011-11-04  8:45   ` KAMEZAWA Hiroyuki
@ 2011-11-13 16:44   ` Frederic Weisbecker
  2011-11-14 13:54     ` Frederic Weisbecker
  2011-11-21 21:58     ` Tejun Heo
  2011-11-13 18:20   ` Frederic Weisbecker
  2011-11-24 22:50   ` [PATCH UPDATED " Tejun Heo
  3 siblings, 2 replies; 56+ messages in thread
From: Frederic Weisbecker @ 2011-11-13 16:44 UTC (permalink / raw)
  To: Tejun Heo
  Cc: paul, rjw, lizf, linux-pm, linux-kernel, containers, matthltc,
	akpm, oleg, kamezawa.hiroyu

On Tue, Nov 01, 2011 at 04:46:26PM -0700, Tejun Heo wrote:
> threadgroup_lock() protected only protected against new addition to
> the threadgroup, which was inherently somewhat incomplete and
> problematic for its only user cgroup.  On-going migration could race
> against exec and exit leading to interesting problems - the symmetry
> between various attach methods, task exiting during method execution,
> ->exit() racing against attach methods, migrating task switching basic
> properties during exec and so on.
> 
> This patch extends threadgroup_lock() such that it protects against
> all three threadgroup altering operations - fork, exit and exec.  For
> exit, threadgroup_change_begin/end() calls are added to exit path.
> For exec, threadgroup_[un]lock() are updated to also grab and release
> cred_guard_mutex.
> 
> With this change, threadgroup_lock() guarantees that the target
> threadgroup will remain stable - no new task will be added, no new
> PF_EXITING will be set and exec won't happen.
> 
> The next patch will update cgroup so that it can take full advantage
> of this change.

I don't want to nitpick really, but IMHO the races involved in exit and exec
are too different, specific and complicated on their own to be solved in a
single one patch. This should be split in two things.

The specific problems and their fix need to be described more in detail
in the changelog because the issues are very tricky.

The exec case:

IIUC, the race in exec is about the group leader that can be changed
to become the exec'ing thread, making while_each_thread() unsafe.
We also have other things happening there like all the other threads
in the group that get killed, but that should be handled by the threadgroup_change_begin()
you put in the exit path.
The old leader is also killed but release_task() -> __unhash_process() is called
for it manually from the exec path. However this thread too should be covered by your
synchronisation in exit().

So after your protection in the exit path, the only thing to protect against in exec
is that group_leader that can change concurrently. But I may be missing something in the picture.
Also note this is currently protected by the tasklist readlock. Cred guard mutex is
certainly better, I just don't remember if you remove the tasklist lock in a
further patch.

The exit case:

There the issue is about racing against cgroup_exit() where the task can be
assigned to the root cgroup. Is this really a problem in fact? I don't know
if subsystems care about that. Perhaps some of them are buggy in that
regard. At least the task counter handles that and it needs a
->cancel_attach_task() for this purpose in the case the migration fails due to exit.
Not sure about others. A real synchronization against exit is less error prone for sure.
In the end that's probably a win.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 03/10] threadgroup: extend threadgroup_lock() to cover exit and exec
  2011-11-01 23:46 ` [PATCH 03/10] threadgroup: extend threadgroup_lock() to cover exit and exec Tejun Heo
  2011-11-04  8:45   ` KAMEZAWA Hiroyuki
  2011-11-13 16:44   ` Frederic Weisbecker
@ 2011-11-13 18:20   ` Frederic Weisbecker
  2011-11-24 22:50   ` [PATCH UPDATED " Tejun Heo
  3 siblings, 0 replies; 56+ messages in thread
From: Frederic Weisbecker @ 2011-11-13 18:20 UTC (permalink / raw)
  To: Tejun Heo
  Cc: paul, rjw, lizf, linux-pm, linux-kernel, containers, matthltc,
	akpm, oleg, kamezawa.hiroyu

On Tue, Nov 01, 2011 at 04:46:26PM -0700, Tejun Heo wrote:
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index aa47d0f..6fb546e 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2377,13 +2377,37 @@ static inline void threadgroup_change_done(struct task_struct *tsk)
>  {
>  	up_read(&tsk->signal->group_rwsem);
>  }
> +
> +/**
> + * threadgroup_lock - lock threadgroup
> + * @tsk: member task of the threadgroup to lock
> + *
> + * Lock the threadgroup @tsk belongs to.  No new task is allowed to enter
> + * and member tasks aren't allowed to exit (as indicated by PF_EXITING) or
> + * perform exec.  This is useful for cases where the threadgroup needs to
> + * stay stable across blockable operations.
> + *
> + * fork and exit explicitly call threadgroup_change_{begin|end}() for
> + * synchronization.  exec is already synchronized with cread_guard_mutex
> + * which we grab here.
> + */
>  static inline void threadgroup_lock(struct task_struct *tsk)
>  {
> +	/* exec uses exit for de-threading, grab cred_guard_mutex first */
> +	mutex_lock(&tsk->signal->cred_guard_mutex);

May be worth explaining what in exec is synchronized through this. ie the leader.

>  	down_write(&tsk->signal->group_rwsem);
>  }
> +
> +/**
> + * threadgroup_unlock - unlock threadgroup
> + * @tsk: member task of the threadgroup to unlock
> + *
> + * Reverse threadgroup_lock().
> + */
>  static inline void threadgroup_unlock(struct task_struct *tsk)
>  {
>  	up_write(&tsk->signal->group_rwsem);
> +	mutex_unlock(&tsk->signal->cred_guard_mutex);
>  }
>  #else
>  static inline void threadgroup_change_begin(struct task_struct *tsk) {}
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 2913b35..3565f00 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -938,6 +938,12 @@ NORET_TYPE void do_exit(long code)
>  		schedule();
>  	}
>  
> +	/*
> +	 * @tsk's threadgroup is going through changes - lock out users
> +	 * which expect stable threadgroup.
> +	 */
> +	threadgroup_change_begin(tsk);

Why so early? You actually want to synchronize against cgroup_exit(). Or
am I missing something?

Yeah there may be a problem with reading PF_EXITING inside the lock if it
is updated outside the lock, may be we need a pair of memory barriers. But
other than that, this should be enough to lock around cgroup_exit().

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 03/10] threadgroup: extend threadgroup_lock() to cover exit and exec
  2011-11-13 16:44   ` Frederic Weisbecker
@ 2011-11-14 13:54     ` Frederic Weisbecker
  2011-11-21 22:03       ` Tejun Heo
  2011-11-21 21:58     ` Tejun Heo
  1 sibling, 1 reply; 56+ messages in thread
From: Frederic Weisbecker @ 2011-11-14 13:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: paul, rjw, lizf, linux-pm, linux-kernel, containers, matthltc,
	akpm, oleg, kamezawa.hiroyu

On Sun, Nov 13, 2011 at 05:44:32PM +0100, Frederic Weisbecker wrote:
> On Tue, Nov 01, 2011 at 04:46:26PM -0700, Tejun Heo wrote:
> > threadgroup_lock() protected only protected against new addition to
> > the threadgroup, which was inherently somewhat incomplete and
> > problematic for its only user cgroup.  On-going migration could race
> > against exec and exit leading to interesting problems - the symmetry
> > between various attach methods, task exiting during method execution,
> > ->exit() racing against attach methods, migrating task switching basic
> > properties during exec and so on.
> > 
> > This patch extends threadgroup_lock() such that it protects against
> > all three threadgroup altering operations - fork, exit and exec.  For
> > exit, threadgroup_change_begin/end() calls are added to exit path.
> > For exec, threadgroup_[un]lock() are updated to also grab and release
> > cred_guard_mutex.
> > 
> > With this change, threadgroup_lock() guarantees that the target
> > threadgroup will remain stable - no new task will be added, no new
> > PF_EXITING will be set and exec won't happen.
> > 
> > The next patch will update cgroup so that it can take full advantage
> > of this change.
> 
> I don't want to nitpick really, but IMHO the races involved in exit and exec
> are too different, specific and complicated on their own to be solved in a
> single one patch. This should be split in two things.
> 
> The specific problems and their fix need to be described more in detail
> in the changelog because the issues are very tricky.
> 
> The exec case:
> 
> IIUC, the race in exec is about the group leader that can be changed
> to become the exec'ing thread, making while_each_thread() unsafe.
> We also have other things happening there like all the other threads
> in the group that get killed, but that should be handled by the threadgroup_change_begin()
> you put in the exit path.
> The old leader is also killed but release_task() -> __unhash_process() is called
> for it manually from the exec path. However this thread too should be covered by your
> synchronisation in exit().
> 
> So after your protection in the exit path, the only thing to protect against in exec
> is that group_leader that can change concurrently. But I may be missing something in the picture.
> Also note this is currently protected by the tasklist readlock. Cred guard mutex is
> certainly better, I just don't remember if you remove the tasklist lock in a
> further patch.

Ah recalling what Ben Blum said, we also need the leader to stay stable because it
is excpected to be passed in ->can_attach(), ->attach(), ->cancel_attach(), ...
Although that's going to change after your patches that pass a flex array.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 04/10] cgroup: always lock threadgroup during migration
  2011-11-01 23:46 ` [PATCH 04/10] cgroup: always lock threadgroup during migration Tejun Heo
  2011-11-04  8:54   ` KAMEZAWA Hiroyuki
@ 2011-11-14 18:46   ` Frederic Weisbecker
  2011-11-14 18:52     ` Frederic Weisbecker
  1 sibling, 1 reply; 56+ messages in thread
From: Frederic Weisbecker @ 2011-11-14 18:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: paul, rjw, lizf, linux-pm, linux-kernel, containers, matthltc,
	akpm, oleg, kamezawa.hiroyu

On Tue, Nov 01, 2011 at 04:46:27PM -0700, Tejun Heo wrote:
> Update cgroup to take advantage of the fack that threadgroup_lock()
> guarantees stable threadgroup.
> 
> * Lock threadgroup even if the target is a single task.  This
>   guarantees that when the target tasks stay stable during migration
>   regardless of the target type.
> 
> * Remove PF_EXITING early exit optimization from attach_task_by_pid()
>   and check it in cgroup_task_migrate() instead.  The optimization was
>   for rather cold path to begin with and PF_EXITING state can be
>   trusted throughout migration by checking it after locking
>   threadgroup.
> 
> * Don't add PF_EXITING tasks to target task array in
>   cgroup_attach_proc().  This ensures that task migration is performed
>   only for live tasks.
> 
> * Remove -ESRCH failure path from cgroup_task_migrate().  With the
>   above changes, it's guaranteed to be called only for live tasks.
> 
> After the changes, only live tasks are migrated and they're guaranteed
> to stay alive until migration is complete.  This removes problems
> caused by exec and exit racing against cgroup migration including
> symmetry among cgroup attach methods and different cgroup methods
> racing each other.
> 
> v2: Oleg pointed out that one more PF_EXITING check can be removed
>     from cgroup_attach_proc().  Removed.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Paul Menage <paul@paulmenage.org>
> Cc: Li Zefan <lizf@cn.fujitsu.com>

Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>

Just a little thing:

> ---
>  kernel/cgroup.c |   51 +++++++++++++++++++++++----------------------------
>  1 files changed, 23 insertions(+), 28 deletions(-)
> 
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index f0e099f..83e10f9 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1762,7 +1762,7 @@ EXPORT_SYMBOL_GPL(cgroup_path);
>   *
>   * 'guarantee' is set if the caller promises that a new css_set for the task
>   * will already exist. If not set, this function might sleep, and can fail with
> - * -ENOMEM. Otherwise, it can only fail with -ESRCH.
> + * -ENOMEM. Must be called with cgroup_mutex and threadgroup locked.
>   */
>  static int cgroup_task_migrate(struct cgroup *cgrp, struct cgroup *oldcgrp,
>  			       struct task_struct *tsk, bool guarantee)
> @@ -1800,13 +1800,9 @@ static int cgroup_task_migrate(struct cgroup *cgrp, struct cgroup *oldcgrp,
>  	}
>  	put_css_set(oldcg);
>  
> -	/* if PF_EXITING is set, the tsk->cgroups pointer is no longer safe. */
> +	/* @tsk can't exit as its threadgroup is locked */
>  	task_lock(tsk);
> -	if (tsk->flags & PF_EXITING) {
> -		task_unlock(tsk);
> -		put_css_set(newcg);
> -		return -ESRCH;
> -	}
> +	WARN_ON_ONCE(tsk->flags & PF_EXITING);

I have the feeling the task_lock is now useless given that we are synchronized
against cgroup_exit() with the threadgroup_lock.

It's probably also useless in cgroup_exit(). But that's something for a further
patch...

Thanks.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 04/10] cgroup: always lock threadgroup during migration
  2011-11-14 18:46   ` Frederic Weisbecker
@ 2011-11-14 18:52     ` Frederic Weisbecker
  2011-11-21 22:05       ` Tejun Heo
  0 siblings, 1 reply; 56+ messages in thread
From: Frederic Weisbecker @ 2011-11-14 18:52 UTC (permalink / raw)
  To: Tejun Heo
  Cc: paul, rjw, lizf, linux-pm, linux-kernel, containers, matthltc,
	akpm, oleg, kamezawa.hiroyu

On Mon, Nov 14, 2011 at 07:46:34PM +0100, Frederic Weisbecker wrote:
> I have the feeling the task_lock is now useless given that we are synchronized
> against cgroup_exit() with the threadgroup_lock.
> 
> It's probably also useless in cgroup_exit(). But that's something for a further
> patch...
> 
> Thanks.

Oops, I missed what Kamezawa said and your answer, sorry.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 05/10] cgroup: subsys->attach_task() should be called after migration
  2011-11-01 23:46 ` [PATCH 05/10] cgroup: subsys->attach_task() should be called after migration Tejun Heo
@ 2011-11-14 20:06   ` Frederic Weisbecker
  2011-11-21 22:04     ` Tejun Heo
  0 siblings, 1 reply; 56+ messages in thread
From: Frederic Weisbecker @ 2011-11-14 20:06 UTC (permalink / raw)
  To: Tejun Heo
  Cc: paul, rjw, lizf, linux-pm, linux-kernel, containers, matthltc,
	akpm, oleg, kamezawa.hiroyu

On Tue, Nov 01, 2011 at 04:46:28PM -0700, Tejun Heo wrote:
> cgroup_attach_task() calls subsys->attach_task() after
> cgroup_task_migrate(); however, cgroup_attach_proc() calls it before
> migration.  This actually affects some of the users.  Update
> cgroup_attach_proc() such that ->attach_task() is called after
> migration.

That's already upstream.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 06/10] cgroup: improve old cgroup handling in cgroup_attach_proc()
  2011-11-01 23:46 ` [PATCH 06/10] cgroup: improve old cgroup handling in cgroup_attach_proc() Tejun Heo
@ 2011-11-14 20:37   ` Frederic Weisbecker
  0 siblings, 0 replies; 56+ messages in thread
From: Frederic Weisbecker @ 2011-11-14 20:37 UTC (permalink / raw)
  To: Tejun Heo
  Cc: paul, rjw, lizf, linux-pm, linux-kernel, containers, matthltc,
	akpm, oleg, kamezawa.hiroyu

On Tue, Nov 01, 2011 at 04:46:29PM -0700, Tejun Heo wrote:
> cgroup_attach_proc() behaves differently from cgroup_attach_task() in
> the following aspects.
> 
> * All hooks are invoked even if no task is actually being moved.
> 
> * ->can_attach_task() is called for all tasks in the group whether the
>   new cgrp is different from the current cgrp or not; however,
>   ->attach_task() is skipped if new equals new.  This makes the calls
>   asymmetric.
> 
> This patch improves old cgroup handling in cgroup_attach_proc() by
> looking up the current cgroup at the head, recording it in the flex
> array along with the task itself, and using it to remove the above two
> differences.  This will also ease further changes.
> 
> -v2: nr_todo renamed to nr_migrating_tasks as per Paul Menage's
>      suggestion.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Acked-by: Paul Menage <paul@paulmenage.org>
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Li Zefan <lizf@cn.fujitsu.com>

Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 07/10] cgroup: introduce cgroup_taskset and use it in subsys->can_attach(), cancel_attach() and attach()
  2011-11-01 23:46 ` [PATCH 07/10] cgroup: introduce cgroup_taskset and use it in subsys->can_attach(), cancel_attach() and attach() Tejun Heo
@ 2011-11-14 21:16   ` Frederic Weisbecker
  0 siblings, 0 replies; 56+ messages in thread
From: Frederic Weisbecker @ 2011-11-14 21:16 UTC (permalink / raw)
  To: Tejun Heo
  Cc: paul, rjw, lizf, linux-pm, linux-kernel, containers, matthltc,
	akpm, oleg, kamezawa.hiroyu, Balbir Singh, Daisuke Nishimura,
	James Morris

On Tue, Nov 01, 2011 at 04:46:30PM -0700, Tejun Heo wrote:
> Currently, there's no way to pass multiple tasks to cgroup_subsys
> methods necessitating the need for separate per-process and per-task
> methods.  This patch introduces cgroup_taskset which can be used to
> pass multiple tasks and their associated cgroups to cgroup_subsys
> methods.
> 
> Three methods - can_attach(), cancel_attach() and attach() - are
> converted to use cgroup_taskset.  This unifies passed parameters so
> that all methods have access to all information.  Conversions in this
> patchset are identical and don't introduce any behavior change.
> 
> -v2: documentation updated as per Paul Menage's suggestion.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Acked-by: Paul Menage <paul@paulmenage.org>
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Li Zefan <lizf@cn.fujitsu.com>
> Cc: Balbir Singh <bsingharora@gmail.com>
> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: James Morris <jmorris@namei.org>

Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 08/10] cgroup: don't use subsys->can_attach_task() or ->attach_task()
  2011-11-01 23:46 ` [PATCH 08/10] cgroup: don't use subsys->can_attach_task() or ->attach_task() Tejun Heo
  2011-11-04  9:08   ` KAMEZAWA Hiroyuki
@ 2011-11-14 23:54   ` Frederic Weisbecker
  1 sibling, 0 replies; 56+ messages in thread
From: Frederic Weisbecker @ 2011-11-14 23:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: paul, rjw, lizf, linux-pm, linux-kernel, containers, matthltc,
	akpm, oleg, kamezawa.hiroyu, Balbir Singh, Daisuke Nishimura,
	James Morris, Ingo Molnar, Peter Zijlstra

On Tue, Nov 01, 2011 at 04:46:31PM -0700, Tejun Heo wrote:
> Now that subsys->can_attach() and attach() take @tset instead of
> @task, they can handle per-task operations.  Convert
> ->can_attach_task() and ->attach_task() users to use ->can_attach()
> and attach() instead.  Most converions are straight-forward.
> Noteworthy changes are,
> 
> * In cgroup_freezer, remove unnecessary NULL assignments to unused
>   methods.  It's useless and very prone to get out of sync, which
>   already happened.
> 
> * In cpuset, PF_THREAD_BOUND test is checked for each task.  This
>   doesn't make any practical difference but is conceptually cleaner.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Paul Menage <paul@paulmenage.org>
> Cc: Li Zefan <lizf@cn.fujitsu.com>
> Cc: Balbir Singh <bsingharora@gmail.com>
> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: James Morris <jmorris@namei.org>
> Cc: Ingo Molnar <mingo@elte.hu>
> Cc: Peter Zijlstra <peterz@infradead.org>

Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 09/10] cgroup, cpuset: don't use ss->pre_attach()
  2011-11-01 23:46 ` [PATCH 09/10] cgroup, cpuset: don't use ss->pre_attach() Tejun Heo
@ 2011-11-15  0:51   ` Frederic Weisbecker
  0 siblings, 0 replies; 56+ messages in thread
From: Frederic Weisbecker @ 2011-11-15  0:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: paul, rjw, lizf, linux-pm, linux-kernel, containers, matthltc,
	akpm, oleg, kamezawa.hiroyu

On Tue, Nov 01, 2011 at 04:46:32PM -0700, Tejun Heo wrote:
> ->pre_attach() is supposed to be called before migration, which is
> observed during process migration but task migration does it the other
> way around.  The only ->pre_attach() user is cpuset which can do the
> same operaitons in ->can_attach().  Collapse cpuset_pre_attach() into
> cpuset_can_attach().
> 
> -v2: Patch contamination from later patch removed.  Spotted by Paul
>      Menage.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Acked-by: Paul Menage <paul@paulmenage.org>
> Cc: Li Zefan <lizf@cn.fujitsu.com>

Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 10/10] cgroup: kill subsys->can_attach_task(), pre_attach() and attach_task()
  2011-11-01 23:46 ` [PATCH 10/10] cgroup: kill subsys->can_attach_task(), pre_attach() and attach_task() Tejun Heo
  2011-11-04  9:10   ` KAMEZAWA Hiroyuki
@ 2011-11-15  0:54   ` Frederic Weisbecker
  1 sibling, 0 replies; 56+ messages in thread
From: Frederic Weisbecker @ 2011-11-15  0:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: paul, rjw, lizf, linux-pm, linux-kernel, containers, matthltc,
	akpm, oleg, kamezawa.hiroyu

On Tue, Nov 01, 2011 at 04:46:33PM -0700, Tejun Heo wrote:
> These three methods are no longer used.  Kill them.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Acked-by: Paul Menage <paul@paulmenage.org>
> Cc: Li Zefan <lizf@cn.fujitsu.com>

Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>

Thanks!

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 03/10] threadgroup: extend threadgroup_lock() to cover exit and exec
  2011-11-13 16:44   ` Frederic Weisbecker
  2011-11-14 13:54     ` Frederic Weisbecker
@ 2011-11-21 21:58     ` Tejun Heo
  2011-11-23 14:02       ` Frederic Weisbecker
  1 sibling, 1 reply; 56+ messages in thread
From: Tejun Heo @ 2011-11-21 21:58 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: paul, rjw, lizf, linux-pm, linux-kernel, containers, matthltc,
	akpm, oleg, kamezawa.hiroyu

Hello, Frederic.

On Sun, Nov 13, 2011 at 05:44:32PM +0100, Frederic Weisbecker wrote:
> > The next patch will update cgroup so that it can take full advantage
> > of this change.
> 
> I don't want to nitpick really,

Heh, please go ahead and nitpick.  That's one of the main purposes of
LKML after all. :)

> but IMHO the races involved in exit and exec
> are too different, specific and complicated on their own to be solved in a
> single one patch. This should be split in two things.
> 
> The specific problems and their fix need to be described more in detail
> in the changelog because the issues are very tricky.
> 
> The exec case:
> 
> IIUC, the race in exec is about the group leader that can be changed
> to become the exec'ing thread, making while_each_thread() unsafe.

Not only that, pid changes, sighand struct may get swapped, and other
weird things which aren't usually expected for a live task happen.
It's basically semi-killed and then resurrected.

> We also have other things happening there like all the other threads
> in the group that get killed, but that should be handled by the threadgroup_change_begin()
> you put in the exit path.
> The old leader is also killed but release_task() -> __unhash_process() is called
> for it manually from the exec path. However this thread too should be covered by your
> synchronisation in exit().
> 
> So after your protection in the exit path, the only thing to protect against in exec
> is that group_leader that can change concurrently. But I may be missing something in the picture.

Hmm... an exec'ing task goes through transitions which usually aren't
allowed to happen.  It assumes exclusive access to group-shared data
structures and may ditch the current one and create a new one.

Sure, it's possible to make every cgroup method correct w.r.t. all
those via explicit locking or by being more careful but I don't see
much point in such excercise.  If the code is sitting in some hot path
and the added exclusion is gonna add significant amount of contention,
sure, we should trade off easiness for better performance /
scalability but this is for cgroup modifications, a fundamentally cold
path, and the added locking is per-task or per-threadgroup.  I don't
see any reason not to be easy here.

Please read on.

> Also note this is currently protected by the tasklist readlock. Cred
> guard mutex is certainly better, I just don't remember if you remove
> the tasklist lock in a further patch.

I don't remove tasklist_lock.

> The exit case:
> 
> There the issue is about racing against cgroup_exit() where the task can be
> assigned to the root cgroup. Is this really a problem in fact? I don't know
> if subsystems care about that. Perhaps some of them are buggy in that
> regard. At least the task counter handles that and it needs a
> ->cancel_attach_task() for this purpose in the case the migration fails due to exit.
> Not sure about others. A real synchronization against exit is less error prone for sure.
> In the end that's probably a win.

This is the same story.  Yes, we can narrow the locking and try to
make sure everyone handles partially destroyed tasks properly in all
methods, but for what?  If we can give stable threadgroups to all
cgroup methods without introducing performance or scalability
bottleneck, that's the right thing to do.  Please also note that bugs
stemming from failure to handle those corner cases properly will be
subtle, difficult to reproduce and track down.

In general, for any subsystem with pluggable interface, I think it's
best to put as much complexity as possible into the core layer to make
things eaiser for its customers.  It becomes excruciatingly painful if
the API invites subtle corner case bugs and the subsystem grows
several dozen clients down the road.

So, to me, what seems more important is how to make it easier for each
cgroup client instead of what's the minimal that's necessary right
now.

Heh, did I make any sense? :)

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 03/10] threadgroup: extend threadgroup_lock() to cover exit and exec
  2011-11-14 13:54     ` Frederic Weisbecker
@ 2011-11-21 22:03       ` Tejun Heo
  2011-11-23 14:34         ` Frederic Weisbecker
  0 siblings, 1 reply; 56+ messages in thread
From: Tejun Heo @ 2011-11-21 22:03 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: paul, rjw, lizf, linux-pm, linux-kernel, containers, matthltc,
	akpm, oleg, kamezawa.hiroyu

On Mon, Nov 14, 2011 at 02:54:08PM +0100, Frederic Weisbecker wrote:
> > Also note this is currently protected by the tasklist readlock. Cred guard mutex is
> > certainly better, I just don't remember if you remove the tasklist lock in a
> > further patch.
> 
> Ah recalling what Ben Blum said, we also need the leader to stay stable because it
> is excpected to be passed in ->can_attach(), ->attach(), ->cancel_attach(), ...
> Although that's going to change after your patches that pass a flex array.

Not really without locking out exec.  The thing is whoever is exec'ing
will be the leader and we can't guarantee that the first in the flex
array is always the leader.  One method may see it as the leader, the
next one might not.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 05/10] cgroup: subsys->attach_task() should be called after migration
  2011-11-14 20:06   ` Frederic Weisbecker
@ 2011-11-21 22:04     ` Tejun Heo
  0 siblings, 0 replies; 56+ messages in thread
From: Tejun Heo @ 2011-11-21 22:04 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: paul, rjw, lizf, linux-pm, linux-kernel, containers, matthltc,
	akpm, oleg, kamezawa.hiroyu

On Mon, Nov 14, 2011 at 09:06:08PM +0100, Frederic Weisbecker wrote:
> On Tue, Nov 01, 2011 at 04:46:28PM -0700, Tejun Heo wrote:
> > cgroup_attach_task() calls subsys->attach_task() after
> > cgroup_task_migrate(); however, cgroup_attach_proc() calls it before
> > migration.  This actually affects some of the users.  Update
> > cgroup_attach_proc() such that ->attach_task() is called after
> > migration.
> 
> That's already upstream.

Yeap, just dropped this one while rebasing.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 04/10] cgroup: always lock threadgroup during migration
  2011-11-14 18:52     ` Frederic Weisbecker
@ 2011-11-21 22:05       ` Tejun Heo
  0 siblings, 0 replies; 56+ messages in thread
From: Tejun Heo @ 2011-11-21 22:05 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: paul, rjw, lizf, linux-pm, linux-kernel, containers, matthltc,
	akpm, oleg, kamezawa.hiroyu

On Mon, Nov 14, 2011 at 07:52:43PM +0100, Frederic Weisbecker wrote:
> On Mon, Nov 14, 2011 at 07:46:34PM +0100, Frederic Weisbecker wrote:
> > I have the feeling the task_lock is now useless given that we are synchronized
> > against cgroup_exit() with the threadgroup_lock.
> > 
> > It's probably also useless in cgroup_exit(). But that's something for a further
> > patch...
> > 
> > Thanks.
> 
> Oops, I missed what Kamezawa said and your answer, sorry.

Yeap, I'll see how removing it for put looks.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHSET] cgroup: stable threadgroup during attach & subsys methods consolidation
  2011-11-01 23:46 [PATCHSET] cgroup: stable threadgroup during attach & subsys methods consolidation Tejun Heo
                   ` (9 preceding siblings ...)
  2011-11-01 23:46 ` [PATCH 10/10] cgroup: kill subsys->can_attach_task(), pre_attach() and attach_task() Tejun Heo
@ 2011-11-21 22:07 ` Tejun Heo
  2011-11-22  2:27   ` Li Zefan
  2011-11-24 22:51 ` Tejun Heo
  11 siblings, 1 reply; 56+ messages in thread
From: Tejun Heo @ 2011-11-21 22:07 UTC (permalink / raw)
  To: paul, rjw, lizf
  Cc: linux-pm, linux-kernel, containers, fweisbec, matthltc, akpm,
	oleg, kamezawa.hiroyu

On Tue, Nov 01, 2011 at 04:46:23PM -0700, Tejun Heo wrote:
> This patchset is combination of the following two patchsets.
> 
>  [1] cgroup: extend threadgroup locking
>  [2] cgroup: introduce cgroup_taskset and consolidate subsys methods, take#2

Li, do you agree with the proposed patches?  If so, I'll wait for
Frederic's response, repost updated series and merge it into the
cgroup tree.

Thank you.

-- 
tejun

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHSET] cgroup: stable threadgroup during attach & subsys methods consolidation
  2011-11-21 22:07 ` [PATCHSET] cgroup: stable threadgroup during attach & subsys methods consolidation Tejun Heo
@ 2011-11-22  2:27   ` Li Zefan
  2011-11-22 16:20     ` Tejun Heo
  0 siblings, 1 reply; 56+ messages in thread
From: Li Zefan @ 2011-11-22  2:27 UTC (permalink / raw)
  To: Tejun Heo
  Cc: paul, rjw, linux-pm, linux-kernel, containers, fweisbec,
	matthltc, akpm, oleg, kamezawa.hiroyu

Tejun Heo wrote:
> On Tue, Nov 01, 2011 at 04:46:23PM -0700, Tejun Heo wrote:
>> This patchset is combination of the following two patchsets.
>>
>>  [1] cgroup: extend threadgroup locking
>>  [2] cgroup: introduce cgroup_taskset and consolidate subsys methods, take#2
> 
> Li, do you agree with the proposed patches?  If so, I'll wait for
> Frederic's response, repost updated series and merge it into the
> cgroup tree.
> 

I remember I've acked both patchsets in your earlier post. :)

--
Li Zefan

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCHSET] cgroup: stable threadgroup during attach & subsys methods consolidation
  2011-11-22  2:27   ` Li Zefan
@ 2011-11-22 16:20     ` Tejun Heo
  0 siblings, 0 replies; 56+ messages in thread
From: Tejun Heo @ 2011-11-22 16:20 UTC (permalink / raw)
  To: Li Zefan
  Cc: paul, rjw, linux-pm, linux-kernel, containers, fweisbec,
	matthltc, akpm, oleg, kamezawa.hiroyu

On Tue, Nov 22, 2011 at 10:27:40AM +0800, Li Zefan wrote:
> Tejun Heo wrote:
> > On Tue, Nov 01, 2011 at 04:46:23PM -0700, Tejun Heo wrote:
> >> This patchset is combination of the following two patchsets.
> >>
> >>  [1] cgroup: extend threadgroup locking
> >>  [2] cgroup: introduce cgroup_taskset and consolidate subsys methods, take#2
> > 
> > Li, do you agree with the proposed patches?  If so, I'll wait for
> > Frederic's response, repost updated series and merge it into the
> > cgroup tree.
> > 
> 
> I remember I've acked both patchsets in your earlier post. :)

Heh, right, it has been too long and I forgot.  I'll add the acks.

Thank you.

-- 
tejun

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 03/10] threadgroup: extend threadgroup_lock() to cover exit and exec
  2011-11-21 21:58     ` Tejun Heo
@ 2011-11-23 14:02       ` Frederic Weisbecker
  2011-11-24 21:22         ` Tejun Heo
  0 siblings, 1 reply; 56+ messages in thread
From: Frederic Weisbecker @ 2011-11-23 14:02 UTC (permalink / raw)
  To: Tejun Heo
  Cc: paul, rjw, lizf, linux-pm, linux-kernel, containers, matthltc,
	akpm, oleg, kamezawa.hiroyu

On Mon, Nov 21, 2011 at 01:58:39PM -0800, Tejun Heo wrote:
> > but IMHO the races involved in exit and exec
> > are too different, specific and complicated on their own to be solved in a
> > single one patch. This should be split in two things.
> > 
> > The specific problems and their fix need to be described more in detail
> > in the changelog because the issues are very tricky.
> > 
> > The exec case:
> > 
> > IIUC, the race in exec is about the group leader that can be changed
> > to become the exec'ing thread, making while_each_thread() unsafe.
> 
> Not only that, pid changes, sighand struct may get swapped, and other
> weird things which aren't usually expected for a live task happen.
> It's basically semi-killed and then resurrected.

Right. I've no problem with the fact you want to make the threadgroup
more stable against subsystem attachment. Just please put more
detailed comments about what you are protecting. It took me quite
some time to precisely identify the race involved in that game in
order to consciously review your patches. And I'm thinking about
the next people who will work on this piece of code.

This:

 static inline void threadgroup_lock(struct task_struct *tsk)                                                                
 {                                                                                                                           
+       /* exec uses exit for de-threading, grab cred_guard_mutex first */                                                   
+       mutex_lock(&tsk->signal->cred_guard_mutex);                                                                          
        down_write(&tsk->signal->group_rwsem);

is really not obvious.

Just point out we want to synchronize against the leader, pid and the sighand
that may change concurrently.

> > We also have other things happening there like all the other threads
> > in the group that get killed, but that should be handled by the threadgroup_change_begin()
> > you put in the exit path.
> > The old leader is also killed but release_task() -> __unhash_process() is called
> > for it manually from the exec path. However this thread too should be covered by your
> > synchronisation in exit().
> > 
> > So after your protection in the exit path, the only thing to protect against in exec
> > is that group_leader that can change concurrently. But I may be missing something in the picture.
> 
> Hmm... an exec'ing task goes through transitions which usually aren't
> allowed to happen.  It assumes exclusive access to group-shared data
> structures and may ditch the current one and create a new one.
> 
> Sure, it's possible to make every cgroup method correct w.r.t. all
> those via explicit locking or by being more careful but I don't see
> much point in such excercise.  If the code is sitting in some hot path
> and the added exclusion is gonna add significant amount of contention,
> sure, we should trade off easiness for better performance /
> scalability but this is for cgroup modifications, a fundamentally cold
> path, and the added locking is per-task or per-threadgroup.  I don't
> see any reason not to be easy here.

Sure. I have no problem with that. I just wish we can precisely identify
what we are protecting against, and not have a foggy locking. Especially
when it's about such a very core issue.

> Please read on.
> 
> > Also note this is currently protected by the tasklist readlock. Cred
> > guard mutex is certainly better, I just don't remember if you remove
> > the tasklist lock in a further patch.
> 
> I don't remove tasklist_lock.

I believe you can do it after using the cred guard mutex. That needs to
be double checked though. I think it was mostly there to keep while_each_thread()
stable non-racy against leader change. cred_guard_mutex should handle that
now.

> 
> > The exit case:
> > 
> > There the issue is about racing against cgroup_exit() where the task can be
> > assigned to the root cgroup. Is this really a problem in fact? I don't know
> > if subsystems care about that. Perhaps some of them are buggy in that
> > regard. At least the task counter handles that and it needs a
> > ->cancel_attach_task() for this purpose in the case the migration fails due to exit.
> > Not sure about others. A real synchronization against exit is less error prone for sure.
> > In the end that's probably a win.
> 
> This is the same story.  Yes, we can narrow the locking and try to
> make sure everyone handles partially destroyed tasks properly in all
> methods, but for what?  If we can give stable threadgroups to all
> cgroup methods without introducing performance or scalability
> bottleneck, that's the right thing to do.  Please also note that bugs
> stemming from failure to handle those corner cases properly will be
> subtle, difficult to reproduce and track down.

Agreed.

> In general, for any subsystem with pluggable interface, I think it's
> best to put as much complexity as possible into the core layer to make
> things eaiser for its customers.  It becomes excruciatingly painful if
> the API invites subtle corner case bugs and the subsystem grows
> several dozen clients down the road.

Sure.

> 
> So, to me, what seems more important is how to make it easier for each
> cgroup client instead of what's the minimal that's necessary right
> now.
> 
> Heh, did I make any sense? :)

Yep, fine for me :)
Just wanted to ensure I (and others) understood and identified well the issues
and the fixes.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 03/10] threadgroup: extend threadgroup_lock() to cover exit and exec
  2011-11-21 22:03       ` Tejun Heo
@ 2011-11-23 14:34         ` Frederic Weisbecker
  0 siblings, 0 replies; 56+ messages in thread
From: Frederic Weisbecker @ 2011-11-23 14:34 UTC (permalink / raw)
  To: Tejun Heo
  Cc: paul, rjw, lizf, linux-pm, linux-kernel, containers, matthltc,
	akpm, oleg, kamezawa.hiroyu

On Mon, Nov 21, 2011 at 02:03:26PM -0800, Tejun Heo wrote:
> On Mon, Nov 14, 2011 at 02:54:08PM +0100, Frederic Weisbecker wrote:
> > > Also note this is currently protected by the tasklist readlock. Cred guard mutex is
> > > certainly better, I just don't remember if you remove the tasklist lock in a
> > > further patch.
> > 
> > Ah recalling what Ben Blum said, we also need the leader to stay stable because it
> > is excpected to be passed in ->can_attach(), ->attach(), ->cancel_attach(), ...
> > Although that's going to change after your patches that pass a flex array.
> 
> Not really without locking out exec.  The thing is whoever is exec'ing
> will be the leader and we can't guarantee that the first in the flex
> array is always the leader.  One method may see it as the leader, the
> next one might not.

Exactly!

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 03/10] threadgroup: extend threadgroup_lock() to cover exit and exec
  2011-11-23 14:02       ` Frederic Weisbecker
@ 2011-11-24 21:22         ` Tejun Heo
  0 siblings, 0 replies; 56+ messages in thread
From: Tejun Heo @ 2011-11-24 21:22 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: paul, rjw, lizf, linux-pm, linux-kernel, containers, matthltc,
	akpm, oleg, kamezawa.hiroyu

Hello, Frederic.

On Wed, Nov 23, 2011 at 03:02:05PM +0100, Frederic Weisbecker wrote:
> This:
> 
>  static inline void threadgroup_lock(struct task_struct *tsk)                                                                
>  {                                                                                                                           
> +       /* exec uses exit for de-threading, grab cred_guard_mutex first */                                                   
> +       mutex_lock(&tsk->signal->cred_guard_mutex);                                                                          
>         down_write(&tsk->signal->group_rwsem);
> 
> is really not obvious.
> 
> Just point out we want to synchronize against the leader, pid and the sighand
> that may change concurrently.

Sure thing.  Will beef up the comments.

> > > Also note this is currently protected by the tasklist readlock. Cred
> > > guard mutex is certainly better, I just don't remember if you remove
> > > the tasklist lock in a further patch.
> > 
> > I don't remove tasklist_lock.
> 
> I believe you can do it after using the cred guard mutex. That needs to
> be double checked though. I think it was mostly there to keep while_each_thread()
> stable non-racy against leader change. cred_guard_mutex should handle that
> now.

Yes, probably, but I still don't think removing that is a good idea.
We're walking tasklist in a rather cold path, IMHO it's better to keep
the locking obvious.

> > So, to me, what seems more important is how to make it easier for each
> > cgroup client instead of what's the minimal that's necessary right
> > now.
> > 
> > Heh, did I make any sense? :)
> 
> Yep, fine for me :)
> Just wanted to ensure I (and others) understood and identified well the issues
> and the fixes.

Yeap, fair enough.  Will update the patches, repost and push them
through cgroup/for-3.3, and move onto other pending patchsets.

Thank you.

-- 
tejun

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH UPDATED 03/10] threadgroup: extend threadgroup_lock() to cover exit and exec
  2011-11-01 23:46 ` [PATCH 03/10] threadgroup: extend threadgroup_lock() to cover exit and exec Tejun Heo
                     ` (2 preceding siblings ...)
  2011-11-13 18:20   ` Frederic Weisbecker
@ 2011-11-24 22:50   ` Tejun Heo
  2011-11-25  4:02     ` Linus Torvalds
  2011-11-25 14:01     ` Frederic Weisbecker
  3 siblings, 2 replies; 56+ messages in thread
From: Tejun Heo @ 2011-11-24 22:50 UTC (permalink / raw)
  To: paul, rjw, lizf, Linus Torvalds
  Cc: linux-pm, linux-kernel, containers, fweisbec, matthltc, akpm,
	oleg, kamezawa.hiroyu

threadgroup_lock() protected only protected against new addition to
the threadgroup, which was inherently somewhat incomplete and
problematic for its only user cgroup.  On-going migration could race
against exec and exit leading to interesting problems - the symmetry
between various attach methods, task exiting during method execution,
->exit() racing against attach methods, migrating task switching basic
properties during exec and so on.

This patch extends threadgroup_lock() such that it protects against
all three threadgroup altering operations - fork, exit and exec.  For
exit, threadgroup_change_begin/end() calls are added to exit path.
For exec, threadgroup_[un]lock() are updated to also grab and release
cred_guard_mutex.

With this change, threadgroup_lock() guarantees that the target
threadgroup will remain stable - no new task will be added, no new
PF_EXITING will be set and exec won't happen.

The next patch will update cgroup so that it can take full advantage
of this change.

-v2: beefed up comment as suggested by Frederic.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
---

Linus, this is something being scheduled for the next merge window.
It extends threadgroup locking which cgroup used to do to exclude only
fork path so that it includes exit and exec paths.  threadgroup
locking is used by cgroup to implement process-scope cgroup migration.

Migration happens in multiple steps, at each of which the matching
method of each cgroup plugin is called.  When a whole process is being
migrated, methods being called at those different steps expect to see
consistent image of the thread group.

cgroup currently only locks out addition of new tasks into the thread
group and holds extra ref to member tasks.  This mandates all cgroup
plugins to deal with tasks being torn down and exec morphing the
threadgroup.  This patch extends the scope of threadgroup locking such
that the thread group is guaranteed to be stable (no new task, tasks
are either live or dead and no exec morphing) while locked.

This is part of changes to clean up cgroup methods and iron out corner
case fuzziness and should make difficult-to-reproduce race conditions
less likely and cgroup plugins easier to implement and verify.

The synchronization is strictly per-threadgroup and goes away if
cgroup is not configured.

The whole series is avilable at

  http://thread.gmane.org/gmane.linux.kernel.containers/21716
  git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-3.3

Thank you.

 include/linux/sched.h |   47 +++++++++++++++++++++++++++++++++++++++++------
 kernel/exit.c         |   19 +++++++++++++++----
 2 files changed, 56 insertions(+), 10 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c5acbce..481c5ed 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -635,11 +635,12 @@ struct signal_struct {
 #endif
 #ifdef CONFIG_CGROUPS
 	/*
-	 * The group_rwsem prevents threads from forking with
-	 * CLONE_THREAD while held for writing. Use this for fork-sensitive
-	 * threadgroup-wide operations. It's taken for reading in fork.c in
-	 * copy_process().
-	 * Currently only needed write-side by cgroups.
+	 * group_rwsem prevents new tasks from entering the threadgroup and
+	 * member tasks from exiting.  fork and exit paths are protected
+	 * with this rwsem using threadgroup_change_begin/end().  Users
+	 * which require threadgroup to remain stable should use
+	 * threadgroup_[un]lock() which also takes care of exec path.
+	 * Currently, cgroup is the only user.
 	 */
 	struct rw_semaphore group_rwsem;
 #endif
@@ -2374,7 +2375,6 @@ static inline void unlock_task_sighand(struct task_struct *tsk,
 	spin_unlock_irqrestore(&tsk->sighand->siglock, *flags);
 }
 
-/* See the declaration of group_rwsem in signal_struct. */
 #ifdef CONFIG_CGROUPS
 static inline void threadgroup_change_begin(struct task_struct *tsk)
 {
@@ -2384,13 +2384,48 @@ static inline void threadgroup_change_end(struct task_struct *tsk)
 {
 	up_read(&tsk->signal->group_rwsem);
 }
+
+/**
+ * threadgroup_lock - lock threadgroup
+ * @tsk: member task of the threadgroup to lock
+ *
+ * Lock the threadgroup @tsk belongs to.  No new task is allowed to enter
+ * and member tasks aren't allowed to exit (as indicated by PF_EXITING) or
+ * perform exec.  This is useful for cases where the threadgroup needs to
+ * stay stable across blockable operations.
+ *
+ * fork and exit explicitly call threadgroup_change_{begin|end}() for
+ * synchronization.  This excludes most of do_exit() to ensure that, while
+ * locked, tasks belonging to a locked group are not in the process of
+ * deconstruction - they're either alive or dead.
+ *
+ * During exec, a task goes and puts its thread group through unusual
+ * changes.  After de-threading, exclusive access is assumed to resources
+ * which are usually shared by tasks in the same group - e.g. sighand may
+ * be replaced with a new one.  Also, the exec'ing task takes over group
+ * leader role including its pid.  Exclude these changes while locked by
+ * grabbing cred_guard_mutex which is used to synchronize exec path.
+ */
 static inline void threadgroup_lock(struct task_struct *tsk)
 {
+	/*
+	 * exec uses exit for de-threading nesting group_rwsem inside
+	 * cred_guard_mutex. Grab cred_guard_mutex first.
+	 */
+	mutex_lock(&tsk->signal->cred_guard_mutex);
 	down_write(&tsk->signal->group_rwsem);
 }
+
+/**
+ * threadgroup_unlock - unlock threadgroup
+ * @tsk: member task of the threadgroup to unlock
+ *
+ * Reverse threadgroup_lock().
+ */
 static inline void threadgroup_unlock(struct task_struct *tsk)
 {
 	up_write(&tsk->signal->group_rwsem);
+	mutex_unlock(&tsk->signal->cred_guard_mutex);
 }
 #else
 static inline void threadgroup_change_begin(struct task_struct *tsk) {}
diff --git a/kernel/exit.c b/kernel/exit.c
index d0b7d98..b2cb562 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -936,6 +936,14 @@ NORET_TYPE void do_exit(long code)
 		schedule();
 	}
 
+	/*
+	 * @tsk's threadgroup is going through changes - lock out users
+	 * which expect stable threadgroup.  Do this before actually
+	 * starting tearing down @tsk so that locked threadgroup has either
+	 * alive or dead tasks, not something inbetween.
+	 */
+	threadgroup_change_begin(tsk);
+
 	exit_irq_thread();
 
 	exit_signals(tsk);  /* sets PF_EXITING */
@@ -1018,10 +1026,6 @@ NORET_TYPE void do_exit(long code)
 		kfree(current->pi_state_cache);
 #endif
 	/*
-	 * Make sure we are holding no locks:
-	 */
-	debug_check_no_locks_held(tsk);
-	/*
 	 * We can do this unlocked here. The futex code uses this flag
 	 * just to verify whether the pi state cleanup has been done
 	 * or not. In the worst case it loops once more.
@@ -1038,6 +1042,13 @@ NORET_TYPE void do_exit(long code)
 
 	preempt_disable();
 	exit_rcu();
+
+	/*
+	 * Release threadgroup and make sure we are holding no locks.
+	 */
+	threadgroup_change_end(tsk);
+	debug_check_no_locks_held(tsk);
+
 	/* causes final put_task_struct in finish_task_switch(). */
 	tsk->state = TASK_DEAD;
 	schedule();
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCHSET] cgroup: stable threadgroup during attach & subsys methods consolidation
  2011-11-01 23:46 [PATCHSET] cgroup: stable threadgroup during attach & subsys methods consolidation Tejun Heo
                   ` (10 preceding siblings ...)
  2011-11-21 22:07 ` [PATCHSET] cgroup: stable threadgroup during attach & subsys methods consolidation Tejun Heo
@ 2011-11-24 22:51 ` Tejun Heo
  11 siblings, 0 replies; 56+ messages in thread
From: Tejun Heo @ 2011-11-24 22:51 UTC (permalink / raw)
  To: paul, rjw, lizf
  Cc: linux-pm, linux-kernel, containers, fweisbec, matthltc, akpm,
	oleg, kamezawa.hiroyu

Series committed to cgroup/for-3.3 with updated patch and pushed out
to linux-next.

Thank you.

-- 
tejun

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH UPDATED 03/10] threadgroup: extend threadgroup_lock() to cover exit and exec
  2011-11-24 22:50   ` [PATCH UPDATED " Tejun Heo
@ 2011-11-25  4:02     ` Linus Torvalds
  2011-11-27 19:21       ` Tejun Heo
  2011-11-25 14:01     ` Frederic Weisbecker
  1 sibling, 1 reply; 56+ messages in thread
From: Linus Torvalds @ 2011-11-25  4:02 UTC (permalink / raw)
  To: Tejun Heo
  Cc: paul, rjw, lizf, linux-pm, linux-kernel, containers, fweisbec,
	matthltc, akpm, oleg, kamezawa.hiroyu

On Thu, Nov 24, 2011 at 2:50 PM, Tejun Heo <tj@kernel.org> wrote:
>
> Linus, this is something being scheduled for the next merge window.
> It extends threadgroup locking which cgroup used to do to exclude only
> fork path so that it includes exit and exec paths.  threadgroup
> locking is used by cgroup to implement process-scope cgroup migration.
>
> Migration happens in multiple steps, at each of which the matching
> method of each cgroup plugin is called.  When a whole process is being
> migrated, methods being called at those different steps expect to see
> consistent image of the thread group.

Ugh.

I see why you want to do this, but I really don't much like some of it.

I've got a few questions:

 - do you even *need* two separate locks at all? If you extend the
threadgroup_fork_lock to cover exec/exit too (and rename it), then why
does that separate cred_guard_mutex make sense at all?

  Taking two locks for the common exit case seems disgusting. What do
the separate locks buy us?

 - could we possible avoid the lock(s) entirely for the
single-threaded case? The fact that ptrace wants to serialize makes me
say "maybe we can't avoid it", but I thought I'd ask. Even if we do
need/want two separate locks, do we really want to take them both for
the case that shouldn't really need even a single one?

Hmm? Simplifying the locking rules is always a good idea, and I think
this seems to make some of it more straightforward, but at the same
time I shudder when I look at some of the patches in the series that
nest locking three locks deep. Ugh.

                 Linus

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH UPDATED 03/10] threadgroup: extend threadgroup_lock() to cover exit and exec
  2011-11-24 22:50   ` [PATCH UPDATED " Tejun Heo
  2011-11-25  4:02     ` Linus Torvalds
@ 2011-11-25 14:01     ` Frederic Weisbecker
  2011-11-27 19:30       ` Tejun Heo
  1 sibling, 1 reply; 56+ messages in thread
From: Frederic Weisbecker @ 2011-11-25 14:01 UTC (permalink / raw)
  To: Tejun Heo
  Cc: paul, rjw, lizf, Linus Torvalds, linux-pm, linux-kernel,
	containers, matthltc, akpm, oleg, kamezawa.hiroyu

On Thu, Nov 24, 2011 at 02:50:54PM -0800, Tejun Heo wrote:
> threadgroup_lock() protected only protected against new addition to
> the threadgroup, which was inherently somewhat incomplete and
> problematic for its only user cgroup.  On-going migration could race
> against exec and exit leading to interesting problems - the symmetry
> between various attach methods, task exiting during method execution,
> ->exit() racing against attach methods, migrating task switching basic
> properties during exec and so on.
> 
> This patch extends threadgroup_lock() such that it protects against
> all three threadgroup altering operations - fork, exit and exec.  For
> exit, threadgroup_change_begin/end() calls are added to exit path.
> For exec, threadgroup_[un]lock() are updated to also grab and release
> cred_guard_mutex.
> 
> With this change, threadgroup_lock() guarantees that the target
> threadgroup will remain stable - no new task will be added, no new
> PF_EXITING will be set and exec won't happen.
> 
> The next patch will update cgroup so that it can take full advantage
> of this change.
> 
> -v2: beefed up comment as suggested by Frederic.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Acked-by: Li Zefan <lizf@cn.fujitsu.com>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Paul Menage <paul@paulmenage.org>
> Cc: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> ---
> 
> Linus, this is something being scheduled for the next merge window.
> It extends threadgroup locking which cgroup used to do to exclude only
> fork path so that it includes exit and exec paths.  threadgroup
> locking is used by cgroup to implement process-scope cgroup migration.
> 
> Migration happens in multiple steps, at each of which the matching
> method of each cgroup plugin is called.  When a whole process is being
> migrated, methods being called at those different steps expect to see
> consistent image of the thread group.
> 
> cgroup currently only locks out addition of new tasks into the thread
> group and holds extra ref to member tasks.  This mandates all cgroup
> plugins to deal with tasks being torn down and exec morphing the
> threadgroup.  This patch extends the scope of threadgroup locking such
> that the thread group is guaranteed to be stable (no new task, tasks
> are either live or dead and no exec morphing) while locked.
> 
> This is part of changes to clean up cgroup methods and iron out corner
> case fuzziness and should make difficult-to-reproduce race conditions
> less likely and cgroup plugins easier to implement and verify.
> 
> The synchronization is strictly per-threadgroup and goes away if
> cgroup is not configured.
> 
> The whole series is avilable at
> 
>   http://thread.gmane.org/gmane.linux.kernel.containers/21716
>   git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-3.3
> 
> Thank you.
> 
>  include/linux/sched.h |   47 +++++++++++++++++++++++++++++++++++++++++------
>  kernel/exit.c         |   19 +++++++++++++++----
>  2 files changed, 56 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index c5acbce..481c5ed 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -635,11 +635,12 @@ struct signal_struct {
>  #endif
>  #ifdef CONFIG_CGROUPS
>  	/*
> -	 * The group_rwsem prevents threads from forking with
> -	 * CLONE_THREAD while held for writing. Use this for fork-sensitive
> -	 * threadgroup-wide operations. It's taken for reading in fork.c in
> -	 * copy_process().
> -	 * Currently only needed write-side by cgroups.
> +	 * group_rwsem prevents new tasks from entering the threadgroup and
> +	 * member tasks from exiting.  fork and exit paths are protected
> +	 * with this rwsem using threadgroup_change_begin/end().  Users
> +	 * which require threadgroup to remain stable should use
> +	 * threadgroup_[un]lock() which also takes care of exec path.
> +	 * Currently, cgroup is the only user.
>  	 */
>  	struct rw_semaphore group_rwsem;
>  #endif
> @@ -2374,7 +2375,6 @@ static inline void unlock_task_sighand(struct task_struct *tsk,
>  	spin_unlock_irqrestore(&tsk->sighand->siglock, *flags);
>  }
>  
> -/* See the declaration of group_rwsem in signal_struct. */
>  #ifdef CONFIG_CGROUPS
>  static inline void threadgroup_change_begin(struct task_struct *tsk)
>  {
> @@ -2384,13 +2384,48 @@ static inline void threadgroup_change_end(struct task_struct *tsk)
>  {
>  	up_read(&tsk->signal->group_rwsem);
>  }
> +
> +/**
> + * threadgroup_lock - lock threadgroup
> + * @tsk: member task of the threadgroup to lock
> + *
> + * Lock the threadgroup @tsk belongs to.  No new task is allowed to enter
> + * and member tasks aren't allowed to exit (as indicated by PF_EXITING) or
> + * perform exec.  This is useful for cases where the threadgroup needs to
> + * stay stable across blockable operations.
> + *
> + * fork and exit explicitly call threadgroup_change_{begin|end}() for
> + * synchronization.  This excludes most of do_exit() to ensure that, while
> + * locked, tasks belonging to a locked group are not in the process of
> + * deconstruction - they're either alive or dead.
> + *
> + * During exec, a task goes and puts its thread group through unusual
> + * changes.  After de-threading, exclusive access is assumed to resources
> + * which are usually shared by tasks in the same group - e.g. sighand may
> + * be replaced with a new one.  Also, the exec'ing task takes over group
> + * leader role including its pid.  Exclude these changes while locked by
> + * grabbing cred_guard_mutex which is used to synchronize exec path.
> + */
>  static inline void threadgroup_lock(struct task_struct *tsk)
>  {
> +	/*
> +	 * exec uses exit for de-threading nesting group_rwsem inside
> +	 * cred_guard_mutex. Grab cred_guard_mutex first.
> +	 */
> +	mutex_lock(&tsk->signal->cred_guard_mutex);
>  	down_write(&tsk->signal->group_rwsem);
>  }
> +
> +/**
> + * threadgroup_unlock - unlock threadgroup
> + * @tsk: member task of the threadgroup to unlock
> + *
> + * Reverse threadgroup_lock().
> + */
>  static inline void threadgroup_unlock(struct task_struct *tsk)
>  {
>  	up_write(&tsk->signal->group_rwsem);
> +	mutex_unlock(&tsk->signal->cred_guard_mutex);
>  }
>  #else
>  static inline void threadgroup_change_begin(struct task_struct *tsk) {}
> diff --git a/kernel/exit.c b/kernel/exit.c
> index d0b7d98..b2cb562 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -936,6 +936,14 @@ NORET_TYPE void do_exit(long code)
>  		schedule();
>  	}
>  
> +	/*
> +	 * @tsk's threadgroup is going through changes - lock out users
> +	 * which expect stable threadgroup.  Do this before actually
> +	 * starting tearing down @tsk so that locked threadgroup has either
> +	 * alive or dead tasks, not something inbetween.
> +	 */
> +	threadgroup_change_begin(tsk);
> +

I still wonder why there is a so big coverage of this lock. I mean
why is it called right before exit_irq_thread() and released so late.
All we want is to lock cgroup_exit() I think, after which tasks can't be
migrated.

Thanks.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH UPDATED 03/10] threadgroup: extend threadgroup_lock() to cover exit and exec
  2011-11-25  4:02     ` Linus Torvalds
@ 2011-11-27 19:21       ` Tejun Heo
  2011-11-27 21:25         ` Tejun Heo
  0 siblings, 1 reply; 56+ messages in thread
From: Tejun Heo @ 2011-11-27 19:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: paul, rjw, lizf, linux-pm, linux-kernel, containers, fweisbec,
	matthltc, akpm, oleg, kamezawa.hiroyu

Hello, Linus.

On Thu, Nov 24, 2011 at 08:02:18PM -0800, Linus Torvalds wrote:
>  - do you even *need* two separate locks at all? If you extend the
> threadgroup_fork_lock to cover exec/exit too (and rename it), then why
> does that separate cred_guard_mutex make sense at all?

That was mostly choosing the path of least resistance as both locks
already existed and exec exclusion was added later on, but yeah there
isn't much to lose by merging those two locks (do_exit nesting inside
exec will need some care tho).  I'll try to unify the two locks.

>   Taking two locks for the common exit case seems disgusting. What do
> the separate locks buy us?

A bit confused.  threadgroup_change_begin() only locks
signal->group_rwsem.  ie. fork/exit is protected by
signal->group_rwsem.  exec is protected by signal->cred_guard_mutex.
Grabbing both locks gives stable thread group.

>  - could we possible avoid the lock(s) entirely for the
> single-threaded case? The fact that ptrace wants to serialize makes me
> say "maybe we can't avoid it", but I thought I'd ask. Even if we do
> need/want two separate locks, do we really want to take them both for
> the case that shouldn't really need even a single one?

Maybe we can avoid grabbing a lock on exec and exit if the task
belongs to single task process but optimizations like that can be very
fragile.  I think it would be better to merge the two locks and keep
the locking unconditional.

> Hmm? Simplifying the locking rules is always a good idea, and I think
> this seems to make some of it more straightforward, but at the same
> time I shudder when I look at some of the patches in the series that
> nest locking three locks deep. Ugh.

cgroup_root_mutex separation is to avoid lock recursion via
cgroup_show_options and can probably be done in lighter handed way.
As I'm still getting used to cgroup code, I wanted to resolve the
locking order without changing much else.

Anyways, I'll send separated patch series dealing with fork/exec/exit
synchronization.

Thank you.

-- 
tejun

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH UPDATED 03/10] threadgroup: extend threadgroup_lock() to cover exit and exec
  2011-11-25 14:01     ` Frederic Weisbecker
@ 2011-11-27 19:30       ` Tejun Heo
  2011-12-02 16:28         ` Frederic Weisbecker
  0 siblings, 1 reply; 56+ messages in thread
From: Tejun Heo @ 2011-11-27 19:30 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: paul, rjw, lizf, Linus Torvalds, linux-pm, linux-kernel,
	containers, matthltc, akpm, oleg, kamezawa.hiroyu

Hello, Frederic.

On Fri, Nov 25, 2011 at 03:01:39PM +0100, Frederic Weisbecker wrote:
> > +	/*
> > +	 * @tsk's threadgroup is going through changes - lock out users
> > +	 * which expect stable threadgroup.  Do this before actually
> > +	 * starting tearing down @tsk so that locked threadgroup has either
> > +	 * alive or dead tasks, not something inbetween.
> > +	 */
> > +	threadgroup_change_begin(tsk);
> > +
> 
> I still wonder why there is a so big coverage of this lock. I mean
> why is it called right before exit_irq_thread() and released so late.
> All we want is to lock cgroup_exit() I think, after which tasks can't be
> migrated.

That way, cgroup plugins never see tasks in the process of
deconstruction.  Its ->mm, ->sighand, ->signal and so on either don't
exist or continue to exist across all cgroup callbacks.  It's simpler
and safer (especially as bugs in this area would only be visible when
migration and exit race), and if we're gonna synchronize exit path at
all, there isn't anything to lose by excluding the whole thing.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH UPDATED 03/10] threadgroup: extend threadgroup_lock() to cover exit and exec
  2011-11-27 19:21       ` Tejun Heo
@ 2011-11-27 21:25         ` Tejun Heo
  2011-12-01 19:29           ` Tejun Heo
  0 siblings, 1 reply; 56+ messages in thread
From: Tejun Heo @ 2011-11-27 21:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: paul, rjw, lizf, linux-pm, linux-kernel, containers, fweisbec,
	matthltc, akpm, oleg, kamezawa.hiroyu

Hello, Linus.

On Sun, Nov 27, 2011 at 11:21:55AM -0800, Tejun Heo wrote:
> On Thu, Nov 24, 2011 at 08:02:18PM -0800, Linus Torvalds wrote:
> >  - do you even *need* two separate locks at all? If you extend the
> > threadgroup_fork_lock to cover exec/exit too (and rename it), then why
> > does that separate cred_guard_mutex make sense at all?
> 
> That was mostly choosing the path of least resistance as both locks
> already existed and exec exclusion was added later on, but yeah there
> isn't much to lose by merging those two locks (do_exit nesting inside
> exec will need some care tho).  I'll try to unify the two locks.

Hmmm... I tried this but it seems more difficult than originally
expected.  The biggest obstacle is that we don't have _interruptible
and _killable rwsem ops.

As proposed, what we have is,

* fork and exit paths are read-protected by group_rwsem.  ie. forks
  and exits don't block each other.

* exec path is protected by cred_guard_mutex.  cred_guard_mutex also
  synchronizes parallel exec attempts.  (It's probably better to
  rename it to exec_mutex)

* thread_group_lock() makes the thread group stable by write-locking
  group_rwsem and grabbing cred_guard_mutex.  The former excludes
  forks and exits and the latter exec.

group_rwsem and cred_guard_mutex can be merged by,

* fork and exit paths read-lock group_rwsem as it currently does.

* exec path write-locks group_rwsem.  Note that it'll need to unlock
  in de_thread() while waiting for other threads to exit as exit path
  can't proceed while group_rwsem is write-locked.
  prepare_bprm_creds() should be updated to check signal_pending()
  after acquiring group_rwsem.

  This introduces synchronization between exec and fork/exit paths,
  but I don't think this is something we need to worry about as both
  fork-exec and exit-exec scalabilities don't matter.

* thread_group_lock() write-locks group_rwsem.

The problem is that cred_guard_mutex uses _interruptible/_killable
operations and rwsem doesn't have them, so cred_guard_mutex can't be
easily replaced with write-locking group_rwsem.

If the two locks can't be merged, under the proposed scheme, while not
exactly pretty, both fork/exit and exec paths go through single
locking and only the ones which want stable threadgroup need to grab
both locks, so IMHO it is at least reasonable.

Any better ideas?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH UPDATED 03/10] threadgroup: extend threadgroup_lock() to cover exit and exec
  2011-11-27 21:25         ` Tejun Heo
@ 2011-12-01 19:29           ` Tejun Heo
  0 siblings, 0 replies; 56+ messages in thread
From: Tejun Heo @ 2011-12-01 19:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: paul, rjw, lizf, linux-pm, linux-kernel, containers, fweisbec,
	matthltc, akpm, oleg, kamezawa.hiroyu

Hello again, Linus.

On Sun, Nov 27, 2011 at 01:25:58PM -0800, Tejun Heo wrote:
> The problem is that cred_guard_mutex uses _interruptible/_killable
> operations and rwsem doesn't have them, so cred_guard_mutex can't be
> easily replaced with write-locking group_rwsem.
> 
> If the two locks can't be merged, under the proposed scheme, while not
> exactly pretty, both fork/exit and exec paths go through single
> locking and only the ones which want stable threadgroup need to grab
> both locks, so IMHO it is at least reasonable.
> 
> Any better ideas?

I agree that the proposed solution is rather ugly but stable
thread-group is a valid mechanism to have and cgroup can benefit a lot
from it.  I'd be happy to revamp the implementation if anyone can come
up with a better way and can add big fat comment stating that.  Until
something better comes up, would it be okay to stick with this
implementation?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH UPDATED 03/10] threadgroup: extend threadgroup_lock() to cover exit and exec
  2011-11-27 19:30       ` Tejun Heo
@ 2011-12-02 16:28         ` Frederic Weisbecker
  2011-12-05 18:43           ` Tejun Heo
  0 siblings, 1 reply; 56+ messages in thread
From: Frederic Weisbecker @ 2011-12-02 16:28 UTC (permalink / raw)
  To: Tejun Heo
  Cc: paul, rjw, lizf, Linus Torvalds, linux-pm, linux-kernel,
	containers, matthltc, akpm, oleg, kamezawa.hiroyu

On Sun, Nov 27, 2011 at 11:30:01AM -0800, Tejun Heo wrote:
> Hello, Frederic.
> 
> On Fri, Nov 25, 2011 at 03:01:39PM +0100, Frederic Weisbecker wrote:
> > > +	/*
> > > +	 * @tsk's threadgroup is going through changes - lock out users
> > > +	 * which expect stable threadgroup.  Do this before actually
> > > +	 * starting tearing down @tsk so that locked threadgroup has either
> > > +	 * alive or dead tasks, not something inbetween.
> > > +	 */
> > > +	threadgroup_change_begin(tsk);
> > > +
> > 
> > I still wonder why there is a so big coverage of this lock. I mean
> > why is it called right before exit_irq_thread() and released so late.
> > All we want is to lock cgroup_exit() I think, after which tasks can't be
> > migrated.
> 
> That way, cgroup plugins never see tasks in the process of
> deconstruction.  Its ->mm, ->sighand, ->signal and so on either don't
> exist or continue to exist across all cgroup callbacks.  It's simpler
> and safer (especially as bugs in this area would only be visible when
> migration and exit race), and if we're gonna synchronize exit path at
> all, there isn't anything to lose by excluding the whole thing.

Fine.

But I don't think it's very useful to protect against irq_exit_thread(),
what happens there is purely of internal irq interest.

Then right after, PF_EXITING is set before any interesting change.
Isn't it possible to simply lock this flag setting? IIRC, as soon
as the PF_EXITING flag is set, you ignore the task for attachment.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH UPDATED 03/10] threadgroup: extend threadgroup_lock() to cover exit and exec
  2011-12-02 16:28         ` Frederic Weisbecker
@ 2011-12-05 18:43           ` Tejun Heo
  2011-12-07 15:30             ` Frederic Weisbecker
  0 siblings, 1 reply; 56+ messages in thread
From: Tejun Heo @ 2011-12-05 18:43 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: paul, rjw, lizf, Linus Torvalds, linux-pm, linux-kernel,
	containers, matthltc, akpm, oleg, kamezawa.hiroyu

Hello, Frederic.

On Fri, Dec 02, 2011 at 05:28:03PM +0100, Frederic Weisbecker wrote:
> But I don't think it's very useful to protect against irq_exit_thread(),
> what happens there is purely of internal irq interest.
> 
> Then right after, PF_EXITING is set before any interesting change.
> Isn't it possible to simply lock this flag setting? IIRC, as soon
> as the PF_EXITING flag is set, you ignore the task for attachment.

I think that's technically possible but it does introduce another
class of tasks - the dying ones.  e.g. If a task has PF_EXITING set
and the containing process is migrating, we'll have to migrate all
tasks but the dying one and cgroup ->exit callbacks can be called on
the lonely task after the migration is complete.  It's kinda messy and
if someone makes a wrong assumption there, the bug is gonna be even
more difficult to reproduce / track down than now.  Yes, smaller scope
locking is nicer but I would like to avoid api weirdities like that.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH UPDATED 03/10] threadgroup: extend threadgroup_lock() to cover exit and exec
  2011-12-05 18:43           ` Tejun Heo
@ 2011-12-07 15:30             ` Frederic Weisbecker
  2011-12-07 18:22               ` Tejun Heo
  0 siblings, 1 reply; 56+ messages in thread
From: Frederic Weisbecker @ 2011-12-07 15:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: paul, rjw, lizf, Linus Torvalds, linux-pm, linux-kernel,
	containers, matthltc, akpm, oleg, kamezawa.hiroyu

On Mon, Dec 05, 2011 at 10:43:15AM -0800, Tejun Heo wrote:
> Hello, Frederic.
> 
> On Fri, Dec 02, 2011 at 05:28:03PM +0100, Frederic Weisbecker wrote:
> > But I don't think it's very useful to protect against irq_exit_thread(),
> > what happens there is purely of internal irq interest.
> > 
> > Then right after, PF_EXITING is set before any interesting change.
> > Isn't it possible to simply lock this flag setting? IIRC, as soon
> > as the PF_EXITING flag is set, you ignore the task for attachment.
> 
> I think that's technically possible but it does introduce another
> class of tasks - the dying ones.  e.g. If a task has PF_EXITING set
> and the containing process is migrating, we'll have to migrate all
> tasks but the dying one and cgroup ->exit callbacks can be called on
> the lonely task after the migration is complete.  It's kinda messy and
> if someone makes a wrong assumption there, the bug is gonna be even
> more difficult to reproduce / track down than now.  Yes, smaller scope
> locking is nicer but I would like to avoid api weirdities like that.

I don't understand what you mean. On your patches, you only process tasks
that don't have PF_EXITING, ie: you don't include these in the flex array
on cgroup_attach_proc(). So that still applies in my proposal.

>From the exit path we would have:

	exit_signal() {
		lock_threadgroup_change(task);
		task->flags |= PF_EXITING;
		lock_threadgroup(task);
	}

	exit all the rest: mm, etc...

Then from cgroup_attach_proc():

	lock_threadgroup(task);
	for_each_thread(task) {
		if (!(task->flags & PF_EXITING))
			include in flex array
	}

Am I forgetting something?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH UPDATED 03/10] threadgroup: extend threadgroup_lock() to cover exit and exec
  2011-12-07 15:30             ` Frederic Weisbecker
@ 2011-12-07 18:22               ` Tejun Heo
  2011-12-08 20:50                 ` [PATCH UPDATED AGAIN " Tejun Heo
  0 siblings, 1 reply; 56+ messages in thread
From: Tejun Heo @ 2011-12-07 18:22 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: paul, rjw, lizf, Linus Torvalds, linux-pm, linux-kernel,
	containers, matthltc, akpm, oleg, kamezawa.hiroyu

Hello, Frederic.

On Wed, Dec 07, 2011 at 04:30:54PM +0100, Frederic Weisbecker wrote:
> I don't understand what you mean. On your patches, you only process tasks
> that don't have PF_EXITING, ie: you don't include these in the flex array
> on cgroup_attach_proc(). So that still applies in my proposal.
> 
> From the exit path we would have:
> 
> 	exit_signal() {
> 		lock_threadgroup_change(task);
> 		task->flags |= PF_EXITING;
> 		lock_threadgroup(task);
> 	}
> 
> 	exit all the rest: mm, etc...
> 
> Then from cgroup_attach_proc():
> 
> 	lock_threadgroup(task);
> 	for_each_thread(task) {
> 		if (!(task->flags & PF_EXITING))
> 			include in flex array
> 	}
> 
> Am I forgetting something?

The point I was trying to make was that doing the above would make
->exit() called on dangling task of a threadgroup in rare cases.
ie. With the proposed change, after a threadgroup migration, all tasks
in the threadgroup is in the new cgroup.  No method will be called on
the old cgroup for any of the member task.  With the above change,
process migration would leave out dying tasks and ->exit() can be
called with the old cgroup.

Hmm... that said, it probably doesn't matter all that much either way.
I'll update the patchset and repost.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH UPDATED AGAIN 03/10] threadgroup: extend threadgroup_lock() to cover exit and exec
  2011-12-07 18:22               ` Tejun Heo
@ 2011-12-08 20:50                 ` Tejun Heo
  2011-12-09 23:42                   ` Frederic Weisbecker
  0 siblings, 1 reply; 56+ messages in thread
From: Tejun Heo @ 2011-12-08 20:50 UTC (permalink / raw)
  To: Frederic Weisbecker, Linus Torvalds
  Cc: paul, rjw, lizf, linux-pm, linux-kernel, containers, matthltc,
	akpm, oleg, kamezawa.hiroyu

(Note for Linus at the bottom)

threadgroup_lock() protected only protected against new addition to
the threadgroup, which was inherently somewhat incomplete and
problematic for its only user cgroup.  On-going migration could race
against exec and exit leading to interesting problems - the symmetry
between various attach methods, task exiting during method execution,
->exit() racing against attach methods, migrating task switching basic
properties during exec and so on.

This patch extends threadgroup_lock() such that it protects against
all three threadgroup altering operations - fork, exit and exec.  For
exit, threadgroup_change_begin/end() calls are added to exit_signals
around assertion of PF_EXITING.  For exec, threadgroup_[un]lock() are
updated to also grab and release cred_guard_mutex.

With this change, threadgroup_lock() guarantees that the target
threadgroup will remain stable - no new task will be added, no new
PF_EXITING will be set and exec won't happen.

The next patch will update cgroup so that it can take full advantage
of this change.

-v2: beefed up comment as suggested by Frederic.

-v3: narrowed scope of protection in exit path as suggested by
     Frederic.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
---
Okay, narrowed exit path protection down to setting of PF_EXITING
itself.  ->exit() on dangling tasks is a bit weird but I don't think
it's too bad.  Frederic, are you okay with this version?

Linus, if Frederic is okay with it, I'm gonna rebase the series on top
of freezer changes in pm tree to avoid conflicts in cgroup_freezer,
which sits between cgroup and freezer, both of which are going through
non-trivial changes, push the branch to linux-next and put pending
cgroup patches on top.  Please scream if you're mighty unhappy with it
or have a better idea.

Thank you.

 include/linux/sched.h |   47 +++++++++++++++++++++++++++++++++++++++++------
 kernel/signal.c       |   10 ++++++++++
 2 files changed, 51 insertions(+), 6 deletions(-)

Index: work/include/linux/sched.h
===================================================================
--- work.orig/include/linux/sched.h
+++ work/include/linux/sched.h
@@ -635,11 +635,13 @@ struct signal_struct {
 #endif
 #ifdef CONFIG_CGROUPS
 	/*
-	 * The group_rwsem prevents threads from forking with
-	 * CLONE_THREAD while held for writing. Use this for fork-sensitive
-	 * threadgroup-wide operations. It's taken for reading in fork.c in
-	 * copy_process().
-	 * Currently only needed write-side by cgroups.
+	 * group_rwsem prevents new tasks from entering the threadgroup and
+	 * member tasks from exiting,a more specifically, setting of
+	 * PF_EXITING.  fork and exit paths are protected with this rwsem
+	 * using threadgroup_change_begin/end().  Users which require
+	 * threadgroup to remain stable should use threadgroup_[un]lock()
+	 * which also takes care of exec path.  Currently, cgroup is the
+	 * only user.
 	 */
 	struct rw_semaphore group_rwsem;
 #endif
@@ -2374,7 +2376,6 @@ static inline void unlock_task_sighand(s
 	spin_unlock_irqrestore(&tsk->sighand->siglock, *flags);
 }
 
-/* See the declaration of group_rwsem in signal_struct. */
 #ifdef CONFIG_CGROUPS
 static inline void threadgroup_change_begin(struct task_struct *tsk)
 {
@@ -2384,13 +2385,47 @@ static inline void threadgroup_change_en
 {
 	up_read(&tsk->signal->group_rwsem);
 }
+
+/**
+ * threadgroup_lock - lock threadgroup
+ * @tsk: member task of the threadgroup to lock
+ *
+ * Lock the threadgroup @tsk belongs to.  No new task is allowed to enter
+ * and member tasks aren't allowed to exit (as indicated by PF_EXITING) or
+ * perform exec.  This is useful for cases where the threadgroup needs to
+ * stay stable across blockable operations.
+ *
+ * fork and exit paths explicitly call threadgroup_change_{begin|end}() for
+ * synchronization.  While held, no new task will be added to threadgroup
+ * and no existing live task will have its PF_EXITING set.
+ *
+ * During exec, a task goes and puts its thread group through unusual
+ * changes.  After de-threading, exclusive access is assumed to resources
+ * which are usually shared by tasks in the same group - e.g. sighand may
+ * be replaced with a new one.  Also, the exec'ing task takes over group
+ * leader role including its pid.  Exclude these changes while locked by
+ * grabbing cred_guard_mutex which is used to synchronize exec path.
+ */
 static inline void threadgroup_lock(struct task_struct *tsk)
 {
+	/*
+	 * exec uses exit for de-threading nesting group_rwsem inside
+	 * cred_guard_mutex. Grab cred_guard_mutex first.
+	 */
+	mutex_lock(&tsk->signal->cred_guard_mutex);
 	down_write(&tsk->signal->group_rwsem);
 }
+
+/**
+ * threadgroup_unlock - unlock threadgroup
+ * @tsk: member task of the threadgroup to unlock
+ *
+ * Reverse threadgroup_lock().
+ */
 static inline void threadgroup_unlock(struct task_struct *tsk)
 {
 	up_write(&tsk->signal->group_rwsem);
+	mutex_unlock(&tsk->signal->cred_guard_mutex);
 }
 #else
 static inline void threadgroup_change_begin(struct task_struct *tsk) {}
Index: work/kernel/signal.c
===================================================================
--- work.orig/kernel/signal.c
+++ work/kernel/signal.c
@@ -2359,8 +2359,15 @@ void exit_signals(struct task_struct *ts
 	int group_stop = 0;
 	sigset_t unblocked;
 
+	/*
+	 * @tsk is about to have PF_EXITING set - lock out users which
+	 * expect stable threadgroup.
+	 */
+	threadgroup_change_begin(tsk);
+
 	if (thread_group_empty(tsk) || signal_group_exit(tsk->signal)) {
 		tsk->flags |= PF_EXITING;
+		threadgroup_change_end(tsk);
 		return;
 	}
 
@@ -2370,6 +2377,9 @@ void exit_signals(struct task_struct *ts
 	 * see wants_signal(), do_signal_stop().
 	 */
 	tsk->flags |= PF_EXITING;
+
+	threadgroup_change_end(tsk);
+
 	if (!signal_pending(tsk))
 		goto out;
 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH UPDATED AGAIN 03/10] threadgroup: extend threadgroup_lock() to cover exit and exec
  2011-12-08 20:50                 ` [PATCH UPDATED AGAIN " Tejun Heo
@ 2011-12-09 23:42                   ` Frederic Weisbecker
  2011-12-13  1:33                     ` Tejun Heo
  0 siblings, 1 reply; 56+ messages in thread
From: Frederic Weisbecker @ 2011-12-09 23:42 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, paul, rjw, lizf, linux-pm, linux-kernel,
	containers, matthltc, akpm, oleg, kamezawa.hiroyu

On Thu, Dec 08, 2011 at 12:50:55PM -0800, Tejun Heo wrote:
> (Note for Linus at the bottom)
> 
> threadgroup_lock() protected only protected against new addition to
> the threadgroup, which was inherently somewhat incomplete and
> problematic for its only user cgroup.  On-going migration could race
> against exec and exit leading to interesting problems - the symmetry
> between various attach methods, task exiting during method execution,
> ->exit() racing against attach methods, migrating task switching basic
> properties during exec and so on.
> 
> This patch extends threadgroup_lock() such that it protects against
> all three threadgroup altering operations - fork, exit and exec.  For
> exit, threadgroup_change_begin/end() calls are added to exit_signals
> around assertion of PF_EXITING.  For exec, threadgroup_[un]lock() are
> updated to also grab and release cred_guard_mutex.
> 
> With this change, threadgroup_lock() guarantees that the target
> threadgroup will remain stable - no new task will be added, no new
> PF_EXITING will be set and exec won't happen.
> 
> The next patch will update cgroup so that it can take full advantage
> of this change.
> 
> -v2: beefed up comment as suggested by Frederic.
> 
> -v3: narrowed scope of protection in exit path as suggested by
>      Frederic.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Acked-by: Li Zefan <lizf@cn.fujitsu.com>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Paul Menage <paul@paulmenage.org>
> Cc: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> ---
> Okay, narrowed exit path protection down to setting of PF_EXITING
> itself.  ->exit() on dangling tasks is a bit weird but I don't think
> it's too bad.  Frederic, are you okay with this version?

Yeah that new scheme that only protects PF_EXITING may look a bit
strange. But I think we are fine. With rcu list traversal, it should
be safe even if a group member is concurrently dropped from the list (in that
case all we check if its PF_EXITING then we give up). And we may
have a concurrent ->exit() but that should be fine too.

Thanks!

Acked-by: Frederic Weisbecker <fweisbec@gmail.com>

> 
> Linus, if Frederic is okay with it, I'm gonna rebase the series on top
> of freezer changes in pm tree to avoid conflicts in cgroup_freezer,
> which sits between cgroup and freezer, both of which are going through
> non-trivial changes, push the branch to linux-next and put pending
> cgroup patches on top.  Please scream if you're mighty unhappy with it
> or have a better idea.
> 
> Thank you.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH UPDATED AGAIN 03/10] threadgroup: extend threadgroup_lock() to cover exit and exec
  2011-12-09 23:42                   ` Frederic Weisbecker
@ 2011-12-13  1:33                     ` Tejun Heo
  2011-12-13  2:17                       ` Tejun Heo
  0 siblings, 1 reply; 56+ messages in thread
From: Tejun Heo @ 2011-12-13  1:33 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Linus Torvalds, paul, rjw, lizf, linux-pm, linux-kernel,
	containers, matthltc, akpm, oleg, kamezawa.hiroyu

Hello,

On Sat, Dec 10, 2011 at 12:42:53AM +0100, Frederic Weisbecker wrote:
> > Okay, narrowed exit path protection down to setting of PF_EXITING
> > itself.  ->exit() on dangling tasks is a bit weird but I don't think
> > it's too bad.  Frederic, are you okay with this version?
> 
> Yeah that new scheme that only protects PF_EXITING may look a bit
> strange. But I think we are fine. With rcu list traversal, it should
> be safe even if a group member is concurrently dropped from the list (in that
> case all we check if its PF_EXITING then we give up). And we may
> have a concurrent ->exit() but that should be fine too.
> 
> Thanks!
> 
> Acked-by: Frederic Weisbecker <fweisbec@gmail.com>

Awesome, adding Acked-by, putting it on top of pm-freezer and pushing
it out to linux-next.

Thank you.

-- 
tejun

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH UPDATED AGAIN 03/10] threadgroup: extend threadgroup_lock() to cover exit and exec
  2011-12-13  1:33                     ` Tejun Heo
@ 2011-12-13  2:17                       ` Tejun Heo
  0 siblings, 0 replies; 56+ messages in thread
From: Tejun Heo @ 2011-12-13  2:17 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Linus Torvalds, paul, rjw, lizf, linux-pm, linux-kernel,
	containers, matthltc, akpm, oleg, kamezawa.hiroyu

Updated patchset published at

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-3.3

and merged branch also pushed out to for-next.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2011-12-13  2:17 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-11-01 23:46 [PATCHSET] cgroup: stable threadgroup during attach & subsys methods consolidation Tejun Heo
2011-11-01 23:46 ` [PATCH 01/10] cgroup: add cgroup_root_mutex Tejun Heo
2011-11-04  8:38   ` KAMEZAWA Hiroyuki
2011-11-01 23:46 ` [PATCH 02/10] threadgroup: rename signal->threadgroup_fork_lock to ->group_rwsem Tejun Heo
2011-11-04  8:40   ` KAMEZAWA Hiroyuki
2011-11-04 15:16     ` Tejun Heo
2011-11-01 23:46 ` [PATCH 03/10] threadgroup: extend threadgroup_lock() to cover exit and exec Tejun Heo
2011-11-04  8:45   ` KAMEZAWA Hiroyuki
2011-11-13 16:44   ` Frederic Weisbecker
2011-11-14 13:54     ` Frederic Weisbecker
2011-11-21 22:03       ` Tejun Heo
2011-11-23 14:34         ` Frederic Weisbecker
2011-11-21 21:58     ` Tejun Heo
2011-11-23 14:02       ` Frederic Weisbecker
2011-11-24 21:22         ` Tejun Heo
2011-11-13 18:20   ` Frederic Weisbecker
2011-11-24 22:50   ` [PATCH UPDATED " Tejun Heo
2011-11-25  4:02     ` Linus Torvalds
2011-11-27 19:21       ` Tejun Heo
2011-11-27 21:25         ` Tejun Heo
2011-12-01 19:29           ` Tejun Heo
2011-11-25 14:01     ` Frederic Weisbecker
2011-11-27 19:30       ` Tejun Heo
2011-12-02 16:28         ` Frederic Weisbecker
2011-12-05 18:43           ` Tejun Heo
2011-12-07 15:30             ` Frederic Weisbecker
2011-12-07 18:22               ` Tejun Heo
2011-12-08 20:50                 ` [PATCH UPDATED AGAIN " Tejun Heo
2011-12-09 23:42                   ` Frederic Weisbecker
2011-12-13  1:33                     ` Tejun Heo
2011-12-13  2:17                       ` Tejun Heo
2011-11-01 23:46 ` [PATCH 04/10] cgroup: always lock threadgroup during migration Tejun Heo
2011-11-04  8:54   ` KAMEZAWA Hiroyuki
2011-11-04 15:21     ` Tejun Heo
2011-11-14 18:46   ` Frederic Weisbecker
2011-11-14 18:52     ` Frederic Weisbecker
2011-11-21 22:05       ` Tejun Heo
2011-11-01 23:46 ` [PATCH 05/10] cgroup: subsys->attach_task() should be called after migration Tejun Heo
2011-11-14 20:06   ` Frederic Weisbecker
2011-11-21 22:04     ` Tejun Heo
2011-11-01 23:46 ` [PATCH 06/10] cgroup: improve old cgroup handling in cgroup_attach_proc() Tejun Heo
2011-11-14 20:37   ` Frederic Weisbecker
2011-11-01 23:46 ` [PATCH 07/10] cgroup: introduce cgroup_taskset and use it in subsys->can_attach(), cancel_attach() and attach() Tejun Heo
2011-11-14 21:16   ` Frederic Weisbecker
2011-11-01 23:46 ` [PATCH 08/10] cgroup: don't use subsys->can_attach_task() or ->attach_task() Tejun Heo
2011-11-04  9:08   ` KAMEZAWA Hiroyuki
2011-11-14 23:54   ` Frederic Weisbecker
2011-11-01 23:46 ` [PATCH 09/10] cgroup, cpuset: don't use ss->pre_attach() Tejun Heo
2011-11-15  0:51   ` Frederic Weisbecker
2011-11-01 23:46 ` [PATCH 10/10] cgroup: kill subsys->can_attach_task(), pre_attach() and attach_task() Tejun Heo
2011-11-04  9:10   ` KAMEZAWA Hiroyuki
2011-11-15  0:54   ` Frederic Weisbecker
2011-11-21 22:07 ` [PATCHSET] cgroup: stable threadgroup during attach & subsys methods consolidation Tejun Heo
2011-11-22  2:27   ` Li Zefan
2011-11-22 16:20     ` Tejun Heo
2011-11-24 22:51 ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).