All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/6] CGroups: cgroup memberlist enhancement+fix
@ 2009-07-31  1:51 Ben Blum
  2009-07-31  1:51 ` [PATCH 1/6] Adds a read-only "procs" file similar to "tasks" that shows only unique tgids Ben Blum
                   ` (5 more replies)
  0 siblings, 6 replies; 100+ messages in thread
From: Ben Blum @ 2009-07-31  1:51 UTC (permalink / raw)
  To: linux-kernel, containers, akpm, serue, lizf, menage, bblum

(This patch series is a revision of http://lkml.org/lkml/2009/7/23/330 .)

The following series adds a "cgroup.procs" file to each cgroup that reports
unique tgids rather than pids, which can also be written to for moving all
threads in a threadgroup at once.

Patch #5 modifies the subsystem attach interface to be able to attach whole
threadgroups at a time, which could introduce potential problems if any
subsystem were to need the old cgroup of every thread being moved.

Patch #6 introduces a new rwsem (per threadgroup) that's taken for reading in
the fork() path, and reveals potential for a race when forking before a
subsystem's attach() function is called - one potential solution in case any
subsystem has this problem is to hang on to the group's fork mutex through the
attach() calls.

This patch series was written at the same time as Li Zefan's pid namespace
bugfix patch (from http://lkml.org/lkml/2009/7/1/559 ), and contains a similar
but finer-grained fix for the same bug. These patches can either be rewritten
to be applied on top of Li's patch, or be applied as they are with Li's patch
reversed.

---

Ben Blum (6):
      Makes procs file writable to move all threads by tgid at once
      Lets ss->can_attach and ss->attach do whole threadgroups at a time
      Changes css_set freeing mechanism to be under RCU
      Quick vmalloc vs kmalloc fix to the case where array size is too large
      Ensures correct concurrent opening/reading of pidlists across pid namespaces
      Adds a read-only "procs" file similar to "tasks" that shows only unique tgids


 Documentation/cgroups/cgroups.txt |   12 +
 include/linux/cgroup.h            |   68 ++-
 include/linux/init_task.h         |    9 
 include/linux/sched.h             |    2 
 kernel/cgroup.c                   |  810 ++++++++++++++++++++++++++++++-------
 kernel/cgroup_freezer.c           |   15 +
 kernel/cpuset.c                   |   65 ++-
 kernel/fork.c                     |    6 
 kernel/ns_cgroup.c                |   16 +
 kernel/sched.c                    |   37 ++
 mm/memcontrol.c                   |    3 
 security/device_cgroup.c          |    3 
 12 files changed, 855 insertions(+), 191 deletions(-)


^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH 1/6] Adds a read-only "procs" file similar to "tasks" that shows only unique tgids
       [not found] ` <20090731012908.27908.62208.stgit-/yCBOHwbXCxd3OlUiQof+WCaruZE5nAUZeezCHUQhQ4@public.gmane.org>
@ 2009-07-31  1:51   ` Ben Blum
  2009-07-31  1:51   ` [PATCH 2/6] Ensures correct concurrent opening/reading of pidlists across pid namespaces Ben Blum
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 100+ messages in thread
From: Ben Blum @ 2009-07-31  1:51 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	serue-r/Jw6+rmf7HQT0dZR+AlfA, lizf-BthXqXjhjHXQFUHtdCDX3A,
	menage-hpIqsD4AKlfQT0dZR+AlfA, bblum-hpIqsD4AKlfQT0dZR+AlfA

Adds a read-only "procs" file similar to "tasks" that shows only unique tgids

struct cgroup used to have a bunch of fields for keeping track of the pidlist
for the tasks file. Those are now separated into a new struct cgroup_pidlist,
of which two are had, one for procs and one for tasks. The way the seq_file
operations are set up is changed so that just the pidlist struct gets passed
around as the private data.

Interface example: Suppose a multithreaded process has pid 1000 and other
threads with ids 1001, 1002, 1003:
$ cat tasks
1000
1001
1002
1003
$ cat cgroup.procs
1000
$

Signed-off-by: Ben Blum <bblum-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

---

 include/linux/cgroup.h |   22 ++--
 kernel/cgroup.c        |  278 ++++++++++++++++++++++++++++++------------------
 2 files changed, 186 insertions(+), 114 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 665fa70..8a3a3ac 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -141,6 +141,17 @@ enum {
 	CGRP_WAIT_ON_RMDIR,
 };
 
+struct cgroup_pidlist {
+	/* protects the other fields */
+	struct rw_semaphore mutex;
+	/* array of xids */
+	pid_t *list;
+	/* how many elements the above list has */
+	int length;
+	/* how many files are using the current array */
+	int use_count;
+};
+
 struct cgroup {
 	unsigned long flags;		/* "unsigned long" so bitops work */
 
@@ -179,14 +190,9 @@ struct cgroup {
 	 */
 	struct list_head release_list;
 
-	/* pids_mutex protects the fields below */
-	struct rw_semaphore pids_mutex;
-	/* Array of process ids in the cgroup */
-	pid_t *tasks_pids;
-	/* How many files are using the current tasks_pids array */
-	int pids_use_count;
-	/* Length of the current tasks_pids array */
-	int pids_length;
+	/* we will have two separate pidlists, one for pids (the tasks file)
+	 * and one for tgids (the procs file). */
+	struct cgroup_pidlist tasks, procs;
 
 	/* For RCU-protected deletion */
 	struct rcu_head rcu_head;
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 3737a68..4ceee3f 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -960,7 +960,8 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
 	INIT_LIST_HEAD(&cgrp->children);
 	INIT_LIST_HEAD(&cgrp->css_sets);
 	INIT_LIST_HEAD(&cgrp->release_list);
-	init_rwsem(&cgrp->pids_mutex);
+	init_rwsem(&(cgrp->tasks.mutex));
+	init_rwsem(&(cgrp->procs.mutex));
 }
 static void init_cgroup_root(struct cgroupfs_root *root)
 {
@@ -1408,15 +1409,6 @@ static int cgroup_tasks_write(struct cgroup *cgrp, struct cftype *cft, u64 pid)
 	return ret;
 }
 
-/* The various types of files and directories in a cgroup file system */
-enum cgroup_filetype {
-	FILE_ROOT,
-	FILE_DIR,
-	FILE_TASKLIST,
-	FILE_NOTIFY_ON_RELEASE,
-	FILE_RELEASE_AGENT,
-};
-
 /**
  * cgroup_lock_live_group - take cgroup_mutex and check that cgrp is alive.
  * @cgrp: the cgroup to be checked for liveness
@@ -2114,7 +2106,7 @@ int cgroup_scan_tasks(struct cgroup_scanner *scan)
 }
 
 /*
- * Stuff for reading the 'tasks' file.
+ * Stuff for reading the 'tasks'/'procs' files.
  *
  * Reading this file can return large amounts of data if a cgroup has
  * *lots* of attached tasks. So it may need several calls to read(),
@@ -2124,27 +2116,106 @@ int cgroup_scan_tasks(struct cgroup_scanner *scan)
  */
 
 /*
- * Load into 'pidarray' up to 'npids' of the tasks using cgroup
- * 'cgrp'.  Return actual number of pids loaded.  No need to
- * task_lock(p) when reading out p->cgroup, since we're in an RCU
- * read section, so the css_set can't go away, and is
- * immutable after creation.
+ * pidlist_uniq - given a kmalloc()ed list, strip out all duplicate entries
+ * If the new stripped list is sufficiently smaller and there's enough memory
+ * to allocate a new buffer, will let go of the unneeded memory. Returns the
+ * number of unique elements.
  */
-static int pid_array_load(pid_t *pidarray, int npids, struct cgroup *cgrp)
+/* is the size difference enough that we should re-allocate the array? */
+#define PIDLIST_REALLOC_DIFFERENCE(old, new) ((old) - PAGE_SIZE >= (new))
+static int pidlist_uniq(pid_t **p, int length)
 {
-	int n = 0, pid;
+	int src, dest = 1;
+	pid_t *list = *p;
+	pid_t *newlist;
+
+	/*
+	 * we presume the 0th element is unique, so i starts at 1. trivial
+	 * edge cases first; no work needs to be done for either
+	 */
+	if (length == 0 || length == 1)
+		return length;
+	/* src and dest walk down the list; dest counts unique elements */
+	for (src = 1; src < length; src++) {
+		/* find next unique element */
+		while (list[src] == list[src-1]) {
+			src++;
+			if (src == length)
+				goto after;
+		}
+		/* dest always points to where the next unique element goes */
+		list[dest] = list[src];
+		dest++;
+	}
+after:
+	/*
+	 * if the length difference is large enough, we want to allocate a
+	 * smaller buffer to save memory. if this fails due to out of memory,
+	 * we'll just stay with what we've got.
+	 */
+	if (PIDLIST_REALLOC_DIFFERENCE(length, dest)) {
+		newlist = krealloc(list, dest * sizeof(pid_t), GFP_KERNEL);
+		if (newlist)
+			*p = newlist;
+	}
+	return dest;
+}
+
+static int cmppid(const void *a, const void *b)
+{
+	return *(pid_t *)a - *(pid_t *)b;
+}
+
+/*
+ * Load a cgroup's pidarray with either procs' tgids or tasks' pids
+ */
+static int pidlist_array_load(struct cgroup *cgrp, bool procs)
+{
+	pid_t *array;
+	int length;
+	int pid, n = 0; /* used for populating the array */
 	struct cgroup_iter it;
 	struct task_struct *tsk;
+	struct cgroup_pidlist *l;
+
+	/*
+	 * If cgroup gets more users after we read count, we won't have
+	 * enough space - tough.  This race is indistinguishable to the
+	 * caller from the case that the additional cgroup users didn't
+	 * show up until sometime later on.
+	 */
+	length = cgroup_task_count(cgrp);
+	array = kmalloc(length * sizeof(pid_t), GFP_KERNEL);
+	if (!array)
+		return -ENOMEM;
+	/* now, populate the array */
 	cgroup_iter_start(cgrp, &it);
 	while ((tsk = cgroup_iter_next(cgrp, &it))) {
-		if (unlikely(n == npids))
+		if (unlikely(n == length))
 			break;
-		pid = task_pid_vnr(tsk);
-		if (pid > 0)
-			pidarray[n++] = pid;
+		/* get tgid or pid for procs or tasks file respectively */
+		pid = (procs ? task_tgid_vnr(tsk) : task_pid_vnr(tsk));
+		if (pid > 0) /* make sure to only use valid results */
+			array[n++] = pid;
 	}
 	cgroup_iter_end(cgrp, &it);
-	return n;
+	length = n;
+	/* now sort & (if procs) strip out duplicates */
+	sort(array, length, sizeof(pid_t), cmppid, NULL);
+	if (procs) {
+		length = pidlist_uniq(&array, length);
+		l = &(cgrp->procs);
+	} else {
+		l = &(cgrp->tasks);
+	}
+	/* store array in cgroup, freeing old if necessary */
+	down_write(&l->mutex);
+	kfree(l->list);
+	l->list = array;
+	l->length = length;
+	l->use_count++;
+	up_write(&l->mutex);
+	return 0;
 }
 
 /**
@@ -2201,19 +2272,14 @@ err:
 	return ret;
 }
 
-static int cmppid(const void *a, const void *b)
-{
-	return *(pid_t *)a - *(pid_t *)b;
-}
-
 
 /*
- * seq_file methods for the "tasks" file. The seq_file position is the
+ * seq_file methods for the tasks/procs files. The seq_file position is the
  * next pid to display; the seq_file iterator is a pointer to the pid
- * in the cgroup->tasks_pids array.
+ * in the cgroup->l->list array.
  */
 
-static void *cgroup_tasks_start(struct seq_file *s, loff_t *pos)
+static void *cgroup_pidlist_start(struct seq_file *s, loff_t *pos)
 {
 	/*
 	 * Initially we receive a position value that corresponds to
@@ -2221,46 +2287,45 @@ static void *cgroup_tasks_start(struct seq_file *s, loff_t *pos)
 	 * after a seek to the start). Use a binary-search to find the
 	 * next pid to display, if any
 	 */
-	struct cgroup *cgrp = s->private;
+	struct cgroup_pidlist *l = s->private;
 	int index = 0, pid = *pos;
 	int *iter;
 
-	down_read(&cgrp->pids_mutex);
+	down_read(&l->mutex);
 	if (pid) {
-		int end = cgrp->pids_length;
+		int end = l->length;
 
 		while (index < end) {
 			int mid = (index + end) / 2;
-			if (cgrp->tasks_pids[mid] == pid) {
+			if (l->list[mid] == pid) {
 				index = mid;
 				break;
-			} else if (cgrp->tasks_pids[mid] <= pid)
+			} else if (l->list[mid] <= pid)
 				index = mid + 1;
 			else
 				end = mid;
 		}
 	}
 	/* If we're off the end of the array, we're done */
-	if (index >= cgrp->pids_length)
+	if (index >= l->length)
 		return NULL;
 	/* Update the abstract position to be the actual pid that we found */
-	iter = cgrp->tasks_pids + index;
+	iter = l->list + index;
 	*pos = *iter;
 	return iter;
 }
 
-static void cgroup_tasks_stop(struct seq_file *s, void *v)
+static void cgroup_pidlist_stop(struct seq_file *s, void *v)
 {
-	struct cgroup *cgrp = s->private;
-	up_read(&cgrp->pids_mutex);
+	struct cgroup_pidlist *l = s->private;
+	up_read(&l->mutex);
 }
 
-static void *cgroup_tasks_next(struct seq_file *s, void *v, loff_t *pos)
+static void *cgroup_pidlist_next(struct seq_file *s, void *v, loff_t *pos)
 {
-	struct cgroup *cgrp = s->private;
-	int *p = v;
-	int *end = cgrp->tasks_pids + cgrp->pids_length;
-
+	struct cgroup_pidlist *l = s->private;
+	pid_t *p = v;
+	pid_t *end = l->list + l->length;
 	/*
 	 * Advance to the next pid in the array. If this goes off the
 	 * end, we're done
@@ -2274,98 +2339,94 @@ static void *cgroup_tasks_next(struct seq_file *s, void *v, loff_t *pos)
 	}
 }
 
-static int cgroup_tasks_show(struct seq_file *s, void *v)
+static int cgroup_pidlist_show(struct seq_file *s, void *v)
 {
 	return seq_printf(s, "%d\n", *(int *)v);
 }
 
-static struct seq_operations cgroup_tasks_seq_operations = {
-	.start = cgroup_tasks_start,
-	.stop = cgroup_tasks_stop,
-	.next = cgroup_tasks_next,
-	.show = cgroup_tasks_show,
+/*
+ * seq_operations functions for iterating on pidlists through seq_file -
+ * independent of whether it's tasks or procs
+ */
+static const struct seq_operations cgroup_pidlist_seq_operations = {
+	.start = cgroup_pidlist_start,
+	.stop = cgroup_pidlist_stop,
+	.next = cgroup_pidlist_next,
+	.show = cgroup_pidlist_show,
 };
 
-static void release_cgroup_pid_array(struct cgroup *cgrp)
+static void cgroup_release_pid_array(struct cgroup_pidlist *l)
 {
-	down_write(&cgrp->pids_mutex);
-	BUG_ON(!cgrp->pids_use_count);
-	if (!--cgrp->pids_use_count) {
-		kfree(cgrp->tasks_pids);
-		cgrp->tasks_pids = NULL;
-		cgrp->pids_length = 0;
+	down_write(&l->mutex);
+	BUG_ON(!l->use_count);
+	if (!--l->use_count) {
+		kfree(l->list);
+		l->list = NULL;
+		l->length = 0;
 	}
-	up_write(&cgrp->pids_mutex);
+	up_write(&l->mutex);
 }
 
-static int cgroup_tasks_release(struct inode *inode, struct file *file)
+static int cgroup_pidlist_release(struct inode *inode, struct file *file)
 {
-	struct cgroup *cgrp = __d_cgrp(file->f_dentry->d_parent);
-
+	struct cgroup_pidlist *l;
 	if (!(file->f_mode & FMODE_READ))
 		return 0;
-
-	release_cgroup_pid_array(cgrp);
+	/*
+	 * the seq_file will only be initialized if the file was opened for
+	 * reading; hence we check if it's not null only in that case.
+	 */
+	l = ((struct seq_file *)file->private_data)->private;
+	cgroup_release_pid_array(l);
 	return seq_release(inode, file);
 }
 
-static struct file_operations cgroup_tasks_operations = {
+static const struct file_operations cgroup_pidlist_operations = {
 	.read = seq_read,
 	.llseek = seq_lseek,
 	.write = cgroup_file_write,
-	.release = cgroup_tasks_release,
+	.release = cgroup_pidlist_release,
 };
 
 /*
- * Handle an open on 'tasks' file.  Prepare an array containing the
- * process id's of tasks currently attached to the cgroup being opened.
+ * The following functions handle opens on a file that displays a pidlist
+ * (tasks or procs). Prepare an array of the process/thread IDs of whoever's
+ * in the cgroup.
  */
-
-static int cgroup_tasks_open(struct inode *unused, struct file *file)
+/* helper function for the two below it */
+static int cgroup_pidlist_open(struct file *file, bool procs)
 {
 	struct cgroup *cgrp = __d_cgrp(file->f_dentry->d_parent);
-	pid_t *pidarray;
-	int npids;
+	struct cgroup_pidlist *l = (procs ? &cgrp->procs : &cgrp->tasks);
 	int retval;
 
 	/* Nothing to do for write-only files */
 	if (!(file->f_mode & FMODE_READ))
 		return 0;
 
-	/*
-	 * If cgroup gets more users after we read count, we won't have
-	 * enough space - tough.  This race is indistinguishable to the
-	 * caller from the case that the additional cgroup users didn't
-	 * show up until sometime later on.
-	 */
-	npids = cgroup_task_count(cgrp);
-	pidarray = kmalloc(npids * sizeof(pid_t), GFP_KERNEL);
-	if (!pidarray)
-		return -ENOMEM;
-	npids = pid_array_load(pidarray, npids, cgrp);
-	sort(pidarray, npids, sizeof(pid_t), cmppid, NULL);
-
-	/*
-	 * Store the array in the cgroup, freeing the old
-	 * array if necessary
-	 */
-	down_write(&cgrp->pids_mutex);
-	kfree(cgrp->tasks_pids);
-	cgrp->tasks_pids = pidarray;
-	cgrp->pids_length = npids;
-	cgrp->pids_use_count++;
-	up_write(&cgrp->pids_mutex);
-
-	file->f_op = &cgroup_tasks_operations;
+	/* have the array populated */
+	retval = pidlist_array_load(cgrp, procs);
+	if (retval)
+		return retval;
+	/* configure file information */
+	file->f_op = &cgroup_pidlist_operations;
 
-	retval = seq_open(file, &cgroup_tasks_seq_operations);
+	retval = seq_open(file, &cgroup_pidlist_seq_operations);
 	if (retval) {
-		release_cgroup_pid_array(cgrp);
+		cgroup_release_pid_array(l);
 		return retval;
 	}
-	((struct seq_file *)file->private_data)->private = cgrp;
+	((struct seq_file *)file->private_data)->private = l;
 	return 0;
 }
+static int cgroup_tasks_open(struct inode *unused, struct file *file)
+{
+	return cgroup_pidlist_open(file, false);
+}
+static int cgroup_procs_open(struct inode *unused, struct file *file)
+{
+	return cgroup_pidlist_open(file, true);
+}
 
 static u64 cgroup_read_notify_on_release(struct cgroup *cgrp,
 					    struct cftype *cft)
@@ -2388,21 +2449,27 @@ static int cgroup_write_notify_on_release(struct cgroup *cgrp,
 /*
  * for the common functions, 'private' gives the type of file
  */
+/* for hysterical raisins, we can't put this on the older files */
+#define CGROUP_FILE_GENERIC_PREFIX "cgroup."
 static struct cftype files[] = {
 	{
 		.name = "tasks",
 		.open = cgroup_tasks_open,
 		.write_u64 = cgroup_tasks_write,
-		.release = cgroup_tasks_release,
-		.private = FILE_TASKLIST,
+		.release = cgroup_pidlist_release,
 		.mode = S_IRUGO | S_IWUSR,
 	},
-
+	{
+		.name = CGROUP_FILE_GENERIC_PREFIX "procs",
+		.open = cgroup_procs_open,
+		/* .write_u64 = cgroup_procs_write, TODO */
+		.release = cgroup_pidlist_release,
+		.mode = S_IRUGO,
+	},
 	{
 		.name = "notify_on_release",
 		.read_u64 = cgroup_read_notify_on_release,
 		.write_u64 = cgroup_write_notify_on_release,
-		.private = FILE_NOTIFY_ON_RELEASE,
 	},
 };
 
@@ -2411,7 +2478,6 @@ static struct cftype cft_release_agent = {
 	.read_seq_string = cgroup_release_agent_show,
 	.write_string = cgroup_release_agent_write,
 	.max_write_len = PATH_MAX,
-	.private = FILE_RELEASE_AGENT,
 };
 
 static int cgroup_populate_dir(struct cgroup *cgrp)

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH 1/6] Adds a read-only "procs" file similar to "tasks" that shows only unique tgids
  2009-07-31  1:51 [PATCH v2 0/6] CGroups: cgroup memberlist enhancement+fix Ben Blum
@ 2009-07-31  1:51 ` Ben Blum
  2009-07-31  1:51 ` [PATCH 2/6] Ensures correct concurrent opening/reading of pidlists across pid namespaces Ben Blum
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 100+ messages in thread
From: Ben Blum @ 2009-07-31  1:51 UTC (permalink / raw)
  To: linux-kernel, containers, akpm, serue, lizf, menage, bblum

Adds a read-only "procs" file similar to "tasks" that shows only unique tgids

struct cgroup used to have a bunch of fields for keeping track of the pidlist
for the tasks file. Those are now separated into a new struct cgroup_pidlist,
of which two are had, one for procs and one for tasks. The way the seq_file
operations are set up is changed so that just the pidlist struct gets passed
around as the private data.

Interface example: Suppose a multithreaded process has pid 1000 and other
threads with ids 1001, 1002, 1003:
$ cat tasks
1000
1001
1002
1003
$ cat cgroup.procs
1000
$

Signed-off-by: Ben Blum <bblum@google.com>

---

 include/linux/cgroup.h |   22 ++--
 kernel/cgroup.c        |  278 ++++++++++++++++++++++++++++++------------------
 2 files changed, 186 insertions(+), 114 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 665fa70..8a3a3ac 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -141,6 +141,17 @@ enum {
 	CGRP_WAIT_ON_RMDIR,
 };
 
+struct cgroup_pidlist {
+	/* protects the other fields */
+	struct rw_semaphore mutex;
+	/* array of xids */
+	pid_t *list;
+	/* how many elements the above list has */
+	int length;
+	/* how many files are using the current array */
+	int use_count;
+};
+
 struct cgroup {
 	unsigned long flags;		/* "unsigned long" so bitops work */
 
@@ -179,14 +190,9 @@ struct cgroup {
 	 */
 	struct list_head release_list;
 
-	/* pids_mutex protects the fields below */
-	struct rw_semaphore pids_mutex;
-	/* Array of process ids in the cgroup */
-	pid_t *tasks_pids;
-	/* How many files are using the current tasks_pids array */
-	int pids_use_count;
-	/* Length of the current tasks_pids array */
-	int pids_length;
+	/* we will have two separate pidlists, one for pids (the tasks file)
+	 * and one for tgids (the procs file). */
+	struct cgroup_pidlist tasks, procs;
 
 	/* For RCU-protected deletion */
 	struct rcu_head rcu_head;
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 3737a68..4ceee3f 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -960,7 +960,8 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
 	INIT_LIST_HEAD(&cgrp->children);
 	INIT_LIST_HEAD(&cgrp->css_sets);
 	INIT_LIST_HEAD(&cgrp->release_list);
-	init_rwsem(&cgrp->pids_mutex);
+	init_rwsem(&(cgrp->tasks.mutex));
+	init_rwsem(&(cgrp->procs.mutex));
 }
 static void init_cgroup_root(struct cgroupfs_root *root)
 {
@@ -1408,15 +1409,6 @@ static int cgroup_tasks_write(struct cgroup *cgrp, struct cftype *cft, u64 pid)
 	return ret;
 }
 
-/* The various types of files and directories in a cgroup file system */
-enum cgroup_filetype {
-	FILE_ROOT,
-	FILE_DIR,
-	FILE_TASKLIST,
-	FILE_NOTIFY_ON_RELEASE,
-	FILE_RELEASE_AGENT,
-};
-
 /**
  * cgroup_lock_live_group - take cgroup_mutex and check that cgrp is alive.
  * @cgrp: the cgroup to be checked for liveness
@@ -2114,7 +2106,7 @@ int cgroup_scan_tasks(struct cgroup_scanner *scan)
 }
 
 /*
- * Stuff for reading the 'tasks' file.
+ * Stuff for reading the 'tasks'/'procs' files.
  *
  * Reading this file can return large amounts of data if a cgroup has
  * *lots* of attached tasks. So it may need several calls to read(),
@@ -2124,27 +2116,106 @@ int cgroup_scan_tasks(struct cgroup_scanner *scan)
  */
 
 /*
- * Load into 'pidarray' up to 'npids' of the tasks using cgroup
- * 'cgrp'.  Return actual number of pids loaded.  No need to
- * task_lock(p) when reading out p->cgroup, since we're in an RCU
- * read section, so the css_set can't go away, and is
- * immutable after creation.
+ * pidlist_uniq - given a kmalloc()ed list, strip out all duplicate entries
+ * If the new stripped list is sufficiently smaller and there's enough memory
+ * to allocate a new buffer, will let go of the unneeded memory. Returns the
+ * number of unique elements.
  */
-static int pid_array_load(pid_t *pidarray, int npids, struct cgroup *cgrp)
+/* is the size difference enough that we should re-allocate the array? */
+#define PIDLIST_REALLOC_DIFFERENCE(old, new) ((old) - PAGE_SIZE >= (new))
+static int pidlist_uniq(pid_t **p, int length)
 {
-	int n = 0, pid;
+	int src, dest = 1;
+	pid_t *list = *p;
+	pid_t *newlist;
+
+	/*
+	 * we presume the 0th element is unique, so i starts at 1. trivial
+	 * edge cases first; no work needs to be done for either
+	 */
+	if (length == 0 || length == 1)
+		return length;
+	/* src and dest walk down the list; dest counts unique elements */
+	for (src = 1; src < length; src++) {
+		/* find next unique element */
+		while (list[src] == list[src-1]) {
+			src++;
+			if (src == length)
+				goto after;
+		}
+		/* dest always points to where the next unique element goes */
+		list[dest] = list[src];
+		dest++;
+	}
+after:
+	/*
+	 * if the length difference is large enough, we want to allocate a
+	 * smaller buffer to save memory. if this fails due to out of memory,
+	 * we'll just stay with what we've got.
+	 */
+	if (PIDLIST_REALLOC_DIFFERENCE(length, dest)) {
+		newlist = krealloc(list, dest * sizeof(pid_t), GFP_KERNEL);
+		if (newlist)
+			*p = newlist;
+	}
+	return dest;
+}
+
+static int cmppid(const void *a, const void *b)
+{
+	return *(pid_t *)a - *(pid_t *)b;
+}
+
+/*
+ * Load a cgroup's pidarray with either procs' tgids or tasks' pids
+ */
+static int pidlist_array_load(struct cgroup *cgrp, bool procs)
+{
+	pid_t *array;
+	int length;
+	int pid, n = 0; /* used for populating the array */
 	struct cgroup_iter it;
 	struct task_struct *tsk;
+	struct cgroup_pidlist *l;
+
+	/*
+	 * If cgroup gets more users after we read count, we won't have
+	 * enough space - tough.  This race is indistinguishable to the
+	 * caller from the case that the additional cgroup users didn't
+	 * show up until sometime later on.
+	 */
+	length = cgroup_task_count(cgrp);
+	array = kmalloc(length * sizeof(pid_t), GFP_KERNEL);
+	if (!array)
+		return -ENOMEM;
+	/* now, populate the array */
 	cgroup_iter_start(cgrp, &it);
 	while ((tsk = cgroup_iter_next(cgrp, &it))) {
-		if (unlikely(n == npids))
+		if (unlikely(n == length))
 			break;
-		pid = task_pid_vnr(tsk);
-		if (pid > 0)
-			pidarray[n++] = pid;
+		/* get tgid or pid for procs or tasks file respectively */
+		pid = (procs ? task_tgid_vnr(tsk) : task_pid_vnr(tsk));
+		if (pid > 0) /* make sure to only use valid results */
+			array[n++] = pid;
 	}
 	cgroup_iter_end(cgrp, &it);
-	return n;
+	length = n;
+	/* now sort & (if procs) strip out duplicates */
+	sort(array, length, sizeof(pid_t), cmppid, NULL);
+	if (procs) {
+		length = pidlist_uniq(&array, length);
+		l = &(cgrp->procs);
+	} else {
+		l = &(cgrp->tasks);
+	}
+	/* store array in cgroup, freeing old if necessary */
+	down_write(&l->mutex);
+	kfree(l->list);
+	l->list = array;
+	l->length = length;
+	l->use_count++;
+	up_write(&l->mutex);
+	return 0;
 }
 
 /**
@@ -2201,19 +2272,14 @@ err:
 	return ret;
 }
 
-static int cmppid(const void *a, const void *b)
-{
-	return *(pid_t *)a - *(pid_t *)b;
-}
-
 
 /*
- * seq_file methods for the "tasks" file. The seq_file position is the
+ * seq_file methods for the tasks/procs files. The seq_file position is the
  * next pid to display; the seq_file iterator is a pointer to the pid
- * in the cgroup->tasks_pids array.
+ * in the cgroup->l->list array.
  */
 
-static void *cgroup_tasks_start(struct seq_file *s, loff_t *pos)
+static void *cgroup_pidlist_start(struct seq_file *s, loff_t *pos)
 {
 	/*
 	 * Initially we receive a position value that corresponds to
@@ -2221,46 +2287,45 @@ static void *cgroup_tasks_start(struct seq_file *s, loff_t *pos)
 	 * after a seek to the start). Use a binary-search to find the
 	 * next pid to display, if any
 	 */
-	struct cgroup *cgrp = s->private;
+	struct cgroup_pidlist *l = s->private;
 	int index = 0, pid = *pos;
 	int *iter;
 
-	down_read(&cgrp->pids_mutex);
+	down_read(&l->mutex);
 	if (pid) {
-		int end = cgrp->pids_length;
+		int end = l->length;
 
 		while (index < end) {
 			int mid = (index + end) / 2;
-			if (cgrp->tasks_pids[mid] == pid) {
+			if (l->list[mid] == pid) {
 				index = mid;
 				break;
-			} else if (cgrp->tasks_pids[mid] <= pid)
+			} else if (l->list[mid] <= pid)
 				index = mid + 1;
 			else
 				end = mid;
 		}
 	}
 	/* If we're off the end of the array, we're done */
-	if (index >= cgrp->pids_length)
+	if (index >= l->length)
 		return NULL;
 	/* Update the abstract position to be the actual pid that we found */
-	iter = cgrp->tasks_pids + index;
+	iter = l->list + index;
 	*pos = *iter;
 	return iter;
 }
 
-static void cgroup_tasks_stop(struct seq_file *s, void *v)
+static void cgroup_pidlist_stop(struct seq_file *s, void *v)
 {
-	struct cgroup *cgrp = s->private;
-	up_read(&cgrp->pids_mutex);
+	struct cgroup_pidlist *l = s->private;
+	up_read(&l->mutex);
 }
 
-static void *cgroup_tasks_next(struct seq_file *s, void *v, loff_t *pos)
+static void *cgroup_pidlist_next(struct seq_file *s, void *v, loff_t *pos)
 {
-	struct cgroup *cgrp = s->private;
-	int *p = v;
-	int *end = cgrp->tasks_pids + cgrp->pids_length;
-
+	struct cgroup_pidlist *l = s->private;
+	pid_t *p = v;
+	pid_t *end = l->list + l->length;
 	/*
 	 * Advance to the next pid in the array. If this goes off the
 	 * end, we're done
@@ -2274,98 +2339,94 @@ static void *cgroup_tasks_next(struct seq_file *s, void *v, loff_t *pos)
 	}
 }
 
-static int cgroup_tasks_show(struct seq_file *s, void *v)
+static int cgroup_pidlist_show(struct seq_file *s, void *v)
 {
 	return seq_printf(s, "%d\n", *(int *)v);
 }
 
-static struct seq_operations cgroup_tasks_seq_operations = {
-	.start = cgroup_tasks_start,
-	.stop = cgroup_tasks_stop,
-	.next = cgroup_tasks_next,
-	.show = cgroup_tasks_show,
+/*
+ * seq_operations functions for iterating on pidlists through seq_file -
+ * independent of whether it's tasks or procs
+ */
+static const struct seq_operations cgroup_pidlist_seq_operations = {
+	.start = cgroup_pidlist_start,
+	.stop = cgroup_pidlist_stop,
+	.next = cgroup_pidlist_next,
+	.show = cgroup_pidlist_show,
 };
 
-static void release_cgroup_pid_array(struct cgroup *cgrp)
+static void cgroup_release_pid_array(struct cgroup_pidlist *l)
 {
-	down_write(&cgrp->pids_mutex);
-	BUG_ON(!cgrp->pids_use_count);
-	if (!--cgrp->pids_use_count) {
-		kfree(cgrp->tasks_pids);
-		cgrp->tasks_pids = NULL;
-		cgrp->pids_length = 0;
+	down_write(&l->mutex);
+	BUG_ON(!l->use_count);
+	if (!--l->use_count) {
+		kfree(l->list);
+		l->list = NULL;
+		l->length = 0;
 	}
-	up_write(&cgrp->pids_mutex);
+	up_write(&l->mutex);
 }
 
-static int cgroup_tasks_release(struct inode *inode, struct file *file)
+static int cgroup_pidlist_release(struct inode *inode, struct file *file)
 {
-	struct cgroup *cgrp = __d_cgrp(file->f_dentry->d_parent);
-
+	struct cgroup_pidlist *l;
 	if (!(file->f_mode & FMODE_READ))
 		return 0;
-
-	release_cgroup_pid_array(cgrp);
+	/*
+	 * the seq_file will only be initialized if the file was opened for
+	 * reading; hence we check if it's not null only in that case.
+	 */
+	l = ((struct seq_file *)file->private_data)->private;
+	cgroup_release_pid_array(l);
 	return seq_release(inode, file);
 }
 
-static struct file_operations cgroup_tasks_operations = {
+static const struct file_operations cgroup_pidlist_operations = {
 	.read = seq_read,
 	.llseek = seq_lseek,
 	.write = cgroup_file_write,
-	.release = cgroup_tasks_release,
+	.release = cgroup_pidlist_release,
 };
 
 /*
- * Handle an open on 'tasks' file.  Prepare an array containing the
- * process id's of tasks currently attached to the cgroup being opened.
+ * The following functions handle opens on a file that displays a pidlist
+ * (tasks or procs). Prepare an array of the process/thread IDs of whoever's
+ * in the cgroup.
  */
-
-static int cgroup_tasks_open(struct inode *unused, struct file *file)
+/* helper function for the two below it */
+static int cgroup_pidlist_open(struct file *file, bool procs)
 {
 	struct cgroup *cgrp = __d_cgrp(file->f_dentry->d_parent);
-	pid_t *pidarray;
-	int npids;
+	struct cgroup_pidlist *l = (procs ? &cgrp->procs : &cgrp->tasks);
 	int retval;
 
 	/* Nothing to do for write-only files */
 	if (!(file->f_mode & FMODE_READ))
 		return 0;
 
-	/*
-	 * If cgroup gets more users after we read count, we won't have
-	 * enough space - tough.  This race is indistinguishable to the
-	 * caller from the case that the additional cgroup users didn't
-	 * show up until sometime later on.
-	 */
-	npids = cgroup_task_count(cgrp);
-	pidarray = kmalloc(npids * sizeof(pid_t), GFP_KERNEL);
-	if (!pidarray)
-		return -ENOMEM;
-	npids = pid_array_load(pidarray, npids, cgrp);
-	sort(pidarray, npids, sizeof(pid_t), cmppid, NULL);
-
-	/*
-	 * Store the array in the cgroup, freeing the old
-	 * array if necessary
-	 */
-	down_write(&cgrp->pids_mutex);
-	kfree(cgrp->tasks_pids);
-	cgrp->tasks_pids = pidarray;
-	cgrp->pids_length = npids;
-	cgrp->pids_use_count++;
-	up_write(&cgrp->pids_mutex);
-
-	file->f_op = &cgroup_tasks_operations;
+	/* have the array populated */
+	retval = pidlist_array_load(cgrp, procs);
+	if (retval)
+		return retval;
+	/* configure file information */
+	file->f_op = &cgroup_pidlist_operations;
 
-	retval = seq_open(file, &cgroup_tasks_seq_operations);
+	retval = seq_open(file, &cgroup_pidlist_seq_operations);
 	if (retval) {
-		release_cgroup_pid_array(cgrp);
+		cgroup_release_pid_array(l);
 		return retval;
 	}
-	((struct seq_file *)file->private_data)->private = cgrp;
+	((struct seq_file *)file->private_data)->private = l;
 	return 0;
 }
+static int cgroup_tasks_open(struct inode *unused, struct file *file)
+{
+	return cgroup_pidlist_open(file, false);
+}
+static int cgroup_procs_open(struct inode *unused, struct file *file)
+{
+	return cgroup_pidlist_open(file, true);
+}
 
 static u64 cgroup_read_notify_on_release(struct cgroup *cgrp,
 					    struct cftype *cft)
@@ -2388,21 +2449,27 @@ static int cgroup_write_notify_on_release(struct cgroup *cgrp,
 /*
  * for the common functions, 'private' gives the type of file
  */
+/* for hysterical raisins, we can't put this on the older files */
+#define CGROUP_FILE_GENERIC_PREFIX "cgroup."
 static struct cftype files[] = {
 	{
 		.name = "tasks",
 		.open = cgroup_tasks_open,
 		.write_u64 = cgroup_tasks_write,
-		.release = cgroup_tasks_release,
-		.private = FILE_TASKLIST,
+		.release = cgroup_pidlist_release,
 		.mode = S_IRUGO | S_IWUSR,
 	},
-
+	{
+		.name = CGROUP_FILE_GENERIC_PREFIX "procs",
+		.open = cgroup_procs_open,
+		/* .write_u64 = cgroup_procs_write, TODO */
+		.release = cgroup_pidlist_release,
+		.mode = S_IRUGO,
+	},
 	{
 		.name = "notify_on_release",
 		.read_u64 = cgroup_read_notify_on_release,
 		.write_u64 = cgroup_write_notify_on_release,
-		.private = FILE_NOTIFY_ON_RELEASE,
 	},
 };
 
@@ -2411,7 +2478,6 @@ static struct cftype cft_release_agent = {
 	.read_seq_string = cgroup_release_agent_show,
 	.write_string = cgroup_release_agent_write,
 	.max_write_len = PATH_MAX,
-	.private = FILE_RELEASE_AGENT,
 };
 
 static int cgroup_populate_dir(struct cgroup *cgrp)


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH 2/6] Ensures correct concurrent opening/reading of pidlists across pid namespaces
       [not found] ` <20090731012908.27908.62208.stgit-/yCBOHwbXCxd3OlUiQof+WCaruZE5nAUZeezCHUQhQ4@public.gmane.org>
  2009-07-31  1:51   ` [PATCH 1/6] Adds a read-only "procs" file similar to "tasks" that shows only unique tgids Ben Blum
@ 2009-07-31  1:51   ` Ben Blum
  2009-07-31  1:51     ` Ben Blum
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 100+ messages in thread
From: Ben Blum @ 2009-07-31  1:51 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	serue-r/Jw6+rmf7HQT0dZR+AlfA, lizf-BthXqXjhjHXQFUHtdCDX3A,
	menage-hpIqsD4AKlfQT0dZR+AlfA, bblum-hpIqsD4AKlfQT0dZR+AlfA

Ensures correct concurrent opening/reading of pidlists across pid namespaces

Previously there was the problem in which two processes from different pid
namespaces reading the tasks or procs file could result in one process seeing
results from the other's namespace. Rather than one pidlist for each file in a
cgroup, we now keep a list of pidlists keyed by namespace and file type (tasks
versus procs) in which entries are placed on demand. Each pidlist has its own
lock, and that the pidlists themselves are passed around in the seq_file's
private pointer means we don't have to touch the cgroup or its master list
except when creating and destroying entries.

Signed-off-by: Ben Blum <bblum-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

---

 include/linux/cgroup.h |   34 +++++++++++++--
 kernel/cgroup.c        |  108 ++++++++++++++++++++++++++++++++++++++++--------
 2 files changed, 120 insertions(+), 22 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 8a3a3ac..b934b72 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -141,15 +141,36 @@ enum {
 	CGRP_WAIT_ON_RMDIR,
 };
 
+/* which pidlist file are we talking about? */
+enum cgroup_filetype {
+	CGROUP_FILE_PROCS,
+	CGROUP_FILE_TASKS,
+};
+
+/*
+ * A pidlist is a list of pids that virtually represents the contents of one
+ * of the cgroup files ("procs" or "tasks"). We keep a list of such pidlists,
+ * a pair (one each for procs, tasks) for each pid namespace that's relevant
+ * to the cgroup.
+ */
 struct cgroup_pidlist {
-	/* protects the other fields */
-	struct rw_semaphore mutex;
+	/*
+	 * used to find which pidlist is wanted. doesn't change as long as
+	 * this particular list stays in the list.
+	 */
+	struct { enum cgroup_filetype type; struct pid_namespace *ns; } key;
 	/* array of xids */
 	pid_t *list;
 	/* how many elements the above list has */
 	int length;
 	/* how many files are using the current array */
 	int use_count;
+	/* each of these stored in a list by its cgroup */
+	struct list_head links;
+	/* pointer to the cgroup we belong to, for list removal purposes */
+	struct cgroup *owner;
+	/* protects the other fields */
+	struct rw_semaphore mutex;
 };
 
 struct cgroup {
@@ -190,9 +211,12 @@ struct cgroup {
 	 */
 	struct list_head release_list;
 
-	/* we will have two separate pidlists, one for pids (the tasks file)
-	 * and one for tgids (the procs file). */
-	struct cgroup_pidlist tasks, procs;
+	/*
+	 * list of pidlists, up to two for each namespace (one for procs, one
+	 * for tasks); created on demand.
+	 */
+	struct list_head pidlists;
+	struct mutex pidlist_mutex;
 
 	/* For RCU-protected deletion */
 	struct rcu_head rcu_head;
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 4ceee3f..cdd0112 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -47,6 +47,7 @@
 #include <linux/hash.h>
 #include <linux/namei.h>
 #include <linux/smp_lock.h>
+#include <linux/pid_namespace.h>
 
 #include <asm/atomic.h>
 
@@ -675,6 +676,12 @@ static void cgroup_diput(struct dentry *dentry, struct inode *inode)
 		 */
 		deactivate_super(cgrp->root->sb);
 
+		/*
+		 * if we're getting rid of the cgroup, refcount should ensure
+		 * that there are no pidlists left.
+		 */
+		BUG_ON(!list_empty(&cgrp->pidlists));
+
 		call_rcu(&cgrp->rcu_head, free_cgroup_rcu);
 	}
 	iput(inode);
@@ -960,8 +967,8 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
 	INIT_LIST_HEAD(&cgrp->children);
 	INIT_LIST_HEAD(&cgrp->css_sets);
 	INIT_LIST_HEAD(&cgrp->release_list);
-	init_rwsem(&(cgrp->tasks.mutex));
-	init_rwsem(&(cgrp->procs.mutex));
+	INIT_LIST_HEAD(&cgrp->pidlists);
+	mutex_init(&cgrp->pidlist_mutex);
 }
 static void init_cgroup_root(struct cgroupfs_root *root)
 {
@@ -2167,9 +2174,59 @@ static int cmppid(const void *a, const void *b)
 }
 
 /*
+ * find the appropriate pidlist for our purpose (given procs vs tasks)
+ * returns with the lock on that pidlist already held, and takes care
+ * of the use count, or returns NULL with no locks held if we're out of
+ * memory.
+ */
+static struct cgroup_pidlist *cgroup_pidlist_find(struct cgroup *cgrp,
+						  enum cgroup_filetype type)
+{
+	struct cgroup_pidlist *l;
+	/* don't need task_nsproxy() if we're looking at ourself */
+	struct pid_namespace *ns = get_pid_ns(current->nsproxy->pid_ns);
+	/*
+	 * We can't drop the pidlist_mutex before taking the l->mutex in case
+	 * the last ref-holder is trying to remove l from the list at the same
+	 * time. Holding the pidlist_mutex precludes somebody taking whichever
+	 * list we find out from under us - compare release_pid_array().
+	 */
+	mutex_lock(&cgrp->pidlist_mutex);
+	list_for_each_entry(l, &cgrp->pidlists, links) {
+		if (l->key.type == type && l->key.ns == ns) {
+			/* found a matching list - drop the extra refcount */
+			put_pid_ns(ns);
+			/* make sure l doesn't vanish out from under us */
+			down_write(&l->mutex);
+			mutex_unlock(&cgrp->pidlist_mutex);
+			l->use_count++;
+			return l;
+		}
+	}
+	/* entry not found; create a new one */
+	l = kmalloc(sizeof(struct cgroup_pidlist), GFP_KERNEL);
+	if (!l) {
+		mutex_unlock(&cgrp->pidlist_mutex);
+		put_pid_ns(ns);
+		return l;
+	}
+	init_rwsem(&l->mutex);
+	down_write(&l->mutex);
+	l->key.type = type;
+	l->key.ns = ns;
+	l->use_count = 0; /* don't increment here */
+	l->list = NULL;
+	l->owner = cgrp;
+	list_add(&l->links, &cgrp->pidlists);
+	mutex_unlock(&cgrp->pidlist_mutex);
+	return l;
+}
+
+/*
  * Load a cgroup's pidarray with either procs' tgids or tasks' pids
  */
-static int pidlist_array_load(struct cgroup *cgrp, bool procs)
+static int pidlist_array_load(struct cgroup *cgrp, enum cgroup_filetype type,
+			      struct cgroup_pidlist **lp)
 {
 	pid_t *array;
 	int length;
@@ -2194,7 +2251,10 @@ static int pidlist_array_load(struct cgroup *cgrp, bool procs)
 		if (unlikely(n == length))
 			break;
 		/* get tgid or pid for procs or tasks file respectively */
-		pid = (procs ? task_tgid_vnr(tsk) : task_pid_vnr(tsk));
+		if (type == CGROUP_FILE_PROCS)
+			pid = task_tgid_vnr(tsk);
+		else
+			pid = task_pid_vnr(tsk);
 		if (pid > 0) /* make sure to only use valid results */
 			array[n++] = pid;
 	}
@@ -2202,19 +2262,20 @@ static int pidlist_array_load(struct cgroup *cgrp, bool procs)
 	length = n;
 	/* now sort & (if procs) strip out duplicates */
 	sort(array, length, sizeof(pid_t), cmppid, NULL);
-	if (procs) {
+	if (type == CGROUP_FILE_PROCS)
 		length = pidlist_uniq(&array, length);
-		l = &(cgrp->procs);
-	} else {
-		l = &(cgrp->tasks);
+	l = cgroup_pidlist_find(cgrp, type);
+	if (!l) {
+		kfree(array);
+		return -ENOMEM;
 	}
-	/* store array in cgroup, freeing old if necessary */
-	down_write(&l->mutex);
+	/* store array, freeing old if necessary - lock already held */
 	kfree(l->list);
 	l->list = array;
 	l->length = length;
 	l->use_count++;
 	up_write(&l->mutex);
+	*lp = l;
 	return 0;
 }
 
@@ -2357,13 +2418,26 @@ static const struct seq_operations cgroup_pidlist_seq_operations = {
 
 static void cgroup_release_pid_array(struct cgroup_pidlist *l)
 {
+	/*
+	 * the case where we're the last user of this particular pidlist will
+	 * have us remove it from the cgroup's list, which entails taking the
+	 * mutex. since in pidlist_find the pidlist->lock depends on cgroup->
+	 * pidlist_mutex, we have to take pidlist_mutex first.
+	 */
+	mutex_lock(&l->owner->pidlist_mutex);
 	down_write(&l->mutex);
 	BUG_ON(!l->use_count);
 	if (!--l->use_count) {
+		/* we're the last user if refcount is 0; remove and free */
+		list_del(&l->links);
+		mutex_unlock(&l->owner->pidlist_mutex);
 		kfree(l->list);
-		l->list = NULL;
-		l->length = 0;
+		put_pid_ns(l->key.ns);
+		up_write(&l->mutex);
+		kfree(l);
+		return;
 	}
+	mutex_unlock(&l->owner->pidlist_mutex);
 	up_write(&l->mutex);
 }
 
@@ -2394,10 +2468,10 @@ static const struct file_operations cgroup_pidlist_operations = {
  * in the cgroup.
  */
 /* helper function for the two below it */
-static int cgroup_pidlist_open(struct file *file, bool procs)
+static int cgroup_pidlist_open(struct file *file, enum cgroup_filetype type)
 {
 	struct cgroup *cgrp = __d_cgrp(file->f_dentry->d_parent);
-	struct cgroup_pidlist *l = (procs ? &cgrp->procs : &cgrp->tasks);
+	struct cgroup_pidlist *l;
 	int retval;
 
 	/* Nothing to do for write-only files */
@@ -2405,7 +2479,7 @@ static int cgroup_pidlist_open(struct file *file, bool procs)
 		return 0;
 
 	/* have the array populated */
-	retval = pidlist_array_load(cgrp, procs);
+	retval = pidlist_array_load(cgrp, type, &l);
 	if (retval)
 		return retval;
 	/* configure file information */
@@ -2421,11 +2495,11 @@ static int cgroup_pidlist_open(struct file *file, bool procs)
 }
 static int cgroup_tasks_open(struct inode *unused, struct file *file)
 {
-	return cgroup_pidlist_open(file, false);
+	return cgroup_pidlist_open(file, CGROUP_FILE_TASKS);
 }
 static int cgroup_procs_open(struct inode *unused, struct file *file)
 {
-	return cgroup_pidlist_open(file, true);
+	return cgroup_pidlist_open(file, CGROUP_FILE_PROCS);
 }
 
 static u64 cgroup_read_notify_on_release(struct cgroup *cgrp,

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH 2/6] Ensures correct concurrent opening/reading of pidlists across pid namespaces
  2009-07-31  1:51 [PATCH v2 0/6] CGroups: cgroup memberlist enhancement+fix Ben Blum
  2009-07-31  1:51 ` [PATCH 1/6] Adds a read-only "procs" file similar to "tasks" that shows only unique tgids Ben Blum
@ 2009-07-31  1:51 ` Ben Blum
  2009-07-31  1:51 ` [PATCH 4/6] Changes css_set freeing mechanism to be under RCU Ben Blum
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 100+ messages in thread
From: Ben Blum @ 2009-07-31  1:51 UTC (permalink / raw)
  To: linux-kernel, containers, akpm, serue, lizf, menage, bblum

Ensures correct concurrent opening/reading of pidlists across pid namespaces

Previously there was the problem in which two processes from different pid
namespaces reading the tasks or procs file could result in one process seeing
results from the other's namespace. Rather than one pidlist for each file in a
cgroup, we now keep a list of pidlists keyed by namespace and file type (tasks
versus procs) in which entries are placed on demand. Each pidlist has its own
lock, and that the pidlists themselves are passed around in the seq_file's
private pointer means we don't have to touch the cgroup or its master list
except when creating and destroying entries.

Signed-off-by: Ben Blum <bblum@google.com>

---

 include/linux/cgroup.h |   34 +++++++++++++--
 kernel/cgroup.c        |  108 ++++++++++++++++++++++++++++++++++++++++--------
 2 files changed, 120 insertions(+), 22 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 8a3a3ac..b934b72 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -141,15 +141,36 @@ enum {
 	CGRP_WAIT_ON_RMDIR,
 };
 
+/* which pidlist file are we talking about? */
+enum cgroup_filetype {
+	CGROUP_FILE_PROCS,
+	CGROUP_FILE_TASKS,
+};
+
+/*
+ * A pidlist is a list of pids that virtually represents the contents of one
+ * of the cgroup files ("procs" or "tasks"). We keep a list of such pidlists,
+ * a pair (one each for procs, tasks) for each pid namespace that's relevant
+ * to the cgroup.
+ */
 struct cgroup_pidlist {
-	/* protects the other fields */
-	struct rw_semaphore mutex;
+	/*
+	 * used to find which pidlist is wanted. doesn't change as long as
+	 * this particular list stays in the list.
+	 */
+	struct { enum cgroup_filetype type; struct pid_namespace *ns; } key;
 	/* array of xids */
 	pid_t *list;
 	/* how many elements the above list has */
 	int length;
 	/* how many files are using the current array */
 	int use_count;
+	/* each of these stored in a list by its cgroup */
+	struct list_head links;
+	/* pointer to the cgroup we belong to, for list removal purposes */
+	struct cgroup *owner;
+	/* protects the other fields */
+	struct rw_semaphore mutex;
 };
 
 struct cgroup {
@@ -190,9 +211,12 @@ struct cgroup {
 	 */
 	struct list_head release_list;
 
-	/* we will have two separate pidlists, one for pids (the tasks file)
-	 * and one for tgids (the procs file). */
-	struct cgroup_pidlist tasks, procs;
+	/*
+	 * list of pidlists, up to two for each namespace (one for procs, one
+	 * for tasks); created on demand.
+	 */
+	struct list_head pidlists;
+	struct mutex pidlist_mutex;
 
 	/* For RCU-protected deletion */
 	struct rcu_head rcu_head;
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 4ceee3f..cdd0112 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -47,6 +47,7 @@
 #include <linux/hash.h>
 #include <linux/namei.h>
 #include <linux/smp_lock.h>
+#include <linux/pid_namespace.h>
 
 #include <asm/atomic.h>
 
@@ -675,6 +676,12 @@ static void cgroup_diput(struct dentry *dentry, struct inode *inode)
 		 */
 		deactivate_super(cgrp->root->sb);
 
+		/*
+		 * if we're getting rid of the cgroup, refcount should ensure
+		 * that there are no pidlists left.
+		 */
+		BUG_ON(!list_empty(&cgrp->pidlists));
+
 		call_rcu(&cgrp->rcu_head, free_cgroup_rcu);
 	}
 	iput(inode);
@@ -960,8 +967,8 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
 	INIT_LIST_HEAD(&cgrp->children);
 	INIT_LIST_HEAD(&cgrp->css_sets);
 	INIT_LIST_HEAD(&cgrp->release_list);
-	init_rwsem(&(cgrp->tasks.mutex));
-	init_rwsem(&(cgrp->procs.mutex));
+	INIT_LIST_HEAD(&cgrp->pidlists);
+	mutex_init(&cgrp->pidlist_mutex);
 }
 static void init_cgroup_root(struct cgroupfs_root *root)
 {
@@ -2167,9 +2174,59 @@ static int cmppid(const void *a, const void *b)
 }
 
 /*
+ * find the appropriate pidlist for our purpose (given procs vs tasks)
+ * returns with the lock on that pidlist already held, and takes care
+ * of the use count, or returns NULL with no locks held if we're out of
+ * memory.
+ */
+static struct cgroup_pidlist *cgroup_pidlist_find(struct cgroup *cgrp,
+						  enum cgroup_filetype type)
+{
+	struct cgroup_pidlist *l;
+	/* don't need task_nsproxy() if we're looking at ourself */
+	struct pid_namespace *ns = get_pid_ns(current->nsproxy->pid_ns);
+	/*
+	 * We can't drop the pidlist_mutex before taking the l->mutex in case
+	 * the last ref-holder is trying to remove l from the list at the same
+	 * time. Holding the pidlist_mutex precludes somebody taking whichever
+	 * list we find out from under us - compare release_pid_array().
+	 */
+	mutex_lock(&cgrp->pidlist_mutex);
+	list_for_each_entry(l, &cgrp->pidlists, links) {
+		if (l->key.type == type && l->key.ns == ns) {
+			/* found a matching list - drop the extra refcount */
+			put_pid_ns(ns);
+			/* make sure l doesn't vanish out from under us */
+			down_write(&l->mutex);
+			mutex_unlock(&cgrp->pidlist_mutex);
+			l->use_count++;
+			return l;
+		}
+	}
+	/* entry not found; create a new one */
+	l = kmalloc(sizeof(struct cgroup_pidlist), GFP_KERNEL);
+	if (!l) {
+		mutex_unlock(&cgrp->pidlist_mutex);
+		put_pid_ns(ns);
+		return l;
+	}
+	init_rwsem(&l->mutex);
+	down_write(&l->mutex);
+	l->key.type = type;
+	l->key.ns = ns;
+	l->use_count = 0; /* don't increment here */
+	l->list = NULL;
+	l->owner = cgrp;
+	list_add(&l->links, &cgrp->pidlists);
+	mutex_unlock(&cgrp->pidlist_mutex);
+	return l;
+}
+
+/*
  * Load a cgroup's pidarray with either procs' tgids or tasks' pids
  */
-static int pidlist_array_load(struct cgroup *cgrp, bool procs)
+static int pidlist_array_load(struct cgroup *cgrp, enum cgroup_filetype type,
+			      struct cgroup_pidlist **lp)
 {
 	pid_t *array;
 	int length;
@@ -2194,7 +2251,10 @@ static int pidlist_array_load(struct cgroup *cgrp, bool procs)
 		if (unlikely(n == length))
 			break;
 		/* get tgid or pid for procs or tasks file respectively */
-		pid = (procs ? task_tgid_vnr(tsk) : task_pid_vnr(tsk));
+		if (type == CGROUP_FILE_PROCS)
+			pid = task_tgid_vnr(tsk);
+		else
+			pid = task_pid_vnr(tsk);
 		if (pid > 0) /* make sure to only use valid results */
 			array[n++] = pid;
 	}
@@ -2202,19 +2262,20 @@ static int pidlist_array_load(struct cgroup *cgrp, bool procs)
 	length = n;
 	/* now sort & (if procs) strip out duplicates */
 	sort(array, length, sizeof(pid_t), cmppid, NULL);
-	if (procs) {
+	if (type == CGROUP_FILE_PROCS)
 		length = pidlist_uniq(&array, length);
-		l = &(cgrp->procs);
-	} else {
-		l = &(cgrp->tasks);
+	l = cgroup_pidlist_find(cgrp, type);
+	if (!l) {
+		kfree(array);
+		return -ENOMEM;
 	}
-	/* store array in cgroup, freeing old if necessary */
-	down_write(&l->mutex);
+	/* store array, freeing old if necessary - lock already held */
 	kfree(l->list);
 	l->list = array;
 	l->length = length;
 	l->use_count++;
 	up_write(&l->mutex);
+	*lp = l;
 	return 0;
 }
 
@@ -2357,13 +2418,26 @@ static const struct seq_operations cgroup_pidlist_seq_operations = {
 
 static void cgroup_release_pid_array(struct cgroup_pidlist *l)
 {
+	/*
+	 * the case where we're the last user of this particular pidlist will
+	 * have us remove it from the cgroup's list, which entails taking the
+	 * mutex. since in pidlist_find the pidlist->lock depends on cgroup->
+	 * pidlist_mutex, we have to take pidlist_mutex first.
+	 */
+	mutex_lock(&l->owner->pidlist_mutex);
 	down_write(&l->mutex);
 	BUG_ON(!l->use_count);
 	if (!--l->use_count) {
+		/* we're the last user if refcount is 0; remove and free */
+		list_del(&l->links);
+		mutex_unlock(&l->owner->pidlist_mutex);
 		kfree(l->list);
-		l->list = NULL;
-		l->length = 0;
+		put_pid_ns(l->key.ns);
+		up_write(&l->mutex);
+		kfree(l);
+		return;
 	}
+	mutex_unlock(&l->owner->pidlist_mutex);
 	up_write(&l->mutex);
 }
 
@@ -2394,10 +2468,10 @@ static const struct file_operations cgroup_pidlist_operations = {
  * in the cgroup.
  */
 /* helper function for the two below it */
-static int cgroup_pidlist_open(struct file *file, bool procs)
+static int cgroup_pidlist_open(struct file *file, enum cgroup_filetype type)
 {
 	struct cgroup *cgrp = __d_cgrp(file->f_dentry->d_parent);
-	struct cgroup_pidlist *l = (procs ? &cgrp->procs : &cgrp->tasks);
+	struct cgroup_pidlist *l;
 	int retval;
 
 	/* Nothing to do for write-only files */
@@ -2405,7 +2479,7 @@ static int cgroup_pidlist_open(struct file *file, bool procs)
 		return 0;
 
 	/* have the array populated */
-	retval = pidlist_array_load(cgrp, procs);
+	retval = pidlist_array_load(cgrp, type, &l);
 	if (retval)
 		return retval;
 	/* configure file information */
@@ -2421,11 +2495,11 @@ static int cgroup_pidlist_open(struct file *file, bool procs)
 }
 static int cgroup_tasks_open(struct inode *unused, struct file *file)
 {
-	return cgroup_pidlist_open(file, false);
+	return cgroup_pidlist_open(file, CGROUP_FILE_TASKS);
 }
 static int cgroup_procs_open(struct inode *unused, struct file *file)
 {
-	return cgroup_pidlist_open(file, true);
+	return cgroup_pidlist_open(file, CGROUP_FILE_PROCS);
 }
 
 static u64 cgroup_read_notify_on_release(struct cgroup *cgrp,


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH 3/6] Quick vmalloc vs kmalloc fix to the case where array size is too large
  2009-07-31  1:51 [PATCH v2 0/6] CGroups: cgroup memberlist enhancement+fix Ben Blum
@ 2009-07-31  1:51     ` Ben Blum
  2009-07-31  1:51 ` [PATCH 2/6] Ensures correct concurrent opening/reading of pidlists across pid namespaces Ben Blum
                       ` (4 subsequent siblings)
  5 siblings, 0 replies; 100+ messages in thread
From: Ben Blum @ 2009-07-31  1:51 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	serue-r/Jw6+rmf7HQT0dZR+AlfA, lizf-BthXqXjhjHXQFUHtdCDX3A,
	menage-hpIqsD4AKlfQT0dZR+AlfA, bblum-hpIqsD4AKlfQT0dZR+AlfA

Quick vmalloc vs kmalloc fix to the case where array size is too large

Separates all pidlist allocation requests to a separate function that judges
based on the requested size whether or not the array needs to be vmalloced or
can be gotten via kmalloc, and similar for kfree/vfree. Should be replaced
entirely with a kernel-wide solution to this general problem.

Depends on cgroup-pidlist-namespace.patch, cgroup-procs.patch

Signed-off-by: Ben Blum <bblum-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

---

 kernel/cgroup.c |   47 ++++++++++++++++++++++++++++++++++++++++++-----
 1 files changed, 42 insertions(+), 5 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index cdd0112..f6d5969 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -48,6 +48,7 @@
 #include <linux/namei.h>
 #include <linux/smp_lock.h>
 #include <linux/pid_namespace.h>
+#include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
 
 #include <asm/atomic.h>
 
@@ -2123,6 +2124,42 @@ int cgroup_scan_tasks(struct cgroup_scanner *scan)
  */
 
 /*
+ * The following two functions "fix" the issue where there are more pids
+ * than kmalloc will give memory for; in such cases, we use vmalloc/vfree.
+ * TODO: replace with a kernel-wide solution to this problem
+ */
+#define PIDLIST_TOO_LARGE(c) ((c) * sizeof(pid_t) > (PAGE_SIZE * 2))
+static void *pidlist_allocate(int count)
+{
+	if (PIDLIST_TOO_LARGE(count))
+		return vmalloc(count * sizeof(pid_t));
+	else
+		return kmalloc(count * sizeof(pid_t), GFP_KERNEL);
+}
+static void pidlist_free(void *p)
+{
+	if (is_vmalloc_addr(p))
+		vfree(p);
+	else
+		kfree(p);
+}
+static void *pidlist_resize(void *p, int newcount)
+{
+	void *newlist;
+	/* note: if new alloc fails, old p will still be valid either way */
+	if (is_vmalloc_addr(p)) {
+		newlist = vmalloc(newcount * sizeof(pid_t));
+		if (!newlist)
+			return NULL;
+		memcpy(newlist, p, newcount * sizeof(pid_t));
+		vfree(p);
+	} else {
+		newlist = krealloc(p, newcount * sizeof(pid_t), GFP_KERNEL);
+	}
+	return newlist;
+}
+
+/*
  * pidlist_uniq - given a kmalloc()ed list, strip out all duplicate entries
  * If the new stripped list is sufficiently smaller and there's enough memory
  * to allocate a new buffer, will let go of the unneeded memory. Returns the
@@ -2161,7 +2198,7 @@ after:
 	 * we'll just stay with what we've got.
 	 */
 	if (PIDLIST_REALLOC_DIFFERENCE(length, dest)) {
-		newlist = krealloc(list, dest * sizeof(pid_t), GFP_KERNEL);
+		newlist = pidlist_resize(list, dest);
 		if (newlist)
 			*p = newlist;
 	}
@@ -2242,7 +2279,7 @@ static int pidlist_array_load(struct cgroup *cgrp, enum cgroup_filetype type,
 	 * show up until sometime later on.
 	 */
 	length = cgroup_task_count(cgrp);
-	array = kmalloc(length * sizeof(pid_t), GFP_KERNEL);
+	array = pidlist_allocate(length);
 	if (!array)
 		return -ENOMEM;
 	/* now, populate the array */
@@ -2266,11 +2303,11 @@ static int pidlist_array_load(struct cgroup *cgrp, enum cgroup_filetype type,
 		length = pidlist_uniq(&array, length);
 	l = cgroup_pidlist_find(cgrp, type);
 	if (!l) {
-		kfree(array);
+		pidlist_free(array);
 		return -ENOMEM;
 	}
 	/* store array, freeing old if necessary - lock already held */
-	kfree(l->list);
+	pidlist_free(l->list);
 	l->list = array;
 	l->length = length;
 	l->use_count++;
@@ -2431,7 +2468,7 @@ static void cgroup_release_pid_array(struct cgroup_pidlist *l)
 		/* we're the last user if refcount is 0; remove and free */
 		list_del(&l->links);
 		mutex_unlock(&l->owner->pidlist_mutex);
-		kfree(l->list);
+		pidlist_free(l->list);
 		put_pid_ns(l->key.ns);
 		up_write(&l->mutex);
 		kfree(l);

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH 3/6] Quick vmalloc vs kmalloc fix to the case where array size is too large
@ 2009-07-31  1:51     ` Ben Blum
  0 siblings, 0 replies; 100+ messages in thread
From: Ben Blum @ 2009-07-31  1:51 UTC (permalink / raw)
  To: linux-kernel, containers, akpm, serue, lizf, menage, bblum

Quick vmalloc vs kmalloc fix to the case where array size is too large

Separates all pidlist allocation requests to a separate function that judges
based on the requested size whether or not the array needs to be vmalloced or
can be gotten via kmalloc, and similar for kfree/vfree. Should be replaced
entirely with a kernel-wide solution to this general problem.

Depends on cgroup-pidlist-namespace.patch, cgroup-procs.patch

Signed-off-by: Ben Blum <bblum@google.com>

---

 kernel/cgroup.c |   47 ++++++++++++++++++++++++++++++++++++++++++-----
 1 files changed, 42 insertions(+), 5 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index cdd0112..f6d5969 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -48,6 +48,7 @@
 #include <linux/namei.h>
 #include <linux/smp_lock.h>
 #include <linux/pid_namespace.h>
+#include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
 
 #include <asm/atomic.h>
 
@@ -2123,6 +2124,42 @@ int cgroup_scan_tasks(struct cgroup_scanner *scan)
  */
 
 /*
+ * The following two functions "fix" the issue where there are more pids
+ * than kmalloc will give memory for; in such cases, we use vmalloc/vfree.
+ * TODO: replace with a kernel-wide solution to this problem
+ */
+#define PIDLIST_TOO_LARGE(c) ((c) * sizeof(pid_t) > (PAGE_SIZE * 2))
+static void *pidlist_allocate(int count)
+{
+	if (PIDLIST_TOO_LARGE(count))
+		return vmalloc(count * sizeof(pid_t));
+	else
+		return kmalloc(count * sizeof(pid_t), GFP_KERNEL);
+}
+static void pidlist_free(void *p)
+{
+	if (is_vmalloc_addr(p))
+		vfree(p);
+	else
+		kfree(p);
+}
+static void *pidlist_resize(void *p, int newcount)
+{
+	void *newlist;
+	/* note: if new alloc fails, old p will still be valid either way */
+	if (is_vmalloc_addr(p)) {
+		newlist = vmalloc(newcount * sizeof(pid_t));
+		if (!newlist)
+			return NULL;
+		memcpy(newlist, p, newcount * sizeof(pid_t));
+		vfree(p);
+	} else {
+		newlist = krealloc(p, newcount * sizeof(pid_t), GFP_KERNEL);
+	}
+	return newlist;
+}
+
+/*
  * pidlist_uniq - given a kmalloc()ed list, strip out all duplicate entries
  * If the new stripped list is sufficiently smaller and there's enough memory
  * to allocate a new buffer, will let go of the unneeded memory. Returns the
@@ -2161,7 +2198,7 @@ after:
 	 * we'll just stay with what we've got.
 	 */
 	if (PIDLIST_REALLOC_DIFFERENCE(length, dest)) {
-		newlist = krealloc(list, dest * sizeof(pid_t), GFP_KERNEL);
+		newlist = pidlist_resize(list, dest);
 		if (newlist)
 			*p = newlist;
 	}
@@ -2242,7 +2279,7 @@ static int pidlist_array_load(struct cgroup *cgrp, enum cgroup_filetype type,
 	 * show up until sometime later on.
 	 */
 	length = cgroup_task_count(cgrp);
-	array = kmalloc(length * sizeof(pid_t), GFP_KERNEL);
+	array = pidlist_allocate(length);
 	if (!array)
 		return -ENOMEM;
 	/* now, populate the array */
@@ -2266,11 +2303,11 @@ static int pidlist_array_load(struct cgroup *cgrp, enum cgroup_filetype type,
 		length = pidlist_uniq(&array, length);
 	l = cgroup_pidlist_find(cgrp, type);
 	if (!l) {
-		kfree(array);
+		pidlist_free(array);
 		return -ENOMEM;
 	}
 	/* store array, freeing old if necessary - lock already held */
-	kfree(l->list);
+	pidlist_free(l->list);
 	l->list = array;
 	l->length = length;
 	l->use_count++;
@@ -2431,7 +2468,7 @@ static void cgroup_release_pid_array(struct cgroup_pidlist *l)
 		/* we're the last user if refcount is 0; remove and free */
 		list_del(&l->links);
 		mutex_unlock(&l->owner->pidlist_mutex);
-		kfree(l->list);
+		pidlist_free(l->list);
 		put_pid_ns(l->key.ns);
 		up_write(&l->mutex);
 		kfree(l);


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH 4/6] Changes css_set freeing mechanism to be under RCU
       [not found] ` <20090731012908.27908.62208.stgit-/yCBOHwbXCxd3OlUiQof+WCaruZE5nAUZeezCHUQhQ4@public.gmane.org>
                     ` (2 preceding siblings ...)
  2009-07-31  1:51     ` Ben Blum
@ 2009-07-31  1:51   ` Ben Blum
  2009-07-31  1:51   ` [PATCH 5/6] Lets ss->can_attach and ss->attach do whole threadgroups at a time Ben Blum
  2009-07-31  1:51   ` [PATCH 6/6] Makes procs file writable to move all threads by tgid at once Ben Blum
  5 siblings, 0 replies; 100+ messages in thread
From: Ben Blum @ 2009-07-31  1:51 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	serue-r/Jw6+rmf7HQT0dZR+AlfA, lizf-BthXqXjhjHXQFUHtdCDX3A,
	menage-hpIqsD4AKlfQT0dZR+AlfA, bblum-hpIqsD4AKlfQT0dZR+AlfA

Changes css_set freeing mechanism to be under RCU

This is a prepatch for making the procs file writable. In order to free the
old css_sets for each task to be moved as they're being moved, the freeing
mechanism must be RCU-protected, or else we would have to have a call to
synchronize_rcu() for each task before freeing its old css_set.

Signed-off-by: Ben Blum <bblum-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

---

 include/linux/cgroup.h |    3 +++
 kernel/cgroup.c        |    8 +++++++-
 2 files changed, 10 insertions(+), 1 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index b934b72..24e3f1a 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -260,6 +260,9 @@ struct css_set {
 	 * during subsystem registration (at boot time).
 	 */
 	struct cgroup_subsys_state *subsys[CGROUP_SUBSYS_COUNT];
+
+	/* For RCU-protected deletion */
+	struct rcu_head rcu_head;
 };
 
 /*
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index f6d5969..a12bc8e 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -288,6 +288,12 @@ static void unlink_css_set(struct css_set *cg)
 	}
 }
 
+static void free_css_set_rcu(struct rcu_head *obj)
+{
+	struct css_set *cg = container_of(obj, struct css_set, rcu_head);
+	kfree(cg);
+}
+
 static void __put_css_set(struct css_set *cg, int taskexit)
 {
 	int i;
@@ -317,7 +323,7 @@ static void __put_css_set(struct css_set *cg, int taskexit)
 		}
 	}
 	rcu_read_unlock();
-	kfree(cg);
+	call_rcu(&cg->rcu_head, free_css_set_rcu);
 }
 
 /*

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH 4/6] Changes css_set freeing mechanism to be under RCU
  2009-07-31  1:51 [PATCH v2 0/6] CGroups: cgroup memberlist enhancement+fix Ben Blum
  2009-07-31  1:51 ` [PATCH 1/6] Adds a read-only "procs" file similar to "tasks" that shows only unique tgids Ben Blum
  2009-07-31  1:51 ` [PATCH 2/6] Ensures correct concurrent opening/reading of pidlists across pid namespaces Ben Blum
@ 2009-07-31  1:51 ` Ben Blum
  2009-07-31  1:51 ` [PATCH 5/6] Lets ss->can_attach and ss->attach do whole threadgroups at a time Ben Blum
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 100+ messages in thread
From: Ben Blum @ 2009-07-31  1:51 UTC (permalink / raw)
  To: linux-kernel, containers, akpm, serue, lizf, menage, bblum

Changes css_set freeing mechanism to be under RCU

This is a prepatch for making the procs file writable. In order to free the
old css_sets for each task to be moved as they're being moved, the freeing
mechanism must be RCU-protected, or else we would have to have a call to
synchronize_rcu() for each task before freeing its old css_set.

Signed-off-by: Ben Blum <bblum@google.com>

---

 include/linux/cgroup.h |    3 +++
 kernel/cgroup.c        |    8 +++++++-
 2 files changed, 10 insertions(+), 1 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index b934b72..24e3f1a 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -260,6 +260,9 @@ struct css_set {
 	 * during subsystem registration (at boot time).
 	 */
 	struct cgroup_subsys_state *subsys[CGROUP_SUBSYS_COUNT];
+
+	/* For RCU-protected deletion */
+	struct rcu_head rcu_head;
 };
 
 /*
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index f6d5969..a12bc8e 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -288,6 +288,12 @@ static void unlink_css_set(struct css_set *cg)
 	}
 }
 
+static void free_css_set_rcu(struct rcu_head *obj)
+{
+	struct css_set *cg = container_of(obj, struct css_set, rcu_head);
+	kfree(cg);
+}
+
 static void __put_css_set(struct css_set *cg, int taskexit)
 {
 	int i;
@@ -317,7 +323,7 @@ static void __put_css_set(struct css_set *cg, int taskexit)
 		}
 	}
 	rcu_read_unlock();
-	kfree(cg);
+	call_rcu(&cg->rcu_head, free_css_set_rcu);
 }
 
 /*


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH 5/6] Lets ss->can_attach and ss->attach do whole threadgroups at a time
       [not found] ` <20090731012908.27908.62208.stgit-/yCBOHwbXCxd3OlUiQof+WCaruZE5nAUZeezCHUQhQ4@public.gmane.org>
                     ` (3 preceding siblings ...)
  2009-07-31  1:51   ` [PATCH 4/6] Changes css_set freeing mechanism to be under RCU Ben Blum
@ 2009-07-31  1:51   ` Ben Blum
  2009-07-31  1:51   ` [PATCH 6/6] Makes procs file writable to move all threads by tgid at once Ben Blum
  5 siblings, 0 replies; 100+ messages in thread
From: Ben Blum @ 2009-07-31  1:51 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	serue-r/Jw6+rmf7HQT0dZR+AlfA, lizf-BthXqXjhjHXQFUHtdCDX3A,
	menage-hpIqsD4AKlfQT0dZR+AlfA, bblum-hpIqsD4AKlfQT0dZR+AlfA

Lets ss->can_attach and ss->attach do whole threadgroups at a time

This patch alters the ss->can_attach and ss->attach functions to be able to
deal with a whole threadgroup at a time, for use in cgroup_attach_proc. (This
is a pre-patch to cgroup-procs-writable.patch.)

Currently, new mode of the attach function can only tell the subsystem about
the old cgroup of the threadgroup leader. No subsystem currently needs that
information for each thread that's being moved, but if one were to be added
(for example, one that counts tasks within a group) this bit would need to be
reworked a bit to tell the subsystem the right information.

Signed-off-by: Ben Blum <bblum-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

---

 include/linux/cgroup.h   |    7 +++--
 kernel/cgroup.c          |    4 +--
 kernel/cgroup_freezer.c  |   15 +++++++++--
 kernel/cpuset.c          |   65 ++++++++++++++++++++++++++++++++++++----------
 kernel/ns_cgroup.c       |   16 ++++++++++-
 kernel/sched.c           |   37 ++++++++++++++++++++++++--
 mm/memcontrol.c          |    3 +-
 security/device_cgroup.c |    3 +-
 8 files changed, 122 insertions(+), 28 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 24e3f1a..8286758 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -408,10 +408,11 @@ struct cgroup_subsys {
 						  struct cgroup *cgrp);
 	int (*pre_destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
 	void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
-	int (*can_attach)(struct cgroup_subsys *ss,
-			  struct cgroup *cgrp, struct task_struct *tsk);
+	int (*can_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
+			  struct task_struct *tsk, bool threadgroup);
 	void (*attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
-			struct cgroup *old_cgrp, struct task_struct *tsk);
+			struct cgroup *old_cgrp, struct task_struct *tsk,
+			bool threadgroup);
 	void (*fork)(struct cgroup_subsys *ss, struct task_struct *task);
 	void (*exit)(struct cgroup_subsys *ss, struct task_struct *task);
 	int (*populate)(struct cgroup_subsys *ss,
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index a12bc8e..ea05d6b 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1324,7 +1324,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 
 	for_each_subsys(root, ss) {
 		if (ss->can_attach) {
-			retval = ss->can_attach(ss, cgrp, tsk);
+			retval = ss->can_attach(ss, cgrp, tsk, false);
 			if (retval)
 				return retval;
 		}
@@ -1362,7 +1362,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 
 	for_each_subsys(root, ss) {
 		if (ss->attach)
-			ss->attach(ss, cgrp, oldcgrp, tsk);
+			ss->attach(ss, cgrp, oldcgrp, tsk, false);
 	}
 	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
 	synchronize_rcu();
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index fb249e2..4e352ab 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -159,10 +159,9 @@ static bool is_task_frozen_enough(struct task_struct *task)
  */
 static int freezer_can_attach(struct cgroup_subsys *ss,
 			      struct cgroup *new_cgroup,
-			      struct task_struct *task)
+			      struct task_struct *task, bool threadgroup)
 {
 	struct freezer *freezer;
-
 	/*
 	 * Anything frozen can't move or be moved to/from.
 	 *
@@ -177,6 +176,18 @@ static int freezer_can_attach(struct cgroup_subsys *ss,
 	if (freezer->state == CGROUP_FROZEN)
 		return -EBUSY;
 
+	if (threadgroup) {
+		struct task_struct *c;
+		rcu_read_lock();
+		list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
+			if (is_task_frozen_enough(c)) {
+				rcu_read_unlock();
+				return -EBUSY;
+			}
+		}
+		rcu_read_unlock();
+	}
+
 	return 0;
 }
 
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 7e75a41..86397f4 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1324,9 +1324,10 @@ static int fmeter_getrate(struct fmeter *fmp)
 static cpumask_var_t cpus_attach;
 
 /* Called by cgroups to determine if a cpuset is usable; cgroup_mutex held */
-static int cpuset_can_attach(struct cgroup_subsys *ss,
-			     struct cgroup *cont, struct task_struct *tsk)
+static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
+			     struct task_struct *tsk, bool threadgroup)
 {
+	int ret;
 	struct cpuset *cs = cgroup_cs(cont);
 
 	if (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))
@@ -1343,18 +1344,50 @@ static int cpuset_can_attach(struct cgroup_subsys *ss,
 	if (tsk->flags & PF_THREAD_BOUND)
 		return -EINVAL;
 
-	return security_task_setscheduler(tsk, 0, NULL);
+	ret = security_task_setscheduler(tsk, 0, NULL);
+	if (ret)
+		return ret;
+	if (threadgroup) {
+		struct task_struct *c;
+		rcu_read_lock();
+		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
+			ret = security_task_setscheduler(c, 0, NULL);
+			if (ret) {
+				rcu_read_unlock();
+				return ret;
+			}
+		}
+		rcu_read_unlock();
+	}
+	return 0;
+}
+
+static void cpuset_attach_task(struct task_struct *tsk, nodemask_t *to,
+			       struct cpuset *cs)
+{
+	int err;
+	/*
+	 * can_attach beforehand should guarantee that this doesn't fail.
+	 * TODO: have a better way to handle failure here
+	 */
+	err = set_cpus_allowed_ptr(tsk, cpus_attach);
+	WARN_ON_ONCE(err);
+
+	task_lock(tsk);
+	cpuset_change_task_nodemask(tsk, to);
+	task_unlock(tsk);
+	cpuset_update_task_spread_flag(cs, tsk);
+
 }
 
-static void cpuset_attach(struct cgroup_subsys *ss,
-			  struct cgroup *cont, struct cgroup *oldcont,
-			  struct task_struct *tsk)
+static void cpuset_attach(struct cgroup_subsys *ss, struct cgroup *cont,
+			  struct cgroup *oldcont, struct task_struct *tsk,
+			  bool threadgroup)
 {
 	nodemask_t from, to;
 	struct mm_struct *mm;
 	struct cpuset *cs = cgroup_cs(cont);
 	struct cpuset *oldcs = cgroup_cs(oldcont);
-	int err;
 
 	if (cs == &top_cpuset) {
 		cpumask_copy(cpus_attach, cpu_possible_mask);
@@ -1363,15 +1396,19 @@ static void cpuset_attach(struct cgroup_subsys *ss,
 		guarantee_online_cpus(cs, cpus_attach);
 		guarantee_online_mems(cs, &to);
 	}
-	err = set_cpus_allowed_ptr(tsk, cpus_attach);
-	if (err)
-		return;
 
-	task_lock(tsk);
-	cpuset_change_task_nodemask(tsk, &to);
-	task_unlock(tsk);
-	cpuset_update_task_spread_flag(cs, tsk);
+	/* do per-task migration stuff possibly for each in the threadgroup */
+	cpuset_attach_task(tsk, &to, cs);
+	if (threadgroup) {
+		struct task_struct *c;
+		rcu_read_lock();
+		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
+			cpuset_attach_task(c, &to, cs);
+		}
+		rcu_read_unlock();
+	}
 
+	/* change mm; only needs to be done once even if threadgroup */
 	from = oldcs->mems_allowed;
 	to = cs->mems_allowed;
 	mm = get_task_mm(tsk);
diff --git a/kernel/ns_cgroup.c b/kernel/ns_cgroup.c
index 5aa854f..2a5dfec 100644
--- a/kernel/ns_cgroup.c
+++ b/kernel/ns_cgroup.c
@@ -42,8 +42,8 @@ int ns_cgroup_clone(struct task_struct *task, struct pid *pid)
  *       (hence either you are in the same cgroup as task, or in an
  *        ancestor cgroup thereof)
  */
-static int ns_can_attach(struct cgroup_subsys *ss,
-		struct cgroup *new_cgroup, struct task_struct *task)
+static int ns_can_attach(struct cgroup_subsys *ss, struct cgroup *new_cgroup,
+			 struct task_struct *task, bool threadgroup)
 {
 	if (current != task) {
 		if (!capable(CAP_SYS_ADMIN))
@@ -56,6 +56,18 @@ static int ns_can_attach(struct cgroup_subsys *ss,
 	if (!cgroup_is_descendant(new_cgroup, task))
 		return -EPERM;
 
+	if (threadgroup) {
+		struct task_struct *c;
+		rcu_read_lock();
+		list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
+			if (!cgroup_is_descendant(new_cgroup, c)) {
+				rcu_read_unlock();
+				return -EPERM;
+			}
+		}
+		rcu_read_unlock();
+	}
+
 	return 0;
 }
 
diff --git a/kernel/sched.c b/kernel/sched.c
index 3393c18..b5e371b 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -10194,8 +10194,7 @@ cpu_cgroup_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
 }
 
 static int
-cpu_cgroup_can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
-		      struct task_struct *tsk)
+cpu_cgroup_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
 #ifdef CONFIG_RT_GROUP_SCHED
 	if (!sched_rt_can_attach(cgroup_tg(cgrp), tsk))
@@ -10209,11 +10208,43 @@ cpu_cgroup_can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 	return 0;
 }
 
+static int
+cpu_cgroup_can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
+		      struct task_struct *tsk, bool threadgroup)
+{
+	int retval = cpu_cgroup_can_attach_task(cgrp, tsk);
+	if (retval)
+		return retval;
+	if (threadgroup) {
+		struct task_struct *c;
+		rcu_read_lock();
+		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
+			retval = cpu_cgroup_can_attach_task(cgrp, c);
+			if (retval) {
+				rcu_read_unlock();
+				return retval;
+			}
+		}
+		rcu_read_unlock();
+	}
+	return 0;
+
+}
+
 static void
 cpu_cgroup_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
-			struct cgroup *old_cont, struct task_struct *tsk)
+		  struct cgroup *old_cont, struct task_struct *tsk
+		  bool threadgroup)
 {
 	sched_move_task(tsk);
+	if (threadgroup) {
+		struct task_struct *c;
+		rcu_read_lock();
+		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
+			sched_move_task(c);
+		}
+		rcu_read_unlock();
+	}
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6ceb6f2..d9e9cf4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2584,7 +2584,8 @@ static int mem_cgroup_populate(struct cgroup_subsys *ss,
 static void mem_cgroup_move_task(struct cgroup_subsys *ss,
 				struct cgroup *cont,
 				struct cgroup *old_cont,
-				struct task_struct *p)
+				struct task_struct *p,
+				bool threadgroup)
 {
 	mutex_lock(&memcg_tasklist);
 	/*
diff --git a/security/device_cgroup.c b/security/device_cgroup.c
index b8186ba..6cf8fd2 100644
--- a/security/device_cgroup.c
+++ b/security/device_cgroup.c
@@ -61,7 +61,8 @@ static inline struct dev_cgroup *task_devcgroup(struct task_struct *task)
 struct cgroup_subsys devices_subsys;
 
 static int devcgroup_can_attach(struct cgroup_subsys *ss,
-		struct cgroup *new_cgroup, struct task_struct *task)
+		struct cgroup *new_cgroup, struct task_struct *task,
+		bool threadgroup)
 {
 	if (current != task && !capable(CAP_SYS_ADMIN))
 			return -EPERM;

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH 5/6] Lets ss->can_attach and ss->attach do whole threadgroups at a time
  2009-07-31  1:51 [PATCH v2 0/6] CGroups: cgroup memberlist enhancement+fix Ben Blum
                   ` (2 preceding siblings ...)
  2009-07-31  1:51 ` [PATCH 4/6] Changes css_set freeing mechanism to be under RCU Ben Blum
@ 2009-07-31  1:51 ` Ben Blum
       [not found]   ` <20090731015149.27908.25403.stgit-/yCBOHwbXCxd3OlUiQof+WCaruZE5nAUZeezCHUQhQ4@public.gmane.org>
  2009-08-03  2:22   ` Li Zefan
       [not found] ` <20090731012908.27908.62208.stgit-/yCBOHwbXCxd3OlUiQof+WCaruZE5nAUZeezCHUQhQ4@public.gmane.org>
  2009-07-31  1:51 ` Ben Blum
  5 siblings, 2 replies; 100+ messages in thread
From: Ben Blum @ 2009-07-31  1:51 UTC (permalink / raw)
  To: linux-kernel, containers, akpm, serue, lizf, menage, bblum

Lets ss->can_attach and ss->attach do whole threadgroups at a time

This patch alters the ss->can_attach and ss->attach functions to be able to
deal with a whole threadgroup at a time, for use in cgroup_attach_proc. (This
is a pre-patch to cgroup-procs-writable.patch.)

Currently, new mode of the attach function can only tell the subsystem about
the old cgroup of the threadgroup leader. No subsystem currently needs that
information for each thread that's being moved, but if one were to be added
(for example, one that counts tasks within a group) this bit would need to be
reworked a bit to tell the subsystem the right information.

Signed-off-by: Ben Blum <bblum@google.com>

---

 include/linux/cgroup.h   |    7 +++--
 kernel/cgroup.c          |    4 +--
 kernel/cgroup_freezer.c  |   15 +++++++++--
 kernel/cpuset.c          |   65 ++++++++++++++++++++++++++++++++++++----------
 kernel/ns_cgroup.c       |   16 ++++++++++-
 kernel/sched.c           |   37 ++++++++++++++++++++++++--
 mm/memcontrol.c          |    3 +-
 security/device_cgroup.c |    3 +-
 8 files changed, 122 insertions(+), 28 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 24e3f1a..8286758 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -408,10 +408,11 @@ struct cgroup_subsys {
 						  struct cgroup *cgrp);
 	int (*pre_destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
 	void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
-	int (*can_attach)(struct cgroup_subsys *ss,
-			  struct cgroup *cgrp, struct task_struct *tsk);
+	int (*can_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
+			  struct task_struct *tsk, bool threadgroup);
 	void (*attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
-			struct cgroup *old_cgrp, struct task_struct *tsk);
+			struct cgroup *old_cgrp, struct task_struct *tsk,
+			bool threadgroup);
 	void (*fork)(struct cgroup_subsys *ss, struct task_struct *task);
 	void (*exit)(struct cgroup_subsys *ss, struct task_struct *task);
 	int (*populate)(struct cgroup_subsys *ss,
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index a12bc8e..ea05d6b 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1324,7 +1324,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 
 	for_each_subsys(root, ss) {
 		if (ss->can_attach) {
-			retval = ss->can_attach(ss, cgrp, tsk);
+			retval = ss->can_attach(ss, cgrp, tsk, false);
 			if (retval)
 				return retval;
 		}
@@ -1362,7 +1362,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 
 	for_each_subsys(root, ss) {
 		if (ss->attach)
-			ss->attach(ss, cgrp, oldcgrp, tsk);
+			ss->attach(ss, cgrp, oldcgrp, tsk, false);
 	}
 	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
 	synchronize_rcu();
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index fb249e2..4e352ab 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -159,10 +159,9 @@ static bool is_task_frozen_enough(struct task_struct *task)
  */
 static int freezer_can_attach(struct cgroup_subsys *ss,
 			      struct cgroup *new_cgroup,
-			      struct task_struct *task)
+			      struct task_struct *task, bool threadgroup)
 {
 	struct freezer *freezer;
-
 	/*
 	 * Anything frozen can't move or be moved to/from.
 	 *
@@ -177,6 +176,18 @@ static int freezer_can_attach(struct cgroup_subsys *ss,
 	if (freezer->state == CGROUP_FROZEN)
 		return -EBUSY;
 
+	if (threadgroup) {
+		struct task_struct *c;
+		rcu_read_lock();
+		list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
+			if (is_task_frozen_enough(c)) {
+				rcu_read_unlock();
+				return -EBUSY;
+			}
+		}
+		rcu_read_unlock();
+	}
+
 	return 0;
 }
 
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 7e75a41..86397f4 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1324,9 +1324,10 @@ static int fmeter_getrate(struct fmeter *fmp)
 static cpumask_var_t cpus_attach;
 
 /* Called by cgroups to determine if a cpuset is usable; cgroup_mutex held */
-static int cpuset_can_attach(struct cgroup_subsys *ss,
-			     struct cgroup *cont, struct task_struct *tsk)
+static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
+			     struct task_struct *tsk, bool threadgroup)
 {
+	int ret;
 	struct cpuset *cs = cgroup_cs(cont);
 
 	if (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))
@@ -1343,18 +1344,50 @@ static int cpuset_can_attach(struct cgroup_subsys *ss,
 	if (tsk->flags & PF_THREAD_BOUND)
 		return -EINVAL;
 
-	return security_task_setscheduler(tsk, 0, NULL);
+	ret = security_task_setscheduler(tsk, 0, NULL);
+	if (ret)
+		return ret;
+	if (threadgroup) {
+		struct task_struct *c;
+		rcu_read_lock();
+		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
+			ret = security_task_setscheduler(c, 0, NULL);
+			if (ret) {
+				rcu_read_unlock();
+				return ret;
+			}
+		}
+		rcu_read_unlock();
+	}
+	return 0;
+}
+
+static void cpuset_attach_task(struct task_struct *tsk, nodemask_t *to,
+			       struct cpuset *cs)
+{
+	int err;
+	/*
+	 * can_attach beforehand should guarantee that this doesn't fail.
+	 * TODO: have a better way to handle failure here
+	 */
+	err = set_cpus_allowed_ptr(tsk, cpus_attach);
+	WARN_ON_ONCE(err);
+
+	task_lock(tsk);
+	cpuset_change_task_nodemask(tsk, to);
+	task_unlock(tsk);
+	cpuset_update_task_spread_flag(cs, tsk);
+
 }
 
-static void cpuset_attach(struct cgroup_subsys *ss,
-			  struct cgroup *cont, struct cgroup *oldcont,
-			  struct task_struct *tsk)
+static void cpuset_attach(struct cgroup_subsys *ss, struct cgroup *cont,
+			  struct cgroup *oldcont, struct task_struct *tsk,
+			  bool threadgroup)
 {
 	nodemask_t from, to;
 	struct mm_struct *mm;
 	struct cpuset *cs = cgroup_cs(cont);
 	struct cpuset *oldcs = cgroup_cs(oldcont);
-	int err;
 
 	if (cs == &top_cpuset) {
 		cpumask_copy(cpus_attach, cpu_possible_mask);
@@ -1363,15 +1396,19 @@ static void cpuset_attach(struct cgroup_subsys *ss,
 		guarantee_online_cpus(cs, cpus_attach);
 		guarantee_online_mems(cs, &to);
 	}
-	err = set_cpus_allowed_ptr(tsk, cpus_attach);
-	if (err)
-		return;
 
-	task_lock(tsk);
-	cpuset_change_task_nodemask(tsk, &to);
-	task_unlock(tsk);
-	cpuset_update_task_spread_flag(cs, tsk);
+	/* do per-task migration stuff possibly for each in the threadgroup */
+	cpuset_attach_task(tsk, &to, cs);
+	if (threadgroup) {
+		struct task_struct *c;
+		rcu_read_lock();
+		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
+			cpuset_attach_task(c, &to, cs);
+		}
+		rcu_read_unlock();
+	}
 
+	/* change mm; only needs to be done once even if threadgroup */
 	from = oldcs->mems_allowed;
 	to = cs->mems_allowed;
 	mm = get_task_mm(tsk);
diff --git a/kernel/ns_cgroup.c b/kernel/ns_cgroup.c
index 5aa854f..2a5dfec 100644
--- a/kernel/ns_cgroup.c
+++ b/kernel/ns_cgroup.c
@@ -42,8 +42,8 @@ int ns_cgroup_clone(struct task_struct *task, struct pid *pid)
  *       (hence either you are in the same cgroup as task, or in an
  *        ancestor cgroup thereof)
  */
-static int ns_can_attach(struct cgroup_subsys *ss,
-		struct cgroup *new_cgroup, struct task_struct *task)
+static int ns_can_attach(struct cgroup_subsys *ss, struct cgroup *new_cgroup,
+			 struct task_struct *task, bool threadgroup)
 {
 	if (current != task) {
 		if (!capable(CAP_SYS_ADMIN))
@@ -56,6 +56,18 @@ static int ns_can_attach(struct cgroup_subsys *ss,
 	if (!cgroup_is_descendant(new_cgroup, task))
 		return -EPERM;
 
+	if (threadgroup) {
+		struct task_struct *c;
+		rcu_read_lock();
+		list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
+			if (!cgroup_is_descendant(new_cgroup, c)) {
+				rcu_read_unlock();
+				return -EPERM;
+			}
+		}
+		rcu_read_unlock();
+	}
+
 	return 0;
 }
 
diff --git a/kernel/sched.c b/kernel/sched.c
index 3393c18..b5e371b 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -10194,8 +10194,7 @@ cpu_cgroup_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
 }
 
 static int
-cpu_cgroup_can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
-		      struct task_struct *tsk)
+cpu_cgroup_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
 #ifdef CONFIG_RT_GROUP_SCHED
 	if (!sched_rt_can_attach(cgroup_tg(cgrp), tsk))
@@ -10209,11 +10208,43 @@ cpu_cgroup_can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 	return 0;
 }
 
+static int
+cpu_cgroup_can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
+		      struct task_struct *tsk, bool threadgroup)
+{
+	int retval = cpu_cgroup_can_attach_task(cgrp, tsk);
+	if (retval)
+		return retval;
+	if (threadgroup) {
+		struct task_struct *c;
+		rcu_read_lock();
+		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
+			retval = cpu_cgroup_can_attach_task(cgrp, c);
+			if (retval) {
+				rcu_read_unlock();
+				return retval;
+			}
+		}
+		rcu_read_unlock();
+	}
+	return 0;
+
+}
+
 static void
 cpu_cgroup_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
-			struct cgroup *old_cont, struct task_struct *tsk)
+		  struct cgroup *old_cont, struct task_struct *tsk
+		  bool threadgroup)
 {
 	sched_move_task(tsk);
+	if (threadgroup) {
+		struct task_struct *c;
+		rcu_read_lock();
+		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
+			sched_move_task(c);
+		}
+		rcu_read_unlock();
+	}
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6ceb6f2..d9e9cf4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2584,7 +2584,8 @@ static int mem_cgroup_populate(struct cgroup_subsys *ss,
 static void mem_cgroup_move_task(struct cgroup_subsys *ss,
 				struct cgroup *cont,
 				struct cgroup *old_cont,
-				struct task_struct *p)
+				struct task_struct *p,
+				bool threadgroup)
 {
 	mutex_lock(&memcg_tasklist);
 	/*
diff --git a/security/device_cgroup.c b/security/device_cgroup.c
index b8186ba..6cf8fd2 100644
--- a/security/device_cgroup.c
+++ b/security/device_cgroup.c
@@ -61,7 +61,8 @@ static inline struct dev_cgroup *task_devcgroup(struct task_struct *task)
 struct cgroup_subsys devices_subsys;
 
 static int devcgroup_can_attach(struct cgroup_subsys *ss,
-		struct cgroup *new_cgroup, struct task_struct *task)
+		struct cgroup *new_cgroup, struct task_struct *task,
+		bool threadgroup)
 {
 	if (current != task && !capable(CAP_SYS_ADMIN))
 			return -EPERM;


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
       [not found] ` <20090731012908.27908.62208.stgit-/yCBOHwbXCxd3OlUiQof+WCaruZE5nAUZeezCHUQhQ4@public.gmane.org>
                     ` (4 preceding siblings ...)
  2009-07-31  1:51   ` [PATCH 5/6] Lets ss->can_attach and ss->attach do whole threadgroups at a time Ben Blum
@ 2009-07-31  1:51   ` Ben Blum
  5 siblings, 0 replies; 100+ messages in thread
From: Ben Blum @ 2009-07-31  1:51 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	serue-r/Jw6+rmf7HQT0dZR+AlfA, lizf-BthXqXjhjHXQFUHtdCDX3A,
	menage-hpIqsD4AKlfQT0dZR+AlfA, bblum-hpIqsD4AKlfQT0dZR+AlfA

Makes procs file writable to move all threads by tgid at once

This patch adds functionality that enables users to move all threads in a
threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs'
file. This current implementation makes use of a per-threadgroup rwsem that's
taken for reading in the fork() path to prevent newly forking threads within
the threadgroup from "escaping" while the move is in progress.

There is a gap between releasing the fork_mutex and calling each subsystem's
attach function, which could possibly lead to problems if the subsystem relies
on something that could change in the meantime as caused by forking threads.
No particular issue seems apparent, but were some subsystem to have a problem
here, the per-threadgroup fork mutex could be held longer until after the
attach calls are done.

Signed-off-by: Ben Blum <bblum-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

---

 Documentation/cgroups/cgroups.txt |   12 +
 include/linux/cgroup.h            |   12 +
 include/linux/init_task.h         |    9 +
 include/linux/sched.h             |    2 
 kernel/cgroup.c                   |  417 +++++++++++++++++++++++++++++++++----
 kernel/fork.c                     |    6 -
 6 files changed, 406 insertions(+), 52 deletions(-)

diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index 6eb1a97..d579346 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -228,6 +228,7 @@ Each cgroup is represented by a directory in the cgroup file system
 containing the following files describing that cgroup:
 
  - tasks: list of tasks (by pid) attached to that cgroup
+ - cgroup.procs: list of unique tgids in the cgroup
  - notify_on_release flag: run the release agent on exit?
  - release_agent: the path to use for release notifications (this file
    exists in the top cgroup only)
@@ -374,7 +375,7 @@ Now you want to do something with this cgroup.
 
 In this directory you can find several files:
 # ls
-notify_on_release tasks
+cgroup.procs notify_on_release tasks
 (plus whatever files added by the attached subsystems)
 
 Now attach your shell to this cgroup:
@@ -408,6 +409,15 @@ You can attach the current shell task by echoing 0:
 
 # echo 0 > tasks
 
+The cgroup.procs file is useful for managing all tasks in a threadgroup at
+once. It works the same way as the tasks file, but moves all tasks in the
+threadgroup with the specified tgid.
+
+Writing the pid of a task that's not the threadgroup leader (i.e., a pid
+that isn't a tgid) is treated as invalid. Writing a '0' to cgroup.procs will
+attach the writing task and all tasks in its threadgroup, but is invalid if
+the writing task is not the leader of the threadgroup.
+
 3. Kernel API
 =============
 
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 8286758..105d681 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -30,10 +30,12 @@ extern int cgroup_init(void);
 extern void cgroup_lock(void);
 extern bool cgroup_lock_live_group(struct cgroup *cgrp);
 extern void cgroup_unlock(void);
-extern void cgroup_fork(struct task_struct *p);
+extern void cgroup_fork(struct task_struct *p, int clone_flags);
 extern void cgroup_fork_callbacks(struct task_struct *p);
-extern void cgroup_post_fork(struct task_struct *p);
+extern void cgroup_post_fork(struct task_struct *p, int clone_flags);
 extern void cgroup_exit(struct task_struct *p, int run_callbacks);
+extern void cgroup_fork_failed(struct task_struct *p, int run_callbacks,
+			       int clone_flags);
 extern int cgroupstats_build(struct cgroupstats *stats,
 				struct dentry *dentry);
 
@@ -551,10 +553,12 @@ unsigned short css_depth(struct cgroup_subsys_state *css);
 
 static inline int cgroup_init_early(void) { return 0; }
 static inline int cgroup_init(void) { return 0; }
-static inline void cgroup_fork(struct task_struct *p) {}
+static inline void cgroup_fork(struct task_struct *p, int clone_flags) {}
 static inline void cgroup_fork_callbacks(struct task_struct *p) {}
-static inline void cgroup_post_fork(struct task_struct *p) {}
+static inline void cgroup_post_fork(struct task_struct *p, int clone_flags) {}
 static inline void cgroup_exit(struct task_struct *p, int callbacks) {}
+static inline void cgroup_fork_failed(struct task_struct *p, int callbacks,
+				      int clone_flags) {}
 
 static inline void cgroup_lock(void) {}
 static inline void cgroup_unlock(void) {}
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index aecd24e..26d814f 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -105,6 +105,14 @@ extern struct cred init_cred;
 # define INIT_PERF_COUNTERS(tsk)
 #endif
 
+#ifdef CONFIG_CGROUPS
+# define INIT_CGROUP_FORK_MUTEX(tsk)					\
+	.cgroup_fork_mutex =						\
+		__RWSEM_INITIALIZER(tsk.cgroup_fork_mutex),
+#else
+# define INIT_CGROUP_FORK_MUTEX(tsk)
+#endif
+
 /*
  *  INIT_TASK is used to set up the first task table, touch at
  * your own risk!. Base=0, limit=0x1fffff (=2MB)
@@ -174,6 +182,7 @@ extern struct cred init_cred;
 	INIT_LOCKDEP							\
 	INIT_FTRACE_GRAPH						\
 	INIT_TRACE_RECURSION						\
+	INIT_CGROUP_FORK_MUTEX(tsk)					\
 }
 
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 55e3e11..5d38980 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1400,6 +1400,8 @@ struct task_struct {
 	struct css_set *cgroups;
 	/* cg_list protected by css_set_lock and tsk->alloc_lock */
 	struct list_head cg_list;
+	/* guarantees atomic threadgroup movement via the procs file */
+	struct rw_semaphore cgroup_fork_mutex;
 #endif
 #ifdef CONFIG_FUTEX
 	struct robust_list_head __user *robust_list;
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index ea05d6b..3ce7298 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1297,6 +1297,87 @@ static void get_first_subsys(const struct cgroup *cgrp,
 		*subsys_id = test_ss->subsys_id;
 }
 
+/*
+ * cgroup_task_migrate - move a task from one cgroup to another.
+ *
+ * 'guarantee' is set if the caller promises that a new css_set for the task
+ * will already exist. If not set, this function might sleep, and can fail
+ * with -ENOMEM. Otherwise, it can only fail with -ESRCH.
+ */
+static int cgroup_task_migrate(struct cgroup *cgrp, struct cgroup *oldcgrp,
+			       struct task_struct *tsk, int guarantee)
+{
+	struct css_set *oldcg;
+	struct css_set *newcg;
+
+	/*
+	 * get old css_set. we need to take task_lock and refcount it, because
+	 * an exiting task can change its css_set to init_css_set and drop its
+	 * old one without taking cgroup_mutex.
+	 */
+	task_lock(tsk);
+	oldcg = tsk->cgroups;
+	get_css_set(oldcg);
+	task_unlock(tsk);
+	/*
+	 * locate or allocate a new css_set for this task. 'guarantee' tells
+	 * us whether or not we are sure that a new css_set already exists;
+	 * in that case, we are not allowed to fail, as we won't need malloc.
+	 */
+	if (guarantee) {
+		/*
+		 * our caller promises us that the css_set we want already
+		 * exists, so we use find_existing_css_set directly.
+		 */
+		struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
+		read_lock(&css_set_lock);
+		newcg = find_existing_css_set(oldcg, cgrp, template);
+		BUG_ON(!newcg);
+		get_css_set(newcg);
+		read_unlock(&css_set_lock);
+	} else {
+		might_sleep();
+		/* find_css_set will give us newcg already referenced. */
+		newcg = find_css_set(oldcg, cgrp);
+		if (!newcg) {
+			put_css_set(oldcg);
+			return -ENOMEM;
+		}
+	}
+	put_css_set(oldcg);
+
+	/*
+	 * we cannot move a task that's declared itself as exiting, as once
+	 * PF_EXITING is set, the tsk->cgroups pointer is no longer safe.
+	 */
+	task_lock(tsk);
+	if (tsk->flags & PF_EXITING) {
+		task_unlock(tsk);
+		put_css_set(newcg);
+		return -ESRCH;
+	}
+	rcu_assign_pointer(tsk->cgroups, newcg);
+	task_unlock(tsk);
+
+	/* Update the css_set linked lists if we're using them */
+	write_lock(&css_set_lock);
+	if (!list_empty(&tsk->cg_list)) {
+		list_del(&tsk->cg_list);
+		list_add(&tsk->cg_list, &newcg->tasks);
+	}
+	write_unlock(&css_set_lock);
+
+	/*
+	 * We just gained a reference on oldcg by taking it from the task. As
+	 * trading it for newcg is protected by cgroup_mutex, we're safe to
+	 * drop it here; it will be freed under RCU.
+	 */
+	put_css_set(oldcg);
+
+	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
+	return 0;
+}
+
 /**
  * cgroup_attach_task - attach task 'tsk' to cgroup 'cgrp'
  * @cgrp: the cgroup the task is attaching to
@@ -1307,11 +1388,9 @@ static void get_first_subsys(const struct cgroup *cgrp,
  */
 int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
-	int retval = 0;
+	int retval;
 	struct cgroup_subsys *ss;
 	struct cgroup *oldcgrp;
-	struct css_set *cg;
-	struct css_set *newcg;
 	struct cgroupfs_root *root = cgrp->root;
 	int subsys_id;
 
@@ -1330,75 +1409,293 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 		}
 	}
 
-	task_lock(tsk);
-	cg = tsk->cgroups;
-	get_css_set(cg);
-	task_unlock(tsk);
+	retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, 0);
+	if (retval)
+		return retval;
+
+	for_each_subsys(root, ss) {
+		if (ss->attach)
+			ss->attach(ss, cgrp, oldcgrp, tsk, false);
+	}
+
+	synchronize_rcu();
+
 	/*
-	 * Locate or allocate a new css_set for this task,
-	 * based on its final set of cgroups
+	 * wake up rmdir() waiter. the rmdir should fail since the cgroup
+	 * is no longer empty.
 	 */
+	cgroup_wakeup_rmdir_waiters(cgrp);
+	return 0;
+}
+
+/*
+ * cgroup_attach_proc works in two stages, the first of which prefetches all
+ * new css_sets needed (to make sure we have enough memory before committing
+ * to the move) and stores them in a list, of entries of the following type.
+ * TODO: possible optimization: use css_set->rcu_head for chaining instead
+ */
+struct cg_list_entry {
+	struct css_set *cg;
+	struct list_head links;
+};
+
+static int css_set_check_fetched(struct cgroup *cgrp, struct task_struct *tsk,
+				 struct css_set *cg,
+				 struct list_head *newcg_list)
+{
+	struct css_set *newcg;
+	struct cg_list_entry *cg_entry;
+	struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
+	read_lock(&css_set_lock);
+	newcg = find_existing_css_set(cg, cgrp, template);
+	if (newcg)
+		get_css_set(newcg);
+	read_unlock(&css_set_lock);
+	/* doesn't exist at all? */
+	if (!newcg)
+		return 1;
+	/* see if it's already in the list */
+	list_for_each_entry(cg_entry, newcg_list, links) {
+		if (cg_entry->cg == newcg) {
+			put_css_set(newcg);
+			return 0;
+		}
+	}
+	/* not found */
+	put_css_set(newcg);
+	return 1;
+}
+
+/*
+ * Find the new css_set and store it in the list in preparation for moving
+ * the given task to the given cgroup. Returns 0 on success, -ENOMEM if we
+ * run out of memory.
+ */
+static int css_set_prefetch(struct cgroup *cgrp, struct css_set *cg,
+			    struct list_head *newcg_list)
+{
+	struct css_set *newcg;
+	struct cg_list_entry *cg_entry;
+	/* ensure a new css_set will exist for this thread */
 	newcg = find_css_set(cg, cgrp);
-	put_css_set(cg);
 	if (!newcg)
 		return -ENOMEM;
+	/* add new element to list */
+	cg_entry = kmalloc(sizeof(struct cg_list_entry), GFP_KERNEL);
+	if (!cg_entry) {
+		put_css_set(newcg);
+		return -ENOMEM;
+	}
+	cg_entry->cg = newcg;
+	list_add(&cg_entry->links, newcg_list);
+	return 0;
+}
 
-	task_lock(tsk);
-	if (tsk->flags & PF_EXITING) {
+/**
+ * cgroup_attach_proc - attach all threads in a threadgroup to a cgroup
+ * @cgrp: the cgroup to attach to
+ * @leader: the threadgroup leader task_struct of the group to be attached
+ *
+ * Call holding cgroup_mutex. Will take task_lock of each thread in leader's
+ * threadgroup individually in turn.
+ */
+int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
+{
+	int retval;
+	struct cgroup_subsys *ss;
+	struct cgroup *oldcgrp;
+	struct css_set *oldcg;
+	struct cgroupfs_root *root = cgrp->root;
+	int subsys_id;
+	/* threadgroup list cursor */
+	struct task_struct *tsk;
+	/*
+	 * we need to make sure we have css_sets for all the tasks we're
+	 * going to move -before- we actually start moving them, so that in
+	 * case we get an ENOMEM we can bail out before making any changes.
+	 */
+	struct list_head newcg_list;
+	struct cg_list_entry *cg_entry;
+
+	/* first, make sure this came from a valid tgid */
+	if (!thread_group_leader(leader))
+		return -EINVAL;
+	/*
+	 * check that we can legitimately attach to the cgroup.
+	 */
+	for_each_subsys(root, ss) {
+		if (ss->can_attach) {
+			retval = ss->can_attach(ss, cgrp, leader, true);
+			if (retval)
+				return retval;
+		}
+	}
+
+	get_first_subsys(cgrp, NULL, &subsys_id);
+
+	/*
+	 * step 1: make sure css_sets exist for all threads to be migrated.
+	 * we use find_css_set, which allocates a new one if necessary.
+	 */
+	INIT_LIST_HEAD(&newcg_list);
+	oldcgrp = task_cgroup(leader, subsys_id);
+	if (cgrp != oldcgrp) {
+		/* get old css_set */
+		task_lock(leader);
+		if (leader->flags & PF_EXITING) {
+			task_unlock(leader);
+			goto prefetch_loop;
+		}
+		oldcg = leader->cgroups;
+		get_css_set(oldcg);
+		task_unlock(leader);
+		/* acquire new one */
+		retval = css_set_prefetch(cgrp, oldcg, &newcg_list);
+		put_css_set(oldcg);
+		if (retval)
+			goto list_teardown;
+	}
+prefetch_loop:
+	rcu_read_lock();
+	/*
+	 * if we need to fetch a new css_set for this task, we must exit the
+	 * rcu_read section because allocating it can sleep. afterwards, we'll
+	 * need to restart iteration on the threadgroup list - the whole thing
+	 * will be O(nm) in the number of threads and css_sets; as the typical
+	 * case only has one css_set for all of them, usually O(n).
+	 */
+	list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
+		/* nothing to do if this task is already in the cgroup */
+		oldcgrp = task_cgroup(tsk, subsys_id);
+		if (cgrp == oldcgrp)
+			continue;
+		/* get old css_set pointer */
+		task_lock(tsk);
+		if (tsk->flags & PF_EXITING) {
+			/* ignore this task if it's going away */
+			task_unlock(tsk);
+			continue;
+		}
+		oldcg = tsk->cgroups;
+		get_css_set(oldcg);
 		task_unlock(tsk);
-		put_css_set(newcg);
-		return -ESRCH;
+		/* see if the new one for us is already in the list? */
+		retval = css_set_check_fetched(cgrp, tsk, oldcg, &newcg_list);
+		if (retval) {
+			/* we don't already have it. get new one. */
+			rcu_read_unlock();
+			retval = css_set_prefetch(cgrp, oldcg, &newcg_list);
+			put_css_set(oldcg);
+			if (retval)
+				goto list_teardown;
+			/* begin iteration again. */
+			goto prefetch_loop;
+		} else {
+			/* was already there, nothing to do. */
+			put_css_set(oldcg);
+		}
 	}
-	rcu_assign_pointer(tsk->cgroups, newcg);
-	task_unlock(tsk);
+	rcu_read_unlock();
 
-	/* Update the css_set linked lists if we're using them */
-	write_lock(&css_set_lock);
-	if (!list_empty(&tsk->cg_list)) {
-		list_del(&tsk->cg_list);
-		list_add(&tsk->cg_list, &newcg->tasks);
+	/*
+	 * step 2: now that we're guaranteed success wrt the css_sets, proceed
+	 * to move all tasks to the new cgroup. Even if the threadgroup leader
+	 * is PF_EXITING, we still proceed to move all of its sub-threads to
+	 * the new cgroup; if everybody is PF_EXITING, we'll just end up doing
+	 * nothing, which is ok.
+	 */
+	oldcgrp = task_cgroup(leader, subsys_id);
+	/* if leader is already there, skip moving him */
+	if (cgrp != oldcgrp) {
+		retval = cgroup_task_migrate(cgrp, oldcgrp, leader, 1);
+		BUG_ON(retval != 0 && retval != -ESRCH);
 	}
-	write_unlock(&css_set_lock);
+	/*
+	 * now move all the rest of the threads - need to lock against
+	 * possible races with fork().
+	 */
+	down_write(&leader->cgroup_fork_mutex);
+	rcu_read_lock();
+	list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
+		/* leave current thread as it is if it's already there */
+		oldcgrp = task_cgroup(tsk, subsys_id);
+		if (cgrp == oldcgrp)
+			continue;
+		/* we don't care whether these threads are exiting */
+		retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, 1);
+		BUG_ON(retval != 0 && retval != -ESRCH);
+	}
+	rcu_read_unlock();
+	up_write(&leader->cgroup_fork_mutex);
 
+	/*
+	 * step 3: attach whole threadgroup to each subsystem
+	 * TODO: if ever a subsystem needs to know the oldcgrp for each task
+	 * being moved, this call will need to be reworked to communicate that
+	 * information.
+	 */
 	for_each_subsys(root, ss) {
 		if (ss->attach)
-			ss->attach(ss, cgrp, oldcgrp, tsk, false);
+			ss->attach(ss, cgrp, oldcgrp, tsk, true);
 	}
-	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
-	synchronize_rcu();
-	put_css_set(cg);
 
 	/*
-	 * wake up rmdir() waiter. the rmdir should fail since the cgroup
-	 * is no longer empty.
+	 * step 4: success! ...and cleanup
 	 */
+	synchronize_rcu();
 	cgroup_wakeup_rmdir_waiters(cgrp);
-	return 0;
+	retval = 0;
+list_teardown:
+	/* no longer need the list of css_sets, so get rid of it */
+	while (!list_empty(&newcg_list)) {
+		/* pop from the list */
+		cg_entry = list_first_entry(&newcg_list, struct cg_list_entry,
+					    links);
+		list_del(&cg_entry->links);
+		/* drop the refcount */
+		put_css_set(cg_entry->cg);
+		kfree(cg_entry);
+	}
+	/* done! */
+	return retval;
 }
 
 /*
- * Attach task with pid 'pid' to cgroup 'cgrp'. Call with cgroup_mutex
- * held. May take task_lock of task
+ * Find the task_struct of the task to attach by vpid and pass it along to the
+ * function to attach either it or all tasks in its threadgroup. Will take
+ * cgroup_mutex; may take task_lock of task.
  */
-static int attach_task_by_pid(struct cgroup *cgrp, u64 pid)
+static int attach_task_by_pid(struct cgroup *cgrp, u64 pid,
+			      int attach(struct cgroup *,
+					 struct task_struct *))
 {
 	struct task_struct *tsk;
 	const struct cred *cred = current_cred(), *tcred;
 	int ret;
 
+	if (!cgroup_lock_live_group(cgrp))
+		return -ENODEV;
+
 	if (pid) {
 		rcu_read_lock();
 		tsk = find_task_by_vpid(pid);
 		if (!tsk || tsk->flags & PF_EXITING) {
 			rcu_read_unlock();
+			cgroup_unlock();
 			return -ESRCH;
 		}
-
+		/*
+		 * even if we're attaching all tasks in the thread group, we
+		 * only need to check permissions on the group leader, because
+		 * even if another task has different permissions, the group
+		 * leader will have sufficient access to change it.
+		 */
 		tcred = __task_cred(tsk);
 		if (cred->euid &&
 		    cred->euid != tcred->uid &&
 		    cred->euid != tcred->suid) {
 			rcu_read_unlock();
+			cgroup_unlock();
 			return -EACCES;
 		}
 		get_task_struct(tsk);
@@ -1408,19 +1705,25 @@ static int attach_task_by_pid(struct cgroup *cgrp, u64 pid)
 		get_task_struct(tsk);
 	}
 
-	ret = cgroup_attach_task(cgrp, tsk);
+	/*
+	 * Note that the check for whether the task is its threadgroup leader
+	 * is done in cgroup_attach_proc. This means that writing 0 to the
+	 * procs file will only work if the writing task is the leader.
+	 */
+	ret = attach(cgrp, tsk);
 	put_task_struct(tsk);
+	cgroup_unlock();
 	return ret;
 }
 
 static int cgroup_tasks_write(struct cgroup *cgrp, struct cftype *cft, u64 pid)
 {
-	int ret;
-	if (!cgroup_lock_live_group(cgrp))
-		return -ENODEV;
-	ret = attach_task_by_pid(cgrp, pid);
-	cgroup_unlock();
-	return ret;
+	return attach_task_by_pid(cgrp, pid, cgroup_attach_task);
+}
+
+static int cgroup_procs_write(struct cgroup *cgrp, struct cftype *cft, u64 tgid)
+{
+	return attach_task_by_pid(cgrp, tgid, cgroup_attach_proc);
 }
 
 /**
@@ -2579,9 +2882,9 @@ static struct cftype files[] = {
 	{
 		.name = CGROUP_FILE_GENERIC_PREFIX "procs",
 		.open = cgroup_procs_open,
-		/* .write_u64 = cgroup_procs_write, TODO */
+		.write_u64 = cgroup_procs_write,
 		.release = cgroup_pidlist_release,
-		.mode = S_IRUGO,
+		.mode = S_IRUGO | S_IWUSR,
 	},
 	{
 		.name = "notify_on_release",
@@ -3182,8 +3485,12 @@ static struct file_operations proc_cgroupstats_operations = {
  * At the point that cgroup_fork() is called, 'current' is the parent
  * task, and the passed argument 'child' points to the child task.
  */
-void cgroup_fork(struct task_struct *child)
+void cgroup_fork(struct task_struct *child, int clone_flags)
 {
+	if (clone_flags & CLONE_THREAD)
+		down_read(&current->group_leader->cgroup_fork_mutex);
+	else
+		init_rwsem(&child->cgroup_fork_mutex);
 	task_lock(current);
 	child->cgroups = current->cgroups;
 	get_css_set(child->cgroups);
@@ -3220,7 +3527,7 @@ void cgroup_fork_callbacks(struct task_struct *child)
  * with the first call to cgroup_iter_start() - to guarantee that the
  * new task ends up on its list.
  */
-void cgroup_post_fork(struct task_struct *child)
+void cgroup_post_fork(struct task_struct *child, int clone_flags)
 {
 	if (use_task_css_set_links) {
 		write_lock(&css_set_lock);
@@ -3230,6 +3537,8 @@ void cgroup_post_fork(struct task_struct *child)
 		task_unlock(child);
 		write_unlock(&css_set_lock);
 	}
+	if (clone_flags & CLONE_THREAD)
+		up_read(&current->group_leader->cgroup_fork_mutex);
 }
 /**
  * cgroup_exit - detach cgroup from exiting task
@@ -3301,6 +3610,26 @@ void cgroup_exit(struct task_struct *tsk, int run_callbacks)
 }
 
 /**
+ * cgroup_fork_failed - undo operations for fork failure
+ * @tsk: pointer to  task_struct of exiting process
+ * @run_callback: run exit callbacks?
+ *
+ * Description: Undo cgroup operations after cgroup_fork in fork failure.
+ *
+ * We release the read lock that was taken in cgroup_fork(), since it is
+ * supposed to be dropped in cgroup_post_fork in the success case. The other
+ * thing that wants to be done is detaching the failed child task from the
+ * cgroup, so we wrap cgroup_exit.
+ */
+void cgroup_fork_failed(struct task_struct *tsk, int run_callbacks,
+			int clone_flags)
+{
+	if (clone_flags & CLONE_THREAD)
+		up_read(&current->group_leader->cgroup_fork_mutex);
+	cgroup_exit(tsk, run_callbacks);
+}
+
+/**
  * cgroup_clone - clone the cgroup the given subsystem is attached to
  * @tsk: the task to be moved
  * @subsys: the given subsystem
diff --git a/kernel/fork.c b/kernel/fork.c
index 926c117..3f9decf 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1053,7 +1053,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	monotonic_to_bootbased(&p->real_start_time);
 	p->io_context = NULL;
 	p->audit_context = NULL;
-	cgroup_fork(p);
+	cgroup_fork(p, clone_flags);
 #ifdef CONFIG_NUMA
 	p->mempolicy = mpol_dup(p->mempolicy);
  	if (IS_ERR(p->mempolicy)) {
@@ -1269,7 +1269,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	spin_unlock(&current->sighand->siglock);
 	write_unlock_irq(&tasklist_lock);
 	proc_fork_connector(p);
-	cgroup_post_fork(p);
+	cgroup_post_fork(p, clone_flags);
 	return p;
 
 bad_fork_free_pid:
@@ -1300,7 +1300,7 @@ bad_fork_cleanup_policy:
 	mpol_put(p->mempolicy);
 bad_fork_cleanup_cgroup:
 #endif
-	cgroup_exit(p, cgroup_callbacks_done);
+	cgroup_fork_failed(p, cgroup_callbacks_done, clone_flags);
 	delayacct_tsk_free(p);
 	if (p->binfmt)
 		module_put(p->binfmt->module);

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
  2009-07-31  1:51 [PATCH v2 0/6] CGroups: cgroup memberlist enhancement+fix Ben Blum
                   ` (4 preceding siblings ...)
       [not found] ` <20090731012908.27908.62208.stgit-/yCBOHwbXCxd3OlUiQof+WCaruZE5nAUZeezCHUQhQ4@public.gmane.org>
@ 2009-07-31  1:51 ` Ben Blum
  2009-08-03  3:00   ` Li Zefan
                     ` (2 more replies)
  5 siblings, 3 replies; 100+ messages in thread
From: Ben Blum @ 2009-07-31  1:51 UTC (permalink / raw)
  To: linux-kernel, containers, akpm, serue, lizf, menage, bblum

Makes procs file writable to move all threads by tgid at once

This patch adds functionality that enables users to move all threads in a
threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs'
file. This current implementation makes use of a per-threadgroup rwsem that's
taken for reading in the fork() path to prevent newly forking threads within
the threadgroup from "escaping" while the move is in progress.

There is a gap between releasing the fork_mutex and calling each subsystem's
attach function, which could possibly lead to problems if the subsystem relies
on something that could change in the meantime as caused by forking threads.
No particular issue seems apparent, but were some subsystem to have a problem
here, the per-threadgroup fork mutex could be held longer until after the
attach calls are done.

Signed-off-by: Ben Blum <bblum@google.com>

---

 Documentation/cgroups/cgroups.txt |   12 +
 include/linux/cgroup.h            |   12 +
 include/linux/init_task.h         |    9 +
 include/linux/sched.h             |    2 
 kernel/cgroup.c                   |  417 +++++++++++++++++++++++++++++++++----
 kernel/fork.c                     |    6 -
 6 files changed, 406 insertions(+), 52 deletions(-)

diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index 6eb1a97..d579346 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -228,6 +228,7 @@ Each cgroup is represented by a directory in the cgroup file system
 containing the following files describing that cgroup:
 
  - tasks: list of tasks (by pid) attached to that cgroup
+ - cgroup.procs: list of unique tgids in the cgroup
  - notify_on_release flag: run the release agent on exit?
  - release_agent: the path to use for release notifications (this file
    exists in the top cgroup only)
@@ -374,7 +375,7 @@ Now you want to do something with this cgroup.
 
 In this directory you can find several files:
 # ls
-notify_on_release tasks
+cgroup.procs notify_on_release tasks
 (plus whatever files added by the attached subsystems)
 
 Now attach your shell to this cgroup:
@@ -408,6 +409,15 @@ You can attach the current shell task by echoing 0:
 
 # echo 0 > tasks
 
+The cgroup.procs file is useful for managing all tasks in a threadgroup at
+once. It works the same way as the tasks file, but moves all tasks in the
+threadgroup with the specified tgid.
+
+Writing the pid of a task that's not the threadgroup leader (i.e., a pid
+that isn't a tgid) is treated as invalid. Writing a '0' to cgroup.procs will
+attach the writing task and all tasks in its threadgroup, but is invalid if
+the writing task is not the leader of the threadgroup.
+
 3. Kernel API
 =============
 
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 8286758..105d681 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -30,10 +30,12 @@ extern int cgroup_init(void);
 extern void cgroup_lock(void);
 extern bool cgroup_lock_live_group(struct cgroup *cgrp);
 extern void cgroup_unlock(void);
-extern void cgroup_fork(struct task_struct *p);
+extern void cgroup_fork(struct task_struct *p, int clone_flags);
 extern void cgroup_fork_callbacks(struct task_struct *p);
-extern void cgroup_post_fork(struct task_struct *p);
+extern void cgroup_post_fork(struct task_struct *p, int clone_flags);
 extern void cgroup_exit(struct task_struct *p, int run_callbacks);
+extern void cgroup_fork_failed(struct task_struct *p, int run_callbacks,
+			       int clone_flags);
 extern int cgroupstats_build(struct cgroupstats *stats,
 				struct dentry *dentry);
 
@@ -551,10 +553,12 @@ unsigned short css_depth(struct cgroup_subsys_state *css);
 
 static inline int cgroup_init_early(void) { return 0; }
 static inline int cgroup_init(void) { return 0; }
-static inline void cgroup_fork(struct task_struct *p) {}
+static inline void cgroup_fork(struct task_struct *p, int clone_flags) {}
 static inline void cgroup_fork_callbacks(struct task_struct *p) {}
-static inline void cgroup_post_fork(struct task_struct *p) {}
+static inline void cgroup_post_fork(struct task_struct *p, int clone_flags) {}
 static inline void cgroup_exit(struct task_struct *p, int callbacks) {}
+static inline void cgroup_fork_failed(struct task_struct *p, int callbacks,
+				      int clone_flags) {}
 
 static inline void cgroup_lock(void) {}
 static inline void cgroup_unlock(void) {}
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index aecd24e..26d814f 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -105,6 +105,14 @@ extern struct cred init_cred;
 # define INIT_PERF_COUNTERS(tsk)
 #endif
 
+#ifdef CONFIG_CGROUPS
+# define INIT_CGROUP_FORK_MUTEX(tsk)					\
+	.cgroup_fork_mutex =						\
+		__RWSEM_INITIALIZER(tsk.cgroup_fork_mutex),
+#else
+# define INIT_CGROUP_FORK_MUTEX(tsk)
+#endif
+
 /*
  *  INIT_TASK is used to set up the first task table, touch at
  * your own risk!. Base=0, limit=0x1fffff (=2MB)
@@ -174,6 +182,7 @@ extern struct cred init_cred;
 	INIT_LOCKDEP							\
 	INIT_FTRACE_GRAPH						\
 	INIT_TRACE_RECURSION						\
+	INIT_CGROUP_FORK_MUTEX(tsk)					\
 }
 
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 55e3e11..5d38980 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1400,6 +1400,8 @@ struct task_struct {
 	struct css_set *cgroups;
 	/* cg_list protected by css_set_lock and tsk->alloc_lock */
 	struct list_head cg_list;
+	/* guarantees atomic threadgroup movement via the procs file */
+	struct rw_semaphore cgroup_fork_mutex;
 #endif
 #ifdef CONFIG_FUTEX
 	struct robust_list_head __user *robust_list;
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index ea05d6b..3ce7298 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1297,6 +1297,87 @@ static void get_first_subsys(const struct cgroup *cgrp,
 		*subsys_id = test_ss->subsys_id;
 }
 
+/*
+ * cgroup_task_migrate - move a task from one cgroup to another.
+ *
+ * 'guarantee' is set if the caller promises that a new css_set for the task
+ * will already exist. If not set, this function might sleep, and can fail
+ * with -ENOMEM. Otherwise, it can only fail with -ESRCH.
+ */
+static int cgroup_task_migrate(struct cgroup *cgrp, struct cgroup *oldcgrp,
+			       struct task_struct *tsk, int guarantee)
+{
+	struct css_set *oldcg;
+	struct css_set *newcg;
+
+	/*
+	 * get old css_set. we need to take task_lock and refcount it, because
+	 * an exiting task can change its css_set to init_css_set and drop its
+	 * old one without taking cgroup_mutex.
+	 */
+	task_lock(tsk);
+	oldcg = tsk->cgroups;
+	get_css_set(oldcg);
+	task_unlock(tsk);
+	/*
+	 * locate or allocate a new css_set for this task. 'guarantee' tells
+	 * us whether or not we are sure that a new css_set already exists;
+	 * in that case, we are not allowed to fail, as we won't need malloc.
+	 */
+	if (guarantee) {
+		/*
+		 * our caller promises us that the css_set we want already
+		 * exists, so we use find_existing_css_set directly.
+		 */
+		struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
+		read_lock(&css_set_lock);
+		newcg = find_existing_css_set(oldcg, cgrp, template);
+		BUG_ON(!newcg);
+		get_css_set(newcg);
+		read_unlock(&css_set_lock);
+	} else {
+		might_sleep();
+		/* find_css_set will give us newcg already referenced. */
+		newcg = find_css_set(oldcg, cgrp);
+		if (!newcg) {
+			put_css_set(oldcg);
+			return -ENOMEM;
+		}
+	}
+	put_css_set(oldcg);
+
+	/*
+	 * we cannot move a task that's declared itself as exiting, as once
+	 * PF_EXITING is set, the tsk->cgroups pointer is no longer safe.
+	 */
+	task_lock(tsk);
+	if (tsk->flags & PF_EXITING) {
+		task_unlock(tsk);
+		put_css_set(newcg);
+		return -ESRCH;
+	}
+	rcu_assign_pointer(tsk->cgroups, newcg);
+	task_unlock(tsk);
+
+	/* Update the css_set linked lists if we're using them */
+	write_lock(&css_set_lock);
+	if (!list_empty(&tsk->cg_list)) {
+		list_del(&tsk->cg_list);
+		list_add(&tsk->cg_list, &newcg->tasks);
+	}
+	write_unlock(&css_set_lock);
+
+	/*
+	 * We just gained a reference on oldcg by taking it from the task. As
+	 * trading it for newcg is protected by cgroup_mutex, we're safe to
+	 * drop it here; it will be freed under RCU.
+	 */
+	put_css_set(oldcg);
+
+	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
+	return 0;
+}
+
 /**
  * cgroup_attach_task - attach task 'tsk' to cgroup 'cgrp'
  * @cgrp: the cgroup the task is attaching to
@@ -1307,11 +1388,9 @@ static void get_first_subsys(const struct cgroup *cgrp,
  */
 int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
-	int retval = 0;
+	int retval;
 	struct cgroup_subsys *ss;
 	struct cgroup *oldcgrp;
-	struct css_set *cg;
-	struct css_set *newcg;
 	struct cgroupfs_root *root = cgrp->root;
 	int subsys_id;
 
@@ -1330,75 +1409,293 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 		}
 	}
 
-	task_lock(tsk);
-	cg = tsk->cgroups;
-	get_css_set(cg);
-	task_unlock(tsk);
+	retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, 0);
+	if (retval)
+		return retval;
+
+	for_each_subsys(root, ss) {
+		if (ss->attach)
+			ss->attach(ss, cgrp, oldcgrp, tsk, false);
+	}
+
+	synchronize_rcu();
+
 	/*
-	 * Locate or allocate a new css_set for this task,
-	 * based on its final set of cgroups
+	 * wake up rmdir() waiter. the rmdir should fail since the cgroup
+	 * is no longer empty.
 	 */
+	cgroup_wakeup_rmdir_waiters(cgrp);
+	return 0;
+}
+
+/*
+ * cgroup_attach_proc works in two stages, the first of which prefetches all
+ * new css_sets needed (to make sure we have enough memory before committing
+ * to the move) and stores them in a list, of entries of the following type.
+ * TODO: possible optimization: use css_set->rcu_head for chaining instead
+ */
+struct cg_list_entry {
+	struct css_set *cg;
+	struct list_head links;
+};
+
+static int css_set_check_fetched(struct cgroup *cgrp, struct task_struct *tsk,
+				 struct css_set *cg,
+				 struct list_head *newcg_list)
+{
+	struct css_set *newcg;
+	struct cg_list_entry *cg_entry;
+	struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
+	read_lock(&css_set_lock);
+	newcg = find_existing_css_set(cg, cgrp, template);
+	if (newcg)
+		get_css_set(newcg);
+	read_unlock(&css_set_lock);
+	/* doesn't exist at all? */
+	if (!newcg)
+		return 1;
+	/* see if it's already in the list */
+	list_for_each_entry(cg_entry, newcg_list, links) {
+		if (cg_entry->cg == newcg) {
+			put_css_set(newcg);
+			return 0;
+		}
+	}
+	/* not found */
+	put_css_set(newcg);
+	return 1;
+}
+
+/*
+ * Find the new css_set and store it in the list in preparation for moving
+ * the given task to the given cgroup. Returns 0 on success, -ENOMEM if we
+ * run out of memory.
+ */
+static int css_set_prefetch(struct cgroup *cgrp, struct css_set *cg,
+			    struct list_head *newcg_list)
+{
+	struct css_set *newcg;
+	struct cg_list_entry *cg_entry;
+	/* ensure a new css_set will exist for this thread */
 	newcg = find_css_set(cg, cgrp);
-	put_css_set(cg);
 	if (!newcg)
 		return -ENOMEM;
+	/* add new element to list */
+	cg_entry = kmalloc(sizeof(struct cg_list_entry), GFP_KERNEL);
+	if (!cg_entry) {
+		put_css_set(newcg);
+		return -ENOMEM;
+	}
+	cg_entry->cg = newcg;
+	list_add(&cg_entry->links, newcg_list);
+	return 0;
+}
 
-	task_lock(tsk);
-	if (tsk->flags & PF_EXITING) {
+/**
+ * cgroup_attach_proc - attach all threads in a threadgroup to a cgroup
+ * @cgrp: the cgroup to attach to
+ * @leader: the threadgroup leader task_struct of the group to be attached
+ *
+ * Call holding cgroup_mutex. Will take task_lock of each thread in leader's
+ * threadgroup individually in turn.
+ */
+int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
+{
+	int retval;
+	struct cgroup_subsys *ss;
+	struct cgroup *oldcgrp;
+	struct css_set *oldcg;
+	struct cgroupfs_root *root = cgrp->root;
+	int subsys_id;
+	/* threadgroup list cursor */
+	struct task_struct *tsk;
+	/*
+	 * we need to make sure we have css_sets for all the tasks we're
+	 * going to move -before- we actually start moving them, so that in
+	 * case we get an ENOMEM we can bail out before making any changes.
+	 */
+	struct list_head newcg_list;
+	struct cg_list_entry *cg_entry;
+
+	/* first, make sure this came from a valid tgid */
+	if (!thread_group_leader(leader))
+		return -EINVAL;
+	/*
+	 * check that we can legitimately attach to the cgroup.
+	 */
+	for_each_subsys(root, ss) {
+		if (ss->can_attach) {
+			retval = ss->can_attach(ss, cgrp, leader, true);
+			if (retval)
+				return retval;
+		}
+	}
+
+	get_first_subsys(cgrp, NULL, &subsys_id);
+
+	/*
+	 * step 1: make sure css_sets exist for all threads to be migrated.
+	 * we use find_css_set, which allocates a new one if necessary.
+	 */
+	INIT_LIST_HEAD(&newcg_list);
+	oldcgrp = task_cgroup(leader, subsys_id);
+	if (cgrp != oldcgrp) {
+		/* get old css_set */
+		task_lock(leader);
+		if (leader->flags & PF_EXITING) {
+			task_unlock(leader);
+			goto prefetch_loop;
+		}
+		oldcg = leader->cgroups;
+		get_css_set(oldcg);
+		task_unlock(leader);
+		/* acquire new one */
+		retval = css_set_prefetch(cgrp, oldcg, &newcg_list);
+		put_css_set(oldcg);
+		if (retval)
+			goto list_teardown;
+	}
+prefetch_loop:
+	rcu_read_lock();
+	/*
+	 * if we need to fetch a new css_set for this task, we must exit the
+	 * rcu_read section because allocating it can sleep. afterwards, we'll
+	 * need to restart iteration on the threadgroup list - the whole thing
+	 * will be O(nm) in the number of threads and css_sets; as the typical
+	 * case only has one css_set for all of them, usually O(n).
+	 */
+	list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
+		/* nothing to do if this task is already in the cgroup */
+		oldcgrp = task_cgroup(tsk, subsys_id);
+		if (cgrp == oldcgrp)
+			continue;
+		/* get old css_set pointer */
+		task_lock(tsk);
+		if (tsk->flags & PF_EXITING) {
+			/* ignore this task if it's going away */
+			task_unlock(tsk);
+			continue;
+		}
+		oldcg = tsk->cgroups;
+		get_css_set(oldcg);
 		task_unlock(tsk);
-		put_css_set(newcg);
-		return -ESRCH;
+		/* see if the new one for us is already in the list? */
+		retval = css_set_check_fetched(cgrp, tsk, oldcg, &newcg_list);
+		if (retval) {
+			/* we don't already have it. get new one. */
+			rcu_read_unlock();
+			retval = css_set_prefetch(cgrp, oldcg, &newcg_list);
+			put_css_set(oldcg);
+			if (retval)
+				goto list_teardown;
+			/* begin iteration again. */
+			goto prefetch_loop;
+		} else {
+			/* was already there, nothing to do. */
+			put_css_set(oldcg);
+		}
 	}
-	rcu_assign_pointer(tsk->cgroups, newcg);
-	task_unlock(tsk);
+	rcu_read_unlock();
 
-	/* Update the css_set linked lists if we're using them */
-	write_lock(&css_set_lock);
-	if (!list_empty(&tsk->cg_list)) {
-		list_del(&tsk->cg_list);
-		list_add(&tsk->cg_list, &newcg->tasks);
+	/*
+	 * step 2: now that we're guaranteed success wrt the css_sets, proceed
+	 * to move all tasks to the new cgroup. Even if the threadgroup leader
+	 * is PF_EXITING, we still proceed to move all of its sub-threads to
+	 * the new cgroup; if everybody is PF_EXITING, we'll just end up doing
+	 * nothing, which is ok.
+	 */
+	oldcgrp = task_cgroup(leader, subsys_id);
+	/* if leader is already there, skip moving him */
+	if (cgrp != oldcgrp) {
+		retval = cgroup_task_migrate(cgrp, oldcgrp, leader, 1);
+		BUG_ON(retval != 0 && retval != -ESRCH);
 	}
-	write_unlock(&css_set_lock);
+	/*
+	 * now move all the rest of the threads - need to lock against
+	 * possible races with fork().
+	 */
+	down_write(&leader->cgroup_fork_mutex);
+	rcu_read_lock();
+	list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
+		/* leave current thread as it is if it's already there */
+		oldcgrp = task_cgroup(tsk, subsys_id);
+		if (cgrp == oldcgrp)
+			continue;
+		/* we don't care whether these threads are exiting */
+		retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, 1);
+		BUG_ON(retval != 0 && retval != -ESRCH);
+	}
+	rcu_read_unlock();
+	up_write(&leader->cgroup_fork_mutex);
 
+	/*
+	 * step 3: attach whole threadgroup to each subsystem
+	 * TODO: if ever a subsystem needs to know the oldcgrp for each task
+	 * being moved, this call will need to be reworked to communicate that
+	 * information.
+	 */
 	for_each_subsys(root, ss) {
 		if (ss->attach)
-			ss->attach(ss, cgrp, oldcgrp, tsk, false);
+			ss->attach(ss, cgrp, oldcgrp, tsk, true);
 	}
-	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
-	synchronize_rcu();
-	put_css_set(cg);
 
 	/*
-	 * wake up rmdir() waiter. the rmdir should fail since the cgroup
-	 * is no longer empty.
+	 * step 4: success! ...and cleanup
 	 */
+	synchronize_rcu();
 	cgroup_wakeup_rmdir_waiters(cgrp);
-	return 0;
+	retval = 0;
+list_teardown:
+	/* no longer need the list of css_sets, so get rid of it */
+	while (!list_empty(&newcg_list)) {
+		/* pop from the list */
+		cg_entry = list_first_entry(&newcg_list, struct cg_list_entry,
+					    links);
+		list_del(&cg_entry->links);
+		/* drop the refcount */
+		put_css_set(cg_entry->cg);
+		kfree(cg_entry);
+	}
+	/* done! */
+	return retval;
 }
 
 /*
- * Attach task with pid 'pid' to cgroup 'cgrp'. Call with cgroup_mutex
- * held. May take task_lock of task
+ * Find the task_struct of the task to attach by vpid and pass it along to the
+ * function to attach either it or all tasks in its threadgroup. Will take
+ * cgroup_mutex; may take task_lock of task.
  */
-static int attach_task_by_pid(struct cgroup *cgrp, u64 pid)
+static int attach_task_by_pid(struct cgroup *cgrp, u64 pid,
+			      int attach(struct cgroup *,
+					 struct task_struct *))
 {
 	struct task_struct *tsk;
 	const struct cred *cred = current_cred(), *tcred;
 	int ret;
 
+	if (!cgroup_lock_live_group(cgrp))
+		return -ENODEV;
+
 	if (pid) {
 		rcu_read_lock();
 		tsk = find_task_by_vpid(pid);
 		if (!tsk || tsk->flags & PF_EXITING) {
 			rcu_read_unlock();
+			cgroup_unlock();
 			return -ESRCH;
 		}
-
+		/*
+		 * even if we're attaching all tasks in the thread group, we
+		 * only need to check permissions on the group leader, because
+		 * even if another task has different permissions, the group
+		 * leader will have sufficient access to change it.
+		 */
 		tcred = __task_cred(tsk);
 		if (cred->euid &&
 		    cred->euid != tcred->uid &&
 		    cred->euid != tcred->suid) {
 			rcu_read_unlock();
+			cgroup_unlock();
 			return -EACCES;
 		}
 		get_task_struct(tsk);
@@ -1408,19 +1705,25 @@ static int attach_task_by_pid(struct cgroup *cgrp, u64 pid)
 		get_task_struct(tsk);
 	}
 
-	ret = cgroup_attach_task(cgrp, tsk);
+	/*
+	 * Note that the check for whether the task is its threadgroup leader
+	 * is done in cgroup_attach_proc. This means that writing 0 to the
+	 * procs file will only work if the writing task is the leader.
+	 */
+	ret = attach(cgrp, tsk);
 	put_task_struct(tsk);
+	cgroup_unlock();
 	return ret;
 }
 
 static int cgroup_tasks_write(struct cgroup *cgrp, struct cftype *cft, u64 pid)
 {
-	int ret;
-	if (!cgroup_lock_live_group(cgrp))
-		return -ENODEV;
-	ret = attach_task_by_pid(cgrp, pid);
-	cgroup_unlock();
-	return ret;
+	return attach_task_by_pid(cgrp, pid, cgroup_attach_task);
+}
+
+static int cgroup_procs_write(struct cgroup *cgrp, struct cftype *cft, u64 tgid)
+{
+	return attach_task_by_pid(cgrp, tgid, cgroup_attach_proc);
 }
 
 /**
@@ -2579,9 +2882,9 @@ static struct cftype files[] = {
 	{
 		.name = CGROUP_FILE_GENERIC_PREFIX "procs",
 		.open = cgroup_procs_open,
-		/* .write_u64 = cgroup_procs_write, TODO */
+		.write_u64 = cgroup_procs_write,
 		.release = cgroup_pidlist_release,
-		.mode = S_IRUGO,
+		.mode = S_IRUGO | S_IWUSR,
 	},
 	{
 		.name = "notify_on_release",
@@ -3182,8 +3485,12 @@ static struct file_operations proc_cgroupstats_operations = {
  * At the point that cgroup_fork() is called, 'current' is the parent
  * task, and the passed argument 'child' points to the child task.
  */
-void cgroup_fork(struct task_struct *child)
+void cgroup_fork(struct task_struct *child, int clone_flags)
 {
+	if (clone_flags & CLONE_THREAD)
+		down_read(&current->group_leader->cgroup_fork_mutex);
+	else
+		init_rwsem(&child->cgroup_fork_mutex);
 	task_lock(current);
 	child->cgroups = current->cgroups;
 	get_css_set(child->cgroups);
@@ -3220,7 +3527,7 @@ void cgroup_fork_callbacks(struct task_struct *child)
  * with the first call to cgroup_iter_start() - to guarantee that the
  * new task ends up on its list.
  */
-void cgroup_post_fork(struct task_struct *child)
+void cgroup_post_fork(struct task_struct *child, int clone_flags)
 {
 	if (use_task_css_set_links) {
 		write_lock(&css_set_lock);
@@ -3230,6 +3537,8 @@ void cgroup_post_fork(struct task_struct *child)
 		task_unlock(child);
 		write_unlock(&css_set_lock);
 	}
+	if (clone_flags & CLONE_THREAD)
+		up_read(&current->group_leader->cgroup_fork_mutex);
 }
 /**
  * cgroup_exit - detach cgroup from exiting task
@@ -3301,6 +3610,26 @@ void cgroup_exit(struct task_struct *tsk, int run_callbacks)
 }
 
 /**
+ * cgroup_fork_failed - undo operations for fork failure
+ * @tsk: pointer to  task_struct of exiting process
+ * @run_callback: run exit callbacks?
+ *
+ * Description: Undo cgroup operations after cgroup_fork in fork failure.
+ *
+ * We release the read lock that was taken in cgroup_fork(), since it is
+ * supposed to be dropped in cgroup_post_fork in the success case. The other
+ * thing that wants to be done is detaching the failed child task from the
+ * cgroup, so we wrap cgroup_exit.
+ */
+void cgroup_fork_failed(struct task_struct *tsk, int run_callbacks,
+			int clone_flags)
+{
+	if (clone_flags & CLONE_THREAD)
+		up_read(&current->group_leader->cgroup_fork_mutex);
+	cgroup_exit(tsk, run_callbacks);
+}
+
+/**
  * cgroup_clone - clone the cgroup the given subsystem is attached to
  * @tsk: the task to be moved
  * @subsys: the given subsystem
diff --git a/kernel/fork.c b/kernel/fork.c
index 926c117..3f9decf 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1053,7 +1053,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	monotonic_to_bootbased(&p->real_start_time);
 	p->io_context = NULL;
 	p->audit_context = NULL;
-	cgroup_fork(p);
+	cgroup_fork(p, clone_flags);
 #ifdef CONFIG_NUMA
 	p->mempolicy = mpol_dup(p->mempolicy);
  	if (IS_ERR(p->mempolicy)) {
@@ -1269,7 +1269,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	spin_unlock(&current->sighand->siglock);
 	write_unlock_irq(&tasklist_lock);
 	proc_fork_connector(p);
-	cgroup_post_fork(p);
+	cgroup_post_fork(p, clone_flags);
 	return p;
 
 bad_fork_free_pid:
@@ -1300,7 +1300,7 @@ bad_fork_cleanup_policy:
 	mpol_put(p->mempolicy);
 bad_fork_cleanup_cgroup:
 #endif
-	cgroup_exit(p, cgroup_callbacks_done);
+	cgroup_fork_failed(p, cgroup_callbacks_done, clone_flags);
 	delayacct_tsk_free(p);
 	if (p->binfmt)
 		module_put(p->binfmt->module);


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* Re: [PATCH 5/6] Lets ss->can_attach and ss->attach do whole threadgroups at a time
       [not found]   ` <20090731015149.27908.25403.stgit-/yCBOHwbXCxd3OlUiQof+WCaruZE5nAUZeezCHUQhQ4@public.gmane.org>
@ 2009-08-03  2:22     ` Li Zefan
  0 siblings, 0 replies; 100+ messages in thread
From: Li Zefan @ 2009-08-03  2:22 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	menage-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

09:51, Ben Blum wrote:
> Lets ss->can_attach and ss->attach do whole threadgroups at a time
> 
> This patch alters the ss->can_attach and ss->attach functions to be able to
> deal with a whole threadgroup at a time, for use in cgroup_attach_proc. (This
> is a pre-patch to cgroup-procs-writable.patch.)
> 
> Currently, new mode of the attach function can only tell the subsystem about
> the old cgroup of the threadgroup leader. No subsystem currently needs that
> information for each thread that's being moved, but if one were to be added
> (for example, one that counts tasks within a group) this bit would need to be
> reworked a bit to tell the subsystem the right information.
> 
> Signed-off-by: Ben Blum <bblum-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> 
> ---
> 
>  include/linux/cgroup.h   |    7 +++--
>  kernel/cgroup.c          |    4 +--
>  kernel/cgroup_freezer.c  |   15 +++++++++--
>  kernel/cpuset.c          |   65 ++++++++++++++++++++++++++++++++++++----------
>  kernel/ns_cgroup.c       |   16 ++++++++++-
>  kernel/sched.c           |   37 ++++++++++++++++++++++++--
>  mm/memcontrol.c          |    3 +-
>  security/device_cgroup.c |    3 +-
>  8 files changed, 122 insertions(+), 28 deletions(-)
> 
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 24e3f1a..8286758 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -408,10 +408,11 @@ struct cgroup_subsys {
>  						  struct cgroup *cgrp);
>  	int (*pre_destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
>  	void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
> -	int (*can_attach)(struct cgroup_subsys *ss,
> -			  struct cgroup *cgrp, struct task_struct *tsk);
> +	int (*can_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
> +			  struct task_struct *tsk, bool threadgroup);
>  	void (*attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
> -			struct cgroup *old_cgrp, struct task_struct *tsk);
> +			struct cgroup *old_cgrp, struct task_struct *tsk,
> +			bool threadgroup);

Could you also update Documentation/cgroups/cgroups.txt?

>  	void (*fork)(struct cgroup_subsys *ss, struct task_struct *task);
>  	void (*exit)(struct cgroup_subsys *ss, struct task_struct *task);
>  	int (*populate)(struct cgroup_subsys *ss,
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index a12bc8e..ea05d6b 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1324,7 +1324,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>  
>  	for_each_subsys(root, ss) {
>  		if (ss->can_attach) {
> -			retval = ss->can_attach(ss, cgrp, tsk);
> +			retval = ss->can_attach(ss, cgrp, tsk, false);
>  			if (retval)
>  				return retval;
>  		}
> @@ -1362,7 +1362,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>  
>  	for_each_subsys(root, ss) {
>  		if (ss->attach)
> -			ss->attach(ss, cgrp, oldcgrp, tsk);
> +			ss->attach(ss, cgrp, oldcgrp, tsk, false);
>  	}
>  	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
>  	synchronize_rcu();
> diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
> index fb249e2..4e352ab 100644
> --- a/kernel/cgroup_freezer.c
> +++ b/kernel/cgroup_freezer.c
> @@ -159,10 +159,9 @@ static bool is_task_frozen_enough(struct task_struct *task)
>   */
>  static int freezer_can_attach(struct cgroup_subsys *ss,
>  			      struct cgroup *new_cgroup,
> -			      struct task_struct *task)
> +			      struct task_struct *task, bool threadgroup)
>  {
>  	struct freezer *freezer;
> -

Please reserve this blank line.

>  	/*
>  	 * Anything frozen can't move or be moved to/from.
>  	 *
> @@ -177,6 +176,18 @@ static int freezer_can_attach(struct cgroup_subsys *ss,
>  	if (freezer->state == CGROUP_FROZEN)
>  		return -EBUSY;
>  
> +	if (threadgroup) {
> +		struct task_struct *c;

And better have a blank line here. ;)

> +		rcu_read_lock();
> +		list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
> +			if (is_task_frozen_enough(c)) {
> +				rcu_read_unlock();
> +				return -EBUSY;
> +			}
> +		}
> +		rcu_read_unlock();
> +	}
> +
>  	return 0;
>  }
>  
> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> index 7e75a41..86397f4 100644
> --- a/kernel/cpuset.c
> +++ b/kernel/cpuset.c
> @@ -1324,9 +1324,10 @@ static int fmeter_getrate(struct fmeter *fmp)
>  static cpumask_var_t cpus_attach;
>  
>  /* Called by cgroups to determine if a cpuset is usable; cgroup_mutex held */
> -static int cpuset_can_attach(struct cgroup_subsys *ss,
> -			     struct cgroup *cont, struct task_struct *tsk)
> +static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
> +			     struct task_struct *tsk, bool threadgroup)
>  {
> +	int ret;
>  	struct cpuset *cs = cgroup_cs(cont);
>  
>  	if (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))
> @@ -1343,18 +1344,50 @@ static int cpuset_can_attach(struct cgroup_subsys *ss,
>  	if (tsk->flags & PF_THREAD_BOUND)
>  		return -EINVAL;
>  
> -	return security_task_setscheduler(tsk, 0, NULL);
> +	ret = security_task_setscheduler(tsk, 0, NULL);
> +	if (ret)
> +		return ret;
> +	if (threadgroup) {
> +		struct task_struct *c;

ditto

> +		rcu_read_lock();
> +		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
> +			ret = security_task_setscheduler(c, 0, NULL);
> +			if (ret) {
> +				rcu_read_unlock();
> +				return ret;
> +			}
> +		}
> +		rcu_read_unlock();
> +	}
> +	return 0;
> +}

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 5/6] Lets ss->can_attach and ss->attach do whole threadgroups at a time
  2009-07-31  1:51 ` [PATCH 5/6] Lets ss->can_attach and ss->attach do whole threadgroups at a time Ben Blum
       [not found]   ` <20090731015149.27908.25403.stgit-/yCBOHwbXCxd3OlUiQof+WCaruZE5nAUZeezCHUQhQ4@public.gmane.org>
@ 2009-08-03  2:22   ` Li Zefan
  2009-08-04  0:35     ` Benjamin Blum
       [not found]     ` <4A7649E1.4000200-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  1 sibling, 2 replies; 100+ messages in thread
From: Li Zefan @ 2009-08-03  2:22 UTC (permalink / raw)
  To: Ben Blum; +Cc: linux-kernel, containers, akpm, serue, menage

09:51, Ben Blum wrote:
> Lets ss->can_attach and ss->attach do whole threadgroups at a time
> 
> This patch alters the ss->can_attach and ss->attach functions to be able to
> deal with a whole threadgroup at a time, for use in cgroup_attach_proc. (This
> is a pre-patch to cgroup-procs-writable.patch.)
> 
> Currently, new mode of the attach function can only tell the subsystem about
> the old cgroup of the threadgroup leader. No subsystem currently needs that
> information for each thread that's being moved, but if one were to be added
> (for example, one that counts tasks within a group) this bit would need to be
> reworked a bit to tell the subsystem the right information.
> 
> Signed-off-by: Ben Blum <bblum@google.com>
> 
> ---
> 
>  include/linux/cgroup.h   |    7 +++--
>  kernel/cgroup.c          |    4 +--
>  kernel/cgroup_freezer.c  |   15 +++++++++--
>  kernel/cpuset.c          |   65 ++++++++++++++++++++++++++++++++++++----------
>  kernel/ns_cgroup.c       |   16 ++++++++++-
>  kernel/sched.c           |   37 ++++++++++++++++++++++++--
>  mm/memcontrol.c          |    3 +-
>  security/device_cgroup.c |    3 +-
>  8 files changed, 122 insertions(+), 28 deletions(-)
> 
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 24e3f1a..8286758 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -408,10 +408,11 @@ struct cgroup_subsys {
>  						  struct cgroup *cgrp);
>  	int (*pre_destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
>  	void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
> -	int (*can_attach)(struct cgroup_subsys *ss,
> -			  struct cgroup *cgrp, struct task_struct *tsk);
> +	int (*can_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
> +			  struct task_struct *tsk, bool threadgroup);
>  	void (*attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
> -			struct cgroup *old_cgrp, struct task_struct *tsk);
> +			struct cgroup *old_cgrp, struct task_struct *tsk,
> +			bool threadgroup);

Could you also update Documentation/cgroups/cgroups.txt?

>  	void (*fork)(struct cgroup_subsys *ss, struct task_struct *task);
>  	void (*exit)(struct cgroup_subsys *ss, struct task_struct *task);
>  	int (*populate)(struct cgroup_subsys *ss,
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index a12bc8e..ea05d6b 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1324,7 +1324,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>  
>  	for_each_subsys(root, ss) {
>  		if (ss->can_attach) {
> -			retval = ss->can_attach(ss, cgrp, tsk);
> +			retval = ss->can_attach(ss, cgrp, tsk, false);
>  			if (retval)
>  				return retval;
>  		}
> @@ -1362,7 +1362,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>  
>  	for_each_subsys(root, ss) {
>  		if (ss->attach)
> -			ss->attach(ss, cgrp, oldcgrp, tsk);
> +			ss->attach(ss, cgrp, oldcgrp, tsk, false);
>  	}
>  	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
>  	synchronize_rcu();
> diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
> index fb249e2..4e352ab 100644
> --- a/kernel/cgroup_freezer.c
> +++ b/kernel/cgroup_freezer.c
> @@ -159,10 +159,9 @@ static bool is_task_frozen_enough(struct task_struct *task)
>   */
>  static int freezer_can_attach(struct cgroup_subsys *ss,
>  			      struct cgroup *new_cgroup,
> -			      struct task_struct *task)
> +			      struct task_struct *task, bool threadgroup)
>  {
>  	struct freezer *freezer;
> -

Please reserve this blank line.

>  	/*
>  	 * Anything frozen can't move or be moved to/from.
>  	 *
> @@ -177,6 +176,18 @@ static int freezer_can_attach(struct cgroup_subsys *ss,
>  	if (freezer->state == CGROUP_FROZEN)
>  		return -EBUSY;
>  
> +	if (threadgroup) {
> +		struct task_struct *c;

And better have a blank line here. ;)

> +		rcu_read_lock();
> +		list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
> +			if (is_task_frozen_enough(c)) {
> +				rcu_read_unlock();
> +				return -EBUSY;
> +			}
> +		}
> +		rcu_read_unlock();
> +	}
> +
>  	return 0;
>  }
>  
> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> index 7e75a41..86397f4 100644
> --- a/kernel/cpuset.c
> +++ b/kernel/cpuset.c
> @@ -1324,9 +1324,10 @@ static int fmeter_getrate(struct fmeter *fmp)
>  static cpumask_var_t cpus_attach;
>  
>  /* Called by cgroups to determine if a cpuset is usable; cgroup_mutex held */
> -static int cpuset_can_attach(struct cgroup_subsys *ss,
> -			     struct cgroup *cont, struct task_struct *tsk)
> +static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
> +			     struct task_struct *tsk, bool threadgroup)
>  {
> +	int ret;
>  	struct cpuset *cs = cgroup_cs(cont);
>  
>  	if (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))
> @@ -1343,18 +1344,50 @@ static int cpuset_can_attach(struct cgroup_subsys *ss,
>  	if (tsk->flags & PF_THREAD_BOUND)
>  		return -EINVAL;
>  
> -	return security_task_setscheduler(tsk, 0, NULL);
> +	ret = security_task_setscheduler(tsk, 0, NULL);
> +	if (ret)
> +		return ret;
> +	if (threadgroup) {
> +		struct task_struct *c;

ditto

> +		rcu_read_lock();
> +		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
> +			ret = security_task_setscheduler(c, 0, NULL);
> +			if (ret) {
> +				rcu_read_unlock();
> +				return ret;
> +			}
> +		}
> +		rcu_read_unlock();
> +	}
> +	return 0;
> +}


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
       [not found]   ` <20090731015154.27908.9639.stgit-/yCBOHwbXCxd3OlUiQof+WCaruZE5nAUZeezCHUQhQ4@public.gmane.org>
@ 2009-08-03  3:00     ` Li Zefan
  2009-08-03 17:54     ` Serge E. Hallyn
  1 sibling, 0 replies; 100+ messages in thread
From: Li Zefan @ 2009-08-03  3:00 UTC (permalink / raw)
  To: Ben Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	menage-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Ben Blum wrote:
> Makes procs file writable to move all threads by tgid at once
> 
> This patch adds functionality that enables users to move all threads in a
> threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs'
> file. This current implementation makes use of a per-threadgroup rwsem that's
> taken for reading in the fork() path to prevent newly forking threads within
> the threadgroup from "escaping" while the move is in progress.
> 
> There is a gap between releasing the fork_mutex and calling each subsystem's
> attach function, which could possibly lead to problems if the subsystem relies
> on something that could change in the meantime as caused by forking threads.
> No particular issue seems apparent, but were some subsystem to have a problem
> here, the per-threadgroup fork mutex could be held longer until after the
> attach calls are done.
> 

This seems to work.

A few comments below..

> Signed-off-by: Ben Blum <bblum-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> 
> ---
> 
>  Documentation/cgroups/cgroups.txt |   12 +
>  include/linux/cgroup.h            |   12 +
>  include/linux/init_task.h         |    9 +
>  include/linux/sched.h             |    2 
>  kernel/cgroup.c                   |  417 +++++++++++++++++++++++++++++++++----
>  kernel/fork.c                     |    6 -
>  6 files changed, 406 insertions(+), 52 deletions(-)
> 
> diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
> index 6eb1a97..d579346 100644
> --- a/Documentation/cgroups/cgroups.txt
> +++ b/Documentation/cgroups/cgroups.txt
> @@ -228,6 +228,7 @@ Each cgroup is represented by a directory in the cgroup file system
>  containing the following files describing that cgroup:
>  
>   - tasks: list of tasks (by pid) attached to that cgroup
> + - cgroup.procs: list of unique tgids in the cgroup
>   - notify_on_release flag: run the release agent on exit?
>   - release_agent: the path to use for release notifications (this file
>     exists in the top cgroup only)
> @@ -374,7 +375,7 @@ Now you want to do something with this cgroup.
>  
>  In this directory you can find several files:
>  # ls
> -notify_on_release tasks
> +cgroup.procs notify_on_release tasks
>  (plus whatever files added by the attached subsystems)
>  
>  Now attach your shell to this cgroup:
> @@ -408,6 +409,15 @@ You can attach the current shell task by echoing 0:
>  
>  # echo 0 > tasks
>  
> +The cgroup.procs file is useful for managing all tasks in a threadgroup at
> +once. It works the same way as the tasks file, but moves all tasks in the
> +threadgroup with the specified tgid.
> +
> +Writing the pid of a task that's not the threadgroup leader (i.e., a pid
> +that isn't a tgid) is treated as invalid. Writing a '0' to cgroup.procs will
> +attach the writing task and all tasks in its threadgroup, but is invalid if
> +the writing task is not the leader of the threadgroup.
> +
>  3. Kernel API
>  =============
>  
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 8286758..105d681 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -30,10 +30,12 @@ extern int cgroup_init(void);
>  extern void cgroup_lock(void);
>  extern bool cgroup_lock_live_group(struct cgroup *cgrp);
>  extern void cgroup_unlock(void);
> -extern void cgroup_fork(struct task_struct *p);
> +extern void cgroup_fork(struct task_struct *p, int clone_flags);
>  extern void cgroup_fork_callbacks(struct task_struct *p);
> -extern void cgroup_post_fork(struct task_struct *p);
> +extern void cgroup_post_fork(struct task_struct *p, int clone_flags);
>  extern void cgroup_exit(struct task_struct *p, int run_callbacks);
> +extern void cgroup_fork_failed(struct task_struct *p, int run_callbacks,
> +			       int clone_flags);
>  extern int cgroupstats_build(struct cgroupstats *stats,
>  				struct dentry *dentry);
>  
> @@ -551,10 +553,12 @@ unsigned short css_depth(struct cgroup_subsys_state *css);
>  
>  static inline int cgroup_init_early(void) { return 0; }
>  static inline int cgroup_init(void) { return 0; }
> -static inline void cgroup_fork(struct task_struct *p) {}
> +static inline void cgroup_fork(struct task_struct *p, int clone_flags) {}
>  static inline void cgroup_fork_callbacks(struct task_struct *p) {}
> -static inline void cgroup_post_fork(struct task_struct *p) {}
> +static inline void cgroup_post_fork(struct task_struct *p, int clone_flags) {}
>  static inline void cgroup_exit(struct task_struct *p, int callbacks) {}
> +static inline void cgroup_fork_failed(struct task_struct *p, int callbacks,
> +				      int clone_flags) {}
>  
>  static inline void cgroup_lock(void) {}
>  static inline void cgroup_unlock(void) {}
> diff --git a/include/linux/init_task.h b/include/linux/init_task.h
> index aecd24e..26d814f 100644
> --- a/include/linux/init_task.h
> +++ b/include/linux/init_task.h
> @@ -105,6 +105,14 @@ extern struct cred init_cred;
>  # define INIT_PERF_COUNTERS(tsk)
>  #endif
>  
> +#ifdef CONFIG_CGROUPS
> +# define INIT_CGROUP_FORK_MUTEX(tsk)					\
> +	.cgroup_fork_mutex =						\
> +		__RWSEM_INITIALIZER(tsk.cgroup_fork_mutex),
> +#else
> +# define INIT_CGROUP_FORK_MUTEX(tsk)
> +#endif
> +
>  /*
>   *  INIT_TASK is used to set up the first task table, touch at
>   * your own risk!. Base=0, limit=0x1fffff (=2MB)
> @@ -174,6 +182,7 @@ extern struct cred init_cred;
>  	INIT_LOCKDEP							\
>  	INIT_FTRACE_GRAPH						\
>  	INIT_TRACE_RECURSION						\
> +	INIT_CGROUP_FORK_MUTEX(tsk)					\
>  }
>  
>  
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 55e3e11..5d38980 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1400,6 +1400,8 @@ struct task_struct {
>  	struct css_set *cgroups;
>  	/* cg_list protected by css_set_lock and tsk->alloc_lock */
>  	struct list_head cg_list;
> +	/* guarantees atomic threadgroup movement via the procs file */
> +	struct rw_semaphore cgroup_fork_mutex;
>  #endif
>  #ifdef CONFIG_FUTEX
>  	struct robust_list_head __user *robust_list;
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index ea05d6b..3ce7298 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1297,6 +1297,87 @@ static void get_first_subsys(const struct cgroup *cgrp,
>  		*subsys_id = test_ss->subsys_id;
>  }
>  
> +/*
> + * cgroup_task_migrate - move a task from one cgroup to another.
> + *
> + * 'guarantee' is set if the caller promises that a new css_set for the task
> + * will already exist. If not set, this function might sleep, and can fail
> + * with -ENOMEM. Otherwise, it can only fail with -ESRCH.
> + */
> +static int cgroup_task_migrate(struct cgroup *cgrp, struct cgroup *oldcgrp,
> +			       struct task_struct *tsk, int guarantee)
> +{
> +	struct css_set *oldcg;
> +	struct css_set *newcg;
> +
> +	/*
> +	 * get old css_set. we need to take task_lock and refcount it, because
> +	 * an exiting task can change its css_set to init_css_set and drop its
> +	 * old one without taking cgroup_mutex.
> +	 */
> +	task_lock(tsk);
> +	oldcg = tsk->cgroups;
> +	get_css_set(oldcg);
> +	task_unlock(tsk);

Better use blank lines more to improve code readability.

> +	/*
> +	 * locate or allocate a new css_set for this task. 'guarantee' tells
> +	 * us whether or not we are sure that a new css_set already exists;
> +	 * in that case, we are not allowed to fail, as we won't need malloc.
> +	 */
> +	if (guarantee) {
> +		/*
> +		 * our caller promises us that the css_set we want already
> +		 * exists, so we use find_existing_css_set directly.
> +		 */
> +		struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
> +		read_lock(&css_set_lock);
> +		newcg = find_existing_css_set(oldcg, cgrp, template);
> +		BUG_ON(!newcg);
> +		get_css_set(newcg);
> +		read_unlock(&css_set_lock);
> +	} else {
> +		might_sleep();
> +		/* find_css_set will give us newcg already referenced. */
> +		newcg = find_css_set(oldcg, cgrp);
> +		if (!newcg) {
> +			put_css_set(oldcg);
> +			return -ENOMEM;
> +		}
> +	}
> +	put_css_set(oldcg);
> +
> +	/*
> +	 * we cannot move a task that's declared itself as exiting, as once
> +	 * PF_EXITING is set, the tsk->cgroups pointer is no longer safe.
> +	 */
> +	task_lock(tsk);
> +	if (tsk->flags & PF_EXITING) {
> +		task_unlock(tsk);
> +		put_css_set(newcg);
> +		return -ESRCH;
> +	}
> +	rcu_assign_pointer(tsk->cgroups, newcg);
> +	task_unlock(tsk);
> +
> +	/* Update the css_set linked lists if we're using them */
> +	write_lock(&css_set_lock);
> +	if (!list_empty(&tsk->cg_list)) {
> +		list_del(&tsk->cg_list);
> +		list_add(&tsk->cg_list, &newcg->tasks);

list_move()

> +	}
> +	write_unlock(&css_set_lock);
> +
> +	/*
> +	 * We just gained a reference on oldcg by taking it from the task. As

This comment is incorrect, the ref we just got has been dropped by
the above put_css_set(oldcg).

> +	 * trading it for newcg is protected by cgroup_mutex, we're safe to
> +	 * drop it here; it will be freed under RCU.
> +	 */
> +	put_css_set(oldcg);
> +
> +	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
> +	return 0;
> +}
> +
>  /**
>   * cgroup_attach_task - attach task 'tsk' to cgroup 'cgrp'
>   * @cgrp: the cgroup the task is attaching to
> @@ -1307,11 +1388,9 @@ static void get_first_subsys(const struct cgroup *cgrp,
>   */
>  int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>  {
> -	int retval = 0;
> +	int retval;
>  	struct cgroup_subsys *ss;
>  	struct cgroup *oldcgrp;
> -	struct css_set *cg;
> -	struct css_set *newcg;
>  	struct cgroupfs_root *root = cgrp->root;
>  	int subsys_id;
>  
> @@ -1330,75 +1409,293 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>  		}
>  	}
>  
> -	task_lock(tsk);
> -	cg = tsk->cgroups;
> -	get_css_set(cg);
> -	task_unlock(tsk);
> +	retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, 0);
> +	if (retval)
> +		return retval;
> +
> +	for_each_subsys(root, ss) {
> +		if (ss->attach)
> +			ss->attach(ss, cgrp, oldcgrp, tsk, false);
> +	}
> +
> +	synchronize_rcu();
> +
>  	/*
> -	 * Locate or allocate a new css_set for this task,
> -	 * based on its final set of cgroups
> +	 * wake up rmdir() waiter. the rmdir should fail since the cgroup
> +	 * is no longer empty.
>  	 */
> +	cgroup_wakeup_rmdir_waiters(cgrp);
> +	return 0;
> +}
> +
> +/*
> + * cgroup_attach_proc works in two stages, the first of which prefetches all
> + * new css_sets needed (to make sure we have enough memory before committing
> + * to the move) and stores them in a list, of entries of the following type.
> + * TODO: possible optimization: use css_set->rcu_head for chaining instead
> + */
> +struct cg_list_entry {
> +	struct css_set *cg;
> +	struct list_head links;
> +};
> +
> +static int css_set_check_fetched(struct cgroup *cgrp, struct task_struct *tsk,
> +				 struct css_set *cg,
> +				 struct list_head *newcg_list)
> +{
> +	struct css_set *newcg;
> +	struct cg_list_entry *cg_entry;
> +	struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
> +	read_lock(&css_set_lock);
> +	newcg = find_existing_css_set(cg, cgrp, template);
> +	if (newcg)
> +		get_css_set(newcg);
> +	read_unlock(&css_set_lock);
> +	/* doesn't exist at all? */
> +	if (!newcg)
> +		return 1;

I think it's more intuitive to return 1 if found and 0 if not found.

> +	/* see if it's already in the list */
> +	list_for_each_entry(cg_entry, newcg_list, links) {
> +		if (cg_entry->cg == newcg) {
> +			put_css_set(newcg);
> +			return 0;
> +		}
> +	}
> +	/* not found */
> +	put_css_set(newcg);
> +	return 1;

Those lines are squeezed too tight. ;)

> +}
> +

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
  2009-07-31  1:51 ` Ben Blum
@ 2009-08-03  3:00   ` Li Zefan
  2009-08-04  0:56     ` Benjamin Blum
       [not found]     ` <4A7652E7.4020206-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  2009-08-03 17:54   ` Serge E. Hallyn
       [not found]   ` <20090731015154.27908.9639.stgit-/yCBOHwbXCxd3OlUiQof+WCaruZE5nAUZeezCHUQhQ4@public.gmane.org>
  2 siblings, 2 replies; 100+ messages in thread
From: Li Zefan @ 2009-08-03  3:00 UTC (permalink / raw)
  To: Ben Blum; +Cc: linux-kernel, containers, akpm, serue, menage

Ben Blum wrote:
> Makes procs file writable to move all threads by tgid at once
> 
> This patch adds functionality that enables users to move all threads in a
> threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs'
> file. This current implementation makes use of a per-threadgroup rwsem that's
> taken for reading in the fork() path to prevent newly forking threads within
> the threadgroup from "escaping" while the move is in progress.
> 
> There is a gap between releasing the fork_mutex and calling each subsystem's
> attach function, which could possibly lead to problems if the subsystem relies
> on something that could change in the meantime as caused by forking threads.
> No particular issue seems apparent, but were some subsystem to have a problem
> here, the per-threadgroup fork mutex could be held longer until after the
> attach calls are done.
> 

This seems to work.

A few comments below..

> Signed-off-by: Ben Blum <bblum@google.com>
> 
> ---
> 
>  Documentation/cgroups/cgroups.txt |   12 +
>  include/linux/cgroup.h            |   12 +
>  include/linux/init_task.h         |    9 +
>  include/linux/sched.h             |    2 
>  kernel/cgroup.c                   |  417 +++++++++++++++++++++++++++++++++----
>  kernel/fork.c                     |    6 -
>  6 files changed, 406 insertions(+), 52 deletions(-)
> 
> diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
> index 6eb1a97..d579346 100644
> --- a/Documentation/cgroups/cgroups.txt
> +++ b/Documentation/cgroups/cgroups.txt
> @@ -228,6 +228,7 @@ Each cgroup is represented by a directory in the cgroup file system
>  containing the following files describing that cgroup:
>  
>   - tasks: list of tasks (by pid) attached to that cgroup
> + - cgroup.procs: list of unique tgids in the cgroup
>   - notify_on_release flag: run the release agent on exit?
>   - release_agent: the path to use for release notifications (this file
>     exists in the top cgroup only)
> @@ -374,7 +375,7 @@ Now you want to do something with this cgroup.
>  
>  In this directory you can find several files:
>  # ls
> -notify_on_release tasks
> +cgroup.procs notify_on_release tasks
>  (plus whatever files added by the attached subsystems)
>  
>  Now attach your shell to this cgroup:
> @@ -408,6 +409,15 @@ You can attach the current shell task by echoing 0:
>  
>  # echo 0 > tasks
>  
> +The cgroup.procs file is useful for managing all tasks in a threadgroup at
> +once. It works the same way as the tasks file, but moves all tasks in the
> +threadgroup with the specified tgid.
> +
> +Writing the pid of a task that's not the threadgroup leader (i.e., a pid
> +that isn't a tgid) is treated as invalid. Writing a '0' to cgroup.procs will
> +attach the writing task and all tasks in its threadgroup, but is invalid if
> +the writing task is not the leader of the threadgroup.
> +
>  3. Kernel API
>  =============
>  
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 8286758..105d681 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -30,10 +30,12 @@ extern int cgroup_init(void);
>  extern void cgroup_lock(void);
>  extern bool cgroup_lock_live_group(struct cgroup *cgrp);
>  extern void cgroup_unlock(void);
> -extern void cgroup_fork(struct task_struct *p);
> +extern void cgroup_fork(struct task_struct *p, int clone_flags);
>  extern void cgroup_fork_callbacks(struct task_struct *p);
> -extern void cgroup_post_fork(struct task_struct *p);
> +extern void cgroup_post_fork(struct task_struct *p, int clone_flags);
>  extern void cgroup_exit(struct task_struct *p, int run_callbacks);
> +extern void cgroup_fork_failed(struct task_struct *p, int run_callbacks,
> +			       int clone_flags);
>  extern int cgroupstats_build(struct cgroupstats *stats,
>  				struct dentry *dentry);
>  
> @@ -551,10 +553,12 @@ unsigned short css_depth(struct cgroup_subsys_state *css);
>  
>  static inline int cgroup_init_early(void) { return 0; }
>  static inline int cgroup_init(void) { return 0; }
> -static inline void cgroup_fork(struct task_struct *p) {}
> +static inline void cgroup_fork(struct task_struct *p, int clone_flags) {}
>  static inline void cgroup_fork_callbacks(struct task_struct *p) {}
> -static inline void cgroup_post_fork(struct task_struct *p) {}
> +static inline void cgroup_post_fork(struct task_struct *p, int clone_flags) {}
>  static inline void cgroup_exit(struct task_struct *p, int callbacks) {}
> +static inline void cgroup_fork_failed(struct task_struct *p, int callbacks,
> +				      int clone_flags) {}
>  
>  static inline void cgroup_lock(void) {}
>  static inline void cgroup_unlock(void) {}
> diff --git a/include/linux/init_task.h b/include/linux/init_task.h
> index aecd24e..26d814f 100644
> --- a/include/linux/init_task.h
> +++ b/include/linux/init_task.h
> @@ -105,6 +105,14 @@ extern struct cred init_cred;
>  # define INIT_PERF_COUNTERS(tsk)
>  #endif
>  
> +#ifdef CONFIG_CGROUPS
> +# define INIT_CGROUP_FORK_MUTEX(tsk)					\
> +	.cgroup_fork_mutex =						\
> +		__RWSEM_INITIALIZER(tsk.cgroup_fork_mutex),
> +#else
> +# define INIT_CGROUP_FORK_MUTEX(tsk)
> +#endif
> +
>  /*
>   *  INIT_TASK is used to set up the first task table, touch at
>   * your own risk!. Base=0, limit=0x1fffff (=2MB)
> @@ -174,6 +182,7 @@ extern struct cred init_cred;
>  	INIT_LOCKDEP							\
>  	INIT_FTRACE_GRAPH						\
>  	INIT_TRACE_RECURSION						\
> +	INIT_CGROUP_FORK_MUTEX(tsk)					\
>  }
>  
>  
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 55e3e11..5d38980 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1400,6 +1400,8 @@ struct task_struct {
>  	struct css_set *cgroups;
>  	/* cg_list protected by css_set_lock and tsk->alloc_lock */
>  	struct list_head cg_list;
> +	/* guarantees atomic threadgroup movement via the procs file */
> +	struct rw_semaphore cgroup_fork_mutex;
>  #endif
>  #ifdef CONFIG_FUTEX
>  	struct robust_list_head __user *robust_list;
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index ea05d6b..3ce7298 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1297,6 +1297,87 @@ static void get_first_subsys(const struct cgroup *cgrp,
>  		*subsys_id = test_ss->subsys_id;
>  }
>  
> +/*
> + * cgroup_task_migrate - move a task from one cgroup to another.
> + *
> + * 'guarantee' is set if the caller promises that a new css_set for the task
> + * will already exist. If not set, this function might sleep, and can fail
> + * with -ENOMEM. Otherwise, it can only fail with -ESRCH.
> + */
> +static int cgroup_task_migrate(struct cgroup *cgrp, struct cgroup *oldcgrp,
> +			       struct task_struct *tsk, int guarantee)
> +{
> +	struct css_set *oldcg;
> +	struct css_set *newcg;
> +
> +	/*
> +	 * get old css_set. we need to take task_lock and refcount it, because
> +	 * an exiting task can change its css_set to init_css_set and drop its
> +	 * old one without taking cgroup_mutex.
> +	 */
> +	task_lock(tsk);
> +	oldcg = tsk->cgroups;
> +	get_css_set(oldcg);
> +	task_unlock(tsk);

Better use blank lines more to improve code readability.

> +	/*
> +	 * locate or allocate a new css_set for this task. 'guarantee' tells
> +	 * us whether or not we are sure that a new css_set already exists;
> +	 * in that case, we are not allowed to fail, as we won't need malloc.
> +	 */
> +	if (guarantee) {
> +		/*
> +		 * our caller promises us that the css_set we want already
> +		 * exists, so we use find_existing_css_set directly.
> +		 */
> +		struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
> +		read_lock(&css_set_lock);
> +		newcg = find_existing_css_set(oldcg, cgrp, template);
> +		BUG_ON(!newcg);
> +		get_css_set(newcg);
> +		read_unlock(&css_set_lock);
> +	} else {
> +		might_sleep();
> +		/* find_css_set will give us newcg already referenced. */
> +		newcg = find_css_set(oldcg, cgrp);
> +		if (!newcg) {
> +			put_css_set(oldcg);
> +			return -ENOMEM;
> +		}
> +	}
> +	put_css_set(oldcg);
> +
> +	/*
> +	 * we cannot move a task that's declared itself as exiting, as once
> +	 * PF_EXITING is set, the tsk->cgroups pointer is no longer safe.
> +	 */
> +	task_lock(tsk);
> +	if (tsk->flags & PF_EXITING) {
> +		task_unlock(tsk);
> +		put_css_set(newcg);
> +		return -ESRCH;
> +	}
> +	rcu_assign_pointer(tsk->cgroups, newcg);
> +	task_unlock(tsk);
> +
> +	/* Update the css_set linked lists if we're using them */
> +	write_lock(&css_set_lock);
> +	if (!list_empty(&tsk->cg_list)) {
> +		list_del(&tsk->cg_list);
> +		list_add(&tsk->cg_list, &newcg->tasks);

list_move()

> +	}
> +	write_unlock(&css_set_lock);
> +
> +	/*
> +	 * We just gained a reference on oldcg by taking it from the task. As

This comment is incorrect, the ref we just got has been dropped by
the above put_css_set(oldcg).

> +	 * trading it for newcg is protected by cgroup_mutex, we're safe to
> +	 * drop it here; it will be freed under RCU.
> +	 */
> +	put_css_set(oldcg);
> +
> +	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
> +	return 0;
> +}
> +
>  /**
>   * cgroup_attach_task - attach task 'tsk' to cgroup 'cgrp'
>   * @cgrp: the cgroup the task is attaching to
> @@ -1307,11 +1388,9 @@ static void get_first_subsys(const struct cgroup *cgrp,
>   */
>  int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>  {
> -	int retval = 0;
> +	int retval;
>  	struct cgroup_subsys *ss;
>  	struct cgroup *oldcgrp;
> -	struct css_set *cg;
> -	struct css_set *newcg;
>  	struct cgroupfs_root *root = cgrp->root;
>  	int subsys_id;
>  
> @@ -1330,75 +1409,293 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>  		}
>  	}
>  
> -	task_lock(tsk);
> -	cg = tsk->cgroups;
> -	get_css_set(cg);
> -	task_unlock(tsk);
> +	retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, 0);
> +	if (retval)
> +		return retval;
> +
> +	for_each_subsys(root, ss) {
> +		if (ss->attach)
> +			ss->attach(ss, cgrp, oldcgrp, tsk, false);
> +	}
> +
> +	synchronize_rcu();
> +
>  	/*
> -	 * Locate or allocate a new css_set for this task,
> -	 * based on its final set of cgroups
> +	 * wake up rmdir() waiter. the rmdir should fail since the cgroup
> +	 * is no longer empty.
>  	 */
> +	cgroup_wakeup_rmdir_waiters(cgrp);
> +	return 0;
> +}
> +
> +/*
> + * cgroup_attach_proc works in two stages, the first of which prefetches all
> + * new css_sets needed (to make sure we have enough memory before committing
> + * to the move) and stores them in a list, of entries of the following type.
> + * TODO: possible optimization: use css_set->rcu_head for chaining instead
> + */
> +struct cg_list_entry {
> +	struct css_set *cg;
> +	struct list_head links;
> +};
> +
> +static int css_set_check_fetched(struct cgroup *cgrp, struct task_struct *tsk,
> +				 struct css_set *cg,
> +				 struct list_head *newcg_list)
> +{
> +	struct css_set *newcg;
> +	struct cg_list_entry *cg_entry;
> +	struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
> +	read_lock(&css_set_lock);
> +	newcg = find_existing_css_set(cg, cgrp, template);
> +	if (newcg)
> +		get_css_set(newcg);
> +	read_unlock(&css_set_lock);
> +	/* doesn't exist at all? */
> +	if (!newcg)
> +		return 1;

I think it's more intuitive to return 1 if found and 0 if not found.

> +	/* see if it's already in the list */
> +	list_for_each_entry(cg_entry, newcg_list, links) {
> +		if (cg_entry->cg == newcg) {
> +			put_css_set(newcg);
> +			return 0;
> +		}
> +	}
> +	/* not found */
> +	put_css_set(newcg);
> +	return 1;

Those lines are squeezed too tight. ;)

> +}
> +


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
       [not found]   ` <20090731015154.27908.9639.stgit-/yCBOHwbXCxd3OlUiQof+WCaruZE5nAUZeezCHUQhQ4@public.gmane.org>
  2009-08-03  3:00     ` Li Zefan
@ 2009-08-03 17:54     ` Serge E. Hallyn
  1 sibling, 0 replies; 100+ messages in thread
From: Serge E. Hallyn @ 2009-08-03 17:54 UTC (permalink / raw)
  To: Ben Blum
  Cc: menage-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Quoting Ben Blum (bblum-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
...
> +static int cgroup_task_migrate(struct cgroup *cgrp, struct cgroup *oldcgrp,
> +			       struct task_struct *tsk, int guarantee)
> +{
> +	struct css_set *oldcg;
> +	struct css_set *newcg;
> +
> +	/*
> +	 * get old css_set. we need to take task_lock and refcount it, because
> +	 * an exiting task can change its css_set to init_css_set and drop its
> +	 * old one without taking cgroup_mutex.
> +	 */
> +	task_lock(tsk);
> +	oldcg = tsk->cgroups;
> +	get_css_set(oldcg);
> +	task_unlock(tsk);
> +	/*
> +	 * locate or allocate a new css_set for this task. 'guarantee' tells
> +	 * us whether or not we are sure that a new css_set already exists;
> +	 * in that case, we are not allowed to fail, as we won't need malloc.
> +	 */
> +	if (guarantee) {
> +		/*
> +		 * our caller promises us that the css_set we want already
> +		 * exists, so we use find_existing_css_set directly.
> +		 */
> +		struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
> +		read_lock(&css_set_lock);
> +		newcg = find_existing_css_set(oldcg, cgrp, template);
> +		BUG_ON(!newcg);
> +		get_css_set(newcg);
> +		read_unlock(&css_set_lock);
> +	} else {
> +		might_sleep();

So cgroup_task_migrate() might sleep, but

...


> +	down_write(&leader->cgroup_fork_mutex);
> +	rcu_read_lock();
> +	list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
> +		/* leave current thread as it is if it's already there */
> +		oldcgrp = task_cgroup(tsk, subsys_id);
> +		if (cgrp == oldcgrp)
> +			continue;
> +		/* we don't care whether these threads are exiting */
> +		retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, 1);

Here it is called under rcu_read_lock().

...

> -void cgroup_fork(struct task_struct *child)
> +void cgroup_fork(struct task_struct *child, int clone_flags)
>  {
> +	if (clone_flags & CLONE_THREAD)
> +		down_read(&current->group_leader->cgroup_fork_mutex);
> +	else
> +		init_rwsem(&child->cgroup_fork_mutex);

I'm also worried about the overhead here on what should be a
fast case, CLONE_THREAD.  Have you done any benchmarking of
one thread spawning a bunch of others?

What *exactly* is it we are protecting with cgroup_fork_mutex?
'fork' (as the name implies) is not a good answer, since we should be
protecting data, not code.  If it is solely tsk->cgroups, then perhaps
we should in fact try switching to (s?)rcu.  Then cgroup_fork() could
just do rcu_read_lock, while cgroup_task_migrate() would make the change
under a spinlock (to protect against concurrent cgroup_task_migrate()s),
and using rcu_assign_pointer to let cgroup_fork() see consistent data
either before or after the update...  That might mean that any checks done
before completing the migrate which involve the # of tasks might become
invalidated before the migration completes?  Seems acceptable (since
it'll be a small overcharge at most and can be quickly remedied).

-serge

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
  2009-07-31  1:51 ` Ben Blum
  2009-08-03  3:00   ` Li Zefan
@ 2009-08-03 17:54   ` Serge E. Hallyn
  2009-08-03 18:07     ` Paul Menage
                       ` (2 more replies)
       [not found]   ` <20090731015154.27908.9639.stgit-/yCBOHwbXCxd3OlUiQof+WCaruZE5nAUZeezCHUQhQ4@public.gmane.org>
  2 siblings, 3 replies; 100+ messages in thread
From: Serge E. Hallyn @ 2009-08-03 17:54 UTC (permalink / raw)
  To: Ben Blum; +Cc: linux-kernel, containers, akpm, lizf, menage

Quoting Ben Blum (bblum@google.com):
...
> +static int cgroup_task_migrate(struct cgroup *cgrp, struct cgroup *oldcgrp,
> +			       struct task_struct *tsk, int guarantee)
> +{
> +	struct css_set *oldcg;
> +	struct css_set *newcg;
> +
> +	/*
> +	 * get old css_set. we need to take task_lock and refcount it, because
> +	 * an exiting task can change its css_set to init_css_set and drop its
> +	 * old one without taking cgroup_mutex.
> +	 */
> +	task_lock(tsk);
> +	oldcg = tsk->cgroups;
> +	get_css_set(oldcg);
> +	task_unlock(tsk);
> +	/*
> +	 * locate or allocate a new css_set for this task. 'guarantee' tells
> +	 * us whether or not we are sure that a new css_set already exists;
> +	 * in that case, we are not allowed to fail, as we won't need malloc.
> +	 */
> +	if (guarantee) {
> +		/*
> +		 * our caller promises us that the css_set we want already
> +		 * exists, so we use find_existing_css_set directly.
> +		 */
> +		struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
> +		read_lock(&css_set_lock);
> +		newcg = find_existing_css_set(oldcg, cgrp, template);
> +		BUG_ON(!newcg);
> +		get_css_set(newcg);
> +		read_unlock(&css_set_lock);
> +	} else {
> +		might_sleep();

So cgroup_task_migrate() might sleep, but

...


> +	down_write(&leader->cgroup_fork_mutex);
> +	rcu_read_lock();
> +	list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
> +		/* leave current thread as it is if it's already there */
> +		oldcgrp = task_cgroup(tsk, subsys_id);
> +		if (cgrp == oldcgrp)
> +			continue;
> +		/* we don't care whether these threads are exiting */
> +		retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, 1);

Here it is called under rcu_read_lock().

...

> -void cgroup_fork(struct task_struct *child)
> +void cgroup_fork(struct task_struct *child, int clone_flags)
>  {
> +	if (clone_flags & CLONE_THREAD)
> +		down_read(&current->group_leader->cgroup_fork_mutex);
> +	else
> +		init_rwsem(&child->cgroup_fork_mutex);

I'm also worried about the overhead here on what should be a
fast case, CLONE_THREAD.  Have you done any benchmarking of
one thread spawning a bunch of others?

What *exactly* is it we are protecting with cgroup_fork_mutex?
'fork' (as the name implies) is not a good answer, since we should be
protecting data, not code.  If it is solely tsk->cgroups, then perhaps
we should in fact try switching to (s?)rcu.  Then cgroup_fork() could
just do rcu_read_lock, while cgroup_task_migrate() would make the change
under a spinlock (to protect against concurrent cgroup_task_migrate()s),
and using rcu_assign_pointer to let cgroup_fork() see consistent data
either before or after the update...  That might mean that any checks done
before completing the migrate which involve the # of tasks might become
invalidated before the migration completes?  Seems acceptable (since
it'll be a small overcharge at most and can be quickly remedied).

-serge

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
       [not found]     ` <20090803175452.GA5481-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-08-03 18:07       ` Paul Menage
  2009-08-03 18:13       ` Benjamin Blum
  1 sibling, 0 replies; 100+ messages in thread
From: Paul Menage @ 2009-08-03 18:07 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Ben Blum,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Mon, Aug 3, 2009 at 10:54 AM, Serge E. Hallyn<serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote:
>> +     if (guarantee) {
>> +             /*
>> +              * our caller promises us that the css_set we want already
>> +              * exists, so we use find_existing_css_set directly.
>> +              */
>> +             struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
>> +             read_lock(&css_set_lock);
>> +             newcg = find_existing_css_set(oldcg, cgrp, template);
>> +             BUG_ON(!newcg);
>> +             get_css_set(newcg);
>> +             read_unlock(&css_set_lock);
>> +     } else {
>> +             might_sleep();
>
> So cgroup_task_migrate() might sleep, but
>

might sleep if guarantee==0

>> +             retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, 1);
>
> Here it is called under rcu_read_lock().

With guarantee==1

>
> What *exactly* is it we are protecting with cgroup_fork_mutex?
> 'fork' (as the name implies) is not a good answer, since we should be
> protecting data, not code.  If it is solely tsk->cgroups, then perhaps
> we should in fact try switching to (s?)rcu.  Then cgroup_fork() could
> just do rcu_read_lock, while cgroup_task_migrate() would make the change
> under a spinlock (to protect against concurrent cgroup_task_migrate()s),
> and using rcu_assign_pointer to let cgroup_fork() see consistent data
> either before or after the update...  That might mean that any checks done
> before completing the migrate which involve the # of tasks might become
> invalidated before the migration completes?

What's being protected is the ability to move an entire thread group
to the new destination cgroup, even in the presence of concurrent
thread clone operations. New threads aren't visible to other threads
until the point when they're attached to the tasklist, so if any
concurrent do_fork() operation is somewhere between the call to
cgroup_fork() and the attachment to the tasklist when the "attach proc
to new cgroup" operation occurs, it may get left behind.

Paul

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid  at once
  2009-08-03 17:54   ` Serge E. Hallyn
@ 2009-08-03 18:07     ` Paul Menage
       [not found]     ` <20090803175452.GA5481-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2009-08-03 18:13     ` Benjamin Blum
  2 siblings, 0 replies; 100+ messages in thread
From: Paul Menage @ 2009-08-03 18:07 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: Ben Blum, linux-kernel, containers, akpm, lizf

On Mon, Aug 3, 2009 at 10:54 AM, Serge E. Hallyn<serue@us.ibm.com> wrote:
>> +     if (guarantee) {
>> +             /*
>> +              * our caller promises us that the css_set we want already
>> +              * exists, so we use find_existing_css_set directly.
>> +              */
>> +             struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
>> +             read_lock(&css_set_lock);
>> +             newcg = find_existing_css_set(oldcg, cgrp, template);
>> +             BUG_ON(!newcg);
>> +             get_css_set(newcg);
>> +             read_unlock(&css_set_lock);
>> +     } else {
>> +             might_sleep();
>
> So cgroup_task_migrate() might sleep, but
>

might sleep if guarantee==0

>> +             retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, 1);
>
> Here it is called under rcu_read_lock().

With guarantee==1

>
> What *exactly* is it we are protecting with cgroup_fork_mutex?
> 'fork' (as the name implies) is not a good answer, since we should be
> protecting data, not code.  If it is solely tsk->cgroups, then perhaps
> we should in fact try switching to (s?)rcu.  Then cgroup_fork() could
> just do rcu_read_lock, while cgroup_task_migrate() would make the change
> under a spinlock (to protect against concurrent cgroup_task_migrate()s),
> and using rcu_assign_pointer to let cgroup_fork() see consistent data
> either before or after the update...  That might mean that any checks done
> before completing the migrate which involve the # of tasks might become
> invalidated before the migration completes?

What's being protected is the ability to move an entire thread group
to the new destination cgroup, even in the presence of concurrent
thread clone operations. New threads aren't visible to other threads
until the point when they're attached to the tasklist, so if any
concurrent do_fork() operation is somewhere between the call to
cgroup_fork() and the attachment to the tasklist when the "attach proc
to new cgroup" operation occurs, it may get left behind.

Paul

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
       [not found]     ` <20090803175452.GA5481-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2009-08-03 18:07       ` Paul Menage
@ 2009-08-03 18:13       ` Benjamin Blum
  1 sibling, 0 replies; 100+ messages in thread
From: Benjamin Blum @ 2009-08-03 18:13 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: menage-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Mon, Aug 3, 2009 at 1:54 PM, Serge E. Hallyn<serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote:
> Quoting Ben Blum (bblum-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> ...
>> +static int cgroup_task_migrate(struct cgroup *cgrp, struct cgroup *oldcgrp,
>> +                            struct task_struct *tsk, int guarantee)
>> +{
>> +     struct css_set *oldcg;
>> +     struct css_set *newcg;
>> +
>> +     /*
>> +      * get old css_set. we need to take task_lock and refcount it, because
>> +      * an exiting task can change its css_set to init_css_set and drop its
>> +      * old one without taking cgroup_mutex.
>> +      */
>> +     task_lock(tsk);
>> +     oldcg = tsk->cgroups;
>> +     get_css_set(oldcg);
>> +     task_unlock(tsk);
>> +     /*
>> +      * locate or allocate a new css_set for this task. 'guarantee' tells
>> +      * us whether or not we are sure that a new css_set already exists;
>> +      * in that case, we are not allowed to fail, as we won't need malloc.
>> +      */
>> +     if (guarantee) {
>> +             /*
>> +              * our caller promises us that the css_set we want already
>> +              * exists, so we use find_existing_css_set directly.
>> +              */
>> +             struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
>> +             read_lock(&css_set_lock);
>> +             newcg = find_existing_css_set(oldcg, cgrp, template);
>> +             BUG_ON(!newcg);
>> +             get_css_set(newcg);
>> +             read_unlock(&css_set_lock);
>> +     } else {
>> +             might_sleep();
>
> So cgroup_task_migrate() might sleep, but
>
> ...
>
>
>> +     down_write(&leader->cgroup_fork_mutex);
>> +     rcu_read_lock();
>> +     list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
>> +             /* leave current thread as it is if it's already there */
>> +             oldcgrp = task_cgroup(tsk, subsys_id);
>> +             if (cgrp == oldcgrp)
>> +                     continue;
>> +             /* we don't care whether these threads are exiting */
>> +             retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, 1);
>
> Here it is called under rcu_read_lock().

You'll notice the fourth argument, which tells cgroup_task_migrate
whether the css_set is guaranteed or not. If we say we've already got
it covered, the might_sleep section doesn't happen.

>> -void cgroup_fork(struct task_struct *child)
>> +void cgroup_fork(struct task_struct *child, int clone_flags)
>>  {
>> +     if (clone_flags & CLONE_THREAD)
>> +             down_read(&current->group_leader->cgroup_fork_mutex);
>> +     else
>> +             init_rwsem(&child->cgroup_fork_mutex);
>
> I'm also worried about the overhead here on what should be a
> fast case, CLONE_THREAD.  Have you done any benchmarking of
> one thread spawning a bunch of others?

Should be strictly better as this is making the rwsem local to the
threadgroup - at least in comparison to the previous edition of this
patch which had it as a global lock.

> What *exactly* is it we are protecting with cgroup_fork_mutex?
> 'fork' (as the name implies) is not a good answer, since we should be
> protecting data, not code.  If it is solely tsk->cgroups, then perhaps
> we should in fact try switching to (s?)rcu.  Then cgroup_fork() could
> just do rcu_read_lock, while cgroup_task_migrate() would make the change
> under a spinlock (to protect against concurrent cgroup_task_migrate()s),
> and using rcu_assign_pointer to let cgroup_fork() see consistent data
> either before or after the update...  That might mean that any checks done
> before completing the migrate which involve the # of tasks might become
> invalidated before the migration completes?  Seems acceptable (since
> it'll be a small overcharge at most and can be quickly remedied).

You'll notice where the rwsem is released - not until cgroup_post_fork
or cgroup_fork_failed. It doesn't just protect the tsk->cgroups
pointer, but rather guarantees atomicity between adjusting
tsk->cgroups and attaching it to the cgroups lists with respect to the
critical section in attach_proc. If you've a better name for the lock
for such a race condition, do suggest.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid  at once
  2009-08-03 17:54   ` Serge E. Hallyn
  2009-08-03 18:07     ` Paul Menage
       [not found]     ` <20090803175452.GA5481-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-08-03 18:13     ` Benjamin Blum
       [not found]       ` <2f86c2480908031113y525b6cbdhe418b8a0364c7760-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2009-08-03 18:55       ` Serge E. Hallyn
  2 siblings, 2 replies; 100+ messages in thread
From: Benjamin Blum @ 2009-08-03 18:13 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: linux-kernel, containers, akpm, lizf, menage

On Mon, Aug 3, 2009 at 1:54 PM, Serge E. Hallyn<serue@us.ibm.com> wrote:
> Quoting Ben Blum (bblum@google.com):
> ...
>> +static int cgroup_task_migrate(struct cgroup *cgrp, struct cgroup *oldcgrp,
>> +                            struct task_struct *tsk, int guarantee)
>> +{
>> +     struct css_set *oldcg;
>> +     struct css_set *newcg;
>> +
>> +     /*
>> +      * get old css_set. we need to take task_lock and refcount it, because
>> +      * an exiting task can change its css_set to init_css_set and drop its
>> +      * old one without taking cgroup_mutex.
>> +      */
>> +     task_lock(tsk);
>> +     oldcg = tsk->cgroups;
>> +     get_css_set(oldcg);
>> +     task_unlock(tsk);
>> +     /*
>> +      * locate or allocate a new css_set for this task. 'guarantee' tells
>> +      * us whether or not we are sure that a new css_set already exists;
>> +      * in that case, we are not allowed to fail, as we won't need malloc.
>> +      */
>> +     if (guarantee) {
>> +             /*
>> +              * our caller promises us that the css_set we want already
>> +              * exists, so we use find_existing_css_set directly.
>> +              */
>> +             struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
>> +             read_lock(&css_set_lock);
>> +             newcg = find_existing_css_set(oldcg, cgrp, template);
>> +             BUG_ON(!newcg);
>> +             get_css_set(newcg);
>> +             read_unlock(&css_set_lock);
>> +     } else {
>> +             might_sleep();
>
> So cgroup_task_migrate() might sleep, but
>
> ...
>
>
>> +     down_write(&leader->cgroup_fork_mutex);
>> +     rcu_read_lock();
>> +     list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
>> +             /* leave current thread as it is if it's already there */
>> +             oldcgrp = task_cgroup(tsk, subsys_id);
>> +             if (cgrp == oldcgrp)
>> +                     continue;
>> +             /* we don't care whether these threads are exiting */
>> +             retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, 1);
>
> Here it is called under rcu_read_lock().

You'll notice the fourth argument, which tells cgroup_task_migrate
whether the css_set is guaranteed or not. If we say we've already got
it covered, the might_sleep section doesn't happen.

>> -void cgroup_fork(struct task_struct *child)
>> +void cgroup_fork(struct task_struct *child, int clone_flags)
>>  {
>> +     if (clone_flags & CLONE_THREAD)
>> +             down_read(&current->group_leader->cgroup_fork_mutex);
>> +     else
>> +             init_rwsem(&child->cgroup_fork_mutex);
>
> I'm also worried about the overhead here on what should be a
> fast case, CLONE_THREAD.  Have you done any benchmarking of
> one thread spawning a bunch of others?

Should be strictly better as this is making the rwsem local to the
threadgroup - at least in comparison to the previous edition of this
patch which had it as a global lock.

> What *exactly* is it we are protecting with cgroup_fork_mutex?
> 'fork' (as the name implies) is not a good answer, since we should be
> protecting data, not code.  If it is solely tsk->cgroups, then perhaps
> we should in fact try switching to (s?)rcu.  Then cgroup_fork() could
> just do rcu_read_lock, while cgroup_task_migrate() would make the change
> under a spinlock (to protect against concurrent cgroup_task_migrate()s),
> and using rcu_assign_pointer to let cgroup_fork() see consistent data
> either before or after the update...  That might mean that any checks done
> before completing the migrate which involve the # of tasks might become
> invalidated before the migration completes?  Seems acceptable (since
> it'll be a small overcharge at most and can be quickly remedied).

You'll notice where the rwsem is released - not until cgroup_post_fork
or cgroup_fork_failed. It doesn't just protect the tsk->cgroups
pointer, but rather guarantees atomicity between adjusting
tsk->cgroups and attaching it to the cgroups lists with respect to the
critical section in attach_proc. If you've a better name for the lock
for such a race condition, do suggest.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
       [not found]       ` <2f86c2480908031113y525b6cbdhe418b8a0364c7760-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-08-03 18:55         ` Serge E. Hallyn
  0 siblings, 0 replies; 100+ messages in thread
From: Serge E. Hallyn @ 2009-08-03 18:55 UTC (permalink / raw)
  To: Benjamin Blum
  Cc: menage-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Quoting Benjamin Blum (bblum-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> On Mon, Aug 3, 2009 at 1:54 PM, Serge E. Hallyn<serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote:
> > Quoting Ben Blum (bblum-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> > ...
> >> +static int cgroup_task_migrate(struct cgroup *cgrp, struct cgroup *oldcgrp,
> >> +                            struct task_struct *tsk, int guarantee)
> >> +{
> >> +     struct css_set *oldcg;
> >> +     struct css_set *newcg;
> >> +
> >> +     /*
> >> +      * get old css_set. we need to take task_lock and refcount it, because
> >> +      * an exiting task can change its css_set to init_css_set and drop its
> >> +      * old one without taking cgroup_mutex.
> >> +      */
> >> +     task_lock(tsk);
> >> +     oldcg = tsk->cgroups;
> >> +     get_css_set(oldcg);
> >> +     task_unlock(tsk);
> >> +     /*
> >> +      * locate or allocate a new css_set for this task. 'guarantee' tells
> >> +      * us whether or not we are sure that a new css_set already exists;
> >> +      * in that case, we are not allowed to fail, as we won't need malloc.
> >> +      */
> >> +     if (guarantee) {
> >> +             /*
> >> +              * our caller promises us that the css_set we want already
> >> +              * exists, so we use find_existing_css_set directly.
> >> +              */
> >> +             struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
> >> +             read_lock(&css_set_lock);
> >> +             newcg = find_existing_css_set(oldcg, cgrp, template);
> >> +             BUG_ON(!newcg);
> >> +             get_css_set(newcg);
> >> +             read_unlock(&css_set_lock);
> >> +     } else {
> >> +             might_sleep();
> >
> > So cgroup_task_migrate() might sleep, but
> >
> > ...
> >
> >
> >> +     down_write(&leader->cgroup_fork_mutex);
> >> +     rcu_read_lock();
> >> +     list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
> >> +             /* leave current thread as it is if it's already there */
> >> +             oldcgrp = task_cgroup(tsk, subsys_id);
> >> +             if (cgrp == oldcgrp)
> >> +                     continue;
> >> +             /* we don't care whether these threads are exiting */
> >> +             retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, 1);
> >
> > Here it is called under rcu_read_lock().
> 
> You'll notice the fourth argument, which tells cgroup_task_migrate

Hmmm, in my defense one would notice it more readily if the caller used a
meaningful #define instead of '1'.

> whether the css_set is guaranteed or not. If we say we've already got
> it covered, the might_sleep section doesn't happen.
> 
> >> -void cgroup_fork(struct task_struct *child)
> >> +void cgroup_fork(struct task_struct *child, int clone_flags)
> >>  {
> >> +     if (clone_flags & CLONE_THREAD)
> >> +             down_read(&current->group_leader->cgroup_fork_mutex);
> >> +     else
> >> +             init_rwsem(&child->cgroup_fork_mutex);
> >
> > I'm also worried about the overhead here on what should be a
> > fast case, CLONE_THREAD.  Have you done any benchmarking of
> > one thread spawning a bunch of others?
> 
> Should be strictly better as this is making the rwsem local to the
> threadgroup - at least in comparison to the previous edition of this
> patch which had it as a global lock.
> 
> > What *exactly* is it we are protecting with cgroup_fork_mutex?
> > 'fork' (as the name implies) is not a good answer, since we should be
> > protecting data, not code.  If it is solely tsk->cgroups, then perhaps
> > we should in fact try switching to (s?)rcu.  Then cgroup_fork() could
> > just do rcu_read_lock, while cgroup_task_migrate() would make the change
> > under a spinlock (to protect against concurrent cgroup_task_migrate()s),
> > and using rcu_assign_pointer to let cgroup_fork() see consistent data
> > either before or after the update...  That might mean that any checks done
> > before completing the migrate which involve the # of tasks might become
> > invalidated before the migration completes?  Seems acceptable (since
> > it'll be a small overcharge at most and can be quickly remedied).
> 
> You'll notice where the rwsem is released - not until cgroup_post_fork
> or cgroup_fork_failed. It doesn't just protect the tsk->cgroups
> pointer, but rather guarantees atomicity between adjusting
> tsk->cgroups and attaching it to the cgroups lists with respect to the
> critical section in attach_proc. If you've a better name for the lock
> for such a race condition, do suggest.

No the name is pretty accurate - it's the lock itself I'm objecting
to.  Maybe it's the best we can do, though.

thanks,
-serge

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
  2009-08-03 18:13     ` Benjamin Blum
       [not found]       ` <2f86c2480908031113y525b6cbdhe418b8a0364c7760-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-08-03 18:55       ` Serge E. Hallyn
       [not found]         ` <20090803185556.GA8469-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2009-08-03 19:45         ` Serge E. Hallyn
  1 sibling, 2 replies; 100+ messages in thread
From: Serge E. Hallyn @ 2009-08-03 18:55 UTC (permalink / raw)
  To: Benjamin Blum; +Cc: linux-kernel, containers, akpm, lizf, menage

Quoting Benjamin Blum (bblum@google.com):
> On Mon, Aug 3, 2009 at 1:54 PM, Serge E. Hallyn<serue@us.ibm.com> wrote:
> > Quoting Ben Blum (bblum@google.com):
> > ...
> >> +static int cgroup_task_migrate(struct cgroup *cgrp, struct cgroup *oldcgrp,
> >> +                            struct task_struct *tsk, int guarantee)
> >> +{
> >> +     struct css_set *oldcg;
> >> +     struct css_set *newcg;
> >> +
> >> +     /*
> >> +      * get old css_set. we need to take task_lock and refcount it, because
> >> +      * an exiting task can change its css_set to init_css_set and drop its
> >> +      * old one without taking cgroup_mutex.
> >> +      */
> >> +     task_lock(tsk);
> >> +     oldcg = tsk->cgroups;
> >> +     get_css_set(oldcg);
> >> +     task_unlock(tsk);
> >> +     /*
> >> +      * locate or allocate a new css_set for this task. 'guarantee' tells
> >> +      * us whether or not we are sure that a new css_set already exists;
> >> +      * in that case, we are not allowed to fail, as we won't need malloc.
> >> +      */
> >> +     if (guarantee) {
> >> +             /*
> >> +              * our caller promises us that the css_set we want already
> >> +              * exists, so we use find_existing_css_set directly.
> >> +              */
> >> +             struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
> >> +             read_lock(&css_set_lock);
> >> +             newcg = find_existing_css_set(oldcg, cgrp, template);
> >> +             BUG_ON(!newcg);
> >> +             get_css_set(newcg);
> >> +             read_unlock(&css_set_lock);
> >> +     } else {
> >> +             might_sleep();
> >
> > So cgroup_task_migrate() might sleep, but
> >
> > ...
> >
> >
> >> +     down_write(&leader->cgroup_fork_mutex);
> >> +     rcu_read_lock();
> >> +     list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
> >> +             /* leave current thread as it is if it's already there */
> >> +             oldcgrp = task_cgroup(tsk, subsys_id);
> >> +             if (cgrp == oldcgrp)
> >> +                     continue;
> >> +             /* we don't care whether these threads are exiting */
> >> +             retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, 1);
> >
> > Here it is called under rcu_read_lock().
> 
> You'll notice the fourth argument, which tells cgroup_task_migrate

Hmmm, in my defense one would notice it more readily if the caller used a
meaningful #define instead of '1'.

> whether the css_set is guaranteed or not. If we say we've already got
> it covered, the might_sleep section doesn't happen.
> 
> >> -void cgroup_fork(struct task_struct *child)
> >> +void cgroup_fork(struct task_struct *child, int clone_flags)
> >>  {
> >> +     if (clone_flags & CLONE_THREAD)
> >> +             down_read(&current->group_leader->cgroup_fork_mutex);
> >> +     else
> >> +             init_rwsem(&child->cgroup_fork_mutex);
> >
> > I'm also worried about the overhead here on what should be a
> > fast case, CLONE_THREAD.  Have you done any benchmarking of
> > one thread spawning a bunch of others?
> 
> Should be strictly better as this is making the rwsem local to the
> threadgroup - at least in comparison to the previous edition of this
> patch which had it as a global lock.
> 
> > What *exactly* is it we are protecting with cgroup_fork_mutex?
> > 'fork' (as the name implies) is not a good answer, since we should be
> > protecting data, not code.  If it is solely tsk->cgroups, then perhaps
> > we should in fact try switching to (s?)rcu.  Then cgroup_fork() could
> > just do rcu_read_lock, while cgroup_task_migrate() would make the change
> > under a spinlock (to protect against concurrent cgroup_task_migrate()s),
> > and using rcu_assign_pointer to let cgroup_fork() see consistent data
> > either before or after the update...  That might mean that any checks done
> > before completing the migrate which involve the # of tasks might become
> > invalidated before the migration completes?  Seems acceptable (since
> > it'll be a small overcharge at most and can be quickly remedied).
> 
> You'll notice where the rwsem is released - not until cgroup_post_fork
> or cgroup_fork_failed. It doesn't just protect the tsk->cgroups
> pointer, but rather guarantees atomicity between adjusting
> tsk->cgroups and attaching it to the cgroups lists with respect to the
> critical section in attach_proc. If you've a better name for the lock
> for such a race condition, do suggest.

No the name is pretty accurate - it's the lock itself I'm objecting
to.  Maybe it's the best we can do, though.

thanks,
-serge

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
       [not found]         ` <20090803185556.GA8469-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-08-03 19:45           ` Serge E. Hallyn
  0 siblings, 0 replies; 100+ messages in thread
From: Serge E. Hallyn @ 2009-08-03 19:45 UTC (permalink / raw)
  To: Benjamin Blum
  Cc: menage-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Quoting Serge E. Hallyn (serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org):
> Quoting Benjamin Blum (bblum-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> > On Mon, Aug 3, 2009 at 1:54 PM, Serge E. Hallyn<serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote:
> > > Quoting Ben Blum (bblum-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> > > What *exactly* is it we are protecting with cgroup_fork_mutex?
> > > 'fork' (as the name implies) is not a good answer, since we should be
> > > protecting data, not code.  If it is solely tsk->cgroups, then perhaps
> > > we should in fact try switching to (s?)rcu.  Then cgroup_fork() could
> > > just do rcu_read_lock, while cgroup_task_migrate() would make the change
> > > under a spinlock (to protect against concurrent cgroup_task_migrate()s),
> > > and using rcu_assign_pointer to let cgroup_fork() see consistent data
> > > either before or after the update...  That might mean that any checks done
> > > before completing the migrate which involve the # of tasks might become
> > > invalidated before the migration completes?  Seems acceptable (since
> > > it'll be a small overcharge at most and can be quickly remedied).
> > 
> > You'll notice where the rwsem is released - not until cgroup_post_fork
> > or cgroup_fork_failed. It doesn't just protect the tsk->cgroups
> > pointer, but rather guarantees atomicity between adjusting
> > tsk->cgroups and attaching it to the cgroups lists with respect to the
> > critical section in attach_proc. If you've a better name for the lock
> > for such a race condition, do suggest.
> 
> No the name is pretty accurate - it's the lock itself I'm objecting
> to.  Maybe it's the best we can do, though.

This is probably a stupid idea, but...  what about having zero
overhead at clone(), and instead, at cgroup_task_migrate(),
dequeue_task()ing all of the affected threads for the duration of
the migrate?

/me prepares to be hit by blunt objects

-serge

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
  2009-08-03 18:55       ` Serge E. Hallyn
       [not found]         ` <20090803185556.GA8469-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-08-03 19:45         ` Serge E. Hallyn
  2009-08-03 19:55           ` Paul Menage
       [not found]           ` <20090803194555.GA10158-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  1 sibling, 2 replies; 100+ messages in thread
From: Serge E. Hallyn @ 2009-08-03 19:45 UTC (permalink / raw)
  To: Benjamin Blum; +Cc: menage, containers, linux-kernel, akpm

Quoting Serge E. Hallyn (serue@us.ibm.com):
> Quoting Benjamin Blum (bblum@google.com):
> > On Mon, Aug 3, 2009 at 1:54 PM, Serge E. Hallyn<serue@us.ibm.com> wrote:
> > > Quoting Ben Blum (bblum@google.com):
> > > What *exactly* is it we are protecting with cgroup_fork_mutex?
> > > 'fork' (as the name implies) is not a good answer, since we should be
> > > protecting data, not code.  If it is solely tsk->cgroups, then perhaps
> > > we should in fact try switching to (s?)rcu.  Then cgroup_fork() could
> > > just do rcu_read_lock, while cgroup_task_migrate() would make the change
> > > under a spinlock (to protect against concurrent cgroup_task_migrate()s),
> > > and using rcu_assign_pointer to let cgroup_fork() see consistent data
> > > either before or after the update...  That might mean that any checks done
> > > before completing the migrate which involve the # of tasks might become
> > > invalidated before the migration completes?  Seems acceptable (since
> > > it'll be a small overcharge at most and can be quickly remedied).
> > 
> > You'll notice where the rwsem is released - not until cgroup_post_fork
> > or cgroup_fork_failed. It doesn't just protect the tsk->cgroups
> > pointer, but rather guarantees atomicity between adjusting
> > tsk->cgroups and attaching it to the cgroups lists with respect to the
> > critical section in attach_proc. If you've a better name for the lock
> > for such a race condition, do suggest.
> 
> No the name is pretty accurate - it's the lock itself I'm objecting
> to.  Maybe it's the best we can do, though.

This is probably a stupid idea, but...  what about having zero
overhead at clone(), and instead, at cgroup_task_migrate(),
dequeue_task()ing all of the affected threads for the duration of
the migrate?

/me prepares to be hit by blunt objects

-serge

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
       [not found]           ` <20090803194555.GA10158-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-08-03 19:55             ` Paul Menage
  2009-08-04 18:48               ` Paul Menage
  1 sibling, 0 replies; 100+ messages in thread
From: Paul Menage @ 2009-08-03 19:55 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Benjamin Blum,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Mon, Aug 3, 2009 at 12:45 PM, Serge E. Hallyn<serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote:
>
> This is probably a stupid idea, but...  what about having zero
> overhead at clone(), and instead, at cgroup_task_migrate(),
> dequeue_task()ing all of the affected threads for the duration of
> the migrate?

That doesn't sound too unreasonable, actually - it would certainly
simplify things a fair bit. Is there a standard API for doing that?
dequeue_task() itself doesn't really look like a public API. I guess
that the task freezer would be one way to accomplish this?

I can imagine that the set of people who'd complain about the latency
hit when migrating with your solution would be smaller than the people
who'd complain about the increased overhead in the normal clone case.

Paul

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid  at once
  2009-08-03 19:45         ` Serge E. Hallyn
@ 2009-08-03 19:55           ` Paul Menage
  2009-08-04 14:01             ` Serge E. Hallyn
                               ` (3 more replies)
       [not found]           ` <20090803194555.GA10158-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  1 sibling, 4 replies; 100+ messages in thread
From: Paul Menage @ 2009-08-03 19:55 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: Benjamin Blum, containers, linux-kernel, akpm

On Mon, Aug 3, 2009 at 12:45 PM, Serge E. Hallyn<serue@us.ibm.com> wrote:
>
> This is probably a stupid idea, but...  what about having zero
> overhead at clone(), and instead, at cgroup_task_migrate(),
> dequeue_task()ing all of the affected threads for the duration of
> the migrate?

That doesn't sound too unreasonable, actually - it would certainly
simplify things a fair bit. Is there a standard API for doing that?
dequeue_task() itself doesn't really look like a public API. I guess
that the task freezer would be one way to accomplish this?

I can imagine that the set of people who'd complain about the latency
hit when migrating with your solution would be smaller than the people
who'd complain about the increased overhead in the normal clone case.

Paul

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 5/6] Lets ss->can_attach and ss->attach do whole threadgroups at a time
       [not found]     ` <4A7649E1.4000200-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-08-04  0:35       ` Benjamin Blum
  0 siblings, 0 replies; 100+ messages in thread
From: Benjamin Blum @ 2009-08-04  0:35 UTC (permalink / raw)
  To: Li Zefan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	menage-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Sun, Aug 2, 2009 at 7:22 PM, Li Zefan<lizf-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> wrote:
> 09:51, Ben Blum wrote:
>> Lets ss->can_attach and ss->attach do whole threadgroups at a time
>>
>> This patch alters the ss->can_attach and ss->attach functions to be able to
>> deal with a whole threadgroup at a time, for use in cgroup_attach_proc. (This
>> is a pre-patch to cgroup-procs-writable.patch.)
>>
>> Currently, new mode of the attach function can only tell the subsystem about
>> the old cgroup of the threadgroup leader. No subsystem currently needs that
>> information for each thread that's being moved, but if one were to be added
>> (for example, one that counts tasks within a group) this bit would need to be
>> reworked a bit to tell the subsystem the right information.
>>
>> Signed-off-by: Ben Blum <bblum-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>>
>> ---
>>
>>  include/linux/cgroup.h   |    7 +++--
>>  kernel/cgroup.c          |    4 +--
>>  kernel/cgroup_freezer.c  |   15 +++++++++--
>>  kernel/cpuset.c          |   65 ++++++++++++++++++++++++++++++++++++----------
>>  kernel/ns_cgroup.c       |   16 ++++++++++-
>>  kernel/sched.c           |   37 ++++++++++++++++++++++++--
>>  mm/memcontrol.c          |    3 +-
>>  security/device_cgroup.c |    3 +-
>>  8 files changed, 122 insertions(+), 28 deletions(-)
>>
>> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
>> index 24e3f1a..8286758 100644
>> --- a/include/linux/cgroup.h
>> +++ b/include/linux/cgroup.h
>> @@ -408,10 +408,11 @@ struct cgroup_subsys {
>>                                                 struct cgroup *cgrp);
>>       int (*pre_destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
>>       void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
>> -     int (*can_attach)(struct cgroup_subsys *ss,
>> -                       struct cgroup *cgrp, struct task_struct *tsk);
>> +     int (*can_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
>> +                       struct task_struct *tsk, bool threadgroup);
>>       void (*attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
>> -                     struct cgroup *old_cgrp, struct task_struct *tsk);
>> +                     struct cgroup *old_cgrp, struct task_struct *tsk,
>> +                     bool threadgroup);
>
> Could you also update Documentation/cgroups/cgroups.txt?
>
>>       void (*fork)(struct cgroup_subsys *ss, struct task_struct *task);
>>       void (*exit)(struct cgroup_subsys *ss, struct task_struct *task);
>>       int (*populate)(struct cgroup_subsys *ss,
>> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
>> index a12bc8e..ea05d6b 100644
>> --- a/kernel/cgroup.c
>> +++ b/kernel/cgroup.c
>> @@ -1324,7 +1324,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>>
>>       for_each_subsys(root, ss) {
>>               if (ss->can_attach) {
>> -                     retval = ss->can_attach(ss, cgrp, tsk);
>> +                     retval = ss->can_attach(ss, cgrp, tsk, false);
>>                       if (retval)
>>                               return retval;
>>               }
>> @@ -1362,7 +1362,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>>
>>       for_each_subsys(root, ss) {
>>               if (ss->attach)
>> -                     ss->attach(ss, cgrp, oldcgrp, tsk);
>> +                     ss->attach(ss, cgrp, oldcgrp, tsk, false);
>>       }
>>       set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
>>       synchronize_rcu();
>> diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
>> index fb249e2..4e352ab 100644
>> --- a/kernel/cgroup_freezer.c
>> +++ b/kernel/cgroup_freezer.c
>> @@ -159,10 +159,9 @@ static bool is_task_frozen_enough(struct task_struct *task)
>>   */
>>  static int freezer_can_attach(struct cgroup_subsys *ss,
>>                             struct cgroup *new_cgroup,
>> -                           struct task_struct *task)
>> +                           struct task_struct *task, bool threadgroup)
>>  {
>>       struct freezer *freezer;
>> -
>
> Please reserve this blank line.
>
>>       /*
>>        * Anything frozen can't move or be moved to/from.
>>        *
>> @@ -177,6 +176,18 @@ static int freezer_can_attach(struct cgroup_subsys *ss,
>>       if (freezer->state == CGROUP_FROZEN)
>>               return -EBUSY;
>>
>> +     if (threadgroup) {
>> +             struct task_struct *c;
>
> And better have a blank line here. ;)
>
>> +             rcu_read_lock();
>> +             list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
>> +                     if (is_task_frozen_enough(c)) {
>> +                             rcu_read_unlock();
>> +                             return -EBUSY;
>> +                     }
>> +             }
>> +             rcu_read_unlock();
>> +     }
>> +
>>       return 0;
>>  }
>>
>> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
>> index 7e75a41..86397f4 100644
>> --- a/kernel/cpuset.c
>> +++ b/kernel/cpuset.c
>> @@ -1324,9 +1324,10 @@ static int fmeter_getrate(struct fmeter *fmp)
>>  static cpumask_var_t cpus_attach;
>>
>>  /* Called by cgroups to determine if a cpuset is usable; cgroup_mutex held */
>> -static int cpuset_can_attach(struct cgroup_subsys *ss,
>> -                          struct cgroup *cont, struct task_struct *tsk)
>> +static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
>> +                          struct task_struct *tsk, bool threadgroup)
>>  {
>> +     int ret;
>>       struct cpuset *cs = cgroup_cs(cont);
>>
>>       if (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))
>> @@ -1343,18 +1344,50 @@ static int cpuset_can_attach(struct cgroup_subsys *ss,
>>       if (tsk->flags & PF_THREAD_BOUND)
>>               return -EINVAL;
>>
>> -     return security_task_setscheduler(tsk, 0, NULL);
>> +     ret = security_task_setscheduler(tsk, 0, NULL);
>> +     if (ret)
>> +             return ret;
>> +     if (threadgroup) {
>> +             struct task_struct *c;
>
> ditto
>
>> +             rcu_read_lock();
>> +             list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
>> +                     ret = security_task_setscheduler(c, 0, NULL);
>> +                     if (ret) {
>> +                             rcu_read_unlock();
>> +                             return ret;
>> +                     }
>> +             }
>> +             rcu_read_unlock();
>> +     }
>> +     return 0;
>> +}
>
>

All taken care of. Thanks.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 5/6] Lets ss->can_attach and ss->attach do whole  threadgroups at a time
  2009-08-03  2:22   ` Li Zefan
@ 2009-08-04  0:35     ` Benjamin Blum
       [not found]     ` <4A7649E1.4000200-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  1 sibling, 0 replies; 100+ messages in thread
From: Benjamin Blum @ 2009-08-04  0:35 UTC (permalink / raw)
  To: Li Zefan; +Cc: linux-kernel, containers, akpm, serue, menage

On Sun, Aug 2, 2009 at 7:22 PM, Li Zefan<lizf@cn.fujitsu.com> wrote:
> 09:51, Ben Blum wrote:
>> Lets ss->can_attach and ss->attach do whole threadgroups at a time
>>
>> This patch alters the ss->can_attach and ss->attach functions to be able to
>> deal with a whole threadgroup at a time, for use in cgroup_attach_proc. (This
>> is a pre-patch to cgroup-procs-writable.patch.)
>>
>> Currently, new mode of the attach function can only tell the subsystem about
>> the old cgroup of the threadgroup leader. No subsystem currently needs that
>> information for each thread that's being moved, but if one were to be added
>> (for example, one that counts tasks within a group) this bit would need to be
>> reworked a bit to tell the subsystem the right information.
>>
>> Signed-off-by: Ben Blum <bblum@google.com>
>>
>> ---
>>
>>  include/linux/cgroup.h   |    7 +++--
>>  kernel/cgroup.c          |    4 +--
>>  kernel/cgroup_freezer.c  |   15 +++++++++--
>>  kernel/cpuset.c          |   65 ++++++++++++++++++++++++++++++++++++----------
>>  kernel/ns_cgroup.c       |   16 ++++++++++-
>>  kernel/sched.c           |   37 ++++++++++++++++++++++++--
>>  mm/memcontrol.c          |    3 +-
>>  security/device_cgroup.c |    3 +-
>>  8 files changed, 122 insertions(+), 28 deletions(-)
>>
>> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
>> index 24e3f1a..8286758 100644
>> --- a/include/linux/cgroup.h
>> +++ b/include/linux/cgroup.h
>> @@ -408,10 +408,11 @@ struct cgroup_subsys {
>>                                                 struct cgroup *cgrp);
>>       int (*pre_destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
>>       void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
>> -     int (*can_attach)(struct cgroup_subsys *ss,
>> -                       struct cgroup *cgrp, struct task_struct *tsk);
>> +     int (*can_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
>> +                       struct task_struct *tsk, bool threadgroup);
>>       void (*attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
>> -                     struct cgroup *old_cgrp, struct task_struct *tsk);
>> +                     struct cgroup *old_cgrp, struct task_struct *tsk,
>> +                     bool threadgroup);
>
> Could you also update Documentation/cgroups/cgroups.txt?
>
>>       void (*fork)(struct cgroup_subsys *ss, struct task_struct *task);
>>       void (*exit)(struct cgroup_subsys *ss, struct task_struct *task);
>>       int (*populate)(struct cgroup_subsys *ss,
>> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
>> index a12bc8e..ea05d6b 100644
>> --- a/kernel/cgroup.c
>> +++ b/kernel/cgroup.c
>> @@ -1324,7 +1324,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>>
>>       for_each_subsys(root, ss) {
>>               if (ss->can_attach) {
>> -                     retval = ss->can_attach(ss, cgrp, tsk);
>> +                     retval = ss->can_attach(ss, cgrp, tsk, false);
>>                       if (retval)
>>                               return retval;
>>               }
>> @@ -1362,7 +1362,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
>>
>>       for_each_subsys(root, ss) {
>>               if (ss->attach)
>> -                     ss->attach(ss, cgrp, oldcgrp, tsk);
>> +                     ss->attach(ss, cgrp, oldcgrp, tsk, false);
>>       }
>>       set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
>>       synchronize_rcu();
>> diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
>> index fb249e2..4e352ab 100644
>> --- a/kernel/cgroup_freezer.c
>> +++ b/kernel/cgroup_freezer.c
>> @@ -159,10 +159,9 @@ static bool is_task_frozen_enough(struct task_struct *task)
>>   */
>>  static int freezer_can_attach(struct cgroup_subsys *ss,
>>                             struct cgroup *new_cgroup,
>> -                           struct task_struct *task)
>> +                           struct task_struct *task, bool threadgroup)
>>  {
>>       struct freezer *freezer;
>> -
>
> Please reserve this blank line.
>
>>       /*
>>        * Anything frozen can't move or be moved to/from.
>>        *
>> @@ -177,6 +176,18 @@ static int freezer_can_attach(struct cgroup_subsys *ss,
>>       if (freezer->state == CGROUP_FROZEN)
>>               return -EBUSY;
>>
>> +     if (threadgroup) {
>> +             struct task_struct *c;
>
> And better have a blank line here. ;)
>
>> +             rcu_read_lock();
>> +             list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
>> +                     if (is_task_frozen_enough(c)) {
>> +                             rcu_read_unlock();
>> +                             return -EBUSY;
>> +                     }
>> +             }
>> +             rcu_read_unlock();
>> +     }
>> +
>>       return 0;
>>  }
>>
>> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
>> index 7e75a41..86397f4 100644
>> --- a/kernel/cpuset.c
>> +++ b/kernel/cpuset.c
>> @@ -1324,9 +1324,10 @@ static int fmeter_getrate(struct fmeter *fmp)
>>  static cpumask_var_t cpus_attach;
>>
>>  /* Called by cgroups to determine if a cpuset is usable; cgroup_mutex held */
>> -static int cpuset_can_attach(struct cgroup_subsys *ss,
>> -                          struct cgroup *cont, struct task_struct *tsk)
>> +static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
>> +                          struct task_struct *tsk, bool threadgroup)
>>  {
>> +     int ret;
>>       struct cpuset *cs = cgroup_cs(cont);
>>
>>       if (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))
>> @@ -1343,18 +1344,50 @@ static int cpuset_can_attach(struct cgroup_subsys *ss,
>>       if (tsk->flags & PF_THREAD_BOUND)
>>               return -EINVAL;
>>
>> -     return security_task_setscheduler(tsk, 0, NULL);
>> +     ret = security_task_setscheduler(tsk, 0, NULL);
>> +     if (ret)
>> +             return ret;
>> +     if (threadgroup) {
>> +             struct task_struct *c;
>
> ditto
>
>> +             rcu_read_lock();
>> +             list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
>> +                     ret = security_task_setscheduler(c, 0, NULL);
>> +                     if (ret) {
>> +                             rcu_read_unlock();
>> +                             return ret;
>> +                     }
>> +             }
>> +             rcu_read_unlock();
>> +     }
>> +     return 0;
>> +}
>
>

All taken care of. Thanks.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
       [not found]     ` <4A7652E7.4020206-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-08-04  0:56       ` Benjamin Blum
  0 siblings, 0 replies; 100+ messages in thread
From: Benjamin Blum @ 2009-08-04  0:56 UTC (permalink / raw)
  To: Li Zefan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	menage-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Sun, Aug 2, 2009 at 8:00 PM, Li Zefan<lizf-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> wrote:
> Ben Blum wrote:
>> +     }
>> +     write_unlock(&css_set_lock);
>> +
>> +     /*
>> +      * We just gained a reference on oldcg by taking it from the task. As
>
> This comment is incorrect, the ref we just got has been dropped by
> the above put_css_set(oldcg).

No, the idea is that even though we had a reference that we already
dropped, we in effect "traded" the newcg to the task for its oldcg,
giving it our reference on newcg and gaining its reference on oldcg. I
believe the cgroup_mutex guarantees that it'll still be there when we
do the trade - perhaps a BUG_ON(tsk->cgroups != oldcg) is wanted
inside the second task_lock section there? At the very least, a
clearer comment.

>> +static int css_set_check_fetched(struct cgroup *cgrp, struct task_struct *tsk,
>> +                              struct css_set *cg,
>> +                              struct list_head *newcg_list)
>> +{
>> +     struct css_set *newcg;
>> +     struct cg_list_entry *cg_entry;
>> +     struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
>> +     read_lock(&css_set_lock);
>> +     newcg = find_existing_css_set(cg, cgrp, template);
>> +     if (newcg)
>> +             get_css_set(newcg);
>> +     read_unlock(&css_set_lock);
>> +     /* doesn't exist at all? */
>> +     if (!newcg)
>> +             return 1;
>
> I think it's more intuitive to return 1 if found and 0 if not found.

I was sticking with the convention of nonzero return values indicating
failure, as is used everywhere else in this context.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid  at once
  2009-08-03  3:00   ` Li Zefan
@ 2009-08-04  0:56     ` Benjamin Blum
  2009-08-04  1:05       ` Paul Menage
       [not found]       ` <2f86c2480908031756j557e7aebmbf7951da6a1aadb0-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
       [not found]     ` <4A7652E7.4020206-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  1 sibling, 2 replies; 100+ messages in thread
From: Benjamin Blum @ 2009-08-04  0:56 UTC (permalink / raw)
  To: Li Zefan; +Cc: linux-kernel, containers, akpm, serue, menage

On Sun, Aug 2, 2009 at 8:00 PM, Li Zefan<lizf@cn.fujitsu.com> wrote:
> Ben Blum wrote:
>> +     }
>> +     write_unlock(&css_set_lock);
>> +
>> +     /*
>> +      * We just gained a reference on oldcg by taking it from the task. As
>
> This comment is incorrect, the ref we just got has been dropped by
> the above put_css_set(oldcg).

No, the idea is that even though we had a reference that we already
dropped, we in effect "traded" the newcg to the task for its oldcg,
giving it our reference on newcg and gaining its reference on oldcg. I
believe the cgroup_mutex guarantees that it'll still be there when we
do the trade - perhaps a BUG_ON(tsk->cgroups != oldcg) is wanted
inside the second task_lock section there? At the very least, a
clearer comment.

>> +static int css_set_check_fetched(struct cgroup *cgrp, struct task_struct *tsk,
>> +                              struct css_set *cg,
>> +                              struct list_head *newcg_list)
>> +{
>> +     struct css_set *newcg;
>> +     struct cg_list_entry *cg_entry;
>> +     struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
>> +     read_lock(&css_set_lock);
>> +     newcg = find_existing_css_set(cg, cgrp, template);
>> +     if (newcg)
>> +             get_css_set(newcg);
>> +     read_unlock(&css_set_lock);
>> +     /* doesn't exist at all? */
>> +     if (!newcg)
>> +             return 1;
>
> I think it's more intuitive to return 1 if found and 0 if not found.

I was sticking with the convention of nonzero return values indicating
failure, as is used everywhere else in this context.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
       [not found]       ` <2f86c2480908031756j557e7aebmbf7951da6a1aadb0-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-08-04  1:05         ` Paul Menage
  2009-08-04  1:09           ` Li Zefan
  1 sibling, 0 replies; 100+ messages in thread
From: Paul Menage @ 2009-08-04  1:05 UTC (permalink / raw)
  To: Benjamin Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Mon, Aug 3, 2009 at 5:56 PM, Benjamin Blum<bblum-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> No, the idea is that even though we had a reference that we already
> dropped, we in effect "traded" the newcg to the task for its oldcg,
> giving it our reference on newcg and gaining its reference on oldcg. I
> believe the cgroup_mutex guarantees that it'll still be there when we
> do the trade - perhaps a BUG_ON(tsk->cgroups != oldcg) is wanted
> inside the second task_lock section there?

I don't think that's safe, since the task could exit and drop its
reference (and switch to init_css_set).

>>
>> I think it's more intuitive to return 1 if found and 0 if not found.
>
> I was sticking with the convention of nonzero return values indicating
> failure, as is used everywhere else in this context.
>

Make it a bool, so that it's intuitively more obvious that true==success?

Paul

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid  at once
  2009-08-04  0:56     ` Benjamin Blum
@ 2009-08-04  1:05       ` Paul Menage
       [not found]         ` <6599ad830908031805y31136eceqeff0bab455100d6c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2009-08-04  1:11         ` Benjamin Blum
       [not found]       ` <2f86c2480908031756j557e7aebmbf7951da6a1aadb0-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 2 replies; 100+ messages in thread
From: Paul Menage @ 2009-08-04  1:05 UTC (permalink / raw)
  To: Benjamin Blum; +Cc: Li Zefan, linux-kernel, containers, akpm, serue

On Mon, Aug 3, 2009 at 5:56 PM, Benjamin Blum<bblum@google.com> wrote:
> No, the idea is that even though we had a reference that we already
> dropped, we in effect "traded" the newcg to the task for its oldcg,
> giving it our reference on newcg and gaining its reference on oldcg. I
> believe the cgroup_mutex guarantees that it'll still be there when we
> do the trade - perhaps a BUG_ON(tsk->cgroups != oldcg) is wanted
> inside the second task_lock section there?

I don't think that's safe, since the task could exit and drop its
reference (and switch to init_css_set).

>>
>> I think it's more intuitive to return 1 if found and 0 if not found.
>
> I was sticking with the convention of nonzero return values indicating
> failure, as is used everywhere else in this context.
>

Make it a bool, so that it's intuitively more obvious that true==success?

Paul

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid  at once
  2009-08-04  0:56     ` Benjamin Blum
@ 2009-08-04  1:09           ` Li Zefan
       [not found]       ` <2f86c2480908031756j557e7aebmbf7951da6a1aadb0-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 0 replies; 100+ messages in thread
From: Li Zefan @ 2009-08-04  1:09 UTC (permalink / raw)
  To: Benjamin Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	menage-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Benjamin Blum wrote:
> On Sun, Aug 2, 2009 at 8:00 PM, Li Zefan<lizf-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> wrote:
>> Ben Blum wrote:
>>> +     }
>>> +     write_unlock(&css_set_lock);
>>> +
>>> +     /*
>>> +      * We just gained a reference on oldcg by taking it from the task. As
>> This comment is incorrect, the ref we just got has been dropped by
>> the above put_css_set(oldcg).
> 
> No, the idea is that even though we had a reference that we already
> dropped, we in effect "traded" the newcg to the task for its oldcg,
> giving it our reference on newcg and gaining its reference on oldcg. I
> believe the cgroup_mutex guarantees that it'll still be there when we
> do the trade - perhaps a BUG_ON(tsk->cgroups != oldcg) is wanted
> inside the second task_lock section there? At the very least, a
> clearer comment.
> 

Maybe my English sucks..

By "gained a reference", doesn't it mean get_css_set()? But this
put_css_set() is not against the get() just called.

And in fact the ref can be 0 before this put(), because task_exit
can drop the last ref, but put_css_set() will check this case,
so it's Ok.

>>> +static int css_set_check_fetched(struct cgroup *cgrp, struct task_struct *tsk,
>>> +                              struct css_set *cg,
>>> +                              struct list_head *newcg_list)
>>> +{
>>> +     struct css_set *newcg;
>>> +     struct cg_list_entry *cg_entry;
>>> +     struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
>>> +     read_lock(&css_set_lock);
>>> +     newcg = find_existing_css_set(cg, cgrp, template);
>>> +     if (newcg)
>>> +             get_css_set(newcg);
>>> +     read_unlock(&css_set_lock);
>>> +     /* doesn't exist at all? */
>>> +     if (!newcg)
>>> +             return 1;
>> I think it's more intuitive to return 1 if found and 0 if not found.
> 
> I was sticking with the convention of nonzero return values indicating
> failure, as is used everywhere else in this context.
> 

Quoted from Documentation/CodingStyle:

...Such a value can be represented as an error-code integer
(-Exxx = failure, 0 = success) or a "succeeded" boolean (0 = failure,
non-zero = success).

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid  at once
@ 2009-08-04  1:09           ` Li Zefan
  0 siblings, 0 replies; 100+ messages in thread
From: Li Zefan @ 2009-08-04  1:09 UTC (permalink / raw)
  To: Benjamin Blum; +Cc: linux-kernel, containers, akpm, serue, menage

Benjamin Blum wrote:
> On Sun, Aug 2, 2009 at 8:00 PM, Li Zefan<lizf@cn.fujitsu.com> wrote:
>> Ben Blum wrote:
>>> +     }
>>> +     write_unlock(&css_set_lock);
>>> +
>>> +     /*
>>> +      * We just gained a reference on oldcg by taking it from the task. As
>> This comment is incorrect, the ref we just got has been dropped by
>> the above put_css_set(oldcg).
> 
> No, the idea is that even though we had a reference that we already
> dropped, we in effect "traded" the newcg to the task for its oldcg,
> giving it our reference on newcg and gaining its reference on oldcg. I
> believe the cgroup_mutex guarantees that it'll still be there when we
> do the trade - perhaps a BUG_ON(tsk->cgroups != oldcg) is wanted
> inside the second task_lock section there? At the very least, a
> clearer comment.
> 

Maybe my English sucks..

By "gained a reference", doesn't it mean get_css_set()? But this
put_css_set() is not against the get() just called.

And in fact the ref can be 0 before this put(), because task_exit
can drop the last ref, but put_css_set() will check this case,
so it's Ok.

>>> +static int css_set_check_fetched(struct cgroup *cgrp, struct task_struct *tsk,
>>> +                              struct css_set *cg,
>>> +                              struct list_head *newcg_list)
>>> +{
>>> +     struct css_set *newcg;
>>> +     struct cg_list_entry *cg_entry;
>>> +     struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
>>> +     read_lock(&css_set_lock);
>>> +     newcg = find_existing_css_set(cg, cgrp, template);
>>> +     if (newcg)
>>> +             get_css_set(newcg);
>>> +     read_unlock(&css_set_lock);
>>> +     /* doesn't exist at all? */
>>> +     if (!newcg)
>>> +             return 1;
>> I think it's more intuitive to return 1 if found and 0 if not found.
> 
> I was sticking with the convention of nonzero return values indicating
> failure, as is used everywhere else in this context.
> 

Quoted from Documentation/CodingStyle:

...Such a value can be represented as an error-code integer
(-Exxx = failure, 0 = success) or a "succeeded" boolean (0 = failure,
non-zero = success).

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
       [not found]         ` <6599ad830908031805y31136eceqeff0bab455100d6c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-08-04  1:11           ` Benjamin Blum
  0 siblings, 0 replies; 100+ messages in thread
From: Benjamin Blum @ 2009-08-04  1:11 UTC (permalink / raw)
  To: Paul Menage
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Mon, Aug 3, 2009 at 6:05 PM, Paul Menage<menage-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> On Mon, Aug 3, 2009 at 5:56 PM, Benjamin Blum<bblum-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>> No, the idea is that even though we had a reference that we already
>> dropped, we in effect "traded" the newcg to the task for its oldcg,
>> giving it our reference on newcg and gaining its reference on oldcg. I
>> believe the cgroup_mutex guarantees that it'll still be there when we
>> do the trade - perhaps a BUG_ON(tsk->cgroups != oldcg) is wanted
>> inside the second task_lock section there?
>
> I don't think that's safe, since the task could exit and drop its
> reference (and switch to init_css_set).

Yes, but while holding the task's lock while trading the pointers we
first do a PF_EXITING check, so it can't have gone away if we
successfully traded.

>>>
>>> I think it's more intuitive to return 1 if found and 0 if not found.
>>
>> I was sticking with the convention of nonzero return values indicating
>> failure, as is used everywhere else in this context.
>>
>
> Make it a bool, so that it's intuitively more obvious that true==success?

ok.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid  at once
  2009-08-04  1:05       ` Paul Menage
       [not found]         ` <6599ad830908031805y31136eceqeff0bab455100d6c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-08-04  1:11         ` Benjamin Blum
  1 sibling, 0 replies; 100+ messages in thread
From: Benjamin Blum @ 2009-08-04  1:11 UTC (permalink / raw)
  To: Paul Menage; +Cc: Li Zefan, linux-kernel, containers, akpm, serue

On Mon, Aug 3, 2009 at 6:05 PM, Paul Menage<menage@google.com> wrote:
> On Mon, Aug 3, 2009 at 5:56 PM, Benjamin Blum<bblum@google.com> wrote:
>> No, the idea is that even though we had a reference that we already
>> dropped, we in effect "traded" the newcg to the task for its oldcg,
>> giving it our reference on newcg and gaining its reference on oldcg. I
>> believe the cgroup_mutex guarantees that it'll still be there when we
>> do the trade - perhaps a BUG_ON(tsk->cgroups != oldcg) is wanted
>> inside the second task_lock section there?
>
> I don't think that's safe, since the task could exit and drop its
> reference (and switch to init_css_set).

Yes, but while holding the task's lock while trading the pointers we
first do a PF_EXITING check, so it can't have gone away if we
successfully traded.

>>>
>>> I think it's more intuitive to return 1 if found and 0 if not found.
>>
>> I was sticking with the convention of nonzero return values indicating
>> failure, as is used everywhere else in this context.
>>
>
> Make it a bool, so that it's intuitively more obvious that true==success?

ok.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
       [not found]           ` <4A778A49.6040302-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-08-04  1:19             ` Benjamin Blum
  0 siblings, 0 replies; 100+ messages in thread
From: Benjamin Blum @ 2009-08-04  1:19 UTC (permalink / raw)
  To: Li Zefan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	menage-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Mon, Aug 3, 2009 at 6:09 PM, Li Zefan<lizf-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> wrote:
> Benjamin Blum wrote:
>> On Sun, Aug 2, 2009 at 8:00 PM, Li Zefan<lizf-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> wrote:
>>> Ben Blum wrote:
>>>> +     }
>>>> +     write_unlock(&css_set_lock);
>>>> +
>>>> +     /*
>>>> +      * We just gained a reference on oldcg by taking it from the task. As
>>> This comment is incorrect, the ref we just got has been dropped by
>>> the above put_css_set(oldcg).
>>
>> No, the idea is that even though we had a reference that we already
>> dropped, we in effect "traded" the newcg to the task for its oldcg,
>> giving it our reference on newcg and gaining its reference on oldcg. I
>> believe the cgroup_mutex guarantees that it'll still be there when we
>> do the trade - perhaps a BUG_ON(tsk->cgroups != oldcg) is wanted
>> inside the second task_lock section there? At the very least, a
>> clearer comment.
>>
>
> Maybe my English sucks..
>
> By "gained a reference", doesn't it mean get_css_set()? But this
> put_css_set() is not against the get() just called.

not in the conventional way, no. the comment there is bad enough that
this is unclear: before trading pointers, the task had a reference on
its tsk->cgroups pointer (same as our oldcg pointer), which is what we
are overwriting with newcg. the task will think that the reference it
has is still on tsk->cgroups, but since the pointer has changed, its
reference also changes to a reference on newcg - one that this
function took care of getting for the task. additionally, now that the
task's reference is no longer for oldcg, we have to take care of the
refcount that still thinks it's being used.

> And in fact the ref can be 0 before this put(), because task_exit
> can drop the last ref, but put_css_set() will check this case,
> so it's Ok.

the check for PF_EXITING precludes that case.

>>>> +static int css_set_check_fetched(struct cgroup *cgrp, struct task_struct *tsk,
>>>> +                              struct css_set *cg,
>>>> +                              struct list_head *newcg_list)
>>>> +{
>>>> +     struct css_set *newcg;
>>>> +     struct cg_list_entry *cg_entry;
>>>> +     struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
>>>> +     read_lock(&css_set_lock);
>>>> +     newcg = find_existing_css_set(cg, cgrp, template);
>>>> +     if (newcg)
>>>> +             get_css_set(newcg);
>>>> +     read_unlock(&css_set_lock);
>>>> +     /* doesn't exist at all? */
>>>> +     if (!newcg)
>>>> +             return 1;
>>> I think it's more intuitive to return 1 if found and 0 if not found.
>>
>> I was sticking with the convention of nonzero return values indicating
>> failure, as is used everywhere else in this context.
>>
>
> Quoted from Documentation/CodingStyle:
>
> ...Such a value can be represented as an error-code integer
> (-Exxx = failure, 0 = success) or a "succeeded" boolean (0 = failure,
> non-zero = success).

sure. a boolean return value will be better here.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid  at once
  2009-08-04  1:09           ` Li Zefan
  (?)
@ 2009-08-04  1:19           ` Benjamin Blum
  2009-08-04  1:45             ` Li Zefan
       [not found]             ` <2f86c2480908031819h2513cdb4tac3d6def3e0aa320-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  -1 siblings, 2 replies; 100+ messages in thread
From: Benjamin Blum @ 2009-08-04  1:19 UTC (permalink / raw)
  To: Li Zefan; +Cc: linux-kernel, containers, akpm, serue, menage

On Mon, Aug 3, 2009 at 6:09 PM, Li Zefan<lizf@cn.fujitsu.com> wrote:
> Benjamin Blum wrote:
>> On Sun, Aug 2, 2009 at 8:00 PM, Li Zefan<lizf@cn.fujitsu.com> wrote:
>>> Ben Blum wrote:
>>>> +     }
>>>> +     write_unlock(&css_set_lock);
>>>> +
>>>> +     /*
>>>> +      * We just gained a reference on oldcg by taking it from the task. As
>>> This comment is incorrect, the ref we just got has been dropped by
>>> the above put_css_set(oldcg).
>>
>> No, the idea is that even though we had a reference that we already
>> dropped, we in effect "traded" the newcg to the task for its oldcg,
>> giving it our reference on newcg and gaining its reference on oldcg. I
>> believe the cgroup_mutex guarantees that it'll still be there when we
>> do the trade - perhaps a BUG_ON(tsk->cgroups != oldcg) is wanted
>> inside the second task_lock section there? At the very least, a
>> clearer comment.
>>
>
> Maybe my English sucks..
>
> By "gained a reference", doesn't it mean get_css_set()? But this
> put_css_set() is not against the get() just called.

not in the conventional way, no. the comment there is bad enough that
this is unclear: before trading pointers, the task had a reference on
its tsk->cgroups pointer (same as our oldcg pointer), which is what we
are overwriting with newcg. the task will think that the reference it
has is still on tsk->cgroups, but since the pointer has changed, its
reference also changes to a reference on newcg - one that this
function took care of getting for the task. additionally, now that the
task's reference is no longer for oldcg, we have to take care of the
refcount that still thinks it's being used.

> And in fact the ref can be 0 before this put(), because task_exit
> can drop the last ref, but put_css_set() will check this case,
> so it's Ok.

the check for PF_EXITING precludes that case.

>>>> +static int css_set_check_fetched(struct cgroup *cgrp, struct task_struct *tsk,
>>>> +                              struct css_set *cg,
>>>> +                              struct list_head *newcg_list)
>>>> +{
>>>> +     struct css_set *newcg;
>>>> +     struct cg_list_entry *cg_entry;
>>>> +     struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
>>>> +     read_lock(&css_set_lock);
>>>> +     newcg = find_existing_css_set(cg, cgrp, template);
>>>> +     if (newcg)
>>>> +             get_css_set(newcg);
>>>> +     read_unlock(&css_set_lock);
>>>> +     /* doesn't exist at all? */
>>>> +     if (!newcg)
>>>> +             return 1;
>>> I think it's more intuitive to return 1 if found and 0 if not found.
>>
>> I was sticking with the convention of nonzero return values indicating
>> failure, as is used everywhere else in this context.
>>
>
> Quoted from Documentation/CodingStyle:
>
> ...Such a value can be represented as an error-code integer
> (-Exxx = failure, 0 = success) or a "succeeded" boolean (0 = failure,
> non-zero = success).

sure. a boolean return value will be better here.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid  at once
       [not found]             ` <2f86c2480908031819h2513cdb4tac3d6def3e0aa320-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-08-04  1:45               ` Li Zefan
  0 siblings, 0 replies; 100+ messages in thread
From: Li Zefan @ 2009-08-04  1:45 UTC (permalink / raw)
  To: Benjamin Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	menage-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Benjamin Blum wrote:
> On Mon, Aug 3, 2009 at 6:09 PM, Li Zefan<lizf-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> wrote:
>> Benjamin Blum wrote:
>>> On Sun, Aug 2, 2009 at 8:00 PM, Li Zefan<lizf-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> wrote:
>>>> Ben Blum wrote:
>>>>> +     }
>>>>> +     write_unlock(&css_set_lock);
>>>>> +
>>>>> +     /*
>>>>> +      * We just gained a reference on oldcg by taking it from the task. As
>>>> This comment is incorrect, the ref we just got has been dropped by
>>>> the above put_css_set(oldcg).
>>> No, the idea is that even though we had a reference that we already
>>> dropped, we in effect "traded" the newcg to the task for its oldcg,
>>> giving it our reference on newcg and gaining its reference on oldcg. I
>>> believe the cgroup_mutex guarantees that it'll still be there when we
>>> do the trade - perhaps a BUG_ON(tsk->cgroups != oldcg) is wanted
>>> inside the second task_lock section there? At the very least, a
>>> clearer comment.
>>>
>> Maybe my English sucks..
>>
>> By "gained a reference", doesn't it mean get_css_set()? But this
>> put_css_set() is not against the get() just called.
> 
> not in the conventional way, no. the comment there is bad enough that
> this is unclear: before trading pointers, the task had a reference on
> its tsk->cgroups pointer (same as our oldcg pointer), which is what we
> are overwriting with newcg. the task will think that the reference it
> has is still on tsk->cgroups, but since the pointer has changed, its
> reference also changes to a reference on newcg - one that this
> function took care of getting for the task. additionally, now that the
> task's reference is no longer for oldcg, we have to take care of the
> refcount that still thinks it's being used.
> 

Ok.

>> And in fact the ref can be 0 before this put(), because task_exit
>> can drop the last ref, but put_css_set() will check this case,
>> so it's Ok.
> 
> the check for PF_EXITING precludes that case.
> 

No. Note task exiting is not protected by cgroup_lock, so this can
happen:

                               | cgroup_attach_task()
                               |   oldcg = tsk->cgroups;
                               |   (tasks->flags & TASK_EXISING == 0)
                               |   rcu_assign_pointer(tsk->cgroups, newcg);
cgroup_exit()                  |
  oldcg = tsk->cgroups;        |
  put_css_set_taskexit(oldcg); |
  (now ref of olcg is 0)       |
                               |   put_css_set(oldcg);

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid  at once
  2009-08-04  1:19           ` Benjamin Blum
@ 2009-08-04  1:45             ` Li Zefan
  2009-08-04  1:55               ` Paul Menage
       [not found]               ` <4A7792C4.5010504-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
       [not found]             ` <2f86c2480908031819h2513cdb4tac3d6def3e0aa320-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 2 replies; 100+ messages in thread
From: Li Zefan @ 2009-08-04  1:45 UTC (permalink / raw)
  To: Benjamin Blum; +Cc: linux-kernel, containers, akpm, serue, menage

Benjamin Blum wrote:
> On Mon, Aug 3, 2009 at 6:09 PM, Li Zefan<lizf@cn.fujitsu.com> wrote:
>> Benjamin Blum wrote:
>>> On Sun, Aug 2, 2009 at 8:00 PM, Li Zefan<lizf@cn.fujitsu.com> wrote:
>>>> Ben Blum wrote:
>>>>> +     }
>>>>> +     write_unlock(&css_set_lock);
>>>>> +
>>>>> +     /*
>>>>> +      * We just gained a reference on oldcg by taking it from the task. As
>>>> This comment is incorrect, the ref we just got has been dropped by
>>>> the above put_css_set(oldcg).
>>> No, the idea is that even though we had a reference that we already
>>> dropped, we in effect "traded" the newcg to the task for its oldcg,
>>> giving it our reference on newcg and gaining its reference on oldcg. I
>>> believe the cgroup_mutex guarantees that it'll still be there when we
>>> do the trade - perhaps a BUG_ON(tsk->cgroups != oldcg) is wanted
>>> inside the second task_lock section there? At the very least, a
>>> clearer comment.
>>>
>> Maybe my English sucks..
>>
>> By "gained a reference", doesn't it mean get_css_set()? But this
>> put_css_set() is not against the get() just called.
> 
> not in the conventional way, no. the comment there is bad enough that
> this is unclear: before trading pointers, the task had a reference on
> its tsk->cgroups pointer (same as our oldcg pointer), which is what we
> are overwriting with newcg. the task will think that the reference it
> has is still on tsk->cgroups, but since the pointer has changed, its
> reference also changes to a reference on newcg - one that this
> function took care of getting for the task. additionally, now that the
> task's reference is no longer for oldcg, we have to take care of the
> refcount that still thinks it's being used.
> 

Ok.

>> And in fact the ref can be 0 before this put(), because task_exit
>> can drop the last ref, but put_css_set() will check this case,
>> so it's Ok.
> 
> the check for PF_EXITING precludes that case.
> 

No. Note task exiting is not protected by cgroup_lock, so this can
happen:

                               | cgroup_attach_task()
                               |   oldcg = tsk->cgroups;
                               |   (tasks->flags & TASK_EXISING == 0)
                               |   rcu_assign_pointer(tsk->cgroups, newcg);
cgroup_exit()                  |
  oldcg = tsk->cgroups;        |
  put_css_set_taskexit(oldcg); |
  (now ref of olcg is 0)       |
                               |   put_css_set(oldcg);

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
       [not found]               ` <4A7792C4.5010504-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-08-04  1:55                 ` Paul Menage
  0 siblings, 0 replies; 100+ messages in thread
From: Paul Menage @ 2009-08-04  1:55 UTC (permalink / raw)
  To: Li Zefan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Benjamin Blum, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Mon, Aug 3, 2009 at 6:45 PM, Li Zefan<lizf-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> wrote:
>
> No. Note task exiting is not protected by cgroup_lock, so this can
> happen:
>
>                               | cgroup_attach_task()
>                               |   oldcg = tsk->cgroups;
>                               |   (tasks->flags & TASK_EXISING == 0)
>                               |   rcu_assign_pointer(tsk->cgroups, newcg);
> cgroup_exit()                  |
>  oldcg = tsk->cgroups;        |
>  put_css_set_taskexit(oldcg); |
>  (now ref of olcg is 0)       |
>                               |   put_css_set(oldcg);
>

But in both cases, reading tsk->cgroups occurs with task->alloc_lock held.

Paul

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid  at once
  2009-08-04  1:45             ` Li Zefan
@ 2009-08-04  1:55               ` Paul Menage
       [not found]               ` <4A7792C4.5010504-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  1 sibling, 0 replies; 100+ messages in thread
From: Paul Menage @ 2009-08-04  1:55 UTC (permalink / raw)
  To: Li Zefan; +Cc: Benjamin Blum, linux-kernel, containers, akpm, serue

On Mon, Aug 3, 2009 at 6:45 PM, Li Zefan<lizf@cn.fujitsu.com> wrote:
>
> No. Note task exiting is not protected by cgroup_lock, so this can
> happen:
>
>                               | cgroup_attach_task()
>                               |   oldcg = tsk->cgroups;
>                               |   (tasks->flags & TASK_EXISING == 0)
>                               |   rcu_assign_pointer(tsk->cgroups, newcg);
> cgroup_exit()                  |
>  oldcg = tsk->cgroups;        |
>  put_css_set_taskexit(oldcg); |
>  (now ref of olcg is 0)       |
>                               |   put_css_set(oldcg);
>

But in both cases, reading tsk->cgroups occurs with task->alloc_lock held.

Paul

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
       [not found]             ` <6599ad830908031255j68ce047x7165bfefa62ed53c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-08-04 14:01               ` Serge E. Hallyn
  2009-08-04 21:40               ` Matt Helsley
  1 sibling, 0 replies; 100+ messages in thread
From: Serge E. Hallyn @ 2009-08-04 14:01 UTC (permalink / raw)
  To: Paul Menage
  Cc: Benjamin Blum,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Quoting Paul Menage (menage-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> On Mon, Aug 3, 2009 at 12:45 PM, Serge E. Hallyn<serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote:
> >
> > This is probably a stupid idea, but...  what about having zero
> > overhead at clone(), and instead, at cgroup_task_migrate(),
> > dequeue_task()ing all of the affected threads for the duration of
> > the migrate?
> 
> That doesn't sound too unreasonable, actually - it would certainly
> simplify things a fair bit. Is there a standard API for doing that?
> dequeue_task() itself doesn't really look like a public API. I guess
> that the task freezer would be one way to accomplish this?

Actually if we wanted to try to avoid -EINTRs for the tasks, which
they get with the freezer, we might want to code our own helpers
in sched.c based on dequeue_task().  Instead of interrupting ongoing
system calls, we'll want to set a flag saying if the syscall exits
(before we unset the flag) then dequeue it real quick.

> I can imagine that the set of people who'd complain about the latency
> hit when migrating with your solution would be smaller than the people
> who'd complain about the increased overhead in the normal clone case.
> 
> Paul

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
  2009-08-03 19:55           ` Paul Menage
@ 2009-08-04 14:01             ` Serge E. Hallyn
       [not found]             ` <6599ad830908031255j68ce047x7165bfefa62ed53c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 100+ messages in thread
From: Serge E. Hallyn @ 2009-08-04 14:01 UTC (permalink / raw)
  To: Paul Menage; +Cc: Benjamin Blum, containers, linux-kernel, akpm

Quoting Paul Menage (menage@google.com):
> On Mon, Aug 3, 2009 at 12:45 PM, Serge E. Hallyn<serue@us.ibm.com> wrote:
> >
> > This is probably a stupid idea, but...  what about having zero
> > overhead at clone(), and instead, at cgroup_task_migrate(),
> > dequeue_task()ing all of the affected threads for the duration of
> > the migrate?
> 
> That doesn't sound too unreasonable, actually - it would certainly
> simplify things a fair bit. Is there a standard API for doing that?
> dequeue_task() itself doesn't really look like a public API. I guess
> that the task freezer would be one way to accomplish this?

Actually if we wanted to try to avoid -EINTRs for the tasks, which
they get with the freezer, we might want to code our own helpers
in sched.c based on dequeue_task().  Instead of interrupting ongoing
system calls, we'll want to set a flag saying if the syscall exits
(before we unset the flag) then dequeue it real quick.

> I can imagine that the set of people who'd complain about the latency
> hit when migrating with your solution would be smaller than the people
> who'd complain about the increased overhead in the normal clone case.
> 
> Paul

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
  2009-08-03 19:45         ` Serge E. Hallyn
@ 2009-08-04 18:48               ` Paul Menage
       [not found]           ` <20090803194555.GA10158-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  1 sibling, 0 replies; 100+ messages in thread
From: Paul Menage @ 2009-08-04 18:48 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Benjamin Blum,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Mon, Aug 3, 2009 at 12:45 PM, Serge E. Hallyn<serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote:
>
> This is probably a stupid idea, but...  what about having zero
> overhead at clone(), and instead, at cgroup_task_migrate(),
> dequeue_task()ing all of the affected threads for the duration of
> the migrate?
>

Or a simpler alternative - rather than taking the thread group
leader's rwsem in cgroup_fork(), always take current's rwsem. Then
you're always locking a (probably?) local rwsem and minimizing the
overhead. So not quite zero overhead in the fork path, but I'd be
surprised if it was measurable. In cgroup_attach_proc() you then have
to take the rwsem of every thread in the process. Kind of the
equivalent of a per-threadgroup big-reader lock.

Paul

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid  at once
@ 2009-08-04 18:48               ` Paul Menage
  0 siblings, 0 replies; 100+ messages in thread
From: Paul Menage @ 2009-08-04 18:48 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: Benjamin Blum, containers, linux-kernel, akpm

On Mon, Aug 3, 2009 at 12:45 PM, Serge E. Hallyn<serue@us.ibm.com> wrote:
>
> This is probably a stupid idea, but...  what about having zero
> overhead at clone(), and instead, at cgroup_task_migrate(),
> dequeue_task()ing all of the affected threads for the duration of
> the migrate?
>

Or a simpler alternative - rather than taking the thread group
leader's rwsem in cgroup_fork(), always take current's rwsem. Then
you're always locking a (probably?) local rwsem and minimizing the
overhead. So not quite zero overhead in the fork path, but I'd be
surprised if it was measurable. In cgroup_attach_proc() you then have
to take the rwsem of every thread in the process. Kind of the
equivalent of a per-threadgroup big-reader lock.

Paul

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
  2009-08-04 18:48               ` Paul Menage
@ 2009-08-04 19:01                   ` Serge E. Hallyn
  -1 siblings, 0 replies; 100+ messages in thread
From: Serge E. Hallyn @ 2009-08-04 19:01 UTC (permalink / raw)
  To: Paul Menage
  Cc: Benjamin Blum,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Quoting Paul Menage (menage-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> On Mon, Aug 3, 2009 at 12:45 PM, Serge E. Hallyn<serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote:
> >
> > This is probably a stupid idea, but...  what about having zero
> > overhead at clone(), and instead, at cgroup_task_migrate(),
> > dequeue_task()ing all of the affected threads for the duration of
> > the migrate?
> >
> 
> Or a simpler alternative - rather than taking the thread group
> leader's rwsem in cgroup_fork(), always take current's rwsem. Then
> you're always locking a (probably?) local rwsem and minimizing the
> overhead. So not quite zero overhead in the fork path, but I'd be
> surprised if it was measurable. In cgroup_attach_proc() you then have
> to take the rwsem of every thread in the process. Kind of the
> equivalent of a per-threadgroup big-reader lock.
> 
> Paul

Yup I think that would addres my concern (cache-line bouncing in hot
clone(CLONE_THREAD) case).

thanks,
-serge

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
@ 2009-08-04 19:01                   ` Serge E. Hallyn
  0 siblings, 0 replies; 100+ messages in thread
From: Serge E. Hallyn @ 2009-08-04 19:01 UTC (permalink / raw)
  To: Paul Menage; +Cc: Benjamin Blum, containers, linux-kernel, akpm

Quoting Paul Menage (menage@google.com):
> On Mon, Aug 3, 2009 at 12:45 PM, Serge E. Hallyn<serue@us.ibm.com> wrote:
> >
> > This is probably a stupid idea, but...  what about having zero
> > overhead at clone(), and instead, at cgroup_task_migrate(),
> > dequeue_task()ing all of the affected threads for the duration of
> > the migrate?
> >
> 
> Or a simpler alternative - rather than taking the thread group
> leader's rwsem in cgroup_fork(), always take current's rwsem. Then
> you're always locking a (probably?) local rwsem and minimizing the
> overhead. So not quite zero overhead in the fork path, but I'd be
> surprised if it was measurable. In cgroup_attach_proc() you then have
> to take the rwsem of every thread in the process. Kind of the
> equivalent of a per-threadgroup big-reader lock.
> 
> Paul

Yup I think that would addres my concern (cache-line bouncing in hot
clone(CLONE_THREAD) case).

thanks,
-serge

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
  2009-08-04 18:48               ` Paul Menage
@ 2009-08-04 19:14                   ` Benjamin Blum
  -1 siblings, 0 replies; 100+ messages in thread
From: Benjamin Blum @ 2009-08-04 19:14 UTC (permalink / raw)
  To: Paul Menage
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Tue, Aug 4, 2009 at 2:48 PM, Paul Menage<menage-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> On Mon, Aug 3, 2009 at 12:45 PM, Serge E. Hallyn<serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote:
>>
>> This is probably a stupid idea, but...  what about having zero
>> overhead at clone(), and instead, at cgroup_task_migrate(),
>> dequeue_task()ing all of the affected threads for the duration of
>> the migrate?
>>
>
> Or a simpler alternative - rather than taking the thread group
> leader's rwsem in cgroup_fork(), always take current's rwsem. Then
> you're always locking a (probably?) local rwsem and minimizing the
> overhead. So not quite zero overhead in the fork path, but I'd be
> surprised if it was measurable. In cgroup_attach_proc() you then have
> to take the rwsem of every thread in the process. Kind of the
> equivalent of a per-threadgroup big-reader lock.

Hmm, the tasklist_lock section in fork() is entirely inside the
read-lock. Presumably then iterating the threadgroup list to take all
rwsems is safe from a race in which one thread escapes?

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid  at once
@ 2009-08-04 19:14                   ` Benjamin Blum
  0 siblings, 0 replies; 100+ messages in thread
From: Benjamin Blum @ 2009-08-04 19:14 UTC (permalink / raw)
  To: Paul Menage; +Cc: Serge E. Hallyn, containers, linux-kernel, akpm

On Tue, Aug 4, 2009 at 2:48 PM, Paul Menage<menage@google.com> wrote:
> On Mon, Aug 3, 2009 at 12:45 PM, Serge E. Hallyn<serue@us.ibm.com> wrote:
>>
>> This is probably a stupid idea, but...  what about having zero
>> overhead at clone(), and instead, at cgroup_task_migrate(),
>> dequeue_task()ing all of the affected threads for the duration of
>> the migrate?
>>
>
> Or a simpler alternative - rather than taking the thread group
> leader's rwsem in cgroup_fork(), always take current's rwsem. Then
> you're always locking a (probably?) local rwsem and minimizing the
> overhead. So not quite zero overhead in the fork path, but I'd be
> surprised if it was measurable. In cgroup_attach_proc() you then have
> to take the rwsem of every thread in the process. Kind of the
> equivalent of a per-threadgroup big-reader lock.

Hmm, the tasklist_lock section in fork() is entirely inside the
read-lock. Presumably then iterating the threadgroup list to take all
rwsems is safe from a race in which one thread escapes?

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
  2009-08-04 19:14                   ` Benjamin Blum
@ 2009-08-04 19:28                       ` Paul Menage
  -1 siblings, 0 replies; 100+ messages in thread
From: Paul Menage @ 2009-08-04 19:28 UTC (permalink / raw)
  To: Benjamin Blum
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Tue, Aug 4, 2009 at 12:14 PM, Benjamin Blum<bblum-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>
> Hmm, the tasklist_lock section in fork() is entirely inside the
> read-lock. Presumably then iterating the threadgroup list to take all
> rwsems is safe from a race in which one thread escapes?
>

I think so - the write-side locking would be something like:

again:
  read_lock(&tasklist_lock);
  find first not-yet-locked task t
  while (t != group_leader) {
    if (!down_write_trylock(&t->threadgroup_stability_mutex)) {
      get_task_struct(t);
      read_unlock(&tasklist_lock);
      down_write(&t->threadgroup_stability_mutex);
      put_task_struct(t);
      goto again;
    }
    t = t->next_thread;
  }
  read_unlock(&tasklist_lock);

  Paul

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid  at once
@ 2009-08-04 19:28                       ` Paul Menage
  0 siblings, 0 replies; 100+ messages in thread
From: Paul Menage @ 2009-08-04 19:28 UTC (permalink / raw)
  To: Benjamin Blum; +Cc: Serge E. Hallyn, containers, linux-kernel, akpm

On Tue, Aug 4, 2009 at 12:14 PM, Benjamin Blum<bblum@google.com> wrote:
>
> Hmm, the tasklist_lock section in fork() is entirely inside the
> read-lock. Presumably then iterating the threadgroup list to take all
> rwsems is safe from a race in which one thread escapes?
>

I think so - the write-side locking would be something like:

again:
  read_lock(&tasklist_lock);
  find first not-yet-locked task t
  while (t != group_leader) {
    if (!down_write_trylock(&t->threadgroup_stability_mutex)) {
      get_task_struct(t);
      read_unlock(&tasklist_lock);
      down_write(&t->threadgroup_stability_mutex);
      put_task_struct(t);
      goto again;
    }
    t = t->next_thread;
  }
  read_unlock(&tasklist_lock);

  Paul

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
       [not found]             ` <6599ad830908031255j68ce047x7165bfefa62ed53c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2009-08-04 14:01               ` Serge E. Hallyn
@ 2009-08-04 21:40               ` Matt Helsley
  1 sibling, 0 replies; 100+ messages in thread
From: Matt Helsley @ 2009-08-04 21:40 UTC (permalink / raw)
  To: Paul Menage
  Cc: Benjamin Blum, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Rafael Wysocki, Linux Power Management,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

[ Cc'ing Rafael and linux-pm for more eyes on proposed freezer usage. ]

On Mon, Aug 03, 2009 at 12:55:33PM -0700, Paul Menage wrote:
> On Mon, Aug 3, 2009 at 12:45 PM, Serge E. Hallyn<serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote:
> >
> > This is probably a stupid idea, but...  what about having zero
> > overhead at clone(), and instead, at cgroup_task_migrate(),
> > dequeue_task()ing all of the affected threads for the duration of
> > the migrate?
> 
> That doesn't sound too unreasonable, actually - it would certainly
> simplify things a fair bit. Is there a standard API for doing that?

I'm all for simplifying cgroup locking. I doubt anybody's against
it, given the "right" simplification.

I'm not sure if the freezer is actually the right thing to
use for this though. Perhaps CFS/scheduler folks could advise?

> dequeue_task() itself doesn't really look like a public API. I guess
> that the task freezer would be one way to accomplish this?
 
The freezer won't actually remove the task from the runqueue -- just
cause it to go into a schedule() loop until it's thawed.

[ Incidentally, sorry if this is a dumb question, but why don't frozen
tasks go onto a special wait queue rather than loop around schedule() ? 
At least for the cgroup freezer I can imagine keeping the wait queue
with the cgroup subsystem... ]

The freezer sends a fake signal to the task which will interrupt syscalls
and userspace to handle the signal. So all of the frozen tasks would be
looping around schedule() just inside the syscall entry layer "handling"
the fake signal until they are thawed.

This could interrupt a read of the cgroup pidlist for example.

I don't think it's 100% reliable -- vfork-ing tasks could delay freezing
the task "indefinitely" if the vfork'ing userspace tasks are 
clueless/malicious.

However the signaling code used there uses kick_process() which may be
needed for this idea.

So if I understand correctly it goes something like:

for each thread
	dequeue from runqueue onto ?what?
	kick thread (I think this should ensure that the thread is no longer
			"current" on any CPU since we dequeued..)

<seems we'd need something to ensure that the previous operations on each
thread have "completed" as far as all other cpus are concerned...>

for each thread
	cgroup migrate

for each thread
	enqueue back on runqueue from ?what? (is this still the right
						queue?)

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
  2009-08-03 19:55           ` Paul Menage
                               ` (2 preceding siblings ...)
  2009-08-04 21:40             ` Matt Helsley
@ 2009-08-04 21:40             ` Matt Helsley
  3 siblings, 0 replies; 100+ messages in thread
From: Matt Helsley @ 2009-08-04 21:40 UTC (permalink / raw)
  To: Paul Menage
  Cc: Serge E. Hallyn, Benjamin Blum, containers, akpm, linux-kernel,
	Rafael Wysocki, Linux Power Management

[ Cc'ing Rafael and linux-pm for more eyes on proposed freezer usage. ]

On Mon, Aug 03, 2009 at 12:55:33PM -0700, Paul Menage wrote:
> On Mon, Aug 3, 2009 at 12:45 PM, Serge E. Hallyn<serue@us.ibm.com> wrote:
> >
> > This is probably a stupid idea, but...  what about having zero
> > overhead at clone(), and instead, at cgroup_task_migrate(),
> > dequeue_task()ing all of the affected threads for the duration of
> > the migrate?
> 
> That doesn't sound too unreasonable, actually - it would certainly
> simplify things a fair bit. Is there a standard API for doing that?

I'm all for simplifying cgroup locking. I doubt anybody's against
it, given the "right" simplification.

I'm not sure if the freezer is actually the right thing to
use for this though. Perhaps CFS/scheduler folks could advise?

> dequeue_task() itself doesn't really look like a public API. I guess
> that the task freezer would be one way to accomplish this?
 
The freezer won't actually remove the task from the runqueue -- just
cause it to go into a schedule() loop until it's thawed.

[ Incidentally, sorry if this is a dumb question, but why don't frozen
tasks go onto a special wait queue rather than loop around schedule() ? 
At least for the cgroup freezer I can imagine keeping the wait queue
with the cgroup subsystem... ]

The freezer sends a fake signal to the task which will interrupt syscalls
and userspace to handle the signal. So all of the frozen tasks would be
looping around schedule() just inside the syscall entry layer "handling"
the fake signal until they are thawed.

This could interrupt a read of the cgroup pidlist for example.

I don't think it's 100% reliable -- vfork-ing tasks could delay freezing
the task "indefinitely" if the vfork'ing userspace tasks are 
clueless/malicious.

However the signaling code used there uses kick_process() which may be
needed for this idea.

So if I understand correctly it goes something like:

for each thread
	dequeue from runqueue onto ?what?
	kick thread (I think this should ensure that the thread is no longer
			"current" on any CPU since we dequeued..)

<seems we'd need something to ensure that the previous operations on each
thread have "completed" as far as all other cpus are concerned...>

for each thread
	cgroup migrate

for each thread
	enqueue back on runqueue from ?what? (is this still the right
						queue?)

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
  2009-08-03 19:55           ` Paul Menage
  2009-08-04 14:01             ` Serge E. Hallyn
       [not found]             ` <6599ad830908031255j68ce047x7165bfefa62ed53c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-08-04 21:40             ` Matt Helsley
  2009-08-04 21:40             ` Matt Helsley
  3 siblings, 0 replies; 100+ messages in thread
From: Matt Helsley @ 2009-08-04 21:40 UTC (permalink / raw)
  To: Paul Menage
  Cc: Benjamin Blum, linux-kernel, Linux Power Management,
	Serge E. Hallyn, akpm, containers

[ Cc'ing Rafael and linux-pm for more eyes on proposed freezer usage. ]

On Mon, Aug 03, 2009 at 12:55:33PM -0700, Paul Menage wrote:
> On Mon, Aug 3, 2009 at 12:45 PM, Serge E. Hallyn<serue@us.ibm.com> wrote:
> >
> > This is probably a stupid idea, but...  what about having zero
> > overhead at clone(), and instead, at cgroup_task_migrate(),
> > dequeue_task()ing all of the affected threads for the duration of
> > the migrate?
> 
> That doesn't sound too unreasonable, actually - it would certainly
> simplify things a fair bit. Is there a standard API for doing that?

I'm all for simplifying cgroup locking. I doubt anybody's against
it, given the "right" simplification.

I'm not sure if the freezer is actually the right thing to
use for this though. Perhaps CFS/scheduler folks could advise?

> dequeue_task() itself doesn't really look like a public API. I guess
> that the task freezer would be one way to accomplish this?
 
The freezer won't actually remove the task from the runqueue -- just
cause it to go into a schedule() loop until it's thawed.

[ Incidentally, sorry if this is a dumb question, but why don't frozen
tasks go onto a special wait queue rather than loop around schedule() ? 
At least for the cgroup freezer I can imagine keeping the wait queue
with the cgroup subsystem... ]

The freezer sends a fake signal to the task which will interrupt syscalls
and userspace to handle the signal. So all of the frozen tasks would be
looping around schedule() just inside the syscall entry layer "handling"
the fake signal until they are thawed.

This could interrupt a read of the cgroup pidlist for example.

I don't think it's 100% reliable -- vfork-ing tasks could delay freezing
the task "indefinitely" if the vfork'ing userspace tasks are 
clueless/malicious.

However the signaling code used there uses kick_process() which may be
needed for this idea.

So if I understand correctly it goes something like:

for each thread
	dequeue from runqueue onto ?what?
	kick thread (I think this should ensure that the thread is no longer
			"current" on any CPU since we dequeued..)

<seems we'd need something to ensure that the previous operations on each
thread have "completed" as far as all other cpus are concerned...>

for each thread
	cgroup migrate

for each thread
	enqueue back on runqueue from ?what? (is this still the right
						queue?)

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
       [not found]                       ` <6599ad830908041228w67bc6f7fh57e28f244e1923b3-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-08-05 10:20                         ` Louis Rilling
  0 siblings, 0 replies; 100+ messages in thread
From: Louis Rilling @ 2009-08-05 10:20 UTC (permalink / raw)
  To: Paul Menage
  Cc: Benjamin Blum,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA


[-- Attachment #1.1: Type: text/plain, Size: 1392 bytes --]

On 04/08/09 12:28 -0700, Paul Menage wrote:
> On Tue, Aug 4, 2009 at 12:14 PM, Benjamin Blum<bblum-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> >
> > Hmm, the tasklist_lock section in fork() is entirely inside the
> > read-lock. Presumably then iterating the threadgroup list to take all
> > rwsems is safe from a race in which one thread escapes?
> >
> 
> I think so - the write-side locking would be something like:
> 
> again:
>   read_lock(&tasklist_lock);
>   find first not-yet-locked task t
>   while (t != group_leader) {
>     if (!down_write_trylock(&t->threadgroup_stability_mutex)) {
>       get_task_struct(t);
>       read_unlock(&tasklist_lock);
>       down_write(&t->threadgroup_stability_mutex);
>       put_task_struct(t);
>       goto again;
>     }
>     t = t->next_thread;
>   }
>   read_unlock(&tasklist_lock);

The downside of this is teaching lockdep about this recursive locking. Not that
simple actually... so that such cases are currently handled using a higher-level
lock that prevents races in locking the whole chain (there was one such example
for locking all vmas with KVM). IIUC, the intent here is to avoid such
higher-level lock.

Louis

-- 
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes

[-- Attachment #1.2: Digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

[-- Attachment #2: Type: text/plain, Size: 206 bytes --]

_______________________________________________
Containers mailing list
Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
  2009-08-04 19:28                       ` Paul Menage
  (?)
@ 2009-08-05 10:20                       ` Louis Rilling
       [not found]                         ` <20090805102057.GT29252-Hu8+6S1rdjywhHL9vcZdMVaTQe2KTcn/@public.gmane.org>
  -1 siblings, 1 reply; 100+ messages in thread
From: Louis Rilling @ 2009-08-05 10:20 UTC (permalink / raw)
  To: Paul Menage; +Cc: Benjamin Blum, containers, linux-kernel, akpm

[-- Attachment #1: Type: text/plain, Size: 1363 bytes --]

On 04/08/09 12:28 -0700, Paul Menage wrote:
> On Tue, Aug 4, 2009 at 12:14 PM, Benjamin Blum<bblum@google.com> wrote:
> >
> > Hmm, the tasklist_lock section in fork() is entirely inside the
> > read-lock. Presumably then iterating the threadgroup list to take all
> > rwsems is safe from a race in which one thread escapes?
> >
> 
> I think so - the write-side locking would be something like:
> 
> again:
>   read_lock(&tasklist_lock);
>   find first not-yet-locked task t
>   while (t != group_leader) {
>     if (!down_write_trylock(&t->threadgroup_stability_mutex)) {
>       get_task_struct(t);
>       read_unlock(&tasklist_lock);
>       down_write(&t->threadgroup_stability_mutex);
>       put_task_struct(t);
>       goto again;
>     }
>     t = t->next_thread;
>   }
>   read_unlock(&tasklist_lock);

The downside of this is teaching lockdep about this recursive locking. Not that
simple actually... so that such cases are currently handled using a higher-level
lock that prevents races in locking the whole chain (there was one such example
for locking all vmas with KVM). IIUC, the intent here is to avoid such
higher-level lock.

Louis

-- 
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
  2009-08-05 10:20                       ` Louis Rilling
@ 2009-08-05 16:11                             ` Paul Menage
  0 siblings, 0 replies; 100+ messages in thread
From: Paul Menage @ 2009-08-05 16:11 UTC (permalink / raw)
  To: Paul Menage, Benjamin Blum,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Wed, Aug 5, 2009 at 3:20 AM, Louis Rilling<Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ@public.gmane.org> wrote:
>
> The downside of this is teaching lockdep about this recursive locking. Not that
> simple actually...

Don't we just give each thread's lock its own lock class? That's what
we did for the cgroup hierarchy_mutex.

> so that such cases are currently handled using a higher-level
> lock that prevents races in locking the whole chain (there was one such example
> for locking all vmas with KVM). IIUC, the intent here is to avoid such
> higher-level lock.

cgroup_mutex already fulfills the role of the higher-level lock.

Paul

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid  at once
@ 2009-08-05 16:11                             ` Paul Menage
  0 siblings, 0 replies; 100+ messages in thread
From: Paul Menage @ 2009-08-05 16:11 UTC (permalink / raw)
  To: Paul Menage, Benjamin Blum, containers, linux-kernel, akpm

On Wed, Aug 5, 2009 at 3:20 AM, Louis Rilling<Louis.Rilling@kerlabs.com> wrote:
>
> The downside of this is teaching lockdep about this recursive locking. Not that
> simple actually...

Don't we just give each thread's lock its own lock class? That's what
we did for the cgroup hierarchy_mutex.

> so that such cases are currently handled using a higher-level
> lock that prevents races in locking the whole chain (there was one such example
> for locking all vmas with KVM). IIUC, the intent here is to avoid such
> higher-level lock.

cgroup_mutex already fulfills the role of the higher-level lock.

Paul

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
       [not found]                             ` <6599ad830908050911t6f23f810i65fe8fe17f3ee698-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-08-05 16:42                               ` Louis Rilling
  0 siblings, 0 replies; 100+ messages in thread
From: Louis Rilling @ 2009-08-05 16:42 UTC (permalink / raw)
  To: Paul Menage
  Cc: Peter Zijlstra, Benjamin Blum,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA


[-- Attachment #1.1: Type: text/plain, Size: 1359 bytes --]

On 05/08/09  9:11 -0700, Paul Menage wrote:
> On Wed, Aug 5, 2009 at 3:20 AM, Louis Rilling<Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ@public.gmane.org> wrote:
> >
> > The downside of this is teaching lockdep about this recursive locking. Not that
> > simple actually...
> 
> Don't we just give each thread's lock its own lock class? That's what
> we did for the cgroup hierarchy_mutex.

Given that lock classes must be static and that lockdep only supports a limited
lock depth, this is an issue for processes having many threads.

> 
> > so that such cases are currently handled using a higher-level
> > lock that prevents races in locking the whole chain (there was one such example
> > for locking all vmas with KVM). IIUC, the intent here is to avoid such
> > higher-level lock.
> 
> cgroup_mutex already fulfills the role of the higher-level lock.

If so (that is, here cgroup_mutex is taken before write-locking all threads'
rw_sem), then enhancing rwsem's interface in a similar way to the
spin_lock_nest_lock() interface could do it. There will still be an issue with
many threads and lockdep limited lock depth though.

Added Peter in CC.

Louis

-- 
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes

[-- Attachment #1.2: Digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

[-- Attachment #2: Type: text/plain, Size: 206 bytes --]

_______________________________________________
Containers mailing list
Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
  2009-08-05 16:11                             ` Paul Menage
  (?)
  (?)
@ 2009-08-05 16:42                             ` Louis Rilling
  2009-08-05 16:53                               ` Peter Zijlstra
                                                 ` (2 more replies)
  -1 siblings, 3 replies; 100+ messages in thread
From: Louis Rilling @ 2009-08-05 16:42 UTC (permalink / raw)
  To: Paul Menage; +Cc: Benjamin Blum, containers, linux-kernel, akpm, Peter Zijlstra

[-- Attachment #1: Type: text/plain, Size: 1331 bytes --]

On 05/08/09  9:11 -0700, Paul Menage wrote:
> On Wed, Aug 5, 2009 at 3:20 AM, Louis Rilling<Louis.Rilling@kerlabs.com> wrote:
> >
> > The downside of this is teaching lockdep about this recursive locking. Not that
> > simple actually...
> 
> Don't we just give each thread's lock its own lock class? That's what
> we did for the cgroup hierarchy_mutex.

Given that lock classes must be static and that lockdep only supports a limited
lock depth, this is an issue for processes having many threads.

> 
> > so that such cases are currently handled using a higher-level
> > lock that prevents races in locking the whole chain (there was one such example
> > for locking all vmas with KVM). IIUC, the intent here is to avoid such
> > higher-level lock.
> 
> cgroup_mutex already fulfills the role of the higher-level lock.

If so (that is, here cgroup_mutex is taken before write-locking all threads'
rw_sem), then enhancing rwsem's interface in a similar way to the
spin_lock_nest_lock() interface could do it. There will still be an issue with
many threads and lockdep limited lock depth though.

Added Peter in CC.

Louis

-- 
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
       [not found]                               ` <20090805164218.GB26446-Hu8+6S1rdjywhHL9vcZdMVaTQe2KTcn/@public.gmane.org>
@ 2009-08-05 16:53                                 ` Peter Zijlstra
  2009-08-06  0:01                                 ` Benjamin Blum
  1 sibling, 0 replies; 100+ messages in thread
From: Peter Zijlstra @ 2009-08-05 16:53 UTC (permalink / raw)
  To: Louis Rilling
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Menage, Benjamin Blum,
	Ingo Molnar

On Wed, 2009-08-05 at 18:42 +0200, Louis Rilling wrote:
> On 05/08/09  9:11 -0700, Paul Menage wrote:
> > On Wed, Aug 5, 2009 at 3:20 AM, Louis Rilling<Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ@public.gmane.org> wrote:
> > >
> > > The downside of this is teaching lockdep about this recursive locking. Not that
> > > simple actually...
> > 
> > Don't we just give each thread's lock its own lock class? That's what
> > we did for the cgroup hierarchy_mutex.
> 
> Given that lock classes must be static and that lockdep only supports a limited
> lock depth, this is an issue for processes having many threads.
> 
> > 
> > > so that such cases are currently handled using a higher-level
> > > lock that prevents races in locking the whole chain (there was one such example
> > > for locking all vmas with KVM). IIUC, the intent here is to avoid such
> > > higher-level lock.
> > 
> > cgroup_mutex already fulfills the role of the higher-level lock.
> 
> If so (that is, here cgroup_mutex is taken before write-locking all threads'
> rw_sem), then enhancing rwsem's interface in a similar way to the
> spin_lock_nest_lock() interface could do it. There will still be an issue with
> many threads and lockdep limited lock depth though.
> 
> Added Peter in CC.

Ah, I recently tinkered with that, see the below commit from -tip.

It should start counting instances instead of tracking each once you
start using the _nest_lock() variant with multiple instances of the same
class.

We loose some precision (can't match exact instance on unlock, and
lockstat looses out), but it keeps lockdep going up to 2048 instances
(those 11 bits).

---
commit bb97a91e2549a7f2df9c21d32542582f549ab3ec
Author: Peter Zijlstra <a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw@public.gmane.org>
Date:   Mon Jul 20 19:15:35 2009 +0200

    lockdep: Deal with many similar locks
    
    spin_lock_nest_lock() allows to take many instances of the same
    class, this can easily lead to overflow of MAX_LOCK_DEPTH.
    
    To avoid this overflow, we'll stop accounting instances but
    start reference counting the class in the held_lock structure.
    
    [ We could maintain a list of instances, if we'd move the hlock
      stuff into __lock_acquired(), but that would require
      significant modifications to the current code. ]
    
    We restrict this mode to spin_lock_nest_lock() only, because it
    degrades the lockdep quality due to lost of instance.
    
    For lockstat this means we don't track lock statistics for any
    but the first lock in the series.
    
    Currently nesting is limited to 11 bits because that was the
    spare space available in held_lock. This yields a 2048
    instances maximium.
    
    Signed-off-by: Peter Zijlstra <a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw@public.gmane.org>
    Cc: Marcelo Tosatti <mtosatti-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    Cc: Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
    Signed-off-by: Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org>

diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index a6d5e5e..47d42ef 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -213,10 +213,12 @@ struct held_lock {
 	 * interrupt context:
 	 */
 	unsigned int irq_context:2; /* bit 0 - soft, bit 1 - hard */
-	unsigned int trylock:1;
+	unsigned int trylock:1;						/* 16 bits */
+
 	unsigned int read:2;        /* see lock_acquire() comment */
 	unsigned int check:2;       /* see lock_acquire() comment */
 	unsigned int hardirqs_off:1;
+	unsigned int references:11;					/* 32 bits */
 };
 
 /*
diff --git a/kernel/lockdep.c b/kernel/lockdep.c
index 28914a5..0bb246e 100644
--- a/kernel/lockdep.c
+++ b/kernel/lockdep.c
@@ -2708,13 +2708,15 @@ EXPORT_SYMBOL_GPL(lockdep_init_map);
  */
 static int __lock_acquire(struct lockdep_map *lock, unsigned int subclass,
 			  int trylock, int read, int check, int hardirqs_off,
-			  struct lockdep_map *nest_lock, unsigned long ip)
+			  struct lockdep_map *nest_lock, unsigned long ip,
+			  int references)
 {
 	struct task_struct *curr = current;
 	struct lock_class *class = NULL;
 	struct held_lock *hlock;
 	unsigned int depth, id;
 	int chain_head = 0;
+	int class_idx;
 	u64 chain_key;
 
 	if (!prove_locking)
@@ -2762,10 +2764,24 @@ static int __lock_acquire(struct lockdep_map *lock, unsigned int subclass,
 	if (DEBUG_LOCKS_WARN_ON(depth >= MAX_LOCK_DEPTH))
 		return 0;
 
+	class_idx = class - lock_classes + 1;
+
+	if (depth) {
+		hlock = curr->held_locks + depth - 1;
+		if (hlock->class_idx == class_idx && nest_lock) {
+			if (hlock->references)
+				hlock->references++;
+			else
+				hlock->references = 2;
+
+			return 1;
+		}
+	}
+
 	hlock = curr->held_locks + depth;
 	if (DEBUG_LOCKS_WARN_ON(!class))
 		return 0;
-	hlock->class_idx = class - lock_classes + 1;
+	hlock->class_idx = class_idx;
 	hlock->acquire_ip = ip;
 	hlock->instance = lock;
 	hlock->nest_lock = nest_lock;
@@ -2773,6 +2789,7 @@ static int __lock_acquire(struct lockdep_map *lock, unsigned int subclass,
 	hlock->read = read;
 	hlock->check = check;
 	hlock->hardirqs_off = !!hardirqs_off;
+	hlock->references = references;
 #ifdef CONFIG_LOCK_STAT
 	hlock->waittime_stamp = 0;
 	hlock->holdtime_stamp = sched_clock();
@@ -2881,6 +2898,30 @@ static int check_unlock(struct task_struct *curr, struct lockdep_map *lock,
 	return 1;
 }
 
+static int match_held_lock(struct held_lock *hlock, struct lockdep_map *lock)
+{
+	if (hlock->instance == lock)
+		return 1;
+
+	if (hlock->references) {
+		struct lock_class *class = lock->class_cache;
+
+		if (!class)
+			class = look_up_lock_class(lock, 0);
+
+		if (DEBUG_LOCKS_WARN_ON(!class))
+			return 0;
+
+		if (DEBUG_LOCKS_WARN_ON(!hlock->nest_lock))
+			return 0;
+
+		if (hlock->class_idx == class - lock_classes + 1)
+			return 1;
+	}
+
+	return 0;
+}
+
 static int
 __lock_set_class(struct lockdep_map *lock, const char *name,
 		 struct lock_class_key *key, unsigned int subclass,
@@ -2904,7 +2945,7 @@ __lock_set_class(struct lockdep_map *lock, const char *name,
 		 */
 		if (prev_hlock && prev_hlock->irq_context != hlock->irq_context)
 			break;
-		if (hlock->instance == lock)
+		if (match_held_lock(hlock, lock))
 			goto found_it;
 		prev_hlock = hlock;
 	}
@@ -2923,7 +2964,8 @@ found_it:
 		if (!__lock_acquire(hlock->instance,
 			hlock_class(hlock)->subclass, hlock->trylock,
 				hlock->read, hlock->check, hlock->hardirqs_off,
-				hlock->nest_lock, hlock->acquire_ip))
+				hlock->nest_lock, hlock->acquire_ip,
+				hlock->references))
 			return 0;
 	}
 
@@ -2962,20 +3004,34 @@ lock_release_non_nested(struct task_struct *curr,
 		 */
 		if (prev_hlock && prev_hlock->irq_context != hlock->irq_context)
 			break;
-		if (hlock->instance == lock)
+		if (match_held_lock(hlock, lock))
 			goto found_it;
 		prev_hlock = hlock;
 	}
 	return print_unlock_inbalance_bug(curr, lock, ip);
 
 found_it:
-	lock_release_holdtime(hlock);
+	if (hlock->instance == lock)
+		lock_release_holdtime(hlock);
+
+	if (hlock->references) {
+		hlock->references--;
+		if (hlock->references) {
+			/*
+			 * We had, and after removing one, still have
+			 * references, the current lock stack is still
+			 * valid. We're done!
+			 */
+			return 1;
+		}
+	}
 
 	/*
 	 * We have the right lock to unlock, 'hlock' points to it.
 	 * Now we remove it from the stack, and add back the other
 	 * entries (if any), recalculating the hash along the way:
 	 */
+
 	curr->lockdep_depth = i;
 	curr->curr_chain_key = hlock->prev_chain_key;
 
@@ -2984,7 +3040,8 @@ found_it:
 		if (!__lock_acquire(hlock->instance,
 			hlock_class(hlock)->subclass, hlock->trylock,
 				hlock->read, hlock->check, hlock->hardirqs_off,
-				hlock->nest_lock, hlock->acquire_ip))
+				hlock->nest_lock, hlock->acquire_ip,
+				hlock->references))
 			return 0;
 	}
 
@@ -3014,7 +3071,7 @@ static int lock_release_nested(struct task_struct *curr,
 	/*
 	 * Is the unlock non-nested:
 	 */
-	if (hlock->instance != lock)
+	if (hlock->instance != lock || hlock->references)
 		return lock_release_non_nested(curr, lock, ip);
 	curr->lockdep_depth--;
 
@@ -3065,7 +3122,9 @@ static int __lock_is_held(struct lockdep_map *lock)
 	int i;
 
 	for (i = 0; i < curr->lockdep_depth; i++) {
-		if (curr->held_locks[i].instance == lock)
+		struct held_lock *hlock = curr->held_locks + i;
+
+		if (match_held_lock(hlock, lock))
 			return 1;
 	}
 
@@ -3148,7 +3207,7 @@ void lock_acquire(struct lockdep_map *lock, unsigned int subclass,
 
 	current->lockdep_recursion = 1;
 	__lock_acquire(lock, subclass, trylock, read, check,
-		       irqs_disabled_flags(flags), nest_lock, ip);
+		       irqs_disabled_flags(flags), nest_lock, ip, 0);
 	current->lockdep_recursion = 0;
 	raw_local_irq_restore(flags);
 }
@@ -3252,7 +3311,7 @@ __lock_contended(struct lockdep_map *lock, unsigned long ip)
 		 */
 		if (prev_hlock && prev_hlock->irq_context != hlock->irq_context)
 			break;
-		if (hlock->instance == lock)
+		if (match_held_lock(hlock, lock))
 			goto found_it;
 		prev_hlock = hlock;
 	}
@@ -3260,6 +3319,9 @@ __lock_contended(struct lockdep_map *lock, unsigned long ip)
 	return;
 
 found_it:
+	if (hlock->instance != lock)
+		return;
+
 	hlock->waittime_stamp = sched_clock();
 
 	contention_point = lock_point(hlock_class(hlock)->contention_point, ip);
@@ -3299,7 +3361,7 @@ __lock_acquired(struct lockdep_map *lock, unsigned long ip)
 		 */
 		if (prev_hlock && prev_hlock->irq_context != hlock->irq_context)
 			break;
-		if (hlock->instance == lock)
+		if (match_held_lock(hlock, lock))
 			goto found_it;
 		prev_hlock = hlock;
 	}
@@ -3307,6 +3369,9 @@ __lock_acquired(struct lockdep_map *lock, unsigned long ip)
 	return;
 
 found_it:
+	if (hlock->instance != lock)
+		return;
+
 	cpu = smp_processor_id();
 	if (hlock->waittime_stamp) {
 		now = sched_clock();

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
  2009-08-05 16:42                             ` Louis Rilling
@ 2009-08-05 16:53                               ` Peter Zijlstra
       [not found]                               ` <20090805164218.GB26446-Hu8+6S1rdjywhHL9vcZdMVaTQe2KTcn/@public.gmane.org>
  2009-08-06  0:01                               ` Benjamin Blum
  2 siblings, 0 replies; 100+ messages in thread
From: Peter Zijlstra @ 2009-08-05 16:53 UTC (permalink / raw)
  To: Louis Rilling
  Cc: Paul Menage, Benjamin Blum, containers, linux-kernel, akpm, Ingo Molnar

On Wed, 2009-08-05 at 18:42 +0200, Louis Rilling wrote:
> On 05/08/09  9:11 -0700, Paul Menage wrote:
> > On Wed, Aug 5, 2009 at 3:20 AM, Louis Rilling<Louis.Rilling@kerlabs.com> wrote:
> > >
> > > The downside of this is teaching lockdep about this recursive locking. Not that
> > > simple actually...
> > 
> > Don't we just give each thread's lock its own lock class? That's what
> > we did for the cgroup hierarchy_mutex.
> 
> Given that lock classes must be static and that lockdep only supports a limited
> lock depth, this is an issue for processes having many threads.
> 
> > 
> > > so that such cases are currently handled using a higher-level
> > > lock that prevents races in locking the whole chain (there was one such example
> > > for locking all vmas with KVM). IIUC, the intent here is to avoid such
> > > higher-level lock.
> > 
> > cgroup_mutex already fulfills the role of the higher-level lock.
> 
> If so (that is, here cgroup_mutex is taken before write-locking all threads'
> rw_sem), then enhancing rwsem's interface in a similar way to the
> spin_lock_nest_lock() interface could do it. There will still be an issue with
> many threads and lockdep limited lock depth though.
> 
> Added Peter in CC.

Ah, I recently tinkered with that, see the below commit from -tip.

It should start counting instances instead of tracking each once you
start using the _nest_lock() variant with multiple instances of the same
class.

We loose some precision (can't match exact instance on unlock, and
lockstat looses out), but it keeps lockdep going up to 2048 instances
(those 11 bits).

---
commit bb97a91e2549a7f2df9c21d32542582f549ab3ec
Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date:   Mon Jul 20 19:15:35 2009 +0200

    lockdep: Deal with many similar locks
    
    spin_lock_nest_lock() allows to take many instances of the same
    class, this can easily lead to overflow of MAX_LOCK_DEPTH.
    
    To avoid this overflow, we'll stop accounting instances but
    start reference counting the class in the held_lock structure.
    
    [ We could maintain a list of instances, if we'd move the hlock
      stuff into __lock_acquired(), but that would require
      significant modifications to the current code. ]
    
    We restrict this mode to spin_lock_nest_lock() only, because it
    degrades the lockdep quality due to lost of instance.
    
    For lockstat this means we don't track lock statistics for any
    but the first lock in the series.
    
    Currently nesting is limited to 11 bits because that was the
    spare space available in held_lock. This yields a 2048
    instances maximium.
    
    Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Cc: Marcelo Tosatti <mtosatti@redhat.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Ingo Molnar <mingo@elte.hu>

diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index a6d5e5e..47d42ef 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -213,10 +213,12 @@ struct held_lock {
 	 * interrupt context:
 	 */
 	unsigned int irq_context:2; /* bit 0 - soft, bit 1 - hard */
-	unsigned int trylock:1;
+	unsigned int trylock:1;						/* 16 bits */
+
 	unsigned int read:2;        /* see lock_acquire() comment */
 	unsigned int check:2;       /* see lock_acquire() comment */
 	unsigned int hardirqs_off:1;
+	unsigned int references:11;					/* 32 bits */
 };
 
 /*
diff --git a/kernel/lockdep.c b/kernel/lockdep.c
index 28914a5..0bb246e 100644
--- a/kernel/lockdep.c
+++ b/kernel/lockdep.c
@@ -2708,13 +2708,15 @@ EXPORT_SYMBOL_GPL(lockdep_init_map);
  */
 static int __lock_acquire(struct lockdep_map *lock, unsigned int subclass,
 			  int trylock, int read, int check, int hardirqs_off,
-			  struct lockdep_map *nest_lock, unsigned long ip)
+			  struct lockdep_map *nest_lock, unsigned long ip,
+			  int references)
 {
 	struct task_struct *curr = current;
 	struct lock_class *class = NULL;
 	struct held_lock *hlock;
 	unsigned int depth, id;
 	int chain_head = 0;
+	int class_idx;
 	u64 chain_key;
 
 	if (!prove_locking)
@@ -2762,10 +2764,24 @@ static int __lock_acquire(struct lockdep_map *lock, unsigned int subclass,
 	if (DEBUG_LOCKS_WARN_ON(depth >= MAX_LOCK_DEPTH))
 		return 0;
 
+	class_idx = class - lock_classes + 1;
+
+	if (depth) {
+		hlock = curr->held_locks + depth - 1;
+		if (hlock->class_idx == class_idx && nest_lock) {
+			if (hlock->references)
+				hlock->references++;
+			else
+				hlock->references = 2;
+
+			return 1;
+		}
+	}
+
 	hlock = curr->held_locks + depth;
 	if (DEBUG_LOCKS_WARN_ON(!class))
 		return 0;
-	hlock->class_idx = class - lock_classes + 1;
+	hlock->class_idx = class_idx;
 	hlock->acquire_ip = ip;
 	hlock->instance = lock;
 	hlock->nest_lock = nest_lock;
@@ -2773,6 +2789,7 @@ static int __lock_acquire(struct lockdep_map *lock, unsigned int subclass,
 	hlock->read = read;
 	hlock->check = check;
 	hlock->hardirqs_off = !!hardirqs_off;
+	hlock->references = references;
 #ifdef CONFIG_LOCK_STAT
 	hlock->waittime_stamp = 0;
 	hlock->holdtime_stamp = sched_clock();
@@ -2881,6 +2898,30 @@ static int check_unlock(struct task_struct *curr, struct lockdep_map *lock,
 	return 1;
 }
 
+static int match_held_lock(struct held_lock *hlock, struct lockdep_map *lock)
+{
+	if (hlock->instance == lock)
+		return 1;
+
+	if (hlock->references) {
+		struct lock_class *class = lock->class_cache;
+
+		if (!class)
+			class = look_up_lock_class(lock, 0);
+
+		if (DEBUG_LOCKS_WARN_ON(!class))
+			return 0;
+
+		if (DEBUG_LOCKS_WARN_ON(!hlock->nest_lock))
+			return 0;
+
+		if (hlock->class_idx == class - lock_classes + 1)
+			return 1;
+	}
+
+	return 0;
+}
+
 static int
 __lock_set_class(struct lockdep_map *lock, const char *name,
 		 struct lock_class_key *key, unsigned int subclass,
@@ -2904,7 +2945,7 @@ __lock_set_class(struct lockdep_map *lock, const char *name,
 		 */
 		if (prev_hlock && prev_hlock->irq_context != hlock->irq_context)
 			break;
-		if (hlock->instance == lock)
+		if (match_held_lock(hlock, lock))
 			goto found_it;
 		prev_hlock = hlock;
 	}
@@ -2923,7 +2964,8 @@ found_it:
 		if (!__lock_acquire(hlock->instance,
 			hlock_class(hlock)->subclass, hlock->trylock,
 				hlock->read, hlock->check, hlock->hardirqs_off,
-				hlock->nest_lock, hlock->acquire_ip))
+				hlock->nest_lock, hlock->acquire_ip,
+				hlock->references))
 			return 0;
 	}
 
@@ -2962,20 +3004,34 @@ lock_release_non_nested(struct task_struct *curr,
 		 */
 		if (prev_hlock && prev_hlock->irq_context != hlock->irq_context)
 			break;
-		if (hlock->instance == lock)
+		if (match_held_lock(hlock, lock))
 			goto found_it;
 		prev_hlock = hlock;
 	}
 	return print_unlock_inbalance_bug(curr, lock, ip);
 
 found_it:
-	lock_release_holdtime(hlock);
+	if (hlock->instance == lock)
+		lock_release_holdtime(hlock);
+
+	if (hlock->references) {
+		hlock->references--;
+		if (hlock->references) {
+			/*
+			 * We had, and after removing one, still have
+			 * references, the current lock stack is still
+			 * valid. We're done!
+			 */
+			return 1;
+		}
+	}
 
 	/*
 	 * We have the right lock to unlock, 'hlock' points to it.
 	 * Now we remove it from the stack, and add back the other
 	 * entries (if any), recalculating the hash along the way:
 	 */
+
 	curr->lockdep_depth = i;
 	curr->curr_chain_key = hlock->prev_chain_key;
 
@@ -2984,7 +3040,8 @@ found_it:
 		if (!__lock_acquire(hlock->instance,
 			hlock_class(hlock)->subclass, hlock->trylock,
 				hlock->read, hlock->check, hlock->hardirqs_off,
-				hlock->nest_lock, hlock->acquire_ip))
+				hlock->nest_lock, hlock->acquire_ip,
+				hlock->references))
 			return 0;
 	}
 
@@ -3014,7 +3071,7 @@ static int lock_release_nested(struct task_struct *curr,
 	/*
 	 * Is the unlock non-nested:
 	 */
-	if (hlock->instance != lock)
+	if (hlock->instance != lock || hlock->references)
 		return lock_release_non_nested(curr, lock, ip);
 	curr->lockdep_depth--;
 
@@ -3065,7 +3122,9 @@ static int __lock_is_held(struct lockdep_map *lock)
 	int i;
 
 	for (i = 0; i < curr->lockdep_depth; i++) {
-		if (curr->held_locks[i].instance == lock)
+		struct held_lock *hlock = curr->held_locks + i;
+
+		if (match_held_lock(hlock, lock))
 			return 1;
 	}
 
@@ -3148,7 +3207,7 @@ void lock_acquire(struct lockdep_map *lock, unsigned int subclass,
 
 	current->lockdep_recursion = 1;
 	__lock_acquire(lock, subclass, trylock, read, check,
-		       irqs_disabled_flags(flags), nest_lock, ip);
+		       irqs_disabled_flags(flags), nest_lock, ip, 0);
 	current->lockdep_recursion = 0;
 	raw_local_irq_restore(flags);
 }
@@ -3252,7 +3311,7 @@ __lock_contended(struct lockdep_map *lock, unsigned long ip)
 		 */
 		if (prev_hlock && prev_hlock->irq_context != hlock->irq_context)
 			break;
-		if (hlock->instance == lock)
+		if (match_held_lock(hlock, lock))
 			goto found_it;
 		prev_hlock = hlock;
 	}
@@ -3260,6 +3319,9 @@ __lock_contended(struct lockdep_map *lock, unsigned long ip)
 	return;
 
 found_it:
+	if (hlock->instance != lock)
+		return;
+
 	hlock->waittime_stamp = sched_clock();
 
 	contention_point = lock_point(hlock_class(hlock)->contention_point, ip);
@@ -3299,7 +3361,7 @@ __lock_acquired(struct lockdep_map *lock, unsigned long ip)
 		 */
 		if (prev_hlock && prev_hlock->irq_context != hlock->irq_context)
 			break;
-		if (hlock->instance == lock)
+		if (match_held_lock(hlock, lock))
 			goto found_it;
 		prev_hlock = hlock;
 	}
@@ -3307,6 +3369,9 @@ __lock_acquired(struct lockdep_map *lock, unsigned long ip)
 	return;
 
 found_it:
+	if (hlock->instance != lock)
+		return;
+
 	cpu = smp_processor_id();
 	if (hlock->waittime_stamp) {
 		now = sched_clock();



^ permalink raw reply related	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
       [not found]                               ` <20090805164218.GB26446-Hu8+6S1rdjywhHL9vcZdMVaTQe2KTcn/@public.gmane.org>
  2009-08-05 16:53                                 ` Peter Zijlstra
@ 2009-08-06  0:01                                 ` Benjamin Blum
  1 sibling, 0 replies; 100+ messages in thread
From: Benjamin Blum @ 2009-08-06  0:01 UTC (permalink / raw)
  To: Paul Menage, Benjamin Blum,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Peter

On Wed, Aug 5, 2009 at 9:42 AM, Louis Rilling<Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ@public.gmane.org> wrote:
> On 05/08/09  9:11 -0700, Paul Menage wrote:
>> On Wed, Aug 5, 2009 at 3:20 AM, Louis Rilling<Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ@public.gmane.org> wrote:
>> >
>> > The downside of this is teaching lockdep about this recursive locking. Not that
>> > simple actually...
>>
>> Don't we just give each thread's lock its own lock class? That's what
>> we did for the cgroup hierarchy_mutex.
>
> Given that lock classes must be static and that lockdep only supports a limited
> lock depth, this is an issue for processes having many threads.
>
>>
>> > so that such cases are currently handled using a higher-level
>> > lock that prevents races in locking the whole chain (there was one such example
>> > for locking all vmas with KVM). IIUC, the intent here is to avoid such
>> > higher-level lock.
>>
>> cgroup_mutex already fulfills the role of the higher-level lock.
>
> If so (that is, here cgroup_mutex is taken before write-locking all threads'
> rw_sem), then enhancing rwsem's interface in a similar way to the
> spin_lock_nest_lock() interface could do it. There will still be an issue with
> many threads and lockdep limited lock depth though.

If we make the locks per-thread, then we can use plain mutexes instead
of rwsems since the only reader will ever be the owning thread itself,
and we can use mutex_lock_nested.

>
> Added Peter in CC.
>
> Louis
>
> --
> Dr Louis Rilling                        Kerlabs
> Skype: louis.rilling                    Batiment Germanium
> Phone: (+33|0) 6 80 89 08 23            80 avenue des Buttes de Coesmes
> http://www.kerlabs.com/                 35700 Rennes
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.9 (GNU/Linux)
>
> iEYEARECAAYFAkp5tmoACgkQVKcRuvQ9Q1TwdACeMwdKtGc3rU3PGXPgYvdj9Vxe
> xYIAmQFMW6Ri9JGuc7+A0WmGzXkzQ81A
> =Wj3u
> -----END PGP SIGNATURE-----
>
>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid  at once
  2009-08-05 16:42                             ` Louis Rilling
  2009-08-05 16:53                               ` Peter Zijlstra
       [not found]                               ` <20090805164218.GB26446-Hu8+6S1rdjywhHL9vcZdMVaTQe2KTcn/@public.gmane.org>
@ 2009-08-06  0:01                               ` Benjamin Blum
  2009-08-06  9:58                                 ` Louis Rilling
       [not found]                                 ` <2f86c2480908051701s57120404q475edbedb58cdca1-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2 siblings, 2 replies; 100+ messages in thread
From: Benjamin Blum @ 2009-08-06  0:01 UTC (permalink / raw)
  To: Paul Menage, Benjamin Blum, containers, linux-kernel, akpm,
	Peter Zijlstra

On Wed, Aug 5, 2009 at 9:42 AM, Louis Rilling<Louis.Rilling@kerlabs.com> wrote:
> On 05/08/09  9:11 -0700, Paul Menage wrote:
>> On Wed, Aug 5, 2009 at 3:20 AM, Louis Rilling<Louis.Rilling@kerlabs.com> wrote:
>> >
>> > The downside of this is teaching lockdep about this recursive locking. Not that
>> > simple actually...
>>
>> Don't we just give each thread's lock its own lock class? That's what
>> we did for the cgroup hierarchy_mutex.
>
> Given that lock classes must be static and that lockdep only supports a limited
> lock depth, this is an issue for processes having many threads.
>
>>
>> > so that such cases are currently handled using a higher-level
>> > lock that prevents races in locking the whole chain (there was one such example
>> > for locking all vmas with KVM). IIUC, the intent here is to avoid such
>> > higher-level lock.
>>
>> cgroup_mutex already fulfills the role of the higher-level lock.
>
> If so (that is, here cgroup_mutex is taken before write-locking all threads'
> rw_sem), then enhancing rwsem's interface in a similar way to the
> spin_lock_nest_lock() interface could do it. There will still be an issue with
> many threads and lockdep limited lock depth though.

If we make the locks per-thread, then we can use plain mutexes instead
of rwsems since the only reader will ever be the owning thread itself,
and we can use mutex_lock_nested.

>
> Added Peter in CC.
>
> Louis
>
> --
> Dr Louis Rilling                        Kerlabs
> Skype: louis.rilling                    Batiment Germanium
> Phone: (+33|0) 6 80 89 08 23            80 avenue des Buttes de Coesmes
> http://www.kerlabs.com/                 35700 Rennes
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.9 (GNU/Linux)
>
> iEYEARECAAYFAkp5tmoACgkQVKcRuvQ9Q1TwdACeMwdKtGc3rU3PGXPgYvdj9Vxe
> xYIAmQFMW6Ri9JGuc7+A0WmGzXkzQ81A
> =Wj3u
> -----END PGP SIGNATURE-----
>
>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
       [not found]                                 ` <2f86c2480908051701s57120404q475edbedb58cdca1-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-08-06  9:58                                   ` Louis Rilling
  0 siblings, 0 replies; 100+ messages in thread
From: Louis Rilling @ 2009-08-06  9:58 UTC (permalink / raw)
  To: Benjamin Blum
  Cc: Peter Zijlstra, Paul Menage,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA


[-- Attachment #1.1: Type: text/plain, Size: 1981 bytes --]

On 05/08/09 17:01 -0700, Benjamin Blum wrote:
> On Wed, Aug 5, 2009 at 9:42 AM, Louis Rilling<Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ@public.gmane.org> wrote:
> > On 05/08/09  9:11 -0700, Paul Menage wrote:
> >> On Wed, Aug 5, 2009 at 3:20 AM, Louis Rilling<Louis.Rilling-aw0BnHfMbSqIwRZHo2/mJg@public.gmane.orgm> wrote:
> >> >
> >> > The downside of this is teaching lockdep about this recursive locking. Not that
> >> > simple actually...
> >>
> >> Don't we just give each thread's lock its own lock class? That's what
> >> we did for the cgroup hierarchy_mutex.
> >
> > Given that lock classes must be static and that lockdep only supports a limited
> > lock depth, this is an issue for processes having many threads.
> >
> >>
> >> > so that such cases are currently handled using a higher-level
> >> > lock that prevents races in locking the whole chain (there was one such example
> >> > for locking all vmas with KVM). IIUC, the intent here is to avoid such
> >> > higher-level lock.
> >>
> >> cgroup_mutex already fulfills the role of the higher-level lock.
> >
> > If so (that is, here cgroup_mutex is taken before write-locking all threads'
> > rw_sem), then enhancing rwsem's interface in a similar way to the
> > spin_lock_nest_lock() interface could do it. There will still be an issue with
> > many threads and lockdep limited lock depth though.
> 
> If we make the locks per-thread, then we can use plain mutexes instead
> of rwsems since the only reader will ever be the owning thread itself,
> and we can use mutex_lock_nested.

mutex_lock_nested is not enough, since this would require putting each thread's
mutex in a different class. Again, something like mutex_lock_nest_lock() is
the solution, especially since Peter's recent improvement.

Louis

-- 
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes

[-- Attachment #1.2: Digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

[-- Attachment #2: Type: text/plain, Size: 206 bytes --]

_______________________________________________
Containers mailing list
Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
  2009-08-06  0:01                               ` Benjamin Blum
@ 2009-08-06  9:58                                 ` Louis Rilling
  2009-08-06 10:04                                   ` Louis Rilling
       [not found]                                   ` <20090806095854.GD26446-Hu8+6S1rdjywhHL9vcZdMVaTQe2KTcn/@public.gmane.org>
       [not found]                                 ` <2f86c2480908051701s57120404q475edbedb58cdca1-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 2 replies; 100+ messages in thread
From: Louis Rilling @ 2009-08-06  9:58 UTC (permalink / raw)
  To: Benjamin Blum; +Cc: Paul Menage, containers, linux-kernel, akpm, Peter Zijlstra

[-- Attachment #1: Type: text/plain, Size: 1924 bytes --]

On 05/08/09 17:01 -0700, Benjamin Blum wrote:
> On Wed, Aug 5, 2009 at 9:42 AM, Louis Rilling<Louis.Rilling@kerlabs.com> wrote:
> > On 05/08/09  9:11 -0700, Paul Menage wrote:
> >> On Wed, Aug 5, 2009 at 3:20 AM, Louis Rilling<Louis.Rilling@kerlabs.com> wrote:
> >> >
> >> > The downside of this is teaching lockdep about this recursive locking. Not that
> >> > simple actually...
> >>
> >> Don't we just give each thread's lock its own lock class? That's what
> >> we did for the cgroup hierarchy_mutex.
> >
> > Given that lock classes must be static and that lockdep only supports a limited
> > lock depth, this is an issue for processes having many threads.
> >
> >>
> >> > so that such cases are currently handled using a higher-level
> >> > lock that prevents races in locking the whole chain (there was one such example
> >> > for locking all vmas with KVM). IIUC, the intent here is to avoid such
> >> > higher-level lock.
> >>
> >> cgroup_mutex already fulfills the role of the higher-level lock.
> >
> > If so (that is, here cgroup_mutex is taken before write-locking all threads'
> > rw_sem), then enhancing rwsem's interface in a similar way to the
> > spin_lock_nest_lock() interface could do it. There will still be an issue with
> > many threads and lockdep limited lock depth though.
> 
> If we make the locks per-thread, then we can use plain mutexes instead
> of rwsems since the only reader will ever be the owning thread itself,
> and we can use mutex_lock_nested.

mutex_lock_nested is not enough, since this would require putting each thread's
mutex in a different class. Again, something like mutex_lock_nest_lock() is
the solution, especially since Peter's recent improvement.

Louis

-- 
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
       [not found]                                   ` <20090806095854.GD26446-Hu8+6S1rdjywhHL9vcZdMVaTQe2KTcn/@public.gmane.org>
@ 2009-08-06 10:04                                     ` Louis Rilling
  2009-08-06 10:28                                       ` Paul Menage
  1 sibling, 0 replies; 100+ messages in thread
From: Louis Rilling @ 2009-08-06 10:04 UTC (permalink / raw)
  To: Benjamin Blum, Paul Menage,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Peter


[-- Attachment #1.1: Type: text/plain, Size: 2230 bytes --]

On 06/08/09 11:58 +0200, Louis Rilling wrote:
> On 05/08/09 17:01 -0700, Benjamin Blum wrote:
> > On Wed, Aug 5, 2009 at 9:42 AM, Louis Rilling<Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ@public.gmane.org> wrote:
> > > On 05/08/09  9:11 -0700, Paul Menage wrote:
> > >> On Wed, Aug 5, 2009 at 3:20 AM, Louis Rilling<Louis.Rilling@kerlabs.com> wrote:
> > >> >
> > >> > The downside of this is teaching lockdep about this recursive locking. Not that
> > >> > simple actually...
> > >>
> > >> Don't we just give each thread's lock its own lock class? That's what
> > >> we did for the cgroup hierarchy_mutex.
> > >
> > > Given that lock classes must be static and that lockdep only supports a limited
> > > lock depth, this is an issue for processes having many threads.
> > >
> > >>
> > >> > so that such cases are currently handled using a higher-level
> > >> > lock that prevents races in locking the whole chain (there was one such example
> > >> > for locking all vmas with KVM). IIUC, the intent here is to avoid such
> > >> > higher-level lock.
> > >>
> > >> cgroup_mutex already fulfills the role of the higher-level lock.
> > >
> > > If so (that is, here cgroup_mutex is taken before write-locking all threads'
> > > rw_sem), then enhancing rwsem's interface in a similar way to the
> > > spin_lock_nest_lock() interface could do it. There will still be an issue with
> > > many threads and lockdep limited lock depth though.
> > 
> > If we make the locks per-thread, then we can use plain mutexes instead
> > of rwsems since the only reader will ever be the owning thread itself,
> > and we can use mutex_lock_nested.
> 
> mutex_lock_nested is not enough, since this would require putting each thread's
> mutex in a different class.

Not exactly what I meant, sorry. This would require defining as many sub-classes
as possible (still limited to MAX_LOCK_DEPTH) to support many threads.

> Again, something like mutex_lock_nest_lock() is
> the solution, especially since Peter's recent improvement.
> 
Louis

-- 
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes

[-- Attachment #1.2: Digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

[-- Attachment #2: Type: text/plain, Size: 206 bytes --]

_______________________________________________
Containers mailing list
Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
  2009-08-06  9:58                                 ` Louis Rilling
@ 2009-08-06 10:04                                   ` Louis Rilling
       [not found]                                   ` <20090806095854.GD26446-Hu8+6S1rdjywhHL9vcZdMVaTQe2KTcn/@public.gmane.org>
  1 sibling, 0 replies; 100+ messages in thread
From: Louis Rilling @ 2009-08-06 10:04 UTC (permalink / raw)
  To: Benjamin Blum, Paul Menage, containers, linux-kernel, akpm,
	Peter Zijlstra

[-- Attachment #1: Type: text/plain, Size: 2202 bytes --]

On 06/08/09 11:58 +0200, Louis Rilling wrote:
> On 05/08/09 17:01 -0700, Benjamin Blum wrote:
> > On Wed, Aug 5, 2009 at 9:42 AM, Louis Rilling<Louis.Rilling@kerlabs.com> wrote:
> > > On 05/08/09  9:11 -0700, Paul Menage wrote:
> > >> On Wed, Aug 5, 2009 at 3:20 AM, Louis Rilling<Louis.Rilling@kerlabs.com> wrote:
> > >> >
> > >> > The downside of this is teaching lockdep about this recursive locking. Not that
> > >> > simple actually...
> > >>
> > >> Don't we just give each thread's lock its own lock class? That's what
> > >> we did for the cgroup hierarchy_mutex.
> > >
> > > Given that lock classes must be static and that lockdep only supports a limited
> > > lock depth, this is an issue for processes having many threads.
> > >
> > >>
> > >> > so that such cases are currently handled using a higher-level
> > >> > lock that prevents races in locking the whole chain (there was one such example
> > >> > for locking all vmas with KVM). IIUC, the intent here is to avoid such
> > >> > higher-level lock.
> > >>
> > >> cgroup_mutex already fulfills the role of the higher-level lock.
> > >
> > > If so (that is, here cgroup_mutex is taken before write-locking all threads'
> > > rw_sem), then enhancing rwsem's interface in a similar way to the
> > > spin_lock_nest_lock() interface could do it. There will still be an issue with
> > > many threads and lockdep limited lock depth though.
> > 
> > If we make the locks per-thread, then we can use plain mutexes instead
> > of rwsems since the only reader will ever be the owning thread itself,
> > and we can use mutex_lock_nested.
> 
> mutex_lock_nested is not enough, since this would require putting each thread's
> mutex in a different class.

Not exactly what I meant, sorry. This would require defining as many sub-classes
as possible (still limited to MAX_LOCK_DEPTH) to support many threads.

> Again, something like mutex_lock_nest_lock() is
> the solution, especially since Peter's recent improvement.
> 
Louis

-- 
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
  2009-08-06  9:58                                 ` Louis Rilling
@ 2009-08-06 10:28                                       ` Paul Menage
       [not found]                                   ` <20090806095854.GD26446-Hu8+6S1rdjywhHL9vcZdMVaTQe2KTcn/@public.gmane.org>
  1 sibling, 0 replies; 100+ messages in thread
From: Paul Menage @ 2009-08-06 10:28 UTC (permalink / raw)
  To: Benjamin Blum, Paul Menage,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Peter

On Thu, Aug 6, 2009 at 2:58 AM, Louis Rilling<Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ@public.gmane.org> wrote:
>
> mutex_lock_nested is not enough, since this would require putting each thread's
> mutex in a different class. Again, something like mutex_lock_nest_lock() is
> the solution, especially since Peter's recent improvement.
>

OK, well if lockdep can't currently handle the "writer takes a lock on
every thread" model, then maybe we should go with a simpler model
until someone shows a performance issue with it? Ben's original
patches had a per-task_struct lock, and a thread forking with CLONE_VM
would down_read() its group leader's lock. Something that's even
simpler (doesn't have to deal with thread group leader changing due to
an execve()), and avoids the per-task_struct overhead would be to put
the lock in sighand_struct instead (so only one per process). The
procs file writer does a down_write(&tsk->sighand->fork_sem), and
cgroup_fork() can do a down_read(&current->sighand->fork_sem) if
flags&CLONE_SIGHAND.

If you put it as the second member of sighand_struct, there wouldn't
even be any extra cacheline bouncing in the common case, since
copy_sighand() would already have brought that line into cache in
order to do atomic_inc(&current->sighand->count)

Paul

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid  at once
@ 2009-08-06 10:28                                       ` Paul Menage
  0 siblings, 0 replies; 100+ messages in thread
From: Paul Menage @ 2009-08-06 10:28 UTC (permalink / raw)
  To: Benjamin Blum, Paul Menage, containers, linux-kernel, akpm,
	Peter Zijlstra

On Thu, Aug 6, 2009 at 2:58 AM, Louis Rilling<Louis.Rilling@kerlabs.com> wrote:
>
> mutex_lock_nested is not enough, since this would require putting each thread's
> mutex in a different class. Again, something like mutex_lock_nest_lock() is
> the solution, especially since Peter's recent improvement.
>

OK, well if lockdep can't currently handle the "writer takes a lock on
every thread" model, then maybe we should go with a simpler model
until someone shows a performance issue with it? Ben's original
patches had a per-task_struct lock, and a thread forking with CLONE_VM
would down_read() its group leader's lock. Something that's even
simpler (doesn't have to deal with thread group leader changing due to
an execve()), and avoids the per-task_struct overhead would be to put
the lock in sighand_struct instead (so only one per process). The
procs file writer does a down_write(&tsk->sighand->fork_sem), and
cgroup_fork() can do a down_read(&current->sighand->fork_sem) if
flags&CLONE_SIGHAND.

If you put it as the second member of sighand_struct, there wouldn't
even be any extra cacheline bouncing in the common case, since
copy_sighand() would already have brought that line into cache in
order to do atomic_inc(&current->sighand->count)

Paul

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid  at once
  2009-08-06 10:28                                       ` Paul Menage
@ 2009-08-06 10:34                                           ` Peter Zijlstra
  -1 siblings, 0 replies; 100+ messages in thread
From: Peter Zijlstra @ 2009-08-06 10:34 UTC (permalink / raw)
  To: Paul Menage
  Cc: Benjamin Blum, Ingo Molnar,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Thu, 2009-08-06 at 03:28 -0700, Paul Menage wrote:

> OK, well if lockdep can't currently handle the "writer takes a lock on
> every thread" model,

I haven't read what this is about, but simply looking at that sentence
makes me want to hit someone with a cluebat. Have you any idea how
expensive that is?

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid  at once
@ 2009-08-06 10:34                                           ` Peter Zijlstra
  0 siblings, 0 replies; 100+ messages in thread
From: Peter Zijlstra @ 2009-08-06 10:34 UTC (permalink / raw)
  To: Paul Menage; +Cc: Benjamin Blum, containers, linux-kernel, akpm, Ingo Molnar

On Thu, 2009-08-06 at 03:28 -0700, Paul Menage wrote:

> OK, well if lockdep can't currently handle the "writer takes a lock on
> every thread" model,

I haven't read what this is about, but simply looking at that sentence
makes me want to hit someone with a cluebat. Have you any idea how
expensive that is?




^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
  2009-08-06 10:34                                           ` Peter Zijlstra
  (?)
@ 2009-08-06 10:42                                           ` Paul Menage
  -1 siblings, 0 replies; 100+ messages in thread
From: Paul Menage @ 2009-08-06 10:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Benjamin Blum, Ingo Molnar,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Thu, Aug 6, 2009 at 3:34 AM, Peter Zijlstra<a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw@public.gmane.org> wrote:
> On Thu, 2009-08-06 at 03:28 -0700, Paul Menage wrote:
>
>> OK, well if lockdep can't currently handle the "writer takes a lock on
>> every thread" model,
>
> I haven't read what this is about, but simply looking at that sentence
> makes me want to hit someone with a cluebat. Have you any idea how
> expensive that is?

For lockdep to track that many locks, or just the concept of taking
that many locks generally?

The basic idea is that in order to implement a "procs" file in cgroups
that can migrate all threads in a process atomically, we need to
synchronize with concurrent clone() calls. But since thread clones are
likely to occur far more often than "procs" writes, and we wanted to
avoid introducing overhead into the clone path, one approach was to
give each thread a fork mutex, which it could take around the relevant
parts of the fork/clone operation, and have the "procs" writer deal
with obtaining the fork mutex for every thread in the process being
moved, so pushing the overhead on to the "procs" writer.

I don't think it's a deficiency of lockdep that it would have trouble
dealing with this - in fact, my original plan was that we'd just have
to live with the fact that anyone doing a "procs" move on a massive
process would have to live with lockdep printing an overflow warning.
But given that AFAICS we can eliminate the overhead associated with a
single lock by piggy-backing on the cache line containing
sighand->count, hopefully this won't be an issue any more.

Paul

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid  at once
  2009-08-06 10:34                                           ` Peter Zijlstra
  (?)
  (?)
@ 2009-08-06 10:42                                           ` Paul Menage
       [not found]                                             ` <6599ad830908060342m1fc8cdd2me25af248a8e0e183-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2009-08-06 11:02                                             ` Peter Zijlstra
  -1 siblings, 2 replies; 100+ messages in thread
From: Paul Menage @ 2009-08-06 10:42 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Benjamin Blum, containers, linux-kernel, akpm, Ingo Molnar

On Thu, Aug 6, 2009 at 3:34 AM, Peter Zijlstra<a.p.zijlstra@chello.nl> wrote:
> On Thu, 2009-08-06 at 03:28 -0700, Paul Menage wrote:
>
>> OK, well if lockdep can't currently handle the "writer takes a lock on
>> every thread" model,
>
> I haven't read what this is about, but simply looking at that sentence
> makes me want to hit someone with a cluebat. Have you any idea how
> expensive that is?

For lockdep to track that many locks, or just the concept of taking
that many locks generally?

The basic idea is that in order to implement a "procs" file in cgroups
that can migrate all threads in a process atomically, we need to
synchronize with concurrent clone() calls. But since thread clones are
likely to occur far more often than "procs" writes, and we wanted to
avoid introducing overhead into the clone path, one approach was to
give each thread a fork mutex, which it could take around the relevant
parts of the fork/clone operation, and have the "procs" writer deal
with obtaining the fork mutex for every thread in the process being
moved, so pushing the overhead on to the "procs" writer.

I don't think it's a deficiency of lockdep that it would have trouble
dealing with this - in fact, my original plan was that we'd just have
to live with the fact that anyone doing a "procs" move on a massive
process would have to live with lockdep printing an overflow warning.
But given that AFAICS we can eliminate the overhead associated with a
single lock by piggy-backing on the cache line containing
sighand->count, hopefully this won't be an issue any more.

Paul

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid  at once
       [not found]                                             ` <6599ad830908060342m1fc8cdd2me25af248a8e0e183-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-08-06 11:02                                               ` Peter Zijlstra
  0 siblings, 0 replies; 100+ messages in thread
From: Peter Zijlstra @ 2009-08-06 11:02 UTC (permalink / raw)
  To: Paul Menage
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Benjamin Blum,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg, Ingo Molnar, paulmck,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On Thu, 2009-08-06 at 03:42 -0700, Paul Menage wrote:
> On Thu, Aug 6, 2009 at 3:34 AM, Peter Zijlstra<a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw@public.gmane.org> wrote:
> > On Thu, 2009-08-06 at 03:28 -0700, Paul Menage wrote:
> >
> >> OK, well if lockdep can't currently handle the "writer takes a lock on
> >> every thread" model,
> >
> > I haven't read what this is about, but simply looking at that sentence
> > makes me want to hit someone with a cluebat. Have you any idea how
> > expensive that is?
> 
> For lockdep to track that many locks, or just the concept of taking
> that many locks generally?

Taking that many locks in general, some apps (JVM based usually) tend to
be thread heavy and can easily have hundreds of them, even on relatively
small (~8 CPU) machines. Having to take hundreds of locks on such a
small machine seems to indicate to me you're doing something wrong.

[ I now understand its only 'all threads of a process' which is
  slightly better than 'all threads' but still crazy ]

> The basic idea is that in order to implement a "procs" file in cgroups
> that can migrate all threads in a process atomically, we need to
> synchronize with concurrent clone() calls. But since thread clones are
> likely to occur far more often than "procs" writes, and we wanted to
> avoid introducing overhead into the clone path, one approach was to
> give each thread a fork mutex, which it could take around the relevant
> parts of the fork/clone operation, and have the "procs" writer deal
> with obtaining the fork mutex for every thread in the process being
> moved, so pushing the overhead on to the "procs" writer.

See below.

> I don't think it's a deficiency of lockdep that it would have trouble
> dealing with this - in fact, my original plan was that we'd just have
> to live with the fact that anyone doing a "procs" move on a massive
> process would have to live with lockdep printing an overflow warning.

Now that's not real nice is it ;-)

> But given that AFAICS we can eliminate the overhead associated with a
> single lock by piggy-backing on the cache line containing
> sighand->count, hopefully this won't be an issue any more.

Right, so this is a write rarely, read frequently thing, which suggests
an RCU like approach where the readers pay a minimum synchronization
penalty.

Then again, clone() might already serialize on the process as a whole
(not sure though, Oleg/Ingo?), in which case you can indeed take a
process wide lock.

If not, you might think about using an SRCU variant where the read
(clone) side does something like:

 if (rcu_derference(tsk->sighand->cgroup_block))
   wait_event(&some_queue, !tsk->sighand->cgroup_block);

 srcu_read_lock(&my_srcu_thing);

 /* do the clone bit */

 srcu_read_unlock(&my_srcu_thing);


and the write (cgroup thingy) side does:

 rcu_assign(tsk->sighand->cgroup_block, 1);

 srcu_synchronize(&my_srcu_thing); 

 /* everybody will be out of clone */

 /* do your thing */

 rcu_assign(tsk->sighand->cgroup_block, 0);
 wake_up_all(&some_queue);

Or something like that, which should reduce the read-side overhead to a
single shared read and some SRCU ops which should be per-cpu in the
normal case.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid  at once
  2009-08-06 10:42                                           ` Paul Menage
       [not found]                                             ` <6599ad830908060342m1fc8cdd2me25af248a8e0e183-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-08-06 11:02                                             ` Peter Zijlstra
  2009-08-06 11:24                                                 ` Paul Menage
  1 sibling, 1 reply; 100+ messages in thread
From: Peter Zijlstra @ 2009-08-06 11:02 UTC (permalink / raw)
  To: Paul Menage
  Cc: Benjamin Blum, containers, linux-kernel, akpm, Ingo Molnar,
	paulmck, oleg

On Thu, 2009-08-06 at 03:42 -0700, Paul Menage wrote:
> On Thu, Aug 6, 2009 at 3:34 AM, Peter Zijlstra<a.p.zijlstra@chello.nl> wrote:
> > On Thu, 2009-08-06 at 03:28 -0700, Paul Menage wrote:
> >
> >> OK, well if lockdep can't currently handle the "writer takes a lock on
> >> every thread" model,
> >
> > I haven't read what this is about, but simply looking at that sentence
> > makes me want to hit someone with a cluebat. Have you any idea how
> > expensive that is?
> 
> For lockdep to track that many locks, or just the concept of taking
> that many locks generally?

Taking that many locks in general, some apps (JVM based usually) tend to
be thread heavy and can easily have hundreds of them, even on relatively
small (~8 CPU) machines. Having to take hundreds of locks on such a
small machine seems to indicate to me you're doing something wrong.

[ I now understand its only 'all threads of a process' which is
  slightly better than 'all threads' but still crazy ]

> The basic idea is that in order to implement a "procs" file in cgroups
> that can migrate all threads in a process atomically, we need to
> synchronize with concurrent clone() calls. But since thread clones are
> likely to occur far more often than "procs" writes, and we wanted to
> avoid introducing overhead into the clone path, one approach was to
> give each thread a fork mutex, which it could take around the relevant
> parts of the fork/clone operation, and have the "procs" writer deal
> with obtaining the fork mutex for every thread in the process being
> moved, so pushing the overhead on to the "procs" writer.

See below.

> I don't think it's a deficiency of lockdep that it would have trouble
> dealing with this - in fact, my original plan was that we'd just have
> to live with the fact that anyone doing a "procs" move on a massive
> process would have to live with lockdep printing an overflow warning.

Now that's not real nice is it ;-)

> But given that AFAICS we can eliminate the overhead associated with a
> single lock by piggy-backing on the cache line containing
> sighand->count, hopefully this won't be an issue any more.

Right, so this is a write rarely, read frequently thing, which suggests
an RCU like approach where the readers pay a minimum synchronization
penalty.

Then again, clone() might already serialize on the process as a whole
(not sure though, Oleg/Ingo?), in which case you can indeed take a
process wide lock.

If not, you might think about using an SRCU variant where the read
(clone) side does something like:

 if (rcu_derference(tsk->sighand->cgroup_block))
   wait_event(&some_queue, !tsk->sighand->cgroup_block);

 srcu_read_lock(&my_srcu_thing);

 /* do the clone bit */

 srcu_read_unlock(&my_srcu_thing);


and the write (cgroup thingy) side does:

 rcu_assign(tsk->sighand->cgroup_block, 1);

 srcu_synchronize(&my_srcu_thing); 

 /* everybody will be out of clone */

 /* do your thing */

 rcu_assign(tsk->sighand->cgroup_block, 0);
 wake_up_all(&some_queue);

Or something like that, which should reduce the read-side overhead to a
single shared read and some SRCU ops which should be per-cpu in the
normal case.




^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
  2009-08-06 11:02                                             ` Peter Zijlstra
@ 2009-08-06 11:24                                                 ` Paul Menage
  0 siblings, 0 replies; 100+ messages in thread
From: Paul Menage @ 2009-08-06 11:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Benjamin Blum,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg, Ingo Molnar, paulmck,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On Thu, Aug 6, 2009 at 4:02 AM, Peter Zijlstra<a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw@public.gmane.org> wrote:
>
> Taking that many locks in general, some apps (JVM based usually) tend to
> be thread heavy and can easily have hundreds of them, even on relatively

Oh, I'm well aware that apps can be heavily multi-threaded - we have
much worse cases at Google.

>
> Now that's not real nice is it ;-)

Not particularly - but who exactly is going to be moving processes
with thousands of threads between cgroups on a lockdep-enabled debug
kernel?

>
>> But given that AFAICS we can eliminate the overhead associated with a
>> single lock by piggy-backing on the cache line containing
>> sighand->count, hopefully this won't be an issue any more.
>
> Right, so this is a write rarely, read frequently thing, which suggests
> an RCU like approach where the readers pay a minimum synchronization
> penalty.

The documentation for SRCU mentions:

        Therefore, SRCU should be used in preference to rw_semaphore
        only in extremely read-intensive situations, or in situations
        requiring SRCU's read-side deadlock immunity or low read-side
        realtime latency.

What benefits does the additional complexity of SRCU give, over the
simple solution of putting an rwsem in the same cache line as
sighand->count ?

Paul

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid  at once
@ 2009-08-06 11:24                                                 ` Paul Menage
  0 siblings, 0 replies; 100+ messages in thread
From: Paul Menage @ 2009-08-06 11:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Benjamin Blum, containers, linux-kernel, akpm, Ingo Molnar,
	paulmck, oleg

On Thu, Aug 6, 2009 at 4:02 AM, Peter Zijlstra<a.p.zijlstra@chello.nl> wrote:
>
> Taking that many locks in general, some apps (JVM based usually) tend to
> be thread heavy and can easily have hundreds of them, even on relatively

Oh, I'm well aware that apps can be heavily multi-threaded - we have
much worse cases at Google.

>
> Now that's not real nice is it ;-)

Not particularly - but who exactly is going to be moving processes
with thousands of threads between cgroups on a lockdep-enabled debug
kernel?

>
>> But given that AFAICS we can eliminate the overhead associated with a
>> single lock by piggy-backing on the cache line containing
>> sighand->count, hopefully this won't be an issue any more.
>
> Right, so this is a write rarely, read frequently thing, which suggests
> an RCU like approach where the readers pay a minimum synchronization
> penalty.

The documentation for SRCU mentions:

        Therefore, SRCU should be used in preference to rw_semaphore
        only in extremely read-intensive situations, or in situations
        requiring SRCU's read-side deadlock immunity or low read-side
        realtime latency.

What benefits does the additional complexity of SRCU give, over the
simple solution of putting an rwsem in the same cache line as
sighand->count ?

Paul

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
  2009-08-06 10:28                                       ` Paul Menage
@ 2009-08-06 11:24                                           ` Louis Rilling
  -1 siblings, 0 replies; 100+ messages in thread
From: Louis Rilling @ 2009-08-06 11:24 UTC (permalink / raw)
  To: Paul Menage
  Cc: Peter Zijlstra, Benjamin Blum,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA


[-- Attachment #1.1: Type: text/plain, Size: 1700 bytes --]

On 06/08/09  3:28 -0700, Paul Menage wrote:
> On Thu, Aug 6, 2009 at 2:58 AM, Louis Rilling<Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ@public.gmane.org> wrote:
> >
> > mutex_lock_nested is not enough, since this would require putting each thread's
> > mutex in a different class. Again, something like mutex_lock_nest_lock() is
> > the solution, especially since Peter's recent improvement.
> >
> 
> OK, well if lockdep can't currently handle the "writer takes a lock on
> every thread" model, then maybe we should go with a simpler model
> until someone shows a performance issue with it? Ben's original
> patches had a per-task_struct lock, and a thread forking with CLONE_VM
> would down_read() its group leader's lock. Something that's even
> simpler (doesn't have to deal with thread group leader changing due to
> an execve()), and avoids the per-task_struct overhead would be to put
> the lock in sighand_struct instead (so only one per process). The
> procs file writer does a down_write(&tsk->sighand->fork_sem), and
> cgroup_fork() can do a down_read(&current->sighand->fork_sem) if
> flags&CLONE_SIGHAND.
> 
> If you put it as the second member of sighand_struct, there wouldn't
> even be any extra cacheline bouncing in the common case, since
> copy_sighand() would already have brought that line into cache in
> order to do atomic_inc(&current->sighand->count)

You meant signal_struct, right? sighand_struct can be shared by several
thread groups, while signal_struct can't.

Louis

-- 
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes

[-- Attachment #1.2: Digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

[-- Attachment #2: Type: text/plain, Size: 206 bytes --]

_______________________________________________
Containers mailing list
Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
@ 2009-08-06 11:24                                           ` Louis Rilling
  0 siblings, 0 replies; 100+ messages in thread
From: Louis Rilling @ 2009-08-06 11:24 UTC (permalink / raw)
  To: Paul Menage; +Cc: Benjamin Blum, containers, linux-kernel, akpm, Peter Zijlstra

[-- Attachment #1: Type: text/plain, Size: 1672 bytes --]

On 06/08/09  3:28 -0700, Paul Menage wrote:
> On Thu, Aug 6, 2009 at 2:58 AM, Louis Rilling<Louis.Rilling@kerlabs.com> wrote:
> >
> > mutex_lock_nested is not enough, since this would require putting each thread's
> > mutex in a different class. Again, something like mutex_lock_nest_lock() is
> > the solution, especially since Peter's recent improvement.
> >
> 
> OK, well if lockdep can't currently handle the "writer takes a lock on
> every thread" model, then maybe we should go with a simpler model
> until someone shows a performance issue with it? Ben's original
> patches had a per-task_struct lock, and a thread forking with CLONE_VM
> would down_read() its group leader's lock. Something that's even
> simpler (doesn't have to deal with thread group leader changing due to
> an execve()), and avoids the per-task_struct overhead would be to put
> the lock in sighand_struct instead (so only one per process). The
> procs file writer does a down_write(&tsk->sighand->fork_sem), and
> cgroup_fork() can do a down_read(&current->sighand->fork_sem) if
> flags&CLONE_SIGHAND.
> 
> If you put it as the second member of sighand_struct, there wouldn't
> even be any extra cacheline bouncing in the common case, since
> copy_sighand() would already have brought that line into cache in
> order to do atomic_inc(&current->sighand->count)

You meant signal_struct, right? sighand_struct can be shared by several
thread groups, while signal_struct can't.

Louis

-- 
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid  at once
       [not found]                                                 ` <6599ad830908060424r72e1aa12g2b246785e7bc039c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-08-06 11:39                                                   ` Peter Zijlstra
  0 siblings, 0 replies; 100+ messages in thread
From: Peter Zijlstra @ 2009-08-06 11:39 UTC (permalink / raw)
  To: Paul Menage
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Benjamin Blum,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg, Ingo Molnar, paulmck,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On Thu, 2009-08-06 at 04:24 -0700, Paul Menage wrote:
> On Thu, Aug 6, 2009 at 4:02 AM, Peter Zijlstra<a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw@public.gmane.org> wrote:
> >
> > Taking that many locks in general, some apps (JVM based usually) tend to
> > be thread heavy and can easily have hundreds of them, even on relatively
> 
> Oh, I'm well aware that apps can be heavily multi-threaded - we have
> much worse cases at Google.
> 
> >
> > Now that's not real nice is it ;-)
> 
> Not particularly - but who exactly is going to be moving processes
> with thousands of threads between cgroups on a lockdep-enabled debug
> kernel?

All it takes are: 8 or 48 (or soon 2048) depending on your particular
annotation. I might and then I'd have to come and kick you ;-)

Really, lockdep not being able to deal with something is a strong
indication that you're doing something wonky.

Stronger, you can even do wonky things which lockdep thinks are
absolutely fine.

And doing wonky things should be avoided :-)

Luckily we seem to have found a sensible solution.

> What benefits does the additional complexity of SRCU give, over the
> simple solution of putting an rwsem in the same cache line as
> sighand->count ?

I said:

> Then again, clone() might already serialize on the process as a whole
> (not sure though, Oleg/Ingo?), in which case you can indeed take a
> process wide lock.

Which looking up sighand->count seems to be the case:

static int copy_sighand(unsigned long clone_flags, struct task_struct *tsk)
{
        struct sighand_struct *sig;

        if (clone_flags & CLONE_SIGHAND) {
                atomic_inc(&current->sighand->count);
                return 0;
        }


So yes, putting a rwsem in there sounds fine, you're already bouncing
it.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid  at once
  2009-08-06 11:24                                                 ` Paul Menage
  (?)
  (?)
@ 2009-08-06 11:39                                                 ` Peter Zijlstra
  2009-08-06 15:19                                                     ` Paul E. McKenney
  -1 siblings, 1 reply; 100+ messages in thread
From: Peter Zijlstra @ 2009-08-06 11:39 UTC (permalink / raw)
  To: Paul Menage
  Cc: Benjamin Blum, containers, linux-kernel, akpm, Ingo Molnar,
	paulmck, oleg

On Thu, 2009-08-06 at 04:24 -0700, Paul Menage wrote:
> On Thu, Aug 6, 2009 at 4:02 AM, Peter Zijlstra<a.p.zijlstra@chello.nl> wrote:
> >
> > Taking that many locks in general, some apps (JVM based usually) tend to
> > be thread heavy and can easily have hundreds of them, even on relatively
> 
> Oh, I'm well aware that apps can be heavily multi-threaded - we have
> much worse cases at Google.
> 
> >
> > Now that's not real nice is it ;-)
> 
> Not particularly - but who exactly is going to be moving processes
> with thousands of threads between cgroups on a lockdep-enabled debug
> kernel?

All it takes are: 8 or 48 (or soon 2048) depending on your particular
annotation. I might and then I'd have to come and kick you ;-)

Really, lockdep not being able to deal with something is a strong
indication that you're doing something wonky.

Stronger, you can even do wonky things which lockdep thinks are
absolutely fine.

And doing wonky things should be avoided :-)

Luckily we seem to have found a sensible solution.

> What benefits does the additional complexity of SRCU give, over the
> simple solution of putting an rwsem in the same cache line as
> sighand->count ?

I said:

> Then again, clone() might already serialize on the process as a whole
> (not sure though, Oleg/Ingo?), in which case you can indeed take a
> process wide lock.

Which looking up sighand->count seems to be the case:

static int copy_sighand(unsigned long clone_flags, struct task_struct *tsk)
{
        struct sighand_struct *sig;

        if (clone_flags & CLONE_SIGHAND) {
                atomic_inc(&current->sighand->count);
                return 0;
        }


So yes, putting a rwsem in there sounds fine, you're already bouncing
it.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
       [not found]                                           ` <20090806112450.GF26446-Hu8+6S1rdjywhHL9vcZdMVaTQe2KTcn/@public.gmane.org>
@ 2009-08-06 11:40                                             ` Paul Menage
  0 siblings, 0 replies; 100+ messages in thread
From: Paul Menage @ 2009-08-06 11:40 UTC (permalink / raw)
  To: Paul Menage, Benjamin Blum,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Peter

On Thu, Aug 6, 2009 at 4:24 AM, Louis Rilling<Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ@public.gmane.org> wrote:
>
> You meant signal_struct, right? sighand_struct can be shared by several
> thread groups, while signal_struct can't.
>

No, I meant sighand_struct. I realise that it *can* be shared between
processes, but I didn't think that NPTL actually did so. (Are there
common cases of this happening?) And in cases where it was shared, it
wouldn't affect correctness, but simply create the potential for a
little more contention.

I agree that signal_struct might in principle be a better place for
it, but the first cacheline of signal_struct appears to be occupied
with performance-sensitive things (a couple of counters and a queue
used in do_wait()) already, whereas the first cacheline of
sighand_struct only appears to be used incremented/decremented during
fork/exit, and when delivering a bunch of mostly-fatal signals.

But having said that, if having it in signal_struct isn't considered a
potential performance hit, it would be fine there too.

Paul

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid  at once
  2009-08-06 11:24                                           ` Louis Rilling
  (?)
@ 2009-08-06 11:40                                           ` Paul Menage
       [not found]                                             ` <6599ad830908060440g2f6cbed6xdc54c7096cd3745e-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
                                                               ` (2 more replies)
  -1 siblings, 3 replies; 100+ messages in thread
From: Paul Menage @ 2009-08-06 11:40 UTC (permalink / raw)
  To: Paul Menage, Benjamin Blum, containers, linux-kernel, akpm,
	Peter Zijlstra

On Thu, Aug 6, 2009 at 4:24 AM, Louis Rilling<Louis.Rilling@kerlabs.com> wrote:
>
> You meant signal_struct, right? sighand_struct can be shared by several
> thread groups, while signal_struct can't.
>

No, I meant sighand_struct. I realise that it *can* be shared between
processes, but I didn't think that NPTL actually did so. (Are there
common cases of this happening?) And in cases where it was shared, it
wouldn't affect correctness, but simply create the potential for a
little more contention.

I agree that signal_struct might in principle be a better place for
it, but the first cacheline of signal_struct appears to be occupied
with performance-sensitive things (a couple of counters and a queue
used in do_wait()) already, whereas the first cacheline of
sighand_struct only appears to be used incremented/decremented during
fork/exit, and when delivering a bunch of mostly-fatal signals.

But having said that, if having it in signal_struct isn't considered a
potential performance hit, it would be fine there too.

Paul

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
       [not found]                                             ` <6599ad830908060440g2f6cbed6xdc54c7096cd3745e-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-08-06 14:54                                               ` Louis Rilling
  2009-08-08  1:41                                               ` Benjamin Blum
  1 sibling, 0 replies; 100+ messages in thread
From: Louis Rilling @ 2009-08-06 14:54 UTC (permalink / raw)
  To: Paul Menage
  Cc: Peter Zijlstra, Benjamin Blum,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA


[-- Attachment #1.1: Type: text/plain, Size: 737 bytes --]

On 06/08/09  4:40 -0700, Paul Menage wrote:
> On Thu, Aug 6, 2009 at 4:24 AM, Louis Rilling<Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ@public.gmane.org> wrote:
> >
> > You meant signal_struct, right? sighand_struct can be shared by several
> > thread groups, while signal_struct can't.
> >
> 
> No, I meant sighand_struct. I realise that it *can* be shared between
> processes, but I didn't think that NPTL actually did so. (Are there
> common cases of this happening?)

Not that I'm aware of, besides old LinuxThreads of course.

Thanks,

Louis

-- 
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes

[-- Attachment #1.2: Digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

[-- Attachment #2: Type: text/plain, Size: 206 bytes --]

_______________________________________________
Containers mailing list
Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
  2009-08-06 11:40                                           ` Paul Menage
       [not found]                                             ` <6599ad830908060440g2f6cbed6xdc54c7096cd3745e-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-08-06 14:54                                             ` Louis Rilling
  2009-08-08  1:41                                             ` Benjamin Blum
  2 siblings, 0 replies; 100+ messages in thread
From: Louis Rilling @ 2009-08-06 14:54 UTC (permalink / raw)
  To: Paul Menage; +Cc: Benjamin Blum, containers, linux-kernel, akpm, Peter Zijlstra

[-- Attachment #1: Type: text/plain, Size: 709 bytes --]

On 06/08/09  4:40 -0700, Paul Menage wrote:
> On Thu, Aug 6, 2009 at 4:24 AM, Louis Rilling<Louis.Rilling@kerlabs.com> wrote:
> >
> > You meant signal_struct, right? sighand_struct can be shared by several
> > thread groups, while signal_struct can't.
> >
> 
> No, I meant sighand_struct. I realise that it *can* be shared between
> processes, but I didn't think that NPTL actually did so. (Are there
> common cases of this happening?)

Not that I'm aware of, besides old LinuxThreads of course.

Thanks,

Louis

-- 
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
  2009-08-06 11:39                                                 ` Peter Zijlstra
@ 2009-08-06 15:19                                                     ` Paul E. McKenney
  0 siblings, 0 replies; 100+ messages in thread
From: Paul E. McKenney @ 2009-08-06 15:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg, Paul Menage,
	Benjamin Blum, Ingo Molnar

On Thu, Aug 06, 2009 at 01:39:21PM +0200, Peter Zijlstra wrote:
> On Thu, 2009-08-06 at 04:24 -0700, Paul Menage wrote:
> > On Thu, Aug 6, 2009 at 4:02 AM, Peter Zijlstra<a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw@public.gmane.org> wrote:
> > >
> > > Taking that many locks in general, some apps (JVM based usually) tend to
> > > be thread heavy and can easily have hundreds of them, even on relatively
> > 
> > Oh, I'm well aware that apps can be heavily multi-threaded - we have
> > much worse cases at Google.
> > 
> > >
> > > Now that's not real nice is it ;-)
> > 
> > Not particularly - but who exactly is going to be moving processes
> > with thousands of threads between cgroups on a lockdep-enabled debug
> > kernel?
> 
> All it takes are: 8 or 48 (or soon 2048) depending on your particular
> annotation. I might and then I'd have to come and kick you ;-)
> 
> Really, lockdep not being able to deal with something is a strong
> indication that you're doing something wonky.
> 
> Stronger, you can even do wonky things which lockdep thinks are
> absolutely fine.
> 
> And doing wonky things should be avoided :-)
> 
> Luckily we seem to have found a sensible solution.
> 
> > What benefits does the additional complexity of SRCU give, over the
> > simple solution of putting an rwsem in the same cache line as
> > sighand->count ?
> 
> I said:
> 
> > Then again, clone() might already serialize on the process as a whole
> > (not sure though, Oleg/Ingo?), in which case you can indeed take a
> > process wide lock.
> 
> Which looking up sighand->count seems to be the case:
> 
> static int copy_sighand(unsigned long clone_flags, struct task_struct *tsk)
> {
>         struct sighand_struct *sig;
> 
>         if (clone_flags & CLONE_SIGHAND) {
>                 atomic_inc(&current->sighand->count);
>                 return 0;
>         }
> 
> 
> So yes, putting a rwsem in there sounds fine, you're already bouncing
> it.

If the critical section is small, is an rwsem really better than a
straight mutex?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
@ 2009-08-06 15:19                                                     ` Paul E. McKenney
  0 siblings, 0 replies; 100+ messages in thread
From: Paul E. McKenney @ 2009-08-06 15:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul Menage, Benjamin Blum, containers, linux-kernel, akpm,
	Ingo Molnar, oleg

On Thu, Aug 06, 2009 at 01:39:21PM +0200, Peter Zijlstra wrote:
> On Thu, 2009-08-06 at 04:24 -0700, Paul Menage wrote:
> > On Thu, Aug 6, 2009 at 4:02 AM, Peter Zijlstra<a.p.zijlstra@chello.nl> wrote:
> > >
> > > Taking that many locks in general, some apps (JVM based usually) tend to
> > > be thread heavy and can easily have hundreds of them, even on relatively
> > 
> > Oh, I'm well aware that apps can be heavily multi-threaded - we have
> > much worse cases at Google.
> > 
> > >
> > > Now that's not real nice is it ;-)
> > 
> > Not particularly - but who exactly is going to be moving processes
> > with thousands of threads between cgroups on a lockdep-enabled debug
> > kernel?
> 
> All it takes are: 8 or 48 (or soon 2048) depending on your particular
> annotation. I might and then I'd have to come and kick you ;-)
> 
> Really, lockdep not being able to deal with something is a strong
> indication that you're doing something wonky.
> 
> Stronger, you can even do wonky things which lockdep thinks are
> absolutely fine.
> 
> And doing wonky things should be avoided :-)
> 
> Luckily we seem to have found a sensible solution.
> 
> > What benefits does the additional complexity of SRCU give, over the
> > simple solution of putting an rwsem in the same cache line as
> > sighand->count ?
> 
> I said:
> 
> > Then again, clone() might already serialize on the process as a whole
> > (not sure though, Oleg/Ingo?), in which case you can indeed take a
> > process wide lock.
> 
> Which looking up sighand->count seems to be the case:
> 
> static int copy_sighand(unsigned long clone_flags, struct task_struct *tsk)
> {
>         struct sighand_struct *sig;
> 
>         if (clone_flags & CLONE_SIGHAND) {
>                 atomic_inc(&current->sighand->count);
>                 return 0;
>         }
> 
> 
> So yes, putting a rwsem in there sounds fine, you're already bouncing
> it.

If the critical section is small, is an rwsem really better than a
straight mutex?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
       [not found]                                                     ` <20090806151922.GB6747-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
@ 2009-08-06 15:24                                                       ` Peter Zijlstra
  0 siblings, 0 replies; 100+ messages in thread
From: Peter Zijlstra @ 2009-08-06 15:24 UTC (permalink / raw)
  To: paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg, Paul Menage,
	Benjamin Blum, Ingo Molnar

On Thu, 2009-08-06 at 08:19 -0700, Paul E. McKenney wrote:
> 
> > So yes, putting a rwsem in there sounds fine, you're already bouncing
> > it.
> 
> If the critical section is small, is an rwsem really better than a
> straight mutex?

Well, my understanding was that it needed to surround much, if not all,
of clone(). That's a rather large bit of code.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
  2009-08-06 15:19                                                     ` Paul E. McKenney
  (?)
@ 2009-08-06 15:24                                                     ` Peter Zijlstra
  2009-08-06 15:37                                                       ` Paul E. McKenney
  2009-08-06 15:37                                                       ` Paul E. McKenney
  -1 siblings, 2 replies; 100+ messages in thread
From: Peter Zijlstra @ 2009-08-06 15:24 UTC (permalink / raw)
  To: paulmck
  Cc: Paul Menage, Benjamin Blum, containers, linux-kernel, akpm,
	Ingo Molnar, oleg

On Thu, 2009-08-06 at 08:19 -0700, Paul E. McKenney wrote:
> 
> > So yes, putting a rwsem in there sounds fine, you're already bouncing
> > it.
> 
> If the critical section is small, is an rwsem really better than a
> straight mutex?

Well, my understanding was that it needed to surround much, if not all,
of clone(). That's a rather large bit of code.


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
  2009-08-06 15:24                                                     ` Peter Zijlstra
@ 2009-08-06 15:37                                                       ` Paul E. McKenney
  2009-08-06 15:37                                                       ` Paul E. McKenney
  1 sibling, 0 replies; 100+ messages in thread
From: Paul E. McKenney @ 2009-08-06 15:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, oleg, Paul Menage,
	Benjamin Blum, Ingo Molnar

On Thu, Aug 06, 2009 at 05:24:46PM +0200, Peter Zijlstra wrote:
> On Thu, 2009-08-06 at 08:19 -0700, Paul E. McKenney wrote:
> > 
> > > So yes, putting a rwsem in there sounds fine, you're already bouncing
> > > it.
> > 
> > If the critical section is small, is an rwsem really better than a
> > straight mutex?
> 
> Well, my understanding was that it needed to surround much, if not all,
> of clone(). That's a rather large bit of code.

That would indeed justify an rwsem!  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
  2009-08-06 15:24                                                     ` Peter Zijlstra
  2009-08-06 15:37                                                       ` Paul E. McKenney
@ 2009-08-06 15:37                                                       ` Paul E. McKenney
  1 sibling, 0 replies; 100+ messages in thread
From: Paul E. McKenney @ 2009-08-06 15:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul Menage, Benjamin Blum, containers, linux-kernel, akpm,
	Ingo Molnar, oleg

On Thu, Aug 06, 2009 at 05:24:46PM +0200, Peter Zijlstra wrote:
> On Thu, 2009-08-06 at 08:19 -0700, Paul E. McKenney wrote:
> > 
> > > So yes, putting a rwsem in there sounds fine, you're already bouncing
> > > it.
> > 
> > If the critical section is small, is an rwsem really better than a
> > straight mutex?
> 
> Well, my understanding was that it needed to surround much, if not all,
> of clone(). That's a rather large bit of code.

That would indeed justify an rwsem!  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
       [not found]                                             ` <6599ad830908060440g2f6cbed6xdc54c7096cd3745e-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2009-08-06 14:54                                               ` Louis Rilling
@ 2009-08-08  1:41                                               ` Benjamin Blum
  1 sibling, 0 replies; 100+ messages in thread
From: Benjamin Blum @ 2009-08-08  1:41 UTC (permalink / raw)
  To: Paul Menage
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Peter Zijlstra

On Thu, Aug 6, 2009 at 4:40 AM, Paul Menage<menage-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> On Thu, Aug 6, 2009 at 4:24 AM, Louis Rilling<Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ@public.gmane.org> wrote:
>>
>> You meant signal_struct, right? sighand_struct can be shared by several
>> thread groups, while signal_struct can't.
>>
>
> No, I meant sighand_struct. I realise that it *can* be shared between
> processes, but I didn't think that NPTL actually did so. (Are there
> common cases of this happening?) And in cases where it was shared, it
> wouldn't affect correctness, but simply create the potential for a
> little more contention.
>
> I agree that signal_struct might in principle be a better place for
> it, but the first cacheline of signal_struct appears to be occupied
> with performance-sensitive things (a couple of counters and a queue
> used in do_wait()) already, whereas the first cacheline of
> sighand_struct only appears to be used incremented/decremented during
> fork/exit, and when delivering a bunch of mostly-fatal signals.
>
> But having said that, if having it in signal_struct isn't considered a
> potential performance hit, it would be fine there too.
>
> Paul
>

I'm presently rewriting the locking scheme here to have the rwsem live
in sighand_struct, and writing two new functions lock_threadgroup_fork
and unlock_threadgroup_fork (for use in cgroup_attach_proc) which will
live in cgroup.c but are generic enough that they could be used by
anybody who wants to do threadgroup-wide fork-sensitive changes. I'm
also having the fork_lock under an #ifdef CONFIG_CGROUPS, which could
be expanded to, say, CONFIG_FORK_LOCK_THREADGROUP (which CGROUPS would
depend upon) if somebody else ever wanted to use this lock as well.
For genericness, I'll have the down_read and up_read on the lock in
do_fork() directly (instead of cgroup_fork and cgroup_post_fork as it
is in this version of the patch).

If there are no more comments/discussion on the locking scheme, I'll
resubmit the patch series with these changes approximately Monday.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid  at once
  2009-08-06 11:40                                           ` Paul Menage
       [not found]                                             ` <6599ad830908060440g2f6cbed6xdc54c7096cd3745e-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2009-08-06 14:54                                             ` Louis Rilling
@ 2009-08-08  1:41                                             ` Benjamin Blum
  2009-08-08  1:51                                               ` Benjamin Blum
       [not found]                                               ` <2f86c2480908071841h13009856hd8fcae167b1fadbf-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2 siblings, 2 replies; 100+ messages in thread
From: Benjamin Blum @ 2009-08-08  1:41 UTC (permalink / raw)
  To: Paul Menage; +Cc: containers, linux-kernel, akpm, Peter Zijlstra

On Thu, Aug 6, 2009 at 4:40 AM, Paul Menage<menage@google.com> wrote:
> On Thu, Aug 6, 2009 at 4:24 AM, Louis Rilling<Louis.Rilling@kerlabs.com> wrote:
>>
>> You meant signal_struct, right? sighand_struct can be shared by several
>> thread groups, while signal_struct can't.
>>
>
> No, I meant sighand_struct. I realise that it *can* be shared between
> processes, but I didn't think that NPTL actually did so. (Are there
> common cases of this happening?) And in cases where it was shared, it
> wouldn't affect correctness, but simply create the potential for a
> little more contention.
>
> I agree that signal_struct might in principle be a better place for
> it, but the first cacheline of signal_struct appears to be occupied
> with performance-sensitive things (a couple of counters and a queue
> used in do_wait()) already, whereas the first cacheline of
> sighand_struct only appears to be used incremented/decremented during
> fork/exit, and when delivering a bunch of mostly-fatal signals.
>
> But having said that, if having it in signal_struct isn't considered a
> potential performance hit, it would be fine there too.
>
> Paul
>

I'm presently rewriting the locking scheme here to have the rwsem live
in sighand_struct, and writing two new functions lock_threadgroup_fork
and unlock_threadgroup_fork (for use in cgroup_attach_proc) which will
live in cgroup.c but are generic enough that they could be used by
anybody who wants to do threadgroup-wide fork-sensitive changes. I'm
also having the fork_lock under an #ifdef CONFIG_CGROUPS, which could
be expanded to, say, CONFIG_FORK_LOCK_THREADGROUP (which CGROUPS would
depend upon) if somebody else ever wanted to use this lock as well.
For genericness, I'll have the down_read and up_read on the lock in
do_fork() directly (instead of cgroup_fork and cgroup_post_fork as it
is in this version of the patch).

If there are no more comments/discussion on the locking scheme, I'll
resubmit the patch series with these changes approximately Monday.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid at once
       [not found]                                               ` <2f86c2480908071841h13009856hd8fcae167b1fadbf-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-08-08  1:51                                                 ` Benjamin Blum
  0 siblings, 0 replies; 100+ messages in thread
From: Benjamin Blum @ 2009-08-08  1:51 UTC (permalink / raw)
  To: Paul Menage
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Peter Zijlstra

On Fri, Aug 7, 2009 at 6:41 PM, Benjamin Blum<bblum-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> For genericness, I'll have the down_read and up_read on the lock in
> do_fork() directly (instead of cgroup_fork and cgroup_post_fork as it
> is in this version of the patch).

rather, I'll keep the read-locking in the cgroup functions for now and
any future thing that wants the lock too can move them to fork.c, and
a comment to that effect.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH 6/6] Makes procs file writable to move all threads by tgid  at once
  2009-08-08  1:41                                             ` Benjamin Blum
@ 2009-08-08  1:51                                               ` Benjamin Blum
       [not found]                                               ` <2f86c2480908071841h13009856hd8fcae167b1fadbf-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 0 replies; 100+ messages in thread
From: Benjamin Blum @ 2009-08-08  1:51 UTC (permalink / raw)
  To: Paul Menage; +Cc: containers, linux-kernel, akpm, Peter Zijlstra

On Fri, Aug 7, 2009 at 6:41 PM, Benjamin Blum<bblum@google.com> wrote:
> For genericness, I'll have the down_read and up_read on the lock in
> do_fork() directly (instead of cgroup_fork and cgroup_post_fork as it
> is in this version of the patch).

rather, I'll keep the read-locking in the cgroup functions for now and
any future thing that wants the lock too can move them to fork.c, and
a comment to that effect.

^ permalink raw reply	[flat|nested] 100+ messages in thread

end of thread, other threads:[~2009-08-08  1:51 UTC | newest]

Thread overview: 100+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-07-31  1:51 [PATCH v2 0/6] CGroups: cgroup memberlist enhancement+fix Ben Blum
2009-07-31  1:51 ` [PATCH 1/6] Adds a read-only "procs" file similar to "tasks" that shows only unique tgids Ben Blum
2009-07-31  1:51 ` [PATCH 2/6] Ensures correct concurrent opening/reading of pidlists across pid namespaces Ben Blum
2009-07-31  1:51 ` [PATCH 4/6] Changes css_set freeing mechanism to be under RCU Ben Blum
2009-07-31  1:51 ` [PATCH 5/6] Lets ss->can_attach and ss->attach do whole threadgroups at a time Ben Blum
     [not found]   ` <20090731015149.27908.25403.stgit-/yCBOHwbXCxd3OlUiQof+WCaruZE5nAUZeezCHUQhQ4@public.gmane.org>
2009-08-03  2:22     ` Li Zefan
2009-08-03  2:22   ` Li Zefan
2009-08-04  0:35     ` Benjamin Blum
     [not found]     ` <4A7649E1.4000200-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-08-04  0:35       ` Benjamin Blum
     [not found] ` <20090731012908.27908.62208.stgit-/yCBOHwbXCxd3OlUiQof+WCaruZE5nAUZeezCHUQhQ4@public.gmane.org>
2009-07-31  1:51   ` [PATCH 1/6] Adds a read-only "procs" file similar to "tasks" that shows only unique tgids Ben Blum
2009-07-31  1:51   ` [PATCH 2/6] Ensures correct concurrent opening/reading of pidlists across pid namespaces Ben Blum
2009-07-31  1:51   ` [PATCH 3/6] Quick vmalloc vs kmalloc fix to the case where array size is too large Ben Blum
2009-07-31  1:51     ` Ben Blum
2009-07-31  1:51   ` [PATCH 4/6] Changes css_set freeing mechanism to be under RCU Ben Blum
2009-07-31  1:51   ` [PATCH 5/6] Lets ss->can_attach and ss->attach do whole threadgroups at a time Ben Blum
2009-07-31  1:51   ` [PATCH 6/6] Makes procs file writable to move all threads by tgid at once Ben Blum
2009-07-31  1:51 ` Ben Blum
2009-08-03  3:00   ` Li Zefan
2009-08-04  0:56     ` Benjamin Blum
2009-08-04  1:05       ` Paul Menage
     [not found]         ` <6599ad830908031805y31136eceqeff0bab455100d6c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-08-04  1:11           ` Benjamin Blum
2009-08-04  1:11         ` Benjamin Blum
     [not found]       ` <2f86c2480908031756j557e7aebmbf7951da6a1aadb0-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-08-04  1:05         ` Paul Menage
2009-08-04  1:09         ` Li Zefan
2009-08-04  1:09           ` Li Zefan
2009-08-04  1:19           ` Benjamin Blum
2009-08-04  1:45             ` Li Zefan
2009-08-04  1:55               ` Paul Menage
     [not found]               ` <4A7792C4.5010504-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-08-04  1:55                 ` Paul Menage
     [not found]             ` <2f86c2480908031819h2513cdb4tac3d6def3e0aa320-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-08-04  1:45               ` Li Zefan
     [not found]           ` <4A778A49.6040302-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-08-04  1:19             ` Benjamin Blum
     [not found]     ` <4A7652E7.4020206-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-08-04  0:56       ` Benjamin Blum
2009-08-03 17:54   ` Serge E. Hallyn
2009-08-03 18:07     ` Paul Menage
     [not found]     ` <20090803175452.GA5481-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-08-03 18:07       ` Paul Menage
2009-08-03 18:13       ` Benjamin Blum
2009-08-03 18:13     ` Benjamin Blum
     [not found]       ` <2f86c2480908031113y525b6cbdhe418b8a0364c7760-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-08-03 18:55         ` Serge E. Hallyn
2009-08-03 18:55       ` Serge E. Hallyn
     [not found]         ` <20090803185556.GA8469-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-08-03 19:45           ` Serge E. Hallyn
2009-08-03 19:45         ` Serge E. Hallyn
2009-08-03 19:55           ` Paul Menage
2009-08-04 14:01             ` Serge E. Hallyn
     [not found]             ` <6599ad830908031255j68ce047x7165bfefa62ed53c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-08-04 14:01               ` Serge E. Hallyn
2009-08-04 21:40               ` Matt Helsley
2009-08-04 21:40             ` Matt Helsley
2009-08-04 21:40             ` Matt Helsley
     [not found]           ` <20090803194555.GA10158-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-08-03 19:55             ` Paul Menage
2009-08-04 18:48             ` Paul Menage
2009-08-04 18:48               ` Paul Menage
     [not found]               ` <6599ad830908041148h6d3f3e9bxfef9f3eedec0ab6d-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-08-04 19:01                 ` Serge E. Hallyn
2009-08-04 19:01                   ` Serge E. Hallyn
2009-08-04 19:14                 ` Benjamin Blum
2009-08-04 19:14                   ` Benjamin Blum
     [not found]                   ` <2f86c2480908041214r1f23c1b7q9a25b04e26c92a1a-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-08-04 19:28                     ` Paul Menage
2009-08-04 19:28                       ` Paul Menage
2009-08-05 10:20                       ` Louis Rilling
     [not found]                         ` <20090805102057.GT29252-Hu8+6S1rdjywhHL9vcZdMVaTQe2KTcn/@public.gmane.org>
2009-08-05 16:11                           ` Paul Menage
2009-08-05 16:11                             ` Paul Menage
     [not found]                             ` <6599ad830908050911t6f23f810i65fe8fe17f3ee698-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-08-05 16:42                               ` Louis Rilling
2009-08-05 16:42                             ` Louis Rilling
2009-08-05 16:53                               ` Peter Zijlstra
     [not found]                               ` <20090805164218.GB26446-Hu8+6S1rdjywhHL9vcZdMVaTQe2KTcn/@public.gmane.org>
2009-08-05 16:53                                 ` Peter Zijlstra
2009-08-06  0:01                                 ` Benjamin Blum
2009-08-06  0:01                               ` Benjamin Blum
2009-08-06  9:58                                 ` Louis Rilling
2009-08-06 10:04                                   ` Louis Rilling
     [not found]                                   ` <20090806095854.GD26446-Hu8+6S1rdjywhHL9vcZdMVaTQe2KTcn/@public.gmane.org>
2009-08-06 10:04                                     ` Louis Rilling
2009-08-06 10:28                                     ` Paul Menage
2009-08-06 10:28                                       ` Paul Menage
     [not found]                                       ` <6599ad830908060328y21a008c1pc5ed5c27e0ec905d-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-08-06 10:34                                         ` Peter Zijlstra
2009-08-06 10:34                                           ` Peter Zijlstra
2009-08-06 10:42                                           ` Paul Menage
2009-08-06 10:42                                           ` Paul Menage
     [not found]                                             ` <6599ad830908060342m1fc8cdd2me25af248a8e0e183-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-08-06 11:02                                               ` Peter Zijlstra
2009-08-06 11:02                                             ` Peter Zijlstra
2009-08-06 11:24                                               ` Paul Menage
2009-08-06 11:24                                                 ` Paul Menage
     [not found]                                                 ` <6599ad830908060424r72e1aa12g2b246785e7bc039c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-08-06 11:39                                                   ` Peter Zijlstra
2009-08-06 11:39                                                 ` Peter Zijlstra
2009-08-06 15:19                                                   ` Paul E. McKenney
2009-08-06 15:19                                                     ` Paul E. McKenney
2009-08-06 15:24                                                     ` Peter Zijlstra
2009-08-06 15:37                                                       ` Paul E. McKenney
2009-08-06 15:37                                                       ` Paul E. McKenney
     [not found]                                                     ` <20090806151922.GB6747-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2009-08-06 15:24                                                       ` Peter Zijlstra
2009-08-06 11:24                                         ` Louis Rilling
2009-08-06 11:24                                           ` Louis Rilling
2009-08-06 11:40                                           ` Paul Menage
     [not found]                                             ` <6599ad830908060440g2f6cbed6xdc54c7096cd3745e-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-08-06 14:54                                               ` Louis Rilling
2009-08-08  1:41                                               ` Benjamin Blum
2009-08-06 14:54                                             ` Louis Rilling
2009-08-08  1:41                                             ` Benjamin Blum
2009-08-08  1:51                                               ` Benjamin Blum
     [not found]                                               ` <2f86c2480908071841h13009856hd8fcae167b1fadbf-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-08-08  1:51                                                 ` Benjamin Blum
     [not found]                                           ` <20090806112450.GF26446-Hu8+6S1rdjywhHL9vcZdMVaTQe2KTcn/@public.gmane.org>
2009-08-06 11:40                                             ` Paul Menage
     [not found]                                 ` <2f86c2480908051701s57120404q475edbedb58cdca1-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-08-06  9:58                                   ` Louis Rilling
     [not found]                       ` <6599ad830908041228w67bc6f7fh57e28f244e1923b3-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-08-05 10:20                         ` Louis Rilling
     [not found]   ` <20090731015154.27908.9639.stgit-/yCBOHwbXCxd3OlUiQof+WCaruZE5nAUZeezCHUQhQ4@public.gmane.org>
2009-08-03  3:00     ` Li Zefan
2009-08-03 17:54     ` Serge E. Hallyn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.