linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/10] cpuset: implement sane hierarchy behaviors
@ 2013-06-05  9:14 Li Zefan
  2013-06-05  9:15 ` [PATCH v2 01/10] cpuset: remove redundant check in cpuset_cpus_allowed_fallback() Li Zefan
                   ` (9 more replies)
  0 siblings, 10 replies; 17+ messages in thread
From: Li Zefan @ 2013-06-05  9:14 UTC (permalink / raw)
  To: Tejun Heo; +Cc: LKML, Cgroups, Containers

v1 -> v2:

- add documentation in include/linux/cgroup.h
- drop rcu_read_lock() before calling update_task_nodemask() when iterating
cpuset hierarchy

======================================

Currently some cpuset behaviors are not friendly when cpuset is co-mounted
with other cgroup controllers.

Now with this patchset if cpuset is mounted with sane_behavior option, it
behaves differently:

- Tasks will be kept in empty cpusets when hotplug happens and take masks
of ancestors with non-empty cpus/mems, instead of being moved to an ancestor.

- A task can be moved into an empty cpuset, and again it takes masks of
ancestors, so the user can drop a task into a newly created cgroup without
having to do anything for it.

As tasks can reside in empy cpusets, here're some rules:

- They can be moved to another cpuset, regardless it's empty or not.

- Though it takes masks from ancestors, it takes other configs from the
empty cpuset.

- If the ancestors' masks are changed, those tasks will also be updated
to take new masks.

Li Zefan (10):
      cpuset: remove redundant check in cpuset_cpus_allowed_fallback()
      cpuset: cleanup guarantee_online_{cpus|mems}()
      cpuset: remove unnecessary variable in cpuset_attach()
      cpuset: remove cpuset_test_cpumask()
      cpuset: re-structure update_cpumask() a bit
      cpuset: record old_mems_allowed in struct cpuset
      cpuset: introduce effective_{cpumask|nodemask}_cpuset()
      cpuset: allow to keep tasks in empty cpusets
      cpuset: allow to move tasks to empty cpusets
      cpuset: fix to migrate mm correctly in a corner case

--
 include/linux/cgroup.h |   7 +
 kernel/cpuset.c        | 342 ++++++++++++++++++++++++++++++++++---------------
 2 files changed, 247 insertions(+), 102 deletions(-)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2 01/10] cpuset: remove redundant check in cpuset_cpus_allowed_fallback()
  2013-06-05  9:14 [PATCH v2 00/10] cpuset: implement sane hierarchy behaviors Li Zefan
@ 2013-06-05  9:15 ` Li Zefan
  2013-06-05  9:15 ` [PATCH v2 02/10] cpuset: cleanup guarantee_online_{cpus|mems}() Li Zefan
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 17+ messages in thread
From: Li Zefan @ 2013-06-05  9:15 UTC (permalink / raw)
  To: Tejun Heo; +Cc: LKML, Cgroups, Containers

task_cs() will never return NULL.

Signed-off-by: Li Zefan <lizefan@huawei.com>
---
 kernel/cpuset.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 64b3f79..f0c884a 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -2253,8 +2253,7 @@ void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
 
 	rcu_read_lock();
 	cs = task_cs(tsk);
-	if (cs)
-		do_set_cpus_allowed(tsk, cs->cpus_allowed);
+	do_set_cpus_allowed(tsk, cs->cpus_allowed);
 	rcu_read_unlock();
 
 	/*
-- 
1.8.0.2

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 02/10] cpuset: cleanup guarantee_online_{cpus|mems}()
  2013-06-05  9:14 [PATCH v2 00/10] cpuset: implement sane hierarchy behaviors Li Zefan
  2013-06-05  9:15 ` [PATCH v2 01/10] cpuset: remove redundant check in cpuset_cpus_allowed_fallback() Li Zefan
@ 2013-06-05  9:15 ` Li Zefan
  2013-06-05  9:15 ` [PATCH v2 03/10] cpuset: remove unnecessary variable in cpuset_attach() Li Zefan
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 17+ messages in thread
From: Li Zefan @ 2013-06-05  9:15 UTC (permalink / raw)
  To: Tejun Heo; +Cc: LKML, Cgroups, Containers

- We never pass a NULL @cs to these functions.
- The top cpuset always has some online cpus/mems.

Signed-off-by: Li Zefan <lizefan@huawei.com>
---
 kernel/cpuset.c | 29 +++++++----------------------
 1 file changed, 7 insertions(+), 22 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index f0c884a..d753837 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -304,53 +304,38 @@ static struct file_system_type cpuset_fs_type = {
 /*
  * Return in pmask the portion of a cpusets's cpus_allowed that
  * are online.  If none are online, walk up the cpuset hierarchy
- * until we find one that does have some online cpus.  If we get
- * all the way to the top and still haven't found any online cpus,
- * return cpu_online_mask.  Or if passed a NULL cs from an exit'ing
- * task, return cpu_online_mask.
+ * until we find one that does have some online cpus.  The top
+ * cpuset always has some cpus online.
  *
  * One way or another, we guarantee to return some non-empty subset
  * of cpu_online_mask.
  *
  * Call with callback_mutex held.
  */
-
 static void guarantee_online_cpus(const struct cpuset *cs,
 				  struct cpumask *pmask)
 {
-	while (cs && !cpumask_intersects(cs->cpus_allowed, cpu_online_mask))
+	while (!cpumask_intersects(cs->cpus_allowed, cpu_online_mask))
 		cs = parent_cs(cs);
-	if (cs)
-		cpumask_and(pmask, cs->cpus_allowed, cpu_online_mask);
-	else
-		cpumask_copy(pmask, cpu_online_mask);
-	BUG_ON(!cpumask_intersects(pmask, cpu_online_mask));
+	cpumask_and(pmask, cs->cpus_allowed, cpu_online_mask);
 }
 
 /*
  * Return in *pmask the portion of a cpusets's mems_allowed that
  * are online, with memory.  If none are online with memory, walk
  * up the cpuset hierarchy until we find one that does have some
- * online mems.  If we get all the way to the top and still haven't
- * found any online mems, return node_states[N_MEMORY].
+ * online mems.  The top cpuset always has some mems online.
  *
  * One way or another, we guarantee to return some non-empty subset
  * of node_states[N_MEMORY].
  *
  * Call with callback_mutex held.
  */
-
 static void guarantee_online_mems(const struct cpuset *cs, nodemask_t *pmask)
 {
-	while (cs && !nodes_intersects(cs->mems_allowed,
-					node_states[N_MEMORY]))
+	while (!nodes_intersects(cs->mems_allowed, node_states[N_MEMORY]))
 		cs = parent_cs(cs);
-	if (cs)
-		nodes_and(*pmask, cs->mems_allowed,
-					node_states[N_MEMORY]);
-	else
-		*pmask = node_states[N_MEMORY];
-	BUG_ON(!nodes_intersects(*pmask, node_states[N_MEMORY]));
+	nodes_and(*pmask, cs->mems_allowed, node_states[N_MEMORY]);
 }
 
 /*
-- 
1.8.0.2

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 03/10] cpuset: remove unnecessary variable in cpuset_attach()
  2013-06-05  9:14 [PATCH v2 00/10] cpuset: implement sane hierarchy behaviors Li Zefan
  2013-06-05  9:15 ` [PATCH v2 01/10] cpuset: remove redundant check in cpuset_cpus_allowed_fallback() Li Zefan
  2013-06-05  9:15 ` [PATCH v2 02/10] cpuset: cleanup guarantee_online_{cpus|mems}() Li Zefan
@ 2013-06-05  9:15 ` Li Zefan
  2013-06-05  9:15 ` [PATCH v2 04/10] cpuset: remove cpuset_test_cpumask() Li Zefan
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 17+ messages in thread
From: Li Zefan @ 2013-06-05  9:15 UTC (permalink / raw)
  To: Tejun Heo; +Cc: LKML, Cgroups, Containers

We can just use oldcs->mems_allowed.

Signed-off-by: Li Zefan <lizefan@huawei.com>
---
 kernel/cpuset.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index d753837..dbef832 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1407,8 +1407,7 @@ static cpumask_var_t cpus_attach;
 
 static void cpuset_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
 {
-	/* static bufs protected by cpuset_mutex */
-	static nodemask_t cpuset_attach_nodemask_from;
+	/* static buf protected by cpuset_mutex */
 	static nodemask_t cpuset_attach_nodemask_to;
 	struct mm_struct *mm;
 	struct task_struct *task;
@@ -1442,13 +1441,12 @@ static void cpuset_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
 	 * Change mm, possibly for multiple threads in a threadgroup. This is
 	 * expensive and may sleep.
 	 */
-	cpuset_attach_nodemask_from = oldcs->mems_allowed;
 	cpuset_attach_nodemask_to = cs->mems_allowed;
 	mm = get_task_mm(leader);
 	if (mm) {
 		mpol_rebind_mm(mm, &cpuset_attach_nodemask_to);
 		if (is_memory_migrate(cs))
-			cpuset_migrate_mm(mm, &cpuset_attach_nodemask_from,
+			cpuset_migrate_mm(mm, &oldcs->mems_allowed,
 					  &cpuset_attach_nodemask_to);
 		mmput(mm);
 	}
-- 
1.8.0.2

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 04/10] cpuset: remove cpuset_test_cpumask()
  2013-06-05  9:14 [PATCH v2 00/10] cpuset: implement sane hierarchy behaviors Li Zefan
                   ` (2 preceding siblings ...)
  2013-06-05  9:15 ` [PATCH v2 03/10] cpuset: remove unnecessary variable in cpuset_attach() Li Zefan
@ 2013-06-05  9:15 ` Li Zefan
  2013-06-05  9:15 ` [PATCH v2 05/10] cpuset: re-structure update_cpumask() a bit Li Zefan
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 17+ messages in thread
From: Li Zefan @ 2013-06-05  9:15 UTC (permalink / raw)
  To: Tejun Heo; +Cc: LKML, Cgroups, Containers

The test is done in set_cpus_allowed_ptr(), so it's redundant.

Signed-off-by: Li Zefan <lizefan@huawei.com>
---
 kernel/cpuset.c | 19 +------------------
 1 file changed, 1 insertion(+), 18 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index dbef832..51f8e1d 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -784,23 +784,6 @@ void rebuild_sched_domains(void)
 }
 
 /**
- * cpuset_test_cpumask - test a task's cpus_allowed versus its cpuset's
- * @tsk: task to test
- * @scan: struct cgroup_scanner contained in its struct cpuset_hotplug_scanner
- *
- * Call with cpuset_mutex held.  May take callback_mutex during call.
- * Called for each task in a cgroup by cgroup_scan_tasks().
- * Return nonzero if this tasks's cpus_allowed mask should be changed (in other
- * words, if its mask is not equal to its cpuset's mask).
- */
-static int cpuset_test_cpumask(struct task_struct *tsk,
-			       struct cgroup_scanner *scan)
-{
-	return !cpumask_equal(&tsk->cpus_allowed,
-			(cgroup_cs(scan->cg))->cpus_allowed);
-}
-
-/**
  * cpuset_change_cpumask - make a task's cpus_allowed the same as its cpuset's
  * @tsk: task to test
  * @scan: struct cgroup_scanner containing the cgroup of the task
@@ -835,7 +818,7 @@ static void update_tasks_cpumask(struct cpuset *cs, struct ptr_heap *heap)
 	struct cgroup_scanner scan;
 
 	scan.cg = cs->css.cgroup;
-	scan.test_task = cpuset_test_cpumask;
+	scan.test_task = NULL;
 	scan.process_task = cpuset_change_cpumask;
 	scan.heap = heap;
 	cgroup_scan_tasks(&scan);
-- 
1.8.0.2

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 05/10] cpuset: re-structure update_cpumask() a bit
  2013-06-05  9:14 [PATCH v2 00/10] cpuset: implement sane hierarchy behaviors Li Zefan
                   ` (3 preceding siblings ...)
  2013-06-05  9:15 ` [PATCH v2 04/10] cpuset: remove cpuset_test_cpumask() Li Zefan
@ 2013-06-05  9:15 ` Li Zefan
  2013-06-05 20:57   ` Tejun Heo
  2013-06-05  9:16 ` [PATCH v2 06/10] cpuset: record old_mems_allowed in struct cpuset Li Zefan
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 17+ messages in thread
From: Li Zefan @ 2013-06-05  9:15 UTC (permalink / raw)
  To: Tejun Heo; +Cc: LKML, Cgroups, Containers

Check if cpus_allowed is to be changed before calling validate_change().

This won't change any behavior, but later it will allow us to do this:

 # mkdir /cpuset/child
 # echo $$ > /cpuset/child/tasks	/* empty cpuset */
 # echo > /cpuset/child/cpuset.cpus	/* do nothing, won't fail */

Without this patch, the last operation will fail.

Signed-off-by: Li Zefan <lizefan@huawei.com>
---
 kernel/cpuset.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 51f8e1d..535dce6 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -856,14 +856,15 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
 		if (!cpumask_subset(trialcs->cpus_allowed, cpu_active_mask))
 			return -EINVAL;
 	}
-	retval = validate_change(cs, trialcs);
-	if (retval < 0)
-		return retval;
 
 	/* Nothing to do if the cpus didn't change */
 	if (cpumask_equal(cs->cpus_allowed, trialcs->cpus_allowed))
 		return 0;
 
+	retval = validate_change(cs, trialcs);
+	if (retval < 0)
+		return retval;
+
 	retval = heap_init(&heap, PAGE_SIZE, GFP_KERNEL, NULL);
 	if (retval)
 		return retval;
-- 
1.8.0.2

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 06/10] cpuset: record old_mems_allowed in struct cpuset
  2013-06-05  9:14 [PATCH v2 00/10] cpuset: implement sane hierarchy behaviors Li Zefan
                   ` (4 preceding siblings ...)
  2013-06-05  9:15 ` [PATCH v2 05/10] cpuset: re-structure update_cpumask() a bit Li Zefan
@ 2013-06-05  9:16 ` Li Zefan
  2013-06-05 19:45   ` Tejun Heo
  2013-06-05  9:16 ` [PATCH v2 07/10] cpuset: introduce effective_{cpumask|nodemask}_cpuset() Li Zefan
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 17+ messages in thread
From: Li Zefan @ 2013-06-05  9:16 UTC (permalink / raw)
  To: Tejun Heo; +Cc: LKML, Cgroups, Containers

When we update a cpuset's mems_allowed and thus update tasks'
mems_allowed, it's required to pass the old mems_allowed and new
mems_allowed to cpuset_migrate_mm().

Currently we save old mems_allowed in a temp local variable before
changing cpuset->mems_allowed. This patch changes it by saving
old mems_allowed in cpuset->old_mems_allowed.

This currently won't change any behavior, but it will later allow
us to keep tasks in empty cpusets.

Signed-off-by: Li Zefan <lizefan@huawei.com>
---
 kernel/cpuset.c | 62 +++++++++++++++++++++++++++++++++------------------------
 1 file changed, 36 insertions(+), 26 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 535dce6..b848505 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -87,6 +87,18 @@ struct cpuset {
 	cpumask_var_t cpus_allowed;	/* CPUs allowed to tasks in cpuset */
 	nodemask_t mems_allowed;	/* Memory Nodes allowed to tasks */
 
+	/*
+	 * This is old Memory Nodes tasks took on.
+	 *
+	 * - top_cpuset.old_mems_allowed is initialized to mems_allowed.
+	 * - A new cpuset's old_mems_allowed is initialized when some
+	 *   task is moved into it.
+	 * - old_mems_allowed is used in cpuset_migrate_mm() when we change
+	 *   cpuset.mems_allowed and have tasks' nodemask updated, and
+	 *   then old_mems_allowed is updated to mems_allowed.
+	 */
+	nodemask_t old_mems_allowed;
+
 	struct fmeter fmeter;		/* memory_pressure filter */
 
 	/*
@@ -976,16 +988,12 @@ static void cpuset_change_task_nodemask(struct task_struct *tsk,
 static void cpuset_change_nodemask(struct task_struct *p,
 				   struct cgroup_scanner *scan)
 {
+	struct cpuset *cs = cgroup_cs(scan->cg);
 	struct mm_struct *mm;
-	struct cpuset *cs;
 	int migrate;
-	const nodemask_t *oldmem = scan->data;
-	static nodemask_t newmems;	/* protected by cpuset_mutex */
-
-	cs = cgroup_cs(scan->cg);
-	guarantee_online_mems(cs, &newmems);
+	nodemask_t *newmems = scan->data;
 
-	cpuset_change_task_nodemask(p, &newmems);
+	cpuset_change_task_nodemask(p, newmems);
 
 	mm = get_task_mm(p);
 	if (!mm)
@@ -995,7 +1003,7 @@ static void cpuset_change_nodemask(struct task_struct *p,
 
 	mpol_rebind_mm(mm, &cs->mems_allowed);
 	if (migrate)
-		cpuset_migrate_mm(mm, oldmem, &cs->mems_allowed);
+		cpuset_migrate_mm(mm, &cs->old_mems_allowed, newmems);
 	mmput(mm);
 }
 
@@ -1004,25 +1012,26 @@ static void *cpuset_being_rebound;
 /**
  * update_tasks_nodemask - Update the nodemasks of tasks in the cpuset.
  * @cs: the cpuset in which each task's mems_allowed mask needs to be changed
- * @oldmem: old mems_allowed of cpuset cs
  * @heap: if NULL, defer allocating heap memory to cgroup_scan_tasks()
  *
  * Called with cpuset_mutex held
  * No return value. It's guaranteed that cgroup_scan_tasks() always returns 0
  * if @heap != NULL.
  */
-static void update_tasks_nodemask(struct cpuset *cs, const nodemask_t *oldmem,
-				 struct ptr_heap *heap)
+static void update_tasks_nodemask(struct cpuset *cs, struct ptr_heap *heap)
 {
+	static nodemask_t newmems;	/* protected by cpuset_mutex */
 	struct cgroup_scanner scan;
 
 	cpuset_being_rebound = cs;		/* causes mpol_dup() rebind */
 
+	guarantee_online_mems(cs, &newmems);
+
 	scan.cg = cs->css.cgroup;
 	scan.test_task = NULL;
 	scan.process_task = cpuset_change_nodemask;
 	scan.heap = heap;
-	scan.data = (nodemask_t *)oldmem;
+	scan.data = &newmems;
 
 	/*
 	 * The mpol_rebind_mm() call takes mmap_sem, which we couldn't
@@ -1036,6 +1045,12 @@ static void update_tasks_nodemask(struct cpuset *cs, const nodemask_t *oldmem,
 	 */
 	cgroup_scan_tasks(&scan);
 
+	/*
+	 * All the tasks' nodemasks have been updated, update
+	 * cs->old_mems_allowed.
+	 */
+	cs->old_mems_allowed = newmems;
+
 	/* We're done rebinding vmas to this cpuset's new mems_allowed. */
 	cpuset_being_rebound = NULL;
 }
@@ -1056,13 +1071,9 @@ static void update_tasks_nodemask(struct cpuset *cs, const nodemask_t *oldmem,
 static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs,
 			   const char *buf)
 {
-	NODEMASK_ALLOC(nodemask_t, oldmem, GFP_KERNEL);
 	int retval;
 	struct ptr_heap heap;
 
-	if (!oldmem)
-		return -ENOMEM;
-
 	/*
 	 * top_cpuset.mems_allowed tracks node_stats[N_MEMORY];
 	 * it's read-only
@@ -1091,8 +1102,8 @@ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs,
 			goto done;
 		}
 	}
-	*oldmem = cs->mems_allowed;
-	if (nodes_equal(*oldmem, trialcs->mems_allowed)) {
+
+	if (nodes_equal(cs->mems_allowed, trialcs->mems_allowed)) {
 		retval = 0;		/* Too easy - nothing to do */
 		goto done;
 	}
@@ -1108,11 +1119,10 @@ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs,
 	cs->mems_allowed = trialcs->mems_allowed;
 	mutex_unlock(&callback_mutex);
 
-	update_tasks_nodemask(cs, oldmem, &heap);
+	update_tasks_nodemask(cs, &heap);
 
 	heap_free(&heap);
 done:
-	NODEMASK_FREE(oldmem);
 	return retval;
 }
 
@@ -1425,7 +1435,6 @@ static void cpuset_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
 	 * Change mm, possibly for multiple threads in a threadgroup. This is
 	 * expensive and may sleep.
 	 */
-	cpuset_attach_nodemask_to = cs->mems_allowed;
 	mm = get_task_mm(leader);
 	if (mm) {
 		mpol_rebind_mm(mm, &cpuset_attach_nodemask_to);
@@ -1435,6 +1444,8 @@ static void cpuset_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
 		mmput(mm);
 	}
 
+	cs->old_mems_allowed = cpuset_attach_nodemask_to;
+
 	cs->attach_in_progress--;
 
 	/*
@@ -2001,7 +2012,7 @@ static void remove_tasks_in_empty_cpuset(struct cpuset *cs)
 static void cpuset_propagate_hotplug_workfn(struct work_struct *work)
 {
 	static cpumask_t off_cpus;
-	static nodemask_t off_mems, tmp_mems;
+	static nodemask_t off_mems;
 	struct cpuset *cs = container_of(work, struct cpuset, hotplug_work);
 	bool is_empty;
 
@@ -2020,11 +2031,10 @@ static void cpuset_propagate_hotplug_workfn(struct work_struct *work)
 
 	/* remove offline mems from @cs */
 	if (!nodes_empty(off_mems)) {
-		tmp_mems = cs->mems_allowed;
 		mutex_lock(&callback_mutex);
 		nodes_andnot(cs->mems_allowed, cs->mems_allowed, off_mems);
 		mutex_unlock(&callback_mutex);
-		update_tasks_nodemask(cs, &tmp_mems, NULL);
+		update_tasks_nodemask(cs, NULL);
 	}
 
 	is_empty = cpumask_empty(cs->cpus_allowed) ||
@@ -2116,11 +2126,10 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
 
 	/* synchronize mems_allowed to N_MEMORY */
 	if (mems_updated) {
-		tmp_mems = top_cpuset.mems_allowed;
 		mutex_lock(&callback_mutex);
 		top_cpuset.mems_allowed = new_mems;
 		mutex_unlock(&callback_mutex);
-		update_tasks_nodemask(&top_cpuset, &tmp_mems, NULL);
+		update_tasks_nodemask(&top_cpuset, NULL);
 	}
 
 	/* if cpus or mems went down, we need to propagate to descendants */
@@ -2186,6 +2195,7 @@ void __init cpuset_init_smp(void)
 {
 	cpumask_copy(top_cpuset.cpus_allowed, cpu_active_mask);
 	top_cpuset.mems_allowed = node_states[N_MEMORY];
+	top_cpuset.old_mems_allowed = top_cpuset.mems_allowed;
 
 	register_hotmemory_notifier(&cpuset_track_online_nodes_nb);
 
-- 
1.8.0.2

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 07/10] cpuset: introduce effective_{cpumask|nodemask}_cpuset()
  2013-06-05  9:14 [PATCH v2 00/10] cpuset: implement sane hierarchy behaviors Li Zefan
                   ` (5 preceding siblings ...)
  2013-06-05  9:16 ` [PATCH v2 06/10] cpuset: record old_mems_allowed in struct cpuset Li Zefan
@ 2013-06-05  9:16 ` Li Zefan
  2013-06-05  9:16 ` [PATCH v2 08/10] cpuset: allow to keep tasks in empty cpusets Li Zefan
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 17+ messages in thread
From: Li Zefan @ 2013-06-05  9:16 UTC (permalink / raw)
  To: Tejun Heo; +Cc: LKML, Cgroups, Containers

effective_cpumask_cpuset() returns an ancestor cpuset which has
non-empty cpumask.

If a cpuset is empty and the tasks in it need to update their
cpus_allowed, they take on the ancestor cpuset's cpumask.

This currently won't change any behavior, but it will later allow us
to keep tasks in empty cpusets.

Signed-off-by: Li Zefan <lizefan@huawei.com>
---
 kernel/cpuset.c | 76 ++++++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 65 insertions(+), 11 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index b848505..5252f94 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -795,6 +795,45 @@ void rebuild_sched_domains(void)
 	mutex_unlock(&cpuset_mutex);
 }
 
+/*
+ * effective_cpumask_cpuset - return nearest ancestor with non-empty cpus
+ * @cs: the cpuset in interest
+ *
+ * A cpuset's effective cpumask is the cpumask of the nearest ancestor
+ * with non-empty cpus. We use effective cpumask whenever:
+ * - we update tasks' cpus_allowed. (they take on the ancestor's cpumask
+ *   if the cpuset they reside in has no cpus)
+ * - we want to retrieve task_cs(tsk)'s cpus_allowed.
+ *
+ * Called with cpuset_mutex held. cpuset_cpus_allowed_fallback() is an
+ * exception. See comments there.
+ */
+static struct cpuset *effective_cpumask_cpuset(struct cpuset *cs)
+{
+	while (cpumask_empty(cs->cpus_allowed))
+		cs = parent_cs(cs);
+	return cs;
+}
+
+/*
+ * effective_nodemask_cpuset - return nearest ancestor with non-empty mems
+ * @cs: the cpuset in interest
+ *
+ * A cpuset's effective nodemask is the nodemask of the nearest ancestor
+ * with non-empty memss. We use effective nodemask whenever:
+ * - we update tasks' mems_allowed. (they take on the ancestor's nodemask
+ *   if the cpuset they reside in has no mems)
+ * - we want to retrieve task_cs(tsk)'s mems_allowed.
+ *
+ * Called with cpuset_mutex held.
+ */
+static struct cpuset *effective_nodemask_cpuset(struct cpuset *cs)
+{
+	while (nodes_empty(cs->mems_allowed))
+		cs = parent_cs(cs);
+	return cs;
+}
+
 /**
  * cpuset_change_cpumask - make a task's cpus_allowed the same as its cpuset's
  * @tsk: task to test
@@ -809,7 +848,10 @@ void rebuild_sched_domains(void)
 static void cpuset_change_cpumask(struct task_struct *tsk,
 				  struct cgroup_scanner *scan)
 {
-	set_cpus_allowed_ptr(tsk, ((cgroup_cs(scan->cg))->cpus_allowed));
+	struct cpuset *cpus_cs;
+
+	cpus_cs = effective_cpumask_cpuset(cgroup_cs(scan->cg));
+	set_cpus_allowed_ptr(tsk, cpus_cs->cpus_allowed);
 }
 
 /**
@@ -924,12 +966,14 @@ static void cpuset_migrate_mm(struct mm_struct *mm, const nodemask_t *from,
 							const nodemask_t *to)
 {
 	struct task_struct *tsk = current;
+	struct cpuset *mems_cs;
 
 	tsk->mems_allowed = *to;
 
 	do_migrate_pages(mm, from, to, MPOL_MF_MOVE_ALL);
 
-	guarantee_online_mems(task_cs(tsk),&tsk->mems_allowed);
+	mems_cs = effective_nodemask_cpuset(task_cs(tsk));
+	guarantee_online_mems(mems_cs, &tsk->mems_allowed);
 }
 
 /*
@@ -1022,10 +1066,11 @@ static void update_tasks_nodemask(struct cpuset *cs, struct ptr_heap *heap)
 {
 	static nodemask_t newmems;	/* protected by cpuset_mutex */
 	struct cgroup_scanner scan;
+	struct cpuset *mems_cs = effective_nodemask_cpuset(cs);
 
 	cpuset_being_rebound = cs;		/* causes mpol_dup() rebind */
 
-	guarantee_online_mems(cs, &newmems);
+	guarantee_online_mems(mems_cs, &newmems);
 
 	scan.cg = cs->css.cgroup;
 	scan.test_task = NULL;
@@ -1409,6 +1454,8 @@ static void cpuset_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
 	struct cgroup *oldcgrp = cgroup_taskset_cur_cgroup(tset);
 	struct cpuset *cs = cgroup_cs(cgrp);
 	struct cpuset *oldcs = cgroup_cs(oldcgrp);
+	struct cpuset *cpus_cs = effective_cpumask_cpuset(cs);
+	struct cpuset *mems_cs = effective_nodemask_cpuset(cs);
 
 	mutex_lock(&cpuset_mutex);
 
@@ -1416,9 +1463,9 @@ static void cpuset_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
 	if (cs == &top_cpuset)
 		cpumask_copy(cpus_attach, cpu_possible_mask);
 	else
-		guarantee_online_cpus(cs, cpus_attach);
+		guarantee_online_cpus(cpus_cs, cpus_attach);
 
-	guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
+	guarantee_online_mems(mems_cs, &cpuset_attach_nodemask_to);
 
 	cgroup_taskset_for_each(task, cgrp, tset) {
 		/*
@@ -1437,9 +1484,11 @@ static void cpuset_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
 	 */
 	mm = get_task_mm(leader);
 	if (mm) {
+		struct cpuset *mems_oldcs = effective_nodemask_cpuset(oldcs);
+
 		mpol_rebind_mm(mm, &cpuset_attach_nodemask_to);
 		if (is_memory_migrate(cs))
-			cpuset_migrate_mm(mm, &oldcs->mems_allowed,
+			cpuset_migrate_mm(mm, &mems_oldcs->mems_allowed,
 					  &cpuset_attach_nodemask_to);
 		mmput(mm);
 	}
@@ -2217,20 +2266,23 @@ void __init cpuset_init_smp(void)
 
 void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)
 {
+	struct cpuset *cpus_cs;
+
 	mutex_lock(&callback_mutex);
 	task_lock(tsk);
-	guarantee_online_cpus(task_cs(tsk), pmask);
+	cpus_cs = effective_cpumask_cpuset(task_cs(tsk));
+	guarantee_online_cpus(cpus_cs, pmask);
 	task_unlock(tsk);
 	mutex_unlock(&callback_mutex);
 }
 
 void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
 {
-	const struct cpuset *cs;
+	const struct cpuset *cpus_cs;
 
 	rcu_read_lock();
-	cs = task_cs(tsk);
-	do_set_cpus_allowed(tsk, cs->cpus_allowed);
+	cpus_cs = effective_cpumask_cpuset(task_cs(tsk));
+	do_set_cpus_allowed(tsk, cpus_cs->cpus_allowed);
 	rcu_read_unlock();
 
 	/*
@@ -2269,11 +2321,13 @@ void cpuset_init_current_mems_allowed(void)
 
 nodemask_t cpuset_mems_allowed(struct task_struct *tsk)
 {
+	struct cpuset *mems_cs;
 	nodemask_t mask;
 
 	mutex_lock(&callback_mutex);
 	task_lock(tsk);
-	guarantee_online_mems(task_cs(tsk), &mask);
+	mems_cs = effective_nodemask_cpuset(task_cs(tsk));
+	guarantee_online_mems(mems_cs, &mask);
 	task_unlock(tsk);
 	mutex_unlock(&callback_mutex);
 
-- 
1.8.0.2

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 08/10] cpuset: allow to keep tasks in empty cpusets
  2013-06-05  9:14 [PATCH v2 00/10] cpuset: implement sane hierarchy behaviors Li Zefan
                   ` (6 preceding siblings ...)
  2013-06-05  9:16 ` [PATCH v2 07/10] cpuset: introduce effective_{cpumask|nodemask}_cpuset() Li Zefan
@ 2013-06-05  9:16 ` Li Zefan
  2013-06-05 20:51   ` Tejun Heo
  2013-06-05  9:17 ` [PATCH v2 09/10] cpuset: allow to move tasks to " Li Zefan
  2013-06-05  9:17 ` [PATCH v2 10/10] cpuset: fix to migrate mm correctly in a corner case Li Zefan
  9 siblings, 1 reply; 17+ messages in thread
From: Li Zefan @ 2013-06-05  9:16 UTC (permalink / raw)
  To: Tejun Heo; +Cc: LKML, Cgroups, Containers

To achieve this:

- We call update_tasks_cpumask/nodemask() for empty cpusets when
hotplug happens, instead of moving tasks out of them.

- When a cpuset's masks are changed by writing cpuset.cpus/mems,
we also update tasks in child cpusets which are empty.

v2:

- drop rcu_read_lock before calling update_task_nodemask() and
  update_task_cpumask(), instead of using workqueue.

- add documentation in include/linux/cgroup.h

Signed-off-by: Li Zefan <lizefan@huawei.com>
---
 include/linux/cgroup.h |   4 ++
 kernel/cpuset.c        | 137 +++++++++++++++++++++++++++++++++++++++++--------
 2 files changed, 120 insertions(+), 21 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index d0ad379..53e81a6 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -277,6 +277,10 @@ enum {
 	 *
 	 * - Remount is disallowed.
 	 *
+	 * - cpuset: tasks will be kept in empty cpusets when hotplug happens
+	 *   and take masks of ancestors with non-empty cpus/mems, instead of
+	 *   being moved to an ancestor.
+	 *
 	 * - memcg: use_hierarchy is on by default and the cgroup file for
 	 *   the flag is not created.
 	 *
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 5252f94..3b93098 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -878,6 +878,49 @@ static void update_tasks_cpumask(struct cpuset *cs, struct ptr_heap *heap)
 	cgroup_scan_tasks(&scan);
 }
 
+/*
+ * update_tasks_cpumask_hier - Update the cpumasks of tasks in the hierarchy.
+ * @root_cs: the root cpuset of the hierarchy
+ * @update_root: update root cpuset or not?
+ * @heap: the heap used by cgroup_scan_tasks()
+ *
+ * This will update cpumasks of tasks in @root_cs and all other empty cpusets
+ * which take on cpumask of @root_cs.
+ *
+ * Called with cpuset_mutex held
+ */
+static void update_tasks_cpumask_hier(struct cpuset *root_cs,
+				      bool update_root, struct ptr_heap *heap)
+{
+	struct cpuset *cp;
+	struct cgroup *pos_cgrp;
+
+	if (update_root)
+		update_tasks_cpumask(root_cs, heap);
+
+	rcu_read_lock();
+	cpuset_for_each_descendant_pre(cp, pos_cgrp, root_cs) {
+		/* skip the whole subtree if @cp have some CPU */
+		if (!cpumask_empty(cp->cpus_allowed)) {
+			pos_cgrp = cgroup_rightmost_descendant(pos_cgrp);
+			continue;
+		}
+		if (!css_tryget(&cp->css))
+			continue;
+		rcu_read_unlock();
+
+		update_tasks_cpumask(cp, heap);
+
+		rcu_read_lock();
+		/*
+		 * We should acquire rcu_read_lock first, so it's still
+		 * valid to access @cp after dropping the css refcnt.
+		 */
+		css_put(&cp->css);
+	}
+	rcu_read_unlock();
+}
+
 /**
  * update_cpumask - update the cpus_allowed mask of a cpuset and all tasks in it
  * @cs: the cpuset to consider
@@ -929,11 +972,7 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
 	cpumask_copy(cs->cpus_allowed, trialcs->cpus_allowed);
 	mutex_unlock(&callback_mutex);
 
-	/*
-	 * Scan tasks in the cpuset, and update the cpumasks of any
-	 * that need an update.
-	 */
-	update_tasks_cpumask(cs, &heap);
+	update_tasks_cpumask_hier(cs, true, &heap);
 
 	heap_free(&heap);
 
@@ -1101,6 +1140,49 @@ static void update_tasks_nodemask(struct cpuset *cs, struct ptr_heap *heap)
 }
 
 /*
+ * update_tasks_nodemask_hier - Update the nodemasks of tasks in the hierarchy.
+ * @cs: the root cpuset of the hierarchy
+ * @update_root: update the root cpuset or not?
+ * @heap: the heap used by cgroup_scan_tasks()
+ *
+ * This will update nodemasks of tasks in @root_cs and all other empty cpusets
+ * which take on nodemask of @root_cs.
+ *
+ * Called with cpuset_mutex held
+ */
+static void update_tasks_nodemask_hier(struct cpuset *root_cs,
+				       bool update_root, struct ptr_heap *heap)
+{
+	struct cpuset *cp;
+	struct cgroup *pos_cgrp;
+
+	if (update_root)
+		update_tasks_nodemask(root_cs, heap);
+
+	rcu_read_lock();
+	cpuset_for_each_descendant_pre(cp, pos_cgrp, root_cs) {
+		/* skip the whole subtree if @cp have some CPU */
+		if (!nodes_empty(cp->mems_allowed)) {
+			pos_cgrp = cgroup_rightmost_descendant(pos_cgrp);
+			continue;
+		}
+		if (!css_tryget(&cp->css))
+			continue;
+		rcu_read_unlock();
+
+		update_tasks_nodemask(cp, heap);
+
+		rcu_read_lock();
+		/*
+		 * We should acquire rcu_read_lock first, so it's still
+		 * valid to access @cp after dropping the css refcnt.
+		 */
+		css_put(&cp->css);
+	}
+	rcu_read_unlock();
+}
+
+/*
  * Handle user request to change the 'mems' memory placement
  * of a cpuset.  Needs to validate the request, update the
  * cpusets mems_allowed, and for each task in the cpuset,
@@ -1164,7 +1246,7 @@ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs,
 	cs->mems_allowed = trialcs->mems_allowed;
 	mutex_unlock(&callback_mutex);
 
-	update_tasks_nodemask(cs, &heap);
+	update_tasks_nodemask_hier(cs, true, &heap);
 
 	heap_free(&heap);
 done:
@@ -2064,27 +2146,36 @@ static void cpuset_propagate_hotplug_workfn(struct work_struct *work)
 	static nodemask_t off_mems;
 	struct cpuset *cs = container_of(work, struct cpuset, hotplug_work);
 	bool is_empty;
+	bool sane = cgroup_sane_behavior(cs->css.cgroup);
 
 	mutex_lock(&cpuset_mutex);
 
 	cpumask_andnot(&off_cpus, cs->cpus_allowed, top_cpuset.cpus_allowed);
 	nodes_andnot(off_mems, cs->mems_allowed, top_cpuset.mems_allowed);
 
-	/* remove offline cpus from @cs */
-	if (!cpumask_empty(&off_cpus)) {
-		mutex_lock(&callback_mutex);
-		cpumask_andnot(cs->cpus_allowed, cs->cpus_allowed, &off_cpus);
-		mutex_unlock(&callback_mutex);
+	mutex_lock(&callback_mutex);
+	cpumask_andnot(cs->cpus_allowed, cs->cpus_allowed, &off_cpus);
+	mutex_unlock(&callback_mutex);
+
+	/*
+	 * If sane_behavior flag is set, we need to update tasks' cpumask
+	 * for empty cpuset to take on ancestor's cpumask
+	 */
+	if ((sane && cpumask_empty(cs->cpus_allowed)) ||
+	    !cpumask_empty(&off_cpus))
 		update_tasks_cpumask(cs, NULL);
-	}
 
-	/* remove offline mems from @cs */
-	if (!nodes_empty(off_mems)) {
-		mutex_lock(&callback_mutex);
-		nodes_andnot(cs->mems_allowed, cs->mems_allowed, off_mems);
-		mutex_unlock(&callback_mutex);
+	mutex_lock(&callback_mutex);
+	nodes_andnot(cs->mems_allowed, cs->mems_allowed, off_mems);
+	mutex_unlock(&callback_mutex);
+
+	/*
+	 * If sane_behavior flag is set, we need to update tasks' nodemask
+	 * for empty cpuset to take on ancestor's nodemask
+	 */
+	if ((sane && nodes_empty(cs->mems_allowed)) ||
+	    !nodes_empty(off_mems))
 		update_tasks_nodemask(cs, NULL);
-	}
 
 	is_empty = cpumask_empty(cs->cpus_allowed) ||
 		nodes_empty(cs->mems_allowed);
@@ -2092,11 +2183,13 @@ static void cpuset_propagate_hotplug_workfn(struct work_struct *work)
 	mutex_unlock(&cpuset_mutex);
 
 	/*
-	 * If @cs became empty, move tasks to the nearest ancestor with
-	 * execution resources.  This is full cgroup operation which will
+	 * If sane_behavior flag is set, we'll keep tasks in empty cpusets.
+	 *
+	 * Otherwise move tasks to the nearest ancestor with execution
+	 *  resources.  This is full cgroup operation which will
 	 * also call back into cpuset.  Should be done outside any lock.
 	 */
-	if (is_empty)
+	if (!sane && is_empty)
 		remove_tasks_in_empty_cpuset(cs);
 
 	/* the following may free @cs, should be the last operation */
@@ -2171,6 +2264,7 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
 		cpumask_copy(top_cpuset.cpus_allowed, &new_cpus);
 		mutex_unlock(&callback_mutex);
 		/* we don't mess with cpumasks of tasks in top_cpuset */
+		update_tasks_cpumask_hier(&top_cpuset, false, NULL);
 	}
 
 	/* synchronize mems_allowed to N_MEMORY */
@@ -2179,6 +2273,7 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
 		top_cpuset.mems_allowed = new_mems;
 		mutex_unlock(&callback_mutex);
 		update_tasks_nodemask(&top_cpuset, NULL);
+		update_tasks_nodemask_hier(&top_cpuset, false, NULL);
 	}
 
 	/* if cpus or mems went down, we need to propagate to descendants */
-- 
1.8.0.2

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 09/10] cpuset: allow to move tasks to empty cpusets
  2013-06-05  9:14 [PATCH v2 00/10] cpuset: implement sane hierarchy behaviors Li Zefan
                   ` (7 preceding siblings ...)
  2013-06-05  9:16 ` [PATCH v2 08/10] cpuset: allow to keep tasks in empty cpusets Li Zefan
@ 2013-06-05  9:17 ` Li Zefan
  2013-06-05  9:17 ` [PATCH v2 10/10] cpuset: fix to migrate mm correctly in a corner case Li Zefan
  9 siblings, 0 replies; 17+ messages in thread
From: Li Zefan @ 2013-06-05  9:17 UTC (permalink / raw)
  To: Tejun Heo; +Cc: LKML, Cgroups, Containers

Currently some cpuset behaviors are not friendly when cpuset is co-mounted
with other cgroup controllers.

Now with this patchset if cpuset is mounted with sane_behavior option,
it behaves differently:

- Tasks will be kept in empty cpusets when hotplug happens and take
  masks of ancestors with non-empty cpus/mems, instead of being moved to
  an ancestor.

- A task can be moved into an empty cpuset, and again it takes masks of
  ancestors, so the user can drop a task into a newly created cgroup without
  having to do anything for it.

As tasks can reside in empy cpusets, here're some rules:

- They can be moved to another cpuset, regardless it's empty or not.

- Though it takes masks from ancestors, it takes other configs from the
  empty cpuset.

- If the ancestors' masks are changed, those tasks will also be updated
  to take new masks.

v2: add documentation in include/linux/cgroup.h

Signed-off-by: Li Zefan <lizefan@huawei.com>
---
 include/linux/cgroup.h |  3 +++
 kernel/cpuset.c        | 12 +++++++++---
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 53e81a6..74e8b8e 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -281,6 +281,9 @@ enum {
 	 *   and take masks of ancestors with non-empty cpus/mems, instead of
 	 *   being moved to an ancestor.
 	 *
+	 * - cpuset: a task can be moved into an empty cpuset, and again it
+	 *   takes masks of ancestors.
+	 *
 	 * - memcg: use_hierarchy is on by default and the cgroup file for
 	 *   the flag is not created.
 	 *
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 3b93098..9bb6a47 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -483,7 +483,7 @@ static int validate_change(const struct cpuset *cur, const struct cpuset *trial)
 	 */
 	ret = -ENOSPC;
 	if ((cgroup_task_count(cur->css.cgroup) || cur->attach_in_progress) &&
-	    (cpumask_empty(trial->cpus_allowed) ||
+	    (cpumask_empty(trial->cpus_allowed) &&
 	     nodes_empty(trial->mems_allowed)))
 		goto out;
 
@@ -1478,8 +1478,13 @@ static int cpuset_can_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
 
 	mutex_lock(&cpuset_mutex);
 
+	/*
+	 * We allow to move tasks into an empty cpuset if sane_behavior
+	 * flag is set.
+	 */
 	ret = -ENOSPC;
-	if (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))
+	if (!cgroup_sane_behavior(cgrp) &&
+	    (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed)))
 		goto out_unlock;
 
 	cgroup_taskset_for_each(task, cgrp, tset) {
@@ -1584,7 +1589,8 @@ static void cpuset_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
 	 * propagation if @cs doesn't have any CPU or memory.  It will move
 	 * the newly added tasks to the nearest parent which can execute.
 	 */
-	if (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))
+	if (!cgroup_sane_behavior(cgrp) &&
+	    (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed)))
 		schedule_cpuset_propagate_hotplug(cs);
 
 	mutex_unlock(&cpuset_mutex);
-- 
1.8.0.2

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 10/10] cpuset: fix to migrate mm correctly in a corner case
  2013-06-05  9:14 [PATCH v2 00/10] cpuset: implement sane hierarchy behaviors Li Zefan
                   ` (8 preceding siblings ...)
  2013-06-05  9:17 ` [PATCH v2 09/10] cpuset: allow to move tasks to " Li Zefan
@ 2013-06-05  9:17 ` Li Zefan
  9 siblings, 0 replies; 17+ messages in thread
From: Li Zefan @ 2013-06-05  9:17 UTC (permalink / raw)
  To: Tejun Heo; +Cc: LKML, Cgroups, Containers

Before moving tasks out of empty cpusets, update_tasks_nodemask()
is called, which calls do_migrate_pages(xx, from, to). Then those
tasks are moved to an ancestor, and do_migrate_pages() is called
again.

The first time: from = node_to_be_offlined, to = empty.
The second time: from = empty, to = ancestor's nodemask.

so looks like no pages will be migrated.

Fix this by:

- Don't call update_tasks_nodemask() on empty cpusets.
- Pass cs->old_mems_allowed to do_migrate_pages().

Signed-off-by: Li Zefan <lizefan@huawei.com>
---
 kernel/cpuset.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 9bb6a47..de7f6c1 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1574,9 +1574,16 @@ static void cpuset_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
 		struct cpuset *mems_oldcs = effective_nodemask_cpuset(oldcs);
 
 		mpol_rebind_mm(mm, &cpuset_attach_nodemask_to);
-		if (is_memory_migrate(cs))
-			cpuset_migrate_mm(mm, &mems_oldcs->mems_allowed,
+		if (is_memory_migrate(cs)) {
+			/*
+			 * old_mems_allowed is the same with mems_allowed,
+			 * except if this task is being moved automatically
+			 * due to hotplug, and in this case mems_allowed is
+			 * empty and old_mems_allowed is the offflined node.
+			 */
+			cpuset_migrate_mm(mm, &mems_oldcs->old_mems_allowed,
 					  &cpuset_attach_nodemask_to);
+		}
 		mmput(mm);
 	}
 
@@ -2168,7 +2175,7 @@ static void cpuset_propagate_hotplug_workfn(struct work_struct *work)
 	 * for empty cpuset to take on ancestor's cpumask
 	 */
 	if ((sane && cpumask_empty(cs->cpus_allowed)) ||
-	    !cpumask_empty(&off_cpus))
+	    (!cpumask_empty(&off_cpus) && !cpumask_empty(cs->cpus_allowed)))
 		update_tasks_cpumask(cs, NULL);
 
 	mutex_lock(&callback_mutex);
@@ -2180,7 +2187,7 @@ static void cpuset_propagate_hotplug_workfn(struct work_struct *work)
 	 * for empty cpuset to take on ancestor's nodemask
 	 */
 	if ((sane && nodes_empty(cs->mems_allowed)) ||
-	    !nodes_empty(off_mems))
+	    (!nodes_empty(off_mems) && !nodes_empty(cs->mems_allowed)))
 		update_tasks_nodemask(cs, NULL);
 
 	is_empty = cpumask_empty(cs->cpus_allowed) ||
-- 
1.8.0.2

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 06/10] cpuset: record old_mems_allowed in struct cpuset
  2013-06-05  9:16 ` [PATCH v2 06/10] cpuset: record old_mems_allowed in struct cpuset Li Zefan
@ 2013-06-05 19:45   ` Tejun Heo
  2013-06-06  9:58     ` Li Zefan
  0 siblings, 1 reply; 17+ messages in thread
From: Tejun Heo @ 2013-06-05 19:45 UTC (permalink / raw)
  To: Li Zefan; +Cc: LKML, Cgroups, Containers

Hello, Li.

On Wed, Jun 05, 2013 at 05:16:24PM +0800, Li Zefan wrote:
> @@ -1425,7 +1435,6 @@ static void cpuset_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
>  	 * Change mm, possibly for multiple threads in a threadgroup. This is
>  	 * expensive and may sleep.
>  	 */
> -	cpuset_attach_nodemask_to = cs->mems_allowed;
>  	mm = get_task_mm(leader);
>  	if (mm) {
>  		mpol_rebind_mm(mm, &cpuset_attach_nodemask_to);

This looks a bit suspicious to me.  Now we're setting mm's nodemask to
guarantee_online_mems() output rather than cs->mems_allowed.  Is this
change intended?  If so, it probably deserves an explanation in the
description?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 08/10] cpuset: allow to keep tasks in empty cpusets
  2013-06-05  9:16 ` [PATCH v2 08/10] cpuset: allow to keep tasks in empty cpusets Li Zefan
@ 2013-06-05 20:51   ` Tejun Heo
  2013-06-06 10:26     ` Li Zefan
  0 siblings, 1 reply; 17+ messages in thread
From: Tejun Heo @ 2013-06-05 20:51 UTC (permalink / raw)
  To: Li Zefan; +Cc: LKML, Cgroups, Containers

Hello, Li.

On Wed, Jun 05, 2013 at 05:16:59PM +0800, Li Zefan wrote:
> @@ -2092,11 +2183,13 @@ static void cpuset_propagate_hotplug_workfn(struct work_struct *work)
>  	mutex_unlock(&cpuset_mutex);
>  
>  	/*
> -	 * If @cs became empty, move tasks to the nearest ancestor with
> -	 * execution resources.  This is full cgroup operation which will
> +	 * If sane_behavior flag is set, we'll keep tasks in empty cpusets.
> +	 *
> +	 * Otherwise move tasks to the nearest ancestor with execution
> +	 *  resources.  This is full cgroup operation which will
>  	 * also call back into cpuset.  Should be done outside any lock.
>  	 */
> -	if (is_empty)
> +	if (!sane && is_empty)
>  		remove_tasks_in_empty_cpuset(cs);
>  
>  	/* the following may free @cs, should be the last operation */
> @@ -2171,6 +2264,7 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
>  		cpumask_copy(top_cpuset.cpus_allowed, &new_cpus);
>  		mutex_unlock(&callback_mutex);
>  		/* we don't mess with cpumasks of tasks in top_cpuset */
> +		update_tasks_cpumask_hier(&top_cpuset, false, NULL);
>  	}

I'm a little confused by the order of operation.  We now have two
different hierarchical walks for hotplug propagation, right?  I
suppose the above one is added because we now also need to update the
mask when cpus are being brought online?

I wonder whether it'd be possible to merge the two paths.  My
suspicion is that we probably don't need propagate_hotplug_work
anymore now that we can drop RCU read lock while doing the pre-order
walk.  What do you think?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 05/10] cpuset: re-structure update_cpumask() a bit
  2013-06-05  9:15 ` [PATCH v2 05/10] cpuset: re-structure update_cpumask() a bit Li Zefan
@ 2013-06-05 20:57   ` Tejun Heo
  0 siblings, 0 replies; 17+ messages in thread
From: Tejun Heo @ 2013-06-05 20:57 UTC (permalink / raw)
  To: Li Zefan; +Cc: LKML, Cgroups, Containers

On Wed, Jun 05, 2013 at 05:15:59PM +0800, Li Zefan wrote:
> Check if cpus_allowed is to be changed before calling validate_change().
> 
> This won't change any behavior, but later it will allow us to do this:
> 
>  # mkdir /cpuset/child
>  # echo $$ > /cpuset/child/tasks	/* empty cpuset */
>  # echo > /cpuset/child/cpuset.cpus	/* do nothing, won't fail */
> 
> Without this patch, the last operation will fail.
> 
> Signed-off-by: Li Zefan <lizefan@huawei.com>

Applied 1-5 to cgroup/for-3.11-cpuset.

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-3.11-cpuset

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 06/10] cpuset: record old_mems_allowed in struct cpuset
  2013-06-05 19:45   ` Tejun Heo
@ 2013-06-06  9:58     ` Li Zefan
  0 siblings, 0 replies; 17+ messages in thread
From: Li Zefan @ 2013-06-06  9:58 UTC (permalink / raw)
  To: Tejun Heo; +Cc: LKML, Cgroups, Containers

On 2013/6/6 3:45, Tejun Heo wrote:
> Hello, Li.
> 
> On Wed, Jun 05, 2013 at 05:16:24PM +0800, Li Zefan wrote:
>> @@ -1425,7 +1435,6 @@ static void cpuset_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
>>  	 * Change mm, possibly for multiple threads in a threadgroup. This is
>>  	 * expensive and may sleep.
>>  	 */
>> -	cpuset_attach_nodemask_to = cs->mems_allowed;
>>  	mm = get_task_mm(leader);
>>  	if (mm) {
>>  		mpol_rebind_mm(mm, &cpuset_attach_nodemask_to);
> 
> This looks a bit suspicious to me.  Now we're setting mm's nodemask to
> guarantee_online_mems() output rather than cs->mems_allowed.  Is this
> change intended?  If so, it probably deserves an explanation in the
> description?
> 

It's possible that cs->mems_allowed is empty, but I should have left
it as it is, as it's not related to this patch.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 08/10] cpuset: allow to keep tasks in empty cpusets
  2013-06-05 20:51   ` Tejun Heo
@ 2013-06-06 10:26     ` Li Zefan
  2013-06-06 21:24       ` Tejun Heo
  0 siblings, 1 reply; 17+ messages in thread
From: Li Zefan @ 2013-06-06 10:26 UTC (permalink / raw)
  To: Tejun Heo; +Cc: LKML, Cgroups, Containers

On 2013/6/6 4:51, Tejun Heo wrote:
> Hello, Li.
> 
> On Wed, Jun 05, 2013 at 05:16:59PM +0800, Li Zefan wrote:
>> @@ -2092,11 +2183,13 @@ static void cpuset_propagate_hotplug_workfn(struct work_struct *work)
>>  	mutex_unlock(&cpuset_mutex);
>>  
>>  	/*
>> -	 * If @cs became empty, move tasks to the nearest ancestor with
>> -	 * execution resources.  This is full cgroup operation which will
>> +	 * If sane_behavior flag is set, we'll keep tasks in empty cpusets.
>> +	 *
>> +	 * Otherwise move tasks to the nearest ancestor with execution
>> +	 *  resources.  This is full cgroup operation which will
>>  	 * also call back into cpuset.  Should be done outside any lock.
>>  	 */
>> -	if (is_empty)
>> +	if (!sane && is_empty)
>>  		remove_tasks_in_empty_cpuset(cs);
>>  
>>  	/* the following may free @cs, should be the last operation */
>> @@ -2171,6 +2264,7 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
>>  		cpumask_copy(top_cpuset.cpus_allowed, &new_cpus);
>>  		mutex_unlock(&callback_mutex);
>>  		/* we don't mess with cpumasks of tasks in top_cpuset */
>> +		update_tasks_cpumask_hier(&top_cpuset, false, NULL);
>>  	}
> 
> I'm a little confused by the order of operation.  We now have two
> different hierarchical walks for hotplug propagation, right?  I
> suppose the above one is added because we now also need to update the
> mask when cpus are being brought online?
> 

The first one will only update tasks in empty cpusets (no matter online or
offline), and the second one will only update tasks in non-empty cpusets
(only when offline).

> I wonder whether it'd be possible to merge the two paths.  My
> suspicion is that we probably don't need propagate_hotplug_work
> anymore now that we can drop RCU read lock while doing the pre-order
> walk.  What do you think?
> 

It indeed can be confusing. I'll see if we can make the code clearer.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 08/10] cpuset: allow to keep tasks in empty cpusets
  2013-06-06 10:26     ` Li Zefan
@ 2013-06-06 21:24       ` Tejun Heo
  0 siblings, 0 replies; 17+ messages in thread
From: Tejun Heo @ 2013-06-06 21:24 UTC (permalink / raw)
  To: Li Zefan; +Cc: LKML, Cgroups, Containers

Hey,

On Thu, Jun 06, 2013 at 06:26:46PM +0800, Li Zefan wrote:
> > I wonder whether it'd be possible to merge the two paths.  My
> > suspicion is that we probably don't need propagate_hotplug_work
> > anymore now that we can drop RCU read lock while doing the pre-order
> > walk.  What do you think?
> 
> It indeed can be confusing. I'll see if we can make the code clearer.

Yeah, if possible, it'd be great to remove the propagation work item
and do things inline first and then add the new online propagation.

Thanks!

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2013-06-06 21:24 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-06-05  9:14 [PATCH v2 00/10] cpuset: implement sane hierarchy behaviors Li Zefan
2013-06-05  9:15 ` [PATCH v2 01/10] cpuset: remove redundant check in cpuset_cpus_allowed_fallback() Li Zefan
2013-06-05  9:15 ` [PATCH v2 02/10] cpuset: cleanup guarantee_online_{cpus|mems}() Li Zefan
2013-06-05  9:15 ` [PATCH v2 03/10] cpuset: remove unnecessary variable in cpuset_attach() Li Zefan
2013-06-05  9:15 ` [PATCH v2 04/10] cpuset: remove cpuset_test_cpumask() Li Zefan
2013-06-05  9:15 ` [PATCH v2 05/10] cpuset: re-structure update_cpumask() a bit Li Zefan
2013-06-05 20:57   ` Tejun Heo
2013-06-05  9:16 ` [PATCH v2 06/10] cpuset: record old_mems_allowed in struct cpuset Li Zefan
2013-06-05 19:45   ` Tejun Heo
2013-06-06  9:58     ` Li Zefan
2013-06-05  9:16 ` [PATCH v2 07/10] cpuset: introduce effective_{cpumask|nodemask}_cpuset() Li Zefan
2013-06-05  9:16 ` [PATCH v2 08/10] cpuset: allow to keep tasks in empty cpusets Li Zefan
2013-06-05 20:51   ` Tejun Heo
2013-06-06 10:26     ` Li Zefan
2013-06-06 21:24       ` Tejun Heo
2013-06-05  9:17 ` [PATCH v2 09/10] cpuset: allow to move tasks to " Li Zefan
2013-06-05  9:17 ` [PATCH v2 10/10] cpuset: fix to migrate mm correctly in a corner case Li Zefan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).