linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET v2 cgroup/for-4.4] cgroup: make multi-process migration atomic
@ 2015-09-11 19:00 Tejun Heo
  2015-09-11 19:00 ` [PATCH 1/5] cpuset: migrate memory only for threadgroup leaders Tejun Heo
                   ` (5 more replies)
  0 siblings, 6 replies; 10+ messages in thread
From: Tejun Heo @ 2015-09-11 19:00 UTC (permalink / raw)
  To: lizefan; +Cc: cgroups, hannes, mhocko, linux-mm, linux-kernel

Hello,

This is v2 of atomic multi-process migration patchset.  This one
slipped through crack somehow.  Changes from the last take[L] are.

* 0002-memcg-restructure-mem_cgroup_can_attach.patch already in
  upstream.

* 0003-memcg-immigrate-charges-only-when-a-threadgroup-lead.patch
  dropped and
  0004-cgroup-memcg-cpuset-implement-cgroup_taskset_for_eac.patch
  updated accordingly.

* Li's acks added and patchset refreshed.

When a controller is enabled or disabled on the unified hierarchy, the
effective css changes for all processes in the sub-hierarchy which
virtually is multi-process migration.  This is implemented in
cgroup_update_dfl_csses() as process-by-process migration - all the
target source css_sets are first chained to the target list and
processes are drained from them one-by-one.

If a process gets rejected by a controller after some are successfully
migrated, the recovery action is tricky.  The changes which have
happened upto this point have to be rolled back but there's nothing
guaranteeing such rollback would be successful either.

The unified hierarchy didn't need to deal with this issue because
organizational operations were expected to always succeed;
unfortunately, it turned out that such policy doesn't work too well
for certain type of resources and unified hierarchy would need to
allow migration failures for some restrictied cases.

This patch updates multi-process migration in
cgroup_update_dfl_csses() atomic so that ->can_attach() can fail the
whole transaction.  It's consisted of the following seven patches.

 0001-cpuset-migrate-memory-only-for-threadgroup-leaders.patch
 0002-cgroup-memcg-cpuset-implement-cgroup_taskset_for_eac.patch
 0003-reorder-cgroup_migrate-s-parameters.patch
 0004-cgroup-separate-out-taskset-operations-from-cgroup_m.patch
 0005-cgroup-make-cgroup_update_dfl_csses-migrate-all-targ.patch

0001-0002 prepare cpuset and memcg.  Note that 0001 causes behavioral
changes in that mm is now always tied to the threadgroup leader.
Avoiding this change isn't too difficult but both the code and
behavior are saner this way and the change is unlikely to cause
breakage.

0003-0005 prepare and implement atomic multi-process migration.

This patchset is on top of 64d1def7d338 ("Merge tag
'sound-fix-4.3-rc1' of
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound").

and available in the following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-multi-process-migration

diffstat follows.  Thanks.

 include/linux/cgroup.h |   22 +++
 kernel/cgroup.c        |  278 ++++++++++++++++++++++++-------------------------
 kernel/cpuset.c        |   41 +++----
 mm/memcontrol.c        |   17 ++
 4 files changed, 198 insertions(+), 160 deletions(-)

--
tejun

[L] http://lkml.kernel.org/g/1431978595-12176-1-git-send-email-tj@kernel.org

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/5] cpuset: migrate memory only for threadgroup leaders
  2015-09-11 19:00 [PATCHSET v2 cgroup/for-4.4] cgroup: make multi-process migration atomic Tejun Heo
@ 2015-09-11 19:00 ` Tejun Heo
  2015-09-11 19:00 ` [PATCH 2/5] cgroup, memcg, cpuset: implement cgroup_taskset_for_each_leader() Tejun Heo
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 10+ messages in thread
From: Tejun Heo @ 2015-09-11 19:00 UTC (permalink / raw)
  To: lizefan; +Cc: cgroups, hannes, mhocko, linux-mm, linux-kernel, Tejun Heo

If memory_migrate flag is set, cpuset migrates memory according to the
destnation css's nodemask.  The current implementation migrates memory
whenever any thread of a process is migrated making the behavior
somewhat arbitrary.  Let's tie memory operations to the threadgroup
leader so that memory is migrated only when the leader is migrated.

While this is a behavior change, given the inherent fuziness, this
change is not too likely to be noticed and allows us to clearly define
who owns the memory (always the leader) and helps the planned atomic
multi-process migration.

Note that we're currently migrating memory in migration path proper
while holding all the locks.  In the long term, this should be moved
out to an async work item.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
---
 kernel/cpuset.c | 40 ++++++++++++++++++++++------------------
 1 file changed, 22 insertions(+), 18 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index f0acff0..09393f6 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1484,7 +1484,6 @@ static void cpuset_attach(struct cgroup_subsys_state *css,
 {
 	/* static buf protected by cpuset_mutex */
 	static nodemask_t cpuset_attach_nodemask_to;
-	struct mm_struct *mm;
 	struct task_struct *task;
 	struct task_struct *leader = cgroup_taskset_first(tset);
 	struct cpuset *cs = css_cs(css);
@@ -1512,26 +1511,31 @@ static void cpuset_attach(struct cgroup_subsys_state *css,
 	}
 
 	/*
-	 * Change mm, possibly for multiple threads in a threadgroup. This is
-	 * expensive and may sleep.
+	 * Change mm, possibly for multiple threads in a threadgroup. This
+	 * is expensive and may sleep and should be moved outside migration
+	 * path proper.
 	 */
 	cpuset_attach_nodemask_to = cs->effective_mems;
-	mm = get_task_mm(leader);
-	if (mm) {
-		mpol_rebind_mm(mm, &cpuset_attach_nodemask_to);
-
-		/*
-		 * old_mems_allowed is the same with mems_allowed here, except
-		 * if this task is being moved automatically due to hotplug.
-		 * In that case @mems_allowed has been updated and is empty,
-		 * so @old_mems_allowed is the right nodesets that we migrate
-		 * mm from.
-		 */
-		if (is_memory_migrate(cs)) {
-			cpuset_migrate_mm(mm, &oldcs->old_mems_allowed,
-					  &cpuset_attach_nodemask_to);
+	if (thread_group_leader(leader)) {
+		struct mm_struct *mm = get_task_mm(leader);
+
+		if (mm) {
+			mpol_rebind_mm(mm, &cpuset_attach_nodemask_to);
+
+			/*
+			 * old_mems_allowed is the same with mems_allowed
+			 * here, except if this task is being moved
+			 * automatically due to hotplug.  In that case
+			 * @mems_allowed has been updated and is empty, so
+			 * @old_mems_allowed is the right nodesets that we
+			 * migrate mm from.
+			 */
+			if (is_memory_migrate(cs)) {
+				cpuset_migrate_mm(mm, &oldcs->old_mems_allowed,
+						  &cpuset_attach_nodemask_to);
+			}
+			mmput(mm);
 		}
-		mmput(mm);
 	}
 
 	cs->old_mems_allowed = cpuset_attach_nodemask_to;
-- 
2.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/5] cgroup, memcg, cpuset: implement cgroup_taskset_for_each_leader()
  2015-09-11 19:00 [PATCHSET v2 cgroup/for-4.4] cgroup: make multi-process migration atomic Tejun Heo
  2015-09-11 19:00 ` [PATCH 1/5] cpuset: migrate memory only for threadgroup leaders Tejun Heo
@ 2015-09-11 19:00 ` Tejun Heo
  2015-09-14 20:49   ` Tejun Heo
  2015-09-11 19:00 ` [PATCH 3/5] reorder cgroup_migrate()'s parameters Tejun Heo
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 10+ messages in thread
From: Tejun Heo @ 2015-09-11 19:00 UTC (permalink / raw)
  To: lizefan; +Cc: cgroups, hannes, mhocko, linux-mm, linux-kernel, Tejun Heo

It wasn't explicitly documented but, when a process is being migrated,
cpuset and memcg depend on cgroup_taskset_first() returning the
threadgroup leader; however, this approach is somewhat ghetto and
would no longer work for the planned multi-process migration.

This patch introduces explicit cgroup_taskset_for_each_leader() which
iterates over only the threadgroup leaders and replaces
cgroup_taskset_first() usages for accessing the leader with it.

This prepares both memcg and cpuset for multi-process migration.  This
patch also updates the documentation for cgroup_taskset_for_each() to
clarify the iteration rules and removes comments mentioning task
ordering in tasksets.

v2: A previous patch which added threadgroup leader test was dropped.
    Patch updated accordingly.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
---
 include/linux/cgroup.h | 22 ++++++++++++++++++++++
 kernel/cgroup.c        | 11 -----------
 kernel/cpuset.c        |  9 ++++-----
 mm/memcontrol.c        | 17 +++++++++++++++--
 4 files changed, 41 insertions(+), 18 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index eb7ca55..916a1e0 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -211,11 +211,33 @@ void css_task_iter_end(struct css_task_iter *it);
  * cgroup_taskset_for_each - iterate cgroup_taskset
  * @task: the loop cursor
  * @tset: taskset to iterate
+ *
+ * @tset may contain multiple tasks and they may belong to multiple
+ * processes.  When there are multiple tasks in @tset, if a task of a
+ * process is in @tset, all tasks of the process are in @tset.  Also, all
+ * are guaranteed to share the same source and destination csses.
+ *
+ * Iteration is not in any specific order.
  */
 #define cgroup_taskset_for_each(task, tset)				\
 	for ((task) = cgroup_taskset_first((tset)); (task);		\
 	     (task) = cgroup_taskset_next((tset)))
 
+/**
+ * cgroup_taskset_for_each_leader - iterate group leaders in a cgroup_taskset
+ * @leader: the loop cursor
+ * @tset: takset to iterate
+ *
+ * Iterate threadgroup leaders of @tset.  For single-task migrations, @tset
+ * may not contain any.
+ */
+#define cgroup_taskset_for_each_leader(leader, tset)			\
+	for ((leader) = cgroup_taskset_first((tset)); (leader);		\
+	     (leader) = cgroup_taskset_next((tset)))			\
+		if ((leader) != (leader)->group_leader)			\
+			;						\
+		else
+
 /*
  * Inline functions.
  */
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 2cf0f79..0b732dd 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2083,13 +2083,6 @@ static void cgroup_task_migrate(struct cgroup *old_cgrp,
 
 	get_css_set(new_cset);
 	rcu_assign_pointer(tsk->cgroups, new_cset);
-
-	/*
-	 * Use move_tail so that cgroup_taskset_first() still returns the
-	 * leader after migration.  This works because cgroup_migrate()
-	 * ensures that the dst_cset of the leader is the first on the
-	 * tset's dst_csets list.
-	 */
 	list_move_tail(&tsk->cg_list, &new_cset->mg_tasks);
 
 	/*
@@ -2285,10 +2278,6 @@ static int cgroup_migrate(struct cgroup *cgrp, struct task_struct *leader,
 		if (!cset->mg_src_cgrp)
 			goto next;
 
-		/*
-		 * cgroup_taskset_first() must always return the leader.
-		 * Take care to avoid disturbing the ordering.
-		 */
 		list_move_tail(&task->cg_list, &cset->mg_tasks);
 		if (list_empty(&cset->mg_node))
 			list_add_tail(&cset->mg_node, &tset.src_csets);
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 09393f6..e7afde6 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1485,7 +1485,7 @@ static void cpuset_attach(struct cgroup_subsys_state *css,
 	/* static buf protected by cpuset_mutex */
 	static nodemask_t cpuset_attach_nodemask_to;
 	struct task_struct *task;
-	struct task_struct *leader = cgroup_taskset_first(tset);
+	struct task_struct *leader;
 	struct cpuset *cs = css_cs(css);
 	struct cpuset *oldcs = cpuset_attach_old_cs;
 
@@ -1511,12 +1511,11 @@ static void cpuset_attach(struct cgroup_subsys_state *css,
 	}
 
 	/*
-	 * Change mm, possibly for multiple threads in a threadgroup. This
-	 * is expensive and may sleep and should be moved outside migration
-	 * path proper.
+	 * Change mm for all threadgroup leaders. This is expensive and may
+	 * sleep and should be moved outside migration path proper.
 	 */
 	cpuset_attach_nodemask_to = cs->effective_mems;
-	if (thread_group_leader(leader)) {
+	cgroup_taskset_for_each_leader(leader, tset) {
 		struct mm_struct *mm = get_task_mm(leader);
 
 		if (mm) {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6ddaeba..32b6bfd 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4829,7 +4829,7 @@ static int mem_cgroup_can_attach(struct cgroup_subsys_state *css,
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
 	struct mem_cgroup *from;
-	struct task_struct *p;
+	struct task_struct *leader, *p;
 	struct mm_struct *mm;
 	unsigned long move_flags;
 	int ret = 0;
@@ -4843,7 +4843,20 @@ static int mem_cgroup_can_attach(struct cgroup_subsys_state *css,
 	if (!move_flags)
 		return 0;
 
-	p = cgroup_taskset_first(tset);
+	/*
+	 * Multi-process migrations only happen on the default hierarchy
+	 * where charge immigration is not used.  Perform charge
+	 * immigration if @tset contains a leader and whine if there are
+	 * multiple.
+	 */
+	p = NULL;
+	cgroup_taskset_for_each_leader(leader, tset) {
+		WARN_ON_ONCE(p);
+		p = leader;
+	}
+	if (!p)
+		return 0;
+
 	from = mem_cgroup_from_task(p);
 
 	VM_BUG_ON(from == memcg);
-- 
2.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 3/5] reorder cgroup_migrate()'s parameters
  2015-09-11 19:00 [PATCHSET v2 cgroup/for-4.4] cgroup: make multi-process migration atomic Tejun Heo
  2015-09-11 19:00 ` [PATCH 1/5] cpuset: migrate memory only for threadgroup leaders Tejun Heo
  2015-09-11 19:00 ` [PATCH 2/5] cgroup, memcg, cpuset: implement cgroup_taskset_for_each_leader() Tejun Heo
@ 2015-09-11 19:00 ` Tejun Heo
  2015-09-11 19:00 ` [PATCH 4/5] cgroup: separate out taskset operations from cgroup_migrate() Tejun Heo
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 10+ messages in thread
From: Tejun Heo @ 2015-09-11 19:00 UTC (permalink / raw)
  To: lizefan; +Cc: cgroups, hannes, mhocko, linux-mm, linux-kernel, Tejun Heo

cgroup_migrate() has the destination cgroup as the first parameter
while cgroup_task_migrate() has the destination cset as the last.
Another migration function is scheduled to be added which can make the
discrepancy further stand out.  Let's reorder cgroup_migrate()'s
parameters so that the destination cgroup is the last.

This doesn't cause any functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
---
 kernel/cgroup.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 0b732dd..1f0c189 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2228,9 +2228,9 @@ static int cgroup_migrate_prepare_dst(struct cgroup *dst_cgrp,
 
 /**
  * cgroup_migrate - migrate a process or task to a cgroup
- * @cgrp: the destination cgroup
  * @leader: the leader of the process or the task to migrate
  * @threadgroup: whether @leader points to the whole process or a single task
+ * @cgrp: the destination cgroup
  *
  * Migrate a process or task denoted by @leader to @cgrp.  If migrating a
  * process, the caller must be holding cgroup_threadgroup_rwsem.  The
@@ -2244,8 +2244,8 @@ static int cgroup_migrate_prepare_dst(struct cgroup *dst_cgrp,
  * decided for all targets by invoking group_migrate_prepare_dst() before
  * actually starting migrating.
  */
-static int cgroup_migrate(struct cgroup *cgrp, struct task_struct *leader,
-			  bool threadgroup)
+static int cgroup_migrate(struct task_struct *leader, bool threadgroup,
+			  struct cgroup *cgrp)
 {
 	struct cgroup_taskset tset = {
 		.src_csets	= LIST_HEAD_INIT(tset.src_csets),
@@ -2382,7 +2382,7 @@ static int cgroup_attach_task(struct cgroup *dst_cgrp,
 	/* prepare dst csets and commit */
 	ret = cgroup_migrate_prepare_dst(dst_cgrp, &preloaded_csets);
 	if (!ret)
-		ret = cgroup_migrate(dst_cgrp, leader, threadgroup);
+		ret = cgroup_migrate(leader, threadgroup, dst_cgrp);
 
 	cgroup_migrate_finish(&preloaded_csets);
 	return ret;
@@ -2689,7 +2689,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp)
 				goto out_finish;
 			last_task = task;
 
-			ret = cgroup_migrate(src_cset->dfl_cgrp, task, true);
+			ret = cgroup_migrate(task, true, src_cset->dfl_cgrp);
 
 			put_task_struct(task);
 
@@ -3760,7 +3760,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
 		css_task_iter_end(&it);
 
 		if (task) {
-			ret = cgroup_migrate(to, task, false);
+			ret = cgroup_migrate(task, false, to);
 			put_task_struct(task);
 		}
 	} while (task && !ret);
-- 
2.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 4/5] cgroup: separate out taskset operations from cgroup_migrate()
  2015-09-11 19:00 [PATCHSET v2 cgroup/for-4.4] cgroup: make multi-process migration atomic Tejun Heo
                   ` (2 preceding siblings ...)
  2015-09-11 19:00 ` [PATCH 3/5] reorder cgroup_migrate()'s parameters Tejun Heo
@ 2015-09-11 19:00 ` Tejun Heo
  2015-09-11 19:00 ` [PATCH 5/5] cgroup: make cgroup_update_dfl_csses() migrate all target processes atomically Tejun Heo
  2015-09-22 16:48 ` [PATCHSET v2 cgroup/for-4.4] cgroup: make multi-process migration atomic Tejun Heo
  5 siblings, 0 replies; 10+ messages in thread
From: Tejun Heo @ 2015-09-11 19:00 UTC (permalink / raw)
  To: lizefan; +Cc: cgroups, hannes, mhocko, linux-mm, linux-kernel, Tejun Heo

Currently, cgroup_migreate() implements large part of the migration
logic inline including building the target taskset and actually
migrating them.  This patch separates out the following taskset
operations.

 CGROUP_TASKSET_INIT()		: taskset initializer
 cgroup_taskset_add()		: add a task to a taskset
 cgroup_taskset_migrate()	: migrate a taskset to the destination cgroup

This will be used to implement atomic multi-process migration in
cgroup_update_dfl_csses().  This is pure reorganization which doesn't
introduce any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
---
 kernel/cgroup.c | 211 +++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 125 insertions(+), 86 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 1f0c189..fa63b92 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2010,6 +2010,49 @@ struct cgroup_taskset {
 	struct task_struct	*cur_task;
 };
 
+#define CGROUP_TASKSET_INIT(tset)	(struct cgroup_taskset){	\
+	.src_csets		= LIST_HEAD_INIT(tset.src_csets),	\
+	.dst_csets		= LIST_HEAD_INIT(tset.dst_csets),	\
+	.csets			= &tset.src_csets,			\
+}
+
+/**
+ * cgroup_taskset_add - try to add a migration target task to a taskset
+ * @task: target task
+ * @tset: target taskset
+ *
+ * Add @task, which is a migration target, to @tset.  This function becomes
+ * noop if @task doesn't need to be migrated.  @task's css_set should have
+ * been added as a migration source and @task->cg_list will be moved from
+ * the css_set's tasks list to mg_tasks one.
+ */
+static void cgroup_taskset_add(struct task_struct *task,
+			       struct cgroup_taskset *tset)
+{
+	struct css_set *cset;
+
+	lockdep_assert_held(&css_set_rwsem);
+
+	/* @task either already exited or can't exit until the end */
+	if (task->flags & PF_EXITING)
+		return;
+
+	/* leave @task alone if post_fork() hasn't linked it yet */
+	if (list_empty(&task->cg_list))
+		return;
+
+	cset = task_css_set(task);
+	if (!cset->mg_src_cgrp)
+		return;
+
+	list_move_tail(&task->cg_list, &cset->mg_tasks);
+	if (list_empty(&cset->mg_node))
+		list_add_tail(&cset->mg_node, &tset->src_csets);
+	if (list_empty(&cset->mg_dst_cset->mg_node))
+		list_move_tail(&cset->mg_dst_cset->mg_node,
+			       &tset->dst_csets);
+}
+
 /**
  * cgroup_taskset_first - reset taskset and return the first task
  * @tset: taskset of interest
@@ -2094,6 +2137,84 @@ static void cgroup_task_migrate(struct cgroup *old_cgrp,
 }
 
 /**
+ * cgroup_taskset_migrate - migrate a taskset to a cgroup
+ * @tset: taget taskset
+ * @dst_cgrp: destination cgroup
+ *
+ * Migrate tasks in @tset to @dst_cgrp.  This function fails iff one of the
+ * ->can_attach callbacks fails and guarantees that either all or none of
+ * the tasks in @tset are migrated.  @tset is consumed regardless of
+ * success.
+ */
+static int cgroup_taskset_migrate(struct cgroup_taskset *tset,
+				  struct cgroup *dst_cgrp)
+{
+	struct cgroup_subsys_state *css, *failed_css = NULL;
+	struct task_struct *task, *tmp_task;
+	struct css_set *cset, *tmp_cset;
+	int i, ret;
+
+	/* methods shouldn't be called if no task is actually migrating */
+	if (list_empty(&tset->src_csets))
+		return 0;
+
+	/* check that we can legitimately attach to the cgroup */
+	for_each_e_css(css, i, dst_cgrp) {
+		if (css->ss->can_attach) {
+			ret = css->ss->can_attach(css, tset);
+			if (ret) {
+				failed_css = css;
+				goto out_cancel_attach;
+			}
+		}
+	}
+
+	/*
+	 * Now that we're guaranteed success, proceed to move all tasks to
+	 * the new cgroup.  There are no failure cases after here, so this
+	 * is the commit point.
+	 */
+	down_write(&css_set_rwsem);
+	list_for_each_entry(cset, &tset->src_csets, mg_node) {
+		list_for_each_entry_safe(task, tmp_task, &cset->mg_tasks, cg_list)
+			cgroup_task_migrate(cset->mg_src_cgrp, task,
+					    cset->mg_dst_cset);
+	}
+	up_write(&css_set_rwsem);
+
+	/*
+	 * Migration is committed, all target tasks are now on dst_csets.
+	 * Nothing is sensitive to fork() after this point.  Notify
+	 * controllers that migration is complete.
+	 */
+	tset->csets = &tset->dst_csets;
+
+	for_each_e_css(css, i, dst_cgrp)
+		if (css->ss->attach)
+			css->ss->attach(css, tset);
+
+	ret = 0;
+	goto out_release_tset;
+
+out_cancel_attach:
+	for_each_e_css(css, i, dst_cgrp) {
+		if (css == failed_css)
+			break;
+		if (css->ss->cancel_attach)
+			css->ss->cancel_attach(css, tset);
+	}
+out_release_tset:
+	down_write(&css_set_rwsem);
+	list_splice_init(&tset->dst_csets, &tset->src_csets);
+	list_for_each_entry_safe(cset, tmp_cset, &tset->src_csets, mg_node) {
+		list_splice_tail_init(&cset->mg_tasks, &cset->tasks);
+		list_del_init(&cset->mg_node);
+	}
+	up_write(&css_set_rwsem);
+	return ret;
+}
+
+/**
  * cgroup_migrate_finish - cleanup after attach
  * @preloaded_csets: list of preloaded css_sets
  *
@@ -2247,15 +2368,8 @@ static int cgroup_migrate_prepare_dst(struct cgroup *dst_cgrp,
 static int cgroup_migrate(struct task_struct *leader, bool threadgroup,
 			  struct cgroup *cgrp)
 {
-	struct cgroup_taskset tset = {
-		.src_csets	= LIST_HEAD_INIT(tset.src_csets),
-		.dst_csets	= LIST_HEAD_INIT(tset.dst_csets),
-		.csets		= &tset.src_csets,
-	};
-	struct cgroup_subsys_state *css, *failed_css = NULL;
-	struct css_set *cset, *tmp_cset;
-	struct task_struct *task, *tmp_task;
-	int i, ret;
+	struct cgroup_taskset tset = CGROUP_TASKSET_INIT(tset);
+	struct task_struct *task;
 
 	/*
 	 * Prevent freeing of tasks while we take a snapshot. Tasks that are
@@ -2266,89 +2380,14 @@ static int cgroup_migrate(struct task_struct *leader, bool threadgroup,
 	rcu_read_lock();
 	task = leader;
 	do {
-		/* @task either already exited or can't exit until the end */
-		if (task->flags & PF_EXITING)
-			goto next;
-
-		/* leave @task alone if post_fork() hasn't linked it yet */
-		if (list_empty(&task->cg_list))
-			goto next;
-
-		cset = task_css_set(task);
-		if (!cset->mg_src_cgrp)
-			goto next;
-
-		list_move_tail(&task->cg_list, &cset->mg_tasks);
-		if (list_empty(&cset->mg_node))
-			list_add_tail(&cset->mg_node, &tset.src_csets);
-		if (list_empty(&cset->mg_dst_cset->mg_node))
-			list_move_tail(&cset->mg_dst_cset->mg_node,
-				       &tset.dst_csets);
-	next:
+		cgroup_taskset_add(task, &tset);
 		if (!threadgroup)
 			break;
 	} while_each_thread(leader, task);
 	rcu_read_unlock();
 	up_write(&css_set_rwsem);
 
-	/* methods shouldn't be called if no task is actually migrating */
-	if (list_empty(&tset.src_csets))
-		return 0;
-
-	/* check that we can legitimately attach to the cgroup */
-	for_each_e_css(css, i, cgrp) {
-		if (css->ss->can_attach) {
-			ret = css->ss->can_attach(css, &tset);
-			if (ret) {
-				failed_css = css;
-				goto out_cancel_attach;
-			}
-		}
-	}
-
-	/*
-	 * Now that we're guaranteed success, proceed to move all tasks to
-	 * the new cgroup.  There are no failure cases after here, so this
-	 * is the commit point.
-	 */
-	down_write(&css_set_rwsem);
-	list_for_each_entry(cset, &tset.src_csets, mg_node) {
-		list_for_each_entry_safe(task, tmp_task, &cset->mg_tasks, cg_list)
-			cgroup_task_migrate(cset->mg_src_cgrp, task,
-					    cset->mg_dst_cset);
-	}
-	up_write(&css_set_rwsem);
-
-	/*
-	 * Migration is committed, all target tasks are now on dst_csets.
-	 * Nothing is sensitive to fork() after this point.  Notify
-	 * controllers that migration is complete.
-	 */
-	tset.csets = &tset.dst_csets;
-
-	for_each_e_css(css, i, cgrp)
-		if (css->ss->attach)
-			css->ss->attach(css, &tset);
-
-	ret = 0;
-	goto out_release_tset;
-
-out_cancel_attach:
-	for_each_e_css(css, i, cgrp) {
-		if (css == failed_css)
-			break;
-		if (css->ss->cancel_attach)
-			css->ss->cancel_attach(css, &tset);
-	}
-out_release_tset:
-	down_write(&css_set_rwsem);
-	list_splice_init(&tset.dst_csets, &tset.src_csets);
-	list_for_each_entry_safe(cset, tmp_cset, &tset.src_csets, mg_node) {
-		list_splice_tail_init(&cset->mg_tasks, &cset->tasks);
-		list_del_init(&cset->mg_node);
-	}
-	up_write(&css_set_rwsem);
-	return ret;
+	return cgroup_taskset_migrate(&tset, cgrp);
 }
 
 /**
-- 
2.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 5/5] cgroup: make cgroup_update_dfl_csses() migrate all target processes atomically
  2015-09-11 19:00 [PATCHSET v2 cgroup/for-4.4] cgroup: make multi-process migration atomic Tejun Heo
                   ` (3 preceding siblings ...)
  2015-09-11 19:00 ` [PATCH 4/5] cgroup: separate out taskset operations from cgroup_migrate() Tejun Heo
@ 2015-09-11 19:00 ` Tejun Heo
  2015-09-22 16:48 ` [PATCHSET v2 cgroup/for-4.4] cgroup: make multi-process migration atomic Tejun Heo
  5 siblings, 0 replies; 10+ messages in thread
From: Tejun Heo @ 2015-09-11 19:00 UTC (permalink / raw)
  To: lizefan; +Cc: cgroups, hannes, mhocko, linux-mm, linux-kernel, Tejun Heo

cgroup_update_dfl_csses() is responsible for migrating processes when
controllers are enabled or disabled on the default hierarchy.  As the
css association changes for all the processes in the affected cgroups,
this involves migrating multiple processes.

Up until now, it was implemented by migrating process-by-process until
the source css_sets are empty; however, this means that if a process
fails to migrate after some succeed before it, the recovery is very
tricky.  This was considered okay as subsystems weren't allowed to
reject process migration on the default hierarchy; unfortunately,
enforcing this policy turned out to be problematic for certain types
of resources - realtime slices for now.

As such, the default hierarchy is gonna allow restricted failures
during migration and to support that this patch makes
cgroup_update_dfl_csses() migrate all target processes atomically
rather than one-by-one.  The preceding patches made subsystems ready
for multi-process migration and factored out taskset operations making
this almost trivial.  All tasks of the target processes are put in the
same taskset and the migration operations are performed once which
either fails or succeeds for all.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
---
 kernel/cgroup.c | 44 ++++++++------------------------------------
 1 file changed, 8 insertions(+), 36 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index fa63b92..cb07a2c 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2665,6 +2665,7 @@ static int cgroup_subtree_control_show(struct seq_file *seq, void *v)
 static int cgroup_update_dfl_csses(struct cgroup *cgrp)
 {
 	LIST_HEAD(preloaded_csets);
+	struct cgroup_taskset tset = CGROUP_TASKSET_INIT(tset);
 	struct cgroup_subsys_state *css;
 	struct css_set *src_cset;
 	int ret;
@@ -2693,50 +2694,21 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp)
 	if (ret)
 		goto out_finish;
 
+	down_write(&css_set_rwsem);
 	list_for_each_entry(src_cset, &preloaded_csets, mg_preload_node) {
-		struct task_struct *last_task = NULL, *task;
+		struct task_struct *task, *ntask;
 
 		/* src_csets precede dst_csets, break on the first dst_cset */
 		if (!src_cset->mg_src_cgrp)
 			break;
 
-		/*
-		 * All tasks in src_cset need to be migrated to the
-		 * matching dst_cset.  Empty it process by process.  We
-		 * walk tasks but migrate processes.  The leader might even
-		 * belong to a different cset but such src_cset would also
-		 * be among the target src_csets because the default
-		 * hierarchy enforces per-process membership.
-		 */
-		while (true) {
-			down_read(&css_set_rwsem);
-			task = list_first_entry_or_null(&src_cset->tasks,
-						struct task_struct, cg_list);
-			if (task) {
-				task = task->group_leader;
-				WARN_ON_ONCE(!task_css_set(task)->mg_src_cgrp);
-				get_task_struct(task);
-			}
-			up_read(&css_set_rwsem);
-
-			if (!task)
-				break;
-
-			/* guard against possible infinite loop */
-			if (WARN(last_task == task,
-				 "cgroup: update_dfl_csses failed to make progress, aborting in inconsistent state\n"))
-				goto out_finish;
-			last_task = task;
-
-			ret = cgroup_migrate(task, true, src_cset->dfl_cgrp);
-
-			put_task_struct(task);
-
-			if (WARN(ret, "cgroup: failed to update controllers for the default hierarchy (%d), further operations may crash or hang\n", ret))
-				goto out_finish;
-		}
+		/* all tasks in src_csets need to be migrated */
+		list_for_each_entry_safe(task, ntask, &src_cset->tasks, cg_list)
+			cgroup_taskset_add(task, &tset);
 	}
+	up_write(&css_set_rwsem);
 
+	ret = cgroup_taskset_migrate(&tset, cgrp);
 out_finish:
 	cgroup_migrate_finish(&preloaded_csets);
 	percpu_up_write(&cgroup_threadgroup_rwsem);
-- 
2.4.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/5] cgroup, memcg, cpuset: implement cgroup_taskset_for_each_leader()
  2015-09-11 19:00 ` [PATCH 2/5] cgroup, memcg, cpuset: implement cgroup_taskset_for_each_leader() Tejun Heo
@ 2015-09-14 20:49   ` Tejun Heo
  2015-09-18 16:04     ` Tejun Heo
  0 siblings, 1 reply; 10+ messages in thread
From: Tejun Heo @ 2015-09-14 20:49 UTC (permalink / raw)
  To: lizefan; +Cc: cgroups, hannes, mhocko, linux-mm, linux-kernel

On Fri, Sep 11, 2015 at 03:00:19PM -0400, Tejun Heo wrote:
> It wasn't explicitly documented but, when a process is being migrated,
> cpuset and memcg depend on cgroup_taskset_first() returning the
> threadgroup leader; however, this approach is somewhat ghetto and
> would no longer work for the planned multi-process migration.
> 
> This patch introduces explicit cgroup_taskset_for_each_leader() which
> iterates over only the threadgroup leaders and replaces
> cgroup_taskset_first() usages for accessing the leader with it.
> 
> This prepares both memcg and cpuset for multi-process migration.  This
> patch also updates the documentation for cgroup_taskset_for_each() to
> clarify the iteration rules and removes comments mentioning task
> ordering in tasksets.
> 
> v2: A previous patch which added threadgroup leader test was dropped.
>     Patch updated accordingly.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Acked-by: Zefan Li <lizefan@huawei.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@suse.cz>

Michal, if you're okay with this patch, I'll apply the patchset in
cgroup/for-4.4.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/5] cgroup, memcg, cpuset: implement cgroup_taskset_for_each_leader()
  2015-09-14 20:49   ` Tejun Heo
@ 2015-09-18 16:04     ` Tejun Heo
  2015-09-22 15:49       ` Michal Hocko
  0 siblings, 1 reply; 10+ messages in thread
From: Tejun Heo @ 2015-09-18 16:04 UTC (permalink / raw)
  To: lizefan; +Cc: cgroups, hannes, mhocko, linux-mm, linux-kernel

On Mon, Sep 14, 2015 at 04:49:07PM -0400, Tejun Heo wrote:
> Michal, if you're okay with this patch, I'll apply the patchset in
> cgroup/for-4.4.

Michal?

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/5] cgroup, memcg, cpuset: implement cgroup_taskset_for_each_leader()
  2015-09-18 16:04     ` Tejun Heo
@ 2015-09-22 15:49       ` Michal Hocko
  0 siblings, 0 replies; 10+ messages in thread
From: Michal Hocko @ 2015-09-22 15:49 UTC (permalink / raw)
  To: Tejun Heo; +Cc: lizefan, cgroups, hannes, linux-mm, linux-kernel

On Fri 18-09-15 12:04:17, Tejun Heo wrote:
> On Mon, Sep 14, 2015 at 04:49:07PM -0400, Tejun Heo wrote:
> > Michal, if you're okay with this patch, I'll apply the patchset in
> > cgroup/for-4.4.
> 
> Michal?

I am sorry but I wasn't online very much last week. The old and new code
are similarly hackish so I am OK with it. Ideally we should be able to
migrate all the tasks for both the legacy and the default hierarchies
this would require more changes in this area though.

Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCHSET v2 cgroup/for-4.4] cgroup: make multi-process migration atomic
  2015-09-11 19:00 [PATCHSET v2 cgroup/for-4.4] cgroup: make multi-process migration atomic Tejun Heo
                   ` (4 preceding siblings ...)
  2015-09-11 19:00 ` [PATCH 5/5] cgroup: make cgroup_update_dfl_csses() migrate all target processes atomically Tejun Heo
@ 2015-09-22 16:48 ` Tejun Heo
  5 siblings, 0 replies; 10+ messages in thread
From: Tejun Heo @ 2015-09-22 16:48 UTC (permalink / raw)
  To: lizefan; +Cc: cgroups, hannes, mhocko, linux-mm, linux-kernel

On Fri, Sep 11, 2015 at 03:00:17PM -0400, Tejun Heo wrote:
> This is v2 of atomic multi-process migration patchset.  This one
> slipped through crack somehow.  Changes from the last take[L] are.

Applied to cgroup/for-4.4.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2015-09-22 16:48 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-11 19:00 [PATCHSET v2 cgroup/for-4.4] cgroup: make multi-process migration atomic Tejun Heo
2015-09-11 19:00 ` [PATCH 1/5] cpuset: migrate memory only for threadgroup leaders Tejun Heo
2015-09-11 19:00 ` [PATCH 2/5] cgroup, memcg, cpuset: implement cgroup_taskset_for_each_leader() Tejun Heo
2015-09-14 20:49   ` Tejun Heo
2015-09-18 16:04     ` Tejun Heo
2015-09-22 15:49       ` Michal Hocko
2015-09-11 19:00 ` [PATCH 3/5] reorder cgroup_migrate()'s parameters Tejun Heo
2015-09-11 19:00 ` [PATCH 4/5] cgroup: separate out taskset operations from cgroup_migrate() Tejun Heo
2015-09-11 19:00 ` [PATCH 5/5] cgroup: make cgroup_update_dfl_csses() migrate all target processes atomically Tejun Heo
2015-09-22 16:48 ` [PATCHSET v2 cgroup/for-4.4] cgroup: make multi-process migration atomic Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).