[PATCHSET cgroup/for-4.2] cgroup: make multi-process migration atomic

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCHSET cgroup/for-4.2] cgroup: make multi-process migration atomic
@ 2015-05-18 19:49 ` Tejun Heo
  0 siblings, 0 replies; 59+ messages in thread
From: Tejun Heo @ 2015-05-18 19:49 UTC (permalink / raw)
  To: lizefan; +Cc: cgroups, hannes, mhocko, linux-mm

Hello,

When a controller is enabled or disabled on the unified hierarchy, the
effective css changes for all processes in the sub-hierarchy which
virtually is multi-process migration.  This is implemented in
cgroup_update_dfl_csses() as process-by-process migration - all the
target source css_sets are first chained to the target list and
processes are drained from them one-by-one.

If a process gets rejected by a controller after some are successfully
migrated, the recovery action is tricky.  The changes which have
happened upto this point have to be rolled back but there's nothing
guaranteeing such rollback would be successful either.

The unified hierarchy didn't need to deal with this issue because
organizational operations were expected to always succeed;
unfortunately, it turned out that such policy doesn't work too well
for certain type of resources and unified hierarchy would need to
allow migration failures for some restrictied cases.

This patch updates multi-process migration in
cgroup_update_dfl_csses() atomic so that ->can_attach() can fail the
whole transaction.  It's consisted of the following seven patches.

 0001-cpuset-migrate-memory-only-for-threadgroup-leaders.patch
 0002-memcg-restructure-mem_cgroup_can_attach.patch
 0003-memcg-immigrate-charges-only-when-a-threadgroup-lead.patch
 0004-cgroup-memcg-cpuset-implement-cgroup_taskset_for_eac.patch
 0005-reorder-cgroup_migrate-s-parameters.patch
 0006-cgroup-separate-out-taskset-operations-from-cgroup_m.patch
 0007-cgroup-make-cgroup_update_dfl_csses-migrate-all-targ.patch

0001-0004 prepare cpuset and memcg.  Note that 0001 and 0003 do cause
behavioral changes in that mm is now always tied to the threadgroup
leader.  Avoiding this change isn't too difficult but both the code
and behavior are saner this way and I don't think the change is likely
to cause breakage.

0005-0007 prepare and implement atomic multi-process migration.

This patchset is on top of the following patches.

 cgroup/for-4.2 d0f702e648dc ("cgroup: fix some comment typos")
 + [1] [PATCH] cgroup: separate out include/linux/cgroup-defs.h
 + [2] [PATCH] cgroup: reorganize include/linux/cgroup.h
 + [3] [PATCHSET] cgroup, sched: restructure threadgroup locking and replace it with a percpu_rwsem

and available in the following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-multi-process-migration

diffstat follows.  Thanks.

 include/linux/cgroup.h |   22 +++
 kernel/cgroup.c        |  278 ++++++++++++++++++++++++-------------------------
 kernel/cpuset.c        |   41 +++----
 mm/memcontrol.c        |   74 +++++++------
 4 files changed, 228 insertions(+), 187 deletions(-)

--
tejun

[1] http://lkml.kernel.org/g/20150513193840.GC11388@htj.duckdns.org
[2] http://lkml.kernel.org/g/20150513202416.GE11388@htj.duckdns.org
[3] http://lkml.kernel.org/g/1431549318-16756-1-git-send-email-tj@kernel.org

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCHSET cgroup/for-4.2] cgroup: make multi-process migration atomic
@ 2015-05-18 19:49 ` Tejun Heo
  0 siblings, 0 replies; 59+ messages in thread
From: Tejun Heo @ 2015-05-18 19:49 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, hannes-druUgvl0LCNAfugRpC6u6w,
	mhocko-AlSwsSmVLrQ, linux-mm-Bw31MaZKKs3YtjvyW6yDsg

Hello,

When a controller is enabled or disabled on the unified hierarchy, the
effective css changes for all processes in the sub-hierarchy which
virtually is multi-process migration.  This is implemented in
cgroup_update_dfl_csses() as process-by-process migration - all the
target source css_sets are first chained to the target list and
processes are drained from them one-by-one.

If a process gets rejected by a controller after some are successfully
migrated, the recovery action is tricky.  The changes which have
happened upto this point have to be rolled back but there's nothing
guaranteeing such rollback would be successful either.

The unified hierarchy didn't need to deal with this issue because
organizational operations were expected to always succeed;
unfortunately, it turned out that such policy doesn't work too well
for certain type of resources and unified hierarchy would need to
allow migration failures for some restrictied cases.

This patch updates multi-process migration in
cgroup_update_dfl_csses() atomic so that ->can_attach() can fail the
whole transaction.  It's consisted of the following seven patches.

 0001-cpuset-migrate-memory-only-for-threadgroup-leaders.patch
 0002-memcg-restructure-mem_cgroup_can_attach.patch
 0003-memcg-immigrate-charges-only-when-a-threadgroup-lead.patch
 0004-cgroup-memcg-cpuset-implement-cgroup_taskset_for_eac.patch
 0005-reorder-cgroup_migrate-s-parameters.patch
 0006-cgroup-separate-out-taskset-operations-from-cgroup_m.patch
 0007-cgroup-make-cgroup_update_dfl_csses-migrate-all-targ.patch

0001-0004 prepare cpuset and memcg.  Note that 0001 and 0003 do cause
behavioral changes in that mm is now always tied to the threadgroup
leader.  Avoiding this change isn't too difficult but both the code
and behavior are saner this way and I don't think the change is likely
to cause breakage.

0005-0007 prepare and implement atomic multi-process migration.

This patchset is on top of the following patches.

 cgroup/for-4.2 d0f702e648dc ("cgroup: fix some comment typos")
 + [1] [PATCH] cgroup: separate out include/linux/cgroup-defs.h
 + [2] [PATCH] cgroup: reorganize include/linux/cgroup.h
 + [3] [PATCHSET] cgroup, sched: restructure threadgroup locking and replace it with a percpu_rwsem

and available in the following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-multi-process-migration

diffstat follows.  Thanks.

 include/linux/cgroup.h |   22 +++
 kernel/cgroup.c        |  278 ++++++++++++++++++++++++-------------------------
 kernel/cpuset.c        |   41 +++----
 mm/memcontrol.c        |   74 +++++++------
 4 files changed, 228 insertions(+), 187 deletions(-)

--
tejun

[1] http://lkml.kernel.org/g/20150513193840.GC11388-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org
[2] http://lkml.kernel.org/g/20150513202416.GE11388-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org
[3] http://lkml.kernel.org/g/1431549318-16756-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH 1/7] cpuset: migrate memory only for threadgroup leaders
@ 2015-05-18 19:49   ` Tejun Heo
  0 siblings, 0 replies; 59+ messages in thread
From: Tejun Heo @ 2015-05-18 19:49 UTC (permalink / raw)
  To: lizefan; +Cc: cgroups, hannes, mhocko, linux-mm, Tejun Heo

If memory_migrate flag is set, cpuset migrates memory according to the
destnation css's nodemask.  The current implementation migrates memory
whenever any thread of a process is migrated making the behavior
somewhat arbitrary.  Let's tie memory operations to the threadgroup
leader so that memory is migrated only when the leader is migrated.

While this is a behavior change, given the inherent fuziness, this
change is not too likely to be noticed and allows us to clearly define
who owns the memory (always the leader) and helps the planned atomic
multi-process migration.

Note that we're currently migrating memory in migration path proper
while holding all the locks.  In the long term, this should be moved
out to an async work item.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
---
 kernel/cpuset.c | 40 ++++++++++++++++++++++------------------
 1 file changed, 22 insertions(+), 18 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index ee14e3a..43db5b7 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1484,7 +1484,6 @@ static void cpuset_attach(struct cgroup_subsys_state *css,
 {
 	/* static buf protected by cpuset_mutex */
 	static nodemask_t cpuset_attach_nodemask_to;
-	struct mm_struct *mm;
 	struct task_struct *task;
 	struct task_struct *leader = cgroup_taskset_first(tset);
 	struct cpuset *cs = css_cs(css);
@@ -1512,26 +1511,31 @@ static void cpuset_attach(struct cgroup_subsys_state *css,
 	}
 
 	/*
-	 * Change mm, possibly for multiple threads in a threadgroup. This is
-	 * expensive and may sleep.
+	 * Change mm, possibly for multiple threads in a threadgroup. This
+	 * is expensive and may sleep and should be moved outside migration
+	 * path proper.
 	 */
 	cpuset_attach_nodemask_to = cs->effective_mems;
-	mm = get_task_mm(leader);
-	if (mm) {
-		mpol_rebind_mm(mm, &cpuset_attach_nodemask_to);
-
-		/*
-		 * old_mems_allowed is the same with mems_allowed here, except
-		 * if this task is being moved automatically due to hotplug.
-		 * In that case @mems_allowed has been updated and is empty,
-		 * so @old_mems_allowed is the right nodesets that we migrate
-		 * mm from.
-		 */
-		if (is_memory_migrate(cs)) {
-			cpuset_migrate_mm(mm, &oldcs->old_mems_allowed,
-					  &cpuset_attach_nodemask_to);
+	if (thread_group_leader(leader)) {
+		struct mm_struct *mm = get_task_mm(leader);
+
+		if (mm) {
+			mpol_rebind_mm(mm, &cpuset_attach_nodemask_to);
+
+			/*
+			 * old_mems_allowed is the same with mems_allowed
+			 * here, except if this task is being moved
+			 * automatically due to hotplug.  In that case
+			 * @mems_allowed has been updated and is empty, so
+			 * @old_mems_allowed is the right nodesets that we
+			 * migrate mm from.
+			 */
+			if (is_memory_migrate(cs)) {
+				cpuset_migrate_mm(mm, &oldcs->old_mems_allowed,
+						  &cpuset_attach_nodemask_to);
+			}
+			mmput(mm);
 		}
-		mmput(mm);
 	}
 
 	cs->old_mems_allowed = cpuset_attach_nodemask_to;
-- 
2.4.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 1/7] cpuset: migrate memory only for threadgroup leaders
@ 2015-05-18 19:49   ` Tejun Heo
  0 siblings, 0 replies; 59+ messages in thread
From: Tejun Heo @ 2015-05-18 19:49 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, hannes-druUgvl0LCNAfugRpC6u6w,
	mhocko-AlSwsSmVLrQ, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Tejun Heo

If memory_migrate flag is set, cpuset migrates memory according to the
destnation css's nodemask.  The current implementation migrates memory
whenever any thread of a process is migrated making the behavior
somewhat arbitrary.  Let's tie memory operations to the threadgroup
leader so that memory is migrated only when the leader is migrated.

While this is a behavior change, given the inherent fuziness, this
change is not too likely to be noticed and allows us to clearly define
who owns the memory (always the leader) and helps the planned atomic
multi-process migration.

Note that we're currently migrating memory in migration path proper
while holding all the locks.  In the long term, this should be moved
out to an async work item.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
---
 kernel/cpuset.c | 40 ++++++++++++++++++++++------------------
 1 file changed, 22 insertions(+), 18 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index ee14e3a..43db5b7 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1484,7 +1484,6 @@ static void cpuset_attach(struct cgroup_subsys_state *css,
 {
 	/* static buf protected by cpuset_mutex */
 	static nodemask_t cpuset_attach_nodemask_to;
-	struct mm_struct *mm;
 	struct task_struct *task;
 	struct task_struct *leader = cgroup_taskset_first(tset);
 	struct cpuset *cs = css_cs(css);
@@ -1512,26 +1511,31 @@ static void cpuset_attach(struct cgroup_subsys_state *css,
 	}
 
 	/*
-	 * Change mm, possibly for multiple threads in a threadgroup. This is
-	 * expensive and may sleep.
+	 * Change mm, possibly for multiple threads in a threadgroup. This
+	 * is expensive and may sleep and should be moved outside migration
+	 * path proper.
 	 */
 	cpuset_attach_nodemask_to = cs->effective_mems;
-	mm = get_task_mm(leader);
-	if (mm) {
-		mpol_rebind_mm(mm, &cpuset_attach_nodemask_to);
-
-		/*
-		 * old_mems_allowed is the same with mems_allowed here, except
-		 * if this task is being moved automatically due to hotplug.
-		 * In that case @mems_allowed has been updated and is empty,
-		 * so @old_mems_allowed is the right nodesets that we migrate
-		 * mm from.
-		 */
-		if (is_memory_migrate(cs)) {
-			cpuset_migrate_mm(mm, &oldcs->old_mems_allowed,
-					  &cpuset_attach_nodemask_to);
+	if (thread_group_leader(leader)) {
+		struct mm_struct *mm = get_task_mm(leader);
+
+		if (mm) {
+			mpol_rebind_mm(mm, &cpuset_attach_nodemask_to);
+
+			/*
+			 * old_mems_allowed is the same with mems_allowed
+			 * here, except if this task is being moved
+			 * automatically due to hotplug.  In that case
+			 * @mems_allowed has been updated and is empty, so
+			 * @old_mems_allowed is the right nodesets that we
+			 * migrate mm from.
+			 */
+			if (is_memory_migrate(cs)) {
+				cpuset_migrate_mm(mm, &oldcs->old_mems_allowed,
+						  &cpuset_attach_nodemask_to);
+			}
+			mmput(mm);
 		}
-		mmput(mm);
 	}
 
 	cs->old_mems_allowed = cpuset_attach_nodemask_to;
-- 
2.4.0

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 2/7] memcg: restructure mem_cgroup_can_attach()
@ 2015-05-18 19:49   ` Tejun Heo
  0 siblings, 0 replies; 59+ messages in thread
From: Tejun Heo @ 2015-05-18 19:49 UTC (permalink / raw)
  To: lizefan; +Cc: cgroups, hannes, mhocko, linux-mm, Tejun Heo

Restructure it to lower nesting level and help the planned threadgroup
leader iteration changes.

This is pure reorganization.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c | 61 ++++++++++++++++++++++++++++++---------------------------
 1 file changed, 32 insertions(+), 29 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 14c2f20..b1b834d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4997,10 +4997,12 @@ static void mem_cgroup_clear_mc(void)
 static int mem_cgroup_can_attach(struct cgroup_subsys_state *css,
 				 struct cgroup_taskset *tset)
 {
-	struct task_struct *p = cgroup_taskset_first(tset);
-	int ret = 0;
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+	struct mem_cgroup *from;
+	struct task_struct *p;
+	struct mm_struct *mm;
 	unsigned long move_flags;
+	int ret = 0;
 
 	/*
 	 * We are now commited to this value whatever it is. Changes in this
@@ -5008,36 +5010,37 @@ static int mem_cgroup_can_attach(struct cgroup_subsys_state *css,
 	 * So we need to save it, and keep it going.
 	 */
 	move_flags = READ_ONCE(memcg->move_charge_at_immigrate);
-	if (move_flags) {
-		struct mm_struct *mm;
-		struct mem_cgroup *from = mem_cgroup_from_task(p);
+	if (!move_flags)
+		return 0;
 
-		VM_BUG_ON(from == memcg);
+	p = cgroup_taskset_first(tset);
+	from = mem_cgroup_from_task(p);
 
-		mm = get_task_mm(p);
-		if (!mm)
-			return 0;
-		/* We move charges only when we move a owner of the mm */
-		if (mm->owner == p) {
-			VM_BUG_ON(mc.from);
-			VM_BUG_ON(mc.to);
-			VM_BUG_ON(mc.precharge);
-			VM_BUG_ON(mc.moved_charge);
-			VM_BUG_ON(mc.moved_swap);
-
-			spin_lock(&mc.lock);
-			mc.from = from;
-			mc.to = memcg;
-			mc.flags = move_flags;
-			spin_unlock(&mc.lock);
-			/* We set mc.moving_task later */
-
-			ret = mem_cgroup_precharge_mc(mm);
-			if (ret)
-				mem_cgroup_clear_mc();
-		}
-		mmput(mm);
+	VM_BUG_ON(from == memcg);
+
+	mm = get_task_mm(p);
+	if (!mm)
+		return 0;
+	/* We move charges only when we move a owner of the mm */
+	if (mm->owner == p) {
+		VM_BUG_ON(mc.from);
+		VM_BUG_ON(mc.to);
+		VM_BUG_ON(mc.precharge);
+		VM_BUG_ON(mc.moved_charge);
+		VM_BUG_ON(mc.moved_swap);
+
+		spin_lock(&mc.lock);
+		mc.from = from;
+		mc.to = memcg;
+		mc.flags = move_flags;
+		spin_unlock(&mc.lock);
+		/* We set mc.moving_task later */
+
+		ret = mem_cgroup_precharge_mc(mm);
+		if (ret)
+			mem_cgroup_clear_mc();
 	}
+	mmput(mm);
 	return ret;
 }
 
-- 
2.4.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 2/7] memcg: restructure mem_cgroup_can_attach()
@ 2015-05-18 19:49   ` Tejun Heo
  0 siblings, 0 replies; 59+ messages in thread
From: Tejun Heo @ 2015-05-18 19:49 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, hannes-druUgvl0LCNAfugRpC6u6w,
	mhocko-AlSwsSmVLrQ, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Tejun Heo

Restructure it to lower nesting level and help the planned threadgroup
leader iteration changes.

This is pure reorganization.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
---
 mm/memcontrol.c | 61 ++++++++++++++++++++++++++++++---------------------------
 1 file changed, 32 insertions(+), 29 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 14c2f20..b1b834d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4997,10 +4997,12 @@ static void mem_cgroup_clear_mc(void)
 static int mem_cgroup_can_attach(struct cgroup_subsys_state *css,
 				 struct cgroup_taskset *tset)
 {
-	struct task_struct *p = cgroup_taskset_first(tset);
-	int ret = 0;
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+	struct mem_cgroup *from;
+	struct task_struct *p;
+	struct mm_struct *mm;
 	unsigned long move_flags;
+	int ret = 0;
 
 	/*
 	 * We are now commited to this value whatever it is. Changes in this
@@ -5008,36 +5010,37 @@ static int mem_cgroup_can_attach(struct cgroup_subsys_state *css,
 	 * So we need to save it, and keep it going.
 	 */
 	move_flags = READ_ONCE(memcg->move_charge_at_immigrate);
-	if (move_flags) {
-		struct mm_struct *mm;
-		struct mem_cgroup *from = mem_cgroup_from_task(p);
+	if (!move_flags)
+		return 0;
 
-		VM_BUG_ON(from == memcg);
+	p = cgroup_taskset_first(tset);
+	from = mem_cgroup_from_task(p);
 
-		mm = get_task_mm(p);
-		if (!mm)
-			return 0;
-		/* We move charges only when we move a owner of the mm */
-		if (mm->owner == p) {
-			VM_BUG_ON(mc.from);
-			VM_BUG_ON(mc.to);
-			VM_BUG_ON(mc.precharge);
-			VM_BUG_ON(mc.moved_charge);
-			VM_BUG_ON(mc.moved_swap);
-
-			spin_lock(&mc.lock);
-			mc.from = from;
-			mc.to = memcg;
-			mc.flags = move_flags;
-			spin_unlock(&mc.lock);
-			/* We set mc.moving_task later */
-
-			ret = mem_cgroup_precharge_mc(mm);
-			if (ret)
-				mem_cgroup_clear_mc();
-		}
-		mmput(mm);
+	VM_BUG_ON(from == memcg);
+
+	mm = get_task_mm(p);
+	if (!mm)
+		return 0;
+	/* We move charges only when we move a owner of the mm */
+	if (mm->owner == p) {
+		VM_BUG_ON(mc.from);
+		VM_BUG_ON(mc.to);
+		VM_BUG_ON(mc.precharge);
+		VM_BUG_ON(mc.moved_charge);
+		VM_BUG_ON(mc.moved_swap);
+
+		spin_lock(&mc.lock);
+		mc.from = from;
+		mc.to = memcg;
+		mc.flags = move_flags;
+		spin_unlock(&mc.lock);
+		/* We set mc.moving_task later */
+
+		ret = mem_cgroup_precharge_mc(mm);
+		if (ret)
+			mem_cgroup_clear_mc();
 	}
+	mmput(mm);
 	return ret;
 }
 
-- 
2.4.0

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 3/7] memcg: immigrate charges only when a threadgroup leader is moved
@ 2015-05-18 19:49   ` Tejun Heo
  0 siblings, 0 replies; 59+ messages in thread
From: Tejun Heo @ 2015-05-18 19:49 UTC (permalink / raw)
  To: lizefan; +Cc: cgroups, hannes, mhocko, linux-mm, Tejun Heo

If move_charge flag is set, memcg tries to move memory charges to the
destnation css.  The current implementation migrates memory whenever
any thread of a process is migrated making the behavior somewhat
arbitrary.  Let's tie memory operations to the threadgroup leader so
that memory is migrated only when the leader is migrated.

While this is a behavior change, given the inherent fuziness, this
change is not too likely to be noticed and allows us to clearly define
who owns the memory (always the leader) and helps the planned atomic
multi-process migration.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b1b834d..74fcea3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5014,6 +5014,9 @@ static int mem_cgroup_can_attach(struct cgroup_subsys_state *css,
 		return 0;

 	p = cgroup_taskset_first(tset);
+	if (!thread_group_leader(p))
+		return 0;
+
 	from = mem_cgroup_from_task(p);

 	VM_BUG_ON(from == memcg);
-- 
2.4.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 3/7] memcg: immigrate charges only when a threadgroup leader is moved
@ 2015-05-18 19:49   ` Tejun Heo
  0 siblings, 0 replies; 59+ messages in thread
From: Tejun Heo @ 2015-05-18 19:49 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, hannes-druUgvl0LCNAfugRpC6u6w,
	mhocko-AlSwsSmVLrQ, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Tejun Heo

If move_charge flag is set, memcg tries to move memory charges to the
destnation css.  The current implementation migrates memory whenever
any thread of a process is migrated making the behavior somewhat
arbitrary.  Let's tie memory operations to the threadgroup leader so
that memory is migrated only when the leader is migrated.

While this is a behavior change, given the inherent fuziness, this
change is not too likely to be noticed and allows us to clearly define
who owns the memory (always the leader) and helps the planned atomic
multi-process migration.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
---
 mm/memcontrol.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b1b834d..74fcea3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5014,6 +5014,9 @@ static int mem_cgroup_can_attach(struct cgroup_subsys_state *css,
 		return 0;

 	p = cgroup_taskset_first(tset);
+	if (!thread_group_leader(p))
+		return 0;
+
 	from = mem_cgroup_from_task(p);

 	VM_BUG_ON(from == memcg);
-- 
2.4.0

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 4/7] cgroup, memcg, cpuset: implement cgroup_taskset_for_each_leader()
@ 2015-05-18 19:49   ` Tejun Heo
  0 siblings, 0 replies; 59+ messages in thread
From: Tejun Heo @ 2015-05-18 19:49 UTC (permalink / raw)
  To: lizefan; +Cc: cgroups, hannes, mhocko, linux-mm, Tejun Heo

It wasn't explicitly documented but, when a process is being migrated,
cpuset and memcg depend on cgroup_taskset_first() returning the
threadgroup leader; however, this approach is somewhat ghetto and
would no longer work for the planned multi-process migration.

This patch introduces explicit cgroup_taskset_for_each_leader() which
iterates over only the threadgroup leaders and replaces
cgroup_taskset_first() usages for accessing the leader with it.

This prepares both memcg and cpuset for multi-process migration.  This
patch also updates the documentation for cgroup_taskset_for_each() to
clarify the iteration rules and removes comments mentioning task
ordering in tasksets.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Li Zefan <lizefan@huawei.com>
---
 include/linux/cgroup.h | 22 ++++++++++++++++++++++
 kernel/cgroup.c        | 11 -----------
 kernel/cpuset.c        |  9 ++++-----
 mm/memcontrol.c        | 16 +++++++++++++---
 4 files changed, 39 insertions(+), 19 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 82319fb..788fab38 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -197,11 +197,33 @@ void css_task_iter_end(struct css_task_iter *it);
  * cgroup_taskset_for_each - iterate cgroup_taskset
  * @task: the loop cursor
  * @tset: taskset to iterate
+ *
+ * @tset may contain multiple tasks and they may belong to multiple
+ * processes.  When there are multiple tasks in @tset, if a task of a
+ * process is in @tset, all tasks of the process are in @tset.  Also, all
+ * are guaranteed to share the same source and destination csses.
+ *
+ * Iteration is not in any specific order.
  */
 #define cgroup_taskset_for_each(task, tset)				\
 	for ((task) = cgroup_taskset_first((tset)); (task);		\
 	     (task) = cgroup_taskset_next((tset)))
 
+/**
+ * cgroup_taskset_for_each_leader - iterate group leaders in a cgroup_taskset
+ * @leader: the loop cursor
+ * @tset: takset to iterate
+ *
+ * Iterate threadgroup leaders of @tset.  For single-task migrations, @tset
+ * may not contain any.
+ */
+#define cgroup_taskset_for_each_leader(leader, tset)			\
+	for ((leader) = cgroup_taskset_first((tset)); (leader);		\
+	     (leader) = cgroup_taskset_next((tset)))			\
+		if ((leader) != (leader)->group_leader)			\
+			;						\
+		else
+
 /*
  * Inline functions.
  */
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index afe9a7e..da45ce9 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2064,13 +2064,6 @@ static void cgroup_task_migrate(struct cgroup *old_cgrp,
 
 	get_css_set(new_cset);
 	rcu_assign_pointer(tsk->cgroups, new_cset);
-
-	/*
-	 * Use move_tail so that cgroup_taskset_first() still returns the
-	 * leader after migration.  This works because cgroup_migrate()
-	 * ensures that the dst_cset of the leader is the first on the
-	 * tset's dst_csets list.
-	 */
 	list_move_tail(&tsk->cg_list, &new_cset->mg_tasks);
 
 	/*
@@ -2266,10 +2259,6 @@ static int cgroup_migrate(struct cgroup *cgrp, struct task_struct *leader,
 		if (!cset->mg_src_cgrp)
 			goto next;
 
-		/*
-		 * cgroup_taskset_first() must always return the leader.
-		 * Take care to avoid disturbing the ordering.
-		 */
 		list_move_tail(&task->cg_list, &cset->mg_tasks);
 		if (list_empty(&cset->mg_node))
 			list_add_tail(&cset->mg_node, &tset.src_csets);
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 43db5b7..54a2b99 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1485,7 +1485,7 @@ static void cpuset_attach(struct cgroup_subsys_state *css,
 	/* static buf protected by cpuset_mutex */
 	static nodemask_t cpuset_attach_nodemask_to;
 	struct task_struct *task;
-	struct task_struct *leader = cgroup_taskset_first(tset);
+	struct task_struct *leader;
 	struct cpuset *cs = css_cs(css);
 	struct cpuset *oldcs = cpuset_attach_old_cs;
 
@@ -1511,12 +1511,11 @@ static void cpuset_attach(struct cgroup_subsys_state *css,
 	}
 
 	/*
-	 * Change mm, possibly for multiple threads in a threadgroup. This
-	 * is expensive and may sleep and should be moved outside migration
-	 * path proper.
+	 * Change mm for all threadgroup leaders. This is expensive and may
+	 * sleep and should be moved outside migration path proper.
 	 */
 	cpuset_attach_nodemask_to = cs->effective_mems;
-	if (thread_group_leader(leader)) {
+	cgroup_taskset_for_each_leader(leader, tset) {
 		struct mm_struct *mm = get_task_mm(leader);
 
 		if (mm) {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 74fcea3..526ed58 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4999,7 +4999,7 @@ static int mem_cgroup_can_attach(struct cgroup_subsys_state *css,
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
 	struct mem_cgroup *from;
-	struct task_struct *p;
+	struct task_struct *leader, *p;
 	struct mm_struct *mm;
 	unsigned long move_flags;
 	int ret = 0;
@@ -5013,8 +5013,18 @@ static int mem_cgroup_can_attach(struct cgroup_subsys_state *css,
 	if (!move_flags)
 		return 0;
 
-	p = cgroup_taskset_first(tset);
-	if (!thread_group_leader(p))
+	/*
+	 * Multi-process migrations only happen on the default hierarchy
+	 * where charge immigration is not used.  Perform charge
+	 * immigration if @tset contains a leader and whine if there are
+	 * multiple.
+	 */
+	p = NULL;
+	cgroup_taskset_for_each_leader(leader, tset) {
+		WARN_ON_ONCE(p);
+		p = leader;
+	}
+	if (!p)
 		return 0;
 
 	from = mem_cgroup_from_task(p);
-- 
2.4.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 4/7] cgroup, memcg, cpuset: implement cgroup_taskset_for_each_leader()
@ 2015-05-18 19:49   ` Tejun Heo
  0 siblings, 0 replies; 59+ messages in thread
From: Tejun Heo @ 2015-05-18 19:49 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, hannes-druUgvl0LCNAfugRpC6u6w,
	mhocko-AlSwsSmVLrQ, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Tejun Heo

It wasn't explicitly documented but, when a process is being migrated,
cpuset and memcg depend on cgroup_taskset_first() returning the
threadgroup leader; however, this approach is somewhat ghetto and
would no longer work for the planned multi-process migration.

This patch introduces explicit cgroup_taskset_for_each_leader() which
iterates over only the threadgroup leaders and replaces
cgroup_taskset_first() usages for accessing the leader with it.

This prepares both memcg and cpuset for multi-process migration.  This
patch also updates the documentation for cgroup_taskset_for_each() to
clarify the iteration rules and removes comments mentioning task
ordering in tasksets.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
Cc: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
---
 include/linux/cgroup.h | 22 ++++++++++++++++++++++
 kernel/cgroup.c        | 11 -----------
 kernel/cpuset.c        |  9 ++++-----
 mm/memcontrol.c        | 16 +++++++++++++---
 4 files changed, 39 insertions(+), 19 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 82319fb..788fab38 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -197,11 +197,33 @@ void css_task_iter_end(struct css_task_iter *it);
  * cgroup_taskset_for_each - iterate cgroup_taskset
  * @task: the loop cursor
  * @tset: taskset to iterate
+ *
+ * @tset may contain multiple tasks and they may belong to multiple
+ * processes.  When there are multiple tasks in @tset, if a task of a
+ * process is in @tset, all tasks of the process are in @tset.  Also, all
+ * are guaranteed to share the same source and destination csses.
+ *
+ * Iteration is not in any specific order.
  */
 #define cgroup_taskset_for_each(task, tset)				\
 	for ((task) = cgroup_taskset_first((tset)); (task);		\
 	     (task) = cgroup_taskset_next((tset)))
 
+/**
+ * cgroup_taskset_for_each_leader - iterate group leaders in a cgroup_taskset
+ * @leader: the loop cursor
+ * @tset: takset to iterate
+ *
+ * Iterate threadgroup leaders of @tset.  For single-task migrations, @tset
+ * may not contain any.
+ */
+#define cgroup_taskset_for_each_leader(leader, tset)			\
+	for ((leader) = cgroup_taskset_first((tset)); (leader);		\
+	     (leader) = cgroup_taskset_next((tset)))			\
+		if ((leader) != (leader)->group_leader)			\
+			;						\
+		else
+
 /*
  * Inline functions.
  */
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index afe9a7e..da45ce9 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2064,13 +2064,6 @@ static void cgroup_task_migrate(struct cgroup *old_cgrp,
 
 	get_css_set(new_cset);
 	rcu_assign_pointer(tsk->cgroups, new_cset);
-
-	/*
-	 * Use move_tail so that cgroup_taskset_first() still returns the
-	 * leader after migration.  This works because cgroup_migrate()
-	 * ensures that the dst_cset of the leader is the first on the
-	 * tset's dst_csets list.
-	 */
 	list_move_tail(&tsk->cg_list, &new_cset->mg_tasks);
 
 	/*
@@ -2266,10 +2259,6 @@ static int cgroup_migrate(struct cgroup *cgrp, struct task_struct *leader,
 		if (!cset->mg_src_cgrp)
 			goto next;
 
-		/*
-		 * cgroup_taskset_first() must always return the leader.
-		 * Take care to avoid disturbing the ordering.
-		 */
 		list_move_tail(&task->cg_list, &cset->mg_tasks);
 		if (list_empty(&cset->mg_node))
 			list_add_tail(&cset->mg_node, &tset.src_csets);
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 43db5b7..54a2b99 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1485,7 +1485,7 @@ static void cpuset_attach(struct cgroup_subsys_state *css,
 	/* static buf protected by cpuset_mutex */
 	static nodemask_t cpuset_attach_nodemask_to;
 	struct task_struct *task;
-	struct task_struct *leader = cgroup_taskset_first(tset);
+	struct task_struct *leader;
 	struct cpuset *cs = css_cs(css);
 	struct cpuset *oldcs = cpuset_attach_old_cs;
 
@@ -1511,12 +1511,11 @@ static void cpuset_attach(struct cgroup_subsys_state *css,
 	}
 
 	/*
-	 * Change mm, possibly for multiple threads in a threadgroup. This
-	 * is expensive and may sleep and should be moved outside migration
-	 * path proper.
+	 * Change mm for all threadgroup leaders. This is expensive and may
+	 * sleep and should be moved outside migration path proper.
 	 */
 	cpuset_attach_nodemask_to = cs->effective_mems;
-	if (thread_group_leader(leader)) {
+	cgroup_taskset_for_each_leader(leader, tset) {
 		struct mm_struct *mm = get_task_mm(leader);
 
 		if (mm) {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 74fcea3..526ed58 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4999,7 +4999,7 @@ static int mem_cgroup_can_attach(struct cgroup_subsys_state *css,
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
 	struct mem_cgroup *from;
-	struct task_struct *p;
+	struct task_struct *leader, *p;
 	struct mm_struct *mm;
 	unsigned long move_flags;
 	int ret = 0;
@@ -5013,8 +5013,18 @@ static int mem_cgroup_can_attach(struct cgroup_subsys_state *css,
 	if (!move_flags)
 		return 0;
 
-	p = cgroup_taskset_first(tset);
-	if (!thread_group_leader(p))
+	/*
+	 * Multi-process migrations only happen on the default hierarchy
+	 * where charge immigration is not used.  Perform charge
+	 * immigration if @tset contains a leader and whine if there are
+	 * multiple.
+	 */
+	p = NULL;
+	cgroup_taskset_for_each_leader(leader, tset) {
+		WARN_ON_ONCE(p);
+		p = leader;
+	}
+	if (!p)
 		return 0;
 
 	from = mem_cgroup_from_task(p);
-- 
2.4.0

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 5/7] reorder cgroup_migrate()'s parameters
@ 2015-05-18 19:49   ` Tejun Heo
  0 siblings, 0 replies; 59+ messages in thread
From: Tejun Heo @ 2015-05-18 19:49 UTC (permalink / raw)
  To: lizefan; +Cc: cgroups, hannes, mhocko, linux-mm, Tejun Heo

cgroup_migrate() has the destination cgroup as the first parameter
while cgroup_task_migrate() has the destination cset as the last.
Another migration function is scheduled to be added which can make the
discrepancy further stand out.  Let's reorder cgroup_migrate()'s
parameters so that the destination cgroup is the last.

This doesn't cause any functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/cgroup.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index da45ce9..b36707b 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2209,9 +2209,9 @@ err:
 
 /**
  * cgroup_migrate - migrate a process or task to a cgroup
- * @cgrp: the destination cgroup
  * @leader: the leader of the process or the task to migrate
  * @threadgroup: whether @leader points to the whole process or a single task
+ * @cgrp: the destination cgroup
  *
  * Migrate a process or task denoted by @leader to @cgrp.  If migrating a
  * process, the caller must be holding cgroup_threadgroup_rwsem.  The
@@ -2225,8 +2225,8 @@ err:
  * decided for all targets by invoking group_migrate_prepare_dst() before
  * actually starting migrating.
  */
-static int cgroup_migrate(struct cgroup *cgrp, struct task_struct *leader,
-			  bool threadgroup)
+static int cgroup_migrate(struct task_struct *leader, bool threadgroup,
+			  struct cgroup *cgrp)
 {
 	struct cgroup_taskset tset = {
 		.src_csets	= LIST_HEAD_INIT(tset.src_csets),
@@ -2363,7 +2363,7 @@ static int cgroup_attach_task(struct cgroup *dst_cgrp,
 	/* prepare dst csets and commit */
 	ret = cgroup_migrate_prepare_dst(dst_cgrp, &preloaded_csets);
 	if (!ret)
-		ret = cgroup_migrate(dst_cgrp, leader, threadgroup);
+		ret = cgroup_migrate(leader, threadgroup, dst_cgrp);
 
 	cgroup_migrate_finish(&preloaded_csets);
 	return ret;
@@ -2640,7 +2640,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp)
 				goto out_finish;
 			last_task = task;
 
-			ret = cgroup_migrate(src_cset->dfl_cgrp, task, true);
+			ret = cgroup_migrate(task, true, src_cset->dfl_cgrp);
 
 			put_task_struct(task);
 
@@ -3711,7 +3711,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
 		css_task_iter_end(&it);
 
 		if (task) {
-			ret = cgroup_migrate(to, task, false);
+			ret = cgroup_migrate(task, false, to);
 			put_task_struct(task);
 		}
 	} while (task && !ret);
-- 
2.4.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 5/7] reorder cgroup_migrate()'s parameters
@ 2015-05-18 19:49   ` Tejun Heo
  0 siblings, 0 replies; 59+ messages in thread
From: Tejun Heo @ 2015-05-18 19:49 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, hannes-druUgvl0LCNAfugRpC6u6w,
	mhocko-AlSwsSmVLrQ, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Tejun Heo

cgroup_migrate() has the destination cgroup as the first parameter
while cgroup_task_migrate() has the destination cset as the last.
Another migration function is scheduled to be added which can make the
discrepancy further stand out.  Let's reorder cgroup_migrate()'s
parameters so that the destination cgroup is the last.

This doesn't cause any functional difference.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 kernel/cgroup.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index da45ce9..b36707b 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2209,9 +2209,9 @@ err:
 
 /**
  * cgroup_migrate - migrate a process or task to a cgroup
- * @cgrp: the destination cgroup
  * @leader: the leader of the process or the task to migrate
  * @threadgroup: whether @leader points to the whole process or a single task
+ * @cgrp: the destination cgroup
  *
  * Migrate a process or task denoted by @leader to @cgrp.  If migrating a
  * process, the caller must be holding cgroup_threadgroup_rwsem.  The
@@ -2225,8 +2225,8 @@ err:
  * decided for all targets by invoking group_migrate_prepare_dst() before
  * actually starting migrating.
  */
-static int cgroup_migrate(struct cgroup *cgrp, struct task_struct *leader,
-			  bool threadgroup)
+static int cgroup_migrate(struct task_struct *leader, bool threadgroup,
+			  struct cgroup *cgrp)
 {
 	struct cgroup_taskset tset = {
 		.src_csets	= LIST_HEAD_INIT(tset.src_csets),
@@ -2363,7 +2363,7 @@ static int cgroup_attach_task(struct cgroup *dst_cgrp,
 	/* prepare dst csets and commit */
 	ret = cgroup_migrate_prepare_dst(dst_cgrp, &preloaded_csets);
 	if (!ret)
-		ret = cgroup_migrate(dst_cgrp, leader, threadgroup);
+		ret = cgroup_migrate(leader, threadgroup, dst_cgrp);
 
 	cgroup_migrate_finish(&preloaded_csets);
 	return ret;
@@ -2640,7 +2640,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp)
 				goto out_finish;
 			last_task = task;
 
-			ret = cgroup_migrate(src_cset->dfl_cgrp, task, true);
+			ret = cgroup_migrate(task, true, src_cset->dfl_cgrp);
 
 			put_task_struct(task);
 
@@ -3711,7 +3711,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
 		css_task_iter_end(&it);
 
 		if (task) {
-			ret = cgroup_migrate(to, task, false);
+			ret = cgroup_migrate(task, false, to);
 			put_task_struct(task);
 		}
 	} while (task && !ret);
-- 
2.4.0

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 6/7] cgroup: separate out taskset operations from cgroup_migrate()
@ 2015-05-18 19:49   ` Tejun Heo
  0 siblings, 0 replies; 59+ messages in thread
From: Tejun Heo @ 2015-05-18 19:49 UTC (permalink / raw)
  To: lizefan; +Cc: cgroups, hannes, mhocko, linux-mm, Tejun Heo

Currently, cgroup_migreate() implements large part of the migration
logic inline including building the target taskset and actually
migrating them.  This patch separates out the following taskset
operations.

 CGROUP_TASKSET_INIT()		: taskset initializer
 cgroup_taskset_add()		: add a task to a taskset
 cgroup_taskset_migrate()	: migrate a taskset to the destination cgroup

This will be used to implement atomic multi-process migration in
cgroup_update_dfl_csses().  This is pure reorganization which doesn't
introduce any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/cgroup.c | 211 +++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 125 insertions(+), 86 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index b36707b..1d86d17 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1991,6 +1991,49 @@ struct cgroup_taskset {
 	struct task_struct	*cur_task;
 };
 
+#define CGROUP_TASKSET_INIT(tset)	(struct cgroup_taskset){	\
+	.src_csets		= LIST_HEAD_INIT(tset.src_csets),	\
+	.dst_csets		= LIST_HEAD_INIT(tset.dst_csets),	\
+	.csets			= &tset.src_csets,			\
+}
+
+/**
+ * cgroup_taskset_add - try to add a migration target task to a taskset
+ * @task: target task
+ * @tset: target taskset
+ *
+ * Add @task, which is a migration target, to @tset.  This function becomes
+ * noop if @task doesn't need to be migrated.  @task's css_set should have
+ * been added as a migration source and @task->cg_list will be moved from
+ * the css_set's tasks list to mg_tasks one.
+ */
+static void cgroup_taskset_add(struct task_struct *task,
+			       struct cgroup_taskset *tset)
+{
+	struct css_set *cset;
+
+	lockdep_assert_held(&css_set_rwsem);
+
+	/* @task either already exited or can't exit until the end */
+	if (task->flags & PF_EXITING)
+		return;
+
+	/* leave @task alone if post_fork() hasn't linked it yet */
+	if (list_empty(&task->cg_list))
+		return;
+
+	cset = task_css_set(task);
+	if (!cset->mg_src_cgrp)
+		return;
+
+	list_move_tail(&task->cg_list, &cset->mg_tasks);
+	if (list_empty(&cset->mg_node))
+		list_add_tail(&cset->mg_node, &tset->src_csets);
+	if (list_empty(&cset->mg_dst_cset->mg_node))
+		list_move_tail(&cset->mg_dst_cset->mg_node,
+			       &tset->dst_csets);
+}
+
 /**
  * cgroup_taskset_first - reset taskset and return the first task
  * @tset: taskset of interest
@@ -2075,6 +2118,84 @@ static void cgroup_task_migrate(struct cgroup *old_cgrp,
 }
 
 /**
+ * cgroup_taskset_migrate - migrate a taskset to a cgroup
+ * @tset: taget taskset
+ * @dst_cgrp: destination cgroup
+ *
+ * Migrate tasks in @tset to @dst_cgrp.  This function fails iff one of the
+ * ->can_attach callbacks fails and guarantees that either all or none of
+ * the tasks in @tset are migrated.  @tset is consumed regardless of
+ * success.
+ */
+static int cgroup_taskset_migrate(struct cgroup_taskset *tset,
+				  struct cgroup *dst_cgrp)
+{
+	struct cgroup_subsys_state *css, *failed_css = NULL;
+	struct task_struct *task, *tmp_task;
+	struct css_set *cset, *tmp_cset;
+	int i, ret;
+
+	/* methods shouldn't be called if no task is actually migrating */
+	if (list_empty(&tset->src_csets))
+		return 0;
+
+	/* check that we can legitimately attach to the cgroup */
+	for_each_e_css(css, i, dst_cgrp) {
+		if (css->ss->can_attach) {
+			ret = css->ss->can_attach(css, tset);
+			if (ret) {
+				failed_css = css;
+				goto out_cancel_attach;
+			}
+		}
+	}
+
+	/*
+	 * Now that we're guaranteed success, proceed to move all tasks to
+	 * the new cgroup.  There are no failure cases after here, so this
+	 * is the commit point.
+	 */
+	down_write(&css_set_rwsem);
+	list_for_each_entry(cset, &tset->src_csets, mg_node) {
+		list_for_each_entry_safe(task, tmp_task, &cset->mg_tasks, cg_list)
+			cgroup_task_migrate(cset->mg_src_cgrp, task,
+					    cset->mg_dst_cset);
+	}
+	up_write(&css_set_rwsem);
+
+	/*
+	 * Migration is committed, all target tasks are now on dst_csets.
+	 * Nothing is sensitive to fork() after this point.  Notify
+	 * controllers that migration is complete.
+	 */
+	tset->csets = &tset->dst_csets;
+
+	for_each_e_css(css, i, dst_cgrp)
+		if (css->ss->attach)
+			css->ss->attach(css, tset);
+
+	ret = 0;
+	goto out_release_tset;
+
+out_cancel_attach:
+	for_each_e_css(css, i, dst_cgrp) {
+		if (css == failed_css)
+			break;
+		if (css->ss->cancel_attach)
+			css->ss->cancel_attach(css, tset);
+	}
+out_release_tset:
+	down_write(&css_set_rwsem);
+	list_splice_init(&tset->dst_csets, &tset->src_csets);
+	list_for_each_entry_safe(cset, tmp_cset, &tset->src_csets, mg_node) {
+		list_splice_tail_init(&cset->mg_tasks, &cset->tasks);
+		list_del_init(&cset->mg_node);
+	}
+	up_write(&css_set_rwsem);
+	return ret;
+}
+
+/**
  * cgroup_migrate_finish - cleanup after attach
  * @preloaded_csets: list of preloaded css_sets
  *
@@ -2228,15 +2349,8 @@ err:
 static int cgroup_migrate(struct task_struct *leader, bool threadgroup,
 			  struct cgroup *cgrp)
 {
-	struct cgroup_taskset tset = {
-		.src_csets	= LIST_HEAD_INIT(tset.src_csets),
-		.dst_csets	= LIST_HEAD_INIT(tset.dst_csets),
-		.csets		= &tset.src_csets,
-	};
-	struct cgroup_subsys_state *css, *failed_css = NULL;
-	struct css_set *cset, *tmp_cset;
-	struct task_struct *task, *tmp_task;
-	int i, ret;
+	struct cgroup_taskset tset = CGROUP_TASKSET_INIT(tset);
+	struct task_struct *task;
 
 	/*
 	 * Prevent freeing of tasks while we take a snapshot. Tasks that are
@@ -2247,89 +2361,14 @@ static int cgroup_migrate(struct task_struct *leader, bool threadgroup,
 	rcu_read_lock();
 	task = leader;
 	do {
-		/* @task either already exited or can't exit until the end */
-		if (task->flags & PF_EXITING)
-			goto next;
-
-		/* leave @task alone if post_fork() hasn't linked it yet */
-		if (list_empty(&task->cg_list))
-			goto next;
-
-		cset = task_css_set(task);
-		if (!cset->mg_src_cgrp)
-			goto next;
-
-		list_move_tail(&task->cg_list, &cset->mg_tasks);
-		if (list_empty(&cset->mg_node))
-			list_add_tail(&cset->mg_node, &tset.src_csets);
-		if (list_empty(&cset->mg_dst_cset->mg_node))
-			list_move_tail(&cset->mg_dst_cset->mg_node,
-				       &tset.dst_csets);
-	next:
+		cgroup_taskset_add(task, &tset);
 		if (!threadgroup)
 			break;
 	} while_each_thread(leader, task);
 	rcu_read_unlock();
 	up_write(&css_set_rwsem);
 
-	/* methods shouldn't be called if no task is actually migrating */
-	if (list_empty(&tset.src_csets))
-		return 0;
-
-	/* check that we can legitimately attach to the cgroup */
-	for_each_e_css(css, i, cgrp) {
-		if (css->ss->can_attach) {
-			ret = css->ss->can_attach(css, &tset);
-			if (ret) {
-				failed_css = css;
-				goto out_cancel_attach;
-			}
-		}
-	}
-
-	/*
-	 * Now that we're guaranteed success, proceed to move all tasks to
-	 * the new cgroup.  There are no failure cases after here, so this
-	 * is the commit point.
-	 */
-	down_write(&css_set_rwsem);
-	list_for_each_entry(cset, &tset.src_csets, mg_node) {
-		list_for_each_entry_safe(task, tmp_task, &cset->mg_tasks, cg_list)
-			cgroup_task_migrate(cset->mg_src_cgrp, task,
-					    cset->mg_dst_cset);
-	}
-	up_write(&css_set_rwsem);
-
-	/*
-	 * Migration is committed, all target tasks are now on dst_csets.
-	 * Nothing is sensitive to fork() after this point.  Notify
-	 * controllers that migration is complete.
-	 */
-	tset.csets = &tset.dst_csets;
-
-	for_each_e_css(css, i, cgrp)
-		if (css->ss->attach)
-			css->ss->attach(css, &tset);
-
-	ret = 0;
-	goto out_release_tset;
-
-out_cancel_attach:
-	for_each_e_css(css, i, cgrp) {
-		if (css == failed_css)
-			break;
-		if (css->ss->cancel_attach)
-			css->ss->cancel_attach(css, &tset);
-	}
-out_release_tset:
-	down_write(&css_set_rwsem);
-	list_splice_init(&tset.dst_csets, &tset.src_csets);
-	list_for_each_entry_safe(cset, tmp_cset, &tset.src_csets, mg_node) {
-		list_splice_tail_init(&cset->mg_tasks, &cset->tasks);
-		list_del_init(&cset->mg_node);
-	}
-	up_write(&css_set_rwsem);
-	return ret;
+	return cgroup_taskset_migrate(&tset, cgrp);
 }
 
 /**
-- 
2.4.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 6/7] cgroup: separate out taskset operations from cgroup_migrate()
@ 2015-05-18 19:49   ` Tejun Heo
  0 siblings, 0 replies; 59+ messages in thread
From: Tejun Heo @ 2015-05-18 19:49 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, hannes-druUgvl0LCNAfugRpC6u6w,
	mhocko-AlSwsSmVLrQ, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Tejun Heo

Currently, cgroup_migreate() implements large part of the migration
logic inline including building the target taskset and actually
migrating them.  This patch separates out the following taskset
operations.

 CGROUP_TASKSET_INIT()		: taskset initializer
 cgroup_taskset_add()		: add a task to a taskset
 cgroup_taskset_migrate()	: migrate a taskset to the destination cgroup

This will be used to implement atomic multi-process migration in
cgroup_update_dfl_csses().  This is pure reorganization which doesn't
introduce any functional changes.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 kernel/cgroup.c | 211 +++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 125 insertions(+), 86 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index b36707b..1d86d17 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1991,6 +1991,49 @@ struct cgroup_taskset {
 	struct task_struct	*cur_task;
 };
 
+#define CGROUP_TASKSET_INIT(tset)	(struct cgroup_taskset){	\
+	.src_csets		= LIST_HEAD_INIT(tset.src_csets),	\
+	.dst_csets		= LIST_HEAD_INIT(tset.dst_csets),	\
+	.csets			= &tset.src_csets,			\
+}
+
+/**
+ * cgroup_taskset_add - try to add a migration target task to a taskset
+ * @task: target task
+ * @tset: target taskset
+ *
+ * Add @task, which is a migration target, to @tset.  This function becomes
+ * noop if @task doesn't need to be migrated.  @task's css_set should have
+ * been added as a migration source and @task->cg_list will be moved from
+ * the css_set's tasks list to mg_tasks one.
+ */
+static void cgroup_taskset_add(struct task_struct *task,
+			       struct cgroup_taskset *tset)
+{
+	struct css_set *cset;
+
+	lockdep_assert_held(&css_set_rwsem);
+
+	/* @task either already exited or can't exit until the end */
+	if (task->flags & PF_EXITING)
+		return;
+
+	/* leave @task alone if post_fork() hasn't linked it yet */
+	if (list_empty(&task->cg_list))
+		return;
+
+	cset = task_css_set(task);
+	if (!cset->mg_src_cgrp)
+		return;
+
+	list_move_tail(&task->cg_list, &cset->mg_tasks);
+	if (list_empty(&cset->mg_node))
+		list_add_tail(&cset->mg_node, &tset->src_csets);
+	if (list_empty(&cset->mg_dst_cset->mg_node))
+		list_move_tail(&cset->mg_dst_cset->mg_node,
+			       &tset->dst_csets);
+}
+
 /**
  * cgroup_taskset_first - reset taskset and return the first task
  * @tset: taskset of interest
@@ -2075,6 +2118,84 @@ static void cgroup_task_migrate(struct cgroup *old_cgrp,
 }
 
 /**
+ * cgroup_taskset_migrate - migrate a taskset to a cgroup
+ * @tset: taget taskset
+ * @dst_cgrp: destination cgroup
+ *
+ * Migrate tasks in @tset to @dst_cgrp.  This function fails iff one of the
+ * ->can_attach callbacks fails and guarantees that either all or none of
+ * the tasks in @tset are migrated.  @tset is consumed regardless of
+ * success.
+ */
+static int cgroup_taskset_migrate(struct cgroup_taskset *tset,
+				  struct cgroup *dst_cgrp)
+{
+	struct cgroup_subsys_state *css, *failed_css = NULL;
+	struct task_struct *task, *tmp_task;
+	struct css_set *cset, *tmp_cset;
+	int i, ret;
+
+	/* methods shouldn't be called if no task is actually migrating */
+	if (list_empty(&tset->src_csets))
+		return 0;
+
+	/* check that we can legitimately attach to the cgroup */
+	for_each_e_css(css, i, dst_cgrp) {
+		if (css->ss->can_attach) {
+			ret = css->ss->can_attach(css, tset);
+			if (ret) {
+				failed_css = css;
+				goto out_cancel_attach;
+			}
+		}
+	}
+
+	/*
+	 * Now that we're guaranteed success, proceed to move all tasks to
+	 * the new cgroup.  There are no failure cases after here, so this
+	 * is the commit point.
+	 */
+	down_write(&css_set_rwsem);
+	list_for_each_entry(cset, &tset->src_csets, mg_node) {
+		list_for_each_entry_safe(task, tmp_task, &cset->mg_tasks, cg_list)
+			cgroup_task_migrate(cset->mg_src_cgrp, task,
+					    cset->mg_dst_cset);
+	}
+	up_write(&css_set_rwsem);
+
+	/*
+	 * Migration is committed, all target tasks are now on dst_csets.
+	 * Nothing is sensitive to fork() after this point.  Notify
+	 * controllers that migration is complete.
+	 */
+	tset->csets = &tset->dst_csets;
+
+	for_each_e_css(css, i, dst_cgrp)
+		if (css->ss->attach)
+			css->ss->attach(css, tset);
+
+	ret = 0;
+	goto out_release_tset;
+
+out_cancel_attach:
+	for_each_e_css(css, i, dst_cgrp) {
+		if (css == failed_css)
+			break;
+		if (css->ss->cancel_attach)
+			css->ss->cancel_attach(css, tset);
+	}
+out_release_tset:
+	down_write(&css_set_rwsem);
+	list_splice_init(&tset->dst_csets, &tset->src_csets);
+	list_for_each_entry_safe(cset, tmp_cset, &tset->src_csets, mg_node) {
+		list_splice_tail_init(&cset->mg_tasks, &cset->tasks);
+		list_del_init(&cset->mg_node);
+	}
+	up_write(&css_set_rwsem);
+	return ret;
+}
+
+/**
  * cgroup_migrate_finish - cleanup after attach
  * @preloaded_csets: list of preloaded css_sets
  *
@@ -2228,15 +2349,8 @@ err:
 static int cgroup_migrate(struct task_struct *leader, bool threadgroup,
 			  struct cgroup *cgrp)
 {
-	struct cgroup_taskset tset = {
-		.src_csets	= LIST_HEAD_INIT(tset.src_csets),
-		.dst_csets	= LIST_HEAD_INIT(tset.dst_csets),
-		.csets		= &tset.src_csets,
-	};
-	struct cgroup_subsys_state *css, *failed_css = NULL;
-	struct css_set *cset, *tmp_cset;
-	struct task_struct *task, *tmp_task;
-	int i, ret;
+	struct cgroup_taskset tset = CGROUP_TASKSET_INIT(tset);
+	struct task_struct *task;
 
 	/*
 	 * Prevent freeing of tasks while we take a snapshot. Tasks that are
@@ -2247,89 +2361,14 @@ static int cgroup_migrate(struct task_struct *leader, bool threadgroup,
 	rcu_read_lock();
 	task = leader;
 	do {
-		/* @task either already exited or can't exit until the end */
-		if (task->flags & PF_EXITING)
-			goto next;
-
-		/* leave @task alone if post_fork() hasn't linked it yet */
-		if (list_empty(&task->cg_list))
-			goto next;
-
-		cset = task_css_set(task);
-		if (!cset->mg_src_cgrp)
-			goto next;
-
-		list_move_tail(&task->cg_list, &cset->mg_tasks);
-		if (list_empty(&cset->mg_node))
-			list_add_tail(&cset->mg_node, &tset.src_csets);
-		if (list_empty(&cset->mg_dst_cset->mg_node))
-			list_move_tail(&cset->mg_dst_cset->mg_node,
-				       &tset.dst_csets);
-	next:
+		cgroup_taskset_add(task, &tset);
 		if (!threadgroup)
 			break;
 	} while_each_thread(leader, task);
 	rcu_read_unlock();
 	up_write(&css_set_rwsem);
 
-	/* methods shouldn't be called if no task is actually migrating */
-	if (list_empty(&tset.src_csets))
-		return 0;
-
-	/* check that we can legitimately attach to the cgroup */
-	for_each_e_css(css, i, cgrp) {
-		if (css->ss->can_attach) {
-			ret = css->ss->can_attach(css, &tset);
-			if (ret) {
-				failed_css = css;
-				goto out_cancel_attach;
-			}
-		}
-	}
-
-	/*
-	 * Now that we're guaranteed success, proceed to move all tasks to
-	 * the new cgroup.  There are no failure cases after here, so this
-	 * is the commit point.
-	 */
-	down_write(&css_set_rwsem);
-	list_for_each_entry(cset, &tset.src_csets, mg_node) {
-		list_for_each_entry_safe(task, tmp_task, &cset->mg_tasks, cg_list)
-			cgroup_task_migrate(cset->mg_src_cgrp, task,
-					    cset->mg_dst_cset);
-	}
-	up_write(&css_set_rwsem);
-
-	/*
-	 * Migration is committed, all target tasks are now on dst_csets.
-	 * Nothing is sensitive to fork() after this point.  Notify
-	 * controllers that migration is complete.
-	 */
-	tset.csets = &tset.dst_csets;
-
-	for_each_e_css(css, i, cgrp)
-		if (css->ss->attach)
-			css->ss->attach(css, &tset);
-
-	ret = 0;
-	goto out_release_tset;
-
-out_cancel_attach:
-	for_each_e_css(css, i, cgrp) {
-		if (css == failed_css)
-			break;
-		if (css->ss->cancel_attach)
-			css->ss->cancel_attach(css, &tset);
-	}
-out_release_tset:
-	down_write(&css_set_rwsem);
-	list_splice_init(&tset.dst_csets, &tset.src_csets);
-	list_for_each_entry_safe(cset, tmp_cset, &tset.src_csets, mg_node) {
-		list_splice_tail_init(&cset->mg_tasks, &cset->tasks);
-		list_del_init(&cset->mg_node);
-	}
-	up_write(&css_set_rwsem);
-	return ret;
+	return cgroup_taskset_migrate(&tset, cgrp);
 }
 
 /**
-- 
2.4.0

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 7/7] cgroup: make cgroup_update_dfl_csses() migrate all target processes atomically
@ 2015-05-18 19:49   ` Tejun Heo
  0 siblings, 0 replies; 59+ messages in thread
From: Tejun Heo @ 2015-05-18 19:49 UTC (permalink / raw)
  To: lizefan; +Cc: cgroups, hannes, mhocko, linux-mm, Tejun Heo

cgroup_update_dfl_csses() is responsible for migrating processes when
controllers are enabled or disabled on the default hierarchy.  As the
css association changes for all the processes in the affected cgroups,
this involves migrating multiple processes.

Up until now, it was implemented by migrating process-by-process until
the source css_sets are empty; however, this means that if a process
fails to migrate after some succeed before it, the recovery is very
tricky.  This was considered okay as subsystems weren't allowed to
reject process migration on the default hierarchy; unfortunately,
enforcing this policy turned out to be problematic for certain types
of resources - realtime slices for now.

As such, the default hierarchy is gonna allow restricted failures
during migration and to support that this patch makes
cgroup_update_dfl_csses() migrate all target processes atomically
rather than one-by-one.  The preceding patches made subsystems ready
for multi-process migration and factored out taskset operations making
this almost trivial.  All tasks of the target processes are put in the
same taskset and the migration operations are performed once which
either fails or succeeds for all.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/cgroup.c | 44 ++++++++------------------------------------
 1 file changed, 8 insertions(+), 36 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 1d86d17..978d741 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2616,6 +2616,7 @@ static int cgroup_subtree_control_show(struct seq_file *seq, void *v)
 static int cgroup_update_dfl_csses(struct cgroup *cgrp)
 {
 	LIST_HEAD(preloaded_csets);
+	struct cgroup_taskset tset = CGROUP_TASKSET_INIT(tset);
 	struct cgroup_subsys_state *css;
 	struct css_set *src_cset;
 	int ret;
@@ -2644,50 +2645,21 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp)
 	if (ret)
 		goto out_finish;
 
+	down_write(&css_set_rwsem);
 	list_for_each_entry(src_cset, &preloaded_csets, mg_preload_node) {
-		struct task_struct *last_task = NULL, *task;
+		struct task_struct *task, *ntask;
 
 		/* src_csets precede dst_csets, break on the first dst_cset */
 		if (!src_cset->mg_src_cgrp)
 			break;
 
-		/*
-		 * All tasks in src_cset need to be migrated to the
-		 * matching dst_cset.  Empty it process by process.  We
-		 * walk tasks but migrate processes.  The leader might even
-		 * belong to a different cset but such src_cset would also
-		 * be among the target src_csets because the default
-		 * hierarchy enforces per-process membership.
-		 */
-		while (true) {
-			down_read(&css_set_rwsem);
-			task = list_first_entry_or_null(&src_cset->tasks,
-						struct task_struct, cg_list);
-			if (task) {
-				task = task->group_leader;
-				WARN_ON_ONCE(!task_css_set(task)->mg_src_cgrp);
-				get_task_struct(task);
-			}
-			up_read(&css_set_rwsem);
-
-			if (!task)
-				break;
-
-			/* guard against possible infinite loop */
-			if (WARN(last_task == task,
-				 "cgroup: update_dfl_csses failed to make progress, aborting in inconsistent state\n"))
-				goto out_finish;
-			last_task = task;
-
-			ret = cgroup_migrate(task, true, src_cset->dfl_cgrp);
-
-			put_task_struct(task);
-
-			if (WARN(ret, "cgroup: failed to update controllers for the default hierarchy (%d), further operations may crash or hang\n", ret))
-				goto out_finish;
-		}
+		/* all tasks in src_csets need to be migrated */
+		list_for_each_entry_safe(task, ntask, &src_cset->tasks, cg_list)
+			cgroup_taskset_add(task, &tset);
 	}
+	up_write(&css_set_rwsem);
 
+	ret = cgroup_taskset_migrate(&tset, cgrp);
 out_finish:
 	cgroup_migrate_finish(&preloaded_csets);
 	percpu_up_write(&cgroup_threadgroup_rwsem);
-- 
2.4.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 7/7] cgroup: make cgroup_update_dfl_csses() migrate all target processes atomically
@ 2015-05-18 19:49   ` Tejun Heo
  0 siblings, 0 replies; 59+ messages in thread
From: Tejun Heo @ 2015-05-18 19:49 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, hannes-druUgvl0LCNAfugRpC6u6w,
	mhocko-AlSwsSmVLrQ, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Tejun Heo

cgroup_update_dfl_csses() is responsible for migrating processes when
controllers are enabled or disabled on the default hierarchy.  As the
css association changes for all the processes in the affected cgroups,
this involves migrating multiple processes.

Up until now, it was implemented by migrating process-by-process until
the source css_sets are empty; however, this means that if a process
fails to migrate after some succeed before it, the recovery is very
tricky.  This was considered okay as subsystems weren't allowed to
reject process migration on the default hierarchy; unfortunately,
enforcing this policy turned out to be problematic for certain types
of resources - realtime slices for now.

As such, the default hierarchy is gonna allow restricted failures
during migration and to support that this patch makes
cgroup_update_dfl_csses() migrate all target processes atomically
rather than one-by-one.  The preceding patches made subsystems ready
for multi-process migration and factored out taskset operations making
this almost trivial.  All tasks of the target processes are put in the
same taskset and the migration operations are performed once which
either fails or succeeds for all.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 kernel/cgroup.c | 44 ++++++++------------------------------------
 1 file changed, 8 insertions(+), 36 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 1d86d17..978d741 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2616,6 +2616,7 @@ static int cgroup_subtree_control_show(struct seq_file *seq, void *v)
 static int cgroup_update_dfl_csses(struct cgroup *cgrp)
 {
 	LIST_HEAD(preloaded_csets);
+	struct cgroup_taskset tset = CGROUP_TASKSET_INIT(tset);
 	struct cgroup_subsys_state *css;
 	struct css_set *src_cset;
 	int ret;
@@ -2644,50 +2645,21 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp)
 	if (ret)
 		goto out_finish;
 
+	down_write(&css_set_rwsem);
 	list_for_each_entry(src_cset, &preloaded_csets, mg_preload_node) {
-		struct task_struct *last_task = NULL, *task;
+		struct task_struct *task, *ntask;
 
 		/* src_csets precede dst_csets, break on the first dst_cset */
 		if (!src_cset->mg_src_cgrp)
 			break;
 
-		/*
-		 * All tasks in src_cset need to be migrated to the
-		 * matching dst_cset.  Empty it process by process.  We
-		 * walk tasks but migrate processes.  The leader might even
-		 * belong to a different cset but such src_cset would also
-		 * be among the target src_csets because the default
-		 * hierarchy enforces per-process membership.
-		 */
-		while (true) {
-			down_read(&css_set_rwsem);
-			task = list_first_entry_or_null(&src_cset->tasks,
-						struct task_struct, cg_list);
-			if (task) {
-				task = task->group_leader;
-				WARN_ON_ONCE(!task_css_set(task)->mg_src_cgrp);
-				get_task_struct(task);
-			}
-			up_read(&css_set_rwsem);
-
-			if (!task)
-				break;
-
-			/* guard against possible infinite loop */
-			if (WARN(last_task == task,
-				 "cgroup: update_dfl_csses failed to make progress, aborting in inconsistent state\n"))
-				goto out_finish;
-			last_task = task;
-
-			ret = cgroup_migrate(task, true, src_cset->dfl_cgrp);
-
-			put_task_struct(task);
-
-			if (WARN(ret, "cgroup: failed to update controllers for the default hierarchy (%d), further operations may crash or hang\n", ret))
-				goto out_finish;
-		}
+		/* all tasks in src_csets need to be migrated */
+		list_for_each_entry_safe(task, ntask, &src_cset->tasks, cg_list)
+			cgroup_taskset_add(task, &tset);
 	}
+	up_write(&css_set_rwsem);
 
+	ret = cgroup_taskset_migrate(&tset, cgrp);
 out_finish:
 	cgroup_migrate_finish(&preloaded_csets);
 	percpu_up_write(&cgroup_threadgroup_rwsem);
-- 
2.4.0

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCHSET cgroup/for-4.2] cgroup: make multi-process migration atomic
  2015-05-18 19:49 ` Tejun Heo
                   ` (7 preceding siblings ...)
  (?)
@ 2015-05-19  6:57 ` Zefan Li
  -1 siblings, 0 replies; 59+ messages in thread
From: Zefan Li @ 2015-05-19  6:57 UTC (permalink / raw)
  To: Tejun Heo; +Cc: cgroups, hannes, mhocko, linux-mm

On 2015/5/19 3:49, Tejun Heo wrote:
> Hello,
> 
> When a controller is enabled or disabled on the unified hierarchy, the
> effective css changes for all processes in the sub-hierarchy which
> virtually is multi-process migration.  This is implemented in
> cgroup_update_dfl_csses() as process-by-process migration - all the
> target source css_sets are first chained to the target list and
> processes are drained from them one-by-one.
> 
> If a process gets rejected by a controller after some are successfully
> migrated, the recovery action is tricky.  The changes which have
> happened upto this point have to be rolled back but there's nothing
> guaranteeing such rollback would be successful either.
> 
> The unified hierarchy didn't need to deal with this issue because
> organizational operations were expected to always succeed;
> unfortunately, it turned out that such policy doesn't work too well
> for certain type of resources and unified hierarchy would need to
> allow migration failures for some restrictied cases.
> 
> This patch updates multi-process migration in
> cgroup_update_dfl_csses() atomic so that ->can_attach() can fail the
> whole transaction.  It's consisted of the following seven patches.
> 
>  0001-cpuset-migrate-memory-only-for-threadgroup-leaders.patch
>  0002-memcg-restructure-mem_cgroup_can_attach.patch
>  0003-memcg-immigrate-charges-only-when-a-threadgroup-lead.patch
>  0004-cgroup-memcg-cpuset-implement-cgroup_taskset_for_eac.patch
>  0005-reorder-cgroup_migrate-s-parameters.patch
>  0006-cgroup-separate-out-taskset-operations-from-cgroup_m.patch
>  0007-cgroup-make-cgroup_update_dfl_csses-migrate-all-targ.patch
> 

Thanks for working on this. The patchset looks good to me.

Acked-by: Zefan Li <lizefan@huawei.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/7] memcg: restructure mem_cgroup_can_attach()
@ 2015-05-19  9:03     ` Michal Hocko
  0 siblings, 0 replies; 59+ messages in thread
From: Michal Hocko @ 2015-05-19  9:03 UTC (permalink / raw)
  To: Tejun Heo; +Cc: lizefan, cgroups, hannes, linux-mm

On Mon 18-05-15 15:49:50, Tejun Heo wrote:
> Restructure it to lower nesting level and help the planned threadgroup
> leader iteration changes.
> 
> This is pure reorganization.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@suse.cz>

Acked-by: Michal Hocko <mhocko@suse.cz>

> ---
>  mm/memcontrol.c | 61 ++++++++++++++++++++++++++++++---------------------------
>  1 file changed, 32 insertions(+), 29 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 14c2f20..b1b834d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4997,10 +4997,12 @@ static void mem_cgroup_clear_mc(void)
>  static int mem_cgroup_can_attach(struct cgroup_subsys_state *css,
>  				 struct cgroup_taskset *tset)
>  {
> -	struct task_struct *p = cgroup_taskset_first(tset);
> -	int ret = 0;
>  	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> +	struct mem_cgroup *from;
> +	struct task_struct *p;
> +	struct mm_struct *mm;
>  	unsigned long move_flags;
> +	int ret = 0;
>  
>  	/*
>  	 * We are now commited to this value whatever it is. Changes in this
> @@ -5008,36 +5010,37 @@ static int mem_cgroup_can_attach(struct cgroup_subsys_state *css,
>  	 * So we need to save it, and keep it going.
>  	 */
>  	move_flags = READ_ONCE(memcg->move_charge_at_immigrate);
> -	if (move_flags) {
> -		struct mm_struct *mm;
> -		struct mem_cgroup *from = mem_cgroup_from_task(p);
> +	if (!move_flags)
> +		return 0;
>  
> -		VM_BUG_ON(from == memcg);
> +	p = cgroup_taskset_first(tset);
> +	from = mem_cgroup_from_task(p);
>  
> -		mm = get_task_mm(p);
> -		if (!mm)
> -			return 0;
> -		/* We move charges only when we move a owner of the mm */
> -		if (mm->owner == p) {
> -			VM_BUG_ON(mc.from);
> -			VM_BUG_ON(mc.to);
> -			VM_BUG_ON(mc.precharge);
> -			VM_BUG_ON(mc.moved_charge);
> -			VM_BUG_ON(mc.moved_swap);
> -
> -			spin_lock(&mc.lock);
> -			mc.from = from;
> -			mc.to = memcg;
> -			mc.flags = move_flags;
> -			spin_unlock(&mc.lock);
> -			/* We set mc.moving_task later */
> -
> -			ret = mem_cgroup_precharge_mc(mm);
> -			if (ret)
> -				mem_cgroup_clear_mc();
> -		}
> -		mmput(mm);
> +	VM_BUG_ON(from == memcg);
> +
> +	mm = get_task_mm(p);
> +	if (!mm)
> +		return 0;
> +	/* We move charges only when we move a owner of the mm */
> +	if (mm->owner == p) {
> +		VM_BUG_ON(mc.from);
> +		VM_BUG_ON(mc.to);
> +		VM_BUG_ON(mc.precharge);
> +		VM_BUG_ON(mc.moved_charge);
> +		VM_BUG_ON(mc.moved_swap);
> +
> +		spin_lock(&mc.lock);
> +		mc.from = from;
> +		mc.to = memcg;
> +		mc.flags = move_flags;
> +		spin_unlock(&mc.lock);
> +		/* We set mc.moving_task later */
> +
> +		ret = mem_cgroup_precharge_mc(mm);
> +		if (ret)
> +			mem_cgroup_clear_mc();
>  	}
> +	mmput(mm);
>  	return ret;
>  }
>  
> -- 
> 2.4.0
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/7] memcg: restructure mem_cgroup_can_attach()
@ 2015-05-19  9:03     ` Michal Hocko
  0 siblings, 0 replies; 59+ messages in thread
From: Michal Hocko @ 2015-05-19  9:03 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA,
	hannes-druUgvl0LCNAfugRpC6u6w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Mon 18-05-15 15:49:50, Tejun Heo wrote:
> Restructure it to lower nesting level and help the planned threadgroup
> leader iteration changes.
> 
> This is pure reorganization.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> Cc: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>

Acked-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>

> ---
>  mm/memcontrol.c | 61 ++++++++++++++++++++++++++++++---------------------------
>  1 file changed, 32 insertions(+), 29 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 14c2f20..b1b834d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4997,10 +4997,12 @@ static void mem_cgroup_clear_mc(void)
>  static int mem_cgroup_can_attach(struct cgroup_subsys_state *css,
>  				 struct cgroup_taskset *tset)
>  {
> -	struct task_struct *p = cgroup_taskset_first(tset);
> -	int ret = 0;
>  	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> +	struct mem_cgroup *from;
> +	struct task_struct *p;
> +	struct mm_struct *mm;
>  	unsigned long move_flags;
> +	int ret = 0;
>  
>  	/*
>  	 * We are now commited to this value whatever it is. Changes in this
> @@ -5008,36 +5010,37 @@ static int mem_cgroup_can_attach(struct cgroup_subsys_state *css,
>  	 * So we need to save it, and keep it going.
>  	 */
>  	move_flags = READ_ONCE(memcg->move_charge_at_immigrate);
> -	if (move_flags) {
> -		struct mm_struct *mm;
> -		struct mem_cgroup *from = mem_cgroup_from_task(p);
> +	if (!move_flags)
> +		return 0;
>  
> -		VM_BUG_ON(from == memcg);
> +	p = cgroup_taskset_first(tset);
> +	from = mem_cgroup_from_task(p);
>  
> -		mm = get_task_mm(p);
> -		if (!mm)
> -			return 0;
> -		/* We move charges only when we move a owner of the mm */
> -		if (mm->owner == p) {
> -			VM_BUG_ON(mc.from);
> -			VM_BUG_ON(mc.to);
> -			VM_BUG_ON(mc.precharge);
> -			VM_BUG_ON(mc.moved_charge);
> -			VM_BUG_ON(mc.moved_swap);
> -
> -			spin_lock(&mc.lock);
> -			mc.from = from;
> -			mc.to = memcg;
> -			mc.flags = move_flags;
> -			spin_unlock(&mc.lock);
> -			/* We set mc.moving_task later */
> -
> -			ret = mem_cgroup_precharge_mc(mm);
> -			if (ret)
> -				mem_cgroup_clear_mc();
> -		}
> -		mmput(mm);
> +	VM_BUG_ON(from == memcg);
> +
> +	mm = get_task_mm(p);
> +	if (!mm)
> +		return 0;
> +	/* We move charges only when we move a owner of the mm */
> +	if (mm->owner == p) {
> +		VM_BUG_ON(mc.from);
> +		VM_BUG_ON(mc.to);
> +		VM_BUG_ON(mc.precharge);
> +		VM_BUG_ON(mc.moved_charge);
> +		VM_BUG_ON(mc.moved_swap);
> +
> +		spin_lock(&mc.lock);
> +		mc.from = from;
> +		mc.to = memcg;
> +		mc.flags = move_flags;
> +		spin_unlock(&mc.lock);
> +		/* We set mc.moving_task later */
> +
> +		ret = mem_cgroup_precharge_mc(mm);
> +		if (ret)
> +			mem_cgroup_clear_mc();
>  	}
> +	mmput(mm);
>  	return ret;
>  }
>  
> -- 
> 2.4.0
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 3/7] memcg: immigrate charges only when a threadgroup leader is moved
@ 2015-05-19 12:13     ` Michal Hocko
  0 siblings, 0 replies; 59+ messages in thread
From: Michal Hocko @ 2015-05-19 12:13 UTC (permalink / raw)
  To: Tejun Heo; +Cc: lizefan, cgroups, hannes, linux-mm

On Mon 18-05-15 15:49:51, Tejun Heo wrote:
> If move_charge flag is set, memcg tries to move memory charges to the
> destnation css.  The current implementation migrates memory whenever
> any thread of a process is migrated making the behavior somewhat
> arbitrary. 

This is not true. We have:
                mm = get_task_mm(p);
                if (!mm)
                        return 0;
                /* We move charges only when we move a owner of the mm */
                if (mm->owner == p) {

So we are ignoring threads which are not owner of the mm struct and that
should be the thread group leader AFAICS.

mm_update_next_owner is rather complex (maybe too much and it would
deserve some attention) so there might really be some corner cases but
the whole memcg code relies on mm->owner rather than thread group leader
so I would keep the same logic here.

> Let's tie memory operations to the threadgroup leader so
> that memory is migrated only when the leader is migrated.

This would lead to another strange behavior when the group leader is not
owner (if that is possible at all) and the memory wouldn't get migrated
at all.

I am trying to wrap my head around mm_update_next_owner and maybe we can
change it to use the thread group leader but this needs more thinking...

> While this is a behavior change, given the inherent fuziness, this
> change is not too likely to be noticed and allows us to clearly define
> who owns the memory (always the leader) and helps the planned atomic
> multi-process migration.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@suse.cz>
> ---
>  mm/memcontrol.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b1b834d..74fcea3 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5014,6 +5014,9 @@ static int mem_cgroup_can_attach(struct cgroup_subsys_state *css,
>  		return 0;
>  
>  	p = cgroup_taskset_first(tset);
> +	if (!thread_group_leader(p))
> +		return 0;
> +
>  	from = mem_cgroup_from_task(p);
>  
>  	VM_BUG_ON(from == memcg);
> -- 
> 2.4.0
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 3/7] memcg: immigrate charges only when a threadgroup leader is moved
@ 2015-05-19 12:13     ` Michal Hocko
  0 siblings, 0 replies; 59+ messages in thread
From: Michal Hocko @ 2015-05-19 12:13 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA,
	hannes-druUgvl0LCNAfugRpC6u6w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Mon 18-05-15 15:49:51, Tejun Heo wrote:
> If move_charge flag is set, memcg tries to move memory charges to the
> destnation css.  The current implementation migrates memory whenever
> any thread of a process is migrated making the behavior somewhat
> arbitrary. 

This is not true. We have:
                mm = get_task_mm(p);
                if (!mm)
                        return 0;
                /* We move charges only when we move a owner of the mm */
                if (mm->owner == p) {

So we are ignoring threads which are not owner of the mm struct and that
should be the thread group leader AFAICS.

mm_update_next_owner is rather complex (maybe too much and it would
deserve some attention) so there might really be some corner cases but
the whole memcg code relies on mm->owner rather than thread group leader
so I would keep the same logic here.

> Let's tie memory operations to the threadgroup leader so
> that memory is migrated only when the leader is migrated.

This would lead to another strange behavior when the group leader is not
owner (if that is possible at all) and the memory wouldn't get migrated
at all.

I am trying to wrap my head around mm_update_next_owner and maybe we can
change it to use the thread group leader but this needs more thinking...

> While this is a behavior change, given the inherent fuziness, this
> change is not too likely to be noticed and allows us to clearly define
> who owns the memory (always the leader) and helps the planned atomic
> multi-process migration.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> Cc: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
> ---
>  mm/memcontrol.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b1b834d..74fcea3 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5014,6 +5014,9 @@ static int mem_cgroup_can_attach(struct cgroup_subsys_state *css,
>  		return 0;
>  
>  	p = cgroup_taskset_first(tset);
> +	if (!thread_group_leader(p))
> +		return 0;
> +
>  	from = mem_cgroup_from_task(p);
>  
>  	VM_BUG_ON(from == memcg);
> -- 
> 2.4.0
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 3/7] memcg: immigrate charges only when a threadgroup leader is moved
@ 2015-05-19 13:10       ` Michal Hocko
  0 siblings, 0 replies; 59+ messages in thread
From: Michal Hocko @ 2015-05-19 13:10 UTC (permalink / raw)
  To: Tejun Heo; +Cc: lizefan, cgroups, hannes, linux-mm

On Tue 19-05-15 14:13:21, Michal Hocko wrote:
> On Mon 18-05-15 15:49:51, Tejun Heo wrote:
> > If move_charge flag is set, memcg tries to move memory charges to the
> > destnation css.  The current implementation migrates memory whenever
> > any thread of a process is migrated making the behavior somewhat
> > arbitrary. 
> 
> This is not true. We have:
>                 mm = get_task_mm(p);
>                 if (!mm)
>                         return 0;
>                 /* We move charges only when we move a owner of the mm */
>                 if (mm->owner == p) {
> 
> So we are ignoring threads which are not owner of the mm struct and that
> should be the thread group leader AFAICS.
> 
> mm_update_next_owner is rather complex (maybe too much and it would
> deserve some attention) so there might really be some corner cases but
> the whole memcg code relies on mm->owner rather than thread group leader
> so I would keep the same logic here.
> 
> > Let's tie memory operations to the threadgroup leader so
> > that memory is migrated only when the leader is migrated.
> 
> This would lead to another strange behavior when the group leader is not
> owner (if that is possible at all) and the memory wouldn't get migrated
> at all.
> 
> I am trying to wrap my head around mm_update_next_owner and maybe we can
> change it to use the thread group leader but this needs more thinking...

OK, I guess I see the reason now. CLONE_VM doesn't imply CLONE_THREAD
so we can have a separate task (with it's own group leader) which
handles its own signals while it still shares the address space. So we
cannot really use the group leader here.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 3/7] memcg: immigrate charges only when a threadgroup leader is moved
@ 2015-05-19 13:10       ` Michal Hocko
  0 siblings, 0 replies; 59+ messages in thread
From: Michal Hocko @ 2015-05-19 13:10 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA,
	hannes-druUgvl0LCNAfugRpC6u6w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Tue 19-05-15 14:13:21, Michal Hocko wrote:
> On Mon 18-05-15 15:49:51, Tejun Heo wrote:
> > If move_charge flag is set, memcg tries to move memory charges to the
> > destnation css.  The current implementation migrates memory whenever
> > any thread of a process is migrated making the behavior somewhat
> > arbitrary. 
> 
> This is not true. We have:
>                 mm = get_task_mm(p);
>                 if (!mm)
>                         return 0;
>                 /* We move charges only when we move a owner of the mm */
>                 if (mm->owner == p) {
> 
> So we are ignoring threads which are not owner of the mm struct and that
> should be the thread group leader AFAICS.
> 
> mm_update_next_owner is rather complex (maybe too much and it would
> deserve some attention) so there might really be some corner cases but
> the whole memcg code relies on mm->owner rather than thread group leader
> so I would keep the same logic here.
> 
> > Let's tie memory operations to the threadgroup leader so
> > that memory is migrated only when the leader is migrated.
> 
> This would lead to another strange behavior when the group leader is not
> owner (if that is possible at all) and the memory wouldn't get migrated
> at all.
> 
> I am trying to wrap my head around mm_update_next_owner and maybe we can
> change it to use the thread group leader but this needs more thinking...

OK, I guess I see the reason now. CLONE_VM doesn't imply CLONE_THREAD
so we can have a separate task (with it's own group leader) which
handles its own signals while it still shares the address space. So we
cannot really use the group leader here.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 3/7] memcg: immigrate charges only when a threadgroup leader is moved
@ 2015-05-19 21:27       ` Tejun Heo
  0 siblings, 0 replies; 59+ messages in thread
From: Tejun Heo @ 2015-05-19 21:27 UTC (permalink / raw)
  To: Michal Hocko; +Cc: lizefan, cgroups, hannes, linux-mm

Hello, Michal.

On Tue, May 19, 2015 at 02:13:21PM +0200, Michal Hocko wrote:
> This is not true. We have:
>                 mm = get_task_mm(p);
>                 if (!mm)
>                         return 0;
>                 /* We move charges only when we move a owner of the mm */
>                 if (mm->owner == p) {

Ah, missed that part.

> So we are ignoring threads which are not owner of the mm struct and that
> should be the thread group leader AFAICS.
> 
> mm_update_next_owner is rather complex (maybe too much and it would
> deserve some attention) so there might really be some corner cases but
> the whole memcg code relies on mm->owner rather than thread group leader
> so I would keep the same logic here.
> 
> > Let's tie memory operations to the threadgroup leader so
> > that memory is migrated only when the leader is migrated.
> 
> This would lead to another strange behavior when the group leader is not
> owner (if that is possible at all) and the memory wouldn't get migrated
> at all.

Hmmm... is it guaranteed that if a threadgroup owns a mm, the mm's
owner would be the threadgroup leader?  If not, the current code is
broken too as it always takes the first member which is the
threadgroup leader and if that's not the mm owner we may skip
immigration while migrating the whole process.

I suppose the right thing to do here is iterating the taskset and find
the mm owner?

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 3/7] memcg: immigrate charges only when a threadgroup leader is moved
@ 2015-05-19 21:27       ` Tejun Heo
  0 siblings, 0 replies; 59+ messages in thread
From: Tejun Heo @ 2015-05-19 21:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA,
	hannes-druUgvl0LCNAfugRpC6u6w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg

Hello, Michal.

On Tue, May 19, 2015 at 02:13:21PM +0200, Michal Hocko wrote:
> This is not true. We have:
>                 mm = get_task_mm(p);
>                 if (!mm)
>                         return 0;
>                 /* We move charges only when we move a owner of the mm */
>                 if (mm->owner == p) {

Ah, missed that part.

> So we are ignoring threads which are not owner of the mm struct and that
> should be the thread group leader AFAICS.
> 
> mm_update_next_owner is rather complex (maybe too much and it would
> deserve some attention) so there might really be some corner cases but
> the whole memcg code relies on mm->owner rather than thread group leader
> so I would keep the same logic here.
> 
> > Let's tie memory operations to the threadgroup leader so
> > that memory is migrated only when the leader is migrated.
> 
> This would lead to another strange behavior when the group leader is not
> owner (if that is possible at all) and the memory wouldn't get migrated
> at all.

Hmmm... is it guaranteed that if a threadgroup owns a mm, the mm's
owner would be the threadgroup leader?  If not, the current code is
broken too as it always takes the first member which is the
threadgroup leader and if that's not the mm owner we may skip
immigration while migrating the whole process.

I suppose the right thing to do here is iterating the taskset and find
the mm owner?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 3/7] memcg: immigrate charges only when a threadgroup leader is moved
  2015-05-19 21:27       ` Tejun Heo
  (?)
@ 2015-05-20 13:10       ` Michal Hocko
  2015-05-20 13:21         ` Michal Hocko
  -1 siblings, 1 reply; 59+ messages in thread
From: Michal Hocko @ 2015-05-20 13:10 UTC (permalink / raw)
  To: Tejun Heo; +Cc: lizefan, cgroups, hannes, linux-mm, Oleg Nesterov

On Tue 19-05-15 17:27:54, Tejun Heo wrote:
> Hello, Michal.
> 
> On Tue, May 19, 2015 at 02:13:21PM +0200, Michal Hocko wrote:
> > This is not true. We have:
> >                 mm = get_task_mm(p);
> >                 if (!mm)
> >                         return 0;
> >                 /* We move charges only when we move a owner of the mm */
> >                 if (mm->owner == p) {
> 
> Ah, missed that part.
> 
> > So we are ignoring threads which are not owner of the mm struct and that
> > should be the thread group leader AFAICS.
> > 
> > mm_update_next_owner is rather complex (maybe too much and it would
> > deserve some attention) so there might really be some corner cases but
> > the whole memcg code relies on mm->owner rather than thread group leader
> > so I would keep the same logic here.
> > 
> > > Let's tie memory operations to the threadgroup leader so
> > > that memory is migrated only when the leader is migrated.
> > 
> > This would lead to another strange behavior when the group leader is not
> > owner (if that is possible at all) and the memory wouldn't get migrated
> > at all.
> 
> Hmmm... is it guaranteed that if a threadgroup owns a mm, the mm's
> owner would be the threadgroup leader? 

That is a good question. As I've said I would expect it to be a thread
group leader but 4cd1a8fc3d3c ("memcg: fix possible panic when
CONFIG_MM_OWNER=y") confused me by claiming
"
    Also, the owner member comment description is wrong. mm->owner does
    not necessarily point to the thread group leader.
"

But now I am looking closer into mm_update_next_owner. for_each_process
should see only thread group leaders. p->{real_parent->}children
siblings search should return group leaders as well AFAICS.

But I am completely lost in the exit code paths. E.g. what happens
when the thread group leader exits and the other threads are still
alive? I would expect another thread would be chosen as a new leader and
siblings would be updated. But I cannot find that code. Maybe the
original leader just waits for all other threads to terminate and stay
in the linked lists.

The scary comment for has_group_leader_pid suggests that a thread might
have a the real pid without being the group leader. /me confused

Something for Oleg I guess.

> If not, the current code is
> broken too as it always takes the first member which is the
> threadgroup leader and if that's not the mm owner we may skip
> immigration while migrating the whole process.
> 
> I suppose the right thing to do here is iterating the taskset and find
> the mm owner?
> 
> Thanks.
> 
> -- 
> tejun

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 3/7] memcg: immigrate charges only when a threadgroup leader is moved
  2015-05-20 13:10       ` Michal Hocko
@ 2015-05-20 13:21         ` Michal Hocko
  2015-05-20 17:53             ` Oleg Nesterov
  0 siblings, 1 reply; 59+ messages in thread
From: Michal Hocko @ 2015-05-20 13:21 UTC (permalink / raw)
  To: Tejun Heo; +Cc: lizefan, cgroups, hannes, linux-mm, Oleg Nesterov

On Wed 20-05-15 15:10:44, Michal Hocko wrote:
[...]
> But I am completely lost in the exit code paths. E.g. what happens
> when the thread group leader exits and the other threads are still
> alive? I would expect another thread would be chosen as a new leader and
> siblings would be updated. But I cannot find that code. Maybe the
> original leader just waits for all other threads to terminate and stay
> in the linked lists.

I've tried a simple test where the main thread (group leader) calls
pthread_exit after it creates another thread. The other thread continues
to run and the leader is marked as Zombie:
$ ./t &
Main pid:2432 tid:2432
Exiting main thread
Secondary pid:2432 tid:2433......

$ ps ax | grep 2432
 2432 pts/3    Zl+    0:00 [t] <defunct>

So I assume the leader simply waits for its threads to finish and it
stays in the sibling list. __unhash_process seems like it does the final
cleanup and unlinks the leader from the lists. Which means that
mm_update_next_owner never sees !group_leader. Is that correct Oleg?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 3/7] memcg: immigrate charges only when a threadgroup leader is moved
@ 2015-05-20 17:53             ` Oleg Nesterov
  0 siblings, 0 replies; 59+ messages in thread
From: Oleg Nesterov @ 2015-05-20 17:53 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Tejun Heo, lizefan, cgroups, hannes, linux-mm

On 05/20, Michal Hocko wrote:
>
> So I assume the leader simply waits for its threads to finish and it
> stays in the sibling list. __unhash_process seems like it does the final
> cleanup and unlinks the leader from the lists. Which means that
> mm_update_next_owner never sees !group_leader. Is that correct Oleg?

Yes, yes, the group leader can't go away until the whole thread-group dies.

But can't we kill mm->owner somehow? I mean, turn it into something else,
ideally into "struct mem_cgroup *" although I doubt this is possible.

It would be nice to kill mm_update_next_owner()/etc, this looks really
ugly. We only need it for mem_cgroup_from_task(), and it would be much
more clean to have mem_cgroup_from_mm(struct mm_struct *mm), imho.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 3/7] memcg: immigrate charges only when a threadgroup leader is moved
@ 2015-05-20 17:53             ` Oleg Nesterov
  0 siblings, 0 replies; 59+ messages in thread
From: Oleg Nesterov @ 2015-05-20 17:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tejun Heo, lizefan-hv44wF8Li93QT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, hannes-druUgvl0LCNAfugRpC6u6w,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On 05/20, Michal Hocko wrote:
>
> So I assume the leader simply waits for its threads to finish and it
> stays in the sibling list. __unhash_process seems like it does the final
> cleanup and unlinks the leader from the lists. Which means that
> mm_update_next_owner never sees !group_leader. Is that correct Oleg?

Yes, yes, the group leader can't go away until the whole thread-group dies.

But can't we kill mm->owner somehow? I mean, turn it into something else,
ideally into "struct mem_cgroup *" although I doubt this is possible.

It would be nice to kill mm_update_next_owner()/etc, this looks really
ugly. We only need it for mem_cgroup_from_task(), and it would be much
more clean to have mem_cgroup_from_mm(struct mm_struct *mm), imho.

Oleg.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 3/7] memcg: immigrate charges only when a threadgroup leader is moved
  2015-05-20 17:53             ` Oleg Nesterov
@ 2015-05-20 20:22               ` Michal Hocko
  -1 siblings, 0 replies; 59+ messages in thread
From: Michal Hocko @ 2015-05-20 20:22 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Tejun Heo, lizefan, cgroups, hannes, linux-mm

On Wed 20-05-15 19:53:02, Oleg Nesterov wrote:
> On 05/20, Michal Hocko wrote:
> >
> > So I assume the leader simply waits for its threads to finish and it
> > stays in the sibling list. __unhash_process seems like it does the final
> > cleanup and unlinks the leader from the lists. Which means that
> > mm_update_next_owner never sees !group_leader. Is that correct Oleg?
> 
> Yes, yes, the group leader can't go away until the whole thread-group dies.

OK, then we should have a guarantee that mm->owner is always thread
group leader, right?

> But can't we kill mm->owner somehow?

I would be happy about that. But it is not that simple.

> I mean, turn it into something else,
> ideally into "struct mem_cgroup *" although I doubt this is possible.

Sounds like a good idea but... it duplicates the cgroup tracking into
two places and that asks for troubles. On the other hand we are doing
that already because mm->owner might be in a different cgroup than the
current. However, this is an inherent problem because CLONE_VM doesn't
imply CLONE_THREAD. So in the end it doesn't look much worse IMO.
We will loose the "this task is in charge" aspect and that would
be a user space visible change but I am not sure how much it is a
problem. Maybe somebody is (ab)using this to workaround the restriction
that all threads are in the same cgroup.

>From the implementation POV it even looks easier because we just have to
hook to fork (pin the memcg on dup_mm), to attach to change the memcg 
and to mmput to unpin the memcg.

I will think about that some more.

> It would be nice to kill mm_update_next_owner()/etc, this looks really
> ugly. We only need it for mem_cgroup_from_task(), and it would be much
> more clean to have mem_cgroup_from_mm(struct mm_struct *mm), imho.
> 
> Oleg.
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 3/7] memcg: immigrate charges only when a threadgroup leader is moved
@ 2015-05-20 20:22               ` Michal Hocko
  0 siblings, 0 replies; 59+ messages in thread
From: Michal Hocko @ 2015-05-20 20:22 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Tejun Heo, lizefan, cgroups, hannes, linux-mm

On Wed 20-05-15 19:53:02, Oleg Nesterov wrote:
> On 05/20, Michal Hocko wrote:
> >
> > So I assume the leader simply waits for its threads to finish and it
> > stays in the sibling list. __unhash_process seems like it does the final
> > cleanup and unlinks the leader from the lists. Which means that
> > mm_update_next_owner never sees !group_leader. Is that correct Oleg?
> 
> Yes, yes, the group leader can't go away until the whole thread-group dies.

OK, then we should have a guarantee that mm->owner is always thread
group leader, right?

> But can't we kill mm->owner somehow?

I would be happy about that. But it is not that simple.

> I mean, turn it into something else,
> ideally into "struct mem_cgroup *" although I doubt this is possible.

Sounds like a good idea but... it duplicates the cgroup tracking into
two places and that asks for troubles. On the other hand we are doing
that already because mm->owner might be in a different cgroup than the
current. However, this is an inherent problem because CLONE_VM doesn't
imply CLONE_THREAD. So in the end it doesn't look much worse IMO.
We will loose the "this task is in charge" aspect and that would
be a user space visible change but I am not sure how much it is a
problem. Maybe somebody is (ab)using this to workaround the restriction
that all threads are in the same cgroup.

From the implementation POV it even looks easier because we just have to
hook to fork (pin the memcg on dup_mm), to attach to change the memcg 
and to mmput to unpin the memcg.

I will think about that some more.

> It would be nice to kill mm_update_next_owner()/etc, this looks really
> ugly. We only need it for mem_cgroup_from_task(), and it would be much
> more clean to have mem_cgroup_from_mm(struct mm_struct *mm), imho.
> 
> Oleg.
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 3/7] memcg: immigrate charges only when a threadgroup leader is moved
@ 2015-05-21 14:12     ` Michal Hocko
  0 siblings, 0 replies; 59+ messages in thread
From: Michal Hocko @ 2015-05-21 14:12 UTC (permalink / raw)
  To: Tejun Heo; +Cc: lizefan, cgroups, hannes, linux-mm

On Mon 18-05-15 15:49:51, Tejun Heo wrote:
> If move_charge flag is set, memcg tries to move memory charges to the
> destnation css.  The current implementation migrates memory whenever
> any thread of a process is migrated making the behavior somewhat
> arbitrary.  Let's tie memory operations to the threadgroup leader so
> that memory is migrated only when the leader is migrated.
> 
> While this is a behavior change, given the inherent fuziness, this
> change is not too likely to be noticed and allows us to clearly define
> who owns the memory (always the leader) and helps the planned atomic
> multi-process migration.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@suse.cz>

OK, I guess the discussion with Oleg confirmed that the patch is not
really needed because mm_struct->owner check implies thread group
leader. This should be sufficient for your purpose Tejun, right?

> ---
>  mm/memcontrol.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b1b834d..74fcea3 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5014,6 +5014,9 @@ static int mem_cgroup_can_attach(struct cgroup_subsys_state *css,
>  		return 0;
>  
>  	p = cgroup_taskset_first(tset);
> +	if (!thread_group_leader(p))
> +		return 0;
> +
>  	from = mem_cgroup_from_task(p);
>  
>  	VM_BUG_ON(from == memcg);
> -- 
> 2.4.0
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 3/7] memcg: immigrate charges only when a threadgroup leader is moved
@ 2015-05-21 14:12     ` Michal Hocko
  0 siblings, 0 replies; 59+ messages in thread
From: Michal Hocko @ 2015-05-21 14:12 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA,
	hannes-druUgvl0LCNAfugRpC6u6w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Mon 18-05-15 15:49:51, Tejun Heo wrote:
> If move_charge flag is set, memcg tries to move memory charges to the
> destnation css.  The current implementation migrates memory whenever
> any thread of a process is migrated making the behavior somewhat
> arbitrary.  Let's tie memory operations to the threadgroup leader so
> that memory is migrated only when the leader is migrated.
> 
> While this is a behavior change, given the inherent fuziness, this
> change is not too likely to be noticed and allows us to clearly define
> who owns the memory (always the leader) and helps the planned atomic
> multi-process migration.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> Cc: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>

OK, I guess the discussion with Oleg confirmed that the patch is not
really needed because mm_struct->owner check implies thread group
leader. This should be sufficient for your purpose Tejun, right?

> ---
>  mm/memcontrol.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b1b834d..74fcea3 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5014,6 +5014,9 @@ static int mem_cgroup_can_attach(struct cgroup_subsys_state *css,
>  		return 0;
>  
>  	p = cgroup_taskset_first(tset);
> +	if (!thread_group_leader(p))
> +		return 0;
> +
>  	from = mem_cgroup_from_task(p);
>  
>  	VM_BUG_ON(from == memcg);
> -- 
> 2.4.0
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 3/7] memcg: immigrate charges only when a threadgroup leader is moved
@ 2015-05-21 17:22                 ` Johannes Weiner
  0 siblings, 0 replies; 59+ messages in thread
From: Johannes Weiner @ 2015-05-21 17:22 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Oleg Nesterov, Tejun Heo, lizefan, cgroups, linux-mm

On Wed, May 20, 2015 at 10:22:21PM +0200, Michal Hocko wrote:
> On Wed 20-05-15 19:53:02, Oleg Nesterov wrote:
> > On 05/20, Michal Hocko wrote:
> > >
> > > So I assume the leader simply waits for its threads to finish and it
> > > stays in the sibling list. __unhash_process seems like it does the final
> > > cleanup and unlinks the leader from the lists. Which means that
> > > mm_update_next_owner never sees !group_leader. Is that correct Oleg?
> > 
> > Yes, yes, the group leader can't go away until the whole thread-group dies.
> 
> OK, then we should have a guarantee that mm->owner is always thread
> group leader, right?
> 
> > But can't we kill mm->owner somehow?
> 
> I would be happy about that. But it is not that simple.
> 
> > I mean, turn it into something else,
> > ideally into "struct mem_cgroup *" although I doubt this is possible.
> 
> Sounds like a good idea but... it duplicates the cgroup tracking into
> two places and that asks for troubles. On the other hand we are doing
> that already because mm->owner might be in a different cgroup than the
> current. However, this is an inherent problem because CLONE_VM doesn't
> imply CLONE_THREAD. So in the end it doesn't look much worse IMO.
> We will loose the "this task is in charge" aspect and that would
> be a user space visible change but I am not sure how much it is a
> problem. Maybe somebody is (ab)using this to workaround the restriction
> that all threads are in the same cgroup.

If mm->owner is currently always the threadgroup leader, it should be
fairly straight forward to maintain mm->memcg on all events that move
any threadgroup leader between cgroups, without having mm->owner, no?

It would have a lot of benefits for sure.  The code would be simpler,
but it would also reduce some of the cost that Mel is observing inside
__mem_cgroup_count_vm_event(), by reducing one level of indirection.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 3/7] memcg: immigrate charges only when a threadgroup leader is moved
@ 2015-05-21 17:22                 ` Johannes Weiner
  0 siblings, 0 replies; 59+ messages in thread
From: Johannes Weiner @ 2015-05-21 17:22 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Oleg Nesterov, Tejun Heo, lizefan-hv44wF8Li93QT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Wed, May 20, 2015 at 10:22:21PM +0200, Michal Hocko wrote:
> On Wed 20-05-15 19:53:02, Oleg Nesterov wrote:
> > On 05/20, Michal Hocko wrote:
> > >
> > > So I assume the leader simply waits for its threads to finish and it
> > > stays in the sibling list. __unhash_process seems like it does the final
> > > cleanup and unlinks the leader from the lists. Which means that
> > > mm_update_next_owner never sees !group_leader. Is that correct Oleg?
> > 
> > Yes, yes, the group leader can't go away until the whole thread-group dies.
> 
> OK, then we should have a guarantee that mm->owner is always thread
> group leader, right?
> 
> > But can't we kill mm->owner somehow?
> 
> I would be happy about that. But it is not that simple.
> 
> > I mean, turn it into something else,
> > ideally into "struct mem_cgroup *" although I doubt this is possible.
> 
> Sounds like a good idea but... it duplicates the cgroup tracking into
> two places and that asks for troubles. On the other hand we are doing
> that already because mm->owner might be in a different cgroup than the
> current. However, this is an inherent problem because CLONE_VM doesn't
> imply CLONE_THREAD. So in the end it doesn't look much worse IMO.
> We will loose the "this task is in charge" aspect and that would
> be a user space visible change but I am not sure how much it is a
> problem. Maybe somebody is (ab)using this to workaround the restriction
> that all threads are in the same cgroup.

If mm->owner is currently always the threadgroup leader, it should be
fairly straight forward to maintain mm->memcg on all events that move
any threadgroup leader between cgroups, without having mm->owner, no?

It would have a lot of benefits for sure.  The code would be simpler,
but it would also reduce some of the cost that Mel is observing inside
__mem_cgroup_count_vm_event(), by reducing one level of indirection.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 3/7] memcg: immigrate charges only when a threadgroup leader is moved
@ 2015-05-21 19:27                 ` Oleg Nesterov
  0 siblings, 0 replies; 59+ messages in thread
From: Oleg Nesterov @ 2015-05-21 19:27 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Tejun Heo, lizefan, cgroups, hannes, linux-mm

On 05/20, Michal Hocko wrote:
>
> On Wed 20-05-15 19:53:02, Oleg Nesterov wrote:
> >
> > Yes, yes, the group leader can't go away until the whole thread-group dies.
>
> OK, then we should have a guarantee that mm->owner is always thread
> group leader, right?

No, please note that the exiting leader does exit_mm()->mm_update_next_owner()
and this changes mm->owner.

Btw, this connects to other potential cleanups... task_struct->mm looks
a bit strange, we probably want to move it into signal_struct->mm and
make exit_mm/etc per-process. But this is not trivial, and off-topic.

Offtopic, because the exiting leader has to call mm_update_next_owner()
in any case. Simply because it does cgroup_exit() after that, so
get_mem_cgroup_from_mm() can't work if mm->owner is zombie.

This also means that get_mem_cgroup_from_mm() can race with the exiting
mm->owner unless I missed something.

> > But can't we kill mm->owner somehow?
>
> I would be happy about that. But it is not that simple.

Yes, yes, I understand.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 3/7] memcg: immigrate charges only when a threadgroup leader is moved
@ 2015-05-21 19:27                 ` Oleg Nesterov
  0 siblings, 0 replies; 59+ messages in thread
From: Oleg Nesterov @ 2015-05-21 19:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tejun Heo, lizefan-hv44wF8Li93QT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, hannes-druUgvl0LCNAfugRpC6u6w,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On 05/20, Michal Hocko wrote:
>
> On Wed 20-05-15 19:53:02, Oleg Nesterov wrote:
> >
> > Yes, yes, the group leader can't go away until the whole thread-group dies.
>
> OK, then we should have a guarantee that mm->owner is always thread
> group leader, right?

No, please note that the exiting leader does exit_mm()->mm_update_next_owner()
and this changes mm->owner.

Btw, this connects to other potential cleanups... task_struct->mm looks
a bit strange, we probably want to move it into signal_struct->mm and
make exit_mm/etc per-process. But this is not trivial, and off-topic.

Offtopic, because the exiting leader has to call mm_update_next_owner()
in any case. Simply because it does cgroup_exit() after that, so
get_mem_cgroup_from_mm() can't work if mm->owner is zombie.

This also means that get_mem_cgroup_from_mm() can race with the exiting
mm->owner unless I missed something.

> > But can't we kill mm->owner somehow?
>
> I would be happy about that. But it is not that simple.

Yes, yes, I understand.

Oleg.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 3/7] memcg: immigrate charges only when a threadgroup leader is moved
@ 2015-05-21 22:09       ` Tejun Heo
  0 siblings, 0 replies; 59+ messages in thread
From: Tejun Heo @ 2015-05-21 22:09 UTC (permalink / raw)
  To: Michal Hocko; +Cc: lizefan, cgroups, hannes, linux-mm

On Thu, May 21, 2015 at 04:12:26PM +0200, Michal Hocko wrote:
> On Mon 18-05-15 15:49:51, Tejun Heo wrote:
> > If move_charge flag is set, memcg tries to move memory charges to the
> > destnation css.  The current implementation migrates memory whenever
> > any thread of a process is migrated making the behavior somewhat
> > arbitrary.  Let's tie memory operations to the threadgroup leader so
> > that memory is migrated only when the leader is migrated.
> > 
> > While this is a behavior change, given the inherent fuziness, this
> > change is not too likely to be noticed and allows us to clearly define
> > who owns the memory (always the leader) and helps the planned atomic
> > multi-process migration.
> > 
> > Signed-off-by: Tejun Heo <tj@kernel.org>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Michal Hocko <mhocko@suse.cz>
> 
> OK, I guess the discussion with Oleg confirmed that the patch is not
> really needed because mm_struct->owner check implies thread group
> leader. This should be sufficient for your purpose Tejun, right?

Hmmm... we still need to update so that it actually iterates leaders
to find the owner as first in taskset == leader assumption is going
away but yeah this patch in itself can go away.  I'll update the next
patch accordingly.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 3/7] memcg: immigrate charges only when a threadgroup leader is moved
@ 2015-05-21 22:09       ` Tejun Heo
  0 siblings, 0 replies; 59+ messages in thread
From: Tejun Heo @ 2015-05-21 22:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA,
	hannes-druUgvl0LCNAfugRpC6u6w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Thu, May 21, 2015 at 04:12:26PM +0200, Michal Hocko wrote:
> On Mon 18-05-15 15:49:51, Tejun Heo wrote:
> > If move_charge flag is set, memcg tries to move memory charges to the
> > destnation css.  The current implementation migrates memory whenever
> > any thread of a process is migrated making the behavior somewhat
> > arbitrary.  Let's tie memory operations to the threadgroup leader so
> > that memory is migrated only when the leader is migrated.
> > 
> > While this is a behavior change, given the inherent fuziness, this
> > change is not too likely to be noticed and allows us to clearly define
> > who owns the memory (always the leader) and helps the planned atomic
> > multi-process migration.
> > 
> > Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> > Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> > Cc: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
> 
> OK, I guess the discussion with Oleg confirmed that the patch is not
> really needed because mm_struct->owner check implies thread group
> leader. This should be sufficient for your purpose Tejun, right?

Hmmm... we still need to update so that it actually iterates leaders
to find the owner as first in taskset == leader assumption is going
away but yeah this patch in itself can go away.  I'll update the next
patch accordingly.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 3/7] memcg: immigrate charges only when a threadgroup leader is moved
  2015-05-21 17:22                 ` Johannes Weiner
  (?)
@ 2015-05-22  9:34                 ` Michal Hocko
  -1 siblings, 0 replies; 59+ messages in thread
From: Michal Hocko @ 2015-05-22  9:34 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Oleg Nesterov, Tejun Heo, lizefan, cgroups, linux-mm

On Thu 21-05-15 13:22:17, Johannes Weiner wrote:
> On Wed, May 20, 2015 at 10:22:21PM +0200, Michal Hocko wrote:
> > On Wed 20-05-15 19:53:02, Oleg Nesterov wrote:
> > > On 05/20, Michal Hocko wrote:
> > > >
> > > > So I assume the leader simply waits for its threads to finish and it
> > > > stays in the sibling list. __unhash_process seems like it does the final
> > > > cleanup and unlinks the leader from the lists. Which means that
> > > > mm_update_next_owner never sees !group_leader. Is that correct Oleg?
> > > 
> > > Yes, yes, the group leader can't go away until the whole thread-group dies.
> > 
> > OK, then we should have a guarantee that mm->owner is always thread
> > group leader, right?
> > 
> > > But can't we kill mm->owner somehow?
> > 
> > I would be happy about that. But it is not that simple.
> > 
> > > I mean, turn it into something else,
> > > ideally into "struct mem_cgroup *" although I doubt this is possible.
> > 
> > Sounds like a good idea but... it duplicates the cgroup tracking into
> > two places and that asks for troubles. On the other hand we are doing
> > that already because mm->owner might be in a different cgroup than the
> > current. However, this is an inherent problem because CLONE_VM doesn't
> > imply CLONE_THREAD. So in the end it doesn't look much worse IMO.
> > We will loose the "this task is in charge" aspect and that would
> > be a user space visible change but I am not sure how much it is a
> > problem. Maybe somebody is (ab)using this to workaround the restriction
> > that all threads are in the same cgroup.
> 
> If mm->owner is currently always the threadgroup leader, it should be
> fairly straight forward to maintain mm->memcg on all events that move
> any threadgroup leader between cgroups, without having mm->owner, no?

I have a tentative patch for that. It is fairly straightforward and it
even reduces the code size. I plan to post it early next week after it
gets some testing. The primary thing I am worried about is the user
visible behavior change, though.

> It would have a lot of benefits for sure.  The code would be simpler,
> but it would also reduce some of the cost that Mel is observing inside
> __mem_cgroup_count_vm_event(), by reducing one level of indirection.

Agreed!

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 3/7] memcg: immigrate charges only when a threadgroup leader is moved
@ 2015-05-22  9:36                   ` Michal Hocko
  0 siblings, 0 replies; 59+ messages in thread
From: Michal Hocko @ 2015-05-22  9:36 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Tejun Heo, lizefan, cgroups, hannes, linux-mm

On Thu 21-05-15 21:27:16, Oleg Nesterov wrote:
> On 05/20, Michal Hocko wrote:
> >
> > On Wed 20-05-15 19:53:02, Oleg Nesterov wrote:
> > >
> > > Yes, yes, the group leader can't go away until the whole thread-group dies.
> >
> > OK, then we should have a guarantee that mm->owner is always thread
> > group leader, right?
> 
> No, please note that the exiting leader does exit_mm()->mm_update_next_owner()
> and this changes mm->owner.

I am confused now. Yeah it changes the owner but the new one will be
again the thread group leader, right?
 
> Btw, this connects to other potential cleanups... task_struct->mm looks
> a bit strange, we probably want to move it into signal_struct->mm and
> make exit_mm/etc per-process. But this is not trivial, and off-topic.

I am not sure this is a good idea but I would have to think about this
some more. Let's not distract this email thread and discuss it in a
separate thread please.

> Offtopic, because the exiting leader has to call mm_update_next_owner()
> in any case. Simply because it does cgroup_exit() after that, so
> get_mem_cgroup_from_mm() can't work if mm->owner is zombie.
> 
> This also means that get_mem_cgroup_from_mm() can race with the exiting
> mm->owner unless I missed something.
> 
> > > But can't we kill mm->owner somehow?
> >
> > I would be happy about that. But it is not that simple.
> 
> Yes, yes, I understand.
> 
> Oleg.
> 

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 3/7] memcg: immigrate charges only when a threadgroup leader is moved
@ 2015-05-22  9:36                   ` Michal Hocko
  0 siblings, 0 replies; 59+ messages in thread
From: Michal Hocko @ 2015-05-22  9:36 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Tejun Heo, lizefan-hv44wF8Li93QT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, hannes-druUgvl0LCNAfugRpC6u6w,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Thu 21-05-15 21:27:16, Oleg Nesterov wrote:
> On 05/20, Michal Hocko wrote:
> >
> > On Wed 20-05-15 19:53:02, Oleg Nesterov wrote:
> > >
> > > Yes, yes, the group leader can't go away until the whole thread-group dies.
> >
> > OK, then we should have a guarantee that mm->owner is always thread
> > group leader, right?
> 
> No, please note that the exiting leader does exit_mm()->mm_update_next_owner()
> and this changes mm->owner.

I am confused now. Yeah it changes the owner but the new one will be
again the thread group leader, right?
 
> Btw, this connects to other potential cleanups... task_struct->mm looks
> a bit strange, we probably want to move it into signal_struct->mm and
> make exit_mm/etc per-process. But this is not trivial, and off-topic.

I am not sure this is a good idea but I would have to think about this
some more. Let's not distract this email thread and discuss it in a
separate thread please.

> Offtopic, because the exiting leader has to call mm_update_next_owner()
> in any case. Simply because it does cgroup_exit() after that, so
> get_mem_cgroup_from_mm() can't work if mm->owner is zombie.
> 
> This also means that get_mem_cgroup_from_mm() can race with the exiting
> mm->owner unless I missed something.
> 
> > > But can't we kill mm->owner somehow?
> >
> > I would be happy about that. But it is not that simple.
> 
> Yes, yes, I understand.
> 
> Oleg.
> 

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 3/7] memcg: immigrate charges only when a threadgroup leader is moved
@ 2015-05-22 16:29                     ` Oleg Nesterov
  0 siblings, 0 replies; 59+ messages in thread
From: Oleg Nesterov @ 2015-05-22 16:29 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Tejun Heo, lizefan, cgroups, hannes, linux-mm

On 05/22, Michal Hocko wrote:
>
> On Thu 21-05-15 21:27:16, Oleg Nesterov wrote:
> > On 05/20, Michal Hocko wrote:
> > >
> > > On Wed 20-05-15 19:53:02, Oleg Nesterov wrote:
> > > >
> > > > Yes, yes, the group leader can't go away until the whole thread-group dies.
> > >
> > > OK, then we should have a guarantee that mm->owner is always thread
> > > group leader, right?
> >
> > No, please note that the exiting leader does exit_mm()->mm_update_next_owner()
> > and this changes mm->owner.
>
> I am confused now. Yeah it changes the owner but the new one will be
> again the thread group leader, right?

Why?

In the likely case (if CLONE_VM without CLONE_THREAD was not used) the
last for_each_process() in mm_update_next_owner() will find another thread
from the same group.

Oh. I think mm_update_next_owner() needs some cleanups. Perhaps I'll send
the patch today.

> > Btw, this connects to other potential cleanups... task_struct->mm looks
> > a bit strange, we probably want to move it into signal_struct->mm and
> > make exit_mm/etc per-process. But this is not trivial, and off-topic.
>
> I am not sure this is a good idea but I would have to think about this
> some more. Let's not distract this email thread and discuss it in a
> separate thread please.

Yes, yes. I mentioned this to explain that we can't keep the exited leader
as mm->owner in any case. And report that get_mem_cgroup_from_mm() looks
racy.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 3/7] memcg: immigrate charges only when a threadgroup leader is moved
@ 2015-05-22 16:29                     ` Oleg Nesterov
  0 siblings, 0 replies; 59+ messages in thread
From: Oleg Nesterov @ 2015-05-22 16:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tejun Heo, lizefan-hv44wF8Li93QT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, hannes-druUgvl0LCNAfugRpC6u6w,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On 05/22, Michal Hocko wrote:
>
> On Thu 21-05-15 21:27:16, Oleg Nesterov wrote:
> > On 05/20, Michal Hocko wrote:
> > >
> > > On Wed 20-05-15 19:53:02, Oleg Nesterov wrote:
> > > >
> > > > Yes, yes, the group leader can't go away until the whole thread-group dies.
> > >
> > > OK, then we should have a guarantee that mm->owner is always thread
> > > group leader, right?
> >
> > No, please note that the exiting leader does exit_mm()->mm_update_next_owner()
> > and this changes mm->owner.
>
> I am confused now. Yeah it changes the owner but the new one will be
> again the thread group leader, right?

Why?

In the likely case (if CLONE_VM without CLONE_THREAD was not used) the
last for_each_process() in mm_update_next_owner() will find another thread
from the same group.

Oh. I think mm_update_next_owner() needs some cleanups. Perhaps I'll send
the patch today.

> > Btw, this connects to other potential cleanups... task_struct->mm looks
> > a bit strange, we probably want to move it into signal_struct->mm and
> > make exit_mm/etc per-process. But this is not trivial, and off-topic.
>
> I am not sure this is a good idea but I would have to think about this
> some more. Let's not distract this email thread and discuss it in a
> separate thread please.

Yes, yes. I mentioned this to explain that we can't keep the exited leader
as mm->owner in any case. And report that get_mem_cgroup_from_mm() looks
racy.

Oleg.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 3/7] memcg: immigrate charges only when a threadgroup leader is moved
  2015-05-22 16:29                     ` Oleg Nesterov
  (?)
@ 2015-05-22 16:57                     ` Michal Hocko
  2015-05-22 18:30                       ` Oleg Nesterov
  -1 siblings, 1 reply; 59+ messages in thread
From: Michal Hocko @ 2015-05-22 16:57 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Tejun Heo, lizefan, cgroups, hannes, linux-mm

On Fri 22-05-15 18:29:00, Oleg Nesterov wrote:
> On 05/22, Michal Hocko wrote:
> >
> > On Thu 21-05-15 21:27:16, Oleg Nesterov wrote:
> > > On 05/20, Michal Hocko wrote:
> > > >
> > > > On Wed 20-05-15 19:53:02, Oleg Nesterov wrote:
> > > > >
> > > > > Yes, yes, the group leader can't go away until the whole thread-group dies.
> > > >
> > > > OK, then we should have a guarantee that mm->owner is always thread
> > > > group leader, right?
> > >
> > > No, please note that the exiting leader does exit_mm()->mm_update_next_owner()
> > > and this changes mm->owner.
> >
> > I am confused now. Yeah it changes the owner but the new one will be
> > again the thread group leader, right?
> 
> Why?
> 
> In the likely case (if CLONE_VM without CLONE_THREAD was not used) the
> last for_each_process() in mm_update_next_owner() will find another thread
> from the same group.

My understanding was that for_each_process will iterate only over
processes (represented by the thread group leaders). That was the reason
I was asking about thread group leader exiting before other threads.
I am sorry to ask again, but let me ask again. How would we get
!group_leader from p->{real_parent->}sibling or from for_each_process?

> Oh. I think mm_update_next_owner() needs some cleanups. Perhaps I'll send
> the patch today.

Please hold on, I have a patch to get rid of the owner altogether. I
will post it sometimes next week. Let's see whether this is a viable
option. If not then we can clean this up.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH 0/3] memcg: mm_update_next_owner() cleanups
  2015-05-22 16:29                     ` Oleg Nesterov
  (?)
  (?)
@ 2015-05-22 18:20                     ` Oleg Nesterov
  2015-05-22 18:21                         ` Oleg Nesterov
                                         ` (3 more replies)
  -1 siblings, 4 replies; 59+ messages in thread
From: Oleg Nesterov @ 2015-05-22 18:20 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Tejun Heo, lizefan, cgroups, hannes, linux-mm

On 05/22, Oleg Nesterov wrote:
>
> Oh. I think mm_update_next_owner() needs some cleanups. Perhaps I'll send
> the patch today.

At least something like this...

Although I still think we can just remove this code. Plus this series
was only compile tested, so please feel free to ignore.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH 1/3] memcg: introduce assign_new_owner()
@ 2015-05-22 18:21                         ` Oleg Nesterov
  0 siblings, 0 replies; 59+ messages in thread
From: Oleg Nesterov @ 2015-05-22 18:21 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Tejun Heo, lizefan, cgroups, hannes, linux-mm

The code under "assign_new_owner" looks very ugly and suboptimal.

We do not really need get_task_struct/put_task_struct(), we can
simply recheck/change mm->owner under tasklist_lock. And we do not
want to restart from the very beginning if ->mm was changed by the
time we take task_lock(), we can simply continue (if we do not drop
tasklist_lock).

Just move this code into the new simple helper, assign_new_owner().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/exit.c |   56 ++++++++++++++++++++++++++------------------------------
 1 files changed, 26 insertions(+), 30 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 22fcc05..4d446ab 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -293,6 +293,23 @@ kill_orphaned_pgrp(struct task_struct *tsk, struct task_struct *parent)
 }
 
 #ifdef CONFIG_MEMCG
+static bool assign_new_owner(struct mm_struct *mm, struct task_struct *c)
+{
+	bool ret = false;
+
+	if (c->mm != mm)
+		return ret;
+
+	task_lock(c); /* protects c->mm from changing */
+	if (c->mm == mm) {
+		mm->owner = c;
+		ret = true;
+	}
+	task_unlock(c);
+
+	return ret;
+}
+
 /*
  * A task is exiting.   If it owned this mm, find a new owner for the mm.
  */
@@ -300,7 +317,6 @@ void mm_update_next_owner(struct mm_struct *mm)
 {
 	struct task_struct *c, *g, *p = current;
 
-retry:
 	/*
 	 * If the exiting or execing task is not the owner, it's
 	 * someone else's problem.
@@ -322,16 +338,16 @@ retry:
 	 * Search in the children
 	 */
 	list_for_each_entry(c, &p->children, sibling) {
-		if (c->mm == mm)
-			goto assign_new_owner;
+		if (assign_new_owner(mm, c))
+			goto done;
 	}
 
 	/*
 	 * Search in the siblings
 	 */
 	list_for_each_entry(c, &p->real_parent->children, sibling) {
-		if (c->mm == mm)
-			goto assign_new_owner;
+		if (assign_new_owner(mm, c))
+			goto done;
 	}
 
 	/*
@@ -341,42 +357,22 @@ retry:
 		if (g->flags & PF_KTHREAD)
 			continue;
 		for_each_thread(g, c) {
-			if (c->mm == mm)
-				goto assign_new_owner;
+			if (assign_new_owner(mm, c))
+				goto done;
 			if (c->mm)
 				break;
 		}
 	}
-	read_unlock(&tasklist_lock);
+
 	/*
 	 * We found no owner yet mm_users > 1: this implies that we are
 	 * most likely racing with swapoff (try_to_unuse()) or /proc or
 	 * ptrace or page migration (get_task_mm()).  Mark owner as NULL.
 	 */
 	mm->owner = NULL;
-	return;
-
-assign_new_owner:
-	BUG_ON(c == p);
-	get_task_struct(c);
-	/*
-	 * The task_lock protects c->mm from changing.
-	 * We always want mm->owner->mm == mm
-	 */
-	task_lock(c);
-	/*
-	 * Delay read_unlock() till we have the task_lock()
-	 * to ensure that c does not slip away underneath us
-	 */
+done:
 	read_unlock(&tasklist_lock);
-	if (c->mm != mm) {
-		task_unlock(c);
-		put_task_struct(c);
-		goto retry;
-	}
-	mm->owner = c;
-	task_unlock(c);
-	put_task_struct(c);
+	return;
 }
 #endif /* CONFIG_MEMCG */
 
-- 
1.5.5.1


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 1/3] memcg: introduce assign_new_owner()
@ 2015-05-22 18:21                         ` Oleg Nesterov
  0 siblings, 0 replies; 59+ messages in thread
From: Oleg Nesterov @ 2015-05-22 18:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tejun Heo, lizefan-hv44wF8Li93QT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, hannes-druUgvl0LCNAfugRpC6u6w,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

The code under "assign_new_owner" looks very ugly and suboptimal.

We do not really need get_task_struct/put_task_struct(), we can
simply recheck/change mm->owner under tasklist_lock. And we do not
want to restart from the very beginning if ->mm was changed by the
time we take task_lock(), we can simply continue (if we do not drop
tasklist_lock).

Just move this code into the new simple helper, assign_new_owner().

Signed-off-by: Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 kernel/exit.c |   56 ++++++++++++++++++++++++++------------------------------
 1 files changed, 26 insertions(+), 30 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 22fcc05..4d446ab 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -293,6 +293,23 @@ kill_orphaned_pgrp(struct task_struct *tsk, struct task_struct *parent)
 }
 
 #ifdef CONFIG_MEMCG
+static bool assign_new_owner(struct mm_struct *mm, struct task_struct *c)
+{
+	bool ret = false;
+
+	if (c->mm != mm)
+		return ret;
+
+	task_lock(c); /* protects c->mm from changing */
+	if (c->mm == mm) {
+		mm->owner = c;
+		ret = true;
+	}
+	task_unlock(c);
+
+	return ret;
+}
+
 /*
  * A task is exiting.   If it owned this mm, find a new owner for the mm.
  */
@@ -300,7 +317,6 @@ void mm_update_next_owner(struct mm_struct *mm)
 {
 	struct task_struct *c, *g, *p = current;
 
-retry:
 	/*
 	 * If the exiting or execing task is not the owner, it's
 	 * someone else's problem.
@@ -322,16 +338,16 @@ retry:
 	 * Search in the children
 	 */
 	list_for_each_entry(c, &p->children, sibling) {
-		if (c->mm == mm)
-			goto assign_new_owner;
+		if (assign_new_owner(mm, c))
+			goto done;
 	}
 
 	/*
 	 * Search in the siblings
 	 */
 	list_for_each_entry(c, &p->real_parent->children, sibling) {
-		if (c->mm == mm)
-			goto assign_new_owner;
+		if (assign_new_owner(mm, c))
+			goto done;
 	}
 
 	/*
@@ -341,42 +357,22 @@ retry:
 		if (g->flags & PF_KTHREAD)
 			continue;
 		for_each_thread(g, c) {
-			if (c->mm == mm)
-				goto assign_new_owner;
+			if (assign_new_owner(mm, c))
+				goto done;
 			if (c->mm)
 				break;
 		}
 	}
-	read_unlock(&tasklist_lock);
+
 	/*
 	 * We found no owner yet mm_users > 1: this implies that we are
 	 * most likely racing with swapoff (try_to_unuse()) or /proc or
 	 * ptrace or page migration (get_task_mm()).  Mark owner as NULL.
 	 */
 	mm->owner = NULL;
-	return;
-
-assign_new_owner:
-	BUG_ON(c == p);
-	get_task_struct(c);
-	/*
-	 * The task_lock protects c->mm from changing.
-	 * We always want mm->owner->mm == mm
-	 */
-	task_lock(c);
-	/*
-	 * Delay read_unlock() till we have the task_lock()
-	 * to ensure that c does not slip away underneath us
-	 */
+done:
 	read_unlock(&tasklist_lock);
-	if (c->mm != mm) {
-		task_unlock(c);
-		put_task_struct(c);
-		goto retry;
-	}
-	mm->owner = c;
-	task_unlock(c);
-	put_task_struct(c);
+	return;
 }
 #endif /* CONFIG_MEMCG */
 
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 2/3] memcg: change assign_new_owner() to consider the sub-htreads
@ 2015-05-22 18:21                         ` Oleg Nesterov
  0 siblings, 0 replies; 59+ messages in thread
From: Oleg Nesterov @ 2015-05-22 18:21 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Tejun Heo, lizefan, cgroups, hannes, linux-mm

mm_update_next_owner() checks the children and siblings first but
it only inspects the group leaders, and thus this optimization won't
work if the leader is zombie.

This is actually correct, the last for_each_process() loop will find
these children/siblings again, but this doesn't look consistent/clean.

Move the for_each_thread() logic from mm_update_next_owner() to
assign_new_owner(). We can also remove the "struct task_struct *c"
local.

See also the next patch relies on this change.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/exit.c |   39 ++++++++++++++++++++-------------------
 1 files changed, 20 insertions(+), 19 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 4d446ab..1d1810d 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -293,19 +293,24 @@ kill_orphaned_pgrp(struct task_struct *tsk, struct task_struct *parent)
 }
 
 #ifdef CONFIG_MEMCG
-static bool assign_new_owner(struct mm_struct *mm, struct task_struct *c)
+static bool assign_new_owner(struct mm_struct *mm, struct task_struct *g)
 {
+	struct task_struct *c;
 	bool ret = false;
 
-	if (c->mm != mm)
-		return ret;
+	for_each_thread(g, c) {
+		if (c->mm == mm) {
+			task_lock(c); /* protects c->mm from changing */
+			if (c->mm == mm) {
+				mm->owner = c;
+				ret = true;
+			}
+			task_unlock(c);
+		}
 
-	task_lock(c); /* protects c->mm from changing */
-	if (c->mm == mm) {
-		mm->owner = c;
-		ret = true;
+		if (ret || c->mm)
+			break;
 	}
-	task_unlock(c);
 
 	return ret;
 }
@@ -315,7 +320,7 @@ static bool assign_new_owner(struct mm_struct *mm, struct task_struct *c)
  */
 void mm_update_next_owner(struct mm_struct *mm)
 {
-	struct task_struct *c, *g, *p = current;
+	struct task_struct *g, *p = current;
 
 	/*
 	 * If the exiting or execing task is not the owner, it's
@@ -337,16 +342,16 @@ void mm_update_next_owner(struct mm_struct *mm)
 	/*
 	 * Search in the children
 	 */
-	list_for_each_entry(c, &p->children, sibling) {
-		if (assign_new_owner(mm, c))
+	list_for_each_entry(g, &p->children, sibling) {
+		if (assign_new_owner(mm, g))
 			goto done;
 	}
 
 	/*
 	 * Search in the siblings
 	 */
-	list_for_each_entry(c, &p->real_parent->children, sibling) {
-		if (assign_new_owner(mm, c))
+	list_for_each_entry(g, &p->real_parent->children, sibling) {
+		if (assign_new_owner(mm, g))
 			goto done;
 	}
 
@@ -356,12 +361,8 @@ void mm_update_next_owner(struct mm_struct *mm)
 	for_each_process(g) {
 		if (g->flags & PF_KTHREAD)
 			continue;
-		for_each_thread(g, c) {
-			if (assign_new_owner(mm, c))
-				goto done;
-			if (c->mm)
-				break;
-		}
+		if (assign_new_owner(mm, g))
+			goto done;
 	}
 
 	/*
-- 
1.5.5.1


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 2/3] memcg: change assign_new_owner() to consider the sub-htreads
@ 2015-05-22 18:21                         ` Oleg Nesterov
  0 siblings, 0 replies; 59+ messages in thread
From: Oleg Nesterov @ 2015-05-22 18:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tejun Heo, lizefan-hv44wF8Li93QT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, hannes-druUgvl0LCNAfugRpC6u6w,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

mm_update_next_owner() checks the children and siblings first but
it only inspects the group leaders, and thus this optimization won't
work if the leader is zombie.

This is actually correct, the last for_each_process() loop will find
these children/siblings again, but this doesn't look consistent/clean.

Move the for_each_thread() logic from mm_update_next_owner() to
assign_new_owner(). We can also remove the "struct task_struct *c"
local.

See also the next patch relies on this change.

Signed-off-by: Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 kernel/exit.c |   39 ++++++++++++++++++++-------------------
 1 files changed, 20 insertions(+), 19 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 4d446ab..1d1810d 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -293,19 +293,24 @@ kill_orphaned_pgrp(struct task_struct *tsk, struct task_struct *parent)
 }
 
 #ifdef CONFIG_MEMCG
-static bool assign_new_owner(struct mm_struct *mm, struct task_struct *c)
+static bool assign_new_owner(struct mm_struct *mm, struct task_struct *g)
 {
+	struct task_struct *c;
 	bool ret = false;
 
-	if (c->mm != mm)
-		return ret;
+	for_each_thread(g, c) {
+		if (c->mm == mm) {
+			task_lock(c); /* protects c->mm from changing */
+			if (c->mm == mm) {
+				mm->owner = c;
+				ret = true;
+			}
+			task_unlock(c);
+		}
 
-	task_lock(c); /* protects c->mm from changing */
-	if (c->mm == mm) {
-		mm->owner = c;
-		ret = true;
+		if (ret || c->mm)
+			break;
 	}
-	task_unlock(c);
 
 	return ret;
 }
@@ -315,7 +320,7 @@ static bool assign_new_owner(struct mm_struct *mm, struct task_struct *c)
  */
 void mm_update_next_owner(struct mm_struct *mm)
 {
-	struct task_struct *c, *g, *p = current;
+	struct task_struct *g, *p = current;
 
 	/*
 	 * If the exiting or execing task is not the owner, it's
@@ -337,16 +342,16 @@ void mm_update_next_owner(struct mm_struct *mm)
 	/*
 	 * Search in the children
 	 */
-	list_for_each_entry(c, &p->children, sibling) {
-		if (assign_new_owner(mm, c))
+	list_for_each_entry(g, &p->children, sibling) {
+		if (assign_new_owner(mm, g))
 			goto done;
 	}
 
 	/*
 	 * Search in the siblings
 	 */
-	list_for_each_entry(c, &p->real_parent->children, sibling) {
-		if (assign_new_owner(mm, c))
+	list_for_each_entry(g, &p->real_parent->children, sibling) {
+		if (assign_new_owner(mm, g))
 			goto done;
 	}
 
@@ -356,12 +361,8 @@ void mm_update_next_owner(struct mm_struct *mm)
 	for_each_process(g) {
 		if (g->flags & PF_KTHREAD)
 			continue;
-		for_each_thread(g, c) {
-			if (assign_new_owner(mm, c))
-				goto done;
-			if (c->mm)
-				break;
-		}
+		if (assign_new_owner(mm, g))
+			goto done;
 	}
 
 	/*
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 3/3] memcg: change mm_update_next_owner() to search in sub-threads first
  2015-05-22 18:20                     ` [PATCH 0/3] memcg: mm_update_next_owner() cleanups Oleg Nesterov
  2015-05-22 18:21                         ` Oleg Nesterov
  2015-05-22 18:21                         ` Oleg Nesterov
@ 2015-05-22 18:21                       ` Oleg Nesterov
  2015-05-22 18:22                       ` [PATCH 0/3] memcg: mm_update_next_owner() cleanups Oleg Nesterov
  3 siblings, 0 replies; 59+ messages in thread
From: Oleg Nesterov @ 2015-05-22 18:21 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Tejun Heo, lizefan, cgroups, hannes, linux-mm

mm_update_next_owner() checks the children and siblings to avoid
the "global" for_each_process() loop. Not sure this makes any sense,
but certainly it makes sense to check our sub-threads before anything
else.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/exit.c |    8 +++++++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 1d1810d..b1f7135 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -340,6 +340,12 @@ void mm_update_next_owner(struct mm_struct *mm)
 
 	read_lock(&tasklist_lock);
 	/*
+	 * Search in the sub-threads
+	 */
+	if (assign_new_owner(mm, p))
+		goto done;
+
+	/*
 	 * Search in the children
 	 */
 	list_for_each_entry(g, &p->children, sibling) {
@@ -359,7 +365,7 @@ void mm_update_next_owner(struct mm_struct *mm)
 	 * Search through everything else, we should not get here often.
 	 */
 	for_each_process(g) {
-		if (g->flags & PF_KTHREAD)
+		if (g == p || g->flags & PF_KTHREAD)
 			continue;
 		if (assign_new_owner(mm, g))
 			goto done;
-- 
1.5.5.1


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH 0/3] memcg: mm_update_next_owner() cleanups
  2015-05-22 18:20                     ` [PATCH 0/3] memcg: mm_update_next_owner() cleanups Oleg Nesterov
                                         ` (2 preceding siblings ...)
  2015-05-22 18:21                       ` [PATCH 3/3] memcg: change mm_update_next_owner() to search in sub-threads first Oleg Nesterov
@ 2015-05-22 18:22                       ` Oleg Nesterov
  3 siblings, 0 replies; 59+ messages in thread
From: Oleg Nesterov @ 2015-05-22 18:22 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Tejun Heo, lizefan, cgroups, hannes, linux-mm

On 05/22, Oleg Nesterov wrote:
>
> On 05/22, Oleg Nesterov wrote:
> >
> > Oh. I think mm_update_next_owner() needs some cleanups. Perhaps I'll send
> > the patch today.
>
> At least something like this...
>
> Although I still think we can just remove this code. Plus this series
> was only compile tested, so please feel free to ignore.

Heh ;) sorry for noise.

Yes, please ignore, I just notice another email from you. Will reply
in a minute.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 3/7] memcg: immigrate charges only when a threadgroup leader is moved
  2015-05-22 16:57                     ` Michal Hocko
@ 2015-05-22 18:30                       ` Oleg Nesterov
  2015-05-25 16:06                           ` Michal Hocko
  0 siblings, 1 reply; 59+ messages in thread
From: Oleg Nesterov @ 2015-05-22 18:30 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Tejun Heo, lizefan, cgroups, hannes, linux-mm

On 05/22, Michal Hocko wrote:
>
> On Fri 22-05-15 18:29:00, Oleg Nesterov wrote:
> >
> > In the likely case (if CLONE_VM without CLONE_THREAD was not used) the
> > last for_each_process() in mm_update_next_owner() will find another thread
> > from the same group.
>
> My understanding was that for_each_process will iterate only over
> processes (represented by the thread group leaders).

Yes. But note the inner for_each_thread() loop. And note that we
we need this loop exactly because the leader can be zombie.

> How would we get
> !group_leader from p->{real_parent->}sibling

As for children/siblings we can't get !group_leader, yes. And this is
actually not right ;) See the (self-nacked) 2/3 I just sent.

> > Oh. I think mm_update_next_owner() needs some cleanups. Perhaps I'll send
> > the patch today.
>
> Please hold on, I have a patch to get rid of the owner altogether. I
> will post it sometimes next week. Let's see whether this is a viable
> option. If not then we can clean this up.

Great. Please ignore 1-3 I already sent.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 3/7] memcg: immigrate charges only when a threadgroup leader is moved
@ 2015-05-25 16:06                           ` Michal Hocko
  0 siblings, 0 replies; 59+ messages in thread
From: Michal Hocko @ 2015-05-25 16:06 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Tejun Heo, lizefan, cgroups, hannes, linux-mm

On Fri 22-05-15 20:30:42, Oleg Nesterov wrote:
> On 05/22, Michal Hocko wrote:
> >
> > On Fri 22-05-15 18:29:00, Oleg Nesterov wrote:
> > >
> > > In the likely case (if CLONE_VM without CLONE_THREAD was not used) the
> > > last for_each_process() in mm_update_next_owner() will find another thread
> > > from the same group.
> >
> > My understanding was that for_each_process will iterate only over
> > processes (represented by the thread group leaders).
> 
> Yes. But note the inner for_each_thread() loop. And note that we
> we need this loop exactly because the leader can be zombie.

I was too vague, sorry about that. What I meant was that
for_each_process would pick up a group leader and the inner
for_each_thread will return it as the first element in the list. As the
leader waits for other threads then it should stay on the thread_node
list as well. But I might be easily wrong here because the whole thing
is really quite confusing to be honest.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 3/7] memcg: immigrate charges only when a threadgroup leader is moved
@ 2015-05-25 16:06                           ` Michal Hocko
  0 siblings, 0 replies; 59+ messages in thread
From: Michal Hocko @ 2015-05-25 16:06 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Tejun Heo, lizefan-hv44wF8Li93QT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, hannes-druUgvl0LCNAfugRpC6u6w,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Fri 22-05-15 20:30:42, Oleg Nesterov wrote:
> On 05/22, Michal Hocko wrote:
> >
> > On Fri 22-05-15 18:29:00, Oleg Nesterov wrote:
> > >
> > > In the likely case (if CLONE_VM without CLONE_THREAD was not used) the
> > > last for_each_process() in mm_update_next_owner() will find another thread
> > > from the same group.
> >
> > My understanding was that for_each_process will iterate only over
> > processes (represented by the thread group leaders).
> 
> Yes. But note the inner for_each_thread() loop. And note that we
> we need this loop exactly because the leader can be zombie.

I was too vague, sorry about that. What I meant was that
for_each_process would pick up a group leader and the inner
for_each_thread will return it as the first element in the list. As the
leader waits for other threads then it should stay on the thread_node
list as well. But I might be easily wrong here because the whole thing
is really quite confusing to be honest.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 3/7] memcg: immigrate charges only when a threadgroup leader is moved
@ 2015-05-25 17:06                             ` Oleg Nesterov
  0 siblings, 0 replies; 59+ messages in thread
From: Oleg Nesterov @ 2015-05-25 17:06 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Tejun Heo, lizefan, cgroups, hannes, linux-mm

On 05/25, Michal Hocko wrote:
>
> On Fri 22-05-15 20:30:42, Oleg Nesterov wrote:
> > On 05/22, Michal Hocko wrote:
> > >
> > > On Fri 22-05-15 18:29:00, Oleg Nesterov wrote:
> > > >
> > > > In the likely case (if CLONE_VM without CLONE_THREAD was not used) the
> > > > last for_each_process() in mm_update_next_owner() will find another thread
> > > > from the same group.
> > >
> > > My understanding was that for_each_process will iterate only over
> > > processes (represented by the thread group leaders).
> >
> > Yes. But note the inner for_each_thread() loop. And note that we
> > we need this loop exactly because the leader can be zombie.
>
> I was too vague, sorry about that.

Looks like, we confused each other somehow ;) Not sure I understand your
concerns...

But,

> What I meant was that
> for_each_process would pick up a group leader

Yes. In the case above it will find the caller (current) too,

> and the inner
> for_each_thread will return it as the first element in the list.

Yes, and this will be "current" task. But current->mm == NULL, so
for_each_thread() will continue and find another thread which becomes
the new mm->owner.

Just in case, note the BUG_ON(c == p). I think that BUG_ON(p->mm) at
the start will look much better. This is what mm_update_next_owner()
actually assumes.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 3/7] memcg: immigrate charges only when a threadgroup leader is moved
@ 2015-05-25 17:06                             ` Oleg Nesterov
  0 siblings, 0 replies; 59+ messages in thread
From: Oleg Nesterov @ 2015-05-25 17:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tejun Heo, lizefan-hv44wF8Li93QT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, hannes-druUgvl0LCNAfugRpC6u6w,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On 05/25, Michal Hocko wrote:
>
> On Fri 22-05-15 20:30:42, Oleg Nesterov wrote:
> > On 05/22, Michal Hocko wrote:
> > >
> > > On Fri 22-05-15 18:29:00, Oleg Nesterov wrote:
> > > >
> > > > In the likely case (if CLONE_VM without CLONE_THREAD was not used) the
> > > > last for_each_process() in mm_update_next_owner() will find another thread
> > > > from the same group.
> > >
> > > My understanding was that for_each_process will iterate only over
> > > processes (represented by the thread group leaders).
> >
> > Yes. But note the inner for_each_thread() loop. And note that we
> > we need this loop exactly because the leader can be zombie.
>
> I was too vague, sorry about that.

Looks like, we confused each other somehow ;) Not sure I understand your
concerns...

But,

> What I meant was that
> for_each_process would pick up a group leader

Yes. In the case above it will find the caller (current) too,

> and the inner
> for_each_thread will return it as the first element in the list.

Yes, and this will be "current" task. But current->mm == NULL, so
for_each_thread() will continue and find another thread which becomes
the new mm->owner.

Just in case, note the BUG_ON(c == p). I think that BUG_ON(p->mm) at
the start will look much better. This is what mm_update_next_owner()
actually assumes.

Oleg.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 3/7] memcg: immigrate charges only when a threadgroup leader is moved
@ 2015-05-26  7:16                               ` Michal Hocko
  0 siblings, 0 replies; 59+ messages in thread
From: Michal Hocko @ 2015-05-26  7:16 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Tejun Heo, lizefan, cgroups, hannes, linux-mm

On Mon 25-05-15 19:06:01, Oleg Nesterov wrote:
> On 05/25, Michal Hocko wrote:
> >
> > On Fri 22-05-15 20:30:42, Oleg Nesterov wrote:
> > > On 05/22, Michal Hocko wrote:
> > > >
> > > > On Fri 22-05-15 18:29:00, Oleg Nesterov wrote:
> > > > >
> > > > > In the likely case (if CLONE_VM without CLONE_THREAD was not used) the
> > > > > last for_each_process() in mm_update_next_owner() will find another thread
> > > > > from the same group.
> > > >
> > > > My understanding was that for_each_process will iterate only over
> > > > processes (represented by the thread group leaders).
> > >
> > > Yes. But note the inner for_each_thread() loop. And note that we
> > > we need this loop exactly because the leader can be zombie.
> >
> > I was too vague, sorry about that.
> 
> Looks like, we confused each other somehow ;) Not sure I understand your
> concerns...
> 
> But,
> 
> > What I meant was that
> > for_each_process would pick up a group leader
> 
> Yes. In the case above it will find the caller (current) too,
> 
> > and the inner
> > for_each_thread will return it as the first element in the list.
> 
> Yes, and this will be "current" task. But current->mm == NULL, so
> for_each_thread() will continue and find another thread which becomes
> the new mm->owner.

OK, this is the part I was missing. exit_mm manages to set mm to NULL
before calling mm_update_next_owner. Now it makes much moer sense.
Thanks for your patience! It really seems like we are broken here.
The most reasonable way forward is to get rid of the owner. The patch
will go to the list today.

> Just in case, note the BUG_ON(c == p). I think that BUG_ON(p->mm) at
> the start will look much better. This is what mm_update_next_owner()
> actually assumes.
> 
> Oleg.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 3/7] memcg: immigrate charges only when a threadgroup leader is moved
@ 2015-05-26  7:16                               ` Michal Hocko
  0 siblings, 0 replies; 59+ messages in thread
From: Michal Hocko @ 2015-05-26  7:16 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Tejun Heo, lizefan-hv44wF8Li93QT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, hannes-druUgvl0LCNAfugRpC6u6w,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Mon 25-05-15 19:06:01, Oleg Nesterov wrote:
> On 05/25, Michal Hocko wrote:
> >
> > On Fri 22-05-15 20:30:42, Oleg Nesterov wrote:
> > > On 05/22, Michal Hocko wrote:
> > > >
> > > > On Fri 22-05-15 18:29:00, Oleg Nesterov wrote:
> > > > >
> > > > > In the likely case (if CLONE_VM without CLONE_THREAD was not used) the
> > > > > last for_each_process() in mm_update_next_owner() will find another thread
> > > > > from the same group.
> > > >
> > > > My understanding was that for_each_process will iterate only over
> > > > processes (represented by the thread group leaders).
> > >
> > > Yes. But note the inner for_each_thread() loop. And note that we
> > > we need this loop exactly because the leader can be zombie.
> >
> > I was too vague, sorry about that.
> 
> Looks like, we confused each other somehow ;) Not sure I understand your
> concerns...
> 
> But,
> 
> > What I meant was that
> > for_each_process would pick up a group leader
> 
> Yes. In the case above it will find the caller (current) too,
> 
> > and the inner
> > for_each_thread will return it as the first element in the list.
> 
> Yes, and this will be "current" task. But current->mm == NULL, so
> for_each_thread() will continue and find another thread which becomes
> the new mm->owner.

OK, this is the part I was missing. exit_mm manages to set mm to NULL
before calling mm_update_next_owner. Now it makes much moer sense.
Thanks for your patience! It really seems like we are broken here.
The most reasonable way forward is to get rid of the owner. The patch
will go to the list today.

> Just in case, note the BUG_ON(c == p). I think that BUG_ON(p->mm) at
> the start will look much better. This is what mm_update_next_owner()
> actually assumes.
> 
> Oleg.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2015-05-26  7:16 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-18 19:49 [PATCHSET cgroup/for-4.2] cgroup: make multi-process migration atomic Tejun Heo
2015-05-18 19:49 ` Tejun Heo
2015-05-18 19:49 ` [PATCH 1/7] cpuset: migrate memory only for threadgroup leaders Tejun Heo
2015-05-18 19:49   ` Tejun Heo
2015-05-18 19:49 ` [PATCH 2/7] memcg: restructure mem_cgroup_can_attach() Tejun Heo
2015-05-18 19:49   ` Tejun Heo
2015-05-19  9:03   ` Michal Hocko
2015-05-19  9:03     ` Michal Hocko
2015-05-18 19:49 ` [PATCH 3/7] memcg: immigrate charges only when a threadgroup leader is moved Tejun Heo
2015-05-18 19:49   ` Tejun Heo
2015-05-19 12:13   ` Michal Hocko
2015-05-19 12:13     ` Michal Hocko
2015-05-19 13:10     ` Michal Hocko
2015-05-19 13:10       ` Michal Hocko
2015-05-19 21:27     ` Tejun Heo
2015-05-19 21:27       ` Tejun Heo
2015-05-20 13:10       ` Michal Hocko
2015-05-20 13:21         ` Michal Hocko
2015-05-20 17:53           ` Oleg Nesterov
2015-05-20 17:53             ` Oleg Nesterov
2015-05-20 20:22             ` Michal Hocko
2015-05-20 20:22               ` Michal Hocko
2015-05-21 17:22               ` Johannes Weiner
2015-05-21 17:22                 ` Johannes Weiner
2015-05-22  9:34                 ` Michal Hocko
2015-05-21 19:27               ` Oleg Nesterov
2015-05-21 19:27                 ` Oleg Nesterov
2015-05-22  9:36                 ` Michal Hocko
2015-05-22  9:36                   ` Michal Hocko
2015-05-22 16:29                   ` Oleg Nesterov
2015-05-22 16:29                     ` Oleg Nesterov
2015-05-22 16:57                     ` Michal Hocko
2015-05-22 18:30                       ` Oleg Nesterov
2015-05-25 16:06                         ` Michal Hocko
2015-05-25 16:06                           ` Michal Hocko
2015-05-25 17:06                           ` Oleg Nesterov
2015-05-25 17:06                             ` Oleg Nesterov
2015-05-26  7:16                             ` Michal Hocko
2015-05-26  7:16                               ` Michal Hocko
2015-05-22 18:20                     ` [PATCH 0/3] memcg: mm_update_next_owner() cleanups Oleg Nesterov
2015-05-22 18:21                       ` [PATCH 1/3] memcg: introduce assign_new_owner() Oleg Nesterov
2015-05-22 18:21                         ` Oleg Nesterov
2015-05-22 18:21                       ` [PATCH 2/3] memcg: change assign_new_owner() to consider the sub-htreads Oleg Nesterov
2015-05-22 18:21                         ` Oleg Nesterov
2015-05-22 18:21                       ` [PATCH 3/3] memcg: change mm_update_next_owner() to search in sub-threads first Oleg Nesterov
2015-05-22 18:22                       ` [PATCH 0/3] memcg: mm_update_next_owner() cleanups Oleg Nesterov
2015-05-21 14:12   ` [PATCH 3/7] memcg: immigrate charges only when a threadgroup leader is moved Michal Hocko
2015-05-21 14:12     ` Michal Hocko
2015-05-21 22:09     ` Tejun Heo
2015-05-21 22:09       ` Tejun Heo
2015-05-18 19:49 ` [PATCH 4/7] cgroup, memcg, cpuset: implement cgroup_taskset_for_each_leader() Tejun Heo
2015-05-18 19:49   ` Tejun Heo
2015-05-18 19:49 ` [PATCH 5/7] reorder cgroup_migrate()'s parameters Tejun Heo
2015-05-18 19:49   ` Tejun Heo
2015-05-18 19:49 ` [PATCH 6/7] cgroup: separate out taskset operations from cgroup_migrate() Tejun Heo
2015-05-18 19:49   ` Tejun Heo
2015-05-18 19:49 ` [PATCH 7/7] cgroup: make cgroup_update_dfl_csses() migrate all target processes atomically Tejun Heo
2015-05-18 19:49   ` Tejun Heo
2015-05-19  6:57 ` [PATCHSET cgroup/for-4.2] cgroup: make multi-process migration atomic Zefan Li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.