linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET cgroup/for-4.4] cgroup: make zombies retain cgroup membership and fix pids controller
@ 2015-10-10  3:29 Tejun Heo
  2015-10-10  3:29 ` [PATCH 01/14] cgroup: remove an unused parameter from cgroup_task_migrate() Tejun Heo
                   ` (14 more replies)
  0 siblings, 15 replies; 23+ messages in thread
From: Tejun Heo @ 2015-10-10  3:29 UTC (permalink / raw)
  To: lizefan, hannes; +Cc: cgroups, cyphar, linux-kernel, kernel-team

Hello,

cgroup currently disassociates a task from its cgroups on exit and
reassigns it to the root cgroup.  This behavior turns out to be
problematic for several reasons.

* Resources can't be tracked for zombies.  This breaks pids controller
  as zombies escape resource restriction.  A cgroup can easily go way
  above its limits by creating a bunch of zombies.

* It's difficult to tell where zombies came from.  /proc/PID/cgroup
  gets reset to / on exit so given a zombie it's difficult to tell
  from which cgroup the zombie came from.

* It creates an extra work for controllers for no reason.  cpu and
  perf_events controllers implement exit callbacks to switch the
  exiting task's membership to root when just leaving it as-is is
  enough.

Unfortunately, fixing this involves opening a few cans of worms.

* Decoupling tasks being on a css_set from its reference counting so
  that css_set can be pinned w/o tasks being on it and decoupling
  css_set existence from whether a cgroup is populated so that pinning
  a css_set doesn't confuse populated state tracking and populated
  state can be used to decide whether certain operations are allowed.

* Making css task iteration drop css_set_rwsem between iteration steps
  so that internal locking is not exposed to iterator users and
  css_set_rwsem can be converted to a spinlock which can be grabbed
  from task free path.

After this patchset, besides pids controller being fixed, the visible
behavior isn't changed on traditional hierarchies but on the default
hierarchy a zombie reports its cgroup at the time of exit in
/proc/PID/cgroup.  If the cgroup gets removed before the task is
reaped, " (deleted)" is appended to the reported path.

This patchset contains the following 14 patches.

 0001-cgroup-remove-an-unused-parameter-from-cgroup_task_m.patch
 0002-cgroup-make-cgroup-nr_populated-count-the-number-of-.patch
 0003-cgroup-replace-cgroup_has_tasks-with-cgroup_is_popul.patch
 0004-cgroup-move-check_for_release-invocation.patch
 0005-cgroup-relocate-cgroup_-try-get-put.patch
 0006-cgroup-make-css_sets-pin-the-associated-cgroups.patch
 0007-cgroup-make-cgroup_destroy_locked-test-cgroup_is_pop.patch
 0008-cgroup-keep-css_set-and-task-lists-in-chronological-.patch
 0009-cgroup-factor-out-css_set_move_task.patch
 0010-cgroup-reorganize-css_task_iter-functions.patch
 0011-cgroup-don-t-hold-css_set_rwsem-across-css-task-iter.patch
 0012-cgroup-make-css_set_rwsem-a-spinlock-and-rename-it-t.patch
 0013-cgroup-keep-zombies-associated-with-their-original-c.patch
 0014-cgroup-add-cgroup_subsys-free-method-and-use-it-to-f.patch

0001-0007 decouple populated state tracking from css_set existence and
allows css_sets to be pinned without tasks on them.

0008-0012 update css_set task iterator to not hold lock across
iteration steps and replace css_set_rwsem with a spinlock.

0013 makes zombies keep their cgroup associations.  0014 introduces
->exit() method and fixes pids controller.

The patchset is pretty lightly tested and I need to verify that the
corner cases behave as expected.

This patchset is on top of cgroup/for-4.4 a3e72739b7a7 ("cgroup: fix
too early usage of static_branch_disable()") and available in the
following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-zombies

diffstat follows.  Thanks.

 Documentation/cgroups/cgroups.txt           |    4 
 Documentation/cgroups/unified-hierarchy.txt |    4 
 include/linux/cgroup-defs.h                 |   16 
 include/linux/cgroup.h                      |   14 
 kernel/cgroup.c                             |  522 +++++++++++++++++-----------
 kernel/cgroup_pids.c                        |    8 
 kernel/cpuset.c                             |    2 
 kernel/events/core.c                        |   16 
 kernel/fork.c                               |    1 
 kernel/sched/core.c                         |   16 
 mm/memcontrol.c                             |    2 
 11 files changed, 354 insertions(+), 251 deletions(-)

--
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 01/14] cgroup: remove an unused parameter from cgroup_task_migrate()
  2015-10-10  3:29 [PATCHSET cgroup/for-4.4] cgroup: make zombies retain cgroup membership and fix pids controller Tejun Heo
@ 2015-10-10  3:29 ` Tejun Heo
  2015-10-10  3:29 ` [PATCH 02/14] cgroup: make cgroup->nr_populated count the number of populated css_sets Tejun Heo
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 23+ messages in thread
From: Tejun Heo @ 2015-10-10  3:29 UTC (permalink / raw)
  To: lizefan, hannes; +Cc: cgroups, cyphar, linux-kernel, kernel-team, Tejun Heo

cgroup_task_migrate() no longer uses @old_cgrp.  Remove it.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/cgroup.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index ae23814..49f30f1 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2235,14 +2235,12 @@ struct task_struct *cgroup_taskset_next(struct cgroup_taskset *tset)
 
 /**
  * cgroup_task_migrate - move a task from one cgroup to another.
- * @old_cgrp: the cgroup @tsk is being migrated from
  * @tsk: the task being migrated
  * @new_cset: the new css_set @tsk is being attached to
  *
  * Must be called with cgroup_mutex, threadgroup and css_set_rwsem locked.
  */
-static void cgroup_task_migrate(struct cgroup *old_cgrp,
-				struct task_struct *tsk,
+static void cgroup_task_migrate(struct task_struct *tsk,
 				struct css_set *new_cset)
 {
 	struct css_set *old_cset;
@@ -2311,8 +2309,7 @@ static int cgroup_taskset_migrate(struct cgroup_taskset *tset,
 	down_write(&css_set_rwsem);
 	list_for_each_entry(cset, &tset->src_csets, mg_node) {
 		list_for_each_entry_safe(task, tmp_task, &cset->mg_tasks, cg_list)
-			cgroup_task_migrate(cset->mg_src_cgrp, task,
-					    cset->mg_dst_cset);
+			cgroup_task_migrate(task, cset->mg_dst_cset);
 	}
 	up_write(&css_set_rwsem);
 
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 02/14] cgroup: make cgroup->nr_populated count the number of populated css_sets
  2015-10-10  3:29 [PATCHSET cgroup/for-4.4] cgroup: make zombies retain cgroup membership and fix pids controller Tejun Heo
  2015-10-10  3:29 ` [PATCH 01/14] cgroup: remove an unused parameter from cgroup_task_migrate() Tejun Heo
@ 2015-10-10  3:29 ` Tejun Heo
  2015-10-10  3:29 ` [PATCH 03/14] cgroup: replace cgroup_has_tasks() with cgroup_is_populated() Tejun Heo
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 23+ messages in thread
From: Tejun Heo @ 2015-10-10  3:29 UTC (permalink / raw)
  To: lizefan, hannes; +Cc: cgroups, cyphar, linux-kernel, kernel-team, Tejun Heo

Currently, cgroup->nr_populated counts whether the cgroup has any
css_sets linked to it and the number of children which has non-zero
->nr_populated.  This works because a css_set's refcnt converges with
the number of tasks linked to it and thus there's no css_set linked to
a cgroup if it doesn't have any live tasks.

To help tracking resource usage of zombie tasks, putting the ref of
css_set will be separated from disassociating the task from the
css_set which means that a cgroup may have css_sets linked to it even
when it doesn't have any live tasks.

This patch updates cgroup->nr_populated so that for the cgroup itself
it counts the number of css_sets which have tasks associated with them
so that empty css_sets don't skew the populated test.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/cgroup-defs.h |  8 +++---
 kernel/cgroup.c             | 65 ++++++++++++++++++++++++++++++++++++---------
 2 files changed, 56 insertions(+), 17 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index df589a0..1744450 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -232,10 +232,10 @@ struct cgroup {
 	int id;
 
 	/*
-	 * If this cgroup contains any tasks, it contributes one to
-	 * populated_cnt.  All children with non-zero popuplated_cnt of
-	 * their own contribute one.  The count is zero iff there's no task
-	 * in this cgroup or its subtree.
+	 * Each non-empty css_set associated with this cgroup contributes
+	 * one to populated_cnt.  All children with non-zero popuplated_cnt
+	 * of their own contribute one.  The count is zero iff there's no
+	 * task in this cgroup or its subtree.
 	 */
 	int populated_cnt;
 
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 49f30f1..e5231d0 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -582,14 +582,25 @@ struct css_set init_css_set = {
 static int css_set_count	= 1;	/* 1 for init_css_set */
 
 /**
+ * css_set_populated - does a css_set contain any tasks?
+ * @cset: target css_set
+ */
+static bool css_set_populated(struct css_set *cset)
+{
+	lockdep_assert_held(&css_set_rwsem);
+
+	return !list_empty(&cset->tasks) || !list_empty(&cset->mg_tasks);
+}
+
+/**
  * cgroup_update_populated - updated populated count of a cgroup
  * @cgrp: the target cgroup
  * @populated: inc or dec populated count
  *
- * @cgrp is either getting the first task (css_set) or losing the last.
- * Update @cgrp->populated_cnt accordingly.  The count is propagated
- * towards root so that a given cgroup's populated_cnt is zero iff the
- * cgroup and all its descendants are empty.
+ * One of the css_sets associated with @cgrp is either getting its first
+ * task or losing the last.  Update @cgrp->populated_cnt accordingly.  The
+ * count is propagated towards root so that a given cgroup's populated_cnt
+ * is zero iff the cgroup and all its descendants don't contain any tasks.
  *
  * @cgrp's interface file "cgroup.populated" is zero if
  * @cgrp->populated_cnt is zero and 1 otherwise.  When @cgrp->populated_cnt
@@ -618,6 +629,24 @@ static void cgroup_update_populated(struct cgroup *cgrp, bool populated)
 	} while (cgrp);
 }
 
+/**
+ * css_set_update_populated - update populated state of a css_set
+ * @cset: target css_set
+ * @populated: whether @cset is populated or depopulated
+ *
+ * @cset is either getting the first task or losing the last.  Update the
+ * ->populated_cnt of all associated cgroups accordingly.
+ */
+static void css_set_update_populated(struct css_set *cset, bool populated)
+{
+	struct cgrp_cset_link *link;
+
+	lockdep_assert_held(&css_set_rwsem);
+
+	list_for_each_entry(link, &cset->cgrp_links, cgrp_link)
+		cgroup_update_populated(link->cgrp, populated);
+}
+
 /*
  * hash table for cgroup groups. This improves the performance to find
  * an existing css_set. This hash doesn't (currently) take into
@@ -663,10 +692,8 @@ static void put_css_set_locked(struct css_set *cset)
 		list_del(&link->cgrp_link);
 
 		/* @cgrp can't go away while we're holding css_set_rwsem */
-		if (list_empty(&cgrp->cset_links)) {
-			cgroup_update_populated(cgrp, false);
+		if (list_empty(&cgrp->cset_links))
 			check_for_release(cgrp);
-		}
 
 		kfree(link);
 	}
@@ -875,8 +902,6 @@ static void link_css_set(struct list_head *tmp_links, struct css_set *cset,
 	link->cset = cset;
 	link->cgrp = cgrp;
 
-	if (list_empty(&cgrp->cset_links))
-		cgroup_update_populated(cgrp, true);
 	list_move(&link->cset_link, &cgrp->cset_links);
 
 	/*
@@ -1754,6 +1779,8 @@ static void cgroup_enable_task_cg_lists(void)
 		if (!(p->flags & PF_EXITING)) {
 			struct css_set *cset = task_css_set(p);
 
+			if (!css_set_populated(cset))
+				css_set_update_populated(cset, true);
 			list_add(&p->cg_list, &cset->tasks);
 			get_css_set(cset);
 		}
@@ -1868,8 +1895,11 @@ static int cgroup_setup_root(struct cgroup_root *root, unsigned long ss_mask)
 	 * objects.
 	 */
 	down_write(&css_set_rwsem);
-	hash_for_each(css_set_table, i, cset, hlist)
+	hash_for_each(css_set_table, i, cset, hlist) {
 		link_css_set(&tmp_links, cset, root_cgrp);
+		if (css_set_populated(cset))
+			cgroup_update_populated(root_cgrp, true);
+	}
 	up_write(&css_set_rwsem);
 
 	BUG_ON(!list_empty(&root_cgrp->self.children));
@@ -2256,10 +2286,16 @@ static void cgroup_task_migrate(struct task_struct *tsk,
 	WARN_ON_ONCE(tsk->flags & PF_EXITING);
 	old_cset = task_css_set(tsk);
 
+	if (!css_set_populated(new_cset))
+		css_set_update_populated(new_cset, true);
+
 	get_css_set(new_cset);
 	rcu_assign_pointer(tsk->cgroups, new_cset);
 	list_move_tail(&tsk->cg_list, &new_cset->mg_tasks);
 
+	if (!css_set_populated(old_cset))
+		css_set_update_populated(old_cset, false);
+
 	/*
 	 * We just gained a reference on old_cset by taking it from the
 	 * task. As trading it for new_cset is protected by cgroup_mutex,
@@ -3774,7 +3810,7 @@ static void css_advance_task_iter(struct css_task_iter *it)
 			link = list_entry(l, struct cgrp_cset_link, cset_link);
 			cset = link->cset;
 		}
-	} while (list_empty(&cset->tasks) && list_empty(&cset->mg_tasks));
+	} while (!css_set_populated(cset));
 
 	it->cset_pos = l;
 
@@ -5492,17 +5528,20 @@ void cgroup_exit(struct task_struct *tsk)
 
 	/*
 	 * Unlink from @tsk from its css_set.  As migration path can't race
-	 * with us, we can check cg_list without grabbing css_set_rwsem.
+	 * with us, we can check css_set and cg_list without synchronization.
 	 */
+	cset = task_css_set(tsk);
+
 	if (!list_empty(&tsk->cg_list)) {
 		down_write(&css_set_rwsem);
 		list_del_init(&tsk->cg_list);
+		if (!css_set_populated(cset))
+			css_set_update_populated(cset, false);
 		up_write(&css_set_rwsem);
 		put_cset = true;
 	}
 
 	/* Reassign the task to the init_css_set. */
-	cset = task_css_set(tsk);
 	RCU_INIT_POINTER(tsk->cgroups, &init_css_set);
 
 	/* see cgroup_post_fork() for details */
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 03/14] cgroup: replace cgroup_has_tasks() with cgroup_is_populated()
  2015-10-10  3:29 [PATCHSET cgroup/for-4.4] cgroup: make zombies retain cgroup membership and fix pids controller Tejun Heo
  2015-10-10  3:29 ` [PATCH 01/14] cgroup: remove an unused parameter from cgroup_task_migrate() Tejun Heo
  2015-10-10  3:29 ` [PATCH 02/14] cgroup: make cgroup->nr_populated count the number of populated css_sets Tejun Heo
@ 2015-10-10  3:29 ` Tejun Heo
  2015-10-10  3:29 ` [PATCH 04/14] cgroup: move check_for_release() invocation Tejun Heo
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 23+ messages in thread
From: Tejun Heo @ 2015-10-10  3:29 UTC (permalink / raw)
  To: lizefan, hannes
  Cc: cgroups, cyphar, linux-kernel, kernel-team, Tejun Heo, Michal Hocko

Currently, cgroup_has_tasks() tests whether the target cgroup has any
css_set linked to it.  This works because a css_set's refcnt converges
with the number of tasks linked to it and thus there's no css_set
linked to a cgroup if it doesn't have any live tasks.

To help tracking resource usage of zombie tasks, putting the ref of
css_set will be separated from disassociating the task from the
css_set which means that a cgroup may have css_sets linked to it even
when it doesn't have any live tasks.

This patch replaces cgroup_has_tasks() with cgroup_is_populated()
which tests cgroup->nr_populated instead which locally counts the
number of populated css_sets.  Unlike cgroup_has_tasks(),
cgroup_is_populated() is recursive - if any of the descendants is
populated, the cgroup is populated too.  While this changes the
meaning of the test, all the existing users are okay with the change.

While at it, replace the open-coded ->populated_cnt test in
cgroup_events_show() with cgroup_is_populated().

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
---
 include/linux/cgroup.h | 4 ++--
 kernel/cgroup.c        | 6 +++---
 kernel/cpuset.c        | 2 +-
 mm/memcontrol.c        | 2 +-
 4 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index e9c3eac..bdfdb3a 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -456,9 +456,9 @@ static inline struct cgroup *task_cgroup(struct task_struct *task,
 }
 
 /* no synchronization, the result can only be used as a hint */
-static inline bool cgroup_has_tasks(struct cgroup *cgrp)
+static inline bool cgroup_is_populated(struct cgroup *cgrp)
 {
-	return !list_empty(&cgrp->cset_links);
+	return cgrp->populated_cnt;
 }
 
 /* returns ino associated with a cgroup */
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index e5231d0..435aa68 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -3121,7 +3121,7 @@ static ssize_t cgroup_subtree_control_write(struct kernfs_open_file *of,
 static int cgroup_events_show(struct seq_file *seq, void *v)
 {
 	seq_printf(seq, "populated %d\n",
-		   (bool)seq_css(seq)->cgroup->populated_cnt);
+		   cgroup_is_populated(seq_css(seq)->cgroup));
 	return 0;
 }
 
@@ -5558,7 +5558,7 @@ void cgroup_exit(struct task_struct *tsk)
 
 static void check_for_release(struct cgroup *cgrp)
 {
-	if (notify_on_release(cgrp) && !cgroup_has_tasks(cgrp) &&
+	if (notify_on_release(cgrp) && !cgroup_is_populated(cgrp) &&
 	    !css_has_online_children(&cgrp->self) && !cgroup_is_dead(cgrp))
 		schedule_work(&cgrp->release_agent_work);
 }
@@ -5806,7 +5806,7 @@ static int cgroup_css_links_read(struct seq_file *seq, void *v)
 
 static u64 releasable_read(struct cgroup_subsys_state *css, struct cftype *cft)
 {
-	return (!cgroup_has_tasks(css->cgroup) &&
+	return (!cgroup_is_populated(css->cgroup) &&
 		!css_has_online_children(&css->cgroup->self));
 }
 
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index e4d9999..d7ccb87 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -498,7 +498,7 @@ static int validate_change(struct cpuset *cur, struct cpuset *trial)
 	 * be changed to have empty cpus_allowed or mems_allowed.
 	 */
 	ret = -ENOSPC;
-	if ((cgroup_has_tasks(cur->css.cgroup) || cur->attach_in_progress)) {
+	if ((cgroup_is_populated(cur->css.cgroup) || cur->attach_in_progress)) {
 		if (!cpumask_empty(cur->cpus_allowed) &&
 		    cpumask_empty(trial->cpus_allowed))
 			goto out;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 33c8dad..0ddd0ff 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2920,7 +2920,7 @@ static int memcg_activate_kmem(struct mem_cgroup *memcg,
 	 * of course permitted.
 	 */
 	mutex_lock(&memcg_create_mutex);
-	if (cgroup_has_tasks(memcg->css.cgroup) ||
+	if (cgroup_is_populated(memcg->css.cgroup) ||
 	    (memcg->use_hierarchy && memcg_has_children(memcg)))
 		err = -EBUSY;
 	mutex_unlock(&memcg_create_mutex);
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 04/14] cgroup: move check_for_release() invocation
  2015-10-10  3:29 [PATCHSET cgroup/for-4.4] cgroup: make zombies retain cgroup membership and fix pids controller Tejun Heo
                   ` (2 preceding siblings ...)
  2015-10-10  3:29 ` [PATCH 03/14] cgroup: replace cgroup_has_tasks() with cgroup_is_populated() Tejun Heo
@ 2015-10-10  3:29 ` Tejun Heo
  2015-10-10  3:29 ` [PATCH 05/14] cgroup: relocate cgroup_[try]get/put() Tejun Heo
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 23+ messages in thread
From: Tejun Heo @ 2015-10-10  3:29 UTC (permalink / raw)
  To: lizefan, hannes; +Cc: cgroups, cyphar, linux-kernel, kernel-team, Tejun Heo

To trigger release agent when the last task leaves the cgroup,
check_for_release() is called from put_css_set_locked(); however,
css_set being unlinked is being decoupled from task leaving the cgroup
and the correct condition to test is cgroup->nr_populated dropping to
zero which check_for_release() is already updated to test.

This patch moves check_for_release() invocation from
put_css_set_locked() to cgroup_update_populated().

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/cgroup.c | 8 +-------
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 435aa68..855313d 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -623,6 +623,7 @@ static void cgroup_update_populated(struct cgroup *cgrp, bool populated)
 		if (!trigger)
 			break;
 
+		check_for_release(cgrp);
 		cgroup_file_notify(&cgrp->events_file);
 
 		cgrp = cgroup_parent(cgrp);
@@ -686,15 +687,8 @@ static void put_css_set_locked(struct css_set *cset)
 	css_set_count--;
 
 	list_for_each_entry_safe(link, tmp_link, &cset->cgrp_links, cgrp_link) {
-		struct cgroup *cgrp = link->cgrp;
-
 		list_del(&link->cset_link);
 		list_del(&link->cgrp_link);
-
-		/* @cgrp can't go away while we're holding css_set_rwsem */
-		if (list_empty(&cgrp->cset_links))
-			check_for_release(cgrp);
-
 		kfree(link);
 	}
 
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 05/14] cgroup: relocate cgroup_[try]get/put()
  2015-10-10  3:29 [PATCHSET cgroup/for-4.4] cgroup: make zombies retain cgroup membership and fix pids controller Tejun Heo
                   ` (3 preceding siblings ...)
  2015-10-10  3:29 ` [PATCH 04/14] cgroup: move check_for_release() invocation Tejun Heo
@ 2015-10-10  3:29 ` Tejun Heo
  2015-10-10  3:29 ` [PATCH 06/14] cgroup: make css_sets pin the associated cgroups Tejun Heo
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 23+ messages in thread
From: Tejun Heo @ 2015-10-10  3:29 UTC (permalink / raw)
  To: lizefan, hannes; +Cc: cgroups, cyphar, linux-kernel, kernel-team, Tejun Heo

Relocate cgroup_get(), cgroup_tryget() and cgroup_put() upwards.  This
is pure code reorganization to prepare for future changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/cgroup.c | 32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 855313d..ab5c9a5 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -428,6 +428,22 @@ static inline bool cgroup_is_dead(const struct cgroup *cgrp)
 	return !(cgrp->self.flags & CSS_ONLINE);
 }
 
+static void cgroup_get(struct cgroup *cgrp)
+{
+	WARN_ON_ONCE(cgroup_is_dead(cgrp));
+	css_get(&cgrp->self);
+}
+
+static bool cgroup_tryget(struct cgroup *cgrp)
+{
+	return css_tryget(&cgrp->self);
+}
+
+static void cgroup_put(struct cgroup *cgrp)
+{
+	css_put(&cgrp->self);
+}
+
 struct cgroup_subsys_state *of_css(struct kernfs_open_file *of)
 {
 	struct cgroup *cgrp = of->kn->parent->priv;
@@ -1177,22 +1193,6 @@ static umode_t cgroup_file_mode(const struct cftype *cft)
 	return mode;
 }
 
-static void cgroup_get(struct cgroup *cgrp)
-{
-	WARN_ON_ONCE(cgroup_is_dead(cgrp));
-	css_get(&cgrp->self);
-}
-
-static bool cgroup_tryget(struct cgroup *cgrp)
-{
-	return css_tryget(&cgrp->self);
-}
-
-static void cgroup_put(struct cgroup *cgrp)
-{
-	css_put(&cgrp->self);
-}
-
 /**
  * cgroup_calc_child_subsys_mask - calculate child_subsys_mask
  * @cgrp: the target cgroup
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 06/14] cgroup: make css_sets pin the associated cgroups
  2015-10-10  3:29 [PATCHSET cgroup/for-4.4] cgroup: make zombies retain cgroup membership and fix pids controller Tejun Heo
                   ` (4 preceding siblings ...)
  2015-10-10  3:29 ` [PATCH 05/14] cgroup: relocate cgroup_[try]get/put() Tejun Heo
@ 2015-10-10  3:29 ` Tejun Heo
  2015-10-15  1:34   ` [PATCH v2 " Tejun Heo
  2015-10-10  3:29 ` [PATCH 07/14] cgroup: make cgroup_destroy_locked() test cgroup_is_populated() Tejun Heo
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 23+ messages in thread
From: Tejun Heo @ 2015-10-10  3:29 UTC (permalink / raw)
  To: lizefan, hannes; +Cc: cgroups, cyphar, linux-kernel, kernel-team, Tejun Heo

Currently, css_sets don't pin the associated cgroups.  This is okay as
a cgroup with css_sets associated are not allowed to be removed;
however, to help resource tracking for zombie tasks, this is scheduled
to change such that a cgroup can be removed even when it has css_sets
associated as long as none of them are populated.

To ensure that a cgroup doesn't go away while css_sets are still
associated with it, make each associated css_set hold a reference on
the cgroup.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/cgroup.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index ab5c9a5..ba11496 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -705,6 +705,7 @@ static void put_css_set_locked(struct css_set *cset)
 	list_for_each_entry_safe(link, tmp_link, &cset->cgrp_links, cgrp_link) {
 		list_del(&link->cset_link);
 		list_del(&link->cgrp_link);
+		cgroup_put(link->cgrp);
 		kfree(link);
 	}
 
@@ -919,6 +920,8 @@ static void link_css_set(struct list_head *tmp_links, struct css_set *cset,
 	 * is sorted by order of hierarchy creation
 	 */
 	list_add_tail(&link->cgrp_link, &cset->cgrp_links);
+
+	cgroup_get(cgrp);
 }
 
 /**
@@ -4998,10 +5001,7 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
 
 	lockdep_assert_held(&cgroup_mutex);
 
-	/*
-	 * css_set_rwsem synchronizes access to ->cset_links and prevents
-	 * @cgrp from being removed while put_css_set() is in progress.
-	 */
+	/* css_set_rwsem synchronizes access to ->cset_links */
 	down_read(&css_set_rwsem);
 	empty = list_empty(&cgrp->cset_links);
 	up_read(&css_set_rwsem);
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 07/14] cgroup: make cgroup_destroy_locked() test cgroup_is_populated()
  2015-10-10  3:29 [PATCHSET cgroup/for-4.4] cgroup: make zombies retain cgroup membership and fix pids controller Tejun Heo
                   ` (5 preceding siblings ...)
  2015-10-10  3:29 ` [PATCH 06/14] cgroup: make css_sets pin the associated cgroups Tejun Heo
@ 2015-10-10  3:29 ` Tejun Heo
  2015-10-10  3:29 ` [PATCH 08/14] cgroup: keep css_set and task lists in chronological order Tejun Heo
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 23+ messages in thread
From: Tejun Heo @ 2015-10-10  3:29 UTC (permalink / raw)
  To: lizefan, hannes; +Cc: cgroups, cyphar, linux-kernel, kernel-team, Tejun Heo

cgroup_destroy_locked() currently tests whether any css_sets are
associated to reject removal if the cgroup contains tasks.  This works
because a css_set's refcnt converges with the number of tasks linked
to it and thus there's no css_set linked to a cgroup if it doesn't
have any live tasks.

To help tracking resource usage of zombie tasks, putting the ref of
css_set will be separated from disassociating the task from the
css_set which means that a cgroup may have css_sets linked to it even
when it doesn't have any live tasks.

This patch updates cgroup_destroy_locked() so that it tests
cgroup_is_populated(), which counts the number of populated css_sets,
instead of whether cgrp->cset_links is empty to determine whether the
cgroup is populated or not.  This ensures that rmdirs won't be
incorrectly rejected for cgroups which only contain zombie tasks.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/cgroup.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index ba11496..b8da97a 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4996,16 +4996,15 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
 	__releases(&cgroup_mutex) __acquires(&cgroup_mutex)
 {
 	struct cgroup_subsys_state *css;
-	bool empty;
 	int ssid;
 
 	lockdep_assert_held(&cgroup_mutex);
 
-	/* css_set_rwsem synchronizes access to ->cset_links */
-	down_read(&css_set_rwsem);
-	empty = list_empty(&cgrp->cset_links);
-	up_read(&css_set_rwsem);
-	if (!empty)
+	/*
+	 * Only migration can raise populated from zero and we're already
+	 * holding cgroup_mutex.
+	 */
+	if (cgroup_is_populated(cgrp))
 		return -EBUSY;
 
 	/*
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 08/14] cgroup: keep css_set and task lists in chronological order
  2015-10-10  3:29 [PATCHSET cgroup/for-4.4] cgroup: make zombies retain cgroup membership and fix pids controller Tejun Heo
                   ` (6 preceding siblings ...)
  2015-10-10  3:29 ` [PATCH 07/14] cgroup: make cgroup_destroy_locked() test cgroup_is_populated() Tejun Heo
@ 2015-10-10  3:29 ` Tejun Heo
  2015-10-10  3:29 ` [PATCH 09/14] cgroup: factor out css_set_move_task() Tejun Heo
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 23+ messages in thread
From: Tejun Heo @ 2015-10-10  3:29 UTC (permalink / raw)
  To: lizefan, hannes; +Cc: cgroups, cyphar, linux-kernel, kernel-team, Tejun Heo

css task iteration will be updated to not leak cgroup internal locking
to iterator users.  In preparation, update css_set and task lists to
be in chronological order.

For tasks, as migration path is already using list_splice_tail_init(),
only cgroup_enable_task_cg_lists() and cgroup_post_fork() need
updating.  For css_sets, link_css_set() is the only place which needs
to be updated.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/cgroup.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index b8da97a..5cb28d2 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -913,12 +913,11 @@ static void link_css_set(struct list_head *tmp_links, struct css_set *cset,
 	link->cset = cset;
 	link->cgrp = cgrp;
 
-	list_move(&link->cset_link, &cgrp->cset_links);
-
 	/*
-	 * Always add links to the tail of the list so that the list
-	 * is sorted by order of hierarchy creation
+	 * Always add links to the tail of the lists so that the lists are
+	 * in choronological order.
 	 */
+	list_move_tail(&link->cset_link, &cgrp->cset_links);
 	list_add_tail(&link->cgrp_link, &cset->cgrp_links);
 
 	cgroup_get(cgrp);
@@ -1778,7 +1777,7 @@ static void cgroup_enable_task_cg_lists(void)
 
 			if (!css_set_populated(cset))
 				css_set_update_populated(cset, true);
-			list_add(&p->cg_list, &cset->tasks);
+			list_add_tail(&p->cg_list, &cset->tasks);
 			get_css_set(cset);
 		}
 		spin_unlock_irq(&p->sighand->siglock);
@@ -5478,7 +5477,7 @@ void cgroup_post_fork(struct task_struct *child,
 		cset = task_css_set(current);
 		if (list_empty(&child->cg_list)) {
 			rcu_assign_pointer(child->cgroups, cset);
-			list_add(&child->cg_list, &cset->tasks);
+			list_add_tail(&child->cg_list, &cset->tasks);
 			get_css_set(cset);
 		}
 		up_write(&css_set_rwsem);
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 09/14] cgroup: factor out css_set_move_task()
  2015-10-10  3:29 [PATCHSET cgroup/for-4.4] cgroup: make zombies retain cgroup membership and fix pids controller Tejun Heo
                   ` (7 preceding siblings ...)
  2015-10-10  3:29 ` [PATCH 08/14] cgroup: keep css_set and task lists in chronological order Tejun Heo
@ 2015-10-10  3:29 ` Tejun Heo
  2015-10-10  3:29 ` [PATCH 10/14] cgroup: reorganize css_task_iter functions Tejun Heo
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 23+ messages in thread
From: Tejun Heo @ 2015-10-10  3:29 UTC (permalink / raw)
  To: lizefan, hannes; +Cc: cgroups, cyphar, linux-kernel, kernel-team, Tejun Heo

A task is associated and disassociated with its css_set in three
places - during migration, after a new task is created and when a task
exits.  The first is handled by cgroup_task_migrate() and the latter
two are open-coded.

These are similar operations and spreading them over multiple places
makes it harder to follow and update.  This patch collects all task
css_set [dis]association operations into css_set_move_task().

While css_set_move_task() may check whether populated state needs to
be updated when not strictly necessary, the behavior is essentially
equivalent before and after this patch.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/cgroup.c | 104 ++++++++++++++++++++++++++++++--------------------------
 1 file changed, 56 insertions(+), 48 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 5cb28d2..4e239e4 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -664,6 +664,52 @@ static void css_set_update_populated(struct css_set *cset, bool populated)
 		cgroup_update_populated(link->cgrp, populated);
 }
 
+/**
+ * css_set_move_task - move a task from one css_set to another
+ * @task: task being moved
+ * @from_cset: css_set @task currently belongs to (may be NULL)
+ * @to_cset: new css_set @task is being moved to (may be NULL)
+ * @use_mg_tasks: move to @to_cset->mg_tasks instead of ->tasks
+ *
+ * Move @task from @from_cset to @to_cset.  If @task didn't belong to any
+ * css_set, @from_cset can be NULL.  If @task is being disassociated
+ * instead of moved, @to_cset can be NULL.
+ *
+ * This function automatically handles populated_cnt updates but the caller
+ * is responsible for managing @from_cset and @to_cset's reference counts.
+ */
+static void css_set_move_task(struct task_struct *task,
+			      struct css_set *from_cset, struct css_set *to_cset,
+			      bool use_mg_tasks)
+{
+	lockdep_assert_held(&css_set_rwsem);
+
+	if (from_cset) {
+		WARN_ON_ONCE(list_empty(&task->cg_list));
+		list_del_init(&task->cg_list);
+		if (!css_set_populated(from_cset))
+			css_set_update_populated(from_cset, false);
+	} else {
+		WARN_ON_ONCE(!list_empty(&task->cg_list));
+	}
+
+	if (to_cset) {
+		/*
+		 * We are synchronized through cgroup_threadgroup_rwsem
+		 * against PF_EXITING setting such that we can't race
+		 * against cgroup_exit() changing the css_set to
+		 * init_css_set and dropping the old one.
+		 */
+		WARN_ON_ONCE(task->flags & PF_EXITING);
+
+		if (!css_set_populated(to_cset))
+			css_set_update_populated(to_cset, true);
+		rcu_assign_pointer(task->cgroups, to_cset);
+		list_add_tail(&task->cg_list, use_mg_tasks ? &to_cset->mg_tasks :
+							     &to_cset->tasks);
+	}
+}
+
 /*
  * hash table for cgroup groups. This improves the performance to find
  * an existing css_set. This hash doesn't (currently) take into
@@ -2260,47 +2306,6 @@ struct task_struct *cgroup_taskset_next(struct cgroup_taskset *tset)
 }
 
 /**
- * cgroup_task_migrate - move a task from one cgroup to another.
- * @tsk: the task being migrated
- * @new_cset: the new css_set @tsk is being attached to
- *
- * Must be called with cgroup_mutex, threadgroup and css_set_rwsem locked.
- */
-static void cgroup_task_migrate(struct task_struct *tsk,
-				struct css_set *new_cset)
-{
-	struct css_set *old_cset;
-
-	lockdep_assert_held(&cgroup_mutex);
-	lockdep_assert_held(&css_set_rwsem);
-
-	/*
-	 * We are synchronized through cgroup_threadgroup_rwsem against
-	 * PF_EXITING setting such that we can't race against cgroup_exit()
-	 * changing the css_set to init_css_set and dropping the old one.
-	 */
-	WARN_ON_ONCE(tsk->flags & PF_EXITING);
-	old_cset = task_css_set(tsk);
-
-	if (!css_set_populated(new_cset))
-		css_set_update_populated(new_cset, true);
-
-	get_css_set(new_cset);
-	rcu_assign_pointer(tsk->cgroups, new_cset);
-	list_move_tail(&tsk->cg_list, &new_cset->mg_tasks);
-
-	if (!css_set_populated(old_cset))
-		css_set_update_populated(old_cset, false);
-
-	/*
-	 * We just gained a reference on old_cset by taking it from the
-	 * task. As trading it for new_cset is protected by cgroup_mutex,
-	 * we're safe to drop it here; it will be freed under RCU.
-	 */
-	put_css_set_locked(old_cset);
-}
-
-/**
  * cgroup_taskset_migrate - migrate a taskset to a cgroup
  * @tset: taget taskset
  * @dst_cgrp: destination cgroup
@@ -2340,8 +2345,14 @@ static int cgroup_taskset_migrate(struct cgroup_taskset *tset,
 	 */
 	down_write(&css_set_rwsem);
 	list_for_each_entry(cset, &tset->src_csets, mg_node) {
-		list_for_each_entry_safe(task, tmp_task, &cset->mg_tasks, cg_list)
-			cgroup_task_migrate(task, cset->mg_dst_cset);
+		list_for_each_entry_safe(task, tmp_task, &cset->mg_tasks, cg_list) {
+			struct css_set *from_cset = task_css_set(task);
+			struct css_set *to_cset = cset->mg_dst_cset;
+
+			get_css_set(to_cset);
+			css_set_move_task(task, from_cset, to_cset, true);
+			put_css_set_locked(from_cset);
+		}
 	}
 	up_write(&css_set_rwsem);
 
@@ -5476,9 +5487,8 @@ void cgroup_post_fork(struct task_struct *child,
 		down_write(&css_set_rwsem);
 		cset = task_css_set(current);
 		if (list_empty(&child->cg_list)) {
-			rcu_assign_pointer(child->cgroups, cset);
-			list_add_tail(&child->cg_list, &cset->tasks);
 			get_css_set(cset);
+			css_set_move_task(child, NULL, cset, false);
 		}
 		up_write(&css_set_rwsem);
 	}
@@ -5526,9 +5536,7 @@ void cgroup_exit(struct task_struct *tsk)
 
 	if (!list_empty(&tsk->cg_list)) {
 		down_write(&css_set_rwsem);
-		list_del_init(&tsk->cg_list);
-		if (!css_set_populated(cset))
-			css_set_update_populated(cset, false);
+		css_set_move_task(tsk, cset, NULL, false);
 		up_write(&css_set_rwsem);
 		put_cset = true;
 	}
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 10/14] cgroup: reorganize css_task_iter functions
  2015-10-10  3:29 [PATCHSET cgroup/for-4.4] cgroup: make zombies retain cgroup membership and fix pids controller Tejun Heo
                   ` (8 preceding siblings ...)
  2015-10-10  3:29 ` [PATCH 09/14] cgroup: factor out css_set_move_task() Tejun Heo
@ 2015-10-10  3:29 ` Tejun Heo
  2015-10-11 13:30 ` [PATCH 11/14] cgroup: don't hold css_set_rwsem across css task iteration Tejun Heo
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 23+ messages in thread
From: Tejun Heo @ 2015-10-10  3:29 UTC (permalink / raw)
  To: lizefan, hannes; +Cc: cgroups, cyphar, linux-kernel, kernel-team, Tejun Heo

* Rename css_advance_task_iter() to css_task_iter_advance_css_set()
  and make it clear it->task_pos too at the end of the iteration.

* Factor out css_task_iter_advance() from css_task_iter_next().  The
  new function whines if called on a terminated iterator.

Except for the termination check, this is pure reorganization and
doesn't introduce any behavior changes.  This will help the planned
locking update for css_task_iter.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/cgroup.c | 49 ++++++++++++++++++++++++++++---------------------
 1 file changed, 28 insertions(+), 21 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 4e239e4..0a58445 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -3791,12 +3791,12 @@ bool css_has_online_children(struct cgroup_subsys_state *css)
 }
 
 /**
- * css_advance_task_iter - advance a task itererator to the next css_set
+ * css_task_iter_advance_css_set - advance a task itererator to the next css_set
  * @it: the iterator to advance
  *
  * Advance @it to the next css_set to walk.
  */
-static void css_advance_task_iter(struct css_task_iter *it)
+static void css_task_iter_advance_css_set(struct css_task_iter *it)
 {
 	struct list_head *l = it->cset_pos;
 	struct cgrp_cset_link *link;
@@ -3807,6 +3807,7 @@ static void css_advance_task_iter(struct css_task_iter *it)
 		l = l->next;
 		if (l == it->cset_head) {
 			it->cset_pos = NULL;
+			it->task_pos = NULL;
 			return;
 		}
 
@@ -3830,6 +3831,28 @@ static void css_advance_task_iter(struct css_task_iter *it)
 	it->mg_tasks_head = &cset->mg_tasks;
 }
 
+static void css_task_iter_advance(struct css_task_iter *it)
+{
+	struct list_head *l = it->task_pos;
+
+	WARN_ON_ONCE(!l);
+
+	/*
+	 * Advance iterator to find next entry.  cset->tasks is consumed
+	 * first and then ->mg_tasks.  After ->mg_tasks, we move onto the
+	 * next cset.
+	 */
+	l = l->next;
+
+	if (l == it->tasks_head)
+		l = it->mg_tasks_head->next;
+
+	if (l == it->mg_tasks_head)
+		css_task_iter_advance_css_set(it);
+	else
+		it->task_pos = l;
+}
+
 /**
  * css_task_iter_start - initiate task iteration
  * @css: the css to walk tasks of
@@ -3862,7 +3885,7 @@ void css_task_iter_start(struct cgroup_subsys_state *css,
 
 	it->cset_head = it->cset_pos;
 
-	css_advance_task_iter(it);
+	css_task_iter_advance_css_set(it);
 }
 
 /**
@@ -3876,28 +3899,12 @@ void css_task_iter_start(struct cgroup_subsys_state *css,
 struct task_struct *css_task_iter_next(struct css_task_iter *it)
 {
 	struct task_struct *res;
-	struct list_head *l = it->task_pos;
 
-	/* If the iterator cg is NULL, we have no tasks */
 	if (!it->cset_pos)
 		return NULL;
-	res = list_entry(l, struct task_struct, cg_list);
-
-	/*
-	 * Advance iterator to find next entry.  cset->tasks is consumed
-	 * first and then ->mg_tasks.  After ->mg_tasks, we move onto the
-	 * next cset.
-	 */
-	l = l->next;
-
-	if (l == it->tasks_head)
-		l = it->mg_tasks_head->next;
-
-	if (l == it->mg_tasks_head)
-		css_advance_task_iter(it);
-	else
-		it->task_pos = l;
 
+	res = list_entry(it->task_pos, struct task_struct, cg_list);
+	css_task_iter_advance(it);
 	return res;
 }
 
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 11/14] cgroup: don't hold css_set_rwsem across css task iteration
  2015-10-10  3:29 [PATCHSET cgroup/for-4.4] cgroup: make zombies retain cgroup membership and fix pids controller Tejun Heo
                   ` (9 preceding siblings ...)
  2015-10-10  3:29 ` [PATCH 10/14] cgroup: reorganize css_task_iter functions Tejun Heo
@ 2015-10-11 13:30 ` Tejun Heo
  2015-10-15  1:35   ` [PATCH v2 " Tejun Heo
  2015-10-11 13:30 ` [PATCH 12/14] cgroup: make css_set_rwsem a spinlock and rename it to css_set_lock Tejun Heo
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 23+ messages in thread
From: Tejun Heo @ 2015-10-11 13:30 UTC (permalink / raw)
  To: lizefan, hannes; +Cc: cgroups, cyphar, linux-kernel, kernel-team, Tejun Heo

css_sets are synchronized through css_set_rwsem but the locking scheme
is kinda bizarre.  The hot paths - fork and exit - have to write lock
the rwsem making the rw part pointless; furthermore, many readers
already hold cgroup_mutex.

One of the readers is css task iteration.  It read locks the rwsem
over the entire duration of iteration.  This leads to silly locking
behavior.  When cpuset tries to migrate processes of a cgroup to a
different NUMA node, css_set_rwsem is held across the entire migration
attempt which can take a long time locking out forking, exiting and
other cgroup operations.

This patch updates css task iteration so that it locks css_set_rwsem
only while the iterator is being advanced.  css task iteration
involves two levels - css_set and task iteration.  As css_sets in use
are practically immutable, simply pinning the current one is enough
for resuming iteration afterwards.  Task iteration is tricky as tasks
may leave their css_set while iteration is in progress.  This is
solved by keeping track of active iterators and advancing them if
their next task leaves its css_set.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/cgroup-defs.h |  3 ++
 include/linux/cgroup.h      |  4 +++
 kernel/cgroup.c             | 83 +++++++++++++++++++++++++++++++++++++--------
 3 files changed, 76 insertions(+), 14 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 1744450..62413c3 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -211,6 +211,9 @@ struct css_set {
 	 */
 	struct list_head e_cset_node[CGROUP_SUBSYS_COUNT];
 
+	/* all css_task_iters currently walking this cset */
+	struct list_head task_iters;
+
 	/* For RCU-protected deletion */
 	struct rcu_head rcu_head;
 };
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index bdfdb3a..a9dcf0e 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -42,6 +42,10 @@ struct css_task_iter {
 	struct list_head		*task_pos;
 	struct list_head		*tasks_head;
 	struct list_head		*mg_tasks_head;
+
+	struct css_set			*cur_cset;
+	struct task_struct		*cur_task;
+	struct list_head		iters_node;	/* css_set->task_iters */
 };
 
 extern struct cgroup_root cgrp_dfl_root;
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 0a58445..e624363 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -216,6 +216,7 @@ static struct cftype cgroup_legacy_base_files[];
 
 static int rebind_subsystems(struct cgroup_root *dst_root,
 			     unsigned long ss_mask);
+static void css_task_iter_advance(struct css_task_iter *it);
 static int cgroup_destroy_locked(struct cgroup *cgrp);
 static int create_css(struct cgroup *cgrp, struct cgroup_subsys *ss,
 		      bool visible);
@@ -593,6 +594,7 @@ struct css_set init_css_set = {
 	.mg_tasks		= LIST_HEAD_INIT(init_css_set.mg_tasks),
 	.mg_preload_node	= LIST_HEAD_INIT(init_css_set.mg_preload_node),
 	.mg_node		= LIST_HEAD_INIT(init_css_set.mg_node),
+	.task_iters		= LIST_HEAD_INIT(init_css_set.task_iters),
 };
 
 static int css_set_count	= 1;	/* 1 for init_css_set */
@@ -675,8 +677,9 @@ static void css_set_update_populated(struct css_set *cset, bool populated)
  * css_set, @from_cset can be NULL.  If @task is being disassociated
  * instead of moved, @to_cset can be NULL.
  *
- * This function automatically handles populated_cnt updates but the caller
- * is responsible for managing @from_cset and @to_cset's reference counts.
+ * This function automatically handles populated_cnt updates and
+ * css_task_iter adjustments but the caller is responsible for managing
+ * @from_cset and @to_cset's reference counts.
  */
 static void css_set_move_task(struct task_struct *task,
 			      struct css_set *from_cset, struct css_set *to_cset,
@@ -685,7 +688,19 @@ static void css_set_move_task(struct task_struct *task,
 	lockdep_assert_held(&css_set_rwsem);
 
 	if (from_cset) {
+		struct css_task_iter *it;
+
 		WARN_ON_ONCE(list_empty(&task->cg_list));
+
+		/*
+		 * @task is leaving, advance task iterators which are
+		 * pointing to it so that they can resume at the next
+		 * position.  See css_task_iter_advance*() for details.
+		 */
+		list_for_each_entry(it, &from_cset->task_iters, iters_node)
+			if (it->task_pos == &task->cg_list)
+				css_task_iter_advance(it);
+
 		list_del_init(&task->cg_list);
 		if (!css_set_populated(from_cset))
 			css_set_update_populated(from_cset, false);
@@ -1017,6 +1032,7 @@ static struct css_set *find_css_set(struct css_set *old_cset,
 	INIT_LIST_HEAD(&cset->mg_tasks);
 	INIT_LIST_HEAD(&cset->mg_preload_node);
 	INIT_LIST_HEAD(&cset->mg_node);
+	INIT_LIST_HEAD(&cset->task_iters);
 	INIT_HLIST_NODE(&cset->hlist);
 
 	/* Copy the set of subsystem state objects generated in
@@ -3802,6 +3818,8 @@ static void css_task_iter_advance_css_set(struct css_task_iter *it)
 	struct cgrp_cset_link *link;
 	struct css_set *cset;
 
+	lockdep_assert_held(&css_set_rwsem);
+
 	/* Advance to the next non-empty css_set */
 	do {
 		l = l->next;
@@ -3829,12 +3847,36 @@ static void css_task_iter_advance_css_set(struct css_task_iter *it)
 
 	it->tasks_head = &cset->tasks;
 	it->mg_tasks_head = &cset->mg_tasks;
+
+	/*
+	 * We don't keep css_sets locked across iteration steps and thus
+	 * need to take steps to ensure that iteration can be resumed after
+	 * the lock is re-acquired.  Iteration is performed at two levels -
+	 * css_sets and tasks in them.
+	 *
+	 * Once created, a css_set never leaves its cgroup lists, so a
+	 * pinned css_set is guaranteed to stay put and we can resume
+	 * iteration afterwards.
+	 *
+	 * Tasks may leave @cset across iteration steps.  This is resolved
+	 * by registering each iterator with the css_set currently being
+	 * walked and making css_set_move_task() advance iterators whose
+	 * next task is leaving.
+	 */
+	if (it->cur_cset) {
+		list_del(&it->iters_node);
+		put_css_set_locked(it->cur_cset);
+	}
+	get_css_set(cset);
+	it->cur_cset = cset;
+	list_add(&it->iters_node, &cset->task_iters);
 }
 
 static void css_task_iter_advance(struct css_task_iter *it)
 {
 	struct list_head *l = it->task_pos;
 
+	lockdep_assert_held(&css_set_rwsem);
 	WARN_ON_ONCE(!l);
 
 	/*
@@ -3862,19 +3904,16 @@ static void css_task_iter_advance(struct css_task_iter *it)
  * css_task_iter_next() to walk through the tasks until the function
  * returns NULL.  On completion of iteration, css_task_iter_end() must be
  * called.
- *
- * Note that this function acquires a lock which is released when the
- * iteration finishes.  The caller can't sleep while iteration is in
- * progress.
  */
 void css_task_iter_start(struct cgroup_subsys_state *css,
 			 struct css_task_iter *it)
-	__acquires(css_set_rwsem)
 {
 	/* no one should try to iterate before mounting cgroups */
 	WARN_ON_ONCE(!use_task_css_set_links);
 
-	down_read(&css_set_rwsem);
+	memset(it, 0, sizeof(*it));
+
+	down_write(&css_set_rwsem);
 
 	it->ss = css->ss;
 
@@ -3886,6 +3925,8 @@ void css_task_iter_start(struct cgroup_subsys_state *css,
 	it->cset_head = it->cset_pos;
 
 	css_task_iter_advance_css_set(it);
+
+	up_write(&css_set_rwsem);
 }
 
 /**
@@ -3898,14 +3939,21 @@ void css_task_iter_start(struct cgroup_subsys_state *css,
  */
 struct task_struct *css_task_iter_next(struct css_task_iter *it)
 {
-	struct task_struct *res;
-
 	if (!it->cset_pos)
 		return NULL;
 
-	res = list_entry(it->task_pos, struct task_struct, cg_list);
+	down_write(&css_set_rwsem);
+
+	if (it->cur_task)
+		put_task_struct(it->cur_task);
+	it->cur_task = list_entry(it->task_pos, struct task_struct, cg_list);
+	get_task_struct(it->cur_task);
+
 	css_task_iter_advance(it);
-	return res;
+
+	up_write(&css_set_rwsem);
+
+	return it->cur_task;
 }
 
 /**
@@ -3915,9 +3963,16 @@ struct task_struct *css_task_iter_next(struct css_task_iter *it)
  * Finish task iteration started by css_task_iter_start().
  */
 void css_task_iter_end(struct css_task_iter *it)
-	__releases(css_set_rwsem)
 {
-	up_read(&css_set_rwsem);
+	if (it->cur_cset) {
+		down_write(&css_set_rwsem);
+		list_del(&it->iters_node);
+		put_css_set_locked(it->cur_cset);
+		up_write(&css_set_rwsem);
+	}
+
+	if (it->cur_task)
+		put_task_struct(it->cur_task);
 }
 
 /**
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 12/14] cgroup: make css_set_rwsem a spinlock and rename it to css_set_lock
  2015-10-10  3:29 [PATCHSET cgroup/for-4.4] cgroup: make zombies retain cgroup membership and fix pids controller Tejun Heo
                   ` (10 preceding siblings ...)
  2015-10-11 13:30 ` [PATCH 11/14] cgroup: don't hold css_set_rwsem across css task iteration Tejun Heo
@ 2015-10-11 13:30 ` Tejun Heo
  2015-10-11 13:30 ` [PATCH 13/14] cgroup: keep zombies associated with their original cgroups Tejun Heo
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 23+ messages in thread
From: Tejun Heo @ 2015-10-11 13:30 UTC (permalink / raw)
  To: lizefan, hannes; +Cc: cgroups, cyphar, linux-kernel, kernel-team, Tejun Heo

css_set_rwsem is the inner lock protecting css_sets and is accessed
from hot paths such as fork and exit.  Internally, it has no reason to
be a rwsem or even mutex.  There are no internal blocking operations
while holding it.  This was rwsem because css task iteration used to
expose it to external iterator users.  As the previous patch updated
css task iteration such that the locking is not leaked to its users,
there's no reason to keep it a rwsem.

This patch converts css_set_rwsem to a spinlock and rename it to
css_set_lock.  It uses bh-safe operations as a planned usage needs to
access it from RCU callback context.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/cgroup.h |   5 +-
 kernel/cgroup.c        | 144 ++++++++++++++++++++++++-------------------------
 2 files changed, 74 insertions(+), 75 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index a9dcf0e..4602073 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -13,7 +13,6 @@
 #include <linux/nodemask.h>
 #include <linux/rculist.h>
 #include <linux/cgroupstats.h>
-#include <linux/rwsem.h>
 #include <linux/fs.h>
 #include <linux/seq_file.h>
 #include <linux/kernfs.h>
@@ -367,11 +366,11 @@ static inline void css_put_many(struct cgroup_subsys_state *css, unsigned int n)
  */
 #ifdef CONFIG_PROVE_RCU
 extern struct mutex cgroup_mutex;
-extern struct rw_semaphore css_set_rwsem;
+extern spinlock_t css_set_lock;
 #define task_css_set_check(task, __c)					\
 	rcu_dereference_check((task)->cgroups,				\
 		lockdep_is_held(&cgroup_mutex) ||			\
-		lockdep_is_held(&css_set_rwsem) ||			\
+		lockdep_is_held(&css_set_lock) ||			\
 		((task)->flags & PF_EXITING) || (__c))
 #else
 #define task_css_set_check(task, __c)					\
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index e624363..0bfa26a 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -45,7 +45,6 @@
 #include <linux/sched.h>
 #include <linux/slab.h>
 #include <linux/spinlock.h>
-#include <linux/rwsem.h>
 #include <linux/percpu-rwsem.h>
 #include <linux/string.h>
 #include <linux/sort.h>
@@ -76,7 +75,7 @@
  * cgroup_mutex is the master lock.  Any modification to cgroup or its
  * hierarchy must be performed while holding it.
  *
- * css_set_rwsem protects task->cgroups pointer, the list of css_set
+ * css_set_lock protects task->cgroups pointer, the list of css_set
  * objects, and the chain of tasks off each css_set.
  *
  * These locks are exported if CONFIG_PROVE_RCU so that accessors in
@@ -84,12 +83,12 @@
  */
 #ifdef CONFIG_PROVE_RCU
 DEFINE_MUTEX(cgroup_mutex);
-DECLARE_RWSEM(css_set_rwsem);
+DEFINE_SPINLOCK(css_set_lock);
 EXPORT_SYMBOL_GPL(cgroup_mutex);
-EXPORT_SYMBOL_GPL(css_set_rwsem);
+EXPORT_SYMBOL_GPL(css_set_lock);
 #else
 static DEFINE_MUTEX(cgroup_mutex);
-static DECLARE_RWSEM(css_set_rwsem);
+static DEFINE_SPINLOCK(css_set_lock);
 #endif
 
 /*
@@ -605,7 +604,7 @@ static int css_set_count	= 1;	/* 1 for init_css_set */
  */
 static bool css_set_populated(struct css_set *cset)
 {
-	lockdep_assert_held(&css_set_rwsem);
+	lockdep_assert_held(&css_set_lock);
 
 	return !list_empty(&cset->tasks) || !list_empty(&cset->mg_tasks);
 }
@@ -628,7 +627,7 @@ static bool css_set_populated(struct css_set *cset)
  */
 static void cgroup_update_populated(struct cgroup *cgrp, bool populated)
 {
-	lockdep_assert_held(&css_set_rwsem);
+	lockdep_assert_held(&css_set_lock);
 
 	do {
 		bool trigger;
@@ -660,7 +659,7 @@ static void css_set_update_populated(struct css_set *cset, bool populated)
 {
 	struct cgrp_cset_link *link;
 
-	lockdep_assert_held(&css_set_rwsem);
+	lockdep_assert_held(&css_set_lock);
 
 	list_for_each_entry(link, &cset->cgrp_links, cgrp_link)
 		cgroup_update_populated(link->cgrp, populated);
@@ -685,7 +684,7 @@ static void css_set_move_task(struct task_struct *task,
 			      struct css_set *from_cset, struct css_set *to_cset,
 			      bool use_mg_tasks)
 {
-	lockdep_assert_held(&css_set_rwsem);
+	lockdep_assert_held(&css_set_lock);
 
 	if (from_cset) {
 		struct css_task_iter *it;
@@ -752,7 +751,7 @@ static void put_css_set_locked(struct css_set *cset)
 	struct cgroup_subsys *ss;
 	int ssid;
 
-	lockdep_assert_held(&css_set_rwsem);
+	lockdep_assert_held(&css_set_lock);
 
 	if (!atomic_dec_and_test(&cset->refcount))
 		return;
@@ -783,9 +782,9 @@ static void put_css_set(struct css_set *cset)
 	if (atomic_add_unless(&cset->refcount, -1, 1))
 		return;
 
-	down_write(&css_set_rwsem);
+	spin_lock_bh(&css_set_lock);
 	put_css_set_locked(cset);
-	up_write(&css_set_rwsem);
+	spin_unlock_bh(&css_set_lock);
 }
 
 /*
@@ -1007,11 +1006,11 @@ static struct css_set *find_css_set(struct css_set *old_cset,
 
 	/* First see if we already have a cgroup group that matches
 	 * the desired set */
-	down_read(&css_set_rwsem);
+	spin_lock_bh(&css_set_lock);
 	cset = find_existing_css_set(old_cset, cgrp, template);
 	if (cset)
 		get_css_set(cset);
-	up_read(&css_set_rwsem);
+	spin_unlock_bh(&css_set_lock);
 
 	if (cset)
 		return cset;
@@ -1039,7 +1038,7 @@ static struct css_set *find_css_set(struct css_set *old_cset,
 	 * find_existing_css_set() */
 	memcpy(cset->subsys, template, sizeof(cset->subsys));
 
-	down_write(&css_set_rwsem);
+	spin_lock_bh(&css_set_lock);
 	/* Add reference counts and links from the new css_set. */
 	list_for_each_entry(link, &old_cset->cgrp_links, cgrp_link) {
 		struct cgroup *c = link->cgrp;
@@ -1061,7 +1060,7 @@ static struct css_set *find_css_set(struct css_set *old_cset,
 		list_add_tail(&cset->e_cset_node[ssid],
 			      &cset->subsys[ssid]->cgroup->e_csets[ssid]);
 
-	up_write(&css_set_rwsem);
+	spin_unlock_bh(&css_set_lock);
 
 	return cset;
 }
@@ -1125,14 +1124,15 @@ static void cgroup_destroy_root(struct cgroup_root *root)
 	 * Release all the links from cset_links to this hierarchy's
 	 * root cgroup
 	 */
-	down_write(&css_set_rwsem);
+	spin_lock_bh(&css_set_lock);
 
 	list_for_each_entry_safe(link, tmp_link, &cgrp->cset_links, cset_link) {
 		list_del(&link->cset_link);
 		list_del(&link->cgrp_link);
 		kfree(link);
 	}
-	up_write(&css_set_rwsem);
+
+	spin_unlock_bh(&css_set_lock);
 
 	if (!list_empty(&root->root_list)) {
 		list_del(&root->root_list);
@@ -1154,7 +1154,7 @@ static struct cgroup *cset_cgroup_from_root(struct css_set *cset,
 	struct cgroup *res = NULL;
 
 	lockdep_assert_held(&cgroup_mutex);
-	lockdep_assert_held(&css_set_rwsem);
+	lockdep_assert_held(&css_set_lock);
 
 	if (cset == &init_css_set) {
 		res = &root->cgrp;
@@ -1177,7 +1177,7 @@ static struct cgroup *cset_cgroup_from_root(struct css_set *cset,
 
 /*
  * Return the cgroup for "task" from the given hierarchy. Must be
- * called with cgroup_mutex and css_set_rwsem held.
+ * called with cgroup_mutex and css_set_lock held.
  */
 static struct cgroup *task_cgroup_from_root(struct task_struct *task,
 					    struct cgroup_root *root)
@@ -1526,11 +1526,11 @@ static int rebind_subsystems(struct cgroup_root *dst_root,
 		ss->root = dst_root;
 		css->cgroup = dcgrp;
 
-		down_write(&css_set_rwsem);
+		spin_lock_bh(&css_set_lock);
 		hash_for_each(css_set_table, i, cset, hlist)
 			list_move_tail(&cset->e_cset_node[ss->id],
 				       &dcgrp->e_csets[ss->id]);
-		up_write(&css_set_rwsem);
+		spin_unlock_bh(&css_set_lock);
 
 		src_root->subsys_mask &= ~(1 << ssid);
 		scgrp->subtree_control &= ~(1 << ssid);
@@ -1807,7 +1807,7 @@ static void cgroup_enable_task_cg_lists(void)
 {
 	struct task_struct *p, *g;
 
-	down_write(&css_set_rwsem);
+	spin_lock_bh(&css_set_lock);
 
 	if (use_task_css_set_links)
 		goto out_unlock;
@@ -1846,7 +1846,7 @@ static void cgroup_enable_task_cg_lists(void)
 	} while_each_thread(g, p);
 	read_unlock(&tasklist_lock);
 out_unlock:
-	up_write(&css_set_rwsem);
+	spin_unlock_bh(&css_set_lock);
 }
 
 static void init_cgroup_housekeeping(struct cgroup *cgrp)
@@ -1910,7 +1910,7 @@ static int cgroup_setup_root(struct cgroup_root *root, unsigned long ss_mask)
 		goto out;
 
 	/*
-	 * We're accessing css_set_count without locking css_set_rwsem here,
+	 * We're accessing css_set_count without locking css_set_lock here,
 	 * but that's OK - it can only be increased by someone holding
 	 * cgroup_lock, and that's us. The worst that can happen is that we
 	 * have some link structures left over
@@ -1952,13 +1952,13 @@ static int cgroup_setup_root(struct cgroup_root *root, unsigned long ss_mask)
 	 * Link the root cgroup in this hierarchy into all the css_set
 	 * objects.
 	 */
-	down_write(&css_set_rwsem);
+	spin_lock_bh(&css_set_lock);
 	hash_for_each(css_set_table, i, cset, hlist) {
 		link_css_set(&tmp_links, cset, root_cgrp);
 		if (css_set_populated(cset))
 			cgroup_update_populated(root_cgrp, true);
 	}
-	up_write(&css_set_rwsem);
+	spin_unlock_bh(&css_set_lock);
 
 	BUG_ON(!list_empty(&root_cgrp->self.children));
 	BUG_ON(atomic_read(&root->nr_cgrps) != 1);
@@ -2191,7 +2191,7 @@ char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen)
 	char *path = NULL;
 
 	mutex_lock(&cgroup_mutex);
-	down_read(&css_set_rwsem);
+	spin_lock_bh(&css_set_lock);
 
 	root = idr_get_next(&cgroup_hierarchy_idr, &hierarchy_id);
 
@@ -2204,7 +2204,7 @@ char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen)
 			path = buf;
 	}
 
-	up_read(&css_set_rwsem);
+	spin_unlock_bh(&css_set_lock);
 	mutex_unlock(&cgroup_mutex);
 	return path;
 }
@@ -2253,7 +2253,7 @@ static void cgroup_taskset_add(struct task_struct *task,
 {
 	struct css_set *cset;
 
-	lockdep_assert_held(&css_set_rwsem);
+	lockdep_assert_held(&css_set_lock);
 
 	/* @task either already exited or can't exit until the end */
 	if (task->flags & PF_EXITING)
@@ -2359,7 +2359,7 @@ static int cgroup_taskset_migrate(struct cgroup_taskset *tset,
 	 * the new cgroup.  There are no failure cases after here, so this
 	 * is the commit point.
 	 */
-	down_write(&css_set_rwsem);
+	spin_lock_bh(&css_set_lock);
 	list_for_each_entry(cset, &tset->src_csets, mg_node) {
 		list_for_each_entry_safe(task, tmp_task, &cset->mg_tasks, cg_list) {
 			struct css_set *from_cset = task_css_set(task);
@@ -2370,7 +2370,7 @@ static int cgroup_taskset_migrate(struct cgroup_taskset *tset,
 			put_css_set_locked(from_cset);
 		}
 	}
-	up_write(&css_set_rwsem);
+	spin_unlock_bh(&css_set_lock);
 
 	/*
 	 * Migration is committed, all target tasks are now on dst_csets.
@@ -2394,13 +2394,13 @@ static int cgroup_taskset_migrate(struct cgroup_taskset *tset,
 			css->ss->cancel_attach(css, tset);
 	}
 out_release_tset:
-	down_write(&css_set_rwsem);
+	spin_lock_bh(&css_set_lock);
 	list_splice_init(&tset->dst_csets, &tset->src_csets);
 	list_for_each_entry_safe(cset, tmp_cset, &tset->src_csets, mg_node) {
 		list_splice_tail_init(&cset->mg_tasks, &cset->tasks);
 		list_del_init(&cset->mg_node);
 	}
-	up_write(&css_set_rwsem);
+	spin_unlock_bh(&css_set_lock);
 	return ret;
 }
 
@@ -2417,14 +2417,14 @@ static void cgroup_migrate_finish(struct list_head *preloaded_csets)
 
 	lockdep_assert_held(&cgroup_mutex);
 
-	down_write(&css_set_rwsem);
+	spin_lock_bh(&css_set_lock);
 	list_for_each_entry_safe(cset, tmp_cset, preloaded_csets, mg_preload_node) {
 		cset->mg_src_cgrp = NULL;
 		cset->mg_dst_cset = NULL;
 		list_del_init(&cset->mg_preload_node);
 		put_css_set_locked(cset);
 	}
-	up_write(&css_set_rwsem);
+	spin_unlock_bh(&css_set_lock);
 }
 
 /**
@@ -2450,7 +2450,7 @@ static void cgroup_migrate_add_src(struct css_set *src_cset,
 	struct cgroup *src_cgrp;
 
 	lockdep_assert_held(&cgroup_mutex);
-	lockdep_assert_held(&css_set_rwsem);
+	lockdep_assert_held(&css_set_lock);
 
 	src_cgrp = cset_cgroup_from_root(src_cset, dst_cgrp->root);
 
@@ -2566,7 +2566,7 @@ static int cgroup_migrate(struct task_struct *leader, bool threadgroup,
 	 * already PF_EXITING could be freed from underneath us unless we
 	 * take an rcu_read_lock.
 	 */
-	down_write(&css_set_rwsem);
+	spin_lock_bh(&css_set_lock);
 	rcu_read_lock();
 	task = leader;
 	do {
@@ -2575,7 +2575,7 @@ static int cgroup_migrate(struct task_struct *leader, bool threadgroup,
 			break;
 	} while_each_thread(leader, task);
 	rcu_read_unlock();
-	up_write(&css_set_rwsem);
+	spin_unlock_bh(&css_set_lock);
 
 	return cgroup_taskset_migrate(&tset, cgrp);
 }
@@ -2596,7 +2596,7 @@ static int cgroup_attach_task(struct cgroup *dst_cgrp,
 	int ret;
 
 	/* look up all src csets */
-	down_read(&css_set_rwsem);
+	spin_lock_bh(&css_set_lock);
 	rcu_read_lock();
 	task = leader;
 	do {
@@ -2606,7 +2606,7 @@ static int cgroup_attach_task(struct cgroup *dst_cgrp,
 			break;
 	} while_each_thread(leader, task);
 	rcu_read_unlock();
-	up_read(&css_set_rwsem);
+	spin_unlock_bh(&css_set_lock);
 
 	/* prepare dst csets and commit */
 	ret = cgroup_migrate_prepare_dst(dst_cgrp, &preloaded_csets);
@@ -2639,9 +2639,9 @@ static int cgroup_procs_write_permission(struct task_struct *task,
 		struct cgroup *cgrp;
 		struct inode *inode;
 
-		down_read(&css_set_rwsem);
+		spin_lock_bh(&css_set_lock);
 		cgrp = task_cgroup_from_root(task, &cgrp_dfl_root);
-		up_read(&css_set_rwsem);
+		spin_unlock_bh(&css_set_lock);
 
 		while (!cgroup_is_descendant(dst_cgrp, cgrp))
 			cgrp = cgroup_parent(cgrp);
@@ -2738,9 +2738,9 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
 		if (root == &cgrp_dfl_root)
 			continue;
 
-		down_read(&css_set_rwsem);
+		spin_lock_bh(&css_set_lock);
 		from_cgrp = task_cgroup_from_root(from, root);
-		up_read(&css_set_rwsem);
+		spin_unlock_bh(&css_set_lock);
 
 		retval = cgroup_attach_task(from_cgrp, tsk, false);
 		if (retval)
@@ -2865,7 +2865,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp)
 	percpu_down_write(&cgroup_threadgroup_rwsem);
 
 	/* look up all csses currently attached to @cgrp's subtree */
-	down_read(&css_set_rwsem);
+	spin_lock_bh(&css_set_lock);
 	css_for_each_descendant_pre(css, cgroup_css(cgrp, NULL)) {
 		struct cgrp_cset_link *link;
 
@@ -2877,14 +2877,14 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp)
 			cgroup_migrate_add_src(link->cset, cgrp,
 					       &preloaded_csets);
 	}
-	up_read(&css_set_rwsem);
+	spin_unlock_bh(&css_set_lock);
 
 	/* NULL dst indicates self on default hierarchy */
 	ret = cgroup_migrate_prepare_dst(NULL, &preloaded_csets);
 	if (ret)
 		goto out_finish;
 
-	down_write(&css_set_rwsem);
+	spin_lock_bh(&css_set_lock);
 	list_for_each_entry(src_cset, &preloaded_csets, mg_preload_node) {
 		struct task_struct *task, *ntask;
 
@@ -2896,7 +2896,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp)
 		list_for_each_entry_safe(task, ntask, &src_cset->tasks, cg_list)
 			cgroup_taskset_add(task, &tset);
 	}
-	up_write(&css_set_rwsem);
+	spin_unlock_bh(&css_set_lock);
 
 	ret = cgroup_taskset_migrate(&tset, cgrp);
 out_finish:
@@ -3572,10 +3572,10 @@ static int cgroup_task_count(const struct cgroup *cgrp)
 	int count = 0;
 	struct cgrp_cset_link *link;
 
-	down_read(&css_set_rwsem);
+	spin_lock_bh(&css_set_lock);
 	list_for_each_entry(link, &cgrp->cset_links, cset_link)
 		count += atomic_read(&link->cset->refcount);
-	up_read(&css_set_rwsem);
+	spin_unlock_bh(&css_set_lock);
 	return count;
 }
 
@@ -3818,7 +3818,7 @@ static void css_task_iter_advance_css_set(struct css_task_iter *it)
 	struct cgrp_cset_link *link;
 	struct css_set *cset;
 
-	lockdep_assert_held(&css_set_rwsem);
+	lockdep_assert_held(&css_set_lock);
 
 	/* Advance to the next non-empty css_set */
 	do {
@@ -3876,7 +3876,7 @@ static void css_task_iter_advance(struct css_task_iter *it)
 {
 	struct list_head *l = it->task_pos;
 
-	lockdep_assert_held(&css_set_rwsem);
+	lockdep_assert_held(&css_set_lock);
 	WARN_ON_ONCE(!l);
 
 	/*
@@ -3913,7 +3913,7 @@ void css_task_iter_start(struct cgroup_subsys_state *css,
 
 	memset(it, 0, sizeof(*it));
 
-	down_write(&css_set_rwsem);
+	spin_lock_bh(&css_set_lock);
 
 	it->ss = css->ss;
 
@@ -3926,7 +3926,7 @@ void css_task_iter_start(struct cgroup_subsys_state *css,
 
 	css_task_iter_advance_css_set(it);
 
-	up_write(&css_set_rwsem);
+	spin_unlock_bh(&css_set_lock);
 }
 
 /**
@@ -3942,7 +3942,7 @@ struct task_struct *css_task_iter_next(struct css_task_iter *it)
 	if (!it->cset_pos)
 		return NULL;
 
-	down_write(&css_set_rwsem);
+	spin_lock_bh(&css_set_lock);
 
 	if (it->cur_task)
 		put_task_struct(it->cur_task);
@@ -3951,7 +3951,7 @@ struct task_struct *css_task_iter_next(struct css_task_iter *it)
 
 	css_task_iter_advance(it);
 
-	up_write(&css_set_rwsem);
+	spin_unlock_bh(&css_set_lock);
 
 	return it->cur_task;
 }
@@ -3965,10 +3965,10 @@ struct task_struct *css_task_iter_next(struct css_task_iter *it)
 void css_task_iter_end(struct css_task_iter *it)
 {
 	if (it->cur_cset) {
-		down_write(&css_set_rwsem);
+		spin_lock_bh(&css_set_lock);
 		list_del(&it->iters_node);
 		put_css_set_locked(it->cur_cset);
-		up_write(&css_set_rwsem);
+		spin_unlock_bh(&css_set_lock);
 	}
 
 	if (it->cur_task)
@@ -3997,10 +3997,10 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
 	mutex_lock(&cgroup_mutex);
 
 	/* all tasks in @from are being moved, all csets are source */
-	down_read(&css_set_rwsem);
+	spin_lock_bh(&css_set_lock);
 	list_for_each_entry(link, &from->cset_links, cset_link)
 		cgroup_migrate_add_src(link->cset, to, &preloaded_csets);
-	up_read(&css_set_rwsem);
+	spin_unlock_bh(&css_set_lock);
 
 	ret = cgroup_migrate_prepare_dst(to, &preloaded_csets);
 	if (ret)
@@ -5353,7 +5353,7 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
 		goto out;
 
 	mutex_lock(&cgroup_mutex);
-	down_read(&css_set_rwsem);
+	spin_lock_bh(&css_set_lock);
 
 	for_each_root(root) {
 		struct cgroup_subsys *ss;
@@ -5385,7 +5385,7 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
 
 	retval = 0;
 out_unlock:
-	up_read(&css_set_rwsem);
+	spin_unlock_bh(&css_set_lock);
 	mutex_unlock(&cgroup_mutex);
 	kfree(buf);
 out:
@@ -5531,7 +5531,7 @@ void cgroup_post_fork(struct task_struct *child,
 	 * @child during its iteration.
 	 *
 	 * If we won the race, @child is associated with %current's
-	 * css_set.  Grabbing css_set_rwsem guarantees both that the
+	 * css_set.  Grabbing css_set_lock guarantees both that the
 	 * association is stable, and, on completion of the parent's
 	 * migration, @child is visible in the source of migration or
 	 * already in the destination cgroup.  This guarantee is necessary
@@ -5546,13 +5546,13 @@ void cgroup_post_fork(struct task_struct *child,
 	if (use_task_css_set_links) {
 		struct css_set *cset;
 
-		down_write(&css_set_rwsem);
+		spin_lock_bh(&css_set_lock);
 		cset = task_css_set(current);
 		if (list_empty(&child->cg_list)) {
 			get_css_set(cset);
 			css_set_move_task(child, NULL, cset, false);
 		}
-		up_write(&css_set_rwsem);
+		spin_unlock_bh(&css_set_lock);
 	}
 
 	/*
@@ -5597,9 +5597,9 @@ void cgroup_exit(struct task_struct *tsk)
 	cset = task_css_set(tsk);
 
 	if (!list_empty(&tsk->cg_list)) {
-		down_write(&css_set_rwsem);
+		spin_lock_bh(&css_set_lock);
 		css_set_move_task(tsk, cset, NULL, false);
-		up_write(&css_set_rwsem);
+		spin_unlock_bh(&css_set_lock);
 		put_cset = true;
 	}
 
@@ -5817,7 +5817,7 @@ static int current_css_set_cg_links_read(struct seq_file *seq, void *v)
 	if (!name_buf)
 		return -ENOMEM;
 
-	down_read(&css_set_rwsem);
+	spin_lock_bh(&css_set_lock);
 	rcu_read_lock();
 	cset = rcu_dereference(current->cgroups);
 	list_for_each_entry(link, &cset->cgrp_links, cgrp_link) {
@@ -5828,7 +5828,7 @@ static int current_css_set_cg_links_read(struct seq_file *seq, void *v)
 			   c->root->hierarchy_id, name_buf);
 	}
 	rcu_read_unlock();
-	up_read(&css_set_rwsem);
+	spin_unlock_bh(&css_set_lock);
 	kfree(name_buf);
 	return 0;
 }
@@ -5839,7 +5839,7 @@ static int cgroup_css_links_read(struct seq_file *seq, void *v)
 	struct cgroup_subsys_state *css = seq_css(seq);
 	struct cgrp_cset_link *link;
 
-	down_read(&css_set_rwsem);
+	spin_lock_bh(&css_set_lock);
 	list_for_each_entry(link, &css->cgroup->cset_links, cset_link) {
 		struct css_set *cset = link->cset;
 		struct task_struct *task;
@@ -5862,7 +5862,7 @@ static int cgroup_css_links_read(struct seq_file *seq, void *v)
 	overflow:
 		seq_puts(seq, "  ...\n");
 	}
-	up_read(&css_set_rwsem);
+	spin_unlock_bh(&css_set_lock);
 	return 0;
 }
 
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 13/14] cgroup: keep zombies associated with their original cgroups
  2015-10-10  3:29 [PATCHSET cgroup/for-4.4] cgroup: make zombies retain cgroup membership and fix pids controller Tejun Heo
                   ` (11 preceding siblings ...)
  2015-10-11 13:30 ` [PATCH 12/14] cgroup: make css_set_rwsem a spinlock and rename it to css_set_lock Tejun Heo
@ 2015-10-11 13:30 ` Tejun Heo
  2015-10-12 17:44   ` [PATCH v2 " Tejun Heo
  2015-10-11 13:30 ` [PATCH 14/14] cgroup: add cgroup_subsys->free() method and use it to fix pids controller Tejun Heo
  2015-10-15  1:38 ` [PATCHSET cgroup/for-4.4] cgroup: make zombies retain cgroup membership and " Tejun Heo
  14 siblings, 1 reply; 23+ messages in thread
From: Tejun Heo @ 2015-10-11 13:30 UTC (permalink / raw)
  To: lizefan, hannes
  Cc: cgroups, cyphar, linux-kernel, kernel-team, Tejun Heo,
	Ingo Molnar, Peter Zijlstra, Arnaldo Carvalho de Melo

cgroup_exit() is called when a task exits and disassociates the
exiting task from its cgroups and half-attach it to the root cgroup.
This is unnecessary and undesirable.

No controller actually needs an exiting task to be disassociated with
non-root cgroups.  Both cpu and perf_event controllers update the
association to the root cgroup from their exit callbacks just to keep
consistent with the cgroup core behavior.

Also, this disassociation makes it difficult to track resources held
by zombies or determine where the zombies came from.  Currently, pids
controller is completely broken as it uncharges on exit and zombies
always escape the resource restriction.  With cgroup association being
reset on exit, fixing it is pretty painful.

There's no reason to reset cgroup membership on exit.  The zombie can
be removed from its css_set so that it doesn't show up on
"cgroup.procs" and thus can't be migrated or interfere with cgroup
removal.  It can still pin and point to the css_set so that its cgroup
membership is maintained.  This patch makes cgroup core keep zombies
associated with their cgroups at the time of exit.

* Previous patches decoupled populated_cnt tracking from css_set
  lifetime, so a dying task can be simply unlinked from its css_set
  while pinning and pointing to the css_set.  This keeps css_set
  association from task side alive while hiding it from "cgroup.procs"
  and populated_cnt tracking.  The css_set reference is dropped when
  the task_struct is freed.

* ->exit() callback no longer needs the css arguments as the
  associated css never changes once PF_EXITING is set.  Removed.

* cpu and perf_events controllers no longer need ->exit() callbacks.
  There's no reason to explicitly switch away on exit.  The final
  schedule out is enough.  The callbacks are removed.

* On traditional hierarchies, nothing changes.  "/proc/PID/cgroup"
  still reports "/" for all zombies.  On the default hierarchy,
  "/proc/PID/cgroup" keeps reporting the cgroup that the task belonged
  to at the time of exit.  If the cgroup gets removed before the task
  is reaped, " (deleted)" is appended.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
---
 Documentation/cgroups/unified-hierarchy.txt |  4 +++
 include/linux/cgroup-defs.h                 |  4 +--
 include/linux/cgroup.h                      |  1 +
 kernel/cgroup.c                             | 51 +++++++++++++++++++----------
 kernel/cgroup_pids.c                        |  6 ++--
 kernel/events/core.c                        | 16 ---------
 kernel/fork.c                               |  1 +
 kernel/sched/core.c                         | 16 ---------
 8 files changed, 43 insertions(+), 56 deletions(-)

diff --git a/Documentation/cgroups/unified-hierarchy.txt b/Documentation/cgroups/unified-hierarchy.txt
index 176b940..6932453 100644
--- a/Documentation/cgroups/unified-hierarchy.txt
+++ b/Documentation/cgroups/unified-hierarchy.txt
@@ -374,6 +374,10 @@ supported and the interface files "release_agent" and
 
 - The "cgroup.clone_children" file is removed.
 
+- /proc/PID/cgroup keeps reporting the cgroup that a zombie belonged
+  to before exiting.  If the cgroup is removed before the zombie is
+  reaped, " (deleted)" is appeneded to the path.
+
 
 5-3. Controller File Conventions
 
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 62413c3..6a1ab64 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -435,9 +435,7 @@ struct cgroup_subsys {
 	int (*can_fork)(struct task_struct *task, void **priv_p);
 	void (*cancel_fork)(struct task_struct *task, void *priv);
 	void (*fork)(struct task_struct *task, void *priv);
-	void (*exit)(struct cgroup_subsys_state *css,
-		     struct cgroup_subsys_state *old_css,
-		     struct task_struct *task);
+	void (*exit)(struct task_struct *task);
 	void (*bind)(struct cgroup_subsys_state *root_css);
 
 	int early_init;
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 4602073..0e58858 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -102,6 +102,7 @@ extern void cgroup_cancel_fork(struct task_struct *p,
 extern void cgroup_post_fork(struct task_struct *p,
 			     void *old_ss_priv[CGROUP_CANFORK_COUNT]);
 void cgroup_exit(struct task_struct *p);
+void cgroup_free(struct task_struct *p);
 
 int cgroup_init_early(void);
 int cgroup_init(void);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 0bfa26a..0bfab7c 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -5373,14 +5373,34 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
 			seq_printf(m, "%sname=%s", count ? "," : "",
 				   root->name);
 		seq_putc(m, ':');
+
 		cgrp = task_cgroup_from_root(tsk, root);
-		path = cgroup_path(cgrp, buf, PATH_MAX);
-		if (!path) {
-			retval = -ENAMETOOLONG;
-			goto out_unlock;
+
+		/*
+		 * On traditional hierarchies, all zombie tasks show up as
+		 * belonging to the root cgroup.  On the default hierarchy,
+		 * while a zombie doesn't show up in "cgroup.procs" and
+		 * thus can't be migrated, its /proc/PID/cgroup keeps
+		 * reporting the cgroup it belonged to before exiting.  If
+		 * the cgroup is removed before the zombie is reaped,
+		 * " (deleted)" is appended to the cgroup path.
+		 */
+		if (cgroup_on_dfl(cgrp) || !(tsk->flags & PF_EXITING)) {
+			path = cgroup_path(cgrp, buf, PATH_MAX);
+			if (!path) {
+				retval = -ENAMETOOLONG;
+				goto out_unlock;
+			}
+		} else {
+			path = "/";
 		}
+
 		seq_puts(m, path);
-		seq_putc(m, '\n');
+
+		if (cgroup_on_dfl(cgrp) && cgroup_is_dead(cgrp))
+			seq_puts(m, " (deleted)\n");
+		else
+			seq_putc(m, '\n');
 	}
 
 	retval = 0;
@@ -5587,7 +5607,6 @@ void cgroup_exit(struct task_struct *tsk)
 {
 	struct cgroup_subsys *ss;
 	struct css_set *cset;
-	bool put_cset = false;
 	int i;
 
 	/*
@@ -5600,22 +5619,20 @@ void cgroup_exit(struct task_struct *tsk)
 		spin_lock_bh(&css_set_lock);
 		css_set_move_task(tsk, cset, NULL, false);
 		spin_unlock_bh(&css_set_lock);
-		put_cset = true;
+	} else {
+		get_css_set(cset);
 	}
 
-	/* Reassign the task to the init_css_set. */
-	RCU_INIT_POINTER(tsk->cgroups, &init_css_set);
-
 	/* see cgroup_post_fork() for details */
-	for_each_subsys_which(ss, i, &have_exit_callback) {
-		struct cgroup_subsys_state *old_css = cset->subsys[i];
-		struct cgroup_subsys_state *css = task_css(tsk, i);
+	for_each_subsys_which(ss, i, &have_exit_callback)
+		ss->exit(tsk);
+}
 
-		ss->exit(css, old_css, tsk);
-	}
+void cgroup_free(struct task_struct *task)
+{
+	struct css_set *cset = task_css_set(task);
 
-	if (put_cset)
-		put_css_set(cset);
+	put_css_set(cset);
 }
 
 static void check_for_release(struct cgroup *cgrp)
diff --git a/kernel/cgroup_pids.c b/kernel/cgroup_pids.c
index 806cd76..45f0856 100644
--- a/kernel/cgroup_pids.c
+++ b/kernel/cgroup_pids.c
@@ -266,11 +266,9 @@ static void pids_fork(struct task_struct *task, void *priv)
 	css_put(old_css);
 }
 
-static void pids_exit(struct cgroup_subsys_state *css,
-		      struct cgroup_subsys_state *old_css,
-		      struct task_struct *task)
+static void pids_exit(struct task_struct *task)
 {
-	struct pids_cgroup *pids = css_pids(old_css);
+	struct pids_cgroup *pids = css_pids(task_css(task, pids_cgrp_id));
 
 	pids_uncharge(pids, 1);
 }
diff --git a/kernel/events/core.c b/kernel/events/core.c
index f548f69..e987494 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -9293,25 +9293,9 @@ static void perf_cgroup_attach(struct cgroup_subsys_state *css,
 		task_function_call(task, __perf_cgroup_move, task);
 }
 
-static void perf_cgroup_exit(struct cgroup_subsys_state *css,
-			     struct cgroup_subsys_state *old_css,
-			     struct task_struct *task)
-{
-	/*
-	 * cgroup_exit() is called in the copy_process() failure path.
-	 * Ignore this case since the task hasn't ran yet, this avoids
-	 * trying to poke a half freed task state from generic code.
-	 */
-	if (!(task->flags & PF_EXITING))
-		return;
-
-	task_function_call(task, __perf_cgroup_move, task);
-}
-
 struct cgroup_subsys perf_event_cgrp_subsys = {
 	.css_alloc	= perf_cgroup_css_alloc,
 	.css_free	= perf_cgroup_css_free,
-	.exit		= perf_cgroup_exit,
 	.attach		= perf_cgroup_attach,
 };
 #endif /* CONFIG_CGROUP_PERF */
diff --git a/kernel/fork.c b/kernel/fork.c
index 7d5f0f1..118743b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -251,6 +251,7 @@ void __put_task_struct(struct task_struct *tsk)
 	WARN_ON(atomic_read(&tsk->usage));
 	WARN_ON(tsk == current);
 
+	cgroup_free(tsk);
 	task_numa_free(tsk);
 	security_task_free(tsk);
 	exit_creds(tsk);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3595403..2cad9ba 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8163,21 +8163,6 @@ static void cpu_cgroup_attach(struct cgroup_subsys_state *css,
 		sched_move_task(task);
 }
 
-static void cpu_cgroup_exit(struct cgroup_subsys_state *css,
-			    struct cgroup_subsys_state *old_css,
-			    struct task_struct *task)
-{
-	/*
-	 * cgroup_exit() is called in the copy_process() failure path.
-	 * Ignore this case since the task hasn't ran yet, this avoids
-	 * trying to poke a half freed task state from generic code.
-	 */
-	if (!(task->flags & PF_EXITING))
-		return;
-
-	sched_move_task(task);
-}
-
 #ifdef CONFIG_FAIR_GROUP_SCHED
 static int cpu_shares_write_u64(struct cgroup_subsys_state *css,
 				struct cftype *cftype, u64 shareval)
@@ -8509,7 +8494,6 @@ struct cgroup_subsys cpu_cgrp_subsys = {
 	.fork		= cpu_cgroup_fork,
 	.can_attach	= cpu_cgroup_can_attach,
 	.attach		= cpu_cgroup_attach,
-	.exit		= cpu_cgroup_exit,
 	.legacy_cftypes	= cpu_files,
 	.early_init	= 1,
 };
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 14/14] cgroup: add cgroup_subsys->free() method and use it to fix pids controller
  2015-10-10  3:29 [PATCHSET cgroup/for-4.4] cgroup: make zombies retain cgroup membership and fix pids controller Tejun Heo
                   ` (12 preceding siblings ...)
  2015-10-11 13:30 ` [PATCH 13/14] cgroup: keep zombies associated with their original cgroups Tejun Heo
@ 2015-10-11 13:30 ` Tejun Heo
  2015-10-12 10:29   ` Aleksa Sarai
  2015-10-15  1:38 ` [PATCHSET cgroup/for-4.4] cgroup: make zombies retain cgroup membership and " Tejun Heo
  14 siblings, 1 reply; 23+ messages in thread
From: Tejun Heo @ 2015-10-11 13:30 UTC (permalink / raw)
  To: lizefan, hannes; +Cc: cgroups, cyphar, linux-kernel, kernel-team, Tejun Heo

pids controller is completely broken in that it uncharges when a task
exits allowing zombies to escape resource control.  With the recent
updates, cgroup core now maintains cgroup association till task free
and pids controller can be fixed by uncharging on free instead of
exit.

This patch adds cgroup_subsys->free() method and update pids
controller to use it instead of ->exit() for uncharging.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
---
 Documentation/cgroups/cgroups.txt | 4 ++++
 include/linux/cgroup-defs.h       | 1 +
 kernel/cgroup.c                   | 7 +++++++
 kernel/cgroup_pids.c              | 4 ++--
 4 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index f935fac..c6256ae 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -637,6 +637,10 @@ void exit(struct task_struct *task)
 
 Called during task exit.
 
+void free(struct task_struct *task)
+
+Called when the task_struct is freed.
+
 void bind(struct cgroup *root)
 (cgroup_mutex held by caller)
 
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 6a1ab64..60d44b2 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -436,6 +436,7 @@ struct cgroup_subsys {
 	void (*cancel_fork)(struct task_struct *task, void *priv);
 	void (*fork)(struct task_struct *task, void *priv);
 	void (*exit)(struct task_struct *task);
+	void (*free)(struct task_struct *task);
 	void (*bind)(struct cgroup_subsys_state *root_css);
 
 	int early_init;
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 0bfab7c..21318e7 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -206,6 +206,7 @@ static u64 css_serial_nr_next = 1;
  */
 static unsigned long have_fork_callback __read_mostly;
 static unsigned long have_exit_callback __read_mostly;
+static unsigned long have_free_callback __read_mostly;
 
 /* Ditto for the can_fork callback. */
 static unsigned long have_canfork_callback __read_mostly;
@@ -5174,6 +5175,7 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
 
 	have_fork_callback |= (bool)ss->fork << ss->id;
 	have_exit_callback |= (bool)ss->exit << ss->id;
+	have_free_callback |= (bool)ss->free << ss->id;
 	have_canfork_callback |= (bool)ss->can_fork << ss->id;
 
 	/* At system boot, before all subsystems have been
@@ -5631,6 +5633,11 @@ void cgroup_exit(struct task_struct *tsk)
 void cgroup_free(struct task_struct *task)
 {
 	struct css_set *cset = task_css_set(task);
+	struct cgroup_subsys *ss;
+	int ssid;
+
+	for_each_subsys_which(ss, ssid, &have_free_callback)
+		ss->free(task);
 
 	put_css_set(cset);
 }
diff --git a/kernel/cgroup_pids.c b/kernel/cgroup_pids.c
index 45f0856..cdd8df4 100644
--- a/kernel/cgroup_pids.c
+++ b/kernel/cgroup_pids.c
@@ -266,7 +266,7 @@ static void pids_fork(struct task_struct *task, void *priv)
 	css_put(old_css);
 }
 
-static void pids_exit(struct task_struct *task)
+static void pids_free(struct task_struct *task)
 {
 	struct pids_cgroup *pids = css_pids(task_css(task, pids_cgrp_id));
 
@@ -347,7 +347,7 @@ struct cgroup_subsys pids_cgrp_subsys = {
 	.can_fork	= pids_can_fork,
 	.cancel_fork	= pids_cancel_fork,
 	.fork		= pids_fork,
-	.exit		= pids_exit,
+	.free		= pids_free,
 	.legacy_cftypes	= pids_files,
 	.dfl_cftypes	= pids_files,
 };
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 14/14] cgroup: add cgroup_subsys->free() method and use it to fix pids controller
  2015-10-11 13:30 ` [PATCH 14/14] cgroup: add cgroup_subsys->free() method and use it to fix pids controller Tejun Heo
@ 2015-10-12 10:29   ` Aleksa Sarai
  2015-10-12 15:25     ` Tejun Heo
  0 siblings, 1 reply; 23+ messages in thread
From: Aleksa Sarai @ 2015-10-12 10:29 UTC (permalink / raw)
  To: Tejun Heo; +Cc: lizefan, hannes, cgroups, linux-kernel, kernel-team

> pids controller is completely broken in that it uncharges when a task
> exits allowing zombies to escape resource control.  With the recent
> updates, cgroup core now maintains cgroup association till task free
> and pids controller can be fixed by uncharging on free instead of
> exit.

Looks good to me. Out of interest, is there any reason why we still
have ->exit(), given the zombie process edge case? Surely the zombie
process edge case would cause issues with kmemcg and similar
controllers, if they use ->exit() and not ->free()?

-- 
Aleksa Sarai (cyphar)
www.cyphar.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 14/14] cgroup: add cgroup_subsys->free() method and use it to fix pids controller
  2015-10-12 10:29   ` Aleksa Sarai
@ 2015-10-12 15:25     ` Tejun Heo
  0 siblings, 0 replies; 23+ messages in thread
From: Tejun Heo @ 2015-10-12 15:25 UTC (permalink / raw)
  To: Aleksa Sarai; +Cc: lizefan, hannes, cgroups, linux-kernel, kernel-team

Hello,

On Mon, Oct 12, 2015 at 09:29:14PM +1100, Aleksa Sarai wrote:
> > pids controller is completely broken in that it uncharges when a task
> > exits allowing zombies to escape resource control.  With the recent
> > updates, cgroup core now maintains cgroup association till task free
> > and pids controller can be fixed by uncharging on free instead of
> > exit.
> 
> Looks good to me. Out of interest, is there any reason why we still
> have ->exit(), given the zombie process edge case? Surely the zombie
> process edge case would cause issues with kmemcg and similar
> controllers, if they use ->exit() and not ->free()?

No, pids was the only broken one.  ->exit() is no longer used but
let's see how it goes for a while before getting rid of it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v2 13/14] cgroup: keep zombies associated with their original cgroups
  2015-10-11 13:30 ` [PATCH 13/14] cgroup: keep zombies associated with their original cgroups Tejun Heo
@ 2015-10-12 17:44   ` Tejun Heo
  0 siblings, 0 replies; 23+ messages in thread
From: Tejun Heo @ 2015-10-12 17:44 UTC (permalink / raw)
  To: lizefan, hannes
  Cc: cgroups, cyphar, linux-kernel, kernel-team, Ingo Molnar,
	Peter Zijlstra, Arnaldo Carvalho de Melo

>From 202cd16bb2011fc09488683b57eb68967bda9373 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Mon, 12 Oct 2015 13:43:29 -0400

cgroup_exit() is called when a task exits and disassociates the
exiting task from its cgroups and half-attach it to the root cgroup.
This is unnecessary and undesirable.

No controller actually needs an exiting task to be disassociated with
non-root cgroups.  Both cpu and perf_event controllers update the
association to the root cgroup from their exit callbacks just to keep
consistent with the cgroup core behavior.

Also, this disassociation makes it difficult to track resources held
by zombies or determine where the zombies came from.  Currently, pids
controller is completely broken as it uncharges on exit and zombies
always escape the resource restriction.  With cgroup association being
reset on exit, fixing it is pretty painful.

There's no reason to reset cgroup membership on exit.  The zombie can
be removed from its css_set so that it doesn't show up on
"cgroup.procs" and thus can't be migrated or interfere with cgroup
removal.  It can still pin and point to the css_set so that its cgroup
membership is maintained.  This patch makes cgroup core keep zombies
associated with their cgroups at the time of exit.

* Previous patches decoupled populated_cnt tracking from css_set
  lifetime, so a dying task can be simply unlinked from its css_set
  while pinning and pointing to the css_set.  This keeps css_set
  association from task side alive while hiding it from "cgroup.procs"
  and populated_cnt tracking.  The css_set reference is dropped when
  the task_struct is freed.

* ->exit() callback no longer needs the css arguments as the
  associated css never changes once PF_EXITING is set.  Removed.

* cpu and perf_events controllers no longer need ->exit() callbacks.
  There's no reason to explicitly switch away on exit.  The final
  schedule out is enough.  The callbacks are removed.

* On traditional hierarchies, nothing changes.  "/proc/PID/cgroup"
  still reports "/" for all zombies.  On the default hierarchy,
  "/proc/PID/cgroup" keeps reporting the cgroup that the task belonged
  to at the time of exit.  If the cgroup gets removed before the task
  is reaped, " (deleted)" is appended.

v2: Build brekage due to missing dummy cgroup_free() when
    !CONFIG_CGROUP fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
---
 Documentation/cgroups/unified-hierarchy.txt |  4 +++
 include/linux/cgroup-defs.h                 |  4 +--
 include/linux/cgroup.h                      |  2 ++
 kernel/cgroup.c                             | 51 +++++++++++++++++++----------
 kernel/cgroup_pids.c                        |  6 ++--
 kernel/events/core.c                        | 16 ---------
 kernel/fork.c                               |  1 +
 kernel/sched/core.c                         | 16 ---------
 8 files changed, 44 insertions(+), 56 deletions(-)

diff --git a/Documentation/cgroups/unified-hierarchy.txt b/Documentation/cgroups/unified-hierarchy.txt
index 176b940..6932453 100644
--- a/Documentation/cgroups/unified-hierarchy.txt
+++ b/Documentation/cgroups/unified-hierarchy.txt
@@ -374,6 +374,10 @@ supported and the interface files "release_agent" and
 
 - The "cgroup.clone_children" file is removed.
 
+- /proc/PID/cgroup keeps reporting the cgroup that a zombie belonged
+  to before exiting.  If the cgroup is removed before the zombie is
+  reaped, " (deleted)" is appeneded to the path.
+
 
 5-3. Controller File Conventions
 
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 62413c3..6a1ab64 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -435,9 +435,7 @@ struct cgroup_subsys {
 	int (*can_fork)(struct task_struct *task, void **priv_p);
 	void (*cancel_fork)(struct task_struct *task, void *priv);
 	void (*fork)(struct task_struct *task, void *priv);
-	void (*exit)(struct cgroup_subsys_state *css,
-		     struct cgroup_subsys_state *old_css,
-		     struct task_struct *task);
+	void (*exit)(struct task_struct *task);
 	void (*bind)(struct cgroup_subsys_state *root_css);
 
 	int early_init;
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 4602073..22e3754 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -102,6 +102,7 @@ extern void cgroup_cancel_fork(struct task_struct *p,
 extern void cgroup_post_fork(struct task_struct *p,
 			     void *old_ss_priv[CGROUP_CANFORK_COUNT]);
 void cgroup_exit(struct task_struct *p);
+void cgroup_free(struct task_struct *p);
 
 int cgroup_init_early(void);
 int cgroup_init(void);
@@ -547,6 +548,7 @@ static inline void cgroup_cancel_fork(struct task_struct *p,
 static inline void cgroup_post_fork(struct task_struct *p,
 				    void *ss_priv[CGROUP_CANFORK_COUNT]) {}
 static inline void cgroup_exit(struct task_struct *p) {}
+static inline void cgroup_free(struct task_struct *p) {}
 
 static inline int cgroup_init_early(void) { return 0; }
 static inline int cgroup_init(void) { return 0; }
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 0bfa26a..0bfab7c 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -5373,14 +5373,34 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
 			seq_printf(m, "%sname=%s", count ? "," : "",
 				   root->name);
 		seq_putc(m, ':');
+
 		cgrp = task_cgroup_from_root(tsk, root);
-		path = cgroup_path(cgrp, buf, PATH_MAX);
-		if (!path) {
-			retval = -ENAMETOOLONG;
-			goto out_unlock;
+
+		/*
+		 * On traditional hierarchies, all zombie tasks show up as
+		 * belonging to the root cgroup.  On the default hierarchy,
+		 * while a zombie doesn't show up in "cgroup.procs" and
+		 * thus can't be migrated, its /proc/PID/cgroup keeps
+		 * reporting the cgroup it belonged to before exiting.  If
+		 * the cgroup is removed before the zombie is reaped,
+		 * " (deleted)" is appended to the cgroup path.
+		 */
+		if (cgroup_on_dfl(cgrp) || !(tsk->flags & PF_EXITING)) {
+			path = cgroup_path(cgrp, buf, PATH_MAX);
+			if (!path) {
+				retval = -ENAMETOOLONG;
+				goto out_unlock;
+			}
+		} else {
+			path = "/";
 		}
+
 		seq_puts(m, path);
-		seq_putc(m, '\n');
+
+		if (cgroup_on_dfl(cgrp) && cgroup_is_dead(cgrp))
+			seq_puts(m, " (deleted)\n");
+		else
+			seq_putc(m, '\n');
 	}
 
 	retval = 0;
@@ -5587,7 +5607,6 @@ void cgroup_exit(struct task_struct *tsk)
 {
 	struct cgroup_subsys *ss;
 	struct css_set *cset;
-	bool put_cset = false;
 	int i;
 
 	/*
@@ -5600,22 +5619,20 @@ void cgroup_exit(struct task_struct *tsk)
 		spin_lock_bh(&css_set_lock);
 		css_set_move_task(tsk, cset, NULL, false);
 		spin_unlock_bh(&css_set_lock);
-		put_cset = true;
+	} else {
+		get_css_set(cset);
 	}
 
-	/* Reassign the task to the init_css_set. */
-	RCU_INIT_POINTER(tsk->cgroups, &init_css_set);
-
 	/* see cgroup_post_fork() for details */
-	for_each_subsys_which(ss, i, &have_exit_callback) {
-		struct cgroup_subsys_state *old_css = cset->subsys[i];
-		struct cgroup_subsys_state *css = task_css(tsk, i);
+	for_each_subsys_which(ss, i, &have_exit_callback)
+		ss->exit(tsk);
+}
 
-		ss->exit(css, old_css, tsk);
-	}
+void cgroup_free(struct task_struct *task)
+{
+	struct css_set *cset = task_css_set(task);
 
-	if (put_cset)
-		put_css_set(cset);
+	put_css_set(cset);
 }
 
 static void check_for_release(struct cgroup *cgrp)
diff --git a/kernel/cgroup_pids.c b/kernel/cgroup_pids.c
index 806cd76..45f0856 100644
--- a/kernel/cgroup_pids.c
+++ b/kernel/cgroup_pids.c
@@ -266,11 +266,9 @@ static void pids_fork(struct task_struct *task, void *priv)
 	css_put(old_css);
 }
 
-static void pids_exit(struct cgroup_subsys_state *css,
-		      struct cgroup_subsys_state *old_css,
-		      struct task_struct *task)
+static void pids_exit(struct task_struct *task)
 {
-	struct pids_cgroup *pids = css_pids(old_css);
+	struct pids_cgroup *pids = css_pids(task_css(task, pids_cgrp_id));
 
 	pids_uncharge(pids, 1);
 }
diff --git a/kernel/events/core.c b/kernel/events/core.c
index f548f69..e987494 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -9293,25 +9293,9 @@ static void perf_cgroup_attach(struct cgroup_subsys_state *css,
 		task_function_call(task, __perf_cgroup_move, task);
 }
 
-static void perf_cgroup_exit(struct cgroup_subsys_state *css,
-			     struct cgroup_subsys_state *old_css,
-			     struct task_struct *task)
-{
-	/*
-	 * cgroup_exit() is called in the copy_process() failure path.
-	 * Ignore this case since the task hasn't ran yet, this avoids
-	 * trying to poke a half freed task state from generic code.
-	 */
-	if (!(task->flags & PF_EXITING))
-		return;
-
-	task_function_call(task, __perf_cgroup_move, task);
-}
-
 struct cgroup_subsys perf_event_cgrp_subsys = {
 	.css_alloc	= perf_cgroup_css_alloc,
 	.css_free	= perf_cgroup_css_free,
-	.exit		= perf_cgroup_exit,
 	.attach		= perf_cgroup_attach,
 };
 #endif /* CONFIG_CGROUP_PERF */
diff --git a/kernel/fork.c b/kernel/fork.c
index 7d5f0f1..118743b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -251,6 +251,7 @@ void __put_task_struct(struct task_struct *tsk)
 	WARN_ON(atomic_read(&tsk->usage));
 	WARN_ON(tsk == current);
 
+	cgroup_free(tsk);
 	task_numa_free(tsk);
 	security_task_free(tsk);
 	exit_creds(tsk);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3595403..2cad9ba 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8163,21 +8163,6 @@ static void cpu_cgroup_attach(struct cgroup_subsys_state *css,
 		sched_move_task(task);
 }
 
-static void cpu_cgroup_exit(struct cgroup_subsys_state *css,
-			    struct cgroup_subsys_state *old_css,
-			    struct task_struct *task)
-{
-	/*
-	 * cgroup_exit() is called in the copy_process() failure path.
-	 * Ignore this case since the task hasn't ran yet, this avoids
-	 * trying to poke a half freed task state from generic code.
-	 */
-	if (!(task->flags & PF_EXITING))
-		return;
-
-	sched_move_task(task);
-}
-
 #ifdef CONFIG_FAIR_GROUP_SCHED
 static int cpu_shares_write_u64(struct cgroup_subsys_state *css,
 				struct cftype *cftype, u64 shareval)
@@ -8509,7 +8494,6 @@ struct cgroup_subsys cpu_cgrp_subsys = {
 	.fork		= cpu_cgroup_fork,
 	.can_attach	= cpu_cgroup_can_attach,
 	.attach		= cpu_cgroup_attach,
-	.exit		= cpu_cgroup_exit,
 	.legacy_cftypes	= cpu_files,
 	.early_init	= 1,
 };
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 06/14] cgroup: make css_sets pin the associated cgroups
  2015-10-10  3:29 ` [PATCH 06/14] cgroup: make css_sets pin the associated cgroups Tejun Heo
@ 2015-10-15  1:34   ` Tejun Heo
  0 siblings, 0 replies; 23+ messages in thread
From: Tejun Heo @ 2015-10-15  1:34 UTC (permalink / raw)
  To: lizefan, hannes; +Cc: cgroups, cyphar, linux-kernel, kernel-team

>From 12ab687370ca062d263848b52515c839809383aa Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Wed, 14 Oct 2015 21:31:40 -0400

Currently, css_sets don't pin the associated cgroups.  This is okay as
a cgroup with css_sets associated are not allowed to be removed;
however, to help resource tracking for zombie tasks, this is scheduled
to change such that a cgroup can be removed even when it has css_sets
associated as long as none of them are populated.

To ensure that a cgroup doesn't go away while css_sets are still
associated with it, make each associated css_set hold a reference on
the cgroup if non-root.

v2: Root cgroups are special and shouldn't be ref'd by css_sets.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/cgroup.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index ab5c9a5..ea87340 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -705,6 +705,8 @@ static void put_css_set_locked(struct css_set *cset)
 	list_for_each_entry_safe(link, tmp_link, &cset->cgrp_links, cgrp_link) {
 		list_del(&link->cset_link);
 		list_del(&link->cgrp_link);
+		if (cgroup_parent(link->cgrp))
+			cgroup_put(link->cgrp);
 		kfree(link);
 	}
 
@@ -919,6 +921,9 @@ static void link_css_set(struct list_head *tmp_links, struct css_set *cset,
 	 * is sorted by order of hierarchy creation
 	 */
 	list_add_tail(&link->cgrp_link, &cset->cgrp_links);
+
+	if (cgroup_parent(cgrp))
+		cgroup_get(cgrp);
 }
 
 /**
@@ -4998,10 +5003,7 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
 
 	lockdep_assert_held(&cgroup_mutex);
 
-	/*
-	 * css_set_rwsem synchronizes access to ->cset_links and prevents
-	 * @cgrp from being removed while put_css_set() is in progress.
-	 */
+	/* css_set_rwsem synchronizes access to ->cset_links */
 	down_read(&css_set_rwsem);
 	empty = list_empty(&cgrp->cset_links);
 	up_read(&css_set_rwsem);
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 11/14] cgroup: don't hold css_set_rwsem across css task iteration
  2015-10-11 13:30 ` [PATCH 11/14] cgroup: don't hold css_set_rwsem across css task iteration Tejun Heo
@ 2015-10-15  1:35   ` Tejun Heo
  0 siblings, 0 replies; 23+ messages in thread
From: Tejun Heo @ 2015-10-15  1:35 UTC (permalink / raw)
  To: lizefan, hannes; +Cc: cgroups, cyphar, linux-kernel, kernel-team

>From 5cd30f075f6c8e3d68852661d317ffd1f26f96f9 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Wed, 14 Oct 2015 21:31:41 -0400

css_sets are synchronized through css_set_rwsem but the locking scheme
is kinda bizarre.  The hot paths - fork and exit - have to write lock
the rwsem making the rw part pointless; furthermore, many readers
already hold cgroup_mutex.

One of the readers is css task iteration.  It read locks the rwsem
over the entire duration of iteration.  This leads to silly locking
behavior.  When cpuset tries to migrate processes of a cgroup to a
different NUMA node, css_set_rwsem is held across the entire migration
attempt which can take a long time locking out forking, exiting and
other cgroup operations.

This patch updates css task iteration so that it locks css_set_rwsem
only while the iterator is being advanced.  css task iteration
involves two levels - css_set and task iteration.  As css_sets in use
are practically immutable, simply pinning the current one is enough
for resuming iteration afterwards.  Task iteration is tricky as tasks
may leave their css_set while iteration is in progress.  This is
solved by keeping track of active iterators and advancing them if
their next task leaves its css_set.

v2: put_task_struct() in css_task_iter_next() moved outside
    css_set_rwsem.  A later patch will add cgroup operations to
    task_struct free path which may grab the same lock and this avoids
    deadlock possibilities.

    css_set_move_task() updated to use list_for_each_entry_safe() when
    walking task_iters and advancing them.  This is necessary as
    advancing an iter may remove it from the list.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/cgroup-defs.h |  3 ++
 include/linux/cgroup.h      |  4 +++
 kernel/cgroup.c             | 87 +++++++++++++++++++++++++++++++++++++--------
 3 files changed, 80 insertions(+), 14 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 1744450..62413c3 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -211,6 +211,9 @@ struct css_set {
 	 */
 	struct list_head e_cset_node[CGROUP_SUBSYS_COUNT];
 
+	/* all css_task_iters currently walking this cset */
+	struct list_head task_iters;
+
 	/* For RCU-protected deletion */
 	struct rcu_head rcu_head;
 };
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index bdfdb3a..a9dcf0e 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -42,6 +42,10 @@ struct css_task_iter {
 	struct list_head		*task_pos;
 	struct list_head		*tasks_head;
 	struct list_head		*mg_tasks_head;
+
+	struct css_set			*cur_cset;
+	struct task_struct		*cur_task;
+	struct list_head		iters_node;	/* css_set->task_iters */
 };
 
 extern struct cgroup_root cgrp_dfl_root;
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 56e2b77..0c5b0e6 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -216,6 +216,7 @@ static struct cftype cgroup_legacy_base_files[];
 
 static int rebind_subsystems(struct cgroup_root *dst_root,
 			     unsigned long ss_mask);
+static void css_task_iter_advance(struct css_task_iter *it);
 static int cgroup_destroy_locked(struct cgroup *cgrp);
 static int create_css(struct cgroup *cgrp, struct cgroup_subsys *ss,
 		      bool visible);
@@ -593,6 +594,7 @@ struct css_set init_css_set = {
 	.mg_tasks		= LIST_HEAD_INIT(init_css_set.mg_tasks),
 	.mg_preload_node	= LIST_HEAD_INIT(init_css_set.mg_preload_node),
 	.mg_node		= LIST_HEAD_INIT(init_css_set.mg_node),
+	.task_iters		= LIST_HEAD_INIT(init_css_set.task_iters),
 };
 
 static int css_set_count	= 1;	/* 1 for init_css_set */
@@ -675,8 +677,9 @@ static void css_set_update_populated(struct css_set *cset, bool populated)
  * css_set, @from_cset can be NULL.  If @task is being disassociated
  * instead of moved, @to_cset can be NULL.
  *
- * This function automatically handles populated_cnt updates but the caller
- * is responsible for managing @from_cset and @to_cset's reference counts.
+ * This function automatically handles populated_cnt updates and
+ * css_task_iter adjustments but the caller is responsible for managing
+ * @from_cset and @to_cset's reference counts.
  */
 static void css_set_move_task(struct task_struct *task,
 			      struct css_set *from_cset, struct css_set *to_cset,
@@ -685,7 +688,22 @@ static void css_set_move_task(struct task_struct *task,
 	lockdep_assert_held(&css_set_rwsem);
 
 	if (from_cset) {
+		struct css_task_iter *it, *pos;
+
 		WARN_ON_ONCE(list_empty(&task->cg_list));
+
+		/*
+		 * @task is leaving, advance task iterators which are
+		 * pointing to it so that they can resume at the next
+		 * position.  Advancing an iterator might remove it from
+		 * the list, use safe walk.  See css_task_iter_advance*()
+		 * for details.
+		 */
+		list_for_each_entry_safe(it, pos, &from_cset->task_iters,
+					 iters_node)
+			if (it->task_pos == &task->cg_list)
+				css_task_iter_advance(it);
+
 		list_del_init(&task->cg_list);
 		if (!css_set_populated(from_cset))
 			css_set_update_populated(from_cset, false);
@@ -1019,6 +1037,7 @@ static struct css_set *find_css_set(struct css_set *old_cset,
 	INIT_LIST_HEAD(&cset->mg_tasks);
 	INIT_LIST_HEAD(&cset->mg_preload_node);
 	INIT_LIST_HEAD(&cset->mg_node);
+	INIT_LIST_HEAD(&cset->task_iters);
 	INIT_HLIST_NODE(&cset->hlist);
 
 	/* Copy the set of subsystem state objects generated in
@@ -3804,6 +3823,8 @@ static void css_task_iter_advance_css_set(struct css_task_iter *it)
 	struct cgrp_cset_link *link;
 	struct css_set *cset;
 
+	lockdep_assert_held(&css_set_rwsem);
+
 	/* Advance to the next non-empty css_set */
 	do {
 		l = l->next;
@@ -3831,12 +3852,36 @@ static void css_task_iter_advance_css_set(struct css_task_iter *it)
 
 	it->tasks_head = &cset->tasks;
 	it->mg_tasks_head = &cset->mg_tasks;
+
+	/*
+	 * We don't keep css_sets locked across iteration steps and thus
+	 * need to take steps to ensure that iteration can be resumed after
+	 * the lock is re-acquired.  Iteration is performed at two levels -
+	 * css_sets and tasks in them.
+	 *
+	 * Once created, a css_set never leaves its cgroup lists, so a
+	 * pinned css_set is guaranteed to stay put and we can resume
+	 * iteration afterwards.
+	 *
+	 * Tasks may leave @cset across iteration steps.  This is resolved
+	 * by registering each iterator with the css_set currently being
+	 * walked and making css_set_move_task() advance iterators whose
+	 * next task is leaving.
+	 */
+	if (it->cur_cset) {
+		list_del(&it->iters_node);
+		put_css_set_locked(it->cur_cset);
+	}
+	get_css_set(cset);
+	it->cur_cset = cset;
+	list_add(&it->iters_node, &cset->task_iters);
 }
 
 static void css_task_iter_advance(struct css_task_iter *it)
 {
 	struct list_head *l = it->task_pos;
 
+	lockdep_assert_held(&css_set_rwsem);
 	WARN_ON_ONCE(!l);
 
 	/*
@@ -3864,19 +3909,16 @@ static void css_task_iter_advance(struct css_task_iter *it)
  * css_task_iter_next() to walk through the tasks until the function
  * returns NULL.  On completion of iteration, css_task_iter_end() must be
  * called.
- *
- * Note that this function acquires a lock which is released when the
- * iteration finishes.  The caller can't sleep while iteration is in
- * progress.
  */
 void css_task_iter_start(struct cgroup_subsys_state *css,
 			 struct css_task_iter *it)
-	__acquires(css_set_rwsem)
 {
 	/* no one should try to iterate before mounting cgroups */
 	WARN_ON_ONCE(!use_task_css_set_links);
 
-	down_read(&css_set_rwsem);
+	memset(it, 0, sizeof(*it));
+
+	down_write(&css_set_rwsem);
 
 	it->ss = css->ss;
 
@@ -3888,6 +3930,8 @@ void css_task_iter_start(struct cgroup_subsys_state *css,
 	it->cset_head = it->cset_pos;
 
 	css_task_iter_advance_css_set(it);
+
+	up_write(&css_set_rwsem);
 }
 
 /**
@@ -3900,14 +3944,22 @@ void css_task_iter_start(struct cgroup_subsys_state *css,
  */
 struct task_struct *css_task_iter_next(struct css_task_iter *it)
 {
-	struct task_struct *res;
-
 	if (!it->cset_pos)
 		return NULL;
 
-	res = list_entry(it->task_pos, struct task_struct, cg_list);
+	if (it->cur_task)
+		put_task_struct(it->cur_task);
+
+	down_write(&css_set_rwsem);
+
+	it->cur_task = list_entry(it->task_pos, struct task_struct, cg_list);
+	get_task_struct(it->cur_task);
+
 	css_task_iter_advance(it);
-	return res;
+
+	up_write(&css_set_rwsem);
+
+	return it->cur_task;
 }
 
 /**
@@ -3917,9 +3969,16 @@ struct task_struct *css_task_iter_next(struct css_task_iter *it)
  * Finish task iteration started by css_task_iter_start().
  */
 void css_task_iter_end(struct css_task_iter *it)
-	__releases(css_set_rwsem)
 {
-	up_read(&css_set_rwsem);
+	if (it->cur_cset) {
+		down_write(&css_set_rwsem);
+		list_del(&it->iters_node);
+		put_css_set_locked(it->cur_cset);
+		up_write(&css_set_rwsem);
+	}
+
+	if (it->cur_task)
+		put_task_struct(it->cur_task);
 }
 
 /**
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCHSET cgroup/for-4.4] cgroup: make zombies retain cgroup membership and fix pids controller
  2015-10-10  3:29 [PATCHSET cgroup/for-4.4] cgroup: make zombies retain cgroup membership and fix pids controller Tejun Heo
                   ` (13 preceding siblings ...)
  2015-10-11 13:30 ` [PATCH 14/14] cgroup: add cgroup_subsys->free() method and use it to fix pids controller Tejun Heo
@ 2015-10-15  1:38 ` Tejun Heo
  2015-10-15 20:41   ` Tejun Heo
  14 siblings, 1 reply; 23+ messages in thread
From: Tejun Heo @ 2015-10-15  1:38 UTC (permalink / raw)
  To: lizefan, hannes; +Cc: cgroups, cyphar, linux-kernel, kernel-team

Hello,

On Fri, Oct 09, 2015 at 11:29:27PM -0400, Tejun Heo wrote:
> The patchset is pretty lightly tested and I need to verify that the
> corner cases behave as expected.

Fixed several bugs while testing.  Updated patches posted.  Seems to
work fine now.  git branch has been refreshed with the updated
patches.

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-zombies

Li, can you please go over the changes?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHSET cgroup/for-4.4] cgroup: make zombies retain cgroup membership and fix pids controller
  2015-10-15  1:38 ` [PATCHSET cgroup/for-4.4] cgroup: make zombies retain cgroup membership and " Tejun Heo
@ 2015-10-15 20:41   ` Tejun Heo
  2015-10-19  8:48     ` Zefan Li
  0 siblings, 1 reply; 23+ messages in thread
From: Tejun Heo @ 2015-10-15 20:41 UTC (permalink / raw)
  To: lizefan, hannes; +Cc: cgroups, cyphar, linux-kernel, kernel-team

On Wed, Oct 14, 2015 at 09:38:09PM -0400, Tejun Heo wrote:
> Hello,
> 
> On Fri, Oct 09, 2015 at 11:29:27PM -0400, Tejun Heo wrote:
> > The patchset is pretty lightly tested and I need to verify that the
> > corner cases behave as expected.
> 
> Fixed several bugs while testing.  Updated patches posted.  Seems to
> work fine now.  git branch has been refreshed with the updated
> patches.
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-zombies
> 
> Li, can you please go over the changes?

I want this series to be in -next at least for a couple weeks before
the merge window opens, so I'm applying it to cgroup/for-4.4 now.
Please let me know if you spot something wrong.  Let's fix it up
incrementally.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHSET cgroup/for-4.4] cgroup: make zombies retain cgroup membership and fix pids controller
  2015-10-15 20:41   ` Tejun Heo
@ 2015-10-19  8:48     ` Zefan Li
  0 siblings, 0 replies; 23+ messages in thread
From: Zefan Li @ 2015-10-19  8:48 UTC (permalink / raw)
  To: Tejun Heo, hannes; +Cc: cgroups, cyphar, linux-kernel, kernel-team

Hi Tejun,

On 2015/10/16 4:41, Tejun Heo wrote:
> On Wed, Oct 14, 2015 at 09:38:09PM -0400, Tejun Heo wrote:
>> Hello,
>>
>> On Fri, Oct 09, 2015 at 11:29:27PM -0400, Tejun Heo wrote:
>>> The patchset is pretty lightly tested and I need to verify that the
>>> corner cases behave as expected.
>>
>> Fixed several bugs while testing.  Updated patches posted.  Seems to
>> work fine now.  git branch has been refreshed with the updated
>> patches.
>>
>>    git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-zombies
>>
>> Li, can you please go over the changes?
>
> I want this series to be in -next at least for a couple weeks before
> the merge window opens, so I'm applying it to cgroup/for-4.4 now.
> Please let me know if you spot something wrong.  Let's fix it up
> incrementally.
>

Sorry for the late reply. I was travelling last week. I'll review those
changes within this week.


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2015-10-19  8:48 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-10  3:29 [PATCHSET cgroup/for-4.4] cgroup: make zombies retain cgroup membership and fix pids controller Tejun Heo
2015-10-10  3:29 ` [PATCH 01/14] cgroup: remove an unused parameter from cgroup_task_migrate() Tejun Heo
2015-10-10  3:29 ` [PATCH 02/14] cgroup: make cgroup->nr_populated count the number of populated css_sets Tejun Heo
2015-10-10  3:29 ` [PATCH 03/14] cgroup: replace cgroup_has_tasks() with cgroup_is_populated() Tejun Heo
2015-10-10  3:29 ` [PATCH 04/14] cgroup: move check_for_release() invocation Tejun Heo
2015-10-10  3:29 ` [PATCH 05/14] cgroup: relocate cgroup_[try]get/put() Tejun Heo
2015-10-10  3:29 ` [PATCH 06/14] cgroup: make css_sets pin the associated cgroups Tejun Heo
2015-10-15  1:34   ` [PATCH v2 " Tejun Heo
2015-10-10  3:29 ` [PATCH 07/14] cgroup: make cgroup_destroy_locked() test cgroup_is_populated() Tejun Heo
2015-10-10  3:29 ` [PATCH 08/14] cgroup: keep css_set and task lists in chronological order Tejun Heo
2015-10-10  3:29 ` [PATCH 09/14] cgroup: factor out css_set_move_task() Tejun Heo
2015-10-10  3:29 ` [PATCH 10/14] cgroup: reorganize css_task_iter functions Tejun Heo
2015-10-11 13:30 ` [PATCH 11/14] cgroup: don't hold css_set_rwsem across css task iteration Tejun Heo
2015-10-15  1:35   ` [PATCH v2 " Tejun Heo
2015-10-11 13:30 ` [PATCH 12/14] cgroup: make css_set_rwsem a spinlock and rename it to css_set_lock Tejun Heo
2015-10-11 13:30 ` [PATCH 13/14] cgroup: keep zombies associated with their original cgroups Tejun Heo
2015-10-12 17:44   ` [PATCH v2 " Tejun Heo
2015-10-11 13:30 ` [PATCH 14/14] cgroup: add cgroup_subsys->free() method and use it to fix pids controller Tejun Heo
2015-10-12 10:29   ` Aleksa Sarai
2015-10-12 15:25     ` Tejun Heo
2015-10-15  1:38 ` [PATCHSET cgroup/for-4.4] cgroup: make zombies retain cgroup membership and " Tejun Heo
2015-10-15 20:41   ` Tejun Heo
2015-10-19  8:48     ` Zefan Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).