linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/14] cgroup: Implement cgroup v2 thread mode & CPU controller
@ 2017-04-21 14:03 Waiman Long
  2017-04-21 14:03 ` [RFC PATCH 01/14] cgroup: reorganize cgroup.procs / task write path Waiman Long
                   ` (14 more replies)
  0 siblings, 15 replies; 17+ messages in thread
From: Waiman Long @ 2017-04-21 14:03 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, linux-mm, kernel-team, pjt,
	luto, efault, Waiman Long

This patchset incorporates the following 2 patchsets from Tejun Heo:

 1) cgroup v2 thread mode patchset (5 patches)
    https://lkml.org/lkml/2017/2/2/592
 2) CPU Controller on Control Group v2 (2 patches)
    https://lkml.org/lkml/2016/8/5/368

Additional patches are then layered on top to implement the following
new features:

 1) An enhanced v2 thread mode where a thread root (root of a threaded
    subtree) can have non-threaded children with non-threaded
    controllers enabled and no internal process constraint does
    not apply.
 2) An enhanced debug controller which dumps out more information
    relevant to the debugging and testing of cgroup v2 in general.
 3) Separate control knobs for resource domain controllers can be
    enabled in a thread root to manage all the internal processes in
    the threaded subtree.

Patches 1-5 are Tejun's cgroup v2 thread mode patchset.

Patch 6 fixes a task_struct reference counting bug introduced in
patch 1.

Patch 7 moves the debug cgroup out from cgroup_v1.c into its own
file.

Patch 8 keeps more accurate counts of the number of tasks associated
with each css_set.

Patch 9 enhances the debug controller to provide more information
relevant to the cgroup v2 thread mode to ease debugging effort.

Patch 10 implements the enhanced cgroup v2 thread mode with the
following enhancements:

 1) Thread roots are treated differently from threaded cgroups.
 2) Thread root can now have non-threaded controllers enabled as well
    as non-threaded children.

Patches 11-12 are Tejun's CPU controller on control group v2 patchset.

Patch 13 makes both cpu and cpuacct controllers threaded.

Patch 14 enables the creation of a special "cgroup.self" directory
under the thread root to hold resource control knobs for controllers
that don't want resource competiton between internal processes and
non-threaded child cgroups.

Preliminary testing was done with the debug controller enabled. Things
seemed to work fine so far. More rigorous testing will be needed, and
any suggestions are welcome.

Tejun Heo (7):
  cgroup: reorganize cgroup.procs / task write path
  cgroup: add @flags to css_task_iter_start() and implement
    CSS_TASK_ITER_PROCS
  cgroup: introduce cgroup->proc_cgrp and threaded css_set handling
  cgroup: implement CSS_TASK_ITER_THREADED
  cgroup: implement cgroup v2 thread support
  sched: Misc preps for cgroup unified hierarchy interface
  sched: Implement interface for cgroup unified hierarchy

Waiman Long (7):
  cgroup: Fix reference counting bug in cgroup_procs_write()
  cgroup: Move debug cgroup to its own file
  cgroup: Keep accurate count of tasks in each css_set
  cgroup: Make debug cgroup support v2 and thread mode
  cgroup: Implement new thread mode semantics
  sched: Make cpu/cpuacct threaded controllers
  cgroup: Enable separate control knobs for thread root internal
    processes

 Documentation/cgroup-v2.txt     | 114 +++++-
 include/linux/cgroup-defs.h     |  56 +++
 include/linux/cgroup.h          |  12 +-
 kernel/cgroup/Makefile          |   1 +
 kernel/cgroup/cgroup-internal.h |  18 +-
 kernel/cgroup/cgroup-v1.c       | 217 +++-------
 kernel/cgroup/cgroup.c          | 862 ++++++++++++++++++++++++++++++++++------
 kernel/cgroup/cpuset.c          |   6 +-
 kernel/cgroup/debug.c           | 284 +++++++++++++
 kernel/cgroup/freezer.c         |   6 +-
 kernel/cgroup/pids.c            |   1 +
 kernel/events/core.c            |   1 +
 kernel/sched/core.c             | 150 ++++++-
 kernel/sched/cpuacct.c          |  55 ++-
 kernel/sched/cpuacct.h          |   5 +
 mm/memcontrol.c                 |   3 +-
 net/core/netclassid_cgroup.c    |   2 +-
 17 files changed, 1478 insertions(+), 315 deletions(-)
 create mode 100644 kernel/cgroup/debug.c

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH 01/14] cgroup: reorganize cgroup.procs / task write path
  2017-04-21 14:03 [RFC PATCH 00/14] cgroup: Implement cgroup v2 thread mode & CPU controller Waiman Long
@ 2017-04-21 14:03 ` Waiman Long
  2017-04-21 14:04 ` [RFC PATCH 02/14] cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS Waiman Long
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 17+ messages in thread
From: Waiman Long @ 2017-04-21 14:03 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, linux-mm, kernel-team, pjt,
	luto, efault

From: Tejun Heo <tj@kernel.org>

Currently, writes "cgroup.procs" and "cgroup.tasks" files are all
handled by __cgroup_procs_write() on both v1 and v2.  This patch
reoragnizes the write path so that there are common helper functions
that different write paths use.

While this somewhat increases LOC, the different paths are no longer
intertwined and each path has more flexibility to implement different
behaviors which will be necessary for the planned v2 thread support.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/cgroup/cgroup-internal.h |   8 +-
 kernel/cgroup/cgroup-v1.c       |  58 ++++++++++++--
 kernel/cgroup/cgroup.c          | 163 +++++++++++++++++++++-------------------
 3 files changed, 142 insertions(+), 87 deletions(-)

diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index 9203bfb..6ef662a 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -179,10 +179,10 @@ int cgroup_migrate(struct task_struct *leader, bool threadgroup,
 
 int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader,
 		       bool threadgroup);
-ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf,
-			     size_t nbytes, loff_t off, bool threadgroup);
-ssize_t cgroup_procs_write(struct kernfs_open_file *of, char *buf, size_t nbytes,
-			   loff_t off);
+struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup)
+	__acquires(&cgroup_threadgroup_rwsem);
+void cgroup_procs_write_finish(void)
+	__releases(&cgroup_threadgroup_rwsem);
 
 void cgroup_lock_and_drain_offline(struct cgroup *cgrp);
 
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index 1dc22f6..e4f3202 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -514,10 +514,58 @@ static int cgroup_pidlist_show(struct seq_file *s, void *v)
 	return 0;
 }
 
-static ssize_t cgroup_tasks_write(struct kernfs_open_file *of,
-				  char *buf, size_t nbytes, loff_t off)
+static ssize_t __cgroup1_procs_write(struct kernfs_open_file *of,
+				     char *buf, size_t nbytes, loff_t off,
+				     bool threadgroup)
 {
-	return __cgroup_procs_write(of, buf, nbytes, off, false);
+	struct cgroup *cgrp;
+	struct task_struct *task;
+	const struct cred *cred, *tcred;
+	ssize_t ret;
+
+	cgrp = cgroup_kn_lock_live(of->kn, false);
+	if (!cgrp)
+		return -ENODEV;
+
+	task = cgroup_procs_write_start(buf, threadgroup);
+	ret = PTR_ERR_OR_ZERO(task);
+	if (ret)
+		goto out_unlock;
+
+	/*
+	 * Even if we're attaching all tasks in the thread group, we only
+	 * need to check permissions on one of them.
+	 */
+	cred = current_cred();
+	tcred = get_task_cred(task);
+	if (!uid_eq(cred->euid, GLOBAL_ROOT_UID) &&
+	    !uid_eq(cred->euid, tcred->uid) &&
+	    !uid_eq(cred->euid, tcred->suid))
+		ret = -EACCES;
+	put_cred(tcred);
+	if (ret)
+		goto out_finish;
+
+	ret = cgroup_attach_task(cgrp, task, threadgroup);
+
+out_finish:
+	cgroup_procs_write_finish();
+out_unlock:
+	cgroup_kn_unlock(of->kn);
+
+	return ret ?: nbytes;
+}
+
+static ssize_t cgroup1_procs_write(struct kernfs_open_file *of,
+				   char *buf, size_t nbytes, loff_t off)
+{
+	return __cgroup1_procs_write(of, buf, nbytes, off, true);
+}
+
+static ssize_t cgroup1_tasks_write(struct kernfs_open_file *of,
+				   char *buf, size_t nbytes, loff_t off)
+{
+	return __cgroup1_procs_write(of, buf, nbytes, off, false);
 }
 
 static ssize_t cgroup_release_agent_write(struct kernfs_open_file *of,
@@ -596,7 +644,7 @@ struct cftype cgroup1_base_files[] = {
 		.seq_stop = cgroup_pidlist_stop,
 		.seq_show = cgroup_pidlist_show,
 		.private = CGROUP_FILE_PROCS,
-		.write = cgroup_procs_write,
+		.write = cgroup1_procs_write,
 	},
 	{
 		.name = "cgroup.clone_children",
@@ -615,7 +663,7 @@ struct cftype cgroup1_base_files[] = {
 		.seq_stop = cgroup_pidlist_stop,
 		.seq_show = cgroup_pidlist_show,
 		.private = CGROUP_FILE_TASKS,
-		.write = cgroup_tasks_write,
+		.write = cgroup1_tasks_write,
 	},
 	{
 		.name = "notify_on_release",
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 687f5e0..b4b8c6b 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1914,6 +1914,23 @@ int task_cgroup_path(struct task_struct *task, char *buf, size_t buflen)
 }
 EXPORT_SYMBOL_GPL(task_cgroup_path);
 
+static struct cgroup *cgroup_migrate_common_ancestor(struct task_struct *task,
+						     struct cgroup *dst_cgrp)
+{
+	struct cgroup *cgrp;
+
+	lockdep_assert_held(&cgroup_mutex);
+
+	spin_lock_irq(&css_set_lock);
+	cgrp = task_cgroup_from_root(task, &cgrp_dfl_root);
+	spin_unlock_irq(&css_set_lock);
+
+	while (!cgroup_is_descendant(dst_cgrp, cgrp))
+		cgrp = cgroup_parent(cgrp);
+
+	return cgrp;
+}
+
 /**
  * cgroup_migrate_add_task - add a migration target task to a migration context
  * @task: target task
@@ -2346,76 +2363,23 @@ int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader,
 	return ret;
 }
 
-static int cgroup_procs_write_permission(struct task_struct *task,
-					 struct cgroup *dst_cgrp,
-					 struct kernfs_open_file *of)
-{
-	int ret = 0;
-
-	if (cgroup_on_dfl(dst_cgrp)) {
-		struct super_block *sb = of->file->f_path.dentry->d_sb;
-		struct cgroup *cgrp;
-		struct inode *inode;
-
-		spin_lock_irq(&css_set_lock);
-		cgrp = task_cgroup_from_root(task, &cgrp_dfl_root);
-		spin_unlock_irq(&css_set_lock);
-
-		while (!cgroup_is_descendant(dst_cgrp, cgrp))
-			cgrp = cgroup_parent(cgrp);
-
-		ret = -ENOMEM;
-		inode = kernfs_get_inode(sb, cgrp->procs_file.kn);
-		if (inode) {
-			ret = inode_permission(inode, MAY_WRITE);
-			iput(inode);
-		}
-	} else {
-		const struct cred *cred = current_cred();
-		const struct cred *tcred = get_task_cred(task);
-
-		/*
-		 * even if we're attaching all tasks in the thread group,
-		 * we only need to check permissions on one of them.
-		 */
-		if (!uid_eq(cred->euid, GLOBAL_ROOT_UID) &&
-		    !uid_eq(cred->euid, tcred->uid) &&
-		    !uid_eq(cred->euid, tcred->suid))
-			ret = -EACCES;
-		put_cred(tcred);
-	}
-
-	return ret;
-}
-
-/*
- * Find the task_struct of the task to attach by vpid and pass it along to the
- * function to attach either it or all tasks in its threadgroup. Will lock
- * cgroup_mutex and threadgroup.
- */
-ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf,
-			     size_t nbytes, loff_t off, bool threadgroup)
+struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup)
+	__acquires(&cgroup_threadgroup_rwsem)
 {
 	struct task_struct *tsk;
-	struct cgroup_subsys *ss;
-	struct cgroup *cgrp;
 	pid_t pid;
-	int ssid, ret;
 
 	if (kstrtoint(strstrip(buf), 0, &pid) || pid < 0)
-		return -EINVAL;
-
-	cgrp = cgroup_kn_lock_live(of->kn, false);
-	if (!cgrp)
-		return -ENODEV;
+		return ERR_PTR(-EINVAL);
 
 	percpu_down_write(&cgroup_threadgroup_rwsem);
+
 	rcu_read_lock();
 	if (pid) {
 		tsk = find_task_by_vpid(pid);
 		if (!tsk) {
-			ret = -ESRCH;
-			goto out_unlock_rcu;
+			tsk = ERR_PTR(-ESRCH);
+			goto out_unlock_threadgroup;
 		}
 	} else {
 		tsk = current;
@@ -2431,35 +2395,30 @@ ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf,
 	 * cgroup with no rt_runtime allocated.  Just say no.
 	 */
 	if (tsk->no_cgroup_migration || (tsk->flags & PF_NO_SETAFFINITY)) {
-		ret = -EINVAL;
-		goto out_unlock_rcu;
+		tsk = ERR_PTR(-EINVAL);
+		goto out_unlock_threadgroup;
 	}
 
 	get_task_struct(tsk);
-	rcu_read_unlock();
-
-	ret = cgroup_procs_write_permission(tsk, cgrp, of);
-	if (!ret)
-		ret = cgroup_attach_task(cgrp, tsk, threadgroup);
-
-	put_task_struct(tsk);
-	goto out_unlock_threadgroup;
+	goto out_unlock_rcu;
 
+out_unlock_threadgroup:
+	percpu_up_write(&cgroup_threadgroup_rwsem);
 out_unlock_rcu:
 	rcu_read_unlock();
-out_unlock_threadgroup:
+	return tsk;
+}
+
+void cgroup_procs_write_finish(void)
+	__releases(&cgroup_threadgroup_rwsem)
+{
+	struct cgroup_subsys *ss;
+	int ssid;
+
 	percpu_up_write(&cgroup_threadgroup_rwsem);
 	for_each_subsys(ss, ssid)
 		if (ss->post_attach)
 			ss->post_attach();
-	cgroup_kn_unlock(of->kn);
-	return ret ?: nbytes;
-}
-
-ssize_t cgroup_procs_write(struct kernfs_open_file *of, char *buf, size_t nbytes,
-			   loff_t off)
-{
-	return __cgroup_procs_write(of, buf, nbytes, off, true);
 }
 
 static void cgroup_print_ss_mask(struct seq_file *seq, u16 ss_mask)
@@ -3783,6 +3742,54 @@ static int cgroup_procs_show(struct seq_file *s, void *v)
 	return 0;
 }
 
+static int cgroup_procs_write_permission(struct cgroup *cgrp,
+					 struct super_block *sb)
+{
+	struct inode *inode;
+	int ret;
+
+	inode = kernfs_get_inode(sb, cgrp->procs_file.kn);
+	if (!inode)
+		return -ENOMEM;
+
+	ret = inode_permission(inode, MAY_WRITE);
+	iput(inode);
+	return ret;
+}
+
+static ssize_t cgroup_procs_write(struct kernfs_open_file *of,
+				  char *buf, size_t nbytes, loff_t off)
+{
+	struct cgroup *cgrp, *common_ancestor;
+	struct task_struct *task;
+	ssize_t ret;
+
+	cgrp = cgroup_kn_lock_live(of->kn, false);
+	if (!cgrp)
+		return -ENODEV;
+
+	task = cgroup_procs_write_start(buf, true);
+	ret = PTR_ERR_OR_ZERO(task);
+	if (ret)
+		goto out_unlock;
+
+	common_ancestor = cgroup_migrate_common_ancestor(task, cgrp);
+
+	ret = cgroup_procs_write_permission(common_ancestor,
+					    of->file->f_path.dentry->d_sb);
+	if (ret)
+		goto out_finish;
+
+	ret = cgroup_attach_task(cgrp, task, true);
+
+out_finish:
+	cgroup_procs_write_finish();
+out_unlock:
+	cgroup_kn_unlock(of->kn);
+
+	return ret ?: nbytes;
+}
+
 /* cgroup core interface files for the default hierarchy */
 static struct cftype cgroup_base_files[] = {
 	{
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 02/14] cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS
  2017-04-21 14:03 [RFC PATCH 00/14] cgroup: Implement cgroup v2 thread mode & CPU controller Waiman Long
  2017-04-21 14:03 ` [RFC PATCH 01/14] cgroup: reorganize cgroup.procs / task write path Waiman Long
@ 2017-04-21 14:04 ` Waiman Long
  2017-04-21 14:04 ` [RFC PATCH 03/14] cgroup: introduce cgroup->proc_cgrp and threaded css_set handling Waiman Long
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 17+ messages in thread
From: Waiman Long @ 2017-04-21 14:04 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, linux-mm, kernel-team, pjt,
	luto, efault

From: Tejun Heo <tj@kernel.org>

css_task_iter currently always walks all tasks.  With the scheduled
cgroup v2 thread support, the iterator would need to handle multiple
types of iteration.  As a preparation, add @flags to
css_task_iter_start() and implement CSS_TASK_ITER_PROCS.  If the flag
is not specified, it walks all tasks as before.  When asserted, the
iterator only walks the group leaders.

For now, the only user of the flag is cgroup v2 "cgroup.procs" file
which no longer needs to skip non-leader tasks in cgroup_procs_next().
Note that cgroup v1 "cgroup.procs" can't use the group leader walk as
v1 "cgroup.procs" doesn't mean "list all thread group leaders in the
cgroup" but "list all thread group id's with any threads in the
cgroup".

While at it, update cgroup_procs_show() to use task_pid_vnr() instead
of task_tgid_vnr().  As the iteration guarantees that the function
only sees group leaders, this doesn't change the output and will allow
sharing the function for thread iteration.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/cgroup.h       |  6 +++++-
 kernel/cgroup/cgroup-v1.c    |  6 +++---
 kernel/cgroup/cgroup.c       | 24 ++++++++++++++----------
 kernel/cgroup/cpuset.c       |  6 +++---
 kernel/cgroup/freezer.c      |  6 +++---
 mm/memcontrol.c              |  2 +-
 net/core/netclassid_cgroup.c |  2 +-
 7 files changed, 30 insertions(+), 22 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index af9c86e..37b20ef 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -36,9 +36,13 @@
 #define CGROUP_WEIGHT_DFL		100
 #define CGROUP_WEIGHT_MAX		10000
 
+/* walk only threadgroup leaders */
+#define CSS_TASK_ITER_PROCS		(1U << 0)
+
 /* a css_task_iter should be treated as an opaque object */
 struct css_task_iter {
 	struct cgroup_subsys		*ss;
+	unsigned int			flags;
 
 	struct list_head		*cset_pos;
 	struct list_head		*cset_head;
@@ -129,7 +133,7 @@ struct task_struct *cgroup_taskset_first(struct cgroup_taskset *tset,
 struct task_struct *cgroup_taskset_next(struct cgroup_taskset *tset,
 					struct cgroup_subsys_state **dst_cssp);
 
-void css_task_iter_start(struct cgroup_subsys_state *css,
+void css_task_iter_start(struct cgroup_subsys_state *css, unsigned int flags,
 			 struct css_task_iter *it);
 struct task_struct *css_task_iter_next(struct css_task_iter *it);
 void css_task_iter_end(struct css_task_iter *it);
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index e4f3202..b837e1a 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -121,7 +121,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
 	 * ->can_attach() fails.
 	 */
 	do {
-		css_task_iter_start(&from->self, &it);
+		css_task_iter_start(&from->self, 0, &it);
 		task = css_task_iter_next(&it);
 		if (task)
 			get_task_struct(task);
@@ -377,7 +377,7 @@ static int pidlist_array_load(struct cgroup *cgrp, enum cgroup_filetype type,
 	if (!array)
 		return -ENOMEM;
 	/* now, populate the array */
-	css_task_iter_start(&cgrp->self, &it);
+	css_task_iter_start(&cgrp->self, 0, &it);
 	while ((tsk = css_task_iter_next(&it))) {
 		if (unlikely(n == length))
 			break;
@@ -753,7 +753,7 @@ int cgroupstats_build(struct cgroupstats *stats, struct dentry *dentry)
 	}
 	rcu_read_unlock();
 
-	css_task_iter_start(&cgrp->self, &it);
+	css_task_iter_start(&cgrp->self, 0, &it);
 	while ((tsk = css_task_iter_next(&it))) {
 		switch (tsk->state) {
 		case TASK_RUNNING:
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index b4b8c6b..9bbfadc 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3590,6 +3590,7 @@ static void css_task_iter_advance(struct css_task_iter *it)
 	lockdep_assert_held(&css_set_lock);
 	WARN_ON_ONCE(!l);
 
+repeat:
 	/*
 	 * Advance iterator to find next entry.  cset->tasks is consumed
 	 * first and then ->mg_tasks.  After ->mg_tasks, we move onto the
@@ -3604,11 +3605,18 @@ static void css_task_iter_advance(struct css_task_iter *it)
 		css_task_iter_advance_css_set(it);
 	else
 		it->task_pos = l;
+
+	/* if PROCS, skip over tasks which aren't group leaders */
+	if ((it->flags & CSS_TASK_ITER_PROCS) && it->task_pos &&
+	    !thread_group_leader(list_entry(it->task_pos, struct task_struct,
+					    cg_list)))
+		goto repeat;
 }
 
 /**
  * css_task_iter_start - initiate task iteration
  * @css: the css to walk tasks of
+ * @flags: CSS_TASK_ITER_* flags
  * @it: the task iterator to use
  *
  * Initiate iteration through the tasks of @css.  The caller can call
@@ -3616,7 +3624,7 @@ static void css_task_iter_advance(struct css_task_iter *it)
  * returns NULL.  On completion of iteration, css_task_iter_end() must be
  * called.
  */
-void css_task_iter_start(struct cgroup_subsys_state *css,
+void css_task_iter_start(struct cgroup_subsys_state *css, unsigned int flags,
 			 struct css_task_iter *it)
 {
 	/* no one should try to iterate before mounting cgroups */
@@ -3627,6 +3635,7 @@ void css_task_iter_start(struct cgroup_subsys_state *css,
 	spin_lock_irq(&css_set_lock);
 
 	it->ss = css->ss;
+	it->flags = flags;
 
 	if (it->ss)
 		it->cset_pos = &css->cgroup->e_csets[css->ss->id];
@@ -3700,13 +3709,8 @@ static void *cgroup_procs_next(struct seq_file *s, void *v, loff_t *pos)
 {
 	struct kernfs_open_file *of = s->private;
 	struct css_task_iter *it = of->priv;
-	struct task_struct *task;
-
-	do {
-		task = css_task_iter_next(it);
-	} while (task && !thread_group_leader(task));
 
-	return task;
+	return css_task_iter_next(it);
 }
 
 static void *cgroup_procs_start(struct seq_file *s, loff_t *pos)
@@ -3727,10 +3731,10 @@ static void *cgroup_procs_start(struct seq_file *s, loff_t *pos)
 		if (!it)
 			return ERR_PTR(-ENOMEM);
 		of->priv = it;
-		css_task_iter_start(&cgrp->self, it);
+		css_task_iter_start(&cgrp->self, CSS_TASK_ITER_PROCS, it);
 	} else if (!(*pos)++) {
 		css_task_iter_end(it);
-		css_task_iter_start(&cgrp->self, it);
+		css_task_iter_start(&cgrp->self, CSS_TASK_ITER_PROCS, it);
 	}
 
 	return cgroup_procs_next(s, NULL, NULL);
@@ -3738,7 +3742,7 @@ static void *cgroup_procs_start(struct seq_file *s, loff_t *pos)
 
 static int cgroup_procs_show(struct seq_file *s, void *v)
 {
-	seq_printf(s, "%d\n", task_tgid_vnr(v));
+	seq_printf(s, "%d\n", task_pid_vnr(v));
 	return 0;
 }
 
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 0f41292..04ce1cf 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -861,7 +861,7 @@ static void update_tasks_cpumask(struct cpuset *cs)
 	struct css_task_iter it;
 	struct task_struct *task;
 
-	css_task_iter_start(&cs->css, &it);
+	css_task_iter_start(&cs->css, 0, &it);
 	while ((task = css_task_iter_next(&it)))
 		set_cpus_allowed_ptr(task, cs->effective_cpus);
 	css_task_iter_end(&it);
@@ -1106,7 +1106,7 @@ static void update_tasks_nodemask(struct cpuset *cs)
 	 * It's ok if we rebind the same mm twice; mpol_rebind_mm()
 	 * is idempotent.  Also migrate pages in each mm to new nodes.
 	 */
-	css_task_iter_start(&cs->css, &it);
+	css_task_iter_start(&cs->css, 0, &it);
 	while ((task = css_task_iter_next(&it))) {
 		struct mm_struct *mm;
 		bool migrate;
@@ -1299,7 +1299,7 @@ static void update_tasks_flags(struct cpuset *cs)
 	struct css_task_iter it;
 	struct task_struct *task;
 
-	css_task_iter_start(&cs->css, &it);
+	css_task_iter_start(&cs->css, 0, &it);
 	while ((task = css_task_iter_next(&it)))
 		cpuset_update_task_spread_flag(cs, task);
 	css_task_iter_end(&it);
diff --git a/kernel/cgroup/freezer.c b/kernel/cgroup/freezer.c
index 1b72d56..0823679 100644
--- a/kernel/cgroup/freezer.c
+++ b/kernel/cgroup/freezer.c
@@ -268,7 +268,7 @@ static void update_if_frozen(struct cgroup_subsys_state *css)
 	rcu_read_unlock();
 
 	/* are all tasks frozen? */
-	css_task_iter_start(css, &it);
+	css_task_iter_start(css, 0, &it);
 
 	while ((task = css_task_iter_next(&it))) {
 		if (freezing(task)) {
@@ -320,7 +320,7 @@ static void freeze_cgroup(struct freezer *freezer)
 	struct css_task_iter it;
 	struct task_struct *task;
 
-	css_task_iter_start(&freezer->css, &it);
+	css_task_iter_start(&freezer->css, 0, &it);
 	while ((task = css_task_iter_next(&it)))
 		freeze_task(task);
 	css_task_iter_end(&it);
@@ -331,7 +331,7 @@ static void unfreeze_cgroup(struct freezer *freezer)
 	struct css_task_iter it;
 	struct task_struct *task;
 
-	css_task_iter_start(&freezer->css, &it);
+	css_task_iter_start(&freezer->css, 0, &it);
 	while ((task = css_task_iter_next(&it)))
 		__thaw_task(task);
 	css_task_iter_end(&it);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2d8d562..f28ab8d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -951,7 +951,7 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 		struct css_task_iter it;
 		struct task_struct *task;
 
-		css_task_iter_start(&iter->css, &it);
+		css_task_iter_start(&iter->css, 0, &it);
 		while (!ret && (task = css_task_iter_next(&it)))
 			ret = fn(task, arg);
 		css_task_iter_end(&it);
diff --git a/net/core/netclassid_cgroup.c b/net/core/netclassid_cgroup.c
index 029a61a..5e4f040 100644
--- a/net/core/netclassid_cgroup.c
+++ b/net/core/netclassid_cgroup.c
@@ -100,7 +100,7 @@ static int write_classid(struct cgroup_subsys_state *css, struct cftype *cft,
 
 	cs->classid = (u32)value;
 
-	css_task_iter_start(css, &it);
+	css_task_iter_start(css, 0, &it);
 	while ((p = css_task_iter_next(&it))) {
 		task_lock(p);
 		iterate_fd(p->files, 0, update_classid_sock,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 03/14] cgroup: introduce cgroup->proc_cgrp and threaded css_set handling
  2017-04-21 14:03 [RFC PATCH 00/14] cgroup: Implement cgroup v2 thread mode & CPU controller Waiman Long
  2017-04-21 14:03 ` [RFC PATCH 01/14] cgroup: reorganize cgroup.procs / task write path Waiman Long
  2017-04-21 14:04 ` [RFC PATCH 02/14] cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS Waiman Long
@ 2017-04-21 14:04 ` Waiman Long
  2017-04-21 14:04 ` [RFC PATCH 04/14] cgroup: implement CSS_TASK_ITER_THREADED Waiman Long
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 17+ messages in thread
From: Waiman Long @ 2017-04-21 14:04 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, linux-mm, kernel-team, pjt,
	luto, efault

From: Tejun Heo <tj@kernel.org>

cgroup v2 is in the process of growing thread granularity support.
Once thread mode is enabled, the root cgroup of the subtree serves as
the proc_cgrp to which the processes of the subtree conceptually
belong and domain-level resource consumptions not tied to any specific
task are charged.  In the subtree, threads won't be subject to process
granularity or no-internal-task constraint and can be distributed
arbitrarily across the subtree.

This patch introduces cgroup->proc_cgrp along with threaded css_set
handling.

* cgroup->proc_cgrp is NULL if !threaded.  If threaded, points to the
  proc_cgrp (root of the threaded subtree).

* css_set->proc_cset points to self if !threaded.  If threaded, points
  to the css_set which belongs to the cgrp->proc_cgrp.  The proc_cgrp
  serves as the resource domain and needs the matching csses readily
  available.  The proc_cset holds those csses and makes them easily
  accessible.

* All threaded csets are linked on their proc_csets to enable
  iteration of all threaded tasks.

This patch adds the above but doesn't actually use them yet.  The
following patches will build on top.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/cgroup-defs.h | 22 ++++++++++++
 kernel/cgroup/cgroup.c      | 87 +++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 103 insertions(+), 6 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 6a3f850..9283ee9 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -158,6 +158,15 @@ struct css_set {
 	/* reference count */
 	atomic_t refcount;
 
+	/*
+	 * If not threaded, the following points to self.  If threaded, to
+	 * a cset which belongs to the top cgroup of the threaded subtree.
+	 * The proc_cset provides access to the process cgroup and its
+	 * csses to which domain level resource consumptions should be
+	 * charged.
+	 */
+	struct css_set __rcu *proc_cset;
+
 	/* the default cgroup associated with this css_set */
 	struct cgroup *dfl_cgrp;
 
@@ -183,6 +192,10 @@ struct css_set {
 	 */
 	struct list_head e_cset_node[CGROUP_SUBSYS_COUNT];
 
+	/* all csets whose ->proc_cset points to this cset */
+	struct list_head threaded_csets;
+	struct list_head threaded_csets_node;
+
 	/*
 	 * List running through all cgroup groups in the same hash
 	 * slot. Protected by css_set_lock
@@ -289,6 +302,15 @@ struct cgroup {
 	struct list_head e_csets[CGROUP_SUBSYS_COUNT];
 
 	/*
+	 * If !threaded, NULL.  If threaded, it points to the top cgroup of
+	 * the threaded subtree, on which it points to self.  Threaded
+	 * subtree is exempt from process granularity and no-internal-task
+	 * constraint.  Domain level resource consumptions which aren't
+	 * tied to a specific task should be charged to the proc_cgrp.
+	 */
+	struct cgroup *proc_cgrp;
+
+	/*
 	 * list of pidlists, up to two for each namespace (one for procs, one
 	 * for tasks); created on demand.
 	 */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 9bbfadc..016bbc6 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -555,9 +555,11 @@ struct cgroup_subsys_state *of_css(struct kernfs_open_file *of)
  */
 struct css_set init_css_set = {
 	.refcount		= ATOMIC_INIT(1),
+	.proc_cset		= RCU_INITIALIZER(&init_css_set),
 	.tasks			= LIST_HEAD_INIT(init_css_set.tasks),
 	.mg_tasks		= LIST_HEAD_INIT(init_css_set.mg_tasks),
 	.task_iters		= LIST_HEAD_INIT(init_css_set.task_iters),
+	.threaded_csets		= LIST_HEAD_INIT(init_css_set.threaded_csets),
 	.cgrp_links		= LIST_HEAD_INIT(init_css_set.cgrp_links),
 	.mg_preload_node	= LIST_HEAD_INIT(init_css_set.mg_preload_node),
 	.mg_node		= LIST_HEAD_INIT(init_css_set.mg_node),
@@ -576,6 +578,17 @@ static bool css_set_populated(struct css_set *cset)
 	return !list_empty(&cset->tasks) || !list_empty(&cset->mg_tasks);
 }
 
+static struct css_set *proc_css_set(struct css_set *cset)
+{
+	return rcu_dereference_protected(cset->proc_cset,
+					 lockdep_is_held(&css_set_lock));
+}
+
+static bool css_set_threaded(struct css_set *cset)
+{
+	return proc_css_set(cset) != cset;
+}
+
 /**
  * cgroup_update_populated - updated populated count of a cgroup
  * @cgrp: the target cgroup
@@ -727,6 +740,8 @@ void put_css_set_locked(struct css_set *cset)
 	if (!atomic_dec_and_test(&cset->refcount))
 		return;
 
+	WARN_ON_ONCE(!list_empty(&cset->threaded_csets));
+
 	/* This css_set is dead. unlink it and release cgroup and css refs */
 	for_each_subsys(ss, ssid) {
 		list_del(&cset->e_cset_node[ssid]);
@@ -743,6 +758,11 @@ void put_css_set_locked(struct css_set *cset)
 		kfree(link);
 	}
 
+	if (css_set_threaded(cset)) {
+		list_del(&cset->threaded_csets_node);
+		put_css_set_locked(proc_css_set(cset));
+	}
+
 	kfree_rcu(cset, rcu_head);
 }
 
@@ -752,6 +772,7 @@ void put_css_set_locked(struct css_set *cset)
  * @old_cset: existing css_set for a task
  * @new_cgrp: cgroup that's being entered by the task
  * @template: desired set of css pointers in css_set (pre-calculated)
+ * @for_pcset: the comparison is for a new proc_cset
  *
  * Returns true if "cset" matches "old_cset" except for the hierarchy
  * which "new_cgrp" belongs to, for which it should match "new_cgrp".
@@ -759,7 +780,8 @@ void put_css_set_locked(struct css_set *cset)
 static bool compare_css_sets(struct css_set *cset,
 			     struct css_set *old_cset,
 			     struct cgroup *new_cgrp,
-			     struct cgroup_subsys_state *template[])
+			     struct cgroup_subsys_state *template[],
+			     bool for_pcset)
 {
 	struct list_head *l1, *l2;
 
@@ -771,6 +793,32 @@ static bool compare_css_sets(struct css_set *cset,
 	if (memcmp(template, cset->subsys, sizeof(cset->subsys)))
 		return false;
 
+	if (for_pcset) {
+		/*
+		 * We're looking for the pcset of @old_cset.  As @old_cset
+		 * doesn't have its ->proc_cset pointer set yet (we're
+		 * trying to find out what to set it to), @old_cset itself
+		 * may seem like a match here.  Explicitly exlude identity
+		 * matching.
+		 */
+		if (css_set_threaded(cset) || cset == old_cset)
+			return false;
+	} else {
+		bool is_threaded;
+
+		/*
+		 * Otherwise, @cset's threaded state should match the
+		 * default cgroup's.
+		 */
+		if (cgroup_on_dfl(new_cgrp))
+			is_threaded = new_cgrp->proc_cgrp;
+		else
+			is_threaded = old_cset->dfl_cgrp->proc_cgrp;
+
+		if (is_threaded != css_set_threaded(cset))
+			return false;
+	}
+
 	/*
 	 * Compare cgroup pointers in order to distinguish between
 	 * different cgroups in hierarchies.  As different cgroups may
@@ -823,10 +871,12 @@ static bool compare_css_sets(struct css_set *cset,
  * @old_cset: the css_set that we're using before the cgroup transition
  * @cgrp: the cgroup that we're moving into
  * @template: out param for the new set of csses, should be clear on entry
+ * @for_pcset: looking for a new proc_cset
  */
 static struct css_set *find_existing_css_set(struct css_set *old_cset,
 					struct cgroup *cgrp,
-					struct cgroup_subsys_state *template[])
+					struct cgroup_subsys_state *template[],
+					bool for_pcset)
 {
 	struct cgroup_root *root = cgrp->root;
 	struct cgroup_subsys *ss;
@@ -857,7 +907,7 @@ static struct css_set *find_existing_css_set(struct css_set *old_cset,
 
 	key = css_set_hash(template);
 	hash_for_each_possible(css_set_table, cset, hlist, key) {
-		if (!compare_css_sets(cset, old_cset, cgrp, template))
+		if (!compare_css_sets(cset, old_cset, cgrp, template, for_pcset))
 			continue;
 
 		/* This css_set matches what we need */
@@ -939,12 +989,13 @@ static void link_css_set(struct list_head *tmp_links, struct css_set *cset,
  * find_css_set - return a new css_set with one cgroup updated
  * @old_cset: the baseline css_set
  * @cgrp: the cgroup to be updated
+ * @for_pcset: looking for a new proc_cset
  *
  * Return a new css_set that's equivalent to @old_cset, but with @cgrp
  * substituted into the appropriate hierarchy.
  */
 static struct css_set *find_css_set(struct css_set *old_cset,
-				    struct cgroup *cgrp)
+				    struct cgroup *cgrp, bool for_pcset)
 {
 	struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT] = { };
 	struct css_set *cset;
@@ -959,7 +1010,7 @@ static struct css_set *find_css_set(struct css_set *old_cset,
 	/* First see if we already have a cgroup group that matches
 	 * the desired set */
 	spin_lock_irq(&css_set_lock);
-	cset = find_existing_css_set(old_cset, cgrp, template);
+	cset = find_existing_css_set(old_cset, cgrp, template, for_pcset);
 	if (cset)
 		get_css_set(cset);
 	spin_unlock_irq(&css_set_lock);
@@ -978,9 +1029,11 @@ static struct css_set *find_css_set(struct css_set *old_cset,
 	}
 
 	atomic_set(&cset->refcount, 1);
+	RCU_INIT_POINTER(cset->proc_cset, cset);
 	INIT_LIST_HEAD(&cset->tasks);
 	INIT_LIST_HEAD(&cset->mg_tasks);
 	INIT_LIST_HEAD(&cset->task_iters);
+	INIT_LIST_HEAD(&cset->threaded_csets);
 	INIT_HLIST_NODE(&cset->hlist);
 	INIT_LIST_HEAD(&cset->cgrp_links);
 	INIT_LIST_HEAD(&cset->mg_preload_node);
@@ -1018,6 +1071,28 @@ static struct css_set *find_css_set(struct css_set *old_cset,
 
 	spin_unlock_irq(&css_set_lock);
 
+	/*
+	 * If @cset should be threaded, look up the matching proc_cset and
+	 * link them up.  We first fully initialize @cset then look for the
+	 * pcset.  It's simpler this way and safe as @cset is guaranteed to
+	 * stay empty until we return.
+	 */
+	if (!for_pcset && cset->dfl_cgrp->proc_cgrp) {
+		struct css_set *pcset;
+
+		pcset = find_css_set(cset, cset->dfl_cgrp->proc_cgrp, true);
+		if (!pcset) {
+			put_css_set(cset);
+			return NULL;
+		}
+
+		spin_lock_irq(&css_set_lock);
+		rcu_assign_pointer(cset->proc_cset, pcset);
+		list_add_tail(&cset->threaded_csets_node,
+			      &pcset->threaded_csets);
+		spin_unlock_irq(&css_set_lock);
+	}
+
 	return cset;
 }
 
@@ -2239,7 +2314,7 @@ int cgroup_migrate_prepare_dst(struct cgroup_mgctx *mgctx)
 		struct cgroup_subsys *ss;
 		int ssid;
 
-		dst_cset = find_css_set(src_cset, src_cset->mg_dst_cgrp);
+		dst_cset = find_css_set(src_cset, src_cset->mg_dst_cgrp, false);
 		if (!dst_cset)
 			goto err;
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 04/14] cgroup: implement CSS_TASK_ITER_THREADED
  2017-04-21 14:03 [RFC PATCH 00/14] cgroup: Implement cgroup v2 thread mode & CPU controller Waiman Long
                   ` (2 preceding siblings ...)
  2017-04-21 14:04 ` [RFC PATCH 03/14] cgroup: introduce cgroup->proc_cgrp and threaded css_set handling Waiman Long
@ 2017-04-21 14:04 ` Waiman Long
  2017-04-21 14:04 ` [RFC PATCH 05/14] cgroup: implement cgroup v2 thread support Waiman Long
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 17+ messages in thread
From: Waiman Long @ 2017-04-21 14:04 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, linux-mm, kernel-team, pjt,
	luto, efault

From: Tejun Heo <tj@kernel.org>

cgroup v2 is in the process of growing thread granularity support.
Once thread mode is enabled, the root cgroup of the subtree serves as
the proc_cgrp to which the processes of the subtree conceptually
belong and domain-level resource consumptions not tied to any specific
task are charged.  In the subtree, threads won't be subject to process
granularity or no-internal-task constraint and can be distributed
arbitrarily across the subtree.

This patch implements a new task iterator flag CSS_TASK_ITER_THREADED,
which, when used on a proc_cgrp, makes the iteration include the tasks
on all the associated threaded css_sets.  "cgroup.procs" read path is
updated to use it so that reading the file on a proc_cgrp lists all
processes.  This will also be used by controller implementations which
need to walk processes or tasks at the resource domain level.

Task iteration is implemented nested in css_set iteration.  If
CSS_TASK_ITER_THREADED is specified, after walking tasks of each
!threaded css_set, all the associated threaded css_sets are visited
before moving onto the next !threaded css_set.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/cgroup.h |  6 ++++
 kernel/cgroup/cgroup.c | 81 +++++++++++++++++++++++++++++++++++++++++---------
 2 files changed, 73 insertions(+), 14 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 37b20ef..d62d75c 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -38,6 +38,8 @@
 
 /* walk only threadgroup leaders */
 #define CSS_TASK_ITER_PROCS		(1U << 0)
+/* walk threaded css_sets as part of their proc_csets */
+#define CSS_TASK_ITER_THREADED		(1U << 1)
 
 /* a css_task_iter should be treated as an opaque object */
 struct css_task_iter {
@@ -47,11 +49,15 @@ struct css_task_iter {
 	struct list_head		*cset_pos;
 	struct list_head		*cset_head;
 
+	struct list_head		*tcset_pos;
+	struct list_head		*tcset_head;
+
 	struct list_head		*task_pos;
 	struct list_head		*tasks_head;
 	struct list_head		*mg_tasks_head;
 
 	struct css_set			*cur_cset;
+	struct css_set			*cur_pcset;
 	struct task_struct		*cur_task;
 	struct list_head		iters_node;	/* css_set->task_iters */
 };
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 016bbc6..b2b1886 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3592,27 +3592,36 @@ bool css_has_online_children(struct cgroup_subsys_state *css)
 	return ret;
 }
 
-/**
- * css_task_iter_advance_css_set - advance a task itererator to the next css_set
- * @it: the iterator to advance
- *
- * Advance @it to the next css_set to walk.
- */
-static void css_task_iter_advance_css_set(struct css_task_iter *it)
+static struct css_set *css_task_iter_next_css_set(struct css_task_iter *it)
 {
-	struct list_head *l = it->cset_pos;
+	bool threaded = it->flags & CSS_TASK_ITER_THREADED;
+	struct list_head *l;
 	struct cgrp_cset_link *link;
 	struct css_set *cset;
 
 	lockdep_assert_held(&css_set_lock);
 
-	/* Advance to the next non-empty css_set */
+	/* find the next threaded cset */
+	if (it->tcset_pos) {
+		l = it->tcset_pos->next;
+
+		if (l != it->tcset_head) {
+			it->tcset_pos = l;
+			return container_of(l, struct css_set,
+					    threaded_csets_node);
+		}
+
+		it->tcset_pos = NULL;
+	}
+
+	/* find the next cset */
+	l = it->cset_pos;
+
 	do {
 		l = l->next;
 		if (l == it->cset_head) {
 			it->cset_pos = NULL;
-			it->task_pos = NULL;
-			return;
+			return NULL;
 		}
 
 		if (it->ss) {
@@ -3622,10 +3631,50 @@ static void css_task_iter_advance_css_set(struct css_task_iter *it)
 			link = list_entry(l, struct cgrp_cset_link, cset_link);
 			cset = link->cset;
 		}
-	} while (!css_set_populated(cset));
+
+		/*
+		 * For threaded iterations, threaded csets are walked
+		 * together with their proc_csets.  Skip here.
+		 */
+	} while (threaded && css_set_threaded(cset));
 
 	it->cset_pos = l;
 
+	/* initialize threaded cset walking */
+	if (threaded) {
+		if (it->cur_pcset)
+			put_css_set_locked(it->cur_pcset);
+		it->cur_pcset = cset;
+		get_css_set(cset);
+
+		it->tcset_head = &cset->threaded_csets;
+		it->tcset_pos = &cset->threaded_csets;
+	}
+
+	return cset;
+}
+
+/**
+ * css_task_iter_advance_css_set - advance a task itererator to the next css_set
+ * @it: the iterator to advance
+ *
+ * Advance @it to the next css_set to walk.
+ */
+static void css_task_iter_advance_css_set(struct css_task_iter *it)
+{
+	struct css_set *cset;
+
+	lockdep_assert_held(&css_set_lock);
+
+	/* Advance to the next non-empty css_set */
+	do {
+		cset = css_task_iter_next_css_set(it);
+		if (!cset) {
+			it->task_pos = NULL;
+			return;
+		}
+	} while (!css_set_populated(cset));
+
 	if (!list_empty(&cset->tasks))
 		it->task_pos = cset->tasks.next;
 	else
@@ -3768,6 +3817,9 @@ void css_task_iter_end(struct css_task_iter *it)
 		spin_unlock_irq(&css_set_lock);
 	}
 
+	if (it->cur_pcset)
+		put_css_set(it->cur_pcset);
+
 	if (it->cur_task)
 		put_task_struct(it->cur_task);
 }
@@ -3793,6 +3845,7 @@ static void *cgroup_procs_start(struct seq_file *s, loff_t *pos)
 	struct kernfs_open_file *of = s->private;
 	struct cgroup *cgrp = seq_css(s)->cgroup;
 	struct css_task_iter *it = of->priv;
+	unsigned iter_flags = CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED;
 
 	/*
 	 * When a seq_file is seeked, it's always traversed sequentially
@@ -3806,10 +3859,10 @@ static void *cgroup_procs_start(struct seq_file *s, loff_t *pos)
 		if (!it)
 			return ERR_PTR(-ENOMEM);
 		of->priv = it;
-		css_task_iter_start(&cgrp->self, CSS_TASK_ITER_PROCS, it);
+		css_task_iter_start(&cgrp->self, iter_flags, it);
 	} else if (!(*pos)++) {
 		css_task_iter_end(it);
-		css_task_iter_start(&cgrp->self, CSS_TASK_ITER_PROCS, it);
+		css_task_iter_start(&cgrp->self, iter_flags, it);
 	}
 
 	return cgroup_procs_next(s, NULL, NULL);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 05/14] cgroup: implement cgroup v2 thread support
  2017-04-21 14:03 [RFC PATCH 00/14] cgroup: Implement cgroup v2 thread mode & CPU controller Waiman Long
                   ` (3 preceding siblings ...)
  2017-04-21 14:04 ` [RFC PATCH 04/14] cgroup: implement CSS_TASK_ITER_THREADED Waiman Long
@ 2017-04-21 14:04 ` Waiman Long
  2017-04-21 14:04 ` [RFC PATCH 06/14] cgroup: Fix reference counting bug in cgroup_procs_write() Waiman Long
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 17+ messages in thread
From: Waiman Long @ 2017-04-21 14:04 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, linux-mm, kernel-team, pjt,
	luto, efault

From: Tejun Heo <tj@kernel.org>

This patch implements cgroup v2 thread support.  The goal of the
thread mode is supporting hierarchical accounting and control at
thread granularity while staying inside the resource domain model
which allows coordination across different resource controllers and
handling of anonymous resource consumptions.

Once thread mode is enabled on a cgroup, the threads of the processes
which are in its subtree can be placed inside the subtree without
being restricted by process granularity or no-internal-process
constraint.  Note that the threads aren't allowed to escape to a
different threaded subtree.  To be used inside a threaded subtree, a
controller should explicitly support threaded mode and be able to
handle internal competition in the way which is appropriate for the
resource.

The root of a threaded subtree, where thread mode is enabled in the
first place, is called the thread root and serves as the resource
domain for the whole subtree.  This is the last cgroup where
non-threaded controllers are operational and where all the
domain-level resource consumptions in the subtree are accounted.  This
allows threaded controllers to operate at thread granularity when
requested while staying inside the scope of system-level resource
distribution.

Internally, in a threaded subtree, each css_set has its ->proc_cset
pointing to a matching css_set which belongs to the thread root.  This
ensures that thread root level cgroup_subsys_state for all threaded
controllers are readily accessible for domain-level operations.

This patch enables threaded mode for the pids and perf_events
controllers.  Neither has to worry about domain-level resource
consumptions and it's enough to simply set the flag.

For more details on the interface and behavior of the thread mode,
please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added
by this patch.  Note that the documentation update is not complete as
the rest of the documentation needs to be updated accordingly.
Rolling those updates into this patch can be confusing so that will be
separate patches.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 Documentation/cgroup-v2.txt |  75 +++++++++++++-
 include/linux/cgroup-defs.h |  16 +++
 kernel/cgroup/cgroup.c      | 240 +++++++++++++++++++++++++++++++++++++++++++-
 kernel/cgroup/pids.c        |   1 +
 kernel/events/core.c        |   1 +
 5 files changed, 326 insertions(+), 7 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 49d7c99..2375e22 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -16,7 +16,9 @@ CONTENTS
   1-2. What is cgroup?
 2. Basic Operations
   2-1. Mounting
-  2-2. Organizing Processes
+  2-2. Organizing Processes and Threads
+    2-2-1. Processes
+    2-2-2. Threads
   2-3. [Un]populated Notification
   2-4. Controlling Controllers
     2-4-1. Enabling and Disabling
@@ -150,7 +152,9 @@ and experimenting easier, the kernel parameter cgroup_no_v1= allows
 disabling controllers in v1 and make them always available in v2.
 
 
-2-2. Organizing Processes
+2-2. Organizing Processes and Threads
+
+2-2-1. Processes
 
 Initially, only the root cgroup exists to which all processes belong.
 A child cgroup can be created by creating a sub-directory.
@@ -201,6 +205,73 @@ is removed subsequently, " (deleted)" is appended to the path.
   0::/test-cgroup/test-cgroup-nested (deleted)
 
 
+2-2-2. Threads
+
+cgroup v2 supports thread granularity for a subset of controllers to
+support use cases requiring hierarchical resource distribution across
+the threads of a group of processes.  By default, all threads of a
+process belong to the same cgroup, which also serves as the resource
+domain to host resource consumptions which are not specific to a
+process or thread.  The thread mode allows threads to be spread across
+a subtree while still maintaining the common resource domain for them.
+
+Enabling thread mode on a subtree makes it threaded.  The root of a
+threaded subtree is called thread root and serves as the resource
+domain for the entire subtree.  In a threaded subtree, threads of a
+process can be put in different cgroups and are not subject to the no
+internal process constraint - threaded controllers can be enabled on
+non-leaf cgroups whether they have threads in them or not.
+
+To enable the thread mode, the following conditions must be met.
+
+- The thread root doesn't have any child cgroups.
+
+- The thread root doesn't have any controllers enabled.
+
+Thread mode can be enabled by writing "enable" to "cgroup.threads"
+file.
+
+  # echo enable > cgroup.threads
+
+Inside a threaded subtree, "cgroup.threads" can be read and contains
+the list of the thread IDs of all threads in the cgroup.  Except that
+the operations are per-thread instead of per-process, "cgroup.threads"
+has the same format and behaves the same way as "cgroup.procs".
+
+The thread root serves as the resource domain for the whole subtree,
+and, while the threads can be scattered across the subtree, all the
+processes are considered to be in the thread root.  "cgroup.procs" in
+a thread root contains the PIDs of all processes in the subtree and is
+not readable in the subtree proper.  However, "cgroup.procs" can be
+written to from anywhere in the subtree to migrate all threads of the
+matching process to the cgroup.
+
+Only threaded controllers can be enabled in a threaded subtree.  When
+a threaded controller is enabled inside a threaded subtree, it only
+accounts for and controls resource consumptions associated with the
+threads in the cgroup and its descendants.  All consumptions which
+aren't tied to a specific thread belong to the thread root.
+
+Because a threaded subtree is exempt from no internal process
+constraint, a threaded controller must be able to handle competition
+between threads in a non-leaf cgroup and its child cgroups.  Each
+threaded controller defines how such competitions are handled.
+
+To disable the thread mode, the following conditions must be met.
+
+- The cgroup is a thread root.  Thread mode can't be disabled
+  partially in the subtree.
+
+- The thread root doesn't have any child cgroups.
+
+- The thread root doesn't have any controllers enabled.
+
+Thread mode can be disabled by writing "disable" to "cgroup.threads"
+file.
+
+  # echo disable > cgroup.threads
+
+
 2-3. [Un]populated Notification
 
 Each non-root cgroup has a "cgroup.events" file which contains
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 9283ee9..bb4752a 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -226,6 +226,10 @@ struct css_set {
 	struct cgroup *mg_dst_cgrp;
 	struct css_set *mg_dst_cset;
 
+	/* used while updating ->proc_cset to enable/disable threaded mode */
+	struct list_head pcset_preload_node;
+	struct css_set *pcset_preload;
+
 	/* dead and being drained, ignore for migration */
 	bool dead;
 
@@ -497,6 +501,18 @@ struct cgroup_subsys {
 	bool implicit_on_dfl:1;
 
 	/*
+	 * If %true, the controller, supports threaded mode on the default
+	 * hierarchy.  In a threaded subtree, both process granularity and
+	 * no-internal-process constraint are ignored and a threaded
+	 * controllers should be able to handle that.
+	 *
+	 * Note that as an implicit controller is automatically enabled on
+	 * all cgroups on the default hierarchy, it should also be
+	 * threaded.  implicit && !threaded is not supported.
+	 */
+	bool threaded:1;
+
+	/*
 	 * If %false, this subsystem is properly hierarchical -
 	 * configuration, resource accounting and restriction on a parent
 	 * cgroup cover those of its children.  If %true, hierarchy support
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index b2b1886..6748207 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -162,6 +162,9 @@ struct cgroup_subsys *cgroup_subsys[] = {
 /* some controllers are implicitly enabled on the default hierarchy */
 static u16 cgrp_dfl_implicit_ss_mask;
 
+/* some controllers can be threaded on the default hierarchy */
+static u16 cgrp_dfl_threaded_ss_mask;
+
 /* The list of hierarchy roots */
 LIST_HEAD(cgroup_roots);
 static int cgroup_root_count;
@@ -2911,11 +2914,18 @@ static ssize_t cgroup_subtree_control_write(struct kernfs_open_file *of,
 		goto out_unlock;
 	}
 
+	/* can't enable !threaded controllers on a threaded cgroup */
+	if (cgrp->proc_cgrp && (enable & ~cgrp_dfl_threaded_ss_mask)) {
+		ret = -EBUSY;
+		goto out_unlock;
+	}
+
 	/*
-	 * Except for the root, subtree_control must be zero for a cgroup
-	 * with tasks so that child cgroups don't compete against tasks.
+	 * Except for root and threaded cgroups, subtree_control must be
+	 * zero for a cgroup with tasks so that child cgroups don't compete
+	 * against tasks.
 	 */
-	if (enable && cgroup_parent(cgrp)) {
+	if (enable && cgroup_parent(cgrp) && !cgrp->proc_cgrp) {
 		struct cgrp_cset_link *link;
 
 		/*
@@ -2956,6 +2966,124 @@ static ssize_t cgroup_subtree_control_write(struct kernfs_open_file *of,
 	return ret ?: nbytes;
 }
 
+static int cgroup_enable_threaded(struct cgroup *cgrp)
+{
+	LIST_HEAD(csets);
+	struct cgrp_cset_link *link;
+	struct css_set *cset, *cset_next;
+	int ret;
+
+	lockdep_assert_held(&cgroup_mutex);
+
+	/* noop if already threaded */
+	if (cgrp->proc_cgrp)
+		return 0;
+
+	/* allow only if there are neither children or enabled controllers */
+	if (css_has_online_children(&cgrp->self) || cgrp->subtree_control)
+		return -EBUSY;
+
+	/* find all csets which need ->proc_cset updated */
+	spin_lock_irq(&css_set_lock);
+	list_for_each_entry(link, &cgrp->cset_links, cset_link) {
+		cset = link->cset;
+		if (css_set_populated(cset)) {
+			WARN_ON_ONCE(css_set_threaded(cset));
+			WARN_ON_ONCE(cset->pcset_preload);
+
+			list_add_tail(&cset->pcset_preload_node, &csets);
+			get_css_set(cset);
+		}
+	}
+	spin_unlock_irq(&css_set_lock);
+
+	/* find the proc_csets to associate */
+	list_for_each_entry(cset, &csets, pcset_preload_node) {
+		struct css_set *pcset = find_css_set(cset, cgrp, true);
+
+		WARN_ON_ONCE(cset == pcset);
+		if (!pcset) {
+			ret = -ENOMEM;
+			goto err_put_csets;
+		}
+		cset->pcset_preload = pcset;
+	}
+
+	/* install ->proc_cset */
+	spin_lock_irq(&css_set_lock);
+	list_for_each_entry_safe(cset, cset_next, &csets, pcset_preload_node) {
+		rcu_assign_pointer(cset->proc_cset, cset->pcset_preload);
+		list_add_tail(&cset->threaded_csets_node,
+			      &cset->pcset_preload->threaded_csets);
+
+		cset->pcset_preload = NULL;
+		list_del(&cset->pcset_preload_node);
+		put_css_set_locked(cset);
+	}
+	spin_unlock_irq(&css_set_lock);
+
+	/* mark it threaded */
+	cgrp->proc_cgrp = cgrp;
+
+	return 0;
+
+err_put_csets:
+	spin_lock_irq(&css_set_lock);
+	list_for_each_entry_safe(cset, cset_next, &csets, pcset_preload_node) {
+		if (cset->pcset_preload) {
+			put_css_set_locked(cset->pcset_preload);
+			cset->pcset_preload = NULL;
+		}
+		list_del(&cset->pcset_preload_node);
+		put_css_set_locked(cset);
+	}
+	spin_unlock_irq(&css_set_lock);
+	return ret;
+}
+
+static int cgroup_disable_threaded(struct cgroup *cgrp)
+{
+	struct cgrp_cset_link *link;
+
+	lockdep_assert_held(&cgroup_mutex);
+
+	/* noop if already !threaded */
+	if (!cgrp->proc_cgrp)
+		return 0;
+
+	/* partial disable isn't supported */
+	if (cgrp->proc_cgrp != cgrp)
+		return -EBUSY;
+
+	/* allow only if there are neither children or enabled controllers */
+	if (css_has_online_children(&cgrp->self) || cgrp->subtree_control)
+		return -EBUSY;
+
+	/* walk all csets and reset ->proc_cset */
+	spin_lock_irq(&css_set_lock);
+	list_for_each_entry(link, &cgrp->cset_links, cset_link) {
+		struct css_set *cset = link->cset;
+
+		if (css_set_threaded(cset)) {
+			struct css_set *pcset = proc_css_set(cset);
+
+			WARN_ON_ONCE(pcset->dfl_cgrp != cgrp);
+			rcu_assign_pointer(cset->proc_cset, cset);
+			list_del(&cset->threaded_csets_node);
+
+			/*
+			 * @pcset is never @cset and safe to put during
+			 * iteration.
+			 */
+			put_css_set_locked(pcset);
+		}
+	}
+	cgrp->proc_cgrp = NULL;
+	spin_unlock_irq(&css_set_lock);
+
+	return 0;
+}
+
 static int cgroup_events_show(struct seq_file *seq, void *v)
 {
 	seq_printf(seq, "populated %d\n",
@@ -3840,12 +3968,12 @@ static void *cgroup_procs_next(struct seq_file *s, void *v, loff_t *pos)
 	return css_task_iter_next(it);
 }
 
-static void *cgroup_procs_start(struct seq_file *s, loff_t *pos)
+static void *__cgroup_procs_start(struct seq_file *s, loff_t *pos,
+				  unsigned int iter_flags)
 {
 	struct kernfs_open_file *of = s->private;
 	struct cgroup *cgrp = seq_css(s)->cgroup;
 	struct css_task_iter *it = of->priv;
-	unsigned iter_flags = CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED;
 
 	/*
 	 * When a seq_file is seeked, it's always traversed sequentially
@@ -3868,6 +3996,23 @@ static void *cgroup_procs_start(struct seq_file *s, loff_t *pos)
 	return cgroup_procs_next(s, NULL, NULL);
 }
 
+static void *cgroup_procs_start(struct seq_file *s, loff_t *pos)
+{
+	struct cgroup *cgrp = seq_css(s)->cgroup;
+
+	/*
+	 * All processes of a threaded subtree are in the top threaded
+	 * cgroup.  Only threads can be distributed across the subtree.
+	 * Reject reads on cgroup.procs in the subtree proper.  They're
+	 * always empty anyway.
+	 */
+	if (cgrp->proc_cgrp && cgrp->proc_cgrp != cgrp)
+		return ERR_PTR(-EINVAL);
+
+	return __cgroup_procs_start(s, pos, CSS_TASK_ITER_PROCS |
+					    CSS_TASK_ITER_THREADED);
+}
+
 static int cgroup_procs_show(struct seq_file *s, void *v)
 {
 	seq_printf(s, "%d\n", task_pid_vnr(v));
@@ -3922,6 +4067,76 @@ static ssize_t cgroup_procs_write(struct kernfs_open_file *of,
 	return ret ?: nbytes;
 }
 
+static void *cgroup_threads_start(struct seq_file *s, loff_t *pos)
+{
+	struct cgroup *cgrp = seq_css(s)->cgroup;
+
+	if (!cgrp->proc_cgrp)
+		return ERR_PTR(-EINVAL);
+
+	return __cgroup_procs_start(s, pos, 0);
+}
+
+static ssize_t cgroup_threads_write(struct kernfs_open_file *of,
+				    char *buf, size_t nbytes, loff_t off)
+{
+	struct super_block *sb = of->file->f_path.dentry->d_sb;
+	struct cgroup *cgrp, *common_ancestor;
+	struct task_struct *task;
+	ssize_t ret;
+
+	buf = strstrip(buf);
+
+	cgrp = cgroup_kn_lock_live(of->kn, false);
+	if (!cgrp)
+		return -ENODEV;
+
+	/* cgroup.procs determines delegation, require permission on it too */
+	ret = cgroup_procs_write_permission(cgrp, sb);
+	if (ret)
+		goto out_unlock;
+
+	/* enable or disable? */
+	if (!strcmp(buf, "enable")) {
+		ret = cgroup_enable_threaded(cgrp);
+		goto out_unlock;
+	} else if (!strcmp(buf, "disable")) {
+		ret = cgroup_disable_threaded(cgrp);
+		goto out_unlock;
+	}
+
+	/* thread migration */
+	ret = -EINVAL;
+	if (!cgrp->proc_cgrp)
+		goto out_unlock;
+
+	task = cgroup_procs_write_start(buf, false);
+	ret = PTR_ERR_OR_ZERO(task);
+	if (ret)
+		goto out_unlock;
+
+	common_ancestor = cgroup_migrate_common_ancestor(task, cgrp);
+
+	/* can't migrate across disjoint threaded subtrees */
+	ret = -EACCES;
+	if (common_ancestor->proc_cgrp != cgrp->proc_cgrp)
+		goto out_finish;
+
+	/* and follow the cgroup.procs delegation rule */
+	ret = cgroup_procs_write_permission(common_ancestor, sb);
+	if (ret)
+		goto out_finish;
+
+	ret = cgroup_attach_task(cgrp, task, false);
+
+out_finish:
+	cgroup_procs_write_finish();
+out_unlock:
+	cgroup_kn_unlock(of->kn);
+
+	return ret ?: nbytes;
+}
+
 /* cgroup core interface files for the default hierarchy */
 static struct cftype cgroup_base_files[] = {
 	{
@@ -3934,6 +4149,14 @@ static ssize_t cgroup_procs_write(struct kernfs_open_file *of,
 		.write = cgroup_procs_write,
 	},
 	{
+		.name = "cgroup.threads",
+		.release = cgroup_procs_release,
+		.seq_start = cgroup_threads_start,
+		.seq_next = cgroup_procs_next,
+		.seq_show = cgroup_procs_show,
+		.write = cgroup_threads_write,
+	},
+	{
 		.name = "cgroup.controllers",
 		.seq_show = cgroup_controllers_show,
 	},
@@ -4247,6 +4470,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent)
 	cgrp->self.parent = &parent->self;
 	cgrp->root = root;
 	cgrp->level = level;
+	cgrp->proc_cgrp = parent->proc_cgrp;
 
 	for (tcgrp = cgrp; tcgrp; tcgrp = cgroup_parent(tcgrp))
 		cgrp->ancestor_ids[tcgrp->level] = tcgrp->id;
@@ -4689,11 +4913,17 @@ int __init cgroup_init(void)
 
 		cgrp_dfl_root.subsys_mask |= 1 << ss->id;
 
+		/* implicit controllers must be threaded too */
+		WARN_ON(ss->implicit_on_dfl && !ss->threaded);
+
 		if (ss->implicit_on_dfl)
 			cgrp_dfl_implicit_ss_mask |= 1 << ss->id;
 		else if (!ss->dfl_cftypes)
 			cgrp_dfl_inhibit_ss_mask |= 1 << ss->id;
 
+		if (ss->threaded)
+			cgrp_dfl_threaded_ss_mask |= 1 << ss->id;
+
 		if (ss->dfl_cftypes == ss->legacy_cftypes) {
 			WARN_ON(cgroup_add_cftypes(ss, ss->dfl_cftypes));
 		} else {
diff --git a/kernel/cgroup/pids.c b/kernel/cgroup/pids.c
index 2237201..9829c67 100644
--- a/kernel/cgroup/pids.c
+++ b/kernel/cgroup/pids.c
@@ -345,4 +345,5 @@ struct cgroup_subsys pids_cgrp_subsys = {
 	.free		= pids_free,
 	.legacy_cftypes	= pids_files,
 	.dfl_cftypes	= pids_files,
+	.threaded	= true,
 };
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 80cf340..095973b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -11129,5 +11129,6 @@ struct cgroup_subsys perf_event_cgrp_subsys = {
 	 * controller is not mounted on a legacy hierarchy.
 	 */
 	.implicit_on_dfl = true,
+	.threaded	= true,
 };
 #endif /* CONFIG_CGROUP_PERF */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 06/14] cgroup: Fix reference counting bug in cgroup_procs_write()
  2017-04-21 14:03 [RFC PATCH 00/14] cgroup: Implement cgroup v2 thread mode & CPU controller Waiman Long
                   ` (4 preceding siblings ...)
  2017-04-21 14:04 ` [RFC PATCH 05/14] cgroup: implement cgroup v2 thread support Waiman Long
@ 2017-04-21 14:04 ` Waiman Long
  2017-04-21 14:04 ` [RFC PATCH 07/14] cgroup: Move debug cgroup to its own file Waiman Long
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 17+ messages in thread
From: Waiman Long @ 2017-04-21 14:04 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, linux-mm, kernel-team, pjt,
	luto, efault, Waiman Long

The cgroup_procs_write_start() took a reference to the task structure
which was not properly released within cgroup_procs_write() and so
on. So a put_task_struct() call is added to cgroup_procs_write_finish()
to match the get_task_struct() in cgroup_procs_write_start() to fix
this reference counting error.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cgroup-internal.h |  2 +-
 kernel/cgroup/cgroup-v1.c       |  2 +-
 kernel/cgroup/cgroup.c          | 10 ++++++----
 3 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index 6ef662a..bea3928 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -181,7 +181,7 @@ int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader,
 		       bool threadgroup);
 struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup)
 	__acquires(&cgroup_threadgroup_rwsem);
-void cgroup_procs_write_finish(void)
+void cgroup_procs_write_finish(struct task_struct *task)
 	__releases(&cgroup_threadgroup_rwsem);
 
 void cgroup_lock_and_drain_offline(struct cgroup *cgrp);
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index b837e1a..e80bc8e 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -549,7 +549,7 @@ static ssize_t __cgroup1_procs_write(struct kernfs_open_file *of,
 	ret = cgroup_attach_task(cgrp, task, threadgroup);
 
 out_finish:
-	cgroup_procs_write_finish();
+	cgroup_procs_write_finish(task);
 out_unlock:
 	cgroup_kn_unlock(of->kn);
 
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 6748207..d48eedd 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2487,12 +2487,15 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup)
 	return tsk;
 }
 
-void cgroup_procs_write_finish(void)
+void cgroup_procs_write_finish(struct task_struct *task)
 	__releases(&cgroup_threadgroup_rwsem)
 {
 	struct cgroup_subsys *ss;
 	int ssid;
 
+	/* release reference from cgroup_procs_write_start() */
+	put_task_struct(task);
+
 	percpu_up_write(&cgroup_threadgroup_rwsem);
 	for_each_subsys(ss, ssid)
 		if (ss->post_attach)
@@ -3295,7 +3298,6 @@ static int cgroup_addrm_files(struct cgroup_subsys_state *css,
 
 static int cgroup_apply_cftypes(struct cftype *cfts, bool is_add)
 {
-	LIST_HEAD(pending);
 	struct cgroup_subsys *ss = cfts[0].ss;
 	struct cgroup *root = &ss->root->cgrp;
 	struct cgroup_subsys_state *css;
@@ -4060,7 +4062,7 @@ static ssize_t cgroup_procs_write(struct kernfs_open_file *of,
 	ret = cgroup_attach_task(cgrp, task, true);
 
 out_finish:
-	cgroup_procs_write_finish();
+	cgroup_procs_write_finish(task);
 out_unlock:
 	cgroup_kn_unlock(of->kn);
 
@@ -4130,7 +4132,7 @@ static ssize_t cgroup_threads_write(struct kernfs_open_file *of,
 	ret = cgroup_attach_task(cgrp, task, false);
 
 out_finish:
-	cgroup_procs_write_finish();
+	cgroup_procs_write_finish(task);
 out_unlock:
 	cgroup_kn_unlock(of->kn);
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 07/14] cgroup: Move debug cgroup to its own file
  2017-04-21 14:03 [RFC PATCH 00/14] cgroup: Implement cgroup v2 thread mode & CPU controller Waiman Long
                   ` (5 preceding siblings ...)
  2017-04-21 14:04 ` [RFC PATCH 06/14] cgroup: Fix reference counting bug in cgroup_procs_write() Waiman Long
@ 2017-04-21 14:04 ` Waiman Long
  2017-04-21 14:04 ` [RFC PATCH 08/14] cgroup: Keep accurate count of tasks in each css_set Waiman Long
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 17+ messages in thread
From: Waiman Long @ 2017-04-21 14:04 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, linux-mm, kernel-team, pjt,
	luto, efault, Waiman Long

The debug cgroup currently resides within cgroup-v1.c and is enabled
only for v1 cgroup. To enable the debug cgroup also for v2, it
makes sense to put the code into its own file as it will no longer
be v1 specific. The only change in this patch is the expansion of
cgroup_task_count() within the debug_taskcount_read() function.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/Makefile    |   1 +
 kernel/cgroup/cgroup-v1.c | 147 -----------------------------------------
 kernel/cgroup/debug.c     | 165 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 166 insertions(+), 147 deletions(-)
 create mode 100644 kernel/cgroup/debug.c

diff --git a/kernel/cgroup/Makefile b/kernel/cgroup/Makefile
index 387348a..ce693cc 100644
--- a/kernel/cgroup/Makefile
+++ b/kernel/cgroup/Makefile
@@ -4,3 +4,4 @@ obj-$(CONFIG_CGROUP_FREEZER) += freezer.o
 obj-$(CONFIG_CGROUP_PIDS) += pids.o
 obj-$(CONFIG_CGROUP_RDMA) += rdma.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
+obj-$(CONFIG_CGROUP_DEBUG) += debug.o
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index e80bc8e..6757a50 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -1297,150 +1297,3 @@ static int __init cgroup_no_v1(char *str)
 	return 1;
 }
 __setup("cgroup_no_v1=", cgroup_no_v1);
-
-
-#ifdef CONFIG_CGROUP_DEBUG
-static struct cgroup_subsys_state *
-debug_css_alloc(struct cgroup_subsys_state *parent_css)
-{
-	struct cgroup_subsys_state *css = kzalloc(sizeof(*css), GFP_KERNEL);
-
-	if (!css)
-		return ERR_PTR(-ENOMEM);
-
-	return css;
-}
-
-static void debug_css_free(struct cgroup_subsys_state *css)
-{
-	kfree(css);
-}
-
-static u64 debug_taskcount_read(struct cgroup_subsys_state *css,
-				struct cftype *cft)
-{
-	return cgroup_task_count(css->cgroup);
-}
-
-static u64 current_css_set_read(struct cgroup_subsys_state *css,
-				struct cftype *cft)
-{
-	return (u64)(unsigned long)current->cgroups;
-}
-
-static u64 current_css_set_refcount_read(struct cgroup_subsys_state *css,
-					 struct cftype *cft)
-{
-	u64 count;
-
-	rcu_read_lock();
-	count = atomic_read(&task_css_set(current)->refcount);
-	rcu_read_unlock();
-	return count;
-}
-
-static int current_css_set_cg_links_read(struct seq_file *seq, void *v)
-{
-	struct cgrp_cset_link *link;
-	struct css_set *cset;
-	char *name_buf;
-
-	name_buf = kmalloc(NAME_MAX + 1, GFP_KERNEL);
-	if (!name_buf)
-		return -ENOMEM;
-
-	spin_lock_irq(&css_set_lock);
-	rcu_read_lock();
-	cset = rcu_dereference(current->cgroups);
-	list_for_each_entry(link, &cset->cgrp_links, cgrp_link) {
-		struct cgroup *c = link->cgrp;
-
-		cgroup_name(c, name_buf, NAME_MAX + 1);
-		seq_printf(seq, "Root %d group %s\n",
-			   c->root->hierarchy_id, name_buf);
-	}
-	rcu_read_unlock();
-	spin_unlock_irq(&css_set_lock);
-	kfree(name_buf);
-	return 0;
-}
-
-#define MAX_TASKS_SHOWN_PER_CSS 25
-static int cgroup_css_links_read(struct seq_file *seq, void *v)
-{
-	struct cgroup_subsys_state *css = seq_css(seq);
-	struct cgrp_cset_link *link;
-
-	spin_lock_irq(&css_set_lock);
-	list_for_each_entry(link, &css->cgroup->cset_links, cset_link) {
-		struct css_set *cset = link->cset;
-		struct task_struct *task;
-		int count = 0;
-
-		seq_printf(seq, "css_set %pK\n", cset);
-
-		list_for_each_entry(task, &cset->tasks, cg_list) {
-			if (count++ > MAX_TASKS_SHOWN_PER_CSS)
-				goto overflow;
-			seq_printf(seq, "  task %d\n", task_pid_vnr(task));
-		}
-
-		list_for_each_entry(task, &cset->mg_tasks, cg_list) {
-			if (count++ > MAX_TASKS_SHOWN_PER_CSS)
-				goto overflow;
-			seq_printf(seq, "  task %d\n", task_pid_vnr(task));
-		}
-		continue;
-	overflow:
-		seq_puts(seq, "  ...\n");
-	}
-	spin_unlock_irq(&css_set_lock);
-	return 0;
-}
-
-static u64 releasable_read(struct cgroup_subsys_state *css, struct cftype *cft)
-{
-	return (!cgroup_is_populated(css->cgroup) &&
-		!css_has_online_children(&css->cgroup->self));
-}
-
-static struct cftype debug_files[] =  {
-	{
-		.name = "taskcount",
-		.read_u64 = debug_taskcount_read,
-	},
-
-	{
-		.name = "current_css_set",
-		.read_u64 = current_css_set_read,
-	},
-
-	{
-		.name = "current_css_set_refcount",
-		.read_u64 = current_css_set_refcount_read,
-	},
-
-	{
-		.name = "current_css_set_cg_links",
-		.seq_show = current_css_set_cg_links_read,
-	},
-
-	{
-		.name = "cgroup_css_links",
-		.seq_show = cgroup_css_links_read,
-	},
-
-	{
-		.name = "releasable",
-		.read_u64 = releasable_read,
-	},
-
-	{ }	/* terminate */
-};
-
-struct cgroup_subsys debug_cgrp_subsys = {
-	.css_alloc = debug_css_alloc,
-	.css_free = debug_css_free,
-	.legacy_cftypes = debug_files,
-};
-#endif /* CONFIG_CGROUP_DEBUG */
diff --git a/kernel/cgroup/debug.c b/kernel/cgroup/debug.c
new file mode 100644
index 0000000..9146461
--- /dev/null
+++ b/kernel/cgroup/debug.c
@@ -0,0 +1,165 @@
+#include <linux/ctype.h>
+#include <linux/mm.h>
+#include <linux/slab.h>
+
+#include "cgroup-internal.h"
+
+static struct cgroup_subsys_state *
+debug_css_alloc(struct cgroup_subsys_state *parent_css)
+{
+	struct cgroup_subsys_state *css = kzalloc(sizeof(*css), GFP_KERNEL);
+
+	if (!css)
+		return ERR_PTR(-ENOMEM);
+
+	return css;
+}
+
+static void debug_css_free(struct cgroup_subsys_state *css)
+{
+	kfree(css);
+}
+
+/*
+ * debug_taskcount_read - return the number of tasks in a cgroup.
+ * @cgrp: the cgroup in question
+ *
+ * Return the number of tasks in the cgroup.  The returned number can be
+ * higher than the actual number of tasks due to css_set references from
+ * namespace roots and temporary usages.
+ */
+static u64 debug_taskcount_read(struct cgroup_subsys_state *css,
+				struct cftype *cft)
+{
+	struct cgroup *cgrp = css->cgroup;
+	u64 count = 0;
+	struct cgrp_cset_link *link;
+
+	spin_lock_irq(&css_set_lock);
+	list_for_each_entry(link, &cgrp->cset_links, cset_link)
+		count += atomic_read(&link->cset->refcount);
+	spin_unlock_irq(&css_set_lock);
+	return count;
+}
+
+static u64 current_css_set_read(struct cgroup_subsys_state *css,
+				struct cftype *cft)
+{
+	return (u64)(unsigned long)current->cgroups;
+}
+
+static u64 current_css_set_refcount_read(struct cgroup_subsys_state *css,
+					 struct cftype *cft)
+{
+	u64 count;
+
+	rcu_read_lock();
+	count = atomic_read(&task_css_set(current)->refcount);
+	rcu_read_unlock();
+	return count;
+}
+
+static int current_css_set_cg_links_read(struct seq_file *seq, void *v)
+{
+	struct cgrp_cset_link *link;
+	struct css_set *cset;
+	char *name_buf;
+
+	name_buf = kmalloc(NAME_MAX + 1, GFP_KERNEL);
+	if (!name_buf)
+		return -ENOMEM;
+
+	spin_lock_irq(&css_set_lock);
+	rcu_read_lock();
+	cset = rcu_dereference(current->cgroups);
+	list_for_each_entry(link, &cset->cgrp_links, cgrp_link) {
+		struct cgroup *c = link->cgrp;
+
+		cgroup_name(c, name_buf, NAME_MAX + 1);
+		seq_printf(seq, "Root %d group %s\n",
+			   c->root->hierarchy_id, name_buf);
+	}
+	rcu_read_unlock();
+	spin_unlock_irq(&css_set_lock);
+	kfree(name_buf);
+	return 0;
+}
+
+#define MAX_TASKS_SHOWN_PER_CSS 25
+static int cgroup_css_links_read(struct seq_file *seq, void *v)
+{
+	struct cgroup_subsys_state *css = seq_css(seq);
+	struct cgrp_cset_link *link;
+
+	spin_lock_irq(&css_set_lock);
+	list_for_each_entry(link, &css->cgroup->cset_links, cset_link) {
+		struct css_set *cset = link->cset;
+		struct task_struct *task;
+		int count = 0;
+
+		seq_printf(seq, "css_set %pK\n", cset);
+
+		list_for_each_entry(task, &cset->tasks, cg_list) {
+			if (count++ > MAX_TASKS_SHOWN_PER_CSS)
+				goto overflow;
+			seq_printf(seq, "  task %d\n", task_pid_vnr(task));
+		}
+
+		list_for_each_entry(task, &cset->mg_tasks, cg_list) {
+			if (count++ > MAX_TASKS_SHOWN_PER_CSS)
+				goto overflow;
+			seq_printf(seq, "  task %d\n", task_pid_vnr(task));
+		}
+		continue;
+	overflow:
+		seq_puts(seq, "  ...\n");
+	}
+	spin_unlock_irq(&css_set_lock);
+	return 0;
+}
+
+static u64 releasable_read(struct cgroup_subsys_state *css, struct cftype *cft)
+{
+	return (!cgroup_is_populated(css->cgroup) &&
+		!css_has_online_children(&css->cgroup->self));
+}
+
+static struct cftype debug_files[] =  {
+	{
+		.name = "taskcount",
+		.read_u64 = debug_taskcount_read,
+	},
+
+	{
+		.name = "current_css_set",
+		.read_u64 = current_css_set_read,
+	},
+
+	{
+		.name = "current_css_set_refcount",
+		.read_u64 = current_css_set_refcount_read,
+	},
+
+	{
+		.name = "current_css_set_cg_links",
+		.seq_show = current_css_set_cg_links_read,
+	},
+
+	{
+		.name = "cgroup_css_links",
+		.seq_show = cgroup_css_links_read,
+	},
+
+	{
+		.name = "releasable",
+		.read_u64 = releasable_read,
+	},
+
+	{ }	/* terminate */
+};
+
+struct cgroup_subsys debug_cgrp_subsys = {
+	.css_alloc = debug_css_alloc,
+	.css_free = debug_css_free,
+	.legacy_cftypes = debug_files,
+};
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 08/14] cgroup: Keep accurate count of tasks in each css_set
  2017-04-21 14:03 [RFC PATCH 00/14] cgroup: Implement cgroup v2 thread mode & CPU controller Waiman Long
                   ` (6 preceding siblings ...)
  2017-04-21 14:04 ` [RFC PATCH 07/14] cgroup: Move debug cgroup to its own file Waiman Long
@ 2017-04-21 14:04 ` Waiman Long
  2017-04-21 14:04 ` [RFC PATCH 09/14] cgroup: Make debug cgroup support v2 and thread mode Waiman Long
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 17+ messages in thread
From: Waiman Long @ 2017-04-21 14:04 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, linux-mm, kernel-team, pjt,
	luto, efault, Waiman Long

The reference count in the css_set data structure was used as a
proxy of the number of tasks attached to that css_set. However, that
count is actually not an accurate measure especially with thread mode
support. So a new variable task_count is added to the css_set to keep
track of the actual task count. This new variable is protected by
the css_set_lock. Functions that require the actual task count are
updated to use the new variable.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/cgroup-defs.h | 3 +++
 kernel/cgroup/cgroup-v1.c   | 6 +-----
 kernel/cgroup/cgroup.c      | 5 +++++
 kernel/cgroup/debug.c       | 6 +-----
 4 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index bb4752a..7be1a90 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -158,6 +158,9 @@ struct css_set {
 	/* reference count */
 	atomic_t refcount;
 
+	/* internal task count, protected by css_set_lock */
+	int task_count;
+
 	/*
 	 * If not threaded, the following points to self.  If threaded, to
 	 * a cset which belongs to the top cgroup of the threaded subtree.
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index 6757a50..6d69796 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -334,10 +334,6 @@ static struct cgroup_pidlist *cgroup_pidlist_find_create(struct cgroup *cgrp,
 /**
  * cgroup_task_count - count the number of tasks in a cgroup.
  * @cgrp: the cgroup in question
- *
- * Return the number of tasks in the cgroup.  The returned number can be
- * higher than the actual number of tasks due to css_set references from
- * namespace roots and temporary usages.
  */
 static int cgroup_task_count(const struct cgroup *cgrp)
 {
@@ -346,7 +342,7 @@ static int cgroup_task_count(const struct cgroup *cgrp)
 
 	spin_lock_irq(&css_set_lock);
 	list_for_each_entry(link, &cgrp->cset_links, cset_link)
-		count += atomic_read(&link->cset->refcount);
+		count += link->cset->task_count;
 	spin_unlock_irq(&css_set_lock);
 	return count;
 }
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index d48eedd..3186b1f 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1671,6 +1671,7 @@ static void cgroup_enable_task_cg_lists(void)
 				css_set_update_populated(cset, true);
 			list_add_tail(&p->cg_list, &cset->tasks);
 			get_css_set(cset);
+			cset->task_count++;
 		}
 		spin_unlock(&p->sighand->siglock);
 	} while_each_thread(g, p);
@@ -2154,8 +2155,10 @@ static int cgroup_migrate_execute(struct cgroup_mgctx *mgctx)
 			struct css_set *to_cset = cset->mg_dst_cset;
 
 			get_css_set(to_cset);
+			to_cset->task_count++;
 			css_set_move_task(task, from_cset, to_cset, true);
 			put_css_set_locked(from_cset);
+			from_cset->task_count--;
 		}
 	}
 	spin_unlock_irq(&css_set_lock);
@@ -5150,6 +5153,7 @@ void cgroup_post_fork(struct task_struct *child)
 		cset = task_css_set(current);
 		if (list_empty(&child->cg_list)) {
 			get_css_set(cset);
+			cset->task_count++;
 			css_set_move_task(child, NULL, cset, false);
 		}
 		spin_unlock_irq(&css_set_lock);
@@ -5199,6 +5203,7 @@ void cgroup_exit(struct task_struct *tsk)
 	if (!list_empty(&tsk->cg_list)) {
 		spin_lock_irq(&css_set_lock);
 		css_set_move_task(tsk, cset, NULL, false);
+		cset->task_count--;
 		spin_unlock_irq(&css_set_lock);
 	} else {
 		get_css_set(cset);
diff --git a/kernel/cgroup/debug.c b/kernel/cgroup/debug.c
index 9146461..c8f7590 100644
--- a/kernel/cgroup/debug.c
+++ b/kernel/cgroup/debug.c
@@ -23,10 +23,6 @@ static void debug_css_free(struct cgroup_subsys_state *css)
 /*
  * debug_taskcount_read - return the number of tasks in a cgroup.
  * @cgrp: the cgroup in question
- *
- * Return the number of tasks in the cgroup.  The returned number can be
- * higher than the actual number of tasks due to css_set references from
- * namespace roots and temporary usages.
  */
 static u64 debug_taskcount_read(struct cgroup_subsys_state *css,
 				struct cftype *cft)
@@ -37,7 +33,7 @@ static u64 debug_taskcount_read(struct cgroup_subsys_state *css,
 
 	spin_lock_irq(&css_set_lock);
 	list_for_each_entry(link, &cgrp->cset_links, cset_link)
-		count += atomic_read(&link->cset->refcount);
+		count += link->cset->task_count;
 	spin_unlock_irq(&css_set_lock);
 	return count;
 }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 09/14] cgroup: Make debug cgroup support v2 and thread mode
  2017-04-21 14:03 [RFC PATCH 00/14] cgroup: Implement cgroup v2 thread mode & CPU controller Waiman Long
                   ` (7 preceding siblings ...)
  2017-04-21 14:04 ` [RFC PATCH 08/14] cgroup: Keep accurate count of tasks in each css_set Waiman Long
@ 2017-04-21 14:04 ` Waiman Long
  2017-04-21 14:04 ` [RFC PATCH 10/14] cgroup: Implement new thread mode semantics Waiman Long
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 17+ messages in thread
From: Waiman Long @ 2017-04-21 14:04 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, linux-mm, kernel-team, pjt,
	luto, efault, Waiman Long

Besides supporting cgroup v2 and thread mode, the following changes
are also made:
 1) current_* cgroup files now resides only at the root as we don't
    need duplicated files of the same function all over the cgroup
    hierarchy.
 2) The cgroup_css_links_read() function is modified to report
    the number of tasks that are skipped because of overflow.
 3) The relationship between proc_cset and threaded_csets are displayed.
 4) The number of extra unaccounted references are displayed.
 5) The status of being a thread root or threaded cgroup is displayed.
 6) The current_css_set_read() function now prints out the addresses of
    the css'es associated with the current css_set.
 7) A new cgroup_subsys_states file is added to display the css objects
    associated with a cgroup.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/debug.c | 151 ++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 134 insertions(+), 17 deletions(-)

diff --git a/kernel/cgroup/debug.c b/kernel/cgroup/debug.c
index c8f7590..4d74458 100644
--- a/kernel/cgroup/debug.c
+++ b/kernel/cgroup/debug.c
@@ -38,10 +38,37 @@ static u64 debug_taskcount_read(struct cgroup_subsys_state *css,
 	return count;
 }
 
-static u64 current_css_set_read(struct cgroup_subsys_state *css,
-				struct cftype *cft)
+static int current_css_set_read(struct seq_file *seq, void *v)
 {
-	return (u64)(unsigned long)current->cgroups;
+	struct css_set *cset;
+	struct cgroup_subsys *ss;
+	struct cgroup_subsys_state *css;
+	int i, refcnt;
+
+	mutex_lock(&cgroup_mutex);
+	spin_lock_irq(&css_set_lock);
+	rcu_read_lock();
+	cset = rcu_dereference(current->cgroups);
+	refcnt = atomic_read(&cset->refcount);
+	seq_printf(seq, "css_set %pK %d", cset, refcnt);
+	if (refcnt > cset->task_count)
+		seq_printf(seq, " +%d", refcnt - cset->task_count);
+	seq_puts(seq, "\n");
+
+	/*
+	 * Print the css'es stored in the current css_set.
+	 */
+	for_each_subsys(ss, i) {
+		css = cset->subsys[ss->id];
+		if (!css)
+			continue;
+		seq_printf(seq, "%2d: %-4s\t- %lx[%d]\n", ss->id, ss->name,
+			  (unsigned long)css, css->id);
+	}
+	rcu_read_unlock();
+	spin_unlock_irq(&css_set_lock);
+	mutex_unlock(&cgroup_mutex);
+	return 0;
 }
 
 static u64 current_css_set_refcount_read(struct cgroup_subsys_state *css,
@@ -86,31 +113,111 @@ static int cgroup_css_links_read(struct seq_file *seq, void *v)
 {
 	struct cgroup_subsys_state *css = seq_css(seq);
 	struct cgrp_cset_link *link;
+	int dead_cnt = 0, extra_refs = 0, threaded_csets = 0;
 
 	spin_lock_irq(&css_set_lock);
+	if (css->cgroup->proc_cgrp)
+		seq_puts(seq, (css->cgroup->proc_cgrp == css->cgroup)
+			      ? "[thread root]\n" : "[threaded]\n");
+
 	list_for_each_entry(link, &css->cgroup->cset_links, cset_link) {
 		struct css_set *cset = link->cset;
 		struct task_struct *task;
 		int count = 0;
+		int refcnt = atomic_read(&cset->refcount);
 
-		seq_printf(seq, "css_set %pK\n", cset);
+		/*
+		 * Print out the proc_cset and threaded_cset relationship
+		 * and highlight difference between refcount and task_count.
+		 */
+		seq_printf(seq, "css_set %pK", cset);
+		if (cset->proc_cset != cset) {
+			threaded_csets++;
+			seq_printf(seq, "=>%pK", cset->proc_cset);
+		}
+		if (!list_empty(&cset->threaded_csets)) {
+			struct css_set *tcset;
+			int idx = 0;
+
+			list_for_each_entry(tcset, &cset->threaded_csets,
+					    threaded_csets_node) {
+				seq_puts(seq, idx ? "," : "<=");
+				seq_printf(seq, "%pK", tcset);
+				idx++;
+			}
+		} else {
+			seq_printf(seq, " %d", refcnt);
+			if (refcnt - cset->task_count > 0) {
+				int extra = refcnt - cset->task_count;
+
+				seq_printf(seq, " +%d", extra);
+				/*
+				 * Take out the one additional reference in
+				 * init_css_set.
+				 */
+				if (cset == &init_css_set)
+					extra--;
+				extra_refs += extra;
+			}
+		}
+		seq_puts(seq, "\n");
 
 		list_for_each_entry(task, &cset->tasks, cg_list) {
-			if (count++ > MAX_TASKS_SHOWN_PER_CSS)
-				goto overflow;
-			seq_printf(seq, "  task %d\n", task_pid_vnr(task));
+			if (count++ <= MAX_TASKS_SHOWN_PER_CSS)
+				seq_printf(seq, "  task %d\n",
+					   task_pid_vnr(task));
 		}
 
 		list_for_each_entry(task, &cset->mg_tasks, cg_list) {
-			if (count++ > MAX_TASKS_SHOWN_PER_CSS)
-				goto overflow;
-			seq_printf(seq, "  task %d\n", task_pid_vnr(task));
+			if (count++ <= MAX_TASKS_SHOWN_PER_CSS)
+				seq_printf(seq, "  task %d\n",
+					   task_pid_vnr(task));
 		}
-		continue;
-	overflow:
-		seq_puts(seq, "  ...\n");
+		/* show # of overflowed tasks */
+		if (count > MAX_TASKS_SHOWN_PER_CSS)
+			seq_printf(seq, "  ... (%d)\n",
+				   count - MAX_TASKS_SHOWN_PER_CSS);
+
+		if (cset->dead) {
+			seq_puts(seq, "    [dead]\n");
+			dead_cnt++;
+		}
+
+		WARN_ON(count != cset->task_count);
 	}
 	spin_unlock_irq(&css_set_lock);
+
+	if (!dead_cnt && !extra_refs && !threaded_csets)
+		return 0;
+
+	seq_puts(seq, "\n");
+	if (threaded_csets)
+		seq_printf(seq, "threaded css_sets = %d\n", threaded_csets);
+	if (extra_refs)
+		seq_printf(seq, "extra references = %d\n", extra_refs);
+	if (dead_cnt)
+		seq_printf(seq, "dead css_sets = %d\n", dead_cnt);
+
+	return 0;
+}
+
+static int cgroup_subsys_states_read(struct seq_file *seq, void *v)
+{
+	struct cgroup *cgrp = seq_css(seq)->cgroup;
+	struct cgroup_subsys *ss;
+	struct cgroup_subsys_state *css;
+	int i;
+
+	mutex_lock(&cgroup_mutex);
+	for_each_subsys(ss, i) {
+		css = rcu_dereference_check(cgrp->subsys[ss->id], true);
+		if (!css)
+			continue;
+		seq_printf(seq, "%2d: %-4s\t- %lx[%d] %d\n", ss->id, ss->name,
+			  (unsigned long)css, css->id,
+			  atomic_read(&css->online_cnt));
+	}
+	mutex_unlock(&cgroup_mutex);
 	return 0;
 }
 
@@ -128,17 +235,20 @@ static u64 releasable_read(struct cgroup_subsys_state *css, struct cftype *cft)
 
 	{
 		.name = "current_css_set",
-		.read_u64 = current_css_set_read,
+		.seq_show = current_css_set_read,
+		.flags = CFTYPE_ONLY_ON_ROOT,
 	},
 
 	{
 		.name = "current_css_set_refcount",
 		.read_u64 = current_css_set_refcount_read,
+		.flags = CFTYPE_ONLY_ON_ROOT,
 	},
 
 	{
 		.name = "current_css_set_cg_links",
 		.seq_show = current_css_set_cg_links_read,
+		.flags = CFTYPE_ONLY_ON_ROOT,
 	},
 
 	{
@@ -147,6 +257,11 @@ static u64 releasable_read(struct cgroup_subsys_state *css, struct cftype *cft)
 	},
 
 	{
+		.name = "cgroup_subsys_states",
+		.seq_show = cgroup_subsys_states_read,
+	},
+
+	{
 		.name = "releasable",
 		.read_u64 = releasable_read,
 	},
@@ -155,7 +270,9 @@ static u64 releasable_read(struct cgroup_subsys_state *css, struct cftype *cft)
 };
 
 struct cgroup_subsys debug_cgrp_subsys = {
-	.css_alloc = debug_css_alloc,
-	.css_free = debug_css_free,
-	.legacy_cftypes = debug_files,
+	.css_alloc	= debug_css_alloc,
+	.css_free	= debug_css_free,
+	.legacy_cftypes	= debug_files,
+	.dfl_cftypes	= debug_files,
+	.threaded	= true,
 };
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 10/14] cgroup: Implement new thread mode semantics
  2017-04-21 14:03 [RFC PATCH 00/14] cgroup: Implement cgroup v2 thread mode & CPU controller Waiman Long
                   ` (8 preceding siblings ...)
  2017-04-21 14:04 ` [RFC PATCH 09/14] cgroup: Make debug cgroup support v2 and thread mode Waiman Long
@ 2017-04-21 14:04 ` Waiman Long
  2017-04-21 14:04 ` [RFC PATCH 11/14] sched: Misc preps for cgroup unified hierarchy interface Waiman Long
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 17+ messages in thread
From: Waiman Long @ 2017-04-21 14:04 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, linux-mm, kernel-team, pjt,
	luto, efault, Waiman Long

The current thread mode semantics aren't sufficient to fully support
threaded controllers like cpu. The main problem is that when thread
mode is enabled at root (mainly for performance reason), all the
non-threaded controllers cannot be supported at all.

To alleviate this problem, the roles of thread root and threaded
cgroups are now further separated. Now thread mode can only be enabled
on a non-root leaf cgroup whose parent will then become the thread
root. All the descendants of a threaded cgroup will still need to be
threaded. All the non-threaded resource will be accounted for in the
thread root. Unlike the previous thread mode, however, a thread root
can have non-threaded children where system resources like memory
can be further split down the hierarchy.

Now we could have something like

	R -- A -- B
	 \
	  T1 -- T2

where R is the thread root, A and B are non-threaded cgroups, T1 and
T2 are threaded cgroups. The cgroups R, T1, T2 form a threaded subtree
where all the non-threaded resources are accounted for in R.  The no
internal process constraint does not apply in the threaded subtree.
Non-threaded controllers need to properly handle the competition
between internal processes and child cgroups at the thread root.

This model will be flexible enough to support the need of the threaded
controllers.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/cgroup-v2.txt     |  51 +++++++----
 kernel/cgroup/cgroup-internal.h |  10 +++
 kernel/cgroup/cgroup.c          | 184 +++++++++++++++++++++++++++++++++++-----
 3 files changed, 208 insertions(+), 37 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 2375e22..4d1c24d 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -222,21 +222,32 @@ process can be put in different cgroups and are not subject to the no
 internal process constraint - threaded controllers can be enabled on
 non-leaf cgroups whether they have threads in them or not.
 
-To enable the thread mode, the following conditions must be met.
+To enable the thread mode on a cgroup, the following conditions must
+be met.
 
-- The thread root doesn't have any child cgroups.
+- The cgroup doesn't have any child cgroups.
 
-- The thread root doesn't have any controllers enabled.
+- The cgroup doesn't have any non-threaded controllers enabled.
+
+- The cgroup doesn't have any processes attached to it.
 
 Thread mode can be enabled by writing "enable" to "cgroup.threads"
 file.
 
   # echo enable > cgroup.threads
 
-Inside a threaded subtree, "cgroup.threads" can be read and contains
-the list of the thread IDs of all threads in the cgroup.  Except that
-the operations are per-thread instead of per-process, "cgroup.threads"
-has the same format and behaves the same way as "cgroup.procs".
+The parent of the threaded cgroup will become the thread root, if
+it hasn't been a thread root yet. In other word, thread mode cannot
+be enabled on the root cgroup as it doesn't have a parent cgroup. A
+thread root can have child cgroups and controllers enabled before
+becoming one.
+
+A threaded subtree includes the thread root and all the threaded child
+cgroups as well as their descendants which are all threaded cgroups.
+"cgroup.threads" can be read and contains the list of the thread
+IDs of all threads in the cgroup.  Except that the operations are
+per-thread instead of per-process, "cgroup.threads" has the same
+format and behaves the same way as "cgroup.procs".
 
 The thread root serves as the resource domain for the whole subtree,
 and, while the threads can be scattered across the subtree, all the
@@ -246,25 +257,30 @@ not readable in the subtree proper.  However, "cgroup.procs" can be
 written to from anywhere in the subtree to migrate all threads of the
 matching process to the cgroup.
 
-Only threaded controllers can be enabled in a threaded subtree.  When
-a threaded controller is enabled inside a threaded subtree, it only
-accounts for and controls resource consumptions associated with the
-threads in the cgroup and its descendants.  All consumptions which
-aren't tied to a specific thread belong to the thread root.
+Only threaded controllers can be enabled in a non-root threaded cgroup.
+When a threaded controller is enabled inside a threaded subtree,
+it only accounts for and controls resource consumptions associated
+with the threads in the cgroup and its descendants.  All consumptions
+which aren't tied to a specific thread belong to the thread root.
 
 Because a threaded subtree is exempt from no internal process
 constraint, a threaded controller must be able to handle competition
 between threads in a non-leaf cgroup and its child cgroups.  Each
 threaded controller defines how such competitions are handled.
 
+A new child cgroup created under a thread root will not be threaded.
+Thread mode has to be explicitly enabled on each of the thread root's
+children.  Descendants of a threaded cgroup, however, will always be
+threaded and that mode cannot be disabled.
+
 To disable the thread mode, the following conditions must be met.
 
-- The cgroup is a thread root.  Thread mode can't be disabled
-  partially in the subtree.
+- The cgroup is a child of a thread root.  Thread mode can't be
+  disabled partially further down the hierarchy.
 
-- The thread root doesn't have any child cgroups.
+- The cgroup doesn't have any child cgroups.
 
-- The thread root doesn't have any controllers enabled.
+- The cgroup doesn't have any threads attached to it.
 
 Thread mode can be disabled by writing "disable" to "cgroup.threads"
 file.
@@ -366,6 +382,9 @@ with any other cgroups and requires special treatment from most
 controllers.  How resource consumption in the root cgroup is governed
 is up to each controller.
 
+The threaded cgroups and the thread roots are also exempt from this
+restriction.
+
 Note that the restriction doesn't get in the way if there is no
 enabled controller in the cgroup's "cgroup.subtree_control".  This is
 important as otherwise it wouldn't be possible to create children of a
diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index bea3928..8d27258 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -123,6 +123,16 @@ static inline bool notify_on_release(const struct cgroup *cgrp)
 	return test_bit(CGRP_NOTIFY_ON_RELEASE, &cgrp->flags);
 }
 
+static inline bool cgroup_is_threaded(const struct cgroup *cgrp)
+{
+	return cgrp->proc_cgrp && (cgrp->proc_cgrp != cgrp);
+}
+
+static inline bool cgroup_is_thread_root(const struct cgroup *cgrp)
+{
+	return cgrp->proc_cgrp == cgrp;
+}
+
 void put_css_set_locked(struct css_set *cset);
 
 static inline void put_css_set(struct css_set *cset)
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 3186b1f..50577c5 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -334,8 +334,13 @@ static u16 cgroup_control(struct cgroup *cgrp)
 	struct cgroup *parent = cgroup_parent(cgrp);
 	u16 root_ss_mask = cgrp->root->subsys_mask;
 
-	if (parent)
-		return parent->subtree_control;
+	if (parent) {
+		u16 ss_mask = parent->subtree_control;
+
+		if (cgroup_is_threaded(cgrp))
+			ss_mask &= cgrp_dfl_threaded_ss_mask;
+		return ss_mask;
+	}
 
 	if (cgroup_on_dfl(cgrp))
 		root_ss_mask &= ~(cgrp_dfl_inhibit_ss_mask |
@@ -348,8 +353,13 @@ static u16 cgroup_ss_mask(struct cgroup *cgrp)
 {
 	struct cgroup *parent = cgroup_parent(cgrp);
 
-	if (parent)
-		return parent->subtree_ss_mask;
+	if (parent) {
+		u16 ss_mask = parent->subtree_ss_mask;
+
+		if (cgroup_is_threaded(cgrp))
+			ss_mask &= cgrp_dfl_threaded_ss_mask;
+		return ss_mask;
+	}
 
 	return cgrp->root->subsys_mask;
 }
@@ -593,6 +603,24 @@ static bool css_set_threaded(struct css_set *cset)
 }
 
 /**
+ * threaded_children_count - returns # of threaded children
+ * @cgrp: cgroup to be tested
+ *
+ * cgroup_mutex must be held by the caller.
+ */
+static int threaded_children_count(struct cgroup *cgrp)
+{
+	struct cgroup *child;
+	int count = 0;
+
+	lockdep_assert_held(&cgroup_mutex);
+	cgroup_for_each_live_child(child, cgrp)
+		if (cgroup_is_threaded(child))
+			count++;
+	return count;
+}
+
+/**
  * cgroup_update_populated - updated populated count of a cgroup
  * @cgrp: the target cgroup
  * @populated: inc or dec populated count
@@ -2921,15 +2949,15 @@ static ssize_t cgroup_subtree_control_write(struct kernfs_open_file *of,
 	}
 
 	/* can't enable !threaded controllers on a threaded cgroup */
-	if (cgrp->proc_cgrp && (enable & ~cgrp_dfl_threaded_ss_mask)) {
+	if (cgroup_is_threaded(cgrp) && (enable & ~cgrp_dfl_threaded_ss_mask)) {
 		ret = -EBUSY;
 		goto out_unlock;
 	}
 
 	/*
-	 * Except for root and threaded cgroups, subtree_control must be
-	 * zero for a cgroup with tasks so that child cgroups don't compete
-	 * against tasks.
+	 * Except for root, thread roots and threaded cgroups, subtree_control
+	 * must be zero for a cgroup with tasks so that child cgroups don't
+	 * compete against tasks.
 	 */
 	if (enable && cgroup_parent(cgrp) && !cgrp->proc_cgrp) {
 		struct cgrp_cset_link *link;
@@ -2977,7 +3005,9 @@ static int cgroup_enable_threaded(struct cgroup *cgrp)
 	LIST_HEAD(csets);
 	struct cgrp_cset_link *link;
 	struct css_set *cset, *cset_next;
+	struct cgroup *child;
 	int ret;
+	u16 ss_mask;
 
 	lockdep_assert_held(&cgroup_mutex);
 
@@ -2985,14 +3015,38 @@ static int cgroup_enable_threaded(struct cgroup *cgrp)
 	if (cgrp->proc_cgrp)
 		return 0;
 
-	/* allow only if there are neither children or enabled controllers */
-	if (css_has_online_children(&cgrp->self) || cgrp->subtree_control)
+	/*
+	 * Allow only if it is not the root and there are:
+	 * 1) no children,
+	 * 2) no non-threaded controllers are enabled for the children, and
+	 * 3) no attached tasks.
+	 *
+	 * With no attached tasks, it is assumed that no css_sets will be
+	 * linked to the current cgroup. This may not be true if some dead
+	 * css_sets linger around due to task_struct leakage, for example.
+	 */
+	if (css_has_online_children(&cgrp->self) ||
+	   (cgrp->subtree_control & ~cgrp_dfl_threaded_ss_mask) ||
+	   !cgroup_parent(cgrp) || cgroup_is_populated(cgrp))
 		return -EBUSY;
 
-	/* find all csets which need ->proc_cset updated */
+	/* make the parent cgroup a thread root */
+	child = cgrp;
+	cgrp = cgroup_parent(child);
+
+	/* noop for parent if parent has already been threaded */
+	if (cgrp->proc_cgrp)
+		goto setup_child;
+
+	/*
+	 * For the parent cgroup, we need to find all csets which need
+	 * ->proc_cset updated
+	 */
 	spin_lock_irq(&css_set_lock);
 	list_for_each_entry(link, &cgrp->cset_links, cset_link) {
 		cset = link->cset;
+		if (cset->dead)
+			continue;
 		if (css_set_populated(cset)) {
 			WARN_ON_ONCE(css_set_threaded(cset));
 			WARN_ON_ONCE(cset->pcset_preload);
@@ -3031,7 +3085,34 @@ static int cgroup_enable_threaded(struct cgroup *cgrp)
 	/* mark it threaded */
 	cgrp->proc_cgrp = cgrp;
 
-	return 0;
+setup_child:
+	ss_mask = cgroup_ss_mask(child);
+	/*
+	 * If some non-threaded controllers are enabled, they have to be
+	 * disabled.
+	 */
+	if (ss_mask & ~cgrp_dfl_threaded_ss_mask) {
+		cgroup_save_control(child);
+		child->proc_cgrp = cgrp;
+		ret = cgroup_apply_control(child);
+		cgroup_finalize_control(child, ret);
+		kernfs_activate(child->kn);
+
+		/*
+		 * If an error happen (it shouldn't), the thread mode
+		 * enablement fails, but the parent will remain as thread
+		 * root. That shouldn't be a problem as a thread root
+		 * without threaded children is not much different from
+		 * a non-threaded cgroup.
+		 */
+		WARN_ON_ONCE(ret);
+		if (ret)
+			child->proc_cgrp = NULL;
+	} else {
+		child->proc_cgrp = cgrp;
+		ret = 0;
+	}
+	return ret;
 
 err_put_csets:
 	spin_lock_irq(&css_set_lock);
@@ -3050,26 +3131,71 @@ static int cgroup_enable_threaded(struct cgroup *cgrp)
 static int cgroup_disable_threaded(struct cgroup *cgrp)
 {
 	struct cgrp_cset_link *link;
+	struct cgroup *parent = cgroup_parent(cgrp);
 
 	lockdep_assert_held(&cgroup_mutex);
 
-	/* noop if already !threaded */
-	if (!cgrp->proc_cgrp)
-		return 0;
-
 	/* partial disable isn't supported */
-	if (cgrp->proc_cgrp != cgrp)
+	if (cgrp->proc_cgrp != parent)
 		return -EBUSY;
 
-	/* allow only if there are neither children or enabled controllers */
-	if (css_has_online_children(&cgrp->self) || cgrp->subtree_control)
+	/* noop if not a threaded cgroup */
+	if (!cgroup_is_threaded(cgrp))
+		return 0;
+
+	/*
+	 * Allow only if there are
+	 * 1) no children, and
+	 * 2) no attached tasks.
+	 *
+	 * With no attached tasks, it is assumed that no css_sets will be
+	 * linked to the current cgroup. This may not be true if some dead
+	 * css_sets linger around due to task_struct leakage, for example.
+	 */
+	if (css_has_online_children(&cgrp->self) || cgroup_is_populated(cgrp))
 		return -EBUSY;
 
-	/* walk all csets and reset ->proc_cset */
+	/*
+	 * If the cgroup has some non-threaded controllers enabled at the
+	 * subtree_control level of the parent, we need to re-enabled those
+	 * controllers.
+	 */
+	cgrp->proc_cgrp = NULL;
+	if (cgroup_ss_mask(cgrp) & ~cgrp_dfl_threaded_ss_mask) {
+		int ret;
+
+		cgrp->proc_cgrp = parent;
+		cgroup_save_control(cgrp);
+		cgrp->proc_cgrp = NULL;
+		ret = cgroup_apply_control(cgrp);
+		cgroup_finalize_control(cgrp, ret);
+		kernfs_activate(cgrp->kn);
+
+		/*
+		 * If an error happen, we abandon update to the thread root
+		 * and return the erorr.
+		 */
+		if (ret)
+			return ret;
+	}
+
+	/*
+	 * Check remaining threaded children count to see if the threaded
+	 * csets of the parent need to be removed and ->proc_cset reset.
+	 */
 	spin_lock_irq(&css_set_lock);
+
+	if (threaded_children_count(parent))
+		goto out_unlock;	/* still have threaded children left */
+
+	cgrp = parent;
 	list_for_each_entry(link, &cgrp->cset_links, cset_link) {
 		struct css_set *cset = link->cset;
 
+		/* skip dead css_set */
+		if (cset->dead)
+			continue;
+
 		if (css_set_threaded(cset)) {
 			struct css_set *pcset = proc_css_set(cset);
 
@@ -3085,6 +3211,7 @@ static int cgroup_disable_threaded(struct cgroup *cgrp)
 		}
 	}
 	cgrp->proc_cgrp = NULL;
+out_unlock:
 	spin_unlock_irq(&css_set_lock);
 
 	return 0;
@@ -4475,7 +4602,16 @@ static struct cgroup *cgroup_create(struct cgroup *parent)
 	cgrp->self.parent = &parent->self;
 	cgrp->root = root;
 	cgrp->level = level;
-	cgrp->proc_cgrp = parent->proc_cgrp;
+
+	/*
+	 * A child cgroup created directly under a thread root will not
+	 * be threaded. Thread mode has to be explictly enabled for it.
+	 * The child cgroup will be threaded if its parent is threaded.
+	 */
+	if (cgroup_is_thread_root(parent))
+		cgrp->proc_cgrp = NULL;
+	else
+		cgrp->proc_cgrp = parent->proc_cgrp;
 
 	for (tcgrp = cgrp; tcgrp; tcgrp = cgroup_parent(tcgrp))
 		cgrp->ancestor_ids[tcgrp->level] = tcgrp->id;
@@ -4702,6 +4838,12 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
 		return -EBUSY;
 
 	/*
+	 * Do an implicit thread mode disable if on default hierarchy.
+	 */
+	if (cgroup_on_dfl(cgrp))
+		cgroup_disable_threaded(cgrp);
+
+	/*
 	 * Mark @cgrp and the associated csets dead.  The former prevents
 	 * further task migration and child creation by disabling
 	 * cgroup_lock_live_group().  The latter makes the csets ignored by
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 11/14] sched: Misc preps for cgroup unified hierarchy interface
  2017-04-21 14:03 [RFC PATCH 00/14] cgroup: Implement cgroup v2 thread mode & CPU controller Waiman Long
                   ` (9 preceding siblings ...)
  2017-04-21 14:04 ` [RFC PATCH 10/14] cgroup: Implement new thread mode semantics Waiman Long
@ 2017-04-21 14:04 ` Waiman Long
  2017-04-21 14:04 ` [RFC PATCH 12/14] sched: Implement interface for cgroup unified hierarchy Waiman Long
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 17+ messages in thread
From: Waiman Long @ 2017-04-21 14:04 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, linux-mm, kernel-team, pjt,
	luto, efault

From: Tejun Heo <tj@kernel.org>

Make the following changes in preparation for the cpu controller
interface implementation for the unified hierarchy.  This patch
doesn't cause any functional differences.

* s/cpu_stats_show()/cpu_cfs_stats_show()/

* s/cpu_files/cpu_legacy_files/

* Separate out cpuacct_stats_read() from cpuacct_stats_show().  While
  at it, make the @val array u64 for consistency.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---
 kernel/sched/core.c    |  8 ++++----
 kernel/sched/cpuacct.c | 29 ++++++++++++++++++-----------
 2 files changed, 22 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 27b4dd5..5e3a217 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7209,7 +7209,7 @@ static int __cfs_schedulable(struct task_group *tg, u64 period, u64 quota)
 	return ret;
 }
 
-static int cpu_stats_show(struct seq_file *sf, void *v)
+static int cpu_cfs_stats_show(struct seq_file *sf, void *v)
 {
 	struct task_group *tg = css_tg(seq_css(sf));
 	struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
@@ -7249,7 +7249,7 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css,
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
-static struct cftype cpu_files[] = {
+static struct cftype cpu_legacy_files[] = {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	{
 		.name = "shares",
@@ -7270,7 +7270,7 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css,
 	},
 	{
 		.name = "stat",
-		.seq_show = cpu_stats_show,
+		.seq_show = cpu_cfs_stats_show,
 	},
 #endif
 #ifdef CONFIG_RT_GROUP_SCHED
@@ -7296,7 +7296,7 @@ struct cgroup_subsys cpu_cgrp_subsys = {
 	.fork		= cpu_cgroup_fork,
 	.can_attach	= cpu_cgroup_can_attach,
 	.attach		= cpu_cgroup_attach,
-	.legacy_cftypes	= cpu_files,
+	.legacy_cftypes	= cpu_legacy_files,
 	.early_init	= true,
 };
 
diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index f95ab29..6151c23 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -276,26 +276,33 @@ static int cpuacct_all_seq_show(struct seq_file *m, void *V)
 	return 0;
 }
 
-static int cpuacct_stats_show(struct seq_file *sf, void *v)
+static void cpuacct_stats_read(struct cpuacct *ca,
+			       u64 (*val)[CPUACCT_STAT_NSTATS])
 {
-	struct cpuacct *ca = css_ca(seq_css(sf));
-	s64 val[CPUACCT_STAT_NSTATS];
 	int cpu;
-	int stat;
 
-	memset(val, 0, sizeof(val));
+	memset(val, 0, sizeof(*val));
+
 	for_each_possible_cpu(cpu) {
 		u64 *cpustat = per_cpu_ptr(ca->cpustat, cpu)->cpustat;
 
-		val[CPUACCT_STAT_USER]   += cpustat[CPUTIME_USER];
-		val[CPUACCT_STAT_USER]   += cpustat[CPUTIME_NICE];
-		val[CPUACCT_STAT_SYSTEM] += cpustat[CPUTIME_SYSTEM];
-		val[CPUACCT_STAT_SYSTEM] += cpustat[CPUTIME_IRQ];
-		val[CPUACCT_STAT_SYSTEM] += cpustat[CPUTIME_SOFTIRQ];
+		(*val)[CPUACCT_STAT_USER]   += cpustat[CPUTIME_USER];
+		(*val)[CPUACCT_STAT_USER]   += cpustat[CPUTIME_NICE];
+		(*val)[CPUACCT_STAT_SYSTEM] += cpustat[CPUTIME_SYSTEM];
+		(*val)[CPUACCT_STAT_SYSTEM] += cpustat[CPUTIME_IRQ];
+		(*val)[CPUACCT_STAT_SYSTEM] += cpustat[CPUTIME_SOFTIRQ];
 	}
+}
+
+static int cpuacct_stats_show(struct seq_file *sf, void *v)
+{
+	u64 val[CPUACCT_STAT_NSTATS];
+	int stat;
+
+	cpuacct_stats_read(css_ca(seq_css(sf)), &val);
 
 	for (stat = 0; stat < CPUACCT_STAT_NSTATS; stat++) {
-		seq_printf(sf, "%s %lld\n",
+		seq_printf(sf, "%s %llu\n",
 			   cpuacct_stat_desc[stat],
 			   (long long)nsec_to_clock_t(val[stat]));
 	}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 12/14] sched: Implement interface for cgroup unified hierarchy
  2017-04-21 14:03 [RFC PATCH 00/14] cgroup: Implement cgroup v2 thread mode & CPU controller Waiman Long
                   ` (10 preceding siblings ...)
  2017-04-21 14:04 ` [RFC PATCH 11/14] sched: Misc preps for cgroup unified hierarchy interface Waiman Long
@ 2017-04-21 14:04 ` Waiman Long
  2017-04-21 14:04 ` [RFC PATCH 13/14] sched: Make cpu/cpuacct threaded controllers Waiman Long
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 17+ messages in thread
From: Waiman Long @ 2017-04-21 14:04 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, linux-mm, kernel-team, pjt,
	luto, efault

From: Tejun Heo <tj@kernel.org>

While the cpu controller doesn't have any functional problems, there
are a couple interface issues which can be addressed in the v2
interface.

* cpuacct being a separate controller.  This separation is artificial
  and rather pointless as demonstrated by most use cases co-mounting
  the two controllers.  It also forces certain information to be
  accounted twice.

* Use of different time units.  Writable control knobs use
  microseconds, some stat fields use nanoseconds while other cpuacct
  stat fields use centiseconds.

* Control knobs which can't be used in the root cgroup still show up
  in the root.

* Control knob names and semantics aren't consistent with other
  controllers.

This patchset implements cpu controller's interface on the unified
hierarchy which adheres to the controller file conventions described in
Documentation/cgroup-v2.txt.  Overall, the following changes are made.

* cpuacct is implictly enabled and disabled by cpu and its information
  is reported through "cpu.stat" which now uses microseconds for all
  time durations.  All time duration fields now have "_usec" appended
  to them for clarity.  While this doesn't solve the double accounting
  immediately, once majority of users switch to v2, cpu can directly
  account and report the relevant stats and cpuacct can be disabled on
  the unified hierarchy.

  Note that cpuacct.usage_percpu is currently not included in
  "cpu.stat".  If this information is actually called for, it can be
  added later.

* "cpu.shares" is replaced with "cpu.weight" and operates on the
  standard scale defined by CGROUP_WEIGHT_MIN/DFL/MAX (1, 100, 10000).
  The weight is scaled to scheduler weight so that 100 maps to 1024
  and the ratio relationship is preserved - if weight is W and its
  scaled value is S, W / 100 == S / 1024.  While the mapped range is a
  bit smaller than the orignal scheduler weight range, the dead zones
  on both sides are relatively small and covers wider range than the
  nice value mappings.  This file doesn't make sense in the root
  cgroup and isn't create on root.

* "cpu.cfs_quota_us" and "cpu.cfs_period_us" are replaced by "cpu.max"
  which contains both quota and period.

* "cpu.rt_runtime_us" and "cpu.rt_period_us" are replaced by
  "cpu.rt.max" which contains both runtime and period.

v2: cpu_stats_show() was incorrectly using CONFIG_FAIR_GROUP_SCHED for
    CFS bandwidth stats and also using raw division for u64.  Use
    CONFIG_CFS_BANDWITH and do_div() instead.

    The semantics of "cpu.rt.max" is not fully decided yet.  Dropped
    for now.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---
 kernel/sched/core.c    | 141 +++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/cpuacct.c |  25 +++++++++
 kernel/sched/cpuacct.h |   5 ++
 3 files changed, 171 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5e3a217..78dfcaa 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7288,6 +7288,139 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css,
 	{ }	/* Terminate */
 };
 
+static int cpu_stats_show(struct seq_file *sf, void *v)
+{
+	cpuacct_cpu_stats_show(sf);
+
+#ifdef CONFIG_CFS_BANDWIDTH
+	{
+		struct task_group *tg = css_tg(seq_css(sf));
+		struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
+		u64 throttled_usec;
+
+		throttled_usec = cfs_b->throttled_time;
+		do_div(throttled_usec, NSEC_PER_USEC);
+
+		seq_printf(sf, "nr_periods %d\n"
+			   "nr_throttled %d\n"
+			   "throttled_usec %llu\n",
+			   cfs_b->nr_periods, cfs_b->nr_throttled,
+			   throttled_usec);
+	}
+#endif
+	return 0;
+}
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+static u64 cpu_weight_read_u64(struct cgroup_subsys_state *css,
+			       struct cftype *cft)
+{
+	struct task_group *tg = css_tg(css);
+	u64 weight = scale_load_down(tg->shares);
+
+	return DIV_ROUND_CLOSEST_ULL(weight * CGROUP_WEIGHT_DFL, 1024);
+}
+
+static int cpu_weight_write_u64(struct cgroup_subsys_state *css,
+				struct cftype *cftype, u64 weight)
+{
+	/*
+	 * cgroup weight knobs should use the common MIN, DFL and MAX
+	 * values which are 1, 100 and 10000 respectively.  While it loses
+	 * a bit of range on both ends, it maps pretty well onto the shares
+	 * value used by scheduler and the round-trip conversions preserve
+	 * the original value over the entire range.
+	 */
+	if (weight < CGROUP_WEIGHT_MIN || weight > CGROUP_WEIGHT_MAX)
+		return -ERANGE;
+
+	weight = DIV_ROUND_CLOSEST_ULL(weight * 1024, CGROUP_WEIGHT_DFL);
+
+	return sched_group_set_shares(css_tg(css), scale_load(weight));
+}
+#endif
+
+static void __maybe_unused cpu_period_quota_print(struct seq_file *sf,
+						  long period, long quota)
+{
+	if (quota < 0)
+		seq_puts(sf, "max");
+	else
+		seq_printf(sf, "%ld", quota);
+
+	seq_printf(sf, " %ld\n", period);
+}
+
+/* caller should put the current value in *@periodp before calling */
+static int __maybe_unused cpu_period_quota_parse(char *buf,
+						 u64 *periodp, u64 *quotap)
+{
+	char tok[21];	/* U64_MAX */
+
+	if (!sscanf(buf, "%s %llu", tok, periodp))
+		return -EINVAL;
+
+	*periodp *= NSEC_PER_USEC;
+
+	if (sscanf(tok, "%llu", quotap))
+		*quotap *= NSEC_PER_USEC;
+	else if (!strcmp(tok, "max"))
+		*quotap = RUNTIME_INF;
+	else
+		return -EINVAL;
+
+	return 0;
+}
+
+#ifdef CONFIG_CFS_BANDWIDTH
+static int cpu_max_show(struct seq_file *sf, void *v)
+{
+	struct task_group *tg = css_tg(seq_css(sf));
+
+	cpu_period_quota_print(sf, tg_get_cfs_period(tg), tg_get_cfs_quota(tg));
+	return 0;
+}
+
+static ssize_t cpu_max_write(struct kernfs_open_file *of,
+			     char *buf, size_t nbytes, loff_t off)
+{
+	struct task_group *tg = css_tg(of_css(of));
+	u64 period = tg_get_cfs_period(tg);
+	u64 quota;
+	int ret;
+
+	ret = cpu_period_quota_parse(buf, &period, &quota);
+	if (!ret)
+		ret = tg_set_cfs_bandwidth(tg, period, quota);
+	return ret ?: nbytes;
+}
+#endif
+
+static struct cftype cpu_files[] = {
+	{
+		.name = "stat",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = cpu_stats_show,
+	},
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	{
+		.name = "weight",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = cpu_weight_read_u64,
+		.write_u64 = cpu_weight_write_u64,
+	},
+#endif
+#ifdef CONFIG_CFS_BANDWIDTH
+	{
+		.name = "max",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = cpu_max_show,
+		.write = cpu_max_write,
+	},
+#endif
+	{ }	/* terminate */
+};
+
 struct cgroup_subsys cpu_cgrp_subsys = {
 	.css_alloc	= cpu_cgroup_css_alloc,
 	.css_online	= cpu_cgroup_css_online,
@@ -7297,7 +7430,15 @@ struct cgroup_subsys cpu_cgrp_subsys = {
 	.can_attach	= cpu_cgroup_can_attach,
 	.attach		= cpu_cgroup_attach,
 	.legacy_cftypes	= cpu_legacy_files,
+	.dfl_cftypes	= cpu_files,
 	.early_init	= true,
+#ifdef CONFIG_CGROUP_CPUACCT
+	/*
+	 * cpuacct is enabled together with cpu on the unified hierarchy
+	 * and its stats are reported through "cpu.stat".
+	 */
+	.depends_on	= 1 << cpuacct_cgrp_id,
+#endif
 };
 
 #endif	/* CONFIG_CGROUP_SCHED */
diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index 6151c23..fc1cf13 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -347,6 +347,31 @@ static int cpuacct_stats_show(struct seq_file *sf, void *v)
 	{ }	/* terminate */
 };
 
+/* used to print cpuacct stats in cpu.stat on the unified hierarchy */
+void cpuacct_cpu_stats_show(struct seq_file *sf)
+{
+	struct cgroup_subsys_state *css;
+	u64 usage, val[CPUACCT_STAT_NSTATS];
+
+	css = cgroup_get_e_css(seq_css(sf)->cgroup, &cpuacct_cgrp_subsys);
+
+	usage = cpuusage_read(css, seq_cft(sf));
+	cpuacct_stats_read(css_ca(css), &val);
+
+	val[CPUACCT_STAT_USER] *= TICK_NSEC;
+	val[CPUACCT_STAT_SYSTEM] *= TICK_NSEC;
+	do_div(usage, NSEC_PER_USEC);
+	do_div(val[CPUACCT_STAT_USER], NSEC_PER_USEC);
+	do_div(val[CPUACCT_STAT_SYSTEM], NSEC_PER_USEC);
+
+	seq_printf(sf, "usage_usec %llu\n"
+		   "user_usec %llu\n"
+		   "system_usec %llu\n",
+		   usage, val[CPUACCT_STAT_USER], val[CPUACCT_STAT_SYSTEM]);
+
+	css_put(css);
+}
+
 /*
  * charge this task's execution time to its accounting group.
  *
diff --git a/kernel/sched/cpuacct.h b/kernel/sched/cpuacct.h
index ba72807..ddf7af4 100644
--- a/kernel/sched/cpuacct.h
+++ b/kernel/sched/cpuacct.h
@@ -2,6 +2,7 @@
 
 extern void cpuacct_charge(struct task_struct *tsk, u64 cputime);
 extern void cpuacct_account_field(struct task_struct *tsk, int index, u64 val);
+extern void cpuacct_cpu_stats_show(struct seq_file *sf);
 
 #else
 
@@ -14,4 +15,8 @@ static inline void cpuacct_charge(struct task_struct *tsk, u64 cputime)
 {
 }
 
+static inline void cpuacct_cpu_stats_show(struct seq_file *sf)
+{
+}
+
 #endif
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 13/14] sched: Make cpu/cpuacct threaded controllers
  2017-04-21 14:03 [RFC PATCH 00/14] cgroup: Implement cgroup v2 thread mode & CPU controller Waiman Long
                   ` (11 preceding siblings ...)
  2017-04-21 14:04 ` [RFC PATCH 12/14] sched: Implement interface for cgroup unified hierarchy Waiman Long
@ 2017-04-21 14:04 ` Waiman Long
  2017-04-21 14:04 ` [RFC PATCH 14/14] cgroup: Enable separate control knobs for thread root internal processes Waiman Long
  2017-04-26 16:05 ` [RFC PATCH 00/14] cgroup: Implement cgroup v2 thread mode & CPU controller Waiman Long
  14 siblings, 0 replies; 17+ messages in thread
From: Waiman Long @ 2017-04-21 14:04 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, linux-mm, kernel-team, pjt,
	luto, efault, Waiman Long

Make cpu and cpuacct cgroup controllers usable within a threaded cgroup.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/sched/core.c    | 1 +
 kernel/sched/cpuacct.c | 1 +
 2 files changed, 2 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 78dfcaa..9d8beda 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7432,6 +7432,7 @@ struct cgroup_subsys cpu_cgrp_subsys = {
 	.legacy_cftypes	= cpu_legacy_files,
 	.dfl_cftypes	= cpu_files,
 	.early_init	= true,
+	.threaded	= true,
 #ifdef CONFIG_CGROUP_CPUACCT
 	/*
 	 * cpuacct is enabled together with cpu on the unified hierarchy
diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index fc1cf13..853d18a 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -414,4 +414,5 @@ struct cgroup_subsys cpuacct_cgrp_subsys = {
 	.css_free	= cpuacct_css_free,
 	.legacy_cftypes	= files,
 	.early_init	= true,
+	.threaded	= true,
 };
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 14/14] cgroup: Enable separate control knobs for thread root internal processes
  2017-04-21 14:03 [RFC PATCH 00/14] cgroup: Implement cgroup v2 thread mode & CPU controller Waiman Long
                   ` (12 preceding siblings ...)
  2017-04-21 14:04 ` [RFC PATCH 13/14] sched: Make cpu/cpuacct threaded controllers Waiman Long
@ 2017-04-21 14:04 ` Waiman Long
  2017-04-26 16:05 ` [RFC PATCH 00/14] cgroup: Implement cgroup v2 thread mode & CPU controller Waiman Long
  14 siblings, 0 replies; 17+ messages in thread
From: Waiman Long @ 2017-04-21 14:04 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, linux-mm, kernel-team, pjt,
	luto, efault, Waiman Long

Internal processes are allowed in a thread root of the cgroup v2
default hierarchy. For those resource domain controllers that don't
want to deal with resource competition between internal processes and
child cgroups, there is now the option of specifying the sep_res_domain
flag in their cgroup_subsys data structure. This flag will tell the
cgroup core to create a special directory "cgroup.self" under the
thread root to hold their resource control knobs for all the processes
within the threaded subtree.

User applications can then tune the control knobs in the "cgroup.self"
directory as if all the threaded subtree processes are under it for
resoruce tracking and controlling purpose.

This directory name is reserved and so it cannot be created or deleted
directly. Moreover, sub-directory cannot be created under it.

This sep_res_domain flag is turned on in the memcg to showcase
its effect.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/cgroup-v2.txt |  20 ++++++++
 include/linux/cgroup-defs.h |  15 ++++++
 kernel/cgroup/cgroup.c      | 122 +++++++++++++++++++++++++++++++++++++++-----
 kernel/cgroup/debug.c       |   6 +++
 mm/memcontrol.c             |   1 +
 5 files changed, 150 insertions(+), 14 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 4d1c24d..e4c25ec 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -393,6 +393,26 @@ cgroup must create children and transfer all its processes to the
 children before enabling controllers in its "cgroup.subtree_control"
 file.
 
+2-4-4. Resource Domain Controllers
+
+As internal processes are allowed in a threaded subtree, a non-threaded
+controller at a thread root cgroup has to properly manage resource
+competition between internal processes and other child non-threaded
+cgroups. However, a controller can specify that it wants to have
+separate resource domain to manage the resources of the processes in
+the threaded subtree instead of each process individually. In this
+case, a "cgroup.self" directory will be created at the thread root
+to hold the resource control knobs for the processes in the threaded
+subtree as if those internal processes are all under the cgroup.self
+child cgroup for resource tracking and controlling purpose.
+
+The "cgroup.self" directory is a special directory which cannot
+be created or deleted directly. No sub-directory can be created
+under it and special files like "cgroup.procs" are not present so
+tasks cannot be moved directly into it.  It is created when a cgroup
+becomes the thread root and have controllers that request separate
+resource domains. It will be removed when that cgroup is not a thread
+root anymore.
 
 2-5. Delegation
 
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 7be1a90..e383f10 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -65,6 +65,7 @@ enum {
 enum {
 	CGRP_ROOT_NOPREFIX	= (1 << 1), /* mounted subsystems have no named prefix */
 	CGRP_ROOT_XATTR		= (1 << 2), /* supports extended attributes */
+	CGRP_RESOURCE_DOMAIN	= (1 << 3), /* thread root resource domain */
 };
 
 /* cftype->flags */
@@ -293,6 +294,9 @@ struct cgroup {
 
 	struct cgroup_root *root;
 
+	/* Pointer to separate resource domain for thread root */
+	struct cgroup *resource_domain;
+
 	/*
 	 * List of cgrp_cset_links pointing at css_sets with tasks in this
 	 * cgroup.  Protected by css_set_lock.
@@ -516,6 +520,17 @@ struct cgroup_subsys {
 	bool threaded:1;
 
 	/*
+	 * If %true, the controller will need a separate resource domain in
+	 * a thread root to avoid internal processes associated with the
+	 * threaded subtree to compete with other child cgroups. This is done
+	 * by having a separate set of knobs in the cgroup.self directory.
+	 * These knobs will control how much resources are allocated to the
+	 * processes in the threaded subtree. Only !thread controllers should
+	 * have this flag turned on.
+	 */
+	bool sep_res_domain:1;
+
+	/*
 	 * If %false, this subsystem is properly hierarchical -
 	 * configuration, resource accounting and restriction on a parent
 	 * cgroup cover those of its children.  If %true, hierarchy support
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 50577c5..3ff3ff5 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -61,6 +61,11 @@
 
 #define CGROUP_FILE_NAME_MAX		(MAX_CGROUP_TYPE_NAMELEN +	\
 					 MAX_CFTYPE_NAME + 2)
+/*
+ * Reserved cgroup directory name for resource domain controllers. Users
+ * are not allowed to create child cgroup of that name.
+ */
+#define CGROUP_SELF	"cgroup.self"
 
 /*
  * cgroup_mutex is the master lock.  Any modification to cgroup or its
@@ -165,6 +170,12 @@ struct cgroup_subsys *cgroup_subsys[] = {
 /* some controllers can be threaded on the default hierarchy */
 static u16 cgrp_dfl_threaded_ss_mask;
 
+/*
+ * Some controllers need separate resource domain on thread root of the
+ * default hierarchy
+ */
+static u16 cgrp_dfl_rdomain_ss_mask;
+
 /* The list of hierarchy roots */
 LIST_HEAD(cgroup_roots);
 static int cgroup_root_count;
@@ -337,7 +348,9 @@ static u16 cgroup_control(struct cgroup *cgrp)
 	if (parent) {
 		u16 ss_mask = parent->subtree_control;
 
-		if (cgroup_is_threaded(cgrp))
+		if (cgrp->flags & CGRP_RESOURCE_DOMAIN)
+			ss_mask &= cgrp_dfl_rdomain_ss_mask;
+		else if (cgroup_is_threaded(cgrp))
 			ss_mask &= cgrp_dfl_threaded_ss_mask;
 		return ss_mask;
 	}
@@ -356,7 +369,9 @@ static u16 cgroup_ss_mask(struct cgroup *cgrp)
 	if (parent) {
 		u16 ss_mask = parent->subtree_ss_mask;
 
-		if (cgroup_is_threaded(cgrp))
+		if (cgrp->flags & CGRP_RESOURCE_DOMAIN)
+			ss_mask &= cgrp_dfl_rdomain_ss_mask;
+		else if (cgroup_is_threaded(cgrp))
 			ss_mask &= cgrp_dfl_threaded_ss_mask;
 		return ss_mask;
 	}
@@ -413,6 +428,18 @@ static struct cgroup_subsys_state *cgroup_e_css(struct cgroup *cgrp,
 			return NULL;
 	}
 
+	/*
+	 * On a thread root with a resource domain, use the css in the
+	 * resource domain, if enabled.
+	 */
+	if (cgrp->resource_domain &&
+	   (cgroup_ss_mask(cgrp->resource_domain) & (1 << ss->id))) {
+		struct cgroup_subsys_state *css;
+
+		css = cgroup_css(cgrp->resource_domain, ss);
+		if (css)
+			return css;
+	}
 	return cgroup_css(cgrp, ss);
 }
 
@@ -3039,8 +3066,21 @@ static int cgroup_enable_threaded(struct cgroup *cgrp)
 		goto setup_child;
 
 	/*
+	 * Create a resource domain child cgroup, if necessary.
+	 * Update the css association if controllers are enabled in
+	 * the resource domain child cgroup.
+	 */
+	if (cgrp->root->subsys_mask & cgrp_dfl_rdomain_ss_mask) {
+		cgroup_mkdir(cgrp->kn, NULL, 0755);
+		if (cgrp->resource_domain &&
+		    cgroup_ss_mask(cgrp->resource_domain))
+			cgroup_update_dfl_csses(cgrp);
+	}
+
+	/*
 	 * For the parent cgroup, we need to find all csets which need
-	 * ->proc_cset updated
+	 * ->proc_cset updated. The updated csets will also pick up the
+	 * new resource domain css'es along the way.
 	 */
 	spin_lock_irq(&css_set_lock);
 	list_for_each_entry(link, &cgrp->cset_links, cset_link) {
@@ -3132,6 +3172,7 @@ static int cgroup_disable_threaded(struct cgroup *cgrp)
 {
 	struct cgrp_cset_link *link;
 	struct cgroup *parent = cgroup_parent(cgrp);
+	struct cgroup *rdomain = NULL;
 
 	lockdep_assert_held(&cgroup_mutex);
 
@@ -3182,6 +3223,8 @@ static int cgroup_disable_threaded(struct cgroup *cgrp)
 	/*
 	 * Check remaining threaded children count to see if the threaded
 	 * csets of the parent need to be removed and ->proc_cset reset.
+	 * If valid css'es are present in the resource domain cgroup, we
+	 * need to migrate the csets away from those css'es.
 	 */
 	spin_lock_irq(&css_set_lock);
 
@@ -3189,6 +3232,14 @@ static int cgroup_disable_threaded(struct cgroup *cgrp)
 		goto out_unlock;	/* still have threaded children left */
 
 	cgrp = parent;
+
+	/*
+	 * Prepare to remove the resource domain child cgroup.
+	 */
+	rdomain = cgrp->resource_domain;
+	if (rdomain)
+		cgrp->resource_domain = NULL;
+
 	list_for_each_entry(link, &cgrp->cset_links, cset_link) {
 		struct css_set *cset = link->cset;
 
@@ -3214,6 +3265,16 @@ static int cgroup_disable_threaded(struct cgroup *cgrp)
 out_unlock:
 	spin_unlock_irq(&css_set_lock);
 
+	if (rdomain) {
+		/*
+		 * Update the css association if controllers are enabled
+		 * in the resource domain child cgroup before destroying
+		 * that resource domain.
+		 */
+		if (cgroup_ss_mask(rdomain))
+			cgroup_update_dfl_csses(cgrp);
+		cgroup_destroy_locked(rdomain);
+	}
 	return 0;
 }
 
@@ -4660,21 +4721,41 @@ int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode)
 {
 	struct cgroup *parent, *cgrp;
 	struct kernfs_node *kn;
+	bool create_self = (name == NULL);
 	int ret;
 
-	/* do not accept '\n' to prevent making /proc/<pid>/cgroup unparsable */
-	if (strchr(name, '\n'))
-		return -EINVAL;
+	/*
+	 * Do not accept '\n' to prevent making /proc/<pid>/cgroup unparsable.
+	 * The reserved resource domain directory name cannot be used. A NULL
+	 * name parameter, however, is used internally to create that
+	 * resource domain directory. A sub-directory cannot be created
+	 * under a resource domain directory.
+	 */
+	if (create_self) {
+		name = CGROUP_SELF;
+		parent = parent_kn->priv;
+	} else {
+		if (strchr(name, '\n') || !strcmp(name, CGROUP_SELF))
+			return -EINVAL;
 
-	parent = cgroup_kn_lock_live(parent_kn, false);
-	if (!parent)
-		return -ENODEV;
+		parent = cgroup_kn_lock_live(parent_kn, false);
+		if (!parent)
+			return -ENODEV;
+		if (parent->flags & CGRP_RESOURCE_DOMAIN) {
+			ret = -EINVAL;
+			goto out_unlock;
+		}
+	}
 
 	cgrp = cgroup_create(parent);
 	if (IS_ERR(cgrp)) {
 		ret = PTR_ERR(cgrp);
 		goto out_unlock;
 	}
+	if (create_self) {
+		parent->resource_domain = cgrp;
+		cgrp->flags |= CGRP_RESOURCE_DOMAIN;
+	}
 
 	/* create the directory */
 	kn = kernfs_create_dir(parent->kn, name, mode, cgrp);
@@ -4694,9 +4775,11 @@ int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode)
 	if (ret)
 		goto out_destroy;
 
-	ret = css_populate_dir(&cgrp->self);
-	if (ret)
-		goto out_destroy;
+	if (!create_self) {
+		ret = css_populate_dir(&cgrp->self);
+		if (ret)
+			goto out_destroy;
+	}
 
 	ret = cgroup_apply_control_enable(cgrp);
 	if (ret)
@@ -4713,7 +4796,8 @@ int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode)
 out_destroy:
 	cgroup_destroy_locked(cgrp);
 out_unlock:
-	cgroup_kn_unlock(parent_kn);
+	if (!create_self)
+		cgroup_kn_unlock(parent_kn);
 	return ret;
 }
 
@@ -4883,7 +4967,15 @@ int cgroup_rmdir(struct kernfs_node *kn)
 	if (!cgrp)
 		return 0;
 
-	ret = cgroup_destroy_locked(cgrp);
+	/*
+	 * A resource domain cgroup cannot be removed directly by users.
+	 * It can only be done internally when its parent directory is
+	 * no longer a thread root.
+	 */
+	if (cgrp->flags & CGRP_RESOURCE_DOMAIN)
+		ret = -EINVAL;
+	else
+		ret = cgroup_destroy_locked(cgrp);
 
 	if (!ret)
 		trace_cgroup_rmdir(cgrp);
@@ -5070,6 +5162,8 @@ int __init cgroup_init(void)
 
 		if (ss->threaded)
 			cgrp_dfl_threaded_ss_mask |= 1 << ss->id;
+		if (ss->sep_res_domain)
+			cgrp_dfl_rdomain_ss_mask |= 1 << ss->id;
 
 		if (ss->dfl_cftypes == ss->legacy_cftypes) {
 			WARN_ON(cgroup_add_cftypes(ss, ss->dfl_cftypes));
diff --git a/kernel/cgroup/debug.c b/kernel/cgroup/debug.c
index 4d74458..51ee2c9 100644
--- a/kernel/cgroup/debug.c
+++ b/kernel/cgroup/debug.c
@@ -269,10 +269,16 @@ static u64 releasable_read(struct cgroup_subsys_state *css, struct cftype *cft)
 	{ }	/* terminate */
 };
 
+/*
+ * Normally, threaded & sep_res_domain are mutually exclusive.
+ * Both are enabled here in the debug controller to enable better internal
+ * status tracking.
+ */
 struct cgroup_subsys debug_cgrp_subsys = {
 	.css_alloc	= debug_css_alloc,
 	.css_free	= debug_css_free,
 	.legacy_cftypes	= debug_files,
 	.dfl_cftypes	= debug_files,
 	.threaded	= true,
+	.sep_res_domain = true,
 };
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f28ab8d..9682bbb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5292,6 +5292,7 @@ struct cgroup_subsys memory_cgrp_subsys = {
 	.dfl_cftypes = memory_files,
 	.legacy_cftypes = mem_cgroup_legacy_files,
 	.early_init = 0,
+	.sep_res_domain = true,
 };
 
 /**
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 00/14] cgroup: Implement cgroup v2 thread mode & CPU controller
  2017-04-21 14:03 [RFC PATCH 00/14] cgroup: Implement cgroup v2 thread mode & CPU controller Waiman Long
                   ` (13 preceding siblings ...)
  2017-04-21 14:04 ` [RFC PATCH 14/14] cgroup: Enable separate control knobs for thread root internal processes Waiman Long
@ 2017-04-26 16:05 ` Waiman Long
  2017-04-26 22:30   ` Tejun Heo
  14 siblings, 1 reply; 17+ messages in thread
From: Waiman Long @ 2017-04-26 16:05 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, linux-mm, kernel-team, pjt,
	luto, efault

On 04/21/2017 10:03 AM, Waiman Long wrote:
> This patchset incorporates the following 2 patchsets from Tejun Heo:
>
>  1) cgroup v2 thread mode patchset (5 patches)
>     https://lkml.org/lkml/2017/2/2/592
>  2) CPU Controller on Control Group v2 (2 patches)
>     https://lkml.org/lkml/2016/8/5/368
>
> Additional patches are then layered on top to implement the following
> new features:
>
>  1) An enhanced v2 thread mode where a thread root (root of a threaded
>     subtree) can have non-threaded children with non-threaded
>     controllers enabled and no internal process constraint does
>     not apply.
>  2) An enhanced debug controller which dumps out more information
>     relevant to the debugging and testing of cgroup v2 in general.
>  3) Separate control knobs for resource domain controllers can be
>     enabled in a thread root to manage all the internal processes in
>     the threaded subtree.
>
> Patches 1-5 are Tejun's cgroup v2 thread mode patchset.
>
> Patch 6 fixes a task_struct reference counting bug introduced in
> patch 1.
>
> Patch 7 moves the debug cgroup out from cgroup_v1.c into its own
> file.
>
> Patch 8 keeps more accurate counts of the number of tasks associated
> with each css_set.
>
> Patch 9 enhances the debug controller to provide more information
> relevant to the cgroup v2 thread mode to ease debugging effort.
>
> Patch 10 implements the enhanced cgroup v2 thread mode with the
> following enhancements:
>
>  1) Thread roots are treated differently from threaded cgroups.
>  2) Thread root can now have non-threaded controllers enabled as well
>     as non-threaded children.
>
> Patches 11-12 are Tejun's CPU controller on control group v2 patchset.
>
> Patch 13 makes both cpu and cpuacct controllers threaded.
>
> Patch 14 enables the creation of a special "cgroup.self" directory
> under the thread root to hold resource control knobs for controllers
> that don't want resource competiton between internal processes and
> non-threaded child cgroups.
>
> Preliminary testing was done with the debug controller enabled. Things
> seemed to work fine so far. More rigorous testing will be needed, and
> any suggestions are welcome.
>
> Tejun Heo (7):
>   cgroup: reorganize cgroup.procs / task write path
>   cgroup: add @flags to css_task_iter_start() and implement
>     CSS_TASK_ITER_PROCS
>   cgroup: introduce cgroup->proc_cgrp and threaded css_set handling
>   cgroup: implement CSS_TASK_ITER_THREADED
>   cgroup: implement cgroup v2 thread support
>   sched: Misc preps for cgroup unified hierarchy interface
>   sched: Implement interface for cgroup unified hierarchy
>
> Waiman Long (7):
>   cgroup: Fix reference counting bug in cgroup_procs_write()
>   cgroup: Move debug cgroup to its own file
>   cgroup: Keep accurate count of tasks in each css_set
>   cgroup: Make debug cgroup support v2 and thread mode
>   cgroup: Implement new thread mode semantics
>   sched: Make cpu/cpuacct threaded controllers
>   cgroup: Enable separate control knobs for thread root internal
>     processes
>
>  Documentation/cgroup-v2.txt     | 114 +++++-
>  include/linux/cgroup-defs.h     |  56 +++
>  include/linux/cgroup.h          |  12 +-
>  kernel/cgroup/Makefile          |   1 +
>  kernel/cgroup/cgroup-internal.h |  18 +-
>  kernel/cgroup/cgroup-v1.c       | 217 +++-------
>  kernel/cgroup/cgroup.c          | 862 ++++++++++++++++++++++++++++++++++------
>  kernel/cgroup/cpuset.c          |   6 +-
>  kernel/cgroup/debug.c           | 284 +++++++++++++
>  kernel/cgroup/freezer.c         |   6 +-
>  kernel/cgroup/pids.c            |   1 +
>  kernel/events/core.c            |   1 +
>  kernel/sched/core.c             | 150 ++++++-
>  kernel/sched/cpuacct.c          |  55 ++-
>  kernel/sched/cpuacct.h          |   5 +
>  mm/memcontrol.c                 |   3 +-
>  net/core/netclassid_cgroup.c    |   2 +-
>  17 files changed, 1478 insertions(+), 315 deletions(-)
>  create mode 100644 kernel/cgroup/debug.c
>
Does anyone has time to take a look at these patches?

As the merge window is going to open up next week, I am not going to
bother you guys when the merge window opens.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 00/14] cgroup: Implement cgroup v2 thread mode & CPU controller
  2017-04-26 16:05 ` [RFC PATCH 00/14] cgroup: Implement cgroup v2 thread mode & CPU controller Waiman Long
@ 2017-04-26 22:30   ` Tejun Heo
  0 siblings, 0 replies; 17+ messages in thread
From: Tejun Heo @ 2017-04-26 22:30 UTC (permalink / raw)
  To: Waiman Long
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto,
	efault

Hello, Waiman.

On Wed, Apr 26, 2017 at 12:05:27PM -0400, Waiman Long wrote:
> Does anyone has time to take a look at these patches?
> 
> As the merge window is going to open up next week, I am not going to
> bother you guys when the merge window opens.

Will get to it next week.  Sorry about the delay.  We're deploying
cgroup2 across the fleet and seeing a lot of interesting issues and I
was chasing down CPU controller performance issues for the last month
or so, which is now getting wrapped up.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2017-04-26 22:30 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-21 14:03 [RFC PATCH 00/14] cgroup: Implement cgroup v2 thread mode & CPU controller Waiman Long
2017-04-21 14:03 ` [RFC PATCH 01/14] cgroup: reorganize cgroup.procs / task write path Waiman Long
2017-04-21 14:04 ` [RFC PATCH 02/14] cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS Waiman Long
2017-04-21 14:04 ` [RFC PATCH 03/14] cgroup: introduce cgroup->proc_cgrp and threaded css_set handling Waiman Long
2017-04-21 14:04 ` [RFC PATCH 04/14] cgroup: implement CSS_TASK_ITER_THREADED Waiman Long
2017-04-21 14:04 ` [RFC PATCH 05/14] cgroup: implement cgroup v2 thread support Waiman Long
2017-04-21 14:04 ` [RFC PATCH 06/14] cgroup: Fix reference counting bug in cgroup_procs_write() Waiman Long
2017-04-21 14:04 ` [RFC PATCH 07/14] cgroup: Move debug cgroup to its own file Waiman Long
2017-04-21 14:04 ` [RFC PATCH 08/14] cgroup: Keep accurate count of tasks in each css_set Waiman Long
2017-04-21 14:04 ` [RFC PATCH 09/14] cgroup: Make debug cgroup support v2 and thread mode Waiman Long
2017-04-21 14:04 ` [RFC PATCH 10/14] cgroup: Implement new thread mode semantics Waiman Long
2017-04-21 14:04 ` [RFC PATCH 11/14] sched: Misc preps for cgroup unified hierarchy interface Waiman Long
2017-04-21 14:04 ` [RFC PATCH 12/14] sched: Implement interface for cgroup unified hierarchy Waiman Long
2017-04-21 14:04 ` [RFC PATCH 13/14] sched: Make cpu/cpuacct threaded controllers Waiman Long
2017-04-21 14:04 ` [RFC PATCH 14/14] cgroup: Enable separate control knobs for thread root internal processes Waiman Long
2017-04-26 16:05 ` [RFC PATCH 00/14] cgroup: Implement cgroup v2 thread mode & CPU controller Waiman Long
2017-04-26 22:30   ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).