linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET for-4.13] cgroup: implement cgroup2 thread mode, v3
@ 2017-07-17  2:07 Tejun Heo
  2017-07-17  2:07 ` [PATCH 1/6] cgroup: reorganize cgroup.procs / task write path Tejun Heo
                   ` (6 more replies)
  0 siblings, 7 replies; 32+ messages in thread
From: Tejun Heo @ 2017-07-17  2:07 UTC (permalink / raw)
  To: lizefan, hannes, peterz, mingo, longman
  Cc: cgroups, linux-kernel, kernel-team, pjt, luto, efault, torvalds, guro

Hello,

This is v3 of cgroup2 thread mode patchset.  The changes from v2[L]
are

* Switched to marking each cgroup threaded instead of doing it
  per-subtree as suggested by PeterZ.  This allows more flexibility
  and removes certain interface quirks.

* Dropped RFC tag and excluded cpu controller patches from this
  patchset as threaded mode behaviors can easily be verified with the
  pid controller.  Will follow up with cpu controller patchset later.

It is largely based on the discussions that we had at the plumbers
last year.  Here's the rough outline.

* Thread mode is explicitly enabled on a cgroup by writing "threaded"
  into "cgroup.type" file.  The cgroup shouldn't have any processes or
  child cgroups.  A threaded cgroup joins the the parent's resource
  domain and becomes a part of the threaded subtree anchored at the
  nearest domain ancestor, which is called the threaded domain cgroup
  of the subtree.

* Threads can be put anywhere in a threaded subtree by writing TIDs
  into "cgroup.threads" file.  Process granularity and
  no-internal-process constraint don't apply in a threaded subtree.

* To be used in a threaded subtree, controllers should explicitly
  declare thread mode support and should be able to handle internal
  competition in some way.

* The threaded domain cgroup of a threaded subtree serves as the
  resource domain for the whole subtree.  This is where all the
  controllers are guaranteed to have a common ground and resource
  consumptions in the threaded subtree which aren't tied to a specific
  thread are charged.  Non-threaded controllers never see beyond
  thread root and can assume that all controllers will follow the same
  rules upto that point.

* Unlike other cgroups, the system root cgroup can serve as parent to
  domain child cgroups and threaded domains to threaded subtrees.

This allows threaded controllers to implement thread granular resource
control without getting in the way of system level resource
partitioning.

For more details on the interface and behavior, please refer to 0005.

This patchset contains the following six patches.

 0001-cgroup-reorganize-cgroup.procs-task-write-path.patch
 0002-cgroup-add-flags-to-css_task_iter_start-and-implemen.patch
 0003-cgroup-introduce-cgroup-dom_cgrp-and-threaded-css_se.patch
 0004-cgroup-implement-CSS_TASK_ITER_THREADED.patch
 0005-cgroup-implement-cgroup-v2-thread-support.patch
 0006-cgroup-update-debug-controller-to-print-out-thread-m.patch

0001-0005 implement cgroup2 thread mode.  0006 enables debug
controller on it.

The patchset is based on the current cgroup/for-4.14 27f26753f8c0
("cgroup: replace css_set walking populated test with testing
cgrp->nr_populated_csets") and also available in the following git
branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-cgroup2-threads-v3

diffstat follows.

 Documentation/cgroup-v2.txt     |  181 +++++++++-
 include/linux/cgroup-defs.h     |   45 ++
 include/linux/cgroup.h          |   15 
 kernel/cgroup/cgroup-internal.h |   12 
 kernel/cgroup/cgroup-v1.c       |   69 +++
 kernel/cgroup/cgroup.c          |  712 +++++++++++++++++++++++++++++++---------
 kernel/cgroup/cpuset.c          |    6 
 kernel/cgroup/debug.c           |   58 ++-
 kernel/cgroup/freezer.c         |    6 
 kernel/cgroup/pids.c            |    1 
 kernel/events/core.c            |    1 
 mm/memcontrol.c                 |    2 
 net/core/netclassid_cgroup.c    |    2 
 13 files changed, 906 insertions(+), 204 deletions(-)

Thanks.

--
tejun

[L] http://lkml.kernel.org/r/20170610140351.10703-1-tj@kernel.org

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 1/6] cgroup: reorganize cgroup.procs / task write path
  2017-07-17  2:07 [PATCHSET for-4.13] cgroup: implement cgroup2 thread mode, v3 Tejun Heo
@ 2017-07-17  2:07 ` Tejun Heo
  2017-07-17  2:07 ` [PATCH 2/6] cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS Tejun Heo
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2017-07-17  2:07 UTC (permalink / raw)
  To: lizefan, hannes, peterz, mingo, longman
  Cc: cgroups, linux-kernel, kernel-team, pjt, luto, efault, torvalds,
	guro, Tejun Heo

Currently, writes "cgroup.procs" and "cgroup.tasks" files are all
handled by __cgroup_procs_write() on both v1 and v2.  This patch
reoragnizes the write path so that there are common helper functions
that different write paths use.

While this somewhat increases LOC, the different paths are no longer
intertwined and each path has more flexibility to implement different
behaviors which will be necessary for the planned v2 thread support.

v3: - Restructured so that cgroup_procs_write_permission() takes
      @src_cgrp and @dst_cgrp.

v2: - Rolled in Waiman's task reference count fix.
    - Updated on top of nsdelegate changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cgroup-internal.h |   8 +-
 kernel/cgroup/cgroup-v1.c       |  58 ++++++++++--
 kernel/cgroup/cgroup.c          | 192 ++++++++++++++++++++--------------------
 3 files changed, 152 insertions(+), 106 deletions(-)

diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index 793565c..0e81c61 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -180,10 +180,10 @@ int cgroup_migrate(struct task_struct *leader, bool threadgroup,
 
 int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader,
 		       bool threadgroup);
-ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf,
-			     size_t nbytes, loff_t off, bool threadgroup);
-ssize_t cgroup_procs_write(struct kernfs_open_file *of, char *buf, size_t nbytes,
-			   loff_t off);
+struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup)
+	__acquires(&cgroup_threadgroup_rwsem);
+void cgroup_procs_write_finish(struct task_struct *task)
+	__releases(&cgroup_threadgroup_rwsem);
 
 void cgroup_lock_and_drain_offline(struct cgroup *cgrp);
 
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index 7bf4b15..60f7247 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -510,10 +510,58 @@ static int cgroup_pidlist_show(struct seq_file *s, void *v)
 	return 0;
 }
 
-static ssize_t cgroup_tasks_write(struct kernfs_open_file *of,
-				  char *buf, size_t nbytes, loff_t off)
+static ssize_t __cgroup1_procs_write(struct kernfs_open_file *of,
+				     char *buf, size_t nbytes, loff_t off,
+				     bool threadgroup)
 {
-	return __cgroup_procs_write(of, buf, nbytes, off, false);
+	struct cgroup *cgrp;
+	struct task_struct *task;
+	const struct cred *cred, *tcred;
+	ssize_t ret;
+
+	cgrp = cgroup_kn_lock_live(of->kn, false);
+	if (!cgrp)
+		return -ENODEV;
+
+	task = cgroup_procs_write_start(buf, threadgroup);
+	ret = PTR_ERR_OR_ZERO(task);
+	if (ret)
+		goto out_unlock;
+
+	/*
+	 * Even if we're attaching all tasks in the thread group, we only
+	 * need to check permissions on one of them.
+	 */
+	cred = current_cred();
+	tcred = get_task_cred(task);
+	if (!uid_eq(cred->euid, GLOBAL_ROOT_UID) &&
+	    !uid_eq(cred->euid, tcred->uid) &&
+	    !uid_eq(cred->euid, tcred->suid))
+		ret = -EACCES;
+	put_cred(tcred);
+	if (ret)
+		goto out_finish;
+
+	ret = cgroup_attach_task(cgrp, task, threadgroup);
+
+out_finish:
+	cgroup_procs_write_finish(task);
+out_unlock:
+	cgroup_kn_unlock(of->kn);
+
+	return ret ?: nbytes;
+}
+
+static ssize_t cgroup1_procs_write(struct kernfs_open_file *of,
+				   char *buf, size_t nbytes, loff_t off)
+{
+	return __cgroup1_procs_write(of, buf, nbytes, off, true);
+}
+
+static ssize_t cgroup1_tasks_write(struct kernfs_open_file *of,
+				   char *buf, size_t nbytes, loff_t off)
+{
+	return __cgroup1_procs_write(of, buf, nbytes, off, false);
 }
 
 static ssize_t cgroup_release_agent_write(struct kernfs_open_file *of,
@@ -592,7 +640,7 @@ struct cftype cgroup1_base_files[] = {
 		.seq_stop = cgroup_pidlist_stop,
 		.seq_show = cgroup_pidlist_show,
 		.private = CGROUP_FILE_PROCS,
-		.write = cgroup_procs_write,
+		.write = cgroup1_procs_write,
 	},
 	{
 		.name = "cgroup.clone_children",
@@ -611,7 +659,7 @@ struct cftype cgroup1_base_files[] = {
 		.seq_stop = cgroup_pidlist_stop,
 		.seq_show = cgroup_pidlist_show,
 		.private = CGROUP_FILE_TASKS,
-		.write = cgroup_tasks_write,
+		.write = cgroup1_tasks_write,
 	},
 	{
 		.name = "notify_on_release",
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index d5b6231..e3bda075 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2421,96 +2421,23 @@ int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader,
 	return ret;
 }
 
-static int cgroup_procs_write_permission(struct task_struct *task,
-					 struct cgroup *dst_cgrp,
-					 struct kernfs_open_file *of)
-{
-	struct super_block *sb = of->file->f_path.dentry->d_sb;
-	struct cgroup_namespace *ns = current->nsproxy->cgroup_ns;
-	struct cgroup *root_cgrp = ns->root_cset->dfl_cgrp;
-	struct cgroup *src_cgrp, *com_cgrp;
-	struct inode *inode;
-	int ret;
-
-	if (!cgroup_on_dfl(dst_cgrp)) {
-		const struct cred *cred = current_cred();
-		const struct cred *tcred = get_task_cred(task);
-
-		/*
-		 * even if we're attaching all tasks in the thread group,
-		 * we only need to check permissions on one of them.
-		 */
-		if (uid_eq(cred->euid, GLOBAL_ROOT_UID) ||
-		    uid_eq(cred->euid, tcred->uid) ||
-		    uid_eq(cred->euid, tcred->suid))
-			ret = 0;
-		else
-			ret = -EACCES;
-
-		put_cred(tcred);
-		return ret;
-	}
-
-	/* find the source cgroup */
-	spin_lock_irq(&css_set_lock);
-	src_cgrp = task_cgroup_from_root(task, &cgrp_dfl_root);
-	spin_unlock_irq(&css_set_lock);
-
-	/* and the common ancestor */
-	com_cgrp = src_cgrp;
-	while (!cgroup_is_descendant(dst_cgrp, com_cgrp))
-		com_cgrp = cgroup_parent(com_cgrp);
-
-	/* %current should be authorized to migrate to the common ancestor */
-	inode = kernfs_get_inode(sb, com_cgrp->procs_file.kn);
-	if (!inode)
-		return -ENOMEM;
-
-	ret = inode_permission(inode, MAY_WRITE);
-	iput(inode);
-	if (ret)
-		return ret;
-
-	/*
-	 * If namespaces are delegation boundaries, %current must be able
-	 * to see both source and destination cgroups from its namespace.
-	 */
-	if ((cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE) &&
-	    (!cgroup_is_descendant(src_cgrp, root_cgrp) ||
-	     !cgroup_is_descendant(dst_cgrp, root_cgrp)))
-		return -ENOENT;
-
-	return 0;
-}
-
-/*
- * Find the task_struct of the task to attach by vpid and pass it along to the
- * function to attach either it or all tasks in its threadgroup. Will lock
- * cgroup_mutex and threadgroup.
- */
-ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf,
-			     size_t nbytes, loff_t off, bool threadgroup)
+struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup)
+	__acquires(&cgroup_threadgroup_rwsem)
 {
 	struct task_struct *tsk;
-	struct cgroup_subsys *ss;
-	struct cgroup *cgrp;
 	pid_t pid;
-	int ssid, ret;
 
 	if (kstrtoint(strstrip(buf), 0, &pid) || pid < 0)
-		return -EINVAL;
-
-	cgrp = cgroup_kn_lock_live(of->kn, false);
-	if (!cgrp)
-		return -ENODEV;
+		return ERR_PTR(-EINVAL);
 
 	percpu_down_write(&cgroup_threadgroup_rwsem);
+
 	rcu_read_lock();
 	if (pid) {
 		tsk = find_task_by_vpid(pid);
 		if (!tsk) {
-			ret = -ESRCH;
-			goto out_unlock_rcu;
+			tsk = ERR_PTR(-ESRCH);
+			goto out_unlock_threadgroup;
 		}
 	} else {
 		tsk = current;
@@ -2526,35 +2453,33 @@ ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf,
 	 * cgroup with no rt_runtime allocated.  Just say no.
 	 */
 	if (tsk->no_cgroup_migration || (tsk->flags & PF_NO_SETAFFINITY)) {
-		ret = -EINVAL;
-		goto out_unlock_rcu;
+		tsk = ERR_PTR(-EINVAL);
+		goto out_unlock_threadgroup;
 	}
 
 	get_task_struct(tsk);
+	goto out_unlock_rcu;
+
+out_unlock_threadgroup:
+	percpu_up_write(&cgroup_threadgroup_rwsem);
+out_unlock_rcu:
 	rcu_read_unlock();
+	return tsk;
+}
 
-	ret = cgroup_procs_write_permission(tsk, cgrp, of);
-	if (!ret)
-		ret = cgroup_attach_task(cgrp, tsk, threadgroup);
+void cgroup_procs_write_finish(struct task_struct *task)
+	__releases(&cgroup_threadgroup_rwsem)
+{
+	struct cgroup_subsys *ss;
+	int ssid;
 
-	put_task_struct(tsk);
-	goto out_unlock_threadgroup;
+	/* release reference from cgroup_procs_write_start() */
+	put_task_struct(task);
 
-out_unlock_rcu:
-	rcu_read_unlock();
-out_unlock_threadgroup:
 	percpu_up_write(&cgroup_threadgroup_rwsem);
 	for_each_subsys(ss, ssid)
 		if (ss->post_attach)
 			ss->post_attach();
-	cgroup_kn_unlock(of->kn);
-	return ret ?: nbytes;
-}
-
-ssize_t cgroup_procs_write(struct kernfs_open_file *of, char *buf, size_t nbytes,
-			   loff_t off)
-{
-	return __cgroup_procs_write(of, buf, nbytes, off, true);
 }
 
 static void cgroup_print_ss_mask(struct seq_file *seq, u16 ss_mask)
@@ -3870,6 +3795,79 @@ static int cgroup_procs_show(struct seq_file *s, void *v)
 	return 0;
 }
 
+static int cgroup_procs_write_permission(struct cgroup *src_cgrp,
+					 struct cgroup *dst_cgrp,
+					 struct super_block *sb)
+{
+	struct cgroup_namespace *ns = current->nsproxy->cgroup_ns;
+	struct cgroup *com_cgrp = src_cgrp;
+	struct inode *inode;
+	int ret;
+
+	lockdep_assert_held(&cgroup_mutex);
+
+	/* find the common ancestor */
+	while (!cgroup_is_descendant(dst_cgrp, com_cgrp))
+		com_cgrp = cgroup_parent(com_cgrp);
+
+	/* %current should be authorized to migrate to the common ancestor */
+	inode = kernfs_get_inode(sb, com_cgrp->procs_file.kn);
+	if (!inode)
+		return -ENOMEM;
+
+	ret = inode_permission(inode, MAY_WRITE);
+	iput(inode);
+	if (ret)
+		return ret;
+
+	/*
+	 * If namespaces are delegation boundaries, %current must be able
+	 * to see both source and destination cgroups from its namespace.
+	 */
+	if ((cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE) &&
+	    (!cgroup_is_descendant(src_cgrp, ns->root_cset->dfl_cgrp) ||
+	     !cgroup_is_descendant(dst_cgrp, ns->root_cset->dfl_cgrp)))
+		return -ENOENT;
+
+	return 0;
+}
+
+static ssize_t cgroup_procs_write(struct kernfs_open_file *of,
+				  char *buf, size_t nbytes, loff_t off)
+{
+	struct cgroup *src_cgrp, *dst_cgrp;
+	struct task_struct *task;
+	ssize_t ret;
+
+	dst_cgrp = cgroup_kn_lock_live(of->kn, false);
+	if (!dst_cgrp)
+		return -ENODEV;
+
+	task = cgroup_procs_write_start(buf, true);
+	ret = PTR_ERR_OR_ZERO(task);
+	if (ret)
+		goto out_unlock;
+
+	/* find the source cgroup */
+	spin_lock_irq(&css_set_lock);
+	src_cgrp = task_cgroup_from_root(task, &cgrp_dfl_root);
+	spin_unlock_irq(&css_set_lock);
+
+	ret = cgroup_procs_write_permission(src_cgrp, dst_cgrp,
+					    of->file->f_path.dentry->d_sb);
+	if (ret)
+		goto out_finish;
+
+	ret = cgroup_attach_task(dst_cgrp, task, true);
+
+out_finish:
+	cgroup_procs_write_finish(task);
+out_unlock:
+	cgroup_kn_unlock(of->kn);
+
+	return ret ?: nbytes;
+}
+
 /* cgroup core interface files for the default hierarchy */
 static struct cftype cgroup_base_files[] = {
 	{
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 2/6] cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS
  2017-07-17  2:07 [PATCHSET for-4.13] cgroup: implement cgroup2 thread mode, v3 Tejun Heo
  2017-07-17  2:07 ` [PATCH 1/6] cgroup: reorganize cgroup.procs / task write path Tejun Heo
@ 2017-07-17  2:07 ` Tejun Heo
  2017-07-17  2:07 ` [PATCH 3/6] cgroup: introduce cgroup->dom_cgrp and threaded css_set handling Tejun Heo
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2017-07-17  2:07 UTC (permalink / raw)
  To: lizefan, hannes, peterz, mingo, longman
  Cc: cgroups, linux-kernel, kernel-team, pjt, luto, efault, torvalds,
	guro, Tejun Heo

css_task_iter currently always walks all tasks.  With the scheduled
cgroup v2 thread support, the iterator would need to handle multiple
types of iteration.  As a preparation, add @flags to
css_task_iter_start() and implement CSS_TASK_ITER_PROCS.  If the flag
is not specified, it walks all tasks as before.  When asserted, the
iterator only walks the group leaders.

For now, the only user of the flag is cgroup v2 "cgroup.procs" file
which no longer needs to skip non-leader tasks in cgroup_procs_next().
Note that cgroup v1 "cgroup.procs" can't use the group leader walk as
v1 "cgroup.procs" doesn't mean "list all thread group leaders in the
cgroup" but "list all thread group id's with any threads in the
cgroup".

While at it, update cgroup_procs_show() to use task_pid_vnr() instead
of task_tgid_vnr().  As the iteration guarantees that the function
only sees group leaders, this doesn't change the output and will allow
sharing the function for thread iteration.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/cgroup.h       |  6 +++++-
 kernel/cgroup/cgroup-v1.c    |  6 +++---
 kernel/cgroup/cgroup.c       | 24 ++++++++++++++----------
 kernel/cgroup/cpuset.c       |  6 +++---
 kernel/cgroup/freezer.c      |  6 +++---
 mm/memcontrol.c              |  2 +-
 net/core/netclassid_cgroup.c |  2 +-
 7 files changed, 30 insertions(+), 22 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 308b107..cae5831 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -36,9 +36,13 @@
 #define CGROUP_WEIGHT_DFL		100
 #define CGROUP_WEIGHT_MAX		10000
 
+/* walk only threadgroup leaders */
+#define CSS_TASK_ITER_PROCS		(1U << 0)
+
 /* a css_task_iter should be treated as an opaque object */
 struct css_task_iter {
 	struct cgroup_subsys		*ss;
+	unsigned int			flags;
 
 	struct list_head		*cset_pos;
 	struct list_head		*cset_head;
@@ -129,7 +133,7 @@ struct task_struct *cgroup_taskset_first(struct cgroup_taskset *tset,
 struct task_struct *cgroup_taskset_next(struct cgroup_taskset *tset,
 					struct cgroup_subsys_state **dst_cssp);
 
-void css_task_iter_start(struct cgroup_subsys_state *css,
+void css_task_iter_start(struct cgroup_subsys_state *css, unsigned int flags,
 			 struct css_task_iter *it);
 struct task_struct *css_task_iter_next(struct css_task_iter *it);
 void css_task_iter_end(struct css_task_iter *it);
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index 60f7247..167aaab 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -121,7 +121,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
 	 * ->can_attach() fails.
 	 */
 	do {
-		css_task_iter_start(&from->self, &it);
+		css_task_iter_start(&from->self, 0, &it);
 		task = css_task_iter_next(&it);
 		if (task)
 			get_task_struct(task);
@@ -373,7 +373,7 @@ static int pidlist_array_load(struct cgroup *cgrp, enum cgroup_filetype type,
 	if (!array)
 		return -ENOMEM;
 	/* now, populate the array */
-	css_task_iter_start(&cgrp->self, &it);
+	css_task_iter_start(&cgrp->self, 0, &it);
 	while ((tsk = css_task_iter_next(&it))) {
 		if (unlikely(n == length))
 			break;
@@ -749,7 +749,7 @@ int cgroupstats_build(struct cgroupstats *stats, struct dentry *dentry)
 	}
 	rcu_read_unlock();
 
-	css_task_iter_start(&cgrp->self, &it);
+	css_task_iter_start(&cgrp->self, 0, &it);
 	while ((tsk = css_task_iter_next(&it))) {
 		switch (tsk->state) {
 		case TASK_RUNNING:
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index e3bda075..3c5a37a 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3643,6 +3643,7 @@ static void css_task_iter_advance(struct css_task_iter *it)
 	lockdep_assert_held(&css_set_lock);
 	WARN_ON_ONCE(!l);
 
+repeat:
 	/*
 	 * Advance iterator to find next entry.  cset->tasks is consumed
 	 * first and then ->mg_tasks.  After ->mg_tasks, we move onto the
@@ -3657,11 +3658,18 @@ static void css_task_iter_advance(struct css_task_iter *it)
 		css_task_iter_advance_css_set(it);
 	else
 		it->task_pos = l;
+
+	/* if PROCS, skip over tasks which aren't group leaders */
+	if ((it->flags & CSS_TASK_ITER_PROCS) && it->task_pos &&
+	    !thread_group_leader(list_entry(it->task_pos, struct task_struct,
+					    cg_list)))
+		goto repeat;
 }
 
 /**
  * css_task_iter_start - initiate task iteration
  * @css: the css to walk tasks of
+ * @flags: CSS_TASK_ITER_* flags
  * @it: the task iterator to use
  *
  * Initiate iteration through the tasks of @css.  The caller can call
@@ -3669,7 +3677,7 @@ static void css_task_iter_advance(struct css_task_iter *it)
  * returns NULL.  On completion of iteration, css_task_iter_end() must be
  * called.
  */
-void css_task_iter_start(struct cgroup_subsys_state *css,
+void css_task_iter_start(struct cgroup_subsys_state *css, unsigned int flags,
 			 struct css_task_iter *it)
 {
 	/* no one should try to iterate before mounting cgroups */
@@ -3680,6 +3688,7 @@ void css_task_iter_start(struct cgroup_subsys_state *css,
 	spin_lock_irq(&css_set_lock);
 
 	it->ss = css->ss;
+	it->flags = flags;
 
 	if (it->ss)
 		it->cset_pos = &css->cgroup->e_csets[css->ss->id];
@@ -3753,13 +3762,8 @@ static void *cgroup_procs_next(struct seq_file *s, void *v, loff_t *pos)
 {
 	struct kernfs_open_file *of = s->private;
 	struct css_task_iter *it = of->priv;
-	struct task_struct *task;
-
-	do {
-		task = css_task_iter_next(it);
-	} while (task && !thread_group_leader(task));
 
-	return task;
+	return css_task_iter_next(it);
 }
 
 static void *cgroup_procs_start(struct seq_file *s, loff_t *pos)
@@ -3780,10 +3784,10 @@ static void *cgroup_procs_start(struct seq_file *s, loff_t *pos)
 		if (!it)
 			return ERR_PTR(-ENOMEM);
 		of->priv = it;
-		css_task_iter_start(&cgrp->self, it);
+		css_task_iter_start(&cgrp->self, CSS_TASK_ITER_PROCS, it);
 	} else if (!(*pos)++) {
 		css_task_iter_end(it);
-		css_task_iter_start(&cgrp->self, it);
+		css_task_iter_start(&cgrp->self, CSS_TASK_ITER_PROCS, it);
 	}
 
 	return cgroup_procs_next(s, NULL, NULL);
@@ -3791,7 +3795,7 @@ static void *cgroup_procs_start(struct seq_file *s, loff_t *pos)
 
 static int cgroup_procs_show(struct seq_file *s, void *v)
 {
-	seq_printf(s, "%d\n", task_tgid_vnr(v));
+	seq_printf(s, "%d\n", task_pid_vnr(v));
 	return 0;
 }
 
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index ca8376e..252d70c 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -861,7 +861,7 @@ static void update_tasks_cpumask(struct cpuset *cs)
 	struct css_task_iter it;
 	struct task_struct *task;
 
-	css_task_iter_start(&cs->css, &it);
+	css_task_iter_start(&cs->css, 0, &it);
 	while ((task = css_task_iter_next(&it)))
 		set_cpus_allowed_ptr(task, cs->effective_cpus);
 	css_task_iter_end(&it);
@@ -1091,7 +1091,7 @@ static void update_tasks_nodemask(struct cpuset *cs)
 	 * It's ok if we rebind the same mm twice; mpol_rebind_mm()
 	 * is idempotent.  Also migrate pages in each mm to new nodes.
 	 */
-	css_task_iter_start(&cs->css, &it);
+	css_task_iter_start(&cs->css, 0, &it);
 	while ((task = css_task_iter_next(&it))) {
 		struct mm_struct *mm;
 		bool migrate;
@@ -1284,7 +1284,7 @@ static void update_tasks_flags(struct cpuset *cs)
 	struct css_task_iter it;
 	struct task_struct *task;
 
-	css_task_iter_start(&cs->css, &it);
+	css_task_iter_start(&cs->css, 0, &it);
 	while ((task = css_task_iter_next(&it)))
 		cpuset_update_task_spread_flag(cs, task);
 	css_task_iter_end(&it);
diff --git a/kernel/cgroup/freezer.c b/kernel/cgroup/freezer.c
index 1b72d56..0823679 100644
--- a/kernel/cgroup/freezer.c
+++ b/kernel/cgroup/freezer.c
@@ -268,7 +268,7 @@ static void update_if_frozen(struct cgroup_subsys_state *css)
 	rcu_read_unlock();
 
 	/* are all tasks frozen? */
-	css_task_iter_start(css, &it);
+	css_task_iter_start(css, 0, &it);
 
 	while ((task = css_task_iter_next(&it))) {
 		if (freezing(task)) {
@@ -320,7 +320,7 @@ static void freeze_cgroup(struct freezer *freezer)
 	struct css_task_iter it;
 	struct task_struct *task;
 
-	css_task_iter_start(&freezer->css, &it);
+	css_task_iter_start(&freezer->css, 0, &it);
 	while ((task = css_task_iter_next(&it)))
 		freeze_task(task);
 	css_task_iter_end(&it);
@@ -331,7 +331,7 @@ static void unfreeze_cgroup(struct freezer *freezer)
 	struct css_task_iter it;
 	struct task_struct *task;
 
-	css_task_iter_start(&freezer->css, &it);
+	css_task_iter_start(&freezer->css, 0, &it);
 	while ((task = css_task_iter_next(&it)))
 		__thaw_task(task);
 	css_task_iter_end(&it);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3df3c04..2b2f071 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -917,7 +917,7 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 		struct css_task_iter it;
 		struct task_struct *task;
 
-		css_task_iter_start(&iter->css, &it);
+		css_task_iter_start(&iter->css, 0, &it);
 		while (!ret && (task = css_task_iter_next(&it)))
 			ret = fn(task, arg);
 		css_task_iter_end(&it);
diff --git a/net/core/netclassid_cgroup.c b/net/core/netclassid_cgroup.c
index 029a61a..5e4f040 100644
--- a/net/core/netclassid_cgroup.c
+++ b/net/core/netclassid_cgroup.c
@@ -100,7 +100,7 @@ static int write_classid(struct cgroup_subsys_state *css, struct cftype *cft,
 
 	cs->classid = (u32)value;
 
-	css_task_iter_start(css, &it);
+	css_task_iter_start(css, 0, &it);
 	while ((p = css_task_iter_next(&it))) {
 		task_lock(p);
 		iterate_fd(p->files, 0, update_classid_sock,
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 3/6] cgroup: introduce cgroup->dom_cgrp and threaded css_set handling
  2017-07-17  2:07 [PATCHSET for-4.13] cgroup: implement cgroup2 thread mode, v3 Tejun Heo
  2017-07-17  2:07 ` [PATCH 1/6] cgroup: reorganize cgroup.procs / task write path Tejun Heo
  2017-07-17  2:07 ` [PATCH 2/6] cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS Tejun Heo
@ 2017-07-17  2:07 ` Tejun Heo
  2017-07-17  2:07 ` [PATCH 4/6] cgroup: implement CSS_TASK_ITER_THREADED Tejun Heo
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2017-07-17  2:07 UTC (permalink / raw)
  To: lizefan, hannes, peterz, mingo, longman
  Cc: cgroups, linux-kernel, kernel-team, pjt, luto, efault, torvalds,
	guro, Tejun Heo

cgroup v2 is in the process of growing thread granularity support.  A
threaded subtree is composed of a thread root and threaded cgroups
which are proper members of the subtree.

The root cgroup of the subtree serves as the domain cgroup to which
the processes (as opposed to threads / tasks) of the subtree
conceptually belong and domain-level resource consumptions not tied to
any specific task are charged.  Inside the subtree, threads won't be
subject to process granularity or no-internal-task constraint and can
be distributed arbitrarily across the subtree.

This patch introduces cgroup->dom_cgrp along with threaded css_set
handling.

* cgroup->dom_cgrp points to self for normal and thread roots.  For
  proper thread subtree members, points to the dom_cgrp (the thread
  root).

* css_set->dom_cset points to self if for normal and thread roots.  If
  threaded, points to the css_set which belongs to the cgrp->dom_cgrp.
  The dom_cgrp serves as the resource domain and keeps the matching
  csses available.  The dom_cset holds those csses and makes them
  easily accessible.

* All threaded csets are linked on their dom_csets to enable iteration
  of all threaded tasks.

* cgroup->nr_threaded_children keeps track of the number of threaded
  children.

This patch adds the above but doesn't actually use them yet.  The
following patches will build on top.

v4: ->nr_threaded_children added.

v3: ->proc_cgrp/cset renamed to ->dom_cgrp/cset.  Updated for the new
    enable-threaded-per-cgroup behavior.

v2: Added cgroup_is_threaded() helper.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/cgroup-defs.h | 33 +++++++++++++++++++---
 include/linux/cgroup.h      |  3 +-
 kernel/cgroup/cgroup.c      | 69 +++++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 97 insertions(+), 8 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index ae7bc1e..651c436 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -172,6 +172,14 @@ struct css_set {
 	/* reference count */
 	refcount_t refcount;
 
+	/*
+	 * For a domain cgroup, the following points to self.  If threaded,
+	 * to the matching cset of the nearest domain ancestor.  The
+	 * dom_cset provides access to the domain cgroup and its csses to
+	 * which domain level resource consumptions should be charged.
+	 */
+	struct css_set *dom_cset;
+
 	/* the default cgroup associated with this css_set */
 	struct cgroup *dfl_cgrp;
 
@@ -200,6 +208,10 @@ struct css_set {
 	 */
 	struct list_head e_cset_node[CGROUP_SUBSYS_COUNT];
 
+	/* all threaded csets whose ->dom_cset points to this cset */
+	struct list_head threaded_csets;
+	struct list_head threaded_csets_node;
+
 	/*
 	 * List running through all cgroup groups in the same hash
 	 * slot. Protected by css_set_lock
@@ -267,12 +279,16 @@ struct cgroup {
 	 * doesn't have any tasks.
 	 *
 	 * All children which have non-zero nr_populated_csets and/or
-	 * nr_populated_children of their own contribute one to
-	 * nr_populated_children.  The counter is zero iff this cgroup's
-	 * subtree proper doesn't have any tasks.
+	 * nr_populated_children of their own contribute one to either
+	 * nr_populated_domain_children or nr_populated_threaded_children
+	 * depending on their type.  Each counter is zero iff all cgroups
+	 * of the type in the subtree proper don't have any tasks.
 	 */
 	int nr_populated_csets;
-	int nr_populated_children;
+	int nr_populated_domain_children;
+	int nr_populated_threaded_children;
+
+	int nr_threaded_children;	/* # of live threaded child cgroups */
 
 	struct kernfs_node *kn;		/* cgroup kernfs entry */
 	struct cgroup_file procs_file;	/* handle for "cgroup.procs" */
@@ -311,6 +327,15 @@ struct cgroup {
 	struct list_head e_csets[CGROUP_SUBSYS_COUNT];
 
 	/*
+	 * If !threaded, self.  If threaded, it points to the nearest
+	 * domain ancestor.  Inside a threaded subtree, cgroups are exempt
+	 * from process granularity and no-internal-task constraint.
+	 * Domain level resource consumptions which aren't tied to a
+	 * specific task are charged to the dom_cgrp.
+	 */
+	struct cgroup *dom_cgrp;
+
+	/*
 	 * list of pidlists, up to two for each namespace (one for procs, one
 	 * for tasks); created on demand.
 	 */
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index cae5831..b7dd230 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -541,7 +541,8 @@ static inline bool task_under_cgroup_hierarchy(struct task_struct *task,
 /* no synchronization, the result can only be used as a hint */
 static inline bool cgroup_is_populated(struct cgroup *cgrp)
 {
-	return cgrp->nr_populated_csets + cgrp->nr_populated_children;
+	return cgrp->nr_populated_csets + cgrp->nr_populated_domain_children +
+		cgrp->nr_populated_threaded_children;
 }
 
 /* returns ino associated with a cgroup */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 3c5a37a..c7e1c24 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -330,6 +330,11 @@ static bool cgroup_has_tasks(struct cgroup *cgrp)
 	return cgrp->nr_populated_csets;
 }
 
+static bool cgroup_is_threaded(struct cgroup *cgrp)
+{
+	return cgrp->dom_cgrp != cgrp;
+}
+
 /* subsystems visibly enabled on a cgroup */
 static u16 cgroup_control(struct cgroup *cgrp)
 {
@@ -565,9 +570,11 @@ EXPORT_SYMBOL_GPL(of_css);
  */
 struct css_set init_css_set = {
 	.refcount		= REFCOUNT_INIT(1),
+	.dom_cset		= &init_css_set,
 	.tasks			= LIST_HEAD_INIT(init_css_set.tasks),
 	.mg_tasks		= LIST_HEAD_INIT(init_css_set.mg_tasks),
 	.task_iters		= LIST_HEAD_INIT(init_css_set.task_iters),
+	.threaded_csets		= LIST_HEAD_INIT(init_css_set.threaded_csets),
 	.cgrp_links		= LIST_HEAD_INIT(init_css_set.cgrp_links),
 	.mg_preload_node	= LIST_HEAD_INIT(init_css_set.mg_preload_node),
 	.mg_node		= LIST_HEAD_INIT(init_css_set.mg_node),
@@ -575,6 +582,11 @@ struct css_set init_css_set = {
 
 static int css_set_count	= 1;	/* 1 for init_css_set */
 
+static bool css_set_threaded(struct css_set *cset)
+{
+	return cset->dom_cset != cset;
+}
+
 /**
  * css_set_populated - does a css_set contain any tasks?
  * @cset: target css_set
@@ -618,10 +630,14 @@ static void cgroup_update_populated(struct cgroup *cgrp, bool populated)
 	do {
 		bool was_populated = cgroup_is_populated(cgrp);
 
-		if (!child)
+		if (!child) {
 			cgrp->nr_populated_csets += adj;
-		else
-			cgrp->nr_populated_children += adj;
+		} else {
+			if (cgroup_is_threaded(child))
+				cgrp->nr_populated_threaded_children += adj;
+			else
+				cgrp->nr_populated_domain_children += adj;
+		}
 
 		if (was_populated == cgroup_is_populated(cgrp))
 			break;
@@ -747,6 +763,8 @@ void put_css_set_locked(struct css_set *cset)
 	if (!refcount_dec_and_test(&cset->refcount))
 		return;
 
+	WARN_ON_ONCE(!list_empty(&cset->threaded_csets));
+
 	/* This css_set is dead. unlink it and release cgroup and css refs */
 	for_each_subsys(ss, ssid) {
 		list_del(&cset->e_cset_node[ssid]);
@@ -763,6 +781,11 @@ void put_css_set_locked(struct css_set *cset)
 		kfree(link);
 	}
 
+	if (css_set_threaded(cset)) {
+		list_del(&cset->threaded_csets_node);
+		put_css_set_locked(cset->dom_cset);
+	}
+
 	kfree_rcu(cset, rcu_head);
 }
 
@@ -781,6 +804,7 @@ static bool compare_css_sets(struct css_set *cset,
 			     struct cgroup *new_cgrp,
 			     struct cgroup_subsys_state *template[])
 {
+	struct cgroup *new_dfl_cgrp;
 	struct list_head *l1, *l2;
 
 	/*
@@ -791,6 +815,16 @@ static bool compare_css_sets(struct css_set *cset,
 	if (memcmp(template, cset->subsys, sizeof(cset->subsys)))
 		return false;
 
+
+	/* @cset's domain should match the default cgroup's */
+	if (cgroup_on_dfl(new_cgrp))
+		new_dfl_cgrp = new_cgrp;
+	else
+		new_dfl_cgrp = old_cset->dfl_cgrp;
+
+	if (new_dfl_cgrp->dom_cgrp != cset->dom_cset->dfl_cgrp)
+		return false;
+
 	/*
 	 * Compare cgroup pointers in order to distinguish between
 	 * different cgroups in hierarchies.  As different cgroups may
@@ -998,9 +1032,11 @@ static struct css_set *find_css_set(struct css_set *old_cset,
 	}
 
 	refcount_set(&cset->refcount, 1);
+	cset->dom_cset = cset;
 	INIT_LIST_HEAD(&cset->tasks);
 	INIT_LIST_HEAD(&cset->mg_tasks);
 	INIT_LIST_HEAD(&cset->task_iters);
+	INIT_LIST_HEAD(&cset->threaded_csets);
 	INIT_HLIST_NODE(&cset->hlist);
 	INIT_LIST_HEAD(&cset->cgrp_links);
 	INIT_LIST_HEAD(&cset->mg_preload_node);
@@ -1038,6 +1074,28 @@ static struct css_set *find_css_set(struct css_set *old_cset,
 
 	spin_unlock_irq(&css_set_lock);
 
+	/*
+	 * If @cset should be threaded, look up the matching dom_cset and
+	 * link them up.  We first fully initialize @cset then look for the
+	 * dom_cset.  It's simpler this way and safe as @cset is guaranteed
+	 * to stay empty until we return.
+	 */
+	if (cgroup_is_threaded(cset->dfl_cgrp)) {
+		struct css_set *dcset;
+
+		dcset = find_css_set(cset, cset->dfl_cgrp->dom_cgrp);
+		if (!dcset) {
+			put_css_set(cset);
+			return NULL;
+		}
+
+		spin_lock_irq(&css_set_lock);
+		cset->dom_cset = dcset;
+		list_add_tail(&cset->threaded_csets_node,
+			      &dcset->threaded_csets);
+		spin_unlock_irq(&css_set_lock);
+	}
+
 	return cset;
 }
 
@@ -1680,6 +1738,7 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
 	mutex_init(&cgrp->pidlist_mutex);
 	cgrp->self.cgroup = cgrp;
 	cgrp->self.flags |= CSS_ONLINE;
+	cgrp->dom_cgrp = cgrp;
 
 	for_each_subsys(ss, ssid)
 		INIT_LIST_HEAD(&cgrp->e_csets[ssid]);
@@ -4408,6 +4467,7 @@ static void kill_css(struct cgroup_subsys_state *css)
 static int cgroup_destroy_locked(struct cgroup *cgrp)
 	__releases(&cgroup_mutex) __acquires(&cgroup_mutex)
 {
+	struct cgroup *parent = cgroup_parent(cgrp);
 	struct cgroup_subsys_state *css;
 	struct cgrp_cset_link *link;
 	int ssid;
@@ -4452,6 +4512,9 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
 	 */
 	kernfs_remove(cgrp->kn);
 
+	if (parent && cgroup_is_threaded(cgrp))
+		parent->nr_threaded_children--;
+
 	cgroup1_check_for_release(cgroup_parent(cgrp));
 
 	/* put the base reference */
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 4/6] cgroup: implement CSS_TASK_ITER_THREADED
  2017-07-17  2:07 [PATCHSET for-4.13] cgroup: implement cgroup2 thread mode, v3 Tejun Heo
                   ` (2 preceding siblings ...)
  2017-07-17  2:07 ` [PATCH 3/6] cgroup: introduce cgroup->dom_cgrp and threaded css_set handling Tejun Heo
@ 2017-07-17  2:07 ` Tejun Heo
  2017-07-17  2:07 ` [PATCH 5/6] cgroup: implement cgroup v2 thread support Tejun Heo
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2017-07-17  2:07 UTC (permalink / raw)
  To: lizefan, hannes, peterz, mingo, longman
  Cc: cgroups, linux-kernel, kernel-team, pjt, luto, efault, torvalds,
	guro, Tejun Heo

cgroup v2 is in the process of growing thread granularity support.
Once thread mode is enabled, the root cgroup of the subtree serves as
the dom_cgrp to which the processes of the subtree conceptually belong
and domain-level resource consumptions not tied to any specific task
are charged.  In the subtree, threads won't be subject to process
granularity or no-internal-task constraint and can be distributed
arbitrarily across the subtree.

This patch implements a new task iterator flag CSS_TASK_ITER_THREADED,
which, when used on a dom_cgrp, makes the iteration include the tasks
on all the associated threaded css_sets.  "cgroup.procs" read path is
updated to use it so that reading the file on a proc_cgrp lists all
processes.  This will also be used by controller implementations which
need to walk processes or tasks at the resource domain level.

Task iteration is implemented nested in css_set iteration.  If
CSS_TASK_ITER_THREADED is specified, after walking tasks of each
!threaded css_set, all the associated threaded css_sets are visited
before moving onto the next !threaded css_set.

v2: ->cur_pcset renamed to ->cur_dcset.  Updated for the new
    enable-threaded-per-cgroup behavior.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/cgroup.h |  6 ++++
 kernel/cgroup/cgroup.c | 77 +++++++++++++++++++++++++++++++++++++++-----------
 2 files changed, 66 insertions(+), 17 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index b7dd230..79faa64 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -38,6 +38,8 @@
 
 /* walk only threadgroup leaders */
 #define CSS_TASK_ITER_PROCS		(1U << 0)
+/* walk all threaded css_sets in the domain */
+#define CSS_TASK_ITER_THREADED		(1U << 1)
 
 /* a css_task_iter should be treated as an opaque object */
 struct css_task_iter {
@@ -47,11 +49,15 @@ struct css_task_iter {
 	struct list_head		*cset_pos;
 	struct list_head		*cset_head;
 
+	struct list_head		*tcset_pos;
+	struct list_head		*tcset_head;
+
 	struct list_head		*task_pos;
 	struct list_head		*tasks_head;
 	struct list_head		*mg_tasks_head;
 
 	struct css_set			*cur_cset;
+	struct css_set			*cur_dcset;
 	struct task_struct		*cur_task;
 	struct list_head		iters_node;	/* css_set->task_iters */
 };
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index c7e1c24..a1d59af 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3629,6 +3629,58 @@ bool css_has_online_children(struct cgroup_subsys_state *css)
 	return ret;
 }
 
+static struct css_set *css_task_iter_next_css_set(struct css_task_iter *it)
+{
+	struct list_head *l;
+	struct cgrp_cset_link *link;
+	struct css_set *cset;
+
+	lockdep_assert_held(&css_set_lock);
+
+	/* find the next threaded cset */
+	if (it->tcset_pos) {
+		l = it->tcset_pos->next;
+
+		if (l != it->tcset_head) {
+			it->tcset_pos = l;
+			return container_of(l, struct css_set,
+					    threaded_csets_node);
+		}
+
+		it->tcset_pos = NULL;
+	}
+
+	/* find the next cset */
+	l = it->cset_pos;
+	l = l->next;
+	if (l == it->cset_head) {
+		it->cset_pos = NULL;
+		return NULL;
+	}
+
+	if (it->ss) {
+		cset = container_of(l, struct css_set, e_cset_node[it->ss->id]);
+	} else {
+		link = list_entry(l, struct cgrp_cset_link, cset_link);
+		cset = link->cset;
+	}
+
+	it->cset_pos = l;
+
+	/* initialize threaded css_set walking */
+	if (it->flags & CSS_TASK_ITER_THREADED) {
+		if (it->cur_dcset)
+			put_css_set_locked(it->cur_dcset);
+		it->cur_dcset = cset;
+		get_css_set(cset);
+
+		it->tcset_head = &cset->threaded_csets;
+		it->tcset_pos = &cset->threaded_csets;
+	}
+
+	return cset;
+}
+
 /**
  * css_task_iter_advance_css_set - advance a task itererator to the next css_set
  * @it: the iterator to advance
@@ -3637,32 +3689,19 @@ bool css_has_online_children(struct cgroup_subsys_state *css)
  */
 static void css_task_iter_advance_css_set(struct css_task_iter *it)
 {
-	struct list_head *l = it->cset_pos;
-	struct cgrp_cset_link *link;
 	struct css_set *cset;
 
 	lockdep_assert_held(&css_set_lock);
 
 	/* Advance to the next non-empty css_set */
 	do {
-		l = l->next;
-		if (l == it->cset_head) {
-			it->cset_pos = NULL;
+		cset = css_task_iter_next_css_set(it);
+		if (!cset) {
 			it->task_pos = NULL;
 			return;
 		}
-
-		if (it->ss) {
-			cset = container_of(l, struct css_set,
-					    e_cset_node[it->ss->id]);
-		} else {
-			link = list_entry(l, struct cgrp_cset_link, cset_link);
-			cset = link->cset;
-		}
 	} while (!css_set_populated(cset));
 
-	it->cset_pos = l;
-
 	if (!list_empty(&cset->tasks))
 		it->task_pos = cset->tasks.next;
 	else
@@ -3805,6 +3844,9 @@ void css_task_iter_end(struct css_task_iter *it)
 		spin_unlock_irq(&css_set_lock);
 	}
 
+	if (it->cur_dcset)
+		put_css_set(it->cur_dcset);
+
 	if (it->cur_task)
 		put_task_struct(it->cur_task);
 }
@@ -3830,6 +3872,7 @@ static void *cgroup_procs_start(struct seq_file *s, loff_t *pos)
 	struct kernfs_open_file *of = s->private;
 	struct cgroup *cgrp = seq_css(s)->cgroup;
 	struct css_task_iter *it = of->priv;
+	unsigned iter_flags = CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED;
 
 	/*
 	 * When a seq_file is seeked, it's always traversed sequentially
@@ -3843,10 +3886,10 @@ static void *cgroup_procs_start(struct seq_file *s, loff_t *pos)
 		if (!it)
 			return ERR_PTR(-ENOMEM);
 		of->priv = it;
-		css_task_iter_start(&cgrp->self, CSS_TASK_ITER_PROCS, it);
+		css_task_iter_start(&cgrp->self, iter_flags, it);
 	} else if (!(*pos)++) {
 		css_task_iter_end(it);
-		css_task_iter_start(&cgrp->self, CSS_TASK_ITER_PROCS, it);
+		css_task_iter_start(&cgrp->self, iter_flags, it);
 	}
 
 	return cgroup_procs_next(s, NULL, NULL);
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 5/6] cgroup: implement cgroup v2 thread support
  2017-07-17  2:07 [PATCHSET for-4.13] cgroup: implement cgroup2 thread mode, v3 Tejun Heo
                   ` (3 preceding siblings ...)
  2017-07-17  2:07 ` [PATCH 4/6] cgroup: implement CSS_TASK_ITER_THREADED Tejun Heo
@ 2017-07-17  2:07 ` Tejun Heo
  2017-07-17 14:14   ` Peter Zijlstra
  2017-07-17 21:12   ` Waiman Long
  2017-07-17  2:07 ` [PATCH 6/6] cgroup: update debug controller to print out thread mode information Tejun Heo
  2017-07-17 14:48 ` [PATCHSET for-4.13] cgroup: implement cgroup2 thread mode, v3 Waiman Long
  6 siblings, 2 replies; 32+ messages in thread
From: Tejun Heo @ 2017-07-17  2:07 UTC (permalink / raw)
  To: lizefan, hannes, peterz, mingo, longman
  Cc: cgroups, linux-kernel, kernel-team, pjt, luto, efault, torvalds,
	guro, Tejun Heo

This patch implements cgroup v2 thread support.  The goal of the
thread mode is supporting hierarchical accounting and control at
thread granularity while staying inside the resource domain model
which allows coordination across different resource controllers and
handling of anonymous resource consumptions.

A cgroup is always created as a domain and can be made threaded by
writing to the "cgroup.type" file.  When a cgroup becomes threaded, it
becomes a member of a threaded subtree which is anchored at the
closest ancestor which isn't threaded.

The threads of the processes which are in a threaded subtree can be
placed anywhere without being restricted by process granularity or
no-internal-process constraint.  Note that the threads aren't allowed
to escape to a different threaded subtree.  To be used inside a
threaded subtree, a controller should explicitly support threaded mode
and be able to handle internal competition in the way which is
appropriate for the resource.

The root of a threaded subtree, the nearest ancestor which isn't
threaded, is called the threaded domain and serves as the resource
domain for the whole subtree.  This is the last cgroup where domain
controllers are operational and where all the domain-level resource
consumptions in the subtree are accounted.  This allows threaded
controllers to operate at thread granularity when requested while
staying inside the scope of system-level resource distribution.

As the root cgroup is exempt from the no-internal-process constraint,
it can serve as both a threaded domain and a parent to normal cgroups,
so, unlike non-root cgroups, the root cgroup can have both domain and
threaded children.

Internally, in a threaded subtree, each css_set has its ->dom_cset
pointing to a matching css_set which belongs to the threaded domain.
This ensures that thread root level cgroup_subsys_state for all
threaded controllers are readily accessible for domain-level
operations.

This patch enables threaded mode for the pids and perf_events
controllers.  Neither has to worry about domain-level resource
consumptions and it's enough to simply set the flag.

For more details on the interface and behavior of the thread mode,
please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added
by this patch.

v4: - Updated to marking each cgroup threaded as suggested by PeterZ.

v3: - Dropped "join" and always make mixed children join the parent's
      threaded subtree.

v2: - After discussions with Waiman, support for mixed thread mode is
      added.  This should address the issue that Peter pointed out
      where any nesting should be avoided for thread subtrees while
      coexisting with other domain cgroups.

    - Enabling / disabling thread mode now piggy backs on the existing
          control mask update mechanism.

    - Bug fixes and cleanup.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
---
 Documentation/cgroup-v2.txt     | 181 ++++++++++++++++++--
 include/linux/cgroup-defs.h     |  12 ++
 kernel/cgroup/cgroup-internal.h |   2 +-
 kernel/cgroup/cgroup-v1.c       |   5 +-
 kernel/cgroup/cgroup.c          | 356 +++++++++++++++++++++++++++++++++++++---
 kernel/cgroup/pids.c            |   1 +
 kernel/events/core.c            |   1 +
 7 files changed, 518 insertions(+), 40 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index f01f831..10d550c 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -18,7 +18,9 @@ v1 is available under Documentation/cgroup-v1/.
      1-2. What is cgroup?
    2. Basic Operations
      2-1. Mounting
-     2-2. Organizing Processes
+     2-2. Organizing Processes and Threads
+       2-2-1. Processes
+       2-2-2. Threads
      2-3. [Un]populated Notification
      2-4. Controlling Controllers
        2-4-1. Enabling and Disabling
@@ -167,8 +169,11 @@ cgroup v2 currently supports the following mount options.
 	Delegation section for details.
 
 
-Organizing Processes
---------------------
+Organizing Processes and Threads
+--------------------------------
+
+Processes
+~~~~~~~~~
 
 Initially, only the root cgroup exists to which all processes belong.
 A child cgroup can be created by creating a sub-directory::
@@ -219,6 +224,104 @@ If the process becomes a zombie and the cgroup it was associated with
   0::/test-cgroup/test-cgroup-nested (deleted)
 
 
+Threads
+~~~~~~~
+
+cgroup v2 supports thread granularity for a subset of controllers to
+support use cases requiring hierarchical resource distribution across
+the threads of a group of processes.  By default, all threads of a
+process belong to the same cgroup, which also serves as the resource
+domain to host resource consumptions which are not specific to a
+process or thread.  The thread mode allows threads to be spread across
+a subtree while still maintaining the common resource domain for them.
+
+Controllers which support thread mode are called threaded controllers.
+The ones which don't are called domain controllers.
+
+Marking a cgroup threaded makes it join the resource domain of its
+parent as a threaded cgroup.  The parent may be another threaded
+cgroup whose resource domain is further up in the hierarchy.  The root
+of a threaded subtree, that is, the nearest ancestor which is not
+threaded, is called threaded domain and serves as the resource domain
+for the entire subtree.
+
+Inside a threaded subtree, threads of a process can be put in
+different cgroups and are not subject to the no internal process
+constraint - threaded controllers can be enabled on non-leaf cgroups
+whether they have threads in them or not.
+
+As the threaded domain cgroup hosts all the domain resource
+consumptions of the subtree, it is considered to have internal
+resource consumptions whether there are processes in it or not and
+can't have populated child cgroups which aren't threaded.  Because the
+root cgroup is not subject to no internal process constraint, it can
+serve both as a threaded domain and a parent to domain cgroups.
+
+The current operation mode or type of the cgroup is shown in the
+"cgroup.type" file which indicates whether the cgroup is a normal
+domain, a domain which is serving as the domain of a threaded subtree,
+or a threaded cgroup.
+
+On creation, a cgroup is always a domain cgroup and can be made
+threaded by writing "threaded" to the "cgroup.type" file.  The
+operation is single direction::
+
+  # echo threaded > cgroup.type
+
+Once threaded, the cgroup can't be made a domain again.  To enable the
+thread mode, the following conditions must be met.
+
+- As the cgroup will join the parent's resource domain.  The parent
+  must either be a valid (threaded) domain or a threaded cgroup.
+
+- The cgroup must be empty.  No enabled controllers, child cgroups or
+  processes.
+
+Topology-wise, a cgroup can be in an invalid state.  Please consider
+the following toplogy::
+
+  A (threaded domain) - B (threaded) - C (domain, just created)
+
+C is created as a domain but isn't connected to a parent which can
+host child domains.  C can't be used until it is turned into a
+threaded cgroup.  "cgroup.type" file will report "domain (invalid)" in
+these cases.  Operations which fail due to invalid topology use
+EOPNOTSUPP as the errno.
+
+A domain cgroup is turned into a threaded domain when one of its child
+cgroup becomes threaded or threaded controllers are enabled in the
+"cgroup.subtree_control" file while there are processes in the cgroup.
+A threaded domain reverts to a normal domain when the conditions
+clear.
+
+When read, "cgroup.threads" contains the list of the thread IDs of all
+threads in the cgroup.  Except that the operations are per-thread
+instead of per-process, "cgroup.threads" has the same format and
+behaves the same way as "cgroup.procs".  While "cgroup.threads" can be
+written to in any cgroup, as it can only move threads inside the same
+threaded domain, its operations are confined inside each threaded
+subtree.
+
+The threaded domain cgroup serves as the resource domain for the whole
+subtree, and, while the threads can be scattered across the subtree,
+all the processes are considered to be in the threaded domain cgroup.
+"cgroup.procs" in a threaded domain cgroup contains the PIDs of all
+processes in the subtree and is not readable in the subtree proper.
+However, "cgroup.procs" can be written to from anywhere in the subtree
+to migrate all threads of the matching process to the cgroup.
+
+Only threaded controllers can be enabled in a threaded subtree.  When
+a threaded controller is enabled inside a threaded subtree, it only
+accounts for and controls resource consumptions associated with the
+threads in the cgroup and its descendants.  All consumptions which
+aren't tied to a specific thread belong to the threaded domain cgroup.
+
+Because a threaded subtree is exempt from no internal process
+constraint, a threaded controller must be able to handle competition
+between threads in a non-leaf cgroup and its child cgroups.  Each
+threaded controller defines how such competitions are handled.
+
+
 [Un]populated Notification
 --------------------------
 
@@ -302,15 +405,15 @@ disabled if one or more children have it enabled.
 No Internal Process Constraint
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-Non-root cgroups can only distribute resources to their children when
-they don't have any processes of their own.  In other words, only
-cgroups which don't contain any processes can have controllers enabled
-in their "cgroup.subtree_control" files.
+Non-root cgroups can distribute domain resources to their children
+only when they don't have any processes of their own.  In other words,
+only domain cgroups which don't contain any processes can have domain
+controllers enabled in their "cgroup.subtree_control" files.
 
-This guarantees that, when a controller is looking at the part of the
-hierarchy which has it enabled, processes are always only on the
-leaves.  This rules out situations where child cgroups compete against
-internal processes of the parent.
+This guarantees that, when a domain controller is looking at the part
+of the hierarchy which has it enabled, processes are always only on
+the leaves.  This rules out situations where child cgroups compete
+against internal processes of the parent.
 
 The root cgroup is exempt from this restriction.  Root contains
 processes and anonymous resource consumption which can't be associated
@@ -334,10 +437,10 @@ Model of Delegation
 ~~~~~~~~~~~~~~~~~~~
 
 A cgroup can be delegated in two ways.  First, to a less privileged
-user by granting write access of the directory and its "cgroup.procs"
-and "cgroup.subtree_control" files to the user.  Second, if the
-"nsdelegate" mount option is set, automatically to a cgroup namespace
-on namespace creation.
+user by granting write access of the directory and its "cgroup.procs",
+"cgroup.threads" and "cgroup.subtree_control" files to the user.
+Second, if the "nsdelegate" mount option is set, automatically to a
+cgroup namespace on namespace creation.
 
 Because the resource control interface files in a given directory
 control the distribution of the parent's resources, the delegatee
@@ -644,6 +747,29 @@ Core Interface Files
 
 All cgroup core files are prefixed with "cgroup."
 
+  cgroup.type
+
+	A read-write single value file which exists on non-root
+	cgroups.
+
+	When read, it indicates the current type of the cgroup, which
+	can be one of the following values.
+
+	- "domain" : A normal valid domain cgroup.
+
+	- "domain (threaded)" : A threaded domain cgroup which is
+          serving as the root of a threaded subtree.
+
+	- "domain (invalid)" : A cgroup which is in an invalid state.
+	  It can't be populated or have controllers enabled.  It may
+	  be allowed to become a threaded cgroup.
+
+	- "threaded" : A threaded cgroup which is a member of a
+          threaded subtree.
+
+	A cgroup can be turned into a threaded cgroup by writing
+	"threaded" to this file.
+
   cgroup.procs
 	A read-write new-line separated values file which exists on
 	all cgroups.
@@ -666,6 +792,31 @@ All cgroup core files are prefixed with "cgroup."
 	When delegating a sub-hierarchy, write access to this file
 	should be granted along with the containing directory.
 
+  cgroup.threads
+	A read-write new-line separated values file which exists on
+	all cgroups.
+
+	When read, it lists the TIDs of all threads which belong to
+	the cgroup one-per-line.  The TIDs are not ordered and the
+	same TID may show up more than once if the thread got moved to
+	another cgroup and then back or the TID got recycled while
+	reading.
+
+	A TID can be written to migrate the thread associated with the
+	TID to the cgroup.  The writer should match all of the
+	following conditions.
+
+	- It must have write access to the "cgroup.threads" file.
+
+	- The cgroup that the thread is currently in must be in the
+          same resource domain as the destination cgroup.
+
+	- It must have write access to the "cgroup.procs" file of the
+	  common ancestor of the source and destination cgroups.
+
+	When delegating a sub-hierarchy, write access to this file
+	should be granted along with the containing directory.
+
   cgroup.controllers
 	A read-only space separated values file which exists on all
 	cgroups.
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 651c436..9d74195 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -522,6 +522,18 @@ struct cgroup_subsys {
 	bool implicit_on_dfl:1;
 
 	/*
+	 * If %true, the controller, supports threaded mode on the default
+	 * hierarchy.  In a threaded subtree, both process granularity and
+	 * no-internal-process constraint are ignored and a threaded
+	 * controllers should be able to handle that.
+	 *
+	 * Note that as an implicit controller is automatically enabled on
+	 * all cgroups on the default hierarchy, it should also be
+	 * threaded.  implicit && !threaded is not supported.
+	 */
+	bool threaded:1;
+
+	/*
 	 * If %false, this subsystem is properly hierarchical -
 	 * configuration, resource accounting and restriction on a parent
 	 * cgroup cover those of its children.  If %true, hierarchy support
diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index 0e81c61..f10eb19 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -170,7 +170,7 @@ struct dentry *cgroup_do_mount(struct file_system_type *fs_type, int flags,
 			       struct cgroup_root *root, unsigned long magic,
 			       struct cgroup_namespace *ns);
 
-bool cgroup_may_migrate_to(struct cgroup *dst_cgrp);
+int cgroup_migrate_vet_dst(struct cgroup *dst_cgrp);
 void cgroup_migrate_finish(struct cgroup_mgctx *mgctx);
 void cgroup_migrate_add_src(struct css_set *src_cset, struct cgroup *dst_cgrp,
 			    struct cgroup_mgctx *mgctx);
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index 167aaab..f0e8601 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -99,8 +99,9 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
 	if (cgroup_on_dfl(to))
 		return -EINVAL;
 
-	if (!cgroup_may_migrate_to(to))
-		return -EBUSY;
+	ret = cgroup_migrate_vet_dst(to);
+	if (ret)
+		return ret;
 
 	mutex_lock(&cgroup_mutex);
 
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index a1d59af..7097ce4 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -162,6 +162,9 @@ static u16 cgrp_dfl_inhibit_ss_mask;
 /* some controllers are implicitly enabled on the default hierarchy */
 static u16 cgrp_dfl_implicit_ss_mask;
 
+/* some controllers can be threaded on the default hierarchy */
+static u16 cgrp_dfl_threaded_ss_mask;
+
 /* The list of hierarchy roots */
 LIST_HEAD(cgroup_roots);
 static int cgroup_root_count;
@@ -335,14 +338,93 @@ static bool cgroup_is_threaded(struct cgroup *cgrp)
 	return cgrp->dom_cgrp != cgrp;
 }
 
+/* can @cgrp host both domain and threaded children? */
+static bool cgroup_is_mixable(struct cgroup *cgrp)
+{
+	/*
+	 * Root isn't under domain level resource control exempting it from
+	 * the no-internal-process constraint, so it can serve as a thread
+	 * root and a parent of resource domains at the same time.
+	 */
+	return !cgroup_parent(cgrp);
+}
+
+/* can @cgrp become a thread root? should always be true for a thread root */
+static bool cgroup_can_be_thread_root(struct cgroup *cgrp)
+{
+	/* mixables don't care */
+	if (cgroup_is_mixable(cgrp))
+		return true;
+
+	/* domain roots can't be nested under threaded */
+	if (cgroup_is_threaded(cgrp))
+		return false;
+
+	/* can only have either domain or threaded children */
+	if (cgrp->nr_populated_domain_children)
+		return false;
+
+	/* and no domain controllers can be enabled */
+	if (cgrp->subtree_control & ~cgrp_dfl_threaded_ss_mask)
+		return false;
+
+	return true;
+}
+
+/* is @cgrp root of a threaded subtree? */
+static bool cgroup_is_thread_root(struct cgroup *cgrp)
+{
+	/* thread root should be a domain */
+	if (cgroup_is_threaded(cgrp))
+		return false;
+
+	/* a domain w/ threaded children is a thread root */
+	if (cgrp->nr_threaded_children)
+		return true;
+
+	/*
+	 * A domain which has tasks and explicit threaded controllers
+	 * enabled is a thread root.
+	 */
+	if (cgroup_has_tasks(cgrp) &&
+	    (cgrp->subtree_control & cgrp_dfl_threaded_ss_mask))
+		return true;
+
+	return false;
+}
+
+/* a domain which isn't connected to the root w/o brekage can't be used */
+static bool cgroup_is_valid_domain(struct cgroup *cgrp)
+{
+	/* the cgroup itself can be a thread root */
+	if (cgroup_is_threaded(cgrp))
+		return false;
+
+	/* but the ancestors can't be unless mixable */
+	while ((cgrp = cgroup_parent(cgrp))) {
+		if (!cgroup_is_mixable(cgrp) && cgroup_is_thread_root(cgrp))
+			return false;
+		if (cgroup_is_threaded(cgrp))
+			return false;
+	}
+
+	return true;
+}
+
 /* subsystems visibly enabled on a cgroup */
 static u16 cgroup_control(struct cgroup *cgrp)
 {
 	struct cgroup *parent = cgroup_parent(cgrp);
 	u16 root_ss_mask = cgrp->root->subsys_mask;
 
-	if (parent)
-		return parent->subtree_control;
+	if (parent) {
+		u16 ss_mask = parent->subtree_control;
+
+		/* threaded cgroups can only have threaded controllers */
+		if (cgroup_is_threaded(cgrp))
+			ss_mask &= cgrp_dfl_threaded_ss_mask;
+		return ss_mask;
+	}
 
 	if (cgroup_on_dfl(cgrp))
 		root_ss_mask &= ~(cgrp_dfl_inhibit_ss_mask |
@@ -355,8 +437,14 @@ static u16 cgroup_ss_mask(struct cgroup *cgrp)
 {
 	struct cgroup *parent = cgroup_parent(cgrp);
 
-	if (parent)
-		return parent->subtree_ss_mask;
+	if (parent) {
+		u16 ss_mask = parent->subtree_ss_mask;
+
+		/* threaded cgroups can only have threaded controllers */
+		if (cgroup_is_threaded(cgrp))
+			ss_mask &= cgrp_dfl_threaded_ss_mask;
+		return ss_mask;
+	}
 
 	return cgrp->root->subsys_mask;
 }
@@ -2237,17 +2325,40 @@ static int cgroup_migrate_execute(struct cgroup_mgctx *mgctx)
 }
 
 /**
- * cgroup_may_migrate_to - verify whether a cgroup can be migration destination
+ * cgroup_migrate_vet_dst - verify whether a cgroup can be migration destination
  * @dst_cgrp: destination cgroup to test
  *
- * On the default hierarchy, except for the root, subtree_control must be
- * zero for migration destination cgroups with tasks so that child cgroups
- * don't compete against tasks.
+ * On the default hierarchy, except for the mixable, (possible) thread root
+ * and threaded cgroups, subtree_control must be zero for migration
+ * destination cgroups with tasks so that child cgroups don't compete
+ * against tasks.
  */
-bool cgroup_may_migrate_to(struct cgroup *dst_cgrp)
+int cgroup_migrate_vet_dst(struct cgroup *dst_cgrp)
 {
-	return !cgroup_on_dfl(dst_cgrp) || !cgroup_parent(dst_cgrp) ||
-		!dst_cgrp->subtree_control;
+	/* v1 doesn't have any restriction */
+	if (!cgroup_on_dfl(dst_cgrp))
+		return 0;
+
+	/* verify @dst_cgrp can host resources */
+	if (!cgroup_is_valid_domain(dst_cgrp->dom_cgrp))
+		return -EOPNOTSUPP;
+
+	/* mixables don't care */
+	if (cgroup_is_mixable(dst_cgrp))
+		return 0;
+
+	/*
+	 * If @dst_cgrp is already or can become a thread root or is
+	 * threaded, it doesn't matter.
+	 */
+	if (cgroup_can_be_thread_root(dst_cgrp) || cgroup_is_threaded(dst_cgrp))
+		return 0;
+
+	/* apply no-internal-process constraint */
+	if (dst_cgrp->subtree_control)
+		return -EBUSY;
+
+	return 0;
 }
 
 /**
@@ -2452,8 +2563,9 @@ int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader,
 	struct task_struct *task;
 	int ret;
 
-	if (!cgroup_may_migrate_to(dst_cgrp))
-		return -EBUSY;
+	ret = cgroup_migrate_vet_dst(dst_cgrp);
+	if (ret)
+		return ret;
 
 	/* look up all src csets */
 	spin_lock_irq(&css_set_lock);
@@ -2881,6 +2993,46 @@ static void cgroup_finalize_control(struct cgroup *cgrp, int ret)
 	cgroup_apply_control_disable(cgrp);
 }
 
+static int cgroup_vet_subtree_control_enable(struct cgroup *cgrp, u16 enable)
+{
+	u16 domain_enable = enable & ~cgrp_dfl_threaded_ss_mask;
+
+	/* if nothing is getting enabled, nothing to worry about */
+	if (!enable)
+		return 0;
+
+	/* can @cgrp host any resources? */
+	if (!cgroup_is_valid_domain(cgrp->dom_cgrp))
+		return -EOPNOTSUPP;
+
+	/* mixables don't care */
+	if (cgroup_is_mixable(cgrp))
+		return 0;
+
+	if (domain_enable) {
+		/* can't enable domain controllers inside a thread subtree */
+		if (cgroup_is_thread_root(cgrp) || cgroup_is_threaded(cgrp))
+			return -EOPNOTSUPP;
+	} else {
+		/*
+		 * Threaded controllers can handle internal competitions
+		 * and are always allowed inside a (prospective) thread
+		 * subtree.
+		 */
+		if (cgroup_can_be_thread_root(cgrp) || cgroup_is_threaded(cgrp))
+			return 0;
+	}
+
+	/*
+	 * Controllers can't be enabled for a cgroup with tasks to avoid
+	 * child cgroups competing against tasks.
+	 */
+	if (cgroup_has_tasks(cgrp))
+		return -EBUSY;
+
+	return 0;
+}
+
 /* change the enabled child controllers for a cgroup in the default hierarchy */
 static ssize_t cgroup_subtree_control_write(struct kernfs_open_file *of,
 					    char *buf, size_t nbytes,
@@ -2956,14 +3108,9 @@ static ssize_t cgroup_subtree_control_write(struct kernfs_open_file *of,
 		goto out_unlock;
 	}
 
-	/*
-	 * Except for the root, subtree_control must be zero for a cgroup
-	 * with tasks so that child cgroups don't compete against tasks.
-	 */
-	if (enable && cgroup_parent(cgrp) && cgroup_has_tasks(cgrp)) {
-		ret = -EBUSY;
+	ret = cgroup_vet_subtree_control_enable(cgrp, enable);
+	if (ret)
 		goto out_unlock;
-	}
 
 	/* save and update control masks and prepare csses */
 	cgroup_save_control(cgrp);
@@ -2982,6 +3129,84 @@ static ssize_t cgroup_subtree_control_write(struct kernfs_open_file *of,
 	return ret ?: nbytes;
 }
 
+static int cgroup_enable_threaded(struct cgroup *cgrp)
+{
+	struct cgroup *parent = cgroup_parent(cgrp);
+	struct cgroup *dom_cgrp = parent->dom_cgrp;
+	int ret;
+
+	lockdep_assert_held(&cgroup_mutex);
+
+	/* noop if already threaded */
+	if (cgroup_is_threaded(cgrp))
+		return 0;
+
+	/* we're joining the parent's domain, ensure its validity */
+	if (!cgroup_is_valid_domain(dom_cgrp) ||
+	    !cgroup_can_be_thread_root(dom_cgrp))
+		return -EOPNOTSUPP;
+
+	/*
+	 * Allow enabling thread mode only on empty cgroups to avoid
+	 * implicit migrations and recursive operations.
+	 */
+	if (cgroup_has_tasks(cgrp) || css_has_online_children(&cgrp->self))
+		return -EBUSY;
+
+	/*
+	 * The following shouldn't cause actual migrations and should
+	 * always succeed.
+	 */
+	cgroup_save_control(cgrp);
+
+	cgrp->dom_cgrp = dom_cgrp;
+	ret = cgroup_apply_control(cgrp);
+	if (!ret)
+		parent->nr_threaded_children++;
+	else
+		cgrp->dom_cgrp = cgrp;
+
+	cgroup_finalize_control(cgrp, ret);
+	return ret;
+}
+
+static int cgroup_type_show(struct seq_file *seq, void *v)
+{
+	struct cgroup *cgrp = seq_css(seq)->cgroup;
+
+	if (cgroup_is_threaded(cgrp))
+		seq_puts(seq, "threaded\n");
+	else if (!cgroup_is_valid_domain(cgrp))
+		seq_puts(seq, "domain (invalid)\n");
+	else if (cgroup_is_thread_root(cgrp))
+		seq_puts(seq, "domain (threaded)\n");
+	else
+		seq_puts(seq, "domain\n");
+
+	return 0;
+}
+
+static ssize_t cgroup_type_write(struct kernfs_open_file *of, char *buf,
+				 size_t nbytes, loff_t off)
+{
+	struct cgroup *cgrp;
+	int ret;
+
+	/* only switching to threaded mode is supported */
+	if (strcmp(strstrip(buf), "threaded"))
+		return -EINVAL;
+
+	cgrp = cgroup_kn_lock_live(of->kn, false);
+	if (!cgrp)
+		return -ENOENT;
+
+	/* threaded can only be enabled */
+	ret = cgroup_enable_threaded(cgrp);
+
+	cgroup_kn_unlock(of->kn);
+	return ret ?: nbytes;
+}
+
 static int cgroup_events_show(struct seq_file *seq, void *v)
 {
 	seq_printf(seq, "populated %d\n",
@@ -3867,12 +4092,12 @@ static void *cgroup_procs_next(struct seq_file *s, void *v, loff_t *pos)
 	return css_task_iter_next(it);
 }
 
-static void *cgroup_procs_start(struct seq_file *s, loff_t *pos)
+static void *__cgroup_procs_start(struct seq_file *s, loff_t *pos,
+				  unsigned int iter_flags)
 {
 	struct kernfs_open_file *of = s->private;
 	struct cgroup *cgrp = seq_css(s)->cgroup;
 	struct css_task_iter *it = of->priv;
-	unsigned iter_flags = CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED;
 
 	/*
 	 * When a seq_file is seeked, it's always traversed sequentially
@@ -3895,6 +4120,23 @@ static void *cgroup_procs_start(struct seq_file *s, loff_t *pos)
 	return cgroup_procs_next(s, NULL, NULL);
 }
 
+static void *cgroup_procs_start(struct seq_file *s, loff_t *pos)
+{
+	struct cgroup *cgrp = seq_css(s)->cgroup;
+
+	/*
+	 * All processes of a threaded subtree belong to the domain cgroup
+	 * of the subtree.  Only threads can be distributed across the
+	 * subtree.  Reject reads on cgroup.procs in the subtree proper.
+	 * They're always empty anyway.
+	 */
+	if (cgroup_is_threaded(cgrp))
+		return ERR_PTR(-EINVAL);
+
+	return __cgroup_procs_start(s, pos, CSS_TASK_ITER_PROCS |
+					    CSS_TASK_ITER_THREADED);
+}
+
 static int cgroup_procs_show(struct seq_file *s, void *v)
 {
 	seq_printf(s, "%d\n", task_pid_vnr(v));
@@ -3974,9 +4216,64 @@ static ssize_t cgroup_procs_write(struct kernfs_open_file *of,
 	return ret ?: nbytes;
 }
 
+static void *cgroup_threads_start(struct seq_file *s, loff_t *pos)
+{
+	return __cgroup_procs_start(s, pos, 0);
+}
+
+static ssize_t cgroup_threads_write(struct kernfs_open_file *of,
+				    char *buf, size_t nbytes, loff_t off)
+{
+	struct cgroup *src_cgrp, *dst_cgrp;
+	struct task_struct *task;
+	ssize_t ret;
+
+	buf = strstrip(buf);
+
+	dst_cgrp = cgroup_kn_lock_live(of->kn, false);
+	if (!dst_cgrp)
+		return -ENODEV;
+
+	task = cgroup_procs_write_start(buf, false);
+	ret = PTR_ERR_OR_ZERO(task);
+	if (ret)
+		goto out_unlock;
+
+	/* find the source cgroup */
+	spin_lock_irq(&css_set_lock);
+	src_cgrp = task_cgroup_from_root(task, &cgrp_dfl_root);
+	spin_unlock_irq(&css_set_lock);
+
+	/* thread migrations follow the cgroup.procs delegation rule */
+	ret = cgroup_procs_write_permission(src_cgrp, dst_cgrp,
+					    of->file->f_path.dentry->d_sb);
+	if (ret)
+		goto out_finish;
+
+	/* and must be contained in the same domain */
+	ret = -EOPNOTSUPP;
+	if (src_cgrp->dom_cgrp != dst_cgrp->dom_cgrp)
+		goto out_finish;
+
+	ret = cgroup_attach_task(dst_cgrp, task, false);
+
+out_finish:
+	cgroup_procs_write_finish(task);
+out_unlock:
+	cgroup_kn_unlock(of->kn);
+
+	return ret ?: nbytes;
+}
+
 /* cgroup core interface files for the default hierarchy */
 static struct cftype cgroup_base_files[] = {
 	{
+		.name = "cgroup.type",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = cgroup_type_show,
+		.write = cgroup_type_write,
+	},
+	{
 		.name = "cgroup.procs",
 		.flags = CFTYPE_NS_DELEGATABLE,
 		.file_offset = offsetof(struct cgroup, procs_file),
@@ -3987,6 +4284,14 @@ static struct cftype cgroup_base_files[] = {
 		.write = cgroup_procs_write,
 	},
 	{
+		.name = "cgroup.threads",
+		.release = cgroup_procs_release,
+		.seq_start = cgroup_threads_start,
+		.seq_next = cgroup_procs_next,
+		.seq_show = cgroup_procs_show,
+		.write = cgroup_threads_write,
+	},
+	{
 		.name = "cgroup.controllers",
 		.seq_show = cgroup_controllers_show,
 	},
@@ -4301,6 +4606,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent)
 	cgrp->self.parent = &parent->self;
 	cgrp->root = root;
 	cgrp->level = level;
+	cgrp->dom_cgrp = cgrp->dom_cgrp;
 
 	for (tcgrp = cgrp; tcgrp; tcgrp = cgroup_parent(tcgrp))
 		cgrp->ancestor_ids[tcgrp->level] = tcgrp->id;
@@ -4753,11 +5059,17 @@ int __init cgroup_init(void)
 
 		cgrp_dfl_root.subsys_mask |= 1 << ss->id;
 
+		/* implicit controllers must be threaded too */
+		WARN_ON(ss->implicit_on_dfl && !ss->threaded);
+
 		if (ss->implicit_on_dfl)
 			cgrp_dfl_implicit_ss_mask |= 1 << ss->id;
 		else if (!ss->dfl_cftypes)
 			cgrp_dfl_inhibit_ss_mask |= 1 << ss->id;
 
+		if (ss->threaded)
+			cgrp_dfl_threaded_ss_mask |= 1 << ss->id;
+
 		if (ss->dfl_cftypes == ss->legacy_cftypes) {
 			WARN_ON(cgroup_add_cftypes(ss, ss->dfl_cftypes));
 		} else {
diff --git a/kernel/cgroup/pids.c b/kernel/cgroup/pids.c
index 2237201..9829c67 100644
--- a/kernel/cgroup/pids.c
+++ b/kernel/cgroup/pids.c
@@ -345,4 +345,5 @@ struct cgroup_subsys pids_cgrp_subsys = {
 	.free		= pids_free,
 	.legacy_cftypes	= pids_files,
 	.dfl_cftypes	= pids_files,
+	.threaded	= true,
 };
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 1538df9..ec78247 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -11210,5 +11210,6 @@ struct cgroup_subsys perf_event_cgrp_subsys = {
 	 * controller is not mounted on a legacy hierarchy.
 	 */
 	.implicit_on_dfl = true,
+	.threaded	= true,
 };
 #endif /* CONFIG_CGROUP_PERF */
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 6/6] cgroup: update debug controller to print out thread mode information
  2017-07-17  2:07 [PATCHSET for-4.13] cgroup: implement cgroup2 thread mode, v3 Tejun Heo
                   ` (4 preceding siblings ...)
  2017-07-17  2:07 ` [PATCH 5/6] cgroup: implement cgroup v2 thread support Tejun Heo
@ 2017-07-17  2:07 ` Tejun Heo
  2017-07-17 21:19   ` Waiman Long
  2017-07-17 14:48 ` [PATCHSET for-4.13] cgroup: implement cgroup2 thread mode, v3 Waiman Long
  6 siblings, 1 reply; 32+ messages in thread
From: Tejun Heo @ 2017-07-17  2:07 UTC (permalink / raw)
  To: lizefan, hannes, peterz, mingo, longman
  Cc: cgroups, linux-kernel, kernel-team, pjt, luto, efault, torvalds,
	guro, Tejun Heo

From: Waiman Long <longman@redhat.com>

Update debug controller so that it prints out debug info about thread
mode.

 1) The relationship between proc_cset and threaded_csets are displayed.
 2) The status of being a thread root or threaded cgroup is displayed.

This patch is extracted from Waiman's larger patch.

Patch-originally-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/cgroup/cgroup-internal.h |  2 ++
 kernel/cgroup/cgroup.c          |  4 +--
 kernel/cgroup/debug.c           | 58 +++++++++++++++++++++++++++++++----------
 3 files changed, 48 insertions(+), 16 deletions(-)

diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index f10eb19..c167a40 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -153,6 +153,8 @@ static inline void get_css_set(struct css_set *cset)
 
 bool cgroup_ssid_enabled(int ssid);
 bool cgroup_on_dfl(const struct cgroup *cgrp);
+bool cgroup_is_thread_root(struct cgroup *cgrp);
+bool cgroup_is_threaded(struct cgroup *cgrp);
 
 struct cgroup_root *cgroup_root_from_kf(struct kernfs_root *kf_root);
 struct cgroup *task_cgroup_from_root(struct task_struct *task,
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 7097ce4..2f34020 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -333,7 +333,7 @@ static bool cgroup_has_tasks(struct cgroup *cgrp)
 	return cgrp->nr_populated_csets;
 }
 
-static bool cgroup_is_threaded(struct cgroup *cgrp)
+bool cgroup_is_threaded(struct cgroup *cgrp)
 {
 	return cgrp->dom_cgrp != cgrp;
 }
@@ -372,7 +372,7 @@ static bool cgroup_can_be_thread_root(struct cgroup *cgrp)
 }
 
 /* is @cgrp root of a threaded subtree? */
-static bool cgroup_is_thread_root(struct cgroup *cgrp)
+bool cgroup_is_thread_root(struct cgroup *cgrp)
 {
 	/* thread root should be a domain */
 	if (cgroup_is_threaded(cgrp))
diff --git a/kernel/cgroup/debug.c b/kernel/cgroup/debug.c
index dac46af..062904f 100644
--- a/kernel/cgroup/debug.c
+++ b/kernel/cgroup/debug.c
@@ -114,27 +114,54 @@ static int cgroup_css_links_read(struct seq_file *seq, void *v)
 {
 	struct cgroup_subsys_state *css = seq_css(seq);
 	struct cgrp_cset_link *link;
-	int dead_cnt = 0, extra_refs = 0;
+	int dead_cnt = 0, extra_refs = 0, threaded_csets = 0;
 
 	spin_lock_irq(&css_set_lock);
+
+	if (cgroup_is_thread_root(css->cgroup))
+		seq_puts(seq, "[thread root]\n");
+	else if (cgroup_is_threaded(css->cgroup))
+		seq_puts(seq, "[threaded]\n");
+
 	list_for_each_entry(link, &css->cgroup->cset_links, cset_link) {
 		struct css_set *cset = link->cset;
 		struct task_struct *task;
 		int count = 0;
 		int refcnt = refcount_read(&cset->refcount);
 
-		seq_printf(seq, " %d", refcnt);
-		if (refcnt - cset->nr_tasks > 0) {
-			int extra = refcnt - cset->nr_tasks;
-
-			seq_printf(seq, " +%d", extra);
-			/*
-			 * Take out the one additional reference in
-			 * init_css_set.
-			 */
-			if (cset == &init_css_set)
-				extra--;
-			extra_refs += extra;
+		/*
+		 * Print out the proc_cset and threaded_cset relationship
+		 * and highlight difference between refcount and task_count.
+		 */
+		seq_printf(seq, "css_set %pK", cset);
+		if (rcu_dereference_protected(cset->dom_cset, 1) != cset) {
+			threaded_csets++;
+			seq_printf(seq, "=>%pK", cset->dom_cset);
+		}
+		if (!list_empty(&cset->threaded_csets)) {
+			struct css_set *tcset;
+			int idx = 0;
+
+			list_for_each_entry(tcset, &cset->threaded_csets,
+					    threaded_csets_node) {
+				seq_puts(seq, idx ? "," : "<=");
+				seq_printf(seq, "%pK", tcset);
+				idx++;
+			}
+		} else {
+			seq_printf(seq, " %d", refcnt);
+			if (refcnt - cset->nr_tasks > 0) {
+				int extra = refcnt - cset->nr_tasks;
+
+				seq_printf(seq, " +%d", extra);
+				/*
+				 * Take out the one additional reference in
+				 * init_css_set.
+				 */
+				if (cset == &init_css_set)
+					extra--;
+				extra_refs += extra;
+			}
 		}
 		seq_puts(seq, "\n");
 
@@ -163,10 +190,12 @@ static int cgroup_css_links_read(struct seq_file *seq, void *v)
 	}
 	spin_unlock_irq(&css_set_lock);
 
-	if (!dead_cnt && !extra_refs)
+	if (!dead_cnt && !extra_refs && !threaded_csets)
 		return 0;
 
 	seq_puts(seq, "\n");
+	if (threaded_csets)
+		seq_printf(seq, "threaded css_sets = %d\n", threaded_csets);
 	if (extra_refs)
 		seq_printf(seq, "extra references = %d\n", extra_refs);
 	if (dead_cnt)
@@ -342,6 +371,7 @@ struct cgroup_subsys debug_cgrp_subsys = {
 	.css_alloc	= debug_css_alloc,
 	.css_free	= debug_css_free,
 	.legacy_cftypes	= debug_legacy_files,
+	.threaded	= true,
 };
 
 /*
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/6] cgroup: implement cgroup v2 thread support
  2017-07-17  2:07 ` [PATCH 5/6] cgroup: implement cgroup v2 thread support Tejun Heo
@ 2017-07-17 14:14   ` Peter Zijlstra
  2017-07-17 14:26     ` Tejun Heo
  2017-07-17 20:56     ` Waiman Long
  2017-07-17 21:12   ` Waiman Long
  1 sibling, 2 replies; 32+ messages in thread
From: Peter Zijlstra @ 2017-07-17 14:14 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, hannes, mingo, longman, cgroups, linux-kernel,
	kernel-team, pjt, luto, efault, torvalds, guro

On Sun, Jul 16, 2017 at 10:07:20PM -0400, Tejun Heo wrote:
> 
> v4: - Updated to marking each cgroup threaded as suggested by PeterZ.
> 

> +On creation, a cgroup is always a domain cgroup and can be made
> +threaded by writing "threaded" to the "cgroup.type" file.  The
> +operation is single direction::
> +
> +  # echo threaded > cgroup.type
> +
> +Once threaded, the cgroup can't be made a domain again.  To enable the
> +thread mode, the following conditions must be met.
> +
> +- As the cgroup will join the parent's resource domain.  The parent
> +  must either be a valid (threaded) domain or a threaded cgroup.
> +
> +- The cgroup must be empty.  No enabled controllers, child cgroups or
> +  processes.
> +
> +Topology-wise, a cgroup can be in an invalid state.  Please consider
> +the following toplogy::
> +
> +  A (threaded domain) - B (threaded) - C (domain, just created)
> +
> +C is created as a domain but isn't connected to a parent which can
> +host child domains.  C can't be used until it is turned into a
> +threaded cgroup.  "cgroup.type" file will report "domain (invalid)" in
> +these cases.  Operations which fail due to invalid topology use
> +EOPNOTSUPP as the errno.
> +
> +A domain cgroup is turned into a threaded domain when one of its child
> +cgroup becomes threaded or threaded controllers are enabled in the
> +"cgroup.subtree_control" file while there are processes in the cgroup.
> +A threaded domain reverts to a normal domain when the conditions
> +clear.

AFAICT this is not in fact what I suggested... :/

My proposal did not have that invalid state. It would simply refuse to
change the type from thread to domain in the case where the parent is
not a domain.

Also, my proposal maintained the normal property inheritance rules. A
child cgroup's creation 'type' would be that of its parent and not
always be 'domain'.

Let me read more (and more careful) to see if there's other things.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/6] cgroup: implement cgroup v2 thread support
  2017-07-17 14:14   ` Peter Zijlstra
@ 2017-07-17 14:26     ` Tejun Heo
  2017-07-18 17:28       ` Peter Zijlstra
  2017-07-17 20:56     ` Waiman Long
  1 sibling, 1 reply; 32+ messages in thread
From: Tejun Heo @ 2017-07-17 14:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: lizefan, hannes, mingo, longman, cgroups, linux-kernel,
	kernel-team, pjt, luto, efault, torvalds, guro

Hello, Peter.

On Mon, Jul 17, 2017 at 04:14:09PM +0200, Peter Zijlstra wrote:
> AFAICT this is not in fact what I suggested... :/

Heh, sorry about misattributing that.  I was mostly referring to the
overall idea of marking each cgroup domain or threaded rather than
subtree.

> My proposal did not have that invalid state. It would simply refuse to
> change the type from thread to domain in the case where the parent is
> not a domain.
> 
> Also, my proposal maintained the normal property inheritance rules. A
> child cgroup's creation 'type' would be that of its parent and not
> always be 'domain'.

But aren't both of the above get weird when the parent can host both
domain and threaded children?

	 R
       / 
      A(D)

If you create another child B under R, it's naturally gonna be a
domain.  Let's say you turn that to threaded.

	 R
       /   \
     A(D) B(T)

And now try to create another child C, should that be a domain or
threaded?

If we only inherit from the second level on, which is in itself
already confusing, that still leads to invalid configs for non-root
thread roots.

I don't think whether we fail the transition or put the cgroup in an
invalid state is all that material.  The simpler the better.

> Let me read more (and more careful) to see if there's other things.

Sure thing.

Thanks!

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCHSET for-4.13] cgroup: implement cgroup2 thread mode, v3
  2017-07-17  2:07 [PATCHSET for-4.13] cgroup: implement cgroup2 thread mode, v3 Tejun Heo
                   ` (5 preceding siblings ...)
  2017-07-17  2:07 ` [PATCH 6/6] cgroup: update debug controller to print out thread mode information Tejun Heo
@ 2017-07-17 14:48 ` Waiman Long
  2017-07-17 14:51   ` Tejun Heo
  6 siblings, 1 reply; 32+ messages in thread
From: Waiman Long @ 2017-07-17 14:48 UTC (permalink / raw)
  To: Tejun Heo, lizefan, hannes, peterz, mingo
  Cc: cgroups, linux-kernel, kernel-team, pjt, luto, efault, torvalds, guro

On 07/16/2017 10:07 PM, Tejun Heo wrote:
> Hello,
>
> This is v3 of cgroup2 thread mode patchset.  The changes from v2[L]
> are
>
> * Switched to marking each cgroup threaded instead of doing it
>   per-subtree as suggested by PeterZ.  This allows more flexibility
>   and removes certain interface quirks.
>
> * Dropped RFC tag and excluded cpu controller patches from this
>   patchset as threaded mode behaviors can easily be verified with the
>   pid controller.  Will follow up with cpu controller patchset later.
>
> It is largely based on the discussions that we had at the plumbers
> last year.  Here's the rough outline.
>
> * Thread mode is explicitly enabled on a cgroup by writing "threaded"
>   into "cgroup.type" file.  The cgroup shouldn't have any processes or
>   child cgroups.  A threaded cgroup joins the the parent's resource
>   domain and becomes a part of the threaded subtree anchored at the
>   nearest domain ancestor, which is called the threaded domain cgroup
>   of the subtree.
>
> * Threads can be put anywhere in a threaded subtree by writing TIDs
>   into "cgroup.threads" file.  Process granularity and
>   no-internal-process constraint don't apply in a threaded subtree.
>
> * To be used in a threaded subtree, controllers should explicitly
>   declare thread mode support and should be able to handle internal
>   competition in some way.
>
> * The threaded domain cgroup of a threaded subtree serves as the
>   resource domain for the whole subtree.  This is where all the
>   controllers are guaranteed to have a common ground and resource
>   consumptions in the threaded subtree which aren't tied to a specific
>   thread are charged.  Non-threaded controllers never see beyond
>   thread root and can assume that all controllers will follow the same
>   rules upto that point.
>
> * Unlike other cgroups, the system root cgroup can serve as parent to
>   domain child cgroups and threaded domains to threaded subtrees.
>
> This allows threaded controllers to implement thread granular resource
> control without getting in the way of system level resource
> partitioning.
>
> For more details on the interface and behavior, please refer to 0005.
>
> This patchset contains the following six patches.
>
>  0001-cgroup-reorganize-cgroup.procs-task-write-path.patch
>  0002-cgroup-add-flags-to-css_task_iter_start-and-implemen.patch
>  0003-cgroup-introduce-cgroup-dom_cgrp-and-threaded-css_se.patch
>  0004-cgroup-implement-CSS_TASK_ITER_THREADED.patch
>  0005-cgroup-implement-cgroup-v2-thread-support.patch
>  0006-cgroup-update-debug-controller-to-print-out-thread-m.patch
>
> 0001-0005 implement cgroup2 thread mode.  0006 enables debug
> controller on it.
>
> The patchset is based on the current cgroup/for-4.14 27f26753f8c0
> ("cgroup: replace css_set walking populated test with testing
> cgrp->nr_populated_csets") and also available in the following git
> branch.
>
>  git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-cgroup2-threads-v3

Your new patches don't seem to be pushed to your git tree yet. I
couldn't find them there.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCHSET for-4.13] cgroup: implement cgroup2 thread mode, v3
  2017-07-17 14:48 ` [PATCHSET for-4.13] cgroup: implement cgroup2 thread mode, v3 Waiman Long
@ 2017-07-17 14:51   ` Tejun Heo
  0 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2017-07-17 14:51 UTC (permalink / raw)
  To: Waiman Long
  Cc: lizefan, hannes, peterz, mingo, cgroups, linux-kernel,
	kernel-team, pjt, luto, efault, torvalds, guro

On Mon, Jul 17, 2017 at 10:48:44AM -0400, Waiman Long wrote:
> >  git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-cgroup2-threads-v3
> 
> Your new patches don't seem to be pushed to your git tree yet. I
> couldn't find them there.

Oops, pushed out now.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/6] cgroup: implement cgroup v2 thread support
  2017-07-17 14:14   ` Peter Zijlstra
  2017-07-17 14:26     ` Tejun Heo
@ 2017-07-17 20:56     ` Waiman Long
  2017-07-18 14:37       ` Waiman Long
  1 sibling, 1 reply; 32+ messages in thread
From: Waiman Long @ 2017-07-17 20:56 UTC (permalink / raw)
  To: Peter Zijlstra, Tejun Heo
  Cc: lizefan, hannes, mingo, cgroups, linux-kernel, kernel-team, pjt,
	luto, efault, torvalds, guro

On 07/17/2017 10:14 AM, Peter Zijlstra wrote:
> On Sun, Jul 16, 2017 at 10:07:20PM -0400, Tejun Heo wrote:
>> v4: - Updated to marking each cgroup threaded as suggested by PeterZ.
>>
>> +On creation, a cgroup is always a domain cgroup and can be made
>> +threaded by writing "threaded" to the "cgroup.type" file.  The
>> +operation is single direction::
>> +
>> +  # echo threaded > cgroup.type
>> +
>> +Once threaded, the cgroup can't be made a domain again.  To enable the
>> +thread mode, the following conditions must be met.
>> +
>> +- As the cgroup will join the parent's resource domain.  The parent
>> +  must either be a valid (threaded) domain or a threaded cgroup.
>> +
>> +- The cgroup must be empty.  No enabled controllers, child cgroups or
>> +  processes.
>> +
>> +Topology-wise, a cgroup can be in an invalid state.  Please consider
>> +the following toplogy::
>> +
>> +  A (threaded domain) - B (threaded) - C (domain, just created)
>> +
>> +C is created as a domain but isn't connected to a parent which can
>> +host child domains.  C can't be used until it is turned into a
>> +threaded cgroup.  "cgroup.type" file will report "domain (invalid)" in
>> +these cases.  Operations which fail due to invalid topology use
>> +EOPNOTSUPP as the errno.
>> +
>> +A domain cgroup is turned into a threaded domain when one of its child
>> +cgroup becomes threaded or threaded controllers are enabled in the
>> +"cgroup.subtree_control" file while there are processes in the cgroup.
>> +A threaded domain reverts to a normal domain when the conditions
>> +clear.
> AFAICT this is not in fact what I suggested... :/
>
> My proposal did not have that invalid state. It would simply refuse to
> change the type from thread to domain in the case where the parent is
> not a domain.
>
> Also, my proposal maintained the normal property inheritance rules. A
> child cgroup's creation 'type' would be that of its parent and not
> always be 'domain'.

I agree with Peter on this. A new cgroup created under a threaded cgroup
should always be threaded. I don't see a need for an intermediate
invalid domain.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/6] cgroup: implement cgroup v2 thread support
  2017-07-17  2:07 ` [PATCH 5/6] cgroup: implement cgroup v2 thread support Tejun Heo
  2017-07-17 14:14   ` Peter Zijlstra
@ 2017-07-17 21:12   ` Waiman Long
  2017-07-19 15:40     ` Tejun Heo
  1 sibling, 1 reply; 32+ messages in thread
From: Waiman Long @ 2017-07-17 21:12 UTC (permalink / raw)
  To: Tejun Heo, lizefan, hannes, peterz, mingo
  Cc: cgroups, linux-kernel, kernel-team, pjt, luto, efault, torvalds, guro

On 07/16/2017 10:07 PM, Tejun Heo wrote:
>  
> +Threads
> +~~~~~~~
> +
> +cgroup v2 supports thread granularity for a subset of controllers to
> +support use cases requiring hierarchical resource distribution across
> +the threads of a group of processes.  By default, all threads of a
> +process belong to the same cgroup, which also serves as the resource
> +domain to host resource consumptions which are not specific to a
> +process or thread.  The thread mode allows threads to be spread across
> +a subtree while still maintaining the common resource domain for them.
> +
> +Controllers which support thread mode are called threaded controllers.
> +The ones which don't are called domain controllers.
> +
> +Marking a cgroup threaded makes it join the resource domain of its
> +parent as a threaded cgroup.  The parent may be another threaded
> +cgroup whose resource domain is further up in the hierarchy.  The root
> +of a threaded subtree, that is, the nearest ancestor which is not
> +threaded, is called threaded domain and serves as the resource domain
> +for the entire subtree.

The cgroup code uses the term "thread root" in quite a number of places.
So a developer may be confused when comparing the code and the
documentation. I would recommend either introducing "thread root" as an
alias for threaded domain here in the documentation or documenting that
"threaded domain = thread root" in the code.

> +  cgroup.type
> +
> +	A read-write single value file which exists on non-root
> +	cgroups.
> +
> +	When read, it indicates the current type of the cgroup, which
> +	can be one of the following values.
> +
> +	- "domain" : A normal valid domain cgroup.
> +
> +	- "domain (threaded)" : A threaded domain cgroup which is
> +          serving as the root of a threaded subtree.
> +
> +	- "domain (invalid)" : A cgroup which is in an invalid state.
> +	  It can't be populated or have controllers enabled.  It may
> +	  be allowed to become a threaded cgroup.
> +
> +	- "threaded" : A threaded cgroup which is a member of a
> +          threaded subtree.
> +
> +	A cgroup can be turned into a threaded cgroup by writing
> +	"threaded" to this file.
> +
>    cgroup.procs
>  	A read-write new-line separated values file which exists on
>  	all cgroups.

Do we need to document that cgroup.procs isn't writable in a threaded
cgroup?

> @@ -4301,6 +4606,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent)
>  	cgrp->self.parent = &parent->self;
>  	cgrp->root = root;
>  	cgrp->level = level;
> +	cgrp->dom_cgrp = cgrp->dom_cgrp;

It is a no-op. I think it is better to modify it to

+    cgrp->dom_cgrp = cgroup_is_threaded(parent) ? parent->dom_cgrp : cgrp;

Then we won't have an invalid domain state.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 6/6] cgroup: update debug controller to print out thread mode information
  2017-07-17  2:07 ` [PATCH 6/6] cgroup: update debug controller to print out thread mode information Tejun Heo
@ 2017-07-17 21:19   ` Waiman Long
  2017-07-19 15:31     ` Tejun Heo
  0 siblings, 1 reply; 32+ messages in thread
From: Waiman Long @ 2017-07-17 21:19 UTC (permalink / raw)
  To: Tejun Heo, lizefan, hannes, peterz, mingo
  Cc: cgroups, linux-kernel, kernel-team, pjt, luto, efault, torvalds, guro

On 07/16/2017 10:07 PM, Tejun Heo wrote:
> From: Waiman Long <longman@redhat.com>
>
> Update debug controller so that it prints out debug info about thread
> mode.
>
>  1) The relationship between proc_cset and threaded_csets are displayed.
>  2) The status of being a thread root or threaded cgroup is displayed.
>
> This patch is extracted from Waiman's larger patch.
>
> Patch-originally-by: Waiman Long <longman@redhat.com>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  kernel/cgroup/cgroup-internal.h |  2 ++
>  kernel/cgroup/cgroup.c          |  4 +--
>  kernel/cgroup/debug.c           | 58 +++++++++++++++++++++++++++++++----------
>  3 files changed, 48 insertions(+), 16 deletions(-)
>
> diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
> index f10eb19..c167a40 100644
> --- a/kernel/cgroup/cgroup-internal.h
> +++ b/kernel/cgroup/cgroup-internal.h
> @@ -153,6 +153,8 @@ static inline void get_css_set(struct css_set *cset)
>  
>  bool cgroup_ssid_enabled(int ssid);
>  bool cgroup_on_dfl(const struct cgroup *cgrp);
> +bool cgroup_is_thread_root(struct cgroup *cgrp);
> +bool cgroup_is_threaded(struct cgroup *cgrp);
>  
>  struct cgroup_root *cgroup_root_from_kf(struct kernfs_root *kf_root);
>  struct cgroup *task_cgroup_from_root(struct task_struct *task,
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 7097ce4..2f34020 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -333,7 +333,7 @@ static bool cgroup_has_tasks(struct cgroup *cgrp)
>  	return cgrp->nr_populated_csets;
>  }
>  
> -static bool cgroup_is_threaded(struct cgroup *cgrp)
> +bool cgroup_is_threaded(struct cgroup *cgrp)
>  {
>  	return cgrp->dom_cgrp != cgrp;
>  }
> @@ -372,7 +372,7 @@ static bool cgroup_can_be_thread_root(struct cgroup *cgrp)
>  }
>  
>  /* is @cgrp root of a threaded subtree? */
> -static bool cgroup_is_thread_root(struct cgroup *cgrp)
> +bool cgroup_is_thread_root(struct cgroup *cgrp)
>  {
>  	/* thread root should be a domain */
>  	if (cgroup_is_threaded(cgrp))
> diff --git a/kernel/cgroup/debug.c b/kernel/cgroup/debug.c
> index dac46af..062904f 100644
> --- a/kernel/cgroup/debug.c
> +++ b/kernel/cgroup/debug.c
> @@ -114,27 +114,54 @@ static int cgroup_css_links_read(struct seq_file *seq, void *v)
>  {
>  	struct cgroup_subsys_state *css = seq_css(seq);
>  	struct cgrp_cset_link *link;
> -	int dead_cnt = 0, extra_refs = 0;
> +	int dead_cnt = 0, extra_refs = 0, threaded_csets = 0;
>  
>  	spin_lock_irq(&css_set_lock);
> +
> +	if (cgroup_is_thread_root(css->cgroup))
> +		seq_puts(seq, "[thread root]\n");
> +	else if (cgroup_is_threaded(css->cgroup))
> +		seq_puts(seq, "[threaded]\n");

The cgroup status will not be needed anymore as you have introduced the
cgroup.type control file that returns the proper information.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/6] cgroup: implement cgroup v2 thread support
  2017-07-17 20:56     ` Waiman Long
@ 2017-07-18 14:37       ` Waiman Long
  2017-07-18 17:10         ` Tejun Heo
  0 siblings, 1 reply; 32+ messages in thread
From: Waiman Long @ 2017-07-18 14:37 UTC (permalink / raw)
  To: Peter Zijlstra, Tejun Heo
  Cc: lizefan, hannes, mingo, cgroups, linux-kernel, kernel-team, pjt,
	luto, efault, torvalds, guro

On 07/17/2017 04:56 PM, Waiman Long wrote:
> On 07/17/2017 10:14 AM, Peter Zijlstra wrote:
>> On Sun, Jul 16, 2017 at 10:07:20PM -0400, Tejun Heo wrote:
>>> v4: - Updated to marking each cgroup threaded as suggested by PeterZ.
>>>
>>> +On creation, a cgroup is always a domain cgroup and can be made
>>> +threaded by writing "threaded" to the "cgroup.type" file.  The
>>> +operation is single direction::
>>> +
>>> +  # echo threaded > cgroup.type
>>> +
>>> +Once threaded, the cgroup can't be made a domain again.  To enable the
>>> +thread mode, the following conditions must be met.
>>> +
>>> +- As the cgroup will join the parent's resource domain.  The parent
>>> +  must either be a valid (threaded) domain or a threaded cgroup.
>>> +
>>> +- The cgroup must be empty.  No enabled controllers, child cgroups or
>>> +  processes.
>>> +
>>> +Topology-wise, a cgroup can be in an invalid state.  Please consider
>>> +the following toplogy::
>>> +
>>> +  A (threaded domain) - B (threaded) - C (domain, just created)
>>> +

Thinking about it some more. There is a place for invalid domain. It is
not the child of a threaded cgroup. It is the siblings of a threaded
cgroup whose parent is not root.

       Root - A (domain) - B (domain)
                         \ C (domain)

With "echo threaded > B/cgroup.type":

       Root - A (threaded domain) - B (threaded)
                                  \ C (domain, invalid)

Any children of a threaded cgroup should be threaded.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/6] cgroup: implement cgroup v2 thread support
  2017-07-18 14:37       ` Waiman Long
@ 2017-07-18 17:10         ` Tejun Heo
  2017-07-18 17:23           ` Waiman Long
  0 siblings, 1 reply; 32+ messages in thread
From: Tejun Heo @ 2017-07-18 17:10 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, lizefan, hannes, mingo, cgroups, linux-kernel,
	kernel-team, pjt, luto, efault, torvalds, guro

Hello, Waiman.

On Tue, Jul 18, 2017 at 10:37:41AM -0400, Waiman Long wrote:
> Thinking about it some more. There is a place for invalid domain. It is
> not the child of a threaded cgroup. It is the siblings of a threaded
> cgroup whose parent is not root.
> 
>        Root - A (domain) - B (domain)
>                          \ C (domain)
> 
> With "echo threaded > B/cgroup.type":
> 
>        Root - A (threaded domain) - B (threaded)
>                                   \ C (domain, invalid)

Yes, I noted that when I was replying to Peter.

> Any children of a threaded cgroup should be threaded.

It's really difficult to discuss if you just declare that something
should be a certain way without giving rationale for thinking so.

If we could get rid of the invalid state completely that way, I'd
completely agree with you but that isn't the case here as you noted
yourself, so the choice between the two isn't something trivially
clear.  Both choices come with their pros and cons.  We can absoultely
discuss them comparing the pros and cons.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/6] cgroup: implement cgroup v2 thread support
  2017-07-18 17:10         ` Tejun Heo
@ 2017-07-18 17:23           ` Waiman Long
  2017-07-19 16:29             ` Tejun Heo
  0 siblings, 1 reply; 32+ messages in thread
From: Waiman Long @ 2017-07-18 17:23 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Peter Zijlstra, lizefan, hannes, mingo, cgroups, linux-kernel,
	kernel-team, pjt, luto, efault, torvalds, guro

On 07/18/2017 01:10 PM, Tejun Heo wrote:
> Hello, Waiman.
>
> On Tue, Jul 18, 2017 at 10:37:41AM -0400, Waiman Long wrote:
>> Thinking about it some more. There is a place for invalid domain. It is
>> not the child of a threaded cgroup. It is the siblings of a threaded
>> cgroup whose parent is not root.
>>
>>        Root - A (domain) - B (domain)
>>                          \ C (domain)
>>
>> With "echo threaded > B/cgroup.type":
>>
>>        Root - A (threaded domain) - B (threaded)
>>                                   \ C (domain, invalid)
> Yes, I noted that when I was replying to Peter.
>
>> Any children of a threaded cgroup should be threaded.
> It's really difficult to discuss if you just declare that something
> should be a certain way without giving rationale for thinking so.
>
> If we could get rid of the invalid state completely that way, I'd
> completely agree with you but that isn't the case here as you noted
> yourself, so the choice between the two isn't something trivially
> clear.  Both choices come with their pros and cons.  We can absoultely
> discuss them comparing the pros and cons.
I am not advocating on removing the invalid state now as I note about
sibling cgroups. I am just saying that there is no point in not doing an
automatic conversion to threaded for newly created children of threaded
cgroups (not thread root). I don't see any cons in doing that.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/6] cgroup: implement cgroup v2 thread support
  2017-07-17 14:26     ` Tejun Heo
@ 2017-07-18 17:28       ` Peter Zijlstra
  2017-07-18 17:35         ` Waiman Long
  2017-07-18 17:54         ` Tejun Heo
  0 siblings, 2 replies; 32+ messages in thread
From: Peter Zijlstra @ 2017-07-18 17:28 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, hannes, mingo, longman, cgroups, linux-kernel,
	kernel-team, pjt, luto, efault, torvalds, guro

On Mon, Jul 17, 2017 at 10:26:09AM -0400, Tejun Heo wrote:
> Hello, Peter.
> 
> On Mon, Jul 17, 2017 at 04:14:09PM +0200, Peter Zijlstra wrote:
> > AFAICT this is not in fact what I suggested... :/
> 
> Heh, sorry about misattributing that.  I was mostly referring to the
> overall idea of marking each cgroup domain or threaded rather than
> subtree.
> 
> > My proposal did not have that invalid state. It would simply refuse to
> > change the type from thread to domain in the case where the parent is
> > not a domain.
> > 
> > Also, my proposal maintained the normal property inheritance rules. A
> > child cgroup's creation 'type' would be that of its parent and not
> > always be 'domain'.
> 
> But aren't both of the above get weird when the parent can host both
> domain and threaded children?
> 
> 	  R
>        /
>       A(D)
> 
> If you create another child B under R, it's naturally gonna be a
> domain.  Let's say you turn that to threaded.
> 
> 	   R
>        /   \
>      A(D) B(T)
> 
> And now try to create another child C, should that be a domain or
> threaded?

Domain of course, as R must be a domain, and hence all its children
start out as such.

> If we only inherit from the second level on, which is in itself
> already confusing, that still leads to invalid configs for non-root
> thread roots.

I don't see how. I don't get the example Waiman gave, what is wrong
with:

	R (D)
	|
	A (D)
       / \
     C(D) B(T) 

? Afaict that's a perfectly valid configuration.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/6] cgroup: implement cgroup v2 thread support
  2017-07-18 17:28       ` Peter Zijlstra
@ 2017-07-18 17:35         ` Waiman Long
  2017-07-18 17:54         ` Tejun Heo
  1 sibling, 0 replies; 32+ messages in thread
From: Waiman Long @ 2017-07-18 17:35 UTC (permalink / raw)
  To: Peter Zijlstra, Tejun Heo
  Cc: lizefan, hannes, mingo, cgroups, linux-kernel, kernel-team, pjt,
	luto, efault, torvalds, guro

On 07/18/2017 01:28 PM, Peter Zijlstra wrote:
> On Mon, Jul 17, 2017 at 10:26:09AM -0400, Tejun Heo wrote:
>> Hello, Peter.
>>
>> On Mon, Jul 17, 2017 at 04:14:09PM +0200, Peter Zijlstra wrote:
>>> AFAICT this is not in fact what I suggested... :/
>> Heh, sorry about misattributing that.  I was mostly referring to the
>> overall idea of marking each cgroup domain or threaded rather than
>> subtree.
>>
>>> My proposal did not have that invalid state. It would simply refuse to
>>> change the type from thread to domain in the case where the parent is
>>> not a domain.
>>>
>>> Also, my proposal maintained the normal property inheritance rules. A
>>> child cgroup's creation 'type' would be that of its parent and not
>>> always be 'domain'.
>> But aren't both of the above get weird when the parent can host both
>> domain and threaded children?
>>
>> 	  R
>>        /
>>       A(D)
>>
>> If you create another child B under R, it's naturally gonna be a
>> domain.  Let's say you turn that to threaded.
>>
>> 	   R
>>        /   \
>>      A(D) B(T)
>>
>> And now try to create another child C, should that be a domain or
>> threaded?
> Domain of course, as R must be a domain, and hence all its children
> start out as such.
>
>> If we only inherit from the second level on, which is in itself
>> already confusing, that still leads to invalid configs for non-root
>> thread roots.
> I don't see how. I don't get the example Waiman gave, what is wrong
> with:
>
> 	R (D)
> 	|
> 	A (D)
>        / \
>      C(D) B(T) 
>
> ? Afaict that's a perfectly valid configuration.

>From what I understand, C is considered to be in an invalid state
because of the no internal process rule. A in this case is the thread
root, so it can have internal process. If C is a domain and a child of
A, we will have the case that internal processes in A is competing
against cgroup C.

I have been advocating (in one of my RFC patches) that we should relax
the rules to allow internal processes when only threaded controllers are
enabled as they are supposed to be able to handle internal processes.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/6] cgroup: implement cgroup v2 thread support
  2017-07-18 17:28       ` Peter Zijlstra
  2017-07-18 17:35         ` Waiman Long
@ 2017-07-18 17:54         ` Tejun Heo
  2017-07-18 18:41           ` Peter Zijlstra
  1 sibling, 1 reply; 32+ messages in thread
From: Tejun Heo @ 2017-07-18 17:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: lizefan, hannes, mingo, longman, cgroups, linux-kernel,
	kernel-team, pjt, luto, efault, torvalds, guro

Hello, Peter.

On Tue, Jul 18, 2017 at 07:28:01PM +0200, Peter Zijlstra wrote:
> > And now try to create another child C, should that be a domain or
> > threaded?
> 
> Domain of course, as R must be a domain, and hence all its children
> start out as such.

I don't think it's a matter of course as R also is the root of a
threaded subtree, but this is more or less bikeshedding.

> > If we only inherit from the second level on, which is in itself
> > already confusing, that still leads to invalid configs for non-root
> > thread roots.
> 
> I don't see how. I don't get the example Waiman gave, what is wrong
> with:
> 
> 	R (D)
> 	|
> 	A (D)
>        / \
>      C(D) B(T) 
> 
> ? Afaict that's a perfectly valid configuration.

Okay, we're kinda off the rails now.  Just to verify that we're on the
same page, are you also saying that the following should be a valid
configuration?

	R (D)
        |
	A (D and has processes in it and controllers enabled)
        |
        C (D and has processes in it) 

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/6] cgroup: implement cgroup v2 thread support
  2017-07-18 17:54         ` Tejun Heo
@ 2017-07-18 18:41           ` Peter Zijlstra
  2017-07-18 18:47             ` Tejun Heo
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2017-07-18 18:41 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, hannes, mingo, longman, cgroups, linux-kernel,
	kernel-team, pjt, luto, efault, torvalds, guro

On Tue, Jul 18, 2017 at 01:54:56PM -0400, Tejun Heo wrote:

> Okay, we're kinda off the rails now.  Just to verify that we're on the
> same page, are you also saying that the following should be a valid
> configuration?
> 
> 	R (D)
>         |
> 	A (D and has processes in it and controllers enabled)
>         |
>         C (D and has processes in it) 
> 

Argh, the no internal process thing again -- I completely forgot about
that :-(

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/6] cgroup: implement cgroup v2 thread support
  2017-07-18 18:41           ` Peter Zijlstra
@ 2017-07-18 18:47             ` Tejun Heo
  2017-07-19 14:07               ` Peter Zijlstra
  0 siblings, 1 reply; 32+ messages in thread
From: Tejun Heo @ 2017-07-18 18:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: lizefan, hannes, mingo, longman, cgroups, linux-kernel,
	kernel-team, pjt, luto, efault, torvalds, guro

On Tue, Jul 18, 2017 at 08:41:35PM +0200, Peter Zijlstra wrote:
> On Tue, Jul 18, 2017 at 01:54:56PM -0400, Tejun Heo wrote:
> 
> > Okay, we're kinda off the rails now.  Just to verify that we're on the
> > same page, are you also saying that the following should be a valid
> > configuration?
> > 
> > 	R (D)
> >         |
> > 	A (D and has processes in it and controllers enabled)
> >         |
> >         C (D and has processes in it) 
> > 
> 
> Argh, the no internal process thing again -- I completely forgot about
> that :-(

Heh, yeah, we wouldn't be talking about all these otherwise.  The
restriction is pain in the ass but at the same time useful for
full(er)-scope resource control.  Were there other things that caught
your eyes?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/6] cgroup: implement cgroup v2 thread support
  2017-07-18 18:47             ` Tejun Heo
@ 2017-07-19 14:07               ` Peter Zijlstra
  2017-07-19 16:34                 ` Tejun Heo
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2017-07-19 14:07 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, hannes, mingo, longman, cgroups, linux-kernel,
	kernel-team, pjt, luto, efault, torvalds, guro

On Tue, Jul 18, 2017 at 02:47:14PM -0400, Tejun Heo wrote:
> Were there other things that caught your eyes?

I didn't immediately see the point of "domain (threaded" output, and I
think it might be useful to have a "threaded" column in /proc/cgroups.
But no, I've been over this a few times now and I think we're good.

Yes, I think we can work with this. Thanks!

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 6/6] cgroup: update debug controller to print out thread mode information
  2017-07-17 21:19   ` Waiman Long
@ 2017-07-19 15:31     ` Tejun Heo
  2017-07-19 15:41       ` Waiman Long
  0 siblings, 1 reply; 32+ messages in thread
From: Tejun Heo @ 2017-07-19 15:31 UTC (permalink / raw)
  To: Waiman Long
  Cc: lizefan, hannes, peterz, mingo, cgroups, linux-kernel,
	kernel-team, pjt, luto, efault, torvalds, guro

Hello,

On Mon, Jul 17, 2017 at 05:19:16PM -0400, Waiman Long wrote:
> > +	if (cgroup_is_thread_root(css->cgroup))
> > +		seq_puts(seq, "[thread root]\n");
> > +	else if (cgroup_is_threaded(css->cgroup))
> > +		seq_puts(seq, "[threaded]\n");
> 
> The cgroup status will not be needed anymore as you have introduced the
> cgroup.type control file that returns the proper information.

Ah, right, will drop this file.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/6] cgroup: implement cgroup v2 thread support
  2017-07-17 21:12   ` Waiman Long
@ 2017-07-19 15:40     ` Tejun Heo
  0 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2017-07-19 15:40 UTC (permalink / raw)
  To: Waiman Long
  Cc: lizefan, hannes, peterz, mingo, cgroups, linux-kernel,
	kernel-team, pjt, luto, efault, torvalds, guro

Hello, Waiman.

On Mon, Jul 17, 2017 at 05:12:31PM -0400, Waiman Long wrote:
> > +Marking a cgroup threaded makes it join the resource domain of its
> > +parent as a threaded cgroup.  The parent may be another threaded
> > +cgroup whose resource domain is further up in the hierarchy.  The root
> > +of a threaded subtree, that is, the nearest ancestor which is not
> > +threaded, is called threaded domain and serves as the resource domain
> > +for the entire subtree.
> 
> The cgroup code uses the term "thread root" in quite a number of places.
> So a developer may be confused when comparing the code and the
> documentation. I would recommend either introducing "thread root" as an
> alias for threaded domain here in the documentation or documenting that
> "threaded domain = thread root" in the code.

Yeah, I was a bit hesitant to introduce an extra term for it, but both
terms make sense and thread root is less cumbersome.  I'll incorporate
it into the doc.

> >    cgroup.procs
> >  	A read-write new-line separated values file which exists on
> >  	all cgroups.
> 
> Do we need to document that cgroup.procs isn't writable in a threaded
> cgroup?

Yeah, will update.

> > @@ -4301,6 +4606,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent)
> >  	cgrp->self.parent = &parent->self;
> >  	cgrp->root = root;
> >  	cgrp->level = level;
> > +	cgrp->dom_cgrp = cgrp->dom_cgrp;
> 
> It is a no-op. I think it is better to modify it to
> 
> +    cgrp->dom_cgrp = cgroup_is_threaded(parent) ? parent->dom_cgrp : cgrp;
> 
> Then we won't have an invalid domain state.

I'll respond to this on the other sub-thread.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 6/6] cgroup: update debug controller to print out thread mode information
  2017-07-19 15:31     ` Tejun Heo
@ 2017-07-19 15:41       ` Waiman Long
  2017-07-19 15:44         ` Tejun Heo
  0 siblings, 1 reply; 32+ messages in thread
From: Waiman Long @ 2017-07-19 15:41 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, hannes, peterz, mingo, cgroups, linux-kernel,
	kernel-team, pjt, luto, efault, torvalds, guro

On 07/19/2017 11:31 AM, Tejun Heo wrote:
> Hello,
>
> On Mon, Jul 17, 2017 at 05:19:16PM -0400, Waiman Long wrote:
>>> +	if (cgroup_is_thread_root(css->cgroup))
>>> +		seq_puts(seq, "[thread root]\n");
>>> +	else if (cgroup_is_threaded(css->cgroup))
>>> +		seq_puts(seq, "[threaded]\n");
>> The cgroup status will not be needed anymore as you have introduced the
>> cgroup.type control file that returns the proper information.
> Ah, right, will drop this file.
>
> Thanks.
>
Oh, I am not saying that we should drop the whole patch. I just want to
drop the above 4 line of codes as they are not needed.

Thanks,
Longman

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 6/6] cgroup: update debug controller to print out thread mode information
  2017-07-19 15:41       ` Waiman Long
@ 2017-07-19 15:44         ` Tejun Heo
  0 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2017-07-19 15:44 UTC (permalink / raw)
  To: Waiman Long
  Cc: lizefan, hannes, peterz, mingo, cgroups, linux-kernel,
	kernel-team, pjt, luto, efault, torvalds, guro

On Wed, Jul 19, 2017 at 11:41:09AM -0400, Waiman Long wrote:
> On 07/19/2017 11:31 AM, Tejun Heo wrote:
> > Hello,
> >
> > On Mon, Jul 17, 2017 at 05:19:16PM -0400, Waiman Long wrote:
> >>> +	if (cgroup_is_thread_root(css->cgroup))
> >>> +		seq_puts(seq, "[thread root]\n");
> >>> +	else if (cgroup_is_threaded(css->cgroup))
> >>> +		seq_puts(seq, "[threaded]\n");
> >> The cgroup status will not be needed anymore as you have introduced the
> >> cgroup.type control file that returns the proper information.
> > Ah, right, will drop this file.
> >
> > Thanks.
> >
> Oh, I am not saying that we should drop the whole patch. I just want to
> drop the above 4 line of codes as they are not needed.

lol, yeah, that's me mis-writing.  I'm gonna drop only the duplicate
part.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/6] cgroup: implement cgroup v2 thread support
  2017-07-18 17:23           ` Waiman Long
@ 2017-07-19 16:29             ` Tejun Heo
  2017-07-19 17:09               ` Waiman Long
  0 siblings, 1 reply; 32+ messages in thread
From: Tejun Heo @ 2017-07-19 16:29 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, lizefan, hannes, mingo, cgroups, linux-kernel,
	kernel-team, pjt, luto, efault, torvalds, guro

Hello,

On Tue, Jul 18, 2017 at 01:23:14PM -0400, Waiman Long wrote:
> > If we could get rid of the invalid state completely that way, I'd
> > completely agree with you but that isn't the case here as you noted
> > yourself, so the choice between the two isn't something trivially
> > clear.  Both choices come with their pros and cons.  We can absoultely
> > discuss them comparing the pros and cons.
>
> I am not advocating on removing the invalid state now as I note about

Yeah, removing invalid state would be great but we can't at least yet.

> sibling cgroups. I am just saying that there is no point in not doing an
> automatic conversion to threaded for newly created children of threaded
> cgroups (not thread root). I don't see any cons in doing that.

So, the cons I see is inconsistency, now and in the future.

This may seem less clear with system root because we can have both
domain and theraded children below it, which makes a newly created
cgroup being a domain seem natural.  More importantly, we can't do it
any other way because we'd break existing users otherwise - creating a
threaded cgroup would cause future first level cgroups to be threaded
which will be very unexpected.

Let's think about a non-root threaded domain.  At least for now, a
non-root threaded domain is terminal - they can't host valid domain
children.  As the alternative term "thread root" implies, the threaded
domain can be the root of a threaded subtree and nothing else, so it's
kinda weird to make a new child cgroup there start out as a domain
which can't be used, just like it'd be for the second level descendant
cgroup.

However, the alternative is even stranger.  Let's say we make the
first level child automatically threaded, but that is inconsistent
with when we first enable threaded mode.  We either would have to turn
all siblings at the same time or disallow enabling threaded mode if
there are domain siblings, which I fear would be unnecessarily
restrictive.

Another point is that what if we eventually make non-root threaded
roots able to host domain children?  Making children automatically
threaded wouldn't make any sense then, right?  I'll come back to this
later.

So, it looks like if we're gonna automatically turn on threaded mode
for new cgroups, the only thing we can do right now is what you're
suggesting; however, we didn't arrive there through some
straight-forward intuition or overall design.  It started as a simple
idea (I want it to be automatic) but the end result is a contorted
destination shaped by constraints and happenstance.

To me, behaving differently on the first-level threaded children than
on second+ level ones is too strange to be justified by the
convenience of not having to turn on threaded on new cgroups.

On top of that, what happens if we get to implement PeterZ's idea of
skipping over threaded internal cgroups to allow domains under
threaded subtrees?  That'd imply that we'd be able to host domains
under threaded domains too.  The end result would be completely
non-sensical.  We'd be defaulting to different modes for different
reasons where half of those reasons won't hold anymore.  This isn't
surprising given that there's nothing actually consistent about the
suggested default behavior.

So, that's why I think it'd be better to be simple here, even if that
adds a bit of hassle when creating threded children.  It is simple and
consistent and can stay that way even if we make the hierarchy more
flexible in the future.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/6] cgroup: implement cgroup v2 thread support
  2017-07-19 14:07               ` Peter Zijlstra
@ 2017-07-19 16:34                 ` Tejun Heo
  0 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2017-07-19 16:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: lizefan, hannes, mingo, longman, cgroups, linux-kernel,
	kernel-team, pjt, luto, efault, torvalds, guro

Hello, Peter.

On Wed, Jul 19, 2017 at 04:07:28PM +0200, Peter Zijlstra wrote:
> On Tue, Jul 18, 2017 at 02:47:14PM -0400, Tejun Heo wrote:
> > Were there other things that caught your eyes?
> 
> I didn't immediately see the point of "domain (threaded" output, and I

I'll probably drop the parens but I think it's meaningful to show the
state to userland as it affects what users can and can't do.

> think it might be useful to have a "threaded" column in /proc/cgroups.

Will add that.  We *might* want cgroup.controllers_threaded too but we
can think about that later.

> But no, I've been over this a few times now and I think we're good.
> 
> Yes, I think we can work with this. Thanks!

Awesome.  I'll update the patchset.

Thanks a lot for the inputs and patience!

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/6] cgroup: implement cgroup v2 thread support
  2017-07-19 16:29             ` Tejun Heo
@ 2017-07-19 17:09               ` Waiman Long
  2017-07-19 17:48                 ` Tejun Heo
  0 siblings, 1 reply; 32+ messages in thread
From: Waiman Long @ 2017-07-19 17:09 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Peter Zijlstra, lizefan, hannes, mingo, cgroups, linux-kernel,
	kernel-team, pjt, luto, efault, torvalds, guro

On 07/19/2017 12:29 PM, Tejun Heo wrote:
> Hello,
>
> On Tue, Jul 18, 2017 at 01:23:14PM -0400, Waiman Long wrote:
>>> If we could get rid of the invalid state completely that way, I'd
>>> completely agree with you but that isn't the case here as you noted
>>> yourself, so the choice between the two isn't something trivially
>>> clear.  Both choices come with their pros and cons.  We can absoultely
>>> discuss them comparing the pros and cons.
>> I am not advocating on removing the invalid state now as I note about
> Yeah, removing invalid state would be great but we can't at least yet.
>
>> sibling cgroups. I am just saying that there is no point in not doing an
>> automatic conversion to threaded for newly created children of threaded
>> cgroups (not thread root). I don't see any cons in doing that.
> So, the cons I see is inconsistency, now and in the future.
>
> This may seem less clear with system root because we can have both
> domain and theraded children below it, which makes a newly created
> cgroup being a domain seem natural.  More importantly, we can't do it
> any other way because we'd break existing users otherwise - creating a
> threaded cgroup would cause future first level cgroups to be threaded
> which will be very unexpected.
>
> Let's think about a non-root threaded domain.  At least for now, a
> non-root threaded domain is terminal - they can't host valid domain
> children.  As the alternative term "thread root" implies, the threaded
> domain can be the root of a threaded subtree and nothing else, so it's
> kinda weird to make a new child cgroup there start out as a domain
> which can't be used, just like it'd be for the second level descendant
> cgroup.
>
> However, the alternative is even stranger.  Let's say we make the
> first level child automatically threaded, but that is inconsistent
> with when we first enable threaded mode.  We either would have to turn
> all siblings at the same time or disallow enabling threaded mode if
> there are domain siblings, which I fear would be unnecessarily
> restrictive.
>
> Another point is that what if we eventually make non-root threaded
> roots able to host domain children?  Making children automatically
> threaded wouldn't make any sense then, right?  I'll come back to this
> later.
>
> So, it looks like if we're gonna automatically turn on threaded mode
> for new cgroups, the only thing we can do right now is what you're
> suggesting; however, we didn't arrive there through some
> straight-forward intuition or overall design.  It started as a simple
> idea (I want it to be automatic) but the end result is a contorted
> destination shaped by constraints and happenstance.
>
> To me, behaving differently on the first-level threaded children than
> on second+ level ones is too strange to be justified by the
> convenience of not having to turn on threaded on new cgroups.

OK, I get your point of being inconsistent. However, I don't think that
is a big deal.

> On top of that, what happens if we get to implement PeterZ's idea of
> skipping over threaded internal cgroups to allow domains under
> threaded subtrees?  That'd imply that we'd be able to host domains
> under threaded domains too.  The end result would be completely
> non-sensical.  We'd be defaulting to different modes for different
> reasons where half of those reasons won't hold anymore.  This isn't
> surprising given that there's nothing actually consistent about the
> suggested default behavior.

For me, that is the only good reason why we should keep the current
behavior. So I am fine with that.

+ cgrp->dom_cgrp = cgrp->dom_cgrp;

However, I am still puzzled by above line of code, should it be just

  cgrp->dom_cgrp = cgrp;

Cheers,
Longman

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/6] cgroup: implement cgroup v2 thread support
  2017-07-19 17:09               ` Waiman Long
@ 2017-07-19 17:48                 ` Tejun Heo
  0 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2017-07-19 17:48 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, lizefan, hannes, mingo, cgroups, linux-kernel,
	kernel-team, pjt, luto, efault, torvalds, guro

Hello, Waiman.

On Wed, Jul 19, 2017 at 01:09:38PM -0400, Waiman Long wrote:
> For me, that is the only good reason why we should keep the current
> behavior. So I am fine with that.
> 
> + cgrp->dom_cgrp = cgrp->dom_cgrp;
> 
> However, I am still puzzled by above line of code, should it be just
> 
>   cgrp->dom_cgrp = cgrp;

Oh I see.  Yeah, that's just a silly (harmless) bug.  The field gets
properly initialized in init_cgroup_housekeeping().  I'll remove that
line.

Thanks!

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 3/6] cgroup: introduce cgroup->dom_cgrp and threaded css_set handling
  2017-07-19 19:44 [PATCHSET for-4.13] cgroup: implement cgroup2 thread mode, v4 Tejun Heo
@ 2017-07-19 19:44 ` Tejun Heo
  0 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2017-07-19 19:44 UTC (permalink / raw)
  To: lizefan, hannes, peterz, mingo, longman
  Cc: cgroups, linux-kernel, kernel-team, pjt, luto, efault, torvalds,
	guro, Tejun Heo

cgroup v2 is in the process of growing thread granularity support.  A
threaded subtree is composed of a thread root and threaded cgroups
which are proper members of the subtree.

The root cgroup of the subtree serves as the domain cgroup to which
the processes (as opposed to threads / tasks) of the subtree
conceptually belong and domain-level resource consumptions not tied to
any specific task are charged.  Inside the subtree, threads won't be
subject to process granularity or no-internal-task constraint and can
be distributed arbitrarily across the subtree.

This patch introduces cgroup->dom_cgrp along with threaded css_set
handling.

* cgroup->dom_cgrp points to self for normal and thread roots.  For
  proper thread subtree members, points to the dom_cgrp (the thread
  root).

* css_set->dom_cset points to self if for normal and thread roots.  If
  threaded, points to the css_set which belongs to the cgrp->dom_cgrp.
  The dom_cgrp serves as the resource domain and keeps the matching
  csses available.  The dom_cset holds those csses and makes them
  easily accessible.

* All threaded csets are linked on their dom_csets to enable iteration
  of all threaded tasks.

* cgroup->nr_threaded_children keeps track of the number of threaded
  children.

This patch adds the above but doesn't actually use them yet.  The
following patches will build on top.

v4: ->nr_threaded_children added.

v3: ->proc_cgrp/cset renamed to ->dom_cgrp/cset.  Updated for the new
    enable-threaded-per-cgroup behavior.

v2: Added cgroup_is_threaded() helper.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/cgroup-defs.h | 33 +++++++++++++++++++---
 include/linux/cgroup.h      |  3 +-
 kernel/cgroup/cgroup.c      | 69 +++++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 97 insertions(+), 8 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index ae7bc1e..651c436 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -172,6 +172,14 @@ struct css_set {
 	/* reference count */
 	refcount_t refcount;
 
+	/*
+	 * For a domain cgroup, the following points to self.  If threaded,
+	 * to the matching cset of the nearest domain ancestor.  The
+	 * dom_cset provides access to the domain cgroup and its csses to
+	 * which domain level resource consumptions should be charged.
+	 */
+	struct css_set *dom_cset;
+
 	/* the default cgroup associated with this css_set */
 	struct cgroup *dfl_cgrp;
 
@@ -200,6 +208,10 @@ struct css_set {
 	 */
 	struct list_head e_cset_node[CGROUP_SUBSYS_COUNT];
 
+	/* all threaded csets whose ->dom_cset points to this cset */
+	struct list_head threaded_csets;
+	struct list_head threaded_csets_node;
+
 	/*
 	 * List running through all cgroup groups in the same hash
 	 * slot. Protected by css_set_lock
@@ -267,12 +279,16 @@ struct cgroup {
 	 * doesn't have any tasks.
 	 *
 	 * All children which have non-zero nr_populated_csets and/or
-	 * nr_populated_children of their own contribute one to
-	 * nr_populated_children.  The counter is zero iff this cgroup's
-	 * subtree proper doesn't have any tasks.
+	 * nr_populated_children of their own contribute one to either
+	 * nr_populated_domain_children or nr_populated_threaded_children
+	 * depending on their type.  Each counter is zero iff all cgroups
+	 * of the type in the subtree proper don't have any tasks.
 	 */
 	int nr_populated_csets;
-	int nr_populated_children;
+	int nr_populated_domain_children;
+	int nr_populated_threaded_children;
+
+	int nr_threaded_children;	/* # of live threaded child cgroups */
 
 	struct kernfs_node *kn;		/* cgroup kernfs entry */
 	struct cgroup_file procs_file;	/* handle for "cgroup.procs" */
@@ -311,6 +327,15 @@ struct cgroup {
 	struct list_head e_csets[CGROUP_SUBSYS_COUNT];
 
 	/*
+	 * If !threaded, self.  If threaded, it points to the nearest
+	 * domain ancestor.  Inside a threaded subtree, cgroups are exempt
+	 * from process granularity and no-internal-task constraint.
+	 * Domain level resource consumptions which aren't tied to a
+	 * specific task are charged to the dom_cgrp.
+	 */
+	struct cgroup *dom_cgrp;
+
+	/*
 	 * list of pidlists, up to two for each namespace (one for procs, one
 	 * for tasks); created on demand.
 	 */
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index cae5831..b7dd230 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -541,7 +541,8 @@ static inline bool task_under_cgroup_hierarchy(struct task_struct *task,
 /* no synchronization, the result can only be used as a hint */
 static inline bool cgroup_is_populated(struct cgroup *cgrp)
 {
-	return cgrp->nr_populated_csets + cgrp->nr_populated_children;
+	return cgrp->nr_populated_csets + cgrp->nr_populated_domain_children +
+		cgrp->nr_populated_threaded_children;
 }
 
 /* returns ino associated with a cgroup */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 3c5a37a..c7e1c24 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -330,6 +330,11 @@ static bool cgroup_has_tasks(struct cgroup *cgrp)
 	return cgrp->nr_populated_csets;
 }
 
+static bool cgroup_is_threaded(struct cgroup *cgrp)
+{
+	return cgrp->dom_cgrp != cgrp;
+}
+
 /* subsystems visibly enabled on a cgroup */
 static u16 cgroup_control(struct cgroup *cgrp)
 {
@@ -565,9 +570,11 @@ EXPORT_SYMBOL_GPL(of_css);
  */
 struct css_set init_css_set = {
 	.refcount		= REFCOUNT_INIT(1),
+	.dom_cset		= &init_css_set,
 	.tasks			= LIST_HEAD_INIT(init_css_set.tasks),
 	.mg_tasks		= LIST_HEAD_INIT(init_css_set.mg_tasks),
 	.task_iters		= LIST_HEAD_INIT(init_css_set.task_iters),
+	.threaded_csets		= LIST_HEAD_INIT(init_css_set.threaded_csets),
 	.cgrp_links		= LIST_HEAD_INIT(init_css_set.cgrp_links),
 	.mg_preload_node	= LIST_HEAD_INIT(init_css_set.mg_preload_node),
 	.mg_node		= LIST_HEAD_INIT(init_css_set.mg_node),
@@ -575,6 +582,11 @@ struct css_set init_css_set = {
 
 static int css_set_count	= 1;	/* 1 for init_css_set */
 
+static bool css_set_threaded(struct css_set *cset)
+{
+	return cset->dom_cset != cset;
+}
+
 /**
  * css_set_populated - does a css_set contain any tasks?
  * @cset: target css_set
@@ -618,10 +630,14 @@ static void cgroup_update_populated(struct cgroup *cgrp, bool populated)
 	do {
 		bool was_populated = cgroup_is_populated(cgrp);
 
-		if (!child)
+		if (!child) {
 			cgrp->nr_populated_csets += adj;
-		else
-			cgrp->nr_populated_children += adj;
+		} else {
+			if (cgroup_is_threaded(child))
+				cgrp->nr_populated_threaded_children += adj;
+			else
+				cgrp->nr_populated_domain_children += adj;
+		}
 
 		if (was_populated == cgroup_is_populated(cgrp))
 			break;
@@ -747,6 +763,8 @@ void put_css_set_locked(struct css_set *cset)
 	if (!refcount_dec_and_test(&cset->refcount))
 		return;
 
+	WARN_ON_ONCE(!list_empty(&cset->threaded_csets));
+
 	/* This css_set is dead. unlink it and release cgroup and css refs */
 	for_each_subsys(ss, ssid) {
 		list_del(&cset->e_cset_node[ssid]);
@@ -763,6 +781,11 @@ void put_css_set_locked(struct css_set *cset)
 		kfree(link);
 	}
 
+	if (css_set_threaded(cset)) {
+		list_del(&cset->threaded_csets_node);
+		put_css_set_locked(cset->dom_cset);
+	}
+
 	kfree_rcu(cset, rcu_head);
 }
 
@@ -781,6 +804,7 @@ static bool compare_css_sets(struct css_set *cset,
 			     struct cgroup *new_cgrp,
 			     struct cgroup_subsys_state *template[])
 {
+	struct cgroup *new_dfl_cgrp;
 	struct list_head *l1, *l2;
 
 	/*
@@ -791,6 +815,16 @@ static bool compare_css_sets(struct css_set *cset,
 	if (memcmp(template, cset->subsys, sizeof(cset->subsys)))
 		return false;
 
+
+	/* @cset's domain should match the default cgroup's */
+	if (cgroup_on_dfl(new_cgrp))
+		new_dfl_cgrp = new_cgrp;
+	else
+		new_dfl_cgrp = old_cset->dfl_cgrp;
+
+	if (new_dfl_cgrp->dom_cgrp != cset->dom_cset->dfl_cgrp)
+		return false;
+
 	/*
 	 * Compare cgroup pointers in order to distinguish between
 	 * different cgroups in hierarchies.  As different cgroups may
@@ -998,9 +1032,11 @@ static struct css_set *find_css_set(struct css_set *old_cset,
 	}
 
 	refcount_set(&cset->refcount, 1);
+	cset->dom_cset = cset;
 	INIT_LIST_HEAD(&cset->tasks);
 	INIT_LIST_HEAD(&cset->mg_tasks);
 	INIT_LIST_HEAD(&cset->task_iters);
+	INIT_LIST_HEAD(&cset->threaded_csets);
 	INIT_HLIST_NODE(&cset->hlist);
 	INIT_LIST_HEAD(&cset->cgrp_links);
 	INIT_LIST_HEAD(&cset->mg_preload_node);
@@ -1038,6 +1074,28 @@ static struct css_set *find_css_set(struct css_set *old_cset,
 
 	spin_unlock_irq(&css_set_lock);
 
+	/*
+	 * If @cset should be threaded, look up the matching dom_cset and
+	 * link them up.  We first fully initialize @cset then look for the
+	 * dom_cset.  It's simpler this way and safe as @cset is guaranteed
+	 * to stay empty until we return.
+	 */
+	if (cgroup_is_threaded(cset->dfl_cgrp)) {
+		struct css_set *dcset;
+
+		dcset = find_css_set(cset, cset->dfl_cgrp->dom_cgrp);
+		if (!dcset) {
+			put_css_set(cset);
+			return NULL;
+		}
+
+		spin_lock_irq(&css_set_lock);
+		cset->dom_cset = dcset;
+		list_add_tail(&cset->threaded_csets_node,
+			      &dcset->threaded_csets);
+		spin_unlock_irq(&css_set_lock);
+	}
+
 	return cset;
 }
 
@@ -1680,6 +1738,7 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
 	mutex_init(&cgrp->pidlist_mutex);
 	cgrp->self.cgroup = cgrp;
 	cgrp->self.flags |= CSS_ONLINE;
+	cgrp->dom_cgrp = cgrp;
 
 	for_each_subsys(ss, ssid)
 		INIT_LIST_HEAD(&cgrp->e_csets[ssid]);
@@ -4408,6 +4467,7 @@ static void kill_css(struct cgroup_subsys_state *css)
 static int cgroup_destroy_locked(struct cgroup *cgrp)
 	__releases(&cgroup_mutex) __acquires(&cgroup_mutex)
 {
+	struct cgroup *parent = cgroup_parent(cgrp);
 	struct cgroup_subsys_state *css;
 	struct cgrp_cset_link *link;
 	int ssid;
@@ -4452,6 +4512,9 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
 	 */
 	kernfs_remove(cgrp->kn);
 
+	if (parent && cgroup_is_threaded(cgrp))
+		parent->nr_threaded_children--;
+
 	cgroup1_check_for_release(cgroup_parent(cgrp));
 
 	/* put the base reference */
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2017-07-19 19:47 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-17  2:07 [PATCHSET for-4.13] cgroup: implement cgroup2 thread mode, v3 Tejun Heo
2017-07-17  2:07 ` [PATCH 1/6] cgroup: reorganize cgroup.procs / task write path Tejun Heo
2017-07-17  2:07 ` [PATCH 2/6] cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS Tejun Heo
2017-07-17  2:07 ` [PATCH 3/6] cgroup: introduce cgroup->dom_cgrp and threaded css_set handling Tejun Heo
2017-07-17  2:07 ` [PATCH 4/6] cgroup: implement CSS_TASK_ITER_THREADED Tejun Heo
2017-07-17  2:07 ` [PATCH 5/6] cgroup: implement cgroup v2 thread support Tejun Heo
2017-07-17 14:14   ` Peter Zijlstra
2017-07-17 14:26     ` Tejun Heo
2017-07-18 17:28       ` Peter Zijlstra
2017-07-18 17:35         ` Waiman Long
2017-07-18 17:54         ` Tejun Heo
2017-07-18 18:41           ` Peter Zijlstra
2017-07-18 18:47             ` Tejun Heo
2017-07-19 14:07               ` Peter Zijlstra
2017-07-19 16:34                 ` Tejun Heo
2017-07-17 20:56     ` Waiman Long
2017-07-18 14:37       ` Waiman Long
2017-07-18 17:10         ` Tejun Heo
2017-07-18 17:23           ` Waiman Long
2017-07-19 16:29             ` Tejun Heo
2017-07-19 17:09               ` Waiman Long
2017-07-19 17:48                 ` Tejun Heo
2017-07-17 21:12   ` Waiman Long
2017-07-19 15:40     ` Tejun Heo
2017-07-17  2:07 ` [PATCH 6/6] cgroup: update debug controller to print out thread mode information Tejun Heo
2017-07-17 21:19   ` Waiman Long
2017-07-19 15:31     ` Tejun Heo
2017-07-19 15:41       ` Waiman Long
2017-07-19 15:44         ` Tejun Heo
2017-07-17 14:48 ` [PATCHSET for-4.13] cgroup: implement cgroup2 thread mode, v3 Waiman Long
2017-07-17 14:51   ` Tejun Heo
2017-07-19 19:44 [PATCHSET for-4.13] cgroup: implement cgroup2 thread mode, v4 Tejun Heo
2017-07-19 19:44 ` [PATCH 3/6] cgroup: introduce cgroup->dom_cgrp and threaded css_set handling Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).