linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v2 00/17] cgroup: Major changes to cgroup v2 core
@ 2017-05-15 13:33 Waiman Long
  2017-05-15 13:34 ` [RFC PATCH v2 01/17] cgroup: reorganize cgroup.procs / task write path Waiman Long
                   ` (16 more replies)
  0 siblings, 17 replies; 69+ messages in thread
From: Waiman Long @ 2017-05-15 13:33 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, linux-mm, kernel-team, pjt,
	luto, efault, longman

 v1->v2:
  - Add a new pass-through mode to allow each controller its own
    unique virtual hierarchy.
  - Add a new control file "cgroup.resource_control" to enable
    the user creation of separate control knobs for internal process
    anywhere in the v2 hierarchy instead of doing that automatically
    in the thread root only.
  - More functionality in the debug controller to dump out more
    internal states.
  - Ported to the 4.12 kernel.
  - Other miscellaneous bug fixes.

 v1: https://lwn.net/Articles/720651/

The existing cgroup v2 core has quite a number of limitations and
constraints that make it hard to migrate controllers from v1 to v2
without suffering performance loss and usability.

This patchset makes some major changes to the cgroup v2 core to
give more freedom and flexibility to controllers so that they can
have their own unique views of the virtual process hierarchies that
are best suit for thier own use cases without suffering unneeded
performance problem. So "Live Free or Die".

On the other hand, the existing controller activation mechanism via
the cgroup.subtree_control file remains unchanged. So existing code 
that relies on the current cgroup v2 semantics should not be impacted.

The major changes are:
 1) Getting rid of the no internal process constraint by allowing
    controllers that don't like internal process competition to have
    separate sets of control knobs for internal processes as if they
    are in a child cgroup of their own.
 2) A thread mode for threaded controllers (e.g. cpu) that can
    have unthreaded child cgroups under a thread root.
 3) A pass-through mode for controllers that disable them for a cgroup
    effectively collapsing the cgroup's processes to its parent
    from the perspective of those controllers while allowing child
    cgroups to have the controllers enabled again. This allows each
    controller a unique virtual hierarchy that can be quite different
    from other controllers.

This patchset incorporates the following 2 patchsets from Tejun Heo:

 1) cgroup v2 thread mode patchset (Patches 1-5)
    https://lkml.org/lkml/2017/2/2/592
 2) CPU Controller on Control Group v2 (Patches 15 & 16)
    https://lkml.org/lkml/2016/8/5/368

Patch 6 fixes a task_struct reference counting bug introduced in
patch 1.

Patch 7 fixes a problem that css_kill() may be called more than once.

Patch 8 moves the debug cgroup out from cgroup_v1.c into its own
file.

Patch 9 keeps more accurate counts of the number of tasks associated
with each css_set.

Patch 10 enhances the debug controller to provide more information
relevant to the cgroup v2 thread mode to ease debugging effort.

Patch 11 implements the enhanced cgroup v2 thread mode with the
following enhancements:

 1) Thread roots are treated differently from threaded cgroups.
 2) Thread root can now have non-threaded controllers enabled as well
    as non-threaded children.

Patch 12 gets rid of the no internal process contraint.

Patch 13 enables fine grained control of controllers including a new
pass-through mode.

Patch 14 enhances the debug controller to print out the virtual
hierarchies for each controller in cgroup v2.

Patch 17 makes both cpu and cpuacct controllers threaded.

Tejun Heo (7):
  cgroup: reorganize cgroup.procs / task write path
  cgroup: add @flags to css_task_iter_start() and implement
    CSS_TASK_ITER_PROCS
  cgroup: introduce cgroup->proc_cgrp and threaded css_set handling
  cgroup: implement CSS_TASK_ITER_THREADED
  cgroup: implement cgroup v2 thread support
  sched: Misc preps for cgroup unified hierarchy interface
  sched: Implement interface for cgroup unified hierarchy

Waiman Long (10):
  cgroup: Fix reference counting bug in cgroup_procs_write()
  cgroup: Prevent kill_css() from being called more than once
  cgroup: Move debug cgroup to its own file
  cgroup: Keep accurate count of tasks in each css_set
  cgroup: Make debug cgroup support v2 and thread mode
  cgroup: Implement new thread mode semantics
  cgroup: Remove cgroup v2 no internal process constraint
  cgroup: Allow fine-grained controllers control in cgroup v2
  cgroup: Enable printing of v2 controllers' cgroup hierarchy
  sched: Make cpu/cpuacct threaded controllers

 Documentation/cgroup-v2.txt     |  287 +++++++--
 include/linux/cgroup-defs.h     |   68 ++
 include/linux/cgroup.h          |   12 +-
 kernel/cgroup/Makefile          |    1 +
 kernel/cgroup/cgroup-internal.h |   19 +-
 kernel/cgroup/cgroup-v1.c       |  220 ++-----
 kernel/cgroup/cgroup.c          | 1317 ++++++++++++++++++++++++++++++++-------
 kernel/cgroup/cpuset.c          |    6 +-
 kernel/cgroup/debug.c           |  471 ++++++++++++++
 kernel/cgroup/freezer.c         |    6 +-
 kernel/cgroup/pids.c            |    1 +
 kernel/events/core.c            |    1 +
 kernel/sched/core.c             |  150 ++++-
 kernel/sched/cpuacct.c          |   55 +-
 kernel/sched/cpuacct.h          |    5 +
 mm/memcontrol.c                 |    2 +-
 net/core/netclassid_cgroup.c    |    2 +-
 17 files changed, 2148 insertions(+), 475 deletions(-)
 create mode 100644 kernel/cgroup/debug.c

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [RFC PATCH v2 01/17] cgroup: reorganize cgroup.procs / task write path
  2017-05-15 13:33 [RFC PATCH v2 00/17] cgroup: Major changes to cgroup v2 core Waiman Long
@ 2017-05-15 13:34 ` Waiman Long
  2017-05-15 13:34 ` [RFC PATCH v2 02/17] cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS Waiman Long
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 69+ messages in thread
From: Waiman Long @ 2017-05-15 13:34 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, linux-mm, kernel-team, pjt,
	luto, efault, longman

From: Tejun Heo <tj@kernel.org>

Currently, writes "cgroup.procs" and "cgroup.tasks" files are all
handled by __cgroup_procs_write() on both v1 and v2.  This patch
reoragnizes the write path so that there are common helper functions
that different write paths use.

While this somewhat increases LOC, the different paths are no longer
intertwined and each path has more flexibility to implement different
behaviors which will be necessary for the planned v2 thread support.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/cgroup/cgroup-internal.h |   8 +-
 kernel/cgroup/cgroup-v1.c       |  58 ++++++++++++--
 kernel/cgroup/cgroup.c          | 163 +++++++++++++++++++++-------------------
 3 files changed, 142 insertions(+), 87 deletions(-)

diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index 00f4d6b..f0a0dba 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -180,10 +180,10 @@ int cgroup_migrate(struct task_struct *leader, bool threadgroup,
 
 int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader,
 		       bool threadgroup);
-ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf,
-			     size_t nbytes, loff_t off, bool threadgroup);
-ssize_t cgroup_procs_write(struct kernfs_open_file *of, char *buf, size_t nbytes,
-			   loff_t off);
+struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup)
+	__acquires(&cgroup_threadgroup_rwsem);
+void cgroup_procs_write_finish(void)
+	__releases(&cgroup_threadgroup_rwsem);
 
 void cgroup_lock_and_drain_offline(struct cgroup *cgrp);
 
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index 85d7515..f13ccab 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -514,10 +514,58 @@ static int cgroup_pidlist_show(struct seq_file *s, void *v)
 	return 0;
 }
 
-static ssize_t cgroup_tasks_write(struct kernfs_open_file *of,
-				  char *buf, size_t nbytes, loff_t off)
+static ssize_t __cgroup1_procs_write(struct kernfs_open_file *of,
+				     char *buf, size_t nbytes, loff_t off,
+				     bool threadgroup)
 {
-	return __cgroup_procs_write(of, buf, nbytes, off, false);
+	struct cgroup *cgrp;
+	struct task_struct *task;
+	const struct cred *cred, *tcred;
+	ssize_t ret;
+
+	cgrp = cgroup_kn_lock_live(of->kn, false);
+	if (!cgrp)
+		return -ENODEV;
+
+	task = cgroup_procs_write_start(buf, threadgroup);
+	ret = PTR_ERR_OR_ZERO(task);
+	if (ret)
+		goto out_unlock;
+
+	/*
+	 * Even if we're attaching all tasks in the thread group, we only
+	 * need to check permissions on one of them.
+	 */
+	cred = current_cred();
+	tcred = get_task_cred(task);
+	if (!uid_eq(cred->euid, GLOBAL_ROOT_UID) &&
+	    !uid_eq(cred->euid, tcred->uid) &&
+	    !uid_eq(cred->euid, tcred->suid))
+		ret = -EACCES;
+	put_cred(tcred);
+	if (ret)
+		goto out_finish;
+
+	ret = cgroup_attach_task(cgrp, task, threadgroup);
+
+out_finish:
+	cgroup_procs_write_finish();
+out_unlock:
+	cgroup_kn_unlock(of->kn);
+
+	return ret ?: nbytes;
+}
+
+static ssize_t cgroup1_procs_write(struct kernfs_open_file *of,
+				   char *buf, size_t nbytes, loff_t off)
+{
+	return __cgroup1_procs_write(of, buf, nbytes, off, true);
+}
+
+static ssize_t cgroup1_tasks_write(struct kernfs_open_file *of,
+				   char *buf, size_t nbytes, loff_t off)
+{
+	return __cgroup1_procs_write(of, buf, nbytes, off, false);
 }
 
 static ssize_t cgroup_release_agent_write(struct kernfs_open_file *of,
@@ -596,7 +644,7 @@ struct cftype cgroup1_base_files[] = {
 		.seq_stop = cgroup_pidlist_stop,
 		.seq_show = cgroup_pidlist_show,
 		.private = CGROUP_FILE_PROCS,
-		.write = cgroup_procs_write,
+		.write = cgroup1_procs_write,
 	},
 	{
 		.name = "cgroup.clone_children",
@@ -615,7 +663,7 @@ struct cftype cgroup1_base_files[] = {
 		.seq_stop = cgroup_pidlist_stop,
 		.seq_show = cgroup_pidlist_show,
 		.private = CGROUP_FILE_TASKS,
-		.write = cgroup_tasks_write,
+		.write = cgroup1_tasks_write,
 	},
 	{
 		.name = "notify_on_release",
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index c3c9a0e..1cf2409 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1919,6 +1919,23 @@ int task_cgroup_path(struct task_struct *task, char *buf, size_t buflen)
 }
 EXPORT_SYMBOL_GPL(task_cgroup_path);
 
+static struct cgroup *cgroup_migrate_common_ancestor(struct task_struct *task,
+						     struct cgroup *dst_cgrp)
+{
+	struct cgroup *cgrp;
+
+	lockdep_assert_held(&cgroup_mutex);
+
+	spin_lock_irq(&css_set_lock);
+	cgrp = task_cgroup_from_root(task, &cgrp_dfl_root);
+	spin_unlock_irq(&css_set_lock);
+
+	while (!cgroup_is_descendant(dst_cgrp, cgrp))
+		cgrp = cgroup_parent(cgrp);
+
+	return cgrp;
+}
+
 /**
  * cgroup_migrate_add_task - add a migration target task to a migration context
  * @task: target task
@@ -2351,76 +2368,23 @@ int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader,
 	return ret;
 }
 
-static int cgroup_procs_write_permission(struct task_struct *task,
-					 struct cgroup *dst_cgrp,
-					 struct kernfs_open_file *of)
-{
-	int ret = 0;
-
-	if (cgroup_on_dfl(dst_cgrp)) {
-		struct super_block *sb = of->file->f_path.dentry->d_sb;
-		struct cgroup *cgrp;
-		struct inode *inode;
-
-		spin_lock_irq(&css_set_lock);
-		cgrp = task_cgroup_from_root(task, &cgrp_dfl_root);
-		spin_unlock_irq(&css_set_lock);
-
-		while (!cgroup_is_descendant(dst_cgrp, cgrp))
-			cgrp = cgroup_parent(cgrp);
-
-		ret = -ENOMEM;
-		inode = kernfs_get_inode(sb, cgrp->procs_file.kn);
-		if (inode) {
-			ret = inode_permission(inode, MAY_WRITE);
-			iput(inode);
-		}
-	} else {
-		const struct cred *cred = current_cred();
-		const struct cred *tcred = get_task_cred(task);
-
-		/*
-		 * even if we're attaching all tasks in the thread group,
-		 * we only need to check permissions on one of them.
-		 */
-		if (!uid_eq(cred->euid, GLOBAL_ROOT_UID) &&
-		    !uid_eq(cred->euid, tcred->uid) &&
-		    !uid_eq(cred->euid, tcred->suid))
-			ret = -EACCES;
-		put_cred(tcred);
-	}
-
-	return ret;
-}
-
-/*
- * Find the task_struct of the task to attach by vpid and pass it along to the
- * function to attach either it or all tasks in its threadgroup. Will lock
- * cgroup_mutex and threadgroup.
- */
-ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf,
-			     size_t nbytes, loff_t off, bool threadgroup)
+struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup)
+	__acquires(&cgroup_threadgroup_rwsem)
 {
 	struct task_struct *tsk;
-	struct cgroup_subsys *ss;
-	struct cgroup *cgrp;
 	pid_t pid;
-	int ssid, ret;
 
 	if (kstrtoint(strstrip(buf), 0, &pid) || pid < 0)
-		return -EINVAL;
-
-	cgrp = cgroup_kn_lock_live(of->kn, false);
-	if (!cgrp)
-		return -ENODEV;
+		return ERR_PTR(-EINVAL);
 
 	percpu_down_write(&cgroup_threadgroup_rwsem);
+
 	rcu_read_lock();
 	if (pid) {
 		tsk = find_task_by_vpid(pid);
 		if (!tsk) {
-			ret = -ESRCH;
-			goto out_unlock_rcu;
+			tsk = ERR_PTR(-ESRCH);
+			goto out_unlock_threadgroup;
 		}
 	} else {
 		tsk = current;
@@ -2436,35 +2400,30 @@ ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf,
 	 * cgroup with no rt_runtime allocated.  Just say no.
 	 */
 	if (tsk->no_cgroup_migration || (tsk->flags & PF_NO_SETAFFINITY)) {
-		ret = -EINVAL;
-		goto out_unlock_rcu;
+		tsk = ERR_PTR(-EINVAL);
+		goto out_unlock_threadgroup;
 	}
 
 	get_task_struct(tsk);
-	rcu_read_unlock();
-
-	ret = cgroup_procs_write_permission(tsk, cgrp, of);
-	if (!ret)
-		ret = cgroup_attach_task(cgrp, tsk, threadgroup);
-
-	put_task_struct(tsk);
-	goto out_unlock_threadgroup;
+	goto out_unlock_rcu;
 
+out_unlock_threadgroup:
+	percpu_up_write(&cgroup_threadgroup_rwsem);
 out_unlock_rcu:
 	rcu_read_unlock();
-out_unlock_threadgroup:
+	return tsk;
+}
+
+void cgroup_procs_write_finish(void)
+	__releases(&cgroup_threadgroup_rwsem)
+{
+	struct cgroup_subsys *ss;
+	int ssid;
+
 	percpu_up_write(&cgroup_threadgroup_rwsem);
 	for_each_subsys(ss, ssid)
 		if (ss->post_attach)
 			ss->post_attach();
-	cgroup_kn_unlock(of->kn);
-	return ret ?: nbytes;
-}
-
-ssize_t cgroup_procs_write(struct kernfs_open_file *of, char *buf, size_t nbytes,
-			   loff_t off)
-{
-	return __cgroup_procs_write(of, buf, nbytes, off, true);
 }
 
 static void cgroup_print_ss_mask(struct seq_file *seq, u16 ss_mask)
@@ -3788,6 +3747,54 @@ static int cgroup_procs_show(struct seq_file *s, void *v)
 	return 0;
 }
 
+static int cgroup_procs_write_permission(struct cgroup *cgrp,
+					 struct super_block *sb)
+{
+	struct inode *inode;
+	int ret;
+
+	inode = kernfs_get_inode(sb, cgrp->procs_file.kn);
+	if (!inode)
+		return -ENOMEM;
+
+	ret = inode_permission(inode, MAY_WRITE);
+	iput(inode);
+	return ret;
+}
+
+static ssize_t cgroup_procs_write(struct kernfs_open_file *of,
+				  char *buf, size_t nbytes, loff_t off)
+{
+	struct cgroup *cgrp, *common_ancestor;
+	struct task_struct *task;
+	ssize_t ret;
+
+	cgrp = cgroup_kn_lock_live(of->kn, false);
+	if (!cgrp)
+		return -ENODEV;
+
+	task = cgroup_procs_write_start(buf, true);
+	ret = PTR_ERR_OR_ZERO(task);
+	if (ret)
+		goto out_unlock;
+
+	common_ancestor = cgroup_migrate_common_ancestor(task, cgrp);
+
+	ret = cgroup_procs_write_permission(common_ancestor,
+					    of->file->f_path.dentry->d_sb);
+	if (ret)
+		goto out_finish;
+
+	ret = cgroup_attach_task(cgrp, task, true);
+
+out_finish:
+	cgroup_procs_write_finish();
+out_unlock:
+	cgroup_kn_unlock(of->kn);
+
+	return ret ?: nbytes;
+}
+
 /* cgroup core interface files for the default hierarchy */
 static struct cftype cgroup_base_files[] = {
 	{
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [RFC PATCH v2 02/17] cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS
  2017-05-15 13:33 [RFC PATCH v2 00/17] cgroup: Major changes to cgroup v2 core Waiman Long
  2017-05-15 13:34 ` [RFC PATCH v2 01/17] cgroup: reorganize cgroup.procs / task write path Waiman Long
@ 2017-05-15 13:34 ` Waiman Long
  2017-05-15 13:34 ` [RFC PATCH v2 03/17] cgroup: introduce cgroup->proc_cgrp and threaded css_set handling Waiman Long
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 69+ messages in thread
From: Waiman Long @ 2017-05-15 13:34 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, linux-mm, kernel-team, pjt,
	luto, efault, longman

From: Tejun Heo <tj@kernel.org>

css_task_iter currently always walks all tasks.  With the scheduled
cgroup v2 thread support, the iterator would need to handle multiple
types of iteration.  As a preparation, add @flags to
css_task_iter_start() and implement CSS_TASK_ITER_PROCS.  If the flag
is not specified, it walks all tasks as before.  When asserted, the
iterator only walks the group leaders.

For now, the only user of the flag is cgroup v2 "cgroup.procs" file
which no longer needs to skip non-leader tasks in cgroup_procs_next().
Note that cgroup v1 "cgroup.procs" can't use the group leader walk as
v1 "cgroup.procs" doesn't mean "list all thread group leaders in the
cgroup" but "list all thread group id's with any threads in the
cgroup".

While at it, update cgroup_procs_show() to use task_pid_vnr() instead
of task_tgid_vnr().  As the iteration guarantees that the function
only sees group leaders, this doesn't change the output and will allow
sharing the function for thread iteration.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/cgroup.h       |  6 +++++-
 kernel/cgroup/cgroup-v1.c    |  6 +++---
 kernel/cgroup/cgroup.c       | 24 ++++++++++++++----------
 kernel/cgroup/cpuset.c       |  6 +++---
 kernel/cgroup/freezer.c      |  6 +++---
 mm/memcontrol.c              |  2 +-
 net/core/netclassid_cgroup.c |  2 +-
 7 files changed, 30 insertions(+), 22 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index ed2573e..3568aa1 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -36,9 +36,13 @@
 #define CGROUP_WEIGHT_DFL		100
 #define CGROUP_WEIGHT_MAX		10000
 
+/* walk only threadgroup leaders */
+#define CSS_TASK_ITER_PROCS		(1U << 0)
+
 /* a css_task_iter should be treated as an opaque object */
 struct css_task_iter {
 	struct cgroup_subsys		*ss;
+	unsigned int			flags;
 
 	struct list_head		*cset_pos;
 	struct list_head		*cset_head;
@@ -129,7 +133,7 @@ struct task_struct *cgroup_taskset_first(struct cgroup_taskset *tset,
 struct task_struct *cgroup_taskset_next(struct cgroup_taskset *tset,
 					struct cgroup_subsys_state **dst_cssp);
 
-void css_task_iter_start(struct cgroup_subsys_state *css,
+void css_task_iter_start(struct cgroup_subsys_state *css, unsigned int flags,
 			 struct css_task_iter *it);
 struct task_struct *css_task_iter_next(struct css_task_iter *it);
 void css_task_iter_end(struct css_task_iter *it);
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index f13ccab..c212856 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -121,7 +121,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
 	 * ->can_attach() fails.
 	 */
 	do {
-		css_task_iter_start(&from->self, &it);
+		css_task_iter_start(&from->self, 0, &it);
 		task = css_task_iter_next(&it);
 		if (task)
 			get_task_struct(task);
@@ -377,7 +377,7 @@ static int pidlist_array_load(struct cgroup *cgrp, enum cgroup_filetype type,
 	if (!array)
 		return -ENOMEM;
 	/* now, populate the array */
-	css_task_iter_start(&cgrp->self, &it);
+	css_task_iter_start(&cgrp->self, 0, &it);
 	while ((tsk = css_task_iter_next(&it))) {
 		if (unlikely(n == length))
 			break;
@@ -753,7 +753,7 @@ int cgroupstats_build(struct cgroupstats *stats, struct dentry *dentry)
 	}
 	rcu_read_unlock();
 
-	css_task_iter_start(&cgrp->self, &it);
+	css_task_iter_start(&cgrp->self, 0, &it);
 	while ((tsk = css_task_iter_next(&it))) {
 		switch (tsk->state) {
 		case TASK_RUNNING:
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 1cf2409..8e3a5c8 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3595,6 +3595,7 @@ static void css_task_iter_advance(struct css_task_iter *it)
 	lockdep_assert_held(&css_set_lock);
 	WARN_ON_ONCE(!l);
 
+repeat:
 	/*
 	 * Advance iterator to find next entry.  cset->tasks is consumed
 	 * first and then ->mg_tasks.  After ->mg_tasks, we move onto the
@@ -3609,11 +3610,18 @@ static void css_task_iter_advance(struct css_task_iter *it)
 		css_task_iter_advance_css_set(it);
 	else
 		it->task_pos = l;
+
+	/* if PROCS, skip over tasks which aren't group leaders */
+	if ((it->flags & CSS_TASK_ITER_PROCS) && it->task_pos &&
+	    !thread_group_leader(list_entry(it->task_pos, struct task_struct,
+					    cg_list)))
+		goto repeat;
 }
 
 /**
  * css_task_iter_start - initiate task iteration
  * @css: the css to walk tasks of
+ * @flags: CSS_TASK_ITER_* flags
  * @it: the task iterator to use
  *
  * Initiate iteration through the tasks of @css.  The caller can call
@@ -3621,7 +3629,7 @@ static void css_task_iter_advance(struct css_task_iter *it)
  * returns NULL.  On completion of iteration, css_task_iter_end() must be
  * called.
  */
-void css_task_iter_start(struct cgroup_subsys_state *css,
+void css_task_iter_start(struct cgroup_subsys_state *css, unsigned int flags,
 			 struct css_task_iter *it)
 {
 	/* no one should try to iterate before mounting cgroups */
@@ -3632,6 +3640,7 @@ void css_task_iter_start(struct cgroup_subsys_state *css,
 	spin_lock_irq(&css_set_lock);
 
 	it->ss = css->ss;
+	it->flags = flags;
 
 	if (it->ss)
 		it->cset_pos = &css->cgroup->e_csets[css->ss->id];
@@ -3705,13 +3714,8 @@ static void *cgroup_procs_next(struct seq_file *s, void *v, loff_t *pos)
 {
 	struct kernfs_open_file *of = s->private;
 	struct css_task_iter *it = of->priv;
-	struct task_struct *task;
-
-	do {
-		task = css_task_iter_next(it);
-	} while (task && !thread_group_leader(task));
 
-	return task;
+	return css_task_iter_next(it);
 }
 
 static void *cgroup_procs_start(struct seq_file *s, loff_t *pos)
@@ -3732,10 +3736,10 @@ static void *cgroup_procs_start(struct seq_file *s, loff_t *pos)
 		if (!it)
 			return ERR_PTR(-ENOMEM);
 		of->priv = it;
-		css_task_iter_start(&cgrp->self, it);
+		css_task_iter_start(&cgrp->self, CSS_TASK_ITER_PROCS, it);
 	} else if (!(*pos)++) {
 		css_task_iter_end(it);
-		css_task_iter_start(&cgrp->self, it);
+		css_task_iter_start(&cgrp->self, CSS_TASK_ITER_PROCS, it);
 	}
 
 	return cgroup_procs_next(s, NULL, NULL);
@@ -3743,7 +3747,7 @@ static void *cgroup_procs_start(struct seq_file *s, loff_t *pos)
 
 static int cgroup_procs_show(struct seq_file *s, void *v)
 {
-	seq_printf(s, "%d\n", task_tgid_vnr(v));
+	seq_printf(s, "%d\n", task_pid_vnr(v));
 	return 0;
 }
 
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index f6501f4..204361a 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -861,7 +861,7 @@ static void update_tasks_cpumask(struct cpuset *cs)
 	struct css_task_iter it;
 	struct task_struct *task;
 
-	css_task_iter_start(&cs->css, &it);
+	css_task_iter_start(&cs->css, 0, &it);
 	while ((task = css_task_iter_next(&it)))
 		set_cpus_allowed_ptr(task, cs->effective_cpus);
 	css_task_iter_end(&it);
@@ -1106,7 +1106,7 @@ static void update_tasks_nodemask(struct cpuset *cs)
 	 * It's ok if we rebind the same mm twice; mpol_rebind_mm()
 	 * is idempotent.  Also migrate pages in each mm to new nodes.
 	 */
-	css_task_iter_start(&cs->css, &it);
+	css_task_iter_start(&cs->css, 0, &it);
 	while ((task = css_task_iter_next(&it))) {
 		struct mm_struct *mm;
 		bool migrate;
@@ -1299,7 +1299,7 @@ static void update_tasks_flags(struct cpuset *cs)
 	struct css_task_iter it;
 	struct task_struct *task;
 
-	css_task_iter_start(&cs->css, &it);
+	css_task_iter_start(&cs->css, 0, &it);
 	while ((task = css_task_iter_next(&it)))
 		cpuset_update_task_spread_flag(cs, task);
 	css_task_iter_end(&it);
diff --git a/kernel/cgroup/freezer.c b/kernel/cgroup/freezer.c
index 1b72d56..0823679 100644
--- a/kernel/cgroup/freezer.c
+++ b/kernel/cgroup/freezer.c
@@ -268,7 +268,7 @@ static void update_if_frozen(struct cgroup_subsys_state *css)
 	rcu_read_unlock();
 
 	/* are all tasks frozen? */
-	css_task_iter_start(css, &it);
+	css_task_iter_start(css, 0, &it);
 
 	while ((task = css_task_iter_next(&it))) {
 		if (freezing(task)) {
@@ -320,7 +320,7 @@ static void freeze_cgroup(struct freezer *freezer)
 	struct css_task_iter it;
 	struct task_struct *task;
 
-	css_task_iter_start(&freezer->css, &it);
+	css_task_iter_start(&freezer->css, 0, &it);
 	while ((task = css_task_iter_next(&it)))
 		freeze_task(task);
 	css_task_iter_end(&it);
@@ -331,7 +331,7 @@ static void unfreeze_cgroup(struct freezer *freezer)
 	struct css_task_iter it;
 	struct task_struct *task;
 
-	css_task_iter_start(&freezer->css, &it);
+	css_task_iter_start(&freezer->css, 0, &it);
 	while ((task = css_task_iter_next(&it)))
 		__thaw_task(task);
 	css_task_iter_end(&it);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d75b38b..fafcefa 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -917,7 +917,7 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 		struct css_task_iter it;
 		struct task_struct *task;
 
-		css_task_iter_start(&iter->css, &it);
+		css_task_iter_start(&iter->css, 0, &it);
 		while (!ret && (task = css_task_iter_next(&it)))
 			ret = fn(task, arg);
 		css_task_iter_end(&it);
diff --git a/net/core/netclassid_cgroup.c b/net/core/netclassid_cgroup.c
index 029a61a..5e4f040 100644
--- a/net/core/netclassid_cgroup.c
+++ b/net/core/netclassid_cgroup.c
@@ -100,7 +100,7 @@ static int write_classid(struct cgroup_subsys_state *css, struct cftype *cft,
 
 	cs->classid = (u32)value;
 
-	css_task_iter_start(css, &it);
+	css_task_iter_start(css, 0, &it);
 	while ((p = css_task_iter_next(&it))) {
 		task_lock(p);
 		iterate_fd(p->files, 0, update_classid_sock,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [RFC PATCH v2 03/17] cgroup: introduce cgroup->proc_cgrp and threaded css_set handling
  2017-05-15 13:33 [RFC PATCH v2 00/17] cgroup: Major changes to cgroup v2 core Waiman Long
  2017-05-15 13:34 ` [RFC PATCH v2 01/17] cgroup: reorganize cgroup.procs / task write path Waiman Long
  2017-05-15 13:34 ` [RFC PATCH v2 02/17] cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS Waiman Long
@ 2017-05-15 13:34 ` Waiman Long
  2017-05-15 13:34 ` [RFC PATCH v2 04/17] cgroup: implement CSS_TASK_ITER_THREADED Waiman Long
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 69+ messages in thread
From: Waiman Long @ 2017-05-15 13:34 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, linux-mm, kernel-team, pjt,
	luto, efault, longman

From: Tejun Heo <tj@kernel.org>

cgroup v2 is in the process of growing thread granularity support.
Once thread mode is enabled, the root cgroup of the subtree serves as
the proc_cgrp to which the processes of the subtree conceptually
belong and domain-level resource consumptions not tied to any specific
task are charged.  In the subtree, threads won't be subject to process
granularity or no-internal-task constraint and can be distributed
arbitrarily across the subtree.

This patch introduces cgroup->proc_cgrp along with threaded css_set
handling.

* cgroup->proc_cgrp is NULL if !threaded.  If threaded, points to the
  proc_cgrp (root of the threaded subtree).

* css_set->proc_cset points to self if !threaded.  If threaded, points
  to the css_set which belongs to the cgrp->proc_cgrp.  The proc_cgrp
  serves as the resource domain and needs the matching csses readily
  available.  The proc_cset holds those csses and makes them easily
  accessible.

* All threaded csets are linked on their proc_csets to enable
  iteration of all threaded tasks.

This patch adds the above but doesn't actually use them yet.  The
following patches will build on top.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/cgroup-defs.h | 22 ++++++++++++
 kernel/cgroup/cgroup.c      | 87 +++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 103 insertions(+), 6 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 2174594..3f3cfdd 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -162,6 +162,15 @@ struct css_set {
 	/* reference count */
 	refcount_t refcount;
 
+	/*
+	 * If not threaded, the following points to self.  If threaded, to
+	 * a cset which belongs to the top cgroup of the threaded subtree.
+	 * The proc_cset provides access to the process cgroup and its
+	 * csses to which domain level resource consumptions should be
+	 * charged.
+	 */
+	struct css_set __rcu *proc_cset;
+
 	/* the default cgroup associated with this css_set */
 	struct cgroup *dfl_cgrp;
 
@@ -187,6 +196,10 @@ struct css_set {
 	 */
 	struct list_head e_cset_node[CGROUP_SUBSYS_COUNT];
 
+	/* all csets whose ->proc_cset points to this cset */
+	struct list_head threaded_csets;
+	struct list_head threaded_csets_node;
+
 	/*
 	 * List running through all cgroup groups in the same hash
 	 * slot. Protected by css_set_lock
@@ -293,6 +306,15 @@ struct cgroup {
 	struct list_head e_csets[CGROUP_SUBSYS_COUNT];
 
 	/*
+	 * If !threaded, NULL.  If threaded, it points to the top cgroup of
+	 * the threaded subtree, on which it points to self.  Threaded
+	 * subtree is exempt from process granularity and no-internal-task
+	 * constraint.  Domain level resource consumptions which aren't
+	 * tied to a specific task should be charged to the proc_cgrp.
+	 */
+	struct cgroup *proc_cgrp;
+
+	/*
 	 * list of pidlists, up to two for each namespace (one for procs, one
 	 * for tasks); created on demand.
 	 */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 8e3a5c8..a9c3d640a 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -560,9 +560,11 @@ struct cgroup_subsys_state *of_css(struct kernfs_open_file *of)
  */
 struct css_set init_css_set = {
 	.refcount		= REFCOUNT_INIT(1),
+	.proc_cset		= RCU_INITIALIZER(&init_css_set),
 	.tasks			= LIST_HEAD_INIT(init_css_set.tasks),
 	.mg_tasks		= LIST_HEAD_INIT(init_css_set.mg_tasks),
 	.task_iters		= LIST_HEAD_INIT(init_css_set.task_iters),
+	.threaded_csets		= LIST_HEAD_INIT(init_css_set.threaded_csets),
 	.cgrp_links		= LIST_HEAD_INIT(init_css_set.cgrp_links),
 	.mg_preload_node	= LIST_HEAD_INIT(init_css_set.mg_preload_node),
 	.mg_node		= LIST_HEAD_INIT(init_css_set.mg_node),
@@ -581,6 +583,17 @@ static bool css_set_populated(struct css_set *cset)
 	return !list_empty(&cset->tasks) || !list_empty(&cset->mg_tasks);
 }
 
+static struct css_set *proc_css_set(struct css_set *cset)
+{
+	return rcu_dereference_protected(cset->proc_cset,
+					 lockdep_is_held(&css_set_lock));
+}
+
+static bool css_set_threaded(struct css_set *cset)
+{
+	return proc_css_set(cset) != cset;
+}
+
 /**
  * cgroup_update_populated - updated populated count of a cgroup
  * @cgrp: the target cgroup
@@ -732,6 +745,8 @@ void put_css_set_locked(struct css_set *cset)
 	if (!refcount_dec_and_test(&cset->refcount))
 		return;
 
+	WARN_ON_ONCE(!list_empty(&cset->threaded_csets));
+
 	/* This css_set is dead. unlink it and release cgroup and css refs */
 	for_each_subsys(ss, ssid) {
 		list_del(&cset->e_cset_node[ssid]);
@@ -748,6 +763,11 @@ void put_css_set_locked(struct css_set *cset)
 		kfree(link);
 	}
 
+	if (css_set_threaded(cset)) {
+		list_del(&cset->threaded_csets_node);
+		put_css_set_locked(proc_css_set(cset));
+	}
+
 	kfree_rcu(cset, rcu_head);
 }
 
@@ -757,6 +777,7 @@ void put_css_set_locked(struct css_set *cset)
  * @old_cset: existing css_set for a task
  * @new_cgrp: cgroup that's being entered by the task
  * @template: desired set of css pointers in css_set (pre-calculated)
+ * @for_pcset: the comparison is for a new proc_cset
  *
  * Returns true if "cset" matches "old_cset" except for the hierarchy
  * which "new_cgrp" belongs to, for which it should match "new_cgrp".
@@ -764,7 +785,8 @@ void put_css_set_locked(struct css_set *cset)
 static bool compare_css_sets(struct css_set *cset,
 			     struct css_set *old_cset,
 			     struct cgroup *new_cgrp,
-			     struct cgroup_subsys_state *template[])
+			     struct cgroup_subsys_state *template[],
+			     bool for_pcset)
 {
 	struct list_head *l1, *l2;
 
@@ -776,6 +798,32 @@ static bool compare_css_sets(struct css_set *cset,
 	if (memcmp(template, cset->subsys, sizeof(cset->subsys)))
 		return false;
 
+	if (for_pcset) {
+		/*
+		 * We're looking for the pcset of @old_cset.  As @old_cset
+		 * doesn't have its ->proc_cset pointer set yet (we're
+		 * trying to find out what to set it to), @old_cset itself
+		 * may seem like a match here.  Explicitly exlude identity
+		 * matching.
+		 */
+		if (css_set_threaded(cset) || cset == old_cset)
+			return false;
+	} else {
+		bool is_threaded;
+
+		/*
+		 * Otherwise, @cset's threaded state should match the
+		 * default cgroup's.
+		 */
+		if (cgroup_on_dfl(new_cgrp))
+			is_threaded = new_cgrp->proc_cgrp;
+		else
+			is_threaded = old_cset->dfl_cgrp->proc_cgrp;
+
+		if (is_threaded != css_set_threaded(cset))
+			return false;
+	}
+
 	/*
 	 * Compare cgroup pointers in order to distinguish between
 	 * different cgroups in hierarchies.  As different cgroups may
@@ -828,10 +876,12 @@ static bool compare_css_sets(struct css_set *cset,
  * @old_cset: the css_set that we're using before the cgroup transition
  * @cgrp: the cgroup that we're moving into
  * @template: out param for the new set of csses, should be clear on entry
+ * @for_pcset: looking for a new proc_cset
  */
 static struct css_set *find_existing_css_set(struct css_set *old_cset,
 					struct cgroup *cgrp,
-					struct cgroup_subsys_state *template[])
+					struct cgroup_subsys_state *template[],
+					bool for_pcset)
 {
 	struct cgroup_root *root = cgrp->root;
 	struct cgroup_subsys *ss;
@@ -862,7 +912,7 @@ static struct css_set *find_existing_css_set(struct css_set *old_cset,
 
 	key = css_set_hash(template);
 	hash_for_each_possible(css_set_table, cset, hlist, key) {
-		if (!compare_css_sets(cset, old_cset, cgrp, template))
+		if (!compare_css_sets(cset, old_cset, cgrp, template, for_pcset))
 			continue;
 
 		/* This css_set matches what we need */
@@ -944,12 +994,13 @@ static void link_css_set(struct list_head *tmp_links, struct css_set *cset,
  * find_css_set - return a new css_set with one cgroup updated
  * @old_cset: the baseline css_set
  * @cgrp: the cgroup to be updated
+ * @for_pcset: looking for a new proc_cset
  *
  * Return a new css_set that's equivalent to @old_cset, but with @cgrp
  * substituted into the appropriate hierarchy.
  */
 static struct css_set *find_css_set(struct css_set *old_cset,
-				    struct cgroup *cgrp)
+				    struct cgroup *cgrp, bool for_pcset)
 {
 	struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT] = { };
 	struct css_set *cset;
@@ -964,7 +1015,7 @@ static struct css_set *find_css_set(struct css_set *old_cset,
 	/* First see if we already have a cgroup group that matches
 	 * the desired set */
 	spin_lock_irq(&css_set_lock);
-	cset = find_existing_css_set(old_cset, cgrp, template);
+	cset = find_existing_css_set(old_cset, cgrp, template, for_pcset);
 	if (cset)
 		get_css_set(cset);
 	spin_unlock_irq(&css_set_lock);
@@ -983,9 +1034,11 @@ static struct css_set *find_css_set(struct css_set *old_cset,
 	}
 
 	refcount_set(&cset->refcount, 1);
+	RCU_INIT_POINTER(cset->proc_cset, cset);
 	INIT_LIST_HEAD(&cset->tasks);
 	INIT_LIST_HEAD(&cset->mg_tasks);
 	INIT_LIST_HEAD(&cset->task_iters);
+	INIT_LIST_HEAD(&cset->threaded_csets);
 	INIT_HLIST_NODE(&cset->hlist);
 	INIT_LIST_HEAD(&cset->cgrp_links);
 	INIT_LIST_HEAD(&cset->mg_preload_node);
@@ -1023,6 +1076,28 @@ static struct css_set *find_css_set(struct css_set *old_cset,
 
 	spin_unlock_irq(&css_set_lock);
 
+	/*
+	 * If @cset should be threaded, look up the matching proc_cset and
+	 * link them up.  We first fully initialize @cset then look for the
+	 * pcset.  It's simpler this way and safe as @cset is guaranteed to
+	 * stay empty until we return.
+	 */
+	if (!for_pcset && cset->dfl_cgrp->proc_cgrp) {
+		struct css_set *pcset;
+
+		pcset = find_css_set(cset, cset->dfl_cgrp->proc_cgrp, true);
+		if (!pcset) {
+			put_css_set(cset);
+			return NULL;
+		}
+
+		spin_lock_irq(&css_set_lock);
+		rcu_assign_pointer(cset->proc_cset, pcset);
+		list_add_tail(&cset->threaded_csets_node,
+			      &pcset->threaded_csets);
+		spin_unlock_irq(&css_set_lock);
+	}
+
 	return cset;
 }
 
@@ -2244,7 +2319,7 @@ int cgroup_migrate_prepare_dst(struct cgroup_mgctx *mgctx)
 		struct cgroup_subsys *ss;
 		int ssid;
 
-		dst_cset = find_css_set(src_cset, src_cset->mg_dst_cgrp);
+		dst_cset = find_css_set(src_cset, src_cset->mg_dst_cgrp, false);
 		if (!dst_cset)
 			goto err;
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [RFC PATCH v2 04/17] cgroup: implement CSS_TASK_ITER_THREADED
  2017-05-15 13:33 [RFC PATCH v2 00/17] cgroup: Major changes to cgroup v2 core Waiman Long
                   ` (2 preceding siblings ...)
  2017-05-15 13:34 ` [RFC PATCH v2 03/17] cgroup: introduce cgroup->proc_cgrp and threaded css_set handling Waiman Long
@ 2017-05-15 13:34 ` Waiman Long
  2017-05-15 13:34 ` [RFC PATCH v2 05/17] cgroup: implement cgroup v2 thread support Waiman Long
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 69+ messages in thread
From: Waiman Long @ 2017-05-15 13:34 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, linux-mm, kernel-team, pjt,
	luto, efault, longman

From: Tejun Heo <tj@kernel.org>

cgroup v2 is in the process of growing thread granularity support.
Once thread mode is enabled, the root cgroup of the subtree serves as
the proc_cgrp to which the processes of the subtree conceptually
belong and domain-level resource consumptions not tied to any specific
task are charged.  In the subtree, threads won't be subject to process
granularity or no-internal-task constraint and can be distributed
arbitrarily across the subtree.

This patch implements a new task iterator flag CSS_TASK_ITER_THREADED,
which, when used on a proc_cgrp, makes the iteration include the tasks
on all the associated threaded css_sets.  "cgroup.procs" read path is
updated to use it so that reading the file on a proc_cgrp lists all
processes.  This will also be used by controller implementations which
need to walk processes or tasks at the resource domain level.

Task iteration is implemented nested in css_set iteration.  If
CSS_TASK_ITER_THREADED is specified, after walking tasks of each
!threaded css_set, all the associated threaded css_sets are visited
before moving onto the next !threaded css_set.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/cgroup.h |  6 ++++
 kernel/cgroup/cgroup.c | 81 +++++++++++++++++++++++++++++++++++++++++---------
 2 files changed, 73 insertions(+), 14 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 3568aa1..e2c0b23 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -38,6 +38,8 @@
 
 /* walk only threadgroup leaders */
 #define CSS_TASK_ITER_PROCS		(1U << 0)
+/* walk threaded css_sets as part of their proc_csets */
+#define CSS_TASK_ITER_THREADED		(1U << 1)
 
 /* a css_task_iter should be treated as an opaque object */
 struct css_task_iter {
@@ -47,11 +49,15 @@ struct css_task_iter {
 	struct list_head		*cset_pos;
 	struct list_head		*cset_head;
 
+	struct list_head		*tcset_pos;
+	struct list_head		*tcset_head;
+
 	struct list_head		*task_pos;
 	struct list_head		*tasks_head;
 	struct list_head		*mg_tasks_head;
 
 	struct css_set			*cur_cset;
+	struct css_set			*cur_pcset;
 	struct task_struct		*cur_task;
 	struct list_head		iters_node;	/* css_set->task_iters */
 };
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index a9c3d640a..7efb5da 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3597,27 +3597,36 @@ bool css_has_online_children(struct cgroup_subsys_state *css)
 	return ret;
 }
 
-/**
- * css_task_iter_advance_css_set - advance a task itererator to the next css_set
- * @it: the iterator to advance
- *
- * Advance @it to the next css_set to walk.
- */
-static void css_task_iter_advance_css_set(struct css_task_iter *it)
+static struct css_set *css_task_iter_next_css_set(struct css_task_iter *it)
 {
-	struct list_head *l = it->cset_pos;
+	bool threaded = it->flags & CSS_TASK_ITER_THREADED;
+	struct list_head *l;
 	struct cgrp_cset_link *link;
 	struct css_set *cset;
 
 	lockdep_assert_held(&css_set_lock);
 
-	/* Advance to the next non-empty css_set */
+	/* find the next threaded cset */
+	if (it->tcset_pos) {
+		l = it->tcset_pos->next;
+
+		if (l != it->tcset_head) {
+			it->tcset_pos = l;
+			return container_of(l, struct css_set,
+					    threaded_csets_node);
+		}
+
+		it->tcset_pos = NULL;
+	}
+
+	/* find the next cset */
+	l = it->cset_pos;
+
 	do {
 		l = l->next;
 		if (l == it->cset_head) {
 			it->cset_pos = NULL;
-			it->task_pos = NULL;
-			return;
+			return NULL;
 		}
 
 		if (it->ss) {
@@ -3627,10 +3636,50 @@ static void css_task_iter_advance_css_set(struct css_task_iter *it)
 			link = list_entry(l, struct cgrp_cset_link, cset_link);
 			cset = link->cset;
 		}
-	} while (!css_set_populated(cset));
+
+		/*
+		 * For threaded iterations, threaded csets are walked
+		 * together with their proc_csets.  Skip here.
+		 */
+	} while (threaded && css_set_threaded(cset));
 
 	it->cset_pos = l;
 
+	/* initialize threaded cset walking */
+	if (threaded) {
+		if (it->cur_pcset)
+			put_css_set_locked(it->cur_pcset);
+		it->cur_pcset = cset;
+		get_css_set(cset);
+
+		it->tcset_head = &cset->threaded_csets;
+		it->tcset_pos = &cset->threaded_csets;
+	}
+
+	return cset;
+}
+
+/**
+ * css_task_iter_advance_css_set - advance a task itererator to the next css_set
+ * @it: the iterator to advance
+ *
+ * Advance @it to the next css_set to walk.
+ */
+static void css_task_iter_advance_css_set(struct css_task_iter *it)
+{
+	struct css_set *cset;
+
+	lockdep_assert_held(&css_set_lock);
+
+	/* Advance to the next non-empty css_set */
+	do {
+		cset = css_task_iter_next_css_set(it);
+		if (!cset) {
+			it->task_pos = NULL;
+			return;
+		}
+	} while (!css_set_populated(cset));
+
 	if (!list_empty(&cset->tasks))
 		it->task_pos = cset->tasks.next;
 	else
@@ -3773,6 +3822,9 @@ void css_task_iter_end(struct css_task_iter *it)
 		spin_unlock_irq(&css_set_lock);
 	}
 
+	if (it->cur_pcset)
+		put_css_set(it->cur_pcset);
+
 	if (it->cur_task)
 		put_task_struct(it->cur_task);
 }
@@ -3798,6 +3850,7 @@ static void *cgroup_procs_start(struct seq_file *s, loff_t *pos)
 	struct kernfs_open_file *of = s->private;
 	struct cgroup *cgrp = seq_css(s)->cgroup;
 	struct css_task_iter *it = of->priv;
+	unsigned iter_flags = CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED;
 
 	/*
 	 * When a seq_file is seeked, it's always traversed sequentially
@@ -3811,10 +3864,10 @@ static void *cgroup_procs_start(struct seq_file *s, loff_t *pos)
 		if (!it)
 			return ERR_PTR(-ENOMEM);
 		of->priv = it;
-		css_task_iter_start(&cgrp->self, CSS_TASK_ITER_PROCS, it);
+		css_task_iter_start(&cgrp->self, iter_flags, it);
 	} else if (!(*pos)++) {
 		css_task_iter_end(it);
-		css_task_iter_start(&cgrp->self, CSS_TASK_ITER_PROCS, it);
+		css_task_iter_start(&cgrp->self, iter_flags, it);
 	}
 
 	return cgroup_procs_next(s, NULL, NULL);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [RFC PATCH v2 05/17] cgroup: implement cgroup v2 thread support
  2017-05-15 13:33 [RFC PATCH v2 00/17] cgroup: Major changes to cgroup v2 core Waiman Long
                   ` (3 preceding siblings ...)
  2017-05-15 13:34 ` [RFC PATCH v2 04/17] cgroup: implement CSS_TASK_ITER_THREADED Waiman Long
@ 2017-05-15 13:34 ` Waiman Long
  2017-05-15 13:34 ` [RFC PATCH v2 06/17] cgroup: Fix reference counting bug in cgroup_procs_write() Waiman Long
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 69+ messages in thread
From: Waiman Long @ 2017-05-15 13:34 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, linux-mm, kernel-team, pjt,
	luto, efault, longman

From: Tejun Heo <tj@kernel.org>

This patch implements cgroup v2 thread support.  The goal of the
thread mode is supporting hierarchical accounting and control at
thread granularity while staying inside the resource domain model
which allows coordination across different resource controllers and
handling of anonymous resource consumptions.

Once thread mode is enabled on a cgroup, the threads of the processes
which are in its subtree can be placed inside the subtree without
being restricted by process granularity or no-internal-process
constraint.  Note that the threads aren't allowed to escape to a
different threaded subtree.  To be used inside a threaded subtree, a
controller should explicitly support threaded mode and be able to
handle internal competition in the way which is appropriate for the
resource.

The root of a threaded subtree, where thread mode is enabled in the
first place, is called the thread root and serves as the resource
domain for the whole subtree.  This is the last cgroup where
non-threaded controllers are operational and where all the
domain-level resource consumptions in the subtree are accounted.  This
allows threaded controllers to operate at thread granularity when
requested while staying inside the scope of system-level resource
distribution.

Internally, in a threaded subtree, each css_set has its ->proc_cset
pointing to a matching css_set which belongs to the thread root.  This
ensures that thread root level cgroup_subsys_state for all threaded
controllers are readily accessible for domain-level operations.

This patch enables threaded mode for the pids and perf_events
controllers.  Neither has to worry about domain-level resource
consumptions and it's enough to simply set the flag.

For more details on the interface and behavior of the thread mode,
please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added
by this patch.  Note that the documentation update is not complete as
the rest of the documentation needs to be updated accordingly.
Rolling those updates into this patch can be confusing so that will be
separate patches.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 Documentation/cgroup-v2.txt |  75 +++++++++++++-
 include/linux/cgroup-defs.h |  16 +++
 kernel/cgroup/cgroup.c      | 240 +++++++++++++++++++++++++++++++++++++++++++-
 kernel/cgroup/pids.c        |   1 +
 kernel/events/core.c        |   1 +
 5 files changed, 326 insertions(+), 7 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index dc5e2dc..1c6f5a9 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -16,7 +16,9 @@ CONTENTS
   1-2. What is cgroup?
 2. Basic Operations
   2-1. Mounting
-  2-2. Organizing Processes
+  2-2. Organizing Processes and Threads
+    2-2-1. Processes
+    2-2-2. Threads
   2-3. [Un]populated Notification
   2-4. Controlling Controllers
     2-4-1. Enabling and Disabling
@@ -150,7 +152,9 @@ and experimenting easier, the kernel parameter cgroup_no_v1= allows
 disabling controllers in v1 and make them always available in v2.
 
 
-2-2. Organizing Processes
+2-2. Organizing Processes and Threads
+
+2-2-1. Processes
 
 Initially, only the root cgroup exists to which all processes belong.
 A child cgroup can be created by creating a sub-directory.
@@ -201,6 +205,73 @@ is removed subsequently, " (deleted)" is appended to the path.
   0::/test-cgroup/test-cgroup-nested (deleted)
 
 
+2-2-2. Threads
+
+cgroup v2 supports thread granularity for a subset of controllers to
+support use cases requiring hierarchical resource distribution across
+the threads of a group of processes.  By default, all threads of a
+process belong to the same cgroup, which also serves as the resource
+domain to host resource consumptions which are not specific to a
+process or thread.  The thread mode allows threads to be spread across
+a subtree while still maintaining the common resource domain for them.
+
+Enabling thread mode on a subtree makes it threaded.  The root of a
+threaded subtree is called thread root and serves as the resource
+domain for the entire subtree.  In a threaded subtree, threads of a
+process can be put in different cgroups and are not subject to the no
+internal process constraint - threaded controllers can be enabled on
+non-leaf cgroups whether they have threads in them or not.
+
+To enable the thread mode, the following conditions must be met.
+
+- The thread root doesn't have any child cgroups.
+
+- The thread root doesn't have any controllers enabled.
+
+Thread mode can be enabled by writing "enable" to "cgroup.threads"
+file.
+
+  # echo enable > cgroup.threads
+
+Inside a threaded subtree, "cgroup.threads" can be read and contains
+the list of the thread IDs of all threads in the cgroup.  Except that
+the operations are per-thread instead of per-process, "cgroup.threads"
+has the same format and behaves the same way as "cgroup.procs".
+
+The thread root serves as the resource domain for the whole subtree,
+and, while the threads can be scattered across the subtree, all the
+processes are considered to be in the thread root.  "cgroup.procs" in
+a thread root contains the PIDs of all processes in the subtree and is
+not readable in the subtree proper.  However, "cgroup.procs" can be
+written to from anywhere in the subtree to migrate all threads of the
+matching process to the cgroup.
+
+Only threaded controllers can be enabled in a threaded subtree.  When
+a threaded controller is enabled inside a threaded subtree, it only
+accounts for and controls resource consumptions associated with the
+threads in the cgroup and its descendants.  All consumptions which
+aren't tied to a specific thread belong to the thread root.
+
+Because a threaded subtree is exempt from no internal process
+constraint, a threaded controller must be able to handle competition
+between threads in a non-leaf cgroup and its child cgroups.  Each
+threaded controller defines how such competitions are handled.
+
+To disable the thread mode, the following conditions must be met.
+
+- The cgroup is a thread root.  Thread mode can't be disabled
+  partially in the subtree.
+
+- The thread root doesn't have any child cgroups.
+
+- The thread root doesn't have any controllers enabled.
+
+Thread mode can be disabled by writing "disable" to "cgroup.threads"
+file.
+
+  # echo disable > cgroup.threads
+
+
 2-3. [Un]populated Notification
 
 Each non-root cgroup has a "cgroup.events" file which contains
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 3f3cfdd..e8d0cfc 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -230,6 +230,10 @@ struct css_set {
 	struct cgroup *mg_dst_cgrp;
 	struct css_set *mg_dst_cset;
 
+	/* used while updating ->proc_cset to enable/disable threaded mode */
+	struct list_head pcset_preload_node;
+	struct css_set *pcset_preload;
+
 	/* dead and being drained, ignore for migration */
 	bool dead;
 
@@ -501,6 +505,18 @@ struct cgroup_subsys {
 	bool implicit_on_dfl:1;
 
 	/*
+	 * If %true, the controller, supports threaded mode on the default
+	 * hierarchy.  In a threaded subtree, both process granularity and
+	 * no-internal-process constraint are ignored and a threaded
+	 * controllers should be able to handle that.
+	 *
+	 * Note that as an implicit controller is automatically enabled on
+	 * all cgroups on the default hierarchy, it should also be
+	 * threaded.  implicit && !threaded is not supported.
+	 */
+	bool threaded:1;
+
+	/*
 	 * If %false, this subsystem is properly hierarchical -
 	 * configuration, resource accounting and restriction on a parent
 	 * cgroup cover those of its children.  If %true, hierarchy support
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 7efb5da..d7bab5e 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -162,6 +162,9 @@ struct cgroup_subsys *cgroup_subsys[] = {
 /* some controllers are implicitly enabled on the default hierarchy */
 static u16 cgrp_dfl_implicit_ss_mask;
 
+/* some controllers can be threaded on the default hierarchy */
+static u16 cgrp_dfl_threaded_ss_mask;
+
 /* The list of hierarchy roots */
 LIST_HEAD(cgroup_roots);
 static int cgroup_root_count;
@@ -2916,11 +2919,18 @@ static ssize_t cgroup_subtree_control_write(struct kernfs_open_file *of,
 		goto out_unlock;
 	}
 
+	/* can't enable !threaded controllers on a threaded cgroup */
+	if (cgrp->proc_cgrp && (enable & ~cgrp_dfl_threaded_ss_mask)) {
+		ret = -EBUSY;
+		goto out_unlock;
+	}
+
 	/*
-	 * Except for the root, subtree_control must be zero for a cgroup
-	 * with tasks so that child cgroups don't compete against tasks.
+	 * Except for root and threaded cgroups, subtree_control must be
+	 * zero for a cgroup with tasks so that child cgroups don't compete
+	 * against tasks.
 	 */
-	if (enable && cgroup_parent(cgrp)) {
+	if (enable && cgroup_parent(cgrp) && !cgrp->proc_cgrp) {
 		struct cgrp_cset_link *link;
 
 		/*
@@ -2961,6 +2971,124 @@ static ssize_t cgroup_subtree_control_write(struct kernfs_open_file *of,
 	return ret ?: nbytes;
 }
 
+static int cgroup_enable_threaded(struct cgroup *cgrp)
+{
+	LIST_HEAD(csets);
+	struct cgrp_cset_link *link;
+	struct css_set *cset, *cset_next;
+	int ret;
+
+	lockdep_assert_held(&cgroup_mutex);
+
+	/* noop if already threaded */
+	if (cgrp->proc_cgrp)
+		return 0;
+
+	/* allow only if there are neither children or enabled controllers */
+	if (css_has_online_children(&cgrp->self) || cgrp->subtree_control)
+		return -EBUSY;
+
+	/* find all csets which need ->proc_cset updated */
+	spin_lock_irq(&css_set_lock);
+	list_for_each_entry(link, &cgrp->cset_links, cset_link) {
+		cset = link->cset;
+		if (css_set_populated(cset)) {
+			WARN_ON_ONCE(css_set_threaded(cset));
+			WARN_ON_ONCE(cset->pcset_preload);
+
+			list_add_tail(&cset->pcset_preload_node, &csets);
+			get_css_set(cset);
+		}
+	}
+	spin_unlock_irq(&css_set_lock);
+
+	/* find the proc_csets to associate */
+	list_for_each_entry(cset, &csets, pcset_preload_node) {
+		struct css_set *pcset = find_css_set(cset, cgrp, true);
+
+		WARN_ON_ONCE(cset == pcset);
+		if (!pcset) {
+			ret = -ENOMEM;
+			goto err_put_csets;
+		}
+		cset->pcset_preload = pcset;
+	}
+
+	/* install ->proc_cset */
+	spin_lock_irq(&css_set_lock);
+	list_for_each_entry_safe(cset, cset_next, &csets, pcset_preload_node) {
+		rcu_assign_pointer(cset->proc_cset, cset->pcset_preload);
+		list_add_tail(&cset->threaded_csets_node,
+			      &cset->pcset_preload->threaded_csets);
+
+		cset->pcset_preload = NULL;
+		list_del(&cset->pcset_preload_node);
+		put_css_set_locked(cset);
+	}
+	spin_unlock_irq(&css_set_lock);
+
+	/* mark it threaded */
+	cgrp->proc_cgrp = cgrp;
+
+	return 0;
+
+err_put_csets:
+	spin_lock_irq(&css_set_lock);
+	list_for_each_entry_safe(cset, cset_next, &csets, pcset_preload_node) {
+		if (cset->pcset_preload) {
+			put_css_set_locked(cset->pcset_preload);
+			cset->pcset_preload = NULL;
+		}
+		list_del(&cset->pcset_preload_node);
+		put_css_set_locked(cset);
+	}
+	spin_unlock_irq(&css_set_lock);
+	return ret;
+}
+
+static int cgroup_disable_threaded(struct cgroup *cgrp)
+{
+	struct cgrp_cset_link *link;
+
+	lockdep_assert_held(&cgroup_mutex);
+
+	/* noop if already !threaded */
+	if (!cgrp->proc_cgrp)
+		return 0;
+
+	/* partial disable isn't supported */
+	if (cgrp->proc_cgrp != cgrp)
+		return -EBUSY;
+
+	/* allow only if there are neither children or enabled controllers */
+	if (css_has_online_children(&cgrp->self) || cgrp->subtree_control)
+		return -EBUSY;
+
+	/* walk all csets and reset ->proc_cset */
+	spin_lock_irq(&css_set_lock);
+	list_for_each_entry(link, &cgrp->cset_links, cset_link) {
+		struct css_set *cset = link->cset;
+
+		if (css_set_threaded(cset)) {
+			struct css_set *pcset = proc_css_set(cset);
+
+			WARN_ON_ONCE(pcset->dfl_cgrp != cgrp);
+			rcu_assign_pointer(cset->proc_cset, cset);
+			list_del(&cset->threaded_csets_node);
+
+			/*
+			 * @pcset is never @cset and safe to put during
+			 * iteration.
+			 */
+			put_css_set_locked(pcset);
+		}
+	}
+	cgrp->proc_cgrp = NULL;
+	spin_unlock_irq(&css_set_lock);
+
+	return 0;
+}
+
 static int cgroup_events_show(struct seq_file *seq, void *v)
 {
 	seq_printf(seq, "populated %d\n",
@@ -3845,12 +3973,12 @@ static void *cgroup_procs_next(struct seq_file *s, void *v, loff_t *pos)
 	return css_task_iter_next(it);
 }
 
-static void *cgroup_procs_start(struct seq_file *s, loff_t *pos)
+static void *__cgroup_procs_start(struct seq_file *s, loff_t *pos,
+				  unsigned int iter_flags)
 {
 	struct kernfs_open_file *of = s->private;
 	struct cgroup *cgrp = seq_css(s)->cgroup;
 	struct css_task_iter *it = of->priv;
-	unsigned iter_flags = CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED;
 
 	/*
 	 * When a seq_file is seeked, it's always traversed sequentially
@@ -3873,6 +4001,23 @@ static void *cgroup_procs_start(struct seq_file *s, loff_t *pos)
 	return cgroup_procs_next(s, NULL, NULL);
 }
 
+static void *cgroup_procs_start(struct seq_file *s, loff_t *pos)
+{
+	struct cgroup *cgrp = seq_css(s)->cgroup;
+
+	/*
+	 * All processes of a threaded subtree are in the top threaded
+	 * cgroup.  Only threads can be distributed across the subtree.
+	 * Reject reads on cgroup.procs in the subtree proper.  They're
+	 * always empty anyway.
+	 */
+	if (cgrp->proc_cgrp && cgrp->proc_cgrp != cgrp)
+		return ERR_PTR(-EINVAL);
+
+	return __cgroup_procs_start(s, pos, CSS_TASK_ITER_PROCS |
+					    CSS_TASK_ITER_THREADED);
+}
+
 static int cgroup_procs_show(struct seq_file *s, void *v)
 {
 	seq_printf(s, "%d\n", task_pid_vnr(v));
@@ -3927,6 +4072,76 @@ static ssize_t cgroup_procs_write(struct kernfs_open_file *of,
 	return ret ?: nbytes;
 }
 
+static void *cgroup_threads_start(struct seq_file *s, loff_t *pos)
+{
+	struct cgroup *cgrp = seq_css(s)->cgroup;
+
+	if (!cgrp->proc_cgrp)
+		return ERR_PTR(-EINVAL);
+
+	return __cgroup_procs_start(s, pos, 0);
+}
+
+static ssize_t cgroup_threads_write(struct kernfs_open_file *of,
+				    char *buf, size_t nbytes, loff_t off)
+{
+	struct super_block *sb = of->file->f_path.dentry->d_sb;
+	struct cgroup *cgrp, *common_ancestor;
+	struct task_struct *task;
+	ssize_t ret;
+
+	buf = strstrip(buf);
+
+	cgrp = cgroup_kn_lock_live(of->kn, false);
+	if (!cgrp)
+		return -ENODEV;
+
+	/* cgroup.procs determines delegation, require permission on it too */
+	ret = cgroup_procs_write_permission(cgrp, sb);
+	if (ret)
+		goto out_unlock;
+
+	/* enable or disable? */
+	if (!strcmp(buf, "enable")) {
+		ret = cgroup_enable_threaded(cgrp);
+		goto out_unlock;
+	} else if (!strcmp(buf, "disable")) {
+		ret = cgroup_disable_threaded(cgrp);
+		goto out_unlock;
+	}
+
+	/* thread migration */
+	ret = -EINVAL;
+	if (!cgrp->proc_cgrp)
+		goto out_unlock;
+
+	task = cgroup_procs_write_start(buf, false);
+	ret = PTR_ERR_OR_ZERO(task);
+	if (ret)
+		goto out_unlock;
+
+	common_ancestor = cgroup_migrate_common_ancestor(task, cgrp);
+
+	/* can't migrate across disjoint threaded subtrees */
+	ret = -EACCES;
+	if (common_ancestor->proc_cgrp != cgrp->proc_cgrp)
+		goto out_finish;
+
+	/* and follow the cgroup.procs delegation rule */
+	ret = cgroup_procs_write_permission(common_ancestor, sb);
+	if (ret)
+		goto out_finish;
+
+	ret = cgroup_attach_task(cgrp, task, false);
+
+out_finish:
+	cgroup_procs_write_finish();
+out_unlock:
+	cgroup_kn_unlock(of->kn);
+
+	return ret ?: nbytes;
+}
+
 /* cgroup core interface files for the default hierarchy */
 static struct cftype cgroup_base_files[] = {
 	{
@@ -3939,6 +4154,14 @@ static ssize_t cgroup_procs_write(struct kernfs_open_file *of,
 		.write = cgroup_procs_write,
 	},
 	{
+		.name = "cgroup.threads",
+		.release = cgroup_procs_release,
+		.seq_start = cgroup_threads_start,
+		.seq_next = cgroup_procs_next,
+		.seq_show = cgroup_procs_show,
+		.write = cgroup_threads_write,
+	},
+	{
 		.name = "cgroup.controllers",
 		.seq_show = cgroup_controllers_show,
 	},
@@ -4252,6 +4475,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent)
 	cgrp->self.parent = &parent->self;
 	cgrp->root = root;
 	cgrp->level = level;
+	cgrp->proc_cgrp = parent->proc_cgrp;
 
 	for (tcgrp = cgrp; tcgrp; tcgrp = cgroup_parent(tcgrp))
 		cgrp->ancestor_ids[tcgrp->level] = tcgrp->id;
@@ -4694,11 +4918,17 @@ int __init cgroup_init(void)
 
 		cgrp_dfl_root.subsys_mask |= 1 << ss->id;
 
+		/* implicit controllers must be threaded too */
+		WARN_ON(ss->implicit_on_dfl && !ss->threaded);
+
 		if (ss->implicit_on_dfl)
 			cgrp_dfl_implicit_ss_mask |= 1 << ss->id;
 		else if (!ss->dfl_cftypes)
 			cgrp_dfl_inhibit_ss_mask |= 1 << ss->id;
 
+		if (ss->threaded)
+			cgrp_dfl_threaded_ss_mask |= 1 << ss->id;
+
 		if (ss->dfl_cftypes == ss->legacy_cftypes) {
 			WARN_ON(cgroup_add_cftypes(ss, ss->dfl_cftypes));
 		} else {
diff --git a/kernel/cgroup/pids.c b/kernel/cgroup/pids.c
index 2237201..9829c67 100644
--- a/kernel/cgroup/pids.c
+++ b/kernel/cgroup/pids.c
@@ -345,4 +345,5 @@ struct cgroup_subsys pids_cgrp_subsys = {
 	.free		= pids_free,
 	.legacy_cftypes	= pids_files,
 	.dfl_cftypes	= pids_files,
+	.threaded	= true,
 };
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 13f5b94..6ba1d06 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -11150,5 +11150,6 @@ struct cgroup_subsys perf_event_cgrp_subsys = {
 	 * controller is not mounted on a legacy hierarchy.
 	 */
 	.implicit_on_dfl = true,
+	.threaded	= true,
 };
 #endif /* CONFIG_CGROUP_PERF */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [RFC PATCH v2 06/17] cgroup: Fix reference counting bug in cgroup_procs_write()
  2017-05-15 13:33 [RFC PATCH v2 00/17] cgroup: Major changes to cgroup v2 core Waiman Long
                   ` (4 preceding siblings ...)
  2017-05-15 13:34 ` [RFC PATCH v2 05/17] cgroup: implement cgroup v2 thread support Waiman Long
@ 2017-05-15 13:34 ` Waiman Long
  2017-05-17 19:20   ` Tejun Heo
  2017-05-15 13:34 ` [RFC PATCH v2 07/17] cgroup: Prevent kill_css() from being called more than once Waiman Long
                   ` (10 subsequent siblings)
  16 siblings, 1 reply; 69+ messages in thread
From: Waiman Long @ 2017-05-15 13:34 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, linux-mm, kernel-team, pjt,
	luto, efault, longman

The cgroup_procs_write_start() took a reference to the task structure
which was not properly released within cgroup_procs_write() and so
on. So a put_task_struct() call is added to cgroup_procs_write_finish()
to match the get_task_struct() in cgroup_procs_write_start() to fix
this reference counting error.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cgroup-internal.h |  2 +-
 kernel/cgroup/cgroup-v1.c       |  2 +-
 kernel/cgroup/cgroup.c          | 10 ++++++----
 3 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index f0a0dba..2c8e3a9 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -182,7 +182,7 @@ int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader,
 		       bool threadgroup);
 struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup)
 	__acquires(&cgroup_threadgroup_rwsem);
-void cgroup_procs_write_finish(void)
+void cgroup_procs_write_finish(struct task_struct *task)
 	__releases(&cgroup_threadgroup_rwsem);
 
 void cgroup_lock_and_drain_offline(struct cgroup *cgrp);
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index c212856..1e101b9 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -549,7 +549,7 @@ static ssize_t __cgroup1_procs_write(struct kernfs_open_file *of,
 	ret = cgroup_attach_task(cgrp, task, threadgroup);
 
 out_finish:
-	cgroup_procs_write_finish();
+	cgroup_procs_write_finish(task);
 out_unlock:
 	cgroup_kn_unlock(of->kn);
 
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index d7bab5e..f14deca 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2492,12 +2492,15 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup)
 	return tsk;
 }
 
-void cgroup_procs_write_finish(void)
+void cgroup_procs_write_finish(struct task_struct *task)
 	__releases(&cgroup_threadgroup_rwsem)
 {
 	struct cgroup_subsys *ss;
 	int ssid;
 
+	/* release reference from cgroup_procs_write_start() */
+	put_task_struct(task);
+
 	percpu_up_write(&cgroup_threadgroup_rwsem);
 	for_each_subsys(ss, ssid)
 		if (ss->post_attach)
@@ -3300,7 +3303,6 @@ static int cgroup_addrm_files(struct cgroup_subsys_state *css,
 
 static int cgroup_apply_cftypes(struct cftype *cfts, bool is_add)
 {
-	LIST_HEAD(pending);
 	struct cgroup_subsys *ss = cfts[0].ss;
 	struct cgroup *root = &ss->root->cgrp;
 	struct cgroup_subsys_state *css;
@@ -4065,7 +4067,7 @@ static ssize_t cgroup_procs_write(struct kernfs_open_file *of,
 	ret = cgroup_attach_task(cgrp, task, true);
 
 out_finish:
-	cgroup_procs_write_finish();
+	cgroup_procs_write_finish(task);
 out_unlock:
 	cgroup_kn_unlock(of->kn);
 
@@ -4135,7 +4137,7 @@ static ssize_t cgroup_threads_write(struct kernfs_open_file *of,
 	ret = cgroup_attach_task(cgrp, task, false);
 
 out_finish:
-	cgroup_procs_write_finish();
+	cgroup_procs_write_finish(task);
 out_unlock:
 	cgroup_kn_unlock(of->kn);
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [RFC PATCH v2 07/17] cgroup: Prevent kill_css() from being called more than once
  2017-05-15 13:33 [RFC PATCH v2 00/17] cgroup: Major changes to cgroup v2 core Waiman Long
                   ` (5 preceding siblings ...)
  2017-05-15 13:34 ` [RFC PATCH v2 06/17] cgroup: Fix reference counting bug in cgroup_procs_write() Waiman Long
@ 2017-05-15 13:34 ` Waiman Long
  2017-05-17 19:23   ` Tejun Heo
  2017-05-15 13:34 ` [RFC PATCH v2 08/17] cgroup: Move debug cgroup to its own file Waiman Long
                   ` (9 subsequent siblings)
  16 siblings, 1 reply; 69+ messages in thread
From: Waiman Long @ 2017-05-15 13:34 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, linux-mm, kernel-team, pjt,
	luto, efault, longman

The kill_css() function may be called more than once under the condition
that the css was killed but not physically removed yet followed by the
removal of the cgroup that is hosting the css. This patch prevents any
harmm from being done when that happens.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/cgroup-defs.h | 1 +
 kernel/cgroup/cgroup.c      | 5 +++++
 2 files changed, 6 insertions(+)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index e8d0cfc..b123afc 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -48,6 +48,7 @@ enum {
 	CSS_ONLINE	= (1 << 1), /* between ->css_online() and ->css_offline() */
 	CSS_RELEASED	= (1 << 2), /* refcnt reached zero, released */
 	CSS_VISIBLE	= (1 << 3), /* css is visible to userland */
+	CSS_DYING	= (1 << 4), /* css is dying */
 };
 
 /* bits in struct cgroup flags field */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index f14deca..7b085d5 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -4630,6 +4630,11 @@ static void kill_css(struct cgroup_subsys_state *css)
 {
 	lockdep_assert_held(&cgroup_mutex);
 
+	if (css->flags & CSS_DYING)
+		return;
+
+	css->flags |= CSS_DYING;
+
 	/*
 	 * This must happen before css is disassociated with its cgroup.
 	 * See seq_css() for details.
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [RFC PATCH v2 08/17] cgroup: Move debug cgroup to its own file
  2017-05-15 13:33 [RFC PATCH v2 00/17] cgroup: Major changes to cgroup v2 core Waiman Long
                   ` (6 preceding siblings ...)
  2017-05-15 13:34 ` [RFC PATCH v2 07/17] cgroup: Prevent kill_css() from being called more than once Waiman Long
@ 2017-05-15 13:34 ` Waiman Long
  2017-05-17 21:36   ` Tejun Heo
  2017-05-15 13:34 ` [RFC PATCH v2 09/17] cgroup: Keep accurate count of tasks in each css_set Waiman Long
                   ` (8 subsequent siblings)
  16 siblings, 1 reply; 69+ messages in thread
From: Waiman Long @ 2017-05-15 13:34 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, linux-mm, kernel-team, pjt,
	luto, efault, longman

The debug cgroup currently resides within cgroup-v1.c and is enabled
only for v1 cgroup. To enable the debug cgroup also for v2, it
makes sense to put the code into its own file as it will no longer
be v1 specific. The only change in this patch is the expansion of
cgroup_task_count() within the debug_taskcount_read() function.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/Makefile    |   1 +
 kernel/cgroup/cgroup-v1.c | 147 -----------------------------------------
 kernel/cgroup/debug.c     | 165 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 166 insertions(+), 147 deletions(-)
 create mode 100644 kernel/cgroup/debug.c

diff --git a/kernel/cgroup/Makefile b/kernel/cgroup/Makefile
index 387348a..ce693cc 100644
--- a/kernel/cgroup/Makefile
+++ b/kernel/cgroup/Makefile
@@ -4,3 +4,4 @@ obj-$(CONFIG_CGROUP_FREEZER) += freezer.o
 obj-$(CONFIG_CGROUP_PIDS) += pids.o
 obj-$(CONFIG_CGROUP_RDMA) += rdma.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
+obj-$(CONFIG_CGROUP_DEBUG) += debug.o
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index 1e101b9..7ad6b17 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -1311,150 +1311,3 @@ static int __init cgroup_no_v1(char *str)
 	return 1;
 }
 __setup("cgroup_no_v1=", cgroup_no_v1);
-
-
-#ifdef CONFIG_CGROUP_DEBUG
-static struct cgroup_subsys_state *
-debug_css_alloc(struct cgroup_subsys_state *parent_css)
-{
-	struct cgroup_subsys_state *css = kzalloc(sizeof(*css), GFP_KERNEL);
-
-	if (!css)
-		return ERR_PTR(-ENOMEM);
-
-	return css;
-}
-
-static void debug_css_free(struct cgroup_subsys_state *css)
-{
-	kfree(css);
-}
-
-static u64 debug_taskcount_read(struct cgroup_subsys_state *css,
-				struct cftype *cft)
-{
-	return cgroup_task_count(css->cgroup);
-}
-
-static u64 current_css_set_read(struct cgroup_subsys_state *css,
-				struct cftype *cft)
-{
-	return (u64)(unsigned long)current->cgroups;
-}
-
-static u64 current_css_set_refcount_read(struct cgroup_subsys_state *css,
-					 struct cftype *cft)
-{
-	u64 count;
-
-	rcu_read_lock();
-	count = refcount_read(&task_css_set(current)->refcount);
-	rcu_read_unlock();
-	return count;
-}
-
-static int current_css_set_cg_links_read(struct seq_file *seq, void *v)
-{
-	struct cgrp_cset_link *link;
-	struct css_set *cset;
-	char *name_buf;
-
-	name_buf = kmalloc(NAME_MAX + 1, GFP_KERNEL);
-	if (!name_buf)
-		return -ENOMEM;
-
-	spin_lock_irq(&css_set_lock);
-	rcu_read_lock();
-	cset = rcu_dereference(current->cgroups);
-	list_for_each_entry(link, &cset->cgrp_links, cgrp_link) {
-		struct cgroup *c = link->cgrp;
-
-		cgroup_name(c, name_buf, NAME_MAX + 1);
-		seq_printf(seq, "Root %d group %s\n",
-			   c->root->hierarchy_id, name_buf);
-	}
-	rcu_read_unlock();
-	spin_unlock_irq(&css_set_lock);
-	kfree(name_buf);
-	return 0;
-}
-
-#define MAX_TASKS_SHOWN_PER_CSS 25
-static int cgroup_css_links_read(struct seq_file *seq, void *v)
-{
-	struct cgroup_subsys_state *css = seq_css(seq);
-	struct cgrp_cset_link *link;
-
-	spin_lock_irq(&css_set_lock);
-	list_for_each_entry(link, &css->cgroup->cset_links, cset_link) {
-		struct css_set *cset = link->cset;
-		struct task_struct *task;
-		int count = 0;
-
-		seq_printf(seq, "css_set %pK\n", cset);
-
-		list_for_each_entry(task, &cset->tasks, cg_list) {
-			if (count++ > MAX_TASKS_SHOWN_PER_CSS)
-				goto overflow;
-			seq_printf(seq, "  task %d\n", task_pid_vnr(task));
-		}
-
-		list_for_each_entry(task, &cset->mg_tasks, cg_list) {
-			if (count++ > MAX_TASKS_SHOWN_PER_CSS)
-				goto overflow;
-			seq_printf(seq, "  task %d\n", task_pid_vnr(task));
-		}
-		continue;
-	overflow:
-		seq_puts(seq, "  ...\n");
-	}
-	spin_unlock_irq(&css_set_lock);
-	return 0;
-}
-
-static u64 releasable_read(struct cgroup_subsys_state *css, struct cftype *cft)
-{
-	return (!cgroup_is_populated(css->cgroup) &&
-		!css_has_online_children(&css->cgroup->self));
-}
-
-static struct cftype debug_files[] =  {
-	{
-		.name = "taskcount",
-		.read_u64 = debug_taskcount_read,
-	},
-
-	{
-		.name = "current_css_set",
-		.read_u64 = current_css_set_read,
-	},
-
-	{
-		.name = "current_css_set_refcount",
-		.read_u64 = current_css_set_refcount_read,
-	},
-
-	{
-		.name = "current_css_set_cg_links",
-		.seq_show = current_css_set_cg_links_read,
-	},
-
-	{
-		.name = "cgroup_css_links",
-		.seq_show = cgroup_css_links_read,
-	},
-
-	{
-		.name = "releasable",
-		.read_u64 = releasable_read,
-	},
-
-	{ }	/* terminate */
-};
-
-struct cgroup_subsys debug_cgrp_subsys = {
-	.css_alloc = debug_css_alloc,
-	.css_free = debug_css_free,
-	.legacy_cftypes = debug_files,
-};
-#endif /* CONFIG_CGROUP_DEBUG */
diff --git a/kernel/cgroup/debug.c b/kernel/cgroup/debug.c
new file mode 100644
index 0000000..56e60a2
--- /dev/null
+++ b/kernel/cgroup/debug.c
@@ -0,0 +1,165 @@
+#include <linux/ctype.h>
+#include <linux/mm.h>
+#include <linux/slab.h>
+
+#include "cgroup-internal.h"
+
+static struct cgroup_subsys_state *
+debug_css_alloc(struct cgroup_subsys_state *parent_css)
+{
+	struct cgroup_subsys_state *css = kzalloc(sizeof(*css), GFP_KERNEL);
+
+	if (!css)
+		return ERR_PTR(-ENOMEM);
+
+	return css;
+}
+
+static void debug_css_free(struct cgroup_subsys_state *css)
+{
+	kfree(css);
+}
+
+/*
+ * debug_taskcount_read - return the number of tasks in a cgroup.
+ * @cgrp: the cgroup in question
+ *
+ * Return the number of tasks in the cgroup.  The returned number can be
+ * higher than the actual number of tasks due to css_set references from
+ * namespace roots and temporary usages.
+ */
+static u64 debug_taskcount_read(struct cgroup_subsys_state *css,
+				struct cftype *cft)
+{
+	struct cgroup *cgrp = css->cgroup;
+	u64 count = 0;
+	struct cgrp_cset_link *link;
+
+	spin_lock_irq(&css_set_lock);
+	list_for_each_entry(link, &cgrp->cset_links, cset_link)
+		count += refcount_read(&link->cset->refcount);
+	spin_unlock_irq(&css_set_lock);
+	return count;
+}
+
+static u64 current_css_set_read(struct cgroup_subsys_state *css,
+				struct cftype *cft)
+{
+	return (u64)(unsigned long)current->cgroups;
+}
+
+static u64 current_css_set_refcount_read(struct cgroup_subsys_state *css,
+					 struct cftype *cft)
+{
+	u64 count;
+
+	rcu_read_lock();
+	count = refcount_read(&task_css_set(current)->refcount);
+	rcu_read_unlock();
+	return count;
+}
+
+static int current_css_set_cg_links_read(struct seq_file *seq, void *v)
+{
+	struct cgrp_cset_link *link;
+	struct css_set *cset;
+	char *name_buf;
+
+	name_buf = kmalloc(NAME_MAX + 1, GFP_KERNEL);
+	if (!name_buf)
+		return -ENOMEM;
+
+	spin_lock_irq(&css_set_lock);
+	rcu_read_lock();
+	cset = rcu_dereference(current->cgroups);
+	list_for_each_entry(link, &cset->cgrp_links, cgrp_link) {
+		struct cgroup *c = link->cgrp;
+
+		cgroup_name(c, name_buf, NAME_MAX + 1);
+		seq_printf(seq, "Root %d group %s\n",
+			   c->root->hierarchy_id, name_buf);
+	}
+	rcu_read_unlock();
+	spin_unlock_irq(&css_set_lock);
+	kfree(name_buf);
+	return 0;
+}
+
+#define MAX_TASKS_SHOWN_PER_CSS 25
+static int cgroup_css_links_read(struct seq_file *seq, void *v)
+{
+	struct cgroup_subsys_state *css = seq_css(seq);
+	struct cgrp_cset_link *link;
+
+	spin_lock_irq(&css_set_lock);
+	list_for_each_entry(link, &css->cgroup->cset_links, cset_link) {
+		struct css_set *cset = link->cset;
+		struct task_struct *task;
+		int count = 0;
+
+		seq_printf(seq, "css_set %pK\n", cset);
+
+		list_for_each_entry(task, &cset->tasks, cg_list) {
+			if (count++ > MAX_TASKS_SHOWN_PER_CSS)
+				goto overflow;
+			seq_printf(seq, "  task %d\n", task_pid_vnr(task));
+		}
+
+		list_for_each_entry(task, &cset->mg_tasks, cg_list) {
+			if (count++ > MAX_TASKS_SHOWN_PER_CSS)
+				goto overflow;
+			seq_printf(seq, "  task %d\n", task_pid_vnr(task));
+		}
+		continue;
+	overflow:
+		seq_puts(seq, "  ...\n");
+	}
+	spin_unlock_irq(&css_set_lock);
+	return 0;
+}
+
+static u64 releasable_read(struct cgroup_subsys_state *css, struct cftype *cft)
+{
+	return (!cgroup_is_populated(css->cgroup) &&
+		!css_has_online_children(&css->cgroup->self));
+}
+
+static struct cftype debug_files[] =  {
+	{
+		.name = "taskcount",
+		.read_u64 = debug_taskcount_read,
+	},
+
+	{
+		.name = "current_css_set",
+		.read_u64 = current_css_set_read,
+	},
+
+	{
+		.name = "current_css_set_refcount",
+		.read_u64 = current_css_set_refcount_read,
+	},
+
+	{
+		.name = "current_css_set_cg_links",
+		.seq_show = current_css_set_cg_links_read,
+	},
+
+	{
+		.name = "cgroup_css_links",
+		.seq_show = cgroup_css_links_read,
+	},
+
+	{
+		.name = "releasable",
+		.read_u64 = releasable_read,
+	},
+
+	{ }	/* terminate */
+};
+
+struct cgroup_subsys debug_cgrp_subsys = {
+	.css_alloc = debug_css_alloc,
+	.css_free = debug_css_free,
+	.legacy_cftypes = debug_files,
+};
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [RFC PATCH v2 09/17] cgroup: Keep accurate count of tasks in each css_set
  2017-05-15 13:33 [RFC PATCH v2 00/17] cgroup: Major changes to cgroup v2 core Waiman Long
                   ` (7 preceding siblings ...)
  2017-05-15 13:34 ` [RFC PATCH v2 08/17] cgroup: Move debug cgroup to its own file Waiman Long
@ 2017-05-15 13:34 ` Waiman Long
  2017-05-17 21:40   ` Tejun Heo
  2017-05-15 13:34 ` [RFC PATCH v2 10/17] cgroup: Make debug cgroup support v2 and thread mode Waiman Long
                   ` (7 subsequent siblings)
  16 siblings, 1 reply; 69+ messages in thread
From: Waiman Long @ 2017-05-15 13:34 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, linux-mm, kernel-team, pjt,
	luto, efault, longman

The reference count in the css_set data structure was used as a
proxy of the number of tasks attached to that css_set. However, that
count is actually not an accurate measure especially with thread mode
support. So a new variable task_count is added to the css_set to keep
track of the actual task count. This new variable is protected by
the css_set_lock. Functions that require the actual task count are
updated to use the new variable.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/cgroup-defs.h | 3 +++
 kernel/cgroup/cgroup-v1.c   | 6 +-----
 kernel/cgroup/cgroup.c      | 5 +++++
 kernel/cgroup/debug.c       | 6 +-----
 4 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index b123afc..104be73 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -163,6 +163,9 @@ struct css_set {
 	/* reference count */
 	refcount_t refcount;
 
+	/* internal task count, protected by css_set_lock */
+	int task_count;
+
 	/*
 	 * If not threaded, the following points to self.  If threaded, to
 	 * a cset which belongs to the top cgroup of the threaded subtree.
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index 7ad6b17..302b3b8 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -334,10 +334,6 @@ static struct cgroup_pidlist *cgroup_pidlist_find_create(struct cgroup *cgrp,
 /**
  * cgroup_task_count - count the number of tasks in a cgroup.
  * @cgrp: the cgroup in question
- *
- * Return the number of tasks in the cgroup.  The returned number can be
- * higher than the actual number of tasks due to css_set references from
- * namespace roots and temporary usages.
  */
 static int cgroup_task_count(const struct cgroup *cgrp)
 {
@@ -346,7 +342,7 @@ static int cgroup_task_count(const struct cgroup *cgrp)
 
 	spin_lock_irq(&css_set_lock);
 	list_for_each_entry(link, &cgrp->cset_links, cset_link)
-		count += refcount_read(&link->cset->refcount);
+		count += link->cset->task_count;
 	spin_unlock_irq(&css_set_lock);
 	return count;
 }
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 7b085d5..7e3ddfb 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1676,6 +1676,7 @@ static void cgroup_enable_task_cg_lists(void)
 				css_set_update_populated(cset, true);
 			list_add_tail(&p->cg_list, &cset->tasks);
 			get_css_set(cset);
+			cset->task_count++;
 		}
 		spin_unlock(&p->sighand->siglock);
 	} while_each_thread(g, p);
@@ -2159,8 +2160,10 @@ static int cgroup_migrate_execute(struct cgroup_mgctx *mgctx)
 			struct css_set *to_cset = cset->mg_dst_cset;
 
 			get_css_set(to_cset);
+			to_cset->task_count++;
 			css_set_move_task(task, from_cset, to_cset, true);
 			put_css_set_locked(from_cset);
+			from_cset->task_count--;
 		}
 	}
 	spin_unlock_irq(&css_set_lock);
@@ -5160,6 +5163,7 @@ void cgroup_post_fork(struct task_struct *child)
 		cset = task_css_set(current);
 		if (list_empty(&child->cg_list)) {
 			get_css_set(cset);
+			cset->task_count++;
 			css_set_move_task(child, NULL, cset, false);
 		}
 		spin_unlock_irq(&css_set_lock);
@@ -5209,6 +5213,7 @@ void cgroup_exit(struct task_struct *tsk)
 	if (!list_empty(&tsk->cg_list)) {
 		spin_lock_irq(&css_set_lock);
 		css_set_move_task(tsk, cset, NULL, false);
+		cset->task_count--;
 		spin_unlock_irq(&css_set_lock);
 	} else {
 		get_css_set(cset);
diff --git a/kernel/cgroup/debug.c b/kernel/cgroup/debug.c
index 56e60a2..ada53e6 100644
--- a/kernel/cgroup/debug.c
+++ b/kernel/cgroup/debug.c
@@ -23,10 +23,6 @@ static void debug_css_free(struct cgroup_subsys_state *css)
 /*
  * debug_taskcount_read - return the number of tasks in a cgroup.
  * @cgrp: the cgroup in question
- *
- * Return the number of tasks in the cgroup.  The returned number can be
- * higher than the actual number of tasks due to css_set references from
- * namespace roots and temporary usages.
  */
 static u64 debug_taskcount_read(struct cgroup_subsys_state *css,
 				struct cftype *cft)
@@ -37,7 +33,7 @@ static u64 debug_taskcount_read(struct cgroup_subsys_state *css,
 
 	spin_lock_irq(&css_set_lock);
 	list_for_each_entry(link, &cgrp->cset_links, cset_link)
-		count += refcount_read(&link->cset->refcount);
+		count += link->cset->task_count;
 	spin_unlock_irq(&css_set_lock);
 	return count;
 }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [RFC PATCH v2 10/17] cgroup: Make debug cgroup support v2 and thread mode
  2017-05-15 13:33 [RFC PATCH v2 00/17] cgroup: Major changes to cgroup v2 core Waiman Long
                   ` (8 preceding siblings ...)
  2017-05-15 13:34 ` [RFC PATCH v2 09/17] cgroup: Keep accurate count of tasks in each css_set Waiman Long
@ 2017-05-15 13:34 ` Waiman Long
  2017-05-17 21:43   ` Tejun Heo
  2017-05-15 13:34 ` [RFC PATCH v2 11/17] cgroup: Implement new thread mode semantics Waiman Long
                   ` (6 subsequent siblings)
  16 siblings, 1 reply; 69+ messages in thread
From: Waiman Long @ 2017-05-15 13:34 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, linux-mm, kernel-team, pjt,
	luto, efault, longman

Besides supporting cgroup v2 and thread mode, the following changes
are also made:
 1) current_* cgroup files now resides only at the root as we don't
    need duplicated files of the same function all over the cgroup
    hierarchy.
 2) The cgroup_css_links_read() function is modified to report
    the number of tasks that are skipped because of overflow.
 3) The relationship between proc_cset and threaded_csets are displayed.
 4) The number of extra unaccounted references are displayed.
 5) The status of being a thread root or threaded cgroup is displayed.
 6) The current_css_set_read() function now prints out the addresses of
    the css'es associated with the current css_set.
 7) A new cgroup_subsys_states file is added to display the css objects
    associated with a cgroup.
 8) A new cgroup_masks file is added to display the various controller
    bit masks in the cgroup.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/debug.c | 196 +++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 179 insertions(+), 17 deletions(-)

diff --git a/kernel/cgroup/debug.c b/kernel/cgroup/debug.c
index ada53e6..3121811 100644
--- a/kernel/cgroup/debug.c
+++ b/kernel/cgroup/debug.c
@@ -38,10 +38,37 @@ static u64 debug_taskcount_read(struct cgroup_subsys_state *css,
 	return count;
 }
 
-static u64 current_css_set_read(struct cgroup_subsys_state *css,
-				struct cftype *cft)
+static int current_css_set_read(struct seq_file *seq, void *v)
 {
-	return (u64)(unsigned long)current->cgroups;
+	struct css_set *cset;
+	struct cgroup_subsys *ss;
+	struct cgroup_subsys_state *css;
+	int i, refcnt;
+
+	mutex_lock(&cgroup_mutex);
+	spin_lock_irq(&css_set_lock);
+	rcu_read_lock();
+	cset = rcu_dereference(current->cgroups);
+	refcnt = refcount_read(&cset->refcount);
+	seq_printf(seq, "css_set %pK %d", cset, refcnt);
+	if (refcnt > cset->task_count)
+		seq_printf(seq, " +%d", refcnt - cset->task_count);
+	seq_puts(seq, "\n");
+
+	/*
+	 * Print the css'es stored in the current css_set.
+	 */
+	for_each_subsys(ss, i) {
+		css = cset->subsys[ss->id];
+		if (!css)
+			continue;
+		seq_printf(seq, "%2d: %-4s\t- %lx[%d]\n", ss->id, ss->name,
+			  (unsigned long)css, css->id);
+	}
+	rcu_read_unlock();
+	spin_unlock_irq(&css_set_lock);
+	mutex_unlock(&cgroup_mutex);
+	return 0;
 }
 
 static u64 current_css_set_refcount_read(struct cgroup_subsys_state *css,
@@ -86,31 +113,151 @@ static int cgroup_css_links_read(struct seq_file *seq, void *v)
 {
 	struct cgroup_subsys_state *css = seq_css(seq);
 	struct cgrp_cset_link *link;
+	int dead_cnt = 0, extra_refs = 0, threaded_csets = 0;
 
 	spin_lock_irq(&css_set_lock);
+	if (css->cgroup->proc_cgrp)
+		seq_puts(seq, (css->cgroup->proc_cgrp == css->cgroup)
+			      ? "[thread root]\n" : "[threaded]\n");
+
 	list_for_each_entry(link, &css->cgroup->cset_links, cset_link) {
 		struct css_set *cset = link->cset;
 		struct task_struct *task;
 		int count = 0;
+		int refcnt = refcount_read(&cset->refcount);
+
+		/*
+		 * Print out the proc_cset and threaded_cset relationship
+		 * and highlight difference between refcount and task_count.
+		 */
+		seq_printf(seq, "css_set %pK", cset);
+		if (rcu_dereference_protected(cset->proc_cset, 1) != cset) {
+			threaded_csets++;
+			seq_printf(seq, "=>%pK", cset->proc_cset);
+		}
+		if (!list_empty(&cset->threaded_csets)) {
+			struct css_set *tcset;
+			int idx = 0;
 
-		seq_printf(seq, "css_set %pK\n", cset);
+			list_for_each_entry(tcset, &cset->threaded_csets,
+					    threaded_csets_node) {
+				seq_puts(seq, idx ? "," : "<=");
+				seq_printf(seq, "%pK", tcset);
+				idx++;
+			}
+		} else {
+			seq_printf(seq, " %d", refcnt);
+			if (refcnt - cset->task_count > 0) {
+				int extra = refcnt - cset->task_count;
+
+				seq_printf(seq, " +%d", extra);
+				/*
+				 * Take out the one additional reference in
+				 * init_css_set.
+				 */
+				if (cset == &init_css_set)
+					extra--;
+				extra_refs += extra;
+			}
+		}
+		seq_puts(seq, "\n");
 
 		list_for_each_entry(task, &cset->tasks, cg_list) {
-			if (count++ > MAX_TASKS_SHOWN_PER_CSS)
-				goto overflow;
-			seq_printf(seq, "  task %d\n", task_pid_vnr(task));
+			if (count++ <= MAX_TASKS_SHOWN_PER_CSS)
+				seq_printf(seq, "  task %d\n",
+					   task_pid_vnr(task));
 		}
 
 		list_for_each_entry(task, &cset->mg_tasks, cg_list) {
-			if (count++ > MAX_TASKS_SHOWN_PER_CSS)
-				goto overflow;
-			seq_printf(seq, "  task %d\n", task_pid_vnr(task));
+			if (count++ <= MAX_TASKS_SHOWN_PER_CSS)
+				seq_printf(seq, "  task %d\n",
+					   task_pid_vnr(task));
 		}
-		continue;
-	overflow:
-		seq_puts(seq, "  ...\n");
+		/* show # of overflowed tasks */
+		if (count > MAX_TASKS_SHOWN_PER_CSS)
+			seq_printf(seq, "  ... (%d)\n",
+				   count - MAX_TASKS_SHOWN_PER_CSS);
+
+		if (cset->dead) {
+			seq_puts(seq, "    [dead]\n");
+			dead_cnt++;
+		}
+
+		WARN_ON(count != cset->task_count);
 	}
 	spin_unlock_irq(&css_set_lock);
+
+	if (!dead_cnt && !extra_refs && !threaded_csets)
+		return 0;
+
+	seq_puts(seq, "\n");
+	if (threaded_csets)
+		seq_printf(seq, "threaded css_sets = %d\n", threaded_csets);
+	if (extra_refs)
+		seq_printf(seq, "extra references = %d\n", extra_refs);
+	if (dead_cnt)
+		seq_printf(seq, "dead css_sets = %d\n", dead_cnt);
+
+	return 0;
+}
+
+static int cgroup_subsys_states_read(struct seq_file *seq, void *v)
+{
+	struct cgroup *cgrp = seq_css(seq)->cgroup;
+	struct cgroup_subsys *ss;
+	struct cgroup_subsys_state *css;
+	char pbuf[16];
+	int i;
+
+	mutex_lock(&cgroup_mutex);
+	for_each_subsys(ss, i) {
+		css = rcu_dereference_check(cgrp->subsys[ss->id], true);
+		if (!css)
+			continue;
+		pbuf[0] = '\0';
+
+		/* Show the parent CSS if applicable*/
+		if (css->parent)
+			snprintf(pbuf, sizeof(pbuf) - 1, " P=%d",
+				 css->parent->id);
+		seq_printf(seq, "%2d: %-4s\t- %lx[%d] %d%s\n", ss->id, ss->name,
+			  (unsigned long)css, css->id,
+			  atomic_read(&css->online_cnt), pbuf);
+	}
+	mutex_unlock(&cgroup_mutex);
+	return 0;
+}
+
+static int cgroup_masks_read(struct seq_file *seq, void *v)
+{
+	struct cgroup *cgrp = seq_css(seq)->cgroup;
+	struct cgroup_subsys *ss;
+	int i, j;
+	struct {
+		u16  *mask;
+		char *name;
+	} mask_list[] = {
+		{ &cgrp->subtree_control, "subtree_control" },
+		{ &cgrp->subtree_ss_mask, "subtree_ss_mask" },
+	};
+
+	mutex_lock(&cgroup_mutex);
+	for (i = 0; i < ARRAY_SIZE(mask_list); i++) {
+		u16 mask = *mask_list[i].mask;
+		bool first = true;
+
+		seq_printf(seq, "%-15s: ", mask_list[i].name);
+		for_each_subsys(ss, j) {
+			if (!(mask & (1 << ss->id)))
+				continue;
+			if (!first)
+				seq_puts(seq, ", ");
+			seq_puts(seq, ss->name);
+			first = false;
+		}
+		seq_putc(seq, '\n');
+	}
+	mutex_unlock(&cgroup_mutex);
 	return 0;
 }
 
@@ -128,17 +275,20 @@ static u64 releasable_read(struct cgroup_subsys_state *css, struct cftype *cft)
 
 	{
 		.name = "current_css_set",
-		.read_u64 = current_css_set_read,
+		.seq_show = current_css_set_read,
+		.flags = CFTYPE_ONLY_ON_ROOT,
 	},
 
 	{
 		.name = "current_css_set_refcount",
 		.read_u64 = current_css_set_refcount_read,
+		.flags = CFTYPE_ONLY_ON_ROOT,
 	},
 
 	{
 		.name = "current_css_set_cg_links",
 		.seq_show = current_css_set_cg_links_read,
+		.flags = CFTYPE_ONLY_ON_ROOT,
 	},
 
 	{
@@ -147,6 +297,16 @@ static u64 releasable_read(struct cgroup_subsys_state *css, struct cftype *cft)
 	},
 
 	{
+		.name = "cgroup_subsys_states",
+		.seq_show = cgroup_subsys_states_read,
+	},
+
+	{
+		.name = "cgroup_masks",
+		.seq_show = cgroup_masks_read,
+	},
+
+	{
 		.name = "releasable",
 		.read_u64 = releasable_read,
 	},
@@ -155,7 +315,9 @@ static u64 releasable_read(struct cgroup_subsys_state *css, struct cftype *cft)
 };
 
 struct cgroup_subsys debug_cgrp_subsys = {
-	.css_alloc = debug_css_alloc,
-	.css_free = debug_css_free,
-	.legacy_cftypes = debug_files,
+	.css_alloc	= debug_css_alloc,
+	.css_free	= debug_css_free,
+	.legacy_cftypes	= debug_files,
+	.dfl_cftypes	= debug_files,
+	.threaded	= true,
 };
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [RFC PATCH v2 11/17] cgroup: Implement new thread mode semantics
  2017-05-15 13:33 [RFC PATCH v2 00/17] cgroup: Major changes to cgroup v2 core Waiman Long
                   ` (9 preceding siblings ...)
  2017-05-15 13:34 ` [RFC PATCH v2 10/17] cgroup: Make debug cgroup support v2 and thread mode Waiman Long
@ 2017-05-15 13:34 ` Waiman Long
  2017-05-17 21:47   ` Tejun Heo
  2017-05-19 20:26   ` Tejun Heo
  2017-05-15 13:34 ` [RFC PATCH v2 12/17] cgroup: Remove cgroup v2 no internal process constraint Waiman Long
                   ` (5 subsequent siblings)
  16 siblings, 2 replies; 69+ messages in thread
From: Waiman Long @ 2017-05-15 13:34 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, linux-mm, kernel-team, pjt,
	luto, efault, longman

The current thread mode semantics aren't sufficient to fully support
threaded controllers like cpu. The main problem is that when thread
mode is enabled at root (mainly for performance reason), all the
non-threaded controllers cannot be supported at all.

To alleviate this problem, the roles of thread root and threaded
cgroups are now further separated. Now thread mode can only be enabled
on a non-root leaf cgroup whose parent will then become the thread
root. All the descendants of a threaded cgroup will still need to be
threaded. All the non-threaded resource will be accounted for in the
thread root. Unlike the previous thread mode, however, a thread root
can have non-threaded children where system resources like memory
can be further split down the hierarchy.

Now we could have something like

	R -- A -- B
	 \
	  T1 -- T2

where R is the thread root, A and B are non-threaded cgroups, T1 and
T2 are threaded cgroups. The cgroups R, T1, T2 form a threaded subtree
where all the non-threaded resources are accounted for in R.  The no
internal process constraint does not apply in the threaded subtree.
Non-threaded controllers need to properly handle the competition
between internal processes and child cgroups at the thread root.

This model will be flexible enough to support the need of the threaded
controllers.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/cgroup-v2.txt     |  51 +++++++----
 kernel/cgroup/cgroup-internal.h |  10 +++
 kernel/cgroup/cgroup.c          | 186 +++++++++++++++++++++++++++++++++++-----
 3 files changed, 209 insertions(+), 38 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 1c6f5a9..3ae7e9c 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -222,21 +222,32 @@ process can be put in different cgroups and are not subject to the no
 internal process constraint - threaded controllers can be enabled on
 non-leaf cgroups whether they have threads in them or not.
 
-To enable the thread mode, the following conditions must be met.
+To enable the thread mode on a cgroup, the following conditions must
+be met.
 
-- The thread root doesn't have any child cgroups.
+- The cgroup doesn't have any child cgroups.
 
-- The thread root doesn't have any controllers enabled.
+- The cgroup doesn't have any non-threaded controllers enabled.
+
+- The cgroup doesn't have any processes attached to it.
 
 Thread mode can be enabled by writing "enable" to "cgroup.threads"
 file.
 
   # echo enable > cgroup.threads
 
-Inside a threaded subtree, "cgroup.threads" can be read and contains
-the list of the thread IDs of all threads in the cgroup.  Except that
-the operations are per-thread instead of per-process, "cgroup.threads"
-has the same format and behaves the same way as "cgroup.procs".
+The parent of the threaded cgroup will become the thread root, if
+it hasn't been a thread root yet. In other word, thread mode cannot
+be enabled on the root cgroup as it doesn't have a parent cgroup. A
+thread root can have child cgroups and controllers enabled before
+becoming one.
+
+A threaded subtree includes the thread root and all the threaded child
+cgroups as well as their descendants which are all threaded cgroups.
+"cgroup.threads" can be read and contains the list of the thread
+IDs of all threads in the cgroup.  Except that the operations are
+per-thread instead of per-process, "cgroup.threads" has the same
+format and behaves the same way as "cgroup.procs".
 
 The thread root serves as the resource domain for the whole subtree,
 and, while the threads can be scattered across the subtree, all the
@@ -246,25 +257,30 @@ not readable in the subtree proper.  However, "cgroup.procs" can be
 written to from anywhere in the subtree to migrate all threads of the
 matching process to the cgroup.
 
-Only threaded controllers can be enabled in a threaded subtree.  When
-a threaded controller is enabled inside a threaded subtree, it only
-accounts for and controls resource consumptions associated with the
-threads in the cgroup and its descendants.  All consumptions which
-aren't tied to a specific thread belong to the thread root.
+Only threaded controllers can be enabled in a non-root threaded cgroup.
+When a threaded controller is enabled inside a threaded subtree,
+it only accounts for and controls resource consumptions associated
+with the threads in the cgroup and its descendants.  All consumptions
+which aren't tied to a specific thread belong to the thread root.
 
 Because a threaded subtree is exempt from no internal process
 constraint, a threaded controller must be able to handle competition
 between threads in a non-leaf cgroup and its child cgroups.  Each
 threaded controller defines how such competitions are handled.
 
+A new child cgroup created under a thread root will not be threaded.
+Thread mode has to be explicitly enabled on each of the thread root's
+children.  Descendants of a threaded cgroup, however, will always be
+threaded and that mode cannot be disabled.
+
 To disable the thread mode, the following conditions must be met.
 
-- The cgroup is a thread root.  Thread mode can't be disabled
-  partially in the subtree.
+- The cgroup is a child of a thread root.  Thread mode can't be
+  disabled partially further down the hierarchy.
 
-- The thread root doesn't have any child cgroups.
+- The cgroup doesn't have any child cgroups.
 
-- The thread root doesn't have any controllers enabled.
+- The cgroup doesn't have any threads attached to it.
 
 Thread mode can be disabled by writing "disable" to "cgroup.threads"
 file.
@@ -366,6 +382,9 @@ with any other cgroups and requires special treatment from most
 controllers.  How resource consumption in the root cgroup is governed
 is up to each controller.
 
+The threaded cgroups and the thread roots are also exempt from this
+restriction.
+
 Note that the restriction doesn't get in the way if there is no
 enabled controller in the cgroup's "cgroup.subtree_control".  This is
 important as otherwise it wouldn't be possible to create children of a
diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index 2c8e3a9..15abaa0 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -124,6 +124,16 @@ static inline bool notify_on_release(const struct cgroup *cgrp)
 	return test_bit(CGRP_NOTIFY_ON_RELEASE, &cgrp->flags);
 }
 
+static inline bool cgroup_is_threaded(const struct cgroup *cgrp)
+{
+	return cgrp->proc_cgrp && (cgrp->proc_cgrp != cgrp);
+}
+
+static inline bool cgroup_is_thread_root(const struct cgroup *cgrp)
+{
+	return cgrp->proc_cgrp == cgrp;
+}
+
 void put_css_set_locked(struct css_set *cset);
 
 static inline void put_css_set(struct css_set *cset)
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 7e3ddfb..11cb091 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -334,8 +334,13 @@ static u16 cgroup_control(struct cgroup *cgrp)
 	struct cgroup *parent = cgroup_parent(cgrp);
 	u16 root_ss_mask = cgrp->root->subsys_mask;
 
-	if (parent)
-		return parent->subtree_control;
+	if (parent) {
+		u16 ss_mask = parent->subtree_control;
+
+		if (cgroup_is_threaded(cgrp))
+			ss_mask &= cgrp_dfl_threaded_ss_mask;
+		return ss_mask;
+	}
 
 	if (cgroup_on_dfl(cgrp))
 		root_ss_mask &= ~(cgrp_dfl_inhibit_ss_mask |
@@ -348,8 +353,13 @@ static u16 cgroup_ss_mask(struct cgroup *cgrp)
 {
 	struct cgroup *parent = cgroup_parent(cgrp);
 
-	if (parent)
-		return parent->subtree_ss_mask;
+	if (parent) {
+		u16 ss_mask = parent->subtree_ss_mask;
+
+		if (cgroup_is_threaded(cgrp))
+			ss_mask &= cgrp_dfl_threaded_ss_mask;
+		return ss_mask;
+	}
 
 	return cgrp->root->subsys_mask;
 }
@@ -598,6 +608,24 @@ static bool css_set_threaded(struct css_set *cset)
 }
 
 /**
+ * threaded_children_count - returns # of threaded children
+ * @cgrp: cgroup to be tested
+ *
+ * cgroup_mutex must be held by the caller.
+ */
+static int threaded_children_count(struct cgroup *cgrp)
+{
+	struct cgroup *child;
+	int count = 0;
+
+	lockdep_assert_held(&cgroup_mutex);
+	cgroup_for_each_live_child(child, cgrp)
+		if (cgroup_is_threaded(child))
+			count++;
+	return count;
+}
+
+/**
  * cgroup_update_populated - updated populated count of a cgroup
  * @cgrp: the target cgroup
  * @populated: inc or dec populated count
@@ -2926,15 +2954,15 @@ static ssize_t cgroup_subtree_control_write(struct kernfs_open_file *of,
 	}
 
 	/* can't enable !threaded controllers on a threaded cgroup */
-	if (cgrp->proc_cgrp && (enable & ~cgrp_dfl_threaded_ss_mask)) {
+	if (cgroup_is_threaded(cgrp) && (enable & ~cgrp_dfl_threaded_ss_mask)) {
 		ret = -EBUSY;
 		goto out_unlock;
 	}
 
 	/*
-	 * Except for root and threaded cgroups, subtree_control must be
-	 * zero for a cgroup with tasks so that child cgroups don't compete
-	 * against tasks.
+	 * Except for root, thread roots and threaded cgroups, subtree_control
+	 * must be zero for a cgroup with tasks so that child cgroups don't
+	 * compete against tasks.
 	 */
 	if (enable && cgroup_parent(cgrp) && !cgrp->proc_cgrp) {
 		struct cgrp_cset_link *link;
@@ -2982,22 +3010,48 @@ static int cgroup_enable_threaded(struct cgroup *cgrp)
 	LIST_HEAD(csets);
 	struct cgrp_cset_link *link;
 	struct css_set *cset, *cset_next;
+	struct cgroup *child;
 	int ret;
+	u16 ss_mask;
 
 	lockdep_assert_held(&cgroup_mutex);
 
 	/* noop if already threaded */
-	if (cgrp->proc_cgrp)
+	if (cgroup_is_threaded(cgrp))
 		return 0;
 
-	/* allow only if there are neither children or enabled controllers */
-	if (css_has_online_children(&cgrp->self) || cgrp->subtree_control)
+	/*
+	 * Allow only if it is not the root and there are:
+	 * 1) no children,
+	 * 2) no non-threaded controllers are enabled, and
+	 * 3) no attached tasks.
+	 *
+	 * With no attached tasks, it is assumed that no css_sets will be
+	 * linked to the current cgroup. This may not be true if some dead
+	 * css_sets linger around due to task_struct leakage, for example.
+	 */
+	if (css_has_online_children(&cgrp->self) ||
+	   (cgroup_control(cgrp) & ~cgrp_dfl_threaded_ss_mask) ||
+	   !cgroup_parent(cgrp) || cgroup_is_populated(cgrp))
 		return -EBUSY;
 
-	/* find all csets which need ->proc_cset updated */
+	/* make the parent cgroup a thread root */
+	child = cgrp;
+	cgrp = cgroup_parent(child);
+
+	/* noop for parent if parent has already been threaded */
+	if (cgrp->proc_cgrp)
+		goto setup_child;
+
+	/*
+	 * For the parent cgroup, we need to find all csets which need
+	 * ->proc_cset updated
+	 */
 	spin_lock_irq(&css_set_lock);
 	list_for_each_entry(link, &cgrp->cset_links, cset_link) {
 		cset = link->cset;
+		if (cset->dead)
+			continue;
 		if (css_set_populated(cset)) {
 			WARN_ON_ONCE(css_set_threaded(cset));
 			WARN_ON_ONCE(cset->pcset_preload);
@@ -3036,7 +3090,34 @@ static int cgroup_enable_threaded(struct cgroup *cgrp)
 	/* mark it threaded */
 	cgrp->proc_cgrp = cgrp;
 
-	return 0;
+setup_child:
+	ss_mask = cgroup_ss_mask(child);
+	/*
+	 * If some non-threaded controllers are enabled, they have to be
+	 * disabled.
+	 */
+	if (ss_mask & ~cgrp_dfl_threaded_ss_mask) {
+		cgroup_save_control(child);
+		child->proc_cgrp = cgrp;
+		ret = cgroup_apply_control(child);
+		cgroup_finalize_control(child, ret);
+		kernfs_activate(child->kn);
+
+		/*
+		 * If an error happen (it shouldn't), the thread mode
+		 * enablement fails, but the parent will remain as thread
+		 * root. That shouldn't be a problem as a thread root
+		 * without threaded children is not much different from
+		 * a non-threaded cgroup.
+		 */
+		WARN_ON_ONCE(ret);
+		if (ret)
+			child->proc_cgrp = NULL;
+	} else {
+		child->proc_cgrp = cgrp;
+		ret = 0;
+	}
+	return ret;
 
 err_put_csets:
 	spin_lock_irq(&css_set_lock);
@@ -3055,26 +3136,71 @@ static int cgroup_enable_threaded(struct cgroup *cgrp)
 static int cgroup_disable_threaded(struct cgroup *cgrp)
 {
 	struct cgrp_cset_link *link;
+	struct cgroup *parent = cgroup_parent(cgrp);
 
 	lockdep_assert_held(&cgroup_mutex);
 
-	/* noop if already !threaded */
-	if (!cgrp->proc_cgrp)
-		return 0;
-
 	/* partial disable isn't supported */
-	if (cgrp->proc_cgrp != cgrp)
+	if (cgrp->proc_cgrp != parent)
 		return -EBUSY;
 
-	/* allow only if there are neither children or enabled controllers */
-	if (css_has_online_children(&cgrp->self) || cgrp->subtree_control)
+	/* noop if not a threaded cgroup */
+	if (!cgroup_is_threaded(cgrp))
+		return 0;
+
+	/*
+	 * Allow only if there are
+	 * 1) no children, and
+	 * 2) no attached tasks.
+	 *
+	 * With no attached tasks, it is assumed that no css_sets will be
+	 * linked to the current cgroup. This may not be true if some dead
+	 * css_sets linger around due to task_struct leakage, for example.
+	 */
+	if (css_has_online_children(&cgrp->self) || cgroup_is_populated(cgrp))
 		return -EBUSY;
 
-	/* walk all csets and reset ->proc_cset */
+	/*
+	 * If the cgroup has some non-threaded controllers enabled at the
+	 * subtree_control level of the parent, we need to re-enabled those
+	 * controllers.
+	 */
+	cgrp->proc_cgrp = NULL;
+	if (cgroup_ss_mask(cgrp) & ~cgrp_dfl_threaded_ss_mask) {
+		int ret;
+
+		cgrp->proc_cgrp = parent;
+		cgroup_save_control(cgrp);
+		cgrp->proc_cgrp = NULL;
+		ret = cgroup_apply_control(cgrp);
+		cgroup_finalize_control(cgrp, ret);
+		kernfs_activate(cgrp->kn);
+
+		/*
+		 * If an error happen, we abandon update to the thread root
+		 * and return the erorr.
+		 */
+		if (ret)
+			return ret;
+	}
+
+	/*
+	 * Check remaining threaded children count to see if the threaded
+	 * csets of the parent need to be removed and ->proc_cset reset.
+	 */
 	spin_lock_irq(&css_set_lock);
+
+	if (threaded_children_count(parent))
+		goto out_unlock;	/* still have threaded children left */
+
+	cgrp = parent;
 	list_for_each_entry(link, &cgrp->cset_links, cset_link) {
 		struct css_set *cset = link->cset;
 
+		/* skip dead css_set */
+		if (cset->dead)
+			continue;
+
 		if (css_set_threaded(cset)) {
 			struct css_set *pcset = proc_css_set(cset);
 
@@ -3090,6 +3216,7 @@ static int cgroup_disable_threaded(struct cgroup *cgrp)
 		}
 	}
 	cgrp->proc_cgrp = NULL;
+out_unlock:
 	spin_unlock_irq(&css_set_lock);
 
 	return 0;
@@ -4480,7 +4607,16 @@ static struct cgroup *cgroup_create(struct cgroup *parent)
 	cgrp->self.parent = &parent->self;
 	cgrp->root = root;
 	cgrp->level = level;
-	cgrp->proc_cgrp = parent->proc_cgrp;
+
+	/*
+	 * A child cgroup created directly under a thread root will not
+	 * be threaded. Thread mode has to be explicitly enabled for it.
+	 * The child cgroup will be threaded if its parent is threaded.
+	 */
+	if (cgroup_is_thread_root(parent))
+		cgrp->proc_cgrp = NULL;
+	else
+		cgrp->proc_cgrp = parent->proc_cgrp;
 
 	for (tcgrp = cgrp; tcgrp; tcgrp = cgroup_parent(tcgrp))
 		cgrp->ancestor_ids[tcgrp->level] = tcgrp->id;
@@ -4712,6 +4848,12 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
 		return -EBUSY;
 
 	/*
+	 * Do an implicit thread mode disable if on default hierarchy.
+	 */
+	if (cgroup_on_dfl(cgrp))
+		cgroup_disable_threaded(cgrp);
+
+	/*
 	 * Mark @cgrp and the associated csets dead.  The former prevents
 	 * further task migration and child creation by disabling
 	 * cgroup_lock_live_group().  The latter makes the csets ignored by
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [RFC PATCH v2 12/17] cgroup: Remove cgroup v2 no internal process constraint
  2017-05-15 13:33 [RFC PATCH v2 00/17] cgroup: Major changes to cgroup v2 core Waiman Long
                   ` (10 preceding siblings ...)
  2017-05-15 13:34 ` [RFC PATCH v2 11/17] cgroup: Implement new thread mode semantics Waiman Long
@ 2017-05-15 13:34 ` Waiman Long
  2017-05-19 20:38   ` Tejun Heo
  2017-05-15 13:34 ` [RFC PATCH v2 13/17] cgroup: Allow fine-grained controllers control in cgroup v2 Waiman Long
                   ` (4 subsequent siblings)
  16 siblings, 1 reply; 69+ messages in thread
From: Waiman Long @ 2017-05-15 13:34 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, linux-mm, kernel-team, pjt,
	luto, efault, longman

The rationale behind the cgroup v2 no internal process constraint is
to avoid resouorce competition between internal processes and child
cgroups. However, not all controllers have problem with internal
process competiton. Enforcing this rule may lead to unnatural process
hierarchy and unneeded levels for those controllers.

This patch removes the no internal process contraint by enabling those
controllers that don't like internal process competition to have a
separate set of control knobs just for internal processes in a cgroup.

A new control file "cgroup.resource_control" is added. Enabling a
controller with a "+" prefix will create a separate set of control
knobs for that controller in the special "cgroup.resource_domain"
sub-directory for all the internal processes. The existing control
knobs in the cgroup will then be used to manage resource distribution
between internal processes as a group and other child cgroups.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/cgroup-v2.txt     |  76 ++++++-----
 include/linux/cgroup-defs.h     |  15 +++
 kernel/cgroup/cgroup-internal.h |   1 -
 kernel/cgroup/cgroup-v1.c       |   3 -
 kernel/cgroup/cgroup.c          | 275 ++++++++++++++++++++++++++++------------
 kernel/cgroup/debug.c           |   7 +-
 6 files changed, 260 insertions(+), 117 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 3ae7e9c..0f41282 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -23,7 +23,7 @@ CONTENTS
   2-4. Controlling Controllers
     2-4-1. Enabling and Disabling
     2-4-2. Top-down Constraint
-    2-4-3. No Internal Process Constraint
+    2-4-3. Managing Internal Process Competition
   2-5. Delegation
     2-5-1. Model of Delegation
     2-5-2. Delegation Containment
@@ -218,9 +218,7 @@ a subtree while still maintaining the common resource domain for them.
 Enabling thread mode on a subtree makes it threaded.  The root of a
 threaded subtree is called thread root and serves as the resource
 domain for the entire subtree.  In a threaded subtree, threads of a
-process can be put in different cgroups and are not subject to the no
-internal process constraint - threaded controllers can be enabled on
-non-leaf cgroups whether they have threads in them or not.
+process can be put in different cgroups.
 
 To enable the thread mode on a cgroup, the following conditions must
 be met.
@@ -263,11 +261,6 @@ it only accounts for and controls resource consumptions associated
 with the threads in the cgroup and its descendants.  All consumptions
 which aren't tied to a specific thread belong to the thread root.
 
-Because a threaded subtree is exempt from no internal process
-constraint, a threaded controller must be able to handle competition
-between threads in a non-leaf cgroup and its child cgroups.  Each
-threaded controller defines how such competitions are handled.
-
 A new child cgroup created under a thread root will not be threaded.
 Thread mode has to be explicitly enabled on each of the thread root's
 children.  Descendants of a threaded cgroup, however, will always be
@@ -364,35 +357,38 @@ the parent has the controller enabled and a controller can't be
 disabled if one or more children have it enabled.
 
 
-2-4-3. No Internal Process Constraint
+2-4-3. Managing Internal Process Competition
 
-Non-root cgroups can only distribute resources to their children when
-they don't have any processes of their own.  In other words, only
-cgroups which don't contain any processes can have controllers enabled
-in their "cgroup.subtree_control" files.
+There are resources managed by some controllers that don't work well
+if the internal processes in a non-leaf cgroup have to compete against
+the resource requirement of the other child cgroups. Other controllers
+work perfectly fine with internal process competition.
 
-This guarantees that, when a controller is looking at the part of the
-hierarchy which has it enabled, processes are always only on the
-leaves.  This rules out situations where child cgroups compete against
-internal processes of the parent.
+Internal processes are allowed in a non-leaf cgroup. Controllers
+that don't like internal process competition can use
+the "cgroup.resource_control" file to create a special
+"cgroup.resource_domain" child cgroup that hold the control knobs
+for all the internal processes in the cgroup.
 
-The root cgroup is exempt from this restriction.  Root contains
-processes and anonymous resource consumption which can't be associated
-with any other cgroups and requires special treatment from most
-controllers.  How resource consumption in the root cgroup is governed
-is up to each controller.
+  # echo "+memory -pids" > cgroup.resource_control
 
-The threaded cgroups and the thread roots are also exempt from this
-restriction.
+Here, the control files for the memory controller are activated in the
+"cgroup.resource_domain" directory while that of the pids controller
+are removed. All the internal processes in the cgroup will use the
+memory control files in the "cgroup.resource_domain" directory to
+manage their memory. The memory control files in the cgroup itself
+can then be used to manage resource distribution between internal
+processes as a group and other child cgroups.
 
-Note that the restriction doesn't get in the way if there is no
-enabled controller in the cgroup's "cgroup.subtree_control".  This is
-important as otherwise it wouldn't be possible to create children of a
-populated cgroup.  To control resource distribution of a cgroup, the
-cgroup must create children and transfer all its processes to the
-children before enabling controllers in its "cgroup.subtree_control"
-file.
+Only controllers that are enabled in the "cgroup.controllers" file
+can be enabled in the "cgroup.resource_control" file. Once enabled,
+the parent cgroup cannot take away the controller until it has been
+disabled in the "cgroup.resource_control" file.
 
+The directory name "cgroup.resource_domain" is reserved. It cannot
+be created or deleted directly and no child cgroups can be created
+underneath it. All the "cgroup." control files are missing and so
+the users cannot move process into it.
 
 2-5. Delegation
 
@@ -730,6 +726,22 @@ All cgroup core files are prefixed with "cgroup."
 	the last one is effective.  When multiple enable and disable
 	operations are specified, either all succeed or all fail.
 
+  cgroup.resource_control
+
+	A read-write space separated values file which exists on all
+	cgroups.  Starts out empty.
+
+	When read, it shows space separated list of the controllers
+	which are enabled to have separate control files in the
+	"cgroup.resource_domain" directory for internal processes.
+
+	Space separated list of controllers prefixed with '+' or '-'
+	can be written to enable or disable controllers.  A controller
+	name prefixed with '+' enables the controller and '-'
+	disables.  If a controller appears more than once on the list,
+	the last one is effective.  When multiple enable and disable
+	operations are specified, either all succeed or all fail.
+
   cgroup.events
 
 	A read-only flat-keyed file which exists on non-root cgroups.
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 104be73..67ab326 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -61,6 +61,9 @@ enum {
 	 * specified at mount time and thus is implemented here.
 	 */
 	CGRP_CPUSET_CLONE_CHILDREN,
+
+	/* Special child resource domain cgroup */
+	CGRP_RESOURCE_DOMAIN,
 };
 
 /* cgroup_root->flags */
@@ -293,11 +296,23 @@ struct cgroup {
 	u16 old_subtree_control;
 	u16 old_subtree_ss_mask;
 
+	/*
+	 * The bitmask of subsystems that have separate sets of control
+	 * knobs in a special child resource cgroup to control internal
+	 * processes within the current cgroup so that they won't compete
+	 * directly with other regular child cgroups. This is for the
+	 * default hierarchy only.
+	 */
+	u16 resource_control;
+
 	/* Private pointers for each registered subsystem */
 	struct cgroup_subsys_state __rcu *subsys[CGROUP_SUBSYS_COUNT];
 
 	struct cgroup_root *root;
 
+	/* Pointer to the special resource child cgroup */
+	struct cgroup *resource_domain;
+
 	/*
 	 * List of cgrp_cset_links pointing at css_sets with tasks in this
 	 * cgroup.  Protected by css_set_lock.
diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index 15abaa0..fc877e0 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -180,7 +180,6 @@ struct dentry *cgroup_do_mount(struct file_system_type *fs_type, int flags,
 			       struct cgroup_root *root, unsigned long magic,
 			       struct cgroup_namespace *ns);
 
-bool cgroup_may_migrate_to(struct cgroup *dst_cgrp);
 void cgroup_migrate_finish(struct cgroup_mgctx *mgctx);
 void cgroup_migrate_add_src(struct css_set *src_cset, struct cgroup *dst_cgrp,
 			    struct cgroup_mgctx *mgctx);
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index 302b3b8..ef578b6 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -99,9 +99,6 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
 	if (cgroup_on_dfl(to))
 		return -EINVAL;
 
-	if (!cgroup_may_migrate_to(to))
-		return -EBUSY;
-
 	mutex_lock(&cgroup_mutex);
 
 	percpu_down_write(&cgroup_threadgroup_rwsem);
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 11cb091..c3be7e2 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -63,6 +63,12 @@
 					 MAX_CFTYPE_NAME + 2)
 
 /*
+ * Reserved cgroup name for the special resource domain child cgroup of
+ * the default hierarchy.
+ */
+#define CGROUP_RESOURCE_DOMAIN	"cgroup.resource_domain"
+
+/*
  * cgroup_mutex is the master lock.  Any modification to cgroup or its
  * hierarchy must be performed while holding it.
  *
@@ -337,6 +343,9 @@ static u16 cgroup_control(struct cgroup *cgrp)
 	if (parent) {
 		u16 ss_mask = parent->subtree_control;
 
+		if (test_bit(CGRP_RESOURCE_DOMAIN, &cgrp->flags))
+			return parent->resource_control;
+
 		if (cgroup_is_threaded(cgrp))
 			ss_mask &= cgrp_dfl_threaded_ss_mask;
 		return ss_mask;
@@ -356,6 +365,9 @@ static u16 cgroup_ss_mask(struct cgroup *cgrp)
 	if (parent) {
 		u16 ss_mask = parent->subtree_ss_mask;
 
+		if (test_bit(CGRP_RESOURCE_DOMAIN, &cgrp->flags))
+			return parent->resource_control;
+
 		if (cgroup_is_threaded(cgrp))
 			ss_mask &= cgrp_dfl_threaded_ss_mask;
 		return ss_mask;
@@ -413,6 +425,11 @@ static struct cgroup_subsys_state *cgroup_e_css(struct cgroup *cgrp,
 			return NULL;
 	}
 
+	if (cgrp->resource_control & (1 << ss->id)) {
+		WARN_ON(!cgrp->resource_domain);
+		if (cgrp->resource_domain)
+			return cgroup_css(cgrp->resource_domain, ss);
+	}
 	return cgroup_css(cgrp, ss);
 }
 
@@ -435,8 +452,10 @@ struct cgroup_subsys_state *cgroup_get_e_css(struct cgroup *cgrp,
 	rcu_read_lock();
 
 	do {
-		css = cgroup_css(cgrp, ss);
-
+		if (cgrp->resource_control & (1 << ss->id))
+			css = cgroup_css(cgrp->resource_domain, ss);
+		else
+			css = cgroup_css(cgrp, ss);
 		if (css && css_tryget_online(css))
 			goto out_unlock;
 		cgrp = cgroup_parent(cgrp);
@@ -2234,20 +2253,6 @@ static int cgroup_migrate_execute(struct cgroup_mgctx *mgctx)
 }
 
 /**
- * cgroup_may_migrate_to - verify whether a cgroup can be migration destination
- * @dst_cgrp: destination cgroup to test
- *
- * On the default hierarchy, except for the root, subtree_control must be
- * zero for migration destination cgroups with tasks so that child cgroups
- * don't compete against tasks.
- */
-bool cgroup_may_migrate_to(struct cgroup *dst_cgrp)
-{
-	return !cgroup_on_dfl(dst_cgrp) || !cgroup_parent(dst_cgrp) ||
-		!dst_cgrp->subtree_control;
-}
-
-/**
  * cgroup_migrate_finish - cleanup after attach
  * @mgctx: migration context
  *
@@ -2449,9 +2454,6 @@ int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader,
 	struct task_struct *task;
 	int ret;
 
-	if (!cgroup_may_migrate_to(dst_cgrp))
-		return -EBUSY;
-
 	/* look up all src csets */
 	spin_lock_irq(&css_set_lock);
 	rcu_read_lock();
@@ -2572,6 +2574,15 @@ static int cgroup_subtree_control_show(struct seq_file *seq, void *v)
 	return 0;
 }
 
+/* show controlllers that have resource control knobs in resource_domain */
+static int cgroup_resource_control_show(struct seq_file *seq, void *v)
+{
+	struct cgroup *cgrp = seq_css(seq)->cgroup;
+
+	cgroup_print_ss_mask(seq, cgrp->resource_control);
+	return 0;
+}
+
 /**
  * cgroup_update_dfl_csses - update css assoc of a subtree in default hierarchy
  * @cgrp: root of the subtree to update csses for
@@ -2921,33 +2932,30 @@ static ssize_t cgroup_subtree_control_write(struct kernfs_open_file *of,
 	if (!cgrp)
 		return -ENODEV;
 
-	for_each_subsys(ss, ssid) {
-		if (enable & (1 << ssid)) {
-			if (cgrp->subtree_control & (1 << ssid)) {
-				enable &= ~(1 << ssid);
-				continue;
-			}
-
-			if (!(cgroup_control(cgrp) & (1 << ssid))) {
-				ret = -ENOENT;
-				goto out_unlock;
-			}
-		} else if (disable & (1 << ssid)) {
-			if (!(cgrp->subtree_control & (1 << ssid))) {
-				disable &= ~(1 << ssid);
-				continue;
-			}
-
-			/* a child has it enabled? */
-			cgroup_for_each_live_child(child, cgrp) {
-				if (child->subtree_control & (1 << ssid)) {
-					ret = -EBUSY;
-					goto out_unlock;
-				}
-			}
+	/*
+	 * We cannot disable controllers that are enabled in a child
+	 * cgroup.
+	 */
+	if (disable) {
+		u16 child_enable = cgrp->resource_control;
+
+		cgroup_for_each_live_child(child, cgrp)
+			child_enable |= child->subtree_control|
+					child->resource_control;
+		if (disable & child_enable) {
+			ret = -EBUSY;
+			goto out_unlock;
 		}
 	}
 
+	if (enable & ~cgroup_control(cgrp)) {
+		ret = -ENOENT;
+		goto out_unlock;
+	}
+
+	enable  &= ~cgrp->subtree_control;
+	disable &= cgrp->subtree_control;
+
 	if (!enable && !disable) {
 		ret = 0;
 		goto out_unlock;
@@ -2959,45 +2967,116 @@ static ssize_t cgroup_subtree_control_write(struct kernfs_open_file *of,
 		goto out_unlock;
 	}
 
+	/* save and update control masks and prepare csses */
+	cgroup_save_control(cgrp);
+
+	cgrp->subtree_control |= enable;
+	cgrp->subtree_control &= ~disable;
+
+	ret = cgroup_apply_control(cgrp);
+
+	cgroup_finalize_control(cgrp, ret);
+
+	kernfs_activate(cgrp->kn);
+	ret = 0;
+out_unlock:
+	cgroup_kn_unlock(of->kn);
+	return ret ?: nbytes;
+}
+
+/*
+ * Change the list of resource domain controllers for a cgroup in the
+ * default hierarchy
+ */
+static ssize_t cgroup_resource_control_write(struct kernfs_open_file *of,
+					     char *buf, size_t nbytes,
+					     loff_t off)
+{
+	u16 enable = 0, disable = 0;
+	struct cgroup *cgrp;
+	struct cgroup_subsys *ss;
+	char *tok;
+	int ssid, ret;
+
 	/*
-	 * Except for root, thread roots and threaded cgroups, subtree_control
-	 * must be zero for a cgroup with tasks so that child cgroups don't
-	 * compete against tasks.
+	 * Parse input - space separated list of subsystem names prefixed
+	 * with either + or -.
 	 */
-	if (enable && cgroup_parent(cgrp) && !cgrp->proc_cgrp) {
-		struct cgrp_cset_link *link;
-
-		/*
-		 * Because namespaces pin csets too, @cgrp->cset_links
-		 * might not be empty even when @cgrp is empty.  Walk and
-		 * verify each cset.
-		 */
-		spin_lock_irq(&css_set_lock);
+	buf = strstrip(buf);
+	while ((tok = strsep(&buf, " "))) {
+		if (tok[0] == '\0')
+			continue;
+		do_each_subsys_mask(ss, ssid, ~cgrp_dfl_inhibit_ss_mask) {
+			if (!cgroup_ssid_enabled(ssid) ||
+			    strcmp(tok + 1, ss->name))
+				continue;
 
-		ret = 0;
-		list_for_each_entry(link, &cgrp->cset_links, cset_link) {
-			if (css_set_populated(link->cset)) {
-				ret = -EBUSY;
-				break;
+			if (*tok == '+') {
+				enable |= 1 << ssid;
+				disable &= ~(1 << ssid);
+			} else if (*tok == '-') {
+				disable |= 1 << ssid;
+				enable &= ~(1 << ssid);
+			} else {
+				return -EINVAL;
 			}
-		}
+			break;
+		} while_each_subsys_mask();
+		if (ssid == CGROUP_SUBSYS_COUNT)
+			return -EINVAL;
+	}
 
-		spin_unlock_irq(&css_set_lock);
+	cgrp = cgroup_kn_lock_live(of->kn, true);
+	if (!cgrp)
+		return -ENODEV;
 
-		if (ret)
-			goto out_unlock;
+	/*
+	 * All the enabled or disabled controllers must have been enabled
+	 * in the current cgroup.
+	 */
+	if ((cgroup_control(cgrp) & (enable|disable)) != (enable|disable)) {
+		ret = -ENOENT;
+		goto out_unlock;
 	}
 
+	/*
+	 * Clear bits that are currently enabled and disabled in
+	 * resource_control.
+	 */
+	enable  &= ~cgrp->resource_control;
+	disable &=  cgrp->resource_control;
+
+	if (!enable && !disable) {
+		ret = 0;
+		goto out_unlock;
+	}
+
+	/*
+	 * Create a new child resource domain cgroup if necessary.
+	 */
+	if (!cgrp->resource_domain && enable)
+		cgroup_mkdir(cgrp->kn, NULL, 0755);
+
+	cgrp->resource_control &= ~disable;
+	cgrp->resource_control |= enable;
+
 	/* save and update control masks and prepare csses */
 	cgroup_save_control(cgrp);
 
-	cgrp->subtree_control |= enable;
-	cgrp->subtree_control &= ~disable;
-
 	ret = cgroup_apply_control(cgrp);
 
 	cgroup_finalize_control(cgrp, ret);
 
+	/*
+	 * Destroy the child resource domain cgroup if no controllers are
+	 * enabled in the resource_control.
+	 */
+	if (!cgrp->resource_control) {
+		struct cgroup *rdomain = cgrp->resource_domain;
+
+		cgrp->resource_domain = NULL;
+		cgroup_destroy_locked(rdomain);
+	}
 	kernfs_activate(cgrp->kn);
 	ret = 0;
 out_unlock:
@@ -4303,6 +4382,11 @@ static ssize_t cgroup_threads_write(struct kernfs_open_file *of,
 		.write = cgroup_subtree_control_write,
 	},
 	{
+		.name = "cgroup.resource_control",
+		.seq_show = cgroup_resource_control_show,
+		.write = cgroup_resource_control_write,
+	},
+	{
 		.name = "cgroup.events",
 		.flags = CFTYPE_NOT_ON_ROOT,
 		.file_offset = offsetof(struct cgroup, events_file),
@@ -4661,25 +4745,49 @@ static struct cgroup *cgroup_create(struct cgroup *parent)
 	return ERR_PTR(ret);
 }
 
+/*
+ * The name parameter will be NULL if called internally for creating the
+ * special resource domain cgroup. In this case, the cgroup_mutex will be
+ * held and there is no need to acquire or release it.
+ */
 int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode)
 {
 	struct cgroup *parent, *cgrp;
 	struct kernfs_node *kn;
+	bool create_rd = (name == NULL);
 	int ret;
 
-	/* do not accept '\n' to prevent making /proc/<pid>/cgroup unparsable */
-	if (strchr(name, '\n'))
-		return -EINVAL;
+	/*
+	 * Do not accept '\n' to prevent making /proc/<pid>/cgroup unparsable.
+	 * The reserved resource domain directory name cannot be used. A
+	 * sub-directory cannot be created under a resource domain directory.
+	 */
+	if (create_rd) {
+		lockdep_assert_held(&cgroup_mutex);
+		name = CGROUP_RESOURCE_DOMAIN;
+		parent = parent_kn->priv;
+	} else {
+		if (strchr(name, '\n') || !strcmp(name, CGROUP_RESOURCE_DOMAIN))
+			return -EINVAL;
 
-	parent = cgroup_kn_lock_live(parent_kn, false);
-	if (!parent)
-		return -ENODEV;
+		parent = cgroup_kn_lock_live(parent_kn, false);
+		if (!parent)
+			return -ENODEV;
+		if (test_bit(CGRP_RESOURCE_DOMAIN, &parent->flags)) {
+			ret = -EINVAL;
+			goto out_unlock;
+		}
+	}
 
 	cgrp = cgroup_create(parent);
 	if (IS_ERR(cgrp)) {
 		ret = PTR_ERR(cgrp);
 		goto out_unlock;
 	}
+	if (create_rd) {
+		parent->resource_domain = cgrp;
+		set_bit(CGRP_RESOURCE_DOMAIN, &cgrp->flags);
+	}
 
 	/* create the directory */
 	kn = kernfs_create_dir(parent->kn, name, mode, cgrp);
@@ -4699,9 +4807,11 @@ int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode)
 	if (ret)
 		goto out_destroy;
 
-	ret = css_populate_dir(&cgrp->self);
-	if (ret)
-		goto out_destroy;
+	if (!create_rd) {
+		ret = css_populate_dir(&cgrp->self);
+		if (ret)
+			goto out_destroy;
+	}
 
 	ret = cgroup_apply_control_enable(cgrp);
 	if (ret)
@@ -4718,7 +4828,8 @@ int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode)
 out_destroy:
 	cgroup_destroy_locked(cgrp);
 out_unlock:
-	cgroup_kn_unlock(parent_kn);
+	if (!create_rd)
+		cgroup_kn_unlock(parent_kn);
 	return ret;
 }
 
@@ -4893,7 +5004,15 @@ int cgroup_rmdir(struct kernfs_node *kn)
 	if (!cgrp)
 		return 0;
 
-	ret = cgroup_destroy_locked(cgrp);
+	/*
+	 * A resource domain cgroup cannot be removed directly by users.
+	 * It can only be done indirectly by writing to the "cgroup.resource"
+	 * control file.
+	 */
+	if (test_bit(CGRP_RESOURCE_DOMAIN, &cgrp->flags))
+		ret = -EINVAL;
+	else
+		ret = cgroup_destroy_locked(cgrp);
 
 	if (!ret)
 		trace_cgroup_rmdir(cgrp);
diff --git a/kernel/cgroup/debug.c b/kernel/cgroup/debug.c
index 3121811..b565951 100644
--- a/kernel/cgroup/debug.c
+++ b/kernel/cgroup/debug.c
@@ -237,8 +237,9 @@ static int cgroup_masks_read(struct seq_file *seq, void *v)
 		u16  *mask;
 		char *name;
 	} mask_list[] = {
-		{ &cgrp->subtree_control, "subtree_control" },
-		{ &cgrp->subtree_ss_mask, "subtree_ss_mask" },
+		{ &cgrp->subtree_control,  "subtree_control"  },
+		{ &cgrp->subtree_ss_mask,  "subtree_ss_mask"  },
+		{ &cgrp->resource_control, "resource_control" },
 	};
 
 	mutex_lock(&cgroup_mutex);
@@ -246,7 +247,7 @@ static int cgroup_masks_read(struct seq_file *seq, void *v)
 		u16 mask = *mask_list[i].mask;
 		bool first = true;
 
-		seq_printf(seq, "%-15s: ", mask_list[i].name);
+		seq_printf(seq, "%-16s: ", mask_list[i].name);
 		for_each_subsys(ss, j) {
 			if (!(mask & (1 << ss->id)))
 				continue;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [RFC PATCH v2 13/17] cgroup: Allow fine-grained controllers control in cgroup v2
  2017-05-15 13:33 [RFC PATCH v2 00/17] cgroup: Major changes to cgroup v2 core Waiman Long
                   ` (11 preceding siblings ...)
  2017-05-15 13:34 ` [RFC PATCH v2 12/17] cgroup: Remove cgroup v2 no internal process constraint Waiman Long
@ 2017-05-15 13:34 ` Waiman Long
  2017-05-19 20:55   ` Tejun Heo
  2017-05-15 13:34 ` [RFC PATCH v2 14/17] cgroup: Enable printing of v2 controllers' cgroup hierarchy Waiman Long
                   ` (3 subsequent siblings)
  16 siblings, 1 reply; 69+ messages in thread
From: Waiman Long @ 2017-05-15 13:34 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, linux-mm, kernel-team, pjt,
	luto, efault, longman

For cgroup v1, different controllers can be binded to different cgroup
hierarchies optimized for their own use cases. That is not currently
the case for cgroup v2 where combining all these controllers into
the same hierarchy will probably require more levels than is needed
by each individual controller.

By not enabling a controller in a cgroup and its descendants, we can
effectively trim the hierarchy as seen by a controller from the leafs
up. However, there is currently no way to compress the hierarchy in
the intermediate levels.

This patch implements a fine-grained mechanism to allow a controller to
skip some intermediate levels in a hierarchy and effectively flatten
the hierarchy as seen by that controller.

Controllers can now be directly enabled or disabled in a cgroup
by writing to the "cgroup.controllers" file.  The special prefix
'#' with the controller name is used to set that controller in
pass-through mode.  In that mode, the controller is disabled for that
cgroup but it allows its children to have that controller enabled or
in pass-through mode again.

With this change, each controller can now have a unique view of their
virtual process hierarchy that can be quite different from other
controllers.  We now have the freedom and flexibility to create the
right hierarchy for each controller to suit their own needs without
performance loss when compared with cgroup v1.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/cgroup-v2.txt | 125 ++++++++++++++++++---
 include/linux/cgroup-defs.h |  11 ++
 kernel/cgroup/cgroup.c      | 263 ++++++++++++++++++++++++++++++++++++++------
 kernel/cgroup/debug.c       |   8 +-
 4 files changed, 359 insertions(+), 48 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 0f41282..bb27491 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -308,25 +308,28 @@ both cgroups.
 2-4-1. Enabling and Disabling
 
 Each cgroup has a "cgroup.controllers" file which lists all
-controllers available for the cgroup to enable.
+controllers available for the cgroup to enable for its children.
 
   # cat cgroup.controllers
   cpu io memory
 
-No controller is enabled by default.  Controllers can be enabled and
-disabled by writing to the "cgroup.subtree_control" file.
+No controller is enabled by default.  Controllers can be
+enabled and disabled on the child cgroups by writing to the
+"cgroup.subtree_control" file. A '+' prefix enables the controller,
+and a '-' prefix disables it.
 
   # echo "+cpu +memory -io" > cgroup.subtree_control
 
-Only controllers which are listed in "cgroup.controllers" can be
-enabled.  When multiple operations are specified as above, either they
-all succeed or fail.  If multiple operations on the same controller
-are specified, the last one is effective.
+Only controllers which are listed in "cgroup.controllers" can
+be enabled in the "cgroup.subtree_control" file.  When multiple
+operations are specified as above, either they all succeed or fail.
+If multiple operations on the same controller are specified, the last
+one is effective.
 
 Enabling a controller in a cgroup indicates that the distribution of
 the target resource across its immediate children will be controlled.
-Consider the following sub-hierarchy.  The enabled controllers are
-listed in parentheses.
+Consider the following sub-hierarchy.  The enabled controllers in the
+"cgroup.subtree_control" file are listed in parentheses.
 
   A(cpu,memory) - B(memory) - C()
                             \ D()
@@ -336,6 +339,17 @@ of CPU cycles and memory to its children, in this case, B.  As B has
 "memory" enabled but not "CPU", C and D will compete freely on CPU
 cycles but their division of memory available to B will be controlled.
 
+By not enabling a controller in a cgroup and its descendants, we can
+effectively trim the hierarchy as seen by a controller from the leafs
+up. From the perspective of the cpu controller, the hierarchy is:
+
+  A - B|C|D
+
+From the perspective of the memory controller, the hierarchy becomes:
+
+  A - B - C
+        \ D
+
 As a controller regulates the distribution of the target resource to
 the cgroup's children, enabling it creates the controller's interface
 files in the child cgroups.  In the above example, enabling "cpu" on B
@@ -343,7 +357,81 @@ would create the "cpu." prefixed controller interface files in C and
 D.  Likewise, disabling "memory" from B would remove the "memory."
 prefixed controller interface files from C and D.  This means that the
 controller interface files - anything which doesn't start with
-"cgroup." are owned by the parent rather than the cgroup itself.
+"cgroup." can be considered to be owned by the parent under this
+control scheme.
+
+Enabling controllers via the "cgroup.subtree_control" file is
+relatively coarse-grained.  Fine-grained control of the controllers in
+a non-root cgroup can be done by writing to its "cgroup.controllers"
+file directly. A '+' prefix enables an controller as long as that
+controller is also enabled on its parent. Similarly, the '-' prefix
+disables a controller as long that controller isn't enabled in its
+parent's subtree_control file.
+
+The special prefix '#' is used to mark a controller in pass-through
+mode. In this mode, the controller is disabled in the cgroup
+effectively collapsing it with its parent from the perspective of
+that controller. However, it allows its child cgroups to enable the
+controller or have it in pass-through mode again. For example,
+
+   +   #   #   #   +
+   A - B - C - D - E
+         \ F
+	   +
+In this case, the effective hiearchy is:
+
+	A|B|C|D - E
+	        \ F
+
+Under this control scheme, the interface files can be considered to be
+owned by the cgroup itself. The use of the special '#' prefix allows
+the users to trim away layers in the middle of the hierarchy, thus
+flattening the tree from the perspective of that particular controller.
+As a result, different controllers can have quite different views of
+their virtual process hierarchy that can best fit their own needs.
+
+In the diagram below, the controller name in the parenthesis represents
+controller enabled by writing to the "cgroup.controllers" file.
+
+  A(cpu,memory) - B(cpu,#memory) - C()
+                                 \ D(memory)
+
+From the memory controller's perspective, the hierarchy looks like:
+
+   A|B|C - D
+
+For the CPU controller, the hierarchy is:
+
+   A - B|C|D
+
+Both control schemes can be used together with some limitations
+as shown in the following table about the interaction between
+subtree_control file of the parent of a cgroup and its controllers
+file.
+
+  ++: enable a controller in parent's subtree_control
+  --: disable a controller in parent's subtree_control
+   +: enable a controller in controllers
+   -: disable a controller in controllers
+   #: skip a controller in controllers
+
+  Old State  New Desired State  Result
+  ---------  -----------------  ------
+    ++               +          Ignored
+    ++               #          Accepted
+    ++               -          Rejected*
+    --             + or #       Accepted
+    --               -          Ignored
+     +               ++         Accepted
+     +               --         Rejected
+     -             ++ or --     Accepted
+     #             ++ or --     Rejected
+
+In the special case that the cgroup is in both '++' & '#' states
+('++' followed by '#'), the '-' prefix can be used to turn off the
+'#' leading to an effective '++' state.  A cgroup in '+' or '#' state
+cannot be changed back to '-' or switched to each other as long as
+its children have that controller in a non-'-' state.
 
 
 2-4-2. Top-down Constraint
@@ -353,8 +441,8 @@ a resource only if the resource has been distributed to it from the
 parent.  This means that all non-root "cgroup.subtree_control" files
 can only contain controllers which are enabled in the parent's
 "cgroup.subtree_control" file.  A controller can be enabled only if
-the parent has the controller enabled and a controller can't be
-disabled if one or more children have it enabled.
+the parent has the controller enabled ('+' or '#') and a controller
+can't be disabled if one or more children have it enabled.
 
 
 2-4-3. Managing Internal Process Competition
@@ -704,11 +792,18 @@ All cgroup core files are prefixed with "cgroup."
 
   cgroup.controllers
 
-	A read-only space separated values file which exists on all
+	A read-write space separated values file which exists on all
 	cgroups.
 
-	It shows space separated list of all controllers available to
-	the cgroup.  The controllers are not ordered.
+	When read, it shows space separated list of all controllers
+	available to the cgroup.  The controllers are not ordered.
+
+	Space separated list of controllers prefixed with '+', '-' or
+	'#' can be written to enable, disable or set the controllers
+	in pass-through mode. If a controller appears more than once
+	on the list, the last one is effective.  When multiple enable
+	and disable operations are specified, either all succeed or
+	all fail.
 
   cgroup.subtree_control
 
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 67ab326..5d30182 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -305,6 +305,17 @@ struct cgroup {
 	 */
 	u16 resource_control;
 
+	/*
+	 * The bitmasks of subsystems enabled and in pass-through mode in
+	 * the current cgroup. The parent's subtree_ss_mask has priority.
+	 * A bit set in subtree_ss_mask will suppress the setting of the
+	 * corresponding bit in enable_ss_mask and passthru_ss_mask.
+	 */
+	u16 enable_ss_mask;
+	u16 passthru_ss_mask;
+	u16 old_enable_ss_mask;
+	u16 old_passthru_ss_mask;
+
 	/* Private pointers for each registered subsystem */
 	struct cgroup_subsys_state __rcu *subsys[CGROUP_SUBSYS_COUNT];
 
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index c3be7e2..6e77ebe 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -341,7 +341,7 @@ static u16 cgroup_control(struct cgroup *cgrp)
 	u16 root_ss_mask = cgrp->root->subsys_mask;
 
 	if (parent) {
-		u16 ss_mask = parent->subtree_control;
+		u16 ss_mask = parent->subtree_control|cgrp->enable_ss_mask;
 
 		if (test_bit(CGRP_RESOURCE_DOMAIN, &cgrp->flags))
 			return parent->resource_control;
@@ -363,7 +363,7 @@ static u16 cgroup_ss_mask(struct cgroup *cgrp)
 	struct cgroup *parent = cgroup_parent(cgrp);
 
 	if (parent) {
-		u16 ss_mask = parent->subtree_ss_mask;
+		u16 ss_mask = parent->subtree_ss_mask|cgrp->enable_ss_mask;
 
 		if (test_bit(CGRP_RESOURCE_DOMAIN, &cgrp->flags))
 			return parent->resource_control;
@@ -2540,15 +2540,18 @@ void cgroup_procs_write_finish(struct task_struct *task)
 			ss->post_attach();
 }
 
-static void cgroup_print_ss_mask(struct seq_file *seq, u16 ss_mask)
+static void cgroup_print_ss_mask(struct seq_file *seq, u16 ss_mask,
+				 u16 passthru_mask)
 {
 	struct cgroup_subsys *ss;
 	bool printed = false;
 	int ssid;
 
-	do_each_subsys_mask(ss, ssid, ss_mask) {
+	do_each_subsys_mask(ss, ssid, ss_mask|passthru_mask) {
 		if (printed)
 			seq_putc(seq, ' ');
+		if (passthru_mask & (1 << ssid))
+			seq_putc(seq, '#');
 		seq_printf(seq, "%s", ss->name);
 		printed = true;
 	} while_each_subsys_mask();
@@ -2561,7 +2564,7 @@ static int cgroup_controllers_show(struct seq_file *seq, void *v)
 {
 	struct cgroup *cgrp = seq_css(seq)->cgroup;
 
-	cgroup_print_ss_mask(seq, cgroup_control(cgrp));
+	cgroup_print_ss_mask(seq, cgroup_control(cgrp), cgrp->passthru_ss_mask);
 	return 0;
 }
 
@@ -2570,7 +2573,7 @@ static int cgroup_subtree_control_show(struct seq_file *seq, void *v)
 {
 	struct cgroup *cgrp = seq_css(seq)->cgroup;
 
-	cgroup_print_ss_mask(seq, cgrp->subtree_control);
+	cgroup_print_ss_mask(seq, cgrp->subtree_control, 0);
 	return 0;
 }
 
@@ -2579,7 +2582,7 @@ static int cgroup_resource_control_show(struct seq_file *seq, void *v)
 {
 	struct cgroup *cgrp = seq_css(seq)->cgroup;
 
-	cgroup_print_ss_mask(seq, cgrp->resource_control);
+	cgroup_print_ss_mask(seq, cgrp->resource_control, 0);
 	return 0;
 }
 
@@ -2692,6 +2695,8 @@ static void cgroup_save_control(struct cgroup *cgrp)
 	cgroup_for_each_live_descendant_pre(dsct, d_css, cgrp) {
 		dsct->old_subtree_control = dsct->subtree_control;
 		dsct->old_subtree_ss_mask = dsct->subtree_ss_mask;
+		dsct->old_enable_ss_mask = dsct->enable_ss_mask;
+		dsct->old_passthru_ss_mask = dsct->passthru_ss_mask;
 	}
 }
 
@@ -2709,10 +2714,11 @@ static void cgroup_propagate_control(struct cgroup *cgrp)
 	struct cgroup_subsys_state *d_css;
 
 	cgroup_for_each_live_descendant_pre(dsct, d_css, cgrp) {
-		dsct->subtree_control &= cgroup_control(dsct);
+		dsct->subtree_control &= cgroup_control(dsct)|
+					 dsct->passthru_ss_mask;
 		dsct->subtree_ss_mask =
 			cgroup_calc_subtree_ss_mask(dsct->subtree_control,
-						    cgroup_ss_mask(dsct));
+				cgroup_ss_mask(dsct)|dsct->passthru_ss_mask);
 	}
 }
 
@@ -2731,6 +2737,8 @@ static void cgroup_restore_control(struct cgroup *cgrp)
 	cgroup_for_each_live_descendant_post(dsct, d_css, cgrp) {
 		dsct->subtree_control = dsct->old_subtree_control;
 		dsct->subtree_ss_mask = dsct->old_subtree_ss_mask;
+		dsct->enable_ss_mask = dsct->old_enable_ss_mask;
+		dsct->passthru_ss_mask = dsct->old_passthru_ss_mask;
 	}
 }
 
@@ -2772,7 +2780,8 @@ static int cgroup_apply_control_enable(struct cgroup *cgrp)
 
 			WARN_ON_ONCE(css && percpu_ref_is_dying(&css->refcnt));
 
-			if (!(cgroup_ss_mask(dsct) & (1 << ss->id)))
+			if (!(cgroup_ss_mask(dsct) & (1 << ss->id)) ||
+			    (dsct->passthru_ss_mask & (1 << ss->id)))
 				continue;
 
 			if (!css) {
@@ -2822,7 +2831,8 @@ static void cgroup_apply_control_disable(struct cgroup *cgrp)
 				continue;
 
 			if (css->parent &&
-			    !(cgroup_ss_mask(dsct) & (1 << ss->id))) {
+			    (!(cgroup_ss_mask(dsct) & (1 << ss->id)) ||
+			    (dsct->passthru_ss_mask & (1 << ss->id)))) {
 				kill_css(css);
 			} else if (!css_visible(css)) {
 				css_clear_dir(css);
@@ -2895,7 +2905,8 @@ static ssize_t cgroup_subtree_control_write(struct kernfs_open_file *of,
 					    loff_t off)
 {
 	u16 enable = 0, disable = 0;
-	struct cgroup *cgrp, *child;
+	u16 child_enable, child_passthru = 0;
+	struct cgroup *cgrp, *child, *grandchild;
 	struct cgroup_subsys *ss;
 	char *tok;
 	int ssid, ret;
@@ -2933,22 +2944,36 @@ static ssize_t cgroup_subtree_control_write(struct kernfs_open_file *of,
 		return -ENODEV;
 
 	/*
-	 * We cannot disable controllers that are enabled in a child
-	 * cgroup.
+	 * Because a controller can be enabled on a grandchild if it is
+	 * enabled in subtree_control, we need to look at all the children
+	 * and grandchildren for what are enabled.
 	 */
-	if (disable) {
-		u16 child_enable = cgrp->resource_control;
+	child_enable = cgrp->resource_control;
+	cgroup_for_each_live_child(child, cgrp) {
+		child_enable |= child->subtree_control|
+				child->resource_control|
+				child->enable_ss_mask;
+		child_passthru |= child->passthru_ss_mask;
+
+		cgroup_for_each_live_child(grandchild, child)
+			child_enable |= grandchild->subtree_control|
+					grandchild->resource_control|
+					grandchild->enable_ss_mask|
+					grandchild->passthru_ss_mask;
+	}
 
-		cgroup_for_each_live_child(child, cgrp)
-			child_enable |= child->subtree_control|
-					child->resource_control;
-		if (disable & child_enable) {
-			ret = -EBUSY;
-			goto out_unlock;
-		}
+	/*
+	 * We cannot disable controllers that are enabled or in pass-through
+	 * mode in a child or grandchild cgroup. We also cannot enable
+	 * controllers that are in pass-through mode in a child cgroup.
+	 */
+	if ((disable & (child_enable|child_passthru)) ||
+	    (enable  & child_passthru)) {
+		ret = -EBUSY;
+		goto out_unlock;
 	}
 
-	if (enable & ~cgroup_control(cgrp)) {
+	if (enable & ~(cgroup_control(cgrp)|cgrp->passthru_ss_mask)) {
 		ret = -ENOENT;
 		goto out_unlock;
 	}
@@ -2963,7 +2988,7 @@ static ssize_t cgroup_subtree_control_write(struct kernfs_open_file *of,
 
 	/* can't enable !threaded controllers on a threaded cgroup */
 	if (cgroup_is_threaded(cgrp) && (enable & ~cgrp_dfl_threaded_ss_mask)) {
-		ret = -EBUSY;
+		ret = -EINVAL;
 		goto out_unlock;
 	}
 
@@ -2973,6 +2998,164 @@ static ssize_t cgroup_subtree_control_write(struct kernfs_open_file *of,
 	cgrp->subtree_control |= enable;
 	cgrp->subtree_control &= ~disable;
 
+	/*
+	 * Clear the child's enable_ss_mask for those bits that are enabled
+	 * in subtree_control.
+	 */
+	if (child_enable & enable) {
+		cgroup_for_each_live_child(child, cgrp)
+			child->enable_ss_mask &= ~enable;
+	}
+
+	ret = cgroup_apply_control(cgrp);
+
+	cgroup_finalize_control(cgrp, ret);
+
+	kernfs_activate(cgrp->kn);
+	ret = 0;
+out_unlock:
+	cgroup_kn_unlock(of->kn);
+	return ret ?: nbytes;
+}
+
+/*
+ * Change the enabled and pass-through controllers for a cgroup in the
+ * default hierarchy
+ */
+static ssize_t cgroup_controllers_write(struct kernfs_open_file *of,
+					char *buf, size_t nbytes,
+					loff_t off)
+{
+	u16 enable = 0, disable = 0, passthru = 0;
+	u16 child_enable, parent_subtree;
+	struct cgroup *cgrp, *child, *parent;
+	struct cgroup_subsys *ss;
+	char *tok;
+	int ssid, ret;
+
+	/*
+	 * Parse input - space separated list of subsystem names prefixed
+	 * with either +, - or #.
+	 */
+	buf = strstrip(buf);
+	while ((tok = strsep(&buf, " "))) {
+		if (tok[0] == '\0')
+			continue;
+		do_each_subsys_mask(ss, ssid, ~cgrp_dfl_inhibit_ss_mask) {
+			if (!cgroup_ssid_enabled(ssid) ||
+			    strcmp(tok + 1, ss->name))
+				continue;
+
+			if (*tok == '+') {
+				enable |= 1 << ssid;
+				disable &= ~(1 << ssid);
+				passthru &= ~(1 << ssid);
+			} else if (*tok == '-') {
+				disable |= 1 << ssid;
+				enable &= ~(1 << ssid);
+				passthru &= ~(1 << ssid);
+			} else if (*tok == '#') {
+				passthru |= 1 << ssid;
+				enable &= ~(1 << ssid);
+				disable &= ~(1 << ssid);
+			} else {
+				return -EINVAL;
+			}
+			break;
+		} while_each_subsys_mask();
+		if (ssid == CGROUP_SUBSYS_COUNT)
+			return -EINVAL;
+	}
+
+	cgrp = cgroup_kn_lock_live(of->kn, true);
+	if (!cgrp)
+		return -ENODEV;
+
+	/*
+	 * Write to root cgroup's controllers file is not allowed.
+	 */
+	parent = cgroup_parent(cgrp);
+	if (!parent) {
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+
+	/*
+	 * We only looks at parent's subtree_control that are not in
+	 * passthru_ss_mask.
+	 */
+	parent_subtree = parent->subtree_control & ~cgrp->passthru_ss_mask;
+
+	/*
+	 * Reject disable bits that are in parent's subtree_control except
+	 * when they are also in passthru_ss_mask.
+	 */
+	if (disable & parent_subtree) {
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+
+	child_enable = cgrp->resource_control|cgrp->subtree_control;
+	cgroup_for_each_live_child(child, cgrp)
+		child_enable |= child->subtree_control|child->resource_control|
+				child->enable_ss_mask|child->passthru_ss_mask;
+
+	/*
+	 * Mask off bits that have been set as well as enable bits set
+	 * in parent's subtree_control, but not in passthru_ss_mask.
+	 */
+	passthru &= ~cgrp->passthru_ss_mask;
+	enable   &= ~(cgrp->enable_ss_mask|parent_subtree);
+
+	/*
+	 * We cannot enable, disable or pass-through controllers that
+	 * are enabled in children's passthru_ss_mask, enable_ss_mask,
+	 * resource_control or subtree_control as well as its own
+	 * resource_control and subtree_control.
+	 */
+	if ((disable|passthru|enable) & child_enable) {
+		ret = -EBUSY;
+		goto out_unlock;
+	}
+
+	/*
+	 * We also cannot enable or pass through controllers that are not
+	 * enabled in its parent's passthru_ss_mask or controllers.
+	 */
+	if (((enable|passthru) & (parent->passthru_ss_mask|
+				  cgroup_control(parent)))
+				  != (enable|passthru)) {
+		ret = -ENOENT;
+		goto out_unlock;
+	}
+
+	disable &= cgrp->enable_ss_mask|cgrp->passthru_ss_mask;
+	if (!enable && !disable && !passthru) {
+		ret = 0;
+		goto out_unlock;
+	}
+
+	/*
+	 * Can't enable or pass through !threaded controllers on a
+	 * threaded cgroup
+	 */
+	if (cgroup_is_threaded(cgrp) &&
+	   ((enable|passthru) & ~cgrp_dfl_threaded_ss_mask)) {
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+
+	/* Save and update control masks and prepare csses */
+	cgroup_save_control(cgrp);
+
+	cgrp->passthru_ss_mask |= passthru;
+	cgrp->passthru_ss_mask &= ~(disable|enable);
+
+	/* Mask off enable bits set in parent's subtree_control */
+	enable &= ~parent->subtree_control;
+	cgrp->enable_ss_mask |= enable;
+	cgrp->enable_ss_mask &= ~(disable|passthru);
+
 	ret = cgroup_apply_control(cgrp);
 
 	cgroup_finalize_control(cgrp, ret);
@@ -3102,7 +3285,8 @@ static int cgroup_enable_threaded(struct cgroup *cgrp)
 	/*
 	 * Allow only if it is not the root and there are:
 	 * 1) no children,
-	 * 2) no non-threaded controllers are enabled, and
+	 * 2) no non-threaded controllers are enabled or in pass-through
+	 *    mode, and
 	 * 3) no attached tasks.
 	 *
 	 * With no attached tasks, it is assumed that no css_sets will be
@@ -3110,7 +3294,8 @@ static int cgroup_enable_threaded(struct cgroup *cgrp)
 	 * css_sets linger around due to task_struct leakage, for example.
 	 */
 	if (css_has_online_children(&cgrp->self) ||
-	   (cgroup_control(cgrp) & ~cgrp_dfl_threaded_ss_mask) ||
+	   ((cgroup_control(cgrp)|cgrp->passthru_ss_mask)
+		& ~cgrp_dfl_threaded_ss_mask) ||
 	   !cgroup_parent(cgrp) || cgroup_is_populated(cgrp))
 		return -EBUSY;
 
@@ -4375,6 +4560,7 @@ static ssize_t cgroup_threads_write(struct kernfs_open_file *of,
 	{
 		.name = "cgroup.controllers",
 		.seq_show = cgroup_controllers_show,
+		.write = cgroup_controllers_write,
 	},
 	{
 		.name = "cgroup.subtree_control",
@@ -4526,7 +4712,8 @@ static void css_release(struct percpu_ref *ref)
 }
 
 static void init_and_link_css(struct cgroup_subsys_state *css,
-			      struct cgroup_subsys *ss, struct cgroup *cgrp)
+			      struct cgroup_subsys *ss, struct cgroup *cgrp,
+			      struct cgroup_subsys_state *parent_css)
 {
 	lockdep_assert_held(&cgroup_mutex);
 
@@ -4542,7 +4729,7 @@ static void init_and_link_css(struct cgroup_subsys_state *css,
 	atomic_set(&css->online_cnt, 0);
 
 	if (cgroup_parent(cgrp)) {
-		css->parent = cgroup_css(cgroup_parent(cgrp), ss);
+		css->parent = parent_css;
 		css_get(css->parent);
 	}
 
@@ -4605,19 +4792,31 @@ static struct cgroup_subsys_state *css_create(struct cgroup *cgrp,
 					      struct cgroup_subsys *ss)
 {
 	struct cgroup *parent = cgroup_parent(cgrp);
-	struct cgroup_subsys_state *parent_css = cgroup_css(parent, ss);
+	struct cgroup_subsys_state *parent_css;
 	struct cgroup_subsys_state *css;
 	int err;
 
 	lockdep_assert_held(&cgroup_mutex);
 
+	/*
+	 * Need to skip over ancestor cgroups with skip flag set.
+	 */
+	while (parent && (parent->passthru_ss_mask & (1 << ss->id)))
+		parent = cgroup_parent(parent);
+
+	if (!parent) {
+		WARN_ON_ONCE(1);
+		return ERR_PTR(-EINVAL);
+	}
+	parent_css = cgroup_css(parent, ss);
+
 	css = ss->css_alloc(parent_css);
 	if (!css)
 		css = ERR_PTR(-ENOMEM);
 	if (IS_ERR(css))
 		return css;
 
-	init_and_link_css(css, ss, cgrp);
+	init_and_link_css(css, ss, cgrp, parent_css);
 
 	err = percpu_ref_init(&css->refcnt, css_release, 0, GFP_KERNEL);
 	if (err)
@@ -5044,7 +5243,7 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
 	css = ss->css_alloc(cgroup_css(&cgrp_dfl_root.cgrp, ss));
 	/* We don't handle early failures gracefully */
 	BUG_ON(IS_ERR(css));
-	init_and_link_css(css, ss, &cgrp_dfl_root.cgrp);
+	init_and_link_css(css, ss, &cgrp_dfl_root.cgrp, NULL);
 
 	/*
 	 * Root csses are never destroyed and we can't initialize
diff --git a/kernel/cgroup/debug.c b/kernel/cgroup/debug.c
index b565951..a2dbf77 100644
--- a/kernel/cgroup/debug.c
+++ b/kernel/cgroup/debug.c
@@ -212,8 +212,12 @@ static int cgroup_subsys_states_read(struct seq_file *seq, void *v)
 	mutex_lock(&cgroup_mutex);
 	for_each_subsys(ss, i) {
 		css = rcu_dereference_check(cgrp->subsys[ss->id], true);
-		if (!css)
+		if (!css) {
+			if (cgrp->passthru_ss_mask & (1 << ss->id))
+				seq_printf(seq, "%2d: %-4s\t- [Pass-through]\n",
+					   ss->id, ss->name);
 			continue;
+		}
 		pbuf[0] = '\0';
 
 		/* Show the parent CSS if applicable*/
@@ -240,6 +244,8 @@ static int cgroup_masks_read(struct seq_file *seq, void *v)
 		{ &cgrp->subtree_control,  "subtree_control"  },
 		{ &cgrp->subtree_ss_mask,  "subtree_ss_mask"  },
 		{ &cgrp->resource_control, "resource_control" },
+		{ &cgrp->enable_ss_mask,   "enable_ss_mask"   },
+		{ &cgrp->passthru_ss_mask, "passthru_ss_mask" },
 	};
 
 	mutex_lock(&cgroup_mutex);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [RFC PATCH v2 14/17] cgroup: Enable printing of v2 controllers' cgroup hierarchy
  2017-05-15 13:33 [RFC PATCH v2 00/17] cgroup: Major changes to cgroup v2 core Waiman Long
                   ` (12 preceding siblings ...)
  2017-05-15 13:34 ` [RFC PATCH v2 13/17] cgroup: Allow fine-grained controllers control in cgroup v2 Waiman Long
@ 2017-05-15 13:34 ` Waiman Long
  2017-05-15 13:34 ` [RFC PATCH v2 15/17] sched: Misc preps for cgroup unified hierarchy interface Waiman Long
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 69+ messages in thread
From: Waiman Long @ 2017-05-15 13:34 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, linux-mm, kernel-team, pjt,
	luto, efault, longman

This patch add a new debug control file on the cgroup v2 root directory
to print out the cgroup hierarchy for each of the v2 controllers.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/debug.c | 141 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 141 insertions(+)

diff --git a/kernel/cgroup/debug.c b/kernel/cgroup/debug.c
index a2dbf77..3adb26a 100644
--- a/kernel/cgroup/debug.c
+++ b/kernel/cgroup/debug.c
@@ -268,6 +268,141 @@ static int cgroup_masks_read(struct seq_file *seq, void *v)
 	return 0;
 }
 
+/*
+ * Print out all the child cgroup names that doesn't have a css for the
+ * corresponding cgroup_subsys. If a child cgroup has a css, put that into
+ * the given cglist to be processed in the next iteration.
+ */
+#define CGLIST_MAX	16
+static void print_hierarchy(struct seq_file *seq,
+			    struct cgroup *cgrp,
+			    struct cgroup_subsys *ss,
+			    struct cgroup_subsys_state *css,
+			    struct cgroup **cglist,
+			    int *cgcnt)
+{
+	struct cgroup *child;
+	struct cgroup_subsys_state *child_css;
+	char cgname[64];
+
+	cgname[sizeof(cgname) - 1] = '\0';
+	/*
+	 * Iterate all live children of the given cgroup.
+	 */
+	list_for_each_entry(child, &cgrp->self.children, self.sibling) {
+		if (cgroup_is_dead(child))
+			continue;
+
+		child_css = rcu_dereference_check(child->subsys[ss->id], true);
+		if (child_css) {
+			WARN_ON(child_css->parent != css);
+			if (*cgcnt < CGLIST_MAX) {
+				cglist[*cgcnt] = child;
+				(*cgcnt)++;
+			}
+			continue;
+		}
+
+		/*
+		 * Skip resource domain cgroup
+		 */
+		if (test_bit(CGRP_RESOURCE_DOMAIN, &child->flags))
+			continue;
+
+		cgroup_name(child, cgname, sizeof(cgname)-1);
+		seq_putc(seq, ',');
+		seq_puts(seq, cgname);
+		print_hierarchy(seq, child, ss, css, cglist, cgcnt);
+	}
+}
+
+/*
+ * Print the hierachies with respect to each controller for the default
+ * hierarchy.
+ *
+ * Each child level is printed on a separate line. Set of cgroups that
+ * have the same css will be grouped together and separated by comma.
+ * Process in those cgroups will be in the same node (css) in the
+ * controller's hierarchy. There is an exception that for resource
+ * domain cgroup, the processes associated with its parent and its
+ * affiliates will be mapped to the css of that resource domain cgroup
+ * instead.
+ *
+ * If there are more than CGLIST_MAX sets of cgroups in each level,
+ * the extra ones will be skipped.
+ */
+static int controller_hierachies_read(struct seq_file *seq, void *v)
+{
+	struct cgroup *root = seq_css(seq)->cgroup;
+	struct cgroup_subsys *ss;
+	struct cgroup_subsys_state *css;
+	struct cgroup *cgrp;
+	struct cgroup *cglist[CGLIST_MAX];
+	struct cgroup *cg2list[CGLIST_MAX];
+	int i, idx, cgnum, cg2num;
+	char cgname[64];
+
+	cgname[sizeof(cgname) - 1] = '\0';
+	mutex_lock(&cgroup_mutex);
+	for_each_subsys(ss, i) {
+		if (!(root->root->subsys_mask & (1 << ss->id)))
+			continue;
+		seq_puts(seq, ss->name);
+		seq_puts(seq, ":\n");
+
+		cgnum = 1;
+		cg2num = 0;
+		cglist[0] = root;
+		idx = 0;
+		while (cgnum) {
+			if (idx)
+				seq_putc(seq, ' ');
+			cgrp = cglist[idx];
+			if (test_bit(CGRP_RESOURCE_DOMAIN, &cgrp->flags)) {
+				struct cgroup *parent;
+
+				parent = container_of(cgrp->self.parent,
+						      struct cgroup, self);
+				cgroup_name(parent, cgname, sizeof(cgname)-1);
+				seq_printf(seq, "%s.rd", cgname);
+			} else {
+				cgroup_name(cgrp, cgname, sizeof(cgname)-1);
+				seq_puts(seq, cgname);
+			}
+			css = rcu_dereference_check(cgrp->subsys[ss->id], true);
+			WARN_ON(!css);
+
+			if (cgrp == root)
+				seq_printf(seq, "[%d]", css->id);
+			else
+				seq_printf(seq, "[%d:P=%d]", css->id,
+					   css->parent->id);
+
+			/*
+			 * List all the cgroups that use the current
+			 * css.
+			 */
+			print_hierarchy(seq, cgrp, ss, css, cg2list, &cg2num);
+
+			if (++idx < cgnum)
+				continue;
+
+			/*
+			 * Move cg2list to cglist.
+			 */
+			cgnum = cg2num;
+			idx = cg2num = 0;
+			if (cgnum)
+				memcpy(cglist, cg2list,
+				       cgnum * sizeof(cglist[0]));
+			seq_putc(seq, '\n');
+		}
+		seq_putc(seq, '\n');
+	}
+	mutex_unlock(&cgroup_mutex);
+	return 0;
+}
+
 static u64 releasable_read(struct cgroup_subsys_state *css, struct cftype *cft)
 {
 	return (!cgroup_is_populated(css->cgroup) &&
@@ -314,6 +449,12 @@ static u64 releasable_read(struct cgroup_subsys_state *css, struct cftype *cft)
 	},
 
 	{
+		.name = "controller_hierachies",
+		.seq_show = controller_hierachies_read,
+		.flags = CFTYPE_ONLY_ON_ROOT|__CFTYPE_ONLY_ON_DFL,
+	},
+
+	{
 		.name = "releasable",
 		.read_u64 = releasable_read,
 	},
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [RFC PATCH v2 15/17] sched: Misc preps for cgroup unified hierarchy interface
  2017-05-15 13:33 [RFC PATCH v2 00/17] cgroup: Major changes to cgroup v2 core Waiman Long
                   ` (13 preceding siblings ...)
  2017-05-15 13:34 ` [RFC PATCH v2 14/17] cgroup: Enable printing of v2 controllers' cgroup hierarchy Waiman Long
@ 2017-05-15 13:34 ` Waiman Long
  2017-05-15 13:34 ` [RFC PATCH v2 16/17] sched: Implement interface for cgroup unified hierarchy Waiman Long
  2017-05-15 13:34 ` [RFC PATCH v2 17/17] sched: Make cpu/cpuacct threaded controllers Waiman Long
  16 siblings, 0 replies; 69+ messages in thread
From: Waiman Long @ 2017-05-15 13:34 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, linux-mm, kernel-team, pjt,
	luto, efault, longman

From: Tejun Heo <tj@kernel.org>

Make the following changes in preparation for the cpu controller
interface implementation for the unified hierarchy.  This patch
doesn't cause any functional differences.

* s/cpu_stats_show()/cpu_cfs_stats_show()/

* s/cpu_files/cpu_legacy_files/

* Separate out cpuacct_stats_read() from cpuacct_stats_show().  While
  at it, make the @val array u64 for consistency.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---
 kernel/sched/core.c    |  8 ++++----
 kernel/sched/cpuacct.c | 29 ++++++++++++++++++-----------
 2 files changed, 22 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c888bd3..be2527b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7230,7 +7230,7 @@ static int __cfs_schedulable(struct task_group *tg, u64 period, u64 quota)
 	return ret;
 }
 
-static int cpu_stats_show(struct seq_file *sf, void *v)
+static int cpu_cfs_stats_show(struct seq_file *sf, void *v)
 {
 	struct task_group *tg = css_tg(seq_css(sf));
 	struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
@@ -7270,7 +7270,7 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css,
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
-static struct cftype cpu_files[] = {
+static struct cftype cpu_legacy_files[] = {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	{
 		.name = "shares",
@@ -7291,7 +7291,7 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css,
 	},
 	{
 		.name = "stat",
-		.seq_show = cpu_stats_show,
+		.seq_show = cpu_cfs_stats_show,
 	},
 #endif
 #ifdef CONFIG_RT_GROUP_SCHED
@@ -7317,7 +7317,7 @@ struct cgroup_subsys cpu_cgrp_subsys = {
 	.fork		= cpu_cgroup_fork,
 	.can_attach	= cpu_cgroup_can_attach,
 	.attach		= cpu_cgroup_attach,
-	.legacy_cftypes	= cpu_files,
+	.legacy_cftypes	= cpu_legacy_files,
 	.early_init	= true,
 };
 
diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index f95ab29..6151c23 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -276,26 +276,33 @@ static int cpuacct_all_seq_show(struct seq_file *m, void *V)
 	return 0;
 }
 
-static int cpuacct_stats_show(struct seq_file *sf, void *v)
+static void cpuacct_stats_read(struct cpuacct *ca,
+			       u64 (*val)[CPUACCT_STAT_NSTATS])
 {
-	struct cpuacct *ca = css_ca(seq_css(sf));
-	s64 val[CPUACCT_STAT_NSTATS];
 	int cpu;
-	int stat;
 
-	memset(val, 0, sizeof(val));
+	memset(val, 0, sizeof(*val));
+
 	for_each_possible_cpu(cpu) {
 		u64 *cpustat = per_cpu_ptr(ca->cpustat, cpu)->cpustat;
 
-		val[CPUACCT_STAT_USER]   += cpustat[CPUTIME_USER];
-		val[CPUACCT_STAT_USER]   += cpustat[CPUTIME_NICE];
-		val[CPUACCT_STAT_SYSTEM] += cpustat[CPUTIME_SYSTEM];
-		val[CPUACCT_STAT_SYSTEM] += cpustat[CPUTIME_IRQ];
-		val[CPUACCT_STAT_SYSTEM] += cpustat[CPUTIME_SOFTIRQ];
+		(*val)[CPUACCT_STAT_USER]   += cpustat[CPUTIME_USER];
+		(*val)[CPUACCT_STAT_USER]   += cpustat[CPUTIME_NICE];
+		(*val)[CPUACCT_STAT_SYSTEM] += cpustat[CPUTIME_SYSTEM];
+		(*val)[CPUACCT_STAT_SYSTEM] += cpustat[CPUTIME_IRQ];
+		(*val)[CPUACCT_STAT_SYSTEM] += cpustat[CPUTIME_SOFTIRQ];
 	}
+}
+
+static int cpuacct_stats_show(struct seq_file *sf, void *v)
+{
+	u64 val[CPUACCT_STAT_NSTATS];
+	int stat;
+
+	cpuacct_stats_read(css_ca(seq_css(sf)), &val);
 
 	for (stat = 0; stat < CPUACCT_STAT_NSTATS; stat++) {
-		seq_printf(sf, "%s %lld\n",
+		seq_printf(sf, "%s %llu\n",
 			   cpuacct_stat_desc[stat],
 			   (long long)nsec_to_clock_t(val[stat]));
 	}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [RFC PATCH v2 16/17] sched: Implement interface for cgroup unified hierarchy
  2017-05-15 13:33 [RFC PATCH v2 00/17] cgroup: Major changes to cgroup v2 core Waiman Long
                   ` (14 preceding siblings ...)
  2017-05-15 13:34 ` [RFC PATCH v2 15/17] sched: Misc preps for cgroup unified hierarchy interface Waiman Long
@ 2017-05-15 13:34 ` Waiman Long
  2017-05-15 13:34 ` [RFC PATCH v2 17/17] sched: Make cpu/cpuacct threaded controllers Waiman Long
  16 siblings, 0 replies; 69+ messages in thread
From: Waiman Long @ 2017-05-15 13:34 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, linux-mm, kernel-team, pjt,
	luto, efault, longman

From: Tejun Heo <tj@kernel.org>

While the cpu controller doesn't have any functional problems, there
are a couple interface issues which can be addressed in the v2
interface.

* cpuacct being a separate controller.  This separation is artificial
  and rather pointless as demonstrated by most use cases co-mounting
  the two controllers.  It also forces certain information to be
  accounted twice.

* Use of different time units.  Writable control knobs use
  microseconds, some stat fields use nanoseconds while other cpuacct
  stat fields use centiseconds.

* Control knobs which can't be used in the root cgroup still show up
  in the root.

* Control knob names and semantics aren't consistent with other
  controllers.

This patchset implements cpu controller's interface on the unified
hierarchy which adheres to the controller file conventions described in
Documentation/cgroup-v2.txt.  Overall, the following changes are made.

* cpuacct is implictly enabled and disabled by cpu and its information
  is reported through "cpu.stat" which now uses microseconds for all
  time durations.  All time duration fields now have "_usec" appended
  to them for clarity.  While this doesn't solve the double accounting
  immediately, once majority of users switch to v2, cpu can directly
  account and report the relevant stats and cpuacct can be disabled on
  the unified hierarchy.

  Note that cpuacct.usage_percpu is currently not included in
  "cpu.stat".  If this information is actually called for, it can be
  added later.

* "cpu.shares" is replaced with "cpu.weight" and operates on the
  standard scale defined by CGROUP_WEIGHT_MIN/DFL/MAX (1, 100, 10000).
  The weight is scaled to scheduler weight so that 100 maps to 1024
  and the ratio relationship is preserved - if weight is W and its
  scaled value is S, W / 100 == S / 1024.  While the mapped range is a
  bit smaller than the original scheduler weight range, the dead zones
  on both sides are relatively small and covers wider range than the
  nice value mappings.  This file doesn't make sense in the root
  cgroup and isn't create on root.

* "cpu.cfs_quota_us" and "cpu.cfs_period_us" are replaced by "cpu.max"
  which contains both quota and period.

* "cpu.rt_runtime_us" and "cpu.rt_period_us" are replaced by
  "cpu.rt.max" which contains both runtime and period.

v2: cpu_stats_show() was incorrectly using CONFIG_FAIR_GROUP_SCHED for
    CFS bandwidth stats and also using raw division for u64.  Use
    CONFIG_CFS_BANDWIDTH and do_div() instead.

    The semantics of "cpu.rt.max" is not fully decided yet.  Dropped
    for now.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---
 kernel/sched/core.c    | 141 +++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/cpuacct.c |  25 +++++++++
 kernel/sched/cpuacct.h |   5 ++
 3 files changed, 171 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index be2527b..b041081 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7309,6 +7309,139 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css,
 	{ }	/* Terminate */
 };
 
+static int cpu_stats_show(struct seq_file *sf, void *v)
+{
+	cpuacct_cpu_stats_show(sf);
+
+#ifdef CONFIG_CFS_BANDWIDTH
+	{
+		struct task_group *tg = css_tg(seq_css(sf));
+		struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
+		u64 throttled_usec;
+
+		throttled_usec = cfs_b->throttled_time;
+		do_div(throttled_usec, NSEC_PER_USEC);
+
+		seq_printf(sf, "nr_periods %d\n"
+			   "nr_throttled %d\n"
+			   "throttled_usec %llu\n",
+			   cfs_b->nr_periods, cfs_b->nr_throttled,
+			   throttled_usec);
+	}
+#endif
+	return 0;
+}
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+static u64 cpu_weight_read_u64(struct cgroup_subsys_state *css,
+			       struct cftype *cft)
+{
+	struct task_group *tg = css_tg(css);
+	u64 weight = scale_load_down(tg->shares);
+
+	return DIV_ROUND_CLOSEST_ULL(weight * CGROUP_WEIGHT_DFL, 1024);
+}
+
+static int cpu_weight_write_u64(struct cgroup_subsys_state *css,
+				struct cftype *cftype, u64 weight)
+{
+	/*
+	 * cgroup weight knobs should use the common MIN, DFL and MAX
+	 * values which are 1, 100 and 10000 respectively.  While it loses
+	 * a bit of range on both ends, it maps pretty well onto the shares
+	 * value used by scheduler and the round-trip conversions preserve
+	 * the original value over the entire range.
+	 */
+	if (weight < CGROUP_WEIGHT_MIN || weight > CGROUP_WEIGHT_MAX)
+		return -ERANGE;
+
+	weight = DIV_ROUND_CLOSEST_ULL(weight * 1024, CGROUP_WEIGHT_DFL);
+
+	return sched_group_set_shares(css_tg(css), scale_load(weight));
+}
+#endif
+
+static void __maybe_unused cpu_period_quota_print(struct seq_file *sf,
+						  long period, long quota)
+{
+	if (quota < 0)
+		seq_puts(sf, "max");
+	else
+		seq_printf(sf, "%ld", quota);
+
+	seq_printf(sf, " %ld\n", period);
+}
+
+/* caller should put the current value in *@periodp before calling */
+static int __maybe_unused cpu_period_quota_parse(char *buf,
+						 u64 *periodp, u64 *quotap)
+{
+	char tok[21];	/* U64_MAX */
+
+	if (!sscanf(buf, "%s %llu", tok, periodp))
+		return -EINVAL;
+
+	*periodp *= NSEC_PER_USEC;
+
+	if (sscanf(tok, "%llu", quotap))
+		*quotap *= NSEC_PER_USEC;
+	else if (!strcmp(tok, "max"))
+		*quotap = RUNTIME_INF;
+	else
+		return -EINVAL;
+
+	return 0;
+}
+
+#ifdef CONFIG_CFS_BANDWIDTH
+static int cpu_max_show(struct seq_file *sf, void *v)
+{
+	struct task_group *tg = css_tg(seq_css(sf));
+
+	cpu_period_quota_print(sf, tg_get_cfs_period(tg), tg_get_cfs_quota(tg));
+	return 0;
+}
+
+static ssize_t cpu_max_write(struct kernfs_open_file *of,
+			     char *buf, size_t nbytes, loff_t off)
+{
+	struct task_group *tg = css_tg(of_css(of));
+	u64 period = tg_get_cfs_period(tg);
+	u64 quota;
+	int ret;
+
+	ret = cpu_period_quota_parse(buf, &period, &quota);
+	if (!ret)
+		ret = tg_set_cfs_bandwidth(tg, period, quota);
+	return ret ?: nbytes;
+}
+#endif
+
+static struct cftype cpu_files[] = {
+	{
+		.name = "stat",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = cpu_stats_show,
+	},
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	{
+		.name = "weight",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = cpu_weight_read_u64,
+		.write_u64 = cpu_weight_write_u64,
+	},
+#endif
+#ifdef CONFIG_CFS_BANDWIDTH
+	{
+		.name = "max",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = cpu_max_show,
+		.write = cpu_max_write,
+	},
+#endif
+	{ }	/* terminate */
+};
+
 struct cgroup_subsys cpu_cgrp_subsys = {
 	.css_alloc	= cpu_cgroup_css_alloc,
 	.css_online	= cpu_cgroup_css_online,
@@ -7318,7 +7451,15 @@ struct cgroup_subsys cpu_cgrp_subsys = {
 	.can_attach	= cpu_cgroup_can_attach,
 	.attach		= cpu_cgroup_attach,
 	.legacy_cftypes	= cpu_legacy_files,
+	.dfl_cftypes	= cpu_files,
 	.early_init	= true,
+#ifdef CONFIG_CGROUP_CPUACCT
+	/*
+	 * cpuacct is enabled together with cpu on the unified hierarchy
+	 * and its stats are reported through "cpu.stat".
+	 */
+	.depends_on	= 1 << cpuacct_cgrp_id,
+#endif
 };
 
 #endif	/* CONFIG_CGROUP_SCHED */
diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index 6151c23..fc1cf13 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -347,6 +347,31 @@ static int cpuacct_stats_show(struct seq_file *sf, void *v)
 	{ }	/* terminate */
 };
 
+/* used to print cpuacct stats in cpu.stat on the unified hierarchy */
+void cpuacct_cpu_stats_show(struct seq_file *sf)
+{
+	struct cgroup_subsys_state *css;
+	u64 usage, val[CPUACCT_STAT_NSTATS];
+
+	css = cgroup_get_e_css(seq_css(sf)->cgroup, &cpuacct_cgrp_subsys);
+
+	usage = cpuusage_read(css, seq_cft(sf));
+	cpuacct_stats_read(css_ca(css), &val);
+
+	val[CPUACCT_STAT_USER] *= TICK_NSEC;
+	val[CPUACCT_STAT_SYSTEM] *= TICK_NSEC;
+	do_div(usage, NSEC_PER_USEC);
+	do_div(val[CPUACCT_STAT_USER], NSEC_PER_USEC);
+	do_div(val[CPUACCT_STAT_SYSTEM], NSEC_PER_USEC);
+
+	seq_printf(sf, "usage_usec %llu\n"
+		   "user_usec %llu\n"
+		   "system_usec %llu\n",
+		   usage, val[CPUACCT_STAT_USER], val[CPUACCT_STAT_SYSTEM]);
+
+	css_put(css);
+}
+
 /*
  * charge this task's execution time to its accounting group.
  *
diff --git a/kernel/sched/cpuacct.h b/kernel/sched/cpuacct.h
index ba72807..ddf7af4 100644
--- a/kernel/sched/cpuacct.h
+++ b/kernel/sched/cpuacct.h
@@ -2,6 +2,7 @@
 
 extern void cpuacct_charge(struct task_struct *tsk, u64 cputime);
 extern void cpuacct_account_field(struct task_struct *tsk, int index, u64 val);
+extern void cpuacct_cpu_stats_show(struct seq_file *sf);
 
 #else
 
@@ -14,4 +15,8 @@ static inline void cpuacct_charge(struct task_struct *tsk, u64 cputime)
 {
 }
 
+static inline void cpuacct_cpu_stats_show(struct seq_file *sf)
+{
+}
+
 #endif
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [RFC PATCH v2 17/17] sched: Make cpu/cpuacct threaded controllers
  2017-05-15 13:33 [RFC PATCH v2 00/17] cgroup: Major changes to cgroup v2 core Waiman Long
                   ` (15 preceding siblings ...)
  2017-05-15 13:34 ` [RFC PATCH v2 16/17] sched: Implement interface for cgroup unified hierarchy Waiman Long
@ 2017-05-15 13:34 ` Waiman Long
  16 siblings, 0 replies; 69+ messages in thread
From: Waiman Long @ 2017-05-15 13:34 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, linux-mm, kernel-team, pjt,
	luto, efault, longman

Make cpu and cpuacct cgroup controllers usable within a threaded cgroup.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/sched/core.c    | 1 +
 kernel/sched/cpuacct.c | 1 +
 2 files changed, 2 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b041081..479f69e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7453,6 +7453,7 @@ struct cgroup_subsys cpu_cgrp_subsys = {
 	.legacy_cftypes	= cpu_legacy_files,
 	.dfl_cftypes	= cpu_files,
 	.early_init	= true,
+	.threaded	= true,
 #ifdef CONFIG_CGROUP_CPUACCT
 	/*
 	 * cpuacct is enabled together with cpu on the unified hierarchy
diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index fc1cf13..853d18a 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -414,4 +414,5 @@ struct cgroup_subsys cpuacct_cgrp_subsys = {
 	.css_free	= cpuacct_css_free,
 	.legacy_cftypes	= files,
 	.early_init	= true,
+	.threaded	= true,
 };
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH v2 06/17] cgroup: Fix reference counting bug in cgroup_procs_write()
  2017-05-15 13:34 ` [RFC PATCH v2 06/17] cgroup: Fix reference counting bug in cgroup_procs_write() Waiman Long
@ 2017-05-17 19:20   ` Tejun Heo
  0 siblings, 0 replies; 69+ messages in thread
From: Tejun Heo @ 2017-05-17 19:20 UTC (permalink / raw)
  To: Waiman Long
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto,
	efault

On Mon, May 15, 2017 at 09:34:05AM -0400, Waiman Long wrote:
> The cgroup_procs_write_start() took a reference to the task structure
> which was not properly released within cgroup_procs_write() and so
> on. So a put_task_struct() call is added to cgroup_procs_write_finish()
> to match the get_task_struct() in cgroup_procs_write_start() to fix
> this reference counting error.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>

Acked-by: Tejun Heo <tj@kernel.org>

Thanks!

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH v2 07/17] cgroup: Prevent kill_css() from being called more than once
  2017-05-15 13:34 ` [RFC PATCH v2 07/17] cgroup: Prevent kill_css() from being called more than once Waiman Long
@ 2017-05-17 19:23   ` Tejun Heo
  2017-05-17 20:24     ` Waiman Long
  0 siblings, 1 reply; 69+ messages in thread
From: Tejun Heo @ 2017-05-17 19:23 UTC (permalink / raw)
  To: Waiman Long
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto,
	efault

Hello,

On Mon, May 15, 2017 at 09:34:06AM -0400, Waiman Long wrote:
> The kill_css() function may be called more than once under the condition
> that the css was killed but not physically removed yet followed by the
> removal of the cgroup that is hosting the css. This patch prevents any
> harmm from being done when that happens.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>

So, this is a bug fix which isn't really related to this patchset.
I'm applying it to cgroup/for-4.12-fixes w/ stable cc'd.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH v2 07/17] cgroup: Prevent kill_css() from being called more than once
  2017-05-17 19:23   ` Tejun Heo
@ 2017-05-17 20:24     ` Waiman Long
  2017-05-17 21:34       ` Tejun Heo
  0 siblings, 1 reply; 69+ messages in thread
From: Waiman Long @ 2017-05-17 20:24 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto,
	efault

On 05/17/2017 03:23 PM, Tejun Heo wrote:
> Hello,
>
> On Mon, May 15, 2017 at 09:34:06AM -0400, Waiman Long wrote:
>> The kill_css() function may be called more than once under the condition
>> that the css was killed but not physically removed yet followed by the
>> removal of the cgroup that is hosting the css. This patch prevents any
>> harmm from being done when that happens.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
> So, this is a bug fix which isn't really related to this patchset.
> I'm applying it to cgroup/for-4.12-fixes w/ stable cc'd.
>
> Thanks.
>
Actually, this bug can be easily triggered with the resource domain
patch later in the series. I guess it can also happen in the current
code base, but I don't have a test that can reproduce it.

Regards,
Longman

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH v2 07/17] cgroup: Prevent kill_css() from being called more than once
  2017-05-17 20:24     ` Waiman Long
@ 2017-05-17 21:34       ` Tejun Heo
  0 siblings, 0 replies; 69+ messages in thread
From: Tejun Heo @ 2017-05-17 21:34 UTC (permalink / raw)
  To: Waiman Long
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto,
	efault

On Wed, May 17, 2017 at 04:24:32PM -0400, Waiman Long wrote:
> On 05/17/2017 03:23 PM, Tejun Heo wrote:
> > Hello,
> >
> > On Mon, May 15, 2017 at 09:34:06AM -0400, Waiman Long wrote:
> >> The kill_css() function may be called more than once under the condition
> >> that the css was killed but not physically removed yet followed by the
> >> removal of the cgroup that is hosting the css. This patch prevents any
> >> harmm from being done when that happens.
> >>
> >> Signed-off-by: Waiman Long <longman@redhat.com>
> > So, this is a bug fix which isn't really related to this patchset.
> > I'm applying it to cgroup/for-4.12-fixes w/ stable cc'd.
> >
> > Thanks.
> >
> Actually, this bug can be easily triggered with the resource domain
> patch later in the series. I guess it can also happen in the current
> code base, but I don't have a test that can reproduce it.

I can reproduce it easily.

[test /sys/fs/cgroup/asdf]# while true; do mkdir asdf; echo +memory > cgroup.subtree_control; echo -memory
[   66.159258] percpu_ref_kill_and_confirm called more than once on css_release!
[   66.159293] ------------[ cut here ]------------
[   66.160966] WARNING: CPU: 1 PID: 1802 at lib/percpu-refcount.c:334 percpu_ref_kill_and_confirm+0x190/0x1a0
[   66.162406] Modules linked in:
[   66.162686] CPU: 1 PID: 1802 Comm: rmdir Not tainted 4.12.0-rc1-work+ #42
[   66.163279] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014
[   66.164043] task: ffff880018240040 task.stack: ffffc90000478000
[   66.164571] RIP: 0010:percpu_ref_kill_and_confirm+0x190/0x1a0
[   66.165106] RSP: 0018:ffffc9000047bde8 EFLAGS: 00010092
[   66.165664] RAX: 0000000000000041 RBX: ffff88001a0fc148 RCX: 0000000000000002
[   66.166443] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff810acc2e
[   66.167083] RBP: ffffc9000047be00 R08: 0000000000000000 R09: 0000000000000001
[   66.167696] R10: ffffc9000047bd50 R11: 0000000000000000 R12: 0000000000000286
[   66.168293] R13: ffffffff810e7c50 R14: ffff88001a106bb8 R15: 0000000000000000
[   66.168897] FS:  00007fba87594700(0000) GS:ffff88001fc80000(0000) knlGS:0000000000000000
[   66.169613] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   66.170204] CR2: 000056167ee4e008 CR3: 000000001820e000 CR4: 00000000000006a0
[   66.170861] Call Trace:
[   66.171075]  kill_css+0x3e/0x170
[   66.171352]  cgroup_destroy_locked+0xac/0x170
[   66.171732]  cgroup_rmdir+0x2c/0x150
[   66.172037]  kernfs_iop_rmdir+0x48/0x70
[   66.172377]  vfs_rmdir+0x73/0x150
[   66.172679]  do_rmdir+0x16d/0x1c0
[   66.172962]  SyS_rmdir+0x16/0x20
[   66.173244]  entry_SYSCALL_64_fastpath+0x18/0xad
[   66.173682] RIP: 0033:0x7fba870c1487
[   66.173985] RSP: 002b:00007ffec57e43a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000054
[   66.174719] RAX: ffffffffffffffda RBX: 00007fba87388b38 RCX: 00007fba870c1487
[   66.175439] RDX: 00007fba8738ae80 RSI: 0000000000000000 RDI: 00007ffec57e5751
[   66.176281] RBP: 00007fba87388ae0 R08: 0000000000000000 R09: 0000000000000000
[   66.177046] R10: 000056167ee4e010 R11: 0000000000000246 R12: 00007fba87388b38
[   66.177821] R13: 0000000000000030 R14: 00007fba87388b58 R15: 0000000000002710
[   66.178574] Code: 80 3d f5 d3 89 00 00 0f 85 b8 fe ff ff 48 8b 53 10 48 c7 c6 e0 c0 83 81 48 c7 c7 68 e0 9c 81 c6 05 d6 d3 89 00 01 e8 09 0a d4 ff <0f> ff 48 8b 43 08 e9 8f fe ff ff 0f 1f 44 00 00 55 ba ff ffff
[   66.181059] ---[ end trace 50ce5cd95cda7a2c ]---

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH v2 08/17] cgroup: Move debug cgroup to its own file
  2017-05-15 13:34 ` [RFC PATCH v2 08/17] cgroup: Move debug cgroup to its own file Waiman Long
@ 2017-05-17 21:36   ` Tejun Heo
  2017-05-18 15:29     ` Waiman Long
  2017-05-18 15:52     ` Waiman Long
  0 siblings, 2 replies; 69+ messages in thread
From: Tejun Heo @ 2017-05-17 21:36 UTC (permalink / raw)
  To: Waiman Long
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto,
	efault

Hello, Waiman.

On Mon, May 15, 2017 at 09:34:07AM -0400, Waiman Long wrote:
> The debug cgroup currently resides within cgroup-v1.c and is enabled
> only for v1 cgroup. To enable the debug cgroup also for v2, it
> makes sense to put the code into its own file as it will no longer
> be v1 specific. The only change in this patch is the expansion of
> cgroup_task_count() within the debug_taskcount_read() function.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>

I don't mind enabling the debug controller for v2 but let's please
hide it behind an unwieldy boot param / controller name so that it's
clear that its interface isn't expected to be stable.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH v2 09/17] cgroup: Keep accurate count of tasks in each css_set
  2017-05-15 13:34 ` [RFC PATCH v2 09/17] cgroup: Keep accurate count of tasks in each css_set Waiman Long
@ 2017-05-17 21:40   ` Tejun Heo
  2017-05-18 15:56     ` Waiman Long
  0 siblings, 1 reply; 69+ messages in thread
From: Tejun Heo @ 2017-05-17 21:40 UTC (permalink / raw)
  To: Waiman Long
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto,
	efault

Hello,

On Mon, May 15, 2017 at 09:34:08AM -0400, Waiman Long wrote:
> The reference count in the css_set data structure was used as a
> proxy of the number of tasks attached to that css_set. However, that
> count is actually not an accurate measure especially with thread mode
> support. So a new variable task_count is added to the css_set to keep
> track of the actual task count. This new variable is protected by
> the css_set_lock. Functions that require the actual task count are
> updated to use the new variable.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>

Looks good.  We probably should replace css_set_populated() to use
this too.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH v2 10/17] cgroup: Make debug cgroup support v2 and thread mode
  2017-05-15 13:34 ` [RFC PATCH v2 10/17] cgroup: Make debug cgroup support v2 and thread mode Waiman Long
@ 2017-05-17 21:43   ` Tejun Heo
  2017-05-18 15:58     ` Waiman Long
  0 siblings, 1 reply; 69+ messages in thread
From: Tejun Heo @ 2017-05-17 21:43 UTC (permalink / raw)
  To: Waiman Long
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto,
	efault

Hello,

On Mon, May 15, 2017 at 09:34:09AM -0400, Waiman Long wrote:
> Besides supporting cgroup v2 and thread mode, the following changes
> are also made:
>  1) current_* cgroup files now resides only at the root as we don't
>     need duplicated files of the same function all over the cgroup
>     hierarchy.
>  2) The cgroup_css_links_read() function is modified to report
>     the number of tasks that are skipped because of overflow.
>  3) The relationship between proc_cset and threaded_csets are displayed.
>  4) The number of extra unaccounted references are displayed.
>  5) The status of being a thread root or threaded cgroup is displayed.
>  6) The current_css_set_read() function now prints out the addresses of
>     the css'es associated with the current css_set.
>  7) A new cgroup_subsys_states file is added to display the css objects
>     associated with a cgroup.
>  8) A new cgroup_masks file is added to display the various controller
>     bit masks in the cgroup.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>

As noted before, please make it clear that this is a debug feature and
not expected to be stable.  Also, I don't see why this and the
previous two patches are in this series.  Can you please split these
out to a separate patchset?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH v2 11/17] cgroup: Implement new thread mode semantics
  2017-05-15 13:34 ` [RFC PATCH v2 11/17] cgroup: Implement new thread mode semantics Waiman Long
@ 2017-05-17 21:47   ` Tejun Heo
  2017-05-18 17:21     ` Waiman Long
  2017-05-19 20:26   ` Tejun Heo
  1 sibling, 1 reply; 69+ messages in thread
From: Tejun Heo @ 2017-05-17 21:47 UTC (permalink / raw)
  To: Waiman Long
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto,
	efault

Hello, Waiman.

On Mon, May 15, 2017 at 09:34:10AM -0400, Waiman Long wrote:
> The current thread mode semantics aren't sufficient to fully support
> threaded controllers like cpu. The main problem is that when thread
> mode is enabled at root (mainly for performance reason), all the
> non-threaded controllers cannot be supported at all.
> 
> To alleviate this problem, the roles of thread root and threaded
> cgroups are now further separated. Now thread mode can only be enabled
> on a non-root leaf cgroup whose parent will then become the thread
> root. All the descendants of a threaded cgroup will still need to be
> threaded. All the non-threaded resource will be accounted for in the
> thread root. Unlike the previous thread mode, however, a thread root
> can have non-threaded children where system resources like memory
> can be further split down the hierarchy.
> 
> Now we could have something like
> 
> 	R -- A -- B
> 	 \
> 	  T1 -- T2
> 
> where R is the thread root, A and B are non-threaded cgroups, T1 and
> T2 are threaded cgroups. The cgroups R, T1, T2 form a threaded subtree
> where all the non-threaded resources are accounted for in R.  The no
> internal process constraint does not apply in the threaded subtree.
> Non-threaded controllers need to properly handle the competition
> between internal processes and child cgroups at the thread root.
> 
> This model will be flexible enough to support the need of the threaded
> controllers.

I do like the approach and it does address the issue with requiring at
least one level of nesting for the thread mode to be used with other
controllers.  I need to think a bit more about it and mull over what
Peterz was suggesting in the old thread.  I'll get back to you soon
but I'd really prefer this and the earlier related patches to be in
its own patchset so that we aren't dealing with different things at
the same time.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH v2 08/17] cgroup: Move debug cgroup to its own file
  2017-05-17 21:36   ` Tejun Heo
@ 2017-05-18 15:29     ` Waiman Long
  2017-05-18 15:52     ` Waiman Long
  1 sibling, 0 replies; 69+ messages in thread
From: Waiman Long @ 2017-05-18 15:29 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto,
	efault

On 05/17/2017 05:36 PM, Tejun Heo wrote:
> Hello, Waiman.
>
> On Mon, May 15, 2017 at 09:34:07AM -0400, Waiman Long wrote:
>> The debug cgroup currently resides within cgroup-v1.c and is enabled
>> only for v1 cgroup. To enable the debug cgroup also for v2, it
>> makes sense to put the code into its own file as it will no longer
>> be v1 specific. The only change in this patch is the expansion of
>> cgroup_task_count() within the debug_taskcount_read() function.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
> I don't mind enabling the debug controller for v2 but let's please
> hide it behind an unwieldy boot param / controller name so that it's
> clear that its interface isn't expected to be stable.
>
> Thanks.
>
The controller name is "debug". So it is pretty obvious what it is for.
However, the config prompt of "Example controller" is indeed a bit
vague. So I think we can make the prompt more descriptive here. As for
boot param, are you saying something like "cgroup_debug" has to be
specified in the command line also to have this controller activated
even if the CGROUP_DEBUG config parameter is specified? I am fine with
that if you think it is necessary.

Regards,
Longman

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH v2 08/17] cgroup: Move debug cgroup to its own file
  2017-05-17 21:36   ` Tejun Heo
  2017-05-18 15:29     ` Waiman Long
@ 2017-05-18 15:52     ` Waiman Long
  2017-05-19 19:21       ` Tejun Heo
  1 sibling, 1 reply; 69+ messages in thread
From: Waiman Long @ 2017-05-18 15:52 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto,
	efault

On 05/17/2017 05:36 PM, Tejun Heo wrote:
> Hello, Waiman.
>
> On Mon, May 15, 2017 at 09:34:07AM -0400, Waiman Long wrote:
>> The debug cgroup currently resides within cgroup-v1.c and is enabled
>> only for v1 cgroup. To enable the debug cgroup also for v2, it
>> makes sense to put the code into its own file as it will no longer
>> be v1 specific. The only change in this patch is the expansion of
>> cgroup_task_count() within the debug_taskcount_read() function.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
> I don't mind enabling the debug controller for v2 but let's please
> hide it behind an unwieldy boot param / controller name so that it's
> clear that its interface isn't expected to be stable.
>
> Thanks.
>
The controller name is "debug" and so it is obvious what this controller
is for. However, the config prompt "Example controller" is indeed vague
in meaning. So we can make the prompt more descriptive here. As for the
boot param, are you saying something like "cgroup_debug" has to be
specified in the command line even if CGROUP_DEBUG config is there for
the debug controller to be enabled? I am fine with that if you think it
is necessary.

Regards,
Longman

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH v2 09/17] cgroup: Keep accurate count of tasks in each css_set
  2017-05-17 21:40   ` Tejun Heo
@ 2017-05-18 15:56     ` Waiman Long
  0 siblings, 0 replies; 69+ messages in thread
From: Waiman Long @ 2017-05-18 15:56 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto,
	efault

On 05/17/2017 05:40 PM, Tejun Heo wrote:
> Hello,
>
> On Mon, May 15, 2017 at 09:34:08AM -0400, Waiman Long wrote:
>> The reference count in the css_set data structure was used as a
>> proxy of the number of tasks attached to that css_set. However, that
>> count is actually not an accurate measure especially with thread mode
>> support. So a new variable task_count is added to the css_set to keep
>> track of the actual task count. This new variable is protected by
>> the css_set_lock. Functions that require the actual task count are
>> updated to use the new variable.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
> Looks good.  We probably should replace css_set_populated() to use
> this too.
>
> Thanks.
>
Yes, you are right. css_set_populated() can be replaced with a check on
the task_count.

Regards,
Longman

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH v2 10/17] cgroup: Make debug cgroup support v2 and thread mode
  2017-05-17 21:43   ` Tejun Heo
@ 2017-05-18 15:58     ` Waiman Long
  0 siblings, 0 replies; 69+ messages in thread
From: Waiman Long @ 2017-05-18 15:58 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto,
	efault

On 05/17/2017 05:43 PM, Tejun Heo wrote:
> Hello,
>
> On Mon, May 15, 2017 at 09:34:09AM -0400, Waiman Long wrote:
>> Besides supporting cgroup v2 and thread mode, the following changes
>> are also made:
>>  1) current_* cgroup files now resides only at the root as we don't
>>     need duplicated files of the same function all over the cgroup
>>     hierarchy.
>>  2) The cgroup_css_links_read() function is modified to report
>>     the number of tasks that are skipped because of overflow.
>>  3) The relationship between proc_cset and threaded_csets are displayed.
>>  4) The number of extra unaccounted references are displayed.
>>  5) The status of being a thread root or threaded cgroup is displayed.
>>  6) The current_css_set_read() function now prints out the addresses of
>>     the css'es associated with the current css_set.
>>  7) A new cgroup_subsys_states file is added to display the css objects
>>     associated with a cgroup.
>>  8) A new cgroup_masks file is added to display the various controller
>>     bit masks in the cgroup.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
> As noted before, please make it clear that this is a debug feature and
> not expected to be stable.  Also, I don't see why this and the
> previous two patches are in this series.  Can you please split these
> out to a separate patchset?
>
> Thanks.
>
Sure. I can separate out the debug code into a separate patchset. It is
just easier to manage as a single patchset than 2 separate ones.

Regards,
Longman

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH v2 11/17] cgroup: Implement new thread mode semantics
  2017-05-17 21:47   ` Tejun Heo
@ 2017-05-18 17:21     ` Waiman Long
  0 siblings, 0 replies; 69+ messages in thread
From: Waiman Long @ 2017-05-18 17:21 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto,
	efault

On 05/17/2017 05:47 PM, Tejun Heo wrote:
> Hello, Waiman.
>
> On Mon, May 15, 2017 at 09:34:10AM -0400, Waiman Long wrote:
>> The current thread mode semantics aren't sufficient to fully support
>> threaded controllers like cpu. The main problem is that when thread
>> mode is enabled at root (mainly for performance reason), all the
>> non-threaded controllers cannot be supported at all.
>>
>> To alleviate this problem, the roles of thread root and threaded
>> cgroups are now further separated. Now thread mode can only be enabled
>> on a non-root leaf cgroup whose parent will then become the thread
>> root. All the descendants of a threaded cgroup will still need to be
>> threaded. All the non-threaded resource will be accounted for in the
>> thread root. Unlike the previous thread mode, however, a thread root
>> can have non-threaded children where system resources like memory
>> can be further split down the hierarchy.
>>
>> Now we could have something like
>>
>> 	R -- A -- B
>> 	 \
>> 	  T1 -- T2
>>
>> where R is the thread root, A and B are non-threaded cgroups, T1 and
>> T2 are threaded cgroups. The cgroups R, T1, T2 form a threaded subtree
>> where all the non-threaded resources are accounted for in R.  The no
>> internal process constraint does not apply in the threaded subtree.
>> Non-threaded controllers need to properly handle the competition
>> between internal processes and child cgroups at the thread root.
>>
>> This model will be flexible enough to support the need of the threaded
>> controllers.
> I do like the approach and it does address the issue with requiring at
> least one level of nesting for the thread mode to be used with other
> controllers.  I need to think a bit more about it and mull over what
> Peterz was suggesting in the old thread.  I'll get back to you soon
> but I'd really prefer this and the earlier related patches to be in
> its own patchset so that we aren't dealing with different things at
> the same time.
>
> Thanks.
>
I have studied the email exchanges with your original thread mode
patchset. This patchset is aimed to hopefully address all the concerns
that Peterz has. This enhanced thread mode should address a big part of
the concern. However, I am not sure if this patch, by itself, is enough
to address all his concerns. That is why I also include 2 other major
changes in the next 2 patches. My goal is to move forward to allow all
controllers to be enabled for v2 eventually. We are not there yet, but I
hope this patchset can move thing forward meaningfully.

Regards,
Longman

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH v2 08/17] cgroup: Move debug cgroup to its own file
  2017-05-18 15:52     ` Waiman Long
@ 2017-05-19 19:21       ` Tejun Heo
  2017-05-19 19:33         ` Waiman Long
  0 siblings, 1 reply; 69+ messages in thread
From: Tejun Heo @ 2017-05-19 19:21 UTC (permalink / raw)
  To: Waiman Long
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto,
	efault

Hello, Waiman.

On Thu, May 18, 2017 at 11:52:18AM -0400, Waiman Long wrote:
> The controller name is "debug" and so it is obvious what this controller
> is for. However, the config prompt "Example controller" is indeed vague

Yeah but it also shows up as an integral part of stable interface
rather than e.g. /sys/kernel/debug.  This isn't of any interest to
people who aren't developing cgroup core code.  There is no reason to
risk growing dependencies on it.

> in meaning. So we can make the prompt more descriptive here. As for the
> boot param, are you saying something like "cgroup_debug" has to be
> specified in the command line even if CGROUP_DEBUG config is there for
> the debug controller to be enabled? I am fine with that if you think it
> is necessary.

Yeah, I think that'd be a good idea.  cgroup_debug should do.  While
at it, can you also please make CGROUP_DEBUG depend on DEBUG_KERNEL?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH v2 08/17] cgroup: Move debug cgroup to its own file
  2017-05-19 19:21       ` Tejun Heo
@ 2017-05-19 19:33         ` Waiman Long
  2017-05-19 20:28           ` Tejun Heo
  0 siblings, 1 reply; 69+ messages in thread
From: Waiman Long @ 2017-05-19 19:33 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto,
	efault

On 05/19/2017 03:21 PM, Tejun Heo wrote:
> Hello, Waiman.
>
> On Thu, May 18, 2017 at 11:52:18AM -0400, Waiman Long wrote:
>> The controller name is "debug" and so it is obvious what this controller
>> is for. However, the config prompt "Example controller" is indeed vague
> Yeah but it also shows up as an integral part of stable interface
> rather than e.g. /sys/kernel/debug.  This isn't of any interest to
> people who aren't developing cgroup core code.  There is no reason to
> risk growing dependencies on it.

The debug controller is used to show information relevant to the cgroup
its css'es are attached to. So it will be very hard to use if we
relocate to /sys/kernel/debug, for example. Currently, nothing in the
debug controller other than debug_cgrp_subsys are exported. I don't see
any risk of having dependency on that controller from other parts of the
kernel.

>> in meaning. So we can make the prompt more descriptive here. As for the
>> boot param, are you saying something like "cgroup_debug" has to be
>> specified in the command line even if CGROUP_DEBUG config is there for
>> the debug controller to be enabled? I am fine with that if you think it
>> is necessary.
> Yeah, I think that'd be a good idea.  cgroup_debug should do.  While
> at it, can you also please make CGROUP_DEBUG depend on DEBUG_KERNEL?
>
> Thanks.
>
Sure. I will do that.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH v2 11/17] cgroup: Implement new thread mode semantics
  2017-05-15 13:34 ` [RFC PATCH v2 11/17] cgroup: Implement new thread mode semantics Waiman Long
  2017-05-17 21:47   ` Tejun Heo
@ 2017-05-19 20:26   ` Tejun Heo
  2017-05-19 20:58     ` Tejun Heo
  2017-05-22 17:13     ` Waiman Long
  1 sibling, 2 replies; 69+ messages in thread
From: Tejun Heo @ 2017-05-19 20:26 UTC (permalink / raw)
  To: Waiman Long
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto,
	efault

Hello, Waiman.

On Mon, May 15, 2017 at 09:34:10AM -0400, Waiman Long wrote:
> Now we could have something like
> 
> 	R -- A -- B
> 	 \
> 	  T1 -- T2
> 
> where R is the thread root, A and B are non-threaded cgroups, T1 and
> T2 are threaded cgroups. The cgroups R, T1, T2 form a threaded subtree
> where all the non-threaded resources are accounted for in R.  The no
> internal process constraint does not apply in the threaded subtree.
> Non-threaded controllers need to properly handle the competition
> between internal processes and child cgroups at the thread root.
> 
> This model will be flexible enough to support the need of the threaded
> controllers.

Maybe I'm misunderstanding the design, but this seems to push the
processes which belong to the threaded subtree to the parent which is
part of the usual resource domain hierarchy thus breaking the no
internal competition constraint.  I'm not sure this is something we'd
want.  Given that the limitation of the original threaded mode was the
required nesting below root and that we treat root special anyway
(exactly in the way necessary), I wonder whether it'd be better to
simply allow root to be both domain and thread root.

Specific review points below but we'd probably want to discuss the
overall design first.

> +static inline bool cgroup_is_threaded(const struct cgroup *cgrp)
> +{
> +	return cgrp->proc_cgrp && (cgrp->proc_cgrp != cgrp);
> +}
> +
> +static inline bool cgroup_is_thread_root(const struct cgroup *cgrp)
> +{
> +	return cgrp->proc_cgrp == cgrp;
> +}

Maybe add a bit of comments explaining what's going on with
->proc_cgrp?

>  /**
> + * threaded_children_count - returns # of threaded children
> + * @cgrp: cgroup to be tested
> + *
> + * cgroup_mutex must be held by the caller.
> + */
> +static int threaded_children_count(struct cgroup *cgrp)
> +{
> +	struct cgroup *child;
> +	int count = 0;
> +
> +	lockdep_assert_held(&cgroup_mutex);
> +	cgroup_for_each_live_child(child, cgrp)
> +		if (cgroup_is_threaded(child))
> +			count++;
> +	return count;
> +}

It probably would be a good idea to keep track of the count so that we
don't have to count them each time.  There are cases where people end
up creating a very high number of cgroups and we've already been
bitten a couple times with silly complexity issues.

> @@ -2982,22 +3010,48 @@ static int cgroup_enable_threaded(struct cgroup *cgrp)
>  	LIST_HEAD(csets);
>  	struct cgrp_cset_link *link;
>  	struct css_set *cset, *cset_next;
> +	struct cgroup *child;
>  	int ret;
> +	u16 ss_mask;
>  
>  	lockdep_assert_held(&cgroup_mutex);
>  
>  	/* noop if already threaded */
> -	if (cgrp->proc_cgrp)
> +	if (cgroup_is_threaded(cgrp))
>  		return 0;
>  
> -	/* allow only if there are neither children or enabled controllers */
> -	if (css_has_online_children(&cgrp->self) || cgrp->subtree_control)
> +	/*
> +	 * Allow only if it is not the root and there are:
> +	 * 1) no children,
> +	 * 2) no non-threaded controllers are enabled, and
> +	 * 3) no attached tasks.
> +	 *
> +	 * With no attached tasks, it is assumed that no css_sets will be
> +	 * linked to the current cgroup. This may not be true if some dead
> +	 * css_sets linger around due to task_struct leakage, for example.
> +	 */

It doesn't look like the code is actually making this (incorrect)
assumption.  I suppose the comment is from before
cgroup_is_populated() was added?

>  	spin_lock_irq(&css_set_lock);
>  	list_for_each_entry(link, &cgrp->cset_links, cset_link) {
>  		cset = link->cset;
> +		if (cset->dead)
> +			continue;

Hmm... is this a bug fix which is necessary regardless of whether we
change the threadroot semantics or not?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH v2 08/17] cgroup: Move debug cgroup to its own file
  2017-05-19 19:33         ` Waiman Long
@ 2017-05-19 20:28           ` Tejun Heo
  0 siblings, 0 replies; 69+ messages in thread
From: Tejun Heo @ 2017-05-19 20:28 UTC (permalink / raw)
  To: Waiman Long
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto,
	efault

Hello,

On Fri, May 19, 2017 at 03:33:14PM -0400, Waiman Long wrote:
> On 05/19/2017 03:21 PM, Tejun Heo wrote:
> > Yeah but it also shows up as an integral part of stable interface
> > rather than e.g. /sys/kernel/debug.  This isn't of any interest to
> > people who aren't developing cgroup core code.  There is no reason to
> > risk growing dependencies on it.
> 
> The debug controller is used to show information relevant to the cgroup
> its css'es are attached to. So it will be very hard to use if we
> relocate to /sys/kernel/debug, for example. Currently, nothing in the
> debug controller other than debug_cgrp_subsys are exported. I don't see
> any risk of having dependency on that controller from other parts of the
> kernel.

Oh, sure, I wasn't suggesting moving it under /sys/kernel/debug but
that we'd want to take extra precautions as we can't.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH v2 12/17] cgroup: Remove cgroup v2 no internal process constraint
  2017-05-15 13:34 ` [RFC PATCH v2 12/17] cgroup: Remove cgroup v2 no internal process constraint Waiman Long
@ 2017-05-19 20:38   ` Tejun Heo
  2017-05-20  2:10     ` Mike Galbraith
  2017-05-22 16:56     ` Waiman Long
  0 siblings, 2 replies; 69+ messages in thread
From: Tejun Heo @ 2017-05-19 20:38 UTC (permalink / raw)
  To: Waiman Long
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto,
	efault

Hello, Waiman.

On Mon, May 15, 2017 at 09:34:11AM -0400, Waiman Long wrote:
> The rationale behind the cgroup v2 no internal process constraint is
> to avoid resouorce competition between internal processes and child
> cgroups. However, not all controllers have problem with internal
> process competiton. Enforcing this rule may lead to unnatural process
> hierarchy and unneeded levels for those controllers.

This isn't necessarily something we can determine by looking at the
current state of controllers.  It's true that some controllers - pid
and perf - inherently only care about membership of each task but at
the same time neither really suffers from the constraint either.  CPU
which is the problematic one here and currently only cares about tasks
actually distributes resources which have parts which are specific to
domain rather than threads and we don't want to declare that CPU isn't
domain aware resource because it inherently is.

> This patch removes the no internal process contraint by enabling those
> controllers that don't like internal process competition to have a
> separate set of control knobs just for internal processes in a cgroup.
> 
> A new control file "cgroup.resource_control" is added. Enabling a
> controller with a "+" prefix will create a separate set of control
> knobs for that controller in the special "cgroup.resource_domain"
> sub-directory for all the internal processes. The existing control
> knobs in the cgroup will then be used to manage resource distribution
> between internal processes as a group and other child cgroups.

We would need to declare all major resource controllers to be needing
that special sub-directory.  That'd work around the
no-internal-process constraint but I don't think it is solving any
real problems.  It's just the kernel doing something that userland can
do with ease and more context.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH v2 13/17] cgroup: Allow fine-grained controllers control in cgroup v2
  2017-05-15 13:34 ` [RFC PATCH v2 13/17] cgroup: Allow fine-grained controllers control in cgroup v2 Waiman Long
@ 2017-05-19 20:55   ` Tejun Heo
  2017-05-19 21:20     ` Waiman Long
  0 siblings, 1 reply; 69+ messages in thread
From: Tejun Heo @ 2017-05-19 20:55 UTC (permalink / raw)
  To: Waiman Long
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto,
	efault

Hello, Waiman.

On Mon, May 15, 2017 at 09:34:12AM -0400, Waiman Long wrote:
> For cgroup v1, different controllers can be binded to different cgroup
> hierarchies optimized for their own use cases. That is not currently
> the case for cgroup v2 where combining all these controllers into
> the same hierarchy will probably require more levels than is needed
> by each individual controller.
> 
> By not enabling a controller in a cgroup and its descendants, we can
> effectively trim the hierarchy as seen by a controller from the leafs
> up. However, there is currently no way to compress the hierarchy in
> the intermediate levels.
> 
> This patch implements a fine-grained mechanism to allow a controller to
> skip some intermediate levels in a hierarchy and effectively flatten
> the hierarchy as seen by that controller.
> 
> Controllers can now be directly enabled or disabled in a cgroup
> by writing to the "cgroup.controllers" file.  The special prefix
> '#' with the controller name is used to set that controller in
> pass-through mode.  In that mode, the controller is disabled for that
> cgroup but it allows its children to have that controller enabled or
> in pass-through mode again.
> 
> With this change, each controller can now have a unique view of their
> virtual process hierarchy that can be quite different from other
> controllers.  We now have the freedom and flexibility to create the
> right hierarchy for each controller to suit their own needs without
> performance loss when compared with cgroup v1.

I can see the appeal but this needs at least more refinements.

This breaks the invariant that in a cgroup its resource control knobs
control distribution of resources from its parent.  IOW, the resource
control knobs of a cgroup always belong to the parent.  This is also
reflected in how delegation is done.  The delegatee assumes ownership
of the cgroup itself and the ability to manage sub-cgroups but doesn't
get the ownership of the resource control knobs as otherwise the
parent would lose control over how it distributes its resources.

Another aspect is that most controllers aren't that sensitive to
nesting several levels.  Expensive operations can be and already are
aggregated and the performance overhead of several levels of nesting
barely shows up.  Skipping levels can be an interesting optimization
approach and we can definitely support from the core side; however,
it'd be a lot nicer if we could do that optimization transparently
(e.g. CPU can skip multi level queueing if there usually is only one
item at some levels).

Hmm... that said, if we can fix the delegation issue in a not-too-ugly
way, why not?  I wonder whether we can still keep the resource control
knobs attached to the parent and skip in the middle.  Topology-wise,
that'd make more sense too.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH v2 11/17] cgroup: Implement new thread mode semantics
  2017-05-19 20:26   ` Tejun Heo
@ 2017-05-19 20:58     ` Tejun Heo
  2017-05-22 17:13     ` Waiman Long
  1 sibling, 0 replies; 69+ messages in thread
From: Tejun Heo @ 2017-05-19 20:58 UTC (permalink / raw)
  To: Waiman Long
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto,
	efault

Hello,

On Fri, May 19, 2017 at 04:26:24PM -0400, Tejun Heo wrote:
> (exactly in the way necessary), I wonder whether it'd be better to
> simply allow root to be both domain and thread root.

I'll give this approach a shot early next week.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH v2 13/17] cgroup: Allow fine-grained controllers control in cgroup v2
  2017-05-19 20:55   ` Tejun Heo
@ 2017-05-19 21:20     ` Waiman Long
  2017-05-24 17:31       ` Tejun Heo
  0 siblings, 1 reply; 69+ messages in thread
From: Waiman Long @ 2017-05-19 21:20 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto,
	efault

On 05/19/2017 04:55 PM, Tejun Heo wrote:
> Hello, Waiman.
>
> On Mon, May 15, 2017 at 09:34:12AM -0400, Waiman Long wrote:
>> For cgroup v1, different controllers can be binded to different cgroup
>> hierarchies optimized for their own use cases. That is not currently
>> the case for cgroup v2 where combining all these controllers into
>> the same hierarchy will probably require more levels than is needed
>> by each individual controller.
>>
>> By not enabling a controller in a cgroup and its descendants, we can
>> effectively trim the hierarchy as seen by a controller from the leafs
>> up. However, there is currently no way to compress the hierarchy in
>> the intermediate levels.
>>
>> This patch implements a fine-grained mechanism to allow a controller to
>> skip some intermediate levels in a hierarchy and effectively flatten
>> the hierarchy as seen by that controller.
>>
>> Controllers can now be directly enabled or disabled in a cgroup
>> by writing to the "cgroup.controllers" file.  The special prefix
>> '#' with the controller name is used to set that controller in
>> pass-through mode.  In that mode, the controller is disabled for that
>> cgroup but it allows its children to have that controller enabled or
>> in pass-through mode again.
>>
>> With this change, each controller can now have a unique view of their
>> virtual process hierarchy that can be quite different from other
>> controllers.  We now have the freedom and flexibility to create the
>> right hierarchy for each controller to suit their own needs without
>> performance loss when compared with cgroup v1.
> I can see the appeal but this needs at least more refinements.
>
> This breaks the invariant that in a cgroup its resource control knobs
> control distribution of resources from its parent.  IOW, the resource
> control knobs of a cgroup always belong to the parent.  This is also
> reflected in how delegation is done.  The delegatee assumes ownership
> of the cgroup itself and the ability to manage sub-cgroups but doesn't
> get the ownership of the resource control knobs as otherwise the
> parent would lose control over how it distributes its resources.

One twist that I am thinking is to have a controller enabled by the
parent in subtree_control, but then allow the child to either disable it
or set it in pass-through mode by writing to controllers file. IOW, a
child cannot enable a controller without parent's permission. Once a
child has permission, it can do whatever it wants. A parent cannot force
a child to have a controller enabled.

> Another aspect is that most controllers aren't that sensitive to
> nesting several levels.  Expensive operations can be and already are
> aggregated and the performance overhead of several levels of nesting
> barely shows up.  Skipping levels can be an interesting optimization
> approach and we can definitely support from the core side; however,
> it'd be a lot nicer if we could do that optimization transparently
> (e.g. CPU can skip multi level queueing if there usually is only one
> item at some levels).

The trend that I am seeing is that the total number of controllers is
going to grow over time. New controllers may be sensitive to the level
of nesting like the cpu controller. I am also thinking about how systemd
is using the cgroup filesystem for task classification purpose without
any controller attached to it. With this scheme, we can accommodate all
the different needs without using different cgroup filesystems.

> Hmm... that said, if we can fix the delegation issue in a not-too-ugly
> way, why not?  I wonder whether we can still keep the resource control
> knobs attached to the parent and skip in the middle.  Topology-wise,
> that'd make more sense too.

Let me know how you think about my proposal above.

Cheers,
Longma

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH v2 12/17] cgroup: Remove cgroup v2 no internal process constraint
  2017-05-19 20:38   ` Tejun Heo
@ 2017-05-20  2:10     ` Mike Galbraith
  2017-05-24 17:01       ` Tejun Heo
  2017-05-22 16:56     ` Waiman Long
  1 sibling, 1 reply; 69+ messages in thread
From: Mike Galbraith @ 2017-05-20  2:10 UTC (permalink / raw)
  To: Tejun Heo, Waiman Long
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto

On Fri, 2017-05-19 at 16:38 -0400, Tejun Heo wrote:
> Hello, Waiman.
> 
> On Mon, May 15, 2017 at 09:34:11AM -0400, Waiman Long wrote:
> > The rationale behind the cgroup v2 no internal process constraint is
> > to avoid resouorce competition between internal processes and child
> > cgroups. However, not all controllers have problem with internal
> > process competiton. Enforcing this rule may lead to unnatural process
> > hierarchy and unneeded levels for those controllers.
> 
> This isn't necessarily something we can determine by looking at the
> current state of controllers.  It's true that some controllers - pid
> and perf - inherently only care about membership of each task but at
> the same time neither really suffers from the constraint either.  CPU
> which is the problematic one here...

(+ cpuacct + cpuset)

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH v2 12/17] cgroup: Remove cgroup v2 no internal process constraint
  2017-05-19 20:38   ` Tejun Heo
  2017-05-20  2:10     ` Mike Galbraith
@ 2017-05-22 16:56     ` Waiman Long
  2017-05-24 17:05       ` Tejun Heo
  1 sibling, 1 reply; 69+ messages in thread
From: Waiman Long @ 2017-05-22 16:56 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto,
	efault

On 05/19/2017 04:38 PM, Tejun Heo wrote:
> Hello, Waiman.
>
> On Mon, May 15, 2017 at 09:34:11AM -0400, Waiman Long wrote:
>> The rationale behind the cgroup v2 no internal process constraint is
>> to avoid resouorce competition between internal processes and child
>> cgroups. However, not all controllers have problem with internal
>> process competiton. Enforcing this rule may lead to unnatural process
>> hierarchy and unneeded levels for those controllers.
> This isn't necessarily something we can determine by looking at the
> current state of controllers.  It's true that some controllers - pid
> and perf - inherently only care about membership of each task but at
> the same time neither really suffers from the constraint either.  CPU
> which is the problematic one here and currently only cares about tasks
> actually distributes resources which have parts which are specific to
> domain rather than threads and we don't want to declare that CPU isn't
> domain aware resource because it inherently is.

I agree that it is hard to decide which controller should be regarded as
domain aware and which should not be. That is why I don't attempt to do
that in the v2 patchset.

Unlike my v1 patch where each controller has to be specifically marked
as being a resource domain and hence has special directory for internal
process resource control knobs, the v2 patch leaves the decision up to
the userland. Depending on the context, any controllers can now have
special resource control knobs for internal processes in the
cgroup.resource_domain directory by writing the controller name to the
cgroup.resource_control file. So even the CPU controller can be regarded
as domain aware, if necessary. This is all part of my move to give as
much freedom and flexibility to the userland.

>> This patch removes the no internal process contraint by enabling those
>> controllers that don't like internal process competition to have a
>> separate set of control knobs just for internal processes in a cgroup.
>>
>> A new control file "cgroup.resource_control" is added. Enabling a
>> controller with a "+" prefix will create a separate set of control
>> knobs for that controller in the special "cgroup.resource_domain"
>> sub-directory for all the internal processes. The existing control
>> knobs in the cgroup will then be used to manage resource distribution
>> between internal processes as a group and other child cgroups.
> We would need to declare all major resource controllers to be needing
> that special sub-directory.  That'd work around the
> no-internal-process constraint but I don't think it is solving any
> real problems.  It's just the kernel doing something that userland can
> do with ease and more context.

All controllers can use the special sub-directory if userland chooses to
do so. The problem that I am trying to address in this patch is to allow
more natural hierarchy that reflect a certain purpose, like the task
classification done by systemd. Restricting tasks only to leaf nodes
makes the hierarchy unnatural and probably difficult to manage.

Regards,
Longman

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH v2 11/17] cgroup: Implement new thread mode semantics
  2017-05-19 20:26   ` Tejun Heo
  2017-05-19 20:58     ` Tejun Heo
@ 2017-05-22 17:13     ` Waiman Long
  2017-05-22 17:32       ` Waiman Long
  2017-05-24 20:36       ` Tejun Heo
  1 sibling, 2 replies; 69+ messages in thread
From: Waiman Long @ 2017-05-22 17:13 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto,
	efault

On 05/19/2017 04:26 PM, Tejun Heo wrote:
> Hello, Waiman.
>
> On Mon, May 15, 2017 at 09:34:10AM -0400, Waiman Long wrote:
>> Now we could have something like
>>
>> 	R -- A -- B
>> 	 \
>> 	  T1 -- T2
>>
>> where R is the thread root, A and B are non-threaded cgroups, T1 and
>> T2 are threaded cgroups. The cgroups R, T1, T2 form a threaded subtree
>> where all the non-threaded resources are accounted for in R.  The no
>> internal process constraint does not apply in the threaded subtree.
>> Non-threaded controllers need to properly handle the competition
>> between internal processes and child cgroups at the thread root.
>>
>> This model will be flexible enough to support the need of the threaded
>> controllers.
> Maybe I'm misunderstanding the design, but this seems to push the
> processes which belong to the threaded subtree to the parent which is
> part of the usual resource domain hierarchy thus breaking the no
> internal competition constraint.  I'm not sure this is something we'd
> want.  Given that the limitation of the original threaded mode was the
> required nesting below root and that we treat root special anyway
> (exactly in the way necessary), I wonder whether it'd be better to
> simply allow root to be both domain and thread root.

Yes, root can be both domain and thread root. I haven't placed any
restriction on that.

>
> Specific review points below but we'd probably want to discuss the
> overall design first.
>
>> +static inline bool cgroup_is_threaded(const struct cgroup *cgrp)
>> +{
>> +	return cgrp->proc_cgrp && (cgrp->proc_cgrp != cgrp);
>> +}
>> +
>> +static inline bool cgroup_is_thread_root(const struct cgroup *cgrp)
>> +{
>> +	return cgrp->proc_cgrp == cgrp;
>> +}
> Maybe add a bit of comments explaining what's going on with
> ->proc_cgrp?

Sure, will do that.

>>  /**
>> + * threaded_children_count - returns # of threaded children
>> + * @cgrp: cgroup to be tested
>> + *
>> + * cgroup_mutex must be held by the caller.
>> + */
>> +static int threaded_children_count(struct cgroup *cgrp)
>> +{
>> +	struct cgroup *child;
>> +	int count = 0;
>> +
>> +	lockdep_assert_held(&cgroup_mutex);
>> +	cgroup_for_each_live_child(child, cgrp)
>> +		if (cgroup_is_threaded(child))
>> +			count++;
>> +	return count;
>> +}
> It probably would be a good idea to keep track of the count so that we
> don't have to count them each time.  There are cases where people end
> up creating a very high number of cgroups and we've already been
> bitten a couple times with silly complexity issues.

Thanks for the suggestion, I can keep a count in the cgroup strcture to
avoid doing that repetitively.

>
>> @@ -2982,22 +3010,48 @@ static int cgroup_enable_threaded(struct cgroup *cgrp)
>>  	LIST_HEAD(csets);
>>  	struct cgrp_cset_link *link;
>>  	struct css_set *cset, *cset_next;
>> +	struct cgroup *child;
>>  	int ret;
>> +	u16 ss_mask;
>>  
>>  	lockdep_assert_held(&cgroup_mutex);
>>  
>>  	/* noop if already threaded */
>> -	if (cgrp->proc_cgrp)
>> +	if (cgroup_is_threaded(cgrp))
>>  		return 0;
>>  
>> -	/* allow only if there are neither children or enabled controllers */
>> -	if (css_has_online_children(&cgrp->self) || cgrp->subtree_control)
>> +	/*
>> +	 * Allow only if it is not the root and there are:
>> +	 * 1) no children,
>> +	 * 2) no non-threaded controllers are enabled, and
>> +	 * 3) no attached tasks.
>> +	 *
>> +	 * With no attached tasks, it is assumed that no css_sets will be
>> +	 * linked to the current cgroup. This may not be true if some dead
>> +	 * css_sets linger around due to task_struct leakage, for example.
>> +	 */
> It doesn't look like the code is actually making this (incorrect)
> assumption.  I suppose the comment is from before
> cgroup_is_populated() was added?

Yes, it is a bug. I should have checked the tasks_count instead of using
cgroup_is_populated. Thanks for catching that.

>
>>  	spin_lock_irq(&css_set_lock);
>>  	list_for_each_entry(link, &cgrp->cset_links, cset_link) {
>>  		cset = link->cset;
>> +		if (cset->dead)
>> +			continue;
> Hmm... is this a bug fix which is necessary regardless of whether we
> change the threadroot semantics or not?

That is true. I put it there because the the reference counting bug
fixed in patch 6 caused a lot of dead csets hanging around before the
fix. I can pull this out as a separate patch.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH v2 11/17] cgroup: Implement new thread mode semantics
  2017-05-22 17:13     ` Waiman Long
@ 2017-05-22 17:32       ` Waiman Long
  2017-05-24 20:36       ` Tejun Heo
  1 sibling, 0 replies; 69+ messages in thread
From: Waiman Long @ 2017-05-22 17:32 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto,
	efault

On 05/22/2017 01:13 PM, Waiman Long wrote:
> On 05/19/2017 04:26 PM, Tejun Heo wrote:
>>> @@ -2982,22 +3010,48 @@ static int cgroup_enable_threaded(struct cgroup *cgrp)
>>>  	LIST_HEAD(csets);
>>>  	struct cgrp_cset_link *link;
>>>  	struct css_set *cset, *cset_next;
>>> +	struct cgroup *child;
>>>  	int ret;
>>> +	u16 ss_mask;
>>>  
>>>  	lockdep_assert_held(&cgroup_mutex);
>>>  
>>>  	/* noop if already threaded */
>>> -	if (cgrp->proc_cgrp)
>>> +	if (cgroup_is_threaded(cgrp))
>>>  		return 0;
>>>  
>>> -	/* allow only if there are neither children or enabled controllers */
>>> -	if (css_has_online_children(&cgrp->self) || cgrp->subtree_control)
>>> +	/*
>>> +	 * Allow only if it is not the root and there are:
>>> +	 * 1) no children,
>>> +	 * 2) no non-threaded controllers are enabled, and
>>> +	 * 3) no attached tasks.
>>> +	 *
>>> +	 * With no attached tasks, it is assumed that no css_sets will be
>>> +	 * linked to the current cgroup. This may not be true if some dead
>>> +	 * css_sets linger around due to task_struct leakage, for example.
>>> +	 */
>> It doesn't look like the code is actually making this (incorrect)
>> assumption.  I suppose the comment is from before
>> cgroup_is_populated() was added?
> Yes, it is a bug. I should have checked the tasks_count instead of using
> cgroup_is_populated. Thanks for catching that.

Sorry, I would like to take it back. I think cgroup_is_populated() will
be set if there is any task attached to the cgroup. So I think it is
doing the right thing with regard to (3).

Cheers,
Longman

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH v2 12/17] cgroup: Remove cgroup v2 no internal process constraint
  2017-05-20  2:10     ` Mike Galbraith
@ 2017-05-24 17:01       ` Tejun Heo
  0 siblings, 0 replies; 69+ messages in thread
From: Tejun Heo @ 2017-05-24 17:01 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Waiman Long, Li Zefan, Johannes Weiner, Peter Zijlstra,
	Ingo Molnar, cgroups, linux-kernel, linux-doc, linux-mm,
	kernel-team, pjt, luto

Hello, Mike.

On Sat, May 20, 2017 at 04:10:07AM +0200, Mike Galbraith wrote:
> On Fri, 2017-05-19 at 16:38 -0400, Tejun Heo wrote:
> > Hello, Waiman.
> > 
> > On Mon, May 15, 2017 at 09:34:11AM -0400, Waiman Long wrote:
> > > The rationale behind the cgroup v2 no internal process constraint is
> > > to avoid resouorce competition between internal processes and child
> > > cgroups. However, not all controllers have problem with internal
> > > process competiton. Enforcing this rule may lead to unnatural process
> > > hierarchy and unneeded levels for those controllers.
> > 
> > This isn't necessarily something we can determine by looking at the
> > current state of controllers.  It's true that some controllers - pid
> > and perf - inherently only care about membership of each task but at
> > the same time neither really suffers from the constraint either.  CPU
> > which is the problematic one here...
> 
> (+ cpuacct + cpuset)

Yeah, cpuacct and cpuset are in the same boat as perf.  cpuset is
completely so and we can move the tree walk to the reader side or
aggregate propagation for cpuacct as necessary.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH v2 12/17] cgroup: Remove cgroup v2 no internal process constraint
  2017-05-22 16:56     ` Waiman Long
@ 2017-05-24 17:05       ` Tejun Heo
  2017-05-24 18:19         ` Waiman Long
  0 siblings, 1 reply; 69+ messages in thread
From: Tejun Heo @ 2017-05-24 17:05 UTC (permalink / raw)
  To: Waiman Long
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto,
	efault

Hello,

On Mon, May 22, 2017 at 12:56:08PM -0400, Waiman Long wrote:
> All controllers can use the special sub-directory if userland chooses to
> do so. The problem that I am trying to address in this patch is to allow
> more natural hierarchy that reflect a certain purpose, like the task
> classification done by systemd. Restricting tasks only to leaf nodes
> makes the hierarchy unnatural and probably difficult to manage.

I see but how is this different from userland just creating the leaf
cgroup?  I'm not sure what this actually enables in terms of what can
be achieved with cgroup.  I suppose we can argue that this is more
convenient but I'd like to keep the interface orthogonal as much as
reasonably possible.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH v2 13/17] cgroup: Allow fine-grained controllers control in cgroup v2
  2017-05-19 21:20     ` Waiman Long
@ 2017-05-24 17:31       ` Tejun Heo
  2017-05-24 17:49         ` Waiman Long
  0 siblings, 1 reply; 69+ messages in thread
From: Tejun Heo @ 2017-05-24 17:31 UTC (permalink / raw)
  To: Waiman Long
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto,
	efault

Hello, Waiman.

On Fri, May 19, 2017 at 05:20:01PM -0400, Waiman Long wrote:
> > This breaks the invariant that in a cgroup its resource control knobs
> > control distribution of resources from its parent.  IOW, the resource
> > control knobs of a cgroup always belong to the parent.  This is also
> > reflected in how delegation is done.  The delegatee assumes ownership
> > of the cgroup itself and the ability to manage sub-cgroups but doesn't
> > get the ownership of the resource control knobs as otherwise the
> > parent would lose control over how it distributes its resources.
> 
> One twist that I am thinking is to have a controller enabled by the
> parent in subtree_control, but then allow the child to either disable it
> or set it in pass-through mode by writing to controllers file. IOW, a
> child cannot enable a controller without parent's permission. Once a
> child has permission, it can do whatever it wants. A parent cannot force
> a child to have a controller enabled.

Heh, I think I need more details to follow your proposal.  Anyways,
what we need to guarantee is that a descendant is never allowed to
pull in more resources than its ancestors want it to.

> > Another aspect is that most controllers aren't that sensitive to
> > nesting several levels.  Expensive operations can be and already are
> > aggregated and the performance overhead of several levels of nesting
> > barely shows up.  Skipping levels can be an interesting optimization
> > approach and we can definitely support from the core side; however,
> > it'd be a lot nicer if we could do that optimization transparently
> > (e.g. CPU can skip multi level queueing if there usually is only one
> > item at some levels).
> 
> The trend that I am seeing is that the total number of controllers is
> going to grow over time. New controllers may be sensitive to the level
> of nesting like the cpu controller. I am also thinking about how systemd
> is using the cgroup filesystem for task classification purpose without
> any controller attached to it. With this scheme, we can accommodate all
> the different needs without using different cgroup filesystems.

I'm not sure about that.  It's true that cgroup hierarchy is being
used more but there are only so many hard / complex resources that we
deal with - cpu, memory and io.  Beyond those, other uses are usually
about identifying membership (perf, net) or propagating and
restricting attributes (cpuset).  pids can be considered an exception
but we have it only because pids can globally run out a lot sooner
than can be controlled through memory.  Even then, it's trivial.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH v2 13/17] cgroup: Allow fine-grained controllers control in cgroup v2
  2017-05-24 17:31       ` Tejun Heo
@ 2017-05-24 17:49         ` Waiman Long
  2017-05-24 17:56           ` Tejun Heo
  0 siblings, 1 reply; 69+ messages in thread
From: Waiman Long @ 2017-05-24 17:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto,
	efault

On 05/24/2017 01:31 PM, Tejun Heo wrote:
> Hello, Waiman.
>
> On Fri, May 19, 2017 at 05:20:01PM -0400, Waiman Long wrote:
>>> This breaks the invariant that in a cgroup its resource control knobs
>>> control distribution of resources from its parent.  IOW, the resource
>>> control knobs of a cgroup always belong to the parent.  This is also
>>> reflected in how delegation is done.  The delegatee assumes ownership
>>> of the cgroup itself and the ability to manage sub-cgroups but doesn't
>>> get the ownership of the resource control knobs as otherwise the
>>> parent would lose control over how it distributes its resources.
>> One twist that I am thinking is to have a controller enabled by the
>> parent in subtree_control, but then allow the child to either disable it
>> or set it in pass-through mode by writing to controllers file. IOW, a
>> child cannot enable a controller without parent's permission. Once a
>> child has permission, it can do whatever it wants. A parent cannot force
>> a child to have a controller enabled.
> Heh, I think I need more details to follow your proposal.  Anyways,
> what we need to guarantee is that a descendant is never allowed to
> pull in more resources than its ancestors want it to.

What I am saying is as follows:
    / A
P - B
   \ C

# echo +memory > P/cgroups.subtree_control
# echo -memory > P/A/cgroup.controllers
# echo "#memory" > P/B/cgroup.controllers

The parent grants the memory controller to its children - A, B and C.
Child A has the memory controller explicitly disabled. Child B has the
memory controller in pass-through mode, while child C has the memory
controller enabled by default. "echo +memory > cgroup.controllers" is
not allowed. There are 2 possible choices with regard to the '-' or '#'
prefixes. We can allow them before the grant from the parent or only
after that. In the former case, the state remains dormant until after
the grant from the parent.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH v2 13/17] cgroup: Allow fine-grained controllers control in cgroup v2
  2017-05-24 17:49         ` Waiman Long
@ 2017-05-24 17:56           ` Tejun Heo
  2017-05-24 18:17             ` Waiman Long
  0 siblings, 1 reply; 69+ messages in thread
From: Tejun Heo @ 2017-05-24 17:56 UTC (permalink / raw)
  To: Waiman Long
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto,
	efault

Hello,

On Wed, May 24, 2017 at 01:49:46PM -0400, Waiman Long wrote:
> What I am saying is as follows:
>     / A
> P - B
>    \ C
> 
> # echo +memory > P/cgroups.subtree_control
> # echo -memory > P/A/cgroup.controllers
> # echo "#memory" > P/B/cgroup.controllers
> 
> The parent grants the memory controller to its children - A, B and C.
> Child A has the memory controller explicitly disabled. Child B has the
> memory controller in pass-through mode, while child C has the memory
> controller enabled by default. "echo +memory > cgroup.controllers" is
> not allowed. There are 2 possible choices with regard to the '-' or '#'
> prefixes. We can allow them before the grant from the parent or only
> after that. In the former case, the state remains dormant until after
> the grant from the parent.

Ah, I see, you want cgroup.controllers to be able to mask available
controllers by the parent.  Can you expand your example with further
nesting and how #memory on cgroup.controllers would affect the nested
descendant?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH v2 13/17] cgroup: Allow fine-grained controllers control in cgroup v2
  2017-05-24 17:56           ` Tejun Heo
@ 2017-05-24 18:17             ` Waiman Long
  0 siblings, 0 replies; 69+ messages in thread
From: Waiman Long @ 2017-05-24 18:17 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto,
	efault

On 05/24/2017 01:56 PM, Tejun Heo wrote:
> Hello,
>
> On Wed, May 24, 2017 at 01:49:46PM -0400, Waiman Long wrote:
>> What I am saying is as follows:
>>     / A
>> P - B
>>    \ C
>>
>> # echo +memory > P/cgroups.subtree_control
>> # echo -memory > P/A/cgroup.controllers
>> # echo "#memory" > P/B/cgroup.controllers
>>
>> The parent grants the memory controller to its children - A, B and C.
>> Child A has the memory controller explicitly disabled. Child B has the
>> memory controller in pass-through mode, while child C has the memory
>> controller enabled by default. "echo +memory > cgroup.controllers" is
>> not allowed. There are 2 possible choices with regard to the '-' or '#'
>> prefixes. We can allow them before the grant from the parent or only
>> after that. In the former case, the state remains dormant until after
>> the grant from the parent.
> Ah, I see, you want cgroup.controllers to be able to mask available
> controllers by the parent.  Can you expand your example with further
> nesting and how #memory on cgroup.controllers would affect the nested
> descendant?
>
> Thanks.
>
I would allow enabling the controller in subtree_control if granted from
the parent and not explicitly disabled. IOW, both B and C can "echo
+memory" to their subtree_control to grant memory controller to their
children, but not A. A has to re-enable memory controller or set it to
pass-through mode before it can enable it in subtree_control. I need to
clarify that "echo +memory > cgroup.controllers" is allowed to re-enable
it, but not without the granting from its parent.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH v2 12/17] cgroup: Remove cgroup v2 no internal process constraint
  2017-05-24 17:05       ` Tejun Heo
@ 2017-05-24 18:19         ` Waiman Long
  0 siblings, 0 replies; 69+ messages in thread
From: Waiman Long @ 2017-05-24 18:19 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto,
	efault

On 05/24/2017 01:05 PM, Tejun Heo wrote:
> Hello,
>
> On Mon, May 22, 2017 at 12:56:08PM -0400, Waiman Long wrote:
>> All controllers can use the special sub-directory if userland chooses to
>> do so. The problem that I am trying to address in this patch is to allow
>> more natural hierarchy that reflect a certain purpose, like the task
>> classification done by systemd. Restricting tasks only to leaf nodes
>> makes the hierarchy unnatural and probably difficult to manage.
> I see but how is this different from userland just creating the leaf
> cgroup?  I'm not sure what this actually enables in terms of what can
> be achieved with cgroup.  I suppose we can argue that this is more
> convenient but I'd like to keep the interface orthogonal as much as
> reasonably possible.
>
> Thanks.
>
I am just thinking that it is a bit more natural with the concept of the
special resource domain sub-directory. You are right that the same
effect can be achieved by proper placement of tasks and enabling of
controllers.

A (cpu,memory) [T1] - B(cpu,memory) [T2]
                                  \ cgroups.resource_domain (memory)

A (cpu,memory)  - B(cpu,memory) [T2]
                            \ C (memory) [T1]

With respect to the tasks T1 and T2, the above 2 configurations are the
same.

I am OK to drop this patch. However, I still think the current
no-internal process constraint is too restricting. I will suggest either

 1. Allow internal processes and document the way to avoid internal
    process competition as shown above from the userland, or
 2. Mark only certain controllers as not allowing internal processes
    when they are enabled.

What do you think about this?

Cheers,
Longman

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH v2 11/17] cgroup: Implement new thread mode semantics
  2017-05-22 17:13     ` Waiman Long
  2017-05-22 17:32       ` Waiman Long
@ 2017-05-24 20:36       ` Tejun Heo
  2017-05-24 21:17         ` Waiman Long
  1 sibling, 1 reply; 69+ messages in thread
From: Tejun Heo @ 2017-05-24 20:36 UTC (permalink / raw)
  To: Waiman Long
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto,
	efault

Hello, Waiman.

On Mon, May 22, 2017 at 01:13:16PM -0400, Waiman Long wrote:
> > Maybe I'm misunderstanding the design, but this seems to push the
> > processes which belong to the threaded subtree to the parent which is
> > part of the usual resource domain hierarchy thus breaking the no
> > internal competition constraint.  I'm not sure this is something we'd
> > want.  Given that the limitation of the original threaded mode was the
> > required nesting below root and that we treat root special anyway
> > (exactly in the way necessary), I wonder whether it'd be better to
> > simply allow root to be both domain and thread root.
> 
> Yes, root can be both domain and thread root. I haven't placed any
> restriction on that.

I've been playing with the proposed "make the parent resource domain".
Unfortunately, the parent - child relationship becomes weird.

The parent becomes the thread root, which means that its
cgroup.threads file becomes writable and threads can be put in there.
It's really weird to write to a child's interface and have the
parent's behavior changed.  This becomes weirder with delegation.  If
a cgroup is delegated, its cgroup.threads should be delegated too but
if the child enables threaded mode, that makes the undelegated parent
thread root, which means that either 1. the delegatee can't migrate
threads to the thread root or 2. if the parent's cgroup.threads is
writeable, the delegatee can mass with other descendants under it
which shouldn't be allowed.

I think the operation of making a cgroup a thread root should happen
on the cgroup where that's requested; otherwise, nesting becomes too
twisted.  This should be solvable.  Will think more about it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH v2 11/17] cgroup: Implement new thread mode semantics
  2017-05-24 20:36       ` Tejun Heo
@ 2017-05-24 21:17         ` Waiman Long
  2017-05-24 21:27           ` Tejun Heo
  0 siblings, 1 reply; 69+ messages in thread
From: Waiman Long @ 2017-05-24 21:17 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto,
	efault

On 05/24/2017 04:36 PM, Tejun Heo wrote:
> Hello, Waiman.
>
> On Mon, May 22, 2017 at 01:13:16PM -0400, Waiman Long wrote:
>>> Maybe I'm misunderstanding the design, but this seems to push the
>>> processes which belong to the threaded subtree to the parent which is
>>> part of the usual resource domain hierarchy thus breaking the no
>>> internal competition constraint.  I'm not sure this is something we'd
>>> want.  Given that the limitation of the original threaded mode was the
>>> required nesting below root and that we treat root special anyway
>>> (exactly in the way necessary), I wonder whether it'd be better to
>>> simply allow root to be both domain and thread root.
>> Yes, root can be both domain and thread root. I haven't placed any
>> restriction on that.
> I've been playing with the proposed "make the parent resource domain".
> Unfortunately, the parent - child relationship becomes weird.
>
> The parent becomes the thread root, which means that its
> cgroup.threads file becomes writable and threads can be put in there.
> It's really weird to write to a child's interface and have the
> parent's behavior changed.  This becomes weirder with delegation.  If
> a cgroup is delegated, its cgroup.threads should be delegated too but
> if the child enables threaded mode, that makes the undelegated parent
> thread root, which means that either 1. the delegatee can't migrate
> threads to the thread root or 2. if the parent's cgroup.threads is
> writeable, the delegatee can mass with other descendants under it
> which shouldn't be allowed.
>
> I think the operation of making a cgroup a thread root should happen
> on the cgroup where that's requested; otherwise, nesting becomes too
> twisted.  This should be solvable.  Will think more about it.
>
> Thanks.
>
An alternative is to have separate enabling for thread root. For example,

# echo root > cgroup.threads
# echo enable > child/cgroup.threads

The first statement make the current cgroup the thread root. However,
setting it to a thread root doesn't make its child to be threaded. This
have to be explicitly done on each of the children. Once a child cgroup
is made to be threaded, all its descendants will be threaded. That will
have the same effect as the current patch.

With delegation, do you mean the relationship between a container and
its host?

Cheers,
Longman

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH v2 11/17] cgroup: Implement new thread mode semantics
  2017-05-24 21:17         ` Waiman Long
@ 2017-05-24 21:27           ` Tejun Heo
  2017-06-01 14:50             ` Tejun Heo
  0 siblings, 1 reply; 69+ messages in thread
From: Tejun Heo @ 2017-05-24 21:27 UTC (permalink / raw)
  To: Waiman Long
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto,
	efault

Hello,

On Wed, May 24, 2017 at 05:17:13PM -0400, Waiman Long wrote:
> An alternative is to have separate enabling for thread root. For example,
> 
> # echo root > cgroup.threads
> # echo enable > child/cgroup.threads
> 
> The first statement make the current cgroup the thread root. However,
> setting it to a thread root doesn't make its child to be threaded. This
> have to be explicitly done on each of the children. Once a child cgroup
> is made to be threaded, all its descendants will be threaded. That will
> have the same effect as the current patch.

Yeah, I'm toying with different ideas.  I'll get back to you once
things get more concrete.

> With delegation, do you mean the relationship between a container and
> its host?

It can be but doesn't have to be.  For example, it can be delegations
to users without namespace / container being involved.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH v2 11/17] cgroup: Implement new thread mode semantics
  2017-05-24 21:27           ` Tejun Heo
@ 2017-06-01 14:50             ` Tejun Heo
  2017-06-01 15:10               ` Peter Zijlstra
  2017-06-01 18:41               ` Waiman Long
  0 siblings, 2 replies; 69+ messages in thread
From: Tejun Heo @ 2017-06-01 14:50 UTC (permalink / raw)
  To: Waiman Long
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto,
	efault

Hello, Waiman.

A short update.  I tried making root special while keeping the
existing threaded semantics but I didn't really like it because we
have to couple controller enables/disables with threaded
enables/disables.  I'm now trying a simpler, albeit a bit more
tedious, approach which should leave things mostly symmetrical.  I'm
hoping to be able to post mostly working patches this week.

Also, do you mind posting the debug patches as a separate series?
Let's get the bits which make sense indepdently in the tree.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH v2 11/17] cgroup: Implement new thread mode semantics
  2017-06-01 14:50             ` Tejun Heo
@ 2017-06-01 15:10               ` Peter Zijlstra
  2017-06-01 15:35                 ` Tejun Heo
                                   ` (2 more replies)
  2017-06-01 18:41               ` Waiman Long
  1 sibling, 3 replies; 69+ messages in thread
From: Peter Zijlstra @ 2017-06-01 15:10 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Waiman Long, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto,
	efault

On Thu, Jun 01, 2017 at 10:50:42AM -0400, Tejun Heo wrote:
> Hello, Waiman.
> 
> A short update.  I tried making root special while keeping the
> existing threaded semantics but I didn't really like it because we
> have to couple controller enables/disables with threaded
> enables/disables.  I'm now trying a simpler, albeit a bit more
> tedious, approach which should leave things mostly symmetrical.  I'm
> hoping to be able to post mostly working patches this week.

I've not had time to look at any of this. But the question I'm most
curious about is how cgroup-v2 preserves the container invariant.

That is, each container (namespace) should look like a 'real' machine.
So just like userns allows to have a uid-0 (aka root) for each container
and pidns allows a pid-1 for each container, cgroupns should provide a
root group for each container.

And cgroup-v2 has this 'exception' (aka wart) for the root group which
needs to be replicated for each namespace.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH v2 11/17] cgroup: Implement new thread mode semantics
  2017-06-01 15:10               ` Peter Zijlstra
@ 2017-06-01 15:35                 ` Tejun Heo
  2017-06-01 18:44                 ` Waiman Long
  2017-06-01 20:15                 ` Waiman Long
  2 siblings, 0 replies; 69+ messages in thread
From: Tejun Heo @ 2017-06-01 15:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Waiman Long, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto,
	efault

Hello, Peter.

On Thu, Jun 01, 2017 at 05:10:45PM +0200, Peter Zijlstra wrote:
> I've not had time to look at any of this. But the question I'm most
> curious about is how cgroup-v2 preserves the container invariant.
> 
> That is, each container (namespace) should look like a 'real' machine.
> So just like userns allows to have a uid-0 (aka root) for each container
> and pidns allows a pid-1 for each container, cgroupns should provide a
> root group for each container.
> 
> And cgroup-v2 has this 'exception' (aka wart) for the root group which
> needs to be replicated for each namespace.

The goal has never been that a container must be indistinguishible
from a real machine.  For certain things, things simply don't have
exact equivalents due to sharing (memory stats or journal writes for
example) and those things are exactly why people prefer containers
over VMs for certain use cases.  If one wants full replication, VM
would be the way to go.

The goal is allowing enough container invariant so that appropriate
workloads can be contained and co-exist in useful ways.  This also
means that the contained workload is usually either a bit illiterate
w.r.t. to the system details (doesn't care) or makes some adjustments
for running inside a container (most quasi-full-system ones already
do).

System root is inherently different from all other nested roots.
Making some exceptions for the root isn't about taking away from other
roots but more reflecting the inherent differences - there are things
which are inherently system / bare-metal.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH v2 11/17] cgroup: Implement new thread mode semantics
  2017-06-01 14:50             ` Tejun Heo
  2017-06-01 15:10               ` Peter Zijlstra
@ 2017-06-01 18:41               ` Waiman Long
  1 sibling, 0 replies; 69+ messages in thread
From: Waiman Long @ 2017-06-01 18:41 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto,
	efault

On 06/01/2017 10:50 AM, Tejun Heo wrote:
> Hello, Waiman.
>
> A short update.  I tried making root special while keeping the
> existing threaded semantics but I didn't really like it because we
> have to couple controller enables/disables with threaded
> enables/disables.  I'm now trying a simpler, albeit a bit more
> tedious, approach which should leave things mostly symmetrical.  I'm
> hoping to be able to post mostly working patches this week.

I am looking forward to your patches.

> Also, do you mind posting the debug patches as a separate series?
> Let's get the bits which make sense indepdently in the tree.

I am going to do that. The debug patches, however, will have dependency
on other cgroup patches and so will need to be posted after the core
patches.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH v2 11/17] cgroup: Implement new thread mode semantics
  2017-06-01 15:10               ` Peter Zijlstra
  2017-06-01 15:35                 ` Tejun Heo
@ 2017-06-01 18:44                 ` Waiman Long
  2017-06-01 18:47                   ` Tejun Heo
  2017-06-01 19:55                   ` Waiman Long
  2017-06-01 20:15                 ` Waiman Long
  2 siblings, 2 replies; 69+ messages in thread
From: Waiman Long @ 2017-06-01 18:44 UTC (permalink / raw)
  To: Peter Zijlstra, Tejun Heo
  Cc: Li Zefan, Johannes Weiner, Ingo Molnar, cgroups, linux-kernel,
	linux-doc, linux-mm, kernel-team, pjt, luto, efault

On 06/01/2017 11:10 AM, Peter Zijlstra wrote:
> On Thu, Jun 01, 2017 at 10:50:42AM -0400, Tejun Heo wrote:
>> Hello, Waiman.
>>
>> A short update.  I tried making root special while keeping the
>> existing threaded semantics but I didn't really like it because we
>> have to couple controller enables/disables with threaded
>> enables/disables.  I'm now trying a simpler, albeit a bit more
>> tedious, approach which should leave things mostly symmetrical.  I'm
>> hoping to be able to post mostly working patches this week.
> I've not had time to look at any of this. But the question I'm most
> curious about is how cgroup-v2 preserves the container invariant.
>
> That is, each container (namespace) should look like a 'real' machine.
> So just like userns allows to have a uid-0 (aka root) for each container
> and pidns allows a pid-1 for each container, cgroupns should provide a
> root group for each container.
>
> And cgroup-v2 has this 'exception' (aka wart) for the root group which
> needs to be replicated for each namespace.

One of the changes that I proposed in my patches was to get rid of the
no internal process constraint. I think that will solve a big part of
the container invariant problem that we have with cgroup v2.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH v2 11/17] cgroup: Implement new thread mode semantics
  2017-06-01 18:44                 ` Waiman Long
@ 2017-06-01 18:47                   ` Tejun Heo
  2017-06-01 19:27                     ` Waiman Long
  2017-06-01 19:55                   ` Waiman Long
  1 sibling, 1 reply; 69+ messages in thread
From: Tejun Heo @ 2017-06-01 18:47 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto,
	efault

Hello, Waiman.

On Thu, Jun 01, 2017 at 02:44:48PM -0400, Waiman Long wrote:
> > And cgroup-v2 has this 'exception' (aka wart) for the root group which
> > needs to be replicated for each namespace.
> 
> One of the changes that I proposed in my patches was to get rid of the
> no internal process constraint. I think that will solve a big part of
> the container invariant problem that we have with cgroup v2.

I'm not sure.  It just masks it without actually solving it.  I mean,
the constraint is thereq for a reason.  "Solving" it would defeat one
of the main capabilities for resource domains and masking it from
kernel side doesn't make whole lot of sense to me given that it's
something which can be easily done from userland.  If we take out that
part, for controllers which don't care about resource domains,
wouldn't thread mode be a sufficient solution?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH v2 11/17] cgroup: Implement new thread mode semantics
  2017-06-01 18:47                   ` Tejun Heo
@ 2017-06-01 19:27                     ` Waiman Long
  2017-06-01 20:38                       ` Tejun Heo
  0 siblings, 1 reply; 69+ messages in thread
From: Waiman Long @ 2017-06-01 19:27 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Peter Zijlstra, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto,
	efault

On 06/01/2017 02:47 PM, Tejun Heo wrote:
> Hello, Waiman.
>
> On Thu, Jun 01, 2017 at 02:44:48PM -0400, Waiman Long wrote:
>>> And cgroup-v2 has this 'exception' (aka wart) for the root group which
>>> needs to be replicated for each namespace.
>> One of the changes that I proposed in my patches was to get rid of the
>> no internal process constraint. I think that will solve a big part of
>> the container invariant problem that we have with cgroup v2.
> I'm not sure.  It just masks it without actually solving it.  I mean,
> the constraint is thereq for a reason.  "Solving" it would defeat one
> of the main capabilities for resource domains and masking it from
> kernel side doesn't make whole lot of sense to me given that it's
> something which can be easily done from userland.  If we take out that
> part, for controllers which don't care about resource domains,
> wouldn't thread mode be a sufficient solution?

As said in an earlier email, I agreed that masking it on the kernel side
may not be the best solution. I offer 2 other alternatives:
1) Document on how to work around the resource domains issue by proper
setup of the cgroup hierarchy.
2) Mark those controllers that require the no internal process
competition constraint and disallow internal process only when those
controllers are active.

I prefer the first alternative, but I can go with the second if necessary.

The major rationale behind my enhanced thread mode patch was to allow
something like

     R -- A -- B
     \
      T1 -- T2

where you can have resource domain controllers enabled in the thread
root as well as some child cgroups of the thread root. As no internal
process rule is currently not applicable to the thread root, this
creates the dilemma that we need to deal with internal process competition.

The container invariant that PeterZ talked about will also be a serious
issue here as I don't think we are going to set up a container root
cgroup that will have no process allowed in it because it has some child
cgroups. IMHO, I don't think cgroup v2 will get wide adoption without
getting rid of that no internal process constraint.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH v2 11/17] cgroup: Implement new thread mode semantics
  2017-06-01 18:44                 ` Waiman Long
  2017-06-01 18:47                   ` Tejun Heo
@ 2017-06-01 19:55                   ` Waiman Long
  1 sibling, 0 replies; 69+ messages in thread
From: Waiman Long @ 2017-06-01 19:55 UTC (permalink / raw)
  To: Peter Zijlstra, Tejun Heo
  Cc: Li Zefan, Johannes Weiner, Ingo Molnar, cgroups, linux-kernel,
	linux-doc, linux-mm, kernel-team, pjt, luto, efault

On 06/01/2017 02:44 PM, Waiman Long wrote:
> On 06/01/2017 11:10 AM, Peter Zijlstra wrote:
>> On Thu, Jun 01, 2017 at 10:50:42AM -0400, Tejun Heo wrote:
>>> Hello, Waiman.
>>>
>>> A short update.  I tried making root special while keeping the
>>> existing threaded semantics but I didn't really like it because we
>>> have to couple controller enables/disables with threaded
>>> enables/disables.  I'm now trying a simpler, albeit a bit more
>>> tedious, approach which should leave things mostly symmetrical.  I'm
>>> hoping to be able to post mostly working patches this week.
>> I've not had time to look at any of this. But the question I'm most
>> curious about is how cgroup-v2 preserves the container invariant.
>>
>> That is, each container (namespace) should look like a 'real' machine.
>> So just like userns allows to have a uid-0 (aka root) for each container
>> and pidns allows a pid-1 for each container, cgroupns should provide a
>> root group for each container.
>>
>> And cgroup-v2 has this 'exception' (aka wart) for the root group which
>> needs to be replicated for each namespace.
> One of the changes that I proposed in my patches was to get rid of the
> no internal process constraint. I think that will solve a big part of
> the container invariant problem that we have with cgroup v2.
>
> Cheers,
> Longman

Another idea that I have to further solve this container invariant
problem is do a cgroup setup like

CP -- CR

CP - container parent belong to the host
CR - container root

We can enable the pass-through mode at the subtree_control file of CP to
force all CR controllers in pass-through mode. In this case, those
controllers are not enabled in the CR like the root. However, the
container can enable those in the child cgroups just like the root
controller. By enabling those controller in the CP level, the host can
control how much resource is being allowed in the container without the
container being aware that its resources are being controlled as all the
control knobs will show up in the CP, but not in CR.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH v2 11/17] cgroup: Implement new thread mode semantics
  2017-06-01 15:10               ` Peter Zijlstra
  2017-06-01 15:35                 ` Tejun Heo
  2017-06-01 18:44                 ` Waiman Long
@ 2017-06-01 20:15                 ` Waiman Long
  2 siblings, 0 replies; 69+ messages in thread
From: Waiman Long @ 2017-06-01 20:15 UTC (permalink / raw)
  To: Peter Zijlstra, Tejun Heo
  Cc: Li Zefan, Johannes Weiner, Ingo Molnar, cgroups, linux-kernel,
	linux-doc, linux-mm, kernel-team, pjt, luto, efault

On 06/01/2017 11:10 AM, Peter Zijlstra wrote:
> On Thu, Jun 01, 2017 at 10:50:42AM -0400, Tejun Heo wrote:
>> Hello, Waiman.
>>
>> A short update.  I tried making root special while keeping the
>> existing threaded semantics but I didn't really like it because we
>> have to couple controller enables/disables with threaded
>> enables/disables.  I'm now trying a simpler, albeit a bit more
>> tedious, approach which should leave things mostly symmetrical.  I'm
>> hoping to be able to post mostly working patches this week.
> I've not had time to look at any of this. But the question I'm most
> curious about is how cgroup-v2 preserves the container invariant.

If you don't have much time to look at the patch, I will suggest just
looking at the cover letter as well as changes to the cgroup-v2.txt
file. You will get a pretty good overview of what this patchset is about.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH v2 11/17] cgroup: Implement new thread mode semantics
  2017-06-01 19:27                     ` Waiman Long
@ 2017-06-01 20:38                       ` Tejun Heo
  2017-06-01 20:48                         ` Waiman Long
  0 siblings, 1 reply; 69+ messages in thread
From: Tejun Heo @ 2017-06-01 20:38 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto,
	efault

Hello,

On Thu, Jun 01, 2017 at 03:27:35PM -0400, Waiman Long wrote:
> As said in an earlier email, I agreed that masking it on the kernel side
> may not be the best solution. I offer 2 other alternatives:
> 1) Document on how to work around the resource domains issue by proper
> setup of the cgroup hierarchy.

We can definitely improve documentation.

> 2) Mark those controllers that require the no internal process
> competition constraint and disallow internal process only when those
> controllers are active.

We *can* do that but wouldn't this be equivalent to enabling thread
mode implicitly when only thread aware controllers are enabled?

> I prefer the first alternative, but I can go with the second if necessary.
> 
> The major rationale behind my enhanced thread mode patch was to allow
> something like
> 
>      R -- A -- B
>      \
>       T1 -- T2
> 
> where you can have resource domain controllers enabled in the thread
> root as well as some child cgroups of the thread root. As no internal
> process rule is currently not applicable to the thread root, this
> creates the dilemma that we need to deal with internal process competition.
> 
> The container invariant that PeterZ talked about will also be a serious
> issue here as I don't think we are going to set up a container root
> cgroup that will have no process allowed in it because it has some child
> cgroups. IMHO, I don't think cgroup v2 will get wide adoption without
> getting rid of that no internal process constraint.

The only thing which is necessary from inside a container is putting
the management processes into their own cgroups so that they can be
controlled (ie. the same thing you did with your patch but doing that
explicitly from userland) and userland management sw can do the same
thing whether it's inside a container or on a bare system.  BTW,
systemd already does so and works completely fine in terms of
containerization on cgroup2.  It is arguable whether we should make
this more convenient from kernel side but using cgroup2 for resource
control already requires the userspace tools to be adapted to it, so
I'm not sure how much benefit we'd gain from adding that compared to
explicitly documenting it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH v2 11/17] cgroup: Implement new thread mode semantics
  2017-06-01 20:38                       ` Tejun Heo
@ 2017-06-01 20:48                         ` Waiman Long
  2017-06-01 20:52                           ` Tejun Heo
  0 siblings, 1 reply; 69+ messages in thread
From: Waiman Long @ 2017-06-01 20:48 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Peter Zijlstra, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto,
	efault

On 06/01/2017 04:38 PM, Tejun Heo wrote:
> Hello,
>
> On Thu, Jun 01, 2017 at 03:27:35PM -0400, Waiman Long wrote:
>> As said in an earlier email, I agreed that masking it on the kernel side
>> may not be the best solution. I offer 2 other alternatives:
>> 1) Document on how to work around the resource domains issue by proper
>> setup of the cgroup hierarchy.
> We can definitely improve documentation.
>
>> 2) Mark those controllers that require the no internal process
>> competition constraint and disallow internal process only when those
>> controllers are active.
> We *can* do that but wouldn't this be equivalent to enabling thread
> mode implicitly when only thread aware controllers are enabled?
>
>> I prefer the first alternative, but I can go with the second if necessary.
>>
>> The major rationale behind my enhanced thread mode patch was to allow
>> something like
>>
>>      R -- A -- B
>>      \
>>       T1 -- T2
>>
>> where you can have resource domain controllers enabled in the thread
>> root as well as some child cgroups of the thread root. As no internal
>> process rule is currently not applicable to the thread root, this
>> creates the dilemma that we need to deal with internal process competition.
>>
>> The container invariant that PeterZ talked about will also be a serious
>> issue here as I don't think we are going to set up a container root
>> cgroup that will have no process allowed in it because it has some child
>> cgroups. IMHO, I don't think cgroup v2 will get wide adoption without
>> getting rid of that no internal process constraint.
> The only thing which is necessary from inside a container is putting
> the management processes into their own cgroups so that they can be
> controlled (ie. the same thing you did with your patch but doing that
> explicitly from userland) and userland management sw can do the same
> thing whether it's inside a container or on a bare system.  BTW,
> systemd already does so and works completely fine in terms of
> containerization on cgroup2.  It is arguable whether we should make
> this more convenient from kernel side but using cgroup2 for resource
> control already requires the userspace tools to be adapted to it, so
> I'm not sure how much benefit we'd gain from adding that compared to
> explicitly documenting it.

I think we are on agreement here. I should we should just document how
userland can work around the internal process competition issue by
setting up the cgroup hierarchy properly. Then we can remove the no
internal process constraint.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH v2 11/17] cgroup: Implement new thread mode semantics
  2017-06-01 20:48                         ` Waiman Long
@ 2017-06-01 20:52                           ` Tejun Heo
  2017-06-01 21:12                             ` Waiman Long
  0 siblings, 1 reply; 69+ messages in thread
From: Tejun Heo @ 2017-06-01 20:52 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto,
	efault

Hello,

On Thu, Jun 01, 2017 at 04:48:48PM -0400, Waiman Long wrote:
> I think we are on agreement here. I should we should just document how
> userland can work around the internal process competition issue by
> setting up the cgroup hierarchy properly. Then we can remove the no
> internal process constraint.

Heh, we agree on the immediate solution but not the final direction.
This requirement affects how controllers implement resource control in
significant ways.  It is a restriction which can be worked around in
userland relatively easily.  I'd much prefer to keep the invariant
intact.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH v2 11/17] cgroup: Implement new thread mode semantics
  2017-06-01 20:52                           ` Tejun Heo
@ 2017-06-01 21:12                             ` Waiman Long
  2017-06-01 21:18                               ` Tejun Heo
  0 siblings, 1 reply; 69+ messages in thread
From: Waiman Long @ 2017-06-01 21:12 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Peter Zijlstra, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto,
	efault

On 06/01/2017 04:52 PM, Tejun Heo wrote:
> Hello,
>
> On Thu, Jun 01, 2017 at 04:48:48PM -0400, Waiman Long wrote:
>> I think we are on agreement here. I should we should just document how
>> userland can work around the internal process competition issue by
>> setting up the cgroup hierarchy properly. Then we can remove the no
>> internal process constraint.
> Heh, we agree on the immediate solution but not the final direction.
> This requirement affects how controllers implement resource control in
> significant ways.  It is a restriction which can be worked around in
> userland relatively easily.  I'd much prefer to keep the invariant
> intact.
>
> Thanks.
>
Are you referring to keeping the no internal process restriction and
document how to work around that instead? I would like to hear what
workarounds are currently being used.

Anyway, you currently allow internal process in thread mode, but not in
non-thread mode. I would prefer no such restriction in both thread and
non-thread mode.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH v2 11/17] cgroup: Implement new thread mode semantics
  2017-06-01 21:12                             ` Waiman Long
@ 2017-06-01 21:18                               ` Tejun Heo
  2017-06-02 20:36                                 ` Waiman Long
  0 siblings, 1 reply; 69+ messages in thread
From: Tejun Heo @ 2017-06-01 21:18 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto,
	efault

Hello,

On Thu, Jun 01, 2017 at 05:12:42PM -0400, Waiman Long wrote:
> Are you referring to keeping the no internal process restriction and
> document how to work around that instead? I would like to hear what
> workarounds are currently being used.

What we've been talking about all along - just creating explicit leaf
nodes.

> Anyway, you currently allow internal process in thread mode, but not in
> non-thread mode. I would prefer no such restriction in both thread and
> non-thread mode.

Heh, so, these aren't arbitrary.  The contraint is tied to
implementing resource domains and thread subtree doesn't have resource
domains in them, so they don't need the constraint.  I'm sorry about
the short replies but I'm kinda really tied up right now.  I'm gonna
do the thread mode so that it can be agnostic w.r.t. the internal
process constraint and I think it could be helpful to decouple these
discussions.  We've been having this discussion for a couple years now
and it looks like we're gonna go through it all over, which is fine,
but let's at least keep that separate.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH v2 11/17] cgroup: Implement new thread mode semantics
  2017-06-01 21:18                               ` Tejun Heo
@ 2017-06-02 20:36                                 ` Waiman Long
  2017-06-03 10:33                                   ` Tejun Heo
  0 siblings, 1 reply; 69+ messages in thread
From: Waiman Long @ 2017-06-02 20:36 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Peter Zijlstra, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto,
	efault

On 06/01/2017 05:18 PM, Tejun Heo wrote:
> Hello,
>
> On Thu, Jun 01, 2017 at 05:12:42PM -0400, Waiman Long wrote:
>> Are you referring to keeping the no internal process restriction and
>> document how to work around that instead? I would like to hear what
>> workarounds are currently being used.
> What we've been talking about all along - just creating explicit leaf
> nodes.
>
>> Anyway, you currently allow internal process in thread mode, but not in
>> non-thread mode. I would prefer no such restriction in both thread and
>> non-thread mode.
> Heh, so, these aren't arbitrary.  The contraint is tied to
> implementing resource domains and thread subtree doesn't have resource
> domains in them, so they don't need the constraint.  I'm sorry about
> the short replies but I'm kinda really tied up right now.  I'm gonna
> do the thread mode so that it can be agnostic w.r.t. the internal
> process constraint and I think it could be helpful to decouple these
> discussions.  We've been having this discussion for a couple years now
> and it looks like we're gonna go through it all over, which is fine,
> but let's at least keep that separate.

I wouldn't argue further on that if you insist. However, I still want to
relax the constraint somewhat by abandoning the no internal process
constraint  when only threaded controllers (non-resource domains) are
enabled even when thread mode has not been explicitly enabled. It is a
modified version my second alternative. Now the question is which
controllers are considered to be resource domains. I think memory and
blkio are in the list. What else do you think should be considered
resource domains?

Cheers,
Longman



any of the resource domains (!threaded) controllers are enabled.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH v2 11/17] cgroup: Implement new thread mode semantics
  2017-06-02 20:36                                 ` Waiman Long
@ 2017-06-03 10:33                                   ` Tejun Heo
  0 siblings, 0 replies; 69+ messages in thread
From: Tejun Heo @ 2017-06-03 10:33 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Li Zefan, Johannes Weiner, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto,
	efault

Hello,

On Fri, Jun 02, 2017 at 04:36:22PM -0400, Waiman Long wrote:
> I wouldn't argue further on that if you insist. However, I still want to

Oh, please don't get me wrong.  I'm not trying to shut down the
discussion or anything.  It's just that whole-scope discussions can
get very meandering and time-consuming when these two issues can be
decoupled from each other without compromising on either.  Let's
approach these issues separately.

> relax the constraint somewhat by abandoning the no internal process
> constraint  when only threaded controllers (non-resource domains) are
> enabled even when thread mode has not been explicitly enabled. It is a
> modified version my second alternative. Now the question is which
> controllers are considered to be resource domains. I think memory and
> blkio are in the list. What else do you think should be considered
> resource domains?

And we're now a bit into repeating ourselves but for controlling of
any significant resources (mostly cpu, memory, io), there gotta be
significant portion of resource consumption which isn't tied to
spcific processes or threads that should be accounted for.  Both
memory and io already do this to a certain extent, but not completely.
cpu doesn't do it at all yet but we usually can't / shouldn't declare
a resource category to be domain-free.

There are exceptions - controllers which are only used for membership
identification (perf and the old net controllers), pids which is
explicitly tied to tasks (note that CPU cycles aren't), cpuset which
is an attribute propagating / restricting controller.

Out of those, the identification uses already aren't affected by the
constraint as they're now all either direct membership test against
the hierarchy or implicit controllers which aren't subject to the
constraint.  That leaves pids and cpuset.  We can exempt them from the
constraint but I'm not quite sure what that buys us given that neither
is affected by requiring explicit leaf nodes.  It'd just make the
rules more complicated without actual benefits.

That said, we can exempt those two.  I don't see much point in it but
we can definitely discuss the pros and cons, and it's likely that it's
not gonna make much difference wichever way we choose.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

end of thread, other threads:[~2017-06-03 10:33 UTC | newest]

Thread overview: 69+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-15 13:33 [RFC PATCH v2 00/17] cgroup: Major changes to cgroup v2 core Waiman Long
2017-05-15 13:34 ` [RFC PATCH v2 01/17] cgroup: reorganize cgroup.procs / task write path Waiman Long
2017-05-15 13:34 ` [RFC PATCH v2 02/17] cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS Waiman Long
2017-05-15 13:34 ` [RFC PATCH v2 03/17] cgroup: introduce cgroup->proc_cgrp and threaded css_set handling Waiman Long
2017-05-15 13:34 ` [RFC PATCH v2 04/17] cgroup: implement CSS_TASK_ITER_THREADED Waiman Long
2017-05-15 13:34 ` [RFC PATCH v2 05/17] cgroup: implement cgroup v2 thread support Waiman Long
2017-05-15 13:34 ` [RFC PATCH v2 06/17] cgroup: Fix reference counting bug in cgroup_procs_write() Waiman Long
2017-05-17 19:20   ` Tejun Heo
2017-05-15 13:34 ` [RFC PATCH v2 07/17] cgroup: Prevent kill_css() from being called more than once Waiman Long
2017-05-17 19:23   ` Tejun Heo
2017-05-17 20:24     ` Waiman Long
2017-05-17 21:34       ` Tejun Heo
2017-05-15 13:34 ` [RFC PATCH v2 08/17] cgroup: Move debug cgroup to its own file Waiman Long
2017-05-17 21:36   ` Tejun Heo
2017-05-18 15:29     ` Waiman Long
2017-05-18 15:52     ` Waiman Long
2017-05-19 19:21       ` Tejun Heo
2017-05-19 19:33         ` Waiman Long
2017-05-19 20:28           ` Tejun Heo
2017-05-15 13:34 ` [RFC PATCH v2 09/17] cgroup: Keep accurate count of tasks in each css_set Waiman Long
2017-05-17 21:40   ` Tejun Heo
2017-05-18 15:56     ` Waiman Long
2017-05-15 13:34 ` [RFC PATCH v2 10/17] cgroup: Make debug cgroup support v2 and thread mode Waiman Long
2017-05-17 21:43   ` Tejun Heo
2017-05-18 15:58     ` Waiman Long
2017-05-15 13:34 ` [RFC PATCH v2 11/17] cgroup: Implement new thread mode semantics Waiman Long
2017-05-17 21:47   ` Tejun Heo
2017-05-18 17:21     ` Waiman Long
2017-05-19 20:26   ` Tejun Heo
2017-05-19 20:58     ` Tejun Heo
2017-05-22 17:13     ` Waiman Long
2017-05-22 17:32       ` Waiman Long
2017-05-24 20:36       ` Tejun Heo
2017-05-24 21:17         ` Waiman Long
2017-05-24 21:27           ` Tejun Heo
2017-06-01 14:50             ` Tejun Heo
2017-06-01 15:10               ` Peter Zijlstra
2017-06-01 15:35                 ` Tejun Heo
2017-06-01 18:44                 ` Waiman Long
2017-06-01 18:47                   ` Tejun Heo
2017-06-01 19:27                     ` Waiman Long
2017-06-01 20:38                       ` Tejun Heo
2017-06-01 20:48                         ` Waiman Long
2017-06-01 20:52                           ` Tejun Heo
2017-06-01 21:12                             ` Waiman Long
2017-06-01 21:18                               ` Tejun Heo
2017-06-02 20:36                                 ` Waiman Long
2017-06-03 10:33                                   ` Tejun Heo
2017-06-01 19:55                   ` Waiman Long
2017-06-01 20:15                 ` Waiman Long
2017-06-01 18:41               ` Waiman Long
2017-05-15 13:34 ` [RFC PATCH v2 12/17] cgroup: Remove cgroup v2 no internal process constraint Waiman Long
2017-05-19 20:38   ` Tejun Heo
2017-05-20  2:10     ` Mike Galbraith
2017-05-24 17:01       ` Tejun Heo
2017-05-22 16:56     ` Waiman Long
2017-05-24 17:05       ` Tejun Heo
2017-05-24 18:19         ` Waiman Long
2017-05-15 13:34 ` [RFC PATCH v2 13/17] cgroup: Allow fine-grained controllers control in cgroup v2 Waiman Long
2017-05-19 20:55   ` Tejun Heo
2017-05-19 21:20     ` Waiman Long
2017-05-24 17:31       ` Tejun Heo
2017-05-24 17:49         ` Waiman Long
2017-05-24 17:56           ` Tejun Heo
2017-05-24 18:17             ` Waiman Long
2017-05-15 13:34 ` [RFC PATCH v2 14/17] cgroup: Enable printing of v2 controllers' cgroup hierarchy Waiman Long
2017-05-15 13:34 ` [RFC PATCH v2 15/17] sched: Misc preps for cgroup unified hierarchy interface Waiman Long
2017-05-15 13:34 ` [RFC PATCH v2 16/17] sched: Implement interface for cgroup unified hierarchy Waiman Long
2017-05-15 13:34 ` [RFC PATCH v2 17/17] sched: Make cpu/cpuacct threaded controllers Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).