linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
@ 2017-02-02 20:06 Tejun Heo
  2017-02-02 20:06 ` [PATCH 1/5] cgroup: reorganize cgroup.procs / task write path Tejun Heo
                   ` (7 more replies)
  0 siblings, 8 replies; 32+ messages in thread
From: Tejun Heo @ 2017-02-02 20:06 UTC (permalink / raw)
  To: lizefan, hannes, peterz, mingo, pjt, luto, efault
  Cc: cgroups, linux-kernel, kernel-team, lvenanci

Hello,

This patchset implements cgroup v2 thread mode.  It is largely based
on the discussions that we had at the plumbers last year.  Here's the
rough outline.

* Thread mode is explicitly enabled on a cgroup by writing "enable"
  into "cgroup.threads" file.  The cgroup shouldn't have any child
  cgroups or enabled controllers.

* Once enabled, arbitrary sub-hierarchy can be created and threads can
  be put anywhere in the subtree by writing TIDs into "cgroup.threads"
  file.  Process granularity and no-internal-process constraint don't
  apply in a threaded subtree.

* To be used in a threaded subtree, controllers should explicitly
  declare thread mode support and should be able to handle internal
  competition in some way.

* The root of a threaded subtree serves as the resource domain for the
  whole subtree.  This is where all the controllers are guaranteed to
  have a common ground and resource consumptions in the threaded
  subtree which aren't tied to a specific thread are charged.
  Non-threaded controllers never see beyond thread root and can assume
  that all controllers will follow the same rules upto that point.

This allows threaded controllers to implement thread granular resource
control without getting in the way of system level resource
partitioning.

This patchset contains the following five patches.  For more details
on the interface and behavior, please refer to the last patch.

 0001-cgroup-reorganize-cgroup.procs-task-write-path.patch
 0002-cgroup-add-flags-to-css_task_iter_start-and-implemen.patch
 0003-cgroup-introduce-cgroup-proc_cgrp-and-threaded-css_s.patch
 0004-cgroup-implement-CSS_TASK_ITER_THREADED.patch
 0005-cgroup-implement-cgroup-v2-thread-support.patch

This patchset is on top of cgroup/for-4.11 63f1ca59453a ("Merge branch
'cgroup/for-4.11-rdmacg' into cgroup/for-4.11") and available in the
following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-cgroup2-threads

diffstat follows.  Thanks.

 Documentation/cgroup-v2.txt     |   75 ++++-
 include/linux/cgroup-defs.h     |   38 ++
 include/linux/cgroup.h          |   12 
 kernel/cgroup/cgroup-internal.h |    8 
 kernel/cgroup/cgroup-v1.c       |   64 +++-
 kernel/cgroup/cgroup.c          |  589 ++++++++++++++++++++++++++++++++--------
 kernel/cgroup/cpuset.c          |    6 
 kernel/cgroup/freezer.c         |    6 
 kernel/cgroup/pids.c            |    1 
 kernel/events/core.c            |    1 
 mm/memcontrol.c                 |    2 
 net/core/netclassid_cgroup.c    |    2 
 12 files changed, 671 insertions(+), 133 deletions(-)

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 1/5] cgroup: reorganize cgroup.procs / task write path
  2017-02-02 20:06 [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode Tejun Heo
@ 2017-02-02 20:06 ` Tejun Heo
  2017-02-02 20:06 ` [PATCH 2/5] cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS Tejun Heo
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2017-02-02 20:06 UTC (permalink / raw)
  To: lizefan, hannes, peterz, mingo, pjt, luto, efault
  Cc: cgroups, linux-kernel, kernel-team, lvenanci, Tejun Heo

Currently, writes "cgroup.procs" and "cgroup.tasks" files are all
handled by __cgroup_procs_write() on both v1 and v2.  This patch
reoragnizes the write path so that there are common helper functions
that different write paths use.

While this somewhat increases LOC, the different paths are no longer
intertwined and each path has more flexibility to implement different
behaviors which will be necessary for the planned v2 thread support.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/cgroup/cgroup-internal.h |   8 +-
 kernel/cgroup/cgroup-v1.c       |  58 ++++++++++++--
 kernel/cgroup/cgroup.c          | 163 +++++++++++++++++++++-------------------
 3 files changed, 142 insertions(+), 87 deletions(-)

diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index 9203bfb..6ef662a 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -179,10 +179,10 @@ int cgroup_migrate(struct task_struct *leader, bool threadgroup,
 
 int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader,
 		       bool threadgroup);
-ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf,
-			     size_t nbytes, loff_t off, bool threadgroup);
-ssize_t cgroup_procs_write(struct kernfs_open_file *of, char *buf, size_t nbytes,
-			   loff_t off);
+struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup)
+	__acquires(&cgroup_threadgroup_rwsem);
+void cgroup_procs_write_finish(void)
+	__releases(&cgroup_threadgroup_rwsem);
 
 void cgroup_lock_and_drain_offline(struct cgroup *cgrp);
 
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index fc34bcf..69146b3 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -511,10 +511,58 @@ static int cgroup_pidlist_show(struct seq_file *s, void *v)
 	return 0;
 }
 
-static ssize_t cgroup_tasks_write(struct kernfs_open_file *of,
-				  char *buf, size_t nbytes, loff_t off)
+static ssize_t __cgroup1_procs_write(struct kernfs_open_file *of,
+				     char *buf, size_t nbytes, loff_t off,
+				     bool threadgroup)
 {
-	return __cgroup_procs_write(of, buf, nbytes, off, false);
+	struct cgroup *cgrp;
+	struct task_struct *task;
+	const struct cred *cred, *tcred;
+	ssize_t ret;
+
+	cgrp = cgroup_kn_lock_live(of->kn, false);
+	if (!cgrp)
+		return -ENODEV;
+
+	task = cgroup_procs_write_start(buf, threadgroup);
+	ret = PTR_ERR_OR_ZERO(task);
+	if (ret)
+		goto out_unlock;
+
+	/*
+	 * Even if we're attaching all tasks in the thread group, we only
+	 * need to check permissions on one of them.
+	 */
+	cred = current_cred();
+	tcred = get_task_cred(task);
+	if (!uid_eq(cred->euid, GLOBAL_ROOT_UID) &&
+	    !uid_eq(cred->euid, tcred->uid) &&
+	    !uid_eq(cred->euid, tcred->suid))
+		ret = -EACCES;
+	put_cred(tcred);
+	if (ret)
+		goto out_finish;
+
+	ret = cgroup_attach_task(cgrp, task, threadgroup);
+
+out_finish:
+	cgroup_procs_write_finish();
+out_unlock:
+	cgroup_kn_unlock(of->kn);
+
+	return ret ?: nbytes;
+}
+
+static ssize_t cgroup1_procs_write(struct kernfs_open_file *of,
+				   char *buf, size_t nbytes, loff_t off)
+{
+	return __cgroup1_procs_write(of, buf, nbytes, off, true);
+}
+
+static ssize_t cgroup1_tasks_write(struct kernfs_open_file *of,
+				   char *buf, size_t nbytes, loff_t off)
+{
+	return __cgroup1_procs_write(of, buf, nbytes, off, false);
 }
 
 static ssize_t cgroup_release_agent_write(struct kernfs_open_file *of,
@@ -593,7 +641,7 @@ struct cftype cgroup1_base_files[] = {
 		.seq_stop = cgroup_pidlist_stop,
 		.seq_show = cgroup_pidlist_show,
 		.private = CGROUP_FILE_PROCS,
-		.write = cgroup_procs_write,
+		.write = cgroup1_procs_write,
 	},
 	{
 		.name = "cgroup.clone_children",
@@ -612,7 +660,7 @@ struct cftype cgroup1_base_files[] = {
 		.seq_stop = cgroup_pidlist_stop,
 		.seq_show = cgroup_pidlist_show,
 		.private = CGROUP_FILE_TASKS,
-		.write = cgroup_tasks_write,
+		.write = cgroup1_tasks_write,
 	},
 	{
 		.name = "notify_on_release",
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index fe374f8..2c37b436 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1913,6 +1913,23 @@ int task_cgroup_path(struct task_struct *task, char *buf, size_t buflen)
 }
 EXPORT_SYMBOL_GPL(task_cgroup_path);
 
+static struct cgroup *cgroup_migrate_common_ancestor(struct task_struct *task,
+						     struct cgroup *dst_cgrp)
+{
+	struct cgroup *cgrp;
+
+	lockdep_assert_held(&cgroup_mutex);
+
+	spin_lock_irq(&css_set_lock);
+	cgrp = task_cgroup_from_root(task, &cgrp_dfl_root);
+	spin_unlock_irq(&css_set_lock);
+
+	while (!cgroup_is_descendant(dst_cgrp, cgrp))
+		cgrp = cgroup_parent(cgrp);
+
+	return cgrp;
+}
+
 /**
  * cgroup_migrate_add_task - add a migration target task to a migration context
  * @task: target task
@@ -2345,76 +2362,23 @@ int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader,
 	return ret;
 }
 
-static int cgroup_procs_write_permission(struct task_struct *task,
-					 struct cgroup *dst_cgrp,
-					 struct kernfs_open_file *of)
-{
-	int ret = 0;
-
-	if (cgroup_on_dfl(dst_cgrp)) {
-		struct super_block *sb = of->file->f_path.dentry->d_sb;
-		struct cgroup *cgrp;
-		struct inode *inode;
-
-		spin_lock_irq(&css_set_lock);
-		cgrp = task_cgroup_from_root(task, &cgrp_dfl_root);
-		spin_unlock_irq(&css_set_lock);
-
-		while (!cgroup_is_descendant(dst_cgrp, cgrp))
-			cgrp = cgroup_parent(cgrp);
-
-		ret = -ENOMEM;
-		inode = kernfs_get_inode(sb, cgrp->procs_file.kn);
-		if (inode) {
-			ret = inode_permission(inode, MAY_WRITE);
-			iput(inode);
-		}
-	} else {
-		const struct cred *cred = current_cred();
-		const struct cred *tcred = get_task_cred(task);
-
-		/*
-		 * even if we're attaching all tasks in the thread group,
-		 * we only need to check permissions on one of them.
-		 */
-		if (!uid_eq(cred->euid, GLOBAL_ROOT_UID) &&
-		    !uid_eq(cred->euid, tcred->uid) &&
-		    !uid_eq(cred->euid, tcred->suid))
-			ret = -EACCES;
-		put_cred(tcred);
-	}
-
-	return ret;
-}
-
-/*
- * Find the task_struct of the task to attach by vpid and pass it along to the
- * function to attach either it or all tasks in its threadgroup. Will lock
- * cgroup_mutex and threadgroup.
- */
-ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf,
-			     size_t nbytes, loff_t off, bool threadgroup)
+struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup)
+	__acquires(&cgroup_threadgroup_rwsem)
 {
 	struct task_struct *tsk;
-	struct cgroup_subsys *ss;
-	struct cgroup *cgrp;
 	pid_t pid;
-	int ssid, ret;
 
 	if (kstrtoint(strstrip(buf), 0, &pid) || pid < 0)
-		return -EINVAL;
-
-	cgrp = cgroup_kn_lock_live(of->kn, false);
-	if (!cgrp)
-		return -ENODEV;
+		return ERR_PTR(-EINVAL);
 
 	percpu_down_write(&cgroup_threadgroup_rwsem);
+
 	rcu_read_lock();
 	if (pid) {
 		tsk = find_task_by_vpid(pid);
 		if (!tsk) {
-			ret = -ESRCH;
-			goto out_unlock_rcu;
+			tsk = ERR_PTR(-ESRCH);
+			goto out_unlock_threadgroup;
 		}
 	} else {
 		tsk = current;
@@ -2429,35 +2393,30 @@ ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf,
 	 * with no rt_runtime allocated.  Just say no.
 	 */
 	if (tsk == kthreadd_task || (tsk->flags & PF_NO_SETAFFINITY)) {
-		ret = -EINVAL;
-		goto out_unlock_rcu;
+		tsk = ERR_PTR(-EINVAL);
+		goto out_unlock_threadgroup;
 	}
 
 	get_task_struct(tsk);
-	rcu_read_unlock();
-
-	ret = cgroup_procs_write_permission(tsk, cgrp, of);
-	if (!ret)
-		ret = cgroup_attach_task(cgrp, tsk, threadgroup);
-
-	put_task_struct(tsk);
-	goto out_unlock_threadgroup;
+	goto out_unlock_rcu;
 
+out_unlock_threadgroup:
+	percpu_up_write(&cgroup_threadgroup_rwsem);
 out_unlock_rcu:
 	rcu_read_unlock();
-out_unlock_threadgroup:
+	return tsk;
+}
+
+void cgroup_procs_write_finish(void)
+	__releases(&cgroup_threadgroup_rwsem)
+{
+	struct cgroup_subsys *ss;
+	int ssid;
+
 	percpu_up_write(&cgroup_threadgroup_rwsem);
 	for_each_subsys(ss, ssid)
 		if (ss->post_attach)
 			ss->post_attach();
-	cgroup_kn_unlock(of->kn);
-	return ret ?: nbytes;
-}
-
-ssize_t cgroup_procs_write(struct kernfs_open_file *of, char *buf, size_t nbytes,
-			   loff_t off)
-{
-	return __cgroup_procs_write(of, buf, nbytes, off, true);
 }
 
 static void cgroup_print_ss_mask(struct seq_file *seq, u16 ss_mask)
@@ -3781,6 +3740,54 @@ static int cgroup_procs_show(struct seq_file *s, void *v)
 	return 0;
 }
 
+static int cgroup_procs_write_permission(struct cgroup *cgrp,
+					 struct super_block *sb)
+{
+	struct inode *inode;
+	int ret;
+
+	inode = kernfs_get_inode(sb, cgrp->procs_file.kn);
+	if (!inode)
+		return -ENOMEM;
+
+	ret = inode_permission(inode, MAY_WRITE);
+	iput(inode);
+	return ret;
+}
+
+static ssize_t cgroup_procs_write(struct kernfs_open_file *of,
+				  char *buf, size_t nbytes, loff_t off)
+{
+	struct cgroup *cgrp, *common_ancestor;
+	struct task_struct *task;
+	ssize_t ret;
+
+	cgrp = cgroup_kn_lock_live(of->kn, false);
+	if (!cgrp)
+		return -ENODEV;
+
+	task = cgroup_procs_write_start(buf, true);
+	ret = PTR_ERR_OR_ZERO(task);
+	if (ret)
+		goto out_unlock;
+
+	common_ancestor = cgroup_migrate_common_ancestor(task, cgrp);
+
+	ret = cgroup_procs_write_permission(common_ancestor,
+					    of->file->f_path.dentry->d_sb);
+	if (ret)
+		goto out_finish;
+
+	ret = cgroup_attach_task(cgrp, task, true);
+
+out_finish:
+	cgroup_procs_write_finish();
+out_unlock:
+	cgroup_kn_unlock(of->kn);
+
+	return ret ?: nbytes;
+}
+
 /* cgroup core interface files for the default hierarchy */
 static struct cftype cgroup_base_files[] = {
 	{
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 2/5] cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS
  2017-02-02 20:06 [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode Tejun Heo
  2017-02-02 20:06 ` [PATCH 1/5] cgroup: reorganize cgroup.procs / task write path Tejun Heo
@ 2017-02-02 20:06 ` Tejun Heo
  2017-02-02 20:06 ` [PATCH 3/5] cgroup: introduce cgroup->proc_cgrp and threaded css_set handling Tejun Heo
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2017-02-02 20:06 UTC (permalink / raw)
  To: lizefan, hannes, peterz, mingo, pjt, luto, efault
  Cc: cgroups, linux-kernel, kernel-team, lvenanci, Tejun Heo

css_task_iter currently always walks all tasks.  With the scheduled
cgroup v2 thread support, the iterator would need to handle multiple
types of iteration.  As a preparation, add @flags to
css_task_iter_start() and implement CSS_TASK_ITER_PROCS.  If the flag
is not specified, it walks all tasks as before.  When asserted, the
iterator only walks the group leaders.

For now, the only user of the flag is cgroup v2 "cgroup.procs" file
which no longer needs to skip non-leader tasks in cgroup_procs_next().
Note that cgroup v1 "cgroup.procs" can't use the group leader walk as
v1 "cgroup.procs" doesn't mean "list all thread group leaders in the
cgroup" but "list all thread group id's with any threads in the
cgroup".

While at it, update cgroup_procs_show() to use task_pid_vnr() instead
of task_tgid_vnr().  As the iteration guarantees that the function
only sees group leaders, this doesn't change the output and will allow
sharing the function for thread iteration.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/cgroup.h       |  6 +++++-
 kernel/cgroup/cgroup-v1.c    |  6 +++---
 kernel/cgroup/cgroup.c       | 24 ++++++++++++++----------
 kernel/cgroup/cpuset.c       |  6 +++---
 kernel/cgroup/freezer.c      |  6 +++---
 mm/memcontrol.c              |  2 +-
 net/core/netclassid_cgroup.c |  2 +-
 7 files changed, 30 insertions(+), 22 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index f6b43fb..20f6057 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -36,9 +36,13 @@
 #define CGROUP_WEIGHT_DFL		100
 #define CGROUP_WEIGHT_MAX		10000
 
+/* walk only threadgroup leaders */
+#define CSS_TASK_ITER_PROCS		(1U << 0)
+
 /* a css_task_iter should be treated as an opaque object */
 struct css_task_iter {
 	struct cgroup_subsys		*ss;
+	unsigned int			flags;
 
 	struct list_head		*cset_pos;
 	struct list_head		*cset_head;
@@ -129,7 +133,7 @@ struct task_struct *cgroup_taskset_first(struct cgroup_taskset *tset,
 struct task_struct *cgroup_taskset_next(struct cgroup_taskset *tset,
 					struct cgroup_subsys_state **dst_cssp);
 
-void css_task_iter_start(struct cgroup_subsys_state *css,
+void css_task_iter_start(struct cgroup_subsys_state *css, unsigned int flags,
 			 struct css_task_iter *it);
 struct task_struct *css_task_iter_next(struct css_task_iter *it);
 void css_task_iter_end(struct css_task_iter *it);
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index 69146b3..de4939d 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -118,7 +118,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
 	 * ->can_attach() fails.
 	 */
 	do {
-		css_task_iter_start(&from->self, &it);
+		css_task_iter_start(&from->self, 0, &it);
 		task = css_task_iter_next(&it);
 		if (task)
 			get_task_struct(task);
@@ -374,7 +374,7 @@ static int pidlist_array_load(struct cgroup *cgrp, enum cgroup_filetype type,
 	if (!array)
 		return -ENOMEM;
 	/* now, populate the array */
-	css_task_iter_start(&cgrp->self, &it);
+	css_task_iter_start(&cgrp->self, 0, &it);
 	while ((tsk = css_task_iter_next(&it))) {
 		if (unlikely(n == length))
 			break;
@@ -750,7 +750,7 @@ int cgroupstats_build(struct cgroupstats *stats, struct dentry *dentry)
 	}
 	rcu_read_unlock();
 
-	css_task_iter_start(&cgrp->self, &it);
+	css_task_iter_start(&cgrp->self, 0, &it);
 	while ((tsk = css_task_iter_next(&it))) {
 		switch (tsk->state) {
 		case TASK_RUNNING:
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 2c37b436..a9df46c 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3588,6 +3588,7 @@ static void css_task_iter_advance(struct css_task_iter *it)
 	lockdep_assert_held(&css_set_lock);
 	WARN_ON_ONCE(!l);
 
+repeat:
 	/*
 	 * Advance iterator to find next entry.  cset->tasks is consumed
 	 * first and then ->mg_tasks.  After ->mg_tasks, we move onto the
@@ -3602,11 +3603,18 @@ static void css_task_iter_advance(struct css_task_iter *it)
 		css_task_iter_advance_css_set(it);
 	else
 		it->task_pos = l;
+
+	/* if PROCS, skip over tasks which aren't group leaders */
+	if ((it->flags & CSS_TASK_ITER_PROCS) && it->task_pos &&
+	    !thread_group_leader(list_entry(it->task_pos, struct task_struct,
+					    cg_list)))
+		goto repeat;
 }
 
 /**
  * css_task_iter_start - initiate task iteration
  * @css: the css to walk tasks of
+ * @flags: CSS_TASK_ITER_* flags
  * @it: the task iterator to use
  *
  * Initiate iteration through the tasks of @css.  The caller can call
@@ -3614,7 +3622,7 @@ static void css_task_iter_advance(struct css_task_iter *it)
  * returns NULL.  On completion of iteration, css_task_iter_end() must be
  * called.
  */
-void css_task_iter_start(struct cgroup_subsys_state *css,
+void css_task_iter_start(struct cgroup_subsys_state *css, unsigned int flags,
 			 struct css_task_iter *it)
 {
 	/* no one should try to iterate before mounting cgroups */
@@ -3625,6 +3633,7 @@ void css_task_iter_start(struct cgroup_subsys_state *css,
 	spin_lock_irq(&css_set_lock);
 
 	it->ss = css->ss;
+	it->flags = flags;
 
 	if (it->ss)
 		it->cset_pos = &css->cgroup->e_csets[css->ss->id];
@@ -3698,13 +3707,8 @@ static void *cgroup_procs_next(struct seq_file *s, void *v, loff_t *pos)
 {
 	struct kernfs_open_file *of = s->private;
 	struct css_task_iter *it = of->priv;
-	struct task_struct *task;
-
-	do {
-		task = css_task_iter_next(it);
-	} while (task && !thread_group_leader(task));
 
-	return task;
+	return css_task_iter_next(it);
 }
 
 static void *cgroup_procs_start(struct seq_file *s, loff_t *pos)
@@ -3725,10 +3729,10 @@ static void *cgroup_procs_start(struct seq_file *s, loff_t *pos)
 		if (!it)
 			return ERR_PTR(-ENOMEM);
 		of->priv = it;
-		css_task_iter_start(&cgrp->self, it);
+		css_task_iter_start(&cgrp->self, CSS_TASK_ITER_PROCS, it);
 	} else if (!(*pos)++) {
 		css_task_iter_end(it);
-		css_task_iter_start(&cgrp->self, it);
+		css_task_iter_start(&cgrp->self, CSS_TASK_ITER_PROCS, it);
 	}
 
 	return cgroup_procs_next(s, NULL, NULL);
@@ -3736,7 +3740,7 @@ static void *cgroup_procs_start(struct seq_file *s, loff_t *pos)
 
 static int cgroup_procs_show(struct seq_file *s, void *v)
 {
-	seq_printf(s, "%d\n", task_tgid_vnr(v));
+	seq_printf(s, "%d\n", task_pid_vnr(v));
 	return 0;
 }
 
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index b308888..35e7d43 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -859,7 +859,7 @@ static void update_tasks_cpumask(struct cpuset *cs)
 	struct css_task_iter it;
 	struct task_struct *task;
 
-	css_task_iter_start(&cs->css, &it);
+	css_task_iter_start(&cs->css, 0, &it);
 	while ((task = css_task_iter_next(&it)))
 		set_cpus_allowed_ptr(task, cs->effective_cpus);
 	css_task_iter_end(&it);
@@ -1104,7 +1104,7 @@ static void update_tasks_nodemask(struct cpuset *cs)
 	 * It's ok if we rebind the same mm twice; mpol_rebind_mm()
 	 * is idempotent.  Also migrate pages in each mm to new nodes.
 	 */
-	css_task_iter_start(&cs->css, &it);
+	css_task_iter_start(&cs->css, 0, &it);
 	while ((task = css_task_iter_next(&it))) {
 		struct mm_struct *mm;
 		bool migrate;
@@ -1297,7 +1297,7 @@ static void update_tasks_flags(struct cpuset *cs)
 	struct css_task_iter it;
 	struct task_struct *task;
 
-	css_task_iter_start(&cs->css, &it);
+	css_task_iter_start(&cs->css, 0, &it);
 	while ((task = css_task_iter_next(&it)))
 		cpuset_update_task_spread_flag(cs, task);
 	css_task_iter_end(&it);
diff --git a/kernel/cgroup/freezer.c b/kernel/cgroup/freezer.c
index 1b72d56..0823679 100644
--- a/kernel/cgroup/freezer.c
+++ b/kernel/cgroup/freezer.c
@@ -268,7 +268,7 @@ static void update_if_frozen(struct cgroup_subsys_state *css)
 	rcu_read_unlock();
 
 	/* are all tasks frozen? */
-	css_task_iter_start(css, &it);
+	css_task_iter_start(css, 0, &it);
 
 	while ((task = css_task_iter_next(&it))) {
 		if (freezing(task)) {
@@ -320,7 +320,7 @@ static void freeze_cgroup(struct freezer *freezer)
 	struct css_task_iter it;
 	struct task_struct *task;
 
-	css_task_iter_start(&freezer->css, &it);
+	css_task_iter_start(&freezer->css, 0, &it);
 	while ((task = css_task_iter_next(&it)))
 		freeze_task(task);
 	css_task_iter_end(&it);
@@ -331,7 +331,7 @@ static void unfreeze_cgroup(struct freezer *freezer)
 	struct css_task_iter it;
 	struct task_struct *task;
 
-	css_task_iter_start(&freezer->css, &it);
+	css_task_iter_start(&freezer->css, 0, &it);
 	while ((task = css_task_iter_next(&it)))
 		__thaw_task(task);
 	css_task_iter_end(&it);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4048897..9250128 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -945,7 +945,7 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 		struct css_task_iter it;
 		struct task_struct *task;
 
-		css_task_iter_start(&iter->css, &it);
+		css_task_iter_start(&iter->css, 0, &it);
 		while (!ret && (task = css_task_iter_next(&it)))
 			ret = fn(task, arg);
 		css_task_iter_end(&it);
diff --git a/net/core/netclassid_cgroup.c b/net/core/netclassid_cgroup.c
index 11fce17..e24ea8c 100644
--- a/net/core/netclassid_cgroup.c
+++ b/net/core/netclassid_cgroup.c
@@ -74,7 +74,7 @@ static void update_classid(struct cgroup_subsys_state *css, void *v)
 	struct css_task_iter it;
 	struct task_struct *p;
 
-	css_task_iter_start(css, &it);
+	css_task_iter_start(css, 0, &it);
 	while ((p = css_task_iter_next(&it))) {
 		task_lock(p);
 		iterate_fd(p->files, 0, update_classid_sock, v);
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 3/5] cgroup: introduce cgroup->proc_cgrp and threaded css_set handling
  2017-02-02 20:06 [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode Tejun Heo
  2017-02-02 20:06 ` [PATCH 1/5] cgroup: reorganize cgroup.procs / task write path Tejun Heo
  2017-02-02 20:06 ` [PATCH 2/5] cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS Tejun Heo
@ 2017-02-02 20:06 ` Tejun Heo
  2017-02-02 20:06 ` [PATCH 4/5] cgroup: implement CSS_TASK_ITER_THREADED Tejun Heo
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2017-02-02 20:06 UTC (permalink / raw)
  To: lizefan, hannes, peterz, mingo, pjt, luto, efault
  Cc: cgroups, linux-kernel, kernel-team, lvenanci, Tejun Heo

cgroup v2 is in the process of growing thread granularity support.
Once thread mode is enabled, the root cgroup of the subtree serves as
the proc_cgrp to which the processes of the subtree conceptually
belong and domain-level resource consumptions not tied to any specific
task are charged.  In the subtree, threads won't be subject to process
granularity or no-internal-task constraint and can be distributed
arbitrarily across the subtree.

This patch introduces cgroup->proc_cgrp along with threaded css_set
handling.

* cgroup->proc_cgrp is NULL if !threaded.  If threaded, points to the
  proc_cgrp (root of the threaded subtree).

* css_set->proc_cset points to self if !threaded.  If threaded, points
  to the css_set which belongs to the cgrp->proc_cgrp.  The proc_cgrp
  serves as the resource domain and needs the matching csses readily
  available.  The proc_cset holds those csses and makes them easily
  accessible.

* All threaded csets are linked on their proc_csets to enable
  iteration of all threaded tasks.

This patch adds the above but doesn't actually use them yet.  The
following patches will build on top.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/cgroup-defs.h | 22 ++++++++++++
 kernel/cgroup/cgroup.c      | 87 +++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 103 insertions(+), 6 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 3c02404..22e894c 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -158,6 +158,15 @@ struct css_set {
 	/* reference count */
 	atomic_t refcount;
 
+	/*
+	 * If not threaded, the following points to self.  If threaded, to
+	 * a cset which belongs to the top cgroup of the threaded subtree.
+	 * The proc_cset provides access to the process cgroup and its
+	 * csses to which domain level resource consumptions should be
+	 * charged.
+	 */
+	struct css_set __rcu *proc_cset;
+
 	/* the default cgroup associated with this css_set */
 	struct cgroup *dfl_cgrp;
 
@@ -183,6 +192,10 @@ struct css_set {
 	 */
 	struct list_head e_cset_node[CGROUP_SUBSYS_COUNT];
 
+	/* all csets whose ->proc_cset points to this cset */
+	struct list_head threaded_csets;
+	struct list_head threaded_csets_node;
+
 	/*
 	 * List running through all cgroup groups in the same hash
 	 * slot. Protected by css_set_lock
@@ -289,6 +302,15 @@ struct cgroup {
 	struct list_head e_csets[CGROUP_SUBSYS_COUNT];
 
 	/*
+	 * If !threaded, NULL.  If threaded, it points to the top cgroup of
+	 * the threaded subtree, on which it points to self.  Threaded
+	 * subtree is exempt from process granularity and no-internal-task
+	 * constraint.  Domain level resource consumptions which aren't
+	 * tied to a specific task should be charged to the proc_cgrp.
+	 */
+	struct cgroup *proc_cgrp;
+
+	/*
 	 * list of pidlists, up to two for each namespace (one for procs, one
 	 * for tasks); created on demand.
 	 */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index a9df46c..6c5658a 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -554,9 +554,11 @@ EXPORT_SYMBOL_GPL(of_css);
  */
 struct css_set init_css_set = {
 	.refcount		= ATOMIC_INIT(1),
+	.proc_cset		= RCU_INITIALIZER(&init_css_set),
 	.tasks			= LIST_HEAD_INIT(init_css_set.tasks),
 	.mg_tasks		= LIST_HEAD_INIT(init_css_set.mg_tasks),
 	.task_iters		= LIST_HEAD_INIT(init_css_set.task_iters),
+	.threaded_csets		= LIST_HEAD_INIT(init_css_set.threaded_csets),
 	.cgrp_links		= LIST_HEAD_INIT(init_css_set.cgrp_links),
 	.mg_preload_node	= LIST_HEAD_INIT(init_css_set.mg_preload_node),
 	.mg_node		= LIST_HEAD_INIT(init_css_set.mg_node),
@@ -575,6 +577,17 @@ static bool css_set_populated(struct css_set *cset)
 	return !list_empty(&cset->tasks) || !list_empty(&cset->mg_tasks);
 }
 
+static struct css_set *proc_css_set(struct css_set *cset)
+{
+	return rcu_dereference_protected(cset->proc_cset,
+					 lockdep_is_held(&css_set_lock));
+}
+
+static bool css_set_threaded(struct css_set *cset)
+{
+	return proc_css_set(cset) != cset;
+}
+
 /**
  * cgroup_update_populated - updated populated count of a cgroup
  * @cgrp: the target cgroup
@@ -726,6 +739,8 @@ void put_css_set_locked(struct css_set *cset)
 	if (!atomic_dec_and_test(&cset->refcount))
 		return;
 
+	WARN_ON_ONCE(!list_empty(&cset->threaded_csets));
+
 	/* This css_set is dead. unlink it and release cgroup and css refs */
 	for_each_subsys(ss, ssid) {
 		list_del(&cset->e_cset_node[ssid]);
@@ -742,6 +757,11 @@ void put_css_set_locked(struct css_set *cset)
 		kfree(link);
 	}
 
+	if (css_set_threaded(cset)) {
+		list_del(&cset->threaded_csets_node);
+		put_css_set_locked(proc_css_set(cset));
+	}
+
 	kfree_rcu(cset, rcu_head);
 }
 
@@ -751,6 +771,7 @@ void put_css_set_locked(struct css_set *cset)
  * @old_cset: existing css_set for a task
  * @new_cgrp: cgroup that's being entered by the task
  * @template: desired set of css pointers in css_set (pre-calculated)
+ * @for_pcset: the comparison is for a new proc_cset
  *
  * Returns true if "cset" matches "old_cset" except for the hierarchy
  * which "new_cgrp" belongs to, for which it should match "new_cgrp".
@@ -758,7 +779,8 @@ void put_css_set_locked(struct css_set *cset)
 static bool compare_css_sets(struct css_set *cset,
 			     struct css_set *old_cset,
 			     struct cgroup *new_cgrp,
-			     struct cgroup_subsys_state *template[])
+			     struct cgroup_subsys_state *template[],
+			     bool for_pcset)
 {
 	struct list_head *l1, *l2;
 
@@ -770,6 +792,32 @@ static bool compare_css_sets(struct css_set *cset,
 	if (memcmp(template, cset->subsys, sizeof(cset->subsys)))
 		return false;
 
+	if (for_pcset) {
+		/*
+		 * We're looking for the pcset of @old_cset.  As @old_cset
+		 * doesn't have its ->proc_cset pointer set yet (we're
+		 * trying to find out what to set it to), @old_cset itself
+		 * may seem like a match here.  Explicitly exlude identity
+		 * matching.
+		 */
+		if (css_set_threaded(cset) || cset == old_cset)
+			return false;
+	} else {
+		bool is_threaded;
+
+		/*
+		 * Otherwise, @cset's threaded state should match the
+		 * default cgroup's.
+		 */
+		if (cgroup_on_dfl(new_cgrp))
+			is_threaded = new_cgrp->proc_cgrp;
+		else
+			is_threaded = old_cset->dfl_cgrp->proc_cgrp;
+
+		if (is_threaded != css_set_threaded(cset))
+			return false;
+	}
+
 	/*
 	 * Compare cgroup pointers in order to distinguish between
 	 * different cgroups in hierarchies.  As different cgroups may
@@ -822,10 +870,12 @@ static bool compare_css_sets(struct css_set *cset,
  * @old_cset: the css_set that we're using before the cgroup transition
  * @cgrp: the cgroup that we're moving into
  * @template: out param for the new set of csses, should be clear on entry
+ * @for_pcset: looking for a new proc_cset
  */
 static struct css_set *find_existing_css_set(struct css_set *old_cset,
 					struct cgroup *cgrp,
-					struct cgroup_subsys_state *template[])
+					struct cgroup_subsys_state *template[],
+					bool for_pcset)
 {
 	struct cgroup_root *root = cgrp->root;
 	struct cgroup_subsys *ss;
@@ -856,7 +906,7 @@ static struct css_set *find_existing_css_set(struct css_set *old_cset,
 
 	key = css_set_hash(template);
 	hash_for_each_possible(css_set_table, cset, hlist, key) {
-		if (!compare_css_sets(cset, old_cset, cgrp, template))
+		if (!compare_css_sets(cset, old_cset, cgrp, template, for_pcset))
 			continue;
 
 		/* This css_set matches what we need */
@@ -938,12 +988,13 @@ static void link_css_set(struct list_head *tmp_links, struct css_set *cset,
  * find_css_set - return a new css_set with one cgroup updated
  * @old_cset: the baseline css_set
  * @cgrp: the cgroup to be updated
+ * @for_pcset: looking for a new proc_cset
  *
  * Return a new css_set that's equivalent to @old_cset, but with @cgrp
  * substituted into the appropriate hierarchy.
  */
 static struct css_set *find_css_set(struct css_set *old_cset,
-				    struct cgroup *cgrp)
+				    struct cgroup *cgrp, bool for_pcset)
 {
 	struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT] = { };
 	struct css_set *cset;
@@ -958,7 +1009,7 @@ static struct css_set *find_css_set(struct css_set *old_cset,
 	/* First see if we already have a cgroup group that matches
 	 * the desired set */
 	spin_lock_irq(&css_set_lock);
-	cset = find_existing_css_set(old_cset, cgrp, template);
+	cset = find_existing_css_set(old_cset, cgrp, template, for_pcset);
 	if (cset)
 		get_css_set(cset);
 	spin_unlock_irq(&css_set_lock);
@@ -977,9 +1028,11 @@ static struct css_set *find_css_set(struct css_set *old_cset,
 	}
 
 	atomic_set(&cset->refcount, 1);
+	RCU_INIT_POINTER(cset->proc_cset, cset);
 	INIT_LIST_HEAD(&cset->tasks);
 	INIT_LIST_HEAD(&cset->mg_tasks);
 	INIT_LIST_HEAD(&cset->task_iters);
+	INIT_LIST_HEAD(&cset->threaded_csets);
 	INIT_HLIST_NODE(&cset->hlist);
 	INIT_LIST_HEAD(&cset->cgrp_links);
 	INIT_LIST_HEAD(&cset->mg_preload_node);
@@ -1017,6 +1070,28 @@ static struct css_set *find_css_set(struct css_set *old_cset,
 
 	spin_unlock_irq(&css_set_lock);
 
+	/*
+	 * If @cset should be threaded, look up the matching proc_cset and
+	 * link them up.  We first fully initialize @cset then look for the
+	 * pcset.  It's simpler this way and safe as @cset is guaranteed to
+	 * stay empty until we return.
+	 */
+	if (!for_pcset && cset->dfl_cgrp->proc_cgrp) {
+		struct css_set *pcset;
+
+		pcset = find_css_set(cset, cset->dfl_cgrp->proc_cgrp, true);
+		if (!pcset) {
+			put_css_set(cset);
+			return NULL;
+		}
+
+		spin_lock_irq(&css_set_lock);
+		rcu_assign_pointer(cset->proc_cset, pcset);
+		list_add_tail(&cset->threaded_csets_node,
+			      &pcset->threaded_csets);
+		spin_unlock_irq(&css_set_lock);
+	}
+
 	return cset;
 }
 
@@ -2238,7 +2313,7 @@ int cgroup_migrate_prepare_dst(struct cgroup_mgctx *mgctx)
 		struct cgroup_subsys *ss;
 		int ssid;
 
-		dst_cset = find_css_set(src_cset, src_cset->mg_dst_cgrp);
+		dst_cset = find_css_set(src_cset, src_cset->mg_dst_cgrp, false);
 		if (!dst_cset)
 			goto err;
 
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 4/5] cgroup: implement CSS_TASK_ITER_THREADED
  2017-02-02 20:06 [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode Tejun Heo
                   ` (2 preceding siblings ...)
  2017-02-02 20:06 ` [PATCH 3/5] cgroup: introduce cgroup->proc_cgrp and threaded css_set handling Tejun Heo
@ 2017-02-02 20:06 ` Tejun Heo
  2017-02-02 20:06 ` [PATCH 5/5] cgroup: implement cgroup v2 thread support Tejun Heo
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2017-02-02 20:06 UTC (permalink / raw)
  To: lizefan, hannes, peterz, mingo, pjt, luto, efault
  Cc: cgroups, linux-kernel, kernel-team, lvenanci, Tejun Heo

cgroup v2 is in the process of growing thread granularity support.
Once thread mode is enabled, the root cgroup of the subtree serves as
the proc_cgrp to which the processes of the subtree conceptually
belong and domain-level resource consumptions not tied to any specific
task are charged.  In the subtree, threads won't be subject to process
granularity or no-internal-task constraint and can be distributed
arbitrarily across the subtree.

This patch implements a new task iterator flag CSS_TASK_ITER_THREADED,
which, when used on a proc_cgrp, makes the iteration include the tasks
on all the associated threaded css_sets.  "cgroup.procs" read path is
updated to use it so that reading the file on a proc_cgrp lists all
processes.  This will also be used by controller implementations which
need to walk processes or tasks at the resource domain level.

Task iteration is implemented nested in css_set iteration.  If
CSS_TASK_ITER_THREADED is specified, after walking tasks of each
!threaded css_set, all the associated threaded css_sets are visited
before moving onto the next !threaded css_set.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/cgroup.h |  6 ++++
 kernel/cgroup/cgroup.c | 81 +++++++++++++++++++++++++++++++++++++++++---------
 2 files changed, 73 insertions(+), 14 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 20f6057..cd0be87 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -38,6 +38,8 @@
 
 /* walk only threadgroup leaders */
 #define CSS_TASK_ITER_PROCS		(1U << 0)
+/* walk threaded css_sets as part of their proc_csets */
+#define CSS_TASK_ITER_THREADED		(1U << 1)
 
 /* a css_task_iter should be treated as an opaque object */
 struct css_task_iter {
@@ -47,11 +49,15 @@ struct css_task_iter {
 	struct list_head		*cset_pos;
 	struct list_head		*cset_head;
 
+	struct list_head		*tcset_pos;
+	struct list_head		*tcset_head;
+
 	struct list_head		*task_pos;
 	struct list_head		*tasks_head;
 	struct list_head		*mg_tasks_head;
 
 	struct css_set			*cur_cset;
+	struct css_set			*cur_pcset;
 	struct task_struct		*cur_task;
 	struct list_head		iters_node;	/* css_set->task_iters */
 };
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 6c5658a..e73b3c3 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3590,27 +3590,36 @@ bool css_has_online_children(struct cgroup_subsys_state *css)
 	return ret;
 }
 
-/**
- * css_task_iter_advance_css_set - advance a task itererator to the next css_set
- * @it: the iterator to advance
- *
- * Advance @it to the next css_set to walk.
- */
-static void css_task_iter_advance_css_set(struct css_task_iter *it)
+static struct css_set *css_task_iter_next_css_set(struct css_task_iter *it)
 {
-	struct list_head *l = it->cset_pos;
+	bool threaded = it->flags & CSS_TASK_ITER_THREADED;
+	struct list_head *l;
 	struct cgrp_cset_link *link;
 	struct css_set *cset;
 
 	lockdep_assert_held(&css_set_lock);
 
-	/* Advance to the next non-empty css_set */
+	/* find the next threaded cset */
+	if (it->tcset_pos) {
+		l = it->tcset_pos->next;
+
+		if (l != it->tcset_head) {
+			it->tcset_pos = l;
+			return container_of(l, struct css_set,
+					    threaded_csets_node);
+		}
+
+		it->tcset_pos = NULL;
+	}
+
+	/* find the next cset */
+	l = it->cset_pos;
+
 	do {
 		l = l->next;
 		if (l == it->cset_head) {
 			it->cset_pos = NULL;
-			it->task_pos = NULL;
-			return;
+			return NULL;
 		}
 
 		if (it->ss) {
@@ -3620,10 +3629,50 @@ static void css_task_iter_advance_css_set(struct css_task_iter *it)
 			link = list_entry(l, struct cgrp_cset_link, cset_link);
 			cset = link->cset;
 		}
-	} while (!css_set_populated(cset));
+
+		/*
+		 * For threaded iterations, threaded csets are walked
+		 * together with their proc_csets.  Skip here.
+		 */
+	} while (threaded && css_set_threaded(cset));
 
 	it->cset_pos = l;
 
+	/* initialize threaded cset walking */
+	if (threaded) {
+		if (it->cur_pcset)
+			put_css_set_locked(it->cur_pcset);
+		it->cur_pcset = cset;
+		get_css_set(cset);
+
+		it->tcset_head = &cset->threaded_csets;
+		it->tcset_pos = &cset->threaded_csets;
+	}
+
+	return cset;
+}
+
+/**
+ * css_task_iter_advance_css_set - advance a task itererator to the next css_set
+ * @it: the iterator to advance
+ *
+ * Advance @it to the next css_set to walk.
+ */
+static void css_task_iter_advance_css_set(struct css_task_iter *it)
+{
+	struct css_set *cset;
+
+	lockdep_assert_held(&css_set_lock);
+
+	/* Advance to the next non-empty css_set */
+	do {
+		cset = css_task_iter_next_css_set(it);
+		if (!cset) {
+			it->task_pos = NULL;
+			return;
+		}
+	} while (!css_set_populated(cset));
+
 	if (!list_empty(&cset->tasks))
 		it->task_pos = cset->tasks.next;
 	else
@@ -3766,6 +3815,9 @@ void css_task_iter_end(struct css_task_iter *it)
 		spin_unlock_irq(&css_set_lock);
 	}
 
+	if (it->cur_pcset)
+		put_css_set(it->cur_pcset);
+
 	if (it->cur_task)
 		put_task_struct(it->cur_task);
 }
@@ -3791,6 +3843,7 @@ static void *cgroup_procs_start(struct seq_file *s, loff_t *pos)
 	struct kernfs_open_file *of = s->private;
 	struct cgroup *cgrp = seq_css(s)->cgroup;
 	struct css_task_iter *it = of->priv;
+	unsigned iter_flags = CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED;
 
 	/*
 	 * When a seq_file is seeked, it's always traversed sequentially
@@ -3804,10 +3857,10 @@ static void *cgroup_procs_start(struct seq_file *s, loff_t *pos)
 		if (!it)
 			return ERR_PTR(-ENOMEM);
 		of->priv = it;
-		css_task_iter_start(&cgrp->self, CSS_TASK_ITER_PROCS, it);
+		css_task_iter_start(&cgrp->self, iter_flags, it);
 	} else if (!(*pos)++) {
 		css_task_iter_end(it);
-		css_task_iter_start(&cgrp->self, CSS_TASK_ITER_PROCS, it);
+		css_task_iter_start(&cgrp->self, iter_flags, it);
 	}
 
 	return cgroup_procs_next(s, NULL, NULL);
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 5/5] cgroup: implement cgroup v2 thread support
  2017-02-02 20:06 [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode Tejun Heo
                   ` (3 preceding siblings ...)
  2017-02-02 20:06 ` [PATCH 4/5] cgroup: implement CSS_TASK_ITER_THREADED Tejun Heo
@ 2017-02-02 20:06 ` Tejun Heo
  2017-02-02 21:32 ` [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode Andy Lutomirski
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2017-02-02 20:06 UTC (permalink / raw)
  To: lizefan, hannes, peterz, mingo, pjt, luto, efault
  Cc: cgroups, linux-kernel, kernel-team, lvenanci, Tejun Heo

This patch implements cgroup v2 thread support.  The goal of the
thread mode is supporting hierarchical accounting and control at
thread granularity while staying inside the resource domain model
which allows coordination across different resource controllers and
handling of anonymous resource consumptions.

Once thread mode is enabled on a cgroup, the threads of the processes
which are in its subtree can be placed inside the subtree without
being restricted by process granularity or no-internal-process
constraint.  Note that the threads aren't allowed to escape to a
different threaded subtree.  To be used inside a threaded subtree, a
controller should explicitly support threaded mode and be able to
handle internal competition in the way which is appropriate for the
resource.

The root of a threaded subtree, where thread mode is enabled in the
first place, is called the thread root and serves as the resource
domain for the whole subtree.  This is the last cgroup where
non-threaded controllers are operational and where all the
domain-level resource consumptions in the subtree are accounted.  This
allows threaded controllers to operate at thread granularity when
requested while staying inside the scope of system-level resource
distribution.

Internally, in a threaded subtree, each css_set has its ->proc_cset
pointing to a matching css_set which belongs to the thread root.  This
ensures that thread root level cgroup_subsys_state for all threaded
controllers are readily accessible for domain-level operations.

This patch enables threaded mode for the pids and perf_events
controllers.  Neither has to worry about domain-level resource
consumptions and it's enough to simply set the flag.

For more details on the interface and behavior of the thread mode,
please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added
by this patch.  Note that the documentation update is not complete as
the rest of the documentation needs to be updated accordingly.
Rolling those updates into this patch can be confusing so that will be
separate patches.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 Documentation/cgroup-v2.txt |  75 +++++++++++++-
 include/linux/cgroup-defs.h |  16 +++
 kernel/cgroup/cgroup.c      | 240 +++++++++++++++++++++++++++++++++++++++++++-
 kernel/cgroup/pids.c        |   1 +
 kernel/events/core.c        |   1 +
 5 files changed, 326 insertions(+), 7 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 3b8449f..6c255394 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -16,7 +16,9 @@ CONTENTS
   1-2. What is cgroup?
 2. Basic Operations
   2-1. Mounting
-  2-2. Organizing Processes
+  2-2. Organizing Processes and Threads
+    2-2-1. Processes
+    2-2-2. Threads
   2-3. [Un]populated Notification
   2-4. Controlling Controllers
     2-4-1. Enabling and Disabling
@@ -150,7 +152,9 @@ and experimenting easier, the kernel parameter cgroup_no_v1= allows
 disabling controllers in v1 and make them always available in v2.
 
 
-2-2. Organizing Processes
+2-2. Organizing Processes and Threads
+
+2-2-1. Processes
 
 Initially, only the root cgroup exists to which all processes belong.
 A child cgroup can be created by creating a sub-directory.
@@ -201,6 +205,73 @@ is removed subsequently, " (deleted)" is appended to the path.
   0::/test-cgroup/test-cgroup-nested (deleted)
 
 
+2-2-2. Threads
+
+cgroup v2 supports thread granularity for a subset of controllers to
+support use cases requiring hierarchical resource distribution across
+the threads of a group of processes.  By default, all threads of a
+process belong to the same cgroup, which also serves as the resource
+domain to host resource consumptions which are not specific to a
+process or thread.  The thread mode allows threads to be spread across
+a subtree while still maintaining the common resource domain for them.
+
+Enabling thread mode on a subtree makes it threaded.  The root of a
+threaded subtree is called thread root and serves as the resource
+domain for the entire subtree.  In a threaded subtree, threads of a
+process can be put in different cgroups and are not subject to the no
+internal process constraint - threaded controllers can be enabled on
+non-leaf cgroups whether they have threads in them or not.
+
+To enable the thread mode, the following conditions must be met.
+
+- The thread root doesn't have any child cgroups.
+
+- The thread root doesn't have any controllers enabled.
+
+Thread mode can be enabled by writing "enable" to "cgroup.threads"
+file.
+
+  # echo enable > cgroup.threads
+
+Inside a threaded subtree, "cgroup.threads" can be read and contains
+the list of the thread IDs of all threads in the cgroup.  Except that
+the operations are per-thread instead of per-process, "cgroup.threads"
+has the same format and behaves the same way as "cgroup.procs".
+
+The thread root serves as the resource domain for the whole subtree,
+and, while the threads can be scattered across the subtree, all the
+processes are considered to be in the thread root.  "cgroup.procs" in
+a thread root contains the PIDs of all processes in the subtree and is
+not readable in the subtree proper.  However, "cgroup.procs" can be
+written to from anywhere in the subtree to migrate all threads of the
+matching process to the cgroup.
+
+Only threaded controllers can be enabled in a threaded subtree.  When
+a threaded controller is enabled inside a threaded subtree, it only
+accounts for and controls resource consumptions associated with the
+threads in the cgroup and its descendants.  All consumptions which
+aren't tied to a specific thread belong to the thread root.
+
+Because a threaded subtree is exempt from no internal process
+constraint, a threaded controller must be able to handle competition
+between threads in a non-leaf cgroup and its child cgroups.  Each
+threaded controller defines how such competitions are handled.
+
+To disable the thread mode, the following conditions must be met.
+
+- The cgroup is a thread root.  Thread mode can't be disabled
+  partially in the subtree.
+
+- The thread root doesn't have any child cgroups.
+
+- The thread root doesn't have any controllers enabled.
+
+Thread mode can be disabled by writing "disable" to "cgroup.threads"
+file.
+
+  # echo disable > cgroup.threads
+
+
 2-3. [Un]populated Notification
 
 Each non-root cgroup has a "cgroup.events" file which contains
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 22e894c..c4fc469 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -226,6 +226,10 @@ struct css_set {
 	struct cgroup *mg_dst_cgrp;
 	struct css_set *mg_dst_cset;
 
+	/* used while updating ->proc_cset to enable/disable threaded mode */
+	struct list_head pcset_preload_node;
+	struct css_set *pcset_preload;
+
 	/* dead and being drained, ignore for migration */
 	bool dead;
 
@@ -497,6 +501,18 @@ struct cgroup_subsys {
 	bool implicit_on_dfl:1;
 
 	/*
+	 * If %true, the controller, supports threaded mode on the default
+	 * hierarchy.  In a threaded subtree, both process granularity and
+	 * no-internal-process constraint are ignored and a threaded
+	 * controllers should be able to handle that.
+	 *
+	 * Note that as an implicit controller is automatically enabled on
+	 * all cgroups on the default hierarchy, it should also be
+	 * threaded.  implicit && !threaded is not supported.
+	 */
+	bool threaded:1;
+
+	/*
 	 * If %false, this subsystem is properly hierarchical -
 	 * configuration, resource accounting and restriction on a parent
 	 * cgroup cover those of its children.  If %true, hierarchy support
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index e73b3c3..4e5b3b2 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -161,6 +161,9 @@ static u16 cgrp_dfl_inhibit_ss_mask;
 /* some controllers are implicitly enabled on the default hierarchy */
 static u16 cgrp_dfl_implicit_ss_mask;
 
+/* some controllers can be threaded on the default hierarchy */
+static u16 cgrp_dfl_threaded_ss_mask;
+
 /* The list of hierarchy roots */
 LIST_HEAD(cgroup_roots);
 static int cgroup_root_count;
@@ -2909,11 +2912,18 @@ static ssize_t cgroup_subtree_control_write(struct kernfs_open_file *of,
 		goto out_unlock;
 	}
 
+	/* can't enable !threaded controllers on a threaded cgroup */
+	if (cgrp->proc_cgrp && (enable & ~cgrp_dfl_threaded_ss_mask)) {
+		ret = -EBUSY;
+		goto out_unlock;
+	}
+
 	/*
-	 * Except for the root, subtree_control must be zero for a cgroup
-	 * with tasks so that child cgroups don't compete against tasks.
+	 * Except for root and threaded cgroups, subtree_control must be
+	 * zero for a cgroup with tasks so that child cgroups don't compete
+	 * against tasks.
 	 */
-	if (enable && cgroup_parent(cgrp)) {
+	if (enable && cgroup_parent(cgrp) && !cgrp->proc_cgrp) {
 		struct cgrp_cset_link *link;
 
 		/*
@@ -2954,6 +2964,124 @@ static ssize_t cgroup_subtree_control_write(struct kernfs_open_file *of,
 	return ret ?: nbytes;
 }
 
+static int cgroup_enable_threaded(struct cgroup *cgrp)
+{
+	LIST_HEAD(csets);
+	struct cgrp_cset_link *link;
+	struct css_set *cset, *cset_next;
+	int ret;
+
+	lockdep_assert_held(&cgroup_mutex);
+
+	/* noop if already threaded */
+	if (cgrp->proc_cgrp)
+		return 0;
+
+	/* allow only if there are neither children or enabled controllers */
+	if (css_has_online_children(&cgrp->self) || cgrp->subtree_control)
+		return -EBUSY;
+
+	/* find all csets which need ->proc_cset updated */
+	spin_lock_irq(&css_set_lock);
+	list_for_each_entry(link, &cgrp->cset_links, cset_link) {
+		cset = link->cset;
+		if (css_set_populated(cset)) {
+			WARN_ON_ONCE(css_set_threaded(cset));
+			WARN_ON_ONCE(cset->pcset_preload);
+
+			list_add_tail(&cset->pcset_preload_node, &csets);
+			get_css_set(cset);
+		}
+	}
+	spin_unlock_irq(&css_set_lock);
+
+	/* find the proc_csets to associate */
+	list_for_each_entry(cset, &csets, pcset_preload_node) {
+		struct css_set *pcset = find_css_set(cset, cgrp, true);
+
+		WARN_ON_ONCE(cset == pcset);
+		if (!pcset) {
+			ret = -ENOMEM;
+			goto err_put_csets;
+		}
+		cset->pcset_preload = pcset;
+	}
+
+	/* install ->proc_cset */
+	spin_lock_irq(&css_set_lock);
+	list_for_each_entry_safe(cset, cset_next, &csets, pcset_preload_node) {
+		rcu_assign_pointer(cset->proc_cset, cset->pcset_preload);
+		list_add_tail(&cset->threaded_csets_node,
+			      &cset->pcset_preload->threaded_csets);
+
+		cset->pcset_preload = NULL;
+		list_del(&cset->pcset_preload_node);
+		put_css_set_locked(cset);
+	}
+	spin_unlock_irq(&css_set_lock);
+
+	/* mark it threaded */
+	cgrp->proc_cgrp = cgrp;
+
+	return 0;
+
+err_put_csets:
+	spin_lock_irq(&css_set_lock);
+	list_for_each_entry_safe(cset, cset_next, &csets, pcset_preload_node) {
+		if (cset->pcset_preload) {
+			put_css_set_locked(cset->pcset_preload);
+			cset->pcset_preload = NULL;
+		}
+		list_del(&cset->pcset_preload_node);
+		put_css_set_locked(cset);
+	}
+	spin_unlock_irq(&css_set_lock);
+	return ret;
+}
+
+static int cgroup_disable_threaded(struct cgroup *cgrp)
+{
+	struct cgrp_cset_link *link;
+
+	lockdep_assert_held(&cgroup_mutex);
+
+	/* noop if already !threaded */
+	if (!cgrp->proc_cgrp)
+		return 0;
+
+	/* partial disable isn't supported */
+	if (cgrp->proc_cgrp != cgrp)
+		return -EBUSY;
+
+	/* allow only if there are neither children or enabled controllers */
+	if (css_has_online_children(&cgrp->self) || cgrp->subtree_control)
+		return -EBUSY;
+
+	/* walk all csets and reset ->proc_cset */
+	spin_lock_irq(&css_set_lock);
+	list_for_each_entry(link, &cgrp->cset_links, cset_link) {
+		struct css_set *cset = link->cset;
+
+		if (css_set_threaded(cset)) {
+			struct css_set *pcset = proc_css_set(cset);
+
+			WARN_ON_ONCE(pcset->dfl_cgrp != cgrp);
+			rcu_assign_pointer(cset->proc_cset, cset);
+			list_del(&cset->threaded_csets_node);
+
+			/*
+			 * @pcset is never @cset and safe to put during
+			 * iteration.
+			 */
+			put_css_set_locked(pcset);
+		}
+	}
+	cgrp->proc_cgrp = NULL;
+	spin_unlock_irq(&css_set_lock);
+
+	return 0;
+}
+
 static int cgroup_events_show(struct seq_file *seq, void *v)
 {
 	seq_printf(seq, "populated %d\n",
@@ -3838,12 +3966,12 @@ static void *cgroup_procs_next(struct seq_file *s, void *v, loff_t *pos)
 	return css_task_iter_next(it);
 }
 
-static void *cgroup_procs_start(struct seq_file *s, loff_t *pos)
+static void *__cgroup_procs_start(struct seq_file *s, loff_t *pos,
+				  unsigned int iter_flags)
 {
 	struct kernfs_open_file *of = s->private;
 	struct cgroup *cgrp = seq_css(s)->cgroup;
 	struct css_task_iter *it = of->priv;
-	unsigned iter_flags = CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED;
 
 	/*
 	 * When a seq_file is seeked, it's always traversed sequentially
@@ -3866,6 +3994,23 @@ static void *cgroup_procs_start(struct seq_file *s, loff_t *pos)
 	return cgroup_procs_next(s, NULL, NULL);
 }
 
+static void *cgroup_procs_start(struct seq_file *s, loff_t *pos)
+{
+	struct cgroup *cgrp = seq_css(s)->cgroup;
+
+	/*
+	 * All processes of a threaded subtree are in the top threaded
+	 * cgroup.  Only threads can be distributed across the subtree.
+	 * Reject reads on cgroup.procs in the subtree proper.  They're
+	 * always empty anyway.
+	 */
+	if (cgrp->proc_cgrp && cgrp->proc_cgrp != cgrp)
+		return ERR_PTR(-EINVAL);
+
+	return __cgroup_procs_start(s, pos, CSS_TASK_ITER_PROCS |
+					    CSS_TASK_ITER_THREADED);
+}
+
 static int cgroup_procs_show(struct seq_file *s, void *v)
 {
 	seq_printf(s, "%d\n", task_pid_vnr(v));
@@ -3920,6 +4065,76 @@ static ssize_t cgroup_procs_write(struct kernfs_open_file *of,
 	return ret ?: nbytes;
 }
 
+static void *cgroup_threads_start(struct seq_file *s, loff_t *pos)
+{
+	struct cgroup *cgrp = seq_css(s)->cgroup;
+
+	if (!cgrp->proc_cgrp)
+		return ERR_PTR(-EINVAL);
+
+	return __cgroup_procs_start(s, pos, 0);
+}
+
+static ssize_t cgroup_threads_write(struct kernfs_open_file *of,
+				    char *buf, size_t nbytes, loff_t off)
+{
+	struct super_block *sb = of->file->f_path.dentry->d_sb;
+	struct cgroup *cgrp, *common_ancestor;
+	struct task_struct *task;
+	ssize_t ret;
+
+	buf = strstrip(buf);
+
+	cgrp = cgroup_kn_lock_live(of->kn, false);
+	if (!cgrp)
+		return -ENODEV;
+
+	/* cgroup.procs determines delegation, require permission on it too */
+	ret = cgroup_procs_write_permission(cgrp, sb);
+	if (ret)
+		goto out_unlock;
+
+	/* enable or disable? */
+	if (!strcmp(buf, "enable")) {
+		ret = cgroup_enable_threaded(cgrp);
+		goto out_unlock;
+	} else if (!strcmp(buf, "disable")) {
+		ret = cgroup_disable_threaded(cgrp);
+		goto out_unlock;
+	}
+
+	/* thread migration */
+	ret = -EINVAL;
+	if (!cgrp->proc_cgrp)
+		goto out_unlock;
+
+	task = cgroup_procs_write_start(buf, false);
+	ret = PTR_ERR_OR_ZERO(task);
+	if (ret)
+		goto out_unlock;
+
+	common_ancestor = cgroup_migrate_common_ancestor(task, cgrp);
+
+	/* can't migrate across disjoint threaded subtrees */
+	ret = -EACCES;
+	if (common_ancestor->proc_cgrp != cgrp->proc_cgrp)
+		goto out_finish;
+
+	/* and follow the cgroup.procs delegation rule */
+	ret = cgroup_procs_write_permission(common_ancestor, sb);
+	if (ret)
+		goto out_finish;
+
+	ret = cgroup_attach_task(cgrp, task, false);
+
+out_finish:
+	cgroup_procs_write_finish();
+out_unlock:
+	cgroup_kn_unlock(of->kn);
+
+	return ret ?: nbytes;
+}
+
 /* cgroup core interface files for the default hierarchy */
 static struct cftype cgroup_base_files[] = {
 	{
@@ -3932,6 +4147,14 @@ static struct cftype cgroup_base_files[] = {
 		.write = cgroup_procs_write,
 	},
 	{
+		.name = "cgroup.threads",
+		.release = cgroup_procs_release,
+		.seq_start = cgroup_threads_start,
+		.seq_next = cgroup_procs_next,
+		.seq_show = cgroup_procs_show,
+		.write = cgroup_threads_write,
+	},
+	{
 		.name = "cgroup.controllers",
 		.seq_show = cgroup_controllers_show,
 	},
@@ -4245,6 +4468,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent)
 	cgrp->self.parent = &parent->self;
 	cgrp->root = root;
 	cgrp->level = level;
+	cgrp->proc_cgrp = parent->proc_cgrp;
 
 	for (tcgrp = cgrp; tcgrp; tcgrp = cgroup_parent(tcgrp))
 		cgrp->ancestor_ids[tcgrp->level] = tcgrp->id;
@@ -4687,11 +4911,17 @@ int __init cgroup_init(void)
 
 		cgrp_dfl_root.subsys_mask |= 1 << ss->id;
 
+		/* implicit controllers must be threaded too */
+		WARN_ON(ss->implicit_on_dfl && !ss->threaded);
+
 		if (ss->implicit_on_dfl)
 			cgrp_dfl_implicit_ss_mask |= 1 << ss->id;
 		else if (!ss->dfl_cftypes)
 			cgrp_dfl_inhibit_ss_mask |= 1 << ss->id;
 
+		if (ss->threaded)
+			cgrp_dfl_threaded_ss_mask |= 1 << ss->id;
+
 		if (ss->dfl_cftypes == ss->legacy_cftypes) {
 			WARN_ON(cgroup_add_cftypes(ss, ss->dfl_cftypes));
 		} else {
diff --git a/kernel/cgroup/pids.c b/kernel/cgroup/pids.c
index 2bd6737..49b5424 100644
--- a/kernel/cgroup/pids.c
+++ b/kernel/cgroup/pids.c
@@ -345,4 +345,5 @@ struct cgroup_subsys pids_cgrp_subsys = {
 	.free		= pids_free,
 	.legacy_cftypes	= pids_files,
 	.dfl_cftypes	= pids_files,
+	.threaded	= true,
 };
diff --git a/kernel/events/core.c b/kernel/events/core.c
index d72128d..508c0d9 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -10798,5 +10798,6 @@ struct cgroup_subsys perf_event_cgrp_subsys = {
 	 * controller is not mounted on a legacy hierarchy.
 	 */
 	.implicit_on_dfl = true,
+	.threaded	= true,
 };
 #endif /* CONFIG_CGROUP_PERF */
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
  2017-02-02 20:06 [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode Tejun Heo
                   ` (4 preceding siblings ...)
  2017-02-02 20:06 ` [PATCH 5/5] cgroup: implement cgroup v2 thread support Tejun Heo
@ 2017-02-02 21:32 ` Andy Lutomirski
  2017-02-02 21:52   ` Tejun Heo
  2017-02-03 20:20 ` Peter Zijlstra
  2017-02-09 13:07 ` Paul Turner
  7 siblings, 1 reply; 32+ messages in thread
From: Andy Lutomirski @ 2017-02-02 21:32 UTC (permalink / raw)
  To: Tejun Heo, Linux API
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar,
	Paul Turner, Mike Galbraith, open list:CONTROL GROUP (CGROUP),
	linux-kernel, kernel-team, lvenanci

On Thu, Feb 2, 2017 at 12:06 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> This patchset implements cgroup v2 thread mode.  It is largely based
> on the discussions that we had at the plumbers last year.  Here's the
> rough outline.

I like this, but I have some design questions:

>
> * Thread mode is explicitly enabled on a cgroup by writing "enable"
>   into "cgroup.threads" file.  The cgroup shouldn't have any child
>   cgroups or enabled controllers.

Why do you need to manually turn it on?  That is, couldn't it be
automatic based on what controllers are enabled?

>
> * Once enabled, arbitrary sub-hierarchy can be created and threads can
>   be put anywhere in the subtree by writing TIDs into "cgroup.threads"
>   file.  Process granularity and no-internal-process constraint don't
>   apply in a threaded subtree.

I'm a bit worried that this conflates two different things.  There's
thread support, i.e. allowing individual threads to be placed into
cgroups.  There's also more flexible sub-hierarchy support, i.e.
relaxing no-internal-process constraints.  For the "cpuacct"
controller, for example, both of these make sense.  But what if
someone writes a controller (directio, for example, just to make
something up) for which thread granularity makes sense but relaxing
no-internal-process constraints does not?

--Andy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
  2017-02-02 21:32 ` [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode Andy Lutomirski
@ 2017-02-02 21:52   ` Tejun Heo
  2017-02-03 21:10     ` Andy Lutomirski
  2017-02-06  9:50     ` Peter Zijlstra
  0 siblings, 2 replies; 32+ messages in thread
From: Tejun Heo @ 2017-02-02 21:52 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linux API, Li Zefan, Johannes Weiner, Peter Zijlstra,
	Ingo Molnar, Paul Turner, Mike Galbraith,
	open list:CONTROL GROUP (CGROUP),
	linux-kernel, kernel-team, lvenanci

Hello,

On Thu, Feb 02, 2017 at 01:32:19PM -0800, Andy Lutomirski wrote:
> > * Thread mode is explicitly enabled on a cgroup by writing "enable"
> >   into "cgroup.threads" file.  The cgroup shouldn't have any child
> >   cgroups or enabled controllers.
> 
> Why do you need to manually turn it on?  That is, couldn't it be
> automatic based on what controllers are enabled?

This came up already but it's not like some controllers are inherently
thread-only.  Consider CPU, all in-context CPU cycle consumptions are
tied to the thread; however, we also want to be able to account for
CPU cycles consumed for, for example, memory reclaim or encryption
during writeback.

I played with an interface where thread mode is enabled automatically
upto the common ancestor of the threads but not only was it
complicated to implement but also the eventual behavior was very
confusing as the resource domain can change without any active actions
from the user.  I think keeping things simple is the right choice
here.

> > * Once enabled, arbitrary sub-hierarchy can be created and threads can
> >   be put anywhere in the subtree by writing TIDs into "cgroup.threads"
> >   file.  Process granularity and no-internal-process constraint don't
> >   apply in a threaded subtree.
> 
> I'm a bit worried that this conflates two different things.  There's
> thread support, i.e. allowing individual threads to be placed into
> cgroups.  There's also more flexible sub-hierarchy support, i.e.
> relaxing no-internal-process constraints.  For the "cpuacct"
> controller, for example, both of these make sense.  But what if
> someone writes a controller (directio, for example, just to make
> something up) for which thread granularity makes sense but relaxing
> no-internal-process constraints does not?

If a controller can't possibly define how internal competition should
be handled, which is unlikely - the problem is being consistent and
sensible, defining something isn't difficult - the controller can
simply error out those cases either on configuration or migration.
Again, I'm very doubtful we'll need that but if we ever need that
denying specific configurations is the best we can do anyway.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
  2017-02-02 20:06 [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode Tejun Heo
                   ` (5 preceding siblings ...)
  2017-02-02 21:32 ` [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode Andy Lutomirski
@ 2017-02-03 20:20 ` Peter Zijlstra
  2017-02-03 20:59   ` Tejun Heo
  2017-02-09 13:07 ` Paul Turner
  7 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2017-02-03 20:20 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, hannes, mingo, pjt, luto, efault, cgroups, linux-kernel,
	kernel-team, lvenanci

On Thu, Feb 02, 2017 at 03:06:27PM -0500, Tejun Heo wrote:
> Hello,
> 
> This patchset implements cgroup v2 thread mode.  It is largely based
> on the discussions that we had at the plumbers last year.  Here's the
> rough outline.
> 
> * Thread mode is explicitly enabled on a cgroup by writing "enable"
>   into "cgroup.threads" file.  The cgroup shouldn't have any child
>   cgroups or enabled controllers.
> 
> * Once enabled, arbitrary sub-hierarchy can be created and threads can
>   be put anywhere in the subtree by writing TIDs into "cgroup.threads"
>   file.  Process granularity and no-internal-process constraint don't
>   apply in a threaded subtree.
> 
> * To be used in a threaded subtree, controllers should explicitly
>   declare thread mode support and should be able to handle internal
>   competition in some way.
> 
> * The root of a threaded subtree serves as the resource domain for the
>   whole subtree.  This is where all the controllers are guaranteed to
>   have a common ground and resource consumptions in the threaded
>   subtree which aren't tied to a specific thread are charged.
>   Non-threaded controllers never see beyond thread root and can assume
>   that all controllers will follow the same rules upto that point.
> 
> This allows threaded controllers to implement thread granular resource
> control without getting in the way of system level resource
> partitioning.


I'm still confused. So suppose I mark my root cgroup as such, because I
run RT tasks there. Does this then mean I cannot later start a VM and
have that containered properly? That is, I think threaded controllers
very much get in the way of system level source partitioning this way.


So my proposal was to do the inverse of what you propose here. Instead
of marking special 'thread' subtrees, explicitly mark resource domains
in the tree.

So always allow arbitrary hierarchies and allow threads to be assigned
to cgroups, as long as they're all in the same resource domain.

Controllers that do not support things, fallback to mapping everything
to the nearest resource domain.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
  2017-02-03 20:20 ` Peter Zijlstra
@ 2017-02-03 20:59   ` Tejun Heo
  2017-02-06 12:49     ` Peter Zijlstra
  0 siblings, 1 reply; 32+ messages in thread
From: Tejun Heo @ 2017-02-03 20:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: lizefan, hannes, mingo, pjt, luto, efault, cgroups, linux-kernel,
	kernel-team, lvenanci

Hello, Peter.

On Fri, Feb 03, 2017 at 09:20:48PM +0100, Peter Zijlstra wrote:
> So my proposal was to do the inverse of what you propose here. Instead
> of marking special 'thread' subtrees, explicitly mark resource domains
> in the tree.
> 
> So always allow arbitrary hierarchies and allow threads to be assigned
> to cgroups, as long as they're all in the same resource domain.
> 
> Controllers that do not support things, fallback to mapping everything
> to the nearest resource domain.

That sounds counter-intuitive as all controllers can do resource
domains and only a subset of them can do thread mode.  Also, thread
subtrees are necessarily a sub-hierarchy of a resource domain.  Also,
expanding resource domains from the root after the trees are populated
would make the behaviors surprising as the resource domains that these
subtrees belong to would change dynamically.

In practice, how would this work?  To enable memcg, the user has to
first create the subtree and then explicitly have to make that a
domain and then enable memcg?  If so, that would be a completely
unnecessary deviation from the current behavior while not achieving
any more functionalities, right?  It's just flipping the default value
the other way around and the default wouldn't be supported by many of
the controllers.  I can't see how that is a better option.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
  2017-02-02 21:52   ` Tejun Heo
@ 2017-02-03 21:10     ` Andy Lutomirski
  2017-02-03 21:56       ` Tejun Heo
  2017-02-06  9:50     ` Peter Zijlstra
  1 sibling, 1 reply; 32+ messages in thread
From: Andy Lutomirski @ 2017-02-03 21:10 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linux API, Li Zefan, Johannes Weiner, Peter Zijlstra,
	Ingo Molnar, Paul Turner, Mike Galbraith,
	open list:CONTROL GROUP (CGROUP),
	linux-kernel, kernel-team, lvenanci

On Thu, Feb 2, 2017 at 1:52 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> On Thu, Feb 02, 2017 at 01:32:19PM -0800, Andy Lutomirski wrote:
>> > * Thread mode is explicitly enabled on a cgroup by writing "enable"
>> >   into "cgroup.threads" file.  The cgroup shouldn't have any child
>> >   cgroups or enabled controllers.
>>
>> Why do you need to manually turn it on?  That is, couldn't it be
>> automatic based on what controllers are enabled?
>
> This came up already but it's not like some controllers are inherently
> thread-only.  Consider CPU, all in-context CPU cycle consumptions are
> tied to the thread; however, we also want to be able to account for
> CPU cycles consumed for, for example, memory reclaim or encryption
> during writeback.
>

Is this flexible enough for the real-world usecases?  For my use case
(if I actually ported over to this), it would mean that I'd have to
enable thread mode on the root.  What about letting a given process
(actually mm, perhaps) live in a cgroup but let the threads be in
different cgroups without any particular constraints.  Then
process-wide stuff would be accounted to the cgroup that owns the
process.

>
>> > * Once enabled, arbitrary sub-hierarchy can be created and threads can
>> >   be put anywhere in the subtree by writing TIDs into "cgroup.threads"
>> >   file.  Process granularity and no-internal-process constraint don't
>> >   apply in a threaded subtree.
>>
>> I'm a bit worried that this conflates two different things.  There's
>> thread support, i.e. allowing individual threads to be placed into
>> cgroups.  There's also more flexible sub-hierarchy support, i.e.
>> relaxing no-internal-process constraints.  For the "cpuacct"
>> controller, for example, both of these make sense.  But what if
>> someone writes a controller (directio, for example, just to make
>> something up) for which thread granularity makes sense but relaxing
>> no-internal-process constraints does not?
>
> If a controller can't possibly define how internal competition should
> be handled, which is unlikely - the problem is being consistent and
> sensible, defining something isn't difficult - the controller can
> simply error out those cases either on configuration or migration.
> Again, I'm very doubtful we'll need that but if we ever need that
> denying specific configurations is the best we can do anyway.
>

I'm not sure I follow.

I'm suggesting something quite simple: let controllers that don't need
the no-internal-process constraints set a flag so that the constraints
don't apply if all enabled controllers have the flag set.

--Andy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
  2017-02-03 21:10     ` Andy Lutomirski
@ 2017-02-03 21:56       ` Tejun Heo
  0 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2017-02-03 21:56 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linux API, Li Zefan, Johannes Weiner, Peter Zijlstra,
	Ingo Molnar, Paul Turner, Mike Galbraith,
	open list:CONTROL GROUP (CGROUP),
	linux-kernel, kernel-team, lvenanci

Hello,

On Fri, Feb 03, 2017 at 01:10:21PM -0800, Andy Lutomirski wrote:
> Is this flexible enough for the real-world usecases?  For my use case

I can't think of a reason why it won't be.  Capability-wise, nothing
is being lost by the interface.

> (if I actually ported over to this), it would mean that I'd have to
> enable thread mode on the root.  What about letting a given process
> (actually mm, perhaps) live in a cgroup but let the threads be in
> different cgroups without any particular constraints.  Then
> process-wide stuff would be accounted to the cgroup that owns the
> process.

I don't know.  So, then, we basiclly have completely separate trees
for resource domains and threads.  That exactly is what mounting cpu
controller separately does.  It doesn't make sense to put them on the
same hierarchy.  Why?

> > If a controller can't possibly define how internal competition should
> > be handled, which is unlikely - the problem is being consistent and
> > sensible, defining something isn't difficult - the controller can
> > simply error out those cases either on configuration or migration.
> > Again, I'm very doubtful we'll need that but if we ever need that
> > denying specific configurations is the best we can do anyway.
> 
> I'm not sure I follow.
> 
> I'm suggesting something quite simple: let controllers that don't need
> the no-internal-process constraints set a flag so that the constraints
> don't apply if all enabled controllers have the flag set.

Firstly, I think it's better to have the rules as simple and
consistent as possible as long as we don't sacrifice critical
capabilities.

Secondly, all the major resource controllers including cpu would
eventually need resource domain, so there is no real practical upside
to doing so.

Thirdly, if we commit to something like "controller X is not subject
to no-internal-process constraint", that commitment would prevent from
ever adding domain level operations to that controller without
breaking userland visible interface.  All controllers do and have to
support process level operations.  Some controllers can do thread
level operations.  Keeping the latter opt-in doesn't block us from
adding thread mode later on.  Doing it the other way around blocks us
from adding domain level operations later on.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
  2017-02-02 21:52   ` Tejun Heo
  2017-02-03 21:10     ` Andy Lutomirski
@ 2017-02-06  9:50     ` Peter Zijlstra
  1 sibling, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2017-02-06  9:50 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andy Lutomirski, Linux API, Li Zefan, Johannes Weiner,
	Ingo Molnar, Paul Turner, Mike Galbraith,
	open list:CONTROL GROUP (CGROUP),
	linux-kernel, kernel-team, lvenanci

On Thu, Feb 02, 2017 at 04:52:29PM -0500, Tejun Heo wrote:

> > Why do you need to manually turn it on?  That is, couldn't it be
> > automatic based on what controllers are enabled?
> 
> This came up already but it's not like some controllers are inherently
> thread-only.  Consider CPU, all in-context CPU cycle consumptions are
> tied to the thread; however, we also want to be able to account for
> CPU cycles consumed for, for example, memory reclaim or encryption
> during writeback.
> 
> I played with an interface where thread mode is enabled automatically
> upto the common ancestor of the threads but not only was it
> complicated to implement but also the eventual behavior was very
> confusing as the resource domain can change without any active actions
> from the user.  I think keeping things simple is the right choice
> here.

Note that the explicit marking of the resource domains gets you exactly
that. But let me reply in the other subthread.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
  2017-02-03 20:59   ` Tejun Heo
@ 2017-02-06 12:49     ` Peter Zijlstra
  2017-02-08 23:08       ` Tejun Heo
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2017-02-06 12:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, hannes, mingo, pjt, luto, efault, cgroups, linux-kernel,
	kernel-team, lvenanci

On Fri, Feb 03, 2017 at 03:59:55PM -0500, Tejun Heo wrote:
> Hello, Peter.
> 
> On Fri, Feb 03, 2017 at 09:20:48PM +0100, Peter Zijlstra wrote:
> > So my proposal was to do the inverse of what you propose here. Instead
> > of marking special 'thread' subtrees, explicitly mark resource domains
> > in the tree.
> > 
> > So always allow arbitrary hierarchies and allow threads to be assigned
> > to cgroups, as long as they're all in the same resource domain.
> > 
> > Controllers that do not support things, fallback to mapping everything
> > to the nearest resource domain.
> 
> That sounds counter-intuitive as all controllers can do resource
> domains and only a subset of them can do thread mode.

But to me the resource domain is your primary new construct; so it makes
more sense to explicitly mark that.

(FWIW note that your whole initial cgroup-v2 thing is counter intuitive
to me, as someone who has only ever dealt with thread capable
controllers.)

> Also, thread subtrees are necessarily a sub-hierarchy of a resource domain.

Sure, don't see how that is relevant though. Or rather, I don't see it
being an argument one way or the other.

> Also,
> expanding resource domains from the root after the trees are populated
> would make the behaviors surprising as the resource domains that these
> subtrees belong to would change dynamically.

Uh what? I cannot parse that.

My question was, if you have root.threads=1, can you then still create
(sub) resource domains? Can I create a child cgroup and clear "threads"
again?

(I'm assuming "threads" is inherited when creating new groups).

Now, _if_ we can do the above, then "threads" is not sufficient to
uniquely identify resource domains, which I think was your point in the
other email. Which argues against the interface. Because a group can be
a resource domain _and_ have threads sub-trees.

OTOH, if you can _not_ do this, then this proposal is
insufficient/inadequate.

> In practice, how would this work?  To enable memcg, the user has to
> first create the subtree and then explicitly have to make that a
> domain and then enable memcg?  If so, that would be a completely
> unnecessary deviation from the current behavior while not achieving
> any more functionalities, right?  It's just flipping the default value
> the other way around and the default wouldn't be supported by many of
> the controllers.  I can't see how that is a better option.

OK, so I'm entirely unaware of this enabling of controllers. What's that
about? I thought the whole point of cgroup-v2 was to have all
controllers enabled over the entire tree, this is not so?

In any case, yes, more or less like that, except of course, not at all
:-) If we make this flag inherited (which I think it should be), you
don't need to do anything different from today, because the root group
must be a resource domain, any new sub-group will automagically also be.

Only once you clear the flag somewhere do you get 'new' behaviour. Note
that the only extra constraint is that all threads of a process must
stay within the same resource domain, anything else goes.

Task based controllers operate on the actual cgroup, resource domain
controllers always map it back to the resource group. Finding a random
task's resource domain is trivial; simply walk up the hierarchy until
you find a group with the flag set. 



So, just to recap, my proposal is as follows:

 1) each cgroup will have a new flag, indicating if this is a resource
    domain.

    a) this flag will be inherited; when creating a new cgroup, the
       state of the flag will be set to the value of the parent cgroup.

    b) the root cgroup is a resource domain per definition, will
       have it set (cannot be cleared).

 2) all tasks of a process shall be part of the same resource domain

 3) controllers come in two types:

    a) task based controllers; these use the direct cgroup the task
       is assigned to.

    b) resource controllers; these use the effective resource group
       of the task, which is the first parent group with our new
       flag set.


With an optional:

 1c) this flag can only be changed on empty groups

to ease implementation.



>From these rules it follows that:

- 1a and 1b together ensure no change in behaviour per default
  for cgroup-v2.

- 2 and 3a together ensure resource groups automagically work for task
  based controllers (under the assumption that the controller is
  strictly hierarchical).


For example, cpuacct does the accounting strictly hierarchical, it adds
the cpu usage to each parent group. Therefore the total cpu usage
accounted to the resource group is the same as if all tasks were part of
that group.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
  2017-02-06 12:49     ` Peter Zijlstra
@ 2017-02-08 23:08       ` Tejun Heo
  2017-02-09 10:29         ` Peter Zijlstra
  0 siblings, 1 reply; 32+ messages in thread
From: Tejun Heo @ 2017-02-08 23:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: lizefan, hannes, mingo, pjt, luto, efault, cgroups, linux-kernel,
	kernel-team, lvenanci, Linus Torvalds, Andrew Morton

(cc'ing Linus and Andrew for visibility)

Hello, Peter.

On Mon, Feb 06, 2017 at 01:49:43PM +0100, Peter Zijlstra wrote:
> But to me the resource domain is your primary new construct; so it makes
> more sense to explicitly mark that.

Whether it's new or not isn't the point.  Resource domains weren't
added arbitrarily.  We were missing critical resource accounting and
control capabilities because cgroup v1's abstraction wasn't strong
enough to cover some of the major resource consumers and how different
resources can interact with each other.

Resource domains were added to address that.  Given that cgroup's
primary goal is providing resource accounting and control, it doesn't
make sense to make this optional.

> (FWIW note that your whole initial cgroup-v2 thing is counter intuitive
> to me, as someone who has only ever dealt with thread capable
> controllers.)

That is completely fine but you can't direct the overall design while
claiming to be not using, disinterested and unfamiliar with the
subject at hand.  I do understand that you have certain use cases that
you think should be covered.  Let's focus on that.

> My question was, if you have root.threads=1, can you then still create
> (sub) resource domains? Can I create a child cgroup and clear "threads"
> again?
>
> (I'm assuming "threads" is inherited when creating new groups).
> 
> Now, _if_ we can do the above, then "threads" is not sufficient to
> uniquely identify resource domains, which I think was your point in the
> other email. Which argues against the interface. Because a group can be
> a resource domain _and_ have threads sub-trees.
>
> OTOH, if you can _not_ do this, then this proposal is
> insufficient/inadequate.

No, you can't flip it back and I'm not convinced this matters.  More
on this below.

> > In practice, how would this work?  To enable memcg, the user has to
> > first create the subtree and then explicitly have to make that a
> > domain and then enable memcg?  If so, that would be a completely
> > unnecessary deviation from the current behavior while not achieving
> > any more functionalities, right?  It's just flipping the default value
> > the other way around and the default wouldn't be supported by many of
> > the controllers.  I can't see how that is a better option.
> 
> OK, so I'm entirely unaware of this enabling of controllers. What's that
> about? I thought the whole point of cgroup-v2 was to have all
> controllers enabled over the entire tree, this is not so?

This is one of the most basic aspects of cgroup v2.  In short, while
the controllers share the hierarchy, each doesn't have to be enabled
all the way down to the leaf.  Different controllers can see upto
different subsets of the hierarchy spreading out from the root.

> In any case, yes, more or less like that, except of course, not at all
> :-) If we make this flag inherited (which I think it should be), you
> don't need to do anything different from today, because the root group
> must be a resource domain, any new sub-group will automagically also be.
> 
> Only once you clear the flag somewhere do you get 'new' behaviour. Note
> that the only extra constraint is that all threads of a process must
> stay within the same resource domain, anything else goes.
> 
> Task based controllers operate on the actual cgroup, resource domain
> controllers always map it back to the resource group. Finding a random
> task's resource domain is trivial; simply walk up the hierarchy until
> you find a group with the flag set. 
> 
> So, just to recap, my proposal is as follows:
> 
>  1) each cgroup will have a new flag, indicating if this is a resource
>     domain.
> 
>     a) this flag will be inherited; when creating a new cgroup, the
>        state of the flag will be set to the value of the parent cgroup.
> 
>     b) the root cgroup is a resource domain per definition, will
>        have it set (cannot be cleared).
> 
>  2) all tasks of a process shall be part of the same resource domain
> 
>  3) controllers come in two types:
> 
>     a) task based controllers; these use the direct cgroup the task
>        is assigned to.
> 
>     b) resource controllers; these use the effective resource group
>        of the task, which is the first parent group with our new
>        flag set.
> 
> 
> With an optional:
> 
>  1c) this flag can only be changed on empty groups
> 
> to ease implementation.
> 
> From these rules it follows that:
> 
> - 1a and 1b together ensure no change in behaviour per default
>   for cgroup-v2.
> 
> - 2 and 3a together ensure resource groups automagically work for task
>   based controllers (under the assumption that the controller is
>   strictly hierarchical).
> 
> For example, cpuacct does the accounting strictly hierarchical, it adds
> the cpu usage to each parent group. Therefore the total cpu usage
> accounted to the resource group is the same as if all tasks were part of
> that group.

So, what you're proposing isn't that different from what the posted
patches implement except that what's implemented doesn't allow
flipping a part of a threaded subtree back to domain mode.

Your proposal is more complicated while seemingly not adding much to
what can be achieved.  The orignal proposal is very simple.  It allows
declaring a subtree to be threaded (or task based) and that's it.  A
threaded subtree can't have resource domains under it.

The only addition that your proposal has is the ability to mark
portions of such subtree to be domains again.  This would require
making domain controllers and thread controllers to see different
hierarchies, which brings in something completely new to the basic
hierarchy.

Different controllers seeing differing levels of the same hierarchy is
part of the basic behaviors and making subtrees threaded is a
straight-forward extension of that - threaded controllers just see
further into the hierarchy.  Adding threaded sub-sections in the
middle is more complex and frankly confusing.

Let's say we can make that work but what are the use cases which would
require such setup where we have to alternate between thread and
domain modes through out the resource hierarchy?  This will be a
considerable departure and added complexity from the existing
behaviors and code.  We gotta be achieving something significant if
we're doing that.  Why would we want this?

And here's another aspect.  The currently proposed interface doesn't
preclude adding the behavior you're describing in the future.  Once
thread mode is enabled on a subtree, it isn't allowed to be disabled
in its proper subtree; however, if there actually are use cases which
require flipping it back, we can later implemnt the behavior and lift
that restriction.  I think it makes sense to start with a simple
model.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
  2017-02-08 23:08       ` Tejun Heo
@ 2017-02-09 10:29         ` Peter Zijlstra
  2017-02-10 15:45           ` Tejun Heo
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2017-02-09 10:29 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, hannes, mingo, pjt, luto, efault, cgroups, linux-kernel,
	kernel-team, lvenanci, Linus Torvalds, Andrew Morton

On Wed, Feb 08, 2017 at 06:08:19PM -0500, Tejun Heo wrote:
> (cc'ing Linus and Andrew for visibility)
> 
> Hello, Peter.
> 
> On Mon, Feb 06, 2017 at 01:49:43PM +0100, Peter Zijlstra wrote:
> > But to me the resource domain is your primary new construct; so it makes
> > more sense to explicitly mark that.
> 
> Whether it's new or not isn't the point.  Resource domains weren't
> added arbitrarily.  We were missing critical resource accounting and
> control capabilities because cgroup v1's abstraction wasn't strong
> enough to cover some of the major resource consumers and how different
> resources can interact with each other.
> 
> Resource domains were added to address that.  Given that cgroup's
> primary goal is providing resource accounting and control, it doesn't
> make sense to make this optional.

I'm not sure what you're saying here. Are you agreeing that 'resource
domains' are the primary new construct or not?

> > My question was, if you have root.threads=1, can you then still create
> > (sub) resource domains? Can I create a child cgroup and clear "threads"
> > again?
> >
> > (I'm assuming "threads" is inherited when creating new groups).
> > 
> > Now, _if_ we can do the above, then "threads" is not sufficient to
> > uniquely identify resource domains, which I think was your point in the
> > other email. Which argues against the interface. Because a group can be
> > a resource domain _and_ have threads sub-trees.
> >
> > OTOH, if you can _not_ do this, then this proposal is
> > insufficient/inadequate.
> 
> No, you can't flip it back and I'm not convinced this matters.  More
> on this below.

Then I shall preliminary NAK your proposal right here, but I shall
continue to read on.

> > So, just to recap, my proposal is as follows:
> > 
> >  1) each cgroup will have a new flag, indicating if this is a resource
> >     domain.
> > 
> >     a) this flag will be inherited; when creating a new cgroup, the
> >        state of the flag will be set to the value of the parent cgroup.
> > 
> >     b) the root cgroup is a resource domain per definition, will
> >        have it set (cannot be cleared).
> > 
> >  2) all tasks of a process shall be part of the same resource domain
> > 
> >  3) controllers come in two types:
> > 
> >     a) task based controllers; these use the direct cgroup the task
> >        is assigned to.
> > 
> >     b) resource controllers; these use the effective resource group
> >        of the task, which is the first parent group with our new
> >        flag set.
> > 
> > 
> > With an optional:
> > 
> >  1c) this flag can only be changed on empty groups
> > 
> > to ease implementation.
> > 
> > From these rules it follows that:
> > 
> > - 1a and 1b together ensure no change in behaviour per default
> >   for cgroup-v2.
> > 
> > - 2 and 3a together ensure resource groups automagically work for task
> >   based controllers (under the assumption that the controller is
> >   strictly hierarchical).
> > 
> > For example, cpuacct does the accounting strictly hierarchical, it adds
> > the cpu usage to each parent group. Therefore the total cpu usage
> > accounted to the resource group is the same as if all tasks were part of
> > that group.
> 
> So, what you're proposing isn't that different from what the posted
> patches implement except that what's implemented doesn't allow
> flipping a part of a threaded subtree back to domain mode.
> 
> Your proposal is more complicated while seemingly not adding much to
> what can be achieved.  The orignal proposal is very simple.  It allows
> declaring a subtree to be threaded (or task based) and that's it.  A
> threaded subtree can't have resource domains under it.
> 
> The only addition that your proposal has is the ability to mark
> portions of such subtree to be domains again.  This would require
> making domain controllers and thread controllers to see different
> hierarchies,

Uhm, no. They would see the exact same hierarchy, seeing how there is
only one tree. They would have different view of it maybe, but I don't
see how that matters, nor do you explain.

> which brings in something completely new to the basic hierarchy.

I'm failing to see what.

> Different controllers seeing differing levels of the same hierarchy is
> part of the basic behaviors

I have no idea what you mean there.

> and making subtrees threaded is a
> straight-forward extension of that - threaded controllers just see
> further into the hierarchy.  Adding threaded sub-sections in the
> middle is more complex and frankly confusing.

I disagree, as I completely fail to see any confusion. The rules are
simple and straight forward.

I also don't see why you would want to impose this artificial
restriction. It doesn't get you anything. Why are you so keen on designs
with these artificial limits on?

> Let's say we can make that work but what are the use cases which would
> require such setup where we have to alternate between thread and
> domain modes through out the resource hierarchy?

I would very much like to run my main workload in the root resource
group. This means I need to have threaded subtrees at the root level.

Your design would then mean I then cannot run a VM (which uses all these
cgroups muck and needs its own resource domain) for some less
critical/isolated workload.

Now, you'll argue I should set up a subtree for the main workload; but
why would I do that? Why would you force me into making this choice;
which has performance penalties associated (because the root resource
domain is special cased in a bunch of places; and because the shallower
the cgroup tree the less overhead etc.).

> This will be a
> considerable departure and added complexity from the existing
> behaviors and code.  We gotta be achieving something significant if
> we're doing that.  Why would we want this?

How is this a departure? I do not understand.

Why would we not want to do this? Why would we want to impose artificial
limitations. What specifically is hard about what I propose?

You have no actual arguments on why what I propose would be hard to
implement. As far as I can tell it should be fairly similar in
complexity to what you already proposed.

> And here's another aspect.  The currently proposed interface doesn't
> preclude adding the behavior you're describing in the future.  Once
> thread mode is enabled on a subtree, it isn't allowed to be disabled
> in its proper subtree; however, if there actually are use cases which
> require flipping it back, we can later implemnt the behavior and lift
> that restriction.  I think it makes sense to start with a simple
> model.

Your choice of flag makes it impossible to tell what is a resource
domain and what is not in that situation.

Suppose I set the root group threaded and I create subgroups (which will
also all have threaded set). Suppose I clear the threaded bit somewhere
in the subtree to create a new resource group, but then immediately set
the threaded bit again to allow that resource group to have thread
subgroups as well. Now the entire hierarchy will have the threaded flag
set and it becomes impossible to find the resource domains.

This is all a direct consequence of your flag not denoting the primary
construct; eg. resource domains.

IOW; you've completely failed to convince me and my NAK stands.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
  2017-02-02 20:06 [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode Tejun Heo
                   ` (6 preceding siblings ...)
  2017-02-03 20:20 ` Peter Zijlstra
@ 2017-02-09 13:07 ` Paul Turner
  2017-02-09 14:47   ` Peter Zijlstra
  2017-02-10 15:46   ` Tejun Heo
  7 siblings, 2 replies; 32+ messages in thread
From: Paul Turner @ 2017-02-09 13:07 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar,
	Andy Lutomirski, Mike Galbraith, cgroups, LKML, kernel-team,
	lvenanci, Linus Torvalds, Andrew Morton

On Thu, Feb 2, 2017 at 12:06 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> This patchset implements cgroup v2 thread mode.  It is largely based
> on the discussions that we had at the plumbers last year.  Here's the
> rough outline.
>
> * Thread mode is explicitly enabled on a cgroup by writing "enable"
>   into "cgroup.threads" file.  The cgroup shouldn't have any child
>   cgroups or enabled controllers.
>
> * Once enabled, arbitrary sub-hierarchy can be created and threads can
>   be put anywhere in the subtree by writing TIDs into "cgroup.threads"
>   file.  Process granularity and no-internal-process constraint don't
>   apply in a threaded subtree.
>
> * To be used in a threaded subtree, controllers should explicitly
>   declare thread mode support and should be able to handle internal
>   competition in some way.
>
> * The root of a threaded subtree serves as the resource domain for the
>   whole subtree.  This is where all the controllers are guaranteed to
>   have a common ground and resource consumptions in the threaded
>   subtree which aren't tied to a specific thread are charged.
>   Non-threaded controllers never see beyond thread root and can assume
>   that all controllers will follow the same rules upto that point.
>
> This allows threaded controllers to implement thread granular resource
> control without getting in the way of system level resource
> partitioning.
>

I think that this is definitely a step in the right direction versus
previous proposals.
However, as proposed it feels like the API is conflating the
process/thread distinction with the core process hierarchy.  While
this does previous use-cases to be re-enabled, it seems to do so at an
unnecessary API cost.

As proposed, the cgroup.threads file means that threads are always
anchored in the tree by their process parent.  They may never move
past it.  I.e.
  If I have cgroups root/A/B
With B allowing sub-thread moves and the parent belonging to A, or B.
it is clear that the child cannot be moved beyond the parent.

Now this, in itself, is a natural restriction.  However, with this in
hand, it means that we are effectively co-mingling two hierarchies
onto the same tree: one that applies to processes, and per-process
sub-trees.

This introduces the following costs/restrictions:

1) We lose the ability to reasonably move a process.  This puts us
back to the existing challenge of the V1 API in which a thread is the
unit we can move atomically.  Hierarchies must be externally managed
and synchronized.

2) This retains all of the problems of the existing V1 API for a
process which wants to use these sub-groups to coordinate its threads.
It must coordinate its operations on these groups with the global
hierarchy (which is not consistently mounted) as well as potential
migration -- (1) above.

With the split as proposed, I fundamentally do not see the advantage
of exposing these as the same hierarchy.  By definition these .thread
files are essentially introducing independent, process level,
sub-hierarchies.

It seems greatly preferable to expose the sub-process level
hierarchies via separate path, e.g.:
  /proc/{pid, self}/thread_cgroups/

Any controllers enabled on the hierarchy that the process belonged to,
which support thread level operations would appear within.  This fully
addresses (1) and (2) while allowing us to keep the unified
process-granular v2-cgroup mounts as is.

The only case that this does not support vs ".threads" would be some
hybrid where we co-mingle threads from different processes (with the
processes belonging to the same node in the hierarchy).  I'm not aware
of any usage that looks like this.

What are the motivations that you see for forcing this all onto one
mount-point via .threads sub-tree tags?


> This patchset contains the following five patches.  For more details
> on the interface and behavior, please refer to the last patch.
>
>  0001-cgroup-reorganize-cgroup.procs-task-write-path.patch
>  0002-cgroup-add-flags-to-css_task_iter_start-and-implemen.patch
>  0003-cgroup-introduce-cgroup-proc_cgrp-and-threaded-css_s.patch
>  0004-cgroup-implement-CSS_TASK_ITER_THREADED.patch
>  0005-cgroup-implement-cgroup-v2-thread-support.patch
>
> This patchset is on top of cgroup/for-4.11 63f1ca59453a ("Merge branch
> 'cgroup/for-4.11-rdmacg' into cgroup/for-4.11") and available in the
> following git branch.
>
>  git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-cgroup2-threads
>
> diffstat follows.  Thanks.
>
>  Documentation/cgroup-v2.txt     |   75 ++++-
>  include/linux/cgroup-defs.h     |   38 ++
>  include/linux/cgroup.h          |   12
>  kernel/cgroup/cgroup-internal.h |    8
>  kernel/cgroup/cgroup-v1.c       |   64 +++-
>  kernel/cgroup/cgroup.c          |  589 ++++++++++++++++++++++++++++++++--------
>  kernel/cgroup/cpuset.c          |    6
>  kernel/cgroup/freezer.c         |    6
>  kernel/cgroup/pids.c            |    1
>  kernel/events/core.c            |    1
>  mm/memcontrol.c                 |    2
>  net/core/netclassid_cgroup.c    |    2
>  12 files changed, 671 insertions(+), 133 deletions(-)
>
> --
> tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
  2017-02-09 13:07 ` Paul Turner
@ 2017-02-09 14:47   ` Peter Zijlstra
  2017-02-09 15:08     ` Mike Galbraith
       [not found]     ` <CAPM31RJaJjFwenC36Abij+EdzO3KBm-DEjQ_crSmzrtrrn2N2A@mail.gmail.com>
  2017-02-10 15:46   ` Tejun Heo
  1 sibling, 2 replies; 32+ messages in thread
From: Peter Zijlstra @ 2017-02-09 14:47 UTC (permalink / raw)
  To: Paul Turner
  Cc: Tejun Heo, lizefan, Johannes Weiner, Ingo Molnar,
	Andy Lutomirski, Mike Galbraith, cgroups, LKML, kernel-team,
	lvenanci, Linus Torvalds, Andrew Morton

On Thu, Feb 09, 2017 at 05:07:16AM -0800, Paul Turner wrote:
> The only case that this does not support vs ".threads" would be some
> hybrid where we co-mingle threads from different processes (with the
> processes belonging to the same node in the hierarchy).  I'm not aware
> of any usage that looks like this.

If I understand you right; this is a fairly common thing with RT where
we would stuff all the !rt threads of the various processes in a 'misc'
bucket.

Similarly, it happens that we stuff the various rt threads of processes
in a specific (shared) 'rt' bucket.

So I would certainly not like to exclude that setup.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
  2017-02-09 14:47   ` Peter Zijlstra
@ 2017-02-09 15:08     ` Mike Galbraith
       [not found]     ` <CAPM31RJaJjFwenC36Abij+EdzO3KBm-DEjQ_crSmzrtrrn2N2A@mail.gmail.com>
  1 sibling, 0 replies; 32+ messages in thread
From: Mike Galbraith @ 2017-02-09 15:08 UTC (permalink / raw)
  To: Peter Zijlstra, Paul Turner
  Cc: Tejun Heo, lizefan, Johannes Weiner, Ingo Molnar,
	Andy Lutomirski, cgroups, LKML, kernel-team, lvenanci,
	Linus Torvalds, Andrew Morton

On Thu, 2017-02-09 at 15:47 +0100, Peter Zijlstra wrote:
> On Thu, Feb 09, 2017 at 05:07:16AM -0800, Paul Turner wrote:
> > The only case that this does not support vs ".threads" would be some
> > hybrid where we co-mingle threads from different processes (with the
> > processes belonging to the same node in the hierarchy).  I'm not aware
> > of any usage that looks like this.
> 
> If I understand you right; this is a fairly common thing with RT where
> we would stuff all the !rt threads of the various processes in a 'misc'
> bucket.
> 
> Similarly, it happens that we stuff the various rt threads of processes
> in a specific (shared) 'rt' bucket.
> 
> So I would certainly not like to exclude that setup.

Absolutely, you just described my daily bread performance setup.

	-Mike

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
  2017-02-09 10:29         ` Peter Zijlstra
@ 2017-02-10 15:45           ` Tejun Heo
  2017-02-10 17:51             ` Peter Zijlstra
  0 siblings, 1 reply; 32+ messages in thread
From: Tejun Heo @ 2017-02-10 15:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: lizefan, hannes, mingo, pjt, luto, efault, cgroups, linux-kernel,
	kernel-team, lvenanci, Linus Torvalds, Andrew Morton

Hello,

On Thu, Feb 09, 2017 at 11:29:09AM +0100, Peter Zijlstra wrote:
> Uhm, no. They would see the exact same hierarchy, seeing how there is
> only one tree. They would have different view of it maybe, but I don't
> see how that matters, nor do you explain.

Sure, the base hierarchy is the same but different controllers would
need to see different subsets (or views) of the hierarchy.  As I wrote
before, cgroup v2 alredy does this to certain extent by controllers
ignoring the hierarchy beyond certain points.  You're proposing to add
a new "view" of the hierarchy.  I'll explain why it matters below.

> > which brings in something completely new to the basic hierarchy.
> 
> I'm failing to see what.
> 
> > Different controllers seeing differing levels of the same hierarchy is
> > part of the basic behaviors
> 
> I have no idea what you mean there.

It's explained in Documentation/cgroup-v2.txt but for example, if the
whole hierarchy is,

	A - B -C
	  \ D

One controller might only see

	A - B
	  \ D

while another sees the whole thing.

> > and making subtrees threaded is a
> > straight-forward extension of that - threaded controllers just see
> > further into the hierarchy.  Adding threaded sub-sections in the
> > middle is more complex and frankly confusing.
> 
> I disagree, as I completely fail to see any confusion. The rules are
> simple and straight forward.
> 
> I also don't see why you would want to impose this artificial
> restriction. It doesn't get you anything. Why are you so keen on designs
> with these artificial limits on?

Because I actually understand and use this thing day in and day out?

Let's go back to the no-internal-process constraint.  The main reason
behind that is avoiding resource competition between child cgroups and
processes.  The reason why we need this is because for some resources
the terminal consumer (be that a process or task or anonymous) and the
resource domain that it belongs to (be that the system itself or a
cgroup) aren't equivalent.

If you make a memcg, put some processes in it and then create some
child cgroups, how resource should be distributed between those
processes and child cgroups is not clearly defined and can't be
controlled from userspace.  The resource control knobs in a child
cgroup governs how the resource is distributed from the parent.  For
child processes, we don't have those knobs.

There are multiple ways to deal with the problem.  We can add a
separate set of control knobs to govern control resource consumption
from internal processes.  This effectively adds an implicit leaf node
to each cgroup so that internal processes or tasks always are in its
own leaf resource domain.  This however adds a lot of cruft to the
interface, the implementation gets nasty and the presented resource
hierarchy can be misleading to users.

Another option would be just letting each controller do whatever,
which is pretty much what we did in v1.  This got really bad because
the behaviors were widely inconsistent across controllers and often
implementation dependent without any way for the user to configure or
monitor what's going on.  Who gets how much becomes a matter of
accidents and people optimize for whatever arbitrary behaviors that
the kernel they're using is showing.

No-internal-process rule establishes that resource domains are always
terminal in the resource graph for a given controller, such that every
competition along the resource hiearchy always is clearly defined and
configurable.  Only the terminal resource domains actually host
resource consumptions and they can behave analogous to a system which
doesn't have any cgroups at all.  Estalishing resource domains this
way isn't the only approach to solve the problem; however, it is a
valid, simple and effective one.

Now, back to not allowing switching back and forth between resource
domains and thread subtrees.  Let's say we allow that and compose a
hierarchy as follows.  Let's say A and B are resource domains and T's
are subtrees of threads.

	A - T1 - B - T2

The resource domain controllers would see the following hierarchy.

	A - B

A will contain processes from T1 and B T2.  Both A and B would have
internal consumptions from the processes and the no-internal-process
constraint and thus resource domain abstraction are broken.  If we
want to support a hierarchy like that, we'll internally have to
something like

	A - B
	 \
	  A'

Where cgroup A' contains processes from T1 and B T2.  Now, this is
exactly the same problem as having internal processes and can be
solved in the same ways.  The only realistic way to handle this in a
generic and consistent manner is creating a leaf cgroup to contain the
processes.  We sure can try to hide this from userspace and convolute
the interface but it can be solved *far* more elegantly by simply
requiring thread subtrees to be leaf subtrees.

And here's another point, currently, all controllers are enabled
consecutively from root.  If we have leaf thread subtrees, this still
works fine.  Resource domain controllers won't be enabled into thread
subtrees.  If we allow switching back and forth, what do we do in the
middle while we're in the thread part?  No matter what we do, it's
gonna be more confusing and we lose basic invariants like "parent
always has superset of control knobs that its child has".

If we're gonna override the above points, we gotta gain something
really substantial.

> > Let's say we can make that work but what are the use cases which would
> > require such setup where we have to alternate between thread and
> > domain modes through out the resource hierarchy?
> 
> I would very much like to run my main workload in the root resource
> group. This means I need to have threaded subtrees at the root level.

But this is just a whim.  It isn't even a functional requirement.

> Your design would then mean I then cannot run a VM (which uses all these
> cgroups muck and needs its own resource domain) for some less
> critical/isolated workload.
> 
> Now, you'll argue I should set up a subtree for the main workload; but
> why would I do that? Why would you force me into making this choice;
> which has performance penalties associated (because the root resource
> domain is special cased in a bunch of places; and because the shallower
> the cgroup tree the less overhead etc.).

Because what you want costs a lot of complexity and significantly
worsens the interface.  "I just want to do it in the root" isn't a
valid justification.  As for the runtime overhead, if you get affected
by adding a top-level cgroup in any measureable way, we need to fix
that.  That's not a valid argument for messing up the interface.

> > This will be a
> > considerable departure and added complexity from the existing
> > behaviors and code.  We gotta be achieving something significant if
> > we're doing that.  Why would we want this?
> 
> How is this a departure? I do not understand.
> 
> Why would we not want to do this? Why would we want to impose artificial
> limitations. What specifically is hard about what I propose?
> 
> You have no actual arguments on why what I propose would be hard to
> implement. As far as I can tell it should be fairly similar in
> complexity to what you already proposed.

I hope it's explained now.

> > And here's another aspect.  The currently proposed interface doesn't
> > preclude adding the behavior you're describing in the future.  Once
> > thread mode is enabled on a subtree, it isn't allowed to be disabled
> > in its proper subtree; however, if there actually are use cases which
> > require flipping it back, we can later implemnt the behavior and lift
> > that restriction.  I think it makes sense to start with a simple
> > model.
> 
> Your choice of flag makes it impossible to tell what is a resource
> domain and what is not in that situation.
> 
> Suppose I set the root group threaded and I create subgroups (which will
> also all have threaded set). Suppose I clear the threaded bit somewhere
> in the subtree to create a new resource group, but then immediately set
> the threaded bit again to allow that resource group to have thread
> subgroups as well. Now the entire hierarchy will have the threaded flag
> set and it becomes impossible to find the resource domains.
> 
> This is all a direct consequence of your flag not denoting the primary
> construct; eg. resource domains.

Even if we allow switching back and forth, we can't make the same
cgroup both resource domain && thread root.  Not in a sane way at
least.

> IOW; you've completely failed to convince me and my NAK stands.

You have a narrow view from a single component and has been openly
claiming and demonstrating to be not using, disinterested and
uninformed on cgroup.  It's unfortunate and bullshit that the whole
thing is blocked on your NAK, especially when the part you're holding
hostage is something a lot of users want and won't change no matter
what we do about threads.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
  2017-02-09 13:07 ` Paul Turner
  2017-02-09 14:47   ` Peter Zijlstra
@ 2017-02-10 15:46   ` Tejun Heo
  1 sibling, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2017-02-10 15:46 UTC (permalink / raw)
  To: Paul Turner
  Cc: lizefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar,
	Andy Lutomirski, Mike Galbraith, cgroups, LKML, kernel-team,
	lvenanci, Linus Torvalds, Andrew Morton

On Thu, Feb 09, 2017 at 05:07:16AM -0800, Paul Turner wrote:
> What are the motivations that you see for forcing this all onto one
> mount-point via .threads sub-tree tags?

So, you wanted rgroup but with /proc interface?  I'm afraid it's too
late for that.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
  2017-02-10 15:45           ` Tejun Heo
@ 2017-02-10 17:51             ` Peter Zijlstra
  2017-02-12  5:05               ` Tejun Heo
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2017-02-10 17:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, hannes, mingo, pjt, luto, efault, cgroups, linux-kernel,
	kernel-team, lvenanci, Linus Torvalds, Andrew Morton

On Fri, Feb 10, 2017 at 10:45:08AM -0500, Tejun Heo wrote:

> > > and making subtrees threaded is a
> > > straight-forward extension of that - threaded controllers just see
> > > further into the hierarchy.  Adding threaded sub-sections in the
> > > middle is more complex and frankly confusing.
> > 
> > I disagree, as I completely fail to see any confusion. The rules are
> > simple and straight forward.
> > 
> > I also don't see why you would want to impose this artificial
> > restriction. It doesn't get you anything. Why are you so keen on designs
> > with these artificial limits on?
> 
> Because I actually understand and use this thing day in and day out?

Just because you don't have the use-cases doesn't mean they're invalid.

Also, the above is effectively: "because I say so", which isn't much of
an argument.

> Let's go back to the no-internal-process constraint.  The main reason
> behind that is avoiding resource competition between child cgroups and
> processes.  The reason why we need this is because for some resources
> the terminal consumer (be that a process or task or anonymous) and the
> resource domain that it belongs to (be that the system itself or a
> cgroup) aren't equivalent.

Sure, we're past that. This isn't about what memcg can or cannot do.
Previous discussions established that controllers come in two shapes:

 - task based controllers; these are build on per task properties and
   groups are aggregates over sets of tasks. Since per definition inter
   task competition is already defined on individual tasks, its fairly
   trivial to extend the same rules to sets of tasks etc..

   Examples: cpu, cpuset, cpuacct, perf, pid, (freezer)

 - system controllers; instead of building from tasks upwards, they
   split what previously would be machine wide / global state. For these
   there is no natural competition rule vs tasks, and hence your
   no-internal-task rule.

   Examples: memcg, io, hugetlb

(I have no idea where: devices, net_cls, net_prio, debug fall in this
classification, nor is that really relevant)

Now, cgroup-v2 is entirely build around the use-case of
containerization, where you want a single hierarchy describing the
containers and their resources. Now, because of that single hierarchy
and single use-case, you let the system controllers dominate and dictate
the rules.

By doing that you've completely killed off a whole bunch of use-cases
that were possible with pure task controllers. And you seen to have a
very hard time accepting that this is a problem.

Furthermore, the argument that people who need that can continue to use
v1 doesn't work. Because v2 and v1 are mutually exclusive and do not
respect the namespace/container invariant. That is, if a controller is
used in v2, a sub-container is forced to also use v2.

Therefore it is important to fix v2 if possible or do v3 if not, such
that all use-cases can be met in a single setup that respects the
container invariant.

> Now, back to not allowing switching back and forth between resource
> domains and thread subtrees.  Let's say we allow that and compose a
> hierarchy as follows.  Let's say A and B are resource domains and T's
> are subtrees of threads.
> 
> 	A - T1 - B - T2
> 
> The resource domain controllers would see the following hierarchy.
> 
> 	A - B
> 
> A will contain processes from T1 and B T2.  Both A and B would have
> internal consumptions from the processes and the no-internal-process
> constraint and thus resource domain abstraction are broken.

> If we want to support a hierarchy like that, we'll internally have to
> something like
> 
> 	A - B
> 	 \
> 	  A'

Because, and it took me a little while to get here, this:

		A
              /   \
	     T1	  t1
	    / \
	   t2  B
	      / \
	     t3  T2
	         /\
		t4 t5


Ends up being this from a resource domain pov. (because the task
controllers are hierarchical their effective contribution collapses onto
the resource domain):

		A
	      /   \
	     B	   t1, t2
	     |
	     t3,t4,t5


> Now, this is exactly the same problem as having internal processes

Indeed, bugger.

> And here's another point, currently, all controllers are enabled
> consecutively from root.  If we have leaf thread subtrees, this still
> works fine.  Resource domain controllers won't be enabled into thread
> subtrees.  If we allow switching back and forth, what do we do in the
> middle while we're in the thread part?

>From what I understand you cannot re-enable a controller once its been
disabled, right? If you disable it, its dead for the entire subtree.

I think it would work naturally if you only allow disabling system
controllers at the resource domain levels (thread controllers can be
disabled at any point).

That means that thread nodes will have the exact same system controllers
enabled as their resource domain, which makes perfect sense, since all
tasks in the thread nodes are effectively mapped into the resource
domain for system controllers.

That is:

	A (cpu, memory)
        |
	T (memory)

is a perfectly valid setup, since all tasks under T will still use the
memory setup of A.

> No matter what we do, it's
> gonna be more confusing and we lose basic invariants like "parent
> always has superset of control knobs that its child has".

No, exactly that. I don't think I ever proposed something different.

The "resource domain" flag I proposed violates the no-internal-processes
thing, but it doesn't violate that rule afaict.

> > > Let's say we can make that work but what are the use cases which would
> > > require such setup where we have to alternate between thread and
> > > domain modes through out the resource hierarchy?
> > 
> > I would very much like to run my main workload in the root resource
> > group. This means I need to have threaded subtrees at the root level.
> 
> But this is just a whim.  It isn't even a functional requirement.

You're always so very quick to dismiss use-cases :/ Or do I read this
that performance is not a functional requirement?

(don't bite, I know you don't mean that)

Sure, I had not seen that I violated the no internal processes rule in
the resource domain view; equally you had not made it very clear either.

> As for the runtime overhead, if you get affected by adding a top-level
> cgroup in any measureable way, we need to fix that.  That's not a
> valid argument for messing up the interface.

I think cgroup tree depth is a more significant issue; because of
hierarchy we often do tree walks (uo-to-root or down-to-task).

So creating elaborate trees is something I try not to do.

> > You have no actual arguments on why what I propose would be hard to
> > implement. As far as I can tell it should be fairly similar in
> > complexity to what you already proposed.
> 
> I hope it's explained now.

I think I got there..

> > > And here's another aspect.  The currently proposed interface doesn't
> > > preclude adding the behavior you're describing in the future.  Once
> > > thread mode is enabled on a subtree, it isn't allowed to be disabled
> > > in its proper subtree; however, if there actually are use cases which
> > > require flipping it back, we can later implemnt the behavior and lift
> > > that restriction.  I think it makes sense to start with a simple
> > > model.
> > 
> > Your choice of flag makes it impossible to tell what is a resource
> > domain and what is not in that situation.
> > 
> > Suppose I set the root group threaded and I create subgroups (which will
> > also all have threaded set). Suppose I clear the threaded bit somewhere
> > in the subtree to create a new resource group, but then immediately set
> > the threaded bit again to allow that resource group to have thread
> > subgroups as well. Now the entire hierarchy will have the threaded flag
> > set and it becomes impossible to find the resource domains.
> > 
> > This is all a direct consequence of your flag not denoting the primary
> > construct; eg. resource domains.
> 
> Even if we allow switching back and forth, we can't make the same
> cgroup both resource domain && thread root.  Not in a sane way at
> least.

The back and forth thing yes, but even with a single level, the one
resource domain you tag will be both resource domain and thread root.

> > IOW; you've completely failed to convince me and my NAK stands.
> 
> You have a narrow view from a single component and has been openly
> claiming and demonstrating to be not using, disinterested and
> uninformed on cgroup.

I use a narrow set of cgroup-v1 capabilities not present in v2. You've
been very aggressively dismissing and ignoring those for a long time.
Given that, why should I be interested in v2?

> It's unfortunate and bullshit that the whole thing is blocked on your
> NAK, especially when the part you're holding hostage is something a
> lot of users want and won't change no matter what we do about threads.

I understand your frustration, I have plenty of my own, see the
paragraph above. Then again, I'm glad you're now more open discuss these
things.

I don't see it as a given that things will not change until the threads
situation is solved. Call me a pessimist if you will, but I want to see
a full picture first.

In any case, let me ponder these new insights for a bit.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
  2017-02-10 17:51             ` Peter Zijlstra
@ 2017-02-12  5:05               ` Tejun Heo
  2017-02-12  6:59                 ` Mike Galbraith
  2017-02-14 10:35                 ` Peter Zijlstra
  0 siblings, 2 replies; 32+ messages in thread
From: Tejun Heo @ 2017-02-12  5:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: lizefan, hannes, mingo, pjt, luto, efault, cgroups, linux-kernel,
	kernel-team, lvenanci, Linus Torvalds, Andrew Morton

Hello,

On Fri, Feb 10, 2017 at 06:51:45PM +0100, Peter Zijlstra wrote:
> Sure, we're past that. This isn't about what memcg can or cannot do.
> Previous discussions established that controllers come in two shapes:
> 
>  - task based controllers; these are build on per task properties and
>    groups are aggregates over sets of tasks. Since per definition inter
>    task competition is already defined on individual tasks, its fairly
>    trivial to extend the same rules to sets of tasks etc..
> 
>    Examples: cpu, cpuset, cpuacct, perf, pid, (freezer)
>
>  - system controllers; instead of building from tasks upwards, they
>    split what previously would be machine wide / global state. For these
>    there is no natural competition rule vs tasks, and hence your
>    no-internal-task rule.
> 
>    Examples: memcg, io, hugetlb

This is a bit of delta but as I wrote before, at least cpu (and
accordingly cpuacct) won't stay purely task-based as we should account
for resource consumptions which aren't tied to specific tasks to the
matching domain (e.g. CPU consumption during writeback, disk
encryption or CPU cycles spent to receive packets).

> > And here's another point, currently, all controllers are enabled
> > consecutively from root.  If we have leaf thread subtrees, this still
> > works fine.  Resource domain controllers won't be enabled into thread
> > subtrees.  If we allow switching back and forth, what do we do in the
> > middle while we're in the thread part?
> 
> From what I understand you cannot re-enable a controller once its been
> disabled, right? If you disable it, its dead for the entire subtree.

cgroups on creation don't enable controllers by default and users can
enable and disable controllers dynamically as long as the conditions
are met.  So, they can be disable and re-enabled.

> > No matter what we do, it's
> > gonna be more confusing and we lose basic invariants like "parent
> > always has superset of control knobs that its child has".
> 
> No, exactly that. I don't think I ever proposed something different.
>
> The "resource domain" flag I proposed violates the no-internal-processes
> thing, but it doesn't violate that rule afaict.

If we go to thread mode and back to domain mode, the control knobs for
domain controllers don't make sense on the thread part of the tree and
they won't have cgroup_subsys_state to correspond to either.  For
example,

 A - T - B

B's memcg knobs would control memory distribution from A and cgroups
in T can't have memcg knobs.  It'd be weird to indicate that memcg is
enabled in those cgroups too.

We can make it work somehow.  It's just weird-ass interface.

> > As for the runtime overhead, if you get affected by adding a top-level
> > cgroup in any measureable way, we need to fix that.  That's not a
> > valid argument for messing up the interface.
> 
> I think cgroup tree depth is a more significant issue; because of
> hierarchy we often do tree walks (uo-to-root or down-to-task).
> 
> So creating elaborate trees is something I try not to do.

So, as long as the depth stays reasonable (single digit or lower),
what we try to do is keeping tree traversal operations aggregated or
located on slow paths.  There still are places that this overhead
shows up (e.g. the block controllers aren't too optimized) but it
isn't particularly difficult to make a handful of layers not matter at
all.  memcg batches the charging operations and it's impossible to
measure the overhead of several levels of hierarchy.

In general, I think it's important to ensure that this in general is
the case so that users can use the logical layouts matching the actual
resource hierarchy rather than having to twist the layout for
optimization.

> > Even if we allow switching back and forth, we can't make the same
> > cgroup both resource domain && thread root.  Not in a sane way at
> > least.
> 
> The back and forth thing yes, but even with a single level, the one
> resource domain you tag will be both resource domain and thread root.

Ah, you're right.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
  2017-02-12  5:05               ` Tejun Heo
@ 2017-02-12  6:59                 ` Mike Galbraith
  2017-02-13  5:45                   ` Mike Galbraith
  2017-02-14 10:35                 ` Peter Zijlstra
  1 sibling, 1 reply; 32+ messages in thread
From: Mike Galbraith @ 2017-02-12  6:59 UTC (permalink / raw)
  To: Tejun Heo, Peter Zijlstra
  Cc: lizefan, hannes, mingo, pjt, luto, cgroups, linux-kernel,
	kernel-team, lvenanci, Linus Torvalds, Andrew Morton

On Sun, 2017-02-12 at 14:05 +0900, Tejun Heo wrote:

> > I think cgroup tree depth is a more significant issue; because of
> > hierarchy we often do tree walks (uo-to-root or down-to-task).
> > 
> > So creating elaborate trees is something I try not to do.
> 
> So, as long as the depth stays reasonable (single digit or lower),
> what we try to do is keeping tree traversal operations aggregated or
> located on slow paths.  There still are places that this overhead
> shows up (e.g. the block controllers aren't too optimized) but it
> isn't particularly difficult to make a handful of layers not matter at
> all.

A handful of cpu bean counting layers stings considerably.

homer:/abuild # pipe-test 1                          
2.010057 usecs/loop -- avg 2.010057 995.0 KHz
2.006630 usecs/loop -- avg 2.009714 995.2 KHz
2.127118 usecs/loop -- avg 2.021455 989.4 KHz
2.256244 usecs/loop -- avg 2.044934 978.0 KHz
1.993693 usecs/loop -- avg 2.039810 980.5 KHz
^C
homer:/abuild # cgexec -g cpu:hurt pipe-test 1
2.771641 usecs/loop -- avg 2.771641 721.6 KHz
2.432333 usecs/loop -- avg 2.737710 730.5 KHz
2.750493 usecs/loop -- avg 2.738988 730.2 KHz
2.663203 usecs/loop -- avg 2.731410 732.2 KHz
2.762564 usecs/loop -- avg 2.734525 731.4 KHz
^C
homer:/abuild # cgexec -g cpu:hurt/pain pipe-test 1
2.967201 usecs/loop -- avg 2.967201 674.0 KHz
3.049012 usecs/loop -- avg 2.975382 672.2 KHz
3.031226 usecs/loop -- avg 2.980966 670.9 KHz
2.954259 usecs/loop -- avg 2.978296 671.5 KHz
2.933432 usecs/loop -- avg 2.973809 672.5 KHz
^C
...
homer:/abuild # cgexec -g cpu:hurt/pain/ouch/moan/groan pipe-test 1
4.417044 usecs/loop -- avg 4.417044 452.8 KHz
4.494913 usecs/loop -- avg 4.424831 452.0 KHz
4.253861 usecs/loop -- avg 4.407734 453.7 KHz
4.378059 usecs/loop -- avg 4.404766 454.1 KHz
4.179895 usecs/loop -- avg 4.382279 456.4 KHz

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
       [not found]     ` <CAPM31RJaJjFwenC36Abij+EdzO3KBm-DEjQ_crSmzrtrrn2N2A@mail.gmail.com>
@ 2017-02-13  5:28       ` Mike Galbraith
  0 siblings, 0 replies; 32+ messages in thread
From: Mike Galbraith @ 2017-02-13  5:28 UTC (permalink / raw)
  To: Paul Turner, Peter Zijlstra
  Cc: Tejun Heo, lizefan, Johannes Weiner, Ingo Molnar,
	Andy Lutomirski, cgroups, LKML, kernel-team, lvenanci,
	Linus Torvalds, Andrew Morton

On Sun, 2017-02-12 at 13:16 -0800, Paul Turner wrote:
> 
> 
> On Thursday, February 9, 2017, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Thu, Feb 09, 2017 at 05:07:16AM -0800, Paul Turner wrote:
> > > The only case that this does not support vs ".threads" would be some
> > > hybrid where we co-mingle threads from different processes (with the
> > > processes belonging to the same node in the hierarchy).  I'm not aware
> > > of any usage that looks like this.
> > 
> > If I understand you right; this is a fairly common thing with RT where
> > we would stuff all the !rt threads of the various processes in a 'misc'
> > bucket.
> > 
> > Similarly, it happens that we stuff the various rt threads of processes
> > in a specific (shared) 'rt' bucket.
> > 
> > So I would certainly not like to exclude that setup.
> > 
> 
> Unless you're using rt groups I'm not sure this one really changes.  
> Whether the "misc" threads exist at the parent level or one below
> should not matter.

(with exclusive cpusets, a mask can exist at one and only one location)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
  2017-02-12  6:59                 ` Mike Galbraith
@ 2017-02-13  5:45                   ` Mike Galbraith
  2017-03-13 19:26                     ` Tejun Heo
  0 siblings, 1 reply; 32+ messages in thread
From: Mike Galbraith @ 2017-02-13  5:45 UTC (permalink / raw)
  To: Tejun Heo, Peter Zijlstra
  Cc: lizefan, hannes, mingo, pjt, luto, cgroups, linux-kernel,
	kernel-team, lvenanci, Linus Torvalds, Andrew Morton

On Sun, 2017-02-12 at 07:59 +0100, Mike Galbraith wrote:
> On Sun, 2017-02-12 at 14:05 +0900, Tejun Heo wrote:
> 
> > > I think cgroup tree depth is a more significant issue; because of
> > > hierarchy we often do tree walks (uo-to-root or down-to-task).
> > > 
> > > So creating elaborate trees is something I try not to do.
> > 
> > So, as long as the depth stays reasonable (single digit or lower),
> > what we try to do is keeping tree traversal operations aggregated or
> > located on slow paths.  There still are places that this overhead
> > shows up (e.g. the block controllers aren't too optimized) but it
> > isn't particularly difficult to make a handful of layers not matter at
> > all.
> 
> A handful of cpu bean counting layers stings considerably.

BTW, that overhead is also why merging cpu/cpuacct is not really as
wonderful as it may seem on paper.  If you only want to account, you
may not have anything to gain from group scheduling (in fact it may
wreck performance), but you'll pay for it.
 
> homer:/abuild # pipe-test 1                          
> 2.010057 usecs/loop -- avg 2.010057 995.0 KHz
> 2.006630 usecs/loop -- avg 2.009714 995.2 KHz
> 2.127118 usecs/loop -- avg 2.021455 989.4 KHz
> 2.256244 usecs/loop -- avg 2.044934 978.0 KHz
> 1.993693 usecs/loop -- avg 2.039810 980.5 KHz
> ^C
> homer:/abuild # cgexec -g cpu:hurt pipe-test 1
> 2.771641 usecs/loop -- avg 2.771641 721.6 KHz
> 2.432333 usecs/loop -- avg 2.737710 730.5 KHz
> 2.750493 usecs/loop -- avg 2.738988 730.2 KHz
> 2.663203 usecs/loop -- avg 2.731410 732.2 KHz
> 2.762564 usecs/loop -- avg 2.734525 731.4 KHz
> ^C
> homer:/abuild # cgexec -g cpu:hurt/pain pipe-test 1
> 2.967201 usecs/loop -- avg 2.967201 674.0 KHz
> 3.049012 usecs/loop -- avg 2.975382 672.2 KHz
> 3.031226 usecs/loop -- avg 2.980966 670.9 KHz
> 2.954259 usecs/loop -- avg 2.978296 671.5 KHz
> 2.933432 usecs/loop -- avg 2.973809 672.5 KHz
> ^C
> ...
> homer:/abuild # cgexec -g cpu:hurt/pain/ouch/moan/groan pipe-test 1
> 4.417044 usecs/loop -- avg 4.417044 452.8 KHz
> 4.494913 usecs/loop -- avg 4.424831 452.0 KHz
> 4.253861 usecs/loop -- avg 4.407734 453.7 KHz
> 4.378059 usecs/loop -- avg 4.404766 454.1 KHz
> 4.179895 usecs/loop -- avg 4.382279 456.4 KHz

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
  2017-02-12  5:05               ` Tejun Heo
  2017-02-12  6:59                 ` Mike Galbraith
@ 2017-02-14 10:35                 ` Peter Zijlstra
  2017-03-13 20:05                   ` Tejun Heo
  1 sibling, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2017-02-14 10:35 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, hannes, mingo, pjt, luto, efault, cgroups, linux-kernel,
	kernel-team, lvenanci, Linus Torvalds, Andrew Morton

On Sun, Feb 12, 2017 at 02:05:44PM +0900, Tejun Heo wrote:
> Hello,
> 
> On Fri, Feb 10, 2017 at 06:51:45PM +0100, Peter Zijlstra wrote:
> > Sure, we're past that. This isn't about what memcg can or cannot do.
> > Previous discussions established that controllers come in two shapes:
> > 
> >  - task based controllers; these are build on per task properties and
> >    groups are aggregates over sets of tasks. Since per definition inter
> >    task competition is already defined on individual tasks, its fairly
> >    trivial to extend the same rules to sets of tasks etc..
> > 
> >    Examples: cpu, cpuset, cpuacct, perf, pid, (freezer)
> >
> >  - system controllers; instead of building from tasks upwards, they
> >    split what previously would be machine wide / global state. For these
> >    there is no natural competition rule vs tasks, and hence your
> >    no-internal-task rule.
> > 
> >    Examples: memcg, io, hugetlb
> 
> This is a bit of delta but as I wrote before, at least cpu (and
> accordingly cpuacct) won't stay purely task-based as we should account
> for resource consumptions which aren't tied to specific tasks to the
> matching domain (e.g. CPU consumption during writeback, disk
> encryption or CPU cycles spent to receive packets).

We should probably do that in another thread, but I'd probably insist on
separate controllers that co-operate to get that done.

> > > And here's another point, currently, all controllers are enabled
> > > consecutively from root.  If we have leaf thread subtrees, this still
> > > works fine.  Resource domain controllers won't be enabled into thread
> > > subtrees.  If we allow switching back and forth, what do we do in the
> > > middle while we're in the thread part?
> > 
> > From what I understand you cannot re-enable a controller once its been
> > disabled, right? If you disable it, its dead for the entire subtree.
> 
> cgroups on creation don't enable controllers by default and users can
> enable and disable controllers dynamically as long as the conditions
> are met.  So, they can be disable and re-enabled.

I was talking in a hierarchical sense, your section 2-4-2. Top-Down
constraint seems to state similar things to what I meant.

Once you disable a controller it cannot be re-enabled in a subtree.

> > > No matter what we do, it's
> > > gonna be more confusing and we lose basic invariants like "parent
> > > always has superset of control knobs that its child has".
> > 
> > No, exactly that. I don't think I ever proposed something different.
> >
> > The "resource domain" flag I proposed violates the no-internal-processes
> > thing, but it doesn't violate that rule afaict.
> 
> If we go to thread mode and back to domain mode, the control knobs for
> domain controllers don't make sense on the thread part of the tree and
> they won't have cgroup_subsys_state to correspond to either.  For
> example,
> 
>  A - T - B
> 
> B's memcg knobs would control memory distribution from A and cgroups
> in T can't have memcg knobs.  It'd be weird to indicate that memcg is
> enabled in those cgroups too.

But memcg _is_ enabled for T. All the tasks are mapped onto A for
purpose of the system controller (memcg) and are subject to its
constraints.

> We can make it work somehow.  It's just weird-ass interface.

You could make these control files (read-only?) symlinks back to A's
actual control files. To more explicitly show this.

> > > As for the runtime overhead, if you get affected by adding a top-level
> > > cgroup in any measureable way, we need to fix that.  That's not a
> > > valid argument for messing up the interface.
> > 
> > I think cgroup tree depth is a more significant issue; because of
> > hierarchy we often do tree walks (uo-to-root or down-to-task).
> > 
> > So creating elaborate trees is something I try not to do.
> 
> So, as long as the depth stays reasonable (single digit or lower),
> what we try to do is keeping tree traversal operations aggregated or
> located on slow paths. 

While at the same time you allowed that BPF cgroup thing to not be
hierarchical because iterating the tree was too expensive; or did I
misunderstand?

Also, I think Mike showed you the pain and hurt are quite visible for
even a few levels.

Batching is tricky, you need to somehow bound the error function in
order to not become too big a factor on behaviour. Esp. for cpu, cpuacct
obviously doesn't care much as it doesn't enforce anything.

> In general, I think it's important to ensure that this in general is
> the case so that users can use the logical layouts matching the actual
> resource hierarchy rather than having to twist the layout for
> optimization.

One does what one can.. But it is important to understand the
constraints, nothing comes for free.

> > > Even if we allow switching back and forth, we can't make the same
> > > cgroup both resource domain && thread root.  Not in a sane way at
> > > least.
> > 
> > The back and forth thing yes, but even with a single level, the one
> > resource domain you tag will be both resource domain and thread root.
> 
> Ah, you're right.

Also, there is the one giant wart in v2 wrt no-internal-processes;
namely the root group is allowed to have them.

Now I understand why this is so; so don't feel compelled to explain that
again, but it does make the model very ugly and has a real problem, see
below. OTOH, since it is there, I would very much like to make use of
this 'feature' and allow a thread-group on the root group.

And since you then _can_ have nested thread groups, it again becomes
very important to be able to find the resource domains, which brings me
back to my proposal; albeit with an addition constraint.

Now on to the problem of the no-internal-processes wart; how does
cgroup-v2 currently implement the whole container invariant? Because by
that invariant, a container's 'root' group must also allow
internal-processes.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
  2017-02-13  5:45                   ` Mike Galbraith
@ 2017-03-13 19:26                     ` Tejun Heo
  2017-03-14 14:45                       ` Mike Galbraith
  0 siblings, 1 reply; 32+ messages in thread
From: Tejun Heo @ 2017-03-13 19:26 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, lizefan, hannes, mingo, pjt, luto, cgroups,
	linux-kernel, kernel-team, lvenanci, Linus Torvalds,
	Andrew Morton

Hello, Mike.

Sorry about the long delay.

On Mon, Feb 13, 2017 at 06:45:07AM +0100, Mike Galbraith wrote:
> > > So, as long as the depth stays reasonable (single digit or lower),
> > > what we try to do is keeping tree traversal operations aggregated or
> > > located on slow paths.  There still are places that this overhead
> > > shows up (e.g. the block controllers aren't too optimized) but it
> > > isn't particularly difficult to make a handful of layers not matter at
> > > all.
> > 
> > A handful of cpu bean counting layers stings considerably.

Hmm... yeah, I was trying to think about ways to avoid full scheduling
overhead at each layer (the scheduler does a lot per each layer of
scheduling) but don't think it's possible to circumvent that without
introducing a whole lot of scheduling artifacts.

In a lot of workloads, the added overhead from several layers of CPU
controllers doesn't seem to get in the way too much (most threads do
something other than scheduling after all).  The only major issue that
we're seeing in the fleet is the cgroup iteration in idle rebalancing
code pushing up the scheduling latency too much but that's a different
issue.

Anyways, I understand that there are cases where people would want to
avoid any extra layers.  I'll continue on PeterZ's message.

> BTW, that overhead is also why merging cpu/cpuacct is not really as
> wonderful as it may seem on paper.  If you only want to account, you
> may not have anything to gain from group scheduling (in fact it may
> wreck performance), but you'll pay for it.

There's another reason why we would want accounting separate - because
weight based controllers, cpu and io currently, can't be enabled
without affecting the scheduling behavior.  However, they're different
from CPU controllers in that all the heavy part of operations can be
shifted to the readers (we just need to do per-cpu updates from hot
paths), so we might as well publish those stats by default on the v2
hierarchy.  We couldn't do the same in v1 because the number of
hierarchies were not limited.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
  2017-02-14 10:35                 ` Peter Zijlstra
@ 2017-03-13 20:05                   ` Tejun Heo
  2017-03-21 12:39                     ` Peter Zijlstra
  0 siblings, 1 reply; 32+ messages in thread
From: Tejun Heo @ 2017-03-13 20:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: lizefan, hannes, mingo, pjt, luto, efault, cgroups, linux-kernel,
	kernel-team, lvenanci, Linus Torvalds, Andrew Morton

Hey, Peter.  Sorry about the long delay.

On Tue, Feb 14, 2017 at 11:35:41AM +0100, Peter Zijlstra wrote:
> > This is a bit of delta but as I wrote before, at least cpu (and
> > accordingly cpuacct) won't stay purely task-based as we should account
> > for resource consumptions which aren't tied to specific tasks to the
> > matching domain (e.g. CPU consumption during writeback, disk
> > encryption or CPU cycles spent to receive packets).
> 
> We should probably do that in another thread, but I'd probably insist on
> separate controllers that co-operate to get that done.

Let's shelve this for now.

> > cgroups on creation don't enable controllers by default and users can
> > enable and disable controllers dynamically as long as the conditions
> > are met.  So, they can be disable and re-enabled.
> 
> I was talking in a hierarchical sense, your section 2-4-2. Top-Down
> constraint seems to state similar things to what I meant.
> 
> Once you disable a controller it cannot be re-enabled in a subtree.

Ah, yeah, you can't jump across parts of the tree.

> > If we go to thread mode and back to domain mode, the control knobs for
> > domain controllers don't make sense on the thread part of the tree and
> > they won't have cgroup_subsys_state to correspond to either.  For
> > example,
> > 
> >  A - T - B
> > 
> > B's memcg knobs would control memory distribution from A and cgroups
> > in T can't have memcg knobs.  It'd be weird to indicate that memcg is
> > enabled in those cgroups too.
> 
> But memcg _is_ enabled for T. All the tasks are mapped onto A for
> purpose of the system controller (memcg) and are subject to its
> constraints.

Sure, T is contained in A but think about the interface.  For memcg, T
belongs to A.  B is the first descendant when viewed from memcg, which
brings about two problems - memcg doesn't have control knobs to assign
throughout T which is just weird and there's no way to configure how T
competes against B.

> > We can make it work somehow.  It's just weird-ass interface.
> 
> You could make these control files (read-only?) symlinks back to A's
> actual control files. To more explicitly show this.

But the knobs are supposed to control how much resource a child gets
from its parent.  Flipping that over while walking down the same tree
sounds horribly ugly and confusing to me.  Besides, that doesn't solve
the problem with lacking the ability configure T's consumptions
against B.

> > So, as long as the depth stays reasonable (single digit or lower),
> > what we try to do is keeping tree traversal operations aggregated or
> > located on slow paths. 
> 
> While at the same time you allowed that BPF cgroup thing to not be
> hierarchical because iterating the tree was too expensive; or did I
> misunderstand?

That was more because that was supposed to be part of bpf (network or
whatever) and just followed the model of table matching "is the target
under this hierarchy?".  That's where it came from after all.
Hierarchical walking can be added but it's more work (defining the
iteration direction and rules) and doesn't bring benefits without
working delegation.

If it were a cgroup controller, it should have been fully hierarchical
no matter what but that involves solving bpf delegation first.

> Also, I think Mike showed you the pain and hurt are quite visible for
> even a few levels.
> 
> Batching is tricky, you need to somehow bound the error function in
> order to not become too big a factor on behaviour. Esp. for cpu, cpuacct
> obviously doesn't care much as it doesn't enforce anything.

Yeah, I thought about this for quite a while but I couldn't think of
any easy way of circumventing overhead without introducing a lot of
scheduling artifacts (e.g. multiplying down the weights to practically
collapse multi levels of the hierarchy), at least for the weight based
control which what most people use.  It looks like the only way to
lower the overhead there is making generic scheduling cheaper but that
still means that multi-level will always be noticeably more expensive
in terms of raw schceduling performance.

Scheduling hackbench is an extreme case tho and in practice at least
we're not seeing noticeable issues with a few levels of nesting when
the workload actually spends cpu cycles doing things other than
scheduling.  However, we're seeing significant increase in scheduling
latency coming from how cgroups are handled from the rebalance path.
I'm still looking into it and will write about that in a separate
thread.

> > In general, I think it's important to ensure that this in general is
> > the case so that users can use the logical layouts matching the actual
> > resource hierarchy rather than having to twist the layout for
> > optimization.
> 
> One does what one can.. But it is important to understand the
> constraints, nothing comes for free.

Yeah, for sure.

> Also, there is the one giant wart in v2 wrt no-internal-processes;
> namely the root group is allowed to have them.
> 
> Now I understand why this is so; so don't feel compelled to explain that
> again, but it does make the model very ugly and has a real problem, see
> below. OTOH, since it is there, I would very much like to make use of
> this 'feature' and allow a thread-group on the root group.
> 
> And since you then _can_ have nested thread groups, it again becomes
> very important to be able to find the resource domains, which brings me
> back to my proposal; albeit with an addition constraint.

I've thought quite a bit about ways to allow thread granularity from
the top while still presenting a consistent picture to resource domain
controllers.  That's what's missing from the CPU controller side given
Mike's claim that there's unavoidable overhead in nesting CPU
controller and requiring at least one level of nesting on cgroup v2
for thread granularity might not be acceptable.

Going back to why thread support on cgroup v2 was needed in the first
place, it was to allow thread level control while cooperating with
other controllers on v2.  IOW, allowing thread level control for CPU
while cooperating with resource domain type controllers.

Now, going back to allowing thread hierarchies from the root, given
that their resource domain can only be root, which is exactly what you
get when CPU is mounted on a separate hierarchy, it seems kinda moot.
The practical constraint with the current scheme is that in cases
where other resource domain controllers need to be used, the thread
hierarchies would have to be nested at least one level, but if you
don't want any resource domain things, that's the same as mounting the
controller separately.

> Now on to the problem of the no-internal-processes wart; how does
> cgroup-v2 currently implement the whole container invariant? Because by
> that invariant, a container's 'root' group must also allow
> internal-processes.

I'm not sure I follow the question here.  What's the "whole container
invariant"?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
  2017-03-13 19:26                     ` Tejun Heo
@ 2017-03-14 14:45                       ` Mike Galbraith
  0 siblings, 0 replies; 32+ messages in thread
From: Mike Galbraith @ 2017-03-14 14:45 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Peter Zijlstra, lizefan, hannes, mingo, pjt, luto, cgroups,
	linux-kernel, kernel-team, lvenanci, Linus Torvalds,
	Andrew Morton

On Mon, 2017-03-13 at 15:26 -0400, Tejun Heo wrote:
> Hello, Mike.
> 
> Sorry about the long delay.
> 
> On Mon, Feb 13, 2017 at 06:45:07AM +0100, Mike Galbraith wrote:
> > > > So, as long as the depth stays reasonable (single digit or lower),
> > > > what we try to do is keeping tree traversal operations aggregated or
> > > > located on slow paths.  There still are places that this overhead
> > > > shows up (e.g. the block controllers aren't too optimized) but it
> > > > isn't particularly difficult to make a handful of layers not matter at
> > > > all.
> > > 
> > > A handful of cpu bean counting layers stings considerably.
> 
> Hmm... yeah, I was trying to think about ways to avoid full scheduling
> overhead at each layer (the scheduler does a lot per each layer of
> scheduling) but don't think it's possible to circumvent that without
> introducing a whole lot of scheduling artifacts.

Yup.

> In a lot of workloads, the added overhead from several layers of CPU
> controllers doesn't seem to get in the way too much (most threads do
> something other than scheduling after all).

Sure, don't schedule a lot, it doesn't hurt much, but there are plenty
of loads that routinely do schedule a LOT, and there it matters a LOT..
which is why network benchmarks tend to be severely allergic to
scheduler lard.

>   The only major issue that
> we're seeing in the fleet is the cgroup iteration in idle rebalancing
> code pushing up the scheduling latency too much but that's a different
> issue.

Hm, I would suspect PELT to be the culprit there.  It helps smooth out
load balancing, but will stack "skinny looking" tasks.

	-Mike

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
  2017-03-13 20:05                   ` Tejun Heo
@ 2017-03-21 12:39                     ` Peter Zijlstra
  2017-03-22 14:52                       ` Peter Zijlstra
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2017-03-21 12:39 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, hannes, mingo, pjt, luto, efault, cgroups, linux-kernel,
	kernel-team, lvenanci, Linus Torvalds, Andrew Morton

On Mon, Mar 13, 2017 at 04:05:44PM -0400, Tejun Heo wrote:
> Hey, Peter.  Sorry about the long delay.

No worries; we're all (too) busy.


> > > If we go to thread mode and back to domain mode, the control knobs for
> > > domain controllers don't make sense on the thread part of the tree and
> > > they won't have cgroup_subsys_state to correspond to either.  For
> > > example,
> > > 
> > >  A - T - B
> > > 
> > > B's memcg knobs would control memory distribution from A and cgroups
> > > in T can't have memcg knobs.  It'd be weird to indicate that memcg is
> > > enabled in those cgroups too.
> > 
> > But memcg _is_ enabled for T. All the tasks are mapped onto A for
> > purpose of the system controller (memcg) and are subject to its
> > constraints.
> 
> Sure, T is contained in A but think about the interface.  For memcg, T
> belongs to A.  B is the first descendant when viewed from memcg, which
> brings about two problems - memcg doesn't have control knobs to assign
> throughout T which is just weird and there's no way to configure how T
> competes against B.
> 
> > > We can make it work somehow.  It's just weird-ass interface.
> > 
> > You could make these control files (read-only?) symlinks back to A's
> > actual control files. To more explicitly show this.
> 
> But the knobs are supposed to control how much resource a child gets
> from its parent.  Flipping that over while walking down the same tree
> sounds horribly ugly and confusing to me.  Besides, that doesn't solve
> the problem with lacking the ability configure T's consumptions
> against B.

So I'm not confused; and I suspect you're not either. But you're worried
about 'simple' people getting confused?

The rules really are fairly straight forward; but yes, it will be a
little more involved than without this. But note that this is an
optional thing, people don't have to make thread groups if they don't
want to. And they further don't have to use non-leaf thread groups.

And at some point, there's no helping stupid; and trying to do so will
only make you insane.

So the fundamental thing to realize (and explain) is that there are two
different types of controllers; and that they operate on different views
of the hierarchy.

I think our goal as a kernel API should be presenting the capabilities
in a concise and consistent manner; and I feel that the proposed
interface is that.

So the points you raise above; about system controller knobs in thread
groups and competition between thread and system groups as seen for
system controllers are confusion due to not considering the views.

And yes, having to consider views is new and a direct consequence of
this new optional feature. But I don't see how its a problem.


> Scheduling hackbench is an extreme case tho and in practice at least
> we're not seeing noticeable issues with a few levels of nesting when
> the workload actually spends cpu cycles doing things other than
> scheduling.

Right; most workloads don't schedule _that_ much; and the overhead isn't
too painful.

> However, we're seeing significant increase in scheduling
> latency coming from how cgroups are handled from the rebalance path.
> I'm still looking into it and will write about that in a separate
> thread.

I have some vague memories of this being a pain point. IIRC it comes
down to the problem that latency is an absolute measure and the weight
is relative thing.

I think we mucked about with it some many years ago; but haven't done so
recently.

> > Also, there is the one giant wart in v2 wrt no-internal-processes;
> > namely the root group is allowed to have them.
> > 
> > Now I understand why this is so; so don't feel compelled to explain that
> > again, but it does make the model very ugly and has a real problem, see
> > below. OTOH, since it is there, I would very much like to make use of
> > this 'feature' and allow a thread-group on the root group.
> > 
> > And since you then _can_ have nested thread groups, it again becomes
> > very important to be able to find the resource domains, which brings me
> > back to my proposal; albeit with an addition constraint.
> 
> I've thought quite a bit about ways to allow thread granularity from
> the top while still presenting a consistent picture to resource domain
> controllers.  That's what's missing from the CPU controller side given
> Mike's claim that there's unavoidable overhead in nesting CPU
> controller and requiring at least one level of nesting on cgroup v2
> for thread granularity might not be acceptable.
> 
> Going back to why thread support on cgroup v2 was needed in the first
> place, it was to allow thread level control while cooperating with
> other controllers on v2.  IOW, allowing thread level control for CPU
> while cooperating with resource domain type controllers.

Well, not only CPU, I can see the same being used for perf for example.

> Now, going back to allowing thread hierarchies from the root, given
> that their resource domain can only be root, which is exactly what you
> get when CPU is mounted on a separate hierarchy, it seems kinda moot.

Not quite; see below on the container thing.

> The practical constraint with the current scheme is that in cases
> where other resource domain controllers need to be used, the thread
> hierarchies would have to be nested at least one level, but if you
> don't want any resource domain things, that's the same as mounting the
> controller separately.
> 
> > Now on to the problem of the no-internal-processes wart; how does
> > cgroup-v2 currently implement the whole container invariant? Because by
> > that invariant, a container's 'root' group must also allow
> > internal-processes.
> 
> I'm not sure I follow the question here.  What's the "whole container
> invariant"?

The container invariant is that everything inside a container looks and
works *exactly* like a 'real' system.

Containers do this with namespace; so the PID namespace 'hides' all
processes not part of its namespace and has an independent PID number;
such that we can start at 1 again for our init; with that it also has a
new child reaper etc..

Similarly, the cgroup namespace should hide everything outside its
subtree; but it should also provide a full new root cgroup, which
_should_ include the no-internal-processes exception.

Another constraint is the whole controller mounting nonsense; unless you
would allow containers to (re)mount cgroups controllers differently, an
unlikely case I feel; containers are constrained to whatever mount
options the host kernel got dealt.

This effectively means that controller mount options are not a viable
configuration mechanism.

This is really important for things like Docker and related. They must
assume some standard setup of cgroups or otherwise cannot use it at all.

But even aside of that; the mount thing is a fairly static an inflexible
configuration. What if you have two workloads that require a different
setup on the same machine?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
  2017-03-21 12:39                     ` Peter Zijlstra
@ 2017-03-22 14:52                       ` Peter Zijlstra
  0 siblings, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2017-03-22 14:52 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, hannes, mingo, pjt, luto, efault, cgroups, linux-kernel,
	kernel-team, lvenanci, Linus Torvalds, Andrew Morton

On Tue, Mar 21, 2017 at 01:39:58PM +0100, Peter Zijlstra wrote:
> 
> And yes, having to consider views is new and a direct consequence of
> this new optional feature. But I don't see how its a problem.
> 

So aside from having (RO) links in thread groups for system controllers,
we could also have a ${controller}_parent link back to whatever group is
the actual parent for that specific controller's view.

So then your B's memcg_parent would point to A, not T.

But I feel this is all superfluous window dressing; but if you want to
clarify the filesystem interface, this could be something to consider.

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2017-03-22 14:52 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-02 20:06 [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode Tejun Heo
2017-02-02 20:06 ` [PATCH 1/5] cgroup: reorganize cgroup.procs / task write path Tejun Heo
2017-02-02 20:06 ` [PATCH 2/5] cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS Tejun Heo
2017-02-02 20:06 ` [PATCH 3/5] cgroup: introduce cgroup->proc_cgrp and threaded css_set handling Tejun Heo
2017-02-02 20:06 ` [PATCH 4/5] cgroup: implement CSS_TASK_ITER_THREADED Tejun Heo
2017-02-02 20:06 ` [PATCH 5/5] cgroup: implement cgroup v2 thread support Tejun Heo
2017-02-02 21:32 ` [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode Andy Lutomirski
2017-02-02 21:52   ` Tejun Heo
2017-02-03 21:10     ` Andy Lutomirski
2017-02-03 21:56       ` Tejun Heo
2017-02-06  9:50     ` Peter Zijlstra
2017-02-03 20:20 ` Peter Zijlstra
2017-02-03 20:59   ` Tejun Heo
2017-02-06 12:49     ` Peter Zijlstra
2017-02-08 23:08       ` Tejun Heo
2017-02-09 10:29         ` Peter Zijlstra
2017-02-10 15:45           ` Tejun Heo
2017-02-10 17:51             ` Peter Zijlstra
2017-02-12  5:05               ` Tejun Heo
2017-02-12  6:59                 ` Mike Galbraith
2017-02-13  5:45                   ` Mike Galbraith
2017-03-13 19:26                     ` Tejun Heo
2017-03-14 14:45                       ` Mike Galbraith
2017-02-14 10:35                 ` Peter Zijlstra
2017-03-13 20:05                   ` Tejun Heo
2017-03-21 12:39                     ` Peter Zijlstra
2017-03-22 14:52                       ` Peter Zijlstra
2017-02-09 13:07 ` Paul Turner
2017-02-09 14:47   ` Peter Zijlstra
2017-02-09 15:08     ` Mike Galbraith
     [not found]     ` <CAPM31RJaJjFwenC36Abij+EdzO3KBm-DEjQ_crSmzrtrrn2N2A@mail.gmail.com>
2017-02-13  5:28       ` Mike Galbraith
2017-02-10 15:46   ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).