linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP
@ 2016-03-11 15:41 Tejun Heo
  2016-03-11 15:41 ` [PATCH 01/10] cgroup: introduce cgroup_[un]lock() Tejun Heo
                   ` (13 more replies)
  0 siblings, 14 replies; 50+ messages in thread
From: Tejun Heo @ 2016-03-11 15:41 UTC (permalink / raw)
  To: torvalds, akpm, a.p.zijlstra, mingo, lizefan, hannes, pjt
  Cc: linux-kernel, cgroups, linux-api, kernel-team

Hello,

This patchset extends cgroup v2 to support rgroup (resource group) for
in-process hierarchical resource control and implements PRIO_RGRP for
setpriority(2) on top to allow in-process hierarchical CPU cycle
control in a seamless way.

cgroup v1 allowed putting threads of a process in different cgroups
which enabled ad-hoc in-process resource control of some resources.
Unfortunately, this approach was fraught with problems such as
membership ambiguity with per-process resources and lack of isolation
between system management and in-process properties.  For a more
detailed discussion on the subject, please refer to the following
message.

 [1] [RFD] cgroup: thread granularity support for cpu controller

This patchset implements the mechanism outlined in the above message.
The new mechanism is named rgroup (resource group).  When explicitly
designating a non-rgroup cgroup, the term sgroup (system group) is
used.  rgroup has the following properties.

* A rgroup is a cgroup which is invisible on and transparent to the
  system-level cgroupfs interface.

* A rgroup can be created by specifying CLONE_NEWRGRP flag, along with
  CLONE_THREAD, during clone(2).  A new rgroup is created under the
  parent thread's cgroup and the new thread is created in it.

* A rgroup is automatically destroyed when empty.

* A top-level rgroup of a process is a rgroup whose parent cgroup is a
  sgroup.  A process may have multiple top-level rgroups and thus
  multiple rgroup subtrees under the same parent sgroup.

* Unlike sgroups, rgroups are allowed to compete against peer threads.
  Each rgroup behaves equivalent to a sibling task.

* rgroup subtrees are local to the process.  When the process forks or
  execs, its rgroup subtrees are collapsed.

* When a process is migrated to a different cgroup, its rgroup
  subtrees are preserved.

* Subset of controllers available on the parent sgroup are available
  to rgroup subtrees.  Controller management on rgroups is automatic
  and implicit and doesn't interfere with system-level cgroup
  controller management.  If a controller is made unavailable on the
  parent sgroup, it's automatically disabled from child rgroup
  subtrees.

rgroup lays the foundation for other kernel mechanisms to make use of
resource controllers while providing proper isolation between system
management and in-process operations removing the awkward and
layer-violating requirement for coordination between individual
applications and system management.  On top of the rgroup mechanism,
PRIO_RGRP is implemented for {set|get}priority(2).

* PRIO_RGRP can only be used if the target task is already in a
  rgroup.  If setpriority(2) is used and cpu controller is available,
  cpu controller is enabled until the target rgroup is covered and the
  specified nice value is set as the weight of the rgroup.

* The specified nice value has the same meaning as for tasks.  For
  example, a rgroup and a task competing under the same parent would
  behave exactly the same as two tasks.

* For top-level rgroups, PRIO_RGRP follows the same rlimit
  restrictions as PRIO_PROCESS; however, as nested rgroups only
  distribute CPU cycles which are allocated to the process, no
  restriction is applied.

PRIO_RGRP allows in-process hierarchical control of CPU cycles in a
manner which is a straight-forward and minimal extension of existing
task and priority management.

There are still some missing pieces.

* Documentation updates.

* A mechanism that applications can use to publish certain rgroups so
  that external entities can determine which IDs to use to change
  rgroup settings.  I already have interface and implementation design
  mostly pinned down.

* Userland updates such as integrating CLONE_NEWRGRP handling to
  pthread or updating renice(1) to handle resource groups.

I'll attach a test program which demonstrates PRIO_RGRP usage in a
follow up email.

This patchset contains the following 10 patches.

 0001-cgroup-introduce-cgroup_-un-lock.patch
 0002-cgroup-un-inline-cgroup_path-and-friends.patch
 0003-cgroup-introduce-CGRP_MIGRATE_-flags.patch
 0004-signal-make-put_signal_struct-public.patch
 0005-cgroup-fork-add-new_rgrp_cset-p-and-clone_flags-to-c.patch
 0006-cgroup-fork-add-child-and-clone_flags-to-threadgroup.patch
 0007-cgroup-introduce-resource-group.patch
 0008-cgroup-implement-rgroup-control-mask-handling.patch
 0009-cgroup-implement-rgroup-subtree-migration.patch
 0010-cgroup-sched-implement-PRIO_RGRP-for-set-get-priorit.patch

0001-0006 are prepatory patches.
0007-0009 implemnet rgroup support.
0010 implements PRIO_RGRP.

This patchset is on top of

  cgroup/for-4.6 f6d635ad341d ("cgroup: implement cgroup_subsys->implicit_on_dfl")
+ [2] [PATCH 2/2] cgroup, perf_event: make perf_event controller work on cgroup2 hierarchy
+ [3] [PATCHSET REPOST] sched, cgroup: implement cgroup v2 interface for cpu controller

and available in the following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-cgroup2-rgroup

diffstat follows.

 fs/exec.c                     |    8 
 include/linux/cgroup-defs.h   |   72 ++-
 include/linux/cgroup.h        |   60 +--
 include/linux/sched.h         |   31 +
 include/uapi/linux/resource.h |    1 
 include/uapi/linux/sched.h    |    1 
 kernel/cgroup.c               |  828 ++++++++++++++++++++++++++++++++++++++----
 kernel/fork.c                 |   27 -
 kernel/sched/core.c           |   32 +
 kernel/signal.c               |    6 
 kernel/sys.c                  |   11 
 11 files changed, 917 insertions(+), 160 deletions(-)

Thanks.

--
tejun

[1] http://lkml.kernel.org/g/20160105154503.GC5995@mtj.duckdns.org
[2] http://lkml.kernel.org/g/1456351975-1899-3-git-send-email-tj@kernel.org
[3] http://lkml.kernel.org/g/20160105164758.GD5995@mtj.duckdns.org

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 01/10] cgroup: introduce cgroup_[un]lock()
  2016-03-11 15:41 [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP Tejun Heo
@ 2016-03-11 15:41 ` Tejun Heo
  2016-03-11 15:41 ` [PATCH 02/10] cgroup: un-inline cgroup_path() and friends Tejun Heo
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 50+ messages in thread
From: Tejun Heo @ 2016-03-11 15:41 UTC (permalink / raw)
  To: torvalds, akpm, a.p.zijlstra, mingo, lizefan, hannes, pjt
  Cc: linux-kernel, cgroups, linux-api, kernel-team, Tejun Heo

Introduce thin wrappers which lock and unlock cgroup_mutex.  While
this doesn't introduce any functional differences now, they will later
be used to perform some extra operations around locking.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/cgroup.c | 100 +++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 59 insertions(+), 41 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index e22df5d8..2297bf6 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -237,6 +237,24 @@ static int cgroup_addrm_files(struct cgroup_subsys_state *css,
 			      bool is_add);
 
 /**
+ * cgroup_lock - lock cgroup_mutex and perform related operations
+ */
+static void cgroup_lock(void)
+	__acquires(&cgroup_mutex)
+{
+	mutex_lock(&cgroup_mutex);
+}
+
+/**
+ * cgroup_unlock - unlock cgroup_mutex and perform related operations
+ */
+static void cgroup_unlock(void)
+	__releases(&cgroup_mutex)
+{
+	mutex_unlock(&cgroup_mutex);
+}
+
+/**
  * cgroup_ssid_enabled - cgroup subsys enabled test by subsys ID
  * @ssid: subsys ID of interest
  *
@@ -1194,7 +1212,7 @@ static void cgroup_destroy_root(struct cgroup_root *root)
 
 	cgroup_exit_root_id(root);
 
-	mutex_unlock(&cgroup_mutex);
+	cgroup_unlock();
 
 	kernfs_destroy_root(root->kf_root);
 	cgroup_free_root(root);
@@ -1373,7 +1391,7 @@ static void cgroup_kn_unlock(struct kernfs_node *kn)
 	else
 		cgrp = kn->parent->priv;
 
-	mutex_unlock(&cgroup_mutex);
+	cgroup_unlock();
 
 	kernfs_unbreak_active_protection(kn);
 	cgroup_put(cgrp);
@@ -1419,7 +1437,7 @@ static struct cgroup *cgroup_kn_lock_live(struct kernfs_node *kn,
 	if (drain_offline)
 		cgroup_lock_and_drain_offline(cgrp);
 	else
-		mutex_lock(&cgroup_mutex);
+		cgroup_lock();
 
 	if (!cgroup_is_dead(cgrp))
 		return cgrp;
@@ -1804,7 +1822,7 @@ static int cgroup_remount(struct kernfs_root *kf_root, int *flags, char *data)
  out_unlock:
 	kfree(opts.release_agent);
 	kfree(opts.name);
-	mutex_unlock(&cgroup_mutex);
+	cgroup_unlock();
 	return ret;
 }
 
@@ -2045,7 +2063,7 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
 			continue;
 
 		if (!percpu_ref_tryget_live(&ss->root->cgrp.self.refcnt)) {
-			mutex_unlock(&cgroup_mutex);
+			cgroup_unlock();
 			msleep(10);
 			ret = restart_syscall();
 			goto out_free;
@@ -2100,7 +2118,7 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
 		pinned_sb = kernfs_pin_sb(root->kf_root, NULL);
 		if (IS_ERR(pinned_sb) ||
 		    !percpu_ref_tryget_live(&root->cgrp.self.refcnt)) {
-			mutex_unlock(&cgroup_mutex);
+			cgroup_unlock();
 			if (!IS_ERR_OR_NULL(pinned_sb))
 				deactivate_super(pinned_sb);
 			msleep(10);
@@ -2135,7 +2153,7 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
 		cgroup_free_root(root);
 
 out_unlock:
-	mutex_unlock(&cgroup_mutex);
+	cgroup_unlock();
 out_free:
 	kfree(opts.release_agent);
 	kfree(opts.name);
@@ -2214,7 +2232,7 @@ char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen)
 	int hierarchy_id = 1;
 	char *path = NULL;
 
-	mutex_lock(&cgroup_mutex);
+	cgroup_lock();
 	spin_lock_bh(&css_set_lock);
 
 	root = idr_get_next(&cgroup_hierarchy_idr, &hierarchy_id);
@@ -2229,7 +2247,7 @@ char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen)
 	}
 
 	spin_unlock_bh(&css_set_lock);
-	mutex_unlock(&cgroup_mutex);
+	cgroup_unlock();
 	return path;
 }
 EXPORT_SYMBOL_GPL(task_cgroup_path);
@@ -2790,7 +2808,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
 	struct cgroup_root *root;
 	int retval = 0;
 
-	mutex_lock(&cgroup_mutex);
+	cgroup_lock();
 	for_each_root(root) {
 		struct cgroup *from_cgrp;
 
@@ -2805,7 +2823,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
 		if (retval)
 			break;
 	}
-	mutex_unlock(&cgroup_mutex);
+	cgroup_unlock();
 
 	return retval;
 }
@@ -2968,7 +2986,7 @@ static void cgroup_lock_and_drain_offline(struct cgroup *cgrp)
 	int ssid;
 
 restart:
-	mutex_lock(&cgroup_mutex);
+	cgroup_lock();
 
 	cgroup_for_each_live_descendant_post(dsct, d_css, cgrp) {
 		for_each_subsys(ss, ssid) {
@@ -2982,7 +3000,7 @@ static void cgroup_lock_and_drain_offline(struct cgroup *cgrp)
 			prepare_to_wait(&dsct->offline_waitq, &wait,
 					TASK_UNINTERRUPTIBLE);
 
-			mutex_unlock(&cgroup_mutex);
+			cgroup_unlock();
 			schedule();
 			finish_wait(&dsct->offline_waitq, &wait);
 
@@ -3426,11 +3444,11 @@ static int cgroup_rename(struct kernfs_node *kn, struct kernfs_node *new_parent,
 	kernfs_break_active_protection(new_parent);
 	kernfs_break_active_protection(kn);
 
-	mutex_lock(&cgroup_mutex);
+	cgroup_lock();
 
 	ret = kernfs_rename(kn, new_parent, new_name_str);
 
-	mutex_unlock(&cgroup_mutex);
+	cgroup_unlock();
 
 	kernfs_unbreak_active_protection(kn);
 	kernfs_unbreak_active_protection(new_parent);
@@ -3637,9 +3655,9 @@ int cgroup_rm_cftypes(struct cftype *cfts)
 {
 	int ret;
 
-	mutex_lock(&cgroup_mutex);
+	cgroup_lock();
 	ret = cgroup_rm_cftypes_locked(cfts);
-	mutex_unlock(&cgroup_mutex);
+	cgroup_unlock();
 	return ret;
 }
 
@@ -3671,14 +3689,14 @@ static int cgroup_add_cftypes(struct cgroup_subsys *ss, struct cftype *cfts)
 	if (ret)
 		return ret;
 
-	mutex_lock(&cgroup_mutex);
+	cgroup_lock();
 
 	list_add_tail(&cfts->node, &ss->cfts);
 	ret = cgroup_apply_cftypes(cfts, true);
 	if (ret)
 		cgroup_rm_cftypes_locked(cfts);
 
-	mutex_unlock(&cgroup_mutex);
+	cgroup_unlock();
 	return ret;
 }
 
@@ -4170,7 +4188,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
 	if (!cgroup_may_migrate_to(to))
 		return -EBUSY;
 
-	mutex_lock(&cgroup_mutex);
+	cgroup_lock();
 
 	/* all tasks in @from are being moved, all csets are source */
 	spin_lock_bh(&css_set_lock);
@@ -4200,7 +4218,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
 	} while (task && !ret);
 out_err:
 	cgroup_migrate_finish(&preloaded_csets);
-	mutex_unlock(&cgroup_mutex);
+	cgroup_unlock();
 	return ret;
 }
 
@@ -4507,7 +4525,7 @@ int cgroupstats_build(struct cgroupstats *stats, struct dentry *dentry)
 	    kernfs_type(kn) != KERNFS_DIR)
 		return -EINVAL;
 
-	mutex_lock(&cgroup_mutex);
+	cgroup_lock();
 
 	/*
 	 * We aren't being called from kernfs and there's no guarantee on
@@ -4518,7 +4536,7 @@ int cgroupstats_build(struct cgroupstats *stats, struct dentry *dentry)
 	cgrp = rcu_dereference(kn->priv);
 	if (!cgrp || cgroup_is_dead(cgrp)) {
 		rcu_read_unlock();
-		mutex_unlock(&cgroup_mutex);
+		cgroup_unlock();
 		return -ENOENT;
 	}
 	rcu_read_unlock();
@@ -4546,7 +4564,7 @@ int cgroupstats_build(struct cgroupstats *stats, struct dentry *dentry)
 	}
 	css_task_iter_end(&it);
 
-	mutex_unlock(&cgroup_mutex);
+	cgroup_unlock();
 	return 0;
 }
 
@@ -4847,7 +4865,7 @@ static void css_release_work_fn(struct work_struct *work)
 	struct cgroup_subsys *ss = css->ss;
 	struct cgroup *cgrp = css->cgroup;
 
-	mutex_lock(&cgroup_mutex);
+	cgroup_lock();
 
 	css->flags |= CSS_RELEASED;
 	list_del_rcu(&css->sibling);
@@ -4874,7 +4892,7 @@ static void css_release_work_fn(struct work_struct *work)
 					 NULL);
 	}
 
-	mutex_unlock(&cgroup_mutex);
+	cgroup_unlock();
 
 	call_rcu(&css->rcu_head, css_free_rcu_fn);
 }
@@ -5168,7 +5186,7 @@ static void css_killed_work_fn(struct work_struct *work)
 	struct cgroup_subsys_state *css =
 		container_of(work, struct cgroup_subsys_state, destroy_work);
 
-	mutex_lock(&cgroup_mutex);
+	cgroup_lock();
 
 	do {
 		offline_css(css);
@@ -5177,7 +5195,7 @@ static void css_killed_work_fn(struct work_struct *work)
 		css = css->parent;
 	} while (css && atomic_dec_and_test(&css->online_cnt));
 
-	mutex_unlock(&cgroup_mutex);
+	cgroup_unlock();
 }
 
 /* css kill confirmation processing requires process context, bounce */
@@ -5330,7 +5348,7 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
 
 	pr_debug("Initializing cgroup subsys %s\n", ss->name);
 
-	mutex_lock(&cgroup_mutex);
+	cgroup_lock();
 
 	idr_init(&ss->css_idr);
 	INIT_LIST_HEAD(&ss->cfts);
@@ -5374,7 +5392,7 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
 
 	BUG_ON(online_css(css));
 
-	mutex_unlock(&cgroup_mutex);
+	cgroup_unlock();
 }
 
 /**
@@ -5431,7 +5449,7 @@ int __init cgroup_init(void)
 	BUG_ON(cgroup_init_cftypes(NULL, cgroup_dfl_base_files));
 	BUG_ON(cgroup_init_cftypes(NULL, cgroup_legacy_base_files));
 
-	mutex_lock(&cgroup_mutex);
+	cgroup_lock();
 
 	/*
 	 * Add init_css_set to the hash table so that dfl_root can link to
@@ -5442,7 +5460,7 @@ int __init cgroup_init(void)
 
 	BUG_ON(cgroup_setup_root(&cgrp_dfl_root, 0));
 
-	mutex_unlock(&cgroup_mutex);
+	cgroup_unlock();
 
 	for_each_subsys(ss, ssid) {
 		if (ss->early_init) {
@@ -5548,7 +5566,7 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
 	if (!buf)
 		goto out;
 
-	mutex_lock(&cgroup_mutex);
+	cgroup_lock();
 	spin_lock_bh(&css_set_lock);
 
 	for_each_root(root) {
@@ -5602,7 +5620,7 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
 	retval = 0;
 out_unlock:
 	spin_unlock_bh(&css_set_lock);
-	mutex_unlock(&cgroup_mutex);
+	cgroup_unlock();
 	kfree(buf);
 out:
 	return retval;
@@ -5620,7 +5638,7 @@ static int proc_cgroupstats_show(struct seq_file *m, void *v)
 	 * cgroup_mutex is also necessary to guarantee an atomic snapshot of
 	 * subsys/hierarchy state.
 	 */
-	mutex_lock(&cgroup_mutex);
+	cgroup_lock();
 
 	for_each_subsys(ss, i)
 		seq_printf(m, "%s\t%d\t%d\t%d\n",
@@ -5628,7 +5646,7 @@ static int proc_cgroupstats_show(struct seq_file *m, void *v)
 			   atomic_read(&ss->root->nr_cgrps),
 			   cgroup_ssid_enabled(i));
 
-	mutex_unlock(&cgroup_mutex);
+	cgroup_unlock();
 	return 0;
 }
 
@@ -5860,7 +5878,7 @@ static void cgroup_release_agent(struct work_struct *work)
 	char *pathbuf = NULL, *agentbuf = NULL, *path;
 	char *argv[3], *envp[3];
 
-	mutex_lock(&cgroup_mutex);
+	cgroup_lock();
 
 	pathbuf = kmalloc(PATH_MAX, GFP_KERNEL);
 	agentbuf = kstrdup(cgrp->root->release_agent_path, GFP_KERNEL);
@@ -5880,11 +5898,11 @@ static void cgroup_release_agent(struct work_struct *work)
 	envp[1] = "PATH=/sbin:/bin:/usr/sbin:/usr/bin";
 	envp[2] = NULL;
 
-	mutex_unlock(&cgroup_mutex);
+	cgroup_unlock();
 	call_usermodehelper(argv[0], argv, envp, UMH_WAIT_EXEC);
 	goto out_free;
 out:
-	mutex_unlock(&cgroup_mutex);
+	cgroup_unlock();
 out_free:
 	kfree(agentbuf);
 	kfree(pathbuf);
@@ -6006,7 +6024,7 @@ struct cgroup *cgroup_get_from_path(const char *path)
 	struct kernfs_node *kn;
 	struct cgroup *cgrp;
 
-	mutex_lock(&cgroup_mutex);
+	cgroup_lock();
 
 	kn = kernfs_walk_and_get(cgrp_dfl_root.cgrp.kn, path);
 	if (kn) {
@@ -6021,7 +6039,7 @@ struct cgroup *cgroup_get_from_path(const char *path)
 		cgrp = ERR_PTR(-ENOENT);
 	}
 
-	mutex_unlock(&cgroup_mutex);
+	cgroup_unlock();
 	return cgrp;
 }
 EXPORT_SYMBOL_GPL(cgroup_get_from_path);
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 02/10] cgroup: un-inline cgroup_path() and friends
  2016-03-11 15:41 [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP Tejun Heo
  2016-03-11 15:41 ` [PATCH 01/10] cgroup: introduce cgroup_[un]lock() Tejun Heo
@ 2016-03-11 15:41 ` Tejun Heo
  2016-03-11 15:41 ` [PATCH 03/10] cgroup: introduce CGRP_MIGRATE_* flags Tejun Heo
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 50+ messages in thread
From: Tejun Heo @ 2016-03-11 15:41 UTC (permalink / raw)
  To: torvalds, akpm, a.p.zijlstra, mingo, lizefan, hannes, pjt
  Cc: linux-kernel, cgroups, linux-api, kernel-team, Tejun Heo

They're trivial wrappers now but to support in-process resource
control, cgroup_path() and friends will need to access more of cgroup
internals.  Un-inline them.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/cgroup.h | 31 +++++--------------------------
 kernel/cgroup.c        | 21 +++++++++++++++++++++
 2 files changed, 26 insertions(+), 26 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 2162dca..4717f43 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -74,6 +74,11 @@ extern struct css_set init_css_set;
 #define cgroup_subsys_on_dfl(ss)						\
 	static_branch_likely(&ss ## _on_dfl_key)
 
+int cgroup_name(struct cgroup *cgrp, char *buf, size_t buflen);
+char * __must_check cgroup_path(struct cgroup *cgrp, char *buf, size_t buflen);
+void pr_cont_cgroup_name(struct cgroup *cgrp);
+void pr_cont_cgroup_path(struct cgroup *cgrp);
+
 bool css_has_online_children(struct cgroup_subsys_state *css);
 struct cgroup_subsys_state *css_from_id(int id, struct cgroup_subsys *ss);
 struct cgroup_subsys_state *cgroup_get_e_css(struct cgroup *cgroup,
@@ -522,32 +527,6 @@ static inline struct cgroup_subsys_state *seq_css(struct seq_file *seq)
 	return of_css(seq->private);
 }
 
-/*
- * Name / path handling functions.  All are thin wrappers around the kernfs
- * counterparts and can be called under any context.
- */
-
-static inline int cgroup_name(struct cgroup *cgrp, char *buf, size_t buflen)
-{
-	return kernfs_name(cgrp->kn, buf, buflen);
-}
-
-static inline char * __must_check cgroup_path(struct cgroup *cgrp, char *buf,
-					      size_t buflen)
-{
-	return kernfs_path(cgrp->kn, buf, buflen);
-}
-
-static inline void pr_cont_cgroup_name(struct cgroup *cgrp)
-{
-	pr_cont_kernfs_name(cgrp->kn);
-}
-
-static inline void pr_cont_cgroup_path(struct cgroup *cgrp)
-{
-	pr_cont_kernfs_path(cgrp->kn);
-}
-
 #else /* !CONFIG_CGROUPS */
 
 struct cgroup_subsys_state;
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 2297bf6..4c3c43e 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -618,6 +618,27 @@ static int notify_on_release(const struct cgroup *cgrp)
 static void cgroup_release_agent(struct work_struct *work);
 static void check_for_release(struct cgroup *cgrp);
 
+int cgroup_name(struct cgroup *cgrp, char *buf, size_t buflen)
+{
+	return kernfs_name(cgrp->kn, buf, buflen);
+}
+
+char * __must_check cgroup_path(struct cgroup *cgrp, char *buf, size_t buflen)
+{
+	return kernfs_path(cgrp->kn, buf, buflen);
+}
+EXPORT_SYMBOL_GPL(cgroup_path);
+
+void pr_cont_cgroup_name(struct cgroup *cgrp)
+{
+	pr_cont_kernfs_name(cgrp->kn);
+}
+
+void pr_cont_cgroup_path(struct cgroup *cgrp)
+{
+	pr_cont_kernfs_path(cgrp->kn);
+}
+
 /*
  * A cgroup can be associated with multiple css_sets as different tasks may
  * belong to different cgroups on different hierarchies.  In the other
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 03/10] cgroup: introduce CGRP_MIGRATE_* flags
  2016-03-11 15:41 [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP Tejun Heo
  2016-03-11 15:41 ` [PATCH 01/10] cgroup: introduce cgroup_[un]lock() Tejun Heo
  2016-03-11 15:41 ` [PATCH 02/10] cgroup: un-inline cgroup_path() and friends Tejun Heo
@ 2016-03-11 15:41 ` Tejun Heo
  2016-03-11 15:41 ` [PATCH 04/10] signal: make put_signal_struct() public Tejun Heo
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 50+ messages in thread
From: Tejun Heo @ 2016-03-11 15:41 UTC (permalink / raw)
  To: torvalds, akpm, a.p.zijlstra, mingo, lizefan, hannes, pjt
  Cc: linux-kernel, cgroups, linux-api, kernel-team, Tejun Heo

Currently, the only migration modifier is boolean @threadgroup.  Make
the functions pass around unsigned flag mask instead and replace
@threadgroup with CGRP_MIGRATE_PROCESS.  This will allow addition of
other modifiers.

This patch doesn't introduce any behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/cgroup.c | 32 ++++++++++++++++++--------------
 1 file changed, 18 insertions(+), 14 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 4c3c43e..8ecb8d6 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2403,6 +2403,10 @@ struct task_struct *cgroup_taskset_next(struct cgroup_taskset *tset,
 	return NULL;
 }
 
+enum {
+	CGRP_MIGRATE_PROCESS	= (1 << 0), /* migrate the whole process */
+};
+
 /**
  * cgroup_taskset_migrate - migrate a taskset
  * @tset: taget taskset
@@ -2634,7 +2638,7 @@ static int cgroup_migrate_prepare_dst(struct list_head *preloaded_csets)
 /**
  * cgroup_migrate - migrate a process or task to a cgroup
  * @leader: the leader of the process or the task to migrate
- * @threadgroup: whether @leader points to the whole process or a single task
+ * @mflags: CGRP_MIGRATE_* flags
  * @root: cgroup root migration is taking place on
  *
  * Migrate a process or task denoted by @leader.  If migrating a process,
@@ -2649,7 +2653,7 @@ static int cgroup_migrate_prepare_dst(struct list_head *preloaded_csets)
  * decided for all targets by invoking group_migrate_prepare_dst() before
  * actually starting migrating.
  */
-static int cgroup_migrate(struct task_struct *leader, bool threadgroup,
+static int cgroup_migrate(struct task_struct *leader, unsigned mflags,
 			  struct cgroup_root *root)
 {
 	struct cgroup_taskset tset = CGROUP_TASKSET_INIT(tset);
@@ -2665,7 +2669,7 @@ static int cgroup_migrate(struct task_struct *leader, bool threadgroup,
 	task = leader;
 	do {
 		cgroup_taskset_add(task, &tset);
-		if (!threadgroup)
+		if (!(mflags & CGRP_MIGRATE_PROCESS))
 			break;
 	} while_each_thread(leader, task);
 	rcu_read_unlock();
@@ -2678,12 +2682,12 @@ static int cgroup_migrate(struct task_struct *leader, bool threadgroup,
  * cgroup_attach_task - attach a task or a whole threadgroup to a cgroup
  * @dst_cgrp: the cgroup to attach to
  * @leader: the task or the leader of the threadgroup to be attached
- * @threadgroup: attach the whole threadgroup?
+ * @mflags: CGRP_MIGRATE_* flags
  *
  * Call holding cgroup_mutex and cgroup_threadgroup_rwsem.
  */
 static int cgroup_attach_task(struct cgroup *dst_cgrp,
-			      struct task_struct *leader, bool threadgroup)
+			      struct task_struct *leader, unsigned mflags)
 {
 	LIST_HEAD(preloaded_csets);
 	struct task_struct *task;
@@ -2699,7 +2703,7 @@ static int cgroup_attach_task(struct cgroup *dst_cgrp,
 	do {
 		cgroup_migrate_add_src(task_css_set(task), dst_cgrp,
 				       &preloaded_csets);
-		if (!threadgroup)
+		if (!(mflags & CGRP_MIGRATE_PROCESS))
 			break;
 	} while_each_thread(leader, task);
 	rcu_read_unlock();
@@ -2708,7 +2712,7 @@ static int cgroup_attach_task(struct cgroup *dst_cgrp,
 	/* prepare dst csets and commit */
 	ret = cgroup_migrate_prepare_dst(&preloaded_csets);
 	if (!ret)
-		ret = cgroup_migrate(leader, threadgroup, dst_cgrp->root);
+		ret = cgroup_migrate(leader, mflags, dst_cgrp->root);
 
 	cgroup_migrate_finish(&preloaded_csets);
 	return ret;
@@ -2761,7 +2765,7 @@ static int cgroup_procs_write_permission(struct task_struct *task,
  * cgroup_mutex and threadgroup.
  */
 static ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf,
-				    size_t nbytes, loff_t off, bool threadgroup)
+				    size_t nbytes, loff_t off, unsigned mflags)
 {
 	struct task_struct *tsk;
 	struct cgroup *cgrp;
@@ -2787,7 +2791,7 @@ static ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf,
 		tsk = current;
 	}
 
-	if (threadgroup)
+	if (mflags & CGRP_MIGRATE_PROCESS)
 		tsk = tsk->group_leader;
 
 	/*
@@ -2805,7 +2809,7 @@ static ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf,
 
 	ret = cgroup_procs_write_permission(tsk, cgrp, of);
 	if (!ret)
-		ret = cgroup_attach_task(cgrp, tsk, threadgroup);
+		ret = cgroup_attach_task(cgrp, tsk, mflags);
 
 	put_task_struct(tsk);
 	goto out_unlock_threadgroup;
@@ -2840,7 +2844,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
 		from_cgrp = task_cgroup_from_root(from, root);
 		spin_unlock_bh(&css_set_lock);
 
-		retval = cgroup_attach_task(from_cgrp, tsk, false);
+		retval = cgroup_attach_task(from_cgrp, tsk, 0);
 		if (retval)
 			break;
 	}
@@ -2853,13 +2857,13 @@ EXPORT_SYMBOL_GPL(cgroup_attach_task_all);
 static ssize_t cgroup_tasks_write(struct kernfs_open_file *of,
 				  char *buf, size_t nbytes, loff_t off)
 {
-	return __cgroup_procs_write(of, buf, nbytes, off, false);
+	return __cgroup_procs_write(of, buf, nbytes, off, 0);
 }
 
 static ssize_t cgroup_procs_write(struct kernfs_open_file *of,
 				  char *buf, size_t nbytes, loff_t off)
 {
-	return __cgroup_procs_write(of, buf, nbytes, off, true);
+	return __cgroup_procs_write(of, buf, nbytes, off, CGRP_MIGRATE_PROCESS);
 }
 
 static ssize_t cgroup_release_agent_write(struct kernfs_open_file *of,
@@ -4233,7 +4237,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
 		css_task_iter_end(&it);
 
 		if (task) {
-			ret = cgroup_migrate(task, false, to->root);
+			ret = cgroup_migrate(task, 0, to->root);
 			put_task_struct(task);
 		}
 	} while (task && !ret);
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 04/10] signal: make put_signal_struct() public
  2016-03-11 15:41 [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP Tejun Heo
                   ` (2 preceding siblings ...)
  2016-03-11 15:41 ` [PATCH 03/10] cgroup: introduce CGRP_MIGRATE_* flags Tejun Heo
@ 2016-03-11 15:41 ` Tejun Heo
  2016-03-11 15:41 ` [PATCH 05/10] cgroup, fork: add @new_rgrp_cset[p] and @clone_flags to cgroup fork callbacks Tejun Heo
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 50+ messages in thread
From: Tejun Heo @ 2016-03-11 15:41 UTC (permalink / raw)
  To: torvalds, akpm, a.p.zijlstra, mingo, lizefan, hannes, pjt
  Cc: linux-kernel, cgroups, linux-api, kernel-team, Tejun Heo,
	Peter Zijlstra, Oleg Nesterov

This will be used from cgroup while implementing in-process resource
control.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
---
 include/linux/sched.h | 2 ++
 kernel/fork.c         | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f1e81e1..80d6ed1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2034,6 +2034,8 @@ static inline int is_global_init(struct task_struct *tsk)
 
 extern struct pid *cad_pid;
 
+extern void put_signal_struct(struct signal_struct *sig);
+
 extern void free_task(struct task_struct *tsk);
 #define get_task_struct(tsk) do { atomic_inc(&(tsk)->usage); } while(0)
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 2e391c7..fd826de 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -239,7 +239,7 @@ static inline void free_signal_struct(struct signal_struct *sig)
 	kmem_cache_free(signal_cachep, sig);
 }
 
-static inline void put_signal_struct(struct signal_struct *sig)
+void put_signal_struct(struct signal_struct *sig)
 {
 	if (atomic_dec_and_test(&sig->sigcnt))
 		free_signal_struct(sig);
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 05/10] cgroup, fork: add @new_rgrp_cset[p] and @clone_flags to cgroup fork callbacks
  2016-03-11 15:41 [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP Tejun Heo
                   ` (3 preceding siblings ...)
  2016-03-11 15:41 ` [PATCH 04/10] signal: make put_signal_struct() public Tejun Heo
@ 2016-03-11 15:41 ` Tejun Heo
  2016-03-11 15:41 ` [PATCH 06/10] cgroup, fork: add @child and @clone_flags to threadgroup_change_begin/end() Tejun Heo
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 50+ messages in thread
From: Tejun Heo @ 2016-03-11 15:41 UTC (permalink / raw)
  To: torvalds, akpm, a.p.zijlstra, mingo, lizefan, hannes, pjt
  Cc: linux-kernel, cgroups, linux-api, kernel-team, Tejun Heo,
	Peter Zijlstra, Oleg Nesterov

Add two extra arguments to cgroup_{can|cancel|post}_fork().  These
will be used to implement in-process resource control.  The extra
arguments aren't used yet.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
---
 include/linux/cgroup-defs.h |  2 ++
 include/linux/cgroup.h      | 21 +++++++++++++++------
 kernel/cgroup.c             | 15 ++++++++++++---
 kernel/fork.c               |  7 ++++---
 4 files changed, 33 insertions(+), 12 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 34b42f0..d3d1f92 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -542,6 +542,8 @@ static inline void cgroup_threadgroup_change_end(struct task_struct *tsk)
 
 #else	/* CONFIG_CGROUPS */
 
+struct css_set;
+
 #define CGROUP_SUBSYS_COUNT 0
 
 static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk) {}
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 4717f43..ebcf21f 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -102,9 +102,12 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
 		     struct pid *pid, struct task_struct *tsk);
 
 void cgroup_fork(struct task_struct *p);
-extern int cgroup_can_fork(struct task_struct *p);
-extern void cgroup_cancel_fork(struct task_struct *p);
-extern void cgroup_post_fork(struct task_struct *p);
+extern int cgroup_can_fork(struct task_struct *p, unsigned long clone_flags,
+			   struct css_set **new_rgrp_csetp);
+extern void cgroup_cancel_fork(struct task_struct *p, unsigned long clone_flags,
+			       struct css_set *new_rgrp_cset);
+extern void cgroup_post_fork(struct task_struct *p, unsigned long clone_flags,
+			     struct css_set *new_rgrp_cset);
 void cgroup_exit(struct task_struct *p);
 void cgroup_free(struct task_struct *p);
 
@@ -538,9 +541,15 @@ static inline int cgroupstats_build(struct cgroupstats *stats,
 				    struct dentry *dentry) { return -EINVAL; }
 
 static inline void cgroup_fork(struct task_struct *p) {}
-static inline int cgroup_can_fork(struct task_struct *p) { return 0; }
-static inline void cgroup_cancel_fork(struct task_struct *p) {}
-static inline void cgroup_post_fork(struct task_struct *p) {}
+static inline int cgroup_can_fork(struct task_struct *p,
+				  unsigned long clone_flags,
+				  struct css_set **new_rgrp_csetp) { return 0; }
+static inline void cgroup_cancel_fork(struct task_struct *p,
+				      unsigned long clone_flags,
+				      struct css_set *new_rgrp_cset) {}
+static inline void cgroup_post_fork(struct task_struct *p,
+				    unsigned long clone_flags,
+				    struct css_set *new_rgrp_cset) {}
 static inline void cgroup_exit(struct task_struct *p) {}
 static inline void cgroup_free(struct task_struct *p) {}
 
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 8ecb8d6..ac207ae 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -5704,12 +5704,15 @@ void cgroup_fork(struct task_struct *child)
 /**
  * cgroup_can_fork - called on a new task before the process is exposed
  * @child: the task in question.
+ * @clone_flags: CLONE_* flags for @child
+ * @new_rgrp_csetp: rgroup out parameter to be passed to post/cancel_fork
  *
  * This calls the subsystem can_fork() callbacks. If the can_fork() callback
  * returns an error, the fork aborts with that error code. This allows for
  * a cgroup subsystem to conditionally allow or deny new forks.
  */
-int cgroup_can_fork(struct task_struct *child)
+int cgroup_can_fork(struct task_struct *child, unsigned long clone_flags,
+		    struct css_set **new_rgrp_csetp)
 {
 	struct cgroup_subsys *ss;
 	int i, j, ret;
@@ -5736,11 +5739,14 @@ int cgroup_can_fork(struct task_struct *child)
 /**
  * cgroup_cancel_fork - called if a fork failed after cgroup_can_fork()
  * @child: the task in question
+ * @clone_flags: CLONE_* flags for @child
+ * @new_rgrp_cset: *@new_rgrp_csetp from cgroup_can_fork()
  *
  * This calls the cancel_fork() callbacks if a fork failed *after*
  * cgroup_can_fork() succeded.
  */
-void cgroup_cancel_fork(struct task_struct *child)
+void cgroup_cancel_fork(struct task_struct *child, unsigned long clone_flags,
+			struct css_set *new_rgrp_cset)
 {
 	struct cgroup_subsys *ss;
 	int i;
@@ -5753,6 +5759,8 @@ void cgroup_cancel_fork(struct task_struct *child)
 /**
  * cgroup_post_fork - called on a new task after adding it to the task list
  * @child: the task in question
+ * @clone_flags: CLONE_* flags for @child
+ * @new_rgrp_cset: *@new_rgrp_csetp from cgroup_can_fork()
  *
  * Adds the task to the list running through its css_set if necessary and
  * call the subsystem fork() callbacks.  Has to be after the task is
@@ -5760,7 +5768,8 @@ void cgroup_cancel_fork(struct task_struct *child)
  * cgroup_task_iter_start() - to guarantee that the new task ends up on its
  * list.
  */
-void cgroup_post_fork(struct task_struct *child)
+void cgroup_post_fork(struct task_struct *child, unsigned long clone_flags,
+		      struct css_set *new_rgrp_cset)
 {
 	struct cgroup_subsys *ss;
 	int i;
diff --git a/kernel/fork.c b/kernel/fork.c
index fd826de..812d477 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1249,6 +1249,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 {
 	int retval;
 	struct task_struct *p;
+	struct css_set *new_rgrp_cset;
 
 	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
 		return ERR_PTR(-EINVAL);
@@ -1525,7 +1526,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	 * between here and cgroup_post_fork() if an organisation operation is in
 	 * progress.
 	 */
-	retval = cgroup_can_fork(p);
+	retval = cgroup_can_fork(p, clone_flags, &new_rgrp_cset);
 	if (retval)
 		goto bad_fork_free_pid;
 
@@ -1607,7 +1608,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	write_unlock_irq(&tasklist_lock);
 
 	proc_fork_connector(p);
-	cgroup_post_fork(p);
+	cgroup_post_fork(p, clone_flags, new_rgrp_cset);
 	threadgroup_change_end(current);
 	perf_event_fork(p);
 
@@ -1617,7 +1618,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	return p;
 
 bad_fork_cancel_cgroup:
-	cgroup_cancel_fork(p);
+	cgroup_cancel_fork(p, clone_flags, new_rgrp_cset);
 bad_fork_free_pid:
 	if (pid != &init_struct_pid)
 		free_pid(pid);
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 06/10] cgroup, fork: add @child and @clone_flags to threadgroup_change_begin/end()
  2016-03-11 15:41 [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP Tejun Heo
                   ` (4 preceding siblings ...)
  2016-03-11 15:41 ` [PATCH 05/10] cgroup, fork: add @new_rgrp_cset[p] and @clone_flags to cgroup fork callbacks Tejun Heo
@ 2016-03-11 15:41 ` Tejun Heo
  2016-03-11 15:41 ` [PATCH 07/10] cgroup: introduce resource group Tejun Heo
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 50+ messages in thread
From: Tejun Heo @ 2016-03-11 15:41 UTC (permalink / raw)
  To: torvalds, akpm, a.p.zijlstra, mingo, lizefan, hannes, pjt
  Cc: linux-kernel, cgroups, linux-api, kernel-team, Tejun Heo,
	Peter Zijlstra, Oleg Nesterov

When threadgroup_change_begin/end() are called from fork path, pass in
@child and @clone_flags so that fork path can be distinguished and
fork related information is available.

While at it, un-inline cgroup_threadgroup_change_begin/end() and fold
cgroup_fork() into cgroup_threadgroup_change_begin().  These changes
will be used to imlement in-process resource control.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
---
 fs/exec.c                   |  6 +++---
 include/linux/cgroup-defs.h | 39 ++++++++++++-------------------------
 include/linux/cgroup.h      |  2 --
 include/linux/sched.h       | 16 +++++++++++----
 kernel/cgroup.c             | 47 ++++++++++++++++++++++++++++++++++++---------
 kernel/fork.c               |  7 +++----
 kernel/signal.c             |  6 +++---
 7 files changed, 71 insertions(+), 52 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 828ec5f..5b81bbb 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -936,7 +936,7 @@ static int de_thread(struct task_struct *tsk)
 		struct task_struct *leader = tsk->group_leader;
 
 		for (;;) {
-			threadgroup_change_begin(tsk);
+			threadgroup_change_begin(tsk, NULL, 0);
 			write_lock_irq(&tasklist_lock);
 			/*
 			 * Do this under tasklist_lock to ensure that
@@ -947,7 +947,7 @@ static int de_thread(struct task_struct *tsk)
 				break;
 			__set_current_state(TASK_KILLABLE);
 			write_unlock_irq(&tasklist_lock);
-			threadgroup_change_end(tsk);
+			threadgroup_change_end(tsk, NULL, 0);
 			schedule();
 			if (unlikely(__fatal_signal_pending(tsk)))
 				goto killed;
@@ -1005,7 +1005,7 @@ static int de_thread(struct task_struct *tsk)
 		if (unlikely(leader->ptrace))
 			__wake_up_parent(leader, leader->parent);
 		write_unlock_irq(&tasklist_lock);
-		threadgroup_change_end(tsk);
+		threadgroup_change_end(tsk, NULL, 0);
 
 		release_task(leader);
 	}
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index d3d1f92..3c4a75b 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -514,31 +514,12 @@ struct cgroup_subsys {
 	unsigned int depends_on;
 };
 
-extern struct percpu_rw_semaphore cgroup_threadgroup_rwsem;
-
-/**
- * cgroup_threadgroup_change_begin - threadgroup exclusion for cgroups
- * @tsk: target task
- *
- * Called from threadgroup_change_begin() and allows cgroup operations to
- * synchronize against threadgroup changes using a percpu_rw_semaphore.
- */
-static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk)
-{
-	percpu_down_read(&cgroup_threadgroup_rwsem);
-}
-
-/**
- * cgroup_threadgroup_change_end - threadgroup exclusion for cgroups
- * @tsk: target task
- *
- * Called from threadgroup_change_end().  Counterpart of
- * cgroup_threadcgroup_change_begin().
- */
-static inline void cgroup_threadgroup_change_end(struct task_struct *tsk)
-{
-	percpu_up_read(&cgroup_threadgroup_rwsem);
-}
+void cgroup_threadgroup_change_begin(struct task_struct *tsk,
+				     struct task_struct *child,
+				     unsigned long clone_flags);
+void cgroup_threadgroup_change_end(struct task_struct *tsk,
+				   struct task_struct *child,
+				   unsigned long clone_flags);
 
 #else	/* CONFIG_CGROUPS */
 
@@ -546,8 +527,12 @@ struct css_set;
 
 #define CGROUP_SUBSYS_COUNT 0
 
-static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk) {}
-static inline void cgroup_threadgroup_change_end(struct task_struct *tsk) {}
+static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk,
+						   struct task_struct *child,
+						   unsigned long clone_flags) {}
+static inline void cgroup_threadgroup_change_end(struct task_struct *tsk,
+						 struct task_struct *child,
+						 unsigned long clone_flags) {}
 
 #endif	/* CONFIG_CGROUPS */
 
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index ebcf21f..1e00fc0 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -101,7 +101,6 @@ int cgroupstats_build(struct cgroupstats *stats, struct dentry *dentry);
 int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
 		     struct pid *pid, struct task_struct *tsk);
 
-void cgroup_fork(struct task_struct *p);
 extern int cgroup_can_fork(struct task_struct *p, unsigned long clone_flags,
 			   struct css_set **new_rgrp_csetp);
 extern void cgroup_cancel_fork(struct task_struct *p, unsigned long clone_flags,
@@ -540,7 +539,6 @@ static inline int cgroup_attach_task_all(struct task_struct *from,
 static inline int cgroupstats_build(struct cgroupstats *stats,
 				    struct dentry *dentry) { return -EINVAL; }
 
-static inline void cgroup_fork(struct task_struct *p) {}
 static inline int cgroup_can_fork(struct task_struct *p,
 				  unsigned long clone_flags,
 				  struct css_set **new_rgrp_csetp) { return 0; }
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 80d6ed1..d4ae795 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2783,6 +2783,8 @@ static inline void unlock_task_sighand(struct task_struct *tsk,
 /**
  * threadgroup_change_begin - mark the beginning of changes to a threadgroup
  * @tsk: task causing the changes
+ * @child: child task if forking, NULL otherwise
+ * @clone_flags: clone flags if forking
  *
  * All operations which modify a threadgroup - a new thread joining the
  * group, death of a member thread (the assertion of PF_EXITING) and
@@ -2791,21 +2793,27 @@ static inline void unlock_task_sighand(struct task_struct *tsk,
  * subsystems needing threadgroup stability can hook into for
  * synchronization.
  */
-static inline void threadgroup_change_begin(struct task_struct *tsk)
+static inline void threadgroup_change_begin(struct task_struct *tsk,
+					    struct task_struct *child,
+					    unsigned long clone_flags)
 {
 	might_sleep();
-	cgroup_threadgroup_change_begin(tsk);
+	cgroup_threadgroup_change_begin(tsk, child, clone_flags);
 }
 
 /**
  * threadgroup_change_end - mark the end of changes to a threadgroup
  * @tsk: task causing the changes
+ * @child: child task if forking, NULL otherwise
+ * @clone_flags: clone flags if forking
  *
  * See threadgroup_change_begin().
  */
-static inline void threadgroup_change_end(struct task_struct *tsk)
+static inline void threadgroup_change_end(struct task_struct *tsk,
+					  struct task_struct *child,
+					  unsigned long clone_flags)
 {
-	cgroup_threadgroup_change_end(tsk);
+	cgroup_threadgroup_change_end(tsk, child, clone_flags);
 }
 
 #ifndef __HAVE_THREAD_FUNCTIONS
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index ac207ae..70f9985 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -110,7 +110,7 @@ static DEFINE_SPINLOCK(cgroup_file_kn_lock);
  */
 static DEFINE_SPINLOCK(release_agent_path_lock);
 
-struct percpu_rw_semaphore cgroup_threadgroup_rwsem;
+static struct percpu_rw_semaphore cgroup_threadgroup_rwsem;
 
 #define cgroup_assert_mutex_or_rcu_locked()				\
 	RCU_LOCKDEP_WARN(!rcu_read_lock_held() &&			\
@@ -5688,17 +5688,46 @@ static const struct file_operations proc_cgroupstats_operations = {
 };
 
 /**
- * cgroup_fork - initialize cgroup related fields during copy_process()
- * @child: pointer to task_struct of forking parent process.
+ * cgroup_threadgroup_change_begin - threadgroup exclusion for cgroups
+ * @tsk: target task
+ * @child: child task if forking, NULL otherwise
+ * @clone_flags: clone flags if forking
  *
- * A task is associated with the init_css_set until cgroup_post_fork()
- * attaches it to the parent's css_set.  Empty cg_list indicates that
- * @child isn't holding reference to its css_set.
+ * Called from threadgroup_change_begin() and allows cgroup operations to
+ * synchronize against threadgroup changes using a percpu_rw_semaphore.
  */
-void cgroup_fork(struct task_struct *child)
+void cgroup_threadgroup_change_begin(struct task_struct *tsk,
+				     struct task_struct *child,
+				     unsigned long clone_flags)
 {
-	RCU_INIT_POINTER(child->cgroups, &init_css_set);
-	INIT_LIST_HEAD(&child->cg_list);
+	if (child) {
+		/*
+		 * A task is associated with the init_css_set until
+		 * cgroup_post_fork() attaches it to the parent's css_set.
+		 * Empty cg_list indicates that @child isn't holding
+		 * reference to its css_set.
+		 */
+		RCU_INIT_POINTER(child->cgroups, &init_css_set);
+		INIT_LIST_HEAD(&child->cg_list);
+	}
+
+	percpu_down_read(&cgroup_threadgroup_rwsem);
+}
+
+/**
+ * cgroup_threadgroup_change_end - threadgroup exclusion for cgroups
+ * @tsk: target task
+ * @child: child task if forking, NULL otherwise
+ * @clone_flags: clone flags if forking
+ *
+ * Called from threadgroup_change_end().  Counterpart of
+ * cgroup_threadcgroup_change_begin().
+ */
+void cgroup_threadgroup_change_end(struct task_struct *tsk,
+				   struct task_struct *child,
+				   unsigned long clone_flags)
+{
+	percpu_up_read(&cgroup_threadgroup_rwsem);
 }
 
 /**
diff --git a/kernel/fork.c b/kernel/fork.c
index 812d477..840b662 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1368,8 +1368,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	p->real_start_time = ktime_get_boot_ns();
 	p->io_context = NULL;
 	p->audit_context = NULL;
-	threadgroup_change_begin(current);
-	cgroup_fork(p);
+	threadgroup_change_begin(current, p, clone_flags);
 #ifdef CONFIG_NUMA
 	p->mempolicy = mpol_dup(p->mempolicy);
 	if (IS_ERR(p->mempolicy)) {
@@ -1609,7 +1608,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 
 	proc_fork_connector(p);
 	cgroup_post_fork(p, clone_flags, new_rgrp_cset);
-	threadgroup_change_end(current);
+	threadgroup_change_end(current, p, clone_flags);
 	perf_event_fork(p);
 
 	trace_task_newtask(p, clone_flags);
@@ -1650,7 +1649,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	mpol_put(p->mempolicy);
 bad_fork_cleanup_threadgroup_lock:
 #endif
-	threadgroup_change_end(current);
+	threadgroup_change_end(current, p, clone_flags);
 	delayacct_tsk_free(p);
 bad_fork_cleanup_count:
 	atomic_dec(&p->cred->user->processes);
diff --git a/kernel/signal.c b/kernel/signal.c
index f3f1f7a..1679c02 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2389,11 +2389,11 @@ void exit_signals(struct task_struct *tsk)
 	 * @tsk is about to have PF_EXITING set - lock out users which
 	 * expect stable threadgroup.
 	 */
-	threadgroup_change_begin(tsk);
+	threadgroup_change_begin(tsk, NULL, 0);
 
 	if (thread_group_empty(tsk) || signal_group_exit(tsk->signal)) {
 		tsk->flags |= PF_EXITING;
-		threadgroup_change_end(tsk);
+		threadgroup_change_end(tsk, NULL, 0);
 		return;
 	}
 
@@ -2404,7 +2404,7 @@ void exit_signals(struct task_struct *tsk)
 	 */
 	tsk->flags |= PF_EXITING;
 
-	threadgroup_change_end(tsk);
+	threadgroup_change_end(tsk, NULL, 0);
 
 	if (!signal_pending(tsk))
 		goto out;
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 07/10] cgroup: introduce resource group
  2016-03-11 15:41 [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP Tejun Heo
                   ` (5 preceding siblings ...)
  2016-03-11 15:41 ` [PATCH 06/10] cgroup, fork: add @child and @clone_flags to threadgroup_change_begin/end() Tejun Heo
@ 2016-03-11 15:41 ` Tejun Heo
  2016-03-11 15:41 ` [PATCH 08/10] cgroup: implement rgroup control mask handling Tejun Heo
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 50+ messages in thread
From: Tejun Heo @ 2016-03-11 15:41 UTC (permalink / raw)
  To: torvalds, akpm, a.p.zijlstra, mingo, lizefan, hannes, pjt
  Cc: linux-kernel, cgroups, linux-api, kernel-team, Tejun Heo,
	Peter Zijlstra, Oleg Nesterov

cgroup v1 allowed tasks of a process to be put in different cgroups
thus allowing controlling resource distribution inside a process;
however, controlling in-process properties through filesystem
interface is highly unusual and has various issues around delegation,
ownership, and lack of integration with process altering operations.

rgroup (resource group) is a type of v2 cgroup which can be created by
setting CLONE_NEWRGRP during clone(2).  A newly created rgroup always
nests below the cgroup of the parent task, whether that is a sgroup
(system group) or rgroup.  rgroups are wholly owned by the associated
process and not visible through cgroupfs.

This patch implements the basic support for rgroups.

* New rgroup can be created through CLONE_NEWRGRP.  Top level rgroups
  are linked on the owning process's signal struct and all such signal
  structs are linked on the parent sgroup.

* A rgroup is destroyed automatically when it becomes depopulated.

* When a new process is forked, it is spawned in the nearest sgroup.

* When a task execs, is is moved to the nearest sgroup.

This patch doesn't yet implement actual resource control or
sub-hierarchy migration and all controllers are suppressed in rgroups.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul Turner <pjt@google.com>
---
 fs/exec.c                   |   2 +-
 include/linux/cgroup-defs.h |  26 +++++
 include/linux/cgroup.h      |   2 +
 include/linux/sched.h       |   4 +
 include/uapi/linux/sched.h  |   1 +
 kernel/cgroup.c             | 229 ++++++++++++++++++++++++++++++++++++++++++--
 kernel/fork.c               |  11 +++
 7 files changed, 266 insertions(+), 9 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 5b81bbb..286141e 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1044,7 +1044,7 @@ static int de_thread(struct task_struct *tsk)
 	}
 
 	BUG_ON(!thread_group_leader(tsk));
-	return 0;
+	return cgroup_exec();
 
 killed:
 	/* protects against exit_notify() and __exit_signal() */
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 3c4a75b..f1ee756 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -201,6 +201,14 @@ struct css_set {
 	struct css_set *mg_dst_cset;
 
 	/*
+	 * If this cset points to a rgroup, the following is a cset which
+	 * is equivalent except that it points to the nearest sgroup.  This
+	 * allows tasks to be escaped to the nearest sgroup without
+	 * introducing deeply nested error cases.
+	 */
+	struct css_set *sgrp_cset;
+
+	/*
 	 * On the default hierarhcy, ->subsys[ssid] may point to a css
 	 * attached to an ancestor instead of the cgroup this css_set is
 	 * associated with.  The following node is anchored at
@@ -285,6 +293,24 @@ struct cgroup {
 	struct list_head e_csets[CGROUP_SUBSYS_COUNT];
 
 	/*
+	 * If not NULL, the cgroup is a rgroup (resource group) of the
+	 * process associated with the following signal struct.  A rgroup
+	 * is used for in-process resource control.  rgroups are created by
+	 * specifying CLONE_NEWRGRP during clone(2), tied to the associated
+	 * process, and invisible and transparent to cgroupfs.
+	 *
+	 * The term "sgroup" (system group) is used for a cgroup which is
+	 * explicitly not a rgroup.
+	 */
+	struct signal_struct *rgrp_sig;
+
+	/* top-level rgroups linked on rgrp_sig->rgrps */
+	struct list_head rgrp_node;
+
+	/* signal structs with rgroups below this cgroup */
+	struct list_head rgrp_child_sigs;
+
+	/*
 	 * list of pidlists, up to two for each namespace (one for procs, one
 	 * for tasks); created on demand.
 	 */
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 1e00fc0..ca1ec50 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -107,6 +107,7 @@ extern void cgroup_cancel_fork(struct task_struct *p, unsigned long clone_flags,
 			       struct css_set *new_rgrp_cset);
 extern void cgroup_post_fork(struct task_struct *p, unsigned long clone_flags,
 			     struct css_set *new_rgrp_cset);
+int cgroup_exec(void);
 void cgroup_exit(struct task_struct *p);
 void cgroup_free(struct task_struct *p);
 
@@ -548,6 +549,7 @@ static inline void cgroup_cancel_fork(struct task_struct *p,
 static inline void cgroup_post_fork(struct task_struct *p,
 				    unsigned long clone_flags,
 				    struct css_set *new_rgrp_cset) {}
+static inline int cgroup_exec(void) { return 0; }
 static inline void cgroup_exit(struct task_struct *p) {}
 static inline void cgroup_free(struct task_struct *p) {}
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d4ae795..7886919 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -778,6 +778,10 @@ struct signal_struct {
 	unsigned audit_tty_log_passwd;
 	struct tty_audit_buf *tty_audit_buf;
 #endif
+#ifdef CONFIG_CGROUPS
+	struct list_head rgrps;		/* top-level rgroups under this sig */
+	struct list_head rgrp_node;	/* parent_sgrp->child_rgrp_sigs list */
+#endif
 
 	oom_flags_t oom_flags;
 	short oom_score_adj;		/* OOM kill score adjustment */
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index cc89dde..ac6cec9 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -9,6 +9,7 @@
 #define CLONE_FS	0x00000200	/* set if fs info shared between processes */
 #define CLONE_FILES	0x00000400	/* set if open files shared between processes */
 #define CLONE_SIGHAND	0x00000800	/* set if signal handlers and blocked signals shared */
+#define CLONE_NEWRGRP	0x00001000	/* New resource group */
 #define CLONE_PTRACE	0x00002000	/* set if we want to let tracing continue on the child too */
 #define CLONE_VFORK	0x00004000	/* set if the parent wants the child to wake it up on mm_release */
 #define CLONE_PARENT	0x00008000	/* set if we want to have the same parent as the cloner */
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 70f9985..53f479c 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -126,6 +126,13 @@ static struct percpu_rw_semaphore cgroup_threadgroup_rwsem;
 static struct workqueue_struct *cgroup_destroy_wq;
 
 /*
+ * rgroups are automatically destroyed when they become unpopulated.
+ * Destructions are bounced through the following workqueue which is
+ * ordered to avoid trying to destroy a parent before its children.
+ */
+static struct workqueue_struct *rgroup_destroy_wq;
+
+/*
  * pidlist destructions need to be flushed on cgroup destruction.  Use a
  * separate workqueue as flush domain.
  */
@@ -228,6 +235,7 @@ static int cgroup_apply_control(struct cgroup *cgrp);
 static void cgroup_finalize_control(struct cgroup *cgrp, int ret);
 static void css_task_iter_advance(struct css_task_iter *it);
 static int cgroup_destroy_locked(struct cgroup *cgrp);
+static void rgroup_destroy_schedule(struct cgroup *rgrp);
 static struct cgroup_subsys_state *css_create(struct cgroup *cgrp,
 					      struct cgroup_subsys *ss);
 static void css_release(struct percpu_ref *ref);
@@ -242,6 +250,16 @@ static int cgroup_addrm_files(struct cgroup_subsys_state *css,
 static void cgroup_lock(void)
 	__acquires(&cgroup_mutex)
 {
+	/*
+	 * In-flight rgroup destructions can interfere with subsequent
+	 * operations.  For example, rmdir of the nearest sgroup would fail
+	 * while rgroup destructions are in flight.  rgroup destructions
+	 * don't involve any time-consuming operations and the following
+	 * flush shouldn't be noticeable.
+	 */
+	if (rgroup_destroy_wq)
+		flush_workqueue(rgroup_destroy_wq);
+
 	mutex_lock(&cgroup_mutex);
 }
 
@@ -330,6 +348,11 @@ static bool cgroup_on_dfl(const struct cgroup *cgrp)
 	return cgrp->root == &cgrp_dfl_root;
 }
 
+static bool is_rgroup(struct cgroup *cgrp)
+{
+	return cgrp->rgrp_sig;
+}
+
 /* IDR wrappers which synchronize using cgroup_idr_lock */
 static int cgroup_idr_alloc(struct idr *idr, void *ptr, int start, int end,
 			    gfp_t gfp_mask)
@@ -370,12 +393,29 @@ static struct cgroup *cgroup_parent(struct cgroup *cgrp)
 	return NULL;
 }
 
+/**
+ * nearest_sgroup - find the nearest system group
+ * @cgrp: cgroup of question
+ *
+ * Find the closest sgroup ancestor.  If @cgrp is not a rgroup, @cgrp is
+ * returned.  A rgroup subtree is always nested under a sgroup.
+ */
+static struct cgroup *nearest_sgroup(struct cgroup *cgrp)
+{
+	while (is_rgroup(cgrp))
+		cgrp = cgroup_parent(cgrp);
+	return cgrp;
+}
+
 /* subsystems visibly enabled on a cgroup */
 static u16 cgroup_control(struct cgroup *cgrp)
 {
 	struct cgroup *parent = cgroup_parent(cgrp);
 	u16 root_ss_mask = cgrp->root->subsys_mask;
 
+	if (is_rgroup(cgrp))
+		return 0;
+
 	if (parent)
 		return parent->subtree_control;
 
@@ -390,6 +430,9 @@ static u16 cgroup_ss_mask(struct cgroup *cgrp)
 {
 	struct cgroup *parent = cgroup_parent(cgrp);
 
+	if (is_rgroup(cgrp))
+		return 0;
+
 	if (parent)
 		return parent->subtree_ss_mask;
 
@@ -620,22 +663,26 @@ static void check_for_release(struct cgroup *cgrp);
 
 int cgroup_name(struct cgroup *cgrp, char *buf, size_t buflen)
 {
+	cgrp = nearest_sgroup(cgrp);
 	return kernfs_name(cgrp->kn, buf, buflen);
 }
 
 char * __must_check cgroup_path(struct cgroup *cgrp, char *buf, size_t buflen)
 {
+	cgrp = nearest_sgroup(cgrp);
 	return kernfs_path(cgrp->kn, buf, buflen);
 }
 EXPORT_SYMBOL_GPL(cgroup_path);
 
 void pr_cont_cgroup_name(struct cgroup *cgrp)
 {
+	cgrp = nearest_sgroup(cgrp);
 	pr_cont_kernfs_name(cgrp->kn);
 }
 
 void pr_cont_cgroup_path(struct cgroup *cgrp)
 {
+	cgrp = nearest_sgroup(cgrp);
 	pr_cont_kernfs_path(cgrp->kn);
 }
 
@@ -720,8 +767,14 @@ static void cgroup_update_populated(struct cgroup *cgrp, bool populated)
 		if (!trigger)
 			break;
 
-		check_for_release(cgrp);
-		cgroup_file_notify(&cgrp->events_file);
+		/* rgroups are automatically destroyed when empty */
+		if (is_rgroup(cgrp)) {
+			if (!cgrp->populated_cnt)
+				rgroup_destroy_schedule(cgrp);
+		} else {
+			check_for_release(cgrp);
+			cgroup_file_notify(&cgrp->events_file);
+		}
 
 		cgrp = cgroup_parent(cgrp);
 	} while (cgrp);
@@ -856,6 +909,9 @@ static void put_css_set_locked(struct css_set *cset)
 		kfree(link);
 	}
 
+	if (cset->sgrp_cset)
+		put_css_set_locked(cset->sgrp_cset);
+
 	kfree_rcu(cset, rcu_head);
 }
 
@@ -1154,6 +1210,16 @@ static struct css_set *find_css_set(struct css_set *old_cset,
 
 	spin_unlock_bh(&css_set_lock);
 
+	if (is_rgroup(cset->dfl_cgrp)) {
+		struct cgroup *c = nearest_sgroup(cset->dfl_cgrp);
+
+		cset->sgrp_cset = find_css_set(cset, c);
+		if (!cset->sgrp_cset) {
+			put_css_set(cset);
+			return NULL;
+		}
+	}
+
 	return cset;
 }
 
@@ -1909,6 +1975,8 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
 	INIT_LIST_HEAD(&cgrp->self.sibling);
 	INIT_LIST_HEAD(&cgrp->self.children);
 	INIT_LIST_HEAD(&cgrp->cset_links);
+	INIT_LIST_HEAD(&cgrp->rgrp_child_sigs);
+	INIT_LIST_HEAD(&cgrp->rgrp_node);
 	INIT_LIST_HEAD(&cgrp->pidlists);
 	mutex_init(&cgrp->pidlist_mutex);
 	cgrp->self.cgroup = cgrp;
@@ -3307,9 +3375,10 @@ static ssize_t cgroup_subtree_control_write(struct kernfs_open_file *of,
 				continue;
 			}
 
-			/* a child has it enabled? */
+			/* a child sgroup has it enabled? */
 			cgroup_for_each_live_child(child, cgrp) {
-				if (child->subtree_control & (1 << ssid)) {
+				if (!is_rgroup(child) &&
+				    child->subtree_control & (1 << ssid)) {
 					ret = -EBUSY;
 					goto out_unlock;
 				}
@@ -5060,7 +5129,8 @@ static struct cgroup_subsys_state *css_create(struct cgroup *cgrp,
 	return ERR_PTR(err);
 }
 
-static struct cgroup *cgroup_create(struct cgroup *parent)
+static struct cgroup *cgroup_create(struct cgroup *parent,
+				    struct signal_struct *rgrp_sig)
 {
 	struct cgroup_root *root = parent->root;
 	struct cgroup *cgrp, *tcgrp;
@@ -5103,6 +5173,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent)
 		set_bit(CGRP_CPUSET_CLONE_CHILDREN, &cgrp->flags);
 
 	cgrp->self.serial_nr = css_serial_nr_next++;
+	cgrp->rgrp_sig = rgrp_sig;
 
 	/* allocation complete, commit to creation */
 	list_add_tail_rcu(&cgrp->self.sibling, &cgroup_parent(cgrp)->self.children);
@@ -5156,7 +5227,7 @@ static int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
 	if (!parent)
 		return -ENODEV;
 
-	cgrp = cgroup_create(parent);
+	cgrp = cgroup_create(parent, NULL);
 	if (IS_ERR(cgrp)) {
 		ret = PTR_ERR(cgrp);
 		goto out_unlock;
@@ -5201,6 +5272,75 @@ static int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
 	return ret;
 }
 
+static void rgroup_destroy_work_fn(struct work_struct *work)
+{
+	struct cgroup *rgrp = container_of(work, struct cgroup,
+					   self.destroy_work);
+	struct signal_struct *sig = rgrp->rgrp_sig;
+
+	/*
+	 * cgroup_lock() flushes rgroup_destroy_wq and using it here would
+	 * lead to deadlock.  Grab cgroup_mutex directly.
+	 */
+	mutex_lock(&cgroup_mutex);
+
+	if (WARN_ON_ONCE(cgroup_destroy_locked(rgrp))) {
+		mutex_unlock(&cgroup_mutex);
+		return;
+	}
+
+	list_del(&rgrp->rgrp_node);
+
+	if (sig && list_empty(&sig->rgrps)) {
+		list_del(&sig->rgrp_node);
+		put_signal_struct(sig);
+	}
+
+	mutex_unlock(&cgroup_mutex);
+}
+
+/**
+ * rgroup_destroy_schedule - schedule destruction of a rgroup
+ * @rgrp: rgroup to be destroyed
+ *
+ * Schedule destruction of @rgrp.  Destructions are guarantee to be
+ * performed in order and flushed on cgroup_lock().
+ */
+static void rgroup_destroy_schedule(struct cgroup *rgrp)
+{
+	INIT_WORK(&rgrp->self.destroy_work, rgroup_destroy_work_fn);
+	queue_work(rgroup_destroy_wq, &rgrp->self.destroy_work);
+}
+
+/**
+ * rgroup_create - create a rgroup
+ * @parent: parent cgroup (sgroup or rgroup)
+ * @sig: signal_struct of the target process
+ *
+ * Create a rgroup under @parent for the process associated with @sig.
+ */
+static struct cgroup *rgroup_create(struct cgroup *parent,
+				    struct signal_struct *sig)
+{
+	struct cgroup *rgrp;
+
+	lockdep_assert_held(&cgroup_mutex);
+
+	rgrp = cgroup_create(parent, sig);
+	if (IS_ERR(rgrp))
+		return rgrp;
+
+	if (!is_rgroup(parent))
+		list_add_tail(&rgrp->rgrp_node, &sig->rgrps);
+
+	if (list_empty(&sig->rgrp_node)) {
+		atomic_inc(&sig->sigcnt);
+		list_add_tail(&sig->rgrp_node, &parent->rgrp_child_sigs);
+	}
+
+	return rgrp;
+}
+
 /*
  * This is called when the refcnt of a css is confirmed to be killed.
  * css_tryget_online() is now guaranteed to fail.  Tell the subsystem to
@@ -5562,6 +5702,9 @@ static int __init cgroup_wq_init(void)
 	cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 1);
 	BUG_ON(!cgroup_destroy_wq);
 
+	rgroup_destroy_wq = alloc_ordered_workqueue("rgroup_destroy", 0);
+	BUG_ON(!rgroup_destroy_wq);
+
 	/*
 	 * Used to destroy pidlists and separate to serve as flush domain.
 	 * Cap @max_active to 1 too.
@@ -5694,7 +5837,8 @@ static const struct file_operations proc_cgroupstats_operations = {
  * @clone_flags: clone flags if forking
  *
  * Called from threadgroup_change_begin() and allows cgroup operations to
- * synchronize against threadgroup changes using a percpu_rw_semaphore.
+ * synchronize against threadgroup changes using a percpu_rw_semaphore.  If
+ * clone(2) is requesting a new rgroup, also grab cgroup_mutex.
  */
 void cgroup_threadgroup_change_begin(struct task_struct *tsk,
 				     struct task_struct *child,
@@ -5709,6 +5853,9 @@ void cgroup_threadgroup_change_begin(struct task_struct *tsk,
 		 */
 		RCU_INIT_POINTER(child->cgroups, &init_css_set);
 		INIT_LIST_HEAD(&child->cg_list);
+
+		if (clone_flags & CLONE_NEWRGRP)
+			cgroup_lock();
 	}
 
 	percpu_down_read(&cgroup_threadgroup_rwsem);
@@ -5728,6 +5875,9 @@ void cgroup_threadgroup_change_end(struct task_struct *tsk,
 				   unsigned long clone_flags)
 {
 	percpu_up_read(&cgroup_threadgroup_rwsem);
+
+	if (child && (clone_flags & CLONE_NEWRGRP))
+		cgroup_unlock();
 }
 
 /**
@@ -5746,6 +5896,23 @@ int cgroup_can_fork(struct task_struct *child, unsigned long clone_flags,
 	struct cgroup_subsys *ss;
 	int i, j, ret;
 
+	if (clone_flags & CLONE_NEWRGRP) {
+		struct css_set *cset = task_css_set(current);
+		struct cgroup *rgrp;
+
+		rgrp = rgroup_create(cset->dfl_cgrp, current->signal);
+		if (IS_ERR(rgrp))
+			return PTR_ERR(rgrp);
+
+		*new_rgrp_csetp = find_css_set(cset, rgrp);
+		if (IS_ERR(*new_rgrp_csetp)) {
+			rgroup_destroy_schedule(rgrp);
+			return PTR_ERR(*new_rgrp_csetp);
+		}
+	} else {
+		*new_rgrp_csetp = NULL;
+	}
+
 	do_each_subsys_mask(ss, i, have_canfork_callback) {
 		ret = ss->can_fork(child);
 		if (ret)
@@ -5780,6 +5947,11 @@ void cgroup_cancel_fork(struct task_struct *child, unsigned long clone_flags,
 	struct cgroup_subsys *ss;
 	int i;
 
+	if (new_rgrp_cset) {
+		rgroup_destroy_schedule(new_rgrp_cset->dfl_cgrp);
+		put_css_set(new_rgrp_cset);
+	}
+
 	for_each_subsys(ss, i)
 		if (ss->cancel_fork)
 			ss->cancel_fork(child);
@@ -5828,11 +6000,29 @@ void cgroup_post_fork(struct task_struct *child, unsigned long clone_flags,
 		struct css_set *cset;
 
 		spin_lock_bh(&css_set_lock);
-		cset = task_css_set(current);
+
+		/*
+		 * If @new_rgrp_cset is set, it contains the requested new
+		 * rgroup created by cgroup_can_fork().
+		 */
+		if (new_rgrp_cset) {
+			cset = new_rgrp_cset;
+		} else {
+			cset = task_css_set(current);
+			/*
+			 * If a new process is being created, it shouldn't
+			 * be put in this process's rgroup.  Escape it to
+			 * the nearest sgroup.
+			 */
+			if (!(clone_flags & CLONE_THREAD) && cset->sgrp_cset)
+				cset = cset->sgrp_cset;
+		}
+
 		if (list_empty(&child->cg_list)) {
 			get_css_set(cset);
 			css_set_move_task(child, NULL, cset, false);
 		}
+
 		spin_unlock_bh(&css_set_lock);
 	}
 
@@ -5846,6 +6036,29 @@ void cgroup_post_fork(struct task_struct *child, unsigned long clone_flags,
 	} while_each_subsys_mask();
 }
 
+int cgroup_exec(void)
+{
+	struct cgroup *cgrp;
+	bool is_rgrp;
+	int ret;
+
+	/* whether a task is in a sgroup or rgroup is immutable */
+	rcu_read_lock();
+	is_rgrp = is_rgroup(task_css_set(current)->dfl_cgrp);
+	rcu_read_unlock();
+
+	if (!is_rgrp)
+		return 0;
+
+	/* exec should reset rgroup, escape to the nearest sgroup */
+	cgroup_lock();
+	cgrp = nearest_sgroup(task_css_set(current)->dfl_cgrp);
+	ret = cgroup_attach_task(cgrp, current, CGRP_MIGRATE_PROCESS);
+	cgroup_unlock();
+
+	return ret;
+}
+
 /**
  * cgroup_exit - detach cgroup from exiting task
  * @tsk: pointer to task_struct of exiting process
diff --git a/kernel/fork.c b/kernel/fork.c
index 840b662..70903fc 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -234,6 +234,9 @@ EXPORT_SYMBOL(free_task);
 
 static inline void free_signal_struct(struct signal_struct *sig)
 {
+#ifdef CONFIG_CGROUPS
+	WARN_ON_ONCE(!list_empty(&sig->rgrps));
+#endif
 	taskstats_tgid_free(sig);
 	sched_autogroup_exit(sig);
 	kmem_cache_free(signal_cachep, sig);
@@ -1159,6 +1162,10 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 
 	mutex_init(&sig->cred_guard_mutex);
 
+#ifdef CONFIG_CGROUPS
+	INIT_LIST_HEAD(&sig->rgrps);
+	INIT_LIST_HEAD(&sig->rgrp_node);
+#endif
 	return 0;
 }
 
@@ -1293,6 +1300,10 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 			return ERR_PTR(-EINVAL);
 	}
 
+	/* Only threads can be put in child resource groups. */
+	if (!(clone_flags & CLONE_THREAD) && (clone_flags & CLONE_NEWRGRP))
+		return ERR_PTR(-EINVAL);
+
 	retval = security_task_create(clone_flags);
 	if (retval)
 		goto fork_out;
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 08/10] cgroup: implement rgroup control mask handling
  2016-03-11 15:41 [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP Tejun Heo
                   ` (6 preceding siblings ...)
  2016-03-11 15:41 ` [PATCH 07/10] cgroup: introduce resource group Tejun Heo
@ 2016-03-11 15:41 ` Tejun Heo
  2016-03-11 15:41 ` [PATCH 09/10] cgroup: implement rgroup subtree migration Tejun Heo
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 50+ messages in thread
From: Tejun Heo @ 2016-03-11 15:41 UTC (permalink / raw)
  To: torvalds, akpm, a.p.zijlstra, mingo, lizefan, hannes, pjt
  Cc: linux-kernel, cgroups, linux-api, kernel-team, Tejun Heo,
	Peter Zijlstra, Oleg Nesterov

To support in-process resource control, the previous patch implemented
the basic rgroup infrastructure which allows creating rgroups by
specifying CLONE_NEWRGRP during clone(2).  This patch implements
control mask handling for rgroups so that controllers can be enabled
on them.

There can be multiple top-level rgroups per process, all nested right
below the sgroup that the process belongs to.  They should share the
same control masks which can be different from those of top-level
rgroups belonging to different processes or peer tasks.  These
top-level rgroup control masks are added to signal_struct and all
control mask handling functions are updated accordingly.

Note that a top-level rgroup should behave the same as a peer task
rather than a peer sibling cgroup.  As such, the set of controllers
avaialble to a rgroup should be the same as the set for peer tasks.
To satisfy this requirement, controller availability for a top-level
rgroup is constrainted by cgroup_{control|ss_mask}(nearest_sgrp)
instead of nearest_sgrp->subtree_{control|ss_mask}.

After this change, csses can be created on rgroups.  The
css_{clear|populate}_dir() are updated to become noops for csses on
rgroups.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul Turner <pjt@google.com>
---
 include/linux/sched.h |  4 ++++
 kernel/cgroup.c       | 49 +++++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 47 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7886919..d3849ad 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -781,6 +781,10 @@ struct signal_struct {
 #ifdef CONFIG_CGROUPS
 	struct list_head rgrps;		/* top-level rgroups under this sig */
 	struct list_head rgrp_node;	/* parent_sgrp->child_rgrp_sigs list */
+	u16 rgrp_subtree_control;	/* control for top-level rgroups */
+	u16 rgrp_subtree_ss_mask;	/* ss_mask for top-level rgroups */
+	u16 rgrp_old_subtree_control;	/* used during control updates */
+	u16 rgrp_old_subtree_ss_mask;	/* used during control updates */
 #endif
 
 	oom_flags_t oom_flags;
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 53f479c..283b7ed 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -413,8 +413,8 @@ static u16 cgroup_control(struct cgroup *cgrp)
 	struct cgroup *parent = cgroup_parent(cgrp);
 	u16 root_ss_mask = cgrp->root->subsys_mask;
 
-	if (is_rgroup(cgrp))
-		return 0;
+	if (is_rgroup(cgrp) && !is_rgroup(parent))
+		return cgrp->rgrp_sig->rgrp_subtree_control;
 
 	if (parent)
 		return parent->subtree_control;
@@ -430,8 +430,8 @@ static u16 cgroup_ss_mask(struct cgroup *cgrp)
 {
 	struct cgroup *parent = cgroup_parent(cgrp);
 
-	if (is_rgroup(cgrp))
-		return 0;
+	if (is_rgroup(cgrp) && !is_rgroup(parent))
+		return cgrp->rgrp_sig->rgrp_subtree_ss_mask;
 
 	if (parent)
 		return parent->subtree_ss_mask;
@@ -1565,6 +1565,9 @@ static void css_clear_dir(struct cgroup_subsys_state *css)
 
 	css->flags &= ~CSS_VISIBLE;
 
+	if (is_rgroup(cgrp))
+		return;
+
 	list_for_each_entry(cfts, &css->ss->cfts, node)
 		cgroup_addrm_files(css, cgrp, cfts, false);
 }
@@ -1584,6 +1587,9 @@ static int css_populate_dir(struct cgroup_subsys_state *css)
 	if ((css->flags & CSS_VISIBLE) || !cgrp->kn)
 		return 0;
 
+	if (is_rgroup(cgrp))
+		goto done;
+
 	if (!css->ss) {
 		if (cgroup_on_dfl(cgrp))
 			cfts = cgroup_dfl_base_files;
@@ -1600,9 +1606,8 @@ static int css_populate_dir(struct cgroup_subsys_state *css)
 			goto err;
 		}
 	}
-
+done:
 	css->flags |= CSS_VISIBLE;
-
 	return 0;
 err:
 	list_for_each_entry(cfts, &css->ss->cfts, node) {
@@ -3116,8 +3121,15 @@ static void cgroup_save_control(struct cgroup *cgrp)
 	struct cgroup_subsys_state *d_css;
 
 	cgroup_for_each_live_descendant_pre(dsct, d_css, cgrp) {
+		struct signal_struct *sig;
+
 		dsct->old_subtree_control = dsct->subtree_control;
 		dsct->old_subtree_ss_mask = dsct->subtree_ss_mask;
+
+		list_for_each_entry(sig, &dsct->rgrp_child_sigs, rgrp_node) {
+			sig->rgrp_old_subtree_control = sig->rgrp_subtree_control;
+			sig->rgrp_old_subtree_ss_mask = sig->rgrp_subtree_ss_mask;
+		}
 	}
 }
 
@@ -3135,10 +3147,19 @@ static void cgroup_propagate_control(struct cgroup *cgrp)
 	struct cgroup_subsys_state *d_css;
 
 	cgroup_for_each_live_descendant_pre(dsct, d_css, cgrp) {
+		struct signal_struct *sig;
+
 		dsct->subtree_control &= cgroup_control(dsct);
 		dsct->subtree_ss_mask =
 			cgroup_calc_subtree_ss_mask(dsct->subtree_control,
 						    cgroup_ss_mask(dsct));
+
+		list_for_each_entry(sig, &dsct->rgrp_child_sigs, rgrp_node) {
+			sig->rgrp_subtree_control &= cgroup_control(dsct);
+			sig->rgrp_subtree_ss_mask =
+				cgroup_calc_subtree_ss_mask(sig->rgrp_subtree_control,
+							    cgroup_ss_mask(dsct));
+		}
 	}
 }
 
@@ -3155,6 +3176,13 @@ static void cgroup_restore_control(struct cgroup *cgrp)
 	struct cgroup_subsys_state *d_css;
 
 	cgroup_for_each_live_descendant_post(dsct, d_css, cgrp) {
+		struct signal_struct *sig;
+
+		list_for_each_entry(sig, &dsct->rgrp_child_sigs, rgrp_node) {
+			sig->rgrp_subtree_control = sig->rgrp_old_subtree_control;
+			sig->rgrp_subtree_ss_mask = sig->rgrp_old_subtree_ss_mask;
+		}
+
 		dsct->subtree_control = dsct->old_subtree_control;
 		dsct->subtree_ss_mask = dsct->old_subtree_ss_mask;
 	}
@@ -5326,6 +5354,15 @@ static struct cgroup *rgroup_create(struct cgroup *parent,
 
 	lockdep_assert_held(&cgroup_mutex);
 
+	if (list_empty(&sig->rgrps)) {
+		WARN_ON_ONCE(is_rgroup(parent));
+
+		sig->rgrp_subtree_control = 0;
+		sig->rgrp_subtree_ss_mask =
+			cgroup_calc_subtree_ss_mask(sig->rgrp_subtree_control,
+						    cgroup_ss_mask(parent));
+	}
+
 	rgrp = cgroup_create(parent, sig);
 	if (IS_ERR(rgrp))
 		return rgrp;
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 09/10] cgroup: implement rgroup subtree migration
  2016-03-11 15:41 [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP Tejun Heo
                   ` (7 preceding siblings ...)
  2016-03-11 15:41 ` [PATCH 08/10] cgroup: implement rgroup control mask handling Tejun Heo
@ 2016-03-11 15:41 ` Tejun Heo
  2016-03-11 15:41 ` [PATCH 10/10] cgroup, sched: implement PRIO_RGRP for {set|get}priority() Tejun Heo
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 50+ messages in thread
From: Tejun Heo @ 2016-03-11 15:41 UTC (permalink / raw)
  To: torvalds, akpm, a.p.zijlstra, mingo, lizefan, hannes, pjt
  Cc: linux-kernel, cgroups, linux-api, kernel-team, Tejun Heo,
	Peter Zijlstra, Oleg Nesterov

Currently, when a process with rgroups is migrated, rgroup subtrees
are not preserved and all threads are put directly under the migration
destination cgroup.  This patch implements rgroup subtree migration so
that rgroup subtrees are preserved across process migration.

Early during process migration, cgroup_migrate_copy_rgrps() duplicates
rgroup subtrees of a process under the destination cgroup and links
the counterparts with newly added src_cgrp->rgrp_target.  Also,
subsystems can implement css_copy() method to copy over settings and
whatever states necessary.  Once copying is complete, the actual
migration uses ->rgrp_target as destination.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul Turner <pjt@google.com>
---
 include/linux/cgroup-defs.h |   5 ++
 kernel/cgroup.c             | 157 +++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 159 insertions(+), 3 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index f1ee756..9ffa2d8 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -310,6 +310,9 @@ struct cgroup {
 	/* signal structs with rgroups below this cgroup */
 	struct list_head rgrp_child_sigs;
 
+	/* target rgroup, used during rgroup subtree migration */
+	struct cgroup *rgrp_target;
+
 	/*
 	 * list of pidlists, up to two for each namespace (one for procs, one
 	 * for tasks); created on demand.
@@ -462,6 +465,8 @@ struct cgroup_subsys {
 	void (*css_offline)(struct cgroup_subsys_state *css);
 	void (*css_released)(struct cgroup_subsys_state *css);
 	void (*css_free)(struct cgroup_subsys_state *css);
+	int (*css_copy)(struct cgroup_subsys_state *to,
+			struct cgroup_subsys_state *from);
 	void (*css_reset)(struct cgroup_subsys_state *css);
 
 	int (*can_attach)(struct cgroup_taskset *tset);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 283b7ed..6107a1f 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -229,8 +229,12 @@ static struct file_system_type cgroup2_fs_type;
 static struct cftype cgroup_dfl_base_files[];
 static struct cftype cgroup_legacy_base_files[];
 
+static struct cgroup *rgroup_create(struct cgroup *parent, struct signal_struct *sig);
 static int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask);
 static void cgroup_lock_and_drain_offline(struct cgroup *cgrp);
+static void cgroup_save_control(struct cgroup *cgrp);
+static void cgroup_propagate_control(struct cgroup *cgrp);
+static void cgroup_restore_control(struct cgroup *cgrp);
 static int cgroup_apply_control(struct cgroup *cgrp);
 static void cgroup_finalize_control(struct cgroup *cgrp, int ret);
 static void css_task_iter_advance(struct css_task_iter *it);
@@ -2478,6 +2482,7 @@ struct task_struct *cgroup_taskset_next(struct cgroup_taskset *tset,
 
 enum {
 	CGRP_MIGRATE_PROCESS	= (1 << 0), /* migrate the whole process */
+	CGRP_MIGRATE_COPY_RGRPS	= (1 << 1), /* copy rgroup subtree */
 };
 
 /**
@@ -2752,6 +2757,132 @@ static int cgroup_migrate(struct task_struct *leader, unsigned mflags,
 }
 
 /**
+ * cgroup_migrate_uncopy_rgrps - cancel in-flight rgroup subtree copying
+ * @dst_cgrp: migration target cgroup
+ * @leader: leader of process being migrated
+ *
+ * Undo cgroup_migrate_copy_rgrps().
+ */
+static void cgroup_migrate_uncopy_rgrps(struct cgroup *dst_cgrp,
+					struct task_struct *leader)
+{
+	struct signal_struct *sig = leader->signal;
+	struct cgroup *sgrp = nearest_sgroup(task_css_set(leader)->dfl_cgrp);
+	struct cgroup *rgrp, *dsct;
+	struct cgroup_subsys_state *d_css;
+
+	/* destroy rgroup copies */
+	list_for_each_entry(rgrp, &sig->rgrps, rgrp_node) {
+		cgroup_for_each_live_descendant_post(dsct, d_css, rgrp) {
+			if (dsct->rgrp_target) {
+				rgroup_destroy_schedule(dsct->rgrp_target);
+				dsct->rgrp_target = NULL;
+			}
+		}
+	}
+
+	/* move back @sig under the original nearest sgroup */
+	if (!list_empty(&sig->rgrp_node))
+		list_move_tail(&sig->rgrp_node, &sgrp->rgrp_child_sigs);
+
+	sgrp->rgrp_target = NULL;
+	cgroup_restore_control(sgrp);
+}
+
+/**
+ * cgroup_migrate_copy_rgrps - copy a process's rgroup subtrees
+ * @dst_cgrp: migration target cgroup
+ * @leader: leader of process being migrated
+ *
+ * @leader and its threads are being migrated under @dst_cgrp.  Copy the
+ * process's rgroup subtrees under @dst_cgrp and make the source rgroups
+ * and their nearest sgroup point to the counterpart in the copied subtrees
+ * via ->rgrp_target.
+ *
+ * Before process migration is complete, this operation can be canceled
+ * using cgroup_migrate_uncopy_rgrps().
+ */
+static int cgroup_migrate_copy_rgrps(struct cgroup *dst_cgrp,
+				     struct task_struct *leader)
+{
+	struct signal_struct *sig = leader->signal;
+	struct cgroup *sgrp = nearest_sgroup(task_css_set(leader)->dfl_cgrp);
+	struct cgroup *rgrp, *dsct;
+	struct cgroup_subsys_state *d_css;
+	int ret;
+
+	if (WARN_ON_ONCE(!cgroup_on_dfl(dst_cgrp)))
+		return -EINVAL;
+
+	/* save for uncopy */
+	cgroup_save_control(sgrp);
+	sgrp->rgrp_target = dst_cgrp;
+
+	/*
+	 * Move @sig under @dst_cgrp for correct control propagation and
+	 * update its control masks.
+	 */
+	if (!list_empty(&sig->rgrp_node))
+		list_move_tail(&sig->rgrp_node, &dst_cgrp->rgrp_child_sigs);
+
+	cgroup_propagate_control(dst_cgrp);
+
+	/*
+	 * Walk and copy each rgroup.  As top-level copies are appended to
+	 * &sig->rgrps, terminate on encountering one.
+	 */
+	list_for_each_entry(rgrp, &sig->rgrps, rgrp_node) {
+		if (cgroup_parent(rgrp) == dst_cgrp)
+			break;
+
+		cgroup_for_each_live_descendant_pre(dsct, d_css, rgrp) {
+			struct cgroup *parent = cgroup_parent(dsct);
+			struct cgroup *copy;
+			struct cgroup_subsys_state *copy_css;
+			int ssid;
+
+			if (WARN_ON_ONCE(!parent->rgrp_target) ||
+			    WARN_ON_ONCE(dsct->rgrp_target)) {
+				ret = -EINVAL;
+				goto out_uncopy;
+			}
+
+			/* create a copy and refresh its control masks */
+			copy = rgroup_create(parent->rgrp_target, sig);
+			if (IS_ERR(copy)) {
+				ret = PTR_ERR(copy);
+				goto out_uncopy;
+			}
+
+			copy->subtree_control = dsct->subtree_control;
+			cgroup_propagate_control(copy);
+
+			dsct->rgrp_target = copy;
+
+			/* copy subsystem states */
+			for_each_css(copy_css, ssid, copy) {
+				struct cgroup_subsys *ss = copy_css->ss;
+				struct cgroup_subsys_state *css =
+					cgroup_css(dsct, ss);
+
+				if (!ss->css_copy || !css)
+					continue;
+
+				ret = ss->css_copy(copy_css, css);
+				if (ret)
+					goto out_uncopy;
+			}
+		}
+	}
+
+	return 0;
+
+out_uncopy:
+	cgroup_migrate_uncopy_rgrps(dst_cgrp, leader);
+	return ret;
+}
+
+/**
  * cgroup_attach_task - attach a task or a whole threadgroup to a cgroup
  * @dst_cgrp: the cgroup to attach to
  * @leader: the task or the leader of the threadgroup to be attached
@@ -2762,6 +2893,7 @@ static int cgroup_migrate(struct task_struct *leader, unsigned mflags,
 static int cgroup_attach_task(struct cgroup *dst_cgrp,
 			      struct task_struct *leader, unsigned mflags)
 {
+	bool copy_rgrps = mflags & CGRP_MIGRATE_COPY_RGRPS;
 	LIST_HEAD(preloaded_csets);
 	struct task_struct *task;
 	int ret;
@@ -2769,13 +2901,25 @@ static int cgroup_attach_task(struct cgroup *dst_cgrp,
 	if (!cgroup_may_migrate_to(dst_cgrp))
 		return -EBUSY;
 
+	if (copy_rgrps) {
+		ret = cgroup_migrate_copy_rgrps(dst_cgrp, leader);
+		if (ret)
+			return ret;
+	}
+
 	/* look up all src csets */
 	spin_lock_bh(&css_set_lock);
 	rcu_read_lock();
 	task = leader;
 	do {
-		cgroup_migrate_add_src(task_css_set(task), dst_cgrp,
-				       &preloaded_csets);
+		struct css_set *src_cset = task_css_set(task);
+		struct cgroup *dfl_cgrp = src_cset->dfl_cgrp;
+		struct cgroup *target_cgrp = dst_cgrp;
+
+		if (copy_rgrps)
+			target_cgrp = dfl_cgrp->rgrp_target;
+
+		cgroup_migrate_add_src(src_cset, target_cgrp, &preloaded_csets);
 		if (!(mflags & CGRP_MIGRATE_PROCESS))
 			break;
 	} while_each_thread(leader, task);
@@ -2788,6 +2932,10 @@ static int cgroup_attach_task(struct cgroup *dst_cgrp,
 		ret = cgroup_migrate(leader, mflags, dst_cgrp->root);
 
 	cgroup_migrate_finish(&preloaded_csets);
+
+	if (copy_rgrps && ret)
+		cgroup_migrate_uncopy_rgrps(dst_cgrp, leader);
+
 	return ret;
 }
 
@@ -2864,8 +3012,11 @@ static ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf,
 		tsk = current;
 	}
 
-	if (mflags & CGRP_MIGRATE_PROCESS)
+	if (mflags & CGRP_MIGRATE_PROCESS) {
 		tsk = tsk->group_leader;
+		if (cgroup_on_dfl(cgrp))
+			mflags |= CGRP_MIGRATE_COPY_RGRPS;
+	}
 
 	/*
 	 * Workqueue threads may acquire PF_NO_SETAFFINITY and become
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 10/10] cgroup, sched: implement PRIO_RGRP for {set|get}priority()
  2016-03-11 15:41 [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP Tejun Heo
                   ` (8 preceding siblings ...)
  2016-03-11 15:41 ` [PATCH 09/10] cgroup: implement rgroup subtree migration Tejun Heo
@ 2016-03-11 15:41 ` Tejun Heo
  2016-03-11 16:05 ` Example program for PRIO_RGRP Tejun Heo
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 50+ messages in thread
From: Tejun Heo @ 2016-03-11 15:41 UTC (permalink / raw)
  To: torvalds, akpm, a.p.zijlstra, mingo, lizefan, hannes, pjt
  Cc: linux-kernel, cgroups, linux-api, kernel-team, Tejun Heo,
	Peter Zijlstra, Oleg Nesterov

One of the missing features in cgroup v2 is the ability to control cpu
cycle distribution hierarchically among threads of a process.  With
rgroup infrastructure in place, this can be implemented as a natural
extension of setpriority().

This patch introduces a new @which selector PRIO_RGRP for
{set|get}priority() which can be used only when the calling thread is
in a rgroup and respectively sets and gets the nice priority of the
rgroup that the calling thread belongs to.  The nice values have
exactly the same meaning as for a single task and top-level rgroups
compete with peer tasks as if the entire subtree is a single task with
the specified nice value.

setpriority(PRIO_RGRP, nice) automatically enables cpu controller upto
the rgroup of the thread.  The cpu controller is available iff it's
mounted on the default hierarchy and available on the nearest sgroup
(ie. the parent of the nearest sgroup should have it enabled in its
subtree_control).  If the controller isn't available, setpriority()
fails with -ENODEV.

If the cpu controller is made unavailable either through clearing of
subtree_control or migration to a cgroup which doesn't have it
available, cpu controller is disabled for the affected rgroup
subtrees.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul Turner <pjt@google.com>
---
 include/linux/cgroup.h        |   4 +
 include/linux/sched.h         |   5 ++
 include/uapi/linux/resource.h |   1 +
 kernel/cgroup.c               | 190 ++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/core.c           |  32 +++++++
 kernel/sys.c                  |  11 ++-
 6 files changed, 241 insertions(+), 2 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index ca1ec50..885c29e 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -110,6 +110,8 @@ extern void cgroup_post_fork(struct task_struct *p, unsigned long clone_flags,
 int cgroup_exec(void);
 void cgroup_exit(struct task_struct *p);
 void cgroup_free(struct task_struct *p);
+int rgroup_setpriority(pid_t vpid, int nice);
+int rgroup_getpriority(pid_t vpid);
 
 int cgroup_init_early(void);
 int cgroup_init(void);
@@ -552,6 +554,8 @@ static inline void cgroup_post_fork(struct task_struct *p,
 static inline int cgroup_exec(void) { return 0; }
 static inline void cgroup_exit(struct task_struct *p) {}
 static inline void cgroup_free(struct task_struct *p) {}
+static inline int rgroup_setpriority(pid_t vpid, int nice) { return -ENODEV; }
+static inline int rgroup_getpriority(pid_t vpid) { return -ENODEV; }
 
 static inline int cgroup_init_early(void) { return 0; }
 static inline int cgroup_init(void) { return 0; }
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d3849ad..36fc5cb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2371,6 +2371,11 @@ extern u64 scheduler_tick_max_deferment(void);
 static inline bool sched_can_stop_tick(void) { return false; }
 #endif
 
+#ifdef CONFIG_FAIR_GROUP_SCHED
+extern int cpu_cgroup_setpriority(struct cgroup_subsys_state *css, int nice);
+extern int cpu_cgroup_getpriority(struct cgroup_subsys_state *css);
+#endif
+
 #ifdef CONFIG_SCHED_AUTOGROUP
 extern void sched_autogroup_create_attach(struct task_struct *p);
 extern void sched_autogroup_detach(struct task_struct *p);
diff --git a/include/uapi/linux/resource.h b/include/uapi/linux/resource.h
index 36fb3b5..da15cb1 100644
--- a/include/uapi/linux/resource.h
+++ b/include/uapi/linux/resource.h
@@ -57,6 +57,7 @@ struct rlimit64 {
 #define	PRIO_PROCESS	0
 #define	PRIO_PGRP	1
 #define	PRIO_USER	2
+#define	PRIO_RGRP	3
 
 /*
  * Limit the stack by to some sane default: root can always
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 6107a1f..92eb74d 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -6305,6 +6305,196 @@ void cgroup_free(struct task_struct *task)
 	put_css_set(cset);
 }
 
+/**
+ * task_rgroup_lock_and_drain_offline - lock a task's rgroup and drain
+ * @task: target task
+ *
+ * Look up @task's rgroup, lock, drain and return it.  If @task doesn't
+ * belong to a rgroup, ERR_PTR(-ENODEV) is returned.
+ */
+static struct cgroup *
+task_rgroup_lock_and_drain_offline(struct task_struct *task)
+{
+	struct cgroup *rgrp;
+
+retry:
+	rcu_read_lock();
+
+	do {
+		rgrp = task_css_set(task)->dfl_cgrp;
+		if (!is_rgroup(rgrp)) {
+			rcu_read_unlock();
+			return ERR_PTR(-ENODEV);
+		}
+
+		if (!cgroup_tryget(rgrp)) {
+			cpu_relax();
+			continue;
+		}
+	} while (false);
+
+	rcu_read_unlock();
+
+	cgroup_lock_and_drain_offline(rgrp);
+
+	/* did we race against migration? */
+	if (rgrp != task_css_set(task)->dfl_cgrp) {
+		cgroup_unlock();
+		goto retry;
+	}
+
+	/*
+	 * @task can't be moved to another cgroup while cgroup_mutex is
+	 * held.  No need to hold the extra reference.
+	 */
+	cgroup_put(rgrp);
+
+	return rgrp;
+}
+
+/**
+ * vpid_rgroup_lock_and_drain_offline - lock a vpid's rgroup and drain
+ * @vpid: target vpid
+ * @taskp: out paramter for the found task
+ *
+ * Look up the task for @vpid.  If @vpid is zero, %current is used.  If the
+ * task is found, look up its rgroup, lock, drain and return it.  On
+ * success, the task's refcnt is incremented and the *@taskp points to it.
+ * An ERR_PTR() value is returned on failure.
+ */
+static struct cgroup *
+vpid_rgroup_lock_and_drain_offline(pid_t vpid, struct task_struct **taskp)
+{
+	struct task_struct *task;
+	struct cgroup *rgrp;
+
+	rcu_read_lock();
+	if (vpid) {
+		task = find_task_by_vpid(vpid);
+		if (!task) {
+			rcu_read_unlock();
+			return ERR_PTR(-ESRCH);
+		}
+	} else {
+		task = current;
+	}
+	get_task_struct(task);
+	rcu_read_unlock();
+
+	rgrp = task_rgroup_lock_and_drain_offline(task);
+	if (IS_ERR(rgrp))
+		put_task_struct(task);
+	else
+		*taskp = task;
+
+	return rgrp;
+}
+
+/**
+ * rgroup_enable_subsys - enable a subsystem on a rgroup
+ * @rgrp: target rgroup
+ * @sgrp: nearest sgroup of @rgrp
+ * @ss: subsystem to enable
+ *
+ * Try to enable @ss on @rgrp.  On success, 0 is returned and @ss is
+ * enabled on @rgrp; otherwise, -errno is returned.  The caller must always
+ * call cgroup_finalize_control() afterwards.
+ */
+static int __maybe_unused rgroup_enable_subsys(struct cgroup *rgrp,
+					       struct cgroup *sgrp,
+					       struct cgroup_subsys *ss)
+{
+	struct cgroup *pos;
+	int ret;
+
+	lockdep_assert_held(&cgroup_mutex);
+
+	cgroup_save_control(sgrp);
+
+	for (pos = rgrp; pos != sgrp; pos = cgroup_parent(pos)) {
+		struct cgroup *parent = cgroup_parent(pos);
+
+		if (parent == sgrp)
+			pos->rgrp_sig->rgrp_subtree_control |= 1 << ss->id;
+		else
+			parent->subtree_control |= 1 << ss->id;
+	}
+
+	ret = cgroup_apply_control(sgrp);
+	if (ret)
+		return ret;
+
+	/* did control propagtion disable @ss? */
+	if (!cgroup_css(rgrp, ss))
+		return -ENODEV;
+
+	return 0;
+}
+
+int rgroup_setpriority(pid_t vpid, int nice)
+{
+	struct task_struct *task;
+	struct cgroup *rgrp;
+	struct cgroup *sgrp __maybe_unused;
+	int ret;
+
+	rgrp = vpid_rgroup_lock_and_drain_offline(vpid, &task);
+	if (IS_ERR(rgrp))
+		return PTR_ERR(rgrp);
+
+	/*
+	 * If @rgrp is top-level, it should be put under the same nice
+	 * level restriction as @task; otherwise, limits are already
+	 * applied higher up the hierarchy and there's no reason to
+	 * restrict nice levels.
+	 */
+	if (!is_rgroup(cgroup_parent(rgrp)) && !can_nice(task, nice)) {
+		ret = -EPERM;
+		goto out_unlock;
+	}
+
+	ret = -ENODEV;
+	/* do ifdef late to preserve the correct error response */
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	sgrp = nearest_sgroup(rgrp);
+
+	/* enable cpu and apply weight */
+	ret = rgroup_enable_subsys(rgrp, sgrp, &cpu_cgrp_subsys);
+	if (!ret)
+		ret = cpu_cgroup_setpriority(cgroup_css(rgrp, &cpu_cgrp_subsys),
+					     nice);
+	cgroup_finalize_control(sgrp, ret);
+#endif
+
+out_unlock:
+	cgroup_unlock();
+	put_task_struct(task);
+	return ret;
+}
+
+int rgroup_getpriority(pid_t vpid)
+{
+	struct task_struct *task;
+	struct cgroup *rgrp;
+	int ret;
+
+	rgrp = vpid_rgroup_lock_and_drain_offline(vpid, &task);
+	if (IS_ERR(rgrp))
+		return PTR_ERR(rgrp);
+
+	ret = -ENODEV;
+	/* do ifdef late to preserve the correct error response */
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	if (cgroup_css(rgrp, &cpu_cgrp_subsys)) {
+		ret = cpu_cgroup_getpriority(cgroup_css(rgrp, &cpu_cgrp_subsys));
+		ret = nice_to_rlimit(ret);
+	}
+#endif
+	cgroup_unlock();
+	put_task_struct(task);
+	return ret;
+}
+
 static void check_for_release(struct cgroup *cgrp)
 {
 	if (notify_on_release(cgrp) && !cgroup_is_populated(cgrp) &&
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 16ad92b..e22e0ce 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8747,6 +8747,35 @@ static int cpu_weight_write_u64(struct cgroup_subsys_state *css,
 
 	return sched_group_set_shares(css_tg(css), scale_load(weight));
 }
+
+static int cpu_cgroup_css_copy(struct cgroup_subsys_state *to,
+			       struct cgroup_subsys_state *from)
+{
+	struct task_group *to_tg = css_tg(to);
+	struct task_group *from_tg = css_tg(from);
+
+	return sched_group_set_shares(to_tg, from_tg->shares);
+}
+
+int cpu_cgroup_setpriority(struct cgroup_subsys_state *css, int nice)
+{
+	int prio = NICE_TO_PRIO(clamp_val(nice, MIN_NICE, MAX_NICE));
+	int weight = sched_prio_to_weight[prio - MAX_RT_PRIO];
+
+	return sched_group_set_shares(css_tg(css), scale_load(weight));
+}
+
+int cpu_cgroup_getpriority(struct cgroup_subsys_state *css)
+{
+	int weight = css_tg(css)->shares;
+	int idx;
+
+	for (idx = 0; idx < ARRAY_SIZE(sched_prio_to_weight) - 1; idx++)
+		if (weight >= sched_prio_to_weight[idx])
+			break;
+
+	return PRIO_TO_NICE(idx + MAX_RT_PRIO);
+}
 #endif
 
 static void __maybe_unused cpu_period_quota_print(struct seq_file *sf,
@@ -8835,6 +8864,9 @@ struct cgroup_subsys cpu_cgrp_subsys = {
 	.css_free	= cpu_cgroup_css_free,
 	.css_online	= cpu_cgroup_css_online,
 	.css_offline	= cpu_cgroup_css_offline,
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	.css_copy	= cpu_cgroup_css_copy,
+#endif
 	.fork		= cpu_cgroup_fork,
 	.can_attach	= cpu_cgroup_can_attach,
 	.attach		= cpu_cgroup_attach,
diff --git a/kernel/sys.c b/kernel/sys.c
index 78947de..923f66a 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -41,6 +41,7 @@
 #include <linux/syscore_ops.h>
 #include <linux/version.h>
 #include <linux/ctype.h>
+#include <linux/cgroup.h>
 
 #include <linux/compat.h>
 #include <linux/syscalls.h>
@@ -181,7 +182,7 @@ SYSCALL_DEFINE3(setpriority, int, which, int, who, int, niceval)
 	struct pid *pgrp;
 	kuid_t uid;
 
-	if (which > PRIO_USER || which < PRIO_PROCESS)
+	if (which > PRIO_RGRP || which < PRIO_PROCESS)
 		goto out;
 
 	/* normalize: avoid signed division (rounding problems) */
@@ -191,6 +192,9 @@ SYSCALL_DEFINE3(setpriority, int, which, int, who, int, niceval)
 	if (niceval > MAX_NICE)
 		niceval = MAX_NICE;
 
+	if (which == PRIO_RGRP)
+		return rgroup_setpriority(who, niceval);
+
 	rcu_read_lock();
 	read_lock(&tasklist_lock);
 	switch (which) {
@@ -251,9 +255,12 @@ SYSCALL_DEFINE2(getpriority, int, which, int, who)
 	struct pid *pgrp;
 	kuid_t uid;
 
-	if (which > PRIO_USER || which < PRIO_PROCESS)
+	if (which > PRIO_RGRP || which < PRIO_PROCESS)
 		return -EINVAL;
 
+	if (which == PRIO_RGRP)
+		return rgroup_getpriority(who);
+
 	rcu_read_lock();
 	read_lock(&tasklist_lock);
 	switch (which) {
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Example program for PRIO_RGRP
  2016-03-11 15:41 [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP Tejun Heo
                   ` (9 preceding siblings ...)
  2016-03-11 15:41 ` [PATCH 10/10] cgroup, sched: implement PRIO_RGRP for {set|get}priority() Tejun Heo
@ 2016-03-11 16:05 ` Tejun Heo
  2016-03-12  6:26 ` [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP Mike Galbraith
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 50+ messages in thread
From: Tejun Heo @ 2016-03-11 16:05 UTC (permalink / raw)
  To: torvalds, akpm, a.p.zijlstra, mingo, lizefan, hannes, pjt
  Cc: linux-kernel, cgroups, linux-api, kernel-team

[-- Attachment #1: Type: text/plain, Size: 1202 bytes --]

Hello,

The attached test-rgrp-burn.c is an example program making use of the
new PRIO_RGRP.  The test program creates rgroup the following rgroup
hierarchy

 sgroup - main thread
        + [rgroup-0] burner-0
        + [rgroup-1] + [rgroup-2] burner-1
                     + [rgroup-3] burner-2

and takes upto 4 arguments respectively specifying the nice level for
each rgroup.  Each burner thread executes CPU burning loops and
periodically prints out how many loops it has completed.

* "./test-rgrp-burn"

If the program is run without any argument, on a kernel which doesn't
support rgroup, or from a cgroup where cpu controller is not
available, the three burner threads would run at about equivalent
speeds.

* "./test-rgrp-burn 0" from a cgroup w/ cpu controller

cpu controller is enabled across the top-level, so rgroup-0 and
rgroup-1 would compete on equal footing, so burner-0 runs twice as
fast as burner-1 or burner-2.

* "./test-rgrp-burn 0 3 -1 2" from a cgroup w/ cpu controller

cpu controller is enabled at both levels.  Nice level difference of 3
is about twice difference in weight, so the ratio would roughly be

  burner-0 : burner-1 : burner-2 ~= 3 : 2 : 1

Thanks.

-- 
tejun

[-- Attachment #2: test-rgrp-burn.c --]
[-- Type: text/plain, Size: 3315 bytes --]

/*
 * test-rgrp-burn - rgrp test program
 *
 * Creates the following rgrp hierarchy of three CPU cycle burner
 * threads.
 *
 * sgrp - main thread
 *      + [rgrp-0] burner thread
 *      + [rgrp-1] + [rgrp-2] nested burner thread
 *                 + [rgrp-3] nested burner thread
 *
 * Takes upto 4 arguments specifying the nice level of each rgrp.
 */
#define _GNU_SOURCE

#include <sys/types.h>
#include <sys/wait.h>
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <errno.h>
#include <limits.h>
#include <pthread.h>
#include <sys/syscall.h>
#include <sys/time.h>
#include <sys/resource.h>

#define CLONE_NEWRGRP		0x00001000	/* New resource group */
#define PRIO_RGRP		3
#define STACK_SIZE		(4 * 1024 * 1024)
#define CLONE_THREAD_FLAGS	(CLONE_THREAD | CLONE_SIGHAND | CLONE_VM |	\
				 CLONE_FS | CLONE_FILES)

static int nice_val[] = { [0 ... 3] = INT_MIN };

static pthread_mutex_t lprintf_mutex;

#define lprintf(fmt, args...)	do {						\
	pthread_mutex_lock(&lprintf_mutex);					\
	printf(fmt, ##args);							\
	pthread_mutex_unlock(&lprintf_mutex);					\
} while (0)

static int gettid(void)
{
	return syscall(SYS_gettid);
}

static int burner_fn(void *arg)
{
	unsigned long a = 37, cnt = 0;

	sleep(1);

	lprintf("burner : %d started\n", gettid());

	while (1) {
		*(volatile unsigned long *)&a = a * 37 / 13 + 53;
		*(volatile unsigned long *)&a = a * 37 / 13 + 53;
		*(volatile unsigned long *)&a = a * 37 / 13 + 53;
		*(volatile unsigned long *)&a = a * 37 / 13 + 53;
		*(volatile unsigned long *)&a = a * 37 / 13 + 53;
		*(volatile unsigned long *)&a = a * 37 / 13 + 53;
		*(volatile unsigned long *)&a = a * 37 / 13 + 53;
		*(volatile unsigned long *)&a = a * 37 / 13 + 53;

		if (!(++cnt % (1000000 * 100))) {
			int prio;

			errno = 0;
			prio = getpriority(PRIO_RGRP, 0);
			lprintf("burner : %d finished %lum loops (rgrp nice=%d errno=%d)\n",
				gettid(), cnt / 1000000, prio, errno);
		}
	}

	return 0;
}

static void rgrp_setprio(pid_t pid, int nice)
{
	if (nice == INT_MIN)
		return;

	lprintf("setprio: setting PRIO_RGRP to %d on %d\n", nice, pid);
	if (setpriority(PRIO_RGRP, pid, nice))
		perror("setpriority");
}

static int child_fn(void *arg)
{
	char *stack;
	pid_t pid;

	stack = malloc(STACK_SIZE) + STACK_SIZE;
	pid = clone(burner_fn, stack, CLONE_THREAD_FLAGS | CLONE_NEWRGRP, NULL);
	lprintf("child  : cloned nested burner %d\n", pid);
	rgrp_setprio(pid, nice_val[2]);

	stack = malloc(STACK_SIZE) + STACK_SIZE;
	pid = clone(burner_fn, stack, CLONE_THREAD_FLAGS | CLONE_NEWRGRP, NULL);
	lprintf("child  : cloned nested burner %d\n", pid);
	rgrp_setprio(pid, nice_val[3]);

	sleep(500);
	return 0;
}

int main(int argc, char **argv)
{	
	char *stack;
	pid_t pid;
	int i;

	if (argc > 5)
		argc = 5;

	for (i = 1; i < argc; i++)
		nice_val[i - 1] = atoi(argv[i]);

	pthread_mutex_init(&lprintf_mutex, NULL);

	stack = malloc(STACK_SIZE) + STACK_SIZE;
	pid = clone(burner_fn, stack, CLONE_THREAD_FLAGS | CLONE_NEWRGRP, NULL);
	lprintf("main   : cloned burner %d\n", pid);
	rgrp_setprio(pid, nice_val[0]);

	stack = malloc(STACK_SIZE) + STACK_SIZE;
	pid = clone(child_fn, stack, CLONE_THREAD_FLAGS | CLONE_NEWRGRP, NULL);
	lprintf("main   : cloned child %d\n", pid);
	rgrp_setprio(pid, nice_val[1]);

	sleep(500);
	return 0;
}

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP
  2016-03-11 15:41 [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP Tejun Heo
                   ` (10 preceding siblings ...)
  2016-03-11 16:05 ` Example program for PRIO_RGRP Tejun Heo
@ 2016-03-12  6:26 ` Mike Galbraith
  2016-03-12 17:04   ` Mike Galbraith
  2016-03-13 15:00   ` Tejun Heo
  2016-03-14 11:30 ` Peter Zijlstra
  2016-03-15 17:21 ` Michal Hocko
  13 siblings, 2 replies; 50+ messages in thread
From: Mike Galbraith @ 2016-03-12  6:26 UTC (permalink / raw)
  To: Tejun Heo, torvalds, akpm, a.p.zijlstra, mingo, lizefan, hannes, pjt
  Cc: linux-kernel, cgroups, linux-api, kernel-team

On Fri, 2016-03-11 at 10:41 -0500, Tejun Heo wrote:
> Hello,
> 
> This patchset extends cgroup v2 to support rgroup (resource group) for
> in-process hierarchical resource control and implements PRIO_RGRP for
> setpriority(2) on top to allow in-process hierarchical CPU cycle
> control in a seamless way.
> 
> cgroup v1 allowed putting threads of a process in different cgroups
> which enabled ad-hoc in-process resource control of some resources.
> Unfortunately, this approach was fraught with problems such as
> membership ambiguity with per-process resources and lack of isolation
> between system management and in-process properties.  For a more
> detailed discussion on the subject, please refer to the following
> message.
> 
>  [1] [RFD] cgroup: thread granularity support for cpu controller
> 
> This patchset implements the mechanism outlined in the above message.
> The new mechanism is named rgroup (resource group).  When explicitly
> designating a non-rgroup cgroup, the term sgroup (system group) is
> used.  rgroup has the following properties.
> 
> * A rgroup is a cgroup which is invisible on and transparent to the
>   system-level cgroupfs interface.
> 
> * A rgroup can be created by specifying CLONE_NEWRGRP flag, along with
>   CLONE_THREAD, during clone(2).  A new rgroup is created under the
>   parent thread's cgroup and the new thread is created in it.
> 
> * A rgroup is automatically destroyed when empty.
> 
> * A top-level rgroup of a process is a rgroup whose parent cgroup is a
>   sgroup.  A process may have multiple top-level rgroups and thus
>   multiple rgroup subtrees under the same parent sgroup.
> 
> * Unlike sgroups, rgroups are allowed to compete against peer threads.
>   Each rgroup behaves equivalent to a sibling task.
> 
> * rgroup subtrees are local to the process.  When the process forks or
>   execs, its rgroup subtrees are collapsed.
> 
> * When a process is migrated to a different cgroup, its rgroup
>   subtrees are preserved.
> 
> * Subset of controllers available on the parent sgroup are available
>   to rgroup subtrees.  Controller management on rgroups is automatic
>   and implicit and doesn't interfere with system-level cgroup
>   controller management.  If a controller is made unavailable on the
>   parent sgroup, it's automatically disabled from child rgroup
>   subtrees.
> 
> rgroup lays the foundation for other kernel mechanisms to make use of
> resource controllers while providing proper isolation between system
> management and in-process operations removing the awkward and
> layer-violating requirement for coordination between individual
> applications and system management.  On top of the rgroup mechanism,
> PRIO_RGRP is implemented for {set|get}priority(2).
> 
> * PRIO_RGRP can only be used if the target task is already in a
>   rgroup.  If setpriority(2) is used and cpu controller is available,
>   cpu controller is enabled until the target rgroup is covered and the
>   specified nice value is set as the weight of the rgroup.
> 
> * The specified nice value has the same meaning as for tasks.  For
>   example, a rgroup and a task competing under the same parent would
>   behave exactly the same as two tasks.
> 
> * For top-level rgroups, PRIO_RGRP follows the same rlimit
>   restrictions as PRIO_PROCESS; however, as nested rgroups only
>   distribute CPU cycles which are allocated to the process, no
>   restriction is applied.
> 
> PRIO_RGRP allows in-process hierarchical control of CPU cycles in a
> manner which is a straight-forward and minimal extension of existing
> task and priority management.

Hrm.  You're showing that per-thread groups can coexist just fine,
which is good given need and usage exists today out in the wild.  Why
do such groups have to be invisible with a unique interface though?

Given the core has to deal with them whether they're visible or not,
and given they exist to fulfill a need, seems they should be first
class citizens, not some Quasimodo like creature sneaking into the
cathedral via a back door and slinking about in the shadows.

	-Mike

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP
  2016-03-12  6:26 ` [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP Mike Galbraith
@ 2016-03-12 17:04   ` Mike Galbraith
  2016-03-12 17:13     ` cgroup NAKs ignored? " Ingo Molnar
  2016-03-13 15:00   ` Tejun Heo
  1 sibling, 1 reply; 50+ messages in thread
From: Mike Galbraith @ 2016-03-12 17:04 UTC (permalink / raw)
  To: Tejun Heo, torvalds, akpm, a.p.zijlstra, mingo, lizefan, hannes, pjt
  Cc: linux-kernel, cgroups, linux-api, kernel-team

On Sat, 2016-03-12 at 07:26 +0100, Mike Galbraith wrote:
> On Fri, 2016-03-11 at 10:41 -0500, Tejun Heo wrote:
> > Hello,
> > 
> > This patchset extends cgroup v2 to support rgroup (resource group) for
> > in-process hierarchical resource control and implements PRIO_RGRP for
> > setpriority(2) on top to allow in-process hierarchical CPU cycle
> > control in a seamless way.
> > 
> > cgroup v1 allowed putting threads of a process in different cgroups
> > which enabled ad-hoc in-process resource control of some resources.

BTW, within the scheduler, "process" does not exist.  A high level
composite entity is what we currently aggregate from arbitrary
individual entities, a.k.a threads.  Whether an individual entity be an
un-threaded "process" bash, a thread of "process" oracle, or one of
"process!?!" kernel is irrelevant.  What entity aggregation has to do
with "process" eludes me completely.

What's ad-hoc or unusual about a thread pool servicing an arbitrary
number of customers using cgroup bean accounting?  Job arrives from
customer, worker is dispatched to customer workshop (cgroup), it does
whatever on behest of customer, sends bean count off to the billing
department, and returns to the break room.  What's so annoying about
using bean counters for.. counting beans that you want to forbid it?

	-Mike

^ permalink raw reply	[flat|nested] 50+ messages in thread

* cgroup NAKs ignored? Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP
  2016-03-12 17:04   ` Mike Galbraith
@ 2016-03-12 17:13     ` Ingo Molnar
  2016-03-13 14:42       ` Tejun Heo
  0 siblings, 1 reply; 50+ messages in thread
From: Ingo Molnar @ 2016-03-12 17:13 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Tejun Heo, torvalds, akpm, a.p.zijlstra, mingo, lizefan, hannes,
	pjt, linux-kernel, cgroups, linux-api, kernel-team,
	Thomas Gleixner


* Mike Galbraith <umgwanakikbuti@gmail.com> wrote:

> On Sat, 2016-03-12 at 07:26 +0100, Mike Galbraith wrote:
> > On Fri, 2016-03-11 at 10:41 -0500, Tejun Heo wrote:
> > > Hello,
> > > 
> > > This patchset extends cgroup v2 to support rgroup (resource group) for
> > > in-process hierarchical resource control and implements PRIO_RGRP for
> > > setpriority(2) on top to allow in-process hierarchical CPU cycle
> > > control in a seamless way.
> > > 
> > > cgroup v1 allowed putting threads of a process in different cgroups
> > > which enabled ad-hoc in-process resource control of some resources.
> 
> BTW, within the scheduler, "process" does not exist. [...]

Yes, and that's very fundamental.

And I see that many bits of the broken 'v2' cgroups ABI already snuck into the 
upstream kernel in this merge dinwo, without this detail having been agreed upon!
:-(

Tejun, this _REALLY_ sucks. We had pending NAKs over the design, still you moved 
ahead like nothing happened, why?!

> [...]  A high level composite entity is what we currently aggregate from 
> arbitrary individual entities, a.k.a threads.  Whether an individual entity be 
> an un-threaded "process" bash, a thread of "process" oracle, or one of 
> "process!?!" kernel is irrelevant.  What entity aggregation has to do with 
> "process" eludes me completely.
> 
> What's ad-hoc or unusual about a thread pool servicing an arbitrary number of 
> customers using cgroup bean accounting?  Job arrives from customer, worker is 
> dispatched to customer workshop (cgroup), it does whatever on behest of 
> customer, sends bean count off to the billing department, and returns to the 
> break room.  What's so annoying about using bean counters for.. counting beans 
> that you want to forbid it?

Agreed ... and many others expressed this concern as well. Why were these concerns 
ignored?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: cgroup NAKs ignored? Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP
  2016-03-12 17:13     ` cgroup NAKs ignored? " Ingo Molnar
@ 2016-03-13 14:42       ` Tejun Heo
  0 siblings, 0 replies; 50+ messages in thread
From: Tejun Heo @ 2016-03-13 14:42 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mike Galbraith, torvalds, akpm, a.p.zijlstra, mingo, lizefan,
	hannes, pjt, linux-kernel, cgroups, linux-api, kernel-team,
	Thomas Gleixner

Hello, Ingo.

On Sat, Mar 12, 2016 at 06:13:18PM +0100, Ingo Molnar wrote:
> > BTW, within the scheduler, "process" does not exist. [...]
> 
> Yes, and that's very fundamental.

I'll go into this part later.

> And I see that many bits of the broken 'v2' cgroups ABI already snuck into the 
> upstream kernel in this merge dinwo, without this detail having been agreed upon!
> :-(
>
> Tejun, this _REALLY_ sucks. We had pending NAKs over the design, still you moved 
> ahead like nothing happened, why?!

Hmmmm?  The cpu controller is still in review branch.  The thread
sprawled out but the disagreement there was about missing the ability
to hierarchically distribute CPU cycles in process and the two
alternatives discussed throughout the thread were per-process private
filesystem under /proc/PID and extension of existing process resource
nmanagement mechanisms.

Going back to the per-process part, I described the rationales in
cgroup-v2 documentation and the RFD document but here are some
important bits.

1. Common resource domains

* When different resources get intermixed as do memory and io during
  writeback, without a common resource domain defined across the
  different resource types, it's impossible to perform resource
  control.  As a simplistic example, let's say there are four
  processes (1, 2, 3, 4), two memory cgroups (ma, mb) and two io
  cgroups (ia, ib) with the following memership.

   ma: 1, 2  mb: 3, 4
   ia: 1, 3  ib: 2, 4

  Writeback and dirty throttling are regulated by the proportion of
  dirty memory against available and writeback bandwidth of the target
  backing device.  When resource domains are orthogonal like the
  above, it's impossible to define clear relationship.  This is one of
  the main reasons why writeback behavior has been so erratic with
  respect to cgroups.

* It is a lot more useful and less painful to have common resource
  domains defined across all resource types as it allows expressing
  things like "if this belongs to resource domain F, do XYZ".  A lot
  of use cases are already doing this by building the identical
  hierarchies (to differing depths) across all controllers.


2. Per-process

* There is a relatively pronounced boundary between system management
  and internal operations of an application and one side-effect of
  allowing threads to be assigned arbitrarily across system cgroupfs
  hierarchy is that it mandates close coordination between individual
  applications and system management (whether that be a human being or
  system agent software).  This is userland suffering because the
  kernel fails to provide a properly abstracted and isolated
  constructs.

  Decoupling system management and in-application operations makes
  hierarchical resource grouping and control easily accessible to
  individual applications without worrying about how the system is
  managed in larger scope.  Process is a fairly good approximation of
  this boundary.

* For some resources, going beyond process granularity doesn't make
  much sense.  While we can just let users do whatever they wanna do
  and declare certain configurations to yield undefined behaviors (io
  controller on v1 hierarchy actually does this), it is better to
  provide abstractions which match the actual characteristics.
  Combined with the above, it is natural to distinguish across-process
  and in-process operations.

> > [...]  A high level composite entity is what we currently aggregate from 
> > arbitrary individual entities, a.k.a threads.  Whether an individual entity be 
> > an un-threaded "process" bash, a thread of "process" oracle, or one of 
> > "process!?!" kernel is irrelevant.  What entity aggregation has to do with 
> > "process" eludes me completely.
> > 
> > What's ad-hoc or unusual about a thread pool servicing an arbitrary number of 
> > customers using cgroup bean accounting?  Job arrives from customer, worker is 
> > dispatched to customer workshop (cgroup), it does whatever on behest of 
> > customer, sends bean count off to the billing department, and returns to the 
> > break room.  What's so annoying about using bean counters for.. counting beans 
> > that you want to forbid it?
> 
> Agreed ... and many others expressed this concern as well. Why were these concerns 
> ignored?

They weren't ignored.  The concern expressed was the loss of the
ability to hierarchically distribute resource in process and the RFD
document and this patchset are the attempts at resolving that specific
issue.

Going back to Mike's "why can't these be arbitrary bean counters?",
yes, they can be.  That's what one gets when the cpu controller is
mounted on its own hierarchy.  If that's what the use case at hand
calls for, that is the way to go and there's nothing preventing that.
In fact, with recent restructuring of cgroup core, stealing a
stateless controller to a new hierarchy can be made a lot easier for
such use cases.

However, as explained above, controlling a resource in abstraction and
restriction-free style also has its costs.  There's no way to tie
different types of resources serving the same purpose which can be
generally painful and makes some cross-resource operations impossible.
Or entangling in-process operations with system management, IOW, a
process having to speak to the external $SYSTEM_AGENT to manage its
threadpools.

What the proposed solution tries to achieve is balancing flexibility
at system management level with proper abstractions and isolation so
that hierarchical resource management is actually accessible to a lot
wider set of applications and use-cases.

Given how cgroup is used in the wild, I'm pretty sure that the
structured approach will reach a lot wider audience without getting in
the way of what they try to achieve.  That said, again, for specific
use cases where the benefits from structured approach can or should be
ignored, using the cpu controller as arbitrary hierarchical bean
counters is completely fine and the right solution.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP
  2016-03-12  6:26 ` [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP Mike Galbraith
  2016-03-12 17:04   ` Mike Galbraith
@ 2016-03-13 15:00   ` Tejun Heo
  2016-03-13 17:40     ` Mike Galbraith
  2016-03-14  2:23     ` Mike Galbraith
  1 sibling, 2 replies; 50+ messages in thread
From: Tejun Heo @ 2016-03-13 15:00 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: torvalds, akpm, a.p.zijlstra, mingo, lizefan, hannes, pjt,
	linux-kernel, cgroups, linux-api, kernel-team

Hello, Mike.

On Sat, Mar 12, 2016 at 07:26:59AM +0100, Mike Galbraith wrote:
> Hrm.  You're showing that per-thread groups can coexist just fine,
> which is good given need and usage exists today out in the wild.  Why
> do such groups have to be invisible with a unique interface though?

I tried to explain these in the forementioned RFD document.  I'll give
a brief example here.

Let's say there is an application which wants to manage resource
distributions across its multiple threadpools in a hierarchical way.
With cgroupfs interface as the only system-wide interface, it has to
coordinate who or whatever is managing that interface.  Maybe it can
get a subtree delegated to it, maybe it has to ask the system thing to
create and place threads there, maybe it can just expose the pids and
let the system management do its thing (what if the threads in the
pools are dynamic tho?).  There is no reliable universal way of doing
this.  Each such application has to be ready to specifically
coordinate with the specific system management in use.

This is kernel failing to provide proper abstraction and isolation
between different layers.  The "raw" feature is there but it's unsafe
to use and thus can't be used widely.

> Given the core has to deal with them whether they're visible or not,
> and given they exist to fulfill a need, seems they should be first
> class citizens, not some Quasimodo like creature sneaking into the
> cathedral via a back door and slinking about in the shadows.

In terms of programmability and accessibility for individual
applications, group resource management being available through
straight-forward and incremental extension of exsiting mechanisms is
*way* more first class citizen.  It is two seamless extensions to
clone(2) and setpriority(2) making hierarchical resource management
generally available to applications.

There can be use cases where building cpu resource hierarchy which is
completely alien to how the rest of the system is organized is useful.
For those cases, the only thing which can be done is building a
separate hierarchy for the cpu controller and that capability isn't
going anywhere.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP
  2016-03-13 15:00   ` Tejun Heo
@ 2016-03-13 17:40     ` Mike Galbraith
  2016-04-07  0:00       ` Tejun Heo
  2016-03-14  2:23     ` Mike Galbraith
  1 sibling, 1 reply; 50+ messages in thread
From: Mike Galbraith @ 2016-03-13 17:40 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, akpm, a.p.zijlstra, mingo, lizefan, hannes, pjt,
	linux-kernel, cgroups, linux-api, kernel-team

On Sun, 2016-03-13 at 11:00 -0400, Tejun Heo wrote:
> Hello, Mike.
> 
> On Sat, Mar 12, 2016 at 07:26:59AM +0100, Mike Galbraith wrote:
> > Hrm.  You're showing that per-thread groups can coexist just fine,
> > which is good given need and usage exists today out in the wild.  Why
> > do such groups have to be invisible with a unique interface though?
> 
> I tried to explain these in the forementioned RFD document.  I'll give
> a brief example here.
> 
> Let's say there is an application which wants to manage resource
> distributions across its multiple threadpools in a hierarchical way.
> With cgroupfs interface as the only system-wide interface, it has to
> coordinate who or whatever is managing that interface.  Maybe it can
> get a subtree delegated to it, maybe it has to ask the system thing to
> create and place threads there, maybe it can just expose the pids and
> let the system management do its thing (what if the threads in the
> pools are dynamic tho?).  There is no reliable universal way of doing
> this.  Each such application has to be ready to specifically
> coordinate with the specific system management in use.

The last thing I ever want to see on my boxen is random applications
either doing their own thing with my cgroups management interface, or
conspiring with "the system thing" behind my back to do things that I
did not specifically ask them to do.

"The system thing" started doing its own thing behind my back, and
oddly enough, its tentacles started falling off.  By golly, its eyes
seem to have fallen out as well.

That's what happens when control freak meets control freak, one of them
ends up in pieces.  There can be only one, and that one is me, the
administrator.  Applications don't coordinate spit, if I put on my
administrator hat and stuff 'em in a box, they better stay stuffed.

> This is kernel failing to provide proper abstraction and isolation
> between different layers.  The "raw" feature is there but it's unsafe
> to use and thus can't be used widely.

Good, management of my boxen is my turf.  The raw feature works fine
today, and I'd like to see it to keep on working tomorrow.  If tools
written for administration diddle, that's fine IFF I'm the guy on the
other end of the tool.  All other manipulators of my management
interfaces can go.. fish.

> > Given the core has to deal with them whether they're visible or not,
> > and given they exist to fulfill a need, seems they should be first
> > class citizens, not some Quasimodo like creature sneaking into the
> > cathedral via a back door and slinking about in the shadows.
> 
> In terms of programmability and accessibility for individual
> applications, group resource management being available through
> straight-forward and incremental extension of exsiting mechanisms is
> *way* more first class citizen.  It is two seamless extensions to
> clone(2) and setpriority(2) making hierarchical resource management
> generally available to applications.

To me, that sounds like chaos.

> There can be use cases where building cpu resource hierarchy which is
> completely alien to how the rest of the system is organized is useful.
> For those cases, the only thing which can be done is building a
> separate hierarchy for the cpu controller and that capability isn't
> going anywhere.

As long as administrators can use the system interface to aggregate
what they see fit, I'm happy.  The scheduler schedules threads, ergo
the cpu controller must aggregate threads.  There is no process.

	-Mike

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP
  2016-03-13 15:00   ` Tejun Heo
  2016-03-13 17:40     ` Mike Galbraith
@ 2016-03-14  2:23     ` Mike Galbraith
  1 sibling, 0 replies; 50+ messages in thread
From: Mike Galbraith @ 2016-03-14  2:23 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, akpm, a.p.zijlstra, mingo, lizefan, hannes, pjt,
	linux-kernel, cgroups, linux-api, kernel-team

On Sun, 2016-03-13 at 11:00 -0400, Tejun Heo wrote:

> There can be use cases where building cpu resource hierarchy which is
> completely alien to how the rest of the system is organized is useful.

Well, from my POV it's the "process" oriented thingy that's the alien, 
 and one that wants to meet an alien juice resistant shovel ;-)

Another example is cpusets.  I use it, and I know beyond doubt that my
employer's customers use it for RT/HPC purposes out in the wild, and
that very much includes placing critical _threads_ in exclusive sets.

	-Mike

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP
  2016-03-11 15:41 [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP Tejun Heo
                   ` (11 preceding siblings ...)
  2016-03-12  6:26 ` [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP Mike Galbraith
@ 2016-03-14 11:30 ` Peter Zijlstra
  2016-04-06 15:58   ` Tejun Heo
  2016-03-15 17:21 ` Michal Hocko
  13 siblings, 1 reply; 50+ messages in thread
From: Peter Zijlstra @ 2016-03-14 11:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, akpm, mingo, lizefan, hannes, pjt, linux-kernel,
	cgroups, linux-api, kernel-team

On Fri, Mar 11, 2016 at 10:41:18AM -0500, Tejun Heo wrote:
> * A rgroup is a cgroup which is invisible on and transparent to the
>   system-level cgroupfs interface.
> 
> * A rgroup can be created by specifying CLONE_NEWRGRP flag, along with
>   CLONE_THREAD, during clone(2).  A new rgroup is created under the
>   parent thread's cgroup and the new thread is created in it.

This seems overly restrictive. As you well know there's people moving
threads about after creation.

Also, with this interface the whole thing cannot be used until your
libc's pthread_create() has been patched to allow use of this new flag.

> * A rgroup is automatically destroyed when empty.

Except for Zombies it appears..

> * A top-level rgroup of a process is a rgroup whose parent cgroup is a
>   sgroup.  A process may have multiple top-level rgroups and thus
>   multiple rgroup subtrees under the same parent sgroup.
> 
> * Unlike sgroups, rgroups are allowed to compete against peer threads.
>   Each rgroup behaves equivalent to a sibling task.
> 
> * rgroup subtrees are local to the process.  When the process forks or
>   execs, its rgroup subtrees are collapsed.
> 
> * When a process is migrated to a different cgroup, its rgroup
>   subtrees are preserved.

This all makes it impossible to say put a single thread outside of the
hierarchy forced upon it by the process. Like putting a RT thread in an
isolated group on the side.

Which is a rather common thing to do.

> rgroup lays the foundation for other kernel mechanisms to make use of
> resource controllers while providing proper isolation between system
> management and in-process operations removing the awkward and
> layer-violating requirement for coordination between individual
> applications and system management.  On top of the rgroup mechanism,
> PRIO_RGRP is implemented for {set|get}priority(2).
> 
> * PRIO_RGRP can only be used if the target task is already in a
>   rgroup.  If setpriority(2) is used and cpu controller is available,
>   cpu controller is enabled until the target rgroup is covered and the
>   specified nice value is set as the weight of the rgroup.
> 
> * The specified nice value has the same meaning as for tasks.  For
>   example, a rgroup and a task competing under the same parent would
>   behave exactly the same as two tasks.
> 
> * For top-level rgroups, PRIO_RGRP follows the same rlimit
>   restrictions as PRIO_PROCESS; however, as nested rgroups only
>   distribute CPU cycles which are allocated to the process, no
>   restriction is applied.

While this appears neat, I doubt it will remain so in the face of this:

> * A mechanism that applications can use to publish certain rgroups so
>   that external entities can determine which IDs to use to change
>   rgroup settings.  I already have interface and implementation design
>   mostly pinned down.

So you need some new fangled way to set/query all the other possible
cgroup parameters supported, and then suddenly you have one that has two
possible interface. That's way ugly.

While I appreciate the sentiment that having two entities poking at the
cgroup filesystem without coordination is a problem, I don't see this as
the solution. I would much rather just kill the system wide thing, that
too solves the problem.

IOW, I'm unconvinced this approach will cater to current practises or
even allow similar functionality.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP
  2016-03-11 15:41 [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP Tejun Heo
                   ` (12 preceding siblings ...)
  2016-03-14 11:30 ` Peter Zijlstra
@ 2016-03-15 17:21 ` Michal Hocko
  2016-04-06 21:53   ` Tejun Heo
  13 siblings, 1 reply; 50+ messages in thread
From: Michal Hocko @ 2016-03-15 17:21 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, akpm, a.p.zijlstra, mingo, lizefan, hannes, pjt,
	linux-kernel, cgroups, linux-api, kernel-team

On Fri 11-03-16 10:41:18, Tejun Heo wrote:
> Hello,
> 
> This patchset extends cgroup v2 to support rgroup (resource group) for
> in-process hierarchical resource control and implements PRIO_RGRP for
> setpriority(2) on top to allow in-process hierarchical CPU cycle
> control in a seamless way.
> 
> cgroup v1 allowed putting threads of a process in different cgroups
> which enabled ad-hoc in-process resource control of some resources.
> Unfortunately, this approach was fraught with problems such as
> membership ambiguity with per-process resources

[Sorry if this has been already discussed, I haven't followed all the
 discussions regarding this topic]

While I agree that per-thread granularity is no fun for controllers
which operate on different than task_struct entities (like memory cgroup
controller) but I am afraid that all the complications will not go away
if we are strictly per-process anyway.

For example memcg controller is not strictly per-process either, it
operates on the mm_struct and that might be shared between different
_processes_. So we still might end up in the same schizophrenic
situation where two different processes are living in different
cgroups while one of them is silently operating in a different memcg
cgroup. I really hate this but this is what our clone(CLONE_VM) (without
CLONE_THREAD) allows to do.

I do not know about other controllers, maybe only memcg is so special,
but that would suggest that even process-only restriction might turn out
to be a problem in the future and controllers would have to face the
same problem later on.

Now I have to admit I do not have great ideas how to cover all the
possible cases but wouldn't it make more sense to allow for more
flexibility and allow thread migration while the migration can be vetoed
by any controller should it cross into a different/incompatible cgroup.

[...]

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP
  2016-03-14 11:30 ` Peter Zijlstra
@ 2016-04-06 15:58   ` Tejun Heo
  2016-04-07  6:45     ` Peter Zijlstra
  0 siblings, 1 reply; 50+ messages in thread
From: Tejun Heo @ 2016-04-06 15:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, akpm, mingo, lizefan, hannes, pjt, linux-kernel,
	cgroups, linux-api, kernel-team

Hello, Peter.

Sorry about the delay.

On Mon, Mar 14, 2016 at 12:30:13PM +0100, Peter Zijlstra wrote:
> On Fri, Mar 11, 2016 at 10:41:18AM -0500, Tejun Heo wrote:
> > * A rgroup is a cgroup which is invisible on and transparent to the
> >   system-level cgroupfs interface.
> > 
> > * A rgroup can be created by specifying CLONE_NEWRGRP flag, along with
> >   CLONE_THREAD, during clone(2).  A new rgroup is created under the
> >   parent thread's cgroup and the new thread is created in it.
> 
> This seems overly restrictive. As you well know there's people moving
> threads about after creation.

Will get to this later.

> Also, with this interface the whole thing cannot be used until your
> libc's pthread_create() has been patched to allow use of this new flag.

This isn't difficult to change but is this a problem in the long term?
Once added, this is gonna be a permanent part of API and I think we
better get it right than quick.  If this is a concern, we can go for a
setsid(2) style syscall so that users can have an easier access to it.

> > * A rgroup is automatically destroyed when empty.
> 
> Except for Zombies it appears..

Zombies do hold onto its rgroup but at that point the rgroup is
draining refs and can't be populated again.  The same state as rmdir'd
cgroup with zombies.

> > * A top-level rgroup of a process is a rgroup whose parent cgroup is a
> >   sgroup.  A process may have multiple top-level rgroups and thus
> >   multiple rgroup subtrees under the same parent sgroup.
> > 
> > * Unlike sgroups, rgroups are allowed to compete against peer threads.
> >   Each rgroup behaves equivalent to a sibling task.
> > 
> > * rgroup subtrees are local to the process.  When the process forks or
> >   execs, its rgroup subtrees are collapsed.
> > 
> > * When a process is migrated to a different cgroup, its rgroup
> >   subtrees are preserved.
> 
> This all makes it impossible to say put a single thread outside of the
> hierarchy forced upon it by the process. Like putting a RT thread in an
> isolated group on the side.
> 
> Which is a rather common thing to do.

I don't think the mentioned RT case is problematic.  Depending on the
desired outcome,

1. If the admin doesn't want the program to be able to meddle with the
   cpu resource control at all, it can just disable the cpu controller
   in subtree_control (this is tied to the parent control now but will
   be moved to the associated cgroup itself).  The application will
   create rgroup hierarchy but won't be able to use CPU resource
   control and the admin would be able to treat all threads as if they
   don't have rgroups at all.

2. If the admin still wants to allow the application to retain CPU
   resource control, unless the said program is actively getting in
   the way, the admin can set the limits the way it wants along the
   hierarchy down to the specific thread.

Note that #1 can be done after-the-fact.  The admin can revoke CPU
controller access anytime.  For example, assuming the following
hierarchy (cX is a cgroup, rX is a rgroup, NNN are threads).

   cA - 234
      + r235 - 235
             + 236

If the process 234 configured CPU resource control in a specific way
and the admin wants to override, the admin can simply do "echo -cpu >
cA.subtree_control".  Afterwards, as far as CPU resource control is
concerned, all threads will behave as if there are no rgroups at all
and the admin can tweak the settings of individual threads using the
usual scheduler systemcalls.

> > rgroup lays the foundation for other kernel mechanisms to make use of
> > resource controllers while providing proper isolation between system
> > management and in-process operations removing the awkward and
> > layer-violating requirement for coordination between individual
> > applications and system management.  On top of the rgroup mechanism,
> > PRIO_RGRP is implemented for {set|get}priority(2).
> > 
> > * PRIO_RGRP can only be used if the target task is already in a
> >   rgroup.  If setpriority(2) is used and cpu controller is available,
> >   cpu controller is enabled until the target rgroup is covered and the
> >   specified nice value is set as the weight of the rgroup.
> > 
> > * The specified nice value has the same meaning as for tasks.  For
> >   example, a rgroup and a task competing under the same parent would
> >   behave exactly the same as two tasks.
> > 
> > * For top-level rgroups, PRIO_RGRP follows the same rlimit
> >   restrictions as PRIO_PROCESS; however, as nested rgroups only
> >   distribute CPU cycles which are allocated to the process, no
> >   restriction is applied.
> 
> While this appears neat, I doubt it will remain so in the face of this:
> 
> > * A mechanism that applications can use to publish certain rgroups so
> >   that external entities can determine which IDs to use to change
> >   rgroup settings.  I already have interface and implementation design
> >   mostly pinned down.
> 
> So you need some new fangled way to set/query all the other possible
> cgroup parameters supported, and then suddenly you have one that has two
> possible interface. That's way ugly.

So, the above response is a bit confusing because publishing rgroups
doesn't require setting or querying all other possible cgroup
parameters.

Regarding the need to add separeate interface for each control knob
for rgroups,

1. There aren't many knobs which make sense for in-process control to
   begin with.

2. As shown in this patchset's modifiction to setpriority(2), for
   stuff which makes sense, we're likely to already have constructs
   which already deal with the issue (it is a needed capability with
   or without cgroup).  The right way forward is seamlessly extending
   existing interfaces.

3. If there is no exactly matching interface, we want to add them for
   both groups and threads in a way which is consistent with other
   syscalls which deal with related issues, especially for the
   scheduler.

Just in case, here's a more concerete explanation about publishing
rgroups.  The only addition needed for external access is a way to
determine which ID maps to which rgroup - a proc file listing (rgroup
name, ID) pairs.

Let's say the program creates the following internal hierarchy.

 cgroup - service0 - highpri_workers
                   + lowpri_workers
        + service1 - highpri_workers
                   + lowpri_workers

The rgroups are published by a member thread performing, for example,
prctl(PR_SET_RGROUP_NAME, "service0.highpri_workers").  The only thing
it does is pinning the pid so that it stays associated with the rgroup
and publishes it in proc as follows.

 # cat /proc/234/rgroups
 service0.highpri_workers 240
 service0.lowpri_workers 241
 service1.highpri_workers 248
 service1.lowpri_workers 249

>From tooling side, renice(2) can be extended to understand rgroups so
that something like the following works.

 # renice -n -10 -r 234:service1.highpri_workers

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP
  2016-03-15 17:21 ` Michal Hocko
@ 2016-04-06 21:53   ` Tejun Heo
  2016-04-07  6:40     ` Peter Zijlstra
  0 siblings, 1 reply; 50+ messages in thread
From: Tejun Heo @ 2016-04-06 21:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: torvalds, akpm, a.p.zijlstra, mingo, lizefan, hannes, pjt,
	linux-kernel, cgroups, linux-api, kernel-team

Hello, Michal.

Sorry about the delay.

On Tue, Mar 15, 2016 at 06:21:36PM +0100, Michal Hocko wrote:
> While I agree that per-thread granularity is no fun for controllers
> which operate on different than task_struct entities (like memory cgroup
> controller) but I am afraid that all the complications will not go away
> if we are strictly per-process anyway.
> 
> For example memcg controller is not strictly per-process either, it
> operates on the mm_struct and that might be shared between different
> _processes_. So we still might end up in the same schizophrenic
> situation where two different processes are living in different
> cgroups while one of them is silently operating in a different memcg
> cgroup. I really hate this but this is what our clone(CLONE_VM) (without
> CLONE_THREAD) allows to do.

Can you list applications which make use of CLONE_VM without
CLONE_THREAD?  I searched using searchcode.com and the only non-kernel
code that I see are niche pthread implementations and some strace type
audit tools.  The only reason those threadpackages use CLONE_VM &&
!CLONE_THREAD is that that used to be how linuxthreads was done before
linux kernel grew proper threading support with CLONE_THREAD.

What you're pointing out is a historical vestige and if you can't
bring yourself to agree to the fact that processes and threads are the
primary abstractions that our userspace use day in and day out, you
are not thinking straight.  Even the existing usages are *to*
implement pthread.

While the kernel can't assume CLONE_VM is always accompanied by
CLONE_THREAD and shouldn't be crashing when such conditions occur, we
also don't and shouldn't architect or optimize for them either.  In
fact, both memory and io pretty much declare that the specific
behaviors are undefined.

> I do not know about other controllers, maybe only memcg is so special,
> but that would suggest that even process-only restriction might turn out
> to be a problem in the future and controllers would have to face the
> same problem later on.
>
> Now I have to admit I do not have great ideas how to cover all the
> possible cases but wouldn't it make more sense to allow for more
> flexibility and allow thread migration while the migration can be vetoed
> by any controller should it cross into a different/incompatible cgroup.

This is a non-issue and designing an interface is not about "covering
all the possible cases".  Different cases have differing levels of
importance.  It'd be absolutely crazy to put the same amount of
consideration towards CLONE_VM && !CLONE_THREAD case when designing
*anything*.

Another factor to consider, which might not be immediately intuitive,
is that exposing everything comes at a cost, often a steep one.
cgroup has been reliably proving to be a very good example of this.
Orthogonal hierarchies seems totally flexible on the surface but it
makes it extremely awkward for different controllers to cooperate
preventing something as fundamental as control over buffered writes.

This case is similar too.  While exposing every possible combination
to userland might seem to be a good idea on the surface, the end
result is the kernel failing to provide a necessary isolation between
operations internal to applications and system management making
resource control essentially inaccessible outside of specialized
custom setups.  It's a failure, not a feature.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP
  2016-03-13 17:40     ` Mike Galbraith
@ 2016-04-07  0:00       ` Tejun Heo
  2016-04-07  3:26         ` Mike Galbraith
  0 siblings, 1 reply; 50+ messages in thread
From: Tejun Heo @ 2016-04-07  0:00 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: torvalds, akpm, a.p.zijlstra, mingo, lizefan, hannes, pjt,
	linux-kernel, cgroups, linux-api, kernel-team

Hello, Mike.

On Sun, Mar 13, 2016 at 06:40:35PM +0100, Mike Galbraith wrote:
> On Sun, 2016-03-13 at 11:00 -0400, Tejun Heo wrote:
> > Let's say there is an application which wants to manage resource
> > distributions across its multiple threadpools in a hierarchical way.
> > With cgroupfs interface as the only system-wide interface, it has to
> > coordinate who or whatever is managing that interface.  Maybe it can
> > get a subtree delegated to it, maybe it has to ask the system thing to
> > create and place threads there, maybe it can just expose the pids and
> > let the system management do its thing (what if the threads in the
> > pools are dynamic tho?).  There is no reliable universal way of doing
> > this.  Each such application has to be ready to specifically
> > coordinate with the specific system management in use.
> 
> The last thing I ever want to see on my boxen is random applications
> either doing their own thing with my cgroups management interface, or
> conspiring with "the system thing" behind my back to do things that I
> did not specifically ask them to do.

That isn't too different from saying that you don't want applications
to be calling setpriority(2) on its threads, which is a weird thing to
say, especially given that there are situations where applying control
from outside simply can't work - thread pools can be dynamic and there
is no reliable way of telling which threads are for which purposes
from outside.

This is not to say that admin override is unnecessary or unsupported.
In fact, rgroup and cgroup give the admin a lot more control.
Controller access can be revoked from applications in subtrees and the
entire controller can be detached from the hierarchy for full
override.

> "The system thing" started doing its own thing behind my back, and
> oddly enough, its tentacles started falling off.  By golly, its eyes
> seem to have fallen out as well.
> 
> That's what happens when control freak meets control freak, one of them
> ends up in pieces.  There can be only one, and that one is me, the
> administrator.  Applications don't coordinate spit, if I put on my
> administrator hat and stuff 'em in a box, they better stay stuffed.

Sure, you're a control freak.  Be that.  rgroup doesn't get in the way
of you doing that; however, you also have to realize that a single
person hand-configuring a specialized setup for oneself isn't the only
mode of usage.  Those are, in fact, vocal but clearly minority use
cases.  What's more common would be systematic management of resources
and applications configuring resource distribution across their
threads.  If you wanna assume full control, do so.  Nothing is
preventing that, and, at the same time, that shouldn't get in the way
of implementing mechanisms which are more widely useful.

> > > Given the core has to deal with them whether they're visible or not,
> > > and given they exist to fulfill a need, seems they should be first
> > > class citizens, not some Quasimodo like creature sneaking into the
> > > cathedral via a back door and slinking about in the shadows.
> > 
> > In terms of programmability and accessibility for individual
> > applications, group resource management being available through
> > straight-forward and incremental extension of exsiting mechanisms is
> > *way* more first class citizen.  It is two seamless extensions to
> > clone(2) and setpriority(2) making hierarchical resource management
> > generally available to applications.
> 
> To me, that sounds like chaos.

Care to elaborate rationales for the claim?

> > There can be use cases where building cpu resource hierarchy which is
> > completely alien to how the rest of the system is organized is useful.
> > For those cases, the only thing which can be done is building a
> > separate hierarchy for the cpu controller and that capability isn't
> > going anywhere.
> 
> As long as administrators can use the system interface to aggregate
> what they see fit, I'm happy.  The scheduler schedules threads, ergo
> the cpu controller must aggregate threads.  There is no process.

Scheduler not knowing beyond threads is great but that doesn't make
the concept of process any less of a real thing when the kernel
interacts with userland.  Processes and threads are clearly the
primary constructs that our userland uses for execution contexts, and
process boundaries are frequently used to delimit various isolation
domains both in kerne API and programming conventions.

A capability can also become inaccessible when it's exposed in a way
which doesn't work in conjunction with the existing abstractions.  The
kernel not providing isolation at the expected layers is a failure
that can prevent the feature from being useable in a lot of cases.
What rgroup tries to do is exposing cgroup's capabilities in a way
which integrates with the existing programming constructs that
userland already depends upon to make these capabilities accessible to
them.

Again, if you wanna be a control freak, nothing stands in your way,
but control freak admins aren't the only consumers of kernel.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP
  2016-04-07  0:00       ` Tejun Heo
@ 2016-04-07  3:26         ` Mike Galbraith
  0 siblings, 0 replies; 50+ messages in thread
From: Mike Galbraith @ 2016-04-07  3:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, akpm, a.p.zijlstra, mingo, lizefan, hannes, pjt,
	linux-kernel, cgroups, linux-api, kernel-team

On Wed, 2016-04-06 at 20:00 -0400, Tejun Heo wrote:
> Hello, Mike.
> 
> On Sun, Mar 13, 2016 at 06:40:35PM +0100, Mike Galbraith wrote:
> > On Sun, 2016-03-13 at 11:00 -0400, Tejun Heo wrote:
> > > Let's say there is an application which wants to manage resource
> > > distributions across its multiple threadpools in a hierarchical way.
> > > With cgroupfs interface as the only system-wide interface, it has to
> > > coordinate who or whatever is managing that interface.  Maybe it can
> > > get a subtree delegated to it, maybe it has to ask the system thing to
> > > create and place threads there, maybe it can just expose the pids and
> > > let the system management do its thing (what if the threads in the
> > > pools are dynamic tho?).  There is no reliable universal way of doing
> > > this.  Each such application has to be ready to specifically
> > > coordinate with the specific system management in use.
> > 
> > The last thing I ever want to see on my boxen is random applications
> > either doing their own thing with my cgroups management interface, or
> > conspiring with "the system thing" behind my back to do things that I
> > did not specifically ask them to do.
> 
> That isn't too different from saying that you don't want applications
> to be calling setpriority(2) on its threads, which is a weird thing to
> say, especially given that there are situations where applying control
> from outside simply can't work - thread pools can be dynamic and there
> is no reliable way of telling which threads are for which purposes
> from outside.
> 
> This is not to say that admin override is unnecessary or unsupported.
> In fact, rgroup and cgroup give the admin a lot more control.
> Controller access can be revoked from applications in subtrees and the
> entire controller can be detached from the hierarchy for full
> override.

My box lacks a hierarchy of say CPUs.  I fail to see how a gaggle of
apps, myself, and system thing are going to all agree on who gets what
percentage of which CPUs.  The traditional thing is for apps to
communicate their needs and desires to the administrator.

> > "The system thing" started doing its own thing behind my back, and
> > oddly enough, its tentacles started falling off.  By golly, its eyes
> > seem to have fallen out as well.
> > 
> > That's what happens when control freak meets control freak, one of them
> > ends up in pieces.  There can be only one, and that one is me, the
> > administrator.  Applications don't coordinate spit, if I put on my
> > administrator hat and stuff 'em in a box, they better stay stuffed.
> 
> Sure, you're a control freak.  Be that.  rgroup doesn't get in the way
> of you doing that; however, you also have to realize that a single
> person hand-configuring a specialized setup for oneself isn't the only
> mode of usage.  Those are, in fact, vocal but clearly minority use
> cases.

Minority says you.  I'm not hearing the voices of this alleged
multitude, so remain highly skeptical.  The only voices I've heard is
yours and the system thing authors.  Perhaps time will prove my
concerns to be unfounded, meanwhile my machete stays at the ready.

>   What's more common would be systematic management of resources
> and applications configuring resource distribution across their
> threads.  If you wanna assume full control, do so.  Nothing is
> preventing that, and, at the same time, that shouldn't get in the way
> of implementing mechanisms which are more widely useful.

As long as the admin remains in the drivers seat, and overhead remains
restricted to those who are playing with cgroups (not the case atm), I
couldn't possibly care less.

> > > > Given the core has to deal with them whether they're visible or not,
> > > > and given they exist to fulfill a need, seems they should be first
> > > > class citizens, not some Quasimodo like creature sneaking into the
> > > > cathedral via a back door and slinking about in the shadows.
> > > 
> > > In terms of programmability and accessibility for individual
> > > applications, group resource management being available through
> > > straight-forward and incremental extension of exsiting mechanisms is
> > > *way* more first class citizen.  It is two seamless extensions to
> > > clone(2) and setpriority(2) making hierarchical resource management
> > > generally available to applications.
> > 
> > To me, that sounds like chaos.
> 
> Care to elaborate rationales for the claim?

I don't have a hierarchy of independent resources, I have one set of
resources.  If system thing or anybody else starts say dorking around
with the CPU controller behind my back (system thing did), I pay for it
in performance loss.  Various remote individuals thinking cgroups are
so cool that _of course_ everybody will want and benefit from group
scheduling does not make it true.

> > > There can be use cases where building cpu resource hierarchy which is
> > > completely alien to how the rest of the system is organized is useful.
> > > For those cases, the only thing which can be done is building a
> > > separate hierarchy for the cpu controller and that capability isn't
> > > going anywhere.
> > 
> > As long as administrators can use the system interface to aggregate
> > what they see fit, I'm happy.  The scheduler schedules threads, ergo
> > the cpu controller must aggregate threads.  There is no process.
> 
> Scheduler not knowing beyond threads is great but that doesn't make
> the concept of process any less of a real thing when the kernel
> interacts with userland.  Processes and threads are clearly the
> primary constructs that our userland uses for execution contexts, and
> process boundaries are frequently used to delimit various isolation
> domains both in kerne API and programming conventions.
> 
> A capability can also become inaccessible when it's exposed in a way
> which doesn't work in conjunction with the existing abstractions.  The
> kernel not providing isolation at the expected layers is a failure
> that can prevent the feature from being useable in a lot of cases.
> What rgroup tries to do is exposing cgroup's capabilities in a way
> which integrates with the existing programming constructs that
> userland already depends upon to make these capabilities accessible to
> them.
> 
> Again, if you wanna be a control freak, nothing stands in your way,
> but control freak admins aren't the only consumers of kernel.

This control freak admin has however in fact bumped heads with control
freak system thing authors, and sees more of that in his future.  I did
not care until remote idiocy gave me a local reason to care.

If the admin remains the final authority, and remote individuals can't
do things that affect all, and thus annoy him/her behind his/her back,
said admins including this one won't have any reason to care.

	-Mike

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP
  2016-04-06 21:53   ` Tejun Heo
@ 2016-04-07  6:40     ` Peter Zijlstra
  0 siblings, 0 replies; 50+ messages in thread
From: Peter Zijlstra @ 2016-04-07  6:40 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Hocko, torvalds, akpm, mingo, lizefan, hannes, pjt,
	linux-kernel, cgroups, linux-api, kernel-team

On Wed, Apr 06, 2016 at 05:53:07PM -0400, Tejun Heo wrote:
> Can you list applications which make use of CLONE_VM without
> CLONE_THREAD? 

I ran into one two years ago or so; I forgot what it was, but it made
perf misbehave because we too had assumed this not to happen.

Eventually it turned out a newer version of that software did no longer
do this and we more or less left it at that.

But it does show that crap is out there..

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP
  2016-04-06 15:58   ` Tejun Heo
@ 2016-04-07  6:45     ` Peter Zijlstra
  2016-04-07  7:35       ` Johannes Weiner
  0 siblings, 1 reply; 50+ messages in thread
From: Peter Zijlstra @ 2016-04-07  6:45 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, akpm, mingo, lizefan, hannes, pjt, linux-kernel,
	cgroups, linux-api, kernel-team



So I recently got made aware of the fact that cgroupv2 doesn't allow
tasks to be associated with !leaf cgroups, this is yet another
capability of cpu-cgroup you've destroyed.

At this point I really don't see why I should spend another second
considering anything v2.

So full NAK and stop wasting my time.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP
  2016-04-07  6:45     ` Peter Zijlstra
@ 2016-04-07  7:35       ` Johannes Weiner
  2016-04-07  8:05         ` Mike Galbraith
                           ` (2 more replies)
  0 siblings, 3 replies; 50+ messages in thread
From: Johannes Weiner @ 2016-04-07  7:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, torvalds, akpm, mingo, lizefan, pjt, linux-kernel,
	cgroups, linux-api, kernel-team

On Thu, Apr 07, 2016 at 08:45:49AM +0200, Peter Zijlstra wrote:
> So I recently got made aware of the fact that cgroupv2 doesn't allow
> tasks to be associated with !leaf cgroups, this is yet another
> capability of cpu-cgroup you've destroyed.

May I ask how you are using that?

The behavior for tasks in !leaf groups was fairly inconsistent across
controllers because they all did different things, or didn't handle it
at all. For example, the block controller in v1 implements separate
weight knobs for the group as a subtree root as well as for the tasks
only inside the group itself. But it didn't do so for bandwith limits.

The memory controller on the other hand only had a singular set of the
controls that applied to both the local tasks and all subgroups. And I
know Google had a lot of trouble with that because they ended up with
basically uncontrollable leftover cache in the top-level group of some
subtree that would put pressure on the real workload leafgroups below.

There was a lot of back and forth whether we should add a second set
of knobs just to control the local tasks separately from the subtree,
but ended up concluding that the situation can be expressed more
clearly by creating dedicated leaf subgroups for stuff like management
software and launchers instead, so that their memory pools/LRUs are
clearly delineated from other groups and seperately controllable. And
we couldn't think of any meaningful configuration that could not be
expressed in that scheme. I mean, it's the same thing, right? Only
that with tasks in !leaf groups the controller would have to emulate a
hidden leaf subgroup and provide additional interfacing, and without
it the leaf groups are explicit and a single set of knobs suffices.
I.e. it seems more of a convenience thing than actual functionality,
but one that forces ugly redundancy in the interface.

So it was a nice cleanup for the memory controller and I believe the
IO controller as well. I'd be curious how it'd be a problem for CPU?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP
  2016-04-07  7:35       ` Johannes Weiner
@ 2016-04-07  8:05         ` Mike Galbraith
  2016-04-07  8:08         ` Peter Zijlstra
  2016-04-07  8:28         ` Peter Zijlstra
  2 siblings, 0 replies; 50+ messages in thread
From: Mike Galbraith @ 2016-04-07  8:05 UTC (permalink / raw)
  To: Johannes Weiner, Peter Zijlstra
  Cc: Tejun Heo, torvalds, akpm, mingo, lizefan, pjt, linux-kernel,
	cgroups, linux-api, kernel-team

On Thu, 2016-04-07 at 03:35 -0400, Johannes Weiner wrote:
> On Thu, Apr 07, 2016 at 08:45:49AM +0200, Peter Zijlstra wrote:
> > So I recently got made aware of the fact that cgroupv2 doesn't
> > allow
> > tasks to be associated with !leaf cgroups, this is yet another
> > capability of cpu-cgroup you've destroyed.
> 
> May I ask how you are using that?

One real world usage is the thread pool servicing customers on a cgroup
= account basis that I outlined.

	-Mike

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP
  2016-04-07  7:35       ` Johannes Weiner
  2016-04-07  8:05         ` Mike Galbraith
@ 2016-04-07  8:08         ` Peter Zijlstra
  2016-04-07  9:28           ` Johannes Weiner
  2016-04-07 19:45           ` Tejun Heo
  2016-04-07  8:28         ` Peter Zijlstra
  2 siblings, 2 replies; 50+ messages in thread
From: Peter Zijlstra @ 2016-04-07  8:08 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tejun Heo, torvalds, akpm, mingo, lizefan, pjt, linux-kernel,
	cgroups, linux-api, kernel-team

On Thu, Apr 07, 2016 at 03:35:47AM -0400, Johannes Weiner wrote:
> On Thu, Apr 07, 2016 at 08:45:49AM +0200, Peter Zijlstra wrote:
> > So I recently got made aware of the fact that cgroupv2 doesn't allow
> > tasks to be associated with !leaf cgroups, this is yet another
> > capability of cpu-cgroup you've destroyed.
> 
> May I ask how you are using that?

_I_ use a kernel with CONFIG_CGROUPS=n (yes really).

But seriously? You have to ask?

The root cgroup is per definition not a leaf, and all tasks start life
there, and some cannot be ever moved out.

Therefore _everybody_ uses this.

> The behavior for tasks in !leaf groups was fairly inconsistent across
> controllers because they all did different things, or didn't handle it
> at all.

Then they're all bloody broken, because fully hierarchical was an early
requirement for cgroups; I know, because I had to throw away many days
of work and start over with cgroup support when they did that.

> So it was a nice cleanup for the memory controller and I believe the
> IO controller as well. I'd be curious how it'd be a problem for CPU?

The full hierarchy took years to make work and is fully ingrained with
how the thing words, changing it isn't going to be nice or easy.


So sure, go with a lowest common denominator, instead of fixing shit,
yay for progress :/

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP
  2016-04-07  7:35       ` Johannes Weiner
  2016-04-07  8:05         ` Mike Galbraith
  2016-04-07  8:08         ` Peter Zijlstra
@ 2016-04-07  8:28         ` Peter Zijlstra
  2016-04-07 19:04           ` Johannes Weiner
  2 siblings, 1 reply; 50+ messages in thread
From: Peter Zijlstra @ 2016-04-07  8:28 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tejun Heo, torvalds, akpm, mingo, lizefan, pjt, linux-kernel,
	cgroups, linux-api, kernel-team

On Thu, Apr 07, 2016 at 03:35:47AM -0400, Johannes Weiner wrote:
> There was a lot of back and forth whether we should add a second set
> of knobs just to control the local tasks separately from the subtree,
> but ended up concluding that the situation can be expressed more
> clearly by creating dedicated leaf subgroups for stuff like management
> software and launchers instead, so that their memory pools/LRUs are
> clearly delineated from other groups and seperately controllable. And
> we couldn't think of any meaningful configuration that could not be
> expressed in that scheme. I mean, it's the same thing, right?


No, not the same.


	R
      / | \
     t1	t2 A
         /   \
        t3   t4


Is fundamentally different from:


             R
	   /   \
	 L       A
       /   \   /   \
      t1  t2  t3   t4


Because if in the first hierarchy you add a task (t5) to R, all of its A
will run at 1/4th of total bandwidth where before it had 1/3rd, whereas
with the second example, if you add our t5 to L, A doesn't get any less
bandwidth.


Please pull your collective heads out of the systemd arse and start
thinking.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP
  2016-04-07  8:08         ` Peter Zijlstra
@ 2016-04-07  9:28           ` Johannes Weiner
  2016-04-07 10:42             ` Peter Zijlstra
  2016-04-07 19:45           ` Tejun Heo
  1 sibling, 1 reply; 50+ messages in thread
From: Johannes Weiner @ 2016-04-07  9:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, torvalds, akpm, mingo, lizefan, pjt, linux-kernel,
	cgroups, linux-api, kernel-team

On Thu, Apr 07, 2016 at 10:08:33AM +0200, Peter Zijlstra wrote:
> On Thu, Apr 07, 2016 at 03:35:47AM -0400, Johannes Weiner wrote:
> > On Thu, Apr 07, 2016 at 08:45:49AM +0200, Peter Zijlstra wrote:
> > > So I recently got made aware of the fact that cgroupv2 doesn't allow
> > > tasks to be associated with !leaf cgroups, this is yet another
> > > capability of cpu-cgroup you've destroyed.
> > 
> > May I ask how you are using that?
> 
> _I_ use a kernel with CONFIG_CGROUPS=n (yes really).
> 
> But seriously? You have to ask?
> 
> The root cgroup is per definition not a leaf, and all tasks start life
> there, and some cannot be ever moved out.
> 
> Therefore _everybody_ uses this.

Hm? The root group can always contain tasks. It's not the only thing
the root is exempt from, it can't control any resources either:

sched_group_set_shares():

	/*
	 * We can't change the weight of the root cgroup.
	 */
	if (!tg->se[0])
		return -EINVAL;

tg_set_cfs_bandwidth():

	if (tg == &root_task_group)
		return -EINVAL;

etc.

and all the problems that led to this rule stem from resource control.

> > The behavior for tasks in !leaf groups was fairly inconsistent across
> > controllers because they all did different things, or didn't handle it
> > at all.
> 
> Then they're all bloody broken, because fully hierarchical was an early
> requirement for cgroups; I know, because I had to throw away many days
> of work and start over with cgroup support when they did that.

I think we're talking past each other.

They're all fully hierarchical in the sense of accounting and divvying
up resources along a tree structure, and configurable groups competing
with other configurable groups or subtrees. That all works perfectly
fine. It's the concept of loose unconfigurable tasks competing with
configured groups or subtrees that invites problems.

It's not a question of implementation, it's that the configurations
that people created with e.g. the memory controller repeatedly ended
up creating the same problems and the same stupid patches to add the
local-only knobs (which the cpu cgroup doesn't have either AFAICS).

This is not some gratuitous cutting away of convenience, it's hours
and hours of discussions both on the mailinglists and at conferences
about such lovely stuff as to whether the memory lowlimit (softlimit)
should apply to only the local memory pool or hierarchically because
that user happened to have memory pools in !leaf nodes which they had
to control somehow.

Swear to god.

[ And yes, the root group IS "loose unconfigurable tasks" that compete
  with configured subtrees. But that is very explicit in the interface
  and you move stuff that consumes significant resources and needs to
  be controlled out of the root group; it doesn't have the same issue. ]

If that happens once or twice I'm willing to write it off as PEBCAK,
but if otherwise competent users like Google repeatedly create
configurations that lead to these problems, and then end up pushing
and lobbying in this case for non-hierarchical knobs to work around
problems in the structural organization of the workloads, it's more
likely that the interface is shit.

So we added a rule that doesn't take away any functionality, but it
forces you to organize your workloads more explicitely to take away
that ambiguity.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP
  2016-04-07  9:28           ` Johannes Weiner
@ 2016-04-07 10:42             ` Peter Zijlstra
  0 siblings, 0 replies; 50+ messages in thread
From: Peter Zijlstra @ 2016-04-07 10:42 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tejun Heo, torvalds, akpm, mingo, lizefan, pjt, linux-kernel,
	cgroups, linux-api, kernel-team

On Thu, Apr 07, 2016 at 05:28:24AM -0400, Johannes Weiner wrote:
> Hm? The root group can always contain tasks. It's not the only thing
> the root is exempt from, it can't control any resources either:

it does in fact control resouces; the hierarchy directly affects the
proportional distribution of time.

> sched_group_set_shares():
> 
> 	/*
> 	 * We can't change the weight of the root cgroup.
> 	 */
> 	if (!tg->se[0])
> 		return -EINVAL;

The root has, per definition, no siblings, so setting a weight is
entirely pointless.

> tg_set_cfs_bandwidth():
> 
> 	if (tg == &root_task_group)
> 		return -EINVAL;
> 

We have had patches to implement this, but have so far held off because
they add a bunch of cycles to some really hot paths and we'd rather not
do that. Its not impossible, or unthinkable to do this otherwise.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP
  2016-04-07  8:28         ` Peter Zijlstra
@ 2016-04-07 19:04           ` Johannes Weiner
  2016-04-07 19:31             ` Peter Zijlstra
  0 siblings, 1 reply; 50+ messages in thread
From: Johannes Weiner @ 2016-04-07 19:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, torvalds, akpm, mingo, lizefan, pjt, linux-kernel,
	cgroups, linux-api, kernel-team

On Thu, Apr 07, 2016 at 10:28:10AM +0200, Peter Zijlstra wrote:
> On Thu, Apr 07, 2016 at 03:35:47AM -0400, Johannes Weiner wrote:
> > There was a lot of back and forth whether we should add a second set
> > of knobs just to control the local tasks separately from the subtree,
> > but ended up concluding that the situation can be expressed more
> > clearly by creating dedicated leaf subgroups for stuff like management
> > software and launchers instead, so that their memory pools/LRUs are
> > clearly delineated from other groups and seperately controllable. And
> > we couldn't think of any meaningful configuration that could not be
> > expressed in that scheme. I mean, it's the same thing, right?
> 
> No, not the same.
> 
> 
> 	R
>       / | \
>      t1	t2 A
>          /   \
>         t3   t4
> 
> 
> Is fundamentally different from:
> 
> 
>              R
> 	   /   \
> 	 L       A
>        /   \   /   \
>       t1  t2  t3   t4
> 
> 
> Because if in the first hierarchy you add a task (t5) to R, all of its A
> will run at 1/4th of total bandwidth where before it had 1/3rd, whereas
> with the second example, if you add our t5 to L, A doesn't get any less
> bandwidth.

I didn't mean the same exact configuration, I meant being able to
configure with the same outcome of resource distribution.

All this means here is that if you want to change the shares allocated
to the tasks in R (or then L) you have to be explicit about it and
update the weight configuration in L.

Again, it's not gratuitous, it's based on the problems this concept in
the interface created in more comprehensive container deployments.

> Please pull your collective heads out of the systemd arse and start
> thinking.

I don't care about systemd here. In fact, in 5 years of rewriting the
memory controller, zero percent of it was driven by systemd and most
of it from Google's feedback at LSF and email since they had by far
the most experience and were pushing the frontier. And even though the
performance and overhead of the memory controller was absolutely
abysmal - routinely hitting double digits in page fault profiles - the
discussions *always* centered around the interface and configuration.

IMO, this thread is a little too focused on the reality of a single
resource controller, when in real setups it doesn't exist in a vacuum.
What these environments need is to robustly divide the machine up into
parcels to isolate thousands of jobs on X dimensions at the same time:
allocate CPU time, allocate memory, allocate IO. And then on top of
that implement higher concepts such as dirty page quotas and
writeback, accounting for kswapd's cpu time based on who owns the
memory it reclaims, accounting IO time for the stuff it swaps out
etc. That *needs* all three resources to be coordinated.

You disparagingly called it the lowest common denominator, but the
thing is that streamlining the controllers and coordinating them
around shared resource domains gives us much more powerful and robust
ways to allocate the *machines* as a whole, and allows the proper
tracking and accounting of cross-domain operations such as writeback
that wasn't even possible before. And all that in a way that doesn't
have the same usability pitfalls that v1 had when you actually push
this stuff beyond the "i want to limit the cpu cycles of this one
service" and move towards "this machine is an anonymous node in a data
center and I want it to host thousands of different workloads - some
sensitive to latency, some that only care about throughput - and they
better not step on each other's toes on *any* of the resource pools."

Those are my primary concerns when it comes to the v2 interface, and I
think focusing too much on what's theoretically possible with a single
controller is missing the bigger challenge of allocating machines.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP
  2016-04-07 19:04           ` Johannes Weiner
@ 2016-04-07 19:31             ` Peter Zijlstra
  2016-04-07 20:23               ` Johannes Weiner
  0 siblings, 1 reply; 50+ messages in thread
From: Peter Zijlstra @ 2016-04-07 19:31 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tejun Heo, torvalds, akpm, mingo, lizefan, pjt, linux-kernel,
	cgroups, linux-api, kernel-team

On Thu, Apr 07, 2016 at 03:04:24PM -0400, Johannes Weiner wrote:

> All this means here is that if you want to change the shares allocated
> to the tasks in R (or then L) you have to be explicit about it and
> update the weight configuration in L.

Updating the weight of L for every task spawned and killed is simply not
an option.

The fact that you're not willing to admit to this is troubling, but does
confirm I can stop spending time on anything cgroup v2. cpu-cgroup just
isn't going to move to this inferior interface.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP
  2016-04-07  8:08         ` Peter Zijlstra
  2016-04-07  9:28           ` Johannes Weiner
@ 2016-04-07 19:45           ` Tejun Heo
  2016-04-07 20:25             ` Peter Zijlstra
  1 sibling, 1 reply; 50+ messages in thread
From: Tejun Heo @ 2016-04-07 19:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Johannes Weiner, torvalds, akpm, mingo, lizefan, pjt,
	linux-kernel, cgroups, linux-api, kernel-team

Hello, Peter.

On Thu, Apr 07, 2016 at 10:08:33AM +0200, Peter Zijlstra wrote:
> On Thu, Apr 07, 2016 at 03:35:47AM -0400, Johannes Weiner wrote:
> > So it was a nice cleanup for the memory controller and I believe the
> > IO controller as well. I'd be curious how it'd be a problem for CPU?
> 
> The full hierarchy took years to make work and is fully ingrained with
> how the thing words, changing it isn't going to be nice or easy.
> 
> So sure, go with a lowest common denominator, instead of fixing shit,
> yay for progress :/

It's easy to get fixated on what each subsystem can do and develop
towards different directions siloed in each subsystem.  That's what
we've had for quite a while in cgroup.  Expectedly, this sends off
controllers towards different directions.  Direct competion between
tasks and child cgroups was one of the main sources of balkanization.

The balkanization was no coincidence either.  Tasks and cgroups are
different types of entities and don't have the same control knobs or
follow the same lifetime rules.  For absolute limits, it isn't clear
how much of the parent's resources should be distributed to internal
children as opposed to child cgroups.  People end up depending on
specific implementation details and proposing one-off hacks and
interface additions.

Proportional weights aren't much better either.  CPU has internal
mapping between nice values and shares and treat them equally, which
can get confusing as the configured weights behave differently
depending on how many threads are in the parent cgroup which often is
opaque and can't be controlled from outside.  Widely diverging from
CPU's behavior, IO grouped all internal tasks into an internal leaf
node and used to assign a fixed weight to it.

Now, you might think that none of it matters and each subsystem
treating cgroup hierarchy as arbitrary and orthogonal collections of
bean counters is fine; however, that makes it impossible to account
for and control operations which span different types of resources.
This prevented us from implementing resource control over frigging
buffered writes, making the whole IO control thing a joke.  While CPU
currently doesn't directly tie into it, that is only because CPU
cycles spent during writeback isn't yet properly accounted.

The structural constraints and resulting consistency don't just
subtract from the abilities of each controller.  It establishes a
common base, the shared resource domains and consistent behaviors on
top of them, that further capabilities can be built upon, capabilities
as fundamental as comprehensive resource control over buffered
writeback.

It can be convenient to have subsystem-specific raw bean counters.  If
that's what the use case calls for, individual controllers can easily
be moved to a separate hierarchy although it would naturally lose the
capabilities coming from cooperating over shared resource domains.
However, please understand that there are a lot of use cases where
comprehensive and consistent resource accounting and control over all
major resources is useful and necessary.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP
  2016-04-07 19:31             ` Peter Zijlstra
@ 2016-04-07 20:23               ` Johannes Weiner
  2016-04-08  3:13                 ` Mike Galbraith
  0 siblings, 1 reply; 50+ messages in thread
From: Johannes Weiner @ 2016-04-07 20:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, torvalds, akpm, mingo, lizefan, pjt, linux-kernel,
	cgroups, linux-api, kernel-team

On Thu, Apr 07, 2016 at 09:31:27PM +0200, Peter Zijlstra wrote:
> On Thu, Apr 07, 2016 at 03:04:24PM -0400, Johannes Weiner wrote:
> 
> > All this means here is that if you want to change the shares allocated
> > to the tasks in R (or then L) you have to be explicit about it and
> > update the weight configuration in L.
> 
> Updating the weight of L for every task spawned and killed is simply not
> an option.
> 
> The fact that you're not willing to admit to this is troubling, but does
> confirm I can stop spending time on anything cgroup v2. cpu-cgroup just
> isn't going to move to this inferior interface.

I guess I walked right into that one, didn't I ;-) It probably makes
more sense to discuss a real-life workload instead of a diagram.

Obviously, if the threadpool size is highly variable it's not
reasonable to ask the user to track every update and accordingly
reconfigure the controller. I fully agree with you there.

All I meant to point out is that the *implicit* behavior of the v1
interface did create real problems, to show you that this is not a
one-sided discussion and that there are real life concerns that played
into the decision of not letting loose tasks compete with groups.

If this is a real workload rather than a thought experiment, it will
need to be supported in v2 as well - just if we can help it hopefully
without reverting to the tricky behavior of the v1 controller. One
possible solution I could imagine for example is adding the option to
configure a groups weight such that its dynamically based on the # of
threads. But it depends on what the exact requirements are here.

Could you explain what this workload is so it's easier to reason about?

Mike, is that the one you referred to with one group per customer
account? If so, would you have a pointer to where you outline it?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP
  2016-04-07 19:45           ` Tejun Heo
@ 2016-04-07 20:25             ` Peter Zijlstra
  2016-04-08 20:11               ` Tejun Heo
  0 siblings, 1 reply; 50+ messages in thread
From: Peter Zijlstra @ 2016-04-07 20:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Johannes Weiner, torvalds, akpm, mingo, lizefan, pjt,
	linux-kernel, cgroups, linux-api, kernel-team

On Thu, Apr 07, 2016 at 03:45:55PM -0400, Tejun Heo wrote:
> Hello, Peter.
> 
> On Thu, Apr 07, 2016 at 10:08:33AM +0200, Peter Zijlstra wrote:
> > On Thu, Apr 07, 2016 at 03:35:47AM -0400, Johannes Weiner wrote:
> > > So it was a nice cleanup for the memory controller and I believe the
> > > IO controller as well. I'd be curious how it'd be a problem for CPU?
> > 
> > The full hierarchy took years to make work and is fully ingrained with
> > how the thing words, changing it isn't going to be nice or easy.
> > 
> > So sure, go with a lowest common denominator, instead of fixing shit,
> > yay for progress :/
> 
> It's easy to get fixated on what each subsystem can do and develop
> towards different directions siloed in each subsystem.  That's what
> we've had for quite a while in cgroup.  Expectedly, this sends off
> controllers towards different directions.  Direct competion between
> tasks and child cgroups was one of the main sources of balkanization.
> 
> The balkanization was no coincidence either.  Tasks and cgroups are
> different types of entities and don't have the same control knobs or
> follow the same lifetime rules.  For absolute limits, it isn't clear
> how much of the parent's resources should be distributed to internal
> children as opposed to child cgroups.  People end up depending on
> specific implementation details and proposing one-off hacks and
> interface additions.

Yes, I'm familiar with the problem; but simply mandating leaf only nodes
is not a solution, for the very simple fact that there are tasks in the
root cgroup that cannot ever be moved out, so we _must_ be able to deal
with !leaf nodes containing tasks.

A consistent interface for absolute controllers to divvy up the
resources between local tasks and child cgroups isn't _that_ hard.

And this leaf only business totally screwed over anything proportional.

This simply cannot work.

> Proportional weights aren't much better either.  CPU has internal
> mapping between nice values and shares and treat them equally, which
> can get confusing as the configured weights behave differently
> depending on how many threads are in the parent cgroup which often is
> opaque and can't be controlled from outside.

Huh what? There's nothing confusing there, the nice to weight mapping is
static and can easily be consulted. Alternatively we can make an
interface where you can set weight through nice values, for those people
that are afraid of numbers.

But the configured weights do _not_ behave differently depending on the
number of tasks, they behave exactly as specified in the proportional
weight based rate distribution. We've done the math..

> Widely diverging from
> CPU's behavior, IO grouped all internal tasks into an internal leaf
> node and used to assign a fixed weight to it.

That's just plain broken... That is not how a proportional weight based
hierarchical controller works.

> Now, you might think that none of it matters and each subsystem
> treating cgroup hierarchy as arbitrary and orthogonal collections of
> bean counters is fine; however, that makes it impossible to account
> for and control operations which span different types of resources.
> This prevented us from implementing resource control over frigging
> buffered writes, making the whole IO control thing a joke.  While CPU
> currently doesn't directly tie into it, that is only because CPU
> cycles spent during writeback isn't yet properly accounted.

CPU cycles spend in waitqueues aren't properly accounted to whoever
queued the job either, and there's a metric ton of async stuff that's
not properly accounted, so what?

> However, please understand that there are a lot of use cases where
> comprehensive and consistent resource accounting and control over all
> major resources is useful and necessary.

Maybe, but so far I've only heard people complain this v2 thing didn't
work for them, and as far as I can see the whole v2 model is internally
inconsistent and impossible to implement.

The suggestion by Johannes to adjust the leaf node weight depending on
the number of tasks in is so ludicrous I don't even know where to start
enumerating the fail.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP
  2016-04-07 20:23               ` Johannes Weiner
@ 2016-04-08  3:13                 ` Mike Galbraith
  0 siblings, 0 replies; 50+ messages in thread
From: Mike Galbraith @ 2016-04-08  3:13 UTC (permalink / raw)
  To: Johannes Weiner, Peter Zijlstra
  Cc: Tejun Heo, torvalds, akpm, mingo, lizefan, pjt, linux-kernel,
	cgroups, linux-api, kernel-team

On Thu, 2016-04-07 at 16:23 -0400, Johannes Weiner wrote:

> Mike, is that the one you referred to with one group per customer
> account? If so, would you have a pointer to where you outline it?

The usage I loosely outlined, I did in this thread.  All of the gory
details I do not have, do not want, and could not provide if I did.

	-Mike

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP
  2016-04-07 20:25             ` Peter Zijlstra
@ 2016-04-08 20:11               ` Tejun Heo
  2016-04-09  6:16                 ` Mike Galbraith
                                   ` (2 more replies)
  0 siblings, 3 replies; 50+ messages in thread
From: Tejun Heo @ 2016-04-08 20:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Johannes Weiner, torvalds, akpm, mingo, lizefan, pjt,
	linux-kernel, cgroups, linux-api, kernel-team

Hello, Peter.

On Thu, Apr 07, 2016 at 10:25:42PM +0200, Peter Zijlstra wrote:
> > The balkanization was no coincidence either.  Tasks and cgroups are
> > different types of entities and don't have the same control knobs or
> > follow the same lifetime rules.  For absolute limits, it isn't clear
> > how much of the parent's resources should be distributed to internal
> > children as opposed to child cgroups.  People end up depending on
> > specific implementation details and proposing one-off hacks and
> > interface additions.
> 
> Yes, I'm familiar with the problem; but simply mandating leaf only nodes
> is not a solution, for the very simple fact that there are tasks in the
> root cgroup that cannot ever be moved out, so we _must_ be able to deal
> with !leaf nodes containing tasks.

As Johannes already pointed out, the root cgroup has always been
special.  While pure practicality, performance implications and
implementation convenience do play important roles in the special
treatment, another constributing aspect is avoiding exposing
statistics and control knobs which are duplicates of and/or
conflicting with what's already available at the system level.  It's
never fun to have multiple sources of truth.

> A consistent interface for absolute controllers to divvy up the
> resources between local tasks and child cgroups isn't _that_ hard.

I've spent months thinking about it and didn't get too far.  If you
have a good solution, I'd be happy to be enlightened.  Also, please
note that the current solution is based on restricting certain
configurations.  If we can find a better solution, we can relax the
relevant constraints and move onto it without breaking compatibility.

> And this leaf only business totally screwed over anything proportional.
> 
> This simply cannot work.

Will get to this below.

> > Proportional weights aren't much better either.  CPU has internal
> > mapping between nice values and shares and treat them equally, which
> > can get confusing as the configured weights behave differently
> > depending on how many threads are in the parent cgroup which often is
> > opaque and can't be controlled from outside.
> 
> Huh what? There's nothing confusing there, the nice to weight mapping is
> static and can easily be consulted. Alternatively we can make an
> interface where you can set weight through nice values, for those people
> that are afraid of numbers.
>
> But the configured weights do _not_ behave differently depending on the
> number of tasks, they behave exactly as specified in the proportional
> weight based rate distribution. We've done the math..

Yes, once one understands what's going on, it isn't confusing.  It's
just not something users can intuitively understand from the presented
interface.  The confusion of course is worsened severely by different
controller behaviors.

> > Widely diverging from
> > CPU's behavior, IO grouped all internal tasks into an internal leaf
> > node and used to assign a fixed weight to it.
> 
> That's just plain broken... That is not how a proportional weight based
> hierarchical controller works.

That's a strong statement.  When the hierarchy is composed of
equivalent objects as in CPU, not distinguishing internal and leaf
nodes would be a more natural way to organize; however, it isn't
necessarily true in all cases.  For example, while a writeback IO
would be issued by some task, the task itself might not have done
anything to cause that IO and the IO would essentially be anonymous in
the resource domain.  Also, different controllers use different units
of organization - CPU sees threads, IO sees IO contexts which are
usually shared in a process.  The difference would lead to differing
scaling behaviors in proportional distribution.

While the separate buckets and entities model may not be as elegant as
tree of uniform objects, it is far from uncommon and more robust when
dealing with different types of objects.

> > Now, you might think that none of it matters and each subsystem
> > treating cgroup hierarchy as arbitrary and orthogonal collections of
> > bean counters is fine; however, that makes it impossible to account
> > for and control operations which span different types of resources.
> > This prevented us from implementing resource control over frigging
> > buffered writes, making the whole IO control thing a joke.  While CPU
> > currently doesn't directly tie into it, that is only because CPU
> > cycles spent during writeback isn't yet properly accounted.
> 
> CPU cycles spend in waitqueues aren't properly accounted to whoever
> queued the job either, and there's a metric ton of async stuff that's
> not properly accounted, so what?

The ultimate goal of cgroup resource control is accounting and
controlling all significant resource consumptions as configured.  Some
system operations are inherently global and others are simply too
cheap to justify the overhead; however, there still are significant
aggregate operations which are being missed out including almost
everything taking place in the writeback path.  So, yes, we eventually
want to be able to account for them, of course in a way which doesn't
get in the way of actual operation.

> > However, please understand that there are a lot of use cases where
> > comprehensive and consistent resource accounting and control over all
> > major resources is useful and necessary.
> 
> Maybe, but so far I've only heard people complain this v2 thing didn't
> work for them, and as far as I can see the whole v2 model is internally
> inconsistent and impossible to implement.

I suppose we live in different bubbles.  Can you please elaborate
which parts of cgroup v2 model are internally inconsistent and
impossible to implement?  I'd be happy to rectify the situation.

> The suggestion by Johannes to adjust the leaf node weight depending on
> the number of tasks in is so ludicrous I don't even know where to start
> enumerating the fail.

That sounds like a pretty uncharitable way to read his message.  I
think he was trying to find out the underlying requirements so that a
way forward can be discussed.  I do have the same question.  It's
difficult to have discussions about trade-offs without knowing where
the requirements are coming from.  Do you have something on mind for
cases where internal tasks have to compete with sibling cgroups?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP
  2016-04-08 20:11               ` Tejun Heo
@ 2016-04-09  6:16                 ` Mike Galbraith
  2016-04-09 13:39                 ` Peter Zijlstra
  2016-04-09 16:02                 ` Peter Zijlstra
  2 siblings, 0 replies; 50+ messages in thread
From: Mike Galbraith @ 2016-04-09  6:16 UTC (permalink / raw)
  To: Tejun Heo, Peter Zijlstra
  Cc: Johannes Weiner, torvalds, akpm, mingo, lizefan, pjt,
	linux-kernel, cgroups, linux-api, kernel-team

On Fri, 2016-04-08 at 16:11 -0400, Tejun Heo wrote:

> > That's just plain broken... That is not how a proportional weight based
> > hierarchical controller works.
> 
> That's a strong statement.  When the hierarchy is composed of
> equivalent objects as in CPU, not distinguishing internal and leaf
> nodes would be a more natural way to organize; however...

You almost said it yourself, you want to make the natural organization
of cpu, cpuacct and cpuset controllers a prohibited organization. 
 There is no "however..." that can possibly justify that.  It's akin to
mandating: henceforth "horse" shall be spelled "cow", riders thereof
shall teach their "cow" the proper enunciation of "moo".  It's silly.

Like it or not, these controllers have thread encoded in their DNA,
it's an integral part of what they are, and how real users in the real
world use them.  No rationalization will change that cold hard fact.

Make an "Aunt Tilly" button for those incapable of comprehending the
complexities if you will, but please don't make cgroups so rigid and
idiot proof that only idiots (hi system thing) can use it.

	-Mike

 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP
  2016-04-08 20:11               ` Tejun Heo
  2016-04-09  6:16                 ` Mike Galbraith
@ 2016-04-09 13:39                 ` Peter Zijlstra
  2016-04-12 22:29                   ` Tejun Heo
  2016-04-09 16:02                 ` Peter Zijlstra
  2 siblings, 1 reply; 50+ messages in thread
From: Peter Zijlstra @ 2016-04-09 13:39 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Johannes Weiner, torvalds, akpm, mingo, lizefan, pjt,
	linux-kernel, cgroups, linux-api, kernel-team

On Fri, Apr 08, 2016 at 04:11:35PM -0400, Tejun Heo wrote:

> > > Widely diverging from
> > > CPU's behavior, IO grouped all internal tasks into an internal leaf
> > > node and used to assign a fixed weight to it.
> > 
> > That's just plain broken... That is not how a proportional weight based
> > hierarchical controller works.
> 
> That's a strong statement. 

No its plain fact.

If you modify a graph, it is not the same graph.

Even if you argue by merit of the function on this graph, and state that
only the result of this function is important, and any modification to
the graph that leaves this result in tact is good; ie. a modification
invariant to the function, this fails.

Because for proportional controllers all that matters is the number and
weight of edges leaving a node.

The modification described above does clearly change the outcome and is
not invariant under the proportional weight distribution function.

> When the hierarchy is composed of
> equivalent objects as in CPU, not distinguishing internal and leaf
> nodes would be a more natural way to organize; however, it isn't
> necessarily true in all cases.  For example, while a writeback IO
> would be issued by some task, the task itself might not have done
> anything to cause that IO and the IO would essentially be anonymous in
> the resource domain.  Also, different controllers use different units
> of organization - CPU sees threads, IO sees IO contexts which are
> usually shared in a process.  The difference would lead to differing
> scaling behaviors in proportional distribution.
> 
> While the separate buckets and entities model may not be as elegant as
> tree of uniform objects, it is far from uncommon and more robust when
> dealing with different types of objects.

The graph does not care about the type of objects the nodes represent,
and proportional weight distribution only cares about the edges.

With cpu-cgroup the nodes are not of uniform type either, they can be a
group or a task. You get runtime type identification and make it work.

There just isn't an excuse for crazy crap like this. Its wrong, no two
ways about it.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP
  2016-04-08 20:11               ` Tejun Heo
  2016-04-09  6:16                 ` Mike Galbraith
  2016-04-09 13:39                 ` Peter Zijlstra
@ 2016-04-09 16:02                 ` Peter Zijlstra
  2 siblings, 0 replies; 50+ messages in thread
From: Peter Zijlstra @ 2016-04-09 16:02 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Johannes Weiner, torvalds, akpm, mingo, lizefan, pjt,
	linux-kernel, cgroups, linux-api, kernel-team

On Fri, Apr 08, 2016 at 04:11:35PM -0400, Tejun Heo wrote:
> > Yes, I'm familiar with the problem; but simply mandating leaf only nodes
> > is not a solution, for the very simple fact that there are tasks in the
> > root cgroup that cannot ever be moved out, so we _must_ be able to deal
> > with !leaf nodes containing tasks.
> 
> As Johannes already pointed out, the root cgroup has always been
> special.

The root of the tree isn't special except for 2 properties.

 - it _is_ a root; iow, it doesn't have any incoming edges.
   This also means it doesn't have a parent; nor can have a weight,
   since that is an edge propery, not a node property.

 - it always exists; for without a root there is no tree.

Making it _more_ special is silly.

> > Maybe, but so far I've only heard people complain this v2 thing didn't
> > work for them, and as far as I can see the whole v2 model is internally
> > inconsistent and impossible to implement.
> 
> I suppose we live in different bubbles.  Can you please elaborate
> which parts of cgroup v2 model are internally inconsistent and
> impossible to implement?  I'd be happy to rectify the situation.

The fact that we have to deal with tasks in the root cgroup while not
allowing tasks in any other node is internally inconsistent.

If I can deal with tasks in one node (root) I can equally deal with
tasks in any other node in exactly the same manner.

Making it different is actually _more_ code.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP
  2016-04-09 13:39                 ` Peter Zijlstra
@ 2016-04-12 22:29                   ` Tejun Heo
  2016-04-13  7:43                     ` Mike Galbraith
  0 siblings, 1 reply; 50+ messages in thread
From: Tejun Heo @ 2016-04-12 22:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Johannes Weiner, torvalds, akpm, mingo, lizefan, pjt,
	linux-kernel, cgroups, linux-api, kernel-team

Hello, Peter.

On Sat, Apr 09, 2016 at 03:39:17PM +0200, Peter Zijlstra wrote:
> > While the separate buckets and entities model may not be as elegant as
> > tree of uniform objects, it is far from uncommon and more robust when
> > dealing with different types of objects.
> 
> The graph does not care about the type of objects the nodes represent,
> and proportional weight distribution only cares about the edges.
> 
> With cpu-cgroup the nodes are not of uniform type either, they can be a
> group or a task. You get runtime type identification and make it work.
>
> There just isn't an excuse for crazy crap like this. Its wrong, no two
> ways about it.

Abstracing tasks and groups as equivalent objects works well for the
scheduler and that's great.  This is also because the domain lends
itself very well to such simple and elegant approach.  The only
entities of interest are tasks, as you and Mike pointed out earlier in
the thread, and group priority can be easily mapped to task priority.
However, this isn't necessarily the case for other controllers.

There's also the issue of mapping the model to absolute controllers.
For the uniform model to work, there must be a way to treat internal
and leaf entities in the same way.  For memory, the leaf entities are
processes and applying the same model would mean that memory
controller would have to implement equivalent per-process control
knobs.  We don't have that.  In fact, we can't have that - a
significant part of memory consumption can't be attached to a single
process.  There is a fundamental distinction between internal and leaf
nodes in the memory resource graph.

We aren't designing a spherical cow in a vacuum, and, I believe,
should aspire to make pragmatic trade-offs of all involved factors.
If multiple controllers co-operating on the same resource domains is
beneficial and required, we should figure out a way to make different
controllers agree and that way most likely will require some
trade-offs from various controllers.

Given the currently known requirements and constraints, restricting
internal competition is a simple and straight-forward way to isolate
leaf node handling details of different controllers.

The cost is part aesthetical and part practical.  While less elegant
than tree of uniform objects, it seems a stretch to call internal /
leaf node distinction broken especially given that the model is
natural to some controllers.

The practical cost is loss of the ability to let leaf entities compete
against groups.  However, we can't evaluate how important such
capability is without actual use-cases.  If there are important ones,
please bring them up, so that we can examine the actual requirements
and try to find a good trade-off to support them.

I understand that CPU controller getting constrained due to other
controllers can feel frustrating; however, the constraint is there to
solve practical problems which hopefully are being explained in this
conversation.  If there is a better trade-off, we can easily get rid
of it and move on, but such decision can only be made considering all
the relevant factors.  If you can think of a better solution, let's
please discuss it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP
  2016-04-12 22:29                   ` Tejun Heo
@ 2016-04-13  7:43                     ` Mike Galbraith
  2016-04-13 15:59                       ` Tejun Heo
  0 siblings, 1 reply; 50+ messages in thread
From: Mike Galbraith @ 2016-04-13  7:43 UTC (permalink / raw)
  To: Tejun Heo, Peter Zijlstra
  Cc: Johannes Weiner, torvalds, akpm, mingo, lizefan, pjt,
	linux-kernel, cgroups, linux-api, kernel-team

On Tue, 2016-04-12 at 18:29 -0400, Tejun Heo wrote:
> Hello, Peter.
> 
> On Sat, Apr 09, 2016 at 03:39:17PM +0200, Peter Zijlstra wrote:
> > > While the separate buckets and entities model may not be as elegant as
> > > tree of uniform objects, it is far from uncommon and more robust when
> > > dealing with different types of objects.
> > 
> > The graph does not care about the type of objects the nodes represent,
> > and proportional weight distribution only cares about the edges.
> > 
> > With cpu-cgroup the nodes are not of uniform type either, they can be a
> > group or a task. You get runtime type identification and make it work.
> > 
> > There just isn't an excuse for crazy crap like this. Its wrong, no two
> > ways about it.
> 
> Abstracing tasks and groups as equivalent objects works well for the
> scheduler and that's great.  This is also because the domain lends
> itself very well to such simple and elegant approach.  The only
> entities of interest are tasks, as you and Mike pointed out earlier in
> the thread, and group priority can be easily mapped to task priority.
> However, this isn't necessarily the case for other controllers.
> 
> There's also the issue of mapping the model to absolute controllers.
> For the uniform model to work, there must be a way to treat internal
> and leaf entities in the same way.  For memory, the leaf entities are
> processes and applying the same model would mean that memory
> controller would have to implement equivalent per-process control
> knobs.  We don't have that.  In fact, we can't have that - a
> significant part of memory consumption can't be attached to a single
> process.  There is a fundamental distinction between internal and leaf
> nodes in the memory resource graph.
> 
> We aren't designing a spherical cow in a vacuum, and, I believe,
> should aspire to make pragmatic trade-offs of all involved factors.
> If multiple controllers co-operating on the same resource domains is
> beneficial and required, we should figure out a way to make different
> controllers agree and that way most likely will require some
> trade-offs from various controllers.
> 
> Given the currently known requirements and constraints, restricting
> internal competition is a simple and straight-forward way to isolate
> leaf node handling details of different controllers.
> 
> The cost is part aesthetical and part practical.  While less elegant
> than tree of uniform objects, it seems a stretch to call internal /
> leaf node distinction broken especially given that the model is
> natural to some controllers.

That justifies prohibiting proper usages of three controllers, cpu,
cpuacct and cpuset?
 
> The practical cost is loss of the ability to let leaf entities compete
> against groups.  However, we can't evaluate how important such
> capability is without actual use-cases.  If there are important ones,
> please bring them up, so that we can examine the actual requirements
> and try to find a good trade-off to support them.

Hm, I though Google did that, and I know I mentioned another gigabuck
sized outfit.  Whatever, ob trade-off..

Another cpuset example is something I was asked to look into recently. 
 There are folks out in the real world who want to run RT guests.  Now
VIRTUAL REALtime tickles my funny-bone, but I piddled around with it
nonetheless to see what such can deliver (not much).  System thing
and/or libvirt created a cpuset home for qemu, but with VPUs sharing
CPU with other qemu threads and the rest of the world, RT performance
in little virtual box was as pathetic as one would expect.  What did I
do about it?  Among others, the obvious, I created an exclusive cpuset,
and distributed qemu contexts having different requirements among
context containment vessels having the required properties.

I won't be doing any more of that particular scenario, but certainly
will want to distribute various contexts among various context
containment vessels in future.  I soon enough won't care about cgroups,
but others will surely expect cpu, cpuacct and cpuset controllers to
continue to function properly.

> I understand that CPU controller getting constrained due to other
> controllers can feel frustrating; however, the constraint is there to
> solve practical problems which hopefully are being explained in this
> conversation.  If there is a better trade-off, we can easily get rid
> of it and move on, but such decision can only be made considering all
> the relevant factors.  If you can think of a better solution, let's
> please discuss it.

None here.  Any artificial restriction placed on controllers will
render same broken in one way or another that will matter to someone
somewhere.  Making something less than it was will do that.

	-Mike

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP
  2016-04-13  7:43                     ` Mike Galbraith
@ 2016-04-13 15:59                       ` Tejun Heo
  2016-04-13 19:15                         ` Mike Galbraith
  2016-04-14  6:07                         ` Mike Galbraith
  0 siblings, 2 replies; 50+ messages in thread
From: Tejun Heo @ 2016-04-13 15:59 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Johannes Weiner, torvalds, akpm, mingo, lizefan,
	pjt, linux-kernel, cgroups, linux-api, kernel-team

Hello, Mike.

On Wed, Apr 13, 2016 at 09:43:01AM +0200, Mike Galbraith wrote:
> > The cost is part aesthetical and part practical.  While less elegant
> > than tree of uniform objects, it seems a stretch to call internal /
> > leaf node distinction broken especially given that the model is
> > natural to some controllers.
> 
> That justifies prohibiting proper usages of three controllers, cpu,
> cpuacct and cpuset?

Neither cpuacct or cpuset loses any capability from the constraint as
there is no difference between tasks being in an internal cgroup or a
leaf cgroup nested under it.  The only practical impact is that we
lose the ability to let internal tasks compete against sibling cgroups
for proportional control.

> > The practical cost is loss of the ability to let leaf entities compete
> > against groups.  However, we can't evaluate how important such
> > capability is without actual use-cases.  If there are important ones,
> > please bring them up, so that we can examine the actual requirements
> > and try to find a good trade-off to support them.
> 
> Hm, I though Google did that, and I know I mentioned another gigabuck
> sized outfit.  Whatever, ob trade-off..

Are you saying that you're aware that google or another big outfit is
making active use of internal tasks competing against sibling cgroups
for proportional CPU distribution?  If so, can you please be more
specific?

> Another cpuset example is something I was asked to look into recently. 

First of all, as mentioned above, cpuset isn't affected at all in
practical terms.  Besides, for a very specialized cpuset setup, the
cpuset configuration might not have anything to do with the resource
domains other controllers use and it might make sense to keep cpuset
on a separate hierarchy.

> > I understand that CPU controller getting constrained due to other
> > controllers can feel frustrating; however, the constraint is there to
> > solve practical problems which hopefully are being explained in this
> > conversation.  If there is a better trade-off, we can easily get rid
> > of it and move on, but such decision can only be made considering all
> > the relevant factors.  If you can think of a better solution, let's
> > please discuss it.
> 
> None here.  Any artificial restriction placed on controllers will
> render same broken in one way or another that will matter to someone
> somewhere.  Making something less than it was will do that.

The specifics of gains and losses are what I've been trying to clarify
in this thread.  Hopefully, what we can gain from sharing common
resource domains is clear by now.  The practical cost is loss of the
capability to let internal tasks compete against sibling cgroups for
proportional control.  However, to determine the weight of this cost,
we have to know which use-cases call for it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP
  2016-04-13 15:59                       ` Tejun Heo
@ 2016-04-13 19:15                         ` Mike Galbraith
  2016-04-14  6:07                         ` Mike Galbraith
  1 sibling, 0 replies; 50+ messages in thread
From: Mike Galbraith @ 2016-04-13 19:15 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Peter Zijlstra, Johannes Weiner, torvalds, akpm, mingo, lizefan,
	pjt, linux-kernel, cgroups, linux-api, kernel-team

On Wed, 2016-04-13 at 11:59 -0400, Tejun Heo wrote:

> Are you saying that you're aware that google or another big outfit is
> making active use of internal tasks competing against sibling cgroups
> for proportional CPU distribution?  If so, can you please be more
> specific?

What I'm aware of is a big outfit that moves thread pool workers in/out
of a large number of cpu/cpuacct cgroups.  What all a worker thread may
spawn in a cgroup, or find already there upon arrival and thus compete
with I do not know. 

I'm baffled by why anyone would care which entity competes with which
other entity.  An entity is an entity is an entity.

	-Mike

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP
  2016-04-13 15:59                       ` Tejun Heo
  2016-04-13 19:15                         ` Mike Galbraith
@ 2016-04-14  6:07                         ` Mike Galbraith
  2016-04-14 19:57                           ` Tejun Heo
  1 sibling, 1 reply; 50+ messages in thread
From: Mike Galbraith @ 2016-04-14  6:07 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Peter Zijlstra, Johannes Weiner, torvalds, akpm, mingo, lizefan,
	pjt, linux-kernel, cgroups, linux-api, kernel-team

On Wed, 2016-04-13 at 11:59 -0400, Tejun Heo wrote:
> Hello, Mike.
> 
> On Wed, Apr 13, 2016 at 09:43:01AM +0200, Mike Galbraith wrote:
> > > The cost is part aesthetical and part practical.  While less
> > > elegant
> > > than tree of uniform objects, it seems a stretch to call internal
> > > /
> > > leaf node distinction broken especially given that the model is
> > > natural to some controllers.
> > 
> > That justifies prohibiting proper usages of three controllers, cpu,
> > cpuacct and cpuset?
> 
> Neither cpuacct or cpuset loses any capability from the constraint as
> there is no difference between tasks being in an internal cgroup or a
> leaf cgroup nested under it.  The only practical impact is that we
> lose the ability to let internal tasks compete against sibling cgroups
> for proportional control.

I'm not getting it.

A.  entity = task[s] | cgroup[s]
B.  entity = task[s] ^ cgroup[s]

A I get, B I don't, but you seem to be saying B, else we get the task
competes with sibling cgroup business.

Let /foo be an exclusive cpuset containing exclusive subset bar.  How can any task acquire set foo affinity if B really really applies?  My box calls me a dummy if I try to create a "proper" home for tasks, one with both no snobby neighbors and proper affinity.

	-Mike

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP
  2016-04-14  6:07                         ` Mike Galbraith
@ 2016-04-14 19:57                           ` Tejun Heo
  2016-04-15  2:42                             ` Mike Galbraith
  0 siblings, 1 reply; 50+ messages in thread
From: Tejun Heo @ 2016-04-14 19:57 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Johannes Weiner, torvalds, akpm, mingo, lizefan,
	pjt, linux-kernel, cgroups, linux-api, kernel-team

Hello, Mike.

On Thu, Apr 14, 2016 at 08:07:37AM +0200, Mike Galbraith wrote:
> Let /foo be an exclusive cpuset containing exclusive subset bar.
> How can any task acquire set foo affinity if B really really
> applies?  My box calls me a dummy if I try to create a "proper" home
> for tasks, one with both no snobby neighbors and proper affinity.

I'm not sure I quite understand what you're saying.  Are you referring
to the cpuset.{cpu|mem}_exclusive knobs?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP
  2016-04-14 19:57                           ` Tejun Heo
@ 2016-04-15  2:42                             ` Mike Galbraith
  0 siblings, 0 replies; 50+ messages in thread
From: Mike Galbraith @ 2016-04-15  2:42 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Peter Zijlstra, Johannes Weiner, torvalds, akpm, mingo, lizefan,
	pjt, linux-kernel, cgroups, linux-api, kernel-team

On Thu, 2016-04-14 at 15:57 -0400, Tejun Heo wrote:
> Hello, Mike.
> 
> On Thu, Apr 14, 2016 at 08:07:37AM +0200, Mike Galbraith wrote:
> > Let /foo be an exclusive cpuset containing exclusive subset bar.
> > How can any task acquire set foo affinity if B really really
> > applies?  My box calls me a dummy if I try to create a "proper"
> > home
> > for tasks, one with both no snobby neighbors and proper affinity.
> 
> I'm not sure I quite understand what you're saying.  Are you referring
> to the cpuset.{cpu|mem}_exclusive knobs?

Yes.

	-Mike

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2016-04-15  2:42 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-11 15:41 [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP Tejun Heo
2016-03-11 15:41 ` [PATCH 01/10] cgroup: introduce cgroup_[un]lock() Tejun Heo
2016-03-11 15:41 ` [PATCH 02/10] cgroup: un-inline cgroup_path() and friends Tejun Heo
2016-03-11 15:41 ` [PATCH 03/10] cgroup: introduce CGRP_MIGRATE_* flags Tejun Heo
2016-03-11 15:41 ` [PATCH 04/10] signal: make put_signal_struct() public Tejun Heo
2016-03-11 15:41 ` [PATCH 05/10] cgroup, fork: add @new_rgrp_cset[p] and @clone_flags to cgroup fork callbacks Tejun Heo
2016-03-11 15:41 ` [PATCH 06/10] cgroup, fork: add @child and @clone_flags to threadgroup_change_begin/end() Tejun Heo
2016-03-11 15:41 ` [PATCH 07/10] cgroup: introduce resource group Tejun Heo
2016-03-11 15:41 ` [PATCH 08/10] cgroup: implement rgroup control mask handling Tejun Heo
2016-03-11 15:41 ` [PATCH 09/10] cgroup: implement rgroup subtree migration Tejun Heo
2016-03-11 15:41 ` [PATCH 10/10] cgroup, sched: implement PRIO_RGRP for {set|get}priority() Tejun Heo
2016-03-11 16:05 ` Example program for PRIO_RGRP Tejun Heo
2016-03-12  6:26 ` [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP Mike Galbraith
2016-03-12 17:04   ` Mike Galbraith
2016-03-12 17:13     ` cgroup NAKs ignored? " Ingo Molnar
2016-03-13 14:42       ` Tejun Heo
2016-03-13 15:00   ` Tejun Heo
2016-03-13 17:40     ` Mike Galbraith
2016-04-07  0:00       ` Tejun Heo
2016-04-07  3:26         ` Mike Galbraith
2016-03-14  2:23     ` Mike Galbraith
2016-03-14 11:30 ` Peter Zijlstra
2016-04-06 15:58   ` Tejun Heo
2016-04-07  6:45     ` Peter Zijlstra
2016-04-07  7:35       ` Johannes Weiner
2016-04-07  8:05         ` Mike Galbraith
2016-04-07  8:08         ` Peter Zijlstra
2016-04-07  9:28           ` Johannes Weiner
2016-04-07 10:42             ` Peter Zijlstra
2016-04-07 19:45           ` Tejun Heo
2016-04-07 20:25             ` Peter Zijlstra
2016-04-08 20:11               ` Tejun Heo
2016-04-09  6:16                 ` Mike Galbraith
2016-04-09 13:39                 ` Peter Zijlstra
2016-04-12 22:29                   ` Tejun Heo
2016-04-13  7:43                     ` Mike Galbraith
2016-04-13 15:59                       ` Tejun Heo
2016-04-13 19:15                         ` Mike Galbraith
2016-04-14  6:07                         ` Mike Galbraith
2016-04-14 19:57                           ` Tejun Heo
2016-04-15  2:42                             ` Mike Galbraith
2016-04-09 16:02                 ` Peter Zijlstra
2016-04-07  8:28         ` Peter Zijlstra
2016-04-07 19:04           ` Johannes Weiner
2016-04-07 19:31             ` Peter Zijlstra
2016-04-07 20:23               ` Johannes Weiner
2016-04-08  3:13                 ` Mike Galbraith
2016-03-15 17:21 ` Michal Hocko
2016-04-06 21:53   ` Tejun Heo
2016-04-07  6:40     ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).