linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 0/5] cgroup: Introducing bypass mode
@ 2018-12-10 21:40 Waiman Long
  2018-12-10 21:40 ` [PATCH v5 1/5] cgroup: subtree_control bypass mode for bypassable controllers Waiman Long
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: Waiman Long @ 2018-12-10 21:40 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Peter Zijlstra, Ingo Molnar,
	Jonathan Corbet, Johannes Weiner
  Cc: cgroups, linux-kernel, linux-doc, linux-api, Roman Gushchin,
	Jens Axboe, Andrew Morton, Dennis Zhou, Shakeel Butt,
	Michael Kerrisk, Waiman Long

 v5:
  - Fix a bug in css_create() reported by Dan Carpenter.

 v4:
  - Rebased to the latest for-4.21 branch of cgroup tree.
  - Make each controller explicitly opt in to become bypassable by
    setting the bypassable cgroup_subsys flag. Currently, only cpu
    controller is made bypassable.
  - Break out the cgroup_v2.rst documentation update as separate patch.

 v3:
  - Remove invalid cgroup subdirectory creation patch.
  - Add use cases for the bypass mode and removing statements about
    control files ownership in cgroup-v2.txt.
  - Restrict bypass mode to non-domain (threaded) controllers only.

 v3 patch - https://lkml.org/lkml/2017/8/9/604
 v4 patch - https://lkml.org/lkml/2018/11/20/1361

This patchset introduces a new bypass mode to the cgroup v2 core to
give more freedom and flexibility to controllers which choose to become
bypassable the freedom to shape their own unique views of the virtual
cgroup hierarchies that can best suit thier own use cases.

Because of the inherent performance overhead in enabling cpu controller,
it is made bypassable so that the controller only needs to be enabled
at those cgroups that really need it instead of in every cgroups at a
given layer if at least one of them needs it.

The cpu controller performance problem is one of the major issues
in migrating from cgroup v1 to v2.

For example,

  R - A(+) - B(#) - C(+)
                  \ D(#)

where "+" means the controller is enabled and "#" means the controller
is bypassed. For this controller's perspective, the cgroups are
equivalent to:

  R - A|B|D - C

Underneath the root R, cgoups A, B and D are controlled by one set of
knobs and cgroup C is controlled by another set of knobs as a child of
cgroups A|B|C.

This patchset is layered on top of the "for-4.21" branch of Tejun's
cgroup git tree.

Patch 1 introduces a new bypass mode that allows a bypassable
controller to be disabled in a cgroup, but to be re-enabled again in its
children. This is enabled by writing the controller name prefixed with
'#' to the "cgroup.subtree_control" file. Then all its children will
have this controller in bypass mode.

Patch 2 extends the bypass mode mechanism to allow those child cgroups
that are put into the bypass mode for a particular bypassable controller
by their parent to be re-enabled again by writing the controller name
with the '+' prefix to the "cgroup.controllers" file.

Patch 3 extends the debug controller to expose additional controller
masks introduced by this patchset.

Patch 4 makes the cpu controller bypassable.

Patch 5 documents the new bypass mode in cgroup-v2.rst file.

Waiman Long (5):
  cgroup: subtree_control bypass mode for bypassable controllers
  cgroup: Allow reenabling of controller in bypass mode
  cgroup: Make debug controller report new controller masks
  sched/core: Make cpu cgroup controller bypassable
  cgroup: Document bypass mode

 Documentation/admin-guide/cgroup-v2.rst |  66 +++++---
 include/linux/cgroup-defs.h             |  26 +++-
 kernel/cgroup/cgroup.c                  | 258 +++++++++++++++++++++++++-------
 kernel/cgroup/debug.c                   |   2 +
 kernel/sched/core.c                     |   1 +
 5 files changed, 279 insertions(+), 74 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v5 1/5] cgroup: subtree_control bypass mode for bypassable controllers
  2018-12-10 21:40 [PATCH v5 0/5] cgroup: Introducing bypass mode Waiman Long
@ 2018-12-10 21:40 ` Waiman Long
  2018-12-10 21:40 ` [PATCH v5 2/5] cgroup: Allow reenabling of controller in bypass mode Waiman Long
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Waiman Long @ 2018-12-10 21:40 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Peter Zijlstra, Ingo Molnar,
	Jonathan Corbet, Johannes Weiner
  Cc: cgroups, linux-kernel, linux-doc, linux-api, Roman Gushchin,
	Jens Axboe, Andrew Morton, Dennis Zhou, Shakeel Butt,
	Michael Kerrisk, Waiman Long

A controller in the default hierarchy is bypassable if the bypassable
flag is set in its cgroup_subsys structure.  The special prefix '#'
attached to a bypassable controller name can now be written into the
cgroup.subtree_control file to set that controller in bypass mode in
all the child cgroups. The controller will show up in the children's
cgroup.controllers file, but the corresponding control knobs will be
absent. However, that controller can be enabled or bypassed in its
children by writing to their respective subtree_control files.

In term of resource control, a bypassed cgroup was lumped into its
nearest non-bypassed ancestor. So all the tasks in that ancestor as
well as those in its bypassed direct descendants are controlled by the
same set of control knobs.

This mode is useful to those bypassable controllers where there are
costs to each additional layer of hierarchy. This mode will also allow
more freedom in how each controller can shape its effective hierarchy
independent of the others.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/cgroup-defs.h |  19 ++++--
 kernel/cgroup/cgroup.c      | 151 ++++++++++++++++++++++++++++++--------------
 2 files changed, 116 insertions(+), 54 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 8fcbae1..5bff798 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -375,16 +375,18 @@ struct cgroup {
 	struct cgroup_file events_file;	/* handle for "cgroup.events" */
 
 	/*
-	 * The bitmask of subsystems enabled on the child cgroups.
-	 * ->subtree_control is the one configured through
-	 * "cgroup.subtree_control" while ->child_ss_mask is the effective
-	 * one which may have more subsystems enabled.  Controller knobs
-	 * are made available iff it's enabled in ->subtree_control.
+	 * The bitmask of subsystems enabled or bypassed on the child cgroups.
+	 * ->subtree_control and ->subtree_bypass are the one configured
+	 * through "cgroup.subtree_control" while ->subtree_ss_mask is the
+	 * effective one which may have more subsystems enabled.  Controller
+	 * knobs are made available iff it's enabled in ->subtree_ss_mask.
 	 */
 	u16 subtree_control;
 	u16 subtree_ss_mask;
+	u16 subtree_bypass;
 	u16 old_subtree_control;
 	u16 old_subtree_ss_mask;
+	u16 old_subtree_bypass;
 
 	/* Private pointers for each registered subsystem */
 	struct cgroup_subsys_state __rcu *subsys[CGROUP_SUBSYS_COUNT];
@@ -647,6 +649,13 @@ struct cgroup_subsys {
 	bool broken_hierarchy:1;
 	bool warned_broken_hierarchy:1;
 
+	/*
+	 * If %true, the controller, on the default hierarchy, can be
+	 * bypassable. IOW, it can have a virtual hierarchy that is
+	 * different from the default hierarchy.
+	 */
+	bool bypassable:1;
+
 	/* the following two fields are initialized automtically during boot */
 	int id;
 	const char *name;
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index e06994f..8a729e4 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -175,6 +175,9 @@ struct cgroup_subsys *cgroup_subsys[] = {
 /* some controllers can be threaded on the default hierarchy */
 static u16 cgrp_dfl_threaded_ss_mask;
 
+/* some controllers can be bypassable on the default hierarchy */
+static u16 cgrp_dfl_bypass_ss_mask;
+
 /* The list of hierarchy roots */
 LIST_HEAD(cgroup_roots);
 static int cgroup_root_count;
@@ -366,7 +369,8 @@ static bool cgroup_can_be_thread_root(struct cgroup *cgrp)
 		return false;
 
 	/* and no domain controllers can be enabled */
-	if (cgrp->subtree_control & ~cgrp_dfl_threaded_ss_mask)
+	if ((cgrp->subtree_control|cgrp->subtree_bypass) &
+	    ~cgrp_dfl_threaded_ss_mask)
 		return false;
 
 	return true;
@@ -388,7 +392,8 @@ bool cgroup_is_thread_root(struct cgroup *cgrp)
 	 * enabled is a thread root.
 	 */
 	if (cgroup_has_tasks(cgrp) &&
-	    (cgrp->subtree_control & cgrp_dfl_threaded_ss_mask))
+	   ((cgrp->subtree_control|cgrp->subtree_bypass)
+	   & cgrp_dfl_threaded_ss_mask))
 		return true;
 
 	return false;
@@ -413,7 +418,7 @@ static bool cgroup_is_valid_domain(struct cgroup *cgrp)
 }
 
 /* subsystems visibly enabled on a cgroup */
-static u16 cgroup_control(struct cgroup *cgrp)
+static u16 cgroup_control(struct cgroup *cgrp, bool show_bypass)
 {
 	struct cgroup *parent = cgroup_parent(cgrp);
 	u16 root_ss_mask = cgrp->root->subsys_mask;
@@ -421,6 +426,9 @@ static u16 cgroup_control(struct cgroup *cgrp)
 	if (parent) {
 		u16 ss_mask = parent->subtree_control;
 
+		if (show_bypass)
+			ss_mask |= parent->subtree_bypass;
+
 		/* threaded cgroups can only have threaded controllers */
 		if (cgroup_is_threaded(cgrp))
 			ss_mask &= cgrp_dfl_threaded_ss_mask;
@@ -434,13 +442,17 @@ static u16 cgroup_control(struct cgroup *cgrp)
 }
 
 /* subsystems enabled on a cgroup */
-static u16 cgroup_ss_mask(struct cgroup *cgrp)
+static u16 cgroup_ss_mask(struct cgroup *cgrp, bool show_bypass)
 {
 	struct cgroup *parent = cgroup_parent(cgrp);
 
 	if (parent) {
 		u16 ss_mask = parent->subtree_ss_mask;
 
+
+		if (show_bypass)
+			ss_mask |= parent->subtree_bypass;
+
 		/* threaded cgroups can only have threaded controllers */
 		if (cgroup_is_threaded(cgrp))
 			ss_mask &= cgrp_dfl_threaded_ss_mask;
@@ -515,7 +527,7 @@ static struct cgroup_subsys_state *cgroup_e_css(struct cgroup *cgrp,
 	 * This function is used while updating css associations and thus
 	 * can't test the csses directly.  Test ss_mask.
 	 */
-	while (!(cgroup_ss_mask(cgrp) & (1 << ss->id))) {
+	while (!(cgroup_ss_mask(cgrp, false) & (1 << ss->id))) {
 		cgrp = cgroup_parent(cgrp);
 		if (!cgrp)
 			return NULL;
@@ -2414,7 +2426,7 @@ int cgroup_migrate_vet_dst(struct cgroup *dst_cgrp)
 		return 0;
 
 	/* apply no-internal-process constraint */
-	if (dst_cgrp->subtree_control)
+	if (dst_cgrp->subtree_control|dst_cgrp->subtree_bypass)
 		return -EBUSY;
 
 	return 0;
@@ -2712,15 +2724,18 @@ void cgroup_procs_write_finish(struct task_struct *task)
 			ss->post_attach();
 }
 
-static void cgroup_print_ss_mask(struct seq_file *seq, u16 ss_mask)
+static void cgroup_print_ss_mask(struct seq_file *seq, u16 ss_mask,
+				 u16 bypass_mask)
 {
 	struct cgroup_subsys *ss;
 	bool printed = false;
 	int ssid;
 
-	do_each_subsys_mask(ss, ssid, ss_mask) {
+	do_each_subsys_mask(ss, ssid, ss_mask|bypass_mask) {
 		if (printed)
 			seq_putc(seq, ' ');
+		if (!(ss_mask & (1 << ssid)))
+			seq_putc(seq, '#');
 		seq_printf(seq, "%s", ss->name);
 		printed = true;
 	} while_each_subsys_mask();
@@ -2732,8 +2747,10 @@ static void cgroup_print_ss_mask(struct seq_file *seq, u16 ss_mask)
 static int cgroup_controllers_show(struct seq_file *seq, void *v)
 {
 	struct cgroup *cgrp = seq_css(seq)->cgroup;
+	struct cgroup *parent = cgroup_parent(cgrp);
+	u16 bypass = parent ? parent->subtree_bypass : 0;
 
-	cgroup_print_ss_mask(seq, cgroup_control(cgrp));
+	cgroup_print_ss_mask(seq, cgroup_control(cgrp, false), bypass);
 	return 0;
 }
 
@@ -2742,7 +2759,7 @@ static int cgroup_subtree_control_show(struct seq_file *seq, void *v)
 {
 	struct cgroup *cgrp = seq_css(seq)->cgroup;
 
-	cgroup_print_ss_mask(seq, cgrp->subtree_control);
+	cgroup_print_ss_mask(seq, cgrp->subtree_control, cgrp->subtree_bypass);
 	return 0;
 }
 
@@ -2856,6 +2873,7 @@ static void cgroup_save_control(struct cgroup *cgrp)
 	cgroup_for_each_live_descendant_pre(dsct, d_css, cgrp) {
 		dsct->old_subtree_control = dsct->subtree_control;
 		dsct->old_subtree_ss_mask = dsct->subtree_ss_mask;
+		dsct->old_subtree_bypass  = dsct->subtree_bypass;
 		dsct->old_dom_cgrp = dsct->dom_cgrp;
 	}
 }
@@ -2874,10 +2892,13 @@ static void cgroup_propagate_control(struct cgroup *cgrp)
 	struct cgroup_subsys_state *d_css;
 
 	cgroup_for_each_live_descendant_pre(dsct, d_css, cgrp) {
-		dsct->subtree_control &= cgroup_control(dsct);
+		u16 mask = cgroup_control(dsct, true);
+
+		dsct->subtree_control &= mask;
+		dsct->subtree_bypass  &= mask;
 		dsct->subtree_ss_mask =
 			cgroup_calc_subtree_ss_mask(dsct->subtree_control,
-						    cgroup_ss_mask(dsct));
+						cgroup_ss_mask(dsct, true));
 	}
 }
 
@@ -2897,6 +2918,7 @@ static void cgroup_restore_control(struct cgroup *cgrp)
 	cgroup_for_each_live_descendant_post(dsct, d_css, cgrp) {
 		dsct->subtree_control = dsct->old_subtree_control;
 		dsct->subtree_ss_mask = dsct->old_subtree_ss_mask;
+		dsct->subtree_bypass  = dsct->old_subtree_bypass;
 		dsct->dom_cgrp = dsct->old_dom_cgrp;
 	}
 }
@@ -2906,9 +2928,9 @@ static bool css_visible(struct cgroup_subsys_state *css)
 	struct cgroup_subsys *ss = css->ss;
 	struct cgroup *cgrp = css->cgroup;
 
-	if (cgroup_control(cgrp) & (1 << ss->id))
+	if (cgroup_control(cgrp, false) & (1 << ss->id))
 		return true;
-	if (!(cgroup_ss_mask(cgrp) & (1 << ss->id)))
+	if (!(cgroup_ss_mask(cgrp, false) & (1 << ss->id)))
 		return false;
 	return cgroup_on_dfl(cgrp) && ss->implicit_on_dfl;
 }
@@ -2939,7 +2961,7 @@ static int cgroup_apply_control_enable(struct cgroup *cgrp)
 
 			WARN_ON_ONCE(css && percpu_ref_is_dying(&css->refcnt));
 
-			if (!(cgroup_ss_mask(dsct) & (1 << ss->id)))
+			if (!(cgroup_ss_mask(dsct, false) & (1 << ss->id)))
 				continue;
 
 			if (!css) {
@@ -2989,7 +3011,7 @@ static void cgroup_apply_control_disable(struct cgroup *cgrp)
 				continue;
 
 			if (css->parent &&
-			    !(cgroup_ss_mask(dsct) & (1 << ss->id))) {
+			    !(cgroup_ss_mask(dsct, false) & (1 << ss->id))) {
 				kill_css(css);
 			} else if (!css_visible(css)) {
 				css_clear_dir(css);
@@ -3101,7 +3123,8 @@ static ssize_t cgroup_subtree_control_write(struct kernfs_open_file *of,
 					    char *buf, size_t nbytes,
 					    loff_t off)
 {
-	u16 enable = 0, disable = 0;
+	u16 enable = 0, disable = 0, bypass = 0;
+	u16 child_enable = 0;
 	struct cgroup *cgrp, *child;
 	struct cgroup_subsys *ss;
 	char *tok;
@@ -3122,10 +3145,16 @@ static ssize_t cgroup_subtree_control_write(struct kernfs_open_file *of,
 
 			if (*tok == '+') {
 				enable |= 1 << ssid;
+				bypass &= ~(1 << ssid);
 				disable &= ~(1 << ssid);
 			} else if (*tok == '-') {
 				disable |= 1 << ssid;
 				enable &= ~(1 << ssid);
+				bypass &= ~(1 << ssid);
+			} else if (*tok == '#') {
+				bypass |= 1 << ssid;
+				enable &= ~(1 << ssid);
+				disable &= ~(1 << ssid);
 			} else {
 				return -EINVAL;
 			}
@@ -3139,35 +3168,42 @@ static ssize_t cgroup_subtree_control_write(struct kernfs_open_file *of,
 	if (!cgrp)
 		return -ENODEV;
 
-	for_each_subsys(ss, ssid) {
-		if (enable & (1 << ssid)) {
-			if (cgrp->subtree_control & (1 << ssid)) {
-				enable &= ~(1 << ssid);
-				continue;
-			}
+	/*
+	 * Cannot use controllers that aren't allowed.
+	 */
+	if (~cgroup_control(cgrp, true) & (enable|disable|bypass)) {
+		ret = -ENOENT;
+		goto out_unlock;
+	}
 
-			if (!(cgroup_control(cgrp) & (1 << ssid))) {
-				ret = -ENOENT;
-				goto out_unlock;
-			}
-		} else if (disable & (1 << ssid)) {
-			if (!(cgrp->subtree_control & (1 << ssid))) {
-				disable &= ~(1 << ssid);
-				continue;
-			}
+	/*
+	 * Strip out redundant bits.
+	 */
+	enable  &= ~cgrp->subtree_control;
+	bypass  &= ~cgrp->subtree_bypass;
+	disable &= (cgrp->subtree_control|cgrp->subtree_bypass);
 
-			/* a child has it enabled? */
-			cgroup_for_each_live_child(child, cgrp) {
-				if (child->subtree_control & (1 << ssid)) {
-					ret = -EBUSY;
-					goto out_unlock;
-				}
-			}
-		}
+	if (!(enable|bypass|disable)) {
+		ret = 0;
+		goto out_unlock;
 	}
 
-	if (!enable && !disable) {
-		ret = 0;
+	/*
+	 * Only bypassable controllers can be bypassed.
+	 */
+	if (bypass & ~cgrp_dfl_bypass_ss_mask) {
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+
+	cgroup_for_each_live_child(child, cgrp)
+		child_enable |= child->subtree_control|child->subtree_bypass;
+
+	/*
+	 * Cannot change the state of a controller if enabled in children.
+	 */
+	if ((enable|bypass|disable) & child_enable) {
+		ret = -EBUSY;
 		goto out_unlock;
 	}
 
@@ -3179,7 +3215,9 @@ static ssize_t cgroup_subtree_control_write(struct kernfs_open_file *of,
 	cgroup_save_control(cgrp);
 
 	cgrp->subtree_control |= enable;
-	cgrp->subtree_control &= ~disable;
+	cgrp->subtree_control &= ~(bypass|disable);
+	cgrp->subtree_bypass  |= bypass;
+	cgrp->subtree_bypass  &= ~(enable|disable);
 
 	ret = cgroup_apply_control(cgrp);
 	cgroup_finalize_control(cgrp, ret);
@@ -4727,7 +4765,8 @@ static void css_release(struct percpu_ref *ref)
 }
 
 static void init_and_link_css(struct cgroup_subsys_state *css,
-			      struct cgroup_subsys *ss, struct cgroup *cgrp)
+			      struct cgroup_subsys *ss, struct cgroup *cgrp,
+			      struct cgroup_subsys_state *parent_css)
 {
 	lockdep_assert_held(&cgroup_mutex);
 
@@ -4743,8 +4782,8 @@ static void init_and_link_css(struct cgroup_subsys_state *css,
 	css->serial_nr = css_serial_nr_next++;
 	atomic_set(&css->online_cnt, 0);
 
-	if (cgroup_parent(cgrp)) {
-		css->parent = cgroup_css(cgroup_parent(cgrp), ss);
+	if (parent_css) {
+		css->parent = parent_css;
 		css_get(css->parent);
 	}
 
@@ -4813,13 +4852,24 @@ static struct cgroup_subsys_state *css_create(struct cgroup *cgrp,
 
 	lockdep_assert_held(&cgroup_mutex);
 
+	/*
+	 * If the cgroup subsystem is bypassble, we need to skip over
+	 * ancestor cgroups with NULL CSS to find the right parent CSS.
+	 */
+	if (ss->bypassable && !parent_css) {
+		struct cgroup *par = parent;
+
+		for (; par && !parent_css; par = cgroup_parent(par))
+			parent_css = cgroup_css(par, ss);
+	}
+
 	css = ss->css_alloc(parent_css);
 	if (!css)
 		css = ERR_PTR(-ENOMEM);
 	if (IS_ERR(css))
 		return css;
 
-	init_and_link_css(css, ss, cgrp);
+	init_and_link_css(css, ss, cgrp, parent_css);
 
 	err = percpu_ref_init(&css->refcnt, css_release, 0, GFP_KERNEL);
 	if (err)
@@ -4941,7 +4991,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent)
 	 * subtree_control from the parent.  Each is configured manually.
 	 */
 	if (!cgroup_on_dfl(cgrp))
-		cgrp->subtree_control = cgroup_control(cgrp);
+		cgrp->subtree_control = cgroup_control(cgrp, false);
 
 	cgroup_propagate_control(cgrp);
 
@@ -5254,7 +5304,7 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
 	css = ss->css_alloc(cgroup_css(&cgrp_dfl_root.cgrp, ss));
 	/* We don't handle early failures gracefully */
 	BUG_ON(IS_ERR(css));
-	init_and_link_css(css, ss, &cgrp_dfl_root.cgrp);
+	init_and_link_css(css, ss, &cgrp_dfl_root.cgrp, NULL);
 
 	/*
 	 * Root csses are never destroyed and we can't initialize
@@ -5412,6 +5462,9 @@ int __init cgroup_init(void)
 		if (ss->threaded)
 			cgrp_dfl_threaded_ss_mask |= 1 << ss->id;
 
+		if (ss->bypassable)
+			cgrp_dfl_bypass_ss_mask |= 1 << ss->id;
+
 		if (ss->dfl_cftypes == ss->legacy_cftypes) {
 			WARN_ON(cgroup_add_cftypes(ss, ss->dfl_cftypes));
 		} else {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v5 2/5] cgroup: Allow reenabling of controller in bypass mode
  2018-12-10 21:40 [PATCH v5 0/5] cgroup: Introducing bypass mode Waiman Long
  2018-12-10 21:40 ` [PATCH v5 1/5] cgroup: subtree_control bypass mode for bypassable controllers Waiman Long
@ 2018-12-10 21:40 ` Waiman Long
  2018-12-10 21:40 ` [PATCH v5 3/5] cgroup: Make debug controller report new controller masks Waiman Long
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Waiman Long @ 2018-12-10 21:40 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Peter Zijlstra, Ingo Molnar,
	Jonathan Corbet, Johannes Weiner
  Cc: cgroups, linux-kernel, linux-doc, linux-api, Roman Gushchin,
	Jens Axboe, Andrew Morton, Dennis Zhou, Shakeel Butt,
	Michael Kerrisk, Waiman Long

Bypassable controllers set to bypass mode in the parent's
"cgroup.subtree_control" can now be optionally enabled by writing the
controller name with the '+' prefix to "cgroup.controllers". Using the
'#' prefix will reset it back to the bypass state.

This capability allows a cgroup parent to individually enable bypassable
controllers in a subset of its children instead of either all or none
of them. This increases the flexibility each controller has in shaping
the effective cgroup hierarchy to best suit its need.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/cgroup-defs.h |   7 +++
 kernel/cgroup/cgroup.c      | 109 ++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 113 insertions(+), 3 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 5bff798..ab1b355 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -388,6 +388,13 @@ struct cgroup {
 	u16 old_subtree_ss_mask;
 	u16 old_subtree_bypass;
 
+	/*
+	 * The bitmask of subsystems that are set in its parent's
+	 * ->subtree_bypass and explicitly enabled in this cgroup.
+	 */
+	u16 enable_ss_mask;
+	u16 old_enable_ss_mask;
+
 	/* Private pointers for each registered subsystem */
 	struct cgroup_subsys_state __rcu *subsys[CGROUP_SUBSYS_COUNT];
 
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 8a729e4..7aa77b1 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -424,7 +424,7 @@ static u16 cgroup_control(struct cgroup *cgrp, bool show_bypass)
 	u16 root_ss_mask = cgrp->root->subsys_mask;
 
 	if (parent) {
-		u16 ss_mask = parent->subtree_control;
+		u16 ss_mask = parent->subtree_control|cgrp->enable_ss_mask;
 
 		if (show_bypass)
 			ss_mask |= parent->subtree_bypass;
@@ -447,7 +447,7 @@ static u16 cgroup_ss_mask(struct cgroup *cgrp, bool show_bypass)
 	struct cgroup *parent = cgroup_parent(cgrp);
 
 	if (parent) {
-		u16 ss_mask = parent->subtree_ss_mask;
+		u16 ss_mask = parent->subtree_ss_mask|cgrp->enable_ss_mask;
 
 
 		if (show_bypass)
@@ -2874,6 +2874,7 @@ static void cgroup_save_control(struct cgroup *cgrp)
 		dsct->old_subtree_control = dsct->subtree_control;
 		dsct->old_subtree_ss_mask = dsct->subtree_ss_mask;
 		dsct->old_subtree_bypass  = dsct->subtree_bypass;
+		dsct->old_enable_ss_mask  = dsct->enable_ss_mask;
 		dsct->old_dom_cgrp = dsct->dom_cgrp;
 	}
 }
@@ -2919,6 +2920,7 @@ static void cgroup_restore_control(struct cgroup *cgrp)
 		dsct->subtree_control = dsct->old_subtree_control;
 		dsct->subtree_ss_mask = dsct->old_subtree_ss_mask;
 		dsct->subtree_bypass  = dsct->old_subtree_bypass;
+		dsct->enable_ss_mask  = dsct->old_enable_ss_mask;
 		dsct->dom_cgrp = dsct->old_dom_cgrp;
 	}
 }
@@ -3197,7 +3199,8 @@ static ssize_t cgroup_subtree_control_write(struct kernfs_open_file *of,
 	}
 
 	cgroup_for_each_live_child(child, cgrp)
-		child_enable |= child->subtree_control|child->subtree_bypass;
+		child_enable |= child->subtree_control|child->subtree_bypass|
+				child->enable_ss_mask;
 
 	/*
 	 * Cannot change the state of a controller if enabled in children.
@@ -3230,6 +3233,105 @@ static ssize_t cgroup_subtree_control_write(struct kernfs_open_file *of,
 	return ret ?: nbytes;
 }
 
+/*
+ * Change bypass status of controllers for a cgroup in the default hierarchy.
+ */
+static ssize_t cgroup_controllers_write(struct kernfs_open_file *of,
+					char *buf, size_t nbytes,
+					loff_t off)
+{
+	u16 enable = 0, bypass = 0;
+	struct cgroup *cgrp, *parent;
+	struct cgroup_subsys *ss;
+	char *tok;
+	int ssid, ret;
+
+	/*
+	 * Parse input - space separated list of subsystem names prefixed
+	 * with either + or #.
+	 */
+	buf = strstrip(buf);
+	while ((tok = strsep(&buf, " "))) {
+		if (tok[0] == '\0')
+			continue;
+		do_each_subsys_mask(ss, ssid, ~cgrp_dfl_inhibit_ss_mask) {
+			if (!cgroup_ssid_enabled(ssid) ||
+			    strcmp(tok + 1, ss->name))
+				continue;
+
+			if (*tok == '+') {
+				enable |= 1 << ssid;
+				bypass &= ~(1 << ssid);
+			} else if (*tok == '#') {
+				bypass |= 1 << ssid;
+				enable &= ~(1 << ssid);
+			} else {
+				return -EINVAL;
+			}
+			break;
+		} while_each_subsys_mask();
+		if (ssid == CGROUP_SUBSYS_COUNT)
+			return -EINVAL;
+	}
+
+	cgrp = cgroup_kn_lock_live(of->kn, true);
+	if (!cgrp)
+		return -ENODEV;
+
+	/*
+	 * Write to root cgroup's controllers file is not allowed.
+	 */
+	parent = cgroup_parent(cgrp);
+	if (!parent) {
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+
+	/*
+	 * Only controllers set into bypass mode in the parent cgroup
+	 * can be specified here.
+	 */
+	if (~parent->subtree_bypass & (enable|bypass)) {
+		ret = -ENOENT;
+		goto out_unlock;
+	}
+
+	/*
+	 * Mask off irrelevant bits.
+	 */
+	enable &= ~cgrp->enable_ss_mask;
+	bypass &=  cgrp->enable_ss_mask;
+
+	if (!(enable|bypass)) {
+		ret = 0;
+		goto out_unlock;
+	}
+
+	/*
+	 * We cannot change the bypass state of a controller that is enabled
+	 * in subtree_control.
+	 */
+	if ((cgrp->subtree_control|cgrp->subtree_bypass) & (enable|bypass)) {
+		ret = -EBUSY;
+		goto out_unlock;
+	}
+
+	/* Save and update control masks and prepare csses */
+	cgroup_save_control(cgrp);
+
+	cgrp->enable_ss_mask |= enable;
+	cgrp->enable_ss_mask &= ~bypass;
+
+	ret = cgroup_apply_control(cgrp);
+	cgroup_finalize_control(cgrp, ret);
+	kernfs_activate(cgrp->kn);
+	ret = 0;
+
+out_unlock:
+	cgroup_kn_unlock(of->kn);
+	return ret ?: nbytes;
+}
+
 /**
  * cgroup_enable_threaded - make @cgrp threaded
  * @cgrp: the target cgroup
@@ -4573,6 +4675,7 @@ static ssize_t cgroup_threads_write(struct kernfs_open_file *of,
 	{
 		.name = "cgroup.controllers",
 		.seq_show = cgroup_controllers_show,
+		.write = cgroup_controllers_write,
 	},
 	{
 		.name = "cgroup.subtree_control",
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v5 3/5] cgroup: Make debug controller report new controller masks
  2018-12-10 21:40 [PATCH v5 0/5] cgroup: Introducing bypass mode Waiman Long
  2018-12-10 21:40 ` [PATCH v5 1/5] cgroup: subtree_control bypass mode for bypassable controllers Waiman Long
  2018-12-10 21:40 ` [PATCH v5 2/5] cgroup: Allow reenabling of controller in bypass mode Waiman Long
@ 2018-12-10 21:40 ` Waiman Long
  2018-12-10 21:40 ` [PATCH v5 4/5] sched/core: Make cpu cgroup controller bypassable Waiman Long
  2018-12-10 21:40 ` [PATCH v5 5/5] cgroup: Document bypass mode Waiman Long
  4 siblings, 0 replies; 6+ messages in thread
From: Waiman Long @ 2018-12-10 21:40 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Peter Zijlstra, Ingo Molnar,
	Jonathan Corbet, Johannes Weiner
  Cc: cgroups, linux-kernel, linux-doc, linux-api, Roman Gushchin,
	Jens Axboe, Andrew Morton, Dennis Zhou, Shakeel Butt,
	Michael Kerrisk, Waiman Long

The newly added cgroup controller masks (subtree_bypass and
enable_ss_mask) are now being reported in the debug.masks controller
file.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/debug.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/cgroup/debug.c b/kernel/cgroup/debug.c
index 5f1b873..176bb9e 100644
--- a/kernel/cgroup/debug.c
+++ b/kernel/cgroup/debug.c
@@ -263,6 +263,8 @@ static int cgroup_masks_read(struct seq_file *seq, void *v)
 
 	cgroup_masks_read_one(seq, "subtree_control", cgrp->subtree_control);
 	cgroup_masks_read_one(seq, "subtree_ss_mask", cgrp->subtree_ss_mask);
+	cgroup_masks_read_one(seq, "subtree_bypass",  cgrp->subtree_bypass);
+	cgroup_masks_read_one(seq, "enable_ss_mask",  cgrp->enable_ss_mask);
 
 	cgroup_kn_unlock(of->kn);
 	return 0;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v5 4/5] sched/core: Make cpu cgroup controller bypassable
  2018-12-10 21:40 [PATCH v5 0/5] cgroup: Introducing bypass mode Waiman Long
                   ` (2 preceding siblings ...)
  2018-12-10 21:40 ` [PATCH v5 3/5] cgroup: Make debug controller report new controller masks Waiman Long
@ 2018-12-10 21:40 ` Waiman Long
  2018-12-10 21:40 ` [PATCH v5 5/5] cgroup: Document bypass mode Waiman Long
  4 siblings, 0 replies; 6+ messages in thread
From: Waiman Long @ 2018-12-10 21:40 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Peter Zijlstra, Ingo Molnar,
	Jonathan Corbet, Johannes Weiner
  Cc: cgroups, linux-kernel, linux-doc, linux-api, Roman Gushchin,
	Jens Axboe, Andrew Morton, Dennis Zhou, Shakeel Butt,
	Michael Kerrisk, Waiman Long

Make the cpu cgroup controller bypassable in the default hierarchy so
that cpu controller could be activated only in those cgroups that really
need it instead of all the cgroups down to the lowest level that need it.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/sched/core.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f12225f..ba4550c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7000,6 +7000,7 @@ struct cgroup_subsys cpu_cgrp_subsys = {
 	.dfl_cftypes	= cpu_files,
 	.early_init	= true,
 	.threaded	= true,
+	.bypassable	= true,
 };
 
 #endif	/* CONFIG_CGROUP_SCHED */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v5 5/5] cgroup: Document bypass mode
  2018-12-10 21:40 [PATCH v5 0/5] cgroup: Introducing bypass mode Waiman Long
                   ` (3 preceding siblings ...)
  2018-12-10 21:40 ` [PATCH v5 4/5] sched/core: Make cpu cgroup controller bypassable Waiman Long
@ 2018-12-10 21:40 ` Waiman Long
  4 siblings, 0 replies; 6+ messages in thread
From: Waiman Long @ 2018-12-10 21:40 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Peter Zijlstra, Ingo Molnar,
	Jonathan Corbet, Johannes Weiner
  Cc: cgroups, linux-kernel, linux-doc, linux-api, Roman Gushchin,
	Jens Axboe, Andrew Morton, Dennis Zhou, Shakeel Butt,
	Michael Kerrisk, Waiman Long

The cgroup-v2.rst file is updated to document the new bypass mode.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/admin-guide/cgroup-v2.rst | 66 ++++++++++++++++++++++++---------
 1 file changed, 48 insertions(+), 18 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 07e0613..17e6584 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -372,9 +372,16 @@ disabled by writing to the "cgroup.subtree_control" file::
 
   # echo "+cpu +memory -io" > cgroup.subtree_control
 
-Only controllers which are listed in "cgroup.controllers" can be
-enabled.  When multiple operations are specified as above, either they
-all succeed or fail.  If multiple operations on the same controller
+The prefixes '+', '-' and '#' are used to enable, disable or put
+a controller in the bypass mode respectively.  In the bypass mode,
+a controller is disabled in a cgroup, but it can be enabled again in
+its child cgroups as it will still be listed in "cgroup.controllers".
+Bypass mode can only be used on bypassable controllers.  Currently,
+only the cpu controller is bypassable.
+
+Only controllers which are listed in "cgroup.controllers" can be enabled
+or bypassed.  When multiple operations are specified as above, either
+they all succeed or fail.  If multiple operations on the same controller
 are specified, the last one is effective.
 
 Enabling a controller in a cgroup indicates that the distribution of
@@ -399,6 +406,20 @@ prefixed controller interface files from C and D.  This means that the
 controller interface files - anything which doesn't start with
 "cgroup." are owned by the parent rather than the cgroup itself.
 
+Once a bypassable controller is put into bypass mode in
+"cgroup.subtree_control", that controller can optionally be enabled
+again in child cgroups by writing the controller name with the '+ prefix
+into "cgroup.controllers".  Writing the controller name with the '#'
+prefix into "cgroup.controllers" resets the state back to bypass mode.
+The state of a bypassable controller cannot be changed anymore if it
+is enabled or bypassed in its "cgroup.subtree_control".
+
+The use of bypass mode thus allows a cgroup parent to have the ability
+to selectively enable a bypassable controller in a subset of its child
+cgroups instead of in either all or none of them. In other words, a
+bypassable can be enabled only on the cgroups that actually need it,
+if desired.
+
 
 Top-down Constraint
 ~~~~~~~~~~~~~~~~~~~
@@ -406,10 +427,11 @@ Top-down Constraint
 Resources are distributed top-down and a cgroup can further distribute
 a resource only if the resource has been distributed to it from the
 parent.  This means that all non-root "cgroup.subtree_control" files
-can only contain controllers which are enabled in the parent's
-"cgroup.subtree_control" file.  A controller can be enabled only if
-the parent has the controller enabled and a controller can't be
-disabled if one or more children have it enabled.
+can only contain controllers which are enabled or bypassed in the
+parent's "cgroup.subtree_control" file.  A controller can be enabled
+or bypassed only if the parent has the controller enabled or bypassed
+and the state of a controller can't be changed if one or more children
+have it enabled or bypassed.
 
 
 No Internal Process Constraint
@@ -834,11 +856,18 @@ All cgroup core files are prefixed with "cgroup."
 	should be granted along with the containing directory.
 
   cgroup.controllers
-	A read-only space separated values file which exists on all
+	A read-write space separated values file which exists on all
 	cgroups.
 
 	It shows space separated list of all controllers available to
-	the cgroup.  The controllers are not ordered.
+	the cgroup.  Controller names with '#' prefix are in bypass mode.
+	The controllers are not ordered.
+
+	When a controller is set into bypass mode in its parent's
+	"cgroup.subtree_control", its name prefixed with '+' or '#'
+	can be written to enable it or reset it back to bypass mode
+	respectively.  Controllers not in bypass mode are not allowed
+	to be written.
 
   cgroup.subtree_control
 	A read-write space separated values file which exists on all
@@ -848,12 +877,12 @@ All cgroup core files are prefixed with "cgroup."
 	which are enabled to control resource distribution from the
 	cgroup to its children.
 
-	Space separated list of controllers prefixed with '+' or '-'
-	can be written to enable or disable controllers.  A controller
-	name prefixed with '+' enables the controller and '-'
-	disables.  If a controller appears more than once on the list,
-	the last one is effective.  When multiple enable and disable
-	operations are specified, either all succeed or all fail.
+	Space separated list of controllers prefixed with '+', '-' or
+	'#' can be written to enable, disable or bypass controllers
+	respectively.  If a controller appears more than once on
+	the list, the last one is effective.  When multiple enable,
+	disable or bypass operations are specified, either all succeed
+	or all fail.
 
   cgroup.events
 	A read-only flat-keyed file which exists on non-root cgroups.
@@ -904,17 +933,18 @@ Controllers
 CPU
 ---
 
-The "cpu" controllers regulates distribution of CPU cycles.  This
+The "cpu" controller regulates distribution of CPU cycles.  This
 controller implements weight and absolute bandwidth limit models for
 normal scheduling policy and absolute bandwidth allocation model for
-realtime scheduling policy.
+realtime scheduling policy.  The cpu controller is bypassable.
 
 WARNING: cgroup2 doesn't yet support control of realtime processes and
 the cpu controller can only be enabled when all RT processes are in
 the root cgroup.  Be aware that system management software may already
 have placed RT processes into nonroot cgroups during the system boot
 process, and these processes may need to be moved to the root cgroup
-before the cpu controller can be enabled.
+before the cpu controller can be enabled unless bypass mode is used
+in those non-root cgroups.
 
 
 CPU Interface Files
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-12-10 21:40 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-10 21:40 [PATCH v5 0/5] cgroup: Introducing bypass mode Waiman Long
2018-12-10 21:40 ` [PATCH v5 1/5] cgroup: subtree_control bypass mode for bypassable controllers Waiman Long
2018-12-10 21:40 ` [PATCH v5 2/5] cgroup: Allow reenabling of controller in bypass mode Waiman Long
2018-12-10 21:40 ` [PATCH v5 3/5] cgroup: Make debug controller report new controller masks Waiman Long
2018-12-10 21:40 ` [PATCH v5 4/5] sched/core: Make cpu cgroup controller bypassable Waiman Long
2018-12-10 21:40 ` [PATCH v5 5/5] cgroup: Document bypass mode Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).