linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH-cgroup 0/6] cgroup: bypass and subtree root modes
@ 2017-06-14 15:05 Waiman Long
  2017-06-14 15:05 ` [RFC PATCH-cgroup 1/6] cgroup: Relax the no internal process constraint Waiman Long
                   ` (5 more replies)
  0 siblings, 6 replies; 20+ messages in thread
From: Waiman Long @ 2017-06-14 15:05 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, kernel-team, pjt, luto, efault, torvalds,
	Waiman Long

This patchset introduces new capability to the cgroup v2 core to give
more freedom and flexibility to controllers so that they can have their
own unique views of the virtual process hierarchies that can best suit
thier own use cases without suffering unneeded performance problem.

This patchset is derived from my previous v2 RFC patchset on

  https://lkml.org/lkml/2017/5/15/440
  cgroup: Major changes to cgroup v2 core

It is on top of Tejun's "cgroup: implement cgroup2 thread mode, v2" and my
"cgroup: Make debug controller useful for debugging" patchsets:

  https://lkml.org/lkml/2017/6/10/93
  https://lkml.org/lkml/2017/6/13/862

Patch 1 relaxes the no internal process constraint to allow internal
processes if no non-threaded (resource domain) controllers are enabled.

Patch 2 introduces a new bypass mode (previously called pass-through
mode) that allows a controller to be disabled in the current cgroup,
but re-enabled again in its children. This is enabled by writing
the controller name prefixed with '#' to the "cgroup.controllers"
file. A controller can be bypassed only if it is enabled in the
parent's subtree_control file.

Patch 3 extends the bypass mode to the "cgroup.subtree_control"
file allowing a parent cgroup to enforce all its child cgroups to
have a controller in bypass mode.

Patch 4 introduces a new subtree root mode which restricts the number
of child cgroups to one and controllers are granted to its child in
bypass mode only. This mode is to be used by containers. The cgroup
with subtree root mode on will control the resources passed down to its
only child which will be the root of the container cgroup hierarchy.
Like a real cgroup v2 root, there is no control knobs to tune the
resources.  Instead, all the tuning will be done in its parent. The
container root can enable controllers in its child cgroups to further
restrict the distribution of resources if it so desires. Like a real
root, the no internal process constraint does not apply to the pseudo
root and it can be a mixable root of a mixed threaded tree.

Patch 5 fixes a problem discovered during testing that rapid enabling
and disabling of controllers can lead to undesirable errors.

Patch 6 extends the debug controller to expose additional information
introduced by this patchset.

Waiman Long (6):
  cgroup: Relax the no internal process constraint
  cgroup: Enable bypass mode in cgroup v2
  cgroup: Allow bypss mode in subtree_control
  cgroup: Introduce subtree root mode
  cgroup: Skip dying css in cgroup_apply_control_{enable,disable}
  cgroup: Make debug controller display bypass and subtree root modes
    info

 Documentation/cgroup-v2.txt     | 194 +++++++++++++++++----
 include/linux/cgroup-defs.h     |  24 +++
 kernel/cgroup/cgroup-internal.h |  12 ++
 kernel/cgroup/cgroup.c          | 372 ++++++++++++++++++++++++++++++++--------
 kernel/cgroup/debug.c           |  27 ++-
 5 files changed, 512 insertions(+), 117 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC PATCH-cgroup 1/6] cgroup: Relax the no internal process constraint
  2017-06-14 15:05 [RFC PATCH-cgroup 0/6] cgroup: bypass and subtree root modes Waiman Long
@ 2017-06-14 15:05 ` Waiman Long
  2017-06-21 20:40   ` Tejun Heo
  2017-06-14 15:05 ` [RFC PATCH-cgroup 2/6] cgroup: Enable bypass mode in cgroup v2 Waiman Long
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 20+ messages in thread
From: Waiman Long @ 2017-06-14 15:05 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, kernel-team, pjt, luto, efault, torvalds,
	Waiman Long

The no internal process contraint is rather limiting. The fact that
threaded cgroups are exempt from this rule means that the restriction
is actually not needed in some cases.

Rather than having threaded cgroups as exceptions, the no internal
process contraint is now relaxed to apply only when those resource
domain controllers, which are not allowed in a threaded cgroup,
are enabled. That will cover the case of threaded cgroups without an
explicit exception. This also makes it easier for those controllers
that can handle internal process competition to migrate to cgroup v2.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/cgroup-v2.txt | 44 ++++++++++++++++++++++++++++----------------
 kernel/cgroup/cgroup.c      | 24 +++++++++++++-----------
 2 files changed, 41 insertions(+), 27 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 96db840..98f92b1 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -374,15 +374,27 @@ disabled if one or more children have it enabled.
 
 2-4-3. No Internal Process Constraint
 
-Non-root cgroups can only distribute resources to their children when
-they don't have any processes of their own.  In other words, only
-cgroups which don't contain any processes can have controllers enabled
-in their "cgroup.subtree_control" files.
-
-This guarantees that, when a controller is looking at the part of the
-hierarchy which has it enabled, processes are always only on the
-leaves.  This rules out situations where child cgroups compete against
-internal processes of the parent.
+When a non-root cgroup distributes resources to their children while
+having processes of its own, its internal processes will then compete
+against its children in term of resource allocation.  For some resource
+types, that is not a problem and the controllers are able to handle
+them correctly.  For others, the controllers may not be able to handle
+internal process competition correctly.  This type of controllers are
+called resource domain controllers in this document.
+
+Internal processes are not allowed on non-root cgroups which has
+any one of those resource domain controllers enabled.  Currently all
+controllers that are allowed in a threaded cgroup will be considered
+as a non-resource domain controller and hence will not block internal
+processes.  In other words, only cgroups which don't contain any
+processes can have resource domain controllers enabled in their
+"cgroup.subtree_control" files.
+
+This guarantees that, when a controller is looking at the part of
+the hierarchy which has it enabled, processes are always only on the
+leaves when resource domain controllers are enabled.  This rules out
+situations where child cgroups compete against internal processes of
+the parent for those controllers that can't handle it properly.
 
 The root cgroup is exempt from this restriction.  Root contains
 processes and anonymous resource consumption which can't be associated
@@ -390,13 +402,13 @@ with any other cgroups and requires special treatment from most
 controllers.  How resource consumption in the root cgroup is governed
 is up to each controller.
 
-Note that the restriction doesn't get in the way if there is no
-enabled controller in the cgroup's "cgroup.subtree_control".  This is
-important as otherwise it wouldn't be possible to create children of a
-populated cgroup.  To control resource distribution of a cgroup, the
-cgroup must create children and transfer all its processes to the
-children before enabling controllers in its "cgroup.subtree_control"
-file.
+Note that the restriction doesn't get in the way if there is no resource
+domain controller enabled in the cgroup's "cgroup.subtree_control".
+This is important as otherwise it wouldn't be possible to create
+children of a populated cgroup.  To control resource distribution
+of a cgroup, the cgroup must create children and transfer all
+its processes to the children before enabling controllers in its
+"cgroup.subtree_control" file.
 
 
 2-5. Delegation
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 216657e..f72dce1 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2297,13 +2297,14 @@ static int cgroup_migrate_execute(struct cgroup_mgctx *mgctx)
  * @dst_cgrp: destination cgroup to test
  *
  * On the default hierarchy, except for the mixable and threaded cgroups,
- * subtree_control must be zero for migration destination cgroups with
- * tasks so that child cgroups don't compete against tasks.
+ * subtree_control must not have any !threaded controllers for migration
+ * cgroups with tasks so that child cgroups don't compete against tasks
+ * for those !threaded controllers.
  */
 bool cgroup_may_migrate_to(struct cgroup *dst_cgrp)
 {
 	return !cgroup_on_dfl(dst_cgrp) || cgroup_is_mixable(dst_cgrp) ||
-		cgroup_is_threaded(dst_cgrp) || !dst_cgrp->subtree_control;
+		!(dst_cgrp->subtree_control & ~cgrp_dfl_threaded_ss_mask);
 }
 
 /**
@@ -3020,12 +3021,12 @@ static ssize_t cgroup_subtree_control_write(struct kernfs_open_file *of,
 	}
 
 	/*
-	 * Except for mixable and threaded cgroups, subtree_control must be
-	 * zero for a cgroup with tasks so that child cgroups don't compete
-	 * against tasks.
+	 * Except for mixable cgroups, subtree_control must not contain any
+	 * !threaded controller for a cgroup with tasks so that child cgroups
+	 * don't compete against tasks.
 	 */
-	if (enable && !cgroup_is_mixable(cgrp) && !cgroup_is_threaded(cgrp) &&
-	    cgroup_has_tasks(cgrp)) {
+	if ((enable & ~cgrp_dfl_threaded_ss_mask) &&
+	   !cgroup_is_mixable(cgrp) && cgroup_has_tasks(cgrp)) {
 		ret = -EBUSY;
 		goto out_unlock;
 	}
@@ -3086,10 +3087,11 @@ static int cgroup_vet_thread_mode_op(struct cgroup *cgrp, enum thread_mode_op op
 
 	/*
 	 * @cgrp is starting or ending a normal threaded subtree.  Make
-	 * sure the subtree is empty and avoid needing implicit domain
-	 * controller migrations.
+	 * sure the subtree has no !threaded controller enabled and avoid
+	 * needing implicit domain controller migrations.
 	 */
-	if (css_has_online_children(&cgrp->self) || cgrp->subtree_control)
+	if (css_has_online_children(&cgrp->self) ||
+	   (cgrp->subtree_control & ~cgrp_dfl_threaded_ss_mask))
 		return -EBUSY;
 
 	/* no partial disable */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH-cgroup 2/6] cgroup: Enable bypass mode in cgroup v2
  2017-06-14 15:05 [RFC PATCH-cgroup 0/6] cgroup: bypass and subtree root modes Waiman Long
  2017-06-14 15:05 ` [RFC PATCH-cgroup 1/6] cgroup: Relax the no internal process constraint Waiman Long
@ 2017-06-14 15:05 ` Waiman Long
  2017-06-21 21:17   ` Tejun Heo
  2017-06-14 15:05 ` [RFC PATCH-cgroup 3/6] cgroup: Allow bypss mode in subtree_control Waiman Long
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 20+ messages in thread
From: Waiman Long @ 2017-06-14 15:05 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, kernel-team, pjt, luto, efault, torvalds,
	Waiman Long

For cgroup v1, different controllers can be binded to different cgroup
hierarchies optimized for their own use cases. That is not currently
the case for cgroup v2 where combining all these controllers into
the same hierarchy will probably require more levels than is needed
by each individual controller.

By not enabling a controller in a cgroup and its descendants, we can
effectively trim the hierarchy as seen by a controller from the leafs
up. However, there is currently no way to compress the hierarchy in
the intermediate levels.

This patch implements a new bypass mechanism to allow a controller to
skip some intermediate levels in a hierarchy and effectively flatten
the hierarchy as seen by that controller.

Controllers enabled by the parent's "cgroup.subtree_control"
file can now be set into a special bypass mode by writing to the
"cgroup.controllers" file with the special '#' prefix attached to the
controller name.  In that mode, the controller is disabled for that
cgroup but it allows its children to have that controller enabled or in
bypass mode again. The bypass mode is removed by using the '+' prefix.

With this change, each controller can now have a unique view of their
virtual process hierarchy that can be quite different from other
controllers.  We now have the freedom and flexibility to create the
right hierarchy for each controller to suit their own needs without
performance loss when compared with cgroup v1.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/cgroup-v2.txt |  98 ++++++++++++++++----
 include/linux/cgroup-defs.h |   8 ++
 kernel/cgroup/cgroup.c      | 211 ++++++++++++++++++++++++++++++++++++--------
 3 files changed, 263 insertions(+), 54 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 98f92b1..0df06ba 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -323,25 +323,28 @@ both cgroups.
 2-4-1. Enabling and Disabling
 
 Each cgroup has a "cgroup.controllers" file which lists all
-controllers available for the cgroup to enable.
+controllers available for the cgroup to enable for its children.
 
   # cat cgroup.controllers
   cpu io memory
 
-No controller is enabled by default.  Controllers can be enabled and
-disabled by writing to the "cgroup.subtree_control" file.
+No controller is enabled by default.  Controllers can be
+enabled and disabled on the child cgroups by writing to the
+"cgroup.subtree_control" file.  A '+' prefix enables the controller,
+and a '-' prefix disables it.
 
   # echo "+cpu +memory -io" > cgroup.subtree_control
 
-Only controllers which are listed in "cgroup.controllers" can be
-enabled.  When multiple operations are specified as above, either they
-all succeed or fail.  If multiple operations on the same controller
-are specified, the last one is effective.
+Only controllers which are listed in "cgroup.controllers" can
+be enabled in the "cgroup.subtree_control" file.  When multiple
+operations are specified as above, either they all succeed or fail.
+If multiple operations on the same controller are specified, the last
+one is effective.
 
 Enabling a controller in a cgroup indicates that the distribution of
 the target resource across its immediate children will be controlled.
-Consider the following sub-hierarchy.  The enabled controllers are
-listed in parentheses.
+Consider the following sub-hierarchy.  The enabled controllers in the
+"cgroup.subtree_control" file are listed in parentheses.
 
   A(cpu,memory) - B(memory) - C()
                             \ D()
@@ -351,6 +354,17 @@ of CPU cycles and memory to its children, in this case, B.  As B has
 "memory" enabled but not "CPU", C and D will compete freely on CPU
 cycles but their division of memory available to B will be controlled.
 
+By not enabling a controller in a cgroup and its descendants, we can
+effectively trim the hierarchy as seen by a controller from the leafs
+up.  From the perspective of the cpu controller, the hierarchy is:
+
+  A - B|C|D
+
+From the perspective of the memory controller, the hierarchy becomes:
+
+  A - B - C
+        \ D
+
 As a controller regulates the distribution of the target resource to
 the cgroup's children, enabling it creates the controller's interface
 files in the child cgroups.  In the above example, enabling "cpu" on B
@@ -358,7 +372,55 @@ would create the "cpu." prefixed controller interface files in C and
 D.  Likewise, disabling "memory" from B would remove the "memory."
 prefixed controller interface files from C and D.  This means that the
 controller interface files - anything which doesn't start with
-"cgroup." are owned by the parent rather than the cgroup itself.
+"cgroup." can be considered to be owned by the parent under this
+control scheme.
+
+Enabling controllers via the "cgroup.subtree_control" file is
+relatively coarse-grained.  Finer-grained control of the controllers
+in a non-root cgroup can be done by writing a controller name with
+either a '#' or '+' prefix to its "cgroup.controllers" file directly.
+
+Writing the special prefix '#' with the controller name
+into "cgroup.controllers" is used to mark that controller in
+bypass mode.  Only controllers that are enabled at the parent's
+"cgroup.subtree_control" file can be used.  In this mode, the
+controller is disabled in the cgroup effectively collapsing it with
+its parent from the perspective of that controller.  However, it allows
+the enablement of that controller in the "cgroup.subtree_control"
+file and hence enabled in the child cgroups.  The bypass mode can be
+disabled by using the '+' prefix to re-enable the controller.
+
+In the example below, '+' corresponds to an enabled controller and
+corresponds to a bypassed controller.
+
+   +   #   #   #   +
+   A - B - C - D - E
+         \ F
+	   +
+In this case, the effective hiearchy is:
+
+	A|B|C|D - E
+	        \ F
+
+The use of the special '#' prefix allows the users to trim away layers
+in the middle of the hierarchy, thus flattening the tree from the
+perspective of that particular controller.  As a result, different
+controllers can have quite different views of their virtual process
+hierarchy that can best fit their own needs.
+
+In the diagram below, the controller name in the parenthesis represents
+controller enabled as shown in the "cgroup.controllers" file.
+
+  A(cpu,memory) - B(cpu,#memory) - C()
+                                 \ D(memory)
+
+From the memory controller's perspective, the hierarchy looks like:
+
+   A|B|C - D
+
+For the CPU controller, the hierarchy is:
+
+   A - B|C|D
 
 
 2-4-2. Top-down Constraint
@@ -368,8 +430,8 @@ a resource only if the resource has been distributed to it from the
 parent.  This means that all non-root "cgroup.subtree_control" files
 can only contain controllers which are enabled in the parent's
 "cgroup.subtree_control" file.  A controller can be enabled only if
-the parent has the controller enabled and a controller can't be
-disabled if one or more children have it enabled.
+the parent has the controller enabled ('+' or '#') and a controller
+can't be disabled if one or more children have it enabled.
 
 
 2-4-3. No Internal Process Constraint
@@ -725,11 +787,17 @@ All cgroup core files are prefixed with "cgroup."
 
   cgroup.controllers
 
-	A read-only space separated values file which exists on all
+	A read-write space separated values file which exists on all
 	cgroups.
 
-	It shows space separated list of all controllers available to
-	the cgroup.  The controllers are not ordered.
+	When read, it shows space separated list of all controllers
+	available to the cgroup.  The controllers are not ordered.
+
+	Space separated list of controllers prefixed with '+' or '#'
+	can be written to re-enable or set the controllers in bypass
+	mode.  If a controller appears more than once on the list,
+	the last one is effective.  When multiple re-enable and bypass
+	operations are specified, either all succeed or all fail.
 
   cgroup.subtree_control
 
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index ea3218a..f5c1e36 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -289,6 +289,14 @@ struct cgroup {
 	u16 old_subtree_control;
 	u16 old_subtree_ss_mask;
 
+	/*
+	 * The bitmasks of subsystems in bypass mode on the current cgroup.
+	 * The bypass mode can only be set if a controller is enabled at
+	 * the parent subtree_control mask.
+	 */
+	u16 bypass_ss_mask;
+	u16 old_bypass_ss_mask;
+
 	/* Private pointers for each registered subsystem */
 	struct cgroup_subsys_state __rcu *subsys[CGROUP_SUBSYS_COUNT];
 
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index f72dce1..7d1326e 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2598,15 +2598,18 @@ void cgroup_procs_write_finish(struct task_struct *task)
 			ss->post_attach();
 }
 
-static void cgroup_print_ss_mask(struct seq_file *seq, u16 ss_mask)
+static void cgroup_print_ss_mask(struct seq_file *seq, u16 ss_mask,
+				 u16 bypass_mask)
 {
 	struct cgroup_subsys *ss;
 	bool printed = false;
 	int ssid;
 
-	do_each_subsys_mask(ss, ssid, ss_mask) {
+	do_each_subsys_mask(ss, ssid, ss_mask|bypass_mask) {
 		if (printed)
 			seq_putc(seq, ' ');
+		if (bypass_mask & (1 << ssid))
+			seq_putc(seq, '#');
 		seq_printf(seq, "%s", ss->name);
 		printed = true;
 	} while_each_subsys_mask();
@@ -2619,7 +2622,7 @@ static int cgroup_controllers_show(struct seq_file *seq, void *v)
 {
 	struct cgroup *cgrp = seq_css(seq)->cgroup;
 
-	cgroup_print_ss_mask(seq, cgroup_control(cgrp));
+	cgroup_print_ss_mask(seq, cgroup_control(cgrp), cgrp->bypass_ss_mask);
 	return 0;
 }
 
@@ -2628,7 +2631,7 @@ static int cgroup_subtree_control_show(struct seq_file *seq, void *v)
 {
 	struct cgroup *cgrp = seq_css(seq)->cgroup;
 
-	cgroup_print_ss_mask(seq, cgrp->subtree_control);
+	cgroup_print_ss_mask(seq, cgrp->subtree_control, 0);
 	return 0;
 }
 
@@ -2741,6 +2744,7 @@ static void cgroup_save_control(struct cgroup *cgrp)
 	cgroup_for_each_live_descendant_pre(dsct, d_css, cgrp) {
 		dsct->old_subtree_control = dsct->subtree_control;
 		dsct->old_subtree_ss_mask = dsct->subtree_ss_mask;
+		dsct->old_bypass_ss_mask = dsct->bypass_ss_mask;
 	}
 }
 
@@ -2758,10 +2762,11 @@ static void cgroup_propagate_control(struct cgroup *cgrp)
 	struct cgroup_subsys_state *d_css;
 
 	cgroup_for_each_live_descendant_pre(dsct, d_css, cgrp) {
-		dsct->subtree_control &= cgroup_control(dsct);
+		dsct->subtree_control &= cgroup_control(dsct)|
+					 dsct->bypass_ss_mask;
 		dsct->subtree_ss_mask =
 			cgroup_calc_subtree_ss_mask(dsct->subtree_control,
-						    cgroup_ss_mask(dsct));
+				cgroup_ss_mask(dsct)|dsct->bypass_ss_mask);
 	}
 }
 
@@ -2780,6 +2785,7 @@ static void cgroup_restore_control(struct cgroup *cgrp)
 	cgroup_for_each_live_descendant_post(dsct, d_css, cgrp) {
 		dsct->subtree_control = dsct->old_subtree_control;
 		dsct->subtree_ss_mask = dsct->old_subtree_ss_mask;
+		dsct->bypass_ss_mask = dsct->old_bypass_ss_mask;
 	}
 }
 
@@ -2821,7 +2827,8 @@ static int cgroup_apply_control_enable(struct cgroup *cgrp)
 
 			WARN_ON_ONCE(css && percpu_ref_is_dying(&css->refcnt));
 
-			if (!(cgroup_ss_mask(dsct) & (1 << ss->id)))
+			if (!(cgroup_ss_mask(dsct) & (1 << ss->id)) ||
+			    (dsct->bypass_ss_mask & (1 << ss->id)))
 				continue;
 
 			if (!css) {
@@ -2871,7 +2878,8 @@ static void cgroup_apply_control_disable(struct cgroup *cgrp)
 				continue;
 
 			if (css->parent &&
-			    !(cgroup_ss_mask(dsct) & (1 << ss->id))) {
+			    (!(cgroup_ss_mask(dsct) & (1 << ss->id)) ||
+			    (dsct->bypass_ss_mask & (1 << ss->id)))) {
 				kill_css(css);
 			} else if (!css_visible(css)) {
 				css_clear_dir(css);
@@ -2944,6 +2952,7 @@ static ssize_t cgroup_subtree_control_write(struct kernfs_open_file *of,
 					    loff_t off)
 {
 	u16 enable = 0, disable = 0;
+	u16 child_enable = 0, child_bypass = 0;
 	struct cgroup *cgrp, *child;
 	struct cgroup_subsys *ss;
 	char *tok;
@@ -2981,31 +2990,31 @@ static ssize_t cgroup_subtree_control_write(struct kernfs_open_file *of,
 	if (!cgrp)
 		return -ENODEV;
 
-	for_each_subsys(ss, ssid) {
-		if (enable & (1 << ssid)) {
-			if (cgrp->subtree_control & (1 << ssid)) {
-				enable &= ~(1 << ssid);
-				continue;
-			}
+	/*
+	 * We cannot use controllers that are not enabled.
+	 */
+	if (~cgroup_control(cgrp) & (enable|disable)) {
+		ret = -ENOENT;
+		goto out_unlock;
+	}
 
-			if (!(cgroup_control(cgrp) & (1 << ssid))) {
-				ret = -ENOENT;
-				goto out_unlock;
-			}
-		} else if (disable & (1 << ssid)) {
-			if (!(cgrp->subtree_control & (1 << ssid))) {
-				disable &= ~(1 << ssid);
-				continue;
-			}
+	cgroup_for_each_live_child(child, cgrp) {
+		child_enable |= child->subtree_control;
+		child_bypass |= child->bypass_ss_mask;
+	}
 
-			/* a child has it enabled? */
-			cgroup_for_each_live_child(child, cgrp) {
-				if (child->subtree_control & (1 << ssid)) {
-					ret = -EBUSY;
-					goto out_unlock;
-				}
-			}
-		}
+	/*
+	 * Strip out redundant bits.
+	 */
+	enable  &= ~cgrp->subtree_control;
+	disable &=  cgrp->subtree_control;
+
+	/*
+	 * We cannot disable controllers that are enabled in a child cgroup.
+	 */
+	if (disable & child_enable) {
+		ret = -EBUSY;
+		goto out_unlock;
 	}
 
 	if (!enable && !disable) {
@@ -3037,6 +3046,15 @@ static ssize_t cgroup_subtree_control_write(struct kernfs_open_file *of,
 	cgrp->subtree_control |= enable;
 	cgrp->subtree_control &= ~disable;
 
+	/*
+	 * Clear the child's bypass_ss_mask for those bits that are disabled
+	 * in subtree_control.
+	 */
+	if (child_bypass & disable) {
+		cgroup_for_each_live_child(child, cgrp)
+			child->bypass_ss_mask &= ~disable;
+	}
+
 	ret = cgroup_apply_control(cgrp);
 
 	cgroup_finalize_control(cgrp, ret);
@@ -3054,6 +3072,104 @@ enum thread_mode_op {
 	THREAD_MODE_DISABLE,
 };
 
+/*
+ * Change the bypass controllers for a cgroup in the default hierarchy.
+ */
+static ssize_t cgroup_controllers_write(struct kernfs_open_file *of,
+					char *buf, size_t nbytes,
+					loff_t off)
+{
+	u16 reenable = 0, bypass = 0;
+	struct cgroup *cgrp, *parent;
+	struct cgroup_subsys *ss;
+	char *tok;
+	int ssid, ret;
+
+	/*
+	 * Parse input - space separated list of subsystem names prefixed
+	 * with either + or #.
+	 */
+	buf = strstrip(buf);
+	while ((tok = strsep(&buf, " "))) {
+		if (tok[0] == '\0')
+			continue;
+		do_each_subsys_mask(ss, ssid, ~cgrp_dfl_inhibit_ss_mask) {
+			if (!cgroup_ssid_enabled(ssid) ||
+			    strcmp(tok + 1, ss->name))
+				continue;
+
+			if (*tok == '+') {
+				reenable |= 1 << ssid;
+				bypass &= ~(1 << ssid);
+			} else if (*tok == '#') {
+				bypass |= 1 << ssid;
+				reenable &= ~(1 << ssid);
+			} else {
+				return -EINVAL;
+			}
+			break;
+		} while_each_subsys_mask();
+		if (ssid == CGROUP_SUBSYS_COUNT)
+			return -EINVAL;
+	}
+
+	cgrp = cgroup_kn_lock_live(of->kn, true);
+	if (!cgrp)
+		return -ENODEV;
+
+	/*
+	 * Write to root cgroup's controllers file is not allowed.
+	 */
+	parent = cgroup_parent(cgrp);
+	if (!parent) {
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+
+	/*
+	 * Only controllers enabled by the parent can be specified here.
+	 */
+	if (~cgroup_control(cgrp) & (reenable|bypass)) {
+		ret = -ENOENT;
+		goto out_unlock;
+	}
+
+	/*
+	 * Mask off irrelevant bits.
+	 */
+	bypass   &= ~cgrp->bypass_ss_mask;
+	reenable &=  cgrp->bypass_ss_mask;
+
+	if (!bypass && !reenable) {
+		ret = 0;
+		goto out_unlock;
+	}
+
+	/*
+	 * We cannot change the bypass state of a controller that is enabled
+	 * in subtree_control.
+	 */
+	if (cgrp->subtree_control & (reenable|bypass)) {
+		ret = -EBUSY;
+		goto out_unlock;
+	}
+
+	/* Save and update control masks and prepare csses */
+	cgroup_save_control(cgrp);
+
+	cgrp->bypass_ss_mask |= bypass;
+	cgrp->bypass_ss_mask &= ~reenable;
+
+	ret = cgroup_apply_control(cgrp);
+	cgroup_finalize_control(cgrp, ret);
+	kernfs_activate(cgrp->kn);
+	ret = 0;
+
+out_unlock:
+	cgroup_kn_unlock(of->kn);
+	return ret ?: nbytes;
+}
+
 static int cgroup_vet_thread_mode_op(struct cgroup *cgrp, enum thread_mode_op op)
 {
 	/* verify join conditions first and convert it to ENABLE */
@@ -3087,11 +3203,12 @@ static int cgroup_vet_thread_mode_op(struct cgroup *cgrp, enum thread_mode_op op
 
 	/*
 	 * @cgrp is starting or ending a normal threaded subtree.  Make
-	 * sure the subtree has no !threaded controller enabled and avoid
-	 * needing implicit domain controller migrations.
+	 * sure the subtree has no !threaded controller enabled or bypassed
+	 * and avoid needing implicit domain controller migrations.
 	 */
 	if (css_has_online_children(&cgrp->self) ||
-	   (cgrp->subtree_control & ~cgrp_dfl_threaded_ss_mask))
+	   ((cgrp->subtree_control|cgrp->bypass_ss_mask) &
+			~cgrp_dfl_threaded_ss_mask))
 		return -EBUSY;
 
 	/* no partial disable */
@@ -4250,6 +4367,7 @@ static ssize_t cgroup_threads_write(struct kernfs_open_file *of,
 	{
 		.name = "cgroup.controllers",
 		.seq_show = cgroup_controllers_show,
+		.write = cgroup_controllers_write,
 	},
 	{
 		.name = "cgroup.subtree_control",
@@ -4396,7 +4514,8 @@ static void css_release(struct percpu_ref *ref)
 }
 
 static void init_and_link_css(struct cgroup_subsys_state *css,
-			      struct cgroup_subsys *ss, struct cgroup *cgrp)
+			      struct cgroup_subsys *ss, struct cgroup *cgrp,
+			      struct cgroup_subsys_state *parent_css)
 {
 	lockdep_assert_held(&cgroup_mutex);
 
@@ -4412,7 +4531,7 @@ static void init_and_link_css(struct cgroup_subsys_state *css,
 	atomic_set(&css->online_cnt, 0);
 
 	if (cgroup_parent(cgrp)) {
-		css->parent = cgroup_css(cgroup_parent(cgrp), ss);
+		css->parent = parent_css;
 		css_get(css->parent);
 	}
 
@@ -4475,19 +4594,33 @@ static struct cgroup_subsys_state *css_create(struct cgroup *cgrp,
 					      struct cgroup_subsys *ss)
 {
 	struct cgroup *parent = cgroup_parent(cgrp);
-	struct cgroup_subsys_state *parent_css = cgroup_css(parent, ss);
+	struct cgroup_subsys_state *parent_css;
 	struct cgroup_subsys_state *css;
 	int err;
 
 	lockdep_assert_held(&cgroup_mutex);
 
+	/*
+	 * Need to skip over ancestor cgroups with NULL CSS.
+	 */
+	for (; parent; parent = cgroup_parent(parent)) {
+		parent_css = cgroup_css(parent, ss);
+		if (parent_css)
+			break;
+	}
+
+	if (!parent) {
+		WARN_ON_ONCE(1);
+		return ERR_PTR(-EINVAL);
+	}
+
 	css = ss->css_alloc(parent_css);
 	if (!css)
 		css = ERR_PTR(-ENOMEM);
 	if (IS_ERR(css))
 		return css;
 
-	init_and_link_css(css, ss, cgrp);
+	init_and_link_css(css, ss, cgrp, parent_css);
 
 	err = percpu_ref_init(&css->refcnt, css_release, 0, GFP_KERNEL);
 	if (err)
@@ -4866,7 +4999,7 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
 	css = ss->css_alloc(cgroup_css(&cgrp_dfl_root.cgrp, ss));
 	/* We don't handle early failures gracefully */
 	BUG_ON(IS_ERR(css));
-	init_and_link_css(css, ss, &cgrp_dfl_root.cgrp);
+	init_and_link_css(css, ss, &cgrp_dfl_root.cgrp, NULL);
 
 	/*
 	 * Root csses are never destroyed and we can't initialize
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH-cgroup 3/6] cgroup: Allow bypss mode in subtree_control
  2017-06-14 15:05 [RFC PATCH-cgroup 0/6] cgroup: bypass and subtree root modes Waiman Long
  2017-06-14 15:05 ` [RFC PATCH-cgroup 1/6] cgroup: Relax the no internal process constraint Waiman Long
  2017-06-14 15:05 ` [RFC PATCH-cgroup 2/6] cgroup: Enable bypass mode in cgroup v2 Waiman Long
@ 2017-06-14 15:05 ` Waiman Long
  2017-06-14 15:05 ` [RFC PATCH-cgroup 4/6] cgroup: Introduce subtree root mode Waiman Long
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 20+ messages in thread
From: Waiman Long @ 2017-06-14 15:05 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, kernel-team, pjt, luto, efault, torvalds,
	Waiman Long

The special prefix '#' attached to a controller name can now be
written into the cgroup.subtree_control file to set that controller
in bypass mode in all the child cgroups. The controller will show
up in the children's cgroup.controllers file, but the corresponding
control knobs will be absent. The child cgroups cannot change the
controller states by writing to their cgroup.controllers file at all.

That can be useful for setting up a container where the container
root has a parent which enables controllers in bypass mode only. The
container root will then behave similar to a real root where controller
names show up in cgroup.controllers but the resource control files
are absent. If that parent has only one child which is a container
root, enabling those controllers in that parent will allow container
specific resources to be controlled there without being noticed by
the container itself.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/cgroup-v2.txt | 19 ++++++++----
 include/linux/cgroup-defs.h |  4 +++
 kernel/cgroup/cgroup.c      | 70 ++++++++++++++++++++++++++++-----------------
 3 files changed, 61 insertions(+), 32 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 0df06ba..55bee8a 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -335,6 +335,11 @@ and a '-' prefix disables it.
 
   # echo "+cpu +memory -io" > cgroup.subtree_control
 
+The special prefix '#' is used to enable bypass mode for that
+particular controller on the child cgroups. In the bypass mode, a
+controller is disabled in a cgroup, but it can be enabled again in
+its child cgroups.
+
 Only controllers which are listed in "cgroup.controllers" can
 be enabled in the "cgroup.subtree_control" file.  When multiple
 operations are specified as above, either they all succeed or fail.
@@ -808,12 +813,14 @@ All cgroup core files are prefixed with "cgroup."
 	which are enabled to control resource distribution from the
 	cgroup to its children.
 
-	Space separated list of controllers prefixed with '+' or '-'
-	can be written to enable or disable controllers.  A controller
-	name prefixed with '+' enables the controller and '-'
-	disables.  If a controller appears more than once on the list,
-	the last one is effective.  When multiple enable and disable
-	operations are specified, either all succeed or all fail.
+	Space separated list of controllers prefixed with '+', '-'
+	or '#' can be written to enable or disable controllers as
+	well as setting them into bypass mode.	A controller name
+	prefixed with '+' enables the controller and '-' disables.
+	The '#' prefix sets the controller into bypass mode.  If a
+	controller appears more than once on the list, the last
+	one is effective.  When multiple operations are specified,
+	either all succeed or all fail.
 
   cgroup.events
 
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index f5c1e36..14fdddb 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -283,10 +283,14 @@ struct cgroup {
 	 * "cgroup.subtree_control" while ->child_ss_mask is the effective
 	 * one which may have more subsystems enabled.  Controller knobs
 	 * are made available iff it's enabled in ->subtree_control.
+	 * ->subtree_bypass marks those controllers that are set into
+	 * the bypass mode in the child cgroups.
 	 */
 	u16 subtree_control;
+	u16 subtree_bypass;
 	u16 subtree_ss_mask;
 	u16 old_subtree_control;
+	u16 old_subtree_bypass;
 	u16 old_subtree_ss_mask;
 
 	/*
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 7d1326e..901314b 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -375,7 +375,7 @@ static bool cgroup_is_mixed_child(struct cgroup *cgrp)
 }
 
 /* subsystems visibly enabled on a cgroup */
-static u16 cgroup_control(struct cgroup *cgrp)
+static u16 cgroup_control(struct cgroup *cgrp, bool show_bypass)
 {
 	struct cgroup *parent = cgroup_parent(cgrp);
 	u16 root_ss_mask = cgrp->root->subsys_mask;
@@ -383,6 +383,9 @@ static u16 cgroup_control(struct cgroup *cgrp)
 	if (parent) {
 		u16 ss_mask = parent->subtree_control;
 
+		if (show_bypass)
+			ss_mask |= parent->subtree_bypass;
+
 		/* mixed child can only have threaded subset of controllers */
 		if (cgroup_is_mixed_child(cgrp))
 			ss_mask &= cgrp_dfl_threaded_ss_mask;
@@ -396,13 +399,16 @@ static u16 cgroup_control(struct cgroup *cgrp)
 }
 
 /* subsystems enabled on a cgroup */
-static u16 cgroup_ss_mask(struct cgroup *cgrp)
+static u16 cgroup_ss_mask(struct cgroup *cgrp, bool show_bypass)
 {
 	struct cgroup *parent = cgroup_parent(cgrp);
 
 	if (parent) {
 		u16 ss_mask = parent->subtree_ss_mask;
 
+		if (show_bypass)
+			ss_mask |= parent->subtree_bypass;
+
 		/* mixed child can only have threaded subset of controllers */
 		if (cgroup_is_mixed_child(cgrp))
 			ss_mask &= cgrp_dfl_threaded_ss_mask;
@@ -455,7 +461,7 @@ static struct cgroup_subsys_state *cgroup_e_css(struct cgroup *cgrp,
 	 * This function is used while updating css associations and thus
 	 * can't test the csses directly.  Test ss_mask.
 	 */
-	while (!(cgroup_ss_mask(cgrp) & (1 << ss->id))) {
+	while (!(cgroup_ss_mask(cgrp, false) & (1 << ss->id))) {
 		cgrp = cgroup_parent(cgrp);
 		if (!cgrp)
 			return NULL;
@@ -2622,7 +2628,8 @@ static int cgroup_controllers_show(struct seq_file *seq, void *v)
 {
 	struct cgroup *cgrp = seq_css(seq)->cgroup;
 
-	cgroup_print_ss_mask(seq, cgroup_control(cgrp), cgrp->bypass_ss_mask);
+	cgroup_print_ss_mask(seq, cgroup_control(cgrp, true),
+			     cgrp->bypass_ss_mask);
 	return 0;
 }
 
@@ -2631,7 +2638,7 @@ static int cgroup_subtree_control_show(struct seq_file *seq, void *v)
 {
 	struct cgroup *cgrp = seq_css(seq)->cgroup;
 
-	cgroup_print_ss_mask(seq, cgrp->subtree_control, 0);
+	cgroup_print_ss_mask(seq, cgrp->subtree_control, cgrp->subtree_bypass);
 	return 0;
 }
 
@@ -2744,6 +2751,7 @@ static void cgroup_save_control(struct cgroup *cgrp)
 	cgroup_for_each_live_descendant_pre(dsct, d_css, cgrp) {
 		dsct->old_subtree_control = dsct->subtree_control;
 		dsct->old_subtree_ss_mask = dsct->subtree_ss_mask;
+		dsct->old_subtree_bypass = dsct->subtree_bypass;
 		dsct->old_bypass_ss_mask = dsct->bypass_ss_mask;
 	}
 }
@@ -2762,11 +2770,10 @@ static void cgroup_propagate_control(struct cgroup *cgrp)
 	struct cgroup_subsys_state *d_css;
 
 	cgroup_for_each_live_descendant_pre(dsct, d_css, cgrp) {
-		dsct->subtree_control &= cgroup_control(dsct)|
-					 dsct->bypass_ss_mask;
+		dsct->subtree_control &= cgroup_control(dsct, true);
 		dsct->subtree_ss_mask =
 			cgroup_calc_subtree_ss_mask(dsct->subtree_control,
-				cgroup_ss_mask(dsct)|dsct->bypass_ss_mask);
+						    cgroup_ss_mask(dsct, true));
 	}
 }
 
@@ -2786,6 +2793,7 @@ static void cgroup_restore_control(struct cgroup *cgrp)
 		dsct->subtree_control = dsct->old_subtree_control;
 		dsct->subtree_ss_mask = dsct->old_subtree_ss_mask;
 		dsct->bypass_ss_mask = dsct->old_bypass_ss_mask;
+		dsct->subtree_bypass = dsct->old_subtree_bypass;
 	}
 }
 
@@ -2794,9 +2802,9 @@ static bool css_visible(struct cgroup_subsys_state *css)
 	struct cgroup_subsys *ss = css->ss;
 	struct cgroup *cgrp = css->cgroup;
 
-	if (cgroup_control(cgrp) & (1 << ss->id))
+	if (cgroup_control(cgrp, false) & (1 << ss->id))
 		return true;
-	if (!(cgroup_ss_mask(cgrp) & (1 << ss->id)))
+	if (!(cgroup_ss_mask(cgrp, false) & (1 << ss->id)))
 		return false;
 	return cgroup_on_dfl(cgrp) && ss->implicit_on_dfl;
 }
@@ -2827,7 +2835,7 @@ static int cgroup_apply_control_enable(struct cgroup *cgrp)
 
 			WARN_ON_ONCE(css && percpu_ref_is_dying(&css->refcnt));
 
-			if (!(cgroup_ss_mask(dsct) & (1 << ss->id)) ||
+			if (!(cgroup_ss_mask(dsct, false) & (1 << ss->id)) ||
 			    (dsct->bypass_ss_mask & (1 << ss->id)))
 				continue;
 
@@ -2878,7 +2886,7 @@ static void cgroup_apply_control_disable(struct cgroup *cgrp)
 				continue;
 
 			if (css->parent &&
-			    (!(cgroup_ss_mask(dsct) & (1 << ss->id)) ||
+			    (!(cgroup_ss_mask(dsct, false) & (1 << ss->id)) ||
 			    (dsct->bypass_ss_mask & (1 << ss->id)))) {
 				kill_css(css);
 			} else if (!css_visible(css)) {
@@ -2951,7 +2959,7 @@ static ssize_t cgroup_subtree_control_write(struct kernfs_open_file *of,
 					    char *buf, size_t nbytes,
 					    loff_t off)
 {
-	u16 enable = 0, disable = 0;
+	u16 enable = 0, disable = 0, bypass = 0;
 	u16 child_enable = 0, child_bypass = 0;
 	struct cgroup *cgrp, *child;
 	struct cgroup_subsys *ss;
@@ -2973,10 +2981,16 @@ static ssize_t cgroup_subtree_control_write(struct kernfs_open_file *of,
 
 			if (*tok == '+') {
 				enable |= 1 << ssid;
+				bypass &= ~(1 << ssid);
 				disable &= ~(1 << ssid);
 			} else if (*tok == '-') {
 				disable |= 1 << ssid;
 				enable &= ~(1 << ssid);
+				bypass &= ~(1 << ssid);
+			} else if (*tok == '#') {
+				bypass |= 1 << ssid;
+				enable &= ~(1 << ssid);
+				disable &= ~(1 << ssid);
 			} else {
 				return -EINVAL;
 			}
@@ -2993,13 +3007,13 @@ static ssize_t cgroup_subtree_control_write(struct kernfs_open_file *of,
 	/*
 	 * We cannot use controllers that are not enabled.
 	 */
-	if (~cgroup_control(cgrp) & (enable|disable)) {
+	if (~cgroup_control(cgrp, true) & (enable|bypass|disable)) {
 		ret = -ENOENT;
 		goto out_unlock;
 	}
 
 	cgroup_for_each_live_child(child, cgrp) {
-		child_enable |= child->subtree_control;
+		child_enable |= child->subtree_control|child->subtree_bypass;
 		child_bypass |= child->bypass_ss_mask;
 	}
 
@@ -3007,24 +3021,26 @@ static ssize_t cgroup_subtree_control_write(struct kernfs_open_file *of,
 	 * Strip out redundant bits.
 	 */
 	enable  &= ~cgrp->subtree_control;
-	disable &=  cgrp->subtree_control;
+	bypass  &= ~cgrp->subtree_bypass;
+	disable &= (cgrp->subtree_control|cgrp->subtree_bypass);
 
 	/*
-	 * We cannot disable controllers that are enabled in a child cgroup.
+	 * We cannot disable controllers or change the bypass state of
+	 * controllers that are enabled in a child cgroup.
 	 */
-	if (disable & child_enable) {
+	if ((enable|bypass|disable) & child_enable) {
 		ret = -EBUSY;
 		goto out_unlock;
 	}
 
-	if (!enable && !disable) {
+	if (!(enable|bypass|disable)) {
 		ret = 0;
 		goto out_unlock;
 	}
 
 	/* can't enable !threaded controllers on a threaded cgroup */
 	if (cgroup_is_threaded(cgrp) && !cgroup_is_mixed_root(cgrp) &&
-	    (enable & ~cgrp_dfl_threaded_ss_mask)) {
+	    ((enable|bypass) & ~cgrp_dfl_threaded_ss_mask)) {
 		ret = -EBUSY;
 		goto out_unlock;
 	}
@@ -3044,15 +3060,17 @@ static ssize_t cgroup_subtree_control_write(struct kernfs_open_file *of,
 	cgroup_save_control(cgrp);
 
 	cgrp->subtree_control |= enable;
-	cgrp->subtree_control &= ~disable;
+	cgrp->subtree_control &= ~(bypass|disable);
+	cgrp->subtree_bypass  |= bypass;
+	cgrp->subtree_bypass  &= ~(enable|disable);
 
 	/*
 	 * Clear the child's bypass_ss_mask for those bits that are disabled
-	 * in subtree_control.
+	 * are bypassed in subtree_control.
 	 */
-	if (child_bypass & disable) {
+	if (child_bypass & (disable|bypass)) {
 		cgroup_for_each_live_child(child, cgrp)
-			child->bypass_ss_mask &= ~disable;
+			child->bypass_ss_mask &= ~(disable|bypass);
 	}
 
 	ret = cgroup_apply_control(cgrp);
@@ -3129,7 +3147,7 @@ static ssize_t cgroup_controllers_write(struct kernfs_open_file *of,
 	/*
 	 * Only controllers enabled by the parent can be specified here.
 	 */
-	if (~cgroup_control(cgrp) & (reenable|bypass)) {
+	if (~cgroup_control(cgrp, true) & (reenable|bypass)) {
 		ret = -ENOENT;
 		goto out_unlock;
 	}
@@ -4725,7 +4743,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent)
 	 * subtree_control from the parent.  Each is configured manually.
 	 */
 	if (!cgroup_on_dfl(cgrp))
-		cgrp->subtree_control = cgroup_control(cgrp);
+		cgrp->subtree_control = cgroup_control(cgrp, false);
 
 	if (parent)
 		cgroup_bpf_inherit(cgrp, parent);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH-cgroup 4/6] cgroup: Introduce subtree root mode
  2017-06-14 15:05 [RFC PATCH-cgroup 0/6] cgroup: bypass and subtree root modes Waiman Long
                   ` (2 preceding siblings ...)
  2017-06-14 15:05 ` [RFC PATCH-cgroup 3/6] cgroup: Allow bypss mode in subtree_control Waiman Long
@ 2017-06-14 15:05 ` Waiman Long
  2017-06-21 21:38   ` Tejun Heo
  2017-06-14 15:05 ` [RFC PATCH-cgroup 5/6] cgroup: Skip dying css in cgroup_apply_control_{enable,disable} Waiman Long
  2017-06-14 15:05 ` [RFC PATCH-cgroup 6/6] cgroup: Make debug controller display bypass and subtree root modes info Waiman Long
  5 siblings, 1 reply; 20+ messages in thread
From: Waiman Long @ 2017-06-14 15:05 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, kernel-team, pjt, luto, efault, torvalds,
	Waiman Long

Subtree root mode is a new cgroup mode which applies the following
restrictions when turned on:

 1) Controllers are only allowed to be passed to the children in
    bypass mode except those with the "enable_on_root" flag on.
 2) Only 1 child cgroup is allowed.

That lone child can be used as the pseudo root of a container cgroup
hierarchy.  All the resources, if controlled, are in the parent
cgroup. There will be no control knobs in the child. That makes it
look and feel like a root.

That pseudo root is also considered to be mixable and so can become
root of a mixed threaded subtree. The no internal process constraint
also does not apply.

The subtree root mode and thread mode are mutually exclusive.

The subtree root mode is enabled by doing:

  # echo root > cgroup.subtree_control

It is disabled by:

  # echo nonroot > cgroup.subtree_control

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/cgroup-v2.txt | 43 +++++++++++++++++++----
 include/linux/cgroup-defs.h | 12 +++++++
 kernel/cgroup/cgroup.c      | 86 +++++++++++++++++++++++++++++++++++++++++----
 3 files changed, 129 insertions(+), 12 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 55bee8a..bc2913c 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -23,7 +23,8 @@ CONTENTS
   2-4. Controlling Controllers
     2-4-1. Enabling and Disabling
     2-4-2. Top-down Constraint
-    2-4-3. No Internal Process Constraint
+    2-4-3. Subtree root mode
+    2-4-4. No Internal Process Constraint
   2-5. Delegation
     2-5-1. Model of Delegation
     2-5-2. Delegation Containment
@@ -439,7 +440,33 @@ the parent has the controller enabled ('+' or '#') and a controller
 can't be disabled if one or more children have it enabled.
 
 
-2-4-3. No Internal Process Constraint
+2-4-3. Subtree root mode
+
+Subtree root mode is a special cgroup mode that restricts the passing
+of most controllers in bypass mode only.  Only controllers that
+have the special "enabled_on_root" flag on can be directly enabled.
+It also allows only one child cgroup to be created.  That child cgroup
+can be used as the pseudo root of a container cgroup hierarchy.
+
+This pseudo root will look and feel like a root cgroup as resources
+that are not controllable in a real root will not be controllable in
+the pseudo root.  Instead, those resources can be controlled in the
+parent of the pseudo root.
+
+The pseudo root can be the root of a mixed threaded subtree, and the
+no internal process contraint does not apply.  Subtree root mode and
+thread mode are mutually exclusive.
+
+Subtree root is enabled by writing "root" to "cgroup.subtree_control".
+
+  # echo root > cgroup.subtree_control
+
+It is disabled by writing "nonroot" to "cgroup.subtree_control".
+
+  # echo nonroot > cgroup.subtree_control
+
+
+2-4-4. No Internal Process Constraint
 
 When a non-root cgroup distributes resources to their children while
 having processes of its own, its internal processes will then compete
@@ -817,10 +844,14 @@ All cgroup core files are prefixed with "cgroup."
 	or '#' can be written to enable or disable controllers as
 	well as setting them into bypass mode.	A controller name
 	prefixed with '+' enables the controller and '-' disables.
-	The '#' prefix sets the controller into bypass mode.  If a
-	controller appears more than once on the list, the last
-	one is effective.  When multiple operations are specified,
-	either all succeed or all fail.
+	The '#' prefix sets the controller into bypass mode.
+
+	The special keywords "root" and "nonroot" can be written to
+	enable and disable the subtree root mode respectively.
+
+	If a controller or a keyword appears more than once on the
+	list, the last one is effective.  When multiple operations
+	are specified, either all succeed or all fail.
 
   cgroup.events
 
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 14fdddb..72d51ec 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -61,6 +61,12 @@ enum {
 	 * specified at mount time and thus is implemented here.
 	 */
 	CGRP_CPUSET_CLONE_CHILDREN,
+	/*
+	 * Enforce passing controllers in bypass mode and one child only.
+	 * This child becomes a pseudo root that can serve as the root of
+	 * a container.
+	 */
+	CGRP_SUBTREE_ROOT_MODE,
 };
 
 /* cgroup_root->flags */
@@ -529,6 +535,12 @@ struct cgroup_subsys {
 	bool threaded:1;
 
 	/*
+	 * If %true, the subsystem can be enabled on root or pseudo root on
+	 * the default heirarchy.
+	 */
+	bool enabled_on_root:1;
+
+	/*
 	 * If %false, this subsystem is properly hierarchical -
 	 * configuration, resource accounting and restriction on a parent
 	 * cgroup cover those of its children.  If %true, hierarchy support
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 901314b..f0bea32 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -165,6 +165,9 @@ struct cgroup_subsys *cgroup_subsys[] = {
 /* some controllers can be threaded on the default hierarchy */
 static u16 cgrp_dfl_threaded_ss_mask;
 
+/* some controllers can be enabled on pseudo root */
+static u16 cgrp_dfl_enabled_on_root;
+
 /* The list of hierarchy roots */
 LIST_HEAD(cgroup_roots);
 static int cgroup_root_count;
@@ -340,6 +343,14 @@ static bool cgroup_is_thread_root(struct cgroup *cgrp)
 	return cgrp->proc_cgrp == cgrp;
 }
 
+/* is @cgrp a pseudo root (i.e. parent in subtree root mode)? */
+static bool cgroup_is_pseudo_root(struct cgroup *cgrp)
+{
+	struct cgroup *parent = cgroup_parent(cgrp);
+
+	return parent && test_bit(CGRP_SUBTREE_ROOT_MODE, &parent->flags);
+}
+
 /* if threaded, would @cgrp become root of a mixed threaded subtree? */
 static bool cgroup_is_mixable(struct cgroup *cgrp)
 {
@@ -347,8 +358,10 @@ static bool cgroup_is_mixable(struct cgroup *cgrp)
 	 * Root isn't under domain level resource control exempting it from
 	 * the no-internal-process constraint, so it can serve as a thread
 	 * root and a parent of resource domains at the same time.
+	 *
+	 * A pseudo root is also considered to be mixable.
 	 */
-	return !cgroup_parent(cgrp);
+	return !cgroup_parent(cgrp) || cgroup_is_pseudo_root(cgrp);
 }
 
 /* is @cgrp root of a mixed threaded subtree */
@@ -2964,6 +2977,8 @@ static ssize_t cgroup_subtree_control_write(struct kernfs_open_file *of,
 	struct cgroup *cgrp, *child;
 	struct cgroup_subsys *ss;
 	char *tok;
+	int subtree_root_mode = 0;
+	int nr_children = 0;
 	int ssid, ret;
 
 	/*
@@ -2974,6 +2989,14 @@ static ssize_t cgroup_subtree_control_write(struct kernfs_open_file *of,
 	while ((tok = strsep(&buf, " "))) {
 		if (tok[0] == '\0')
 			continue;
+
+		if (!strcmp(tok, "root")) {
+			subtree_root_mode = 1;
+			continue;
+		} else if (!strcmp(tok, "nonroot")) {
+			subtree_root_mode = -1;
+			continue;
+		}
 		do_each_subsys_mask(ss, ssid, ~cgrp_dfl_inhibit_ss_mask) {
 			if (!cgroup_ssid_enabled(ssid) ||
 			    strcmp(tok + 1, ss->name))
@@ -3015,6 +3038,7 @@ static ssize_t cgroup_subtree_control_write(struct kernfs_open_file *of,
 	cgroup_for_each_live_child(child, cgrp) {
 		child_enable |= child->subtree_control|child->subtree_bypass;
 		child_bypass |= child->bypass_ss_mask;
+		nr_children++;
 	}
 
 	/*
@@ -3025,16 +3049,33 @@ static ssize_t cgroup_subtree_control_write(struct kernfs_open_file *of,
 	disable &= (cgrp->subtree_control|cgrp->subtree_bypass);
 
 	/*
-	 * We cannot disable controllers or change the bypass state of
-	 * controllers that are enabled in a child cgroup.
+	 * We cannot enable or disable subtree root mode if it is root,
+	 * there is any child cgroups or when thread mode is on.
 	 */
-	if ((enable|bypass|disable) & child_enable) {
+	if (subtree_root_mode &&
+	   (!cgroup_parent(cgrp) || nr_children || cgroup_is_threaded(cgrp))) {
 		ret = -EBUSY;
 		goto out_unlock;
 	}
 
-	if (!(enable|bypass|disable)) {
-		ret = 0;
+	/*
+	 * We can't have any controllers enabled directly when in subtree
+	 * root mode except those with the enabled_on_root flag on.
+	 */
+	if ((test_bit(CGRP_SUBTREE_ROOT_MODE, &cgrp->flags) ||
+	    (subtree_root_mode > 0)) &&
+	   ((enable|cgrp->subtree_control) & ~cgrp_dfl_enabled_on_root
+					   & ~disable)) {
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+
+	/*
+	 * We cannot disable controllers or change the bypass state of
+	 * controllers that are enabled in a child cgroup.
+	 */
+	if ((enable|bypass|disable) & child_enable) {
+		ret = -EBUSY;
 		goto out_unlock;
 	}
 
@@ -3056,6 +3097,13 @@ static ssize_t cgroup_subtree_control_write(struct kernfs_open_file *of,
 		goto out_unlock;
 	}
 
+	if (!(enable|bypass|disable)) {
+		ret = 0;
+		if (subtree_root_mode)
+			goto set_root_mode;
+		goto out_unlock;
+	}
+
 	/* save and update control masks and prepare csses */
 	cgroup_save_control(cgrp);
 
@@ -3077,6 +3125,14 @@ static ssize_t cgroup_subtree_control_write(struct kernfs_open_file *of,
 
 	cgroup_finalize_control(cgrp, ret);
 
+	if (!ret && subtree_root_mode) {
+set_root_mode:
+		if (subtree_root_mode > 0)
+			set_bit(CGRP_SUBTREE_ROOT_MODE, &cgrp->flags);
+		else
+			clear_bit(CGRP_SUBTREE_ROOT_MODE, &cgrp->flags);
+	}
+
 	kernfs_activate(cgrp->kn);
 	ret = 0;
 out_unlock:
@@ -3190,6 +3246,12 @@ static ssize_t cgroup_controllers_write(struct kernfs_open_file *of,
 
 static int cgroup_vet_thread_mode_op(struct cgroup *cgrp, enum thread_mode_op op)
 {
+	/*
+	 * Thread mode and subtree root mode are mutually exclusive.
+	 */
+	if (test_bit(CGRP_SUBTREE_ROOT_MODE, &cgrp->flags))
+		return -EINVAL;
+
 	/* verify join conditions first and convert it to ENABLE */
 	if (op == THREAD_MODE_JOIN) {
 		/* can't join if it isn't there */
@@ -4773,6 +4835,15 @@ int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode)
 	if (!parent)
 		return -ENODEV;
 
+	/*
+	 * A cgroup in subtree root mode cannot have more than one child.
+	 */
+	if (test_bit(CGRP_SUBTREE_ROOT_MODE, &parent->flags) &&
+	   !list_empty(&parent->self.children)) {
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+
 	cgrp = cgroup_create(parent);
 	if (IS_ERR(cgrp)) {
 		ret = PTR_ERR(cgrp);
@@ -5173,6 +5244,9 @@ int __init cgroup_init(void)
 		if (ss->threaded)
 			cgrp_dfl_threaded_ss_mask |= 1 << ss->id;
 
+		if (ss->enabled_on_root)
+			cgrp_dfl_enabled_on_root |= 1 << ss->id;
+
 		if (ss->dfl_cftypes == ss->legacy_cftypes) {
 			WARN_ON(cgroup_add_cftypes(ss, ss->dfl_cftypes));
 		} else {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH-cgroup 5/6] cgroup: Skip dying css in cgroup_apply_control_{enable,disable}
  2017-06-14 15:05 [RFC PATCH-cgroup 0/6] cgroup: bypass and subtree root modes Waiman Long
                   ` (3 preceding siblings ...)
  2017-06-14 15:05 ` [RFC PATCH-cgroup 4/6] cgroup: Introduce subtree root mode Waiman Long
@ 2017-06-14 15:05 ` Waiman Long
  2017-06-21 21:42   ` Tejun Heo
  2017-06-14 15:05 ` [RFC PATCH-cgroup 6/6] cgroup: Make debug controller display bypass and subtree root modes info Waiman Long
  5 siblings, 1 reply; 20+ messages in thread
From: Waiman Long @ 2017-06-14 15:05 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, kernel-team, pjt, luto, efault, torvalds,
	Waiman Long

While constantly turning on and off controllers, it is possible to
trigger the dying CSS warnings in cgroup_apply_control_enable() and
cgroup_apply_control_disable(). The current code, however, proceeds
after the warning leading to other secondary warnings and maybe even
data corruption, like

  cgroup: cgroup_addrm_files: failed to add current, err=-17

To avoid the secondary errors, the dying CSS is now ignored or skipped
so as not to cause other problem.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cgroup.c | 20 +++++++++++++++-----
 1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index f0bea32..2a5bd49 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2846,12 +2846,24 @@ static int cgroup_apply_control_enable(struct cgroup *cgrp)
 		for_each_subsys(ss, ssid) {
 			struct cgroup_subsys_state *css = cgroup_css(dsct, ss);
 
-			WARN_ON_ONCE(css && percpu_ref_is_dying(&css->refcnt));
-
 			if (!(cgroup_ss_mask(dsct, false) & (1 << ss->id)) ||
 			    (dsct->bypass_ss_mask & (1 << ss->id)))
 				continue;
 
+			/*
+			 * If the css is dying, we will just skip it after
+			 * warning.
+			 */
+			if (css && (css->flags & CSS_DYING)) {
+				char name[NAME_MAX+1];
+
+				cgroup_name(cgrp, name, NAME_MAX);
+				pr_warn("%s: %s css of cgroup %s is dying!\n",
+					__func__, ss->name, name);
+				WARN_ON_ONCE(1);
+				continue;
+			}
+
 			if (!css) {
 				css = css_create(dsct, ss);
 				if (IS_ERR(css))
@@ -2893,9 +2905,7 @@ static void cgroup_apply_control_disable(struct cgroup *cgrp)
 		for_each_subsys(ss, ssid) {
 			struct cgroup_subsys_state *css = cgroup_css(dsct, ss);
 
-			WARN_ON_ONCE(css && percpu_ref_is_dying(&css->refcnt));
-
-			if (!css)
+			if (!css || (css->flags & CSS_DYING))
 				continue;
 
 			if (css->parent &&
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH-cgroup 6/6] cgroup: Make debug controller display bypass and subtree root modes info
  2017-06-14 15:05 [RFC PATCH-cgroup 0/6] cgroup: bypass and subtree root modes Waiman Long
                   ` (4 preceding siblings ...)
  2017-06-14 15:05 ` [RFC PATCH-cgroup 5/6] cgroup: Skip dying css in cgroup_apply_control_{enable,disable} Waiman Long
@ 2017-06-14 15:05 ` Waiman Long
  5 siblings, 0 replies; 20+ messages in thread
From: Waiman Long @ 2017-06-14 15:05 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, kernel-team, pjt, luto, efault, torvalds,
	Waiman Long

This patch enables the debug controller to display the new bypass
mode and subtree root mode information.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cgroup-internal.h | 12 ++++++++++++
 kernel/cgroup/cgroup.c          |  9 ---------
 kernel/cgroup/debug.c           | 27 ++++++++++++++++++++-------
 3 files changed, 32 insertions(+), 16 deletions(-)

diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index 0e81c61..6bb0a6a 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -151,6 +151,18 @@ static inline void get_css_set(struct css_set *cset)
 	refcount_inc(&cset->refcount);
 }
 
+/*
+ * Return the cgroup parent.
+ */
+static inline struct cgroup *cgroup_parent(struct cgroup *cgrp)
+{
+	struct cgroup_subsys_state *parent_css = cgrp->self.parent;
+
+	if (parent_css)
+		return container_of(parent_css, struct cgroup, self);
+	return NULL;
+}
+
 bool cgroup_ssid_enabled(int ssid);
 bool cgroup_on_dfl(const struct cgroup *cgrp);
 
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 2a5bd49..6ffe900 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -322,15 +322,6 @@ static void cgroup_idr_remove(struct idr *idr, int id)
 	spin_unlock_bh(&cgroup_idr_lock);
 }
 
-static struct cgroup *cgroup_parent(struct cgroup *cgrp)
-{
-	struct cgroup_subsys_state *parent_css = cgrp->self.parent;
-
-	if (parent_css)
-		return container_of(parent_css, struct cgroup, self);
-	return NULL;
-}
-
 /* is @cgrp threaded? regardless of mixed / root / member? */
 static bool cgroup_is_threaded(struct cgroup *cgrp)
 {
diff --git a/kernel/cgroup/debug.c b/kernel/cgroup/debug.c
index be901c0..561539d 100644
--- a/kernel/cgroup/debug.c
+++ b/kernel/cgroup/debug.c
@@ -198,6 +198,7 @@ static int cgroup_css_links_read(struct seq_file *seq, void *v)
 static int cgroup_subsys_states_read(struct seq_file *seq, void *v)
 {
 	struct cgroup *cgrp = seq_css(seq)->cgroup;
+	struct cgroup *parent = cgroup_parent(cgrp);
 	struct cgroup_subsys *ss;
 	struct cgroup_subsys_state *css;
 	char pbuf[16];
@@ -206,9 +207,16 @@ static int cgroup_subsys_states_read(struct seq_file *seq, void *v)
 	mutex_lock(&cgroup_mutex);
 	for_each_subsys(ss, i) {
 		css = rcu_dereference_check(cgrp->subsys[ss->id], true);
-		if (!css)
+		if (!css) {
+			u16 bypass = cgrp->bypass_ss_mask;
+
+			if (parent)
+				bypass |= parent->subtree_bypass;
+			if (bypass & (1 << ss->id))
+				seq_printf(seq, "%2d: %-4s\t- [Bypass]\n",
+					   ss->id, ss->name);
 			continue;
-
+		}
 		pbuf[0] = '\0';
 
 		/* Show the parent CSS if applicable*/
@@ -234,6 +242,8 @@ static int cgroup_masks_read(struct seq_file *seq, void *v)
 	} mask_list[] = {
 		{ &cgrp->subtree_control,  "subtree_control"  },
 		{ &cgrp->subtree_ss_mask,  "subtree_ss_mask"  },
+		{ &cgrp->subtree_bypass,   "subtree_bypass"   },
+		{ &cgrp->bypass_ss_mask,   "bypass_ss_mask"   },
 	};
 
 	mutex_lock(&cgroup_mutex);
@@ -255,6 +265,8 @@ static int cgroup_masks_read(struct seq_file *seq, void *v)
 	if (cgrp->proc_cgrp)
 		seq_printf(seq, "thread mode      : %s\n",
 			  (cgrp->proc_cgrp == cgrp) ? "root" : "threaded");
+	if (test_bit(CGRP_SUBTREE_ROOT_MODE, &cgrp->flags))
+		seq_puts(seq,   "subtree root mode: on\n");
 
 	mutex_unlock(&cgroup_mutex);
 	return 0;
@@ -314,11 +326,12 @@ static u64 releasable_read(struct cgroup_subsys_state *css, struct cftype *cft)
 };
 
 struct cgroup_subsys debug_cgrp_subsys = {
-	.css_alloc	= debug_css_alloc,
-	.css_free	= debug_css_free,
-	.legacy_cftypes	= debug_files,
-	.dfl_cftypes	= debug_files,
-	.threaded	= true,
+	.css_alloc	 = debug_css_alloc,
+	.css_free	 = debug_css_free,
+	.legacy_cftypes	 = debug_files,
+	.dfl_cftypes	 = debug_files,
+	.threaded	 = true,
+	.enabled_on_root = true,
 };
 
 /*
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH-cgroup 1/6] cgroup: Relax the no internal process constraint
  2017-06-14 15:05 ` [RFC PATCH-cgroup 1/6] cgroup: Relax the no internal process constraint Waiman Long
@ 2017-06-21 20:40   ` Tejun Heo
  2017-06-21 21:37     ` Waiman Long
  0 siblings, 1 reply; 20+ messages in thread
From: Tejun Heo @ 2017-06-21 20:40 UTC (permalink / raw)
  To: Waiman Long
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, kernel-team, pjt, luto, efault, torvalds

Hello, Waiman.

On Wed, Jun 14, 2017 at 11:05:32AM -0400, Waiman Long wrote:
>  2-4-3. No Internal Process Constraint
>  
> +When a non-root cgroup distributes resources to their children while
> +having processes of its own, its internal processes will then compete
> +against its children in term of resource allocation.  For some resource
> +types, that is not a problem and the controllers are able to handle
> +them correctly.  For others, the controllers may not be able to handle
> +internal process competition correctly.  This type of controllers are
> +called resource domain controllers in this document.
> +
> +Internal processes are not allowed on non-root cgroups which has
> +any one of those resource domain controllers enabled.  Currently all
> +controllers that are allowed in a threaded cgroup will be considered
> +as a non-resource domain controller and hence will not block internal
> +processes.  In other words, only cgroups which don't contain any

This isn't on this patch but I'm not sure this is a good way to define
resource domain controllers.  We probably should first define resource
domains and walk our way in to the accompanying restrictions and then
the distinction between the controller types.

...
> +Note that the restriction doesn't get in the way if there is no resource
> +domain controller enabled in the cgroup's "cgroup.subtree_control".
> +This is important as otherwise it wouldn't be possible to create
> +children of a populated cgroup.  To control resource distribution
> +of a cgroup, the cgroup must create children and transfer all
> +its processes to the children before enabling controllers in its
> +"cgroup.subtree_control" file.

What happens when we add domain handling to CPU so that it is both a
domain and resource controller?  Even if that somehow can be resolved,
wouldn't that come with a rather surprising userland behavior changes?
Also, I'm not sure what we're achieving by doing this.  It doesn't
really relax the restriction.  It just turns it off implicitly when
certain conditions are met, which doesn't really allow any real
capabilities and at least to me the behaviors feel more subtle and
complicated than before.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH-cgroup 2/6] cgroup: Enable bypass mode in cgroup v2
  2017-06-14 15:05 ` [RFC PATCH-cgroup 2/6] cgroup: Enable bypass mode in cgroup v2 Waiman Long
@ 2017-06-21 21:17   ` Tejun Heo
  2017-06-22 20:07     ` Waiman Long
  0 siblings, 1 reply; 20+ messages in thread
From: Tejun Heo @ 2017-06-21 21:17 UTC (permalink / raw)
  To: Waiman Long
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, kernel-team, pjt, luto, efault, torvalds

Hello, Waiman.

Let's first talk about and make sense of high level semantics.

On Wed, Jun 14, 2017 at 11:05:33AM -0400, Waiman Long wrote:
> +In the example below, '+' corresponds to an enabled controller and
> +corresponds to a bypassed controller.
> +
> +   +   #   #   #   +
> +   A - B - C - D - E
> +         \ F
> +	   +
> +In this case, the effective hiearchy is:
> +
> +	A|B|C|D - E
> +	        \ F

I think that this definitely has potential.  While different
controllers may see differently abbreviated versions of the tree, they
can still be mapped to the same hierarchy and we can implement
cross-controller operations in a meaningful way, I think; however, it
does make some things really weird.

In the above example, how would A's resources be distributed.  Let's
say the resource knob in question is memory.high.  Because from memory
controller's point of view A|B|C|D are all bunched up and have E and F
as children, memory.high resource knobs on E and F would control how
A's memory gets distributed, right?

So, once a parent skips a controller with #, you can only determine
how its resources are actually distributed by scanning the entire
subtree to determine the span of '#' on the controller and any sort of
delegation - whether implicit or explicit - wouldn't be possible in
the middle, right?

Can you please think of / explain how this would work with delegation?
Making things clear with delegation is really helpful because it can
serve as the canary for the usual hierarchical operations.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH-cgroup 1/6] cgroup: Relax the no internal process constraint
  2017-06-21 20:40   ` Tejun Heo
@ 2017-06-21 21:37     ` Waiman Long
  2017-06-21 21:39       ` Tejun Heo
  0 siblings, 1 reply; 20+ messages in thread
From: Waiman Long @ 2017-06-21 21:37 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, kernel-team, pjt, luto, efault, torvalds

On 06/21/2017 04:40 PM, Tejun Heo wrote:
> Hello, Waiman.
>
> On Wed, Jun 14, 2017 at 11:05:32AM -0400, Waiman Long wrote:
>>  2-4-3. No Internal Process Constraint
>>  
>> +When a non-root cgroup distributes resources to their children while
>> +having processes of its own, its internal processes will then compete
>> +against its children in term of resource allocation.  For some resource
>> +types, that is not a problem and the controllers are able to handle
>> +them correctly.  For others, the controllers may not be able to handle
>> +internal process competition correctly.  This type of controllers are
>> +called resource domain controllers in this document.
>> +
>> +Internal processes are not allowed on non-root cgroups which has
>> +any one of those resource domain controllers enabled.  Currently all
>> +controllers that are allowed in a threaded cgroup will be considered
>> +as a non-resource domain controller and hence will not block internal
>> +processes.  In other words, only cgroups which don't contain any
> This isn't on this patch but I'm not sure this is a good way to define
> resource domain controllers.  We probably should first define resource
> domains and walk our way in to the accompanying restrictions and then
> the distinction between the controller types.

This patch essentially allows internal process when non of the resource
domain controllers are enabled in the sense that they are activated by
the subtree_control file of the parent. I do agree that we should
probably define resource domain as a separate topic in this document. It
is true that a resource domain controller is not technically equivalent
to the !threaded controller, but this is the best that we have unless we
introduce another controller attribute for resource domain. I have tried
to do that in some of my past patches. I think PeterZ had suggested too.

The main purpose to not make thread mode special for internal process
handling. Instead, internal processes are allowed because no resource
domain controllers are enabled whether or not thread mode is enabled or
not. I think that is more consistent than treating thread mode as special.

> ...
>> +Note that the restriction doesn't get in the way if there is no resource
>> +domain controller enabled in the cgroup's "cgroup.subtree_control".
>> +This is important as otherwise it wouldn't be possible to create
>> +children of a populated cgroup.  To control resource distribution
>> +of a cgroup, the cgroup must create children and transfer all
>> +its processes to the children before enabling controllers in its
>> +"cgroup.subtree_control" file.
> What happens when we add domain handling to CPU so that it is both a
> domain and resource controller?  Even if that somehow can be resolved,
> wouldn't that come with a rather surprising userland behavior changes?
> Also, I'm not sure what we're achieving by doing this.  It doesn't
> really relax the restriction.  It just turns it off implicitly when
> certain conditions are met, which doesn't really allow any real
> capabilities and at least to me the behaviors feel more subtle and
> complicated than before.
>
> Thanks.

I think CPU isn't a good example for that.

Another alternative is to treat no internal process as a controller
attribute. Then we don't need to worry about this intricate question and
let the  controllers decide if they will allow internal processes.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH-cgroup 4/6] cgroup: Introduce subtree root mode
  2017-06-14 15:05 ` [RFC PATCH-cgroup 4/6] cgroup: Introduce subtree root mode Waiman Long
@ 2017-06-21 21:38   ` Tejun Heo
  2017-06-22 20:27     ` Waiman Long
  0 siblings, 1 reply; 20+ messages in thread
From: Tejun Heo @ 2017-06-21 21:38 UTC (permalink / raw)
  To: Waiman Long
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, kernel-team, pjt, luto, efault, torvalds

Hello, Waiman.

On Wed, Jun 14, 2017 at 11:05:35AM -0400, Waiman Long wrote:
> Subtree root mode is a new cgroup mode which applies the following
> restrictions when turned on:
> 
>  1) Controllers are only allowed to be passed to the children in
>     bypass mode except those with the "enable_on_root" flag on.
>  2) Only 1 child cgroup is allowed.
> 
> That lone child can be used as the pseudo root of a container cgroup
> hierarchy.  All the resources, if controlled, are in the parent
> cgroup. There will be no control knobs in the child. That makes it
> look and feel like a root.

I'm not sure not having the control files in the child makes that much
difference.

> That pseudo root is also considered to be mixable and so can become
> root of a mixed threaded subtree. The no internal process constraint
> also does not apply.

Heh, you can't just declare a non-root cgroup to be a mixed root but
if you're special casing this and making the kernel play a masquerade
with special node, you can make cgroup provide a mixed root while
hosting the internal processes in a dedicated leaf cgroup which isn't
visible to the nested root, right?

It's all a game of masquerading tho and doesn't actually enable
anything which isn't possible now.  This would definitely be useful
for testing.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH-cgroup 1/6] cgroup: Relax the no internal process constraint
  2017-06-21 21:37     ` Waiman Long
@ 2017-06-21 21:39       ` Tejun Heo
  2017-06-21 21:50         ` Waiman Long
  0 siblings, 1 reply; 20+ messages in thread
From: Tejun Heo @ 2017-06-21 21:39 UTC (permalink / raw)
  To: Waiman Long
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, kernel-team, pjt, luto, efault, torvalds

Hello,

On Wed, Jun 21, 2017 at 05:37:00PM -0400, Waiman Long wrote:
> > What happens when we add domain handling to CPU so that it is both a
> > domain and resource controller?  Even if that somehow can be resolved,
> > wouldn't that come with a rather surprising userland behavior changes?
> > Also, I'm not sure what we're achieving by doing this.  It doesn't
> > really relax the restriction.  It just turns it off implicitly when
> > certain conditions are met, which doesn't really allow any real
> > capabilities and at least to me the behaviors feel more subtle and
> > complicated than before.
> 
> I think CPU isn't a good example for that.

Can you please elaborate?

> Another alternative is to treat no internal process as a controller
> attribute. Then we don't need to worry about this intricate question and
> let the  controllers decide if they will allow internal processes.

Isn't that what "threaded" is?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH-cgroup 5/6] cgroup: Skip dying css in cgroup_apply_control_{enable,disable}
  2017-06-14 15:05 ` [RFC PATCH-cgroup 5/6] cgroup: Skip dying css in cgroup_apply_control_{enable,disable} Waiman Long
@ 2017-06-21 21:42   ` Tejun Heo
  2017-06-21 22:01     ` Waiman Long
  0 siblings, 1 reply; 20+ messages in thread
From: Tejun Heo @ 2017-06-21 21:42 UTC (permalink / raw)
  To: Waiman Long
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, kernel-team, pjt, luto, efault, torvalds

Hello,

On Wed, Jun 14, 2017 at 11:05:36AM -0400, Waiman Long wrote:
> While constantly turning on and off controllers, it is possible to
> trigger the dying CSS warnings in cgroup_apply_control_enable() and
> cgroup_apply_control_disable(). The current code, however, proceeds
> after the warning leading to other secondary warnings and maybe even
> data corruption, like
> 
>   cgroup: cgroup_addrm_files: failed to add current, err=-17
> 
> To avoid the secondary errors, the dying CSS is now ignored or skipped
> so as not to cause other problem.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  kernel/cgroup/cgroup.c | 20 +++++++++++++++-----
>  1 file changed, 15 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index f0bea32..2a5bd49 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -2846,12 +2846,24 @@ static int cgroup_apply_control_enable(struct cgroup *cgrp)
>  		for_each_subsys(ss, ssid) {
>  			struct cgroup_subsys_state *css = cgroup_css(dsct, ss);
>  
> -			WARN_ON_ONCE(css && percpu_ref_is_dying(&css->refcnt));
> -
>  			if (!(cgroup_ss_mask(dsct, false) & (1 << ss->id)) ||
>  			    (dsct->bypass_ss_mask & (1 << ss->id)))
>  				continue;
>  
> +			/*
> +			 * If the css is dying, we will just skip it after
> +			 * warning.
> +			 */
> +			if (css && (css->flags & CSS_DYING)) {
> +				char name[NAME_MAX+1];
> +
> +				cgroup_name(cgrp, name, NAME_MAX);
> +				pr_warn("%s: %s css of cgroup %s is dying!\n",
> +					__func__, ss->name, name);
> +				WARN_ON_ONCE(1);
> +				continue;
> +			}

Can you trigger this without your patches because this triggering
means that the code screwed up before it reached this point.  We
should be fixing that bug rather than masking it up here.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH-cgroup 1/6] cgroup: Relax the no internal process constraint
  2017-06-21 21:39       ` Tejun Heo
@ 2017-06-21 21:50         ` Waiman Long
  2017-06-21 22:02           ` Tejun Heo
  0 siblings, 1 reply; 20+ messages in thread
From: Waiman Long @ 2017-06-21 21:50 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, kernel-team, pjt, luto, efault, torvalds

On 06/21/2017 05:39 PM, Tejun Heo wrote:
> Hello,
>
> On Wed, Jun 21, 2017 at 05:37:00PM -0400, Waiman Long wrote:
>>> What happens when we add domain handling to CPU so that it is both a
>>> domain and resource controller?  Even if that somehow can be resolved,
>>> wouldn't that come with a rather surprising userland behavior changes?
>>> Also, I'm not sure what we're achieving by doing this.  It doesn't
>>> really relax the restriction.  It just turns it off implicitly when
>>> certain conditions are met, which doesn't really allow any real
>>> capabilities and at least to me the behaviors feel more subtle and
>>> complicated than before.
>> I think CPU isn't a good example for that.
> Can you please elaborate?

CPU is probably the most prominent controller where deep hierarchy has a
performance cost. So I can't envision that it will forbid internal
process competition.
 
>> Another alternative is to treat no internal process as a controller
>> attribute. Then we don't need to worry about this intricate question and
>> let the  controllers decide if they will allow internal processes.
> Isn't that what "threaded" is?
>

That is exactly what this patch intends to do. However, you raised
concern that threaded may not be equivalent to the need of allowing
internal process. That is why I propose that. If your concern is only
about the documentation change, we can certainly fix that.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH-cgroup 5/6] cgroup: Skip dying css in cgroup_apply_control_{enable,disable}
  2017-06-21 21:42   ` Tejun Heo
@ 2017-06-21 22:01     ` Waiman Long
  2017-06-21 22:04       ` Tejun Heo
  0 siblings, 1 reply; 20+ messages in thread
From: Waiman Long @ 2017-06-21 22:01 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, kernel-team, pjt, luto, efault, torvalds

On 06/21/2017 05:42 PM, Tejun Heo wrote:
> Hello,
>
> On Wed, Jun 14, 2017 at 11:05:36AM -0400, Waiman Long wrote:
>> While constantly turning on and off controllers, it is possible to
>> trigger the dying CSS warnings in cgroup_apply_control_enable() and
>> cgroup_apply_control_disable(). The current code, however, proceeds
>> after the warning leading to other secondary warnings and maybe even
>> data corruption, like
>>
>>   cgroup: cgroup_addrm_files: failed to add current, err=-17
>>
>> To avoid the secondary errors, the dying CSS is now ignored or skipped
>> so as not to cause other problem.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>  kernel/cgroup/cgroup.c | 20 +++++++++++++++-----
>>  1 file changed, 15 insertions(+), 5 deletions(-)
>>
>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
>> index f0bea32..2a5bd49 100644
>> --- a/kernel/cgroup/cgroup.c
>> +++ b/kernel/cgroup/cgroup.c
>> @@ -2846,12 +2846,24 @@ static int cgroup_apply_control_enable(struct cgroup *cgrp)
>>  		for_each_subsys(ss, ssid) {
>>  			struct cgroup_subsys_state *css = cgroup_css(dsct, ss);
>>  
>> -			WARN_ON_ONCE(css && percpu_ref_is_dying(&css->refcnt));
>> -
>>  			if (!(cgroup_ss_mask(dsct, false) & (1 << ss->id)) ||
>>  			    (dsct->bypass_ss_mask & (1 << ss->id)))
>>  				continue;
>>  
>> +			/*
>> +			 * If the css is dying, we will just skip it after
>> +			 * warning.
>> +			 */
>> +			if (css && (css->flags & CSS_DYING)) {
>> +				char name[NAME_MAX+1];
>> +
>> +				cgroup_name(cgrp, name, NAME_MAX);
>> +				pr_warn("%s: %s css of cgroup %s is dying!\n",
>> +					__func__, ss->name, name);
>> +				WARN_ON_ONCE(1);
>> +				continue;
>> +			}
> Can you trigger this without your patches because this triggering
> means that the code screwed up before it reached this point.  We
> should be fixing that bug rather than masking it up here.
>
> Thanks.
>
I will try to reproduce it without my patch.

I do think that it can happen with existing code because CSS killing is
asynchronous, I think. So the command can complete before the CSS is
actually gone. If the next command to reactivate it happens fast enough,
we can trigger that. When I added more checking to my test script
essentially increasing the latency between successive tests, I couldn't
trigger it anymore.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH-cgroup 1/6] cgroup: Relax the no internal process constraint
  2017-06-21 21:50         ` Waiman Long
@ 2017-06-21 22:02           ` Tejun Heo
  0 siblings, 0 replies; 20+ messages in thread
From: Tejun Heo @ 2017-06-21 22:02 UTC (permalink / raw)
  To: Waiman Long
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, kernel-team, pjt, luto, efault, torvalds

Hey,

On Wed, Jun 21, 2017 at 05:50:09PM -0400, Waiman Long wrote:
> >> I think CPU isn't a good example for that.
> > Can you please elaborate?
> 
> CPU is probably the most prominent controller where deep hierarchy has a
> performance cost. So I can't envision that it will forbid internal
> process competition.

I think there's a fundamental misunderstanding here.  The internal
competion thing is about how to account for resource consumptions
which aren't tied to specific processes or tasks.  Thread mode allows
building sub-hierarchy beyond the domain point while still keeping the
domain at the root of the thread subtree.  It is true that as
currently implemented, CPU controller has performance issues for some
workloads even with a moderate level of nesting (and quite a bit of
other artifacts from nesting too); however, supporting control of
anonymous resources or not is an orthogonal issue.  People can enable
thread mode at the root if that's applicable to the workload at hand
but you can't change what the basic topology means because a
controller has performance overhead.

> >> Another alternative is to treat no internal process as a controller
> >> attribute. Then we don't need to worry about this intricate question and
> >> let the  controllers decide if they will allow internal processes.
> > Isn't that what "threaded" is?
> 
> That is exactly what this patch intends to do. However, you raised
> concern that threaded may not be equivalent to the need of allowing
> internal process. That is why I propose that. If your concern is only
> about the documentation change, we can certainly fix that.

I'm really lost on what this actually achieves.  Can we please first
talk about what you're trying to enable?  Let's talk about features
and capabilities first because it feels like most of the changes in
this patchset lack them and we seem to be talking past each other.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH-cgroup 5/6] cgroup: Skip dying css in cgroup_apply_control_{enable,disable}
  2017-06-21 22:01     ` Waiman Long
@ 2017-06-21 22:04       ` Tejun Heo
  2017-06-21 22:19         ` Waiman Long
  0 siblings, 1 reply; 20+ messages in thread
From: Tejun Heo @ 2017-06-21 22:04 UTC (permalink / raw)
  To: Waiman Long
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, kernel-team, pjt, luto, efault, torvalds

Hello,

On Wed, Jun 21, 2017 at 06:01:56PM -0400, Waiman Long wrote:
> I do think that it can happen with existing code because CSS killing is
> asynchronous, I think. So the command can complete before the CSS is
> actually gone. If the next command to reactivate it happens fast enough,
> we can trigger that. When I added more checking to my test script
> essentially increasing the latency between successive tests, I couldn't
> trigger it anymore.

While disabling is asynchronous, there's a flushing logic before
starting reenabling things, so the existing code shouldn't trigger
that condition.  But then there's should and the reality. :)

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH-cgroup 5/6] cgroup: Skip dying css in cgroup_apply_control_{enable,disable}
  2017-06-21 22:04       ` Tejun Heo
@ 2017-06-21 22:19         ` Waiman Long
  0 siblings, 0 replies; 20+ messages in thread
From: Waiman Long @ 2017-06-21 22:19 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, kernel-team, pjt, luto, efault, torvalds

On 06/21/2017 06:04 PM, Tejun Heo wrote:
> Hello,
>
> On Wed, Jun 21, 2017 at 06:01:56PM -0400, Waiman Long wrote:
>> I do think that it can happen with existing code because CSS killing is
>> asynchronous, I think. So the command can complete before the CSS is
>> actually gone. If the next command to reactivate it happens fast enough,
>> we can trigger that. When I added more checking to my test script
>> essentially increasing the latency between successive tests, I couldn't
>> trigger it anymore.
> While disabling is asynchronous, there's a flushing logic before
> starting reenabling things, so the existing code shouldn't trigger
> that condition.  But then there's should and the reality. :)
>
> Thanks.
>
I actually got errors from cgroup_addrm_files() complaining about
creating existing sysfs files because of the css_populate_dir() call
when the css was dying but not gone yet. I added code in the patch just
to sidestep that. Maybe I should just force the call to css_create()
when the old one is dying to see if that will work out.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH-cgroup 2/6] cgroup: Enable bypass mode in cgroup v2
  2017-06-21 21:17   ` Tejun Heo
@ 2017-06-22 20:07     ` Waiman Long
  0 siblings, 0 replies; 20+ messages in thread
From: Waiman Long @ 2017-06-22 20:07 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, kernel-team, pjt, luto, efault, torvalds

On 06/21/2017 05:17 PM, Tejun Heo wrote:
> Hello, Waiman.
>
> Let's first talk about and make sense of high level semantics.
>
> On Wed, Jun 14, 2017 at 11:05:33AM -0400, Waiman Long wrote:
>> +In the example below, '+' corresponds to an enabled controller and
>> +corresponds to a bypassed controller.
>> +
>> +   +   #   #   #   +
>> +   A - B - C - D - E
>> +         \ F
>> +	   +
>> +In this case, the effective hiearchy is:
>> +
>> +	A|B|C|D - E
>> +	        \ F
> I think that this definitely has potential.  While different
> controllers may see differently abbreviated versions of the tree, they
> can still be mapped to the same hierarchy and we can implement
> cross-controller operations in a meaningful way, I think; however, it
> does make some things really weird.
>
> In the above example, how would A's resources be distributed.  Let's
> say the resource knob in question is memory.high.  Because from memory
> controller's point of view A|B|C|D are all bunched up and have E and F
> as children, memory.high resource knobs on E and F would control how
> A's memory gets distributed, right?

That is right.

> So, once a parent skips a controller with #, you can only determine
> how its resources are actually distributed by scanning the entire
> subtree to determine the span of '#' on the controller and any sort of
> delegation - whether implicit or explicit - wouldn't be possible in
> the middle, right?

That is right, too.

> Can you please think of / explain how this would work with delegation?
> Making things clear with delegation is really helpful because it can
> serve as the canary for the usual hierarchical operations.

+   #   +
A - B - C
        \ D +

In term of delegation story, I would say that for the above
configuration, the parent A can delegate 5 units of resources to B. B,
upon finding out it has 5 units of resources, may decide to take itself
out of the picture (bypass itself) and delegate, say, 2 units to C and 3
units to D assuming that B has no internal process.

Of course, B can decide to ignore the rules, bypass itself and add
internal process to compete with other children of A. So we can make
"cgroup.controllers" non-writable to B if we don't want any controllers
to be bypassed.

As a side note, I have another delegation story for enabling bypass mode
in subtree_control. A parent can activate a controller in bypass mode to
signal that it has delegated the authority to enable a controller to its
children. A child can then activate a controller by writing '+' to its
cgroup.controllers (not implemented yet). In this scenario, the child
own the control knobs, not the parent. That can be useful for
controllers that deal with ID or membership like devices, freezer,
perf_event or even cpusets. You may not want to have separate IDs for
all the nodes in the hierarchy, but for those who need a different ID,
they can choose to do that. In fact, I am thinking if it may be useful
to define a bypass_on_dfl attribute that work like implicit_on_dfl so
that we don't need to explicitly set those controllers in bypass mode.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH-cgroup 4/6] cgroup: Introduce subtree root mode
  2017-06-21 21:38   ` Tejun Heo
@ 2017-06-22 20:27     ` Waiman Long
  0 siblings, 0 replies; 20+ messages in thread
From: Waiman Long @ 2017-06-22 20:27 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, kernel-team, pjt, luto, efault, torvalds

On 06/21/2017 05:38 PM, Tejun Heo wrote:
> Hello, Waiman.
>
> On Wed, Jun 14, 2017 at 11:05:35AM -0400, Waiman Long wrote:
>> Subtree root mode is a new cgroup mode which applies the following
>> restrictions when turned on:
>>
>>  1) Controllers are only allowed to be passed to the children in
>>     bypass mode except those with the "enable_on_root" flag on.
>>  2) Only 1 child cgroup is allowed.
>>
>> That lone child can be used as the pseudo root of a container cgroup
>> hierarchy.  All the resources, if controlled, are in the parent
>> cgroup. There will be no control knobs in the child. That makes it
>> look and feel like a root.
> I'm not sure not having the control files in the child makes that much
> difference.

It is more a look and feel thing than being useful. The sole purpose is
to make the container root looks like a real root as much as possible.

>> That pseudo root is also considered to be mixable and so can become
>> root of a mixed threaded subtree. The no internal process constraint
>> also does not apply.
> Heh, you can't just declare a non-root cgroup to be a mixed root but
> if you're special casing this and making the kernel play a masquerade
> with special node, you can make cgroup provide a mixed root while
> hosting the internal processes in a dedicated leaf cgroup which isn't
> visible to the nested root, right?

That is true.

> It's all a game of masquerading tho and doesn't actually enable
> anything which isn't possible now.  This would definitely be useful
> for testing.
You are right about that. As mentioned in another email, having bypass
mode in subtree_control is useful, I think. The concept of subtree root
mode, however, is more toward satisfying PeterZ's idea of container
invariant. I won't mind leaving it out if others have no objection to that.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2017-06-22 20:27 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-14 15:05 [RFC PATCH-cgroup 0/6] cgroup: bypass and subtree root modes Waiman Long
2017-06-14 15:05 ` [RFC PATCH-cgroup 1/6] cgroup: Relax the no internal process constraint Waiman Long
2017-06-21 20:40   ` Tejun Heo
2017-06-21 21:37     ` Waiman Long
2017-06-21 21:39       ` Tejun Heo
2017-06-21 21:50         ` Waiman Long
2017-06-21 22:02           ` Tejun Heo
2017-06-14 15:05 ` [RFC PATCH-cgroup 2/6] cgroup: Enable bypass mode in cgroup v2 Waiman Long
2017-06-21 21:17   ` Tejun Heo
2017-06-22 20:07     ` Waiman Long
2017-06-14 15:05 ` [RFC PATCH-cgroup 3/6] cgroup: Allow bypss mode in subtree_control Waiman Long
2017-06-14 15:05 ` [RFC PATCH-cgroup 4/6] cgroup: Introduce subtree root mode Waiman Long
2017-06-21 21:38   ` Tejun Heo
2017-06-22 20:27     ` Waiman Long
2017-06-14 15:05 ` [RFC PATCH-cgroup 5/6] cgroup: Skip dying css in cgroup_apply_control_{enable,disable} Waiman Long
2017-06-21 21:42   ` Tejun Heo
2017-06-21 22:01     ` Waiman Long
2017-06-21 22:04       ` Tejun Heo
2017-06-21 22:19         ` Waiman Long
2017-06-14 15:05 ` [RFC PATCH-cgroup 6/6] cgroup: Make debug controller display bypass and subtree root modes info Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).