From mboxrd@z Thu Jan  1 00:00:00 1970
From: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Subject: [PATCH cgroup/for-3.16] cgroup: add documentation about unified
	hierarchy
Date: Mon, 14 Apr 2014 18:09:17 -0400
Message-ID: <20140414220917.GD1863@htj.dyndns.org>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Return-path: <containers-bounces-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org>
Content-Disposition: inline
List-Unsubscribe: <https://lists.linuxfoundation.org/mailman/options/containers>,
	<mailto:containers-request-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org?subject=unsubscribe>
List-Archive: <http://lists.linuxfoundation.org/pipermail/containers/>
List-Post: <mailto:containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org>
List-Help: <mailto:containers-request-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org?subject=help>
List-Subscribe: <https://lists.linuxfoundation.org/mailman/listinfo/containers>,
	<mailto:containers-request-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org?subject=subscribe>
Sender: containers-bounces-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
Errors-To: containers-bounces-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
To: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
Cc: Brandon Philips <brandon.philips-JW9irJGTvgXQT0dZR+AlfA@public.gmane.org>, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org, Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>, Kay Sievers <kay-tD+1rO4QERM@public.gmane.org>, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>, Daniel Borkmann <dborkman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>, Lennart Poettering <lennart-mdGvqq1h2p+GdvJs77BJ7Q@public.gmane.org>, Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>, Thomas Graf <tgraf-G/eBtMaohhA@public.gmane.org>, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
List-Id: containers.vger.kernel.org

Hello,

Unified hierarchy is finally out for review [1][2].  This patch adds
the documentation which describes the design and rationales.  If you
can think of more people to cc, please go ahead.

If you have any comments and/or questions, please don't hesitate.

Thanks.

[1] http://lkml.kernel.org/g/1397511430-2673-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org
[2] http://lkml.kernel.org/g/1397511846-2904-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org

------ 8< ------
>From 68eb841c53bb26a7b49f8f244ebd68f2530d8d0b Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Date: Mon, 14 Apr 2014 17:29:39 -0400

Unified hierarchy will be the new version of cgroup interface.  This
patch adds Documentation/cgroups/unified-hierarchy.txt which describes
the design and rationales of unified hierarchy.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 Documentation/cgroups/unified-hierarchy.txt | 359 ++++++++++++++++++++++++++++
 1 file changed, 359 insertions(+)
 create mode 100644 Documentation/cgroups/unified-hierarchy.txt

diff --git a/Documentation/cgroups/unified-hierarchy.txt b/Documentation/cgroups/unified-hierarchy.txt
new file mode 100644
index 0000000..41386c3
--- /dev/null
+++ b/Documentation/cgroups/unified-hierarchy.txt
@@ -0,0 +1,359 @@
+
+Cgroup unified hierarchy
+
+April, 2014		Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
+
+This document describes the changes made by unified hierarchy and
+their rationales.  It will eventually be merged into the main cgroup
+documentation.
+
+CONTENTS
+
+1. Background
+2. Basic Operation
+  2-1. Mounting
+  2-2. cgroup.subtree_control
+  2-3. cgroup.controllers
+3. Structural Constraints
+  3-1. Top-down
+  3-2. No internal tasks
+4. Other Changes
+  4-1. [Un]populated Notification
+  4-2. Other Core Changes
+  4-3. Per-Controller Changes
+    4-3-1. blkio
+    4-3-2. cpuset
+    4-3-3. memory
+5. Planned Changes
+  5-1. CAP for resource control
+
+
+1. Background
+
+cgroup allows arbitrary number of hierarchies and each hierarchy can
+host any number of controllers.  While this seems to provide high
+level of flexibility, it isn't quite useful in practice.
+
+For example, as there is only one instance of each controller, utility
+type controllers such as freezer which can be useful in all
+hierarchies can only be used in one.  The issue is exacerbated by the
+fact that controllers can't be moved around once hierarchies are
+populated.  Another issue is that all controllers bound to a hierarchy
+are forced to have exactly the same view of the hierarchy.  It isn't
+possible to vary the granularity depending on the specific controller.
+
+In practice, these issues heavily limit which controllers can be put
+on the same hierarchy and most configurations resort to putting each
+controller on its own hierarchy.  Only closely related ones, such as
+cpu and cpuacct, make sense to put on the same hierarchy.  This often
+means that userland ends up managing multiple similar hierarchies
+repeating the same steps on each hierarchy whenever a hierarchy
+management operation is necessary.
+
+Unfortunately, support for multiple hierarchies comes at a steep cost.
+Internal implementation in cgroup core proper is dazzlingly
+complicated but more importantly the support for multiple hierarchies
+restricts how cgroup is used in general and what controllers can do.
+
+There's no limit on how many hierarchies there may be, which means
+that a task's cgroup membership can't be described in finite length.
+The key may contain any varying number of entries and is unlimited in
+length, which makes it highly awkward to handle and leads to addition
+of controllers which exist only to identify membership, which in turn
+exacerbates the original problem.
+
+Also, as a controller can't have any expectation regarding what shape
+of hierarchies other controllers would be on, each controller has to
+assume that all other controllers are operating on completely
+orthogonal hierarchies.  This makes it impossible, or at least very
+cumbersome, for controllers to cooperate with each other.
+
+In most use cases, putting controllers on hierarchies which are
+completely orthogonal to each other isn't necessary.  What usually is
+called for is the ability to have differing levels of granularity
+depending on the specific controller.  IOW, hierarchy may be collapsed
+from leaf towards root when viewed from specific controllers.  For
+example, a given configuration might not care about how memory is
+distributed beyond certain level while still want to control how cpu
+cycles are distributed.
+
+Unified hierarchy is the next version of cgroup interface.  It aims to
+address the aforementioned issues by having more structure while
+retaining enough flexibility for most use cases.  Various other
+general and controller-specific interface issues are also addressed in
+the process.
+
+
+2. Basic Operation
+
+2-1. Mounting
+
+Currently, unified hierarchy can be mounted with the following mount
+command.  Note that this is still under development and scheduled to
+change soon.
+
+ mount -t cgroup -o __DEVEL__sane_behavior cgroup $MOUNT_POINT
+
+All controllers which are not bound to other hierarchies are
+automatically bound to unified hierarchy and show up at the root of
+it.  Controllers which are enabled only in the root of unified
+hierarchy can be bound to other hierarchies at any time.  This allows
+mixing unified hierarchy with the traditional multiple hierarchies in
+fully backward compatible way.
+
+
+2-2. cgroup.subtree_control
+
+All cgroups on unified hierarchy have "cgroup.subtree_control" which
+governs which controllers are enabled on the children of the cgroup.
+Let's assume a hierarchy like the following.
+
+  root - A - B - C
+               \ D
+
+root's "cgroup.subtree_control" determines which controllers are
+enabled on A.  A's on B.  B's on C and D.  This coincides with the
+fact that controllers on the immediate sub-level are used to
+distribute the resources of the parent.  In fact, it's natural to
+assume that resource control knobs of a child belong to its parent.
+Enabling a controller in "cgroup.subtree_control" declares that
+distribution of the respective resources of the cgroup will be
+controlled.  Note that this means that controller enable states are
+shared among siblings.
+
+When read, the file contains space-separated list of currently enabled
+controllers.  A write to the file should contain spaced-separated list
+of controllers with '+' or '-' prefixed (without the quotes).
+Controllers prefixed with '+' are enabled and '-' disabled.  If a
+controller is listed multiple times, the last entry wins.  The
+specific operations are executed atomically - either all succeed or
+fail.
+
+
+2-3. cgroup.controllers
+
+Read-only "cgroup.controllers" contains space-separated list of
+controllers which can be enabled in the cgroup's
+"cgroup.subtree_control".
+
+In the root cgroup, this lists controllers which are not bound to
+other hierarchies and the content changes as controllers are bound to
+and unbound from other hierarchies.
+
+In non-root cgroups, the content of this file equals that of the
+parent's "cgroup.subtree_control" as only controllers enabled from the
+parent can be used in its children.
+
+
+3. Structural Constraints
+
+3-1. Top-down
+
+As it doesn't make sense to nest control of an uncontrolled resource,
+all non-root "cgroup.subtree_control" can only contain controllers
+which are enabled in the parent's "cgroup.subtree_control".  A
+controller can be enabled only if the parent has the controller
+enabled and a controller can't be disabled if one or more children
+have it enabled.
+
+
+3-2. No internal tasks
+
+One long-standing issue that cgroup faces is the competition between
+tasks belonging to the parent cgroup and its children cgroups.  This
+is inherently nasty as two different types of entities compete and
+there is no agreed-upon obvious way to handle it.  Different
+controllers are doing different things.
+
+cpu considers tasks and cgroups as equivalents and maps nice level to
+cgroup weights.  This works for some cases but falls flat when
+children should be allocated specific ratios of cpu cycles and the
+number of internal tasks fluctuates - the ratios constantly change as
+the number of competing entities fluctuates.  There also are other
+issues.  The mapping from nice level to weight isn't obvious or
+universal, and there are various other knobs which simply aren't
+available for tasks.
+
+blkio implicitly creates a hidden leaf node for each cgroup to host
+the tasks.  The hidden leaf has its own copies of all the knobs with
+"leaf_" prefixed.  While this allows equivalent control over internal
+tasks, it's with serious drawbacks.  It always adds an extra layer of
+nesting which may not be necessary, makes the interface messy and
+significantly complicates the implementation.
+
+memory currently doesn't have a way to control what happens between
+internal tasks and child cgroups and the behavior is not clearly
+defined.  There have been attempts to add ad-hoc behaviors and knobs
+to tailor the behavior to specific workloads.  Continuing this
+direction will lead to problems which will be extremely difficult to
+resolve in the long term.
+
+Multiple controllers struggle with internal tasks and came up with
+different ways to deal with it; unfortunately, all the approaches in
+use now are severely flawed and, furthermore, the widely different
+behaviors make cgroup as whole highly inconsistent.
+
+It is clear that this is something which needs to be addressed from
+cgroup core proper in a uniform way so that controllers don't need to
+worry about it and cgroup as a whole shows a consistent and logical
+behavior.  To achieve that, unified hierarchy enforces the following
+structural constraint.
+
+ Except for the root, only cgroups which don't contain any task may
+ have controllers enabled in "cgroup.subtree_control".
+
+Combined with other properties, this guarantees that, when a
+controller is looking at the part of the hierarchy which has it
+enabled, tasks are always only on the leaves.  This rules out
+situations where child cgroups compete against internal tasks of the
+parent.
+
+There are two things to note.  Firstly, the root cgroup is exempt from
+the restriction.  Root contains tasks and anonymous resource
+consumption which can't be associated with any other cgroup and
+requires special treatment from most controllers.  How resource
+consumption in the root cgroup is governed is upto each controller.
+
+Secondly, the restriction doesn't take effect if there is no enabled
+controller in the cgroup's "cgroup.subtree_control".  This is
+important as otherwise it wouldn't be possible to create children of a
+populated cgroup.  To control resource distribution of a cgroup, the
+cgroup must create children and transfer all its tasks to the children
+before enabling controllers in its "cgroup.subtree_control".
+
+
+4. Other Changes
+
+4-1. [Un]populated Notification
+
+cgroup users often need a way to determine when a cgroup's
+subhierarchy becomes empty so that it can be cleaned up.  cgroup
+currently provides release_agent for it; unfortunately, this mechanism
+is riddled with issues.
+
+- It delivers events by forking and execing a userland binary
+  specified as the release_agent.  This is a long deprecated method of
+  notification delivery.  It's extremely heavy, slow and cumbersome to
+  integrate with larger infrastructure.
+
+- There is single monitoring point at the root.  There's no way to
+  delegate management of subtree.
+
+- The event isn't recursive.  It triggers when a cgroup doesn't have
+  any tasks or child cgroups.  Events for internal nodes trigger only
+  after all children are removed.  This again makes it impossible to
+  delegate management of subtree.
+
+- Events are filtered from the kernel side.  "notify_on_release" file
+  is used to subscribe to or suppress release event.  This is
+  unnecessarily complicated and probably done this way because event
+  delivery itself was expensive.
+
+Unified hierarchy implements interface file "cgroup.subtree_populated"
+which can be used to monitor whether the cgroup's subhierarchy has
+tasks in it or not.  Its value is 0 if there is no task in the cgroup
+and its descendants; otherwise, 1.  poll and [id]notify events are
+triggered when the value changes.
+
+This is significantly lighter and simpler and trivially allows
+delegating management of subhierarchy - subhierarchy monitoring can
+block further propagation simply by putting itself or another process
+in the root of the subhierarchy and monitor events that it's
+interested in from there without interfering with monitoring higher in
+the tree.
+
+In unified hierarchy, release_agent mechanism is no longer supported
+and the interface files "release_agent" and "notify_on_release" do not
+exist.
+
+
+4-2. Other Core Changes
+
+- None of the mount options is allowed.
+
+- remount is disallowed.
+
+- rename(2) is disallowed.
+
+- "tasks" is removed.  Everything should at process granularity.  Use
+  "cgroup.procs" instead.
+
+- "cgroup.procs" is not sorted.  pids will be unique unless they got
+  recycled in-between reads.
+
+- "cgroup.clone_children" is removed.
+
+
+4-3. Per-Controller Changes
+
+4-3-1. blkio
+
+- blk-throttle becomes properly hierarchical.
+
+
+4-3-2. cpuset
+
+- Tasks are kept in empty cpusets after hotplug and take on the masks
+  of the nearest non-empty ancestor, instead of being moved to it.
+
+- A task can be moved into an empty cpuset, and again it takes on the
+  masks of the nearest non-empty ancestor.
+
+
+4-3-3. memory
+
+- use_hierarchy is on by default and the cgroup file for the flag is
+  not created.
+
+
+5. Planned Changes
+
+5-1. CAP for resource control
+
+Unified hierarchy will require one of the capabilities(7), which is
+yet to be decided, for all resource control related knobs.  Process
+organization operations - creation of sub-cgroups and migration of
+processes in sub-hierarchies may be delegated by changing the
+ownership and/or permissions on the cgroup directory and
+"cgroup.procs" interface file; however, all operations which affect
+resource control - writes to "cgroup.subtree_control" or any
+controller-specific knobs - will require an explicit CAP privilege.
+
+This, in part, is to prevent cgroup interface from being inadvertently
+promoted to programmable API used by non-privileged binaries.  cgroup
+exposes various aspects of the system in ways which aren't properly
+abstracted for direct consumption by regular programs.  This is an
+administration interface much closer to sysctl knobs than system
+calls.  Even the basic access model, being filesystem path based,
+isn't suitable for direct consumption.  There's no way to access "my
+cgroup" in race-free way or make multiple operations atomic against
+migration to another cgroup.
+
+Another aspect is that, for better or for worse, cgroup interface goes
+through far less scrutiny than regular interfaces for unprivileged
+userland.  The upside is that cgroup is able to expose useful features
+which may not be suitable for general consumption in reasonable time
+frame.  It provides a relatively short path between internal details
+and userland-visible interface.  Of course, this shortcut comes with
+high risk.  We go through what we go through for general kernel APIs
+for good reasons.  It may end up leaking internal details in a way
+which can exert significant pain by locking the kernel into a contract
+that can't be maintained in a reasonable manner.
+
+Also, due to the specific nature, cgroup and its controllers don't
+tend to attract attention from wide-scope of developers.  cgroup's
+short history is already fraught with severely mis-designed
+interfaces, unnecessary commitment to and exposing of internal
+details, broken and dangerous implementations of various features.
+
+Keeping cgroup as an administration interface is both advantageous for
+its role and an imperative given its nature.  Some of the cgroup
+features may make sense for unprivileged access.  If deemed justified,
+those must be further abstracted and implemented as a different
+interface, be it a system call or process-private filesystem, and
+survive through the scrutiny that any interface for general
+consumption is required to go through.
+
+Requiring CAP is not a complete solution but should serve as a
+significant deterrent against spraying cgroup usages in non-privileged
+programs.
-- 
1.9.0

From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1756157AbaDNWJa (ORCPT <rfc822;w@1wt.eu>);
	Mon, 14 Apr 2014 18:09:30 -0400
Received: from mail-qg0-f53.google.com ([209.85.192.53]:41098 "EHLO
	mail-qg0-f53.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1755142AbaDNWJW (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Mon, 14 Apr 2014 18:09:22 -0400
X-Greylist: delayed 1916 seconds by postgrey-1.27 at vger.kernel.org; Mon, 14 Apr 2014 18:09:21 EDT
Date: Mon, 14 Apr 2014 18:09:17 -0400
From: Tejun Heo <tj@kernel.org>
To: Li Zefan <lizefan@huawei.com>
Cc: cgroups@vger.kernel.org, containers@lists.linux-foundation.org,
        linux-kernel@vger.kernel.org, Serge Hallyn <serge.hallyn@ubuntu.com>,
        Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@suse.cz>,
        Peter Zijlstra <peterz@infradead.org>,
        Aristeu Rozanski <arozansk@redhat.com>,
        Daniel Borkmann <dborkman@redhat.com>, Thomas Graf <tgraf@suug.ch>,
        Lennart Poettering <lennart@poettering.net>,
        Kay Sievers <kay@vrfy.org>, Rohit Jnagal <jnagal@google.com>,
        Brandon Philips <brandon.philips@coreos.com>
Subject: [PATCH cgroup/for-3.16] cgroup: add documentation about unified
 hierarchy
Message-ID: <20140414220917.GD1863@htj.dyndns.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.5.21 (2010-09-15)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

Hello,

Unified hierarchy is finally out for review [1][2].  This patch adds
the documentation which describes the design and rationales.  If you
can think of more people to cc, please go ahead.

If you have any comments and/or questions, please don't hesitate.

Thanks.

[1] http://lkml.kernel.org/g/1397511430-2673-1-git-send-email-tj@kernel.org
[2] http://lkml.kernel.org/g/1397511846-2904-1-git-send-email-tj@kernel.org

------ 8< ------
>>From 68eb841c53bb26a7b49f8f244ebd68f2530d8d0b Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Mon, 14 Apr 2014 17:29:39 -0400

Unified hierarchy will be the new version of cgroup interface.  This
patch adds Documentation/cgroups/unified-hierarchy.txt which describes
the design and rationales of unified hierarchy.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 Documentation/cgroups/unified-hierarchy.txt | 359 ++++++++++++++++++++++++++++
 1 file changed, 359 insertions(+)
 create mode 100644 Documentation/cgroups/unified-hierarchy.txt

diff --git a/Documentation/cgroups/unified-hierarchy.txt b/Documentation/cgroups/unified-hierarchy.txt
new file mode 100644
index 0000000..41386c3
--- /dev/null
+++ b/Documentation/cgroups/unified-hierarchy.txt
@@ -0,0 +1,359 @@
+
+Cgroup unified hierarchy
+
+April, 2014		Tejun Heo <tj@kernel.org>
+
+This document describes the changes made by unified hierarchy and
+their rationales.  It will eventually be merged into the main cgroup
+documentation.
+
+CONTENTS
+
+1. Background
+2. Basic Operation
+  2-1. Mounting
+  2-2. cgroup.subtree_control
+  2-3. cgroup.controllers
+3. Structural Constraints
+  3-1. Top-down
+  3-2. No internal tasks
+4. Other Changes
+  4-1. [Un]populated Notification
+  4-2. Other Core Changes
+  4-3. Per-Controller Changes
+    4-3-1. blkio
+    4-3-2. cpuset
+    4-3-3. memory
+5. Planned Changes
+  5-1. CAP for resource control
+
+
+1. Background
+
+cgroup allows arbitrary number of hierarchies and each hierarchy can
+host any number of controllers.  While this seems to provide high
+level of flexibility, it isn't quite useful in practice.
+
+For example, as there is only one instance of each controller, utility
+type controllers such as freezer which can be useful in all
+hierarchies can only be used in one.  The issue is exacerbated by the
+fact that controllers can't be moved around once hierarchies are
+populated.  Another issue is that all controllers bound to a hierarchy
+are forced to have exactly the same view of the hierarchy.  It isn't
+possible to vary the granularity depending on the specific controller.
+
+In practice, these issues heavily limit which controllers can be put
+on the same hierarchy and most configurations resort to putting each
+controller on its own hierarchy.  Only closely related ones, such as
+cpu and cpuacct, make sense to put on the same hierarchy.  This often
+means that userland ends up managing multiple similar hierarchies
+repeating the same steps on each hierarchy whenever a hierarchy
+management operation is necessary.
+
+Unfortunately, support for multiple hierarchies comes at a steep cost.
+Internal implementation in cgroup core proper is dazzlingly
+complicated but more importantly the support for multiple hierarchies
+restricts how cgroup is used in general and what controllers can do.
+
+There's no limit on how many hierarchies there may be, which means
+that a task's cgroup membership can't be described in finite length.
+The key may contain any varying number of entries and is unlimited in
+length, which makes it highly awkward to handle and leads to addition
+of controllers which exist only to identify membership, which in turn
+exacerbates the original problem.
+
+Also, as a controller can't have any expectation regarding what shape
+of hierarchies other controllers would be on, each controller has to
+assume that all other controllers are operating on completely
+orthogonal hierarchies.  This makes it impossible, or at least very
+cumbersome, for controllers to cooperate with each other.
+
+In most use cases, putting controllers on hierarchies which are
+completely orthogonal to each other isn't necessary.  What usually is
+called for is the ability to have differing levels of granularity
+depending on the specific controller.  IOW, hierarchy may be collapsed
+from leaf towards root when viewed from specific controllers.  For
+example, a given configuration might not care about how memory is
+distributed beyond certain level while still want to control how cpu
+cycles are distributed.
+
+Unified hierarchy is the next version of cgroup interface.  It aims to
+address the aforementioned issues by having more structure while
+retaining enough flexibility for most use cases.  Various other
+general and controller-specific interface issues are also addressed in
+the process.
+
+
+2. Basic Operation
+
+2-1. Mounting
+
+Currently, unified hierarchy can be mounted with the following mount
+command.  Note that this is still under development and scheduled to
+change soon.
+
+ mount -t cgroup -o __DEVEL__sane_behavior cgroup $MOUNT_POINT
+
+All controllers which are not bound to other hierarchies are
+automatically bound to unified hierarchy and show up at the root of
+it.  Controllers which are enabled only in the root of unified
+hierarchy can be bound to other hierarchies at any time.  This allows
+mixing unified hierarchy with the traditional multiple hierarchies in
+fully backward compatible way.
+
+
+2-2. cgroup.subtree_control
+
+All cgroups on unified hierarchy have "cgroup.subtree_control" which
+governs which controllers are enabled on the children of the cgroup.
+Let's assume a hierarchy like the following.
+
+  root - A - B - C
+               \ D
+
+root's "cgroup.subtree_control" determines which controllers are
+enabled on A.  A's on B.  B's on C and D.  This coincides with the
+fact that controllers on the immediate sub-level are used to
+distribute the resources of the parent.  In fact, it's natural to
+assume that resource control knobs of a child belong to its parent.
+Enabling a controller in "cgroup.subtree_control" declares that
+distribution of the respective resources of the cgroup will be
+controlled.  Note that this means that controller enable states are
+shared among siblings.
+
+When read, the file contains space-separated list of currently enabled
+controllers.  A write to the file should contain spaced-separated list
+of controllers with '+' or '-' prefixed (without the quotes).
+Controllers prefixed with '+' are enabled and '-' disabled.  If a
+controller is listed multiple times, the last entry wins.  The
+specific operations are executed atomically - either all succeed or
+fail.
+
+
+2-3. cgroup.controllers
+
+Read-only "cgroup.controllers" contains space-separated list of
+controllers which can be enabled in the cgroup's
+"cgroup.subtree_control".
+
+In the root cgroup, this lists controllers which are not bound to
+other hierarchies and the content changes as controllers are bound to
+and unbound from other hierarchies.
+
+In non-root cgroups, the content of this file equals that of the
+parent's "cgroup.subtree_control" as only controllers enabled from the
+parent can be used in its children.
+
+
+3. Structural Constraints
+
+3-1. Top-down
+
+As it doesn't make sense to nest control of an uncontrolled resource,
+all non-root "cgroup.subtree_control" can only contain controllers
+which are enabled in the parent's "cgroup.subtree_control".  A
+controller can be enabled only if the parent has the controller
+enabled and a controller can't be disabled if one or more children
+have it enabled.
+
+
+3-2. No internal tasks
+
+One long-standing issue that cgroup faces is the competition between
+tasks belonging to the parent cgroup and its children cgroups.  This
+is inherently nasty as two different types of entities compete and
+there is no agreed-upon obvious way to handle it.  Different
+controllers are doing different things.
+
+cpu considers tasks and cgroups as equivalents and maps nice level to
+cgroup weights.  This works for some cases but falls flat when
+children should be allocated specific ratios of cpu cycles and the
+number of internal tasks fluctuates - the ratios constantly change as
+the number of competing entities fluctuates.  There also are other
+issues.  The mapping from nice level to weight isn't obvious or
+universal, and there are various other knobs which simply aren't
+available for tasks.
+
+blkio implicitly creates a hidden leaf node for each cgroup to host
+the tasks.  The hidden leaf has its own copies of all the knobs with
+"leaf_" prefixed.  While this allows equivalent control over internal
+tasks, it's with serious drawbacks.  It always adds an extra layer of
+nesting which may not be necessary, makes the interface messy and
+significantly complicates the implementation.
+
+memory currently doesn't have a way to control what happens between
+internal tasks and child cgroups and the behavior is not clearly
+defined.  There have been attempts to add ad-hoc behaviors and knobs
+to tailor the behavior to specific workloads.  Continuing this
+direction will lead to problems which will be extremely difficult to
+resolve in the long term.
+
+Multiple controllers struggle with internal tasks and came up with
+different ways to deal with it; unfortunately, all the approaches in
+use now are severely flawed and, furthermore, the widely different
+behaviors make cgroup as whole highly inconsistent.
+
+It is clear that this is something which needs to be addressed from
+cgroup core proper in a uniform way so that controllers don't need to
+worry about it and cgroup as a whole shows a consistent and logical
+behavior.  To achieve that, unified hierarchy enforces the following
+structural constraint.
+
+ Except for the root, only cgroups which don't contain any task may
+ have controllers enabled in "cgroup.subtree_control".
+
+Combined with other properties, this guarantees that, when a
+controller is looking at the part of the hierarchy which has it
+enabled, tasks are always only on the leaves.  This rules out
+situations where child cgroups compete against internal tasks of the
+parent.
+
+There are two things to note.  Firstly, the root cgroup is exempt from
+the restriction.  Root contains tasks and anonymous resource
+consumption which can't be associated with any other cgroup and
+requires special treatment from most controllers.  How resource
+consumption in the root cgroup is governed is upto each controller.
+
+Secondly, the restriction doesn't take effect if there is no enabled
+controller in the cgroup's "cgroup.subtree_control".  This is
+important as otherwise it wouldn't be possible to create children of a
+populated cgroup.  To control resource distribution of a cgroup, the
+cgroup must create children and transfer all its tasks to the children
+before enabling controllers in its "cgroup.subtree_control".
+
+
+4. Other Changes
+
+4-1. [Un]populated Notification
+
+cgroup users often need a way to determine when a cgroup's
+subhierarchy becomes empty so that it can be cleaned up.  cgroup
+currently provides release_agent for it; unfortunately, this mechanism
+is riddled with issues.
+
+- It delivers events by forking and execing a userland binary
+  specified as the release_agent.  This is a long deprecated method of
+  notification delivery.  It's extremely heavy, slow and cumbersome to
+  integrate with larger infrastructure.
+
+- There is single monitoring point at the root.  There's no way to
+  delegate management of subtree.
+
+- The event isn't recursive.  It triggers when a cgroup doesn't have
+  any tasks or child cgroups.  Events for internal nodes trigger only
+  after all children are removed.  This again makes it impossible to
+  delegate management of subtree.
+
+- Events are filtered from the kernel side.  "notify_on_release" file
+  is used to subscribe to or suppress release event.  This is
+  unnecessarily complicated and probably done this way because event
+  delivery itself was expensive.
+
+Unified hierarchy implements interface file "cgroup.subtree_populated"
+which can be used to monitor whether the cgroup's subhierarchy has
+tasks in it or not.  Its value is 0 if there is no task in the cgroup
+and its descendants; otherwise, 1.  poll and [id]notify events are
+triggered when the value changes.
+
+This is significantly lighter and simpler and trivially allows
+delegating management of subhierarchy - subhierarchy monitoring can
+block further propagation simply by putting itself or another process
+in the root of the subhierarchy and monitor events that it's
+interested in from there without interfering with monitoring higher in
+the tree.
+
+In unified hierarchy, release_agent mechanism is no longer supported
+and the interface files "release_agent" and "notify_on_release" do not
+exist.
+
+
+4-2. Other Core Changes
+
+- None of the mount options is allowed.
+
+- remount is disallowed.
+
+- rename(2) is disallowed.
+
+- "tasks" is removed.  Everything should at process granularity.  Use
+  "cgroup.procs" instead.
+
+- "cgroup.procs" is not sorted.  pids will be unique unless they got
+  recycled in-between reads.
+
+- "cgroup.clone_children" is removed.
+
+
+4-3. Per-Controller Changes
+
+4-3-1. blkio
+
+- blk-throttle becomes properly hierarchical.
+
+
+4-3-2. cpuset
+
+- Tasks are kept in empty cpusets after hotplug and take on the masks
+  of the nearest non-empty ancestor, instead of being moved to it.
+
+- A task can be moved into an empty cpuset, and again it takes on the
+  masks of the nearest non-empty ancestor.
+
+
+4-3-3. memory
+
+- use_hierarchy is on by default and the cgroup file for the flag is
+  not created.
+
+
+5. Planned Changes
+
+5-1. CAP for resource control
+
+Unified hierarchy will require one of the capabilities(7), which is
+yet to be decided, for all resource control related knobs.  Process
+organization operations - creation of sub-cgroups and migration of
+processes in sub-hierarchies may be delegated by changing the
+ownership and/or permissions on the cgroup directory and
+"cgroup.procs" interface file; however, all operations which affect
+resource control - writes to "cgroup.subtree_control" or any
+controller-specific knobs - will require an explicit CAP privilege.
+
+This, in part, is to prevent cgroup interface from being inadvertently
+promoted to programmable API used by non-privileged binaries.  cgroup
+exposes various aspects of the system in ways which aren't properly
+abstracted for direct consumption by regular programs.  This is an
+administration interface much closer to sysctl knobs than system
+calls.  Even the basic access model, being filesystem path based,
+isn't suitable for direct consumption.  There's no way to access "my
+cgroup" in race-free way or make multiple operations atomic against
+migration to another cgroup.
+
+Another aspect is that, for better or for worse, cgroup interface goes
+through far less scrutiny than regular interfaces for unprivileged
+userland.  The upside is that cgroup is able to expose useful features
+which may not be suitable for general consumption in reasonable time
+frame.  It provides a relatively short path between internal details
+and userland-visible interface.  Of course, this shortcut comes with
+high risk.  We go through what we go through for general kernel APIs
+for good reasons.  It may end up leaking internal details in a way
+which can exert significant pain by locking the kernel into a contract
+that can't be maintained in a reasonable manner.
+
+Also, due to the specific nature, cgroup and its controllers don't
+tend to attract attention from wide-scope of developers.  cgroup's
+short history is already fraught with severely mis-designed
+interfaces, unnecessary commitment to and exposing of internal
+details, broken and dangerous implementations of various features.
+
+Keeping cgroup as an administration interface is both advantageous for
+its role and an imperative given its nature.  Some of the cgroup
+features may make sense for unprivileged access.  If deemed justified,
+those must be further abstracted and implemented as a different
+interface, be it a system call or process-private filesystem, and
+survive through the scrutiny that any interface for general
+consumption is required to go through.
+
+Requiring CAP is not a complete solution but should serve as a
+significant deterrent against spraying cgroup usages in non-privileged
+programs.
-- 
1.9.0