All of lore.kernel.org
 help / color / mirror / Atom feed
* [Documentation] State of CPU controller in cgroup v2
@ 2016-08-05 17:07 ` Tejun Heo
  0 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-08-05 17:07 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Li Zefan, Johannes Weiner,
	Peter Zijlstra, Paul Turner, Mike Galbraith, Ingo Molnar
  Cc: linux-kernel, cgroups, linux-api, kernel-team

Hello,

There have been several discussions around CPU controller support.
Unfortunately, no consensus was reached and cgroup v2 is sorely
lacking CPU controller support.  This document includes summary of the
situation and arguments along with an interim solution for parties who
want to use the out-of-tree patches for CPU controller cgroup v2
support.  I'll post the two patches as replies for reference.

Thanks.


CPU Controller on Control Group v2

August, 2016		Tejun Heo <tj@kernel.org>


While most controllers have support for cgroup v2 now, the CPU
controller support is not upstream yet due to objections from the
scheduler maintainers on the basic designs of cgroup v2.  This
document explains the current situation as well as an interim
solution, and details the disagreements and arguments.  The latest
version of this document can be found at the following URL.

 https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git/tree/Documentation/cgroup-v2-cpu.txt?h=cgroup-v2-cpu


CONTENTS

1. Current Situation and Interim Solution
2. Disagreements and Arguments
  2-1. Contentious Restrictions
    2-1-1. Process Granularity
    2-1-2. No Internal Process Constraint
  2-2. Impact on CPU Controller
    2-2-1. Impact of Process Granularity
    2-2-2. Impact of No Internal Process Constraint
  2-3. Arguments for cgroup v2
3. Way Forward
4. References


1. Current Situation and Interim Solution

All objections from the scheduler maintainers apply to cgroup v2 core
design, and there are no known objections to the specifics of the CPU
controller cgroup v2 interface.  The only blocked part is changes to
expose the CPU controller interface on cgroup v2, which comprises the
following two patches:

 [1] sched: Misc preps for cgroup unified hierarchy interface
 [2] sched: Implement interface for cgroup unified hierarchy

The necessary changes are superficial and implement the interface
files on cgroup v2.  The combined diffstat is as follows.

 kernel/sched/core.c    |  149 +++++++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/cpuacct.c |   57 ++++++++++++------
 kernel/sched/cpuacct.h |    5 +
 3 files changed, 189 insertions(+), 22 deletions(-)

The patches are easy to apply and forward-port.  The following git
branch will always carry the two patches on top of the latest release
of the upstream kernel.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git/cgroup-v2-cpu

There also are versioned branches going back to v4.4.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git/cgroup-v2-cpu-$KERNEL_VER

While it's difficult to tell whether the CPU controller support will
be merged, there are crucial resource control features in cgroup v2
that are only possible due to the design choices that are being
objected to, and every effort will be made to ease enabling the CPU
controller cgroup v2 support out-of-tree for parties which choose to.


2. Disagreements and Arguments

There have been several lengthy discussion threads [3][4] on LKML
around the structural constraints of cgroup v2.  The two that affect
the CPU controller are process granularity and no internal process
constraint.  Both arise primarily from the need for common resource
domain definition across different resources.

The common resource domain is a powerful concept in cgroup v2 that
allows controllers to make basic assumptions about the structural
organization of processes and controllers inside the cgroup hierarchy,
and thus solve problems spanning multiple types of resources.  The
prime example for this is page cache writeback: dirty page cache is
regulated through throttling buffered writers based on memory
availability, and initiating batched write outs to the disk based on
IO capacity.  Tracking and controlling writeback inside a cgroup thus
requires the direct cooperation of the memory and the IO controller.

This easily extends to other areas, such as CPU cycles consumed while
performing memory reclaim or IO encryption.


2-1. Contentious Restrictions

For controllers of different resources to work together, they must
agree on a common organization.  This uniform model across controllers
imposes two contentious restrictions on the CPU controller: process
granularity and the no-internal-process constraint.


  2-1-1. Process Granularity

  For memory, because an address space is shared between all threads
  of a process, the terminal consumer is a process, not a thread.
  Separating the threads of a single process into different memory
  control domains doesn't make semantical sense.  cgroup v2 ensures
  that all controller can agree on the same organization by requiring
  that threads of the same process belong to the same cgroup.

  There are other reasons to enforce process granularity.  One
  important one is isolating system-level management operations from
  in-process application operations.  The cgroup interface, being a
  virtual filesystem, is very unfit for multiple independent
  operations taking place at the same time as most operations have to
  be multi-step and there is no way to synchronize multiple accessors.
  See also [5] Documentation/cgroup-v2.txt, "R-2. Thread Granularity"


  2-1-2. No Internal Process Constraint

  cgroup v2 does not allow processes to belong to any cgroup which has
  child cgroups when resource controllers are enabled on it (the
  notable exception being the root cgroup itself).  This is because,
  for some resources, a resource domain (cgroup) is not directly
  comparable to the terminal consumer (process/task) of said resource,
  and so putting the two into a sibling relationship isn't meaningful.

  - Differing Control Parameters and Capabilities

    A cgroup controller has different resource control parameters and
    capabilities from a terminal consumer, be that a task or process.
    There are a couple cases where a cgroup control knob can be mapped
    to a per-task or per-process API but they are exceptions and the
    mappings aren't obvious even in those cases.

    For example, task priorities (also known as nice values) set
    through setpriority(2) are mapped to the CPU controller
    "cpu.shares" values.  However, how exactly the two ranges map and
    even the fact that they map to each other at all are not obvious.

    The situation gets further muddled when considering other resource
    types and control knobs.  IO priorities set through ioprio_set(2)
    cannot be mapped to IO controller weights and most cgroup resource
    control knobs including the bandwidth control knobs of the CPU
    controller don't have counterparts in the terminal consumers.

  - Anonymous Resource Consumption

    For CPU, every time slice consumed from inside a cgroup, which
    comprises most but not all of consumed CPU time for the cgroup,
    can be clearly attributed to a specific task or process.  Because
    these two types of entities are directly comparable as consumers
    of CPU time, it's theoretically possible to mix tasks and cgroups
    on the same tree levels and let them directly compete for the time
    quota available to their common ancestor.

    However, the same can't be said for resource types like memory or
    IO: the memory consumed by the page cache, for example, can be
    tracked on a per-cgroup level, but due to mismatches in lifetimes
    of involved objects (page cache can persist long after processes
    are gone), shared usages and the implementation overhead of
    tracking persistent state, it can no longer be attributed to
    individual processes after instantiation.  Consequently, any IO
    incurred by page cache writeback can be attributed to a cgroup,
    but not to the individual consumers inside the cgroup.

  For memory and IO, this makes a resource domain (cgroup) an object
  of a fundamentally different type than a terminal consumer
  (process).  A process can't be a first class object in the resource
  distribution graph as its total resource consumption can't be
  described without the containing resource domain.

  Disallowing processes in internal cgroups avoids competition between
  cgroups and processes which cannot be meaningfully defined for these
  resources.  All resource control takes place among cgroups and a
  terminal consumer interacts with the containing cgroup the same way
  it would with the system without cgroup.

  Root cgroup is exempt from this constraint, which is in line with
  how root cgroup is handled in general - it's excluded from cgroup
  resource accounting and control.


Enforcing process granularity and no internal process constraint
allows all controllers to be on the same footing in terms of resource
distribution hierarchy.


2-2. Impact on CPU Controller

As indicated earlier, the CPU controller's resource distribution graph
is the simplest.  Every schedulable resource consumption can be
attributed to a specific task.  In addition, for weight based control,
the per-task priority set through setpriority(2) can be translated to
and from a per-cgroup weight.  As such, the CPU controller can treat a
task and a cgroup symmetrically, allowing support for any tree layout
of cgroups and tasks.  Both process granularity and the no internal
process constraint restrict how the CPU controller can be used.


  2-2-1. Impact of Process Granularity

  Process granularity prevents tasks belonging to the same process to
  be assigned to different cgroups.  It was pointed out [6] that this
  excludes the valid use case of hierarchical CPU distribution within
  processes.

  To address this issue, the rgroup (resource group) [7][8][9]
  interface, an extension of the existing setpriority(2) API, was
  proposed, which is in line with other programmable priority
  mechanisms and eliminates the risk of in-application configuration
  and system configuration stepping on each other's toes.
  Unfortunately, the proposal quickly turned into discussions around
  cgroup v2 design decisions [4] and no consensus could be reached.


  2-2-2. Impact of No Internal Process Constraint

  The no internal process constraint disallows tasks from competing
  directly against cgroups.  Here is an excerpt from Peter Zijlstra
  pointing out the issue [10] - R, L and A are cgroups; t1, t2, t3 and
  t4 are tasks:


          R
        / | \
       t1 t2 A
           /   \
          t3   t4


    Is fundamentally different from:


               R
             /   \
           L       A
         /   \   /   \
        t1  t2  t3   t4


    Because if in the first hierarchy you add a task (t5) to R, all of
    its A will run at 1/4th of total bandwidth where before it had
    1/3rd, whereas with the second example, if you add our t5 to L, A
    doesn't get any less bandwidth.


  It is true that the trees are semantically different from each other
  and the symmetric handling of tasks and cgroups is aesthetically
  pleasing.  However, it isn't clear what the practical usefulness of
  a layout with direct competition between tasks and cgroups would be,
  considering that number and behavior of tasks are controlled by each
  application, and cgroups primarily deal with system level resource
  distribution; changes in the number of active threads would directly
  impact resource distribution.  Real world use cases of such layouts
  could not be established during the discussions.


2-3. Arguments for cgroup v2

There are strong demands for comprehensive hierarchical resource
control across all major resources, and establishing a common resource
hierarchy is an essential step.  As with most engineering decisions,
common resource hierarchy definition comes with its trade-offs.  With
cgroup v2, the trade-offs are in the form of structural constraints
which, among others, restrict the CPU controller's space of possible
configurations.

However, even with the restrictions, cgroup v2, in combination with
rgroup, covers most of identified real world use cases while enabling
new important use cases of resource control across multiple resource
types that were fundamentally broken previously.

Furthermore, for resource control, treating resource domains as
objects of a different type from terminal consumers has important
advantages - it can account for resource consumptions which are not
tied to any specific terminal consumer, be that a task or process, and
allows decoupling resource distribution controls from in-application
APIs.  Even the CPU controller may benefit from it as the kernel can
consume significant amount of CPU cycles in interrupt context or tasks
shared across multiple resource domains (e.g. softirq).

Finally, it's important to note that enabling cgroup v2 support for
the CPU controller doesn't block use cases which require the features
which are not available on cgroup v2.  Unlikely, but should anybody
actually rely on the CPU controller's symmetric handling of tasks and
cgroups, backward compatibility is and will be maintained by being
able to disconnect the controller from the cgroup v2 hierarchy and use
it standalone.  This also holds for cpuset which is often used in
highly customized configurations which might be a poor fit for common
resource domains.

The required changes are minimal, the benefits for the target use
cases are critical and obvious, and use cases which have to use v1 can
continue to do so.


3. Way Forward

cgroup v2 primarily aims to solve the problem of comprehensive
hierarchical resource control across all major computing resources,
which is one of the core problems of modern server infrastructure
engineering.  The trade-offs that cgroup v2 took are results of
pursuing that goal and gaining a better understanding of the nature of
resource control in the process.

I believe that real world usages will prove cgroup v2's model right,
considering the crucial pieces of comprehensive resource control that
cannot be implemented without common resource domains.  This is not to
say that cgroup v2 is fixed in stone and can't be updated; if there is
an approach which better serves both comprehensive resource control
and the CPU controller's flexibility, we will surely move towards
that.  It goes without saying that discussions around such approach
should consider practical aspects of resource control as a whole
rather than absolutely focusing on a particular controller.

Until such consensus can be reached, the CPU controller cgroup v2
support will be maintained out of the mainline kernel in an easily
accessible form.  If there is anything cgroup developers can do to
ease the pain, please feel free to contact us on the cgroup mailing
list at cgroups@vger.kernel.org.


4. References

[1]  http://lkml.kernel.org/r/20160105164834.GE5995@mtj.duckdns.org
     [PATCH 1/2] sched: Misc preps for cgroup unified hierarchy interface
     Tejun Heo <tj@kernel.org>

[2]  http://lkml.kernel.org/r/20160105164852.GF5995@mtj.duckdns.org
     [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy
     Tejun Heo <tj@kernel.org>

[3]  http://lkml.kernel.org/r/1438641689-14655-4-git-send-email-tj@kernel.org
     [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy
     Tejun Heo <tj@kernel.org>

[4]  http://lkml.kernel.org/r/20160407064549.GH3430@twins.programming.kicks-ass.net
     Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP
     Peter Zijlstra <peterz@infradead.org>

[5]  https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/cgroup-v2.txt
     Control Group v2
     Tejun Heo <tj@kernel.org>

[6]  http://lkml.kernel.org/r/CAPM31RJNy3jgG=DYe6GO=wyL4BPPxwUm1f2S6YXacQmo7viFZA@mail.gmail.com
     Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy
     Paul Turner <pjt@google.com>

[7]  http://lkml.kernel.org/r/20160105154503.GC5995@mtj.duckdns.org
     [RFD] cgroup: thread granularity support for cpu controller
     Tejun Heo <tj@kernel.org>

[8]  http://lkml.kernel.org/r/1457710888-31182-1-git-send-email-tj@kernel.org
     [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP
     Tejun Heo <tj@kernel.org>

[9]  http://lkml.kernel.org/r/20160311160522.GA24046@htj.duckdns.org
     Example program for PRIO_RGRP
     Tejun Heo <tj@kernel.org>

[10] http://lkml.kernel.org/r/20160407082810.GN3430@twins.programming.kicks-ass.net
     Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource
     Peter Zijlstra <peterz@infradead.org>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [Documentation] State of CPU controller in cgroup v2
@ 2016-08-05 17:07 ` Tejun Heo
  0 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-08-05 17:07 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Li Zefan, Johannes Weiner,
	Peter Zijlstra, Paul Turner, Mike Galbraith, Ingo Molnar
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA,
	kernel-team-b10kYP2dOMg

Hello,

There have been several discussions around CPU controller support.
Unfortunately, no consensus was reached and cgroup v2 is sorely
lacking CPU controller support.  This document includes summary of the
situation and arguments along with an interim solution for parties who
want to use the out-of-tree patches for CPU controller cgroup v2
support.  I'll post the two patches as replies for reference.

Thanks.


CPU Controller on Control Group v2

August, 2016		Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>


While most controllers have support for cgroup v2 now, the CPU
controller support is not upstream yet due to objections from the
scheduler maintainers on the basic designs of cgroup v2.  This
document explains the current situation as well as an interim
solution, and details the disagreements and arguments.  The latest
version of this document can be found at the following URL.

 https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git/tree/Documentation/cgroup-v2-cpu.txt?h=cgroup-v2-cpu


CONTENTS

1. Current Situation and Interim Solution
2. Disagreements and Arguments
  2-1. Contentious Restrictions
    2-1-1. Process Granularity
    2-1-2. No Internal Process Constraint
  2-2. Impact on CPU Controller
    2-2-1. Impact of Process Granularity
    2-2-2. Impact of No Internal Process Constraint
  2-3. Arguments for cgroup v2
3. Way Forward
4. References


1. Current Situation and Interim Solution

All objections from the scheduler maintainers apply to cgroup v2 core
design, and there are no known objections to the specifics of the CPU
controller cgroup v2 interface.  The only blocked part is changes to
expose the CPU controller interface on cgroup v2, which comprises the
following two patches:

 [1] sched: Misc preps for cgroup unified hierarchy interface
 [2] sched: Implement interface for cgroup unified hierarchy

The necessary changes are superficial and implement the interface
files on cgroup v2.  The combined diffstat is as follows.

 kernel/sched/core.c    |  149 +++++++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/cpuacct.c |   57 ++++++++++++------
 kernel/sched/cpuacct.h |    5 +
 3 files changed, 189 insertions(+), 22 deletions(-)

The patches are easy to apply and forward-port.  The following git
branch will always carry the two patches on top of the latest release
of the upstream kernel.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git/cgroup-v2-cpu

There also are versioned branches going back to v4.4.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git/cgroup-v2-cpu-$KERNEL_VER

While it's difficult to tell whether the CPU controller support will
be merged, there are crucial resource control features in cgroup v2
that are only possible due to the design choices that are being
objected to, and every effort will be made to ease enabling the CPU
controller cgroup v2 support out-of-tree for parties which choose to.


2. Disagreements and Arguments

There have been several lengthy discussion threads [3][4] on LKML
around the structural constraints of cgroup v2.  The two that affect
the CPU controller are process granularity and no internal process
constraint.  Both arise primarily from the need for common resource
domain definition across different resources.

The common resource domain is a powerful concept in cgroup v2 that
allows controllers to make basic assumptions about the structural
organization of processes and controllers inside the cgroup hierarchy,
and thus solve problems spanning multiple types of resources.  The
prime example for this is page cache writeback: dirty page cache is
regulated through throttling buffered writers based on memory
availability, and initiating batched write outs to the disk based on
IO capacity.  Tracking and controlling writeback inside a cgroup thus
requires the direct cooperation of the memory and the IO controller.

This easily extends to other areas, such as CPU cycles consumed while
performing memory reclaim or IO encryption.


2-1. Contentious Restrictions

For controllers of different resources to work together, they must
agree on a common organization.  This uniform model across controllers
imposes two contentious restrictions on the CPU controller: process
granularity and the no-internal-process constraint.


  2-1-1. Process Granularity

  For memory, because an address space is shared between all threads
  of a process, the terminal consumer is a process, not a thread.
  Separating the threads of a single process into different memory
  control domains doesn't make semantical sense.  cgroup v2 ensures
  that all controller can agree on the same organization by requiring
  that threads of the same process belong to the same cgroup.

  There are other reasons to enforce process granularity.  One
  important one is isolating system-level management operations from
  in-process application operations.  The cgroup interface, being a
  virtual filesystem, is very unfit for multiple independent
  operations taking place at the same time as most operations have to
  be multi-step and there is no way to synchronize multiple accessors.
  See also [5] Documentation/cgroup-v2.txt, "R-2. Thread Granularity"


  2-1-2. No Internal Process Constraint

  cgroup v2 does not allow processes to belong to any cgroup which has
  child cgroups when resource controllers are enabled on it (the
  notable exception being the root cgroup itself).  This is because,
  for some resources, a resource domain (cgroup) is not directly
  comparable to the terminal consumer (process/task) of said resource,
  and so putting the two into a sibling relationship isn't meaningful.

  - Differing Control Parameters and Capabilities

    A cgroup controller has different resource control parameters and
    capabilities from a terminal consumer, be that a task or process.
    There are a couple cases where a cgroup control knob can be mapped
    to a per-task or per-process API but they are exceptions and the
    mappings aren't obvious even in those cases.

    For example, task priorities (also known as nice values) set
    through setpriority(2) are mapped to the CPU controller
    "cpu.shares" values.  However, how exactly the two ranges map and
    even the fact that they map to each other at all are not obvious.

    The situation gets further muddled when considering other resource
    types and control knobs.  IO priorities set through ioprio_set(2)
    cannot be mapped to IO controller weights and most cgroup resource
    control knobs including the bandwidth control knobs of the CPU
    controller don't have counterparts in the terminal consumers.

  - Anonymous Resource Consumption

    For CPU, every time slice consumed from inside a cgroup, which
    comprises most but not all of consumed CPU time for the cgroup,
    can be clearly attributed to a specific task or process.  Because
    these two types of entities are directly comparable as consumers
    of CPU time, it's theoretically possible to mix tasks and cgroups
    on the same tree levels and let them directly compete for the time
    quota available to their common ancestor.

    However, the same can't be said for resource types like memory or
    IO: the memory consumed by the page cache, for example, can be
    tracked on a per-cgroup level, but due to mismatches in lifetimes
    of involved objects (page cache can persist long after processes
    are gone), shared usages and the implementation overhead of
    tracking persistent state, it can no longer be attributed to
    individual processes after instantiation.  Consequently, any IO
    incurred by page cache writeback can be attributed to a cgroup,
    but not to the individual consumers inside the cgroup.

  For memory and IO, this makes a resource domain (cgroup) an object
  of a fundamentally different type than a terminal consumer
  (process).  A process can't be a first class object in the resource
  distribution graph as its total resource consumption can't be
  described without the containing resource domain.

  Disallowing processes in internal cgroups avoids competition between
  cgroups and processes which cannot be meaningfully defined for these
  resources.  All resource control takes place among cgroups and a
  terminal consumer interacts with the containing cgroup the same way
  it would with the system without cgroup.

  Root cgroup is exempt from this constraint, which is in line with
  how root cgroup is handled in general - it's excluded from cgroup
  resource accounting and control.


Enforcing process granularity and no internal process constraint
allows all controllers to be on the same footing in terms of resource
distribution hierarchy.


2-2. Impact on CPU Controller

As indicated earlier, the CPU controller's resource distribution graph
is the simplest.  Every schedulable resource consumption can be
attributed to a specific task.  In addition, for weight based control,
the per-task priority set through setpriority(2) can be translated to
and from a per-cgroup weight.  As such, the CPU controller can treat a
task and a cgroup symmetrically, allowing support for any tree layout
of cgroups and tasks.  Both process granularity and the no internal
process constraint restrict how the CPU controller can be used.


  2-2-1. Impact of Process Granularity

  Process granularity prevents tasks belonging to the same process to
  be assigned to different cgroups.  It was pointed out [6] that this
  excludes the valid use case of hierarchical CPU distribution within
  processes.

  To address this issue, the rgroup (resource group) [7][8][9]
  interface, an extension of the existing setpriority(2) API, was
  proposed, which is in line with other programmable priority
  mechanisms and eliminates the risk of in-application configuration
  and system configuration stepping on each other's toes.
  Unfortunately, the proposal quickly turned into discussions around
  cgroup v2 design decisions [4] and no consensus could be reached.


  2-2-2. Impact of No Internal Process Constraint

  The no internal process constraint disallows tasks from competing
  directly against cgroups.  Here is an excerpt from Peter Zijlstra
  pointing out the issue [10] - R, L and A are cgroups; t1, t2, t3 and
  t4 are tasks:


          R
        / | \
       t1 t2 A
           /   \
          t3   t4


    Is fundamentally different from:


               R
             /   \
           L       A
         /   \   /   \
        t1  t2  t3   t4


    Because if in the first hierarchy you add a task (t5) to R, all of
    its A will run at 1/4th of total bandwidth where before it had
    1/3rd, whereas with the second example, if you add our t5 to L, A
    doesn't get any less bandwidth.


  It is true that the trees are semantically different from each other
  and the symmetric handling of tasks and cgroups is aesthetically
  pleasing.  However, it isn't clear what the practical usefulness of
  a layout with direct competition between tasks and cgroups would be,
  considering that number and behavior of tasks are controlled by each
  application, and cgroups primarily deal with system level resource
  distribution; changes in the number of active threads would directly
  impact resource distribution.  Real world use cases of such layouts
  could not be established during the discussions.


2-3. Arguments for cgroup v2

There are strong demands for comprehensive hierarchical resource
control across all major resources, and establishing a common resource
hierarchy is an essential step.  As with most engineering decisions,
common resource hierarchy definition comes with its trade-offs.  With
cgroup v2, the trade-offs are in the form of structural constraints
which, among others, restrict the CPU controller's space of possible
configurations.

However, even with the restrictions, cgroup v2, in combination with
rgroup, covers most of identified real world use cases while enabling
new important use cases of resource control across multiple resource
types that were fundamentally broken previously.

Furthermore, for resource control, treating resource domains as
objects of a different type from terminal consumers has important
advantages - it can account for resource consumptions which are not
tied to any specific terminal consumer, be that a task or process, and
allows decoupling resource distribution controls from in-application
APIs.  Even the CPU controller may benefit from it as the kernel can
consume significant amount of CPU cycles in interrupt context or tasks
shared across multiple resource domains (e.g. softirq).

Finally, it's important to note that enabling cgroup v2 support for
the CPU controller doesn't block use cases which require the features
which are not available on cgroup v2.  Unlikely, but should anybody
actually rely on the CPU controller's symmetric handling of tasks and
cgroups, backward compatibility is and will be maintained by being
able to disconnect the controller from the cgroup v2 hierarchy and use
it standalone.  This also holds for cpuset which is often used in
highly customized configurations which might be a poor fit for common
resource domains.

The required changes are minimal, the benefits for the target use
cases are critical and obvious, and use cases which have to use v1 can
continue to do so.


3. Way Forward

cgroup v2 primarily aims to solve the problem of comprehensive
hierarchical resource control across all major computing resources,
which is one of the core problems of modern server infrastructure
engineering.  The trade-offs that cgroup v2 took are results of
pursuing that goal and gaining a better understanding of the nature of
resource control in the process.

I believe that real world usages will prove cgroup v2's model right,
considering the crucial pieces of comprehensive resource control that
cannot be implemented without common resource domains.  This is not to
say that cgroup v2 is fixed in stone and can't be updated; if there is
an approach which better serves both comprehensive resource control
and the CPU controller's flexibility, we will surely move towards
that.  It goes without saying that discussions around such approach
should consider practical aspects of resource control as a whole
rather than absolutely focusing on a particular controller.

Until such consensus can be reached, the CPU controller cgroup v2
support will be maintained out of the mainline kernel in an easily
accessible form.  If there is anything cgroup developers can do to
ease the pain, please feel free to contact us on the cgroup mailing
list at cgroups-u79uwXL29TaiAVqoAR/hOA@public.gmane.org


4. References

[1]  http://lkml.kernel.org/r/20160105164834.GE5995-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org
     [PATCH 1/2] sched: Misc preps for cgroup unified hierarchy interface
     Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

[2]  http://lkml.kernel.org/r/20160105164852.GF5995-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org
     [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy
     Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

[3]  http://lkml.kernel.org/r/1438641689-14655-4-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org
     [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy
     Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

[4]  http://lkml.kernel.org/r/20160407064549.GH3430-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org
     Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP
     Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>

[5]  https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/cgroup-v2.txt
     Control Group v2
     Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

[6]  http://lkml.kernel.org/r/CAPM31RJNy3jgG=DYe6GO=wyL4BPPxwUm1f2S6YXacQmo7viFZA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org
     Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy
     Paul Turner <pjt-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

[7]  http://lkml.kernel.org/r/20160105154503.GC5995-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org
     [RFD] cgroup: thread granularity support for cpu controller
     Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

[8]  http://lkml.kernel.org/r/1457710888-31182-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org
     [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP
     Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

[9]  http://lkml.kernel.org/r/20160311160522.GA24046-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org
     Example program for PRIO_RGRP
     Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

[10] http://lkml.kernel.org/r/20160407082810.GN3430-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org
     Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource
     Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH 1/2] sched: Misc preps for cgroup unified hierarchy interface
  2016-08-05 17:07 ` Tejun Heo
@ 2016-08-05 17:09   ` Tejun Heo
  -1 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-08-05 17:09 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Li Zefan, Johannes Weiner,
	Peter Zijlstra, Paul Turner, Mike Galbraith, Ingo Molnar
  Cc: linux-kernel, cgroups, linux-api, kernel-team

>From 0d966df508ef4d6c0b1baae9e369f4fb0d3e10af Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Fri, 11 Mar 2016 07:31:23 -0500

Make the following changes in preparation for the cpu controller
interface implementation for the unified hierarchy.  This patch
doesn't cause any functional differences.

* s/cpu_stats_show()/cpu_cfs_stats_show()/

* s/cpu_files/cpu_legacy_files/

* Separate out cpuacct_stats_read() from cpuacct_stats_show().  While
  at it, remove pointless cpuacct_stat_desc[] array.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---
 kernel/sched/core.c    |  8 ++++----
 kernel/sched/cpuacct.c | 33 +++++++++++++++------------------
 2 files changed, 19 insertions(+), 22 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 97ee9ac..c148dfe 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8482,7 +8482,7 @@ static int __cfs_schedulable(struct task_group *tg, u64 period, u64 quota)
 	return ret;
 }
 
-static int cpu_stats_show(struct seq_file *sf, void *v)
+static int cpu_cfs_stats_show(struct seq_file *sf, void *v)
 {
 	struct task_group *tg = css_tg(seq_css(sf));
 	struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
@@ -8522,7 +8522,7 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css,
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
-static struct cftype cpu_files[] = {
+static struct cftype cpu_legacy_files[] = {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	{
 		.name = "shares",
@@ -8543,7 +8543,7 @@ static struct cftype cpu_files[] = {
 	},
 	{
 		.name = "stat",
-		.seq_show = cpu_stats_show,
+		.seq_show = cpu_cfs_stats_show,
 	},
 #endif
 #ifdef CONFIG_RT_GROUP_SCHED
@@ -8568,7 +8568,7 @@ struct cgroup_subsys cpu_cgrp_subsys = {
 	.fork		= cpu_cgroup_fork,
 	.can_attach	= cpu_cgroup_can_attach,
 	.attach		= cpu_cgroup_attach,
-	.legacy_cftypes	= cpu_files,
+	.legacy_cftypes	= cpu_legacy_files,
 	.early_init	= true,
 };
 
diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index 41f85c4..3eb9eda 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -242,36 +242,33 @@ static int cpuacct_percpu_seq_show(struct seq_file *m, void *V)
 	return __cpuacct_percpu_seq_show(m, CPUACCT_USAGE_NRUSAGE);
 }
 
-static const char * const cpuacct_stat_desc[] = {
-	[CPUACCT_STAT_USER] = "user",
-	[CPUACCT_STAT_SYSTEM] = "system",
-};
-
-static int cpuacct_stats_show(struct seq_file *sf, void *v)
+static void cpuacct_stats_read(struct cpuacct *ca, u64 *userp, u64 *sysp)
 {
-	struct cpuacct *ca = css_ca(seq_css(sf));
 	int cpu;
-	s64 val = 0;
 
+	*userp = 0;
 	for_each_possible_cpu(cpu) {
 		struct kernel_cpustat *kcpustat = per_cpu_ptr(ca->cpustat, cpu);
-		val += kcpustat->cpustat[CPUTIME_USER];
-		val += kcpustat->cpustat[CPUTIME_NICE];
+		*userp += kcpustat->cpustat[CPUTIME_USER];
+		*userp += kcpustat->cpustat[CPUTIME_NICE];
 	}
-	val = cputime64_to_clock_t(val);
-	seq_printf(sf, "%s %lld\n", cpuacct_stat_desc[CPUACCT_STAT_USER], val);
 
-	val = 0;
+	*sysp = 0;
 	for_each_possible_cpu(cpu) {
 		struct kernel_cpustat *kcpustat = per_cpu_ptr(ca->cpustat, cpu);
-		val += kcpustat->cpustat[CPUTIME_SYSTEM];
-		val += kcpustat->cpustat[CPUTIME_IRQ];
-		val += kcpustat->cpustat[CPUTIME_SOFTIRQ];
+		*sysp += kcpustat->cpustat[CPUTIME_SYSTEM];
+		*sysp += kcpustat->cpustat[CPUTIME_IRQ];
+		*sysp += kcpustat->cpustat[CPUTIME_SOFTIRQ];
 	}
+}
 
-	val = cputime64_to_clock_t(val);
-	seq_printf(sf, "%s %lld\n", cpuacct_stat_desc[CPUACCT_STAT_SYSTEM], val);
+static int cpuacct_stats_show(struct seq_file *sf, void *v)
+{
+	cputime64_t user, sys;
 
+	cpuacct_stats_read(css_ca(seq_css(sf)), &user, &sys);
+	seq_printf(sf, "user %lld\n", cputime64_to_clock_t(user));
+	seq_printf(sf, "system %lld\n", cputime64_to_clock_t(sys));
 	return 0;
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH 1/2] sched: Misc preps for cgroup unified hierarchy interface
@ 2016-08-05 17:09   ` Tejun Heo
  0 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-08-05 17:09 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Li Zefan, Johannes Weiner,
	Peter Zijlstra, Paul Turner, Mike Galbraith, Ingo Molnar
  Cc: linux-kernel, cgroups, linux-api, kernel-team

From 0d966df508ef4d6c0b1baae9e369f4fb0d3e10af Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Fri, 11 Mar 2016 07:31:23 -0500

Make the following changes in preparation for the cpu controller
interface implementation for the unified hierarchy.  This patch
doesn't cause any functional differences.

* s/cpu_stats_show()/cpu_cfs_stats_show()/

* s/cpu_files/cpu_legacy_files/

* Separate out cpuacct_stats_read() from cpuacct_stats_show().  While
  at it, remove pointless cpuacct_stat_desc[] array.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---
 kernel/sched/core.c    |  8 ++++----
 kernel/sched/cpuacct.c | 33 +++++++++++++++------------------
 2 files changed, 19 insertions(+), 22 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 97ee9ac..c148dfe 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8482,7 +8482,7 @@ static int __cfs_schedulable(struct task_group *tg, u64 period, u64 quota)
 	return ret;
 }
 
-static int cpu_stats_show(struct seq_file *sf, void *v)
+static int cpu_cfs_stats_show(struct seq_file *sf, void *v)
 {
 	struct task_group *tg = css_tg(seq_css(sf));
 	struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
@@ -8522,7 +8522,7 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css,
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
-static struct cftype cpu_files[] = {
+static struct cftype cpu_legacy_files[] = {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	{
 		.name = "shares",
@@ -8543,7 +8543,7 @@ static struct cftype cpu_files[] = {
 	},
 	{
 		.name = "stat",
-		.seq_show = cpu_stats_show,
+		.seq_show = cpu_cfs_stats_show,
 	},
 #endif
 #ifdef CONFIG_RT_GROUP_SCHED
@@ -8568,7 +8568,7 @@ struct cgroup_subsys cpu_cgrp_subsys = {
 	.fork		= cpu_cgroup_fork,
 	.can_attach	= cpu_cgroup_can_attach,
 	.attach		= cpu_cgroup_attach,
-	.legacy_cftypes	= cpu_files,
+	.legacy_cftypes	= cpu_legacy_files,
 	.early_init	= true,
 };
 
diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index 41f85c4..3eb9eda 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -242,36 +242,33 @@ static int cpuacct_percpu_seq_show(struct seq_file *m, void *V)
 	return __cpuacct_percpu_seq_show(m, CPUACCT_USAGE_NRUSAGE);
 }
 
-static const char * const cpuacct_stat_desc[] = {
-	[CPUACCT_STAT_USER] = "user",
-	[CPUACCT_STAT_SYSTEM] = "system",
-};
-
-static int cpuacct_stats_show(struct seq_file *sf, void *v)
+static void cpuacct_stats_read(struct cpuacct *ca, u64 *userp, u64 *sysp)
 {
-	struct cpuacct *ca = css_ca(seq_css(sf));
 	int cpu;
-	s64 val = 0;
 
+	*userp = 0;
 	for_each_possible_cpu(cpu) {
 		struct kernel_cpustat *kcpustat = per_cpu_ptr(ca->cpustat, cpu);
-		val += kcpustat->cpustat[CPUTIME_USER];
-		val += kcpustat->cpustat[CPUTIME_NICE];
+		*userp += kcpustat->cpustat[CPUTIME_USER];
+		*userp += kcpustat->cpustat[CPUTIME_NICE];
 	}
-	val = cputime64_to_clock_t(val);
-	seq_printf(sf, "%s %lld\n", cpuacct_stat_desc[CPUACCT_STAT_USER], val);
 
-	val = 0;
+	*sysp = 0;
 	for_each_possible_cpu(cpu) {
 		struct kernel_cpustat *kcpustat = per_cpu_ptr(ca->cpustat, cpu);
-		val += kcpustat->cpustat[CPUTIME_SYSTEM];
-		val += kcpustat->cpustat[CPUTIME_IRQ];
-		val += kcpustat->cpustat[CPUTIME_SOFTIRQ];
+		*sysp += kcpustat->cpustat[CPUTIME_SYSTEM];
+		*sysp += kcpustat->cpustat[CPUTIME_IRQ];
+		*sysp += kcpustat->cpustat[CPUTIME_SOFTIRQ];
 	}
+}
 
-	val = cputime64_to_clock_t(val);
-	seq_printf(sf, "%s %lld\n", cpuacct_stat_desc[CPUACCT_STAT_SYSTEM], val);
+static int cpuacct_stats_show(struct seq_file *sf, void *v)
+{
+	cputime64_t user, sys;
 
+	cpuacct_stats_read(css_ca(seq_css(sf)), &user, &sys);
+	seq_printf(sf, "user %lld\n", cputime64_to_clock_t(user));
+	seq_printf(sf, "system %lld\n", cputime64_to_clock_t(sys));
 	return 0;
 }
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy
  2016-08-05 17:07 ` Tejun Heo
@ 2016-08-05 17:09   ` Tejun Heo
  -1 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-08-05 17:09 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Li Zefan, Johannes Weiner,
	Peter Zijlstra, Paul Turner, Mike Galbraith, Ingo Molnar
  Cc: linux-kernel, cgroups, linux-api, kernel-team

>From ed6d93036ec930cb774da10b7c87f67905ce71f1 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Fri, 11 Mar 2016 07:31:23 -0500

While the cpu controller doesn't have any functional problems, there
are a couple interface issues which can be addressed in the v2
interface.

* cpuacct being a separate controller.  This separation is artificial
  and rather pointless as demonstrated by most use cases co-mounting
  the two controllers.  It also forces certain information to be
  accounted twice.

* Use of different time units.  Writable control knobs use
  microseconds, some stat fields use nanoseconds while other cpuacct
  stat fields use centiseconds.

* Control knobs which can't be used in the root cgroup still show up
  in the root.

* Control knob names and semantics aren't consistent with other
  controllers.

This patchset implements cpu controller's interface on the unified
hierarchy which adheres to the controller file conventions described
in Documentation/cgroups/unified-hierarchy.txt.  Overall, the
following changes are made.

* cpuacct is implictly enabled and disabled by cpu and its information
  is reported through "cpu.stat" which now uses microseconds for all
  time durations.  All time duration fields now have "_usec" appended
  to them for clarity.  While this doesn't solve the double accounting
  immediately, once majority of users switch to v2, cpu can directly
  account and report the relevant stats and cpuacct can be disabled on
  the unified hierarchy.

  Note that cpuacct.usage_percpu is currently not included in
  "cpu.stat".  If this information is actually called for, it can be
  added later.

* "cpu.shares" is replaced with "cpu.weight" and operates on the
  standard scale defined by CGROUP_WEIGHT_MIN/DFL/MAX (1, 100, 10000).
  The weight is scaled to scheduler weight so that 100 maps to 1024
  and the ratio relationship is preserved - if weight is W and its
  scaled value is S, W / 100 == S / 1024.  While the mapped range is a
  bit smaller than the orignal scheduler weight range, the dead zones
  on both sides are relatively small and covers wider range than the
  nice value mappings.  This file doesn't make sense in the root
  cgroup and isn't create on root.

* "cpu.cfs_quota_us" and "cpu.cfs_period_us" are replaced by "cpu.max"
  which contains both quota and period.

* "cpu.rt_runtime_us" and "cpu.rt_period_us" are replaced by
  "cpu.rt.max" which contains both runtime and period.

v2: cpu_stats_show() was incorrectly using CONFIG_FAIR_GROUP_SCHED for
    CFS bandwidth stats and also using raw division for u64.  Use
    CONFIG_CFS_BANDWITH and do_div() instead.

    The semantics of "cpu.rt.max" is not fully decided yet.  Dropped
    for now.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---
 kernel/sched/core.c    | 141 +++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/cpuacct.c |  24 +++++++++
 kernel/sched/cpuacct.h |   5 ++
 3 files changed, 170 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c148dfe..7bba2c5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8561,6 +8561,139 @@ static struct cftype cpu_legacy_files[] = {
 	{ }	/* terminate */
 };
 
+static int cpu_stats_show(struct seq_file *sf, void *v)
+{
+	cpuacct_cpu_stats_show(sf);
+
+#ifdef CONFIG_CFS_BANDWIDTH
+	{
+		struct task_group *tg = css_tg(seq_css(sf));
+		struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
+		u64 throttled_usec;
+
+		throttled_usec = cfs_b->throttled_time;
+		do_div(throttled_usec, NSEC_PER_USEC);
+
+		seq_printf(sf, "nr_periods %d\n"
+			   "nr_throttled %d\n"
+			   "throttled_usec %llu\n",
+			   cfs_b->nr_periods, cfs_b->nr_throttled,
+			   throttled_usec);
+	}
+#endif
+	return 0;
+}
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+static u64 cpu_weight_read_u64(struct cgroup_subsys_state *css,
+			       struct cftype *cft)
+{
+	struct task_group *tg = css_tg(css);
+	u64 weight = scale_load_down(tg->shares);
+
+	return DIV_ROUND_CLOSEST_ULL(weight * CGROUP_WEIGHT_DFL, 1024);
+}
+
+static int cpu_weight_write_u64(struct cgroup_subsys_state *css,
+				struct cftype *cftype, u64 weight)
+{
+	/*
+	 * cgroup weight knobs should use the common MIN, DFL and MAX
+	 * values which are 1, 100 and 10000 respectively.  While it loses
+	 * a bit of range on both ends, it maps pretty well onto the shares
+	 * value used by scheduler and the round-trip conversions preserve
+	 * the original value over the entire range.
+	 */
+	if (weight < CGROUP_WEIGHT_MIN || weight > CGROUP_WEIGHT_MAX)
+		return -ERANGE;
+
+	weight = DIV_ROUND_CLOSEST_ULL(weight * 1024, CGROUP_WEIGHT_DFL);
+
+	return sched_group_set_shares(css_tg(css), scale_load(weight));
+}
+#endif
+
+static void __maybe_unused cpu_period_quota_print(struct seq_file *sf,
+						  long period, long quota)
+{
+	if (quota < 0)
+		seq_puts(sf, "max");
+	else
+		seq_printf(sf, "%ld", quota);
+
+	seq_printf(sf, " %ld\n", period);
+}
+
+/* caller should put the current value in *@periodp before calling */
+static int __maybe_unused cpu_period_quota_parse(char *buf,
+						 u64 *periodp, u64 *quotap)
+{
+	char tok[21];	/* U64_MAX */
+
+	if (!sscanf(buf, "%s %llu", tok, periodp))
+		return -EINVAL;
+
+	*periodp *= NSEC_PER_USEC;
+
+	if (sscanf(tok, "%llu", quotap))
+		*quotap *= NSEC_PER_USEC;
+	else if (!strcmp(tok, "max"))
+		*quotap = RUNTIME_INF;
+	else
+		return -EINVAL;
+
+	return 0;
+}
+
+#ifdef CONFIG_CFS_BANDWIDTH
+static int cpu_max_show(struct seq_file *sf, void *v)
+{
+	struct task_group *tg = css_tg(seq_css(sf));
+
+	cpu_period_quota_print(sf, tg_get_cfs_period(tg), tg_get_cfs_quota(tg));
+	return 0;
+}
+
+static ssize_t cpu_max_write(struct kernfs_open_file *of,
+			     char *buf, size_t nbytes, loff_t off)
+{
+	struct task_group *tg = css_tg(of_css(of));
+	u64 period = tg_get_cfs_period(tg);
+	u64 quota;
+	int ret;
+
+	ret = cpu_period_quota_parse(buf, &period, &quota);
+	if (!ret)
+		ret = tg_set_cfs_bandwidth(tg, period, quota);
+	return ret ?: nbytes;
+}
+#endif
+
+static struct cftype cpu_files[] = {
+	{
+		.name = "stat",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = cpu_stats_show,
+	},
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	{
+		.name = "weight",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = cpu_weight_read_u64,
+		.write_u64 = cpu_weight_write_u64,
+	},
+#endif
+#ifdef CONFIG_CFS_BANDWIDTH
+	{
+		.name = "max",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = cpu_max_show,
+		.write = cpu_max_write,
+	},
+#endif
+	{ }	/* terminate */
+};
+
 struct cgroup_subsys cpu_cgrp_subsys = {
 	.css_alloc	= cpu_cgroup_css_alloc,
 	.css_released	= cpu_cgroup_css_released,
@@ -8569,7 +8702,15 @@ struct cgroup_subsys cpu_cgrp_subsys = {
 	.can_attach	= cpu_cgroup_can_attach,
 	.attach		= cpu_cgroup_attach,
 	.legacy_cftypes	= cpu_legacy_files,
+	.dfl_cftypes	= cpu_files,
 	.early_init	= true,
+#ifdef CONFIG_CGROUP_CPUACCT
+	/*
+	 * cpuacct is enabled together with cpu on the unified hierarchy
+	 * and its stats are reported through "cpu.stat".
+	 */
+	.depends_on	= 1 << cpuacct_cgrp_id,
+#endif
 };
 
 #endif	/* CONFIG_CGROUP_SCHED */
diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index 3eb9eda..7a02d26 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -305,6 +305,30 @@ static struct cftype files[] = {
 	{ }	/* terminate */
 };
 
+/* used to print cpuacct stats in cpu.stat on the unified hierarchy */
+void cpuacct_cpu_stats_show(struct seq_file *sf)
+{
+	struct cgroup_subsys_state *css;
+	u64 usage, user, sys;
+
+	css = cgroup_get_e_css(seq_css(sf)->cgroup, &cpuacct_cgrp_subsys);
+
+	usage = cpuusage_read(css, seq_cft(sf));
+	cpuacct_stats_read(css_ca(css), &user, &sys);
+
+	user *= TICK_NSEC;
+	sys *= TICK_NSEC;
+	do_div(usage, NSEC_PER_USEC);
+	do_div(user, NSEC_PER_USEC);
+	do_div(sys, NSEC_PER_USEC);
+
+	seq_printf(sf, "usage_usec %llu\n"
+		   "user_usec %llu\n"
+		   "system_usec %llu\n", usage, user, sys);
+
+	css_put(css);
+}
+
 /*
  * charge this task's execution time to its accounting group.
  *
diff --git a/kernel/sched/cpuacct.h b/kernel/sched/cpuacct.h
index ba72807..ddf7af4 100644
--- a/kernel/sched/cpuacct.h
+++ b/kernel/sched/cpuacct.h
@@ -2,6 +2,7 @@
 
 extern void cpuacct_charge(struct task_struct *tsk, u64 cputime);
 extern void cpuacct_account_field(struct task_struct *tsk, int index, u64 val);
+extern void cpuacct_cpu_stats_show(struct seq_file *sf);
 
 #else
 
@@ -14,4 +15,8 @@ cpuacct_account_field(struct task_struct *tsk, int index, u64 val)
 {
 }
 
+static inline void cpuacct_cpu_stats_show(struct seq_file *sf)
+{
+}
+
 #endif
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy
@ 2016-08-05 17:09   ` Tejun Heo
  0 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-08-05 17:09 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Li Zefan, Johannes Weiner,
	Peter Zijlstra, Paul Turner, Mike Galbraith, Ingo Molnar
  Cc: linux-kernel, cgroups, linux-api, kernel-team

From ed6d93036ec930cb774da10b7c87f67905ce71f1 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Fri, 11 Mar 2016 07:31:23 -0500

While the cpu controller doesn't have any functional problems, there
are a couple interface issues which can be addressed in the v2
interface.

* cpuacct being a separate controller.  This separation is artificial
  and rather pointless as demonstrated by most use cases co-mounting
  the two controllers.  It also forces certain information to be
  accounted twice.

* Use of different time units.  Writable control knobs use
  microseconds, some stat fields use nanoseconds while other cpuacct
  stat fields use centiseconds.

* Control knobs which can't be used in the root cgroup still show up
  in the root.

* Control knob names and semantics aren't consistent with other
  controllers.

This patchset implements cpu controller's interface on the unified
hierarchy which adheres to the controller file conventions described
in Documentation/cgroups/unified-hierarchy.txt.  Overall, the
following changes are made.

* cpuacct is implictly enabled and disabled by cpu and its information
  is reported through "cpu.stat" which now uses microseconds for all
  time durations.  All time duration fields now have "_usec" appended
  to them for clarity.  While this doesn't solve the double accounting
  immediately, once majority of users switch to v2, cpu can directly
  account and report the relevant stats and cpuacct can be disabled on
  the unified hierarchy.

  Note that cpuacct.usage_percpu is currently not included in
  "cpu.stat".  If this information is actually called for, it can be
  added later.

* "cpu.shares" is replaced with "cpu.weight" and operates on the
  standard scale defined by CGROUP_WEIGHT_MIN/DFL/MAX (1, 100, 10000).
  The weight is scaled to scheduler weight so that 100 maps to 1024
  and the ratio relationship is preserved - if weight is W and its
  scaled value is S, W / 100 == S / 1024.  While the mapped range is a
  bit smaller than the orignal scheduler weight range, the dead zones
  on both sides are relatively small and covers wider range than the
  nice value mappings.  This file doesn't make sense in the root
  cgroup and isn't create on root.

* "cpu.cfs_quota_us" and "cpu.cfs_period_us" are replaced by "cpu.max"
  which contains both quota and period.

* "cpu.rt_runtime_us" and "cpu.rt_period_us" are replaced by
  "cpu.rt.max" which contains both runtime and period.

v2: cpu_stats_show() was incorrectly using CONFIG_FAIR_GROUP_SCHED for
    CFS bandwidth stats and also using raw division for u64.  Use
    CONFIG_CFS_BANDWITH and do_div() instead.

    The semantics of "cpu.rt.max" is not fully decided yet.  Dropped
    for now.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---
 kernel/sched/core.c    | 141 +++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/cpuacct.c |  24 +++++++++
 kernel/sched/cpuacct.h |   5 ++
 3 files changed, 170 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c148dfe..7bba2c5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8561,6 +8561,139 @@ static struct cftype cpu_legacy_files[] = {
 	{ }	/* terminate */
 };
 
+static int cpu_stats_show(struct seq_file *sf, void *v)
+{
+	cpuacct_cpu_stats_show(sf);
+
+#ifdef CONFIG_CFS_BANDWIDTH
+	{
+		struct task_group *tg = css_tg(seq_css(sf));
+		struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
+		u64 throttled_usec;
+
+		throttled_usec = cfs_b->throttled_time;
+		do_div(throttled_usec, NSEC_PER_USEC);
+
+		seq_printf(sf, "nr_periods %d\n"
+			   "nr_throttled %d\n"
+			   "throttled_usec %llu\n",
+			   cfs_b->nr_periods, cfs_b->nr_throttled,
+			   throttled_usec);
+	}
+#endif
+	return 0;
+}
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+static u64 cpu_weight_read_u64(struct cgroup_subsys_state *css,
+			       struct cftype *cft)
+{
+	struct task_group *tg = css_tg(css);
+	u64 weight = scale_load_down(tg->shares);
+
+	return DIV_ROUND_CLOSEST_ULL(weight * CGROUP_WEIGHT_DFL, 1024);
+}
+
+static int cpu_weight_write_u64(struct cgroup_subsys_state *css,
+				struct cftype *cftype, u64 weight)
+{
+	/*
+	 * cgroup weight knobs should use the common MIN, DFL and MAX
+	 * values which are 1, 100 and 10000 respectively.  While it loses
+	 * a bit of range on both ends, it maps pretty well onto the shares
+	 * value used by scheduler and the round-trip conversions preserve
+	 * the original value over the entire range.
+	 */
+	if (weight < CGROUP_WEIGHT_MIN || weight > CGROUP_WEIGHT_MAX)
+		return -ERANGE;
+
+	weight = DIV_ROUND_CLOSEST_ULL(weight * 1024, CGROUP_WEIGHT_DFL);
+
+	return sched_group_set_shares(css_tg(css), scale_load(weight));
+}
+#endif
+
+static void __maybe_unused cpu_period_quota_print(struct seq_file *sf,
+						  long period, long quota)
+{
+	if (quota < 0)
+		seq_puts(sf, "max");
+	else
+		seq_printf(sf, "%ld", quota);
+
+	seq_printf(sf, " %ld\n", period);
+}
+
+/* caller should put the current value in *@periodp before calling */
+static int __maybe_unused cpu_period_quota_parse(char *buf,
+						 u64 *periodp, u64 *quotap)
+{
+	char tok[21];	/* U64_MAX */
+
+	if (!sscanf(buf, "%s %llu", tok, periodp))
+		return -EINVAL;
+
+	*periodp *= NSEC_PER_USEC;
+
+	if (sscanf(tok, "%llu", quotap))
+		*quotap *= NSEC_PER_USEC;
+	else if (!strcmp(tok, "max"))
+		*quotap = RUNTIME_INF;
+	else
+		return -EINVAL;
+
+	return 0;
+}
+
+#ifdef CONFIG_CFS_BANDWIDTH
+static int cpu_max_show(struct seq_file *sf, void *v)
+{
+	struct task_group *tg = css_tg(seq_css(sf));
+
+	cpu_period_quota_print(sf, tg_get_cfs_period(tg), tg_get_cfs_quota(tg));
+	return 0;
+}
+
+static ssize_t cpu_max_write(struct kernfs_open_file *of,
+			     char *buf, size_t nbytes, loff_t off)
+{
+	struct task_group *tg = css_tg(of_css(of));
+	u64 period = tg_get_cfs_period(tg);
+	u64 quota;
+	int ret;
+
+	ret = cpu_period_quota_parse(buf, &period, &quota);
+	if (!ret)
+		ret = tg_set_cfs_bandwidth(tg, period, quota);
+	return ret ?: nbytes;
+}
+#endif
+
+static struct cftype cpu_files[] = {
+	{
+		.name = "stat",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = cpu_stats_show,
+	},
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	{
+		.name = "weight",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = cpu_weight_read_u64,
+		.write_u64 = cpu_weight_write_u64,
+	},
+#endif
+#ifdef CONFIG_CFS_BANDWIDTH
+	{
+		.name = "max",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = cpu_max_show,
+		.write = cpu_max_write,
+	},
+#endif
+	{ }	/* terminate */
+};
+
 struct cgroup_subsys cpu_cgrp_subsys = {
 	.css_alloc	= cpu_cgroup_css_alloc,
 	.css_released	= cpu_cgroup_css_released,
@@ -8569,7 +8702,15 @@ struct cgroup_subsys cpu_cgrp_subsys = {
 	.can_attach	= cpu_cgroup_can_attach,
 	.attach		= cpu_cgroup_attach,
 	.legacy_cftypes	= cpu_legacy_files,
+	.dfl_cftypes	= cpu_files,
 	.early_init	= true,
+#ifdef CONFIG_CGROUP_CPUACCT
+	/*
+	 * cpuacct is enabled together with cpu on the unified hierarchy
+	 * and its stats are reported through "cpu.stat".
+	 */
+	.depends_on	= 1 << cpuacct_cgrp_id,
+#endif
 };
 
 #endif	/* CONFIG_CGROUP_SCHED */
diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index 3eb9eda..7a02d26 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -305,6 +305,30 @@ static struct cftype files[] = {
 	{ }	/* terminate */
 };
 
+/* used to print cpuacct stats in cpu.stat on the unified hierarchy */
+void cpuacct_cpu_stats_show(struct seq_file *sf)
+{
+	struct cgroup_subsys_state *css;
+	u64 usage, user, sys;
+
+	css = cgroup_get_e_css(seq_css(sf)->cgroup, &cpuacct_cgrp_subsys);
+
+	usage = cpuusage_read(css, seq_cft(sf));
+	cpuacct_stats_read(css_ca(css), &user, &sys);
+
+	user *= TICK_NSEC;
+	sys *= TICK_NSEC;
+	do_div(usage, NSEC_PER_USEC);
+	do_div(user, NSEC_PER_USEC);
+	do_div(sys, NSEC_PER_USEC);
+
+	seq_printf(sf, "usage_usec %llu\n"
+		   "user_usec %llu\n"
+		   "system_usec %llu\n", usage, user, sys);
+
+	css_put(css);
+}
+
 /*
  * charge this task's execution time to its accounting group.
  *
diff --git a/kernel/sched/cpuacct.h b/kernel/sched/cpuacct.h
index ba72807..ddf7af4 100644
--- a/kernel/sched/cpuacct.h
+++ b/kernel/sched/cpuacct.h
@@ -2,6 +2,7 @@
 
 extern void cpuacct_charge(struct task_struct *tsk, u64 cputime);
 extern void cpuacct_account_field(struct task_struct *tsk, int index, u64 val);
+extern void cpuacct_cpu_stats_show(struct seq_file *sf);
 
 #else
 
@@ -14,4 +15,8 @@ cpuacct_account_field(struct task_struct *tsk, int index, u64 val)
 {
 }
 
+static inline void cpuacct_cpu_stats_show(struct seq_file *sf)
+{
+}
+
 #endif
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-08-06  9:04   ` Mike Galbraith
  0 siblings, 0 replies; 87+ messages in thread
From: Mike Galbraith @ 2016-08-06  9:04 UTC (permalink / raw)
  To: Tejun Heo, Linus Torvalds, Andrew Morton, Li Zefan,
	Johannes Weiner, Peter Zijlstra, Paul Turner, Ingo Molnar
  Cc: linux-kernel, cgroups, linux-api, kernel-team

On Fri, 2016-08-05 at 13:07 -0400, Tejun Heo wrote:

> 2-2. Impact on CPU Controller
> 
> As indicated earlier, the CPU controller's resource distribution graph
> is the simplest.  Every schedulable resource consumption can be
> attributed to a specific task.  In addition, for weight based control,
> the per-task priority set through setpriority(2) can be translated to
> and from a per-cgroup weight.  As such, the CPU controller can treat a
> task and a cgroup symmetrically, allowing support for any tree layout
> of cgroups and tasks.  Both process granularity and the no internal
> process constraint restrict how the CPU controller can be used.

Not only the cpu controller, but also cpuacct and cpuset.

>   2-2-1. Impact of Process Granularity
> 
>   Process granularity prevents tasks belonging to the same process to
>   be assigned to different cgroups.  It was pointed out [6] that this
>   excludes the valid use case of hierarchical CPU distribution within
>   processes.

Does that not obsolete the rather useful/common concept "thread pool"?

>   2-2-2. Impact of No Internal Process Constraint
> 
>   The no internal process constraint disallows tasks from competing
>   directly against cgroups.  Here is an excerpt from Peter Zijlstra
>   pointing out the issue [10] - R, L and A are cgroups; t1, t2, t3 and
>   t4 are tasks:
> 
> 
>           R
>         / | \
>        t1 t2 A
>            /   \
>           t3   t4
> 
> 
>     Is fundamentally different from:
> 
> 
>                R
>              /   \
>            L       A
>          /   \   /   \
>         t1  t2  t3   t4
> 
> 
>     Because if in the first hierarchy you add a task (t5) to R, all of
>     its A will run at 1/4th of total bandwidth where before it had
>     1/3rd, whereas with the second example, if you add our t5 to L, A
>     doesn't get any less bandwidth.
> 
> 
>   It is true that the trees are semantically different from each other
>   and the symmetric handling of tasks and cgroups is aesthetically
>   pleasing.  However, it isn't clear what the practical usefulness of
>   a layout with direct competition between tasks and cgroups would be,
>   considering that number and behavior of tasks are controlled by each
>   application, and cgroups primarily deal with system level resource
>   distribution; changes in the number of active threads would directly
>   impact resource distribution.  Real world use cases of such layouts
>   could not be established during the discussions.

You apparently intend to ignore any real world usages that don't work
with these new constraints.  Priority and affinity are not process wide
attributes, never have been, but you're insisting that so they must
become for the sake of progress.

I mentioned a real world case of a thread pool servicing customer
accounts by doing something quite sane: hop into an account (cgroup),
do work therein, send bean count off to the $$ department, wash, rinse
repeat.  That's real world users making real world cash registers go ka
-ching so real world people can pay their real world bills.

I also mentioned breakage to cpusets: given exclusive set A and
exclusive subset B therein, there is one and only one spot where
affinity A exists... at the to be forbidden junction of A and B.

As with the thread pool, process granularity makes it impossible for
any threaded application affinity to be managed via cpusets, such as
say stuffing realtime critical threads into a shielded cpuset, mundane
threads into another.  There are any number of affinity usages that
will break.

Try as I may, I can't see anything progressive about enforcing process
granularity of per thread attributes.  I do see regression potential
for users of these controllers, and no viable means to even report them
as being such.  It will likely be systemd flipping the V2 on switch,
not the kernel, not the user.  Regression reports would thus presumably
be deflected to... those who want this.  Sweet.

	-Mike

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-08-06  9:04   ` Mike Galbraith
  0 siblings, 0 replies; 87+ messages in thread
From: Mike Galbraith @ 2016-08-06  9:04 UTC (permalink / raw)
  To: Tejun Heo, Linus Torvalds, Andrew Morton, Li Zefan,
	Johannes Weiner, Peter Zijlstra, Paul Turner, Ingo Molnar
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA,
	kernel-team-b10kYP2dOMg

On Fri, 2016-08-05 at 13:07 -0400, Tejun Heo wrote:

> 2-2. Impact on CPU Controller
> 
> As indicated earlier, the CPU controller's resource distribution graph
> is the simplest.  Every schedulable resource consumption can be
> attributed to a specific task.  In addition, for weight based control,
> the per-task priority set through setpriority(2) can be translated to
> and from a per-cgroup weight.  As such, the CPU controller can treat a
> task and a cgroup symmetrically, allowing support for any tree layout
> of cgroups and tasks.  Both process granularity and the no internal
> process constraint restrict how the CPU controller can be used.

Not only the cpu controller, but also cpuacct and cpuset.

>   2-2-1. Impact of Process Granularity
> 
>   Process granularity prevents tasks belonging to the same process to
>   be assigned to different cgroups.  It was pointed out [6] that this
>   excludes the valid use case of hierarchical CPU distribution within
>   processes.

Does that not obsolete the rather useful/common concept "thread pool"?

>   2-2-2. Impact of No Internal Process Constraint
> 
>   The no internal process constraint disallows tasks from competing
>   directly against cgroups.  Here is an excerpt from Peter Zijlstra
>   pointing out the issue [10] - R, L and A are cgroups; t1, t2, t3 and
>   t4 are tasks:
> 
> 
>           R
>         / | \
>        t1 t2 A
>            /   \
>           t3   t4
> 
> 
>     Is fundamentally different from:
> 
> 
>                R
>              /   \
>            L       A
>          /   \   /   \
>         t1  t2  t3   t4
> 
> 
>     Because if in the first hierarchy you add a task (t5) to R, all of
>     its A will run at 1/4th of total bandwidth where before it had
>     1/3rd, whereas with the second example, if you add our t5 to L, A
>     doesn't get any less bandwidth.
> 
> 
>   It is true that the trees are semantically different from each other
>   and the symmetric handling of tasks and cgroups is aesthetically
>   pleasing.  However, it isn't clear what the practical usefulness of
>   a layout with direct competition between tasks and cgroups would be,
>   considering that number and behavior of tasks are controlled by each
>   application, and cgroups primarily deal with system level resource
>   distribution; changes in the number of active threads would directly
>   impact resource distribution.  Real world use cases of such layouts
>   could not be established during the discussions.

You apparently intend to ignore any real world usages that don't work
with these new constraints.  Priority and affinity are not process wide
attributes, never have been, but you're insisting that so they must
become for the sake of progress.

I mentioned a real world case of a thread pool servicing customer
accounts by doing something quite sane: hop into an account (cgroup),
do work therein, send bean count off to the $$ department, wash, rinse
repeat.  That's real world users making real world cash registers go ka
-ching so real world people can pay their real world bills.

I also mentioned breakage to cpusets: given exclusive set A and
exclusive subset B therein, there is one and only one spot where
affinity A exists... at the to be forbidden junction of A and B.

As with the thread pool, process granularity makes it impossible for
any threaded application affinity to be managed via cpusets, such as
say stuffing realtime critical threads into a shielded cpuset, mundane
threads into another.  There are any number of affinity usages that
will break.

Try as I may, I can't see anything progressive about enforcing process
granularity of per thread attributes.  I do see regression potential
for users of these controllers, and no viable means to even report them
as being such.  It will likely be systemd flipping the V2 on switch,
not the kernel, not the user.  Regression reports would thus presumably
be deflected to... those who want this.  Sweet.

	-Mike

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-08-10 22:09     ` Johannes Weiner
  0 siblings, 0 replies; 87+ messages in thread
From: Johannes Weiner @ 2016-08-10 22:09 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Tejun Heo, Linus Torvalds, Andrew Morton, Li Zefan,
	Peter Zijlstra, Paul Turner, Ingo Molnar, linux-kernel, cgroups,
	linux-api, kernel-team

On Sat, Aug 06, 2016 at 11:04:51AM +0200, Mike Galbraith wrote:
> On Fri, 2016-08-05 at 13:07 -0400, Tejun Heo wrote:
> >   It is true that the trees are semantically different from each other
> >   and the symmetric handling of tasks and cgroups is aesthetically
> >   pleasing.  However, it isn't clear what the practical usefulness of
> >   a layout with direct competition between tasks and cgroups would be,
> >   considering that number and behavior of tasks are controlled by each
> >   application, and cgroups primarily deal with system level resource
> >   distribution; changes in the number of active threads would directly
> >   impact resource distribution.  Real world use cases of such layouts
> >   could not be established during the discussions.
> 
> You apparently intend to ignore any real world usages that don't work
> with these new constraints.

He didn't ignore these use cases. He offered alternatives like rgroup
to allow manipulating threads from within the application, only in a
way that does not interfere with cgroup2's common controller model.

The complete lack of cohesiveness between v1 controllers prevents us
from implementing even the most fundamental resource control that
cloud fleets like Google's and Facebook's are facing, such as
controlling buffered IO; attributing CPU cycles spent receiving
packets, reclaiming memory in kswapd, encrypting the disk; attributing
swap IO etc. That's why cgroup2 runs a tighter ship when it comes to
the controllers: to make something much bigger work.

Agreeing on something - in this case a common controller model - is
necessarily going to take away some flexibility from how you approach
a problem. What matters is whether the problem can still be solved.

This argument that cgroup2 is not backward compatible is laughable. Of
course it's going to be different, otherwise we wouldn't have had to
version it. The question is not whether the exact same configurations
and existing application design can be used in v1 and v2 - that's a
strange onus to put on a versioned interface. The question is whether
you can translate a solution from v1 to v2. Yeah, it might be a hassle
depending on how specialized your setup is, but that's why we keep v1
around until the last user dies and allow you to freely mix and match
v1 and v2 controllers within a single system to ease the transition.

But this distinction between approach and application design, and the
application's actual purpose is crucial. Every time this discussion
came up, somebody says 'moving worker threads between different
resource domains'. That's not a goal, though, that's a very specific
means to an end, with no explanation of why it has to be done that
way. When comparing the cgroup v1 and v2 interface, we should be
discussing goals, not 'this is my favorite way to do it'. If you have
an actual real-world goal that can be accomplished in v1 but not in v2
+ rgroup, then that's what we should be talking about.

Lastly, again - and this was the whole point of this document - the
changes in cgroup2 are not gratuitous. They are driven by fundamental
resource control problems faced by more comprehensive applications of
cgroup. On the other hand, the opposition here mainly seems to be the
inconvenience of switching some specialized setups from a v1-oriented
way of solving a problem to a v2-oriented way.

[ That, and a disturbing number of emotional outbursts against
  systemd, which has nothing to do with any of this. ]

It's a really myopic line of argument.

That being said, let's go through your points:

> Priority and affinity are not process wide attributes, never have
> been, but you're insisting that so they must become for the sake of
> progress.

Not really.

It's just questionable whether the cgroup interface is the best way to
manipulate these attributes, or whether existing interfaces like
setpriority() and sched_setaffinity() should be extended to manipulate
groups, like the rgroup proposal does. The problems of using the
cgroup interface for this are extensively documented, including in the
email you were replying to.

> I mentioned a real world case of a thread pool servicing customer
> accounts by doing something quite sane: hop into an account (cgroup),
> do work therein, send bean count off to the $$ department, wash, rinse
> repeat.  That's real world users making real world cash registers go ka
> -ching so real world people can pay their real world bills.

Sure, but you're implying that this is the only way to run this real
world cash register. I think it's entirely justified to re-evaluate
this, given the myriad of much more fundamental problems that cgroup2
is solving by building on a common controller model.

I'm not going down the rabbit hole again of arguing against an
incomplete case description. Scale matters. Number of workers
matter. Amount of work each thread does matters to evaluate
transaction overhead. Task migration is an expensive operation etc.

> I also mentioned breakage to cpusets: given exclusive set A and
> exclusive subset B therein, there is one and only one spot where
> affinity A exists... at the to be forbidden junction of A and B.

Again, a means to an end rather than a goal - and a particularly
suspicious one at that: why would a cgroup need to tell its *siblings*
which cpus/nodes in cannot use? In the hierarchical model, it's
clearly the task of the ancestor to allocate the resources downward.

More details would be needed to properly discuss what we are trying to
accomplish here.

> As with the thread pool, process granularity makes it impossible for
> any threaded application affinity to be managed via cpusets, such as
> say stuffing realtime critical threads into a shielded cpuset, mundane
> threads into another.  There are any number of affinity usages that
> will break.

Ditto. It's not obvious why this needs to be the cgroup interface and
couldn't instead be solved with extending sched_setaffinity() - again
weighing that against the power of the common controller model that
could be preserved this way.

> Try as I may, I can't see anything progressive about enforcing process
> granularity of per thread attributes.  I do see regression potential
> for users of these controllers,

I could understand not being entirely happy about the trade-offs if
you look at this from the perspective of a single controller in the
entire resource control subsystem.

But not seeing anything progressive in a common controller model? Have
you read anything we have been writing?

> and no viable means to even report them as being such.  It will
> likely be systemd flipping the V2 on switch, not the kernel, not the
> user.  Regression reports would thus presumably be deflected
> to... those who want this.  Sweet.

There it is...

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-08-10 22:09     ` Johannes Weiner
  0 siblings, 0 replies; 87+ messages in thread
From: Johannes Weiner @ 2016-08-10 22:09 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Tejun Heo, Linus Torvalds, Andrew Morton, Li Zefan,
	Peter Zijlstra, Paul Turner, Ingo Molnar,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA,
	kernel-team-b10kYP2dOMg

On Sat, Aug 06, 2016 at 11:04:51AM +0200, Mike Galbraith wrote:
> On Fri, 2016-08-05 at 13:07 -0400, Tejun Heo wrote:
> >   It is true that the trees are semantically different from each other
> >   and the symmetric handling of tasks and cgroups is aesthetically
> >   pleasing.  However, it isn't clear what the practical usefulness of
> >   a layout with direct competition between tasks and cgroups would be,
> >   considering that number and behavior of tasks are controlled by each
> >   application, and cgroups primarily deal with system level resource
> >   distribution; changes in the number of active threads would directly
> >   impact resource distribution.  Real world use cases of such layouts
> >   could not be established during the discussions.
> 
> You apparently intend to ignore any real world usages that don't work
> with these new constraints.

He didn't ignore these use cases. He offered alternatives like rgroup
to allow manipulating threads from within the application, only in a
way that does not interfere with cgroup2's common controller model.

The complete lack of cohesiveness between v1 controllers prevents us
from implementing even the most fundamental resource control that
cloud fleets like Google's and Facebook's are facing, such as
controlling buffered IO; attributing CPU cycles spent receiving
packets, reclaiming memory in kswapd, encrypting the disk; attributing
swap IO etc. That's why cgroup2 runs a tighter ship when it comes to
the controllers: to make something much bigger work.

Agreeing on something - in this case a common controller model - is
necessarily going to take away some flexibility from how you approach
a problem. What matters is whether the problem can still be solved.

This argument that cgroup2 is not backward compatible is laughable. Of
course it's going to be different, otherwise we wouldn't have had to
version it. The question is not whether the exact same configurations
and existing application design can be used in v1 and v2 - that's a
strange onus to put on a versioned interface. The question is whether
you can translate a solution from v1 to v2. Yeah, it might be a hassle
depending on how specialized your setup is, but that's why we keep v1
around until the last user dies and allow you to freely mix and match
v1 and v2 controllers within a single system to ease the transition.

But this distinction between approach and application design, and the
application's actual purpose is crucial. Every time this discussion
came up, somebody says 'moving worker threads between different
resource domains'. That's not a goal, though, that's a very specific
means to an end, with no explanation of why it has to be done that
way. When comparing the cgroup v1 and v2 interface, we should be
discussing goals, not 'this is my favorite way to do it'. If you have
an actual real-world goal that can be accomplished in v1 but not in v2
+ rgroup, then that's what we should be talking about.

Lastly, again - and this was the whole point of this document - the
changes in cgroup2 are not gratuitous. They are driven by fundamental
resource control problems faced by more comprehensive applications of
cgroup. On the other hand, the opposition here mainly seems to be the
inconvenience of switching some specialized setups from a v1-oriented
way of solving a problem to a v2-oriented way.

[ That, and a disturbing number of emotional outbursts against
  systemd, which has nothing to do with any of this. ]

It's a really myopic line of argument.

That being said, let's go through your points:

> Priority and affinity are not process wide attributes, never have
> been, but you're insisting that so they must become for the sake of
> progress.

Not really.

It's just questionable whether the cgroup interface is the best way to
manipulate these attributes, or whether existing interfaces like
setpriority() and sched_setaffinity() should be extended to manipulate
groups, like the rgroup proposal does. The problems of using the
cgroup interface for this are extensively documented, including in the
email you were replying to.

> I mentioned a real world case of a thread pool servicing customer
> accounts by doing something quite sane: hop into an account (cgroup),
> do work therein, send bean count off to the $$ department, wash, rinse
> repeat.  That's real world users making real world cash registers go ka
> -ching so real world people can pay their real world bills.

Sure, but you're implying that this is the only way to run this real
world cash register. I think it's entirely justified to re-evaluate
this, given the myriad of much more fundamental problems that cgroup2
is solving by building on a common controller model.

I'm not going down the rabbit hole again of arguing against an
incomplete case description. Scale matters. Number of workers
matter. Amount of work each thread does matters to evaluate
transaction overhead. Task migration is an expensive operation etc.

> I also mentioned breakage to cpusets: given exclusive set A and
> exclusive subset B therein, there is one and only one spot where
> affinity A exists... at the to be forbidden junction of A and B.

Again, a means to an end rather than a goal - and a particularly
suspicious one at that: why would a cgroup need to tell its *siblings*
which cpus/nodes in cannot use? In the hierarchical model, it's
clearly the task of the ancestor to allocate the resources downward.

More details would be needed to properly discuss what we are trying to
accomplish here.

> As with the thread pool, process granularity makes it impossible for
> any threaded application affinity to be managed via cpusets, such as
> say stuffing realtime critical threads into a shielded cpuset, mundane
> threads into another.  There are any number of affinity usages that
> will break.

Ditto. It's not obvious why this needs to be the cgroup interface and
couldn't instead be solved with extending sched_setaffinity() - again
weighing that against the power of the common controller model that
could be preserved this way.

> Try as I may, I can't see anything progressive about enforcing process
> granularity of per thread attributes.  I do see regression potential
> for users of these controllers,

I could understand not being entirely happy about the trade-offs if
you look at this from the perspective of a single controller in the
entire resource control subsystem.

But not seeing anything progressive in a common controller model? Have
you read anything we have been writing?

> and no viable means to even report them as being such.  It will
> likely be systemd flipping the V2 on switch, not the kernel, not the
> user.  Regression reports would thus presumably be deflected
> to... those who want this.  Sweet.

There it is...

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-08-11  6:25       ` Mike Galbraith
  0 siblings, 0 replies; 87+ messages in thread
From: Mike Galbraith @ 2016-08-11  6:25 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tejun Heo, Linus Torvalds, Andrew Morton, Li Zefan,
	Peter Zijlstra, Paul Turner, Ingo Molnar, linux-kernel, cgroups,
	linux-api, kernel-team

On Wed, 2016-08-10 at 18:09 -0400, Johannes Weiner wrote:
> The complete lack of cohesiveness between v1 controllers prevents us
> from implementing even the most fundamental resource control that
> cloud fleets like Google's and Facebook's are facing, such as
> controlling buffered IO; attributing CPU cycles spent receiving
> packets, reclaiming memory in kswapd, encrypting the disk; attributing
> swap IO etc. That's why cgroup2 runs a tighter ship when it comes to
> the controllers: to make something much bigger work.

Where is the gun wielding thug forcing people to place tasks where v2
now explicitly forbids them?

> Agreeing on something - in this case a common controller model - is
> necessarily going to take away some flexibility from how you approach
> a problem. What matters is whether the problem can still be solved.

What annoys me about this more than the seemingly gratuitous breakage
is that the decision is passed to third parties who have nothing to
lose, and have done quite a bit of breaking lately.

> This argument that cgroup2 is not backward compatible is laughable.

Fine, you're entitled to your sense of humor.  I have one to, I find it
laughable that threaded applications can only sit there like a lump of
mud simply because they share more than applications written as a
gaggle of tasks.  "Threads are like.. so yesterday, the future belongs
to the process" tickles my funny-bone.  Whatever, to each his own.

...

> Lastly, again - and this was the whole point of this document - the
> changes in cgroup2 are not gratuitous. They are driven by fundamental
> resource control problems faced by more comprehensive applications of
> cgroup. On the other hand, the opposition here mainly seems to be the
> inconvenience of switching some specialized setups from a v1-oriented
> way of solving a problem to a v2-oriented way.
> 
> [ That, and a disturbing number of emotional outbursts against
>   systemd, which has nothing to do with any of this. ]
> 
> It's a really myopic line of argument.

And I think the myopia is on the other side of my monitor, whatever. 
 
> That being said, let's go through your points:
> 
> > Priority and affinity are not process wide attributes, never have
> > been, but you're insisting that so they must become for the sake of
> > progress.
> 
> Not really.
> 
> It's just questionable whether the cgroup interface is the best way to
> manipulate these attributes, or whether existing interfaces like
> setpriority() and sched_setaffinity() should be extended to manipulate
> groups, like the rgroup proposal does. The problems of using the
> cgroup interface for this are extensively documented, including in the
> email you were replying to.
> 
> > I mentioned a real world case of a thread pool servicing customer
> > accounts by doing something quite sane: hop into an account (cgroup),
> > do work therein, send bean count off to the $$ department, wash, rinse
> > repeat.  That's real world users making real world cash registers go ka
> > -ching so real world people can pay their real world bills.
> 
> Sure, but you're implying that this is the only way to run this real
> world cash register.

I implied no such thing.  Of course it can be done differently, all
they have to do is rip out these archaic thread thingies.

Apologies for dripping sarcasm all over your monitor, but this annoys
me far more that it should any casual user of cgroups.  Perhaps I
shouldn't care about the users (suse customers) who will step in this
eventually, but I do.

> I'm not going down the rabbit hole again of arguing against an
> incomplete case description. Scale matters. Number of workers
> matter. Amount of work each thread does matters to evaluate
> transaction overhead. Task migration is an expensive operation etc.
> 
> > I also mentioned breakage to cpusets: given exclusive set A and
> > exclusive subset B therein, there is one and only one spot where
> > affinity A exists... at the to be forbidden junction of A and B.
> 
> Again, a means to an end rather than a goal

I don't believe I described a means to an end, I believe I described
affinity bits going missing.

>  - and a particularly
> suspicious one at that: why would a cgroup need to tell its *siblings*
> which cpus/nodes in cannot use? In the hierarchical model, it's
> clearly the task of the ancestor to allocate the resources downward.
> 
> More details would be needed to properly discuss what we are trying to
> accomplish here.
> 
> > As with the thread pool, process granularity makes it impossible for
> > any threaded application affinity to be managed via cpusets, such as
> > say stuffing realtime critical threads into a shielded cpuset, mundane
> > threads into another.  There are any number of affinity usages that
> > will break.
> 
> Ditto. It's not obvious why this needs to be the cgroup interface and
> couldn't instead be solved with extending sched_setaffinity() - again
> weighing that against the power of the common controller model that
> could be preserved this way.

Wow.  Well sure, anything that becomes broken can be replaced by
something else.  Hell, people can just stop using cgroups entirely, and
the way issues become non-issues with the wave of a hand makes me
suspect that some users are going to be forced to do just that.

	-Mike

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-08-11  6:25       ` Mike Galbraith
  0 siblings, 0 replies; 87+ messages in thread
From: Mike Galbraith @ 2016-08-11  6:25 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tejun Heo, Linus Torvalds, Andrew Morton, Li Zefan,
	Peter Zijlstra, Paul Turner, Ingo Molnar,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA,
	kernel-team-b10kYP2dOMg

On Wed, 2016-08-10 at 18:09 -0400, Johannes Weiner wrote:
> The complete lack of cohesiveness between v1 controllers prevents us
> from implementing even the most fundamental resource control that
> cloud fleets like Google's and Facebook's are facing, such as
> controlling buffered IO; attributing CPU cycles spent receiving
> packets, reclaiming memory in kswapd, encrypting the disk; attributing
> swap IO etc. That's why cgroup2 runs a tighter ship when it comes to
> the controllers: to make something much bigger work.

Where is the gun wielding thug forcing people to place tasks where v2
now explicitly forbids them?

> Agreeing on something - in this case a common controller model - is
> necessarily going to take away some flexibility from how you approach
> a problem. What matters is whether the problem can still be solved.

What annoys me about this more than the seemingly gratuitous breakage
is that the decision is passed to third parties who have nothing to
lose, and have done quite a bit of breaking lately.

> This argument that cgroup2 is not backward compatible is laughable.

Fine, you're entitled to your sense of humor.  I have one to, I find it
laughable that threaded applications can only sit there like a lump of
mud simply because they share more than applications written as a
gaggle of tasks.  "Threads are like.. so yesterday, the future belongs
to the process" tickles my funny-bone.  Whatever, to each his own.

...

> Lastly, again - and this was the whole point of this document - the
> changes in cgroup2 are not gratuitous. They are driven by fundamental
> resource control problems faced by more comprehensive applications of
> cgroup. On the other hand, the opposition here mainly seems to be the
> inconvenience of switching some specialized setups from a v1-oriented
> way of solving a problem to a v2-oriented way.
> 
> [ That, and a disturbing number of emotional outbursts against
>   systemd, which has nothing to do with any of this. ]
> 
> It's a really myopic line of argument.

And I think the myopia is on the other side of my monitor, whatever. 
 
> That being said, let's go through your points:
> 
> > Priority and affinity are not process wide attributes, never have
> > been, but you're insisting that so they must become for the sake of
> > progress.
> 
> Not really.
> 
> It's just questionable whether the cgroup interface is the best way to
> manipulate these attributes, or whether existing interfaces like
> setpriority() and sched_setaffinity() should be extended to manipulate
> groups, like the rgroup proposal does. The problems of using the
> cgroup interface for this are extensively documented, including in the
> email you were replying to.
> 
> > I mentioned a real world case of a thread pool servicing customer
> > accounts by doing something quite sane: hop into an account (cgroup),
> > do work therein, send bean count off to the $$ department, wash, rinse
> > repeat.  That's real world users making real world cash registers go ka
> > -ching so real world people can pay their real world bills.
> 
> Sure, but you're implying that this is the only way to run this real
> world cash register.

I implied no such thing.  Of course it can be done differently, all
they have to do is rip out these archaic thread thingies.

Apologies for dripping sarcasm all over your monitor, but this annoys
me far more that it should any casual user of cgroups.  Perhaps I
shouldn't care about the users (suse customers) who will step in this
eventually, but I do.

> I'm not going down the rabbit hole again of arguing against an
> incomplete case description. Scale matters. Number of workers
> matter. Amount of work each thread does matters to evaluate
> transaction overhead. Task migration is an expensive operation etc.
> 
> > I also mentioned breakage to cpusets: given exclusive set A and
> > exclusive subset B therein, there is one and only one spot where
> > affinity A exists... at the to be forbidden junction of A and B.
> 
> Again, a means to an end rather than a goal

I don't believe I described a means to an end, I believe I described
affinity bits going missing.

>  - and a particularly
> suspicious one at that: why would a cgroup need to tell its *siblings*
> which cpus/nodes in cannot use? In the hierarchical model, it's
> clearly the task of the ancestor to allocate the resources downward.
> 
> More details would be needed to properly discuss what we are trying to
> accomplish here.
> 
> > As with the thread pool, process granularity makes it impossible for
> > any threaded application affinity to be managed via cpusets, such as
> > say stuffing realtime critical threads into a shielded cpuset, mundane
> > threads into another.  There are any number of affinity usages that
> > will break.
> 
> Ditto. It's not obvious why this needs to be the cgroup interface and
> couldn't instead be solved with extending sched_setaffinity() - again
> weighing that against the power of the common controller model that
> could be preserved this way.

Wow.  Well sure, anything that becomes broken can be replaced by
something else.  Hell, people can just stop using cgroups entirely, and
the way issues become non-issues with the wave of a hand makes me
suspect that some users are going to be forced to do just that.

	-Mike

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-08-12 22:17         ` Johannes Weiner
  0 siblings, 0 replies; 87+ messages in thread
From: Johannes Weiner @ 2016-08-12 22:17 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Tejun Heo, Linus Torvalds, Andrew Morton, Li Zefan,
	Peter Zijlstra, Paul Turner, Ingo Molnar, linux-kernel, cgroups,
	linux-api, kernel-team

On Thu, Aug 11, 2016 at 08:25:06AM +0200, Mike Galbraith wrote:
> On Wed, 2016-08-10 at 18:09 -0400, Johannes Weiner wrote:
> > The complete lack of cohesiveness between v1 controllers prevents us
> > from implementing even the most fundamental resource control that
> > cloud fleets like Google's and Facebook's are facing, such as
> > controlling buffered IO; attributing CPU cycles spent receiving
> > packets, reclaiming memory in kswapd, encrypting the disk; attributing
> > swap IO etc. That's why cgroup2 runs a tighter ship when it comes to
> > the controllers: to make something much bigger work.
> 
> Where is the gun wielding thug forcing people to place tasks where v2
> now explicitly forbids them?

The problems with supporting this are well-documented. Please see R-2
in Documentation/cgroup-v2.txt.

> > Agreeing on something - in this case a common controller model - is
> > necessarily going to take away some flexibility from how you approach
> > a problem. What matters is whether the problem can still be solved.
> 
> What annoys me about this more than the seemingly gratuitous breakage
> is that the decision is passed to third parties who have nothing to
> lose, and have done quite a bit of breaking lately.

Mike, there is no connection between what you are quoting and what you
are replying to here. We cannot have a technical discussion when you
enter it with your mind fully made up, repeat the same inflammatory
talking points over and over - some of them trivially false, some a
gross misrepresentation of what we have been trying to do - and are
completely unwilling to even entertain the idea that there might be
problems outside of the one-controller-scope you are looking at.

But to address your point: there is no 'breakage' here. Or in your
words: there is no gun wielding thug forcing people to upgrade to
v2. If v1 does everything your specific setup needs, nobody forces you
to upgrade. We are fairly confident that the majority of users *will*
upgrade, simply because v2 solves so many basic resource control
problems that v1 is inherently incapable of solving. There is a
positive incentive, but we are trying not to create negative ones.

And even if you run a systemd distribution, and systemd switches to
v2, it's trivially easy to pry the CPU controller from its hands and
maintain your setup exactly as-is using the current CPU controller.

This is really not a technical argument.

> > This argument that cgroup2 is not backward compatible is laughable.
> 
> Fine, you're entitled to your sense of humor.  I have one to, I find it
> laughable that threaded applications can only sit there like a lump of
> mud simply because they share more than applications written as a
> gaggle of tasks.  "Threads are like.. so yesterday, the future belongs
> to the process" tickles my funny-bone.  Whatever, to each his own.

Who are you quoting here? This is such a grotesque misrepresentation
of what we have been saying and implementing, it's not even funny.

In reality, the rgroup extension for setpriority() was directly based
on your and PeterZ's feedback regarding thread control. Except that,
unlike cgroup1's approach to threads, which might work in some setups
but suffers immensely from the global nature of the vfs interface once
you have to cooperate with other applications and system management*,
rgroup was proposed as a much more generic and robust interface to do
hierarchical resource control from inside the application.

* This doesn't have to be systemd, btw. We have used cgroups to
  isolate system services, maintenance jobs, cron jobs etc. from our
  applications way before systemd, and it's been a pita to coordinate
  the system managing applications and the applications managing its
  workers using the same globally scoped vfs interface.

> > > I mentioned a real world case of a thread pool servicing customer
> > > accounts by doing something quite sane: hop into an account (cgroup),
> > > do work therein, send bean count off to the $$ department, wash, rinse
> > > repeat.  That's real world users making real world cash registers go ka
> > > -ching so real world people can pay their real world bills.
> > 
> > Sure, but you're implying that this is the only way to run this real
> > world cash register.
> 
> I implied no such thing.  Of course it can be done differently, all
> they have to do is rip out these archaic thread thingies.
>
> Apologies for dripping sarcasm all over your monitor, but this annoys
> me far more that it should any casual user of cgroups.  Perhaps I
> shouldn't care about the users (suse customers) who will step in this
> eventually, but I do.

https://yourlogicalfallacyis.com/black-or-white
https://yourlogicalfallacyis.com/strawman
https://yourlogicalfallacyis.com/appeal-to-emotion

Can you please try to stay objective?

> > > As with the thread pool, process granularity makes it impossible for
> > > any threaded application affinity to be managed via cpusets, such as
> > > say stuffing realtime critical threads into a shielded cpuset, mundane
> > > threads into another.  There are any number of affinity usages that
> > > will break.
> > 
> > Ditto. It's not obvious why this needs to be the cgroup interface and
> > couldn't instead be solved with extending sched_setaffinity() - again
> > weighing that against the power of the common controller model that
> > could be preserved this way.
> 
> Wow.  Well sure, anything that becomes broken can be replaced by
> something else.  Hell, people can just stop using cgroups entirely, and
> the way issues become non-issues with the wave of a hand makes me
> suspect that some users are going to be forced to do just that.

We are not the ones doing the handwaving. We have reacted with code
and with repeated attempts to restart a grounded technical discussion
on this issue, and were met time and again with polemics, categorical
dismissal of the problems we are facing in the cloud, and a flatout
refusal to even consider a different approach to resource control.

It's great that cgroup1 works for some of your customers, and they are
free to keep using it, but there is only so much you can build with a
handful of loose shoestrings, and we are badly hitting the design
limitations of that model. We have tried to work in your direction and
proposed interfaces/processes to support the different things people
are (ab)using cgroup1 for right now, but at some point you have to
acknowledge that cgroup2 is the result of problems we have run into
with cgroup1 and that, consequently, not everything from cgroup1 can
be retained as-is. Only when that happens can we properly discuss
cgroup2's current design choices and whether it could be done better.

Ignoring the real problems that cgroup2 is solving will not remove the
demand for it. It only squanders your chance to help shape it in the
interest of the particular group of users you feel most obligated to.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-08-12 22:17         ` Johannes Weiner
  0 siblings, 0 replies; 87+ messages in thread
From: Johannes Weiner @ 2016-08-12 22:17 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Tejun Heo, Linus Torvalds, Andrew Morton, Li Zefan,
	Peter Zijlstra, Paul Turner, Ingo Molnar,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA,
	kernel-team-b10kYP2dOMg

On Thu, Aug 11, 2016 at 08:25:06AM +0200, Mike Galbraith wrote:
> On Wed, 2016-08-10 at 18:09 -0400, Johannes Weiner wrote:
> > The complete lack of cohesiveness between v1 controllers prevents us
> > from implementing even the most fundamental resource control that
> > cloud fleets like Google's and Facebook's are facing, such as
> > controlling buffered IO; attributing CPU cycles spent receiving
> > packets, reclaiming memory in kswapd, encrypting the disk; attributing
> > swap IO etc. That's why cgroup2 runs a tighter ship when it comes to
> > the controllers: to make something much bigger work.
> 
> Where is the gun wielding thug forcing people to place tasks where v2
> now explicitly forbids them?

The problems with supporting this are well-documented. Please see R-2
in Documentation/cgroup-v2.txt.

> > Agreeing on something - in this case a common controller model - is
> > necessarily going to take away some flexibility from how you approach
> > a problem. What matters is whether the problem can still be solved.
> 
> What annoys me about this more than the seemingly gratuitous breakage
> is that the decision is passed to third parties who have nothing to
> lose, and have done quite a bit of breaking lately.

Mike, there is no connection between what you are quoting and what you
are replying to here. We cannot have a technical discussion when you
enter it with your mind fully made up, repeat the same inflammatory
talking points over and over - some of them trivially false, some a
gross misrepresentation of what we have been trying to do - and are
completely unwilling to even entertain the idea that there might be
problems outside of the one-controller-scope you are looking at.

But to address your point: there is no 'breakage' here. Or in your
words: there is no gun wielding thug forcing people to upgrade to
v2. If v1 does everything your specific setup needs, nobody forces you
to upgrade. We are fairly confident that the majority of users *will*
upgrade, simply because v2 solves so many basic resource control
problems that v1 is inherently incapable of solving. There is a
positive incentive, but we are trying not to create negative ones.

And even if you run a systemd distribution, and systemd switches to
v2, it's trivially easy to pry the CPU controller from its hands and
maintain your setup exactly as-is using the current CPU controller.

This is really not a technical argument.

> > This argument that cgroup2 is not backward compatible is laughable.
> 
> Fine, you're entitled to your sense of humor.  I have one to, I find it
> laughable that threaded applications can only sit there like a lump of
> mud simply because they share more than applications written as a
> gaggle of tasks.  "Threads are like.. so yesterday, the future belongs
> to the process" tickles my funny-bone.  Whatever, to each his own.

Who are you quoting here? This is such a grotesque misrepresentation
of what we have been saying and implementing, it's not even funny.

In reality, the rgroup extension for setpriority() was directly based
on your and PeterZ's feedback regarding thread control. Except that,
unlike cgroup1's approach to threads, which might work in some setups
but suffers immensely from the global nature of the vfs interface once
you have to cooperate with other applications and system management*,
rgroup was proposed as a much more generic and robust interface to do
hierarchical resource control from inside the application.

* This doesn't have to be systemd, btw. We have used cgroups to
  isolate system services, maintenance jobs, cron jobs etc. from our
  applications way before systemd, and it's been a pita to coordinate
  the system managing applications and the applications managing its
  workers using the same globally scoped vfs interface.

> > > I mentioned a real world case of a thread pool servicing customer
> > > accounts by doing something quite sane: hop into an account (cgroup),
> > > do work therein, send bean count off to the $$ department, wash, rinse
> > > repeat.  That's real world users making real world cash registers go ka
> > > -ching so real world people can pay their real world bills.
> > 
> > Sure, but you're implying that this is the only way to run this real
> > world cash register.
> 
> I implied no such thing.  Of course it can be done differently, all
> they have to do is rip out these archaic thread thingies.
>
> Apologies for dripping sarcasm all over your monitor, but this annoys
> me far more that it should any casual user of cgroups.  Perhaps I
> shouldn't care about the users (suse customers) who will step in this
> eventually, but I do.

https://yourlogicalfallacyis.com/black-or-white
https://yourlogicalfallacyis.com/strawman
https://yourlogicalfallacyis.com/appeal-to-emotion

Can you please try to stay objective?

> > > As with the thread pool, process granularity makes it impossible for
> > > any threaded application affinity to be managed via cpusets, such as
> > > say stuffing realtime critical threads into a shielded cpuset, mundane
> > > threads into another.  There are any number of affinity usages that
> > > will break.
> > 
> > Ditto. It's not obvious why this needs to be the cgroup interface and
> > couldn't instead be solved with extending sched_setaffinity() - again
> > weighing that against the power of the common controller model that
> > could be preserved this way.
> 
> Wow.  Well sure, anything that becomes broken can be replaced by
> something else.  Hell, people can just stop using cgroups entirely, and
> the way issues become non-issues with the wave of a hand makes me
> suspect that some users are going to be forced to do just that.

We are not the ones doing the handwaving. We have reacted with code
and with repeated attempts to restart a grounded technical discussion
on this issue, and were met time and again with polemics, categorical
dismissal of the problems we are facing in the cloud, and a flatout
refusal to even consider a different approach to resource control.

It's great that cgroup1 works for some of your customers, and they are
free to keep using it, but there is only so much you can build with a
handful of loose shoestrings, and we are badly hitting the design
limitations of that model. We have tried to work in your direction and
proposed interfaces/processes to support the different things people
are (ab)using cgroup1 for right now, but at some point you have to
acknowledge that cgroup2 is the result of problems we have run into
with cgroup1 and that, consequently, not everything from cgroup1 can
be retained as-is. Only when that happens can we properly discuss
cgroup2's current design choices and whether it could be done better.

Ignoring the real problems that cgroup2 is solving will not remove the
demand for it. It only squanders your chance to help shape it in the
interest of the particular group of users you feel most obligated to.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-08-13  5:08           ` Mike Galbraith
  0 siblings, 0 replies; 87+ messages in thread
From: Mike Galbraith @ 2016-08-13  5:08 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tejun Heo, Linus Torvalds, Andrew Morton, Li Zefan,
	Peter Zijlstra, Paul Turner, Ingo Molnar, linux-kernel, cgroups,
	linux-api, kernel-team

On Fri, 2016-08-12 at 18:17 -0400, Johannes Weiner wrote:

> > > This argument that cgroup2 is not backward compatible is laughable.
> > 
> > Fine, you're entitled to your sense of humor.  I have one to, I find it
> > laughable that threaded applications can only sit there like a lump of
> > mud simply because they share more than applications written as a
> > gaggle of tasks.  "Threads are like.. so yesterday, the future belongs
> > to the process" tickles my funny-bone.  Whatever, to each his own.
> 
> Who are you quoting here? This is such a grotesque misrepresentation
> of what we have been saying and implementing, it's not even funny.

Agreed, it's not funny to me either.  Excluding threaded applications
from doing.. anything.. implies to me that either someone thinks same
do not need resource management facilities due to some magical property
of threading itself, or someone doesn't realize that an application
thread is a task, ie one and the same things which can be doing one and
the same job.  No matter how I turn it, what I see is nonsense.

> https://yourlogicalfallacyis.com/black-or-white
> https://yourlogicalfallacyis.com/strawman
> https://yourlogicalfallacyis.com/appeal-to-emotion

Nope, plain ole sarcasm, an expression of shock and awe.

> It's great that cgroup1 works for some of your customers, and they are
> free to keep using it.

If no third party can flush my customers investment down the toilet, I
can cease to care.  Please don't CC me in future, you're unlikely to
convince me that v2 is remotely sane, nor do you need to.  Lucky you. 
 
	-Mike

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-08-13  5:08           ` Mike Galbraith
  0 siblings, 0 replies; 87+ messages in thread
From: Mike Galbraith @ 2016-08-13  5:08 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tejun Heo, Linus Torvalds, Andrew Morton, Li Zefan,
	Peter Zijlstra, Paul Turner, Ingo Molnar,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA,
	kernel-team-b10kYP2dOMg

On Fri, 2016-08-12 at 18:17 -0400, Johannes Weiner wrote:

> > > This argument that cgroup2 is not backward compatible is laughable.
> > 
> > Fine, you're entitled to your sense of humor.  I have one to, I find it
> > laughable that threaded applications can only sit there like a lump of
> > mud simply because they share more than applications written as a
> > gaggle of tasks.  "Threads are like.. so yesterday, the future belongs
> > to the process" tickles my funny-bone.  Whatever, to each his own.
> 
> Who are you quoting here? This is such a grotesque misrepresentation
> of what we have been saying and implementing, it's not even funny.

Agreed, it's not funny to me either.  Excluding threaded applications
from doing.. anything.. implies to me that either someone thinks same
do not need resource management facilities due to some magical property
of threading itself, or someone doesn't realize that an application
thread is a task, ie one and the same things which can be doing one and
the same job.  No matter how I turn it, what I see is nonsense.

> https://yourlogicalfallacyis.com/black-or-white
> https://yourlogicalfallacyis.com/strawman
> https://yourlogicalfallacyis.com/appeal-to-emotion

Nope, plain ole sarcasm, an expression of shock and awe.

> It's great that cgroup1 works for some of your customers, and they are
> free to keep using it.

If no third party can flush my customers investment down the toilet, I
can cease to care.  Please don't CC me in future, you're unlikely to
convince me that v2 is remotely sane, nor do you need to.  Lucky you. 
 
	-Mike

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-08-16 14:07       ` Peter Zijlstra
  0 siblings, 0 replies; 87+ messages in thread
From: Peter Zijlstra @ 2016-08-16 14:07 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Mike Galbraith, Tejun Heo, Linus Torvalds, Andrew Morton,
	Li Zefan, Paul Turner, Ingo Molnar, linux-kernel, cgroups,
	linux-api, kernel-team

On Wed, Aug 10, 2016 at 06:09:44PM -0400, Johannes Weiner wrote:

> [ That, and a disturbing number of emotional outbursts against
>   systemd, which has nothing to do with any of this. ]

Oh, so I'm entirely dreaming this then:

  https://github.com/systemd/systemd/pull/3905

Completely unrelated.

Also, the argument there seems unfair at best, you don't need cpu-v2 for
buffered write control, you only need memcg and block co-mounted.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-08-16 14:07       ` Peter Zijlstra
  0 siblings, 0 replies; 87+ messages in thread
From: Peter Zijlstra @ 2016-08-16 14:07 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Mike Galbraith, Tejun Heo, Linus Torvalds, Andrew Morton,
	Li Zefan, Paul Turner, Ingo Molnar,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA,
	kernel-team-b10kYP2dOMg

On Wed, Aug 10, 2016 at 06:09:44PM -0400, Johannes Weiner wrote:

> [ That, and a disturbing number of emotional outbursts against
>   systemd, which has nothing to do with any of this. ]

Oh, so I'm entirely dreaming this then:

  https://github.com/systemd/systemd/pull/3905

Completely unrelated.

Also, the argument there seems unfair at best, you don't need cpu-v2 for
buffered write control, you only need memcg and block co-mounted.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-08-16 14:58         ` Chris Mason
  0 siblings, 0 replies; 87+ messages in thread
From: Chris Mason @ 2016-08-16 14:58 UTC (permalink / raw)
  To: Peter Zijlstra, Johannes Weiner
  Cc: Mike Galbraith, Tejun Heo, Linus Torvalds, Andrew Morton,
	Li Zefan, Paul Turner, Ingo Molnar, linux-kernel, cgroups,
	linux-api, kernel-team



On 08/16/2016 10:07 AM, Peter Zijlstra wrote:
> On Wed, Aug 10, 2016 at 06:09:44PM -0400, Johannes Weiner wrote:
>
>> [ That, and a disturbing number of emotional outbursts against
>>   systemd, which has nothing to do with any of this. ]
>
> Oh, so I'm entirely dreaming this then:
>
>   https://github.com/systemd/systemd/pull/3905
>
> Completely unrelated.
>
> Also, the argument there seems unfair at best, you don't need cpu-v2 for
> buffered write control, you only need memcg and block co-mounted.
>

This isn't systemd dictating cgroups2 or systemd trying to get rid of 
v1.  But systemd is a common user of cgroups, and we do use it here in 
production.

We're just sending patches upstream for the tools we're using.  It's 
better than keeping them private, or reinventing a completely different 
tool that does almost the same thing.

-chris

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-08-16 14:58         ` Chris Mason
  0 siblings, 0 replies; 87+ messages in thread
From: Chris Mason @ 2016-08-16 14:58 UTC (permalink / raw)
  To: Peter Zijlstra, Johannes Weiner
  Cc: Mike Galbraith, Tejun Heo, Linus Torvalds, Andrew Morton,
	Li Zefan, Paul Turner, Ingo Molnar,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA,
	kernel-team-b10kYP2dOMg



On 08/16/2016 10:07 AM, Peter Zijlstra wrote:
> On Wed, Aug 10, 2016 at 06:09:44PM -0400, Johannes Weiner wrote:
>
>> [ That, and a disturbing number of emotional outbursts against
>>   systemd, which has nothing to do with any of this. ]
>
> Oh, so I'm entirely dreaming this then:
>
>   https://github.com/systemd/systemd/pull/3905
>
> Completely unrelated.
>
> Also, the argument there seems unfair at best, you don't need cpu-v2 for
> buffered write control, you only need memcg and block co-mounted.
>

This isn't systemd dictating cgroups2 or systemd trying to get rid of 
v1.  But systemd is a common user of cgroups, and we do use it here in 
production.

We're just sending patches upstream for the tools we're using.  It's 
better than keeping them private, or reinventing a completely different 
tool that does almost the same thing.

-chris

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-08-16 16:30         ` Johannes Weiner
  0 siblings, 0 replies; 87+ messages in thread
From: Johannes Weiner @ 2016-08-16 16:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, Tejun Heo, Linus Torvalds, Andrew Morton,
	Li Zefan, Paul Turner, Ingo Molnar, linux-kernel, cgroups,
	linux-api, kernel-team

On Tue, Aug 16, 2016 at 04:07:38PM +0200, Peter Zijlstra wrote:
> On Wed, Aug 10, 2016 at 06:09:44PM -0400, Johannes Weiner wrote:
> 
> > [ That, and a disturbing number of emotional outbursts against
> >   systemd, which has nothing to do with any of this. ]
> 
> Oh, so I'm entirely dreaming this then:
> 
>   https://github.com/systemd/systemd/pull/3905
> 
> Completely unrelated.

Yes and no. We certainly do use systemd (kind of hard not to at this
point if you're using any major distribution), and we do feed back the
changes we make to it upstream. But this is updating systemd to work
with the resource control design choices we made in the kernel, not
the other way round.

As I wrote to Mike before, we have been running into these resource
control issues way before systemd, when we used a combination of
libcgroup and custom hacks to coordinate the jobs on the system. The
cgroup2 design choices fell out of experiences with those setups.

Neither the problem statement nor the proposed solutions depend on
systemd, which is why I had hoped we could focus these cgroup2 debates
around the broader resource control issues we are trying to address,
rather than get hung up on one contentious user of the interface.

> Also, the argument there seems unfair at best, you don't need cpu-v2 for
> buffered write control, you only need memcg and block co-mounted.

Yes, memcg and block agreeing is enough for that case. But I mentioned
a whole bunch of these examples, to make the broader case for a common
controller model.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-08-16 16:30         ` Johannes Weiner
  0 siblings, 0 replies; 87+ messages in thread
From: Johannes Weiner @ 2016-08-16 16:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, Tejun Heo, Linus Torvalds, Andrew Morton,
	Li Zefan, Paul Turner, Ingo Molnar,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA,
	kernel-team-b10kYP2dOMg

On Tue, Aug 16, 2016 at 04:07:38PM +0200, Peter Zijlstra wrote:
> On Wed, Aug 10, 2016 at 06:09:44PM -0400, Johannes Weiner wrote:
> 
> > [ That, and a disturbing number of emotional outbursts against
> >   systemd, which has nothing to do with any of this. ]
> 
> Oh, so I'm entirely dreaming this then:
> 
>   https://github.com/systemd/systemd/pull/3905
> 
> Completely unrelated.

Yes and no. We certainly do use systemd (kind of hard not to at this
point if you're using any major distribution), and we do feed back the
changes we make to it upstream. But this is updating systemd to work
with the resource control design choices we made in the kernel, not
the other way round.

As I wrote to Mike before, we have been running into these resource
control issues way before systemd, when we used a combination of
libcgroup and custom hacks to coordinate the jobs on the system. The
cgroup2 design choices fell out of experiences with those setups.

Neither the problem statement nor the proposed solutions depend on
systemd, which is why I had hoped we could focus these cgroup2 debates
around the broader resource control issues we are trying to address,
rather than get hung up on one contentious user of the interface.

> Also, the argument there seems unfair at best, you don't need cpu-v2 for
> buffered write control, you only need memcg and block co-mounted.

Yes, memcg and block agreeing is enough for that case. But I mentioned
a whole bunch of these examples, to make the broader case for a common
controller model.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-08-16 21:59         ` Tejun Heo
  0 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-08-16 21:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Johannes Weiner, Mike Galbraith, Linus Torvalds, Andrew Morton,
	Li Zefan, Paul Turner, Ingo Molnar, linux-kernel, cgroups,
	linux-api, kernel-team

Hello, Peter.

On Tue, Aug 16, 2016 at 04:07:38PM +0200, Peter Zijlstra wrote:
> On Wed, Aug 10, 2016 at 06:09:44PM -0400, Johannes Weiner wrote:
> 
> > [ That, and a disturbing number of emotional outbursts against
> >   systemd, which has nothing to do with any of this. ]
> 
> Oh, so I'm entirely dreaming this then:
> 
>   https://github.com/systemd/systemd/pull/3905
> 
> Completely unrelated.

We use centos in the fleet and are trying to control resources in base
system which of course requires writeback control and thus cgroup v2.
I'm working to solve the use cases people are facing and systemd is a
piece of the puzzle.  There is no big conspiracy.

As Johannes and Chris already pointed out, systemd is a user of cgroup
v2, a pretty important one at this point.  While I of course care
about it having a proper support for cgroup v2, systemd is just
picking up the changes in cgroup v2.  cgroup v2 design wouldn't be
different without systemd.  We'll just have something else playing its
role in resource management.

> Also, the argument there seems unfair at best, you don't need cpu-v2 for
> buffered write control, you only need memcg and block co-mounted.

( Everything I'm gonna write below has already been extensively
  documented in the posted documentation.  I'm gonna repeat the points
  for completeness but if we're gonna start an actually technical
  discussion, let's please start from the documentation instead of
  jumping off of an one liner and trying to rebuild the entire
  argument each time.

  I'm not sure what you exactly meant by the above sentence and
  assuming that you're saying that there are no new capabilities
  gained by cpu controller being on the v2 hierarchy and thus the cpu
  controller doesn't need to be on cgroup v2?  If I'm mistaken, please
  let me know. )

Just co-mounting isn't enough as it still leaves the problems with
anonymous consumption, different handling of threads belonging to
different cgroups, and whether it's acceptable to always require blkio
to use memory controller.  cgroup v2 is what we got after working
through all these issues.

While it is true that cpu controller doesn't need to be on cgroup v2
for writeback control to work, it misses the point about the larger
design issues identified during writeback control work, which can be
easily applied to the cpu controller - e.g. accounting cpu cycles
spent for packet reception, memory reclaim, IO encryption and so on.

In addition, it is an unnecessary inconvenience for users who want
writeback control to require the complication of mixed v1 and v2
hierarchies when their requirements can be easily served by v2,
especially considering that the only blocked part is trivial changes
to expose cpu controller interface on v2 and that enabling it on v2
doesn't preclude it from being used on a v1 hierarchy if necessary.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-08-16 21:59         ` Tejun Heo
  0 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-08-16 21:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Johannes Weiner, Mike Galbraith, Linus Torvalds, Andrew Morton,
	Li Zefan, Paul Turner, Ingo Molnar,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA,
	kernel-team-b10kYP2dOMg

Hello, Peter.

On Tue, Aug 16, 2016 at 04:07:38PM +0200, Peter Zijlstra wrote:
> On Wed, Aug 10, 2016 at 06:09:44PM -0400, Johannes Weiner wrote:
> 
> > [ That, and a disturbing number of emotional outbursts against
> >   systemd, which has nothing to do with any of this. ]
> 
> Oh, so I'm entirely dreaming this then:
> 
>   https://github.com/systemd/systemd/pull/3905
> 
> Completely unrelated.

We use centos in the fleet and are trying to control resources in base
system which of course requires writeback control and thus cgroup v2.
I'm working to solve the use cases people are facing and systemd is a
piece of the puzzle.  There is no big conspiracy.

As Johannes and Chris already pointed out, systemd is a user of cgroup
v2, a pretty important one at this point.  While I of course care
about it having a proper support for cgroup v2, systemd is just
picking up the changes in cgroup v2.  cgroup v2 design wouldn't be
different without systemd.  We'll just have something else playing its
role in resource management.

> Also, the argument there seems unfair at best, you don't need cpu-v2 for
> buffered write control, you only need memcg and block co-mounted.

( Everything I'm gonna write below has already been extensively
  documented in the posted documentation.  I'm gonna repeat the points
  for completeness but if we're gonna start an actually technical
  discussion, let's please start from the documentation instead of
  jumping off of an one liner and trying to rebuild the entire
  argument each time.

  I'm not sure what you exactly meant by the above sentence and
  assuming that you're saying that there are no new capabilities
  gained by cpu controller being on the v2 hierarchy and thus the cpu
  controller doesn't need to be on cgroup v2?  If I'm mistaken, please
  let me know. )

Just co-mounting isn't enough as it still leaves the problems with
anonymous consumption, different handling of threads belonging to
different cgroups, and whether it's acceptable to always require blkio
to use memory controller.  cgroup v2 is what we got after working
through all these issues.

While it is true that cpu controller doesn't need to be on cgroup v2
for writeback control to work, it misses the point about the larger
design issues identified during writeback control work, which can be
easily applied to the cpu controller - e.g. accounting cpu cycles
spent for packet reception, memory reclaim, IO encryption and so on.

In addition, it is an unnecessary inconvenience for users who want
writeback control to require the complication of mixed v1 and v2
hierarchies when their requirements can be easily served by v2,
especially considering that the only blocked part is trivial changes
to expose cpu controller interface on v2 and that enabling it on v2
doesn't preclude it from being used on a v1 hierarchy if necessary.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
  2016-08-16 16:30         ` Johannes Weiner
  (?)
@ 2016-08-17  9:33         ` Mike Galbraith
  -1 siblings, 0 replies; 87+ messages in thread
From: Mike Galbraith @ 2016-08-17  9:33 UTC (permalink / raw)
  To: Johannes Weiner, Peter Zijlstra
  Cc: Tejun Heo, Linus Torvalds, Andrew Morton, Li Zefan, Paul Turner,
	Ingo Molnar, linux-kernel, cgroups, linux-api, kernel-team

On Tue, 2016-08-16 at 12:30 -0400, Johannes Weiner wrote:
> On Tue, Aug 16, 2016 at 04:07:38PM +0200, Peter Zijlstra wrote:

> > Also, the argument there seems unfair at best, you don't need cpu-v2 for
> > buffered write control, you only need memcg and block co-mounted.
> 
> Yes, memcg and block agreeing is enough for that case. But I mentioned
> a whole bunch of these examples, to make the broader case for a common
> controller model.

The core issue I have with that model is that it defines context=mm,
and declares context=task to be invalid, while in reality, both views
are perfectly valid, useful, and in use.  That redefinition of context
is demonstrably harmful when applied to scheduler related controllers,
rendering a substantial portion of to be managed objects completely
unmanageable.  You (collectively) know that full well.

AFAIKT, there is only one viable option, and that is to continue to
allow both.  Whether you like the duality or not (who would), it's
deeply embedded in what's under the controllers, and won't go away.

I'll now go try a little harder while you ponder (or pop) this thought
bubble, see if I can set a new personal best at the art of ignoring. 

(CC did not help btw, your bad if you don't like bubble content)

	-Mike

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
  2016-08-05 17:07 ` Tejun Heo
                   ` (3 preceding siblings ...)
  (?)
@ 2016-08-17 20:18 ` Andy Lutomirski
  2016-08-20 15:56     ` Tejun Heo
  2016-08-21  5:34     ` James Bottomley
  -1 siblings, 2 replies; 87+ messages in thread
From: Andy Lutomirski @ 2016-08-17 20:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Mike Galbraith, linux-kernel, kernel-team,
	open list:CONTROL GROUP (CGROUP),
	Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra,
	Johannes Weiner, Linus Torvalds

On Aug 5, 2016 7:07 PM, "Tejun Heo" <tj@kernel.org> wrote:
>
> Hello,
>
> There have been several discussions around CPU controller support.
> Unfortunately, no consensus was reached and cgroup v2 is sorely
> lacking CPU controller support.  This document includes summary of the
> situation and arguments along with an interim solution for parties who
> want to use the out-of-tree patches for CPU controller cgroup v2
> support.  I'll post the two patches as replies for reference.
>
> Thanks.
>
>
> CPU Controller on Control Group v2
>
> August, 2016            Tejun Heo <tj@kernel.org>
>
>
> While most controllers have support for cgroup v2 now, the CPU
> controller support is not upstream yet due to objections from the
> scheduler maintainers on the basic designs of cgroup v2.  This
> document explains the current situation as well as an interim
> solution, and details the disagreements and arguments.  The latest
> version of this document can be found at the following URL.
>
>  https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git/tree/Documentation/cgroup-v2-cpu.txt?h=cgroup-v2-cpu
>
>
> CONTENTS
>
> 1. Current Situation and Interim Solution
> 2. Disagreements and Arguments
>   2-1. Contentious Restrictions
>     2-1-1. Process Granularity
>     2-1-2. No Internal Process Constraint
>   2-2. Impact on CPU Controller
>     2-2-1. Impact of Process Granularity
>     2-2-2. Impact of No Internal Process Constraint
>   2-3. Arguments for cgroup v2
> 3. Way Forward
> 4. References
>
>
> 1. Current Situation and Interim Solution
>
> All objections from the scheduler maintainers apply to cgroup v2 core
> design, and there are no known objections to the specifics of the CPU
> controller cgroup v2 interface.  The only blocked part is changes to
> expose the CPU controller interface on cgroup v2, which comprises the
> following two patches:
>
>  [1] sched: Misc preps for cgroup unified hierarchy interface
>  [2] sched: Implement interface for cgroup unified hierarchy
>
> The necessary changes are superficial and implement the interface
> files on cgroup v2.  The combined diffstat is as follows.
>
>  kernel/sched/core.c    |  149 +++++++++++++++++++++++++++++++++++++++++++++++--
>  kernel/sched/cpuacct.c |   57 ++++++++++++------
>  kernel/sched/cpuacct.h |    5 +
>  3 files changed, 189 insertions(+), 22 deletions(-)
>
> The patches are easy to apply and forward-port.  The following git
> branch will always carry the two patches on top of the latest release
> of the upstream kernel.
>
>  git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git/cgroup-v2-cpu
>
> There also are versioned branches going back to v4.4.
>
>  git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git/cgroup-v2-cpu-$KERNEL_VER
>
> While it's difficult to tell whether the CPU controller support will
> be merged, there are crucial resource control features in cgroup v2
> that are only possible due to the design choices that are being
> objected to, and every effort will be made to ease enabling the CPU
> controller cgroup v2 support out-of-tree for parties which choose to.
>
>
> 2. Disagreements and Arguments
>
> There have been several lengthy discussion threads [3][4] on LKML
> around the structural constraints of cgroup v2.  The two that affect
> the CPU controller are process granularity and no internal process
> constraint.  Both arise primarily from the need for common resource
> domain definition across different resources.
>
> The common resource domain is a powerful concept in cgroup v2 that
> allows controllers to make basic assumptions about the structural
> organization of processes and controllers inside the cgroup hierarchy,
> and thus solve problems spanning multiple types of resources.  The
> prime example for this is page cache writeback: dirty page cache is
> regulated through throttling buffered writers based on memory
> availability, and initiating batched write outs to the disk based on
> IO capacity.  Tracking and controlling writeback inside a cgroup thus
> requires the direct cooperation of the memory and the IO controller.
>
> This easily extends to other areas, such as CPU cycles consumed while
> performing memory reclaim or IO encryption.
>
>
> 2-1. Contentious Restrictions
>
> For controllers of different resources to work together, they must
> agree on a common organization.  This uniform model across controllers
> imposes two contentious restrictions on the CPU controller: process
> granularity and the no-internal-process constraint.
>
>
>   2-1-1. Process Granularity
>
>   For memory, because an address space is shared between all threads
>   of a process, the terminal consumer is a process, not a thread.
>   Separating the threads of a single process into different memory
>   control domains doesn't make semantical sense.  cgroup v2 ensures
>   that all controller can agree on the same organization by requiring
>   that threads of the same process belong to the same cgroup.

I haven't followed all of the history here, but it seems to me that
this argument is less accurate than it appears.  Linux, for better or
for worse, has somewhat orthogonal concepts of thread groups
(processes), mms, and file tables.  An mm has VMAs in it, and VMAs can
reference things (files, etc) that hold resources.  (Two mms can share
resources by mapping the same thing or using fork().)  File tables
hold files, and files can use resources.  Both of these are, at best,
moderately good approximations of what actually holds resources.
Meanwhile, threads (tasks) do syscalls, take page faults, *allocate*
resources, etc.

So I think it's not really true to say that the "terminal consumer" of
anything is a process, not a thread.

While it's certainly easier to think about assigning processes to
cgroups, and I certainly agree that, in the common case, it's the
right thing to do, I don't see why requiring it is a good idea.  Can
we turn this around: what actually goes wrong if cgroup v2 were to
allow assigning individual threads if a user specifically requests it?

>
>   There are other reasons to enforce process granularity.  One
>   important one is isolating system-level management operations from
>   in-process application operations.  The cgroup interface, being a
>   virtual filesystem, is very unfit for multiple independent
>   operations taking place at the same time as most operations have to
>   be multi-step and there is no way to synchronize multiple accessors.
>   See also [5] Documentation/cgroup-v2.txt, "R-2. Thread Granularity"

I don't buy this argument at all.  System-level code is likely to
assign single process *trees*, which are a different beast entirely.
I.e. you fork, move the child into a cgroup, and that child and its
children stay in that cgroup.  I don't see how the thread/process
distinction matters.

On the contrary: with cgroup namespaces, one could easily create a
cgroup namespace, shove a process in it, and let that process delegate
its threads to child cgroups however it likes.  (Well, children of the
namespace root.)

>
>
>   2-1-2. No Internal Process Constraint
>
>   cgroup v2 does not allow processes to belong to any cgroup which has
>   child cgroups when resource controllers are enabled on it (the
>   notable exception being the root cgroup itself).

Can you elaborate on this exception?  How do you get any of the
supposed benefits of not having processes and cgroups exist as
siblings when you make an exception for the root?  Similarly, if you
make an exception for the root, what do you do about cgroup namespaces
where the apparent root isn't the global root?

--Andy

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-08-20 15:56     ` Tejun Heo
  0 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-08-20 15:56 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, Mike Galbraith, linux-kernel, kernel-team,
	open list:CONTROL GROUP (CGROUP),
	Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra,
	Johannes Weiner, Linus Torvalds

Hello, Andy.

On Wed, Aug 17, 2016 at 01:18:24PM -0700, Andy Lutomirski wrote:
> >   2-1-1. Process Granularity
> >
> >   For memory, because an address space is shared between all threads
> >   of a process, the terminal consumer is a process, not a thread.
> >   Separating the threads of a single process into different memory
> >   control domains doesn't make semantical sense.  cgroup v2 ensures
> >   that all controller can agree on the same organization by requiring
> >   that threads of the same process belong to the same cgroup.
> 
> I haven't followed all of the history here, but it seems to me that
> this argument is less accurate than it appears.  Linux, for better or
> for worse, has somewhat orthogonal concepts of thread groups
> (processes), mms, and file tables.  An mm has VMAs in it, and VMAs can
> reference things (files, etc) that hold resources.  (Two mms can share
> resources by mapping the same thing or using fork().)  File tables
> hold files, and files can use resources.  Both of these are, at best,
> moderately good approximations of what actually holds resources.
> Meanwhile, threads (tasks) do syscalls, take page faults, *allocate*
> resources, etc.
> 
> So I think it's not really true to say that the "terminal consumer" of
> anything is a process, not a thread.

The terminal consumer is actually the mm context.  A task may be the
allocating entity but not always for itself.

This becomes clear whenever an entity is allocating memory on behalf
of someone else - get_user_pages(), khugepaged, swapoff and so on (and
likely userfaultfd too).  When a task is trying to add a page to a
VMA, the task might not have any relationship with the VMA other than
that it's operating on it for someone else.  The page has to be
charged to whoever is responsible for the VMA and the only ownership
which can be established is the containing mm_struct.

While a mm_struct technically may not map to a process, it is a very
close approxmiation which is hardly ever broken in practice.

> While it's certainly easier to think about assigning processes to
> cgroups, and I certainly agree that, in the common case, it's the
> right thing to do, I don't see why requiring it is a good idea.  Can
> we turn this around: what actually goes wrong if cgroup v2 were to
> allow assigning individual threads if a user specifically requests it?

Consider the scenario where you have somebody faulting on behalf of a
foreign VMA, but the thread who created and is actively using that VMA
is in a different cgroup than the process leader.  Who are we going to
charge?  All possible answers seem erratic.

Please note that I agree that thread granularity can be useful for
some resources; however, my points are 1. it should be scoped so that
the resource distribution tree as a whole can be shared across
different resources, and, 2. cgroup filesystem interface isn't a good
interface for the purpose.  I'll continue the second point below.

> >   there are other reasons to enforce process granularity.  One
> >   important one is isolating system-level management operations from
> >   in-process application operations.  The cgroup interface, being a
> >   virtual filesystem, is very unfit for multiple independent
> >   operations taking place at the same time as most operations have to
> >   be multi-step and there is no way to synchronize multiple accessors.
> >   See also [5] Documentation/cgroup-v2.txt, "R-2. Thread Granularity"
> 
> I don't buy this argument at all.  System-level code is likely to
> assign single process *trees*, which are a different beast entirely.
> I.e. you fork, move the child into a cgroup, and that child and its
> children stay in that cgroup.  I don't see how the thread/process
> distinction matters.

Good point on the multi-process issue, this is something which nagged
me a bit while working on rgroup, although I have to point out that
the issue here is one of not going far enough rather than the approach
being wrong.  There are limitations to scoping it to individual
processes but that doesn't negate the underlying problem or the
usefulness of in-process control.

For system-level and process-level operations to not step on each
other's toes, they need to agree on the granularity boundary -
system-level should be able to treat an application hierarchy as a
single unit.  A possible solution is allowing rgroup hirearchies to
span across process boundaries and implementing cgroup migration
operations which treat such hierarchies as a single unit.  I'm not yet
sure whether the boundary should be at program groups or rgroups.

> On the contrary: with cgroup namespaces, one could easily create a
> cgroup namespace, shove a process in it, and let that process delegate
> its threads to child cgroups however it likes.  (Well, children of the
> namespace root.)

cgroup namespace solves just one piece of the whole problem and not in
a very robust way.  It's okay for containers but not so for individual
applications.

* Using namespace is neither trivial or dependable.  It requires
  explicit mount setups, and, more importantly, an application can't
  rely on a specific namespace setup being there, unlike a
  setpriority() extension.  This affects application designs in the
  first place and severely hampers the accessibility and thus
  usefulness of in-application resource control.

* While it makes the names consistent from inside, it doesn't solve
  the atomicity issues when system and application operate on the
  subtree concurrently.

  Imagine system level operation trying to relocate the namespace.
  While the symbolic names can be made to stay the same before and
  after.  That's about it.  During migration, depending on how
  migration is implemented, some may see path linking back to the old
  or new location.  Even the open files for the filesystem knobs
  wouldn't work after such migration.

* It's just a bad interface if one has to use setpriority(2) to set a
  thread priority but resort to opening a file, parse path, open
  another file, write a number string which uses a completely
  different value range to it for thread groups.

> >   2-1-2. No Internal Process Constraint
> >
> >   cgroup v2 does not allow processes to belong to any cgroup which has
> >   child cgroups when resource controllers are enabled on it (the
> >   notable exception being the root cgroup itself).
> 
> Can you elaborate on this exception?  How do you get any of the
> supposed benefits of not having processes and cgroups exist as
> siblings when you make an exception for the root?  Similarly, if you
> make an exception for the root, what do you do about cgroup namespaces
> where the apparent root isn't the global root?

Having a special case doesn't necessarily get in the way of benefiting
from a set of general rules.  The root cgroup is inherently special as
it has to be the catch-all scope for entities and resource
consumptions which can't be tied to any specific consumer - irq
handling, packet rx, journal writes, memory reclaim from global memory
pressure and so on.  None of sub-cgroups have to worry about them.

These base-system operations are special regardless of cgroup and we
already have sometimes crude ways to affect their behaviors where
necessary through sysctl knobs, priorities on specific kernel threads
and so on.  cgroup doesn't change the situation all that much.  What
gets left in the root cgroup usually are the base-system operations
which are outside the scope of cgroup resource control in the first
place and cgroup resource graph can treat the root as an opaque anchor
point.

There can be other ways to deal with the issue; however, treating root
cgroup this way has the big advantage of minimizing the gap between
configurations without and with cgroups both in terms of mental model
and implementation.

Hopefully, the case of a namespace root is clear now.  If it's gonna
have a sub-hierarchy, it itself can't contain processes but the system
root just contains base-system entities and resources which a
namespace root doesn't have to worry about.  Ignoring base-system
stuff, a namespace root is topologically in the same position as the
system root in the cgroup resource graph.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-08-20 15:56     ` Tejun Heo
  0 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-08-20 15:56 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-team-b10kYP2dOMg, open list:CONTROL GROUP (CGROUP),
	Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra,
	Johannes Weiner, Linus Torvalds

Hello, Andy.

On Wed, Aug 17, 2016 at 01:18:24PM -0700, Andy Lutomirski wrote:
> >   2-1-1. Process Granularity
> >
> >   For memory, because an address space is shared between all threads
> >   of a process, the terminal consumer is a process, not a thread.
> >   Separating the threads of a single process into different memory
> >   control domains doesn't make semantical sense.  cgroup v2 ensures
> >   that all controller can agree on the same organization by requiring
> >   that threads of the same process belong to the same cgroup.
> 
> I haven't followed all of the history here, but it seems to me that
> this argument is less accurate than it appears.  Linux, for better or
> for worse, has somewhat orthogonal concepts of thread groups
> (processes), mms, and file tables.  An mm has VMAs in it, and VMAs can
> reference things (files, etc) that hold resources.  (Two mms can share
> resources by mapping the same thing or using fork().)  File tables
> hold files, and files can use resources.  Both of these are, at best,
> moderately good approximations of what actually holds resources.
> Meanwhile, threads (tasks) do syscalls, take page faults, *allocate*
> resources, etc.
> 
> So I think it's not really true to say that the "terminal consumer" of
> anything is a process, not a thread.

The terminal consumer is actually the mm context.  A task may be the
allocating entity but not always for itself.

This becomes clear whenever an entity is allocating memory on behalf
of someone else - get_user_pages(), khugepaged, swapoff and so on (and
likely userfaultfd too).  When a task is trying to add a page to a
VMA, the task might not have any relationship with the VMA other than
that it's operating on it for someone else.  The page has to be
charged to whoever is responsible for the VMA and the only ownership
which can be established is the containing mm_struct.

While a mm_struct technically may not map to a process, it is a very
close approxmiation which is hardly ever broken in practice.

> While it's certainly easier to think about assigning processes to
> cgroups, and I certainly agree that, in the common case, it's the
> right thing to do, I don't see why requiring it is a good idea.  Can
> we turn this around: what actually goes wrong if cgroup v2 were to
> allow assigning individual threads if a user specifically requests it?

Consider the scenario where you have somebody faulting on behalf of a
foreign VMA, but the thread who created and is actively using that VMA
is in a different cgroup than the process leader.  Who are we going to
charge?  All possible answers seem erratic.

Please note that I agree that thread granularity can be useful for
some resources; however, my points are 1. it should be scoped so that
the resource distribution tree as a whole can be shared across
different resources, and, 2. cgroup filesystem interface isn't a good
interface for the purpose.  I'll continue the second point below.

> >   there are other reasons to enforce process granularity.  One
> >   important one is isolating system-level management operations from
> >   in-process application operations.  The cgroup interface, being a
> >   virtual filesystem, is very unfit for multiple independent
> >   operations taking place at the same time as most operations have to
> >   be multi-step and there is no way to synchronize multiple accessors.
> >   See also [5] Documentation/cgroup-v2.txt, "R-2. Thread Granularity"
> 
> I don't buy this argument at all.  System-level code is likely to
> assign single process *trees*, which are a different beast entirely.
> I.e. you fork, move the child into a cgroup, and that child and its
> children stay in that cgroup.  I don't see how the thread/process
> distinction matters.

Good point on the multi-process issue, this is something which nagged
me a bit while working on rgroup, although I have to point out that
the issue here is one of not going far enough rather than the approach
being wrong.  There are limitations to scoping it to individual
processes but that doesn't negate the underlying problem or the
usefulness of in-process control.

For system-level and process-level operations to not step on each
other's toes, they need to agree on the granularity boundary -
system-level should be able to treat an application hierarchy as a
single unit.  A possible solution is allowing rgroup hirearchies to
span across process boundaries and implementing cgroup migration
operations which treat such hierarchies as a single unit.  I'm not yet
sure whether the boundary should be at program groups or rgroups.

> On the contrary: with cgroup namespaces, one could easily create a
> cgroup namespace, shove a process in it, and let that process delegate
> its threads to child cgroups however it likes.  (Well, children of the
> namespace root.)

cgroup namespace solves just one piece of the whole problem and not in
a very robust way.  It's okay for containers but not so for individual
applications.

* Using namespace is neither trivial or dependable.  It requires
  explicit mount setups, and, more importantly, an application can't
  rely on a specific namespace setup being there, unlike a
  setpriority() extension.  This affects application designs in the
  first place and severely hampers the accessibility and thus
  usefulness of in-application resource control.

* While it makes the names consistent from inside, it doesn't solve
  the atomicity issues when system and application operate on the
  subtree concurrently.

  Imagine system level operation trying to relocate the namespace.
  While the symbolic names can be made to stay the same before and
  after.  That's about it.  During migration, depending on how
  migration is implemented, some may see path linking back to the old
  or new location.  Even the open files for the filesystem knobs
  wouldn't work after such migration.

* It's just a bad interface if one has to use setpriority(2) to set a
  thread priority but resort to opening a file, parse path, open
  another file, write a number string which uses a completely
  different value range to it for thread groups.

> >   2-1-2. No Internal Process Constraint
> >
> >   cgroup v2 does not allow processes to belong to any cgroup which has
> >   child cgroups when resource controllers are enabled on it (the
> >   notable exception being the root cgroup itself).
> 
> Can you elaborate on this exception?  How do you get any of the
> supposed benefits of not having processes and cgroups exist as
> siblings when you make an exception for the root?  Similarly, if you
> make an exception for the root, what do you do about cgroup namespaces
> where the apparent root isn't the global root?

Having a special case doesn't necessarily get in the way of benefiting
from a set of general rules.  The root cgroup is inherently special as
it has to be the catch-all scope for entities and resource
consumptions which can't be tied to any specific consumer - irq
handling, packet rx, journal writes, memory reclaim from global memory
pressure and so on.  None of sub-cgroups have to worry about them.

These base-system operations are special regardless of cgroup and we
already have sometimes crude ways to affect their behaviors where
necessary through sysctl knobs, priorities on specific kernel threads
and so on.  cgroup doesn't change the situation all that much.  What
gets left in the root cgroup usually are the base-system operations
which are outside the scope of cgroup resource control in the first
place and cgroup resource graph can treat the root as an opaque anchor
point.

There can be other ways to deal with the issue; however, treating root
cgroup this way has the big advantage of minimizing the gap between
configurations without and with cgroups both in terms of mental model
and implementation.

Hopefully, the case of a namespace root is clear now.  If it's gonna
have a sub-hierarchy, it itself can't contain processes but the system
root just contains base-system entities and resources which a
namespace root doesn't have to worry about.  Ignoring base-system
stuff, a namespace root is topologically in the same position as the
system root in the cgroup resource graph.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
  2016-08-20 15:56     ` Tejun Heo
  (?)
@ 2016-08-20 18:45     ` Andy Lutomirski
  2016-08-29 22:20         ` Tejun Heo
  -1 siblings, 1 reply; 87+ messages in thread
From: Andy Lutomirski @ 2016-08-20 18:45 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Mike Galbraith, linux-kernel, kernel-team,
	open list:CONTROL GROUP (CGROUP),
	Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra,
	Johannes Weiner, Linus Torvalds

On Sat, Aug 20, 2016 at 8:56 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Andy.
>
> On Wed, Aug 17, 2016 at 01:18:24PM -0700, Andy Lutomirski wrote:
>> >   2-1-1. Process Granularity
>> >
>> >   For memory, because an address space is shared between all threads
>> >   of a process, the terminal consumer is a process, not a thread.
>> >   Separating the threads of a single process into different memory
>> >   control domains doesn't make semantical sense.  cgroup v2 ensures
>> >   that all controller can agree on the same organization by requiring
>> >   that threads of the same process belong to the same cgroup.
>>
>> I haven't followed all of the history here, but it seems to me that
>> this argument is less accurate than it appears.  Linux, for better or
>> for worse, has somewhat orthogonal concepts of thread groups
>> (processes), mms, and file tables.  An mm has VMAs in it, and VMAs can
>> reference things (files, etc) that hold resources.  (Two mms can share
>> resources by mapping the same thing or using fork().)  File tables
>> hold files, and files can use resources.  Both of these are, at best,
>> moderately good approximations of what actually holds resources.
>> Meanwhile, threads (tasks) do syscalls, take page faults, *allocate*
>> resources, etc.
>>
>> So I think it's not really true to say that the "terminal consumer" of
>> anything is a process, not a thread.
>
> The terminal consumer is actually the mm context.  A task may be the
> allocating entity but not always for itself.
>
> This becomes clear whenever an entity is allocating memory on behalf
> of someone else - get_user_pages(), khugepaged, swapoff and so on (and
> likely userfaultfd too).  When a task is trying to add a page to a
> VMA, the task might not have any relationship with the VMA other than
> that it's operating on it for someone else.  The page has to be
> charged to whoever is responsible for the VMA and the only ownership
> which can be established is the containing mm_struct.

This surprises me a bit.  If I do access_process_vm(), then I would
have expected the charge to go the caller, not the mm being accessed.

What happens if a program calls read(2), though?  A page may be
inserted into page cache on behalf of an address_space without any
particular mm being involved.  There will usually be a calling task,
though.

But this is all very memcg-specific.  What about other cgroups?  I/O
is per-task, right?  Scheduling is definitely per-task.

>
> While a mm_struct technically may not map to a process, it is a very
> close approxmiation which is hardly ever broken in practice.
>
>> While it's certainly easier to think about assigning processes to
>> cgroups, and I certainly agree that, in the common case, it's the
>> right thing to do, I don't see why requiring it is a good idea.  Can
>> we turn this around: what actually goes wrong if cgroup v2 were to
>> allow assigning individual threads if a user specifically requests it?
>
> Consider the scenario where you have somebody faulting on behalf of a
> foreign VMA, but the thread who created and is actively using that VMA
> is in a different cgroup than the process leader.  Who are we going to
> charge?  All possible answers seem erratic.
>

Indeed, and this problem is probably not solvable in practice unless
you charge all involved cgroups.  But the caller's *mm* is entirely
irrelevant here, so I don't see how this implies that cgroups need to
keep tasks in the same process together.  The relevant entities are
the calling *task* and the target mm, and you're going to be
hard-pressed to ensure that they belong to the same cgroup, so I think
you need to be able handle weird cases in which there isn't an
obviously correct cgroup to charge.

>> >   there are other reasons to enforce process granularity.  One
>> >   important one is isolating system-level management operations from
>> >   in-process application operations.  The cgroup interface, being a
>> >   virtual filesystem, is very unfit for multiple independent
>> >   operations taking place at the same time as most operations have to
>> >   be multi-step and there is no way to synchronize multiple accessors.
>> >   See also [5] Documentation/cgroup-v2.txt, "R-2. Thread Granularity"
>>
>> I don't buy this argument at all.  System-level code is likely to
>> assign single process *trees*, which are a different beast entirely.
>> I.e. you fork, move the child into a cgroup, and that child and its
>> children stay in that cgroup.  I don't see how the thread/process
>> distinction matters.
>
> Good point on the multi-process issue, this is something which nagged
> me a bit while working on rgroup, although I have to point out that
> the issue here is one of not going far enough rather than the approach
> being wrong.  There are limitations to scoping it to individual
> processes but that doesn't negate the underlying problem or the
> usefulness of in-process control.
>
> For system-level and process-level operations to not step on each
> other's toes, they need to agree on the granularity boundary -
> system-level should be able to treat an application hierarchy as a
> single unit.  A possible solution is allowing rgroup hirearchies to
> span across process boundaries and implementing cgroup migration
> operations which treat such hierarchies as a single unit.  I'm not yet
> sure whether the boundary should be at program groups or rgroups.

I think that, if the system cgroup manager is moving processes around
after starting them and execing the final binary, there will be races
and confusion, and no about of granularity fiddling will fix that.

I know nothing about rgroups.  Are they upstream?


>
>> >   2-1-2. No Internal Process Constraint
>> >
>> >   cgroup v2 does not allow processes to belong to any cgroup which has
>> >   child cgroups when resource controllers are enabled on it (the
>> >   notable exception being the root cgroup itself).
>>
>> Can you elaborate on this exception?  How do you get any of the
>> supposed benefits of not having processes and cgroups exist as
>> siblings when you make an exception for the root?  Similarly, if you
>> make an exception for the root, what do you do about cgroup namespaces
>> where the apparent root isn't the global root?
>
> Having a special case doesn't necessarily get in the way of benefiting
> from a set of general rules.  The root cgroup is inherently special as
> it has to be the catch-all scope for entities and resource
> consumptions which can't be tied to any specific consumer - irq
> handling, packet rx, journal writes, memory reclaim from global memory
> pressure and so on.  None of sub-cgroups have to worry about them.
>
> These base-system operations are special regardless of cgroup and we
> already have sometimes crude ways to affect their behaviors where
> necessary through sysctl knobs, priorities on specific kernel threads
> and so on.  cgroup doesn't change the situation all that much.  What
> gets left in the root cgroup usually are the base-system operations
> which are outside the scope of cgroup resource control in the first
> place and cgroup resource graph can treat the root as an opaque anchor
> point.

This seems to explain why the controllers need to be able to handle
things being charged to the root cgroup (or to an unidentifiable
cgroup, anyway).  That isn't quite the same thing as allowing, from an
ABI point of view, the root cgroup to contain processes and cgroups
but not allowing other cgroups to do the same thing.  Consider:
suppose that systemd (or some competing cgroup manager) is designed to
run in the root cgroup namespace.  It presumably expects *itself* to
be in the root cgroup.  Now try to run it using cgroups v2 in a
non-root namespace.  I don't see how it can possibly work if it the
hierarchy constraints don't permit it to create sub-cgroups while it's
still in the root.  In fact, this seems impossible to fix even with
user code changes.  The manager would need to simultaneously create a
new child cgroup to contain itself and assign itself to that child
cgroup, because the intermediate state is illegal.

I really, really think that cgroup v2 should supply the same
*interface* inside and outside of a non-root namespace.  If this is
impossible due to ABI compatibility, then you could, in the worst
case, introduce cgroup v3, fix it there, and remove cgroup v2, since
apparently cgroup v2 isn't in use right now in mainline kernels.  (To
be clear, I think either decision -- allowing tasks and cgroups to be
siblings or disallowing it -- is okay, but I think that the interface
should apply the same constraint at all levels.)

--Andy

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-08-21  5:34     ` James Bottomley
  0 siblings, 0 replies; 87+ messages in thread
From: James Bottomley @ 2016-08-21  5:34 UTC (permalink / raw)
  To: Andy Lutomirski, Tejun Heo
  Cc: Ingo Molnar, Mike Galbraith, linux-kernel, kernel-team,
	open list:CONTROL GROUP (CGROUP),
	Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra,
	Johannes Weiner, Linus Torvalds

On Wed, 2016-08-17 at 13:18 -0700, Andy Lutomirski wrote:
> On Aug 5, 2016 7:07 PM, "Tejun Heo" <tj@kernel.org> wrote:
[...]
> > 2. Disagreements and Arguments
> > 
> > There have been several lengthy discussion threads [3][4] on LKML
> > around the structural constraints of cgroup v2.  The two that 
> > affect the CPU controller are process granularity and no internal 
> > process constraint.  Both arise primarily from the need for common 
> > resource domain definition across different resources.
> > 
> > The common resource domain is a powerful concept in cgroup v2 that
> > allows controllers to make basic assumptions about the structural
> > organization of processes and controllers inside the cgroup 
> > hierarchy, and thus solve problems spanning multiple types of 
> > resources.  The prime example for this is page cache writeback: 
> > dirty page cache is regulated through throttling buffered writers 
> > based on memory availability, and initiating batched write outs to 
> > the disk based on IO capacity.  Tracking and controlling writeback 
> > inside a cgroup thus requires the direct cooperation of the memory 
> > and the IO controller.
> > 
> > This easily extends to other areas, such as CPU cycles consumed 
> > while performing memory reclaim or IO encryption.
> > 
> > 
> > 2-1. Contentious Restrictions
> > 
> > For controllers of different resources to work together, they must
> > agree on a common organization.  This uniform model across 
> > controllers imposes two contentious restrictions on the CPU 
> > controller: process granularity and the no-internal-process
> > constraint.
> > 
> > 
> >   2-1-1. Process Granularity
> > 
> >   For memory, because an address space is shared between all
> > threads
> >   of a process, the terminal consumer is a process, not a thread.
> >   Separating the threads of a single process into different memory
> >   control domains doesn't make semantical sense.  cgroup v2 ensures
> >   that all controller can agree on the same organization by
> > requiring
> >   that threads of the same process belong to the same cgroup.
> 
> I haven't followed all of the history here, but it seems to me that
> this argument is less accurate than it appears.  Linux, for better or
> for worse, has somewhat orthogonal concepts of thread groups
> (processes), mms, and file tables.  An mm has VMAs in it, and VMAs 
> can reference things (files, etc) that hold resources.  (Two mms can
> share resources by mapping the same thing or using fork().)  File 
> tables hold files, and files can use resources.  Both of these are, 
> at best, moderately good approximations of what actually holds 
> resources. Meanwhile, threads (tasks) do syscalls, take page faults, 
> *allocate* resources, etc.
> 
> So I think it's not really true to say that the "terminal consumer" 
> of anything is a process, not a thread.
> 
> While it's certainly easier to think about assigning processes to
> cgroups, and I certainly agree that, in the common case, it's the
> right thing to do, I don't see why requiring it is a good idea.  Can
> we turn this around: what actually goes wrong if cgroup v2 were to
> allow assigning individual threads if a user specifically requests
> it?

A similar point from a different consumer: from the unprivileged
containers point of view, I'm interested in a thread based interface as
well.  The principle utility of unprivileged containers is to allow
applications that wish to to use container properties (effectively to
become self-containerising).  Some that use the producer/consumer model
do use process pools (apache springs to mind instantly) but some use
thread pools.  It is useful to the latter to preserve the concept of a
thread as being the entity inhabiting the cgroup (but only where the
granularity of the cgroup permits threads to participate) so we can
easily modify them to be self containerising without forcing them to
switch back from a thread pool model to a process pool model.

I can see that process based is conceptually easier in v2 because you
begin with a process tree, but it would really be a pity to lose the
thread based controls we have now and permanently lose the ability to
create more as we find uses for them.  I can't really see how improving
"common resource domain" is a good tradeoff for this.

James

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-08-21  5:34     ` James Bottomley
  0 siblings, 0 replies; 87+ messages in thread
From: James Bottomley @ 2016-08-21  5:34 UTC (permalink / raw)
  To: Andy Lutomirski, Tejun Heo
  Cc: Ingo Molnar, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-team-b10kYP2dOMg, open list:CONTROL GROUP (CGROUP),
	Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra,
	Johannes Weiner, Linus Torvalds

On Wed, 2016-08-17 at 13:18 -0700, Andy Lutomirski wrote:
> On Aug 5, 2016 7:07 PM, "Tejun Heo" <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
[...]
> > 2. Disagreements and Arguments
> > 
> > There have been several lengthy discussion threads [3][4] on LKML
> > around the structural constraints of cgroup v2.  The two that 
> > affect the CPU controller are process granularity and no internal 
> > process constraint.  Both arise primarily from the need for common 
> > resource domain definition across different resources.
> > 
> > The common resource domain is a powerful concept in cgroup v2 that
> > allows controllers to make basic assumptions about the structural
> > organization of processes and controllers inside the cgroup 
> > hierarchy, and thus solve problems spanning multiple types of 
> > resources.  The prime example for this is page cache writeback: 
> > dirty page cache is regulated through throttling buffered writers 
> > based on memory availability, and initiating batched write outs to 
> > the disk based on IO capacity.  Tracking and controlling writeback 
> > inside a cgroup thus requires the direct cooperation of the memory 
> > and the IO controller.
> > 
> > This easily extends to other areas, such as CPU cycles consumed 
> > while performing memory reclaim or IO encryption.
> > 
> > 
> > 2-1. Contentious Restrictions
> > 
> > For controllers of different resources to work together, they must
> > agree on a common organization.  This uniform model across 
> > controllers imposes two contentious restrictions on the CPU 
> > controller: process granularity and the no-internal-process
> > constraint.
> > 
> > 
> >   2-1-1. Process Granularity
> > 
> >   For memory, because an address space is shared between all
> > threads
> >   of a process, the terminal consumer is a process, not a thread.
> >   Separating the threads of a single process into different memory
> >   control domains doesn't make semantical sense.  cgroup v2 ensures
> >   that all controller can agree on the same organization by
> > requiring
> >   that threads of the same process belong to the same cgroup.
> 
> I haven't followed all of the history here, but it seems to me that
> this argument is less accurate than it appears.  Linux, for better or
> for worse, has somewhat orthogonal concepts of thread groups
> (processes), mms, and file tables.  An mm has VMAs in it, and VMAs 
> can reference things (files, etc) that hold resources.  (Two mms can
> share resources by mapping the same thing or using fork().)  File 
> tables hold files, and files can use resources.  Both of these are, 
> at best, moderately good approximations of what actually holds 
> resources. Meanwhile, threads (tasks) do syscalls, take page faults, 
> *allocate* resources, etc.
> 
> So I think it's not really true to say that the "terminal consumer" 
> of anything is a process, not a thread.
> 
> While it's certainly easier to think about assigning processes to
> cgroups, and I certainly agree that, in the common case, it's the
> right thing to do, I don't see why requiring it is a good idea.  Can
> we turn this around: what actually goes wrong if cgroup v2 were to
> allow assigning individual threads if a user specifically requests
> it?

A similar point from a different consumer: from the unprivileged
containers point of view, I'm interested in a thread based interface as
well.  The principle utility of unprivileged containers is to allow
applications that wish to to use container properties (effectively to
become self-containerising).  Some that use the producer/consumer model
do use process pools (apache springs to mind instantly) but some use
thread pools.  It is useful to the latter to preserve the concept of a
thread as being the entity inhabiting the cgroup (but only where the
granularity of the cgroup permits threads to participate) so we can
easily modify them to be self containerising without forcing them to
switch back from a thread pool model to a process pool model.

I can see that process based is conceptually easier in v2 because you
begin with a process tree, but it would really be a pity to lose the
thread based controls we have now and permanently lose the ability to
create more as we find uses for them.  I can't really see how improving
"common resource domain" is a good tradeoff for this.

James

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-08-22 10:12       ` Mike Galbraith
  0 siblings, 0 replies; 87+ messages in thread
From: Mike Galbraith @ 2016-08-22 10:12 UTC (permalink / raw)
  To: Tejun Heo, Andy Lutomirski
  Cc: Ingo Molnar, linux-kernel, kernel-team,
	open list:CONTROL GROUP (CGROUP),
	Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra,
	Johannes Weiner, Linus Torvalds

On Sat, 2016-08-20 at 11:56 -0400, Tejun Heo wrote:

> > >   there are other reasons to enforce process granularity.  One
> > >   important one is isolating system-level management operations from
> > >   in-process application operations.  The cgroup interface, being a
> > >   virtual filesystem, is very unfit for multiple independent
> > >   operations taking place at the same time as most operations have to
> > >   be multi-step and there is no way to synchronize multiple accessors.
> > >   See also [5] Documentation/cgroup-v2.txt, "R-2. Thread Granularity"
> > 
> > I don't buy this argument at all.  System-level code is likely to
> > assign single process *trees*, which are a different beast entirely.
> > I.e. you fork, move the child into a cgroup, and that child and its
> > children stay in that cgroup.  I don't see how the thread/process
> > distinction matters.
> 
> Good point on the multi-process issue, this is something which nagged
> me a bit while working on rgroup, although I have to point out that
> the issue here is one of not going far enough rather than the approach
> being wrong.  There are limitations to scoping it to individual
> processes but that doesn't negate the underlying problem or the
> usefulness of in-process control.
> 
> For system-level and process-level operations to not step on each
> other's toes, they need to agree on the granularity boundary -
> system-level should be able to treat an application hierarchy as a
> single unit.  A possible solution is allowing rgroup hirearchies to
> span across process boundaries and implementing cgroup migration
> operations which treat such hierarchies as a single unit.  I'm not yet
> sure whether the boundary should be at program groups or rgroups.

Why is it not viable to predicate contentious lowest common denominator
restrictions upon the set of enabled controllers?  If only thread
granularity controllers are enabled, from that point onward, v2
restrictions cease to make any sense, thus could be lifted, leaving
nobody cast adrift in a leaky v1 lifeboat when v2 sets sail.  Or?

	-Mike

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-08-22 10:12       ` Mike Galbraith
  0 siblings, 0 replies; 87+ messages in thread
From: Mike Galbraith @ 2016-08-22 10:12 UTC (permalink / raw)
  To: Tejun Heo, Andy Lutomirski
  Cc: Ingo Molnar, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-team-b10kYP2dOMg, open list:CONTROL GROUP (CGROUP),
	Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra,
	Johannes Weiner, Linus Torvalds

On Sat, 2016-08-20 at 11:56 -0400, Tejun Heo wrote:

> > >   there are other reasons to enforce process granularity.  One
> > >   important one is isolating system-level management operations from
> > >   in-process application operations.  The cgroup interface, being a
> > >   virtual filesystem, is very unfit for multiple independent
> > >   operations taking place at the same time as most operations have to
> > >   be multi-step and there is no way to synchronize multiple accessors.
> > >   See also [5] Documentation/cgroup-v2.txt, "R-2. Thread Granularity"
> > 
> > I don't buy this argument at all.  System-level code is likely to
> > assign single process *trees*, which are a different beast entirely.
> > I.e. you fork, move the child into a cgroup, and that child and its
> > children stay in that cgroup.  I don't see how the thread/process
> > distinction matters.
> 
> Good point on the multi-process issue, this is something which nagged
> me a bit while working on rgroup, although I have to point out that
> the issue here is one of not going far enough rather than the approach
> being wrong.  There are limitations to scoping it to individual
> processes but that doesn't negate the underlying problem or the
> usefulness of in-process control.
> 
> For system-level and process-level operations to not step on each
> other's toes, they need to agree on the granularity boundary -
> system-level should be able to treat an application hierarchy as a
> single unit.  A possible solution is allowing rgroup hirearchies to
> span across process boundaries and implementing cgroup migration
> operations which treat such hierarchies as a single unit.  I'm not yet
> sure whether the boundary should be at program groups or rgroups.

Why is it not viable to predicate contentious lowest common denominator
restrictions upon the set of enabled controllers?  If only thread
granularity controllers are enabled, from that point onward, v2
restrictions cease to make any sense, thus could be lifted, leaving
nobody cast adrift in a leaky v1 lifeboat when v2 sets sail.  Or?

	-Mike

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-08-29 22:20         ` Tejun Heo
  0 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-08-29 22:20 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, Mike Galbraith, linux-kernel, kernel-team,
	open list:CONTROL GROUP (CGROUP),
	Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra,
	Johannes Weiner, Linus Torvalds

Hello, Andy.

Sorry about the delay.  Was kinda overwhelmed with other things.

On Sat, Aug 20, 2016 at 11:45:55AM -0700, Andy Lutomirski wrote:
> > This becomes clear whenever an entity is allocating memory on behalf
> > of someone else - get_user_pages(), khugepaged, swapoff and so on (and
> > likely userfaultfd too).  When a task is trying to add a page to a
> > VMA, the task might not have any relationship with the VMA other than
> > that it's operating on it for someone else.  The page has to be
> > charged to whoever is responsible for the VMA and the only ownership
> > which can be established is the containing mm_struct.
> 
> This surprises me a bit.  If I do access_process_vm(), then I would
> have expected the charge to go the caller, not the mm being accessed.

It does and should go the target mm.  Who faults in a page shouldn't
be the final determinant in the ownership; otherwise, we end up in
situations where the ownership changes due to, for example,
fluctuations in page fault pattern.  It doesn't make semantical sense
either.  If a kthread is doing PIO for a process, why would it get
charged for the memory it's faulting in?

> What happens if a program calls read(2), though?  A page may be
> inserted into page cache on behalf of an address_space without any
> particular mm being involved.  There will usually be a calling task,
> though.

Most faults are synchronous and the faulting thread is a member of the
mm to be charged, so this usually isn't an issue.  I don't think there
are places where we populate an address_space without knowing who it
is for (as opposed / in addition to who the operator is).

> But this is all very memcg-specific.  What about other cgroups?  I/O
> is per-task, right?  Scheduling is definitely per-task.

They aren't separate.  Think about IOs to write out page cache, CPU
cycles spent reclaiming memory or encrypting writeback IOs.  It's fine
to get more granular with specific resources but the semantics gets
messy for cross-resource accounting and control without proper
scoping.

> > Consider the scenario where you have somebody faulting on behalf of a
> > foreign VMA, but the thread who created and is actively using that VMA
> > is in a different cgroup than the process leader.  Who are we going to
> > charge?  All possible answers seem erratic.
> 
> Indeed, and this problem is probably not solvable in practice unless
> you charge all involved cgroups.  But the caller's *mm* is entirely
> irrelevant here, so I don't see how this implies that cgroups need to
> keep tasks in the same process together.  The relevant entities are
> the calling *task* and the target mm, and you're going to be
> hard-pressed to ensure that they belong to the same cgroup, so I think
> you need to be able handle weird cases in which there isn't an
> obviously correct cgroup to charge.

It is an erratic case which is caused by userland interface allowing
non-sensical configuration.  We can accept it as a necessary trade-off
given big enough benefits or unavoidable constraints but it isn't
something to do willy-nilly.

> > For system-level and process-level operations to not step on each
> > other's toes, they need to agree on the granularity boundary -
> > system-level should be able to treat an application hierarchy as a
> > single unit.  A possible solution is allowing rgroup hirearchies to
> > span across process boundaries and implementing cgroup migration
> > operations which treat such hierarchies as a single unit.  I'm not yet
> > sure whether the boundary should be at program groups or rgroups.
> 
> I think that, if the system cgroup manager is moving processes around
> after starting them and execing the final binary, there will be races
> and confusion, and no about of granularity fiddling will fix that.

I don't see how that statement is true.  For example, if you confine
the hierarhcy to in-process, there is proper isolation and whether
system agent migrates the process or not doesn't make any difference
to the internal hierarchy.

> I know nothing about rgroups.  Are they upstream?

It was linked from the original message.

[7]  http://lkml.kernel.org/r/20160105154503.GC5995@mtj.duckdns.org
     [RFD] cgroup: thread granularity support for cpu controller
     Tejun Heo <tj@kernel.org>

[8]  http://lkml.kernel.org/r/1457710888-31182-1-git-send-email-tj@kernel.org
     [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP
     Tejun Heo <tj@kernel.org>

[9]  http://lkml.kernel.org/r/20160311160522.GA24046@htj.duckdns.org
     Example program for PRIO_RGRP
     Tejun Heo <tj@kernel.org>

> > These base-system operations are special regardless of cgroup and we
> > already have sometimes crude ways to affect their behaviors where
> > necessary through sysctl knobs, priorities on specific kernel threads
> > and so on.  cgroup doesn't change the situation all that much.  What
> > gets left in the root cgroup usually are the base-system operations
> > which are outside the scope of cgroup resource control in the first
> > place and cgroup resource graph can treat the root as an opaque anchor
> > point.
> 
> This seems to explain why the controllers need to be able to handle
> things being charged to the root cgroup (or to an unidentifiable
> cgroup, anyway).  That isn't quite the same thing as allowing, from an
> ABI point of view, the root cgroup to contain processes and cgroups
> but not allowing other cgroups to do the same thing.  Consider:

The points are 1. we need the root to be a special container anyway
2. allowing it to be special and contain system-wide consumptions
doesn't make the resource graph inconsistent once all non-system-wide
consumptions are put in non-root cgroups, and 3. this is the most
natural way to handle the situation both from implementation and
interface standpoints as it makes non-cgroup configuration a natural
degenerate case of cgroup configuration.

> suppose that systemd (or some competing cgroup manager) is designed to
> run in the root cgroup namespace.  It presumably expects *itself* to
> be in the root cgroup.  Now try to run it using cgroups v2 in a
> non-root namespace.  I don't see how it can possibly work if it the
> hierarchy constraints don't permit it to create sub-cgroups while it's
> still in the root.  In fact, this seems impossible to fix even with
> user code changes.  The manager would need to simultaneously create a
> new child cgroup to contain itself and assign itself to that child
> cgroup, because the intermediate state is illegal.

Please re-read the constraint.  It doesn't prevent any organizational
operations before resource control is enabled.

> I really, really think that cgroup v2 should supply the same
> *interface* inside and outside of a non-root namespace.  If this is

It *does*.  That's what I tried to explain, that it's exactly
isomorhpic once you discount the system-wide consumptions.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-08-29 22:20         ` Tejun Heo
  0 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-08-29 22:20 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-team-b10kYP2dOMg, open list:CONTROL GROUP (CGROUP),
	Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra,
	Johannes Weiner, Linus Torvalds

Hello, Andy.

Sorry about the delay.  Was kinda overwhelmed with other things.

On Sat, Aug 20, 2016 at 11:45:55AM -0700, Andy Lutomirski wrote:
> > This becomes clear whenever an entity is allocating memory on behalf
> > of someone else - get_user_pages(), khugepaged, swapoff and so on (and
> > likely userfaultfd too).  When a task is trying to add a page to a
> > VMA, the task might not have any relationship with the VMA other than
> > that it's operating on it for someone else.  The page has to be
> > charged to whoever is responsible for the VMA and the only ownership
> > which can be established is the containing mm_struct.
> 
> This surprises me a bit.  If I do access_process_vm(), then I would
> have expected the charge to go the caller, not the mm being accessed.

It does and should go the target mm.  Who faults in a page shouldn't
be the final determinant in the ownership; otherwise, we end up in
situations where the ownership changes due to, for example,
fluctuations in page fault pattern.  It doesn't make semantical sense
either.  If a kthread is doing PIO for a process, why would it get
charged for the memory it's faulting in?

> What happens if a program calls read(2), though?  A page may be
> inserted into page cache on behalf of an address_space without any
> particular mm being involved.  There will usually be a calling task,
> though.

Most faults are synchronous and the faulting thread is a member of the
mm to be charged, so this usually isn't an issue.  I don't think there
are places where we populate an address_space without knowing who it
is for (as opposed / in addition to who the operator is).

> But this is all very memcg-specific.  What about other cgroups?  I/O
> is per-task, right?  Scheduling is definitely per-task.

They aren't separate.  Think about IOs to write out page cache, CPU
cycles spent reclaiming memory or encrypting writeback IOs.  It's fine
to get more granular with specific resources but the semantics gets
messy for cross-resource accounting and control without proper
scoping.

> > Consider the scenario where you have somebody faulting on behalf of a
> > foreign VMA, but the thread who created and is actively using that VMA
> > is in a different cgroup than the process leader.  Who are we going to
> > charge?  All possible answers seem erratic.
> 
> Indeed, and this problem is probably not solvable in practice unless
> you charge all involved cgroups.  But the caller's *mm* is entirely
> irrelevant here, so I don't see how this implies that cgroups need to
> keep tasks in the same process together.  The relevant entities are
> the calling *task* and the target mm, and you're going to be
> hard-pressed to ensure that they belong to the same cgroup, so I think
> you need to be able handle weird cases in which there isn't an
> obviously correct cgroup to charge.

It is an erratic case which is caused by userland interface allowing
non-sensical configuration.  We can accept it as a necessary trade-off
given big enough benefits or unavoidable constraints but it isn't
something to do willy-nilly.

> > For system-level and process-level operations to not step on each
> > other's toes, they need to agree on the granularity boundary -
> > system-level should be able to treat an application hierarchy as a
> > single unit.  A possible solution is allowing rgroup hirearchies to
> > span across process boundaries and implementing cgroup migration
> > operations which treat such hierarchies as a single unit.  I'm not yet
> > sure whether the boundary should be at program groups or rgroups.
> 
> I think that, if the system cgroup manager is moving processes around
> after starting them and execing the final binary, there will be races
> and confusion, and no about of granularity fiddling will fix that.

I don't see how that statement is true.  For example, if you confine
the hierarhcy to in-process, there is proper isolation and whether
system agent migrates the process or not doesn't make any difference
to the internal hierarchy.

> I know nothing about rgroups.  Are they upstream?

It was linked from the original message.

[7]  http://lkml.kernel.org/r/20160105154503.GC5995-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org
     [RFD] cgroup: thread granularity support for cpu controller
     Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

[8]  http://lkml.kernel.org/r/1457710888-31182-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org
     [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP
     Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

[9]  http://lkml.kernel.org/r/20160311160522.GA24046-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org
     Example program for PRIO_RGRP
     Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

> > These base-system operations are special regardless of cgroup and we
> > already have sometimes crude ways to affect their behaviors where
> > necessary through sysctl knobs, priorities on specific kernel threads
> > and so on.  cgroup doesn't change the situation all that much.  What
> > gets left in the root cgroup usually are the base-system operations
> > which are outside the scope of cgroup resource control in the first
> > place and cgroup resource graph can treat the root as an opaque anchor
> > point.
> 
> This seems to explain why the controllers need to be able to handle
> things being charged to the root cgroup (or to an unidentifiable
> cgroup, anyway).  That isn't quite the same thing as allowing, from an
> ABI point of view, the root cgroup to contain processes and cgroups
> but not allowing other cgroups to do the same thing.  Consider:

The points are 1. we need the root to be a special container anyway
2. allowing it to be special and contain system-wide consumptions
doesn't make the resource graph inconsistent once all non-system-wide
consumptions are put in non-root cgroups, and 3. this is the most
natural way to handle the situation both from implementation and
interface standpoints as it makes non-cgroup configuration a natural
degenerate case of cgroup configuration.

> suppose that systemd (or some competing cgroup manager) is designed to
> run in the root cgroup namespace.  It presumably expects *itself* to
> be in the root cgroup.  Now try to run it using cgroups v2 in a
> non-root namespace.  I don't see how it can possibly work if it the
> hierarchy constraints don't permit it to create sub-cgroups while it's
> still in the root.  In fact, this seems impossible to fix even with
> user code changes.  The manager would need to simultaneously create a
> new child cgroup to contain itself and assign itself to that child
> cgroup, because the intermediate state is illegal.

Please re-read the constraint.  It doesn't prevent any organizational
operations before resource control is enabled.

> I really, really think that cgroup v2 should supply the same
> *interface* inside and outside of a non-root namespace.  If this is

It *does*.  That's what I tried to explain, that it's exactly
isomorhpic once you discount the system-wide consumptions.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-08-29 22:35       ` Tejun Heo
  0 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-08-29 22:35 UTC (permalink / raw)
  To: James Bottomley
  Cc: Andy Lutomirski, Ingo Molnar, Mike Galbraith, linux-kernel,
	kernel-team, open list:CONTROL GROUP (CGROUP),
	Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra,
	Johannes Weiner, Linus Torvalds

Hello, James.

On Sat, Aug 20, 2016 at 10:34:14PM -0700, James Bottomley wrote:
> I can see that process based is conceptually easier in v2 because you
> begin with a process tree, but it would really be a pity to lose the
> thread based controls we have now and permanently lose the ability to
> create more as we find uses for them.  I can't really see how improving
> "common resource domain" is a good tradeoff for this.

Thread based control for namespace is not a different problem from
thread based control for individual applications, right?  And the
problems with using cgroupfs directly for in-process control still
applies the same whether it's system-wide or inside a namespace.

One argument could be that inside a namespace, as the cgroupfs is
already scoped, cgroup path headaches are less of an issue, which is
true; however, that isn't applicable to applications which aren't
scoped in thier own namespaces and we can't scope every binary on the
system.  More importnatly, a given application can't rely on being
scoped in a certain way.  You can craft a custom config for a specific
setup but that's a horrible way to solve the problem of in-application
hierarchical resource distribution, and that's what rgroup was all
about.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-08-29 22:35       ` Tejun Heo
  0 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-08-29 22:35 UTC (permalink / raw)
  To: James Bottomley
  Cc: Andy Lutomirski, Ingo Molnar, Mike Galbraith,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	open list:CONTROL GROUP (CGROUP),
	Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra,
	Johannes Weiner, Linus Torvalds

Hello, James.

On Sat, Aug 20, 2016 at 10:34:14PM -0700, James Bottomley wrote:
> I can see that process based is conceptually easier in v2 because you
> begin with a process tree, but it would really be a pity to lose the
> thread based controls we have now and permanently lose the ability to
> create more as we find uses for them.  I can't really see how improving
> "common resource domain" is a good tradeoff for this.

Thread based control for namespace is not a different problem from
thread based control for individual applications, right?  And the
problems with using cgroupfs directly for in-process control still
applies the same whether it's system-wide or inside a namespace.

One argument could be that inside a namespace, as the cgroupfs is
already scoped, cgroup path headaches are less of an issue, which is
true; however, that isn't applicable to applications which aren't
scoped in thier own namespaces and we can't scope every binary on the
system.  More importnatly, a given application can't rely on being
scoped in a certain way.  You can craft a custom config for a specific
setup but that's a horrible way to solve the problem of in-application
hierarchical resource distribution, and that's what rgroup was all
about.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-08-31  3:42           ` Andy Lutomirski
  0 siblings, 0 replies; 87+ messages in thread
From: Andy Lutomirski @ 2016-08-31  3:42 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Mike Galbraith, linux-kernel, kernel-team,
	open list:CONTROL GROUP (CGROUP),
	Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra,
	Johannes Weiner, Linus Torvalds

On Mon, Aug 29, 2016 at 3:20 PM, Tejun Heo <tj@kernel.org> wrote:
>> > These base-system operations are special regardless of cgroup and we
>> > already have sometimes crude ways to affect their behaviors where
>> > necessary through sysctl knobs, priorities on specific kernel threads
>> > and so on.  cgroup doesn't change the situation all that much.  What
>> > gets left in the root cgroup usually are the base-system operations
>> > which are outside the scope of cgroup resource control in the first
>> > place and cgroup resource graph can treat the root as an opaque anchor
>> > point.
>>
>> This seems to explain why the controllers need to be able to handle
>> things being charged to the root cgroup (or to an unidentifiable
>> cgroup, anyway).  That isn't quite the same thing as allowing, from an
>> ABI point of view, the root cgroup to contain processes and cgroups
>> but not allowing other cgroups to do the same thing.  Consider:
>
> The points are 1. we need the root to be a special container anyway

But you don't need to let userspace see that.

> 2. allowing it to be special and contain system-wide consumptions
> doesn't make the resource graph inconsistent once all non-system-wide
> consumptions are put in non-root cgroups, and 3. this is the most
> natural way to handle the situation both from implementation and
> interface standpoints as it makes non-cgroup configuration a natural
> degenerate case of cgroup configuration.
>
>> suppose that systemd (or some competing cgroup manager) is designed to
>> run in the root cgroup namespace.  It presumably expects *itself* to
>> be in the root cgroup.  Now try to run it using cgroups v2 in a
>> non-root namespace.  I don't see how it can possibly work if it the
>> hierarchy constraints don't permit it to create sub-cgroups while it's
>> still in the root.  In fact, this seems impossible to fix even with
>> user code changes.  The manager would need to simultaneously create a
>> new child cgroup to contain itself and assign itself to that child
>> cgroup, because the intermediate state is illegal.
>
> Please re-read the constraint.  It doesn't prevent any organizational
> operations before resource control is enabled.
>
>> I really, really think that cgroup v2 should supply the same
>> *interface* inside and outside of a non-root namespace.  If this is
>
> It *does*.  That's what I tried to explain, that it's exactly
> isomorhpic once you discount the system-wide consumptions.
>

I don't think I agree.

Suppose I wrote an init program or a cgroup manager.  I can expect
that init program to be started in the root cgroup.  The program can
be lazy and write +io to /cgroup/cgroup.subtree_control and then
create some new cgroup /cgroup/a and it will work (I just tried it).

Now I run that program in a namespace.  It will not work because it'll
get -EBUSY when it tries to write to cgroup.subtree_control.  (I just
tried this, too, only using cd instead of a namespace.)  So it's *not*
isomorphic.

It *also* won't work (I think) if subtree control is enabled on the
root, but I don't think this is a problem in practice because subtree
control won't be enabled on the namespace root by a sensible cgroup
manager.

--Andy

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-08-31  3:42           ` Andy Lutomirski
  0 siblings, 0 replies; 87+ messages in thread
From: Andy Lutomirski @ 2016-08-31  3:42 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-team-b10kYP2dOMg, open list:CONTROL GROUP (CGROUP),
	Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra,
	Johannes Weiner, Linus Torvalds

On Mon, Aug 29, 2016 at 3:20 PM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>> > These base-system operations are special regardless of cgroup and we
>> > already have sometimes crude ways to affect their behaviors where
>> > necessary through sysctl knobs, priorities on specific kernel threads
>> > and so on.  cgroup doesn't change the situation all that much.  What
>> > gets left in the root cgroup usually are the base-system operations
>> > which are outside the scope of cgroup resource control in the first
>> > place and cgroup resource graph can treat the root as an opaque anchor
>> > point.
>>
>> This seems to explain why the controllers need to be able to handle
>> things being charged to the root cgroup (or to an unidentifiable
>> cgroup, anyway).  That isn't quite the same thing as allowing, from an
>> ABI point of view, the root cgroup to contain processes and cgroups
>> but not allowing other cgroups to do the same thing.  Consider:
>
> The points are 1. we need the root to be a special container anyway

But you don't need to let userspace see that.

> 2. allowing it to be special and contain system-wide consumptions
> doesn't make the resource graph inconsistent once all non-system-wide
> consumptions are put in non-root cgroups, and 3. this is the most
> natural way to handle the situation both from implementation and
> interface standpoints as it makes non-cgroup configuration a natural
> degenerate case of cgroup configuration.
>
>> suppose that systemd (or some competing cgroup manager) is designed to
>> run in the root cgroup namespace.  It presumably expects *itself* to
>> be in the root cgroup.  Now try to run it using cgroups v2 in a
>> non-root namespace.  I don't see how it can possibly work if it the
>> hierarchy constraints don't permit it to create sub-cgroups while it's
>> still in the root.  In fact, this seems impossible to fix even with
>> user code changes.  The manager would need to simultaneously create a
>> new child cgroup to contain itself and assign itself to that child
>> cgroup, because the intermediate state is illegal.
>
> Please re-read the constraint.  It doesn't prevent any organizational
> operations before resource control is enabled.
>
>> I really, really think that cgroup v2 should supply the same
>> *interface* inside and outside of a non-root namespace.  If this is
>
> It *does*.  That's what I tried to explain, that it's exactly
> isomorhpic once you discount the system-wide consumptions.
>

I don't think I agree.

Suppose I wrote an init program or a cgroup manager.  I can expect
that init program to be started in the root cgroup.  The program can
be lazy and write +io to /cgroup/cgroup.subtree_control and then
create some new cgroup /cgroup/a and it will work (I just tried it).

Now I run that program in a namespace.  It will not work because it'll
get -EBUSY when it tries to write to cgroup.subtree_control.  (I just
tried this, too, only using cd instead of a namespace.)  So it's *not*
isomorphic.

It *also* won't work (I think) if subtree control is enabled on the
root, but I don't think this is a problem in practice because subtree
control won't be enabled on the namespace root by a sensible cgroup
manager.

--Andy

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
  2016-08-31  3:42           ` Andy Lutomirski
  (?)
@ 2016-08-31 17:32           ` Tejun Heo
  2016-08-31 19:11               ` Andy Lutomirski
  -1 siblings, 1 reply; 87+ messages in thread
From: Tejun Heo @ 2016-08-31 17:32 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, Mike Galbraith, linux-kernel, kernel-team,
	open list:CONTROL GROUP (CGROUP),
	Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra,
	Johannes Weiner, Linus Torvalds

Hello, Andy.

On Tue, Aug 30, 2016 at 08:42:20PM -0700, Andy Lutomirski wrote:
> On Mon, Aug 29, 2016 at 3:20 PM, Tejun Heo <tj@kernel.org> wrote:
> >> This seems to explain why the controllers need to be able to handle
> >> things being charged to the root cgroup (or to an unidentifiable
> >> cgroup, anyway).  That isn't quite the same thing as allowing, from an
> >> ABI point of view, the root cgroup to contain processes and cgroups
> >> but not allowing other cgroups to do the same thing.  Consider:
> >
> > The points are 1. we need the root to be a special container anyway
> 
> But you don't need to let userspace see that.

I'm not saying that what cgroup v2 implements is the only solution.
There of course can be other approaches which don't expose this
particular detail to userland.  I was highlighting that there is an
underlying condition to be dealt with and that what cgroup v2
implements is one working solution for it.

It's fine to have, say, aesthetical disgreements on the specifics of
the chosen approach, and, while a bit late, we can still talk about
pros and cons of different possible approaches and make improvements
where it makes sense.  However, this isn't in any way a
make-it-or-break-it issue as you implied before.

> >> I really, really think that cgroup v2 should supply the same
> >> *interface* inside and outside of a non-root namespace.  If this is
> >
> > It *does*.  That's what I tried to explain, that it's exactly
> > isomorhpic once you discount the system-wide consumptions.
> 
> I don't think I agree.
> 
> Suppose I wrote an init program or a cgroup manager.  I can expect
> that init program to be started in the root cgroup.  The program can
> be lazy and write +io to /cgroup/cgroup.subtree_control and then
> create some new cgroup /cgroup/a and it will work (I just tried it).
> 
> Now I run that program in a namespace.  It will not work because it'll
> get -EBUSY when it tries to write to cgroup.subtree_control.  (I just
> tried this, too, only using cd instead of a namespace.)  So it's *not*
> isomorphic.

Yeah, it is possible to shoot yourself in the foot but both
system-scope and namespace-scope can implement the exactly same
behavior - move yourself out of root before enabling resource controls
and get the same expected outcome, which BTW is how systemd behaves
already.

You can say that allowing the possibility of deviation isn't a good
design choice but it is a design choice with other implications - on
how we deal with configurations without cgroup at all, transitioning
from v1, bootstrapping a system and avoiding surprising
userland-visible behaviors (e.g. like creating magic preset cgroups
and silently migrating process there on certain events).

> It *also* won't work (I think) if subtree control is enabled on the
> root, but I don't think this is a problem in practice because subtree
> control won't be enabled on the namespace root by a sensible cgroup
> manager.

Exactly the same thing.  You can shoot yourself in the foot but it's
easy not to.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-08-31 19:11               ` Andy Lutomirski
  0 siblings, 0 replies; 87+ messages in thread
From: Andy Lutomirski @ 2016-08-31 19:11 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Mike Galbraith, linux-kernel, kernel-team,
	open list:CONTROL GROUP (CGROUP),
	Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra,
	Johannes Weiner, Linus Torvalds

On Wed, Aug 31, 2016 at 10:32 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Andy.
>

>
>> >> I really, really think that cgroup v2 should supply the same
>> >> *interface* inside and outside of a non-root namespace.  If this is
>> >
>> > It *does*.  That's what I tried to explain, that it's exactly
>> > isomorhpic once you discount the system-wide consumptions.
>>
>> I don't think I agree.
>>
>> Suppose I wrote an init program or a cgroup manager.  I can expect
>> that init program to be started in the root cgroup.  The program can
>> be lazy and write +io to /cgroup/cgroup.subtree_control and then
>> create some new cgroup /cgroup/a and it will work (I just tried it).
>>
>> Now I run that program in a namespace.  It will not work because it'll
>> get -EBUSY when it tries to write to cgroup.subtree_control.  (I just
>> tried this, too, only using cd instead of a namespace.)  So it's *not*
>> isomorphic.
>
> Yeah, it is possible to shoot yourself in the foot but both
> system-scope and namespace-scope can implement the exactly same
> behavior - move yourself out of root before enabling resource controls
> and get the same expected outcome, which BTW is how systemd behaves
> already.
>
> You can say that allowing the possibility of deviation isn't a good
> design choice but it is a design choice with other implications - on
> how we deal with configurations without cgroup at all, transitioning
> from v1, bootstrapping a system and avoiding surprising
> userland-visible behaviors (e.g. like creating magic preset cgroups
> and silently migrating process there on certain events).

Are there existing userspace programs that use cgroup2 and enable
subtree control on / when there are processes in /?  If the answer is
no, then I think you should change cgroup2 to just disallow it.  If
the answer is yes, then I think there's a problem and maybe you should
consider a breaking change.  Given that cgroup2 hasn't really launched
on a large scale, it seems worthwhile to get it right.

I don't understand what you're talking about wrt silently migrating
processes.  Are you thinking about usermodehelper?  If so, maybe it
really does make sense to allow (or require?) the cgroup manager to
specify which cgroup these processes end up in.

But, given that all the controllers need to support the current magic
root exception (for genuinely unaccountable things if nothing else),
can you explain what would actually go wrong if you just removed the
restriction entirely?

Also, here's an idea to maybe make PeterZ happier: relax the
restriction a bit per-controller.  Currently (except for /), if you
have subtree control enabled you can't have any processes in the
cgroup.  Could you change this so it only applies to certain
controllers?  If the cpu controller is entirely happy to have
processes and cgroups as siblings, then maybe a cgroup with only cpu
subtree control enabled could allow processes to exist.

>
>> It *also* won't work (I think) if subtree control is enabled on the
>> root, but I don't think this is a problem in practice because subtree
>> control won't be enabled on the namespace root by a sensible cgroup
>> manager.
>
> Exactly the same thing.  You can shoot yourself in the foot but it's
> easy not to.
>

Somewhat off-topic: this appears to be either a bug or a misfeature:

bash-4.3# mkdir foo
bash-4.3# ls foo
cgroup.controllers  cgroup.events  cgroup.procs  cgroup.subtree_control
bash-4.3# mkdir foo/io.max  <-- IMO this shouldn't have worked
bash-4.3# echo +io >cgroup.subtree_control
[   40.470712] cgroup: cgroup_addrm_files: failed to add max, err=-17

Shouldn't cgroups with names that potentially conflict with
kernel-provided dentries be disallowed?

--Andy

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-08-31 19:11               ` Andy Lutomirski
  0 siblings, 0 replies; 87+ messages in thread
From: Andy Lutomirski @ 2016-08-31 19:11 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-team-b10kYP2dOMg, open list:CONTROL GROUP (CGROUP),
	Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra,
	Johannes Weiner, Linus Torvalds

On Wed, Aug 31, 2016 at 10:32 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> Hello, Andy.
>

>
>> >> I really, really think that cgroup v2 should supply the same
>> >> *interface* inside and outside of a non-root namespace.  If this is
>> >
>> > It *does*.  That's what I tried to explain, that it's exactly
>> > isomorhpic once you discount the system-wide consumptions.
>>
>> I don't think I agree.
>>
>> Suppose I wrote an init program or a cgroup manager.  I can expect
>> that init program to be started in the root cgroup.  The program can
>> be lazy and write +io to /cgroup/cgroup.subtree_control and then
>> create some new cgroup /cgroup/a and it will work (I just tried it).
>>
>> Now I run that program in a namespace.  It will not work because it'll
>> get -EBUSY when it tries to write to cgroup.subtree_control.  (I just
>> tried this, too, only using cd instead of a namespace.)  So it's *not*
>> isomorphic.
>
> Yeah, it is possible to shoot yourself in the foot but both
> system-scope and namespace-scope can implement the exactly same
> behavior - move yourself out of root before enabling resource controls
> and get the same expected outcome, which BTW is how systemd behaves
> already.
>
> You can say that allowing the possibility of deviation isn't a good
> design choice but it is a design choice with other implications - on
> how we deal with configurations without cgroup at all, transitioning
> from v1, bootstrapping a system and avoiding surprising
> userland-visible behaviors (e.g. like creating magic preset cgroups
> and silently migrating process there on certain events).

Are there existing userspace programs that use cgroup2 and enable
subtree control on / when there are processes in /?  If the answer is
no, then I think you should change cgroup2 to just disallow it.  If
the answer is yes, then I think there's a problem and maybe you should
consider a breaking change.  Given that cgroup2 hasn't really launched
on a large scale, it seems worthwhile to get it right.

I don't understand what you're talking about wrt silently migrating
processes.  Are you thinking about usermodehelper?  If so, maybe it
really does make sense to allow (or require?) the cgroup manager to
specify which cgroup these processes end up in.

But, given that all the controllers need to support the current magic
root exception (for genuinely unaccountable things if nothing else),
can you explain what would actually go wrong if you just removed the
restriction entirely?

Also, here's an idea to maybe make PeterZ happier: relax the
restriction a bit per-controller.  Currently (except for /), if you
have subtree control enabled you can't have any processes in the
cgroup.  Could you change this so it only applies to certain
controllers?  If the cpu controller is entirely happy to have
processes and cgroups as siblings, then maybe a cgroup with only cpu
subtree control enabled could allow processes to exist.

>
>> It *also* won't work (I think) if subtree control is enabled on the
>> root, but I don't think this is a problem in practice because subtree
>> control won't be enabled on the namespace root by a sensible cgroup
>> manager.
>
> Exactly the same thing.  You can shoot yourself in the foot but it's
> easy not to.
>

Somewhat off-topic: this appears to be either a bug or a misfeature:

bash-4.3# mkdir foo
bash-4.3# ls foo
cgroup.controllers  cgroup.events  cgroup.procs  cgroup.subtree_control
bash-4.3# mkdir foo/io.max  <-- IMO this shouldn't have worked
bash-4.3# echo +io >cgroup.subtree_control
[   40.470712] cgroup: cgroup_addrm_files: failed to add max, err=-17

Shouldn't cgroups with names that potentially conflict with
kernel-provided dentries be disallowed?

--Andy

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-08-31 19:57           ` Andy Lutomirski
  0 siblings, 0 replies; 87+ messages in thread
From: Andy Lutomirski @ 2016-08-31 19:57 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Mike Galbraith, linux-kernel, kernel-team,
	open list:CONTROL GROUP (CGROUP),
	Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra,
	Johannes Weiner, Linus Torvalds

I'm replying separately to keep the two issues in separate emails.

On Mon, Aug 29, 2016 at 3:20 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Andy.
>
> Sorry about the delay.  Was kinda overwhelmed with other things.
>
> On Sat, Aug 20, 2016 at 11:45:55AM -0700, Andy Lutomirski wrote:
>> > This becomes clear whenever an entity is allocating memory on behalf
>> > of someone else - get_user_pages(), khugepaged, swapoff and so on (and
>> > likely userfaultfd too).  When a task is trying to add a page to a
>> > VMA, the task might not have any relationship with the VMA other than
>> > that it's operating on it for someone else.  The page has to be
>> > charged to whoever is responsible for the VMA and the only ownership
>> > which can be established is the containing mm_struct.
>>
>> This surprises me a bit.  If I do access_process_vm(), then I would
>> have expected the charge to go the caller, not the mm being accessed.
>
> It does and should go the target mm.  Who faults in a page shouldn't
> be the final determinant in the ownership; otherwise, we end up in
> situations where the ownership changes due to, for example,
> fluctuations in page fault pattern.  It doesn't make semantical sense
> either.  If a kthread is doing PIO for a process, why would it get
> charged for the memory it's faulting in?

OK, that makes sense.  Although, given that cgroup1 allows tasks in
the same processes to be split up, how does this work in cgroup1?  Do
you just pick the mm associated with the thread group leader?  If so,
why can't cgroup2 do the same thing?

But even this is at best a vague approximation.  If you have
MAP_SHARED mappings (libc.so, for example), then the cgroup you charge
it to is more or less arbitrary.

>
>> What happens if a program calls read(2), though?  A page may be
>> inserted into page cache on behalf of an address_space without any
>> particular mm being involved.  There will usually be a calling task,
>> though.
>
> Most faults are synchronous and the faulting thread is a member of the
> mm to be charged, so this usually isn't an issue.  I don't think there
> are places where we populate an address_space without knowing who it
> is for (as opposed / in addition to who the operator is).

True, but there's no *mm* involved in any fundamental sense.  You can
look at the task and find the task's mm (or actually the task's thread
group leader, since cgroup2 doesn't literally map mms to cgroups), but
that seems to me to be a pretty poor reason to argue that tasks should
have to be kept together.

>
>> But this is all very memcg-specific.  What about other cgroups?  I/O
>> is per-task, right?  Scheduling is definitely per-task.
>
> They aren't separate.  Think about IOs to write out page cache, CPU
> cycles spent reclaiming memory or encrypting writeback IOs.  It's fine
> to get more granular with specific resources but the semantics gets
> messy for cross-resource accounting and control without proper
> scoping.

Page cache doesn't belong to a a specific mm.  Memory reclaim only has
an mm associated if the memory being reclaimed belongs cleanly to an
mm.  Encrypting writeback (I assume you mean the cpu usage) is just
like page cache writeback IO -- there's no specific mm involved in
general.

>
>> > Consider the scenario where you have somebody faulting on behalf of a
>> > foreign VMA, but the thread who created and is actively using that VMA
>> > is in a different cgroup than the process leader.  Who are we going to
>> > charge?  All possible answers seem erratic.
>>
>> Indeed, and this problem is probably not solvable in practice unless
>> you charge all involved cgroups.  But the caller's *mm* is entirely
>> irrelevant here, so I don't see how this implies that cgroups need to
>> keep tasks in the same process together.  The relevant entities are
>> the calling *task* and the target mm, and you're going to be
>> hard-pressed to ensure that they belong to the same cgroup, so I think
>> you need to be able handle weird cases in which there isn't an
>> obviously correct cgroup to charge.
>
> It is an erratic case which is caused by userland interface allowing
> non-sensical configuration.  We can accept it as a necessary trade-off
> given big enough benefits or unavoidable constraints but it isn't
> something to do willy-nilly.
>
>> > For system-level and process-level operations to not step on each
>> > other's toes, they need to agree on the granularity boundary -
>> > system-level should be able to treat an application hierarchy as a
>> > single unit.  A possible solution is allowing rgroup hirearchies to
>> > span across process boundaries and implementing cgroup migration
>> > operations which treat such hierarchies as a single unit.  I'm not yet
>> > sure whether the boundary should be at program groups or rgroups.
>>
>> I think that, if the system cgroup manager is moving processes around
>> after starting them and execing the final binary, there will be races
>> and confusion, and no about of granularity fiddling will fix that.
>
> I don't see how that statement is true.  For example, if you confine
> the hierarhcy to in-process, there is proper isolation and whether
> system agent migrates the process or not doesn't make any difference
> to the internal hierarchy.

But hierarchy isn't always per process.  Some real-world services have
threads and subprocesses.

>
>> I know nothing about rgroups.  Are they upstream?
>
> It was linked from the original message.
>
> [7]  http://lkml.kernel.org/r/20160105154503.GC5995@mtj.duckdns.org
>      [RFD] cgroup: thread granularity support for cpu controller
>      Tejun Heo <tj@kernel.org>

I can see two issues here:

1. You're allowing groups and tasks to be siblings.  If you're okay
allowing that for rgroups, why not allow it for cgroup2 on the same
set of controllers?

2. It looks impossible to fork and keep a child in the same group as
one of your non-leader threads.

I think I'm starting to agree with PeterZ here.  Why not just make
cgroup2 more flexible?

--Andy

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-08-31 19:57           ` Andy Lutomirski
  0 siblings, 0 replies; 87+ messages in thread
From: Andy Lutomirski @ 2016-08-31 19:57 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-team-b10kYP2dOMg, open list:CONTROL GROUP (CGROUP),
	Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra,
	Johannes Weiner, Linus Torvalds

I'm replying separately to keep the two issues in separate emails.

On Mon, Aug 29, 2016 at 3:20 PM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> Hello, Andy.
>
> Sorry about the delay.  Was kinda overwhelmed with other things.
>
> On Sat, Aug 20, 2016 at 11:45:55AM -0700, Andy Lutomirski wrote:
>> > This becomes clear whenever an entity is allocating memory on behalf
>> > of someone else - get_user_pages(), khugepaged, swapoff and so on (and
>> > likely userfaultfd too).  When a task is trying to add a page to a
>> > VMA, the task might not have any relationship with the VMA other than
>> > that it's operating on it for someone else.  The page has to be
>> > charged to whoever is responsible for the VMA and the only ownership
>> > which can be established is the containing mm_struct.
>>
>> This surprises me a bit.  If I do access_process_vm(), then I would
>> have expected the charge to go the caller, not the mm being accessed.
>
> It does and should go the target mm.  Who faults in a page shouldn't
> be the final determinant in the ownership; otherwise, we end up in
> situations where the ownership changes due to, for example,
> fluctuations in page fault pattern.  It doesn't make semantical sense
> either.  If a kthread is doing PIO for a process, why would it get
> charged for the memory it's faulting in?

OK, that makes sense.  Although, given that cgroup1 allows tasks in
the same processes to be split up, how does this work in cgroup1?  Do
you just pick the mm associated with the thread group leader?  If so,
why can't cgroup2 do the same thing?

But even this is at best a vague approximation.  If you have
MAP_SHARED mappings (libc.so, for example), then the cgroup you charge
it to is more or less arbitrary.

>
>> What happens if a program calls read(2), though?  A page may be
>> inserted into page cache on behalf of an address_space without any
>> particular mm being involved.  There will usually be a calling task,
>> though.
>
> Most faults are synchronous and the faulting thread is a member of the
> mm to be charged, so this usually isn't an issue.  I don't think there
> are places where we populate an address_space without knowing who it
> is for (as opposed / in addition to who the operator is).

True, but there's no *mm* involved in any fundamental sense.  You can
look at the task and find the task's mm (or actually the task's thread
group leader, since cgroup2 doesn't literally map mms to cgroups), but
that seems to me to be a pretty poor reason to argue that tasks should
have to be kept together.

>
>> But this is all very memcg-specific.  What about other cgroups?  I/O
>> is per-task, right?  Scheduling is definitely per-task.
>
> They aren't separate.  Think about IOs to write out page cache, CPU
> cycles spent reclaiming memory or encrypting writeback IOs.  It's fine
> to get more granular with specific resources but the semantics gets
> messy for cross-resource accounting and control without proper
> scoping.

Page cache doesn't belong to a a specific mm.  Memory reclaim only has
an mm associated if the memory being reclaimed belongs cleanly to an
mm.  Encrypting writeback (I assume you mean the cpu usage) is just
like page cache writeback IO -- there's no specific mm involved in
general.

>
>> > Consider the scenario where you have somebody faulting on behalf of a
>> > foreign VMA, but the thread who created and is actively using that VMA
>> > is in a different cgroup than the process leader.  Who are we going to
>> > charge?  All possible answers seem erratic.
>>
>> Indeed, and this problem is probably not solvable in practice unless
>> you charge all involved cgroups.  But the caller's *mm* is entirely
>> irrelevant here, so I don't see how this implies that cgroups need to
>> keep tasks in the same process together.  The relevant entities are
>> the calling *task* and the target mm, and you're going to be
>> hard-pressed to ensure that they belong to the same cgroup, so I think
>> you need to be able handle weird cases in which there isn't an
>> obviously correct cgroup to charge.
>
> It is an erratic case which is caused by userland interface allowing
> non-sensical configuration.  We can accept it as a necessary trade-off
> given big enough benefits or unavoidable constraints but it isn't
> something to do willy-nilly.
>
>> > For system-level and process-level operations to not step on each
>> > other's toes, they need to agree on the granularity boundary -
>> > system-level should be able to treat an application hierarchy as a
>> > single unit.  A possible solution is allowing rgroup hirearchies to
>> > span across process boundaries and implementing cgroup migration
>> > operations which treat such hierarchies as a single unit.  I'm not yet
>> > sure whether the boundary should be at program groups or rgroups.
>>
>> I think that, if the system cgroup manager is moving processes around
>> after starting them and execing the final binary, there will be races
>> and confusion, and no about of granularity fiddling will fix that.
>
> I don't see how that statement is true.  For example, if you confine
> the hierarhcy to in-process, there is proper isolation and whether
> system agent migrates the process or not doesn't make any difference
> to the internal hierarchy.

But hierarchy isn't always per process.  Some real-world services have
threads and subprocesses.

>
>> I know nothing about rgroups.  Are they upstream?
>
> It was linked from the original message.
>
> [7]  http://lkml.kernel.org/r/20160105154503.GC5995-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org
>      [RFD] cgroup: thread granularity support for cpu controller
>      Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

I can see two issues here:

1. You're allowing groups and tasks to be siblings.  If you're okay
allowing that for rgroups, why not allow it for cgroup2 on the same
set of controllers?

2. It looks impossible to fork and keep a child in the same group as
one of your non-leader threads.

I think I'm starting to agree with PeterZ here.  Why not just make
cgroup2 more flexible?

--Andy

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-08-31 21:07                 ` Tejun Heo
  0 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-08-31 21:07 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, Mike Galbraith, linux-kernel, kernel-team,
	open list:CONTROL GROUP (CGROUP),
	Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra,
	Johannes Weiner, Linus Torvalds

Hello,

On Wed, Aug 31, 2016 at 12:11:58PM -0700, Andy Lutomirski wrote:
> > You can say that allowing the possibility of deviation isn't a good
> > design choice but it is a design choice with other implications - on
> > how we deal with configurations without cgroup at all, transitioning
> > from v1, bootstrapping a system and avoiding surprising
> > userland-visible behaviors (e.g. like creating magic preset cgroups
> > and silently migrating process there on certain events).
> 
> Are there existing userspace programs that use cgroup2 and enable
> subtree control on / when there are processes in /?  If the answer is
> no, then I think you should change cgroup2 to just disallow it.  If
> the answer is yes, then I think there's a problem and maybe you should
> consider a breaking change.  Given that cgroup2 hasn't really launched
> on a large scale, it seems worthwhile to get it right.

Adding the restriction isn't difficult from implementation point of
view and for a system agent which control the boot process
implementing that wouldn't be difficult either but I can't see what
the actual benefits of the extra restriction would be and there are
tangible downsides to doing so.

Consider a use case where the user isn't interested in fully
accounting and dividing up system resources but wants to just cap
resource usage from a subset of workloads.  There is no reason to
require such usages to fully contain all processes in non-root
cgroups.  Furthermore, it's not trivial to migrate all processes out
of root to a sub-cgroup unless the agent is in full control of boot
process.

At least up until this point in discussion, I can't see actual
benefits of adding this restriction and the only reason for pushing it
seems the initial misunderstanding and purism.

> I don't understand what you're talking about wrt silently migrating
> processes.  Are you thinking about usermodehelper?  If so, maybe it
> really does make sense to allow (or require?) the cgroup manager to
> specify which cgroup these processes end up in.

That was from one of the ideas that I was considering way back where
enabling resource control in an intermediate node automatically moves
internal processes to a preset cgroup whether visible or hidden, which
would be another way of addressing the problem.

None of these affects what cgroup v2 can do at all and the only thing
the userland is asked to do under the current scheme is "if you wanna
keep the whole system divided up and use the same mode of operations
across system-scope and namespace-scope move out of root while setting
yourself up, which also happens to be what you have to do inside
namespaces anyway."

> But, given that all the controllers need to support the current magic
> root exception (for genuinely unaccountable things if nothing else),
> can you explain what would actually go wrong if you just removed the
> restriction entirely?

I have, multiple times.  Can you please read 2-1-2 of the document in
the original post and take the discussion from there?

> Also, here's an idea to maybe make PeterZ happier: relax the
> restriction a bit per-controller.  Currently (except for /), if you
> have subtree control enabled you can't have any processes in the
> cgroup.  Could you change this so it only applies to certain
> controllers?  If the cpu controller is entirely happy to have
> processes and cgroups as siblings, then maybe a cgroup with only cpu
> subtree control enabled could allow processes to exist.

The document lists several reasons for not doing this and also that
there is no known real world use case for such configuration.

Please also note that the behavior that you're describing is actually
what rgroup implements.  It makes a lot more sense there because
threads and groups share the same configuration mechanism and it only
has to worry about competition among threads (anonymous consumption is
out of scope for rgroup).

> >> It *also* won't work (I think) if subtree control is enabled on the
> >> root, but I don't think this is a problem in practice because subtree
> >> control won't be enabled on the namespace root by a sensible cgroup
> >> manager.
> >
> > Exactly the same thing.  You can shoot yourself in the foot but it's
> > easy not to.
> 
> Somewhat off-topic: this appears to be either a bug or a misfeature:
> 
> bash-4.3# mkdir foo
> bash-4.3# ls foo
> cgroup.controllers  cgroup.events  cgroup.procs  cgroup.subtree_control
> bash-4.3# mkdir foo/io.max  <-- IMO this shouldn't have worked
> bash-4.3# echo +io >cgroup.subtree_control
> [   40.470712] cgroup: cgroup_addrm_files: failed to add max, err=-17
> 
> Shouldn't cgroups with names that potentially conflict with
> kernel-provided dentries be disallowed?

Yeap, the name collisions suck.  I thought about disallowing all
sub-cgroups which starts with "KNOWN_SUBSYS." but that has a
non-trivial chance of breaking users which were happy before when a
new controller gets added.  But, yeah, we at least should disallow the
known filenames.  Will think more about it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-08-31 21:07                 ` Tejun Heo
  0 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-08-31 21:07 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-team-b10kYP2dOMg, open list:CONTROL GROUP (CGROUP),
	Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra,
	Johannes Weiner, Linus Torvalds

Hello,

On Wed, Aug 31, 2016 at 12:11:58PM -0700, Andy Lutomirski wrote:
> > You can say that allowing the possibility of deviation isn't a good
> > design choice but it is a design choice with other implications - on
> > how we deal with configurations without cgroup at all, transitioning
> > from v1, bootstrapping a system and avoiding surprising
> > userland-visible behaviors (e.g. like creating magic preset cgroups
> > and silently migrating process there on certain events).
> 
> Are there existing userspace programs that use cgroup2 and enable
> subtree control on / when there are processes in /?  If the answer is
> no, then I think you should change cgroup2 to just disallow it.  If
> the answer is yes, then I think there's a problem and maybe you should
> consider a breaking change.  Given that cgroup2 hasn't really launched
> on a large scale, it seems worthwhile to get it right.

Adding the restriction isn't difficult from implementation point of
view and for a system agent which control the boot process
implementing that wouldn't be difficult either but I can't see what
the actual benefits of the extra restriction would be and there are
tangible downsides to doing so.

Consider a use case where the user isn't interested in fully
accounting and dividing up system resources but wants to just cap
resource usage from a subset of workloads.  There is no reason to
require such usages to fully contain all processes in non-root
cgroups.  Furthermore, it's not trivial to migrate all processes out
of root to a sub-cgroup unless the agent is in full control of boot
process.

At least up until this point in discussion, I can't see actual
benefits of adding this restriction and the only reason for pushing it
seems the initial misunderstanding and purism.

> I don't understand what you're talking about wrt silently migrating
> processes.  Are you thinking about usermodehelper?  If so, maybe it
> really does make sense to allow (or require?) the cgroup manager to
> specify which cgroup these processes end up in.

That was from one of the ideas that I was considering way back where
enabling resource control in an intermediate node automatically moves
internal processes to a preset cgroup whether visible or hidden, which
would be another way of addressing the problem.

None of these affects what cgroup v2 can do at all and the only thing
the userland is asked to do under the current scheme is "if you wanna
keep the whole system divided up and use the same mode of operations
across system-scope and namespace-scope move out of root while setting
yourself up, which also happens to be what you have to do inside
namespaces anyway."

> But, given that all the controllers need to support the current magic
> root exception (for genuinely unaccountable things if nothing else),
> can you explain what would actually go wrong if you just removed the
> restriction entirely?

I have, multiple times.  Can you please read 2-1-2 of the document in
the original post and take the discussion from there?

> Also, here's an idea to maybe make PeterZ happier: relax the
> restriction a bit per-controller.  Currently (except for /), if you
> have subtree control enabled you can't have any processes in the
> cgroup.  Could you change this so it only applies to certain
> controllers?  If the cpu controller is entirely happy to have
> processes and cgroups as siblings, then maybe a cgroup with only cpu
> subtree control enabled could allow processes to exist.

The document lists several reasons for not doing this and also that
there is no known real world use case for such configuration.

Please also note that the behavior that you're describing is actually
what rgroup implements.  It makes a lot more sense there because
threads and groups share the same configuration mechanism and it only
has to worry about competition among threads (anonymous consumption is
out of scope for rgroup).

> >> It *also* won't work (I think) if subtree control is enabled on the
> >> root, but I don't think this is a problem in practice because subtree
> >> control won't be enabled on the namespace root by a sensible cgroup
> >> manager.
> >
> > Exactly the same thing.  You can shoot yourself in the foot but it's
> > easy not to.
> 
> Somewhat off-topic: this appears to be either a bug or a misfeature:
> 
> bash-4.3# mkdir foo
> bash-4.3# ls foo
> cgroup.controllers  cgroup.events  cgroup.procs  cgroup.subtree_control
> bash-4.3# mkdir foo/io.max  <-- IMO this shouldn't have worked
> bash-4.3# echo +io >cgroup.subtree_control
> [   40.470712] cgroup: cgroup_addrm_files: failed to add max, err=-17
> 
> Shouldn't cgroups with names that potentially conflict with
> kernel-provided dentries be disallowed?

Yeap, the name collisions suck.  I thought about disallowing all
sub-cgroups which starts with "KNOWN_SUBSYS." but that has a
non-trivial chance of breaking users which were happy before when a
new controller gets added.  But, yeah, we at least should disallow the
known filenames.  Will think more about it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
  2016-08-31 21:07                 ` Tejun Heo
  (?)
@ 2016-08-31 21:46                 ` Andy Lutomirski
  2016-09-03 22:05                     ` Tejun Heo
  -1 siblings, 1 reply; 87+ messages in thread
From: Andy Lutomirski @ 2016-08-31 21:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Mike Galbraith, linux-kernel, kernel-team,
	open list:CONTROL GROUP (CGROUP),
	Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra,
	Johannes Weiner, Linus Torvalds

On Wed, Aug 31, 2016 at 2:07 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> On Wed, Aug 31, 2016 at 12:11:58PM -0700, Andy Lutomirski wrote:
>> > You can say that allowing the possibility of deviation isn't a good
>> > design choice but it is a design choice with other implications - on
>> > how we deal with configurations without cgroup at all, transitioning
>> > from v1, bootstrapping a system and avoiding surprising
>> > userland-visible behaviors (e.g. like creating magic preset cgroups
>> > and silently migrating process there on certain events).
>>
>> Are there existing userspace programs that use cgroup2 and enable
>> subtree control on / when there are processes in /?  If the answer is
>> no, then I think you should change cgroup2 to just disallow it.  If
>> the answer is yes, then I think there's a problem and maybe you should
>> consider a breaking change.  Given that cgroup2 hasn't really launched
>> on a large scale, it seems worthwhile to get it right.
>
> Adding the restriction isn't difficult from implementation point of
> view and for a system agent which control the boot process
> implementing that wouldn't be difficult either but I can't see what
> the actual benefits of the extra restriction would be and there are
> tangible downsides to doing so.
>
> Consider a use case where the user isn't interested in fully
> accounting and dividing up system resources but wants to just cap
> resource usage from a subset of workloads.  There is no reason to
> require such usages to fully contain all processes in non-root
> cgroups.  Furthermore, it's not trivial to migrate all processes out
> of root to a sub-cgroup unless the agent is in full control of boot
> process.

Then please also consider exactly the same use case while running in a
container.

I'm a bit frustrated that you're saying that my example failure modes
consist of shooting oneself in the foot and then you go on to come up
with your own examples that have precisely the same problem.

>
>> I don't understand what you're talking about wrt silently migrating
>> processes.  Are you thinking about usermodehelper?  If so, maybe it
>> really does make sense to allow (or require?) the cgroup manager to
>> specify which cgroup these processes end up in.
>
> That was from one of the ideas that I was considering way back where
> enabling resource control in an intermediate node automatically moves
> internal processes to a preset cgroup whether visible or hidden, which
> would be another way of addressing the problem.
>
> None of these affects what cgroup v2 can do at all and the only thing
> the userland is asked to do under the current scheme is "if you wanna
> keep the whole system divided up and use the same mode of operations
> across system-scope and namespace-scope move out of root while setting
> yourself up, which also happens to be what you have to do inside
> namespaces anyway."
>
>> But, given that all the controllers need to support the current magic
>> root exception (for genuinely unaccountable things if nothing else),
>> can you explain what would actually go wrong if you just removed the
>> restriction entirely?
>
> I have, multiple times.  Can you please read 2-1-2 of the document in
> the original post and take the discussion from there?

I've read it multiple times, and I don't see any explanation that's
consistent with the fact that you are exempting the root cgroup from
this constraint.  If the constraint were really critical to everything
working, then I would expect the root cgroup to have exactly the same
problem.  This makes me think that either something nasty is being
fudged for the root cgroup or that the constraint isn't actually so
important after all.  The only thing on point I can find is:

> Root cgroup is exempt from this constraint, which is in line with
> how root cgroup is handled in general - it's excluded from cgroup
> resource accounting and control.

and that's not very helpful.

>
>> Also, here's an idea to maybe make PeterZ happier: relax the
>> restriction a bit per-controller.  Currently (except for /), if you
>> have subtree control enabled you can't have any processes in the
>> cgroup.  Could you change this so it only applies to certain
>> controllers?  If the cpu controller is entirely happy to have
>> processes and cgroups as siblings, then maybe a cgroup with only cpu
>> subtree control enabled could allow processes to exist.
>
> The document lists several reasons for not doing this and also that
> there is no known real world use case for such configuration.

My company's production workload would map quite nicely to this
relaxed model.  I have quite a few processes each with several
threads.  Some of those threads get some CPUs, some get other CPUs,
and they vary in what shares of what CPUs they get.  To be clear,
there is not a hierarchy of resource usage that's compatible with the
process hierarchy.  Multiple processes have threads that should be
grouped in a different place in the hierarchy than other threads.
Concretely, I have processes A and B with threads A1, A2, B1, and B2.
(And many more, but this is enough to get the point across.)  The
natural grouping is:

Group 1: A1 and B1
Group 2: A2
Group 3: B2

This cannot be expressed with rgroup or with cgroup2.  cgroup1 has no
problem with it.  If I were using memcg, I would want to have a memcg
hierarchy that was incompatible with the hierarchy above, so I
actually find the cgroup2 insistence on a unified hierarchy to be a
bit annoying, but I at least understand the motivation behind the
unified hierarchy.

And I don't care that the system controller can't atomically move this
whole mess around.  I'm currently running without systemd, so I don't
*have* a system controller.  If I end up migrating to systemd, I'll
probably put this whole pile into its own slice and manage it
manually.

>
>> >> It *also* won't work (I think) if subtree control is enabled on the
>> >> root, but I don't think this is a problem in practice because subtree
>> >> control won't be enabled on the namespace root by a sensible cgroup
>> >> manager.
>> >
>> > Exactly the same thing.  You can shoot yourself in the foot but it's
>> > easy not to.
>>
>> Somewhat off-topic: this appears to be either a bug or a misfeature:
>>
>> bash-4.3# mkdir foo
>> bash-4.3# ls foo
>> cgroup.controllers  cgroup.events  cgroup.procs  cgroup.subtree_control
>> bash-4.3# mkdir foo/io.max  <-- IMO this shouldn't have worked
>> bash-4.3# echo +io >cgroup.subtree_control
>> [   40.470712] cgroup: cgroup_addrm_files: failed to add max, err=-17
>>
>> Shouldn't cgroups with names that potentially conflict with
>> kernel-provided dentries be disallowed?
>
> Yeap, the name collisions suck.  I thought about disallowing all
> sub-cgroups which starts with "KNOWN_SUBSYS." but that has a
> non-trivial chance of breaking users which were happy before when a
> new controller gets added.  But, yeah, we at least should disallow the
> known filenames.  Will think more about it.

How about disallowing names that contain a '.'?

--Andy

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-09-03 22:05                     ` Tejun Heo
  0 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-09-03 22:05 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, Mike Galbraith, linux-kernel, kernel-team,
	open list:CONTROL GROUP (CGROUP),
	Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra,
	Johannes Weiner, Linus Torvalds

Hello, Andy.

On Wed, Aug 31, 2016 at 02:46:20PM -0700, Andy Lutomirski wrote:
> > Consider a use case where the user isn't interested in fully
> > accounting and dividing up system resources but wants to just cap
> > resource usage from a subset of workloads.  There is no reason to
> > require such usages to fully contain all processes in non-root
> > cgroups.  Furthermore, it's not trivial to migrate all processes out
> > of root to a sub-cgroup unless the agent is in full control of boot
> > process.
> 
> Then please also consider exactly the same use case while running in a
> container.
> 
> I'm a bit frustrated that you're saying that my example failure modes
> consist of shooting oneself in the foot and then you go on to come up
> with your own examples that have precisely the same problem.

You have a point, which is

  The system-root and namespace-roots are not symmetric.

and that's a valid concern.  Here's why the system-root is special.

* A system has entities and resource consumptions which can only be
  attributed to the "system".  The system-root is the natural place to
  put them.  The system-root has stuff no other cgroups, not even
  namespace-roots, have.  It's a unique situation.

* The need to bypass most cgroup related overhead when not in use.
  The system-root is there whether cgroup is actally in use or not and
  thus can not impose noticeable overhead.  It has to make sense for
  both resource-controlled systems as well as ones that aren't.
  Again, no other group has these requirements.

  Note that this means that all controllers should be able to and
  already allow uncontained consumptions in the system-root.  I'll
  come back to this later.

Now, due to the various issues with direct competition between
processes and cgroups, cgroup v2 disallows resource control across
them (the no-internal-tasks restriction); however, cgroup v2 currently
doesn't apply the restriction to the system-root.  Here are the
reasons.

* It doesn't bring any practical benefits in terms of implementation.
  As noted above, all controllers already have to allow uncontained
  consumptions in the system-root and that's the only attribute
  required for the exemption.

* It doesn't bring any practical benefits in terms of capability.
  Userland can trivially handle the system-root and namespace-roots in
  a symmetrical manner.

* It's an unncessary inconvenience, especially for cases where the
  cgroup agent isn't in control of boot, for partial usage cases, or
  just for playing with it.

You say that I'm ignoring the same use case for namespace-scope but
namespace-roots don't have the same hybrid function for partial and
uncontrolled systems, so it's not clear why there even NEEDS to be
strict symmetry.

On this subject, your only actual point is that there is an asymmetry
and that's bothersome.  I've been trying to explain why the special
case doesn't actually get in the way in terms of implementation or
capability and is actually beneficial.  Instead of engaging in the
actual discussion, you're constantly coming up with different ways of
saying "it's not symmetric".

The system-root and namespace-roots aren't equivalent.  There are a
lot of parallels between system-root and namescope-root but they
aren't the same thing (e.g. bootstrapping a namespace is a less
complicated and more malleable process).  The system-root is not even
a fully qualified node of the resource graph.

It's easy and understandable to get hangups on asymmetries or
exemptions like this, but they also often are acceptable trade-offs.
It's really frustrating to see you first getting hung up on "this must
be wrong" and even after explanations repeating the same thing just in
different ways.

If there is something fundamentally wrong with it, sure, let's fix it,
but what's actually broken?

> > I have, multiple times.  Can you please read 2-1-2 of the document in
> > the original post and take the discussion from there?
> 
> I've read it multiple times, and I don't see any explanation that's
> consistent with the fact that you are exempting the root cgroup from
> this constraint.  If the constraint were really critical to everything
> working, then I would expect the root cgroup to have exactly the same
> problem.  This makes me think that either something nasty is being
> fudged for the root cgroup or that the constraint isn't actually so
> important after all.  The only thing on point I can find is:
> 
> > Root cgroup is exempt from this constraint, which is in line with
> > how root cgroup is handled in general - it's excluded from cgroup
> > resource accounting and control.
> 
> and that's not very helpful.

My apologies.  I somehow thought that was part of the documentation.
Will update it later, but here's an excerpt from my earlier response.

  Having a special case doesn't necessarily get in the way of
  benefiting from a set of general rules.  The root cgroup is
  inherently special as it has to be the catch-all scope for entities
  and resource consumptions which can't be tied to any specific
  consumer - irq handling, packet rx, journal writes, memory reclaim
  from global memory pressure and so on.  None of sub-cgroups have to
  worry about them.

  These base-system operations are special regardless of cgroup and we
  already have sometimes crude ways to affect their behaviors where
  necessary through sysctl knobs, priorities on specific kernel
  threads and so on.  cgroup doesn't change the situation all that
  much.  What gets left in the root cgroup usually are the base-system
  operations which are outside the scope of cgroup resource control in
  the first place and cgroup resource graph can treat the root as an
  opaque anchor point.

  There can be other ways to deal with the issue; however, treating
  root cgroup this way has the big advantage of minimizing the gap
  between configurations without and with cgroups both in terms of
  mental model and implementation.

  Hopefully, the case of a namespace root is clear now.  If it's gonna
  have a sub-hierarchy, it itself can't contain processes but the
  system root just contains base-system entities and resources which a
  namespace root doesn't have to worry about.  Ignoring base-system
  stuff, a namespace root is topologically in the same position as the
  system root in the cgroup resource graph.

Maybe this wasn't as clear as I thought it was.  I hope the earlier
part of this message is enough of a clarification.

> >> Also, here's an idea to maybe make PeterZ happier: relax the
> >> restriction a bit per-controller.  Currently (except for /), if you
> >> have subtree control enabled you can't have any processes in the
> >> cgroup.  Could you change this so it only applies to certain
> >> controllers?  If the cpu controller is entirely happy to have
> >> processes and cgroups as siblings, then maybe a cgroup with only cpu
> >> subtree control enabled could allow processes to exist.
> >
> > The document lists several reasons for not doing this and also that
> > there is no known real world use case for such configuration.

So, up until this point, we were talking about no-internal-tasks
constraint.

> My company's production workload would map quite nicely to this
> relaxed model.  I have quite a few processes each with several
> threads.  Some of those threads get some CPUs, some get other CPUs,
> and they vary in what shares of what CPUs they get.  To be clear,
> there is not a hierarchy of resource usage that's compatible with the
> process hierarchy.  Multiple processes have threads that should be
> grouped in a different place in the hierarchy than other threads.
> Concretely, I have processes A and B with threads A1, A2, B1, and B2.
> (And many more, but this is enough to get the point across.)  The
> natural grouping is:
> 
> Group 1: A1 and B1
> Group 2: A2
> Group 3: B2

And now you're talking about process granularity.

> This cannot be expressed with rgroup or with cgroup2.  cgroup1 has no
> problem with it.  If I were using memcg, I would want to have a memcg
> hierarchy that was incompatible with the hierarchy above, so I
> actually find the cgroup2 insistence on a unified hierarchy to be a
> bit annoying, but I at least understand the motivation behind the
> unified hierarchy.
> 
> And I don't care that the system controller can't atomically move this
> whole mess around.  I'm currently running without systemd, so I don't

I do.  It's a horrible userland API to expose to individual
applications if the organization that a given application expects can
be disturbed by system operations.  Imagine how this would be
documented - "if this operation races with system operation, it may
return -ENOENT.  Repeating the path lookup might make the operation
succeed again."

> *have* a system controller.  If I end up migrating to systemd, I'll
> probably put this whole pile into its own slice and manage it
> manually.

Yeah, systemd has delegation feature for cases like that which we
depend on too.

As for your example, who performs the cgroup setup and configuration,
the application itself or an external entity?  If an external entity,
how does it know which thread is what?

And, as for rgroup not covering it, would extending rgroup to cover
multi-process cases be enough or are there more fundamental issues?

> > Yeap, the name collisions suck.  I thought about disallowing all
> > sub-cgroups which starts with "KNOWN_SUBSYS." but that has a
> > non-trivial chance of breaking users which were happy before when a
> > new controller gets added.  But, yeah, we at least should disallow the
> > known filenames.  Will think more about it.
> 
> How about disallowing names that contain a '.'?

That's guaranteed to break things left and right, and, given how
departed it is from what has been all along including v1, it'd be an
actually gratuitous painful change.  While name collisions is a nasty
possibility, it seldom is a practical problem as most use naming
schemes which are unlikely to actually collide.  Even "$SUBSYS." is
likely too broad.  Most cures seem worse than the disease here.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-09-03 22:05                     ` Tejun Heo
  0 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-09-03 22:05 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-team-b10kYP2dOMg, open list:CONTROL GROUP (CGROUP),
	Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra,
	Johannes Weiner, Linus Torvalds

Hello, Andy.

On Wed, Aug 31, 2016 at 02:46:20PM -0700, Andy Lutomirski wrote:
> > Consider a use case where the user isn't interested in fully
> > accounting and dividing up system resources but wants to just cap
> > resource usage from a subset of workloads.  There is no reason to
> > require such usages to fully contain all processes in non-root
> > cgroups.  Furthermore, it's not trivial to migrate all processes out
> > of root to a sub-cgroup unless the agent is in full control of boot
> > process.
> 
> Then please also consider exactly the same use case while running in a
> container.
> 
> I'm a bit frustrated that you're saying that my example failure modes
> consist of shooting oneself in the foot and then you go on to come up
> with your own examples that have precisely the same problem.

You have a point, which is

  The system-root and namespace-roots are not symmetric.

and that's a valid concern.  Here's why the system-root is special.

* A system has entities and resource consumptions which can only be
  attributed to the "system".  The system-root is the natural place to
  put them.  The system-root has stuff no other cgroups, not even
  namespace-roots, have.  It's a unique situation.

* The need to bypass most cgroup related overhead when not in use.
  The system-root is there whether cgroup is actally in use or not and
  thus can not impose noticeable overhead.  It has to make sense for
  both resource-controlled systems as well as ones that aren't.
  Again, no other group has these requirements.

  Note that this means that all controllers should be able to and
  already allow uncontained consumptions in the system-root.  I'll
  come back to this later.

Now, due to the various issues with direct competition between
processes and cgroups, cgroup v2 disallows resource control across
them (the no-internal-tasks restriction); however, cgroup v2 currently
doesn't apply the restriction to the system-root.  Here are the
reasons.

* It doesn't bring any practical benefits in terms of implementation.
  As noted above, all controllers already have to allow uncontained
  consumptions in the system-root and that's the only attribute
  required for the exemption.

* It doesn't bring any practical benefits in terms of capability.
  Userland can trivially handle the system-root and namespace-roots in
  a symmetrical manner.

* It's an unncessary inconvenience, especially for cases where the
  cgroup agent isn't in control of boot, for partial usage cases, or
  just for playing with it.

You say that I'm ignoring the same use case for namespace-scope but
namespace-roots don't have the same hybrid function for partial and
uncontrolled systems, so it's not clear why there even NEEDS to be
strict symmetry.

On this subject, your only actual point is that there is an asymmetry
and that's bothersome.  I've been trying to explain why the special
case doesn't actually get in the way in terms of implementation or
capability and is actually beneficial.  Instead of engaging in the
actual discussion, you're constantly coming up with different ways of
saying "it's not symmetric".

The system-root and namespace-roots aren't equivalent.  There are a
lot of parallels between system-root and namescope-root but they
aren't the same thing (e.g. bootstrapping a namespace is a less
complicated and more malleable process).  The system-root is not even
a fully qualified node of the resource graph.

It's easy and understandable to get hangups on asymmetries or
exemptions like this, but they also often are acceptable trade-offs.
It's really frustrating to see you first getting hung up on "this must
be wrong" and even after explanations repeating the same thing just in
different ways.

If there is something fundamentally wrong with it, sure, let's fix it,
but what's actually broken?

> > I have, multiple times.  Can you please read 2-1-2 of the document in
> > the original post and take the discussion from there?
> 
> I've read it multiple times, and I don't see any explanation that's
> consistent with the fact that you are exempting the root cgroup from
> this constraint.  If the constraint were really critical to everything
> working, then I would expect the root cgroup to have exactly the same
> problem.  This makes me think that either something nasty is being
> fudged for the root cgroup or that the constraint isn't actually so
> important after all.  The only thing on point I can find is:
> 
> > Root cgroup is exempt from this constraint, which is in line with
> > how root cgroup is handled in general - it's excluded from cgroup
> > resource accounting and control.
> 
> and that's not very helpful.

My apologies.  I somehow thought that was part of the documentation.
Will update it later, but here's an excerpt from my earlier response.

  Having a special case doesn't necessarily get in the way of
  benefiting from a set of general rules.  The root cgroup is
  inherently special as it has to be the catch-all scope for entities
  and resource consumptions which can't be tied to any specific
  consumer - irq handling, packet rx, journal writes, memory reclaim
  from global memory pressure and so on.  None of sub-cgroups have to
  worry about them.

  These base-system operations are special regardless of cgroup and we
  already have sometimes crude ways to affect their behaviors where
  necessary through sysctl knobs, priorities on specific kernel
  threads and so on.  cgroup doesn't change the situation all that
  much.  What gets left in the root cgroup usually are the base-system
  operations which are outside the scope of cgroup resource control in
  the first place and cgroup resource graph can treat the root as an
  opaque anchor point.

  There can be other ways to deal with the issue; however, treating
  root cgroup this way has the big advantage of minimizing the gap
  between configurations without and with cgroups both in terms of
  mental model and implementation.

  Hopefully, the case of a namespace root is clear now.  If it's gonna
  have a sub-hierarchy, it itself can't contain processes but the
  system root just contains base-system entities and resources which a
  namespace root doesn't have to worry about.  Ignoring base-system
  stuff, a namespace root is topologically in the same position as the
  system root in the cgroup resource graph.

Maybe this wasn't as clear as I thought it was.  I hope the earlier
part of this message is enough of a clarification.

> >> Also, here's an idea to maybe make PeterZ happier: relax the
> >> restriction a bit per-controller.  Currently (except for /), if you
> >> have subtree control enabled you can't have any processes in the
> >> cgroup.  Could you change this so it only applies to certain
> >> controllers?  If the cpu controller is entirely happy to have
> >> processes and cgroups as siblings, then maybe a cgroup with only cpu
> >> subtree control enabled could allow processes to exist.
> >
> > The document lists several reasons for not doing this and also that
> > there is no known real world use case for such configuration.

So, up until this point, we were talking about no-internal-tasks
constraint.

> My company's production workload would map quite nicely to this
> relaxed model.  I have quite a few processes each with several
> threads.  Some of those threads get some CPUs, some get other CPUs,
> and they vary in what shares of what CPUs they get.  To be clear,
> there is not a hierarchy of resource usage that's compatible with the
> process hierarchy.  Multiple processes have threads that should be
> grouped in a different place in the hierarchy than other threads.
> Concretely, I have processes A and B with threads A1, A2, B1, and B2.
> (And many more, but this is enough to get the point across.)  The
> natural grouping is:
> 
> Group 1: A1 and B1
> Group 2: A2
> Group 3: B2

And now you're talking about process granularity.

> This cannot be expressed with rgroup or with cgroup2.  cgroup1 has no
> problem with it.  If I were using memcg, I would want to have a memcg
> hierarchy that was incompatible with the hierarchy above, so I
> actually find the cgroup2 insistence on a unified hierarchy to be a
> bit annoying, but I at least understand the motivation behind the
> unified hierarchy.
> 
> And I don't care that the system controller can't atomically move this
> whole mess around.  I'm currently running without systemd, so I don't

I do.  It's a horrible userland API to expose to individual
applications if the organization that a given application expects can
be disturbed by system operations.  Imagine how this would be
documented - "if this operation races with system operation, it may
return -ENOENT.  Repeating the path lookup might make the operation
succeed again."

> *have* a system controller.  If I end up migrating to systemd, I'll
> probably put this whole pile into its own slice and manage it
> manually.

Yeah, systemd has delegation feature for cases like that which we
depend on too.

As for your example, who performs the cgroup setup and configuration,
the application itself or an external entity?  If an external entity,
how does it know which thread is what?

And, as for rgroup not covering it, would extending rgroup to cover
multi-process cases be enough or are there more fundamental issues?

> > Yeap, the name collisions suck.  I thought about disallowing all
> > sub-cgroups which starts with "KNOWN_SUBSYS." but that has a
> > non-trivial chance of breaking users which were happy before when a
> > new controller gets added.  But, yeah, we at least should disallow the
> > known filenames.  Will think more about it.
> 
> How about disallowing names that contain a '.'?

That's guaranteed to break things left and right, and, given how
departed it is from what has been all along including v1, it'd be an
actually gratuitous painful change.  While name collisions is a nasty
possibility, it seldom is a practical problem as most use naming
schemes which are unlikely to actually collide.  Even "$SUBSYS." is
likely too broad.  Most cures seem worse than the disease here.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
  2016-09-03 22:05                     ` Tejun Heo
  (?)
@ 2016-09-05 17:37                     ` Andy Lutomirski
  2016-09-06 10:29                         ` Peter Zijlstra
  2016-09-09 22:57                       ` Tejun Heo
  -1 siblings, 2 replies; 87+ messages in thread
From: Andy Lutomirski @ 2016-09-05 17:37 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Mike Galbraith, linux-kernel, kernel-team,
	open list:CONTROL GROUP (CGROUP),
	Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra,
	Johannes Weiner, Linus Torvalds

On Sat, Sep 3, 2016 at 3:05 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Andy.
>
> On Wed, Aug 31, 2016 at 02:46:20PM -0700, Andy Lutomirski wrote:
>> > Consider a use case where the user isn't interested in fully
>> > accounting and dividing up system resources but wants to just cap
>> > resource usage from a subset of workloads.  There is no reason to
>> > require such usages to fully contain all processes in non-root
>> > cgroups.  Furthermore, it's not trivial to migrate all processes out
>> > of root to a sub-cgroup unless the agent is in full control of boot
>> > process.
>>
>> Then please also consider exactly the same use case while running in a
>> container.
>>
>> I'm a bit frustrated that you're saying that my example failure modes
>> consist of shooting oneself in the foot and then you go on to come up
>> with your own examples that have precisely the same problem.
>
> You have a point, which is
>
>   The system-root and namespace-roots are not symmetric.
>
> and that's a valid concern.  Here's why the system-root is special.
>

[...]

>
> Now, due to the various issues with direct competition between
> processes and cgroups, cgroup v2 disallows resource control across
> them (the no-internal-tasks restriction); however, cgroup v2 currently
> doesn't apply the restriction to the system-root.  Here are the
> reasons.
>
> * It doesn't bring any practical benefits in terms of implementation.
>   As noted above, all controllers already have to allow uncontained
>   consumptions in the system-root and that's the only attribute
>   required for the exemption.
>
> * It doesn't bring any practical benefits in terms of capability.
>   Userland can trivially handle the system-root and namespace-roots in
>   a symmetrical manner.

Your idea of "trivially" doesn't match mine.  You gave a use case in
which userspace might take advantage of root being special.  If
userspace does that, then that userspace cannot be run in a container.
This could be a problem for real users.  Sure, "don't do that" is a
*valid* answer, but it's not a very helpful answer.

>
> * It's an unncessary inconvenience, especially for cases where the
>   cgroup agent isn't in control of boot, for partial usage cases, or
>   just for playing with it.
>
> You say that I'm ignoring the same use case for namespace-scope but
> namespace-roots don't have the same hybrid function for partial and
> uncontrolled systems, so it's not clear why there even NEEDS to be
> strict symmetry.

I think their functions are much closer than you think they are.  I
want a whole Linux distro to be able to run in a container.  This
means that useful things people do in a distro or initramfs or
whatever should just work if containerized.

>
> It's easy and understandable to get hangups on asymmetries or
> exemptions like this, but they also often are acceptable trade-offs.
> It's really frustrating to see you first getting hung up on "this must
> be wrong" and even after explanations repeating the same thing just in
> different ways.
>
> If there is something fundamentally wrong with it, sure, let's fix it,
> but what's actually broken?

I'm not saying it's fundamentally wrong.  I'm saying it's a design
that has a big wart, and that wart is unfortunate, and after thinking
a bit, I'm starting to agree with PeterZ that this is problematic.  It
also seems fixable: the constraint could be relaxed.

>> >> Also, here's an idea to maybe make PeterZ happier: relax the
>> >> restriction a bit per-controller.  Currently (except for /), if you
>> >> have subtree control enabled you can't have any processes in the
>> >> cgroup.  Could you change this so it only applies to certain
>> >> controllers?  If the cpu controller is entirely happy to have
>> >> processes and cgroups as siblings, then maybe a cgroup with only cpu
>> >> subtree control enabled could allow processes to exist.
>> >
>> > The document lists several reasons for not doing this and also that
>> > there is no known real world use case for such configuration.
>
> So, up until this point, we were talking about no-internal-tasks
> constraint.

Isn't this the same thing?  IIUC the constraint in question is that,
if a non-root cgroup has subtree control on, then it can't have
processes in it.  This is the no-internal-tasks constraint, right?

And I still think that, at least for cpu, nothing at all goes wrong if
you allow processes to exist in cgroups that have cpu set in
subtree-control.

----- begin talking about process granularity -----

>
>> My company's production workload would map quite nicely to this
>> relaxed model.  I have quite a few processes each with several
>> threads.  Some of those threads get some CPUs, some get other CPUs,
>> and they vary in what shares of what CPUs they get.  To be clear,
>> there is not a hierarchy of resource usage that's compatible with the
>> process hierarchy.  Multiple processes have threads that should be
>> grouped in a different place in the hierarchy than other threads.
>> Concretely, I have processes A and B with threads A1, A2, B1, and B2.
>> (And many more, but this is enough to get the point across.)  The
>> natural grouping is:
>>
>> Group 1: A1 and B1
>> Group 2: A2
>> Group 3: B2
>
> And now you're talking about process granularity.

Yes.

>
>> This cannot be expressed with rgroup or with cgroup2.  cgroup1 has no
>> problem with it.  If I were using memcg, I would want to have a memcg
>> hierarchy that was incompatible with the hierarchy above, so I
>> actually find the cgroup2 insistence on a unified hierarchy to be a
>> bit annoying, but I at least understand the motivation behind the
>> unified hierarchy.
>>
>> And I don't care that the system controller can't atomically move this
>> whole mess around.  I'm currently running without systemd, so I don't
>
> I do.  It's a horrible userland API to expose to individual
> applications if the organization that a given application expects can
> be disturbed by system operations.  Imagine how this would be
> documented - "if this operation races with system operation, it may
> return -ENOENT.  Repeating the path lookup might make the operation
> succeed again."

It could be made to work without races, though, with minimal (or even
no) ABI change.  The managed program could grab an fd pointing to its
cgroup.  Then it would use openat, etc for all operations.  As long as
'mv /cgroup/a/b /cgroup/c/" didn't cause that fd to stop working,
we're fine.

Note that this pretty much has to work if cgroup namespaces are to
allow rearrangement of the hierarchy -- '/cgroup/' from inside the
namespace has to remain valid at all times

Obviously this only works if the cgroup in question doesn't itself get
destroyed, but having an internal hierarchy is a bit nonsensical if
the application shares a cgroup with another application, so that
shouldn't be a problem in practice.

In fact, ISTM that allowing applications to manage cgroup
sub-hierarchies has almost exactly the same set of constraints as
allowing namespaced cgroup managers to work.  In a container, the
outer manager manages where the container lives and the container
manages its own hierarchy.  Why can't fancy cgroup-aware applications
work exactly the same way?

>
>> *have* a system controller.  If I end up migrating to systemd, I'll
>> probably put this whole pile into its own slice and manage it
>> manually.
>
> Yeah, systemd has delegation feature for cases like that which we
> depend on too.
>
> As for your example, who performs the cgroup setup and configuration,
> the application itself or an external entity?  If an external entity,
> how does it know which thread is what?

In my case, it would be a little script that reads a config file that
knows all kinds of internal information about the application and its
threads.

>
> And, as for rgroup not covering it, would extending rgroup to cover
> multi-process cases be enough or are there more fundamental issues?

Maybe, as long as the configuration could actually be created -- IIUC
the current rgroup proposal requires that the hierarchy of groups
matches the hierarchy implied by clone(), which isn't going to happen
in my case.

But, given that this fancy-cgroup-aware-multiprocess-application case
looks so much like cgroup-using container, ISTM you could solve the
problem completely by just allowing tasks to be split out by users who
want to do it.  (Obviously those users will get funny results if they
try to do this to memcg.  "Don't do that" seems fine here.)  I don't
expect the race condition issues you're worried about to happen in
practice.  Certainly not in my case, since I control the entire
system.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-09-06 10:29                         ` Peter Zijlstra
  0 siblings, 0 replies; 87+ messages in thread
From: Peter Zijlstra @ 2016-09-06 10:29 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Tejun Heo, Ingo Molnar, Mike Galbraith, linux-kernel,
	kernel-team, open list:CONTROL GROUP (CGROUP),
	Andrew Morton, Paul Turner, Li Zefan, Linux API, Johannes Weiner,
	Linus Torvalds

On Mon, Sep 05, 2016 at 10:37:55AM -0700, Andy Lutomirski wrote:
> And I still think that, at least for cpu, nothing at all goes wrong if
> you allow processes to exist in cgroups that have cpu set in
> subtree-control.

cpu, cpuset, perf, cpuacct (although we all agree that really should be
part of cpu), pid, and possibly freezer (but I think we all agree
freezer is 'broken').

That's roughly half the controllers out there.

They all work on tasks, and should therefore have no problems what so
ever to allow the full hierarchy without silly exceptions and
constraints.



The fundamental problem is that we have 2 different types of
controllers, on the one hand these controllers above, that work on tasks
and form groups of them and build up from that. Lets call them
task-controllers.

On the other hand we have controllers like memcg which take the 'system'
as a whole and shrink it down into smaller bits. Lets call these
system-controllers.


They are fundamentally at odds with capabilities, simply because of the
granularity they can work on.

Merging the two into a common hierarchy is a useful concept for
containerization, no argument on that, esp. when also coupled with
namespaces and the like.


However, where I object _most_ strongly is having this one use dominate
and destroy the capabilities (which are in use) of the task-controllers.


> > I do.  It's a horrible userland API to expose to individual
> > applications if the organization that a given application expects can
> > be disturbed by system operations.  Imagine how this would be
> > documented - "if this operation races with system operation, it may
> > return -ENOENT.  Repeating the path lookup might make the operation
> > succeed again."
> 
> It could be made to work without races, though, with minimal (or even
> no) ABI change.  The managed program could grab an fd pointing to its
> cgroup.  Then it would use openat, etc for all operations.  As long as
> 'mv /cgroup/a/b /cgroup/c/" didn't cause that fd to stop working,
> we're fine.

I've mentioned openat() and related APIs several times, but so far never
got good reasons why that wouldn't work.



Also note that in order to partition the cpus with cpusets, you're
required to generate a disjoint hierarchy (that is, one where the
(common) parent is 'disabled' and the children have no overlap).

This is rather fundamental to partitioning, that by its very nature
requires separation.

The result is that if you want to place your RT threads (consider an
application that consists of RT and !RT parts) in a different partition
there is no common parent you can place the process in.


cgroup-v2, by placing the system style controllers first and foremost,
completely renders that scenario impossible. Note also that any proposed
rgroup would not work for this, since that, per design, is a subtree,
and therefore not disjoint.


So my objection to the whole cgroup-v2 model and implementation stems
from the fact that it purports to be a 'better' and 'improved' system,
while in actuality it neuters and destroys a lot of useful usecases.

It completely disregards all task-controllers and labels their use-cases
as irrelevant.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-09-06 10:29                         ` Peter Zijlstra
  0 siblings, 0 replies; 87+ messages in thread
From: Peter Zijlstra @ 2016-09-06 10:29 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Tejun Heo, Ingo Molnar, Mike Galbraith,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	open list:CONTROL GROUP (CGROUP),
	Andrew Morton, Paul Turner, Li Zefan, Linux API, Johannes Weiner,
	Linus Torvalds

On Mon, Sep 05, 2016 at 10:37:55AM -0700, Andy Lutomirski wrote:
> And I still think that, at least for cpu, nothing at all goes wrong if
> you allow processes to exist in cgroups that have cpu set in
> subtree-control.

cpu, cpuset, perf, cpuacct (although we all agree that really should be
part of cpu), pid, and possibly freezer (but I think we all agree
freezer is 'broken').

That's roughly half the controllers out there.

They all work on tasks, and should therefore have no problems what so
ever to allow the full hierarchy without silly exceptions and
constraints.



The fundamental problem is that we have 2 different types of
controllers, on the one hand these controllers above, that work on tasks
and form groups of them and build up from that. Lets call them
task-controllers.

On the other hand we have controllers like memcg which take the 'system'
as a whole and shrink it down into smaller bits. Lets call these
system-controllers.


They are fundamentally at odds with capabilities, simply because of the
granularity they can work on.

Merging the two into a common hierarchy is a useful concept for
containerization, no argument on that, esp. when also coupled with
namespaces and the like.


However, where I object _most_ strongly is having this one use dominate
and destroy the capabilities (which are in use) of the task-controllers.


> > I do.  It's a horrible userland API to expose to individual
> > applications if the organization that a given application expects can
> > be disturbed by system operations.  Imagine how this would be
> > documented - "if this operation races with system operation, it may
> > return -ENOENT.  Repeating the path lookup might make the operation
> > succeed again."
> 
> It could be made to work without races, though, with minimal (or even
> no) ABI change.  The managed program could grab an fd pointing to its
> cgroup.  Then it would use openat, etc for all operations.  As long as
> 'mv /cgroup/a/b /cgroup/c/" didn't cause that fd to stop working,
> we're fine.

I've mentioned openat() and related APIs several times, but so far never
got good reasons why that wouldn't work.



Also note that in order to partition the cpus with cpusets, you're
required to generate a disjoint hierarchy (that is, one where the
(common) parent is 'disabled' and the children have no overlap).

This is rather fundamental to partitioning, that by its very nature
requires separation.

The result is that if you want to place your RT threads (consider an
application that consists of RT and !RT parts) in a different partition
there is no common parent you can place the process in.


cgroup-v2, by placing the system style controllers first and foremost,
completely renders that scenario impossible. Note also that any proposed
rgroup would not work for this, since that, per design, is a subtree,
and therefore not disjoint.


So my objection to the whole cgroup-v2 model and implementation stems
from the fact that it purports to be a 'better' and 'improved' system,
while in actuality it neuters and destroys a lot of useful usecases.

It completely disregards all task-controllers and labels their use-cases
as irrelevant.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
  2016-09-05 17:37                     ` Andy Lutomirski
  2016-09-06 10:29                         ` Peter Zijlstra
@ 2016-09-09 22:57                       ` Tejun Heo
  2016-09-10  8:54                           ` Mike Galbraith
                                           ` (3 more replies)
  1 sibling, 4 replies; 87+ messages in thread
From: Tejun Heo @ 2016-09-09 22:57 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, Mike Galbraith, linux-kernel, kernel-team,
	open list:CONTROL GROUP (CGROUP),
	Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra,
	Johannes Weiner, Linus Torvalds

Hello, again.

On Mon, Sep 05, 2016 at 10:37:55AM -0700, Andy Lutomirski wrote:
> > * It doesn't bring any practical benefits in terms of capability.
> >   Userland can trivially handle the system-root and namespace-roots in
> >   a symmetrical manner.
> 
> Your idea of "trivially" doesn't match mine.  You gave a use case in

I suppose I wasn't clear enough.  It is trivial in the sense that if
the userland implements something which works for namespace-root, it
would work the same in system-root without further modifications.

> which userspace might take advantage of root being special.  If

I was emphasizing the cases where userspace would have to deal with
the inherent differences, and, when they don't, they can behave
exactly the same way.

> userspace does that, then that userspace cannot be run in a container.
> This could be a problem for real users.  Sure, "don't do that" is a
> *valid* answer, but it's not a very helpful answer.

Great, now we agree that what's currently implemented is valid.  I
think you're still failing to recognize the inherent specialness of
the system-root and how much unnecessary pain the removal of the
exemption would cause at virtually no practical gain.  I won't repeat
the same backing points here.

> > * It's an unncessary inconvenience, especially for cases where the
> >   cgroup agent isn't in control of boot, for partial usage cases, or
> >   just for playing with it.
> >
> > You say that I'm ignoring the same use case for namespace-scope but
> > namespace-roots don't have the same hybrid function for partial and
> > uncontrolled systems, so it's not clear why there even NEEDS to be
> > strict symmetry.
> 
> I think their functions are much closer than you think they are.  I
> want a whole Linux distro to be able to run in a container.  This
> means that useful things people do in a distro or initramfs or
> whatever should just work if containerized.

There isn't much which is getting in the way of doing that.  Again,
something which follows no-internal-task rule would behave the same no
matter where it is.  The system-root is different in that it is exempt
from the rule and thus is more flexible but that difference is serving
the purpose of handling the inherent specialness of the system-root.
AFAICS, it is the solution which causes the least amount of contortion
and unnecessary inconvenience to userland.

> > It's easy and understandable to get hangups on asymmetries or
> > exemptions like this, but they also often are acceptable trade-offs.
> > It's really frustrating to see you first getting hung up on "this must
> > be wrong" and even after explanations repeating the same thing just in
> > different ways.
> >
> > If there is something fundamentally wrong with it, sure, let's fix it,
> > but what's actually broken?
> 
> I'm not saying it's fundamentally wrong.  I'm saying it's a design

You were.

> that has a big wart, and that wart is unfortunate, and after thinking
> a bit, I'm starting to agree with PeterZ that this is problematic.  It
> also seems fixable: the constraint could be relaxed.

You've been pushing for enforcing the restriction on the system-root
too and now are jumping to the opposite end.  It's really frustrating
that this is such a whack-a-mole game where you throw ideas without
really thinking through them and only concede the bare minimum when
all other logical avenues are closed off.  Here, again, you seem to be
stating a strong opinion when you haven't fully thought about it or
tried to understand the reasons behind it.

But, whatever, let's go there: Given the arguments that I laid out for
the no-internal-tasks rule, how does the problem seem fixable through
relaxing the constraint?

> >> >> Also, here's an idea to maybe make PeterZ happier: relax the
> >> >> restriction a bit per-controller.  Currently (except for /), if you
> >> >> have subtree control enabled you can't have any processes in the
> >> >> cgroup.  Could you change this so it only applies to certain
> >> >> controllers?  If the cpu controller is entirely happy to have
> >> >> processes and cgroups as siblings, then maybe a cgroup with only cpu
> >> >> subtree control enabled could allow processes to exist.
> >> >
> >> > The document lists several reasons for not doing this and also that
> >> > there is no known real world use case for such configuration.
> >
> > So, up until this point, we were talking about no-internal-tasks
> > constraint.
> 
> Isn't this the same thing?  IIUC the constraint in question is that,
> if a non-root cgroup has subtree control on, then it can't have
> processes in it.  This is the no-internal-tasks constraint, right?

Yes, that is what no-internal-tasks rule is but I don't understand how
that is the same thing as process granularity.  Am I completely
misunderstanding what you are trying to say here?

> And I still think that, at least for cpu, nothing at all goes wrong if
> you allow processes to exist in cgroups that have cpu set in
> subtree-control.

If you confine it to the cpu controller, ignore anonymous
consumptions, the rather ugly mapping between nice and weight values
and the fact that nobody could come up with a practical usefulness for
such setup, yes.  My point was never that the cpu controller can't do
it but that we should find a better way of coordinating it with other
controllers and exposing it to individual applications.

> ----- begin talking about process granularity -----
...
> > I do.  It's a horrible userland API to expose to individual
> > applications if the organization that a given application expects can
> > be disturbed by system operations.  Imagine how this would be
> > documented - "if this operation races with system operation, it may
> > return -ENOENT.  Repeating the path lookup might make the operation
> > succeed again."
> 
> It could be made to work without races, though, with minimal (or even
> no) ABI change.  The managed program could grab an fd pointing to its
> cgroup.  Then it would use openat, etc for all operations.  As long as
> 'mv /cgroup/a/b /cgroup/c/" didn't cause that fd to stop working,
> we're fine.

After a migration, the cgroup and its interface knobs are a different
directory and files.  Semantically, during migration, we aren't moving
the directory or files and it'd be bizarre to overlay the semantics
you're describing on top of the existing cgroupfs.  We will have to
break away from the very basic vfs rules such as a fd, once opened,
always corresponding to the same file.  The only thing openat(2) does
is abstracting away prefix handling and that is only a small part of
the problem.

A more acceptable way could be implementing, say, per-task filesystem
which always appears at the fixed location and proxies the operations;
however, even this wouldn't be able to handle issues stemming from
lack of actual atomicity.  Think about two tasks accessing the same
interface file.  If they race against outside agent migrating them
one-by-one, they may or may not be accessing the same file.  If they
perform operations with side effects such as config changes, creation
of sub-cgroups and migrations, what would be the end result?

In addition, a per-task filesystem is an a lot worse interface to
program against than a system-call based API, especially when the same
API which is used to do the exact same operations on threads can be
reused for resource groups.

> Note that this pretty much has to work if cgroup namespaces are to
> allow rearrangement of the hierarchy -- '/cgroup/' from inside the
> namespace has to remain valid at all times

If I'm not mistaken, namespaces don't allow this type of dynamic
migrations.

> Obviously this only works if the cgroup in question doesn't itself get
> destroyed, but having an internal hierarchy is a bit nonsensical if
> the application shares a cgroup with another application, so that
> shouldn't be a problem in practice.
>
> In fact, ISTM that allowing applications to manage cgroup
> sub-hierarchies has almost exactly the same set of constraints as
> allowing namespaced cgroup managers to work.  In a container, the
> outer manager manages where the container lives and the container
> manages its own hierarchy.  Why can't fancy cgroup-aware applications
> work exactly the same way?

System agents and individual applications are different.  This is the
same argument that you brought up earlier in this thread where you
said that userland can just set up namespaces for individual
applications.  In purely mathematical terms, they can be mapped to
each other but that grossly ignores practical differences between
them.

Most applications should and want to keep their assumptions
conservative, robust and portable, and not dependent on some crazy
fragile and custom-built namespace setup that nobody in the stack is
really responsible for.  How many would ever program against something
like that?

A system agent has a large part of the system configuration under its
control (it's the system agent after all) and thus is way more
flexible in what assumptions it can dictate and depend on.

> > Yeah, systemd has delegation feature for cases like that which we
> > depend on too.
> >
> > As for your example, who performs the cgroup setup and configuration,
> > the application itself or an external entity?  If an external entity,
> > how does it know which thread is what?
> 
> In my case, it would be a little script that reads a config file that
> knows all kinds of internal information about the application and its
> threads.

I see.  One-of-a-kind custom setup.  This is a completely valid usage;
however, please also recognize that it's an extremely specific one
which is niche by definition.  If we're going to support
in-application hierarchical resource control, I think it's very
important to make sure that it's something which is easy to use and
widely accessible so that any lay application can make use of it.
I'll come back to this point later.

> > And, as for rgroup not covering it, would extending rgroup to cover
> > multi-process cases be enough or are there more fundamental issues?
> 
> Maybe, as long as the configuration could actually be created -- IIUC
> the current rgroup proposal requires that the hierarchy of groups
> matches the hierarchy implied by clone(), which isn't going to happen
> in my case.

We can make that dynamic as long as the subtree is properly scoped;
however, there is an important design decision to make here.  If we
open up full-on dynamic migrations to individual applications, we
commit ourselves to supporting arbitrarily high frequency migration
operations, which we've never supported before and will restrict what
we can do in terms of optimizing hot paths over migration.

We haven't had to face this decision because cgroup has never properly
supported delegating to applications and the in-use setups where this
happens are custom configurations where there is no boundary between
system and applications and adhoc trial-and-error is good enough a way
to find a working solution.  That wiggle room goes away once we
officially open this up to individual applications.

So, if we decide to open up dynamic assignment, we need to weigh what
we gain in terms of capabilities against reduction of implementation
maneuvering room.  I guess there can be a middleground where, for
example, only initial asssignment is allowed.

It is really difficult to understand your position without
understanding where the requirements are coming from.  Can you please
elaborate more on the workload?  Why is the specific configuration
useful?  What is it trying to achieve?

> But, given that this fancy-cgroup-aware-multiprocess-application case
> looks so much like cgroup-using container, ISTM you could solve the
> problem completely by just allowing tasks to be split out by users who
> want to do it.  (Obviously those users will get funny results if they
> try to do this to memcg.  "Don't do that" seems fine here.)  I don't
> expect the race condition issues you're worried about to happen in
> practice.  Certainly not in my case, since I control the entire
> system.

What people do now with cgroup inside an application is extremely
limited.  Because there is no proper support for it, each use case has
to craft up a dedicated custom setup which is all but guaranteed to be
incompatible with what someone else would come up for another
application.  Everybody is in "this is mine, I control the entire
system" mindset, which is fine for those specific setups but
deterimental to making it widely available and useful.

Accepting some measured restrictions and building a common ground for
everyone can make in-application cgroup usages vastly more accessible
and useful than now.  Certain things would need to be done differently
and maybe some scenarios won't be supported as well but those are
trade-offs that we'd need to weigh against what we gain.  Another
point is that, for very specific use cases where none of these generic
concerns matter, keeping using cgroup v1 is fine.  The lack of common
resource domains has never been an issue for those use cases anyway.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-09-10  8:54                           ` Mike Galbraith
  0 siblings, 0 replies; 87+ messages in thread
From: Mike Galbraith @ 2016-09-10  8:54 UTC (permalink / raw)
  To: Tejun Heo, Andy Lutomirski
  Cc: Ingo Molnar, linux-kernel, kernel-team,
	open list:CONTROL GROUP (CGROUP),
	Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra,
	Johannes Weiner, Linus Torvalds

On Fri, 2016-09-09 at 18:57 -0400, Tejun Heo wrote:

> But, whatever, let's go there: Given the arguments that I laid out for
> the no-internal-tasks rule, how does the problem seem fixable through
> relaxing the constraint?

Well, for one thing, cpusets would cease to leak CPUs.  With the no
-internal-tasks constraint, no task can acquire affinity of exclusive
set A if set B is an exclusive subset thereof, as there is one and only
one spot where the affinity of set A exists: in the forbidden set A.

Relaxing no-internal-tasks would fix that, but without also relaxing
the process-only rule, cpusets would remain useless for the purpose for
which it was created.  After all, it doesn't do much good to use the
one and only dynamic partitioning tool to partition a box if you cannot
subsequently place your tasks/threads properly therein.

> What people do now with cgroup inside an application is extremely
> limited.  Because there is no proper support for it, each use case has
> to craft up a dedicated custom setup which is all but guaranteed to be
> incompatible with what someone else would come up for another
> application.  Everybody is in "this is mine, I control the entire
> system" mindset, which is fine for those specific setups but
> deterimental to making it widely available and useful.

IMO, the problem with that making it available to the huddled masses
bit is that it is a completely unrealistic fantasy.  Can hordes of
programs really autonomously carve up a single set of resources?  I do
not believe they can.  The system agent cannot autonomously do so
either.  Intimate knowledge of local requirements is not optional, it
is a prerequisite to sound decision making.  You have to have a well
defined need before it makes any sense to turn these things on, they
are not free, and impact is global.

	-Mike

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-09-10  8:54                           ` Mike Galbraith
  0 siblings, 0 replies; 87+ messages in thread
From: Mike Galbraith @ 2016-09-10  8:54 UTC (permalink / raw)
  To: Tejun Heo, Andy Lutomirski
  Cc: Ingo Molnar, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-team-b10kYP2dOMg, open list:CONTROL GROUP (CGROUP),
	Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra,
	Johannes Weiner, Linus Torvalds

On Fri, 2016-09-09 at 18:57 -0400, Tejun Heo wrote:

> But, whatever, let's go there: Given the arguments that I laid out for
> the no-internal-tasks rule, how does the problem seem fixable through
> relaxing the constraint?

Well, for one thing, cpusets would cease to leak CPUs.  With the no
-internal-tasks constraint, no task can acquire affinity of exclusive
set A if set B is an exclusive subset thereof, as there is one and only
one spot where the affinity of set A exists: in the forbidden set A.

Relaxing no-internal-tasks would fix that, but without also relaxing
the process-only rule, cpusets would remain useless for the purpose for
which it was created.  After all, it doesn't do much good to use the
one and only dynamic partitioning tool to partition a box if you cannot
subsequently place your tasks/threads properly therein.

> What people do now with cgroup inside an application is extremely
> limited.  Because there is no proper support for it, each use case has
> to craft up a dedicated custom setup which is all but guaranteed to be
> incompatible with what someone else would come up for another
> application.  Everybody is in "this is mine, I control the entire
> system" mindset, which is fine for those specific setups but
> deterimental to making it widely available and useful.

IMO, the problem with that making it available to the huddled masses
bit is that it is a completely unrealistic fantasy.  Can hordes of
programs really autonomously carve up a single set of resources?  I do
not believe they can.  The system agent cannot autonomously do so
either.  Intimate knowledge of local requirements is not optional, it
is a prerequisite to sound decision making.  You have to have a well
defined need before it makes any sense to turn these things on, they
are not free, and impact is global.

	-Mike

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-09-10 10:08                           ` Mike Galbraith
  0 siblings, 0 replies; 87+ messages in thread
From: Mike Galbraith @ 2016-09-10 10:08 UTC (permalink / raw)
  To: Tejun Heo, Andy Lutomirski
  Cc: Ingo Molnar, linux-kernel, kernel-team,
	open list:CONTROL GROUP (CGROUP),
	Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra,
	Johannes Weiner, Linus Torvalds

On Fri, 2016-09-09 at 18:57 -0400, Tejun Heo wrote:

> > > As for your example, who performs the cgroup setup and configuration,
> > > the application itself or an external entity?  If an external entity,
> > > how does it know which thread is what?
> > 
> > In my case, it would be a little script that reads a config file that
> > knows all kinds of internal information about the application and its
> > threads.
> 
> I see.  One-of-a-kind custom setup.  This is a completely valid usage;
> however, please also recognize that it's an extremely specific one
> which is niche by definition.

This is the same pigeon hole you placed Google into.  So Google, my
(also decidedly non-petite) users, and now Andy are all sharing the one
of a kind extremely specific niche.. it's becoming a tad crowded.

	-Mike

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-09-10 10:08                           ` Mike Galbraith
  0 siblings, 0 replies; 87+ messages in thread
From: Mike Galbraith @ 2016-09-10 10:08 UTC (permalink / raw)
  To: Tejun Heo, Andy Lutomirski
  Cc: Ingo Molnar, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-team-b10kYP2dOMg, open list:CONTROL GROUP (CGROUP),
	Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra,
	Johannes Weiner, Linus Torvalds

On Fri, 2016-09-09 at 18:57 -0400, Tejun Heo wrote:

> > > As for your example, who performs the cgroup setup and configuration,
> > > the application itself or an external entity?  If an external entity,
> > > how does it know which thread is what?
> > 
> > In my case, it would be a little script that reads a config file that
> > knows all kinds of internal information about the application and its
> > threads.
> 
> I see.  One-of-a-kind custom setup.  This is a completely valid usage;
> however, please also recognize that it's an extremely specific one
> which is niche by definition.

This is the same pigeon hole you placed Google into.  So Google, my
(also decidedly non-petite) users, and now Andy are all sharing the one
of a kind extremely specific niche.. it's becoming a tad crowded.

	-Mike

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-09-12 15:20                           ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 87+ messages in thread
From: Austin S. Hemmelgarn @ 2016-09-12 15:20 UTC (permalink / raw)
  To: Tejun Heo, Andy Lutomirski
  Cc: Ingo Molnar, Mike Galbraith, linux-kernel, kernel-team,
	open list:CONTROL GROUP (CGROUP),
	Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra,
	Johannes Weiner, Linus Torvalds

On 2016-09-09 18:57, Tejun Heo wrote:
> Hello, again.
>
> On Mon, Sep 05, 2016 at 10:37:55AM -0700, Andy Lutomirski wrote:
>>> * It doesn't bring any practical benefits in terms of capability.
>>>   Userland can trivially handle the system-root and namespace-roots in
>>>   a symmetrical manner.
>>
>> Your idea of "trivially" doesn't match mine.  You gave a use case in
>
> I suppose I wasn't clear enough.  It is trivial in the sense that if
> the userland implements something which works for namespace-root, it
> would work the same in system-root without further modifications.
>
>> which userspace might take advantage of root being special.  If
>
> I was emphasizing the cases where userspace would have to deal with
> the inherent differences, and, when they don't, they can behave
> exactly the same way.
>
>> userspace does that, then that userspace cannot be run in a container.
>> This could be a problem for real users.  Sure, "don't do that" is a
>> *valid* answer, but it's not a very helpful answer.
>
> Great, now we agree that what's currently implemented is valid.  I
> think you're still failing to recognize the inherent specialness of
> the system-root and how much unnecessary pain the removal of the
> exemption would cause at virtually no practical gain.  I won't repeat
> the same backing points here.
>
>>> * It's an unncessary inconvenience, especially for cases where the
>>>   cgroup agent isn't in control of boot, for partial usage cases, or
>>>   just for playing with it.
>>>
>>> You say that I'm ignoring the same use case for namespace-scope but
>>> namespace-roots don't have the same hybrid function for partial and
>>> uncontrolled systems, so it's not clear why there even NEEDS to be
>>> strict symmetry.
>>
>> I think their functions are much closer than you think they are.  I
>> want a whole Linux distro to be able to run in a container.  This
>> means that useful things people do in a distro or initramfs or
>> whatever should just work if containerized.
>
> There isn't much which is getting in the way of doing that.  Again,
> something which follows no-internal-task rule would behave the same no
> matter where it is.  The system-root is different in that it is exempt
> from the rule and thus is more flexible but that difference is serving
> the purpose of handling the inherent specialness of the system-root.
> AFAICS, it is the solution which causes the least amount of contortion
> and unnecessary inconvenience to userland.
>
>>> It's easy and understandable to get hangups on asymmetries or
>>> exemptions like this, but they also often are acceptable trade-offs.
>>> It's really frustrating to see you first getting hung up on "this must
>>> be wrong" and even after explanations repeating the same thing just in
>>> different ways.
>>>
>>> If there is something fundamentally wrong with it, sure, let's fix it,
>>> but what's actually broken?
>>
>> I'm not saying it's fundamentally wrong.  I'm saying it's a design
>
> You were.
>
>> that has a big wart, and that wart is unfortunate, and after thinking
>> a bit, I'm starting to agree with PeterZ that this is problematic.  It
>> also seems fixable: the constraint could be relaxed.
>
> You've been pushing for enforcing the restriction on the system-root
> too and now are jumping to the opposite end.  It's really frustrating
> that this is such a whack-a-mole game where you throw ideas without
> really thinking through them and only concede the bare minimum when
> all other logical avenues are closed off.  Here, again, you seem to be
> stating a strong opinion when you haven't fully thought about it or
> tried to understand the reasons behind it.
>
> But, whatever, let's go there: Given the arguments that I laid out for
> the no-internal-tasks rule, how does the problem seem fixable through
> relaxing the constraint?
>
>>>>>> Also, here's an idea to maybe make PeterZ happier: relax the
>>>>>> restriction a bit per-controller.  Currently (except for /), if you
>>>>>> have subtree control enabled you can't have any processes in the
>>>>>> cgroup.  Could you change this so it only applies to certain
>>>>>> controllers?  If the cpu controller is entirely happy to have
>>>>>> processes and cgroups as siblings, then maybe a cgroup with only cpu
>>>>>> subtree control enabled could allow processes to exist.
>>>>>
>>>>> The document lists several reasons for not doing this and also that
>>>>> there is no known real world use case for such configuration.
>>>
>>> So, up until this point, we were talking about no-internal-tasks
>>> constraint.
>>
>> Isn't this the same thing?  IIUC the constraint in question is that,
>> if a non-root cgroup has subtree control on, then it can't have
>> processes in it.  This is the no-internal-tasks constraint, right?
>
> Yes, that is what no-internal-tasks rule is but I don't understand how
> that is the same thing as process granularity.  Am I completely
> misunderstanding what you are trying to say here?
>
>> And I still think that, at least for cpu, nothing at all goes wrong if
>> you allow processes to exist in cgroups that have cpu set in
>> subtree-control.
>
> If you confine it to the cpu controller, ignore anonymous
> consumptions, the rather ugly mapping between nice and weight values
> and the fact that nobody could come up with a practical usefulness for
> such setup, yes.  My point was never that the cpu controller can't do
> it but that we should find a better way of coordinating it with other
> controllers and exposing it to individual applications.
So, having a container where not everything in the container is split 
further into subgroups is not a practically useful situation?  Because 
that's exactly what both systemd and every other cgroup management tool 
expects to have work as things stand right now.  The root cgroup within 
a cgroup namespace has to function exactly like the system-root, 
otherwise nothing can depend on the special cases for the system root, 
because they might get run in a cgroup namespace and such assumptions 
will be invalid.  This in turn means that no current distro can run 
unmodified in a cgroup namespace under a v2 hierarchy, which is a Very 
Bad Thing.
>
>> ----- begin talking about process granularity -----
> ...
>>> I do.  It's a horrible userland API to expose to individual
>>> applications if the organization that a given application expects can
>>> be disturbed by system operations.  Imagine how this would be
>>> documented - "if this operation races with system operation, it may
>>> return -ENOENT.  Repeating the path lookup might make the operation
>>> succeed again."
>>
>> It could be made to work without races, though, with minimal (or even
>> no) ABI change.  The managed program could grab an fd pointing to its
>> cgroup.  Then it would use openat, etc for all operations.  As long as
>> 'mv /cgroup/a/b /cgroup/c/" didn't cause that fd to stop working,
>> we're fine.
>
> After a migration, the cgroup and its interface knobs are a different
> directory and files.  Semantically, during migration, we aren't moving
> the directory or files and it'd be bizarre to overlay the semantics
> you're describing on top of the existing cgroupfs.  We will have to
> break away from the very basic vfs rules such as a fd, once opened,
> always corresponding to the same file.  The only thing openat(2) does
> is abstracting away prefix handling and that is only a small part of
> the problem.
>
> A more acceptable way could be implementing, say, per-task filesystem
> which always appears at the fixed location and proxies the operations;
> however, even this wouldn't be able to handle issues stemming from
> lack of actual atomicity.  Think about two tasks accessing the same
> interface file.  If they race against outside agent migrating them
> one-by-one, they may or may not be accessing the same file.  If they
> perform operations with side effects such as config changes, creation
> of sub-cgroups and migrations, what would be the end result?
>
> In addition, a per-task filesystem is an a lot worse interface to
> program against than a system-call based API, especially when the same
> API which is used to do the exact same operations on threads can be
> reused for resource groups.
>
>> Note that this pretty much has to work if cgroup namespaces are to
>> allow rearrangement of the hierarchy -- '/cgroup/' from inside the
>> namespace has to remain valid at all times
>
> If I'm not mistaken, namespaces don't allow this type of dynamic
> migrations.
>
>> Obviously this only works if the cgroup in question doesn't itself get
>> destroyed, but having an internal hierarchy is a bit nonsensical if
>> the application shares a cgroup with another application, so that
>> shouldn't be a problem in practice.
>>
>> In fact, ISTM that allowing applications to manage cgroup
>> sub-hierarchies has almost exactly the same set of constraints as
>> allowing namespaced cgroup managers to work.  In a container, the
>> outer manager manages where the container lives and the container
>> manages its own hierarchy.  Why can't fancy cgroup-aware applications
>> work exactly the same way?
>
> System agents and individual applications are different.  This is the
> same argument that you brought up earlier in this thread where you
> said that userland can just set up namespaces for individual
> applications.  In purely mathematical terms, they can be mapped to
> each other but that grossly ignores practical differences between
> them.
>
> Most applications should and want to keep their assumptions
> conservative, robust and portable, and not dependent on some crazy
> fragile and custom-built namespace setup that nobody in the stack is
> really responsible for.  How many would ever program against something
> like that?
>
> A system agent has a large part of the system configuration under its
> control (it's the system agent after all) and thus is way more
> flexible in what assumptions it can dictate and depend on.
>
>>> Yeah, systemd has delegation feature for cases like that which we
>>> depend on too.
>>>
>>> As for your example, who performs the cgroup setup and configuration,
>>> the application itself or an external entity?  If an external entity,
>>> how does it know which thread is what?
>>
>> In my case, it would be a little script that reads a config file that
>> knows all kinds of internal information about the application and its
>> threads.
>
> I see.  One-of-a-kind custom setup.  This is a completely valid usage;
> however, please also recognize that it's an extremely specific one
> which is niche by definition.  If we're going to support
> in-application hierarchical resource control, I think it's very
> important to make sure that it's something which is easy to use and
> widely accessible so that any lay application can make use of it.
> I'll come back to this point later.
>
>>> And, as for rgroup not covering it, would extending rgroup to cover
>>> multi-process cases be enough or are there more fundamental issues?
>>
>> Maybe, as long as the configuration could actually be created -- IIUC
>> the current rgroup proposal requires that the hierarchy of groups
>> matches the hierarchy implied by clone(), which isn't going to happen
>> in my case.
>
> We can make that dynamic as long as the subtree is properly scoped;
> however, there is an important design decision to make here.  If we
> open up full-on dynamic migrations to individual applications, we
> commit ourselves to supporting arbitrarily high frequency migration
> operations, which we've never supported before and will restrict what
> we can do in terms of optimizing hot paths over migration.
>
> We haven't had to face this decision because cgroup has never properly
> supported delegating to applications and the in-use setups where this
> happens are custom configurations where there is no boundary between
> system and applications and adhoc trial-and-error is good enough a way
> to find a working solution.  That wiggle room goes away once we
> officially open this up to individual applications.
>
> So, if we decide to open up dynamic assignment, we need to weigh what
> we gain in terms of capabilities against reduction of implementation
> maneuvering room.  I guess there can be a middleground where, for
> example, only initial asssignment is allowed.
>
> It is really difficult to understand your position without
> understanding where the requirements are coming from.  Can you please
> elaborate more on the workload?  Why is the specific configuration
> useful?  What is it trying to achieve?
>
>> But, given that this fancy-cgroup-aware-multiprocess-application case
>> looks so much like cgroup-using container, ISTM you could solve the
>> problem completely by just allowing tasks to be split out by users who
>> want to do it.  (Obviously those users will get funny results if they
>> try to do this to memcg.  "Don't do that" seems fine here.)  I don't
>> expect the race condition issues you're worried about to happen in
>> practice.  Certainly not in my case, since I control the entire
>> system.
>
> What people do now with cgroup inside an application is extremely
> limited.  Because there is no proper support for it, each use case has
> to craft up a dedicated custom setup which is all but guaranteed to be
> incompatible with what someone else would come up for another
> application.  Everybody is in "this is mine, I control the entire
> system" mindset, which is fine for those specific setups but
> deterimental to making it widely available and useful.
>
> Accepting some measured restrictions and building a common ground for
> everyone can make in-application cgroup usages vastly more accessible
> and useful than now.  Certain things would need to be done differently
> and maybe some scenarios won't be supported as well but those are
> trade-offs that we'd need to weigh against what we gain.  Another
> point is that, for very specific use cases where none of these generic
> concerns matter, keeping using cgroup v1 is fine.  The lack of common
> resource domains has never been an issue for those use cases anyway.
>
> Thanks.
>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-09-12 15:20                           ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 87+ messages in thread
From: Austin S. Hemmelgarn @ 2016-09-12 15:20 UTC (permalink / raw)
  To: Tejun Heo, Andy Lutomirski
  Cc: Ingo Molnar, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-team-b10kYP2dOMg, open list:CONTROL GROUP (CGROUP),
	Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra,
	Johannes Weiner, Linus Torvalds

On 2016-09-09 18:57, Tejun Heo wrote:
> Hello, again.
>
> On Mon, Sep 05, 2016 at 10:37:55AM -0700, Andy Lutomirski wrote:
>>> * It doesn't bring any practical benefits in terms of capability.
>>>   Userland can trivially handle the system-root and namespace-roots in
>>>   a symmetrical manner.
>>
>> Your idea of "trivially" doesn't match mine.  You gave a use case in
>
> I suppose I wasn't clear enough.  It is trivial in the sense that if
> the userland implements something which works for namespace-root, it
> would work the same in system-root without further modifications.
>
>> which userspace might take advantage of root being special.  If
>
> I was emphasizing the cases where userspace would have to deal with
> the inherent differences, and, when they don't, they can behave
> exactly the same way.
>
>> userspace does that, then that userspace cannot be run in a container.
>> This could be a problem for real users.  Sure, "don't do that" is a
>> *valid* answer, but it's not a very helpful answer.
>
> Great, now we agree that what's currently implemented is valid.  I
> think you're still failing to recognize the inherent specialness of
> the system-root and how much unnecessary pain the removal of the
> exemption would cause at virtually no practical gain.  I won't repeat
> the same backing points here.
>
>>> * It's an unncessary inconvenience, especially for cases where the
>>>   cgroup agent isn't in control of boot, for partial usage cases, or
>>>   just for playing with it.
>>>
>>> You say that I'm ignoring the same use case for namespace-scope but
>>> namespace-roots don't have the same hybrid function for partial and
>>> uncontrolled systems, so it's not clear why there even NEEDS to be
>>> strict symmetry.
>>
>> I think their functions are much closer than you think they are.  I
>> want a whole Linux distro to be able to run in a container.  This
>> means that useful things people do in a distro or initramfs or
>> whatever should just work if containerized.
>
> There isn't much which is getting in the way of doing that.  Again,
> something which follows no-internal-task rule would behave the same no
> matter where it is.  The system-root is different in that it is exempt
> from the rule and thus is more flexible but that difference is serving
> the purpose of handling the inherent specialness of the system-root.
> AFAICS, it is the solution which causes the least amount of contortion
> and unnecessary inconvenience to userland.
>
>>> It's easy and understandable to get hangups on asymmetries or
>>> exemptions like this, but they also often are acceptable trade-offs.
>>> It's really frustrating to see you first getting hung up on "this must
>>> be wrong" and even after explanations repeating the same thing just in
>>> different ways.
>>>
>>> If there is something fundamentally wrong with it, sure, let's fix it,
>>> but what's actually broken?
>>
>> I'm not saying it's fundamentally wrong.  I'm saying it's a design
>
> You were.
>
>> that has a big wart, and that wart is unfortunate, and after thinking
>> a bit, I'm starting to agree with PeterZ that this is problematic.  It
>> also seems fixable: the constraint could be relaxed.
>
> You've been pushing for enforcing the restriction on the system-root
> too and now are jumping to the opposite end.  It's really frustrating
> that this is such a whack-a-mole game where you throw ideas without
> really thinking through them and only concede the bare minimum when
> all other logical avenues are closed off.  Here, again, you seem to be
> stating a strong opinion when you haven't fully thought about it or
> tried to understand the reasons behind it.
>
> But, whatever, let's go there: Given the arguments that I laid out for
> the no-internal-tasks rule, how does the problem seem fixable through
> relaxing the constraint?
>
>>>>>> Also, here's an idea to maybe make PeterZ happier: relax the
>>>>>> restriction a bit per-controller.  Currently (except for /), if you
>>>>>> have subtree control enabled you can't have any processes in the
>>>>>> cgroup.  Could you change this so it only applies to certain
>>>>>> controllers?  If the cpu controller is entirely happy to have
>>>>>> processes and cgroups as siblings, then maybe a cgroup with only cpu
>>>>>> subtree control enabled could allow processes to exist.
>>>>>
>>>>> The document lists several reasons for not doing this and also that
>>>>> there is no known real world use case for such configuration.
>>>
>>> So, up until this point, we were talking about no-internal-tasks
>>> constraint.
>>
>> Isn't this the same thing?  IIUC the constraint in question is that,
>> if a non-root cgroup has subtree control on, then it can't have
>> processes in it.  This is the no-internal-tasks constraint, right?
>
> Yes, that is what no-internal-tasks rule is but I don't understand how
> that is the same thing as process granularity.  Am I completely
> misunderstanding what you are trying to say here?
>
>> And I still think that, at least for cpu, nothing at all goes wrong if
>> you allow processes to exist in cgroups that have cpu set in
>> subtree-control.
>
> If you confine it to the cpu controller, ignore anonymous
> consumptions, the rather ugly mapping between nice and weight values
> and the fact that nobody could come up with a practical usefulness for
> such setup, yes.  My point was never that the cpu controller can't do
> it but that we should find a better way of coordinating it with other
> controllers and exposing it to individual applications.
So, having a container where not everything in the container is split 
further into subgroups is not a practically useful situation?  Because 
that's exactly what both systemd and every other cgroup management tool 
expects to have work as things stand right now.  The root cgroup within 
a cgroup namespace has to function exactly like the system-root, 
otherwise nothing can depend on the special cases for the system root, 
because they might get run in a cgroup namespace and such assumptions 
will be invalid.  This in turn means that no current distro can run 
unmodified in a cgroup namespace under a v2 hierarchy, which is a Very 
Bad Thing.
>
>> ----- begin talking about process granularity -----
> ...
>>> I do.  It's a horrible userland API to expose to individual
>>> applications if the organization that a given application expects can
>>> be disturbed by system operations.  Imagine how this would be
>>> documented - "if this operation races with system operation, it may
>>> return -ENOENT.  Repeating the path lookup might make the operation
>>> succeed again."
>>
>> It could be made to work without races, though, with minimal (or even
>> no) ABI change.  The managed program could grab an fd pointing to its
>> cgroup.  Then it would use openat, etc for all operations.  As long as
>> 'mv /cgroup/a/b /cgroup/c/" didn't cause that fd to stop working,
>> we're fine.
>
> After a migration, the cgroup and its interface knobs are a different
> directory and files.  Semantically, during migration, we aren't moving
> the directory or files and it'd be bizarre to overlay the semantics
> you're describing on top of the existing cgroupfs.  We will have to
> break away from the very basic vfs rules such as a fd, once opened,
> always corresponding to the same file.  The only thing openat(2) does
> is abstracting away prefix handling and that is only a small part of
> the problem.
>
> A more acceptable way could be implementing, say, per-task filesystem
> which always appears at the fixed location and proxies the operations;
> however, even this wouldn't be able to handle issues stemming from
> lack of actual atomicity.  Think about two tasks accessing the same
> interface file.  If they race against outside agent migrating them
> one-by-one, they may or may not be accessing the same file.  If they
> perform operations with side effects such as config changes, creation
> of sub-cgroups and migrations, what would be the end result?
>
> In addition, a per-task filesystem is an a lot worse interface to
> program against than a system-call based API, especially when the same
> API which is used to do the exact same operations on threads can be
> reused for resource groups.
>
>> Note that this pretty much has to work if cgroup namespaces are to
>> allow rearrangement of the hierarchy -- '/cgroup/' from inside the
>> namespace has to remain valid at all times
>
> If I'm not mistaken, namespaces don't allow this type of dynamic
> migrations.
>
>> Obviously this only works if the cgroup in question doesn't itself get
>> destroyed, but having an internal hierarchy is a bit nonsensical if
>> the application shares a cgroup with another application, so that
>> shouldn't be a problem in practice.
>>
>> In fact, ISTM that allowing applications to manage cgroup
>> sub-hierarchies has almost exactly the same set of constraints as
>> allowing namespaced cgroup managers to work.  In a container, the
>> outer manager manages where the container lives and the container
>> manages its own hierarchy.  Why can't fancy cgroup-aware applications
>> work exactly the same way?
>
> System agents and individual applications are different.  This is the
> same argument that you brought up earlier in this thread where you
> said that userland can just set up namespaces for individual
> applications.  In purely mathematical terms, they can be mapped to
> each other but that grossly ignores practical differences between
> them.
>
> Most applications should and want to keep their assumptions
> conservative, robust and portable, and not dependent on some crazy
> fragile and custom-built namespace setup that nobody in the stack is
> really responsible for.  How many would ever program against something
> like that?
>
> A system agent has a large part of the system configuration under its
> control (it's the system agent after all) and thus is way more
> flexible in what assumptions it can dictate and depend on.
>
>>> Yeah, systemd has delegation feature for cases like that which we
>>> depend on too.
>>>
>>> As for your example, who performs the cgroup setup and configuration,
>>> the application itself or an external entity?  If an external entity,
>>> how does it know which thread is what?
>>
>> In my case, it would be a little script that reads a config file that
>> knows all kinds of internal information about the application and its
>> threads.
>
> I see.  One-of-a-kind custom setup.  This is a completely valid usage;
> however, please also recognize that it's an extremely specific one
> which is niche by definition.  If we're going to support
> in-application hierarchical resource control, I think it's very
> important to make sure that it's something which is easy to use and
> widely accessible so that any lay application can make use of it.
> I'll come back to this point later.
>
>>> And, as for rgroup not covering it, would extending rgroup to cover
>>> multi-process cases be enough or are there more fundamental issues?
>>
>> Maybe, as long as the configuration could actually be created -- IIUC
>> the current rgroup proposal requires that the hierarchy of groups
>> matches the hierarchy implied by clone(), which isn't going to happen
>> in my case.
>
> We can make that dynamic as long as the subtree is properly scoped;
> however, there is an important design decision to make here.  If we
> open up full-on dynamic migrations to individual applications, we
> commit ourselves to supporting arbitrarily high frequency migration
> operations, which we've never supported before and will restrict what
> we can do in terms of optimizing hot paths over migration.
>
> We haven't had to face this decision because cgroup has never properly
> supported delegating to applications and the in-use setups where this
> happens are custom configurations where there is no boundary between
> system and applications and adhoc trial-and-error is good enough a way
> to find a working solution.  That wiggle room goes away once we
> officially open this up to individual applications.
>
> So, if we decide to open up dynamic assignment, we need to weigh what
> we gain in terms of capabilities against reduction of implementation
> maneuvering room.  I guess there can be a middleground where, for
> example, only initial asssignment is allowed.
>
> It is really difficult to understand your position without
> understanding where the requirements are coming from.  Can you please
> elaborate more on the workload?  Why is the specific configuration
> useful?  What is it trying to achieve?
>
>> But, given that this fancy-cgroup-aware-multiprocess-application case
>> looks so much like cgroup-using container, ISTM you could solve the
>> problem completely by just allowing tasks to be split out by users who
>> want to do it.  (Obviously those users will get funny results if they
>> try to do this to memcg.  "Don't do that" seems fine here.)  I don't
>> expect the race condition issues you're worried about to happen in
>> practice.  Certainly not in my case, since I control the entire
>> system.
>
> What people do now with cgroup inside an application is extremely
> limited.  Because there is no proper support for it, each use case has
> to craft up a dedicated custom setup which is all but guaranteed to be
> incompatible with what someone else would come up for another
> application.  Everybody is in "this is mine, I control the entire
> system" mindset, which is fine for those specific setups but
> deterimental to making it widely available and useful.
>
> Accepting some measured restrictions and building a common ground for
> everyone can make in-application cgroup usages vastly more accessible
> and useful than now.  Certain things would need to be done differently
> and maybe some scenarios won't be supported as well but those are
> trade-offs that we'd need to weigh against what we gain.  Another
> point is that, for very specific use cases where none of these generic
> concerns matter, keeping using cgroup v1 is fine.  The lack of common
> resource domains has never been an issue for those use cases anyway.
>
> Thanks.
>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
       [not found]                         ` <CALCETrUhpPQdyZ-6WRjdB+iLbpGBduRZMWXQtCuS+R7Cq7rygg@mail.gmail.com>
@ 2016-09-14 20:00                           ` Tejun Heo
  2016-09-15 20:08                               ` Andy Lutomirski
  0 siblings, 1 reply; 87+ messages in thread
From: Tejun Heo @ 2016-09-14 20:00 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, Mike Galbraith, Andrew Morton, kernel-team,
	open list:CONTROL GROUP (CGROUP),
	Linux API, Li Zefan, Paul Turner, linux-kernel, Linus Torvalds,
	Johannes Weiner, Peter Zijlstra

Hello,

On Mon, Sep 12, 2016 at 10:39:04AM -0700, Andy Lutomirski wrote:
> > > Your idea of "trivially" doesn't match mine.  You gave a use case in
> >
> > I suppose I wasn't clear enough.  It is trivial in the sense that if
> > the userland implements something which works for namespace-root, it
> > would work the same in system-root without further modifications.
> 
> So I guess userspace can trivially get it right and can just as trivially
> get it wrong.

I wasn't trying to play a word game.  What I was trying to say is that
a configuration which works for namespace-roots works for the
system-root too, in terms of cgroup hierarchy, without any
modifications.

> > Great, now we agree that what's currently implemented is valid.  I
> > think you're still failing to recognize the inherent specialness of
> > the system-root and how much unnecessary pain the removal of the
> > exemption would cause at virtually no practical gain.  I won't repeat
> > the same backing points here.
> 
> I'm starting to think that you could extend the exemption with considerably
> less difficulty.

Can you please elaborate?  It feels like you're repeating the same
opinions without really describing them in detail or backing them up
in the last couple replies.  Having differing opinions is fine but to
actually hash them out, the opinions and their rationles need to be
laid out in detail.

> > There isn't much which is getting in the way of doing that.  Again,
> > something which follows no-internal-task rule would behave the same no
> > matter where it is.  The system-root is different in that it is exempt
> > from the rule and thus is more flexible but that difference is serving
> > the purpose of handling the inherent specialness of the system-root.
> 
> From *userspace's* POV, I still don't think there's any specialness except
> from an accounting POV.  After all, userspace has no control over the
> special stuff anyway.  And accounting doesn't matter: a namespace could
> just see zeros in any special root accounting slots.

The disagreement here isn't really consequential.  The only reason
this part became imporant is because you felt that something must be
broken, which you now don't think is the case.

I agree that there can be other ways to handle this but what's your
proposal here?  And how would that be practically and substantically
better than what is implemented now?

> > You've been pushing for enforcing the restriction on the system-root
> > too and now are jumping to the opposite end.  It's really frustrating
> > that this is such a whack-a-mole game where you throw ideas without
> > really thinking through them and only concede the bare minimum when
> > all other logical avenues are closed off.  Here, again, you seem to be
> > stating a strong opinion when you haven't fully thought about it or
> > tried to understand the reasons behind it.
> 
> I think you should make it work the same way in namespace roots as it does
> in the system root.  I acknowledge that there are pros and cons of each.  I
> think the current middle ground is worse than either of the consistent
> options.

Again, the only thing you're doing is restating the same opinion.  I
understand that you have an impression that this can be done better
but how exactly?

> > But, whatever, let's go there: Given the arguments that I laid out for
> > the no-internal-tasks rule, how does the problem seem fixable through
> > relaxing the constraint?
> 
> By deciding that, despite the arguments you laid out, it's still worth
> relaxing the constraint.  Or by deciding to add the constraint to the root.

You're not really saying anything of substance in the above paragraph.

> > > Isn't this the same thing?  IIUC the constraint in question is that,
> > > if a non-root cgroup has subtree control on, then it can't have
> > > processes in it.  This is the no-internal-tasks constraint, right?
> >
> > Yes, that is what no-internal-tasks rule is but I don't understand how
> > that is the same thing as process granularity.  Am I completely
> > misunderstanding what you are trying to say here?
> 
> Yes.  I'm saying that no-internal-tasks could be relaxed per controller.

I was asking whether you were wondering whether no-internal-tasks rule
and process-granularity are the same thing.  And, if that's not the
case, what the previous sentence meant.  I can't make out what you're
responding to.

> > If you confine it to the cpu controller, ignore anonymous
> > consumptions, the rather ugly mapping between nice and weight values
> > and the fact that nobody could come up with a practical usefulness for
> > such setup, yes.  My point was never that the cpu controller can't do
> > it but that we should find a better way of coordinating it with other
> > controllers and exposing it to individual applications.
> 
> I'm not sure what the nice-vs-weight thing has to do with internal
> processes, but all of this is a question for Peter.

That part is from cgroup cpu controller weight being mapped to task
nice numbers because the priorities between the two have to be somehow
comparable.  It's not a critical issue, just awkward.

> > After a migration, the cgroup and its interface knobs are a different
> > directory and files.  Semantically, during migration, we aren't moving
> > the directory or files and it'd be bizarre to overlay the semantics
> > you're describing on top of the existing cgroupfs.  We will have to
> > break away from the very basic vfs rules such as a fd, once opened,
> > always corresponding to the same file.
> 
> What kind of migration do you mean?  Having fds follow rename(2) around is
> the normal vfs behavior, so I don't really know what you mean.

Process or task migration by writing pid to cgroup.procs or tasks
file.  cgroup never supported directory / cgroup level migrations.

> > If I'm not mistaken, namespaces don't allow this type of dynamic
> > migrations.
> 
> I don't see why they couldn't allow exactly this.  If you rename(2) a
> cgroup, any namespace with that cgroup as root should keep it as root,
> completely atomically.  If this doesn't work, I'd argue that it's a bug.

I hope this part is clear now.

> > A system agent has a large part of the system configuration under its
> > control (it's the system agent after all) and thus is way more
> > flexible in what assumptions it can dictate and depend on.
> 
> Can you give an example of any use case for which a system agent would
> fork, exec a daemon that isn't written by the same developers as the system
> agent, and then walk that daemon's process tree and move the processes
> around in the cgroup hierarchy one by one?  I think this is what you're
> describing, and I don't see why doing so is sensible.  Certainly if a
> system agent gives the daemon write access to cgroupfs, it should not start
> moving that daemon's children around individually.

That's the only way anything can be moved across cgroups.  In terms of
resource control, I can't think of scenarios which would *require*
this behavior but it's still a behavior cgroup has to allow as there's
no "spawn this process in that cgroup" call and all migrations are
dynamic.

We can proclaim that once an application is started outer scope
shouldn't meddle with it.  It would be another restriction where
violation would actually break applications tho.  And it doesn't
address other downsides - making in-application controls less
approachable as it requires specific setup and cooperation from the
system agent, and the interface being awkward.

> > We can make that dynamic as long as the subtree is properly scoped;
> > however, there is an important design decision to make here.  If we
> > open up full-on dynamic migrations to individual applications, we
> > commit ourselves to supporting arbitrarily high frequency migration
> > operations, which we've never supported before and will restrict what
> > we can do in terms of optimizing hot paths over migration.
> 
> I haven't (yet?) seen use cases where changing cgroups *quickly* is
> important.

Android does something along this line - creating preset cgroups and
migrating processes according to their current states.  The problem is
that once we generally open up the API to individual applications,
there is no good way of policing the usages and there certainly are
multiple ways to make use of frequent cgroup membership changes
especially for stateless controllers like CPU.

We can easily end up in situations where having several of these
usages on the same machine bogs down the whole system.  One way to
avoid this is building the API so that changing cgroup membership is
naturally unattractive - e.g. membership can only be assigned only on
creation of a new thread or process, or migration can only be towards
deeper level in the tree, so that migrations can be used to organize
the threads and processes as necessary but not used as the primary
method of adjusting configurations dynamically.

> > It is really difficult to understand your position without
> > understanding where the requirements are coming from.  Can you please
> > elaborate more on the workload?  Why is the specific configuration
> > useful?  What is it trying to achieve?
> 
> Multiple cooperating RT processes, most of which have non-RT helper
> threads.  For scheduling purposes, I lump the non-RT threads together.

I see.  Can you please share how the cgroups are actually configured
(ie. how the weights are assigned and so on)?

> Will you (Tejun), PeterZ, and maybe some of the other interested parties be
> at KS?  Maybe this is worth hashing out in person.

Yeap, it'd be nice to talk in person.  However, I'm not sure talking
offline is the best way to hash out technical details.  The discussion
has been painful but we're actually addressing technical
misunderstandings and where the actual disgreements lie.  We really
need to agree on what we disagree on and why first.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-09-15 20:08                               ` Andy Lutomirski
  0 siblings, 0 replies; 87+ messages in thread
From: Andy Lutomirski @ 2016-09-15 20:08 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Mike Galbraith, Andrew Morton, kernel-team,
	open list:CONTROL GROUP (CGROUP),
	Linux API, Li Zefan, Paul Turner, linux-kernel, Linus Torvalds,
	Johannes Weiner, Peter Zijlstra

On Wed, Sep 14, 2016 at 1:00 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>

With regard to no-internal-tasks, I see (at least) three options:

1. Keep the cgroup2 status quo.  Lots of distros and such are likely
to have their cgroup management fail if run in a container.  I really,
really dislike this option.

2. Enforce no-internal-tasks for the root cgroup.  Un-cgroupable
thinks will still get accounted to the root cgroup even if subtree
control is on, but no tasks can be in the root cgroup if the root
cgroup has subtree control on.  (If some controllers removed the
no-internal-tasks restriction, this would apply to the root as well.)
I think this may annoy certain users.  If so, and if those users are
doing something valid, then I think that either those users should be
strongly encouraged or even forced to changed so namespacing works for
them or that we should do (3) instead.

3. Remove the no-internal-tasks restriction entirely.  I can see this
resulting in a lot of configuration awkwardness, but I think it will
*work*, especially since all of the controllers already need to do
something vaguely intelligent when subtree control is on in the root
and there are tasks in the root.

What I'm trying to say is that I think that option (1) is sufficiently
bad that cgroup2 should do (2) or (3) instead.  If option (2) is
preferred and if it would break userspace, then I think we can work
around it by entirely deprecating cgroup2, renaming it to cgroup3, and
doing option (2) there.  You've given reasons you don't like options
(2) and (3).  I mostly agree with those reasons, but I don't think
they're strong enough to overcome the problems with (1).

BTW, Mike keeps mentioning exclusive cgroups as problematic with the
no-internal-tasks constraints.  Do exclusive cgroups still exist in
cgroup2?  Could we perhaps just remove that capability entirely?  I've
never understood what problem exlusive cpusets and such solve that
can't be more comprehensibly solved by just assigning the cpusets the
normal inclusive way.

>> > After a migration, the cgroup and its interface knobs are a different
>> > directory and files.  Semantically, during migration, we aren't moving
>> > the directory or files and it'd be bizarre to overlay the semantics
>> > you're describing on top of the existing cgroupfs.  We will have to
>> > break away from the very basic vfs rules such as a fd, once opened,
>> > always corresponding to the same file.
>>
>> What kind of migration do you mean?  Having fds follow rename(2) around is
>> the normal vfs behavior, so I don't really know what you mean.
>
> Process or task migration by writing pid to cgroup.procs or tasks
> file.  cgroup never supported directory / cgroup level migrations.
>

Ugh.  Perhaps cgroup2 should start supporting this.  I think that
making rename(2) work is simpler than adding a whole new API for
rgroups, and I think it could solve a lot of the same problems that
rgroups are trying to solve.

--Andy

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-09-15 20:08                               ` Andy Lutomirski
  0 siblings, 0 replies; 87+ messages in thread
From: Andy Lutomirski @ 2016-09-15 20:08 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Mike Galbraith, Andrew Morton,
	kernel-team-b10kYP2dOMg, open list:CONTROL GROUP (CGROUP),
	Linux API, Li Zefan, Paul Turner,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Linus Torvalds,
	Johannes Weiner, Peter Zijlstra

On Wed, Sep 14, 2016 at 1:00 PM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> Hello,
>

With regard to no-internal-tasks, I see (at least) three options:

1. Keep the cgroup2 status quo.  Lots of distros and such are likely
to have their cgroup management fail if run in a container.  I really,
really dislike this option.

2. Enforce no-internal-tasks for the root cgroup.  Un-cgroupable
thinks will still get accounted to the root cgroup even if subtree
control is on, but no tasks can be in the root cgroup if the root
cgroup has subtree control on.  (If some controllers removed the
no-internal-tasks restriction, this would apply to the root as well.)
I think this may annoy certain users.  If so, and if those users are
doing something valid, then I think that either those users should be
strongly encouraged or even forced to changed so namespacing works for
them or that we should do (3) instead.

3. Remove the no-internal-tasks restriction entirely.  I can see this
resulting in a lot of configuration awkwardness, but I think it will
*work*, especially since all of the controllers already need to do
something vaguely intelligent when subtree control is on in the root
and there are tasks in the root.

What I'm trying to say is that I think that option (1) is sufficiently
bad that cgroup2 should do (2) or (3) instead.  If option (2) is
preferred and if it would break userspace, then I think we can work
around it by entirely deprecating cgroup2, renaming it to cgroup3, and
doing option (2) there.  You've given reasons you don't like options
(2) and (3).  I mostly agree with those reasons, but I don't think
they're strong enough to overcome the problems with (1).

BTW, Mike keeps mentioning exclusive cgroups as problematic with the
no-internal-tasks constraints.  Do exclusive cgroups still exist in
cgroup2?  Could we perhaps just remove that capability entirely?  I've
never understood what problem exlusive cpusets and such solve that
can't be more comprehensibly solved by just assigning the cpusets the
normal inclusive way.

>> > After a migration, the cgroup and its interface knobs are a different
>> > directory and files.  Semantically, during migration, we aren't moving
>> > the directory or files and it'd be bizarre to overlay the semantics
>> > you're describing on top of the existing cgroupfs.  We will have to
>> > break away from the very basic vfs rules such as a fd, once opened,
>> > always corresponding to the same file.
>>
>> What kind of migration do you mean?  Having fds follow rename(2) around is
>> the normal vfs behavior, so I don't really know what you mean.
>
> Process or task migration by writing pid to cgroup.procs or tasks
> file.  cgroup never supported directory / cgroup level migrations.
>

Ugh.  Perhaps cgroup2 should start supporting this.  I think that
making rename(2) work is simpler than adding a whole new API for
rgroups, and I think it could solve a lot of the same problems that
rgroups are trying to solve.

--Andy

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-09-16  7:51                                 ` Peter Zijlstra
  0 siblings, 0 replies; 87+ messages in thread
From: Peter Zijlstra @ 2016-09-16  7:51 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Tejun Heo, Ingo Molnar, Mike Galbraith, Andrew Morton,
	kernel-team, open list:CONTROL GROUP (CGROUP),
	Linux API, Li Zefan, Paul Turner, linux-kernel, Linus Torvalds,
	Johannes Weiner

On Thu, Sep 15, 2016 at 01:08:07PM -0700, Andy Lutomirski wrote:
> BTW, Mike keeps mentioning exclusive cgroups as problematic with the
> no-internal-tasks constraints.  Do exclusive cgroups still exist in
> cgroup2?  Could we perhaps just remove that capability entirely?  I've
> never understood what problem exlusive cpusets and such solve that
> can't be more comprehensibly solved by just assigning the cpusets the
> normal inclusive way.

Without exclusive sets we cannot split the sched_domain structure.
Which leads to not being able to actually partition things. That would
break DL for one.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-09-16  7:51                                 ` Peter Zijlstra
  0 siblings, 0 replies; 87+ messages in thread
From: Peter Zijlstra @ 2016-09-16  7:51 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Tejun Heo, Ingo Molnar, Mike Galbraith, Andrew Morton,
	kernel-team-b10kYP2dOMg, open list:CONTROL GROUP (CGROUP),
	Linux API, Li Zefan, Paul Turner,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Linus Torvalds,
	Johannes Weiner

On Thu, Sep 15, 2016 at 01:08:07PM -0700, Andy Lutomirski wrote:
> BTW, Mike keeps mentioning exclusive cgroups as problematic with the
> no-internal-tasks constraints.  Do exclusive cgroups still exist in
> cgroup2?  Could we perhaps just remove that capability entirely?  I've
> never understood what problem exlusive cpusets and such solve that
> can't be more comprehensibly solved by just assigning the cpusets the
> normal inclusive way.

Without exclusive sets we cannot split the sched_domain structure.
Which leads to not being able to actually partition things. That would
break DL for one.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-09-16 15:12                                   ` Andy Lutomirski
  0 siblings, 0 replies; 87+ messages in thread
From: Andy Lutomirski @ 2016-09-16 15:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Mike Galbraith, kernel-team, Andrew Morton,
	open list:CONTROL GROUP (CGROUP),
	Paul Turner, Li Zefan, Linux API, linux-kernel, Tejun Heo,
	Johannes Weiner, Linus Torvalds

On Sep 16, 2016 12:51 AM, "Peter Zijlstra" <peterz@infradead.org> wrote:
>
> On Thu, Sep 15, 2016 at 01:08:07PM -0700, Andy Lutomirski wrote:
> > BTW, Mike keeps mentioning exclusive cgroups as problematic with the
> > no-internal-tasks constraints.  Do exclusive cgroups still exist in
> > cgroup2?  Could we perhaps just remove that capability entirely?  I've
> > never understood what problem exlusive cpusets and such solve that
> > can't be more comprehensibly solved by just assigning the cpusets the
> > normal inclusive way.
>
> Without exclusive sets we cannot split the sched_domain structure.
> Which leads to not being able to actually partition things. That would
> break DL for one.

Can you sketch out a toy example?  And what's DL?

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-09-16 15:12                                   ` Andy Lutomirski
  0 siblings, 0 replies; 87+ messages in thread
From: Andy Lutomirski @ 2016-09-16 15:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Mike Galbraith, kernel-team-b10kYP2dOMg,
	Andrew Morton, open list:CONTROL GROUP (CGROUP),
	Paul Turner, Li Zefan, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, Johannes Weiner,
	Linus Torvalds

On Sep 16, 2016 12:51 AM, "Peter Zijlstra" <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:
>
> On Thu, Sep 15, 2016 at 01:08:07PM -0700, Andy Lutomirski wrote:
> > BTW, Mike keeps mentioning exclusive cgroups as problematic with the
> > no-internal-tasks constraints.  Do exclusive cgroups still exist in
> > cgroup2?  Could we perhaps just remove that capability entirely?  I've
> > never understood what problem exlusive cpusets and such solve that
> > can't be more comprehensibly solved by just assigning the cpusets the
> > normal inclusive way.
>
> Without exclusive sets we cannot split the sched_domain structure.
> Which leads to not being able to actually partition things. That would
> break DL for one.

Can you sketch out a toy example?  And what's DL?

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-09-16 16:19                                     ` Peter Zijlstra
  0 siblings, 0 replies; 87+ messages in thread
From: Peter Zijlstra @ 2016-09-16 16:19 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, Mike Galbraith, kernel-team, Andrew Morton,
	open list:CONTROL GROUP (CGROUP),
	Paul Turner, Li Zefan, Linux API, linux-kernel, Tejun Heo,
	Johannes Weiner, Linus Torvalds

On Fri, Sep 16, 2016 at 08:12:58AM -0700, Andy Lutomirski wrote:
> On Sep 16, 2016 12:51 AM, "Peter Zijlstra" <peterz@infradead.org> wrote:
> >
> > On Thu, Sep 15, 2016 at 01:08:07PM -0700, Andy Lutomirski wrote:
> > > BTW, Mike keeps mentioning exclusive cgroups as problematic with the
> > > no-internal-tasks constraints.  Do exclusive cgroups still exist in
> > > cgroup2?  Could we perhaps just remove that capability entirely?  I've
> > > never understood what problem exlusive cpusets and such solve that
> > > can't be more comprehensibly solved by just assigning the cpusets the
> > > normal inclusive way.
> >
> > Without exclusive sets we cannot split the sched_domain structure.
> > Which leads to not being able to actually partition things. That would
> > break DL for one.
> 
> Can you sketch out a toy example? 

[ Also see Documentation/cgroup-v1/cpusets.txt section 1.7 ]


  mkdir /cpuset

  mount -t cgroup -o cpuset none /cpuset

  mkdir /cpuset/A
  mkdir /cpuset/B

  cat /sys/devices/system/node/node0/cpulist > /cpuset/A/cpuset.cpus
  echo 0 > /cpuset/A/cpuset.mems

  cat /sys/devices/system/node/node1/cpulist > /cpuset/B/cpuset.cpus
  echo 1 > /cpuset/B/cpuset.mems

  # move all movable tasks into A
  cat /cpuset/tasks | while read task; do echo $task > /cpuset/A/tasks ; done

  # kill machine wide load-balancing
  echo 0 > /cpuset/cpuset.sched_load_balance

  # now place 'special' tasks in B


This partitions the scheduler into two, one for each node.

Hereafter no task will be moved from one node to another. The
load-balancer is split in two, one balances in A one balances in B
nothing crosses. (It is important that A.cpus and B.cpus do not
intersect.)

Ideally no task would remain in the root group, back in the day we could
actually do this (with exception of the cpu bound kernel threads), but
this has significantly regressed :-(
(still hate the workqueue affinity interface)

As is, tasks that are left in the root group get balanced within
whatever domain they ended up in.

> And what's DL?

SCHED_DEADLINE, its a 'Global'-EDF like scheduler that doesn't support
CPU affinities (because that doesn't make sense). The only way to
restrict it is to partition.

'Global' because you can partition it. If you reduce your system to
single CPU partitions you'll reduce to P-EDF.

(The same is true of SCHED_FIFO, that's a 'Global'-FIFO on the same
partition scheme, it however does support sched_affinity, but using it
gives 'interesting' schedulability results -- call it a historic
accident).


Note that related, but differently, we have the isolcpus boot parameter
which creates single CPU partitions for all listed CPUs and gives the
rest to the root cpuset. Ideally we'd kill this option given its a boot
time setting (for something which is trivially to do at runtime).

But this cannot be done, because that would mean we'd have to start with
a !0 cpuset layout:

		'/'
		load_balance=0
            /              \
	'system'	'isolated'
	cpus=~isolcpus	cpus=isolcpus
			load_balance=0

And start with _everything_ in the /system group (inclding default IRQ
affinities).

Of course, that will break everything cgroup :-(

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-09-16 16:19                                     ` Peter Zijlstra
  0 siblings, 0 replies; 87+ messages in thread
From: Peter Zijlstra @ 2016-09-16 16:19 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, Mike Galbraith, kernel-team-b10kYP2dOMg,
	Andrew Morton, open list:CONTROL GROUP (CGROUP),
	Paul Turner, Li Zefan, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, Johannes Weiner,
	Linus Torvalds

On Fri, Sep 16, 2016 at 08:12:58AM -0700, Andy Lutomirski wrote:
> On Sep 16, 2016 12:51 AM, "Peter Zijlstra" <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:
> >
> > On Thu, Sep 15, 2016 at 01:08:07PM -0700, Andy Lutomirski wrote:
> > > BTW, Mike keeps mentioning exclusive cgroups as problematic with the
> > > no-internal-tasks constraints.  Do exclusive cgroups still exist in
> > > cgroup2?  Could we perhaps just remove that capability entirely?  I've
> > > never understood what problem exlusive cpusets and such solve that
> > > can't be more comprehensibly solved by just assigning the cpusets the
> > > normal inclusive way.
> >
> > Without exclusive sets we cannot split the sched_domain structure.
> > Which leads to not being able to actually partition things. That would
> > break DL for one.
> 
> Can you sketch out a toy example? 

[ Also see Documentation/cgroup-v1/cpusets.txt section 1.7 ]


  mkdir /cpuset

  mount -t cgroup -o cpuset none /cpuset

  mkdir /cpuset/A
  mkdir /cpuset/B

  cat /sys/devices/system/node/node0/cpulist > /cpuset/A/cpuset.cpus
  echo 0 > /cpuset/A/cpuset.mems

  cat /sys/devices/system/node/node1/cpulist > /cpuset/B/cpuset.cpus
  echo 1 > /cpuset/B/cpuset.mems

  # move all movable tasks into A
  cat /cpuset/tasks | while read task; do echo $task > /cpuset/A/tasks ; done

  # kill machine wide load-balancing
  echo 0 > /cpuset/cpuset.sched_load_balance

  # now place 'special' tasks in B


This partitions the scheduler into two, one for each node.

Hereafter no task will be moved from one node to another. The
load-balancer is split in two, one balances in A one balances in B
nothing crosses. (It is important that A.cpus and B.cpus do not
intersect.)

Ideally no task would remain in the root group, back in the day we could
actually do this (with exception of the cpu bound kernel threads), but
this has significantly regressed :-(
(still hate the workqueue affinity interface)

As is, tasks that are left in the root group get balanced within
whatever domain they ended up in.

> And what's DL?

SCHED_DEADLINE, its a 'Global'-EDF like scheduler that doesn't support
CPU affinities (because that doesn't make sense). The only way to
restrict it is to partition.

'Global' because you can partition it. If you reduce your system to
single CPU partitions you'll reduce to P-EDF.

(The same is true of SCHED_FIFO, that's a 'Global'-FIFO on the same
partition scheme, it however does support sched_affinity, but using it
gives 'interesting' schedulability results -- call it a historic
accident).


Note that related, but differently, we have the isolcpus boot parameter
which creates single CPU partitions for all listed CPUs and gives the
rest to the root cpuset. Ideally we'd kill this option given its a boot
time setting (for something which is trivially to do at runtime).

But this cannot be done, because that would mean we'd have to start with
a !0 cpuset layout:

		'/'
		load_balance=0
            /              \
	'system'	'isolated'
	cpus=~isolcpus	cpus=isolcpus
			load_balance=0

And start with _everything_ in the /system group (inclding default IRQ
affinities).

Of course, that will break everything cgroup :-(

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-09-16 16:29                                       ` Andy Lutomirski
  0 siblings, 0 replies; 87+ messages in thread
From: Andy Lutomirski @ 2016-09-16 16:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Mike Galbraith, kernel-team, Andrew Morton,
	open list:CONTROL GROUP (CGROUP),
	Paul Turner, Li Zefan, Linux API, linux-kernel, Tejun Heo,
	Johannes Weiner, Linus Torvalds

On Fri, Sep 16, 2016 at 9:19 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Sep 16, 2016 at 08:12:58AM -0700, Andy Lutomirski wrote:
>> On Sep 16, 2016 12:51 AM, "Peter Zijlstra" <peterz@infradead.org> wrote:
>> >
>> > On Thu, Sep 15, 2016 at 01:08:07PM -0700, Andy Lutomirski wrote:
>> > > BTW, Mike keeps mentioning exclusive cgroups as problematic with the
>> > > no-internal-tasks constraints.  Do exclusive cgroups still exist in
>> > > cgroup2?  Could we perhaps just remove that capability entirely?  I've
>> > > never understood what problem exlusive cpusets and such solve that
>> > > can't be more comprehensibly solved by just assigning the cpusets the
>> > > normal inclusive way.
>> >
>> > Without exclusive sets we cannot split the sched_domain structure.
>> > Which leads to not being able to actually partition things. That would
>> > break DL for one.
>>
>> Can you sketch out a toy example?
>
> [ Also see Documentation/cgroup-v1/cpusets.txt section 1.7 ]
>
>
>   mkdir /cpuset
>
>   mount -t cgroup -o cpuset none /cpuset
>
>   mkdir /cpuset/A
>   mkdir /cpuset/B
>
>   cat /sys/devices/system/node/node0/cpulist > /cpuset/A/cpuset.cpus
>   echo 0 > /cpuset/A/cpuset.mems
>
>   cat /sys/devices/system/node/node1/cpulist > /cpuset/B/cpuset.cpus
>   echo 1 > /cpuset/B/cpuset.mems
>
>   # move all movable tasks into A
>   cat /cpuset/tasks | while read task; do echo $task > /cpuset/A/tasks ; done
>
>   # kill machine wide load-balancing
>   echo 0 > /cpuset/cpuset.sched_load_balance
>
>   # now place 'special' tasks in B
>
>
> This partitions the scheduler into two, one for each node.
>
> Hereafter no task will be moved from one node to another. The
> load-balancer is split in two, one balances in A one balances in B
> nothing crosses. (It is important that A.cpus and B.cpus do not
> intersect.)
>
> Ideally no task would remain in the root group, back in the day we could
> actually do this (with exception of the cpu bound kernel threads), but
> this has significantly regressed :-(
> (still hate the workqueue affinity interface)

I wonder if we could address this by creating (automatically at boot
or when the cpuset controller is enabled or whatever) a
/cpuset/random_kernel_shit cgroup and have all of the unmoveable tasks
land there?

>
> As is, tasks that are left in the root group get balanced within
> whatever domain they ended up in.
>
>> And what's DL?
>
> SCHED_DEADLINE, its a 'Global'-EDF like scheduler that doesn't support
> CPU affinities (because that doesn't make sense). The only way to
> restrict it is to partition.
>
> 'Global' because you can partition it. If you reduce your system to
> single CPU partitions you'll reduce to P-EDF.
>
> (The same is true of SCHED_FIFO, that's a 'Global'-FIFO on the same
> partition scheme, it however does support sched_affinity, but using it
> gives 'interesting' schedulability results -- call it a historic
> accident).

Hmm, I didn't realize that the deadline scheduler was global.  But
ISTM requiring the use of "exclusive" to get this working is
unfortunate.  What if a user wants two separate partitions, one using
CPUs 1 and 2 and the other using CPUs 3 and 4 (with 5 reserved for
non-RT stuff)?  Shouldn't we be able to have a cgroup for each of the
DL partitions and do something to tell the deadline scheduler "here is
your domain"?

>
>
> Note that related, but differently, we have the isolcpus boot parameter
> which creates single CPU partitions for all listed CPUs and gives the
> rest to the root cpuset. Ideally we'd kill this option given its a boot
> time setting (for something which is trivially to do at runtime).
>
> But this cannot be done, because that would mean we'd have to start with
> a !0 cpuset layout:
>
>                 '/'
>                 load_balance=0
>             /              \
>         'system'        'isolated'
>         cpus=~isolcpus  cpus=isolcpus
>                         load_balance=0
>
> And start with _everything_ in the /system group (inclding default IRQ
> affinities).
>
> Of course, that will break everything cgroup :-(
>

I would actually *much* prefer this over the status quo.  I'm tired of
my crappy, partially-working script that sits there and creates
exactly this configuration (minus the isolcpus part because I actually
want migration to work) on boot.  (Actually, it could have two
automatic cgroups: /kernel and /init -- init and UMH would go in init
and kernel threads and such would go in /kernel.  Userspace would be
able to request that a different cgroup be used for newly-created
kernel threads.)

Heck, even systemd would probably prefer this.  Then it could cleanly
expose a "slice" or whatever it's called for random kernel shit and at
least you could configure it meaningfully.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-09-16 16:29                                       ` Andy Lutomirski
  0 siblings, 0 replies; 87+ messages in thread
From: Andy Lutomirski @ 2016-09-16 16:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Mike Galbraith, kernel-team-b10kYP2dOMg,
	Andrew Morton, open list:CONTROL GROUP (CGROUP),
	Paul Turner, Li Zefan, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, Johannes Weiner,
	Linus Torvalds

On Fri, Sep 16, 2016 at 9:19 AM, Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:
> On Fri, Sep 16, 2016 at 08:12:58AM -0700, Andy Lutomirski wrote:
>> On Sep 16, 2016 12:51 AM, "Peter Zijlstra" <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:
>> >
>> > On Thu, Sep 15, 2016 at 01:08:07PM -0700, Andy Lutomirski wrote:
>> > > BTW, Mike keeps mentioning exclusive cgroups as problematic with the
>> > > no-internal-tasks constraints.  Do exclusive cgroups still exist in
>> > > cgroup2?  Could we perhaps just remove that capability entirely?  I've
>> > > never understood what problem exlusive cpusets and such solve that
>> > > can't be more comprehensibly solved by just assigning the cpusets the
>> > > normal inclusive way.
>> >
>> > Without exclusive sets we cannot split the sched_domain structure.
>> > Which leads to not being able to actually partition things. That would
>> > break DL for one.
>>
>> Can you sketch out a toy example?
>
> [ Also see Documentation/cgroup-v1/cpusets.txt section 1.7 ]
>
>
>   mkdir /cpuset
>
>   mount -t cgroup -o cpuset none /cpuset
>
>   mkdir /cpuset/A
>   mkdir /cpuset/B
>
>   cat /sys/devices/system/node/node0/cpulist > /cpuset/A/cpuset.cpus
>   echo 0 > /cpuset/A/cpuset.mems
>
>   cat /sys/devices/system/node/node1/cpulist > /cpuset/B/cpuset.cpus
>   echo 1 > /cpuset/B/cpuset.mems
>
>   # move all movable tasks into A
>   cat /cpuset/tasks | while read task; do echo $task > /cpuset/A/tasks ; done
>
>   # kill machine wide load-balancing
>   echo 0 > /cpuset/cpuset.sched_load_balance
>
>   # now place 'special' tasks in B
>
>
> This partitions the scheduler into two, one for each node.
>
> Hereafter no task will be moved from one node to another. The
> load-balancer is split in two, one balances in A one balances in B
> nothing crosses. (It is important that A.cpus and B.cpus do not
> intersect.)
>
> Ideally no task would remain in the root group, back in the day we could
> actually do this (with exception of the cpu bound kernel threads), but
> this has significantly regressed :-(
> (still hate the workqueue affinity interface)

I wonder if we could address this by creating (automatically at boot
or when the cpuset controller is enabled or whatever) a
/cpuset/random_kernel_shit cgroup and have all of the unmoveable tasks
land there?

>
> As is, tasks that are left in the root group get balanced within
> whatever domain they ended up in.
>
>> And what's DL?
>
> SCHED_DEADLINE, its a 'Global'-EDF like scheduler that doesn't support
> CPU affinities (because that doesn't make sense). The only way to
> restrict it is to partition.
>
> 'Global' because you can partition it. If you reduce your system to
> single CPU partitions you'll reduce to P-EDF.
>
> (The same is true of SCHED_FIFO, that's a 'Global'-FIFO on the same
> partition scheme, it however does support sched_affinity, but using it
> gives 'interesting' schedulability results -- call it a historic
> accident).

Hmm, I didn't realize that the deadline scheduler was global.  But
ISTM requiring the use of "exclusive" to get this working is
unfortunate.  What if a user wants two separate partitions, one using
CPUs 1 and 2 and the other using CPUs 3 and 4 (with 5 reserved for
non-RT stuff)?  Shouldn't we be able to have a cgroup for each of the
DL partitions and do something to tell the deadline scheduler "here is
your domain"?

>
>
> Note that related, but differently, we have the isolcpus boot parameter
> which creates single CPU partitions for all listed CPUs and gives the
> rest to the root cpuset. Ideally we'd kill this option given its a boot
> time setting (for something which is trivially to do at runtime).
>
> But this cannot be done, because that would mean we'd have to start with
> a !0 cpuset layout:
>
>                 '/'
>                 load_balance=0
>             /              \
>         'system'        'isolated'
>         cpus=~isolcpus  cpus=isolcpus
>                         load_balance=0
>
> And start with _everything_ in the /system group (inclding default IRQ
> affinities).
>
> Of course, that will break everything cgroup :-(
>

I would actually *much* prefer this over the status quo.  I'm tired of
my crappy, partially-working script that sits there and creates
exactly this configuration (minus the isolcpus part because I actually
want migration to work) on boot.  (Actually, it could have two
automatic cgroups: /kernel and /init -- init and UMH would go in init
and kernel threads and such would go in /kernel.  Userspace would be
able to request that a different cgroup be used for newly-created
kernel threads.)

Heck, even systemd would probably prefer this.  Then it could cleanly
expose a "slice" or whatever it's called for random kernel shit and at
least you could configure it meaningfully.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-09-16 16:50                                         ` Peter Zijlstra
  0 siblings, 0 replies; 87+ messages in thread
From: Peter Zijlstra @ 2016-09-16 16:50 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, Mike Galbraith, kernel-team, Andrew Morton,
	open list:CONTROL GROUP (CGROUP),
	Paul Turner, Li Zefan, Linux API, linux-kernel, Tejun Heo,
	Johannes Weiner, Linus Torvalds

On Fri, Sep 16, 2016 at 09:29:06AM -0700, Andy Lutomirski wrote:

> > SCHED_DEADLINE, its a 'Global'-EDF like scheduler that doesn't support
> > CPU affinities (because that doesn't make sense). The only way to
> > restrict it is to partition.
> >
> > 'Global' because you can partition it. If you reduce your system to
> > single CPU partitions you'll reduce to P-EDF.
> >
> > (The same is true of SCHED_FIFO, that's a 'Global'-FIFO on the same
> > partition scheme, it however does support sched_affinity, but using it
> > gives 'interesting' schedulability results -- call it a historic
> > accident).
> 
> Hmm, I didn't realize that the deadline scheduler was global.  But
> ISTM requiring the use of "exclusive" to get this working is
> unfortunate.  What if a user wants two separate partitions, one using
> CPUs 1 and 2 and the other using CPUs 3 and 4 (with 5 reserved for
> non-RT stuff)? 

{1,2} {3,4} {5} seem exclusive, did I miss something? (other than that 5
cpu parts are 'rare').

> Shouldn't we be able to have a cgroup for each of the
> DL partitions and do something to tell the deadline scheduler "here is
> your domain"?

Somewhat confused, by doing the non-overlapping domains, you do exactly
that no?

You end up with 2 (or more) independent deadline schedulers, but if
you're not running deadline tasks (like in the /system partition) you
don't care its there.

> > Note that related, but differently, we have the isolcpus boot parameter
> > which creates single CPU partitions for all listed CPUs and gives the
> > rest to the root cpuset. Ideally we'd kill this option given its a boot
> > time setting (for something which is trivially to do at runtime).
> >
> > But this cannot be done, because that would mean we'd have to start with
> > a !0 cpuset layout:
> >
> >                 '/'
> >                 load_balance=0
> >             /              \
> >         'system'        'isolated'
> >         cpus=~isolcpus  cpus=isolcpus
> >                         load_balance=0
> >
> > And start with _everything_ in the /system group (inclding default IRQ
> > affinities).
> >
> > Of course, that will break everything cgroup :-(
> >
> 
> I would actually *much* prefer this over the status quo.  I'm tired of
> my crappy, partially-working script that sits there and creates
> exactly this configuration (minus the isolcpus part because I actually
> want migration to work) on boot.  (Actually, it could have two
> automatic cgroups: /kernel and /init -- init and UMH would go in init
> and kernel threads and such would go in /kernel.  Userspace would be
> able to request that a different cgroup be used for newly-created
> kernel threads.)

So there's a problem with sticking kernel threads (and esp. kthreadd)
into !root groups. For example if you place it in a cpuset that doesn't
have all cpus, then binding your shiny new kthread to a cpu will fail.

You can fix that of course, and we used to do exactly that, but we kept
running into 'fun' cases like that.

The unbound workqueue stuff is totally arbitrary borkage though, that
can be made to work just fine, TJ didn't like it for some reason which I
really cannot remember.

Also, UMH?

> Heck, even systemd would probably prefer this.  Then it could cleanly
> expose a "slice" or whatever it's called for random kernel shit and at
> least you could configure it meaningfully.

No clue about systemd, I'm still on systems without that virus.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-09-16 16:50                                         ` Peter Zijlstra
  0 siblings, 0 replies; 87+ messages in thread
From: Peter Zijlstra @ 2016-09-16 16:50 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, Mike Galbraith, kernel-team-b10kYP2dOMg,
	Andrew Morton, open list:CONTROL GROUP (CGROUP),
	Paul Turner, Li Zefan, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, Johannes Weiner,
	Linus Torvalds

On Fri, Sep 16, 2016 at 09:29:06AM -0700, Andy Lutomirski wrote:

> > SCHED_DEADLINE, its a 'Global'-EDF like scheduler that doesn't support
> > CPU affinities (because that doesn't make sense). The only way to
> > restrict it is to partition.
> >
> > 'Global' because you can partition it. If you reduce your system to
> > single CPU partitions you'll reduce to P-EDF.
> >
> > (The same is true of SCHED_FIFO, that's a 'Global'-FIFO on the same
> > partition scheme, it however does support sched_affinity, but using it
> > gives 'interesting' schedulability results -- call it a historic
> > accident).
> 
> Hmm, I didn't realize that the deadline scheduler was global.  But
> ISTM requiring the use of "exclusive" to get this working is
> unfortunate.  What if a user wants two separate partitions, one using
> CPUs 1 and 2 and the other using CPUs 3 and 4 (with 5 reserved for
> non-RT stuff)? 

{1,2} {3,4} {5} seem exclusive, did I miss something? (other than that 5
cpu parts are 'rare').

> Shouldn't we be able to have a cgroup for each of the
> DL partitions and do something to tell the deadline scheduler "here is
> your domain"?

Somewhat confused, by doing the non-overlapping domains, you do exactly
that no?

You end up with 2 (or more) independent deadline schedulers, but if
you're not running deadline tasks (like in the /system partition) you
don't care its there.

> > Note that related, but differently, we have the isolcpus boot parameter
> > which creates single CPU partitions for all listed CPUs and gives the
> > rest to the root cpuset. Ideally we'd kill this option given its a boot
> > time setting (for something which is trivially to do at runtime).
> >
> > But this cannot be done, because that would mean we'd have to start with
> > a !0 cpuset layout:
> >
> >                 '/'
> >                 load_balance=0
> >             /              \
> >         'system'        'isolated'
> >         cpus=~isolcpus  cpus=isolcpus
> >                         load_balance=0
> >
> > And start with _everything_ in the /system group (inclding default IRQ
> > affinities).
> >
> > Of course, that will break everything cgroup :-(
> >
> 
> I would actually *much* prefer this over the status quo.  I'm tired of
> my crappy, partially-working script that sits there and creates
> exactly this configuration (minus the isolcpus part because I actually
> want migration to work) on boot.  (Actually, it could have two
> automatic cgroups: /kernel and /init -- init and UMH would go in init
> and kernel threads and such would go in /kernel.  Userspace would be
> able to request that a different cgroup be used for newly-created
> kernel threads.)

So there's a problem with sticking kernel threads (and esp. kthreadd)
into !root groups. For example if you place it in a cpuset that doesn't
have all cpus, then binding your shiny new kthread to a cpu will fail.

You can fix that of course, and we used to do exactly that, but we kept
running into 'fun' cases like that.

The unbound workqueue stuff is totally arbitrary borkage though, that
can be made to work just fine, TJ didn't like it for some reason which I
really cannot remember.

Also, UMH?

> Heck, even systemd would probably prefer this.  Then it could cleanly
> expose a "slice" or whatever it's called for random kernel shit and at
> least you could configure it meaningfully.

No clue about systemd, I'm still on systems without that virus.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-09-16 18:19                                           ` Andy Lutomirski
  0 siblings, 0 replies; 87+ messages in thread
From: Andy Lutomirski @ 2016-09-16 18:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Mike Galbraith, kernel-team, Andrew Morton,
	open list:CONTROL GROUP (CGROUP),
	Paul Turner, Li Zefan, Linux API, linux-kernel, Tejun Heo,
	Johannes Weiner, Linus Torvalds

On Fri, Sep 16, 2016 at 9:50 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Sep 16, 2016 at 09:29:06AM -0700, Andy Lutomirski wrote:
>
>> > SCHED_DEADLINE, its a 'Global'-EDF like scheduler that doesn't support
>> > CPU affinities (because that doesn't make sense). The only way to
>> > restrict it is to partition.
>> >
>> > 'Global' because you can partition it. If you reduce your system to
>> > single CPU partitions you'll reduce to P-EDF.
>> >
>> > (The same is true of SCHED_FIFO, that's a 'Global'-FIFO on the same
>> > partition scheme, it however does support sched_affinity, but using it
>> > gives 'interesting' schedulability results -- call it a historic
>> > accident).
>>
>> Hmm, I didn't realize that the deadline scheduler was global.  But
>> ISTM requiring the use of "exclusive" to get this working is
>> unfortunate.  What if a user wants two separate partitions, one using
>> CPUs 1 and 2 and the other using CPUs 3 and 4 (with 5 reserved for
>> non-RT stuff)?
>
> {1,2} {3,4} {5} seem exclusive, did I miss something? (other than that 5
> cpu parts are 'rare').

There's no overlap, so they're logically exclusive, but it avoids
needing the "cpu_exclusive" parameter.  It always seemed confusing to
me that a setting on a child cgroup would strictly remove a resource
from the parent.  (To be clear: I don't have any particularly strong
objection to cpu_exclusive.  It just always seemed like a bit of a
hack that mostly duplicated what you could get by just setting the
cpusets appropriately throughout the hierarchy.)

>> > Note that related, but differently, we have the isolcpus boot parameter
>> > which creates single CPU partitions for all listed CPUs and gives the
>> > rest to the root cpuset. Ideally we'd kill this option given its a boot
>> > time setting (for something which is trivially to do at runtime).
>> >
>> > But this cannot be done, because that would mean we'd have to start with
>> > a !0 cpuset layout:
>> >
>> >                 '/'
>> >                 load_balance=0
>> >             /              \
>> >         'system'        'isolated'
>> >         cpus=~isolcpus  cpus=isolcpus
>> >                         load_balance=0
>> >
>> > And start with _everything_ in the /system group (inclding default IRQ
>> > affinities).
>> >
>> > Of course, that will break everything cgroup :-(
>> >
>>
>> I would actually *much* prefer this over the status quo.  I'm tired of
>> my crappy, partially-working script that sits there and creates
>> exactly this configuration (minus the isolcpus part because I actually
>> want migration to work) on boot.  (Actually, it could have two
>> automatic cgroups: /kernel and /init -- init and UMH would go in init
>> and kernel threads and such would go in /kernel.  Userspace would be
>> able to request that a different cgroup be used for newly-created
>> kernel threads.)
>
> So there's a problem with sticking kernel threads (and esp. kthreadd)
> into !root groups. For example if you place it in a cpuset that doesn't
> have all cpus, then binding your shiny new kthread to a cpu will fail.
>
> You can fix that of course, and we used to do exactly that, but we kept
> running into 'fun' cases like that.

Blech.  But may this *should* have that effect.  I'm sick of random
kernel crap being scheduled on my RT CPUs and on the CPUs that I
intend to be kept forcibly idle.

>
> The unbound workqueue stuff is totally arbitrary borkage though, that
> can be made to work just fine, TJ didn't like it for some reason which I
> really cannot remember.
>
> Also, UMH?

User mode helper.  Fortunately most users are gone now, but it still exists.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-09-16 18:19                                           ` Andy Lutomirski
  0 siblings, 0 replies; 87+ messages in thread
From: Andy Lutomirski @ 2016-09-16 18:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Mike Galbraith, kernel-team-b10kYP2dOMg,
	Andrew Morton, open list:CONTROL GROUP (CGROUP),
	Paul Turner, Li Zefan, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, Johannes Weiner,
	Linus Torvalds

On Fri, Sep 16, 2016 at 9:50 AM, Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:
> On Fri, Sep 16, 2016 at 09:29:06AM -0700, Andy Lutomirski wrote:
>
>> > SCHED_DEADLINE, its a 'Global'-EDF like scheduler that doesn't support
>> > CPU affinities (because that doesn't make sense). The only way to
>> > restrict it is to partition.
>> >
>> > 'Global' because you can partition it. If you reduce your system to
>> > single CPU partitions you'll reduce to P-EDF.
>> >
>> > (The same is true of SCHED_FIFO, that's a 'Global'-FIFO on the same
>> > partition scheme, it however does support sched_affinity, but using it
>> > gives 'interesting' schedulability results -- call it a historic
>> > accident).
>>
>> Hmm, I didn't realize that the deadline scheduler was global.  But
>> ISTM requiring the use of "exclusive" to get this working is
>> unfortunate.  What if a user wants two separate partitions, one using
>> CPUs 1 and 2 and the other using CPUs 3 and 4 (with 5 reserved for
>> non-RT stuff)?
>
> {1,2} {3,4} {5} seem exclusive, did I miss something? (other than that 5
> cpu parts are 'rare').

There's no overlap, so they're logically exclusive, but it avoids
needing the "cpu_exclusive" parameter.  It always seemed confusing to
me that a setting on a child cgroup would strictly remove a resource
from the parent.  (To be clear: I don't have any particularly strong
objection to cpu_exclusive.  It just always seemed like a bit of a
hack that mostly duplicated what you could get by just setting the
cpusets appropriately throughout the hierarchy.)

>> > Note that related, but differently, we have the isolcpus boot parameter
>> > which creates single CPU partitions for all listed CPUs and gives the
>> > rest to the root cpuset. Ideally we'd kill this option given its a boot
>> > time setting (for something which is trivially to do at runtime).
>> >
>> > But this cannot be done, because that would mean we'd have to start with
>> > a !0 cpuset layout:
>> >
>> >                 '/'
>> >                 load_balance=0
>> >             /              \
>> >         'system'        'isolated'
>> >         cpus=~isolcpus  cpus=isolcpus
>> >                         load_balance=0
>> >
>> > And start with _everything_ in the /system group (inclding default IRQ
>> > affinities).
>> >
>> > Of course, that will break everything cgroup :-(
>> >
>>
>> I would actually *much* prefer this over the status quo.  I'm tired of
>> my crappy, partially-working script that sits there and creates
>> exactly this configuration (minus the isolcpus part because I actually
>> want migration to work) on boot.  (Actually, it could have two
>> automatic cgroups: /kernel and /init -- init and UMH would go in init
>> and kernel threads and such would go in /kernel.  Userspace would be
>> able to request that a different cgroup be used for newly-created
>> kernel threads.)
>
> So there's a problem with sticking kernel threads (and esp. kthreadd)
> into !root groups. For example if you place it in a cpuset that doesn't
> have all cpus, then binding your shiny new kthread to a cpu will fail.
>
> You can fix that of course, and we used to do exactly that, but we kept
> running into 'fun' cases like that.

Blech.  But may this *should* have that effect.  I'm sick of random
kernel crap being scheduled on my RT CPUs and on the CPUs that I
intend to be kept forcibly idle.

>
> The unbound workqueue stuff is totally arbitrary borkage though, that
> can be made to work just fine, TJ didn't like it for some reason which I
> really cannot remember.
>
> Also, UMH?

User mode helper.  Fortunately most users are gone now, but it still exists.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-09-17  1:47                                             ` Peter Zijlstra
  0 siblings, 0 replies; 87+ messages in thread
From: Peter Zijlstra @ 2016-09-17  1:47 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, Mike Galbraith, kernel-team, Andrew Morton,
	open list:CONTROL GROUP (CGROUP),
	Paul Turner, Li Zefan, Linux API, linux-kernel, Tejun Heo,
	Johannes Weiner, Linus Torvalds

On Fri, Sep 16, 2016 at 11:19:38AM -0700, Andy Lutomirski wrote:
> On Fri, Sep 16, 2016 at 9:50 AM, Peter Zijlstra <peterz@infradead.org> wrote:

> > {1,2} {3,4} {5} seem exclusive, did I miss something? (other than that 5
> > cpu parts are 'rare').
> 
> There's no overlap, so they're logically exclusive, but it avoids
> needing the "cpu_exclusive" parameter. 

I'd need to double check, but I don't think you _need_ that. That's more
for enforcing nobody else steals your CPUs and 'accidentally' creates
overlaps. But if you configure it right, non-overlap should be enough.

That is, generate_sched_domains() only uses cpusets_overlap() which is
cpumask_intersects(). Then again, it is almost 4am, so who knows.

> > So there's a problem with sticking kernel threads (and esp. kthreadd)
> > into !root groups. For example if you place it in a cpuset that doesn't
> > have all cpus, then binding your shiny new kthread to a cpu will fail.
> >
> > You can fix that of course, and we used to do exactly that, but we kept
> > running into 'fun' cases like that.
> 
> Blech.  But may this *should* have that effect.  I'm sick of random
> kernel crap being scheduled on my RT CPUs and on the CPUs that I
> intend to be kept forcibly idle.

Hehe, so ideally those threads don't do anything unless the tasks
running on those CPUs explicitly ask for it. If you find any of the
CPU-bound kernel tasks do work that is unrelated to the tasks running on
that CPU, we should certainly look into it.

Personally I'm not much bothered by idle threads sitting about.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-09-17  1:47                                             ` Peter Zijlstra
  0 siblings, 0 replies; 87+ messages in thread
From: Peter Zijlstra @ 2016-09-17  1:47 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, Mike Galbraith, kernel-team-b10kYP2dOMg,
	Andrew Morton, open list:CONTROL GROUP (CGROUP),
	Paul Turner, Li Zefan, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, Johannes Weiner,
	Linus Torvalds

On Fri, Sep 16, 2016 at 11:19:38AM -0700, Andy Lutomirski wrote:
> On Fri, Sep 16, 2016 at 9:50 AM, Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:

> > {1,2} {3,4} {5} seem exclusive, did I miss something? (other than that 5
> > cpu parts are 'rare').
> 
> There's no overlap, so they're logically exclusive, but it avoids
> needing the "cpu_exclusive" parameter. 

I'd need to double check, but I don't think you _need_ that. That's more
for enforcing nobody else steals your CPUs and 'accidentally' creates
overlaps. But if you configure it right, non-overlap should be enough.

That is, generate_sched_domains() only uses cpusets_overlap() which is
cpumask_intersects(). Then again, it is almost 4am, so who knows.

> > So there's a problem with sticking kernel threads (and esp. kthreadd)
> > into !root groups. For example if you place it in a cpuset that doesn't
> > have all cpus, then binding your shiny new kthread to a cpu will fail.
> >
> > You can fix that of course, and we used to do exactly that, but we kept
> > running into 'fun' cases like that.
> 
> Blech.  But may this *should* have that effect.  I'm sick of random
> kernel crap being scheduled on my RT CPUs and on the CPUs that I
> intend to be kept forcibly idle.

Hehe, so ideally those threads don't do anything unless the tasks
running on those CPUs explicitly ask for it. If you find any of the
CPU-bound kernel tasks do work that is unrelated to the tasks running on
that CPU, we should certainly look into it.

Personally I'm not much bothered by idle threads sitting about.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-09-19 21:34                             ` Tejun Heo
  0 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-09-19 21:34 UTC (permalink / raw)
  To: Austin S. Hemmelgarn
  Cc: Andy Lutomirski, Ingo Molnar, Mike Galbraith, linux-kernel,
	kernel-team, open list:CONTROL GROUP (CGROUP),
	Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra,
	Johannes Weiner, Linus Torvalds

Hello, Austin.

On Mon, Sep 12, 2016 at 11:20:03AM -0400, Austin S. Hemmelgarn wrote:
> > If you confine it to the cpu controller, ignore anonymous
> > consumptions, the rather ugly mapping between nice and weight values
> > and the fact that nobody could come up with a practical usefulness for
> > such setup, yes.  My point was never that the cpu controller can't do
> > it but that we should find a better way of coordinating it with other
> > controllers and exposing it to individual applications.
>
> So, having a container where not everything in the container is split
> further into subgroups is not a practically useful situation?  Because
> that's exactly what both systemd and every other cgroup management tool
> expects to have work as things stand right now.  The root cgroup within a

Not true.

 $ cat /proc/1/cgroup
 11:hugetlb:/
 10:pids:/init.scope
 9:blkio:/
 8:cpuset:/
 7:memory:/
 6:freezer:/
 5:perf_event:/
 4:net_cls,net_prio:/
 3:cpu,cpuacct:/
 2:devices:/init.scope
 1:name=systemd:/init.scope
 $ systemctl --version
 systemd 229
 +PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN

> cgroup namespace has to function exactly like the system-root, otherwise
> nothing can depend on the special cases for the system root, because they
> might get run in a cgroup namespace and such assumptions will be invalid.

systemd already behaves exactly the same whether it's inside a
namespace or not.

> This in turn means that no current distro can run unmodified in a cgroup
> namespace under a v2 hierarchy, which is a Very Bad Thing.

cgroup v1 hierarchies can be mounted the same inside a namespace
whether the system itself is on cgroup v1 or v2.  Obviously, a given
controller can only be attached to one hierarchy, so a controller
can't be used at the same time on both v1 and v2 hierarchies; however,
that is true with different v1 hierarchies too, and, given that
delegations doesn't work properly on v1, shouldn't be that much of an
issue.

I'm not just claiming it.  systemd-nspawn can already be on either v1
or v2 hierarchies regardless of what the outer systemd uses.

Out of the claims that you made, the only one which holds up is that
an existing software can't make use of cgroup v2 without
modifications, which is true but at the same time doesn't mean much of
anything.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-09-19 21:34                             ` Tejun Heo
  0 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-09-19 21:34 UTC (permalink / raw)
  To: Austin S. Hemmelgarn
  Cc: Andy Lutomirski, Ingo Molnar, Mike Galbraith,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	open list:CONTROL GROUP (CGROUP),
	Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra,
	Johannes Weiner, Linus Torvalds

Hello, Austin.

On Mon, Sep 12, 2016 at 11:20:03AM -0400, Austin S. Hemmelgarn wrote:
> > If you confine it to the cpu controller, ignore anonymous
> > consumptions, the rather ugly mapping between nice and weight values
> > and the fact that nobody could come up with a practical usefulness for
> > such setup, yes.  My point was never that the cpu controller can't do
> > it but that we should find a better way of coordinating it with other
> > controllers and exposing it to individual applications.
>
> So, having a container where not everything in the container is split
> further into subgroups is not a practically useful situation?  Because
> that's exactly what both systemd and every other cgroup management tool
> expects to have work as things stand right now.  The root cgroup within a

Not true.

 $ cat /proc/1/cgroup
 11:hugetlb:/
 10:pids:/init.scope
 9:blkio:/
 8:cpuset:/
 7:memory:/
 6:freezer:/
 5:perf_event:/
 4:net_cls,net_prio:/
 3:cpu,cpuacct:/
 2:devices:/init.scope
 1:name=systemd:/init.scope
 $ systemctl --version
 systemd 229
 +PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN

> cgroup namespace has to function exactly like the system-root, otherwise
> nothing can depend on the special cases for the system root, because they
> might get run in a cgroup namespace and such assumptions will be invalid.

systemd already behaves exactly the same whether it's inside a
namespace or not.

> This in turn means that no current distro can run unmodified in a cgroup
> namespace under a v2 hierarchy, which is a Very Bad Thing.

cgroup v1 hierarchies can be mounted the same inside a namespace
whether the system itself is on cgroup v1 or v2.  Obviously, a given
controller can only be attached to one hierarchy, so a controller
can't be used at the same time on both v1 and v2 hierarchies; however,
that is true with different v1 hierarchies too, and, given that
delegations doesn't work properly on v1, shouldn't be that much of an
issue.

I'm not just claiming it.  systemd-nspawn can already be on either v1
or v2 hierarchies regardless of what the outer systemd uses.

Out of the claims that you made, the only one which holds up is that
an existing software can't make use of cgroup v2 without
modifications, which is true but at the same time doesn't mean much of
anything.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-09-19 21:53                                 ` Tejun Heo
  0 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-09-19 21:53 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, Mike Galbraith, Andrew Morton, kernel-team,
	open list:CONTROL GROUP (CGROUP),
	Linux API, Li Zefan, Paul Turner, linux-kernel, Linus Torvalds,
	Johannes Weiner, Peter Zijlstra

Hello,

On Thu, Sep 15, 2016 at 01:08:07PM -0700, Andy Lutomirski wrote:
> With regard to no-internal-tasks, I see (at least) three options:
> 
> 1. Keep the cgroup2 status quo.  Lots of distros and such are likely
> to have their cgroup management fail if run in a container.  I really,

I don't know where you're getting this.  No-internal-tasks rule has
*NOTHING* to do with how or how not cgroup v1 hierarchies can be used
inside a namespace.  I suppose this is coming from the same
misunderstanding that Austin has.  Please see my reply there for more
details.

> really dislike this option.

Up until this point, you haven't supplied any valid technical reasons
for your objection.  Repeating "really" doesn't add to the discussion
at all.  If you're indicating that you don't like it on an aeshtetic
ground, please just say so.

> 2. Enforce no-internal-tasks for the root cgroup.  Un-cgroupable
> thinks will still get accounted to the root cgroup even if subtree
> control is on, but no tasks can be in the root cgroup if the root
> cgroup has subtree control on.  (If some controllers removed the
> no-internal-tasks restriction, this would apply to the root as well.)
> I think this may annoy certain users.  If so, and if those users are
> doing something valid, then I think that either those users should be
> strongly encouraged or even forced to changed so namespacing works for
> them or that we should do (3) instead.

Theoretically, we can do that but what are the upsides and are they
enough to justify the added inconveniences?  Up until now, the only
argument you provided is that people may do certain things in
system-root which might not work in namespace-root but that isn't a
critical problem.  No real functionalities are lost by implementing
the same behaviors both inside and outside namespaces.

> 3. Remove the no-internal-tasks restriction entirely.  I can see this
> resulting in a lot of configuration awkwardness, but I think it will
> *work*, especially since all of the controllers already need to do
> something vaguely intelligent when subtree control is on in the root
> and there are tasks in the root.

The reasons for no-internal-tasks restriction have been explained
multiple times in the documentations and throughout this thread, and
we also discussed how and why system-root is special and allowing
system-root's special treatment doesn't break things.

> What I'm trying to say is that I think that option (1) is sufficiently
> bad that cgroup2 should do (2) or (3) instead.  If option (2) is
> preferred and if it would break userspace, then I think we can work
> around it by entirely deprecating cgroup2, renaming it to cgroup3, and
> doing option (2) there.  You've given reasons you don't like options
> (2) and (3).  I mostly agree with those reasons, but I don't think
> they're strong enough to overcome the problems with (1).

And you keep suggesting very drastic measures for an issue which isn't
critical without providing any substantial technical reasons why such
drastic measures would be necessary.  This part of discussion started
with your misunderstanding of the implications of the system-root
being special, and the only reason you presented in the previous
message is still a, different, misunderstanding.

The only thing which isn't changing here is your opinions on how it
should be.  It is a baffling situation because your opinions don't
seem to be affected at all by the validity of reasons for thinking so.

> BTW, Mike keeps mentioning exclusive cgroups as problematic with the
> no-internal-tasks constraints.  Do exclusive cgroups still exist in
> cgroup2?  Could we perhaps just remove that capability entirely?  I've
> never understood what problem exlusive cpusets and such solve that
> can't be more comprehensibly solved by just assigning the cpusets the
> normal inclusive way.

This was explained before during the discussion.  Maybe it wasn't
clear enough.  The knob is a config protector which protects oneself
from changing its configs.  It doesn't really belong in the kernel.
My guess is that it was added because delegation model wasn't properly
established and people tried to delegate resource control knobs along
with the cgroups and then wanted to prevent those knobs from changed
in certain ways.

> >> What kind of migration do you mean?  Having fds follow rename(2) around is
> >> the normal vfs behavior, so I don't really know what you mean.
> >
> > Process or task migration by writing pid to cgroup.procs or tasks
> > file.  cgroup never supported directory / cgroup level migrations.
> 
> Ugh.  Perhaps cgroup2 should start supporting this.  I think that
> making rename(2) work is simpler than adding a whole new API for
> rgroups, and I think it could solve a lot of the same problems that
> rgroups are trying to solve.

We haven't needed that yet and supporting rename(2) doesn't
necessarily make the API safe in terms of migration atomicity.  Also,
as pointed out in my previous reply (and rgroup documentation),
atomicity is only one part of rationales for rgroup.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-09-19 21:53                                 ` Tejun Heo
  0 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-09-19 21:53 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, Mike Galbraith, Andrew Morton,
	kernel-team-b10kYP2dOMg, open list:CONTROL GROUP (CGROUP),
	Linux API, Li Zefan, Paul Turner,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Linus Torvalds,
	Johannes Weiner, Peter Zijlstra

Hello,

On Thu, Sep 15, 2016 at 01:08:07PM -0700, Andy Lutomirski wrote:
> With regard to no-internal-tasks, I see (at least) three options:
> 
> 1. Keep the cgroup2 status quo.  Lots of distros and such are likely
> to have their cgroup management fail if run in a container.  I really,

I don't know where you're getting this.  No-internal-tasks rule has
*NOTHING* to do with how or how not cgroup v1 hierarchies can be used
inside a namespace.  I suppose this is coming from the same
misunderstanding that Austin has.  Please see my reply there for more
details.

> really dislike this option.

Up until this point, you haven't supplied any valid technical reasons
for your objection.  Repeating "really" doesn't add to the discussion
at all.  If you're indicating that you don't like it on an aeshtetic
ground, please just say so.

> 2. Enforce no-internal-tasks for the root cgroup.  Un-cgroupable
> thinks will still get accounted to the root cgroup even if subtree
> control is on, but no tasks can be in the root cgroup if the root
> cgroup has subtree control on.  (If some controllers removed the
> no-internal-tasks restriction, this would apply to the root as well.)
> I think this may annoy certain users.  If so, and if those users are
> doing something valid, then I think that either those users should be
> strongly encouraged or even forced to changed so namespacing works for
> them or that we should do (3) instead.

Theoretically, we can do that but what are the upsides and are they
enough to justify the added inconveniences?  Up until now, the only
argument you provided is that people may do certain things in
system-root which might not work in namespace-root but that isn't a
critical problem.  No real functionalities are lost by implementing
the same behaviors both inside and outside namespaces.

> 3. Remove the no-internal-tasks restriction entirely.  I can see this
> resulting in a lot of configuration awkwardness, but I think it will
> *work*, especially since all of the controllers already need to do
> something vaguely intelligent when subtree control is on in the root
> and there are tasks in the root.

The reasons for no-internal-tasks restriction have been explained
multiple times in the documentations and throughout this thread, and
we also discussed how and why system-root is special and allowing
system-root's special treatment doesn't break things.

> What I'm trying to say is that I think that option (1) is sufficiently
> bad that cgroup2 should do (2) or (3) instead.  If option (2) is
> preferred and if it would break userspace, then I think we can work
> around it by entirely deprecating cgroup2, renaming it to cgroup3, and
> doing option (2) there.  You've given reasons you don't like options
> (2) and (3).  I mostly agree with those reasons, but I don't think
> they're strong enough to overcome the problems with (1).

And you keep suggesting very drastic measures for an issue which isn't
critical without providing any substantial technical reasons why such
drastic measures would be necessary.  This part of discussion started
with your misunderstanding of the implications of the system-root
being special, and the only reason you presented in the previous
message is still a, different, misunderstanding.

The only thing which isn't changing here is your opinions on how it
should be.  It is a baffling situation because your opinions don't
seem to be affected at all by the validity of reasons for thinking so.

> BTW, Mike keeps mentioning exclusive cgroups as problematic with the
> no-internal-tasks constraints.  Do exclusive cgroups still exist in
> cgroup2?  Could we perhaps just remove that capability entirely?  I've
> never understood what problem exlusive cpusets and such solve that
> can't be more comprehensibly solved by just assigning the cpusets the
> normal inclusive way.

This was explained before during the discussion.  Maybe it wasn't
clear enough.  The knob is a config protector which protects oneself
from changing its configs.  It doesn't really belong in the kernel.
My guess is that it was added because delegation model wasn't properly
established and people tried to delegate resource control knobs along
with the cgroups and then wanted to prevent those knobs from changed
in certain ways.

> >> What kind of migration do you mean?  Having fds follow rename(2) around is
> >> the normal vfs behavior, so I don't really know what you mean.
> >
> > Process or task migration by writing pid to cgroup.procs or tasks
> > file.  cgroup never supported directory / cgroup level migrations.
> 
> Ugh.  Perhaps cgroup2 should start supporting this.  I think that
> making rename(2) work is simpler than adding a whole new API for
> rgroups, and I think it could solve a lot of the same problems that
> rgroups are trying to solve.

We haven't needed that yet and supporting rename(2) doesn't
necessarily make the API safe in terms of migration atomicity.  Also,
as pointed out in my previous reply (and rgroup documentation),
atomicity is only one part of rationales for rgroup.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-09-30  9:06                             ` Tejun Heo
  0 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-09-30  9:06 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Andy Lutomirski, Ingo Molnar, linux-kernel, kernel-team,
	open list:CONTROL GROUP (CGROUP),
	Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra,
	Johannes Weiner, Linus Torvalds

Hello, Mike.

On Sat, Sep 10, 2016 at 12:08:57PM +0200, Mike Galbraith wrote:
> On Fri, 2016-09-09 at 18:57 -0400, Tejun Heo wrote:
> > > > As for your example, who performs the cgroup setup and configuration,
> > > > the application itself or an external entity?  If an external entity,
> > > > how does it know which thread is what?
> > > 
> > > In my case, it would be a little script that reads a config file that
> > > knows all kinds of internal information about the application and its
> > > threads.
> > 
> > I see.  One-of-a-kind custom setup.  This is a completely valid usage;
> > however, please also recognize that it's an extremely specific one
> > which is niche by definition.
> 
> This is the same pigeon hole you placed Google into.  So Google, my
> (also decidedly non-petite) users, and now Andy are all sharing the one
> of a kind extremely specific niche.. it's becoming a tad crowded.

I wasn't trying to say that these use cases are small in numbers when
added up, but that they're all isolated in their own small silos.
Facebook has a lot of these usages too but they're almost all mutually
exculsive.  Making workloads share machines or even adding resource
conrol for base system operations afterwards is extremely difficult.
There are cases these adhoc approaches make sense but insisting that
this is all there is to resource control is short-sighted.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-09-30  9:06                             ` Tejun Heo
  0 siblings, 0 replies; 87+ messages in thread
From: Tejun Heo @ 2016-09-30  9:06 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Andy Lutomirski, Ingo Molnar,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	open list:CONTROL GROUP (CGROUP),
	Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra,
	Johannes Weiner, Linus Torvalds

Hello, Mike.

On Sat, Sep 10, 2016 at 12:08:57PM +0200, Mike Galbraith wrote:
> On Fri, 2016-09-09 at 18:57 -0400, Tejun Heo wrote:
> > > > As for your example, who performs the cgroup setup and configuration,
> > > > the application itself or an external entity?  If an external entity,
> > > > how does it know which thread is what?
> > > 
> > > In my case, it would be a little script that reads a config file that
> > > knows all kinds of internal information about the application and its
> > > threads.
> > 
> > I see.  One-of-a-kind custom setup.  This is a completely valid usage;
> > however, please also recognize that it's an extremely specific one
> > which is niche by definition.
> 
> This is the same pigeon hole you placed Google into.  So Google, my
> (also decidedly non-petite) users, and now Andy are all sharing the one
> of a kind extremely specific niche.. it's becoming a tad crowded.

I wasn't trying to say that these use cases are small in numbers when
added up, but that they're all isolated in their own small silos.
Facebook has a lot of these usages too but they're almost all mutually
exculsive.  Making workloads share machines or even adding resource
conrol for base system operations afterwards is extremely difficult.
There are cases these adhoc approaches make sense but insisting that
this is all there is to resource control is short-sighted.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-09-30 14:53                               ` Mike Galbraith
  0 siblings, 0 replies; 87+ messages in thread
From: Mike Galbraith @ 2016-09-30 14:53 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andy Lutomirski, Ingo Molnar, linux-kernel, kernel-team,
	open list:CONTROL GROUP (CGROUP),
	Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra,
	Johannes Weiner, Linus Torvalds

On Fri, 2016-09-30 at 11:06 +0200, Tejun Heo wrote:
> Hello, Mike.
> 
> On Sat, Sep 10, 2016 at 12:08:57PM +0200, Mike Galbraith wrote:
> > On Fri, 2016-09-09 at 18:57 -0400, Tejun Heo wrote:
> > > > > As for your example, who performs the cgroup setup and configuration,
> > > > > the application itself or an external entity?  If an external entity,
> > > > > how does it know which thread is what?
> > > > 
> > > > In my case, it would be a little script that reads a config file that
> > > > knows all kinds of internal information about the application and its
> > > > threads.
> > > 
> > > I see.  One-of-a-kind custom setup.  This is a completely valid usage;
> > > however, please also recognize that it's an extremely specific one
> > > which is niche by definition.
> > 
> > This is the same pigeon hole you placed Google into.  So Google, my
> > (also decidedly non-petite) users, and now Andy are all sharing the one
> > of a kind extremely specific niche.. it's becoming a tad crowded.
> 
> I wasn't trying to say that these use cases are small in numbers when
> added up, but that they're all isolated in their own small silos.

These use cases exist, and are perfectly valid use cases.  That is sum
and total of what is relevant.

> Facebook has a lot of these usages too but they're almost all mutually
> exculsive.  Making workloads share machines or even adding resource
> conrol for base system operations afterwards is extremely difficult.

The cases I have in mind are not difficult to deal with, as you don't
have to worry about collisions.

> There are cases these adhoc approaches make sense but insisting that
> this is all there is to resource control is short-sighted.

1. I never insisted any such thing.
2. Please stop pigeon-holing.

The usage cases in question are no more ad hoc than any other usage,
they are all "for this", none are globally applicable.  What they are
is power users utilizing the intimate knowledge that is both required
and in the possession of power users who are in fact using controllers
precisely as said controllers were designed to be used.

No, these usages do not belong in an "adhoc" (aka disposable refuse) pi
geon-hole.  I choose to ignore the one you stuffed me into.
	-Mike

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-09-30 14:53                               ` Mike Galbraith
  0 siblings, 0 replies; 87+ messages in thread
From: Mike Galbraith @ 2016-09-30 14:53 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andy Lutomirski, Ingo Molnar,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	open list:CONTROL GROUP (CGROUP),
	Andrew Morton, Paul Turner, Li Zefan, Linux API, Peter Zijlstra,
	Johannes Weiner, Linus Torvalds

On Fri, 2016-09-30 at 11:06 +0200, Tejun Heo wrote:
> Hello, Mike.
> 
> On Sat, Sep 10, 2016 at 12:08:57PM +0200, Mike Galbraith wrote:
> > On Fri, 2016-09-09 at 18:57 -0400, Tejun Heo wrote:
> > > > > As for your example, who performs the cgroup setup and configuration,
> > > > > the application itself or an external entity?  If an external entity,
> > > > > how does it know which thread is what?
> > > > 
> > > > In my case, it would be a little script that reads a config file that
> > > > knows all kinds of internal information about the application and its
> > > > threads.
> > > 
> > > I see.  One-of-a-kind custom setup.  This is a completely valid usage;
> > > however, please also recognize that it's an extremely specific one
> > > which is niche by definition.
> > 
> > This is the same pigeon hole you placed Google into.  So Google, my
> > (also decidedly non-petite) users, and now Andy are all sharing the one
> > of a kind extremely specific niche.. it's becoming a tad crowded.
> 
> I wasn't trying to say that these use cases are small in numbers when
> added up, but that they're all isolated in their own small silos.

These use cases exist, and are perfectly valid use cases.  That is sum
and total of what is relevant.

> Facebook has a lot of these usages too but they're almost all mutually
> exculsive.  Making workloads share machines or even adding resource
> conrol for base system operations afterwards is extremely difficult.

The cases I have in mind are not difficult to deal with, as you don't
have to worry about collisions.

> There are cases these adhoc approaches make sense but insisting that
> this is all there is to resource control is short-sighted.

1. I never insisted any such thing.
2. Please stop pigeon-holing.

The usage cases in question are no more ad hoc than any other usage,
they are all "for this", none are globally applicable.  What they are
is power users utilizing the intimate knowledge that is both required
and in the possession of power users who are in fact using controllers
precisely as said controllers were designed to be used.

No, these usages do not belong in an "adhoc" (aka disposable refuse) pi
geon-hole.  I choose to ignore the one you stuffed me into.
	-Mike

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
  2016-09-06 10:29                         ` Peter Zijlstra
  (?)
@ 2016-10-04 14:47                         ` Tejun Heo
  2016-10-05  8:07                             ` Peter Zijlstra
  -1 siblings, 1 reply; 87+ messages in thread
From: Tejun Heo @ 2016-10-04 14:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, Ingo Molnar, Mike Galbraith, linux-kernel,
	kernel-team, open list:CONTROL GROUP (CGROUP),
	Andrew Morton, Paul Turner, Li Zefan, Linux API, Johannes Weiner,
	Linus Torvalds

Hello, Peter.

On Tue, Sep 06, 2016 at 12:29:50PM +0200, Peter Zijlstra wrote:
> The fundamental problem is that we have 2 different types of
> controllers, on the one hand these controllers above, that work on tasks
> and form groups of them and build up from that. Lets call them
> task-controllers.
> 
> On the other hand we have controllers like memcg which take the 'system'
> as a whole and shrink it down into smaller bits. Lets call these
> system-controllers.
>
> They are fundamentally at odds with capabilities, simply because of the
> granularity they can work on.

As pointed out multiple times, the picture is not that simple.  For
example, eventually, we want to be able to account for cpu cycles
spent during memory reclaim or processing IOs (e.g. encryption), which
can only be tied to the resource domain, not a specific task.

There surely are things that can only be done by task-level
controllers, but there are two different aspects here.  One is the
actual capabilities (e.g. hierarchical proportional cpu cycle
distribution) and the other is how such capabilities are exposed.
I'll continue below.

> Merging the two into a common hierarchy is a useful concept for
> containerization, no argument on that, esp. when also coupled with
> namespaces and the like.

Great, we now agree that comprehensive system resource control is
useful.

> However, where I object _most_ strongly is having this one use dominate
> and destroy the capabilities (which are in use) of the task-controllers.

The objection isn't necessarily just about loss of capabilities but
also about not being able to do them in the same way as v1.  The
reason I proposed rgroup instead of scoped task-granularity is because
I think that a properly insulated programmable interface which is
inline with other widely used APIs is a better solution in the long
run.

If we go cgroupfs route for thread granularity, we pretty much lose
the possibility, or at least make it very difficult, to make
hierarchical resource control widely available to individual
applications.

How important such use cases are is debatable.  I don't find it too
difficult to imagine scenarios where individual applications like
apache or torrent clients make use of it.  Probably more importantly,
rgroup, or something like it, gives an application an officially
supported way to build and expose their resource hierarchies, which
can then be used by both the application itself and outside to monitor
and manipulate resource distribution.

The decision between cgroupfs thread granularity and something like
rgroup isn't an obvious one.  Choosing the former is the path of lower
resistance but it is so at the cost of certain long-term benefits.

> > It could be made to work without races, though, with minimal (or even
> > no) ABI change.  The managed program could grab an fd pointing to its
> > cgroup.  Then it would use openat, etc for all operations.  As long as
> > 'mv /cgroup/a/b /cgroup/c/" didn't cause that fd to stop working,
> > we're fine.
> 
> I've mentioned openat() and related APIs several times, but so far never
> got good reasons why that wouldn't work.

Hopefully, this part was addressed in my reply to Andy.

> cgroup-v2, by placing the system style controllers first and foremost,
> completely renders that scenario impossible. Note also that any proposed
> rgroup would not work for this, since that, per design, is a subtree,
> and therefore not disjoint.

If a use case absolutely requires disjoint resource hierarchies, the
only solution is to keep using multiple v1 hierarchies, which
necessarily excludes the possibility of doing anyting across different
resource types.

> So my objection to the whole cgroup-v2 model and implementation stems
> from the fact that it purports to be a 'better' and 'improved' system,
> while in actuality it neuters and destroys a lot of useful usecases.
> 
> It completely disregards all task-controllers and labels their use-cases
> as irrelevant.

Your objection then doesn't have much to do with the specifics of the
cgroup v2 model or implementation.  It's an objection against
establishing common resource domains as that excludes building
orthogonal multiple hierarchies.  That, necessarily, can only be
achieved by having multiple hierarchies for different resource types
and thus giving up the benefits of common resource domains.

Assuming that, I don't think your position is against cgroup v2 but
more toward keeping v1 around.  We're talking about two quite
different mutually exclusive classes of use cases.  You need unified
for one and disjoint for the other.  v1 is gonna be there and can
easily be used alongside v2 for different controller types, which
would in most cases be cpu and cpuset.

I can't see a reason why this would need to block properly supporting
containerization use cases.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-10-05  8:07                             ` Peter Zijlstra
  0 siblings, 0 replies; 87+ messages in thread
From: Peter Zijlstra @ 2016-10-05  8:07 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andy Lutomirski, Ingo Molnar, Mike Galbraith, linux-kernel,
	kernel-team, open list:CONTROL GROUP (CGROUP),
	Andrew Morton, Paul Turner, Li Zefan, Linux API, Johannes Weiner,
	Linus Torvalds

On Tue, Oct 04, 2016 at 10:47:17AM -0400, Tejun Heo wrote:
> > cgroup-v2, by placing the system style controllers first and foremost,
> > completely renders that scenario impossible. Note also that any proposed
> > rgroup would not work for this, since that, per design, is a subtree,
> > and therefore not disjoint.
> 
> If a use case absolutely requires disjoint resource hierarchies, the
> only solution is to keep using multiple v1 hierarchies, which
> necessarily excludes the possibility of doing anyting across different
> resource types.
> 
> > So my objection to the whole cgroup-v2 model and implementation stems
> > from the fact that it purports to be a 'better' and 'improved' system,
> > while in actuality it neuters and destroys a lot of useful usecases.
> > 
> > It completely disregards all task-controllers and labels their use-cases
> > as irrelevant.
> 
> Your objection then doesn't have much to do with the specifics of the
> cgroup v2 model or implementation.

It is too, I've stated multiple times that the no internal tasks thing
is bad and that the root exception is an inexcusable wart that makes the
whole thing internally inconsistent.

But talking to you guys is pointless. You'll just keep moving air until
the other party tires and gives up.

My NAK on v2 stands.

> It's an objection against
> establishing common resource domains as that excludes building
> orthogonal multiple hierarchies.  That, necessarily, can only be
> achieved by having multiple hierarchies for different resource types
> and thus giving up the benefits of common resource domains.

Yes, v2 not allowing that rules it out as a valid model.

> Assuming that, I don't think your position is against cgroup v2 but
> more toward keeping v1 around.  We're talking about two quite
> different mutually exclusive classes of use cases.  You need unified
> for one and disjoint for the other.  v1 is gonna be there and can
> easily be used alongside v2 for different controller types, which
> would in most cases be cpu and cpuset.
> 
> I can't see a reason why this would need to block properly supporting
> containerization use cases.

I don't block that use-case, I block cgroup-v2, its shit.

The fact is, the naming "v2" suggests its a replacement and will
deprecate "v1". Also the implementation is mutually exclusive with v1,
you have to pick one and the other becomes inaccessible.

You cannot even pick another one inside a container, breaking the
container invariant.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Documentation] State of CPU controller in cgroup v2
@ 2016-10-05  8:07                             ` Peter Zijlstra
  0 siblings, 0 replies; 87+ messages in thread
From: Peter Zijlstra @ 2016-10-05  8:07 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andy Lutomirski, Ingo Molnar, Mike Galbraith,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	open list:CONTROL GROUP (CGROUP),
	Andrew Morton, Paul Turner, Li Zefan, Linux API, Johannes Weiner,
	Linus Torvalds

On Tue, Oct 04, 2016 at 10:47:17AM -0400, Tejun Heo wrote:
> > cgroup-v2, by placing the system style controllers first and foremost,
> > completely renders that scenario impossible. Note also that any proposed
> > rgroup would not work for this, since that, per design, is a subtree,
> > and therefore not disjoint.
> 
> If a use case absolutely requires disjoint resource hierarchies, the
> only solution is to keep using multiple v1 hierarchies, which
> necessarily excludes the possibility of doing anyting across different
> resource types.
> 
> > So my objection to the whole cgroup-v2 model and implementation stems
> > from the fact that it purports to be a 'better' and 'improved' system,
> > while in actuality it neuters and destroys a lot of useful usecases.
> > 
> > It completely disregards all task-controllers and labels their use-cases
> > as irrelevant.
> 
> Your objection then doesn't have much to do with the specifics of the
> cgroup v2 model or implementation.

It is too, I've stated multiple times that the no internal tasks thing
is bad and that the root exception is an inexcusable wart that makes the
whole thing internally inconsistent.

But talking to you guys is pointless. You'll just keep moving air until
the other party tires and gives up.

My NAK on v2 stands.

> It's an objection against
> establishing common resource domains as that excludes building
> orthogonal multiple hierarchies.  That, necessarily, can only be
> achieved by having multiple hierarchies for different resource types
> and thus giving up the benefits of common resource domains.

Yes, v2 not allowing that rules it out as a valid model.

> Assuming that, I don't think your position is against cgroup v2 but
> more toward keeping v1 around.  We're talking about two quite
> different mutually exclusive classes of use cases.  You need unified
> for one and disjoint for the other.  v1 is gonna be there and can
> easily be used alongside v2 for different controller types, which
> would in most cases be cpu and cpuset.
> 
> I can't see a reason why this would need to block properly supporting
> containerization use cases.

I don't block that use-case, I block cgroup-v2, its shit.

The fact is, the naming "v2" suggests its a replacement and will
deprecate "v1". Also the implementation is mutually exclusive with v1,
you have to pick one and the other becomes inaccessible.

You cannot even pick another one inside a container, breaking the
container invariant.

^ permalink raw reply	[flat|nested] 87+ messages in thread

end of thread, other threads:[~2016-10-05  8:08 UTC | newest]

Thread overview: 87+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-05 17:07 [Documentation] State of CPU controller in cgroup v2 Tejun Heo
2016-08-05 17:07 ` Tejun Heo
2016-08-05 17:09 ` [PATCH 1/2] sched: Misc preps for cgroup unified hierarchy interface Tejun Heo
2016-08-05 17:09   ` Tejun Heo
2016-08-05 17:09 ` [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy Tejun Heo
2016-08-05 17:09   ` Tejun Heo
2016-08-06  9:04 ` [Documentation] State of CPU controller in cgroup v2 Mike Galbraith
2016-08-06  9:04   ` Mike Galbraith
2016-08-10 22:09   ` Johannes Weiner
2016-08-10 22:09     ` Johannes Weiner
2016-08-11  6:25     ` Mike Galbraith
2016-08-11  6:25       ` Mike Galbraith
2016-08-12 22:17       ` Johannes Weiner
2016-08-12 22:17         ` Johannes Weiner
2016-08-13  5:08         ` Mike Galbraith
2016-08-13  5:08           ` Mike Galbraith
2016-08-16 14:07     ` Peter Zijlstra
2016-08-16 14:07       ` Peter Zijlstra
2016-08-16 14:58       ` Chris Mason
2016-08-16 14:58         ` Chris Mason
2016-08-16 16:30       ` Johannes Weiner
2016-08-16 16:30         ` Johannes Weiner
2016-08-17  9:33         ` Mike Galbraith
2016-08-16 21:59       ` Tejun Heo
2016-08-16 21:59         ` Tejun Heo
2016-08-17 20:18 ` Andy Lutomirski
2016-08-20 15:56   ` Tejun Heo
2016-08-20 15:56     ` Tejun Heo
2016-08-20 18:45     ` Andy Lutomirski
2016-08-29 22:20       ` Tejun Heo
2016-08-29 22:20         ` Tejun Heo
2016-08-31  3:42         ` Andy Lutomirski
2016-08-31  3:42           ` Andy Lutomirski
2016-08-31 17:32           ` Tejun Heo
2016-08-31 19:11             ` Andy Lutomirski
2016-08-31 19:11               ` Andy Lutomirski
2016-08-31 21:07               ` Tejun Heo
2016-08-31 21:07                 ` Tejun Heo
2016-08-31 21:46                 ` Andy Lutomirski
2016-09-03 22:05                   ` Tejun Heo
2016-09-03 22:05                     ` Tejun Heo
2016-09-05 17:37                     ` Andy Lutomirski
2016-09-06 10:29                       ` Peter Zijlstra
2016-09-06 10:29                         ` Peter Zijlstra
2016-10-04 14:47                         ` Tejun Heo
2016-10-05  8:07                           ` Peter Zijlstra
2016-10-05  8:07                             ` Peter Zijlstra
2016-09-09 22:57                       ` Tejun Heo
2016-09-10  8:54                         ` Mike Galbraith
2016-09-10  8:54                           ` Mike Galbraith
2016-09-10 10:08                         ` Mike Galbraith
2016-09-10 10:08                           ` Mike Galbraith
2016-09-30  9:06                           ` Tejun Heo
2016-09-30  9:06                             ` Tejun Heo
2016-09-30 14:53                             ` Mike Galbraith
2016-09-30 14:53                               ` Mike Galbraith
2016-09-12 15:20                         ` Austin S. Hemmelgarn
2016-09-12 15:20                           ` Austin S. Hemmelgarn
2016-09-19 21:34                           ` Tejun Heo
2016-09-19 21:34                             ` Tejun Heo
     [not found]                         ` <CALCETrUhpPQdyZ-6WRjdB+iLbpGBduRZMWXQtCuS+R7Cq7rygg@mail.gmail.com>
2016-09-14 20:00                           ` Tejun Heo
2016-09-15 20:08                             ` Andy Lutomirski
2016-09-15 20:08                               ` Andy Lutomirski
2016-09-16  7:51                               ` Peter Zijlstra
2016-09-16  7:51                                 ` Peter Zijlstra
2016-09-16 15:12                                 ` Andy Lutomirski
2016-09-16 15:12                                   ` Andy Lutomirski
2016-09-16 16:19                                   ` Peter Zijlstra
2016-09-16 16:19                                     ` Peter Zijlstra
2016-09-16 16:29                                     ` Andy Lutomirski
2016-09-16 16:29                                       ` Andy Lutomirski
2016-09-16 16:50                                       ` Peter Zijlstra
2016-09-16 16:50                                         ` Peter Zijlstra
2016-09-16 18:19                                         ` Andy Lutomirski
2016-09-16 18:19                                           ` Andy Lutomirski
2016-09-17  1:47                                           ` Peter Zijlstra
2016-09-17  1:47                                             ` Peter Zijlstra
2016-09-19 21:53                               ` Tejun Heo
2016-09-19 21:53                                 ` Tejun Heo
2016-08-31 19:57         ` Andy Lutomirski
2016-08-31 19:57           ` Andy Lutomirski
2016-08-22 10:12     ` Mike Galbraith
2016-08-22 10:12       ` Mike Galbraith
2016-08-21  5:34   ` James Bottomley
2016-08-21  5:34     ` James Bottomley
2016-08-29 22:35     ` Tejun Heo
2016-08-29 22:35       ` Tejun Heo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.