All of lore.kernel.org
 help / color / mirror / Atom feed
From: Tejun Heo <tj@kernel.org>
To: Ingo Molnar <mingo@kernel.org>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>,
	torvalds@linux-foundation.org, akpm@linux-foundation.org,
	a.p.zijlstra@chello.nl, mingo@redhat.com, lizefan@huawei.com,
	hannes@cmpxchg.org, pjt@google.com, linux-kernel@vger.kernel.org,
	cgroups@vger.kernel.org, linux-api@vger.kernel.org,
	kernel-team@fb.com, Thomas Gleixner <tglx@linutronix.de>
Subject: Re: cgroup NAKs ignored? Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP
Date: Sun, 13 Mar 2016 10:42:57 -0400	[thread overview]
Message-ID: <20160313144257.GA13405@htj.duckdns.org> (raw)
In-Reply-To: <20160312171318.GD1108@gmail.com>

Hello, Ingo.

On Sat, Mar 12, 2016 at 06:13:18PM +0100, Ingo Molnar wrote:
> > BTW, within the scheduler, "process" does not exist. [...]
> 
> Yes, and that's very fundamental.

I'll go into this part later.

> And I see that many bits of the broken 'v2' cgroups ABI already snuck into the 
> upstream kernel in this merge dinwo, without this detail having been agreed upon!
> :-(
>
> Tejun, this _REALLY_ sucks. We had pending NAKs over the design, still you moved 
> ahead like nothing happened, why?!

Hmmmm?  The cpu controller is still in review branch.  The thread
sprawled out but the disagreement there was about missing the ability
to hierarchically distribute CPU cycles in process and the two
alternatives discussed throughout the thread were per-process private
filesystem under /proc/PID and extension of existing process resource
nmanagement mechanisms.

Going back to the per-process part, I described the rationales in
cgroup-v2 documentation and the RFD document but here are some
important bits.

1. Common resource domains

* When different resources get intermixed as do memory and io during
  writeback, without a common resource domain defined across the
  different resource types, it's impossible to perform resource
  control.  As a simplistic example, let's say there are four
  processes (1, 2, 3, 4), two memory cgroups (ma, mb) and two io
  cgroups (ia, ib) with the following memership.

   ma: 1, 2  mb: 3, 4
   ia: 1, 3  ib: 2, 4

  Writeback and dirty throttling are regulated by the proportion of
  dirty memory against available and writeback bandwidth of the target
  backing device.  When resource domains are orthogonal like the
  above, it's impossible to define clear relationship.  This is one of
  the main reasons why writeback behavior has been so erratic with
  respect to cgroups.

* It is a lot more useful and less painful to have common resource
  domains defined across all resource types as it allows expressing
  things like "if this belongs to resource domain F, do XYZ".  A lot
  of use cases are already doing this by building the identical
  hierarchies (to differing depths) across all controllers.


2. Per-process

* There is a relatively pronounced boundary between system management
  and internal operations of an application and one side-effect of
  allowing threads to be assigned arbitrarily across system cgroupfs
  hierarchy is that it mandates close coordination between individual
  applications and system management (whether that be a human being or
  system agent software).  This is userland suffering because the
  kernel fails to provide a properly abstracted and isolated
  constructs.

  Decoupling system management and in-application operations makes
  hierarchical resource grouping and control easily accessible to
  individual applications without worrying about how the system is
  managed in larger scope.  Process is a fairly good approximation of
  this boundary.

* For some resources, going beyond process granularity doesn't make
  much sense.  While we can just let users do whatever they wanna do
  and declare certain configurations to yield undefined behaviors (io
  controller on v1 hierarchy actually does this), it is better to
  provide abstractions which match the actual characteristics.
  Combined with the above, it is natural to distinguish across-process
  and in-process operations.

> > [...]  A high level composite entity is what we currently aggregate from 
> > arbitrary individual entities, a.k.a threads.  Whether an individual entity be 
> > an un-threaded "process" bash, a thread of "process" oracle, or one of 
> > "process!?!" kernel is irrelevant.  What entity aggregation has to do with 
> > "process" eludes me completely.
> > 
> > What's ad-hoc or unusual about a thread pool servicing an arbitrary number of 
> > customers using cgroup bean accounting?  Job arrives from customer, worker is 
> > dispatched to customer workshop (cgroup), it does whatever on behest of 
> > customer, sends bean count off to the billing department, and returns to the 
> > break room.  What's so annoying about using bean counters for.. counting beans 
> > that you want to forbid it?
> 
> Agreed ... and many others expressed this concern as well. Why were these concerns 
> ignored?

They weren't ignored.  The concern expressed was the loss of the
ability to hierarchically distribute resource in process and the RFD
document and this patchset are the attempts at resolving that specific
issue.

Going back to Mike's "why can't these be arbitrary bean counters?",
yes, they can be.  That's what one gets when the cpu controller is
mounted on its own hierarchy.  If that's what the use case at hand
calls for, that is the way to go and there's nothing preventing that.
In fact, with recent restructuring of cgroup core, stealing a
stateless controller to a new hierarchy can be made a lot easier for
such use cases.

However, as explained above, controlling a resource in abstraction and
restriction-free style also has its costs.  There's no way to tie
different types of resources serving the same purpose which can be
generally painful and makes some cross-resource operations impossible.
Or entangling in-process operations with system management, IOW, a
process having to speak to the external $SYSTEM_AGENT to manage its
threadpools.

What the proposed solution tries to achieve is balancing flexibility
at system management level with proper abstractions and isolation so
that hierarchical resource management is actually accessible to a lot
wider set of applications and use-cases.

Given how cgroup is used in the wild, I'm pretty sure that the
structured approach will reach a lot wider audience without getting in
the way of what they try to achieve.  That said, again, for specific
use cases where the benefits from structured approach can or should be
ignored, using the cpu controller as arbitrary hierarchical bean
counters is completely fine and the right solution.

Thanks.

-- 
tejun

WARNING: multiple messages have this Message-ID (diff)
From: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
To: Ingo Molnar <mingo-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Mike Galbraith
	<umgwanakikbuti-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org,
	a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw@public.gmane.org,
	mingo-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
	lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org,
	hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org,
	pjt-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	kernel-team-b10kYP2dOMg@public.gmane.org,
	Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
Subject: Re: cgroup NAKs ignored? Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP
Date: Sun, 13 Mar 2016 10:42:57 -0400	[thread overview]
Message-ID: <20160313144257.GA13405@htj.duckdns.org> (raw)
In-Reply-To: <20160312171318.GD1108-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

Hello, Ingo.

On Sat, Mar 12, 2016 at 06:13:18PM +0100, Ingo Molnar wrote:
> > BTW, within the scheduler, "process" does not exist. [...]
> 
> Yes, and that's very fundamental.

I'll go into this part later.

> And I see that many bits of the broken 'v2' cgroups ABI already snuck into the 
> upstream kernel in this merge dinwo, without this detail having been agreed upon!
> :-(
>
> Tejun, this _REALLY_ sucks. We had pending NAKs over the design, still you moved 
> ahead like nothing happened, why?!

Hmmmm?  The cpu controller is still in review branch.  The thread
sprawled out but the disagreement there was about missing the ability
to hierarchically distribute CPU cycles in process and the two
alternatives discussed throughout the thread were per-process private
filesystem under /proc/PID and extension of existing process resource
nmanagement mechanisms.

Going back to the per-process part, I described the rationales in
cgroup-v2 documentation and the RFD document but here are some
important bits.

1. Common resource domains

* When different resources get intermixed as do memory and io during
  writeback, without a common resource domain defined across the
  different resource types, it's impossible to perform resource
  control.  As a simplistic example, let's say there are four
  processes (1, 2, 3, 4), two memory cgroups (ma, mb) and two io
  cgroups (ia, ib) with the following memership.

   ma: 1, 2  mb: 3, 4
   ia: 1, 3  ib: 2, 4

  Writeback and dirty throttling are regulated by the proportion of
  dirty memory against available and writeback bandwidth of the target
  backing device.  When resource domains are orthogonal like the
  above, it's impossible to define clear relationship.  This is one of
  the main reasons why writeback behavior has been so erratic with
  respect to cgroups.

* It is a lot more useful and less painful to have common resource
  domains defined across all resource types as it allows expressing
  things like "if this belongs to resource domain F, do XYZ".  A lot
  of use cases are already doing this by building the identical
  hierarchies (to differing depths) across all controllers.


2. Per-process

* There is a relatively pronounced boundary between system management
  and internal operations of an application and one side-effect of
  allowing threads to be assigned arbitrarily across system cgroupfs
  hierarchy is that it mandates close coordination between individual
  applications and system management (whether that be a human being or
  system agent software).  This is userland suffering because the
  kernel fails to provide a properly abstracted and isolated
  constructs.

  Decoupling system management and in-application operations makes
  hierarchical resource grouping and control easily accessible to
  individual applications without worrying about how the system is
  managed in larger scope.  Process is a fairly good approximation of
  this boundary.

* For some resources, going beyond process granularity doesn't make
  much sense.  While we can just let users do whatever they wanna do
  and declare certain configurations to yield undefined behaviors (io
  controller on v1 hierarchy actually does this), it is better to
  provide abstractions which match the actual characteristics.
  Combined with the above, it is natural to distinguish across-process
  and in-process operations.

> > [...]  A high level composite entity is what we currently aggregate from 
> > arbitrary individual entities, a.k.a threads.  Whether an individual entity be 
> > an un-threaded "process" bash, a thread of "process" oracle, or one of 
> > "process!?!" kernel is irrelevant.  What entity aggregation has to do with 
> > "process" eludes me completely.
> > 
> > What's ad-hoc or unusual about a thread pool servicing an arbitrary number of 
> > customers using cgroup bean accounting?  Job arrives from customer, worker is 
> > dispatched to customer workshop (cgroup), it does whatever on behest of 
> > customer, sends bean count off to the billing department, and returns to the 
> > break room.  What's so annoying about using bean counters for.. counting beans 
> > that you want to forbid it?
> 
> Agreed ... and many others expressed this concern as well. Why were these concerns 
> ignored?

They weren't ignored.  The concern expressed was the loss of the
ability to hierarchically distribute resource in process and the RFD
document and this patchset are the attempts at resolving that specific
issue.

Going back to Mike's "why can't these be arbitrary bean counters?",
yes, they can be.  That's what one gets when the cpu controller is
mounted on its own hierarchy.  If that's what the use case at hand
calls for, that is the way to go and there's nothing preventing that.
In fact, with recent restructuring of cgroup core, stealing a
stateless controller to a new hierarchy can be made a lot easier for
such use cases.

However, as explained above, controlling a resource in abstraction and
restriction-free style also has its costs.  There's no way to tie
different types of resources serving the same purpose which can be
generally painful and makes some cross-resource operations impossible.
Or entangling in-process operations with system management, IOW, a
process having to speak to the external $SYSTEM_AGENT to manage its
threadpools.

What the proposed solution tries to achieve is balancing flexibility
at system management level with proper abstractions and isolation so
that hierarchical resource management is actually accessible to a lot
wider set of applications and use-cases.

Given how cgroup is used in the wild, I'm pretty sure that the
structured approach will reach a lot wider audience without getting in
the way of what they try to achieve.  That said, again, for specific
use cases where the benefits from structured approach can or should be
ignored, using the cpu controller as arbitrary hierarchical bean
counters is completely fine and the right solution.

Thanks.

-- 
tejun

  reply	other threads:[~2016-03-13 14:43 UTC|newest]

Thread overview: 95+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-03-11 15:41 [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP Tejun Heo
2016-03-11 15:41 ` Tejun Heo
2016-03-11 15:41 ` [PATCH 01/10] cgroup: introduce cgroup_[un]lock() Tejun Heo
2016-03-11 15:41   ` Tejun Heo
2016-03-11 15:41 ` [PATCH 02/10] cgroup: un-inline cgroup_path() and friends Tejun Heo
2016-03-11 15:41 ` [PATCH 03/10] cgroup: introduce CGRP_MIGRATE_* flags Tejun Heo
2016-03-11 15:41   ` Tejun Heo
2016-03-11 15:41 ` [PATCH 04/10] signal: make put_signal_struct() public Tejun Heo
2016-03-11 15:41 ` [PATCH 05/10] cgroup, fork: add @new_rgrp_cset[p] and @clone_flags to cgroup fork callbacks Tejun Heo
2016-03-11 15:41   ` Tejun Heo
2016-03-11 15:41 ` [PATCH 06/10] cgroup, fork: add @child and @clone_flags to threadgroup_change_begin/end() Tejun Heo
2016-03-11 15:41 ` [PATCH 07/10] cgroup: introduce resource group Tejun Heo
2016-03-11 15:41   ` Tejun Heo
2016-03-11 15:41 ` [PATCH 08/10] cgroup: implement rgroup control mask handling Tejun Heo
2016-03-11 15:41   ` Tejun Heo
2016-03-11 15:41 ` [PATCH 09/10] cgroup: implement rgroup subtree migration Tejun Heo
2016-03-11 15:41 ` [PATCH 10/10] cgroup, sched: implement PRIO_RGRP for {set|get}priority() Tejun Heo
2016-03-11 15:41   ` Tejun Heo
2016-03-11 16:05 ` Example program for PRIO_RGRP Tejun Heo
2016-03-11 16:05   ` Tejun Heo
2016-03-12  6:26 ` [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP Mike Galbraith
2016-03-12  6:26   ` Mike Galbraith
2016-03-12 17:04   ` Mike Galbraith
2016-03-12 17:04     ` Mike Galbraith
2016-03-12 17:13     ` cgroup NAKs ignored? " Ingo Molnar
2016-03-12 17:13       ` Ingo Molnar
2016-03-13 14:42       ` Tejun Heo [this message]
2016-03-13 14:42         ` Tejun Heo
2016-03-13 15:00   ` Tejun Heo
2016-03-13 15:00     ` Tejun Heo
2016-03-13 17:40     ` Mike Galbraith
2016-03-13 17:40       ` Mike Galbraith
2016-04-07  0:00       ` Tejun Heo
2016-04-07  0:00         ` Tejun Heo
2016-04-07  3:26         ` Mike Galbraith
2016-04-07  3:26           ` Mike Galbraith
2016-03-14  2:23     ` Mike Galbraith
2016-03-14  2:23       ` Mike Galbraith
2016-03-14 11:30 ` Peter Zijlstra
2016-03-14 11:30   ` Peter Zijlstra
2016-04-06 15:58   ` Tejun Heo
2016-04-06 15:58     ` Tejun Heo
2016-04-06 15:58     ` Tejun Heo
2016-04-07  6:45     ` Peter Zijlstra
2016-04-07  6:45       ` Peter Zijlstra
2016-04-07  7:35       ` Johannes Weiner
2016-04-07  7:35         ` Johannes Weiner
2016-04-07  8:05         ` Mike Galbraith
2016-04-07  8:05           ` Mike Galbraith
2016-04-07  8:08         ` Peter Zijlstra
2016-04-07  8:08           ` Peter Zijlstra
2016-04-07  9:28           ` Johannes Weiner
2016-04-07  9:28             ` Johannes Weiner
2016-04-07 10:42             ` Peter Zijlstra
2016-04-07 10:42               ` Peter Zijlstra
2016-04-07 19:45           ` Tejun Heo
2016-04-07 19:45             ` Tejun Heo
2016-04-07 20:25             ` Peter Zijlstra
2016-04-07 20:25               ` Peter Zijlstra
2016-04-08 20:11               ` Tejun Heo
2016-04-08 20:11                 ` Tejun Heo
2016-04-09  6:16                 ` Mike Galbraith
2016-04-09  6:16                   ` Mike Galbraith
2016-04-09 13:39                 ` Peter Zijlstra
2016-04-09 13:39                   ` Peter Zijlstra
2016-04-12 22:29                   ` Tejun Heo
2016-04-12 22:29                     ` Tejun Heo
2016-04-13  7:43                     ` Mike Galbraith
2016-04-13  7:43                       ` Mike Galbraith
2016-04-13 15:59                       ` Tejun Heo
2016-04-13 19:15                         ` Mike Galbraith
2016-04-13 19:15                           ` Mike Galbraith
2016-04-14  6:07                         ` Mike Galbraith
2016-04-14 19:57                           ` Tejun Heo
2016-04-14 19:57                             ` Tejun Heo
2016-04-15  2:42                             ` Mike Galbraith
2016-04-15  2:42                               ` Mike Galbraith
2016-04-09 16:02                 ` Peter Zijlstra
2016-04-09 16:02                   ` Peter Zijlstra
2016-04-07  8:28         ` Peter Zijlstra
2016-04-07  8:28           ` Peter Zijlstra
2016-04-07 19:04           ` Johannes Weiner
2016-04-07 19:04             ` Johannes Weiner
2016-04-07 19:31             ` Peter Zijlstra
2016-04-07 19:31               ` Peter Zijlstra
2016-04-07 20:23               ` Johannes Weiner
2016-04-07 20:23                 ` Johannes Weiner
2016-04-08  3:13                 ` Mike Galbraith
2016-04-08  3:13                   ` Mike Galbraith
2016-03-15 17:21 ` Michal Hocko
2016-03-15 17:21   ` Michal Hocko
2016-04-06 21:53   ` Tejun Heo
2016-04-06 21:53     ` Tejun Heo
2016-04-07  6:40     ` Peter Zijlstra
2016-04-07  6:40       ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160313144257.GA13405@htj.duckdns.org \
    --to=tj@kernel.org \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=kernel-team@fb.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lizefan@huawei.com \
    --cc=mingo@kernel.org \
    --cc=mingo@redhat.com \
    --cc=pjt@google.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=umgwanakikbuti@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.