All of lore.kernel.org
 help / color / mirror / Atom feed
From: Johannes Weiner <hannes@cmpxchg.org>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>,
	torvalds@linux-foundation.org, akpm@linux-foundation.org,
	mingo@redhat.com, lizefan@huawei.com, pjt@google.com,
	linux-kernel@vger.kernel.org, cgroups@vger.kernel.org,
	linux-api@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP
Date: Thu, 7 Apr 2016 15:04:24 -0400	[thread overview]
Message-ID: <20160407190424.GA20407@cmpxchg.org> (raw)
In-Reply-To: <20160407082810.GN3430@twins.programming.kicks-ass.net>

On Thu, Apr 07, 2016 at 10:28:10AM +0200, Peter Zijlstra wrote:
> On Thu, Apr 07, 2016 at 03:35:47AM -0400, Johannes Weiner wrote:
> > There was a lot of back and forth whether we should add a second set
> > of knobs just to control the local tasks separately from the subtree,
> > but ended up concluding that the situation can be expressed more
> > clearly by creating dedicated leaf subgroups for stuff like management
> > software and launchers instead, so that their memory pools/LRUs are
> > clearly delineated from other groups and seperately controllable. And
> > we couldn't think of any meaningful configuration that could not be
> > expressed in that scheme. I mean, it's the same thing, right?
> 
> No, not the same.
> 
> 
> 	R
>       / | \
>      t1	t2 A
>          /   \
>         t3   t4
> 
> 
> Is fundamentally different from:
> 
> 
>              R
> 	   /   \
> 	 L       A
>        /   \   /   \
>       t1  t2  t3   t4
> 
> 
> Because if in the first hierarchy you add a task (t5) to R, all of its A
> will run at 1/4th of total bandwidth where before it had 1/3rd, whereas
> with the second example, if you add our t5 to L, A doesn't get any less
> bandwidth.

I didn't mean the same exact configuration, I meant being able to
configure with the same outcome of resource distribution.

All this means here is that if you want to change the shares allocated
to the tasks in R (or then L) you have to be explicit about it and
update the weight configuration in L.

Again, it's not gratuitous, it's based on the problems this concept in
the interface created in more comprehensive container deployments.

> Please pull your collective heads out of the systemd arse and start
> thinking.

I don't care about systemd here. In fact, in 5 years of rewriting the
memory controller, zero percent of it was driven by systemd and most
of it from Google's feedback at LSF and email since they had by far
the most experience and were pushing the frontier. And even though the
performance and overhead of the memory controller was absolutely
abysmal - routinely hitting double digits in page fault profiles - the
discussions *always* centered around the interface and configuration.

IMO, this thread is a little too focused on the reality of a single
resource controller, when in real setups it doesn't exist in a vacuum.
What these environments need is to robustly divide the machine up into
parcels to isolate thousands of jobs on X dimensions at the same time:
allocate CPU time, allocate memory, allocate IO. And then on top of
that implement higher concepts such as dirty page quotas and
writeback, accounting for kswapd's cpu time based on who owns the
memory it reclaims, accounting IO time for the stuff it swaps out
etc. That *needs* all three resources to be coordinated.

You disparagingly called it the lowest common denominator, but the
thing is that streamlining the controllers and coordinating them
around shared resource domains gives us much more powerful and robust
ways to allocate the *machines* as a whole, and allows the proper
tracking and accounting of cross-domain operations such as writeback
that wasn't even possible before. And all that in a way that doesn't
have the same usability pitfalls that v1 had when you actually push
this stuff beyond the "i want to limit the cpu cycles of this one
service" and move towards "this machine is an anonymous node in a data
center and I want it to host thousands of different workloads - some
sensitive to latency, some that only care about throughput - and they
better not step on each other's toes on *any* of the resource pools."

Those are my primary concerns when it comes to the v2 interface, and I
think focusing too much on what's theoretically possible with a single
controller is missing the bigger challenge of allocating machines.

WARNING: multiple messages have this Message-ID (diff)
From: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
To: Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org,
	mingo-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
	lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org,
	pjt-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	kernel-team-b10kYP2dOMg@public.gmane.org
Subject: Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP
Date: Thu, 7 Apr 2016 15:04:24 -0400	[thread overview]
Message-ID: <20160407190424.GA20407@cmpxchg.org> (raw)
In-Reply-To: <20160407082810.GN3430-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org>

On Thu, Apr 07, 2016 at 10:28:10AM +0200, Peter Zijlstra wrote:
> On Thu, Apr 07, 2016 at 03:35:47AM -0400, Johannes Weiner wrote:
> > There was a lot of back and forth whether we should add a second set
> > of knobs just to control the local tasks separately from the subtree,
> > but ended up concluding that the situation can be expressed more
> > clearly by creating dedicated leaf subgroups for stuff like management
> > software and launchers instead, so that their memory pools/LRUs are
> > clearly delineated from other groups and seperately controllable. And
> > we couldn't think of any meaningful configuration that could not be
> > expressed in that scheme. I mean, it's the same thing, right?
> 
> No, not the same.
> 
> 
> 	R
>       / | \
>      t1	t2 A
>          /   \
>         t3   t4
> 
> 
> Is fundamentally different from:
> 
> 
>              R
> 	   /   \
> 	 L       A
>        /   \   /   \
>       t1  t2  t3   t4
> 
> 
> Because if in the first hierarchy you add a task (t5) to R, all of its A
> will run at 1/4th of total bandwidth where before it had 1/3rd, whereas
> with the second example, if you add our t5 to L, A doesn't get any less
> bandwidth.

I didn't mean the same exact configuration, I meant being able to
configure with the same outcome of resource distribution.

All this means here is that if you want to change the shares allocated
to the tasks in R (or then L) you have to be explicit about it and
update the weight configuration in L.

Again, it's not gratuitous, it's based on the problems this concept in
the interface created in more comprehensive container deployments.

> Please pull your collective heads out of the systemd arse and start
> thinking.

I don't care about systemd here. In fact, in 5 years of rewriting the
memory controller, zero percent of it was driven by systemd and most
of it from Google's feedback at LSF and email since they had by far
the most experience and were pushing the frontier. And even though the
performance and overhead of the memory controller was absolutely
abysmal - routinely hitting double digits in page fault profiles - the
discussions *always* centered around the interface and configuration.

IMO, this thread is a little too focused on the reality of a single
resource controller, when in real setups it doesn't exist in a vacuum.
What these environments need is to robustly divide the machine up into
parcels to isolate thousands of jobs on X dimensions at the same time:
allocate CPU time, allocate memory, allocate IO. And then on top of
that implement higher concepts such as dirty page quotas and
writeback, accounting for kswapd's cpu time based on who owns the
memory it reclaims, accounting IO time for the stuff it swaps out
etc. That *needs* all three resources to be coordinated.

You disparagingly called it the lowest common denominator, but the
thing is that streamlining the controllers and coordinating them
around shared resource domains gives us much more powerful and robust
ways to allocate the *machines* as a whole, and allows the proper
tracking and accounting of cross-domain operations such as writeback
that wasn't even possible before. And all that in a way that doesn't
have the same usability pitfalls that v1 had when you actually push
this stuff beyond the "i want to limit the cpu cycles of this one
service" and move towards "this machine is an anonymous node in a data
center and I want it to host thousands of different workloads - some
sensitive to latency, some that only care about throughput - and they
better not step on each other's toes on *any* of the resource pools."

Those are my primary concerns when it comes to the v2 interface, and I
think focusing too much on what's theoretically possible with a single
controller is missing the bigger challenge of allocating machines.

  reply	other threads:[~2016-04-07 19:06 UTC|newest]

Thread overview: 95+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-03-11 15:41 [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP Tejun Heo
2016-03-11 15:41 ` Tejun Heo
2016-03-11 15:41 ` [PATCH 01/10] cgroup: introduce cgroup_[un]lock() Tejun Heo
2016-03-11 15:41   ` Tejun Heo
2016-03-11 15:41 ` [PATCH 02/10] cgroup: un-inline cgroup_path() and friends Tejun Heo
2016-03-11 15:41 ` [PATCH 03/10] cgroup: introduce CGRP_MIGRATE_* flags Tejun Heo
2016-03-11 15:41   ` Tejun Heo
2016-03-11 15:41 ` [PATCH 04/10] signal: make put_signal_struct() public Tejun Heo
2016-03-11 15:41 ` [PATCH 05/10] cgroup, fork: add @new_rgrp_cset[p] and @clone_flags to cgroup fork callbacks Tejun Heo
2016-03-11 15:41   ` Tejun Heo
2016-03-11 15:41 ` [PATCH 06/10] cgroup, fork: add @child and @clone_flags to threadgroup_change_begin/end() Tejun Heo
2016-03-11 15:41 ` [PATCH 07/10] cgroup: introduce resource group Tejun Heo
2016-03-11 15:41   ` Tejun Heo
2016-03-11 15:41 ` [PATCH 08/10] cgroup: implement rgroup control mask handling Tejun Heo
2016-03-11 15:41   ` Tejun Heo
2016-03-11 15:41 ` [PATCH 09/10] cgroup: implement rgroup subtree migration Tejun Heo
2016-03-11 15:41 ` [PATCH 10/10] cgroup, sched: implement PRIO_RGRP for {set|get}priority() Tejun Heo
2016-03-11 15:41   ` Tejun Heo
2016-03-11 16:05 ` Example program for PRIO_RGRP Tejun Heo
2016-03-11 16:05   ` Tejun Heo
2016-03-12  6:26 ` [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP Mike Galbraith
2016-03-12  6:26   ` Mike Galbraith
2016-03-12 17:04   ` Mike Galbraith
2016-03-12 17:04     ` Mike Galbraith
2016-03-12 17:13     ` cgroup NAKs ignored? " Ingo Molnar
2016-03-12 17:13       ` Ingo Molnar
2016-03-13 14:42       ` Tejun Heo
2016-03-13 14:42         ` Tejun Heo
2016-03-13 15:00   ` Tejun Heo
2016-03-13 15:00     ` Tejun Heo
2016-03-13 17:40     ` Mike Galbraith
2016-03-13 17:40       ` Mike Galbraith
2016-04-07  0:00       ` Tejun Heo
2016-04-07  0:00         ` Tejun Heo
2016-04-07  3:26         ` Mike Galbraith
2016-04-07  3:26           ` Mike Galbraith
2016-03-14  2:23     ` Mike Galbraith
2016-03-14  2:23       ` Mike Galbraith
2016-03-14 11:30 ` Peter Zijlstra
2016-03-14 11:30   ` Peter Zijlstra
2016-04-06 15:58   ` Tejun Heo
2016-04-06 15:58     ` Tejun Heo
2016-04-06 15:58     ` Tejun Heo
2016-04-07  6:45     ` Peter Zijlstra
2016-04-07  6:45       ` Peter Zijlstra
2016-04-07  7:35       ` Johannes Weiner
2016-04-07  7:35         ` Johannes Weiner
2016-04-07  8:05         ` Mike Galbraith
2016-04-07  8:05           ` Mike Galbraith
2016-04-07  8:08         ` Peter Zijlstra
2016-04-07  8:08           ` Peter Zijlstra
2016-04-07  9:28           ` Johannes Weiner
2016-04-07  9:28             ` Johannes Weiner
2016-04-07 10:42             ` Peter Zijlstra
2016-04-07 10:42               ` Peter Zijlstra
2016-04-07 19:45           ` Tejun Heo
2016-04-07 19:45             ` Tejun Heo
2016-04-07 20:25             ` Peter Zijlstra
2016-04-07 20:25               ` Peter Zijlstra
2016-04-08 20:11               ` Tejun Heo
2016-04-08 20:11                 ` Tejun Heo
2016-04-09  6:16                 ` Mike Galbraith
2016-04-09  6:16                   ` Mike Galbraith
2016-04-09 13:39                 ` Peter Zijlstra
2016-04-09 13:39                   ` Peter Zijlstra
2016-04-12 22:29                   ` Tejun Heo
2016-04-12 22:29                     ` Tejun Heo
2016-04-13  7:43                     ` Mike Galbraith
2016-04-13  7:43                       ` Mike Galbraith
2016-04-13 15:59                       ` Tejun Heo
2016-04-13 19:15                         ` Mike Galbraith
2016-04-13 19:15                           ` Mike Galbraith
2016-04-14  6:07                         ` Mike Galbraith
2016-04-14 19:57                           ` Tejun Heo
2016-04-14 19:57                             ` Tejun Heo
2016-04-15  2:42                             ` Mike Galbraith
2016-04-15  2:42                               ` Mike Galbraith
2016-04-09 16:02                 ` Peter Zijlstra
2016-04-09 16:02                   ` Peter Zijlstra
2016-04-07  8:28         ` Peter Zijlstra
2016-04-07  8:28           ` Peter Zijlstra
2016-04-07 19:04           ` Johannes Weiner [this message]
2016-04-07 19:04             ` Johannes Weiner
2016-04-07 19:31             ` Peter Zijlstra
2016-04-07 19:31               ` Peter Zijlstra
2016-04-07 20:23               ` Johannes Weiner
2016-04-07 20:23                 ` Johannes Weiner
2016-04-08  3:13                 ` Mike Galbraith
2016-04-08  3:13                   ` Mike Galbraith
2016-03-15 17:21 ` Michal Hocko
2016-03-15 17:21   ` Michal Hocko
2016-04-06 21:53   ` Tejun Heo
2016-04-06 21:53     ` Tejun Heo
2016-04-07  6:40     ` Peter Zijlstra
2016-04-07  6:40       ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160407190424.GA20407@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=kernel-team@fb.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lizefan@huawei.com \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=tj@kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.