From: Johannes Weiner <hannes@cmpxchg.org> To: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org>, torvalds@linux-foundation.org, akpm@linux-foundation.org, mingo@redhat.com, lizefan@huawei.com, pjt@google.com, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-api@vger.kernel.org, kernel-team@fb.com Subject: Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP Date: Thu, 7 Apr 2016 15:04:24 -0400 [thread overview] Message-ID: <20160407190424.GA20407@cmpxchg.org> (raw) In-Reply-To: <20160407082810.GN3430@twins.programming.kicks-ass.net> On Thu, Apr 07, 2016 at 10:28:10AM +0200, Peter Zijlstra wrote: > On Thu, Apr 07, 2016 at 03:35:47AM -0400, Johannes Weiner wrote: > > There was a lot of back and forth whether we should add a second set > > of knobs just to control the local tasks separately from the subtree, > > but ended up concluding that the situation can be expressed more > > clearly by creating dedicated leaf subgroups for stuff like management > > software and launchers instead, so that their memory pools/LRUs are > > clearly delineated from other groups and seperately controllable. And > > we couldn't think of any meaningful configuration that could not be > > expressed in that scheme. I mean, it's the same thing, right? > > No, not the same. > > > R > / | \ > t1 t2 A > / \ > t3 t4 > > > Is fundamentally different from: > > > R > / \ > L A > / \ / \ > t1 t2 t3 t4 > > > Because if in the first hierarchy you add a task (t5) to R, all of its A > will run at 1/4th of total bandwidth where before it had 1/3rd, whereas > with the second example, if you add our t5 to L, A doesn't get any less > bandwidth. I didn't mean the same exact configuration, I meant being able to configure with the same outcome of resource distribution. All this means here is that if you want to change the shares allocated to the tasks in R (or then L) you have to be explicit about it and update the weight configuration in L. Again, it's not gratuitous, it's based on the problems this concept in the interface created in more comprehensive container deployments. > Please pull your collective heads out of the systemd arse and start > thinking. I don't care about systemd here. In fact, in 5 years of rewriting the memory controller, zero percent of it was driven by systemd and most of it from Google's feedback at LSF and email since they had by far the most experience and were pushing the frontier. And even though the performance and overhead of the memory controller was absolutely abysmal - routinely hitting double digits in page fault profiles - the discussions *always* centered around the interface and configuration. IMO, this thread is a little too focused on the reality of a single resource controller, when in real setups it doesn't exist in a vacuum. What these environments need is to robustly divide the machine up into parcels to isolate thousands of jobs on X dimensions at the same time: allocate CPU time, allocate memory, allocate IO. And then on top of that implement higher concepts such as dirty page quotas and writeback, accounting for kswapd's cpu time based on who owns the memory it reclaims, accounting IO time for the stuff it swaps out etc. That *needs* all three resources to be coordinated. You disparagingly called it the lowest common denominator, but the thing is that streamlining the controllers and coordinating them around shared resource domains gives us much more powerful and robust ways to allocate the *machines* as a whole, and allows the proper tracking and accounting of cross-domain operations such as writeback that wasn't even possible before. And all that in a way that doesn't have the same usability pitfalls that v1 had when you actually push this stuff beyond the "i want to limit the cpu cycles of this one service" and move towards "this machine is an anonymous node in a data center and I want it to host thousands of different workloads - some sensitive to latency, some that only care about throughput - and they better not step on each other's toes on *any* of the resource pools." Those are my primary concerns when it comes to the v2 interface, and I think focusing too much on what's theoretically possible with a single controller is missing the bigger challenge of allocating machines.
WARNING: multiple messages have this Message-ID (diff)
From: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> To: Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org, mingo-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org, lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org, pjt-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, kernel-team-b10kYP2dOMg@public.gmane.org Subject: Re: [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP Date: Thu, 7 Apr 2016 15:04:24 -0400 [thread overview] Message-ID: <20160407190424.GA20407@cmpxchg.org> (raw) In-Reply-To: <20160407082810.GN3430-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> On Thu, Apr 07, 2016 at 10:28:10AM +0200, Peter Zijlstra wrote: > On Thu, Apr 07, 2016 at 03:35:47AM -0400, Johannes Weiner wrote: > > There was a lot of back and forth whether we should add a second set > > of knobs just to control the local tasks separately from the subtree, > > but ended up concluding that the situation can be expressed more > > clearly by creating dedicated leaf subgroups for stuff like management > > software and launchers instead, so that their memory pools/LRUs are > > clearly delineated from other groups and seperately controllable. And > > we couldn't think of any meaningful configuration that could not be > > expressed in that scheme. I mean, it's the same thing, right? > > No, not the same. > > > R > / | \ > t1 t2 A > / \ > t3 t4 > > > Is fundamentally different from: > > > R > / \ > L A > / \ / \ > t1 t2 t3 t4 > > > Because if in the first hierarchy you add a task (t5) to R, all of its A > will run at 1/4th of total bandwidth where before it had 1/3rd, whereas > with the second example, if you add our t5 to L, A doesn't get any less > bandwidth. I didn't mean the same exact configuration, I meant being able to configure with the same outcome of resource distribution. All this means here is that if you want to change the shares allocated to the tasks in R (or then L) you have to be explicit about it and update the weight configuration in L. Again, it's not gratuitous, it's based on the problems this concept in the interface created in more comprehensive container deployments. > Please pull your collective heads out of the systemd arse and start > thinking. I don't care about systemd here. In fact, in 5 years of rewriting the memory controller, zero percent of it was driven by systemd and most of it from Google's feedback at LSF and email since they had by far the most experience and were pushing the frontier. And even though the performance and overhead of the memory controller was absolutely abysmal - routinely hitting double digits in page fault profiles - the discussions *always* centered around the interface and configuration. IMO, this thread is a little too focused on the reality of a single resource controller, when in real setups it doesn't exist in a vacuum. What these environments need is to robustly divide the machine up into parcels to isolate thousands of jobs on X dimensions at the same time: allocate CPU time, allocate memory, allocate IO. And then on top of that implement higher concepts such as dirty page quotas and writeback, accounting for kswapd's cpu time based on who owns the memory it reclaims, accounting IO time for the stuff it swaps out etc. That *needs* all three resources to be coordinated. You disparagingly called it the lowest common denominator, but the thing is that streamlining the controllers and coordinating them around shared resource domains gives us much more powerful and robust ways to allocate the *machines* as a whole, and allows the proper tracking and accounting of cross-domain operations such as writeback that wasn't even possible before. And all that in a way that doesn't have the same usability pitfalls that v1 had when you actually push this stuff beyond the "i want to limit the cpu cycles of this one service" and move towards "this machine is an anonymous node in a data center and I want it to host thousands of different workloads - some sensitive to latency, some that only care about throughput - and they better not step on each other's toes on *any* of the resource pools." Those are my primary concerns when it comes to the v2 interface, and I think focusing too much on what's theoretically possible with a single controller is missing the bigger challenge of allocating machines.
next prev parent reply other threads:[~2016-04-07 19:06 UTC|newest] Thread overview: 95+ messages / expand[flat|nested] mbox.gz Atom feed top 2016-03-11 15:41 [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP Tejun Heo 2016-03-11 15:41 ` Tejun Heo 2016-03-11 15:41 ` [PATCH 01/10] cgroup: introduce cgroup_[un]lock() Tejun Heo 2016-03-11 15:41 ` Tejun Heo 2016-03-11 15:41 ` [PATCH 02/10] cgroup: un-inline cgroup_path() and friends Tejun Heo 2016-03-11 15:41 ` [PATCH 03/10] cgroup: introduce CGRP_MIGRATE_* flags Tejun Heo 2016-03-11 15:41 ` Tejun Heo 2016-03-11 15:41 ` [PATCH 04/10] signal: make put_signal_struct() public Tejun Heo 2016-03-11 15:41 ` [PATCH 05/10] cgroup, fork: add @new_rgrp_cset[p] and @clone_flags to cgroup fork callbacks Tejun Heo 2016-03-11 15:41 ` Tejun Heo 2016-03-11 15:41 ` [PATCH 06/10] cgroup, fork: add @child and @clone_flags to threadgroup_change_begin/end() Tejun Heo 2016-03-11 15:41 ` [PATCH 07/10] cgroup: introduce resource group Tejun Heo 2016-03-11 15:41 ` Tejun Heo 2016-03-11 15:41 ` [PATCH 08/10] cgroup: implement rgroup control mask handling Tejun Heo 2016-03-11 15:41 ` Tejun Heo 2016-03-11 15:41 ` [PATCH 09/10] cgroup: implement rgroup subtree migration Tejun Heo 2016-03-11 15:41 ` [PATCH 10/10] cgroup, sched: implement PRIO_RGRP for {set|get}priority() Tejun Heo 2016-03-11 15:41 ` Tejun Heo 2016-03-11 16:05 ` Example program for PRIO_RGRP Tejun Heo 2016-03-11 16:05 ` Tejun Heo 2016-03-12 6:26 ` [PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP Mike Galbraith 2016-03-12 6:26 ` Mike Galbraith 2016-03-12 17:04 ` Mike Galbraith 2016-03-12 17:04 ` Mike Galbraith 2016-03-12 17:13 ` cgroup NAKs ignored? " Ingo Molnar 2016-03-12 17:13 ` Ingo Molnar 2016-03-13 14:42 ` Tejun Heo 2016-03-13 14:42 ` Tejun Heo 2016-03-13 15:00 ` Tejun Heo 2016-03-13 15:00 ` Tejun Heo 2016-03-13 17:40 ` Mike Galbraith 2016-03-13 17:40 ` Mike Galbraith 2016-04-07 0:00 ` Tejun Heo 2016-04-07 0:00 ` Tejun Heo 2016-04-07 3:26 ` Mike Galbraith 2016-04-07 3:26 ` Mike Galbraith 2016-03-14 2:23 ` Mike Galbraith 2016-03-14 2:23 ` Mike Galbraith 2016-03-14 11:30 ` Peter Zijlstra 2016-03-14 11:30 ` Peter Zijlstra 2016-04-06 15:58 ` Tejun Heo 2016-04-06 15:58 ` Tejun Heo 2016-04-06 15:58 ` Tejun Heo 2016-04-07 6:45 ` Peter Zijlstra 2016-04-07 6:45 ` Peter Zijlstra 2016-04-07 7:35 ` Johannes Weiner 2016-04-07 7:35 ` Johannes Weiner 2016-04-07 8:05 ` Mike Galbraith 2016-04-07 8:05 ` Mike Galbraith 2016-04-07 8:08 ` Peter Zijlstra 2016-04-07 8:08 ` Peter Zijlstra 2016-04-07 9:28 ` Johannes Weiner 2016-04-07 9:28 ` Johannes Weiner 2016-04-07 10:42 ` Peter Zijlstra 2016-04-07 10:42 ` Peter Zijlstra 2016-04-07 19:45 ` Tejun Heo 2016-04-07 19:45 ` Tejun Heo 2016-04-07 20:25 ` Peter Zijlstra 2016-04-07 20:25 ` Peter Zijlstra 2016-04-08 20:11 ` Tejun Heo 2016-04-08 20:11 ` Tejun Heo 2016-04-09 6:16 ` Mike Galbraith 2016-04-09 6:16 ` Mike Galbraith 2016-04-09 13:39 ` Peter Zijlstra 2016-04-09 13:39 ` Peter Zijlstra 2016-04-12 22:29 ` Tejun Heo 2016-04-12 22:29 ` Tejun Heo 2016-04-13 7:43 ` Mike Galbraith 2016-04-13 7:43 ` Mike Galbraith 2016-04-13 15:59 ` Tejun Heo 2016-04-13 19:15 ` Mike Galbraith 2016-04-13 19:15 ` Mike Galbraith 2016-04-14 6:07 ` Mike Galbraith 2016-04-14 19:57 ` Tejun Heo 2016-04-14 19:57 ` Tejun Heo 2016-04-15 2:42 ` Mike Galbraith 2016-04-15 2:42 ` Mike Galbraith 2016-04-09 16:02 ` Peter Zijlstra 2016-04-09 16:02 ` Peter Zijlstra 2016-04-07 8:28 ` Peter Zijlstra 2016-04-07 8:28 ` Peter Zijlstra 2016-04-07 19:04 ` Johannes Weiner [this message] 2016-04-07 19:04 ` Johannes Weiner 2016-04-07 19:31 ` Peter Zijlstra 2016-04-07 19:31 ` Peter Zijlstra 2016-04-07 20:23 ` Johannes Weiner 2016-04-07 20:23 ` Johannes Weiner 2016-04-08 3:13 ` Mike Galbraith 2016-04-08 3:13 ` Mike Galbraith 2016-03-15 17:21 ` Michal Hocko 2016-03-15 17:21 ` Michal Hocko 2016-04-06 21:53 ` Tejun Heo 2016-04-06 21:53 ` Tejun Heo 2016-04-07 6:40 ` Peter Zijlstra 2016-04-07 6:40 ` Peter Zijlstra
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20160407190424.GA20407@cmpxchg.org \ --to=hannes@cmpxchg.org \ --cc=akpm@linux-foundation.org \ --cc=cgroups@vger.kernel.org \ --cc=kernel-team@fb.com \ --cc=linux-api@vger.kernel.org \ --cc=linux-kernel@vger.kernel.org \ --cc=lizefan@huawei.com \ --cc=mingo@redhat.com \ --cc=peterz@infradead.org \ --cc=pjt@google.com \ --cc=tj@kernel.org \ --cc=torvalds@linux-foundation.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.