From: Andy Lutomirski <luto@amacapital.net> To: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com>, Mike Galbraith <umgwanakikbuti@gmail.com>, kernel-team@fb.com, Andrew Morton <akpm@linux-foundation.org>, "open list:CONTROL GROUP (CGROUP)" <cgroups@vger.kernel.org>, Paul Turner <pjt@google.com>, Li Zefan <lizefan@huawei.com>, Linux API <linux-api@vger.kernel.org>, "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>, Tejun Heo <tj@kernel.org>, Johannes Weiner <hannes@cmpxchg.org>, Linus Torvalds <torvalds@linux-foundation.org> Subject: Re: [Documentation] State of CPU controller in cgroup v2 Date: Fri, 16 Sep 2016 09:29:06 -0700 [thread overview] Message-ID: <CALCETrXoTfhaDxZJ9_XcFknnniDvrYLY9SATVXj+tK1UdaWw4A@mail.gmail.com> (raw) In-Reply-To: <20160916161951.GH5016@twins.programming.kicks-ass.net> On Fri, Sep 16, 2016 at 9:19 AM, Peter Zijlstra <peterz@infradead.org> wrote: > On Fri, Sep 16, 2016 at 08:12:58AM -0700, Andy Lutomirski wrote: >> On Sep 16, 2016 12:51 AM, "Peter Zijlstra" <peterz@infradead.org> wrote: >> > >> > On Thu, Sep 15, 2016 at 01:08:07PM -0700, Andy Lutomirski wrote: >> > > BTW, Mike keeps mentioning exclusive cgroups as problematic with the >> > > no-internal-tasks constraints. Do exclusive cgroups still exist in >> > > cgroup2? Could we perhaps just remove that capability entirely? I've >> > > never understood what problem exlusive cpusets and such solve that >> > > can't be more comprehensibly solved by just assigning the cpusets the >> > > normal inclusive way. >> > >> > Without exclusive sets we cannot split the sched_domain structure. >> > Which leads to not being able to actually partition things. That would >> > break DL for one. >> >> Can you sketch out a toy example? > > [ Also see Documentation/cgroup-v1/cpusets.txt section 1.7 ] > > > mkdir /cpuset > > mount -t cgroup -o cpuset none /cpuset > > mkdir /cpuset/A > mkdir /cpuset/B > > cat /sys/devices/system/node/node0/cpulist > /cpuset/A/cpuset.cpus > echo 0 > /cpuset/A/cpuset.mems > > cat /sys/devices/system/node/node1/cpulist > /cpuset/B/cpuset.cpus > echo 1 > /cpuset/B/cpuset.mems > > # move all movable tasks into A > cat /cpuset/tasks | while read task; do echo $task > /cpuset/A/tasks ; done > > # kill machine wide load-balancing > echo 0 > /cpuset/cpuset.sched_load_balance > > # now place 'special' tasks in B > > > This partitions the scheduler into two, one for each node. > > Hereafter no task will be moved from one node to another. The > load-balancer is split in two, one balances in A one balances in B > nothing crosses. (It is important that A.cpus and B.cpus do not > intersect.) > > Ideally no task would remain in the root group, back in the day we could > actually do this (with exception of the cpu bound kernel threads), but > this has significantly regressed :-( > (still hate the workqueue affinity interface) I wonder if we could address this by creating (automatically at boot or when the cpuset controller is enabled or whatever) a /cpuset/random_kernel_shit cgroup and have all of the unmoveable tasks land there? > > As is, tasks that are left in the root group get balanced within > whatever domain they ended up in. > >> And what's DL? > > SCHED_DEADLINE, its a 'Global'-EDF like scheduler that doesn't support > CPU affinities (because that doesn't make sense). The only way to > restrict it is to partition. > > 'Global' because you can partition it. If you reduce your system to > single CPU partitions you'll reduce to P-EDF. > > (The same is true of SCHED_FIFO, that's a 'Global'-FIFO on the same > partition scheme, it however does support sched_affinity, but using it > gives 'interesting' schedulability results -- call it a historic > accident). Hmm, I didn't realize that the deadline scheduler was global. But ISTM requiring the use of "exclusive" to get this working is unfortunate. What if a user wants two separate partitions, one using CPUs 1 and 2 and the other using CPUs 3 and 4 (with 5 reserved for non-RT stuff)? Shouldn't we be able to have a cgroup for each of the DL partitions and do something to tell the deadline scheduler "here is your domain"? > > > Note that related, but differently, we have the isolcpus boot parameter > which creates single CPU partitions for all listed CPUs and gives the > rest to the root cpuset. Ideally we'd kill this option given its a boot > time setting (for something which is trivially to do at runtime). > > But this cannot be done, because that would mean we'd have to start with > a !0 cpuset layout: > > '/' > load_balance=0 > / \ > 'system' 'isolated' > cpus=~isolcpus cpus=isolcpus > load_balance=0 > > And start with _everything_ in the /system group (inclding default IRQ > affinities). > > Of course, that will break everything cgroup :-( > I would actually *much* prefer this over the status quo. I'm tired of my crappy, partially-working script that sits there and creates exactly this configuration (minus the isolcpus part because I actually want migration to work) on boot. (Actually, it could have two automatic cgroups: /kernel and /init -- init and UMH would go in init and kernel threads and such would go in /kernel. Userspace would be able to request that a different cgroup be used for newly-created kernel threads.) Heck, even systemd would probably prefer this. Then it could cleanly expose a "slice" or whatever it's called for random kernel shit and at least you could configure it meaningfully.
WARNING: multiple messages have this Message-ID (diff)
From: Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> To: Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> Cc: Ingo Molnar <mingo-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>, Mike Galbraith <umgwanakikbuti-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>, kernel-team-b10kYP2dOMg@public.gmane.org, Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>, "open list:CONTROL GROUP (CGROUP)" <cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, Paul Turner <pjt-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>, Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>, Linux API <linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, "linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org" <linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>, Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>, Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> Subject: Re: [Documentation] State of CPU controller in cgroup v2 Date: Fri, 16 Sep 2016 09:29:06 -0700 [thread overview] Message-ID: <CALCETrXoTfhaDxZJ9_XcFknnniDvrYLY9SATVXj+tK1UdaWw4A@mail.gmail.com> (raw) In-Reply-To: <20160916161951.GH5016-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org> On Fri, Sep 16, 2016 at 9:19 AM, Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote: > On Fri, Sep 16, 2016 at 08:12:58AM -0700, Andy Lutomirski wrote: >> On Sep 16, 2016 12:51 AM, "Peter Zijlstra" <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote: >> > >> > On Thu, Sep 15, 2016 at 01:08:07PM -0700, Andy Lutomirski wrote: >> > > BTW, Mike keeps mentioning exclusive cgroups as problematic with the >> > > no-internal-tasks constraints. Do exclusive cgroups still exist in >> > > cgroup2? Could we perhaps just remove that capability entirely? I've >> > > never understood what problem exlusive cpusets and such solve that >> > > can't be more comprehensibly solved by just assigning the cpusets the >> > > normal inclusive way. >> > >> > Without exclusive sets we cannot split the sched_domain structure. >> > Which leads to not being able to actually partition things. That would >> > break DL for one. >> >> Can you sketch out a toy example? > > [ Also see Documentation/cgroup-v1/cpusets.txt section 1.7 ] > > > mkdir /cpuset > > mount -t cgroup -o cpuset none /cpuset > > mkdir /cpuset/A > mkdir /cpuset/B > > cat /sys/devices/system/node/node0/cpulist > /cpuset/A/cpuset.cpus > echo 0 > /cpuset/A/cpuset.mems > > cat /sys/devices/system/node/node1/cpulist > /cpuset/B/cpuset.cpus > echo 1 > /cpuset/B/cpuset.mems > > # move all movable tasks into A > cat /cpuset/tasks | while read task; do echo $task > /cpuset/A/tasks ; done > > # kill machine wide load-balancing > echo 0 > /cpuset/cpuset.sched_load_balance > > # now place 'special' tasks in B > > > This partitions the scheduler into two, one for each node. > > Hereafter no task will be moved from one node to another. The > load-balancer is split in two, one balances in A one balances in B > nothing crosses. (It is important that A.cpus and B.cpus do not > intersect.) > > Ideally no task would remain in the root group, back in the day we could > actually do this (with exception of the cpu bound kernel threads), but > this has significantly regressed :-( > (still hate the workqueue affinity interface) I wonder if we could address this by creating (automatically at boot or when the cpuset controller is enabled or whatever) a /cpuset/random_kernel_shit cgroup and have all of the unmoveable tasks land there? > > As is, tasks that are left in the root group get balanced within > whatever domain they ended up in. > >> And what's DL? > > SCHED_DEADLINE, its a 'Global'-EDF like scheduler that doesn't support > CPU affinities (because that doesn't make sense). The only way to > restrict it is to partition. > > 'Global' because you can partition it. If you reduce your system to > single CPU partitions you'll reduce to P-EDF. > > (The same is true of SCHED_FIFO, that's a 'Global'-FIFO on the same > partition scheme, it however does support sched_affinity, but using it > gives 'interesting' schedulability results -- call it a historic > accident). Hmm, I didn't realize that the deadline scheduler was global. But ISTM requiring the use of "exclusive" to get this working is unfortunate. What if a user wants two separate partitions, one using CPUs 1 and 2 and the other using CPUs 3 and 4 (with 5 reserved for non-RT stuff)? Shouldn't we be able to have a cgroup for each of the DL partitions and do something to tell the deadline scheduler "here is your domain"? > > > Note that related, but differently, we have the isolcpus boot parameter > which creates single CPU partitions for all listed CPUs and gives the > rest to the root cpuset. Ideally we'd kill this option given its a boot > time setting (for something which is trivially to do at runtime). > > But this cannot be done, because that would mean we'd have to start with > a !0 cpuset layout: > > '/' > load_balance=0 > / \ > 'system' 'isolated' > cpus=~isolcpus cpus=isolcpus > load_balance=0 > > And start with _everything_ in the /system group (inclding default IRQ > affinities). > > Of course, that will break everything cgroup :-( > I would actually *much* prefer this over the status quo. I'm tired of my crappy, partially-working script that sits there and creates exactly this configuration (minus the isolcpus part because I actually want migration to work) on boot. (Actually, it could have two automatic cgroups: /kernel and /init -- init and UMH would go in init and kernel threads and such would go in /kernel. Userspace would be able to request that a different cgroup be used for newly-created kernel threads.) Heck, even systemd would probably prefer this. Then it could cleanly expose a "slice" or whatever it's called for random kernel shit and at least you could configure it meaningfully.
next prev parent reply other threads:[~2016-09-16 16:29 UTC|newest] Thread overview: 87+ messages / expand[flat|nested] mbox.gz Atom feed top 2016-08-05 17:07 [Documentation] State of CPU controller in cgroup v2 Tejun Heo 2016-08-05 17:07 ` Tejun Heo 2016-08-05 17:09 ` [PATCH 1/2] sched: Misc preps for cgroup unified hierarchy interface Tejun Heo 2016-08-05 17:09 ` Tejun Heo 2016-08-05 17:09 ` [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy Tejun Heo 2016-08-05 17:09 ` Tejun Heo 2016-08-06 9:04 ` [Documentation] State of CPU controller in cgroup v2 Mike Galbraith 2016-08-06 9:04 ` Mike Galbraith 2016-08-10 22:09 ` Johannes Weiner 2016-08-10 22:09 ` Johannes Weiner 2016-08-11 6:25 ` Mike Galbraith 2016-08-11 6:25 ` Mike Galbraith 2016-08-12 22:17 ` Johannes Weiner 2016-08-12 22:17 ` Johannes Weiner 2016-08-13 5:08 ` Mike Galbraith 2016-08-13 5:08 ` Mike Galbraith 2016-08-16 14:07 ` Peter Zijlstra 2016-08-16 14:07 ` Peter Zijlstra 2016-08-16 14:58 ` Chris Mason 2016-08-16 14:58 ` Chris Mason 2016-08-16 16:30 ` Johannes Weiner 2016-08-16 16:30 ` Johannes Weiner 2016-08-17 9:33 ` Mike Galbraith 2016-08-16 21:59 ` Tejun Heo 2016-08-16 21:59 ` Tejun Heo 2016-08-17 20:18 ` Andy Lutomirski 2016-08-20 15:56 ` Tejun Heo 2016-08-20 15:56 ` Tejun Heo 2016-08-20 18:45 ` Andy Lutomirski 2016-08-29 22:20 ` Tejun Heo 2016-08-29 22:20 ` Tejun Heo 2016-08-31 3:42 ` Andy Lutomirski 2016-08-31 3:42 ` Andy Lutomirski 2016-08-31 17:32 ` Tejun Heo 2016-08-31 19:11 ` Andy Lutomirski 2016-08-31 19:11 ` Andy Lutomirski 2016-08-31 21:07 ` Tejun Heo 2016-08-31 21:07 ` Tejun Heo 2016-08-31 21:46 ` Andy Lutomirski 2016-09-03 22:05 ` Tejun Heo 2016-09-03 22:05 ` Tejun Heo 2016-09-05 17:37 ` Andy Lutomirski 2016-09-06 10:29 ` Peter Zijlstra 2016-09-06 10:29 ` Peter Zijlstra 2016-10-04 14:47 ` Tejun Heo 2016-10-05 8:07 ` Peter Zijlstra 2016-10-05 8:07 ` Peter Zijlstra 2016-09-09 22:57 ` Tejun Heo 2016-09-10 8:54 ` Mike Galbraith 2016-09-10 8:54 ` Mike Galbraith 2016-09-10 10:08 ` Mike Galbraith 2016-09-10 10:08 ` Mike Galbraith 2016-09-30 9:06 ` Tejun Heo 2016-09-30 9:06 ` Tejun Heo 2016-09-30 14:53 ` Mike Galbraith 2016-09-30 14:53 ` Mike Galbraith 2016-09-12 15:20 ` Austin S. Hemmelgarn 2016-09-12 15:20 ` Austin S. Hemmelgarn 2016-09-19 21:34 ` Tejun Heo 2016-09-19 21:34 ` Tejun Heo [not found] ` <CALCETrUhpPQdyZ-6WRjdB+iLbpGBduRZMWXQtCuS+R7Cq7rygg@mail.gmail.com> 2016-09-14 20:00 ` Tejun Heo 2016-09-15 20:08 ` Andy Lutomirski 2016-09-15 20:08 ` Andy Lutomirski 2016-09-16 7:51 ` Peter Zijlstra 2016-09-16 7:51 ` Peter Zijlstra 2016-09-16 15:12 ` Andy Lutomirski 2016-09-16 15:12 ` Andy Lutomirski 2016-09-16 16:19 ` Peter Zijlstra 2016-09-16 16:19 ` Peter Zijlstra 2016-09-16 16:29 ` Andy Lutomirski [this message] 2016-09-16 16:29 ` Andy Lutomirski 2016-09-16 16:50 ` Peter Zijlstra 2016-09-16 16:50 ` Peter Zijlstra 2016-09-16 18:19 ` Andy Lutomirski 2016-09-16 18:19 ` Andy Lutomirski 2016-09-17 1:47 ` Peter Zijlstra 2016-09-17 1:47 ` Peter Zijlstra 2016-09-19 21:53 ` Tejun Heo 2016-09-19 21:53 ` Tejun Heo 2016-08-31 19:57 ` Andy Lutomirski 2016-08-31 19:57 ` Andy Lutomirski 2016-08-22 10:12 ` Mike Galbraith 2016-08-22 10:12 ` Mike Galbraith 2016-08-21 5:34 ` James Bottomley 2016-08-21 5:34 ` James Bottomley 2016-08-29 22:35 ` Tejun Heo 2016-08-29 22:35 ` Tejun Heo
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=CALCETrXoTfhaDxZJ9_XcFknnniDvrYLY9SATVXj+tK1UdaWw4A@mail.gmail.com \ --to=luto@amacapital.net \ --cc=akpm@linux-foundation.org \ --cc=cgroups@vger.kernel.org \ --cc=hannes@cmpxchg.org \ --cc=kernel-team@fb.com \ --cc=linux-api@vger.kernel.org \ --cc=linux-kernel@vger.kernel.org \ --cc=lizefan@huawei.com \ --cc=mingo@redhat.com \ --cc=peterz@infradead.org \ --cc=pjt@google.com \ --cc=tj@kernel.org \ --cc=torvalds@linux-foundation.org \ --cc=umgwanakikbuti@gmail.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.