Re: [Documentation] State of CPU controller in cgroup v2

From: Andy Lutomirski <luto@amacapital.net>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>,
	Mike Galbraith <umgwanakikbuti@gmail.com>,
	kernel-team@fb.com, Andrew Morton <akpm@linux-foundation.org>,
	"open list:CONTROL GROUP (CGROUP)" <cgroups@vger.kernel.org>,
	Paul Turner <pjt@google.com>, Li Zefan <lizefan@huawei.com>,
	Linux API <linux-api@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Tejun Heo <tj@kernel.org>, Johannes Weiner <hannes@cmpxchg.org>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [Documentation] State of CPU controller in cgroup v2
Date: Fri, 16 Sep 2016 09:29:06 -0700	[thread overview]
Message-ID: <CALCETrXoTfhaDxZJ9_XcFknnniDvrYLY9SATVXj+tK1UdaWw4A@mail.gmail.com> (raw)
In-Reply-To: <20160916161951.GH5016@twins.programming.kicks-ass.net>

On Fri, Sep 16, 2016 at 9:19 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Sep 16, 2016 at 08:12:58AM -0700, Andy Lutomirski wrote:
>> On Sep 16, 2016 12:51 AM, "Peter Zijlstra" <peterz@infradead.org> wrote:
>> >
>> > On Thu, Sep 15, 2016 at 01:08:07PM -0700, Andy Lutomirski wrote:
>> > > BTW, Mike keeps mentioning exclusive cgroups as problematic with the
>> > > no-internal-tasks constraints.  Do exclusive cgroups still exist in
>> > > cgroup2?  Could we perhaps just remove that capability entirely?  I've
>> > > never understood what problem exlusive cpusets and such solve that
>> > > can't be more comprehensibly solved by just assigning the cpusets the
>> > > normal inclusive way.
>> >
>> > Without exclusive sets we cannot split the sched_domain structure.
>> > Which leads to not being able to actually partition things. That would
>> > break DL for one.
>>
>> Can you sketch out a toy example?
>
> [ Also see Documentation/cgroup-v1/cpusets.txt section 1.7 ]
>
>
>   mkdir /cpuset
>
>   mount -t cgroup -o cpuset none /cpuset
>
>   mkdir /cpuset/A
>   mkdir /cpuset/B
>
>   cat /sys/devices/system/node/node0/cpulist > /cpuset/A/cpuset.cpus
>   echo 0 > /cpuset/A/cpuset.mems
>
>   cat /sys/devices/system/node/node1/cpulist > /cpuset/B/cpuset.cpus
>   echo 1 > /cpuset/B/cpuset.mems
>
>   # move all movable tasks into A
>   cat /cpuset/tasks | while read task; do echo $task > /cpuset/A/tasks ; done
>
>   # kill machine wide load-balancing
>   echo 0 > /cpuset/cpuset.sched_load_balance
>
>   # now place 'special' tasks in B
>
>
> This partitions the scheduler into two, one for each node.
>
> Hereafter no task will be moved from one node to another. The
> load-balancer is split in two, one balances in A one balances in B
> nothing crosses. (It is important that A.cpus and B.cpus do not
> intersect.)
>
> Ideally no task would remain in the root group, back in the day we could
> actually do this (with exception of the cpu bound kernel threads), but
> this has significantly regressed :-(
> (still hate the workqueue affinity interface)

I wonder if we could address this by creating (automatically at boot
or when the cpuset controller is enabled or whatever) a
/cpuset/random_kernel_shit cgroup and have all of the unmoveable tasks
land there?

>
> As is, tasks that are left in the root group get balanced within
> whatever domain they ended up in.
>
>> And what's DL?
>
> SCHED_DEADLINE, its a 'Global'-EDF like scheduler that doesn't support
> CPU affinities (because that doesn't make sense). The only way to
> restrict it is to partition.
>
> 'Global' because you can partition it. If you reduce your system to
> single CPU partitions you'll reduce to P-EDF.
>
> (The same is true of SCHED_FIFO, that's a 'Global'-FIFO on the same
> partition scheme, it however does support sched_affinity, but using it
> gives 'interesting' schedulability results -- call it a historic
> accident).

Hmm, I didn't realize that the deadline scheduler was global.  But
ISTM requiring the use of "exclusive" to get this working is
unfortunate.  What if a user wants two separate partitions, one using
CPUs 1 and 2 and the other using CPUs 3 and 4 (with 5 reserved for
non-RT stuff)?  Shouldn't we be able to have a cgroup for each of the
DL partitions and do something to tell the deadline scheduler "here is
your domain"?

>
>
> Note that related, but differently, we have the isolcpus boot parameter
> which creates single CPU partitions for all listed CPUs and gives the
> rest to the root cpuset. Ideally we'd kill this option given its a boot
> time setting (for something which is trivially to do at runtime).
>
> But this cannot be done, because that would mean we'd have to start with
> a !0 cpuset layout:
>
>                 '/'
>                 load_balance=0
>             /              \
>         'system'        'isolated'
>         cpus=~isolcpus  cpus=isolcpus
>                         load_balance=0
>
> And start with _everything_ in the /system group (inclding default IRQ
> affinities).
>
> Of course, that will break everything cgroup :-(
>

I would actually *much* prefer this over the status quo.  I'm tired of
my crappy, partially-working script that sits there and creates
exactly this configuration (minus the isolcpus part because I actually
want migration to work) on boot.  (Actually, it could have two
automatic cgroups: /kernel and /init -- init and UMH would go in init
and kernel threads and such would go in /kernel.  Userspace would be
able to request that a different cgroup be used for newly-created
kernel threads.)

Heck, even systemd would probably prefer this.  Then it could cleanly
expose a "slice" or whatever it's called for random kernel shit and at
least you could configure it meaningfully.