linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Peter Zijlstra <peterz@infradead.org>
To: Andy Lutomirski <luto@amacapital.net>
Cc: Ingo Molnar <mingo@redhat.com>,
	Mike Galbraith <umgwanakikbuti@gmail.com>,
	kernel-team@fb.com, Andrew Morton <akpm@linux-foundation.org>,
	"open list:CONTROL GROUP (CGROUP)" <cgroups@vger.kernel.org>,
	Paul Turner <pjt@google.com>, Li Zefan <lizefan@huawei.com>,
	Linux API <linux-api@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Tejun Heo <tj@kernel.org>, Johannes Weiner <hannes@cmpxchg.org>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [Documentation] State of CPU controller in cgroup v2
Date: Fri, 16 Sep 2016 18:19:51 +0200	[thread overview]
Message-ID: <20160916161951.GH5016@twins.programming.kicks-ass.net> (raw)
In-Reply-To: <CALCETrXzrXJmZoFVfAXS1Zf9uNZjibnHizEhwgqdmRvnbJEksw@mail.gmail.com>

On Fri, Sep 16, 2016 at 08:12:58AM -0700, Andy Lutomirski wrote:
> On Sep 16, 2016 12:51 AM, "Peter Zijlstra" <peterz@infradead.org> wrote:
> >
> > On Thu, Sep 15, 2016 at 01:08:07PM -0700, Andy Lutomirski wrote:
> > > BTW, Mike keeps mentioning exclusive cgroups as problematic with the
> > > no-internal-tasks constraints.  Do exclusive cgroups still exist in
> > > cgroup2?  Could we perhaps just remove that capability entirely?  I've
> > > never understood what problem exlusive cpusets and such solve that
> > > can't be more comprehensibly solved by just assigning the cpusets the
> > > normal inclusive way.
> >
> > Without exclusive sets we cannot split the sched_domain structure.
> > Which leads to not being able to actually partition things. That would
> > break DL for one.
> 
> Can you sketch out a toy example? 

[ Also see Documentation/cgroup-v1/cpusets.txt section 1.7 ]


  mkdir /cpuset

  mount -t cgroup -o cpuset none /cpuset

  mkdir /cpuset/A
  mkdir /cpuset/B

  cat /sys/devices/system/node/node0/cpulist > /cpuset/A/cpuset.cpus
  echo 0 > /cpuset/A/cpuset.mems

  cat /sys/devices/system/node/node1/cpulist > /cpuset/B/cpuset.cpus
  echo 1 > /cpuset/B/cpuset.mems

  # move all movable tasks into A
  cat /cpuset/tasks | while read task; do echo $task > /cpuset/A/tasks ; done

  # kill machine wide load-balancing
  echo 0 > /cpuset/cpuset.sched_load_balance

  # now place 'special' tasks in B


This partitions the scheduler into two, one for each node.

Hereafter no task will be moved from one node to another. The
load-balancer is split in two, one balances in A one balances in B
nothing crosses. (It is important that A.cpus and B.cpus do not
intersect.)

Ideally no task would remain in the root group, back in the day we could
actually do this (with exception of the cpu bound kernel threads), but
this has significantly regressed :-(
(still hate the workqueue affinity interface)

As is, tasks that are left in the root group get balanced within
whatever domain they ended up in.

> And what's DL?

SCHED_DEADLINE, its a 'Global'-EDF like scheduler that doesn't support
CPU affinities (because that doesn't make sense). The only way to
restrict it is to partition.

'Global' because you can partition it. If you reduce your system to
single CPU partitions you'll reduce to P-EDF.

(The same is true of SCHED_FIFO, that's a 'Global'-FIFO on the same
partition scheme, it however does support sched_affinity, but using it
gives 'interesting' schedulability results -- call it a historic
accident).


Note that related, but differently, we have the isolcpus boot parameter
which creates single CPU partitions for all listed CPUs and gives the
rest to the root cpuset. Ideally we'd kill this option given its a boot
time setting (for something which is trivially to do at runtime).

But this cannot be done, because that would mean we'd have to start with
a !0 cpuset layout:

		'/'
		load_balance=0
            /              \
	'system'	'isolated'
	cpus=~isolcpus	cpus=isolcpus
			load_balance=0

And start with _everything_ in the /system group (inclding default IRQ
affinities).

Of course, that will break everything cgroup :-(

  reply	other threads:[~2016-09-16 16:20 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-08-05 17:07 [Documentation] State of CPU controller in cgroup v2 Tejun Heo
2016-08-05 17:09 ` [PATCH 1/2] sched: Misc preps for cgroup unified hierarchy interface Tejun Heo
2016-08-05 17:09 ` [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy Tejun Heo
2016-08-06  9:04 ` [Documentation] State of CPU controller in cgroup v2 Mike Galbraith
2016-08-10 22:09   ` Johannes Weiner
2016-08-11  6:25     ` Mike Galbraith
2016-08-12 22:17       ` Johannes Weiner
2016-08-13  5:08         ` Mike Galbraith
2016-08-16 14:07     ` Peter Zijlstra
2016-08-16 14:58       ` Chris Mason
2016-08-16 16:30       ` Johannes Weiner
2016-08-17  9:33         ` Mike Galbraith
2016-08-16 21:59       ` Tejun Heo
2016-08-17 20:18 ` Andy Lutomirski
2016-08-20 15:56   ` Tejun Heo
2016-08-20 18:45     ` Andy Lutomirski
2016-08-29 22:20       ` Tejun Heo
2016-08-31  3:42         ` Andy Lutomirski
2016-08-31 17:32           ` Tejun Heo
2016-08-31 19:11             ` Andy Lutomirski
2016-08-31 21:07               ` Tejun Heo
2016-08-31 21:46                 ` Andy Lutomirski
2016-09-03 22:05                   ` Tejun Heo
2016-09-05 17:37                     ` Andy Lutomirski
2016-09-06 10:29                       ` Peter Zijlstra
2016-10-04 14:47                         ` Tejun Heo
2016-10-05  8:07                           ` Peter Zijlstra
2016-09-09 22:57                       ` Tejun Heo
2016-09-10  8:54                         ` Mike Galbraith
2016-09-10 10:08                         ` Mike Galbraith
2016-09-30  9:06                           ` Tejun Heo
2016-09-30 14:53                             ` Mike Galbraith
2016-09-12 15:20                         ` Austin S. Hemmelgarn
2016-09-19 21:34                           ` Tejun Heo
     [not found]                         ` <CALCETrUhpPQdyZ-6WRjdB+iLbpGBduRZMWXQtCuS+R7Cq7rygg@mail.gmail.com>
2016-09-14 20:00                           ` Tejun Heo
2016-09-15 20:08                             ` Andy Lutomirski
2016-09-16  7:51                               ` Peter Zijlstra
2016-09-16 15:12                                 ` Andy Lutomirski
2016-09-16 16:19                                   ` Peter Zijlstra [this message]
2016-09-16 16:29                                     ` Andy Lutomirski
2016-09-16 16:50                                       ` Peter Zijlstra
2016-09-16 18:19                                         ` Andy Lutomirski
2016-09-17  1:47                                           ` Peter Zijlstra
2016-09-19 21:53                               ` Tejun Heo
2016-08-31 19:57         ` Andy Lutomirski
2016-08-22 10:12     ` Mike Galbraith
2016-08-21  5:34   ` James Bottomley
2016-08-29 22:35     ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160916161951.GH5016@twins.programming.kicks-ass.net \
    --to=peterz@infradead.org \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=kernel-team@fb.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lizefan@huawei.com \
    --cc=luto@amacapital.net \
    --cc=mingo@redhat.com \
    --cc=pjt@google.com \
    --cc=tj@kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=umgwanakikbuti@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).