linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Jan H. Schönherr" <jschoenh@amazon.de>
To: Rik van Riel <riel@surriel.com>,
	Frederic Weisbecker <frederic@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	linux-kernel@vger.kernel.org,
	Subhra Mazumdar <subhra.mazumdar@oracle.com>
Subject: Re: [RFC 00/60] Coscheduling for Linux
Date: Fri, 19 Oct 2018 21:07:45 +0200	[thread overview]
Message-ID: <37fe7d2a-5ec6-067b-e446-6c5a0fb10153@amazon.de> (raw)
In-Reply-To: <e34f720326a03ac07e3156abf77f0f6c22ce4289.camel@surriel.com>

On 19/10/2018 17.45, Rik van Riel wrote:
> On Fri, 2018-10-19 at 17:33 +0200, Frederic Weisbecker wrote:
>> On Fri, Oct 19, 2018 at 11:16:49AM -0400, Rik van Riel wrote:
>>> On Fri, 2018-10-19 at 13:40 +0200, Jan H. Schönherr wrote:
>>>> 
>>>> Now, it would be possible to "invent" relocatable cpusets to 
>>>> address that issue ("I want affinity restricted to a core, I don't
>>>> care which"), but then, the current way how cpuset affinity is
>>>> enforced doesn't scale for making use of it from within the
>>>> balancer. (The upcoming load balancing portion of the coscheduler
>>>> currently uses a file similar to cpu.scheduled to restrict
>>>> affinity to a load-balancer-controlled subset of the system.)
>>> 
>>> Oh boy, so the coscheduler is going to get its own load balancer?

Not "its own". The load balancer already aggregates statistics about
sched-groups. With the coscheduler as posted, there is now a runqueue per
scheduling group. The current "ad-hoc" gathering of data per scheduling
group is then basically replaced with looking up that data at the
corresponding runqueue, where it is kept up-to-date automatically.


>>> At that point, why bother integrating the coscheduler into CFS,
>>> instead of making it its own scheduling class?
>>> 
>>> CFS is already complicated enough that it borders on unmaintainable.
>>> I would really prefer to have the coscheduler code separate from
>>> CFS, unless there is a really compelling reason to do otherwise.
>> 
>> I guess he wants to reuse as much as possible from the CFS features 
>> and code present or to come (nice, fairness, load balancing, power
>> aware, NUMA aware, etc...).

Exactly. I want a user to be able to "switch on" coscheduling for those
parts of the workload that profit from it, without affecting the behavior
we are all used to. For both: scheduling behavior for tasks that are not
coscheduled, as well as scheduling behavior for tasks *within* the group of
coscheduled tasks.


> I wonder if things like nice levels, fairness, and balancing could be
> broken out into code that could be reused from both CFS and a new 
> co-scheduler scheduling class.
> 
> A bunch of the cgroup code is already broken out, but maybe some more
> could be broken out and shared, too?

Maybe.


>> OTOH you're right, the thing has specific enough requirements to 
>> consider a new sched policy.

The primary issue that I have with a new scheduling class, is that they are
strictly priority ordered. If there is a runnable task in a higher class,
it is executed, no matter the situation in lower classes. "Coscheduling"
would have to be higher in the class hierarchy than CFS. And then, all
kinds of issues appear from starvation of CFS threads and other unfairness,
to the necessity of (re-)defining a set of preemption rules, nice and other
things that are given with CFS.


> Some bits of functionality come to mind:
> 
> - track groups of tasks that should be co-scheduled (eg all the VCPUs of
> a virtual machine)

cgroups

> - track the subsets of those groups that are runnable (eg. the currently
> runnable VCPUs of a virtual machine)

runqueues

> - figure out time slots and CPU assignments to efficiently use CPU time
> for the co-scheduled tasks (while leaving some configurable(?) amount of
> CPU time available for other tasks)

CFS runqueues and associated rules for preemption/time slices/etc.

> - configuring some lower-level code on each affected CPU to "run task A
> in slot X", etc

There is no "slot" concept, as it does not fit my idea of interactive
usage. (As in "slot X will execute from time T to T+1.) It is purely
event-driven right now (eg, "group X just became runnable, it is considered
more important than the currently running group Y; all CPUs (in the
affected part of the system) switch to group X", or "group X ran long
enough, next group").

While some planning ahead seems possible (as demonstrated by the Tableau
scheduler that Peter already pointed me at), I currently cannot imagine
such an approach working for general purpose workloads. The absence of true
preemption being my primary concern.


> This really does not seem like something that could be shoehorned into
> CFS without making it unmaintainable.
> 
> Furthermore, it also seems like the thing that you could never really
> get into a highly efficient state as long as it is weighed down by the
> rest of CFS.

I still have this idealistic notion, that there is no "weighing down". I
see it more as profiting from all the hard work that went into CFS,
avoiding the same mistakes, being backwards compatible, etc.



If I were to do this "outside of CFS", I'd overhaul the scheduling class
concept as it exists today. Instead, I'd probably attempt to schedule
instantiations of scheduling classes. In its easiest setup, nothing would
change: one CFS instance, one RT instance, one DL instance, strictly
ordered by priority (on each CPU). The coscheduler as it is posted (and
task groups in general), are effectively some form of multiple CFS
instances being governed by a CFS instance.

This approach would allow, for example, multiple CFS instances that are
scheduled with explicit priorities; or some tasks that are scheduled with a
custom scheduling class, while the whole group of tasks competes for time
with other tasks via CFS rules.

I'd still keep the feature of "coscheduling" orthogonal to everything else,
though. Essentially, I'd just give the user/admin the possibility to choose
the set of rules that shall be applied to entities in a runqueue.



Your idea of further modularization seems to go in a similar direction, or
at least is not incompatible with that. If it helps keeping things
maintainable, I'm all for it. For example, some of the (upcoming) load
balancing changes are just generalizations, so that the functions don't
operate on *the* set of CFS runqueues, but just *a* set of CFS runqueues.
Similarly in the already posted code, where task picking now starts at
*some* top CFS runqueue, instead of *the* top CFS runqueue.

Regards
Jan

  reply	other threads:[~2018-10-19 19:08 UTC|newest]

Thread overview: 114+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-09-07 21:39 [RFC 00/60] Coscheduling for Linux Jan H. Schönherr
2018-09-07 21:39 ` [RFC 01/60] sched: Store task_group->se[] pointers as part of cfs_rq Jan H. Schönherr
2018-09-07 21:39 ` [RFC 02/60] sched: Introduce set_entity_cfs() to place a SE into a certain CFS runqueue Jan H. Schönherr
2018-09-07 21:39 ` [RFC 03/60] sched: Setup sched_domain_shared for all sched_domains Jan H. Schönherr
2018-09-07 21:39 ` [RFC 04/60] sched: Replace sd_numa_mask() hack with something sane Jan H. Schönherr
2018-09-07 21:39 ` [RFC 05/60] sched: Allow to retrieve the sched_domain_topology Jan H. Schönherr
2018-09-07 21:39 ` [RFC 06/60] sched: Add a lock-free variant of resched_cpu() Jan H. Schönherr
2018-09-07 21:39 ` [RFC 07/60] sched: Reduce dependencies of init_tg_cfs_entry() Jan H. Schönherr
2018-09-07 21:39 ` [RFC 08/60] sched: Move init_entity_runnable_average() into init_tg_cfs_entry() Jan H. Schönherr
2018-09-07 21:39 ` [RFC 09/60] sched: Do not require a CFS in init_tg_cfs_entry() Jan H. Schönherr
2018-09-07 21:39 ` [RFC 10/60] sched: Use parent_entity() in more places Jan H. Schönherr
2018-09-07 21:39 ` [RFC 11/60] locking/lockdep: Increase number of supported lockdep subclasses Jan H. Schönherr
2018-09-07 21:39 ` [RFC 12/60] locking/lockdep: Make cookie generator accessible Jan H. Schönherr
2018-09-07 21:40 ` [RFC 13/60] sched: Remove useless checks for root task-group Jan H. Schönherr
2018-09-07 21:40 ` [RFC 14/60] sched: Refactor sync_throttle() to accept a CFS runqueue as argument Jan H. Schönherr
2018-09-07 21:40 ` [RFC 15/60] sched: Introduce parent_cfs_rq() and use it Jan H. Schönherr
2018-09-07 21:40 ` [RFC 16/60] sched: Preparatory code movement Jan H. Schönherr
2018-09-07 21:40 ` [RFC 17/60] sched: Introduce and use generic task group CFS traversal functions Jan H. Schönherr
2018-09-07 21:40 ` [RFC 18/60] sched: Fix return value of SCHED_WARN_ON() Jan H. Schönherr
2018-09-07 21:40 ` [RFC 19/60] sched: Add entity variants of enqueue_task_fair() and dequeue_task_fair() Jan H. Schönherr
2018-09-07 21:40 ` [RFC 20/60] sched: Let {en,de}queue_entity_fair() work with a varying amount of tasks Jan H. Schönherr
2018-09-07 21:40 ` [RFC 21/60] sched: Add entity variants of put_prev_task_fair() and set_curr_task_fair() Jan H. Schönherr
2018-09-07 21:40 ` [RFC 22/60] cosched: Add config option for coscheduling support Jan H. Schönherr
2018-09-07 21:40 ` [RFC 23/60] cosched: Add core data structures for coscheduling Jan H. Schönherr
2018-09-07 21:40 ` [RFC 24/60] cosched: Do minimal pre-SMP coscheduler initialization Jan H. Schönherr
2018-09-07 21:40 ` [RFC 25/60] cosched: Prepare scheduling domain topology for coscheduling Jan H. Schönherr
2018-09-07 21:40 ` [RFC 26/60] cosched: Construct runqueue hierarchy Jan H. Schönherr
2018-09-07 21:40 ` [RFC 27/60] cosched: Add some small helper functions for later use Jan H. Schönherr
2018-09-07 21:40 ` [RFC 28/60] cosched: Add is_sd_se() to distinguish SD-SEs from TG-SEs Jan H. Schönherr
2018-09-07 21:40 ` [RFC 29/60] cosched: Adjust code reflecting on the total number of CFS tasks on a CPU Jan H. Schönherr
2018-09-07 21:40 ` [RFC 30/60] cosched: Disallow share modification on task groups for now Jan H. Schönherr
2018-09-07 21:40 ` [RFC 31/60] cosched: Don't disable idle tick " Jan H. Schönherr
2018-09-07 21:40 ` [RFC 32/60] cosched: Specialize parent_cfs_rq() for hierarchical runqueues Jan H. Schönherr
2018-09-07 21:40 ` [RFC 33/60] cosched: Allow resched_curr() to be called " Jan H. Schönherr
2018-09-07 21:40 ` [RFC 34/60] cosched: Add rq_of() variants for different use cases Jan H. Schönherr
2018-09-07 21:40 ` [RFC 35/60] cosched: Adjust rq_lock() functions to work with hierarchical runqueues Jan H. Schönherr
2018-09-07 21:40 ` [RFC 36/60] cosched: Use hrq_of() for rq_clock() and rq_clock_task() Jan H. Schönherr
2018-09-07 21:40 ` [RFC 37/60] cosched: Use hrq_of() for (indirect calls to) ___update_load_sum() Jan H. Schönherr
2018-09-07 21:40 ` [RFC 38/60] cosched: Skip updates on non-CPU runqueues in cfs_rq_util_change() Jan H. Schönherr
2018-09-07 21:40 ` [RFC 39/60] cosched: Adjust task group management for hierarchical runqueues Jan H. Schönherr
2018-09-07 21:40 ` [RFC 40/60] cosched: Keep track of task group hierarchy within each SD-RQ Jan H. Schönherr
2018-09-07 21:40 ` [RFC 41/60] cosched: Introduce locking for leader activities Jan H. Schönherr
2018-09-07 21:40 ` [RFC 42/60] cosched: Introduce locking for (mostly) enqueuing and dequeuing Jan H. Schönherr
2018-09-07 21:40 ` [RFC 43/60] cosched: Add for_each_sched_entity() variant for owned entities Jan H. Schönherr
2018-09-07 21:40 ` [RFC 44/60] cosched: Perform various rq_of() adjustments in scheduler code Jan H. Schönherr
2018-09-07 21:40 ` [RFC 45/60] cosched: Continue to account all load on per-CPU runqueues Jan H. Schönherr
2018-09-07 21:40 ` [RFC 46/60] cosched: Warn on throttling attempts of non-CPU runqueues Jan H. Schönherr
2018-09-07 21:40 ` [RFC 47/60] cosched: Adjust SE traversal and locking for common leader activities Jan H. Schönherr
2018-09-07 21:40 ` [RFC 48/60] cosched: Adjust SE traversal and locking for yielding and buddies Jan H. Schönherr
2018-09-07 21:40 ` [RFC 49/60] cosched: Adjust locking for enqueuing and dequeueing Jan H. Schönherr
2018-09-07 21:40 ` [RFC 50/60] cosched: Propagate load changes across hierarchy levels Jan H. Schönherr
2018-09-07 21:40 ` [RFC 51/60] cosched: Hacky work-around to avoid observing zero weight SD-SE Jan H. Schönherr
2018-09-07 21:40 ` [RFC 52/60] cosched: Support SD-SEs in enqueuing and dequeuing Jan H. Schönherr
2018-09-07 21:40 ` [RFC 53/60] cosched: Prevent balancing related functions from crossing hierarchy levels Jan H. Schönherr
2018-09-07 21:40 ` [RFC 54/60] cosched: Support idling in a coscheduled set Jan H. Schönherr
2018-09-07 21:40 ` [RFC 55/60] cosched: Adjust task selection for coscheduling Jan H. Schönherr
2018-09-07 21:40 ` [RFC 56/60] cosched: Adjust wakeup preemption rules " Jan H. Schönherr
2018-09-07 21:40 ` [RFC 57/60] cosched: Add sysfs interface to configure coscheduling on cgroups Jan H. Schönherr
2018-09-07 21:40 ` [RFC 58/60] cosched: Switch runqueues between regular scheduling and coscheduling Jan H. Schönherr
2018-09-07 21:40 ` [RFC 59/60] cosched: Handle non-atomicity during switches to and from coscheduling Jan H. Schönherr
2018-09-07 21:40 ` [RFC 60/60] cosched: Add command line argument to enable coscheduling Jan H. Schönherr
2018-09-10  2:50   ` Randy Dunlap
2018-09-12  0:24 ` [RFC 00/60] Coscheduling for Linux Nishanth Aravamudan
2018-09-12 19:34   ` Jan H. Schönherr
2018-09-12 23:15     ` Nishanth Aravamudan
2018-09-13 11:31       ` Jan H. Schönherr
2018-09-13 18:16         ` Nishanth Aravamudan
2018-09-12 23:18     ` Jan H. Schönherr
2018-09-13  3:05       ` Nishanth Aravamudan
2018-09-13 19:19 ` [RFC 61/60] cosched: Accumulated fixes and improvements Jan H. Schönherr
2018-09-26 17:25   ` Nishanth Aravamudan
2018-09-26 21:05     ` Nishanth Aravamudan
2018-10-01  9:13       ` Jan H. Schönherr
2018-09-14 11:12 ` [RFC 00/60] Coscheduling for Linux Peter Zijlstra
2018-09-14 16:25   ` Jan H. Schönherr
2018-09-15  8:48     ` Task group cleanups and optimizations (was: Re: [RFC 00/60] Coscheduling for Linux) Jan H. Schönherr
2018-09-17  9:48       ` Peter Zijlstra
2018-09-18 13:22         ` Jan H. Schönherr
2018-09-18 13:38           ` Peter Zijlstra
2018-09-18 13:54             ` Jan H. Schönherr
2018-09-18 13:42           ` Peter Zijlstra
2018-09-18 14:35           ` Rik van Riel
2018-09-19  9:23             ` Jan H. Schönherr
2018-11-23 16:51           ` Frederic Weisbecker
2018-12-04 13:23             ` Jan H. Schönherr
2018-09-17 11:33     ` [RFC 00/60] Coscheduling for Linux Peter Zijlstra
2018-11-02 22:13       ` Nishanth Aravamudan
2018-09-17 12:25     ` Peter Zijlstra
2018-09-26  9:58       ` Jan H. Schönherr
2018-09-27 18:36         ` Subhra Mazumdar
2018-11-23 16:29           ` Frederic Weisbecker
2018-09-17 13:37     ` Peter Zijlstra
2018-09-26  9:35       ` Jan H. Schönherr
2018-09-18 14:40     ` Rik van Riel
2018-09-24 15:23       ` Jan H. Schönherr
2018-09-24 18:01         ` Rik van Riel
2018-09-18  0:33 ` Subhra Mazumdar
2018-09-18 11:44   ` Jan H. Schönherr
2018-09-19 21:53     ` Subhra Mazumdar
2018-09-24 15:43       ` Jan H. Schönherr
2018-09-27 18:12         ` Subhra Mazumdar
2018-10-04 13:29 ` Jon Masters
2018-10-17  2:09 ` Frederic Weisbecker
2018-10-19 11:40   ` Jan H. Schönherr
2018-10-19 14:52     ` Frederic Weisbecker
2018-10-19 15:16     ` Rik van Riel
2018-10-19 15:33       ` Frederic Weisbecker
2018-10-19 15:45         ` Rik van Riel
2018-10-19 19:07           ` Jan H. Schönherr [this message]
2018-10-19  0:26 ` Subhra Mazumdar
2018-10-26 23:44   ` Jan H. Schönherr
2018-10-29 22:52     ` Subhra Mazumdar
2018-10-26 23:05 ` Subhra Mazumdar
2018-10-27  0:07   ` Jan H. Schönherr

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=37fe7d2a-5ec6-067b-e446-6c5a0fb10153@amazon.de \
    --to=jschoenh@amazon.de \
    --cc=frederic@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=riel@surriel.com \
    --cc=subhra.mazumdar@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).