linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Peter Zijlstra <peterz@infradead.org>
To: Aaron Lu <aaron.lwe@gmail.com>
Cc: "Vineeth Remanan Pillai" <vpillai@digitalocean.com>,
	"Nishanth Aravamudan" <naravamudan@digitalocean.com>,
	"Julien Desfossez" <jdesfossez@digitalocean.com>,
	"Tim Chen" <tim.c.chen@linux.intel.com>,
	"Ingo Molnar" <mingo@kernel.org>,
	"Thomas Gleixner" <tglx@linutronix.de>,
	"Paul Turner" <pjt@google.com>,
	"Linus Torvalds" <torvalds@linux-foundation.org>,
	"Aaron Lu" <aaron.lu@linux.alibaba.com>,
	"Linux List Kernel Mailing" <linux-kernel@vger.kernel.org>,
	"Frédéric Weisbecker" <fweisbec@gmail.com>,
	"Kees Cook" <keescook@chromium.org>,
	"Greg Kerr" <kerrnel@google.com>, "Phil Auld" <pauld@redhat.com>,
	"Aubrey Li" <aubrey.intel@gmail.com>,
	"Li, Aubrey" <aubrey.li@linux.intel.com>,
	"Valentin Schneider" <valentin.schneider@arm.com>,
	"Mel Gorman" <mgorman@techsingularity.net>,
	"Pawan Gupta" <pawan.kumar.gupta@linux.intel.com>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"Joel Fernandes" <joelaf@google.com>,
	"Joel Fernandes" <joel@joelfernandes.org>
Subject: Re: [PATCH updated v2] sched/fair: core wide cfs task priority comparison
Date: Wed, 6 May 2020 16:35:06 +0200	[thread overview]
Message-ID: <20200506143506.GH5298@hirez.programming.kicks-ass.net> (raw)
In-Reply-To: <20200424142443.GA263207@aaronlu-desktop>


Sorry for being verbose; I've been procrastinating replying, and in
doing so the things I wanted to say kept growing.

On Fri, Apr 24, 2020 at 10:24:43PM +0800, Aaron Lu wrote:

> To make this work, the root level sched entities' vruntime of the two
> threads must be directly comparable. So one of the hyperthread's root
> cfs_rq's min_vruntime is chosen as the core wide one and all root level
> sched entities' vruntime is normalized against it.

> +/*
> + * This is called in stop machine context so no need to take the rq lock.
> + *
> + * Core scheduling is going to be enabled and the root level sched entities
> + * of both siblings will use cfs_rq->min_vruntime as the common cfs_rq
> + * min_vruntime, so it's necessary to normalize vruntime of existing root
> + * level sched entities in sibling_cfs_rq.
> + *
> + * Update of sibling_cfs_rq's min_vruntime isn't necessary as we will be
> + * only using cfs_rq->min_vruntime during the entire run of core scheduling.
> + */
> +void sched_core_normalize_se_vruntime(int cpu)
> +{
> +	struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs;
> +	int i;
> +
> +	for_each_cpu(i, cpu_smt_mask(cpu)) {
> +		struct sched_entity *se, *next;
> +		struct cfs_rq *sibling_cfs_rq;
> +		s64 delta;
> +
> +		if (i == cpu)
> +			continue;
> +
> +		sibling_cfs_rq = &cpu_rq(i)->cfs;
> +		if (!sibling_cfs_rq->nr_running)
> +			continue;
> +
> +		delta = cfs_rq->min_vruntime - sibling_cfs_rq->min_vruntime;
> +		rbtree_postorder_for_each_entry_safe(se, next,
> +				&sibling_cfs_rq->tasks_timeline.rb_root,
> +				run_node) {
> +			se->vruntime += delta;
> +		}
> +	}
> +}

Aside from this being way to complicated for what it does -- you
could've saved the min_vruntime for each rq and compared them with
subtraction -- it is also terminally broken afaict.

Consider any infeasible weight scenario. Take for instance two tasks,
each bound to their respective sibling, one with weight 1 and one with
weight 2. Then the lower weight task will run ahead of the higher weight
task without bound.

This utterly destroys the concept of a shared time base.

Remember; all this is about a proportionally fair scheduling, where each
tasks receives:

             w_i
  dt_i = ---------- dt                                     (1)
	 \Sum_j w_j

which we do by tracking a virtual time, s_i:

         1
  s_i = --- d[t]_i                                         (2)
        w_i

Where d[t] is a delta of discrete time, while dt is an infinitesimal.
The immediate corrolary is that the ideal schedule S, where (2) to use
an infnitesimal delta, is:

          1
  S = ---------- dt                                        (3)
      \Sum_i w_i

From which we can define the lag, or deviation from the ideal, as:

  lag(i) = S - s_i                                         (4)

And since the one and only purpose is to approximate S, we get that:

  \Sum_i w_i lag(i) := 0                                   (5)

If this were not so, we no longer converge to S, and we can no longer
claim our scheduler has any of the properties we derive from S. This is
exactly what you did above, you broke it!


Let's continue for a while though; to see if there is anything useful to
be learned. We can combine (1)-(3) or (4)-(5) and express S in s_i:

      \Sum_i w_i s_i
  S = --------------                                       (6)
        \Sum_i w_i

Which gives us a way to compute S, given our s_i. Now, if you've read
our code, you know that we do not in fact do this, the reason for this
is two-fold. Firstly, computing S in that way requires a 64bit division
for every time we'd use it (see 12), and secondly, this only describes
the steady-state, it doesn't handle dynamics.

Anyway, in (6):  s_i -> x + (s_i - x), to get:

          \Sum_i w_i (s_i - x)
  S - x = --------------------                             (7)
               \Sum_i w_i

Which shows that S and s_i transform alike (which makes perfect sense
given that S is basically the (weighted) average of s_i).

Then:

  x -> s_min := min{s_i}                                   (8)

to obtain:

              \Sum_i w_i (s_i - s_min)
  S = s_min + ------------------------                     (9)
                    \Sum_i w_i

Which already looks familiar, and is the basis for our current
approximation:

  S ~= s_min                                              (10)

Now, obviously, (10) is absolute crap :-), but it sorta works.

So the thing to remember is that the above is strictly UP. It is
possible to generalize to multiple runqueues -- however it gets really
yuck when you have to add affinity support, as illustrated by our very
first counter-example.

XXX segue into the load-balance issues related to this:

  - how a negative lag task on a 'heavy' runqueue should not
    remain a negative lag task when migrated to a 'light' runqueue.

  - how we can compute and use the combined S in load-balancing to
    better handle infeasible weight scenarios.

Luckily I think we can avoid needing a full multi-queue variant for
core-scheduling (or load-balancing). The crucial observation is that we
only actually need this comparison in the presence of forced-idle; only
then do we need to tell if the stalled rq has higher priority over the
other.

[XXX assumes SMT2; better consider the more general case, I suspect
it'll work out because our comparison is always between 2 rqs and the
answer is only interesting if one of them is forced-idle]

And (under assumption of SMT2) when there is forced-idle, there is only
a single queue, so everything works like normal.

Let, for our runqueue 'k':

  T_k = \Sum_i w_i s_i
  W_k = \Sum_i w_i      ; for all i of k                  (11)

Then we can write (6) like:

        T_k
  S_k = ---                                               (12)
        W_k

From which immediately follows that:

          T_k + T_l
  S_k+l = ---------                                       (13)
          W_k + W_l

On which we can define a combined lag:

  lag_k+l(i) := S_k+l - s_i                               (14)

And that gives us the tools to compare tasks across a combined runqueue.


Combined this gives the following:

 a) when a runqueue enters force-idle, sync it against it's sibling rq(s)
    using (7); this only requires storing single 'time'-stamps.

 b) when comparing tasks between 2 runqueues of which one is forced-idle,
    compare the combined lag, per (14).

Now, of course cgroups (I so hate them) make this more interesting in
that a) seems to suggest we need to iterate all cgroup on a CPU at such
boundaries, but I think we can avoid that. The force-idle is for the
whole CPU, all it's rqs. So we can mark it in the root and lazily
propagate downward on demand.




  reply	other threads:[~2020-05-06 14:35 UTC|newest]

Thread overview: 110+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-04 16:59 [RFC PATCH 00/13] Core scheduling v5 vpillai
2020-03-04 16:59 ` [RFC PATCH 01/13] sched: Wrap rq::lock access vpillai
2020-03-04 16:59 ` [RFC PATCH 02/13] sched: Introduce sched_class::pick_task() vpillai
2020-03-04 16:59 ` [RFC PATCH 03/13] sched: Core-wide rq->lock vpillai
2020-04-01 11:42   ` [PATCH] sched/arm64: store cpu topology before notify_cpu_starting Cheng Jian
2020-04-01 13:23     ` Valentin Schneider
2020-04-06  8:00       ` chengjian (D)
2020-04-09  9:59       ` Sudeep Holla
2020-04-09 10:32         ` Valentin Schneider
2020-04-09 11:08           ` Sudeep Holla
2020-04-09 17:54     ` Joel Fernandes
2020-04-10 13:49       ` chengjian (D)
2020-04-14 11:36   ` [RFC PATCH 03/13] sched: Core-wide rq->lock Peter Zijlstra
2020-04-14 21:35     ` Vineeth Remanan Pillai
2020-04-15 10:55       ` Peter Zijlstra
2020-04-14 14:32   ` Peter Zijlstra
2020-03-04 16:59 ` [RFC PATCH 04/13] sched/fair: Add a few assertions vpillai
2020-03-04 16:59 ` [RFC PATCH 05/13] sched: Basic tracking of matching tasks vpillai
2020-03-04 16:59 ` [RFC PATCH 06/13] sched: Update core scheduler queue when taking cpu online/offline vpillai
2020-03-04 16:59 ` [RFC PATCH 07/13] sched: Add core wide task selection and scheduling vpillai
2020-04-14 13:35   ` Peter Zijlstra
2020-04-16 23:32     ` Tim Chen
2020-04-17 10:57       ` Peter Zijlstra
2020-04-16  3:39   ` Chen Yu
2020-04-16 19:59     ` Vineeth Remanan Pillai
2020-04-17 11:18     ` Peter Zijlstra
2020-04-19 15:31       ` Chen Yu
2020-05-21 23:14   ` Joel Fernandes
2020-05-21 23:16     ` Joel Fernandes
2020-05-22  2:35     ` Joel Fernandes
2020-05-22  3:44       ` Aaron Lu
2020-05-22 20:13         ` Joel Fernandes
2020-03-04 16:59 ` [RFC PATCH 08/13] sched/fair: wrapper for cfs_rq->min_vruntime vpillai
2020-03-04 16:59 ` [RFC PATCH 09/13] sched/fair: core wide vruntime comparison vpillai
2020-04-14 13:56   ` Peter Zijlstra
2020-04-15  3:34     ` Aaron Lu
2020-04-15  4:07       ` Aaron Lu
2020-04-15 21:24         ` Vineeth Remanan Pillai
2020-04-17  9:40           ` Aaron Lu
2020-04-20  8:07             ` [PATCH updated] sched/fair: core wide cfs task priority comparison Aaron Lu
2020-04-20 22:26               ` Vineeth Remanan Pillai
2020-04-21  2:51                 ` Aaron Lu
2020-04-24 14:24                   ` [PATCH updated v2] " Aaron Lu
2020-05-06 14:35                     ` Peter Zijlstra [this message]
2020-05-08  8:44                       ` Aaron Lu
2020-05-08  9:09                         ` Peter Zijlstra
2020-05-08 12:34                           ` Aaron Lu
2020-05-14 13:02                             ` Peter Zijlstra
2020-05-14 22:51                               ` Vineeth Remanan Pillai
2020-05-15 10:38                                 ` Peter Zijlstra
2020-05-15 10:43                                   ` Peter Zijlstra
2020-05-15 14:24                                   ` Vineeth Remanan Pillai
2020-05-16  3:42                               ` Aaron Lu
2020-05-22  9:40                                 ` Aaron Lu
2020-06-08  1:41                               ` Ning, Hongyu
2020-03-04 17:00 ` [RFC PATCH 10/13] sched: Trivial forced-newidle balancer vpillai
2020-03-04 17:00 ` [RFC PATCH 11/13] sched: migration changes for core scheduling vpillai
2020-06-12 13:21   ` Joel Fernandes
2020-06-12 21:32     ` Vineeth Remanan Pillai
2020-06-13  2:25       ` Joel Fernandes
2020-06-13 18:59         ` Vineeth Remanan Pillai
2020-06-15  2:05           ` Li, Aubrey
2020-03-04 17:00 ` [RFC PATCH 12/13] sched: cgroup tagging interface " vpillai
2020-06-26 15:06   ` Vineeth Remanan Pillai
2020-03-04 17:00 ` [RFC PATCH 13/13] sched: Debug bits vpillai
2020-03-04 17:36 ` [RFC PATCH 00/13] Core scheduling v5 Tim Chen
2020-03-04 17:42   ` Vineeth Remanan Pillai
2020-04-14 14:21 ` Peter Zijlstra
2020-04-15 16:32   ` Joel Fernandes
2020-04-17 11:12     ` Peter Zijlstra
2020-04-17 12:35       ` Alexander Graf
2020-04-17 13:08         ` Peter Zijlstra
2020-04-18  2:25       ` Joel Fernandes
2020-05-09 14:35   ` Dario Faggioli
     [not found] ` <38805656-2e2f-222a-c083-692f4b113313@linux.intel.com>
2020-05-09  3:39   ` Ning, Hongyu
2020-05-14 20:51     ` FW: " Gruza, Agata
2020-05-10 23:46 ` [PATCH RFC] Add support for core-wide protection of IRQ and softirq Joel Fernandes (Google)
2020-05-11 13:49   ` Peter Zijlstra
2020-05-11 14:54     ` Joel Fernandes
2020-05-20 22:26 ` [PATCH RFC] sched: Add a per-thread core scheduling interface Joel Fernandes (Google)
2020-05-21  4:09   ` [PATCH RFC] sched: Add a per-thread core scheduling interface(Internet mail) benbjiang(蒋彪)
2020-05-21 13:49     ` Joel Fernandes
2020-05-21  8:51   ` [PATCH RFC] sched: Add a per-thread core scheduling interface Peter Zijlstra
2020-05-21 13:47     ` Joel Fernandes
2020-05-21 20:20       ` Vineeth Remanan Pillai
2020-05-22 12:59       ` Peter Zijlstra
2020-05-22 21:35         ` Joel Fernandes
2020-05-24 14:00           ` Phil Auld
2020-05-28 14:51             ` Joel Fernandes
2020-05-28 17:01             ` Peter Zijlstra
2020-05-28 18:17               ` Phil Auld
2020-05-28 18:34                 ` Phil Auld
2020-05-28 18:23               ` Joel Fernandes
2020-05-21 18:31   ` Linus Torvalds
2020-05-21 20:40     ` Joel Fernandes
2020-05-21 21:58       ` Jesse Barnes
2020-05-22 16:33         ` Linus Torvalds
2020-05-20 22:37 ` [PATCH RFC v2] Add support for core-wide protection of IRQ and softirq Joel Fernandes (Google)
2020-05-20 22:48 ` [PATCH RFC] sched: Use sched-RCU in core-scheduling balancing logic Joel Fernandes (Google)
2020-05-21 22:52   ` Paul E. McKenney
2020-05-22  1:26     ` Joel Fernandes
2020-06-25 20:12 ` [RFC PATCH 00/13] Core scheduling v5 Vineeth Remanan Pillai
2020-06-26  1:47   ` Joel Fernandes
2020-06-26 14:36     ` Vineeth Remanan Pillai
2020-06-26 15:10       ` Joel Fernandes
2020-06-26 15:12         ` Joel Fernandes
2020-06-27 16:21         ` Joel Fernandes
2020-06-30 14:11         ` Phil Auld
2020-06-29 12:33   ` Li, Aubrey
2020-06-29 19:41     ` Vineeth Remanan Pillai

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200506143506.GH5298@hirez.programming.kicks-ass.net \
    --to=peterz@infradead.org \
    --cc=aaron.lu@linux.alibaba.com \
    --cc=aaron.lwe@gmail.com \
    --cc=aubrey.intel@gmail.com \
    --cc=aubrey.li@linux.intel.com \
    --cc=fweisbec@gmail.com \
    --cc=jdesfossez@digitalocean.com \
    --cc=joel@joelfernandes.org \
    --cc=joelaf@google.com \
    --cc=keescook@chromium.org \
    --cc=kerrnel@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@techsingularity.net \
    --cc=mingo@kernel.org \
    --cc=naravamudan@digitalocean.com \
    --cc=pauld@redhat.com \
    --cc=pawan.kumar.gupta@linux.intel.com \
    --cc=pbonzini@redhat.com \
    --cc=pjt@google.com \
    --cc=tglx@linutronix.de \
    --cc=tim.c.chen@linux.intel.com \
    --cc=torvalds@linux-foundation.org \
    --cc=valentin.schneider@arm.com \
    --cc=vpillai@digitalocean.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).