[updated,v2] sched/fair: core wide cfs task priority comparison
diff mbox series

Message ID 20200424142443.GA263207@aaronlu-desktop
State New
Headers show
Series
  • [updated,v2] sched/fair: core wide cfs task priority comparison
Related show

Commit Message

Aaron Lu April 24, 2020, 2:24 p.m. UTC
On Tue, Apr 21, 2020 at 10:51:31AM +0800, Aaron Lu wrote:
> On Mon, Apr 20, 2020 at 06:26:34PM -0400, Vineeth Remanan Pillai wrote:
> > On Mon, Apr 20, 2020 at 4:08 AM Aaron Lu <aaron.lwe@gmail.com> wrote:
> > >
> > > On Fri, Apr 17, 2020 at 05:40:45PM +0800, Aaron Lu wrote:
> > 
> > > The adjust is only needed when core scheduling is enabled while I
> > > mistakenly called it on both enable and disable. And I come to think
> > > normalize is a better name than adjust.
> > >
> > I guess we would also need to update the min_vruntime of the sibling
> > to match the rq->core->min_vruntime on coresched disable. Otherwise
> > a new enqueue on root cfs of the sibling would inherit the very old
> > min_vruntime before coresched enable and thus would starve all the
> > already queued tasks until the newly enqueued se's vruntime catches up.
> 
> Yes this is a concern but AFAICS, there is no problem. Consider:
> - when there is no queued task across the disable boundary, the stale
>   min_vruntime doesn't matter as you said;
> - when there are queued tasks across the disable boundary, the newly
>   queued task will normalize its vruntime against the sibling_cfs_rq's
>   min_vruntime, if the min_vruntime is stale and problem would occur.
>   But my reading of the code made me think this min_vruntime should
>   have already been updated by update_curr() in enqueue_entity() before
>   being used by this newly enqueued task and update_curr() would bring
>   the stale min_vruntime to the smallest vruntime of the queued ones so
>   again, no problem should occur.

After discussion with Vineeth, I now tend to add the syncing of
sibling_cfs_rq min_vruntime on core disable because analysing all the
code is time consuming and though I didn't find any problems now, I
might miss something and future code change may also break the
expectations so adding it seems a safe thing to do, also, it didn't
bring any performance downgrade as it is a one time disable stuff.

Vineeth also pointed out a problem of misusing cfs_rq->min_vruntime for
!CONFIG_64BIT kernel in migrate_task_rq_fair(), this is also fixed.

(only compile tested for !CONFIG_64BIT kernel)

From cda051ed33e6b88f28b44147cc7c894994c9d991 Mon Sep 17 00:00:00 2001
From: Aaron Lu <ziqian.lzq@antfin.com>
Date: Mon, 20 Apr 2020 10:27:17 +0800
Subject: [PATCH] sched/fair: core wide cfs task priority comparison

This patch provides a vruntime based way to compare two cfs task's
priority, be it on the same cpu or different threads of the same core.

When the two tasks are on the same CPU, we just need to find a common
cfs_rq both sched_entities are on and then do the comparison.

When the two tasks are on differen threads of the same core, each thread
will choose the next task to run the usual way and then the root level
sched entities which the two tasks belong to will be used to decide
which task runs next core wide.

An illustration for the cross CPU case:

   cpu0         cpu1
 /   |  \     /   |  \
se1 se2 se3  se4 se5 se6
    /  \            /   \
  se21 se22       se61  se62
  (A)                    /
                       se621
                        (B)

Assume CPU0 and CPU1 are smt siblings and cpu0 has decided task A to
run next and cpu1 has decided task B to run next. To compare priority
of task A and B, we compare priority of se2 and se6. Whose vruntime is
smaller, who wins.

To make this work, the root level sched entities' vruntime of the two
threads must be directly comparable. So one of the hyperthread's root
cfs_rq's min_vruntime is chosen as the core wide one and all root level
sched entities' vruntime is normalized against it.

All sub cfs_rqs and sched entities are not interesting in cross cpu
priority comparison as they will only participate in the usual cpu local
schedule decision so no need to normalize their vruntimes.

Signed-off-by: Aaron Lu <ziqian.lzq@antfin.com>
---
 kernel/sched/core.c  |  28 +++++----
 kernel/sched/fair.c  | 135 +++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h |   4 ++
 3 files changed, 148 insertions(+), 19 deletions(-)

Comments

Peter Zijlstra May 6, 2020, 2:35 p.m. UTC | #1
Sorry for being verbose; I've been procrastinating replying, and in
doing so the things I wanted to say kept growing.

On Fri, Apr 24, 2020 at 10:24:43PM +0800, Aaron Lu wrote:

> To make this work, the root level sched entities' vruntime of the two
> threads must be directly comparable. So one of the hyperthread's root
> cfs_rq's min_vruntime is chosen as the core wide one and all root level
> sched entities' vruntime is normalized against it.

> +/*
> + * This is called in stop machine context so no need to take the rq lock.
> + *
> + * Core scheduling is going to be enabled and the root level sched entities
> + * of both siblings will use cfs_rq->min_vruntime as the common cfs_rq
> + * min_vruntime, so it's necessary to normalize vruntime of existing root
> + * level sched entities in sibling_cfs_rq.
> + *
> + * Update of sibling_cfs_rq's min_vruntime isn't necessary as we will be
> + * only using cfs_rq->min_vruntime during the entire run of core scheduling.
> + */
> +void sched_core_normalize_se_vruntime(int cpu)
> +{
> +	struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs;
> +	int i;
> +
> +	for_each_cpu(i, cpu_smt_mask(cpu)) {
> +		struct sched_entity *se, *next;
> +		struct cfs_rq *sibling_cfs_rq;
> +		s64 delta;
> +
> +		if (i == cpu)
> +			continue;
> +
> +		sibling_cfs_rq = &cpu_rq(i)->cfs;
> +		if (!sibling_cfs_rq->nr_running)
> +			continue;
> +
> +		delta = cfs_rq->min_vruntime - sibling_cfs_rq->min_vruntime;
> +		rbtree_postorder_for_each_entry_safe(se, next,
> +				&sibling_cfs_rq->tasks_timeline.rb_root,
> +				run_node) {
> +			se->vruntime += delta;
> +		}
> +	}
> +}

Aside from this being way to complicated for what it does -- you
could've saved the min_vruntime for each rq and compared them with
subtraction -- it is also terminally broken afaict.

Consider any infeasible weight scenario. Take for instance two tasks,
each bound to their respective sibling, one with weight 1 and one with
weight 2. Then the lower weight task will run ahead of the higher weight
task without bound.

This utterly destroys the concept of a shared time base.

Remember; all this is about a proportionally fair scheduling, where each
tasks receives:

             w_i
  dt_i = ---------- dt                                     (1)
	 \Sum_j w_j

which we do by tracking a virtual time, s_i:

         1
  s_i = --- d[t]_i                                         (2)
        w_i

Where d[t] is a delta of discrete time, while dt is an infinitesimal.
The immediate corrolary is that the ideal schedule S, where (2) to use
an infnitesimal delta, is:

          1
  S = ---------- dt                                        (3)
      \Sum_i w_i

From which we can define the lag, or deviation from the ideal, as:

  lag(i) = S - s_i                                         (4)

And since the one and only purpose is to approximate S, we get that:

  \Sum_i w_i lag(i) := 0                                   (5)

If this were not so, we no longer converge to S, and we can no longer
claim our scheduler has any of the properties we derive from S. This is
exactly what you did above, you broke it!


Let's continue for a while though; to see if there is anything useful to
be learned. We can combine (1)-(3) or (4)-(5) and express S in s_i:

      \Sum_i w_i s_i
  S = --------------                                       (6)
        \Sum_i w_i

Which gives us a way to compute S, given our s_i. Now, if you've read
our code, you know that we do not in fact do this, the reason for this
is two-fold. Firstly, computing S in that way requires a 64bit division
for every time we'd use it (see 12), and secondly, this only describes
the steady-state, it doesn't handle dynamics.

Anyway, in (6):  s_i -> x + (s_i - x), to get:

          \Sum_i w_i (s_i - x)
  S - x = --------------------                             (7)
               \Sum_i w_i

Which shows that S and s_i transform alike (which makes perfect sense
given that S is basically the (weighted) average of s_i).

Then:

  x -> s_min := min{s_i}                                   (8)

to obtain:

              \Sum_i w_i (s_i - s_min)
  S = s_min + ------------------------                     (9)
                    \Sum_i w_i

Which already looks familiar, and is the basis for our current
approximation:

  S ~= s_min                                              (10)

Now, obviously, (10) is absolute crap :-), but it sorta works.

So the thing to remember is that the above is strictly UP. It is
possible to generalize to multiple runqueues -- however it gets really
yuck when you have to add affinity support, as illustrated by our very
first counter-example.

XXX segue into the load-balance issues related to this:

  - how a negative lag task on a 'heavy' runqueue should not
    remain a negative lag task when migrated to a 'light' runqueue.

  - how we can compute and use the combined S in load-balancing to
    better handle infeasible weight scenarios.

Luckily I think we can avoid needing a full multi-queue variant for
core-scheduling (or load-balancing). The crucial observation is that we
only actually need this comparison in the presence of forced-idle; only
then do we need to tell if the stalled rq has higher priority over the
other.

[XXX assumes SMT2; better consider the more general case, I suspect
it'll work out because our comparison is always between 2 rqs and the
answer is only interesting if one of them is forced-idle]

And (under assumption of SMT2) when there is forced-idle, there is only
a single queue, so everything works like normal.

Let, for our runqueue 'k':

  T_k = \Sum_i w_i s_i
  W_k = \Sum_i w_i      ; for all i of k                  (11)

Then we can write (6) like:

        T_k
  S_k = ---                                               (12)
        W_k

From which immediately follows that:

          T_k + T_l
  S_k+l = ---------                                       (13)
          W_k + W_l

On which we can define a combined lag:

  lag_k+l(i) := S_k+l - s_i                               (14)

And that gives us the tools to compare tasks across a combined runqueue.


Combined this gives the following:

 a) when a runqueue enters force-idle, sync it against it's sibling rq(s)
    using (7); this only requires storing single 'time'-stamps.

 b) when comparing tasks between 2 runqueues of which one is forced-idle,
    compare the combined lag, per (14).

Now, of course cgroups (I so hate them) make this more interesting in
that a) seems to suggest we need to iterate all cgroup on a CPU at such
boundaries, but I think we can avoid that. The force-idle is for the
whole CPU, all it's rqs. So we can mark it in the root and lazily
propagate downward on demand.
Aaron Lu May 8, 2020, 8:44 a.m. UTC | #2
On Wed, May 06, 2020 at 04:35:06PM +0200, Peter Zijlstra wrote:
> 
> Sorry for being verbose; I've been procrastinating replying, and in
> doing so the things I wanted to say kept growing.
> 
> On Fri, Apr 24, 2020 at 10:24:43PM +0800, Aaron Lu wrote:
> 
> > To make this work, the root level sched entities' vruntime of the two
> > threads must be directly comparable. So one of the hyperthread's root
> > cfs_rq's min_vruntime is chosen as the core wide one and all root level
> > sched entities' vruntime is normalized against it.
> 
> > +/*
> > + * This is called in stop machine context so no need to take the rq lock.
> > + *
> > + * Core scheduling is going to be enabled and the root level sched entities
> > + * of both siblings will use cfs_rq->min_vruntime as the common cfs_rq
> > + * min_vruntime, so it's necessary to normalize vruntime of existing root
> > + * level sched entities in sibling_cfs_rq.
> > + *
> > + * Update of sibling_cfs_rq's min_vruntime isn't necessary as we will be
> > + * only using cfs_rq->min_vruntime during the entire run of core scheduling.
> > + */
> > +void sched_core_normalize_se_vruntime(int cpu)
> > +{
> > +	struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs;
> > +	int i;
> > +
> > +	for_each_cpu(i, cpu_smt_mask(cpu)) {
> > +		struct sched_entity *se, *next;
> > +		struct cfs_rq *sibling_cfs_rq;
> > +		s64 delta;
> > +
> > +		if (i == cpu)
> > +			continue;
> > +
> > +		sibling_cfs_rq = &cpu_rq(i)->cfs;
> > +		if (!sibling_cfs_rq->nr_running)
> > +			continue;
> > +
> > +		delta = cfs_rq->min_vruntime - sibling_cfs_rq->min_vruntime;
> > +		rbtree_postorder_for_each_entry_safe(se, next,
> > +				&sibling_cfs_rq->tasks_timeline.rb_root,
> > +				run_node) {
> > +			se->vruntime += delta;
> > +		}
> > +	}
> > +}
> 
> Aside from this being way to complicated for what it does -- you
> could've saved the min_vruntime for each rq and compared them with
> subtraction -- it is also terminally broken afaict.
>
> Consider any infeasible weight scenario. Take for instance two tasks,
> each bound to their respective sibling, one with weight 1 and one with
> weight 2. Then the lower weight task will run ahead of the higher weight
> task without bound.

I don't follow how this could happen. Even the lower weight task runs
first, after some time, the higher weight task will get its turn and
from then on, the higher weight task will get more chance to run(due to
its higher weight and thus, slower accumulation of vruntime).

We used to have the following patch as a standalone one in v4:
sched/fair : Wake up forced idle siblings if needed
https://lore.kernel.org/lkml/cover.1572437285.git.vpillai@digitalocean.com/T/#md22d25d0e2932d059013e9b56600d8a847b02a13
Which originates from:
https://lore.kernel.org/lkml/20190725143344.GD992@aaronlu/

And in this series, it seems to be merged in:
[RFC PATCH 07/13] sched: Add core wide task selection and scheduling
https://lore.kernel.org/lkml/e942da7fd881977923463f19648085c1bfaa37f8.1583332765.git.vpillai@digitalocean.com/

My local test shows that when two cgroup's share are both set to 1024
and each bound to one sibling of a core, start a cpu intensive task in
each cgroup, then the cpu intensive task will each consume 50% cpu. When
one cgroup's share set to 512, it will consume about 33% while the other
consumes 67%, as expected.

I think the current patch works fine when 2 differently tagged tasks are
competing CPU, but when there are 3 tasks or more, things can get less
fair.
Peter Zijlstra May 8, 2020, 9:09 a.m. UTC | #3
On Fri, May 08, 2020 at 04:44:19PM +0800, Aaron Lu wrote:
> On Wed, May 06, 2020 at 04:35:06PM +0200, Peter Zijlstra wrote:

> > Aside from this being way to complicated for what it does -- you
> > could've saved the min_vruntime for each rq and compared them with
> > subtraction -- it is also terminally broken afaict.
> >
> > Consider any infeasible weight scenario. Take for instance two tasks,
> > each bound to their respective sibling, one with weight 1 and one with
> > weight 2. Then the lower weight task will run ahead of the higher weight
> > task without bound.
> 
> I don't follow how this could happen. Even the lower weight task runs
> first, after some time, the higher weight task will get its turn and
> from then on, the higher weight task will get more chance to run(due to
> its higher weight and thus, slower accumulation of vruntime).

That seems to assume they're mutually exclusive. In that case, as I
argued, we only have a single runqueue and then yes it works. But if
they're not exclusive, and can run concurrently, it comes apart.
Aaron Lu May 8, 2020, 12:34 p.m. UTC | #4
On Fri, May 08, 2020 at 11:09:25AM +0200, Peter Zijlstra wrote:
> On Fri, May 08, 2020 at 04:44:19PM +0800, Aaron Lu wrote:
> > On Wed, May 06, 2020 at 04:35:06PM +0200, Peter Zijlstra wrote:
> 
> > > Aside from this being way to complicated for what it does -- you
> > > could've saved the min_vruntime for each rq and compared them with
> > > subtraction -- it is also terminally broken afaict.
> > >
> > > Consider any infeasible weight scenario. Take for instance two tasks,
> > > each bound to their respective sibling, one with weight 1 and one with
> > > weight 2. Then the lower weight task will run ahead of the higher weight
> > > task without bound.
> > 
> > I don't follow how this could happen. Even the lower weight task runs
> > first, after some time, the higher weight task will get its turn and
> > from then on, the higher weight task will get more chance to run(due to
> > its higher weight and thus, slower accumulation of vruntime).
> 
> That seems to assume they're mutually exclusive. In that case, as I
> argued, we only have a single runqueue and then yes it works. But if
> they're not exclusive, and can run concurrently, it comes apart.

Ah right, now I see what you mean. Sorry for misunderstanding.

And yes, that 'utterly destroys the concept of a shared time base' and
then bad things can happen:
1) two same tagged tasks(t1 and t2) running on two siblings, with t1's
   weight lower than t2's;
2) both tasks are cpu intensive;
3) over time, the lower weight task(t1)'s vruntime becomes bigger and
   bigger than t2's vruntime and the core wide min_vruntime is the
   same as t1's vruntime per this patch;
4) a new task enqueued on the same sibling as t1, if the new task has
   an incompatible tag, it will be starved by t2 because t2's vruntime
   is way smaller than the core wide min_vruntime.

With this said, I realized a workaround for the issue described above:
when the core went from 'compatible mode'(step 1-3) to 'incompatible
mode'(step 4), reset all root level sched entities' vruntime to be the
same as the core wide min_vruntime. After all, the core is transforming
from two runqueue mode to single runqueue mode... I think this can solve
the issue to some extent but I may miss other scenarios.

I'll also re-read your last email about the 'lag' idea.
Peter Zijlstra May 14, 2020, 1:02 p.m. UTC | #5
On Fri, May 08, 2020 at 08:34:57PM +0800, Aaron Lu wrote:
> With this said, I realized a workaround for the issue described above:
> when the core went from 'compatible mode'(step 1-3) to 'incompatible
> mode'(step 4), reset all root level sched entities' vruntime to be the
> same as the core wide min_vruntime. After all, the core is transforming
> from two runqueue mode to single runqueue mode... I think this can solve
> the issue to some extent but I may miss other scenarios.

A little something like so, this syncs min_vruntime when we switch to
single queue mode. This is very much SMT2 only, I got my head in twist
when thikning about more siblings, I'll have to try again later.

This very much retains the horrible approximation of S we always do.

Also, it is _completely_ untested...

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -102,7 +102,6 @@ static inline int __task_prio(struct tas
 /* real prio, less is less */
 static inline bool prio_less(struct task_struct *a, struct task_struct *b)
 {
-
 	int pa = __task_prio(a), pb = __task_prio(b);
 
 	if (-pa < -pb)
@@ -114,19 +113,8 @@ static inline bool prio_less(struct task
 	if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
 		return !dl_time_before(a->dl.deadline, b->dl.deadline);
 
-	if (pa == MAX_RT_PRIO + MAX_NICE)  { /* fair */
-		u64 vruntime = b->se.vruntime;
-
-		/*
-		 * Normalize the vruntime if tasks are in different cpus.
-		 */
-		if (task_cpu(a) != task_cpu(b)) {
-			vruntime -= task_cfs_rq(b)->min_vruntime;
-			vruntime += task_cfs_rq(a)->min_vruntime;
-		}
-
-		return !((s64)(a->se.vruntime - vruntime) <= 0);
-	}
+	if (pa == MAX_RT_PRIO + MAX_NICE)
+		return cfs_prio_less(a, b);
 
 	return false;
 }
@@ -4293,10 +4281,11 @@ static struct task_struct *
 pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
 	struct task_struct *next, *max = NULL;
+	int old_active = 0, new_active = 0;
 	const struct sched_class *class;
 	const struct cpumask *smt_mask;
-	int i, j, cpu;
 	bool need_sync = false;
+	int i, j, cpu;
 
 	cpu = cpu_of(rq);
 	if (cpu_is_offline(cpu))
@@ -4349,10 +4338,14 @@ pick_next_task(struct rq *rq, struct tas
 		rq_i->core_pick = NULL;
 
 		if (rq_i->core_forceidle) {
+			// XXX is_idle_task(rq_i->curr) && rq_i->nr_running ??
 			need_sync = true;
 			rq_i->core_forceidle = false;
 		}
 
+		if (!is_idle_task(rq_i->curr))
+			old_active++;
+
 		if (i != cpu)
 			update_rq_clock(rq_i);
 	}
@@ -4463,8 +4456,12 @@ next_class:;
 
 		WARN_ON_ONCE(!rq_i->core_pick);
 
-		if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
-			rq_i->core_forceidle = true;
+		if (is_idle_task(rq_i->core_pick)) {
+			if (rq_i->nr_running)
+				rq_i->core_forceidle = true;
+		} else {
+			new_active++;
+		}
 
 		if (i == cpu)
 			continue;
@@ -4476,6 +4473,16 @@ next_class:;
 		WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
 	}
 
+	/* XXX SMT2 only */
+	if (new_active == 1 && old_active > 1) {
+		/*
+		 * We just dropped into single-rq mode, increment the sequence
+		 * count to trigger the vruntime sync.
+		 */
+		rq->core->core_sync_seq++;
+	}
+	rq->core->core_active = new_active;
+
 done:
 	set_next_task(rq, next);
 	return next;
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -386,6 +386,12 @@ is_same_group(struct sched_entity *se, s
 	return NULL;
 }
 
+static inline bool
+is_same_tg(struct sched_entity *se, struct sched_entity *pse)
+{
+	return se->cfs_rq->tg == pse->cfs_rq->tg;
+}
+
 static inline struct sched_entity *parent_entity(struct sched_entity *se)
 {
 	return se->parent;
@@ -394,8 +400,6 @@ static inline struct sched_entity *paren
 static void
 find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 {
-	int se_depth, pse_depth;
-
 	/*
 	 * preemption test can be made between sibling entities who are in the
 	 * same cfs_rq i.e who have a common parent. Walk up the hierarchy of
@@ -403,23 +407,16 @@ find_matching_se(struct sched_entity **s
 	 * parent.
 	 */
 
-	/* First walk up until both entities are at same depth */
-	se_depth = (*se)->depth;
-	pse_depth = (*pse)->depth;
-
-	while (se_depth > pse_depth) {
-		se_depth--;
-		*se = parent_entity(*se);
-	}
-
-	while (pse_depth > se_depth) {
-		pse_depth--;
-		*pse = parent_entity(*pse);
-	}
+	/* XXX we now have 3 of these loops, C stinks */
 
 	while (!is_same_group(*se, *pse)) {
-		*se = parent_entity(*se);
-		*pse = parent_entity(*pse);
+		int se_depth = (*se)->depth;
+		int pse_depth = (*pse)->depth;
+
+		if (se_depth <= pse_depth)
+			*pse = parent_entity(*pse);
+		if (se_depth >= pse_depth)
+			*se = parent_entity(*se);
 	}
 }
 
@@ -455,6 +452,12 @@ static inline struct sched_entity *paren
 	return NULL;
 }
 
+static inline bool
+is_same_tg(struct sched_entity *se, struct sched_entity *pse)
+{
+	return true;
+}
+
 static inline void
 find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 {
@@ -462,6 +465,31 @@ find_matching_se(struct sched_entity **s
 
 #endif	/* CONFIG_FAIR_GROUP_SCHED */
 
+bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
+{
+	struct sched_entity *se_a = &a->se, *se_b = &b->se;
+	struct cfs_rq *cfs_rq_a, *cfa_rq_b;
+	u64 vruntime_a, vruntime_b;
+
+	while (!is_same_tg(se_a, se_b)) {
+		int se_a_depth = se_a->depth;
+		int se_b_depth = se_b->depth;
+
+		if (se_a_depth <= se_b_depth)
+			se_b = parent_entity(se_b);
+		if (se_a_depth >= se_b_depth)
+			se_a = parent_entity(se_a);
+	}
+
+	cfs_rq_a = cfs_rq_of(se_a);
+	cfs_rq_b = cfs_rq_of(se_b);
+
+	vruntime_a = se_a->vruntime - cfs_rq_a->core_vruntime;
+	vruntime_b = se_b->vruntime - cfs_rq_b->core_vruntime;
+
+	return !((s64)(vruntime_a - vruntime_b) <= 0);
+}
+
 static __always_inline
 void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec);
 
@@ -6891,6 +6919,18 @@ static void check_preempt_wakeup(struct
 		set_last_buddy(se);
 }
 
+static void core_sync_entity(struct rq *rq, struct cfs_rq *cfs_rq)
+{
+	if (!sched_core_enabled())
+		return;
+
+	if (rq->core->core_sync_seq == cfs_rq->core_sync_seq)
+		return;
+
+	cfs_rq->core_sync_seq = rq->core->core_sync_seq;
+	cfs_rq->core_vruntime = cfs_rq->min_vruntime;
+}
+
 static struct task_struct *pick_task_fair(struct rq *rq)
 {
 	struct cfs_rq *cfs_rq = &rq->cfs;
@@ -6902,6 +6942,14 @@ static struct task_struct *pick_task_fai
 	do {
 		struct sched_entity *curr = cfs_rq->curr;
 
+		/*
+		 * Propagate the sync state down to whatever cfs_rq we need,
+		 * the active cfs_rq's will have been done by
+		 * set_next_task_fair(), the rest is inactive and will not have
+		 * changed due to the current running task.
+		 */
+		core_sync_entity(rq, cfs_rq);
+
 		se = pick_next_entity(cfs_rq, NULL);
 
 		if (curr) {
@@ -10825,7 +10873,8 @@ static void switched_to_fair(struct rq *
 	}
 }
 
-/* Account for a task changing its policy or group.
+/*
+ * Account for a task changing its policy or group.
  *
  * This routine is mostly called to set cfs_rq->curr field when a task
  * migrates between groups/classes.
@@ -10847,6 +10896,9 @@ static void set_next_task_fair(struct rq
 	for_each_sched_entity(se) {
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
+		/* snapshot vruntime before using it */
+		core_sync_entity(rq, cfs_rq);
+
 		set_next_entity(cfs_rq, se);
 		/* ensure bandwidth has been allocated on our new cfs_rq */
 		account_cfs_rq_runtime(cfs_rq, 0);
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -503,6 +503,10 @@ struct cfs_rq {
 	unsigned int		h_nr_running;      /* SCHED_{NORMAL,BATCH,IDLE} */
 	unsigned int		idle_h_nr_running; /* SCHED_IDLE */
 
+#ifdef CONFIG_SCHED_CORE
+	unsigned int		core_sync_seq;
+	u64			core_vruntime;
+#endif
 	u64			exec_clock;
 	u64			min_vruntime;
 #ifndef CONFIG_64BIT
@@ -1035,12 +1039,15 @@ struct rq {
 	unsigned int		core_enabled;
 	unsigned int		core_sched_seq;
 	struct rb_root		core_tree;
-	bool			core_forceidle;
+	unsigned int		core_forceidle;
 
 	/* shared state */
 	unsigned int		core_task_seq;
 	unsigned int		core_pick_seq;
 	unsigned long		core_cookie;
+	unsigned int		core_sync_seq;
+	unsigned int		core_active;
+
 #endif
 };
 
@@ -2592,6 +2599,8 @@ static inline bool sched_energy_enabled(
 
 #endif /* CONFIG_ENERGY_MODEL && CONFIG_CPU_FREQ_GOV_SCHEDUTIL */
 
+extern bool cfs_prio_less(struct task_struct *a, struct task_struct *b);
+
 #ifdef CONFIG_MEMBARRIER
 /*
  * The scheduler provides memory barriers required by membarrier between:
Vineeth Remanan Pillai May 14, 2020, 10:51 p.m. UTC | #6
Hi Peter,

On Thu, May 14, 2020 at 9:02 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> A little something like so, this syncs min_vruntime when we switch to
> single queue mode. This is very much SMT2 only, I got my head in twist
> when thikning about more siblings, I'll have to try again later.
>
Thanks for the quick patch! :-)

For SMT-n, would it work if sync vruntime if atleast one sibling is
forced idle? Since force_idle is for all the rqs, I think it would
be correct to sync the vruntime if atleast one cpu is forced idle.

> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> -               if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
> -                       rq_i->core_forceidle = true;
> +               if (is_idle_task(rq_i->core_pick)) {
> +                       if (rq_i->nr_running)
> +                               rq_i->core_forceidle = true;
> +               } else {
> +                       new_active++;
I think we need to reset new_active on restarting the selection.

> +               }
>
>                 if (i == cpu)
>                         continue;
> @@ -4476,6 +4473,16 @@ next_class:;
>                 WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
>         }
>
> +       /* XXX SMT2 only */
> +       if (new_active == 1 && old_active > 1) {
As I mentioned above, would it be correct to check if atleast one sibling is
forced_idle? Something like:
if (cpumask_weight(cpu_smt_mask(cpu)) == old_active && new_active < old_active)

> +               /*
> +                * We just dropped into single-rq mode, increment the sequence
> +                * count to trigger the vruntime sync.
> +                */
> +               rq->core->core_sync_seq++;
> +       }
> +       rq->core->core_active = new_active;
core_active seems to be unused.

> +bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
> +{
> +       struct sched_entity *se_a = &a->se, *se_b = &b->se;
> +       struct cfs_rq *cfs_rq_a, *cfa_rq_b;
> +       u64 vruntime_a, vruntime_b;
> +
> +       while (!is_same_tg(se_a, se_b)) {
> +               int se_a_depth = se_a->depth;
> +               int se_b_depth = se_b->depth;
> +
> +               if (se_a_depth <= se_b_depth)
> +                       se_b = parent_entity(se_b);
> +               if (se_a_depth >= se_b_depth)
> +                       se_a = parent_entity(se_a);
> +       }
> +
> +       cfs_rq_a = cfs_rq_of(se_a);
> +       cfs_rq_b = cfs_rq_of(se_b);
> +
> +       vruntime_a = se_a->vruntime - cfs_rq_a->core_vruntime;
> +       vruntime_b = se_b->vruntime - cfs_rq_b->core_vruntime;
Should we be using core_vruntime conditionally? should it be min_vruntime for
default comparisons and core_vruntime during force_idle?

Thanks,
Vineeth
Peter Zijlstra May 15, 2020, 10:38 a.m. UTC | #7
On Thu, May 14, 2020 at 06:51:27PM -0400, Vineeth Remanan Pillai wrote:
> On Thu, May 14, 2020 at 9:02 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > A little something like so, this syncs min_vruntime when we switch to
> > single queue mode. This is very much SMT2 only, I got my head in twist
> > when thikning about more siblings, I'll have to try again later.
> >
> Thanks for the quick patch! :-)
> 
> For SMT-n, would it work if sync vruntime if atleast one sibling is
> forced idle? Since force_idle is for all the rqs, I think it would
> be correct to sync the vruntime if atleast one cpu is forced idle.

It's complicated ;-)

So this sync is basically a relative reset of S to 0.

So with 2 queues, when one goes idle, we drop them both to 0 and one
then increases due to not being idle, and the idle one builds up lag to
get re-elected. So far so simple, right?

When there's 3, we can have the situation where 2 run and one is idle,
we sync to 0 and let the idle one build up lag to get re-election. Now
suppose another one also drops idle. At this point dropping all to 0
again would destroy the built-up lag from the queue that was already
idle, not good.

So instead of syncing everything, we can:

  less := !((s64)(s_a - s_b) <= 0)

  (v_a - S_a) - (v_b - S_b) == v_a - v_b - S_a + S_b
                            == v_a - (v_b - S_a + S_b)

IOW, we can recast the (lag) comparison to a one-sided difference.
So if then, instead of syncing the whole queue, sync the idle queue
against the active queue with S_a + S_b at the point where we sync.

(XXX consider the implication of living in a cyclic group: N / 2^n N)

This gives us means of syncing single queues against the active queue,
and for already idle queues to preseve their build-up lag.

Of course, then we get the situation where there's 2 active and one
going idle, who do we pick to sync against? Theory would have us sync
against the combined S, but as we've already demonstated, there is no
such thing in infeasible weight scenarios.

One thing I've considered; and this is where that core_active rudiment
came from, is having active queues sync up between themselves after
every tick. This limits the observed divergence due to the work
conservance.

On top of that, we can improve upon things by moving away from our
horrible (10) hack and moving to (9) and employing (13) here.

Anyway, I got partway through that in the past days, but then my head
hurt. I'll consider it some more :-)

> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > -               if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
> > -                       rq_i->core_forceidle = true;
> > +               if (is_idle_task(rq_i->core_pick)) {
> > +                       if (rq_i->nr_running)
> > +                               rq_i->core_forceidle = true;
> > +               } else {
> > +                       new_active++;
> I think we need to reset new_active on restarting the selection.

But this loop is after selection has been done; we don't modify
new_active during selection.

> > +               /*
> > +                * We just dropped into single-rq mode, increment the sequence
> > +                * count to trigger the vruntime sync.
> > +                */
> > +               rq->core->core_sync_seq++;
> > +       }
> > +       rq->core->core_active = new_active;
> core_active seems to be unused.

Correct; that's rudiments from an SMT-n attempt.

> > +bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
> > +{
> > +       struct sched_entity *se_a = &a->se, *se_b = &b->se;
> > +       struct cfs_rq *cfs_rq_a, *cfa_rq_b;
> > +       u64 vruntime_a, vruntime_b;
> > +
> > +       while (!is_same_tg(se_a, se_b)) {
> > +               int se_a_depth = se_a->depth;
> > +               int se_b_depth = se_b->depth;
> > +
> > +               if (se_a_depth <= se_b_depth)
> > +                       se_b = parent_entity(se_b);
> > +               if (se_a_depth >= se_b_depth)
> > +                       se_a = parent_entity(se_a);
> > +       }
> > +
> > +       cfs_rq_a = cfs_rq_of(se_a);
> > +       cfs_rq_b = cfs_rq_of(se_b);
> > +
> > +       vruntime_a = se_a->vruntime - cfs_rq_a->core_vruntime;
> > +       vruntime_b = se_b->vruntime - cfs_rq_b->core_vruntime;
> Should we be using core_vruntime conditionally? should it be min_vruntime for
> default comparisons and core_vruntime during force_idle?

At the very least it should be min_vruntime when cfs_rq_a == cfs_rq_b,
ie. when we're on the same CPU.

For the other case I was considering that tick based active sync, but
never got that finished and admittedly it all looks a bit weird. But I
figured I'd send it out so we can at least advance the discussion.

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -469,7 +469,7 @@ bool cfs_prio_less(struct task_struct *a
 {
 	struct sched_entity *se_a = &a->se, *se_b = &b->se;
 	struct cfs_rq *cfs_rq_a, *cfa_rq_b;
-	u64 vruntime_a, vruntime_b;
+	u64 s_a, s_b, S_a, S_b;
 
 	while (!is_same_tg(se_a, se_b)) {
 		int se_a_depth = se_a->depth;
@@ -484,10 +484,16 @@ bool cfs_prio_less(struct task_struct *a
 	cfs_rq_a = cfs_rq_of(se_a);
 	cfs_rq_b = cfs_rq_of(se_b);
 
-	vruntime_a = se_a->vruntime - cfs_rq_a->core_vruntime;
-	vruntime_b = se_b->vruntime - cfs_rq_b->core_vruntime;
+	S_a = cfs_rq_a->core_vruntime;
+	S_b = cfs_rq_b->core_vruntime;
 
-	return !((s64)(vruntime_a - vruntime_b) <= 0);
+	if (cfs_rq_a == cfs_rq_b)
+		S_a = S_b = cfs_rq_a->min_vruntime;
+
+	s_a = se_a->vruntime - S_a;
+	s_b = se_b->vruntime - S_b;
+
+	return !((s64)(s_a - s_b) <= 0);
 }
 
 static __always_inline
Peter Zijlstra May 15, 2020, 10:43 a.m. UTC | #8
On Fri, May 15, 2020 at 12:38:44PM +0200, Peter Zijlstra wrote:
>   less := !((s64)(s_a - s_b) <= 0)
> 
>   (v_a - S_a) - (v_b - S_b) == v_a - v_b - S_a + S_b
>                             == v_a - (v_b - S_a + S_b)
> 

> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -469,7 +469,7 @@ bool cfs_prio_less(struct task_struct *a
>  {
>  	struct sched_entity *se_a = &a->se, *se_b = &b->se;
>  	struct cfs_rq *cfs_rq_a, *cfa_rq_b;
> -	u64 vruntime_a, vruntime_b;
> +	u64 s_a, s_b, S_a, S_b;
>  
>  	while (!is_same_tg(se_a, se_b)) {
>  		int se_a_depth = se_a->depth;
> @@ -484,10 +484,16 @@ bool cfs_prio_less(struct task_struct *a
>  	cfs_rq_a = cfs_rq_of(se_a);
>  	cfs_rq_b = cfs_rq_of(se_b);
>  
> -	vruntime_a = se_a->vruntime - cfs_rq_a->core_vruntime;
> -	vruntime_b = se_b->vruntime - cfs_rq_b->core_vruntime;
> +	S_a = cfs_rq_a->core_vruntime;
> +	S_b = cfs_rq_b->core_vruntime;
>  
> -	return !((s64)(vruntime_a - vruntime_b) <= 0);
> +	if (cfs_rq_a == cfs_rq_b)
> +		S_a = S_b = cfs_rq_a->min_vruntime;
> +
> +	s_a = se_a->vruntime - S_a;
> +	s_b = se_b->vruntime - S_b;
> +
> +	return !((s64)(s_a - s_b) <= 0);
>  }

Clearly I'm not awake yet; 's/s_/l_/g', 's/v_/s_/g', IOW:

  l_a = s_a - S_a
Vineeth Remanan Pillai May 15, 2020, 2:24 p.m. UTC | #9
On Fri, May 15, 2020 at 6:39 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> It's complicated ;-)
>
> So this sync is basically a relative reset of S to 0.
>
> So with 2 queues, when one goes idle, we drop them both to 0 and one
> then increases due to not being idle, and the idle one builds up lag to
> get re-elected. So far so simple, right?
>
> When there's 3, we can have the situation where 2 run and one is idle,
> we sync to 0 and let the idle one build up lag to get re-election. Now
> suppose another one also drops idle. At this point dropping all to 0
> again would destroy the built-up lag from the queue that was already
> idle, not good.
>
Thanks for the clarification :-).

I was suggesting an idea of corewide force_idle. We sync the core_vruntime
on first force_idle of a sibling in the core and start using core_vruntime
for priority comparison from then on. That way, we don't reset the lag on
every force_idle and the lag builds up from the first sibling that was
forced_idle. I think this would work with infeasible weights as well,
but needs to think more to see if it would break. A sample check to enter
this core wide force_idle state is:
(cpumask_weight(cpu_smt_mask(cpu)) == old_active && new_active < old_active)

And we exit the core wide force_idle state when the last sibling goes out
of force_idle and can start using min_vruntime for priority comparison
from then on.

When there is a cookie match on all siblings, we don't do priority comparison
now. But I think we need to do priority comparison for cookie matches
also, so that we update 'max' in the loop. And for this comparison during
a no forced_idle scenario, I hope it should be fine to use the min_vruntime.
Updating 'max' in the loop when cookie matches is not really needed for SMT2,
but would be needed for SMTn.

This is just a wild idea on top of your patches. Might not be accurate
in all cases and need to think more about the corner cases. I thought I
would think it loud here :-)

> So instead of syncing everything, we can:
>
>   less := !((s64)(s_a - s_b) <= 0)
>
>   (v_a - S_a) - (v_b - S_b) == v_a - v_b - S_a + S_b
>                             == v_a - (v_b - S_a + S_b)
>
> IOW, we can recast the (lag) comparison to a one-sided difference.
> So if then, instead of syncing the whole queue, sync the idle queue
> against the active queue with S_a + S_b at the point where we sync.
>
> (XXX consider the implication of living in a cyclic group: N / 2^n N)
>
> This gives us means of syncing single queues against the active queue,
> and for already idle queues to preseve their build-up lag.
>
> Of course, then we get the situation where there's 2 active and one
> going idle, who do we pick to sync against? Theory would have us sync
> against the combined S, but as we've already demonstated, there is no
> such thing in infeasible weight scenarios.
>
> One thing I've considered; and this is where that core_active rudiment
> came from, is having active queues sync up between themselves after
> every tick. This limits the observed divergence due to the work
> conservance.
>
> On top of that, we can improve upon things by moving away from our
> horrible (10) hack and moving to (9) and employing (13) here.
>
> Anyway, I got partway through that in the past days, but then my head
> hurt. I'll consider it some more :-)
This sounds much better and a more accurate approach then the one I
mentioned above. Please share the code when you have it in some form :-)

>
> > > +                       new_active++;
> > I think we need to reset new_active on restarting the selection.
>
> But this loop is after selection has been done; we don't modify
> new_active during selection.
My bad, sorry about this false alarm!

> > > +
> > > +       vruntime_a = se_a->vruntime - cfs_rq_a->core_vruntime;
> > > +       vruntime_b = se_b->vruntime - cfs_rq_b->core_vruntime;
> > Should we be using core_vruntime conditionally? should it be min_vruntime for
> > default comparisons and core_vruntime during force_idle?
>
> At the very least it should be min_vruntime when cfs_rq_a == cfs_rq_b,
> ie. when we're on the same CPU.
>
yes, this makes sense.

The issue that I was thinking about is, when there is no force_idle and
all siblings run compatible tasks for a while, min_vruntime progresses,
but core_vruntime lags behind. And when a new task gets enqueued, it gets
the min_vruntime. But now during comparison it might be treated unfairly.

Consider a small example of two rqs rq1 and rq2.
rq1->cfs->min_vruntime = 1000
rq2->cfs->min_vruntime = 2000

During a force_idle, core_vruntime gets synced and

rq1->cfs->core_vruntime = 1000
rq2->cfs->core_vruntime = 2000

Now, suppose the core is out of force_idle and runs two compatible tasks
for a while, where the task on rq1 has more weight. min_vruntime progresses
on both, but slowly on rq1. Say the progress looks like:
rq1->cfs->min_vruntime = 1200, se1->vruntime = 1200
rq2->cfs->min_vruntime = 2500, se2->vruntime = 2500

If a new incompatible task(se3) gets enqueued to rq2, it's vruntime would
be based on rq2's min_vruntime, say:
se3->vruntime = 2500

During our priority comparison, lag would be:
l_se1 = 200
l_se3 = 500

So se1, will get selected and run with se2 until its lag catches up with
se3's lag(even if se3 has more weight than se1).

This is a hypothetical situation, but can happen I think. And if we use
min_vruntime for comparison during no force_idle scenario, we could
avoid this. What do you think?

I didn't clearly understand the tick based active sync and probably would
better fix this problem I guess.

Thanks,
Vineeth
Aaron Lu May 16, 2020, 3:42 a.m. UTC | #10
On Thu, May 14, 2020 at 03:02:48PM +0200, Peter Zijlstra wrote:
> On Fri, May 08, 2020 at 08:34:57PM +0800, Aaron Lu wrote:
> > With this said, I realized a workaround for the issue described above:
> > when the core went from 'compatible mode'(step 1-3) to 'incompatible
> > mode'(step 4), reset all root level sched entities' vruntime to be the
> > same as the core wide min_vruntime. After all, the core is transforming
> > from two runqueue mode to single runqueue mode... I think this can solve
> > the issue to some extent but I may miss other scenarios.
> 
> A little something like so, this syncs min_vruntime when we switch to
> single queue mode. This is very much SMT2 only, I got my head in twist
> when thikning about more siblings, I'll have to try again later.

Thanks a lot for the patch, I now see that "there is no need to adjust
every se's vruntime". :-)

> This very much retains the horrible approximation of S we always do.
> 
> Also, it is _completely_ untested...

I've been testing it.

One problem below.

> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4293,10 +4281,11 @@ static struct task_struct *
>  pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  {
>  	struct task_struct *next, *max = NULL;
> +	int old_active = 0, new_active = 0;
>  	const struct sched_class *class;
>  	const struct cpumask *smt_mask;
> -	int i, j, cpu;
>  	bool need_sync = false;
> +	int i, j, cpu;
>  
>  	cpu = cpu_of(rq);
>  	if (cpu_is_offline(cpu))
> @@ -4349,10 +4338,14 @@ pick_next_task(struct rq *rq, struct tas
>  		rq_i->core_pick = NULL;
>  
>  		if (rq_i->core_forceidle) {
> +			// XXX is_idle_task(rq_i->curr) && rq_i->nr_running ??
>  			need_sync = true;
>  			rq_i->core_forceidle = false;
>  		}
>  
> +		if (!is_idle_task(rq_i->curr))
> +			old_active++;
> +
>  		if (i != cpu)
>  			update_rq_clock(rq_i);
>  	}
> @@ -4463,8 +4456,12 @@ next_class:;
>  
>  		WARN_ON_ONCE(!rq_i->core_pick);
>  
> -		if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
> -			rq_i->core_forceidle = true;
> +		if (is_idle_task(rq_i->core_pick)) {
> +			if (rq_i->nr_running)
> +				rq_i->core_forceidle = true;
> +		} else {
> +			new_active++;
> +		}
>  
>  		if (i == cpu)
>  			continue;
> @@ -4476,6 +4473,16 @@ next_class:;
>  		WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
>  	}
>  
> +	/* XXX SMT2 only */
> +	if (new_active == 1 && old_active > 1) {

There is a case when incompatible task appears but we failed to 'drop
into single-rq mode' per the above condition check. The TLDR is: when
there is a task that sits on the sibling rq with the same cookie as
'max', new_active will be 2 instead of 1 and that would cause us missing
the chance to do a sync of core min_vruntime.

This is how it happens:
1) 2 tasks of the same cgroup with different weight running on 2 siblings,
   say cg0_A with weight 1024 bound at cpu0 and cg0_B with weight 2 bound
   at cpu1(assume cpu0 and cpu1 are siblings);
2) Since new_active == 2, we didn't trigger min_vruntime sync. For
   simplicity, let's assume both siblings' root cfs_rq's min_vruntime and
   core_vruntime are all at 0 now;
3) let the two tasks run a while;
4) a new task cg1_C of another cgroup gets queued on cpu1. Since cpu1's
   existing task has a very small weight, its cfs_rq's min_vruntime can
   be much larger than cpu0's cfs_rq min_vruntime. So cg1_C's vruntime is
   much larger than cg0_A's and the 'max' of the core wide task
   selection goes to cg0_A;
5) Now I suppose we should drop into single-rq mode and by doing a sync
   of core min_vruntime, cg1_C's turn shall come. But the problem is, our
   current selection logic prefer not to waste CPU time so after decides
   cg0_A as the 'max', the sibling will also do a cookie_pick() and
   get cg0_B to run. This is where problem asises: new_active is 2
   instead of the expected 1.
6) Due to we didn't do the sync of core min_vruntime, the newly queued
   cg1_C shall wait a long time before cg0_A's vruntime catches up.

One naive way of precisely determine when to drop into single-rq mode is
to track how many tasks of a particular tag exists and use that to
decide if the core is in compatible mode(all tasks belong to the same
cgroup, IOW, have the same core_cookie) or not and act accordingly,
except that: does this sound too complex and inefficient?...

> +		/*
> +		 * We just dropped into single-rq mode, increment the sequence
> +		 * count to trigger the vruntime sync.
> +		 */
> +		rq->core->core_sync_seq++;
> +	}
> +	rq->core->core_active = new_active;
> +
>  done:
>  	set_next_task(rq, next);
>  	return next;
Aaron Lu May 22, 2020, 9:40 a.m. UTC | #11
On Sat, May 16, 2020 at 11:42:30AM +0800, Aaron Lu wrote:
> On Thu, May 14, 2020 at 03:02:48PM +0200, Peter Zijlstra wrote:
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4476,6 +4473,16 @@ next_class:;
> >  		WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
> >  	}
> >  
> > +	/* XXX SMT2 only */
> > +	if (new_active == 1 && old_active > 1) {
> 
> There is a case when incompatible task appears but we failed to 'drop
> into single-rq mode' per the above condition check. The TLDR is: when
> there is a task that sits on the sibling rq with the same cookie as
> 'max', new_active will be 2 instead of 1 and that would cause us missing
> the chance to do a sync of core min_vruntime.

FWIW: when I disable the feature of running cookie_pick task on sibling
and thus enforce a strict single-rq mode, Peter's patch works well for
the scenario described below.

> This is how it happens:
> 1) 2 tasks of the same cgroup with different weight running on 2 siblings,
>    say cg0_A with weight 1024 bound at cpu0 and cg0_B with weight 2 bound
>    at cpu1(assume cpu0 and cpu1 are siblings);
> 2) Since new_active == 2, we didn't trigger min_vruntime sync. For
>    simplicity, let's assume both siblings' root cfs_rq's min_vruntime and
>    core_vruntime are all at 0 now;
> 3) let the two tasks run a while;
> 4) a new task cg1_C of another cgroup gets queued on cpu1. Since cpu1's
>    existing task has a very small weight, its cfs_rq's min_vruntime can
>    be much larger than cpu0's cfs_rq min_vruntime. So cg1_C's vruntime is
>    much larger than cg0_A's and the 'max' of the core wide task
>    selection goes to cg0_A;
> 5) Now I suppose we should drop into single-rq mode and by doing a sync
>    of core min_vruntime, cg1_C's turn shall come. But the problem is, our
>    current selection logic prefer not to waste CPU time so after decides
>    cg0_A as the 'max', the sibling will also do a cookie_pick() and
>    get cg0_B to run. This is where problem asises: new_active is 2
>    instead of the expected 1.
> 6) Due to we didn't do the sync of core min_vruntime, the newly queued
>    cg1_C shall wait a long time before cg0_A's vruntime catches up.

P.S. this is what I did to enforce a strict single-rq mode:

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1fa5b48b742a..0f5580bc7e96 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4411,7 +4411,7 @@ pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *ma
 	    (!max || prio_less(max, class_pick)))
 		return class_pick;
 
-	return cookie_pick;
+	return NULL;
 }
 
 static struct task_struct *
Ning, Hongyu June 8, 2020, 1:41 a.m. UTC | #12
On 2020/5/14 21:02, Peter Zijlstra wrote:
> On Fri, May 08, 2020 at 08:34:57PM +0800, Aaron Lu wrote:
>> With this said, I realized a workaround for the issue described above:
>> when the core went from 'compatible mode'(step 1-3) to 'incompatible
>> mode'(step 4), reset all root level sched entities' vruntime to be the
>> same as the core wide min_vruntime. After all, the core is transforming
>> from two runqueue mode to single runqueue mode... I think this can solve
>> the issue to some extent but I may miss other scenarios.
> 
> A little something like so, this syncs min_vruntime when we switch to
> single queue mode. This is very much SMT2 only, I got my head in twist
> when thikning about more siblings, I'll have to try again later.
> 
> This very much retains the horrible approximation of S we always do.
> 
> Also, it is _completely_ untested...
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -102,7 +102,6 @@ static inline int __task_prio(struct tas
>  /* real prio, less is less */
>  static inline bool prio_less(struct task_struct *a, struct task_struct *b)
>  {
> -
>  	int pa = __task_prio(a), pb = __task_prio(b);
>  
>  	if (-pa < -pb)
> @@ -114,19 +113,8 @@ static inline bool prio_less(struct task
>  	if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
>  		return !dl_time_before(a->dl.deadline, b->dl.deadline);
>  
> -	if (pa == MAX_RT_PRIO + MAX_NICE)  { /* fair */
> -		u64 vruntime = b->se.vruntime;
> -
> -		/*
> -		 * Normalize the vruntime if tasks are in different cpus.
> -		 */
> -		if (task_cpu(a) != task_cpu(b)) {
> -			vruntime -= task_cfs_rq(b)->min_vruntime;
> -			vruntime += task_cfs_rq(a)->min_vruntime;
> -		}
> -
> -		return !((s64)(a->se.vruntime - vruntime) <= 0);
> -	}
> +	if (pa == MAX_RT_PRIO + MAX_NICE)
> +		return cfs_prio_less(a, b);
>  
>  	return false;
>  }
> @@ -4293,10 +4281,11 @@ static struct task_struct *
>  pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  {
>  	struct task_struct *next, *max = NULL;
> +	int old_active = 0, new_active = 0;
>  	const struct sched_class *class;
>  	const struct cpumask *smt_mask;
> -	int i, j, cpu;
>  	bool need_sync = false;
> +	int i, j, cpu;
>  
>  	cpu = cpu_of(rq);
>  	if (cpu_is_offline(cpu))
> @@ -4349,10 +4338,14 @@ pick_next_task(struct rq *rq, struct tas
>  		rq_i->core_pick = NULL;
>  
>  		if (rq_i->core_forceidle) {
> +			// XXX is_idle_task(rq_i->curr) && rq_i->nr_running ??
>  			need_sync = true;
>  			rq_i->core_forceidle = false;
>  		}
>  
> +		if (!is_idle_task(rq_i->curr))
> +			old_active++;
> +
>  		if (i != cpu)
>  			update_rq_clock(rq_i);
>  	}
> @@ -4463,8 +4456,12 @@ next_class:;
>  
>  		WARN_ON_ONCE(!rq_i->core_pick);
>  
> -		if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
> -			rq_i->core_forceidle = true;
> +		if (is_idle_task(rq_i->core_pick)) {
> +			if (rq_i->nr_running)
> +				rq_i->core_forceidle = true;
> +		} else {
> +			new_active++;
> +		}
>  
>  		if (i == cpu)
>  			continue;
> @@ -4476,6 +4473,16 @@ next_class:;
>  		WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
>  	}
>  
> +	/* XXX SMT2 only */
> +	if (new_active == 1 && old_active > 1) {
> +		/*
> +		 * We just dropped into single-rq mode, increment the sequence
> +		 * count to trigger the vruntime sync.
> +		 */
> +		rq->core->core_sync_seq++;
> +	}
> +	rq->core->core_active = new_active;
> +
>  done:
>  	set_next_task(rq, next);
>  	return next;
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -386,6 +386,12 @@ is_same_group(struct sched_entity *se, s
>  	return NULL;
>  }
>  
> +static inline bool
> +is_same_tg(struct sched_entity *se, struct sched_entity *pse)
> +{
> +	return se->cfs_rq->tg == pse->cfs_rq->tg;
> +}
> +
>  static inline struct sched_entity *parent_entity(struct sched_entity *se)
>  {
>  	return se->parent;
> @@ -394,8 +400,6 @@ static inline struct sched_entity *paren
>  static void
>  find_matching_se(struct sched_entity **se, struct sched_entity **pse)
>  {
> -	int se_depth, pse_depth;
> -
>  	/*
>  	 * preemption test can be made between sibling entities who are in the
>  	 * same cfs_rq i.e who have a common parent. Walk up the hierarchy of
> @@ -403,23 +407,16 @@ find_matching_se(struct sched_entity **s
>  	 * parent.
>  	 */
>  
> -	/* First walk up until both entities are at same depth */
> -	se_depth = (*se)->depth;
> -	pse_depth = (*pse)->depth;
> -
> -	while (se_depth > pse_depth) {
> -		se_depth--;
> -		*se = parent_entity(*se);
> -	}
> -
> -	while (pse_depth > se_depth) {
> -		pse_depth--;
> -		*pse = parent_entity(*pse);
> -	}
> +	/* XXX we now have 3 of these loops, C stinks */
>  
>  	while (!is_same_group(*se, *pse)) {
> -		*se = parent_entity(*se);
> -		*pse = parent_entity(*pse);
> +		int se_depth = (*se)->depth;
> +		int pse_depth = (*pse)->depth;
> +
> +		if (se_depth <= pse_depth)
> +			*pse = parent_entity(*pse);
> +		if (se_depth >= pse_depth)
> +			*se = parent_entity(*se);
>  	}
>  }
>  
> @@ -455,6 +452,12 @@ static inline struct sched_entity *paren
>  	return NULL;
>  }
>  
> +static inline bool
> +is_same_tg(struct sched_entity *se, struct sched_entity *pse)
> +{
> +	return true;
> +}
> +
>  static inline void
>  find_matching_se(struct sched_entity **se, struct sched_entity **pse)
>  {
> @@ -462,6 +465,31 @@ find_matching_se(struct sched_entity **s
>  
>  #endif	/* CONFIG_FAIR_GROUP_SCHED */
>  
> +bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
> +{
> +	struct sched_entity *se_a = &a->se, *se_b = &b->se;
> +	struct cfs_rq *cfs_rq_a, *cfa_rq_b;
> +	u64 vruntime_a, vruntime_b;
> +
> +	while (!is_same_tg(se_a, se_b)) {
> +		int se_a_depth = se_a->depth;
> +		int se_b_depth = se_b->depth;
> +
> +		if (se_a_depth <= se_b_depth)
> +			se_b = parent_entity(se_b);
> +		if (se_a_depth >= se_b_depth)
> +			se_a = parent_entity(se_a);
> +	}
> +
> +	cfs_rq_a = cfs_rq_of(se_a);
> +	cfs_rq_b = cfs_rq_of(se_b);
> +
> +	vruntime_a = se_a->vruntime - cfs_rq_a->core_vruntime;
> +	vruntime_b = se_b->vruntime - cfs_rq_b->core_vruntime;
> +
> +	return !((s64)(vruntime_a - vruntime_b) <= 0);
> +}
> +
>  static __always_inline
>  void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec);
>  
> @@ -6891,6 +6919,18 @@ static void check_preempt_wakeup(struct
>  		set_last_buddy(se);
>  }
>  
> +static void core_sync_entity(struct rq *rq, struct cfs_rq *cfs_rq)
> +{
> +	if (!sched_core_enabled())
> +		return;
> +
> +	if (rq->core->core_sync_seq == cfs_rq->core_sync_seq)
> +		return;
> +
> +	cfs_rq->core_sync_seq = rq->core->core_sync_seq;
> +	cfs_rq->core_vruntime = cfs_rq->min_vruntime;
> +}
> +
>  static struct task_struct *pick_task_fair(struct rq *rq)
>  {
>  	struct cfs_rq *cfs_rq = &rq->cfs;
> @@ -6902,6 +6942,14 @@ static struct task_struct *pick_task_fai
>  	do {
>  		struct sched_entity *curr = cfs_rq->curr;
>  
> +		/*
> +		 * Propagate the sync state down to whatever cfs_rq we need,
> +		 * the active cfs_rq's will have been done by
> +		 * set_next_task_fair(), the rest is inactive and will not have
> +		 * changed due to the current running task.
> +		 */
> +		core_sync_entity(rq, cfs_rq);
> +
>  		se = pick_next_entity(cfs_rq, NULL);
>  
>  		if (curr) {
> @@ -10825,7 +10873,8 @@ static void switched_to_fair(struct rq *
>  	}
>  }
>  
> -/* Account for a task changing its policy or group.
> +/*
> + * Account for a task changing its policy or group.
>   *
>   * This routine is mostly called to set cfs_rq->curr field when a task
>   * migrates between groups/classes.
> @@ -10847,6 +10896,9 @@ static void set_next_task_fair(struct rq
>  	for_each_sched_entity(se) {
>  		struct cfs_rq *cfs_rq = cfs_rq_of(se);
>  
> +		/* snapshot vruntime before using it */
> +		core_sync_entity(rq, cfs_rq);
> +
>  		set_next_entity(cfs_rq, se);
>  		/* ensure bandwidth has been allocated on our new cfs_rq */
>  		account_cfs_rq_runtime(cfs_rq, 0);
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -503,6 +503,10 @@ struct cfs_rq {
>  	unsigned int		h_nr_running;      /* SCHED_{NORMAL,BATCH,IDLE} */
>  	unsigned int		idle_h_nr_running; /* SCHED_IDLE */
>  
> +#ifdef CONFIG_SCHED_CORE
> +	unsigned int		core_sync_seq;
> +	u64			core_vruntime;
> +#endif
>  	u64			exec_clock;
>  	u64			min_vruntime;
>  #ifndef CONFIG_64BIT
> @@ -1035,12 +1039,15 @@ struct rq {
>  	unsigned int		core_enabled;
>  	unsigned int		core_sched_seq;
>  	struct rb_root		core_tree;
> -	bool			core_forceidle;
> +	unsigned int		core_forceidle;
>  
>  	/* shared state */
>  	unsigned int		core_task_seq;
>  	unsigned int		core_pick_seq;
>  	unsigned long		core_cookie;
> +	unsigned int		core_sync_seq;
> +	unsigned int		core_active;
> +
>  #endif
>  };
>  
> @@ -2592,6 +2599,8 @@ static inline bool sched_energy_enabled(
>  
>  #endif /* CONFIG_ENERGY_MODEL && CONFIG_CPU_FREQ_GOV_SCHEDUTIL */
>  
> +extern bool cfs_prio_less(struct task_struct *a, struct task_struct *b);
> +
>  #ifdef CONFIG_MEMBARRIER
>  /*
>   * The scheduler provides memory barriers required by membarrier between:
> 

here is a quick test update based on Peter's fairness patch above:

- Kernel under test: 
A: Core scheduling v5 community base + Peter's fairness patch (by reverting Aaron's fairness patch)
https://github.com/digitalocean/linux-coresched/tree/coresched/v5-v5.5.y + Peter's patch above
B: Core scheduling v5 community base (with Aaron's fairness patchset)
https://github.com/digitalocean/linux-coresched/tree/coresched/v5-v5.5.y (with Aaron's fairness patch)

- Test results briefing:
OVERALL PERFORMANCE ARE THE SAME FOR FOLLOWING 3 TEST SETS, BETWEEN 2 KERNEL TEST BUILDS

- Test set based on sysbench 1.1.0-bd4b418:
1: sysbench cpu in cgroup cpu 0 + sysbench cpu in cgroup cpu 1 (192 workload tasks for each cgroup)
2: sysbench mysql in cgroup mysql 0 + sysbench mysql in cgroup mysql 1 (192 workload tasks for each cgroup)
3: sysbench cpu in cgroup cpu 0 + sysbench mysql in cgroup mysql 0 (192 workload tasks for each cgroup)

- Test environment:
Intel Xeon Server platform
CPU(s):              192
On-line CPU(s) list: 0-191
Thread(s) per core:  2
Core(s) per socket:  48
Socket(s):           2
NUMA node(s):        4

- Test results:

Note: 
1: test results in following tables are Tput normalized to default baseline
2: test setting in following tables:
2.1: default -> core scheduling disabled
2.2: coresched -> core scheduling enabled
3. default test results are reused between 2 kernel test builds


Test set 1:
+----------------------------------+-------+-----------+-------------+-------------+---------------+------+-------------+-------------+---------------+---------------+
| setting                          | ***   | default   | default     | coresched   | coresched     | **   | default     | default     | coresched     | coresched     |
+==================================+=======+===========+=============+=============+===============+======+=============+=============+===============+===============+
| cgroups                          | ***   | cg cpu 0  | cg cpu 0    | cg cpu 0    | cg cpu 0      | **   | cg cpu 1    | cg cpu 1    | cg cpu 1      | cg cpu 1      |
+----------------------------------+-------+-----------+-------------+-------------+---------------+------+-------------+-------------+---------------+---------------+
| sysbench workload                | ***   | cpu       | cpu         | cpu         | cpu           | **   | cpu         | cpu         | cpu           | cpu           |
+----------------------------------+-------+-----------+-------------+-------------+---------------+------+-------------+-------------+---------------+---------------+
| record item                      | ***   | Tput_avg  | Tput_stdev% | Tput_avg    | Tput_stdev%   | **   | Tput_avg    | Tput_stdev% | Tput_avg      | Tput_stdev%   |
+----------------------------------+-------+-----------+-------------+-------------+---------------+------+-------------+-------------+---------------+---------------+
| Kernel_A(Peter's fairness patch) | ***   |           |             | 0.96        | 3.45%         | **   |             |             | 1.03          | 3.60%         |
+----------------------------------+-------+ 1         + 1.14%       +-------------+---------------+------+ 1           + 1.20%       +---------------+---------------+
| Kernel_B(Aaron's fairness patch) | ***   |           |             | 0.98        | 1.75%         | **   |             |             | 1.01          | 1.83%         |
+----------------------------------+-------+-----------+-------------+-------------+---------------+------+-------------+-------------+---------------+---------------+

Test set 2:
+----------------------------------+-------+------------+-------------+-------------+---------------+------+-------------+-------------+---------------+---------------+
| setting                          | ***   | default    | default     | coresched   | coresched     | **   | default     | default     | coresched     | coresched     |
+==================================+=======+============+=============+=============+===============+======+=============+=============+===============+===============+
| cgroups                          | ***   | cg mysql 0 | cg mysql 0  | cg mysql 0  | cg mysql 0    | **   | cg mysql 1  | cg mysql 1  | cg mysql 1    | cg mysql 1    |
+----------------------------------+-------+------------+-------------+-------------+---------------+------+-------------+-------------+---------------+---------------+
| sysbench workload                | ***   | mysql      | mysql       | mysql       | mysql         | **   | mysql       | mysql       | mysql         | mysql         |
+----------------------------------+-------+------------+-------------+-------------+---------------+------+-------------+-------------+---------------+---------------+
| record item                      | ***   | Tput_avg   | Tput_stdev% | Tput_avg    | Tput_stdev%   | **   | Tput_avg    | Tput_stdev% | Tput_avg      | Tput_stdev%   |
+----------------------------------+-------+------------+-------------+-------------+---------------+------+-------------+-------------+---------------+---------------+
| Kernel_A(Peter's fairness patch) | ***   |            |             | 0.98        | 2.00%         | **   |             |             | 0.98          | 1.98%         |
+----------------------------------+-------+ 1          + 1.85%       +-------------+---------------+------+ 1           + 1.84%       +---------------+---------------+
| Kernel_B(Aaron's fairness patch) | ***   |            |             | 1.01        | 1.61%         | **   |             |             | 1.01          | 1.59%         |
+----------------------------------+-------+------------+-------------+-------------+---------------+------+-------------+-------------+---------------+---------------+

Test set 3:
+----------------------------------+-------+-----------+-------------+-------------+---------------+------+-------------+-------------+---------------+---------------+
| setting                          | ***   | default   | default     | coresched   | coresched     | **   | default     | default     | coresched     | coresched     |
+==================================+=======+===========+=============+=============+===============+======+=============+=============+===============+===============+
| cgroups                          | ***   | cg cpu    | cg cpu      | cg cpu      | cg cpu        | **   | cg mysql    | cg mysql    | cg mysql      | cg mysql      |
+----------------------------------+-------+-----------+-------------+-------------+---------------+------+-------------+-------------+---------------+---------------+
| sysbench workload                | ***   | cpu       | cpu         | cpu         | cpu           | **   | mysql       | mysql       | mysql         | mysql         |
+----------------------------------+-------+-----------+-------------+-------------+---------------+------+-------------+-------------+---------------+---------------+
| record item                      | ***   | Tput_avg  | Tput_stdev% | Tput_avg    | Tput_stdev%   | **   | Tput_avg    | Tput_stdev% | Tput_avg      | Tput_stdev%   |
+----------------------------------+-------+-----------+-------------+-------------+---------------+------+-------------+-------------+---------------+---------------+
| Kernel_A(Peter's fairness patch) | ***   |           |             | 1.01        | 4.67%         | **   |             |             | 0.84          | 25.89%        |
+----------------------------------+-------+ 1         + 1.56%       +-------------+---------------+------+ 1           + 3.17%       +---------------+---------------+
| Kernel_B(Aaron's fairness patch) | ***   |           |             | 0.99        | 4.17%         | **   |             |             | 0.89          | 16.44%        |
+----------------------------------+-------+-----------+-------------+-------------+---------------+------+-------------+-------------+---------------+---------------+

Patch
diff mbox series

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5f322922f5ae..d8bedddef6fb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -119,19 +119,8 @@  static inline bool prio_less(struct task_struct *a, struct task_struct *b)
 	if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
 		return !dl_time_before(a->dl.deadline, b->dl.deadline);
 
-	if (pa == MAX_RT_PRIO + MAX_NICE)  { /* fair */
-		u64 vruntime = b->se.vruntime;
-
-		/*
-		 * Normalize the vruntime if tasks are in different cpus.
-		 */
-		if (task_cpu(a) != task_cpu(b)) {
-			vruntime -= task_cfs_rq(b)->min_vruntime;
-			vruntime += task_cfs_rq(a)->min_vruntime;
-		}
-
-		return !((s64)(a->se.vruntime - vruntime) <= 0);
-	}
+	if (pa == MAX_RT_PRIO + MAX_NICE) /* fair */
+		return cfs_prio_less(a, b);
 
 	return false;
 }
@@ -291,8 +280,17 @@  static int __sched_core_stopper(void *data)
 	}
 
 	for_each_online_cpu(cpu) {
-		if (!enabled || (enabled && cpumask_weight(cpu_smt_mask(cpu)) >= 2))
-			cpu_rq(cpu)->core_enabled = enabled;
+		if (!enabled || (enabled && cpumask_weight(cpu_smt_mask(cpu)) >= 2)) {
+			struct rq *rq = cpu_rq(cpu);
+
+			rq->core_enabled = enabled;
+			if (rq->core == rq) {
+				if (enabled)
+					sched_core_normalize_se_vruntime(cpu);
+				else
+					sched_core_sync_cfs_vruntime(cpu);
+			}
+		}
 	}
 
 	return 0;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d99ea6ee7af2..a5774f495d97 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -449,9 +449,133 @@  find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 
 #endif	/* CONFIG_FAIR_GROUP_SCHED */
 
+static inline struct cfs_rq *root_cfs_rq(struct cfs_rq *cfs_rq)
+{
+	return &rq_of(cfs_rq)->cfs;
+}
+
+static inline bool is_root_cfs_rq(struct cfs_rq *cfs_rq)
+{
+	return cfs_rq == root_cfs_rq(cfs_rq);
+}
+
+static inline struct cfs_rq *core_cfs_rq(struct cfs_rq *cfs_rq)
+{
+	return &rq_of(cfs_rq)->core->cfs;
+}
+
 static inline u64 cfs_rq_min_vruntime(struct cfs_rq *cfs_rq)
 {
-	return cfs_rq->min_vruntime;
+	if (!sched_core_enabled(rq_of(cfs_rq)) || !is_root_cfs_rq(cfs_rq))
+		return cfs_rq->min_vruntime;
+
+	return core_cfs_rq(cfs_rq)->min_vruntime;
+}
+
+#ifndef CONFIG_64BIT
+static inline u64 cfs_rq_min_vruntime_copy(struct cfs_rq *cfs_rq)
+{
+	if (!sched_core_enabled(rq_of(cfs_rq)) || !is_root_cfs_rq(cfs_rq))
+		return cfs_rq->min_vruntime_copy;
+
+	return core_cfs_rq(cfs_rq)->min_vruntime_copy;
+}
+#endif
+
+bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
+{
+	struct sched_entity *sea = &a->se;
+	struct sched_entity *seb = &b->se;
+	bool samecpu = task_cpu(a) == task_cpu(b);
+	s64 delta;
+
+	if (samecpu) {
+		/* vruntime is per cfs_rq */
+		while (!is_same_group(sea, seb)) {
+			int sea_depth = sea->depth;
+			int seb_depth = seb->depth;
+
+			if (sea_depth >= seb_depth)
+				sea = parent_entity(sea);
+			if (sea_depth <= seb_depth)
+				seb = parent_entity(seb);
+		}
+
+		delta = (s64)(sea->vruntime - seb->vruntime);
+		goto out;
+	}
+
+	/* crosscpu: compare root level se's vruntime to decide priority */
+	while (sea->parent)
+		sea = sea->parent;
+	while (seb->parent)
+		seb = seb->parent;
+	delta = (s64)(sea->vruntime - seb->vruntime);
+
+out:
+	return delta > 0;
+}
+
+/*
+ * This is called in stop machine context so no need to take the rq lock.
+ *
+ * Core scheduling is going to be enabled and the root level sched entities
+ * of both siblings will use cfs_rq->min_vruntime as the common cfs_rq
+ * min_vruntime, so it's necessary to normalize vruntime of existing root
+ * level sched entities in sibling_cfs_rq.
+ *
+ * Update of sibling_cfs_rq's min_vruntime isn't necessary as we will be
+ * only using cfs_rq->min_vruntime during the entire run of core scheduling.
+ */
+void sched_core_normalize_se_vruntime(int cpu)
+{
+	struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs;
+	int i;
+
+	for_each_cpu(i, cpu_smt_mask(cpu)) {
+		struct sched_entity *se, *next;
+		struct cfs_rq *sibling_cfs_rq;
+		s64 delta;
+
+		if (i == cpu)
+			continue;
+
+		sibling_cfs_rq = &cpu_rq(i)->cfs;
+		if (!sibling_cfs_rq->nr_running)
+			continue;
+
+		delta = cfs_rq->min_vruntime - sibling_cfs_rq->min_vruntime;
+		rbtree_postorder_for_each_entry_safe(se, next,
+				&sibling_cfs_rq->tasks_timeline.rb_root,
+				run_node) {
+			se->vruntime += delta;
+		}
+	}
+}
+
+/*
+ * During the entire run of core scheduling, sibling_cfs_rq's min_vruntime
+ * is left unused and could lag far behind its still queued sched entities.
+ * Sync it to the up2date core wide one to avoid problems.
+ */
+void sched_core_sync_cfs_vruntime(int cpu)
+{
+	struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs;
+	int i;
+
+	for_each_cpu(i, cpu_smt_mask(cpu)) {
+		struct cfs_rq *sibling_cfs_rq;
+
+		if (i == cpu)
+			continue;
+
+		sibling_cfs_rq = &cpu_rq(i)->cfs;
+		sibling_cfs_rq->min_vruntime = cfs_rq->min_vruntime;
+#ifndef CONFIG_64BIT
+		smp_wmb();
+		sibling_cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
+#endif
+	}
 }
 
 static __always_inline
@@ -509,8 +633,11 @@  static void update_min_vruntime(struct cfs_rq *cfs_rq)
 			vruntime = min_vruntime(vruntime, se->vruntime);
 	}
 
+	if (sched_core_enabled(rq_of(cfs_rq)) && is_root_cfs_rq(cfs_rq))
+		cfs_rq = core_cfs_rq(cfs_rq);
+
 	/* ensure we never gain time by being placed backwards. */
-	cfs_rq->min_vruntime = max_vruntime(cfs_rq_min_vruntime(cfs_rq), vruntime);
+	cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime);
 #ifndef CONFIG_64BIT
 	smp_wmb();
 	cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
@@ -6396,9 +6523,9 @@  static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
 		u64 min_vruntime_copy;
 
 		do {
-			min_vruntime_copy = cfs_rq->min_vruntime_copy;
+			min_vruntime_copy = cfs_rq_min_vruntime_copy(cfs_rq);
 			smp_rmb();
-			min_vruntime = cfs_rq->min_vruntime;
+			min_vruntime = cfs_rq_min_vruntime(cfs_rq);
 		} while (min_vruntime != min_vruntime_copy);
 #else
 		min_vruntime = cfs_rq_min_vruntime(cfs_rq);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 50a5675e941a..5517ca92b5bd 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2594,3 +2594,7 @@  static inline void membarrier_switch_mm(struct rq *rq,
 {
 }
 #endif
+
+bool cfs_prio_less(struct task_struct *a, struct task_struct *b);
+void sched_core_normalize_se_vruntime(int cpu);
+void sched_core_sync_cfs_vruntime(int cpu);