linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Aaron Lu <aaron.lwe@gmail.com>
To: Vineeth Remanan Pillai <vpillai@digitalocean.com>
Cc: "Peter Zijlstra" <peterz@infradead.org>,
	"Nishanth Aravamudan" <naravamudan@digitalocean.com>,
	"Julien Desfossez" <jdesfossez@digitalocean.com>,
	"Tim Chen" <tim.c.chen@linux.intel.com>,
	"Ingo Molnar" <mingo@kernel.org>,
	"Thomas Gleixner" <tglx@linutronix.de>,
	"Paul Turner" <pjt@google.com>,
	"Linus Torvalds" <torvalds@linux-foundation.org>,
	"Aaron Lu" <aaron.lu@linux.alibaba.com>,
	"Linux List Kernel Mailing" <linux-kernel@vger.kernel.org>,
	"Frédéric Weisbecker" <fweisbec@gmail.com>,
	"Kees Cook" <keescook@chromium.org>,
	"Greg Kerr" <kerrnel@google.com>, "Phil Auld" <pauld@redhat.com>,
	"Aubrey Li" <aubrey.intel@gmail.com>,
	"Li, Aubrey" <aubrey.li@linux.intel.com>,
	"Valentin Schneider" <valentin.schneider@arm.com>,
	"Mel Gorman" <mgorman@techsingularity.net>,
	"Pawan Gupta" <pawan.kumar.gupta@linux.intel.com>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"Joel Fernandes" <joelaf@google.com>,
	"Joel Fernandes" <joel@joelfernandes.org>
Subject: Re: [RFC PATCH 09/13] sched/fair: core wide vruntime comparison
Date: Fri, 17 Apr 2020 17:40:45 +0800	[thread overview]
Message-ID: <20200417094045.GA197704@ziqianlu-desktop.localdomain> (raw)
In-Reply-To: <CANaguZAXsjD=X-bB7dvQZ3FSqLioiZU=YREHH_7JqiFsZCTxHA@mail.gmail.com>

On Wed, Apr 15, 2020 at 05:24:30PM -0400, Vineeth Remanan Pillai wrote:
> > > > You forgot the time complexity analysis.
> > >
> > > This is a mistake and the adjust should be needed only once when core
> > > scheduling is initially enabled. It is an initialization thing and there
> > > is no reason to do it in every invocation of coresched_adjust_vruntime().
> >
> > Correction...
> > I meant there is no need to call coresched_adjust_vruntime() in every
> > invocation of update_core_cfs_rq_min_vruntime().
> 
> Due to the checks in place, update_core_cfs_rq_min_vruntime should
> not be calling coresched_adjust_vruntime more than once between a
> coresched enable/disable. Once the min_vruntime is adjusted, we depend
> only on rq->core and the other sibling's min_vruntime will not grow
> until coresched disable.

OK, but I prefer to make it clear that this is an initialization only
stuff. Below is what I cooked, I also enhanced the changelog while at
it.

From e80121e61953da717da074ea2a097194f6d29ef4 Mon Sep 17 00:00:00 2001
From: Aaron Lu <ziqian.lzq@antfin.com>
Date: Thu, 25 Jul 2019 22:32:48 +0800
Subject: [PATCH] sched/fair: core vruntime comparison

This patch provides a vruntime based way to compare two cfs task's
priority, be it on the same cpu or different threads of the same core.

When the two tasks are on the same CPU, we just need to find a common
cfs_rq both sched_entities are on and then do the comparison.

When the two tasks are on differen threads of the same core, each thread
will choose the next task to run the usual way and then the root level
sched entities which the two tasks belong to will be used to decide
which task runs next core wide.

An illustration for the cross CPU case:

   cpu0         cpu1
 /   |  \     /   |  \
se1 se2 se3  se4 se5 se6
    /  \            /   \
  se21 se22       se61  se62
  (A)                    /
                       se621
                        (B)

Assume CPU0 and CPU1 are smt siblings and cpu0 has decided task A to
run next and cpu1 has decided task B to run next. To compare priority
of task A and B, we compare priority of se2 and se6. Whose vruntime is
smaller, who wins.

To make this work, the root level sched entities' vruntime of the two
threads must be directly comparable. So a new core wide cfs_rq
min_vruntime is introduced to serve the purpose of normalizing these
root level sched entities' vruntime.

All sub cfs_rqs and sched entities are not interesting in cross cpu
priority comparison as they will only participate in the usual cpu local
schedule decision so no need to normalize their vruntimes.

Signed-off-by: Aaron Lu <ziqian.lzq@antfin.com>
---
 kernel/sched/core.c  |  24 ++++------
 kernel/sched/fair.c  | 101 ++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h |   3 ++
 3 files changed, 111 insertions(+), 17 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5f322922f5ae..d6c8c76cb07a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -119,19 +119,8 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b)
 	if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
 		return !dl_time_before(a->dl.deadline, b->dl.deadline);
 
-	if (pa == MAX_RT_PRIO + MAX_NICE)  { /* fair */
-		u64 vruntime = b->se.vruntime;
-
-		/*
-		 * Normalize the vruntime if tasks are in different cpus.
-		 */
-		if (task_cpu(a) != task_cpu(b)) {
-			vruntime -= task_cfs_rq(b)->min_vruntime;
-			vruntime += task_cfs_rq(a)->min_vruntime;
-		}
-
-		return !((s64)(a->se.vruntime - vruntime) <= 0);
-	}
+	if (pa == MAX_RT_PRIO + MAX_NICE) /* fair */
+		return cfs_prio_less(a, b);
 
 	return false;
 }
@@ -291,8 +280,13 @@ static int __sched_core_stopper(void *data)
 	}
 
 	for_each_online_cpu(cpu) {
-		if (!enabled || (enabled && cpumask_weight(cpu_smt_mask(cpu)) >= 2))
-			cpu_rq(cpu)->core_enabled = enabled;
+		if (!enabled || (enabled && cpumask_weight(cpu_smt_mask(cpu)) >= 2)) {
+			struct rq *rq = cpu_rq(cpu);
+
+			rq->core_enabled = enabled;
+			if (rq->core == rq)
+				sched_core_adjust_se_vruntime(cpu);
+		}
 	}
 
 	return 0;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d99ea6ee7af2..7eecf590d6c0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -449,9 +449,103 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 
 #endif	/* CONFIG_FAIR_GROUP_SCHED */
 
+static inline struct cfs_rq *root_cfs_rq(struct cfs_rq *cfs_rq)
+{
+	return &rq_of(cfs_rq)->cfs;
+}
+
+static inline bool is_root_cfs_rq(struct cfs_rq *cfs_rq)
+{
+	return cfs_rq == root_cfs_rq(cfs_rq);
+}
+
+static inline struct cfs_rq *core_cfs_rq(struct cfs_rq *cfs_rq)
+{
+	return &rq_of(cfs_rq)->core->cfs;
+}
+
 static inline u64 cfs_rq_min_vruntime(struct cfs_rq *cfs_rq)
 {
-	return cfs_rq->min_vruntime;
+	if (!sched_core_enabled(rq_of(cfs_rq)) || !is_root_cfs_rq(cfs_rq))
+		return cfs_rq->min_vruntime;
+
+	return core_cfs_rq(cfs_rq)->min_vruntime;
+}
+
+bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
+{
+	struct sched_entity *sea = &a->se;
+	struct sched_entity *seb = &b->se;
+	bool samecpu = task_cpu(a) == task_cpu(b);
+	s64 delta;
+
+	if (samecpu) {
+		/* vruntime is per cfs_rq */
+		while (!is_same_group(sea, seb)) {
+			int sea_depth = sea->depth;
+			int seb_depth = seb->depth;
+
+			if (sea_depth >= seb_depth)
+				sea = parent_entity(sea);
+			if (sea_depth <= seb_depth)
+				seb = parent_entity(seb);
+		}
+
+		delta = (s64)(sea->vruntime - seb->vruntime);
+		goto out;
+	}
+
+	/* crosscpu: compare root level se's vruntime to decide priority */
+	while (sea->parent)
+		sea = sea->parent;
+	while (seb->parent)
+		seb = seb->parent;
+	delta = (s64)(sea->vruntime - seb->vruntime);
+
+out:
+	return delta > 0;
+}
+
+/*
+ * This is called in stop machine context so no need to take the rq lock.
+ *
+ * Core scheduling is going to be enabled and the root level sched entities
+ * of both siblings will use cfs_rq->min_vruntime as the common cfs_rq
+ * min_vruntime, so it's necessary to normalize vruntime of existing root
+ * level sched entities in sibling_cfs_rq.
+ *
+ * Update of sibling_cfs_rq's min_vruntime isn't necessary as we will be
+ * only using cfs_rq->min_vruntime during the entire run of core scheduling.
+ */
+void sched_core_adjust_se_vruntime(int cpu)
+{
+	int i;
+
+	for_each_cpu(i, cpu_smt_mask(cpu)) {
+		struct cfs_rq *cfs_rq, *sibling_cfs_rq;
+		struct sched_entity *se, *next;
+		s64 delta;
+
+		if (i == cpu)
+			continue;
+
+		sibling_cfs_rq = &cpu_rq(i)->cfs;
+		if (!sibling_cfs_rq->nr_running)
+			continue;
+
+		cfs_rq = &cpu_rq(cpu)->cfs;
+		delta = cfs_rq->min_vruntime - sibling_cfs_rq->min_vruntime;
+		/*
+		 * XXX Malicious user can create a ton of runnable tasks in root
+		 * sibling_cfs_rq and cause the below vruntime normalization
+		 * potentially taking a long time.
+		 */
+		rbtree_postorder_for_each_entry_safe(se, next,
+				&sibling_cfs_rq->tasks_timeline.rb_root,
+				run_node) {
+			se->vruntime += delta;
+		}
+	}
 }
 
 static __always_inline
@@ -509,8 +603,11 @@ static void update_min_vruntime(struct cfs_rq *cfs_rq)
 			vruntime = min_vruntime(vruntime, se->vruntime);
 	}
 
+	if (sched_core_enabled(rq_of(cfs_rq)) && is_root_cfs_rq(cfs_rq))
+		cfs_rq = core_cfs_rq(cfs_rq);
+
 	/* ensure we never gain time by being placed backwards. */
-	cfs_rq->min_vruntime = max_vruntime(cfs_rq_min_vruntime(cfs_rq), vruntime);
+	cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime);
 #ifndef CONFIG_64BIT
 	smp_wmb();
 	cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 50a5675e941a..24bae760f764 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2594,3 +2594,6 @@ static inline void membarrier_switch_mm(struct rq *rq,
 {
 }
 #endif
+
+bool cfs_prio_less(struct task_struct *a, struct task_struct *b);
+void sched_core_adjust_se_vruntime(int cpu);
-- 
2.19.1.3.ge56e4f7


  reply	other threads:[~2020-04-17  9:40 UTC|newest]

Thread overview: 110+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-04 16:59 [RFC PATCH 00/13] Core scheduling v5 vpillai
2020-03-04 16:59 ` [RFC PATCH 01/13] sched: Wrap rq::lock access vpillai
2020-03-04 16:59 ` [RFC PATCH 02/13] sched: Introduce sched_class::pick_task() vpillai
2020-03-04 16:59 ` [RFC PATCH 03/13] sched: Core-wide rq->lock vpillai
2020-04-01 11:42   ` [PATCH] sched/arm64: store cpu topology before notify_cpu_starting Cheng Jian
2020-04-01 13:23     ` Valentin Schneider
2020-04-06  8:00       ` chengjian (D)
2020-04-09  9:59       ` Sudeep Holla
2020-04-09 10:32         ` Valentin Schneider
2020-04-09 11:08           ` Sudeep Holla
2020-04-09 17:54     ` Joel Fernandes
2020-04-10 13:49       ` chengjian (D)
2020-04-14 11:36   ` [RFC PATCH 03/13] sched: Core-wide rq->lock Peter Zijlstra
2020-04-14 21:35     ` Vineeth Remanan Pillai
2020-04-15 10:55       ` Peter Zijlstra
2020-04-14 14:32   ` Peter Zijlstra
2020-03-04 16:59 ` [RFC PATCH 04/13] sched/fair: Add a few assertions vpillai
2020-03-04 16:59 ` [RFC PATCH 05/13] sched: Basic tracking of matching tasks vpillai
2020-03-04 16:59 ` [RFC PATCH 06/13] sched: Update core scheduler queue when taking cpu online/offline vpillai
2020-03-04 16:59 ` [RFC PATCH 07/13] sched: Add core wide task selection and scheduling vpillai
2020-04-14 13:35   ` Peter Zijlstra
2020-04-16 23:32     ` Tim Chen
2020-04-17 10:57       ` Peter Zijlstra
2020-04-16  3:39   ` Chen Yu
2020-04-16 19:59     ` Vineeth Remanan Pillai
2020-04-17 11:18     ` Peter Zijlstra
2020-04-19 15:31       ` Chen Yu
2020-05-21 23:14   ` Joel Fernandes
2020-05-21 23:16     ` Joel Fernandes
2020-05-22  2:35     ` Joel Fernandes
2020-05-22  3:44       ` Aaron Lu
2020-05-22 20:13         ` Joel Fernandes
2020-03-04 16:59 ` [RFC PATCH 08/13] sched/fair: wrapper for cfs_rq->min_vruntime vpillai
2020-03-04 16:59 ` [RFC PATCH 09/13] sched/fair: core wide vruntime comparison vpillai
2020-04-14 13:56   ` Peter Zijlstra
2020-04-15  3:34     ` Aaron Lu
2020-04-15  4:07       ` Aaron Lu
2020-04-15 21:24         ` Vineeth Remanan Pillai
2020-04-17  9:40           ` Aaron Lu [this message]
2020-04-20  8:07             ` [PATCH updated] sched/fair: core wide cfs task priority comparison Aaron Lu
2020-04-20 22:26               ` Vineeth Remanan Pillai
2020-04-21  2:51                 ` Aaron Lu
2020-04-24 14:24                   ` [PATCH updated v2] " Aaron Lu
2020-05-06 14:35                     ` Peter Zijlstra
2020-05-08  8:44                       ` Aaron Lu
2020-05-08  9:09                         ` Peter Zijlstra
2020-05-08 12:34                           ` Aaron Lu
2020-05-14 13:02                             ` Peter Zijlstra
2020-05-14 22:51                               ` Vineeth Remanan Pillai
2020-05-15 10:38                                 ` Peter Zijlstra
2020-05-15 10:43                                   ` Peter Zijlstra
2020-05-15 14:24                                   ` Vineeth Remanan Pillai
2020-05-16  3:42                               ` Aaron Lu
2020-05-22  9:40                                 ` Aaron Lu
2020-06-08  1:41                               ` Ning, Hongyu
2020-03-04 17:00 ` [RFC PATCH 10/13] sched: Trivial forced-newidle balancer vpillai
2020-03-04 17:00 ` [RFC PATCH 11/13] sched: migration changes for core scheduling vpillai
2020-06-12 13:21   ` Joel Fernandes
2020-06-12 21:32     ` Vineeth Remanan Pillai
2020-06-13  2:25       ` Joel Fernandes
2020-06-13 18:59         ` Vineeth Remanan Pillai
2020-06-15  2:05           ` Li, Aubrey
2020-03-04 17:00 ` [RFC PATCH 12/13] sched: cgroup tagging interface " vpillai
2020-06-26 15:06   ` Vineeth Remanan Pillai
2020-03-04 17:00 ` [RFC PATCH 13/13] sched: Debug bits vpillai
2020-03-04 17:36 ` [RFC PATCH 00/13] Core scheduling v5 Tim Chen
2020-03-04 17:42   ` Vineeth Remanan Pillai
2020-04-14 14:21 ` Peter Zijlstra
2020-04-15 16:32   ` Joel Fernandes
2020-04-17 11:12     ` Peter Zijlstra
2020-04-17 12:35       ` Alexander Graf
2020-04-17 13:08         ` Peter Zijlstra
2020-04-18  2:25       ` Joel Fernandes
2020-05-09 14:35   ` Dario Faggioli
     [not found] ` <38805656-2e2f-222a-c083-692f4b113313@linux.intel.com>
2020-05-09  3:39   ` Ning, Hongyu
2020-05-14 20:51     ` FW: " Gruza, Agata
2020-05-10 23:46 ` [PATCH RFC] Add support for core-wide protection of IRQ and softirq Joel Fernandes (Google)
2020-05-11 13:49   ` Peter Zijlstra
2020-05-11 14:54     ` Joel Fernandes
2020-05-20 22:26 ` [PATCH RFC] sched: Add a per-thread core scheduling interface Joel Fernandes (Google)
2020-05-21  4:09   ` [PATCH RFC] sched: Add a per-thread core scheduling interface(Internet mail) benbjiang(蒋彪)
2020-05-21 13:49     ` Joel Fernandes
2020-05-21  8:51   ` [PATCH RFC] sched: Add a per-thread core scheduling interface Peter Zijlstra
2020-05-21 13:47     ` Joel Fernandes
2020-05-21 20:20       ` Vineeth Remanan Pillai
2020-05-22 12:59       ` Peter Zijlstra
2020-05-22 21:35         ` Joel Fernandes
2020-05-24 14:00           ` Phil Auld
2020-05-28 14:51             ` Joel Fernandes
2020-05-28 17:01             ` Peter Zijlstra
2020-05-28 18:17               ` Phil Auld
2020-05-28 18:34                 ` Phil Auld
2020-05-28 18:23               ` Joel Fernandes
2020-05-21 18:31   ` Linus Torvalds
2020-05-21 20:40     ` Joel Fernandes
2020-05-21 21:58       ` Jesse Barnes
2020-05-22 16:33         ` Linus Torvalds
2020-05-20 22:37 ` [PATCH RFC v2] Add support for core-wide protection of IRQ and softirq Joel Fernandes (Google)
2020-05-20 22:48 ` [PATCH RFC] sched: Use sched-RCU in core-scheduling balancing logic Joel Fernandes (Google)
2020-05-21 22:52   ` Paul E. McKenney
2020-05-22  1:26     ` Joel Fernandes
2020-06-25 20:12 ` [RFC PATCH 00/13] Core scheduling v5 Vineeth Remanan Pillai
2020-06-26  1:47   ` Joel Fernandes
2020-06-26 14:36     ` Vineeth Remanan Pillai
2020-06-26 15:10       ` Joel Fernandes
2020-06-26 15:12         ` Joel Fernandes
2020-06-27 16:21         ` Joel Fernandes
2020-06-30 14:11         ` Phil Auld
2020-06-29 12:33   ` Li, Aubrey
2020-06-29 19:41     ` Vineeth Remanan Pillai

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200417094045.GA197704@ziqianlu-desktop.localdomain \
    --to=aaron.lwe@gmail.com \
    --cc=aaron.lu@linux.alibaba.com \
    --cc=aubrey.intel@gmail.com \
    --cc=aubrey.li@linux.intel.com \
    --cc=fweisbec@gmail.com \
    --cc=jdesfossez@digitalocean.com \
    --cc=joel@joelfernandes.org \
    --cc=joelaf@google.com \
    --cc=keescook@chromium.org \
    --cc=kerrnel@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@techsingularity.net \
    --cc=mingo@kernel.org \
    --cc=naravamudan@digitalocean.com \
    --cc=pauld@redhat.com \
    --cc=pawan.kumar.gupta@linux.intel.com \
    --cc=pbonzini@redhat.com \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=tglx@linutronix.de \
    --cc=tim.c.chen@linux.intel.com \
    --cc=torvalds@linux-foundation.org \
    --cc=valentin.schneider@arm.com \
    --cc=vpillai@digitalocean.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).