linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] Reduce scanning of runqueues in select_idle_sibling
@ 2020-12-08 15:34 Mel Gorman
  2020-12-08 15:34 ` [PATCH 1/4] sched/fair: Remove SIS_AVG_CPU Mel Gorman
                   ` (5 more replies)
  0 siblings, 6 replies; 26+ messages in thread
From: Mel Gorman @ 2020-12-08 15:34 UTC (permalink / raw)
  To: Peter Ziljstra, Ingo Molnar, LKML
  Cc: Aubrey Li, Barry Song, Juri Lelli, Vincent Guittot,
	Valentin Schneider, Linux-ARM, Mel Gorman

Changelog since v1
o Drop single-pass patch						(vincent)
o Scope variables used for SIS_AVG_CPU					(dietmar)
o Remove redundant assignment						(dietmar

This reduces the amount of runqueue scanning in select_idle_sibling in
the worst case.

Patch 1 removes SIS_AVG_CPU because it's unused.

Patch 2 moves all SIS_PROP-related calculations under SIS_PROP

Patch 3 improves the hit rate of p->recent_used_cpu to reduce the amount
	of scanning. It should be relatively uncontroversial

Patch 4 returns an idle candidate if one is found while scanning for a
	free core.

-- 
2.26.2

Mel Gorman (4):
  sched/fair: Remove SIS_AVG_CPU
  sched/fair: Move avg_scan_cost calculations under SIS_PROP
  sched/fair: Do not replace recent_used_cpu with the new target
  sched/fair: Return an idle cpu if one is found after a failed search
    for an idle core

 kernel/sched/fair.c     | 51 ++++++++++++++++++++---------------------
 kernel/sched/features.h |  1 -
 2 files changed, 25 insertions(+), 27 deletions(-)

-- 
2.26.2


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/4] sched/fair: Remove SIS_AVG_CPU
  2020-12-08 15:34 [PATCH 0/4] Reduce scanning of runqueues in select_idle_sibling Mel Gorman
@ 2020-12-08 15:34 ` Mel Gorman
  2020-12-08 16:13   ` Vincent Guittot
  2020-12-08 15:34 ` [PATCH 2/4] sched/fair: Move avg_scan_cost calculations under SIS_PROP Mel Gorman
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 26+ messages in thread
From: Mel Gorman @ 2020-12-08 15:34 UTC (permalink / raw)
  To: Peter Ziljstra, Ingo Molnar, LKML
  Cc: Aubrey Li, Barry Song, Juri Lelli, Vincent Guittot,
	Valentin Schneider, Linux-ARM, Mel Gorman

SIS_AVG_CPU was introduced as a means of avoiding a search when the
average search cost indicated that the search would likely fail. It
was a blunt instrument and disabled by 4c77b18cf8b7 ("sched/fair: Make
select_idle_cpu() more aggressive") and later replaced with a proportional
search depth by 1ad3aaf3fcd2 ("sched/core: Implement new approach to
scale select_idle_cpu()").

While there are corner cases where SIS_AVG_CPU is better, it has now been
disabled for almost three years. As the intent of SIS_PROP is to reduce
the time complexity of select_idle_cpu(), lets drop SIS_AVG_CPU and focus
on SIS_PROP as a throttling mechanism.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c     | 20 +++++++++-----------
 kernel/sched/features.h |  1 -
 2 files changed, 9 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 98075f9ea9a8..ac7b34e7372b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6145,7 +6145,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 {
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
 	struct sched_domain *this_sd;
-	u64 avg_cost, avg_idle;
 	u64 time;
 	int this = smp_processor_id();
 	int cpu, nr = INT_MAX;
@@ -6154,18 +6153,17 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 	if (!this_sd)
 		return -1;
 
-	/*
-	 * Due to large variance we need a large fuzz factor; hackbench in
-	 * particularly is sensitive here.
-	 */
-	avg_idle = this_rq()->avg_idle / 512;
-	avg_cost = this_sd->avg_scan_cost + 1;
+	if (sched_feat(SIS_PROP)) {
+		u64 avg_cost, avg_idle, span_avg;
 
-	if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost)
-		return -1;
+		/*
+		 * Due to large variance we need a large fuzz factor;
+		 * hackbench in particularly is sensitive here.
+		 */
+		avg_idle = this_rq()->avg_idle / 512;
+		avg_cost = this_sd->avg_scan_cost + 1;
 
-	if (sched_feat(SIS_PROP)) {
-		u64 span_avg = sd->span_weight * avg_idle;
+		span_avg = sd->span_weight * avg_idle;
 		if (span_avg > 4*avg_cost)
 			nr = div_u64(span_avg, avg_cost);
 		else
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 68d369cba9e4..e875eabb6600 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -54,7 +54,6 @@ SCHED_FEAT(TTWU_QUEUE, true)
 /*
  * When doing wakeups, attempt to limit superfluous scans of the LLC domain.
  */
-SCHED_FEAT(SIS_AVG_CPU, false)
 SCHED_FEAT(SIS_PROP, true)
 
 /*
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/4] sched/fair: Move avg_scan_cost calculations under SIS_PROP
  2020-12-08 15:34 [PATCH 0/4] Reduce scanning of runqueues in select_idle_sibling Mel Gorman
  2020-12-08 15:34 ` [PATCH 1/4] sched/fair: Remove SIS_AVG_CPU Mel Gorman
@ 2020-12-08 15:34 ` Mel Gorman
  2020-12-08 16:03   ` Vincent Guittot
  2020-12-10  5:18   ` Li, Aubrey
  2020-12-08 15:35 ` [PATCH 3/4] sched/fair: Do not replace recent_used_cpu with the new target Mel Gorman
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 26+ messages in thread
From: Mel Gorman @ 2020-12-08 15:34 UTC (permalink / raw)
  To: Peter Ziljstra, Ingo Molnar, LKML
  Cc: Aubrey Li, Barry Song, Juri Lelli, Vincent Guittot,
	Valentin Schneider, Linux-ARM, Mel Gorman

As noted by Vincent Guittot, avg_scan_costs are calculated for SIS_PROP
even if SIS_PROP is disabled. Move the time calculations under a SIS_PROP
check and while we are at it, exclude the cost of initialising the CPU
mask from the average scan cost.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ac7b34e7372b..5c41875aec23 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6153,6 +6153,8 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 	if (!this_sd)
 		return -1;
 
+	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+
 	if (sched_feat(SIS_PROP)) {
 		u64 avg_cost, avg_idle, span_avg;
 
@@ -6168,11 +6170,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 			nr = div_u64(span_avg, avg_cost);
 		else
 			nr = 4;
-	}
-
-	time = cpu_clock(this);
 
-	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+		time = cpu_clock(this);
+	}
 
 	for_each_cpu_wrap(cpu, cpus, target) {
 		if (!--nr)
@@ -6181,8 +6181,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 			break;
 	}
 
-	time = cpu_clock(this) - time;
-	update_avg(&this_sd->avg_scan_cost, time);
+	if (sched_feat(SIS_PROP)) {
+		time = cpu_clock(this) - time;
+		update_avg(&this_sd->avg_scan_cost, time);
+	}
 
 	return cpu;
 }
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 3/4] sched/fair: Do not replace recent_used_cpu with the new target
  2020-12-08 15:34 [PATCH 0/4] Reduce scanning of runqueues in select_idle_sibling Mel Gorman
  2020-12-08 15:34 ` [PATCH 1/4] sched/fair: Remove SIS_AVG_CPU Mel Gorman
  2020-12-08 15:34 ` [PATCH 2/4] sched/fair: Move avg_scan_cost calculations under SIS_PROP Mel Gorman
@ 2020-12-08 15:35 ` Mel Gorman
  2020-12-08 16:14   ` Vincent Guittot
  2020-12-08 15:35 ` [PATCH 4/4] sched/fair: Return an idle cpu if one is found after a failed search for an idle core Mel Gorman
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 26+ messages in thread
From: Mel Gorman @ 2020-12-08 15:35 UTC (permalink / raw)
  To: Peter Ziljstra, Ingo Molnar, LKML
  Cc: Aubrey Li, Barry Song, Juri Lelli, Vincent Guittot,
	Valentin Schneider, Linux-ARM, Mel Gorman

After select_idle_sibling, p->recent_used_cpu is set to the
new target. However on the next wakeup, prev will be the same as
recent_used_cpu unless the load balancer has moved the task since the last
wakeup. It still works, but is less efficient than it can be after all
the changes that went in since that reduce unnecessary migrations, load
balancer changes etc.  This patch preserves recent_used_cpu for longer.

With tbench on a 2-socket CascadeLake machine, 80 logical CPUs, HT enabled

                          5.10.0-rc6             5.10.0-rc6
                 	 baseline-v2           altrecent-v2
Hmean     1        508.39 (   0.00%)      502.05 *  -1.25%*
Hmean     2        986.70 (   0.00%)      983.65 *  -0.31%*
Hmean     4       1914.55 (   0.00%)     1920.24 *   0.30%*
Hmean     8       3702.37 (   0.00%)     3663.96 *  -1.04%*
Hmean     16      6573.11 (   0.00%)     6545.58 *  -0.42%*
Hmean     32     10142.57 (   0.00%)    10253.73 *   1.10%*
Hmean     64     14348.40 (   0.00%)    12506.31 * -12.84%*
Hmean     128    21842.59 (   0.00%)    21967.13 *   0.57%*
Hmean     256    20813.75 (   0.00%)    21534.52 *   3.46%*
Hmean     320    20684.33 (   0.00%)    21070.14 *   1.87%*

The different was marginal except for 64 threads which showed in the
baseline that the result was very unstable where as the patch was much
more stable. This is somewhat machine specific as on a separate 80-cpu
Broadwell machine the same test reported.

                          5.10.0-rc6             5.10.0-rc6
                 	 baseline-v2           altrecent-v2
Hmean     1        310.36 (   0.00%)      291.81 *  -5.98%*
Hmean     2        340.86 (   0.00%)      547.22 *  60.54%*
Hmean     4        912.29 (   0.00%)     1063.21 *  16.54%*
Hmean     8       2116.40 (   0.00%)     2103.60 *  -0.60%*
Hmean     16      4232.90 (   0.00%)     4362.92 *   3.07%*
Hmean     32      8442.03 (   0.00%)     8642.10 *   2.37%*
Hmean     64     11733.91 (   0.00%)    11473.66 *  -2.22%*
Hmean     128    17727.24 (   0.00%)    16784.23 *  -5.32%*
Hmean     256    16089.23 (   0.00%)    16110.79 *   0.13%*
Hmean     320    15992.60 (   0.00%)    16071.64 *   0.49%*

schedstats were not used in this series but from an earlier debugging
effort, the schedstats after the test run were as follows;

Ops SIS Search               5653107942.00  5726545742.00
Ops SIS Domain Search        3365067916.00  3319768543.00
Ops SIS Scanned            112173512543.00 99194352541.00
Ops SIS Domain Scanned     109885472517.00 96787575342.00
Ops SIS Failures             2923185114.00  2950166441.00
Ops SIS Recent Used Hit           56547.00   118064916.00
Ops SIS Recent Used Miss     1590899250.00   354942791.00
Ops SIS Recent Attempts      1590955797.00   473007707.00
Ops SIS Search Efficiency             5.04           5.77
Ops SIS Domain Search Eff             3.06           3.43
Ops SIS Fast Success Rate            40.47          42.03
Ops SIS Success Rate                 48.29          48.48
Ops SIS Recent Success Rate           0.00          24.96

First interesting point is the ridiculous number of times runqueues are
enabled -- almost 97 billion times over the course of 40 minutes

With the patch, "Recent Used Hit" is over 2000 times more likely to
succeed. The failure rate also increases by quite a lot but the cost is
marginal even if the "Fast Success Rate" only increases by 2% overall. What
cannot be observed from these stats is where the biggest impact as these
stats cover low utilisation to over saturation.

If graphed over time, the graphs show that the sched domain is only
scanned at negligible rates until the machine is fully busy. With
low utilisation, the "Fast Success Rate" is almost 100% until the
machine is fully busy. For 320 clients, the success rate is close to
0% which is unsurprising.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5c41875aec23..413d895bbbf8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6277,17 +6277,13 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 
 	/* Check a recently used CPU as a potential idle candidate: */
 	recent_used_cpu = p->recent_used_cpu;
+	p->recent_used_cpu = prev;
 	if (recent_used_cpu != prev &&
 	    recent_used_cpu != target &&
 	    cpus_share_cache(recent_used_cpu, target) &&
 	    (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
 	    cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) &&
 	    asym_fits_capacity(task_util, recent_used_cpu)) {
-		/*
-		 * Replace recent_used_cpu with prev as it is a potential
-		 * candidate for the next wake:
-		 */
-		p->recent_used_cpu = prev;
 		return recent_used_cpu;
 	}
 
@@ -6768,9 +6764,6 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 	} else if (wake_flags & WF_TTWU) { /* XXX always ? */
 		/* Fast path */
 		new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
-
-		if (want_affine)
-			current->recent_used_cpu = cpu;
 	}
 	rcu_read_unlock();
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 4/4] sched/fair: Return an idle cpu if one is found after a failed search for an idle core
  2020-12-08 15:34 [PATCH 0/4] Reduce scanning of runqueues in select_idle_sibling Mel Gorman
                   ` (2 preceding siblings ...)
  2020-12-08 15:35 ` [PATCH 3/4] sched/fair: Do not replace recent_used_cpu with the new target Mel Gorman
@ 2020-12-08 15:35 ` Mel Gorman
  2020-12-08 16:15   ` Vincent Guittot
  2020-12-09 14:37 ` [PATCH 0/4] Reduce scanning of runqueues in select_idle_sibling Mel Gorman
       [not found] ` <20201211062542.3082-1-hdanton@sina.com>
  5 siblings, 1 reply; 26+ messages in thread
From: Mel Gorman @ 2020-12-08 15:35 UTC (permalink / raw)
  To: Peter Ziljstra, Ingo Molnar, LKML
  Cc: Aubrey Li, Barry Song, Juri Lelli, Vincent Guittot,
	Valentin Schneider, Linux-ARM, Mel Gorman

select_idle_core is called when SMT is active and there is likely a free
core available. It may find idle CPUs but this information is simply
discarded and the scan starts over again with select_idle_cpu.

This patch caches information on idle CPUs found during the search for
a core and uses one if no core is found. This is a tradeoff. There may
be a slight impact when utilisation is low and an idle core can be
found quickly. It provides improvements as the number of busy CPUs
approaches 50% of the domain size when SMT is enabled.

With tbench on a 2-socket CascadeLake machine, 80 logical CPUs, HT enabled

                          5.10.0-rc6             5.10.0-rc6
                           schedstat          idlecandidate
Hmean     1        500.06 (   0.00%)      505.67 *   1.12%*
Hmean     2        975.90 (   0.00%)      974.06 *  -0.19%*
Hmean     4       1902.95 (   0.00%)     1904.43 *   0.08%*
Hmean     8       3761.73 (   0.00%)     3721.02 *  -1.08%*
Hmean     16      6713.93 (   0.00%)     6769.17 *   0.82%*
Hmean     32     10435.31 (   0.00%)    10312.58 *  -1.18%*
Hmean     64     12325.51 (   0.00%)    13792.01 *  11.90%*
Hmean     128    21225.21 (   0.00%)    20963.44 *  -1.23%*
Hmean     256    20532.83 (   0.00%)    20335.62 *  -0.96%*
Hmean     320    20334.81 (   0.00%)    20147.25 *  -0.92%*

Note that there is a significant corner case. As the SMT scan may be
terminated early, not all CPUs have been visited and select_idle_cpu()
is still called for a full scan. This case is handled in the next
patch.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 413d895bbbf8..ed6f45832d97 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6066,6 +6066,7 @@ void __update_idle_core(struct rq *rq)
  */
 static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
 {
+	int idle_candidate = -1;
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
 	int core, cpu;
 
@@ -6085,6 +6086,11 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
 				idle = false;
 				break;
 			}
+
+			if (idle_candidate == -1 &&
+			    cpumask_test_cpu(cpu, p->cpus_ptr)) {
+				idle_candidate = cpu;
+			}
 		}
 
 		if (idle)
@@ -6098,7 +6104,7 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
 	 */
 	set_idle_cores(target, 0);
 
-	return -1;
+	return idle_candidate;
 }
 
 /*
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/4] sched/fair: Move avg_scan_cost calculations under SIS_PROP
  2020-12-08 15:34 ` [PATCH 2/4] sched/fair: Move avg_scan_cost calculations under SIS_PROP Mel Gorman
@ 2020-12-08 16:03   ` Vincent Guittot
  2020-12-08 16:30     ` Mel Gorman
  2020-12-09  5:28     ` Li, Aubrey
  2020-12-10  5:18   ` Li, Aubrey
  1 sibling, 2 replies; 26+ messages in thread
From: Vincent Guittot @ 2020-12-08 16:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Ziljstra, Ingo Molnar, LKML, Aubrey Li, Barry Song,
	Juri Lelli, Valentin Schneider, Linux-ARM

On Tue, 8 Dec 2020 at 16:35, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> As noted by Vincent Guittot, avg_scan_costs are calculated for SIS_PROP
> even if SIS_PROP is disabled. Move the time calculations under a SIS_PROP
> check and while we are at it, exclude the cost of initialising the CPU
> mask from the average scan cost.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  kernel/sched/fair.c | 14 ++++++++------
>  1 file changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ac7b34e7372b..5c41875aec23 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6153,6 +6153,8 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>         if (!this_sd)
>                 return -1;

Just noticed while reviewing the patch that the above related to
this_sd can also go under sched_feat(SIS_PROP)

>
> +       cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> +
>         if (sched_feat(SIS_PROP)) {
>                 u64 avg_cost, avg_idle, span_avg;
>
> @@ -6168,11 +6170,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>                         nr = div_u64(span_avg, avg_cost);
>                 else
>                         nr = 4;
> -       }
> -
> -       time = cpu_clock(this);
>
> -       cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> +               time = cpu_clock(this);
> +       }
>
>         for_each_cpu_wrap(cpu, cpus, target) {
>                 if (!--nr)
> @@ -6181,8 +6181,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>                         break;
>         }
>
> -       time = cpu_clock(this) - time;
> -       update_avg(&this_sd->avg_scan_cost, time);
> +       if (sched_feat(SIS_PROP)) {
> +               time = cpu_clock(this) - time;
> +               update_avg(&this_sd->avg_scan_cost, time);
> +       }
>
>         return cpu;
>  }
> --
> 2.26.2
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/4] sched/fair: Remove SIS_AVG_CPU
  2020-12-08 15:34 ` [PATCH 1/4] sched/fair: Remove SIS_AVG_CPU Mel Gorman
@ 2020-12-08 16:13   ` Vincent Guittot
  0 siblings, 0 replies; 26+ messages in thread
From: Vincent Guittot @ 2020-12-08 16:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Ziljstra, Ingo Molnar, LKML, Aubrey Li, Barry Song,
	Juri Lelli, Valentin Schneider, Linux-ARM

On Tue, 8 Dec 2020 at 16:35, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> SIS_AVG_CPU was introduced as a means of avoiding a search when the
> average search cost indicated that the search would likely fail. It
> was a blunt instrument and disabled by 4c77b18cf8b7 ("sched/fair: Make
> select_idle_cpu() more aggressive") and later replaced with a proportional
> search depth by 1ad3aaf3fcd2 ("sched/core: Implement new approach to
> scale select_idle_cpu()").
>
> While there are corner cases where SIS_AVG_CPU is better, it has now been
> disabled for almost three years. As the intent of SIS_PROP is to reduce
> the time complexity of select_idle_cpu(), lets drop SIS_AVG_CPU and focus
> on SIS_PROP as a throttling mechanism.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>

> ---
>  kernel/sched/fair.c     | 20 +++++++++-----------
>  kernel/sched/features.h |  1 -
>  2 files changed, 9 insertions(+), 12 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 98075f9ea9a8..ac7b34e7372b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6145,7 +6145,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>  {
>         struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
>         struct sched_domain *this_sd;
> -       u64 avg_cost, avg_idle;
>         u64 time;
>         int this = smp_processor_id();
>         int cpu, nr = INT_MAX;
> @@ -6154,18 +6153,17 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>         if (!this_sd)
>                 return -1;
>
> -       /*
> -        * Due to large variance we need a large fuzz factor; hackbench in
> -        * particularly is sensitive here.
> -        */
> -       avg_idle = this_rq()->avg_idle / 512;
> -       avg_cost = this_sd->avg_scan_cost + 1;
> +       if (sched_feat(SIS_PROP)) {
> +               u64 avg_cost, avg_idle, span_avg;
>
> -       if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost)
> -               return -1;
> +               /*
> +                * Due to large variance we need a large fuzz factor;
> +                * hackbench in particularly is sensitive here.
> +                */
> +               avg_idle = this_rq()->avg_idle / 512;
> +               avg_cost = this_sd->avg_scan_cost + 1;
>
> -       if (sched_feat(SIS_PROP)) {
> -               u64 span_avg = sd->span_weight * avg_idle;
> +               span_avg = sd->span_weight * avg_idle;
>                 if (span_avg > 4*avg_cost)
>                         nr = div_u64(span_avg, avg_cost);
>                 else
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index 68d369cba9e4..e875eabb6600 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -54,7 +54,6 @@ SCHED_FEAT(TTWU_QUEUE, true)
>  /*
>   * When doing wakeups, attempt to limit superfluous scans of the LLC domain.
>   */
> -SCHED_FEAT(SIS_AVG_CPU, false)
>  SCHED_FEAT(SIS_PROP, true)
>
>  /*
> --
> 2.26.2
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/4] sched/fair: Do not replace recent_used_cpu with the new target
  2020-12-08 15:35 ` [PATCH 3/4] sched/fair: Do not replace recent_used_cpu with the new target Mel Gorman
@ 2020-12-08 16:14   ` Vincent Guittot
  2020-12-10  9:40     ` Vincent Guittot
  0 siblings, 1 reply; 26+ messages in thread
From: Vincent Guittot @ 2020-12-08 16:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Ziljstra, Ingo Molnar, LKML, Aubrey Li, Barry Song,
	Juri Lelli, Valentin Schneider, Linux-ARM

On Tue, 8 Dec 2020 at 16:35, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> After select_idle_sibling, p->recent_used_cpu is set to the
> new target. However on the next wakeup, prev will be the same as
> recent_used_cpu unless the load balancer has moved the task since the last
> wakeup. It still works, but is less efficient than it can be after all
> the changes that went in since that reduce unnecessary migrations, load
> balancer changes etc.  This patch preserves recent_used_cpu for longer.
>
> With tbench on a 2-socket CascadeLake machine, 80 logical CPUs, HT enabled
>
>                           5.10.0-rc6             5.10.0-rc6
>                          baseline-v2           altrecent-v2
> Hmean     1        508.39 (   0.00%)      502.05 *  -1.25%*
> Hmean     2        986.70 (   0.00%)      983.65 *  -0.31%*
> Hmean     4       1914.55 (   0.00%)     1920.24 *   0.30%*
> Hmean     8       3702.37 (   0.00%)     3663.96 *  -1.04%*
> Hmean     16      6573.11 (   0.00%)     6545.58 *  -0.42%*
> Hmean     32     10142.57 (   0.00%)    10253.73 *   1.10%*
> Hmean     64     14348.40 (   0.00%)    12506.31 * -12.84%*
> Hmean     128    21842.59 (   0.00%)    21967.13 *   0.57%*
> Hmean     256    20813.75 (   0.00%)    21534.52 *   3.46%*
> Hmean     320    20684.33 (   0.00%)    21070.14 *   1.87%*
>
> The different was marginal except for 64 threads which showed in the
> baseline that the result was very unstable where as the patch was much
> more stable. This is somewhat machine specific as on a separate 80-cpu
> Broadwell machine the same test reported.
>
>                           5.10.0-rc6             5.10.0-rc6
>                          baseline-v2           altrecent-v2
> Hmean     1        310.36 (   0.00%)      291.81 *  -5.98%*
> Hmean     2        340.86 (   0.00%)      547.22 *  60.54%*
> Hmean     4        912.29 (   0.00%)     1063.21 *  16.54%*
> Hmean     8       2116.40 (   0.00%)     2103.60 *  -0.60%*
> Hmean     16      4232.90 (   0.00%)     4362.92 *   3.07%*
> Hmean     32      8442.03 (   0.00%)     8642.10 *   2.37%*
> Hmean     64     11733.91 (   0.00%)    11473.66 *  -2.22%*
> Hmean     128    17727.24 (   0.00%)    16784.23 *  -5.32%*
> Hmean     256    16089.23 (   0.00%)    16110.79 *   0.13%*
> Hmean     320    15992.60 (   0.00%)    16071.64 *   0.49%*
>
> schedstats were not used in this series but from an earlier debugging
> effort, the schedstats after the test run were as follows;
>
> Ops SIS Search               5653107942.00  5726545742.00
> Ops SIS Domain Search        3365067916.00  3319768543.00
> Ops SIS Scanned            112173512543.00 99194352541.00
> Ops SIS Domain Scanned     109885472517.00 96787575342.00
> Ops SIS Failures             2923185114.00  2950166441.00
> Ops SIS Recent Used Hit           56547.00   118064916.00
> Ops SIS Recent Used Miss     1590899250.00   354942791.00
> Ops SIS Recent Attempts      1590955797.00   473007707.00
> Ops SIS Search Efficiency             5.04           5.77
> Ops SIS Domain Search Eff             3.06           3.43
> Ops SIS Fast Success Rate            40.47          42.03
> Ops SIS Success Rate                 48.29          48.48
> Ops SIS Recent Success Rate           0.00          24.96
>
> First interesting point is the ridiculous number of times runqueues are
> enabled -- almost 97 billion times over the course of 40 minutes
>
> With the patch, "Recent Used Hit" is over 2000 times more likely to
> succeed. The failure rate also increases by quite a lot but the cost is
> marginal even if the "Fast Success Rate" only increases by 2% overall. What
> cannot be observed from these stats is where the biggest impact as these
> stats cover low utilisation to over saturation.
>
> If graphed over time, the graphs show that the sched domain is only
> scanned at negligible rates until the machine is fully busy. With
> low utilisation, the "Fast Success Rate" is almost 100% until the
> machine is fully busy. For 320 clients, the success rate is close to
> 0% which is unsurprising.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>

> ---
>  kernel/sched/fair.c | 9 +--------
>  1 file changed, 1 insertion(+), 8 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 5c41875aec23..413d895bbbf8 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6277,17 +6277,13 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>
>         /* Check a recently used CPU as a potential idle candidate: */
>         recent_used_cpu = p->recent_used_cpu;
> +       p->recent_used_cpu = prev;
>         if (recent_used_cpu != prev &&
>             recent_used_cpu != target &&
>             cpus_share_cache(recent_used_cpu, target) &&
>             (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
>             cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) &&
>             asym_fits_capacity(task_util, recent_used_cpu)) {
> -               /*
> -                * Replace recent_used_cpu with prev as it is a potential
> -                * candidate for the next wake:
> -                */
> -               p->recent_used_cpu = prev;
>                 return recent_used_cpu;
>         }
>
> @@ -6768,9 +6764,6 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>         } else if (wake_flags & WF_TTWU) { /* XXX always ? */
>                 /* Fast path */
>                 new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
> -
> -               if (want_affine)
> -                       current->recent_used_cpu = cpu;
>         }
>         rcu_read_unlock();
>
> --
> 2.26.2
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] sched/fair: Return an idle cpu if one is found after a failed search for an idle core
  2020-12-08 15:35 ` [PATCH 4/4] sched/fair: Return an idle cpu if one is found after a failed search for an idle core Mel Gorman
@ 2020-12-08 16:15   ` Vincent Guittot
  0 siblings, 0 replies; 26+ messages in thread
From: Vincent Guittot @ 2020-12-08 16:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Ziljstra, Ingo Molnar, LKML, Aubrey Li, Barry Song,
	Juri Lelli, Valentin Schneider, Linux-ARM

On Tue, 8 Dec 2020 at 16:35, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> select_idle_core is called when SMT is active and there is likely a free
> core available. It may find idle CPUs but this information is simply
> discarded and the scan starts over again with select_idle_cpu.
>
> This patch caches information on idle CPUs found during the search for
> a core and uses one if no core is found. This is a tradeoff. There may
> be a slight impact when utilisation is low and an idle core can be
> found quickly. It provides improvements as the number of busy CPUs
> approaches 50% of the domain size when SMT is enabled.
>
> With tbench on a 2-socket CascadeLake machine, 80 logical CPUs, HT enabled
>
>                           5.10.0-rc6             5.10.0-rc6
>                            schedstat          idlecandidate
> Hmean     1        500.06 (   0.00%)      505.67 *   1.12%*
> Hmean     2        975.90 (   0.00%)      974.06 *  -0.19%*
> Hmean     4       1902.95 (   0.00%)     1904.43 *   0.08%*
> Hmean     8       3761.73 (   0.00%)     3721.02 *  -1.08%*
> Hmean     16      6713.93 (   0.00%)     6769.17 *   0.82%*
> Hmean     32     10435.31 (   0.00%)    10312.58 *  -1.18%*
> Hmean     64     12325.51 (   0.00%)    13792.01 *  11.90%*
> Hmean     128    21225.21 (   0.00%)    20963.44 *  -1.23%*
> Hmean     256    20532.83 (   0.00%)    20335.62 *  -0.96%*
> Hmean     320    20334.81 (   0.00%)    20147.25 *  -0.92%*
>
> Note that there is a significant corner case. As the SMT scan may be
> terminated early, not all CPUs have been visited and select_idle_cpu()
> is still called for a full scan. This case is handled in the next
> patch.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>

> ---
>  kernel/sched/fair.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 413d895bbbf8..ed6f45832d97 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6066,6 +6066,7 @@ void __update_idle_core(struct rq *rq)
>   */
>  static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
>  {
> +       int idle_candidate = -1;
>         struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
>         int core, cpu;
>
> @@ -6085,6 +6086,11 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
>                                 idle = false;
>                                 break;
>                         }
> +
> +                       if (idle_candidate == -1 &&
> +                           cpumask_test_cpu(cpu, p->cpus_ptr)) {
> +                               idle_candidate = cpu;
> +                       }
>                 }
>
>                 if (idle)
> @@ -6098,7 +6104,7 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
>          */
>         set_idle_cores(target, 0);
>
> -       return -1;
> +       return idle_candidate;
>  }
>
>  /*
> --
> 2.26.2
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/4] sched/fair: Move avg_scan_cost calculations under SIS_PROP
  2020-12-08 16:03   ` Vincent Guittot
@ 2020-12-08 16:30     ` Mel Gorman
  2020-12-09  5:28     ` Li, Aubrey
  1 sibling, 0 replies; 26+ messages in thread
From: Mel Gorman @ 2020-12-08 16:30 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Ziljstra, Ingo Molnar, LKML, Aubrey Li, Barry Song,
	Juri Lelli, Valentin Schneider, Linux-ARM

On Tue, Dec 08, 2020 at 05:03:21PM +0100, Vincent Guittot wrote:
> On Tue, 8 Dec 2020 at 16:35, Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > As noted by Vincent Guittot, avg_scan_costs are calculated for SIS_PROP
> > even if SIS_PROP is disabled. Move the time calculations under a SIS_PROP
> > check and while we are at it, exclude the cost of initialising the CPU
> > mask from the average scan cost.
> >
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > ---
> >  kernel/sched/fair.c | 14 ++++++++------
> >  1 file changed, 8 insertions(+), 6 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index ac7b34e7372b..5c41875aec23 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6153,6 +6153,8 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> >         if (!this_sd)
> >                 return -1;
> 
> Just noticed while reviewing the patch that the above related to
> this_sd can also go under sched_feat(SIS_PROP)
> 

Technically yes but I also decided against it. It's a functional difference
depending on whether SIS_PROP is set in the highly unlikely case that
this_sd == NULL. I was also thinking in terms of what happens if SIS_PROP
was disabled and enabled while a search is in progress even if it's very
unlikely. In that case, this_sd would be uninitialised. That might be
impossible in practice depending on how static branching is implemented
but I don't think we should rely on the internals of jump labels and play
it safe. I can move it in if you feel strongly about it but I think the
disable/enable race is enough of a concern to leave it alone.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/4] sched/fair: Move avg_scan_cost calculations under SIS_PROP
  2020-12-08 16:03   ` Vincent Guittot
  2020-12-08 16:30     ` Mel Gorman
@ 2020-12-09  5:28     ` Li, Aubrey
  2020-12-09  9:05       ` Mel Gorman
  1 sibling, 1 reply; 26+ messages in thread
From: Li, Aubrey @ 2020-12-09  5:28 UTC (permalink / raw)
  To: Vincent Guittot, Mel Gorman
  Cc: Peter Ziljstra, Ingo Molnar, LKML, Barry Song, Juri Lelli,
	Valentin Schneider, Linux-ARM

On 2020/12/9 0:03, Vincent Guittot wrote:
> On Tue, 8 Dec 2020 at 16:35, Mel Gorman <mgorman@techsingularity.net> wrote:
>>
>> As noted by Vincent Guittot, avg_scan_costs are calculated for SIS_PROP
>> even if SIS_PROP is disabled. Move the time calculations under a SIS_PROP
>> check and while we are at it, exclude the cost of initialising the CPU
>> mask from the average scan cost.
>>
>> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
>> ---
>>  kernel/sched/fair.c | 14 ++++++++------
>>  1 file changed, 8 insertions(+), 6 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index ac7b34e7372b..5c41875aec23 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6153,6 +6153,8 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>>         if (!this_sd)
>>                 return -1;
> 
> Just noticed while reviewing the patch that the above related to
> this_sd can also go under sched_feat(SIS_PROP)
> 
>>
>> +       cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>> +
>>         if (sched_feat(SIS_PROP)) {
>>                 u64 avg_cost, avg_idle, span_avg;
>>
>> @@ -6168,11 +6170,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>>                         nr = div_u64(span_avg, avg_cost);
>>                 else
>>                         nr = 4;
>> -       }
>> -
>> -       time = cpu_clock(this);
>>
>> -       cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>> +               time = cpu_clock(this);
>> +       }
>>
>>         for_each_cpu_wrap(cpu, cpus, target) {
>>                 if (!--nr)

nr is the key of this throttling mechanism, need to be placed under sched_feat(SIS_PROP) as well.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/4] sched/fair: Move avg_scan_cost calculations under SIS_PROP
  2020-12-09  5:28     ` Li, Aubrey
@ 2020-12-09  9:05       ` Mel Gorman
  2020-12-09 11:07         ` Li, Aubrey
  0 siblings, 1 reply; 26+ messages in thread
From: Mel Gorman @ 2020-12-09  9:05 UTC (permalink / raw)
  To: Li, Aubrey
  Cc: Vincent Guittot, Peter Ziljstra, Ingo Molnar, LKML, Barry Song,
	Juri Lelli, Valentin Schneider, Linux-ARM

On Wed, Dec 09, 2020 at 01:28:11PM +0800, Li, Aubrey wrote:
> >>                         nr = div_u64(span_avg, avg_cost);
> >>                 else
> >>                         nr = 4;
> >> -       }
> >> -
> >> -       time = cpu_clock(this);
> >>
> >> -       cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> >> +               time = cpu_clock(this);
> >> +       }
> >>
> >>         for_each_cpu_wrap(cpu, cpus, target) {
> >>                 if (!--nr)
> 
> nr is the key of this throttling mechanism, need to be placed under sched_feat(SIS_PROP) as well.
> 

It isn't necessary as nr in initialised to INT_MAX if !SIS_PROP.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/4] sched/fair: Move avg_scan_cost calculations under SIS_PROP
  2020-12-09  9:05       ` Mel Gorman
@ 2020-12-09 11:07         ` Li, Aubrey
  2020-12-09 11:33           ` Mel Gorman
  0 siblings, 1 reply; 26+ messages in thread
From: Li, Aubrey @ 2020-12-09 11:07 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Vincent Guittot, Peter Ziljstra, Ingo Molnar, LKML, Barry Song,
	Juri Lelli, Valentin Schneider, Linux-ARM

On 2020/12/9 17:05, Mel Gorman wrote:
> On Wed, Dec 09, 2020 at 01:28:11PM +0800, Li, Aubrey wrote:
>>>>                         nr = div_u64(span_avg, avg_cost);
>>>>                 else
>>>>                         nr = 4;
>>>> -       }
>>>> -
>>>> -       time = cpu_clock(this);
>>>>
>>>> -       cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>>>> +               time = cpu_clock(this);
>>>> +       }
>>>>
>>>>         for_each_cpu_wrap(cpu, cpus, target) {
>>>>                 if (!--nr)
>>
>> nr is the key of this throttling mechanism, need to be placed under sched_feat(SIS_PROP) as well.
>>
> 
> It isn't necessary as nr in initialised to INT_MAX if !SIS_PROP.
>If !SIS_PROP, nr need to -1 then tested in the loop, instead of testing directly.
But with SIS_PROP, need adding a test in the loop.
Since SIS_PROP is default true, I think it's okay to keep the current way.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/4] sched/fair: Move avg_scan_cost calculations under SIS_PROP
  2020-12-09 11:07         ` Li, Aubrey
@ 2020-12-09 11:33           ` Mel Gorman
  0 siblings, 0 replies; 26+ messages in thread
From: Mel Gorman @ 2020-12-09 11:33 UTC (permalink / raw)
  To: Li, Aubrey
  Cc: Vincent Guittot, Peter Ziljstra, Ingo Molnar, LKML, Barry Song,
	Juri Lelli, Valentin Schneider, Linux-ARM

On Wed, Dec 09, 2020 at 07:07:11PM +0800, Li, Aubrey wrote:
> On 2020/12/9 17:05, Mel Gorman wrote:
> > On Wed, Dec 09, 2020 at 01:28:11PM +0800, Li, Aubrey wrote:
> >>>>                         nr = div_u64(span_avg, avg_cost);
> >>>>                 else
> >>>>                         nr = 4;
> >>>> -       }
> >>>> -
> >>>> -       time = cpu_clock(this);
> >>>>
> >>>> -       cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> >>>> +               time = cpu_clock(this);
> >>>> +       }
> >>>>
> >>>>         for_each_cpu_wrap(cpu, cpus, target) {
> >>>>                 if (!--nr)
> >>
> >> nr is the key of this throttling mechanism, need to be placed under sched_feat(SIS_PROP) as well.
> >>
> > 
> > It isn't necessary as nr in initialised to INT_MAX if !SIS_PROP.
> >If !SIS_PROP, nr need to -1 then tested in the loop, instead of testing directly.
> But with SIS_PROP, need adding a test in the loop.
> Since SIS_PROP is default true, I think it's okay to keep the current way.
> 

It's because it's default true and the cost is negligible that I'm leaving
it alone. The branch cost and nr accounting cost is negligible and it
avoids peppering select_idle_cpu() with too many SIS_PROP checks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] Reduce scanning of runqueues in select_idle_sibling
  2020-12-08 15:34 [PATCH 0/4] Reduce scanning of runqueues in select_idle_sibling Mel Gorman
                   ` (3 preceding siblings ...)
  2020-12-08 15:35 ` [PATCH 4/4] sched/fair: Return an idle cpu if one is found after a failed search for an idle core Mel Gorman
@ 2020-12-09 14:37 ` Mel Gorman
  2020-12-10  8:00   ` Vincent Guittot
       [not found] ` <20201211062542.3082-1-hdanton@sina.com>
  5 siblings, 1 reply; 26+ messages in thread
From: Mel Gorman @ 2020-12-09 14:37 UTC (permalink / raw)
  To: Peter Ziljstra, Ingo Molnar, LKML
  Cc: Aubrey Li, Barry Song, Juri Lelli, Vincent Guittot,
	Valentin Schneider, Linux-ARM

On Tue, Dec 08, 2020 at 03:34:57PM +0000, Mel Gorman wrote:
> Changelog since v1
> o Drop single-pass patch						(vincent)
> o Scope variables used for SIS_AVG_CPU					(dietmar)
> o Remove redundant assignment						(dietmar
> 
> This reduces the amount of runqueue scanning in select_idle_sibling in
> the worst case.
> 
> Patch 1 removes SIS_AVG_CPU because it's unused.
> 
> Patch 2 moves all SIS_PROP-related calculations under SIS_PROP
> 
> Patch 3 improves the hit rate of p->recent_used_cpu to reduce the amount
> 	of scanning. It should be relatively uncontroversial
> 
> Patch 4 returns an idle candidate if one is found while scanning for a
> 	free core.
> 

Any other objections to the series? Vincent marked 1, 3 and 4 as
reviewed. While patch 2 had some mild cosmetic concerns, I think the
version and how it treats SIS_PROP is fine as it is to keep it
functionally equivalent to !SIS_PROP and without adding too many
SIS_PROP checks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/4] sched/fair: Move avg_scan_cost calculations under SIS_PROP
  2020-12-08 15:34 ` [PATCH 2/4] sched/fair: Move avg_scan_cost calculations under SIS_PROP Mel Gorman
  2020-12-08 16:03   ` Vincent Guittot
@ 2020-12-10  5:18   ` Li, Aubrey
  2020-12-10  9:32     ` Mel Gorman
  1 sibling, 1 reply; 26+ messages in thread
From: Li, Aubrey @ 2020-12-10  5:18 UTC (permalink / raw)
  To: Mel Gorman, Peter Ziljstra, Ingo Molnar, LKML
  Cc: Barry Song, Juri Lelli, Vincent Guittot, Valentin Schneider, Linux-ARM

On 2020/12/8 23:34, Mel Gorman wrote:
> As noted by Vincent Guittot, avg_scan_costs are calculated for SIS_PROP
> even if SIS_PROP is disabled. Move the time calculations under a SIS_PROP
> check and while we are at it, exclude the cost of initialising the CPU
> mask from the average scan cost.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  kernel/sched/fair.c | 14 ++++++++------
>  1 file changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ac7b34e7372b..5c41875aec23 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6153,6 +6153,8 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>  	if (!this_sd)
>  		return -1;
>  
> +	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> +
>  	if (sched_feat(SIS_PROP)) {
>  		u64 avg_cost, avg_idle, span_avg;
>  
> @@ -6168,11 +6170,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>  			nr = div_u64(span_avg, avg_cost);
>  		else
>  			nr = 4;
> -	}
> -
> -	time = cpu_clock(this);
>  
> -	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> +		time = cpu_clock(this);
> +	}
>  
>  	for_each_cpu_wrap(cpu, cpus, target) {
>  		if (!--nr)
>			return -1;

I thought about this again and here seems not to be consistent:
- even if nr reduces to 0, shouldn't avg_scan_cost be updated as well before return -1?
- if avg_scan_cost is not updated because nr is throttled, the first 
	time = cpu_clock(this);
  can be optimized. As nr is calculated and we already know which of the weight of cpumask and nr is greater.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] Reduce scanning of runqueues in select_idle_sibling
  2020-12-09 14:37 ` [PATCH 0/4] Reduce scanning of runqueues in select_idle_sibling Mel Gorman
@ 2020-12-10  8:00   ` Vincent Guittot
  2020-12-10  9:38     ` Vincent Guittot
  0 siblings, 1 reply; 26+ messages in thread
From: Vincent Guittot @ 2020-12-10  8:00 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Ziljstra, Ingo Molnar, LKML, Aubrey Li, Barry Song,
	Juri Lelli, Valentin Schneider, Linux-ARM

On Wed, 9 Dec 2020 at 15:37, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Tue, Dec 08, 2020 at 03:34:57PM +0000, Mel Gorman wrote:
> > Changelog since v1
> > o Drop single-pass patch                                              (vincent)
> > o Scope variables used for SIS_AVG_CPU                                        (dietmar)
> > o Remove redundant assignment                                         (dietmar
> >
> > This reduces the amount of runqueue scanning in select_idle_sibling in
> > the worst case.
> >
> > Patch 1 removes SIS_AVG_CPU because it's unused.
> >
> > Patch 2 moves all SIS_PROP-related calculations under SIS_PROP
> >
> > Patch 3 improves the hit rate of p->recent_used_cpu to reduce the amount
> >       of scanning. It should be relatively uncontroversial
> >
> > Patch 4 returns an idle candidate if one is found while scanning for a
> >       free core.
> >
>
> Any other objections to the series? Vincent marked 1, 3 and 4 as
> reviewed. While patch 2 had some mild cosmetic concerns, I think the
> version and how it treats SIS_PROP is fine as it is to keep it
> functionally equivalent to !SIS_PROP and without adding too many
> SIS_PROP checks.

while testing your patchset and Aubrey one on top of tip, I'm facing
some perf regression on my arm64 numa system on hackbench and reaim.
The regression seems to comes from your patchset but i don't know
which patch in particular yet

hackbench -l 256000 -g 1

v5.10-rc7 + tip/sched/core 13,255(+/- 3.22%)
with your patchset         15.368(+/- 2.74)  -15.9%

I'm also seeing perf regression on reaim but this one needs more
investigation before confirming

TBH, I was not expecting regressions. I'm running more test to find
which patch is the culprit


>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/4] sched/fair: Move avg_scan_cost calculations under SIS_PROP
  2020-12-10  5:18   ` Li, Aubrey
@ 2020-12-10  9:32     ` Mel Gorman
  0 siblings, 0 replies; 26+ messages in thread
From: Mel Gorman @ 2020-12-10  9:32 UTC (permalink / raw)
  To: Li, Aubrey
  Cc: Peter Ziljstra, Ingo Molnar, LKML, Barry Song, Juri Lelli,
	Vincent Guittot, Valentin Schneider, Linux-ARM

On Thu, Dec 10, 2020 at 01:18:05PM +0800, Li, Aubrey wrote:
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index ac7b34e7372b..5c41875aec23 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6153,6 +6153,8 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> >  	if (!this_sd)
> >  		return -1;
> >  
> > +	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> > +
> >  	if (sched_feat(SIS_PROP)) {
> >  		u64 avg_cost, avg_idle, span_avg;
> >  
> > @@ -6168,11 +6170,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> >  			nr = div_u64(span_avg, avg_cost);
> >  		else
> >  			nr = 4;
> > -	}
> > -
> > -	time = cpu_clock(this);
> >  
> > -	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> > +		time = cpu_clock(this);
> > +	}
> >  
> >  	for_each_cpu_wrap(cpu, cpus, target) {
> >  		if (!--nr)
> >			return -1;
> 
> I thought about this again and here seems not to be consistent:
> - even if nr reduces to 0, shouldn't avg_scan_cost be updated as well before return -1?

You're right, but it's outside the scope
of this patch. I noted that this was a problem in
lore.kernel.org/r/lore.kernel.org/r/20201203141124.7391-8-mgorman@techsingularity.net
It's neither a consistent win or loss to always account for it and so
was dropped for this series to keep the number of controversial patches
to a minimum.

> - if avg_scan_cost is not updated because nr is throttled, the first 
> 	time = cpu_clock(this);
>   can be optimized. As nr is calculated and we already know which of the weight of cpumask and nr is greater.
> 

That is also outside the scope of this patch. To do that, cpumask_weight()
would have to be calculated but it's likely to be a net loss. Even under
light load, nr will be smaller than the domain weight incurring both the
cost of cpumask_weight and the clock read in the common case.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] Reduce scanning of runqueues in select_idle_sibling
  2020-12-10  8:00   ` Vincent Guittot
@ 2020-12-10  9:38     ` Vincent Guittot
  2020-12-10 11:04       ` Mel Gorman
  0 siblings, 1 reply; 26+ messages in thread
From: Vincent Guittot @ 2020-12-10  9:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Ziljstra, Ingo Molnar, LKML, Aubrey Li, Barry Song,
	Juri Lelli, Valentin Schneider, Linux-ARM

On Thu, 10 Dec 2020 at 09:00, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> On Wed, 9 Dec 2020 at 15:37, Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > On Tue, Dec 08, 2020 at 03:34:57PM +0000, Mel Gorman wrote:
> > > Changelog since v1
> > > o Drop single-pass patch                                              (vincent)
> > > o Scope variables used for SIS_AVG_CPU                                        (dietmar)
> > > o Remove redundant assignment                                         (dietmar
> > >
> > > This reduces the amount of runqueue scanning in select_idle_sibling in
> > > the worst case.
> > >
> > > Patch 1 removes SIS_AVG_CPU because it's unused.
> > >
> > > Patch 2 moves all SIS_PROP-related calculations under SIS_PROP
> > >
> > > Patch 3 improves the hit rate of p->recent_used_cpu to reduce the amount
> > >       of scanning. It should be relatively uncontroversial
> > >
> > > Patch 4 returns an idle candidate if one is found while scanning for a
> > >       free core.
> > >
> >
> > Any other objections to the series? Vincent marked 1, 3 and 4 as
> > reviewed. While patch 2 had some mild cosmetic concerns, I think the
> > version and how it treats SIS_PROP is fine as it is to keep it
> > functionally equivalent to !SIS_PROP and without adding too many
> > SIS_PROP checks.
>
> while testing your patchset and Aubrey one on top of tip, I'm facing
> some perf regression on my arm64 numa system on hackbench and reaim.
> The regression seems to comes from your patchset but i don't know
> which patch in particular yet
>
> hackbench -l 256000 -g 1
>
> v5.10-rc7 + tip/sched/core 13,255(+/- 3.22%)
> with your patchset         15.368(+/- 2.74)  -15.9%
>
> I'm also seeing perf regression on reaim but this one needs more
> investigation before confirming
>
> TBH, I was not expecting regressions. I'm running more test to find
> which patch is the culprit

The regression comes from patch 3: sched/fair: Do not replace
recent_used_cpu with the new target


>
>
> >
> > --
> > Mel Gorman
> > SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/4] sched/fair: Do not replace recent_used_cpu with the new target
  2020-12-08 16:14   ` Vincent Guittot
@ 2020-12-10  9:40     ` Vincent Guittot
  0 siblings, 0 replies; 26+ messages in thread
From: Vincent Guittot @ 2020-12-10  9:40 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Ziljstra, Ingo Molnar, LKML, Aubrey Li, Barry Song,
	Juri Lelli, Valentin Schneider, Linux-ARM

On Tue, 8 Dec 2020 at 17:14, Vincent Guittot <vincent.guittot@linaro.org> wrote:
>
> On Tue, 8 Dec 2020 at 16:35, Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > After select_idle_sibling, p->recent_used_cpu is set to the
> > new target. However on the next wakeup, prev will be the same as
> > recent_used_cpu unless the load balancer has moved the task since the last
> > wakeup. It still works, but is less efficient than it can be after all
> > the changes that went in since that reduce unnecessary migrations, load
> > balancer changes etc.  This patch preserves recent_used_cpu for longer.
> >
> > With tbench on a 2-socket CascadeLake machine, 80 logical CPUs, HT enabled
> >
> >                           5.10.0-rc6             5.10.0-rc6
> >                          baseline-v2           altrecent-v2
> > Hmean     1        508.39 (   0.00%)      502.05 *  -1.25%*
> > Hmean     2        986.70 (   0.00%)      983.65 *  -0.31%*
> > Hmean     4       1914.55 (   0.00%)     1920.24 *   0.30%*
> > Hmean     8       3702.37 (   0.00%)     3663.96 *  -1.04%*
> > Hmean     16      6573.11 (   0.00%)     6545.58 *  -0.42%*
> > Hmean     32     10142.57 (   0.00%)    10253.73 *   1.10%*
> > Hmean     64     14348.40 (   0.00%)    12506.31 * -12.84%*
> > Hmean     128    21842.59 (   0.00%)    21967.13 *   0.57%*
> > Hmean     256    20813.75 (   0.00%)    21534.52 *   3.46%*
> > Hmean     320    20684.33 (   0.00%)    21070.14 *   1.87%*
> >
> > The different was marginal except for 64 threads which showed in the
> > baseline that the result was very unstable where as the patch was much
> > more stable. This is somewhat machine specific as on a separate 80-cpu
> > Broadwell machine the same test reported.
> >
> >                           5.10.0-rc6             5.10.0-rc6
> >                          baseline-v2           altrecent-v2
> > Hmean     1        310.36 (   0.00%)      291.81 *  -5.98%*
> > Hmean     2        340.86 (   0.00%)      547.22 *  60.54%*
> > Hmean     4        912.29 (   0.00%)     1063.21 *  16.54%*
> > Hmean     8       2116.40 (   0.00%)     2103.60 *  -0.60%*
> > Hmean     16      4232.90 (   0.00%)     4362.92 *   3.07%*
> > Hmean     32      8442.03 (   0.00%)     8642.10 *   2.37%*
> > Hmean     64     11733.91 (   0.00%)    11473.66 *  -2.22%*
> > Hmean     128    17727.24 (   0.00%)    16784.23 *  -5.32%*
> > Hmean     256    16089.23 (   0.00%)    16110.79 *   0.13%*
> > Hmean     320    15992.60 (   0.00%)    16071.64 *   0.49%*
> >
> > schedstats were not used in this series but from an earlier debugging
> > effort, the schedstats after the test run were as follows;
> >
> > Ops SIS Search               5653107942.00  5726545742.00
> > Ops SIS Domain Search        3365067916.00  3319768543.00
> > Ops SIS Scanned            112173512543.00 99194352541.00
> > Ops SIS Domain Scanned     109885472517.00 96787575342.00
> > Ops SIS Failures             2923185114.00  2950166441.00
> > Ops SIS Recent Used Hit           56547.00   118064916.00
> > Ops SIS Recent Used Miss     1590899250.00   354942791.00
> > Ops SIS Recent Attempts      1590955797.00   473007707.00
> > Ops SIS Search Efficiency             5.04           5.77
> > Ops SIS Domain Search Eff             3.06           3.43
> > Ops SIS Fast Success Rate            40.47          42.03
> > Ops SIS Success Rate                 48.29          48.48
> > Ops SIS Recent Success Rate           0.00          24.96
> >
> > First interesting point is the ridiculous number of times runqueues are
> > enabled -- almost 97 billion times over the course of 40 minutes
> >
> > With the patch, "Recent Used Hit" is over 2000 times more likely to
> > succeed. The failure rate also increases by quite a lot but the cost is
> > marginal even if the "Fast Success Rate" only increases by 2% overall. What
> > cannot be observed from these stats is where the biggest impact as these
> > stats cover low utilisation to over saturation.
> >
> > If graphed over time, the graphs show that the sched domain is only
> > scanned at negligible rates until the machine is fully busy. With
> > low utilisation, the "Fast Success Rate" is almost 100% until the
> > machine is fully busy. For 320 clients, the success rate is close to
> > 0% which is unsurprising.
> >
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
>
> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>

This patch is responsible for a performance regression on my thx2 with
hackbench. So although i reviewed it, it should not be applied as the
change in the behavior is far deeper than expected

>
> > ---
> >  kernel/sched/fair.c | 9 +--------
> >  1 file changed, 1 insertion(+), 8 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 5c41875aec23..413d895bbbf8 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6277,17 +6277,13 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> >
> >         /* Check a recently used CPU as a potential idle candidate: */
> >         recent_used_cpu = p->recent_used_cpu;
> > +       p->recent_used_cpu = prev;
> >         if (recent_used_cpu != prev &&
> >             recent_used_cpu != target &&
> >             cpus_share_cache(recent_used_cpu, target) &&
> >             (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
> >             cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) &&
> >             asym_fits_capacity(task_util, recent_used_cpu)) {
> > -               /*
> > -                * Replace recent_used_cpu with prev as it is a potential
> > -                * candidate for the next wake:
> > -                */
> > -               p->recent_used_cpu = prev;
> >                 return recent_used_cpu;
> >         }
> >
> > @@ -6768,9 +6764,6 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> >         } else if (wake_flags & WF_TTWU) { /* XXX always ? */
> >                 /* Fast path */
> >                 new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
> > -
> > -               if (want_affine)
> > -                       current->recent_used_cpu = cpu;
> >         }
> >         rcu_read_unlock();
> >
> > --
> > 2.26.2
> >

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] Reduce scanning of runqueues in select_idle_sibling
  2020-12-10  9:38     ` Vincent Guittot
@ 2020-12-10 11:04       ` Mel Gorman
  2020-12-11  9:51         ` Vincent Guittot
  0 siblings, 1 reply; 26+ messages in thread
From: Mel Gorman @ 2020-12-10 11:04 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Ziljstra, Ingo Molnar, LKML, Aubrey Li, Barry Song,
	Juri Lelli, Valentin Schneider, Linux-ARM

On Thu, Dec 10, 2020 at 10:38:37AM +0100, Vincent Guittot wrote:
> > while testing your patchset and Aubrey one on top of tip, I'm facing
> > some perf regression on my arm64 numa system on hackbench and reaim.
> > The regression seems to comes from your patchset but i don't know
> > which patch in particular yet
> >
> > hackbench -l 256000 -g 1
> >
> > v5.10-rc7 + tip/sched/core 13,255(+/- 3.22%)
> > with your patchset         15.368(+/- 2.74)  -15.9%
> >
> > I'm also seeing perf regression on reaim but this one needs more
> > investigation before confirming
> >
> > TBH, I was not expecting regressions. I'm running more test to find
> > which patch is the culprit
> 
> The regression comes from patch 3: sched/fair: Do not replace
> recent_used_cpu with the new target
> 

That's not entirely surprising. The intent of the patch is to increase the
hit rate of p->recent_used_cpu but it's not a guaranteed win due to two
corner cases. If multiple tasks have the same p->recent_used_cpu, they can
race to use that CPU and stack as a result instead of searching the domain.
If SMT is enabled then p->recent_used_cpu can point to an idle CPU that has
a busy sibling which the search would have avoided in select_idle_core().

I think you are using processes and sockets for hackbench but as you'll
see later, hackbench can be used both to show losses and gains.

I originally tested with 6 machines covering Broadwell (2 socket), Haswell
(2 socket), Skylake (1 socket), Cascadelake (2 socket), EPYC (2 socket)
and EPYC 2 (2 socket) with a range of workloads including hackbench. Of
those, just one reported a major problem with 1 group -- the EPYC 1 machine

EPYC hackbench process-sockets
                          5.10.0-rc6             5.10.0-rc6
                       baseline-v2r2         altrecent-v2r5
Amean     1        1.0607 (   0.00%)      1.1480 (  -8.23%)
Amean     4        1.3277 (   0.00%)      1.3117 (   1.21%)
Amean     7        1.6940 (   0.00%)      1.6950 (  -0.06%)
Amean     12       2.1600 (   0.00%)      2.1367 (   1.08%)
Amean     21       3.2450 (   0.00%)      3.5883 ( -10.58%)
Amean     30       4.1673 (   0.00%)      3.9653 (   4.85%)
Amean     48       4.9257 (   0.00%)      5.0000 (  -1.51%)
Amean     79       7.4950 (   0.00%)      7.4563 (   0.52%)
Amean     110     10.4233 (   0.00%)     10.4727 (  -0.47%)
Amean     141     13.4690 (   0.00%)     13.4563 (   0.09%)
Amean     172     16.6450 (   0.00%)     16.6033 (   0.25%)
Amean     203     19.4873 (   0.00%)     19.7893 *  -1.55%*
Amean     234     22.5507 (   0.00%)     22.8033 (  -1.12%)
Amean     265     25.3380 (   0.00%)     25.6490 (  -1.23%)
Amean     296     28.0070 (   0.00%)     28.1270 (  -0.43%)

That's showing an 8% loss for 1 group and also a problem with 21 groups.
Otherwise, it was more or less flat. EPYC 2 also showed a 2% loss for 1
group and 9% loss for 21 groups (probably related to the size of the LLC
domain as there are many LLCs per socket on EPYC*).

For the *same* machine running hackbench using pipes instead of sockets
we get

EPYC hackbench process-pipes
Amean     1        0.9497 (   0.00%)      0.9517 (  -0.21%)
Amean     4        1.2253 (   0.00%)      1.1387 (   7.07%)
Amean     7        2.0677 (   0.00%)      1.7460 *  15.56%*
Amean     12       2.8717 (   0.00%)      2.4797 *  13.65%*
Amean     21       4.4053 (   0.00%)      3.7463 *  14.96%*
Amean     30       5.3983 (   0.00%)      4.1097 *  23.87%*
Amean     48       6.1050 (   0.00%)      4.6873 *  23.22%*
Amean     79       7.5640 (   0.00%)      6.8493 (   9.45%)
Amean     110     12.2627 (   0.00%)      9.4613 *  22.84%*
Amean     141     16.9980 (   0.00%)     13.8327 *  18.62%*
Amean     172     21.5280 (   0.00%)     17.3693 *  19.32%*
Amean     203     25.4480 (   0.00%)     20.9947 *  17.50%*
Amean     234     29.6570 (   0.00%)     24.9613 *  15.83%*
Amean     265     33.0713 (   0.00%)     28.1103 *  15.00%*
Amean     296     37.4443 (   0.00%)     31.8757 *  14.87%*

So even on the *same hardware*, hackbench can show very different results
depending on how it is run.

The rest of the machines were more or less neutral for this patch. Once
hackbench saturates the machine, the hit rate on recent_used_cpu is going
to be low

1-socket skylake
Amean     1       1.3183 (   0.00%)      1.2827 *   2.71%*
Amean     3       3.6750 (   0.00%)      3.6610 (   0.38%)
Amean     5       6.1003 (   0.00%)      6.0190 *   1.33%*
Amean     7       8.6063 (   0.00%)      8.6047 (   0.02%)
Amean     12     14.9480 (   0.00%)     15.0327 (  -0.57%)
Amean     18     22.3430 (   0.00%)     22.6680 (  -1.45%)
Amean     24     29.4970 (   0.00%)     29.6677 (  -0.58%)
Amean     30     36.7373 (   0.00%)     36.3687 (   1.00%)
Amean     32     39.0973 (   0.00%)     39.4417 (  -0.88%)

Shows a 2.71% gain for one group, otherwise more or less neutral

2-socket CascadeLake

Amean     1        0.3663 (   0.00%)      0.3657 (   0.18%)
Amean     4        0.7510 (   0.00%)      0.7793 (  -3.77%)
Amean     7        1.2650 (   0.00%)      1.2583 (   0.53%)
Amean     12       1.9510 (   0.00%)      1.9450 (   0.31%)
Amean     21       2.9677 (   0.00%)      3.0277 (  -2.02%)
Amean     30       4.2993 (   0.00%)      4.0237 *   6.41%*
Amean     48       6.5373 (   0.00%)      6.2987 *   3.65%*
Amean     79      10.5513 (   0.00%)     10.3280 (   2.12%)
Amean     110     15.8567 (   0.00%)     13.9817 (  11.82%)
Amean     141     17.4243 (   0.00%)     17.3177 (   0.61%)
Amean     172     21.0473 (   0.00%)     20.9760 (   0.34%)
Amean     203     25.1070 (   0.00%)     25.1150 (  -0.03%)
Amean     234     28.6753 (   0.00%)     28.9383 (  -0.92%)
Amean     265     32.7970 (   0.00%)     32.9663 (  -0.52%)
Amean     296     36.6510 (   0.00%)     36.6753 (  -0.07%)

Neutral for 1 group, small regression for 4 groups, few gains around the
middle, neutral when over-saturated.

select_idle_sibling is a curse because it's very rare that a change to
it is a universal win. On balance, I think it's better to avoid searching
the domain at all where possible even if there are cases where searching
can have a benefit such as finding an idle core instead of picking an
idle CPU with a busy sibling via p->recent_used_cpu.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/4] sched/fair: Do not replace recent_used_cpu with the new target
       [not found] ` <20201211062542.3082-1-hdanton@sina.com>
@ 2020-12-11  9:02   ` Mel Gorman
       [not found]   ` <20201211093443.3259-1-hdanton@sina.com>
  1 sibling, 0 replies; 26+ messages in thread
From: Mel Gorman @ 2020-12-11  9:02 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Peter Ziljstra, Ingo Molnar, LKML, Aubrey Li, Barry Song,
	Juri Lelli, Vincent Guittot, Valentin Schneider, Linux-ARM

On Fri, Dec 11, 2020 at 02:25:42PM +0800, Hillf Danton wrote:
> On Tue,  8 Dec 2020 15:35:00 +0000 Mel Gorman wrote:
> > @@ -6277,17 +6277,13 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> >  
> >  	/* Check a recently used CPU as a potential idle candidate: */
> >  	recent_used_cpu = p->recent_used_cpu;
> > +	p->recent_used_cpu = prev;
> >  	if (recent_used_cpu != prev &&
> >  	    recent_used_cpu != target &&
> >  	    cpus_share_cache(recent_used_cpu, target) &&
> >  	    (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
> >  	    cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) &&
> 
> Typo? Fix it in spin if so.
> 

What typo?

> >  	    asym_fits_capacity(task_util, recent_used_cpu)) {
> > -		/*
> > -		 * Replace recent_used_cpu with prev as it is a potential
> > -		 * candidate for the next wake:
> > -		 */
> > -		p->recent_used_cpu = prev;
> >  		return recent_used_cpu;
> 
> I prefer to update the recent CPU after llc check.
> 

That would prevent recent_used_cpu leaving the LLC the task first
started on.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/4] sched/fair: Do not replace recent_used_cpu with the new target
       [not found]   ` <20201211093443.3259-1-hdanton@sina.com>
@ 2020-12-11  9:45     ` Mel Gorman
  0 siblings, 0 replies; 26+ messages in thread
From: Mel Gorman @ 2020-12-11  9:45 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Peter Ziljstra, Ingo Molnar, LKML, Aubrey Li, Barry Song,
	Juri Lelli, Vincent Guittot, Valentin Schneider, Linux-ARM

On Fri, Dec 11, 2020 at 05:34:43PM +0800, Hillf Danton wrote:
> On Fri, 11 Dec 2020 09:02:28 +0000 Mel Gorman wrote:
> >On Fri, Dec 11, 2020 at 02:25:42PM +0800, Hillf Danton wrote:
> >> On Tue,  8 Dec 2020 15:35:00 +0000 Mel Gorman wrote:
> >> > @@ -6277,17 +6277,13 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> >> >  
> >> >  	/* Check a recently used CPU as a potential idle candidate: */
> >> >  	recent_used_cpu = p->recent_used_cpu;
> >> > +	p->recent_used_cpu = prev;
> >> >  	if (recent_used_cpu != prev &&
> >> >  	    recent_used_cpu != target &&
> >> >  	    cpus_share_cache(recent_used_cpu, target) &&
> >> >  	    (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
> >> >  	    cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) &&
> >> 
> >> Typo? Fix it in spin if so.
> >> 
> >
> >What typo?
> 
> After your change it is prev that we check against p->cpus_ptr instead of
> the recent CPU. Wonder the point to do such a check for returning the
> recent one.

Ah... yes, this is indeed wrong. It wouldn't affect Vincent's case
that showed a problem with a hackbench configuration (which I'm still
disappointed about as it's a trade-off depending on machine and workload)
but it allows a task to run on the wrong cpu if sched_setscheduler()
was called between wakeup events.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] Reduce scanning of runqueues in select_idle_sibling
  2020-12-10 11:04       ` Mel Gorman
@ 2020-12-11  9:51         ` Vincent Guittot
  2020-12-11 10:23           ` Mel Gorman
  0 siblings, 1 reply; 26+ messages in thread
From: Vincent Guittot @ 2020-12-11  9:51 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Ziljstra, Ingo Molnar, LKML, Aubrey Li, Barry Song,
	Juri Lelli, Valentin Schneider, Linux-ARM

On Thu, 10 Dec 2020 at 12:04, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Thu, Dec 10, 2020 at 10:38:37AM +0100, Vincent Guittot wrote:
> > > while testing your patchset and Aubrey one on top of tip, I'm facing
> > > some perf regression on my arm64 numa system on hackbench and reaim.
> > > The regression seems to comes from your patchset but i don't know
> > > which patch in particular yet
> > >
> > > hackbench -l 256000 -g 1
> > >
> > > v5.10-rc7 + tip/sched/core 13,255(+/- 3.22%)
> > > with your patchset         15.368(+/- 2.74)  -15.9%
> > >
> > > I'm also seeing perf regression on reaim but this one needs more
> > > investigation before confirming
> > >
> > > TBH, I was not expecting regressions. I'm running more test to find
> > > which patch is the culprit
> >
> > The regression comes from patch 3: sched/fair: Do not replace
> > recent_used_cpu with the new target
> >
>
> That's not entirely surprising. The intent of the patch is to increase the
> hit rate of p->recent_used_cpu but it's not a guaranteed win due to two
> corner cases. If multiple tasks have the same p->recent_used_cpu, they can
> race to use that CPU and stack as a result instead of searching the domain.
> If SMT is enabled then p->recent_used_cpu can point to an idle CPU that has
> a busy sibling which the search would have avoided in select_idle_core().
>
> I think you are using processes and sockets for hackbench but as you'll
> see later, hackbench can be used both to show losses and gains.

I run more hackbench tests with pipe and socket and both show
regression with patch 3 whereas this is significant improvement with
other patches and Aubrey's one

>
> I originally tested with 6 machines covering Broadwell (2 socket), Haswell
> (2 socket), Skylake (1 socket), Cascadelake (2 socket), EPYC (2 socket)
> and EPYC 2 (2 socket) with a range of workloads including hackbench. Of
> those, just one reported a major problem with 1 group -- the EPYC 1 machine
>
> EPYC hackbench process-sockets
>                           5.10.0-rc6             5.10.0-rc6
>                        baseline-v2r2         altrecent-v2r5
> Amean     1        1.0607 (   0.00%)      1.1480 (  -8.23%)
> Amean     4        1.3277 (   0.00%)      1.3117 (   1.21%)
> Amean     7        1.6940 (   0.00%)      1.6950 (  -0.06%)
> Amean     12       2.1600 (   0.00%)      2.1367 (   1.08%)
> Amean     21       3.2450 (   0.00%)      3.5883 ( -10.58%)
> Amean     30       4.1673 (   0.00%)      3.9653 (   4.85%)
> Amean     48       4.9257 (   0.00%)      5.0000 (  -1.51%)
> Amean     79       7.4950 (   0.00%)      7.4563 (   0.52%)
> Amean     110     10.4233 (   0.00%)     10.4727 (  -0.47%)
> Amean     141     13.4690 (   0.00%)     13.4563 (   0.09%)
> Amean     172     16.6450 (   0.00%)     16.6033 (   0.25%)
> Amean     203     19.4873 (   0.00%)     19.7893 *  -1.55%*
> Amean     234     22.5507 (   0.00%)     22.8033 (  -1.12%)
> Amean     265     25.3380 (   0.00%)     25.6490 (  -1.23%)
> Amean     296     28.0070 (   0.00%)     28.1270 (  -0.43%)
>
> That's showing an 8% loss for 1 group and also a problem with 21 groups.
> Otherwise, it was more or less flat. EPYC 2 also showed a 2% loss for 1
> group and 9% loss for 21 groups (probably related to the size of the LLC
> domain as there are many LLCs per socket on EPYC*).
>
> For the *same* machine running hackbench using pipes instead of sockets
> we get
>
> EPYC hackbench process-pipes
> Amean     1        0.9497 (   0.00%)      0.9517 (  -0.21%)
> Amean     4        1.2253 (   0.00%)      1.1387 (   7.07%)
> Amean     7        2.0677 (   0.00%)      1.7460 *  15.56%*
> Amean     12       2.8717 (   0.00%)      2.4797 *  13.65%*
> Amean     21       4.4053 (   0.00%)      3.7463 *  14.96%*
> Amean     30       5.3983 (   0.00%)      4.1097 *  23.87%*
> Amean     48       6.1050 (   0.00%)      4.6873 *  23.22%*
> Amean     79       7.5640 (   0.00%)      6.8493 (   9.45%)
> Amean     110     12.2627 (   0.00%)      9.4613 *  22.84%*
> Amean     141     16.9980 (   0.00%)     13.8327 *  18.62%*
> Amean     172     21.5280 (   0.00%)     17.3693 *  19.32%*
> Amean     203     25.4480 (   0.00%)     20.9947 *  17.50%*
> Amean     234     29.6570 (   0.00%)     24.9613 *  15.83%*
> Amean     265     33.0713 (   0.00%)     28.1103 *  15.00%*
> Amean     296     37.4443 (   0.00%)     31.8757 *  14.87%*
>
> So even on the *same hardware*, hackbench can show very different results
> depending on how it is run.
>
> The rest of the machines were more or less neutral for this patch. Once
> hackbench saturates the machine, the hit rate on recent_used_cpu is going
> to be low
>
> 1-socket skylake
> Amean     1       1.3183 (   0.00%)      1.2827 *   2.71%*
> Amean     3       3.6750 (   0.00%)      3.6610 (   0.38%)
> Amean     5       6.1003 (   0.00%)      6.0190 *   1.33%*
> Amean     7       8.6063 (   0.00%)      8.6047 (   0.02%)
> Amean     12     14.9480 (   0.00%)     15.0327 (  -0.57%)
> Amean     18     22.3430 (   0.00%)     22.6680 (  -1.45%)
> Amean     24     29.4970 (   0.00%)     29.6677 (  -0.58%)
> Amean     30     36.7373 (   0.00%)     36.3687 (   1.00%)
> Amean     32     39.0973 (   0.00%)     39.4417 (  -0.88%)
>
> Shows a 2.71% gain for one group, otherwise more or less neutral
>
> 2-socket CascadeLake
>
> Amean     1        0.3663 (   0.00%)      0.3657 (   0.18%)
> Amean     4        0.7510 (   0.00%)      0.7793 (  -3.77%)
> Amean     7        1.2650 (   0.00%)      1.2583 (   0.53%)
> Amean     12       1.9510 (   0.00%)      1.9450 (   0.31%)
> Amean     21       2.9677 (   0.00%)      3.0277 (  -2.02%)
> Amean     30       4.2993 (   0.00%)      4.0237 *   6.41%*
> Amean     48       6.5373 (   0.00%)      6.2987 *   3.65%*
> Amean     79      10.5513 (   0.00%)     10.3280 (   2.12%)
> Amean     110     15.8567 (   0.00%)     13.9817 (  11.82%)
> Amean     141     17.4243 (   0.00%)     17.3177 (   0.61%)
> Amean     172     21.0473 (   0.00%)     20.9760 (   0.34%)
> Amean     203     25.1070 (   0.00%)     25.1150 (  -0.03%)
> Amean     234     28.6753 (   0.00%)     28.9383 (  -0.92%)
> Amean     265     32.7970 (   0.00%)     32.9663 (  -0.52%)
> Amean     296     36.6510 (   0.00%)     36.6753 (  -0.07%)
>
> Neutral for 1 group, small regression for 4 groups, few gains around the
> middle, neutral when over-saturated.
>
> select_idle_sibling is a curse because it's very rare that a change to
> it is a universal win. On balance, I think it's better to avoid searching
> the domain at all where possible even if there are cases where searching
> can have a benefit such as finding an idle core instead of picking an
> idle CPU with a busy sibling via p->recent_used_cpu.
>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] Reduce scanning of runqueues in select_idle_sibling
  2020-12-11  9:51         ` Vincent Guittot
@ 2020-12-11 10:23           ` Mel Gorman
  2020-12-12 10:02             ` Vincent Guittot
  0 siblings, 1 reply; 26+ messages in thread
From: Mel Gorman @ 2020-12-11 10:23 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Ziljstra, Ingo Molnar, LKML, Aubrey Li, Barry Song,
	Juri Lelli, Valentin Schneider, Linux-ARM

On Fri, Dec 11, 2020 at 10:51:17AM +0100, Vincent Guittot wrote:
> On Thu, 10 Dec 2020 at 12:04, Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > On Thu, Dec 10, 2020 at 10:38:37AM +0100, Vincent Guittot wrote:
> > > > while testing your patchset and Aubrey one on top of tip, I'm facing
> > > > some perf regression on my arm64 numa system on hackbench and reaim.
> > > > The regression seems to comes from your patchset but i don't know
> > > > which patch in particular yet
> > > >
> > > > hackbench -l 256000 -g 1
> > > >
> > > > v5.10-rc7 + tip/sched/core 13,255(+/- 3.22%)
> > > > with your patchset         15.368(+/- 2.74)  -15.9%
> > > >
> > > > I'm also seeing perf regression on reaim but this one needs more
> > > > investigation before confirming
> > > >
> > > > TBH, I was not expecting regressions. I'm running more test to find
> > > > which patch is the culprit
> > >
> > > The regression comes from patch 3: sched/fair: Do not replace
> > > recent_used_cpu with the new target
> > >
> >
> > That's not entirely surprising. The intent of the patch is to increase the
> > hit rate of p->recent_used_cpu but it's not a guaranteed win due to two
> > corner cases. If multiple tasks have the same p->recent_used_cpu, they can
> > race to use that CPU and stack as a result instead of searching the domain.
> > If SMT is enabled then p->recent_used_cpu can point to an idle CPU that has
> > a busy sibling which the search would have avoided in select_idle_core().
> >
> > I think you are using processes and sockets for hackbench but as you'll
> > see later, hackbench can be used both to show losses and gains.
> 
> I run more hackbench tests with pipe and socket and both show
> regression with patch 3 whereas this is significant improvement with
> other patches and Aubrey's one
> 

Is SMT enabled on your test machine? If not, then patch 4 should make no
difference but if SMT is enabled, I wonder how this untested version of
patch 3 behaves for you. The main difference is that the recent used cpu
is used as a search target so that it would still check if it's an idle
core and if not, fall through so it's used as an idle CPU after checking
it's allowed by p->cpus_ptr.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5c41875aec23..63980bcf6e70 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6275,21 +6275,14 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 		return prev;
 	}
 
-	/* Check a recently used CPU as a potential idle candidate: */
+	/* Check a recently used CPU as a search target: */
 	recent_used_cpu = p->recent_used_cpu;
+	p->recent_used_cpu = prev;
 	if (recent_used_cpu != prev &&
 	    recent_used_cpu != target &&
 	    cpus_share_cache(recent_used_cpu, target) &&
-	    (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
-	    cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) &&
-	    asym_fits_capacity(task_util, recent_used_cpu)) {
-		/*
-		 * Replace recent_used_cpu with prev as it is a potential
-		 * candidate for the next wake:
-		 */
-		p->recent_used_cpu = prev;
-		return recent_used_cpu;
-	}
+	    (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)))
+		target = recent_used_cpu;
 
 	/*
 	 * For asymmetric CPU capacity systems, our domain of interest is
@@ -6768,9 +6761,6 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 	} else if (wake_flags & WF_TTWU) { /* XXX always ? */
 		/* Fast path */
 		new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
-
-		if (want_affine)
-			current->recent_used_cpu = cpu;
 	}
 	rcu_read_unlock();
 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] Reduce scanning of runqueues in select_idle_sibling
  2020-12-11 10:23           ` Mel Gorman
@ 2020-12-12 10:02             ` Vincent Guittot
  0 siblings, 0 replies; 26+ messages in thread
From: Vincent Guittot @ 2020-12-12 10:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Ziljstra, Ingo Molnar, LKML, Aubrey Li, Barry Song,
	Juri Lelli, Valentin Schneider, Linux-ARM

On Fri, 11 Dec 2020 at 11:23, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Fri, Dec 11, 2020 at 10:51:17AM +0100, Vincent Guittot wrote:
> > On Thu, 10 Dec 2020 at 12:04, Mel Gorman <mgorman@techsingularity.net> wrote:
> > >
> > > On Thu, Dec 10, 2020 at 10:38:37AM +0100, Vincent Guittot wrote:
> > > > > while testing your patchset and Aubrey one on top of tip, I'm facing
> > > > > some perf regression on my arm64 numa system on hackbench and reaim.
> > > > > The regression seems to comes from your patchset but i don't know
> > > > > which patch in particular yet
> > > > >
> > > > > hackbench -l 256000 -g 1
> > > > >
> > > > > v5.10-rc7 + tip/sched/core 13,255(+/- 3.22%)
> > > > > with your patchset         15.368(+/- 2.74)  -15.9%
> > > > >
> > > > > I'm also seeing perf regression on reaim but this one needs more
> > > > > investigation before confirming
> > > > >
> > > > > TBH, I was not expecting regressions. I'm running more test to find
> > > > > which patch is the culprit
> > > >
> > > > The regression comes from patch 3: sched/fair: Do not replace
> > > > recent_used_cpu with the new target
> > > >
> > >
> > > That's not entirely surprising. The intent of the patch is to increase the
> > > hit rate of p->recent_used_cpu but it's not a guaranteed win due to two
> > > corner cases. If multiple tasks have the same p->recent_used_cpu, they can
> > > race to use that CPU and stack as a result instead of searching the domain.
> > > If SMT is enabled then p->recent_used_cpu can point to an idle CPU that has
> > > a busy sibling which the search would have avoided in select_idle_core().
> > >
> > > I think you are using processes and sockets for hackbench but as you'll
> > > see later, hackbench can be used both to show losses and gains.
> >
> > I run more hackbench tests with pipe and socket and both show
> > regression with patch 3 whereas this is significant improvement with
> > other patches and Aubrey's one
> >
>
> Is SMT enabled on your test machine? If not, then patch 4 should make no

yes I have SMT on my system : 2 nodes x 28 cores x 4 SMT



> difference but if SMT is enabled, I wonder how this untested version of
> patch 3 behaves for you. The main difference is that the recent used cpu
> is used as a search target so that it would still check if it's an idle
> core and if not, fall through so it's used as an idle CPU after checking
> it's allowed by p->cpus_ptr.
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 5c41875aec23..63980bcf6e70 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6275,21 +6275,14 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>                 return prev;
>         }
>
> -       /* Check a recently used CPU as a potential idle candidate: */
> +       /* Check a recently used CPU as a search target: */
>         recent_used_cpu = p->recent_used_cpu;
> +       p->recent_used_cpu = prev;
>         if (recent_used_cpu != prev &&
>             recent_used_cpu != target &&
>             cpus_share_cache(recent_used_cpu, target) &&
> -           (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
> -           cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) &&
> -           asym_fits_capacity(task_util, recent_used_cpu)) {
> -               /*
> -                * Replace recent_used_cpu with prev as it is a potential
> -                * candidate for the next wake:
> -                */
> -               p->recent_used_cpu = prev;
> -               return recent_used_cpu;
> -       }
> +           (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)))
> +               target = recent_used_cpu;
>
>         /*
>          * For asymmetric CPU capacity systems, our domain of interest is
> @@ -6768,9 +6761,6 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>         } else if (wake_flags & WF_TTWU) { /* XXX always ? */
>                 /* Fast path */
>                 new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
> -
> -               if (want_affine)
> -                       current->recent_used_cpu = cpu;
>         }
>         rcu_read_unlock();
>
>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2020-12-12 10:04 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-08 15:34 [PATCH 0/4] Reduce scanning of runqueues in select_idle_sibling Mel Gorman
2020-12-08 15:34 ` [PATCH 1/4] sched/fair: Remove SIS_AVG_CPU Mel Gorman
2020-12-08 16:13   ` Vincent Guittot
2020-12-08 15:34 ` [PATCH 2/4] sched/fair: Move avg_scan_cost calculations under SIS_PROP Mel Gorman
2020-12-08 16:03   ` Vincent Guittot
2020-12-08 16:30     ` Mel Gorman
2020-12-09  5:28     ` Li, Aubrey
2020-12-09  9:05       ` Mel Gorman
2020-12-09 11:07         ` Li, Aubrey
2020-12-09 11:33           ` Mel Gorman
2020-12-10  5:18   ` Li, Aubrey
2020-12-10  9:32     ` Mel Gorman
2020-12-08 15:35 ` [PATCH 3/4] sched/fair: Do not replace recent_used_cpu with the new target Mel Gorman
2020-12-08 16:14   ` Vincent Guittot
2020-12-10  9:40     ` Vincent Guittot
2020-12-08 15:35 ` [PATCH 4/4] sched/fair: Return an idle cpu if one is found after a failed search for an idle core Mel Gorman
2020-12-08 16:15   ` Vincent Guittot
2020-12-09 14:37 ` [PATCH 0/4] Reduce scanning of runqueues in select_idle_sibling Mel Gorman
2020-12-10  8:00   ` Vincent Guittot
2020-12-10  9:38     ` Vincent Guittot
2020-12-10 11:04       ` Mel Gorman
2020-12-11  9:51         ` Vincent Guittot
2020-12-11 10:23           ` Mel Gorman
2020-12-12 10:02             ` Vincent Guittot
     [not found] ` <20201211062542.3082-1-hdanton@sina.com>
2020-12-11  9:02   ` [PATCH 3/4] sched/fair: Do not replace recent_used_cpu with the new target Mel Gorman
     [not found]   ` <20201211093443.3259-1-hdanton@sina.com>
2020-12-11  9:45     ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).