All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] sched/fair: Prefer idle CPU to cache affinity
@ 2021-02-26 16:40 Srikar Dronamraju
  2021-02-27 19:56 ` Rik van Riel
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Srikar Dronamraju @ 2021-02-26 16:40 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Dietmar Eggemann, Michael Ellerman, Michael Neuling,
	Gautham R Shenoy, Parth Shah

On POWER8 and POWER9, the last level cache (L2) has been at the level of
a group of 8 threads (SMT8 on POWER8, a big-core comprising of a pair of
SMT4 cores on POWER9). However, on POWER10, the LLC domain is at the
level of a group of SMT4 threads within the SMT8 core. Due to the
shrinking in the size of the LLC domain, the probability of finding an
idle CPU in the LLC domain of the target is lesser on POWER10 compared
to the previous generation processors.

With commit 9538abee18cc ("powerpc/smp: Add support detecting
thread-groups sharing L2 cache") benchmarks such as Daytrader
(https://github.com/WASdev/sample.daytrader7) show a drop in throughput
in a configuration consisting of 1 JVM spanning across 6-8 Bigcores on
POWER10.  Analysis showed that this was because more number of wakeups
were happening on busy CPUs when the utilization was 60-70%. This drop
in throughput also shows up as a drop in CPU utilization. However most
other benchmarks benefit with detecting the thread-groups that share L2
cache.

Current order of preference to pick a LLC while waking a wake-affine
task:
1. Between the waker CPU and previous CPU, prefer the LLC of the CPU
   that is idle.

2. Between the waker CPU and previous CPU, prefer the LLC of the CPU
   that is less lightly loaded.

In the current situation where waker and previous CPUs are busy, but
only one of its LLC has an idle CPU, Scheduler may end up picking a LLC
with no idle CPUs. To mitigate this, add a new step between 1 and 2
where Scheduler compares idle CPUs in waker and previous LLCs and picks
the appropriate one.

The other alternative is to search for an idle CPU in the other LLC, if
the current select_idle_sibling is unable to find an idle CPU in the
preferred LLC. But that may increase the time to select a CPU.


                                     5.11-rc6      5.11-rc6+revert   5.11-rc6+patch
8CORE/1JVM  80USERS   throughput     6651.6        6716.3 (0.97%)    6940 (4.34%)
                      sys/user:time  59.75/23.86   61.77/24.55       60/24

8CORE/2JVM  80USERS   throughput     6425.4        6446.8 (0.33%)    6473.2 (0.74%)
                      sys/user:time  70.59/24.25   72.28/23.77       70/24

8CORE/4JVM  80USERS   throughput     5355.3        5551.2 (3.66%)    5586.6 (4.32%)
                      sys/user:time  76.74/21.79   76.54/22.73       76/22

8CORE/8JVM  80USERS   throughput     4420.6        4553.3 (3.00%)    4405.8 (-0.33%)
                      sys/user:time  79.13/20.32   78.76/21.01       79/20

Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Parth Shah <parth@linux.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Co-developed-by: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Co-developed-by: Parth Shah <parth@linux.ibm.com>
Signed-off-by: Parth Shah <parth@linux.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/fair.c     | 41 +++++++++++++++++++++++++++++++++++++++--
 kernel/sched/features.h |  2 ++
 2 files changed, 41 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8a8bd7b13634..d49bfcdc4a19 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5869,6 +5869,36 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
 	return this_eff_load < prev_eff_load ? this_cpu : nr_cpumask_bits;
 }

+static int prefer_idler_llc(int this_cpu, int prev_cpu, int sync)
+{
+	struct sched_domain_shared *tsds, *psds;
+	int pnr_busy, pllc_size, tnr_busy, tllc_size, diff;
+
+	tsds = rcu_dereference(per_cpu(sd_llc_shared, this_cpu));
+	tnr_busy = atomic_read(&tsds->nr_busy_cpus);
+	tllc_size = per_cpu(sd_llc_size, this_cpu);
+
+	psds = rcu_dereference(per_cpu(sd_llc_shared, prev_cpu));
+	pnr_busy = atomic_read(&psds->nr_busy_cpus);
+	pllc_size = per_cpu(sd_llc_size, prev_cpu);
+
+	/* No need to compare, if both LLCs are fully loaded */
+	if (pnr_busy == pllc_size && tnr_busy == pllc_size)
+		return nr_cpumask_bits;
+
+	if (sched_feat(WA_WAKER) && tnr_busy < tllc_size)
+		return this_cpu;
+
+	/* For better wakeup latency, prefer idler LLC to cache affinity */
+	diff = tnr_busy * pllc_size - sync - pnr_busy * tllc_size;
+	if (!diff)
+		return nr_cpumask_bits;
+	if (diff < 0)
+		return this_cpu;
+
+	return prev_cpu;
+}
+
 static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 		       int this_cpu, int prev_cpu, int sync)
 {
@@ -5877,6 +5907,10 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 	if (sched_feat(WA_IDLE))
 		target = wake_affine_idle(this_cpu, prev_cpu, sync);

+	if (sched_feat(WA_IDLER_LLC) && target == nr_cpumask_bits &&
+				!cpus_share_cache(this_cpu, prev_cpu))
+		target = prefer_idler_llc(this_cpu, prev_cpu, sync);
+
 	if (sched_feat(WA_WEIGHT) && target == nr_cpumask_bits)
 		target = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);

@@ -5884,8 +5918,11 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 	if (target == nr_cpumask_bits)
 		return prev_cpu;

-	schedstat_inc(sd->ttwu_move_affine);
-	schedstat_inc(p->se.statistics.nr_wakeups_affine);
+	if (target == this_cpu) {
+		schedstat_inc(sd->ttwu_move_affine);
+		schedstat_inc(p->se.statistics.nr_wakeups_affine);
+	}
+
 	return target;
 }

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 1bc2b158fc51..e2de3ba8d5b1 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -83,6 +83,8 @@ SCHED_FEAT(ATTACH_AGE_LOAD, true)

 SCHED_FEAT(WA_IDLE, true)
 SCHED_FEAT(WA_WEIGHT, true)
+SCHED_FEAT(WA_IDLER_LLC, true)
+SCHED_FEAT(WA_WAKER, false)
 SCHED_FEAT(WA_BIAS, true)

 /*
--
2.18.4


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity
  2021-02-26 16:40 [PATCH] sched/fair: Prefer idle CPU to cache affinity Srikar Dronamraju
@ 2021-02-27 19:56 ` Rik van Riel
  2021-03-01 13:37   ` Srikar Dronamraju
  2021-03-01 15:44   ` Peter Zijlstra
  2021-03-01 15:40 ` Peter Zijlstra
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 16+ messages in thread
From: Rik van Riel @ 2021-02-27 19:56 UTC (permalink / raw)
  To: Srikar Dronamraju, Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Thomas Gleixner, Valentin Schneider,
	Vincent Guittot, Dietmar Eggemann, Michael Ellerman,
	Michael Neuling, Gautham R Shenoy, Parth Shah

[-- Attachment #1: Type: text/plain, Size: 1878 bytes --]

On Fri, 2021-02-26 at 22:10 +0530, Srikar Dronamraju wrote:

> Current order of preference to pick a LLC while waking a wake-affine
> task:
> 1. Between the waker CPU and previous CPU, prefer the LLC of the CPU
>    that is idle.
> 
> 2. Between the waker CPU and previous CPU, prefer the LLC of the CPU
>    that is less lightly loaded.
> 
> In the current situation where waker and previous CPUs are busy, but
> only one of its LLC has an idle CPU, Scheduler may end up picking a
> LLC
> with no idle CPUs. To mitigate this, add a new step between 1 and 2
> where Scheduler compares idle CPUs in waker and previous LLCs and
> picks
> the appropriate one.

I like that idea a lot. That could also solve some of the
issues sometimes observed on multi-node x86 systems, and
probably on the newer AMD chips with several LLCs on chip.

> +	if (sched_feat(WA_WAKER) && tnr_busy < tllc_size)
> +		return this_cpu;

I wonder if we need to use a slightly lower threshold on
very large LLCs, both to account for the fact that the
select_idle_cpu code may not find the single idle CPU
among a dozen busy ones, or because on a system with
hyperthreading we may often be better off picking another
LLC for HT contention issues?

Maybe we could use "tnr_busy * 4 <
tllc_size * 3" or
something like that?

That way we will only try to find the last 5 idle
CPUs
in a 22 CPU LLC if the other LLC also has fewer than 6
idle cores.

That might increase our chances of finding an idle CPU
with SIS_PROP enabled, and might allow WA_WAKER to be
true by default.

> +	/* For better wakeup latency, prefer idler LLC to cache
> affinity */
> +	diff = tnr_busy * pllc_size - sync - pnr_busy * tllc_size;
> +	if (!diff)
> +		return nr_cpumask_bits;
> +	if (diff < 0)
> +		return this_cpu;
> +
> +	return prev_cpu;
> +}

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity
  2021-02-27 19:56 ` Rik van Riel
@ 2021-03-01 13:37   ` Srikar Dronamraju
  2021-03-01 15:44   ` Peter Zijlstra
  1 sibling, 0 replies; 16+ messages in thread
From: Srikar Dronamraju @ 2021-03-01 13:37 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Mel Gorman, Thomas Gleixner,
	Valentin Schneider, Vincent Guittot, Dietmar Eggemann,
	Michael Ellerman, Michael Neuling, Gautham R Shenoy, Parth Shah

* Rik van Riel <riel@surriel.com> [2021-02-27 14:56:07]:

> > In the current situation where waker and previous CPUs are busy, but
> > only one of its LLC has an idle CPU, Scheduler may end up picking a
> > LLC
> > with no idle CPUs. To mitigate this, add a new step between 1 and 2
> > where Scheduler compares idle CPUs in waker and previous LLCs and
> > picks
> > the appropriate one.
> 
> I like that idea a lot. That could also solve some of the

Thanks.

> issues sometimes observed on multi-node x86 systems, and
> probably on the newer AMD chips with several LLCs on chip.
> 

Okay.

> > +	if (sched_feat(WA_WAKER) && tnr_busy < tllc_size)
> > +		return this_cpu;
> 
> I wonder if we need to use a slightly lower threshold on
> very large LLCs, both to account for the fact that the
> select_idle_cpu code may not find the single idle CPU
> among a dozen busy ones, or because on a system with
> hyperthreading we may often be better off picking another
> LLC for HT contention issues?

> 
> Maybe we could use "tnr_busy * 4 <
> tllc_size * 3" or
> something like that?
> 
> That way we will only try to find the last 5 idle
> CPUs
> in a 22 CPU LLC if the other LLC also has fewer than 6
> idle cores.
> 
> That might increase our chances of finding an idle CPU
> with SIS_PROP enabled, and might allow WA_WAKER to be
> true by default.

Agree we need to be conservative esp if we want to make WA_WAKER on by
default. I would still like to hear from other people if they think its ok
to enable it by default. I wonder if enabling it by default can cause some
load imbalances leading to more active load balance down the line.  I
haven't benchmarked with WA_WAKER enabled.

Thanks Rik for your inputs.

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity
  2021-02-26 16:40 [PATCH] sched/fair: Prefer idle CPU to cache affinity Srikar Dronamraju
  2021-02-27 19:56 ` Rik van Riel
@ 2021-03-01 15:40 ` Peter Zijlstra
  2021-03-01 17:08   ` Srikar Dronamraju
  2021-03-02  9:53 ` Dietmar Eggemann
  2021-03-08 13:52 ` Vincent Guittot
  3 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2021-03-01 15:40 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner,
	Valentin Schneider, Vincent Guittot, Dietmar Eggemann,
	Michael Ellerman, Michael Neuling, Gautham R Shenoy, Parth Shah

On Fri, Feb 26, 2021 at 10:10:29PM +0530, Srikar Dronamraju wrote:
> +static int prefer_idler_llc(int this_cpu, int prev_cpu, int sync)
> +{
> +	struct sched_domain_shared *tsds, *psds;
> +	int pnr_busy, pllc_size, tnr_busy, tllc_size, diff;
> +
> +	tsds = rcu_dereference(per_cpu(sd_llc_shared, this_cpu));
> +	tnr_busy = atomic_read(&tsds->nr_busy_cpus);
> +	tllc_size = per_cpu(sd_llc_size, this_cpu);
> +
> +	psds = rcu_dereference(per_cpu(sd_llc_shared, prev_cpu));
> +	pnr_busy = atomic_read(&psds->nr_busy_cpus);
> +	pllc_size = per_cpu(sd_llc_size, prev_cpu);
> +

nr_busy_cpus is NO_HZ_COMMON So this code that consumes it should be
too.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity
  2021-02-27 19:56 ` Rik van Riel
  2021-03-01 13:37   ` Srikar Dronamraju
@ 2021-03-01 15:44   ` Peter Zijlstra
  2021-03-01 17:06     ` Srikar Dronamraju
  1 sibling, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2021-03-01 15:44 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Srikar Dronamraju, Ingo Molnar, LKML, Mel Gorman,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Dietmar Eggemann, Michael Ellerman, Michael Neuling,
	Gautham R Shenoy, Parth Shah

[-- Attachment #1: Type: text/plain, Size: 699 bytes --]

On Sat, Feb 27, 2021 at 02:56:07PM -0500, Rik van Riel wrote:
> On Fri, 2021-02-26 at 22:10 +0530, Srikar Dronamraju wrote:

> > +	if (sched_feat(WA_WAKER) && tnr_busy < tllc_size)
> > +		return this_cpu;
> 
> I wonder if we need to use a slightly lower threshold on
> very large LLCs, both to account for the fact that the
> select_idle_cpu code may not find the single idle CPU
> among a dozen busy ones, or because on a system with
> hyperthreading we may often be better off picking another
> LLC for HT contention issues?
> 
> Maybe we could use "tnr_busy * 4 <
> tllc_size * 3" or
> something like that?

How about:

	tnr_busy < tllc_size / topology_max_smt_threads()

?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity
  2021-03-01 15:44   ` Peter Zijlstra
@ 2021-03-01 17:06     ` Srikar Dronamraju
  2021-03-01 17:18       ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: Srikar Dronamraju @ 2021-03-01 17:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Ingo Molnar, LKML, Mel Gorman, Thomas Gleixner,
	Valentin Schneider, Vincent Guittot, Dietmar Eggemann,
	Michael Ellerman, Michael Neuling, Gautham R Shenoy, Parth Shah

* Peter Zijlstra <peterz@infradead.org> [2021-03-01 16:44:42]:

> On Sat, Feb 27, 2021 at 02:56:07PM -0500, Rik van Riel wrote:
> > On Fri, 2021-02-26 at 22:10 +0530, Srikar Dronamraju wrote:
> 
> > > +	if (sched_feat(WA_WAKER) && tnr_busy < tllc_size)
> > > +		return this_cpu;
> > 
> > I wonder if we need to use a slightly lower threshold on
> > very large LLCs, both to account for the fact that the
> > select_idle_cpu code may not find the single idle CPU
> > among a dozen busy ones, or because on a system with
> > hyperthreading we may often be better off picking another
> > LLC for HT contention issues?
> > 
> > Maybe we could use "tnr_busy * 4 <
> > tllc_size * 3" or
> > something like that?
> 
> How about:
> 
> 	tnr_busy < tllc_size / topology_max_smt_threads()
> 
> ?

Isn't topology_max_smt_threads only for x86 as of today?
Or Am I missing out?


-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity
  2021-03-01 15:40 ` Peter Zijlstra
@ 2021-03-01 17:08   ` Srikar Dronamraju
  0 siblings, 0 replies; 16+ messages in thread
From: Srikar Dronamraju @ 2021-03-01 17:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner,
	Valentin Schneider, Vincent Guittot, Dietmar Eggemann,
	Michael Ellerman, Michael Neuling, Gautham R Shenoy, Parth Shah

* Peter Zijlstra <peterz@infradead.org> [2021-03-01 16:40:33]:

> On Fri, Feb 26, 2021 at 10:10:29PM +0530, Srikar Dronamraju wrote:
> > +static int prefer_idler_llc(int this_cpu, int prev_cpu, int sync)
> > +{
> > +	struct sched_domain_shared *tsds, *psds;
> > +	int pnr_busy, pllc_size, tnr_busy, tllc_size, diff;
> > +
> > +	tsds = rcu_dereference(per_cpu(sd_llc_shared, this_cpu));
> > +	tnr_busy = atomic_read(&tsds->nr_busy_cpus);
> > +	tllc_size = per_cpu(sd_llc_size, this_cpu);
> > +
> > +	psds = rcu_dereference(per_cpu(sd_llc_shared, prev_cpu));
> > +	pnr_busy = atomic_read(&psds->nr_busy_cpus);
> > +	pllc_size = per_cpu(sd_llc_size, prev_cpu);
> > +
> 
> nr_busy_cpus is NO_HZ_COMMON So this code that consumes it should be
> too.

Thanks Peter, will take care of this along with other changes including
calling within rcu_read_lock and checking for tsds and psds after
rcu_dereference.

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity
  2021-03-01 17:06     ` Srikar Dronamraju
@ 2021-03-01 17:18       ` Peter Zijlstra
  2021-03-02  7:39         ` Srikar Dronamraju
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2021-03-01 17:18 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Rik van Riel, Ingo Molnar, LKML, Mel Gorman, Thomas Gleixner,
	Valentin Schneider, Vincent Guittot, Dietmar Eggemann,
	Michael Ellerman, Michael Neuling, Gautham R Shenoy, Parth Shah

On Mon, Mar 01, 2021 at 10:36:01PM +0530, Srikar Dronamraju wrote:
> * Peter Zijlstra <peterz@infradead.org> [2021-03-01 16:44:42]:
> 
> > On Sat, Feb 27, 2021 at 02:56:07PM -0500, Rik van Riel wrote:
> > > On Fri, 2021-02-26 at 22:10 +0530, Srikar Dronamraju wrote:
> > 
> > > > +	if (sched_feat(WA_WAKER) && tnr_busy < tllc_size)
> > > > +		return this_cpu;
> > > 
> > > I wonder if we need to use a slightly lower threshold on
> > > very large LLCs, both to account for the fact that the
> > > select_idle_cpu code may not find the single idle CPU
> > > among a dozen busy ones, or because on a system with
> > > hyperthreading we may often be better off picking another
> > > LLC for HT contention issues?
> > > 
> > > Maybe we could use "tnr_busy * 4 <
> > > tllc_size * 3" or
> > > something like that?
> > 
> > How about:
> > 
> > 	tnr_busy < tllc_size / topology_max_smt_threads()
> > 
> > ?
> 
> Isn't topology_max_smt_threads only for x86 as of today?
> Or Am I missing out?

Oh, could be, I didn't grep :/ We could have core code keep track of the
smt count I suppose.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity
  2021-03-01 17:18       ` Peter Zijlstra
@ 2021-03-02  7:39         ` Srikar Dronamraju
  2021-03-02  9:10           ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: Srikar Dronamraju @ 2021-03-02  7:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Ingo Molnar, LKML, Mel Gorman, Thomas Gleixner,
	Valentin Schneider, Vincent Guittot, Dietmar Eggemann,
	Michael Ellerman, Michael Neuling, Gautham R Shenoy, Parth Shah

* Peter Zijlstra <peterz@infradead.org> [2021-03-01 18:18:28]:

> On Mon, Mar 01, 2021 at 10:36:01PM +0530, Srikar Dronamraju wrote:
> > * Peter Zijlstra <peterz@infradead.org> [2021-03-01 16:44:42]:
> > 
> > > On Sat, Feb 27, 2021 at 02:56:07PM -0500, Rik van Riel wrote:
> > > > On Fri, 2021-02-26 at 22:10 +0530, Srikar Dronamraju wrote:
> > > 
 > > > 
> > > > Maybe we could use "tnr_busy * 4 <
> > > > tllc_size * 3" or
> > > > something like that?
> > > 
> > > How about:
> > > 
> > > 	tnr_busy < tllc_size / topology_max_smt_threads()
> > > 
> > > ?
> > 
> > Isn't topology_max_smt_threads only for x86 as of today?
> > Or Am I missing out?
> 
> Oh, could be, I didn't grep :/ We could have core code keep track of the
> smt count I suppose.

Could we use cpumask_weight(cpu_smt_mask(this_cpu)) instead?

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity
  2021-03-02  7:39         ` Srikar Dronamraju
@ 2021-03-02  9:10           ` Peter Zijlstra
  2021-03-02 10:05             ` Srikar Dronamraju
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2021-03-02  9:10 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Rik van Riel, Ingo Molnar, LKML, Mel Gorman, Thomas Gleixner,
	Valentin Schneider, Vincent Guittot, Dietmar Eggemann,
	Michael Ellerman, Michael Neuling, Gautham R Shenoy, Parth Shah

On Tue, Mar 02, 2021 at 01:09:46PM +0530, Srikar Dronamraju wrote:
> > Oh, could be, I didn't grep :/ We could have core code keep track of the
> > smt count I suppose.
> 
> Could we use cpumask_weight(cpu_smt_mask(this_cpu)) instead?

cpumask_weight() is potentially super expensive. With CPUMASK_OFFSTACK
you get at least one more cache miss and then the bitmap might be really
long.

Best to compute the results once and use it later.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity
  2021-02-26 16:40 [PATCH] sched/fair: Prefer idle CPU to cache affinity Srikar Dronamraju
  2021-02-27 19:56 ` Rik van Riel
  2021-03-01 15:40 ` Peter Zijlstra
@ 2021-03-02  9:53 ` Dietmar Eggemann
  2021-03-02 10:04   ` Srikar Dronamraju
  2021-03-08 13:52 ` Vincent Guittot
  3 siblings, 1 reply; 16+ messages in thread
From: Dietmar Eggemann @ 2021-03-02  9:53 UTC (permalink / raw)
  To: Srikar Dronamraju, Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Thomas Gleixner,
	Valentin Schneider, Vincent Guittot, Michael Ellerman,
	Michael Neuling, Gautham R Shenoy, Parth Shah

On 26/02/2021 17:40, Srikar Dronamraju wrote:

[...]

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8a8bd7b13634..d49bfcdc4a19 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5869,6 +5869,36 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
>  	return this_eff_load < prev_eff_load ? this_cpu : nr_cpumask_bits;
>  }
> 
> +static int prefer_idler_llc(int this_cpu, int prev_cpu, int sync)
> +{
> +	struct sched_domain_shared *tsds, *psds;
> +	int pnr_busy, pllc_size, tnr_busy, tllc_size, diff;
> +
> +	tsds = rcu_dereference(per_cpu(sd_llc_shared, this_cpu));
> +	tnr_busy = atomic_read(&tsds->nr_busy_cpus);
> +	tllc_size = per_cpu(sd_llc_size, this_cpu);
> +
> +	psds = rcu_dereference(per_cpu(sd_llc_shared, prev_cpu));
> +	pnr_busy = atomic_read(&psds->nr_busy_cpus);
> +	pllc_size = per_cpu(sd_llc_size, prev_cpu);
> +
> +	/* No need to compare, if both LLCs are fully loaded */
> +	if (pnr_busy == pllc_size && tnr_busy == pllc_size)

                                                     ^
                                           shouldn't this be tllc_size ?

> +		return nr_cpumask_bits;
> +
> +	if (sched_feat(WA_WAKER) && tnr_busy < tllc_size)
> +		return this_cpu;
> +
> +	/* For better wakeup latency, prefer idler LLC to cache affinity */
> +	diff = tnr_busy * pllc_size - sync - pnr_busy * tllc_size;
> +	if (!diff)
> +		return nr_cpumask_bits;
> +	if (diff < 0)
> +		return this_cpu;
> +
> +	return prev_cpu;
> +}
> +

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity
  2021-03-02  9:53 ` Dietmar Eggemann
@ 2021-03-02 10:04   ` Srikar Dronamraju
  0 siblings, 0 replies; 16+ messages in thread
From: Srikar Dronamraju @ 2021-03-02 10:04 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Mel Gorman, Rik van Riel,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Michael Ellerman, Michael Neuling, Gautham R Shenoy, Parth Shah

* Dietmar Eggemann <dietmar.eggemann@arm.com> [2021-03-02 10:53:06]:

> On 26/02/2021 17:40, Srikar Dronamraju wrote:
> 
> [...]
> 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 8a8bd7b13634..d49bfcdc4a19 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5869,6 +5869,36 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
> >  	return this_eff_load < prev_eff_load ? this_cpu : nr_cpumask_bits;
> >  }
> > 
> > +static int prefer_idler_llc(int this_cpu, int prev_cpu, int sync)
> > +{
> > +	struct sched_domain_shared *tsds, *psds;
> > +	int pnr_busy, pllc_size, tnr_busy, tllc_size, diff;
> > +
> > +	tsds = rcu_dereference(per_cpu(sd_llc_shared, this_cpu));
> > +	tnr_busy = atomic_read(&tsds->nr_busy_cpus);
> > +	tllc_size = per_cpu(sd_llc_size, this_cpu);
> > +
> > +	psds = rcu_dereference(per_cpu(sd_llc_shared, prev_cpu));
> > +	pnr_busy = atomic_read(&psds->nr_busy_cpus);
> > +	pllc_size = per_cpu(sd_llc_size, prev_cpu);
> > +
> > +	/* No need to compare, if both LLCs are fully loaded */
> > +	if (pnr_busy == pllc_size && tnr_busy == pllc_size)
> 
>                                                      ^
>                                            shouldn't this be tllc_size ?

Yes, thanks for pointing out.

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity
  2021-03-02  9:10           ` Peter Zijlstra
@ 2021-03-02 10:05             ` Srikar Dronamraju
  0 siblings, 0 replies; 16+ messages in thread
From: Srikar Dronamraju @ 2021-03-02 10:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Ingo Molnar, LKML, Mel Gorman, Thomas Gleixner,
	Valentin Schneider, Vincent Guittot, Dietmar Eggemann,
	Michael Ellerman, Michael Neuling, Gautham R Shenoy, Parth Shah

* Peter Zijlstra <peterz@infradead.org> [2021-03-02 10:10:32]:

> On Tue, Mar 02, 2021 at 01:09:46PM +0530, Srikar Dronamraju wrote:
> > > Oh, could be, I didn't grep :/ We could have core code keep track of the
> > > smt count I suppose.
> > 
> > Could we use cpumask_weight(cpu_smt_mask(this_cpu)) instead?
> 
> cpumask_weight() is potentially super expensive. With CPUMASK_OFFSTACK
> you get at least one more cache miss and then the bitmap might be really
> long.
> 
> Best to compute the results once and use it later.

Oh okay .. Noted.

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity
  2021-02-26 16:40 [PATCH] sched/fair: Prefer idle CPU to cache affinity Srikar Dronamraju
                   ` (2 preceding siblings ...)
  2021-03-02  9:53 ` Dietmar Eggemann
@ 2021-03-08 13:52 ` Vincent Guittot
  2021-03-10  5:52   ` Srikar Dronamraju
  3 siblings, 1 reply; 16+ messages in thread
From: Vincent Guittot @ 2021-03-08 13:52 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Mel Gorman, Rik van Riel,
	Thomas Gleixner, Valentin Schneider, Dietmar Eggemann,
	Michael Ellerman, Michael Neuling, Gautham R Shenoy, Parth Shah

On Fri, 26 Feb 2021 at 17:41, Srikar Dronamraju
<srikar@linux.vnet.ibm.com> wrote:
>
> On POWER8 and POWER9, the last level cache (L2) has been at the level of
> a group of 8 threads (SMT8 on POWER8, a big-core comprising of a pair of
> SMT4 cores on POWER9). However, on POWER10, the LLC domain is at the
> level of a group of SMT4 threads within the SMT8 core. Due to the
> shrinking in the size of the LLC domain, the probability of finding an
> idle CPU in the LLC domain of the target is lesser on POWER10 compared
> to the previous generation processors.
>
> With commit 9538abee18cc ("powerpc/smp: Add support detecting
> thread-groups sharing L2 cache") benchmarks such as Daytrader
> (https://github.com/WASdev/sample.daytrader7) show a drop in throughput
> in a configuration consisting of 1 JVM spanning across 6-8 Bigcores on
> POWER10.  Analysis showed that this was because more number of wakeups
> were happening on busy CPUs when the utilization was 60-70%. This drop
> in throughput also shows up as a drop in CPU utilization. However most
> other benchmarks benefit with detecting the thread-groups that share L2
> cache.
>
> Current order of preference to pick a LLC while waking a wake-affine
> task:
> 1. Between the waker CPU and previous CPU, prefer the LLC of the CPU
>    that is idle.
>
> 2. Between the waker CPU and previous CPU, prefer the LLC of the CPU
>    that is less lightly loaded.
>
> In the current situation where waker and previous CPUs are busy, but
> only one of its LLC has an idle CPU, Scheduler may end up picking a LLC
> with no idle CPUs. To mitigate this, add a new step between 1 and 2
> where Scheduler compares idle CPUs in waker and previous LLCs and picks
> the appropriate one.
>
> The other alternative is to search for an idle CPU in the other LLC, if
> the current select_idle_sibling is unable to find an idle CPU in the
> preferred LLC. But that may increase the time to select a CPU.
>
>
>                                      5.11-rc6      5.11-rc6+revert   5.11-rc6+patch
> 8CORE/1JVM  80USERS   throughput     6651.6        6716.3 (0.97%)    6940 (4.34%)
>                       sys/user:time  59.75/23.86   61.77/24.55       60/24
>
> 8CORE/2JVM  80USERS   throughput     6425.4        6446.8 (0.33%)    6473.2 (0.74%)
>                       sys/user:time  70.59/24.25   72.28/23.77       70/24
>
> 8CORE/4JVM  80USERS   throughput     5355.3        5551.2 (3.66%)    5586.6 (4.32%)
>                       sys/user:time  76.74/21.79   76.54/22.73       76/22
>
> 8CORE/8JVM  80USERS   throughput     4420.6        4553.3 (3.00%)    4405.8 (-0.33%)
>                       sys/user:time  79.13/20.32   78.76/21.01       79/20
>
> Cc: LKML <linux-kernel@vger.kernel.org>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Michael Neuling <mikey@neuling.org>
> Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> Cc: Parth Shah <parth@linux.ibm.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Valentin Schneider <valentin.schneider@arm.com>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Co-developed-by: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> Signed-off-by: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> Co-developed-by: Parth Shah <parth@linux.ibm.com>
> Signed-off-by: Parth Shah <parth@linux.ibm.com>
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> ---
>  kernel/sched/fair.c     | 41 +++++++++++++++++++++++++++++++++++++++--
>  kernel/sched/features.h |  2 ++
>  2 files changed, 41 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8a8bd7b13634..d49bfcdc4a19 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5869,6 +5869,36 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
>         return this_eff_load < prev_eff_load ? this_cpu : nr_cpumask_bits;
>  }
>
> +static int prefer_idler_llc(int this_cpu, int prev_cpu, int sync)
> +{
> +       struct sched_domain_shared *tsds, *psds;
> +       int pnr_busy, pllc_size, tnr_busy, tllc_size, diff;
> +
> +       tsds = rcu_dereference(per_cpu(sd_llc_shared, this_cpu));
> +       tnr_busy = atomic_read(&tsds->nr_busy_cpus);
> +       tllc_size = per_cpu(sd_llc_size, this_cpu);
> +
> +       psds = rcu_dereference(per_cpu(sd_llc_shared, prev_cpu));
> +       pnr_busy = atomic_read(&psds->nr_busy_cpus);
> +       pllc_size = per_cpu(sd_llc_size, prev_cpu);
> +
> +       /* No need to compare, if both LLCs are fully loaded */
> +       if (pnr_busy == pllc_size && tnr_busy == pllc_size)
> +               return nr_cpumask_bits;
> +
> +       if (sched_feat(WA_WAKER) && tnr_busy < tllc_size)
> +               return this_cpu;

Why have you chosen to favor this_cpu instead of prev_cpu unlike for wake_idle ?

> +
> +       /* For better wakeup latency, prefer idler LLC to cache affinity */
> +       diff = tnr_busy * pllc_size - sync - pnr_busy * tllc_size;
> +       if (!diff)
> +               return nr_cpumask_bits;
> +       if (diff < 0)
> +               return this_cpu;
> +
> +       return prev_cpu;
> +}
> +
>  static int wake_affine(struct sched_domain *sd, struct task_struct *p,
>                        int this_cpu, int prev_cpu, int sync)
>  {
> @@ -5877,6 +5907,10 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
>         if (sched_feat(WA_IDLE))
>                 target = wake_affine_idle(this_cpu, prev_cpu, sync);
>
> +       if (sched_feat(WA_IDLER_LLC) && target == nr_cpumask_bits &&
> +                               !cpus_share_cache(this_cpu, prev_cpu))
> +               target = prefer_idler_llc(this_cpu, prev_cpu, sync);

could you use the same naming convention as others function ?
wake_affine_llc as an example

> +
>         if (sched_feat(WA_WEIGHT) && target == nr_cpumask_bits)
>                 target = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);
>
> @@ -5884,8 +5918,11 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
>         if (target == nr_cpumask_bits)
>                 return prev_cpu;
>
> -       schedstat_inc(sd->ttwu_move_affine);
> -       schedstat_inc(p->se.statistics.nr_wakeups_affine);
> +       if (target == this_cpu) {

How is this condition related to $subject ?

> +               schedstat_inc(sd->ttwu_move_affine);
> +               schedstat_inc(p->se.statistics.nr_wakeups_affine);
> +       }
> +
>         return target;
>  }
>
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index 1bc2b158fc51..e2de3ba8d5b1 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -83,6 +83,8 @@ SCHED_FEAT(ATTACH_AGE_LOAD, true)
>
>  SCHED_FEAT(WA_IDLE, true)
>  SCHED_FEAT(WA_WEIGHT, true)
> +SCHED_FEAT(WA_IDLER_LLC, true)
> +SCHED_FEAT(WA_WAKER, false)
>  SCHED_FEAT(WA_BIAS, true)
>
>  /*
> --
> 2.18.4
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity
  2021-03-08 13:52 ` Vincent Guittot
@ 2021-03-10  5:52   ` Srikar Dronamraju
  2021-03-10 15:37     ` Vincent Guittot
  0 siblings, 1 reply; 16+ messages in thread
From: Srikar Dronamraju @ 2021-03-10  5:52 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Mel Gorman, Rik van Riel,
	Thomas Gleixner, Valentin Schneider, Dietmar Eggemann,
	Michael Ellerman, Michael Neuling, Gautham R Shenoy, Parth Shah

* Vincent Guittot <vincent.guittot@linaro.org> [2021-03-08 14:52:39]:

> On Fri, 26 Feb 2021 at 17:41, Srikar Dronamraju
> <srikar@linux.vnet.ibm.com> wrote:
> >

Thanks Vincent for your review comments.

> > +static int prefer_idler_llc(int this_cpu, int prev_cpu, int sync)
> > +{
> > +       struct sched_domain_shared *tsds, *psds;
> > +       int pnr_busy, pllc_size, tnr_busy, tllc_size, diff;
> > +
> > +       tsds = rcu_dereference(per_cpu(sd_llc_shared, this_cpu));
> > +       tnr_busy = atomic_read(&tsds->nr_busy_cpus);
> > +       tllc_size = per_cpu(sd_llc_size, this_cpu);
> > +
> > +       psds = rcu_dereference(per_cpu(sd_llc_shared, prev_cpu));
> > +       pnr_busy = atomic_read(&psds->nr_busy_cpus);
> > +       pllc_size = per_cpu(sd_llc_size, prev_cpu);
> > +
> > +       /* No need to compare, if both LLCs are fully loaded */
> > +       if (pnr_busy == pllc_size && tnr_busy == pllc_size)
> > +               return nr_cpumask_bits;
> > +
> > +       if (sched_feat(WA_WAKER) && tnr_busy < tllc_size)
> > +               return this_cpu;
> 
> Why have you chosen to favor this_cpu instead of prev_cpu unlike for wake_idle ?

At this point, we know the waker running on this_cpu and wakee which was
running on prev_cpu are affine to each other and this_cpu and prev_cpu dont
share cache. I chose to move them close to each other to benefit from the
cache sharing. Based on feedback from Peter and Rik, I made the check more
conservative i.e tnr_busy <= tllc_size/smt_weight (where smt_weight is the
cpumask weight of smt domain for this_cpu) i.e if we have a free core in
this llc domain, chose this_cpu.  select_idle_sibling() should pick an idle
cpu/core/smt within the llc domain for this_cpu.

Do you feel, this may not be the correct option?

We are also experimenting with another option, were we call prefer_idler_cpu
after wa_weight. I.e 
1. if wake_affine_weight choses this_cpu but llc in prev_cpu has an idle
smt/CPU but there are no idle smt/CPU in this_cpu, then chose idle smt/CPU
in prev_cpu
2. if wake_affine_weight choses nr_cpumask(aka prev_cpu) but llc in this_cpu
has an idle smt/CPU but there are no idle smt/CPU in prev_cpu, then chose
idle smt/CPU in this_cpu


> > +
> > +       /* For better wakeup latency, prefer idler LLC to cache affinity */
> > +       diff = tnr_busy * pllc_size - sync - pnr_busy * tllc_size;
> > +       if (!diff)
> > +               return nr_cpumask_bits;
> > +       if (diff < 0)
> > +               return this_cpu;
> > +
> > +       return prev_cpu;
> > +}
> > +
> >  static int wake_affine(struct sched_domain *sd, struct task_struct *p,
> >                        int this_cpu, int prev_cpu, int sync)
> >  {
> > @@ -5877,6 +5907,10 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
> >         if (sched_feat(WA_IDLE))
> >                 target = wake_affine_idle(this_cpu, prev_cpu, sync);
> >
> > +       if (sched_feat(WA_IDLER_LLC) && target == nr_cpumask_bits &&
> > +                               !cpus_share_cache(this_cpu, prev_cpu))
> > +               target = prefer_idler_llc(this_cpu, prev_cpu, sync);
> 
> could you use the same naming convention as others function ?
> wake_affine_llc as an example

I guess you meant s/prefer_idler_llc/wake_affine_llc/
Sure. I can modify.

> 
> > +
> >         if (sched_feat(WA_WEIGHT) && target == nr_cpumask_bits)
> >                 target = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);
> >
> > @@ -5884,8 +5918,11 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
> >         if (target == nr_cpumask_bits)
> >                 return prev_cpu;
> >
> > -       schedstat_inc(sd->ttwu_move_affine);
> > -       schedstat_inc(p->se.statistics.nr_wakeups_affine);
> > +       if (target == this_cpu) {
> 
> How is this condition related to $subject ?

Before this change, wake_affine_weight and wake_affine_idle would either
return this_cpu or nr_cpumask_bits. Just before this check, we check if
target is nr_cpumask_bits and return prev_cpu. So the stats were only
incremented when target was this_cpu.

However with prefer_idler_llc, we may return this_cpu, prev_cpu or
nr_cpumask_bits. Now we only to update stats when we have chosen to migrate
the task to this_cpu. Hence I had this check.

If we use the slightly lazier approach which is check for wa_weight first
before wa_idler_llc, then we may not need this change at all.

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity
  2021-03-10  5:52   ` Srikar Dronamraju
@ 2021-03-10 15:37     ` Vincent Guittot
  0 siblings, 0 replies; 16+ messages in thread
From: Vincent Guittot @ 2021-03-10 15:37 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Mel Gorman, Rik van Riel,
	Thomas Gleixner, Valentin Schneider, Dietmar Eggemann,
	Michael Ellerman, Michael Neuling, Gautham R Shenoy, Parth Shah

On Wed, 10 Mar 2021 at 06:53, Srikar Dronamraju
<srikar@linux.vnet.ibm.com> wrote:
>
> * Vincent Guittot <vincent.guittot@linaro.org> [2021-03-08 14:52:39]:
>
> > On Fri, 26 Feb 2021 at 17:41, Srikar Dronamraju
> > <srikar@linux.vnet.ibm.com> wrote:
> > >
>
> Thanks Vincent for your review comments.
>
> > > +static int prefer_idler_llc(int this_cpu, int prev_cpu, int sync)
> > > +{
> > > +       struct sched_domain_shared *tsds, *psds;
> > > +       int pnr_busy, pllc_size, tnr_busy, tllc_size, diff;
> > > +
> > > +       tsds = rcu_dereference(per_cpu(sd_llc_shared, this_cpu));
> > > +       tnr_busy = atomic_read(&tsds->nr_busy_cpus);
> > > +       tllc_size = per_cpu(sd_llc_size, this_cpu);
> > > +
> > > +       psds = rcu_dereference(per_cpu(sd_llc_shared, prev_cpu));
> > > +       pnr_busy = atomic_read(&psds->nr_busy_cpus);
> > > +       pllc_size = per_cpu(sd_llc_size, prev_cpu);
> > > +
> > > +       /* No need to compare, if both LLCs are fully loaded */
> > > +       if (pnr_busy == pllc_size && tnr_busy == pllc_size)
> > > +               return nr_cpumask_bits;
> > > +
> > > +       if (sched_feat(WA_WAKER) && tnr_busy < tllc_size)
> > > +               return this_cpu;
> >
> > Why have you chosen to favor this_cpu instead of prev_cpu unlike for wake_idle ?
>
> At this point, we know the waker running on this_cpu and wakee which was
> running on prev_cpu are affine to each other and this_cpu and prev_cpu dont
> share cache. I chose to move them close to each other to benefit from the
> cache sharing. Based on feedback from Peter and Rik, I made the check more
> conservative i.e tnr_busy <= tllc_size/smt_weight (where smt_weight is the
> cpumask weight of smt domain for this_cpu) i.e if we have a free core in

yeah make sense

> this llc domain, chose this_cpu.  select_idle_sibling() should pick an idle
> cpu/core/smt within the llc domain for this_cpu.
>
> Do you feel, this may not be the correct option?

I was worried that we end up pulling tasks in same llc but the
condition above and wake_wide should prevent such behavior

>
> We are also experimenting with another option, were we call prefer_idler_cpu
> after wa_weight. I.e
> 1. if wake_affine_weight choses this_cpu but llc in prev_cpu has an idle
> smt/CPU but there are no idle smt/CPU in this_cpu, then chose idle smt/CPU
> in prev_cpu
> 2. if wake_affine_weight choses nr_cpumask(aka prev_cpu) but llc in this_cpu
> has an idle smt/CPU but there are no idle smt/CPU in prev_cpu, then chose
> idle smt/CPU in this_cpu
>
>
> > > +
> > > +       /* For better wakeup latency, prefer idler LLC to cache affinity */
> > > +       diff = tnr_busy * pllc_size - sync - pnr_busy * tllc_size;
> > > +       if (!diff)
> > > +               return nr_cpumask_bits;
> > > +       if (diff < 0)
> > > +               return this_cpu;
> > > +
> > > +       return prev_cpu;
> > > +}
> > > +
> > >  static int wake_affine(struct sched_domain *sd, struct task_struct *p,
> > >                        int this_cpu, int prev_cpu, int sync)
> > >  {
> > > @@ -5877,6 +5907,10 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
> > >         if (sched_feat(WA_IDLE))
> > >                 target = wake_affine_idle(this_cpu, prev_cpu, sync);
> > >
> > > +       if (sched_feat(WA_IDLER_LLC) && target == nr_cpumask_bits &&
> > > +                               !cpus_share_cache(this_cpu, prev_cpu))
> > > +               target = prefer_idler_llc(this_cpu, prev_cpu, sync);
> >
> > could you use the same naming convention as others function ?
> > wake_affine_llc as an example
>
> I guess you meant s/prefer_idler_llc/wake_affine_llc/

yes

> Sure. I can modify.
>
> >
> > > +
> > >         if (sched_feat(WA_WEIGHT) && target == nr_cpumask_bits)
> > >                 target = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);
> > >
> > > @@ -5884,8 +5918,11 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
> > >         if (target == nr_cpumask_bits)
> > >                 return prev_cpu;
> > >
> > > -       schedstat_inc(sd->ttwu_move_affine);
> > > -       schedstat_inc(p->se.statistics.nr_wakeups_affine);
> > > +       if (target == this_cpu) {
> >
> > How is this condition related to $subject ?
>
> Before this change, wake_affine_weight and wake_affine_idle would either
> return this_cpu or nr_cpumask_bits. Just before this check, we check if
> target is nr_cpumask_bits and return prev_cpu. So the stats were only
> incremented when target was this_cpu.
>
> However with prefer_idler_llc, we may return this_cpu, prev_cpu or
> nr_cpumask_bits. Now we only to update stats when we have chosen to migrate
> the task to this_cpu. Hence I had this check.

ok got it.

May be return earlier in this case like for  if (target ==
nr_cpumask_bits) above

>
> If we use the slightly lazier approach which is check for wa_weight first
> before wa_idler_llc, then we may not need this change at all.
>
> --
> Thanks and Regards
> Srikar Dronamraju

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2021-03-10 15:38 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-26 16:40 [PATCH] sched/fair: Prefer idle CPU to cache affinity Srikar Dronamraju
2021-02-27 19:56 ` Rik van Riel
2021-03-01 13:37   ` Srikar Dronamraju
2021-03-01 15:44   ` Peter Zijlstra
2021-03-01 17:06     ` Srikar Dronamraju
2021-03-01 17:18       ` Peter Zijlstra
2021-03-02  7:39         ` Srikar Dronamraju
2021-03-02  9:10           ` Peter Zijlstra
2021-03-02 10:05             ` Srikar Dronamraju
2021-03-01 15:40 ` Peter Zijlstra
2021-03-01 17:08   ` Srikar Dronamraju
2021-03-02  9:53 ` Dietmar Eggemann
2021-03-02 10:04   ` Srikar Dronamraju
2021-03-08 13:52 ` Vincent Guittot
2021-03-10  5:52   ` Srikar Dronamraju
2021-03-10 15:37     ` Vincent Guittot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.