linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] sched/fair: Introduce WAKEUP_BIAS_PREV_IDLE to reduce migrations
@ 2023-10-12 20:36 Mathieu Desnoyers
  2023-10-15 15:44 ` Chen Yu
  0 siblings, 1 reply; 3+ messages in thread
From: Mathieu Desnoyers @ 2023-10-12 20:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Mathieu Desnoyers, Ingo Molnar, Valentin Schneider,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Vincent Guittot, Juri Lelli,
	Swapnil Sapkal, Aaron Lu, Chen Yu, Tim Chen, K Prateek Nayak,
	Gautham R . Shenoy, x86

Introduce the WAKEUP_BIAS_PREV_IDLE scheduler feature to reduce the
task migration rate.

For scenarios where the system is under-utilized (CPUs are partly idle),
eliminate frequent task migrations from almost idle CPU to completely
idle CPUs by introducing a bias towards the previous CPU if it is idle
or almost idle in select_idle_sibling(). Use 1% of the CPU capacity
of the previously used CPU as CPU utilization "almost idle" cutoff.

For scenarios where the system is fully or over-utilized (CPUs are
almost never idle), favor the previous CPU (rather than the target CPU)
if all CPUs are busy to minimize migrations. (suggested by Chen Yu)

The following benchmarks are performed on a v6.5.5 kernel with
mitigations=off.

This speeds up the following hackbench workload on a 192 cores AMD EPYC
9654 96-Core Processor (over 2 sockets):

hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100

from 49s to 31s. (37% speedup)

We can observe that the number of migrations is reduced significantly
(-90%) with this patch, which may explain the speedup:

Baseline:      118M cpu-migrations  (9.286 K/sec)
With patch:      5M cpu-migrations  (0.580 K/sec)

As a consequence, the stalled-cycles-backend are reduced:

Baseline:     8.16% backend cycles idle
With patch:   6.85% backend cycles idle

Interestingly, the rate of context switch increases with the patch, but
it does not appear to be an issue performance-wise:

Baseline:     454M context-switches (35.677 K/sec)
With patch:   670M context-switches (70.805 K/sec)

This was developed as part of the investigation into a weird regression
reported by AMD where adding a raw spinlock in the scheduler context
switch accelerated hackbench. It turned out that changing this raw
spinlock for a loop of 10000x cpu_relax within do_idle() had similar
benefits.

This patch achieves a similar effect without the busy-waiting by
allowing select_task_rq to favor almost idle previously used CPUs based
on the utilization of that CPU. The threshold of 1% cpu_util for almost
idle CPU has been identified empirically using the hackbench workload.

Feedback is welcome. I am especially interested to learn whether this
patch has positive or detrimental effects on performance of other
workloads.

Link: https://lore.kernel.org/r/09e0f469-a3f7-62ef-75a1-e64cec2dcfc5@amd.com
Link: https://lore.kernel.org/lkml/20230725193048.124796-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/20230810140635.75296-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/20230810140635.75296-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/f6dc1652-bc39-0b12-4b6b-29a2f9cd8484@amd.com/
Link: https://lore.kernel.org/lkml/20230822113133.643238-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/20230823060832.454842-1-aaron.lu@intel.com/
Link: https://lore.kernel.org/lkml/20230905171105.1005672-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/cover.1695704179.git.yu.c.chen@intel.com/
Link: https://lore.kernel.org/lkml/20230929183350.239721-1-mathieu.desnoyers@efficios.com/
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Swapnil Sapkal <Swapnil.Sapkal@amd.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Chen Yu <yu.c.chen@intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Gautham R . Shenoy <gautham.shenoy@amd.com>
Cc: x86@kernel.org
---
 kernel/sched/fair.c     | 45 +++++++++++++++++++++++++++++++++++++++--
 kernel/sched/features.h |  6 ++++++
 2 files changed, 49 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1d9c2482c5a3..70bffe3b6bd7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7113,6 +7113,23 @@ static inline bool asym_fits_cpu(unsigned long util,
 	return true;
 }
 
+static unsigned long cpu_util_without(int cpu, struct task_struct *p);
+
+/*
+ * A runqueue is considered almost idle if:
+ *
+ *   cpu_util_without(cpu, p) / 1024 <= 1% * capacity_of(cpu)
+ *
+ * This inequality is transformed as follows to minimize arithmetic:
+ *
+ *   cpu_util_without(cpu, p) <= 10 * capacity_of(cpu)
+ */
+static bool
+almost_idle_cpu(int cpu, struct task_struct *p)
+{
+	return cpu_util_without(cpu, p) <= 10 * capacity_of(cpu);
+}
+
 /*
  * Try and locate an idle core/thread in the LLC cache domain.
  */
@@ -7139,18 +7156,33 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	 */
 	lockdep_assert_irqs_disabled();
 
+	/*
+	 * With the WAKEUP_BIAS_PREV_IDLE feature, if the previous CPU
+	 * is cache affine and almost idle, prefer the previous CPU to
+	 * the target CPU to inhibit costly task migration.
+	 */
+	if (sched_feat(WAKEUP_BIAS_PREV_IDLE) &&
+	    (prev == target || cpus_share_cache(prev, target)) &&
+	    (available_idle_cpu(prev) || sched_idle_cpu(prev) || almost_idle_cpu(prev, p)) &&
+	    asym_fits_cpu(task_util, util_min, util_max, prev))
+		return prev;
+
 	if ((available_idle_cpu(target) || sched_idle_cpu(target)) &&
 	    asym_fits_cpu(task_util, util_min, util_max, target))
 		return target;
 
 	/*
-	 * If the previous CPU is cache affine and idle, don't be stupid:
+	 * Without the WAKEUP_BIAS_PREV_IDLE feature, use the previous
+	 * CPU if it is cache affine and idle if the target cpu is not
+	 * idle.
 	 */
-	if (prev != target && cpus_share_cache(prev, target) &&
+	if (!sched_feat(WAKEUP_BIAS_PREV_IDLE) &&
+	    prev != target && cpus_share_cache(prev, target) &&
 	    (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
 	    asym_fits_cpu(task_util, util_min, util_max, prev))
 		return prev;
 
+
 	/*
 	 * Allow a per-cpu kthread to stack with the wakee if the
 	 * kworker thread and the tasks previous CPUs are the same.
@@ -7217,6 +7249,15 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	if ((unsigned)i < nr_cpumask_bits)
 		return i;
 
+	/*
+	 * With the WAKEUP_BIAS_PREV_IDLE feature, if the previous CPU
+	 * is cache affine, prefer the previous CPU when all CPUs are
+	 * busy to inhibit migration.
+	 */
+	if (sched_feat(WAKEUP_BIAS_PREV_IDLE) &&
+	    prev != target && cpus_share_cache(prev, target))
+		return prev;
+
 	return target;
 }
 
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index ee7f23c76bd3..1ba67d177fe0 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -37,6 +37,12 @@ SCHED_FEAT(CACHE_HOT_BUDDY, true)
  */
 SCHED_FEAT(WAKEUP_PREEMPTION, true)
 
+/*
+ * Bias runqueue selection towards the previous runqueue if it is almost
+ * idle or if all CPUs are busy.
+ */
+SCHED_FEAT(WAKEUP_BIAS_PREV_IDLE, true)
+
 SCHED_FEAT(HRTICK, false)
 SCHED_FEAT(HRTICK_DL, false)
 SCHED_FEAT(DOUBLE_TICK, false)
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [RFC PATCH] sched/fair: Introduce WAKEUP_BIAS_PREV_IDLE to reduce migrations
  2023-10-12 20:36 [RFC PATCH] sched/fair: Introduce WAKEUP_BIAS_PREV_IDLE to reduce migrations Mathieu Desnoyers
@ 2023-10-15 15:44 ` Chen Yu
  2023-10-16 19:24   ` Mathieu Desnoyers
  0 siblings, 1 reply; 3+ messages in thread
From: Chen Yu @ 2023-10-15 15:44 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Valentin Schneider,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Vincent Guittot, Juri Lelli,
	Swapnil Sapkal, Aaron Lu, Tim Chen, K Prateek Nayak,
	Gautham R . Shenoy, x86

On 2023-10-12 at 16:36:26 -0400, Mathieu Desnoyers wrote:
> Introduce the WAKEUP_BIAS_PREV_IDLE scheduler feature to reduce the
> task migration rate.
> 
> For scenarios where the system is under-utilized (CPUs are partly idle),
> eliminate frequent task migrations from almost idle CPU to completely
> idle CPUs by introducing a bias towards the previous CPU if it is idle
> or almost idle in select_idle_sibling(). Use 1% of the CPU capacity
> of the previously used CPU as CPU utilization "almost idle" cutoff.
>
> +
> +/*
> + * A runqueue is considered almost idle if:
> + *
> + *   cpu_util_without(cpu, p) / 1024 <= 1% * capacity_of(cpu)

util_avg is in the range [0:1024], thus cpu_util_without(cpu, p) / 1024
is <= 1, and 1% * cap is 10, so 1 <= 10 is always true.
I suppose you want to compare:
 (cpu_util_without(cpu, p) / capacity_orig_of(cpu)) <= 1% ->
    cpu_util_without(cpu, p) * 100 <= capacity_orig_of(cpu) ?

thanks,
Chenyu
 


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RFC PATCH] sched/fair: Introduce WAKEUP_BIAS_PREV_IDLE to reduce migrations
  2023-10-15 15:44 ` Chen Yu
@ 2023-10-16 19:24   ` Mathieu Desnoyers
  0 siblings, 0 replies; 3+ messages in thread
From: Mathieu Desnoyers @ 2023-10-16 19:24 UTC (permalink / raw)
  To: Chen Yu
  Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Valentin Schneider,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Vincent Guittot, Juri Lelli,
	Swapnil Sapkal, Aaron Lu, Tim Chen, K Prateek Nayak,
	Gautham R . Shenoy, x86

On 2023-10-15 11:44, Chen Yu wrote:
> On 2023-10-12 at 16:36:26 -0400, Mathieu Desnoyers wrote:
>> Introduce the WAKEUP_BIAS_PREV_IDLE scheduler feature to reduce the
>> task migration rate.
>>
>> For scenarios where the system is under-utilized (CPUs are partly idle),
>> eliminate frequent task migrations from almost idle CPU to completely
>> idle CPUs by introducing a bias towards the previous CPU if it is idle
>> or almost idle in select_idle_sibling(). Use 1% of the CPU capacity
>> of the previously used CPU as CPU utilization "almost idle" cutoff.
>>
>> +
>> +/*
>> + * A runqueue is considered almost idle if:
>> + *
>> + *   cpu_util_without(cpu, p) / 1024 <= 1% * capacity_of(cpu)
> 
> util_avg is in the range [0:1024], thus cpu_util_without(cpu, p) / 1024
> is <= 1, and 1% * cap is 10, so 1 <= 10 is always true.
> I suppose you want to compare:
>   (cpu_util_without(cpu, p) / capacity_orig_of(cpu)) <= 1% ->
>      cpu_util_without(cpu, p) * 100 <= capacity_orig_of(cpu) ?

Good point!

Now that I have fixed this, I come back to a situation where:

- load_avg works, probably because it multiplies by the weight, and 
therefore when there are few tasks on the runqueue it reflects the fact 
that the runqueue is almost idle. Even though it happens to work, it 
does not appear to be an elegant solution.

- util_avg and runnable_avg do not work. Probably because they take into 
account both running/runnable and recently blocked tasks, so they cannot 
be used to provide a clear picture of the very-short-term idleness 
status for the purpose of selecting a prev rq.

I wonder if there are any rq stats I can use which do not include 
recently blocked tasks ?

Thanks,

Mathieu


> 
> thanks,
> Chenyu
>   
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2023-10-16 19:24 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-12 20:36 [RFC PATCH] sched/fair: Introduce WAKEUP_BIAS_PREV_IDLE to reduce migrations Mathieu Desnoyers
2023-10-15 15:44 ` Chen Yu
2023-10-16 19:24   ` Mathieu Desnoyers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).